1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high-quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:18,480
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:18,480 --> 00:00:19,730
ocw.mit.edu.

8
00:00:22,570 --> 00:00:25,080
JOHN DONG: All right, so I'm
sure everybody's kind of

9
00:00:25,080 --> 00:00:30,750
curious about Project 2.2
Beta, so here's the

10
00:00:30,750 --> 00:00:32,310
preliminary performance
results.

11
00:00:32,310 --> 00:00:34,800
There's still a bit more work
to do on finalizing these

12
00:00:34,800 --> 00:00:37,210
numbers, but here's
how they look.

13
00:00:37,210 --> 00:00:40,270
But before I show you guys that,
I'd like to yell at you

14
00:00:40,270 --> 00:00:41,790
guys for a little bit.

15
00:00:41,790 --> 00:00:46,100
So this is a timeline of the
submission deadline and when

16
00:00:46,100 --> 00:00:47,680
people submitted things.

17
00:00:47,680 --> 00:00:52,610
So from this zone to this zone
is an hour, and I see like 50%

18
00:00:52,610 --> 00:00:55,830
of the commits made during
that period.

19
00:00:55,830 --> 00:01:00,660
So a little bit of background,
the submission checking system

20
00:01:00,660 --> 00:01:04,269
automatically clones a
repository up to the deadline

21
00:01:04,269 --> 00:01:05,880
and not a second after.

22
00:01:05,880 --> 00:01:10,200
So if you look at from this
to this, some people took

23
00:01:10,200 --> 00:01:11,870
quite a jump back.

24
00:01:11,870 --> 00:01:15,810
So just as a warning, please try
to submit things on time.

25
00:01:15,810 --> 00:01:21,370
11:59 means 11:59, and to drive
that point further, this

26
00:01:21,370 --> 00:01:23,690
is an example of a commit
that I saw in somebody's

27
00:01:23,690 --> 00:01:27,340
repository, who's name
I blanked out.

28
00:01:27,340 --> 00:01:30,340
Obviously, it's seven seconds
past the deadline, so the

29
00:01:30,340 --> 00:01:32,970
automatic repository cloner
didn't grab it.

30
00:01:32,970 --> 00:01:37,170
And the previous commit to that
was like 10 days ago when

31
00:01:37,170 --> 00:01:40,760
Reid pushed out pentominoes
grades.

32
00:01:40,760 --> 00:01:44,340
So don't do that either.

33
00:01:44,340 --> 00:01:48,140
A little break down on how
often people commit.

34
00:01:48,140 --> 00:01:50,490
About a quarter of the class
only made one commit to their

35
00:01:50,490 --> 00:01:52,490
repository.

36
00:01:52,490 --> 00:01:54,780
Half of you guys did three
to 10 commits, which

37
00:01:54,780 --> 00:01:56,280
seems to be an up.

38
00:01:56,280 --> 00:02:00,190
Under 21+, there was somebody
who did 100 some commits,

39
00:02:00,190 --> 00:02:01,440
which was pretty impressive.

40
00:02:05,150 --> 00:02:06,830
Yeah, very smart dude.

41
00:02:06,830 --> 00:02:09,310
Committing often is a good idea,
so that you don't run

42
00:02:09,310 --> 00:02:10,850
into a situation like before.

43
00:02:10,850 --> 00:02:14,440
And next time definitely, there
is going to be next to

44
00:02:14,440 --> 00:02:17,800
zero tolerance for people who
don't commit things on time

45
00:02:17,800 --> 00:02:19,360
for the deadlines.

46
00:02:19,360 --> 00:02:22,320
Now, the numbers that
people want.

47
00:02:22,320 --> 00:02:25,520
So for rotate, just as an
interesting data point, this

48
00:02:25,520 --> 00:02:28,570
is the 512 by 512 case.

49
00:02:28,570 --> 00:02:33,400
It seems like not everybody
remembered to carry over their

50
00:02:33,400 --> 00:02:36,950
optimizations for the 512 case
over to the final, otherwise

51
00:02:36,950 --> 00:02:38,670
you would expect the numbers
to be a little bit more

52
00:02:38,670 --> 00:02:43,800
similar and not off by
a factor of eight.

53
00:02:43,800 --> 00:02:49,860
And for rotate overall, that's
the distribution.

54
00:02:49,860 --> 00:02:52,340
The speed up factor is
normalized to some constant

55
00:02:52,340 --> 00:02:55,760
that gives everybody a
reasonable number.

56
00:02:55,760 --> 00:02:58,780
And there were a lot
of groups that had

57
00:02:58,780 --> 00:03:00,020
code that didn't build.

58
00:03:00,020 --> 00:03:00,540
Yes?

59
00:03:00,540 --> 00:03:02,188
AUDIENCE: Does the speedup
only include

60
00:03:02,188 --> 00:03:05,250
the rotate dot 64.

61
00:03:05,250 --> 00:03:06,920
JOHN DONG: Yes.

62
00:03:06,920 --> 00:03:09,560
Performance was only tested on
the rotate dot 64, which is

63
00:03:09,560 --> 00:03:12,280
what we said in the
handout also.

64
00:03:12,280 --> 00:03:14,900
So there were a lot of groups
that didn't build, which

65
00:03:14,900 --> 00:03:18,260
really surprised me, and I think
it's probably because

66
00:03:18,260 --> 00:03:20,760
half of you guys pushed things
after the deadline, and I

67
00:03:20,760 --> 00:03:23,900
presume those things contained
important commits towards

68
00:03:23,900 --> 00:03:25,730
making your code
actually work.

69
00:03:25,730 --> 00:03:28,910
But in this case, there was no
header files, there was no

70
00:03:28,910 --> 00:03:30,590
cross-testing.

71
00:03:30,590 --> 00:03:34,720
All we did was run your make
file on your code, and we

72
00:03:34,720 --> 00:03:39,510
replaced your test bed dot c,
and your k timing, so I'm not

73
00:03:39,510 --> 00:03:42,380
quite sure why people have
code that doesn't build.

74
00:03:46,210 --> 00:03:51,410
For sort, this was the maximum
size input that we allowed you

75
00:03:51,410 --> 00:03:52,660
guys to run.

76
00:03:58,250 --> 00:04:01,210
One group did really well.

77
00:04:01,210 --> 00:04:03,640
So all of these are correct.

78
00:04:03,640 --> 00:04:08,120
The correctness test is built
in, and I replaced that with a

79
00:04:08,120 --> 00:04:10,730
clean copy that contains
a couple additional

80
00:04:10,730 --> 00:04:13,630
checks, by the way.

81
00:04:13,630 --> 00:04:19,151
So I tried the other extreme,
a relatively small array.

82
00:04:19,151 --> 00:04:22,310
AUDIENCE: Is the top
the same person?

83
00:04:22,310 --> 00:04:23,650
JOHN DONG: I'm not sure whether
or not the top is the

84
00:04:23,650 --> 00:04:25,080
same person.

85
00:04:25,080 --> 00:04:29,650
But the distribution wise seems
like people didn't quite

86
00:04:29,650 --> 00:04:32,750
remember to optimize for
the smallest case.

87
00:04:32,750 --> 00:04:37,240
And this is the current overall
speedup factors for

88
00:04:37,240 --> 00:04:38,640
sort averaging.

89
00:04:38,640 --> 00:04:44,320
We did about 10 to 15 cases
for rotate and sort.

90
00:04:44,320 --> 00:04:50,240
And that's the overall
speedup distribution.

91
00:04:50,240 --> 00:04:56,650
SAMAN AMARASINGHE: OK, so now
you're done with individual

92
00:04:56,650 --> 00:04:59,440
project, so you already did the
last project individually,

93
00:04:59,440 --> 00:05:02,090
and then we are moving into,
again, a group project.

94
00:05:02,090 --> 00:05:05,190
So the first thing we have,
setting up automated system

95
00:05:05,190 --> 00:05:08,490
for you to say who your
group members are.

96
00:05:08,490 --> 00:05:11,610
So we will send you information,
and with that,

97
00:05:11,610 --> 00:05:15,150
what you have to do is run a
script saying who your group

98
00:05:15,150 --> 00:05:17,600
members are, both group members
have to do it, and

99
00:05:17,600 --> 00:05:20,600
then we will basically clear
that account in there.

100
00:05:20,600 --> 00:05:24,490
That said, a lot of you didn't
know, in the first project,

101
00:05:24,490 --> 00:05:27,150
how to work and what's
the right mode of

102
00:05:27,150 --> 00:05:29,070
operations with the group.

103
00:05:29,070 --> 00:05:33,360
OK if we gave you to write
100,000 lines of code, it

104
00:05:33,360 --> 00:05:35,550
makes sense to say, OK, I'm
going to divide the problem

105
00:05:35,550 --> 00:05:37,910
into half, one person do
one, the other person

106
00:05:37,910 --> 00:05:39,840
do the other half.

107
00:05:39,840 --> 00:05:43,500
The reason for doing the group
is to try to get you to do

108
00:05:43,500 --> 00:05:46,260
pair programming, because
talking to a lot of you,

109
00:05:46,260 --> 00:05:49,440
getting a lot of feedback, it
looks like most of you spent a

110
00:05:49,440 --> 00:05:52,330
huge amount of time debugging.

111
00:05:52,330 --> 00:05:57,080
And since you're only writing
a little amount of code, it

112
00:05:57,080 --> 00:06:00,990
makes a lot more sense to sit
with your partner next to the

113
00:06:00,990 --> 00:06:04,010
screen, one person types, other
person looks over, and

114
00:06:04,010 --> 00:06:06,510
then you have a much faster
way of getting through the

115
00:06:06,510 --> 00:06:07,500
debugging process.

116
00:06:07,500 --> 00:06:11,000
So the next one, don't try to
divide the problem by half.

117
00:06:11,000 --> 00:06:15,200
Just try to find some time,
sit with each other.

118
00:06:15,200 --> 00:06:18,220
Then the other really disturbing
thing is there have

119
00:06:18,220 --> 00:06:22,540
been a couple of groups that
were completely dysfunctional.

120
00:06:22,540 --> 00:06:26,010
We get emails saying, OK, my
group member didn't talk to

121
00:06:26,010 --> 00:06:29,880
me, or they didn't do any
work, or they were very

122
00:06:29,880 --> 00:06:31,940
condescending.

123
00:06:31,940 --> 00:06:35,710
And that's really sad, because
from my experience with MIT

124
00:06:35,710 --> 00:06:39,080
students, when you guys go to
company, you guys probably

125
00:06:39,080 --> 00:06:41,640
will be the best programmers
there.

126
00:06:41,640 --> 00:06:42,510
There's no question about it.

127
00:06:42,510 --> 00:06:43,590
I have seen that.

128
00:06:43,590 --> 00:06:45,950
To the point that some people
might even resent it, to have

129
00:06:45,950 --> 00:06:46,670
this best programmer.

130
00:06:46,670 --> 00:06:50,700
But what I have seen is the fact
a lot of you cannot work

131
00:06:50,700 --> 00:06:53,530
in the group, if you haven't
developed that skill, you will

132
00:06:53,530 --> 00:06:55,590
not be the most impactful
person.

133
00:06:55,590 --> 00:06:58,700
I have seen that again and again
in my experience with

134
00:06:58,700 --> 00:06:59,780
doing a start-up.

135
00:06:59,780 --> 00:07:03,410
Our MIT students, their way of
impacting is put all night do

136
00:07:03,410 --> 00:07:05,820
the entire project
by themselves.

137
00:07:05,820 --> 00:07:08,480
Doable when you're doing a
small change to a large

138
00:07:08,480 --> 00:07:10,690
project, but if you want
to have a big change,

139
00:07:10,690 --> 00:07:11,380
you can't do that.

140
00:07:11,380 --> 00:07:13,390
You have to work with the
group, figure out how to

141
00:07:13,390 --> 00:07:15,630
impact, how to communicate.

142
00:07:15,630 --> 00:07:18,620
This is more important learning
than, say, trying to

143
00:07:18,620 --> 00:07:20,390
figure out how you can
optimize something.

144
00:07:20,390 --> 00:07:26,170
So being individual contributors
and able to do

145
00:07:26,170 --> 00:07:29,450
amazing things is important,
but the fact that you can't

146
00:07:29,450 --> 00:07:32,980
work with a group is going to
really make the impact you can

147
00:07:32,980 --> 00:07:34,700
make much less.

148
00:07:34,700 --> 00:07:38,840
So please, please learn how to
work with your group members.

149
00:07:38,840 --> 00:07:42,160
Some of them might not be as
good as you are, and that will

150
00:07:42,160 --> 00:07:45,330
be probably true in real life,
too, but that doesn't mean you

151
00:07:45,330 --> 00:07:48,200
can be condescending towards
them, make them feel inferior.

152
00:07:48,200 --> 00:07:50,260
That doesn't cut it.

153
00:07:50,260 --> 00:07:53,240
You have to learn how to
work with these people.

154
00:07:53,240 --> 00:07:59,970
So part of your learning is
working with others, and

155
00:07:59,970 --> 00:08:01,570
that's a large part
of your learning.

156
00:08:01,570 --> 00:08:04,190
Don't consider that to be this
external thing, even though

157
00:08:04,190 --> 00:08:05,710
you might think you can
do a better job.

158
00:08:05,710 --> 00:08:08,200
Just work with the other person,
especially in pair

159
00:08:08,200 --> 00:08:09,670
programming, because where
both of you are sticking

160
00:08:09,670 --> 00:08:12,530
together, it's much easier.

161
00:08:12,530 --> 00:08:16,310
Because there are four eyes on
your project, they might see

162
00:08:16,310 --> 00:08:18,510
something you don't see, and
see whether you can work

163
00:08:18,510 --> 00:08:19,110
together like that.

164
00:08:19,110 --> 00:08:24,030
And I don't want to hear any
more stories saying, look, my

165
00:08:24,030 --> 00:08:27,420
partner was too dumb, or my
partner didn't show up, he

166
00:08:27,420 --> 00:08:30,320
couldn't deal with that.

167
00:08:30,320 --> 00:08:32,289
Those are not great excuses.

168
00:08:32,289 --> 00:08:33,940
And so we wonder.

169
00:08:33,940 --> 00:08:37,299
Some of them, we will pay
attention to it, because it

170
00:08:37,299 --> 00:08:39,539
might be one person's
unilateral actions

171
00:08:39,539 --> 00:08:40,850
that lead to that.

172
00:08:40,850 --> 00:08:43,620
Still, please, try to figure
out how we can

173
00:08:43,620 --> 00:08:44,910
work with your partners.

174
00:08:44,910 --> 00:08:47,110
And hope you have a good
partner experience.

175
00:08:47,110 --> 00:08:48,540
Use pair programming.

176
00:08:48,540 --> 00:08:51,390
Use a lot of good debugging
techniques, and the next

177
00:08:51,390 --> 00:08:52,640
project will be fine.

178
00:09:13,430 --> 00:09:14,260
CHARLES LEISERSON: Great.

179
00:09:14,260 --> 00:09:17,780
We're going to talk
more about caches.

180
00:09:17,780 --> 00:09:18,620
Whoo-whoo!

181
00:09:18,620 --> 00:09:20,170
OK.

182
00:09:20,170 --> 00:09:24,880
So for those who weren't here
last time, we talked about the

183
00:09:24,880 --> 00:09:27,980
ideal cache model.

184
00:09:27,980 --> 00:09:30,840
As you recall, it has a
two-level hierarchy and a

185
00:09:30,840 --> 00:09:35,040
cache size M bytes, and a cache
line length of B bytes.

186
00:09:35,040 --> 00:09:37,860
It's fully associative
and is optimal

187
00:09:37,860 --> 00:09:41,750
omniscient replacement strategy.

188
00:09:41,750 --> 00:09:45,850
However, we also learned that
LRU was a good substitute, and

189
00:09:45,850 --> 00:09:49,380
that any of the asymptotic
results that you can get with

190
00:09:49,380 --> 00:09:52,800
optimal, you could also
get with LRU.

191
00:09:52,800 --> 00:09:56,120
The two performance measures
that we talked about were the

192
00:09:56,120 --> 00:09:59,800
work, which deals with what the
processor ends up doing,

193
00:09:59,800 --> 00:10:04,490
and the cache misses, which is
the transfers between cache

194
00:10:04,490 --> 00:10:06,590
and main memory.

195
00:10:06,590 --> 00:10:08,900
You only have to count one
direction, because what goes

196
00:10:08,900 --> 00:10:16,230
in basically goes out, so more
or less, it's the same number.

197
00:10:16,230 --> 00:10:21,570
OK, so I'd like to start today
by talking about some very

198
00:10:21,570 --> 00:10:25,860
basic algorithms that you have
seen in your algorithms and

199
00:10:25,860 --> 00:10:31,890
data structures class, but which
may look new when we

200
00:10:31,890 --> 00:10:34,140
start taking caches
into account.

201
00:10:34,140 --> 00:10:36,960
So the first one here
is the problem of

202
00:10:36,960 --> 00:10:40,490
merging two sorted arrays.

203
00:10:40,490 --> 00:10:42,160
So as you recall, you
can basically do

204
00:10:42,160 --> 00:10:43,660
this in linear time.

205
00:10:43,660 --> 00:10:48,130
The way that the algorithm works
is that it looks at the

206
00:10:48,130 --> 00:10:52,330
first element of the two
arrays to be sorted and

207
00:10:52,330 --> 00:10:55,910
whichever is smaller, it
puts in the output.

208
00:10:55,910 --> 00:10:59,440
And then it advances the pointer
to the next element,

209
00:10:59,440 --> 00:11:03,160
and then whichever is smaller,
it puts in the output.

210
00:11:03,160 --> 00:11:05,600
And in every step, it's doing
just a constant amount of

211
00:11:05,600 --> 00:11:12,730
work, there are n items, so by
the time this process is done,

212
00:11:12,730 --> 00:11:15,770
we basically spent time
proportional to the number of

213
00:11:15,770 --> 00:11:17,270
items in the output list.

214
00:11:17,270 --> 00:11:20,360
So this should be
fairly familiar.

215
00:11:20,360 --> 00:11:24,270
So the time to emerge n
elements is order n.

216
00:11:27,330 --> 00:11:30,320
Now, the reason merging is
useful is because you can use

217
00:11:30,320 --> 00:11:34,010
it in a sorting algorithm, a
merge sorting algorithm.

218
00:11:34,010 --> 00:11:39,550
The way merge sort works is it
essentially does divide and

219
00:11:39,550 --> 00:11:40,740
conquer on the array.

220
00:11:40,740 --> 00:11:44,980
It divides the array into two
pieces, and it divides those

221
00:11:44,980 --> 00:11:47,450
each into two pieces, and those
into two, until it gets

222
00:11:47,450 --> 00:11:49,880
down to something
of unit size.

223
00:11:49,880 --> 00:11:52,600
And then what it does
is it merges the

224
00:11:52,600 --> 00:11:55,560
two pairs of arrays.

225
00:11:55,560 --> 00:12:03,850
So for example here, the 19 and
3 got merged together to

226
00:12:03,850 --> 00:12:06,260
be 3 and 19.

227
00:12:06,260 --> 00:12:08,790
The 12 and 46 were already in
order, but you still had to do

228
00:12:08,790 --> 00:12:11,300
work to get it there
and so forth.

229
00:12:11,300 --> 00:12:15,480
So it puts everything in order
in pairs, and then for each of

230
00:12:15,480 --> 00:12:18,790
those, it puts it together into
fours, and for each of

231
00:12:18,790 --> 00:12:21,730
those, it puts it together
into the final list.

232
00:12:21,730 --> 00:12:25,010
Now of course, the way it does
this is not in the order I

233
00:12:25,010 --> 00:12:25,510
showed you.

234
00:12:25,510 --> 00:12:30,690
It actually goes down and does
a walk of this tree.

235
00:12:30,690 --> 00:12:36,940
But conceptually, you can see
that it essentially comes down

236
00:12:36,940 --> 00:12:41,850
to merging pairs, merging
quadruples, emerging octuples,

237
00:12:41,850 --> 00:12:50,100
and so forth, all the way until
the program is done.

238
00:12:50,100 --> 00:12:54,790
So to calculate the work of
Merge Sort, this is something

239
00:12:54,790 --> 00:12:57,300
you've seen before because it's
exactly what you do in

240
00:12:57,300 --> 00:13:00,290
your algorithms class.

241
00:13:00,290 --> 00:13:04,160
You get a recurrence that
says that the work,

242
00:13:04,160 --> 00:13:05,470
in this case is--

243
00:13:05,470 --> 00:13:07,620
well, if you have only one
element, it's a constant

244
00:13:07,620 --> 00:13:11,350
amount of work, and otherwise,
I solve two problems of half

245
00:13:11,350 --> 00:13:15,320
the size doing order n work,
which is the time to merge the

246
00:13:15,320 --> 00:13:16,440
two elements.

247
00:13:16,440 --> 00:13:19,020
So classic divide and conquer.

248
00:13:19,020 --> 00:13:22,380
And I'm sure you're familiar
with what the solution to this

249
00:13:22,380 --> 00:13:23,320
recurrence is.

250
00:13:23,320 --> 00:13:26,990
What's the solution to
this recurrence?

251
00:13:26,990 --> 00:13:28,840
n log n.

252
00:13:28,840 --> 00:13:32,200
I want to, though, step through
it, just to get

253
00:13:32,200 --> 00:13:35,520
everybody warmed up to the
way that I want to solve

254
00:13:35,520 --> 00:13:39,780
recurrences so that you are in
a position, when we do the

255
00:13:39,780 --> 00:13:41,030
caching analysis--

256
00:13:43,560 --> 00:13:47,050
we have a common framework for
understanding how the caching

257
00:13:47,050 --> 00:13:49,130
analysis will work.

258
00:13:49,130 --> 00:13:53,200
So we're going to solve this
recurrence, and if the base

259
00:13:53,200 --> 00:13:56,330
case is constant, we
usually omit it.

260
00:13:56,330 --> 00:13:58,940
It's assumed.

261
00:13:58,940 --> 00:14:02,800
So we start out with W of n, and
what we do is we replace

262
00:14:02,800 --> 00:14:08,610
it by the right hand side, where
we put the constant term

263
00:14:08,610 --> 00:14:10,230
on the top and then
the two children.

264
00:14:10,230 --> 00:14:14,630
So here I've gotten rid of the
theta, because conceptually

265
00:14:14,630 --> 00:14:17,600
when I'm done, I can put a big
theta around the whole thing,

266
00:14:17,600 --> 00:14:20,610
around the whole tree, and it
just makes the math a little

267
00:14:20,610 --> 00:14:21,990
bit easier and a little
bit clearer to see

268
00:14:21,990 --> 00:14:23,320
what's going on.

269
00:14:23,320 --> 00:14:28,270
So then I take each of those
and I split those, and this

270
00:14:28,270 --> 00:14:31,460
time I've got n over 4.

271
00:14:31,460 --> 00:14:32,430
Correct?

272
00:14:32,430 --> 00:14:34,580
I checked for that
one this time.

273
00:14:34,580 --> 00:14:37,180
It's funny because it was still
there, actually, just a

274
00:14:37,180 --> 00:14:41,340
few minutes before class
as I was going through.

275
00:14:41,340 --> 00:14:44,640
And we keep doing that until
we get down to something of

276
00:14:44,640 --> 00:14:48,600
size one, until the recurrence
bottoms out.

277
00:14:48,600 --> 00:14:51,590
So when you look at a recursion
tree of this nature,

278
00:14:51,590 --> 00:14:54,470
the first thing that you
typically want to do is take a

279
00:14:54,470 --> 00:14:57,340
look at what the height
of the tree is.

280
00:14:57,340 --> 00:15:00,560
In this case, we're taking a
problem of size n, and we're

281
00:15:00,560 --> 00:15:02,600
halving it at every step.

282
00:15:02,600 --> 00:15:07,910
And so the number of times we
have to halve the argument--

283
00:15:07,910 --> 00:15:10,910
which turns out to be also equal
to the work, but that's

284
00:15:10,910 --> 00:15:12,100
just coincidence--

285
00:15:12,100 --> 00:15:14,730
is log n times.

286
00:15:14,730 --> 00:15:18,730
So the height is log
base 2 of n.

287
00:15:18,730 --> 00:15:23,610
Now what we do typically is we
add things up across the rows,

288
00:15:23,610 --> 00:15:24,900
across the levels.

289
00:15:24,900 --> 00:15:27,185
So on the top level,
we have n.

290
00:15:27,185 --> 00:15:31,010
On the next level, we have n.

291
00:15:31,010 --> 00:15:33,940
The next level, hey, n.

292
00:15:36,660 --> 00:15:41,890
To add up the bottom, just to
make sure, we have to count up

293
00:15:41,890 --> 00:15:44,930
how many leaves there are, and
the number of leaves, since

294
00:15:44,930 --> 00:15:47,720
this is a binary tree, is
just 2 to the height.

295
00:15:47,720 --> 00:15:53,310
So it's 2 to the log
n, which is n.

296
00:15:53,310 --> 00:15:58,640
So then I add across all the
leaves, and I get the order 1

297
00:15:58,640 --> 00:16:03,580
at the base times n,
which is order n.

298
00:16:03,580 --> 00:16:07,370
And so now I'm in a position
to add up the total work,

299
00:16:07,370 --> 00:16:11,855
which is basically log n
levels of n is total

300
00:16:11,855 --> 00:16:13,040
of order n log n.

301
00:16:13,040 --> 00:16:16,090
So hopefully, this
is all review.

302
00:16:16,090 --> 00:16:18,190
Hopefully this is all review.

303
00:16:18,190 --> 00:16:19,310
You haven't seen this before.

304
00:16:19,310 --> 00:16:21,260
It's really neat, isn't it?

305
00:16:21,260 --> 00:16:22,950
But you've missed something
along the way.

306
00:16:26,860 --> 00:16:28,110
So now with caching.

307
00:16:30,860 --> 00:16:35,500
So the first thing to observe is
that merge subroutine, the

308
00:16:35,500 --> 00:16:43,190
number of cache misses that it
has is order n over B. So as

309
00:16:43,190 --> 00:16:45,800
you're going through, these
arrays are laid out

310
00:16:45,800 --> 00:16:48,530
continuously in memory.

311
00:16:48,530 --> 00:16:50,900
The number of misses--

312
00:16:50,900 --> 00:16:52,740
you're just going through
the data once--

313
00:16:52,740 --> 00:16:56,950
is order n data, all going
through contiguously.

314
00:16:56,950 --> 00:17:00,210
And so every time you bring in
data, you get full spatial

315
00:17:00,210 --> 00:17:04,730
locality, there are n elements,
it costs n over B.

316
00:17:04,730 --> 00:17:06,400
So is that plain?

317
00:17:06,400 --> 00:17:10,990
Hopefully that part's plain,
because you bring in things of

318
00:17:10,990 --> 00:17:13,970
each axis, you get the
same factor of B.

319
00:17:13,970 --> 00:17:15,690
So now merge sort--

320
00:17:15,690 --> 00:17:18,109
and this is, once again, the
hard part is coming up with a

321
00:17:18,109 --> 00:17:21,099
recurrence, and then the other
hard part is solving it.

322
00:17:21,099 --> 00:17:23,560
So there's two hard parts
to recurrences.

323
00:17:23,560 --> 00:17:27,310
OK, so the merge sort algorithm
solves two problems

324
00:17:27,310 --> 00:17:30,160
of size n over 2, and
then does a merge.

325
00:17:30,160 --> 00:17:32,610
So the second line here is
pretty straightforward.

326
00:17:32,610 --> 00:17:34,930
I take the cache misses
that I need to do,

327
00:17:34,930 --> 00:17:35,830
and I take a merge.

328
00:17:35,830 --> 00:17:38,810
I may have a few other accesses
in there, but they're

329
00:17:38,810 --> 00:17:41,300
going to be dominated
by the merge.

330
00:17:41,300 --> 00:17:45,080
OK, so it's still going
to be theta n over B.

331
00:17:45,080 --> 00:17:50,410
Now the hard part, generally, of
dealing with cache analysis

332
00:17:50,410 --> 00:17:53,590
is how to do the base case,
because the base case is more

333
00:17:53,590 --> 00:17:56,680
complicated than when you just
do running time, you get that

334
00:17:56,680 --> 00:17:59,610
to run down to a base case
of constant size.

335
00:17:59,610 --> 00:18:01,720
Here, you don't get to
run down to base

336
00:18:01,720 --> 00:18:04,080
case of constant size.

337
00:18:04,080 --> 00:18:08,900
So what it says here is that
we're going to run down until

338
00:18:08,900 --> 00:18:13,580
I have a sorting problem that
fits in cache, n is going to

339
00:18:13,580 --> 00:18:17,190
be less than some constant times
n, for some sufficiently

340
00:18:17,190 --> 00:18:21,080
small constant c less than 1.

341
00:18:21,080 --> 00:18:25,420
When in finally fits in cache,
how many cache misses does it

342
00:18:25,420 --> 00:18:27,300
take me to sort it?

343
00:18:27,300 --> 00:18:30,490
Well, I only need to have the
cold misses to bring that

344
00:18:30,490 --> 00:18:36,500
array into memory, and so that's
just proportional to n

345
00:18:36,500 --> 00:18:41,450
over B, because for all the rest
of the levels of merging,

346
00:18:41,450 --> 00:18:44,610
you're inside the cache.

347
00:18:44,610 --> 00:18:46,430
That make sense?

348
00:18:46,430 --> 00:18:48,190
So that's where we get
this recurrence.

349
00:18:48,190 --> 00:18:51,980
This is always a tricky thing
to figure out how to write

350
00:18:51,980 --> 00:18:54,780
that recurrence for
a given thing.

351
00:18:54,780 --> 00:18:57,620
Then, as I say, the other
tricky thing is how

352
00:18:57,620 --> 00:18:58,890
do you solve it?

353
00:18:58,890 --> 00:19:01,420
But we're going to solve it
essentially the same way as we

354
00:19:01,420 --> 00:19:04,500
did before.

355
00:19:04,500 --> 00:19:06,880
I'm not going to go through all
the steps here, except to

356
00:19:06,880 --> 00:19:07,940
just elaborate.

357
00:19:07,940 --> 00:19:12,140
So what we're doing is we're
taking, and we have n over B

358
00:19:12,140 --> 00:19:16,200
at the top, and then we divide
it into two problems, and for

359
00:19:16,200 --> 00:19:18,580
each of those--

360
00:19:18,580 --> 00:19:22,600
whoops, there's a c there
that doesn't belong.

361
00:19:22,600 --> 00:19:26,560
It should just be n over
2B on both, and then n

362
00:19:26,560 --> 00:19:30,170
over 4B, and so forth.

363
00:19:30,170 --> 00:19:34,500
And we keep going down until
we get to our base case.

364
00:19:34,500 --> 00:19:39,060
Now in our base case, what I
claim is that when I hit this

365
00:19:39,060 --> 00:19:43,120
base case, it's going to be the
case that n is, in fact, a

366
00:19:43,120 --> 00:19:48,800
constant factor times n, so that
n over B is almost the

367
00:19:48,800 --> 00:19:53,280
same as m over B. And the reason
is because before I hit

368
00:19:53,280 --> 00:20:00,120
the base case, when I was at
size twice n, and that was

369
00:20:00,120 --> 00:20:01,410
bigger than m.

370
00:20:01,410 --> 00:20:04,900
So if twice n is
bigger than m--

371
00:20:04,900 --> 00:20:07,960
than my constant times m--

372
00:20:07,960 --> 00:20:17,710
but n is smaller than m, then
it's the case that n and m are

373
00:20:17,710 --> 00:20:21,210
essentially the same size to
within a constant factor,

374
00:20:21,210 --> 00:20:23,880
within a factor of
two, in fact.

375
00:20:23,880 --> 00:20:30,110
And so therefore, here I can say
that it's order m over B.

376
00:20:30,110 --> 00:20:32,850
And now the question is, how
many levels did I have to go

377
00:20:32,850 --> 00:20:37,690
down cutting things in half
before I got to something of

378
00:20:37,690 --> 00:20:40,460
size m over B?

379
00:20:40,460 --> 00:20:43,270
Well, the way that I usually
think about this is--

380
00:20:43,270 --> 00:20:46,230
you can do it by taking the
difference as I did before.

381
00:20:46,230 --> 00:20:49,590
The height of the whole tree is
going to be log base 2 of

382
00:20:49,590 --> 00:20:55,220
n, and the height of this
is basically log of--

383
00:20:55,220 --> 00:21:00,250
what's the size of n?

384
00:21:00,250 --> 00:21:02,725
It's going to be log of the size
of n when this occurs.

385
00:21:02,725 --> 00:21:07,650
Well, n at that point is
something like cm.

386
00:21:07,650 --> 00:21:11,590
So it's basically log n minus
log cm, which is basically log

387
00:21:11,590 --> 00:21:14,840
of n over cm.

388
00:21:14,840 --> 00:21:16,090
How about some questions.

389
00:21:22,520 --> 00:21:23,195
Yeah, question.

390
00:21:23,195 --> 00:21:24,924
AUDIENCE: --the reason for why
you just substituted on the

391
00:21:24,924 --> 00:21:28,820
left, [INAUDIBLE] over B, but
on the right [INAUDIBLE].

392
00:21:28,820 --> 00:21:29,630
CHARLES LEISERSON: Here?

393
00:21:29,630 --> 00:21:30,650
AUDIENCE: No.

394
00:21:30,650 --> 00:21:31,180
CHARLES LEISERSON:
Or you mean here?

395
00:21:31,180 --> 00:21:32,430
AUDIENCE: [INAUDIBLE].

396
00:21:36,100 --> 00:21:36,930
CHARLES LEISERSON: On
the right side.

397
00:21:36,930 --> 00:21:38,914
AUDIENCE: On the right-most
leaf, it's n over B. Is that

398
00:21:38,914 --> 00:21:41,180
all of the leaves added up
because [INAUDIBLE]?

399
00:21:41,180 --> 00:21:42,820
CHARLES LEISERSON: No, no, no,
this is going to be all of the

400
00:21:42,820 --> 00:21:44,710
leaves added up here.

401
00:21:44,710 --> 00:21:47,690
This is the stack I have
on the right hand side.

402
00:21:47,690 --> 00:21:48,520
So we'll get there.

403
00:21:48,520 --> 00:21:51,625
So the point is, the number of
leaves is 2 to this-- so

404
00:21:51,625 --> 00:21:54,020
that's just n over cm--

405
00:21:54,020 --> 00:21:59,660
times m over B. Well, n over
cm times m over B, the m's

406
00:21:59,660 --> 00:22:03,500
cancel, and I get essentially
n over b with whatever that

407
00:22:03,500 --> 00:22:06,920
constant is here.

408
00:22:06,920 --> 00:22:11,490
And so now I have n over b
across every level, and then

409
00:22:11,490 --> 00:22:18,340
when I add those up, I have to
go log of n over cm levels,

410
00:22:18,340 --> 00:22:20,560
which is the same as
log of n over m.

411
00:22:20,560 --> 00:22:24,230
So I have n over B times
log of n over m.

412
00:22:24,230 --> 00:22:25,281
Yeah, question?

413
00:22:25,281 --> 00:22:28,718
AUDIENCE: Initial assumption
that c is some sufficiently

414
00:22:28,718 --> 00:22:31,418
small number, so 1
over c would be a

415
00:22:31,418 --> 00:22:32,660
rather large factor.

416
00:22:32,660 --> 00:22:36,570
CHARLES LEISERSON: It could
potentially be a large factor,

417
00:22:36,570 --> 00:22:37,690
but it's a constant.

418
00:22:37,690 --> 00:22:41,520
In other words, it can't
vary with n.

419
00:22:41,520 --> 00:22:44,310
So in fact, typically for
something like merge sort, the

420
00:22:44,310 --> 00:22:47,930
constant is going to
be very close to--

421
00:22:47,930 --> 00:22:51,840
for most things, the constant
there is typically only a few.

422
00:22:51,840 --> 00:22:54,460
Because the question is, how
many other things do you need

423
00:22:54,460 --> 00:22:56,070
in order to make sure
it fits in?

424
00:22:56,070 --> 00:22:57,170
In this case, you have n.

425
00:22:57,170 --> 00:23:00,490
Well, you have both the input
and the output, so here, it's

426
00:23:00,490 --> 00:23:02,750
going to be, you have to fit
both the input and the output

427
00:23:02,750 --> 00:23:05,850
into cache in order not
to have the cache it.

428
00:23:05,850 --> 00:23:07,880
So it's basically going
to be like a factor

429
00:23:07,880 --> 00:23:09,170
of 2 for merge sort.

430
00:23:09,170 --> 00:23:12,500
For the matrix multiplication,
it was like a factor of three.

431
00:23:12,500 --> 00:23:14,600
So generally a fairly
small number.

432
00:23:14,600 --> 00:23:15,502
Question?

433
00:23:15,502 --> 00:23:16,978
AUDIENCE: I guess that
makes sense.

434
00:23:16,978 --> 00:23:19,438
But I [INAUDIBLE].

435
00:23:19,438 --> 00:23:24,358
So on the order of the size of
the leaves, can you assume

436
00:23:24,358 --> 00:23:27,320
that n is more than cm, so
you can substitute--

437
00:23:27,320 --> 00:23:30,510
CHARLES LEISERSON: Yeah, because
basically, when it

438
00:23:30,510 --> 00:23:31,680
hits this condition--

439
00:23:31,680 --> 00:23:32,155
AUDIENCE: Right.

440
00:23:32,155 --> 00:23:33,105
I understand that.

441
00:23:33,105 --> 00:23:37,754
Then you're also saying, why
isn't there more or less just

442
00:23:37,754 --> 00:23:42,173
one B, because there's n over
cm, n is the same as cm.

443
00:23:42,173 --> 00:23:44,137
That's [INAUDIBLE].

444
00:23:44,137 --> 00:23:46,120
At the bottom level,
there should be--

445
00:23:46,120 --> 00:23:49,340
CHARLES LEISERSON: Oh, did I
do something wrong here?

446
00:23:49,340 --> 00:23:50,590
Number of leaves is--

447
00:23:50,590 --> 00:23:51,840
AUDIENCE: [INAUDIBLE].

448
00:23:55,260 --> 00:23:55,900
CHARLES LEISERSON:
Right, right.

449
00:23:55,900 --> 00:23:56,220
Sorry.

450
00:23:56,220 --> 00:23:58,480
This is the n at the top.

451
00:23:58,480 --> 00:24:00,610
You always have to
be careful here.

452
00:24:00,610 --> 00:24:07,520
So this is the n in this case,
so this is some n, little n.

453
00:24:07,520 --> 00:24:08,900
So this is not the same n.

454
00:24:08,900 --> 00:24:10,910
This is the n that we
had at the top.

455
00:24:10,910 --> 00:24:13,650
This is the notion of
recurrences, like the n keeps

456
00:24:13,650 --> 00:24:16,030
recurring, but you
know which ones--

457
00:24:16,030 --> 00:24:18,530
so it can be confusing,
because if you're--

458
00:24:18,530 --> 00:24:19,470
so yeah.

459
00:24:19,470 --> 00:24:22,270
So this is not the
n that's here.

460
00:24:22,270 --> 00:24:25,260
This is the n that started
out at the top.

461
00:24:25,260 --> 00:24:27,110
So we're analyzing it
in terms of the n.

462
00:24:27,110 --> 00:24:29,380
Some people write these things
where they would write this in

463
00:24:29,380 --> 00:24:34,410
terms of k, and then analyze it
for n, and for some people,

464
00:24:34,410 --> 00:24:39,380
that can be helpful to do, to
disambiguate the two things.

465
00:24:39,380 --> 00:24:43,330
I always find it wastes a
variable, and you know, those

466
00:24:43,330 --> 00:24:44,580
variables are hard to come by.

467
00:24:44,580 --> 00:24:45,980
There's only a finite
number of them.

468
00:24:50,510 --> 00:24:53,310
OK, so are we good for this?

469
00:24:53,310 --> 00:24:56,210
So here, we ended up with
n over b, log of n

470
00:24:56,210 --> 00:24:59,980
over m cache misses.

471
00:24:59,980 --> 00:25:01,990
So how does that compare?

472
00:25:01,990 --> 00:25:06,500
Let's just do a little
thinking about this.

473
00:25:06,500 --> 00:25:10,770
Here's the recurrence, and
I solved it out to this.

474
00:25:10,770 --> 00:25:12,530
So let's just look to
see what this means.

475
00:25:12,530 --> 00:25:15,890
If I have a really big n, much
bigger than the size of my

476
00:25:15,890 --> 00:25:22,020
cache, then I'm going to do a
factor of B log n less misses

477
00:25:22,020 --> 00:25:27,060
than work, because the work
is n over B log n.

478
00:25:31,380 --> 00:25:37,990
So if n compared to m-- let's
say n was m squared or

479
00:25:37,990 --> 00:25:39,240
something--

480
00:25:41,370 --> 00:25:48,060
then n over m would
still be n, if n

481
00:25:48,060 --> 00:25:50,380
was as big as m squared.

482
00:25:50,380 --> 00:25:56,180
So if n was as big as m squared,
this log of n over m

483
00:25:56,180 --> 00:25:59,590
would still be log n.

484
00:25:59,590 --> 00:26:07,435
And so I would basically have n
over B log n for a factor of

485
00:26:07,435 --> 00:26:09,160
B log n less misses than work.

486
00:26:09,160 --> 00:26:10,625
If they're about the same--

487
00:26:10,625 --> 00:26:11,650
did I get this right?

488
00:26:11,650 --> 00:26:16,950
If they're about the same size,
if n is approximately m,

489
00:26:16,950 --> 00:26:21,760
maybe it's just a little bit
bigger, then the log here

490
00:26:21,760 --> 00:26:27,020
disappears completely, and
so I basically just

491
00:26:27,020 --> 00:26:28,570
have n over B misses.

492
00:26:31,237 --> 00:26:32,487
AUDIENCE: [INAUDIBLE].

493
00:26:36,380 --> 00:26:38,060
CHARLES LEISERSON: Yeah,
but if n is like--

494
00:26:38,060 --> 00:26:39,800
AUDIENCE: [INAUDIBLE].

495
00:26:39,800 --> 00:26:40,740
CHARLES LEISERSON: Yeah.

496
00:26:40,740 --> 00:26:42,980
In fact, for this, you have to
be careful as you get to the

497
00:26:42,980 --> 00:26:43,860
base cases.

498
00:26:43,860 --> 00:26:47,470
Technically, for some of this,
I should be saying 1 plus log

499
00:26:47,470 --> 00:26:50,610
of n over m, and in some of the
things I do later, I will

500
00:26:50,610 --> 00:26:52,480
put in the ones.

501
00:26:52,480 --> 00:26:57,030
But if you're looking at it
asymptotically and n gets big,

502
00:26:57,030 --> 00:26:58,920
you don't have to worry
about those cases.

503
00:26:58,920 --> 00:27:02,110
That just handles the case
whether you're looking at n

504
00:27:02,110 --> 00:27:04,130
getting large or whether you're
trying to handle a

505
00:27:04,130 --> 00:27:06,950
formula for all n, even
if and n is small.

506
00:27:06,950 --> 00:27:07,850
Question?

507
00:27:07,850 --> 00:27:09,100
AUDIENCE: [INAUDIBLE]?

508
00:27:11,770 --> 00:27:15,150
CHARLES LEISERSON: The work
was n log n, yes.

509
00:27:15,150 --> 00:27:17,050
The work was n log n.

510
00:27:17,050 --> 00:27:21,890
So here we basically have n over
B log n, so I'm saving a

511
00:27:21,890 --> 00:27:25,380
factor of B in the case where
it's about the same.

512
00:27:25,380 --> 00:27:26,830
Did I get this right?

513
00:27:26,830 --> 00:27:28,880
I'm just looking at this, and
now I'm trying to reverse

514
00:27:28,880 --> 00:27:32,480
engineer what my argument is.

515
00:27:32,480 --> 00:27:38,890
So we're looking at n log n
versus n over B log n over m.

516
00:27:38,890 --> 00:27:41,830
AUDIENCE: [INAUDIBLE PHRASE].

517
00:27:41,830 --> 00:27:45,995
So that you get a factor of B
less misses, because you would

518
00:27:45,995 --> 00:27:49,180
be getting n over B like log
of n, that's the only way

519
00:27:49,180 --> 00:27:51,630
you're getting a factor
of B less misses.

520
00:27:51,630 --> 00:27:54,406
So I don't understand how you're
saying, for n more or

521
00:27:54,406 --> 00:27:55,060
less equal to m.

522
00:27:55,060 --> 00:27:58,000
You would want something
more like, for n--

523
00:27:58,000 --> 00:28:00,760
CHARLES LEISERSON: Well, if n
and m are about the same size,

524
00:28:00,760 --> 00:28:07,070
the number of cache misses is
just n over B. And the number

525
00:28:07,070 --> 00:28:12,430
of cache misses is n over B,
and the work is n log n.

526
00:28:12,430 --> 00:28:17,106
So I've saved a factor
of B times log n, OK?

527
00:28:21,320 --> 00:28:22,730
What did I say?

528
00:28:22,730 --> 00:28:23,980
AUDIENCE: [INAUDIBLE].

529
00:28:26,960 --> 00:28:28,180
CHARLES LEISERSON: B log m?

530
00:28:28,180 --> 00:28:31,020
No, I was saying that's
for the case when n is

531
00:28:31,020 --> 00:28:32,830
much bigger than m.

532
00:28:32,830 --> 00:28:34,080
So let's take a look
at the case--

533
00:28:36,750 --> 00:28:38,180
let me just do it on
the board here.

534
00:28:44,420 --> 00:28:54,440
Let's suppose that n is like m
squared, just as an example,

535
00:28:54,440 --> 00:28:56,510
big number.

536
00:28:56,510 --> 00:29:03,120
So I'm going to look at,
essentially, n over B times

537
00:29:03,120 --> 00:29:06,502
log of n over m.

538
00:29:09,650 --> 00:29:15,360
So log of n over m, so what is
n over m is about m, which is

539
00:29:15,360 --> 00:29:20,960
about square root of n, right?

540
00:29:20,960 --> 00:29:25,950
So this basically ends up being
approximately n over B

541
00:29:25,950 --> 00:29:34,250
log of square root of n, which
is the same as log n, to

542
00:29:34,250 --> 00:29:35,290
within a constant factor.

543
00:29:35,290 --> 00:29:38,540
I'm going to leave out the
constant factors here.

544
00:29:38,540 --> 00:29:43,546
Then I want to compare
that with n log n.

545
00:29:50,610 --> 00:29:53,310
So I get a factor of
B less misses.

546
00:29:53,310 --> 00:29:56,780
So the first one, yes, OK.

547
00:29:56,780 --> 00:29:59,530
So I get a factor of B less
misses, you're right.

548
00:29:59,530 --> 00:30:01,230
Then I get a factor
of B less misses.

549
00:30:01,230 --> 00:30:02,480
So I think I've got
these switched.

550
00:30:10,330 --> 00:30:13,400
So this is the case I'm doing
is for n much bigger than m.

551
00:30:16,260 --> 00:30:17,590
So let's do the other case.

552
00:30:17,590 --> 00:30:19,790
I think I've got the two
things switched.

553
00:30:19,790 --> 00:30:21,400
I'll fix it in the notes.

554
00:30:21,400 --> 00:30:25,550
If n and m are approximately
the same, then the log is a

555
00:30:25,550 --> 00:30:28,230
constant, right?

556
00:30:28,230 --> 00:30:33,860
So this ends up being
approximately n over B. And

557
00:30:33,860 --> 00:30:36,470
now when I take a look at the
difference between the number

558
00:30:36,470 --> 00:30:39,110
of things, I get B log n.

559
00:30:39,110 --> 00:30:40,660
So I've got the two
things mixed.

560
00:30:40,660 --> 00:30:41,820
Yeah?

561
00:30:41,820 --> 00:30:44,640
AUDIENCE: As n approaches m,
then the log would approach

562
00:30:44,640 --> 00:30:47,460
zero, but were you talking
about how it technically

563
00:30:47,460 --> 00:30:49,020
should be--

564
00:30:49,020 --> 00:30:50,247
CHARLES LEISERSON:
1 plus n, yes.

565
00:30:50,247 --> 00:30:53,217
AUDIENCE: So technically, that
approaches one when the log

566
00:30:53,217 --> 00:30:54,360
approaches zero.

567
00:30:54,360 --> 00:30:57,008
CHARLES LEISERSON: Yeah.

568
00:30:57,008 --> 00:30:58,976
AUDIENCE: These things are
really hard for me, because

569
00:30:58,976 --> 00:30:59,960
they are really arbitrary.

570
00:30:59,960 --> 00:31:01,600
And then you're like, oh
yeah, you can just put

571
00:31:01,600 --> 00:31:02,420
a 1 on top of there.

572
00:31:02,420 --> 00:31:05,126
And for example, I always miss
those, because I usually try

573
00:31:05,126 --> 00:31:07,586
to do the math as rigorously
as I can, and those ones

574
00:31:07,586 --> 00:31:08,356
generally do not appear,
and you're

575
00:31:08,356 --> 00:31:09,834
like, oh, sure whatever.

576
00:31:09,834 --> 00:31:13,412
So how am I supposed to know
that the log is actually not

577
00:31:13,412 --> 00:31:15,771
going to be zero, and I'm going
to be like, yeah, you're

578
00:31:15,771 --> 00:31:18,140
not going to do any caches.

579
00:31:18,140 --> 00:31:19,590
CHARLES LEISERSON: Because
generally, what we're doing is

580
00:31:19,590 --> 00:31:21,950
we're looking at how things
scale, so we're generally

581
00:31:21,950 --> 00:31:25,410
looking at n being big, in which
case it doesn't matter.

582
00:31:25,410 --> 00:31:28,530
These things only matter if
n's-- for example, notice here

583
00:31:28,530 --> 00:31:34,000
that if n goes less than m,
we're in real trouble, right?

584
00:31:34,000 --> 00:31:35,820
Because now the log
is negative.

585
00:31:35,820 --> 00:31:37,190
Wait, what does that mean?

586
00:31:37,190 --> 00:31:41,340
Well, the answer is the analysis
was assuming that n

587
00:31:41,340 --> 00:31:44,150
was sufficiently large
compared with m.

588
00:31:44,150 --> 00:31:46,712
AUDIENCE: Why can't you just
be like, oh, when n is for

589
00:31:46,712 --> 00:31:49,518
less than one, you can assume,
well, n is 2n.

590
00:31:49,518 --> 00:31:50,982
In that case, you get log
of two, which is still

591
00:31:50,982 --> 00:31:52,934
something or other.

592
00:31:52,934 --> 00:31:53,920
CHARLES LEISERSON:
Yeah, exactly.

593
00:31:53,920 --> 00:31:56,390
So what happens in these things
is if you get right on

594
00:31:56,390 --> 00:32:01,350
the cusp of fitting in memory,
then these analyses like,

595
00:32:01,350 --> 00:32:07,020
well, what exactly is the
answer, is dicey.

596
00:32:07,020 --> 00:32:09,580
But if you assume that
it doesn't fit in,

597
00:32:09,580 --> 00:32:10,690
what's going to happen?

598
00:32:10,690 --> 00:32:13,930
Or that does fit in, what
is going to happen?

599
00:32:13,930 --> 00:32:16,350
And then the analysis
right on the edge is

600
00:32:16,350 --> 00:32:17,600
somewhere between there.

601
00:32:20,100 --> 00:32:21,000
Good.

602
00:32:21,000 --> 00:32:23,040
So I switched these.

603
00:32:23,040 --> 00:32:25,190
I said this the other
way around.

604
00:32:25,190 --> 00:32:25,670
That's funny.

605
00:32:25,670 --> 00:32:30,530
I went through this, and then
in my notes, I had them

606
00:32:30,530 --> 00:32:34,990
switched, and I said, oh my
gosh, I did this wrong.

607
00:32:34,990 --> 00:32:37,020
And I've just gone through it,
and it turns out I was right

608
00:32:37,020 --> 00:32:40,090
in my notes.

609
00:32:40,090 --> 00:32:42,850
Now, one of the things, if you
look at what's going on--

610
00:32:42,850 --> 00:32:45,220
let's just go back to
this picture here.

611
00:32:45,220 --> 00:32:49,410
What's going on here is each one
of the passes that we're

612
00:32:49,410 --> 00:32:56,010
doing to do a merge is basically
taking n over B

613
00:32:56,010 --> 00:32:58,100
misses to do a binary merge.

614
00:32:58,100 --> 00:33:05,890
We're going through all the data
to merge just two things,

615
00:33:05,890 --> 00:33:08,490
and traversing all the data.

616
00:33:08,490 --> 00:33:11,530
So you can imagine, what would
happen if I did, say, a

617
00:33:11,530 --> 00:33:12,780
four-way merge?

618
00:33:15,210 --> 00:33:20,770
With a four-way merge, I could
actually merge four things

619
00:33:20,770 --> 00:33:23,505
with only a little bit more
than n over B misses.

620
00:33:26,210 --> 00:33:29,280
In fact, that's what we're going
to analyze in general.

621
00:33:29,280 --> 00:33:34,850
So the idea is that we can
improve our cache efficiency

622
00:33:34,850 --> 00:33:38,650
by doing multi-way merging.

623
00:33:38,650 --> 00:33:42,350
So the idea here is, let's merge
R, which is, let's say,

624
00:33:42,350 --> 00:33:44,370
less than n subarrays
with a tournament.

625
00:33:44,370 --> 00:33:51,250
So here are R subarrays, and
here's they're each, let's

626
00:33:51,250 --> 00:33:56,370
say, is of size n over R. And
what we're going to do is

627
00:33:56,370 --> 00:33:58,940
merge them with a tournament, so
this is a tournament where

628
00:33:58,940 --> 00:34:02,070
we say, who's the winner of
these two, who's the winner of

629
00:34:02,070 --> 00:34:03,740
these two, et cetera.

630
00:34:03,740 --> 00:34:07,710
And then whoever wins at the top
here, we take them and put

631
00:34:07,710 --> 00:34:09,810
them in the output, and then
we repeat the tournament.

632
00:34:12,320 --> 00:34:13,810
Now let's just look
what happens.

633
00:34:13,810 --> 00:34:18,830
It takes order R work to produce
the forced output.

634
00:34:18,830 --> 00:34:20,630
So we got R things here.

635
00:34:20,630 --> 00:34:23,600
To playoff this tournament,
there are R nodes here.

636
00:34:23,600 --> 00:34:26,630
They each have to do a constant
amount of comparing

637
00:34:26,630 --> 00:34:30,399
before I end up with a single
value to put in the output.

638
00:34:30,399 --> 00:34:34,100
So it costs me R to get
this thing warmed up.

639
00:34:34,100 --> 00:34:38,540
But once I find the winner,
and I remove the winner

640
00:34:38,540 --> 00:34:42,120
whatever chain he might have
come along, how quickly can I

641
00:34:42,120 --> 00:34:46,320
repopulate the tournament
with the next guy?

642
00:34:46,320 --> 00:34:51,260
The next guy only has to play
the tournament on the path

643
00:34:51,260 --> 00:34:51,780
that was there.

644
00:34:51,780 --> 00:34:53,610
All the other matches,
we know who won.

645
00:34:56,570 --> 00:35:03,810
So the second guy only cost me
log R to produce the next guy.

646
00:35:03,810 --> 00:35:08,910
And the next guy is log R, and
so once we get going, to do an

647
00:35:08,910 --> 00:35:16,620
R-way merge only costs me
log R work per element.

648
00:35:16,620 --> 00:35:20,930
So the total work in merging is
R, to get started, plus n

649
00:35:20,930 --> 00:35:26,530
log R. Well, R is less than n,
so that's just n log R total

650
00:35:26,530 --> 00:35:27,880
to do the merging.

651
00:35:27,880 --> 00:35:29,130
That's the work.

652
00:35:32,360 --> 00:35:38,340
Now, let's take a look at what
happens if I now do merge sort

653
00:35:38,340 --> 00:35:39,590
with R-way merges.

654
00:35:42,630 --> 00:35:51,660
So what I do is if I have only
one element, then it's going

655
00:35:51,660 --> 00:35:54,830
to cost me order one time to
merge it, because there's

656
00:35:54,830 --> 00:35:57,580
nothing to do, just put
it in the output.

657
00:35:57,580 --> 00:36:04,160
Otherwise, I've got R problems
of size n over R that I'm

658
00:36:04,160 --> 00:36:07,050
going to merge, and my
merge takes n log R

659
00:36:07,050 --> 00:36:12,270
time to do the merge.

660
00:36:12,270 --> 00:36:15,310
So if I look at the recursion
tree, I have n log R here

661
00:36:15,310 --> 00:36:20,900
starting here, then I branch R
ways, and then I have n over R

662
00:36:20,900 --> 00:36:26,510
log R to do the R-way branching
at the next level, n

663
00:36:26,510 --> 00:36:29,200
over R squared log R at the
next level, et cetera.

664
00:36:35,183 --> 00:36:38,500
AUDIENCE: You said that the
cost of processing is R.

665
00:36:38,500 --> 00:36:39,560
CHARLES LEISERSON:
--is n log R.

666
00:36:39,560 --> 00:36:43,040
AUDIENCE: But is--

667
00:36:43,040 --> 00:36:46,940
CHARLES LEISERSON: Upfront,
there's an order R cost, but

668
00:36:46,940 --> 00:36:51,520
the order R cost is dominated
by the n log R, so we don't

669
00:36:51,520 --> 00:36:54,000
have to count that separately.

670
00:36:54,000 --> 00:36:55,360
We just have to worry
about this guy.

671
00:36:59,380 --> 00:37:05,240
So as I go through here, I
basically end up having a tree

672
00:37:05,240 --> 00:37:09,780
which is only log base R of n
tall, because I'm dividing

673
00:37:09,780 --> 00:37:13,440
things into R pieces
each time, rather

674
00:37:13,440 --> 00:37:15,330
than into two pieces.

675
00:37:15,330 --> 00:37:20,030
So I only go log base R steps
till I get to the base case.

676
00:37:20,030 --> 00:37:22,610
But I'm doing an R-way merge,
so the number of

677
00:37:22,610 --> 00:37:23,860
leaves is still n.

678
00:37:26,390 --> 00:37:31,110
But now when I add across here,
I get n times log R, and

679
00:37:31,110 --> 00:37:33,940
I go across here, I get n times
log R, because I got R

680
00:37:33,940 --> 00:37:35,490
copies of the same thing.

681
00:37:35,490 --> 00:37:38,480
Now I've got R squared times
n over R squared

682
00:37:38,480 --> 00:37:40,150
log R, and so forth.

683
00:37:40,150 --> 00:37:44,740
And so at every level, I have
n log R. So I have n log R

684
00:37:44,740 --> 00:37:48,550
times the number of levels here,
which is log base R of

685
00:37:48,550 --> 00:37:53,000
n, plus the order n work at
the bottom, which we can

686
00:37:53,000 --> 00:37:55,060
ignore because it's going
to be dominated.

687
00:37:55,060 --> 00:37:58,560
And so what you notice here is,
what's log base R of n?

688
00:37:58,560 --> 00:38:03,490
That's just log n over log R.
So the log Rs cancel, and I

689
00:38:03,490 --> 00:38:08,570
get n log n plus n, which
is just n log n.

690
00:38:08,570 --> 00:38:11,980
So after all that work, we still
do the same amount of

691
00:38:11,980 --> 00:38:15,760
work, whether I do binary
merging or R-way merging, the

692
00:38:15,760 --> 00:38:18,880
work is the same.

693
00:38:18,880 --> 00:38:21,600
But there's kind of
a big difference

694
00:38:21,600 --> 00:38:27,260
when it comes to caching.

695
00:38:27,260 --> 00:38:30,670
So it's the same work as
binary merge sort.

696
00:38:30,670 --> 00:38:33,980
So let's take a look
at the caching.

697
00:38:33,980 --> 00:38:40,390
So let's assume that my
tournament fits in the cache.

698
00:38:40,390 --> 00:38:46,670
So I want to make sure that R is
less than m over B for some

699
00:38:46,670 --> 00:38:50,360
constant R. So when I do
constant way, when I consider

700
00:38:50,360 --> 00:38:53,320
the R-way merging of contiguous
arrays of total

701
00:38:53,320 --> 00:38:58,150
size n, the entire tournament
plus 1 block from each array

702
00:38:58,150 --> 00:39:00,090
can fit in cache.

703
00:39:00,090 --> 00:39:04,660
So the tournament is never going
to be responsible for

704
00:39:04,660 --> 00:39:08,440
generating cache misses, because
I'm going to leave the

705
00:39:08,440 --> 00:39:10,790
tournament in memory.

706
00:39:10,790 --> 00:39:15,470
So if I'm the optimal algorithm,
it's going to say,

707
00:39:15,470 --> 00:39:20,290
let's just leave the tournament
in memory and bring

708
00:39:20,290 --> 00:39:23,200
in all the other things as
we do the operation.

709
00:39:23,200 --> 00:39:23,990
Question?

710
00:39:23,990 --> 00:39:26,290
AUDIENCE: [INAUDIBLE]

711
00:39:26,290 --> 00:39:28,060
CHARLES LEISERSON: Those circles
that I had, the tree.

712
00:39:28,060 --> 00:39:31,154
AUDIENCE: Is that a cumulative
list of the elements that

713
00:39:31,154 --> 00:39:32,820
you've merged in already?

714
00:39:36,460 --> 00:39:36,880
CHARLES LEISERSON: I'm sorry.

715
00:39:36,880 --> 00:39:38,825
Is the--

716
00:39:38,825 --> 00:39:44,363
AUDIENCE: Is it a cumulative
list of the length arrays that

717
00:39:44,363 --> 00:39:45,368
you've merged already?

718
00:39:45,368 --> 00:39:45,770
CHARLES LEISERSON: No, no, no.

719
00:39:45,770 --> 00:39:46,630
You haven't merged them.

720
00:39:46,630 --> 00:39:47,790
Let's just go back
and make sure we

721
00:39:47,790 --> 00:39:49,040
understand the algorithm.

722
00:39:49,040 --> 00:39:55,000
The algorithm says that what we
do is we compare the head

723
00:39:55,000 --> 00:40:01,240
of this pair and we produce a
single value here, for which

724
00:40:01,240 --> 00:40:02,560
whatever is-- because
these are already

725
00:40:02,560 --> 00:40:05,890
sorted to do the merge.

726
00:40:05,890 --> 00:40:06,780
These are already sorted.

727
00:40:06,780 --> 00:40:09,460
So I just have the minimum of
these two here, and the

728
00:40:09,460 --> 00:40:12,020
minimum of these two here,
and the minimum of all

729
00:40:12,020 --> 00:40:13,600
four of them here.

730
00:40:13,600 --> 00:40:15,510
So it's repeated.

731
00:40:15,510 --> 00:40:19,500
When we get to the top, we have
now the minimum of all of

732
00:40:19,500 --> 00:40:22,770
these guys, and that's the guy
that's the minimum overall, we

733
00:40:22,770 --> 00:40:24,400
put him in the output array.

734
00:40:24,400 --> 00:40:29,560
And now we walk back down the
path that he came from, and

735
00:40:29,560 --> 00:40:34,150
what we do is we say, oh, this
guy had a-- let's walk down

736
00:40:34,150 --> 00:40:37,680
the path, let's say we
get to this guy.

737
00:40:37,680 --> 00:40:39,350
Let's advance the pointer
in here and

738
00:40:39,350 --> 00:40:42,180
bring out another element.

739
00:40:42,180 --> 00:40:45,180
And now we play off the
tournament here, play off the

740
00:40:45,180 --> 00:40:48,360
guy here, and he advances, and
he advances, whatever.

741
00:40:48,360 --> 00:40:50,585
And now some other path
may be the minimum.

742
00:40:50,585 --> 00:40:52,370
But it only took
me log n work.

743
00:40:52,370 --> 00:40:56,740
I'm only keeping copies of the
element, if you will, or the

744
00:40:56,740 --> 00:41:01,800
results of the comparisons along
this in the tree here.

745
00:41:04,370 --> 00:41:07,440
And that tree, we're saying,
fits in the cache, plus 1

746
00:41:07,440 --> 00:41:09,240
block, the first block.

747
00:41:09,240 --> 00:41:12,660
Whatever cache block fits in
each of these arrays, no

748
00:41:12,660 --> 00:41:15,590
matter how much we've gone
down, one of those

749
00:41:15,590 --> 00:41:18,260
is fitting in memory.

750
00:41:18,260 --> 00:41:23,450
So then what happens here is the
entire tournament plus one

751
00:41:23,450 --> 00:41:28,450
block from each memory can fit
in cache, and so therefore the

752
00:41:28,450 --> 00:41:31,950
number of cache misses that I'm
going to have when I do

753
00:41:31,950 --> 00:41:35,780
the merge it's just essentially
the time to take

754
00:41:35,780 --> 00:41:40,630
faults on that one cache block
whenever I exceed it in each

755
00:41:40,630 --> 00:41:45,120
array, plus the one for the
output that's similar.

756
00:41:45,120 --> 00:41:48,060
And so the total number of cache
misses that I'm going to

757
00:41:48,060 --> 00:41:52,810
have is going to be n over B,
because I'm just striding

758
00:41:52,810 --> 00:41:57,880
straight through memory, and
that tournament, I don't have

759
00:41:57,880 --> 00:42:00,170
to worry about, because
it's sitting in cache.

760
00:42:00,170 --> 00:42:02,650
And there's enough sitting in
cache that all the other

761
00:42:02,650 --> 00:42:05,350
stuff, I can just keep one block
from each of them in

762
00:42:05,350 --> 00:42:08,910
memory and still expect
to get it.

763
00:42:08,910 --> 00:42:14,130
In fact, you need the tall cache
assumption to assume

764
00:42:14,130 --> 00:42:16,480
that they all fit in memory.

765
00:42:16,480 --> 00:42:20,880
So therefore, the R-way merge
sort is then, if it's

766
00:42:20,880 --> 00:42:23,340
sufficiently small, once again,
we have the case that

767
00:42:23,340 --> 00:42:25,830
it fits in memory so I only have
the cold misses to get

768
00:42:25,830 --> 00:42:29,120
there, n over B, if
n is less than cm.

769
00:42:29,120 --> 00:42:32,730
And otherwise, it's R copies of
the number of cache misses

770
00:42:32,730 --> 00:42:42,220
for n over R, plus n over B.
Because this is what it took

771
00:42:42,220 --> 00:42:44,050
us here to do the merge.

772
00:42:44,050 --> 00:42:48,400
We get only n over B faults when
we merge, as long as the

773
00:42:48,400 --> 00:42:50,390
tournament fits in cache.

774
00:42:50,390 --> 00:42:52,410
If the tournament doesn't fit
in cache, it's a more

775
00:42:52,410 --> 00:42:54,626
complicated analysis.

776
00:42:54,626 --> 00:42:57,972
AUDIENCE: --n over B,
that's cold misses.

777
00:42:57,972 --> 00:43:00,840
You're getting the stuff--

778
00:43:00,840 --> 00:43:03,070
CHARLES LEISERSON: Yeah,
basically, it's the cold

779
00:43:03,070 --> 00:43:05,910
misses on the data,
yes, basically.

780
00:43:11,200 --> 00:43:12,030
Good.

781
00:43:12,030 --> 00:43:16,230
So now, let's do the recursion
tree for this.

782
00:43:16,230 --> 00:43:19,090
So we basically have n over B
that we're going to pay at

783
00:43:19,090 --> 00:43:24,780
every level, dividing by R, et
cetera, down to the point

784
00:43:24,780 --> 00:43:26,215
where things fit in cache.

785
00:43:30,290 --> 00:43:32,960
And by the time it fits in
cache, it's going to be m over

786
00:43:32,960 --> 00:43:35,890
B, because n will be
approximately m, just as we

787
00:43:35,890 --> 00:43:38,630
had before when we were
doing the binary case.

788
00:43:38,630 --> 00:43:41,800
As soon as the subarray
completely fits in memory, I

789
00:43:41,800 --> 00:43:43,750
don't have to when I'm
doing the sorting.

790
00:43:43,750 --> 00:43:46,390
So this is now analyzing not the
merging, this is analyzing

791
00:43:46,390 --> 00:43:49,360
the sorting now.

792
00:43:49,360 --> 00:43:50,920
This is the sorting,
not the merging.

793
00:43:54,330 --> 00:43:58,930
So we get down to m over B, and
I've gone now log base R,

794
00:43:58,930 --> 00:44:01,180
not log base 2 as we
did before, but log

795
00:44:01,180 --> 00:44:05,010
base R n over cm.

796
00:44:05,010 --> 00:44:08,360
The number of leaves is n over
cm, and so when I multiply

797
00:44:08,360 --> 00:44:11,330
this out, I get the same n over
B here, and I've got n

798
00:44:11,330 --> 00:44:12,730
over B at every level here.

799
00:44:15,420 --> 00:44:17,060
So where's the win?

800
00:44:17,060 --> 00:44:21,440
The win is that I have only
log base R of n over cm,

801
00:44:21,440 --> 00:44:26,040
rather than log base 2 levels
in the tree, because the

802
00:44:26,040 --> 00:44:29,640
amount that every level
cost me was the same,

803
00:44:29,640 --> 00:44:30,890
asymptotically.

804
00:44:33,710 --> 00:44:37,520
So when I add it up, I get n
over B log base R of n over m,

805
00:44:37,520 --> 00:44:40,600
instead of n over B log
base 2 of n over m.

806
00:44:40,600 --> 00:44:42,530
So how do we tune R?

807
00:44:45,100 --> 00:44:49,460
Well if we just look at this
formula here, if I want to

808
00:44:49,460 --> 00:44:51,580
tune R, what should I
do to R to make this

809
00:44:51,580 --> 00:44:52,830
be as small as possible?

810
00:44:55,100 --> 00:44:57,300
Make it as big as possible.

811
00:44:57,300 --> 00:45:01,360
But I had to assume that R was
less than some constant times

812
00:45:01,360 --> 00:45:04,890
m so that it fits in cache.

813
00:45:04,890 --> 00:45:12,850
So that's, in fact, what I do,
is I say R is m over B. So it

814
00:45:12,850 --> 00:45:17,890
fits in cache, we have at least
one block for each thing

815
00:45:17,890 --> 00:45:19,020
that we're merging.

816
00:45:19,020 --> 00:45:22,150
And then when we do the analysis
now, I take log base

817
00:45:22,150 --> 00:45:27,210
m over B here, and that's
compared to the binary one,

818
00:45:27,210 --> 00:45:31,910
which was log base 2, which is
a factor of-- because this is

819
00:45:31,910 --> 00:45:34,110
just log 2 over log base--

820
00:45:34,110 --> 00:45:38,050
of log of m over B savings
in cache misses.

821
00:45:38,050 --> 00:45:39,700
Now, is that a significant
number?

822
00:45:39,700 --> 00:45:41,110
Let's take a look.

823
00:45:41,110 --> 00:45:57,830
So if your l one cache is 32
kilobytes, and we have cache

824
00:45:57,830 --> 00:46:01,410
lines of 64 bytes, that is
basically the difference in

825
00:46:01,410 --> 00:46:04,420
the exponents, 9x savings.

826
00:46:04,420 --> 00:46:07,230
For l two cache, we get
about a 12x savings.

827
00:46:07,230 --> 00:46:09,430
For l three, we get about
a 17x savings.

828
00:46:09,430 --> 00:46:11,910
Now of course, there are some
other constants going on in

829
00:46:11,910 --> 00:46:14,360
here, so you can't be absolutely
sure that it's

830
00:46:14,360 --> 00:46:17,320
exactly these numbers, but it's
going to be proportional

831
00:46:17,320 --> 00:46:20,200
to these numbers.

832
00:46:20,200 --> 00:46:24,750
So that's pretty good savings
to do multi-way merging.

833
00:46:24,750 --> 00:46:28,030
So generally when you merge,
don't merge pairs.

834
00:46:28,030 --> 00:46:30,270
Not a very good way of doing
it if you want to take good

835
00:46:30,270 --> 00:46:33,210
advantage of cache.

836
00:46:33,210 --> 00:46:36,680
May give you some ideas for how
to improve some sorts that

837
00:46:36,680 --> 00:46:37,930
you might have looked at.

838
00:46:40,890 --> 00:46:43,830
Now it turns out that there's
a cache oblivious sorting

839
00:46:43,830 --> 00:46:47,360
algorithm, where you don't
actually have--

840
00:46:47,360 --> 00:46:50,360
that was a cache aware algorithm
that knew the size

841
00:46:50,360 --> 00:46:52,635
of the cache, and we tune
R to get there.

842
00:46:52,635 --> 00:46:57,630
There is an algorithm called
funnelsort, which is based on

843
00:46:57,630 --> 00:47:00,510
recursively sorting and
n to the 1/3 groups of

844
00:47:00,510 --> 00:47:03,200
n to the 2/3 items.

845
00:47:03,200 --> 00:47:06,930
And then you merge the sorted
groups with a merging process

846
00:47:06,930 --> 00:47:08,780
called an n to the 1/3 funnel.

847
00:47:11,540 --> 00:47:15,230
So this is more for fun,
although the sorting

848
00:47:15,230 --> 00:47:20,710
algorithm, in my experience,
from what others have told me

849
00:47:20,710 --> 00:47:24,110
about implementing it and so
forth, is probably about 30%

850
00:47:24,110 --> 00:47:27,750
slower than the best hand-tuned
algorithm.

851
00:47:27,750 --> 00:47:30,020
Whereas with matrix
multiplication, the cache

852
00:47:30,020 --> 00:47:33,450
oblivious algorithms are as
good as any cache aware

853
00:47:33,450 --> 00:47:37,350
algorithm as a practical matter,
here, they're off by

854
00:47:37,350 --> 00:47:39,660
about 20% or 30%.

855
00:47:39,660 --> 00:47:43,580
So interesting research topic
is build one of these things

856
00:47:43,580 --> 00:47:45,980
and make it really efficient
so that it can compete with

857
00:47:45,980 --> 00:47:48,030
real sorts.

858
00:47:48,030 --> 00:47:51,890
So the k funnel merges k cubed
items in k sorted lists,

859
00:47:51,890 --> 00:47:54,070
incurring this many
cache funnels.

860
00:47:54,070 --> 00:47:58,090
Here, I did put in the one, for
people who are concerned

861
00:47:58,090 --> 00:48:00,550
about the ones.

862
00:48:00,550 --> 00:48:04,400
And so then, you get this
recurrence for the cache

863
00:48:04,400 --> 00:48:08,330
misses, because you solve n to
the 1/3 problems of size n to

864
00:48:08,330 --> 00:48:13,900
the 2/3 recursively, plus
this amount for merging.

865
00:48:13,900 --> 00:48:16,550
And that ends up giving you this
bound, which turns out to

866
00:48:16,550 --> 00:48:17,800
be asymptotically optimal.

867
00:48:20,740 --> 00:48:25,050
And the way it works is there's
basically, a k funnel

868
00:48:25,050 --> 00:48:27,520
is constructed recursively.

869
00:48:27,520 --> 00:48:32,120
And the idea is that what we
have is, we have recursive k

870
00:48:32,120 --> 00:48:36,570
funnels, so this is going to be
a merging process, that is

871
00:48:36,570 --> 00:48:49,400
going to produce k cubed items
by having k to the 3/2 buffers

872
00:48:49,400 --> 00:48:53,380
that each are taking the square
root of k guys and

873
00:48:53,380 --> 00:48:56,050
producing k guys out.

874
00:48:56,050 --> 00:48:58,380
So each of these guys is going
to produce k, and the square

875
00:48:58,380 --> 00:49:04,000
root of k of them for a total
of k to the 3/2, but each of

876
00:49:04,000 --> 00:49:06,290
these is going to be length
square root of k, so we end up

877
00:49:06,290 --> 00:49:09,260
with k cubed.

878
00:49:09,260 --> 00:49:12,870
and And they basically feed each
other, and then they get

879
00:49:12,870 --> 00:49:17,330
merged with their own k thing,
and each of these then

880
00:49:17,330 --> 00:49:19,290
recursively is constructed
the same way.

881
00:49:21,980 --> 00:49:24,930
And the basic idea is that you
keep filling the buffers, I

882
00:49:24,930 --> 00:49:28,315
think I say this here, so that
all these buffers end up being

883
00:49:28,315 --> 00:49:30,050
in contiguous storage.

884
00:49:30,050 --> 00:49:33,710
And the idea is, rather than
going and just getting one

885
00:49:33,710 --> 00:49:37,660
element out as you do in a
typical tournament, as long as

886
00:49:37,660 --> 00:49:41,860
you're going to go merge, let's
merge a lot of stuff and

887
00:49:41,860 --> 00:49:43,390
put it into our buffer
so we don't have to

888
00:49:43,390 --> 00:49:45,350
go back here again.

889
00:49:45,350 --> 00:49:48,880
So you sort of batch your
merging in local regions, and

890
00:49:48,880 --> 00:49:51,250
that ends up using the
cache efficiently

891
00:49:51,250 --> 00:49:52,500
in the local regions.

892
00:49:54,960 --> 00:49:56,520
Enough of sorting.

893
00:49:56,520 --> 00:50:01,280
Let's go on to physics.

894
00:50:01,280 --> 00:50:04,260
So many of you are probably
studying in your linear

895
00:50:04,260 --> 00:50:07,630
algebra class or elsewhere,
the heat equation.

896
00:50:07,630 --> 00:50:11,130
So, people familiar with
heat diffusion?

897
00:50:11,130 --> 00:50:14,680
So it's a common one to
do, and these were--

898
00:50:14,680 --> 00:50:18,110
I have a former student,
Matteo Frigo, who is a

899
00:50:18,110 --> 00:50:24,750
brilliant coder on anything
cache oblivious.

900
00:50:24,750 --> 00:50:26,860
He's got the best
code out there.

901
00:50:30,680 --> 00:50:35,660
So the 2D heat equation, what we
do is let's let u(t, x, y)

902
00:50:35,660 --> 00:50:39,730
be the temperature at time
t at point (x, y).

903
00:50:39,730 --> 00:50:41,910
And now you can go through the
physics and come up with an

904
00:50:41,910 --> 00:50:46,210
equation that looks like this,
which says that basically the

905
00:50:46,210 --> 00:50:51,010
partial of u with respect to t
is proportional to the sum of

906
00:50:51,010 --> 00:50:55,240
the second partials with
respect to x and

907
00:50:55,240 --> 00:50:58,510
with respect to y.

908
00:50:58,510 --> 00:51:01,380
So basically what that says is,
the hotter the difference

909
00:51:01,380 --> 00:51:04,210
between two things, the quicker
things are going to

910
00:51:04,210 --> 00:51:07,890
adjust, the quicker
the temperature

911
00:51:07,890 --> 00:51:09,170
moves between them.

912
00:51:09,170 --> 00:51:14,220
And alpha is the thermal
diffusivity, which has--

913
00:51:14,220 --> 00:51:18,220
different materials have
different thermal

914
00:51:18,220 --> 00:51:20,840
diffusivities.

915
00:51:20,840 --> 00:51:24,660
Say that three times fast.

916
00:51:24,660 --> 00:51:31,520
So if we do a simulation, we
can end up with heats, say,

917
00:51:31,520 --> 00:51:35,840
put like this, and after
it looks like this.

918
00:51:35,840 --> 00:51:37,170
See if we can get this
running here.

919
00:51:37,170 --> 00:51:38,420
So now, let me see.

920
00:51:45,500 --> 00:51:51,940
So I can move my cursor around
and make things.

921
00:51:51,940 --> 00:51:54,080
You can just sort of see
that it simulates.

922
00:51:54,080 --> 00:51:57,280
You can see the simulation
is actually pretty slow.

923
00:51:57,280 --> 00:52:01,940
Now, on my slide, I have a
thing here that says--

924
00:52:01,940 --> 00:52:04,920
let's see if this breaks
when we do it again.

925
00:52:04,920 --> 00:52:06,170
There we go.

926
00:52:08,960 --> 00:52:13,220
So we're getting around 100
frames per minute in doing

927
00:52:13,220 --> 00:52:14,470
this simulation.

928
00:52:17,550 --> 00:52:21,120
And so how does this
simulation work?

929
00:52:21,120 --> 00:52:22,380
So let's take a look at that.

930
00:52:22,380 --> 00:52:27,040
It's kind of a neat problem.

931
00:52:27,040 --> 00:52:30,310
So this is what happened
when I did 6.172

932
00:52:30,310 --> 00:52:31,050
for a little while.

933
00:52:31,050 --> 00:52:34,510
It basically gave me that after
a while, because it just

934
00:52:34,510 --> 00:52:37,450
sort of averages things,
smears it out.

935
00:52:37,450 --> 00:52:38,290
So what's going on?

936
00:52:38,290 --> 00:52:40,700
Let's look at it in one
dimension, because it's easier

937
00:52:40,700 --> 00:52:45,410
to understand than if we
take on two dimensions.

938
00:52:45,410 --> 00:52:48,940
So assuming that we have,
say, a bar which has no

939
00:52:48,940 --> 00:52:50,190
differential in this direction,

940
00:52:50,190 --> 00:52:52,250
only in this direction.

941
00:52:52,250 --> 00:52:58,420
So then we get to drop the
partials with respect to y.

942
00:52:58,420 --> 00:53:01,270
So if I take a look at that,
what I can do is what's called

943
00:53:01,270 --> 00:53:02,110
a finite difference

944
00:53:02,110 --> 00:53:03,870
approximation, which you probably--

945
00:53:03,870 --> 00:53:06,600
who's studied finite
differences?

946
00:53:06,600 --> 00:53:07,760
So a few people.

947
00:53:07,760 --> 00:53:10,080
It's OK if you haven't.

948
00:53:10,080 --> 00:53:12,420
That's OK if you haven't, I'll
teach it to you now.

949
00:53:12,420 --> 00:53:15,530
And then you're free to forget
it, because that's not the

950
00:53:15,530 --> 00:53:17,520
part that I want you to
understand, but it is

951
00:53:17,520 --> 00:53:18,780
interesting.

952
00:53:18,780 --> 00:53:22,660
So what I can do is look at the
partial, for example, with

953
00:53:22,660 --> 00:53:26,890
respect to t, and just do an
approximation the says, well,

954
00:53:26,890 --> 00:53:29,120
let me perturb t
a little bit--

955
00:53:29,120 --> 00:53:30,580
that's what it means.

956
00:53:30,580 --> 00:53:38,040
So I add delta t minus u of t
divided by t plus delta t

957
00:53:38,040 --> 00:53:41,300
minus t, which gives
me delta t.

958
00:53:41,300 --> 00:53:42,270
And I can use that as an

959
00:53:42,270 --> 00:53:44,350
approximation for this partial.

960
00:53:47,410 --> 00:53:50,090
Then on the right hand side--
well first of all, let me get

961
00:53:50,090 --> 00:53:52,930
the first derivative
with respect to x.

962
00:53:52,930 --> 00:53:55,470
And basically here what I'll do
is I'll do an approximation

963
00:53:55,470 --> 00:54:00,740
where I take x plus delta x over
2 minus x minus delta x

964
00:54:00,740 --> 00:54:04,780
over 2, and once again, the
differences in the terms there

965
00:54:04,780 --> 00:54:07,100
ends up being delta x.

966
00:54:07,100 --> 00:54:11,470
And now I use that to
take the next one.

967
00:54:11,470 --> 00:54:14,540
So basically, to take this
one, I basically take the

968
00:54:14,540 --> 00:54:19,600
partial with respect to delta
x over 2, minus the partial

969
00:54:19,600 --> 00:54:24,440
with x minus delta x over 2, and
take the partial of that,

970
00:54:24,440 --> 00:54:25,850
do the approximation.

971
00:54:25,850 --> 00:54:29,830
And what happens is, if you
look at it, when I take a

972
00:54:29,830 --> 00:54:34,920
partial here I'm adding delta
x over 2 twice, so I end up

973
00:54:34,920 --> 00:54:38,640
getting just a delta x here,
and then the two things on

974
00:54:38,640 --> 00:54:41,830
either side combined give me my
original one, 2 times u(t,

975
00:54:41,830 --> 00:54:45,720
x), and then another one here,
and now the whole thing over

976
00:54:45,720 --> 00:54:48,900
delta x squared.

977
00:54:48,900 --> 00:54:52,500
And so what I can do is to
reduce this heat equation,

978
00:54:52,500 --> 00:54:55,670
which is continuous, to
something that we can handle

979
00:54:55,670 --> 00:54:59,740
in a computer, which is
discrete, by saying OK, let's

980
00:54:59,740 --> 00:55:03,090
just do this approximation that
says that this term must

981
00:55:03,090 --> 00:55:06,350
be equal to that term.

982
00:55:06,350 --> 00:55:08,326
And if you've studied the linear
algebra that said that

983
00:55:08,326 --> 00:55:10,920
there are all kinds of
conditions on convergence, and

984
00:55:10,920 --> 00:55:13,690
stability, and stuff like that,
that are actually quite

985
00:55:13,690 --> 00:55:15,522
interesting from a numerical
point of view, but we're not

986
00:55:15,522 --> 00:55:17,590
going to get into it.

987
00:55:17,590 --> 00:55:20,550
But basically, I've just taken
that equation here and said,

988
00:55:20,550 --> 00:55:23,950
OK, that's my approximation
for this one.

989
00:55:23,950 --> 00:55:25,210
And now what do I have here?

990
00:55:25,210 --> 00:55:31,640
I've got u of t plus delta t,
and u things u of t, and then

991
00:55:31,640 --> 00:55:36,970
over here, they're all with t,
but now the deltas are in--

992
00:55:36,970 --> 00:55:39,880
whoops, that should have
been a delta x there.

993
00:55:39,880 --> 00:55:40,870
I don't know how
that got there.

994
00:55:40,870 --> 00:55:43,370
That should be a
delta x there.

995
00:55:43,370 --> 00:55:45,340
They're all in spatial
over here.

996
00:55:48,520 --> 00:55:56,820
So what I can do is take this,
and do an iterative process to

997
00:55:56,820 --> 00:55:58,730
compute this.

998
00:55:58,730 --> 00:56:04,170
And so the idea is, let me take
this and throw this term

999
00:56:04,170 --> 00:56:08,590
onto the right hand side, and
look at u of t plus delta t as

1000
00:56:08,590 --> 00:56:09,770
if it's t plus 1.

1001
00:56:09,770 --> 00:56:13,880
Let me make my delta be
one, essentially.

1002
00:56:13,880 --> 00:56:19,730
Throw the delta t over here
times the alpha over delta x1,

1003
00:56:19,730 --> 00:56:24,540
and then I get basically u of
tx is based on u of t of x

1004
00:56:24,540 --> 00:56:27,640
plus 1 and of x and x minus 1.

1005
00:56:27,640 --> 00:56:28,840
As I say, there's a typo here.

1006
00:56:28,840 --> 00:56:33,080
That should be a delta t.

1007
00:56:33,080 --> 00:56:36,370
So what that says is that if I
look at my one-dimensional

1008
00:56:36,370 --> 00:56:40,410
process proceeding through
time, what I'm doing is

1009
00:56:40,410 --> 00:56:44,320
updating every point here based
on the three points

1010
00:56:44,320 --> 00:56:48,840
below it, diagonally
to the right, and

1011
00:56:48,840 --> 00:56:51,640
diagonally to the left.

1012
00:56:51,640 --> 00:56:54,640
So this guy can be updated
because of those.

1013
00:56:54,640 --> 00:56:58,390
These we're not going update,
because they're the boundary.

1014
00:56:58,390 --> 00:56:59,810
So these can be fixed.

1015
00:56:59,810 --> 00:57:02,455
In a periodic stencil,
they may even wrap

1016
00:57:02,455 --> 00:57:05,660
around like a torus.

1017
00:57:05,660 --> 00:57:08,570
So basically, I can go through
and update all these with

1018
00:57:08,570 --> 00:57:11,190
whatever that hairy
equation is.

1019
00:57:11,190 --> 00:57:13,410
And this is basically what
the code is that I

1020
00:57:13,410 --> 00:57:14,660
showed you is doing.

1021
00:57:17,130 --> 00:57:21,360
It just keeps updating everyone
based on three until

1022
00:57:21,360 --> 00:57:25,160
I've gone through a bunch of
time, and that's how the

1023
00:57:25,160 --> 00:57:26,410
system evolves.

1024
00:57:31,330 --> 00:57:37,080
So any questions about
how I got to here?

1025
00:57:37,080 --> 00:57:39,250
So we're going to now
look at this purely

1026
00:57:39,250 --> 00:57:41,080
computer sciencey now.

1027
00:57:41,080 --> 00:57:42,970
We don't have to understand
any of those equations.

1028
00:57:42,970 --> 00:57:44,680
We just have to understand
the structure.

1029
00:57:44,680 --> 00:57:47,980
The structure is that we're
updating t plus 1 based on

1030
00:57:47,980 --> 00:57:52,300
stuff on three points with
some function that some

1031
00:57:52,300 --> 00:57:58,530
physicist oracle gave
us out of the blue.

1032
00:57:58,530 --> 00:58:03,180
And so here is a pretty simple
algorithm to do it.

1033
00:58:03,180 --> 00:58:05,750
I basically have what's called
the kernel, which does this

1034
00:58:05,750 --> 00:58:11,200
updating, basically updating
each one based on things.

1035
00:58:11,200 --> 00:58:13,430
And what I'm going to do for
computer science is I don't

1036
00:58:13,430 --> 00:58:17,400
need to keep all the
intermediate values.

1037
00:58:17,400 --> 00:58:18,660
And so what I'm going
to do is do what's

1038
00:58:18,660 --> 00:58:21,300
called an even-odd trick.

1039
00:58:21,300 --> 00:58:24,870
Basically if I have one row,
I compute the next row into

1040
00:58:24,870 --> 00:58:29,160
another array, and then I'll
reuse that first array-- it's

1041
00:58:29,160 --> 00:58:30,770
all been used up--

1042
00:58:30,770 --> 00:58:33,020
and go back to the first one.

1043
00:58:33,020 --> 00:58:36,410
So basically here, I'm just
going to update t plus 1 mod

1044
00:58:36,410 --> 00:58:43,550
2, and just allocate two arrays
of size n, and just do

1045
00:58:43,550 --> 00:58:46,596
modding all the way up.

1046
00:58:46,596 --> 00:58:48,760
Is that clear?

1047
00:58:48,760 --> 00:58:50,880
And other than that, it's
basically doing the same

1048
00:58:50,880 --> 00:58:53,120
thing, and I'm doing a little
bit of fancy arithmetic here

1049
00:58:53,120 --> 00:58:54,910
by passing--

1050
00:58:54,910 --> 00:58:59,450
see stuff, where I'm passing the
pointer to where I am in

1051
00:58:59,450 --> 00:59:02,010
the array, so I only
have to update it

1052
00:59:02,010 --> 00:59:04,900
locally within the array.

1053
00:59:04,900 --> 00:59:06,870
So I don't have to double
indexing once I'm in the

1054
00:59:06,870 --> 00:59:09,530
array, because I'm already
indexed into the part of the

1055
00:59:09,530 --> 00:59:12,470
array that I'm going to use, and
then I am doing flipping.

1056
00:59:12,470 --> 00:59:14,270
So this is just a little
bit of cleverness.

1057
00:59:14,270 --> 00:59:17,330
You might want to study
this later.

1058
00:59:17,330 --> 00:59:20,700
So what's happening then is I
have this double nested loop

1059
00:59:20,700 --> 00:59:23,260
where I have a time loop on the
outside, and a space loop

1060
00:59:23,260 --> 00:59:25,920
on the inside, and I'm basically
going through and

1061
00:59:25,920 --> 00:59:30,270
using a stencil of this shape,
this is called a three point

1062
00:59:30,270 --> 00:59:34,190
stencil, because you're
basically taking three points

1063
00:59:34,190 --> 00:59:36,570
to update one point.

1064
00:59:36,570 --> 00:59:41,950
And now if I imagine that this
dimension is bigger, n here is

1065
00:59:41,950 --> 00:59:46,490
bigger than my cache size,
what's going to happen?

1066
00:59:46,490 --> 00:59:48,840
I'm going to take a cache fault
here, these are all cold

1067
00:59:48,840 --> 00:59:49,850
misses, et cetera.

1068
00:59:49,850 --> 00:59:55,760
But when I get back to the
beginning here, if I use LRU,

1069
00:59:55,760 --> 00:59:58,350
nothing is going to be in memory
that I happened to

1070
00:59:58,350 --> 00:59:59,430
update over here.

1071
00:59:59,430 --> 01:00:01,580
So I have to go and I
take a cache fault

1072
01:00:01,580 --> 01:00:04,070
on every cache line.

1073
01:00:04,070 --> 01:00:10,810
And so if I'm going t steps into
the future from where I

1074
01:00:10,810 --> 01:00:15,620
started, I basically have n
times t updates, and I save a

1075
01:00:15,620 --> 01:00:20,935
factor of B, because I get the
spatial locality because the u

1076
01:00:20,935 --> 01:00:25,490
of t minus 1, u of t, and u of t
plus 1, are all generally on

1077
01:00:25,490 --> 01:00:30,500
the same-- are nearby, and all
within one cache line.

1078
01:00:30,500 --> 01:00:32,170
Question?

1079
01:00:32,170 --> 01:00:34,020
AUDIENCE: The x's, what
are the x's for?

1080
01:00:36,660 --> 01:00:37,055
CHARLES LEISERSON: Sorry.

1081
01:00:37,055 --> 01:00:38,110
I should have put the
legend on here.

1082
01:00:38,110 --> 01:00:40,660
The x's are a miss.

1083
01:00:40,660 --> 01:00:44,030
So I do a miss when I update
these, and then these I don't

1084
01:00:44,030 --> 01:00:45,300
miss on, because it
was brought in

1085
01:00:45,300 --> 01:00:48,420
when I accessed that.

1086
01:00:48,420 --> 01:00:52,110
And then I do a miss,
and I'll do it--

1087
01:00:52,110 --> 01:00:55,660
so basically I do it, then I
shift over the stencil by one,

1088
01:00:55,660 --> 01:00:58,440
and then I won't get a miss.

1089
01:00:58,440 --> 01:01:01,060
So I'm just looking at the
misses on the reads, not

1090
01:01:01,060 --> 01:01:03,000
misses on the writes.

1091
01:01:03,000 --> 01:01:05,500
I should have made
that clear, too.

1092
01:01:05,500 --> 01:01:07,340
But the point is, the writes
don't help you, because it's

1093
01:01:07,340 --> 01:01:11,320
all out of memory by the
time I get up here.

1094
01:01:11,320 --> 01:01:15,740
To the second row, if this is
longer than my cache size,

1095
01:01:15,740 --> 01:01:17,220
none of that's there
if I'm using LRU.

1096
01:01:21,358 --> 01:01:24,214
AUDIENCE: You have also a miss,
like you need to get two

1097
01:01:24,214 --> 01:01:27,045
[INAUDIBLE].

1098
01:01:27,045 --> 01:01:27,830
CHARLES LEISERSON: Yeah, but
what I'm saying is I'm only

1099
01:01:27,830 --> 01:01:30,270
looking at the read misses.

1100
01:01:30,270 --> 01:01:33,260
Yes, there are write misses as
well, but basically, I'm only

1101
01:01:33,260 --> 01:01:34,060
doing the read misses.

1102
01:01:34,060 --> 01:01:35,900
You can look at the write
misses as well.

1103
01:01:35,900 --> 01:01:37,150
It makes the picture messier.

1104
01:01:40,120 --> 01:01:42,670
So we've basically
have nt over b.

1105
01:01:42,670 --> 01:01:44,950
However this, let me tell
you, is the way that

1106
01:01:44,950 --> 01:01:46,200
everybody codes it.

1107
01:01:48,530 --> 01:01:51,410
and And if you have a machine
where you have any bandwidth

1108
01:01:51,410 --> 01:01:55,010
issues to memory, especially for
these large problems, this

1109
01:01:55,010 --> 01:01:58,810
is not a very good way to
do it, as it turns out.

1110
01:01:58,810 --> 01:02:02,030
So it turns out that what you
want to do is, as we've seen,

1111
01:02:02,030 --> 01:02:04,960
divide and conquer is a really
good way to do it.

1112
01:02:04,960 --> 01:02:08,970
But in this case, when we're
doing divide and conquer,

1113
01:02:08,970 --> 01:02:13,980
we're actually not going to use
rectangles, we're going to

1114
01:02:13,980 --> 01:02:15,230
use trapezoids.

1115
01:02:17,760 --> 01:02:19,920
And the reason is that a
trapezoid has the nice

1116
01:02:19,920 --> 01:02:22,010
property that--

1117
01:02:22,010 --> 01:02:27,100
notice that if I have all these
points in memory, then

1118
01:02:27,100 --> 01:02:29,760
notice that I can compute all
the guys that are read on the

1119
01:02:29,760 --> 01:02:33,530
next level, and then I can
compute all the guys that are

1120
01:02:33,530 --> 01:02:35,660
next on the next level.

1121
01:02:35,660 --> 01:02:37,910
And so for example, if you
imagine that this part here

1122
01:02:37,910 --> 01:02:41,380
fit within cache, I could
actually keep going.

1123
01:02:41,380 --> 01:02:44,580
I didn't have to stop here, I
could keep going right up to a

1124
01:02:44,580 --> 01:02:48,480
triangle if I wanted to, and
compute all the values without

1125
01:02:48,480 --> 01:02:52,190
having any more cache misses
than those needed to bring in,

1126
01:02:52,190 --> 01:02:53,480
essentially, one row--

1127
01:02:53,480 --> 01:02:57,390
two rows, actually, because
I'm reusing the

1128
01:02:57,390 --> 01:02:59,660
rows as I go up.

1129
01:02:59,660 --> 01:03:02,490
So what we're going to do is
traverse trapezoidal regions

1130
01:03:02,490 --> 01:03:07,720
of space-time points such that
the points are between an

1131
01:03:07,720 --> 01:03:11,300
upper limit, T1, and a low one,
T0, and between an x0 and

1132
01:03:11,300 --> 01:03:15,330
an x1, where now I have slopes
here that are going to be, in

1133
01:03:15,330 --> 01:03:18,480
general, this is
plus 1 minus 1.

1134
01:03:18,480 --> 01:03:22,680
And in fact, sometimes it will
be straight, in which case

1135
01:03:22,680 --> 01:03:23,600
we'll call it 0.

1136
01:03:23,600 --> 01:03:29,760
It's really the inverse of the
slope, but we'll still call it

1137
01:03:29,760 --> 01:03:32,220
zero rather than infinity.

1138
01:03:32,220 --> 01:03:33,380
So it's 1 over the slope.

1139
01:03:33,380 --> 01:03:35,260
There's a name for
that, right?

1140
01:03:35,260 --> 01:03:37,730
Is that called the
run or something?

1141
01:03:37,730 --> 01:03:39,620
I forget, I don't remember
my calculus.

1142
01:03:42,570 --> 01:03:44,840
So that's what we're
going to do.

1143
01:03:44,840 --> 01:03:49,230
And we're going to leave the
upper and right borders undone

1144
01:03:49,230 --> 01:03:51,270
and include, so it's going
to be a sort of half open

1145
01:03:51,270 --> 01:03:54,960
trapezoid on the left and
bottom, closed on the left and

1146
01:03:54,960 --> 01:03:58,200
bottom, and open on
the top and right.

1147
01:03:58,200 --> 01:04:03,120
So the width is basically the
midpoint here, and the height

1148
01:04:03,120 --> 01:04:05,220
is the height, because they're
always going to have

1149
01:04:05,220 --> 01:04:09,420
parallel axes here.

1150
01:04:09,420 --> 01:04:13,790
So here's how are our recursion
is going to work.

1151
01:04:13,790 --> 01:04:18,390
If the height is 1, then we
can compute all space-time

1152
01:04:18,390 --> 01:04:22,300
points in any way we want.

1153
01:04:22,300 --> 01:04:25,900
I can just go through them if
I want, because they're all

1154
01:04:25,900 --> 01:04:26,570
independent.

1155
01:04:26,570 --> 01:04:28,930
None depends on anybody else.

1156
01:04:28,930 --> 01:04:30,540
So that's going to
be our base case.

1157
01:04:33,730 --> 01:04:38,100
If the width is greater than
twice the height, however,

1158
01:04:38,100 --> 01:04:39,930
then what we're going to do
is we're going to cut the

1159
01:04:39,930 --> 01:04:45,740
trapezoid through the middle
of the slope of minus 1.

1160
01:04:48,380 --> 01:04:51,680
And that will produce two new
trapezoids, which we then will

1161
01:04:51,680 --> 01:04:58,936
recursively compute all
the elements of.

1162
01:04:58,936 --> 01:05:02,000
So I'll start out with
a trapezoid.

1163
01:05:02,000 --> 01:05:07,000
Basically, if it ends up that
it's a long and wide one, I'm

1164
01:05:07,000 --> 01:05:10,700
going to make what's called a
space cut, and cut it this

1165
01:05:10,700 --> 01:05:13,410
way, and then I'm going to
recursively do this one and

1166
01:05:13,410 --> 01:05:15,470
then this one.

1167
01:05:15,470 --> 01:05:18,950
And notice that I can
do that because--

1168
01:05:18,950 --> 01:05:21,790
all these guys I can do, but
then when I get to the border

1169
01:05:21,790 --> 01:05:24,830
here, this will already have
been done by the time I'm

1170
01:05:24,830 --> 01:05:26,080
computing these guys.

1171
01:05:28,930 --> 01:05:32,510
So the requirement is that
I've got to do things

1172
01:05:32,510 --> 01:05:36,360
according to that map of triples
that I showed you

1173
01:05:36,360 --> 01:05:38,750
before, but I don't have to
do them in the same order.

1174
01:05:38,750 --> 01:05:41,300
I don't have to do the whole
bottom row first.

1175
01:05:41,300 --> 01:05:44,820
In this case, I can compute the
whole trapezoid here, and

1176
01:05:44,820 --> 01:05:48,850
then I can compute this
trapezoid here, and then all

1177
01:05:48,850 --> 01:05:50,940
the values that I'll need will
have already been computed

1178
01:05:50,940 --> 01:05:54,870
over here, that are on the
boundary of this trapezoid.

1179
01:05:54,870 --> 01:05:58,030
The other type of cut I'll
do is what happens when a

1180
01:05:58,030 --> 01:06:01,680
trapezoid gets too
tall for me.

1181
01:06:01,680 --> 01:06:04,470
So if the trapezoid is too tall,
then what we'll do is

1182
01:06:04,470 --> 01:06:06,580
we'll slice it through the
middle, but the other way.

1183
01:06:06,580 --> 01:06:08,380
We call that a time cut.

1184
01:06:08,380 --> 01:06:11,570
So we won't take it all the way
through time, we'll only

1185
01:06:11,570 --> 01:06:12,990
take it partially
through time.

1186
01:06:16,430 --> 01:06:19,000
Now you can show, and I'm not
going to show this in detail,

1187
01:06:19,000 --> 01:06:21,910
but you can show that if I do
this, my trapezoids are always

1188
01:06:21,910 --> 01:06:23,620
sort of medium sized.

1189
01:06:23,620 --> 01:06:26,520
I never get long, long
skinny ones.

1190
01:06:26,520 --> 01:06:29,490
If I start with something that's
sort of got a good

1191
01:06:29,490 --> 01:06:33,660
aspect ratio, I maintain a good
aspect ratio through the

1192
01:06:33,660 --> 01:06:34,910
entire code.

1193
01:06:38,820 --> 01:06:40,070
So here's the implementation.

1194
01:06:43,700 --> 01:06:48,040
This is what Matteo Frigo wrote,
and I've modified it a

1195
01:06:48,040 --> 01:06:48,900
little bit.

1196
01:06:48,900 --> 01:06:53,020
So basically, we pass in it
the values that let us

1197
01:06:53,020 --> 01:06:59,320
identify the trapezoid, t0, t1,
x0, and then the slope on

1198
01:06:59,320 --> 01:07:04,870
the left side, x1 in the slope
on the right side, where the

1199
01:07:04,870 --> 01:07:11,290
dx0 and the dx1s are all either
0, 1, or minus 1.

1200
01:07:11,290 --> 01:07:15,980
And then what I do is I look at
what the height is that my

1201
01:07:15,980 --> 01:07:17,760
trapezoid is going
to operate on.

1202
01:07:17,760 --> 01:07:22,920
And if the height is 1, well,
then I just run through all

1203
01:07:22,920 --> 01:07:27,830
the elements, and I just
compute the kernel--

1204
01:07:27,830 --> 01:07:30,050
that program that I showed
you before, that kernel--

1205
01:07:30,050 --> 01:07:31,690
on all the elements.

1206
01:07:31,690 --> 01:07:33,920
Nothing really to be done there,
just go through and

1207
01:07:33,920 --> 01:07:37,900
compute them individually
with a four loop.

1208
01:07:37,900 --> 01:07:46,910
Otherwise, if I've got the
situation where the trapezoid

1209
01:07:46,910 --> 01:07:51,330
is big, then I do this
comparison, which I promise

1210
01:07:51,330 --> 01:07:53,310
you-- you can work out the
math if you wish--

1211
01:07:53,310 --> 01:07:56,950
which I promise you tells you
whether or not it's more than

1212
01:07:56,950 --> 01:07:59,570
twice the height, as I said
before, whether the width is

1213
01:07:59,570 --> 01:08:01,130
more than twice the height.

1214
01:08:01,130 --> 01:08:07,360
And if so, I compute the middle
point, and then I

1215
01:08:07,360 --> 01:08:09,120
partition it into two
trapezoids, and I

1216
01:08:09,120 --> 01:08:12,790
recursively call them.

1217
01:08:12,790 --> 01:08:16,770
And otherwise, I simply cut the
time in half, and then I

1218
01:08:16,770 --> 01:08:20,550
do the bottom half and
then the upper half.

1219
01:08:20,550 --> 01:08:25,979
So getting all those parameters
exactly right takes

1220
01:08:25,979 --> 01:08:30,310
a little bit of thinking,
makes my brain hurt, but

1221
01:08:30,310 --> 01:08:33,149
Matteo is brilliant at
this kind of coding.

1222
01:08:36,490 --> 01:08:38,540
So let's see how
well this does.

1223
01:08:38,540 --> 01:08:41,479
So I'm not going to do a
detailed analysis that I did

1224
01:08:41,479 --> 01:08:47,319
before, but basically what's
going on is at this level, if

1225
01:08:47,319 --> 01:08:49,960
I'm doing divide and conquering,
I'm only doing a

1226
01:08:49,960 --> 01:08:54,350
constant amount of work
managing this stuff.

1227
01:08:54,350 --> 01:08:58,500
So my caches that I'm taking
in the internal part of the

1228
01:08:58,500 --> 01:08:59,840
tree are all going
to be order one.

1229
01:09:03,140 --> 01:09:10,729
Now each leaf is going to
represent a trapezoid, which

1230
01:09:10,729 --> 01:09:15,990
is going to be approximately h
times w, where h and w are

1231
01:09:15,990 --> 01:09:19,260
approximately equal, because
they're going to be shaped--

1232
01:09:19,260 --> 01:09:22,140
This is assuming I start out
with a number of iterations to

1233
01:09:22,140 --> 01:09:28,720
do that is at least as large as
the number of points that I

1234
01:09:28,720 --> 01:09:29,500
have to go on.

1235
01:09:29,500 --> 01:09:32,170
If I start out with something
that's really thin and flat,

1236
01:09:32,170 --> 01:09:34,290
then it's not going
to be the case.

1237
01:09:34,290 --> 01:09:36,410
But if I start out with
something that's deep enough,

1238
01:09:36,410 --> 01:09:39,950
then I'm going to be able
to make progress in an

1239
01:09:39,950 --> 01:09:43,620
unconventional order into
time by moving the time

1240
01:09:43,620 --> 01:09:46,550
non-uniformly through
the space.

1241
01:09:46,550 --> 01:09:53,340
So each leaf represents a fairly
balanced trapezoid.

1242
01:09:53,340 --> 01:09:57,350
Each leaf basically
is going to--

1243
01:09:57,350 --> 01:10:03,050
if you look that the direction
of the trapezoid is in time,

1244
01:10:03,050 --> 01:10:06,960
so this represents the spatial
dimension, and if I have

1245
01:10:06,960 --> 01:10:10,430
something of size w, I
can access it with

1246
01:10:10,430 --> 01:10:13,720
only w over B misses.

1247
01:10:13,720 --> 01:10:18,810
And when that fits in cache,
where w is some constant less

1248
01:10:18,810 --> 01:10:20,830
than m, so w is order m.

1249
01:10:20,830 --> 01:10:24,230
So each of these things that's
a leaf is only going to occur

1250
01:10:24,230 --> 01:10:25,480
w over B misses.

1251
01:10:30,080 --> 01:10:32,720
Now, the total space number
of points I have to go

1252
01:10:32,720 --> 01:10:34,490
after is n times t.

1253
01:10:34,490 --> 01:10:37,610
N is going to be the full
dimension this way, t is the

1254
01:10:37,610 --> 01:10:39,450
height that way.

1255
01:10:39,450 --> 01:10:42,810
And so since each leaf
has hw points, I

1256
01:10:42,810 --> 01:10:44,570
have nt over hw leaves.

1257
01:10:47,130 --> 01:10:49,860
And the number of internal nodes
is just the leaves minus

1258
01:10:49,860 --> 01:10:52,640
1, so they can't contribute
substantially, because there's

1259
01:10:52,640 --> 01:10:57,840
only order one misses I'm taking
here, whereas I've got

1260
01:10:57,840 --> 01:11:02,720
something on the order of w
over B misses for this.

1261
01:11:02,720 --> 01:11:05,330
So therefore, now I
can do my math.

1262
01:11:05,330 --> 01:11:08,100
The number of cache misses
I'm going to take is--

1263
01:11:08,100 --> 01:11:11,300
well, how many leaves
do I have?

1264
01:11:11,300 --> 01:11:13,780
nt over hw.

1265
01:11:13,780 --> 01:11:15,420
And what does each
one cost us?

1266
01:11:15,420 --> 01:11:19,220
W over B.

1267
01:11:19,220 --> 01:11:24,830
And so now, when I multiply that
out, well, hw is about m

1268
01:11:24,830 --> 01:11:34,540
squared, and w over B is about m
over B. And so I get nt over

1269
01:11:34,540 --> 01:11:38,200
MB as being the total
number of savings.

1270
01:11:38,200 --> 01:11:42,270
so whereas the original
algorithm only got nt over B,

1271
01:11:42,270 --> 01:11:49,610
we've got this factor of a
memory cache size in there

1272
01:11:49,610 --> 01:11:53,100
showing us that we have far
fewer cache misses.

1273
01:11:53,100 --> 01:11:55,640
So the cache misses end up not
being an issue for this.

1274
01:11:58,360 --> 01:11:59,610
Any questions about that?

1275
01:12:06,230 --> 01:12:11,410
So I want to show you a
simulation of this three point

1276
01:12:11,410 --> 01:12:14,330
stencil and comparing
the two things.

1277
01:12:14,330 --> 01:12:16,650
So this is going to be the
looping version, where the red

1278
01:12:16,650 --> 01:12:21,170
dots are where the cache misses
are, and this is going

1279
01:12:21,170 --> 01:12:23,590
to be the trapezoidal one.

1280
01:12:23,590 --> 01:12:27,950
And basically, I have an n of 95
and a t of 87, and what I'm

1281
01:12:27,950 --> 01:12:31,440
going to do is assume a fully
associative LRU cache that

1282
01:12:31,440 --> 01:12:35,750
fits four points on a cache
line, where the cache size is

1283
01:12:35,750 --> 01:12:39,520
32, two to the fifth as opposed
to two to the 15th,

1284
01:12:39,520 --> 01:12:42,100
it's really little.

1285
01:12:42,100 --> 01:12:44,910
If I get a cache hit, I'm going
to call it one cycle.

1286
01:12:44,910 --> 01:12:48,070
If I get a cache miss, I'm going
to call it 10 cycles.

1287
01:12:48,070 --> 01:12:49,070
We're going to race them.

1288
01:12:49,070 --> 01:12:52,840
So on the left is the
current world

1289
01:12:52,840 --> 01:12:54,510
champion, the looping algorithm.

1290
01:12:54,510 --> 01:12:59,130
And on the right is the cache
oblivious trapezoid algorithm.

1291
01:12:59,130 --> 01:13:00,380
So let's go.

1292
01:13:07,750 --> 01:13:12,480
So you can see that it's
basically, it's made a space

1293
01:13:12,480 --> 01:13:18,060
cut there, but it's made a time
cut across the top there.

1294
01:13:18,060 --> 01:13:20,730
It said, this is too tall, so
let me cut it this way.

1295
01:13:38,460 --> 01:13:40,764
And that guy's, meanwhile,
taking all those-- you can see

1296
01:13:40,764 --> 01:13:42,140
how many cache misses
he's taking.

1297
01:13:45,850 --> 01:13:47,100
Let's speed him up.

1298
01:13:52,770 --> 01:13:54,460
That's one way you can do
it, is make it think.

1299
01:14:09,050 --> 01:14:12,670
So let's see what happens if I
have a cache of size eight.

1300
01:14:18,020 --> 01:14:19,270
So here we go.

1301
01:14:31,320 --> 01:14:32,830
I think I'm just doing
the same thing.

1302
01:14:32,830 --> 01:14:34,160
I know I can show
you the cuts.

1303
01:14:34,160 --> 01:14:37,050
Can I show you the cuts?

1304
01:14:37,050 --> 01:14:37,270
I know.

1305
01:14:37,270 --> 01:14:39,540
I think it's because I'm not--

1306
01:14:39,540 --> 01:14:41,350
OK, let's try it.

1307
01:14:41,350 --> 01:14:41,600
There we go.

1308
01:14:41,600 --> 01:14:44,700
Now I'm showing the cuts
as they go on.

1309
01:14:44,700 --> 01:14:47,678
Let's do that again.

1310
01:14:47,678 --> 01:14:49,516
We'll go fast and do it again.

1311
01:14:58,230 --> 01:15:00,400
So those are the cuts that it's
making to begin with as

1312
01:15:00,400 --> 01:15:01,760
it's doing the divide
and conquer.

1313
01:15:06,450 --> 01:15:09,570
So I think this is the
same size cache.

1314
01:15:14,100 --> 01:15:16,595
So now I think I'm doing
a bigger cache.

1315
01:15:25,010 --> 01:15:26,830
I think I did a bigger cache,
but I'm not sure I gave the

1316
01:15:26,830 --> 01:15:28,080
other guy a bigger cache.

1317
01:15:46,790 --> 01:15:48,330
Yeah, because it doesn't matter
for the guy on the

1318
01:15:48,330 --> 01:15:49,100
left, right?

1319
01:15:49,100 --> 01:15:51,470
As long as the cache line is the
same length and as long as

1320
01:15:51,470 --> 01:15:55,910
it's not big enough, he's going
to do the same thing.

1321
01:15:55,910 --> 01:15:57,600
He didn't get to take advantage
of the fact that the

1322
01:15:57,600 --> 01:16:00,350
cache was bigger, because it
was still smaller than the

1323
01:16:00,350 --> 01:16:02,050
array that he's striping
out there.

1324
01:16:05,460 --> 01:16:06,880
Anyway, we can play with
these all day.

1325
01:16:13,910 --> 01:16:16,240
So if you make the cache lines
bigger, then of course it'll

1326
01:16:16,240 --> 01:16:19,790
go faster, because he'll
have fewer misses.

1327
01:16:19,790 --> 01:16:21,680
He'll get to bring it in.

1328
01:16:21,680 --> 01:16:22,930
So let's see here.

1329
01:16:25,330 --> 01:16:29,030
So let's now do it for real.

1330
01:16:29,030 --> 01:16:30,820
So this is a two-dimensional
problem.

1331
01:16:30,820 --> 01:16:34,720
You can use the same thing
to do what end up being

1332
01:16:34,720 --> 01:16:37,710
three-dimensional trapezoids.

1333
01:16:37,710 --> 01:16:40,130
In fact, you can generalize
this trapezoid method to

1334
01:16:40,130 --> 01:16:41,790
multiple dimensions.

1335
01:16:41,790 --> 01:16:43,320
So this is the looping one.

1336
01:16:43,320 --> 01:16:45,240
So let's start that one out.

1337
01:16:51,020 --> 01:16:53,820
So it's going about 104
frames a minute.

1338
01:17:08,470 --> 01:17:12,590
I think by resizing it, the
calibration is off.

1339
01:17:15,160 --> 01:17:18,000
But in any case, let's switch to
the cash oblivious version.

1340
01:17:27,840 --> 01:17:29,090
Anybody notice something?

1341
01:17:31,460 --> 01:17:32,710
Slower.

1342
01:17:35,040 --> 01:17:36,290
Why is that?

1343
01:17:41,460 --> 01:17:43,010
I gave code exactly
as I had up there.

1344
01:17:46,120 --> 01:17:47,640
No, it's not because it's
two dimensions.

1345
01:17:47,640 --> 01:17:48,890
AUDIENCE: [INAUDIBLE].

1346
01:17:53,090 --> 01:17:54,090
CHARLES LEISERSON: I'm sorry?

1347
01:17:54,090 --> 01:17:54,390
[INTERPOSING VOICES]

1348
01:17:54,390 --> 01:17:58,505
CHARLES LEISERSON: Yeah, so now
it's the trapezoiding at

1349
01:17:58,505 --> 01:17:59,755
only 86 frames.

1350
01:18:02,080 --> 01:18:04,150
What do you suppose
is going on there?

1351
01:18:04,150 --> 01:18:06,750
AUDIENCE: You have
[INAUDIBLE].

1352
01:18:06,750 --> 01:18:08,000
CHARLES LEISERSON: Yeah.

1353
01:18:11,120 --> 01:18:14,870
So this is a case where if you
look at the code I wrote, I

1354
01:18:14,870 --> 01:18:24,790
went down to a t, a delta t,
of one in my recursion.

1355
01:18:27,510 --> 01:18:28,810
I recursed all the way down.

1356
01:18:28,810 --> 01:18:32,160
Let's see what happens if
instead of going all the way

1357
01:18:32,160 --> 01:18:34,830
down, playing the trapezoid
game on little tiny

1358
01:18:34,830 --> 01:18:41,070
trapezoids, suppose I go down
only to, say, when t is 10,

1359
01:18:41,070 --> 01:18:45,280
and then do essentially
the row major ones.

1360
01:18:45,280 --> 01:18:50,920
So I'm basically coarsening
the leaves of the thing.

1361
01:18:50,920 --> 01:18:52,460
So to do that, I do this.

1362
01:18:52,460 --> 01:18:53,590
So now we go--

1363
01:18:53,590 --> 01:18:54,840
ah.

1364
01:19:00,660 --> 01:19:02,670
So I have to coarsen in
order to overcome the

1365
01:19:02,670 --> 01:19:04,250
procedure call overhead.

1366
01:19:04,250 --> 01:19:06,050
It has nothing to do
with the cache.

1367
01:19:06,050 --> 01:19:08,440
It has to the fact that the
way that you implement

1368
01:19:08,440 --> 01:19:10,770
recursion, recursion
and function calls

1369
01:19:10,770 --> 01:19:11,770
have a cost to them.

1370
01:19:11,770 --> 01:19:15,600
And if what you're going to do
is do a little tiny update of

1371
01:19:15,600 --> 01:19:18,860
those few floating point
operations--

1372
01:19:18,860 --> 01:19:22,380
let's go back to the looping
just to see.

1373
01:19:22,380 --> 01:19:30,540
The looping is going
about 107, 108, and

1374
01:19:30,540 --> 01:19:37,310
trapezoiding at 136.

1375
01:19:37,310 --> 01:19:39,520
So unfortunately, you need a
voodoo variable, but it's a

1376
01:19:39,520 --> 01:19:42,660
voodoo variable not to overcome
the cache, but rather

1377
01:19:42,660 --> 01:19:45,470
to deal with what's the
overhead in using the

1378
01:19:45,470 --> 01:19:48,420
processor when you do
function calls.

1379
01:19:48,420 --> 01:19:51,150
So let's see.

1380
01:19:51,150 --> 01:19:52,290
How coarse can we make it?

1381
01:19:52,290 --> 01:19:54,110
Let's try five, a coarsening
of five?

1382
01:19:57,370 --> 01:19:58,620
That's still pretty good.

1383
01:20:02,040 --> 01:20:02,790
That's still pretty good.

1384
01:20:02,790 --> 01:20:04,040
How about four?

1385
01:20:07,940 --> 01:20:11,750
Still doing 131 frames
a minute.

1386
01:20:11,750 --> 01:20:15,080
How about three?

1387
01:20:15,080 --> 01:20:16,620
Oh, we lost something there.

1388
01:20:19,540 --> 01:20:20,790
How about two?

1389
01:20:24,020 --> 01:20:30,070
So at a coarsening of two, I
go 138, whereas the looping

1390
01:20:30,070 --> 01:20:32,890
goes at about the same.

1391
01:20:32,890 --> 01:20:33,720
I can't do 20.

1392
01:20:33,720 --> 01:20:34,780
I didn't program that in.

1393
01:20:34,780 --> 01:20:37,820
I just programmed up to 10.

1394
01:20:37,820 --> 01:20:42,330
So if I go down to one, however,
then you see it's not

1395
01:20:42,330 --> 01:20:42,970
that efficient.

1396
01:20:42,970 --> 01:20:46,780
But if I pick any number that's
even slightly larger,

1397
01:20:46,780 --> 01:20:50,490
that gives me just enough that
the function call overhead

1398
01:20:50,490 --> 01:20:51,300
ends up not being a

1399
01:20:51,300 --> 01:20:57,600
substantial cost of the things.

1400
01:20:57,600 --> 01:20:59,630
So let me just wrap up now.

1401
01:20:59,630 --> 01:21:00,870
So I just have a couple
more things.

1402
01:21:00,870 --> 01:21:05,750
So I'm not going to really talk
about these, but there

1403
01:21:05,750 --> 01:21:09,640
are lots of cash oblivious
algorithms that have been

1404
01:21:09,640 --> 01:21:18,250
discovered in the last 10 or 15
years for doing things like

1405
01:21:18,250 --> 01:21:23,340
matrix transposition, which is
similar to rotating a matrix.

1406
01:21:23,340 --> 01:21:26,720
You can do that in a cache
oblivious fashion.

1407
01:21:26,720 --> 01:21:32,270
Strassen's algorithm, which
does matrix multiplication

1408
01:21:32,270 --> 01:21:34,720
using fewer than n
cubed operations.

1409
01:21:34,720 --> 01:21:40,440
The FFT can be computed in a
cache oblivious fashion.

1410
01:21:40,440 --> 01:21:44,770
And LUD composition
is a popular

1411
01:21:44,770 --> 01:21:48,740
thing to solve systems.

1412
01:21:48,740 --> 01:21:52,690
In addition, there are cache
oblivious data structures, and

1413
01:21:52,690 --> 01:21:53,760
here are just a few of them.

1414
01:21:53,760 --> 01:21:57,060
There's cache oblivious B-Trees
and priority queues,

1415
01:21:57,060 --> 01:22:00,820
and doing things called
ordered-file maintenance.

1416
01:22:00,820 --> 01:22:01,850
There's a whole raft.

1417
01:22:01,850 --> 01:22:05,740
There's probably now several
hundred papers written on

1418
01:22:05,740 --> 01:22:09,240
cache oblivious algorithms, so
something you should be aware

1419
01:22:09,240 --> 01:22:12,970
of and understand how it is that
you go about designing an

1420
01:22:12,970 --> 01:22:14,040
algorithm of this nature.

1421
01:22:14,040 --> 01:22:17,290
Not all of them are
straightforward.

1422
01:22:17,290 --> 01:22:22,910
For example, the FFT one does
divide and conquer but not by

1423
01:22:22,910 --> 01:22:24,010
dividing it into two.

1424
01:22:24,010 --> 01:22:26,750
It divides it into square root
of n pieces of size square

1425
01:22:26,750 --> 01:22:31,460
root of n each in order to get
a good cache efficient

1426
01:22:31,460 --> 01:22:34,060
algorithm that doesn't have
any tuning parameters.

1427
01:22:34,060 --> 01:22:38,010
But almost all of them, since
they're recursive, do have

1428
01:22:38,010 --> 01:22:41,090
this annoying thing that you
have to still coarsen the base

1429
01:22:41,090 --> 01:22:45,250
case in order to get really good
performance if you're not

1430
01:22:45,250 --> 01:22:50,600
doing a lot of work in the
leaves of the recursion.

1431
01:22:50,600 --> 01:22:51,850
So any questions?