1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high-quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:17,780
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,780 --> 00:00:19,030
ocw.mit.edu.

8
00:00:23,360 --> 00:00:27,230
CHARLES LEISERSON: So today,
more parallel programming, as

9
00:00:27,230 --> 00:00:32,170
we will do for the next couple
lectures as well.

10
00:00:32,170 --> 00:00:36,570
So today, we're going to
look at how to analyze

11
00:00:36,570 --> 00:00:44,970
multi-threaded algorithms, and
I'm going to start out with a

12
00:00:44,970 --> 00:00:53,470
review of what I hope most of
you know from 6006 or 6046,

13
00:00:53,470 --> 00:00:56,380
which is how to solve divide
and conquer recurrences.

14
00:00:56,380 --> 00:01:00,280
Now, we know that we can solve
them with recursion trees, and

15
00:01:00,280 --> 00:01:04,069
that gets tedious after a while,
so I want to go through

16
00:01:04,069 --> 00:01:07,050
the so-called Master Method to
begin with, and then we'll get

17
00:01:07,050 --> 00:01:11,680
into the content
of the course.

18
00:01:11,680 --> 00:01:13,610
But it will be very helpful,
since we're going to do so

19
00:01:13,610 --> 00:01:15,650
many divide and conquer
recurrences.

20
00:01:15,650 --> 00:01:18,270
The difference between these
divide and conquer recurrences

21
00:01:18,270 --> 00:01:21,790
and the ones for caching is that
caching is all tricky by

22
00:01:21,790 --> 00:01:23,270
the base condition.

23
00:01:23,270 --> 00:01:25,140
Here, are all the recurrences
are going to be nice and

24
00:01:25,140 --> 00:01:29,690
clean, just like you learn
in your algorithms class.

25
00:01:29,690 --> 00:01:32,310
So we'll start with talking
about it, and then we'll go

26
00:01:32,310 --> 00:01:36,140
through several examples of
analysis of algorithms.

27
00:01:36,140 --> 00:01:38,450
And it'll also tell us something
about what we need

28
00:01:38,450 --> 00:01:42,290
to do to make our
code go fast.

29
00:01:42,290 --> 00:01:44,870
So the main method we're
going to use is

30
00:01:44,870 --> 00:01:48,400
called the Master Method.

31
00:01:48,400 --> 00:01:53,360
It's for solving recurrences of
the form t of n equals a t

32
00:01:53,360 --> 00:01:58,100
of n over b plus f of n, where
we have some technical

33
00:01:58,100 --> 00:02:01,480
conditions, a is greater than
or equal to 1, b is greater

34
00:02:01,480 --> 00:02:03,950
than one, and f is
asymptotically positive.

35
00:02:03,950 --> 00:02:08,460
As f gets large, it
becomes positive.

36
00:02:08,460 --> 00:02:11,330
When we give a recurrence like
this, normally if the base

37
00:02:11,330 --> 00:02:14,850
case is order one, it's
convention not give it, to

38
00:02:14,850 --> 00:02:17,810
just assume, yeah, we understand
that when n is

39
00:02:17,810 --> 00:02:22,670
small enough, the result
is constant.

40
00:02:22,670 --> 00:02:25,710
As I say, that's the place where
this differs from the

41
00:02:25,710 --> 00:02:33,260
way that we solve recurrences
for caching, where you have to

42
00:02:33,260 --> 00:02:35,760
worry about what is the base
case of the recurrence.

43
00:02:38,640 --> 00:02:40,450
So the way to solve this
is, in fact, the

44
00:02:40,450 --> 00:02:41,330
way we've seen before.

45
00:02:41,330 --> 00:02:42,730
It's a recursion tree.

46
00:02:42,730 --> 00:02:47,490
So we start with t of n, and
then we replace t of n by the

47
00:02:47,490 --> 00:02:50,010
right hand side, just
by substitution.

48
00:02:50,010 --> 00:02:52,590
So what's always going to be in
the tree as we develop it

49
00:02:52,590 --> 00:02:56,150
is the total amount of work.

50
00:02:56,150 --> 00:02:59,870
So we basically replace it
by f of n plus a copies

51
00:02:59,870 --> 00:03:02,410
of t of n over b.

52
00:03:02,410 --> 00:03:08,650
And then each of those we
replace by a copies so t of n

53
00:03:08,650 --> 00:03:11,280
over b squared.

54
00:03:11,280 --> 00:03:14,830
And so forth, continually
replacing until we get

55
00:03:14,830 --> 00:03:15,980
down to t of 1.

56
00:03:15,980 --> 00:03:18,680
And at the point t of 1, we no
longer can substitute here,

57
00:03:18,680 --> 00:03:22,340
but we know that t
of 1 is order 1.

58
00:03:22,340 --> 00:03:25,560
And now what we do is
add across the rows.

59
00:03:25,560 --> 00:03:30,510
So we get f of n, we get a, f of
n over b, a squared f of n

60
00:03:30,510 --> 00:03:35,110
over b squared, and we keep
going to the height of this,

61
00:03:35,110 --> 00:03:39,050
which we're dividing by the
argument by b each time.

62
00:03:39,050 --> 00:03:44,440
So to get down to 1 is
just log base b of n.

63
00:03:44,440 --> 00:03:50,200
So the number of leaves is,
since this is a regular

64
00:03:50,200 --> 00:03:50,680
[? a area ?]

65
00:03:50,680 --> 00:03:53,690
tree, is a to the height,
which is a to the

66
00:03:53,690 --> 00:03:54,880
log base b of n.

67
00:03:54,880 --> 00:03:58,010
And for each of those,
we're paying t of 1,

68
00:03:58,010 --> 00:04:00,840
which is order 1.

69
00:04:00,840 --> 00:04:03,270
And so now, it turns
out there's

70
00:04:03,270 --> 00:04:04,740
no closed form solution.

71
00:04:04,740 --> 00:04:06,480
If I add up all these
terms, there's

72
00:04:06,480 --> 00:04:08,790
no closed form solution.

73
00:04:08,790 --> 00:04:11,390
But there are three
common situations

74
00:04:11,390 --> 00:04:14,020
that occur in practice.

75
00:04:14,020 --> 00:04:17,170
So yes, this is just n to the
log base b of a, just this

76
00:04:17,170 --> 00:04:22,210
term, not the sum,
just this term.

77
00:04:22,210 --> 00:04:26,600
So three cases have to do with
comparing the number of

78
00:04:26,600 --> 00:04:31,350
leaves, which is times
order 1, with f of n.

79
00:04:33,880 --> 00:04:39,760
So the first case is the case
where n the log base b of a is

80
00:04:39,760 --> 00:04:42,170
bigger than f of n.

81
00:04:42,170 --> 00:04:44,430
So whenever you're given a
recurrence to compute n to the

82
00:04:44,430 --> 00:04:45,480
log base b of a--

83
00:04:45,480 --> 00:04:48,010
I hope this is repeat
for most people.

84
00:04:48,010 --> 00:04:50,240
If not, that's fine,
but hopefully it'll

85
00:04:50,240 --> 00:04:53,990
get you caught up.

86
00:04:53,990 --> 00:04:58,020
n log base b of a, if it's much
bigger than f of n, then

87
00:04:58,020 --> 00:05:01,080
these terms are geometrically
increasing.

88
00:05:01,080 --> 00:05:03,930
And since it's geometrically
increasing, all that matters

89
00:05:03,930 --> 00:05:05,630
is the base case.

90
00:05:05,630 --> 00:05:08,970
In fact, actually, it has to be
not just greater than, it's

91
00:05:08,970 --> 00:05:11,760
got to be greater than by a
polynomial amount, by some n

92
00:05:11,760 --> 00:05:15,200
to the epsilon amount for some
epsilon greater than 0.

93
00:05:15,200 --> 00:05:20,480
So it might be n to the 1/2, it
might be n to the 1/3, it

94
00:05:20,480 --> 00:05:23,510
could be n to the 100th.

95
00:05:23,510 --> 00:05:29,320
But what it can't be is log n,
because log n is less than any

96
00:05:29,320 --> 00:05:31,590
polynomial amount.

97
00:05:31,590 --> 00:05:34,330
So it's got to exceed it by at
least n to the epsilon for

98
00:05:34,330 --> 00:05:35,830
some epsilon.

99
00:05:35,830 --> 00:05:38,010
In that case, it's geometrically
increasing, the

100
00:05:38,010 --> 00:05:40,550
answer is just what's
at the leaves.

101
00:05:40,550 --> 00:05:44,270
So that's case one,
geometrically increasing.

102
00:05:44,270 --> 00:05:47,800
Case two is when things
are actually fairly

103
00:05:47,800 --> 00:05:50,670
equal on every level.

104
00:05:50,670 --> 00:05:53,680
And the general case we'll
look at is when it's

105
00:05:53,680 --> 00:05:56,500
arithmetically increasing.

106
00:05:56,500 --> 00:06:03,200
So in particular, this occurs
when f of n is n to the log

107
00:06:03,200 --> 00:06:09,000
base b of a times log to some
power, n, for some constant k

108
00:06:09,000 --> 00:06:12,080
that's at least 0.

109
00:06:12,080 --> 00:06:13,450
So this is the situation.

110
00:06:13,450 --> 00:06:18,080
If k is equal to 0, it just says
that f of n, the amount

111
00:06:18,080 --> 00:06:22,420
here is exactly the same as
the number of leaves.

112
00:06:22,420 --> 00:06:24,960
And that case, it turns out that
every level has almost

113
00:06:24,960 --> 00:06:26,960
exactly the same amount.

114
00:06:26,960 --> 00:06:29,610
And since there are log n
levels, you tack on an extra

115
00:06:29,610 --> 00:06:31,300
log n for the solution.

116
00:06:31,300 --> 00:06:33,050
In fact, the solution
is one more log.

117
00:06:33,050 --> 00:06:38,720
Turns out that if it's growing
arithmetically with layer, you

118
00:06:38,720 --> 00:06:41,400
basically take on
one extra log.

119
00:06:41,400 --> 00:06:45,320
It's actually like doing
the integral of a as

120
00:06:45,320 --> 00:06:46,430
an arithmetic series.

121
00:06:46,430 --> 00:06:51,810
If you're adding the terms
of, say, i squared,

122
00:06:51,810 --> 00:06:53,060
the result is i cubed.

123
00:06:56,000 --> 00:07:00,030
If you have a summation that
goes from i equals 1 to n of i

124
00:07:00,030 --> 00:07:04,640
squared, the result is
proportional to n cubed.

125
00:07:04,640 --> 00:07:11,380
And similarly, if it's i to any
k, the result is going to

126
00:07:11,380 --> 00:07:15,690
be n to the k plus 1,
and that's basically

127
00:07:15,690 --> 00:07:18,450
what's going on here.

128
00:07:18,450 --> 00:07:21,320
And then the third case is
when it's geometrically

129
00:07:21,320 --> 00:07:23,670
decreasing, when the amount
at the root dominates.

130
00:07:26,860 --> 00:07:33,960
So in this case, if it's much
less than n, and specifically,

131
00:07:33,960 --> 00:07:40,726
f of n is at least n to the
epsilon less than log base b

132
00:07:40,726 --> 00:07:45,380
of a, for some constant epsilon,
it turns out in

133
00:07:45,380 --> 00:07:47,950
addition, you need f of n
to satisfy a regularity

134
00:07:47,950 --> 00:07:50,810
condition, but this regularity
condition is satisfied by all

135
00:07:50,810 --> 00:07:53,300
the normal functions that
we're going to come up.

136
00:07:53,300 --> 00:08:00,110
It's not satisfied by things
like n to the sine n, which

137
00:08:00,110 --> 00:08:02,150
oscillates like crazy.

138
00:08:02,150 --> 00:08:04,750
But it also isn't satisfied by

139
00:08:04,750 --> 00:08:06,800
exponentially growing functions.

140
00:08:06,800 --> 00:08:09,150
But it is satisfied by anything
that's polynomial, or

141
00:08:09,150 --> 00:08:12,580
polynomial times a logarithm,
or what have you.

142
00:08:12,580 --> 00:08:14,560
So generally, we don't
really have to

143
00:08:14,560 --> 00:08:16,820
check this too carefully.

144
00:08:16,820 --> 00:08:19,850
And then the answer there is
just it's order f of n,

145
00:08:19,850 --> 00:08:24,530
because it's geometrically
decreasing, f of n dominates.

146
00:08:24,530 --> 00:08:26,890
So is this review
for everybody?

147
00:08:26,890 --> 00:08:28,920
Pretty much, yeah, yeah, yeah?

148
00:08:28,920 --> 00:08:31,960
You can do this in your head,
because we're going to ask you

149
00:08:31,960 --> 00:08:35,195
to do this in your head
during the lecture?

150
00:08:35,195 --> 00:08:36,110
Yeah, we're all good?

151
00:08:36,110 --> 00:08:38,940
OK, good.

152
00:08:38,940 --> 00:08:42,299
One of the things that students,
when they learn this

153
00:08:42,299 --> 00:08:45,500
in an algorithms class, don't
recognize is that this also

154
00:08:45,500 --> 00:08:48,260
tells you where in your
program, where in your

155
00:08:48,260 --> 00:08:52,100
recursive program, you should
bother to try to eke out

156
00:08:52,100 --> 00:08:53,350
constant factors.

157
00:08:55,570 --> 00:08:58,210
So if you think about it, for
example, in case three here,

158
00:08:58,210 --> 00:09:00,290
it's geometrically decreasing.

159
00:09:00,290 --> 00:09:05,530
Does it make sense to try
to optimize the leaves?

160
00:09:05,530 --> 00:09:07,510
No, because very little
time is spent there.

161
00:09:07,510 --> 00:09:12,420
It makes sense to optimize
what's going on at the root,

162
00:09:12,420 --> 00:09:14,850
and to save anything you
can at the root.

163
00:09:14,850 --> 00:09:17,250
And sometimes, the root in
particular has special

164
00:09:17,250 --> 00:09:20,140
properties that aren't true of
the internal nodes that you

165
00:09:20,140 --> 00:09:23,160
can take advantage of, that you
may not be able to take

166
00:09:23,160 --> 00:09:24,360
advantage of regularly.

167
00:09:24,360 --> 00:09:26,390
But since it's going to be
dominated by the root, trying

168
00:09:26,390 --> 00:09:28,340
to save in the root
makes sense.

169
00:09:28,340 --> 00:09:35,540
Correspondingly, if we're in
case one, in case one, it's

170
00:09:35,540 --> 00:09:39,810
absolutely critical that you
coarsen the recursion, because

171
00:09:39,810 --> 00:09:41,850
all the work is down
at this level.

172
00:09:44,420 --> 00:09:47,430
And so if you want to get
additional performance, you

173
00:09:47,430 --> 00:09:50,660
want to basically move this up
high enough that you can cut

174
00:09:50,660 --> 00:09:55,540
off that constant amount and
get factors two, three,

175
00:09:55,540 --> 00:10:00,820
sometimes more, out
of your code.

176
00:10:00,820 --> 00:10:06,200
So understanding the structure
of the recursion allows you to

177
00:10:06,200 --> 00:10:09,165
figure out where it is that you
should optimize your code.

178
00:10:09,165 --> 00:10:10,870
Of course, with loops,
it's much easier.

179
00:10:10,870 --> 00:10:13,720
Where do you spend your
time with loops to

180
00:10:13,720 --> 00:10:14,970
make code go fast?

181
00:10:17,520 --> 00:10:21,250
The innermost loop, right,
because that's the one that's

182
00:10:21,250 --> 00:10:22,790
executing the most.

183
00:10:22,790 --> 00:10:25,230
The outer loops are not
that important.

184
00:10:25,230 --> 00:10:26,090
This is sort of the

185
00:10:26,090 --> 00:10:28,840
corresponding thing for recursion.

186
00:10:28,840 --> 00:10:31,490
Figure out where the recursion
is spending the time, that's

187
00:10:31,490 --> 00:10:34,110
where you spend time eking
out extra factors.

188
00:10:38,550 --> 00:10:39,800
Here's the cheat sheet.

189
00:10:43,010 --> 00:10:47,920
So if it's n to the log base
b of a minus epsilon, the

190
00:10:47,920 --> 00:10:49,505
answer's n to the
log base b of a.

191
00:10:49,505 --> 00:10:52,020
If it's plus epsilon,
it's f of n.

192
00:10:52,020 --> 00:10:55,030
And if it's n to the log base
b of a times a logarithmic

193
00:10:55,030 --> 00:10:58,875
factor, where this logarithm has
the exponent greater than

194
00:10:58,875 --> 00:11:02,140
or equal to 0, you add a 1.

195
00:11:02,140 --> 00:11:04,340
This is not all of
the situations.

196
00:11:04,340 --> 00:11:06,935
There are situations
which don't occur.

197
00:11:06,935 --> 00:11:08,185
OK, quick quiz.

198
00:11:10,380 --> 00:11:13,390
So t of n equals 4t of
n over 2 plus n.

199
00:11:13,390 --> 00:11:14,640
What's the solution?

200
00:11:17,550 --> 00:11:19,730
n squared, good.

201
00:11:19,730 --> 00:11:23,170
So this is n to the log base b
of a, is n to the log base 2

202
00:11:23,170 --> 00:11:25,910
of 4, which is n squared.

203
00:11:25,910 --> 00:11:27,260
That's much bigger than n.

204
00:11:27,260 --> 00:11:29,230
It's bigger by a factor of n.

205
00:11:29,230 --> 00:11:32,430
Here, the epsilon of 1 would do,
so would an epsilon of 1/2

206
00:11:32,430 --> 00:11:34,580
or an epsilon of 1/4, but
in particular, an

207
00:11:34,580 --> 00:11:36,050
epsilon of 1 would do.

208
00:11:36,050 --> 00:11:37,430
That's case one.

209
00:11:37,430 --> 00:11:40,590
The n squared dominates, so
the answer is n squared.

210
00:11:40,590 --> 00:11:43,460
The basic idea is whichever side
dominates for case one

211
00:11:43,460 --> 00:11:47,700
and case three, that's the
one that is the answer.

212
00:11:47,700 --> 00:11:48,150
Here we go.

213
00:11:48,150 --> 00:11:50,650
What about this one?

214
00:11:50,650 --> 00:11:54,170
n squared log n, because
the two sides are

215
00:11:54,170 --> 00:11:56,950
about the same size.

216
00:11:56,950 --> 00:12:02,240
It's n squared times log to the
0n, tack on the extra log.

217
00:12:02,240 --> 00:12:04,120
How about this one?

218
00:12:04,120 --> 00:12:05,180
n cubed.

219
00:12:05,180 --> 00:12:06,430
How about this one?

220
00:12:09,403 --> 00:12:11,399
AUDIENCE: [INAUDIBLE].

221
00:12:11,399 --> 00:12:14,900
Master Theorem [INAUDIBLE]?

222
00:12:14,900 --> 00:12:17,140
CHARLES LEISERSON: Yeah, the
Master Theorem does not apply

223
00:12:17,140 --> 00:12:18,790
to this one.

224
00:12:18,790 --> 00:12:23,070
It looks like it's case two with
an an exponent of minus

225
00:12:23,070 --> 00:12:28,360
1, but that's bogus because the
exponent of the log must

226
00:12:28,360 --> 00:12:32,280
be greater than or equal to 0.

227
00:12:32,280 --> 00:12:35,510
So instead, this actually has
a solution, so it does not

228
00:12:35,510 --> 00:12:40,750
apply, it actually has the
solution n squared, log log n,

229
00:12:40,750 --> 00:12:43,320
but that's not covered by
the Master Theorem.

230
00:12:43,320 --> 00:12:46,640
You can have an infinite
hierarchy of

231
00:12:46,640 --> 00:12:48,840
narrowing in things.

232
00:12:48,840 --> 00:12:52,330
So if you don't have a solution
to something that

233
00:12:52,330 --> 00:12:56,250
looks like a Master Theorem
type of recurrence, what's

234
00:12:56,250 --> 00:12:59,782
your best strategy
for solving it?

235
00:12:59,782 --> 00:13:00,750
AUDIENCE: [INAUDIBLE].

236
00:13:00,750 --> 00:13:03,260
CHARLES LEISERSON:
What's that?

237
00:13:03,260 --> 00:13:04,840
AUDIENCE: [INAUDIBLE].

238
00:13:04,840 --> 00:13:05,860
CHARLES LEISERSON: So a
recursion tree can be good,

239
00:13:05,860 --> 00:13:09,410
but actually, the best is the
substitution method, basically

240
00:13:09,410 --> 00:13:11,090
proving it by induction.

241
00:13:11,090 --> 00:13:14,320
And the recursion tree can be
very helpful in giving you a

242
00:13:14,320 --> 00:13:18,070
good guess for what you
think the answer is.

243
00:13:18,070 --> 00:13:21,170
But the most reliable way to
prove any of these things is

244
00:13:21,170 --> 00:13:23,900
using substitution method.

245
00:13:23,900 --> 00:13:25,270
Good enough.

246
00:13:25,270 --> 00:13:28,590
So that was review for,
I hope, most people?

247
00:13:28,590 --> 00:13:28,820
Yeah?

248
00:13:28,820 --> 00:13:30,580
OK, good.

249
00:13:30,580 --> 00:13:32,510
OK, let's talk about parallel
programming.

250
00:13:32,510 --> 00:13:34,550
We're going to start
out with loops.

251
00:13:34,550 --> 00:13:39,800
So last time, we talked about
how the cilk++ runtime system

252
00:13:39,800 --> 00:13:46,430
is based on, essentially,
implementing spawns and syncs,

253
00:13:46,430 --> 00:13:48,820
using the work stealing
algorithm, and we talked about

254
00:13:48,820 --> 00:13:51,660
scheduling and so forth.

255
00:13:51,660 --> 00:13:55,920
We didn't talk about how loops
are implemented, except to

256
00:13:55,920 --> 00:13:57,110
mention that they
were implemented

257
00:13:57,110 --> 00:13:58,580
with divide and conquer.

258
00:13:58,580 --> 00:14:02,160
So here I want to go into the
subtleties of loops, because

259
00:14:02,160 --> 00:14:06,770
probably most programs that
occur in the real world these

260
00:14:06,770 --> 00:14:11,930
days are programs where people
just simply say, make this a

261
00:14:11,930 --> 00:14:13,800
parallel loop.

262
00:14:13,800 --> 00:14:15,830
That's it.

263
00:14:15,830 --> 00:14:21,730
So let's take an example of the
in-place matrix transpose,

264
00:14:21,730 --> 00:14:24,880
where we're basically trying to
flip everything along the

265
00:14:24,880 --> 00:14:26,660
main diagonal.

266
00:14:26,660 --> 00:14:31,030
I've used this figure
before, I think.

267
00:14:31,030 --> 00:14:34,860
So let's just do it not
cache efficiently.

268
00:14:34,860 --> 00:14:36,540
So the cache efficient
algorithm actually

269
00:14:36,540 --> 00:14:41,690
parallelizes beautifully also,
but let's not look at the

270
00:14:41,690 --> 00:14:44,020
cache efficient version, the
divide and conquer version.

271
00:14:44,020 --> 00:14:48,010
Let's look at a looping
version to understand.

272
00:14:48,010 --> 00:14:51,940
And once again here, as I did
before, I'm going to make the

273
00:14:51,940 --> 00:14:56,430
indices for my implementation
run from 0, not 1.

274
00:14:56,430 --> 00:15:01,030
And basically, I have an outer
loop that goes from i equals 1

275
00:15:01,030 --> 00:15:08,020
up to n minus 1, and an inner
loop that goes from j equals 0

276
00:15:08,020 --> 00:15:09,990
up to i minus 1.

277
00:15:09,990 --> 00:15:15,850
And then I do a little
swap in there.

278
00:15:15,850 --> 00:15:21,590
And in here, the outer loop I've
parallelized, the inner

279
00:15:21,590 --> 00:15:24,450
loop is running serially.

280
00:15:31,600 --> 00:15:34,810
So let's take a look at
analyzing this particular

281
00:15:34,810 --> 00:15:37,900
piece of code to understand
what's going on.

282
00:15:37,900 --> 00:15:42,380
So the way this actually gets
implemented is as follows.

283
00:15:42,380 --> 00:15:44,830
So here's the code
on the left.

284
00:15:44,830 --> 00:15:53,790
What actually happens in the
cilk++ compiler is it converts

285
00:15:53,790 --> 00:15:58,210
it into recursion, divide
and conquer recursion.

286
00:15:58,210 --> 00:16:02,160
So what it does is it has a
range from low to high, is

287
00:16:02,160 --> 00:16:03,840
sort of the common case.

288
00:16:03,840 --> 00:16:06,920
We're going to call it on from 1
to n minus 1, because that's

289
00:16:06,920 --> 00:16:11,300
the indexes that I've given
to the cilk_for loop.

290
00:16:11,300 --> 00:16:17,530
And what we do is, if I have
a region to do divide and

291
00:16:17,530 --> 00:16:22,760
conquer on, a set of values
for i, I basically

292
00:16:22,760 --> 00:16:26,760
divide that in half.

293
00:16:26,760 --> 00:16:32,640
And then I recursively execute
the first half, and then the

294
00:16:32,640 --> 00:16:34,740
second half, and
then cilk_sync.

295
00:16:34,740 --> 00:16:37,790
So the two sides are going
off in parallel.

296
00:16:37,790 --> 00:16:42,450
And then if I am at the base,
then I go through the inner

297
00:16:42,450 --> 00:16:48,260
loop and do the swap of the
values in the inner loop.

298
00:16:48,260 --> 00:16:51,100
So the outer loop is the one
that's the parallel loop.

299
00:16:51,100 --> 00:16:54,450
That one we're doing divide
and conquer on.

300
00:16:54,450 --> 00:16:58,990
So we basically recursively
spawn the first half, execute

301
00:16:58,990 --> 00:17:02,960
the second, and then each of
those recursively does the

302
00:17:02,960 --> 00:17:08,680
same thing until all the
iterations have been done.

303
00:17:08,680 --> 00:17:10,184
Any questions about
how that operates?

304
00:17:14,069 --> 00:17:16,940
So this is the way all the
parallel loops are done, is

305
00:17:16,940 --> 00:17:18,560
basically this strategy.

306
00:17:18,560 --> 00:17:21,609
Now here, I mention that this
is in fact what happens is

307
00:17:21,609 --> 00:17:25,589
this test here is actually
coarsened.

308
00:17:25,589 --> 00:17:29,000
So we don't want to go all the
way to n equals 1, because

309
00:17:29,000 --> 00:17:31,990
then we'll have this recursion
call overhead every

310
00:17:31,990 --> 00:17:33,280
time we do a call.

311
00:17:33,280 --> 00:17:39,120
So in fact, what happens is you
go down to some grain size

312
00:17:39,120 --> 00:17:43,365
of some number of iterations,
and at that number of

313
00:17:43,365 --> 00:17:45,790
iterations, it then just runs
through it as an ordinary

314
00:17:45,790 --> 00:17:50,140
serial four loop, in order not
to pay the function call

315
00:17:50,140 --> 00:17:51,830
overhead all the way down.

316
00:17:51,830 --> 00:17:55,600
We're going to look exactly
at that issue.

317
00:17:55,600 --> 00:18:05,660
So if I take a look at it from
using the DAG model that I

318
00:18:05,660 --> 00:18:12,000
introduced last time, remember
that the rectangles here kind

319
00:18:12,000 --> 00:18:16,560
of count as activation
frames, stack

320
00:18:16,560 --> 00:18:18,520
frames on the call stack.

321
00:18:18,520 --> 00:18:22,700
And the circles here are
strands, which are sequences

322
00:18:22,700 --> 00:18:25,020
of serial code.

323
00:18:25,020 --> 00:18:27,430
And so what's happening here
is essentially, I'm running

324
00:18:27,430 --> 00:18:30,665
the code that divides
it into two parts,

325
00:18:30,665 --> 00:18:33,200
and I spawn one part.

326
00:18:33,200 --> 00:18:36,480
Then this guy spawns the other
and waits for the return, and

327
00:18:36,480 --> 00:18:37,530
then these guys come back.

328
00:18:37,530 --> 00:18:40,010
And then I keep doing that
recursively, and then when I

329
00:18:40,010 --> 00:18:45,010
get to the bottom, I then run
through the innermost loop,

330
00:18:45,010 --> 00:18:49,470
which starts out with just one
element to do, two, three.

331
00:18:49,470 --> 00:18:51,210
And so the number
of elements--

332
00:18:51,210 --> 00:18:53,080
for example, in this case where
I've done eight, I go

333
00:18:53,080 --> 00:18:59,130
through eight elements at the
bottom here if this were an

334
00:18:59,130 --> 00:19:03,430
eight by eight matrix that
I was transposing.

335
00:19:03,430 --> 00:19:08,450
So there's more work in these
guys than there is over here.

336
00:19:08,450 --> 00:19:11,150
So it's not something you just
can map onto processors in

337
00:19:11,150 --> 00:19:12,770
some naive fashion.

338
00:19:12,770 --> 00:19:14,410
It does take some load
balancing to

339
00:19:14,410 --> 00:19:16,040
parallelize this loop.

340
00:19:16,040 --> 00:19:18,750
Any questions about what's
going on here?

341
00:19:18,750 --> 00:19:19,800
Yeah?

342
00:19:19,800 --> 00:19:21,854
AUDIENCE: Why is it that
it's one, two,

343
00:19:21,854 --> 00:19:23,270
three, four, up to eight?

344
00:19:23,270 --> 00:19:24,686
CHARLES LEISERSON:
Take a look.

345
00:19:24,686 --> 00:19:26,085
The inner loop goes
from j to i.

346
00:19:28,710 --> 00:19:31,230
So this guy just does one
iteration of the inner loop,

347
00:19:31,230 --> 00:19:33,660
this guy does two, this guy does
three, all the way up to

348
00:19:33,660 --> 00:19:35,820
this guy doing eight iterations,
if it were an

349
00:19:35,820 --> 00:19:38,880
eight by eight matrix.

350
00:19:38,880 --> 00:19:42,610
And in general, if it's n by
n, it's going from one to n

351
00:19:42,610 --> 00:19:45,630
work up here, but only
one work down here.

352
00:19:45,630 --> 00:19:47,800
Because I'm basically iterating
through a triangular

353
00:19:47,800 --> 00:19:52,000
iteration space to swap the
matrix, and this is basically

354
00:19:52,000 --> 00:19:54,480
swapping row by row.

355
00:19:54,480 --> 00:19:55,550
Questions?

356
00:19:55,550 --> 00:19:57,160
Is that good?

357
00:19:57,160 --> 00:19:58,770
Everybody see what's going on?

358
00:19:58,770 --> 00:20:01,080
So now let's take a look
and let's analyze this

359
00:20:01,080 --> 00:20:02,500
for work and span.

360
00:20:02,500 --> 00:20:06,410
So what is the work of this
in terms of n, if I

361
00:20:06,410 --> 00:20:08,255
have an n by n matrix?

362
00:20:18,550 --> 00:20:18,950
What's the work?

363
00:20:18,950 --> 00:20:23,470
The work is the ordinary serial
running time, right?

364
00:20:23,470 --> 00:20:24,420
It's n squared.

365
00:20:24,420 --> 00:20:26,370
Good.

366
00:20:26,370 --> 00:20:29,710
So basically, it's order n
squared, because these guys

367
00:20:29,710 --> 00:20:30,670
are all adding up.

368
00:20:30,670 --> 00:20:35,060
This is an arithmetic sequence
up to n, and so the total

369
00:20:35,060 --> 00:20:38,420
amount in here is
order n squared.

370
00:20:38,420 --> 00:20:39,770
What about this part up here?

371
00:20:43,420 --> 00:20:50,580
How much does that
cost us for work?

372
00:20:50,580 --> 00:20:52,460
How much is in the control
overhead of

373
00:20:52,460 --> 00:20:53,710
doing that outer loop?

374
00:21:05,935 --> 00:21:08,530
So asymptotically, how
much is in here?

375
00:21:08,530 --> 00:21:09,890
The total is going
to be n squared.

376
00:21:09,890 --> 00:21:11,180
That I guarantee you.

377
00:21:11,180 --> 00:21:13,820
But what's going on up here?

378
00:21:13,820 --> 00:21:16,500
How do I count that up?

379
00:21:16,500 --> 00:21:18,261
AUDIENCE: I'm assuming that each
[? strain ?] is going to

380
00:21:18,261 --> 00:21:20,730
be constant time?

381
00:21:20,730 --> 00:21:22,160
CHARLES LEISERSON: Yeah, well
in this case, it is constant

382
00:21:22,160 --> 00:21:25,520
time, for these up here, because
what am I doing?

383
00:21:25,520 --> 00:21:32,690
All I'm doing is the recursion
code where I divide the range

384
00:21:32,690 --> 00:21:34,060
and then spawn off two things.

385
00:21:34,060 --> 00:21:36,880
That takes me only a constant
amount of manipulation to be

386
00:21:36,880 --> 00:21:38,600
able to do that.

387
00:21:38,600 --> 00:21:42,090
So this is all order n.

388
00:21:42,090 --> 00:21:43,760
Total of order n here.

389
00:21:43,760 --> 00:21:45,620
The reason is because
in some sense, there

390
00:21:45,620 --> 00:21:47,550
are n leaves here.

391
00:21:47,550 --> 00:21:52,040
And if you have a full binary
tree, meaning every child is

392
00:21:52,040 --> 00:21:55,775
either a leaf or has two
children, then the number of

393
00:21:55,775 --> 00:21:58,120
the internal nodes of the
tree is one less than

394
00:21:58,120 --> 00:22:01,330
the number of leaves.

395
00:22:01,330 --> 00:22:06,080
So that's a basic property of
trees, that the number of the

396
00:22:06,080 --> 00:22:10,430
internal nodes here is going to
be n minus 1, in this case.

397
00:22:10,430 --> 00:22:14,460
In particular, we have 7 here.

398
00:22:14,460 --> 00:22:15,910
Is that good?

399
00:22:15,910 --> 00:22:17,730
So this part doesn't contribute

400
00:22:17,730 --> 00:22:19,790
significantly to the work.

401
00:22:19,790 --> 00:22:22,700
Just this part contributes
to the work.

402
00:22:22,700 --> 00:22:23,950
Is that good?

403
00:22:26,000 --> 00:22:27,520
What about the span for this?

404
00:22:44,470 --> 00:22:45,720
What's the span?

405
00:22:48,819 --> 00:22:50,770
AUDIENCE: Log n.

406
00:22:50,770 --> 00:22:54,010
CHARLES LEISERSON: It's not log
n, but your heads are in

407
00:22:54,010 --> 00:22:55,260
the right place.

408
00:22:57,796 --> 00:23:00,720
AUDIENCE: The longest path
is going [INAUDIBLE].

409
00:23:00,720 --> 00:23:02,824
CHARLES LEISERSON: Which is the
longest path going to be

410
00:23:02,824 --> 00:23:10,371
here, starting here and ending
there, which way do we go?

411
00:23:10,371 --> 00:23:10,840
AUDIENCE: Go all the way down.

412
00:23:10,840 --> 00:23:11,707
CHARLES LEISERSON: Which way?

413
00:23:11,707 --> 00:23:12,960
AUDIENCE: To the right.

414
00:23:12,960 --> 00:23:17,310
CHARLES LEISERSON: Down to
the right, over, down

415
00:23:17,310 --> 00:23:18,040
through this guy.

416
00:23:18,040 --> 00:23:19,800
How big is this guy?

417
00:23:19,800 --> 00:23:20,730
n.

418
00:23:20,730 --> 00:23:22,830
Go back up this way.

419
00:23:22,830 --> 00:23:25,407
So how much is in this
part going down?

420
00:23:25,407 --> 00:23:26,800
AUDIENCE: Log n.

421
00:23:26,800 --> 00:23:27,880
CHARLES LEISERSON: Going
down and up is log

422
00:23:27,880 --> 00:23:30,890
n, but this is n.

423
00:23:30,890 --> 00:23:32,140
Good.

424
00:23:35,160 --> 00:23:40,310
So it's basically order n plus
order log n, order n down here

425
00:23:40,310 --> 00:23:44,490
plus order log n up and
down, that's order n.

426
00:23:44,490 --> 00:23:48,840
So the parallelism is the ratio
of those two things,

427
00:23:48,840 --> 00:23:51,530
which is order n.

428
00:23:51,530 --> 00:23:54,260
So that's got good
parallelism.

429
00:23:54,260 --> 00:23:56,280
And so if you imagine doing
this in a large number

430
00:23:56,280 --> 00:24:02,630
processors, very easy to get
sort of your benchmark of

431
00:24:02,630 --> 00:24:05,040
maybe 10 times more parallelism
than the number of

432
00:24:05,040 --> 00:24:08,570
processors that you're
running on.

433
00:24:08,570 --> 00:24:09,820
Everybody follow this?

434
00:24:12,240 --> 00:24:14,460
Good.

435
00:24:14,460 --> 00:24:16,700
So the span of the loop control
is order log n.

436
00:24:16,700 --> 00:24:23,310
And in general, when you have a
four loop with n iterations,

437
00:24:23,310 --> 00:24:27,700
the loop control itself is going
to have log n is going

438
00:24:27,700 --> 00:24:30,000
to have to be added to the
span every time you hit a

439
00:24:30,000 --> 00:24:35,570
loop, log of whatever the
number of iterations is.

440
00:24:35,570 --> 00:24:38,680
And then we have the maximum
span of the body.

441
00:24:38,680 --> 00:24:41,710
Well, the worst case for this
thing in the body is when it's

442
00:24:41,710 --> 00:24:44,680
doing the whole thing, because
whenever we're looking at

443
00:24:44,680 --> 00:24:48,490
spans, we're always looking
what's the maximum of things

444
00:24:48,490 --> 00:24:50,740
that are operating
in parallel.

445
00:24:50,740 --> 00:24:53,060
Everybody good?

446
00:24:53,060 --> 00:24:54,310
Questions?

447
00:24:56,330 --> 00:24:58,440
Great.

448
00:24:58,440 --> 00:25:05,590
So now let's do something
a little more parallel.

449
00:25:05,590 --> 00:25:09,380
Let's make both loops
be parallel.

450
00:25:09,380 --> 00:25:12,510
So here we have a cilk_for loop
here, and then another

451
00:25:12,510 --> 00:25:13,990
cilk_for loop on the interior.

452
00:25:16,570 --> 00:25:18,480
And let's see what
we get here.

453
00:25:18,480 --> 00:25:20,868
So what's the work for this?

454
00:25:20,868 --> 00:25:21,816
AUDIENCE: n squared.

455
00:25:21,816 --> 00:25:23,000
CHARLES LEISERSON:
Yeah, n squared.

456
00:25:23,000 --> 00:25:25,230
That's not going to change.

457
00:25:25,230 --> 00:25:26,480
That's not going to change.

458
00:25:29,650 --> 00:25:30,900
What's the span?

459
00:25:34,992 --> 00:25:37,990
AUDIENCE: log n.

460
00:25:37,990 --> 00:25:40,490
CHARLES LEISERSON:
Yeah, log n.

461
00:25:40,490 --> 00:25:44,580
So it's log n because the span
of the outer control loop is

462
00:25:44,580 --> 00:25:46,490
going to add log n.

463
00:25:46,490 --> 00:25:50,750
The max span of the inner
control loop, well, it's going

464
00:25:50,750 --> 00:25:58,850
from log of 1 up to log of i,
but the maximums of those is

465
00:25:58,850 --> 00:26:01,960
going to be proportional
to log n

466
00:26:01,960 --> 00:26:05,420
because it's not regular.

467
00:26:05,420 --> 00:26:09,490
And the span of the body now
is going to be order 1.

468
00:26:09,490 --> 00:26:16,470
And so we add the logs because
those things are in series.

469
00:26:16,470 --> 00:26:17,810
We don't multiply them.

470
00:26:21,250 --> 00:26:22,310
What we're doing is
we're looking at,

471
00:26:22,310 --> 00:26:23,170
what's the worst case?

472
00:26:23,170 --> 00:26:26,380
The worst case is I have to do
the control for this, plus the

473
00:26:26,380 --> 00:26:30,230
control for this, plus the worst
iteration here, which in

474
00:26:30,230 --> 00:26:32,000
this case is just order one.

475
00:26:32,000 --> 00:26:35,090
So the total is order log n.

476
00:26:35,090 --> 00:26:38,810
That can be confusing for
people, why it is that we add

477
00:26:38,810 --> 00:26:45,210
here rather than multiply
or do something else.

478
00:26:45,210 --> 00:26:47,460
So let me pause here
for some questions

479
00:26:47,460 --> 00:26:51,020
if people have questions.

480
00:26:51,020 --> 00:26:52,020
Everybody with us?

481
00:26:52,020 --> 00:26:55,550
Anybody want clarification or
make a point that would lead

482
00:26:55,550 --> 00:26:56,800
to clarification?

483
00:26:59,630 --> 00:27:00,780
Yes, question.

484
00:27:00,780 --> 00:27:02,989
AUDIENCE: If you were going to
draw a tree like the previous

485
00:27:02,989 --> 00:27:04,239
slide, what would
it look like?

486
00:27:09,670 --> 00:27:10,910
CHARLES LEISERSON: Let's see.

487
00:27:10,910 --> 00:27:14,175
I had wanted to do that and
it got out of control.

488
00:27:17,510 --> 00:27:20,220
So what it would look like is if
we go back to the previous

489
00:27:20,220 --> 00:27:25,550
slide, it basically would look
like this, except where each

490
00:27:25,550 --> 00:27:30,450
one of these guys is replaced by
a tree that looks like this

491
00:27:30,450 --> 00:27:38,900
with as many leaves as the
number here indicates.

492
00:27:38,900 --> 00:27:42,310
So once again, this would be the
one with the longest span

493
00:27:42,310 --> 00:27:44,990
because this would be log
of the largest number.

494
00:27:44,990 --> 00:27:48,720
But basically, each one of these
would be a tree that

495
00:27:48,720 --> 00:27:49,640
came from this.

496
00:27:49,640 --> 00:27:50,200
Is that clear?

497
00:27:50,200 --> 00:27:52,070
That's a great question.

498
00:27:52,070 --> 00:27:53,580
Anybody else have
as illuminating

499
00:27:53,580 --> 00:27:54,670
questions as those?

500
00:27:54,670 --> 00:27:57,630
Everybody understand that
explanation, what the tree

501
00:27:57,630 --> 00:27:58,310
would look like?

502
00:27:58,310 --> 00:27:59,960
OK, good.

503
00:28:07,420 --> 00:28:10,140
Get

504
00:28:10,140 --> 00:28:15,022
So the parallelism here is
n squared over log n.

505
00:28:15,022 --> 00:28:17,790
Now it's tempting, when you do
parallel programming, to say

506
00:28:17,790 --> 00:28:20,160
therefore, this is better
parallel code.

507
00:28:24,190 --> 00:28:26,760
And the reason is, well, it does
asymptotically have more

508
00:28:26,760 --> 00:28:27,430
parallelism.

509
00:28:27,430 --> 00:28:31,110
But generally when you're
programming, you're not trying

510
00:28:31,110 --> 00:28:33,120
to get the most parallelism.

511
00:28:33,120 --> 00:28:37,840
What you're trying to do is get
sufficient parallelism.

512
00:28:37,840 --> 00:28:43,470
So if n is sufficiently large,
it's going to be way more--

513
00:28:43,470 --> 00:28:44,710
if n is a million--

514
00:28:44,710 --> 00:28:47,540
which is typical problem size
for a loop, for example, for a

515
00:28:47,540 --> 00:28:51,180
big loop, or even if it's a
few thousand or whatever--

516
00:28:51,180 --> 00:28:55,820
it may be just fine to have
parallelism on the order of

517
00:28:55,820 --> 00:29:00,720
1,000, which is what the
first one gives you.

518
00:29:00,720 --> 00:29:03,520
So 1,000 iterations is generally
a small number of

519
00:29:03,520 --> 00:29:05,310
iterations.

520
00:29:05,310 --> 00:29:08,150
So 1,000 by 1,000 matrix
is going to generate

521
00:29:08,150 --> 00:29:09,470
parallelism of 1,000.

522
00:29:09,470 --> 00:29:11,480
Here, we're going to get a
parallelism of 1 million

523
00:29:11,480 --> 00:29:18,010
divided by about 20, log of
10 or 20, so like 100,000.

524
00:29:18,010 --> 00:29:28,520
But if I have 1,000 by 1,000
matrix, the difference between

525
00:29:28,520 --> 00:29:33,140
having parallelism of 1,000 and
parallelism of 100,000,

526
00:29:33,140 --> 00:29:38,440
when I'm running on 100 cores,
let's say, it doesn't matter.

527
00:29:38,440 --> 00:29:40,740
Up to 100 cores, it
doesn't matter.

528
00:29:40,740 --> 00:29:43,540
And in fact, running this on
100 cores, that's really a

529
00:29:43,540 --> 00:29:46,410
tiny problem compared to
the amount of memory

530
00:29:46,410 --> 00:29:48,670
you're going to get.

531
00:29:48,670 --> 00:29:55,840
1,000 by 1,000 matrix is tiny
when it comes to the size of

532
00:29:55,840 --> 00:29:58,390
memory that you're going to have
access to and so forth.

533
00:29:58,390 --> 00:30:00,760
So for big problems and so forth
you really want to look

534
00:30:00,760 --> 00:30:08,650
at this and say, of things that
have ample parallelism,

535
00:30:08,650 --> 00:30:12,570
which ones really are going to
give me the best bang for the

536
00:30:12,570 --> 00:30:15,070
buck for reasonable
machine sizes?

537
00:30:18,300 --> 00:30:20,950
That's different from
things like work or

538
00:30:20,950 --> 00:30:22,240
serial running time.

539
00:30:22,240 --> 00:30:25,880
Usually less running
time is better,

540
00:30:25,880 --> 00:30:27,970
and it's always better.

541
00:30:27,970 --> 00:30:30,170
But here parallelism--

542
00:30:30,170 --> 00:30:35,150
yes, it's good to minimize your
span, but you don't have

543
00:30:35,150 --> 00:30:38,680
to minimize it extremely.

544
00:30:38,680 --> 00:30:43,220
You just have to get it small
enough, whereas the work term,

545
00:30:43,220 --> 00:30:45,680
that you really want to
minimize, because that's what

546
00:30:45,680 --> 00:30:47,250
you're going to have to
do, even in a serial

547
00:30:47,250 --> 00:30:48,420
implementation.

548
00:30:48,420 --> 00:30:49,780
Question.

549
00:30:49,780 --> 00:30:52,105
AUDIENCE: So are you
suggesting that the

550
00:30:52,105 --> 00:30:55,800
other code was OK?

551
00:30:55,800 --> 00:30:59,050
CHARLES LEISERSON: We're going
to look a little bit closer at

552
00:30:59,050 --> 00:31:02,210
the issue of overheads.

553
00:31:02,210 --> 00:31:04,200
We're now going to take a look
at what's the difference

554
00:31:04,200 --> 00:31:05,930
between these two codes?

555
00:31:05,930 --> 00:31:08,320
We'll come back to
that in a minute.

556
00:31:08,320 --> 00:31:12,620
The way I want to do it is take
a look at the issue of

557
00:31:12,620 --> 00:31:16,590
overheads with a simpler
example, where we can see

558
00:31:16,590 --> 00:31:18,380
what's really going on.

559
00:31:18,380 --> 00:31:22,850
So here, what I've done is
I've got a loop that is

560
00:31:22,850 --> 00:31:25,520
basically just doing
vector addition.

561
00:31:25,520 --> 00:31:32,110
It's adding for i equals
0 to n minus 1, add b

562
00:31:32,110 --> 00:31:33,450
of i into a of i.

563
00:31:36,120 --> 00:31:38,020
Pretty simple code, and we
want to make that be a

564
00:31:38,020 --> 00:31:40,660
parallel loop.

565
00:31:40,660 --> 00:31:44,290
So I get a recursion tree that
looks like this, where I have

566
00:31:44,290 --> 00:31:47,170
constant work at every
step there.

567
00:31:47,170 --> 00:31:49,460
And of course, the
work is order n,

568
00:31:49,460 --> 00:31:52,500
because I've got n leaves.

569
00:31:52,500 --> 00:31:55,670
And the number of internal
nodes, the control, is all

570
00:31:55,670 --> 00:31:57,600
constant size strands there.

571
00:31:57,600 --> 00:32:00,170
So this is all just
order n for work.

572
00:32:00,170 --> 00:32:03,830
And the span is basically log
n, as we've seen, by going

573
00:32:03,830 --> 00:32:06,000
down one of these paths,
for example.

574
00:32:06,000 --> 00:32:11,000
And so the parallelism for this
is order n over log n.

575
00:32:11,000 --> 00:32:12,550
So a very simple problem.

576
00:32:12,550 --> 00:32:17,280
But now let's take a look more
closely at the overheads here.

577
00:32:17,280 --> 00:32:19,540
So the problem is that
this work term

578
00:32:19,540 --> 00:32:22,570
contains substantial overhead.

579
00:32:22,570 --> 00:32:25,440
In other words, if I really was
doing that, if I hadn't

580
00:32:25,440 --> 00:32:29,160
coarsened the recursion at all
in the implementation of

581
00:32:29,160 --> 00:32:31,940
cilk_for, if the developers
hadn't done that, then I've

582
00:32:31,940 --> 00:32:37,210
got a function call, I've got
n function calls here for

583
00:32:37,210 --> 00:32:43,000
doing a single addition of
values at the leaves.

584
00:32:43,000 --> 00:32:47,035
I've got n minus one of these
guys, that's approximately n,

585
00:32:47,035 --> 00:32:48,940
and I've got n of these guys.

586
00:32:48,940 --> 00:32:51,360
And which are bigger, these
guys or these guys?

587
00:32:54,540 --> 00:32:55,690
These guys are way bigger.

588
00:32:55,690 --> 00:32:58,040
They've got a function
call in there.

589
00:32:58,040 --> 00:32:59,590
This guy right here
just has what?

590
00:32:59,590 --> 00:33:02,300
One floating point addition.

591
00:33:02,300 --> 00:33:06,530
And so if I really was doing my
divide and conquer down to

592
00:33:06,530 --> 00:33:13,230
a single element, this would be
way slower on one processor

593
00:33:13,230 --> 00:33:15,790
than if I just ran it
with a for loop.

594
00:33:15,790 --> 00:33:17,620
Because if I do a for loop, it's
just going to go through,

595
00:33:17,620 --> 00:33:24,440
and the overhead it has is
incrementing i and testing for

596
00:33:24,440 --> 00:33:25,100
termination.

597
00:33:25,100 --> 00:33:26,160
That's it.

598
00:33:26,160 --> 00:33:30,130
And of course, that's a
predictable branch, because it

599
00:33:30,130 --> 00:33:34,070
almost never terminates until it
actually terminates, and so

600
00:33:34,070 --> 00:33:36,150
that's exactly the sort of thing
that's going to have a

601
00:33:36,150 --> 00:33:38,850
really, really tight loop with
very few instructions.

602
00:33:38,850 --> 00:33:40,750
But in the parallel
implementation, there's going

603
00:33:40,750 --> 00:33:44,590
to be this function call
overhead everywhere.

604
00:33:44,590 --> 00:33:47,310
And so therefore, this cilk_for
loop in principle

605
00:33:47,310 --> 00:33:49,430
would not be as efficient.

606
00:33:49,430 --> 00:33:52,290
It actually is, but we're going
to explain why it is,

607
00:33:52,290 --> 00:33:56,760
what goes on in the runtime
system, to understand that.

608
00:33:56,760 --> 00:34:00,290
So here's the idea, and you can

609
00:34:00,290 --> 00:34:02,230
control this with a pragma.

610
00:34:02,230 --> 00:34:06,200
So a pragma is a statement
to the compiler

611
00:34:06,200 --> 00:34:08,070
that gives it a hint.

612
00:34:08,070 --> 00:34:12,880
And here, the pragma says, you
can name a grain size and give

613
00:34:12,880 --> 00:34:15,389
it a value of g.

614
00:34:15,389 --> 00:34:19,000
And what that says is rather
than just doing one element

615
00:34:19,000 --> 00:34:22,590
when you get down to the bottom
here, do g elements in

616
00:34:22,590 --> 00:34:26,750
a for loop when you get
down to the bottom.

617
00:34:26,750 --> 00:34:29,880
And that way, you halt the
recursion earlier.

618
00:34:29,880 --> 00:34:32,570
You have fewer of these
internal nodes.

619
00:34:32,570 --> 00:34:36,760
And if you make the grain size
sufficiently large, the cost

620
00:34:36,760 --> 00:34:40,600
of the recursion at the top
you won't be able to see.

621
00:34:40,600 --> 00:34:43,989
So let's analyze what happens
when we do this.

622
00:34:43,989 --> 00:34:49,190
So we can understand this
vis a vis this equation.

623
00:34:49,190 --> 00:34:54,380
So the idea here is, if I look
at my work, imagine that t

624
00:34:54,380 --> 00:34:58,290
iter is the time for iteration
of one iteration of the loop,

625
00:34:58,290 --> 00:35:00,490
basic time for one iteration
of the loop.

626
00:35:00,490 --> 00:35:04,340
So the amount of work that I
have to do is n times the time

627
00:35:04,340 --> 00:35:06,580
for the iterations
of the loop.

628
00:35:06,580 --> 00:35:10,430
And then depending upon my grain
size, I've got to do

629
00:35:10,430 --> 00:35:13,220
things having to do with the
internal nodes, and there's

630
00:35:13,220 --> 00:35:20,180
basically going to be n over g
of those, times the time for a

631
00:35:20,180 --> 00:35:24,050
spawn, which is I'm saying is
the time to execute one of

632
00:35:24,050 --> 00:35:26,110
these things.

633
00:35:26,110 --> 00:35:29,640
So if these are batched into
groups of g, then there are n

634
00:35:29,640 --> 00:35:32,560
over g such leaves.

635
00:35:32,560 --> 00:35:34,840
There's a minus 1 in here,
but it doesn't matter.

636
00:35:34,840 --> 00:35:39,250
It's basically n over
g times the time for

637
00:35:39,250 --> 00:35:40,700
the internal nodes.

638
00:35:40,700 --> 00:35:42,740
So everybody see where
I'm getting this?

639
00:35:42,740 --> 00:35:44,820
So I'm trying to account for
the constants in the

640
00:35:44,820 --> 00:35:46,620
implementation.

641
00:35:46,620 --> 00:35:48,000
People follow where
I'm getting this?

642
00:35:48,000 --> 00:35:49,180
Ask questions.

643
00:35:49,180 --> 00:35:53,220
I see a couple of people who are
sort of going, not sure I

644
00:35:53,220 --> 00:35:54,470
understand.

645
00:35:57,480 --> 00:35:58,260
Yes?

646
00:35:58,260 --> 00:36:01,040
AUDIENCE: The constants
[INAUDIBLE].

647
00:36:01,040 --> 00:36:01,520
CHARLES LEISERSON: Yes.

648
00:36:01,520 --> 00:36:05,940
So basically, the constants are
these t iter and t spawn.

649
00:36:05,940 --> 00:36:10,310
So t spawn is the time to
execute all that mess.

650
00:36:10,310 --> 00:36:15,250
t iter is the time to execute
one iteration within here.

651
00:36:15,250 --> 00:36:17,670
I'm doing, in this
case, g of them.

652
00:36:17,670 --> 00:36:22,430
So I have n over g leaves, but
each one is doing g, so it's n

653
00:36:22,430 --> 00:36:26,240
over g times g, which is a total
of n iterations, which

654
00:36:26,240 --> 00:36:26,750
makes sense.

655
00:36:26,750 --> 00:36:28,980
I should be doing n
iterations if I'm

656
00:36:28,980 --> 00:36:30,460
adding two vectors here.

657
00:36:30,460 --> 00:36:34,200
So that's accounting for all
the work in these guys.

658
00:36:34,200 --> 00:36:37,470
Then in addition, I've got all
of the work for all the

659
00:36:37,470 --> 00:36:41,790
spawning, which is n over
g times t spawn.

660
00:36:41,790 --> 00:36:43,950
And as I say, you can play with
this yourself, play with

661
00:36:43,950 --> 00:36:46,640
grain size yourself, by just
sticking in different grain

662
00:36:46,640 --> 00:36:47,830
size directives.

663
00:36:47,830 --> 00:36:52,410
Otherwise it turns out that the
cilk runtime system will

664
00:36:52,410 --> 00:36:56,930
pick what it deems to be
a good grain size.

665
00:36:56,930 --> 00:37:02,080
And it usually does a good
job, except sometimes.

666
00:37:02,080 --> 00:37:05,350
And that's why there's
a parameter here.

667
00:37:05,350 --> 00:37:06,570
So if there's a parameter
there, you

668
00:37:06,570 --> 00:37:10,160
can get rid of that.

669
00:37:10,160 --> 00:37:11,410
Yes?

670
00:37:13,462 --> 00:37:16,086
AUDIENCE: Is the pragma
something that is enforced, or

671
00:37:16,086 --> 00:37:18,382
is it something that
says, hey--

672
00:37:18,382 --> 00:37:19,366
CHARLES LEISERSON:
It's a hint.

673
00:37:19,366 --> 00:37:20,350
AUDIENCE: It's a hint.

674
00:37:20,350 --> 00:37:21,885
CHARLES LEISERSON:
Yes, it's a hint.

675
00:37:21,885 --> 00:37:23,310
In other words, compiler
could ignore it.

676
00:37:23,310 --> 00:37:23,389
[?

677
00:37:23,389 --> 00:37:24,972
AUDIENCE: The compiler is ?]
going to be like, oh, that's

678
00:37:24,972 --> 00:37:26,975
the total [INAUDIBLE]

679
00:37:26,975 --> 00:37:28,000
constant.

680
00:37:28,000 --> 00:37:29,340
CHARLES LEISERSON: It's supposed
to be something that

681
00:37:29,340 --> 00:37:32,240
gives a hint for performance
reasons but does not affect

682
00:37:32,240 --> 00:37:35,120
the correctness of
the program.

683
00:37:35,120 --> 00:37:37,650
So the program is going to
be doing the same thing

684
00:37:37,650 --> 00:37:38,350
regardless.

685
00:37:38,350 --> 00:37:40,800
The question is, here's a hint
to the compiler and the

686
00:37:40,800 --> 00:37:44,020
runtime system.

687
00:37:44,020 --> 00:37:46,788
And so then it's picked at--

688
00:37:46,788 --> 00:37:49,268
yeah?

689
00:37:49,268 --> 00:37:53,265
AUDIENCE: My question is, so
there's these cases where you

690
00:37:53,265 --> 00:37:56,460
say that the runtime system
fails to find an appropriate

691
00:37:56,460 --> 00:37:57,873
value for that [INAUDIBLE].

692
00:37:57,873 --> 00:38:00,699
I mean, basically, chooses
one that's not as good.

693
00:38:00,699 --> 00:38:03,211
If you put a pragma on it, will
the runtime system choose

694
00:38:03,211 --> 00:38:04,890
the one that you give it,
or still choose--

695
00:38:04,890 --> 00:38:06,290
CHARLES LEISERSON: No,
if you give it, the

696
00:38:06,290 --> 00:38:07,840
runtime system will--

697
00:38:07,840 --> 00:38:11,040
in the current implementation,
it always picks whatever you

698
00:38:11,040 --> 00:38:12,340
say is here.

699
00:38:12,340 --> 00:38:13,215
And that can be an expression.

700
00:38:13,215 --> 00:38:15,490
You can evaluate something
there.

701
00:38:15,490 --> 00:38:16,540
It's not just a constant.

702
00:38:16,540 --> 00:38:20,070
It could be maximum of
this and that times

703
00:38:20,070 --> 00:38:22,655
whatever, et cetera.

704
00:38:22,655 --> 00:38:25,670
Is that good?

705
00:38:25,670 --> 00:38:27,390
So this is a description
of the work.

706
00:38:27,390 --> 00:38:30,400
Now let's get a description
with the constants

707
00:38:30,400 --> 00:38:33,840
again of the span.

708
00:38:33,840 --> 00:38:35,825
So what is going to be the
constants for the span?

709
00:38:40,040 --> 00:38:43,275
Well, I'm executing this part
in here now serially.

710
00:38:46,320 --> 00:38:49,120
So for the span part, we're
basically going to go down on

711
00:38:49,120 --> 00:38:52,220
one of these paths and back up
I'm not sure which one, but

712
00:38:52,220 --> 00:38:55,330
they're basically all
fairly symmetric.

713
00:38:55,330 --> 00:38:56,860
But then when I get
to the leaf, I'm

714
00:38:56,860 --> 00:39:00,110
executing the leaf serially.

715
00:39:00,110 --> 00:39:03,790
So I'm going to have whatever
the cost is, g times the time

716
00:39:03,790 --> 00:39:07,520
per iteration, is going to be
executed serially, plus now

717
00:39:07,520 --> 00:39:12,850
log of n over g--

718
00:39:12,850 --> 00:39:15,680
n over g is the number of
things I have here--

719
00:39:15,680 --> 00:39:17,435
times the cost of the
spawn, basically.

720
00:39:21,152 --> 00:39:22,402
Does that make sense?

721
00:39:25,470 --> 00:39:28,270
So the idea is, what do we want
to have here if I want a

722
00:39:28,270 --> 00:39:29,640
good parallel code?

723
00:39:29,640 --> 00:39:33,230
We would like the work to
be as small as possible.

724
00:39:33,230 --> 00:39:34,530
How do I make the work small?

725
00:39:38,270 --> 00:39:42,758
How can I set g to make
the work small?

726
00:39:42,758 --> 00:39:44,580
AUDIENCE: [INAUDIBLE].

727
00:39:44,580 --> 00:39:45,520
CHARLES LEISERSON: Make g--

728
00:39:45,520 --> 00:39:46,540
AUDIENCE: Square root of n.

729
00:39:46,540 --> 00:39:51,090
CHARLES LEISERSON: Well,
make g big or little?

730
00:39:51,090 --> 00:39:58,770
If you want this term to be
small, you want g to be big.

731
00:39:58,770 --> 00:40:02,380
But we also want to have
a lot of parallelism.

732
00:40:02,380 --> 00:40:05,530
So I want this term
to be what?

733
00:40:05,530 --> 00:40:09,290
Small, which means I need
to make g what?

734
00:40:09,290 --> 00:40:14,366
Well, we got an n over g here,
but it's in a log.

735
00:40:14,366 --> 00:40:15,910
It's minus log.

736
00:40:15,910 --> 00:40:20,882
So really, to get this small,
I want g to be small.

737
00:40:20,882 --> 00:40:24,670
So I have tension, trade off.

738
00:40:24,670 --> 00:40:26,090
I have trade off.

739
00:40:26,090 --> 00:40:27,640
So let's analyze this
a little bit.

740
00:40:30,210 --> 00:40:34,465
Essentially, if I look at this,
I want g to be bigger--

741
00:40:41,860 --> 00:40:44,130
from this one I want
g to be small.

742
00:40:44,130 --> 00:40:47,170
But here, what I would like is
to make it so that this term

743
00:40:47,170 --> 00:40:48,480
dominates this term.

744
00:40:51,390 --> 00:40:56,470
If the first term here dominates
the second term,

745
00:40:56,470 --> 00:40:59,400
then the work is going to be
the same as if I did an

746
00:40:59,400 --> 00:41:05,450
ordinary for loop to within
a few percent.

747
00:41:05,450 --> 00:41:09,570
So therefore, I want t span
over t iter, if I take the

748
00:41:09,570 --> 00:41:14,110
ratio of these things, I want
g to be bigger than the time

749
00:41:14,110 --> 00:41:17,450
to spawn divided by the
time to iterate.

750
00:41:17,450 --> 00:41:20,960
If I get it much bigger than
that, then this term will be

751
00:41:20,960 --> 00:41:22,580
much bigger than that term
and I don't have to

752
00:41:22,580 --> 00:41:25,760
worry about this term.

753
00:41:25,760 --> 00:41:29,660
So I want it to be much bigger,
but I want to be as

754
00:41:29,660 --> 00:41:31,850
small as possible.

755
00:41:31,850 --> 00:41:35,630
There's no point in making it
much bigger than that which

756
00:41:35,630 --> 00:41:38,340
causes this term to essentially
be wiped out.

757
00:41:38,340 --> 00:41:39,590
People follow that?

758
00:41:44,830 --> 00:41:49,580
So basically, the idea is we
pick a grain size that's large

759
00:41:49,580 --> 00:41:53,180
but not too large, is what you
generally want to do, so that

760
00:41:53,180 --> 00:41:55,930
you have enough parallelism,
but you don't.

761
00:41:55,930 --> 00:41:59,690
The way that the runtime system
does it is it has a

762
00:41:59,690 --> 00:42:02,780
somewhat complicated heuristic,
but it actually

763
00:42:02,780 --> 00:42:06,660
looks to see how many processors
you're running on.

764
00:42:06,660 --> 00:42:10,970
And it uses a heuristic that
says, let's make sure there's

765
00:42:10,970 --> 00:42:13,550
at least parallelism 10 times
more than the number of

766
00:42:13,550 --> 00:42:14,820
processors.

767
00:42:14,820 --> 00:42:18,330
But there's no point in having
more iterations than like 500

768
00:42:18,330 --> 00:42:23,350
or something, because at 500
iterations, you can't see the

769
00:42:23,350 --> 00:42:25,940
spawn overhead regardless.

770
00:42:25,940 --> 00:42:29,900
So basically, it uses a formula
kind of that nature to

771
00:42:29,900 --> 00:42:31,590
pick this automatically.

772
00:42:31,590 --> 00:42:34,150
But you're free to pick
this yourself.

773
00:42:34,150 --> 00:42:37,540
But you can see the point is
that although it's doing

774
00:42:37,540 --> 00:42:41,470
divide and conquer, you do this
issue of coarsening and

775
00:42:41,470 --> 00:42:46,760
you do want to make sure that
you have enough work to do in

776
00:42:46,760 --> 00:42:49,100
any of the leaves of
the computation.

777
00:42:49,100 --> 00:42:51,470
And as I say, usually
it'll guess right.

778
00:42:51,470 --> 00:42:54,550
But if you have trouble with
that, you have a parameter you

779
00:42:54,550 --> 00:42:56,760
can play with.

780
00:42:56,760 --> 00:42:59,990
Let's take a look at another
implementation just to try to

781
00:42:59,990 --> 00:43:01,240
understand this issue.

782
00:43:04,490 --> 00:43:06,100
Suppose I'm going to
do a vector add.

783
00:43:06,100 --> 00:43:10,110
So here I have a vector add
of two arrays, where I'm

784
00:43:10,110 --> 00:43:17,750
basically saying ai gets the
value of b added into it.

785
00:43:17,750 --> 00:43:20,260
That's kind of the code
we had before.

786
00:43:20,260 --> 00:43:25,440
But now, what I want to do is
I'm going to implement a

787
00:43:25,440 --> 00:43:26,950
vector add using cilk spawn.

788
00:43:30,560 --> 00:43:34,160
So rather than using a cilk_for
loop, I'm going to

789
00:43:34,160 --> 00:43:37,660
parallelize this loop by
hand using cilk spawn.

790
00:43:37,660 --> 00:43:41,240
What I'm going to do is I'm
going to say for j equals 0 up

791
00:43:41,240 --> 00:43:42,970
to-- and I'm going to
jump by whatever my

792
00:43:42,970 --> 00:43:45,020
grain size is here--

793
00:43:45,020 --> 00:43:50,610
and spawn off the addition of
things of size, essentially,

794
00:43:50,610 --> 00:43:53,180
g, unless I get close to
the end of the array.

795
00:43:53,180 --> 00:43:57,440
But basically, I'm always
spawning off the next g

796
00:43:57,440 --> 00:44:00,200
iterations to do that
in parallel.

797
00:44:00,200 --> 00:44:03,280
And then I sync all
these spawns.

798
00:44:03,280 --> 00:44:06,180
So everybody understand
the code?

799
00:44:06,180 --> 00:44:07,270
I see nods.

800
00:44:07,270 --> 00:44:09,740
I want to see everybody nod,
actually, when I do this.

801
00:44:09,740 --> 00:44:12,690
Otherwise what happens is I see
three people nod, and I

802
00:44:12,690 --> 00:44:13,770
assume that people
are nodding.

803
00:44:13,770 --> 00:44:15,760
Because if you don't do it, you
can shake your head, and I

804
00:44:15,760 --> 00:44:18,410
promise none of your friends
will see that you're

805
00:44:18,410 --> 00:44:21,280
shaking your head.

806
00:44:21,280 --> 00:44:23,880
And since the TAs are doing the
grading and they're facing

807
00:44:23,880 --> 00:44:26,450
this way, they won't
see either.

808
00:44:26,450 --> 00:44:29,900
And so it's perfectly safe to
let me know, and that way I

809
00:44:29,900 --> 00:44:31,150
can make sure you understand.

810
00:44:33,590 --> 00:44:37,290
So everybody understand
what this does?

811
00:44:37,290 --> 00:44:38,500
OK, so I see a few more.

812
00:44:38,500 --> 00:44:38,820
No.

813
00:44:38,820 --> 00:44:39,910
OK, question?

814
00:44:39,910 --> 00:44:43,540
Do you have a question, or
should I just explain again?

815
00:44:43,540 --> 00:44:49,490
So this is basically doing a
vector add of b into a, of n

816
00:44:49,490 --> 00:44:51,970
iterations here.

817
00:44:51,970 --> 00:44:54,910
And we're going to call it here,
when I do a vector add,

818
00:44:54,910 --> 00:44:57,490
of basically g iterations.

819
00:44:57,490 --> 00:45:00,670
So what it's doing is it's going
to take my array of size

820
00:45:00,670 --> 00:45:05,590
n, bust it into chunks of size
g, and spawn off the first

821
00:45:05,590 --> 00:45:08,230
one, spawn off the second one,
spawn off the third one, each

822
00:45:08,230 --> 00:45:11,310
one to do g iterations.

823
00:45:11,310 --> 00:45:13,340
That make sense?

824
00:45:13,340 --> 00:45:14,700
We'll see it.

825
00:45:14,700 --> 00:45:17,330
So here's sort of the
instruction stream

826
00:45:17,330 --> 00:45:18,980
for the code here.

827
00:45:18,980 --> 00:45:22,810
So basically, it says here is
one, we spawn off something of

828
00:45:22,810 --> 00:45:27,370
size g, then we go on, we spawn
off something else of

829
00:45:27,370 --> 00:45:28,870
size g, et cetera.

830
00:45:28,870 --> 00:45:32,740
We keep going up there until
we hit the cilk sync.

831
00:45:32,740 --> 00:45:34,480
That make sense?

832
00:45:34,480 --> 00:45:38,610
Each of these is doing a vector
add of size g using

833
00:45:38,610 --> 00:45:40,015
this serial routine.

834
00:45:42,610 --> 00:45:46,420
So let's analyze this to
understand the efficiency of

835
00:45:46,420 --> 00:45:49,980
this type of looping
structure.

836
00:45:49,980 --> 00:45:52,910
So let's assume for this
analysis that g equals 1, to

837
00:45:52,910 --> 00:45:54,690
make it easy, so we don't
have to worry about it.

838
00:45:54,690 --> 00:45:57,840
So we're simply spawning off
one thing here, one thing

839
00:45:57,840 --> 00:46:01,760
here, one iteration here,
all the way to the end.

840
00:46:01,760 --> 00:46:05,370
So what is the work for this, if
I spawn off things of size

841
00:46:05,370 --> 00:46:09,370
one, asymptotic work?

842
00:46:09,370 --> 00:46:12,850
It's order n, because I've got
n leaves, and I've got n guys

843
00:46:12,850 --> 00:46:13,710
that I'm spawning off.

844
00:46:13,710 --> 00:46:15,720
So it's order n.

845
00:46:15,720 --> 00:46:18,019
What's the span?

846
00:46:18,019 --> 00:46:20,800
AUDIENCE: [INAUDIBLE].

847
00:46:20,800 --> 00:46:27,130
CHARLES LEISERSON: Yeah, it's
also order n, because the

848
00:46:27,130 --> 00:46:29,110
critical path goes something
like brrrup, brrrup, brrrup.

849
00:46:33,620 --> 00:46:35,850
That's order n length.

850
00:46:35,850 --> 00:46:37,760
It's not this, because
that's only order

851
00:46:37,760 --> 00:46:38,820
one length, all those.

852
00:46:38,820 --> 00:46:42,130
The longest path is order n.

853
00:46:42,130 --> 00:46:49,220
So that says the parallelism
is order one.

854
00:46:49,220 --> 00:46:53,720
Conclusion, at least with grain
size one, this is a

855
00:46:53,720 --> 00:46:57,950
really bad way to implement
a parallel loop.

856
00:46:57,950 --> 00:47:01,080
However, I guarantee, it may
not be the people in this

857
00:47:01,080 --> 00:47:07,130
room, but some fraction of
students in this class will

858
00:47:07,130 --> 00:47:12,080
write this rather than
doing a cilk for.

859
00:47:12,080 --> 00:47:15,440
Bad idea.

860
00:47:15,440 --> 00:47:17,270
Bad idea.

861
00:47:17,270 --> 00:47:19,950
Generally, bad idea.

862
00:47:19,950 --> 00:47:20,862
Question?

863
00:47:20,862 --> 00:47:22,308
AUDIENCE: Do you think you could
find a constant factor,

864
00:47:22,308 --> 00:47:23,558
not just [INAUDIBLE]?

865
00:47:26,164 --> 00:47:29,360
CHARLES LEISERSON: Well here,
actually, with grain size one,

866
00:47:29,360 --> 00:47:31,960
this is really bad, because
I've got this overhead of

867
00:47:31,960 --> 00:47:35,450
doing a spawn, and then I'm
only doing one iteration.

868
00:47:35,450 --> 00:47:38,250
So the ideal thing would be that
I really am only paying

869
00:47:38,250 --> 00:47:41,170
for the leaves, and the internal
nodes, I don't have

870
00:47:41,170 --> 00:47:42,405
to pay anything for.

871
00:47:42,405 --> 00:47:44,182
Yeah, Eric?

872
00:47:44,182 --> 00:47:45,820
AUDIENCE: Shouldn't there
be a sort of keyword

873
00:47:45,820 --> 00:47:46,560
in the b add too?

874
00:47:46,560 --> 00:47:47,150
CHARLES LEISERSON:
In the where?

875
00:47:47,150 --> 00:47:48,175
AUDIENCE: In the b add?

876
00:47:48,175 --> 00:47:49,470
CHARLES LEISERSON: No,
that's serial.

877
00:47:49,470 --> 00:47:51,190
That's a serial code.

878
00:47:51,190 --> 00:47:53,066
AUDIENCE: No, but if you were
going to call it with cilk

879
00:47:53,066 --> 00:47:56,140
spawn, don't you have
to declare it cilk?

880
00:47:56,140 --> 00:47:58,581
Is that not the case?

881
00:47:58,581 --> 00:47:59,038
CHARLES LEISERSON: No.

882
00:47:59,038 --> 00:48:00,288
AUDIENCE: Never mind.

883
00:48:02,820 --> 00:48:03,900
CHARLES LEISERSON:
Yes, question.

884
00:48:03,900 --> 00:48:05,884
AUDIENCE: If g is [INAUDIBLE],
isn't that good enough?

885
00:48:08,420 --> 00:48:09,290
CHARLES LEISERSON: Yeah,
so let's take a look.

886
00:48:09,290 --> 00:48:10,540
That's actually the
next slide.

887
00:48:12,980 --> 00:48:17,036
This is basically, this we
call puny parallelism.

888
00:48:17,036 --> 00:48:20,580
We don't like puny
parallelism.

889
00:48:20,580 --> 00:48:22,470
It doesn't have to
be spectacular.

890
00:48:22,470 --> 00:48:25,680
It has to be good enough.

891
00:48:25,680 --> 00:48:28,455
And this is not good enough
for most applications.

892
00:48:31,960 --> 00:48:34,250
So here's another
implementation.

893
00:48:34,250 --> 00:48:35,620
Here's another way
of doing it.

894
00:48:35,620 --> 00:48:40,710
Now let's analyze it where
we have control over g.

895
00:48:40,710 --> 00:48:44,430
So we'll analyze it in terms
of g, and then see whether

896
00:48:44,430 --> 00:48:47,270
there's a setting for which
this make sense.

897
00:48:47,270 --> 00:48:49,690
So if I analyze it in terms of
g, now I have to do a little

898
00:48:49,690 --> 00:48:51,600
bit more careful
analysis here.

899
00:48:51,600 --> 00:48:57,798
How much work do I have here
in terms of n and g?

900
00:48:57,798 --> 00:48:59,200
AUDIENCE: It's the same.

901
00:48:59,200 --> 00:49:00,120
CHARLES LEISERSON: Yeah,
the work is still

902
00:49:00,120 --> 00:49:01,370
asymptotically order n.

903
00:49:05,220 --> 00:49:07,940
Because I always have n work
in the leaves, even if I do

904
00:49:07,940 --> 00:49:09,190
more iterations here.

905
00:49:09,190 --> 00:49:14,091
What increasing g does is
it shrinks this, right?

906
00:49:14,091 --> 00:49:17,350
It shrinks this.

907
00:49:17,350 --> 00:49:18,850
The span for this is what?

908
00:49:23,240 --> 00:49:25,820
So I heard somebody say it.

909
00:49:25,820 --> 00:49:27,448
n over g plus g.

910
00:49:30,560 --> 00:49:32,240
And it corresponds
to this path.

911
00:49:34,750 --> 00:49:37,660
So this is the n over
g part up here, and

912
00:49:37,660 --> 00:49:39,126
this is the plus g.

913
00:49:41,720 --> 00:49:47,060
So what we want to do to
minimize this, is we can

914
00:49:47,060 --> 00:49:47,820
minimize this.

915
00:49:47,820 --> 00:49:50,630
This has the smallest value
when these two terms are

916
00:49:50,630 --> 00:49:55,370
equal, which you can either know
as a basic fact of the

917
00:49:55,370 --> 00:49:58,300
summation of these kinds of
things, or you could take

918
00:49:58,300 --> 00:50:02,210
derivatives and so forth.

919
00:50:02,210 --> 00:50:05,050
Or you can just eyeball it and
say, gee, if g is bigger than

920
00:50:05,050 --> 00:50:08,710
square root of n, then this is
going to be the dominant, and

921
00:50:08,710 --> 00:50:11,500
if g is smaller than square root
of n, then this is going

922
00:50:11,500 --> 00:50:12,630
to be dominant.

923
00:50:12,630 --> 00:50:15,130
And so when they're equal, that
sounds like about when it

924
00:50:15,130 --> 00:50:17,720
should be the smallest,
which is true.

925
00:50:17,720 --> 00:50:20,270
So we pick it to be about
square root of n, to

926
00:50:20,270 --> 00:50:22,970
minimize the span.

927
00:50:22,970 --> 00:50:25,620
Since g, I don't have anything
to minimize here.

928
00:50:25,620 --> 00:50:30,200
So pick it around square root of
n, then the span is around

929
00:50:30,200 --> 00:50:31,520
square root of n.

930
00:50:31,520 --> 00:50:37,680
And so then the parallelism
is order square root of n.

931
00:50:37,680 --> 00:50:38,880
So that's pretty good.

932
00:50:38,880 --> 00:50:40,150
So that's not bad.

933
00:50:40,150 --> 00:50:42,950
So for something that's a big
array, array of size 1

934
00:50:42,950 --> 00:50:46,310
million, parallelism
might be 1,000.

935
00:50:46,310 --> 00:50:50,300
That might be just hunky dory.

936
00:50:50,300 --> 00:50:51,142
Question.

937
00:50:51,142 --> 00:50:51,594
What's that?

938
00:50:51,594 --> 00:50:54,510
AUDIENCE: I don't see where--

939
00:50:54,510 --> 00:50:56,170
CHARLES LEISERSON: We've
picked g to be equal to

940
00:50:56,170 --> 00:50:57,986
square root of n.

941
00:50:57,986 --> 00:51:00,944
AUDIENCE: [INAUDIBLE] plus
n over g, plus g.

942
00:51:00,944 --> 00:51:02,430
I don't see where [INAUDIBLE].

943
00:51:02,430 --> 00:51:05,870
CHARLES LEISERSON: You don't
see where this g came from?

944
00:51:05,870 --> 00:51:09,540
This g comes from, because I'm
doing g iterations here.

945
00:51:09,540 --> 00:51:11,780
So remember that these
are now of size g.

946
00:51:11,780 --> 00:51:14,510
I'm doing g iterations
in each leaf here, if

947
00:51:14,510 --> 00:51:15,860
I set g to be large.

948
00:51:15,860 --> 00:51:21,960
So I'm doing n over g pieces
here, plus g iterations in my

949
00:51:21,960 --> 00:51:22,610
[INAUDIBLE].

950
00:51:22,610 --> 00:51:24,090
Is that clear?

951
00:51:24,090 --> 00:51:26,120
So the n over g is this part.

952
00:51:26,120 --> 00:51:28,710
This size here, this
is not one.

953
00:51:28,710 --> 00:51:30,620
This has g iterations in it.

954
00:51:30,620 --> 00:51:33,252
So the total span is
g plus n over g.

955
00:51:37,370 --> 00:51:39,280
Any other questions
about this?

956
00:51:39,280 --> 00:51:41,750
So basically, I get order
square root of n.

957
00:51:44,270 --> 00:51:49,370
And so this is not necessarily a
bad way of doing it, but the

958
00:51:49,370 --> 00:51:51,985
cilk for is a far more reliable
way of making sure

959
00:51:51,985 --> 00:51:54,230
that you get the parallelism
than spawning

960
00:51:54,230 --> 00:51:55,710
things off one by one.

961
00:51:55,710 --> 00:52:00,000
One of the things, by the way,
in this, I've seen people

962
00:52:00,000 --> 00:52:03,370
write code where their first
instinct is to write something

963
00:52:03,370 --> 00:52:06,600
like this, where this that
they're spawning off is only

964
00:52:06,600 --> 00:52:07,660
constant time.

965
00:52:07,660 --> 00:52:11,810
And they say, gee, I spawned
off n things.

966
00:52:11,810 --> 00:52:14,340
That's really parallel.

967
00:52:14,340 --> 00:52:18,270
When in fact, their parallelism
is order one.

968
00:52:18,270 --> 00:52:22,600
So it's really seductive to
think that you can get

969
00:52:22,600 --> 00:52:23,880
parallelism by this,
[? right. ?]

970
00:52:23,880 --> 00:52:27,160
It's much better to do divide
and conquer, and cilk for does

971
00:52:27,160 --> 00:52:29,140
that for you automatically.

972
00:52:29,140 --> 00:52:31,060
If you're going to do it by
hand, sometimes you do want to

973
00:52:31,060 --> 00:52:33,160
do it by hand, then you probably
want to think more

974
00:52:33,160 --> 00:52:35,900
about divide and conquer to
generate parallelism, because

975
00:52:35,900 --> 00:52:38,230
you'll have a small span,
than doing many

976
00:52:38,230 --> 00:52:39,480
things one at a time.

977
00:52:41,800 --> 00:52:47,520
So here's some tips
for performance.

978
00:52:47,520 --> 00:52:52,090
So you want to minimize the
span, so the parallelism is

979
00:52:52,090 --> 00:52:53,470
the work over the span.

980
00:52:53,470 --> 00:52:57,880
So you want to minimize the span
to maximize parallelism.

981
00:52:57,880 --> 00:53:00,770
And in general, you should try
to generate something like 10

982
00:53:00,770 --> 00:53:04,290
times more parallelism than
processors, if you want to get

983
00:53:04,290 --> 00:53:05,510
near perfect linear speed-up.

984
00:53:05,510 --> 00:53:09,200
In other words, a parallel
slackness of 10 or better is

985
00:53:09,200 --> 00:53:10,450
usually adequate.

986
00:53:13,340 --> 00:53:16,190
If you can get more, you're now
talking that you can get

987
00:53:16,190 --> 00:53:22,720
more performance, but now you're
getting performance

988
00:53:22,720 --> 00:53:27,110
increases in the range of
5% or so, 5% to 10%,

989
00:53:27,110 --> 00:53:29,890
something like that.

990
00:53:29,890 --> 00:53:33,520
Second thing is if you have
plenty of parallelism, try to

991
00:53:33,520 --> 00:53:36,970
trade some of it off to
reduce work overhead.

992
00:53:36,970 --> 00:53:38,060
So this is a general case.

993
00:53:38,060 --> 00:53:42,800
This is what actually goes on
underneath in the cilk++

994
00:53:42,800 --> 00:53:45,790
runtime system, is they are
trying to do this themselves.

995
00:53:45,790 --> 00:53:49,640
But you in your own code can
play exactly the same game.

996
00:53:49,640 --> 00:53:52,330
Whenever you have a problem and
it says, whoa, look at all

997
00:53:52,330 --> 00:53:55,130
this parallelism, think about
ways that you could reduce the

998
00:53:55,130 --> 00:54:00,140
parallelism and get something
back in the efficiency of the

999
00:54:00,140 --> 00:54:03,490
work term, because the
performance in the end is

1000
00:54:03,490 --> 00:54:08,070
going to be something like t1
over p plus t infinity.

1001
00:54:08,070 --> 00:54:11,400
If t infinity is small, it's
like t1 over p, and so

1002
00:54:11,400 --> 00:54:16,060
anything you save in the t1 term
is saving you overall.

1003
00:54:16,060 --> 00:54:19,630
It's going to be a savings
for you overall.

1004
00:54:19,630 --> 00:54:22,200
Use divide and conquer recursion
on parallel loops

1005
00:54:22,200 --> 00:54:24,870
rather than sprawling one small
thing after another.

1006
00:54:24,870 --> 00:54:28,930
In other words, do this
not this, generally.

1007
00:54:33,620 --> 00:54:36,220
And here's some more tips.

1008
00:54:36,220 --> 00:54:39,520
Another thing that can happen
that we looked at here was

1009
00:54:39,520 --> 00:54:42,300
make sure that the amount
of work you're doing is

1010
00:54:42,300 --> 00:54:45,390
reasonably large compared
to the number of spawns.

1011
00:54:45,390 --> 00:54:47,950
You could also say this is true
when you do recursion for

1012
00:54:47,950 --> 00:54:49,280
function calls.

1013
00:54:49,280 --> 00:54:52,040
Make sure if you're just in
serial programming, you always

1014
00:54:52,040 --> 00:54:54,180
want to make sure that the
amount of work you're doing is

1015
00:54:54,180 --> 00:54:57,500
small compared to the number of
function calls are doing if

1016
00:54:57,500 --> 00:55:00,120
you can, and that'll make
things go faster.

1017
00:55:00,120 --> 00:55:08,050
So same thing here, you want to
have a lot of work compared

1018
00:55:08,050 --> 00:55:09,620
to the total number of
spawns that you're

1019
00:55:09,620 --> 00:55:11,780
doing in your program.

1020
00:55:11,780 --> 00:55:14,520
So spawns, by the way, in this
system, are about three or

1021
00:55:14,520 --> 00:55:19,500
four times the cost of
a function call.

1022
00:55:19,500 --> 00:55:22,800
They're sort of the same order
of magnitude as a function

1023
00:55:22,800 --> 00:55:26,670
call, a little bit heavier
than a function call.

1024
00:55:26,670 --> 00:55:31,960
So you can spawn pretty readily,
as long as the total

1025
00:55:31,960 --> 00:55:37,350
number of spawns you're doing
isn't dominating your work.

1026
00:55:37,350 --> 00:55:40,210
Generally parallelize outer
loops as opposed to inner

1027
00:55:40,210 --> 00:55:43,620
loops if you're forced
to make a choice.

1028
00:55:43,620 --> 00:55:47,090
So it's always better to have
an outer loop that runs in

1029
00:55:47,090 --> 00:55:51,000
parallel rather than an inner
loop that runs in parallel,

1030
00:55:51,000 --> 00:55:54,750
because when you do an inner
loop that runs in parallel,

1031
00:55:54,750 --> 00:55:56,590
you've got a lot of overhead
to overcome.

1032
00:55:56,590 --> 00:56:00,980
But in an outer loop, you've got
all of the inner loop to

1033
00:56:00,980 --> 00:56:03,930
amortize against the cost of the
spawns that are being used

1034
00:56:03,930 --> 00:56:06,570
to parallelize the outer loop.

1035
00:56:06,570 --> 00:56:10,195
So you'll do many fewer spawns
in the implementation.

1036
00:56:12,810 --> 00:56:14,510
Watch out for scheduling
overheads.

1037
00:56:18,620 --> 00:56:21,990
So if you do something
like this--

1038
00:56:21,990 --> 00:56:27,310
so here we're paralyzing an
inner loop rather than an

1039
00:56:27,310 --> 00:56:27,640
outer loop.

1040
00:56:27,640 --> 00:56:30,470
Now this turns out, it doesn't
matter which order we're going

1041
00:56:30,470 --> 00:56:33,000
in or whatever.

1042
00:56:33,000 --> 00:56:35,650
It's generally not desirable to
do this because I'm paying

1043
00:56:35,650 --> 00:56:40,230
scheduling overhead n times
through this loop, whereas

1044
00:56:40,230 --> 00:56:43,930
here, I pay for scheduling
overhead just twice.

1045
00:56:46,920 --> 00:56:50,010
So is generally better, if I
have n pieces of work to do,

1046
00:56:50,010 --> 00:56:52,400
rather than, in this case,
parallelizing--

1047
00:56:55,200 --> 00:56:57,510
let me slow down here.

1048
00:56:57,510 --> 00:56:58,980
So let's look at what
this code does.

1049
00:56:58,980 --> 00:57:00,835
This says, go for
two iterations.

1050
00:57:03,740 --> 00:57:05,980
Do something for which
it is going to take n

1051
00:57:05,980 --> 00:57:09,840
iterations for j.

1052
00:57:09,840 --> 00:57:12,235
So two iterations for i,
n iterations for j.

1053
00:57:15,710 --> 00:57:17,970
If you look at the parallelism
of this, what is the

1054
00:57:17,970 --> 00:57:20,560
parallelism of this assuming
that f is constant time?

1055
00:57:23,830 --> 00:57:25,220
What's the parallelism
of this code?

1056
00:57:33,460 --> 00:57:35,210
Two.

1057
00:57:35,210 --> 00:57:37,570
The parallelism of two, because
I've got two things on

1058
00:57:37,570 --> 00:57:39,930
the outer loop here,
and then each is n.

1059
00:57:39,930 --> 00:57:43,420
So my span is essentially n.

1060
00:57:43,420 --> 00:57:46,770
My work is like 2n, something
like that.

1061
00:57:46,770 --> 00:57:49,250
So it's got a parallelism
of that, too.

1062
00:57:49,250 --> 00:57:50,800
What's the parallelism
of this code?

1063
00:58:04,970 --> 00:58:05,790
What's the parallelism?

1064
00:58:05,790 --> 00:58:08,220
It's not n, because I'm
basically going through this

1065
00:58:08,220 --> 00:58:10,070
serially, the outer
loop serially.

1066
00:58:18,680 --> 00:58:20,490
What's the theoretical
parallelism of this?

1067
00:58:24,270 --> 00:58:29,430
So for each iteration here,
the parallelism is two.

1068
00:58:29,430 --> 00:58:31,170
No, not n.

1069
00:58:31,170 --> 00:58:34,980
It can't be n, because I'm
basically only parallelizing

1070
00:58:34,980 --> 00:58:37,530
two things, and I'm doing
them serially.

1071
00:58:40,462 --> 00:58:44,540
The outer loop is going serially
through the code and

1072
00:58:44,540 --> 00:58:47,480
it's spawning off two things,
two things, two things, two

1073
00:58:47,480 --> 00:58:48,530
things, two things.

1074
00:58:48,530 --> 00:58:50,380
And waiting for them to be done,
two things, wait for it

1075
00:58:50,380 --> 00:58:52,540
to be done, two things, wait
for it to be done.

1076
00:58:52,540 --> 00:58:54,930
So the parallelism is two.

1077
00:58:54,930 --> 00:58:56,180
These have the same
parallelism.

1078
00:58:58,690 --> 00:59:03,190
However if you run this, this
one will give you a speedup of

1079
00:59:03,190 --> 00:59:07,530
two on two cores, very
close to it.

1080
00:59:07,530 --> 00:59:09,990
Because there's the scheduling
overhead here, you've only

1081
00:59:09,990 --> 00:59:12,390
paid once for the scheduling
overhead, and then you're

1082
00:59:12,390 --> 00:59:14,640
doing a whole bunch of stuff.

1083
00:59:14,640 --> 00:59:17,140
So remember, to schedule it,
it's got to be migrated, it's

1084
00:59:17,140 --> 00:59:19,910
got to be moved to another
processor, et cetera.

1085
00:59:19,910 --> 00:59:25,410
This one, it's not even worth
it probably to steal each of

1086
00:59:25,410 --> 00:59:26,250
these individual things.

1087
00:59:26,250 --> 00:59:28,770
You're spawning off things that
are so small, this may

1088
00:59:28,770 --> 00:59:33,395
even have parallelism that's
less than 1 in practice.

1089
00:59:33,395 --> 00:59:35,880
And if you look at the cilkview
tool, this will show

1090
00:59:35,880 --> 00:59:38,180
you a high burden parallelism.

1091
00:59:38,180 --> 00:59:41,450
Because the cilkview tool, the
burden parallelism tells you

1092
00:59:41,450 --> 00:59:46,980
what the overhead is from
scheduling, as well as what

1093
00:59:46,980 --> 00:59:48,310
the actual parallelism is.

1094
00:59:48,310 --> 00:59:51,010
And it recognizes that
oh, gee whiz.

1095
00:59:51,010 --> 00:59:53,625
This thing really
has very small--

1096
01:00:00,670 --> 01:00:02,200
there's almost no
work in here.

1097
01:00:02,200 --> 01:00:04,000
So you're trying to parallelize
something where

1098
01:00:04,000 --> 01:00:06,960
the work is so small, it's not
even worth migrating it to

1099
01:00:06,960 --> 01:00:10,380
take advantage of it.

1100
01:00:10,380 --> 01:00:12,740
So those are some tips.

1101
01:00:12,740 --> 01:00:15,140
Now let's go through and analyze
a bunch of algorithms

1102
01:00:15,140 --> 01:00:16,730
reasonably quickly.

1103
01:00:16,730 --> 01:00:21,670
We'll start with matrix
multiplication.

1104
01:00:21,670 --> 01:00:22,935
People seen this
problem before?

1105
01:00:28,400 --> 01:00:31,900
Here's the matrix multiplication
problem.

1106
01:00:31,900 --> 01:00:33,780
And let's assume for
simplicity that n

1107
01:00:33,780 --> 01:00:35,030
is a power of 2.

1108
01:00:38,470 --> 01:00:44,250
So basically, let's start out
with just our looping version.

1109
01:00:44,250 --> 01:00:46,390
In fact, this isn't even a very
good looping version,

1110
01:00:46,390 --> 01:00:49,340
because I've got the order of
the loops wrong, I think.

1111
01:00:49,340 --> 01:00:52,380
But it is just illustrative.

1112
01:00:52,380 --> 01:00:55,080
Basically let's parallelize
the outer two loops.

1113
01:00:55,080 --> 01:00:57,070
I can't parallelize
the inner loop.

1114
01:00:57,070 --> 01:00:58,140
Why not?

1115
01:00:58,140 --> 01:01:00,180
What happens if I tried to
parallelize the inner loop

1116
01:01:00,180 --> 01:01:03,095
with a cilk_for in this
implementation?

1117
01:01:07,580 --> 01:01:11,040
Why can't I just put
a cilk_for there?

1118
01:01:11,040 --> 01:01:12,408
Yes, somebody said it.

1119
01:01:12,408 --> 01:01:14,600
AUDIENCE: It does that in cij.

1120
01:01:14,600 --> 01:01:17,220
CHARLES LEISERSON: Yeah, we
get a race condition.

1121
01:01:17,220 --> 01:01:19,980
We have more than two things in
parallel trying to update

1122
01:01:19,980 --> 01:01:25,070
the same cij, and we'll
have a race condition.

1123
01:01:25,070 --> 01:01:29,000
So always run cilkview to
tell your performance.

1124
01:01:29,000 --> 01:01:33,695
But always, always, run cilk
screen to tell whether or not

1125
01:01:33,695 --> 01:01:35,060
you've got races in your code.

1126
01:01:38,500 --> 01:01:40,650
So yeah, you'll have a race
condition if you try to

1127
01:01:40,650 --> 01:01:43,160
naively parallelize
the loop here.

1128
01:01:46,570 --> 01:01:47,990
So the work of this is what?

1129
01:01:53,460 --> 01:01:58,090
It's order n cubed, just three
nested loops each going to n.

1130
01:01:58,090 --> 01:01:59,340
What's the span of this?

1131
01:02:12,180 --> 01:02:13,430
What's the span of this?

1132
01:02:20,990 --> 01:02:26,900
It's order n, because it's log
n for this loop, log n for

1133
01:02:26,900 --> 01:02:30,290
this loop, plus the maximum
of this, well, that's n.

1134
01:02:30,290 --> 01:02:34,860
Log n plus log n plus
n is order n.

1135
01:02:34,860 --> 01:02:37,130
So order n span, which says

1136
01:02:37,130 --> 01:02:42,340
parallelism is order n squared.

1137
01:02:42,340 --> 01:02:45,080
So for 1,000 by 1,000 matrices,
the parallelism is

1138
01:02:45,080 --> 01:02:49,376
on the order of a million.

1139
01:02:49,376 --> 01:02:50,626
Wow.

1140
01:02:52,430 --> 01:02:53,680
That's great.

1141
01:02:56,050 --> 01:03:00,600
However, it's on the order of
a million, but as we know,

1142
01:03:00,600 --> 01:03:04,630
this doesn't use cache
very effectively.

1143
01:03:04,630 --> 01:03:06,890
So one of the nice things about
doing divide and conquer

1144
01:03:06,890 --> 01:03:08,860
is, as you know, that's a
really good way to take

1145
01:03:08,860 --> 01:03:11,850
advantage of caching.

1146
01:03:11,850 --> 01:03:15,100
And this works in
parallel, too.

1147
01:03:15,100 --> 01:03:18,920
In particular because whenever
you have sufficient

1148
01:03:18,920 --> 01:03:23,880
parallelism, these processors
are executing the code just as

1149
01:03:23,880 --> 01:03:26,160
if they were executing
serial code.

1150
01:03:26,160 --> 01:03:28,750
So you get all the same cache
locality you would get in the

1151
01:03:28,750 --> 01:03:32,280
serial code in the parallel
code, except for the times

1152
01:03:32,280 --> 01:03:34,300
that you're actually
migrating work.

1153
01:03:34,300 --> 01:03:35,710
And if you have sufficient
parallelism,

1154
01:03:35,710 --> 01:03:38,590
that isn't too often.

1155
01:03:38,590 --> 01:03:40,740
So let's take a look at
recursive divide and conquer

1156
01:03:40,740 --> 01:03:41,890
multiplication.

1157
01:03:41,890 --> 01:03:45,770
So we're familiar
with this, too.

1158
01:03:45,770 --> 01:03:48,520
So this is eight multiplications
of n over 2 by

1159
01:03:48,520 --> 01:03:51,350
2 matrices, and one addition
of n by n matrix.

1160
01:03:51,350 --> 01:03:55,970
So here's a code using a
little bit of C++ism.

1161
01:03:55,970 --> 01:04:01,560
So I've made the type
a variable t.

1162
01:04:01,560 --> 01:04:07,710
So we're going to do matrix
multiplication of an array, a,

1163
01:04:07,710 --> 01:04:10,220
the result is going to go
in c, and we're going to

1164
01:04:10,220 --> 01:04:13,630
basically have a and b,
and we're going to add

1165
01:04:13,630 --> 01:04:15,240
the result into c.

1166
01:04:15,240 --> 01:04:19,950
We have n, which is the side
of the submatrix that we're

1167
01:04:19,950 --> 01:04:23,080
working on, and we're also going
to have an n size, which

1168
01:04:23,080 --> 01:04:26,105
is the length of the row
in the original matrix.

1169
01:04:26,105 --> 01:04:29,870
So remember when we do matrix
things, if I take a submatrix,

1170
01:04:29,870 --> 01:04:31,240
it's not contiguous in memory.

1171
01:04:31,240 --> 01:04:34,470
So I have to know the row size
of the matrix that I'm in in

1172
01:04:34,470 --> 01:04:38,710
order to be able to calculate
what the elements are.

1173
01:04:38,710 --> 01:04:41,370
So the way it's going to work
is I'm going to assign this

1174
01:04:41,370 --> 01:04:46,060
temporary d, by using
the new--

1175
01:04:46,060 --> 01:04:49,096
which is basically memory
allocation C++--

1176
01:04:49,096 --> 01:04:52,410
array of size n by n.

1177
01:04:52,410 --> 01:04:58,670
And what we're going to do is
then do four of the recursive

1178
01:04:58,670 --> 01:05:05,470
multiplications, these guys
here, into the elements of c,

1179
01:05:05,470 --> 01:05:12,000
and then four of them also into
d using the temporary.

1180
01:05:12,000 --> 01:05:14,690
And then we're going to sync,
after we get all that parallel

1181
01:05:14,690 --> 01:05:18,440
work done, and then we're going
to add d into c, and

1182
01:05:18,440 --> 01:05:22,880
then we'll delete d, because
we allocated it up here.

1183
01:05:22,880 --> 01:05:25,920
Everybody understand the code?

1184
01:05:25,920 --> 01:05:27,080
So we're doing this,
it's just we're

1185
01:05:27,080 --> 01:05:30,490
going to do it in parallel.

1186
01:05:30,490 --> 01:05:31,520
Good?

1187
01:05:31,520 --> 01:05:33,620
Questions?

1188
01:05:33,620 --> 01:05:36,260
OK.

1189
01:05:36,260 --> 01:05:38,680
So this is the row length of the
matrices so that I can do

1190
01:05:38,680 --> 01:05:41,590
the base cases, and in
particular, partition the

1191
01:05:41,590 --> 01:05:43,350
matrices effectively.

1192
01:05:43,350 --> 01:05:45,720
I haven't shown that code.

1193
01:05:45,720 --> 01:05:47,980
And of course, the base case,
normally, we would want to

1194
01:05:47,980 --> 01:05:49,560
coarsen for efficiency.

1195
01:05:49,560 --> 01:05:52,390
I would want to go down to
something like maybe a eight

1196
01:05:52,390 --> 01:05:57,090
by eight or 16 by 16 matrix,
and at that point switch to

1197
01:05:57,090 --> 01:06:01,880
something that's going to use
the processor pipeline better.

1198
01:06:01,880 --> 01:06:04,360
The base cases, once again, I
want to emphasize this because

1199
01:06:04,360 --> 01:06:07,430
a couple people on the quiz
misunderstood this.

1200
01:06:07,430 --> 01:06:11,250
The reason you coarsen has
nothing to do with caches.

1201
01:06:11,250 --> 01:06:14,410
The reason you coarsen is to
overcome the overhead of the

1202
01:06:14,410 --> 01:06:18,440
function calls, and the
coarsening is generally chosen

1203
01:06:18,440 --> 01:06:21,280
independent of what the size
of the caches are.

1204
01:06:21,280 --> 01:06:25,590
It's not a parameter that has
to be tuned to cache size.

1205
01:06:25,590 --> 01:06:28,190
It's a parameter that has to
be tuned to function call,

1206
01:06:28,190 --> 01:06:33,080
versus ALU instructions, and
what that balance is.

1207
01:06:33,080 --> 01:06:34,464
Question?

1208
01:06:34,464 --> 01:06:37,446
AUDIENCE: I mean, I understand
that's true, but I thought--

1209
01:06:37,446 --> 01:06:39,434
I mean, maybe I
[? heard the call ?] wrong,

1210
01:06:39,434 --> 01:06:42,416
but I thought we wanted, in
general, in terms of caching,

1211
01:06:42,416 --> 01:06:45,895
that you would choose it somehow
so that all of the

1212
01:06:45,895 --> 01:06:48,380
data that you have would
somehow fit--

1213
01:06:48,380 --> 01:06:50,060
CHARLES LEISERSON: That's what
the divide and conquer does

1214
01:06:50,060 --> 01:06:52,210
automatically.

1215
01:06:52,210 --> 01:06:54,550
The divide and conquer keeps
halving it until it fits in

1216
01:06:54,550 --> 01:06:56,450
whatever size cache you have.

1217
01:06:56,450 --> 01:06:58,330
And in fact, we have
three caches on the

1218
01:06:58,330 --> 01:06:59,770
machines we're using.

1219
01:06:59,770 --> 01:07:03,363
AUDIENCE: Yeah, but I'm saying
if your coarsened constant is

1220
01:07:03,363 --> 01:07:05,160
too big, that's not
going to happen.

1221
01:07:05,160 --> 01:07:07,030
CHARLES LEISERSON: If the
coarsened constant is too big,

1222
01:07:07,030 --> 01:07:08,180
that's not going to happen.

1223
01:07:08,180 --> 01:07:12,120
But generally, the caches are
much bigger than what you need

1224
01:07:12,120 --> 01:07:14,410
to do to amortize the cost.

1225
01:07:14,410 --> 01:07:16,660
But you're right, that
is an assumption.

1226
01:07:16,660 --> 01:07:19,490
The caches are generally much
bigger than the size that you

1227
01:07:19,490 --> 01:07:21,680
need in order to overcome
function call overhead.

1228
01:07:21,680 --> 01:07:24,170
Function call overhead
is not that high.

1229
01:07:24,170 --> 01:07:24,930
OK?

1230
01:07:24,930 --> 01:07:27,020
Good.

1231
01:07:27,020 --> 01:07:29,280
I'm glad I raised that
issue again.

1232
01:07:29,280 --> 01:07:31,690
And so we're going to determine
the submatrices by

1233
01:07:31,690 --> 01:07:33,330
index calculation.

1234
01:07:33,330 --> 01:07:36,400
And then we have to implement
this parallel add, and that

1235
01:07:36,400 --> 01:07:42,270
I'm going to do just with a
doubly nested for loop to add

1236
01:07:42,270 --> 01:07:43,520
the things.

1237
01:07:45,530 --> 01:07:48,650
There's no cache behavior I can
really take advantage of

1238
01:07:48,650 --> 01:07:51,750
here except for spatial
locality.

1239
01:07:51,750 --> 01:07:53,760
There's no temporal locality
because I'm just adding two

1240
01:07:53,760 --> 01:07:59,270
matrices once, so there's no
real temporal locality that

1241
01:07:59,270 --> 01:08:00,726
I'll get out of it.

1242
01:08:00,726 --> 01:08:02,440
And here, I've actually
done the index

1243
01:08:02,440 --> 01:08:03,690
calculations by hand.

1244
01:08:08,550 --> 01:08:11,050
So let's analyze this.

1245
01:08:11,050 --> 01:08:13,790
So to analyze the multiplication
program, I have

1246
01:08:13,790 --> 01:08:16,590
to start by analyzing the
addition program.

1247
01:08:16,590 --> 01:08:19,890
So this should be, I think,
fairly straightforward.

1248
01:08:19,890 --> 01:08:26,290
What's the work for adding
two n by n matrices here?

1249
01:08:26,290 --> 01:08:29,689
n squared, good, just
doubly nested loop.

1250
01:08:29,689 --> 01:08:32,612
What's the span?

1251
01:08:32,612 --> 01:08:35,609
AUDIENCE: [INAUDIBLE].

1252
01:08:35,609 --> 01:08:37,859
CHARLES LEISERSON: Yeah, here
it's log n, very good.

1253
01:08:37,859 --> 01:08:40,830
Because I've got log n plus
log n plus order one.

1254
01:08:44,600 --> 01:08:46,470
I'm not going to analyze the
parallelism, because I really

1255
01:08:46,470 --> 01:08:48,260
don't care about the parallelism
of the addition.

1256
01:08:48,260 --> 01:08:51,180
I really care about the
parallelism of the matrix

1257
01:08:51,180 --> 01:08:51,859
multiplication.

1258
01:08:51,859 --> 01:08:54,859
But we'll plug those
values in now.

1259
01:08:54,859 --> 01:08:57,430
What is the work of the
matrix multiplication?

1260
01:08:57,430 --> 01:09:02,359
Well for this, what we want to
do is get a recurrence that we

1261
01:09:02,359 --> 01:09:03,140
can then solve.

1262
01:09:03,140 --> 01:09:04,920
So what's the recurrence
that we want to get

1263
01:09:04,920 --> 01:09:06,450
for m of 1 of n?

1264
01:09:11,050 --> 01:09:17,439
Yeah, it's going to be 8m
sub 1 of n over 2, that

1265
01:09:17,439 --> 01:09:22,399
corresponds to these things,
plus some constant stuff, plus

1266
01:09:22,399 --> 01:09:23,649
the work of the addition.

1267
01:09:26,300 --> 01:09:28,040
Does that make sense?

1268
01:09:28,040 --> 01:09:29,390
We analyze the work
of the addition.

1269
01:09:29,390 --> 01:09:32,000
What's the work of
the addition?

1270
01:09:32,000 --> 01:09:33,630
Order n squared.

1271
01:09:33,630 --> 01:09:36,100
So that's going to dominate
that constant

1272
01:09:36,100 --> 01:09:38,569
there, so we get 8.

1273
01:09:38,569 --> 01:09:41,090
And what's the solution
to this?

1274
01:09:41,090 --> 01:09:43,319
Back to Master Theorem.

1275
01:09:43,319 --> 01:09:46,210
Now we're going to start pulling
out the Master Theorem

1276
01:09:46,210 --> 01:09:49,850
multiple times per slide for
the rest of the lecture.

1277
01:09:49,850 --> 01:09:52,700
n cubed, because we have
log base 2 of 8.

1278
01:09:52,700 --> 01:09:57,330
That's n cubed compared
with n squared.

1279
01:09:57,330 --> 01:10:00,450
So we get a solution
which is n cubed--

1280
01:10:00,450 --> 01:10:02,390
Case 3 of the Master Theorem.

1281
01:10:02,390 --> 01:10:03,670
So that's good.

1282
01:10:03,670 --> 01:10:06,140
The work we're doing is the
same asymptotic work we're

1283
01:10:06,140 --> 01:10:09,220
doing for the triply
nested loop.

1284
01:10:09,220 --> 01:10:11,380
Now let's take a look
at the span.

1285
01:10:14,170 --> 01:10:15,790
So what's the span for this?

1286
01:10:21,030 --> 01:10:23,420
So once again, we want
a recurrence.

1287
01:10:23,420 --> 01:10:24,670
What's the recurrence
look like?

1288
01:10:31,930 --> 01:10:35,444
So the span of this is going
to be the span of--

1289
01:10:35,444 --> 01:10:37,345
it's going to be the
sum of some things.

1290
01:10:39,990 --> 01:10:43,950
But the key observation is that
it's going to be-- we

1291
01:10:43,950 --> 01:10:45,800
want the maximum
of these guys.

1292
01:10:52,970 --> 01:10:55,850
So we're going to basically
have the allocation as

1293
01:10:55,850 --> 01:11:00,900
constant time, we have the
maximum of these, which is m

1294
01:11:00,900 --> 01:11:03,900
of infinity of n over
2, and then we have

1295
01:11:03,900 --> 01:11:07,160
the span of the add.

1296
01:11:07,160 --> 01:11:10,060
So we get this recurrence.

1297
01:11:10,060 --> 01:11:12,930
m infinity sub n over 2, because
we have only to worry

1298
01:11:12,930 --> 01:11:15,740
about the worst of these guys.

1299
01:11:15,740 --> 01:11:17,400
The worst of them is--

1300
01:11:17,400 --> 01:11:19,620
they're all symmetric, so
it's basically the same.

1301
01:11:19,620 --> 01:11:21,800
We have a of n, and then there's
a constant amount of

1302
01:11:21,800 --> 01:11:24,290
other overhead here.

1303
01:11:24,290 --> 01:11:28,320
Any questions about where I
pulled that out of, why that's

1304
01:11:28,320 --> 01:11:29,570
the recurrence?

1305
01:11:32,710 --> 01:11:36,390
So this is the addition, the
span of the addition of this

1306
01:11:36,390 --> 01:11:37,570
guy that we analyzed already.

1307
01:11:37,570 --> 01:11:41,000
What is the span of
the addition?

1308
01:11:41,000 --> 01:11:42,900
What did we decide that was?

1309
01:11:42,900 --> 01:11:44,530
log n.

1310
01:11:44,530 --> 01:11:46,730
So basically, that dominates
the order one.

1311
01:11:46,730 --> 01:11:48,920
So we get this term, and what's
the solution of this

1312
01:11:48,920 --> 01:11:49,530
recurrence?

1313
01:11:49,530 --> 01:11:50,780
AUDIENCE: [INAUDIBLE].

1314
01:11:54,270 --> 01:11:55,727
CHARLES LEISERSON: What
case is this?

1315
01:11:55,727 --> 01:11:57,515
AUDIENCE: [INAUDIBLE]

1316
01:11:57,515 --> 01:11:58,860
log n squared.

1317
01:11:58,860 --> 01:12:02,500
CHARLES LEISERSON: Yes,
it's log squared n.

1318
01:12:02,500 --> 01:12:04,530
So basically, it's case two.

1319
01:12:04,530 --> 01:12:07,880
So if I do n to the log base b
of a, that's n to the log base

1320
01:12:07,880 --> 01:12:12,330
2 of 1, that's just 1.

1321
01:12:12,330 --> 01:12:15,900
And so this is basically a
logarithmic factor times the

1322
01:12:15,900 --> 01:12:18,250
1, so we add an extra log.

1323
01:12:18,250 --> 01:12:19,500
We get log squared n.

1324
01:12:22,330 --> 01:12:24,300
That's just Master Theorem
plugging in.

1325
01:12:24,300 --> 01:12:26,640
So here, the span is order
log squared n.

1326
01:12:29,370 --> 01:12:33,840
And so we have the work of n
cubed, the span of log squared

1327
01:12:33,840 --> 01:12:37,150
n, so the parallelism is the
ratio, which is n cubed over

1328
01:12:37,150 --> 01:12:38,400
log squared n.

1329
01:12:40,810 --> 01:12:43,150
Not too bad for a 1,000
by 1,000 matrices, the

1330
01:12:43,150 --> 01:12:49,300
parallelism is about
10 million.

1331
01:12:49,300 --> 01:12:50,550
Plenty of parallelism.

1332
01:12:52,940 --> 01:12:55,760
So let's use the fact that we
have plenty of parallelism to

1333
01:12:55,760 --> 01:12:58,500
say, let's get rid of some of
that parallelism and put it

1334
01:12:58,500 --> 01:13:02,180
back into making our code
more efficient.

1335
01:13:02,180 --> 01:13:09,520
So in particular, this code
uses an extra temporary d,

1336
01:13:09,520 --> 01:13:14,080
which it allocates here
and it deletes here.

1337
01:13:14,080 --> 01:13:16,640
And generally, there's a good
rule that says, if you use

1338
01:13:16,640 --> 01:13:18,570
more storage you're going to use
more time, because you're

1339
01:13:18,570 --> 01:13:20,770
going to have to look at that
storage, it's going to take up

1340
01:13:20,770 --> 01:13:23,220
space in your cache,
and it's generally

1341
01:13:23,220 --> 01:13:24,300
going to make you slower.

1342
01:13:24,300 --> 01:13:28,470
So things that use less storage
are generally faster.

1343
01:13:28,470 --> 01:13:30,240
Not always the case, sometimes
there's a trade off.

1344
01:13:30,240 --> 01:13:34,290
But often it's the case, use
more storage, it runs slower.

1345
01:13:34,290 --> 01:13:36,380
So let's get rid of this guy.

1346
01:13:36,380 --> 01:13:37,630
How do we get rid of this guy?

1347
01:13:44,440 --> 01:13:45,754
Yeah?

1348
01:13:45,754 --> 01:13:47,004
AUDIENCE: [INAUDIBLE PHRASE].

1349
01:13:55,140 --> 01:13:55,530
CHARLES LEISERSON: You're going
to do this serially,

1350
01:13:55,530 --> 01:13:55,860
you're saying?

1351
01:13:55,860 --> 01:13:58,510
AUDIENCE: Yeah, you do those
serially in add.

1352
01:13:58,510 --> 01:14:00,880
CHARLES LEISERSON: If you do
this serially in add, it turns

1353
01:14:00,880 --> 01:14:03,290
out if you do that, you're going
to be in trouble because

1354
01:14:03,290 --> 01:14:07,120
you're going to not have
very much parallelism,

1355
01:14:07,120 --> 01:14:09,330
unfortunately.

1356
01:14:09,330 --> 01:14:11,920
Actually, analyzing exactly what
the parallelism is there

1357
01:14:11,920 --> 01:14:13,330
is actually pretty good.

1358
01:14:13,330 --> 01:14:15,130
It's a good puzzle.

1359
01:14:15,130 --> 01:14:18,740
Maybe we'll do that on the quiz,
the take home problem

1360
01:14:18,740 --> 01:14:21,110
set we're calling
it now, right?

1361
01:14:21,110 --> 01:14:23,170
We're going to have a take home
problem set, maybe that's

1362
01:14:23,170 --> 01:14:26,100
a good one.

1363
01:14:26,100 --> 01:14:29,865
Yeah, so the idea is,
you can sync.

1364
01:14:29,865 --> 01:14:36,810
And in particular, why not
compute these, then sync, and

1365
01:14:36,810 --> 01:14:39,780
then compute these, adding their
results into the places

1366
01:14:39,780 --> 01:14:41,030
where we added these in?

1367
01:14:43,850 --> 01:14:47,370
So it's making the program
more serial, because I'm

1368
01:14:47,370 --> 01:14:50,420
putting in a sync.

1369
01:14:50,420 --> 01:14:52,300
That shouldn't have an impact on
the work, but it will have

1370
01:14:52,300 --> 01:14:54,845
an impact on the span.

1371
01:14:58,430 --> 01:15:00,880
So we're going to trade it off,
and the way we'll do that

1372
01:15:00,880 --> 01:15:04,470
is by putting essentially
a sync in the middle.

1373
01:15:04,470 --> 01:15:06,700
And since they're adding it in,
I don't even have we call

1374
01:15:06,700 --> 01:15:10,800
the addition routine, because
it's just going to

1375
01:15:10,800 --> 01:15:12,800
add it in in place.

1376
01:15:12,800 --> 01:15:16,190
So I spawn off these four guys,
putting their results

1377
01:15:16,190 --> 01:15:20,260
into c, then I spawn off these
four guys, and they add their

1378
01:15:20,260 --> 01:15:21,920
results into c.

1379
01:15:21,920 --> 01:15:25,060
Is that clear what
the code is?

1380
01:15:25,060 --> 01:15:26,415
So let's analyze this.

1381
01:15:32,050 --> 01:15:41,570
So the work for this
is order n cubed.

1382
01:15:41,570 --> 01:15:43,580
It's the same as anything else,
we can come up with a

1383
01:15:43,580 --> 01:15:46,380
recurrence, slightly different
from before because I only

1384
01:15:46,380 --> 01:15:48,510
have an order one there, but
it doesn't really matter.

1385
01:15:48,510 --> 01:15:51,570
The answer is order n cubed.

1386
01:15:51,570 --> 01:15:55,850
The span, now this gets
a little trickier.

1387
01:15:55,850 --> 01:15:57,200
What's the recurrence
of the span?

1388
01:16:04,430 --> 01:16:06,570
AUDIENCE: [INAUDIBLE].

1389
01:16:06,570 --> 01:16:07,706
CHARLES LEISERSON:
What is that?

1390
01:16:07,706 --> 01:16:09,690
AUDIENCE: Twice the span
of m of n over 2.

1391
01:16:09,690 --> 01:16:12,780
CHARLES LEISERSON: Twice
the span of m of n

1392
01:16:12,780 --> 01:16:15,510
over 2, that's right.

1393
01:16:15,510 --> 01:16:18,560
So basically, we have the
maximum of these guys, the

1394
01:16:18,560 --> 01:16:20,870
maximum of these guys, and
then this is making those

1395
01:16:20,870 --> 01:16:23,090
things be in series.

1396
01:16:23,090 --> 01:16:25,650
So things that are in parallel
I take the max, if it's in

1397
01:16:25,650 --> 01:16:26,940
series, I have to add them.

1398
01:16:26,940 --> 01:16:31,916
So I end up with 2m infinity
of n over 2 plus order one.

1399
01:16:31,916 --> 01:16:35,270
Does that make sense?

1400
01:16:35,270 --> 01:16:36,250
OK, good.

1401
01:16:36,250 --> 01:16:37,550
So let's solve that
recurrence.

1402
01:16:37,550 --> 01:16:40,260
What's the answer to that one?

1403
01:16:40,260 --> 01:16:41,120
That's order in.

1404
01:16:41,120 --> 01:16:44,290
Which case is it?

1405
01:16:44,290 --> 01:16:46,540
I never know what
the cases are.

1406
01:16:46,540 --> 01:16:49,410
I know two, but one and
three, it's like--

1407
01:16:49,410 --> 01:16:52,030
they're the same thing, it's
just which side it's in, so I

1408
01:16:52,030 --> 01:16:53,370
never remember what
the number is.

1409
01:16:53,370 --> 01:16:55,250
But anyway, case one, yes.

1410
01:16:55,250 --> 01:16:57,720
Case one.

1411
01:16:57,720 --> 01:17:00,710
It's the one where this thing is
bigger, so that's order n.

1412
01:17:04,410 --> 01:17:07,200
Good.

1413
01:17:07,200 --> 01:17:13,010
So then the work is n cubed,
the span is order n, the

1414
01:17:13,010 --> 01:17:15,870
parallelism is order
n squared.

1415
01:17:15,870 --> 01:17:18,660
So for 1,000 by 1,000 matrices,
I get parallelism on

1416
01:17:18,660 --> 01:17:21,600
the order of a million, instead
of before, I had

1417
01:17:21,600 --> 01:17:25,710
parallelism on the order
of 10 million.

1418
01:17:25,710 --> 01:17:29,540
So this turns out way better
code than the previous one

1419
01:17:29,540 --> 01:17:33,840
because it avoids the temporary
and therefore runs,

1420
01:17:33,840 --> 01:17:37,280
you get a constant factor
improvement for that, and it's

1421
01:17:37,280 --> 01:17:42,300
still, on 12 cores, it's going
to run pretty fast.

1422
01:17:42,300 --> 01:17:45,530
And in practice, this is a
much better way to do it.

1423
01:17:45,530 --> 01:17:49,190
The actual best code that
I know for doing this

1424
01:17:49,190 --> 01:17:52,590
essentially does divide and
conquer in only one

1425
01:17:52,590 --> 01:17:54,580
dimension at a time.

1426
01:17:54,580 --> 01:17:57,270
So basically, it looks to see
what's the long dimension, and

1427
01:17:57,270 --> 01:18:00,880
whatever the long dimension is,
it slices it in half and

1428
01:18:00,880 --> 01:18:04,920
then recurses, and just does
that as a binary thing.

1429
01:18:04,920 --> 01:18:06,450
And it basically is the
same work, et cetera.

1430
01:18:06,450 --> 01:18:07,755
It's a little bit more
tricky to analyze.

1431
01:18:14,820 --> 01:18:17,940
Let me quick do merge sort.

1432
01:18:17,940 --> 01:18:18,900
So you know merge sort.

1433
01:18:18,900 --> 01:18:23,470
There's merging two sorted
arrays, we saw this before.

1434
01:18:23,470 --> 01:18:26,320
If I spend all this time doing
animations, I might as well

1435
01:18:26,320 --> 01:18:29,830
get my mileage out of it.

1436
01:18:29,830 --> 01:18:30,480
There we go.

1437
01:18:30,480 --> 01:18:33,820
So you merge, that's basically
what this code does.

1438
01:18:33,820 --> 01:18:37,090
Order n time to merge.

1439
01:18:37,090 --> 01:18:38,470
So here's merge sort.

1440
01:18:38,470 --> 01:18:40,880
So what I'll do in merge sort
is the same thing I normally

1441
01:18:40,880 --> 01:18:52,210
do, except that I'll
make recursive

1442
01:18:52,210 --> 01:18:53,500
routines go in parallel.

1443
01:18:53,500 --> 01:18:58,500
So when I do that, it basically
divide and conquers

1444
01:18:58,500 --> 01:19:03,760
down, and then it sort of does
this to merge things together.

1445
01:19:03,760 --> 01:19:08,710
So we saw this before, except
now, I've got the fact that I

1446
01:19:08,710 --> 01:19:11,450
can sort two things in parallel
rather than sorting

1447
01:19:11,450 --> 01:19:13,020
them serially.

1448
01:19:13,020 --> 01:19:14,375
So let's take a look
at the work.

1449
01:19:14,375 --> 01:19:15,900
What's the work of merge sort?

1450
01:19:15,900 --> 01:19:18,770
We know that.

1451
01:19:18,770 --> 01:19:20,370
n log n, right?

1452
01:19:20,370 --> 01:19:26,790
2t of n over 2 plus order n,
so that's order n log n.

1453
01:19:26,790 --> 01:19:29,590
The span is what?

1454
01:19:29,590 --> 01:19:30,840
What's the recurrence
of the span?

1455
01:19:36,150 --> 01:19:38,890
So we're going to take the
maximum of these two guys.

1456
01:19:38,890 --> 01:19:44,050
So we only have one term that
involves t infinity, and then

1457
01:19:44,050 --> 01:19:46,570
the merge costs us order n,
so we get this recurrence.

1458
01:19:49,440 --> 01:19:54,990
So that says that the
solution is order n.

1459
01:19:54,990 --> 01:20:01,590
So therefore, the work is n log
n, the span is order n,

1460
01:20:01,590 --> 01:20:03,858
and so the parallelism
is order log n.

1461
01:20:06,546 --> 01:20:07,950
Puny.

1462
01:20:07,950 --> 01:20:08,410
Puny parallelism.

1463
01:20:08,410 --> 01:20:11,750
Log n is like, you can run it,
and it'll work fine on a few

1464
01:20:11,750 --> 01:20:14,250
cores, but it's not to be
something that generally will

1465
01:20:14,250 --> 01:20:17,570
scale and give you a
lot of parallelism.

1466
01:20:17,570 --> 01:20:20,630
So it's pretty clear from this
that the bottleneck--

1467
01:20:20,630 --> 01:20:22,330
where's all the span going to?

1468
01:20:22,330 --> 01:20:23,580
It's going to that merge.

1469
01:20:26,730 --> 01:20:28,960
So when you understand that
that's the structure of it,

1470
01:20:28,960 --> 01:20:31,410
now you say if you want to get
parallelism, you've got to go

1471
01:20:31,410 --> 01:20:32,230
after the merge.

1472
01:20:32,230 --> 01:20:34,480
So here's how we parallelize
the merge.

1473
01:20:34,480 --> 01:20:36,615
So we're going to look at
merging of two arrays that are

1474
01:20:36,615 --> 01:20:38,920
of possibly different length.

1475
01:20:38,920 --> 01:20:41,220
So one we'll call A, and
one we'll call B,

1476
01:20:41,220 --> 01:20:42,810
with na and nb elements.

1477
01:20:42,810 --> 01:20:46,540
And let me assume without loss
of generality that na is

1478
01:20:46,540 --> 01:20:48,830
greater than or equal to nb,
because otherwise I can just

1479
01:20:48,830 --> 01:20:51,280
switch the roles of A and B.

1480
01:20:51,280 --> 01:20:53,370
So the way that I'm going to do
it is I'm going to find the

1481
01:20:53,370 --> 01:20:56,490
middle element of A. These
are sorted arrays that

1482
01:20:56,490 --> 01:20:58,300
I'm going to merge.

1483
01:20:58,300 --> 01:21:01,990
I find the middle element of A,
so these guys are or less

1484
01:21:01,990 --> 01:21:04,540
than or equal to a of ma,
and these are greater

1485
01:21:04,540 --> 01:21:05,930
than or equal to.

1486
01:21:05,930 --> 01:21:09,570
And now I binary search and
find out where that middle

1487
01:21:09,570 --> 01:21:14,840
element would fall in the array
B. So that costs me log

1488
01:21:14,840 --> 01:21:16,460
n time to binary search.

1489
01:21:16,460 --> 01:21:17,710
Remember binary search?

1490
01:21:22,420 --> 01:21:25,560
Then what I'm going to do is
recursively merge these guys,

1491
01:21:25,560 --> 01:21:28,490
because these are sorted and
less than or equal to ma,

1492
01:21:28,490 --> 01:21:31,950
recursively merge those and put
this guy in the middle.

1493
01:21:35,140 --> 01:21:42,180
So when I do that, the key
question when we analyze--

1494
01:21:42,180 --> 01:21:45,310
it turns out the work is going
to basically be the same, but

1495
01:21:45,310 --> 01:21:49,170
the key thing is going to be
what happens to the span?

1496
01:21:49,170 --> 01:21:52,080
And the idea here is that the
total number of elements in

1497
01:21:52,080 --> 01:21:59,690
the larger of these two things
is going to be at most what?

1498
01:21:59,690 --> 01:22:04,015
Another way of looking at it is
in the smaller partition,

1499
01:22:04,015 --> 01:22:06,230
if n is the total number of
elements, the smaller

1500
01:22:06,230 --> 01:22:09,310
partition has how
many elements at

1501
01:22:09,310 --> 01:22:11,010
least relative to n?

1502
01:22:13,750 --> 01:22:17,620
No matter where this binary
search finds itself.

1503
01:22:17,620 --> 01:22:21,060
So the worst case is sort of
going to come when this guy is

1504
01:22:21,060 --> 01:22:24,900
like at one end or the other.

1505
01:22:24,900 --> 01:22:28,070
And then the point is that
because A is the larger array,

1506
01:22:28,070 --> 01:22:30,640
at least a quarter of the
elements will still be in the

1507
01:22:30,640 --> 01:22:33,340
smaller partition.

1508
01:22:33,340 --> 01:22:37,280
Of all the elements here, at
least a quarter will be in the

1509
01:22:37,280 --> 01:22:39,840
smaller partition, which will
occur when B is equal to in

1510
01:22:39,840 --> 01:22:45,170
size to A. So the number, in
the larger of the recursive

1511
01:22:45,170 --> 01:22:46,790
merges, is at most 3/4 n.

1512
01:22:49,350 --> 01:22:50,750
Sound good?

1513
01:22:50,750 --> 01:22:54,410
That's the main, key
idea behind this.

1514
01:22:54,410 --> 01:22:57,420
So here's the parallel merge.

1515
01:22:57,420 --> 01:23:01,290
Basically you do binary search,
you spawn, then, the

1516
01:23:01,290 --> 01:23:02,660
two merges.

1517
01:23:02,660 --> 01:23:04,720
Here's one merge, and here's
the other merge,

1518
01:23:04,720 --> 01:23:06,140
and then you sync.

1519
01:23:06,140 --> 01:23:09,590
So that's the code for the
doing the parallel merge.

1520
01:23:09,590 --> 01:23:11,430
And now you want to incorporate
that parallel

1521
01:23:11,430 --> 01:23:14,750
merge into the parallel
merge sort.

1522
01:23:14,750 --> 01:23:16,805
Of course, you coarsen the base
cases for efficiency.

1523
01:23:21,190 --> 01:23:24,190
So let's analyze the
span of this.

1524
01:23:24,190 --> 01:23:29,470
So the span is basically then
the span of something of 3/4,

1525
01:23:29,470 --> 01:23:36,380
at most 3/4, the size plus the
log n for the binary search.

1526
01:23:36,380 --> 01:23:40,500
So the span of parallel merge is
therefore order log squared

1527
01:23:40,500 --> 01:23:44,300
n, because the important thing
is, I'm whacking off a

1528
01:23:44,300 --> 01:23:46,415
constant fraction
here every time.

1529
01:23:46,415 --> 01:23:52,050
So I get log squared n as the
span, and the work I get this

1530
01:23:52,050 --> 01:23:58,270
hairy recurrence, that it's t of
alpha n plus t1 minus alpha

1531
01:23:58,270 --> 01:24:03,920
n plus log n, where alpha
falls in this range.

1532
01:24:03,920 --> 01:24:07,700
This does not satisfy
the Master Theorem.

1533
01:24:07,700 --> 01:24:10,080
You can actually do this pretty
easily with a recursion

1534
01:24:10,080 --> 01:24:13,270
tree, but the way
to verify is--

1535
01:24:13,270 --> 01:24:16,370
we call this technically
a hairy recurrence.

1536
01:24:16,370 --> 01:24:20,360
That's the technical
term for it.

1537
01:24:20,360 --> 01:24:23,830
So it turns out, this has order
n, just like ordinary

1538
01:24:23,830 --> 01:24:28,010
merge, order n time.

1539
01:24:28,010 --> 01:24:30,540
here's You can use the
substitution method, and I

1540
01:24:30,540 --> 01:24:32,230
won't drag you through
it, but you can look

1541
01:24:32,230 --> 01:24:35,510
at it in the notes.

1542
01:24:35,510 --> 01:24:39,180
And this should be very familiar
to you as having all

1543
01:24:39,180 --> 01:24:43,270
aced 6006, right?

1544
01:24:43,270 --> 01:24:44,560
Otherwise you wouldn't
be here, right?

1545
01:24:47,930 --> 01:24:51,640
So the parallelism of the
parallel merge is something

1546
01:24:51,640 --> 01:24:55,670
like n over log squared n.

1547
01:24:55,670 --> 01:25:00,510
So that's much better than
having n order n bound.

1548
01:25:00,510 --> 01:25:03,340
And now, we can plug
it into merge sort.

1549
01:25:03,340 --> 01:25:05,920
So the work is going to be the
same as before, because I just

1550
01:25:05,920 --> 01:25:08,780
have the work of the merge,
which is still order n.

1551
01:25:08,780 --> 01:25:12,190
So the work is order n log n,
once again pulling out the

1552
01:25:12,190 --> 01:25:13,550
Master Theorem.

1553
01:25:13,550 --> 01:25:21,660
And then the span is n over 2
plus log n, because basically,

1554
01:25:21,660 --> 01:25:27,590
I have the span of a problem of
half the size plus the span

1555
01:25:27,590 --> 01:25:28,910
that I need to merge things.

1556
01:25:28,910 --> 01:25:30,660
That's order log squared n.

1557
01:25:30,660 --> 01:25:32,300
This I want to pause
on for moment.

1558
01:25:32,300 --> 01:25:35,520
People get this recurrence?

1559
01:25:35,520 --> 01:25:38,230
Because this is the
span of the merge.

1560
01:25:38,230 --> 01:25:45,310
And so what I end up with is I
get another log, log cubed n.

1561
01:25:45,310 --> 01:25:50,350
And so the total parallelism
is n over log squared n.

1562
01:25:50,350 --> 01:25:55,440
And this is actually quite a
practical thing to implement,

1563
01:25:55,440 --> 01:25:59,980
to get the n over log squared
n parallelism versus just a

1564
01:25:59,980 --> 01:26:02,770
log n parallelism.

1565
01:26:02,770 --> 01:26:04,200
We're not going to do tableau
construction.

1566
01:26:04,200 --> 01:26:07,170
You can read that up, that's on
the notes that are online,

1567
01:26:07,170 --> 01:26:11,010
but you should read through
that part of it.

1568
01:26:11,010 --> 01:26:13,520
It's got some nice animations
which you don't get to see.

1569
01:26:20,630 --> 01:26:23,240
This is like when you do longest
common subsequence and

1570
01:26:23,240 --> 01:26:25,460
stuff like that, how you
would solve that type

1571
01:26:25,460 --> 01:26:27,110
of problem in parallel.

1572
01:26:27,110 --> 01:26:28,360
OK, great.