1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:18,130
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:18,130 --> 00:00:19,380
ocw.mit.edu.

8
00:00:23,310 --> 00:00:26,840
PROFESSOR: So today, we're going
to talk a bit more about

9
00:00:26,840 --> 00:00:32,360
parallelism and about how you
get performance out of

10
00:00:32,360 --> 00:00:33,650
parallel codes.

11
00:00:33,650 --> 00:00:36,860
And also, we're going to take
a little bit of a tour

12
00:00:36,860 --> 00:00:41,100
underneath the Cilk++ runtime
system so you can get an idea

13
00:00:41,100 --> 00:00:44,830
of what's going on underneath
and why it is that when you

14
00:00:44,830 --> 00:00:49,540
code stuff, how it is that it
gets mapped, scheduled on the

15
00:00:49,540 --> 00:00:50,790
processors.

16
00:00:52,700 --> 00:00:57,230
So when people talk about
parallelism, one of the first

17
00:00:57,230 --> 00:01:01,150
things that often comes up is
what's called Amdahl's Law.

18
00:01:01,150 --> 00:01:09,780
Gene Amdahl was the architect
of the IBM360 computers who

19
00:01:09,780 --> 00:01:15,400
then left IBM and formed his
own company that made

20
00:01:15,400 --> 00:01:22,440
competing machines and he made
the following observation

21
00:01:22,440 --> 00:01:25,230
about parallel computing,
he said--

22
00:01:25,230 --> 00:01:28,250
and I'm paraphrasing here--

23
00:01:28,250 --> 00:01:32,720
half your application is
parallel and half is serial.

24
00:01:32,720 --> 00:01:35,310
You can't get more than a factor
of two speed up, no

25
00:01:35,310 --> 00:01:38,980
matter how many processors
it runs on.

26
00:01:38,980 --> 00:01:42,050
So if you think about it, if
it's half parallel and you

27
00:01:42,050 --> 00:01:48,390
managed to make that parallel
part run in zero time, still

28
00:01:48,390 --> 00:01:52,430
the serial part will be half of
the time and you only get a

29
00:01:52,430 --> 00:01:54,980
factor of two speedup.

30
00:01:54,980 --> 00:01:58,160
You can generalize that to say
if some fraction alpha can be

31
00:01:58,160 --> 00:02:01,420
run in parallel and the rest
must be run serially, the

32
00:02:01,420 --> 00:02:05,780
speedup is at most 1
over 1 minus alpha.

33
00:02:05,780 --> 00:02:15,170
OK, so this was used in the
1980s in particular to say why

34
00:02:15,170 --> 00:02:18,790
it was that parallel computing
had no future, because you

35
00:02:18,790 --> 00:02:23,090
simply weren't going to be able
to get very much speedups

36
00:02:23,090 --> 00:02:24,030
from parallel computing.

37
00:02:24,030 --> 00:02:28,890
You're going to spend extra
hardware on the parallel parts

38
00:02:28,890 --> 00:02:34,580
of the system and yet you might
be limited in terms of

39
00:02:34,580 --> 00:02:38,690
how much parallelism there is
in a particular application

40
00:02:38,690 --> 00:02:40,060
and you wouldn't get
very much speedup.

41
00:02:40,060 --> 00:02:42,750
You wouldn't get the bang for
the buck, if you will.

42
00:02:42,750 --> 00:02:46,650
So things have changed today
that make that not quite the

43
00:02:46,650 --> 00:02:47,150
same story.

44
00:02:47,150 --> 00:02:53,490
The first thing is that with
multicore computers, it is

45
00:02:53,490 --> 00:03:00,600
pretty much just as inexpensive
to produce a p

46
00:03:00,600 --> 00:03:03,970
processor right now, like six
processor machine as it is a

47
00:03:03,970 --> 00:03:05,520
one processor machine.

48
00:03:05,520 --> 00:03:07,440
so it's not like you're actually
paying for those

49
00:03:07,440 --> 00:03:09,300
extra processing cores.

50
00:03:09,300 --> 00:03:11,560
They come for free.

51
00:03:11,560 --> 00:03:16,330
Because what else are you're
going to use that silicon for?

52
00:03:16,330 --> 00:03:19,580
And the other thing is that
we've had a large growth of

53
00:03:19,580 --> 00:03:23,510
understanding of problems
for which there's ample

54
00:03:23,510 --> 00:03:26,280
parallelism, where that
amount of time is,

55
00:03:26,280 --> 00:03:30,830
in fact, quite small.

56
00:03:30,830 --> 00:03:33,810
And the main place these things
come from, it turns

57
00:03:33,810 --> 00:03:38,170
out, this analysis is kind of a
throughput kind of analysis.

58
00:03:38,170 --> 00:03:41,810
OK, it says, gee, I only get
50% speedup for that

59
00:03:41,810 --> 00:03:46,920
application, but what most
people care about in most

60
00:03:46,920 --> 00:03:50,130
interactive applications, at
least for a client side

61
00:03:50,130 --> 00:03:53,020
programming, is response time.

62
00:03:53,020 --> 00:03:58,120
And for any problem that you
have that has a response time

63
00:03:58,120 --> 00:04:01,970
that's too long and its compute
intensive, using

64
00:04:01,970 --> 00:04:07,590
parallelism to make it so that
the response is much zippier

65
00:04:07,590 --> 00:04:10,320
is definitely worthwhile.

66
00:04:10,320 --> 00:04:14,520
And so this is true, even for
things like game programs.

67
00:04:14,520 --> 00:04:16,649
So in game programs, they don't
have quite a response

68
00:04:16,649 --> 00:04:20,180
time problem, they have what's
called a time box problem,

69
00:04:20,180 --> 00:04:22,700
where you have a certain
amount of time--

70
00:04:22,700 --> 00:04:24,400
13 milliseconds typically--

71
00:04:24,400 --> 00:04:29,730
because you need some slop to
make sure that you can go from

72
00:04:29,730 --> 00:04:33,760
one frame to another, but about
13 milliseconds to do a

73
00:04:33,760 --> 00:04:37,540
rendering of whatever the frame
is that the game player

74
00:04:37,540 --> 00:04:41,200
is going to see on his computer
or her computer.

75
00:04:41,200 --> 00:04:47,560
And so in that time, you want to
do as much as you possibly

76
00:04:47,560 --> 00:04:51,310
can, and so there's a big
opportunity there to take

77
00:04:51,310 --> 00:04:55,530
advantage of parallelism in
order to do more, have more

78
00:04:55,530 --> 00:05:00,120
quality graphics, have better
AI, have better physics and

79
00:05:00,120 --> 00:05:02,245
all the other components that
make up a game engine.

80
00:05:07,210 --> 00:05:10,760
But one of the issues
with Amdahl's Law--

81
00:05:10,760 --> 00:05:14,610
and this analysis is a cogent
analysis that Amdahl made--

82
00:05:14,610 --> 00:05:20,270
but one of the issues here is
that it doesn't really say

83
00:05:20,270 --> 00:05:23,770
anything about how fast
you can expect your

84
00:05:23,770 --> 00:05:25,750
application to run.

85
00:05:25,750 --> 00:05:30,320
In other words, this is a nice
sort of thing, but who really

86
00:05:30,320 --> 00:05:32,910
can decompose their application
into the serial

87
00:05:32,910 --> 00:05:37,420
part and the part that
can be parallel?

88
00:05:37,420 --> 00:05:41,940
Well fortunately, there's been a
lot of work in the theory of

89
00:05:41,940 --> 00:05:44,660
parallel systems to answer this
question, and we're going

90
00:05:44,660 --> 00:05:50,800
to go over some of that really
outstanding research that

91
00:05:50,800 --> 00:05:55,290
helps us understand what
parallelism is.

92
00:05:55,290 --> 00:05:56,740
So we're going to talk a
little bit about what

93
00:05:56,740 --> 00:06:02,800
parallelism is and come up with
a very specific measure

94
00:06:02,800 --> 00:06:06,750
of parallelism, quantify
parallelism, OK?

95
00:06:06,750 --> 00:06:08,720
We're also going to talk a
little bit about scheduling

96
00:06:08,720 --> 00:06:12,060
theory and how the Cilk++
runtime system works.

97
00:06:12,060 --> 00:06:15,900
And then we're going to have
a little chess lesson.

98
00:06:15,900 --> 00:06:18,870
So who here plays chess?

99
00:06:18,870 --> 00:06:19,930
Nobody plays chess anymore.

100
00:06:19,930 --> 00:06:22,410
Who plays Angry Birds?

101
00:06:22,410 --> 00:06:24,710
[LAUGHTER]

102
00:06:24,710 --> 00:06:25,960
OK.

103
00:06:29,940 --> 00:06:31,900
So you don't have to know
anything about chess to learn

104
00:06:31,900 --> 00:06:35,700
this chess lesson, that's OK.

105
00:06:35,700 --> 00:06:37,830
So we'll start out with
what is parallelism?

106
00:06:37,830 --> 00:06:40,980
So let's recall first the
basics of Cilk++.

107
00:06:40,980 --> 00:06:44,820
So here's the example of the
lousy Fibonacci that everybody

108
00:06:44,820 --> 00:06:48,110
parallelizes because it's
good didactically.

109
00:06:48,110 --> 00:06:51,050
We have the Cilk spawn statement
that says that the

110
00:06:51,050 --> 00:06:54,170
child can execute in parallel
with the parent caller and the

111
00:06:54,170 --> 00:06:56,820
sync that says don't go past
this point until all your

112
00:06:56,820 --> 00:06:58,740
spawn children have returned.

113
00:06:58,740 --> 00:07:00,240
And that's a local sync,
that's just a

114
00:07:00,240 --> 00:07:02,300
sync for that function.

115
00:07:02,300 --> 00:07:04,080
It's not a sync across
the whole machine.

116
00:07:04,080 --> 00:07:06,840
So some of you may have had
experience with open MP

117
00:07:06,840 --> 00:07:09,220
barriers, for example,
that's a sync

118
00:07:09,220 --> 00:07:10,110
across the whole machine.

119
00:07:10,110 --> 00:07:13,110
This is not, this is just a
local sync for this function

120
00:07:13,110 --> 00:07:16,080
saying when I sync, make sure
all my children have returned

121
00:07:16,080 --> 00:07:18,200
before going past this point.

122
00:07:18,200 --> 00:07:20,930
And just remember also that Cilk
keywords grant permission

123
00:07:20,930 --> 00:07:22,480
for parallel execution.

124
00:07:22,480 --> 00:07:24,890
They don't command parallel
execution.

125
00:07:24,890 --> 00:07:28,360
OK so we can always execute
our code serially

126
00:07:28,360 --> 00:07:29,610
if we choose to.

127
00:07:31,765 --> 00:07:33,564
Yes?

128
00:07:33,564 --> 00:07:34,500
AUDIENCE: [UNINTELLIGIBLE]

129
00:07:34,500 --> 00:07:38,444
Can't this runtime figure that
spawning an extra child would

130
00:07:38,444 --> 00:07:40,312
be more expensive?

131
00:07:40,312 --> 00:07:41,450
Can't it like look at
this and be like--

132
00:07:41,450 --> 00:07:43,450
PROFESSOR: We'll go into it.

133
00:07:43,450 --> 00:07:46,360
I'll show you how it works
later in the lecture.

134
00:07:46,360 --> 00:07:50,170
I'll show you how it works and
then we can talk about what

135
00:07:50,170 --> 00:07:53,970
knobs you have to tune, OK?

136
00:07:53,970 --> 00:08:00,350
So it's helpful to have
an execution model for

137
00:08:00,350 --> 00:08:01,320
something like this.

138
00:08:01,320 --> 00:08:05,240
And so we're going to look at
an abstract execution model,

139
00:08:05,240 --> 00:08:09,390
which is basically asking what
does the instruction trace

140
00:08:09,390 --> 00:08:11,130
look like for this program?

141
00:08:11,130 --> 00:08:14,230
So normally when you execute a
program, you can imagine one

142
00:08:14,230 --> 00:08:15,940
instruction executing
after the other.

143
00:08:15,940 --> 00:08:19,250
And if it's a serial program,
all those instructions

144
00:08:19,250 --> 00:08:22,470
essentially form a long chain.

145
00:08:22,470 --> 00:08:24,820
Well there's a similar thing for
parallel computers, which

146
00:08:24,820 --> 00:08:30,620
is that instead of a chain as
you'll see, it gets bushier

147
00:08:30,620 --> 00:08:32,580
and it's going to be a directed
acyclic graph.

148
00:08:32,580 --> 00:08:34,760
So let's take a look
at how we do this.

149
00:08:34,760 --> 00:08:36,535
So we'll the example
of fib of four.

150
00:08:39,049 --> 00:08:44,920
So what we're going to do
is start out here with a

151
00:08:44,920 --> 00:08:49,720
rectangle here that I want you
think about as sort of a

152
00:08:49,720 --> 00:08:51,810
function call activation
record.

153
00:08:51,810 --> 00:08:54,250
So it's a record on a stack.

154
00:08:54,250 --> 00:08:56,680
It's got variables associated
with it.

155
00:08:56,680 --> 00:08:59,270
The only variable I'm going to
keep track of is n, so that's

156
00:08:59,270 --> 00:09:00,840
what the four is there.

157
00:09:00,840 --> 00:09:03,150
OK, so we're going to
do fib of four.

158
00:09:03,150 --> 00:09:05,390
So we've got in this activation
frame, we have the

159
00:09:05,390 --> 00:09:09,040
variable four and now what I've
done is I've color coded

160
00:09:09,040 --> 00:09:16,100
the fib function here and into
the parts that are all serial.

161
00:09:16,100 --> 00:09:22,100
So there's a serial part up to
where it spawns, then there's

162
00:09:22,100 --> 00:09:25,270
recursively calling the fib and
then there's returning.

163
00:09:25,270 --> 00:09:27,830
So there's sort of three parts
to this function, each of

164
00:09:27,830 --> 00:09:30,370
which is, in fact, a chain
of serial instruction.

165
00:09:30,370 --> 00:09:34,570
I'm going to collapse those
chains into a single circle

166
00:09:34,570 --> 00:09:37,190
here that I'm going
to call a strand.

167
00:09:37,190 --> 00:09:40,410
OK, now what we do is we execute
the strand, which

168
00:09:40,410 --> 00:09:43,480
corresponds to executing the
instructions and advancing the

169
00:09:43,480 --> 00:09:46,900
program calendar up until
the point we hit this

170
00:09:46,900 --> 00:09:50,350
fib of n minus 1.

171
00:09:50,350 --> 00:09:53,700
At that point, I basically
call fib of n minus 1.

172
00:09:53,700 --> 00:09:56,160
So in this case, it's now
going to be fib of 3.

173
00:09:56,160 --> 00:10:03,100
So that means I create a child
and start executing in the

174
00:10:03,100 --> 00:10:07,770
child, this prefix part
of the function.

175
00:10:07,770 --> 00:10:11,140
However, unlike I were doing an
ordinary function call, I

176
00:10:11,140 --> 00:10:14,330
would make this call and then
this guy would just sit here

177
00:10:14,330 --> 00:10:19,650
and wait until this
frame was done.

178
00:10:19,650 --> 00:10:23,280
But since it's a spawn, what
happens is I'm actually going

179
00:10:23,280 --> 00:10:28,070
to continue executing in the
parent and execute, in fact,

180
00:10:28,070 --> 00:10:29,320
the green part.

181
00:10:31,390 --> 00:10:35,640
So in this case, evaluating
the arguments, etc.

182
00:10:35,640 --> 00:10:39,420
Then it's going to spawn here,
but this guy, in fact, is

183
00:10:39,420 --> 00:10:42,490
going to what it does when it
gets here is it evaluates n

184
00:10:42,490 --> 00:10:48,520
minus 2, it does a call
of fib of n minus 2.

185
00:10:48,520 --> 00:10:51,320
So I've indicated that this was
a called frame by showing

186
00:10:51,320 --> 00:10:53,580
it in a light color.

187
00:10:53,580 --> 00:10:56,520
So these are spawn, spawn,
call, meanwhile

188
00:10:56,520 --> 00:10:57,210
this thing is going.

189
00:10:57,210 --> 00:11:02,780
So at this point, we now have
one, two, three things that

190
00:11:02,780 --> 00:11:07,440
are operating in parallel
at the same time.

191
00:11:07,440 --> 00:11:09,420
We keep going on, OK?

192
00:11:09,420 --> 00:11:12,730
So this guy that does a spawn
and has a continuation, this

193
00:11:12,730 --> 00:11:16,510
one does a call, but while
he's doing a call, he's

194
00:11:16,510 --> 00:11:18,330
waiting for the return
so he doesn't start

195
00:11:18,330 --> 00:11:20,050
executing the successor.

196
00:11:20,050 --> 00:11:22,920
He stalled at the
Cilk sink here.

197
00:11:22,920 --> 00:11:28,550
And we keep executing and so
as you can see, what's

198
00:11:28,550 --> 00:11:31,360
happening is we're actually
creating a directed acyclic

199
00:11:31,360 --> 00:11:33,840
graph of these strands.

200
00:11:33,840 --> 00:11:37,190
So here basically, this guy was
able to execute because

201
00:11:37,190 --> 00:11:40,875
both of the children, one that
he had spawned and one that he

202
00:11:40,875 --> 00:11:43,120
had called, have returned.

203
00:11:43,120 --> 00:11:45,430
And so this fella, therefore,
is able then

204
00:11:45,430 --> 00:11:47,880
to execute the return.

205
00:11:47,880 --> 00:11:51,710
OK, so the addition of x plus y
in particular, and then the

206
00:11:51,710 --> 00:11:53,190
return to the parent.

207
00:11:53,190 --> 00:11:56,640
And so what we end up with is of
all these serial chains of

208
00:11:56,640 --> 00:11:59,890
instructions that are
represented by these strands,

209
00:11:59,890 --> 00:12:03,950
all these circles, they're
embedded in the call tree like

210
00:12:03,950 --> 00:12:07,670
you would have in an ordinary
serial execution.

211
00:12:07,670 --> 00:12:10,480
You have a call tree that you
execute up and down, you walk

212
00:12:10,480 --> 00:12:14,370
it like a stack normally.

213
00:12:14,370 --> 00:12:17,550
Now, in fact, what we have is
embedded in there is the

214
00:12:17,550 --> 00:12:26,120
parallel execution which form a
DAG, directed acyclic graph.

215
00:12:29,020 --> 00:12:31,670
So when you start thinking in
parallel, you have to start

216
00:12:31,670 --> 00:12:36,650
thinking about the DAG as your
execution model, not a chain

217
00:12:36,650 --> 00:12:37,900
of instructions.

218
00:12:40,240 --> 00:12:42,820
And the nice thing about this
particular execution model

219
00:12:42,820 --> 00:12:45,190
we're going to be looking at is
nowhere did I say how many

220
00:12:45,190 --> 00:12:47,920
processors we were running on.

221
00:12:47,920 --> 00:12:50,100
This is a processor
oblivious model.

222
00:12:50,100 --> 00:12:52,510
It doesn't know how many
processors you're running on.

223
00:12:57,400 --> 00:13:02,110
We simply in the execution
model, are thinking about

224
00:13:02,110 --> 00:13:07,420
abstractly what can run in
parallel, not what actually

225
00:13:07,420 --> 00:13:10,970
does run in parallel
in an execution.

226
00:13:10,970 --> 00:13:13,620
So any questions about
this execution model?

227
00:13:19,260 --> 00:13:21,570
OK.

228
00:13:21,570 --> 00:13:30,620
So just so that we have some
terminology, so the parallel

229
00:13:30,620 --> 00:13:33,675
instruction stream is a DAG
with vertices and edges.

230
00:13:36,400 --> 00:13:39,320
Each vertex is a strand, OK?

231
00:13:39,320 --> 00:13:42,570
Which is a sequence of
instructions not containing a

232
00:13:42,570 --> 00:13:47,200
call spawn sync, a return or
thrown exception, if you're

233
00:13:47,200 --> 00:13:49,480
doing exceptions.

234
00:13:49,480 --> 00:13:53,100
We're not going to really talk
about exceptions much.

235
00:13:53,100 --> 00:13:58,080
So they are supported in the
software that we'll be using,

236
00:13:58,080 --> 00:14:00,890
but for most part, we're
not going to have

237
00:14:00,890 --> 00:14:03,170
to worry about them.

238
00:14:03,170 --> 00:14:06,270
OK so there's an initial strand
where you start, and a

239
00:14:06,270 --> 00:14:09,990
final strand where you end.

240
00:14:09,990 --> 00:14:16,950
Then each edge is a spawn or
a call or return or what's

241
00:14:16,950 --> 00:14:21,570
called a continue edge or a
continuation edge, which goes

242
00:14:21,570 --> 00:14:27,430
from the parent, when a parent
spawns something to the next

243
00:14:27,430 --> 00:14:30,030
instruction after the spawn.

244
00:14:34,380 --> 00:14:38,600
So we can classify the edges
in that fashion.

245
00:14:38,600 --> 00:14:43,640
And I've only explain this for
spawm and sync, as you recall

246
00:14:43,640 --> 00:14:46,470
from last time, we also talked
about Cilk four.

247
00:14:46,470 --> 00:14:49,950
It turns out Cilk four is
converted to spawns and syncs

248
00:14:49,950 --> 00:14:52,520
using a recursive divide
and conquer approach.

249
00:14:52,520 --> 00:14:56,515
We'll talk about that next
time on Thursday.

250
00:14:56,515 --> 00:15:00,280
So we'll talk more about Cilk
four and how it's implemented

251
00:15:00,280 --> 00:15:03,390
and the implications of how
loop parallelism works.

252
00:15:03,390 --> 00:15:06,180
So at the fundamental level,
the runtime system is only

253
00:15:06,180 --> 00:15:07,780
concerned about spawns
and syncs.

254
00:15:12,190 --> 00:15:15,240
Now given that we have a DAG,
so I've taken away the call

255
00:15:15,240 --> 00:15:18,480
tree and just left the strands
of a computation.

256
00:15:18,480 --> 00:15:22,700
It's actually not the same as
the computation we saw before.

257
00:15:22,700 --> 00:15:26,260
We would like to understand,
is this a good parallel

258
00:15:26,260 --> 00:15:28,620
program or not?

259
00:15:28,620 --> 00:15:30,730
Based on if I understand
the logical

260
00:15:30,730 --> 00:15:32,690
parallelism that I've exposed.

261
00:15:32,690 --> 00:15:34,660
So how much parallelism do
you think is in here?

262
00:15:37,850 --> 00:15:39,060
Give me a number.

263
00:15:39,060 --> 00:15:43,470
How many processors does it
make sense to run this on?

264
00:15:43,470 --> 00:15:44,720
Five?

265
00:15:47,260 --> 00:15:49,080
That's as parallel as it gets.

266
00:15:49,080 --> 00:15:50,350
Let's take a look.

267
00:15:50,350 --> 00:15:51,730
We're going to do an analysis.

268
00:15:51,730 --> 00:15:55,620
At the end of it, we'll know
what the answer is.

269
00:15:55,620 --> 00:16:01,560
So for that, let tp be the
execution time on p processors

270
00:16:01,560 --> 00:16:02,810
for this particular program.

271
00:16:06,600 --> 00:16:08,460
It turns but there
are two measures

272
00:16:08,460 --> 00:16:09,410
that are really important.

273
00:16:09,410 --> 00:16:11,380
The first is called the work.

274
00:16:11,380 --> 00:16:14,900
OK, so of course, we know
that real machines

275
00:16:14,900 --> 00:16:16,180
have caches, etc.

276
00:16:16,180 --> 00:16:17,000
Let's forget all of that.

277
00:16:17,000 --> 00:16:22,320
Just very simple algorithmic
model where every strand,

278
00:16:22,320 --> 00:16:26,930
let's say, costs us unit time
as opposed to in practice,

279
00:16:26,930 --> 00:16:28,880
they may be many instructions
and so forth.

280
00:16:28,880 --> 00:16:30,370
We can take that into account.

281
00:16:30,370 --> 00:16:32,680
Let's take that into
account separately.

282
00:16:32,680 --> 00:16:35,390
So T1 is the work.

283
00:16:35,390 --> 00:16:39,620
It's the time it if I had to
execute it on one processor,

284
00:16:39,620 --> 00:16:43,450
I've got to do all the
work that's in here.

285
00:16:43,450 --> 00:16:45,940
So what's the work of this
particular computation?

286
00:16:51,850 --> 00:16:52,940
I think it's 18, right?

287
00:16:52,940 --> 00:16:55,230
Yeah, 18.

288
00:16:55,230 --> 00:16:56,600
So T1 is the work.

289
00:16:56,600 --> 00:16:59,280
So even though I'm executing a
parallel, I could it execute

290
00:16:59,280 --> 00:17:06,440
it serially and then T1 is the
amount of work it would take.

291
00:17:06,440 --> 00:17:11,040
The other measure is called the
span, and sometimes called

292
00:17:11,040 --> 00:17:13,490
critical path length or
computational depth.

293
00:17:13,490 --> 00:17:17,440
And it corresponds to the
longest path of dependencies

294
00:17:17,440 --> 00:17:20,349
in the DAG.

295
00:17:20,349 --> 00:17:23,220
We call it T infinity because
even if you had an infinite

296
00:17:23,220 --> 00:17:27,160
number of processors, you still
can't do this one until

297
00:17:27,160 --> 00:17:28,240
you finish that one.

298
00:17:28,240 --> 00:17:31,150
You can't do this one until you
finish that one, can't do

299
00:17:31,150 --> 00:17:33,800
this one till you've finished
that one and so forth.

300
00:17:33,800 --> 00:17:36,390
So even with an infinite number
of processors, I still

301
00:17:36,390 --> 00:17:38,670
wouldn't go faster
than the span.

302
00:17:38,670 --> 00:17:40,345
So that's why we denote
by T infinity.

303
00:17:43,130 --> 00:17:45,080
So these are the two
important measures.

304
00:17:45,080 --> 00:17:46,250
Now what we're really
interested in is

305
00:17:46,250 --> 00:17:48,530
Tp for a given p.

306
00:17:48,530 --> 00:17:54,080
As you'll see, we actually can
get some bounds on the

307
00:17:54,080 --> 00:17:59,060
performance on p processors just
by looking at the work,

308
00:17:59,060 --> 00:18:02,660
the span and the number of
processors we're executing on.

309
00:18:02,660 --> 00:18:04,410
So the first bound is
the following, it's

310
00:18:04,410 --> 00:18:06,700
called the Work Law.

311
00:18:06,700 --> 00:18:09,910
The Work Law says that the time
on p processors is at

312
00:18:09,910 --> 00:18:14,170
least the time on one processor
divided by p.

313
00:18:14,170 --> 00:18:17,190
So why does that Work
Law make sense?

314
00:18:17,190 --> 00:18:18,440
What's that saying?

315
00:18:22,880 --> 00:18:24,028
Sorry?

316
00:18:24,028 --> 00:18:26,468
AUDIENCE: Like work is
conserved sort of?

317
00:18:26,468 --> 00:18:28,910
I mean, you have to do the
same amount of work.

318
00:18:28,910 --> 00:18:30,410
PROFESSOR: You have to do the
same amount of work, so on

319
00:18:30,410 --> 00:18:35,350
every time step, you can get
p pieces of work done.

320
00:18:35,350 --> 00:18:40,020
So if you're running for fewer
than T1 over p steps, you've

321
00:18:40,020 --> 00:18:46,440
done less than T1 work
over and time Tp.

322
00:18:46,440 --> 00:18:48,670
So you won't have done all
the work if you run

323
00:18:48,670 --> 00:18:50,420
for less than this.

324
00:18:50,420 --> 00:18:55,250
So the time must be at least
Tp, time Tp must be at

325
00:18:55,250 --> 00:18:56,790
least T1 over p.

326
00:18:56,790 --> 00:19:00,365
You only get to do p
work on one step.

327
00:19:00,365 --> 00:19:02,750
Is that pretty clear?

328
00:19:02,750 --> 00:19:06,880
The second one should be even
clearer, the Span Law.

329
00:19:06,880 --> 00:19:11,340
On p processors, you're not
going to go faster than if you

330
00:19:11,340 --> 00:19:13,330
had an infinite number of
processors because the

331
00:19:13,330 --> 00:19:16,170
infinite processor could always
use fewer processors if

332
00:19:16,170 --> 00:19:17,370
it's scheduled.

333
00:19:17,370 --> 00:19:18,800
Once again, this is a
very simple model.

334
00:19:18,800 --> 00:19:21,520
We're not taking into account
scheduling, we're not taking

335
00:19:21,520 --> 00:19:24,190
into account overheads or
whatever, just a simple

336
00:19:24,190 --> 00:19:28,930
conceptual model for
understanding parallelism.

337
00:19:28,930 --> 00:19:30,535
So any questions about
these two laws?

338
00:19:33,620 --> 00:19:36,470
There's going to be a couple
of formulas in this lecture

339
00:19:36,470 --> 00:19:39,610
today that you should write
down and play with.

340
00:19:39,610 --> 00:19:44,280
So these two, they may seem
simple, but these are hugely

341
00:19:44,280 --> 00:19:46,790
important formulas.

342
00:19:46,790 --> 00:19:49,860
So you should know that Tp is at
least T1 over p, that's the

343
00:19:49,860 --> 00:19:53,080
Work Law and that Tp is
at least T infinity.

344
00:19:53,080 --> 00:19:55,480
Those are bounds on how fast
you could execute.

345
00:19:55,480 --> 00:19:59,250
Do I have a question
in that back there?

346
00:19:59,250 --> 00:20:07,330
OK so let's see what happens
to work in span in terms of

347
00:20:07,330 --> 00:20:10,490
how we can understand our
programs and decompose them.

348
00:20:10,490 --> 00:20:15,200
So suppose that I have a
computation A followed by

349
00:20:15,200 --> 00:20:18,990
computation B and I connect
them in series.

350
00:20:18,990 --> 00:20:21,580
What happens to the work?

351
00:20:21,580 --> 00:20:26,690
How does the work of all this
whole thing correspond to the

352
00:20:26,690 --> 00:20:28,210
work of A and the work of B?

353
00:20:32,840 --> 00:20:33,740
What's that?

354
00:20:33,740 --> 00:20:34,280
AUDIENCE: [UNINTELLIGIBLE]

355
00:20:34,280 --> 00:20:37,210
PROFESSOR: Yeah, add
them together.

356
00:20:37,210 --> 00:20:41,650
You get T1 of A plus T1 of B.
Take the work of this and the

357
00:20:41,650 --> 00:20:43,110
work of this.

358
00:20:43,110 --> 00:20:44,460
OK, that's pretty easy.

359
00:20:44,460 --> 00:20:45,710
What about the span?

360
00:20:50,290 --> 00:20:52,775
So the span is the longest
path of dependencies.

361
00:20:55,850 --> 00:20:57,370
What happens to the span
when I connect

362
00:20:57,370 --> 00:20:58,620
two things in a series?

363
00:21:01,130 --> 00:21:05,430
Yeah, it just sums as well
because I take whatever the

364
00:21:05,430 --> 00:21:08,050
longest path is from here to
here and then the longest one

365
00:21:08,050 --> 00:21:11,470
from here to here,
it just adds.

366
00:21:11,470 --> 00:21:15,370
But now let's look at parallel
composition, So now suppose

367
00:21:15,370 --> 00:21:18,500
that I can execute these
two things in parallel.

368
00:21:18,500 --> 00:21:19,750
What happens to the work?

369
00:21:25,160 --> 00:21:28,330
It just adds, just as before.

370
00:21:28,330 --> 00:21:29,460
The work always adds.

371
00:21:29,460 --> 00:21:32,910
The work is easy because
it's additive.

372
00:21:32,910 --> 00:21:34,160
What happens to the span?

373
00:21:40,560 --> 00:21:41,744
What's that?

374
00:21:41,744 --> 00:21:42,520
AUDIENCE: [UNINTELLIGIBLE]

375
00:21:42,520 --> 00:21:45,030
PROFESSOR: It's the
max of the spans.

376
00:21:45,030 --> 00:21:49,860
Right, so whatever is the
longest, whichever one of

377
00:21:49,860 --> 00:21:52,440
these ones has a longer span,
that's going to be

378
00:21:52,440 --> 00:21:55,670
the span of the total.

379
00:21:55,670 --> 00:21:58,780
Does that give you some
Intuition So we're going to

380
00:21:58,780 --> 00:22:04,390
see when we analyze the spans of
things that in fact, we're

381
00:22:04,390 --> 00:22:06,310
going to see maxes occurring
all over the place.

382
00:22:09,570 --> 00:22:17,810
So speedup is defined
to be T1 over Tp.

383
00:22:17,810 --> 00:22:22,110
So speedup is how much faster am
I on p processors than I am

384
00:22:22,110 --> 00:22:23,360
on one processor?

385
00:22:25,450 --> 00:22:26,300
Pretty easy.

386
00:22:26,300 --> 00:22:31,050
So if T1 over Tp is equal to
p, we say we have perfect

387
00:22:31,050 --> 00:22:33,170
linear speedup, or
linear speedup.

388
00:22:36,840 --> 00:22:38,460
That's good, right?

389
00:22:38,460 --> 00:22:43,970
Because if I put on use p
processors, I'd like to have

390
00:22:43,970 --> 00:22:45,970
things go p times faster.

391
00:22:45,970 --> 00:22:49,530
OK, that would be
the ideal world.

392
00:22:49,530 --> 00:22:55,720
If T1 over Tp, which is the
speedup, is greater than p,

393
00:22:55,720 --> 00:22:58,550
that says we have super
linear speedup.

394
00:22:58,550 --> 00:23:01,590
And in our model, we don't get
that because of the work law.

395
00:23:01,590 --> 00:23:04,580
Because the work law says Tp is
greater than or equal to T1

396
00:23:04,580 --> 00:23:09,580
over p and just do a little
algebra here, you get T1 over

397
00:23:09,580 --> 00:23:13,930
Tp must be less than
or equal to p.

398
00:23:13,930 --> 00:23:15,500
So you can't get super
linear speedup.

399
00:23:15,500 --> 00:23:18,600
In practice, there are
situations where you can get

400
00:23:18,600 --> 00:23:21,420
super linear speedup due
to caching effects and

401
00:23:21,420 --> 00:23:21,970
a variety of things.

402
00:23:21,970 --> 00:23:23,590
We'll talk about some
of those things.

403
00:23:23,590 --> 00:23:27,470
But in this simple model,
we don't get

404
00:23:27,470 --> 00:23:31,130
that kind of behavior.

405
00:23:31,130 --> 00:23:34,390
And of course, the case I left
out is the common case, which

406
00:23:34,390 --> 00:23:38,590
is the T1 over Tp is less than
p, and that's very common

407
00:23:38,590 --> 00:23:40,370
people write code which doesn't

408
00:23:40,370 --> 00:23:44,090
give them linear speedup.

409
00:23:44,090 --> 00:23:47,060
We're mostly interested in
getting linear speedup here.

410
00:23:47,060 --> 00:23:48,790
That's our goal.

411
00:23:48,790 --> 00:23:52,050
So that we're getting the most
bang for the buck out of the

412
00:23:52,050 --> 00:23:55,110
processors we're using.

413
00:23:55,110 --> 00:23:56,330
OK, parallelism.

414
00:23:56,330 --> 00:23:58,500
So we're finally to the point
where I can talk about

415
00:23:58,500 --> 00:24:04,050
parallelism and give a
quantitative definition of

416
00:24:04,050 --> 00:24:06,120
parallelism.

417
00:24:06,120 --> 00:24:11,950
So the Span Law says that Tp is
at least T infinity, right?

418
00:24:11,950 --> 00:24:14,580
The time on p processors is at
least the time on an infinite

419
00:24:14,580 --> 00:24:16,060
number of processors.

420
00:24:16,060 --> 00:24:21,390
So the maximum possible speedup,
that's T1 over Tp,

421
00:24:21,390 --> 00:24:26,390
given T1 and T infinity
is T1 over T infinity.

422
00:24:29,020 --> 00:24:30,270
And we call that the
parallelism.

423
00:24:32,900 --> 00:24:37,050
It's the maximum amount
of speedup we

424
00:24:37,050 --> 00:24:38,320
could possibly attain.

425
00:24:41,080 --> 00:24:45,220
So we have the speedup and the
speedup by the Span Law that

426
00:24:45,220 --> 00:24:48,440
says this is the maximum amount
we can get, we could

427
00:24:48,440 --> 00:24:53,660
also view it as if I look along
the critical path of the

428
00:24:53,660 --> 00:24:55,900
computation.

429
00:24:55,900 --> 00:24:57,900
It's sort of what's the average
amount of work at

430
00:24:57,900 --> 00:24:59,302
every level.

431
00:24:59,302 --> 00:25:02,830
The work, the total amount of
stuff here divided by that

432
00:25:02,830 --> 00:25:05,500
length there that sort of tells
us the width, what's the

433
00:25:05,500 --> 00:25:10,180
average amount of stuff that's
going on in every step.

434
00:25:10,180 --> 00:25:12,700
So for this example,
what is the--

435
00:25:12,700 --> 00:25:16,790
I forgot to put this
on my slide--

436
00:25:16,790 --> 00:25:21,770
what is the parallelism of
this particular DAG here?

437
00:25:26,830 --> 00:25:28,730
Two, right?

438
00:25:28,730 --> 00:25:32,460
So the span has length nine--
this is assuming everything

439
00:25:32,460 --> 00:25:33,440
was unit time--

440
00:25:33,440 --> 00:25:37,780
obviously in reality, when you
have more instructions, you in

441
00:25:37,780 --> 00:25:42,920
fact would make it be whatever
the length of this was in

442
00:25:42,920 --> 00:25:46,390
terms of number of instructions
or what have you,

443
00:25:46,390 --> 00:25:48,160
of execution time of
all these things.

444
00:25:48,160 --> 00:25:51,750
So this is length 9, there's
18 things here,

445
00:25:51,750 --> 00:25:54,660
parallelism is 2.

446
00:25:54,660 --> 00:25:57,410
So we can quantify parallelism
precisely.

447
00:25:57,410 --> 00:25:59,650
We'll see why it's important
to quantify it.

448
00:25:59,650 --> 00:26:02,210
So that the maximum speedup
we're going to get when we run

449
00:26:02,210 --> 00:26:05,180
this application.

450
00:26:05,180 --> 00:26:07,110
Here's another example
we did before.

451
00:26:07,110 --> 00:26:09,930
Fib of four.

452
00:26:09,930 --> 00:26:13,700
So let's assume again that
each strand takes

453
00:26:13,700 --> 00:26:16,320
unit time to execute.

454
00:26:16,320 --> 00:26:22,225
So what is the work in this
particular computation?

455
00:26:31,680 --> 00:26:34,050
Assume every strand takes unit
time to execute, which of

456
00:26:34,050 --> 00:26:35,782
course it doesn't, but--

457
00:26:48,190 --> 00:26:49,440
anybody care to hazard
a guess?

458
00:26:52,340 --> 00:26:59,320
17, yeah, because there's four
nodes here that have 3 plus 5.

459
00:26:59,320 --> 00:27:03,830
So 3 times 4 plus 5 is 17.

460
00:27:03,830 --> 00:27:06,670
So the work is 17.

461
00:27:06,670 --> 00:27:07,920
OK, what's the span?

462
00:27:12,590 --> 00:27:13,840
This one's tricky.

463
00:27:21,370 --> 00:27:22,690
Too bad it's not a little
bit more focused.

464
00:27:27,270 --> 00:27:28,075
What the span?

465
00:27:28,075 --> 00:27:30,146
AUDIENCE: 8.

466
00:27:30,146 --> 00:27:32,610
PROFESSOR: 8, that's correct.

467
00:27:32,610 --> 00:27:35,320
Who got 7?

468
00:27:35,320 --> 00:27:39,450
Yeah, so I got 7 when I did this
and then I looked harder

469
00:27:39,450 --> 00:27:40,330
and it was 8.

470
00:27:40,330 --> 00:27:44,000
It's 8, so here it is.

471
00:27:44,000 --> 00:27:46,410
Here's the span.

472
00:27:46,410 --> 00:27:48,360
There is goes.

473
00:27:48,360 --> 00:27:51,667
Ooh that little sidestep there,
that's what makes it 8.

474
00:27:54,590 --> 00:27:59,160
OK so basically, it comes down
here and I had gone down like

475
00:27:59,160 --> 00:28:01,940
that when I did it, but in fact,
you've got to go over

476
00:28:01,940 --> 00:28:03,770
and back up.

477
00:28:03,770 --> 00:28:06,450
So it's actually 8.

478
00:28:06,450 --> 00:28:12,620
So that says that the
parallelism is a little bit

479
00:28:12,620 --> 00:28:16,520
more than 2, 2 and 1/8.

480
00:28:16,520 --> 00:28:19,370
What that says is that if
I use many more than two

481
00:28:19,370 --> 00:28:24,970
processors, I can't get linear
speedup anymore.

482
00:28:24,970 --> 00:28:28,890
I'm only going to get marginal
performance gains.

483
00:28:28,890 --> 00:28:31,530
If I use more than 2, because
the maximum speedup I can get

484
00:28:31,530 --> 00:28:35,590
is like 2.125 if I had an
infinite number of processors.

485
00:28:39,120 --> 00:28:40,190
So any questions about this?

486
00:28:40,190 --> 00:28:46,960
So this by the way deceptively
simple and yet, if you don't

487
00:28:46,960 --> 00:28:49,830
play around with it a little
bit, you can get

488
00:28:49,830 --> 00:28:53,350
confused very easily.

489
00:28:53,350 --> 00:28:56,880
Deceptively simple,
very powerful to

490
00:28:56,880 --> 00:28:58,130
be able to do this.

491
00:29:01,740 --> 00:29:06,110
So here we have for the analysis
of parallelism, one

492
00:29:06,110 --> 00:29:09,170
of the things that we have going
for us in using the Cilk

493
00:29:09,170 --> 00:29:13,350
tool suite is a program called
Cilkview, which has a

494
00:29:13,350 --> 00:29:17,110
scalability analyzer.

495
00:29:17,110 --> 00:29:20,290
And it is like the race detector
that I talked to you

496
00:29:20,290 --> 00:29:23,780
about last time in that it uses
dynamic instrumentation.

497
00:29:23,780 --> 00:29:27,820
So you run it under Cilkview,
it's like running it under

498
00:29:27,820 --> 00:29:28,400
[? Valgrhen ?]

499
00:29:28,400 --> 00:29:30,960
for example, or what have you.

500
00:29:30,960 --> 00:29:33,430
So basically you run your
program under it, and it

501
00:29:33,430 --> 00:29:37,040
analyzes your program
for scalability.

502
00:29:37,040 --> 00:29:41,820
It computes the work and span of
your program to derive some

503
00:29:41,820 --> 00:29:44,630
upper bounds on parallel
performance and it also

504
00:29:44,630 --> 00:29:47,090
estimates a scheduling overhead
to compute what's

505
00:29:47,090 --> 00:29:49,250
called a burden span
for lower bounds.

506
00:29:52,710 --> 00:29:55,230
So let's take a look.

507
00:29:55,230 --> 00:29:58,690
So here's, for example, here's
a quick sort program.

508
00:29:58,690 --> 00:30:03,110
So let's just see this
is a c++ program.

509
00:30:03,110 --> 00:30:06,260
So here we're using a template
so that the type of items that

510
00:30:06,260 --> 00:30:09,400
I'm sorting I can make
be a variable.

511
00:30:09,400 --> 00:30:13,000
So tightening-- can we shut
the back door there?

512
00:30:13,000 --> 00:30:13,620
One of the TAs?

513
00:30:13,620 --> 00:30:14,870
Somebody run up to--

514
00:30:17,210 --> 00:30:18,460
thank you.

515
00:30:20,610 --> 00:30:25,380
So we have the variable T And
we're going to quick sort from

516
00:30:25,380 --> 00:30:29,080
the beginning to the
end of the array.

517
00:30:29,080 --> 00:30:31,340
And what we do is, just as
you're familiar with quick

518
00:30:31,340 --> 00:30:34,970
sort, if there's actually
something to be sorted, more

519
00:30:34,970 --> 00:30:39,590
than one thing, then we find the
middle by partitioning the

520
00:30:39,590 --> 00:30:42,580
thing and this is a bit
of a c++ magic to

521
00:30:42,580 --> 00:30:45,200
find the middle element.

522
00:30:45,200 --> 00:30:47,320
And then the important part
from our point of view is

523
00:30:47,320 --> 00:30:50,540
after we've done this partition,
we quick sort the

524
00:30:50,540 --> 00:30:53,250
first part of the array, from
beginning to middle and then

525
00:30:53,250 --> 00:31:00,170
from the beginning plus 1 or
the middle, whichever is

526
00:31:00,170 --> 00:31:02,200
greater to the end.

527
00:31:02,200 --> 00:31:04,260
And then we sync.

528
00:31:04,260 --> 00:31:07,350
So what we're doing is quick
sort where we're spawning off

529
00:31:07,350 --> 00:31:11,230
the two sub problems to
be solved in parallel

530
00:31:11,230 --> 00:31:11,940
recursively.

531
00:31:11,940 --> 00:31:14,430
So they're going to execute in
parallel and they're going to

532
00:31:14,430 --> 00:31:17,380
execute in parallel
and so forth.

533
00:31:17,380 --> 00:31:20,480
So a fairly natural thing to
divide, to do divide and

534
00:31:20,480 --> 00:31:23,810
conquer on quick sort because
the two some problems can be

535
00:31:23,810 --> 00:31:25,580
operated on independently.

536
00:31:25,580 --> 00:31:27,830
We just sort them recursively.

537
00:31:27,830 --> 00:31:30,330
But we can sort them
in parallel.

538
00:31:30,330 --> 00:31:34,010
OK, so suppose that we are
sorting 100,000 numbers.

539
00:31:34,010 --> 00:31:36,130
How much parallelism do you
think is in this code?

540
00:31:46,770 --> 00:31:50,850
So remember that we're getting
this recursive stuff done.

541
00:31:50,850 --> 00:31:53,854
How many people think--

542
00:31:53,854 --> 00:31:55,670
well, it's not going
to be more than

543
00:31:55,670 --> 00:31:57,090
100,000, I promise you.

544
00:31:57,090 --> 00:32:00,500
So how many people think more
than a million parallels?

545
00:32:00,500 --> 00:32:02,570
Raise your hand, more
than a million?

546
00:32:02,570 --> 00:32:09,370
And how many people think
more than 100,000?

547
00:32:09,370 --> 00:32:13,470
And how many people think
more than 10,000?

548
00:32:13,470 --> 00:32:14,720
OK, between the two.

549
00:32:17,380 --> 00:32:21,050
More than 1,000?

550
00:32:21,050 --> 00:32:22,990
OK, how about more than 100?

551
00:32:22,990 --> 00:32:25,520
100 to 1,000?

552
00:32:25,520 --> 00:32:26,790
How about 10 to 100?

553
00:32:29,440 --> 00:32:30,800
How about between 1 and 10?

554
00:32:34,380 --> 00:32:36,290
So a lot of people think
between 1 and 10.

555
00:32:36,290 --> 00:32:39,000
Why do you think that there's
so little parallels in this?

556
00:32:42,700 --> 00:32:46,140
You don't have to justify
yourself, OK.

557
00:32:46,140 --> 00:32:49,760
Well let's see how much there
is according to Cilkview.

558
00:32:49,760 --> 00:32:51,630
So here's the type of output
that you'll get.

559
00:32:51,630 --> 00:32:52,820
You'll get a graphical curve.

560
00:32:52,820 --> 00:32:55,430
You'll also get a
textual output.

561
00:32:55,430 --> 00:32:57,320
But this is sort of the
graphical output.

562
00:32:57,320 --> 00:33:00,910
And this is basically showing
what the running time here is.

563
00:33:00,910 --> 00:33:03,550
So the first thing it shows is
it will actually run your

564
00:33:03,550 --> 00:33:06,950
program, benchmark your program,
on in this case, up

565
00:33:06,950 --> 00:33:08,800
to 8 course.

566
00:33:08,800 --> 00:33:11,330
We ran it.

567
00:33:11,330 --> 00:33:15,260
So we ran up to 8 course and
give you what your measured

568
00:33:15,260 --> 00:33:17,980
speedup is.

569
00:33:17,980 --> 00:33:20,140
So the second thing is it
tells you the parallels.

570
00:33:20,140 --> 00:33:24,760
If you can't read that
it's, 11.21.

571
00:33:24,760 --> 00:33:28,590
So we get about 11.

572
00:33:28,590 --> 00:33:30,140
Why do you think it's
not higher?

573
00:33:35,880 --> 00:33:36,740
What's that?

574
00:33:36,740 --> 00:33:37,620
AUDIENCE: It's the log.

575
00:33:37,620 --> 00:33:39,890
PROFESSOR: What's the log?

576
00:33:39,890 --> 00:33:41,140
AUDIENCE: [UNINTELLIGIBLE]

577
00:33:46,000 --> 00:33:47,570
PROFESSOR: Yeah, but you're
doing the two things in

578
00:33:47,570 --> 00:33:48,670
parallel, right?

579
00:33:48,670 --> 00:33:50,060
We'll actually analyze this.

580
00:33:50,060 --> 00:33:53,500
So it has to do with the fact
that the partition routine is

581
00:33:53,500 --> 00:33:56,140
a serial piece of code
and it's big.

582
00:33:56,140 --> 00:34:00,236
So the initial partitioning
takes you 100,000--

583
00:34:00,236 --> 00:34:04,100
sorry, 100 million steps
of doing a partition--

584
00:34:04,100 --> 00:34:06,790
before you get to do any
parallelism at all.

585
00:34:06,790 --> 00:34:08,620
And we'll see that
in just a minute.

586
00:34:08,620 --> 00:34:10,699
So it gives you the
parallelism.

587
00:34:10,699 --> 00:34:12,260
It also plots this.

588
00:34:12,260 --> 00:34:14,130
So this is the parallelism.

589
00:34:14,130 --> 00:34:17,170
Notice that's the same
number, 11.21 is

590
00:34:17,170 --> 00:34:20,260
plotted as this bound.

591
00:34:20,260 --> 00:34:24,800
So it tells you the span law and
it tells you the work law.

592
00:34:24,800 --> 00:34:25,980
This is the linear speedup.

593
00:34:25,980 --> 00:34:28,040
If you were having linear
speedup, this is what your

594
00:34:28,040 --> 00:34:29,960
program would give you.

595
00:34:29,960 --> 00:34:33,250
So it gives you these two
bounds, the work law and span

596
00:34:33,250 --> 00:34:35,659
law on your speedup.

597
00:34:35,659 --> 00:34:39,460
And then it also computes
what's called a burden

598
00:34:39,460 --> 00:34:41,920
parallelism, estimating
scheduling overheads to sort

599
00:34:41,920 --> 00:34:44,699
of give you a lower bound.

600
00:34:44,699 --> 00:34:46,790
Now that's not to say that
your numbers can't fall

601
00:34:46,790 --> 00:34:48,400
outside this range.

602
00:34:48,400 --> 00:34:51,989
But when they do, it will tell
you essentially what the

603
00:34:51,989 --> 00:34:54,639
issues are with your program.

604
00:34:54,639 --> 00:34:57,850
And we'll discuss how you
diagnose some of those issues.

605
00:34:57,850 --> 00:35:05,400
Actually that's in one of the
handouts that we've provided.

606
00:35:05,400 --> 00:35:07,320
I think that's in one
of the handouts.

607
00:35:07,320 --> 00:35:10,390
If not, we'll make sure it's
among the handouts.

608
00:35:10,390 --> 00:35:13,610
So basically, this gives you a
range for what you can expect.

609
00:35:13,610 --> 00:35:16,340
So the important thing here is
to notice here for example,

610
00:35:16,340 --> 00:35:20,200
that we're losing performance,
but it's not due to the

611
00:35:20,200 --> 00:35:24,450
parallelism, to the work law.

612
00:35:24,450 --> 00:35:28,230
Basically, in some sense, what's
happening is we are

613
00:35:28,230 --> 00:35:30,380
losing it because the Span Law
because we're starting to

614
00:35:30,380 --> 00:35:35,530
approach the point where the
span is going to be the issue.

615
00:35:35,530 --> 00:35:36,740
So we'll talk more about this.

616
00:35:36,740 --> 00:35:39,660
So the main thing is you have
a tool that can tell you the

617
00:35:39,660 --> 00:35:43,550
work and span and so that you
can analyze your own programs

618
00:35:43,550 --> 00:35:47,340
to understand are you bounded
by parallelism, for example,

619
00:35:47,340 --> 00:35:53,040
in particular, in the code
that you've written.

620
00:35:53,040 --> 00:35:56,680
OK let's do a theoretical
analysis of this to understand

621
00:35:56,680 --> 00:35:59,590
why that number is small.

622
00:35:59,590 --> 00:36:02,170
So the main thing here is that
the expected work, as you

623
00:36:02,170 --> 00:36:05,900
recall, of quick sort
is order n log n.

624
00:36:05,900 --> 00:36:09,700
You tend to do order n log n
work, you partition and then

625
00:36:09,700 --> 00:36:11,470
you're solving two problems
of the same size.

626
00:36:11,470 --> 00:36:14,060
If you actually draw out the
recursion tree, it's log

627
00:36:14,060 --> 00:36:17,130
height with linear amount of
work on every level for n log

628
00:36:17,130 --> 00:36:20,050
end total work.

629
00:36:20,050 --> 00:36:24,610
The expected span, however, is
order n because the partition

630
00:36:24,610 --> 00:36:29,240
routine is a serial program that
partitions up the thing

631
00:36:29,240 --> 00:36:32,090
of size n in order n time.

632
00:36:32,090 --> 00:36:34,840
So when you compute the
parallelism, you get

633
00:36:34,840 --> 00:36:38,830
parallelism of order log n
and log n is kind of puny

634
00:36:38,830 --> 00:36:42,380
parallelism, and that's our
technical word for it.

635
00:36:42,380 --> 00:36:44,770
So puny parallelism is what
we get out of quick sort.

636
00:36:48,360 --> 00:36:50,560
So it turns out there
are lots of things

637
00:36:50,560 --> 00:36:51,630
that you can analyze.

638
00:36:51,630 --> 00:36:54,910
Here's just a selection of
some of the interesting

639
00:36:54,910 --> 00:36:58,200
practical algorithms and the
kinds of analyses that you can

640
00:36:58,200 --> 00:37:01,150
do showing that, for example,
with merge sort you can do it

641
00:37:01,150 --> 00:37:03,380
with work n log n.

642
00:37:03,380 --> 00:37:09,010
You can get a span of log qn and
so then the parallelism is

643
00:37:09,010 --> 00:37:10,260
the ratio of the two.

644
00:37:12,750 --> 00:37:15,400
In fact, you can actually
theoretically get log squared

645
00:37:15,400 --> 00:37:19,250
n span, but that's not as
practical an algorithm as the

646
00:37:19,250 --> 00:37:20,930
one that gives you
log cubed n.

647
00:37:20,930 --> 00:37:23,620
And you can go through and there
are a whole bunch of

648
00:37:23,620 --> 00:37:29,080
algorithms for which you can
get very good parallelism.

649
00:37:29,080 --> 00:37:31,330
So all of these, if you look
at the ratio of these, the

650
00:37:31,330 --> 00:37:32,600
parallelism is quite high.

651
00:37:35,510 --> 00:37:37,900
So let's talk a little bit
about what's going on

652
00:37:37,900 --> 00:37:41,580
underneath and why parallelism
is important.

653
00:37:41,580 --> 00:37:48,020
So when you describe your
program in Cilk, you express

654
00:37:48,020 --> 00:37:53,810
the potential parallelism
of your application.

655
00:37:53,810 --> 00:37:56,360
You don't say exactly how it's
going to be scheduled, that's

656
00:37:56,360 --> 00:38:00,980
done by the Cilk++ scheduler,
which maps the strands

657
00:38:00,980 --> 00:38:05,450
dynamically onto the processors
at run time.

658
00:38:05,450 --> 00:38:08,040
So it's going to do the load
balancing and everything

659
00:38:08,040 --> 00:38:11,490
necessary to balance your
computation off the number of

660
00:38:11,490 --> 00:38:12,710
processors.

661
00:38:12,710 --> 00:38:15,890
We want to understand how that
process works, because that's

662
00:38:15,890 --> 00:38:18,920
going to help us to understand
how it is that we can build

663
00:38:18,920 --> 00:38:23,070
codes that will map very
effectively on to the number

664
00:38:23,070 --> 00:38:25,000
of processors.

665
00:38:25,000 --> 00:38:27,760
Now it turns out that the theory
of the distributed

666
00:38:27,760 --> 00:38:33,150
schedulers such as is in
Cilk++ is complicated.

667
00:38:33,150 --> 00:38:36,550
I'll wave my hands about it
towards the end, but the

668
00:38:36,550 --> 00:38:40,280
analysis of it is advanced.

669
00:38:40,280 --> 00:38:44,030
You have to take a graduate
course to get that stuff.

670
00:38:44,030 --> 00:38:46,560
So instead, we're going to
explore the ideas with a

671
00:38:46,560 --> 00:38:52,600
centralized, much simpler,
scheduler which serves as a

672
00:38:52,600 --> 00:38:54,870
surrogate for understanding
what's going on.

673
00:38:58,010 --> 00:39:03,380
So the basic idea of almost all
scheduling theory in this

674
00:39:03,380 --> 00:39:07,220
domain is greedy scheduling.

675
00:39:07,220 --> 00:39:09,360
And so this is-- by the way,
we're coming to the second

676
00:39:09,360 --> 00:39:11,930
thing you have to understand
really well in order to be

677
00:39:11,930 --> 00:39:14,150
able to generate good code,
the second sort

678
00:39:14,150 --> 00:39:15,370
of theoretical thing--

679
00:39:15,370 --> 00:39:19,020
so the idea of a greedy
scheduler is you want to do as

680
00:39:19,020 --> 00:39:24,700
much work as possible
on each step.

681
00:39:24,700 --> 00:39:31,490
So the idea here is let's take
a look, for example, suppose

682
00:39:31,490 --> 00:39:36,200
that we've executed this part
of the DAG already.

683
00:39:36,200 --> 00:39:38,830
Then there are certain number
of strands that are ready to

684
00:39:38,830 --> 00:39:42,910
execute, meaning all their
predecessors have exited.

685
00:39:42,910 --> 00:39:46,680
How many strands are ready
to execute on this DAG?

686
00:39:46,680 --> 00:39:48,460
Five, right?

687
00:39:48,460 --> 00:39:51,620
These guys.

688
00:39:51,620 --> 00:39:54,560
So those five strands are
ready to execute.

689
00:39:54,560 --> 00:39:59,810
So the idea is-- and let me
illustrate for p equals 3--

690
00:39:59,810 --> 00:40:04,480
the idea is to understand the
execution in terms of two

691
00:40:04,480 --> 00:40:06,120
types of steps.

692
00:40:06,120 --> 00:40:08,800
So in a greed schedule, you
always do as much as possible.

693
00:40:08,800 --> 00:40:12,120
So is what would be called a
complete step because I can

694
00:40:12,120 --> 00:40:16,110
schedule all three processors
to have some work

695
00:40:16,110 --> 00:40:18,600
to do on that step.

696
00:40:18,600 --> 00:40:23,970
So which are the best three guys
to be able to execute?

697
00:40:27,270 --> 00:40:30,100
Yes, so I'm not sure what the
best three are, but for sure,

698
00:40:30,100 --> 00:40:32,560
you want to get this guy
and this guy, right?

699
00:40:32,560 --> 00:40:34,810
Maybe that guy's not,
but this guy, you

700
00:40:34,810 --> 00:40:38,050
definitely want to execute.

701
00:40:38,050 --> 00:40:40,950
And these guys, I guess, OK.

702
00:40:40,950 --> 00:40:43,200
So in a greedy schedule, no,
you're not allowed to look to

703
00:40:43,200 --> 00:40:45,240
see which ones are
the best execute.

704
00:40:45,240 --> 00:40:47,430
You don't know what the future
is, the scheduler isn't going

705
00:40:47,430 --> 00:40:54,310
to know what the future is so it
just executes any p course.

706
00:41:02,630 --> 00:41:04,300
You just execute any p course.

707
00:41:04,300 --> 00:41:07,830
In this case, I executed
the p strand.

708
00:41:07,830 --> 00:41:11,610
In this case, I executed these
three guys even though they

709
00:41:11,610 --> 00:41:12,910
weren't necessarily the best.

710
00:41:12,910 --> 00:41:16,580
And in a greedy scheduler, it
doesn't look to see what's the

711
00:41:16,580 --> 00:41:20,500
best one to execute, it
just executes as many

712
00:41:20,500 --> 00:41:21,300
as it can this case.

713
00:41:21,300 --> 00:41:22,340
In this case, it's p.

714
00:41:22,340 --> 00:41:24,570
Now we have what's called
an incomplete step.

715
00:41:24,570 --> 00:41:25,960
Notice nothing got enabled.

716
00:41:25,960 --> 00:41:28,300
That was sort of too bad.

717
00:41:28,300 --> 00:41:30,150
So there's only two guys
that are ready to go.

718
00:41:30,150 --> 00:41:34,630
What do you think happens if
I have an incomplete step,

719
00:41:34,630 --> 00:41:36,810
namely p strands are
ready, fewer than

720
00:41:36,810 --> 00:41:39,140
p strands are ready?

721
00:41:39,140 --> 00:41:42,390
I just to execute all of
them, as many as I can.

722
00:41:42,390 --> 00:41:43,820
Run all of them.

723
00:41:43,820 --> 00:41:45,500
So that's what a greedy
scheduler does.

724
00:41:45,500 --> 00:41:50,090
Just at every step, it executes
as many as it can and

725
00:41:50,090 --> 00:41:54,350
we can classify the steps as
ones which are complete,

726
00:41:54,350 --> 00:41:57,930
meaning we used all our
processors versus incomplete,

727
00:41:57,930 --> 00:42:00,680
meaning we only used a subset
of our processors in

728
00:42:00,680 --> 00:42:03,540
scheduling it.

729
00:42:03,540 --> 00:42:05,060
So that's what a greedy
scheduler does.

730
00:42:05,060 --> 00:42:09,440
Now the important thing,
which is the

731
00:42:09,440 --> 00:42:11,430
analysis of this program.

732
00:42:11,430 --> 00:42:13,970
And this is, by the way, the
single most important thing in

733
00:42:13,970 --> 00:42:18,600
scheduling theory but you're
going to ever learn is this

734
00:42:18,600 --> 00:42:19,520
particular theory.

735
00:42:19,520 --> 00:42:25,290
It goes all the way back to
1968 and what it basically

736
00:42:25,290 --> 00:42:32,850
says it is any greedy scheduler
achieves a bound of

737
00:42:32,850 --> 00:42:36,160
T1 over p plus T infinity.

738
00:42:36,160 --> 00:42:40,590
So why is that an interesting
upper bound?

739
00:42:40,590 --> 00:42:42,034
Yeah?

740
00:42:42,034 --> 00:42:45,761
AUDIENCE: That says that it's
got the refinement of what you

741
00:42:45,761 --> 00:42:50,000
said before, even if you add as
many processors as you can,

742
00:42:50,000 --> 00:42:51,560
basically you're bounded
by T infinity.

743
00:42:51,560 --> 00:42:52,997
PROFESSOR: Yeah.

744
00:42:52,997 --> 00:42:55,392
AUDIENCE: It's compulsory.

745
00:42:58,266 --> 00:43:00,540
PROFESSOR: So basically, each
of these, this term here is

746
00:43:00,540 --> 00:43:02,410
the term in the Work Law.

747
00:43:02,410 --> 00:43:05,640
This is the term in the Span
Law, and we're saying you can

748
00:43:05,640 --> 00:43:09,110
always achieve the sum of those
two lower bounds as an

749
00:43:09,110 --> 00:43:12,030
upper bound.

750
00:43:12,030 --> 00:43:14,460
So let's see how we do this and
then we'll look at some of

751
00:43:14,460 --> 00:43:15,620
the implications.

752
00:43:15,620 --> 00:43:17,534
Question, do you have
a question?

753
00:43:17,534 --> 00:43:19,230
No?

754
00:43:19,230 --> 00:43:22,780
So here's the proof that
you meet this.

755
00:43:22,780 --> 00:43:25,890
So that the proof says--
and I'll illustrate

756
00:43:25,890 --> 00:43:27,700
for P equals 3--

757
00:43:27,700 --> 00:43:31,570
how many complete steps
could we have?

758
00:43:31,570 --> 00:43:33,960
So I'll argue that the number
of complete steps is at

759
00:43:33,960 --> 00:43:37,140
most T1 over p.

760
00:43:37,140 --> 00:43:38,610
Why is that?

761
00:43:38,610 --> 00:43:43,240
Every complete step
performs p work.

762
00:43:43,240 --> 00:43:47,970
So if I had more complete steps
than T1 over p, I'd be

763
00:43:47,970 --> 00:43:51,280
doing more than T1 work.

764
00:43:51,280 --> 00:43:55,120
But I only have T1 work to do.

765
00:43:55,120 --> 00:43:58,220
OK, so the maximum number of
complete steps I could have is

766
00:43:58,220 --> 00:43:59,470
at most T1 over p.

767
00:43:59,470 --> 00:44:03,200
Do people follow that?

768
00:44:03,200 --> 00:44:05,960
So the trickier part of the
proof, which is not all that

769
00:44:05,960 --> 00:44:08,630
tricky but it's a little bit
trickier, is the other side.

770
00:44:08,630 --> 00:44:12,120
How many incomplete steps
could I have?

771
00:44:12,120 --> 00:44:14,420
So we execute those.

772
00:44:14,420 --> 00:44:19,000
So I claim that the number of
incomplete steps is bounded by

773
00:44:19,000 --> 00:44:22,610
the critical path length,
by the span.

774
00:44:22,610 --> 00:44:24,440
Why is that?

775
00:44:24,440 --> 00:44:26,860
Well let's take a look at
the part of DAG that

776
00:44:26,860 --> 00:44:29,290
has yet to be executed.

777
00:44:29,290 --> 00:44:31,230
So that this gray part here.

778
00:44:31,230 --> 00:44:33,270
There's some span associated
with that.

779
00:44:33,270 --> 00:44:37,440
In this case, it's this
longest path.

780
00:44:37,440 --> 00:44:46,460
When I execute all of the ready
threads that are ready

781
00:44:46,460 --> 00:44:52,530
to go, I guarantee to reduce
the span of that unexecuted

782
00:44:52,530 --> 00:44:54,820
DAG by at least one.

783
00:44:58,300 --> 00:45:02,780
So as I do here, so I reduce
it by one when I execute.

784
00:45:02,780 --> 00:45:07,022
So if I have a complete step,
I don't guaranteed to reduce

785
00:45:07,022 --> 00:45:13,200
the span of the unexecuted DAG,
because I may execute

786
00:45:13,200 --> 00:45:15,950
things as I showed you in this
example, you don't actually

787
00:45:15,950 --> 00:45:17,240
advance anything.

788
00:45:17,240 --> 00:45:23,770
But I execute all the ready
threads on an incomplete step,

789
00:45:23,770 --> 00:45:25,490
and that's going to
reduce it by one.

790
00:45:25,490 --> 00:45:28,410
So the number of incomplete
steps is at most infinity.

791
00:45:28,410 --> 00:45:32,650
So the total number of steps
is at most the sum.

792
00:45:32,650 --> 00:45:35,710
So as I say, this proof you
should understand in your

793
00:45:35,710 --> 00:45:39,380
sleep because it's the most
important scheduling theory

794
00:45:39,380 --> 00:45:43,250
proof that you're going to
probably see in your lifetime.

795
00:45:43,250 --> 00:45:48,180
It's very old, and really, very,
very simple and yet,

796
00:45:48,180 --> 00:45:50,840
there's a huge amount of
scheduling theory if you have

797
00:45:50,840 --> 00:45:54,560
a look at scheduling theory,
that comes out of this just

798
00:45:54,560 --> 00:45:58,160
making this same problem more
complicated and more real and

799
00:45:58,160 --> 00:46:00,340
more interesting and so forth.

800
00:46:00,340 --> 00:46:03,590
But this is really the crux
of what's going on.

801
00:46:03,590 --> 00:46:07,510
Any questions about
this proof?

802
00:46:07,510 --> 00:46:13,370
So one corollary of the greedy
scheduling algorithm is that

803
00:46:13,370 --> 00:46:16,650
any greedy scheduler achieves
within a factor of two of

804
00:46:16,650 --> 00:46:17,900
optimal scheduling.

805
00:46:20,280 --> 00:46:21,400
So let's see why that is.

806
00:46:21,400 --> 00:46:24,070
So it's guaranteed as an upper
bound to get within a factor

807
00:46:24,070 --> 00:46:26,220
of two of optimal.

808
00:46:26,220 --> 00:46:27,650
So here's the proof.

809
00:46:27,650 --> 00:46:31,700
So let's Tp star be the
execution time produced by the

810
00:46:31,700 --> 00:46:32,425
optimal scheduler.

811
00:46:32,425 --> 00:46:35,630
This is the schedule that knows
the whole DAG in advance

812
00:46:35,630 --> 00:46:38,000
and can schedule things exactly
where they need to be

813
00:46:38,000 --> 00:46:40,790
scheduled to minimize the
total amount of time.

814
00:46:40,790 --> 00:46:44,550
Now even though the optimal
scheduler can schedule very

815
00:46:44,550 --> 00:46:47,760
officially, it's still
bound by the Work Law

816
00:46:47,760 --> 00:46:50,170
and the Span Law.

817
00:46:50,170 --> 00:46:53,260
So therefore, Tp star has still
got to be greater than

818
00:46:53,260 --> 00:46:56,730
T1 over p and greater than
T infinity by the

819
00:46:56,730 --> 00:46:58,360
Work and Span Laws.

820
00:46:58,360 --> 00:47:01,850
Even though it's optimal, every
scheduler must obey the

821
00:47:01,850 --> 00:47:05,190
Work Laws and Spam Law.

822
00:47:05,190 --> 00:47:08,680
So then we have, by the greedy
scheduling theorem, Tp is at

823
00:47:08,680 --> 00:47:11,770
most T1 over p plus
T infinity.

824
00:47:11,770 --> 00:47:15,660
Well that's at most twice the
maximum of these two values,

825
00:47:15,660 --> 00:47:17,180
whichever is larger.

826
00:47:17,180 --> 00:47:20,880
I've just plugged in to get the
maximum of those two and

827
00:47:20,880 --> 00:47:23,590
that's at most, by
this equation,

828
00:47:23,590 --> 00:47:25,670
twice the optimal time.

829
00:47:29,060 --> 00:47:33,642
So this is a very simple
corollary says oh, greedy

830
00:47:33,642 --> 00:47:35,110
scheduling is actually
pretty good.

831
00:47:35,110 --> 00:47:37,400
It's not optimal,
in fact, optimal

832
00:47:37,400 --> 00:47:39,200
scheduling is mP complete.

833
00:47:39,200 --> 00:47:41,010
Very hard problem to solve.

834
00:47:41,010 --> 00:47:43,630
But getting within a factor
of two, you just do greedy

835
00:47:43,630 --> 00:47:44,880
scheduling, it works
just fine.

836
00:47:47,460 --> 00:47:52,770
More importantly is the next
corollary, which has to do is

837
00:47:52,770 --> 00:47:54,630
when do you get linear
speedup?

838
00:47:54,630 --> 00:47:56,660
And this is, I think, the
most important thing

839
00:47:56,660 --> 00:47:57,770
to get out of this.

840
00:47:57,770 --> 00:48:01,820
So any greedy scheduler achieves
near perfect linear

841
00:48:01,820 --> 00:48:04,590
speedup whenever--

842
00:48:04,590 --> 00:48:05,970
what's this thing on
the left-hand side?

843
00:48:05,970 --> 00:48:08,940
What's the name we
call that?--

844
00:48:08,940 --> 00:48:10,550
the parallelism, right?

845
00:48:10,550 --> 00:48:13,900
That's the parallelism, is much
bigger than the number of

846
00:48:13,900 --> 00:48:16,300
processors you're running on.

847
00:48:16,300 --> 00:48:19,440
So if the number of processors
are running on is smaller than

848
00:48:19,440 --> 00:48:23,400
the parallelism of your code
says you can expect near

849
00:48:23,400 --> 00:48:26,510
perfect linear speedup.

850
00:48:26,510 --> 00:48:29,140
OK, so what does that say you
want to do in your program?

851
00:48:29,140 --> 00:48:33,690
You want to make sure you have
ample parallelism and then the

852
00:48:33,690 --> 00:48:37,210
scheduler will be able to
schedule it so that you get

853
00:48:37,210 --> 00:48:39,170
near perfect linear speedup.

854
00:48:39,170 --> 00:48:42,210
Let's see why that's true.

855
00:48:42,210 --> 00:48:46,470
So T1 over T infinity is much
bigger than p is equivalent to

856
00:48:46,470 --> 00:48:50,860
saying that T infinity is much
less than T1 over p.

857
00:48:50,860 --> 00:48:53,960
That's just algebra.

858
00:48:53,960 --> 00:48:55,060
Well what does that mean?

859
00:48:55,060 --> 00:48:58,420
The greedy scheduling theorem
says Tp is at most T1 over p

860
00:48:58,420 --> 00:48:59,700
plus T infinity.

861
00:48:59,700 --> 00:49:02,780
We just said that if we have
this condition, then T

862
00:49:02,780 --> 00:49:08,020
infinity is very small compared
to T1 over p.

863
00:49:08,020 --> 00:49:11,830
So if this is negligible,
then the whole thing is

864
00:49:11,830 --> 00:49:13,195
about T1 over p.

865
00:49:15,850 --> 00:49:19,617
Well that just says that
the speedup is about p.

866
00:49:23,320 --> 00:49:27,920
So the name of the game is to
make sure that your span is

867
00:49:27,920 --> 00:49:31,950
relatively short compared to
the amount of work per

868
00:49:31,950 --> 00:49:34,082
processor that you're doing.

869
00:49:34,082 --> 00:49:37,510
And in that case, you'll
get linear speedup.

870
00:49:37,510 --> 00:49:40,050
And that happens when you've
got enough parallelism

871
00:49:40,050 --> 00:49:43,150
compared to the number
processors you're running on.

872
00:49:43,150 --> 00:49:44,460
Any questions about this?

873
00:49:44,460 --> 00:49:50,000
This is like the most important
thing you're going

874
00:49:50,000 --> 00:49:51,395
to learn about parallel
computing.

875
00:49:57,410 --> 00:49:59,230
Everything else we're going to
do is going to be derivatives

876
00:49:59,230 --> 00:50:02,430
of this, so if you don't
understand this, you have a

877
00:50:02,430 --> 00:50:05,670
hard time with the
other stuff.

878
00:50:05,670 --> 00:50:08,360
So in some sense, it's
deceptively simple, right?

879
00:50:08,360 --> 00:50:13,730
We just have a few variables,
T1, Tp, T infinity, p, there's

880
00:50:13,730 --> 00:50:14,890
not much else going on.

881
00:50:14,890 --> 00:50:19,590
But there are these bounds and
these elegant theorems that

882
00:50:19,590 --> 00:50:25,430
tell us something about how no
matter what the shape of the

883
00:50:25,430 --> 00:50:29,200
DAG is or whatever, these two
values, the work and the span,

884
00:50:29,200 --> 00:50:33,890
really characterize very closely
where it is that you

885
00:50:33,890 --> 00:50:37,440
can expect to get
linear speedup.

886
00:50:37,440 --> 00:50:39,660
Any questions?

887
00:50:39,660 --> 00:50:43,630
OK, good.

888
00:50:46,500 --> 00:50:50,220
So the quantity T1 over PT
infinity, so what is that?

889
00:50:50,220 --> 00:50:56,310
That's just the parallelism
divided by p.

890
00:50:56,310 --> 00:50:59,410
That's called the parallel
slackness.

891
00:50:59,410 --> 00:51:05,200
So this parallel slackness is
10, means you have 10 times

892
00:51:05,200 --> 00:51:08,120
more parallelism than
processors.

893
00:51:08,120 --> 00:51:10,330
So if you have high slackness,
you can expect

894
00:51:10,330 --> 00:51:12,340
to get linear speedup.

895
00:51:12,340 --> 00:51:14,070
If you have low slackness,
don't expect

896
00:51:14,070 --> 00:51:15,320
to get linear speedup.

897
00:51:17,660 --> 00:51:18,540
OK.

898
00:51:18,540 --> 00:51:26,920
Now the scheduler we're using
is not a greedy scheduler.

899
00:51:26,920 --> 00:51:33,530
It's better in many ways,
because it's a distributed,

900
00:51:33,530 --> 00:51:35,580
what's called work stealing
scheduler and I'll show you

901
00:51:35,580 --> 00:51:38,450
how it works in a little bit.

902
00:51:38,450 --> 00:51:41,070
But it's based on
the same theory.

903
00:51:41,070 --> 00:51:46,340
Even though it's a more
complicated scheduler from an

904
00:51:46,340 --> 00:51:48,900
analytical point of view, it's
really based on the same

905
00:51:48,900 --> 00:51:51,080
theory as greedy scheduling.

906
00:51:51,080 --> 00:51:57,110
It guarantees that the time on
p processors is at most T1

907
00:51:57,110 --> 00:51:59,300
over p plus order T infinity.

908
00:51:59,300 --> 00:52:02,310
So there's a constant here.

909
00:52:02,310 --> 00:52:05,660
And it's a randomized scheduler,
so it actually only

910
00:52:05,660 --> 00:52:08,120
guarantees this in
expectation.

911
00:52:08,120 --> 00:52:11,590
It actually guarantees very
close to this with high

912
00:52:11,590 --> 00:52:13,060
probability.

913
00:52:13,060 --> 00:52:19,190
OK so the difference is the big
O, but if you look at any

914
00:52:19,190 --> 00:52:21,500
of the formulas that we did with
the greedy scheduler, the

915
00:52:21,500 --> 00:52:24,480
fact that there's a constant
there doesn't really matter.

916
00:52:24,480 --> 00:52:27,700
You get the same effect, it just
means that the slackness

917
00:52:27,700 --> 00:52:30,580
that you need to get linear
speedup has to not only

918
00:52:30,580 --> 00:52:33,010
overcome the T infinity, it's
also got to overcome the

919
00:52:33,010 --> 00:52:36,200
constant there.

920
00:52:36,200 --> 00:52:40,440
And empirically, it actually
turns out this is not bad as

921
00:52:40,440 --> 00:52:44,040
an estimate using the
greedy bound.

922
00:52:44,040 --> 00:52:46,690
Not bad as an estimate, so this
is sort of a model that

923
00:52:46,690 --> 00:52:49,450
we'll take as if we're
doing things

924
00:52:49,450 --> 00:52:51,130
with a greedy scheduler.

925
00:52:51,130 --> 00:52:53,540
And that will be very close for
what we're actually going

926
00:52:53,540 --> 00:52:58,790
to see in practice with
the Cilk++ scheduler.

927
00:52:58,790 --> 00:53:01,620
So once again, it means near
perfect linear speedup as long

928
00:53:01,620 --> 00:53:06,330
as p is much less than T1 over
T infinity generally.

929
00:53:06,330 --> 00:53:10,820
And so Cilkview allows us to
measure T1 and T infinity.

930
00:53:10,820 --> 00:53:13,320
So that's going to be good,
because then we can figure out

931
00:53:13,320 --> 00:53:16,480
what our parallelism is and look
to see how we're running

932
00:53:16,480 --> 00:53:21,910
on typically 12 cores, how much
parallels do we have?

933
00:53:21,910 --> 00:53:25,500
If our parallelism is 12, we
don't have a lot of slackness.

934
00:53:25,500 --> 00:53:27,440
We won't get very
good speedup.

935
00:53:27,440 --> 00:53:30,550
But if we have a parallelism
of say, 10 times more, say

936
00:53:30,550 --> 00:53:36,200
120, we should get very, very
good parallelism, very, very

937
00:53:36,200 --> 00:53:38,370
good speedup on 12 cores.

938
00:53:38,370 --> 00:53:40,930
We should get close to
perfect speedup.

939
00:53:45,100 --> 00:53:47,490
So let's talk about the runtime
system and how this

940
00:53:47,490 --> 00:53:50,740
work stealing scheduler works,
because it different

941
00:53:50,740 --> 00:53:51,890
from the other one.

942
00:53:51,890 --> 00:53:56,160
And this will be helpful also
for understanding when you

943
00:53:56,160 --> 00:53:59,530
program these things what
you can expect.

944
00:53:59,530 --> 00:54:07,730
So the basic idea of the
schedule is there's two

945
00:54:07,730 --> 00:54:11,810
strategies the people have
explored for doing scheduling.

946
00:54:11,810 --> 00:54:16,110
One is called work sharing,
which is not what Cilk++ does.

947
00:54:16,110 --> 00:54:19,390
But let me explain what work
sharing is because it's

948
00:54:19,390 --> 00:54:22,200
helpful to contrast it
with work stealing.

949
00:54:22,200 --> 00:54:25,990
So in works sharing, what you do
is when you spawn off some

950
00:54:25,990 --> 00:54:32,280
work, you say let me go find
some low utilized processor

951
00:54:32,280 --> 00:54:37,450
and put that worked there
for it to operate on.

952
00:54:37,450 --> 00:54:41,470
The problem with work sharing
is that you have to do some

953
00:54:41,470 --> 00:54:45,600
communication and
synchronization every time you

954
00:54:45,600 --> 00:54:47,960
do a spawn.

955
00:54:47,960 --> 00:54:49,830
Every time you do a spawn,
you're going to go out.

956
00:54:49,830 --> 00:54:52,290
This is kind of what
Pthreads does, when

957
00:54:52,290 --> 00:54:53,580
you do Pthread create.

958
00:54:53,580 --> 00:54:58,120
It goes out and says OK, let me
create all of the things it

959
00:54:58,120 --> 00:55:03,070
needs to do and get it schedule
then on a processor.

960
00:55:03,070 --> 00:55:06,410
Work stealing, on the
other hand, takes

961
00:55:06,410 --> 00:55:08,310
the opposite approach.

962
00:55:08,310 --> 00:55:11,780
Whenever it spawns work, it's
just going to keep that work

963
00:55:11,780 --> 00:55:16,230
local to it, but make it
available for stealing.

964
00:55:16,230 --> 00:55:21,220
A processor that runs out of
work is going to go looking

965
00:55:21,220 --> 00:55:23,720
for work to steal,
to bring back.

966
00:55:23,720 --> 00:55:31,540
The advantage of work stealing
is that the processor doesn't

967
00:55:31,540 --> 00:55:33,650
do any synchronization
except when it's

968
00:55:33,650 --> 00:55:36,210
actually load balancing.

969
00:55:36,210 --> 00:55:42,850
So if all of the processors have
ample work to do, then

970
00:55:42,850 --> 00:55:47,570
what happens is there's no
overhead for scheduling

971
00:55:47,570 --> 00:55:48,380
whatsoever.

972
00:55:48,380 --> 00:55:51,600
They all just crank away.

973
00:55:51,600 --> 00:55:56,120
And so you get very, very low
overheads when there's ample

974
00:55:56,120 --> 00:55:58,320
work to do on each processor.

975
00:55:58,320 --> 00:56:00,980
So let's see how this works.

976
00:56:00,980 --> 00:56:04,120
So the particular way that
it maintains it is that

977
00:56:04,120 --> 00:56:08,180
basically, each processor
maintains a work deck.

978
00:56:08,180 --> 00:56:13,750
So a deck is a double-ended
queue of the ready strands.

979
00:56:13,750 --> 00:56:17,500
It manipulates the bottom of
the deck like a stack.

980
00:56:17,500 --> 00:56:21,020
So what that says is, for
example, here, we had a spawn

981
00:56:21,020 --> 00:56:24,310
followed by two calls.

982
00:56:24,310 --> 00:56:26,810
And basically, it's operating
just as it would have to

983
00:56:26,810 --> 00:56:36,210
operate in an ordinary stack,
an ordinary call stack.

984
00:56:36,210 --> 00:56:40,000
So, for example, this guy says
call, well it pushes a frame

985
00:56:40,000 --> 00:56:44,460
on the bottom of the call
stack just like normal.

986
00:56:44,460 --> 00:56:47,950
It says spawn, it pushes
a spawn frame on the

987
00:56:47,950 --> 00:56:49,200
bottom of the deck.

988
00:56:52,910 --> 00:56:55,450
In fact, of course, it's running
in parallel, so you

989
00:56:55,450 --> 00:56:58,420
can have a bunch of guys that
are both calling and spawning

990
00:56:58,420 --> 00:57:01,270
and they all push whatever
their frames are.

991
00:57:01,270 --> 00:57:05,380
When somebody says return,
you just pop it off.

992
00:57:05,380 --> 00:57:10,420
So in the common case, each of
these guys is just executing

993
00:57:10,420 --> 00:57:13,420
the code serially the way
that it would normally

994
00:57:13,420 --> 00:57:15,295
executing in C or C++.

995
00:57:18,120 --> 00:57:25,370
However, if somebody runs out
of work, then it becomes a

996
00:57:25,370 --> 00:57:33,310
thief and it looks for a victim
and the strategy that's

997
00:57:33,310 --> 00:57:36,150
used by Cilk++ is to
look at random.

998
00:57:36,150 --> 00:57:43,120
It says let me just go to
any other processor

999
00:57:43,120 --> 00:57:45,050
or any other workers--

1000
00:57:45,050 --> 00:57:46,300
I call these workers--

1001
00:57:48,880 --> 00:57:52,730
and grab away some
of their work.

1002
00:57:52,730 --> 00:57:56,170
But when it grabs it away, what
it does is it steals it

1003
00:57:56,170 --> 00:58:02,990
from the opposite end of the
deck from where this

1004
00:58:02,990 --> 00:58:06,050
particular victim is actually
doing its work.

1005
00:58:06,050 --> 00:58:09,650
So it steals the oldest
stuff first.

1006
00:58:09,650 --> 00:58:12,970
So it moves that over and now
here what it's doing is it's

1007
00:58:12,970 --> 00:58:14,710
stealing up to the point
that it spawns.

1008
00:58:14,710 --> 00:58:17,740
So it steals from the top of the
deck down to where there's

1009
00:58:17,740 --> 00:58:18,640
a spawn on top.

1010
00:58:18,640 --> 00:58:19,521
Yes?

1011
00:58:19,521 --> 00:58:21,970
AUDIENCE: Is there always
a spawn on the

1012
00:58:21,970 --> 00:58:23,220
top of every deck?

1013
00:58:25,540 --> 00:58:28,140
PROFESSOR: Close,
almost always.

1014
00:58:28,140 --> 00:58:31,150
Yes, so I think that you could
say that there are.

1015
00:58:31,150 --> 00:58:34,290
So the initial deck does not
have a spawn on top of it, but

1016
00:58:34,290 --> 00:58:37,250
you could imagine that it did.

1017
00:58:37,250 --> 00:58:39,240
And then when you steal, you're
always stealing from

1018
00:58:39,240 --> 00:58:41,920
the top down to a spawn.

1019
00:58:41,920 --> 00:58:47,990
If there isn't something, if
this is just a call here, this

1020
00:58:47,990 --> 00:58:49,640
cannot any longer be stolen.

1021
00:58:49,640 --> 00:58:52,170
There's no work there to be
stolen because this is just a

1022
00:58:52,170 --> 00:58:54,990
single execution, there's
nothing that's been spawned

1023
00:58:54,990 --> 00:58:57,440
off at this point.

1024
00:58:57,440 --> 00:59:00,200
This is the result of having
been spawned as opposed to

1025
00:59:00,200 --> 00:59:01,950
that it's doing a spawn.

1026
00:59:01,950 --> 00:59:05,230
So yes, basically
you're right.

1027
00:59:05,230 --> 00:59:06,890
There's a spawn on the top.

1028
00:59:06,890 --> 00:59:09,670
So it basically steals that
off and then it resumes

1029
00:59:09,670 --> 00:59:15,540
execution afterwards and starts
then operating just

1030
00:59:15,540 --> 00:59:19,190
like an ordinary deck.

1031
00:59:19,190 --> 00:59:24,550
So the theorem that you can
prove for this type of

1032
00:59:24,550 --> 00:59:28,480
scheduler is that if you have
sufficient parallelism, so you

1033
00:59:28,480 --> 00:59:31,910
all know what parallelism is at
this point, you can prove

1034
00:59:31,910 --> 00:59:35,990
that the workers steal
infrequently.

1035
00:59:35,990 --> 00:59:40,880
So in a a typical execution, you
might have a few hundred

1036
00:59:40,880 --> 00:59:44,200
load balancing operations of
this nature for something

1037
00:59:44,200 --> 00:59:48,420
which is doing billions and
billions of instructions.

1038
00:59:48,420 --> 00:59:51,220
So you steal infrequently.

1039
00:59:51,220 --> 00:59:53,860
If you're stealing infrequently
and all the rest

1040
00:59:53,860 --> 00:59:59,970
of the time you're just
executing like the C or C++,

1041
00:59:59,970 --> 01:00:02,330
hey, now you've got linear
speedup because you've got all

1042
01:00:02,330 --> 01:00:04,175
of these guys working
all the time.

1043
01:00:08,140 --> 01:00:11,230
And so as I say, the main thing
to understand is that

1044
01:00:11,230 --> 01:00:14,550
there's this work stealing
scheduler running underneath.

1045
01:00:14,550 --> 01:00:17,050
It's more complicated to
analyze then the greedy

1046
01:00:17,050 --> 01:00:20,210
scheduler, but it gives you
pretty much the same

1047
01:00:20,210 --> 01:00:24,620
qualitative kinds of results.

1048
01:00:24,620 --> 01:00:31,450
And the idea then is that the
stealing occurs infrequently

1049
01:00:31,450 --> 01:00:32,500
so you get linear speedup.

1050
01:00:32,500 --> 01:00:35,470
So the idea then is just as with
greedy scheduling, make

1051
01:00:35,470 --> 01:00:37,540
sure you have enough
parallelism, because then the

1052
01:00:37,540 --> 01:00:41,550
load balancing is a small
fraction of the time these

1053
01:00:41,550 --> 01:00:45,890
processors are spending
executing the code.

1054
01:00:45,890 --> 01:00:48,090
Because whenever it's doing
things like work stealing,

1055
01:00:48,090 --> 01:00:52,460
it's not working on your code
executing, making it go fast.

1056
01:00:52,460 --> 01:00:57,790
It's doing bookkeeping and
overhead and stuff.

1057
01:00:57,790 --> 01:00:59,910
So you want to make sure
that stays low.

1058
01:00:59,910 --> 01:01:03,290
So any questions about that?

1059
01:01:03,290 --> 01:01:07,150
So specifically, we
have these bounds.

1060
01:01:07,150 --> 01:01:09,750
You have achieved this expected
running time, which I

1061
01:01:09,750 --> 01:01:10,400
mentioned before.

1062
01:01:10,400 --> 01:01:13,920
Let me give you a pseudo-proof
of this.

1063
01:01:13,920 --> 01:01:19,030
So this is not a real proof
because it ignores things like

1064
01:01:19,030 --> 01:01:21,540
independence of probabilities.

1065
01:01:21,540 --> 01:01:24,000
So when you do a probability
analysis, you're not allowed

1066
01:01:24,000 --> 01:01:27,310
to multiply probabilities unless
they're independent.

1067
01:01:27,310 --> 01:01:30,125
So anyway, here I'm multiplying
probabilities that

1068
01:01:30,125 --> 01:01:32,080
are independent.

1069
01:01:32,080 --> 01:01:34,960
So the idea is you can
view a processor as

1070
01:01:34,960 --> 01:01:36,940
either working or stealing.

1071
01:01:36,940 --> 01:01:38,880
So it goes into one
of two modes.

1072
01:01:38,880 --> 01:01:42,220
It's going to be stealing
if it's run out of work,

1073
01:01:42,220 --> 01:01:43,840
otherwise it's working.

1074
01:01:43,840 --> 01:01:46,630
So the total time all processors
spend working is

1075
01:01:46,630 --> 01:01:51,900
T1, hooray, that's
at least a bound.

1076
01:01:51,900 --> 01:01:56,060
Now it turns out that every
steal has a 1 over p chance of

1077
01:01:56,060 --> 01:01:59,550
reducing the span by one.

1078
01:01:59,550 --> 01:02:04,040
So you can prove that of all of
the work that's in the top

1079
01:02:04,040 --> 01:02:07,720
of all those decks that those
are where any of the ready

1080
01:02:07,720 --> 01:02:15,890
threads are going to be there
are in a position of reducing

1081
01:02:15,890 --> 01:02:18,680
the span if you execute them.

1082
01:02:18,680 --> 01:02:21,780
And so whenever you steal, you
have a 1 over p chance of

1083
01:02:21,780 --> 01:02:28,580
hitting the guy that matters
for the span

1084
01:02:28,580 --> 01:02:30,360
of unexecuted DAG.

1085
01:02:30,360 --> 01:02:33,190
So the same kind of thing
as in theory.

1086
01:02:33,190 --> 01:02:34,050
You have a 1 over p chance.

1087
01:02:34,050 --> 01:02:39,580
So the expected cost of all
steals is order PT infinity.

1088
01:02:39,580 --> 01:02:43,580
So this is true, but not
for this reason.

1089
01:02:43,580 --> 01:02:45,960
But it's kind, the intuition
is right.

1090
01:02:48,610 --> 01:02:52,070
So therefore the cost of all
steals is PT infinity and the

1091
01:02:52,070 --> 01:02:55,320
cost of the work is T1, so
that's the total amount of

1092
01:02:55,320 --> 01:03:00,480
work and time spent stealing
by all the p processors.

1093
01:03:00,480 --> 01:03:05,420
So to get the time spent doing
that, we divide by p, because

1094
01:03:05,420 --> 01:03:08,220
they're p processors.

1095
01:03:08,220 --> 01:03:12,140
And when I do that, I get T1
over p plus order T infinity.

1096
01:03:12,140 --> 01:03:15,550
So that's kind of where that
bound is coming from.

1097
01:03:15,550 --> 01:03:19,620
So you can see what's important
here is that the

1098
01:03:19,620 --> 01:03:22,730
term, that order T infinity
term, this the one where all

1099
01:03:22,730 --> 01:03:25,630
the overhead of scheduling
and synchronization is.

1100
01:03:25,630 --> 01:03:28,120
There's no overhead for
scheduling and synchronization

1101
01:03:28,120 --> 01:03:29,940
in the T1 over p term.

1102
01:03:29,940 --> 01:03:33,850
The only overhead there is to do
things like mark the frames

1103
01:03:33,850 --> 01:03:39,140
as being a steel frame or
a spawn frame and do the

1104
01:03:39,140 --> 01:03:43,820
bookkeeping of the deck as
you're executing so the spawn

1105
01:03:43,820 --> 01:03:48,070
can be implemented
very cheaply.

1106
01:03:48,070 --> 01:03:55,130
Now in addition to the
scheduling things, there are

1107
01:03:55,130 --> 01:03:57,020
some other things to understand
a little bit about

1108
01:03:57,020 --> 01:04:02,960
the scheduler and that is that
it supports the C, C++ rule

1109
01:04:02,960 --> 01:04:03,720
for pointers.

1110
01:04:03,720 --> 01:04:08,130
So remember in C and C++, you
can pass a pointer to stack

1111
01:04:08,130 --> 01:04:12,110
space down, but you can't pass
a pointer to stack space back

1112
01:04:12,110 --> 01:04:13,430
to your parent, right?

1113
01:04:13,430 --> 01:04:14,680
Because it popped off.

1114
01:04:17,340 --> 01:04:22,990
So if you think about a C or
C++ execution, let's say we

1115
01:04:22,990 --> 01:04:25,940
have this call structure here.

1116
01:04:25,940 --> 01:04:30,820
A really cannot see any of the
stack space of B,C,D or E. So

1117
01:04:30,820 --> 01:04:33,130
this is what A gets to see.

1118
01:04:33,130 --> 01:04:36,350
And B, meanwhile, can see A
space, because that's down on

1119
01:04:36,350 --> 01:04:39,700
the stack, but it can't see C, D
or E. Particularly if you're

1120
01:04:39,700 --> 01:04:42,490
executing this serially, it
can't see C because C hasn't

1121
01:04:42,490 --> 01:04:45,860
executed yet when B executes.

1122
01:04:45,860 --> 01:04:49,190
However, C, it turns out,
the same thing.

1123
01:04:49,190 --> 01:04:51,310
I can't see any of the variables
that might be

1124
01:04:51,310 --> 01:04:54,270
allocated in the space
for B when I'm

1125
01:04:54,270 --> 01:04:55,800
executing here on a stack.

1126
01:04:55,800 --> 01:04:58,660
You can see them in a heap, but
not on the stack, because

1127
01:04:58,660 --> 01:05:02,250
B has been popped off at that
point and so forth.

1128
01:05:02,250 --> 01:05:05,190
So this is basically the normal
rule, the normal views

1129
01:05:05,190 --> 01:05:09,790
of stack that you
get in C or C++.

1130
01:05:09,790 --> 01:05:14,380
In Cilk++, you get exactly the
same behavior except that

1131
01:05:14,380 --> 01:05:20,260
multiple ones of these views
may exist at the same time.

1132
01:05:20,260 --> 01:05:23,900
So if, for example, B and C are
both executing at the same

1133
01:05:23,900 --> 01:05:26,690
time, they each will
see their own stack

1134
01:05:26,690 --> 01:05:30,600
space and a stack space.

1135
01:05:30,600 --> 01:05:34,220
And so the cactus stack
maintains that fiction that

1136
01:05:34,220 --> 01:05:36,280
you can sort of look at your
ancestors and see your

1137
01:05:36,280 --> 01:05:38,280
ancestors, but now
it's maintained.

1138
01:05:38,280 --> 01:05:41,890
It's called a cactus stack
because it's kind of like a

1139
01:05:41,890 --> 01:05:47,240
tree structure upside down, like
a what's the name of that

1140
01:05:47,240 --> 01:05:49,310
big cactus out West?

1141
01:05:49,310 --> 01:05:50,105
Yes, saguaro.

1142
01:05:50,105 --> 01:05:52,490
The saguaro cactus, yep.

1143
01:05:52,490 --> 01:05:55,290
This kind of looks like that
if you look at the stacks.

1144
01:05:58,110 --> 01:06:04,070
This leads to a very powerful
bound on how much space your

1145
01:06:04,070 --> 01:06:05,850
program is using.

1146
01:06:05,850 --> 01:06:08,820
So normally, if you do a greedy
scheduler, you could

1147
01:06:08,820 --> 01:06:11,950
end up using gobs more space
then you would in a serial

1148
01:06:11,950 --> 01:06:15,250
execution, gobs more
stack space.

1149
01:06:15,250 --> 01:06:18,600
In Cilk++ programs,
you have a bound.

1150
01:06:18,600 --> 01:06:22,530
It's p times s1 is the maximum
amount of stack space you'll

1151
01:06:22,530 --> 01:06:25,420
ever use where s1 is
the stack space

1152
01:06:25,420 --> 01:06:27,250
used by serial execution.

1153
01:06:27,250 --> 01:06:29,950
So if you can keep your serial
execution to use a reasonable

1154
01:06:29,950 --> 01:06:32,270
amount of stack space--
and usually it does--

1155
01:06:32,270 --> 01:06:34,920
then in parallel, you don't
use more than p times that

1156
01:06:34,920 --> 01:06:36,890
amount of stack space.

1157
01:06:36,890 --> 01:06:39,890
And the proof for that is sort
of by induction, which

1158
01:06:39,890 --> 01:06:43,240
basically says there's a
property called the Busy

1159
01:06:43,240 --> 01:06:50,530
Leaves Property that says that
if you have a leaf that's

1160
01:06:50,530 --> 01:06:54,810
being worked on but hasn't
been completed--

1161
01:06:54,810 --> 01:06:57,720
so I've indicated those by the
purple and pink ones--

1162
01:06:57,720 --> 01:07:02,720
then if it's a leaf, it has
a worker executing on it.

1163
01:07:02,720 --> 01:07:06,610
And so therefore, if you look at
how much stack space you're

1164
01:07:06,610 --> 01:07:09,990
using, each of these guys can
trace up and they may double

1165
01:07:09,990 --> 01:07:14,550
count the stack space, but it'll
still be bounded by p

1166
01:07:14,550 --> 01:07:18,040
times the depth that they're
at, or p times s1, which is

1167
01:07:18,040 --> 01:07:20,620
the maximum amount.

1168
01:07:20,620 --> 01:07:23,160
So it has good space bounds.

1169
01:07:23,160 --> 01:07:26,100
That's not so crucial for you
folks to know as a practical

1170
01:07:26,100 --> 01:07:29,690
matter, but it would be
if this didn't hold.

1171
01:07:29,690 --> 01:07:32,410
If this didn't hold, then you
would have more programming

1172
01:07:32,410 --> 01:07:33,660
problems than you'll have.

1173
01:07:37,080 --> 01:07:40,780
The implications of this work
stealing scheduler is

1174
01:07:40,780 --> 01:07:45,920
interesting from the linguistic
point of view,

1175
01:07:45,920 --> 01:07:49,180
because you can write a code
like this, so for i gets one

1176
01:07:49,180 --> 01:07:54,740
to a billion, spawn some
sub-routine foo

1177
01:07:54,740 --> 01:07:58,000
of i and then sync.

1178
01:07:58,000 --> 01:08:01,920
So one way of executing this,
the way that the work sharing

1179
01:08:01,920 --> 01:08:05,180
schedulers tend to do this,
is they say oh, I've got a

1180
01:08:05,180 --> 01:08:08,600
billion tasks to do.

1181
01:08:08,600 --> 01:08:13,090
So let me create a billion tasks
and now schedule them

1182
01:08:13,090 --> 01:08:16,180
and the space just vrooms to
store all those billion tasks,

1183
01:08:16,180 --> 01:08:18,100
it gets to be huge.

1184
01:08:18,100 --> 01:08:20,200
Now of course, they have some
strategies they can use to

1185
01:08:20,200 --> 01:08:22,939
reduce it by bunching tasks
together and so forth.

1186
01:08:22,939 --> 01:08:26,689
But in principle, you got a
billion pieces of work to do

1187
01:08:26,689 --> 01:08:30,630
even if you execute
on one processor.

1188
01:08:30,630 --> 01:08:34,580
Whereas in the work stealing
type execution, what happens

1189
01:08:34,580 --> 01:08:38,180
is you execute this in
fact depth research.

1190
01:08:38,180 --> 01:08:42,420
So basically, you're going to
execute foo of 1 and then

1191
01:08:42,420 --> 01:08:44,620
you'll return.

1192
01:08:44,620 --> 01:08:48,342
And then you'll increment i and
you'll execute foo of 2,

1193
01:08:48,342 --> 01:08:49,670
and you'll return.

1194
01:08:49,670 --> 01:08:52,970
At no time are you using more
than in this case two stack

1195
01:08:52,970 --> 01:08:58,029
frames, one for this routine
here and one for foo because

1196
01:08:58,029 --> 01:08:59,240
you basically keep going up.

1197
01:08:59,240 --> 01:09:03,430
You're using your stack up on
demand, rather than creating

1198
01:09:03,430 --> 01:09:05,469
all the work up front
to be scheduled.

1199
01:09:08,090 --> 01:09:09,840
So the work stealing scheduler
is very good from

1200
01:09:09,840 --> 01:09:10,720
that point of view.

1201
01:09:10,720 --> 01:09:13,890
The tricky thing for people
to understand is that if

1202
01:09:13,890 --> 01:09:16,520
executing on multiple
processors, when you do Cilk

1203
01:09:16,520 --> 01:09:21,569
spawn, the processor, the worker
that you're running on,

1204
01:09:21,569 --> 01:09:25,760
is going to execute foo of 1.

1205
01:09:25,760 --> 01:09:26,859
The next statement--

1206
01:09:26,859 --> 01:09:28,270
which would basically
be incrementing the

1207
01:09:28,270 --> 01:09:30,720
counter and so forth--

1208
01:09:30,720 --> 01:09:33,910
is executed by whatever
processor comes in and steals

1209
01:09:33,910 --> 01:09:35,719
that continuation.

1210
01:09:40,090 --> 01:09:42,790
So if you had two processors,
they're each going to

1211
01:09:42,790 --> 01:09:44,770
basically be executing.

1212
01:09:44,770 --> 01:09:47,620
The first processor isn't the
one that excuse everything in

1213
01:09:47,620 --> 01:09:48,500
this function.

1214
01:09:48,500 --> 01:09:51,630
This function has its execution
shared, the strands

1215
01:09:51,630 --> 01:09:53,910
are going to be shared where the
first part of it would be

1216
01:09:53,910 --> 01:09:56,170
done by processor one and the
latter part of it would be

1217
01:09:56,170 --> 01:09:58,270
done by processor two.

1218
01:09:58,270 --> 01:10:01,160
And then when processor one
finishes this off, it might go

1219
01:10:01,160 --> 01:10:04,635
back and steal back from
processor two.

1220
01:10:07,150 --> 01:10:10,040
So the important thing there is
it's generating its stack

1221
01:10:10,040 --> 01:10:16,240
needs sort of on demand rather
than all up front, and that

1222
01:10:16,240 --> 01:10:21,250
keeps the amount of stack space
small as it executes.

1223
01:10:21,250 --> 01:10:23,950
So the moral is it's better to
steal your parents from their

1224
01:10:23,950 --> 01:10:26,780
children than stealing children
from their parents.

1225
01:10:30,940 --> 01:10:33,710
So that's the advantage of
doing this sort of parent

1226
01:10:33,710 --> 01:10:36,150
stealing, because you're always
doing the frame which

1227
01:10:36,150 --> 01:10:40,310
is an ancestor of where that
worker is working and that

1228
01:10:40,310 --> 01:10:44,050
means resuming a function
right in the middle on a

1229
01:10:44,050 --> 01:10:44,920
different processor.

1230
01:10:44,920 --> 01:10:47,110
That's kind of the magic of the
technologies is how do you

1231
01:10:47,110 --> 01:10:50,710
actually move a stack frame from
one place to another and

1232
01:10:50,710 --> 01:10:51,960
resume it in the middle?

1233
01:10:55,560 --> 01:10:57,640
Let's finish up here with
a chess lesson.

1234
01:10:57,640 --> 01:10:59,760
I promised a chess lesson,
so we might as well do

1235
01:10:59,760 --> 01:11:02,440
some fun and games.

1236
01:11:02,440 --> 01:11:06,160
We have a lot of experience at
MIT with chess programs.

1237
01:11:06,160 --> 01:11:14,770
We've had a lot of success,
probably our closest one was

1238
01:11:14,770 --> 01:11:19,010
Star Socrates 2.0, which took
second place in the world

1239
01:11:19,010 --> 01:11:23,850
computer chess championship
running on an 1824 node Intel

1240
01:11:23,850 --> 01:11:28,475
Paragon, so a big supercomputer
running with a

1241
01:11:28,475 --> 01:11:28,980
Cilk scheduler.

1242
01:11:28,980 --> 01:11:33,990
We actually almost won that
competition, and it's a sad

1243
01:11:33,990 --> 01:11:38,450
story that maybe be sometime
around dinner or something I

1244
01:11:38,450 --> 01:11:41,540
will tell you the sad story
behind it, but I'm not going

1245
01:11:41,540 --> 01:11:45,940
to tell you why we didn't
take first place.

1246
01:11:45,940 --> 01:11:50,030
And we've had a bunch of other
successes over the years.

1247
01:11:50,030 --> 01:11:52,040
Right now our chess programming
is dormant, we're

1248
01:11:52,040 --> 01:11:55,680
not doing that in my group
anymore, but in the past, we

1249
01:11:55,680 --> 01:11:57,900
had some very strong chess
playing programs.

1250
01:12:00,580 --> 01:12:05,990
So what we did with Star
Socrates, which is one of our

1251
01:12:05,990 --> 01:12:11,880
programs, was we wanted to
understand the Cilk scheduler.

1252
01:12:11,880 --> 01:12:14,300
And so what we did is we ran
a whole bunch of different

1253
01:12:14,300 --> 01:12:19,000
positions on different numbers
of processors which ran for

1254
01:12:19,000 --> 01:12:21,540
different amounts of time.

1255
01:12:21,540 --> 01:12:25,640
We wanted to plot them all on
the same chart, and here's our

1256
01:12:25,640 --> 01:12:27,340
strategy for doing it.

1257
01:12:27,340 --> 01:12:31,230
What decided to do was do a
standard speedup curve.

1258
01:12:31,230 --> 01:12:34,500
So a standard speedup curve says
let's plot the number of

1259
01:12:34,500 --> 01:12:40,700
processors along this axis and
the speed up along that axis.

1260
01:12:40,700 --> 01:12:44,550
But in order to fit all these
things on the same processor

1261
01:12:44,550 --> 01:12:48,110
curve, what we did was we
normalize the speedup.

1262
01:12:48,110 --> 01:12:49,710
So what's the maximum pot?

1263
01:12:49,710 --> 01:12:50,700
So here's the speedup.

1264
01:12:50,700 --> 01:12:52,770
If you look the numerator
here, this is the

1265
01:12:52,770 --> 01:12:54,280
speedup, T1 over Tp.

1266
01:12:54,280 --> 01:12:58,540
What we did is we normalized
by the parallelism.

1267
01:12:58,540 --> 01:13:04,360
So we said what fraction of
perfect speedup can we get?

1268
01:13:04,360 --> 01:13:12,480
So here one says that I got
exactly a speedup, this is the

1269
01:13:12,480 --> 01:13:16,470
maximum possible speed up that
I can get because the maximum

1270
01:13:16,470 --> 01:13:20,190
possible value of T1 over
p is T1 over T infinity.

1271
01:13:20,190 --> 01:13:23,400
So that's sort of the maximum.

1272
01:13:23,400 --> 01:13:25,560
On this axis, we said how
many processors are

1273
01:13:25,560 --> 01:13:26,250
you running on it?

1274
01:13:26,250 --> 01:13:27,890
Well, we looked at
that relative to

1275
01:13:27,890 --> 01:13:29,760
essentially the slackness.

1276
01:13:29,760 --> 01:13:33,930
So notice by normalizing, we
essentially have here the

1277
01:13:33,930 --> 01:13:35,620
inverse of the slackness.

1278
01:13:35,620 --> 01:13:39,130
So 1 here says that I'm running
on exactly the same

1279
01:13:39,130 --> 01:13:42,360
number of processors
as my parallelism.

1280
01:13:42,360 --> 01:13:46,950
A tenth here says I've got a
slackness of 10, I'm running

1281
01:13:46,950 --> 01:13:51,250
on 10 times fewer processors
then parallelism.

1282
01:13:51,250 --> 01:13:55,310
Out here, I'm saying I got way
more processors than I have

1283
01:13:55,310 --> 01:13:56,610
parallelism.

1284
01:13:56,610 --> 01:13:58,200
So I plotted all the points.

1285
01:13:58,200 --> 01:14:01,040
So it doesn't show up very well
here, but all those green

1286
01:14:01,040 --> 01:14:03,630
points, there are a lot of green
points here, that's our

1287
01:14:03,630 --> 01:14:07,200
performance, measured
performance.

1288
01:14:07,200 --> 01:14:10,050
You can sort of see they're
green there, not the best

1289
01:14:10,050 --> 01:14:11,300
color for this projector.

1290
01:14:13,910 --> 01:14:20,090
So we plot on this essentially
the Work Law and the Span Law.

1291
01:14:20,090 --> 01:14:22,980
So this is the Work Law, it says
linear speedup, and this

1292
01:14:22,980 --> 01:14:24,560
is the Span Law.

1293
01:14:24,560 --> 01:14:28,360
And you can see that we're
getting very close to perfect

1294
01:14:28,360 --> 01:14:35,600
linear speedup as long as our
slackness is 10 or greater.

1295
01:14:35,600 --> 01:14:36,250
See that?

1296
01:14:36,250 --> 01:14:38,500
It's hugging that curve
really tightly.

1297
01:14:38,500 --> 01:14:48,570
As we approach a slackness of 1,
you can see that it starts

1298
01:14:48,570 --> 01:14:50,740
to go away from the linear
speedup curve.

1299
01:14:53,410 --> 01:14:56,060
So for this program, if you
look, it says, gee, if we were

1300
01:14:56,060 --> 01:15:00,570
running with 10 time, slackness
of 10, 10 times more

1301
01:15:00,570 --> 01:15:03,550
parallelism than processors,
we're getting almost perfect

1302
01:15:03,550 --> 01:15:06,050
linear speedup in the number of
processors we're running on

1303
01:15:06,050 --> 01:15:09,440
across a wide range of number
of processors, wide range of

1304
01:15:09,440 --> 01:15:12,410
benchmarks for this
chess program.

1305
01:15:12,410 --> 01:15:15,040
And in fact, this curve
is the curve.

1306
01:15:15,040 --> 01:15:17,580
This is not an interpolation
here, but rather it is just

1307
01:15:17,580 --> 01:15:19,900
the greedy scheduling curve,
and you can see it does a

1308
01:15:19,900 --> 01:15:24,810
pretty good job of going through
all the points here.

1309
01:15:24,810 --> 01:15:26,860
Greedy scheduling does a pretty
good job of predicting

1310
01:15:26,860 --> 01:15:27,850
the performance.

1311
01:15:27,850 --> 01:15:29,920
The other thing you should
notice is that although things

1312
01:15:29,920 --> 01:15:33,250
are very tight down here, as
you approach up here, they

1313
01:15:33,250 --> 01:15:35,360
start getting more spread.

1314
01:15:35,360 --> 01:15:38,830
And the reason is that as you
start having more of the span

1315
01:15:38,830 --> 01:15:42,480
mattering in the calculation,
that's where all the

1316
01:15:42,480 --> 01:15:45,510
synchronization, communication,
all the

1317
01:15:45,510 --> 01:15:49,750
overhead of actually doing the
mechanics of moving a frame

1318
01:15:49,750 --> 01:15:52,890
from one processor to another
take into account, so you get

1319
01:15:52,890 --> 01:15:56,360
a lot more spread as
you go up here.

1320
01:15:56,360 --> 01:15:58,610
So that's just the first
part of the lesson.

1321
01:15:58,610 --> 01:16:05,560
The first part was, oh, the
theory works out in practice

1322
01:16:05,560 --> 01:16:06,800
for real programs.

1323
01:16:06,800 --> 01:16:12,930
You have like 10 times more
parallelisms than processors,

1324
01:16:12,930 --> 01:16:14,600
you're going to do a
pretty good job of

1325
01:16:14,600 --> 01:16:16,590
getting linear speedup.

1326
01:16:16,590 --> 01:16:19,620
So that says you guys should be
shooting for parallelisms

1327
01:16:19,620 --> 01:16:26,660
on the order of 100 for
running on 12 cores.

1328
01:16:26,660 --> 01:16:29,610
Somewhere in that vicinity you
should be doing pretty well if

1329
01:16:29,610 --> 01:16:31,150
you've got parallelism
of 100 when you

1330
01:16:31,150 --> 01:16:33,530
measure it for your codes.

1331
01:16:33,530 --> 01:16:35,740
So we normalize by the
parallel there.

1332
01:16:38,290 --> 01:16:43,840
Now the real lesson though was
understanding how to use

1333
01:16:43,840 --> 01:16:47,270
things like work and span to
make decisions in the design

1334
01:16:47,270 --> 01:16:49,730
of our program.

1335
01:16:49,730 --> 01:16:53,550
So as it turned out, Socrates
for this particular

1336
01:16:53,550 --> 01:16:57,750
competition was to run on a
512 processor connection

1337
01:16:57,750 --> 01:17:02,680
machine at the University
of Illinois.

1338
01:17:02,680 --> 01:17:08,950
So this was in the mid
in the early 1990's.

1339
01:17:08,950 --> 01:17:12,000
It was one of the most powerful
machines in the

1340
01:17:12,000 --> 01:17:17,270
world, and this thing is
probably more powerful today.

1341
01:17:17,270 --> 01:17:20,150
But in those days, it was a
pretty powerful machine.

1342
01:17:20,150 --> 01:17:21,540
I don't know whether this
thing is, but this thing

1343
01:17:21,540 --> 01:17:25,610
probably I'm pretty sure
is more powerful.

1344
01:17:25,610 --> 01:17:28,820
So this was a big machine.

1345
01:17:28,820 --> 01:17:31,300
However here at MIT, we didn't
have a great big

1346
01:17:31,300 --> 01:17:32,950
machine like that.

1347
01:17:32,950 --> 01:17:35,240
We only had a 32
processor CM5.

1348
01:17:37,800 --> 01:17:41,090
So we were developing on a
little machine expecting to

1349
01:17:41,090 --> 01:17:42,340
run on a big machine.

1350
01:17:45,050 --> 01:17:48,040
So one of the developers
proposed to change the program

1351
01:17:48,040 --> 01:17:54,310
that produced a speedup of over
20% on the MIT machine.

1352
01:17:54,310 --> 01:17:57,910
So we said, oh that's pretty
good, 25% improvement.

1353
01:17:57,910 --> 01:18:00,990
But we did a back of the
envelope calculation and

1354
01:18:00,990 --> 01:18:05,030
rejected that improvement
because we were able to use

1355
01:18:05,030 --> 01:18:10,645
work and span to predict the
behavior on the big machine.

1356
01:18:13,180 --> 01:18:16,670
So let's see how that worked
out, why that worked out.

1357
01:18:16,670 --> 01:18:20,180
So I've fudged these numbers so
that they're easy to do the

1358
01:18:20,180 --> 01:18:22,780
math on and easy
to understand.

1359
01:18:22,780 --> 01:18:25,830
The real numbers actually though
did sort out very, very

1360
01:18:25,830 --> 01:18:28,610
similar to what I'm saying,
just they weren't round

1361
01:18:28,610 --> 01:18:30,630
numbers like I'm going
to give you.

1362
01:18:30,630 --> 01:18:34,610
So the original program
ran for let's say 65

1363
01:18:34,610 --> 01:18:37,560
seconds on 32 cores.

1364
01:18:37,560 --> 01:18:41,480
The proposed program ran for
40 seconds on 32 cores.

1365
01:18:41,480 --> 01:18:43,830
Sounds like a good improvement
to me.

1366
01:18:43,830 --> 01:18:46,790
Let's go for the
faster program.

1367
01:18:46,790 --> 01:18:48,940
Well, let's hold your horses.

1368
01:18:48,940 --> 01:18:52,480
Let's take a look at our
performance model based on

1369
01:18:52,480 --> 01:18:54,880
greedy scheduling.

1370
01:18:54,880 --> 01:18:57,500
That Tp is T1 over
p plus infinity.

1371
01:18:57,500 --> 01:19:00,860
What component we really need to
understand the scale this,

1372
01:19:00,860 --> 01:19:03,830
what component of each of
these things is work

1373
01:19:03,830 --> 01:19:05,040
and which is span?

1374
01:19:05,040 --> 01:19:07,520
Because that's how we're going
to be able to predict what's

1375
01:19:07,520 --> 01:19:09,930
going to happen on
the big machine.

1376
01:19:09,930 --> 01:19:15,360
So indeed, this original program
had a work of 2048

1377
01:19:15,360 --> 01:19:19,760
seconds and a span
of one second.

1378
01:19:19,760 --> 01:19:23,820
Now chess, it turns out, is a
non-deterministic type of

1379
01:19:23,820 --> 01:19:28,670
program where you use
speculative parallelism, and

1380
01:19:28,670 --> 01:19:32,205
so in order to get more
parallelism, you can sacrifice

1381
01:19:32,205 --> 01:19:34,395
and do more work versus
less work.

1382
01:19:34,395 --> 01:19:39,250
So this one over here that we
improved it to had less work

1383
01:19:39,250 --> 01:19:42,405
on the benchmark, but it
had a longer span.

1384
01:19:46,280 --> 01:19:48,100
So it had less work
but a longer span.

1385
01:19:48,100 --> 01:19:52,730
So when we actually were going
to run this, well first of

1386
01:19:52,730 --> 01:19:57,050
all, we did the calculation
and it actually came out

1387
01:19:57,050 --> 01:19:57,700
pretty close.

1388
01:19:57,700 --> 01:20:00,620
I was kind of surprised how
close the theory matched.

1389
01:20:00,620 --> 01:20:03,870
We actually on 32 processors
when you do the work spanned

1390
01:20:03,870 --> 01:20:08,920
calculation, you get the 65
seconds on a 32 processor

1391
01:20:08,920 --> 01:20:12,250
machine, here we
had 40 seconds.

1392
01:20:12,250 --> 01:20:20,200
But now what happens when we
scale this to the big machine?

1393
01:20:20,200 --> 01:20:22,200
Here we scaled it
to 512 cores.

1394
01:20:22,200 --> 01:20:25,100
So now we take the work divided
by the number of

1395
01:20:25,100 --> 01:20:29,160
processors, 512, plus 1, that's
5 seconds for this.

1396
01:20:29,160 --> 01:20:33,340
Here we have the work but we now
have a much larger span.

1397
01:20:33,340 --> 01:20:36,790
So we have two seconds of work
for processor, but now eight

1398
01:20:36,790 --> 01:20:42,130
seconds of span for a
total of 10 seconds.

1399
01:20:42,130 --> 01:20:45,920
So had we made this quote
"improvement," our code would

1400
01:20:45,920 --> 01:20:48,035
have been half as fast.

1401
01:20:50,910 --> 01:20:52,160
It would not have scaled.

1402
01:20:55,020 --> 01:21:01,300
And so the point is that work
and span typically will beat

1403
01:21:01,300 --> 01:21:05,420
running times for predicting
scalability of performance.

1404
01:21:05,420 --> 01:21:07,440
So you can measure a particular
thing, but what you

1405
01:21:07,440 --> 01:21:11,160
really want to know is this
thing this going to scale and

1406
01:21:11,160 --> 01:21:12,860
how is it going to scale
into the future.

1407
01:21:12,860 --> 01:21:16,550
So people building multicore
applications today want to

1408
01:21:16,550 --> 01:21:17,750
know that they coded up.

1409
01:21:17,750 --> 01:21:20,450
They don't want to be told in
two years that they've got to

1410
01:21:20,450 --> 01:21:24,300
recode it all because the
number of cores doubled.

1411
01:21:24,300 --> 01:21:27,370
They want to have some
future-proof notion that hey,

1412
01:21:27,370 --> 01:21:33,490
there's a lot of parallelism
in this program.

1413
01:21:33,490 --> 01:21:37,630
So work and span, work
and span, eat it,

1414
01:21:37,630 --> 01:21:39,560
drink it, sleep it.

1415
01:21:39,560 --> 01:21:45,740
Work and span, work and span,
work and span, work and span,

1416
01:21:45,740 --> 01:21:47,210
work and span, OK?

1417
01:21:47,210 --> 01:21:48,460
Work and span.