1
00:00:07,000 --> 00:00:12,000
We only have four more lectures
left, and what Professor Demaine

2
00:00:12,000 --> 00:00:18,000
and I have decided to do is give
two series of lectures on sort

3
00:00:18,000 --> 00:00:22,000
of advanced topics.
So, today at Wednesday we're

4
00:00:22,000 --> 00:00:27,000
going to talk about parallel
algorithms, algorithms where you

5
00:00:27,000 --> 00:00:34,000
have more than one processor
whacking away on your problem.

6
00:00:34,000 --> 00:00:38,000
And this is a very hot topic
right now because all of the

7
00:00:38,000 --> 00:00:42,000
chip manufacturers are now
producing so-called multicore

8
00:00:42,000 --> 00:00:47,000
processors where you have more
than one processor per chip.

9
00:00:47,000 --> 00:00:50,000
So, knowing something about
that is good.

10
00:00:50,000 --> 00:00:55,000
The second topic we're going to
cover is going to be caching,

11
00:00:55,000 --> 00:01:00,000
and how you design algorithms
for systems with cache.

12
00:01:00,000 --> 00:01:03,000
Right now, we've sort of
program to everything as if it

13
00:01:03,000 --> 00:01:07,000
were just a single level of
memory, and for some problems

14
00:01:07,000 --> 00:01:10,000
that's not an entirely realistic
model.

15
00:01:10,000 --> 00:01:14,000
You'd like to have some model
for how the caching hierarchy

16
00:01:14,000 --> 00:01:18,000
works, and how you can take
advantage of that.

17
00:01:18,000 --> 00:01:22,000
And there's been a lot of
research in that area as well.

18
00:01:22,000 --> 00:01:26,000
So, both of those actually turn
out to be my area of research.

19
00:01:26,000 --> 00:01:30,000
So, this is actually fun for
me.

20
00:01:30,000 --> 00:01:33,000
Actually, most of it's fun
anyway.

21
00:01:33,000 --> 00:01:37,000
So, today we'll talk about
parallel algorithms.

22
00:01:37,000 --> 00:01:43,000
And the particular topic,
it turns out that there are

23
00:01:43,000 --> 00:01:49,000
lots of models for parallel
algorithms, and for parallelism.

24
00:01:49,000 --> 00:01:54,000
And it's one of the reasons
that, whereas for serial

25
00:01:54,000 --> 00:02:00,000
algorithms, most people sort of
have this basic model that we've

26
00:02:00,000 --> 00:02:04,000
been using.
It's sometimes called a random

27
00:02:04,000 --> 00:02:08,000
access machine model,
which is what we've been using

28
00:02:08,000 --> 00:02:11,000
to analyze things,
whereas in the parallel space,

29
00:02:11,000 --> 00:02:15,000
there's just a huge number of
models, and there is no general

30
00:02:15,000 --> 00:02:19,000
agreement on what is the best
model because there are

31
00:02:19,000 --> 00:02:23,000
different machines that are made
with different configurations,

32
00:02:23,000 --> 00:02:24,000
etc.
and people haven't,

33
00:02:24,000 --> 00:02:27,000
sort of, agreed on,
even how parallel machines

34
00:02:27,000 --> 00:02:32,000
should be organized.
So, we're going to deal with a

35
00:02:32,000 --> 00:02:37,000
particular model,
which goes under the rubric of

36
00:02:37,000 --> 00:02:42,000
dynamic multithreading,
which is appropriate for the

37
00:02:42,000 --> 00:02:48,000
multicore machines that are now
being built for shared memory

38
00:02:48,000 --> 00:02:52,000
programming.
It's not appropriate for what's

39
00:02:52,000 --> 00:02:57,000
called distributed memory
programs particularly because

40
00:02:57,000 --> 00:03:03,000
the processors are able to
access things.

41
00:03:03,000 --> 00:03:06,000
And for those,
you need more involved models.

42
00:03:06,000 --> 00:03:10,000
And so, let me start just by
giving an example of how one

43
00:03:10,000 --> 00:03:14,000
would write something.
I'm going to give you a program

44
00:03:14,000 --> 00:03:18,000
for calculating the nth
Fibonacci number in this model.

45
00:03:18,000 --> 00:03:23,000
This is actually a really bad
algorithm that I'm going to give

46
00:03:23,000 --> 00:03:28,000
you because it's going to be the
exponential time algorithm,

47
00:03:28,000 --> 00:03:32,000
whereas we know from week one
or two that you can calculate

48
00:03:32,000 --> 00:03:37,000
the nth Fibonacci number and how
much time?

49
00:03:37,000 --> 00:03:40,000
Log n time.
So, this is too exponentials

50
00:03:40,000 --> 00:03:46,000
off what you should be able to
get, OK, two exponentials off.

51
00:03:46,000 --> 00:03:49,000
OK, so here's the code.

52
00:04:36,000 --> 00:04:40,000
OK, so this is essentially the
pseudocode we would write.

53
00:04:40,000 --> 00:04:44,000
And let me just explain a
little bit about,

54
00:04:44,000 --> 00:04:48,000
we have a couple of key words
here we haven't seen before:

55
00:04:48,000 --> 00:04:52,000
in particular,
spawn and sync.

56
00:04:52,000 --> 00:04:58,000
OK, so spawn,
this basically says that the

57
00:04:58,000 --> 00:05:07,000
subroutine that you're calling,
you use it as a keyword before

58
00:05:07,000 --> 00:05:14,000
a subroutine,
that it can execute at the same

59
00:05:14,000 --> 00:05:21,000
time as its parent.
So, here, what we say x equals

60
00:05:21,000 --> 00:05:29,000
spawn of n minus one,
we immediately go onto the next

61
00:05:29,000 --> 00:05:36,000
statement.
And now, while we're executing

62
00:05:36,000 --> 00:05:42,000
fib of n minus one,
we can also be executing,

63
00:05:42,000 --> 00:05:49,000
now, this statement which
itself will spawn something off.

64
00:05:49,000 --> 00:05:54,000
OK, and we continue,
and then we hit the sync

65
00:05:54,000 --> 00:05:58,000
statement.
And, what sync says is,

66
00:05:58,000 --> 00:06:04,000
wait until all children are
done.

67
00:06:04,000 --> 00:06:09,000
OK, so it says once you get to
this point, you've got to wait

68
00:06:09,000 --> 00:06:15,000
until everything here has
completed before you execute the

69
00:06:15,000 --> 00:06:21,000
x plus y because otherwise
you're going to try to execute

70
00:06:21,000 --> 00:06:26,000
the calculation of x plus y
without having computed it yet.

71
00:06:26,000 --> 00:06:31,000
OK, so that's the basic
structure.

72
00:06:31,000 --> 00:06:33,000
What this describes,
notice in here we never said

73
00:06:33,000 --> 00:06:36,000
how many processors or anything
we are running on.

74
00:06:36,000 --> 00:06:40,000
OK, so this actually is just
describing logical parallelism

75
00:06:40,000 --> 00:06:41,000
--

76
00:06:51,000 --> 00:07:02,000
-- not the actual parallelism
when we execute it.

77
00:07:02,000 --> 00:07:11,000
And so, what we need is a
scheduler, OK,

78
00:07:11,000 --> 00:07:25,000
to determine how to map this
dynamically, unfolding execution

79
00:07:25,000 --> 00:07:37,000
onto whatever processors you
have available.

80
00:07:37,000 --> 00:07:45,000
OK, and so, today actually
we're going to talk mostly about

81
00:07:45,000 --> 00:07:48,000
scheduling.
OK, and then,

82
00:07:48,000 --> 00:07:56,000
next time we're going to talk
about specific application

83
00:07:56,000 --> 00:08:01,000
algorithms, and how you analyze
them.

84
00:08:01,000 --> 00:08:11,000
OK, so you can view the actual
multithreaded computation.

85
00:08:11,000 --> 00:08:16,000
If you take a look at the
parallel instruction stream,

86
00:08:16,000 --> 00:08:21,000
it's just a directed acyclic
graph, OK?

87
00:08:21,000 --> 00:08:25,000
So, let me show you how that
works.

88
00:08:25,000 --> 00:08:30,000
So, normally when we have an
instruction stream,

89
00:08:30,000 --> 00:08:36,000
I look at each instruction
being executed.

90
00:08:36,000 --> 00:08:38,000
If I'm in a loop,
I'm not looking at it as a

91
00:08:38,000 --> 00:08:40,000
loop.
I'm just looking at the

92
00:08:40,000 --> 00:08:42,000
sequence of instructions that
actually executed.

93
00:08:42,000 --> 00:08:45,000
I can do that just as a chain.
Before I execute one

94
00:08:45,000 --> 00:08:48,000
instruction, I have to execute
the one before it.

95
00:08:48,000 --> 00:08:51,000
Before I execute that,
I've got to execute the one

96
00:08:51,000 --> 00:08:53,000
before it.
At least, that's the

97
00:08:53,000 --> 00:08:55,000
abstraction.
If you've studied processors,

98
00:08:55,000 --> 00:08:58,000
you know that there are a lot
of tricks there in figuring out

99
00:08:58,000 --> 00:09:02,000
instruction level parallelism,
and how you can actually make

100
00:09:02,000 --> 00:09:07,000
that serial instruction stream
actually execute in parallel.

101
00:09:07,000 --> 00:09:15,000
But what we are going to be
mostly talking about is the

102
00:09:15,000 --> 00:09:22,000
logical parallelism here,
and what we can do in that

103
00:09:22,000 --> 00:09:26,000
context.
So, in this DAG,

104
00:09:26,000 --> 00:09:34,000
the vertices are threads,
which are maximal sequences of

105
00:09:34,000 --> 00:09:40,000
instructions not containing --

106
00:09:47,000 --> 00:09:52,000
-- parallel control.
And by parallel control,

107
00:09:52,000 --> 00:09:58,000
I just mean spawn,
sync, and return from a spawned

108
00:09:58,000 --> 00:10:02,000
procedure.
So, let's just mark the,

109
00:10:02,000 --> 00:10:06,000
so the vertices are threads.
So, let's just mark what the

110
00:10:06,000 --> 00:10:10,000
vertices are here,
OK, what the threads are here.

111
00:10:10,000 --> 00:10:16,000
So, when we enter the function
here, we basically execute up to

112
00:10:16,000 --> 00:10:18,000
the point where,
basically, here,

113
00:10:18,000 --> 00:10:24,000
let's call that thread A where
we are just doing a sequential

114
00:10:24,000 --> 00:10:29,000
execution up to either returning
or starting to do the spawn,

115
00:10:29,000 --> 00:10:33,000
fib of n minus one.
So actually,

116
00:10:33,000 --> 00:10:38,000
thread A would include the
calculation of n minus one right

117
00:10:38,000 --> 00:10:43,000
up to the point where you
actually make the subroutine

118
00:10:43,000 --> 00:10:45,000
jump.
That's thread A.

119
00:10:45,000 --> 00:10:49,000
Thread B would be the stuff
that you would do,

120
00:10:49,000 --> 00:10:54,000
executing from fib of,
sorry, B would be from the,

121
00:10:54,000 --> 00:10:57,000
right.
We'd go up to the spawn.

122
00:10:57,000 --> 00:11:03,000
So, we've done the spawn.
I'm really looking at this.

123
00:11:03,000 --> 00:11:05,000
So, B would be up to the spawn
of y.

124
00:11:05,000 --> 00:11:09,000
OK, spawn of fib of n minus two
to compute y,

125
00:11:09,000 --> 00:11:12,000
and then we'd have essentially
an empty thread.

126
00:11:12,000 --> 00:11:17,000
So, I'll ignore that for now,
but really then we have after

127
00:11:17,000 --> 00:11:22,000
the sync up to the point that we
get to the return of x plus y.

128
00:11:22,000 --> 00:11:25,000
So basically,
we're just looking at maximal

129
00:11:25,000 --> 00:11:30,000
sequences of instructions that
are all serial.

130
00:11:30,000 --> 00:11:34,000
And every time I do a parallel
instruction, OK,

131
00:11:34,000 --> 00:11:37,000
spawn or a sync,
or return from it,

132
00:11:37,000 --> 00:11:40,000
that terminates the current
thread.

133
00:11:40,000 --> 00:11:45,000
OK, so we can look at that as a
bunch of small threads.

134
00:11:45,000 --> 00:11:50,000
So those of you who are
familiar with threads from Java

135
00:11:50,000 --> 00:11:54,000
threads, or POSIX threads,
OK, so-called P threads,

136
00:11:54,000 --> 00:12:00,000
those are sort of heavyweight
static threads.

137
00:12:00,000 --> 00:12:04,000
This is a much lighter weight
notion of thread,

138
00:12:04,000 --> 00:12:08,000
OK, that we are using in this
model.

139
00:12:08,000 --> 00:12:13,000
OK, so these are the vertices.
And now, let me map out a

140
00:12:13,000 --> 00:12:19,000
little bit how this works,
so we can where the edges come

141
00:12:19,000 --> 00:12:21,000
from.
So, let's imagine we're

142
00:12:21,000 --> 00:12:26,000
executing fib of four.
So, I'm going to draw a

143
00:12:26,000 --> 00:12:31,000
horizontal oval.
That's going to correspond to

144
00:12:31,000 --> 00:12:36,000
the procedure execution.
And, in this procedure,

145
00:12:36,000 --> 00:12:39,000
there are essentially three
threads.

146
00:12:39,000 --> 00:12:44,000
We start out with A,
so this is our initial thread

147
00:12:44,000 --> 00:12:49,000
is this guy here.
And then, when he executes a

148
00:12:49,000 --> 00:12:55,000
spawn, OK, he's going to execute
a spawn, we are going to create

149
00:12:55,000 --> 00:13:00,000
a new procedure,
and he's going to execute a new

150
00:13:00,000 --> 00:13:05,000
A recursively within that
procedure.

151
00:13:05,000 --> 00:13:09,000
But at the same time,
we're also going to be,

152
00:13:09,000 --> 00:13:14,000
now, aloud to go on and execute
B in the parent,

153
00:13:14,000 --> 00:13:18,000
we have parallelism here when I
do a spawn.

154
00:13:18,000 --> 00:13:21,000
OK, and so there's an edge
here.

155
00:13:21,000 --> 00:13:25,000
This edge we are going to call
a spawn edge,

156
00:13:25,000 --> 00:13:31,000
and this is called a
continuation edge because it's

157
00:13:31,000 --> 00:13:37,000
just simply continuing the
procedure execution.

158
00:13:37,000 --> 00:13:41,000
OK, now at this point,
this guy, we now have two

159
00:13:41,000 --> 00:13:45,000
things that can execute at the
same time.

160
00:13:45,000 --> 00:13:49,000
Once I've executed A,
I now have two things that can

161
00:13:49,000 --> 00:13:52,000
execute.
OK, so this one,

162
00:13:52,000 --> 00:13:56,000
for example,
may spawn another thread here.

163
00:13:56,000 --> 00:13:59,000
Oh, so this is fib of three,
right?

164
00:13:59,000 --> 00:14:07,000
And this is now fib of two.
OK, so he spawns another guy

165
00:14:07,000 --> 00:14:15,000
here, and simultaneously,
he can go on and execute B

166
00:14:15,000 --> 00:14:22,000
here, OK, with a continued edge.
And B, in fact,

167
00:14:22,000 --> 00:14:32,000
can also spawn at this point.
OK, and this is now fib of two

168
00:14:32,000 --> 00:14:36,000
also.
And now, at this point,

169
00:14:36,000 --> 00:14:44,000
we can't execute C yet here
even though I've spawned things

170
00:14:44,000 --> 00:14:48,000
off.
And the reason is because C

171
00:14:48,000 --> 00:14:54,000
won't execute until we've
executed the sync statement,

172
00:14:54,000 --> 00:15:01,000
which can't occur until A and B
have both been executed,

173
00:15:01,000 --> 00:15:06,000
OK?
So, he just sort of sits there

174
00:15:06,000 --> 00:15:12,000
waiting, OK, and a scheduler
can't try to schedule him.

175
00:15:12,000 --> 00:15:18,000
Or if he does,
then nothing's going to happen

176
00:15:18,000 --> 00:15:21,000
here, OK?
So, we can go on.

177
00:15:21,000 --> 00:15:25,000
Let's see, here we could call
fib of one.

178
00:15:25,000 --> 00:15:34,000
The fib of one is only going to
execute an A statement here.

179
00:15:34,000 --> 00:15:39,000
OK, of course it can't continue
here because A is the only

180
00:15:39,000 --> 00:15:45,000
thing, when I execute fib of
one, if we look at the code,

181
00:15:45,000 --> 00:15:50,000
it never executes B or C.
OK, and similarly here,

182
00:15:50,000 --> 00:15:55,000
this guy here to do fib of one.
OK, and this guy,

183
00:15:55,000 --> 00:16:01,000
I guess, could execute A here
of fib of one.

184
00:16:10,000 --> 00:16:17,000
OK, and maybe now this guy
calls his another fib of one,

185
00:16:17,000 --> 00:16:25,000
and this guy does another one.
This is going to be fib of

186
00:16:25,000 --> 00:16:31,000
zero, right?
I keep drawing that arrow to

187
00:16:31,000 --> 00:16:35,000
the wrong place,
OK?

188
00:16:35,000 --> 00:16:38,000
And now, once these guys
return, well,

189
00:16:38,000 --> 00:16:42,000
let's say these guys return
here, I can now execute C.

190
00:16:42,000 --> 00:16:47,000
But I can't execute with them
until both of these guys are

191
00:16:47,000 --> 00:16:52,000
done, and that guy is done.
So, you see that we get a

192
00:16:52,000 --> 00:16:56,000
synchronization point here
before executing C.

193
00:16:56,000 --> 00:17:01,000
And then, similarly here,
now that we've executed this

194
00:17:01,000 --> 00:17:06,000
and this, we can now execute
this guy here.

195
00:17:06,000 --> 00:17:11,000
And so, those returns go to
there.

196
00:17:11,000 --> 00:17:17,000
Likewise here,
this guy can now execute his C,

197
00:17:17,000 --> 00:17:26,000
and now once both of those are
done, we can execute this guy

198
00:17:26,000 --> 00:17:30,000
here.
And then we are done.

199
00:17:30,000 --> 00:17:41,000
This is our final thread.
So, I should have labeled also

200
00:17:41,000 --> 00:17:53,000
that when I get one of these
guys here, that's a return edge.

201
00:17:53,000 --> 00:18:01,000
So, the three types of edges
are spawn, return,

202
00:18:01,000 --> 00:18:08,000
and continuation.
OK, and by describing it in

203
00:18:08,000 --> 00:18:11,000
this way, I essentially get a
DAG that unfolds.

204
00:18:11,000 --> 00:18:15,000
So, rather than having just a
serial execution trace,

205
00:18:15,000 --> 00:18:19,000
I get something where I have
still some serial dependencies.

206
00:18:19,000 --> 00:18:23,000
There are still some things
that have to be done before

207
00:18:23,000 --> 00:18:27,000
other things,
but there are also things that

208
00:18:27,000 --> 00:18:31,000
can be done at the same time.
So how are we doing?

209
00:18:31,000 --> 00:18:35,000
Yeah, question?
Is every spawn were covered by

210
00:18:35,000 --> 00:18:38,000
a sync, effectively,
yeah, yeah, effectively.

211
00:18:38,000 --> 00:18:43,000
There's actually a null thread
that gets executed in there,

212
00:18:43,000 --> 00:18:45,000
which I hadn't bothered to
show.

213
00:18:45,000 --> 00:18:50,000
But yes, basically you would
then not have any parallelism,

214
00:18:50,000 --> 00:18:54,000
OK, because you would spawn it
off, but then you're not doing

215
00:18:54,000 --> 00:18:58,000
anything in the parent.
So it's pretty much the same,

216
00:18:58,000 --> 00:19:03,000
yeah, as if it had executed
serially.

217
00:19:03,000 --> 00:19:06,000
Yep, OK, so you can see that
basically what we had here in

218
00:19:06,000 --> 00:19:09,000
some sense is a DAG embedded in
a tree.

219
00:19:09,000 --> 00:19:13,000
OK, so you have a tree that's
sort of the procedure structure,

220
00:19:13,000 --> 00:19:16,000
but in their you have a DAG,
and that DAG can actually get

221
00:19:16,000 --> 00:19:20,000
to be pretty complicated.
OK, now what I want to do is

222
00:19:20,000 --> 00:19:23,000
now that we understand that
we've got an underlying DAG,

223
00:19:23,000 --> 00:19:27,000
I want to switch to trying to
study the performance attributes

224
00:19:27,000 --> 00:19:31,000
of a particular DAG execution,
so looking at performance

225
00:19:31,000 --> 00:19:33,000
measures.

226
00:19:45,000 --> 00:19:55,000
So, the notation that we'll use
is we'll let T_P be the running

227
00:19:55,000 --> 00:20:05,000
time of whatever our computation
is on P processors.

228
00:20:05,000 --> 00:20:07,000
OK, so, T_P is,
how long does it take to

229
00:20:07,000 --> 00:20:10,000
execute this on P processors?
Now, in general,

230
00:20:10,000 --> 00:20:13,000
this is not going to be just a
particular number,

231
00:20:13,000 --> 00:20:17,000
OK, because I can have
different scheduling disciplines

232
00:20:17,000 --> 00:20:20,000
would lead me to get numbers for
T_P, OK?

233
00:20:20,000 --> 00:20:22,000
But when we talk about the
running time,

234
00:20:22,000 --> 00:20:26,000
we'll still sort of use this
notation, and I'll try to be

235
00:20:26,000 --> 00:20:30,000
careful as we go through to make
sure that there's no confusion

236
00:20:30,000 --> 00:20:34,000
about what that means in
context.

237
00:20:34,000 --> 00:20:38,000
There are a couple of them,
though, which are fairly well

238
00:20:38,000 --> 00:20:40,000
defined.
One is based on this.

239
00:20:40,000 --> 00:20:43,000
One is T_1.
So, T_1 is the running time on

240
00:20:43,000 --> 00:20:46,000
one processor.
OK, so if I were to execute

241
00:20:46,000 --> 00:20:49,000
this on one processor,
you can imagine it's just as if

242
00:20:49,000 --> 00:20:53,000
I had just gotten rid of the
spawn, and syncs,

243
00:20:53,000 --> 00:20:55,000
and everything,
and just executed it.

244
00:20:55,000 --> 00:21:00,000
That will give me a particular
running time.

245
00:21:00,000 --> 00:21:06,000
We call that running time on
one processor the work.

246
00:21:06,000 --> 00:21:10,000
It's essentially the serial
time.

247
00:21:10,000 --> 00:21:16,000
OK, so when we talk about the
work of a computation,

248
00:21:16,000 --> 00:21:22,000
we just been essentially a
serial running time.

249
00:21:22,000 --> 00:21:30,000
OK, the other measure that ends
up being interesting is what we

250
00:21:30,000 --> 00:21:35,000
call T infinity.
OK, and this is the critical

251
00:21:35,000 --> 00:21:40,000
pathlength, OK,
which is essentially the

252
00:21:40,000 --> 00:21:46,000
longest path in the DAG.
So, for example,

253
00:21:46,000 --> 00:21:50,000
if we look at the fib of four
in this example,

254
00:21:50,000 --> 00:21:54,000
it has T of one equal to,
so let's assume we have unit

255
00:21:54,000 --> 00:21:58,000
time threads.
I know they're not unit time,

256
00:21:58,000 --> 00:22:01,000
but let's just imagine,
for the purposes of

257
00:22:01,000 --> 00:22:06,000
understanding this,
that every thread costs me one

258
00:22:06,000 --> 00:22:12,000
unit of time to execute.
What would be the work of this

259
00:22:12,000 --> 00:22:16,000
particular computation?
17, right, OK,

260
00:22:16,000 --> 00:22:21,000
because all we do is just add
up three, six,

261
00:22:21,000 --> 00:22:24,000
nine, 12, 13,
14, 15, 16, 17.

262
00:22:24,000 --> 00:22:32,000
So, the work is 17 in this case
if it were unit time threads.

263
00:22:32,000 --> 00:22:35,000
In general, you would add up
how many instructions or

264
00:22:35,000 --> 00:22:39,000
whatever were in there.
OK, and then T infinity is the

265
00:22:39,000 --> 00:22:42,000
longest path.
So, this is the longest

266
00:22:42,000 --> 00:22:44,000
sequence.
It's like, if you had an

267
00:22:44,000 --> 00:22:48,000
infinite number of processors,
you still can't just do

268
00:22:48,000 --> 00:22:52,000
everything at once because some
things have to come before other

269
00:22:52,000 --> 00:22:55,000
things.
But if you had an infinite

270
00:22:55,000 --> 00:22:59,000
number of processors,
as many processors as you want,

271
00:22:59,000 --> 00:23:04,000
what's the fastest you could
possibly execute this?

272
00:23:04,000 --> 00:23:07,000
A little trickier.
Seven?

273
00:23:07,000 --> 00:23:12,000
So, what's your seven?
So, one, two,

274
00:23:12,000 --> 00:23:17,000
three, four,
five, six, seven,

275
00:23:17,000 --> 00:23:22,000
eight, yeah,
eight is the longest path.

276
00:23:22,000 --> 00:23:30,000
So, the work and the critical
path length, as we'll see,

277
00:23:30,000 --> 00:23:38,000
are key attributes of any
computation.

278
00:23:38,000 --> 00:23:44,000
And abstractly,
and this is just for [the

279
00:23:44,000 --> 00:23:50,000
notes?], if they're unit time
threads.

280
00:23:50,000 --> 00:23:59,000
OK, so we can use these two
measures to derive lower bounds

281
00:23:59,000 --> 00:24:07,000
on T_P for P that fall between
one and infinity,

282
00:24:07,000 --> 00:24:09,000
OK?

283
00:24:20,000 --> 00:24:30,000
OK, so the first lower bound we
can derive is that T_P has got

284
00:24:30,000 --> 00:24:39,000
to be at least T_1 over P.
OK, so why is that a lower

285
00:24:39,000 --> 00:24:42,000
bound?
Yeah?

286
00:24:42,000 --> 00:24:57,000
But if I have P processors,
and, OK, and why would I have

287
00:24:57,000 --> 00:25:05,000
this lower bound?
OK, yeah, you've got the right

288
00:25:05,000 --> 00:25:07,000
idea.
So, but can we be a little bit

289
00:25:07,000 --> 00:25:10,000
more articulate about it?
So, that's right,

290
00:25:10,000 --> 00:25:13,000
so you want to use all of
processors.

291
00:25:13,000 --> 00:25:17,000
If you could use all of
processors, why couldn't I use

292
00:25:17,000 --> 00:25:20,000
all the processors,
though, and have T_P be less

293
00:25:20,000 --> 00:25:23,000
than this?
Why does it have to be at least

294
00:25:23,000 --> 00:25:27,000
as big as T_1 over P?
I'm just asking for a little

295
00:25:27,000 --> 00:25:31,000
more precision in the answer.
You've got exactly the right

296
00:25:31,000 --> 00:25:35,000
idea, but I need a little more
precision if we're going to

297
00:25:35,000 --> 00:25:41,000
persuade the rest of the class
that this is the lower bound.

298
00:25:41,000 --> 00:25:42,000
Yeah?

299
00:25:50,000 --> 00:25:53,000
Yeah, that's another way of
looking at it.

300
00:25:53,000 --> 00:25:56,000
If you were to serialize the
computation, OK,

301
00:25:56,000 --> 00:25:59,000
so whatever things you execute
on each step,

302
00:25:59,000 --> 00:26:02,000
you do P of them,
and so if you serialized it,

303
00:26:02,000 --> 00:26:07,000
somehow then it would take you
P steps to execute one step of a

304
00:26:07,000 --> 00:26:09,000
P way, a machine with P
processors.

305
00:26:09,000 --> 00:26:11,000
So then, OK,
yeah?

306
00:26:11,000 --> 00:26:13,000
OK, maybe a little more
precise.

307
00:26:13,000 --> 00:26:15,000
David?

308
00:26:28,000 --> 00:26:33,000
Yeah, good, so let me just
state this a little bit.

309
00:26:33,000 --> 00:26:38,000
So, P processors,
so what are we relying on?

310
00:26:38,000 --> 00:26:43,000
P processors can do,
at most, P work in one step,

311
00:26:43,000 --> 00:26:47,000
right?
So, in one step they do,

312
00:26:47,000 --> 00:26:52,000
at most P work.
They can't do more than P work.

313
00:26:52,000 --> 00:26:58,000
And so, if they can do,
at most P work in one step,

314
00:26:58,000 --> 00:27:02,000
then if the number of steps
was, in fact,

315
00:27:02,000 --> 00:27:08,000
less than T_1 over P,
they would be able to do more

316
00:27:08,000 --> 00:27:15,000
than T_1 work in P steps.
And, there's only T_1 work to

317
00:27:15,000 --> 00:27:19,000
be done.
OK, I just stated that almost

318
00:27:19,000 --> 00:27:22,000
as badly as all the responses I
got.

319
00:27:22,000 --> 00:27:25,000
[LAUGHTER] OK,
P processors can do,

320
00:27:25,000 --> 00:27:30,000
at most, P work in one step,
right?

321
00:27:30,000 --> 00:27:34,000
So, if there's T_1 work to be
done, the number of steps is

322
00:27:34,000 --> 00:27:37,000
going to be at least T_1 over P,
OK?

323
00:27:37,000 --> 00:27:40,000
There we go.
OK, it wasn't that hard.

324
00:27:40,000 --> 00:27:43,000
It's just like,
I've got a certain amount of,

325
00:27:43,000 --> 00:27:46,000
I've got T_1 work to do.
I can knock off,

326
00:27:46,000 --> 00:27:49,000
at most, P on every step.
How many steps?

327
00:27:49,000 --> 00:27:53,000
Just divide.
OK, so it's going to have to be

328
00:27:53,000 --> 00:27:55,000
at least that amount.
OK, good.

329
00:27:55,000 --> 00:27:59,000
The other lower bound is T_P is
greater than or equal to T

330
00:27:59,000 --> 00:28:04,000
infinity.
Somebody explain to me why that

331
00:28:04,000 --> 00:28:06,000
might be true.
Yeah?

332
00:28:06,000 --> 00:28:10,000
Yeah, if you have an infinite
number of processors,

333
00:28:10,000 --> 00:28:13,000
you have P.
so if you could do it in a

334
00:28:13,000 --> 00:28:18,000
certain amount of time with P,
you can certainly do it in that

335
00:28:18,000 --> 00:28:21,000
time with an infinite number of
processors.

336
00:28:21,000 --> 00:28:25,000
OK, this is in this model
where, you know,

337
00:28:25,000 --> 00:28:29,000
there is lots of stuff that
this model doesn't model like

338
00:28:29,000 --> 00:28:32,000
communication costs and
interference,

339
00:28:32,000 --> 00:28:37,000
and all sorts of things.
But it is simple model,

340
00:28:37,000 --> 00:28:41,000
which actually in practice
works out pretty well,

341
00:28:41,000 --> 00:28:45,000
OK, you're not going to be able
to do more work with P

342
00:28:45,000 --> 00:28:51,000
processors than you are with an
infinite number of processors.

343
00:29:06,000 --> 00:29:12,000
OK, so those are helpful bounds
to understand when we are trying

344
00:29:12,000 --> 00:29:17,000
to make something go faster,
it's nice to know what you

345
00:29:17,000 --> 00:29:23,000
could possibly hope to achieve,
OK, as opposed to beating your

346
00:29:23,000 --> 00:29:28,000
head against a wall,
how come I can't get it to go

347
00:29:28,000 --> 00:29:33,000
much faster?
Maybe it's because one of these

348
00:29:33,000 --> 00:29:39,000
lower bounds is operating.
OK, well, we're interested in

349
00:29:39,000 --> 00:29:44,000
how fast we can go.
That's the main reason for

350
00:29:44,000 --> 00:29:51,000
using multiple processors is you
hope you're going to go faster

351
00:29:51,000 --> 00:29:55,000
than you could with one
processor.

352
00:29:55,000 --> 00:30:03,000
So, we define T_1 over T_P to
be the speedup on P processors.

353
00:30:03,000 --> 00:30:09,000
OK, so we say,
how much faster is it on P

354
00:30:09,000 --> 00:30:14,000
processors than on one
processor?

355
00:30:14,000 --> 00:30:22,000
OK, that's the speed up.
If T_1 over T_P is order P,

356
00:30:22,000 --> 00:30:27,000
we say that it's linear
speedup.

357
00:30:27,000 --> 00:30:32,000
OK, in other words,
why?

358
00:30:32,000 --> 00:30:38,000
Because that says that it means
that if I've thrown P processors

359
00:30:38,000 --> 00:30:44,000
at the job I'm going to get a
speedup that's proportional to

360
00:30:44,000 --> 00:30:46,000
P.
OK, so when I throw P

361
00:30:46,000 --> 00:30:51,000
processors at the job and I get
T_P, if that's order P,

362
00:30:51,000 --> 00:30:57,000
that means that in some sense
my processors each contributed

363
00:30:57,000 --> 00:31:04,000
within a constant factor its
full measure of support.

364
00:31:04,000 --> 00:31:08,000
If this, in fact,
were equal to P,

365
00:31:08,000 --> 00:31:13,000
we'd call that perfect linear
speedup.

366
00:31:13,000 --> 00:31:20,000
OK, so but here we're looking
at giving ourselves,

367
00:31:20,000 --> 00:31:27,000
for theoretical purposes,
a little bit of a constant

368
00:31:27,000 --> 00:31:34,000
buffer here, perhaps.
If T_1 over T_P is greater than

369
00:31:34,000 --> 00:31:41,000
P, we call that super linear
speedup.

370
00:31:41,000 --> 00:31:45,000
OK, so can somebody tell me,
when can I get super linear

371
00:31:45,000 --> 00:31:46,000
speedup?

372
00:31:56,000 --> 00:31:59,000
When can I get super linear
speed up?

373
00:31:59,000 --> 00:32:01,000
Never.
OK, why never?

374
00:32:01,000 --> 00:32:06,000
Yeah, if we buy these lower
bounds, the first lower bound

375
00:32:06,000 --> 00:32:11,000
there, it is T_P is greater than
or equal to T_1 over P.

376
00:32:11,000 --> 00:32:17,000
And, if I just take T_1 over
T_P, that says it's less than or

377
00:32:17,000 --> 00:32:19,000
equal to P.
so, this is never,

378
00:32:19,000 --> 00:32:25,000
OK, not possible in this model.
OK, there are other models

379
00:32:25,000 --> 00:32:30,000
where it is possible to get
super linear speed up due to

380
00:32:30,000 --> 00:32:36,000
caching effects,
and things of that nature.

381
00:32:36,000 --> 00:32:43,000
But in this simple model that
we are dealing with,

382
00:32:43,000 --> 00:32:50,000
it's not possible to get super
linear speedup.

383
00:32:50,000 --> 00:32:57,000
OK, not possible.
Now, the maximum possible

384
00:32:57,000 --> 00:33:06,000
speedup, given some amount of
work and critical path length is

385
00:33:06,000 --> 00:33:13,000
what?
What's the maximum possible

386
00:33:13,000 --> 00:33:20,000
speed up I could get over any
number of processors?

387
00:33:20,000 --> 00:33:26,000
What's the maximum I could
possibly get?

388
00:33:26,000 --> 00:33:32,000
No, I'm saying,
no matter how many processors,

389
00:33:32,000 --> 00:33:40,000
what's the most speedup that I
could get?

390
00:33:40,000 --> 00:33:44,000
T_1 over T infinity,
because this is the,

391
00:33:44,000 --> 00:33:49,000
so T_1 over T infinity is the
maximum I could possibly get.

392
00:33:49,000 --> 00:33:55,000
OK, if I threw an infinite
number of processors at the

393
00:33:55,000 --> 00:34:00,000
problem, that's going to give me
my biggest speedup.

394
00:34:00,000 --> 00:34:05,000
OK, and we call that the
parallelism.

395
00:34:05,000 --> 00:34:08,000
OK, so that's defined to be the
parallelism.

396
00:34:08,000 --> 00:34:11,000
So the parallelism of the
particular algorithm is

397
00:34:11,000 --> 00:34:16,000
essentially the work divided by
the critical path length.

398
00:34:16,000 --> 00:34:31,000
Another way of viewing it is
that this is the average amount

399
00:34:31,000 --> 00:34:46,000
of work that can be done in
parallel along each step of the

400
00:34:46,000 --> 00:34:57,000
critical path.
And, we denote it often by P

401
00:34:57,000 --> 00:35:01,000
bar.
So, do not get confused.

402
00:35:01,000 --> 00:35:05,000
P bar does not have anything to
do with P at some level.

403
00:35:05,000 --> 00:35:10,000
OK, P is going to be a certain
number of processors you're

404
00:35:10,000 --> 00:35:13,000
running.
P bar is defined just in terms

405
00:35:13,000 --> 00:35:17,000
of the computation you're
executing, not in terms of the

406
00:35:17,000 --> 00:35:21,000
machine you're running it on.
OK, it's just the average

407
00:35:21,000 --> 00:35:25,000
amount of work that can be done
in parallel along each step of

408
00:35:25,000 --> 00:35:30,000
the critical path.
OK, questions so far?

409
00:35:30,000 --> 00:35:33,000
So mostly we're just doing
definitions so far.

410
00:35:33,000 --> 00:35:37,000
OK, now we get into,
OK, so it's helpful to know

411
00:35:37,000 --> 00:35:41,000
what the parallelism is,
because the parallelism is

412
00:35:41,000 --> 00:35:46,000
going to, there's no real point
in trying to get speed up bigger

413
00:35:46,000 --> 00:35:50,000
than the parallelism.
OK, so if you are given a

414
00:35:50,000 --> 00:35:53,000
particular computation,
you'll be able to say,

415
00:35:53,000 --> 00:35:58,000
oh, it doesn't go any faster.
You're throwing more processors

416
00:35:58,000 --> 00:36:03,000
at it.
Why is it that going any

417
00:36:03,000 --> 00:36:07,000
faster?
And the answer could be,

418
00:36:07,000 --> 00:36:14,000
no more parallelism.
OK, let's see what I want to,

419
00:36:14,000 --> 00:36:20,000
yeah, I think we can raise the
example here.

420
00:36:20,000 --> 00:36:25,000
We'll talk more about this
model.

421
00:36:25,000 --> 00:36:31,000
Mostly, now,
we're going to just talk about

422
00:36:31,000 --> 00:36:35,000
DAG's.
So, we'll talk about the

423
00:36:35,000 --> 00:36:43,000
programming model next time.
So, let's talk about

424
00:36:43,000 --> 00:36:48,000
scheduling.
The goal of scheduler is to map

425
00:36:48,000 --> 00:36:55,000
the computation to P processors.
And this is typically done by a

426
00:36:55,000 --> 00:36:59,000
runtime system,
which, if you will,

427
00:36:59,000 --> 00:37:06,000
is an algorithm that is running
underneath the language layer

428
00:37:06,000 --> 00:37:12,000
that I showed you.
OK, so the programmer designs

429
00:37:12,000 --> 00:37:15,000
an algorithm using spawns,
and syncs, and so forth.

430
00:37:15,000 --> 00:37:19,000
Then, underneath that,
there's an algorithm that has

431
00:37:19,000 --> 00:37:24,000
to actually map that executing
program onto the processors of

432
00:37:24,000 --> 00:37:27,000
the machine as it executes.
And that's the scheduler.

433
00:37:27,000 --> 00:37:31,000
OK, so it's done by the
language runtime system,

434
00:37:31,000 --> 00:37:37,000
typically.
OK, so it turns out that online

435
00:37:37,000 --> 00:37:42,000
schedulers, let me just say
they're complex.

436
00:37:42,000 --> 00:37:49,000
OK, they're not necessarily
easy things to build.

437
00:37:49,000 --> 00:37:53,000
OK, they're not too bad
actually.

438
00:37:53,000 --> 00:38:01,000
But, we are not going to go
there because we only have two

439
00:38:01,000 --> 00:38:07,000
lectures to do this.
Instead, we're going to do is

440
00:38:07,000 --> 00:38:16,000
we'll illustrate the ideas using
off-line scheduling.

441
00:38:16,000 --> 00:38:20,000
OK, so you'll get an idea out
of this for what a scheduler

442
00:38:20,000 --> 00:38:24,000
does, and it turns out that
doing these things online is

443
00:38:24,000 --> 00:38:27,000
another level of complexity
beyond that.

444
00:38:27,000 --> 00:38:31,000
And typically,
the online schedulers that are

445
00:38:31,000 --> 00:38:35,000
good, these days,
are randomized schedulers.

446
00:38:35,000 --> 00:38:42,000
And they have very strong
proofs of their ability to

447
00:38:42,000 --> 00:38:46,000
perform.
But we're not going to go

448
00:38:46,000 --> 00:38:50,000
there.
We'll keep it simple.

449
00:38:50,000 --> 00:38:56,000
And in particular,
we're going to look at a

450
00:38:56,000 --> 00:39:05,000
particular type of scheduler
called a greedy scheduler.

451
00:39:05,000 --> 00:39:09,000
So, if you have a DAG to
execute, so the basic rules of

452
00:39:09,000 --> 00:39:15,000
the scheduler is you can't
execute a node until all of the

453
00:39:15,000 --> 00:39:19,000
nodes that precede it in the DAG
have executed.

454
00:39:19,000 --> 00:39:24,000
OK, so you've got to wait until
everything is executed.

455
00:39:24,000 --> 00:39:29,000
So, a greedy scheduler,
what it says is let's just try

456
00:39:29,000 --> 00:39:34,000
to do as much as possible on
every step, OK?

457
00:39:50,000 --> 00:39:52,000
In other words,
it says I'm never going to try

458
00:39:52,000 --> 00:39:56,000
to guess that it's worthwhile
delaying doing something.

459
00:39:56,000 --> 00:40:00,000
If I could do something now,
I'm going to do it.

460
00:40:00,000 --> 00:40:08,000
And so, each step is going to
correspond to be one of two

461
00:40:08,000 --> 00:40:13,000
types.
The first type is what we'll

462
00:40:13,000 --> 00:40:21,000
call a complete step.
And this is a step in which

463
00:40:21,000 --> 00:40:27,000
there are at least P threads
ready to run.

464
00:40:27,000 --> 00:40:34,000
And, I'm executing on P
processors.

465
00:40:34,000 --> 00:40:38,000
There are at least P threads
ready to run.

466
00:40:38,000 --> 00:40:42,000
So, what's a greedy strategy
here?

467
00:40:42,000 --> 00:40:48,000
I've got P processors.
I've got at least P threads.

468
00:40:48,000 --> 00:40:52,000
Run any P.
Yeah, first P would be if you

469
00:40:52,000 --> 00:40:57,000
had a notion of ordering.
That would be perfectly

470
00:40:57,000 --> 00:41:02,000
reasonable.
Here, we are just going to

471
00:41:02,000 --> 00:41:07,000
execute any P.
We might make a mistake there,

472
00:41:07,000 --> 00:41:10,000
because there may be a
particular one that if we

473
00:41:10,000 --> 00:41:14,000
execute now, that'll enable more
parallelism later on.

474
00:41:14,000 --> 00:41:18,000
We might not execute that one.
We don't know.

475
00:41:18,000 --> 00:41:21,000
OK, but basically,
what we're going to do is just

476
00:41:21,000 --> 00:41:24,000
execute any P willy-nilly.
So, there's some,

477
00:41:24,000 --> 00:41:27,000
if you will,
non-determinism in this step

478
00:41:27,000 --> 00:41:32,000
here because which one you
execute may or may not be a good

479
00:41:32,000 --> 00:41:38,000
choice.
OK, the second type of step

480
00:41:38,000 --> 00:41:45,000
we're going to have is an
incomplete step.

481
00:41:45,000 --> 00:41:55,000
And this is a situation where
we have fewer than P threads

482
00:41:55,000 --> 00:42:04,000
ready to run.
So, what's our strategy there?

483
00:42:04,000 --> 00:42:10,000
Execute all of them.
OK, if it's greedy,

484
00:42:10,000 --> 00:42:19,000
no point in not executing.
OK, so if I've got more than P

485
00:42:19,000 --> 00:42:25,000
threads ready to run,
I execute any P.

486
00:42:25,000 --> 00:42:32,000
If I have fewer than P threads
ready to run,

487
00:42:32,000 --> 00:42:39,000
we execute all of them.
So, it turns out this is a good

488
00:42:39,000 --> 00:42:42,000
strategy.
It's not a perfect strategy.

489
00:42:42,000 --> 00:42:48,000
In fact, the strategy of trying
to schedule optimally a DAG on P

490
00:42:48,000 --> 00:42:53,000
processors is NP complete,
meaning it's very difficult.

491
00:42:53,000 --> 00:42:57,000
So, those of you going to take
6.045 or 6.840,

492
00:42:57,000 --> 00:43:01,000
I highly recommend these
courses, and we'll talk more

493
00:43:01,000 --> 00:43:06,000
about that in the last lecture
as we talked a little bit about

494
00:43:06,000 --> 00:43:13,000
what's coming up in the theory
engineering concentration.

495
00:43:13,000 --> 00:43:16,000
You can learn about NP
completeness and about how you

496
00:43:16,000 --> 00:43:19,000
show that certain problems,
there are no good algorithms

497
00:43:19,000 --> 00:43:22,000
for them, OK,
that we are aware of,

498
00:43:22,000 --> 00:43:24,000
OK, and what exactly that
means.

499
00:43:24,000 --> 00:43:28,000
So, it turns out that this type
of scheduling problem turns out

500
00:43:28,000 --> 00:43:32,000
to be a very difficult problem
to get it optimal.

501
00:43:32,000 --> 00:43:46,000
But, there's nice theorem,
due independently to Graham and

502
00:43:46,000 --> 00:43:53,000
Brent.
It says, essentially,

503
00:43:53,000 --> 00:44:05,000
a greedy scheduler executes any
computation,

504
00:44:05,000 --> 00:44:15,000
G, with work,
T_1, and critical path length,

505
00:44:15,000 --> 00:44:27,000
T infinity in time,
T_P, less than or equal to T_1

506
00:44:27,000 --> 00:44:34,000
over P plus T infinity --

507
00:44:44,000 --> 00:44:49,000
-- on a computer with P
processors.

508
00:44:49,000 --> 00:44:56,000
OK, so, it says that I can
achieve T_1 over P plus T

509
00:44:56,000 --> 00:45:02,000
infinity.
So, what does that say?

510
00:45:02,000 --> 00:45:09,000
If we take a look and compare
this with our lower bounds on

511
00:45:09,000 --> 00:45:16,000
runtime, how efficient is this?
How does this compare with the

512
00:45:16,000 --> 00:45:22,000
optimal execution?
Yeah, it's two competitive.

513
00:45:22,000 --> 00:45:30,000
It's within a factor of two of
optimal because this is a lower

514
00:45:30,000 --> 00:45:37,000
bound and this is a lower bound.
And so, if I take twice the max

515
00:45:37,000 --> 00:45:41,000
of these two,
twice the maximum of these two,

516
00:45:41,000 --> 00:45:44,000
that's going to be bigger than
the sum.

517
00:45:44,000 --> 00:45:49,000
So, I'm within a factor of two
of which ever is the stronger,

518
00:45:49,000 --> 00:45:54,000
lower bound for any situation.
So, this says you get within a

519
00:45:54,000 --> 00:45:58,000
factor of two of efficiency of
scheduling in terms of the

520
00:45:58,000 --> 00:46:04,000
runtime on P processors.
OK, does everybody see that?

521
00:46:04,000 --> 00:46:10,000
So, let's prove this theorem.
It's quite an elegant theorem.

522
00:46:10,000 --> 00:46:15,000
It's not a hard theorem.
One of the nice things,

523
00:46:15,000 --> 00:46:20,000
by the way, about this week,
is that nothing is very hard.

524
00:46:20,000 --> 00:46:25,000
It just requires you to think
differently.

525
00:46:25,000 --> 00:46:31,000
OK, so the proof has to do with
counting up how many complete

526
00:46:31,000 --> 00:46:35,000
steps we have,
and how many incomplete steps

527
00:46:35,000 --> 00:46:41,000
we have.
OK, so we'll start with the

528
00:46:41,000 --> 00:46:49,000
number of complete steps.
So, can somebody tell me what's

529
00:46:49,000 --> 00:46:58,000
the largest number of complete
steps I could possibly have?

530
00:46:58,000 --> 00:47:05,000
Yeah, I heard somebody mumble
it back there.

531
00:47:05,000 --> 00:47:08,000
T_1 over P.
Why is that?

532
00:47:08,000 --> 00:47:17,000
Yeah, so the number of complete
steps is, at most,

533
00:47:17,000 --> 00:47:25,000
T_1 over P because why?
Yeah, once you've had this

534
00:47:25,000 --> 00:47:32,000
many, you've done T_1 work,
OK?

535
00:47:32,000 --> 00:47:36,000
So, every complete step I'm
getting P work done.

536
00:47:36,000 --> 00:47:41,000
So, if I did more than T_1 over
P steps, there would be no more

537
00:47:41,000 --> 00:47:45,000
work to be done.
So, the number of complete

538
00:47:45,000 --> 00:47:49,000
steps can't be bigger than T_1
over P.

539
00:48:10,000 --> 00:48:16,000
OK, so that's this piece.
OK, now we're going to count up

540
00:48:16,000 --> 00:48:21,000
the incomplete steps,
and show its bounded by T

541
00:48:21,000 --> 00:48:25,000
infinity.
OK, so let's consider an

542
00:48:25,000 --> 00:48:31,000
incomplete step.
And, let's see what happens.

543
00:48:39,000 --> 00:48:57,000
And, let's let G prime be the
subgraph of G that remains to be

544
00:48:57,000 --> 00:49:02,000
executed.
OK, so we'll draw a picture

545
00:49:02,000 --> 00:49:04,000
here.
So, imagine we have,

546
00:49:04,000 --> 00:49:07,000
let's draw it on a new board.

547
00:49:26,000 --> 00:49:32,000
So here, we're going to have a
graph, our graph,

548
00:49:32,000 --> 00:49:36,000
G.
We're going to do actually P

549
00:49:36,000 --> 00:49:40,000
equals three as our example
here.

550
00:49:40,000 --> 00:49:45,000
So, imagine that this is the
graph, G.

551
00:49:45,000 --> 00:49:52,000
And, I'm not showing the
procedures here because this

552
00:49:52,000 --> 00:50:00,000
actually is a theorem that works
for any DAG.

553
00:50:00,000 --> 00:50:09,000
And, the procedure outlines are
not necessary.

554
00:50:09,000 --> 00:50:16,000
All we care about is the
threads.

555
00:50:16,000 --> 00:50:25,000
I missed one.
OK, so imagine that's my DAG,

556
00:50:25,000 --> 00:50:38,000
G, and imagine that I have
executed up to this point.

557
00:50:38,000 --> 00:50:47,000
Which ones have I executed?
Yeah, I've executed these guys.

558
00:50:47,000 --> 00:50:57,000
So, the things that are in G
prime are just the things that

559
00:50:57,000 --> 00:51:04,000
have yet to be executed.
And these guys are the ones

560
00:51:04,000 --> 00:51:09,000
that are already executed.
And, we'll imagine that all of

561
00:51:09,000 --> 00:51:14,000
them are unit time threads
without loss of generality.

562
00:51:14,000 --> 00:51:19,000
The theorem would go through,
even if each of these had a

563
00:51:19,000 --> 00:51:23,000
particular time associated with
it.

564
00:51:23,000 --> 00:51:27,000
The same scheduling algorithm
will work just fine.

565
00:51:27,000 --> 00:51:32,000
So, how can I characterize the
threads that are ready to be

566
00:51:32,000 --> 00:51:38,000
executed?
Which are the threads that are

567
00:51:38,000 --> 00:51:42,000
ready to be executed here?
Let's just see.

568
00:51:42,000 --> 00:51:46,000
So, that one?
No, that's not ready to be

569
00:51:46,000 --> 00:51:48,000
executed.
Why?

570
00:51:48,000 --> 00:51:52,000
Because it's got a predecessor
here, this guy.

571
00:51:52,000 --> 00:51:59,000
OK, so this guy is ready to be
executed, and this guy is ready

572
00:51:59,000 --> 00:52:04,000
to be executed.
OK, so those two threads are

573
00:52:04,000 --> 00:52:08,000
ready to be, how can I
characterize this?

574
00:52:08,000 --> 00:52:12,000
What's their property?
What's a graph theoretic

575
00:52:12,000 --> 00:52:17,000
property in G prime that tells
me whether or not something is

576
00:52:17,000 --> 00:52:21,000
ready to be executed?
It has no predecessor,

577
00:52:21,000 --> 00:52:24,000
but what's another way of
saying that?

578
00:52:24,000 --> 00:52:29,000
It's got no predecessor in G
prime.

579
00:52:29,000 --> 00:52:38,000
What does it mean for a node
not to have a predecessor in a

580
00:52:38,000 --> 00:52:43,000
graph?
Its in degree is zero,

581
00:52:43,000 --> 00:52:46,000
right?
Same thing.

582
00:52:46,000 --> 00:52:56,000
OK, the threads with in degree,
zero and G prime are the ones

583
00:52:56,000 --> 00:53:06,000
that are ready to be executed.
OK, and if it's incomplete

584
00:53:06,000 --> 00:53:11,000
step, what do I do?
I'm going to execute says,

585
00:53:11,000 --> 00:53:17,000
if it's an incomplete step,
I execute all of them.

586
00:53:17,000 --> 00:53:24,000
OK, so I execute all of these.
OK, now I execute all of the in

587
00:53:24,000 --> 00:53:30,000
degree zero threads,
what happens to the critical

588
00:53:30,000 --> 00:53:38,000
path length of the graph that
remains to be executed?

589
00:53:38,000 --> 00:53:48,000
It decreases by one.
OK, so the critical path length

590
00:53:48,000 --> 00:54:00,000
of what remains to be executed,
G prime, is reduced by one.

591
00:54:00,000 --> 00:54:04,000
So, what's left to be executed
on every incomplete step,

592
00:54:04,000 --> 00:54:08,000
what's left to be executed
always reduces by one.

593
00:54:08,000 --> 00:54:12,000
Notice the next step here is
going to be a complete step,

594
00:54:12,000 --> 00:54:16,000
because I've got four things
that are ready to go.

595
00:54:16,000 --> 00:54:21,000
And, I can execute them in such
a way that the critical path

596
00:54:21,000 --> 00:54:24,000
length doesn't get reduced on
that step.

597
00:54:24,000 --> 00:54:29,000
OK, but if I had to execute all
of them, then it does reduce the

598
00:54:29,000 --> 00:54:33,000
critical path length.
Now, of course,

599
00:54:33,000 --> 00:54:38,000
both could happen,
OK, at the same time,

600
00:54:38,000 --> 00:54:43,000
OK, but any time that I have an
incomplete step,

601
00:54:43,000 --> 00:54:50,000
I'm guaranteed to reduce the
critical path length by one.

602
00:54:50,000 --> 00:54:56,000
OK, so that implies that the
number of incomplete steps is,

603
00:54:56,000 --> 00:55:01,000
at most, T infinity.
And so, therefore,

604
00:55:01,000 --> 00:55:05,000
T of P is, at most,
the number of complete steps

605
00:55:05,000 --> 00:55:08,000
plus the number of incomplete
steps.

606
00:55:08,000 --> 00:55:12,000
And we get our bound.
This is sort of an amortized

607
00:55:12,000 --> 00:55:17,000
argument if you want to think of
it that way, OK,

608
00:55:17,000 --> 00:55:22,000
that at every step I'm either
amortizing the step against the

609
00:55:22,000 --> 00:55:26,000
work, or I'm amortizing it
against the critical path

610
00:55:26,000 --> 00:55:32,000
length, or possibly both.
But I'm at least doing one of

611
00:55:32,000 --> 00:55:35,000
those for every step,
OK, and so, in the end,

612
00:55:35,000 --> 00:55:39,000
I just have to add up the two
contributions.

613
00:55:39,000 --> 00:55:42,000
Any questions about that?
So this, by the way,

614
00:55:42,000 --> 00:55:46,000
is the fundamental theorem of
all scheduling.

615
00:55:46,000 --> 00:55:50,000
If ever you study anything
having to do with scheduling,

616
00:55:50,000 --> 00:55:55,000
this basic result is sort of
the foundation of a huge number

617
00:55:55,000 --> 00:55:58,000
of things.
And then what people do is they

618
00:55:58,000 --> 00:56:01,000
gussy it up, like,
let's do this online,

619
00:56:01,000 --> 00:56:05,000
OK, with a scheduler,
etc., that everybody's trying

620
00:56:05,000 --> 00:56:09,000
to match these bounds,
OK, of what an omniscient

621
00:56:09,000 --> 00:56:14,000
greedy scheduler would achieve,
OK, and there are all kinds of

622
00:56:14,000 --> 00:56:19,000
other things.
But this is sort of the basic

623
00:56:19,000 --> 00:56:25,000
theorem that just pervades the
whole area of scheduling.

624
00:56:25,000 --> 00:56:32,000
OK, let's do a quick corollary.
I'm not going to erase those.

625
00:56:32,000 --> 00:56:37,000
Those are just too important.
I want to erase those.

626
00:56:37,000 --> 00:56:42,000
Let's not erase those.
I want to erase that either.

627
00:56:42,000 --> 00:56:45,000
We're going to go back to the
top.

628
00:56:45,000 --> 00:56:51,000
Actually, we'll put the
corollary here because that's

629
00:56:51,000 --> 00:56:54,000
just one line.
OK.

630
00:57:11,000 --> 00:57:17,000
The corollary says you get
linear speed up if the number of

631
00:57:17,000 --> 00:57:24,000
processors that you allocate,
that you run your job on is

632
00:57:24,000 --> 00:57:31,000
order, the parallelism.
OK, so greedy scheduler gives

633
00:57:31,000 --> 00:57:37,000
you linear speed up if you're
running on essentially

634
00:57:37,000 --> 00:57:46,000
parallelism or fewer processors.
OK, so let's see why that is.

635
00:57:46,000 --> 00:57:51,000
And I hope I'll fit this,
OK?

636
00:57:51,000 --> 00:57:58,000
So, P bar is T_1 over T
infinity.

637
00:57:58,000 --> 00:58:04,000
And that implies that if P
equals order T_1 over T

638
00:58:04,000 --> 00:58:10,000
infinity, then that says just
bringing those around,

639
00:58:10,000 --> 00:58:17,000
T infinity is order T_1 over P.
So, everybody with me?

640
00:58:17,000 --> 00:58:22,000
It's just algebra.
So, it says this is the

641
00:58:22,000 --> 00:58:28,000
definition of parallelism,
T_1 over T infinity,

642
00:58:28,000 --> 00:58:35,000
and so, if P is order
parallelism, then it's order T_1

643
00:58:35,000 --> 00:58:43,000
over T infinity.
And now, just bring it around.

644
00:58:43,000 --> 00:58:49,000
It says T infinity is order T_1
over P.

645
00:58:49,000 --> 00:58:56,000
So, that says T infinity is
order T_1 over P.

646
00:58:56,000 --> 00:59:03,000
OK, and so, therefore,
continue the proof here,

647
00:59:03,000 --> 00:59:12,000
thus T_P is at most T_1 over P
plus T infinity.

648
00:59:12,000 --> 00:59:23,000
Well, if this is order T_1 over
P, the whole thing is order T_1

649
00:59:23,000 --> 00:59:29,000
over P.
OK, and so, now I have T_P is

650
00:59:29,000 --> 00:59:37,000
order T_1 over P,
and what we need is to compute

651
00:59:37,000 --> 00:59:45,000
T_1 over T_P,
and that's going to be order

652
00:59:45,000 --> 00:59:48,000
T_P.
OK?

653
00:59:48,000 --> 00:59:51,000
Does everybody see that?
So what that says is that if I

654
00:59:51,000 --> 00:59:54,000
have a certain amount of
parallelism, if I run

655
00:59:54,000 --> 00:59:58,000
essentially on fewer processors
than that parallelism,

656
00:59:58,000 --> 01:00:02,000
I get linear speed up if I use
greedy scheduling.

657
01:00:02,000 --> 01:00:05,077
OK, if I run on more processors
than the parallelism,

658
01:00:05,077 --> 01:00:07,859
in some sense I'm being
wasteful because I can't

659
01:00:07,859 --> 01:00:11,529
possibly get enough speed up to
justify those extra processors.

660
01:00:11,529 --> 01:00:15,021
So, understanding parallelism
of a job says that's sort of a

661
01:00:15,021 --> 01:00:17,862
limit on the number of
processors I want to have.

662
01:00:17,862 --> 01:00:19,757
And, in fact,
I can achieve that.

663
01:00:19,757 --> 01:00:21,000
Question?

664
01:00:39,000 --> 01:00:41,008
Yeah, really,
in some sense,

665
01:00:41,008 --> 01:00:43,611
this is saying it should be
omega P.

666
01:00:43,611 --> 01:00:46,586
Yeah, so that's fine.
It's a question of,

667
01:00:46,586 --> 01:00:48,000
so ask again.

668
01:01:03,000 --> 01:01:06,495
No, no, it's only if it's
bounded above by a constant.

669
01:01:06,495 --> 01:01:08,804
T_1 and T infinity aren't
constants.

670
01:01:08,804 --> 01:01:12,497
They're variables in this.
So, we are doing multivariable

671
01:01:12,497 --> 01:01:15,795
asymptotic analysis.
So, any of these things can be

672
01:01:15,795 --> 01:01:19,555
a function of anything else,
and can be growing as much as

673
01:01:19,555 --> 01:01:22,127
we want.
So, the fact that we say we are

674
01:01:22,127 --> 01:01:26,019
given it for a particular thing,
we're really not given that

675
01:01:26,019 --> 01:01:28,327
number.
We're given a whole class of

676
01:01:28,327 --> 01:01:31,889
DAG's or whatever of various
sizes is really what we're

677
01:01:31,889 --> 01:01:37,788
talking about.
So, I can look at the growth.

678
01:01:37,788 --> 01:01:45,626
Here, where it's talking about
the growth of the parallelism,

679
01:01:45,626 --> 01:01:52,941
sorry, the growth of the
runtime T_P as a function of T_1

680
01:01:52,941 --> 01:01:58,689
and T infinity.
So, I am talking about things

681
01:01:58,689 --> 01:02:03,000
that are growing here,
OK?

682
01:02:03,000 --> 01:02:06,018
OK, so let's put this to work,
OK?

683
01:02:06,018 --> 01:02:09,951
And, in fact,
so now I'm going to go back to

684
01:02:09,951 --> 01:02:13,243
here.
Now I'm going to tell you about

685
01:02:13,243 --> 01:02:18,913
a little bit of my own research,
and how we use this in some of

686
01:02:18,913 --> 01:02:23,030
the work that we did.
OK, so we've developed a

687
01:02:23,030 --> 01:02:28,426
dynamic multithreaded language
called Cilk, spelled with a C

688
01:02:28,426 --> 01:02:33,000
because it's based on the
language, C.

689
01:02:33,000 --> 01:02:39,837
And, it's not an acronym
because silk is like nice

690
01:02:39,837 --> 01:02:46,953
threads, OK, although at one
point my students had a

691
01:02:46,953 --> 01:02:53,651
competition for what the acronym
silk could mean.

692
01:02:53,651 --> 01:03:01,046
The winner, turns out,
was Charles' Idiotic Linguistic

693
01:03:01,046 --> 01:03:06,214
Kluge.
So anyway, if you want to take

694
01:03:06,214 --> 01:03:10,714
a look at it,
you can find some stuff on it

695
01:03:10,714 --> 01:03:12,000
here.
OK,

696
01:03:20,000 --> 01:03:28,412
OK, and what it uses is
actually one of these more

697
01:03:28,412 --> 01:03:36,480
complicated schedulers.
It's a randomized online

698
01:03:36,480 --> 01:03:44,206
scheduler, OK,
and if you look at its expected

699
01:03:44,206 --> 01:03:53,476
runtime on P processors,
it gets effectively T_1 over P

700
01:03:53,476 --> 01:04:01,428
plus O of T infinity provably.
OK, and empirically,

701
01:04:01,428 --> 01:04:05,714
if you actually look at what
kind of runtimes you get to find

702
01:04:05,714 --> 01:04:09,285
out what's hidden in the big O
there, it turns out,

703
01:04:09,285 --> 01:04:13,785
in fact, it's T_1 over P plus T
infinity with the constants here

704
01:04:13,785 --> 01:04:16,285
being very close to one
empirically.

705
01:04:16,285 --> 01:04:19,428
So, no guarantees,
but this turns out to be a

706
01:04:19,428 --> 01:04:22,142
pretty good bound.
Sometimes, you see a

707
01:04:22,142 --> 01:04:26,214
coefficient on T infinity that's
up maybe close to four or

708
01:04:26,214 --> 01:04:29,385
something.
But generally,

709
01:04:29,385 --> 01:04:34,533
you don't see something that's
much bigger than that.

710
01:04:34,533 --> 01:04:39,680
And mostly, it tends to be
around, if you do a linear

711
01:04:39,680 --> 01:04:44,729
regression curve fit,
you get that the constant here

712
01:04:44,729 --> 01:04:48,094
is close to one.
And so, with this,

713
01:04:48,094 --> 01:04:54,331
you get near perfect if you use
this formula as a model for your

714
01:04:54,331 --> 01:04:57,795
runtime.
You get near perfect linear

715
01:04:57,795 --> 01:05:03,339
speed up if the number of
processors you're running on is

716
01:05:03,339 --> 01:05:07,892
much less than your average
parallelism, which,

717
01:05:07,892 --> 01:05:14,029
of course, is the same thing as
if T infinity is much less than

718
01:05:14,029 --> 01:05:19,481
T_1 over P.
So, what happens here is that

719
01:05:19,481 --> 01:05:23,247
when P is much less than P
infinity, that is,

720
01:05:23,247 --> 01:05:28,297
T infinity is much less than
T_1 over P, this term ceases to

721
01:05:28,297 --> 01:05:32,319
matter very much,
and you get very good speedup,

722
01:05:32,319 --> 01:05:36,000
OK, in fact,
almost perfect speedup.

723
01:05:36,000 --> 01:05:42,357
So, each processor gives you
another processor's work as long

724
01:05:42,357 --> 01:05:48,503
as you are the range where the
number of processors is much

725
01:05:48,503 --> 01:05:52,211
less than the number of
parallelism.

726
01:05:52,211 --> 01:05:58,463
Now, with this language many
years ago, which seems now like

727
01:05:58,463 --> 01:06:03,231
many years ago,
OK, it turned out we competed.

728
01:06:03,231 --> 01:06:08,000
We built a bunch of chess
programs.

729
01:06:08,000 --> 01:06:11,962
And, among our programs were
Starsocrates,

730
01:06:11,962 --> 01:06:16,312
and Cilkchess,
and we also had several others.

731
01:06:16,312 --> 01:06:19,501
And these were,
I would call them,

732
01:06:19,501 --> 01:06:22,014
world-class.
In particular,

733
01:06:22,014 --> 01:06:26,750
we tied for first in the 1995
World Computer Chess

734
01:06:26,750 --> 01:06:32,066
Championship in Hong Kong,
and then we had a playoff and

735
01:06:32,066 --> 01:06:35,860
we lost.
It was really a shame.

736
01:06:35,860 --> 01:06:39,157
We almost won,
running on a big parallel

737
01:06:39,157 --> 01:06:41,778
machine.
That was, incidentally,

738
01:06:41,778 --> 01:06:47,020
some of you may know about the
Deep Blue chess playing program.

739
01:06:47,020 --> 01:06:52,008
That was the last time before
they faced then world champion

740
01:06:52,008 --> 01:06:55,728
Kasparov that they competed
against programs.

741
01:06:55,728 --> 01:06:58,941
They tied for third in that
tournament.

742
01:06:58,941 --> 01:07:03,000
OK, so we actually out-placed
them.

743
01:07:03,000 --> 01:07:07,159
However, in the head-to-head
competition, we lost to them.

744
01:07:07,159 --> 01:07:11,099
So we had one loss in the
tournament up to the point of

745
01:07:11,099 --> 01:07:13,872
the finals.
They had a loss and a draw.

746
01:07:13,872 --> 01:07:17,375
Most people aren't aware that
Deep Blue, in fact,

747
01:07:17,375 --> 01:07:21,608
was not the reigning World
Computer Chess Championship when

748
01:07:21,608 --> 01:07:24,964
they faced Kasparov.
The reason that they faced

749
01:07:24,964 --> 01:07:30,000
Kasparov was because IBM was
willing to put up the money.

750
01:07:30,000 --> 01:07:38,029
OK, so we developed these chess
programs, and the way we

751
01:07:38,029 --> 01:07:44,747
developed them,
let me in particular talk about

752
01:07:44,747 --> 01:07:51,172
Starsocrates.
We had this interesting anomaly

753
01:07:51,172 --> 01:07:55,699
come up.
We were running on a 32

754
01:07:55,699 --> 01:08:03,000
processor computer at MIT for
development.

755
01:08:03,000 --> 01:08:07,463
And, we had access to a 512
processor computer for the

756
01:08:07,463 --> 01:08:11,505
tournament at NCSA at the
University of Illinois.

757
01:08:11,505 --> 01:08:16,389
So, we had this big machine.
Of course, they didn't want to

758
01:08:16,389 --> 01:08:20,852
give it to us very much,
but we have the same machine,

759
01:08:20,852 --> 01:08:22,872
just a small one,
at MIT.

760
01:08:22,872 --> 01:08:27,756
So, we would develop on this,
and occasionally we'd be able

761
01:08:27,756 --> 01:08:31,126
to run on this,
and this was what we were

762
01:08:31,126 --> 01:08:37,719
developing for on our processor.
So, let me show you sort of the

763
01:08:37,719 --> 01:08:40,000
anomaly that came up,
OK?

764
01:08:48,000 --> 01:08:55,974
So, we had a version of a
program that I'll call the

765
01:08:55,974 --> 01:09:02,854
original program,
OK, and we had an optimized

766
01:09:02,854 --> 01:09:12,236
program that included some new
features that were supposed to

767
01:09:12,236 --> 01:09:20,992
make the program go faster.
And so, we timed it on our 32

768
01:09:20,992 --> 01:09:28,341
processor machine.
And, it took us 65 seconds to

769
01:09:28,341 --> 01:09:33,839
run it.
OK, and then we timed this new

770
01:09:33,839 --> 01:09:37,340
program.
So, I'll call that T prime of

771
01:09:37,340 --> 01:09:42,261
sub 32 on our 32 processor
machine, and it ran and 40

772
01:09:42,261 --> 01:09:45,952
seconds to do this particular
benchmark.

773
01:09:45,952 --> 01:09:50,399
Now, let me just say,
I've lied about the actual

774
01:09:50,399 --> 01:09:54,375
numbers here to make the
calculations easy.

775
01:09:54,375 --> 01:10:01,000
But, the same idea happened.
Just the numbers were messier.

776
01:10:01,000 --> 01:10:07,275
OK, so this looks like a
significant improvement in

777
01:10:07,275 --> 01:10:12,421
runtime, but we rejected the
optimization.

778
01:10:12,421 --> 01:10:19,574
OK, and the reason we rejected
it is because we understood

779
01:10:19,574 --> 01:10:24,846
about the issues of work and
critical path.

780
01:10:24,846 --> 01:10:30,368
So, let me show you the
analysis that we did,

781
01:10:30,368 --> 01:10:33,813
OK?
So the analysis,

782
01:10:33,813 --> 01:10:37,441
it turns out,
if we looked at our

783
01:10:37,441 --> 01:10:42,089
instrumentation,
the work in this case was

784
01:10:42,089 --> 01:10:46,170
2,048.
And, the critical path was one

785
01:10:46,170 --> 01:10:50,931
second, which,
over here with the optimized

786
01:10:50,931 --> 01:10:55,125
program, the work was,
in fact, 1,024.

787
01:10:55,125 --> 01:11:00,000
But the critical path was
eight.

788
01:11:00,000 --> 01:11:07,375
So, if we plug into our simple
model here, the one I have up

789
01:11:07,375 --> 01:11:14,625
there with the approximation
there, I have T_32 is equal to

790
01:11:14,625 --> 01:11:20,625
T_1 over 32 plus T infinity,
and that's equal to,

791
01:11:20,625 --> 01:11:25,250
well, the work is 2,048 divided
by 32.

792
01:11:25,250 --> 01:11:30,125
What's that?
64, good, plus the critical

793
01:11:30,125 --> 01:11:37,625
path, one, that's 65.
So, that checks out with what

794
01:11:37,625 --> 01:11:40,000
we saw.
OK, in fact,

795
01:11:40,000 --> 01:11:43,875
we did that,
and it checked out.

796
01:11:43,875 --> 01:11:48,375
OK, it was very close.
OK, over here,

797
01:11:48,375 --> 01:11:54,875
T prime of 32 is T prime,
one over 32 plus T infinity

798
01:11:54,875 --> 01:12:02,750
prime, and that's equal to 1,024
divided by 32 is 32 plus eight,

799
01:12:02,750 --> 01:12:07,981
the critical path here.
That's 40.

800
01:12:07,981 --> 01:12:13,377
So, that checked out too.
So, now what we did is we said

801
01:12:13,377 --> 01:12:17,596
is we said, OK,
let's extrapolate to our big

802
01:12:17,596 --> 01:12:21,422
machine.
How fast are these things going

803
01:12:21,422 --> 01:12:25,445
to run on our big machine?
Well, for that,

804
01:12:25,445 --> 01:12:29,958
we want T of 512.
And, that's equal to T_1 over

805
01:12:29,958 --> 01:12:36,913
512 plus T infinity.
And so, what's 2,048 divided by

806
01:12:36,913 --> 01:12:41,079
512?
It's four, plus T infinity is

807
01:12:41,079 --> 01:12:44,235
one.
That's equal to five.

808
01:12:44,235 --> 01:12:48,401
So, go quite a bit faster on
this.

809
01:12:48,401 --> 01:12:55,471
But here, T prime of 512 is
equal to T one prime over 512

810
01:12:55,471 --> 01:13:03,172
plus T infinity prime is equal
to, well, 1,024 plus divided by

811
01:13:03,172 --> 01:13:11,000
512 is two plus critical path of
eight, that's ten.

812
01:13:11,000 --> 01:13:15,913
OK, and so, you see that on the
big machine, we would have been

813
01:13:15,913 --> 01:13:19,163
running twice as slow had we
adopted that,

814
01:13:19,163 --> 01:13:23,205
quote, "optimization",
OK, because we had run out of

815
01:13:23,205 --> 01:13:27,009
parallelism, and this was making
the path longer.

816
01:13:27,009 --> 01:13:31,447
We needed to have a way of
doing it where we could reduce

817
01:13:31,447 --> 01:13:34,459
the work.
Yeah, it's good to reduce the

818
01:13:34,459 --> 01:13:39,135
work but not as the critical
path ends up getting rid of the

819
01:13:39,135 --> 01:13:45,000
parallels that we hope to be
able to use during the runtime.

820
01:13:45,000 --> 01:13:48,186
So, it's twice as slow,
OK, twice as slow.

821
01:13:48,186 --> 01:13:52,927
So the moral is that the work
and critical path length predict

822
01:13:52,927 --> 01:13:56,968
the performance better than the
execution time alone,

823
01:13:56,968 --> 01:14:00,000
OK, when you look at
scalability.

824
01:14:00,000 --> 01:14:03,600
And a big issue on a lot of
these machines is scalability;

825
01:14:03,600 --> 01:14:07,263
not always, sometimes you're
not worried about scalability.

826
01:14:07,263 --> 01:14:10,421
Sometimes you just care.
Had we been running in the

827
01:14:10,421 --> 01:14:14,210
competition on a 32 processor
machine, we would have accepted

828
01:14:14,210 --> 01:14:16,926
this optimization.
It would have been a good

829
01:14:16,926 --> 01:14:19,515
trade-off.
OK, but because we knew that we

830
01:14:19,515 --> 01:14:22,800
were running on a machine with a
lot more processors,

831
01:14:22,800 --> 01:14:26,336
and that we were close to
running out of the parallelism,

832
01:14:26,336 --> 01:14:29,936
it didn't make sense to be
increasing the critical path at

833
01:14:29,936 --> 01:14:33,726
that point, because that was
just reducing the parallelism of

834
01:14:33,726 --> 01:14:36,887
our calculation.
OK, next time,

835
01:14:36,887 --> 01:14:39,041
any questions about that first?
No?

836
01:14:39,041 --> 01:14:40,626
OK.
Next time, now that we

837
01:14:40,626 --> 01:14:44,111
understand the model for
execution, we're going to start

838
01:14:44,111 --> 01:14:47,786
looking at the performance of
particular algorithms what we

839
01:14:47,786 --> 01:14:50,701
code them up in a dynamic,
multithreaded style,

840
01:14:50,701 --> 01:14:53,000
OK?