1
00:00:00,030 --> 00:00:02,420
The following content is
provided under a Creative

2
00:00:02,420 --> 00:00:03,860
Commons license.

3
00:00:03,860 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,870 --> 00:00:10,540
offer high quality educational
resources for free.

5
00:00:10,540 --> 00:00:13,410
To make a donation or view
additional materials from

6
00:00:13,410 --> 00:00:16,610
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:16,610 --> 00:00:17,860
ocw.mit.edu.

8
00:00:21,030 --> 00:00:26,076
PROFESSOR: I guess [OBSCURED]

9
00:00:26,076 --> 00:00:28,030
Let's get going.

10
00:00:28,030 --> 00:00:30,300
OK, should I introduce you?

11
00:00:30,300 --> 00:00:30,610
BRADLEY KUSZMAUL: If you want.

12
00:00:30,610 --> 00:00:31,810
I can introduce myself.

13
00:00:31,810 --> 00:00:34,735
PROFESSOR: We have Bradley
Kuszmaul who's been doing

14
00:00:34,735 --> 00:00:36,921
articles on Cilk?

15
00:00:36,921 --> 00:00:42,935
He's a very interesting
paralleling and also what you

16
00:00:42,935 --> 00:00:47,820
can say about the program It's
a very interesting project

17
00:00:47,820 --> 00:00:52,873
that coming for a while, and
there's a lot of interesting

18
00:00:52,873 --> 00:00:55,816
things he's developed,
and multi core

19
00:00:55,816 --> 00:00:59,740
becoming very important.

20
00:00:59,740 --> 00:01:00,860
BRADLEY KUSZMAUL: So how
many of you people have

21
00:01:00,860 --> 00:01:04,070
ever heard of Cilk?

22
00:01:04,070 --> 00:01:05,870
Have used it?

23
00:01:05,870 --> 00:01:09,370
So those of you who have
used it may find

24
00:01:09,370 --> 00:01:13,400
this talk old or whatever.

25
00:01:13,400 --> 00:01:16,840
So Cilk is a system that
runs on a shared-memory

26
00:01:16,840 --> 00:01:17,750
multiprocessor.

27
00:01:17,750 --> 00:01:21,690
So this is not like the system
you've been programming for

28
00:01:21,690 --> 00:01:22,720
this class.

29
00:01:22,720 --> 00:01:25,440
This kind of machine you have
processors, which each have

30
00:01:25,440 --> 00:01:28,840
cache and some sort of a network
and a bunch of memory

31
00:01:28,840 --> 00:01:33,720
and when the processors do
memory operations they are all

32
00:01:33,720 --> 00:01:36,780
on the same address space
and it's typically--

33
00:01:36,780 --> 00:01:39,000
the memory system provides some
sort of coherence like

34
00:01:39,000 --> 00:01:41,540
strong consistency or maybe
released consistency.

35
00:01:44,690 --> 00:01:49,180
We're interested in the case
where the distance from

36
00:01:49,180 --> 00:01:52,420
processors to other processors
into a processors to memory

37
00:01:52,420 --> 00:01:55,980
may be nonuniform and so it's
important to use the cache

38
00:01:55,980 --> 00:02:02,500
well in this kind of machine
because you can't

39
00:02:02,500 --> 00:02:03,600
just ignore the cache.

40
00:02:03,600 --> 00:02:06,810
So sort of the technology that
I'm going to talk about for

41
00:02:06,810 --> 00:02:09,310
this kind of system
is called Cilk.

42
00:02:09,310 --> 00:02:13,240
Cilk is a C language and it does
dynamic multithreading

43
00:02:13,240 --> 00:02:15,390
and it has a provably
good runtime system.

44
00:02:15,390 --> 00:02:18,630
So I'll talk about what
those all mean.

45
00:02:18,630 --> 00:02:22,420
Cilk runs on shared-memory
machines like Suns and SGIs

46
00:02:22,420 --> 00:02:25,920
and well, you probably can't
find Alphaservers anymore.

47
00:02:25,920 --> 00:02:30,260
It runs on SMPs like that are
in everybody's laptops now.

48
00:02:30,260 --> 00:02:33,290
There's been several interesting
applications

49
00:02:33,290 --> 00:02:37,930
written in Cilk including virus
shell assembly, graphics

50
00:02:37,930 --> 00:02:40,050
rendering, n-body simulation.

51
00:02:40,050 --> 00:02:43,180
We did a bunch of chess programs
because they were

52
00:02:43,180 --> 00:02:48,740
sort of the raison
d'etre for Cilk.

53
00:02:48,740 --> 00:02:51,270
One of the features about Cilk
is that it automatically

54
00:02:51,270 --> 00:02:53,300
manages a lot of the
low-level issues.

55
00:02:53,300 --> 00:02:56,510
You don't have to do load
balancing, you don't have to

56
00:02:56,510 --> 00:02:59,210
write in protocols.

57
00:02:59,210 --> 00:03:02,310
You basically write programs
that look a lot more like the

58
00:03:02,310 --> 00:03:06,250
ordinary Cilk programs instead
of saying first I'm going to

59
00:03:06,250 --> 00:03:09,040
do this and then I'm going to
set this variable and then

60
00:03:09,040 --> 00:03:11,870
somebody else is going to read
that variable and that's a

61
00:03:11,870 --> 00:03:15,240
protocol and those are very
difficult to get right.

62
00:03:15,240 --> 00:03:17,720
AUDIENCE: [OBSCURED]

63
00:03:17,720 --> 00:03:19,460
BRADLEY KUSZMAUL: Yeah,
I'll mention that

64
00:03:19,460 --> 00:03:20,900
a little bit later.

65
00:03:20,900 --> 00:03:24,390
We had award-winning
chess player.

66
00:03:24,390 --> 00:03:26,770
So to explain what Cilk's about

67
00:03:26,770 --> 00:03:28,390
I'll talk about Fibonacci.

68
00:03:28,390 --> 00:03:32,510
Now Fibonacci, this is just to
review in case you don't know

69
00:03:32,510 --> 00:03:34,400
C. You all know C right?

70
00:03:34,400 --> 00:03:39,080
So Fibonacci is the function
that each number is the sum of

71
00:03:39,080 --> 00:03:41,330
the previous two Fibonacci
numbers.

72
00:03:41,330 --> 00:03:46,340
And this is an implementation
that basically does that

73
00:03:46,340 --> 00:03:47,400
computation directly.

74
00:03:47,400 --> 00:03:50,410
The Fibonacci of n if n is
less than 2, it's just n.

75
00:03:50,410 --> 00:03:53,840
So Fibonacci of zero
is zero, 1 is 1.

76
00:03:53,840 --> 00:03:55,120
2, the Fibonacci's--

77
00:03:55,120 --> 00:03:58,380
well, then you have to do the
recursion, so you compute

78
00:03:58,380 --> 00:04:01,585
Fibonacci of n minus 1 and
Fibonacci of n minus 2 and sum

79
00:04:01,585 --> 00:04:04,500
them together and that's
Fibonacci of n.

80
00:04:04,500 --> 00:04:09,330
One observation about this
function is it's a really slow

81
00:04:09,330 --> 00:04:12,430
implementation of Fibonacci.

82
00:04:12,430 --> 00:04:13,610
You all know how to
do this faster?

83
00:04:13,610 --> 00:04:16,570
How fast can you do Fibonacci?

84
00:04:16,570 --> 00:04:20,703
You all know this, How
fast is this one?

85
00:04:20,703 --> 00:04:22,960
AUDIENCE: [OBSCURED].

86
00:04:22,960 --> 00:04:25,450
BRADLEY KUSZMAUL: So for those
of you who don't know--

87
00:04:25,450 --> 00:04:28,200
certainly know how to compute
Fibonacci in linear time just

88
00:04:28,200 --> 00:04:31,340
by keeping track of the
most recent two.

89
00:04:31,340 --> 00:04:34,230
1, 1, 2, 3, 5, you just do it.

90
00:04:34,230 --> 00:04:37,270
This is exponential time and
there's an algorithm that does

91
00:04:37,270 --> 00:04:38,350
it in logarithmic time.

92
00:04:38,350 --> 00:04:44,380
So this implementation is
doubly, exponentially bad.

93
00:04:44,380 --> 00:04:46,760
But it's good as a didactic
example because it's easy to

94
00:04:46,760 --> 00:04:48,020
understand.

95
00:04:48,020 --> 00:04:50,540
So to turn this into Cilk we
just add some key words and

96
00:04:50,540 --> 00:04:53,730
I'll talk about what the key
words are in a minute, but the

97
00:04:53,730 --> 00:04:56,490
key thing to understand about
this is if you delete the key

98
00:04:56,490 --> 00:05:00,860
words you have a C program and
Cilk programs have the

99
00:05:00,860 --> 00:05:06,070
property that one of the legal
semantics for the Cilk program

100
00:05:06,070 --> 00:05:10,210
is the C program that you get
by deleting the key words.

101
00:05:10,210 --> 00:05:11,490
Now there's other possible
semantics

102
00:05:11,490 --> 00:05:13,020
you could get because--

103
00:05:13,020 --> 00:05:15,390
not for this function, this
function always produces the

104
00:05:15,390 --> 00:05:18,450
same answer because there's
no race conditions in it.

105
00:05:18,450 --> 00:05:20,660
But for programs that have
races you may have other

106
00:05:20,660 --> 00:05:24,150
semantics that the system
could provide.

107
00:05:24,150 --> 00:05:27,480
And so this kind of a language
extension where you can sort

108
00:05:27,480 --> 00:05:32,590
of delete the extensions and get
a correct implementation

109
00:05:32,590 --> 00:05:36,150
of the parallel program is
called a faithful extension.

110
00:05:36,150 --> 00:05:40,380
A lot of languages like OpenMP
have properties that if you

111
00:05:40,380 --> 00:05:42,360
had these directives and if
you delete them, it will

112
00:05:42,360 --> 00:05:44,990
change the semantics of your
program and so you have to be

113
00:05:44,990 --> 00:05:46,160
very careful.

114
00:05:46,160 --> 00:05:48,820
Now if you're careful about
programming OpenMP you can

115
00:05:48,820 --> 00:05:51,990
make it so that it's faithful,
that has this property.

116
00:05:51,990 --> 00:05:57,600
But that's not always
the case that it is.

117
00:05:57,600 --> 00:05:58,896
Sure.

118
00:05:58,896 --> 00:06:01,166
AUDIENCE: Is it built
on the different..

119
00:06:04,060 --> 00:06:06,310
BRADLEY KUSZMAUL: C 77.

120
00:06:06,310 --> 00:06:06,940
No, C 89.

121
00:06:06,940 --> 00:06:08,934
AUDIENCE: OK, so there's
no presumption

122
00:06:08,934 --> 00:06:12,346
or any alias involved?

123
00:06:12,346 --> 00:06:15,570
It's assumed that
the [OBSCURED].

124
00:06:15,570 --> 00:06:17,060
BRADLEY KUSZMAUL: So the
issue of restricted

125
00:06:17,060 --> 00:06:18,095
pointers, for example?

126
00:06:18,095 --> 00:06:20,820
AUDIENCE: Restricted pointers.

127
00:06:20,820 --> 00:06:21,860
BRADLEY KUSZMAUL: So Cilk
turns out to work

128
00:06:21,860 --> 00:06:23,535
with C 99 as well.

129
00:06:23,535 --> 00:06:26,862
AUDIENCE: But is the presumption
though for a

130
00:06:26,862 --> 00:06:29,450
pointer that it could alias?

131
00:06:29,450 --> 00:06:31,020
BRADLEY KUSZMAUL: The Cilk
compiler makes no assumptions

132
00:06:31,020 --> 00:06:32,680
about that.

133
00:06:32,680 --> 00:06:35,310
If you write a program
and the back end--

134
00:06:35,310 --> 00:06:40,040
Cilk works and I'll talk about
this in a couple minutes.

135
00:06:40,040 --> 00:06:42,080
Cilk works by transforming
this into a C

136
00:06:42,080 --> 00:06:44,940
program that has--

137
00:06:44,940 --> 00:06:48,060
when you run it on one processor
it's just the

138
00:06:48,060 --> 00:06:50,370
original C program in effect.

139
00:06:50,370 --> 00:06:53,420
And so if you have a dialect
of C that has restricted

140
00:06:53,420 --> 00:06:55,950
pointers and a compiler that--

141
00:06:55,950 --> 00:06:57,460
PROFESSOR: You're taking
the assumptions that

142
00:06:57,460 --> 00:06:59,110
if you make a mistake--

143
00:06:59,110 --> 00:07:02,610
BRADLEY KUSZMAUL: If you make a
mistake the language doens't

144
00:07:02,610 --> 00:07:04,940
stop you from making
the mistake.

145
00:07:04,940 --> 00:07:08,230
AUDIENCE: Well, but in C 89
there's not a mistake.

146
00:07:08,230 --> 00:07:10,068
There's no assumption about
aliasing, right?

147
00:07:10,068 --> 00:07:11,060
It could alias.

148
00:07:11,060 --> 00:07:14,250
So if I said --

149
00:07:14,250 --> 00:07:16,232
BRADLEY KUSZMAUL: Because of
the aliasing you write a

150
00:07:16,232 --> 00:07:19,276
program that has a race
condition in it, which is

151
00:07:19,276 --> 00:07:19,450
erroneous--

152
00:07:19,450 --> 00:07:22,930
AUDIENCE: It would be valid?

153
00:07:22,930 --> 00:07:23,360
BRADLEY KUSZMAUL: No,
it'd still be valid.

154
00:07:23,360 --> 00:07:25,120
It would just have a race in
it and you would have a

155
00:07:25,120 --> 00:07:26,910
non-determinate result.

156
00:07:26,910 --> 00:07:27,900
PROFESSOR: It may not
do what you want.

157
00:07:27,900 --> 00:07:29,830
BRADLEY KUSZMAUL: It may not do
what you want, but one of

158
00:07:29,830 --> 00:07:32,310
the legal executions of that
parallel program is the

159
00:07:32,310 --> 00:07:34,060
original C program.

160
00:07:34,060 --> 00:07:37,300
AUDIENCE: So there's no extra.

161
00:07:37,300 --> 00:07:39,270
BRADLEY KUSZMAUL: At the sort
of level of doing analysis,

162
00:07:39,270 --> 00:07:40,840
Cilk doesn't do analysis.

163
00:07:40,840 --> 00:07:46,530
Cilk is a compiler that compiles
this language and the

164
00:07:46,530 --> 00:07:49,610
semantics are what they are,
which is you the spawn is

165
00:07:49,610 --> 00:07:50,450
its-- and I'll talk about
the semantics.

166
00:07:50,450 --> 00:07:52,910
The spawn means you can run the
function in parallel and

167
00:07:52,910 --> 00:07:56,310
if that doesn't give you the
same answer every time it's

168
00:07:56,310 --> 00:07:58,350
not the compilers fault.

169
00:07:58,350 --> 00:08:00,890
AUDIENCE: [OBSCURED]

170
00:08:00,890 --> 00:08:01,752
BRADLEY KUSZMAUL: Pardon?

171
00:08:01,752 --> 00:08:04,585
AUDIENCE: There has to be some
guarantee [OBSCURED].

172
00:08:04,585 --> 00:08:07,990
[OBSCURED]

173
00:08:07,990 --> 00:08:09,890
PROFESSOR: How in a race
condition you get some

174
00:08:09,890 --> 00:08:12,650
[OBSCURED].

175
00:08:12,650 --> 00:08:14,240
BRADLEY KUSZMAUL: One of the
legal things the Cilk system

176
00:08:14,240 --> 00:08:17,480
could do is just run this,
run that program.

177
00:08:17,480 --> 00:08:20,410
Now if you're running it on
multiple processors that's not

178
00:08:20,410 --> 00:08:22,430
what happens because the other
thing is there's some

179
00:08:22,430 --> 00:08:23,730
performance guarantees we get.

180
00:08:23,730 --> 00:08:24,970
So there's actually
parallelism.

181
00:08:24,970 --> 00:08:28,130
But on one processor in fact,
that's exactly what the

182
00:08:28,130 --> 00:08:30,860
execution does.

183
00:08:30,860 --> 00:08:34,120
So Cilk does dynamic
multithreading and this is

184
00:08:34,120 --> 00:08:37,660
different from p threads for
example where you have this

185
00:08:37,660 --> 00:08:41,330
very heavyweight thread that
costs tens of thousands of

186
00:08:41,330 --> 00:08:43,010
instructions to create.

187
00:08:43,010 --> 00:08:46,720
Cilk threads are really small,
so in this program there's a

188
00:08:46,720 --> 00:08:49,350
Cilk thread that runs basically
from when the fib

189
00:08:49,350 --> 00:08:51,990
starts to here and then--

190
00:08:57,480 --> 00:08:59,920
I feel like there's a missing
slide in here.

191
00:08:59,920 --> 00:09:01,170
I didn't tell you about spawn.

192
00:09:04,400 --> 00:09:07,820
OK, well let me tell you about
spawn because what the spawn

193
00:09:07,820 --> 00:09:12,940
means is that this function
can run in parallel.

194
00:09:12,940 --> 00:09:13,920
That's very simple.

195
00:09:13,920 --> 00:09:18,430
What the sync means is that all
the functions that were

196
00:09:18,430 --> 00:09:23,540
spawned off in this function all
have to finish before this

197
00:09:23,540 --> 00:09:25,040
function can proceed.

198
00:09:25,040 --> 00:09:29,170
So in a normal execution of C,
when you call a function the

199
00:09:29,170 --> 00:09:31,060
parent stops.

200
00:09:31,060 --> 00:09:33,910
In Cilk the parent can keep
running, so while that's

201
00:09:33,910 --> 00:09:36,580
running the parent-- this can
spawn off this and then the

202
00:09:36,580 --> 00:09:39,410
sync happens and now the
parent has to stop.

203
00:09:39,410 --> 00:09:42,140
And this key word basically just
says that this function

204
00:09:42,140 --> 00:09:44,590
can be spawned.

205
00:09:44,590 --> 00:09:46,510
AUDIENCE: Is the sync in that
scope or the children scope?

206
00:09:49,960 --> 00:09:52,710
BRADLEY KUSZMAUL: The sync is
scoped within the function.

207
00:09:52,710 --> 00:09:54,883
So you could have a 4 loop
that spawned off a

208
00:09:54,883 --> 00:09:55,640
whole bunch of stuff.

209
00:09:55,640 --> 00:09:58,220
AUDIENCE: You could call the
function instead of moving

210
00:09:58,220 --> 00:10:00,180
some spawns, but
then [OBSCURED]

211
00:10:00,180 --> 00:10:00,620
in the sync.

212
00:10:00,620 --> 00:10:01,830
BRADLEY KUSZMAUL: There's an
explicit sync at the end of

213
00:10:01,830 --> 00:10:03,640
every function.

214
00:10:03,640 --> 00:10:06,660
So Cilk functions are strict.

215
00:10:06,660 --> 00:10:12,680
PROFESSOR: [NOISE]

216
00:10:12,680 --> 00:10:14,940
BRADLEY KUSZMAUL: You know,
there's children down inside

217
00:10:14,940 --> 00:10:17,420
here, but this function can't
return-- well, if I had

218
00:10:17,420 --> 00:10:22,000
omitted the sync and down in
some leaf the compiler puts

219
00:10:22,000 --> 00:10:25,970
one in before the function
returns.

220
00:10:25,970 --> 00:10:28,620
There's some languages that are
like this where somehow

221
00:10:28,620 --> 00:10:32,030
the intermediate function can go
away and then you can sync

222
00:10:32,030 --> 00:10:35,490
directly with your
grandparent.

223
00:10:35,490 --> 00:10:36,740
AUDIENCE: Otherwise
it would stop.

224
00:10:38,710 --> 00:10:40,303
BRADLEY KUSZMAUL: So this gives
you this dag, so you

225
00:10:40,303 --> 00:10:42,100
have this part of the program
that runs up to the first

226
00:10:42,100 --> 00:10:44,900
spawn and then part of the
program that runs between the

227
00:10:44,900 --> 00:10:48,460
spawns and the part of the
program that runs after--

228
00:10:48,460 --> 00:10:51,760
well, after the last spawn to
the sync and then from there

229
00:10:51,760 --> 00:10:53,080
to the return.

230
00:10:53,080 --> 00:10:56,690
So I've got this drawing
that shows this

231
00:10:56,690 --> 00:10:58,130
function sort of running.

232
00:10:58,130 --> 00:11:00,650
So first the purple code runs
at it gets to the spawn, it

233
00:11:00,650 --> 00:11:03,850
spawns of this guy, but now the
second piece of code can

234
00:11:03,850 --> 00:11:05,140
start running.

235
00:11:05,140 --> 00:11:08,510
He does a spawn, so these two
are running in parallel.

236
00:11:08,510 --> 00:11:08,880
Meanwhile.

237
00:11:08,880 --> 00:11:10,200
This guy started that pff.

238
00:11:13,460 --> 00:11:18,480
This is a base case, so he's
going to not do anything.

239
00:11:18,480 --> 00:11:19,920
Just feels like there's
something

240
00:11:19,920 --> 00:11:21,530
missing in this slide.

241
00:11:21,530 --> 00:11:23,670
Oh well.

242
00:11:23,670 --> 00:11:26,780
Essentially now this
guy couldn't run

243
00:11:26,780 --> 00:11:28,260
going back to here.

244
00:11:28,260 --> 00:11:30,500
This part of the code couldn't
run until after sync so this

245
00:11:30,500 --> 00:11:32,410
thing's sitting here waiting.

246
00:11:32,410 --> 00:11:37,780
So when these guys finally
return then this can run.

247
00:11:37,780 --> 00:11:39,270
This guy's getting stuck here.

248
00:11:39,270 --> 00:11:41,320
He runs and he runs.

249
00:11:41,320 --> 00:11:43,970
These two return and the
value comes up here.

250
00:11:43,970 --> 00:11:47,050
And now basically the
function is done.

251
00:11:49,820 --> 00:11:52,500
One observation here is that
there's no mention of the

252
00:11:52,500 --> 00:11:55,470
number of processors
in this code.

253
00:11:55,470 --> 00:11:58,760
You haven't specified how
to schedule or how many

254
00:11:58,760 --> 00:11:59,800
processors.

255
00:11:59,800 --> 00:12:02,580
All you've specified is this
directed acyclic graph that

256
00:12:02,580 --> 00:12:06,550
unfolds dynamically and it's
up to us to schedule those

257
00:12:06,550 --> 00:12:07,830
onto the processors.

258
00:12:07,830 --> 00:12:09,850
So this code is processor
oblivious.

259
00:12:09,850 --> 00:12:12,481
It's oblivious to the number
of processors.

260
00:12:12,481 --> 00:12:14,940
PROFESSOR: But because we're
using the language we're

261
00:12:14,940 --> 00:12:17,890
probably have to create, write
as many spawns depending on--

262
00:12:17,890 --> 00:12:19,570
BRADLEY KUSZMAUL: No, what you
do is you write as many spawns

263
00:12:19,570 --> 00:12:20,866
as you can.

264
00:12:20,866 --> 00:12:23,800
You expose all the parallelism
in your code.

265
00:12:23,800 --> 00:12:27,980
So you want this dag to have
millions of threads in it

266
00:12:27,980 --> 00:12:29,520
concurrently.

267
00:12:29,520 --> 00:12:32,120
And then it's up to us to
schedule that efficiently.

268
00:12:32,120 --> 00:12:36,120
So it's a different mindset
then, I have 4 processors, let

269
00:12:36,120 --> 00:12:37,440
me create 4 things to do.

270
00:12:37,440 --> 00:12:40,250
I have 4 processors, let me
create a million things to do.

271
00:12:40,250 --> 00:12:43,350
And then the Cilk scheduler
guarantees to give you-- you

272
00:12:43,350 --> 00:12:45,480
have 4 processors, I'll give
you 4 fold speed up.

273
00:12:45,480 --> 00:12:48,555
PROFESSOR: I guess what you'd
like to avoid is the mindset

274
00:12:48,555 --> 00:12:51,430
of the programmer has to change
or find the changing

275
00:12:51,430 --> 00:12:52,680
tuning the parameters
for the performance.

276
00:12:55,260 --> 00:12:58,215
BRADLEY KUSZMAUL: There's some
tuning that you do in order to

277
00:12:58,215 --> 00:12:59,890
make the leaf code.

278
00:12:59,890 --> 00:13:02,120
There's some overhead for
doing function calls.

279
00:13:02,120 --> 00:13:06,140
So it's small overhead.

280
00:13:06,140 --> 00:13:07,210
It turns out the cost
of the spawn is like

281
00:13:07,210 --> 00:13:10,670
three function calls.

282
00:13:10,670 --> 00:13:13,760
If you were actually trying to
make this code run faster you

283
00:13:13,760 --> 00:13:18,815
make the base case bigger and do
something, trying to speed

284
00:13:18,815 --> 00:13:21,970
things up a little bit with
the leaves of this call.

285
00:13:21,970 --> 00:13:24,140
So there's this call tree
and inside the call

286
00:13:24,140 --> 00:13:28,140
tree is this dag.

287
00:13:28,140 --> 00:13:31,740
So it supports C's rule
for pointers.

288
00:13:31,740 --> 00:13:37,550
For whatever dialect you have.
If you have a pointer to a

289
00:13:37,550 --> 00:13:41,250
stack and then you have a
pointer to the stack and then

290
00:13:41,250 --> 00:13:45,120
you call, you're allowed to use
that pointer in C. So in

291
00:13:45,120 --> 00:13:46,760
Cilk you are as well.

292
00:13:46,760 --> 00:13:49,820
If you have a parallel thing
going on where normally in C

293
00:13:49,820 --> 00:13:53,780
you would call A, then B
returns, then C and D. So C

294
00:13:53,780 --> 00:13:58,030
and D can refer to anything on
A, but C can't legally refer

295
00:13:58,030 --> 00:14:02,000
to something on B and the same
rule applies to Cilk.

296
00:14:02,000 --> 00:14:04,550
So we have a data structure that
implements this cactus

297
00:14:04,550 --> 00:14:10,062
stack is what it's called, after
the sugauro cactus--

298
00:14:10,062 --> 00:14:17,360
the view of the imagery
there and it lets you

299
00:14:17,360 --> 00:14:19,690
support that rule.

300
00:14:19,690 --> 00:14:23,630
There's some advanced features
in Cilk that have to do with

301
00:14:23,630 --> 00:14:29,150
speculative execution and I'm
going to skip over those today

302
00:14:29,150 --> 00:14:32,000
because it turns out that sort
of 99% of the time you don't

303
00:14:32,000 --> 00:14:33,250
need this stuff.

304
00:14:35,720 --> 00:14:39,740
We have some debugger support,
so if you've written code that

305
00:14:39,740 --> 00:14:45,300
relied on some semantics that
maybe you didn't like when you

306
00:14:45,300 --> 00:14:48,460
went to the parallel world,
you'd like to find out.

307
00:14:48,460 --> 00:14:51,560
This is a tool that basically
takes a Cilk program and an

308
00:14:51,560 --> 00:14:56,640
input data set and it runs and
it tells you is there any

309
00:14:56,640 --> 00:14:59,780
schedule that I could have
chosen-- so it's that directed

310
00:14:59,780 --> 00:15:00,470
acyclic graph.

311
00:15:00,470 --> 00:15:02,230
So there's a whole bunch of
possible schedules I could

312
00:15:02,230 --> 00:15:03,270
have chosen.

313
00:15:03,270 --> 00:15:07,030
Is there any schedule that
changes the order of two

314
00:15:07,030 --> 00:15:12,060
concurrent memory operations
where one of them is right?

315
00:15:12,060 --> 00:15:14,000
So we call this the
non-determinator because it

316
00:15:14,000 --> 00:15:17,750
finds all the determinacy
races in your program.

317
00:15:17,750 --> 00:15:22,080
And Cilk guarantees-- the Cilk
race detector is guaranteed to

318
00:15:22,080 --> 00:15:23,150
find those.

319
00:15:23,150 --> 00:15:25,470
There's a lot of race detectors
where if the race

320
00:15:25,470 --> 00:15:27,300
doesn't actually occur you
have two things that are

321
00:15:27,300 --> 00:15:30,820
logically in parallel, but if
they don't actually run on

322
00:15:30,820 --> 00:15:34,910
different processors a lot of
race detectors out there in

323
00:15:34,910 --> 00:15:36,190
the world won't report
the race.

324
00:15:36,190 --> 00:15:38,560
So you get false negatives and
there's a bunch of false

325
00:15:38,560 --> 00:15:39,530
positives that show up.

326
00:15:39,530 --> 00:15:41,753
This basically only gives
you the real ones.

327
00:15:41,753 --> 00:15:46,800
AUDIENCE: That might be
indicatiors there might be

328
00:15:46,800 --> 00:15:49,220
still a data to arrays.

329
00:15:49,220 --> 00:15:51,110
BRADLEY KUSZMAUL: So this
doesn't analyze the program.

330
00:15:51,110 --> 00:15:52,890
It analyzes the execution.

331
00:15:52,890 --> 00:15:56,040
So it's not trying to solve some
MP complete problem or

332
00:15:56,040 --> 00:15:58,560
Turing complete problem.

333
00:15:58,560 --> 00:16:01,810
And so this reduces the problem
of finding data races

334
00:16:01,810 --> 00:16:05,770
to the situation that's just
like when you're trying to do

335
00:16:05,770 --> 00:16:08,540
code release and quality control
for serial programs.

336
00:16:08,540 --> 00:16:10,030
You write tests.

337
00:16:10,030 --> 00:16:12,720
If you don't test your program
you don't know what it does

338
00:16:12,720 --> 00:16:15,280
and that's the same
property here.

339
00:16:15,280 --> 00:16:18,430
If you do find some race someday
later then you can

340
00:16:18,430 --> 00:16:21,660
write a test for it and know
that you're testing to make

341
00:16:21,660 --> 00:16:23,860
sure that race didn't creep
back into your code.

342
00:16:23,860 --> 00:16:27,025
That's what you want out of a
software release strategy.

343
00:16:27,025 --> 00:16:36,900
AUDIENCE: [NOISE]

344
00:16:36,900 --> 00:16:40,170
BRADLEY KUSZMAUL: If you start
putting in sync than maybe the

345
00:16:40,170 --> 00:16:41,480
race goes away because
of that.

346
00:16:41,480 --> 00:16:44,950
But if just put in
instrumentation to try to

347
00:16:44,950 --> 00:16:47,380
figure out what's going,
it's still there.

348
00:16:47,380 --> 00:16:50,650
And the race detector sort of
says, this variable in this

349
00:16:50,650 --> 00:16:53,312
function, this variable in this
function, you look at it

350
00:16:53,312 --> 00:16:54,330
and say, how could
that happen?

351
00:16:54,330 --> 00:16:56,460
And finally you figured out and
you fix it and then you

352
00:16:56,460 --> 00:16:59,290
put it-- if you're trying to do
software release you build

353
00:16:59,290 --> 00:17:03,610
a regression test that will
verify that has that input.

354
00:17:03,610 --> 00:17:08,021
AUDIENCE: What if you have a
situation where the spawn

355
00:17:08,021 --> 00:17:09,964
graph falls into a terminal.

356
00:17:09,964 --> 00:17:14,670
So it's not a radius, but
monitoring spawn is there but

357
00:17:14,670 --> 00:17:15,920
it spawns a graph a
little bit deeper.

358
00:17:21,060 --> 00:17:23,580
BRADLEY KUSZMAUL: Yes.

359
00:17:23,580 --> 00:17:25,420
For example, our race detector
understands locks.

360
00:17:25,420 --> 00:17:29,390
So part of the rule is it
doesn't report a race if the

361
00:17:29,390 --> 00:17:30,910
two memory accesses--

362
00:17:30,910 --> 00:17:34,610
if there was a lock that they
both held in common.

363
00:17:34,610 --> 00:17:37,010
Now you now you can write buggy
programs because you can

364
00:17:37,010 --> 00:17:39,890
essentially do the memory at
lock, you know, read the

365
00:17:39,890 --> 00:17:42,390
memory, unlock, lock,
write the memory.

366
00:17:42,390 --> 00:17:44,330
Now the interleave happens
and there's a race.

367
00:17:44,330 --> 00:17:47,330
So the assumption of this race
detector is that if you put

368
00:17:47,330 --> 00:17:49,600
locks in there that you've
sort of thought about.

369
00:17:49,600 --> 00:17:53,560
This is finding races that you
forgot about rather than races

370
00:17:53,560 --> 00:17:55,150
that you ostensibly
thought about.

371
00:17:55,150 --> 00:17:57,530
There are some races that
are actually correct.

372
00:17:57,530 --> 00:17:59,880
For example, in the chess
programs there's this big

373
00:17:59,880 --> 00:18:01,910
table that remembers all
the chess positions

374
00:18:01,910 --> 00:18:03,290
that have been seen.

375
00:18:03,290 --> 00:18:05,360
And if you don't get the right
answer out of the table it

376
00:18:05,360 --> 00:18:07,740
doesn't matter because you
search it again anyway.

377
00:18:07,740 --> 00:18:08,880
Not getting the right
answer means you

378
00:18:08,880 --> 00:18:09,700
don't get any answer.

379
00:18:09,700 --> 00:18:12,420
You look something up and it's
not there so you search again.

380
00:18:12,420 --> 00:18:14,790
If you just waited a little
longer maybe somebody else

381
00:18:14,790 --> 00:18:15,940
would have put the value
in, you could have

382
00:18:15,940 --> 00:18:17,470
saved a little work.

383
00:18:17,470 --> 00:18:20,310
And so in that case, well it
turns out to be there's no

384
00:18:20,310 --> 00:18:21,570
parallel way to do that.

385
00:18:21,570 --> 00:18:25,290
So I'm willing to tolerate that
race because that gives

386
00:18:25,290 --> 00:18:28,940
me performance and so you have
what we call fake locks, which

387
00:18:28,940 --> 00:18:33,280
are basically things that look
like lock calls, but they

388
00:18:33,280 --> 00:18:36,200
don't do anything except tell
the race detector, pretend

389
00:18:36,200 --> 00:18:37,940
there was a lock
held in common.

390
00:18:37,940 --> 00:18:38,707
Yeah?

391
00:18:38,707 --> 00:18:39,957
AUDIENCE:
[UNINTELLIGIBLE PHRASE]

392
00:18:52,080 --> 00:18:53,780
BRADLEY KUSZMAUL: If it says
there's no race it means that

393
00:18:53,780 --> 00:18:57,745
for every possible
scheduling that--

394
00:18:57,745 --> 00:18:59,290
AUDIENCE:
[UNINTELLIGIBLE PHRASE].

395
00:18:59,290 --> 00:19:00,880
BRADLEY KUSZMAUL: Well,
you have that dag.

396
00:19:00,880 --> 00:19:02,850
And imagine running it
on one processor.

397
00:19:02,850 --> 00:19:04,300
There's a lot of possible
orders in

398
00:19:04,300 --> 00:19:05,910
which to run the dag.

399
00:19:05,910 --> 00:19:10,350
And the rule is well, was there
a load in a store or a

400
00:19:10,350 --> 00:19:14,100
store in a store that switched
orders in some possible

401
00:19:14,100 --> 00:19:16,511
schedule and that's
the definition.

402
00:19:16,511 --> 00:19:30,090
AUDIENCE: So, in practice,
sorry, one of the [INAUDIBLE]

403
00:19:30,090 --> 00:19:31,573
techniques is loss.

404
00:19:31,573 --> 00:19:33,798
Assuming, dependent on the
processor, that you have

405
00:19:33,798 --> 00:19:37,012
atomic rights, we want to
deal with that data

406
00:19:37,012 --> 00:19:38,990
[UNINTELLIGIBLE] in
the background --

407
00:19:38,990 --> 00:19:40,140
BRADLEY KUSZMAUL: Those
protocols are really hard to

408
00:19:40,140 --> 00:19:42,755
get right, but yes, it's
an important trick.

409
00:19:42,755 --> 00:19:44,430
AUDIENCE: Certainly
[INAUDIBLE].

410
00:19:44,430 --> 00:19:45,440
BRADLEY KUSZMAUL: So to convince
the race detector not

411
00:19:45,440 --> 00:19:47,160
to complain you put fake
locks around it.

412
00:19:51,410 --> 00:19:53,840
You've programmed a
sophisticated algorithm it's

413
00:19:53,840 --> 00:19:56,090
up to you to get the
details right.

414
00:19:59,150 --> 00:20:01,190
The other property about this
race detector is that it's

415
00:20:01,190 --> 00:20:03,220
fast. It runs almost
in liear time.

416
00:20:03,220 --> 00:20:05,700
A lot of the race detectors that
you find out there run in

417
00:20:05,700 --> 00:20:07,210
quadratic time.

418
00:20:07,210 --> 00:20:09,550
So if you want to run a million
instructions it has to

419
00:20:09,550 --> 00:20:12,380
compare every instruction to
every other instruction.

420
00:20:12,380 --> 00:20:13,690
Turns out we don't
have to do that.

421
00:20:13,690 --> 00:20:17,220
We run in time, which is n times
alpha of n where alpha's

422
00:20:17,220 --> 00:20:18,440
the inverse Ackermann
function.

423
00:20:18,440 --> 00:20:23,580
Anybody remember that from
the union-find algorithm.

424
00:20:23,580 --> 00:20:27,210
It's got that graded So it's
like the almost linear time.

425
00:20:27,210 --> 00:20:32,980
We actually now have a linear
timed one that has performance

426
00:20:32,980 --> 00:20:34,990
advantages.

427
00:20:34,990 --> 00:20:40,130
So let me do a little
theory in practice.

428
00:20:40,130 --> 00:20:42,850
In Cilk we have some fundamental
complexity

429
00:20:42,850 --> 00:20:44,160
measures that we worry about.

430
00:20:44,160 --> 00:20:47,070
So we're interested in knowing
and being able to predict the

431
00:20:47,070 --> 00:20:50,440
runtime of a Cilk program
on P processors.

432
00:20:50,440 --> 00:20:53,580
So we want to know T sub p,
which is the execution time on

433
00:20:53,580 --> 00:20:54,350
P processors.

434
00:20:54,350 --> 00:20:55,660
That's the goal.

435
00:20:55,660 --> 00:20:59,430
What we've got to work with is
some directed acyclic graph

436
00:20:59,430 --> 00:21:02,260
that is for a particular input
set and if the program

437
00:21:02,260 --> 00:21:05,400
determines it and everything
else it's a well defined graph

438
00:21:05,400 --> 00:21:08,470
and we can come up with some
basic measures of this graph.

439
00:21:08,470 --> 00:21:11,950
So T sub 1 is the work of the
graph, which is the total time

440
00:21:11,950 --> 00:21:15,080
it would take to run that
graph on one processor.

441
00:21:15,080 --> 00:21:17,570
Or if you assume that these
things are all cost unit

442
00:21:17,570 --> 00:21:19,510
times, just the number
of nodes.

443
00:21:19,510 --> 00:21:21,940
So for this graph
what's the work?

444
00:21:31,780 --> 00:21:34,070
I heard teen, but something--

445
00:21:34,070 --> 00:21:36,460
18?

446
00:21:36,460 --> 00:21:43,590
And the critical path
is the longest path.

447
00:21:43,590 --> 00:21:45,920
And if these nodes weren't unit
time you'd have to weight

448
00:21:45,920 --> 00:21:48,200
the things according
to actually how

449
00:21:48,200 --> 00:21:49,200
much time they run.

450
00:21:49,200 --> 00:21:53,280
So the critical path
here is what?

451
00:21:53,280 --> 00:21:54,770
9.

452
00:21:54,770 --> 00:21:57,700
So I think those are right.

453
00:21:57,700 --> 00:22:02,500
The lower bounds then that you
know is that you don't expect

454
00:22:02,500 --> 00:22:05,480
the runtime on P processes to be
faster than linear speedup.

455
00:22:09,980 --> 00:22:13,460
In this model that
doesn't happen.

456
00:22:17,430 --> 00:22:18,820
It turns out cache
does things.

457
00:22:18,820 --> 00:22:21,190
It's adding more than
just processors.

458
00:22:21,190 --> 00:22:23,060
You're adding more cache too.

459
00:22:23,060 --> 00:22:26,380
So all sorts of things or maybe
it means that there's a

460
00:22:26,380 --> 00:22:28,180
better algorithm you
should have used.

461
00:22:28,180 --> 00:22:30,120
So there's some funny things
that happen if you have bad

462
00:22:30,120 --> 00:22:31,180
algorithms and so forth.

463
00:22:31,180 --> 00:22:34,020
But in this model you can't have
more than linear speedup.

464
00:22:34,020 --> 00:22:36,880
You also can't get things done
faster than in linear time.

465
00:22:36,880 --> 00:22:39,660
This model assumes basically
that these costs of running

466
00:22:39,660 --> 00:22:45,420
these things are fixed and the
cache has the property that

467
00:22:45,420 --> 00:22:47,410
changing the order of execution
means that the

468
00:22:47,410 --> 00:22:52,600
actual costs of the nodes in
the graph change costs.

469
00:22:52,600 --> 00:22:56,530
So those are lower bounds and
the things that we want to

470
00:22:56,530 --> 00:23:00,410
know are speedups, so that's
T sub 1 over T sub p.

471
00:23:00,410 --> 00:23:02,600
And the parallelism of
the graph is T sub

472
00:23:02,600 --> 00:23:04,310
1 over T sub infinity.

473
00:23:04,310 --> 00:23:07,090
So the work over the critical
path and we've been calling

474
00:23:07,090 --> 00:23:09,740
this span sometimes lately.

475
00:23:09,740 --> 00:23:12,260
Some people call that depth.

476
00:23:12,260 --> 00:23:14,750
Span is easier to say than
critical path, depth has too

477
00:23:14,750 --> 00:23:17,390
many other meanings so
I kind of like span.

478
00:23:17,390 --> 00:23:19,570
So what's the parallelism
for this program?

479
00:23:24,760 --> 00:23:29,880
18/9.

480
00:23:29,880 --> 00:23:33,920
We said that T sub 1 was what?

481
00:23:33,920 --> 00:23:35,120
18.

482
00:23:35,120 --> 00:23:37,770
The infinity is 9.

483
00:23:37,770 --> 00:23:40,290
So on average and if you had
an infinite number of

484
00:23:40,290 --> 00:23:44,730
processors and you scheduled
this as greedy as you good, it

485
00:23:44,730 --> 00:23:47,560
would take you 9 steps to run
and you would you be doing 18

486
00:23:47,560 --> 00:23:48,350
things worth of work.

487
00:23:48,350 --> 00:23:51,140
So on average there's
two things to do.

488
00:23:51,140 --> 00:23:55,950
You know, 1 plus 1 plus 1 plus
3 plus 4 plus 4 plus 1 plus 1

489
00:23:55,950 --> 00:23:59,970
plus 1 divided by 9
turns out to be 2.

490
00:23:59,970 --> 00:24:02,950
So the average parallelism or
just the parallelism of the

491
00:24:02,950 --> 00:24:06,120
program is T sub 1 over
T sub infinity.

492
00:24:06,120 --> 00:24:08,580
And this property is something
that's not dependent on the

493
00:24:08,580 --> 00:24:12,390
scheduler, it's a property
of the program.

494
00:24:12,390 --> 00:24:15,490
Doesn't depend on how many
processors you have.

495
00:24:15,490 --> 00:24:16,438
AUDIENCE: [OBSCURED]

496
00:24:16,438 --> 00:24:21,740
You're saying, you're calling
that the span now?

497
00:24:21,740 --> 00:24:32,070
Is that the one for
us [OBSCURED]

498
00:24:32,070 --> 00:24:34,460
BRADLEY KUSZMAUL: That's
too long to say.

499
00:24:34,460 --> 00:24:37,240
I might as well say critical
path length.

500
00:24:37,240 --> 00:24:40,440
Critical path length, longest
trace span is a mathematical

501
00:24:40,440 --> 00:24:41,690
sounding name.

502
00:24:45,956 --> 00:24:48,020
AUDIENCE: We just like
to steal terminology.

503
00:24:48,020 --> 00:24:50,510
BRADLEY KUSZMAUL: Well, yeah.

504
00:24:50,510 --> 00:24:51,660
So there's a theorem due to--

505
00:24:51,660 --> 00:24:54,500
Graham and Brent said that
there's some schedule that can

506
00:24:54,500 --> 00:24:58,040
actually achieve the sum of
those two lower bounds.

507
00:24:58,040 --> 00:25:01,740
This linear speedup is one lower
bound of the runtime and

508
00:25:01,740 --> 00:25:04,470
the critical path
is the other.

509
00:25:04,470 --> 00:25:06,400
So there's some schedule that
basically achieves the sum of

510
00:25:06,400 --> 00:25:09,360
those and how does that
theorem work?

511
00:25:09,360 --> 00:25:12,350
Well, at each time
step either--

512
00:25:12,350 --> 00:25:14,400
suppose we had 3 processors.

513
00:25:14,400 --> 00:25:20,090
Either there's at least 3 things
ready to run and so

514
00:25:20,090 --> 00:25:22,460
what you do is you do
a greedy schedule.

515
00:25:22,460 --> 00:25:23,730
You grab any 3 of them.

516
00:25:27,790 --> 00:25:30,120
If there's fewer than p things
to run, like here we have a

517
00:25:30,120 --> 00:25:34,680
situation where these
have all run.

518
00:25:34,680 --> 00:25:36,270
The green ones are
ready to go.

519
00:25:36,270 --> 00:25:38,600
Those are the only 2 that
are ready to go.

520
00:25:38,600 --> 00:25:40,640
So what do you do then
in a greedy schedule?

521
00:25:40,640 --> 00:25:42,170
You run them all.

522
00:25:44,880 --> 00:25:50,730
And the argument goes, well, how
many times steps could you

523
00:25:50,730 --> 00:25:51,980
execute 3 things?

524
00:25:55,800 --> 00:25:58,090
At most you could do it the work
divided by the number of

525
00:25:58,090 --> 00:26:00,625
processors times because
then after that you've

526
00:26:00,625 --> 00:26:02,860
used up all the work.

527
00:26:02,860 --> 00:26:07,910
Well how many times could you
execute less than p things?

528
00:26:07,910 --> 00:26:09,940
Well, every time you execute
less than p things you're

529
00:26:09,940 --> 00:26:13,090
reducing the length of the
remaining critical path.

530
00:26:13,090 --> 00:26:17,130
You can't do that more
than the span times.

531
00:26:17,130 --> 00:26:21,040
And so a greedy scheduler will
achieve some runtime which is

532
00:26:21,040 --> 00:26:22,520
within the sum of these 2.

533
00:26:22,520 --> 00:26:25,990
It's actually the sum
of these 2 minus 1.

534
00:26:25,990 --> 00:26:28,940
It turns out that there has to
be at least one node that's on

535
00:26:28,940 --> 00:26:32,430
both work and critical path.

536
00:26:32,430 --> 00:26:34,400
And so that means that you're
guaranteed to be within a

537
00:26:34,400 --> 00:26:39,720
factor of 2 of optimal with
a greedy schedule.

538
00:26:39,720 --> 00:26:43,030
And it turns out that if you
have a lot of parallelism

539
00:26:43,030 --> 00:26:45,810
compared to the number
processors, so if you have a

540
00:26:45,810 --> 00:26:49,330
graph that has a million fold
parallelism and a thousand

541
00:26:49,330 --> 00:26:54,890
fold processors Well, if this
is really small compared to

542
00:26:54,890 --> 00:26:57,440
the work, if you have a graph
with a million fold

543
00:26:57,440 --> 00:27:00,410
parallelism that means the
critical path is small.

544
00:27:00,410 --> 00:27:02,190
If you only had 1000 processors
that means this

545
00:27:02,190 --> 00:27:05,220
term's big.

546
00:27:05,220 --> 00:27:08,420
And that means that this term is
very close to this term, so

547
00:27:08,420 --> 00:27:11,730
essentially the corollary to
this is that you get linear

548
00:27:11,730 --> 00:27:17,770
speedup, perfect linear speed
asymptotically if you have

549
00:27:17,770 --> 00:27:22,060
fewer processors then you have
parallelism in your program.

550
00:27:22,060 --> 00:27:26,280
So the game here at this level
of understanding, I haven't

551
00:27:26,280 --> 00:27:28,500
told you how the scheduler
actually works-- is to write a

552
00:27:28,500 --> 00:27:30,720
program that's got a lot of
parallelism that you can get

553
00:27:30,720 --> 00:27:31,970
linear speedup.

554
00:27:38,070 --> 00:27:40,390
Well, the work-stealing
scheduler we actually use.

555
00:27:40,390 --> 00:27:42,380
The problem is the greedy
schedulers can be hard to

556
00:27:42,380 --> 00:27:44,670
compute-- especially if you
imagine having a million

557
00:27:44,670 --> 00:27:48,090
processors in a program with
a billion fold parallelism.

558
00:27:48,090 --> 00:27:50,940
Finding on every clock cycle,
finding something for each of

559
00:27:50,940 --> 00:27:54,580
the million guys to do is
conceptually difficult, so

560
00:27:54,580 --> 00:27:57,520
instead we have a work-stealing
scheduler.

561
00:27:57,520 --> 00:27:59,140
I'll talk about that
in a second.

562
00:27:59,140 --> 00:28:02,970
It achieves bounds which are
not quite as good as those.

563
00:28:02,970 --> 00:28:04,140
This bound is the same.

564
00:28:04,140 --> 00:28:07,130
It's the sum of two terms. One
is the linear speedup term,

565
00:28:07,130 --> 00:28:09,740
but instead of it being T sub
infinity it's big O of T sub

566
00:28:09,740 --> 00:28:14,010
infinity because you actually
have to do communication

567
00:28:14,010 --> 00:28:17,530
sometimes if the critical
path length is long.

568
00:28:17,530 --> 00:28:18,860
Basically, you can
sort of imagine.

569
00:28:18,860 --> 00:28:21,930
If you have a lot of things to
do, a lot of tasks and people

570
00:28:21,930 --> 00:28:25,840
to do it, it's easy to do that
in parallel if there's no

571
00:28:25,840 --> 00:28:27,810
interdependencies
among the tasks.

572
00:28:27,810 --> 00:28:29,930
But as soon as there's
dependencies you end up having

573
00:28:29,930 --> 00:28:33,530
to coordinate a lot and that
communication costs--

574
00:28:33,530 --> 00:28:37,620
there's lots of lore about
adding programmers to a task

575
00:28:37,620 --> 00:28:41,590
and it slowing you down.

576
00:28:41,590 --> 00:28:45,380
Because basically communication
gets you.

577
00:28:45,380 --> 00:28:47,170
What we found empirically--

578
00:28:47,170 --> 00:28:48,680
there's a theorem for this--

579
00:28:48,680 --> 00:28:53,130
empirically the runtime is
actually still very close to

580
00:28:53,130 --> 00:28:56,960
the sum of those terms. Or maybe
it's those terms plus 2

581
00:28:56,960 --> 00:29:00,610
times T sub infinity or
something like that.

582
00:29:00,610 --> 00:29:03,190
And again, we basically get
near-perfect speedup as long

583
00:29:03,190 --> 00:29:05,450
as the number of processors
is a lot less than the

584
00:29:05,450 --> 00:29:06,250
parallelism.

585
00:29:06,250 --> 00:29:08,815
Should be sort of a less
than less than.

586
00:29:12,320 --> 00:29:14,950
The compiler has the mode where
you basically can insert

587
00:29:14,950 --> 00:29:15,310
instrumentations.

588
00:29:15,310 --> 00:29:17,700
So you can run your program,
it'll tell you the critical

589
00:29:17,700 --> 00:29:18,200
path length.

590
00:29:18,200 --> 00:29:20,530
You can compute these numbers.

591
00:29:20,530 --> 00:29:23,220
Clear how to compute work, you
just sum up the runtime of all

592
00:29:23,220 --> 00:29:24,590
the threads.

593
00:29:24,590 --> 00:29:27,360
To compute the critical path
length, well you have to do

594
00:29:27,360 --> 00:29:31,540
some max's and stuff as you
go through the graph.

595
00:29:31,540 --> 00:29:36,000
And the average cost of a spawn
these days is about 3 on

596
00:29:36,000 --> 00:29:39,580
like a dual core pentium.

597
00:29:39,580 --> 00:29:42,270
Three times the cost
of a function call.

598
00:29:42,270 --> 00:29:45,080
And most of that cost actually
has to do with the memory

599
00:29:45,080 --> 00:29:48,700
barrier that we do at the spawn
because that machine

600
00:29:48,700 --> 00:29:50,080
doesn't have strong
consistencies.

601
00:29:50,080 --> 00:29:52,360
So you have to put this memory
barrier in and that just

602
00:29:52,360 --> 00:29:53,610
empties all the pipelines.

603
00:29:56,480 --> 00:30:00,100
It does better on like an SGI
machine, which has strong--

604
00:30:00,100 --> 00:30:01,480
well, traditional.

605
00:30:01,480 --> 00:30:04,520
A MIPS machine that has strong
consistency actually does

606
00:30:04,520 --> 00:30:08,440
better for the cost
of that overhead.

607
00:30:08,440 --> 00:30:10,490
Let me talk a little
bit about chess.

608
00:30:10,490 --> 00:30:16,410
And we had a bunch of chess
programs. I wrote one in 1994,

609
00:30:16,410 --> 00:30:19,650
which placed third at the
International Computer Chess

610
00:30:19,650 --> 00:30:21,940
Championship and that was
running on a big connection

611
00:30:21,940 --> 00:30:23,960
machine CM5.

612
00:30:23,960 --> 00:30:25,663
I was one of the architects
of that machine, so

613
00:30:25,663 --> 00:30:27,770
it was double fun.

614
00:30:27,770 --> 00:30:32,110
We wrote another program that
placed second in '95 and that

615
00:30:32,110 --> 00:30:34,860
was running on an 1800 node
Paragon and that was a big

616
00:30:34,860 --> 00:30:37,020
computer back then.

617
00:30:37,020 --> 00:30:40,520
We built another program called
Cilk chess, which

618
00:30:40,520 --> 00:30:46,440
placed first in '96 running on
a relatively smaller machine.

619
00:30:46,440 --> 00:30:52,810
And then on a larger SGI origin
we ran some more and

620
00:30:52,810 --> 00:30:56,520
then at the World Computer Chess
Championship in 1999 we

621
00:30:56,520 --> 00:31:01,270
beat Deep Blue and
lost to a PC.

622
00:31:04,420 --> 00:31:07,570
And people don't realize this,
but at the time that Deep Blue

623
00:31:07,570 --> 00:31:09,740
beat Kasparov it was not the
World Computer Chess

624
00:31:09,740 --> 00:31:13,290
Champion, a PC was.

625
00:31:13,290 --> 00:31:16,150
So what?

626
00:31:16,150 --> 00:31:17,400
It's running a program.

627
00:31:20,930 --> 00:31:22,883
You know, there's this
head and a tape.

628
00:31:26,830 --> 00:31:29,360
I don't know what it did.

629
00:31:29,360 --> 00:31:31,980
So this was a program called
Fritz, which is a commercially

630
00:31:31,980 --> 00:31:32,860
available program.

631
00:31:32,860 --> 00:31:38,450
And those guys were very good,
the PC guys playing were very

632
00:31:38,450 --> 00:31:42,290
good at getting on sort
of the algorithm side.

633
00:31:42,290 --> 00:31:44,680
We got advantage
by brute force.

634
00:31:44,680 --> 00:31:47,630
And we also had some real chess
expertise on our team,

635
00:31:47,630 --> 00:31:51,060
but those guys were spending
full time on things like

636
00:31:51,060 --> 00:31:55,290
pruning away sub-searches that
they were convinced weren't

637
00:31:55,290 --> 00:31:55,930
going to pan out.

638
00:31:55,930 --> 00:31:58,340
Computer chess programs spend
most of their time looking at

639
00:31:58,340 --> 00:32:00,720
situations that any person would
look at and say, ah,

640
00:32:00,720 --> 00:32:01,420
blacks won.

641
00:32:01,420 --> 00:32:02,560
Why are you even looking
at this?

642
00:32:02,560 --> 00:32:03,820
And it keeps searching.

643
00:32:03,820 --> 00:32:05,786
It's like, well maybe there's
a way to get the queen.

644
00:32:10,480 --> 00:32:12,310
So computers are pretty
dumb at that.

645
00:32:12,310 --> 00:32:15,380
So basically these guys put a
lot more chess intelligence in

646
00:32:15,380 --> 00:32:19,270
and we also lost due to what--
in this particular game, we

647
00:32:19,270 --> 00:32:24,570
were tied for first place and we
decided to do a runoff game

648
00:32:24,570 --> 00:32:27,770
to find out who would win and
we lost due to a classic

649
00:32:27,770 --> 00:32:29,960
horizon effect.

650
00:32:29,960 --> 00:32:32,910
So it turns out that we were
searching to depth 12 in the

651
00:32:32,910 --> 00:32:35,280
tree and Fritz was searching
to depth 11.

652
00:32:35,280 --> 00:32:38,730
Even with all these heuristics
and stuff they had in it, they

653
00:32:38,730 --> 00:32:41,280
were still not searching
as deeply as we were.

654
00:32:41,280 --> 00:32:45,050
But there was a move that was a
good move that looked OK at

655
00:32:45,050 --> 00:32:49,290
depth 11 and looked bad at depth
11 and at depth 13 it

656
00:32:49,290 --> 00:32:50,540
looked really good again.

657
00:32:53,000 --> 00:32:56,460
So they saw the move and made
it for the wrong reason, we

658
00:32:56,460 --> 00:32:59,130
saw the move and didn't make it
for the right reason, but

659
00:32:59,130 --> 00:33:02,120
it was wrong and the right
move-- if we'd been able to

660
00:33:02,120 --> 00:33:05,320
search a little deeper, we would
have seen that it was

661
00:33:05,320 --> 00:33:08,070
really the wrong thing to do.

662
00:33:08,070 --> 00:33:09,550
This happens all the
time in chess.

663
00:33:09,550 --> 00:33:10,760
There's a little randomness
in there.

664
00:33:10,760 --> 00:33:13,820
This horizon effect shows up
and again, it boils down to

665
00:33:13,820 --> 00:33:15,820
the programs are not
intelligent.

666
00:33:15,820 --> 00:33:18,550
A human would look at it and
say, eventually that knight's

667
00:33:18,550 --> 00:33:19,730
going to fall.

668
00:33:19,730 --> 00:33:24,070
But if the computer can't see
it with a search, you know?

669
00:33:27,190 --> 00:33:31,280
We plotted the speedup of star
Socrates, which was the first

670
00:33:31,280 --> 00:33:33,330
one on this funny graph.

671
00:33:33,330 --> 00:33:35,980
So this looks sort of like a
typical linear speedup graph.

672
00:33:35,980 --> 00:33:38,200
Sort of when you're down here
with few numbers processors

673
00:33:38,200 --> 00:33:40,660
you get good linear speedup
and eventually you stop

674
00:33:40,660 --> 00:33:41,730
getting linear speedup.

675
00:33:41,730 --> 00:33:43,410
That's sort of in broad
strokes what

676
00:33:43,410 --> 00:33:44,960
this graph looks like.

677
00:33:44,960 --> 00:33:46,510
But the axes are
kind of funny.

678
00:33:46,510 --> 00:33:50,210
The axes aren't the number of
processors and the speedup--

679
00:33:50,210 --> 00:33:54,250
it's the number processors
divided by the parallelism of

680
00:33:54,250 --> 00:33:55,440
the program.

681
00:33:55,440 --> 00:33:58,360
And here is the speedup divided
by the parallelism of

682
00:33:58,360 --> 00:33:59,400
the program.

683
00:33:59,400 --> 00:34:01,750
And the reason we did that is
the each of these data points

684
00:34:01,750 --> 00:34:05,540
is a different program with
different work in span.

685
00:34:08,640 --> 00:34:10,720
If I'm trying to run a
particular problem on a bunch

686
00:34:10,720 --> 00:34:13,350
of different processors I can
just draw that curve and see

687
00:34:13,350 --> 00:34:15,950
what happens as get
more processors.

688
00:34:19,250 --> 00:34:20,960
I'm not getting any advantage
because I've got too many

689
00:34:20,960 --> 00:34:21,600
processors.

690
00:34:21,600 --> 00:34:23,320
I've exceeded the parallelism
of the program.

691
00:34:23,320 --> 00:34:25,240
But if I'm running, trying
to compare two different

692
00:34:25,240 --> 00:34:27,170
programs, how do I do that?

693
00:34:27,170 --> 00:34:29,560
Well, you can do that by
normalizing by the

694
00:34:29,560 --> 00:34:30,760
parallelism.

695
00:34:30,760 --> 00:34:35,890
So down in this domain the
number of processors is small

696
00:34:35,890 --> 00:34:38,800
compared to the average
parallelism and we get good

697
00:34:38,800 --> 00:34:39,610
linear speedups.

698
00:34:39,610 --> 00:34:43,210
And up in this the domain the
number of processors is large

699
00:34:43,210 --> 00:34:46,310
and it starts asymptoting to
the point where the speedup

700
00:34:46,310 --> 00:34:51,650
approaches the parallelism and
that's sort of what happened.

701
00:34:51,650 --> 00:34:53,830
You get some noise out here so
one of the things down here,

702
00:34:53,830 --> 00:34:56,520
it's nice and tight.

703
00:34:56,520 --> 00:34:58,660
And that's because we're in
that domain where the

704
00:34:58,660 --> 00:35:01,790
communication costs are
infrequently paid because

705
00:35:01,790 --> 00:35:03,200
there's lots of work to do.

706
00:35:03,200 --> 00:35:05,120
You don't have to communicate
very much.

707
00:35:05,120 --> 00:35:07,510
Up here there's a lot of
communication that happens and

708
00:35:07,510 --> 00:35:12,170
so the noise is showing
up more in the data.

709
00:35:12,170 --> 00:35:14,530
This curve here is the
T sub 1 over P plus T

710
00:35:14,530 --> 00:35:15,780
sub infinity curve.

711
00:35:19,200 --> 00:35:22,700
The T sub P equals T sub
infinity curve and that's the

712
00:35:22,700 --> 00:35:25,430
linear speedup curve
on this graph.

713
00:35:25,430 --> 00:35:28,380
So I think there's an important
lesson in this graph

714
00:35:28,380 --> 00:35:32,120
besides the data itself, which
is if you're careful about

715
00:35:32,120 --> 00:35:37,920
choosing the axes, you can take
a whole bunch of data

716
00:35:37,920 --> 00:35:40,580
that you couldn't see how to
plot it together and you can

717
00:35:40,580 --> 00:35:42,690
plot it together and get
something meaningful.

718
00:35:42,690 --> 00:35:46,130
So in my Ph.D. thesis I had
hundreds of little plots for

719
00:35:46,130 --> 00:35:48,980
each chess position and I didn't
figure out how-- it's

720
00:35:48,980 --> 00:35:50,030
like they all look
the same, right?

721
00:35:50,030 --> 00:35:53,050
But I didn't sort of figure out
that if I was careful I

722
00:35:53,050 --> 00:35:55,110
could actually make
them be the same.

723
00:35:55,110 --> 00:35:57,290
That happened after I
published my thesis.

724
00:35:57,290 --> 00:35:59,030
Oh, we could just
overlay them.

725
00:35:59,030 --> 00:36:03,340
Well, what's the normalization
that makes that work?

726
00:36:03,340 --> 00:36:05,540
So there's a speedup paradox
that happened.

727
00:36:09,220 --> 00:36:10,025
Pardon?

728
00:36:10,025 --> 00:36:11,480
AUDIENCE: [OBSCURED]

729
00:36:11,480 --> 00:36:12,940
BRADLEY KUSZMAUL: Yeah, OK.

730
00:36:12,940 --> 00:36:15,460
There was a speedup paradox that
happened while we were

731
00:36:15,460 --> 00:36:16,650
developing star Socrates.

732
00:36:16,650 --> 00:36:20,040
We were developing this for
512 processor connection

733
00:36:20,040 --> 00:36:24,520
machine that was at University
of Illinois, but we only had a

734
00:36:24,520 --> 00:36:26,790
smaller machine on which
to do our development.

735
00:36:26,790 --> 00:36:30,910
We had a 128 processor machine
at MIT and most days I could

736
00:36:30,910 --> 00:36:34,040
only get 32 processors
because the machine

737
00:36:34,040 --> 00:36:35,340
was in heavy demand.

738
00:36:35,340 --> 00:36:38,240
So we had this program
and it ran on 32

739
00:36:38,240 --> 00:36:41,000
processors in 65 seconds.

740
00:36:41,000 --> 00:36:44,260
And one of the developers said,
here's a variation on

741
00:36:44,260 --> 00:36:46,390
the algorithm, it
changes the dag.

742
00:36:46,390 --> 00:36:47,720
It's a heuristic.

743
00:36:47,720 --> 00:36:50,510
It makes the program
run more efficient.

744
00:36:50,510 --> 00:36:53,910
Look, it runs in only 40 seconds
on 32 processors.

745
00:36:53,910 --> 00:36:55,850
And so is that a good idea?

746
00:36:55,850 --> 00:36:59,770
It sure seemed like a good idea,
but we were worried that

747
00:36:59,770 --> 00:37:01,890
we knew that the transformation
increased the

748
00:37:01,890 --> 00:37:04,260
critical path length of the
program, so we weren't sure it

749
00:37:04,260 --> 00:37:05,280
was a good idea.

750
00:37:05,280 --> 00:37:07,660
So we did some calculation.

751
00:37:07,660 --> 00:37:11,550
We measured the work
and the speedup.

752
00:37:11,550 --> 00:37:12,510
And so the work here--

753
00:37:12,510 --> 00:37:14,340
these numbers have been cooked
a little bit to make the math

754
00:37:14,340 --> 00:37:19,950
easy, but the numbers--

755
00:37:19,950 --> 00:37:24,980
this really did happen, but not
with these exact numbers.

756
00:37:24,980 --> 00:37:29,100
So we had a work which was 2048
seconds and only 1 second

757
00:37:29,100 --> 00:37:29,910
of critical path.

758
00:37:29,910 --> 00:37:33,430
And over this new program had
only 1/2 as much work to do,

759
00:37:33,430 --> 00:37:35,120
but the critical path
length was longer.

760
00:37:35,120 --> 00:37:36,390
It was 8 seconds long.

761
00:37:40,190 --> 00:37:43,140
If you predict on 32 processors
what the runtime's

762
00:37:43,140 --> 00:37:46,740
going to be that formula
says well, 65 seconds.

763
00:37:46,740 --> 00:37:50,050
If you predict it on 32
processors this-- well, it's

764
00:37:50,050 --> 00:37:53,540
40 seconds and that looks good,
but we were going to be

765
00:37:53,540 --> 00:37:57,990
running the tournament on 512
processors where this term

766
00:37:57,990 --> 00:38:02,030
would start being less important
than this term.

767
00:38:02,030 --> 00:38:04,570
So this really did happen and
we actually went back and

768
00:38:04,570 --> 00:38:07,200
validated that these numbers
were right after we did the

769
00:38:07,200 --> 00:38:11,120
calculation and it allowed us to
do the engineering to make

770
00:38:11,120 --> 00:38:14,310
the right decision and not be
misled by something that

771
00:38:14,310 --> 00:38:19,120
looked good in the
test environment.

772
00:38:19,120 --> 00:38:21,090
We were able to predict what was
going to happen on the big

773
00:38:21,090 --> 00:38:23,310
machine without actually having
access to the big

774
00:38:23,310 --> 00:38:24,730
machine and that was
very important.

775
00:38:27,660 --> 00:38:31,230
Let me do some algorithms. You
guys probably have done some

776
00:38:31,230 --> 00:38:34,000
matrix multipliers over the
past 3 weeks, right?

777
00:38:34,000 --> 00:38:36,223
That's probably the only thing
you've been able to do would

778
00:38:36,223 --> 00:38:38,210
be my guess.

779
00:38:38,210 --> 00:38:40,680
So matrix multiplication
is this operation.

780
00:38:40,680 --> 00:38:42,780
I won't talk about it, but
you know what it is.

781
00:38:42,780 --> 00:38:47,240
In Cilk instead of doing the
standard triply nested loops

782
00:38:47,240 --> 00:38:49,740
you do divide and conquer.

783
00:38:49,740 --> 00:38:52,825
We don't parallelize loops we
parallelize function calls, so

784
00:38:52,825 --> 00:38:56,830
you want to express a
loops as recursion.

785
00:38:56,830 --> 00:39:00,460
So to multipliy two big matrices
you do a whole bunch

786
00:39:00,460 --> 00:39:01,990
of little matrix multiplications
of the

787
00:39:01,990 --> 00:39:04,450
sub-blocks and then you express
those little matrix

788
00:39:04,450 --> 00:39:07,870
multiplications themselves and
go off and recursively do

789
00:39:07,870 --> 00:39:10,490
smaller matrix multiplications.

790
00:39:10,490 --> 00:39:12,980
So this requires 8
multiplications of matrices

791
00:39:12,980 --> 00:39:15,780
these of 1/2 the number of
rows and 1/2 the number

792
00:39:15,780 --> 00:39:19,280
columns an one edition at the
end where you add these two

793
00:39:19,280 --> 00:39:20,430
matrices together.

794
00:39:20,430 --> 00:39:25,600
That's the algorithm that we do,
it's the same total work

795
00:39:25,600 --> 00:39:28,700
as the standard one, but it's
just expressed recursively.

796
00:39:28,700 --> 00:39:32,850
So a matrix multiply is you
do these 8 multiplies.

797
00:39:32,850 --> 00:39:35,660
I had to create a temporary
variable, so the first four

798
00:39:35,660 --> 00:39:40,540
multiplies the A's and B's
into C. The second four

799
00:39:40,540 --> 00:39:45,030
multiply the A's and B's into
T and then I have to add T

800
00:39:45,030 --> 00:39:48,880
into C. So I do all those
spawns, do all the multiplies.

801
00:39:48,880 --> 00:39:51,516
I do a sync because I better not
start using the results on

802
00:39:51,516 --> 00:39:55,030
the multiplies and adding them
until the multiplies are done.

803
00:39:55,030 --> 00:39:56,360
AUDIENCE: Which four
do you add?

804
00:39:56,360 --> 00:39:57,960
BRADLEY KUSZMAUL: What?

805
00:39:57,960 --> 00:39:59,210
There's parallelism in add.

806
00:40:01,620 --> 00:40:02,930
Matrix addition.

807
00:40:02,930 --> 00:40:05,750
AUDIENCE: Yeah, but it doesn't
add spawn extent

808
00:40:05,750 --> 00:40:08,330
BRADLEY KUSZMAUL: Well,
we spawn off add.

809
00:40:08,330 --> 00:40:10,610
I don't understand--

810
00:40:10,610 --> 00:40:12,770
[INTERPOSING VOICES]

811
00:40:12,770 --> 00:40:15,550
BRADLEY KUSZMAUL: So you have
to spawn Cilk functions even

812
00:40:15,550 --> 00:40:17,890
if you're only executing
one of them at a time.

813
00:40:20,760 --> 00:40:24,290
Cilk functions are spawned,
C functions are called.

814
00:40:24,290 --> 00:40:26,810
It's a decision that's built
into the language.

815
00:40:26,810 --> 00:40:28,760
It's not really a fundamental
decision.

816
00:40:28,760 --> 00:40:30,748
It's just that's the
way we did it.

817
00:40:30,748 --> 00:40:32,703
AUDIENCE: Why'd you choose to
have the key word then?

818
00:40:32,703 --> 00:40:34,170
That's just documentation
from the caller side?

819
00:40:34,170 --> 00:40:37,600
BRADLEY KUSZMAUL: Yeah, we found
we were less likely to

820
00:40:37,600 --> 00:40:41,420
make a mistake if we sort of
built it into the type system

821
00:40:41,420 --> 00:40:42,370
in this way.

822
00:40:42,370 --> 00:40:45,110
But I'm not convinced that this
is the best way to do

823
00:40:45,110 --> 00:40:47,990
this type system.

824
00:40:47,990 --> 00:40:48,920
AUDIENCE: Can the C functions
spawn a Cilk function.

825
00:40:48,920 --> 00:40:49,240
BRADLEY KUSZMAUL: No.

826
00:40:49,240 --> 00:40:52,330
You can only call spawn, spawn,
spawn, spawn then you

827
00:40:52,330 --> 00:40:55,390
can call C functions
at the leaves.

828
00:40:55,390 --> 00:40:58,160
It turns out you can actually
spawn Cilk functions if you're

829
00:40:58,160 --> 00:41:01,310
a little clever about-- there's
a mechanism for a Cilk

830
00:41:01,310 --> 00:41:02,850
system running in the background
and if you're

831
00:41:02,850 --> 00:41:05,800
running C you can
say OK, do this

832
00:41:05,800 --> 00:41:07,570
Cilk function in parallel.

833
00:41:07,570 --> 00:41:10,290
So we have that, but that's
not didactic.

834
00:41:13,805 --> 00:41:15,745
AUDIENCE: Sorry, I
have a question

835
00:41:15,745 --> 00:41:18,655
about the sync spawning.

836
00:41:18,655 --> 00:41:22,487
Is the sync actually doing a
whole wave or -- like, in the

837
00:41:22,487 --> 00:41:27,720
case of-- maybe not in the case
of the add here, but in

838
00:41:27,720 --> 00:41:32,155
plenty of other practical
functions you get inside the

839
00:41:32,155 --> 00:41:35,187
spawn function looking
at the tendencies of

840
00:41:35,187 --> 00:41:36,415
the parameters, right?

841
00:41:36,415 --> 00:41:39,136
Based on how those
were built from

842
00:41:39,136 --> 00:41:40,960
previous spawned funcitons.

843
00:41:40,960 --> 00:41:44,822
You can actually just start
processing so long as it's

844
00:41:44,822 --> 00:41:47,006
guaranteed that the results
are available before you

845
00:41:47,006 --> 00:41:47,080
actually read them.

846
00:41:47,080 --> 00:41:49,080
BRADLEY KUSZMAUL: So there's
this other style of expressing

847
00:41:49,080 --> 00:41:50,890
parallelism which you see
in some of the data flow

848
00:41:50,890 --> 00:41:54,480
languages where you say well,
I've computed this first

849
00:41:54,480 --> 00:41:57,160
multiply, why can't I get
started on the corresponding

850
00:41:57,160 --> 00:42:00,030
part of the addition.

851
00:42:00,030 --> 00:42:03,440
And it turns out that in those
models there's no performance

852
00:42:03,440 --> 00:42:04,690
guarantees.

853
00:42:07,110 --> 00:42:08,480
The real issue is you
run out of memory.

854
00:42:12,220 --> 00:42:15,060
It's a long topic, let's not
go into it, but there's a

855
00:42:15,060 --> 00:42:17,560
serious technical issue with
those programming models.

856
00:42:20,610 --> 00:42:22,560
We have very tight memory
bounds as well, so we

857
00:42:22,560 --> 00:42:25,310
simultaneously get these good
scheduling bounds and good

858
00:42:25,310 --> 00:42:28,060
memory bounds and if you are
doing that you could have sort

859
00:42:28,060 --> 00:42:30,910
of a really large number of
temporaries required and run

860
00:42:30,910 --> 00:42:31,300
out of memory.

861
00:42:31,300 --> 00:42:35,030
The data flow machine used to
have this number-- there was a

862
00:42:35,030 --> 00:42:38,295
student, Ken Traub, who was
working on Monsoon when Greg

863
00:42:38,295 --> 00:42:42,780
Papadapolous was here and he
came up with this term which

864
00:42:42,780 --> 00:42:45,390
we called Traub's constant,
which was how long the machine

865
00:42:45,390 --> 00:42:47,740
could be guaranteed to run
before it crashed from being

866
00:42:47,740 --> 00:42:48,870
out of memory.

867
00:42:48,870 --> 00:42:51,910
And that was-- well, he took
the rate at which it Kahn's

868
00:42:51,910 --> 00:42:56,960
divided by the amount of
memory and that was it.

869
00:42:56,960 --> 00:43:00,390
And many data flow programs had
that property that Monsoon

870
00:43:00,390 --> 00:43:02,460
could run for 40 seconds
and then after

871
00:43:02,460 --> 00:43:04,150
that you never knew.

872
00:43:04,150 --> 00:43:06,420
It might start crashing at any
moment, so everybody wrote

873
00:43:06,420 --> 00:43:11,770
short data flow programs.

874
00:43:11,770 --> 00:43:13,280
So one of the things you
actually do when you're

875
00:43:13,280 --> 00:43:14,900
implementing, when you're trying
to engineer this to go

876
00:43:14,900 --> 00:43:18,540
fast, is you course in the
base case, which I didn't

877
00:43:18,540 --> 00:43:19,350
describe up there.

878
00:43:19,350 --> 00:43:22,310
You don't just do a 1 by 1
matrix multiplied down there

879
00:43:22,310 --> 00:43:24,970
at the leaves of
this recursion.

880
00:43:24,970 --> 00:43:26,900
Because then you're not using
the processor pipeline

881
00:43:26,900 --> 00:43:28,580
efficiently.

882
00:43:28,580 --> 00:43:31,980
You call the Intel Math Kernel
Library or something on an 8

883
00:43:31,980 --> 00:43:34,290
by 8 matrix so that it really
gets the pipeline a

884
00:43:34,290 --> 00:43:35,750
chance to chug away.

885
00:43:38,740 --> 00:43:39,250
So analysis.

886
00:43:39,250 --> 00:43:42,850
This matrix addition operation--
well, what's the

887
00:43:42,850 --> 00:43:45,390
work for matrix addition?

888
00:43:45,390 --> 00:43:50,820
Well the work to do a matrix
operation on n rows is well,

889
00:43:50,820 --> 00:43:55,120
you have to do 4 additions
of size n over 2.

890
00:43:55,120 --> 00:43:58,370
Plus there's order 1 work
here for the sync.

891
00:43:58,370 --> 00:44:04,080
And that recurrence has solution
order n squared.

892
00:44:04,080 --> 00:44:05,200
Well, that's not surprising.

893
00:44:05,200 --> 00:44:07,980
You have to add up 2 matrices
which are n by n.

894
00:44:07,980 --> 00:44:10,900
That's going to be n squared
so that's a good result.

895
00:44:10,900 --> 00:44:15,450
The critical path for this is
well, you have to do all of

896
00:44:15,450 --> 00:44:16,350
these in parallel.

897
00:44:16,350 --> 00:44:20,840
So whatever the critical path of
the longest one is, they're

898
00:44:20,840 --> 00:44:23,350
all the same so it's just the
critical path of the size n

899
00:44:23,350 --> 00:44:29,350
over 2 plus quarter 1, so the
critical path is order log n.

900
00:44:29,350 --> 00:44:34,270
For matrix multiplication,
sort of the reason I

901
00:44:34,270 --> 00:44:36,840
do this is I can.

902
00:44:36,840 --> 00:44:38,820
This is a model which I can
do this analysis, so

903
00:44:38,820 --> 00:44:40,010
I have to do it.

904
00:44:40,010 --> 00:44:43,740
But really, being able to do
this analysis is important

905
00:44:43,740 --> 00:44:46,550
when you're trying to make
things run faster.

906
00:44:46,550 --> 00:44:48,800
Matrix multiplication, well,
the work is I have to do 8

907
00:44:48,800 --> 00:44:52,840
little matrix multiplies plus
I have to do the matrix add.

908
00:44:56,350 --> 00:44:59,110
The work has solution order n
cubed and everybody knows that

909
00:44:59,110 --> 00:45:01,990
there's order n cubed multiply
adds in a matrix multiplier,

910
00:45:01,990 --> 00:45:04,420
so that's not very surprising.

911
00:45:04,420 --> 00:45:09,700
The critical path is-- well, I
have to do a add so that takes

912
00:45:09,700 --> 00:45:12,910
log n, plus I have to do a
multiply on a matrix that's

913
00:45:12,910 --> 00:45:14,210
1/2 the size.

914
00:45:14,210 --> 00:45:16,990
So the critical path length of
the whole thing has solution

915
00:45:16,990 --> 00:45:19,190
order log squared n.

916
00:45:19,190 --> 00:45:25,670
So the total parallelism of
matrix multiplication is the

917
00:45:25,670 --> 00:45:30,900
work over the span, which is
n cubed over log squared n.

918
00:45:30,900 --> 00:45:34,030
So if you have a 1000 by 1000
matrix that means your

919
00:45:34,030 --> 00:45:37,860
parallelism is close
to 10 million.

920
00:45:37,860 --> 00:45:40,930
There's a lot of parallelism
and in fact, we see perfect

921
00:45:40,930 --> 00:45:43,760
linear speedup on matrix
multiply because there's so

922
00:45:43,760 --> 00:45:45,010
much parallelism in it.

923
00:45:47,710 --> 00:45:51,270
It turns out that this stack
temporary that I created so

924
00:45:51,270 --> 00:45:53,550
that I could do these multiplies
all in parallel is

925
00:45:53,550 --> 00:45:57,870
actually costing me work because
I'm on a machine that

926
00:45:57,870 --> 00:46:00,110
has cache and I want to use
the cache effectively.

927
00:46:00,110 --> 00:46:02,630
So I really don't want to create
a whole big temporary

928
00:46:02,630 --> 00:46:06,780
matrix and blow my cache
out if I can avoid it.

929
00:46:06,780 --> 00:46:10,860
So I proposed the following
matrix multiply, which is I

930
00:46:10,860 --> 00:46:14,950
first do 4 of the matrix
multiplies into C1 then I do a

931
00:46:14,950 --> 00:46:20,830
sync and then I do the other
4 into C1 and another sync.

932
00:46:20,830 --> 00:46:24,560
And I forgot to do the add-- oh,
no those are multiply adds

933
00:46:24,560 --> 00:46:26,850
so they're multiplying
and adding in.

934
00:46:26,850 --> 00:46:30,300
And this saves space because it
doesn't need a temporary,

935
00:46:30,300 --> 00:46:32,960
but it increases the
critical path.

936
00:46:32,960 --> 00:46:35,790
So is that a good idea
about or a bad idea?

937
00:46:35,790 --> 00:46:39,410
Well, we can answer part of that
question with analysis.

938
00:46:39,410 --> 00:46:42,290
Saving space we know is going
to save something.

939
00:46:42,290 --> 00:46:44,080
What does it do to the work
in critical path?

940
00:46:44,080 --> 00:46:47,220
Well, the work is still the
same, it's n cubed because we

941
00:46:47,220 --> 00:46:50,370
didn't change the number of
flops that we're doing.

942
00:46:50,370 --> 00:46:51,900
But the critical
path has grown.

943
00:46:51,900 --> 00:46:56,530
Instead of doing 1 times a
matrix multiply, we have to do

944
00:46:56,530 --> 00:46:58,690
one and then sync and
then do another one.

945
00:46:58,690 --> 00:47:02,140
So it's 2 matrix multiplies of
1/2 the size plus the order 1

946
00:47:02,140 --> 00:47:06,700
and that recurrence has solution
order n instead of

947
00:47:06,700 --> 00:47:09,300
order log squared n.

948
00:47:09,300 --> 00:47:12,590
So that sounds bad, we've made
the critical path longer.

949
00:47:12,590 --> 00:47:13,610
AUDIENCE: [OBSCURED]

950
00:47:13,610 --> 00:47:13,870
BRADLEY KUSZMAUL: What?

951
00:47:13,870 --> 00:47:15,010
Yeah.

952
00:47:15,010 --> 00:47:18,790
So parallelism is now order n
squared instead of n cubed

953
00:47:18,790 --> 00:47:22,240
over log squared n and for a
1000 by 1000 matrix that means

954
00:47:22,240 --> 00:47:24,740
you still have a million
fold parallelism.

955
00:47:24,740 --> 00:47:27,900
So for relatively modest sized
matrices you still have plenty

956
00:47:27,900 --> 00:47:29,260
of work to do this
optimization.

957
00:47:29,260 --> 00:47:31,830
So this is a good transformation
to do it.

958
00:47:31,830 --> 00:47:34,680
One of the advantages of Cilk
is that you can do this kind

959
00:47:34,680 --> 00:47:37,770
of You could say, let me
do an optimization.

960
00:47:37,770 --> 00:47:40,340
I can do an optimization in my
C code and I get to take

961
00:47:40,340 --> 00:47:42,460
advantage of it in
the Cilk code.

962
00:47:42,460 --> 00:47:45,580
I could do this kind of
optimization of trading work

963
00:47:45,580 --> 00:47:46,290
for parallelism.

964
00:47:46,290 --> 00:47:49,730
If I have a lot of work that
sometimes is a good idea.

965
00:47:49,730 --> 00:47:53,810
Ordinary matrix multiplication
just is really bad.

966
00:47:53,810 --> 00:47:57,170
Basically you can imagine
spawning off the n squared

967
00:47:57,170 --> 00:48:00,180
inner dot products here and

968
00:48:00,180 --> 00:48:01,790
computing them all in parallel.

969
00:48:01,790 --> 00:48:06,560
It has work n cubed
parallelism log n.

970
00:48:06,560 --> 00:48:10,210
I mean, critical path log n so
the parallelism's even better.

971
00:48:10,210 --> 00:48:13,480
It's n cubed over log n
instead of n squared.

972
00:48:13,480 --> 00:48:16,000
That looks better theoretically,
but it's really

973
00:48:16,000 --> 00:48:19,430
bad in practice because it has
such poor cache behavior.

974
00:48:19,430 --> 00:48:23,390
So we don't do that.

975
00:48:23,390 --> 00:48:25,360
I'll just briefly talk
about how it works.

976
00:48:25,360 --> 00:48:27,000
So Cilk does work-stealing.

977
00:48:27,000 --> 00:48:29,740
We had did double ended
queue-like decque.

978
00:48:29,740 --> 00:48:31,995
So at the bottom of the queue
is the stack where you push

979
00:48:31,995 --> 00:48:34,680
and pop things and the top is
something where you can pop

980
00:48:34,680 --> 00:48:36,500
things off if you want to.

981
00:48:36,500 --> 00:48:38,790
And so what's running is all
these processors are running

982
00:48:38,790 --> 00:48:40,170
each on their own stack.

983
00:48:40,170 --> 00:48:42,690
They're all running the
ordinary serial code.

984
00:48:42,690 --> 00:48:44,700
That's sort of the
basic situation.

985
00:48:44,700 --> 00:48:46,190
They're pretty much
running the serial

986
00:48:46,190 --> 00:48:48,470
code most of the time.

987
00:48:48,470 --> 00:48:50,170
So some processor runs.

988
00:48:50,170 --> 00:48:51,490
It pushes.

989
00:48:51,490 --> 00:48:53,130
Well, it doesn't spawn,
so what does it do?

990
00:48:53,130 --> 00:48:55,080
It pushes something onto its
stack because it's just a

991
00:48:55,080 --> 00:48:57,080
function call.

992
00:48:57,080 --> 00:49:01,760
And it does another couple more
spawns so things pop off.

993
00:49:01,760 --> 00:49:04,180
Somebody returns so
he pops his stack.

994
00:49:04,180 --> 00:49:06,890
So far everything's going on,
they're not communicating,

995
00:49:06,890 --> 00:49:10,640
they're completely independent
computations.

996
00:49:10,640 --> 00:49:12,840
This guy spawns and now
he's out of work.

997
00:49:12,840 --> 00:49:14,220
Now he has to do something.

998
00:49:14,220 --> 00:49:17,090
What he does is he goes and
picks another processor at

999
00:49:17,090 --> 00:49:22,270
random and he steals
the thing from the

1000
00:49:22,270 --> 00:49:23,920
other end of the stack.

1001
00:49:23,920 --> 00:49:26,260
So he's unlikely to conflict
because this guy's pushing and

1002
00:49:26,260 --> 00:49:28,900
popping down here, but there's
a lock in there, thers's a

1003
00:49:28,900 --> 00:49:30,290
little algorithm.

1004
00:49:30,290 --> 00:49:34,680
A non-blocking algorithm
actually, it's not lock.

1005
00:49:34,680 --> 00:49:38,870
And so he goes and he steals
something and come on, slide

1006
00:49:38,870 --> 00:49:39,690
over there.

1007
00:49:39,690 --> 00:49:40,460
Whoa.

1008
00:49:40,460 --> 00:49:43,900
Yes, that's animation, right?

1009
00:49:43,900 --> 00:49:47,340
That's the extent
of my animation.

1010
00:49:47,340 --> 00:49:49,600
And then he starts
working away.

1011
00:49:49,600 --> 00:49:52,800
And the theorem is that a
work-stealing scheduler like

1012
00:49:52,800 --> 00:49:56,330
this gives expected running
time with high probability

1013
00:49:56,330 --> 00:49:59,280
actually of T sub 1 over P
plus T sub infinity on P

1014
00:49:59,280 --> 00:50:00,690
processors.

1015
00:50:00,690 --> 00:50:04,190
And the pseudoproof is a little
bit like the proof for

1016
00:50:04,190 --> 00:50:05,270
Brent's Theorem,
which is either

1017
00:50:05,270 --> 00:50:07,020
you're working or stealing.

1018
00:50:07,020 --> 00:50:11,050
If you're working well, that
goes against T sub 1 over P.

1019
00:50:11,050 --> 00:50:14,410
You can't do that very much
or you run out of work.

1020
00:50:14,410 --> 00:50:19,210
If you're stealing well, each
steal has a chance that it

1021
00:50:19,210 --> 00:50:22,040
steals the thing that's
on the critical path.

1022
00:50:22,040 --> 00:50:23,960
You may actually steal the wrong
thing, but you actually

1023
00:50:23,960 --> 00:50:26,910
have a 1 in P chance that you're
the one who steals the

1024
00:50:26,910 --> 00:50:30,030
thing that it's on the critical
path and then in

1025
00:50:30,030 --> 00:50:31,910
which case the expected
number--

1026
00:50:31,910 --> 00:50:34,060
so you had this chance of
1 over P of reducing the

1027
00:50:34,060 --> 00:50:38,210
critical path length by 1, so
after this many steals the

1028
00:50:38,210 --> 00:50:39,750
critical path is all gone.

1029
00:50:39,750 --> 00:50:44,260
So you can only do P times
T infinity steals.

1030
00:50:44,260 --> 00:50:46,750
This high probability
it comes out.

1031
00:50:46,750 --> 00:50:50,440
And that gives you
these bounds.

1032
00:50:50,440 --> 00:50:54,110
OK, I'm not going to give
you all this stuff.

1033
00:50:54,110 --> 00:50:58,040
Message passing sucks,
you know.

1034
00:50:58,040 --> 00:50:59,170
You guys know.

1035
00:50:59,170 --> 00:51:02,270
There's probably nothing
else in here.

1036
00:51:05,270 --> 00:51:09,790
So basically the pitch here is
that you get some high level

1037
00:51:09,790 --> 00:51:13,620
linguistics support for these
very fine-grained parallelism.

1038
00:51:13,620 --> 00:51:16,620
It's an algorithmic programming
model so that

1039
00:51:16,620 --> 00:51:19,640
means that you can do
engineering for performance.

1040
00:51:19,640 --> 00:51:23,310
There's fairly easy conversion
of existing code, especially

1041
00:51:23,310 --> 00:51:24,770
when you combine it with
the race detector.

1042
00:51:24,770 --> 00:51:27,335
You've got this factorization
of the debugging problem and

1043
00:51:27,335 --> 00:51:30,930
to debugging your serial code
is you run it with all the

1044
00:51:30,930 --> 00:51:32,060
Cilk stuff turned off.

1045
00:51:32,060 --> 00:51:35,010
You allied the program and make
sure your program works.

1046
00:51:35,010 --> 00:51:36,700
Then you run it with the rate
detector to make sure you get

1047
00:51:36,700 --> 00:51:41,290
the same answer in parallel
and then you're done.

1048
00:51:41,290 --> 00:51:44,310
Applications in Cilk don't just
scale to large number of

1049
00:51:44,310 --> 00:51:47,240
processors, they scale down
to small numbers, which is

1050
00:51:47,240 --> 00:51:49,890
important if you only have
two processors or one.

1051
00:51:49,890 --> 00:51:53,750
You don't suddenly want to pay
a factor of 10 to get off the

1052
00:51:53,750 --> 00:51:55,820
ground, which happens
sometimes on

1053
00:51:55,820 --> 00:51:57,110
clusters running MPI.

1054
00:51:57,110 --> 00:51:58,760
You have to pay a big
overhead before

1055
00:51:58,760 --> 00:52:01,700
you've made any progress.

1056
00:52:01,700 --> 00:52:04,320
And one of the advantages for
example is that the number of

1057
00:52:04,320 --> 00:52:06,420
processors might change
dynamically.

1058
00:52:06,420 --> 00:52:09,520
In this model that's
OK because it's

1059
00:52:09,520 --> 00:52:10,650
not part of the program.

1060
00:52:10,650 --> 00:52:14,050
So you may have the operating
system reduce the number of

1061
00:52:14,050 --> 00:52:18,230
actual worker threads that you
have doing that work-stealing

1062
00:52:18,230 --> 00:52:19,560
and that can work.

1063
00:52:19,560 --> 00:52:22,420
One of the bad things about
Cilk is that it doesn't

1064
00:52:22,420 --> 00:52:29,450
support sort of data parallel
or program model kind of

1065
00:52:29,450 --> 00:52:30,420
parallelism.

1066
00:52:30,420 --> 00:52:34,010
You really have to think of
things as this divide and

1067
00:52:34,010 --> 00:52:35,930
conquer kind of the world.

1068
00:52:35,930 --> 00:52:38,000
And if you have trouble
expressing that--

1069
00:52:40,730 --> 00:52:43,900
situations where you're doing
Jacobi update and you very

1070
00:52:43,900 --> 00:52:48,010
carefully put things on, had
each processor work on its

1071
00:52:48,010 --> 00:52:49,930
local memory and then they only
have to communicate at

1072
00:52:49,930 --> 00:52:51,250
the boundaries.

1073
00:52:51,250 --> 00:52:54,660
That's difficult to do right
in Cilk because essentially

1074
00:52:54,660 --> 00:52:56,960
every time you go around the
loop of I have all these

1075
00:52:56,960 --> 00:52:57,420
things to do.

1076
00:52:57,420 --> 00:52:59,700
All the work-stealing happens
randomly and it happens on a

1077
00:52:59,700 --> 00:53:00,520
different processor.

1078
00:53:00,520 --> 00:53:03,350
So it's not very good at that
sort of thing, although it

1079
00:53:03,350 --> 00:53:05,920
turns out Jacobi update's not
a very good example for that

1080
00:53:05,920 --> 00:53:08,790
because there are more
sophisticated algorithms that

1081
00:53:08,790 --> 00:53:12,230
use cache effectively that you
can express in Cilk and I

1082
00:53:12,230 --> 00:53:15,020
would have no idea how to no
say those in some of these

1083
00:53:15,020 --> 00:53:16,870
sort of data parallel
languages.

1084
00:53:16,870 --> 00:53:21,010
Using the cache efficiently is
really important on modern

1085
00:53:21,010 --> 00:53:23,481
processors.

1086
00:53:23,481 --> 00:53:24,731
PROFESSOR: Thank you.

1087
00:53:27,543 --> 00:53:28,793
Questions?

1088
00:53:33,130 --> 00:53:34,700
BRADLEY KUSZMAUL: You can
download Cilk, there's a bunch

1089
00:53:34,700 --> 00:53:35,360
of contributors.

1090
00:53:35,360 --> 00:53:38,490
Those are the Cilk worms
and you can download

1091
00:53:38,490 --> 00:53:39,620
Cilk off our webpage.

1092
00:53:39,620 --> 00:53:41,730
Just Google for Cilk
and you'll find it.

1093
00:53:41,730 --> 00:53:44,540
It's a great language,
you'll love it.

1094
00:53:44,540 --> 00:53:47,115
You'll love it much more than
what you've been doing.

1095
00:53:47,115 --> 00:53:48,534
AUDIENCE: How does the Cilk
play with processor

1096
00:53:48,534 --> 00:53:54,350
[OBSCURED]?

1097
00:53:54,350 --> 00:53:57,420
BRADLEY KUSZMAUL: Well, you
have to have a language, a

1098
00:53:57,420 --> 00:53:58,820
compiler that can
generate those.

1099
00:53:58,820 --> 00:54:02,482
If you have an assembly command
or you have some other

1100
00:54:02,482 --> 00:54:04,090
complier that can
generate those.

1101
00:54:04,090 --> 00:54:10,130
So I just won the HPC challenge,
which is this

1102
00:54:10,130 --> 00:54:16,070
challenge where everybody tries
to run parallel programs

1103
00:54:16,070 --> 00:54:18,620
and argue that they
get productivity.

1104
00:54:18,620 --> 00:54:21,870
For that there were some codes
like matrix multiply and LUD

1105
00:54:21,870 --> 00:54:24,020
composition with pivoting.

1106
00:54:24,020 --> 00:54:26,910
Basically at the leads of the
computation I call the Intel

1107
00:54:26,910 --> 00:54:27,960
Math Kernel Library.

1108
00:54:27,960 --> 00:54:32,190
Which in turn uses the
SSE instructions.

1109
00:54:32,190 --> 00:54:35,860
You could do anything you can do
in C in the C parts of the

1110
00:54:35,860 --> 00:54:39,940
code because Cilk compiler just
passes those through.

1111
00:54:39,940 --> 00:54:43,140
So if you have some really
efficient pipeline code for

1112
00:54:43,140 --> 00:54:47,530
doing something, up to
some point it made

1113
00:54:47,530 --> 00:54:48,680
sense to use that.

1114
00:54:48,680 --> 00:54:52,620
AUDIENCE: [OBSCURED]

1115
00:54:52,620 --> 00:54:58,460
BRADLEY KUSZMAUL: So I ran
it on NASIS Columbia.

1116
00:54:58,460 --> 00:55:02,420
So the benchmark consists of--
well, there's 7 applications

1117
00:55:02,420 --> 00:55:04,560
they have. 6 of which are
actually well-defined.

1118
00:55:04,560 --> 00:55:07,065
One of them is this thing that
just measures network

1119
00:55:07,065 --> 00:55:07,540
performance or something,
so it doesn't

1120
00:55:07,540 --> 00:55:09,190
have any real semantics.

1121
00:55:09,190 --> 00:55:10,020
There's 6 benchmarks.

1122
00:55:10,020 --> 00:55:13,220
One of them is LUD composition,
one of them is

1123
00:55:13,220 --> 00:55:18,110
DJEM matrix multiplication and
this FFT and 3 others.

1124
00:55:18,110 --> 00:55:21,030
So I implemented all 6, nobody
else implemented all 6.

1125
00:55:21,030 --> 00:55:24,310
It turns out that you had to
implement 3 in order to enter.

1126
00:55:24,310 --> 00:55:27,990
Almost everybody implemented 3
or 4, but I did all 6 which is

1127
00:55:27,990 --> 00:55:29,540
part of why I won.

1128
00:55:29,540 --> 00:55:33,390
So I could argue that in
a weeks work I just

1129
00:55:33,390 --> 00:55:33,820
implemented--

1130
00:55:33,820 --> 00:55:37,510
AUDIENCE: What is [OBSCURED]?

1131
00:55:37,510 --> 00:55:40,280
BRADLEY KUSZMAUL: So the prize
has two components.

1132
00:55:40,280 --> 00:55:43,860
Performance and productivity or
elegance or something and

1133
00:55:43,860 --> 00:55:47,370
it's completely whatever the
judges want that to be.

1134
00:55:47,370 --> 00:55:50,800
So it was up to me as a
presenter to make the case

1135
00:55:50,800 --> 00:55:51,780
that I was elegant.

1136
00:55:51,780 --> 00:55:54,550
Because I had my performance
numbers, which were pretty

1137
00:55:54,550 --> 00:55:58,245
good and it turned out that the
IBM entry for x10 did me

1138
00:55:58,245 --> 00:55:59,360
more good than I did, I think.

1139
00:55:59,360 --> 00:56:01,960
Because they got up there and
they compared the performance

1140
00:56:01,960 --> 00:56:05,330
of x10 to their Cilk
implementation and their x10

1141
00:56:05,330 --> 00:56:07,650
thing was almost as
good as Cilk.

1142
00:56:07,650 --> 00:56:10,380
So after that I think the judges
said they had to give

1143
00:56:10,380 --> 00:56:12,310
me the prize.

1144
00:56:12,310 --> 00:56:15,230
So basically, it went down to
supercomputing and each of us

1145
00:56:15,230 --> 00:56:20,680
got 5 minutes to present and
there were 5 finalists.

1146
00:56:20,680 --> 00:56:22,840
We did our presentation and
then they gave out the --

1147
00:56:22,840 --> 00:56:28,970
So they divided the prize three
ways: the people who got

1148
00:56:28,970 --> 00:56:31,650
the absolute best performance,
which were some people running

1149
00:56:31,650 --> 00:56:36,950
UPC and the people who had the
most elegance based on minimal

1150
00:56:36,950 --> 00:56:40,630
number of lines of codes and
that was Cleve at --

1151
00:56:40,630 --> 00:56:41,140
what's his name?

1152
00:56:41,140 --> 00:56:43,100
The Mathworks guy, MATLAB guy.

1153
00:56:43,100 --> 00:56:45,880
Who said, look, matrix,
LUD composition.

1154
00:56:45,880 --> 00:56:50,110
LU of P. It's very elegant,
but I don't think that it

1155
00:56:50,110 --> 00:56:53,720
really sort of explains what
you have to do to solve the

1156
00:56:53,720 --> 00:56:58,250
problems. So he won the prize
for most elegant and I got the

1157
00:56:58,250 --> 00:57:02,040
prize for best combination,
which they then changed--

1158
00:57:02,040 --> 00:57:06,560
in the final citation for the
prize they said, most

1159
00:57:06,560 --> 00:57:07,410
productivity.

1160
00:57:07,410 --> 00:57:08,390
That was the prize.

1161
00:57:08,390 --> 00:57:10,450
So I actually won the contest
because that was what the

1162
00:57:10,450 --> 00:57:13,040
contest was supposed to be
was most productivity.

1163
00:57:13,040 --> 00:57:14,880
But I only won 1/3 of the
prize money because they

1164
00:57:14,880 --> 00:57:16,130
divided it three ways.

1165
00:57:19,236 --> 00:57:22,682
PROFESSOR: Any other question?

1166
00:57:22,682 --> 00:57:24,651
Thank you.

1167
00:57:24,651 --> 00:57:26,620
BRADLEY KUSZMAUL: Thank you.

1168
00:57:26,620 --> 00:57:30,589
PROFESSOR: We'll take a 5 minute
break and since you had

1169
00:57:30,589 --> 00:57:34,867
guest lecturer I do
have [OBSCURED]