1
00:00:00,030 --> 00:00:02,420
The following content is
provided under a Creative

2
00:00:02,420 --> 00:00:03,860
Commons license.

3
00:00:03,860 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,860 --> 00:00:10,540
offer high quality educational
resources for free.

5
00:00:10,540 --> 00:00:13,410
To make a donation or view
additional materials from

6
00:00:13,410 --> 00:00:17,460
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,460 --> 00:00:18,710
ocw.mit.edu.

8
00:00:21,430 --> 00:00:23,530
PROFESSOR: OK.

9
00:00:23,530 --> 00:00:24,920
Let's get started.

10
00:00:24,920 --> 00:00:28,750
So what I'm going to do next
is switch gears to one

11
00:00:28,750 --> 00:00:30,540
interesting compiler, which
is the StreamIt

12
00:00:30,540 --> 00:00:31,790
parallelizing compiler.

13
00:00:33,850 --> 00:00:37,400
The main idea about StreamIt
is the need for a common

14
00:00:37,400 --> 00:00:39,080
machine language.

15
00:00:39,080 --> 00:00:43,670
What we want to do normally is,
in a language, you want to

16
00:00:43,670 --> 00:00:46,465
represent common architecture
properties so you get good

17
00:00:46,465 --> 00:00:46,790
performance.

18
00:00:46,790 --> 00:00:50,470
You don't do it at a very high
level of abstraction, so you

19
00:00:50,470 --> 00:00:53,300
base a lot of cycles dealing
with the abstraction.

20
00:00:53,300 --> 00:00:55,960
But you want to abstract out
the differences between

21
00:00:55,960 --> 00:00:58,310
machines to get the portability,
otherwise you are

22
00:00:58,310 --> 00:00:58,950
going to just do

23
00:00:58,950 --> 00:01:01,690
assembly-hacking for one machine.

24
00:01:01,690 --> 00:01:05,010
Also you can't have things too
complex, because a typical

25
00:01:05,010 --> 00:01:07,840
programmer cannot deal with very
complex things that we

26
00:01:07,840 --> 00:01:10,960
ask them to do.

27
00:01:10,960 --> 00:01:15,510
C and Fortran was a really nice
common assembly language

28
00:01:15,510 --> 00:01:17,080
for imperative languages
running

29
00:01:17,080 --> 00:01:20,650
on the unicore machines.

30
00:01:20,650 --> 00:01:24,650
The problem is this type of
language is not a good common

31
00:01:24,650 --> 00:01:26,610
language for multicores, because
it doesn't deal with,

32
00:01:26,610 --> 00:01:29,350
first of all, multiple cores.

33
00:01:29,350 --> 00:01:32,620
And as you keep changing
the number of cores --

34
00:01:32,620 --> 00:01:34,970
for example, automatic
parallelizing compilers are

35
00:01:34,970 --> 00:01:38,480
not good to basically get really
good parallelism out of

36
00:01:38,480 --> 00:01:40,360
that, even though we
talk about that.

37
00:01:40,360 --> 00:01:43,300
Still a lot of work has
to be done in there.

38
00:01:43,300 --> 00:01:45,860
So what's the correct
abstraction if you have

39
00:01:45,860 --> 00:01:48,860
multicore machines?

40
00:01:48,860 --> 00:01:51,200
The current offering, what you
guys are doing, is things like

41
00:01:51,200 --> 00:01:54,470
OpenMP, MPI type stuff.

42
00:01:54,470 --> 00:01:56,850
You are hand-hacking
the parallelism.

43
00:01:56,850 --> 00:01:59,390
Well, the issue with that.

44
00:01:59,390 --> 00:02:01,450
It's basically this explicit
parallel construct.

45
00:02:01,450 --> 00:02:04,710
It's kind of added to languages
like C -- that's

46
00:02:04,710 --> 00:02:06,510
what you're working on.

47
00:02:06,510 --> 00:02:09,790
And what this does is all these
nice properties about

48
00:02:09,790 --> 00:02:13,260
composability, malleability,
debuggability, portability --

49
00:02:13,260 --> 00:02:15,660
all those things were kind
of out of the window.

50
00:02:15,660 --> 00:02:18,320
And this is why this
parallelizing is hard, because

51
00:02:18,320 --> 00:02:21,040
all these things makes life
very difficult for the

52
00:02:21,040 --> 00:02:22,700
programmer.

53
00:02:22,700 --> 00:02:25,720
And it's a huge additional
program burden.

54
00:02:25,720 --> 00:02:27,850
The programmer has to introduce
parallelism,

55
00:02:27,850 --> 00:02:29,580
correctness, optimization --

56
00:02:29,580 --> 00:02:32,920
it's all left to
the programmer.

57
00:02:32,920 --> 00:02:35,400
So what the program has to do in
this kind of world -- what

58
00:02:35,400 --> 00:02:37,236
you are doing right now --

59
00:02:37,236 --> 00:02:40,300
you have to feed all the
granularity decisions.

60
00:02:40,300 --> 00:02:42,880
If things are too small
you might get too much

61
00:02:42,880 --> 00:02:43,250
communication.

62
00:02:43,250 --> 00:02:46,710
If things are too large you
might not get good load

63
00:02:46,710 --> 00:02:48,340
balancing, and stuff like.

64
00:02:48,340 --> 00:02:50,680
And then you deal with all the
load balancing decisions.

65
00:02:50,680 --> 00:02:54,570
All those decisions are
left for you guys.

66
00:02:54,570 --> 00:02:56,610
You need to figure out what's
local, what's not.

67
00:02:56,610 --> 00:03:00,530
And if you make a wrong decision
it can cost you.

68
00:03:00,530 --> 00:03:04,460
All the synchronization, and
all the pain and suffering

69
00:03:04,460 --> 00:03:06,230
that comes from making
a wrong decision.

70
00:03:06,230 --> 00:03:10,270
Things like race conditions,
deadlocks,

71
00:03:10,270 --> 00:03:12,620
and stuff like that.

72
00:03:12,620 --> 00:03:16,430
And this, while MIT students can
hack it, you can't go and

73
00:03:16,430 --> 00:03:21,030
convince a dull programmer to
convert from writing nice

74
00:03:21,030 --> 00:03:24,230
simple Java application code
to dealing with all these

75
00:03:24,230 --> 00:03:25,810
complexities.

76
00:03:25,810 --> 00:03:29,210
So this is what kind of led to
our research for the last five

77
00:03:29,210 --> 00:03:32,070
years to do StreamIt.

78
00:03:32,070 --> 00:03:34,900
What you want to do is move a
bunch of these decisions to

79
00:03:34,900 --> 00:03:35,980
the compiler.

80
00:03:35,980 --> 00:03:38,950
Granularity, load balancing,
locality, and

81
00:03:38,950 --> 00:03:39,516
synchronization --

82
00:03:39,516 --> 00:03:40,650
[OBSCURED]

83
00:03:40,650 --> 00:03:43,250
And in today's talk I am going
to talk to you about, after

84
00:03:43,250 --> 00:03:46,060
you write a StreamIt program
-- as Bill pointed out, the

85
00:03:46,060 --> 00:03:48,910
nice parallel properties -- how
do you actually go about

86
00:03:48,910 --> 00:03:50,670
getting this kind
of parallelism.

87
00:03:54,130 --> 00:03:59,210
So in StreamIt , in summary, it
basically has regular and

88
00:03:59,210 --> 00:04:01,630
repeating computation
in these filters.

89
00:04:01,630 --> 00:04:04,290
This is called a synchronous
data flow model, because we

90
00:04:04,290 --> 00:04:07,990
know at compile time exactly how
the data moves, how much

91
00:04:07,990 --> 00:04:10,890
each produces and consumes.

92
00:04:10,890 --> 00:04:15,110
And this has natural
parallelism, and it exposes

93
00:04:15,110 --> 00:04:19,010
exactly what's going to happen
to the compiler.

94
00:04:19,010 --> 00:04:21,970
And the compiler can do a lot
of powerful transformations,

95
00:04:21,970 --> 00:04:24,720
as yesterday I pointed out.

96
00:04:24,720 --> 00:04:28,320
The first thing is, because of
synchronous data flow, we know

97
00:04:28,320 --> 00:04:32,530
at compile time exactly who
needs to do what when.

98
00:04:32,530 --> 00:04:34,580
And that really helps
transform.

99
00:04:34,580 --> 00:04:37,970
It's not like everything happens
run-time dynamically.

100
00:04:37,970 --> 00:04:39,970
So what does that mean?

101
00:04:39,970 --> 00:04:43,780
So what that means is each
filter knows exactly how much

102
00:04:43,780 --> 00:04:45,960
to push and pop --

103
00:04:45,960 --> 00:04:48,040
that's in a repeatable
execution.

104
00:04:48,040 --> 00:04:50,330
And so what we can do is, we
can come up to the static

105
00:04:50,330 --> 00:04:53,710
schedule that can be repeated
multiple times.

106
00:04:53,710 --> 00:04:55,900
So let me tell you a little
bit about what a static

107
00:04:55,900 --> 00:04:56,540
schedule means.

108
00:04:56,540 --> 00:04:59,350
So assume this filter pushes
two, this filter pops three

109
00:04:59,350 --> 00:05:01,960
but pushes one, that
filter pops two.

110
00:05:01,960 --> 00:05:04,180
So these are kind of rate
pushes, it's not everybody

111
00:05:04,180 --> 00:05:05,670
producing-consuming at once.

112
00:05:05,670 --> 00:05:07,000
So what's the schedule?

113
00:05:07,000 --> 00:05:07,580
So you can say --

114
00:05:07,580 --> 00:05:12,130
OK, at the beginning it produces
two items, but I

115
00:05:12,130 --> 00:05:15,650
can't consume that because I
need three items. And then I

116
00:05:15,650 --> 00:05:18,910
do two of them, and I can
consume the first three and

117
00:05:18,910 --> 00:05:20,240
then produce one there.

118
00:05:20,240 --> 00:05:21,350
And I have two left behind.

119
00:05:21,350 --> 00:05:24,810
I do one more that, and
now I got three.

120
00:05:24,810 --> 00:05:27,610
And it consumes that
and produces that.

121
00:05:27,610 --> 00:05:31,450
And then I can fire C. So the
neat thing about this is, when

122
00:05:31,450 --> 00:05:35,290
I started there was nothing
inside any of these buffers.

123
00:05:35,290 --> 00:05:39,710
And if I ran A, A, B, A, B, C,
there's nothing inside the

124
00:05:39,710 --> 00:05:41,440
buffers again.

125
00:05:41,440 --> 00:05:45,690
So what I have is, I'm back to
the starting positioning.

126
00:05:45,690 --> 00:05:48,890
And if I repeat this millions
of times --

127
00:05:48,890 --> 00:05:51,390
I keep the computation running
nicely without any buffers

128
00:05:51,390 --> 00:05:53,340
accumulating or anything
like that.

129
00:05:53,340 --> 00:05:56,650
So I can come up with this very
nice schedule that says

130
00:05:56,650 --> 00:05:58,390
-- here's what I have to do.

131
00:05:58,390 --> 00:06:02,060
I have to run A actually three
times, B twice, and C once, in

132
00:06:02,060 --> 00:06:03,970
this order, and if I
do that I have a

133
00:06:03,970 --> 00:06:05,730
computation that keeps running.

134
00:06:05,730 --> 00:06:08,940
And that gives me a good global
view on what can I

135
00:06:08,940 --> 00:06:11,230
parallelize, what can I load
balance, all those things.

136
00:06:11,230 --> 00:06:13,660
Because things don't change.

137
00:06:13,660 --> 00:06:17,670
One more additional thing about
StreamIt is we can look

138
00:06:17,670 --> 00:06:20,290
at more elements than
I am consuming.

139
00:06:20,290 --> 00:06:21,100
Question?

140
00:06:21,100 --> 00:06:25,500
AUDIENCE: How common is it in
your typical code that you can

141
00:06:25,500 --> 00:06:28,400
actually produce a static
schedule like that?

142
00:06:28,400 --> 00:06:33,880
PROFESSOR: In a lot of DSV
code this is very common.

143
00:06:33,880 --> 00:06:38,670
A lot of DSV code right now that
goes into hardware and

144
00:06:38,670 --> 00:06:41,230
software, they have very
common properties.

145
00:06:41,230 --> 00:06:44,390
But even things that are not
common, what has a very large

146
00:06:44,390 --> 00:06:48,760
chunk of the program has this
static property, and there are

147
00:06:48,760 --> 00:06:51,310
some few places that has
dynamic property.

148
00:06:51,310 --> 00:06:54,850
So it's like, when you write a
normal program you don't write

149
00:06:54,850 --> 00:06:57,600
a branch instruction after
every instruction.

150
00:06:57,600 --> 00:06:59,690
You have a few hundred
instructions and a branch, a

151
00:06:59,690 --> 00:07:02,200
few tens of instructions and
a branch type thing.

152
00:07:02,200 --> 00:07:05,840
So what you can think about it
is that those instructions

153
00:07:05,840 --> 00:07:08,460
without a branch can get
optimized the hell out of

154
00:07:08,460 --> 00:07:10,980
them, and then you do a
branch dynamically.

155
00:07:10,980 --> 00:07:12,660
So you can think about
it like this.

156
00:07:12,660 --> 00:07:15,560
What's the largest chunks you
can find that you don't have

157
00:07:15,560 --> 00:07:17,480
this uncertainty
until run-time?

158
00:07:17,480 --> 00:07:19,930
Then you can optimize the hell
out of it, and then you can

159
00:07:19,930 --> 00:07:23,330
deal with this run-time
issues --

160
00:07:23,330 --> 00:07:27,110
basically branches, or control
for changes, or direct your

161
00:07:27,110 --> 00:07:31,410
rate changes at run-time.

162
00:07:31,410 --> 00:07:34,240
If we have 10-90 rule, if you
get 90% of the things are in a

163
00:07:34,240 --> 00:07:36,640
nice thing, and if you get good
performance on that --

164
00:07:36,640 --> 00:07:38,880
hey, it has a big impact.

165
00:07:38,880 --> 00:07:41,850
So in our language you can deal
with dynamism, but our

166
00:07:41,850 --> 00:07:43,590
analysis is basically trying
to find the largest static

167
00:07:43,590 --> 00:07:45,350
chunk and analyze.

168
00:07:45,350 --> 00:07:48,710
So most of the time that
basically said we start with

169
00:07:48,710 --> 00:07:50,150
empty and end with empty.

170
00:07:50,150 --> 00:07:52,630
But the trouble is, a lot of
times we actually can look

171
00:07:52,630 --> 00:07:54,630
beyond the number of
what we consume.

172
00:07:54,630 --> 00:07:58,010
So what you have to do is kind
of do initial schedule that

173
00:07:58,010 --> 00:08:03,260
you don't start with empty,
you basically consume

174
00:08:03,260 --> 00:08:05,230
something -- you start with
something like this.

175
00:08:05,230 --> 00:08:07,860
So the next time in
something comes --

176
00:08:07,860 --> 00:08:11,190
three things come
into this one --

177
00:08:11,190 --> 00:08:13,360
I can actually pick four
and pop three.

178
00:08:13,360 --> 00:08:16,340
So you go through the first
thing kind of priming

179
00:08:16,340 --> 00:08:19,270
everything with the amount of
data needed, and then you go

180
00:08:19,270 --> 00:08:22,190
to the static schedule.

181
00:08:22,190 --> 00:08:25,860
This kind of gives you a feel
for what you'll get.

182
00:08:25,860 --> 00:08:27,430
This is a neat thing,
I know exactly

183
00:08:27,430 --> 00:08:28,930
what's going on in here.

184
00:08:28,930 --> 00:08:30,730
So now how do I run
this parallelism?

185
00:08:30,730 --> 00:08:35,560
This is something actually
Rodric pointed out before,

186
00:08:35,560 --> 00:08:38,220
there are three types of
parallelism we can deal with.

187
00:08:38,220 --> 00:08:42,140
So here's my stream program in
here, and I do some filters,

188
00:08:42,140 --> 00:08:44,040
scatter-gather in here.

189
00:08:44,040 --> 00:08:46,710
The first site of parallelism
is task parallelism.

190
00:08:46,710 --> 00:08:49,280
What that means is the
programmer said, there are

191
00:08:49,280 --> 00:08:51,100
three things that can
run parallelly

192
00:08:51,100 --> 00:08:53,250
before I join them together.

193
00:08:53,250 --> 00:08:54,890
So this is a
programmer-specified

194
00:08:54,890 --> 00:08:56,720
parallelism.

195
00:08:56,720 --> 00:09:01,010
And you have a nice data
parallel messenger

196
00:09:01,010 --> 00:09:03,190
presentation.

197
00:09:03,190 --> 00:09:06,140
The second part is
data parallelism.

198
00:09:06,140 --> 00:09:09,680
What that means is, you have
some of these things that

199
00:09:09,680 --> 00:09:13,180
don't depend on the previous
run of that one.

200
00:09:13,180 --> 00:09:17,570
So there's no invocation,
dependency across multiple

201
00:09:17,570 --> 00:09:18,340
invocations.

202
00:09:18,340 --> 00:09:20,950
These are called stateless
filters, there's no state that

203
00:09:20,950 --> 00:09:21,920
keeps changing.

204
00:09:21,920 --> 00:09:23,320
If the state kept changing,
you had to wait till the

205
00:09:23,320 --> 00:09:25,860
previous one finishes
to run the next one.

206
00:09:25,860 --> 00:09:27,310
So if you have a stateless
filter --

207
00:09:27,310 --> 00:09:29,840
assume that it's data
parallel --

208
00:09:29,840 --> 00:09:35,270
what you can do is you can
basically take that, replicate

209
00:09:35,270 --> 00:09:38,240
it many, many times, and
when the data comes --

210
00:09:38,240 --> 00:09:39,040
parallel is in it
every data --

211
00:09:39,040 --> 00:09:43,210
and it will compute and
parallelly get out here.

212
00:09:43,210 --> 00:09:46,020
The final thing is pipeline
parallelism.

213
00:09:46,020 --> 00:09:48,440
So you can feed this one into
this one, this one into this

214
00:09:48,440 --> 00:09:51,230
one, and then douse across
in a pipeline fashion.

215
00:09:51,230 --> 00:09:53,380
And you can get multiple
things execution.

216
00:09:53,380 --> 00:09:55,660
So we have these three types
of parallelism in here, and

217
00:09:55,660 --> 00:09:59,390
the interesting thing is if you
have stateful filters, you

218
00:09:59,390 --> 00:10:01,170
can't run this data parallel.

219
00:10:01,170 --> 00:10:02,900
Actually the only parallelism
you can get is pipeline

220
00:10:02,900 --> 00:10:05,540
parallelism.

221
00:10:05,540 --> 00:10:08,170
So traditionally task
parallelism is fork/join

222
00:10:08,170 --> 00:10:10,800
parallelism, that you guys
are doing right now.

223
00:10:10,800 --> 00:10:13,800
Data parallelism is
loop parallelism.

224
00:10:13,800 --> 00:10:16,760
And pipeline parallelism mainly
was done in hardware.

225
00:10:16,760 --> 00:10:19,520
If you have done something like
Verilog or VHDL you'll do

226
00:10:19,520 --> 00:10:21,070
a lot of pipeline parallelism.

227
00:10:21,070 --> 00:10:23,600
So kind of combining these three
ideas from different

228
00:10:23,600 --> 00:10:26,620
communities all into one,
because I think programs can

229
00:10:26,620 --> 00:10:30,040
have each part in there.

230
00:10:30,040 --> 00:10:32,200
So now, how do you go
and exploit this?

231
00:10:32,200 --> 00:10:35,480
How do you go take advantage
of that?

232
00:10:35,480 --> 00:10:38,660
So I'll talk a little bit of
baseline techniques, and then

233
00:10:38,660 --> 00:10:40,890
talk about what StreamIt
compiler does today.

234
00:10:40,890 --> 00:10:43,660
So assume I have a program
like this.

235
00:10:43,660 --> 00:10:46,110
The hardest thing is there
are two tasks in here.

236
00:10:46,110 --> 00:10:47,940
The programs are given, you
don't have to worry anything

237
00:10:47,940 --> 00:10:48,500
about that.

238
00:10:48,500 --> 00:10:52,190
And what you can do is assign
them into different

239
00:10:52,190 --> 00:10:53,320
cores and run it.

240
00:10:53,320 --> 00:10:54,570
Neat.

241
00:10:57,070 --> 00:10:59,030
You can think what a fork /join
parallelism is, you come

242
00:10:59,030 --> 00:11:02,600
here you fork, you do this
thing, and you join in here.

243
00:11:02,600 --> 00:11:05,160
So the interesting thing is
if you have two cores.

244
00:11:05,160 --> 00:11:07,090
You probably got a 2x
speedup in this one.

245
00:11:07,090 --> 00:11:08,740
This is really neat because
there are two things in here.

246
00:11:08,740 --> 00:11:11,210
The problem is, how about if you
have a lot more different

247
00:11:11,210 --> 00:11:13,600
number of cores?

248
00:11:13,600 --> 00:11:16,280
Or if the next generation has
double the number of cores,

249
00:11:16,280 --> 00:11:19,075
and I'm stuck with the program
you've written for the current

250
00:11:19,075 --> 00:11:19,340
generation?

251
00:11:19,340 --> 00:11:23,130
So this not that great,
interesting.

252
00:11:23,130 --> 00:11:28,290
So we ran it on the Raw
processor we have -- it has 16

253
00:11:28,290 --> 00:11:30,050
cores in there --

254
00:11:30,050 --> 00:11:33,080
that we have been building, and
this is actually running a

255
00:11:33,080 --> 00:11:34,140
simulator of that.

256
00:11:34,140 --> 00:11:38,530
What you find is, is a bunch of
StreamIt programs we have

257
00:11:38,530 --> 00:11:42,590
we kind of get performance like
basically close to two,

258
00:11:42,590 --> 00:11:44,100
because that's the kind of

259
00:11:44,100 --> 00:11:45,450
parallelism people have written.

260
00:11:45,450 --> 00:11:49,760
In fact, some programs even
slowed down in there, because

261
00:11:49,760 --> 00:11:54,600
what happens in here is
the parallelism and

262
00:11:54,600 --> 00:11:57,340
synchronization is not matched
with the target --

263
00:11:57,340 --> 00:11:58,550
because it's matched
with the program.

264
00:11:58,550 --> 00:12:00,780
Because you wrote a program
because your parallelism in

265
00:12:00,780 --> 00:12:04,430
there was what you thought was
right for the algorithm.

266
00:12:04,430 --> 00:12:07,370
We didn't want you to give any
consideration to the machine

267
00:12:07,370 --> 00:12:09,490
you are running, and it didn't
match the machine, basically,

268
00:12:09,490 --> 00:12:10,620
if you just got the
parallelism.

269
00:12:10,620 --> 00:12:13,280
And you just don't
do that right.

270
00:12:13,280 --> 00:12:17,280
So one thing we have noticed
for a lot of streaming

271
00:12:17,280 --> 00:12:20,210
programs, to answer your
question, is there are a lot

272
00:12:20,210 --> 00:12:22,010
of data parallelism.

273
00:12:22,010 --> 00:12:24,000
In fact, in this filter --

274
00:12:24,000 --> 00:12:26,440
in this program --

275
00:12:26,440 --> 00:12:29,030
what you can do is you can find
data parallel filters,

276
00:12:29,030 --> 00:12:30,660
and parallelize them.

277
00:12:30,660 --> 00:12:32,750
So you can take each
filter, run it on

278
00:12:32,750 --> 00:12:34,370
every core for awile.

279
00:12:34,370 --> 00:12:35,850
Get the data back.

280
00:12:35,850 --> 00:12:37,070
Go to the next filter,
write on every

281
00:12:37,070 --> 00:12:38,590
go-while, get that back.

282
00:12:38,590 --> 00:12:42,285
So what you can do is, if you
have four cores in here, you

283
00:12:42,285 --> 00:12:44,950
can each replicate all
this four times.

284
00:12:44,950 --> 00:12:47,410
Run these four for a while,
and then these four, these

285
00:12:47,410 --> 00:12:49,300
four, these four, these four.

286
00:12:49,300 --> 00:12:50,360
OK?

287
00:12:50,360 --> 00:12:51,360
So that's the nice
way to do that.

288
00:12:51,360 --> 00:12:53,390
So the nice thing about doing
that is you have a lot of nice

289
00:12:53,390 --> 00:12:55,480
in the load balancing, because
each are doing the same amount

290
00:12:55,480 --> 00:12:56,740
of work for a while.

291
00:12:56,740 --> 00:12:59,410
And after it accumulates enough
data you go to the next

292
00:12:59,410 --> 00:13:04,870
one, do for a while,
and then like that.

293
00:13:04,870 --> 00:13:07,220
And each group basically will
occupy the entire machine --

294
00:13:07,220 --> 00:13:09,640
you just go down this
group like that.

295
00:13:09,640 --> 00:13:13,660
And so we ran it, it started
even slower.

296
00:13:13,660 --> 00:13:14,910
Why?

297
00:13:17,720 --> 00:13:19,470
It should have a lot more
parallelism, because all those

298
00:13:19,470 --> 00:13:20,490
filters were data-parallel.

299
00:13:20,490 --> 00:13:23,280
So you sort of gettting stuck
with two, now we can easily

300
00:13:23,280 --> 00:13:26,135
run a parallelism of 16, because
data parallelism you

301
00:13:26,135 --> 00:13:28,150
can just put it any
amount in there.

302
00:13:28,150 --> 00:13:30,700
But we are running slow.

303
00:13:30,700 --> 00:13:32,240
AUDIENCE: Communication
overhead?

304
00:13:32,240 --> 00:13:34,850
PROFESSOR: Yeah, it could
mainly be communication

305
00:13:34,850 --> 00:13:37,780
overhead, because what happens
is you run this for a small

306
00:13:37,780 --> 00:13:38,320
amount of time.

307
00:13:38,320 --> 00:13:41,120
You had to send it all over
the place, collect it back

308
00:13:41,120 --> 00:13:44,740
again, send it all over the
place, collect it back again.

309
00:13:44,740 --> 00:13:46,420
The problem is there's too
much synchronization and

310
00:13:46,420 --> 00:13:47,810
communication.

311
00:13:47,810 --> 00:13:50,400
Because every person at the
end is like this global

312
00:13:50,400 --> 00:13:54,200
barrier, and the data has
to go shuffling around.

313
00:13:54,200 --> 00:13:57,220
And that doesn't help.

314
00:13:57,220 --> 00:14:00,940
So the other part, what you can
do in the baseline is what

315
00:14:00,940 --> 00:14:03,280
you call hardware pipeline.

316
00:14:03,280 --> 00:14:05,125
What that means is you can
actually do pipeline

317
00:14:05,125 --> 00:14:05,560
parallelism.

318
00:14:05,560 --> 00:14:12,490
The way you can do that is you
can look at the amount of work

319
00:14:12,490 --> 00:14:16,490
each filters contain, and you
can combine them together in a

320
00:14:16,490 --> 00:14:20,890
way that the number of filters
is going to be just about the

321
00:14:20,890 --> 00:14:22,140
number of tiles available.

322
00:14:24,380 --> 00:14:25,760
Most programs have
more filters than

323
00:14:25,760 --> 00:14:26,960
the number of cores.

324
00:14:26,960 --> 00:14:29,250
So you review combined filters,
to give us a number

325
00:14:29,250 --> 00:14:32,630
of filters, is just either the
same, or one or two less than

326
00:14:32,630 --> 00:14:35,090
the number of cores available.

327
00:14:35,090 --> 00:14:38,840
In a way that you combine them
so each of them will probably

328
00:14:38,840 --> 00:14:41,360
have close to the same
amount of work.

329
00:14:41,360 --> 00:14:44,180
The problem is if when you
combine it's very hard to get

330
00:14:44,180 --> 00:14:46,860
the same amount of work.

331
00:14:46,860 --> 00:14:49,970
And if you assume eight cores,
you can do this combination

332
00:14:49,970 --> 00:14:50,390
and we can say --

333
00:14:50,390 --> 00:14:53,420
aha, if I do this combination,
I have one, two, three, four,

334
00:14:53,420 --> 00:14:55,460
five, six, seven.

335
00:14:55,460 --> 00:14:57,680
Eight cores, I can get
seven of them.

336
00:14:57,680 --> 00:15:00,430
Hopefully each of them have the
same amount of work, and I

337
00:15:00,430 --> 00:15:03,070
can run that.

338
00:15:03,070 --> 00:15:06,610
And then we assign this to one
filter and say -- "You own

339
00:15:06,610 --> 00:15:07,970
this one, you run it forever.

340
00:15:07,970 --> 00:15:11,040
You get the data from the guy
who owns this one, and you

341
00:15:11,040 --> 00:15:16,080
produce at this one." And if
you have more cores you can

342
00:15:16,080 --> 00:15:17,500
actually keep doing
some of that.

343
00:15:17,500 --> 00:15:18,790
If you have enough filters
you can each

344
00:15:18,790 --> 00:15:20,800
combine them and do that.

345
00:15:20,800 --> 00:15:24,980
So we perform, and
we got this.

346
00:15:24,980 --> 00:15:28,020
Not that bad.

347
00:15:28,020 --> 00:15:29,650
So what might be the
problems here?

348
00:15:37,308 --> 00:15:40,100
AUDIENCE: Hardware locality.

349
00:15:40,100 --> 00:15:42,616
You want to make sure that the
communicating filters are

350
00:15:42,616 --> 00:15:43,460
close to each other.

351
00:15:43,460 --> 00:15:44,805
PROFESSOR: Yeah, that
we can deal with.

352
00:15:44,805 --> 00:15:47,400
It's not a big locality
[OBSCURED]

353
00:15:47,400 --> 00:15:49,740
What's the other problem?

354
00:15:49,740 --> 00:15:50,990
The bigger problem.

355
00:15:56,020 --> 00:15:57,000
AUDIENCE: [NOISE]

356
00:15:57,000 --> 00:15:57,490
load balance.

357
00:15:57,490 --> 00:15:59,310
PROFESSOR: Load balance is the
biggest problem, because the

358
00:15:59,310 --> 00:16:01,540
problem is you are combining
different types of things

359
00:16:01,540 --> 00:16:04,080
together, and you are hoping
that each chunk you get

360
00:16:04,080 --> 00:16:05,660
combined togeher will
have an almost

361
00:16:05,660 --> 00:16:07,430
identical amount of work.

362
00:16:07,430 --> 00:16:09,070
And that's very hard to achieve
most of the time,

363
00:16:09,070 --> 00:16:11,640
because dynamically things
keep changing.

364
00:16:11,640 --> 00:16:13,360
The nice thing about loops is,
most of the time if you have a

365
00:16:13,360 --> 00:16:17,260
loop or state if you replicate
it many times, it's the same

366
00:16:17,260 --> 00:16:19,050
amount of code, same
amount of work.

367
00:16:19,050 --> 00:16:20,340
It nicely balances out.

368
00:16:20,340 --> 00:16:21,310
Hardware --

369
00:16:21,310 --> 00:16:23,570
combining different things
becomes actually much harder.

370
00:16:26,280 --> 00:16:28,390
So again, parallelism and
synchronization are not really

371
00:16:28,390 --> 00:16:29,950
matched to the target.

372
00:16:29,950 --> 00:16:35,270
So the StreamIt compiler right
now does two, three things.

373
00:16:35,270 --> 00:16:36,180
I'll go through details.

374
00:16:36,180 --> 00:16:37,550
Coarsen the granularity
of things.

375
00:16:37,550 --> 00:16:41,020
So what happens is if you have
small filters it combines them

376
00:16:41,020 --> 00:16:45,320
together to get the large
stateless areas.

377
00:16:45,320 --> 00:16:47,600
It data parallelizes
when possible.

378
00:16:47,600 --> 00:16:49,960
And it does software pipelining,
that's a pipeline

379
00:16:49,960 --> 00:16:50,640
parallelism.

380
00:16:50,640 --> 00:16:53,710
I'll go through all these
things in detail.

381
00:16:53,710 --> 00:16:58,460
And you can get about
11x's speedup by

382
00:16:58,460 --> 00:16:59,460
doing all those things.

383
00:16:59,460 --> 00:17:01,150
So coarsen the stream graph.

384
00:17:01,150 --> 00:17:03,290
So you look at this stream
graph and say -- wait a

385
00:17:03,290 --> 00:17:06,950
minute, I have a bunch of
data-parallel parts.

386
00:17:06,950 --> 00:17:09,420
And before what I did was I take
each data-parallel part,

387
00:17:09,420 --> 00:17:12,450
when 16 then came or get
together, went 16 came

388
00:17:12,450 --> 00:17:13,310
together, went 16.

389
00:17:13,310 --> 00:17:14,400
Why?

390
00:17:14,400 --> 00:17:15,900
I have put too much
communication.

391
00:17:15,900 --> 00:17:20,590
Can I combine data-parallel
things into one gigantic unit

392
00:17:20,590 --> 00:17:22,620
when possible?

393
00:17:22,620 --> 00:17:24,810
Of course, you don't want to
combine a data-parallel part

394
00:17:24,810 --> 00:17:26,290
with a non-data-parallel part.

395
00:17:26,290 --> 00:17:27,830
Then the entire thing
becomes sequential,

396
00:17:27,830 --> 00:17:29,170
and that's not helpful.

397
00:17:29,170 --> 00:17:32,400
So in here what we found is
these four cannot be combined,

398
00:17:32,400 --> 00:17:36,350
because if you combime them
the entire thing becomes

399
00:17:36,350 --> 00:17:37,520
sequential.

400
00:17:37,520 --> 00:17:41,240
So what we have to do is, you
can combine this way.

401
00:17:41,240 --> 00:17:43,680
So all those things are
data-parallel, all those

402
00:17:43,680 --> 00:17:44,790
things are data-parallel.

403
00:17:44,790 --> 00:17:47,640
And even though they are
data-parallel if you combine

404
00:17:47,640 --> 00:17:49,160
them they become
non-data-parallel, because

405
00:17:49,160 --> 00:17:51,040
this is actually doing peeking,
it's looking at more

406
00:17:51,040 --> 00:17:53,030
than one, and so it's
looking at somebody

407
00:17:53,030 --> 00:17:53,950
else's iteration work.

408
00:17:53,950 --> 00:17:56,920
So you can't combine them.

409
00:17:56,920 --> 00:18:00,560
So what the benefits of doing
this is you reduce global

410
00:18:00,560 --> 00:18:01,810
communication basically.

411
00:18:04,460 --> 00:18:07,920
And the next thing
is you want data

412
00:18:07,920 --> 00:18:10,650
parallelizing to four cores.

413
00:18:10,650 --> 00:18:18,060
And this one fits four
ways in there.

414
00:18:18,060 --> 00:18:20,425
But the interesting thing is,
when you go in this one you

415
00:18:20,425 --> 00:18:21,830
realize there's some
task parallelism.

416
00:18:24,680 --> 00:18:28,040
We know there are two tasks that
have the same amount of

417
00:18:28,040 --> 00:18:30,520
work in here.

418
00:18:30,520 --> 00:18:32,880
So facing this four ways, and
facing this four ways, and

419
00:18:32,880 --> 00:18:34,525
giving the entire machine to
this one, and giving the

420
00:18:34,525 --> 00:18:36,860
entire machine to this one,
might not be the best idea.

421
00:18:36,860 --> 00:18:40,000
What you want to do is you
want to face it two ways.

422
00:18:40,000 --> 00:18:43,930
And then basically give the
entire machine to all of these

423
00:18:43,930 --> 00:18:45,550
running at the same time,
because they're

424
00:18:45,550 --> 00:18:46,130
load balanced --

425
00:18:46,130 --> 00:18:48,340
because they are the same
thing repeated.

426
00:18:48,340 --> 00:18:49,790
And you can do the same
thing in here.

427
00:18:53,660 --> 00:18:54,610
OK.

428
00:18:54,610 --> 00:18:59,390
So that's what the compiler
does automatically, and it

429
00:18:59,390 --> 00:19:00,400
preserves task parallelism.

430
00:19:00,400 --> 00:19:02,320
So if you are task parallelism
you don't need --

431
00:19:02,320 --> 00:19:05,160
the thing about that is the
parallelism you need, you

432
00:19:05,160 --> 00:19:06,270
don't need too much
parallelism.

433
00:19:06,270 --> 00:19:08,590
You need enough parallelism
to make the machine happy.

434
00:19:08,590 --> 00:19:10,280
If you have too much parallelism
you end up in

435
00:19:10,280 --> 00:19:11,600
other problems, like
synchronization.

436
00:19:11,600 --> 00:19:13,380
So this gives enough parallelism
to keep the entire

437
00:19:13,380 --> 00:19:16,640
machine happy, but
not too much.

438
00:19:16,640 --> 00:19:20,420
And by doing that actually we
get pretty good performance.

439
00:19:20,420 --> 00:19:24,770
There are a few cases where this
hardware parallelism wins

440
00:19:24,770 --> 00:19:26,460
out, these two, but
most of them --

441
00:19:26,460 --> 00:19:29,650
actually this last one
we can recover --

442
00:19:29,650 --> 00:19:31,458
do it pretty well.

443
00:19:31,458 --> 00:19:32,820
OK.

444
00:19:32,820 --> 00:19:37,870
So what's left here is -- so
this is good parallelism and

445
00:19:37,870 --> 00:19:39,770
low synchronization.

446
00:19:39,770 --> 00:19:43,320
But there's one thing, when you
are doing data parallelism

447
00:19:43,320 --> 00:19:47,540
there are places where there
are filters that cannot be

448
00:19:47,540 --> 00:19:50,460
parallelized -- they are
stateful filters.

449
00:19:50,460 --> 00:19:53,590
Because you can't run the data
parallelism, and according to

450
00:19:53,590 --> 00:19:55,580
Amdahl's Law that's actually
going to basically kill you,

451
00:19:55,580 --> 00:19:57,210
because that's just waiting
there and you

452
00:19:57,210 --> 00:20:00,090
can't do too much.

453
00:20:00,090 --> 00:20:03,040
I'm going to show that using
this separate program -- so

454
00:20:03,040 --> 00:20:05,580
this number is the amount
of work that each

455
00:20:05,580 --> 00:20:06,300
of them has to do.

456
00:20:06,300 --> 00:20:08,670
So this is actually a lot of
work, a lot of work -- this

457
00:20:08,670 --> 00:20:12,410
does a little work in each
of these filters.

458
00:20:12,410 --> 00:20:14,660
So if you look at that, these
are data parallel but it

459
00:20:14,660 --> 00:20:15,780
doesn't do any much work.

460
00:20:15,780 --> 00:20:19,200
Just parallelizing this
doesn't help you.

461
00:20:19,200 --> 00:20:20,710
And these are data parallel.

462
00:20:20,710 --> 00:20:21,830
And these actually
do enough work.

463
00:20:21,830 --> 00:20:23,770
Actually we can go and say I
am replicating this four

464
00:20:23,770 --> 00:20:25,100
times, and I'm OK.

465
00:20:25,100 --> 00:20:27,640
I'm getting actually good
performance in here.

466
00:20:27,640 --> 00:20:30,090
Now what we have is a
program like this.

467
00:20:30,090 --> 00:20:33,320
And so if you are not doing
anything else that we have

468
00:20:33,320 --> 00:20:34,080
data parallelism in.

469
00:20:34,080 --> 00:20:37,000
So what happens in the first
cycle you run these two.

470
00:20:37,000 --> 00:20:39,460
And then you run data parallel
this one, and then you run

471
00:20:39,460 --> 00:20:45,260
these, and then you run data
parallel this one.

472
00:20:45,260 --> 00:20:47,760
And if you look at that, what
happens is we have a bunch of

473
00:20:47,760 --> 00:20:50,140
holes in here.

474
00:20:50,140 --> 00:20:52,510
Because at that point when you
are running that part of the

475
00:20:52,510 --> 00:20:54,370
program there's not enough
parallelism, and you only have

476
00:20:54,370 --> 00:20:55,330
two things in there.

477
00:20:55,330 --> 00:20:57,430
And when you're running this
you can run this task

478
00:20:57,430 --> 00:20:59,210
parallelism in here, but there's
nothing else you can

479
00:20:59,210 --> 00:21:00,980
do in here.

480
00:21:00,980 --> 00:21:05,860
And so you get basically
21 time steps each --

481
00:21:05,860 --> 00:21:08,310
time minutes basically will
run into that program.

482
00:21:08,310 --> 00:21:10,910
But here we can do better.

483
00:21:10,910 --> 00:21:14,960
What we can do is we can take
and try to move that there,

484
00:21:14,960 --> 00:21:18,620
and kind of compress them.

485
00:21:18,620 --> 00:21:20,880
But the interesting thing
is these things

486
00:21:20,880 --> 00:21:24,820
are not data parallel.

487
00:21:24,820 --> 00:21:26,350
So how do I do that?

488
00:21:26,350 --> 00:21:28,640
So the way to do that is taking
advantage of pipeline

489
00:21:28,640 --> 00:21:30,370
parallelism.

490
00:21:30,370 --> 00:21:34,280
So what you can do is you can
take this filter in here.

491
00:21:34,280 --> 00:21:40,120
Since each of the entire graph
can run only sequentially --

492
00:21:40,120 --> 00:21:42,590
this has to run after this --
you can look at the filters

493
00:21:42,590 --> 00:21:47,470
running separately like that,
and kind of say, instead of

494
00:21:47,470 --> 00:21:50,440
running this and this and this,
why don't I run this

495
00:21:50,440 --> 00:21:52,020
iterations of this one.

496
00:21:52,020 --> 00:21:53,940
This iterations of
this invocation.

497
00:21:53,940 --> 00:21:55,680
And this interations
of this one.

498
00:21:55,680 --> 00:21:58,340
And this iterations
on the next one.

499
00:21:58,340 --> 00:22:00,420
And I'm still maintaining --
because when I'm running this

500
00:22:00,420 --> 00:22:01,030
even though the --

501
00:22:01,030 --> 00:22:02,920
I'm not running anything data
parallel here because these

502
00:22:02,920 --> 00:22:05,100
ones were already done
previously, so I can actually

503
00:22:05,100 --> 00:22:06,760
use that value.

504
00:22:06,760 --> 00:22:10,720
And so I can maintain that
dependency, but I'm running

505
00:22:10,720 --> 00:22:12,430
things from the different
iterations.

506
00:22:12,430 --> 00:22:15,200
And so what I need to do is, I
need to kind of do a prologue

507
00:22:15,200 --> 00:22:17,430
to kind of set everything
up in there.

508
00:22:17,430 --> 00:22:19,810
And then I can do that and
I don't have any kind of

509
00:22:19,810 --> 00:22:21,330
dependence among these things.

510
00:22:21,330 --> 00:22:28,020
So now what I can do is I can
basically take thes two and

511
00:22:28,020 --> 00:22:31,520
basically lay out anything
anywhere in those groups,

512
00:22:31,520 --> 00:22:34,610
because they are in different
iterations and since I am

513
00:22:34,610 --> 00:22:36,870
pipelining these I don't have
any dependence in there.

514
00:22:36,870 --> 00:22:39,960
So I end up in this kind of a
thing, and basically much

515
00:22:39,960 --> 00:22:41,210
compress in here.

516
00:22:43,420 --> 00:22:46,380
And by doing that what you
actually get is a really nice

517
00:22:46,380 --> 00:22:47,830
performance.

518
00:22:47,830 --> 00:22:53,500
The only place that this
actually wins -- hardware

519
00:22:53,500 --> 00:22:55,630
pipelining, and this little
bit in there.

520
00:22:55,630 --> 00:22:59,970
But the rest you get a really
good win in here.

521
00:22:59,970 --> 00:23:00,700
OK.

522
00:23:00,700 --> 00:23:06,460
So what this does is basically
now we got a program that when

523
00:23:06,460 --> 00:23:09,140
the programmer never thought
anything about what the

524
00:23:09,140 --> 00:23:10,010
hardware is --

525
00:23:10,010 --> 00:23:12,780
just wrote abstract graph
and data streaming.

526
00:23:12,780 --> 00:23:15,810
And given Raw, we automatically
actually mapped

527
00:23:15,810 --> 00:23:17,880
into it, and figured out what
is the right balance, right

528
00:23:17,880 --> 00:23:20,200
communication, right
synchronization, and got

529
00:23:20,200 --> 00:23:21,370
really good performance.

530
00:23:21,370 --> 00:23:24,050
And you're getting something
like 11x performance.

531
00:23:24,050 --> 00:23:26,540
If you do hard hand, if you work
hard probably you can do

532
00:23:26,540 --> 00:23:27,390
a little bit better.

533
00:23:27,390 --> 00:23:29,650
But this is good, because you
don't hand-do anything.

534
00:23:29,650 --> 00:23:31,990
The killer thing is now I can
probably take this set of

535
00:23:31,990 --> 00:23:34,800
programs -- which we are
actually working on -- is you

536
00:23:34,800 --> 00:23:39,600
can take them to Cell which has,
depending on the day, six

537
00:23:39,600 --> 00:23:44,100
cores, seven cores, eight cores,
and we can basically

538
00:23:44,100 --> 00:23:47,150
get to matching the number
of cores in there.

539
00:23:47,150 --> 00:23:49,390
So this is it because right now
what happens is you have

540
00:23:49,390 --> 00:23:51,810
to basically hand code all those
things, and this can

541
00:23:51,810 --> 00:23:53,220
automate all that process.

542
00:23:53,220 --> 00:23:55,470
So that's the idea, is can you
do this -- which we haven't

543
00:23:55,470 --> 00:23:57,530
really proved and this
is our research --

544
00:23:57,530 --> 00:23:59,970
write once, use anywhere.

545
00:23:59,970 --> 00:24:04,760
So write this program once
in this abstract way.

546
00:24:04,760 --> 00:24:06,760
You have to really don't think
about full parallelism.

547
00:24:06,760 --> 00:24:09,815
You have to think about some
amount of parallelism, how

548
00:24:09,815 --> 00:24:13,140
this can be put into a stream
graph, but you are not dealing

549
00:24:13,140 --> 00:24:15,530
with synchronization, load
balancing, performance.

550
00:24:15,530 --> 00:24:16,990
You don't have to
deal with that.

551
00:24:16,990 --> 00:24:20,140
And then the compiler will
automatically do all these

552
00:24:20,140 --> 00:24:22,800
things behind you, and get
really good performance.

553
00:24:22,800 --> 00:24:26,380
And the reason I showed
this was --

554
00:24:26,380 --> 00:24:27,713
I'll just play one more
slide I think --

555
00:24:30,710 --> 00:24:33,050
showed this was it's
not a simple thing.

556
00:24:33,050 --> 00:24:35,580
The compiler actually has to do
a bunch of work, the work

557
00:24:35,580 --> 00:24:36,930
that you used to do before.

558
00:24:36,930 --> 00:24:40,190
Things like figuring out what's
the right granularity,

559
00:24:40,190 --> 00:24:43,780
what's the right mix of
operations, what type of

560
00:24:43,780 --> 00:24:46,840
transformations you need
to do to get there.

561
00:24:46,840 --> 00:24:51,970
But at some point we did three
things -- coarse-grained, data

562
00:24:51,970 --> 00:24:53,840
parallel, and software
pipelining.

563
00:24:53,840 --> 00:24:56,290
And by doing these three we can
actually get a really good

564
00:24:56,290 --> 00:25:00,620
performance in most of the
programs we have. So what we

565
00:25:00,620 --> 00:25:03,840
are hoping is basically this
kind of techniques can in fact

566
00:25:03,840 --> 00:25:08,380
help programmers to get
multicore performance without

567
00:25:08,380 --> 00:25:11,490
really going and dealing in the
grunge level of details

568
00:25:11,490 --> 00:25:12,540
you guys do.

569
00:25:12,540 --> 00:25:15,430
You guys will appreciate that,
and hopefully will

570
00:25:15,430 --> 00:25:17,310
think of making --

571
00:25:17,310 --> 00:25:19,870
because now at the end of this
class, you will know all the

572
00:25:19,870 --> 00:25:21,520
pain and suffering the
programmers go

573
00:25:21,520 --> 00:25:24,170
through to get there.

574
00:25:24,170 --> 00:25:26,510
And the interesting thing would
be to in fact look at

575
00:25:26,510 --> 00:25:29,100
the ways to basically reduce
that pain and suffering.

576
00:25:29,100 --> 00:25:32,570
So that's what I have today.

577
00:25:32,570 --> 00:25:36,340
So this was, as I promised,
a short lecture --

578
00:25:36,340 --> 00:25:37,420
the second one.

579
00:25:37,420 --> 00:25:38,670
Any questions?

580
00:25:41,611 --> 00:25:44,330
AUDIENCE: So if we've got enough
data parallelism we'll

581
00:25:44,330 --> 00:25:49,322
have the same software pipeline
jumping on each tile?

582
00:25:49,322 --> 00:25:50,950
Is that right?

583
00:25:50,950 --> 00:25:51,760
PROFESSOR: Yes.

584
00:25:51,760 --> 00:25:52,410
AUDIENCE: OK.

585
00:25:52,410 --> 00:25:56,002
So if you do that how does it
scale up to something that has

586
00:25:56,002 --> 00:25:59,236
higher communication
costs than Raw?

587
00:25:59,236 --> 00:26:01,939
By doing this software
pipelining you have to do all

588
00:26:01,939 --> 00:26:03,650
of your communication
off tile.

589
00:26:03,650 --> 00:26:08,140
PROFESSOR: So the interest in
there right now is we haven't

590
00:26:08,140 --> 00:26:10,850
done any kind of hardware
pipelining.

591
00:26:10,850 --> 00:26:14,040
We are kind of doing --
everybody's getting a lot of

592
00:26:14,040 --> 00:26:15,950
data moving in there.

593
00:26:15,950 --> 00:26:19,950
The neat thing about right now
is, even with the SP in Cell

594
00:26:19,950 --> 00:26:24,090
and even Raw, the number of
tiles are still small enough

595
00:26:24,090 --> 00:26:27,240
that a lot of communication
-- unless way too much

596
00:26:27,240 --> 00:26:28,720
communication -- it doesn't
really overwhelm you.

597
00:26:28,720 --> 00:26:32,530
Because everybody's nearby,
you can send things.

598
00:26:32,530 --> 00:26:35,910
They talk a little bit about in
Cell that near enableness

599
00:26:35,910 --> 00:26:38,310
helps, but not that much.

600
00:26:38,310 --> 00:26:42,150
But as we go into larger and
larger cores, it's going to

601
00:26:42,150 --> 00:26:44,360
become an issue.

602
00:26:44,360 --> 00:26:47,080
Near enables become much easier
to communicate, and you

603
00:26:47,080 --> 00:26:48,700
can't do global things
in there.

604
00:26:48,700 --> 00:26:50,230
And at that point you will
actually have to do some

605
00:26:50,230 --> 00:26:51,090
hardware pipelining.

606
00:26:51,090 --> 00:26:55,120
You can't just assume that at
some point everybody's going

607
00:26:55,120 --> 00:26:56,940
to get some data and
go to something.

608
00:26:56,940 --> 00:26:59,480
So what you need to do is have
different chunks that the only

609
00:26:59,480 --> 00:27:01,520
communication that would be
between these chunks would be

610
00:27:01,520 --> 00:27:02,980
kind of a pipeline
communication.

611
00:27:02,980 --> 00:27:06,410
So you don't mix data around.

612
00:27:06,410 --> 00:27:10,260
So as we go into larger and
larger cores you need to start

613
00:27:10,260 --> 00:27:12,860
doing techniques like that.

614
00:27:12,860 --> 00:27:16,710
The interesting thing here is
even though what you had to

615
00:27:16,710 --> 00:27:18,950
change was the compiler --
hopefully the program stays

616
00:27:18,950 --> 00:27:20,040
the same --

617
00:27:20,040 --> 00:27:22,180
right now it's not an easy
issue, because our compiler

618
00:27:22,180 --> 00:27:24,733
has 10 times more core than the
program, so it's easier in

619
00:27:24,733 --> 00:27:25,300
the program.

620
00:27:25,300 --> 00:27:30,300
But if you look at something C,
the core base is millions

621
00:27:30,300 --> 00:27:32,410
of times larger than the
size of the compiler.

622
00:27:32,410 --> 00:27:34,020
So at some point they'll
be switched.

623
00:27:34,020 --> 00:27:35,500
It's easier to change
the compiler to kind

624
00:27:35,500 --> 00:27:36,900
of keep up to date.

625
00:27:36,900 --> 00:27:39,070
That's what happened in C. Every
generation you change

626
00:27:39,070 --> 00:27:41,150
the compiler, you don't ask
programmers where to code the

627
00:27:41,150 --> 00:27:42,350
application.

628
00:27:42,350 --> 00:27:46,500
So can you make these kind of
things as the multicores

629
00:27:46,500 --> 00:27:47,240
become different --

630
00:27:47,240 --> 00:27:49,320
bigger, have different
features.

631
00:27:49,320 --> 00:27:52,270
You change the compiler to get
the performance, but have the

632
00:27:52,270 --> 00:27:54,780
same code base.

633
00:27:54,780 --> 00:27:56,030
That's the goal for
portability.

634
00:27:58,711 --> 00:28:01,350
AUDIENCE: Have you tried
applying StreamIt or the

635
00:28:01,350 --> 00:28:06,550
streaming model in general, to
codes that are not not very

636
00:28:06,550 --> 00:28:12,950
clearly stream-based but using
the streaming model to make

637
00:28:12,950 --> 00:28:16,450
communication explicit, such
as scientific codes.

638
00:28:16,450 --> 00:28:19,100
Or, for example, the kinds of
parallelizable loops that you

639
00:28:19,100 --> 00:28:20,870
covered in the first half
of the lecture.

640
00:28:20,870 --> 00:28:23,450
PROFESSOR: Some of those things,
when you have free

641
00:28:23,450 --> 00:28:25,810
form simple communication
can map into streaming.

642
00:28:25,810 --> 00:28:28,420
So for example, one thing
we are doing is things

643
00:28:28,420 --> 00:28:30,490
like right now MPEG.

644
00:28:30,490 --> 00:28:33,430
Some part of the MPEG is nicely
StreamIt, but when you

645
00:28:33,430 --> 00:28:36,090
actually go inside the MPEG and
dealing with the frame,

646
00:28:36,090 --> 00:28:39,810
it's basically a big array,
and you're doing that.

647
00:28:39,810 --> 00:28:42,100
So how do you chunkify the
arrays, and basically deal

648
00:28:42,100 --> 00:28:43,350
with it in a streaming order?

649
00:28:43,350 --> 00:28:44,980
There's some interesting
things you can do.

650
00:28:44,980 --> 00:28:48,400
There will be some stuff
that doesn't fit that.

651
00:28:48,400 --> 00:28:55,100
Things like pattern recognition
type stuff, where

652
00:28:55,100 --> 00:28:57,270
what you want to do
is you want to --

653
00:28:57,270 --> 00:28:59,330
assume you're trying to --

654
00:28:59,330 --> 00:29:01,540
good example.

655
00:29:01,540 --> 00:29:04,210
You're trying to feature
a condition in a video.

656
00:29:04,210 --> 00:29:07,560
And what happens is the number
of features, can you match or

657
00:29:07,560 --> 00:29:08,860
connect two features,
or match and

658
00:29:08,860 --> 00:29:10,090
connect a thousand features.

659
00:29:10,090 --> 00:29:12,480
And then each feature you need
to do some processing.

660
00:29:12,480 --> 00:29:14,880
And that is a very
dynamic thing.

661
00:29:14,880 --> 00:29:16,140
And that doesn't
really fit into

662
00:29:16,140 --> 00:29:17,800
streaming order right now.

663
00:29:17,800 --> 00:29:21,530
And so the interesting thing is,
the problem we have been

664
00:29:21,530 --> 00:29:24,340
doing is we are trying to
fit everything into one.

665
00:29:24,340 --> 00:29:26,780
So right now the object-oriented
model is it

666
00:29:26,780 --> 00:29:28,570
basically -- everything
has to fit in there.

667
00:29:28,570 --> 00:29:30,580
But what you're finding is there
are many things that

668
00:29:30,580 --> 00:29:32,170
don't really fit nicely.

669
00:29:32,170 --> 00:29:35,310
And you'll do these very crazy
looking things just to get

670
00:29:35,310 --> 00:29:38,320
every program to fit into the
object-oriented model.

671
00:29:38,320 --> 00:29:39,150
That doesn't really work.

672
00:29:39,150 --> 00:29:41,090
I think the right way
to work is, is there

673
00:29:41,090 --> 00:29:42,350
might be multiple models.

674
00:29:42,350 --> 00:29:44,050
There's a streaming model,
there's some kind of a

675
00:29:44,050 --> 00:29:45,610
threaded model, there might
be different ones --

676
00:29:45,610 --> 00:29:47,130
I don't know what other
models are.

677
00:29:47,130 --> 00:29:49,450
So the key thing is your program
might have a large

678
00:29:49,450 --> 00:29:51,520
chunky model, another
chunky model.

679
00:29:51,520 --> 00:29:53,540
Don't try to come up with --

680
00:29:53,540 --> 00:29:55,220
right now what we have
is we have a kitchen

681
00:29:55,220 --> 00:29:56,370
sink type of language.

682
00:29:56,370 --> 00:29:59,120
It tries to support everything
at the same time.

683
00:29:59,120 --> 00:30:01,030
And that doesn't really work
because then you have to think

684
00:30:01,030 --> 00:30:01,400
about and say --

685
00:30:01,400 --> 00:30:04,110
OK done, can I have
a pointer here?

686
00:30:04,110 --> 00:30:08,890
And I need to think about all
the possible models kind of

687
00:30:08,890 --> 00:30:10,960
colliding in the same space.

688
00:30:10,960 --> 00:30:14,880
AUDIENCE: On the other hand, the
object-oriented model is

689
00:30:14,880 --> 00:30:16,010
much more generalized to me.

690
00:30:16,010 --> 00:30:18,545
It's not the best model for
many things, but it's much

691
00:30:18,545 --> 00:30:20,830
more generalizable
than some models.

692
00:30:20,830 --> 00:30:23,910
And having a single model cuts
down on the number of semantic

693
00:30:23,910 --> 00:30:25,140
barriers you have to cross --

694
00:30:25,140 --> 00:30:26,030
PROFESSOR: I don't know but --

695
00:30:26,030 --> 00:30:29,790
AUDIENCE: Semantic barriers
incur both programmer overhead

696
00:30:29,790 --> 00:30:31,380
and run-time overhead.

697
00:30:31,380 --> 00:30:33,130
PROFESSOR: See the problem with
right now with all the

698
00:30:33,130 --> 00:30:36,770
semantic barriers, is
object-oriented model plus a

699
00:30:36,770 --> 00:30:38,860
huge number of libraries.

700
00:30:38,860 --> 00:30:41,630
If you want to do OpenGL, it's
object-oriented but you have

701
00:30:41,630 --> 00:30:42,250
no library.

702
00:30:42,250 --> 00:30:43,395
If you want to do something
else, you have

703
00:30:43,395 --> 00:30:44,230
to learn the library.

704
00:30:44,230 --> 00:30:46,650
What the right thing would be,
instead of trying to learn the

705
00:30:46,650 --> 00:30:49,880
libraries is learn kind
of a subset language.

706
00:30:49,880 --> 00:30:53,580
So you have nice semantics, you
have nice syntax in there,

707
00:30:53,580 --> 00:30:56,380
you have nice error
checking, nice

708
00:30:56,380 --> 00:30:58,270
optimization within that syntax.

709
00:30:58,270 --> 00:31:01,340
Because the trouble is right now
everything is in this just

710
00:31:01,340 --> 00:31:03,620
gigantic language, and you
can't do anything.

711
00:31:03,620 --> 00:31:05,750
And in the program you don't
even know, because you can mix

712
00:31:05,750 --> 00:31:08,290
and match in really bad ways.

713
00:31:08,290 --> 00:31:13,450
The mix and match gives you
a lot of power, but it can

714
00:31:13,450 --> 00:31:14,480
actually really hurt.

715
00:31:14,480 --> 00:31:15,990
And a lot of people
don't need it.

716
00:31:15,990 --> 00:31:19,060
Like for example in C, people
doubt it was really crucial

717
00:31:19,060 --> 00:31:22,230
for you to access any part of
memory anywhere you want.

718
00:31:22,230 --> 00:31:25,270
You just can go and just access
any program, anywhere,

719
00:31:25,270 --> 00:31:26,620
anytime in there.

720
00:31:26,620 --> 00:31:28,300
If you look at it, nobody
takes advantage of that.

721
00:31:28,300 --> 00:31:29,700
How many times do you write the
program an say -- "Hey, I

722
00:31:29,700 --> 00:31:32,980
want to go access the other
guy's stack from this part."

723
00:31:32,980 --> 00:31:34,150
That doesn't work.

724
00:31:34,150 --> 00:31:36,160
You have a variable and
you use a variable.

725
00:31:36,160 --> 00:31:37,610
AUDIENCE: It still [OBSCURED]

726
00:31:37,610 --> 00:31:42,020
PROFESSOR: Yeah, but the thing
is because of that power, it

727
00:31:42,020 --> 00:31:43,950
creates a lot of problems
for a compiler --

728
00:31:43,950 --> 00:31:45,970
because it needs to prove
that you're not doing

729
00:31:45,970 --> 00:31:47,530
that, which is hard.

730
00:31:47,530 --> 00:31:50,160
And also, if you make a mistake
the program is like --

731
00:31:50,160 --> 00:31:51,190
"Yeah, this looks like right.

732
00:31:51,190 --> 00:31:54,650
It still matches my semantics
and syntax, I'll let you do

733
00:31:54,650 --> 00:31:57,155
that." But what you realize is
that's not something people do

734
00:31:57,155 --> 00:31:59,370
-- just stick with
your variable.

735
00:31:59,370 --> 00:32:01,140
And if you don't go
to variables --

736
00:32:01,140 --> 00:32:03,190
that's what type-safe languages
do -- it's probably

737
00:32:03,190 --> 00:32:05,690
more for bugs than a feature.

738
00:32:05,690 --> 00:32:08,790
And the same kind of thing
having efficiency in language,

739
00:32:08,790 --> 00:32:11,710
is you can do everything
at the same time.

740
00:32:11,710 --> 00:32:15,130
Why can't you have a language
that you can go with this kind

741
00:32:15,130 --> 00:32:15,600
of context.

742
00:32:15,600 --> 00:32:18,750
I'm in the streaming
context now.

743
00:32:18,750 --> 00:32:20,220
I say this is my streaming
context.

744
00:32:20,220 --> 00:32:21,780
I am in a threaded context.

745
00:32:21,780 --> 00:32:25,720
Then what that does is, I have
to learn the full set of

746
00:32:25,720 --> 00:32:28,340
features, but I restrict
what I'm using here.

747
00:32:28,340 --> 00:32:30,760
That can probably realistically
improve your

748
00:32:30,760 --> 00:32:33,520
program building, because you
don't have to worry about --

749
00:32:33,520 --> 00:32:35,568
AUDIENCE: It gives the
programmer time to get to know

750
00:32:35,568 --> 00:32:36,080
each language.

751
00:32:36,080 --> 00:32:37,530
PROFESSOR: But right now
you have to do that.

752
00:32:37,530 --> 00:32:40,100
If you look at C# it has
all these features.

753
00:32:40,100 --> 00:32:42,770
It has streaming features, it
has threaded features, it has

754
00:32:42,770 --> 00:32:43,990
every possible object-oriented
feature.

755
00:32:43,990 --> 00:32:49,465
AUDIENCE: Right, but there's a
compact central model which

756
00:32:49,465 --> 00:32:50,900
covers most things.

757
00:32:50,900 --> 00:32:52,880
You can pull in additional
features and fit them

758
00:32:52,880 --> 00:32:54,300
[OBSCURED].

759
00:32:54,300 --> 00:32:58,745
You can do pointer manipulation
in C#, but you

760
00:32:58,745 --> 00:33:00,770
bracket things into
an unsafe block.

761
00:33:00,770 --> 00:33:02,845
And then the compiler knows
in there you're

762
00:33:02,845 --> 00:33:03,800
doing really bad things.

763
00:33:03,800 --> 00:33:06,950
PROFESSOR: That's a nice
thing, because

764
00:33:06,950 --> 00:33:07,850
you can have unsafe.

765
00:33:07,850 --> 00:33:08,970
But can you have something
like -- this is

766
00:33:08,970 --> 00:33:11,060
my streaming part.

767
00:33:11,060 --> 00:33:11,510
OK.

768
00:33:11,510 --> 00:33:13,155
Can I do something like
that, so I don't have

769
00:33:13,155 --> 00:33:13,610
to worry about other?

770
00:33:13,610 --> 00:33:17,210
The key thing is, is there a
way where -- because right

771
00:33:17,210 --> 00:33:20,650
now, my feeling is if you look
at the object-oriented part.

772
00:33:20,650 --> 00:33:23,190
So if you are doing, for
example, Windows programming,

773
00:33:23,190 --> 00:33:25,280
you can spend about a week
and learn all the

774
00:33:25,280 --> 00:33:26,810
object-oriented concepts.

775
00:33:26,810 --> 00:33:28,310
And you have to spend probably
a year to learn all the

776
00:33:28,310 --> 00:33:30,150
libraries on top of that.

777
00:33:30,150 --> 00:33:32,590
That's the old action
these days.

778
00:33:32,590 --> 00:33:35,540
It's basically the building
blocks have become too low,

779
00:33:35,540 --> 00:33:39,950
and then everything else is kind
of an unorganized mess on

780
00:33:39,950 --> 00:33:41,050
top of that.

781
00:33:41,050 --> 00:33:43,390
Can you put more abstraction
things that easy?

782
00:33:43,390 --> 00:33:45,100
Hey, I'm talking about
research, this is

783
00:33:45,100 --> 00:33:45,930
one possible angle.

784
00:33:45,930 --> 00:33:48,420
I mean there might -- you can
think, I know there are messes

785
00:33:48,420 --> 00:33:50,460
that I think in there.

786
00:33:50,460 --> 00:33:55,310
My feeling is what we do well
is when things get too

787
00:33:55,310 --> 00:33:58,180
complicated we build
abstraction layers.

788
00:33:58,180 --> 00:34:02,835
And the interesting thing there
is, we build this high

789
00:34:02,835 --> 00:34:05,790
level programming language
abstraction layer.

790
00:34:05,790 --> 00:34:09,030
And then now we have built so
much crud on top of that

791
00:34:09,030 --> 00:34:11,360
without any nice abstraction
layers, I think it's probably

792
00:34:11,360 --> 00:34:13,290
time to think through what
there could be at the

793
00:34:13,290 --> 00:34:13,790
abstraction level.

794
00:34:13,790 --> 00:34:15,590
Things like, it's hitting --

795
00:34:15,590 --> 00:34:17,480
that is where parallelism
is really hitting.

796
00:34:17,480 --> 00:34:19,980
Because that layer, the
object-oriented layer, doesn't

797
00:34:19,980 --> 00:34:22,530
really support that well.

798
00:34:22,530 --> 00:34:26,170
And it's all kind of ad
hoc on top of that.

799
00:34:26,170 --> 00:34:27,160
And so that says something.

800
00:34:27,160 --> 00:34:29,070
Yes, it's usable.

801
00:34:29,070 --> 00:34:30,140
We had this argument --

802
00:34:30,140 --> 00:34:32,760
assembly languages programmers
-- for two decades.

803
00:34:32,760 --> 00:34:33,390
There are people who were

804
00:34:33,390 --> 00:34:34,970
swearing by assembly languages.

805
00:34:34,970 --> 00:34:39,010
They could write it two times
smaller, two times faster than

806
00:34:39,010 --> 00:34:40,680
anything you can write in
high level language.

807
00:34:40,680 --> 00:34:42,120
It's probably still
true today.

808
00:34:42,120 --> 00:34:46,550
But at the end there were
things that high level

809
00:34:46,550 --> 00:34:47,560
languages won out.

810
00:34:47,560 --> 00:34:49,350
I think we are probably in
another layer like that.

811
00:34:49,350 --> 00:34:51,050
I don't know, probably will
go with that argument.

812
00:34:51,050 --> 00:34:52,680
You can always point to
something saying this is

813
00:34:52,680 --> 00:34:53,910
something you cannot do.

814
00:34:53,910 --> 00:34:56,580
If there are still things --
like structured programs and

815
00:34:56,580 --> 00:34:58,260
unstructured programs,
we talked about that.

816
00:34:58,260 --> 00:35:00,197
That argument went
for a decade.

817
00:35:00,197 --> 00:35:04,810
AUDIENCE: The question I would
pose is can you formulate a

818
00:35:04,810 --> 00:35:08,100
kitchen sink language at a
parallelizable level of

819
00:35:08,100 --> 00:35:08,590
abstraction?

820
00:35:08,590 --> 00:35:09,080
PROFESSOR: Ah.

821
00:35:09,080 --> 00:35:12,572
That's interesting, because
parallelization is -- one of

822
00:35:12,572 --> 00:35:14,390
the biggest things people
have to figure out is

823
00:35:14,390 --> 00:35:16,120
composability.

824
00:35:16,120 --> 00:35:19,530
You can't have two parallel
regions as a

825
00:35:19,530 --> 00:35:21,220
black box put together.

826
00:35:21,220 --> 00:35:23,640
You start running into deadlocks
and all those other

827
00:35:23,640 --> 00:35:24,720
issues in there.

828
00:35:24,720 --> 00:35:27,760
Most of the things that you work
is the abstraction works,

829
00:35:27,760 --> 00:35:30,850
because then you can compose at
a higher level abstraction.

830
00:35:30,850 --> 00:35:32,770
You can have interface and
say -- here are something

831
00:35:32,770 --> 00:35:34,880
interface, I don't know what's
underneath, I compose at the

832
00:35:34,880 --> 00:35:35,630
interface level.

833
00:35:35,630 --> 00:35:38,960
And then the next guy composes
at the higher level, and

834
00:35:38,960 --> 00:35:40,360
everything is hidden.

835
00:35:40,360 --> 00:35:42,750
We don't know how to do that
in parallelism right now.

836
00:35:42,750 --> 00:35:47,170
We need to combine two things,
it runs into problems. And the

837
00:35:47,170 --> 00:35:49,370
minute you figure that one out
-- if somebody can figure out

838
00:35:49,370 --> 00:35:52,450
what's the right abstraction
that is composable, parallel

839
00:35:52,450 --> 00:35:53,030
abstraction --

840
00:35:53,030 --> 00:35:55,990
I think that will solve a
huge amount of problems.

841
00:35:55,990 --> 00:35:58,831
AUDIENCE: Isn't it Fortress that
attempted to do something

842
00:35:58,831 --> 00:36:01,549
that's parallelizable and the
kitchen sink, but that then

843
00:36:01,549 --> 00:36:03,280
leaves all the parallelizable
--

844
00:36:03,280 --> 00:36:03,640
AUDIENCE: I'm saying how
terrible [OBSCURED]

845
00:36:03,640 --> 00:36:06,020
programmers.

846
00:36:06,020 --> 00:36:08,040
PROFESSOR: But I would say right
now is a very exciting

847
00:36:08,040 --> 00:36:11,240
time, because there's
a big problem and

848
00:36:11,240 --> 00:36:12,920
nobody knows the solution.

849
00:36:12,920 --> 00:36:17,020
And I think for industry they
lose a lot of sleep over that,

850
00:36:17,020 --> 00:36:19,200
but for academia it's
the best time.

851
00:36:19,200 --> 00:36:20,930
Because we don't care, we don't
have to make money out

852
00:36:20,930 --> 00:36:24,170
of these things, we don't have
to get production out of it.

853
00:36:24,170 --> 00:36:26,470
But these actually have a very
open problem that a lot of

854
00:36:26,470 --> 00:36:28,150
people care about.

855
00:36:28,150 --> 00:36:31,120
And I think this is fun partly
because of that.

856
00:36:31,120 --> 00:36:35,040
I think this huge open problem
that if you talk to people

857
00:36:35,040 --> 00:36:38,070
like Intels and Microsoft, a
lot of people worry a lot

858
00:36:38,070 --> 00:36:40,475
about they don't know how to
deal with the future in 5-10

859
00:36:40,475 --> 00:36:42,870
years time.

860
00:36:42,870 --> 00:36:44,990
They don't see this is scaling
what they're doing.

861
00:36:44,990 --> 00:36:48,650
And from Intel's point of
view, they made money by

862
00:36:48,650 --> 00:36:53,280
making Moore's Law available
for people to use.

863
00:36:53,280 --> 00:36:55,585
They know how to make it
available, but they don't know

864
00:36:55,585 --> 00:36:57,330
how to make people use it.

865
00:36:57,330 --> 00:37:01,150
From Microsoft's point of
view, their current

866
00:37:01,150 --> 00:37:05,340
development methodology is
almost at this breaking point.

867
00:37:05,340 --> 00:37:09,290
And if you look at the last
time this happening -- so

868
00:37:09,290 --> 00:37:12,760
things like Windows 3.0, where
their current development

869
00:37:12,760 --> 00:37:14,670
methodology doesn't really
scale, and they really

870
00:37:14,670 --> 00:37:15,630
revamped it.

871
00:37:15,630 --> 00:37:18,010
They came up with all this
process, and that had

872
00:37:18,010 --> 00:37:18,750
lasted until now.

873
00:37:18,750 --> 00:37:22,620
For the last two Office and
Vista, just realized they

874
00:37:22,620 --> 00:37:24,040
can't really scale that up.

875
00:37:24,040 --> 00:37:26,300
So they are already in trouble,
because they can't

876
00:37:26,300 --> 00:37:29,040
write the next big goal is just
two times bigger than

877
00:37:29,040 --> 00:37:31,070
Vista, and hopefully
get it working.

878
00:37:31,070 --> 00:37:34,120
But on top of that, they have
it thrown this multicore

879
00:37:34,120 --> 00:37:37,180
thing, and that really puts
huge amount of burden.

880
00:37:37,180 --> 00:37:38,480
So they are worried
about that.

881
00:37:38,480 --> 00:37:41,840
So from both their point of
view, everybody's clamoring

882
00:37:41,840 --> 00:37:43,090
for a solution.

883
00:37:45,240 --> 00:37:46,950
And things like last
time around --

884
00:37:46,950 --> 00:37:49,200
I'll talk about this in the
future -- last time around

885
00:37:49,200 --> 00:37:51,240
when that happened, it created
a huge amount of

886
00:37:51,240 --> 00:37:54,720
opportunities, and bunch of
people who sold it kind of

887
00:37:54,720 --> 00:37:55,800
became famous.

888
00:37:55,800 --> 00:37:58,230
Becuase they say -- we came up
with a solution, and that

889
00:37:58,230 --> 00:37:59,890
people started using and
stuff like that.

890
00:37:59,890 --> 00:38:02,190
Right now, everybody's kind of
waiting for somebody to come

891
00:38:02,190 --> 00:38:05,500
up and say here's the solution,
here's a solution.

892
00:38:05,500 --> 00:38:07,590
And there are a lot of --

893
00:38:07,590 --> 00:38:10,910
Fortress type exports is one,
and what we are doing is one.

894
00:38:10,910 --> 00:38:13,380
And hopefully some of you will
end up doing something

895
00:38:13,380 --> 00:38:15,690
interesting that
might solve it.

896
00:38:15,690 --> 00:38:17,090
This is why it's fun.

897
00:38:17,090 --> 00:38:21,610
I think we haven't had this much
of an interesting time in

898
00:38:21,610 --> 00:38:24,640
programming languages,
parallelism, architecture in

899
00:38:24,640 --> 00:38:27,500
the last two decades.

900
00:38:27,500 --> 00:38:30,680
With that, I'll stop my talk.