1
00:00:00,050 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:04,010
Commons license.

3
00:00:04,010 --> 00:00:06,350
Your support will help
MIT OpenCourseWare

4
00:00:06,350 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,202
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,202 --> 00:00:17,827
at ocw.mit.edu.

8
00:00:21,630 --> 00:00:25,903
ARVIND THIAGARAJAN: So I'm
Arvind and this is Micah.

9
00:00:25,903 --> 00:00:29,290
And we worked together on a
framework for flexible stream

10
00:00:29,290 --> 00:00:31,530
processing on the cell.

11
00:00:31,530 --> 00:00:33,999
So we were also oriented
towards an application which is,

12
00:00:33,999 --> 00:00:35,290
in this case, a software radio.

13
00:00:35,290 --> 00:00:37,622
And we used this
application as a case study.

14
00:00:37,622 --> 00:00:38,330
Where is the mic?

15
00:00:41,390 --> 00:00:43,230
I'll try and speak
as loud as I can.

16
00:00:43,230 --> 00:00:46,710
So our project was about
flexible stream processing

17
00:00:46,710 --> 00:00:48,030
framework for the cell.

18
00:00:48,030 --> 00:00:50,870
And we used the software
radios as a case study,

19
00:00:50,870 --> 00:00:52,620
essentially, of an
application you could

20
00:00:52,620 --> 00:00:54,480
build using this framework.

21
00:00:54,480 --> 00:00:57,990
The motivation for our
project is essentially

22
00:00:57,990 --> 00:00:59,490
what we've been
discussing in class.

23
00:00:59,490 --> 00:01:00,380
It's been reiterated.

24
00:01:00,380 --> 00:01:01,940
The cell isn't easy to program.

25
00:01:01,940 --> 00:01:03,530
There's no shared memory.

26
00:01:03,530 --> 00:01:06,541
There's just message passing,
which is quite messy.

27
00:01:06,541 --> 00:01:08,540
You have to explicitly
parallelize your programs

28
00:01:08,540 --> 00:01:11,452
if you want to write them
as, say, custom C programs.

29
00:01:11,452 --> 00:01:13,410
Some of the groups have
describe the challenges

30
00:01:13,410 --> 00:01:15,660
they faced when doing that.

31
00:01:15,660 --> 00:01:17,470
Extracting parallelism
can be tricky.

32
00:01:17,470 --> 00:01:20,700
For example, if you want to
do pipelining, then on DSPs

33
00:01:20,700 --> 00:01:22,130
you have to predict
what addresses

34
00:01:22,130 --> 00:01:24,290
you're going to
require, and set up

35
00:01:24,290 --> 00:01:27,210
a DMA so that those addresses
are fetched in advance.

36
00:01:27,210 --> 00:01:30,680
And as Bill mentioned
in his talk in class,

37
00:01:30,680 --> 00:01:33,590
stream programming can help
alleviate some of these issues

38
00:01:33,590 --> 00:01:36,090
for some applications,
like software radio

39
00:01:36,090 --> 00:01:38,015
where the applications
fit naturally

40
00:01:38,015 --> 00:01:39,790
into the streaming model.

41
00:01:39,790 --> 00:01:42,110
So what we tried to
do for our project

42
00:01:42,110 --> 00:01:45,100
was build a light
weight-- as light weight

43
00:01:45,100 --> 00:01:47,910
as possible because the
code has to fit on DSPs,

44
00:01:47,910 --> 00:01:52,140
but as expressive as possible,
as well, so as to simplify life

45
00:01:52,140 --> 00:01:53,274
for developers.

46
00:01:53,274 --> 00:01:55,940
The streaming framework which is
targeted specifically at signal

47
00:01:55,940 --> 00:01:58,870
processing applications.

48
00:01:58,870 --> 00:02:01,960
The data model, at least, is
based on a research project

49
00:02:01,960 --> 00:02:03,720
that I have worked
on in the past, which

50
00:02:03,720 --> 00:02:06,520
is a WaveScope streaming
database management system.

51
00:02:06,520 --> 00:02:09,480
It is is a research project
with the database group.

52
00:02:09,480 --> 00:02:12,240
The data model is essentially
an extension of the streaming

53
00:02:12,240 --> 00:02:14,940
model to handle
larger blocks of data

54
00:02:14,940 --> 00:02:16,149
so as to process them better.

55
00:02:16,149 --> 00:02:18,189
For example, several signal
processing operators,

56
00:02:18,189 --> 00:02:19,830
like the fast Fourier
transform, need

57
00:02:19,830 --> 00:02:22,270
to do multiple
passes over the data

58
00:02:22,270 --> 00:02:25,240
and, therefore, trying to
treat streams on a sample

59
00:02:25,240 --> 00:02:28,300
by sample basis
leads to high cost,

60
00:02:28,300 --> 00:02:31,790
scheduling overhead, as
well as inefficiency.

61
00:02:31,790 --> 00:02:34,380
So that's what the
data model does.

62
00:02:34,380 --> 00:02:36,170
And we tried to port
this data model over

63
00:02:36,170 --> 00:02:39,100
to the cell processor
and see how well we

64
00:02:39,100 --> 00:02:41,150
could exploit the features
of the cell processor.

65
00:02:41,150 --> 00:02:45,730
In particular, the high on-chip
bandwidth between the SPEs

66
00:02:45,730 --> 00:02:47,860
to do with streaming.

67
00:02:47,860 --> 00:02:50,865
So our case study was, as we
mentioned, a simple software

68
00:02:50,865 --> 00:02:51,690
radio application.

69
00:02:51,690 --> 00:02:52,481
It's really simple.

70
00:02:52,481 --> 00:02:57,600
It uses incoherent demodulation,
as well as just simple

71
00:02:57,600 --> 00:03:00,150
amplitude-shift key modulation.

72
00:03:00,150 --> 00:03:01,900
As I said, the main
goals of our framework

73
00:03:01,900 --> 00:03:04,034
were to simplify
life and modeling

74
00:03:04,034 --> 00:03:05,575
as much parallelism
as possible-- try

75
00:03:05,575 --> 00:03:08,170
and extract pipeline
parallelism, data parallelism.

76
00:03:08,170 --> 00:03:10,650
Some of the kinds of
parallelism, We'll mention.

77
00:03:10,650 --> 00:03:13,850
The kind of parallelism we were
able to finally implement--

78
00:03:13,850 --> 00:03:15,846
so far, it's only
pipeline parallelism,

79
00:03:15,846 --> 00:03:18,220
but as future work, we'd be
interested in the other kinds

80
00:03:18,220 --> 00:03:19,178
of parallelism as well.

81
00:03:19,178 --> 00:03:21,540
More about it as I go on.

82
00:03:21,540 --> 00:03:23,660
So in the framework we
at least implemented,

83
00:03:23,660 --> 00:03:26,010
the programming model
is quite simple.

84
00:03:26,010 --> 00:03:28,776
The basic execution unit is
what we call an operator.

85
00:03:28,776 --> 00:03:31,150
It's analogous to what in
StreamIt would be called a work

86
00:03:31,150 --> 00:03:33,820
function, or what
in GNURadio, which

87
00:03:33,820 --> 00:03:35,730
is a framework for
building soft radios,

88
00:03:35,730 --> 00:03:37,210
would be called a block.

89
00:03:37,210 --> 00:03:40,530
So these operators can be any
arbitrary C++ classes with

90
00:03:40,530 --> 00:03:44,560
state, and they implement
an iterate method which

91
00:03:44,560 --> 00:03:47,105
the developer has to overload
in order to process a block

92
00:03:47,105 --> 00:03:48,330
of data.

93
00:03:48,330 --> 00:03:51,060
The WaveScope data model
also provides a library

94
00:03:51,060 --> 00:03:53,760
for managing memory and
passing blocks of signal

95
00:03:53,760 --> 00:03:56,270
data between these operators.

96
00:03:56,270 --> 00:03:57,770
Applications in
this model are built

97
00:03:57,770 --> 00:03:59,330
by chaining operators together.

98
00:03:59,330 --> 00:04:01,460
So this is a snippet
of some of the code

99
00:04:01,460 --> 00:04:04,800
we wrote for the
software radio, roughly.

100
00:04:04,800 --> 00:04:07,730
So you create a box,
let's say a FIRFilter,

101
00:04:07,730 --> 00:04:10,590
which processes
elements of type float.

102
00:04:10,590 --> 00:04:12,690
And then, it takes in
some arguments, initialize

103
00:04:12,690 --> 00:04:14,220
the filters,
parameters, and so on.

104
00:04:14,220 --> 00:04:15,970
You want to create a
white noise generator

105
00:04:15,970 --> 00:04:18,374
and hook up the filter to
the white noise generator.

106
00:04:18,374 --> 00:04:23,040
We use this to simulate a simple
channel, [INAUDIBLE] channel.

107
00:04:23,040 --> 00:04:28,080
So I'll just describe the
components of our framework.

108
00:04:28,080 --> 00:04:31,340
We have a lightweight scheduler
on both the PPE, as well as

109
00:04:31,340 --> 00:04:33,490
the SPEs.

110
00:04:33,490 --> 00:04:36,090
Right now, it uses a static
operator mapping in the sense

111
00:04:36,090 --> 00:04:38,570
that you have to specify a
static configuration file,

112
00:04:38,570 --> 00:04:42,550
where you say this operator
name will run on this [? SPU ?]

113
00:04:42,550 --> 00:04:43,390
number.

114
00:04:43,390 --> 00:04:47,435
But we've not completely
implemented dynamically

115
00:04:47,435 --> 00:04:49,290
reconfiguring at runtime.

116
00:04:49,290 --> 00:04:51,680
And we haven't yet seen
the need for doing that.

117
00:04:51,680 --> 00:04:54,197
So it wouldn't be too
hard to add if needed.

118
00:04:54,197 --> 00:04:56,030
But right now, you can
easily shuffle around

119
00:04:56,030 --> 00:04:59,310
the operator mapping by
tweaking the configuration file.

120
00:04:59,310 --> 00:05:02,810
Signal blocks, as I said,
were adapted from WaveScope.

121
00:05:02,810 --> 00:05:05,190
They use reference counting
and avoid in-memory copies,

122
00:05:05,190 --> 00:05:08,650
which can be quite expensive,
especially on the cell.

123
00:05:08,650 --> 00:05:11,220
They also provide a convenient
API to manipulate signals.

124
00:05:11,220 --> 00:05:13,428
So you don't have to do much
of the memory management

125
00:05:13,428 --> 00:05:15,670
yourself or debug any
of the hard problems

126
00:05:15,670 --> 00:05:17,389
to do with memory management.

127
00:05:17,389 --> 00:05:18,930
We also ported this
library to ensure

128
00:05:18,930 --> 00:05:23,410
that data is aligned for you
automatically and transported

129
00:05:23,410 --> 00:05:24,180
via queues.

130
00:05:24,180 --> 00:05:26,710
So one of the major
things we had to implement

131
00:05:26,710 --> 00:05:30,360
was a queuing library and
remote heap management--

132
00:05:30,360 --> 00:05:33,180
what amounted essentially to a
remote heap management library.

133
00:05:33,180 --> 00:05:35,750
So in some sense, we
faced a choice here.

134
00:05:35,750 --> 00:05:38,710
We could either
have the PPE control

135
00:05:38,710 --> 00:05:42,030
and allocate memory statically
and make all the decisions

136
00:05:42,030 --> 00:05:44,390
about what memory
is allocated where.

137
00:05:44,390 --> 00:05:47,390
Or, we could have the
SPEs manage it themselves.

138
00:05:47,390 --> 00:05:49,895
We decided to go for the
latter, partly because it

139
00:05:49,895 --> 00:05:52,020
was more dynamic and, also,
because we weren't sure

140
00:05:52,020 --> 00:05:54,865
what the implications of
all the control flow passing

141
00:05:54,865 --> 00:05:56,892
through the PPE were for this.

142
00:05:56,892 --> 00:05:59,350
So we chose the second approach,
which is autonomous memory

143
00:05:59,350 --> 00:06:00,350
management.

144
00:06:00,350 --> 00:06:05,774
So when an SPE sends a streaming
data element to another SPE,

145
00:06:05,774 --> 00:06:07,690
it doesn't have to
actually explicitly request

146
00:06:07,690 --> 00:06:09,480
the other SPE for
allocating memory.

147
00:06:09,480 --> 00:06:11,870
It has a remote
heap interface so

148
00:06:11,870 --> 00:06:14,450
that it can directly allocate
data and write to the SPE.

149
00:06:14,450 --> 00:06:19,080
This is currently of
fixed size, but it

150
00:06:19,080 --> 00:06:21,970
could be improved
by using this heap

151
00:06:21,970 --> 00:06:23,110
to share a bunch of queues.

152
00:06:23,110 --> 00:06:24,526
Right now, we have
one remote heap

153
00:06:24,526 --> 00:06:28,450
for each queue between
operators on different SPEs.

154
00:06:31,520 --> 00:06:33,170
So our system
automatically handles

155
00:06:33,170 --> 00:06:40,740
pipelining streaming data from
SPE to SPE using the DMA API.

156
00:06:40,740 --> 00:06:43,417
So Micah will takeover from here
and describe the software radio

157
00:06:43,417 --> 00:06:44,452
implementation briefly.

158
00:07:04,445 --> 00:07:06,570
MICAH BRODSKY: So our
software radio implementation

159
00:07:06,570 --> 00:07:09,301
is relatively simple.

160
00:07:09,301 --> 00:07:10,800
We weren't pushing
on that too hard,

161
00:07:10,800 --> 00:07:12,650
especially because it took
the vast majority of our time

162
00:07:12,650 --> 00:07:14,390
just to get the
framework to work.

163
00:07:14,390 --> 00:07:15,860
Saving the programmer
trouble meant

164
00:07:15,860 --> 00:07:17,855
that we inherited a lot
of trouble ourselves.

165
00:07:22,320 --> 00:07:26,312
It breaks down to
about 25 boxes,

166
00:07:26,312 --> 00:07:27,895
which we took in our
config file which

167
00:07:27,895 --> 00:07:32,690
is manually mapped to the SPEs.

168
00:07:32,690 --> 00:07:36,025
About 3,000 lines of code,
most of which is framework.

169
00:07:40,250 --> 00:07:43,340
So I guess, if we were
more put-together,

170
00:07:43,340 --> 00:07:45,637
we'd have a nice
diagram to show you.

171
00:07:45,637 --> 00:07:46,470
There's enough time.

172
00:07:46,470 --> 00:07:48,780
I'll try to draw a quick
diagram on the chalkboard.

173
00:07:52,540 --> 00:07:56,278
We're simulating both sender,
receiver and a channel.

174
00:08:03,373 --> 00:08:05,940
So the computation in
question looks something

175
00:08:05,940 --> 00:08:09,270
like you have a bitstream.

176
00:08:09,270 --> 00:08:10,950
You need to take
these and convert them

177
00:08:10,950 --> 00:08:14,350
into some-- can
you read anything?

178
00:08:27,360 --> 00:08:29,910
So you take stream of bits in.

179
00:08:34,270 --> 00:08:36,850
Take bits.

180
00:08:36,850 --> 00:08:38,950
Basically, use a
lookup mechanism

181
00:08:38,950 --> 00:08:42,659
to convert it to
an analog waveform,

182
00:08:42,659 --> 00:08:44,387
and filter that to
produce something

183
00:08:44,387 --> 00:08:45,595
that has a narrower spectrum.

184
00:08:48,986 --> 00:08:50,420
Running out of space here.

185
00:08:50,420 --> 00:08:51,376
Means that.

186
00:08:51,376 --> 00:08:54,932
Multiply that
against a sine wave.

187
00:08:54,932 --> 00:08:56,640
ARVIND THIAGARAJAN:
That's for modulation

188
00:08:56,640 --> 00:09:01,410
MICAH BRODSKY: And so
you get-- you've probably

189
00:09:01,410 --> 00:09:03,900
seen pictures of this.

190
00:09:03,900 --> 00:09:05,860
It's very much like AM
It's basically binary

191
00:09:05,860 --> 00:09:07,150
AM. [INAUDIBLE].

192
00:09:07,150 --> 00:09:08,990
It's one of the simplest
things you can do.

193
00:09:14,830 --> 00:09:17,780
Then, this is to
simulate a channel, which

194
00:09:17,780 --> 00:09:22,170
is a random FIR filter,
finite impulse response.

195
00:09:22,170 --> 00:09:24,690
What that means,
it's basically taking

196
00:09:24,690 --> 00:09:27,060
the copies of the input
at different time offsets

197
00:09:27,060 --> 00:09:29,190
with random coefficients
and just summing them up.

198
00:09:29,190 --> 00:09:32,010
It's a huge multiply
add computation.

199
00:09:32,010 --> 00:09:33,350
AUDIENCE: [INAUDIBLE]

200
00:09:33,350 --> 00:09:35,706
So this is 80 taps.

201
00:09:38,960 --> 00:09:40,785
And add some Gaussian noise.

202
00:09:46,050 --> 00:09:48,450
And then this, we take
over, and then try

203
00:09:48,450 --> 00:09:50,560
to figure out what we put
in in the first place.

204
00:09:53,530 --> 00:09:54,745
So a bunch of filtering.

205
00:09:58,310 --> 00:10:00,610
Again, more finite impulse
response filtering.

206
00:10:03,180 --> 00:10:05,950
There's a little
closed loop that

207
00:10:05,950 --> 00:10:09,100
tries to estimate the signal
amplitude and correct for it.

208
00:10:09,100 --> 00:10:10,600
That's called
automatic gain control

209
00:10:10,600 --> 00:10:11,850
to sort of keep it constant.

210
00:10:14,400 --> 00:10:16,910
And I'm probably getting these
things a little out of order.

211
00:10:20,897 --> 00:10:22,605
This is the incoherent
demodulation part.

212
00:10:22,605 --> 00:10:24,563
We square the signal to
get rid of the carrier.

213
00:10:26,872 --> 00:10:27,830
Automatic gain control.

214
00:10:27,830 --> 00:10:29,420
Filtering.

215
00:10:29,420 --> 00:10:31,800
There's another loop.

216
00:10:31,800 --> 00:10:33,370
This is called a
phase lock loop.

217
00:10:33,370 --> 00:10:38,420
The idea is try to match a
sine wave to some input signal.

218
00:10:41,605 --> 00:10:44,190
I don't know how to
explain it very well.

219
00:10:44,190 --> 00:10:45,960
It's basically a locking
type of detector.

220
00:10:45,960 --> 00:10:49,330
The idea is to lock into the
phase of some periodic thing.

221
00:10:49,330 --> 00:10:51,805
This is for recovering
when do you sample.

222
00:10:51,805 --> 00:10:53,430
Because you've got
this messy waveform,

223
00:10:53,430 --> 00:10:56,354
you've got to know when to
look and, say, OK, is it high,

224
00:10:56,354 --> 00:10:57,478
is it low to get a bit out.

225
00:10:57,478 --> 00:10:58,822
ARVIND THIAGARAJAN: [INAUDIBLE].

226
00:10:58,822 --> 00:11:00,414
MICAH BRODSKY: Yeah.

227
00:11:00,414 --> 00:11:01,705
I think that gives the picture.

228
00:11:01,705 --> 00:11:04,220
I'm probably boring everybody.

229
00:11:04,220 --> 00:11:06,981
Here's a picture
generated from the system.

230
00:11:06,981 --> 00:11:08,355
So the green line
is the data in.

231
00:11:11,224 --> 00:11:13,112
Hi, low. [INAUDIBLE].

232
00:11:13,112 --> 00:11:15,711
The red line is
the analog signal

233
00:11:15,711 --> 00:11:17,760
out right before it's
supposed to decide

234
00:11:17,760 --> 00:11:19,044
what the heck the input was.

235
00:11:21,667 --> 00:11:23,250
This is after squaring,
and filtering,

236
00:11:23,250 --> 00:11:25,705
and automatic gain
control, and all that.

237
00:11:25,705 --> 00:11:28,300
The little blips are
actually because we

238
00:11:28,300 --> 00:11:30,785
used a modulation called
alternate-mark-inversion.

239
00:11:30,785 --> 00:11:33,232
It basically flips every one.

240
00:11:33,232 --> 00:11:35,612
That's why it's blipping
instead of being constant,

241
00:11:35,612 --> 00:11:37,516
which is to be [INAUDIBLE].

242
00:11:37,516 --> 00:11:39,729
And then the little blue
daggers are the results

243
00:11:39,729 --> 00:11:42,596
of the phase lock loop to try
and recover when to sample.

244
00:11:42,596 --> 00:11:44,720
And they're kind of off,
but they're kind of right.

245
00:11:44,720 --> 00:11:48,300
And so, if you take
the little blue blip,

246
00:11:48,300 --> 00:11:51,330
if the red line is
above 0, it's a 1.

247
00:11:51,330 --> 00:11:55,219
And if it's below 0, it's 0.

248
00:11:55,219 --> 00:11:56,760
And that's how you
get your bits out.

249
00:12:04,530 --> 00:12:08,041
This was hard to get right,
mostly because of the framework

250
00:12:08,041 --> 00:12:08,541
issues.

251
00:12:11,960 --> 00:12:15,280
Implementing distributed
objects on a system

252
00:12:15,280 --> 00:12:18,174
without real shared
memory is hard

253
00:12:18,174 --> 00:12:19,840
because you have to
serialize everything

254
00:12:19,840 --> 00:12:21,423
into a stream of
bits and deserialize.

255
00:12:24,560 --> 00:12:27,880
So it really makes pie out of
any existing object oriented

256
00:12:27,880 --> 00:12:29,770
code.

257
00:12:29,770 --> 00:12:35,044
We did quite a bit of work to
get decent lock-free almost

258
00:12:35,044 --> 00:12:36,010
zero-copy.

259
00:12:36,010 --> 00:12:37,510
Another day or so,
we probably would

260
00:12:37,510 --> 00:12:40,860
have gotten zero-copy--
transfer-- streaming

261
00:12:40,860 --> 00:12:43,370
of the data from place to place.

262
00:12:43,370 --> 00:12:45,120
And we had to keep the
code footprint low.

263
00:12:47,680 --> 00:12:49,640
C++ is bloated.

264
00:12:49,640 --> 00:12:52,590
We don't have an
overlay system yet to--

265
00:12:52,590 --> 00:12:55,689
if SPE is not running
a particular box,

266
00:12:55,689 --> 00:12:57,230
it still has to have
the code for it.

267
00:12:57,230 --> 00:12:59,355
So we don't have any
infrastructure for--

268
00:12:59,355 --> 00:13:00,980
ARVIND THIAGARAJAN:
There's some macros

269
00:13:00,980 --> 00:13:01,960
to get around that, right?

270
00:13:01,960 --> 00:13:02,230
MICAH BRODSKY: Yeah.

271
00:13:02,230 --> 00:13:03,190
It's pretty messy.

272
00:13:03,190 --> 00:13:06,397
So all the code is
on all the SPEs.

273
00:13:06,397 --> 00:13:07,980
So code bloat is
particularly a issue.

274
00:13:07,980 --> 00:13:10,900
And XLC has this
particular penchant

275
00:13:10,900 --> 00:13:13,240
for runtime type information
and exception handling.

276
00:13:16,240 --> 00:13:19,450
Incredible amount of voodoo
is necessary to get them

277
00:13:19,450 --> 00:13:21,792
and that 70 K of useless
bloat out of there.

278
00:13:25,595 --> 00:13:27,470
ARVIND THIAGARAJAN: We
pretty much have the--

279
00:13:27,470 --> 00:13:32,188
MICAH BRODSKY: It
works, but not always.

280
00:13:32,188 --> 00:13:33,562
ARVIND THIAGARAJAN:
We did manage

281
00:13:33,562 --> 00:13:35,853
to get it running long enough
to get some measurements.

282
00:13:35,853 --> 00:13:39,660
MICAH BRODSKY: Yeah, we did
get some decent data out.

283
00:13:39,660 --> 00:13:43,508
Running on the PPE only, we
can about 170,000 samples

284
00:13:43,508 --> 00:13:45,440
per second through.

285
00:13:45,440 --> 00:13:48,474
And with the scheduling file,
that's kind of rule of thumb--

286
00:13:48,474 --> 00:13:50,640
we just roughly said, OK,
that looks about this big.

287
00:13:50,640 --> 00:13:53,520
We'll throw it on
this SPE, SPE, SPE.

288
00:13:53,520 --> 00:13:56,060
We got roughly four times
that using five SPEs.

289
00:13:56,060 --> 00:13:57,685
ARVIND THIAGARAJAN:
The core foorprints

290
00:13:57,685 --> 00:13:59,057
are really large, but--

291
00:13:59,057 --> 00:13:59,890
[INTERPOSING VOICES]

292
00:14:02,946 --> 00:14:05,664
MICAH BRODSKY: We
really had to push down

293
00:14:05,664 --> 00:14:06,580
things like our queue.

294
00:14:06,580 --> 00:14:08,543
We just didn't
have enough memory.

295
00:14:08,543 --> 00:14:09,043
It sucked.

296
00:14:14,124 --> 00:14:15,624
We basically just
said that already.

297
00:14:15,624 --> 00:14:16,508
ARVIND THIAGARAJAN: Some
performance bottlenecks, I

298
00:14:16,508 --> 00:14:16,950
guess.

299
00:14:16,950 --> 00:14:17,840
[INTERPOSING VOICES]

300
00:14:17,840 --> 00:14:20,050
MICAH BRODSKY: Interesting
performance behavior.

301
00:14:20,050 --> 00:14:23,050
We found that the SPEs are
ridiculously underutilized.

302
00:14:25,870 --> 00:14:29,970
Most of the algorithms are
quite a bit zippier on the SPEs.

303
00:14:29,970 --> 00:14:32,880
And so, they may be running
about a third of the time,

304
00:14:32,880 --> 00:14:34,860
and the rest of time
just waiting for input.

305
00:14:34,860 --> 00:14:37,639
And then the PPE, which is
doing only a tiny amount

306
00:14:37,639 --> 00:14:40,180
of the computation-- basically
just feeding it in and sucking

307
00:14:40,180 --> 00:14:44,540
it out-- is spending
half of its time all busy

308
00:14:44,540 --> 00:14:48,350
and, the other half of the time,
stuck in flow control waiting

309
00:14:48,350 --> 00:14:54,790
for queue space, which is-- our
flow control algorithm sucks.

310
00:14:54,790 --> 00:14:57,290
Better with time.

311
00:14:57,290 --> 00:14:59,190
So it seems like you
should be able to do

312
00:14:59,190 --> 00:15:02,760
quite a bit better than we
did with a bit more work.

313
00:15:09,810 --> 00:15:12,642
We need to cut
down the footprint.

314
00:15:12,642 --> 00:15:14,600
And once we have a little
bit of breathing room

315
00:15:14,600 --> 00:15:17,240
and get rid of the nasty
race bugs and such,

316
00:15:17,240 --> 00:15:19,650
we can finally investigate
what was our original, pie

317
00:15:19,650 --> 00:15:22,160
in the sky goal of
automatically deciding

318
00:15:22,160 --> 00:15:28,070
what goes where, taking
measurements of the performance

319
00:15:28,070 --> 00:15:30,920
and then feeding that
back to producing a better

320
00:15:30,920 --> 00:15:34,870
placement of operators,
and applying data

321
00:15:34,870 --> 00:15:36,600
parallelism by
instantiating operators

322
00:15:36,600 --> 00:15:40,590
on multiple different SPEs
and splitting the data stream.

323
00:15:43,240 --> 00:15:44,330
And just doing more.

324
00:15:49,818 --> 00:15:52,228
AUDIENCE: Since your PPE is
at 40% to 50% utilization,

325
00:15:52,228 --> 00:15:54,225
did you put actual
work on there?

326
00:15:54,225 --> 00:15:56,100
MICAH BRODSKY: We did
put some work on there.

327
00:15:56,100 --> 00:15:57,808
Actually, we put work
on there because we

328
00:15:57,808 --> 00:15:59,726
couldn't fit all the
boxes-- the code for all

329
00:15:59,726 --> 00:16:01,352
the boxes under the SPEs.

330
00:16:01,352 --> 00:16:03,310
So we started strategically
to put a few things

331
00:16:03,310 --> 00:16:04,060
at the beginning
and a few things

332
00:16:04,060 --> 00:16:05,434
at the end which
weren't supposed

333
00:16:05,434 --> 00:16:07,290
to be very
computationally intensive,

334
00:16:07,290 --> 00:16:10,014
and yet they managed to
take up half the CPU.

335
00:16:10,014 --> 00:16:11,680
The issue with the
other half of the CPU

336
00:16:11,680 --> 00:16:14,210
is that it's actually
inside an inner loop

337
00:16:14,210 --> 00:16:17,580
deep recursive in blocking
because, basically,

338
00:16:17,580 --> 00:16:19,840
our back pressure
isn't online yet.

339
00:16:19,840 --> 00:16:22,690
So if it tries to emit
something to a queue,

340
00:16:22,690 --> 00:16:24,530
and that queue is
full, it just stops.

341
00:16:24,530 --> 00:16:27,030
And it really could be running
a whole bunch of other stuff,

342
00:16:27,030 --> 00:16:28,452
but it's not smart
enough to do that ahead.

343
00:16:28,452 --> 00:16:29,327
AUDIENCE: [INAUDIBLE]

344
00:16:36,577 --> 00:16:38,410
MICAH BRODSKY: Unlike
a model like StreamIt,

345
00:16:38,410 --> 00:16:41,130
everything here is
asynchronous and code driven.

346
00:16:41,130 --> 00:16:45,641
The SPUs they can decide on
the fly how much to emit.

347
00:16:45,641 --> 00:16:47,640
The programmer doesn't
have to declare anything,

348
00:16:47,640 --> 00:16:50,980
But it means everything's
asynchronous.

349
00:16:50,980 --> 00:16:53,509
And so you basically
[? race the issues ?] galore.

350
00:16:53,509 --> 00:16:55,300
AUDIENCE: But in this
a application, do you

351
00:16:55,300 --> 00:16:56,570
find any dynamic rates?

352
00:16:56,570 --> 00:16:58,061
MICAH BRODSKY: In this
application, there's not much.

353
00:16:58,061 --> 00:16:59,352
It's a very simple application.

354
00:16:59,352 --> 00:17:02,037
If we actually went
to packetization,

355
00:17:02,037 --> 00:17:04,472
error correction,
compression, things like that,

356
00:17:04,472 --> 00:17:06,013
we'd probably see
a lot more of that.

357
00:17:06,013 --> 00:17:08,001
This is definitely
underutilized in

358
00:17:08,001 --> 00:17:13,333
the asynchronous
capabilities of the system.

359
00:17:13,333 --> 00:17:15,124
AUDIENCE: In the case
of radio, wouldn't it

360
00:17:15,124 --> 00:17:18,438
be OK to draw [INAUDIBLE]
frame into audio data

361
00:17:18,438 --> 00:17:22,260
just because you're spending
so much time waiting,

362
00:17:22,260 --> 00:17:25,030
it'd be better just to
relieve pressure on the queues

363
00:17:25,030 --> 00:17:27,420
by just dropping some
frames that are unnecessary.

364
00:17:27,420 --> 00:17:27,898
MICAH BRODSKY: It might well be.

365
00:17:27,898 --> 00:17:29,332
AUDIENCE: And
interpolating it in the end

366
00:17:29,332 --> 00:17:30,766
to try to fix it
up a little bit.

367
00:17:30,766 --> 00:17:31,244
MICAH BRODSKY: It might well be.

368
00:17:31,244 --> 00:17:32,702
We decided not to
do that as a part

369
00:17:32,702 --> 00:17:36,210
of the framework because
[INAUDIBLE] policy decision,

370
00:17:36,210 --> 00:17:38,994
and we didn't want to make that
for all possible applications.

371
00:17:38,994 --> 00:17:41,050
But if we figure out a
good way to dispose that,

372
00:17:41,050 --> 00:17:42,960
that definitely
would be an option.

373
00:17:42,960 --> 00:17:44,140
Just drop a few packets.

374
00:17:44,140 --> 00:17:47,334
Drop a few samples.

375
00:17:47,334 --> 00:17:48,875
AUDIENCE: So what
about buffer sizes?

376
00:17:48,875 --> 00:17:50,937
Do you declare the
buffer size as well?

377
00:17:50,937 --> 00:17:51,770
MICAH BRODSKY: Yeah.

378
00:17:51,770 --> 00:17:53,786
The way we have it
now-- there's actually

379
00:17:53,786 --> 00:17:57,030
just a #define in the code that
says all buffers are this size.

380
00:17:57,030 --> 00:17:59,340
AUDIENCE: Do you need any
double buffering in there?

381
00:17:59,340 --> 00:18:00,715
MICAH BRODSKY:
You don't need it.

382
00:18:00,715 --> 00:18:03,250
Well, actually, the way
it works is that there's

383
00:18:03,250 --> 00:18:05,360
this remote heap input queue.

384
00:18:05,360 --> 00:18:09,640
And an upstream SPE just DMAs
things in as it feels like.

385
00:18:09,640 --> 00:18:11,950
And then the
downstream SPE looks.

386
00:18:11,950 --> 00:18:13,140
Is there something here?

387
00:18:13,140 --> 00:18:14,330
Grab it, use it.

388
00:18:14,330 --> 00:18:17,139
So it just works, however
much buffering there is.

389
00:18:17,139 --> 00:18:19,430
ARVIND THIAGARAJAN: The ring
buffer and the [INAUDIBLE]

390
00:18:19,430 --> 00:18:21,620
should get that
block [? free. ?]

391
00:18:21,620 --> 00:18:24,610
MICAH BRODSKY: That's a benefit
of the asynchronous approach.

392
00:18:24,610 --> 00:18:27,514
It just works, if you have
memory, which we don't.

393
00:18:30,165 --> 00:18:30,940
Queues are tiny.

394
00:18:35,995 --> 00:18:37,867
SPEAKER: [INAUDIBLE]
is angrily telling me

395
00:18:37,867 --> 00:18:40,450
that his connection dropped but
he's not picking up the phone.

396
00:18:49,741 --> 00:18:50,990
Yeah, I could see it in there.

397
00:19:01,427 --> 00:19:04,906
AUDIENCE: Any other
questions for them?

398
00:19:04,906 --> 00:19:07,920
Did you want to do the demo?

399
00:19:07,920 --> 00:19:10,370
MICAH BRODSKY: Nah.

400
00:19:10,370 --> 00:19:13,320
SPEAKER ON PHONE: What did
you do to get race conditions?

401
00:19:13,320 --> 00:19:14,320
MICAH BRODSKY: How
did we manage to get

402
00:19:14,320 --> 00:19:15,403
ourselves race conditions?

403
00:19:18,150 --> 00:19:22,775
Well, it's a mix of the
fact that all the SPUs

404
00:19:22,775 --> 00:19:25,170
can essentially operate
as independent threads

405
00:19:25,170 --> 00:19:27,110
and are sort of
asynchronously DMAing things

406
00:19:27,110 --> 00:19:31,350
to each other and
somewhat poor--

407
00:19:31,350 --> 00:19:33,880
legacy driven architectural
decisions on the way the PPU

408
00:19:33,880 --> 00:19:35,462
code works.

409
00:19:35,462 --> 00:19:38,397
Because we ported a lot of code
from the WaveScope platform.

410
00:19:38,397 --> 00:19:38,897
[?

411
00:19:38,897 --> 00:19:40,085
ARVIND THIAGARAJAN: Which
wasn't very ?] well documented.

412
00:19:40,085 --> 00:19:41,585
MICAH BRODSKY: And
that was intended

413
00:19:41,585 --> 00:19:43,740
to be multithreaded
with Pthreads, which

414
00:19:43,740 --> 00:19:47,344
in our original, naive before
we actually started the class

415
00:19:47,344 --> 00:19:48,802
impression, we
thought we were just

416
00:19:48,802 --> 00:19:50,718
going to port the whole
thing and use Pthreads

417
00:19:50,718 --> 00:19:52,800
on the sub, which of
course is not possible.

418
00:19:52,800 --> 00:19:57,490
So there were still some
threaded components in there.

419
00:19:57,490 --> 00:20:01,386
And we introduced a lot of
bugs trying to port the thing.

420
00:20:03,920 --> 00:20:05,080
See if i can--

421
00:20:05,080 --> 00:20:06,440
ARVIND THIAGARAJAN: The
surprising thing was the fact

422
00:20:06,440 --> 00:20:08,523
that we had to replace
array constructors in order

423
00:20:08,523 --> 00:20:10,717
to eliminate [? RTTI. ?]
And it was just a big deal

424
00:20:10,717 --> 00:20:11,550
[INTERPOSING VOICES]

425
00:20:11,550 --> 00:20:12,335
MICAH BRODSKY: We had
to get rid of new,

426
00:20:12,335 --> 00:20:13,770
we had to get rid of delete.

427
00:20:13,770 --> 00:20:15,689
We had to get rid of
array constructors.

428
00:20:15,689 --> 00:20:17,156
We had to get rid of--

429
00:20:17,156 --> 00:20:18,134
ARVIND THIAGARAJAN:
Virtual destructors.

430
00:20:18,134 --> 00:20:19,112
MICAH BRODSKY: --pure
virtual functions

431
00:20:19,112 --> 00:20:20,112
and virtual destructors.

432
00:20:20,112 --> 00:20:21,490
It was a mess.

433
00:20:21,490 --> 00:20:25,193
I guess one more
pithy thing I could

434
00:20:25,193 --> 00:20:27,950
say about race conditions
is it's incredible

435
00:20:27,950 --> 00:20:30,620
how many subtle race
conditions and bugs we found in

436
00:20:30,620 --> 00:20:35,260
the remote keeping
queueing library.

437
00:20:35,260 --> 00:20:38,930
Because there's one-- it's a
lock free asynchronous data

438
00:20:38,930 --> 00:20:41,110
structure.

439
00:20:41,110 --> 00:20:45,346
There are two threads-- two
SPUs reading and writing from it

440
00:20:45,346 --> 00:20:46,630
at the same time.

441
00:20:46,630 --> 00:20:48,380
And there's all sorts
of little subtleties

442
00:20:48,380 --> 00:20:50,042
where if you get it
a little bit wrong,

443
00:20:50,042 --> 00:20:52,250
you end up with one of the
queue pointers overrunning

444
00:20:52,250 --> 00:20:53,850
the other guy, and
you have basically

445
00:20:53,850 --> 00:20:57,130
dangling pointers
and things like that.

446
00:20:57,130 --> 00:20:59,910
At least three or four
times, we spent a few hours

447
00:20:59,910 --> 00:21:03,410
until we discovered that was the
cause of mysterious behavior.

448
00:21:10,610 --> 00:21:12,110
AUDIENCE: Anything else?

449
00:21:12,110 --> 00:21:12,710
Thank you.

450
00:21:12,710 --> 00:21:14,260
[APPLAUSE]