1
00:00:00,030 --> 00:00:03,155
The following content is
provided under a Creative

2
00:00:03,155 --> 00:00:04,000
Commons license.

3
00:00:04,000 --> 00:00:06,920
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,920 --> 00:00:08,660
offer high quality, educational

5
00:00:08,660 --> 00:00:10,560
resources for free.

6
00:00:10,560 --> 00:00:13,450
To make a donation or view
additional materials from

7
00:00:13,450 --> 00:00:16,610
hundreds of MIT courses visit
MIT OpenCourseWare at

8
00:00:16,610 --> 00:00:17,860
ocw.mit.edu.

9
00:00:22,010 --> 00:00:23,260
MICHAEL PERRONE: So my name's
Michael Perrone.

10
00:00:30,170 --> 00:00:34,460
I'm at the T.J. Watson Research
Center, IBM research.

11
00:00:34,460 --> 00:00:38,330
Doing all kinds of things for
research, but most recently--

12
00:00:38,330 --> 00:00:39,630
that's not what I want.

13
00:00:39,630 --> 00:00:40,660
There we go.

14
00:00:40,660 --> 00:00:43,290
Most recently I've been
working with the cell

15
00:00:43,290 --> 00:00:46,590
processor for the past
three years or so.

16
00:00:46,590 --> 00:00:47,840
I don't want that.

17
00:00:51,170 --> 00:00:53,380
How's that?

18
00:00:53,380 --> 00:00:56,320
And because I do have to run
out for a flight, I have my

19
00:00:56,320 --> 00:00:59,330
e-mail here if you want
to ask me questions,

20
00:00:59,330 --> 00:01:02,820
feel free to do that.

21
00:01:02,820 --> 00:01:05,640
What I'm going to do in this
presentation is as Saman

22
00:01:05,640 --> 00:01:09,270
suggested, talk in depth about
the cell processor, but really

23
00:01:09,270 --> 00:01:11,140
it's still going to be just the
very surface because you

24
00:01:11,140 --> 00:01:12,950
going to have a month to go
into a lot more detail.

25
00:01:12,950 --> 00:01:16,300
But I want to give you a sense
for why it was created, the

26
00:01:16,300 --> 00:01:19,180
way it was created, what it's
capable of doing, and what are

27
00:01:19,180 --> 00:01:22,640
the programming considerations
that have to be taken in mind

28
00:01:22,640 --> 00:01:24,120
when you program.

29
00:01:30,520 --> 00:01:33,490
Here's the agenda just
for this section,

30
00:01:33,490 --> 00:01:34,800
Mike, of this class.

31
00:01:34,800 --> 00:01:35,990
I'll give you some motivation.

32
00:01:35,990 --> 00:01:37,840
This is going to be a bit of a
repeat, so I'll go through it

33
00:01:37,840 --> 00:01:38,940
fairly quickly.

34
00:01:38,940 --> 00:01:43,460
I'll talk about the design
concepts, hardware overview,

35
00:01:43,460 --> 00:01:46,070
performance characteristics,
application affinity--

36
00:01:46,070 --> 00:01:49,920
what good is this device?

37
00:01:49,920 --> 00:01:53,290
Talk about the software and this
I imagine is one of the

38
00:01:53,290 --> 00:01:55,330
areas where you're going to go
into a lot of detail in the

39
00:01:55,330 --> 00:01:59,200
next month because as you
suggested, the software really

40
00:01:59,200 --> 00:02:01,470
is the issue and I would
actually go a little further

41
00:02:01,470 --> 00:02:05,520
and say, why do people drive
such large cars in the U.S.?

42
00:02:05,520 --> 00:02:07,560
Why do they waste
so much energy?

43
00:02:07,560 --> 00:02:08,360
The answer is very simple.

44
00:02:08,360 --> 00:02:09,660
It's because it's cheap.

45
00:02:09,660 --> 00:02:12,840
Even at $3 a gallon, it's cheap
compared to say, Europe

46
00:02:12,840 --> 00:02:15,450
and other places.

47
00:02:15,450 --> 00:02:17,790
The truth is it's the same
thing with programmers.

48
00:02:17,790 --> 00:02:20,480
Why did programmers program the
way they did in the past

49
00:02:20,480 --> 00:02:22,240
10, 20 years?

50
00:02:22,240 --> 00:02:23,490
Because cycles were cheap.

51
00:02:23,490 --> 00:02:26,190
They knew Moore's law was going
to keep going and so you

52
00:02:26,190 --> 00:02:28,710
could implement some algorithm,
you didn't have to

53
00:02:28,710 --> 00:02:31,310
worry about the details, as
long as you got the right

54
00:02:31,310 --> 00:02:35,560
power law-- if you got your n
squared or n cubed or n log n,

55
00:02:35,560 --> 00:02:37,720
whatever behavior.

56
00:02:37,720 --> 00:02:41,390
The details, if the multiplying
factor was 10 or

57
00:02:41,390 --> 00:02:42,170
100 it didn't matter.

58
00:02:42,170 --> 00:02:44,170
Eventually Moore's law would
solve that problem for you, so

59
00:02:44,170 --> 00:02:45,410
you didn't have to
be efficient.

60
00:02:45,410 --> 00:02:49,090
And I think I've spent the
better part of three years

61
00:02:49,090 --> 00:02:52,510
trying to fight against that and
you're going to learn in

62
00:02:52,510 --> 00:02:54,630
this class that, particularly
for multicore you have to

63
00:02:54,630 --> 00:02:57,660
think very hard about how
you're going to get

64
00:02:57,660 --> 00:02:58,910
performance.

65
00:03:00,990 --> 00:03:04,260
This is actually the take home
message that I want to give.

66
00:03:04,260 --> 00:03:06,630
I think it's just one or two
slides, but we really need to

67
00:03:06,630 --> 00:03:10,340
get to these because that's
where I want to get you

68
00:03:10,340 --> 00:03:11,740
thinking along the
right lines.

69
00:03:11,740 --> 00:03:12,960
And then there's a hardware

70
00:03:12,960 --> 00:03:16,650
consideration, we can skip that.

71
00:03:16,650 --> 00:03:19,790
All right, so where have all
the gigahertz gone, right?

72
00:03:19,790 --> 00:03:24,220
We saw Moore's law, things
getting faster and faster and

73
00:03:24,220 --> 00:03:26,926
the answer is I have a different
chart that's

74
00:03:26,926 --> 00:03:28,220
basically the same thing.

75
00:03:28,220 --> 00:03:31,200
You have relative device
performance on this axis and

76
00:03:31,200 --> 00:03:32,210
you've got the year here.

77
00:03:32,210 --> 00:03:35,300
And different technologies
were growing, growing,

78
00:03:35,300 --> 00:03:37,210
growing, but now you see
they're thresholding.

79
00:03:37,210 --> 00:03:42,100
And you go to conferences now,
architecture conferences, and

80
00:03:42,100 --> 00:03:45,500
people are saying, Moore's
law is dead.

81
00:03:45,500 --> 00:03:47,660
Now, I don't know if I would go
that far and I know there

82
00:03:47,660 --> 00:03:50,110
are true believers out there
who say, well maybe the

83
00:03:50,110 --> 00:03:54,280
silicon on the insulator
technology is dead, but

84
00:03:54,280 --> 00:03:55,140
they'll be something else.

85
00:03:55,140 --> 00:03:59,330
And maybe that's true and maybe
that is multicore, but

86
00:03:59,330 --> 00:04:02,800
unless we get the right
programming models in place

87
00:04:02,800 --> 00:04:04,050
it's not going to
be multicore.

88
00:04:07,320 --> 00:04:08,730
Here's this power
density graph.

89
00:04:08,730 --> 00:04:11,460
Here we have the nuclear reactor
power up here and you

90
00:04:11,460 --> 00:04:12,660
see pentiums going up now.

91
00:04:12,660 --> 00:04:16,870
Of course, there's a log plot,
so we're far away, but on this

92
00:04:16,870 --> 00:04:18,450
axis we're not far away.

93
00:04:18,450 --> 00:04:22,320
This is how much we shrink the
technology, the size of those

94
00:04:22,320 --> 00:04:23,940
transistors.

95
00:04:23,940 --> 00:04:30,670
So if we're kind of going down
by 2 every 18 months or so,

96
00:04:30,670 --> 00:04:33,080
maybe it's 2 years now, we're
not so far away from that

97
00:04:33,080 --> 00:04:34,500
nuclear reactor output.

98
00:04:34,500 --> 00:04:37,140
And that's a problem.

99
00:04:37,140 --> 00:04:39,800
And what's really causing
that problem?

100
00:04:39,800 --> 00:04:42,680
Here's a picture of one of these
gates magnified a lot

101
00:04:42,680 --> 00:04:46,300
and here's the interface
magnified even further and you

102
00:04:46,300 --> 00:04:49,330
see here's this dielectric
that's insulating between the

103
00:04:49,330 --> 00:04:51,680
2 sides of the gate--

104
00:04:51,680 --> 00:04:52,860
we're reaching a fundamental
limit.

105
00:04:52,860 --> 00:04:54,000
A few atomic layers.

106
00:04:54,000 --> 00:04:56,880
You see here it's like 11
angstroms. What's that?

107
00:04:56,880 --> 00:05:00,040
10, 11 atoms across?

108
00:05:00,040 --> 00:05:02,700
If you go back to basic physics
you know that quantum

109
00:05:02,700 --> 00:05:06,700
mechanical properties like
electrons, they tunnel, right?

110
00:05:06,700 --> 00:05:08,560
And they tunnel through barriers
with kind of an

111
00:05:08,560 --> 00:05:09,890
exponential decay.

112
00:05:09,890 --> 00:05:12,800
So whenever you shrink this
further you get more and more

113
00:05:12,800 --> 00:05:15,630
leakage, so the current
is leaking through.

114
00:05:15,630 --> 00:05:19,040
In this graph, what you see here
is that as this size gets

115
00:05:19,040 --> 00:05:22,780
smaller, the leakage current is
getting equivalent to the

116
00:05:22,780 --> 00:05:23,510
active power.

117
00:05:23,510 --> 00:05:29,050
So even when it's not doing
anything, this 65 nanometer,

118
00:05:29,050 --> 00:05:31,110
the technology is leaking
as much power

119
00:05:31,110 --> 00:05:32,550
as it actually uses.

120
00:05:32,550 --> 00:05:35,430
And eventually, as we get
smaller, smaller we're going

121
00:05:35,430 --> 00:05:38,720
to be using more power, just
leaking stuff away and that's

122
00:05:38,720 --> 00:05:43,200
really bad because as Saman
suggested we have people like

123
00:05:43,200 --> 00:05:45,390
Google putting this stuff near
the Coulee Dam so that they

124
00:05:45,390 --> 00:05:46,140
can get power.

125
00:05:46,140 --> 00:05:49,600
I deal with a lot of customers
who have tens of thousands of

126
00:05:49,600 --> 00:05:54,450
nodes, 50,000 processors,
100,000 processors.

127
00:05:54,450 --> 00:05:56,930
They're using 20 gigabytes--

128
00:05:56,930 --> 00:05:58,620
sorry, megahertz.

129
00:05:58,620 --> 00:06:02,090
No, megawatts, that's
what I want to say.

130
00:06:02,090 --> 00:06:03,340
It's too early in the morning.

131
00:06:06,150 --> 00:06:09,460
Tens of megawatts to power
their installations and

132
00:06:09,460 --> 00:06:12,130
they're choosing sites
specifically to get that power

133
00:06:12,130 --> 00:06:12,940
and they're limited.

134
00:06:12,940 --> 00:06:15,300
So they come to me, they come to
people at IBM and they say,

135
00:06:15,300 --> 00:06:16,390
what can we do about power?

136
00:06:16,390 --> 00:06:18,810
Power is a problem.

137
00:06:18,810 --> 00:06:21,630
And that's why we're not seeing

138
00:06:21,630 --> 00:06:25,190
increasing the gigahertz.

139
00:06:25,190 --> 00:06:26,590
Has this ever happened before?

140
00:06:26,590 --> 00:06:29,560
Well, I'm going to go to
this quickly, yes.

141
00:06:29,560 --> 00:06:33,230
Here we see the power outage
output of a steam iron, right

142
00:06:33,230 --> 00:06:36,520
there per unit area.

143
00:06:39,300 --> 00:06:42,750
And something's messed
up here.

144
00:06:42,750 --> 00:06:46,960
You see as the technology
changed from bipolar to CMOS

145
00:06:46,960 --> 00:06:52,220
we were able to improve the
performance, but the heat flux

146
00:06:52,220 --> 00:06:55,700
got higher again and that begs
the question, what's going to

147
00:06:55,700 --> 00:06:56,750
happen next?

148
00:06:56,750 --> 00:07:00,310
And of course, IBM, Intel,
AMD, they're all

149
00:07:00,310 --> 00:07:03,250
betting this multicore.

150
00:07:03,250 --> 00:07:06,090
And so there's an opportunity
from a business point of view.

151
00:07:06,090 --> 00:07:09,650
So now, that's the intro.

152
00:07:09,650 --> 00:07:12,540
Multicore: how do you
deal with it?

153
00:07:12,540 --> 00:07:17,060
Here's a picture of the chip,
the cell processor.

154
00:07:17,060 --> 00:07:19,930
You can see these 8
little black dots.

155
00:07:19,930 --> 00:07:23,570
They're local memory for each
one of 8 special purpose

156
00:07:23,570 --> 00:07:27,400
processors, as well as a big
chunk over here, which is a

157
00:07:27,400 --> 00:07:28,200
ninth processor.

158
00:07:28,200 --> 00:07:32,770
So this chip has 9 processors
on board and the trick is to

159
00:07:32,770 --> 00:07:35,720
design it so that it addresses
lots of issues

160
00:07:35,720 --> 00:07:38,080
that we just discussed.

161
00:07:38,080 --> 00:07:43,570
So let me put this in context,
cell was created for the Sony

162
00:07:43,570 --> 00:07:44,780
Playstation 3.

163
00:07:44,780 --> 00:07:48,590
It started in about 2000 and
there's a long development

164
00:07:48,590 --> 00:07:53,530
here until it was finally
announced over here.

165
00:07:53,530 --> 00:07:54,380
Where was it first announced?

166
00:07:54,380 --> 00:07:58,680
It was announced several years
later and IBM recently

167
00:07:58,680 --> 00:08:02,190
announced a cell blade about a
year back and we're pushing

168
00:08:02,190 --> 00:08:05,280
these blades and we're very
much struggling with the

169
00:08:05,280 --> 00:08:06,660
programming model.

170
00:08:06,660 --> 00:08:09,040
How do you get performance
while making something

171
00:08:09,040 --> 00:08:09,610
programmable?

172
00:08:09,610 --> 00:08:11,790
If you go to customers and they
have 4 million lines of

173
00:08:11,790 --> 00:08:19,240
code, you can't tell them just
port it and it'll be 80 person

174
00:08:19,240 --> 00:08:22,030
years to get it ported,
100 person years more.

175
00:08:22,030 --> 00:08:23,740
And then you have
to optimize it.

176
00:08:23,740 --> 00:08:27,950
So there are problems and
we'll talk about that.

177
00:08:27,950 --> 00:08:32,360
But it was created in this
context and because of that,

178
00:08:32,360 --> 00:08:35,510
this chip in particular, is
a commodity processor.

179
00:08:35,510 --> 00:08:39,070
Meaning that it's going to be
selling millions and millions.

180
00:08:39,070 --> 00:08:44,920
Sony Playstation 2 sold an
average of 20 million units

181
00:08:44,920 --> 00:08:47,360
each year for 5 years and we
expect the same for the

182
00:08:47,360 --> 00:08:48,440
Playstation 3.

183
00:08:48,440 --> 00:08:53,600
So the cell has a big advantage
over other multicore

184
00:08:53,600 --> 00:08:57,340
processors like the Intel
Woodcrest, which has a street

185
00:08:57,340 --> 00:09:01,930
price of about $2000 and
the cell around 100.

186
00:09:01,930 --> 00:09:04,790
So not only do we have big
performance improvements, we

187
00:09:04,790 --> 00:09:06,800
have price advantages
too because of

188
00:09:06,800 --> 00:09:09,660
that commodity market.

189
00:09:09,660 --> 00:09:14,450
All right, let's talk about
the design concept.

190
00:09:14,450 --> 00:09:16,580
Here's a little bit of a rehash
of what we discussed

191
00:09:16,580 --> 00:09:18,550
with some interesting
words here.

192
00:09:18,550 --> 00:09:20,570
We're talking about a power
wall, a memory wall and a

193
00:09:20,570 --> 00:09:21,320
frequency wall.

194
00:09:21,320 --> 00:09:22,900
So we've talked about
this frequency wall.

195
00:09:22,900 --> 00:09:26,300
We're hitting this wall because
of the power really

196
00:09:26,300 --> 00:09:28,840
and the power wall people just
don't have enough power coming

197
00:09:28,840 --> 00:09:32,140
into their buildings to keep
these things going.

198
00:09:32,140 --> 00:09:35,680
But memory wall, Saman didn't
actually use that term, but

199
00:09:35,680 --> 00:09:38,140
that's the fact that as the
clock frequencies get higher

200
00:09:38,140 --> 00:09:41,160
and higher, memory appeared
further and further away.

201
00:09:41,160 --> 00:09:44,200
The more cycles that I have to
go as a processor before the

202
00:09:44,200 --> 00:09:45,180
data came in.

203
00:09:45,180 --> 00:09:47,620
And so that changes the whole
paradigm, how you have to

204
00:09:47,620 --> 00:09:48,400
think about it.

205
00:09:48,400 --> 00:09:53,000
We have processors with lots of
cache, but is cache really

206
00:09:53,000 --> 00:09:54,320
what you want?

207
00:09:54,320 --> 00:09:56,020
Well, it depends.

208
00:09:56,020 --> 00:09:59,490
If you have a very localized
process where you're going to

209
00:09:59,490 --> 00:10:02,920
bring something into cache and
the data is going to be reused

210
00:10:02,920 --> 00:10:04,950
then that's really a
good thing to do.

211
00:10:04,950 --> 00:10:07,980
But what if you have random
gather and scatter of data?

212
00:10:07,980 --> 00:10:13,040
You know, you're doing some
transactional processing or

213
00:10:13,040 --> 00:10:16,040
whatever mathematical function
you're calculating is very

214
00:10:16,040 --> 00:10:17,820
distributed like an FFT.

215
00:10:17,820 --> 00:10:21,210
So you have to do all sorts of
accesses through memory and it

216
00:10:21,210 --> 00:10:23,770
doesn't fit in that cache.

217
00:10:23,770 --> 00:10:26,080
Well, then you can start
thrashing cache.

218
00:10:26,080 --> 00:10:29,770
You bring in one integer and
then you ask the cache for the

219
00:10:29,770 --> 00:10:32,400
next thing, it's not there, so
it has to go in and so you

220
00:10:32,400 --> 00:10:36,380
spend all this time wasting time
getting stuff into cache.

221
00:10:36,380 --> 00:10:40,270
So what we're pushing for
multicore, especially for cell

222
00:10:40,270 --> 00:10:43,260
is the notion of a shopping
list. And this is where

223
00:10:43,260 --> 00:10:46,830
programability comes in and
programing models come in.

224
00:10:46,830 --> 00:10:50,310
You really need to think ahead
of time about what your

225
00:10:50,310 --> 00:10:53,380
shopping list is going to be
and the analogy that people

226
00:10:53,380 --> 00:10:56,170
have been using is you're
fixing something in your

227
00:10:56,170 --> 00:10:57,690
house, you're pipe breaks.

228
00:10:57,690 --> 00:10:59,120
So you go and say, oh,
I need a new pipe.

229
00:10:59,120 --> 00:11:00,570
So you go the store,
you get a pipe.

230
00:11:00,570 --> 00:11:02,870
You bring it back and say,
oh, I need some putty.

231
00:11:02,870 --> 00:11:04,090
So you go to the store,
you get some putty.

232
00:11:04,090 --> 00:11:05,420
And oh, I need a wrench.

233
00:11:05,420 --> 00:11:07,760
Go to the store-- that's
what cache is.

234
00:11:07,760 --> 00:11:10,790
So you figure out what you
need when you need it.

235
00:11:10,790 --> 00:11:12,580
In the cell processor you have
to think ahead and make a

236
00:11:12,580 --> 00:11:14,850
shopping list. If I'm going to
do this calculation I need all

237
00:11:14,850 --> 00:11:15,630
these things.

238
00:11:15,630 --> 00:11:17,240
I'm going to bring them all
in, I'm going to start

239
00:11:17,240 --> 00:11:18,040
calculating.

240
00:11:18,040 --> 00:11:20,090
When I'm calculating on that,
I'm going to get my other

241
00:11:20,090 --> 00:11:23,190
shopping list. So that I can
have some concurrency of the

242
00:11:23,190 --> 00:11:24,750
data load with the computes.

243
00:11:31,230 --> 00:11:33,260
I'm going to skip this here.

244
00:11:37,850 --> 00:11:41,340
You can read that later, it's
not that important.

245
00:11:41,340 --> 00:11:45,230
Cell synergy, now this is kind
of you know, apple pie,

246
00:11:45,230 --> 00:11:48,000
motherhood kind of thing.

247
00:11:48,000 --> 00:11:50,610
The cell processor was
specifically designed so that

248
00:11:50,610 --> 00:11:52,990
those 9 cores are synergistic.

249
00:11:52,990 --> 00:11:55,670
That they interoperate
very efficiently.

250
00:11:55,670 --> 00:11:59,040
Now I told you we have 8
identical processors, we call

251
00:11:59,040 --> 00:11:59,440
those SPEs.

252
00:11:59,440 --> 00:12:02,760
In the ninth processor
its the PPE.

253
00:12:02,760 --> 00:12:06,100
It's been designed so that the
PPE is running the OS and it's

254
00:12:06,100 --> 00:12:09,630
doing all the transaction file
systems and what not so that

255
00:12:09,630 --> 00:12:11,650
these SPEs can focus on
what they're good

256
00:12:11,650 --> 00:12:12,900
at, which is compute.

257
00:12:15,510 --> 00:12:19,290
The whole thing is pullled
together with an element

258
00:12:19,290 --> 00:12:22,530
interconnect bus and we'll
talk about that.

259
00:12:22,530 --> 00:12:25,070
It's very, very efficient,
very high bandwidth bus.

260
00:12:28,940 --> 00:12:30,450
Now we're going to talk about
the detail hardware

261
00:12:30,450 --> 00:12:31,340
components.

262
00:12:31,340 --> 00:12:35,440
And Rodric somewhere, there you
are, asked me to actually

263
00:12:35,440 --> 00:12:39,370
dig down into more
of the hardware.

264
00:12:39,370 --> 00:12:40,200
I would love to do that.

265
00:12:40,200 --> 00:12:43,470
Honestly, I'm not a
hardware person.

266
00:12:43,470 --> 00:12:47,220
I'll do the best I can, perhaps
at the end of the talk

267
00:12:47,220 --> 00:12:50,410
we'll dig down and show me
which slides you want.

268
00:12:50,410 --> 00:12:54,180
But I've been dealing with this
for so long that I can do

269
00:12:54,180 --> 00:12:55,010
a decent job.

270
00:12:55,010 --> 00:12:57,620
Here's another picture
of the chip.

271
00:12:57,620 --> 00:12:59,140
It has lots of transistors.

272
00:12:59,140 --> 00:13:00,340
This is the size.

273
00:13:00,340 --> 00:13:03,200
We talked about the 9 cores, it
has 10 threads because this

274
00:13:03,200 --> 00:13:06,080
power processor, the
PPE has 2 threads.

275
00:13:06,080 --> 00:13:08,780
Each of these are
single threaded.

276
00:13:08,780 --> 00:13:10,480
And this is the wow factor.

277
00:13:10,480 --> 00:13:15,350
We have 200 gigaflops, over
200 gigaflops of single

278
00:13:15,350 --> 00:13:18,260
precision performance
on these chips.

279
00:13:18,260 --> 00:13:21,820
And over 20 gigaflops of double
precision and that will

280
00:13:21,820 --> 00:13:24,670
be going up to 100 gigaflops
by the end of this year.

281
00:13:27,840 --> 00:13:32,430
The bandwidth to main memory is
25 gigabytes per second and

282
00:13:32,430 --> 00:13:35,640
up to 75 gigabytes per second
of I/O bandwidth.

283
00:13:35,640 --> 00:13:38,840
Now this chip really has
tremendous bandwidth, but what

284
00:13:38,840 --> 00:13:40,800
we've seen so far-- particularly
with the Sony

285
00:13:40,800 --> 00:13:44,170
Playstation and I think you may
have lots of them here,

286
00:13:44,170 --> 00:13:46,780
the board is not designed
to really take

287
00:13:46,780 --> 00:13:48,310
advantage of that bandwidth.

288
00:13:48,310 --> 00:13:53,860
And even the blades that IBM
sells really can't get that

289
00:13:53,860 --> 00:13:55,830
type of bandwidth
off the blade.

290
00:13:55,830 --> 00:13:57,870
And so if you're keeping
everything local on the blade

291
00:13:57,870 --> 00:14:00,640
or on the Playstation 3 then
you have lots of bandwidth

292
00:14:00,640 --> 00:14:01,500
internally.

293
00:14:01,500 --> 00:14:06,320
But off blade or off board you
really have to survive with

294
00:14:06,320 --> 00:14:12,060
something like a gigabyte, 2
gigabytes in the future.

295
00:14:12,060 --> 00:14:14,330
And this element interconnect
bus I mentioned before has a

296
00:14:14,330 --> 00:14:20,630
tremendous bandwidth, over
300 gigabytes per second.

297
00:14:20,630 --> 00:14:23,280
The top frequency in the lab
was over 4 gigabytes--

298
00:14:23,280 --> 00:14:24,380
gigahertz, sorry.

299
00:14:24,380 --> 00:14:28,060
And it's currently
running when you

300
00:14:28,060 --> 00:14:30,220
buy them at 3.2 gigahertz.

301
00:14:30,220 --> 00:14:33,720
And actually the Playstation 3's
that you're buying today,

302
00:14:33,720 --> 00:14:38,560
I think, they only use
7 out of the 8 SPEs.

303
00:14:38,560 --> 00:14:40,460
And that was a design
consideration from the

304
00:14:40,460 --> 00:14:43,870
hardware point of view because
as these chips get bigger and

305
00:14:43,870 --> 00:14:46,890
bigger, which is if you can't
ratchet up the gigahertz you

306
00:14:46,890 --> 00:14:48,980
have to spread out.

307
00:14:48,980 --> 00:14:51,830
And so as they get bigger, flaws
in the manufacturing

308
00:14:51,830 --> 00:14:54,630
process lead to faulty units.

309
00:14:54,630 --> 00:14:57,600
So instead of just throwing away
things, if one of these

310
00:14:57,600 --> 00:15:01,110
SPE is bad we don't use
it and we just do 7.

311
00:15:01,110 --> 00:15:06,410
As the design process gets
better by the end of this year

312
00:15:06,410 --> 00:15:09,020
they'll be using 8.

313
00:15:09,020 --> 00:15:14,160
The blades that IBM sells,
they're all set up for 8

314
00:15:14,160 --> 00:15:18,440
OK, so here's a schematic view
of what you just saw on the

315
00:15:18,440 --> 00:15:20,010
previous slide.

316
00:15:20,010 --> 00:15:22,110
You have these 8 SPEs.

317
00:15:22,110 --> 00:15:25,270
You have the PPE here with
this L1 and L2 cache.

318
00:15:25,270 --> 00:15:26,800
You have the element
interconnect bus connecting

319
00:15:26,800 --> 00:15:29,610
all of these pieces together
to a memory interface

320
00:15:29,610 --> 00:15:32,550
controller and a bus interface
controller.

321
00:15:32,550 --> 00:15:38,120
And so this MIC is what has the
25.6 gigabytes per second

322
00:15:38,120 --> 00:15:43,240
and this BIC has potentially
75 going out here.

323
00:15:43,240 --> 00:15:48,060
Each of these SPEs has
its own local store.

324
00:15:48,060 --> 00:15:49,940
Those are those little black
dots that you saw, those 8

325
00:15:49,940 --> 00:15:50,830
black dots.

326
00:15:50,830 --> 00:15:54,340
It's not very large, it's a
quarter of a megabyte, but

327
00:15:54,340 --> 00:15:57,750
it's very fast to this SXU,
this processing unit.

328
00:15:57,750 --> 00:16:02,520
It's only 6 cycles away
from that unit.

329
00:16:02,520 --> 00:16:06,520
And it's a fully pipelined 6
so that if you feed that

330
00:16:06,520 --> 00:16:09,810
pipeline you can get
data every cycle.

331
00:16:09,810 --> 00:16:11,480
And here, the thing that you
can't read because it's

332
00:16:11,480 --> 00:16:14,190
probably too dark is
the DMA engine.

333
00:16:14,190 --> 00:16:17,960
So one of the interesting things
about this is that each

334
00:16:17,960 --> 00:16:21,610
one of these is a full
fledged processor.

335
00:16:21,610 --> 00:16:25,930
It can access main memory
independent of this PPE.

336
00:16:25,930 --> 00:16:30,990
So you can have 9 processes or
10 if you're running 2 threads

337
00:16:30,990 --> 00:16:34,470
here, all going simultaneously,
all

338
00:16:34,470 --> 00:16:36,210
independent of one another.

339
00:16:36,210 --> 00:16:37,800
And that allows for a
tremendous amount of

340
00:16:37,800 --> 00:16:41,650
flexibility in the types of
algorithms you can implement.

341
00:16:41,650 --> 00:16:45,000
And because of this bus here you
can see it's 96 bytes per

342
00:16:45,000 --> 00:16:49,390
cycle and we're at
3.2 gigahertz.

343
00:16:49,390 --> 00:16:54,930
I think that's 288 gigabytes
per second.

344
00:16:54,930 --> 00:16:57,650
These guys can communicate to
one another across this bus

345
00:16:57,650 --> 00:17:00,210
without ever going out to main
memory and so they can get

346
00:17:00,210 --> 00:17:03,400
much faster access to their
local memories.

347
00:17:03,400 --> 00:17:06,630
So if you're doing lots of
computes internally here you

348
00:17:06,630 --> 00:17:11,880
can scream on this processing;
really, really go fast. And

349
00:17:11,880 --> 00:17:13,530
you can do the same if you're
going out to the memory

350
00:17:13,530 --> 00:17:16,410
interface controller here
to main memory, if you

351
00:17:16,410 --> 00:17:19,890
sufficiently hide that
memory access.

352
00:17:19,890 --> 00:17:21,140
So we'll talk about that.

353
00:17:24,030 --> 00:17:28,100
All right, this is the PPE
that I mentioned before.

354
00:17:28,100 --> 00:17:32,020
It's based on the IBM power
family of processors, it's a

355
00:17:32,020 --> 00:17:34,680
watered down version to reduce
the power consumption.

356
00:17:34,680 --> 00:17:39,190
So it doesn't have the horse
power that you see in say a

357
00:17:39,190 --> 00:17:42,470
Pentium 4 or even--

358
00:17:42,470 --> 00:17:44,730
actually, I don't have an exact
comparison point for

359
00:17:44,730 --> 00:17:47,930
this processor, but if you take
the code that runs today

360
00:17:47,930 --> 00:17:51,910
on your Intel or AMD, whatever
your power and you recompile

361
00:17:51,910 --> 00:17:54,960
it on cell it'll run today--

362
00:17:54,960 --> 00:17:57,810
maybe you have to change the
library or two, but it'll run

363
00:17:57,810 --> 00:17:59,470
today here, no problem.

364
00:17:59,470 --> 00:18:04,150
But it'll be about 60% slower,
50% slower and so people say,

365
00:18:04,150 --> 00:18:07,620
oh my god this cell processor's
terrible.

366
00:18:07,620 --> 00:18:10,980
But that's because you're only
using that one piece.

367
00:18:10,980 --> 00:18:12,490
So let's look at the other--

368
00:18:12,490 --> 00:18:14,070
OK, so now we go into
details of the PPE.

369
00:18:16,960 --> 00:18:20,490
Half a megabyte of L2 cache
here, coherent load stores.

370
00:18:20,490 --> 00:18:24,270
It does have a VMX unit,
so you can do some SIMD

371
00:18:24,270 --> 00:18:27,670
operations, single instruction
multiple data instructions.

372
00:18:27,670 --> 00:18:29,530
Two-way hardware multithreaded
here.

373
00:18:33,180 --> 00:18:36,960
Then there's an EIB that
goes around here.

374
00:18:36,960 --> 00:18:41,780
It's composed of four
16 byte data rings.

375
00:18:41,780 --> 00:18:44,510
And you can have multiple,
simultaneous transfers per

376
00:18:44,510 --> 00:18:48,030
ring for a total of over 100
outstanding requests

377
00:18:48,030 --> 00:18:49,280
simultaneously.

378
00:18:53,390 --> 00:18:54,830
But this slide doesn't--
this kind of hides

379
00:18:54,830 --> 00:18:55,620
it under the rug.

380
00:18:55,620 --> 00:18:57,720
There's a certain
topology here.

381
00:18:57,720 --> 00:18:59,910
And so these things
are going to be

382
00:18:59,910 --> 00:19:05,160
connected to those 8 SPEs.

383
00:19:05,160 --> 00:19:08,850
And depending on which way you
send things, you'll have

384
00:19:08,850 --> 00:19:10,340
better or worse performance.

385
00:19:10,340 --> 00:19:14,845
So some of these buses are going
around this way and some

386
00:19:14,845 --> 00:19:16,470
are going counterclockwise.

387
00:19:16,470 --> 00:19:18,960
And because of that you have
to know who you're

388
00:19:18,960 --> 00:19:22,230
communicating if you want have
real high efficiency.

389
00:19:22,230 --> 00:19:25,890
I haven't seen personally cases
where it made a really

390
00:19:25,890 --> 00:19:27,700
big difference, but I do know
that there's some people who

391
00:19:27,700 --> 00:19:34,720
found, if I'm going from here
to here I want to make sure

392
00:19:34,720 --> 00:19:38,220
I'm sending things the right
way because of that

393
00:19:38,220 --> 00:19:39,160
connectivity.

394
00:19:39,160 --> 00:19:40,880
Or else I could be sending
things all the

395
00:19:40,880 --> 00:19:42,540
way around and waiting.

396
00:19:42,540 --> 00:19:43,880
AUDIENCE: Just a
quick question.

397
00:19:43,880 --> 00:19:46,020
MICHAEL PERRONE: Yes.

398
00:19:46,020 --> 00:19:47,282
AUDIENCE: Just like you said you
could complie anything on

399
00:19:47,282 --> 00:19:48,740
the power processor would
be slower, but you can.

400
00:19:48,740 --> 00:19:51,380
Now you also said the cell
processor is in itself a

401
00:19:51,380 --> 00:19:53,440
[INAUDIBLE] processor.

402
00:19:53,440 --> 00:19:58,300
Can I compile it in a C code
just for that as well.

403
00:19:58,300 --> 00:19:59,580
MICHAEL PERRONE: C code
would compile.

404
00:19:59,580 --> 00:20:03,000
There's issues with libraries
because the libraries wouldn't

405
00:20:03,000 --> 00:20:05,230
be ported to the SPE
necessarily.

406
00:20:05,230 --> 00:20:08,620
If it had been then yes.

407
00:20:08,620 --> 00:20:10,440
This is actually a very
good question.

408
00:20:10,440 --> 00:20:11,700
It opens up lots of things.

409
00:20:11,700 --> 00:20:14,320
I don't know if I should
take that later.

410
00:20:14,320 --> 00:20:15,480
PROFESSOR: Take it later.

411
00:20:15,480 --> 00:20:18,990
MICHAEL PERRONE: Bottom line is
this chip has two different

412
00:20:18,990 --> 00:20:21,230
processors and therefore you
need two different compilers

413
00:20:21,230 --> 00:20:26,440
and it generates two different
source codes.

414
00:20:26,440 --> 00:20:30,220
In principle, SPEs can run a
full OS, but they're not

415
00:20:30,220 --> 00:20:32,980
designed to do that and no one's
ever actually tried.

416
00:20:32,980 --> 00:20:36,200
So you could imagine having 8
or 9 OSes running on this

417
00:20:36,200 --> 00:20:38,190
processor if you wanted.

418
00:20:38,190 --> 00:20:41,280
Terrible waste from my
perspective, but OK, so let's

419
00:20:41,280 --> 00:20:42,780
talk about these a little bit.

420
00:20:42,780 --> 00:20:47,190
Each of these SPEs has, like I
mentioned this memory flow

421
00:20:47,190 --> 00:20:52,110
controller here, an atomic
update unit, the local store,

422
00:20:52,110 --> 00:20:54,900
and the SPU, which is actually
the processing unit.

423
00:20:54,900 --> 00:21:01,370
Each SPU has a register file
with 128 registers.

424
00:21:01,370 --> 00:21:04,140
Each register is 128 bits.

425
00:21:04,140 --> 00:21:09,340
So they're native SIMD, there
are no scalar registers here

426
00:21:09,340 --> 00:21:12,060
for the user to play with.

427
00:21:12,060 --> 00:21:15,220
If you want to do scalar ops
they'll be running in those

428
00:21:15,220 --> 00:21:18,420
full vector registers, but
you'll just be wasting some

429
00:21:18,420 --> 00:21:19,670
portion of that register.

430
00:21:22,340 --> 00:21:25,760
It has IEEE double precision
floating point, but it doesn't

431
00:21:25,760 --> 00:21:29,100
have IEEE single precision
floating point.

432
00:21:29,100 --> 00:21:32,950
It's curiosity, but that was
again, came from the history.

433
00:21:32,950 --> 00:21:36,420
The processor was designed for
the gaming industry and the

434
00:21:36,420 --> 00:21:38,850
gamers, they didn't care
if it had IEEE.

435
00:21:38,850 --> 00:21:39,910
Who cares IEEE?

436
00:21:39,910 --> 00:21:42,020
What I want is to have good
monsters right on the screen.

437
00:21:45,590 --> 00:21:51,500
And so those SIMD registers can
operate bitwise on bytes,

438
00:21:51,500 --> 00:21:57,020
on shorts, on four words at a
time or two doubles at a time.

439
00:21:59,950 --> 00:22:06,210
The DMA engines here, each DMA
engine can have up to 16

440
00:22:06,210 --> 00:22:09,430
outstanding requests in its
queue before it stalls.

441
00:22:09,430 --> 00:22:12,680
So you can imagine you're
writing something, some code

442
00:22:12,680 --> 00:22:15,210
and you're sending things out
to the DMA and then all of a

443
00:22:15,210 --> 00:22:18,060
sudden you see really bad
performance, it could be that

444
00:22:18,060 --> 00:22:20,210
your DMA egine has stalled
the entire processor.

445
00:22:20,210 --> 00:22:23,300
If you try to write to that
thing and then that queue is

446
00:22:23,300 --> 00:22:27,230
full, it just waits until the
next open slot is available.

447
00:22:27,230 --> 00:22:31,040
So those are kind
considerations.

448
00:22:31,040 --> 00:22:34,460
AUDIENCE: You mean
[UNINTELLIGIBLE PHRASE]

449
00:22:34,460 --> 00:22:35,352
MICHAEL PERRONE: Yes.

450
00:22:35,352 --> 00:22:37,360
AUDIENCE: It's not
the global one?

451
00:22:37,360 --> 00:22:37,900
MICHAEL PERRONE: Right.

452
00:22:37,900 --> 00:22:39,590
That's correct.

453
00:22:39,590 --> 00:22:42,000
But there is a global
address space.

454
00:22:42,000 --> 00:22:45,070
AUDIENCE: 16 slots
each in each SPU.

455
00:22:45,070 --> 00:22:45,910
MICHAEL PERRONE: Right.

456
00:22:45,910 --> 00:22:46,450
Exactly.

457
00:22:46,450 --> 00:22:51,570
Each MFC has its own 16 slots.

458
00:22:51,570 --> 00:22:54,450
And they all address
the same memory.

459
00:22:54,450 --> 00:22:57,540
They can have a transparent
memory space or they can have

460
00:22:57,540 --> 00:22:59,280
a partitioned memory
space depending on

461
00:22:59,280 --> 00:22:59,920
how you set it up.

462
00:22:59,920 --> 00:23:03,809
AUDIENCE: Each SPU doesn't have
its own-- the DMA goes

463
00:23:03,809 --> 00:23:05,267
onto the bus, [UNINTELLIGIBLE]

464
00:23:07,985 --> 00:23:10,850
that goes to a connection
to the [UNINTELLIGIBLE].

465
00:23:14,235 --> 00:23:16,570
PROFESSOR: You can add this
data in the SPUs too.

466
00:23:16,570 --> 00:23:18,530
You don't have to always
go to outside memory.

467
00:23:18,530 --> 00:23:20,690
You can do SPU to SPU
communication basically.

468
00:23:20,690 --> 00:23:21,250
MICHAEL PERRONE: Right.

469
00:23:21,250 --> 00:23:23,700
So I can do a DMA that transfers
memory from this

470
00:23:23,700 --> 00:23:27,760
local store to this one if I
wanted to and vice versa.

471
00:23:27,760 --> 00:23:29,590
And I can pull stuff
in through the--

472
00:23:32,920 --> 00:23:34,350
yeah, I mentioned this stuff.

473
00:23:37,800 --> 00:23:43,710
Now this broadband interface
controller, the BIC, this is

474
00:23:43,710 --> 00:23:47,660
how you get off the blade
or off the board.

475
00:23:47,660 --> 00:23:51,570
It has 20 gigabytes per
second here on I/O IF.

476
00:23:54,790 --> 00:23:56,410
In 10 over here--I'm
sorry, 5 over here.

477
00:23:56,410 --> 00:24:00,700
I'm trying to remember
how we get up to 70.

478
00:24:00,700 --> 00:24:04,260
This is actually two-way
and one is 25 and

479
00:24:04,260 --> 00:24:04,990
the other one's 30.

480
00:24:04,990 --> 00:24:08,100
That gets you to 55.

481
00:24:08,100 --> 00:24:09,920
This should be 10 and now,
what's going on here?

482
00:24:14,310 --> 00:24:16,790
It adds up to 75, I'm sure.

483
00:24:16,790 --> 00:24:18,040
I'm sure about that.

484
00:24:20,790 --> 00:24:22,850
I don't know why
that says that.

485
00:24:22,850 --> 00:24:25,730
But the interesting thing about
this over here, this I/O

486
00:24:25,730 --> 00:24:30,670
IF zero is that you can
use it to connect two

487
00:24:30,670 --> 00:24:32,130
cell processors together.

488
00:24:32,130 --> 00:24:35,180
So this is why I know it's
really 25.6 because it's

489
00:24:35,180 --> 00:24:38,110
matched to this one.

490
00:24:38,110 --> 00:24:42,690
So you have 25.6 going out to
main memory, but this one can

491
00:24:42,690 --> 00:24:45,240
go to another processor, so
now you have these two

492
00:24:45,240 --> 00:24:49,140
processors side-by-side
connected at 25.6 gigabytes

493
00:24:49,140 --> 00:24:49,880
per second.

494
00:24:49,880 --> 00:24:52,360
And now I can do a memory access
through here to the

495
00:24:52,360 --> 00:24:56,270
memory that's on this processor
and vice versa.

496
00:24:56,270 --> 00:24:59,090
However, If I'm going straight
out to my memory it's going to

497
00:24:59,090 --> 00:25:01,300
be faster than if I go
out to this memory.

498
00:25:01,300 --> 00:25:04,220
So you have a slight NUMA
architecture and nonuniform

499
00:25:04,220 --> 00:25:05,320
memory access.

500
00:25:05,320 --> 00:25:09,220
And you can hide that with
sufficient multibuffering.

501
00:25:12,090 --> 00:25:14,910
So I know that this is 25 and
I know the other one's 30.

502
00:25:14,910 --> 00:25:17,070
I don't know why it's
written as 20 there.

503
00:25:17,070 --> 00:25:18,970
AUDIENCE: Can the SPUs
write to the

504
00:25:18,970 --> 00:25:21,600
[UNINTELLIGIBLE PHRASE]?

505
00:25:21,600 --> 00:25:24,370
MICHAEL PERRONE: Yes, they
can read from it.

506
00:25:24,370 --> 00:25:27,220
I don't know if they
can write to it.

507
00:25:27,220 --> 00:25:29,790
In fact, that leads to a
bottleneck occurring.

508
00:25:29,790 --> 00:25:34,850
So I happily start a process on
my PPE and then I tell all

509
00:25:34,850 --> 00:25:37,340
my SPEs, start doing some
number crunching.

510
00:25:37,340 --> 00:25:38,420
So they do that.

511
00:25:38,420 --> 00:25:41,690
They get access to memory, but
they find the memory is in L2.

512
00:25:41,690 --> 00:25:44,440
So they start pulling from L2,
but now all 8 are pulling from

513
00:25:44,440 --> 00:25:47,820
L2 and it's only 7 gigabytes per
second instead of 25 and

514
00:25:47,820 --> 00:25:49,180
so you get a bottleneck.

515
00:25:49,180 --> 00:25:51,660
And so what I tell everybody
is if you're going to

516
00:25:51,660 --> 00:25:54,520
initialize data with that PPE
make sure you flush your cache

517
00:25:54,520 --> 00:25:59,210
before you start the SPEs.

518
00:25:59,210 --> 00:26:02,010
And then you don't want to be
touching that memory because

519
00:26:02,010 --> 00:26:04,380
you really want to keep things--
stuff that the SPEs

520
00:26:04,380 --> 00:26:06,330
are dealing with-- you want to
keep it out of L2 cache.

521
00:26:12,380 --> 00:26:14,020
Here there's an interrupt
controller.

522
00:26:17,050 --> 00:26:19,540
An I/O bus master translation
unit.

523
00:26:19,540 --> 00:26:22,850
And you know, these allow for
messaging and message passing

524
00:26:22,850 --> 00:26:24,340
and interrupts and things
of that nature.

525
00:26:27,450 --> 00:26:29,130
So that's the hardware
overview.

526
00:26:29,130 --> 00:26:30,820
Any questions before
I move on?

527
00:26:37,950 --> 00:26:39,900
So why's the cell processor
so fast?

528
00:26:39,900 --> 00:26:43,250
Well, 3.2 gigahertz,
that's one.

529
00:26:43,250 --> 00:26:45,630
But there's also the fact
that we have 8 SPEs.

530
00:26:45,630 --> 00:26:51,140
Each 8 SPEs have SIMD units,
registers that are running so

531
00:26:51,140 --> 00:26:56,090
they can do this parallel
processing on a chip.

532
00:26:56,090 --> 00:27:01,440
We have 8 SPEs and each one
are doing up to 8 ops per

533
00:27:01,440 --> 00:27:03,760
cycle if you're doing
a mul-add.

534
00:27:03,760 --> 00:27:07,730
So you have four mul-adds
for single precision.

535
00:27:07,730 --> 00:27:15,340
So you've got 8, that's 64
ops per cycle times 3.2.

536
00:27:15,340 --> 00:27:20,040
You get up to 200 gigaflops
per cycle, 204.8.

537
00:27:20,040 --> 00:27:23,970
So that's really the
main reason.

538
00:27:23,970 --> 00:27:25,650
We've talked about
this stuff here.

539
00:27:25,650 --> 00:27:29,810
This is an image of
why it's faster.

540
00:27:29,810 --> 00:27:32,160
Instead of staging and bringing
the data through the

541
00:27:32,160 --> 00:27:34,740
L2, which is kind of what we
were just discussing and

542
00:27:34,740 --> 00:27:39,220
having this PU, this processing
unit, the PPE

543
00:27:39,220 --> 00:27:42,640
manage the data coming in, each
one can do it themselves

544
00:27:42,640 --> 00:27:45,030
and bypass this bottleneck.

545
00:27:45,030 --> 00:27:47,410
So that's something you have
to keep in the back of your

546
00:27:47,410 --> 00:27:48,380
mind when you're programming.

547
00:27:48,380 --> 00:27:52,140
You really want to make sure
that you get this processor

548
00:27:52,140 --> 00:27:52,720
out of there.

549
00:27:52,720 --> 00:27:54,230
You don't want it in your way.

550
00:27:54,230 --> 00:27:56,540
Let these guys do as much of
their own work as they can.

551
00:27:59,780 --> 00:28:03,030
Here's a comparison of
theorectical peak performance

552
00:28:03,030 --> 00:28:08,200
of cell versus freescale,
AMD, Intel over here.

553
00:28:08,200 --> 00:28:08,720
Very nice.

554
00:28:08,720 --> 00:28:11,170
That's the wow chart.

555
00:28:11,170 --> 00:28:15,860
The theoretical peak, this is in
practice, what did we see?

556
00:28:15,860 --> 00:28:18,410
I don't know if you can read
these numbers but what you

557
00:28:18,410 --> 00:28:20,750
really want to focus on is the
first and last columns.

558
00:28:20,750 --> 00:28:23,460
This is the type of calculation,
high performance

559
00:28:23,460 --> 00:28:26,470
computing like matrix
multiplication,

560
00:28:26,470 --> 00:28:28,910
bioinformatics, graphics,
security, it was really

561
00:28:28,910 --> 00:28:31,150
designed for graphics.

562
00:28:31,150 --> 00:28:33,850
Security, communication, video
processing and over here you

563
00:28:33,850 --> 00:28:40,470
see the advantage against an
IA 32, a G5 processor.

564
00:28:40,470 --> 00:28:46,510
And you see 8x, 12x,
15, 10, 18x.

565
00:28:46,510 --> 00:28:48,270
Very considerable improvement
in performance.

566
00:28:48,270 --> 00:28:49,557
In the back-- question?

567
00:28:49,557 --> 00:28:51,841
AUDIENCE: [UNINTELLIGIBLE]
previous slide, how did it

568
00:28:51,841 --> 00:28:55,140
compare to high
[UNINTELLIGIBLE PHRASE]?

569
00:28:55,140 --> 00:28:57,020
MICHAEL PERRONE: All right, so
you're thinking like a peak

570
00:28:57,020 --> 00:28:58,833
stream or something like that?

571
00:28:58,833 --> 00:29:01,400
AUDIENCE: Any particular
[UNINTELLIGIBLE PHRASE].

572
00:29:01,400 --> 00:29:05,506
The design of the SPUs is
very reminiscent of

573
00:29:05,506 --> 00:29:06,860
[UNINTELLIGIBLE PHRASE].

574
00:29:06,860 --> 00:29:11,480
MICHAEL PERRONE: So I believe,
and I'm not well versed in all

575
00:29:11,480 --> 00:29:12,560
of the processors that
are out there.

576
00:29:12,560 --> 00:29:14,090
I think that we still
have a performance

577
00:29:14,090 --> 00:29:17,850
advantage in that space.

578
00:29:17,850 --> 00:29:19,260
You know, I don't know
about Xilinx and

579
00:29:19,260 --> 00:29:20,490
those kind of things--

580
00:29:20,490 --> 00:29:25,850
FPGAs I don't know, but what
I tell people this

581
00:29:25,850 --> 00:29:26,890
is there's a spectrum.

582
00:29:26,890 --> 00:29:29,150
And at one end you have your
general purpose processors.

583
00:29:29,150 --> 00:29:32,390
You've got your Intel, you've
got your Opteron whatever,

584
00:29:32,390 --> 00:29:33,540
your power processor.

585
00:29:33,540 --> 00:29:37,410
And then at the other and you've
got your FPGAs and DSPs

586
00:29:37,410 --> 00:29:39,960
and then maybe over here,
somewhere in the middle you've

587
00:29:39,960 --> 00:29:42,230
got graphical processing
units.

588
00:29:42,230 --> 00:29:43,970
Like Nvidia kind of things.

589
00:29:43,970 --> 00:29:47,210
And then somewhere between
those graphics processing

590
00:29:47,210 --> 00:29:49,060
processors and the
general purpose

591
00:29:49,060 --> 00:29:52,360
processors you've got cell.

592
00:29:52,360 --> 00:29:57,040
You get a significant
improvement in performance,

593
00:29:57,040 --> 00:29:59,340
but you have to pay some
pain in programming.

594
00:29:59,340 --> 00:30:01,350
But not nearly as much as you
have to do with the graphics

595
00:30:01,350 --> 00:30:06,150
processors and no where near
the FPGAs, which are just

596
00:30:06,150 --> 00:30:08,220
every time you write something
you have to rewrite

597
00:30:08,220 --> 00:30:10,980
everything.

598
00:30:10,980 --> 00:30:11,520
Question?

599
00:30:11,520 --> 00:30:13,848
AUDIENCE: Somewhat related to
the previous question, but

600
00:30:13,848 --> 00:30:16,253
with a different angle.

601
00:30:16,253 --> 00:30:19,540
I always figured anyone could
do a [INAUDIBLE], so that's

602
00:30:19,540 --> 00:30:21,010
why I ask about FFTs.

603
00:30:21,010 --> 00:30:25,590
Are they captured on the front
or otherwise [UNINTELLIGIBLE]

604
00:30:25,590 --> 00:30:27,640
MICHAEL PERRONE: Yeah, so this
is actually one of the things

605
00:30:27,640 --> 00:30:29,660
I spent a lot of time
on for FFTs.

606
00:30:29,660 --> 00:30:32,750
I spent a lot of time with
the petroleum industry.

607
00:30:32,750 --> 00:30:36,590
They take these enormous boats,
they have these arrays

608
00:30:36,590 --> 00:30:39,460
that go 5 kilometers back and
1 kilometer wide, they drag

609
00:30:39,460 --> 00:30:41,800
them over the ocean, and they
make these noises and they

610
00:30:41,800 --> 00:30:43,240
record the echo.

611
00:30:43,240 --> 00:30:45,010
And they have to do this
enormous FFT and it

612
00:30:45,010 --> 00:30:47,580
takes them 6 months.

613
00:30:47,580 --> 00:30:49,690
Depending on the size of the FFT
it can be anywhere from a

614
00:30:49,690 --> 00:30:51,665
week to 6 months, literally.

615
00:30:51,665 --> 00:30:52,270
AUDIENCE: [UNINTELLIGIBLE].

616
00:30:52,270 --> 00:30:52,860
MICHAEL PERRONE: Sorry?

617
00:30:52,860 --> 00:30:55,740
AUDIENCE: Is this a PD FFT?

618
00:30:55,740 --> 00:31:00,690
MICHAEL PERRONE: Sometimes I
do too, but they do both.

619
00:31:00,690 --> 00:31:03,250
I've become somewhat of an
expert on these FFTs.

620
00:31:03,250 --> 00:31:06,610
For cell the best performance
number I know of is about 90

621
00:31:06,610 --> 00:31:08,390
gigaflops of FFT performance.

622
00:31:11,960 --> 00:31:14,630
You know, that's very good.

623
00:31:14,630 --> 00:31:17,590
Yeah, it's like 50% of
peak performance.

624
00:31:17,590 --> 00:31:21,320
You know, it's easy to get
98% with [? lynpacker ?]

625
00:31:21,320 --> 00:31:22,890
or [? djem ?]

626
00:31:22,890 --> 00:31:28,320
on a processor like this and
we have. We get 97% of peak

627
00:31:28,320 --> 00:31:31,845
performance, but it's a lot
harder to get FFTs up to that.

628
00:31:31,845 --> 00:31:34,005
AUDIENCE: Well, then I'll
[INAUDIBLE] the next questions

629
00:31:34,005 --> 00:31:36,529
then which is somehow or
another you get the FFT

630
00:31:36,529 --> 00:31:39,435
performance, you've got to
get the data at the right

631
00:31:39,435 --> 00:31:39,535
place at the time.

632
00:31:39,535 --> 00:31:39,560
[UNINTELLIGIBLE]

633
00:31:39,560 --> 00:31:42,940
So you've personally done that
or been involved with that?

634
00:31:42,940 --> 00:31:44,560
MICHAEL PERRONE: Right, so
we do a lot of tricks.

635
00:31:44,560 --> 00:31:47,080
I can show you another slide or
another presentation that

636
00:31:47,080 --> 00:31:51,880
we talk about this, but
typically the FFTs that we

637
00:31:51,880 --> 00:31:58,920
work with are somewhere from a
1024 to 2048, that's square.

638
00:31:58,920 --> 00:32:04,700
And so it's possible to take
say, the top 4 rows--

639
00:32:04,700 --> 00:32:08,540
in the case of 1024, four rows
complex, single precision I

640
00:32:08,540 --> 00:32:11,300
think is 16 kilobytes.

641
00:32:11,300 --> 00:32:13,340
That fits into the local
store very nicely.

642
00:32:13,340 --> 00:32:14,690
So you can stop multibuffering.

643
00:32:14,690 --> 00:32:16,620
You bring in one, you start
computing on it.

644
00:32:16,620 --> 00:32:19,530
While you're computing on those
4 in a SIMD fashion

645
00:32:19,530 --> 00:32:21,530
across the SIMD registers
you're

646
00:32:21,530 --> 00:32:22,900
bringing in the next one.

647
00:32:22,900 --> 00:32:24,670
And then when that one's
finished you're writing that

648
00:32:24,670 --> 00:32:26,840
one out while your computing
on the one that arrived and

649
00:32:26,840 --> 00:32:28,140
while you're getting
the next one.

650
00:32:28,140 --> 00:32:33,760
And since you can get the entire
1024 or 2000 into local

651
00:32:33,760 --> 00:32:38,600
store, you're only 6 cycles away
from any element in it.

652
00:32:38,600 --> 00:32:41,470
So it's much, much faster.

653
00:32:41,470 --> 00:32:45,610
We also did the 16 million
element FFT.

654
00:32:48,120 --> 00:32:52,550
1D, yeah and we did some
tricks there to make it

655
00:32:52,550 --> 00:32:53,980
efficient, but it was
a lot slower.

656
00:32:56,810 --> 00:32:59,180
AUDIENCE:
[UNINTELLIGIBLE PHRASE]

657
00:32:59,180 --> 00:33:01,156
would have to be a lot slower
by the need for the problem.

658
00:33:01,156 --> 00:33:03,970
[UNINTELLIGIBLE PHRASE]

659
00:33:03,970 --> 00:33:05,870
MICHAEL PERRONE: What I remember
it was fifteen times

660
00:33:05,870 --> 00:33:08,660
faster than a power 5.

661
00:33:12,970 --> 00:33:16,160
It might have been a power 4,
I don't remember, sorry.

662
00:33:22,010 --> 00:33:22,716
I might want to skip this one.

663
00:33:22,716 --> 00:33:25,436
I think I'm going to
skip this one.

664
00:33:25,436 --> 00:33:27,340
AUDIENCE:
[UNINTELLIGIBLE PHRASE]

665
00:33:27,340 --> 00:33:28,590
MICHAEL PERRONE: Right.

666
00:33:32,330 --> 00:33:34,360
Let's talk about what is
the cell good for.

667
00:33:34,360 --> 00:33:36,935
You kind of have a sense of the
architecture and how it

668
00:33:36,935 --> 00:33:38,510
all fits together.

669
00:33:38,510 --> 00:33:41,690
You may have some sense of the
gotchas and the problems that

670
00:33:41,690 --> 00:33:44,300
might be there, but what did
we actually applied to 2?

671
00:33:44,300 --> 00:33:48,120
I mean you saw some
of that here.

672
00:33:48,120 --> 00:33:52,405
Here's a list of things that
either we've already proven to

673
00:33:52,405 --> 00:33:56,460
ourself that it works well or
we're very confident that it

674
00:33:56,460 --> 00:33:58,320
works well or we're working
to demonstrate

675
00:33:58,320 --> 00:33:59,570
that it works well.

676
00:34:01,700 --> 00:34:04,280
Signal processing, image
processing, audio resampling,

677
00:34:04,280 --> 00:34:04,990
noise generation.

678
00:34:04,990 --> 00:34:06,920
I mean, you can read through
this list, there's a long

679
00:34:06,920 --> 00:34:11,010
list. And I guess there are
a few characteristics that

680
00:34:11,010 --> 00:34:14,030
really make it suitable
for cell.

681
00:34:14,030 --> 00:34:16,460
Things that are in single
precision because you've got

682
00:34:16,460 --> 00:34:20,210
200 gigaflops single and only
20 of double, but that will

683
00:34:20,210 --> 00:34:23,360
change as I mentioned.

684
00:34:23,360 --> 00:34:26,580
Things that are streaming,
streaming through and so

685
00:34:26,580 --> 00:34:29,770
single processing is ideal where
the data comes through

686
00:34:29,770 --> 00:34:32,350
and you do your compute and then
you throw it away or you

687
00:34:32,350 --> 00:34:33,830
give out your results and
you throw it away.

688
00:34:33,830 --> 00:34:35,080
Those are good.

689
00:34:39,770 --> 00:34:42,410
And things that are compute
intensive, where you bring the

690
00:34:42,410 --> 00:34:44,590
data in and you're going to
crunch on it for a long time,

691
00:34:44,590 --> 00:34:48,170
so things like cryptography
where you're either generating

692
00:34:48,170 --> 00:34:53,320
something from a key and there's
virtually no input.

693
00:34:53,320 --> 00:34:57,120
You're just generating streams
of random numbers that's very

694
00:34:57,120 --> 00:34:58,160
well suited for this thing.

695
00:34:58,160 --> 00:34:59,770
You see FFTs listed here.

696
00:35:02,630 --> 00:35:04,210
TCPIP off load.

697
00:35:04,210 --> 00:35:06,790
I didn't put that there.

698
00:35:06,790 --> 00:35:11,590
There's actually a problem with
cell today that we're

699
00:35:11,590 --> 00:35:15,330
working to fix that the TCPIP
performance is not very good.

700
00:35:15,330 --> 00:35:19,970
And so what I tell people
to use is open NPI.

701
00:35:19,970 --> 00:35:23,450
You know, so that
over InfiniBand.

702
00:35:23,450 --> 00:35:26,930
The PPE processor really doesn't
have the horse power

703
00:35:26,930 --> 00:35:29,750
to drive a full TCPIP sack.

704
00:35:29,750 --> 00:35:33,300
I'm not sure it has the horse
power to do a full NPI stack

705
00:35:33,300 --> 00:35:36,680
either, but at least you have
more control in that case.

706
00:35:42,130 --> 00:35:45,170
The game physics, physical
simulations--

707
00:35:45,170 --> 00:35:47,130
I can show you a demo, but I
don't know that we'll have

708
00:35:47,130 --> 00:35:50,500
time where a company called
Rapid Mind, which is

709
00:35:50,500 --> 00:35:55,380
developing software to ease
programmability for cell.

710
00:35:55,380 --> 00:35:57,980
Basically you take your existing
scalar code and you

711
00:35:57,980 --> 00:36:03,010
instrument it with C++ classes
that are kind of SPE aware.

712
00:36:03,010 --> 00:36:07,650
And by doing that, just write
your scalar code and you get

713
00:36:07,650 --> 00:36:11,010
the SPE performance advantage.

714
00:36:11,010 --> 00:36:12,470
They have this wonderful
demo of these chickens.

715
00:36:12,470 --> 00:36:15,770
They've got 16,000 chickens
in a chicken yard.

716
00:36:15,770 --> 00:36:18,870
You know, the chicken yard has
varying topologies and the

717
00:36:18,870 --> 00:36:22,310
chickens move around and all
16,000 are being processed in

718
00:36:22,310 --> 00:36:24,470
real time with a single
cell processor.

719
00:36:24,470 --> 00:36:30,080
In fact, the Nvidia card that
was used to render that

720
00:36:30,080 --> 00:36:33,480
couldn't keep up with what was
coming out of the SPEs.

721
00:36:33,480 --> 00:36:34,470
We we're impressed with that.

722
00:36:34,470 --> 00:36:35,000
We're happy with that.

723
00:36:35,000 --> 00:36:37,710
We showed it around at the
game conferences and the

724
00:36:37,710 --> 00:36:40,300
gamers saw all these chickens
and were like,

725
00:36:40,300 --> 00:36:40,630
this is really cool.

726
00:36:40,630 --> 00:36:41,880
How do I shoot them?

727
00:36:44,260 --> 00:36:45,740
So we said, you can't.

728
00:36:45,740 --> 00:36:48,250
But maybe in the next version.

729
00:36:48,250 --> 00:36:51,780
But the idea that we've designed
this so that it can

730
00:36:51,780 --> 00:36:55,050
do physical simulations, and
this is maybe an entree for

731
00:36:55,050 --> 00:36:56,740
some of you people when you're
doing your stuff.

732
00:36:56,740 --> 00:36:58,680
I don't know what kinds of
things you want to try to do

733
00:36:58,680 --> 00:37:02,630
on cell, but I've seen people do
lots of things that really

734
00:37:02,630 --> 00:37:04,430
have no business doing
well on cell and they

735
00:37:04,430 --> 00:37:05,240
did very, very well.

736
00:37:05,240 --> 00:37:08,010
Like pointer chasing.

737
00:37:13,260 --> 00:37:14,100
I'm trying to remember.

738
00:37:14,100 --> 00:37:15,230
There are two pieces of work.

739
00:37:15,230 --> 00:37:22,860
One done by PNNL Fabritzio
Petrini and he did a graph

740
00:37:22,860 --> 00:37:24,690
traversal algorithm.

741
00:37:24,690 --> 00:37:29,340
It was very much random access
and he was able to parallelize

742
00:37:29,340 --> 00:37:31,120
that very nicely on Cell.

743
00:37:31,120 --> 00:37:34,900
And then there was another guy
at Georgia Tech who did

744
00:37:34,900 --> 00:37:37,010
something similar for
linked lists.

745
00:37:37,010 --> 00:37:41,170
And you know, I expect things
to work well on cell if

746
00:37:41,170 --> 00:37:44,310
they're streaming and they have
very compute intensive

747
00:37:44,310 --> 00:37:46,870
kernels that are working on
things, but those are two

748
00:37:46,870 --> 00:37:50,600
examples where they're very not
very compute intensive and

749
00:37:50,600 --> 00:37:51,350
not very streaming.

750
00:37:51,350 --> 00:37:54,710
They're kind of random access
and they work very well.

751
00:37:54,710 --> 00:37:56,410
Over here, target
applications.

752
00:37:56,410 --> 00:37:58,980
There are lots of areas
where we're trying

753
00:37:58,980 --> 00:38:02,260
to push cell forward.

754
00:38:02,260 --> 00:38:04,110
Clearly it works in the
gaming industry, but

755
00:38:04,110 --> 00:38:04,960
where else can it work?

756
00:38:04,960 --> 00:38:08,360
So medical imaging, there's
a lot of success there.

757
00:38:08,360 --> 00:38:11,580
The sysmic imaging for
petroleum, aerospace and

758
00:38:11,580 --> 00:38:13,080
defense for radar and sonar--

759
00:38:13,080 --> 00:38:16,190
these are all signal
processing apps.

760
00:38:16,190 --> 00:38:18,510
We're also looking at digital
content creation

761
00:38:18,510 --> 00:38:20,220
for computer animation.

762
00:38:20,220 --> 00:38:21,470
Very well suited for cell.

763
00:38:24,470 --> 00:38:28,040
This is kind just what
I just said.

764
00:38:28,040 --> 00:38:29,380
Did I leave out anything?

765
00:38:29,380 --> 00:38:33,870
Finance-- once we have double
precision we'll be doing

766
00:38:33,870 --> 00:38:35,700
things with finance.

767
00:38:35,700 --> 00:38:37,940
We actually demonstrated that
things work very well.

768
00:38:37,940 --> 00:38:41,690
You know, metropolis algorithms,
Monte Carlo, Black

769
00:38:41,690 --> 00:38:43,620
shoals algorithms if you're
familiar with these kind of

770
00:38:43,620 --> 00:38:47,140
things from finance.

771
00:38:47,140 --> 00:38:48,960
They tell us they need double
precision and we're like, you

772
00:38:48,960 --> 00:38:51,240
don't really need double
precision, come on.

773
00:38:51,240 --> 00:38:56,460
I mean, what you have is some
mathematical calculation that

774
00:38:56,460 --> 00:38:57,900
you're doing and you're doing
it over and over and over.

775
00:38:57,900 --> 00:39:00,190
And Monte Carlo there's so much
noise, we say to these

776
00:39:00,190 --> 00:39:01,150
people, why do you need
double precision?

777
00:39:01,150 --> 00:39:06,040
It turns out with decimal
notation you can only go up to

778
00:39:06,040 --> 00:39:08,730
like a billion or something
in single precision.

779
00:39:08,730 --> 00:39:11,180
So they have more dollars than
that, so they need double, for

780
00:39:11,180 --> 00:39:13,060
that reason alone.

781
00:39:13,060 --> 00:39:15,620
But this gets back to the
sloppiness of programmers.

782
00:39:15,620 --> 00:39:18,030
And I'm guilty of this myself.

783
00:39:18,030 --> 00:39:18,910
They said, oh we have double.

784
00:39:18,910 --> 00:39:19,930
Let's use double.

785
00:39:19,930 --> 00:39:21,720
They didn't need to, but
they did it anyway.

786
00:39:21,720 --> 00:39:24,990
And now their legacy code
is stuck with double.

787
00:39:24,990 --> 00:39:28,840
They could convert it all to
single, but it's too painful.

788
00:39:28,840 --> 00:39:32,410
Down on Wall Street to build a
new data center is like $100

789
00:39:32,410 --> 00:39:34,090
million proposition.

790
00:39:34,090 --> 00:39:37,330
And they do it regularly,
all of the banks.

791
00:39:37,330 --> 00:39:40,050
They'll be generating a new
data center every year,

792
00:39:40,050 --> 00:39:43,700
sometimes multiple times a year
and they just don't have

793
00:39:43,700 --> 00:39:47,010
time or the resources to go
through and redo all their

794
00:39:47,010 --> 00:39:49,270
code to make it run or
something like cell.

795
00:39:49,270 --> 00:39:54,690
So we're making double
precision cell.

796
00:39:54,690 --> 00:39:56,170
That's the short of it.

797
00:39:56,170 --> 00:40:00,210
All right, now software
environment.

798
00:40:00,210 --> 00:40:03,640
This is stuff that you can find
on the web and actually,

799
00:40:03,640 --> 00:40:06,260
it's changing a lot lately
because we just

800
00:40:06,260 --> 00:40:09,430
released the 2.0 SDK.

801
00:40:09,430 --> 00:40:12,950
And so the stuff that's in the
slide might not actually be

802
00:40:12,950 --> 00:40:16,480
the latest and greatest, but
it's going to be epsilon away,

803
00:40:16,480 --> 00:40:17,970
so don't worry about
it too much.

804
00:40:17,970 --> 00:40:20,020
But you really shouldn't trust
these slides, you should go to

805
00:40:20,020 --> 00:40:23,300
the website and the website
you want to go to is

806
00:40:23,300 --> 00:40:26,981
www.ibm.com/alphaworks.

807
00:40:26,981 --> 00:40:30,100
PROFESSOR: Tomorrow we are going
to have a recitation

808
00:40:30,100 --> 00:40:32,310
session talking about
the environment

809
00:40:32,310 --> 00:40:33,960
that we have created.

810
00:40:33,960 --> 00:40:36,500
I think we just got, probably
just set up the latest

811
00:40:36,500 --> 00:40:39,815
environment and then we increase
it through the three

812
00:40:39,815 --> 00:40:41,000
weeks we've got.

813
00:40:41,000 --> 00:40:44,180
This is changing faster than
a three week cycle.

814
00:40:44,180 --> 00:40:45,430
So [UNINTELLIGIBLE PHRASE]

815
00:40:47,590 --> 00:40:51,620
So this will give you a preview
of what's going to be.

816
00:40:51,620 --> 00:40:52,460
MICHAEL PERRONE: Then you go
to alphaworks, you go to

817
00:40:52,460 --> 00:40:55,510
search on alphaworks for cell
and you get more information

818
00:40:55,510 --> 00:40:57,810
then you could ever
possibly read.

819
00:40:57,810 --> 00:41:01,370
We have a programmer's manual
that's 900 pages long, it's

820
00:41:01,370 --> 00:41:04,260
really good reading.

821
00:41:04,260 --> 00:41:07,730
Actually there's one thing in
that 800, 900 hundred pages

822
00:41:07,730 --> 00:41:08,460
that you really should read.

823
00:41:08,460 --> 00:41:10,600
It's called the cell programming
tips chapter.

824
00:41:10,600 --> 00:41:14,450
It's a really nice chapter.

825
00:41:14,450 --> 00:41:17,140
But there are many, many
publications and things like

826
00:41:17,140 --> 00:41:23,110
that, more than just the SDK
in the OS and whatnot, so I

827
00:41:23,110 --> 00:41:25,410
encourage you to look at that.

828
00:41:25,410 --> 00:41:28,430
All right, so this is kind
of the pyramid, the

829
00:41:28,430 --> 00:41:29,520
cell software pyramid.

830
00:41:29,520 --> 00:41:32,990
We've got the standards under
here, the application binary

831
00:41:32,990 --> 00:41:36,710
interface, language
extensions.

832
00:41:36,710 --> 00:41:39,380
And over here we have
development tools and we'll

833
00:41:39,380 --> 00:41:42,130
talk about each of these
pieces briefly.

834
00:41:45,080 --> 00:41:49,350
These specifications define
what's actually the reference

835
00:41:49,350 --> 00:41:52,030
implementation for the cell.

836
00:41:52,030 --> 00:41:56,480
C++ and C, they have language
extensions in the similar way

837
00:41:56,480 --> 00:42:01,090
to the extensions for VMX
for SSE on Intel.

838
00:42:01,090 --> 00:42:05,000
You have C extensions for cell
that allow you to use

839
00:42:05,000 --> 00:42:12,200
intrinsics that actually run as
SIMD instructions on cell.

840
00:42:12,200 --> 00:42:15,540
For example, you can say SPU
underscore mul-add, and it's

841
00:42:15,540 --> 00:42:17,670
going to do a vector mul-add.

842
00:42:17,670 --> 00:42:24,060
So you can get assembly language
level control over

843
00:42:24,060 --> 00:42:28,390
your code without having to
use any assembly language.

844
00:42:28,390 --> 00:42:30,890
And then there's that.

845
00:42:30,890 --> 00:42:34,180
There is a full system
simulator.

846
00:42:34,180 --> 00:42:40,050
The simulator is very, very
accurate for things that do

847
00:42:40,050 --> 00:42:43,040
not run out to main memory.

848
00:42:43,040 --> 00:42:44,910
They've been working to improve
this so I don't know

849
00:42:44,910 --> 00:42:47,810
if recently they have made it
more accurate, but if you're

850
00:42:47,810 --> 00:42:52,090
doing compute intensive stuff,
if you're compute bound the

851
00:42:52,090 --> 00:42:55,000
simulator can give you
accuracies within 99%.

852
00:42:55,000 --> 00:42:58,120
You know, within 1%
of the real value.

853
00:42:58,120 --> 00:43:02,050
I've only seen one thing on the
simulator more than 1% off

854
00:43:02,050 --> 00:43:04,930
and that was 4%, so the
simulator is very-- excuse

855
00:43:04,930 --> 00:43:06,220
me-- very reliable.

856
00:43:06,220 --> 00:43:08,260
And I encourage you to
use it if you can't

857
00:43:08,260 --> 00:43:09,510
get access to hardware.

858
00:43:12,600 --> 00:43:14,240
What else?

859
00:43:14,240 --> 00:43:16,710
The simulator has all kinds
of tools in there.

860
00:43:16,710 --> 00:43:21,820
And I'm not going to go through
the software stack in

861
00:43:21,820 --> 00:43:23,070
simulation.

862
00:43:31,280 --> 00:43:33,090
This gives you a sense for--

863
00:43:33,090 --> 00:43:35,330
you've got your hardware
running here.

864
00:43:35,330 --> 00:43:38,280
You can run this on any one of
these platforms. Power PC,

865
00:43:38,280 --> 00:43:42,910
Intel with these OS's.

866
00:43:42,910 --> 00:43:46,560
The whole thing is written
in TCL, the simulator.

867
00:43:46,560 --> 00:43:48,930
And it has all these
kind of simulators.

868
00:43:48,930 --> 00:43:54,300
It's simulating the DMAs, it's
simulating the caches and then

869
00:43:54,300 --> 00:43:56,300
you get a graphical user
interface and a command line

870
00:43:56,300 --> 00:43:58,590
interface to that simulator.

871
00:43:58,590 --> 00:44:01,940
THe graphical user interface is
convenient, but the command

872
00:44:01,940 --> 00:44:03,160
line gives you much
more control.

873
00:44:03,160 --> 00:44:04,860
You can treat parameters.

874
00:44:09,790 --> 00:44:14,850
This gives you a view of
what the graphical

875
00:44:14,850 --> 00:44:17,600
userface looks like.

876
00:44:17,600 --> 00:44:19,660
It says mambo zebra because that
was a different project,

877
00:44:19,660 --> 00:44:21,360
but now it'd probably
say system sim or

878
00:44:21,360 --> 00:44:23,780
something like that.

879
00:44:23,780 --> 00:44:26,040
And you'll see the PPC--

880
00:44:26,040 --> 00:44:28,190
this is the PPE I don't know
why they changed it.

881
00:44:28,190 --> 00:44:32,090
And then you have SP of zero,
SP of 1 going down and it

882
00:44:32,090 --> 00:44:35,240
gives you some access
to these parameters.

883
00:44:35,240 --> 00:44:41,310
The model here, it says pipeline
and then there's I

884
00:44:41,310 --> 00:44:43,090
think, functional mode
or pipeline mode.

885
00:44:43,090 --> 00:44:45,570
Pipeline mode is where it's
really simulating everything

886
00:44:45,570 --> 00:44:47,280
and it's much slower.

887
00:44:47,280 --> 00:44:48,760
But it's accurate.

888
00:44:48,760 --> 00:44:50,590
And then the other is functional
mode just to test

889
00:44:50,590 --> 00:44:51,960
the code actually works
as it's supposed to.

890
00:44:51,960 --> 00:44:55,136
PROFESSOR: I guess one point
in the class what we'll try

891
00:44:55,136 --> 00:44:58,340
and do is since each group has
access to the the hardware,

892
00:44:58,340 --> 00:45:01,930
you can do most of the things
in the real hardware and use

893
00:45:01,930 --> 00:45:03,430
the debugger in the
hardware that's

894
00:45:03,430 --> 00:45:04,300
probably been talked about.

895
00:45:04,300 --> 00:45:07,950
But if things gets really bad
and you can't understand use

896
00:45:07,950 --> 00:45:11,030
simulator as a very accurate
debugger only when it's needs

897
00:45:11,030 --> 00:45:13,250
needed because there you
can look at every

898
00:45:13,250 --> 00:45:14,870
little detail inside.

899
00:45:14,870 --> 00:45:17,980
This is kind of a thing, a
last resort type thing.

900
00:45:17,980 --> 00:45:19,930
MICHAEL PERRONE:
Yeah, I agree.

901
00:45:19,930 --> 00:45:21,390
Like I said, I've been doing
this for three years.

902
00:45:21,390 --> 00:45:23,590
Three years ago we didn't
even have hardware.

903
00:45:23,590 --> 00:45:27,120
So the simulator was all we had,
so we relied on it a lot.

904
00:45:27,120 --> 00:45:29,880
But I think that usage of
it makes a lot of sense.

905
00:45:33,550 --> 00:45:34,900
This is the graphical
interface.

906
00:45:34,900 --> 00:45:36,720
You know, it's just a
Tickle interface.

907
00:45:41,240 --> 00:45:42,440
I'm going to skip through
these things.

908
00:45:42,440 --> 00:45:47,350
It just shows you how you can
look at memory with this more

909
00:45:47,350 --> 00:45:48,970
memory access.

910
00:45:48,970 --> 00:45:49,830
You get some graphical

911
00:45:49,830 --> 00:45:51,630
representation of various pieces.

912
00:45:51,630 --> 00:45:52,660
You know, how many stalls?

913
00:45:52,660 --> 00:45:53,740
How many loads?

914
00:45:53,740 --> 00:45:55,590
How many DMA transactions?

915
00:45:55,590 --> 00:45:57,320
So you can see what's going
on at that level.

916
00:46:00,270 --> 00:46:02,090
And all of this can be
pulled together into

917
00:46:02,090 --> 00:46:05,240
this UART window here.

918
00:46:05,240 --> 00:46:09,680
OK, so the Linux, it's pretty
standard Linux, but it has

919
00:46:09,680 --> 00:46:12,410
some extensions.

920
00:46:12,410 --> 00:46:14,820
Let's see.

921
00:46:14,820 --> 00:46:16,930
Provided as a patch, yeah.

922
00:46:16,930 --> 00:46:17,730
That might be wrong.

923
00:46:17,730 --> 00:46:21,490
I don't know where
we are currently.

924
00:46:21,490 --> 00:46:24,980
You have this SPE thread
API for creating

925
00:46:24,980 --> 00:46:28,020
threads from the PPEs.

926
00:46:28,020 --> 00:46:30,850
Let's see.

927
00:46:30,850 --> 00:46:32,330
What do I want to
tell you here?

928
00:46:32,330 --> 00:46:35,680
There's a better slide for
this kind of information.

929
00:46:35,680 --> 00:46:39,220
They share the memory space,
we talked about that.

930
00:46:39,220 --> 00:46:41,830
There's error event and
signal handling.

931
00:46:41,830 --> 00:46:45,630
So there are multiple ways
you communicate.

932
00:46:45,630 --> 00:46:50,030
You can communicate with the
interrupts and the event and

933
00:46:50,030 --> 00:46:53,770
signaling that way or you
can use these mailboxes.

934
00:46:53,770 --> 00:46:56,640
So each SPE has its own mailbox
and inbox and an

935
00:46:56,640 --> 00:46:59,750
outbox so you can post something
to your outbox and

936
00:46:59,750 --> 00:47:01,770
then the PPE will read
it when it's ready.

937
00:47:01,770 --> 00:47:05,030
Or you can read from your inbox
waiting on the PPE to

938
00:47:05,030 --> 00:47:05,790
write something.

939
00:47:05,790 --> 00:47:07,960
You have to be careful because
you can stall there.

940
00:47:07,960 --> 00:47:11,970
If the PPE hasn't written you
will stall waiting for

941
00:47:11,970 --> 00:47:12,770
something to fill up.

942
00:47:12,770 --> 00:47:14,460
So you can do a check.

943
00:47:14,460 --> 00:47:16,150
There are ways to get around
that, but these are kind of

944
00:47:16,150 --> 00:47:18,040
common gotchas that you
have to watch out for.

945
00:47:22,410 --> 00:47:25,360
Then you have the mailboxes, you
have the interrupts, you

946
00:47:25,360 --> 00:47:26,100
also have DMAs.

947
00:47:26,100 --> 00:47:28,300
You can do communication with
DMAs so you have at least

948
00:47:28,300 --> 00:47:29,900
three different ways that
you communicate

949
00:47:29,900 --> 00:47:33,580
between the SPEs on cell.

950
00:47:33,580 --> 00:47:37,250
And which one is going to be
best really depends on the

951
00:47:37,250 --> 00:47:40,050
algorithm you're running.

952
00:47:40,050 --> 00:47:42,330
So these are the extensions
to Linux.

953
00:47:42,330 --> 00:47:43,800
This is going to show you a
bunch of things that you

954
00:47:43,800 --> 00:47:46,800
probably won't be able to read,
but there's something

955
00:47:46,800 --> 00:47:51,580
called SPUFS, the file system
that has a bunch of open,

956
00:47:51,580 --> 00:47:53,900
read, write, and close
functionality.

957
00:47:57,450 --> 00:48:01,630
And then we also have this
signaling and the mailboxes

958
00:48:01,630 --> 00:48:03,650
that I mentioned to
you previously.

959
00:48:03,650 --> 00:48:04,870
And this you can't even read.

960
00:48:04,870 --> 00:48:05,850
I can't even read this one.

961
00:48:05,850 --> 00:48:08,300
What is it?

962
00:48:08,300 --> 00:48:10,060
Ah, this is perhaps the
most important one.

963
00:48:10,060 --> 00:48:13,790
It says SPU create thread.

964
00:48:13,790 --> 00:48:19,370
So the SPEs from the Linux point
of view are just threads

965
00:48:19,370 --> 00:48:20,440
that are running.

966
00:48:20,440 --> 00:48:23,290
The Linux doesn't really know
that they're special purpose

967
00:48:23,290 --> 00:48:25,890
hardware, it just knows it's a
thread and you can do things

968
00:48:25,890 --> 00:48:29,775
like spawn a thread, kill a
thread, wait on a thread-- all

969
00:48:29,775 --> 00:48:33,490
the usual things that you
can do with threads.

970
00:48:33,490 --> 00:48:34,970
So it's a lot like P
threads, but it's

971
00:48:34,970 --> 00:48:36,980
not actually P threads.

972
00:48:36,980 --> 00:48:40,590
So here you could see these
things are more useful.

973
00:48:40,590 --> 00:48:42,710
This is SPE create groups.

974
00:48:42,710 --> 00:48:46,370
So you can create a thread and
thread group so that threads

975
00:48:46,370 --> 00:48:49,200
that are part of the same group
know about one another.

976
00:48:49,200 --> 00:48:51,620
So you can partition your system
and have three SPEs

977
00:48:51,620 --> 00:48:53,740
doing one thing and five
doing another.

978
00:48:53,740 --> 00:48:56,060
So that you can split it
up however you like.

979
00:48:56,060 --> 00:48:58,940
You have get and set affinity
so that you can choose which

980
00:48:58,940 --> 00:49:01,750
SPEs are running which tasks,
so that you can get more

981
00:49:01,750 --> 00:49:05,800
efficient use of that element
interconnect bus.

982
00:49:05,800 --> 00:49:10,260
Kill and waits, open, close,
writing signals, the usual.

983
00:49:15,110 --> 00:49:17,490
Let me check my time here.

984
00:49:17,490 --> 00:49:22,410
I really don't have a lot more
time, so I'm going to say that

985
00:49:22,410 --> 00:49:24,030
we have this thread management
library.

986
00:49:24,030 --> 00:49:26,660
It has the functionality
that I just mentioned.

987
00:49:26,660 --> 00:49:28,470
In the next month or so you're
going to go through that in a

988
00:49:28,470 --> 00:49:29,990
lot more detail.

989
00:49:35,860 --> 00:49:38,340
The SPE comes with a lot
of sample libraries.

990
00:49:38,340 --> 00:49:41,410
These are not necessarily the
very best implementation of

991
00:49:41,410 --> 00:49:43,440
these libraries and they're
not even fully functional

992
00:49:43,440 --> 00:49:46,500
libraries, but they're
suggestive of first of all,

993
00:49:46,500 --> 00:49:50,900
how things can be written to
cell, how to use cell, and in

994
00:49:50,900 --> 00:49:53,000
some cases how to
optimize cell.

995
00:49:53,000 --> 00:49:55,790
Like the basic matrix
operations, there's some

996
00:49:55,790 --> 00:49:56,670
optimization.

997
00:49:56,670 --> 00:49:58,970
The FFTs are very tightly
optimized, so if you want to

998
00:49:58,970 --> 00:50:01,470
take a look at that and
understand how to do that type

999
00:50:01,470 --> 00:50:04,010
of memory manipulation.

1000
00:50:04,010 --> 00:50:08,940
So there are samples codes out
there that can be very useful.

1001
00:50:08,940 --> 00:50:10,240
We'll skip that.

1002
00:50:10,240 --> 00:50:12,400
Oh, this is that
FFT 16 million.

1003
00:50:12,400 --> 00:50:15,940
There's an example,
it's on the SDK.

1004
00:50:15,940 --> 00:50:18,340
Actually, I don't know if you've
got PS3's if all these

1005
00:50:18,340 --> 00:50:20,070
things can run.

1006
00:50:20,070 --> 00:50:20,900
They should run.

1007
00:50:20,900 --> 00:50:23,820
Yeah, they should run.

1008
00:50:23,820 --> 00:50:25,850
There may be some memory issues
out to main memory that

1009
00:50:25,850 --> 00:50:29,090
I'm not aware of.

1010
00:50:29,090 --> 00:50:32,040
There are all kinds of demos
there that you can play with,

1011
00:50:32,040 --> 00:50:35,620
which are good for learning
how to spawn threads and

1012
00:50:35,620 --> 00:50:38,030
things like that.

1013
00:50:38,030 --> 00:50:41,360
You have your basic GNU
binutils tools.

1014
00:50:41,360 --> 00:50:43,670
There's GCC out there.

1015
00:50:43,670 --> 00:50:45,150
There's also XLC.

1016
00:50:45,150 --> 00:50:48,530
You can download XLC.

1017
00:50:48,530 --> 00:50:51,420
In some cases, one will be
better than the other, but I

1018
00:50:51,420 --> 00:50:53,780
think in most cases XLC's
a little better.

1019
00:50:53,780 --> 00:50:57,210
Or in some cases, actually
a lot better.

1020
00:50:57,210 --> 00:50:59,240
So you can get that.

1021
00:50:59,240 --> 00:51:00,820
I'd recommend that.

1022
00:51:00,820 --> 00:51:04,110
There's a debugger which
provides application source

1023
00:51:04,110 --> 00:51:06,160
level debugging.

1024
00:51:06,160 --> 00:51:08,790
PPE multithreading, SPE
multithreading, the

1025
00:51:08,790 --> 00:51:11,310
interaction between
these guys.

1026
00:51:11,310 --> 00:51:15,430
There are three modes for the
debugger: stand alone and then

1027
00:51:15,430 --> 00:51:17,750
attached to SPE threads.

1028
00:51:17,750 --> 00:51:19,000
Sounds like two.

1029
00:51:22,270 --> 00:51:26,120
That's problematic.

1030
00:51:26,120 --> 00:51:28,130
There's this nice static
analysis tool.

1031
00:51:28,130 --> 00:51:30,140
This is good for looking
for really tightly,

1032
00:51:30,140 --> 00:51:31,330
optimizing your code.

1033
00:51:31,330 --> 00:51:33,070
You have to be able to read
assembly, but it shows you

1034
00:51:33,070 --> 00:51:34,810
graphically--

1035
00:51:34,810 --> 00:51:36,430
kind of--

1036
00:51:36,430 --> 00:51:38,800
where the stalls are happening
and you can try and

1037
00:51:38,800 --> 00:51:40,890
reorganize your code.

1038
00:51:40,890 --> 00:51:44,720
And then like Saman suggested,
the dynamic analysis using the

1039
00:51:44,720 --> 00:51:48,880
simulator is a good way to
really get cycle by cycle

1040
00:51:48,880 --> 00:51:51,190
stepping through the code.

1041
00:51:51,190 --> 00:51:54,220
And someone was very excited
when they made this chart

1042
00:51:54,220 --> 00:51:55,720
because they put these
big explosions here.

1043
00:51:58,500 --> 00:52:02,790
You've got some compiler here
that's going to be generating

1044
00:52:02,790 --> 00:52:07,270
two pieces of code, the PPE
binary and the SPE binary.

1045
00:52:07,270 --> 00:52:11,210
When you go through the cell
tutorials for training on how

1046
00:52:11,210 --> 00:52:14,900
to program cell you'll see that
this code is actually

1047
00:52:14,900 --> 00:52:17,900
plugged into-- linked
into the PPE code.

1048
00:52:17,900 --> 00:52:21,170
And when the PPE code spawns a
thread it's going to take a

1049
00:52:21,170 --> 00:52:25,030
pointer to this code and
basically DMA that code into

1050
00:52:25,030 --> 00:52:27,540
the SPE and tell the SPE
to start running.

1051
00:52:27,540 --> 00:52:31,180
Once it's done that, that
thread is independent.

1052
00:52:31,180 --> 00:52:34,220
The PPE could kill it, but it
could just let it run to its

1053
00:52:34,220 --> 00:52:37,060
natural termination or this
thing could terminate itself

1054
00:52:37,060 --> 00:52:41,370
or it could be interrupted by
some other communication.

1055
00:52:41,370 --> 00:52:42,890
But that's the basic process,
you have these

1056
00:52:42,890 --> 00:52:45,900
two pieces of code.

1057
00:52:45,900 --> 00:52:51,070
OK, so now this is really
what I wanted to get to.

1058
00:52:51,070 --> 00:52:54,620
So I want lots of
questions here.

1059
00:52:54,620 --> 00:52:59,800
There are 4 levels of
parallelism in cell.

1060
00:52:59,800 --> 00:53:02,680
On the cell blade, the IBM
blade you have two cell

1061
00:53:02,680 --> 00:53:04,270
processors per blade.

1062
00:53:04,270 --> 00:53:06,570
So that's one level
of parallelism.

1063
00:53:06,570 --> 00:53:08,160
At chip level we know there are
9 cores and they're all

1064
00:53:08,160 --> 00:53:08,900
running independently.

1065
00:53:08,900 --> 00:53:11,050
That's another level
of parallelism.

1066
00:53:11,050 --> 00:53:14,170
On the instruction level each
of the SPEs has two

1067
00:53:14,170 --> 00:53:18,010
instruction pipelines, so it's
an odd and an even pipeline.

1068
00:53:18,010 --> 00:53:19,860
One pipeline is doing things--

1069
00:53:19,860 --> 00:53:23,370
the odd pipeline is doing
loads and stores, DMA

1070
00:53:23,370 --> 00:53:30,840
transactions, interrupts,
branches and it's doing

1071
00:53:30,840 --> 00:53:33,610
something called shuffle byte
or the shuffle operation.

1072
00:53:33,610 --> 00:53:36,270
So shuffle operation's a very,
very useful operation that

1073
00:53:36,270 --> 00:53:41,140
allows you to take two registers
as data, a third

1074
00:53:41,140 --> 00:53:44,730
register as a pattern register,
and the fourth

1075
00:53:44,730 --> 00:53:46,530
register as output.

1076
00:53:46,530 --> 00:53:50,040
It then, from this pattern, will
choose arbitrarily the

1077
00:53:50,040 --> 00:53:53,210
bytes that are in these two
and reconstitute them into

1078
00:53:53,210 --> 00:53:54,990
this fourth register.

1079
00:53:54,990 --> 00:53:58,350
It's wonderful for doing
manipulations and shuffling

1080
00:53:58,350 --> 00:53:59,360
things around.

1081
00:53:59,360 --> 00:54:02,870
Like shuffling a deck of cards,
you could take all of

1082
00:54:02,870 --> 00:54:04,820
these and ignore this or you
could take the first one here,

1083
00:54:04,820 --> 00:54:07,410
replicate it 16 times or you
could take a random sampling

1084
00:54:07,410 --> 00:54:09,120
from these, put into
that register.

1085
00:54:09,120 --> 00:54:12,172
AUDIENCE: Do you use that
specifically for the

1086
00:54:12,172 --> 00:54:13,630
[UNINTELLIGIBLE]?

1087
00:54:13,630 --> 00:54:14,670
MICHAEL PERRONE: We
do use it, yeah.

1088
00:54:14,670 --> 00:54:18,010
Yeah, you take a look, you'll
see we use shuffle a lot.

1089
00:54:18,010 --> 00:54:20,540
It's surprising how valuable
shuffle can be.

1090
00:54:20,540 --> 00:54:23,280
However, then you have to worry
now, you've got the

1091
00:54:23,280 --> 00:54:28,300
shuffle here, if you're doing
like matrix transpose, it's

1092
00:54:28,300 --> 00:54:30,350
all shuffles.

1093
00:54:30,350 --> 00:54:32,090
But what's a date matrix
transpose?

1094
00:54:32,090 --> 00:54:34,490
It's really bandwidth
bound, right?

1095
00:54:34,490 --> 00:54:36,940
Because you're pulling data in,
shuffling it around and

1096
00:54:36,940 --> 00:54:37,350
sending it out.

1097
00:54:37,350 --> 00:54:39,640
Well, where's the reads
and writes?

1098
00:54:39,640 --> 00:54:40,590
They're on the odd pipeline.

1099
00:54:40,590 --> 00:54:41,360
Where are the shuffles?

1100
00:54:41,360 --> 00:54:42,970
They're on the odd pipeline.

1101
00:54:42,970 --> 00:54:45,390
So now you can have a situation
where it's all

1102
00:54:45,390 --> 00:54:50,360
shuffle, shuffle, shuffle,
shuffle and then the

1103
00:54:50,360 --> 00:54:53,950
instruction pre-fetch buffer
gets starved and so it stalls

1104
00:54:53,950 --> 00:54:56,840
for 15, 17 cycles while
I have to load.

1105
00:54:56,840 --> 00:54:59,900
Basically, it's a tiny
little loop.

1106
00:54:59,900 --> 00:55:01,710
But you get stalls and you get
really bad performance.

1107
00:55:01,710 --> 00:55:04,480
So then you have to tell
the compiler--

1108
00:55:04,480 --> 00:55:05,880
actually, the compiler
is getting

1109
00:55:05,880 --> 00:55:07,170
better at these things.

1110
00:55:07,170 --> 00:55:10,550
Much better than it used to be
or by hand you can force it to

1111
00:55:10,550 --> 00:55:12,910
leave a slot for
the pre-fetch.

1112
00:55:12,910 --> 00:55:14,690
These are gotchas
that programmers

1113
00:55:14,690 --> 00:55:17,470
have to be aware of.

1114
00:55:17,470 --> 00:55:20,800
On the other pipeline you have
all your normal operations.

1115
00:55:20,800 --> 00:55:25,620
So you have your mul-adds, your
bit operations, all the

1116
00:55:25,620 --> 00:55:28,060
shift and things like that,
they're all over there.

1117
00:55:28,060 --> 00:55:30,500
There is one other operation
on the odd pipeline and I

1118
00:55:30,500 --> 00:55:32,730
think it's a quad
word rotate or

1119
00:55:32,730 --> 00:55:36,560
something, but I don't remember.

1120
00:55:36,560 --> 00:55:40,710
So that's instruction level
dual issue parallelism.

1121
00:55:40,710 --> 00:55:43,280
AUDIENCE:
[UNINTELLIGIBLE PHRASE]

1122
00:55:43,280 --> 00:55:44,280
MICHAEL PERRONE: Everything
is in order on

1123
00:55:44,280 --> 00:55:45,340
this processor, yeah.

1124
00:55:45,340 --> 00:55:47,080
And that was done for power
reasons, right?

1125
00:55:47,080 --> 00:55:49,760
Get rid of all the space and all
the transistors that are

1126
00:55:49,760 --> 00:55:51,730
doing all this fancy,
out of order

1127
00:55:51,730 --> 00:55:53,600
processing to save power.

1128
00:55:53,600 --> 00:55:54,850
AUDIENCE:
[UNINTELLIGIBLE PHRASE]

1129
00:56:18,050 --> 00:56:19,270
MICHAEL PERRONE: That's
a really good point.

1130
00:56:19,270 --> 00:56:22,810
When you're doing scalar
processing you think well,

1131
00:56:22,810 --> 00:56:25,465
you're thinking I'm going to--
kind of conceptually, you want

1132
00:56:25,465 --> 00:56:27,050
to have all the things that
are doing the same thing

1133
00:56:27,050 --> 00:56:27,960
together right.

1134
00:56:27,960 --> 00:56:30,160
That's how I used to program.

1135
00:56:30,160 --> 00:56:32,590
You put all this stuff here
then you do maybe all your

1136
00:56:32,590 --> 00:56:35,320
reads or whatever and then you
do all your computes and you

1137
00:56:35,320 --> 00:56:36,290
can't do it that way.

1138
00:56:36,290 --> 00:56:38,370
You have to really think about
how are you going to interlead

1139
00:56:38,370 --> 00:56:39,600
these things.

1140
00:56:39,600 --> 00:56:43,990
Now the compiler will help you,
but to get really high

1141
00:56:43,990 --> 00:56:46,680
performance you have to have
better tools and we don't have

1142
00:56:46,680 --> 00:56:47,550
those tools yet.

1143
00:56:47,550 --> 00:56:50,140
And so I'm hoping that you guys
are the ones that are

1144
00:56:50,140 --> 00:56:52,380
going to come up with the new
tools, the new ideas that are

1145
00:56:52,380 --> 00:56:54,420
going to really help
people improve

1146
00:56:54,420 --> 00:56:57,970
programmability in cell.

1147
00:56:57,970 --> 00:57:00,930
Then at the lowest level you
have the register level

1148
00:57:00,930 --> 00:57:05,320
parallelism where you can have
four single precision float

1149
00:57:05,320 --> 00:57:08,720
ops going simultaneously.

1150
00:57:08,720 --> 00:57:11,250
So when you're programming cell
you have to keep all of

1151
00:57:11,250 --> 00:57:13,140
these levels of hierarchy
in your head.

1152
00:57:13,140 --> 00:57:15,860
It's not straight scalar
programming anymore.

1153
00:57:15,860 --> 00:57:18,070
And if you think of it that way
you're just not going to

1154
00:57:18,070 --> 00:57:20,910
get the performance that you're
looking for period.

1155
00:57:24,600 --> 00:57:26,960
Another consideration
is this local store.

1156
00:57:26,960 --> 00:57:30,880
Each little store is
256 kilobytes.

1157
00:57:30,880 --> 00:57:32,130
That's not a lot of space.

1158
00:57:35,110 --> 00:57:37,760
You have to think about how
are you going to bring the

1159
00:57:37,760 --> 00:57:41,680
data in so that the chunks are
big enough, but not too big

1160
00:57:41,680 --> 00:57:43,050
because if they're too big thing
then you won't be able

1161
00:57:43,050 --> 00:57:44,300
to get multibuffering.

1162
00:57:48,120 --> 00:57:49,930
Let's back up a little
bit more.

1163
00:57:49,930 --> 00:57:54,640
The local store holds the data,
but it also holds the

1164
00:57:54,640 --> 00:57:56,730
code that you're running.

1165
00:57:56,730 --> 00:58:02,350
So if you have 200 kilobytes of
code then you only have 56

1166
00:58:02,350 --> 00:58:03,950
kilobytes of data space.

1167
00:58:03,950 --> 00:58:06,080
And if you want to have double
buffering that means you only

1168
00:58:06,080 --> 00:58:15,400
have 25 kilobytes and then as
Saman correctly points out

1169
00:58:15,400 --> 00:58:17,950
there's a problem with stack.

1170
00:58:17,950 --> 00:58:20,390
So if you're going to have
recursion in your code or

1171
00:58:20,390 --> 00:58:23,550
something nasty like that,
you're going to start pushing

1172
00:58:23,550 --> 00:58:25,630
stack variables off
the register file.

1173
00:58:25,630 --> 00:58:27,020
So where do they go?

1174
00:58:27,020 --> 00:58:29,130
They go in the local store.

1175
00:58:29,130 --> 00:58:34,200
What prevents the stack them
overwriting your data?

1176
00:58:34,200 --> 00:58:35,520
Nothing.

1177
00:58:35,520 --> 00:58:38,160
Nothing at all and that's
a big gotcha.

1178
00:58:38,160 --> 00:58:42,620
I've seen over the past three
years maybe 30 separate

1179
00:58:42,620 --> 00:58:46,470
algorithms implemented on cell
and I know of only one that

1180
00:58:46,470 --> 00:58:48,030
was definitely doing that.

1181
00:58:48,030 --> 00:58:51,080
But you know, if there are 30
in this class maybe you're

1182
00:58:51,080 --> 00:58:52,420
going to be the one that
that happens to.

1183
00:58:52,420 --> 00:58:57,970
So you have to be aware
of that and you have

1184
00:58:57,970 --> 00:58:58,400
to deal with it.

1185
00:58:58,400 --> 00:59:02,240
So what you can do, is in the
local store put some dead beef

1186
00:59:02,240 --> 00:59:07,400
thing in there so that you can
look for an overwrite and that

1187
00:59:07,400 --> 00:59:10,240
will let you know that either
you have to make you code

1188
00:59:10,240 --> 00:59:14,890
smalller or your data smaller
or get rid of recursion.

1189
00:59:14,890 --> 00:59:18,350
On SPEs, recursion is
kind of anathema.

1190
00:59:18,350 --> 00:59:19,900
Inlining is good.

1191
00:59:19,900 --> 00:59:25,220
Inlining really can accelerate
your codes performance.

1192
00:59:25,220 --> 00:59:28,310
Oh yeah, it says stack
right there.

1193
00:59:28,310 --> 00:59:30,330
You're reading ahead
on me here.

1194
00:59:30,330 --> 00:59:32,340
Yes, so all three are in
there and you have

1195
00:59:32,340 --> 00:59:33,780
to be aware of that.

1196
00:59:33,780 --> 00:59:37,000
Now there is a memory management
library, very

1197
00:59:37,000 --> 00:59:39,960
lightweight library on the SPE
and it's going to prevent your

1198
00:59:39,960 --> 00:59:42,930
data from overwriting your code
because once the code's

1199
00:59:42,930 --> 00:59:45,820
loaded that memory management
library knows where it is and

1200
00:59:45,820 --> 00:59:47,320
it will stop.

1201
00:59:47,320 --> 00:59:50,830
The date you from allocating,
doing a [? mul-add. ?]

1202
00:59:50,830 --> 00:59:52,150
over this code.

1203
00:59:52,150 --> 00:59:53,850
But the stack's up for grabs.

1204
00:59:53,850 --> 00:59:56,270
And that was again done
because of power

1205
00:59:56,270 --> 00:59:58,220
considerations and real
estate on the chip.

1206
00:59:58,220 --> 01:00:02,640
It you want to have a chip
that's this big you can have

1207
01:00:02,640 --> 01:00:05,950
anything you want, but
manufacturing it's impossible.

1208
01:00:05,950 --> 01:00:08,170
So things were removed and that
was one of the things

1209
01:00:08,170 --> 01:00:09,440
that's removed and that's one
of the things you have to

1210
01:00:09,440 --> 01:00:11,040
watch out for.

1211
01:00:11,040 --> 01:00:14,010
And communication, we've talked
about this quite a bit.

1212
01:00:17,380 --> 01:00:20,460
I didn't mention this: the
DMA transactions-- oh,

1213
01:00:20,460 --> 01:00:21,685
question in the back?

1214
01:00:21,685 --> 01:00:25,151
AUDIENCE: Is there any
reasonable possibility of

1215
01:00:25,151 --> 01:00:26,665
doing things dynamically?

1216
01:00:32,670 --> 01:00:39,000
Is it at all conceivable to have
[? bunks ?] that fetch in

1217
01:00:39,000 --> 01:00:42,100
new code or an allocator
that shuffles somehow?

1218
01:00:42,100 --> 01:00:45,572
Or is it basically as soon as
you get to that point your

1219
01:00:45,572 --> 01:00:46,510
performance is going
to go to hell.

1220
01:00:46,510 --> 01:00:48,330
MICHAEL PERRONE: Yes, well if
you don't do anything about

1221
01:00:48,330 --> 01:00:50,510
it, yes your performance
will go to hell.

1222
01:00:50,510 --> 01:00:52,070
So there are two ways.

1223
01:00:52,070 --> 01:00:57,240
In research we came up with
an overlay mechanism.

1224
01:00:57,240 --> 01:00:59,810
So this is what people used
to do 20 years ago when

1225
01:00:59,810 --> 01:01:00,820
processors were simple.

1226
01:01:00,820 --> 01:01:03,630
Well, these processors are
simple, so going back to the

1227
01:01:03,630 --> 01:01:07,570
old technologies is actually
a good thing to do.

1228
01:01:07,570 --> 01:01:13,580
So we had a video processing
algorithm where we took video

1229
01:01:13,580 --> 01:01:17,070
images, we had to decode them
with one SPE, we had to do

1230
01:01:17,070 --> 01:01:19,630
some background subtraction
to the next SPE.

1231
01:01:19,630 --> 01:01:21,300
We had to do some
edge detection.

1232
01:01:21,300 --> 01:01:24,300
And so each SPE was doing a
different thing, but even then

1233
01:01:24,300 --> 01:01:27,850
the code was very big, the
chunks of code were large.

1234
01:01:27,850 --> 01:01:32,080
And we were spending 27% of the
time swapping code out and

1235
01:01:32,080 --> 01:01:33,370
bringing in new code.

1236
01:01:33,370 --> 01:01:34,740
Bad, very bad.

1237
01:01:34,740 --> 01:01:36,580
Oh, and I should tell
you, spawning SPE

1238
01:01:36,580 --> 01:01:37,830
threads is very painful.

1239
01:01:40,660 --> 01:01:43,790
500,000 cycles, a
million cycles--

1240
01:01:43,790 --> 01:01:44,490
I don't know.

1241
01:01:44,490 --> 01:01:48,040
It varies depending on how the
SPE feels that particular day.

1242
01:01:48,040 --> 01:01:51,080
And it's something to avoid.

1243
01:01:51,080 --> 01:01:53,030
You really want to spawn a
thread and keep it running for

1244
01:01:53,030 --> 01:01:54,240
a long time.

1245
01:01:54,240 --> 01:01:58,290
So context switching
is painful on cell.

1246
01:01:58,290 --> 01:02:03,420
Using an overlay we got that
27% overhead down to 1%.

1247
01:02:03,420 --> 01:02:04,970
So yes, you can do that.

1248
01:02:04,970 --> 01:02:07,410
That tool is not in the SDK.

1249
01:02:07,410 --> 01:02:09,640
It's on my to-do list to put
it in the SDK, but the

1250
01:02:09,640 --> 01:02:11,750
compiler team at IBM tells
me that the XLC

1251
01:02:11,750 --> 01:02:14,040
compiler now does overlays.

1252
01:02:14,040 --> 01:02:18,310
But it only does overlays at the
function level, so if the

1253
01:02:18,310 --> 01:02:20,800
function still doesn't
fit in the SPE

1254
01:02:20,800 --> 01:02:22,070
you're dead in the water.

1255
01:02:22,070 --> 01:02:24,800
And I think the compiler will
say, when it compiles it it'll

1256
01:02:24,800 --> 01:02:28,010
say this doesn't fit quietly
and you'll never see that

1257
01:02:28,010 --> 01:02:29,450
until you run and it doesn't
load and you don't know

1258
01:02:29,450 --> 01:02:30,360
what's going on.

1259
01:02:30,360 --> 01:02:33,570
So read your compiler outputs.

1260
01:02:33,570 --> 01:02:35,530
The DMA granularity
is 128 bytes.

1261
01:02:35,530 --> 01:02:38,770
This is the same, the data
transactions for Intel, for

1262
01:02:38,770 --> 01:02:41,950
AMD they're all 128 byte
data envelopes.

1263
01:02:41,950 --> 01:02:45,690
So if you're doing a memory
access that's 4 bytes you're

1264
01:02:45,690 --> 01:02:48,180
still using 128 bytes
of bandwidth.

1265
01:02:48,180 --> 01:02:50,790
So this comes back to this
notion of getting a shopping

1266
01:02:50,790 --> 01:02:53,740
list. You really want to think
ahead what you want to get,

1267
01:02:53,740 --> 01:02:56,130
bring it in, then use it
so that you don't waste

1268
01:02:56,130 --> 01:02:58,750
bandwidth; if you're
bandwidth bound.

1269
01:02:58,750 --> 01:03:01,380
If you're not than you can be
a little more wasteful.

1270
01:03:01,380 --> 01:03:04,100
But there's a guy,
Mike Acton--

1271
01:03:04,100 --> 01:03:07,050
you can find his website, I
think he has a website called

1272
01:03:07,050 --> 01:03:11,060
www.cellperformance.org?

1273
01:03:11,060 --> 01:03:11,480
Net?

1274
01:03:11,480 --> 01:03:11,820
Com?

1275
01:03:11,820 --> 01:03:12,100
I don't know.

1276
01:03:12,100 --> 01:03:15,010
AUDIENCE: Just a quick comment
[UNINTELLIGIBLE PHRASE].

1277
01:03:15,010 --> 01:03:16,410
MICHAEL PERRONE:
Oh, he's good.

1278
01:03:16,410 --> 01:03:17,410
He's much better than me.

1279
01:03:17,410 --> 01:03:20,470
You're really going
to like him.

1280
01:03:20,470 --> 01:03:24,460
His belief, and I believe him
wholeheartedly, is it's all

1281
01:03:24,460 --> 01:03:26,030
about the data.

1282
01:03:26,030 --> 01:03:32,930
We're coming to a point in
computer science where the

1283
01:03:32,930 --> 01:03:35,150
code doesn't matter as much
as getting the data

1284
01:03:35,150 --> 01:03:36,310
where you need it.

1285
01:03:36,310 --> 01:03:40,300
This is because of the latency
out to main memory.

1286
01:03:40,300 --> 01:03:43,790
Memory's getting so far away
that having all these cycles

1287
01:03:43,790 --> 01:03:46,210
is not that useful anymore if
you can't get the data.

1288
01:03:46,210 --> 01:03:47,940
So he always pushes
this point, you

1289
01:03:47,940 --> 01:03:48,830
have to get the data.

1290
01:03:48,830 --> 01:03:51,510
You have to think about the
data, good code starts with

1291
01:03:51,510 --> 01:03:54,180
the data, good code ends with
the data, good data structure

1292
01:03:54,180 --> 01:03:55,000
start with the data.

1293
01:03:55,000 --> 01:03:58,520
You have to think data,
data, data.

1294
01:03:58,520 --> 01:04:00,590
And I can't emphasize that
enough because it's really

1295
01:04:00,590 --> 01:04:03,625
very, very true for this
processor and I believe, for

1296
01:04:03,625 --> 01:04:05,310
all the multicore processors
you're going to be seeing.

1297
01:04:08,730 --> 01:04:15,090
The DMAs that you issue can be
128 bytes or multiples of 128

1298
01:04:15,090 --> 01:04:17,890
bytes, up to 16 kilobytes
per single DMA.

1299
01:04:17,890 --> 01:04:20,570
There's also something called a
DMA list, which is a list of

1300
01:04:20,570 --> 01:04:26,140
DMAs in local store and you tell
the DMA queue OK, here

1301
01:04:26,140 --> 01:04:29,490
are these 100 DMAs,
spawn them off.

1302
01:04:29,490 --> 01:04:32,760
That only takes one slot in
the DMA queue so it's an

1303
01:04:32,760 --> 01:04:36,210
efficient way of loading
the queue without

1304
01:04:36,210 --> 01:04:39,200
overloading the queue.

1305
01:04:39,200 --> 01:04:46,080
Traffic controls, this is
perhaps one of the trickier

1306
01:04:46,080 --> 01:04:48,020
things with cell because the
simulator doesn't help very

1307
01:04:48,020 --> 01:04:51,560
much and the tools don't
help very much.

1308
01:04:51,560 --> 01:04:53,530
Thinking about synchronization,
DMA latency

1309
01:04:53,530 --> 01:04:54,860
handling-- all those things
are important.

1310
01:04:59,390 --> 01:05:01,690
OK, so this is the last slide
that I'm going to do and then

1311
01:05:01,690 --> 01:05:02,940
I have to run off.

1312
01:05:05,820 --> 01:05:09,780
I want to give you a sense for
the process by which people--

1313
01:05:09,780 --> 01:05:12,320
my group in particular went
through, especially when we

1314
01:05:12,320 --> 01:05:15,490
didn't even have hardware and we
didn't have compilers that

1315
01:05:15,490 --> 01:05:17,880
worked nearly as well as they
do now and it's really very

1316
01:05:17,880 --> 01:05:21,140
ugly knifes and stones
and sticks.

1317
01:05:21,140 --> 01:05:23,750
You know, just kind
of stone knifes.

1318
01:05:23,750 --> 01:05:26,580
That's what I'm thinking,
very primitive.

1319
01:05:26,580 --> 01:05:30,970
But this way of thinking is
still very much true.

1320
01:05:30,970 --> 01:05:32,570
You have to think about
your code this way.

1321
01:05:32,570 --> 01:05:34,940
You want to start, you have your
application, whatever it

1322
01:05:34,940 --> 01:05:35,900
happens to be; you
want to do an

1323
01:05:35,900 --> 01:05:38,080
algorithmic complexity study.

1324
01:05:38,080 --> 01:05:41,140
Is this order n squared,
is this log n?

1325
01:05:41,140 --> 01:05:42,260
Where are the bottlenecks?

1326
01:05:42,260 --> 01:05:45,160
What do I expect to
be bottlenecks?

1327
01:05:45,160 --> 01:05:48,390
Then I want to do data
layout/locality.

1328
01:05:48,390 --> 01:05:50,360
Now this is the data, data, data
approach of Mike Acton.

1329
01:05:52,950 --> 01:05:54,430
You want to think
about the data.

1330
01:05:54,430 --> 01:05:55,540
Where is it?

1331
01:05:55,540 --> 01:05:57,810
How can you structure your data
so that it's going to be

1332
01:05:57,810 --> 01:06:01,550
efficiently positioned
for when you need it?

1333
01:06:01,550 --> 01:06:04,400
And then you start with an
experimental petitioning of

1334
01:06:04,400 --> 01:06:05,340
the algorithm.

1335
01:06:05,340 --> 01:06:08,050
You want to break it up between
the pieces that you

1336
01:06:08,050 --> 01:06:12,320
believe are scalar and remain
scalar, leave those on the SPE

1337
01:06:12,320 --> 01:06:14,460
and the ones that can
be paralellized.

1338
01:06:14,460 --> 01:06:17,810
Those are the ones that are
going to go on the SPE.

1339
01:06:17,810 --> 01:06:19,430
You have the think conceptually
about

1340
01:06:19,430 --> 01:06:21,730
partitioning that out.

1341
01:06:21,730 --> 01:06:24,980
And then run it on
the PPE anyway.

1342
01:06:24,980 --> 01:06:27,390
You want to have a
baseline there.

1343
01:06:27,390 --> 01:06:31,370
Then you have this PPE scalar
code and PPE control code.

1344
01:06:31,370 --> 01:06:35,230
This PPE scalar code you want to
then push down to the SPEs.

1345
01:06:35,230 --> 01:06:39,060
So now you're going to add
stuff for communication,

1346
01:06:39,060 --> 01:06:40,440
synchronization, and
latency handling.

1347
01:06:40,440 --> 01:06:42,420
So you have the spawn threads.

1348
01:06:42,420 --> 01:06:43,640
The [? RAIDs ?]

1349
01:06:43,640 --> 01:06:47,110
have to be told where the data
is, they have to get their

1350
01:06:47,110 --> 01:06:49,320
code, they have to run their
code, they have to then start

1351
01:06:49,320 --> 01:06:51,490
pulling in the data, synchronize
with the other

1352
01:06:51,490 --> 01:06:55,620
SPEs and then latency handling
with multibuffering of the

1353
01:06:55,620 --> 01:06:59,090
data so that you can be doing
computing and reading data

1354
01:06:59,090 --> 01:07:01,020
simultaneously.

1355
01:07:01,020 --> 01:07:06,970
Then you have your first
parallel code that's running.

1356
01:07:06,970 --> 01:07:12,400
Now the compiler, the XLC
compiler, GCC compiler--

1357
01:07:12,400 --> 01:07:14,900
well, the XLC compiler I know
for certain will do some

1358
01:07:14,900 --> 01:07:16,370
automatic SIMDization.

1359
01:07:16,370 --> 01:07:18,080
if you put the auto
SIMD flag on.

1360
01:07:18,080 --> 01:07:19,550
Does GCC compiler do that?

1361
01:07:19,550 --> 01:07:20,800
PROFESSOR:
[UNINTELLIGIBLE PHRASE]

1362
01:07:23,300 --> 01:07:24,860
MICHAEL PERRONE: OK, so I
don't know if the GCC

1363
01:07:24,860 --> 01:07:27,190
compiler does that.

1364
01:07:27,190 --> 01:07:33,690
So that can be done by hand,
but sometimes that works,

1365
01:07:33,690 --> 01:07:34,670
sometimes it doesn't.

1366
01:07:34,670 --> 01:07:36,690
And it really depends on how
complex the algorithm.

1367
01:07:36,690 --> 01:07:39,530
If it's a very regular code,
like a matrix-matrix multiply.

1368
01:07:39,530 --> 01:07:43,980
You'll see that the compiler
can do fairly well if the

1369
01:07:43,980 --> 01:07:45,590
block sizes are right and all.

1370
01:07:45,590 --> 01:07:50,090
But if you have something that's
more irregular then you

1371
01:07:50,090 --> 01:07:53,360
may find that doing it by
hand is really required.

1372
01:07:53,360 --> 01:07:56,270
And so this step here could
be done with the compiler

1373
01:07:56,270 --> 01:07:58,700
initially to see if you're
getting the performance that

1374
01:07:58,700 --> 01:08:00,780
you think you should be getting
from that algorithmic

1375
01:08:00,780 --> 01:08:02,380
complexity study.

1376
01:08:02,380 --> 01:08:04,420
You should see that
type of scaling.

1377
01:08:04,420 --> 01:08:06,880
You can look at the CPI and
see how many cycles per

1378
01:08:06,880 --> 01:08:08,480
instruction you're getting.

1379
01:08:08,480 --> 01:08:11,200
Each SPE should be
getting 0.5.

1380
01:08:11,200 --> 01:08:13,590
You should be able to get two
instructions per cycle.

1381
01:08:16,310 --> 01:08:19,480
Very few codes actually
get exactly--

1382
01:08:19,480 --> 01:08:27,180
you can get down to 5.8 or
something like that, but I

1383
01:08:27,180 --> 01:08:29,830
think if you can get to
1 you're doing well.

1384
01:08:29,830 --> 01:08:32,390
If you get to 2 there's probably
more you can be doing

1385
01:08:32,390 --> 01:08:33,870
and if you're above 2
there's something

1386
01:08:33,870 --> 01:08:36,200
wrong with your code.

1387
01:08:36,200 --> 01:08:37,020
It may be the algorithm.

1388
01:08:37,020 --> 01:08:39,400
It may be just a poorly
chosen algorithm.

1389
01:08:42,120 --> 01:08:44,020
But that's where you
can talk to me.

1390
01:08:44,020 --> 01:08:46,010
I want to make myself available
to everyone in the

1391
01:08:46,010 --> 01:08:48,460
class or in my department
as well.

1392
01:08:48,460 --> 01:08:53,170
We're very enthusiastic about
working with research groups

1393
01:08:53,170 --> 01:08:59,230
in universities to develop new
tools, new methods and if you

1394
01:08:59,230 --> 01:09:00,180
can help me, I can help you.

1395
01:09:00,180 --> 01:09:01,850
I think it works very well.

1396
01:09:04,710 --> 01:09:07,440
Then once you've done this,
you may find that what you

1397
01:09:07,440 --> 01:09:11,000
originally thought for the
complexity or the layout

1398
01:09:11,000 --> 01:09:13,840
wasn't quite accurate, so you
need to then go do some

1399
01:09:13,840 --> 01:09:14,970
additional rebalancing.

1400
01:09:14,970 --> 01:09:17,060
Maybe change your block sizes.

1401
01:09:17,060 --> 01:09:20,960
You know, maybe you had 64 by 64
blocks, now you need 32 by

1402
01:09:20,960 --> 01:09:25,800
64 or 48 by whatever-- some
readjustment to match what you

1403
01:09:25,800 --> 01:09:30,610
have, And then you may want to
reevaluate the data movement.

1404
01:09:30,610 --> 01:09:33,100
And then you know, in many
cases you'll be done, but

1405
01:09:33,100 --> 01:09:35,620
you're looking at your cycles
per instruction or your speed

1406
01:09:35,620 --> 01:09:39,960
up and you're not seeing exactly
what you expected, so

1407
01:09:39,960 --> 01:09:42,830
you can start looking at other
optimization considerations.

1408
01:09:42,830 --> 01:09:46,210
Like using the vector unit,
the VMX unit on the cell

1409
01:09:46,210 --> 01:09:49,840
processor, on the PPE.

1410
01:09:49,840 --> 01:09:53,760
Looking for system bottlenecks
and this is actually, I have

1411
01:09:53,760 --> 01:09:56,400
found the biggest problem.

1412
01:09:56,400 --> 01:09:59,730
Trying to identify where the DMA
bottlenecks are happening

1413
01:09:59,730 --> 01:10:02,980
is kind of devilishly hard.

1414
01:10:02,980 --> 01:10:05,100
We don't have good tools for
that, so you really have to

1415
01:10:05,100 --> 01:10:08,100
think hard and come up with
interesting kind of

1416
01:10:08,100 --> 01:10:11,260
experiments for your code to
track down these bottlenecks.

1417
01:10:13,990 --> 01:10:15,160
And then load balancing.

1418
01:10:15,160 --> 01:10:17,850
If you look at these SPEs, I
told you they're completely

1419
01:10:17,850 --> 01:10:18,520
independent.

1420
01:10:18,520 --> 01:10:20,850
You can have them all running
the same code or they could be

1421
01:10:20,850 --> 01:10:22,170
running all different code.

1422
01:10:22,170 --> 01:10:24,310
They could be daisy chained so
that this one feeds, this one

1423
01:10:24,310 --> 01:10:25,940
feeds that one, feeds
that one.

1424
01:10:25,940 --> 01:10:28,020
If you do that daisy chaining
you may find out there's a

1425
01:10:28,020 --> 01:10:28,400
bottleneck.

1426
01:10:28,400 --> 01:10:31,540
That this SPE takes three
times as long

1427
01:10:31,540 --> 01:10:33,110
as any of the others.

1428
01:10:33,110 --> 01:10:38,540
So make that use 3 SPEs and have
this SPE feed these 3.

1429
01:10:38,540 --> 01:10:41,430
So you have to do some load
balancing and thinking about

1430
01:10:41,430 --> 01:10:43,460
how many SPEs really need
to be dedicated

1431
01:10:43,460 --> 01:10:46,510
to each of the tasks.

1432
01:10:46,510 --> 01:10:50,920
Now that's the end of my talk.

1433
01:10:50,920 --> 01:10:54,900
I think that gives you a good
sense of where we have been,

1434
01:10:54,900 --> 01:10:57,190
where we are now, and
where we're going.

1435
01:10:57,190 --> 01:11:01,420
And I hope that if was good,
educational, and I'll make

1436
01:11:01,420 --> 01:11:03,260
myself available to you
guys in the future.

1437
01:11:03,260 --> 01:11:04,680
And if you have questions--

1438
01:11:04,680 --> 01:11:05,170
PROFESSOR: Thank you.

1439
01:11:05,170 --> 01:11:10,140
I know you have to
catch a flight.

1440
01:11:10,140 --> 01:11:11,810
How much time do have
for questions?

1441
01:11:11,810 --> 01:11:13,210
MICHAEL PERRONE: Not much.

1442
01:11:13,210 --> 01:11:14,890
I leave at 1:10.

1443
01:11:14,890 --> 01:11:16,750
So I should be there by 12:00.

1444
01:11:16,750 --> 01:11:16,970
PROFESSOR: OK.

1445
01:11:16,970 --> 01:11:18,080
So [UNINTELLIGIBLE]

1446
01:11:18,080 --> 01:11:18,810
at some time.

1447
01:11:18,810 --> 01:11:19,450
MICHAEL PERRONE:
My car is out--

1448
01:11:19,450 --> 01:11:22,770
PROFESSOR: OK, so we'll have
about 5 minutes questions.

1449
01:11:22,770 --> 01:11:25,630
OK, so I know this
talk is early.

1450
01:11:25,630 --> 01:11:27,750
We haven't gotten a lot of
basics so there might be a lot

1451
01:11:27,750 --> 01:11:30,940
of things kind of going above
your head, but we'll slowly

1452
01:11:30,940 --> 01:11:32,030
get back to it.

1453
01:11:32,030 --> 01:11:34,990
So questions?

1454
01:11:34,990 --> 01:11:38,190
AUDIENCE: You mentioned
that SPEs would

1455
01:11:38,190 --> 01:11:40,910
be able to run kernel.

1456
01:11:40,910 --> 01:11:43,517
Is there a microkernel that you
could install on them so

1457
01:11:43,517 --> 01:11:45,660
that you could begin
experimenting with MPI type

1458
01:11:45,660 --> 01:11:47,240
structures?

1459
01:11:47,240 --> 01:11:49,450
MICHAEL PERRONE: Not
that I'm aware of.

1460
01:11:49,450 --> 01:11:52,240
We did look at something called
MicroMPI, where we were

1461
01:11:52,240 --> 01:11:57,290
using kind of a very watered
down MPI implementation for

1462
01:11:57,290 --> 01:12:00,030
the SPEs in the transactions.

1463
01:12:00,030 --> 01:12:01,000
I don't recommend it.

1464
01:12:01,000 --> 01:12:07,060
What I recommend is you have a
cluster say, a thousand node

1465
01:12:07,060 --> 01:12:10,570
cluster and the code today,
the legacy code that's out

1466
01:12:10,570 --> 01:12:14,400
there runs some process
on this node.

1467
01:12:14,400 --> 01:12:19,360
Take that process, don't try to
push MPI further down, but

1468
01:12:19,360 --> 01:12:24,940
just try to subpartition that
process and let the PPE handle

1469
01:12:24,940 --> 01:12:31,190
all the communication
off board, off node.

1470
01:12:31,190 --> 01:12:32,130
That's my recommendation.

1471
01:12:32,130 --> 01:12:35,960
AUDIENCE: So MPI is running
on [UNINTELLIGIBLE]?

1472
01:12:35,960 --> 01:12:37,890
MICHAEL PERRONE:
Yeah, Open MPI.

1473
01:12:37,890 --> 01:12:39,840
It's an open source MPI.

1474
01:12:39,840 --> 01:12:42,310
It's just a recompile
and it hasn't

1475
01:12:42,310 --> 01:12:44,960
been tuned or optimized.

1476
01:12:44,960 --> 01:12:48,480
And it doesn't know anything
about the SPEs.

1477
01:12:48,480 --> 01:12:50,990
You know, you let the PPE do
all the communication or

1478
01:12:50,990 --> 01:12:52,080
handle the communications.

1479
01:12:52,080 --> 01:12:55,180
When it finishes the task
at hand then it can

1480
01:12:55,180 --> 01:12:56,967
issue its MPI process.

1481
01:12:56,967 --> 01:12:58,217
AUDIENCE:
[UNINTELLIGIBLE PHRASE]

1482
01:13:00,010 --> 01:13:03,850
MICHAEL PERRONE: Open NP is the
methodology where you take

1483
01:13:03,850 --> 01:13:08,260
existing scalar code and you
insert compiler pragmas to say

1484
01:13:08,260 --> 01:13:10,490
this for loop can
be parallelized.

1485
01:13:10,490 --> 01:13:13,330
And you know, this data
structures are disjoint, so we

1486
01:13:13,330 --> 01:13:17,410
don't have to worry about any
kind of interference, side

1487
01:13:17,410 --> 01:13:19,950
effects of the data
manipulation.

1488
01:13:19,950 --> 01:13:24,090
The compiler, the XLC compiler
implements open MP.

1489
01:13:24,090 --> 01:13:27,360
There's several components
that are required.

1490
01:13:27,360 --> 01:13:30,980
One was a software cache where
they implemented a little

1491
01:13:30,980 --> 01:13:32,460
cache on the local store.

1492
01:13:32,460 --> 01:13:36,250
And if it misses in that local
cache it goes and gets it.

1493
01:13:36,250 --> 01:13:41,910
I don't know how well that
performs yet, but it exists.

1494
01:13:41,910 --> 01:13:43,150
There's the SIMDization.

1495
01:13:43,150 --> 01:13:45,830
For a while, Open NP wasn't
working with auto SIMDization

1496
01:13:45,830 --> 01:13:48,830
but now it does.

1497
01:13:48,830 --> 01:13:53,310
So it's getting there,
for so C it's there.

1498
01:13:53,310 --> 01:13:55,205
I don't know what type
of performance hit

1499
01:13:55,205 --> 01:13:55,950
you take for that.

1500
01:13:55,950 --> 01:13:59,820
AUDIENCE: Probably runs
[UNINTELLIGIBLE PHRASE]

1501
01:13:59,820 --> 01:14:00,450
MICHAEL PERRONE: It's

1502
01:14:00,450 --> 01:14:03,110
XLC version that does that.

1503
01:14:03,110 --> 01:14:06,700
I don't know if GCC does it.

1504
01:14:06,700 --> 01:14:10,500
But my recommendation is if you
want to use open NP, go

1505
01:14:10,500 --> 01:14:14,010
ahead, take your scalar code,
implement it with those

1506
01:14:14,010 --> 01:14:17,340
pragmas, see what type of
improvement you get.

1507
01:14:17,340 --> 01:14:18,330
Play around with it a little.

1508
01:14:18,330 --> 01:14:21,710
If you find something that you
expect should be 10x better

1509
01:14:21,710 --> 01:14:25,100
and it's only 3x take
that bottleneck and

1510
01:14:25,100 --> 01:14:26,350
implement it by hand.

1511
01:14:31,726 --> 01:14:32,976
AUDIENCE:
[UNINTELLIGIBLE PHRASE]

1512
01:14:34,945 --> 01:14:39,340
with the memory models and such
that the SPEs certainly

1513
01:14:39,340 --> 01:14:41,293
went back a couple of
generations to a simpler

1514
01:14:41,293 --> 01:14:41,781
[INAUDIBLE].

1515
01:14:41,781 --> 01:14:44,512
How come you went so far back
rather to just say,

1516
01:14:44,512 --> 01:14:45,580
segmentation.

1517
01:14:45,580 --> 01:14:46,760
MICHAEL PERRONE: I don't
know the answer.

1518
01:14:46,760 --> 01:14:48,010
I'm sorry.

1519
01:14:50,650 --> 01:14:53,860
I suspect and most of these
answers come down to the same

1520
01:14:53,860 --> 01:14:57,210
thing, it comes back to Sony.

1521
01:14:57,210 --> 01:14:59,990
Sony contracted with IBM,
gave us a lot of money

1522
01:14:59,990 --> 01:15:00,700
to make this thing.

1523
01:15:00,700 --> 01:15:02,330
And they said we need
a Playstation 3.

1524
01:15:02,330 --> 01:15:03,740
We need this, this,
this, this.

1525
01:15:03,740 --> 01:15:06,870
And so IBM was very focused
on providing those things.

1526
01:15:06,870 --> 01:15:10,650
Now that that is delivered,
Playstation 3 is being sold

1527
01:15:10,650 --> 01:15:11,740
we're looking at
other options.

1528
01:15:11,740 --> 01:15:17,560
And if that's something that
you're interested in pursuing

1529
01:15:17,560 --> 01:15:18,020
you should talk to me.

1530
01:15:18,020 --> 01:15:20,057
AUDIENCE: Among other things
it seems to me that the

1531
01:15:20,057 --> 01:15:23,622
lightweight mechanism for
keeping the stack from

1532
01:15:23,622 --> 01:15:27,940
stomping on other things --

1533
01:15:27,940 --> 01:15:33,950
PROFESSOR: I think that
this is very new area.

1534
01:15:33,950 --> 01:15:36,190
Before you put things in
hardware, you need to have

1535
01:15:36,190 --> 01:15:39,190
some kind of consensus, what's
the right way to do it?

1536
01:15:39,190 --> 01:15:42,690
This is a bare metal that
gives you huge amount of

1537
01:15:42,690 --> 01:15:44,320
opportunity but you give enough
rope to hang yourself.

1538
01:15:46,850 --> 01:15:49,250
And the key thing is you can get
all this performance and

1539
01:15:49,250 --> 01:15:52,380
what will happen perhaps, in the
next few years is people

1540
01:15:52,380 --> 01:15:53,730
come up to consensus
saying, look,

1541
01:15:53,730 --> 01:15:54,790
everybody has to do this.

1542
01:15:54,790 --> 01:15:57,060
Everybody needs MPI, everybody
needs this cache.

1543
01:15:57,060 --> 01:16:00,180
And slowly, some of those
features will do a little bit

1544
01:16:00,180 --> 01:16:02,130
of a feature creep, so you're
going to have they little bit

1545
01:16:02,130 --> 01:16:04,390
of overhead, little bit
less power efficient.

1546
01:16:04,390 --> 01:16:05,630
But it will be much
easier to program.

1547
01:16:05,630 --> 01:16:08,590
But this is kind of the bare
metal thing that to get and in

1548
01:16:08,590 --> 01:16:12,410
some sense, it's a nice time
because I think in 5 years if

1549
01:16:12,410 --> 01:16:17,400
you look at cell you won't have
this level of access.

1550
01:16:17,400 --> 01:16:20,940
You'll have all this nice build
on top up in doing this

1551
01:16:20,940 --> 01:16:22,840
so, this is a unique
positioning there.

1552
01:16:22,840 --> 01:16:25,840
It's very hard to deal with, but
also on the other hand you

1553
01:16:25,840 --> 01:16:27,640
get to see underneath.

1554
01:16:27,640 --> 01:16:30,650
You get to see without any
kind of these sort

1555
01:16:30,650 --> 01:16:31,310
of things in there.

1556
01:16:31,310 --> 01:16:33,700
So my feeling is in a few years
you'll get all those

1557
01:16:33,700 --> 01:16:34,800
things put back.

1558
01:16:34,800 --> 01:16:37,210
When and if we figure out how
to deal with things like

1559
01:16:37,210 --> 01:16:40,640
segmentation on the multicore
with very fine grain

1560
01:16:40,640 --> 01:16:42,640
communication and there's a lot
of issues here that you

1561
01:16:42,640 --> 01:16:43,370
need to figure out.

1562
01:16:43,370 --> 01:16:44,950
But right now all those issues
are [INAUDIBLE].

1563
01:16:44,950 --> 01:16:46,450
It's like OK, we don't
know how to do it.

1564
01:16:46,450 --> 01:16:52,460
Well, you go figure it out OK?

1565
01:16:52,460 --> 01:16:53,070
MICHAEL PERRONE: Thank
you very much.

1566
01:16:53,070 --> 01:16:54,320
PROFESSOR: Thank you.

1567
01:16:56,390 --> 01:16:58,070
I don't have that much
more material.

1568
01:16:58,070 --> 01:17:00,690
So I have about 10,
15 minutes.

1569
01:17:00,690 --> 01:17:03,160
Do you guys need a break
or should we just go

1570
01:17:03,160 --> 01:17:03,740
directly to the end?

1571
01:17:03,740 --> 01:17:06,430
How many people say
we want a break?