1
00:00:00,040 --> 00:00:02,480
The following content is
provided under a Creative

2
00:00:02,480 --> 00:00:04,000
Commons license.

3
00:00:04,000 --> 00:00:06,340
Your support will help
MIT OpenCourseWare

4
00:00:06,340 --> 00:00:10,710
continue to offer high quality
educational resources for free.

5
00:00:10,710 --> 00:00:13,320
To make a donation, or
view additional materials

6
00:00:13,320 --> 00:00:17,216
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,216 --> 00:00:17,841
at ocw.mit.edu.

8
00:00:21,092 --> 00:00:22,550
PROFESSOR: So this
hour we're going

9
00:00:22,550 --> 00:00:24,920
to talk about SIMD
programming with cell.

10
00:00:27,284 --> 00:00:28,950
First we'll talk a
little bit about what

11
00:00:28,950 --> 00:00:32,810
SIMD is and then
about the facilities

12
00:00:32,810 --> 00:00:34,720
that cell and the
compiler provide

13
00:00:34,720 --> 00:00:36,452
for programming with SIMD.

14
00:00:36,452 --> 00:00:37,910
And then some design
considerations

15
00:00:37,910 --> 00:00:40,201
that you have to keep in mind
when you're doing things.

16
00:00:42,769 --> 00:00:44,310
All right so the
situation these days

17
00:00:44,310 --> 00:00:46,600
is that most compute
bound applications

18
00:00:46,600 --> 00:00:49,870
are running through
a large piece of data

19
00:00:49,870 --> 00:00:51,580
and running the same
computations on it

20
00:00:51,580 --> 00:00:53,310
over and over again,
or rather running

21
00:00:53,310 --> 00:00:56,860
the same computations across all
the pieces of different data.

22
00:00:56,860 --> 00:01:00,280
And very frequently
there'll be no dependence

23
00:01:00,280 --> 00:01:03,110
between iterations when you're
going through this data.

24
00:01:03,110 --> 00:01:05,379
So that means
there's opportunities

25
00:01:05,379 --> 00:01:06,710
for you to data parallelize.

26
00:01:09,970 --> 00:01:15,390
So as an example, if
we have for example,

27
00:01:15,390 --> 00:01:20,510
say we're multiplying
a0 and b0 to get c0.

28
00:01:20,510 --> 00:01:24,580
And suppose we want
to actually perform

29
00:01:24,580 --> 00:01:29,250
this operation across all the
elements of arrays a, b and c.

30
00:01:29,250 --> 00:01:32,190
So instead of multiplying
two elements or two integers

31
00:01:32,190 --> 00:01:37,990
together, we're actually
going to be taking two arrays

32
00:01:37,990 --> 00:01:42,660
an element-wise multiplying
each of the-- multiplying each

33
00:01:42,660 --> 00:01:45,390
of the pairs
element-wise and writing

34
00:01:45,390 --> 00:01:46,885
the results to a third array.

35
00:01:46,885 --> 00:01:49,010
So the picture's going to
look something like this.

36
00:01:49,010 --> 00:01:52,120
And you would of
course represent this

37
00:01:52,120 --> 00:01:56,140
using for example, four loop.

38
00:01:56,140 --> 00:02:01,660
Now what we're going to do
is instead of-- let's see,

39
00:02:01,660 --> 00:02:05,130
so you can think of this as
kind of an operation that's

40
00:02:05,130 --> 00:02:09,289
abstractly operating
on these entire arrays.

41
00:02:09,289 --> 00:02:11,620
And we're not going
to go quite that far,

42
00:02:11,620 --> 00:02:14,350
but what we're
going to do is we're

43
00:02:14,350 --> 00:02:16,050
going to think of
these operations

44
00:02:16,050 --> 00:02:19,300
as acting on these kind
of bundles of elements.

45
00:02:19,300 --> 00:02:21,480
So we're going to bundle
our arrayed elements

46
00:02:21,480 --> 00:02:23,100
into groups of four.

47
00:02:23,100 --> 00:02:26,039
And then each time we're
going to take a group

48
00:02:26,039 --> 00:02:27,580
and multiply with
another group using

49
00:02:27,580 --> 00:02:30,460
this element-wise multiplication
and write the result

50
00:02:30,460 --> 00:02:32,290
to this third bundle.

51
00:02:32,290 --> 00:02:35,480
OK does that make sense?

52
00:02:35,480 --> 00:02:37,320
Now the thing about
this kind of model

53
00:02:37,320 --> 00:02:40,290
is that cell is going to
provide very good hardware

54
00:02:40,290 --> 00:02:43,181
support for something that
looks kind of like this.

55
00:02:43,181 --> 00:02:45,065
AUDIENCE: Is that
actual cell [INAUDIBLE]

56
00:02:48,770 --> 00:02:50,630
PROFESSOR: Yes,
I'll get into this.

57
00:02:50,630 --> 00:02:53,500
In fact, this is what we'll
be talking about the syntax

58
00:02:53,500 --> 00:02:55,940
and meaning of
this kind of thing.

59
00:02:55,940 --> 00:02:56,600
All right?

60
00:02:59,250 --> 00:03:03,530
So for this kind
of thing to happen

61
00:03:03,530 --> 00:03:06,240
we need the compiler to
support two different things.

62
00:03:06,240 --> 00:03:09,680
First is we need to be able to
address these kind of bundles

63
00:03:09,680 --> 00:03:13,000
of elements and these are
going to be called vectors.

64
00:03:13,000 --> 00:03:15,590
And second we need to be
able to perform operations

65
00:03:15,590 --> 00:03:18,530
on these vectors.

66
00:03:18,530 --> 00:03:25,590
So cell and the XLC compiler
give us support for this.

67
00:03:25,590 --> 00:03:28,150
And what they're going
to do is provide first,

68
00:03:28,150 --> 00:03:30,940
registers which are
capable of holding vectors.

69
00:03:30,940 --> 00:03:32,510
Now normally you
think of a register

70
00:03:32,510 --> 00:03:38,300
as holding on a 32-bit
machine, one machine word

71
00:03:38,300 --> 00:03:41,000
will hold a 32-bit
int, for example.

72
00:03:41,000 --> 00:03:42,500
What we're going
to have on the cell

73
00:03:42,500 --> 00:03:45,562
is these 128-bit
registers which are

74
00:03:45,562 --> 00:03:47,270
going to be able to
hold for example four

75
00:03:47,270 --> 00:03:49,540
ints right next to each other.

76
00:03:49,540 --> 00:03:52,550
So we're going to be able
to take this bundle of ints

77
00:03:52,550 --> 00:03:56,770
and operate it--
operate on it as a unit.

78
00:03:56,770 --> 00:03:59,290
And the second
part is we're going

79
00:03:59,290 --> 00:04:02,800
to have operations that act
on these vector registers.

80
00:04:02,800 --> 00:04:06,957
So the cell is going to support
special assembly instructions

81
00:04:06,957 --> 00:04:08,790
and it's going to be
able to interpret those

82
00:04:08,790 --> 00:04:12,020
as acting on particular vectors.

83
00:04:12,020 --> 00:04:14,700
But also we're going to have C++
language extensions which are

84
00:04:14,700 --> 00:04:15,639
called intrinsics.

85
00:04:15,639 --> 00:04:17,180
And those are going
to give us access

86
00:04:17,180 --> 00:04:19,010
to these special
assembly instructions,

87
00:04:19,010 --> 00:04:25,200
but not require us to be
poking around in the assembly.

88
00:04:25,200 --> 00:04:28,374
All right now the
big draw of this

89
00:04:28,374 --> 00:04:29,790
is that these
operations are going

90
00:04:29,790 --> 00:04:33,545
to be pretty much as fast
as single operations, which

91
00:04:33,545 --> 00:04:35,170
means that if we take
advantage of them

92
00:04:35,170 --> 00:04:37,700
we can make our code run
say four times as fast.

93
00:04:41,790 --> 00:04:45,690
OK so how do we refer to these
vectors when we're coding?

94
00:04:48,470 --> 00:04:51,540
XLC is going to provide
us with these intrinsics.

95
00:04:51,540 --> 00:04:54,150
And we have these
vector data types

96
00:04:54,150 --> 00:04:56,170
and each one is just
going to specify

97
00:04:56,170 --> 00:05:00,120
how to interpret a
consecutive group of 128-bits

98
00:05:00,120 --> 00:05:01,830
as some sort of vector.

99
00:05:01,830 --> 00:05:05,040
And you can have vectors
of varying element sizes

100
00:05:05,040 --> 00:05:08,400
and varying number of elements.

101
00:05:08,400 --> 00:05:11,710
So when you're programming
on the PPU or the SPU

102
00:05:11,710 --> 00:05:16,680
you get these four different
kinds of vector data types.

103
00:05:16,680 --> 00:05:18,760
You can declare things
as for example, vector

104
00:05:18,760 --> 00:05:23,120
signed int, which is what
I mentioned in the example.

105
00:05:23,120 --> 00:05:26,050
Which is where you have four
ints next to each other each

106
00:05:26,050 --> 00:05:27,020
32-bits.

107
00:05:27,020 --> 00:05:28,640
You could also
have vectors which

108
00:05:28,640 --> 00:05:31,930
contains 16-bit integers
or 8-bit integers.

109
00:05:31,930 --> 00:05:38,220
And you could also have
vectors of floating point--

110
00:05:38,220 --> 00:05:41,970
floating point numbers.

111
00:05:41,970 --> 00:05:43,680
I should mention
that all of these

112
00:05:43,680 --> 00:05:47,842
signed integer types also
have unsigned equivalence.

113
00:05:47,842 --> 00:05:50,300
Anyway, so you can just declare
these anywhere in your code

114
00:05:50,300 --> 00:05:53,750
and use them as if they
were a C++ data type.

115
00:05:53,750 --> 00:05:56,270
All right any questions?

116
00:05:56,270 --> 00:06:00,120
On the SPU you also get some
additional vector data types.

117
00:06:00,120 --> 00:06:03,235
One is vector signed
long-long, which is 64-bit ints

118
00:06:03,235 --> 00:06:05,490
and you can fit two
of those in 128.

119
00:06:05,490 --> 00:06:08,660
And you can also fit two
double precision floating

120
00:06:08,660 --> 00:06:09,840
point numbers in 128-bits.

121
00:06:12,550 --> 00:06:15,700
Now the compiler's actually
support these types

122
00:06:15,700 --> 00:06:16,414
pretty nicely.

123
00:06:16,414 --> 00:06:18,580
So not only can you declare
variables of these types

124
00:06:18,580 --> 00:06:19,996
pretty much anywhere
in your code,

125
00:06:19,996 --> 00:06:22,680
you can also declare pointers
to these types and arrays

126
00:06:22,680 --> 00:06:23,850
of these types.

127
00:06:23,850 --> 00:06:24,960
All right.

128
00:06:24,960 --> 00:06:29,130
And so they look pretty
much like natural C++ types,

129
00:06:29,130 --> 00:06:32,220
except that they translate
directly into these particular

130
00:06:32,220 --> 00:06:33,685
types that the
hardware supports.

131
00:06:38,510 --> 00:06:44,199
Now in order to manipulate
these vector data,

132
00:06:44,199 --> 00:06:45,740
we're going to have
the-- we're going

133
00:06:45,740 --> 00:06:47,510
to have compiler
extensions called

134
00:06:47,510 --> 00:06:51,060
intrinsics, which are going to
provide access to the assembly

135
00:06:51,060 --> 00:06:52,620
level features that we want.

136
00:06:52,620 --> 00:06:55,870
Remember we're going to have
specific assembly instructions

137
00:06:55,870 --> 00:07:02,140
that correspond to for example,
multiplying two vectors which

138
00:07:02,140 --> 00:07:06,080
contain each four,
32-bit integers.

139
00:07:06,080 --> 00:07:10,540
And instead of writing
out-- instead of going

140
00:07:10,540 --> 00:07:13,900
into the assembly and actually
inserting that instruction

141
00:07:13,900 --> 00:07:17,250
ourselves, we just use a
compiler intrinsic inside

142
00:07:17,250 --> 00:07:18,320
our C++ code.

143
00:07:18,320 --> 00:07:20,630
And what it does is it
provides this notation that

144
00:07:20,630 --> 00:07:22,327
looks a lot like
a function call,

145
00:07:22,327 --> 00:07:24,160
but the compiler
automatically translates it

146
00:07:24,160 --> 00:07:26,515
into the correct
assembly instruction.

147
00:07:26,515 --> 00:07:27,890
And again you
don't have to worry

148
00:07:27,890 --> 00:07:30,410
about going into the
assembly and messing around

149
00:07:30,410 --> 00:07:32,249
with this instruction
that's supposed

150
00:07:32,249 --> 00:07:33,415
to apply to these registers.

151
00:07:33,415 --> 00:07:36,730
You don't have to worry about
register allocation at all.

152
00:07:36,730 --> 00:07:39,426
The compiler just figures
out the right thing for you.

153
00:07:39,426 --> 00:07:41,800
And to use these in your SPU
program you're going to want

154
00:07:41,800 --> 00:07:43,530
to include SPU_intrinsics.h.

155
00:07:46,736 --> 00:07:48,110
Now what's a little
bit confusing

156
00:07:48,110 --> 00:07:51,090
is that you're going to have
slightly different intrinsics

157
00:07:51,090 --> 00:07:52,567
available on the
PPU and the SPU,

158
00:07:52,567 --> 00:07:55,150
because those are actually going
to have different instruction

159
00:07:55,150 --> 00:07:57,480
sets.

160
00:07:57,480 --> 00:08:03,320
But anyway as an example,
you can declare two variables

161
00:08:03,320 --> 00:08:06,330
of type vector
signed int and then

162
00:08:06,330 --> 00:08:10,880
you can multiply them using
this intrinsic called SPU add.

163
00:08:10,880 --> 00:08:14,170
All right and assign them to
a third vector signed int.

164
00:08:14,170 --> 00:08:15,660
Questions?

165
00:08:15,660 --> 00:08:16,834
Yep.

166
00:08:16,834 --> 00:08:19,328
AUDIENCE: In what way are
they introduced if you're

167
00:08:19,328 --> 00:08:20,940
on the SPU or the PPU?

168
00:08:20,940 --> 00:08:24,740
Is it just-- not
entirely the same set

169
00:08:24,740 --> 00:08:26,040
of operations available?

170
00:08:26,040 --> 00:08:28,210
Or are there actually
semantic differences?

171
00:08:28,210 --> 00:08:31,080
Could make a little header file
that masks over the differences

172
00:08:31,080 --> 00:08:32,402
mostly.

173
00:08:32,402 --> 00:08:34,610
PROFESSOR: There are going
to be some operations that

174
00:08:34,610 --> 00:08:37,190
are only available on
one and not the other.

175
00:08:37,190 --> 00:08:39,720
But in general, if
you look at the names,

176
00:08:39,720 --> 00:08:42,000
if the names
correspond, and I'll

177
00:08:42,000 --> 00:08:44,195
go into that in a
little bit, then they

178
00:08:44,195 --> 00:08:48,120
should perform essentially
the same function.

179
00:08:48,120 --> 00:08:56,050
AUDIENCE: [INAUDIBLE] mostly
was the [INAUDIBLE] also

180
00:08:56,050 --> 00:08:58,695
some name differences where
there really don't need to be.

181
00:08:58,695 --> 00:09:03,465
For instance, if you try
to do a shift on the PPU

182
00:09:03,465 --> 00:09:09,450
I believe it's a [INAUDIBLE]
SL, shift logical right,

183
00:09:09,450 --> 00:09:13,930
shift logical left or
shift arithmetic right.

184
00:09:13,930 --> 00:09:15,530
Sort of things you
would remember.

185
00:09:15,530 --> 00:09:25,900
On the SPU it's the acronym
for rotate and mask for shift.

186
00:09:25,900 --> 00:09:30,030
So R-O-T-M-A-R or
something like that.

187
00:09:32,880 --> 00:09:35,510
So yes there's some differences
that don't need to be there.

188
00:09:44,294 --> 00:09:46,182
PROFESSOR: OK so
to actually create

189
00:09:46,182 --> 00:09:48,390
these vectors there's a
couple of different notations

190
00:09:48,390 --> 00:09:49,540
you can use.

191
00:09:49,540 --> 00:09:53,720
The first is, you can use this
thing which looks like a cast

192
00:09:53,720 --> 00:09:55,800
to, for example,
vector signed int.

193
00:09:55,800 --> 00:09:58,040
So you do vector signed
int in parentheses

194
00:09:58,040 --> 00:10:02,770
and then a list of four
integers you want to fill in.

195
00:10:02,770 --> 00:10:07,950
And that will create an integer
vector and assign it to a.

196
00:10:07,950 --> 00:10:11,060
You can also, I
believe you can also

197
00:10:11,060 --> 00:10:13,930
use that notation
with just one integer

198
00:10:13,930 --> 00:10:18,770
and it will fill in that
integer in all four positions.

199
00:10:18,770 --> 00:10:21,830
There's also an SPU
intrinsic called splats

200
00:10:21,830 --> 00:10:24,250
that you can use to basically
copy the same integer

201
00:10:24,250 --> 00:10:26,247
to all four components.

202
00:10:26,247 --> 00:10:27,788
AUDIENCE: How does
it know you're not

203
00:10:27,788 --> 00:10:28,762
using a comma operator?

204
00:10:33,145 --> 00:10:36,370
PROFESSOR: Yeah I don't--
is that right David,

205
00:10:36,370 --> 00:10:38,090
with the parentheses
in the second part?

206
00:10:38,090 --> 00:10:39,778
OK.

207
00:10:39,778 --> 00:10:42,158
AUDIENCE: Whatever.

208
00:10:42,158 --> 00:10:44,062
AUDIENCE: Another
caveat here from someone

209
00:10:44,062 --> 00:10:48,520
who's been in the trenches is
that XLC likes this notation.

210
00:10:48,520 --> 00:10:53,686
GCC sometimes likes curly
brace notations instead.

211
00:10:53,686 --> 00:10:55,186
PROFESSOR: So I'd
seen both of those

212
00:10:55,186 --> 00:10:55,682
and I do know which to do.

213
00:10:55,682 --> 00:10:56,178
[INTERPOSING VOICES]

214
00:10:56,178 --> 00:10:57,170
AUDIENCE: [INAUDIBLE].

215
00:11:04,610 --> 00:11:06,440
PROFESSOR: OK great thanks.

216
00:11:10,430 --> 00:11:11,554
All right.

217
00:11:11,554 --> 00:11:13,970
And after you've assigned some
of these variables in order

218
00:11:13,970 --> 00:11:18,720
to get back the pieces out,
one way you can do it is to use

219
00:11:18,720 --> 00:11:24,430
this union trick where you
assign or rather you allocate

220
00:11:24,430 --> 00:11:29,650
something of vector signed int
and then you tell C++ that it

221
00:11:29,650 --> 00:11:32,219
can find an array of
integers in the same place.

222
00:11:32,219 --> 00:11:34,260
And what that will do is
pull out the components.

223
00:11:37,410 --> 00:11:39,800
So if you define
this union this way,

224
00:11:39,800 --> 00:11:41,660
then you get a
type called intVec.

225
00:11:41,660 --> 00:11:45,060
And any time you have
an intVec you can either

226
00:11:45,060 --> 00:11:50,160
do dot Vec to get at the vector
signed int, the vector data

227
00:11:50,160 --> 00:11:50,680
type.

228
00:11:50,680 --> 00:11:54,925
Or you can use dot
vals to-- with an array

229
00:11:54,925 --> 00:11:59,640
index get at the
components of the vector.

230
00:11:59,640 --> 00:12:01,390
And you could also use
this intrinsic call

231
00:12:01,390 --> 00:12:05,740
SPU extract to pick out
the same components.

232
00:12:11,290 --> 00:12:13,920
XLC provides a bunch of
different vector operations

233
00:12:13,920 --> 00:12:15,330
that you can use.

234
00:12:15,330 --> 00:12:18,590
There's integer operations,
floating point operations,

235
00:12:18,590 --> 00:12:22,450
there are a permutation
and formatting operations

236
00:12:22,450 --> 00:12:26,240
which you can use to shuffle
data around inside vector.

237
00:12:26,240 --> 00:12:29,510
And there's also load
and store instructions.

238
00:12:29,510 --> 00:12:32,830
And I believe we have a
reference linked off the course

239
00:12:32,830 --> 00:12:37,000
website if you want to
figure out more about these.

240
00:12:37,000 --> 00:12:41,310
I'm only going to
touch on a few of them.

241
00:12:41,310 --> 00:12:47,610
OK so the arithmetic and
logical operations like I said,

242
00:12:47,610 --> 00:12:50,290
most of these are the same
between the PPU and the SPU.

243
00:12:50,290 --> 00:12:55,510
There's some that are
named slightly differently

244
00:12:55,510 --> 00:12:58,280
and some that are not
available on the PPU.

245
00:12:58,280 --> 00:13:01,540
So these are all things you
would expect, add, subtract.

246
00:13:01,540 --> 00:13:05,090
Madd is multiply and
then add with three--

247
00:13:05,090 --> 00:13:07,280
with three arguments.

248
00:13:07,280 --> 00:13:11,000
Multiply, re is for reciprocal.

249
00:13:11,000 --> 00:13:14,460
You can also do
bit-wise and, or xor

250
00:13:14,460 --> 00:13:17,694
and I believe there are other
logical operations there too.

251
00:13:17,694 --> 00:13:19,110
Now the thing is
you have to worry

252
00:13:19,110 --> 00:13:22,540
about which PPU or SPU
instruction you're using.

253
00:13:22,540 --> 00:13:26,370
But you usually don't have
to worry about selecting

254
00:13:26,370 --> 00:13:30,370
the right vector type.

255
00:13:30,370 --> 00:13:33,180
The compiler should figure out
which vector types you're using

256
00:13:33,180 --> 00:13:36,200
and substitute the
appropriate assembly

257
00:13:36,200 --> 00:13:40,590
instruction that produces a
result of the same vector type.

258
00:13:40,590 --> 00:13:43,960
So all these operations
are what we call generic.

259
00:13:43,960 --> 00:13:48,210
And they stand in for all
the specific instructions,

260
00:13:48,210 --> 00:13:51,576
which are-- which only apply
to a single vector type.

261
00:13:51,576 --> 00:13:52,450
Does that make sense?

262
00:13:56,830 --> 00:14:00,680
OK so one handy thing is
a permutation operation.

263
00:14:00,680 --> 00:14:04,070
And this allows you to rearrange
the bytes inside a vector

264
00:14:04,070 --> 00:14:06,250
or two vectors arbitrarily.

265
00:14:06,250 --> 00:14:09,060
And so the syntax
is SPU shuffle a,

266
00:14:09,060 --> 00:14:12,130
b which are your source
vectors and pattern which

267
00:14:12,130 --> 00:14:14,027
tells you how to shuffle them.

268
00:14:14,027 --> 00:14:15,610
And pattern is going
to be interpreted

269
00:14:15,610 --> 00:14:21,040
as a vector of 16-bytes.

270
00:14:21,040 --> 00:14:22,969
And each byte is
going to tell you--

271
00:14:22,969 --> 00:14:25,260
each byte is going to tell
the compiler how to pick out

272
00:14:25,260 --> 00:14:28,560
one of these bytes
in the result.

273
00:14:28,560 --> 00:14:35,510
And how the byte is
interpreted is the low,

274
00:14:35,510 --> 00:14:38,470
the low four bits
are going to specify

275
00:14:38,470 --> 00:14:42,790
which position the source
is going to come from.

276
00:14:42,790 --> 00:14:46,210
And the fourth byte is going
to specify whether you're

277
00:14:46,210 --> 00:14:48,080
going to pull from a or b.

278
00:14:48,080 --> 00:14:52,050
So as an example,
here's the pattern VC

279
00:14:52,050 --> 00:14:54,570
and if you look at the
second byte which is

280
00:14:54,570 --> 00:14:59,240
one, which is one, four in hex.

281
00:14:59,240 --> 00:15:02,750
Then that means the
destination register

282
00:15:02,750 --> 00:15:08,140
is going to contain the fourth
byte or the fourth byte of b,

283
00:15:08,140 --> 00:15:09,160
all right.

284
00:15:09,160 --> 00:15:13,190
So four means select the element
numbered four and one means

285
00:15:13,190 --> 00:15:14,046
select from b.

286
00:15:14,046 --> 00:15:15,520
Does that make sense?

287
00:15:15,520 --> 00:15:19,140
And this is very versatile
by putting in the right,

288
00:15:19,140 --> 00:15:22,129
by putting in the right
pattern vector you can arrange

289
00:15:22,129 --> 00:15:24,170
for all these bytes to be
shuffled around however

290
00:15:24,170 --> 00:15:25,475
you want.

291
00:15:25,475 --> 00:15:27,474
AUDIENCE: The pattern is
a constant [INAUDIBLE].

292
00:15:27,474 --> 00:15:28,472
PROFESSOR: Pardon?

293
00:15:28,472 --> 00:15:29,407
AUDIENCE: The
pattern is a constant

294
00:15:29,407 --> 00:15:30,490
in intermediate parameter.

295
00:15:34,715 --> 00:15:37,320
PROFESSOR: You can fill in
the parameter at run time

296
00:15:37,320 --> 00:15:39,302
if that's what you're asking.

297
00:15:39,302 --> 00:15:41,707
AUDIENCE: [INAUDIBLE]

298
00:15:41,707 --> 00:15:42,669
AUDIENCE: [INAUDIBLE]

299
00:15:50,365 --> 00:15:54,920
PROFESSOR: OK, also useful
are these rotation operations

300
00:15:54,920 --> 00:15:57,490
which will let you
shift your vector left

301
00:15:57,490 --> 00:15:58,610
or right by some amount.

302
00:16:02,550 --> 00:16:06,120
Now one thing to be aware
of is that on the SPU

303
00:16:06,120 --> 00:16:09,340
you only have these
128-bit registers.

304
00:16:09,340 --> 00:16:12,660
So on the PPU you have
different registers

305
00:16:12,660 --> 00:16:15,240
which are suitable for
holding different types.

306
00:16:15,240 --> 00:16:20,010
For example, there's word-sized
registers for holding ints

307
00:16:20,010 --> 00:16:23,230
and PPU also has these
128-bit registers.

308
00:16:23,230 --> 00:16:25,280
But the SPU has nothing else.

309
00:16:25,280 --> 00:16:29,940
So that means whenever you're
using scalar types on the SPU

310
00:16:29,940 --> 00:16:32,530
they're all going to be
using these large registers.

311
00:16:32,530 --> 00:16:36,860
No matter what the size of
the scalar you're using.

312
00:16:36,860 --> 00:16:40,290
And depending on the
size of the scalar

313
00:16:40,290 --> 00:16:43,450
you're using it's going to
go in a particular position

314
00:16:43,450 --> 00:16:46,150
inside this wide register.

315
00:16:46,150 --> 00:16:49,210
It's called a quadword
register, because it's 16-bytes.

316
00:16:51,765 --> 00:16:53,140
Now the thing to
watch out for is

317
00:16:53,140 --> 00:16:56,060
that whenever you load
something from memory

318
00:16:56,060 --> 00:16:58,720
into-- whenever you load
a scalar from memory

319
00:16:58,720 --> 00:17:01,830
into one of these
registers, there's

320
00:17:01,830 --> 00:17:04,310
going to have to be a little
extra processing done in order

321
00:17:04,310 --> 00:17:07,450
to shift-- in order
to possibly shift

322
00:17:07,450 --> 00:17:11,950
the scalar into the right
place inside this register.

323
00:17:11,950 --> 00:17:15,450
And furthermore, the
hardware is always

324
00:17:15,450 --> 00:17:18,720
going to want to grab one of
these quadwords all at a time.

325
00:17:18,720 --> 00:17:20,630
So loading a scalar
is not going to be

326
00:17:20,630 --> 00:17:24,260
any cheaper than loading one
of these quadword registers.

327
00:17:24,260 --> 00:17:25,829
So one possible
you're going to want

328
00:17:25,829 --> 00:17:29,430
to load an entire quadword
register at a time.

329
00:17:29,430 --> 00:17:31,870
And if you just
need a part of it,

330
00:17:31,870 --> 00:17:33,600
then you can figure
that out later.

331
00:17:33,600 --> 00:17:36,840
But you might as well
get the whole thing.

332
00:17:36,840 --> 00:17:38,708
Questions?

333
00:17:38,708 --> 00:17:43,360
AUDIENCE: So when you--
just a scalar question.

334
00:17:43,360 --> 00:17:45,860
So when you load a
scalar value that's

335
00:17:45,860 --> 00:17:53,380
not aligned-- it's not aligned
with the preferred position is

336
00:17:53,380 --> 00:17:55,332
that-- is there overhead
associated with that?

337
00:17:55,332 --> 00:17:57,040
PROFESSOR: I'm not
sure how much overhead

338
00:17:57,040 --> 00:17:59,040
is associated with that.

339
00:17:59,040 --> 00:17:59,540
Pardon?

340
00:17:59,540 --> 00:18:00,230
Oh do you know?

341
00:18:00,230 --> 00:18:02,188
AUDIENCE: Well, unlike
scalar it's [INAUDIBLE],

342
00:18:02,188 --> 00:18:07,080
it can only load on
16-byte boundaries.

343
00:18:07,080 --> 00:18:10,721
So it's going to load the-- load
something that includes that

344
00:18:10,721 --> 00:18:15,531
and that's going to have to
shift to the another position.

345
00:18:15,531 --> 00:18:17,640
PROFESSOR: So do
unaligned-- when

346
00:18:17,640 --> 00:18:19,500
it has to shift
the scalar around,

347
00:18:19,500 --> 00:18:21,750
does that actually
take longer than

348
00:18:21,750 --> 00:18:24,655
went in-- when it's natural?

349
00:18:24,655 --> 00:18:27,730
AUDIENCE: I don't know if
it's-- well what you can do,

350
00:18:27,730 --> 00:18:31,340
you can set some
flags in XLC that say,

351
00:18:31,340 --> 00:18:34,340
align all of my
scalars correctly.

352
00:18:34,340 --> 00:18:37,730
And we'll waste 4x overhead.

353
00:18:37,730 --> 00:18:41,087
It'll even say align my array,
have my elements so that I

354
00:18:41,087 --> 00:18:43,170
can have the scalar array
at the back-- I can load

355
00:18:43,170 --> 00:18:45,711
and it will waste the overhead's
that everything in the array

356
00:18:45,711 --> 00:18:46,760
is [INAUDIBLE].

357
00:18:46,760 --> 00:18:49,075
So you can have the
compiler trade off

358
00:18:49,075 --> 00:18:50,866
space versus time for
you off two switches.

359
00:18:50,866 --> 00:18:51,812
PROFESSOR: I see.

360
00:18:57,490 --> 00:19:00,820
OK so we're going to want to
look at the sim application

361
00:19:00,820 --> 00:19:02,140
from recitation two.

362
00:19:02,140 --> 00:19:05,040
And we want to adapt that to
make use of SIMD data types

363
00:19:05,040 --> 00:19:08,300
and intrinsics.

364
00:19:08,300 --> 00:19:12,380
So what we've done is,
remember we had these x, y, z

365
00:19:12,380 --> 00:19:14,070
coordinates that we
were manipulating.

366
00:19:14,070 --> 00:19:16,320
What we're going to do is
we're going to pad each one.

367
00:19:16,320 --> 00:19:17,778
It was three words
before and we're

368
00:19:17,778 --> 00:19:20,510
going to pad each one so
that it fills a quadword.

369
00:19:20,510 --> 00:19:22,910
And so for each
quadword of course

370
00:19:22,910 --> 00:19:25,790
the first three words are going
to correspond to the x, y, z

371
00:19:25,790 --> 00:19:26,640
components.

372
00:19:26,640 --> 00:19:29,850
And we can grab those
out using SPU extract

373
00:19:29,850 --> 00:19:32,670
or some other intrinsics.

374
00:19:32,670 --> 00:19:34,880
Now when we're
doing manipulations

375
00:19:34,880 --> 00:19:37,350
with these components
for example,

376
00:19:37,350 --> 00:19:40,000
we wanted to find
the displacement

377
00:19:40,000 --> 00:19:42,150
between two locations.

378
00:19:42,150 --> 00:19:44,640
And that's just subtracting
two of these coordinates.

379
00:19:44,640 --> 00:19:47,970
So we can do that subtraction
which before required three

380
00:19:47,970 --> 00:19:50,060
floating point subtractions.

381
00:19:50,060 --> 00:19:52,645
We can replace that with
a single SIMD instruction.

382
00:19:52,645 --> 00:19:53,680
Does that make sense?

383
00:19:56,400 --> 00:20:03,580
OK so all this-- most of
this has already been done

384
00:20:03,580 --> 00:20:07,200
and we're providing most
of the implementation

385
00:20:07,200 --> 00:20:11,320
of this SIMD version of sim.

386
00:20:11,320 --> 00:20:15,440
And what we want you to do
is download this, download

387
00:20:15,440 --> 00:20:18,700
the tarball for this
recitation and then go in there

388
00:20:18,700 --> 00:20:21,650
and what we want you to do
is fill in one of the blanks.

389
00:20:21,650 --> 00:20:22,190
All right.

390
00:20:22,190 --> 00:20:23,430
So there's just
one function here

391
00:20:23,430 --> 00:20:24,721
that's been left unimplemented.

392
00:20:26,910 --> 00:20:30,210
And to see if you
know what's going on,

393
00:20:30,210 --> 00:20:33,470
see if you can fill in the
implementation for this.

394
00:20:33,470 --> 00:20:34,120
Any questions?

395
00:20:34,120 --> 00:20:36,370
So this question-- this
function you want to implement

396
00:20:36,370 --> 00:20:41,390
is basically going to
take a vector float

397
00:20:41,390 --> 00:20:45,422
and if that float
contains a, b, c and d

398
00:20:45,422 --> 00:20:47,130
you want to return
a-- you want to return

399
00:20:47,130 --> 00:20:49,840
a vector which each
of whose elements

400
00:20:49,840 --> 00:20:52,780
is a plus b plus c plus d.

401
00:20:52,780 --> 00:20:53,280
Questions?

402
00:20:57,200 --> 00:21:00,630
AUDIENCE: What
directory under the--

403
00:21:00,630 --> 00:21:03,080
AUDIENCE: [INAUDIBLE].

404
00:21:03,080 --> 00:21:05,440
PROFESSOR: So we're going
to go into sim a list.

405
00:21:13,168 --> 00:21:15,334
AUDIENCE: But we can stay
around afterwards and help

406
00:21:15,334 --> 00:21:18,050
you figure out what's going on.

407
00:21:18,050 --> 00:21:21,280
PROFESSOR: OK so here's
one implementation.

408
00:21:23,950 --> 00:21:28,220
Basically we're going to just
declare another vector float

409
00:21:28,220 --> 00:21:31,330
and that vector
float we're going

410
00:21:31,330 --> 00:21:40,560
to-- that's basically we're
just going to do these swaps.

411
00:21:40,560 --> 00:21:42,850
So notice in this
one we're swapping

412
00:21:42,850 --> 00:21:47,350
the first and second--
we're swapping

413
00:21:47,350 --> 00:21:49,800
the first and second words.

414
00:21:49,800 --> 00:21:51,410
So that means down
here we're going

415
00:21:51,410 --> 00:21:55,500
to want to carry bits four,
five, six, seven and then--

416
00:21:55,500 --> 00:21:58,700
or bytes four, five, six, seven
first and then bytes 0, 1, 2,

417
00:21:58,700 --> 00:21:59,920
3.

418
00:21:59,920 --> 00:22:03,880
And then over here we want
bytes 12, 13, 14, 15 and then

419
00:22:03,880 --> 00:22:05,166
8, 9, 10, 11.

420
00:22:05,166 --> 00:22:07,290
Everyone see what's going
on for the first shuffle?

421
00:22:09,821 --> 00:22:12,320
And then we're going to just
add that to our original vector

422
00:22:12,320 --> 00:22:14,420
to get this.

423
00:22:14,420 --> 00:22:17,840
And we can do that
again, this time now

424
00:22:17,840 --> 00:22:21,700
we just want to swap
these two halves.

425
00:22:21,700 --> 00:22:25,144
So the shuffle pattern is going
to be 8, 9, 10, 11, 12, 13, 14,

426
00:22:25,144 --> 00:22:28,470
15 followed by 0,
1, 2, 3, 4, 5, 6, 7.

427
00:22:32,120 --> 00:22:32,620
Make sense?

428
00:22:38,750 --> 00:22:44,490
OK so the way we translated the
program we just used into SIMD

429
00:22:44,490 --> 00:22:47,985
was we used a struct of arrays.

430
00:22:47,985 --> 00:22:49,610
Basically each of
these structs that we

431
00:22:49,610 --> 00:22:51,530
had from our previous
implementation

432
00:22:51,530 --> 00:22:54,260
just carried over and we just
put all those into an array.

433
00:22:54,260 --> 00:22:57,600
So the structs were
right next to each other.

434
00:22:57,600 --> 00:23:00,970
Alternatively we could
have laid out the,

435
00:23:00,970 --> 00:23:03,390
laid out the data in
memory in a different way

436
00:23:03,390 --> 00:23:06,630
and this is called an
array of structs layout.

437
00:23:06,630 --> 00:23:09,680
Instead what we can do is
put all the like fields

438
00:23:09,680 --> 00:23:12,157
next to each other so
that we have, for example,

439
00:23:12,157 --> 00:23:13,740
an array of all the
x components, then

440
00:23:13,740 --> 00:23:15,590
an array of all the
y components, then

441
00:23:15,590 --> 00:23:17,930
an array of all
the z components.

442
00:23:17,930 --> 00:23:21,610
And when you reorder
the data this way

443
00:23:21,610 --> 00:23:25,980
you get different ways
you can use to process it.

444
00:23:25,980 --> 00:23:29,460
So for example,
now each quadword

445
00:23:29,460 --> 00:23:31,980
instead of containing the
data for a single point

446
00:23:31,980 --> 00:23:35,790
is going to contain the data
for the same component of four

447
00:23:35,790 --> 00:23:37,200
consecutive points.

448
00:23:37,200 --> 00:23:37,950
Everyone see that?

449
00:23:40,640 --> 00:23:43,080
And actually we can
implement the algorithm

450
00:23:43,080 --> 00:23:46,690
from before in this new layout.

451
00:23:46,690 --> 00:23:49,590
But we have to be a little
bit more clever in how we're

452
00:23:49,590 --> 00:23:50,880
putting together the elements.

453
00:23:50,880 --> 00:23:53,700
Before we were able
to just subtract

454
00:23:53,700 --> 00:23:58,000
each-- subtract or multiply
the quadwords with each other,

455
00:23:58,000 --> 00:24:01,430
because those would just
correspond to for example,

456
00:24:01,430 --> 00:24:05,540
subtracting the
coordinates of two points.

457
00:24:05,540 --> 00:24:09,150
Now this time we have to do some
additional computation in order

458
00:24:09,150 --> 00:24:11,630
to put all the pieces together.

459
00:24:11,630 --> 00:24:15,740
The trick behind this structure
of arrays implementation

460
00:24:15,740 --> 00:24:19,930
which I'll just gloss
over is, if we're

461
00:24:19,930 --> 00:24:22,560
storing state for
eight objects then

462
00:24:22,560 --> 00:24:30,310
we're going to need-- eight
objects hold the-- hold 24.

463
00:24:30,310 --> 00:24:33,850
Rather for each object we need
the position and the velocity.

464
00:24:33,850 --> 00:24:35,860
And for each of those
we have x, y and z.

465
00:24:35,860 --> 00:24:37,980
So that means to store
state for each object,

466
00:24:37,980 --> 00:24:43,570
for eight objects we need
8 times 6 is 48 words.

467
00:24:43,570 --> 00:24:45,370
And so we can put
those in 12 quadwords

468
00:24:45,370 --> 00:24:48,290
if we pack them right.

469
00:24:48,290 --> 00:24:51,540
And when we do SIMD
operations on these quadwords

470
00:24:51,540 --> 00:24:56,207
that we pull out, we can
get four pair interactions.

471
00:24:56,207 --> 00:24:58,290
So suppose this is a
quadword and it's contained--

472
00:24:58,290 --> 00:25:00,420
and it contains
data corresponding

473
00:25:00,420 --> 00:25:02,550
to elements a, b, c and d.

474
00:25:02,550 --> 00:25:05,140
And over here we have a quadword
containing data corresponding

475
00:25:05,140 --> 00:25:08,270
to elements one,
two, three, and four.

476
00:25:08,270 --> 00:25:10,990
With some SIMD
operations we can kind of

477
00:25:10,990 --> 00:25:12,840
figure out the
pairwise interaction

478
00:25:12,840 --> 00:25:15,490
between objects a and
one, between b and two,

479
00:25:15,490 --> 00:25:19,210
c and three and d and four.

480
00:25:19,210 --> 00:25:23,340
But of course we have to be
able to find the interactions

481
00:25:23,340 --> 00:25:24,400
between any pair.

482
00:25:24,400 --> 00:25:26,240
It's not just these
pairs that lineup.

483
00:25:26,240 --> 00:25:31,050
So what we have to do is rotate
the quadword over by one word

484
00:25:31,050 --> 00:25:32,850
and then do the
same thing again.

485
00:25:32,850 --> 00:25:37,109
We do that four times in all
and then we add up the results.

486
00:25:37,109 --> 00:25:38,650
So as you can see
this implementation

487
00:25:38,650 --> 00:25:42,400
is a little bit more
involved and less--

488
00:25:42,400 --> 00:25:45,900
it maps to the original
implementation less directly.

489
00:25:48,490 --> 00:25:53,240
On the other hand, it does give
us a really dramatic speedup.

490
00:25:53,240 --> 00:25:58,370
Because we're using more
of the vector words.

491
00:25:58,370 --> 00:26:00,580
Notice that in the
first packing we had.

492
00:26:00,580 --> 00:26:04,530
We had x, y and z and then
the fourth blank was unused.

493
00:26:04,530 --> 00:26:08,110
Anyway, this time the structure
of a race implementation is

494
00:26:08,110 --> 00:26:13,210
actually 7 1/2 times faster
than the array of structures

495
00:26:13,210 --> 00:26:14,990
implementation.

496
00:26:14,990 --> 00:26:17,600
So choosing this
data layout correctly

497
00:26:17,600 --> 00:26:20,740
can actually be one of the
really big determinants

498
00:26:20,740 --> 00:26:22,350
of how your program performs.

499
00:26:22,350 --> 00:26:25,286
AUDIENCE: The scalar version
was like what 480, something

500
00:26:25,286 --> 00:26:25,786
like that?

501
00:26:25,786 --> 00:26:29,204
Or is it not comparable?

502
00:26:29,204 --> 00:26:32,681
PROFESSOR: I let's see,
David, do you remember?

503
00:26:32,681 --> 00:26:35,387
AUDIENCE: [INAUDIBLE]

504
00:26:35,387 --> 00:26:36,830
PROFESSOR: OK so--

505
00:26:36,830 --> 00:26:38,794
AUDIENCE: [INAUDIBLE].

506
00:26:38,794 --> 00:26:41,290
AUDIENCE: No that
was just on the PPU.

507
00:26:41,290 --> 00:26:42,166
AUDIENCE: [INAUDIBLE]

508
00:26:48,137 --> 00:26:48,970
[INTERPOSING VOICES]

509
00:26:54,210 --> 00:26:56,970
AUDIENCE: [INAUDIBLE]

510
00:26:56,970 --> 00:26:58,900
PROFESSOR: OK, so
something like 400

511
00:26:58,900 --> 00:27:03,680
for the double buffered one
and 300 for array of structs.

512
00:27:07,370 --> 00:27:09,170
OK one other thing
to worry about

513
00:27:09,170 --> 00:27:12,669
is when you're dealing
with these-- when you're

514
00:27:12,669 --> 00:27:14,210
dealing with these
SIMD instructions,

515
00:27:14,210 --> 00:27:16,793
you want to make sure that all
your data are aligned correctly

516
00:27:16,793 --> 00:27:18,100
in memory.

517
00:27:18,100 --> 00:27:20,490
And like I said before,
when you're pulling things

518
00:27:20,490 --> 00:27:22,090
in from memory you
want to make sure

519
00:27:22,090 --> 00:27:24,006
that whatever you're
pulling in is going to be

520
00:27:24,006 --> 00:27:27,270
aligned on a quadword boundary.

521
00:27:27,270 --> 00:27:29,440
And you can use the
align compiler directive

522
00:27:29,440 --> 00:27:31,740
to tell the compiler,
I want this piece

523
00:27:31,740 --> 00:27:34,350
of data aligned at
a particular place.

524
00:27:34,350 --> 00:27:37,820
And if you do that on all
your arrays for example,

525
00:27:37,820 --> 00:27:40,170
and make sure that your
array-- the array elements

526
00:27:40,170 --> 00:27:42,920
are going to fit
neatly into quadwords

527
00:27:42,920 --> 00:27:46,330
then you should be OK.

528
00:27:46,330 --> 00:27:49,050
Again like I said
before, you also

529
00:27:49,050 --> 00:27:51,550
want to transfer only
multiples of 16-bytes

530
00:27:51,550 --> 00:27:57,370
on the load and store.

531
00:27:57,370 --> 00:28:00,060
And so when you're doing
processing it may help,

532
00:28:00,060 --> 00:28:03,760
it may help you if you actually
pad the end of your-- pad

533
00:28:03,760 --> 00:28:05,840
the end of your array's
so that they fill out

534
00:28:05,840 --> 00:28:07,410
a multiple of 16-bytes.

535
00:28:07,410 --> 00:28:10,460
Because it's easier to just do
that processing with the SIMD

536
00:28:10,460 --> 00:28:12,790
instruction rather
than just have

537
00:28:12,790 --> 00:28:18,192
one or two elements hanging off
and have to worry about those.

538
00:28:18,192 --> 00:28:19,049
AUDIENCE: Question.

539
00:28:19,049 --> 00:28:19,674
PROFESSOR: Yep.

540
00:28:19,674 --> 00:28:26,590
AUDIENCE: Is it a good idea to
pass parameters 2.2 [INAUDIBLE]

541
00:28:26,590 --> 00:28:28,566
I mean, which one is preferred?

542
00:28:28,566 --> 00:28:31,036
[INAUDIBLE]

543
00:28:31,036 --> 00:28:38,446
AUDIENCE: So you
should [INAUDIBLE]

544
00:28:38,446 --> 00:28:39,928
for figuring out
whether something

545
00:28:39,928 --> 00:28:41,410
can scale easily or not.

546
00:28:41,410 --> 00:28:45,362
So you might make [INAUDIBLE].

547
00:28:45,362 --> 00:28:47,996
So in cases where you
can avoid using pointers,

548
00:28:47,996 --> 00:28:49,314
you should do that.

549
00:28:52,772 --> 00:28:53,594
PROFESSOR: OK.

550
00:28:53,594 --> 00:28:54,385
[SIDE CONVERSATION]

551
00:30:07,380 --> 00:30:12,040
PROFESSOR: So one last
thing that I should mention.

552
00:30:12,040 --> 00:30:16,470
I haven't really let on,
but compilers can actually

553
00:30:16,470 --> 00:30:18,960
generate some of these SIMD
instructions by themselves.

554
00:30:18,960 --> 00:30:22,040
If you declare your types to
be vector and then use just

555
00:30:22,040 --> 00:30:26,340
regular operations
apparently GCC and XLC yes,

556
00:30:26,340 --> 00:30:29,539
will substitute the
correct intrinsics for you.

557
00:30:29,539 --> 00:30:31,830
Of course that doesn't get
you all the operations which

558
00:30:31,830 --> 00:30:36,840
are available with
intrinsics, but anyway

559
00:30:36,840 --> 00:30:40,954
automatically
simbianizing your code

560
00:30:40,954 --> 00:30:42,870
is something that's
really worth looking into.

561
00:30:42,870 --> 00:30:45,950
As we saw it can give you a
great performance improvement.

562
00:30:45,950 --> 00:30:49,010
And the thing is that
compilers are still not

563
00:30:49,010 --> 00:30:52,680
very good at automatically
doing this transformation.

564
00:30:52,680 --> 00:30:55,412
So unlike instruction scheduling
where if your passing 05

565
00:30:55,412 --> 00:30:57,870
your compiler will do a much
better job than you would have

566
00:30:57,870 --> 00:30:58,900
time to do.

567
00:30:58,900 --> 00:31:01,140
This is something that
you should probably

568
00:31:01,140 --> 00:31:03,580
reserve some time for.

569
00:31:03,580 --> 00:31:04,150
That's all.

570
00:31:04,150 --> 00:31:06,066
If you have any questions
you can stick around

571
00:31:06,066 --> 00:31:07,850
and I'll try and help you.