1
00:00:00,060 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:04,019
Commons license.

3
00:00:04,019 --> 00:00:06,350
Your support will help
MIT OpenCourseWare

4
00:00:06,350 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,205
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,205 --> 00:00:17,830
at ocw.mit.edu.

8
00:00:21,420 --> 00:00:22,780
PROFESSOR: All right, everyone.

9
00:00:22,780 --> 00:00:25,360
Welcome back.

10
00:00:25,360 --> 00:00:27,240
Today we're going to
talk about the profiling

11
00:00:27,240 --> 00:00:29,120
tools available on Cell
to help you evaluate

12
00:00:29,120 --> 00:00:30,550
how your programs are running.

13
00:00:30,550 --> 00:00:33,260
In the next hour
I'll talk about SIMD,

14
00:00:33,260 --> 00:00:35,450
which is how you can take
advantage of the hardware

15
00:00:35,450 --> 00:00:37,015
support for data
parallelization.

16
00:00:39,520 --> 00:00:43,370
First up, I'd like to follow
up with some questions

17
00:00:43,370 --> 00:00:46,860
that we had last time
about gdb, and then I'll

18
00:00:46,860 --> 00:00:50,330
discuss some of the various
methods for profiling on Cell.

19
00:00:50,330 --> 00:00:52,120
I'll talk about
the Cell simulator

20
00:00:52,120 --> 00:00:54,680
and using the decrementers,
and then we're

21
00:00:54,680 --> 00:00:57,810
going to talk about static
profiling and instruction

22
00:00:57,810 --> 00:01:00,120
reordering, which I started
to get into last time.

23
00:01:02,640 --> 00:01:05,120
So first up, we ran
into some problems

24
00:01:05,120 --> 00:01:07,980
last time with gdb,
various error messages.

25
00:01:07,980 --> 00:01:10,470
First thing,
sometimes gbd is going

26
00:01:10,470 --> 00:01:12,400
to examine the wrong
variable, as we saw

27
00:01:12,400 --> 00:01:14,364
when the names are ambiguous.

28
00:01:14,364 --> 00:01:15,780
One thing you can
do to get around

29
00:01:15,780 --> 00:01:19,080
this-- we had problems
with ambiguity

30
00:01:19,080 --> 00:01:22,100
between variables, which have
the same name where one of them

31
00:01:22,100 --> 00:01:25,520
was in the spu thread and one
of them was in the ppu thread.

32
00:01:25,520 --> 00:01:28,530
And so to work around
this, one thing you can do

33
00:01:28,530 --> 00:01:32,480
is use spu-gdb, which will
just log on to a particular spu

34
00:01:32,480 --> 00:01:36,780
thread, then you can just
look for variable names

35
00:01:36,780 --> 00:01:38,410
in that thread.

36
00:01:38,410 --> 00:01:41,600
You can also, of course,
rename your variables.

37
00:01:41,600 --> 00:01:44,490
Second problem is
that sometimes gdb

38
00:01:44,490 --> 00:01:46,500
will delete your
breakpoints before you

39
00:01:46,500 --> 00:01:48,470
want it to delete them.

40
00:01:48,470 --> 00:01:51,690
Reason being, it will
remove breakpoints

41
00:01:51,690 --> 00:01:54,260
whenever the program
associated with a breakpoint

42
00:01:54,260 --> 00:01:55,530
has been unloaded.

43
00:01:55,530 --> 00:01:57,530
But sometimes you have
more than one thread

44
00:01:57,530 --> 00:02:03,730
running the same program,
and so the second one you

45
00:02:03,730 --> 00:02:06,620
won't be able to break into.

46
00:02:06,620 --> 00:02:10,860
So again, you can use spu-gdb
to log on to a particular thread

47
00:02:10,860 --> 00:02:13,440
if you want to do that.

48
00:02:13,440 --> 00:02:17,740
Another thing you can do is
for example put a delay loop

49
00:02:17,740 --> 00:02:19,620
at the end of your
spu thread in order

50
00:02:19,620 --> 00:02:22,120
to keep that thread running
for as long as possible.

51
00:02:22,120 --> 00:02:24,540
It seems that as long as gdb
doesn't unload that thread,

52
00:02:24,540 --> 00:02:27,520
then it won't delete
your breakpoint.

53
00:02:27,520 --> 00:02:31,770
And finally, I saw this kind
of cryptic error thread event

54
00:02:31,770 --> 00:02:34,250
breakpoint gdb should not stop.

55
00:02:34,250 --> 00:02:36,110
Indeed it shouldn't.

56
00:02:36,110 --> 00:02:38,780
One thing, to get
around this, it

57
00:02:38,780 --> 00:02:42,730
seems that this problem
doesn't happen on spu-gdb.

58
00:02:42,730 --> 00:02:47,690
So I'll just recap how to use
spu-gdb to debug your programs.

59
00:02:47,690 --> 00:02:50,400
Last time we were
using mostly ppu-gdb,

60
00:02:50,400 --> 00:02:54,080
which will help you debug
kind of both your PPU program

61
00:02:54,080 --> 00:02:55,879
and your SPU program together.

62
00:02:55,879 --> 00:02:57,920
And you're able to switch
between all the threads

63
00:02:57,920 --> 00:02:59,190
that are available.

64
00:02:59,190 --> 00:03:01,340
On the other hand, when
you're using spu-gdb,

65
00:03:01,340 --> 00:03:04,900
it's just going to log on
to a particular SPU thread.

66
00:03:04,900 --> 00:03:08,530
So what you do is you invoke
your original people PPU

67
00:03:08,530 --> 00:03:11,980
program with SPU
debug star equals 1.

68
00:03:11,980 --> 00:03:15,190
And then you can background it.

69
00:03:15,190 --> 00:03:19,760
Now what happens is every
time that your-- every time

70
00:03:19,760 --> 00:03:22,530
that your SPU-- sorry, every
time that your program spawns

71
00:03:22,530 --> 00:03:25,427
a new SPU thread, it
will print the process ID

72
00:03:25,427 --> 00:03:27,510
and then actually that
thread, the spawned thread,

73
00:03:27,510 --> 00:03:30,510
will stop and wait for a
debugger to attach to it.

74
00:03:30,510 --> 00:03:35,080
So at this point, you can run
spu-gdb -p and then the process

75
00:03:35,080 --> 00:03:36,390
number that they give you.

76
00:03:36,390 --> 00:03:38,870
And when you attach to it,
you can continue that thread

77
00:03:38,870 --> 00:03:41,002
to resume debugging.

78
00:03:41,002 --> 00:03:41,585
Any questions?

79
00:03:49,200 --> 00:03:55,910
OK, so the Cell simulator
is one of the tools

80
00:03:55,910 --> 00:04:00,845
that we have available
to help you debug.

81
00:04:00,845 --> 00:04:02,970
This is how you invoke it
if you have it installed,

82
00:04:02,970 --> 00:04:05,660
opt/ibm/systemsim
cell/bin/systemsim,

83
00:04:05,660 --> 00:04:09,050
and you want to use -g to
use the graphical interface.

84
00:04:09,050 --> 00:04:12,320
And a little window will
pop up and you'll click Go.

85
00:04:12,320 --> 00:04:13,450
The window looks like this.

86
00:04:13,450 --> 00:04:15,080
It actually, from
here, you can have

87
00:04:15,080 --> 00:04:16,663
a lot of fine-grained
control over how

88
00:04:16,663 --> 00:04:17,740
your program is running.

89
00:04:17,740 --> 00:04:22,420
In fact you can advance the
state of the simulation cycle

90
00:04:22,420 --> 00:04:25,561
by cycle to look at what's going
on inside the simulated Cell

91
00:04:25,561 --> 00:04:26,060
system.

92
00:04:30,180 --> 00:04:32,520
And when you boot
up this simulator,

93
00:04:32,520 --> 00:04:35,070
you can actually get a complete
Linux environment in there

94
00:04:35,070 --> 00:04:36,660
with an x term.

95
00:04:36,660 --> 00:04:39,550
And what you can do in
this Linux environment is

96
00:04:39,550 --> 00:04:42,201
you can copy over programs
and then you can--

97
00:04:42,201 --> 00:04:44,700
you can copy over programs which
were compiled for the Cell,

98
00:04:44,700 --> 00:04:45,783
and then you can run them.

99
00:04:49,090 --> 00:04:52,590
In order to transfer a program
to your simulated system

100
00:04:52,590 --> 00:04:56,750
so that you can run it,
you can use this function--

101
00:04:56,750 --> 00:04:58,810
or you can use this
program call through which

102
00:04:58,810 --> 00:05:00,740
is available on the
simulated system.

103
00:05:00,740 --> 00:05:02,830
And what it will
do is it will grab

104
00:05:02,830 --> 00:05:06,710
onto a file which is on the
host system and copy it in.

105
00:05:06,710 --> 00:05:10,100
And it will just
put it on standard

106
00:05:10,100 --> 00:05:12,979
out so you can redirect it
to whatever file you want.

107
00:05:12,979 --> 00:05:14,520
And then when you
make it executable,

108
00:05:14,520 --> 00:05:16,205
then you can run it.

109
00:05:16,205 --> 00:05:17,080
All right, questions?

110
00:05:21,940 --> 00:05:24,320
While you're in
the simulator, you

111
00:05:24,320 --> 00:05:27,160
can view a lot of
different things.

112
00:05:27,160 --> 00:05:31,020
You can get a lot of information
about the state of the machine.

113
00:05:31,020 --> 00:05:32,860
In particular, you can
look at the registers

114
00:05:32,860 --> 00:05:37,794
at any particular time, again
with clock step by clock step.

115
00:05:37,794 --> 00:05:40,460
This might not be as interesting
now that you have the debugger.

116
00:05:40,460 --> 00:05:42,530
And you can run the
debugger on real hardware.

117
00:05:42,530 --> 00:05:44,740
But one thing which
the Cell simulator does

118
00:05:44,740 --> 00:05:48,820
get you is these
dynamic profiling tools.

119
00:05:48,820 --> 00:05:51,990
And what it will
do is it will look

120
00:05:51,990 --> 00:05:54,510
at the state of the simulation
and it will figure out

121
00:05:54,510 --> 00:05:57,440
where things are stalling,
where the time is

122
00:05:57,440 --> 00:05:59,310
being spent by the processor.

123
00:05:59,310 --> 00:06:01,910
And you can get separate
stats like these.

124
00:06:01,910 --> 00:06:04,930
Let's see, it tells you, for
example, total cycle count

125
00:06:04,930 --> 00:06:07,820
stalls for various reasons.

126
00:06:07,820 --> 00:06:09,760
You get all these
statistics separately

127
00:06:09,760 --> 00:06:13,430
for the PPU and each SPU.

128
00:06:13,430 --> 00:06:16,090
So this can be very helpful
if you're trying to do,

129
00:06:16,090 --> 00:06:18,530
for example, you're
trying to figure out

130
00:06:18,530 --> 00:06:20,610
exactly what's
going on each SPU,

131
00:06:20,610 --> 00:06:22,490
do load balancing, whatever.

132
00:06:26,940 --> 00:06:29,310
And so these
statistics will hold--

133
00:06:29,310 --> 00:06:32,730
these statistics will cover
whatever length of the program

134
00:06:32,730 --> 00:06:33,462
you run.

135
00:06:33,462 --> 00:06:35,420
But you can also get more
fine-grain statistics

136
00:06:35,420 --> 00:06:37,690
using these profiling functions.

137
00:06:37,690 --> 00:06:40,410
So if you include profile
dot H in your program,

138
00:06:40,410 --> 00:06:42,300
then you get access
to these functions,

139
00:06:42,300 --> 00:06:44,700
profile clear, start, and stop.

140
00:06:44,700 --> 00:06:48,140
And what these will
do is they actually--

141
00:06:48,140 --> 00:06:49,970
when they get compiled
into your program,

142
00:06:49,970 --> 00:06:53,170
there are no ops when you run
them on a real Cell processor.

143
00:06:53,170 --> 00:06:55,430
But the simulator will be
able to latch onto them

144
00:06:55,430 --> 00:06:59,740
and you can use those to start
and stop regions of interest

145
00:06:59,740 --> 00:07:01,730
for profiling.

146
00:07:05,660 --> 00:07:07,747
Now the simulator
comes with the SDK,

147
00:07:07,747 --> 00:07:09,580
and it's a little bit
cumbersome to install.

148
00:07:09,580 --> 00:07:14,150
So we haven't made it available
yet on the actual PS3 hardware.

149
00:07:14,150 --> 00:07:17,610
You can run it on x86, and so
if your group would like to get

150
00:07:17,610 --> 00:07:20,680
set up with that on
one of your computers,

151
00:07:20,680 --> 00:07:23,195
or we'll try and make it
available on our hardware

152
00:07:23,195 --> 00:07:24,320
if there's enough interest.

153
00:07:24,320 --> 00:07:25,340
So please let us know.

154
00:07:31,370 --> 00:07:39,250
OK, the next thing we'll
cover is using decrementers

155
00:07:39,250 --> 00:07:42,070
to profile your program.

156
00:07:42,070 --> 00:07:45,176
This is one way you can use
dynamic profiling, which

157
00:07:45,176 --> 00:07:46,800
actually doesn't
require the simulator.

158
00:07:46,800 --> 00:07:50,130
So when you get information
from the decrementers,

159
00:07:50,130 --> 00:07:54,000
you can run these programs
on your actual Cell hardware.

160
00:07:54,000 --> 00:07:55,670
Basically the
decrementer is just

161
00:07:55,670 --> 00:07:58,490
a counter that counts
down at a constant rate.

162
00:07:58,490 --> 00:08:01,490
So you can use it as
a clock to figure out

163
00:08:01,490 --> 00:08:05,490
how long different
events are lasting.

164
00:08:05,490 --> 00:08:11,230
And the rate at which the
decrementer counts down

165
00:08:11,230 --> 00:08:13,360
is not that fast.

166
00:08:13,360 --> 00:08:16,830
So you're not going to be
able to use it to time things

167
00:08:16,830 --> 00:08:19,480
on the order of a
few clock cycles.

168
00:08:22,060 --> 00:08:25,390
It's best for
timing things which

169
00:08:25,390 --> 00:08:27,200
are maybe on the
order of thousands

170
00:08:27,200 --> 00:08:30,190
of instructions in length.

171
00:08:30,190 --> 00:08:36,919
And how you use it is there
are these SPU write channel

172
00:08:36,919 --> 00:08:40,280
and SPU read channel
functions which

173
00:08:40,280 --> 00:08:42,470
give you access to some of
the internals of the Cell

174
00:08:42,470 --> 00:08:42,970
processor.

175
00:08:47,000 --> 00:08:50,030
And so here's an example
setup you can use.

176
00:08:50,030 --> 00:08:53,000
Basically, first you
call write channel,

177
00:08:53,000 --> 00:08:57,640
and you pass in this
MFC_DECREMENTER_EVENT thing.

178
00:08:57,640 --> 00:09:00,654
And what that will do is
initialize this decrementer

179
00:09:00,654 --> 00:09:01,320
counter for you.

180
00:09:05,030 --> 00:09:08,780
And then you read the value
before and after the function

181
00:09:08,780 --> 00:09:12,370
you want to profile,
and when you subtract,

182
00:09:12,370 --> 00:09:16,580
that just gives you the length
or the time that was elapsed

183
00:09:16,580 --> 00:09:18,974
in some arbitrary units.

184
00:09:18,974 --> 00:09:20,390
And of course,
it's counting down,

185
00:09:20,390 --> 00:09:24,190
so you want to subtract
the start from the end.

186
00:09:24,190 --> 00:09:27,480
And after you're
done using it, you

187
00:09:27,480 --> 00:09:30,080
can do another write
channel to stop

188
00:09:30,080 --> 00:09:32,269
the counter from continuing.

189
00:09:32,269 --> 00:09:33,310
All right, any questions?

190
00:09:41,190 --> 00:09:46,420
OK, let's continue talking about
instruction reordering, which

191
00:09:46,420 --> 00:09:47,952
I started last time.

192
00:09:47,952 --> 00:09:50,410
So I'm going to kind of start
from the beginning this time,

193
00:09:50,410 --> 00:09:54,010
and maybe it'll
be more coherent.

194
00:09:54,010 --> 00:09:56,810
So on the Cell
architecture, when

195
00:09:56,810 --> 00:09:59,600
you're looking at what
instructions are being run,

196
00:09:59,600 --> 00:10:02,870
the instructions are mostly
going to be of the form

197
00:10:02,870 --> 00:10:05,420
evaluate some function
of some of the registers

198
00:10:05,420 --> 00:10:08,330
and write the result
to some other register.

199
00:10:08,330 --> 00:10:10,500
And the assembly
file are what you

200
00:10:10,500 --> 00:10:13,580
get when you run
with GCC-S is just

201
00:10:13,580 --> 00:10:15,960
going to be a human-readable
representation

202
00:10:15,960 --> 00:10:17,550
of these instructions.

203
00:10:17,550 --> 00:10:21,915
And I'll show you how to
read that actually later.

204
00:10:21,915 --> 00:10:23,540
And you can think of
these instructions

205
00:10:23,540 --> 00:10:27,200
just as executing in series
just in the order in which they

206
00:10:27,200 --> 00:10:28,530
appear in the assembly.

207
00:10:28,530 --> 00:10:32,120
So if you think of one
instruction starting and then

208
00:10:32,120 --> 00:10:34,340
finishing, the next
instruction comes in,

209
00:10:34,340 --> 00:10:40,150
finishes, that should be
consistent with the actual

210
00:10:40,150 --> 00:10:43,340
behavior of the hardware.

211
00:10:43,340 --> 00:10:46,630
Now for real hardware,
waiting for one instruction

212
00:10:46,630 --> 00:10:49,680
to finish before the next one
starts is going to be too slow.

213
00:10:49,680 --> 00:10:52,060
So there are these various
optimizations we pull

214
00:10:52,060 --> 00:10:54,460
or that they do on
the hardware in order

215
00:10:54,460 --> 00:10:58,540
to run instructions sooner than
when the previous instruction

216
00:10:58,540 --> 00:10:59,870
finishes.

217
00:10:59,870 --> 00:11:02,752
So the big one of
these is pipelining,

218
00:11:02,752 --> 00:11:04,960
where you have multiple
stages inside your processor,

219
00:11:04,960 --> 00:11:07,135
and you can run
each instruction--

220
00:11:07,135 --> 00:11:10,070
you can push each instruction
in before the previous one has

221
00:11:10,070 --> 00:11:12,270
completed.

222
00:11:12,270 --> 00:11:15,530
And in pipelining,
you're going to be

223
00:11:15,530 --> 00:11:18,230
subject to these dependencies
between instructions.

224
00:11:18,230 --> 00:11:21,880
For example, if one
instruction reads from a value

225
00:11:21,880 --> 00:11:24,490
that the previous
instruction wrote, then

226
00:11:24,490 --> 00:11:26,700
clearly you can't start
the second instruction

227
00:11:26,700 --> 00:11:28,960
until after the first
instruction has completed.

228
00:11:32,020 --> 00:11:35,450
The Cell processor also has--
actually has multiple pipelines

229
00:11:35,450 --> 00:11:37,470
in which you can
insert instructions

230
00:11:37,470 --> 00:11:39,320
and another
optimization that they

231
00:11:39,320 --> 00:11:44,620
do is branch prediction
in order to reduce stalls

232
00:11:44,620 --> 00:11:46,030
from branching.

233
00:11:49,340 --> 00:11:50,600
AUDIENCE: [INAUDIBLE].

234
00:11:50,600 --> 00:11:50,820
PROFESSOR: Pardon?

235
00:11:50,820 --> 00:11:53,150
AUDIENCE: Can you talk about
static branch prediction?

236
00:11:53,150 --> 00:11:56,604
PROFESSOR: Yes, so static
branch prediction means--

237
00:11:56,604 --> 00:11:58,270
or, rather the way
it's done on the Cell

238
00:11:58,270 --> 00:12:02,050
is that what's
predicted for the branch

239
00:12:02,050 --> 00:12:07,810
is not going to be affected by
the history of which branches

240
00:12:07,810 --> 00:12:09,020
have been taken.

241
00:12:09,020 --> 00:12:10,690
And some other
processors actually

242
00:12:10,690 --> 00:12:12,015
do use this information.

243
00:12:16,910 --> 00:12:22,020
All right, now the
thing about pipelining

244
00:12:22,020 --> 00:12:23,760
is that because we
still have to honor

245
00:12:23,760 --> 00:12:25,980
these dependencies
between instructions,

246
00:12:25,980 --> 00:12:28,140
the order in which we
evaluate instructions,

247
00:12:28,140 --> 00:12:30,400
that's going to make a
difference on how long

248
00:12:30,400 --> 00:12:32,940
the program runs.

249
00:12:32,940 --> 00:12:34,640
So let's take the
simple example where

250
00:12:34,640 --> 00:12:37,290
we have three
instructions, a, b, and c,

251
00:12:37,290 --> 00:12:39,850
and we have this
pipeline processor where

252
00:12:39,850 --> 00:12:44,120
we can insert instructions
at these clock ticks.

253
00:12:44,120 --> 00:12:46,710
So suppose that c
depends on b, and there

254
00:12:46,710 --> 00:12:50,734
are no other dependencies
in the system.

255
00:12:50,734 --> 00:12:52,150
Then this sequence
of instructions

256
00:12:52,150 --> 00:12:54,800
is going to take five
cycles, because we

257
00:12:54,800 --> 00:12:57,250
can push b in after
the first cycle,

258
00:12:57,250 --> 00:13:00,580
but we can't push c in
until two cycles after that.

259
00:13:00,580 --> 00:13:02,001
Is that clear?

260
00:13:02,001 --> 00:13:02,500
Questions?

261
00:13:05,410 --> 00:13:08,140
All right, now because
c depends on b,

262
00:13:08,140 --> 00:13:11,630
it kind of makes sense to
push b as far up as we can in

263
00:13:11,630 --> 00:13:14,310
order to ease this dependency.

264
00:13:14,310 --> 00:13:17,710
So if we execute b before
a instead of after a--

265
00:13:17,710 --> 00:13:20,010
and remember we can
reorder b and a however

266
00:13:20,010 --> 00:13:22,620
we want because there's no
dependencies between them--

267
00:13:22,620 --> 00:13:27,290
then we can start b at the first
tick, then we put in a right

268
00:13:27,290 --> 00:13:29,350
after that, and then c.

269
00:13:29,350 --> 00:13:33,210
And now the sequence only
takes four clock cycles.

270
00:13:33,210 --> 00:13:34,780
All right?

271
00:13:34,780 --> 00:13:38,721
So observe that, when we put
in c in this third clock cycle,

272
00:13:38,721 --> 00:13:40,970
we actually couldn't have
put in the third instruction

273
00:13:40,970 --> 00:13:45,217
any sooner, no matter what
the dependencies were.

274
00:13:45,217 --> 00:13:46,800
And so in general,
we're going to want

275
00:13:46,800 --> 00:13:50,800
that to be the case where
we want instructions

276
00:13:50,800 --> 00:13:56,690
to be waiting on the pipeline
rather than on dependencies.

277
00:13:56,690 --> 00:13:58,170
So that means in
the first picture

278
00:13:58,170 --> 00:14:01,941
there's this stall
of one clock cycle,

279
00:14:01,941 --> 00:14:03,440
and we're going to
want to eliminate

280
00:14:03,440 --> 00:14:04,881
these kinds of stalls.

281
00:14:04,881 --> 00:14:05,380
All right?

282
00:14:10,804 --> 00:14:12,220
So what we're going
to do is we're

283
00:14:12,220 --> 00:14:14,530
going to use these static
profiling tools to figure out

284
00:14:14,530 --> 00:14:17,250
where stalls happen.

285
00:14:17,250 --> 00:14:20,150
Now the first thing we're going
to do-- the first thing that we

286
00:14:20,150 --> 00:14:22,700
have to do is generate the
assembly and the instruction

287
00:14:22,700 --> 00:14:24,660
schedule that goes with it.

288
00:14:24,660 --> 00:14:29,390
And so you can use GCC -S, and
the same flag works with xlc

289
00:14:29,390 --> 00:14:31,300
to generate the assembly.

290
00:14:31,300 --> 00:14:35,960
And then we have this
utility called spu_timing,

291
00:14:35,960 --> 00:14:38,580
which runs on the Cell,
and which will tell you

292
00:14:38,580 --> 00:14:40,410
the instruction schedule.

293
00:14:40,410 --> 00:14:42,490
Basically it'll take
in your assembly file

294
00:14:42,490 --> 00:14:45,265
and tell you exactly when each
instruction in that assembly

295
00:14:45,265 --> 00:14:45,890
gets scheduled.

296
00:14:48,720 --> 00:14:50,490
If you call it with
-running-count,

297
00:14:50,490 --> 00:14:54,375
then it'll tell you the
running count of clock cycles

298
00:14:54,375 --> 00:14:55,750
at each stage,
which means if you

299
00:14:55,750 --> 00:14:57,560
look at the last number,
that will tell you

300
00:14:57,560 --> 00:15:00,450
how long your program took
to run in clock cycles.

301
00:15:00,450 --> 00:15:03,330
And we'll write the
output to a file

302
00:15:03,330 --> 00:15:08,560
with the same name as the input
but with dot timing at the end.

303
00:15:08,560 --> 00:15:10,070
Now if you're
using our Makefile,

304
00:15:10,070 --> 00:15:12,810
you can actually do all
this, generate the assembly

305
00:15:12,810 --> 00:15:18,430
and the timing info, all in
one step by doing SPU_TIMING=1

306
00:15:18,430 --> 00:15:22,370
and then make the name of
the .s file, all right?

307
00:15:27,780 --> 00:15:29,670
Now if you'll
recall on Cell we're

308
00:15:29,670 --> 00:15:31,870
going to have only
in-order execution.

309
00:15:31,870 --> 00:15:34,570
The instructions are just
going to execute in the order

310
00:15:34,570 --> 00:15:37,990
that they're specified
in the assembly.

311
00:15:37,990 --> 00:15:42,847
And because the Cell
has dual pipelines,

312
00:15:42,847 --> 00:15:45,430
that means there's two pipelines
in which instructions can go,

313
00:15:45,430 --> 00:15:48,410
and the pipelines are
going to be selected just

314
00:15:48,410 --> 00:15:51,220
based on the instruction type.

315
00:15:51,220 --> 00:15:53,080
And of course, two
instructions are only

316
00:15:53,080 --> 00:15:57,492
going to go in simultaneously
when the dependencies allow it.

317
00:15:57,492 --> 00:15:58,408
AUDIENCE: [INAUDIBLE].

318
00:16:00,922 --> 00:16:01,630
PROFESSOR: Right.

319
00:16:04,530 --> 00:16:06,980
So if you're looking
at the assembly,

320
00:16:06,980 --> 00:16:10,550
you're going to see lots
of lines of this form OP

321
00:16:10,550 --> 00:16:14,000
DESTINATION and then
the source registers.

322
00:16:14,000 --> 00:16:16,599
So when you're trying to
figure out dependencies,

323
00:16:16,599 --> 00:16:18,390
it's going to be helpful
to be able to look

324
00:16:18,390 --> 00:16:20,600
at each line of the
assembly and figure out

325
00:16:20,600 --> 00:16:22,900
what registers that
line is reading from,

326
00:16:22,900 --> 00:16:26,160
and what registers that
line is spreading to.

327
00:16:26,160 --> 00:16:29,280
And there's also information
in the assembly file

328
00:16:29,280 --> 00:16:31,280
I went into that last
time, which tells you

329
00:16:31,280 --> 00:16:33,880
kind of for each
chunk of assembly what

330
00:16:33,880 --> 00:16:37,830
the corresponding
source file line is.

331
00:16:40,990 --> 00:16:45,060
All right, and the actual
output of the static profiler,

332
00:16:45,060 --> 00:16:47,380
like I said, it will
spit out a schedule

333
00:16:47,380 --> 00:16:49,450
of when each instruction runs.

334
00:16:52,670 --> 00:16:54,750
And in the schedule,
you're going

335
00:16:54,750 --> 00:16:58,290
to see one digit for each
cycle that the instruction is

336
00:16:58,290 --> 00:16:58,870
running.

337
00:16:58,870 --> 00:17:00,730
So however many
digits there are,

338
00:17:00,730 --> 00:17:03,460
that's how many
cycles are being used.

339
00:17:03,460 --> 00:17:06,839
And in front of some
of the instructions,

340
00:17:06,839 --> 00:17:09,930
you're going to see these
dashes which represent stalls.

341
00:17:09,930 --> 00:17:13,790
So these three dashes in
front of this instruction fm

342
00:17:13,790 --> 00:17:17,560
means that based
on the pipeline,

343
00:17:17,560 --> 00:17:19,980
you could've scheduled
this instruction up here

344
00:17:19,980 --> 00:17:23,569
right after the previous
instruction finished.

345
00:17:23,569 --> 00:17:26,420
However, because you're waiting
on particular dependencies,

346
00:17:26,420 --> 00:17:28,560
you're only going to
be able to schedule it

347
00:17:28,560 --> 00:17:29,770
these three cycles later.

348
00:17:29,770 --> 00:17:31,230
So that's a three cycle stall.

349
00:17:35,840 --> 00:17:38,030
And the static
profile output is also

350
00:17:38,030 --> 00:17:39,820
going to show other information.

351
00:17:39,820 --> 00:17:42,660
The original assembly appears
on the right-hand side.

352
00:17:42,660 --> 00:17:46,090
And which pipeline the
instructions are going into

353
00:17:46,090 --> 00:17:47,870
is going to show up on the left.

354
00:17:47,870 --> 00:17:50,928
Any questions about this format?

355
00:17:50,928 --> 00:17:53,207
AUDIENCE: [INAUDIBLE].

356
00:17:53,207 --> 00:17:55,540
PROFESSOR: Dual-issue just
means that these instructions

357
00:17:55,540 --> 00:17:59,645
are going to be executed
at the same time.

358
00:17:59,645 --> 00:18:01,609
AUDIENCE: So can
people figure out,

359
00:18:01,609 --> 00:18:04,064
for example, what is the
latency for fm instruction?

360
00:18:04,064 --> 00:18:06,028
How many circles does fm take?

361
00:18:11,920 --> 00:18:14,866
[INAUDIBLE]

362
00:18:14,866 --> 00:18:17,062
AUDIENCE: [INAUDIBLE].

363
00:18:17,062 --> 00:18:17,812
PROFESSOR: Pardon?

364
00:18:17,812 --> 00:18:20,758
AUDIENCE: What are the
numbers on the right?

365
00:18:20,758 --> 00:18:22,329
PROFESSOR: Over here?

366
00:18:22,329 --> 00:18:23,870
All right, so these
are the arguments

367
00:18:23,870 --> 00:18:25,620
to the assembly instructions.

368
00:18:25,620 --> 00:18:27,953
So remember the first one is
going to be the destination

369
00:18:27,953 --> 00:18:30,950
register and the other ones
are going to be all the source

370
00:18:30,950 --> 00:18:33,130
registers that that
instruction is going to use.

371
00:18:37,950 --> 00:18:43,140
All right, so
again, we want to be

372
00:18:43,140 --> 00:18:45,710
able to reorder these
instructions to minimize

373
00:18:45,710 --> 00:18:47,240
stalls.

374
00:18:47,240 --> 00:18:49,489
And what you're
going to do is you're

375
00:18:49,489 --> 00:18:51,280
going to try and look
for instructions that

376
00:18:51,280 --> 00:18:54,877
are going to take a long time
potentially, for example, loads

377
00:18:54,877 --> 00:18:57,460
and floating point instructions
are going to take a long time.

378
00:18:57,460 --> 00:19:01,300
And of these instructions
have dependencies

379
00:19:01,300 --> 00:19:04,750
then those dependencies
are going to stall.

380
00:19:04,750 --> 00:19:07,702
So not only do you want
to minimize stalls,

381
00:19:07,702 --> 00:19:10,160
you also want to dual issue
instructions whenever possible,

382
00:19:10,160 --> 00:19:12,285
although that's a little
bit more difficult to just

383
00:19:12,285 --> 00:19:14,530
kind of eyeball.

384
00:19:14,530 --> 00:19:19,390
But anyway, if you get up to
the ideal running point, where

385
00:19:19,390 --> 00:19:21,890
you're able to dual issue every
instruction, then that means

386
00:19:21,890 --> 00:19:24,940
you're going to be running
two instructions per cycle,

387
00:19:24,940 --> 00:19:29,170
or it's going to take 1/2
cycle per instruction.

388
00:19:29,170 --> 00:19:31,490
Anyway, going through
these assembly files,

389
00:19:31,490 --> 00:19:33,539
and figuring out where
the dependencies are,

390
00:19:33,539 --> 00:19:36,080
which instructions you can move
around, it's kind of tedious,

391
00:19:36,080 --> 00:19:39,040
and you're not going to be
able to apply these techniques

392
00:19:39,040 --> 00:19:40,890
to your entire program.

393
00:19:40,890 --> 00:19:43,744
So if you only
have limited time,

394
00:19:43,744 --> 00:19:45,160
what you're going
to have to do is

395
00:19:45,160 --> 00:19:47,700
try these on the most
critical code first.

396
00:19:47,700 --> 00:19:50,590
And that's usually
going to be loop bodies,

397
00:19:50,590 --> 00:19:52,990
because those are going to
get executed many, many times.

398
00:19:52,990 --> 00:19:54,610
And any savings
that you see there

399
00:19:54,610 --> 00:19:56,450
will be multiplied
many, many times over.

400
00:19:59,150 --> 00:20:01,320
So I'm going to go
through an example

401
00:20:01,320 --> 00:20:03,990
of one possible
optimization you could do.

402
00:20:08,050 --> 00:20:10,490
So we'll actually go
through these same files

403
00:20:10,490 --> 00:20:16,130
in the exercise about
two minutes from now.

404
00:20:16,130 --> 00:20:19,580
This is from Lab 1.

405
00:20:19,580 --> 00:20:22,340
And this is one of
the code snippets

406
00:20:22,340 --> 00:20:28,050
or the assembly snippets
from the SPU program.

407
00:20:28,050 --> 00:20:32,180
So notice that over
here we have these load

408
00:20:32,180 --> 00:20:34,280
instructions followed
by a, I guess

409
00:20:34,280 --> 00:20:37,670
this is a shift instruction.

410
00:20:37,670 --> 00:20:40,370
And there's two pairs of these.

411
00:20:40,370 --> 00:20:42,500
And in each one, the
shift instruction

412
00:20:42,500 --> 00:20:45,060
is going to depend on the
result of the previous load.

413
00:20:45,060 --> 00:20:46,810
Does everyone see that
from the arguments?

414
00:20:50,886 --> 00:20:53,260
So now the problem is that
because the shift is happening

415
00:20:53,260 --> 00:20:56,280
right after the load, it's
going to stall for a long time,

416
00:20:56,280 --> 00:21:00,650
because the load
takes six cycles.

417
00:21:00,650 --> 00:21:02,644
All right, so what
can we do over here

418
00:21:02,644 --> 00:21:03,810
in order to reduce stalling?

419
00:21:07,110 --> 00:21:07,940
Yep?

420
00:21:07,940 --> 00:21:09,840
AUDIENCE: Two
loads concurrently?

421
00:21:09,840 --> 00:21:12,026
PROFESSOR: Yeah, we can
do the loads concurrently,

422
00:21:12,026 --> 00:21:14,150
and in general, we just
want to move these loads up

423
00:21:14,150 --> 00:21:15,070
as high as we can.

424
00:21:17,660 --> 00:21:24,030
So if I move the two loads
up to the top of this--

425
00:21:24,030 --> 00:21:27,340
oops-- the top of
this snippet, then we

426
00:21:27,340 --> 00:21:30,130
have two loads followed
by all the instructions

427
00:21:30,130 --> 00:21:32,180
that came after.

428
00:21:32,180 --> 00:21:35,890
Now the first shift is going
to stall by only two cycles,

429
00:21:35,890 --> 00:21:39,020
and the second shift is
not going to stall at all.

430
00:21:39,020 --> 00:21:39,870
All right?

431
00:21:39,870 --> 00:21:40,786
AUDIENCE: [INAUDIBLE]?

432
00:21:45,400 --> 00:21:46,665
PROFESSOR: So, pardon?

433
00:21:46,665 --> 00:21:47,850
AUDIENCE: [INAUDIBLE].

434
00:21:47,850 --> 00:21:50,224
PROFESSOR: So the question is
what would this change look

435
00:21:50,224 --> 00:21:51,170
like in code?

436
00:21:51,170 --> 00:21:53,190
The thing is that
kind of the order

437
00:21:53,190 --> 00:21:54,640
of these generate
instructions is

438
00:21:54,640 --> 00:21:56,434
going to be dependent
on the compiler,

439
00:21:56,434 --> 00:21:57,850
and it's really
up to the compiler

440
00:21:57,850 --> 00:21:59,890
how it wants to order
these instructions.

441
00:21:59,890 --> 00:22:01,380
So this change is
not going to be

442
00:22:01,380 --> 00:22:04,240
reflected in like your C code.

443
00:22:04,240 --> 00:22:07,050
And that's why you have
to do these optimizations

444
00:22:07,050 --> 00:22:10,050
after the assembly is generated.

445
00:22:10,050 --> 00:22:11,097
Does that make sense?

446
00:22:11,097 --> 00:22:12,013
AUDIENCE: [INAUDIBLE].

447
00:22:19,050 --> 00:22:20,700
PROFESSOR: If you
have a smart compiler

448
00:22:20,700 --> 00:22:24,160
then it might do some kind
of instruction scheduling

449
00:22:24,160 --> 00:22:27,010
in order to minimize stalls
using this methodology.

450
00:22:27,010 --> 00:22:27,684
Yup?

451
00:22:27,684 --> 00:22:30,054
AUDIENCE: I'm quite
impressed with XLC.

452
00:22:30,054 --> 00:22:33,500
I compiled the same
program under GCC

453
00:22:33,500 --> 00:22:38,260
for SPU with optimization
level 3 under XLC.

454
00:22:38,260 --> 00:22:42,760
XLC ran in half the time.

455
00:22:42,760 --> 00:22:45,475
So it was doing a much better
job of instruction scheduling--

456
00:22:45,475 --> 00:22:46,266
AUDIENCE: Than GCC?

457
00:22:46,266 --> 00:22:48,780
AUDIENCE: Than GCC.

458
00:22:48,780 --> 00:22:51,950
So that's something,
if you've got a choice,

459
00:22:51,950 --> 00:22:56,430
and the XLC compiler
is working for you--

460
00:22:56,430 --> 00:23:00,210
AUDIENCE: In general, this kind
of sort of low-level assembly

461
00:23:00,210 --> 00:23:03,980
hacking for instruction ordering
is not something you'll do,

462
00:23:03,980 --> 00:23:05,450
just because
compilers are getting

463
00:23:05,450 --> 00:23:09,090
incredibly good at doing
instruction scheduling.

464
00:23:09,090 --> 00:23:11,660
But here the exercise is for
you to sort of get familiar

465
00:23:11,660 --> 00:23:13,695
with the tools and
understand a little bit more

466
00:23:13,695 --> 00:23:16,140
in detail as to what's
going on at the instruction

467
00:23:16,140 --> 00:23:19,110
level [INAUDIBLE] pipelines.

468
00:23:19,110 --> 00:23:21,370
Another thing to
note here is, well,

469
00:23:21,370 --> 00:23:23,676
two things you could
consider here, how high up

470
00:23:23,676 --> 00:23:26,920
can you push the limits?

471
00:23:26,920 --> 00:23:29,510
And related to that is, why
can't you completely hide

472
00:23:29,510 --> 00:23:33,520
the latency for the first load,
and hence the first shift still

473
00:23:33,520 --> 00:23:36,550
takes two seconds to stall.

474
00:23:36,550 --> 00:23:38,605
So that's dependent
upon how much slack

475
00:23:38,605 --> 00:23:41,630
do you have in your schedule,
dependent on dependencies

476
00:23:41,630 --> 00:23:43,825
and where the loops starts.

477
00:23:43,825 --> 00:23:45,200
So in some cases,
you just simply

478
00:23:45,200 --> 00:23:48,520
can't hide the entire latency.

479
00:23:48,520 --> 00:23:51,446
So it's part of considerations
you have to consider.

480
00:23:51,446 --> 00:23:52,820
But [INAUDIBLE]
and the compilers

481
00:23:52,820 --> 00:23:54,796
are getting really good at it.

482
00:23:56,994 --> 00:23:58,660
PROFESSOR: And in
case you're wondering,

483
00:23:58,660 --> 00:24:00,840
I think this code
was compiled using

484
00:24:00,840 --> 00:24:02,960
XLC on no
optimizations, which is

485
00:24:02,960 --> 00:24:06,280
why there's this really easy
opportunity for optimization

486
00:24:06,280 --> 00:24:08,450
here.

487
00:24:08,450 --> 00:24:12,090
It turns out, if we run
the timing utility again,

488
00:24:12,090 --> 00:24:15,570
we saved eight cycles.

489
00:24:15,570 --> 00:24:17,490
So notice that on
the previous picture,

490
00:24:17,490 --> 00:24:20,910
we're going up to
seven in that column.

491
00:24:20,910 --> 00:24:27,630
And now we're only going up to
the previous nine, all right?

492
00:24:27,630 --> 00:24:31,620
So what we're going to do for
the exercise right now is you

493
00:24:31,620 --> 00:24:32,874
can try something like this.

494
00:24:32,874 --> 00:24:35,290
What we're going to try and
do is just improve performance

495
00:24:35,290 --> 00:24:37,490
by rescheduling instructions.

496
00:24:37,490 --> 00:24:41,540
And so if you
download the tarball,

497
00:24:41,540 --> 00:24:44,520
you'll get the exact same
file I was working on before.

498
00:24:44,520 --> 00:24:46,880
And what we're going to
have you do is just generate

499
00:24:46,880 --> 00:24:49,220
the assembly, practice
generating the assembly,

500
00:24:49,220 --> 00:24:52,390
modifying it, and then
continuing with the rest of the

501
00:24:52,390 --> 00:24:55,670
build process to
get a new object

502
00:24:55,670 --> 00:24:58,709
file that's based on
your modified assembly.

503
00:24:58,709 --> 00:24:59,750
All right, any questions?

504
00:25:02,730 --> 00:25:06,800
Now you just need to find,
for example, one opportunity

505
00:25:06,800 --> 00:25:07,830
for optimization.

506
00:25:07,830 --> 00:25:10,840
And you can do the exact same
thing that I did, if you want.

507
00:25:10,840 --> 00:25:15,080
And so the key is just after
you've done your reordering,

508
00:25:15,080 --> 00:25:16,930
you want to re-run
the timing utility

509
00:25:16,930 --> 00:25:18,930
to see how many
cycles you saved.

510
00:25:18,930 --> 00:25:21,700
Also you want to, of course,
continue the build process

511
00:25:21,700 --> 00:25:24,070
to get the final
executable file.

512
00:25:24,070 --> 00:25:26,570
And you want to run that just
to make sure that your code is

513
00:25:26,570 --> 00:25:28,206
still correct.

514
00:25:28,206 --> 00:25:30,696
AUDIENCE: Are we
able to get this?

515
00:25:30,696 --> 00:27:16,400
[SIDE CONVERSATIONS]

516
00:27:16,400 --> 00:27:19,840
PROFESSOR: OK, so this
static profiling process does

517
00:27:19,840 --> 00:27:21,290
have its limitations.

518
00:27:21,290 --> 00:27:25,520
The first is that, and this
is a pretty major limitation,

519
00:27:25,520 --> 00:27:29,240
the static profiler assumes that
none of the branches are taken.

520
00:27:29,240 --> 00:27:31,900
So basically it just zips
through the entire assembly

521
00:27:31,900 --> 00:27:34,210
file in a straight line.

522
00:27:34,210 --> 00:27:36,740
So that means you're
going to have a skewed

523
00:27:36,740 --> 00:27:41,942
view of how long code inside
conditionals or loop bodies is

524
00:27:41,942 --> 00:27:43,900
going to take, because
conditionals, if there's

525
00:27:43,900 --> 00:27:47,170
savings in there, they
may not run at all.

526
00:27:47,170 --> 00:27:49,250
And loop bodies,
any savings in there

527
00:27:49,250 --> 00:27:53,170
may get multiplied
many, many times.

528
00:27:53,170 --> 00:27:54,950
So if, for example,
you see that you've

529
00:27:54,950 --> 00:27:57,660
saved eight cycles in
the static scheduling,

530
00:27:57,660 --> 00:28:00,510
you're going to need to
know some more context, that

531
00:28:00,510 --> 00:28:02,150
is whether it's
inside a loop body

532
00:28:02,150 --> 00:28:04,019
and how many times you
expect that to run.

533
00:28:04,019 --> 00:28:06,060
So you're going to need
a little more information

534
00:28:06,060 --> 00:28:08,980
to get a good idea of how these
instruction reorderings are

535
00:28:08,980 --> 00:28:12,630
actually affecting the
runtime of your program.

536
00:28:12,630 --> 00:28:14,750
And also, because this is
static profiling, which

537
00:28:14,750 --> 00:28:18,650
means you're not using an
actual program run to get data,

538
00:28:18,650 --> 00:28:21,522
you're not going
to get behavior--

539
00:28:21,522 --> 00:28:23,730
you're not going to capture
any behavior that depends

540
00:28:23,730 --> 00:28:25,720
on the inputs to your program.

541
00:28:25,720 --> 00:28:31,160
So for example, you don't
get counts of how many times

542
00:28:31,160 --> 00:28:35,810
each loop runs, or when
each branch is taken.

543
00:28:35,810 --> 00:28:38,120
One other thing to
worry about, which

544
00:28:38,120 --> 00:28:44,330
isn't going to be manifested
in these static profiling

545
00:28:44,330 --> 00:28:47,694
predictions, is
branch prediction.

546
00:28:47,694 --> 00:28:49,360
Now remember the
static profiler is just

547
00:28:49,360 --> 00:28:52,110
going to ignore all branches.

548
00:28:52,110 --> 00:28:55,630
And the thing is that
branch prediction,

549
00:28:55,630 --> 00:28:58,260
because branch
prediction is used

550
00:28:58,260 --> 00:29:00,890
to reduce the stall
after a branch is taken

551
00:29:00,890 --> 00:29:02,780
by kind of pre-fetching
instructions

552
00:29:02,780 --> 00:29:04,370
and possibly doing other things.

553
00:29:04,370 --> 00:29:06,310
So what we'll do is
guess the direction

554
00:29:06,310 --> 00:29:07,654
that the branch will take.

555
00:29:07,654 --> 00:29:09,570
And then we'll start
pre-fetching instructions

556
00:29:09,570 --> 00:29:11,870
from the expected
place where it's going

557
00:29:11,870 --> 00:29:14,390
to resume after the branch.

558
00:29:14,390 --> 00:29:18,150
And the thing is, if your
branch prediction is wrong,

559
00:29:18,150 --> 00:29:21,010
then you're going to incur
a pretty large penalty.

560
00:29:21,010 --> 00:29:24,190
And the penalty is on
the order of 20 cycles

561
00:29:24,190 --> 00:29:25,940
while it flushes the
pipeline and starts

562
00:29:25,940 --> 00:29:27,980
fetching instructions
from the new place.

563
00:29:27,980 --> 00:29:30,670
On the other hand, if branch
prediction is correct,

564
00:29:30,670 --> 00:29:33,450
then there's no penalty
at all, actually.

565
00:29:33,450 --> 00:29:34,910
So the next
instruction will just

566
00:29:34,910 --> 00:29:38,340
resume right after the branch.

567
00:29:38,340 --> 00:29:41,350
Anyway, you can give
hints to the compiler

568
00:29:41,350 --> 00:29:45,590
to tell it what you think the
outcomes of branches will be.

569
00:29:45,590 --> 00:29:48,540
And this can help in
some circumstances.

570
00:29:48,540 --> 00:29:53,510
Basically there are these
macros that you can define.

571
00:29:53,510 --> 00:29:58,710
And what you're going to use is
this built-in expect compiler

572
00:29:58,710 --> 00:30:01,850
intrinsic, and
what that means is

573
00:30:01,850 --> 00:30:04,170
if you put that inside
an if, it will mean,

574
00:30:04,170 --> 00:30:07,740
you know, run the if as
if the condition were exp.

575
00:30:07,740 --> 00:30:09,990
But then I'm going to
expect that the value

576
00:30:09,990 --> 00:30:13,870
of this condition is going to
be, for example, true or false.

577
00:30:13,870 --> 00:30:17,100
All right, does that make sense?

578
00:30:17,100 --> 00:30:22,940
Anyway, just one note about
the exercise that you did.

579
00:30:22,940 --> 00:30:28,560
I believe the original run
time was 469 clock cycles.

580
00:30:28,560 --> 00:30:32,480
Actually, if you run XLC
with 05 optimizations,

581
00:30:32,480 --> 00:30:39,610
it will reduce that to 188, so
almost 280 cycles of savings.

582
00:30:39,610 --> 00:30:43,614
All right, so we'll-- pardon?

583
00:30:43,614 --> 00:30:47,970
AUDIENCE: The one that
we just [INAUDIBLE].

584
00:30:47,970 --> 00:30:50,750
PROFESSOR: No, so the
original code was--

585
00:30:50,750 --> 00:30:53,150
the original assembly
was compiled on O 0,

586
00:30:53,150 --> 00:30:55,270
or no optimizations.

587
00:30:55,270 --> 00:30:59,670
But then if you're doing
an optimization by hand,

588
00:30:59,670 --> 00:31:01,470
it saves eight
cycles off of O 0.

589
00:31:01,470 --> 00:31:03,260
These are actually
pretty easy to find.

590
00:31:03,260 --> 00:31:08,540
And the 05 optimization level
will actually shave off 280

591
00:31:08,540 --> 00:31:10,260
cycles.

592
00:31:10,260 --> 00:31:11,880
So we'll break for
about 10 minutes,

593
00:31:11,880 --> 00:31:16,670
and then after the break,
we'll do SIMD on Cell.