1
00:00:00,030 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,850
Commons license.

3
00:00:03,850 --> 00:00:06,930
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,930 --> 00:00:10,560
offer high quality educational
resources for free.

5
00:00:10,560 --> 00:00:13,410
To make a donation or view
additional materials from

6
00:00:13,410 --> 00:00:17,510
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,510 --> 00:00:18,760
ocw.mit.edu.

8
00:00:21,650 --> 00:00:25,100
PROFESSOR: So, what I'll talk
about here is how to actually

9
00:00:25,100 --> 00:00:29,120
understand the performance of
your application, and what are

10
00:00:29,120 --> 00:00:30,860
some of the things you can do
to actually improve your

11
00:00:30,860 --> 00:00:31,850
performance.

12
00:00:31,850 --> 00:00:35,690
You're going to hear more about
automated optimizations,

13
00:00:35,690 --> 00:00:38,810
compile the optimizations
on Monday.

14
00:00:38,810 --> 00:00:40,370
There will be two
talks on that.

15
00:00:40,370 --> 00:00:43,770
You'll get some cell-specific
optimizations that you can do,

16
00:00:43,770 --> 00:00:49,010
so some cell-specific tricks on
Tuesday in the recitation.

17
00:00:49,010 --> 00:00:52,460
So here, this is meant to be a
more general purpose talk on

18
00:00:52,460 --> 00:00:54,810
how can you debug performance
anomalies and performance

19
00:00:54,810 --> 00:00:57,920
problems. Then what are some
way that you can actually

20
00:00:57,920 --> 00:00:58,840
improve the performance.

21
00:00:58,840 --> 00:01:02,410
Where do you look after you've
done your parallelization.

22
00:01:02,410 --> 00:01:07,680
So just to review the key
concepts to parallelism.

23
00:01:07,680 --> 00:01:09,285
Coverage -- how much parallelism
do you have in

24
00:01:09,285 --> 00:01:11,000
your application.

25
00:01:11,000 --> 00:01:13,900
All of you know, because
you all had perfect

26
00:01:13,900 --> 00:01:15,450
scores on the last quiz.

27
00:01:15,450 --> 00:01:18,550
So that means if you take a look
at your program, you find

28
00:01:18,550 --> 00:01:21,200
the parallel parts and that
tells you how much parallelism

29
00:01:21,200 --> 00:01:23,670
that you have. If you don't
have more than a certain

30
00:01:23,670 --> 00:01:25,760
fraction, there's really nothing
else you can do with

31
00:01:25,760 --> 00:01:26,240
parallelism.

32
00:01:26,240 --> 00:01:29,420
So the rest of the talk will
help you address the question

33
00:01:29,420 --> 00:01:31,980
of well, where do you go
for last frontier.

34
00:01:31,980 --> 00:01:33,260
The granularity.

35
00:01:33,260 --> 00:01:36,920
We talked about how the
granularity of your work and

36
00:01:36,920 --> 00:01:39,360
how much work are you doing on
each processor affects your

37
00:01:39,360 --> 00:01:41,910
load balancing, and how it
actually affects your

38
00:01:41,910 --> 00:01:43,580
communication costs.

39
00:01:43,580 --> 00:01:45,510
If you have a lot of things
colocate on a single

40
00:01:45,510 --> 00:01:48,170
processor, then you don't have
to do a whole lot of

41
00:01:48,170 --> 00:01:50,960
communication across processors,
but if you

42
00:01:50,960 --> 00:01:53,690
distribute things at a finer
level, then you're doing a

43
00:01:53,690 --> 00:01:54,980
whole lot of communication.

44
00:01:54,980 --> 00:01:56,990
So we'll look at the
communication costs again and

45
00:01:56,990 --> 00:02:00,450
some tricks that you can
apply to optimize that.

46
00:02:00,450 --> 00:02:04,360
Then the last thing that we had
talked about in one of the

47
00:02:04,360 --> 00:02:07,940
previous lectures is locality,
in locality of communication

48
00:02:07,940 --> 00:02:11,570
versus computation, and both
of those are critical.

49
00:02:11,570 --> 00:02:15,720
So we'll have some
examples of that.

50
00:02:15,720 --> 00:02:18,292
So just to review the
communication cost model, so I

51
00:02:18,292 --> 00:02:23,250
had flashed up on the screen a
while ago this equation that

52
00:02:23,250 --> 00:02:27,650
captures all the factors that
go into figuring out how

53
00:02:27,650 --> 00:02:30,770
expensive is it to actually send
data from one processor

54
00:02:30,770 --> 00:02:31,670
to the other.

55
00:02:31,670 --> 00:02:35,640
Or this could even apply on
a single machine where a

56
00:02:35,640 --> 00:02:37,760
processor is talking
to the memory --

57
00:02:37,760 --> 00:02:39,330
you know, loads and stores.

58
00:02:39,330 --> 00:02:41,690
The same cost model really
applies there.

59
00:02:41,690 --> 00:02:43,490
If you look at how
processors --

60
00:02:43,490 --> 00:02:46,580
a uniprocessor tries to improve
communication, and

61
00:02:46,580 --> 00:02:48,275
some of the things we've
mentioned really early on in

62
00:02:48,275 --> 00:02:51,160
the course for improving
communication costs, what we

63
00:02:51,160 --> 00:02:53,210
focused on is this overlap.

64
00:02:53,210 --> 00:02:55,010
There are things you can do,
for example, sending fewer

65
00:02:55,010 --> 00:02:58,880
messages, optimizing how you're
packing data into your

66
00:02:58,880 --> 00:03:02,450
messages, reducing the cost of
the network in terms of the

67
00:03:02,450 --> 00:03:06,710
latency, using architecture
support, increasing the

68
00:03:06,710 --> 00:03:08,620
bandwidth and so on.

69
00:03:08,620 --> 00:03:11,300
But really the biggest impact
that you can get is really

70
00:03:11,300 --> 00:03:13,610
just from overlap because you
have direct control over that,

71
00:03:13,610 --> 00:03:16,490
especially in parallel
programming.

72
00:03:16,490 --> 00:03:20,690
So let's look at a small review
-- you know, what did

73
00:03:20,690 --> 00:03:22,510
it mean to overlap.

74
00:03:22,510 --> 00:03:24,880
So we had some synchronization
point or some point in the

75
00:03:24,880 --> 00:03:28,300
execution and then
we get data.

76
00:03:28,300 --> 00:03:31,455
Then once the data's arrived,
we compute on that data.

77
00:03:31,455 --> 00:03:33,710
So this could be
a uniprocessor.

78
00:03:33,710 --> 00:03:37,720
A CPU issues a load, it goes
out to memory, memory sends

79
00:03:37,720 --> 00:03:42,100
back the data, and then the
CU can continue operating.

80
00:03:42,100 --> 00:03:44,160
But the uniprocessors
can pipeline.

81
00:03:44,160 --> 00:03:45,510
They allow you to have
multiple loads

82
00:03:45,510 --> 00:03:47,180
going out to memory.

83
00:03:47,180 --> 00:03:50,330
So you can get the effect of
hiding over overlapping a lot

84
00:03:50,330 --> 00:03:53,730
of that communication latency.

85
00:03:53,730 --> 00:03:56,800
But there are limits to the
communication, to the

86
00:03:56,800 --> 00:03:58,200
pipelining effects.

87
00:03:58,200 --> 00:04:03,435
If the work that you're doing is
really equal to the amount

88
00:04:03,435 --> 00:04:04,910
of data that you're
fetching, then you

89
00:04:04,910 --> 00:04:05,930
have really good overlap.

90
00:04:05,930 --> 00:04:08,610
So we went over this in the
recitation and we showed you

91
00:04:08,610 --> 00:04:11,720
an example of where pipelining
doesn't have any performance

92
00:04:11,720 --> 00:04:13,560
effects and so you might not
want to do it because it

93
00:04:13,560 --> 00:04:15,730
doesn't give you the performance
bang for the

94
00:04:15,730 --> 00:04:17,540
complexity you invest in it.

95
00:04:17,540 --> 00:04:20,830
So, if things are really nicely
matched you get good

96
00:04:20,830 --> 00:04:23,630
overlap, here you only get god
overlap -- sorry, these, for

97
00:04:23,630 --> 00:04:26,370
some reason, should be
shifted over one.

98
00:04:26,370 --> 00:04:30,010
So where else do you look
for performance?

99
00:04:30,010 --> 00:04:32,350
So there are two kinds
of communication.

100
00:04:35,070 --> 00:04:37,280
There's inherent communication
and your algorithm, and this

101
00:04:37,280 --> 00:04:41,650
is a result of how you actually
partition your data

102
00:04:41,650 --> 00:04:43,790
and how you partitioned
your computation.

103
00:04:43,790 --> 00:04:48,170
Then there's artifacts that come
up because of the way you

104
00:04:48,170 --> 00:04:50,660
actually do the implementation
and how you map it to the

105
00:04:50,660 --> 00:04:52,330
architecture.

106
00:04:52,330 --> 00:04:55,120
So, if you have poor
distribution of data across

107
00:04:55,120 --> 00:04:59,150
memory, then you might
unnecessarily end up fetching

108
00:04:59,150 --> 00:05:05,370
data that you don't need.

109
00:05:05,370 --> 00:05:07,370
So, you might also have
redundant data fetchers.

110
00:05:07,370 --> 00:05:09,100
So let's talk about that
in more detail.

111
00:05:11,820 --> 00:05:15,930
The way I'm going to do this
is to draw from wisdom in

112
00:05:15,930 --> 00:05:16,340
uniprocessors.

113
00:05:16,340 --> 00:05:19,970
So in uniprocessors, CPUs
communicate with memory, and

114
00:05:19,970 --> 00:05:22,290
really conceptually, I think
that's no different than

115
00:05:22,290 --> 00:05:24,880
multiple processors talking
to multiple processors.

116
00:05:24,880 --> 00:05:28,190
It's really all about where the
data is flowing and how

117
00:05:28,190 --> 00:05:29,710
the memories are structured.

118
00:05:29,710 --> 00:05:34,190
So, loads and stores are the
uniprocessor as what and what

119
00:05:34,190 --> 00:05:35,500
are to distributed memory.

120
00:05:35,500 --> 00:05:38,080
So if you think of Cell, what
would go in those two blanks?

121
00:05:42,820 --> 00:05:44,070
Can you get this?

122
00:05:48,360 --> 00:05:50,950
I heard the answer there.

123
00:05:50,950 --> 00:05:51,780
You get input.

124
00:05:51,780 --> 00:05:53,400
So, DMA get and DMA put.

125
00:05:53,400 --> 00:05:55,300
That's really just the
load and a store.

126
00:05:55,300 --> 00:05:57,960
It's just doing, instead of
loading one particular data

127
00:05:57,960 --> 00:06:01,460
element, you're loading a
whole chunk of memory.

128
00:06:01,460 --> 00:06:04,950
So, on a uniprocessor, how do
you overlap communication?

129
00:06:04,950 --> 00:06:08,210
Well, architecture, the memory
system is designed in a way to

130
00:06:08,210 --> 00:06:10,630
exploit two properties that
have been observed in

131
00:06:10,630 --> 00:06:11,600
computation.

132
00:06:11,600 --> 00:06:14,030
Spacial locality and temporal
locality, and I'll look at

133
00:06:14,030 --> 00:06:15,640
each one separately.

134
00:06:15,640 --> 00:06:19,640
So in spacial locality,
CPU asks for a

135
00:06:19,640 --> 00:06:21,840
data address of 1,000.

136
00:06:21,840 --> 00:06:24,650
What the memory does, it'll send
data address of 1,000,

137
00:06:24,650 --> 00:06:27,440
plus a whole bunch of other data
that's neighboring to it,

138
00:06:27,440 --> 00:06:30,290
so 1,000 to 1,064.

139
00:06:30,290 --> 00:06:33,010
Really, how much data you
actually send, you know, what

140
00:06:33,010 --> 00:06:42,290
is the granularity of
communication depends on

141
00:06:42,290 --> 00:06:43,430
architectural parameters.

142
00:06:43,430 --> 00:06:46,610
So in common architecture it's
really the block side.

143
00:06:46,610 --> 00:06:49,260
So if you have a cache where the
organization says you have

144
00:06:49,260 --> 00:06:53,690
a block side to 32 words, 32
bytes, then this is how much

145
00:06:53,690 --> 00:06:57,210
you transfer from main
memory to the caches.

146
00:06:57,210 --> 00:07:01,250
So this works well when the CPU
actually uses that data.

147
00:07:01,250 --> 00:07:05,530
If I send you 64 data bytes and
I only use one of them,

148
00:07:05,530 --> 00:07:06,490
then what have I done?

149
00:07:06,490 --> 00:07:07,880
I've wasted bandwidth.

150
00:07:07,880 --> 00:07:10,460
Plus, I need to store all that
extra data in the cache so

151
00:07:10,460 --> 00:07:12,240
I've wasted my cache capacity.

152
00:07:12,240 --> 00:07:15,650
So that's bad and you
want to avoid it.

153
00:07:15,650 --> 00:07:19,290
Temporal locality is
a clustering of

154
00:07:19,290 --> 00:07:20,650
references in time.

155
00:07:20,650 --> 00:07:24,470
So if you access some particular
data element, then

156
00:07:24,470 --> 00:07:26,960
what the memory assumes is
you're going reuse that data

157
00:07:26,960 --> 00:07:29,840
over and over and over again, so
it stores it in the cache.

158
00:07:29,840 --> 00:07:32,620
So your memory hierarchy has
the main memory at the top

159
00:07:32,620 --> 00:07:34,780
level, and that's your
slowest memory

160
00:07:34,780 --> 00:07:36,410
but the biggest capacity.

161
00:07:36,410 --> 00:07:38,790
Then as you get closer and
closer to the processor, you

162
00:07:38,790 --> 00:07:42,460
end up with smaller caches --
these are local, smaller

163
00:07:42,460 --> 00:07:45,380
storage, but they're faster.

164
00:07:45,380 --> 00:07:48,470
So, if you reuse a data element
then it gets cached at

165
00:07:48,470 --> 00:07:50,770
the lowest data level, and so
the assumption there is that

166
00:07:50,770 --> 00:07:53,580
you're gonna reuse it over
and over and over again.

167
00:07:53,580 --> 00:07:55,300
If you do that, then what
you've done is you've

168
00:07:55,300 --> 00:07:58,210
amortized the cost of bringing
in that data over many, many

169
00:07:58,210 --> 00:07:58,990
references.

170
00:07:58,990 --> 00:08:00,930
So that works out really well.

171
00:08:00,930 --> 00:08:03,160
But if you don't reuse that
particular data elements over

172
00:08:03,160 --> 00:08:06,770
and over again, then you've
wasted cache capacity.

173
00:08:06,770 --> 00:08:09,280
You still need to fetch the data
because the CPU asks for

174
00:08:09,280 --> 00:08:12,080
it, but you might not have had
the cache, so that would have

175
00:08:12,080 --> 00:08:14,250
created more space in your cache
to have something else

176
00:08:14,250 --> 00:08:17,350
in there that might have
been more useful.

177
00:08:17,350 --> 00:08:20,880
So in the multiprocessor case,
how do you reduce these

178
00:08:20,880 --> 00:08:24,180
artifactual costs in
communication.

179
00:08:24,180 --> 00:08:32,750
So, DCMA gets inputs on the
cell, or just in just message

180
00:08:32,750 --> 00:08:35,320
passing, you're exchanging
messages.

181
00:08:35,320 --> 00:08:39,460
Typically, you're communicating
over a course or

182
00:08:39,460 --> 00:08:41,820
large blocks of data.

183
00:08:41,820 --> 00:08:44,490
What you're usually getting is
a continuous chunk of memory,

184
00:08:44,490 --> 00:08:46,930
although you could do some
things in software or in

185
00:08:46,930 --> 00:08:51,000
hardware to gather data from
different memory locations and

186
00:08:51,000 --> 00:08:52,880
pack them into contiguous
locations.

187
00:08:52,880 --> 00:08:55,910
The reason you pack them into
contiguous locations again to

188
00:08:55,910 --> 00:08:59,610
export spatial locality when
you store the data locally.

189
00:08:59,610 --> 00:09:03,440
So to exploit the spatial
locality characteristics, what

190
00:09:03,440 --> 00:09:05,890
you want to make sure is that
you actually are going to have

191
00:09:05,890 --> 00:09:08,320
good spatial locality in your
actual computation.

192
00:09:08,320 --> 00:09:11,130
So you want things that are
iterating over loops with

193
00:09:11,130 --> 00:09:14,710
well-defined indices with
indices that go over very

194
00:09:14,710 --> 00:09:18,010
short ranges, or they're very
sequential or have fixed

195
00:09:18,010 --> 00:09:20,340
striped patterns where you're
not wasting a lot of the data

196
00:09:20,340 --> 00:09:21,870
that you have brought in.

197
00:09:21,870 --> 00:09:23,730
Otherwise, you have to
essentially just increase your

198
00:09:23,730 --> 00:09:26,540
communication because every
fetch is getting you only a

199
00:09:26,540 --> 00:09:28,400
small fraction of what
you actually need.

200
00:09:28,400 --> 00:09:31,520
So, intuitively this
should make sense.

201
00:09:31,520 --> 00:09:34,470
Temporal locality just says I
brought in some data and so I

202
00:09:34,470 --> 00:09:36,350
want to maximize this utility.

203
00:09:36,350 --> 00:09:39,520
So if I have any computation in
a parallel system, I might

204
00:09:39,520 --> 00:09:42,910
be able to reorder my tasks in
a way that I have explicit

205
00:09:42,910 --> 00:09:45,930
control over the scheduling --
which stripe executes when.

206
00:09:45,930 --> 00:09:48,270
Then you want to make sure that
all the computation that

207
00:09:48,270 --> 00:09:52,470
needs that particular data
happens adjacent in time or in

208
00:09:52,470 --> 00:09:56,830
some short time window so that
you can amortize the cost.

209
00:09:56,830 --> 00:09:58,860
Are those two concepts clear?

210
00:09:58,860 --> 00:10:00,110
Any questions on this?

211
00:10:03,920 --> 00:10:05,880
So, you've done all of that.

212
00:10:05,880 --> 00:10:08,600
You've parallelized your code,
you've taken care of your

213
00:10:08,600 --> 00:10:10,270
communication costs, you've
tried to reduce

214
00:10:10,270 --> 00:10:12,360
it as much as possible.

215
00:10:12,360 --> 00:10:14,560
Where else can you look for
performance -- things just

216
00:10:14,560 --> 00:10:16,140
don't look like they're
performing as

217
00:10:16,140 --> 00:10:17,690
well as they could?

218
00:10:17,690 --> 00:10:20,550
So, the last frontier is
perhaps single thread

219
00:10:20,550 --> 00:10:23,380
performance, so I'm going
to talk about that.

220
00:10:23,380 --> 00:10:25,160
So what is really
a single thread.

221
00:10:25,160 --> 00:10:26,770
So if you think of what you're
doing with parallel

222
00:10:26,770 --> 00:10:29,340
programming, you're taking a
bunch of tasks -- this is the

223
00:10:29,340 --> 00:10:31,750
work that you have to do -- and
you group them together

224
00:10:31,750 --> 00:10:35,800
into threads or the equivalent
of threads, and some threads

225
00:10:35,800 --> 00:10:37,340
will run on individual cores.

226
00:10:37,340 --> 00:10:39,790
So essentially you have one
thread running on a core, and

227
00:10:39,790 --> 00:10:42,570
if that performance goes fast,
then your overall execution

228
00:10:42,570 --> 00:10:43,960
can also benefit from that.

229
00:10:43,960 --> 00:10:46,310
So, that's single thread
performance.

230
00:10:46,310 --> 00:10:49,920
So if you look at a timeline,
here you have sequential code

231
00:10:49,920 --> 00:10:52,360
going on, then we hit some
parallel part of the

232
00:10:52,360 --> 00:10:53,350
computation.

233
00:10:53,350 --> 00:10:56,260
We have multiple executions
going on.

234
00:10:56,260 --> 00:10:59,430
Each one of these is a
thread of execution.

235
00:10:59,430 --> 00:11:03,380
Really, my finish line depends
on who's the longest thread,

236
00:11:03,380 --> 00:11:06,640
who's the slowest one to
complete, and that's going to

237
00:11:06,640 --> 00:11:10,300
essentially control
my speed up.

238
00:11:10,300 --> 00:11:14,620
So I can improve this by doing
better load balancing.

239
00:11:14,620 --> 00:11:17,250
If I distribute the
work [? so that ?]

240
00:11:17,250 --> 00:11:19,980
everybody's doing equivalent
amount of work, then I can

241
00:11:19,980 --> 00:11:22,630
shift that finish line
earlier in time.

242
00:11:22,630 --> 00:11:23,990
That can work reasonably well.

243
00:11:23,990 --> 00:11:26,920
So we talked about load
balancing before.

244
00:11:26,920 --> 00:11:30,620
We can also make execution
on each processor faster.

245
00:11:30,620 --> 00:11:33,630
If each one of these threads
finishes faster or I've done

246
00:11:33,630 --> 00:11:35,800
the load balancing and now I
can even squeeze out more

247
00:11:35,800 --> 00:11:39,410
performance by shrinking each
one of those lines, then I can

248
00:11:39,410 --> 00:11:41,460
get performance improvement
there as well.

249
00:11:41,460 --> 00:11:44,280
So that's improving single
thread performance.

250
00:11:44,280 --> 00:11:46,140
But how do we actually
understand what's going on?

251
00:11:46,140 --> 00:11:48,360
How do I know where
to optimize?

252
00:11:48,360 --> 00:11:50,800
How do I know how long each
thread is taking?

253
00:11:50,800 --> 00:11:52,020
How do I know how long
my program is taking?

254
00:11:52,020 --> 00:11:54,090
Where are the problems?

255
00:11:54,090 --> 00:11:56,750
So, there are performance
monitoring tools that hard

256
00:11:56,750 --> 00:11:58,840
designed to help you do that.

257
00:11:58,840 --> 00:12:02,290
So what's the most
coarse-grained way of figuring

258
00:12:02,290 --> 00:12:04,020
out how long your
program took?

259
00:12:04,020 --> 00:12:08,090
You have some sample piece of
code shown over here, you

260
00:12:08,090 --> 00:12:10,790
might compile it, and then you
might just use time --

261
00:12:10,790 --> 00:12:15,790
standard units command say run
this program and tell me how

262
00:12:15,790 --> 00:12:17,320
much time it took to run.

263
00:12:17,320 --> 00:12:19,670
So you get some alpha back from
time that said you took

264
00:12:19,670 --> 00:12:23,210
about two seconds of user time,
this is actual code, you

265
00:12:23,210 --> 00:12:27,700
took some small amount of time
in system code, and this is

266
00:12:27,700 --> 00:12:29,910
your overall execution, this is
how much of the processor

267
00:12:29,910 --> 00:12:30,610
you actually use.

268
00:12:30,610 --> 00:12:32,870
So, a 95% utilization.

269
00:12:32,870 --> 00:12:34,600
Then you might apply
some optimization.

270
00:12:34,600 --> 00:12:36,430
So here we'll use the compiler,
we'll change the

271
00:12:36,430 --> 00:12:39,290
optimization level, compile
the same code, run it, and

272
00:12:39,290 --> 00:12:41,270
we'll see wow, performance
improved.

273
00:12:41,270 --> 00:12:44,420
So we increased 99% utilization,
my running time

274
00:12:44,420 --> 00:12:47,180
went down by a small chunk.

275
00:12:47,180 --> 00:12:48,940
But did we really learn
anything about

276
00:12:48,940 --> 00:12:51,100
what's going on here?

277
00:12:51,100 --> 00:12:53,600
There's some code going on,
there's a loop here, there's a

278
00:12:53,600 --> 00:12:56,650
loop here, there's some
functions with more loops.

279
00:12:56,650 --> 00:12:59,300
So where is the actual
computation time going?

280
00:12:59,300 --> 00:13:01,550
So how would I actually go
about understanding this?

281
00:13:01,550 --> 00:13:04,060
So what are some tricks you
might have used in trying to

282
00:13:04,060 --> 00:13:06,200
figure out how long something
took in your computation?

283
00:13:06,200 --> 00:13:08,192
AUDIENCE: [INAUDIBLE PHRASE].

284
00:13:17,010 --> 00:13:17,690
PROFESSOR: Right.

285
00:13:17,690 --> 00:13:22,700
So you might have a timer, you
record the time here, you

286
00:13:22,700 --> 00:13:24,590
compute and then you stop the
timer and then you might

287
00:13:24,590 --> 00:13:27,230
printout or record how
long that particular

288
00:13:27,230 --> 00:13:28,960
block of code took.

289
00:13:28,960 --> 00:13:30,370
Then you might have a histogram
of them and then you

290
00:13:30,370 --> 00:13:33,350
might analyze the histogram to
find out the distribution.

291
00:13:33,350 --> 00:13:35,810
You might repeat this over and
over again for many different

292
00:13:35,810 --> 00:13:39,180
loops or many different
parts that are code.

293
00:13:39,180 --> 00:13:42,320
If you have a preconceived
notion of where the problem is

294
00:13:42,320 --> 00:13:44,160
then you instrument that
and see if your

295
00:13:44,160 --> 00:13:45,540
hypothesis is correct.

296
00:13:45,540 --> 00:13:49,880
That can help you identify the
problems. But increasingly you

297
00:13:49,880 --> 00:13:52,340
can actually get more accurate
measurements.

298
00:13:52,340 --> 00:13:54,770
So, in the previous routine,
you're using the time, you

299
00:13:54,770 --> 00:14:01,490
were looking at how much time
has elapsed in seconds or in

300
00:14:01,490 --> 00:14:02,760
small increments.

301
00:14:02,760 --> 00:14:04,920
But you can actually use
hardware counters today to

302
00:14:04,920 --> 00:14:07,620
actually measure clock
cycles, clock ticks.

303
00:14:07,620 --> 00:14:09,710
That might be more useful.

304
00:14:09,710 --> 00:14:11,630
Actually, it's more useful
because you can measure a lot

305
00:14:11,630 --> 00:14:13,100
more events than just
clock ticks.

306
00:14:15,610 --> 00:14:18,840
The counters in modern
architectures are really

307
00:14:18,840 --> 00:14:21,890
specialized registers that count
up events and then you

308
00:14:21,890 --> 00:14:24,890
can go in there and probe and
ask what is the value in this

309
00:14:24,890 --> 00:14:27,400
register, and you can
use that as part of

310
00:14:27,400 --> 00:14:29,160
your performance tuning.

311
00:14:29,160 --> 00:14:32,070
You use them much in the same
way as you would have done to

312
00:14:32,070 --> 00:14:37,260
start a regular timer or
stop a regular timer.

313
00:14:37,260 --> 00:14:39,340
There are specialized
libraries that run.

314
00:14:39,340 --> 00:14:40,330
Unfortunately, these are very

315
00:14:40,330 --> 00:14:41,760
architecture-specific at this point.

316
00:14:41,760 --> 00:14:44,940
There's not really a common
standard that says grab a

317
00:14:44,940 --> 00:14:47,990
timer at each different
architecture in a uniform way,

318
00:14:47,990 --> 00:14:49,760
although that's getting
better with some

319
00:14:49,760 --> 00:14:52,570
standards coming out from--.

320
00:14:52,570 --> 00:14:54,870
I'll talk about that in
just a few slides.

321
00:14:54,870 --> 00:14:56,650
You can use this to, for
example, measure your

322
00:14:56,650 --> 00:15:00,360
communication to computation
cost. So you can wrap your DMA

323
00:15:00,360 --> 00:15:03,710
get and DMA put by timer calls,
and you can measure

324
00:15:03,710 --> 00:15:07,880
your actual work by timer calls,
and figuring out how

325
00:15:07,880 --> 00:15:12,320
much overlap can you get from
overlap in communication and

326
00:15:12,320 --> 00:15:14,960
communication computation and
is that really worthwhile to

327
00:15:14,960 --> 00:15:17,850
do pipelining.

328
00:15:17,850 --> 00:15:20,120
But this really requires
manual changes to code.

329
00:15:20,120 --> 00:15:22,490
You have to go in there
and start the timers.

330
00:15:22,490 --> 00:15:26,320
You have to have maybe an idea
of where the problem is, and

331
00:15:26,320 --> 00:15:27,780
you have the Heisenberg
effect.

332
00:15:27,780 --> 00:15:30,370
If you have a loop and you want
to measure code within

333
00:15:30,370 --> 00:15:32,910
the loop because you have a
nested loop inside of that,

334
00:15:32,910 --> 00:15:35,420
then now you're effecting the
performance of the outer loop.

335
00:15:35,420 --> 00:15:37,100
That can be problematic.

336
00:15:37,100 --> 00:15:39,110
So it has a better effect
because you can't really make

337
00:15:39,110 --> 00:15:42,320
an accurate measurement on the
thing you're inspecting.

338
00:15:42,320 --> 00:15:47,100
So there's a slightly better
approach, dynamic profiling.

339
00:15:47,100 --> 00:15:50,030
Dynamic profiling is really
there's an event-based

340
00:15:50,030 --> 00:15:51,750
profiling and time-based
profiling.

341
00:15:51,750 --> 00:15:53,530
Conceptually they do
the same thing.

342
00:15:53,530 --> 00:15:56,840
What's going on here is your
program is running and you're

343
00:15:56,840 --> 00:16:01,050
going to say I'm interested in
events such as cache misses.

344
00:16:01,050 --> 00:16:04,160
Whenever n number of cache
misses happen, let's say

345
00:16:04,160 --> 00:16:05,830
1,000, let me know.

346
00:16:05,830 --> 00:16:07,800
So you get an interrupt
whenever 1,000

347
00:16:07,800 --> 00:16:08,860
cache misses happen.

348
00:16:08,860 --> 00:16:13,290
Then you can update a counter
or use that to trigger some

349
00:16:13,290 --> 00:16:15,940
optimizations or analysis.

350
00:16:15,940 --> 00:16:18,120
This works really nicely because
you don't have to

351
00:16:18,120 --> 00:16:19,820
touch your code.

352
00:16:19,820 --> 00:16:22,680
You essentially run your program
as you normally do

353
00:16:22,680 --> 00:16:26,380
with just one modification
that includes running the

354
00:16:26,380 --> 00:16:31,070
dynamic profiler, as well as
your actual computation.

355
00:16:31,070 --> 00:16:33,610
As far as multiple languages
because all it does is just

356
00:16:33,610 --> 00:16:37,750
takes your binary so you can
program in any language, any

357
00:16:37,750 --> 00:16:38,940
programming model.

358
00:16:38,940 --> 00:16:41,520
It's quite efficient to actually
use these dynamic

359
00:16:41,520 --> 00:16:42,990
profiling tools.

360
00:16:42,990 --> 00:16:46,770
The sampling frequencies are
reasonably small, you can make

361
00:16:46,770 --> 00:16:49,950
them reasonably small and still
have it be efficient.

362
00:16:49,950 --> 00:16:52,010
So some counter examples.

363
00:16:52,010 --> 00:16:54,350
Clock cycles, so you can
measure clock ticks.

364
00:16:54,350 --> 00:16:55,190
Pipeline stalls.

365
00:16:55,190 --> 00:16:58,660
This might be interesting if
you want to optimize your

366
00:16:58,660 --> 00:17:00,310
instruction schedule -- you'll
actually see this in the

367
00:17:00,310 --> 00:17:02,340
recitation next week.

368
00:17:02,340 --> 00:17:05,560
Cache hits, cache misses -- you
can get an idea of how bad

369
00:17:05,560 --> 00:17:07,820
your cache performance is and
how much time you're spending

370
00:17:07,820 --> 00:17:08,970
in the memory system.

371
00:17:08,970 --> 00:17:11,420
Number of instructions, loads,
stores, floating

372
00:17:11,420 --> 00:17:14,230
point ops and so on.

373
00:17:14,230 --> 00:17:16,670
Then you can derive some useful
measures from that.

374
00:17:16,670 --> 00:17:19,730
So I can get an idea of
processor utilization --

375
00:17:19,730 --> 00:17:22,850
divide cycles by time and that
gives me utilization.

376
00:17:22,850 --> 00:17:26,200
I can derive some other things
and maybe some of the more

377
00:17:26,200 --> 00:17:28,350
interesting things like
memory of traffic.

378
00:17:28,350 --> 00:17:30,900
So how much data am I actually
sending between a CPU and a

379
00:17:30,900 --> 00:17:33,480
processor, or how much data am
I communicating from one

380
00:17:33,480 --> 00:17:35,560
processor to the other.

381
00:17:35,560 --> 00:17:38,000
So I can just grab the counters
for number of loads

382
00:17:38,000 --> 00:17:40,500
and number of stores, figure out
what the cache line size

383
00:17:40,500 --> 00:17:42,440
is -- usually those are
documented or there are

384
00:17:42,440 --> 00:17:45,214
calibration tools you can run
to get that value, and you

385
00:17:45,214 --> 00:17:46,860
figure out memory of traffic.

386
00:17:46,860 --> 00:17:48,740
Another one would be
bandwidth consumed.

387
00:17:48,740 --> 00:17:51,860
So bandwidth is memory of
traffic per second.

388
00:17:51,860 --> 00:17:53,110
So how would you measure that?

389
00:17:55,460 --> 00:17:59,770
It's just the traffic divided
by the wall clock time.

390
00:17:59,770 --> 00:18:01,480
There's some others that
you can calculate.

391
00:18:01,480 --> 00:18:04,200
So these can be really useful
in helping you figure out

392
00:18:04,200 --> 00:18:06,890
where are the things you
should go focus in on.

393
00:18:06,890 --> 00:18:09,220
I'm going to show you
some examples.

394
00:18:09,220 --> 00:18:11,810
The way these tools work is
you have your application

395
00:18:11,810 --> 00:18:14,320
source code, you compile
it down to a binary.

396
00:18:14,320 --> 00:18:17,290
You take your binary and you
run it, and then that can

397
00:18:17,290 --> 00:18:22,200
generate some profile that
stores locally on your disk.

398
00:18:22,200 --> 00:18:24,800
Then you can take that profile
and analyze it by some sort of

399
00:18:24,800 --> 00:18:28,920
interpreter, and some cases you
can actually analyze the

400
00:18:28,920 --> 00:18:32,340
binary as well, and reannotate
your source code.

401
00:18:32,340 --> 00:18:34,390
That can actually be very useful
because it'll tell you

402
00:18:34,390 --> 00:18:36,640
this particular line of your
code is the one where you're

403
00:18:36,640 --> 00:18:39,780
spending most of your
time computing.

404
00:18:39,780 --> 00:18:41,860
So some tools -- have any
of you used these tools?

405
00:18:41,860 --> 00:18:45,320
Anybody use Gprof,
for example?

406
00:18:45,320 --> 00:18:45,520
Good.

407
00:18:45,520 --> 00:18:49,790
So, you might have an idea of
how these could be used.

408
00:18:49,790 --> 00:18:50,870
There are others.

409
00:18:50,870 --> 00:18:54,740
HPCToolkit, which I commonly
use from Rice.

410
00:18:54,740 --> 00:18:57,820
Pappy is very common because it
has a very nice interface

411
00:18:57,820 --> 00:18:59,430
for grabbing all kinds
of counters.

412
00:18:59,430 --> 00:19:03,010
VTune from Intel, and there's
others that work in different

413
00:19:03,010 --> 00:19:06,070
ways and so there are binary
instrumenters that do the same

414
00:19:06,070 --> 00:19:08,900
things and do it slightly more
efficiently and actually give

415
00:19:08,900 --> 00:19:11,350
you the ability to compile
your code at run time and

416
00:19:11,350 --> 00:19:14,230
optimize it, taking advantage
of the profiling information

417
00:19:14,230 --> 00:19:16,570
you've collected.

418
00:19:16,570 --> 00:19:20,670
So here's a sample
of running Gprof.

419
00:19:20,670 --> 00:19:23,670
Gprof should be available on
any Linux system, it's even

420
00:19:23,670 --> 00:19:27,080
available on Cygwin,
if you see Cygwin.

421
00:19:27,080 --> 00:19:29,350
I've compiled some code --
this is MPEG 2D code, a

422
00:19:29,350 --> 00:19:31,530
reference implementation.

423
00:19:31,530 --> 00:19:33,950
I specify some parameters
to run it.

424
00:19:33,950 --> 00:19:38,830
Here I add this dash Rflag which
says use a particular

425
00:19:38,830 --> 00:19:42,010
kind of inverse DCDT that's a
floating point precise that

426
00:19:42,010 --> 00:19:43,920
uses double precision for
the floating point

427
00:19:43,920 --> 00:19:46,640
computations in DCT --

428
00:19:46,640 --> 00:19:48,210
inverse DCT rather.

429
00:19:48,210 --> 00:19:50,890
So you can see actually where
most of the time is being

430
00:19:50,890 --> 00:19:52,380
spent in the computation.

431
00:19:52,380 --> 00:19:55,270
So here's a time per function,
so each row

432
00:19:55,270 --> 00:19:59,020
represents a function.

433
00:19:59,020 --> 00:20:05,860
So this is the percent of the
time, this is the actual time

434
00:20:05,860 --> 00:20:09,450
in seconds, how many times I
actually called this function,

435
00:20:09,450 --> 00:20:10,800
and some other useful things.

436
00:20:10,800 --> 00:20:15,920
So the second function that's
used here that happens is MPEG

437
00:20:15,920 --> 00:20:18,770
intrablock decoding and here
you're doing some spatial

438
00:20:18,770 --> 00:20:22,730
decomposition, restoring
spatial pictures by 5%.

439
00:20:22,730 --> 00:20:25,340
So if you were optimize this
particular code, where would

440
00:20:25,340 --> 00:20:26,590
you go look?

441
00:20:29,520 --> 00:20:31,950
You would look in the
reference DCT.

442
00:20:31,950 --> 00:20:34,900
So, MPEG has two versions
of DCT --

443
00:20:34,900 --> 00:20:38,450
one that uses floating point,
another that just uses some

444
00:20:38,450 --> 00:20:42,240
numerical tricks to operate over
integers for a loss of

445
00:20:42,240 --> 00:20:46,780
precision, but they find that
acceptable as part of this

446
00:20:46,780 --> 00:20:47,670
application.

447
00:20:47,670 --> 00:20:50,640
So you omit the Rflag, it
actually uses a different

448
00:20:50,640 --> 00:20:53,760
function for doing the DCT.

449
00:20:53,760 --> 00:20:56,330
Now you see the distribution of
where the time is spent in

450
00:20:56,330 --> 00:20:57,780
your computation changes.

451
00:20:57,780 --> 00:21:00,100
Now there's a new function
that's become the bottleneck

452
00:21:00,100 --> 00:21:02,810
and it's called Form Component
Prediction.

453
00:21:02,810 --> 00:21:06,550
Then IDCT column, which
actually is the main

454
00:21:06,550 --> 00:21:10,390
replacement of the previous
code, this one, is now about

455
00:21:10,390 --> 00:21:12,250
1/3 of the actual computation.

456
00:21:12,250 --> 00:21:15,090
So, this could be useful because
you can Gprof your

457
00:21:15,090 --> 00:21:17,700
application, figure out where
the bottlenecks are in terms

458
00:21:17,700 --> 00:21:20,680
of performance, and you might
go in there and tweak the

459
00:21:20,680 --> 00:21:22,030
algorithm completely.

460
00:21:22,030 --> 00:21:24,260
You might go in there and sort
of look at some problems that

461
00:21:24,260 --> 00:21:28,120
might be implementation bugs
or performance bugs and be

462
00:21:28,120 --> 00:21:30,580
able to fix those.

463
00:21:30,580 --> 00:21:34,150
Any questions on that?

464
00:21:34,150 --> 00:21:36,340
You can do sort of more
accurate things.

465
00:21:36,340 --> 00:21:42,420
So, Gprof largely uses one
mechanism, HPCToolkit uses the

466
00:21:42,420 --> 00:21:44,950
performance counters to actually
give you more finer

467
00:21:44,950 --> 00:21:48,810
grade measurements
if you want them.

468
00:21:48,810 --> 00:21:53,340
So, in the HPCToolkit, you run
your program in the same way.

469
00:21:53,340 --> 00:21:57,130
You have MPEG 2D code, dash dash
just says this is where

470
00:21:57,130 --> 00:22:00,740
the parameters are to impact
2D code following that dash

471
00:22:00,740 --> 00:22:03,290
dash, and you can add some
parameters in there that say

472
00:22:03,290 --> 00:22:05,450
these are counters I'm
interested in measuring.

473
00:22:05,450 --> 00:22:08,160
So the first one is
total cycles.

474
00:22:08,160 --> 00:22:13,180
The second one is the L1, so
primary cache load misses.

475
00:22:13,180 --> 00:22:14,850
Then you might want to count
the floating point

476
00:22:14,850 --> 00:22:17,110
instructions and the
total instructions.

477
00:22:17,110 --> 00:22:19,490
As you run your program you
actually get a profiling

478
00:22:19,490 --> 00:22:23,290
output, and then you can process
that file and it'll

479
00:22:23,290 --> 00:22:25,220
spit out some summaries
for you.

480
00:22:25,220 --> 00:22:28,920
So it'll tell you this is the
total number of cycles, 698

481
00:22:28,920 --> 00:22:30,580
samples with this frequency.

482
00:22:30,580 --> 00:22:32,930
So if you multiply the two
together, you an idea of how

483
00:22:32,930 --> 00:22:36,590
many cycles your computation
took.

484
00:22:36,590 --> 00:22:38,230
How many load misses?

485
00:22:38,230 --> 00:22:40,910
So it's 27 samples at
this frequency.

486
00:22:40,910 --> 00:22:43,040
So remember what's going on
here is the counter is

487
00:22:43,040 --> 00:22:46,510
measuring events, and when the
event reaches a particular

488
00:22:46,510 --> 00:22:47,980
threshold it let's you know.

489
00:22:47,980 --> 00:22:52,170
So here the sampling threshold
is 32,000.

490
00:22:52,170 --> 00:22:54,960
So whenever 32,000 floating
point instructions occur you

491
00:22:54,960 --> 00:22:55,880
get a sample.

492
00:22:55,880 --> 00:22:57,745
So you're just counting how many
interrupts you're getting

493
00:22:57,745 --> 00:22:58,810
or how many samples.

494
00:22:58,810 --> 00:23:00,140
So you multiply the two
together you can

495
00:23:00,140 --> 00:23:02,770
get the final counts.

496
00:23:02,770 --> 00:23:06,270
It can do things like Gprof,
it'll tell you where your time

497
00:23:06,270 --> 00:23:09,580
is and where you spent
most of your time.

498
00:23:09,580 --> 00:23:12,230
Actually breaks it down
into your module.

499
00:23:12,230 --> 00:23:15,620
So, MPEG calls some standard
libraries, libsi.

500
00:23:15,620 --> 00:23:18,400
I could break it down by
functions, break it down by

501
00:23:18,400 --> 00:23:19,920
line number.

502
00:23:19,920 --> 00:23:21,800
You can even annotate
your source code.

503
00:23:21,800 --> 00:23:25,510
So here's just a simple example
that I used earlier,

504
00:23:25,510 --> 00:23:28,960
and each one of these columns
represent one of the metrics

505
00:23:28,960 --> 00:23:32,110
that we measured, and you can
see most of my time is spent

506
00:23:32,110 --> 00:23:35,460
here, 36% at this particular
statement.

507
00:23:35,460 --> 00:23:37,640
So that can be very useful.

508
00:23:37,640 --> 00:23:40,280
You can go in there and
say, I want to do some

509
00:23:40,280 --> 00:23:43,710
[? dization ?], I can maybe
reduce this overhead in some

510
00:23:43,710 --> 00:23:46,330
way to get better performance.

511
00:23:46,330 --> 00:23:48,210
Any questions on that?

512
00:23:48,210 --> 00:23:48,520
Yup.

513
00:23:48,520 --> 00:23:51,210
AUDIENCE: [INAUDIBLE PHRASE]?

514
00:23:51,210 --> 00:23:52,460
PROFESSOR: I don't know.

515
00:23:54,820 --> 00:23:56,070
Unfortunately, I don't know
the answer to that.

516
00:23:59,600 --> 00:24:01,450
There's some nice gooies for
some of these tools.

517
00:24:01,450 --> 00:24:03,620
So VTune has a nice interface.

518
00:24:03,620 --> 00:24:06,530
I use HPCViewer.

519
00:24:06,530 --> 00:24:09,670
I use HPCToolkit, which provides
HPCViewer, so I just

520
00:24:09,670 --> 00:24:14,200
grab the screenshot from one
of the tutorials on this.

521
00:24:14,200 --> 00:24:15,960
You have your source code.

522
00:24:15,960 --> 00:24:17,830
It shows you some of the same
information I had on a

523
00:24:17,830 --> 00:24:21,950
previous slide, but in a
nicer graphical format.

524
00:24:21,950 --> 00:24:27,710
So, now I have all this
information, how do I actually

525
00:24:27,710 --> 00:24:29,460
improve the performance?

526
00:24:29,460 --> 00:24:32,375
Well, if you look at what is
the performance time on a

527
00:24:32,375 --> 00:24:36,120
uniprocessor, it's time spent
computing plus the time spent

528
00:24:36,120 --> 00:24:38,090
waiting for data or
waiting for some

529
00:24:38,090 --> 00:24:40,580
other things to complete.

530
00:24:40,580 --> 00:24:42,360
You have instructional level
parallels, which is really

531
00:24:42,360 --> 00:24:45,100
critical for uniprocessors,
and architect that sort of

532
00:24:45,100 --> 00:24:49,060
spent massive amounts of effort
in providing multiple

533
00:24:49,060 --> 00:24:51,570
functional units,
deeply pipeline

534
00:24:51,570 --> 00:24:53,900
the instruction pipeline.

535
00:24:53,900 --> 00:24:56,440
Doing things like speculation,
prediction to keep that

536
00:24:56,440 --> 00:24:59,070
instructional level of
parallelism number high so you

537
00:24:59,070 --> 00:25:01,260
can get really good
performance.

538
00:25:01,260 --> 00:25:03,630
You can do things like looking
at the assembly code and

539
00:25:03,630 --> 00:25:07,230
re-ordering instructions to
avoid instruction hazards in

540
00:25:07,230 --> 00:25:08,360
the pipeline.

541
00:25:08,360 --> 00:25:11,500
You might look at a register
allocation.

542
00:25:11,500 --> 00:25:15,510
But that's really very
low-hanging fruit.

543
00:25:15,510 --> 00:25:17,910
You have to reach really high
to grab that kind of fruit.

544
00:25:17,910 --> 00:25:19,520
You'll actually, unfortunately,
get the

545
00:25:19,520 --> 00:25:21,670
experience that is part of
the next recitation.

546
00:25:21,670 --> 00:25:24,020
So apologies in advance.

547
00:25:24,020 --> 00:25:25,710
But you'll see that --
well I'm not going

548
00:25:25,710 --> 00:25:26,550
to talk about that.

549
00:25:26,550 --> 00:25:28,460
Instead I'm going to focus about
some things that are

550
00:25:28,460 --> 00:25:30,220
perhaps lower-hanging fruit.

551
00:25:30,220 --> 00:25:31,430
So data level parallelism.

552
00:25:31,430 --> 00:25:35,925
So we've used SIMD in
some of recitations.

553
00:25:35,925 --> 00:25:38,080
I'm giving you a short
example of that.

554
00:25:38,080 --> 00:25:41,190
Here, I'm going to talk about
how you actually get data

555
00:25:41,190 --> 00:25:44,670
level parallelism or how do you
actually find the SIMD in

556
00:25:44,670 --> 00:25:48,520
your computation so you can
get that added advantage.

557
00:25:48,520 --> 00:25:51,650
Some nice things about data
level parallelism in the form

558
00:25:51,650 --> 00:25:54,290
of short vector instructions
is that the harder really

559
00:25:54,290 --> 00:25:55,080
becomes simpler.

560
00:25:55,080 --> 00:25:58,660
You issue one instruction and
that same instruction operates

561
00:25:58,660 --> 00:26:01,520
over multiple data elements
and you get

562
00:26:01,520 --> 00:26:02,620
better instruction bandwidth.

563
00:26:02,620 --> 00:26:04,960
I just have to fetch one
instruction and if my vector

564
00:26:04,960 --> 00:26:07,000
lens is 10, than that
effectively does 10

565
00:26:07,000 --> 00:26:08,280
instructions for me.

566
00:26:08,280 --> 00:26:10,000
The architecture can get
simpler, reduces the

567
00:26:10,000 --> 00:26:10,880
complexity.

568
00:26:10,880 --> 00:26:13,880
So it has some nice
advantages.

569
00:26:13,880 --> 00:26:16,470
The thing to go after is
the memory hierarchy.

570
00:26:16,470 --> 00:26:20,270
This is because of that speed
gap that we showed earlier on

571
00:26:20,270 --> 00:26:24,020
in the course between memory
speed and processor speed, and

572
00:26:24,020 --> 00:26:27,180
if you optimize a performance
usually it's like 1%

573
00:26:27,180 --> 00:26:31,610
performance and your cache
registry gives you some

574
00:26:31,610 --> 00:26:33,590
significant performance
improvement in

575
00:26:33,590 --> 00:26:36,600
your overall execution.

576
00:26:36,600 --> 00:26:39,230
So you want to go after that
because that's the biggest

577
00:26:39,230 --> 00:26:40,480
beast in the room.

578
00:26:43,540 --> 00:26:47,230
A brief overview of SIMD and
then some detailed examples as

579
00:26:47,230 --> 00:26:49,740
to how you actually go about
extracting short vector

580
00:26:49,740 --> 00:26:51,520
instructions.

581
00:26:51,520 --> 00:26:54,700
So, here we have an example
of scaleacode.

582
00:26:54,700 --> 00:26:58,920
We're iterating in a loop from
zero to n, and we're just

583
00:26:58,920 --> 00:27:03,330
adding some array elements a to
b and storing results in c.

584
00:27:03,330 --> 00:27:08,262
So in the scaler mode, we just
have one add, each value of a

585
00:27:08,262 --> 00:27:09,610
and b is one register.

586
00:27:09,610 --> 00:27:11,090
We add those together,
we write the value

587
00:27:11,090 --> 00:27:13,090
to a separate register.

588
00:27:13,090 --> 00:27:17,420
In the vector mode, we can pack
multiple data elements,

589
00:27:17,420 --> 00:27:20,100
so here let's assume our vector
lens is four, I can

590
00:27:20,100 --> 00:27:23,420
pack four of these data values
into one vector register.

591
00:27:23,420 --> 00:27:25,510
You can pack four of these data
elements into another

592
00:27:25,510 --> 00:27:29,010
vector register, and now my
single vector instruction has

593
00:27:29,010 --> 00:27:32,380
the effect of doing four ads
at th same time, and it can

594
00:27:32,380 --> 00:27:36,940
store results into four
elements of c.

595
00:27:36,940 --> 00:27:37,660
Any questions on that?

596
00:27:37,660 --> 00:27:40,106
AUDIENCE: [UNINTELLIGIBLE]

597
00:27:44,820 --> 00:27:45,310
PROFESSOR: No.

598
00:27:45,310 --> 00:27:46,560
We'll get to that.

599
00:27:49,850 --> 00:27:54,390
So, let's look at those to sort
of give you a more lower

600
00:27:54,390 --> 00:27:56,030
level feel for this.

601
00:27:56,030 --> 00:28:00,270
Same code, I've just shown
data dependence graph.

602
00:28:00,270 --> 00:28:04,610
I've omitted things like the
increment of the loop and the

603
00:28:04,610 --> 00:28:07,300
branch, just focusing on
the main computation.

604
00:28:07,300 --> 00:28:09,890
So I have two loads, one brings
a sub i, the other

605
00:28:09,890 --> 00:28:11,040
brings b sub i.

606
00:28:11,040 --> 00:28:14,860
I do the add and I get c sub i
and then I can store that.

607
00:28:14,860 --> 00:28:17,490
So that might be sort of a
generic op code sequence that

608
00:28:17,490 --> 00:28:21,540
you have. If you're scheduling
that, then in the first slot I

609
00:28:21,540 --> 00:28:24,840
can do those two loads in
parallel, second cycle I can

610
00:28:24,840 --> 00:28:27,160
do the add, third cycle
I can do the store.

611
00:28:27,160 --> 00:28:29,320
I can further improve
this performance.

612
00:28:29,320 --> 00:28:32,410
If you took 6.035 you might see
software pipelining, you

613
00:28:32,410 --> 00:28:35,280
can actually overlap some
of these operations.

614
00:28:35,280 --> 00:28:37,750
Not really that important
here.

615
00:28:37,750 --> 00:28:41,970
So, what would the cycle or the
schedule look like on a

616
00:28:41,970 --> 00:28:45,670
cycle-by-cycle basis if this
was defector output?

617
00:28:48,970 --> 00:28:51,600
In the scaler case, you have
n iterations, right?

618
00:28:51,600 --> 00:28:54,570
Each iteration's taking three
cycles so that's your overall

619
00:28:54,570 --> 00:28:58,770
execution on time --
n times 3 cycles.

620
00:28:58,770 --> 00:29:02,560
In the vector case, each load
is bringing you four data

621
00:29:02,560 --> 00:29:07,550
elements, so a sub i
to a sub i plus 3.

622
00:29:07,550 --> 00:29:09,000
Similarly for b.

623
00:29:09,000 --> 00:29:11,470
Then you add those together.

624
00:29:11,470 --> 00:29:14,060
So the schedule would look
essentially the same.

625
00:29:14,060 --> 00:29:19,210
The op codes are different,
and here what your overall

626
00:29:19,210 --> 00:29:20,310
execution time be?

627
00:29:20,310 --> 00:29:24,530
Well, what I've done is each
iteration is now doing four

628
00:29:24,530 --> 00:29:25,700
additions for me.

629
00:29:25,700 --> 00:29:27,630
So if you notice, the loop
bounds have changed.

630
00:29:27,630 --> 00:29:31,400
Instead of going from i to n
by increments of 1, now I'm

631
00:29:31,400 --> 00:29:34,300
going by increments of 4.

632
00:29:34,300 --> 00:29:37,940
So, overall, instead of having
n iterations, I can get by

633
00:29:37,940 --> 00:29:39,810
with n over 4 iterations.

634
00:29:39,810 --> 00:29:42,300
That make sense?

635
00:29:42,300 --> 00:29:44,110
So, what would my speed
up be in this case?

636
00:29:52,830 --> 00:29:54,230
4.

637
00:29:54,230 --> 00:29:57,990
So you can get more and more
speed up if your vector lens

638
00:29:57,990 --> 00:30:00,860
is longer, because then I can
cut down on the number of

639
00:30:00,860 --> 00:30:04,310
iterations that I need.

640
00:30:04,310 --> 00:30:07,270
Depending on the length of my
vector register and the data

641
00:30:07,270 --> 00:30:09,920
types that I have, that
effectively gives me different

642
00:30:09,920 --> 00:30:12,380
kinds of vector lens for
different data types.

643
00:30:12,380 --> 00:30:15,860
So you saw on Cell you have 128
bit registers and you can

644
00:30:15,860 --> 00:30:20,110
pack those with bytes,
characters, bytes, shorts,

645
00:30:20,110 --> 00:30:22,700
integers, floats or doubles.

646
00:30:22,700 --> 00:30:25,890
So each one of those gives you
different kinds of a different

647
00:30:25,890 --> 00:30:27,140
vector lens.

648
00:30:29,750 --> 00:30:32,990
SIMD is really now, SIMD
extensions are

649
00:30:32,990 --> 00:30:33,690
increasingly popular.

650
00:30:33,690 --> 00:30:34,990
They're available on
a lot of ISAs.

651
00:30:37,510 --> 00:30:40,080
Alt of x, MMX, SSE
are available

652
00:30:40,080 --> 00:30:41,900
on a lot x86 machines.

653
00:30:41,900 --> 00:30:44,930
And, of course, in Cell, in
fact, on the SPU, all your

654
00:30:44,930 --> 00:30:47,000
instructions are SIMD
instruction, and when you're

655
00:30:47,000 --> 00:30:50,930
doing a scaler instruction,
you're actually using just one

656
00:30:50,930 --> 00:30:56,310
chunk of your vector register
and your vector pipeline.

657
00:30:56,310 --> 00:31:00,000
So how do you actually use
these SIMD instructions?

658
00:31:00,000 --> 00:31:05,190
Unfortunately, it's library
calls or using inline assembly

659
00:31:05,190 --> 00:31:07,760
or using intrinsics.

660
00:31:07,760 --> 00:31:10,240
You'll get hands-on experience
with this with Cell, so you

661
00:31:10,240 --> 00:31:15,090
might complain about that
when you actually do it.

662
00:31:15,090 --> 00:31:17,860
Compile technology is actually
getting better, and you'll see

663
00:31:17,860 --> 00:31:21,050
that one of the reasons we're
using an XLC compiler is

664
00:31:21,050 --> 00:31:25,320
because it has these vector data
types, which also latest

665
00:31:25,320 --> 00:31:29,170
versions of GCC have that allow
you to express data

666
00:31:29,170 --> 00:31:31,740
types as vector data types,
and the compiler can more

667
00:31:31,740 --> 00:31:34,580
easily or more naturally get the
parallelism for you, SIMD

668
00:31:34,580 --> 00:31:38,640
parallelism with you having to
go in there and do it by hand.

669
00:31:38,640 --> 00:31:41,110
But if you were to do it by
hand, or, in fact, what the

670
00:31:41,110 --> 00:31:43,600
compilers are trying to
automate are different

671
00:31:43,600 --> 00:31:48,730
techniques for looking for where
the SIMD parallelism is.

672
00:31:48,730 --> 00:31:52,000
There was some work done here
about six years ago by Sam

673
00:31:52,000 --> 00:31:57,300
Larson, who is now graduated,
on super word level

674
00:31:57,300 --> 00:31:57,910
parallelism.

675
00:31:57,910 --> 00:32:00,970
so I'm going to focus the rest
of this talk on this concept

676
00:32:00,970 --> 00:32:03,170
of SIMDization because I think
it's probably the one that's

677
00:32:03,170 --> 00:32:05,720
most useful for extracting
parallelism in some of the

678
00:32:05,720 --> 00:32:07,680
codes that you're doing.

679
00:32:07,680 --> 00:32:11,150
So this is really ideal for
SIMD where you have really

680
00:32:11,150 --> 00:32:14,210
short vector lens, 2 to 8.

681
00:32:14,210 --> 00:32:16,750
What you're looking for is
SIMDization that exists within

682
00:32:16,750 --> 00:32:17,550
a basic block.

683
00:32:17,550 --> 00:32:21,780
So within a code block, within
a body of a loop or within

684
00:32:21,780 --> 00:32:24,790
some control flow even.

685
00:32:24,790 --> 00:32:26,910
You can uncover this with simple
analysis, and this

686
00:32:26,910 --> 00:32:31,470
really has pushed the boundary
on what automatic

687
00:32:31,470 --> 00:32:33,280
compilers can do.

688
00:32:33,280 --> 00:32:35,870
Some of work that's gone on
at IMB, what they call the

689
00:32:35,870 --> 00:32:38,980
octipiler, has eventually been
transferred to the XLC

690
00:32:38,980 --> 00:32:43,170
compiler do a lot of defecniques
that build on SLP

691
00:32:43,170 --> 00:32:46,090
and expand in various ways to
broaden the scope of what you

692
00:32:46,090 --> 00:32:48,680
can automatically parallize.

693
00:32:48,680 --> 00:32:52,200
So here's an example of how
you might actually derive

694
00:32:52,200 --> 00:32:55,120
SIMDization or opportunities
for SIMDization.

695
00:32:55,120 --> 00:32:57,280
So you have some code, let's
say you're doing RGB

696
00:32:57,280 --> 00:33:03,190
computations where you're just
adding the r elements, that's

697
00:33:03,190 --> 00:33:06,820
a red, green and blue.

698
00:33:06,820 --> 00:33:09,630
So this might be in a loop, and
what you might notice it

699
00:33:09,630 --> 00:33:13,410
well I can pack the RGB elements
into one register.

700
00:33:13,410 --> 00:33:16,190
I can pack these into another
register, and I can pack these

701
00:33:16,190 --> 00:33:19,630
literals into a third
register.

702
00:33:19,630 --> 00:33:23,220
So that gives me a way to pack
data together into SIMD

703
00:33:23,220 --> 00:33:29,150
registers, and now I can replace
this scaler code with

704
00:33:29,150 --> 00:33:31,570
instructions that pack
the vector register.

705
00:33:31,570 --> 00:33:33,160
I can do the computations
in parallel and

706
00:33:33,160 --> 00:33:34,420
I can unpack them.

707
00:33:34,420 --> 00:33:37,050
We'll talk about that with a
little bit more illustration

708
00:33:37,050 --> 00:33:37,730
in a second.

709
00:33:37,730 --> 00:33:38,980
Any questions on this?

710
00:33:42,720 --> 00:33:44,560
Perhaps the biggest improvement
that you can get

711
00:33:44,560 --> 00:33:48,520
from SIMDization is by looking
at adjacent memory references.

712
00:33:48,520 --> 00:33:51,260
Rather than doing one load you
can do a vector load, which

713
00:33:51,260 --> 00:33:55,230
really gives you a bigger
bandwidth to memory.

714
00:33:55,230 --> 00:34:00,580
So in this case, I have a load
from I1, I2, and since these

715
00:34:00,580 --> 00:34:03,750
memory locations are continuous,
I can replace them

716
00:34:03,750 --> 00:34:07,530
by one vector load that brings
in all these data

717
00:34:07,530 --> 00:34:09,320
elements in one shot.

718
00:34:09,320 --> 00:34:12,540
That essentially eliminates
three load instructions, which

719
00:34:12,540 --> 00:34:17,210
are potentially most heavy
weight for one ligher weight

720
00:34:17,210 --> 00:34:20,450
instruction because it amortizes
bandwidth and

721
00:34:20,450 --> 00:34:21,700
exploits things like
spatial locality.

722
00:34:25,000 --> 00:34:27,530
Another one, vectorizable
loops.

723
00:34:27,530 --> 00:34:30,870
So this is probably one of
the most advanced ways of

724
00:34:30,870 --> 00:34:34,200
exploiting SIMDization,
especially in really long

725
00:34:34,200 --> 00:34:37,290
vector codes, so traditional
supercomputers like the Cray,

726
00:34:37,290 --> 00:34:39,370
and you'll probably hear Simmon
talk about this in the

727
00:34:39,370 --> 00:34:40,590
next lecture.

728
00:34:40,590 --> 00:34:43,090
So I have some loop
and I hvae this

729
00:34:43,090 --> 00:34:44,540
particular statement here.

730
00:34:44,540 --> 00:34:46,730
So how can I get SIMD
code out of this?

731
00:34:46,730 --> 00:34:47,980
Anybody have any ideas?

732
00:35:01,750 --> 00:35:03,080
Anybody know about
loop unrolling?

733
00:35:05,920 --> 00:35:10,170
So if I unroll this loop, I
essentially -- that same trick

734
00:35:10,170 --> 00:35:11,900
that I had shown earlier,
although I didn't

735
00:35:11,900 --> 00:35:12,680
quite do it this way.

736
00:35:12,680 --> 00:35:16,440
I change a loop bound
from n to --

737
00:35:16,440 --> 00:35:18,600
the increment from -- rather
than stepping through one at a

738
00:35:18,600 --> 00:35:20,590
time, stepping through
four at a time.

739
00:35:20,590 --> 00:35:25,190
Now the loop body, rather than
doing one addition at a time,

740
00:35:25,190 --> 00:35:27,120
I'm doing four additions
at a time.

741
00:35:27,120 --> 00:35:29,770
So now this is very natural for
a vectorization, right?

742
00:35:29,770 --> 00:35:32,410
Vector load, vector load, vector
store plus the vector

743
00:35:32,410 --> 00:35:33,520
add in the middle.

744
00:35:33,520 --> 00:35:34,770
Is that intuitive?

745
00:35:38,400 --> 00:35:41,320
So this gives you another way
of extracting parallelism.

746
00:35:41,320 --> 00:35:43,010
Looking at traditional loops,
seeing whether you can

747
00:35:43,010 --> 00:35:46,510
actually unroll it in different
ways, be able to get

748
00:35:46,510 --> 00:35:48,920
that SIMD parallelization.

749
00:35:48,920 --> 00:35:51,210
The last one I'll talk about
is about partial

750
00:35:51,210 --> 00:35:53,400
vectorization.

751
00:35:53,400 --> 00:35:57,390
Either it might be some things
where you have a mix of

752
00:35:57,390 --> 00:35:58,130
statements.

753
00:35:58,130 --> 00:36:01,890
So here I have a loop where I
have some load and then I'm

754
00:36:01,890 --> 00:36:03,210
doing some computation here.

755
00:36:03,210 --> 00:36:04,460
So what could I do here?

756
00:36:09,780 --> 00:36:11,260
It's not as symmetric
as the other loop.

757
00:36:11,260 --> 00:36:18,200
AUDIENCE: There's no vector
and [INAUDIBLE PHRASE].

758
00:36:18,200 --> 00:36:18,530
PROFESSOR: Right.

759
00:36:18,530 --> 00:36:19,600
So you might omit that.

760
00:36:19,600 --> 00:36:22,580
But could you do anything
about the subtraction?

761
00:36:22,580 --> 00:36:24,870
AUDIENCE: [INAUDIBLE PHRASE].

762
00:36:24,870 --> 00:36:27,990
PROFESSOR: If I can unroll
this again, right?

763
00:36:27,990 --> 00:36:30,680
Now there's no dependencies
between this instruction and

764
00:36:30,680 --> 00:36:34,210
this instruction, so I can
really move these together,

765
00:36:34,210 --> 00:36:36,580
and once I've moved these
together then these loads

766
00:36:36,580 --> 00:36:38,170
become contiguous.

767
00:36:38,170 --> 00:36:40,790
These loads are contiguous so I
can replace these by vector

768
00:36:40,790 --> 00:36:44,640
codes, vector equivalents.

769
00:36:44,640 --> 00:36:48,210
So now the vector load bring
in L0, L1, I have the

770
00:36:48,210 --> 00:36:52,800
addition, that brings in those
two elements in a vector, and

771
00:36:52,800 --> 00:36:55,960
then I can do my scaler
additions.

772
00:36:55,960 --> 00:36:59,300
But what do I do about the
value getting out of this

773
00:36:59,300 --> 00:37:01,890
vector register into this scaler
register that I need

774
00:37:01,890 --> 00:37:03,400
for the absolute values.

775
00:37:03,400 --> 00:37:06,740
So this is where the benefits
versus cost of

776
00:37:06,740 --> 00:37:08,210
SIMDization come in.

777
00:37:08,210 --> 00:37:11,330
So the benefits are great
because you can replace

778
00:37:11,330 --> 00:37:15,460
multiple instructions by one
instruction, or you can just

779
00:37:15,460 --> 00:37:17,250
cut down the number of
instructions by specific

780
00:37:17,250 --> 00:37:19,920
factor, your vector lens.

781
00:37:19,920 --> 00:37:23,400
Low stores can be replaced by
one wide memory operation, and

782
00:37:23,400 --> 00:37:26,210
this is probably the biggest
opportunity for performance

783
00:37:26,210 --> 00:37:27,800
improvements.

784
00:37:27,800 --> 00:37:30,300
But the cost is that you have
to pack data into the data

785
00:37:30,300 --> 00:37:32,700
registers an you have to unpack
it out so that you can

786
00:37:32,700 --> 00:37:37,460
have those kinds of
communications between this

787
00:37:37,460 --> 00:37:40,430
vector register here and the
value here, this value here

788
00:37:40,430 --> 00:37:41,400
and this value here.

789
00:37:41,400 --> 00:37:45,210
Often you can't simply access
vector values without doing

790
00:37:45,210 --> 00:37:46,460
this packing and unpacking.

791
00:37:52,210 --> 00:37:55,450
So how do you actually do
the packing, unpacking?

792
00:37:55,450 --> 00:37:57,880
This is predominantly where a
lot of the complexity goes.

793
00:38:03,910 --> 00:38:06,493
So the value of a here is
initialized by some function

794
00:38:06,493 --> 00:38:09,140
and the value of b here is
initialized by some function,

795
00:38:09,140 --> 00:38:11,200
and these might not be
things that I can

796
00:38:11,200 --> 00:38:12,740
SIMdize very easily.

797
00:38:12,740 --> 00:38:16,480
So what I need to do is move
that value into the first

798
00:38:16,480 --> 00:38:19,180
element of vector register, and
move the second value into

799
00:38:19,180 --> 00:38:20,800
the second element of
the vector register.

800
00:38:20,800 --> 00:38:23,470
So if I have a four-way vector
register, then I have to do

801
00:38:23,470 --> 00:38:27,790
four of these moves, and that
essentially is the packing.

802
00:38:27,790 --> 00:38:30,590
Then I could do my vector
computation, which is really

803
00:38:30,590 --> 00:38:32,940
these two statements here.

804
00:38:32,940 --> 00:38:35,750
Then eventually I have to do my
unpacking because I have to

805
00:38:35,750 --> 00:38:39,960
get the values out to do this
operation and this operation.

806
00:38:39,960 --> 00:38:42,180
So there's an extraction
that has to happen

807
00:38:42,180 --> 00:38:43,430
out of my SIMD register.

808
00:38:46,310 --> 00:38:48,980
But you can amortize the cost
of the packing and unpacking

809
00:38:48,980 --> 00:38:50,560
by just reusing your
vector registers.

810
00:38:50,560 --> 00:38:54,490
So these are like register
allocation techniques.

811
00:38:54,490 --> 00:38:56,960
So if I pack things into a
vector register, I find all

812
00:38:56,960 --> 00:38:59,890
cases where I can actually reuse
that vector register and

813
00:38:59,890 --> 00:39:04,490
I try to find opportunities
for extra SIMDization.

814
00:39:04,490 --> 00:39:08,120
So in the other case then, I
pack one then I can reuse that

815
00:39:08,120 --> 00:39:09,320
same vector register.

816
00:39:09,320 --> 00:39:13,690
So what are some ways I can look
for to amortize the cost?

817
00:39:16,600 --> 00:39:18,700
The interesting thing about
memory operations is while

818
00:39:18,700 --> 00:39:21,950
there are many different ways
you can pack scaler values

819
00:39:21,950 --> 00:39:24,700
into a vector register, there's
really only one way

820
00:39:24,700 --> 00:39:28,380
you can pack loads coming in
from memory into a vector

821
00:39:28,380 --> 00:39:31,290
register is because you want
the loads to be sequential,

822
00:39:31,290 --> 00:39:33,340
you want to exploit the
spatial locality.

823
00:39:33,340 --> 00:39:36,130
So one vector load really gives
you specific ordering.

824
00:39:36,130 --> 00:39:40,140
So, that really constrains
you in various ways.

825
00:39:40,140 --> 00:39:42,180
So you might bend over backwards
in some cases to

826
00:39:42,180 --> 00:39:46,350
actually get your code to
be able to you reuse the

827
00:39:46,350 --> 00:39:49,090
wide-word load without having
to do too much packing or

828
00:39:49,090 --> 00:39:50,870
unpacking because that'll start

829
00:39:50,870 --> 00:39:53,520
eating into your benefits.

830
00:39:53,520 --> 00:40:00,900
So simple example of how
you might find the SLP

831
00:40:00,900 --> 00:40:02,740
parallelism.

832
00:40:02,740 --> 00:40:05,850
So the first thing you want to
do is start with are the

833
00:40:05,850 --> 00:40:08,740
instructions that give you the
most benefit, so it's memory

834
00:40:08,740 --> 00:40:09,860
references.

835
00:40:09,860 --> 00:40:11,840
So here there are two
memory references.

836
00:40:11,840 --> 00:40:15,680
They happen to be adjacent, so
I'm accessing a contiguous

837
00:40:15,680 --> 00:40:17,930
memory chunks, so I can
parallelized that.

838
00:40:17,930 --> 00:40:20,540
That would be my first step.

839
00:40:20,540 --> 00:40:23,580
I can do a vector load and
that assignment can

840
00:40:23,580 --> 00:40:25,880
become a and b.

841
00:40:25,880 --> 00:40:29,730
I can look for opportunities
where I can propagate this

842
00:40:29,730 --> 00:40:34,250
vector values within the
vector register that's

843
00:40:34,250 --> 00:40:35,840
holding a and b.

844
00:40:35,840 --> 00:40:38,980
So the way I do that is I look
for uses of a and b.

845
00:40:38,980 --> 00:40:40,960
In this case, there are
these two statements.

846
00:40:45,020 --> 00:40:46,440
So I can look for opportunities

847
00:40:46,440 --> 00:40:47,690
to vectorize that.

848
00:40:52,160 --> 00:40:54,030
So in this case, both of these
instructions are also

849
00:40:54,030 --> 00:40:54,960
vectorizable.

850
00:40:54,960 --> 00:40:59,990
Now I have a vector subtraction,
and I have a

851
00:40:59,990 --> 00:41:02,680
vector register holding
new values h and j.

852
00:41:02,680 --> 00:41:08,200
So I follow that chain again
of where data's flowing.

853
00:41:08,200 --> 00:41:12,170
I find these operations and I
can vectorize that as well.

854
00:41:18,160 --> 00:41:21,860
So, sign up with a vectorizable
loop where all my

855
00:41:21,860 --> 00:41:24,520
instructions, all my scale
instructions are now in SIMD

856
00:41:24,520 --> 00:41:25,820
instructions.

857
00:41:25,820 --> 00:41:30,000
I can cut down on loop
iterations of total number of

858
00:41:30,000 --> 00:41:31,220
instructions that I issue.

859
00:41:31,220 --> 00:41:33,650
But I've made some implicit
assumption here.

860
00:41:33,650 --> 00:41:35,020
Anybody know what it is?

861
00:41:35,020 --> 00:41:42,538
AUDIENCE: Do you actually
need that many

862
00:41:42,538 --> 00:41:44,580
iterations of the loop?

863
00:41:44,580 --> 00:41:47,680
PROFESSOR: Well, so you can
factor down the cost. so here

864
00:41:47,680 --> 00:41:49,830
I've vectorized by 2, so I would
cut down the number of

865
00:41:49,830 --> 00:41:50,590
iterations by 2.

866
00:41:50,590 --> 00:41:52,830
AUDIENCE: You could have an
odd number of iterations?

867
00:41:52,830 --> 00:41:54,190
PROFESSOR: Right, so you could
have an odd number of

868
00:41:54,190 --> 00:41:55,110
iterations.

869
00:41:55,110 --> 00:41:57,400
What do you do about the
remaining iterations.

870
00:41:57,400 --> 00:41:59,760
You might have to do scaler
code for that.

871
00:41:59,760 --> 00:42:02,260
What are some other
assumptions?

872
00:42:02,260 --> 00:42:04,660
Maybe it will be clear here.

873
00:42:07,380 --> 00:42:10,160
So in vectorizing this, what
have I assumed about

874
00:42:10,160 --> 00:42:13,040
relationships between
these statements?

875
00:42:13,040 --> 00:42:15,690
I've essentially reorganized
all the statements so that

876
00:42:15,690 --> 00:42:19,520
assumes I have this liberty to
move instructions around.

877
00:42:19,520 --> 00:42:19,810
Yup?

878
00:42:19,810 --> 00:42:22,801
AUDIENCE: [UNINTELLIGIBLE]
a and b don't change

879
00:42:22,801 --> 00:42:23,300
[INAUDIBLE PHRASE].

880
00:42:23,300 --> 00:42:23,790
PROFESSOR: Right.

881
00:42:23,790 --> 00:42:27,830
So there's nothing in here
that's changing the values.

882
00:42:27,830 --> 00:42:29,930
There's no dependencies between
these statements -- no

883
00:42:29,930 --> 00:42:33,910
flow dependencies and no other
kind of constraints that limit

884
00:42:33,910 --> 00:42:34,930
this kind of movement.

885
00:42:34,930 --> 00:42:36,800
So in real code it's not
actually the case.

886
00:42:36,800 --> 00:42:40,150
You end up with patterns of
computation where you can get

887
00:42:40,150 --> 00:42:43,710
really a nice case of classic
cases you can vectorize those

888
00:42:43,710 --> 00:42:44,840
really nicely.

889
00:42:44,840 --> 00:42:48,020
In a lot of other codes you
have a mix of vectorizable

890
00:42:48,020 --> 00:42:50,520
code and scaler code and there's
a lot of communication

891
00:42:50,520 --> 00:42:51,380
between the two.

892
00:42:51,380 --> 00:42:54,090
So the cost is really something
significant that you

893
00:42:54,090 --> 00:42:55,340
have to consider.

894
00:42:57,510 --> 00:43:00,780
This was, as I mentioned, done
in somebody's Masters thesis

895
00:43:00,780 --> 00:43:04,960
and eventually led to some
additional work that was his

896
00:43:04,960 --> 00:43:06,770
PhD thesis.

897
00:43:06,770 --> 00:43:09,370
So in some of the early work,
what he did was he looked at a

898
00:43:09,370 --> 00:43:12,470
bunch of benchmarks and looked
at how much available

899
00:43:12,470 --> 00:43:15,960
parallelism you have in terms
of this kind of short vector

900
00:43:15,960 --> 00:43:20,260
parallelism, or rather SLP
where you're looking for

901
00:43:20,260 --> 00:43:23,830
vectorizable code within basic
blocks, which really differed

902
00:43:23,830 --> 00:43:26,100
from a classic way of people
looking for vectorization.

903
00:43:26,100 --> 00:43:27,950
[? And you ?] have
well-structured loops and

904
00:43:27,950 --> 00:43:32,120
doing kinds of transformations
you'll hear about next week.

905
00:43:32,120 --> 00:43:35,240
So for different kinds of vector
registers, so these are

906
00:43:35,240 --> 00:43:36,010
your vector lens.

907
00:43:36,010 --> 00:43:41,470
So going from 128 bits to 1,024
bits, you can actually

908
00:43:41,470 --> 00:43:43,330
reduce a whole lot
of instructions.

909
00:43:43,330 --> 00:43:46,380
So what I'm showing here
is the percent dynamic

910
00:43:46,380 --> 00:43:47,420
instruction reduction.

911
00:43:47,420 --> 00:43:50,795
So if I take my baseline
application and just compile

912
00:43:50,795 --> 00:43:53,670
it in a normal way and I run it
again an instruction count.

913
00:43:53,670 --> 00:43:57,175
I apply this SLP technique that
find the SIMDization and

914
00:43:57,175 --> 00:43:59,500
then run my application again,
use the performance counters

915
00:43:59,500 --> 00:44:01,110
to count the number
of instructions

916
00:44:01,110 --> 00:44:03,450
and compare the two.

917
00:44:03,450 --> 00:44:07,020
I can get 60%, 50%, 40%.

918
00:44:07,020 --> 00:44:11,700
In some cases I can completely
eliminate almost 90% or more

919
00:44:11,700 --> 00:44:12,660
of the instructions.

920
00:44:12,660 --> 00:44:16,410
So it's a lot of opportunity for
performance improvements

921
00:44:16,410 --> 00:44:19,410
that might be apparent.

922
00:44:19,410 --> 00:44:22,130
One because I'm reducing the
instruction bandwidth, I'm

923
00:44:22,130 --> 00:44:25,370
reducing the amount of space I
need in my instruction cache,

924
00:44:25,370 --> 00:44:27,590
I have fewer instructions so I
can fit more instructions into

925
00:44:27,590 --> 00:44:31,190
my instruction cache, you reduce
the number of branches.

926
00:44:31,190 --> 00:44:34,870
You get better bandwidth to the
memory, better use of the

927
00:44:34,870 --> 00:44:37,080
memory bandwidth.

928
00:44:37,080 --> 00:44:41,250
Overall, you're running fewer
iterations, so you're getting

929
00:44:41,250 --> 00:44:44,050
lots of potential
for performance.

930
00:44:44,050 --> 00:44:46,350
So, I actually ran this
on the AltiVec.

931
00:44:46,350 --> 00:44:50,710
This was one of the earliest
generations of AltiVec, which

932
00:44:50,710 --> 00:44:57,100
SIMD instructions didn't have
I believe double precision

933
00:44:57,100 --> 00:44:59,420
floating point, so not all the
benchmarks you see on the

934
00:44:59,420 --> 00:45:02,470
previsou slide are here, only
the ones that could run

935
00:45:02,470 --> 00:45:04,120
reasonably accurately
with a single

936
00:45:04,120 --> 00:45:05,590
precision floating point.

937
00:45:05,590 --> 00:45:08,240
What they measure is the
actual speed up.

938
00:45:08,240 --> 00:45:10,810
Doing this SIMDization versus
not doing a SIMDization, how

939
00:45:10,810 --> 00:45:12,470
much performance you can get.

940
00:45:12,470 --> 00:45:17,550
The thing to take away is in
some cases where you have

941
00:45:17,550 --> 00:45:20,260
nicely structured loops and some
nice patterns, you can

942
00:45:20,260 --> 00:45:24,250
get up to 7x speed up
on some benchmarks.

943
00:45:24,250 --> 00:45:27,020
What might be the maximum speed
up that you can get

944
00:45:27,020 --> 00:45:31,210
depends on the vector lens,
so 8, for example on some

945
00:45:31,210 --> 00:45:35,170
architectures depending
on the data type.

946
00:45:35,170 --> 00:45:38,500
Is there any questions
on that?

947
00:45:38,500 --> 00:45:40,530
So as part of the next
recitation, you'll actually

948
00:45:40,530 --> 00:45:43,000
get an exercise of going through
and SIMDizing for

949
00:45:43,000 --> 00:45:45,300
Cell, and whether that actually
means SIMDize

950
00:45:45,300 --> 00:45:47,990
instructions for Cell might take
statements and sort of

951
00:45:47,990 --> 00:45:50,320
replace them by intrinsic
functions, which eventually

952
00:45:50,320 --> 00:45:53,630
map down to actually assembly
op codes that you'll need.

953
00:45:53,630 --> 00:45:55,940
So you don't actually have to
program at the assembly level,

954
00:45:55,940 --> 00:46:00,080
although in effect, you're
probably doing the same thing.

955
00:46:00,080 --> 00:46:03,040
Last thing we'll talk about
today is optimizing for the

956
00:46:03,040 --> 00:46:05,110
memeory hierarchy.

957
00:46:05,110 --> 00:46:07,950
In addition to data level
parallelism, looking for

958
00:46:07,950 --> 00:46:11,250
performance enhancements in the
memory system gives you

959
00:46:11,250 --> 00:46:19,250
the best opportunities because
of this big gap in performance

960
00:46:19,250 --> 00:46:23,800
between memory access latencies
and what the CPU

961
00:46:23,800 --> 00:46:24,840
efficiency is.

962
00:46:24,840 --> 00:46:27,380
So exploiting locality in
a memroy system is key.

963
00:46:27,380 --> 00:46:30,940
So these concepts of temporal
and spatial locality.

964
00:46:30,940 --> 00:46:32,770
So let's look at an example.

965
00:46:32,770 --> 00:46:37,870
Let's say I have a loop and in
this loop I have some code

966
00:46:37,870 --> 00:46:42,440
that's embodied in some function
a, some code embodied

967
00:46:42,440 --> 00:46:45,660
in some function b, and some
code in some function c.

968
00:46:45,660 --> 00:46:49,720
The values produced by a are
consumed by the function b,

969
00:46:49,720 --> 00:46:51,280
and similarly the values
consumed by b

970
00:46:51,280 --> 00:46:53,880
are consumed by c.

971
00:46:53,880 --> 00:46:57,040
So this is a general data flow
graph that you might have for

972
00:46:57,040 --> 00:46:58,590
this function.

973
00:46:58,590 --> 00:47:03,240
Let's say that all the data
could go into a small array

974
00:47:03,240 --> 00:47:08,870
that then I can communicate
between functions.

975
00:47:08,870 --> 00:47:12,500
So if I look at my actual cache
size and how the working

976
00:47:12,500 --> 00:47:15,400
set of each of these functions
is, so let's say this is my

977
00:47:15,400 --> 00:47:18,260
cache size -- this is how
many instructions I can

978
00:47:18,260 --> 00:47:20,390
pack into the cache.

979
00:47:20,390 --> 00:47:22,490
Looking at the collective number
of instructions in each

980
00:47:22,490 --> 00:47:25,990
one of these functions,
I overflow that.

981
00:47:25,990 --> 00:47:27,810
I have more instructions
I can fit into my

982
00:47:27,810 --> 00:47:30,130
cache any one time.

983
00:47:30,130 --> 00:47:32,890
So what does that mean for my
actual cash performance?

984
00:47:32,890 --> 00:47:38,490
So when I run a, what do I
expect the cache hit and miss

985
00:47:38,490 --> 00:47:40,670
rate behavior to be like?

986
00:47:40,670 --> 00:47:45,830
So in the first iteration, I
need the instructions for a.

987
00:47:45,830 --> 00:47:48,210
I've never seen a before so I
have to fetch that data from

988
00:47:48,210 --> 00:47:49,840
memory and put in the cache.

989
00:47:49,840 --> 00:47:53,320
So, the attachments.

990
00:47:53,320 --> 00:47:54,570
So what about b?

991
00:47:58,330 --> 00:47:59,310
Then c?

992
00:47:59,310 --> 00:48:00,420
Same thing.

993
00:48:00,420 --> 00:48:03,920
So now I'm back at the
top of my loop.

994
00:48:03,920 --> 00:48:06,250
So if everything fit in
the cache then I would

995
00:48:06,250 --> 00:48:10,230
expect a to be a what?

996
00:48:10,230 --> 00:48:11,320
You'll be a hit.

997
00:48:11,320 --> 00:48:14,900
But since I've constrained this
problem such that the

998
00:48:14,900 --> 00:48:17,220
working set doesn't really fit
in the cache, what that means

999
00:48:17,220 --> 00:48:19,690
is that I have to fetch some
new instructions for a.

1000
00:48:19,690 --> 00:48:20,960
So let's say I have
to fetch all the

1001
00:48:20,960 --> 00:48:22,460
instructions for a again.

1002
00:48:22,460 --> 00:48:25,590
That leads me to another miss.

1003
00:48:25,590 --> 00:48:29,610
Now, bringing a again into my
cache kicks out some extra

1004
00:48:29,610 --> 00:48:32,100
instructions because I need to
make room in a finite memory

1005
00:48:32,100 --> 00:48:35,090
so I kick out b.

1006
00:48:35,090 --> 00:48:38,120
Bring in b and I end
up kicking out c.

1007
00:48:38,120 --> 00:48:41,530
So you end up with a pattern
where everything is a miss.

1008
00:48:41,530 --> 00:48:45,740
This is a problem because the
way the loop is structured,

1009
00:48:45,740 --> 00:48:48,025
collectively I just can't pack
all those instructions into

1010
00:48:48,025 --> 00:48:52,760
the cache, so I end up taking
a lot of cache misses and

1011
00:48:52,760 --> 00:48:55,030
that's bad for performance.

1012
00:48:55,030 --> 00:48:56,690
But I can look at an
alternative way

1013
00:48:56,690 --> 00:48:58,710
of doing this loop.

1014
00:48:58,710 --> 00:49:02,960
I can split up this loop into
three where in one loop I do

1015
00:49:02,960 --> 00:49:07,300
all the a instructions, in the
second loop I do all the b's,

1016
00:49:07,300 --> 00:49:09,260
and the third loop
I do all the c's.

1017
00:49:09,260 --> 00:49:12,830
Now my working set
is really small.

1018
00:49:12,830 --> 00:49:16,020
So the instructions for a fit in
the cache, instructions for

1019
00:49:16,020 --> 00:49:17,460
b fit in the cache, and
instructions for

1020
00:49:17,460 --> 00:49:19,670
c fit in the cache.

1021
00:49:19,670 --> 00:49:24,330
So what do I expect for the
first time I see a?

1022
00:49:24,330 --> 00:49:25,110
Miss.

1023
00:49:25,110 --> 00:49:26,360
Then second time?

1024
00:49:31,270 --> 00:49:36,730
It'll be hit, because I've
brought in a, I haven't run b

1025
00:49:36,730 --> 00:49:40,150
or c yet, the number of
instructions I need for a is

1026
00:49:40,150 --> 00:49:41,370
smaller than what I
can fit into the

1027
00:49:41,370 --> 00:49:42,770
cache, so that's great.

1028
00:49:42,770 --> 00:49:44,170
Nothing gets kicked out.

1029
00:49:44,170 --> 00:49:46,200
So every one of those
iterations

1030
00:49:46,200 --> 00:49:48,450
for a becomes a hit.

1031
00:49:48,450 --> 00:49:49,220
So that's good.

1032
00:49:49,220 --> 00:49:51,950
I've improved performance.

1033
00:49:51,950 --> 00:49:54,550
For b I have the same pattern.

1034
00:49:54,550 --> 00:49:56,620
First time I see b it's a miss,
every time after that

1035
00:49:56,620 --> 00:49:57,480
it's a hit.

1036
00:49:57,480 --> 00:49:58,560
Similarly for c.

1037
00:49:58,560 --> 00:50:02,490
So my cache miss rate goes from
being one, everything's a

1038
00:50:02,490 --> 00:50:07,050
miss, to decreasing to 1 over
n where n is essentially how

1039
00:50:07,050 --> 00:50:09,500
much I can run the loop.

1040
00:50:09,500 --> 00:50:11,964
So we call that full scaling
because we've taken the loop

1041
00:50:11,964 --> 00:50:14,440
where we've distributed, and
we've scaled every one of

1042
00:50:14,440 --> 00:50:16,550
those smaller loops to the
maximum that we could get.

1043
00:50:19,070 --> 00:50:21,230
Now what about the data?

1044
00:50:21,230 --> 00:50:22,340
So we have the same example.

1045
00:50:22,340 --> 00:50:26,820
Here we saw that the instruction
working set is

1046
00:50:26,820 --> 00:50:29,420
big, but what about the data?

1047
00:50:29,420 --> 00:50:31,330
So let's say in this case
I'm sending just a

1048
00:50:31,330 --> 00:50:32,270
small amount of data.

1049
00:50:32,270 --> 00:50:35,640
Then the behavior
is really good.

1050
00:50:35,640 --> 00:50:38,070
It's a small amount of data
that I need to communicate

1051
00:50:38,070 --> 00:50:38,690
from a to b.

1052
00:50:38,690 --> 00:50:40,310
A small amount of
data you need to

1053
00:50:40,310 --> 00:50:42,040
communicate from b to c.

1054
00:50:42,040 --> 00:50:43,270
So it's great.

1055
00:50:43,270 --> 00:50:44,990
No problems with
the data cache.

1056
00:50:44,990 --> 00:50:46,530
What happens in full
scaling case?

1057
00:50:46,530 --> 00:50:53,330
AUDIENCE: It's not correct to
communicate from a to b.

1058
00:50:53,330 --> 00:50:54,921
PROFESSOR: What do you mean
it's not correct?

1059
00:50:54,921 --> 00:50:55,576
AUDIENCE: Oh, it's not

1060
00:50:55,576 --> 00:50:56,826
communicating at the same time.

1061
00:50:58,740 --> 00:51:01,300
PROFESSOR: Yeah, it's not
at the same time.

1062
00:51:01,300 --> 00:51:03,890
In fact, just assume
this is sequential.

1063
00:51:03,890 --> 00:51:07,800
So I run a, I store some data,
and then when I run

1064
00:51:07,800 --> 00:51:10,680
b I grab that data.

1065
00:51:10,680 --> 00:51:12,430
This is in sequential.

1066
00:51:12,430 --> 00:51:16,610
AUDIENCE: How do you know that
the transmission's valid then?

1067
00:51:16,610 --> 00:51:19,210
We could use some
global variable.

1068
00:51:19,210 --> 00:51:21,390
PROFESSOR: Simple case.

1069
00:51:21,390 --> 00:51:22,750
There's no global variables.

1070
00:51:22,750 --> 00:51:26,210
All the data that b needs
comes from a.

1071
00:51:26,210 --> 00:51:28,290
So if I run a I produce
all the data and

1072
00:51:28,290 --> 00:51:29,540
that's all that b needs.

1073
00:51:32,000 --> 00:51:34,110
So in the full scaling case,
what do I expect to

1074
00:51:34,110 --> 00:51:37,140
happen for the data?

1075
00:51:37,140 --> 00:51:39,730
Remember, in the full scaling
case, all the working sets for

1076
00:51:39,730 --> 00:51:42,810
the instructions are small so
they all fit in the cache.

1077
00:51:42,810 --> 00:51:46,260
But now I'm running a for a lot
longer so I have to store

1078
00:51:46,260 --> 00:51:48,180
a lot more data for b.

1079
00:51:48,180 --> 00:51:50,970
Similarly, I'm running b for a
lot longer so I have to store

1080
00:51:50,970 --> 00:51:52,690
a lot more data for c.

1081
00:51:52,690 --> 00:51:54,770
So what do I expect to happen
with the working set here?

1082
00:51:58,180 --> 00:52:01,430
Instructions are still good,
but the data might be bad

1083
00:52:01,430 --> 00:52:06,410
because I've run a for a lot
more iterations at one shot.

1084
00:52:06,410 --> 00:52:09,730
So now I have to buffer all
this data for a to b.

1085
00:52:09,730 --> 00:52:11,960
Similarly, I've run b for a long
time so I have to buffer

1086
00:52:11,960 --> 00:52:13,960
a whole lot data for b to c.

1087
00:52:13,960 --> 00:52:15,570
Is that clear?

1088
00:52:15,570 --> 00:52:15,930
AUDIENCE: No.

1089
00:52:15,930 --> 00:52:19,040
PROFESSOR: So let's say every
time a runs it produces one

1090
00:52:19,040 --> 00:52:20,830
data element.

1091
00:52:20,830 --> 00:52:23,770
So now in this case,
every iteration

1092
00:52:23,770 --> 00:52:24,970
produces one data element.

1093
00:52:24,970 --> 00:52:25,880
That's fine.

1094
00:52:25,880 --> 00:52:27,100
That's clear?

1095
00:52:27,100 --> 00:52:31,720
Here I run a n times, so I
produce n data elements.

1096
00:52:31,720 --> 00:52:34,950
And b let's say produces
one data element.

1097
00:52:34,950 --> 00:52:39,560
So if my cache can only hold
let's say n by 2 data

1098
00:52:39,560 --> 00:52:42,140
elements, then there's
an overflow.

1099
00:52:42,140 --> 00:52:44,770
So what that means is not
everything's in the cache, and

1100
00:52:44,770 --> 00:52:46,620
that's bad because of the same
reasons we saw for the

1101
00:52:46,620 --> 00:52:47,550
instructions.

1102
00:52:47,550 --> 00:52:49,780
When I need those data I have
to go out to memory and get

1103
00:52:49,780 --> 00:52:52,060
them again, so it's extra
communication, extra

1104
00:52:52,060 --> 00:52:52,620
redundancy.

1105
00:52:52,620 --> 00:52:54,530
AUDIENCE: In this case where you
don't need to store the a

1106
00:52:54,530 --> 00:52:55,780
variables
[UNINTELLIGIBLE PHRASE].

1107
00:52:59,550 --> 00:53:03,000
PROFESSOR: But notice these were
sequential simple case.

1108
00:53:03,000 --> 00:53:08,210
I need all the data from a to
run all the iterations for b.

1109
00:53:08,210 --> 00:53:10,650
Then, yeah, this goes away.

1110
00:53:10,650 --> 00:53:13,570
So let's say this goes away,
but still b produces n

1111
00:53:13,570 --> 00:53:15,240
elements and that overflows
the cache.

1112
00:53:19,770 --> 00:53:23,230
So there's a third example where
I don't fully distribute

1113
00:53:23,230 --> 00:53:26,520
everything, I partially
distribute some of the loops.

1114
00:53:26,520 --> 00:53:29,730
I can fully scale a and b
because I can fit those

1115
00:53:29,730 --> 00:53:31,600
instructions in the cache.

1116
00:53:31,600 --> 00:53:35,090
That gets me around this
problem, because now a and b

1117
00:53:35,090 --> 00:53:37,850
are just communicating
one day data element.

1118
00:53:37,850 --> 00:53:40,120
But c is still a problem because
I still have to run b

1119
00:53:40,120 --> 00:53:43,430
n times in the end before I can
run c so there are n data

1120
00:53:43,430 --> 00:53:46,940
elements in flight.

1121
00:53:46,940 --> 00:53:50,480
So the data for b still becomes
a problem in terms of

1122
00:53:50,480 --> 00:53:51,500
its locality.

1123
00:53:51,500 --> 00:53:54,810
Is that clear?

1124
00:53:54,810 --> 00:53:57,640
So, any ideas on how
I can improve this?

1125
00:53:57,640 --> 00:53:58,937
AUDIENCE: assuming you have the
wrong cache line and you

1126
00:53:58,937 --> 00:54:00,187
have to do one or two memory
acceses to get the cache back.

1127
00:54:10,920 --> 00:54:16,690
PROFESSOR: So, programs
typically have really good

1128
00:54:16,690 --> 00:54:19,090
instruction locality just
because the nature of the way

1129
00:54:19,090 --> 00:54:19,490
we run them.

1130
00:54:19,490 --> 00:54:23,350
We have small loops and they
iterate over and over again.

1131
00:54:23,350 --> 00:54:26,410
Data is actually where you spend
most of your time in the

1132
00:54:26,410 --> 00:54:27,090
memory system.

1133
00:54:27,090 --> 00:54:28,320
It's fetching data.

1134
00:54:28,320 --> 00:54:31,650
So I didn't actually understand
why you think data

1135
00:54:31,650 --> 00:54:33,340
is less expensive than
instructions.

1136
00:54:33,340 --> 00:54:35,879
AUDIENCE: What I'm saying say
you want to read an array,

1137
00:54:35,879 --> 00:54:39,435
read the first, say, 8
elements to 8 words

1138
00:54:39,435 --> 00:54:39,942
in the cache box.

1139
00:54:39,942 --> 00:54:44,514
Well then you'd get 7 hits, so
every 8 iterations you have to

1140
00:54:44,514 --> 00:54:45,530
do a rewrite.

1141
00:54:45,530 --> 00:54:46,190
PROFESSOR: Right.

1142
00:54:46,190 --> 00:54:49,800
So that assumes that you have
really good spatial locality,

1143
00:54:49,800 --> 00:54:52,170
because you've assumed that I've
brought in 8 elements and

1144
00:54:52,170 --> 00:54:53,780
I'm going to use every
one of them.

1145
00:54:53,780 --> 00:54:55,910
So if that's the case you have
really good spatial locality

1146
00:54:55,910 --> 00:54:57,690
and that's, in fact,
what you want.

1147
00:54:57,690 --> 00:55:00,160
It's the same kind of thing
that I showed for the

1148
00:55:00,160 --> 00:55:00,820
instruction cache.

1149
00:55:00,820 --> 00:55:04,450
The first thing is a miss,
the rest are hits.

1150
00:55:04,450 --> 00:55:07,160
The reason data is more
expensive, you simply have a

1151
00:55:07,160 --> 00:55:10,490
lot more data reads than
you have instructions.

1152
00:55:10,490 --> 00:55:12,730
Typically you have small
loops, hundreds of

1153
00:55:12,730 --> 00:55:15,120
instructions and they might
access really big arrays that

1154
00:55:15,120 --> 00:55:18,160
are millions of data
references.

1155
00:55:18,160 --> 00:55:20,230
So that becomes a problem.

1156
00:55:20,230 --> 00:55:22,250
So ideas on how to
improve this?

1157
00:55:22,250 --> 00:55:23,340
AUDIENCE: That's a loop?

1158
00:55:23,340 --> 00:55:24,190
PROFESSOR: That's a loop.

1159
00:55:24,190 --> 00:55:25,800
So what would you do in
the smaller loop?

1160
00:55:25,800 --> 00:55:30,740
AUDIENCE: [INAUDIBLE PHRASE].

1161
00:55:30,740 --> 00:55:31,330
PROFESSOR: Something
like that?

1162
00:55:31,330 --> 00:55:33,230
AUDIENCE: Yeah.

1163
00:55:33,230 --> 00:55:35,010
PROFESSOR: OK.

1164
00:55:35,010 --> 00:55:39,190
So in a nested loop, you have
a smaller loop that has a

1165
00:55:39,190 --> 00:55:43,220
small number of iterations,
so 64.

1166
00:55:43,220 --> 00:55:47,110
So, 64 might be just as much as
I can buffer for the data

1167
00:55:47,110 --> 00:55:48,380
in the cache.

1168
00:55:48,380 --> 00:55:51,610
Then I wrap that loop with one
outer loop that completes the

1169
00:55:51,610 --> 00:55:52,860
whole number of iterations.

1170
00:55:52,860 --> 00:55:55,950
So if I had to do n, then
I divide n by 64.

1171
00:55:55,950 --> 00:55:58,060
So that can work out
really well.

1172
00:55:58,060 --> 00:56:00,190
So there's different kinds of
blocking techniques that you

1173
00:56:00,190 --> 00:56:03,800
can use on getting your data to
fit into your local store

1174
00:56:03,800 --> 00:56:07,340
or into your cache to exploit
these spatial and temporal

1175
00:56:07,340 --> 00:56:08,250
properties.

1176
00:56:08,250 --> 00:56:08,550
Question?

1177
00:56:08,550 --> 00:56:12,135
AUDIENCE: Would it not be
better to use a small

1178
00:56:12,135 --> 00:56:14,079
[UNINTELLIGIBLE] size so
you could run a, b, c

1179
00:56:14,079 --> 00:56:15,329
sequentially?

1180
00:56:17,210 --> 00:56:18,430
PROFESSOR: You could
do that as well.

1181
00:56:18,430 --> 00:56:21,620
But the problem with running a,
b, c sequentially is that

1182
00:56:21,620 --> 00:56:23,930
if they're in the same
loop, you end up with

1183
00:56:23,930 --> 00:56:26,050
instructions being bad.

1184
00:56:26,050 --> 00:56:28,370
That would really, this case --
so even if you change this

1185
00:56:28,370 --> 00:56:30,790
number you don't get around
the instructions.

1186
00:56:34,890 --> 00:56:38,040
So you're going to see more
optimizations that do more of

1187
00:56:38,040 --> 00:56:39,990
these loop tricks.

1188
00:56:39,990 --> 00:56:43,230
I talk about unrolling without
really defining what unrolling

1189
00:56:43,230 --> 00:56:45,930
is or going into a
lot of details.

1190
00:56:45,930 --> 00:56:48,170
Loop distribution, loop fision,
some of the things,

1191
00:56:48,170 --> 00:56:49,960
like loop tiling,
loop blocking.

1192
00:56:49,960 --> 00:56:53,620
I think Simmon's going to cover
some of these next week.

1193
00:56:53,620 --> 00:56:56,890
So this was implemented, this
was done by another Master

1194
00:56:56,890 --> 00:57:02,470
student at MIT who graduated
about two years ago, to show

1195
00:57:02,470 --> 00:57:04,910
that if you factor in cache
constraints versus ignoring

1196
00:57:04,910 --> 00:57:07,650
cache constraints, how much
performance you can get.

1197
00:57:07,650 --> 00:57:09,580
This was done in the context
of StreamIt.

1198
00:57:09,580 --> 00:57:12,710
So, in fact, some of you might
have recognized a to b to c as

1199
00:57:12,710 --> 00:57:16,950
being interconnected as
pipeline filters.

1200
00:57:16,950 --> 00:57:19,690
We ran it on different
processors, so the StrongARM

1201
00:57:19,690 --> 00:57:21,550
processor's really small.

1202
00:57:21,550 --> 00:57:25,780
InOrder processor has no L1
cache in this particular model

1203
00:57:25,780 --> 00:57:26,520
that we used.

1204
00:57:26,520 --> 00:57:29,710
But it had a really
long latency --

1205
00:57:29,710 --> 00:57:31,570
sorry, it had no L2 cache.

1206
00:57:31,570 --> 00:57:34,390
It had really long latency
to memory.

1207
00:57:34,390 --> 00:57:37,710
Pentium, an x86 processor.

1208
00:57:37,710 --> 00:57:40,300
Reasonably fast. It had a
complicated memory system and

1209
00:57:40,300 --> 00:57:43,490
a lot, a lot of memory overlap
in terms of references.

1210
00:57:43,490 --> 00:57:49,490
Then the Itanium processor,
which had a huge L2 cache at

1211
00:57:49,490 --> 00:57:50,590
its disposal.

1212
00:57:50,590 --> 00:57:53,340
So what you can see is that
lower bars indicate

1213
00:57:53,340 --> 00:57:56,060
bigger speed ups.

1214
00:57:56,060 --> 00:57:57,680
This is normalized run time.

1215
00:57:57,680 --> 00:58:01,090
So on the processor where you
don't actually have caches to

1216
00:58:01,090 --> 00:58:03,870
save you, and the memory
communication is really

1217
00:58:03,870 --> 00:58:06,490
expensive, you can get a lot
of benefit from doing the

1218
00:58:06,490 --> 00:58:09,800
cache aware scaling, that loop
nesting to take advantage of

1219
00:58:09,800 --> 00:58:12,430
packing instructions instead of
instruction cache, packing

1220
00:58:12,430 --> 00:58:15,550
data into data cache and not
having to go out to memory if

1221
00:58:15,550 --> 00:58:17,100
you don't to.

1222
00:58:17,100 --> 00:58:23,470
So you can reduce run time to
about 1/3 of what it was with

1223
00:58:23,470 --> 00:58:26,160
this kind of cache
optimization.

1224
00:58:26,160 --> 00:58:30,710
On the Pentium3 where you have
a cache to help you out, the

1225
00:58:30,710 --> 00:58:34,300
benefits are there, but you
don't get as big a benefit

1226
00:58:34,300 --> 00:58:38,550
from ignoring the cache
constraints versus being aware

1227
00:58:38,550 --> 00:58:39,670
of the cache constraints.

1228
00:58:39,670 --> 00:58:44,480
So here you're actually doing
some of that middle column

1229
00:58:44,480 --> 00:58:46,360
whereas here we're doing
third columns,

1230
00:58:46,360 --> 00:58:48,980
the cache aware fusion.

1231
00:58:51,890 --> 00:58:56,220
In a Itanium you really get no
benefits between the two.

1232
00:58:56,220 --> 00:58:56,580
Yep?

1233
00:58:56,580 --> 00:59:02,140
AUDIENCE: Can you explain what
the left columns are?

1234
00:59:02,140 --> 00:59:04,040
PROFESSOR: These?

1235
00:59:04,040 --> 00:59:04,280
AUDIENCE: Yeah.

1236
00:59:04,280 --> 00:59:07,270
PROFESSOR: So this is tricky.

1237
00:59:11,020 --> 00:59:13,000
So the left columns
are doing this.

1238
00:59:13,000 --> 00:59:17,990
AUDIENCE: OK, sort of assuming
that icache is there.

1239
00:59:17,990 --> 00:59:20,500
PROFESSOR: Right, and the third
column is doing this.

1240
00:59:20,500 --> 00:59:24,680
So you want to do this because
the icache locality is the

1241
00:59:24,680 --> 00:59:29,890
best. So you always want to go
to full or maximum scaling.

1242
00:59:29,890 --> 00:59:33,310
I'm actually fudging a little
just for sake of clarity.

1243
00:59:33,310 --> 00:59:37,120
Here you're actually doing this
nesting to improve both

1244
00:59:37,120 --> 00:59:38,730
the instruction and
the data locality.

1245
00:59:41,440 --> 00:59:43,160
So you can get really good
performance improvement.

1246
00:59:43,160 --> 00:59:46,330
So what does that mean for
your Cell projects or for

1247
00:59:46,330 --> 00:59:49,020
Cell, we'll talk about that next
week at the recitation.

1248
00:59:52,000 --> 00:59:52,150
Yeah?

1249
00:59:52,150 --> 00:59:53,572
AUDIENCE: Is there
some big reasons

1250
00:59:53,572 --> 00:59:56,400
[UNINTELLIGIBLE PHRASE].

1251
00:59:56,400 --> 00:59:57,850
PROFESSOR: Well it just means
that if you have caches to

1252
00:59:57,850 --> 01:00:00,390
save you, and they're really big
caches and they're really

1253
01:00:00,390 --> 01:00:06,990
efficient, the law of
diminishing returns.

1254
01:00:06,990 --> 01:00:08,370
That's where profiling
comes in.

1255
01:00:08,370 --> 01:00:10,100
So you look at the profiling
results, you look at your

1256
01:00:10,100 --> 01:00:12,360
cache misses, how many cache
misses are you taking.

1257
01:00:12,360 --> 01:00:14,780
If it's really significant,
then you look at ways to

1258
01:00:14,780 --> 01:00:16,020
improve it.

1259
01:00:16,020 --> 01:00:18,110
If your cache misses are really
low, you missed rate is

1260
01:00:18,110 --> 01:00:20,540
really low, then it doesn't make
sense to spend time and

1261
01:00:20,540 --> 01:00:22,120
energy focusing on that.

1262
01:00:22,120 --> 01:00:24,830
Good question.

1263
01:00:24,830 --> 01:00:28,360
So, any other questions?

1264
01:00:28,360 --> 01:00:33,410
So summarizing the gamut of
programming for performance.

1265
01:00:33,410 --> 01:00:35,410
So you tune to parallelism
first, because if you can't

1266
01:00:35,410 --> 01:00:38,140
find the concurrency, your
Amdahl's law, you're not going

1267
01:00:38,140 --> 01:00:40,440
to a get a whole lot
of speed up.

1268
01:00:40,440 --> 01:00:43,750
But then once you figured out
what the parallelism is, then

1269
01:00:43,750 --> 01:00:45,740
what you want to do is really
get the performance on each

1270
01:00:45,740 --> 01:00:48,055
processor, the single track
performance to be really good.

1271
01:00:48,055 --> 01:00:49,870
You shouldn't ignore that.

1272
01:00:49,870 --> 01:00:51,630
The modern processors
are complex.

1273
01:00:51,630 --> 01:00:53,700
You need instructional level
parallelism, you need data

1274
01:00:53,700 --> 01:00:56,190
level parallelism, you
need memory hierarchy

1275
01:00:56,190 --> 01:00:59,210
optimizations, and so you
should consider those

1276
01:00:59,210 --> 01:01:00,230
optimizations.

1277
01:01:00,230 --> 01:01:02,640
Here, profiling tools could
really help you figure out

1278
01:01:02,640 --> 01:01:06,030
where the biggest benefits to
performance will come from.

1279
01:01:09,570 --> 01:01:11,490
You may have to, in fact,
change everything.

1280
01:01:11,490 --> 01:01:13,320
You may have to change your
algorithm, your data

1281
01:01:13,320 --> 01:01:14,970
structures, your program
structure.

1282
01:01:14,970 --> 01:01:17,460
So in the MPEG decoder case, for
example, I showed you that

1283
01:01:17,460 --> 01:01:20,910
if you change the flag that
says don't use double

1284
01:01:20,910 --> 01:01:25,530
precision inverse DCT, use a
numerical hack, then you can

1285
01:01:25,530 --> 01:01:27,330
get performance improvements
but you're changing your

1286
01:01:27,330 --> 01:01:29,380
algorithm really.

1287
01:01:29,380 --> 01:01:32,120
You really want to focus on just
the biggest nuggets --

1288
01:01:32,120 --> 01:01:34,760
where is most of the performance
coming in, or

1289
01:01:34,760 --> 01:01:36,600
where's the biggest performance
bottleneck, and

1290
01:01:36,600 --> 01:01:38,060
that's the thing you
want optimize.

1291
01:01:38,060 --> 01:01:40,200
So remember the law of
diminishing returns.

1292
01:01:40,200 --> 01:01:42,720
Don't spend your time on doing
things that aren't going to

1293
01:01:42,720 --> 01:01:46,010
get you anything significant
in return.

1294
01:01:46,010 --> 01:01:46,450
That's it.

1295
01:01:46,450 --> 01:01:47,700
Any questions?

1296
01:01:51,070 --> 01:01:51,830
OK.

1297
01:01:51,830 --> 01:01:53,080
How are you guys doing
with the projects?

1298
01:01:56,830 --> 01:02:01,195
So, one of the added benefits of
the central CBS repository

1299
01:02:01,195 --> 01:02:04,620
is I get notifications too
when you submit things.

1300
01:02:04,620 --> 01:02:07,410
So I know of only two projects
that have been submitting

1301
01:02:07,410 --> 01:02:08,580
things regularly.

1302
01:02:08,580 --> 01:02:11,950
So, I hope that'll
pick up soon.

1303
01:02:11,950 --> 01:02:14,280
I guess a few minutes to finish
the quiz and then we'll

1304
01:02:14,280 --> 01:02:14,930
see you next week.

1305
01:02:14,930 --> 01:02:16,180
Have a good weekend.