1
00:00:02,070 --> 00:00:04,500
The following content is
provided under a Creative

2
00:00:04,500 --> 00:00:05,920
Commons license.

3
00:00:05,920 --> 00:00:08,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:08,950 --> 00:00:12,610
offer high-quality educational
resources for free.

5
00:00:12,610 --> 00:00:15,500
To make a donation or view
additional materials from

6
00:00:15,500 --> 00:00:19,430
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:19,430 --> 00:00:20,680
ocw.mit.edu.

8
00:00:25,716 --> 00:00:26,180
GUEST SPEAKER 1: All right.

9
00:00:26,180 --> 00:00:28,960
So we're going to talk about
performance engineering and

10
00:00:28,960 --> 00:00:34,262
pretty much how to do that
with profiling tools.

11
00:00:34,262 --> 00:00:37,655
So first, we're going to talk
about what profiling tools are

12
00:00:37,655 --> 00:00:39,420
and how you use them in
general, what the

13
00:00:39,420 --> 00:00:40,550
theory of that is.

14
00:00:40,550 --> 00:00:43,820
And then, more importantly,
we're going to do an

15
00:00:43,820 --> 00:00:46,550
interactive walk-through
of two examples

16
00:00:46,550 --> 00:00:47,230
that we picked out.

17
00:00:47,230 --> 00:00:50,000
The first one is the Matrix
Multiply that you guys saw in

18
00:00:50,000 --> 00:00:54,560
Project 0, the cache ratio
measurements and how to do

19
00:00:54,560 --> 00:00:58,400
those measurements and how to do
the cycles per instruction

20
00:00:58,400 --> 00:01:00,060
measurements, these
kinds of things.

21
00:01:00,060 --> 00:01:03,450
And then we're going to look at
doing some of the bit hacks

22
00:01:03,450 --> 00:01:05,750
that Charles talked about
in the second lecture.

23
00:01:05,750 --> 00:01:08,520
We're going to work on a
branchless sorting algorithm,

24
00:01:08,520 --> 00:01:09,770
which is pretty interesting.

25
00:01:15,360 --> 00:01:22,510
So the code we gave you in
project one, it was very

26
00:01:22,510 --> 00:01:24,100
obvious where the
hot spot was.

27
00:01:24,100 --> 00:01:25,750
But real life doesn't really
work that way.

28
00:01:25,750 --> 00:01:28,340
And code gets really,
really big.

29
00:01:31,420 --> 00:01:33,510
It's not always obvious
where the part is

30
00:01:33,510 --> 00:01:34,340
that you want to optimize.

31
00:01:34,340 --> 00:01:36,770
And you really don't want to
waste time taking what is

32
00:01:36,770 --> 00:01:40,060
otherwise understandable, good
code, and beating it over the

33
00:01:40,060 --> 00:01:42,050
head trying to make
it go faster.

34
00:01:42,050 --> 00:01:44,640
So that's why Donald Knuth said,
"premature optimization

35
00:01:44,640 --> 00:01:47,460
is the root of all evil."
You really want to

36
00:01:47,460 --> 00:01:50,590
focus on the hot spots.

37
00:01:50,590 --> 00:01:51,890
So that's what profiling's
for.

38
00:01:51,890 --> 00:01:55,130
It basically gives you
visibility into the execution

39
00:01:55,130 --> 00:01:56,920
of your code, because it would
be too complicated to just sit

40
00:01:56,920 --> 00:02:00,070
down and look at it and try
to understand where it's

41
00:02:00,070 --> 00:02:01,320
spending its time.

42
00:02:04,110 --> 00:02:07,110
And there are many possibilities
for where it

43
00:02:07,110 --> 00:02:07,900
could be spending its time.

44
00:02:07,900 --> 00:02:11,720
In 6-172 we're mostly going to
be focusing on things that are

45
00:02:11,720 --> 00:02:14,890
CPU-bound and things that are
using the memory hierarchy

46
00:02:14,890 --> 00:02:16,830
ineffectively.

47
00:02:16,830 --> 00:02:20,410
But in the real world, a couple
of other common things

48
00:02:20,410 --> 00:02:22,730
are network requests.

49
00:02:22,730 --> 00:02:24,460
So if you've got your web
application, it's making a

50
00:02:24,460 --> 00:02:26,450
whole bunch of AJAX requests.

51
00:02:26,450 --> 00:02:29,870
It's going to be wasting a
lot of your users' time.

52
00:02:29,870 --> 00:02:31,690
If you've got your desktop
application, it's hitting the

53
00:02:31,690 --> 00:02:32,390
disk all the time.

54
00:02:32,390 --> 00:02:34,320
That's also going to
slow you down.

55
00:02:34,320 --> 00:02:36,360
I think SQL database, if
anybody's ever done web

56
00:02:36,360 --> 00:02:38,875
development, they know that
can be a huge bottleneck.

57
00:02:42,570 --> 00:02:45,390
So basically, the tool that
you're going to use is going

58
00:02:45,390 --> 00:02:49,800
to depend a lot on what the
problem you're solving is.

59
00:02:49,800 --> 00:02:51,570
But the basic ideas
are all the same.

60
00:02:51,570 --> 00:02:54,550
There are all kinds of profiling
tools for measuring

61
00:02:54,550 --> 00:02:57,710
these things and figuring out
where it's coming from and

62
00:02:57,710 --> 00:02:58,960
where the time's going.

63
00:03:01,690 --> 00:03:05,910
So for trying to figure out
CPU and memory, there's a

64
00:03:05,910 --> 00:03:06,570
couple things you can do.

65
00:03:06,570 --> 00:03:09,700
The first and most basic and
sort of most obvious to you

66
00:03:09,700 --> 00:03:12,480
guys is you can do manual
instrumentation.

67
00:03:12,480 --> 00:03:15,960
And what that is is basically
where you go in and you put in

68
00:03:15,960 --> 00:03:18,230
prints and try to figure out
where it's spending its time,

69
00:03:18,230 --> 00:03:20,610
what codes could you
have executed more.

70
00:03:20,610 --> 00:03:22,960
Or you can be a little
bit more clever.

71
00:03:22,960 --> 00:03:27,480
You can try to insert logging
calls that fill up a buffer

72
00:03:27,480 --> 00:03:32,180
with checking the time and
actually doing that yourself.

73
00:03:32,180 --> 00:03:35,620
And one of the big advantages
of that is that, when you

74
00:03:35,620 --> 00:03:38,590
actually go into the code and
you instrument yourself, you

75
00:03:38,590 --> 00:03:41,830
can cut down the information
that's coming back out at you.

76
00:03:41,830 --> 00:03:44,630
If you have an intuition for
how the program works,

77
00:03:44,630 --> 00:03:47,410
sprinkling little calls to
timing functions around your

78
00:03:47,410 --> 00:03:51,340
code in the places that you
think matter is going to give

79
00:03:51,340 --> 00:03:53,580
you maybe a better picture.

80
00:03:53,580 --> 00:03:55,410
But it requires more
work on your part.

81
00:03:55,410 --> 00:03:58,210
So that's the drawback.

82
00:03:58,210 --> 00:04:00,370
The other kind of
instrumentation is static

83
00:04:00,370 --> 00:04:03,250
instrumentation, which
is basically--

84
00:04:03,250 --> 00:04:05,075
and I should say that
instrumentation is basically a

85
00:04:05,075 --> 00:04:08,190
little snippet of code that
is gathering information,

86
00:04:08,190 --> 00:04:11,930
gathering data about
the code execution.

87
00:04:11,930 --> 00:04:14,020
So static instrumentation is
basically code that's inserted

88
00:04:14,020 --> 00:04:17,100
by the compiler when you
build your code.

89
00:04:17,100 --> 00:04:19,230
And gprof is an example
of this.

90
00:04:19,230 --> 00:04:22,500
Basically, what it does is it
will, at the prologue and the

91
00:04:22,500 --> 00:04:25,130
epilogue for every function,
insert a little piece of code

92
00:04:25,130 --> 00:04:26,470
that increments a counter
every time the

93
00:04:26,470 --> 00:04:28,686
function gets executed.

94
00:04:28,686 --> 00:04:30,570
I think it also figures
out who called it,

95
00:04:30,570 --> 00:04:32,840
that kind of thing.

96
00:04:32,840 --> 00:04:35,150
One of the drawbacks to static
instrumentation is that you

97
00:04:35,150 --> 00:04:37,810
have to have the source, and you
have to be able to rebuild

98
00:04:37,810 --> 00:04:38,960
your project.

99
00:04:38,960 --> 00:04:44,230
If you've ever had a DLL binary
blob of data library

100
00:04:44,230 --> 00:04:47,080
handed to you, and you want to
know where the time's going in

101
00:04:47,080 --> 00:04:50,010
that library, you're not going
to be able to instrument it.

102
00:04:50,010 --> 00:04:52,550
So you're not going
to get any data.

103
00:04:52,550 --> 00:04:54,340
So that brings us to the next
thing, which is dynamic

104
00:04:54,340 --> 00:04:57,430
instrumentation, which
solves that problem.

105
00:04:57,430 --> 00:05:00,290
Basically, what happens there
is you take the binary as it

106
00:05:00,290 --> 00:05:02,990
is just before you're
about to run it.

107
00:05:02,990 --> 00:05:05,150
And before you start executing,
you actually just

108
00:05:05,150 --> 00:05:06,570
take a look at the code
and translate it.

109
00:05:06,570 --> 00:05:08,440
And you go in, and you insert
the instrumentation.

110
00:05:08,440 --> 00:05:10,500
And then you have this new
snippet of code that you then

111
00:05:10,500 --> 00:05:12,450
run instead.

112
00:05:12,450 --> 00:05:16,270
And so that works for
binary blobs.

113
00:05:16,270 --> 00:05:18,500
And it doesn't require a
recompile, which is great.

114
00:05:18,500 --> 00:05:23,750
But it introduces a lot of
overhead, because you're not

115
00:05:23,750 --> 00:05:27,560
actually running the original
code, right?

116
00:05:27,560 --> 00:05:30,270
You're running this translated
thing, which might be much

117
00:05:30,270 --> 00:05:31,520
more inefficient.

118
00:05:33,802 --> 00:05:37,410
And I guess Valgrind has a
couple of tools that are sort

119
00:05:37,410 --> 00:05:40,270
of classic examples of that,
Callgrind and Cachegrind.

120
00:05:40,270 --> 00:05:41,870
I think Callgrind gives
you a call graph.

121
00:05:41,870 --> 00:05:46,840
And Cachegrind tries to measure
the cache effects,

122
00:05:46,840 --> 00:05:50,010
tries to make a prediction of
when you do memory accesses,

123
00:05:50,010 --> 00:05:53,780
are these things in
cache or not.

124
00:05:53,780 --> 00:05:56,250
Another thing you might want to
look at, or the thing that

125
00:05:56,250 --> 00:05:57,410
we're going to focus
on in this course

126
00:05:57,410 --> 00:05:59,880
are performance counters.

127
00:05:59,880 --> 00:06:03,120
And two tools to get these
are OProfile and perf.

128
00:06:03,120 --> 00:06:07,590
And basically, the CPU has a
bunch of counters-- we've

129
00:06:07,590 --> 00:06:09,530
talked about them-- for branch
misses, cache misses, these

130
00:06:09,530 --> 00:06:10,140
kinds of things.

131
00:06:10,140 --> 00:06:13,222
And this will give
access to those.

132
00:06:13,222 --> 00:06:15,350
The last tool that I guess I'm
going to talk about for

133
00:06:15,350 --> 00:06:17,650
profiling is heap profiling.

134
00:06:17,650 --> 00:06:21,330
A lot of times, if memory's your
problem, you really want

135
00:06:21,330 --> 00:06:23,670
to figure out who's doing
all the allocation.

136
00:06:23,670 --> 00:06:28,520
And so these tools will give you
a nice breakdown of who's

137
00:06:28,520 --> 00:06:31,800
allocating what, who's calling
them, and who can I stop

138
00:06:31,800 --> 00:06:35,110
calling so that I'm allocating
less memory.

139
00:06:35,110 --> 00:06:40,500
And like I said, you can find
more tools for profiling

140
00:06:40,500 --> 00:06:42,630
different kinds of problems
and resources.

141
00:06:42,630 --> 00:06:45,730
But today we're mostly going to
focus on perf, which is a

142
00:06:45,730 --> 00:06:52,030
new tool for getting access to
performance counters on Linux.

143
00:06:52,030 --> 00:06:55,870
I guess I should also mention
that, as someone talked about

144
00:06:55,870 --> 00:06:58,410
in the last lecture, the way
that you want to gather data

145
00:06:58,410 --> 00:07:02,570
from performance counters is you
want to do event sampling.

146
00:07:02,570 --> 00:07:04,670
There's too much data.

147
00:07:04,670 --> 00:07:07,570
As you execute along, you don't
want to record, OK, I

148
00:07:07,570 --> 00:07:09,190
executed this instruction,
then I executed that

149
00:07:09,190 --> 00:07:12,080
instruction, and I'm going to
count all the instruction

150
00:07:12,080 --> 00:07:14,410
executions and then store
that data somewhere.

151
00:07:14,410 --> 00:07:16,040
That would be way
too much data.

152
00:07:16,040 --> 00:07:20,830
So what you want to do is you
want to sample every so often,

153
00:07:20,830 --> 00:07:26,630
every 1,000th event, or some
other frequency for whatever

154
00:07:26,630 --> 00:07:27,890
your interesting event is.

155
00:07:31,630 --> 00:07:37,680
And basically, what happens is
that when that counter has hit

156
00:07:37,680 --> 00:07:39,830
the threshold, an interrupt
will fire, and the

157
00:07:39,830 --> 00:07:40,950
kernel will catch it.

158
00:07:40,950 --> 00:07:44,080
And then it will record the
data that you want, the

159
00:07:44,080 --> 00:07:46,560
context, like where in the
execution it was, what the

160
00:07:46,560 --> 00:07:48,780
call stack looked like,
these kinds of things.

161
00:07:54,000 --> 00:07:57,930
I think we've covered most of
the performance counters that

162
00:07:57,930 --> 00:07:58,970
we mostly care about.

163
00:07:58,970 --> 00:08:02,470
But I think it's worth
mentioning that there are a

164
00:08:02,470 --> 00:08:03,410
lot of performance counters.

165
00:08:03,410 --> 00:08:07,900
And you can go look them up in
this part of the manual.

166
00:08:07,900 --> 00:08:09,420
They cover a lot of interesting
things, not just

167
00:08:09,420 --> 00:08:13,150
the ones we've talked about,
including sleep states, power

168
00:08:13,150 --> 00:08:13,700
consumption.

169
00:08:13,700 --> 00:08:19,030
And if you want to look at
contention, say you have a

170
00:08:19,030 --> 00:08:20,510
parallel program, and
you have two cores

171
00:08:20,510 --> 00:08:22,470
accessing the same data.

172
00:08:22,470 --> 00:08:24,110
If they're writing back to it,
they're going to be fighting

173
00:08:24,110 --> 00:08:25,460
over the same cache lines.

174
00:08:25,460 --> 00:08:27,930
And so there are counters for
things like, how many times

175
00:08:27,930 --> 00:08:33,630
did this core have to go off
core to get cache line, and

176
00:08:33,630 --> 00:08:35,840
all kinds of interesting
things like that.

177
00:08:35,840 --> 00:08:38,929
But mostly, we're going to focus
on the cache misses,

178
00:08:38,929 --> 00:08:40,179
branch misses.

179
00:08:44,520 --> 00:08:48,200
I also wanted to mention that
the tool we're using, perf, it

180
00:08:48,200 --> 00:08:50,520
kind of replaces OProfile and
PerfMon, which are tools that

181
00:08:50,520 --> 00:08:52,940
do similar things.

182
00:08:52,940 --> 00:08:55,870
I guess it's different in that
it's a unified interface to

183
00:08:55,870 --> 00:08:58,020
monitoring two kinds
of events.

184
00:08:58,020 --> 00:09:02,080
The first are software kernel
events, so things

185
00:09:02,080 --> 00:09:03,320
like contact switches.

186
00:09:03,320 --> 00:09:06,090
So how many times did your
application make a system call

187
00:09:06,090 --> 00:09:08,290
and then get contact
switched out?

188
00:09:08,290 --> 00:09:12,540
And how many page faults did
your application generate?

189
00:09:15,510 --> 00:09:17,270
These kinds of things that the
kernel knows about, but aren't

190
00:09:17,270 --> 00:09:18,720
necessarily hardware events.

191
00:09:18,720 --> 00:09:20,480
And the other kind is
hardware events.

192
00:09:20,480 --> 00:09:23,450
And those are sort of the ones
we're going to focus on.

193
00:09:23,450 --> 00:09:25,310
We've already talked
about those.

194
00:09:25,310 --> 00:09:27,530
If you want to get this tool
and try it out yourself, if

195
00:09:27,530 --> 00:09:30,160
you have any recent Linux
distribution, anything with a

196
00:09:30,160 --> 00:09:33,320
kernel that's less than
a year old, you can

197
00:09:33,320 --> 00:09:34,720
install Linux tools.

198
00:09:34,720 --> 00:09:38,470
And the command is perf.

199
00:09:38,470 --> 00:09:41,010
So now we're going to do a demo
of the matrix multiply

200
00:09:41,010 --> 00:09:42,650
from Project 0.

201
00:09:42,650 --> 00:09:43,160
GUEST SPEAKER 2: All right.

202
00:09:43,160 --> 00:09:46,070
So for the demos, I guess I'll
narrate what the code's doing,

203
00:09:46,070 --> 00:09:49,090
so Reid can focus on typing.

204
00:09:49,090 --> 00:09:50,530
This code should
look familiar.

205
00:10:04,116 --> 00:10:05,070
All right.

206
00:10:05,070 --> 00:10:06,940
Can everybody hear me?

207
00:10:06,940 --> 00:10:07,790
All right, cool.

208
00:10:07,790 --> 00:10:09,860
So this code should look
very familiar.

209
00:10:09,860 --> 00:10:12,030
It's Project 0's matrix
multiply code,

210
00:10:12,030 --> 00:10:13,890
the original version.

211
00:10:13,890 --> 00:10:18,380
And it's a triply-nested four
loop for multiplying matrices.

212
00:10:18,380 --> 00:10:22,580
So what we're going to do is
I'm going to switch to a

213
00:10:22,580 --> 00:10:25,640
terminal that Reid
is typing on.

214
00:10:25,640 --> 00:10:28,330
And we're going to run
matrix multiply.

215
00:10:28,330 --> 00:10:31,160
And it takes about eight
seconds to run.

216
00:10:31,160 --> 00:10:35,410
And now we're going to run it,
but have perf output some

217
00:10:35,410 --> 00:10:39,270
performance statistics for it,
basically, CPU counters.

218
00:10:39,270 --> 00:10:42,660
So you use -e to request
specific counters.

219
00:10:42,660 --> 00:10:47,180
You can use perf list to take
a look at all the counters

220
00:10:47,180 --> 00:10:48,360
that perf supports.

221
00:10:48,360 --> 00:10:51,450
In this case, we're asking for
the number of CPU cycles, the

222
00:10:51,450 --> 00:10:55,090
number of instructions, the
number of L1 cache loads, and

223
00:10:55,090 --> 00:10:56,955
then the number of
L1 cache misses.

224
00:11:00,250 --> 00:11:01,500
Perf stat.

225
00:11:04,240 --> 00:11:05,490
That's better.

226
00:11:07,710 --> 00:11:10,620
So it takes a little bit
longer to execute,

227
00:11:10,620 --> 00:11:11,500
I think, with perf.

228
00:11:11,500 --> 00:11:13,145
But it's not supposed to
be a big difference.

229
00:11:13,145 --> 00:11:16,460
You can see, 8.31 versus 8.34.

230
00:11:16,460 --> 00:11:19,050
So it's relatively low overhead
sampling, which is

231
00:11:19,050 --> 00:11:21,020
good, because you don't want to
change the behavior of your

232
00:11:21,020 --> 00:11:23,010
program when you sample it.

233
00:11:23,010 --> 00:11:25,830
Important things to see.

234
00:11:25,830 --> 00:11:32,750
So we're looking at this cache
miss rate, which is 18%,

235
00:11:32,750 --> 00:11:37,080
almost 19%, L1 cache miss rate,
which is pretty bad.

236
00:11:37,080 --> 00:11:39,240
So as a result of that, if
you look at the number of

237
00:11:39,240 --> 00:11:43,680
instructions, the number of
cycles, perf calculates the

238
00:11:43,680 --> 00:11:44,870
IPC for you.

239
00:11:44,870 --> 00:11:46,400
We previously talked
about CPI.

240
00:11:46,400 --> 00:11:50,450
But perf uses IPC, which is
instructions per cycle.

241
00:11:50,450 --> 00:11:54,430
So basically, every cycle,
you only executed half an

242
00:11:54,430 --> 00:11:55,890
instruction on average,
because you're

243
00:11:55,890 --> 00:11:57,060
waiting for the cache.

244
00:11:57,060 --> 00:11:59,100
So that's no good.

245
00:11:59,100 --> 00:12:03,890
So can anybody remember from
Project 0 what you did to work

246
00:12:03,890 --> 00:12:05,610
around that?

247
00:12:05,610 --> 00:12:06,000
Yes.

248
00:12:06,000 --> 00:12:07,730
AUDIENCE: You reordered
the loops.

249
00:12:07,730 --> 00:12:09,340
GUEST SPEAKER 2: Yes.

250
00:12:09,340 --> 00:12:12,440
So the before and after.

251
00:12:12,440 --> 00:12:15,200
You just swap out the
inner two loops.

252
00:12:15,200 --> 00:12:20,370
And we will run it again.

253
00:12:20,370 --> 00:12:22,770
OK, so remember, the previous
runtime was 8 seconds.

254
00:12:22,770 --> 00:12:25,940
Now it's down to 2.5 seconds,
which is a lot better.

255
00:12:25,940 --> 00:12:29,500
Let's run the perf stat command
again and look at the

256
00:12:29,500 --> 00:12:30,830
performance counters.

257
00:12:30,830 --> 00:12:34,840
And as you can see, the IPC
jumped from 0.49 to 1.4.

258
00:12:34,840 --> 00:12:36,300
So it tripled.

259
00:12:36,300 --> 00:12:41,530
And if you look at the new L1
cache miss rate, it's 1%

260
00:12:41,530 --> 00:12:46,670
rather than 20%, which
is a lot better.

261
00:12:46,670 --> 00:12:46,995
Yes?

262
00:12:46,995 --> 00:12:49,760
AUDIENCE: Can you point to
where the cache miss rate

263
00:12:49,760 --> 00:12:50,248
percent is?

264
00:12:50,248 --> 00:12:52,690
Because I just see
zero per second.

265
00:12:52,690 --> 00:12:53,770
GUEST SPEAKER 2: Oh,
sorry about that.

266
00:12:53,770 --> 00:12:58,110
So L1 D-cache loads is all of
the times that the CPU asked

267
00:12:58,110 --> 00:12:59,850
the L1 cache for something.

268
00:12:59,850 --> 00:13:02,860
And then L1 D-cache
load misses is the

269
00:13:02,860 --> 00:13:04,730
number of cache misses.

270
00:13:04,730 --> 00:13:12,630
So you divide this over this
to get the ratio, which is

271
00:13:12,630 --> 00:13:15,640
basically what Reid
did over here.

272
00:13:15,640 --> 00:13:16,794
Does that make sense?

273
00:13:16,794 --> 00:13:17,830
AUDIENCE: Yes.

274
00:13:17,830 --> 00:13:19,080
GUEST SPEAKER 2: OK.

275
00:13:20,760 --> 00:13:25,520
So this is a ridiculously simple
example of how to use

276
00:13:25,520 --> 00:13:28,150
perf, just so that you can
see the command set.

277
00:13:28,150 --> 00:13:31,780
But that's not really a very
interesting example.

278
00:13:31,780 --> 00:13:36,020
So the next one that we'll look
at is actually something

279
00:13:36,020 --> 00:13:41,650
that Charles, Reid, and I did
yesterday afternoon for fun.

280
00:13:41,650 --> 00:13:44,700
It started, we were testing
your Project

281
00:13:44,700 --> 00:13:46,670
2.1, one which is--

282
00:13:46,670 --> 00:13:47,980
I don't know how many
people have looked

283
00:13:47,980 --> 00:13:48,750
at the handout yet.

284
00:13:48,750 --> 00:13:51,280
But it deals with a bunch of
mystery sorting algorithms

285
00:13:51,280 --> 00:13:53,150
that you have to figure out
what they're doing.

286
00:13:53,150 --> 00:13:55,510
And so we took one of them.

287
00:13:55,510 --> 00:13:57,300
And we were running sorting.

288
00:13:57,300 --> 00:13:59,200
And Charles asked, what
happens if you have a

289
00:13:59,200 --> 00:14:00,850
branchless sorting algorithm?

290
00:14:00,850 --> 00:14:02,510
And we said, we don't know.

291
00:14:02,510 --> 00:14:04,530
So we played around with it.

292
00:14:04,530 --> 00:14:07,150
And in the process, we optimized
our branchless

293
00:14:07,150 --> 00:14:10,760
sorting algorithm by using perf
to generate insight into

294
00:14:10,760 --> 00:14:12,200
what the CPU is doing.

295
00:14:12,200 --> 00:14:13,680
And as you'll see, we generated

296
00:14:13,680 --> 00:14:14,860
quite a good speed-up.

297
00:14:14,860 --> 00:14:16,610
So the first thing we'll do
is, let's start with the

298
00:14:16,610 --> 00:14:18,210
baseline, quicksort,
everybody's

299
00:14:18,210 --> 00:14:19,750
favorite sorting algorithm.

300
00:14:19,750 --> 00:14:21,980
So I'll switch over
to the terminal.

301
00:14:21,980 --> 00:14:25,770
And Reid is going to run
quicksort on 30 million

302
00:14:25,770 --> 00:14:31,570
integers one time, which it
says takes 4.14 seconds.

303
00:14:31,570 --> 00:14:37,710
Now, let's use perf stat
to see what it's doing.

304
00:14:37,710 --> 00:14:40,260
So this time, we will ask for
slightly different events.

305
00:14:40,260 --> 00:14:42,220
We'll still ask for
the cycles and

306
00:14:42,220 --> 00:14:44,090
instructions to get our IPC.

307
00:14:44,090 --> 00:14:46,380
But we'll also ask for branches
and branch misses,

308
00:14:46,380 --> 00:14:48,470
because we kind of intuitively
know that

309
00:14:48,470 --> 00:14:52,130
quicksort is doing branching.

310
00:14:52,130 --> 00:14:53,350
So OK.

311
00:14:53,350 --> 00:14:58,960
It's doing 0.8 IPC, so 0.8
instructions per cycle.

312
00:14:58,960 --> 00:15:04,580
And the branch miss rate, so
branch misses over total

313
00:15:04,580 --> 00:15:08,000
number branches, is 11.5%.

314
00:15:08,000 --> 00:15:10,780
That's pretty bad in terms
of branch missing.

315
00:15:10,780 --> 00:15:12,930
So let's take a look
at the code and see

316
00:15:12,930 --> 00:15:14,630
why that's the case.

317
00:15:14,630 --> 00:15:18,480
So here's the part of the code
in quicksort where you select

318
00:15:18,480 --> 00:15:21,830
a pivot recursively, and
then you loop, and

319
00:15:21,830 --> 00:15:23,680
then you do your swapping.

320
00:15:23,680 --> 00:15:29,550
So these branches are more
or less unpredictable.

321
00:15:29,550 --> 00:15:32,240
And looking at them, there's no
way that we could think of

322
00:15:32,240 --> 00:15:34,390
to get rid of the unpredictable
branches.

323
00:15:34,390 --> 00:15:36,800
They're unpredictable
by nature.

324
00:15:36,800 --> 00:15:39,490
So this would not be an
interesting example of which

325
00:15:39,490 --> 00:15:40,980
to do branchless sorting.

326
00:15:40,980 --> 00:15:44,570
So let's switch to a different
algorithm, mergesort.

327
00:15:44,570 --> 00:15:50,150
So here's the relevant code in
mergesort that takes a list,

328
00:15:50,150 --> 00:15:54,760
C, and two input lists, A and B.
And you merge the contents

329
00:15:54,760 --> 00:15:57,810
of A and B into C by taking
the smallest element from

330
00:15:57,810 --> 00:16:00,360
either A or B, putting it in
C, and then repeating that

331
00:16:00,360 --> 00:16:02,340
process until the
lists are empty.

332
00:16:02,340 --> 00:16:06,060
So the "if else" branches, as
the comment says, tries to

333
00:16:06,060 --> 00:16:10,050
place the min of the value
either at A or B into C, and

334
00:16:10,050 --> 00:16:12,950
then properly increment the
pointer and do those

335
00:16:12,950 --> 00:16:15,610
housekeeping tasks so
you don't crash.

336
00:16:15,610 --> 00:16:20,080
So we're going to try running
this and profiling it.

337
00:16:24,570 --> 00:16:27,390
So Reid is running mergesort.

338
00:16:27,390 --> 00:16:29,200
Took 5.04 seconds.

339
00:16:29,200 --> 00:16:31,160
Well, the last one
took 4.8 seconds.

340
00:16:31,160 --> 00:16:33,460
So already, it doesn't
look too good.

341
00:16:33,460 --> 00:16:36,410
But let's try to see
what happened.

342
00:16:36,410 --> 00:16:40,080
So we'll issue the same perf
stat command, asking for

343
00:16:40,080 --> 00:16:42,790
branches, branch
misses, cycles.

344
00:16:42,790 --> 00:16:44,040
Reid is being lazy.

345
00:16:50,670 --> 00:16:54,370
OK, so the IPC is 1.1, which
is a little bit better.

346
00:16:54,370 --> 00:16:59,120
But if you look at the branch
misses over branches, they're

347
00:16:59,120 --> 00:17:00,870
off by roughly an order
of magnitude just

348
00:17:00,870 --> 00:17:03,780
like before, so 10%.

349
00:17:03,780 --> 00:17:05,030
So OK.

350
00:17:09,060 --> 00:17:11,770
Mergesort is currently slower
than quicksort.

351
00:17:11,770 --> 00:17:15,079
And the reason is still,
roughly, you still have all of

352
00:17:15,079 --> 00:17:18,780
these branches that you're
not predicting correctly.

353
00:17:18,780 --> 00:17:22,130
OK, so why branching,
and why not caching?

354
00:17:22,130 --> 00:17:23,930
Well, we did some quick
back-of-the-envelope

355
00:17:23,930 --> 00:17:26,130
calculations, or research,
rather.

356
00:17:26,130 --> 00:17:29,180
And it turns out that in the
Nehalem, the processes that

357
00:17:29,180 --> 00:17:32,780
you're using on the clouds, one
of the design changes that

358
00:17:32,780 --> 00:17:36,300
Intel made is that they made
the L1 cache and, actually,

359
00:17:36,300 --> 00:17:40,560
also the L2 cache faster, so
fewer number of clock cycles

360
00:17:40,560 --> 00:17:41,870
to go out to the cache.

361
00:17:41,870 --> 00:17:45,040
But at the same time, they also
made the pipeline deeper.

362
00:17:45,040 --> 00:17:47,970
So an L1 cache miss is maybe
three or four cycles.

363
00:17:47,970 --> 00:17:50,320
An L2 cache miss is 15 cycles.

364
00:17:50,320 --> 00:17:52,430
By the time you miss the L2
cache, you're going out into

365
00:17:52,430 --> 00:17:54,250
L3 cache, which is
12 megabytes.

366
00:17:54,250 --> 00:17:56,180
And if you're sorting integers,
most of them are

367
00:17:56,180 --> 00:17:57,400
probably going to be
in there anyway.

368
00:17:57,400 --> 00:18:00,080
AUDIENCE: John, those are
actually hits [INAUDIBLE].

369
00:18:00,080 --> 00:18:02,630
GUEST SPEAKER 2: Oh, they are.

370
00:18:02,630 --> 00:18:05,250
OK, so misses might be a little
bit bigger than that.

371
00:18:05,250 --> 00:18:06,525
But OK.

372
00:18:06,525 --> 00:18:09,140
AUDIENCE: L1 miss is a L2--

373
00:18:09,140 --> 00:18:11,020
GUEST SPEAKER 2: Right,
is an L2.

374
00:18:11,020 --> 00:18:11,760
AUDIENCE: L1 misses.

375
00:18:11,760 --> 00:18:13,620
AUDIENCE: It was just a typo.

376
00:18:13,620 --> 00:18:16,050
GUEST SPEAKER 2: OK, so the next
number is probably, like,

377
00:18:16,050 --> 00:18:19,870
30 to 50, somewhere
on that order.

378
00:18:19,870 --> 00:18:20,600
50-something.

379
00:18:20,600 --> 00:18:21,710
OK.

380
00:18:21,710 --> 00:18:24,410
But a branch mispredict will
also cost you somewhere

381
00:18:24,410 --> 00:18:27,890
between 16 and 24 cycles,
depending on how many pipeline

382
00:18:27,890 --> 00:18:32,080
stages are actually in the
Nehalem, which we don't know.

383
00:18:32,080 --> 00:18:35,810
So bad branch predictions might
just be as costly as bad

384
00:18:35,810 --> 00:18:37,000
memory access patterns.

385
00:18:37,000 --> 00:18:39,780
And plus, Charles is going to
go into detail about memory

386
00:18:39,780 --> 00:18:42,050
access patterns and how to
optimize sorting for good

387
00:18:42,050 --> 00:18:42,860
memory access.

388
00:18:42,860 --> 00:18:44,600
So it wouldn't be interesting
if I talked

389
00:18:44,600 --> 00:18:45,650
about that right now.

390
00:18:45,650 --> 00:18:47,450
So let's optimize branching.

391
00:18:50,230 --> 00:18:53,670
So to do that, we'll use a bit
hack that Charles introduced

392
00:18:53,670 --> 00:18:55,140
in an earlier lecture.

393
00:18:55,140 --> 00:19:02,150
So remember, all we want to do
is we want to put the minimum

394
00:19:02,150 --> 00:19:05,630
of either A or B into C. So
right now, we're using

395
00:19:05,630 --> 00:19:06,520
branches to do it.

396
00:19:06,520 --> 00:19:08,400
But there's no reason
why we have to.

397
00:19:08,400 --> 00:19:11,400
So the trick that will apply
is the XOR trick for

398
00:19:11,400 --> 00:19:14,040
calculating min without
a branch.

399
00:19:14,040 --> 00:19:17,650
Does everybody kind of follow
what that code is doing?

400
00:19:17,650 --> 00:19:21,620
You can refer to the bit hacks
lecture later to check.

401
00:19:21,620 --> 00:19:24,116
But trust us that
it's correct.

402
00:19:24,116 --> 00:19:25,820
GUEST SPEAKER 1: [INAUDIBLE]

403
00:19:25,820 --> 00:19:26,130
GUEST SPEAKER 2: All right.

404
00:19:26,130 --> 00:19:29,261
I guess we can go over it.

405
00:19:29,261 --> 00:19:31,360
Do I have a mouse pointer?

406
00:19:31,360 --> 00:19:32,290
Cool.

407
00:19:32,290 --> 00:19:37,045
OK, so inside, you're doing a
comparing A is less than B,

408
00:19:37,045 --> 00:19:39,700
the value of A is less than the
value of B. And you store

409
00:19:39,700 --> 00:19:42,940
either a 0 or 1 into
a flag called CMP.

410
00:19:42,940 --> 00:19:45,880
And then you take CMP,
and you negate it.

411
00:19:45,880 --> 00:19:48,960
So then you end up with a bit
mask of either all 0's or all

412
00:19:48,960 --> 00:19:52,130
1's, depending on which
one was smaller.

413
00:19:52,130 --> 00:19:55,130
And then you mask A
XOR B with that.

414
00:19:55,130 --> 00:19:59,170
So this inner expression, you
either get A XOR B out of it,

415
00:19:59,170 --> 00:20:01,090
or you get all 0's.

416
00:20:01,090 --> 00:20:03,390
And then you XOR B into it.

417
00:20:03,390 --> 00:20:07,990
So in one case, B XOR this
expression cancels out the B.

418
00:20:07,990 --> 00:20:09,450
You'll just get A out.

419
00:20:09,450 --> 00:20:14,670
And the other way, B XOR 0 is
going to give you B. So that

420
00:20:14,670 --> 00:20:16,790
should be convincing motivation
that this is

421
00:20:16,790 --> 00:20:18,530
actually a min function.

422
00:20:18,530 --> 00:20:21,580
And then, once you have that,
you can do cute tricks to

423
00:20:21,580 --> 00:20:25,470
recycle the flags, to do
the A++ and B++ and

424
00:20:25,470 --> 00:20:28,530
so on, using CMP.

425
00:20:28,530 --> 00:20:32,630
So next, let's run this code
and see what happens.

426
00:20:40,780 --> 00:20:49,820
OK, so we went from 5.04 seconds
to 4.59 seconds.

427
00:20:49,820 --> 00:20:52,710
And we'll do a profiling run
just to get some stats on it.

428
00:20:55,260 --> 00:20:58,390
OK, so now look at the IPC.

429
00:20:58,390 --> 00:21:01,371
We went from 1.1 to 1.7--

430
00:21:04,020 --> 00:21:05,210
1.7.

431
00:21:05,210 --> 00:21:09,180
And branch misses went down by
a factor of 10, which is

432
00:21:09,180 --> 00:21:10,760
really awesome.

433
00:21:10,760 --> 00:21:13,630
But overall, the performance
didn't seem

434
00:21:13,630 --> 00:21:15,600
to improve by much.

435
00:21:15,600 --> 00:21:18,240
Well, if you want to look at
why, look at the total number

436
00:21:18,240 --> 00:21:20,610
of instructions executed.

437
00:21:20,610 --> 00:21:24,130
That's that number versus
that number.

438
00:21:24,130 --> 00:21:26,200
We seem to be executing
considerably more

439
00:21:26,200 --> 00:21:27,460
instructions.

440
00:21:27,460 --> 00:21:31,870
So to see why that's the case,
we decided to use another mode

441
00:21:31,870 --> 00:21:36,040
of perf called report, or
record and then report.

442
00:21:36,040 --> 00:21:40,500
So what this does is that it
logs where all of those

443
00:21:40,500 --> 00:21:41,720
interrupts actually happen.

444
00:21:41,720 --> 00:21:45,580
So it logs which point in your
code the CPU was executing

445
00:21:45,580 --> 00:21:47,730
when these performance
counters tripped.

446
00:21:47,730 --> 00:21:51,040
And then it generates a nice
report that we can call using

447
00:21:51,040 --> 00:21:52,880
perf annotate.

448
00:21:52,880 --> 00:21:57,790
And the top section gives you
a sorted summary, where it

449
00:21:57,790 --> 00:22:00,340
shows you the percent of times
that perf caught your program

450
00:22:00,340 --> 00:22:01,720
executing that line of code.

451
00:22:01,720 --> 00:22:06,530
So 11% of the times that perf
triggered its interrupt, your

452
00:22:06,530 --> 00:22:09,650
code was executing mergesort.c
line 32.

453
00:22:09,650 --> 00:22:12,900
So let's go to mergesort.c
line 32 and

454
00:22:12,900 --> 00:22:15,720
see what it's doing.

455
00:22:15,720 --> 00:22:20,000
So this annotated view shows you
the lines of C-code, which

456
00:22:20,000 --> 00:22:21,900
are in black.

457
00:22:21,900 --> 00:22:23,490
C-code is in black.

458
00:22:23,490 --> 00:22:26,030
And then line numbers
are annotated here.

459
00:22:26,030 --> 00:22:28,660
And then it actually goes
ahead and shows you the

460
00:22:28,660 --> 00:22:31,680
assembly instructions
that it's executing.

461
00:22:31,680 --> 00:22:33,900
And it puts the timing actually
on the assembly

462
00:22:33,900 --> 00:22:36,680
instructions corresponding
to every C expression.

463
00:22:36,680 --> 00:22:37,546
Yes?

464
00:22:37,546 --> 00:22:39,976
AUDIENCE: Why is this
[INAUDIBLE] is

465
00:22:39,976 --> 00:22:41,440
the annotated version--

466
00:22:41,440 --> 00:22:44,070
GUEST SPEAKER 2: Yeah, you call
perf annotate and then a

467
00:22:44,070 --> 00:22:45,770
function name.

468
00:22:45,770 --> 00:22:47,900
The Project 2.1 will walk
you through doing

469
00:22:47,900 --> 00:22:48,730
this a couple of times.

470
00:22:48,730 --> 00:22:52,170
So you'll be able to see
what it's doing.

471
00:22:52,170 --> 00:22:58,270
OK, so the assembly here is
actually pretty important.

472
00:22:58,270 --> 00:23:02,030
I guess everybody's taken 6004,
or some variant, where

473
00:23:02,030 --> 00:23:04,110
you had to code some
form of assembly?

474
00:23:04,110 --> 00:23:06,370
OK, it's a prereq.

475
00:23:06,370 --> 00:23:06,940
Cool.

476
00:23:06,940 --> 00:23:09,470
So Saman will talk more
about assembly.

477
00:23:09,470 --> 00:23:12,190
But in this case-- or Charles,
actually, now will talk more

478
00:23:12,190 --> 00:23:13,590
about assembly.

479
00:23:13,590 --> 00:23:16,190
But anyway, in this case, it was
actually kind of useful to

480
00:23:16,190 --> 00:23:17,660
look at the assembly.

481
00:23:17,660 --> 00:23:22,020
We won't expect you to know how
to write assembly for this

482
00:23:22,020 --> 00:23:24,760
class or expect you to write
high-performing assembly.

483
00:23:24,760 --> 00:23:27,310
But sometimes taking a look at
what the compiler outputs can

484
00:23:27,310 --> 00:23:28,250
really help you.

485
00:23:28,250 --> 00:23:31,730
Well, in this case, what caught
our eye was this cltq

486
00:23:31,730 --> 00:23:35,600
thing, because none of us in
the room knew what the heck

487
00:23:35,600 --> 00:23:38,430
cltq was It wasn't an
assembly instruction

488
00:23:38,430 --> 00:23:39,440
that you usually see.

489
00:23:39,440 --> 00:23:40,340
So that caught our eye.

490
00:23:40,340 --> 00:23:42,390
And we wondered what was
this assembly doing.

491
00:23:42,390 --> 00:23:44,990
So we spent a little bit of
time tracing through this,

492
00:23:44,990 --> 00:23:46,960
working out what the
compiler was doing.

493
00:23:46,960 --> 00:23:52,240
So in PowerPoint, I pulled
out this code.

494
00:23:52,240 --> 00:23:58,120
And OK, so I looked up cltq in
my 600-page Intel manual.

495
00:23:58,120 --> 00:24:01,250
Yes, your processor does
come with a manual.

496
00:24:01,250 --> 00:24:06,170
So cltq, it says, sign extends
the EAX register to 64 bits

497
00:24:06,170 --> 00:24:08,830
and then places that into RAX.

498
00:24:08,830 --> 00:24:10,900
Why is it sign extending?

499
00:24:10,900 --> 00:24:14,180
OK, so let's try to
trace through what

500
00:24:14,180 --> 00:24:15,250
the assembly's doing.

501
00:24:15,250 --> 00:24:19,990
So here, we wanted to do comp
equals star A less than star

502
00:24:19,990 --> 00:24:25,420
B. So it moves the value at
the address in register 14

503
00:24:25,420 --> 00:24:28,700
into RCX and does the same
thing for register 13.

504
00:24:28,700 --> 00:24:33,520
So assume that R14 and R13 are
A and B. And so RCX and RDX

505
00:24:33,520 --> 00:24:36,310
are star A and star B, right?

506
00:24:36,310 --> 00:24:39,730
And then it XORs ESI with
ESI, which is 0.

507
00:24:39,730 --> 00:24:42,050
So it zeroes out the
register ESI.

508
00:24:42,050 --> 00:24:46,410
And then it does a compare
between RCX and RDX, which is

509
00:24:46,410 --> 00:24:49,070
the value at A and the value
at B. So it does that

510
00:24:49,070 --> 00:24:50,190
comparison.

511
00:24:50,190 --> 00:24:52,250
And then, depending on the
result of that comparison, it

512
00:24:52,250 --> 00:24:56,230
writes that result out
into register SIL--

513
00:24:56,230 --> 00:24:58,090
GUEST SPEAKER 1: Which
is the low-byte--

514
00:24:58,090 --> 00:25:01,820
GUEST SPEAKER 2: SIL, as Reid
said, is the low-byte of the

515
00:25:01,820 --> 00:25:04,010
register ESI, which
is used here.

516
00:25:04,010 --> 00:25:08,280
But before that, the compiler
then executes this expression

517
00:25:08,280 --> 00:25:14,210
here, B XOR a, which it puts
into register RDX.

518
00:25:14,210 --> 00:25:17,080
And then, finally, it takes ESI,
which contains the same

519
00:25:17,080 --> 00:25:21,830
value as SIL, and then it
moves that into EAX.

520
00:25:21,830 --> 00:25:23,960
And then it negates EAX.

521
00:25:23,960 --> 00:25:27,025
OK, so that kind of looks like
negative CMP, right?

522
00:25:27,025 --> 00:25:30,660
And then, after negating it, it
then sign extends it to a

523
00:25:30,660 --> 00:25:32,900
64-bit register.

524
00:25:32,900 --> 00:25:36,410
And then it does the en masse
that we asked it to.

525
00:25:36,410 --> 00:25:37,780
Well, why is it jumping
through so

526
00:25:37,780 --> 00:25:39,010
many hoops to do this?

527
00:25:39,010 --> 00:25:42,960
Well, it turns out, when you
declare a value of type int,

528
00:25:42,960 --> 00:25:45,040
you say that it's
a 32-bit value.

529
00:25:45,040 --> 00:25:48,070
And the compiler took you very
literally in saying that you

530
00:25:48,070 --> 00:25:49,930
requested a 32-bit value.

531
00:25:49,930 --> 00:25:52,060
So it did a 32-bit negation.

532
00:25:52,060 --> 00:25:54,060
And then it extended
that to 64-bits.

533
00:25:54,060 --> 00:25:57,620
So it wasted a couple of
instructions doing this,

534
00:25:57,620 --> 00:25:59,660
instead of just flat-out
doing a 64-bit

535
00:25:59,660 --> 00:26:01,590
negation to begin with.

536
00:26:01,590 --> 00:26:05,000
Now, the problem with this is
usually you don't care if it

537
00:26:05,000 --> 00:26:06,550
generates one XOR assembly
instruction.

538
00:26:06,550 --> 00:26:10,380
But this is a tight loop that's
called in the recursion

539
00:26:10,380 --> 00:26:12,520
process for every single
merge that it's doing.

540
00:26:12,520 --> 00:26:16,330
So the time that it wastes here
actually really adds up.

541
00:26:16,330 --> 00:26:18,030
So how do we fix that?

542
00:26:18,030 --> 00:26:21,390
Well, instead of declaring
it as int, let's just try

543
00:26:21,390 --> 00:26:24,430
replacing it with long, and then
recompiler the code and

544
00:26:24,430 --> 00:26:26,200
see if things are
any different.

545
00:26:26,200 --> 00:26:30,800
So here we start our process
of trial and error.

546
00:26:30,800 --> 00:26:35,170
So we compiled mergesort long.

547
00:26:35,170 --> 00:26:38,010
And the runtime is
4.05 seconds.

548
00:26:38,010 --> 00:26:40,050
Before, it was, like,
4.5 or so.

549
00:26:40,050 --> 00:26:43,830
So the runtime definitely
improved.

550
00:26:43,830 --> 00:26:44,665
Let's compare it.

551
00:26:44,665 --> 00:26:46,040
GUEST SPEAKER 1: Yeah,
just to verify.

552
00:26:46,040 --> 00:26:49,860
GUEST SPEAKER 2: So it went
from 4.59 to 4.05, which

553
00:26:49,860 --> 00:26:51,580
definitely is an improvement.

554
00:26:51,580 --> 00:26:55,380
So let's figure out, well,
let's see if it generated

555
00:26:55,380 --> 00:26:57,170
better assembly.

556
00:26:57,170 --> 00:27:00,880
So we'll run perf record
in order to look at

557
00:27:00,880 --> 00:27:02,190
perf annotate again.

558
00:27:04,750 --> 00:27:07,870
And OK, once again, it's
spending 14% of the time

559
00:27:07,870 --> 00:27:12,490
somewhere in mergesort.c, which
is hardly surprising.

560
00:27:12,490 --> 00:27:20,620
We'll go to line 27 and,
yeah, use long.

561
00:27:20,620 --> 00:27:22,510
OK, so that's roughly
the assembly.

562
00:27:22,510 --> 00:27:28,910
And Reid, although he wasted
time to find it, I have a

563
00:27:28,910 --> 00:27:31,750
nicely prepared screen shot of
the same section of code.

564
00:27:31,750 --> 00:27:34,010
So it looks cleaner.

565
00:27:37,490 --> 00:27:40,390
It loads A and B into different
registers this time,

566
00:27:40,390 --> 00:27:42,700
actually, RSI and RCX.

567
00:27:42,700 --> 00:27:45,100
And then it clears out EDX.

568
00:27:45,100 --> 00:27:48,640
And then it does the compare
between RSI and RCX, which is

569
00:27:48,640 --> 00:27:52,410
A and B. And then it
sets percent DL.

570
00:27:52,410 --> 00:27:56,070
Now, percent DL is the
lower byte of EDX.

571
00:27:56,070 --> 00:27:58,400
So it used a different
choice of registers.

572
00:27:58,400 --> 00:28:04,660
And then it did the A XOR B.
And then it moved that

573
00:28:04,660 --> 00:28:05,880
compare into RAX.

574
00:28:05,880 --> 00:28:08,320
GUEST SPEAKER 1: [INAUDIBLE]

575
00:28:08,320 --> 00:28:11,080
GUEST SPEAKER 2: OK, and Reid
actually has a longer snippet

576
00:28:11,080 --> 00:28:11,990
of the code.

577
00:28:11,990 --> 00:28:16,400
And the compiler, actually, now
that we told it that it

578
00:28:16,400 --> 00:28:18,980
was a 64-bit flag, the compiler
realized that there's

579
00:28:18,980 --> 00:28:20,980
a couple optimizations
that it could do.

580
00:28:20,980 --> 00:28:23,340
So it switched around the order
of things a little bit.

581
00:28:23,340 --> 00:28:28,570
And it's recycling the value
of RDX for loading, I think

582
00:28:28,570 --> 00:28:30,060
it's A+ equals comp.

583
00:28:30,060 --> 00:28:32,270
It's recycling the
value RDX that it

584
00:28:32,270 --> 00:28:33,980
generated out of a compare.

585
00:28:33,980 --> 00:28:38,190
And now, down there, it's
negating RAX as a 64-bit

586
00:28:38,190 --> 00:28:42,240
number and directly using it
without that cltq instruction.

587
00:28:42,240 --> 00:28:44,800
So that was kind of nice.

588
00:28:44,800 --> 00:28:47,270
So in just nudging the compiler
a little bit, we got

589
00:28:47,270 --> 00:28:54,160
it to produce code that ran
roughly about 10% faster,

590
00:28:54,160 --> 00:28:56,490
which was surprising to us,
actually, that changing a

591
00:28:56,490 --> 00:28:59,010
single int to a long
could reduce the

592
00:28:59,010 --> 00:29:01,380
runtime by that amount.

593
00:29:01,380 --> 00:29:04,750
So next thing that we did was we
scrolled down a little bit

594
00:29:04,750 --> 00:29:05,650
more on this window.

595
00:29:05,650 --> 00:29:06,920
And we looked at the
next thing that the

596
00:29:06,920 --> 00:29:08,290
compiler was doing.

597
00:29:08,290 --> 00:29:10,600
And we kind of caught
the compiler again

598
00:29:10,600 --> 00:29:12,870
doing something silly.

599
00:29:12,870 --> 00:29:16,220
So here's the code
that we were at

600
00:29:16,220 --> 00:29:18,240
before, the compare code.

601
00:29:18,240 --> 00:29:24,510
And now let's look at how the
compiler did B+ equals CMP.

602
00:29:24,510 --> 00:29:27,320
Once again, the reason that we
stopped and looked at this

603
00:29:27,320 --> 00:29:31,760
code was we saw an instruction
that we didn't recognize, SBB.

604
00:29:31,760 --> 00:29:33,600
Well, we looked at
the manual again.

605
00:29:33,600 --> 00:29:39,080
And SBB of R1 and R2 is a
subtract with a borrow bit.

606
00:29:39,080 --> 00:29:43,875
So it does subtracting the two
arguments and then subtracting

607
00:29:43,875 --> 00:29:46,270
a carry flag from that.

608
00:29:46,270 --> 00:29:48,180
So OK, why's it doing that?

609
00:29:48,180 --> 00:29:50,390
Well, let's walk through what
the compiler tried to do.

610
00:29:53,130 --> 00:29:55,660
Let's see, where's a good
place to start?

611
00:29:55,660 --> 00:29:59,370
So it's RDX in this compare.

612
00:29:59,370 --> 00:30:02,220
So it compares RDX to 1.

613
00:30:02,220 --> 00:30:04,670
RDX is where CMP was stored.

614
00:30:04,670 --> 00:30:07,960
So it checks whether or
not it equals to 1.

615
00:30:07,960 --> 00:30:11,460
And then the result of that
check is actually stored in

616
00:30:11,460 --> 00:30:13,490
the carry flag, CF.

617
00:30:13,490 --> 00:30:16,960
And subtract with borrow,
RAX, RAX.

618
00:30:16,960 --> 00:30:21,160
Before, RAX held the value of B.
But now, since it did this,

619
00:30:21,160 --> 00:30:23,820
RAX minus RAX is 0, right?

620
00:30:23,820 --> 00:30:27,390
And then minus the carry flag
is either 0 or negative 1,

621
00:30:27,390 --> 00:30:30,270
depending on the value
of the compare.

622
00:30:30,270 --> 00:30:34,960
So once again, it generates a
register that's either all 0's

623
00:30:34,960 --> 00:30:38,380
or all 1's, 64 bits.

624
00:30:38,380 --> 00:30:40,270
Now, what did it go
and do with that?

625
00:30:40,270 --> 00:30:42,700
Well, jump down a little bit
later on in the code.

626
00:30:42,700 --> 00:30:47,860
And it ends EAX, which is the
32-bit version of RAX, with

627
00:30:47,860 --> 00:30:51,390
the value of 8 in hexadecimal.

628
00:30:51,390 --> 00:30:54,690
So now it's either all
0's, or it's 8.

629
00:30:54,690 --> 00:31:00,570
And then it adds that
value, EAX, which

630
00:31:00,570 --> 00:31:01,920
is the same as RAX--

631
00:31:01,920 --> 00:31:03,780
RAX is a 64-bit version--

632
00:31:03,780 --> 00:31:09,050
it adds that value to RBP, which
stored the address of B.

633
00:31:09,050 --> 00:31:13,270
So it jumped through that many
hoops just to increment B by

634
00:31:13,270 --> 00:31:15,420
either 8 or 0.

635
00:31:15,420 --> 00:31:17,640
Now, this seems a little
bit unnecessary.

636
00:31:17,640 --> 00:31:22,540
So we thought, how else could we
express B+ equals not CMP,

637
00:31:22,540 --> 00:31:26,050
and try to change the code
around a little bit so that it

638
00:31:26,050 --> 00:31:28,260
does the same thing, but hope
that the compiler treats it

639
00:31:28,260 --> 00:31:29,950
differently.

640
00:31:29,950 --> 00:31:33,980
Well, in this case, not CMP,
when CMP is either 1 or 0, is

641
00:31:33,980 --> 00:31:35,395
the same thing as 1 minus CMP.

642
00:31:39,770 --> 00:31:42,640
Well, we compiled a version
of this code, and then we

643
00:31:42,640 --> 00:31:43,890
tried to run it.

644
00:31:52,800 --> 00:31:57,262
Refreshing our memory, mergesort
minus, where we just

645
00:31:57,262 --> 00:32:01,360
did the minus sign.

646
00:32:01,360 --> 00:32:03,900
So we took the runtime
down from 4.04

647
00:32:03,900 --> 00:32:07,450
seconds to 3.77 seconds.

648
00:32:07,450 --> 00:32:14,940
And once again, this is just
from replacing a not CMP to a

649
00:32:14,940 --> 00:32:16,160
1 minus CMP.

650
00:32:16,160 --> 00:32:17,720
That's all the code change
that we made.

651
00:32:20,670 --> 00:32:25,740
And we will look at the
annotation again, just to make

652
00:32:25,740 --> 00:32:27,770
sure that the compiler generated
different code.

653
00:32:27,770 --> 00:32:29,410
Rather, we were curious
of what the compiler

654
00:32:29,410 --> 00:32:32,900
generated this time.

655
00:32:32,900 --> 00:32:36,280
So it's roughly this
section of code.

656
00:32:39,680 --> 00:32:41,200
Now, one thing you'll
notice is, because

657
00:32:41,200 --> 00:32:42,580
we're compiling with--

658
00:32:42,580 --> 00:32:45,450
this is already compiling with
GCC's maximum optimization

659
00:32:45,450 --> 00:32:46,690
level, mind you.

660
00:32:46,690 --> 00:32:49,580
So keep that in mind when you
assume that the compiler is

661
00:32:49,580 --> 00:32:52,920
going to help you out
and do smart things.

662
00:32:52,920 --> 00:32:55,240
So the instructions are a
little bit out of order

663
00:32:55,240 --> 00:32:56,490
compared to what you
might expect.

664
00:33:02,584 --> 00:33:04,610
All right, so I'll
go to my prepared

665
00:33:04,610 --> 00:33:07,460
screen shot of the code.

666
00:33:07,460 --> 00:33:10,300
So what did it generate
this time?

667
00:33:10,300 --> 00:33:12,640
All right, so already, this code
looks shorter than the

668
00:33:12,640 --> 00:33:13,890
previous one, right?

669
00:33:17,040 --> 00:33:19,790
A little bit less
hairy than that.

670
00:33:19,790 --> 00:33:21,050
So what did it do?

671
00:33:21,050 --> 00:33:25,810
Well, SIL, once again, is
the lower byte of RSI.

672
00:33:25,810 --> 00:33:29,270
So it does the compare RCX with
RDX, which is actually

673
00:33:29,270 --> 00:33:30,540
this comparison.

674
00:33:30,540 --> 00:33:33,110
And then it saves
that into SIL.

675
00:33:33,110 --> 00:33:38,050
And now, later, it
does the XOR.

676
00:33:38,050 --> 00:33:39,810
So this was the code
from before.

677
00:33:39,810 --> 00:33:43,700
And then it does a move of RSI,
which contains the same

678
00:33:43,700 --> 00:33:47,380
value as SIL, which is
CMP's either 1 or 0.

679
00:33:47,380 --> 00:33:48,630
It moves that into RAX.

680
00:33:51,830 --> 00:34:00,680
And then, later on, it uses
RAX for one calculation.

681
00:34:00,680 --> 00:34:04,130
And then it also uses the value
that's already in RSI.

682
00:34:04,130 --> 00:34:07,580
And it subtracts that from
register 14, which is where it

683
00:34:07,580 --> 00:34:11,020
decided to store the value of
B. So that's a much more

684
00:34:11,020 --> 00:34:14,040
direct way to either subtract 0
or 1, actually get just 0 or

685
00:34:14,040 --> 00:34:16,690
1, and then use it for
a subtraction.

686
00:34:16,690 --> 00:34:19,679
So the end result, and something
else to point out,

687
00:34:19,679 --> 00:34:23,810
is that the move and subtract
commands have a little bit of

688
00:34:23,810 --> 00:34:26,199
instruction level parallelism,
because right

689
00:34:26,199 --> 00:34:27,800
after you set [? LE ?]

690
00:34:27,800 --> 00:34:30,510
SIL, as soon as that's done, you
can execute both the move

691
00:34:30,510 --> 00:34:33,540
and the subtract at the same
time, since they use the same

692
00:34:33,540 --> 00:34:34,860
source register and put in

693
00:34:34,860 --> 00:34:36,960
different destination registers.

694
00:34:36,960 --> 00:34:40,219
And the other thing to notice
in the comparison is the

695
00:34:40,219 --> 00:34:42,320
number of ALU operations.

696
00:34:42,320 --> 00:34:44,449
I'll go back to the
original one.

697
00:34:44,449 --> 00:34:48,909
You see negate, and, XOR, add.

698
00:34:48,909 --> 00:34:50,580
So move is not an ALU.

699
00:34:50,580 --> 00:34:52,190
It's not an arithmetic
operation.

700
00:34:52,190 --> 00:34:56,100
But everything else is either
XOR, and, add, or subtract, or

701
00:34:56,100 --> 00:34:59,180
compare, which all
use the ALU.

702
00:34:59,180 --> 00:35:03,130
Now, if you look at the code
over here, you have move,

703
00:35:03,130 --> 00:35:05,740
move, move, which don't
need the ALU.

704
00:35:05,740 --> 00:35:10,300
And then the other or four
or five need the ALU.

705
00:35:10,300 --> 00:35:14,010
But having fewer ALU
op is a good thing.

706
00:35:14,010 --> 00:35:16,440
If you remember from the
previous lecture on CPU

707
00:35:16,440 --> 00:35:19,220
architecture, the Nehalem,
has six execution ports.

708
00:35:19,220 --> 00:35:23,770
So it can execute six micro-ops
in parallel per CPU

709
00:35:23,770 --> 00:35:25,000
clock cycle.

710
00:35:25,000 --> 00:35:28,620
Actually, only three of those
execution ports have ALUs on

711
00:35:28,620 --> 00:35:32,120
them, so that they can actually
do XOR, at, and,

712
00:35:32,120 --> 00:35:33,730
subtract, add, and so on.

713
00:35:33,730 --> 00:35:36,540
The rest of them can do register
moves and a couple of

714
00:35:36,540 --> 00:35:40,760
other things, but not arithmetic
operations.

715
00:35:40,760 --> 00:35:42,200
So that's a good thing.

716
00:35:42,200 --> 00:35:45,900
So that's an explanation of why
the code actually sped up.

717
00:35:45,900 --> 00:35:51,040
Now, just as an overview of what
we were able to do, we

718
00:35:51,040 --> 00:35:55,270
ran perf stat, the same thing
that we did for mergesort, or

719
00:35:55,270 --> 00:35:57,050
for the original
two mergesorts.

720
00:35:57,050 --> 00:35:59,900
And we made a table of the
improvements that we made.

721
00:35:59,900 --> 00:36:03,240
So originally, quicksort
ran in four seconds.

722
00:36:03,240 --> 00:36:08,420
And it had an IPC of 0.8
and 11% branch missing.

723
00:36:08,420 --> 00:36:10,790
So when we switched from
quicksort to standard

724
00:36:10,790 --> 00:36:15,610
mergesort, we increased
execution time by 20%, which

725
00:36:15,610 --> 00:36:17,420
is not good.

726
00:36:17,420 --> 00:36:20,810
But the instructions per clock
jumped up a little bit.

727
00:36:20,810 --> 00:36:23,230
Branch miss rate is
still pretty bad.

728
00:36:23,230 --> 00:36:26,980
Now, as soon as we switched it
to branchless, we got an 8%

729
00:36:26,980 --> 00:36:29,190
improvement over the
previous runtime.

730
00:36:29,190 --> 00:36:32,120
But note that we were still
worse off than quicksort at

731
00:36:32,120 --> 00:36:33,370
that point.

732
00:36:35,210 --> 00:36:38,710
But the instructions per clock
jumped quite a bit from last

733
00:36:38,710 --> 00:36:42,360
time, which is kind of what you
would expect with reducing

734
00:36:42,360 --> 00:36:44,230
the branch misses by
a factor of 10.

735
00:36:44,230 --> 00:36:47,700
So the CPU isn't spending
extra time stalling the

736
00:36:47,700 --> 00:36:51,880
pipeline, trying to backtrack
on its branch mispredicts.

737
00:36:51,880 --> 00:36:56,890
And now the more interesting
thing to me is, if you look at

738
00:36:56,890 --> 00:37:01,150
the two silly little compiler
optimizations we made,

739
00:37:01,150 --> 00:37:05,930
switching from int to long
reduced the runtime by 11%

740
00:37:05,930 --> 00:37:09,510
over the previous branchless
implementation.

741
00:37:09,510 --> 00:37:13,090
And then instructions per clock
and branch miss both

742
00:37:13,090 --> 00:37:14,020
didn't change at this point.

743
00:37:14,020 --> 00:37:15,790
At this point, we were just
reducing the number of

744
00:37:15,790 --> 00:37:17,210
instructions.

745
00:37:17,210 --> 00:37:23,340
And then going from there to
changing not CMP to 1 minus

746
00:37:23,340 --> 00:37:29,020
CMP reduced our runtime by
another 7% from before.

747
00:37:29,020 --> 00:37:33,860
So overall, branchless
mergesorting is 10.8% faster

748
00:37:33,860 --> 00:37:37,010
than quicksorting by the time we
were done optimizing, which

749
00:37:37,010 --> 00:37:40,770
is a lot better than negative.

750
00:37:40,770 --> 00:37:42,550
At this point, we were
still behind.

751
00:37:42,550 --> 00:37:45,560
And at the end, we ended
up ahead of quicksort.

752
00:37:45,560 --> 00:37:49,300
And then, overall, switching
to branchless, with all the

753
00:37:49,300 --> 00:37:53,890
performance tweaks that we made,
was 33.6% better over

754
00:37:53,890 --> 00:37:55,170
the branching version.

755
00:37:55,170 --> 00:37:58,960
So the non-branching ran
a third faster than the

756
00:37:58,960 --> 00:38:01,870
branching version, which
is pretty cool.

757
00:38:01,870 --> 00:38:03,450
And that's something that
we learned at the

758
00:38:03,450 --> 00:38:05,750
end of the day yesterday.

759
00:38:05,750 --> 00:38:10,630
So to conclude, what did we
learn when we were doing this?

760
00:38:10,630 --> 00:38:13,720
Well, I think a lot of that
applies to the PSETS that

761
00:38:13,720 --> 00:38:16,720
you're doing and how you should
go about optimizing

762
00:38:16,720 --> 00:38:19,420
performance on the code
that we give you.

763
00:38:19,420 --> 00:38:22,190
And it's profile before
you optimize.

764
00:38:22,190 --> 00:38:25,710
So before you spend too much
time just reading through

765
00:38:25,710 --> 00:38:28,930
source code and making wild
guesses as to why the code is

766
00:38:28,930 --> 00:38:30,920
slow, profile the code.

767
00:38:30,920 --> 00:38:32,250
See why it's actually slow.

768
00:38:32,250 --> 00:38:35,280
Get some performance
counter data.

769
00:38:35,280 --> 00:38:39,470
Do some perf annotate, so that
it tells you what assembly or

770
00:38:39,470 --> 00:38:41,400
what line of code that it's
actually spending

771
00:38:41,400 --> 00:38:42,710
the most time on.

772
00:38:42,710 --> 00:38:44,270
And then, optimize
iteratively.

773
00:38:44,270 --> 00:38:46,000
Don't try to do everything
at once.

774
00:38:46,000 --> 00:38:48,480
Do things one at a time to see
whether or not they make a

775
00:38:48,480 --> 00:38:49,500
difference.

776
00:38:49,500 --> 00:38:53,240
Now, before we arrived at the
not, there are so many

777
00:38:53,240 --> 00:38:56,200
different ways that you could
have expressed not and 1 minus

778
00:38:56,200 --> 00:38:57,790
and negative and so on.

779
00:38:57,790 --> 00:38:59,880
So if you try them all at once,
you really don't know

780
00:38:59,880 --> 00:39:01,260
what made the improvement.

781
00:39:01,260 --> 00:39:03,400
So you've got to try things one
at a time and see what it

782
00:39:03,400 --> 00:39:08,640
causes the compiler to do,
because you can see in the

783
00:39:08,640 --> 00:39:10,990
previous two examples, by the
time we made that one

784
00:39:10,990 --> 00:39:13,740
optimization, it wasn't just
that one assembly instruction

785
00:39:13,740 --> 00:39:15,700
that the compiler ended
up changing.

786
00:39:15,700 --> 00:39:17,980
The compiler, in fact, realized
that it could do

787
00:39:17,980 --> 00:39:21,250
additional optimizations on top
of what we hinted to it.

788
00:39:21,250 --> 00:39:23,280
So it did even more
optimizations.

789
00:39:23,280 --> 00:39:26,550
But the bottom line is, you
should optimize iteratively.

790
00:39:26,550 --> 00:39:29,510
Try something, and see what
effect it has before moving on

791
00:39:29,510 --> 00:39:31,130
and trying something else.

792
00:39:31,130 --> 00:39:35,250
And then, as much as I don't
like coding assembly, and I

793
00:39:35,250 --> 00:39:38,500
don't expect that anybody here
codes assembly for the class,

794
00:39:38,500 --> 00:39:41,160
looking at the assembly is
always a good thing.

795
00:39:41,160 --> 00:39:44,530
Just skim through it for your
performance bottlenecks.

796
00:39:44,530 --> 00:39:47,070
And try to get an idea of
what the compiler is

797
00:39:47,070 --> 00:39:48,520
asking the CPU to do.

798
00:39:48,520 --> 00:39:51,960
And see if there's a way that
you can nudge the compiler to

799
00:39:51,960 --> 00:39:53,240
do the right thing.

800
00:39:53,240 --> 00:39:56,110
Sometimes we tell you the
compiler will optimize certain

801
00:39:56,110 --> 00:39:58,320
things, will get rid of
dead code, and so on.

802
00:39:58,320 --> 00:39:59,820
And most of the time, that's
actually true.

803
00:39:59,820 --> 00:40:03,080
But it doesn't hurt to
double-check in the assembly

804
00:40:03,080 --> 00:40:04,930
and make sure that it's
actually doing that.

805
00:40:04,930 --> 00:40:09,150
And the output from perf is
pretty helpful to doing this.

806
00:40:09,150 --> 00:40:11,800
It's not a monumental task
of disassembling.

807
00:40:11,800 --> 00:40:15,000
It does all of the annotation
and interleaving for you.

808
00:40:15,000 --> 00:40:18,210
And finally, learn
through practice.

809
00:40:18,210 --> 00:40:21,830
That's the only way to learn
how to do this performance

810
00:40:21,830 --> 00:40:22,470
optimization.

811
00:40:22,470 --> 00:40:24,580
You gain a lot of experience
from just going out

812
00:40:24,580 --> 00:40:25,900
and trying the tools.

813
00:40:25,900 --> 00:40:29,450
And Project 2 will have you
working with these tools.

814
00:40:29,450 --> 00:40:32,500
Project 2.1 does not have
a code submission.

815
00:40:32,500 --> 00:40:35,490
It's just a write-up, where you
get to just play around

816
00:40:35,490 --> 00:40:38,430
with the tools on code that we
give you, so you can see how

817
00:40:38,430 --> 00:40:39,600
to use the tools.

818
00:40:39,600 --> 00:40:43,550
And then Project 2.2, which is
going out next week, will

819
00:40:43,550 --> 00:40:46,750
actually give you a chance to
try to optimize code that we

820
00:40:46,750 --> 00:40:50,760
give you, using these tools to
help you out along the way.

821
00:40:50,760 --> 00:40:54,719
So with that said, any
questions about--

822
00:40:54,719 --> 00:40:56,571
AUDIENCE: [INAUDIBLE]
a write-up.

823
00:40:56,571 --> 00:40:59,890
how do you submit write-ups
for projects?

824
00:40:59,890 --> 00:41:01,050
GUEST SPEAKER 2: On Stellar.

825
00:41:01,050 --> 00:41:05,000
So the handouts for the
write-ups are Stellar homework

826
00:41:05,000 --> 00:41:05,480
assignments.

827
00:41:05,480 --> 00:41:07,568
Just upload a PDF to Stellar.

828
00:41:07,568 --> 00:41:10,320
AUDIENCE: OK.

829
00:41:10,320 --> 00:41:11,752
GUEST SPEAKER 2: Yes?

830
00:41:11,752 --> 00:41:13,002
AUDIENCE: [INAUDIBLE PHRASE]

831
00:41:19,850 --> 00:41:21,690
GUEST SPEAKER 1: Yeah, you just
have to build the binary

832
00:41:21,690 --> 00:41:22,610
with debug info.

833
00:41:22,610 --> 00:41:24,340
GUEST SPEAKER 2: Yeah,
minus G is enough.

834
00:41:24,340 --> 00:41:29,100
As long as GDB's break points
can see your code,

835
00:41:29,100 --> 00:41:30,410
then perf can do it.

836
00:41:34,420 --> 00:41:37,230
And it's actually pretty nice in
that, even if your code is

837
00:41:37,230 --> 00:41:40,840
inlined, it can identify where
the inlined portions came from

838
00:41:40,840 --> 00:41:44,390
and pull in the proper lines
of code for that.

839
00:41:44,390 --> 00:41:45,850
SAMAN AMARASINGHE: I think
what's really interesting and

840
00:41:45,850 --> 00:41:49,110
important here is not that they
found this interesting

841
00:41:49,110 --> 00:41:50,940
thing, but the process
they went about.

842
00:41:50,940 --> 00:41:54,110
A lot of times, when you are
coding, you are in control.

843
00:41:54,110 --> 00:41:56,920
You have very planned structure
how you're going

844
00:41:56,920 --> 00:41:57,720
about doing that.

845
00:41:57,720 --> 00:42:00,800
And you follow these through and
produce the code you want.

846
00:42:00,800 --> 00:42:02,440
And sometimes, some pesky
bugs show up.

847
00:42:02,440 --> 00:42:04,120
You go through that.

848
00:42:04,120 --> 00:42:08,470
Here is, basically, you
basically let the system and

849
00:42:08,470 --> 00:42:10,700
data drive what you are doing.

850
00:42:10,700 --> 00:42:14,670
And when you start, there's no
planned part, saying, I'm

851
00:42:14,670 --> 00:42:17,930
going to do A, B, C, D.
Basically, you look at the

852
00:42:17,930 --> 00:42:20,180
results, and you figure
out what might change.

853
00:42:20,180 --> 00:42:21,640
And you can go through that.

854
00:42:21,640 --> 00:42:25,390
And I was actually talking with
Charles, saying, in my

855
00:42:25,390 --> 00:42:28,130
first lecture, I talked
about Matrix Multiply.

856
00:42:28,130 --> 00:42:29,050
It looks really cool.

857
00:42:29,050 --> 00:42:32,080
Every time I do something, it
improved by a factor of 2,

858
00:42:32,080 --> 00:42:33,930
50%, stuff like that.

859
00:42:33,930 --> 00:42:37,670
What you didn't see is all the
things I did that didn't have

860
00:42:37,670 --> 00:42:40,130
any impact on the program.

861
00:42:40,130 --> 00:42:43,240
You saw this nice tree going
down there, all these leaves

862
00:42:43,240 --> 00:42:46,890
that end up in dead ends, that
had nothing happened.

863
00:42:46,890 --> 00:42:49,700
So if you feel like when you
go there, you do a bunch of

864
00:42:49,700 --> 00:42:51,830
things, nothing happens,
that's normal.

865
00:42:51,830 --> 00:42:53,890
Of course, when you tell
somebody, you're not going to

866
00:42:53,890 --> 00:42:55,540
say, oh, that [UNINTELLIGIBLE].

867
00:42:55,540 --> 00:42:56,370
Say, I did this

868
00:42:56,370 --> 00:42:57,620
[UNINTELLIGIBLE], and I did that.

869
00:42:57,620 --> 00:43:00,240
And at the end, it looks like
these other people are very

870
00:43:00,240 --> 00:43:01,570
smart, and they're getting
through all these

871
00:43:01,570 --> 00:43:02,450
optimizations.

872
00:43:02,450 --> 00:43:04,590
No, they spend a lot of
time going through

873
00:43:04,590 --> 00:43:05,930
these dead end parts.

874
00:43:05,930 --> 00:43:10,560
And the interesting thing in
here is, one thing is, don't

875
00:43:10,560 --> 00:43:12,560
trust anything.

876
00:43:12,560 --> 00:43:14,350
[UNINTELLIGIBLE PHRASE]

877
00:43:14,350 --> 00:43:16,150
Intel, they have hundreds
of engineers

878
00:43:16,150 --> 00:43:17,350
working at it for years.

879
00:43:17,350 --> 00:43:20,425
And OK, that should be good.

880
00:43:20,425 --> 00:43:22,210
Not really.

881
00:43:22,210 --> 00:43:25,030
So a lot of times, it's
an end-to-end thing.

882
00:43:25,030 --> 00:43:26,670
With performance,
it's end to end.

883
00:43:26,670 --> 00:43:29,030
So the problem can
be at any level.

884
00:43:29,030 --> 00:43:31,210
So it could become a compiler,
it could be [UNINTELLIGIBLE]

885
00:43:31,210 --> 00:43:33,620
network, it can be network, it
can be an operating system, it

886
00:43:33,620 --> 00:43:35,795
can be in the memory system, it
can be in the processor, it

887
00:43:35,795 --> 00:43:36,870
can be anything.

888
00:43:36,870 --> 00:43:38,740
So if you say, oh, yeah, that
shouldn't be the case,

889
00:43:38,740 --> 00:43:40,320
probably that's where it is.

890
00:43:40,320 --> 00:43:42,350
So all the time, basically,
it's interesting.

891
00:43:42,350 --> 00:43:45,070
The nice thing about these
kind of tools is that

892
00:43:45,070 --> 00:43:48,080
sometimes even the tools
might be wrong.

893
00:43:48,080 --> 00:43:49,915
So sometimes I have been in
situations where you look have

894
00:43:49,915 --> 00:43:51,900
to look at the wall clock, OK,
wait a minute, does the tool

895
00:43:51,900 --> 00:43:53,840
say what I actually see
what's happening?

896
00:43:53,840 --> 00:43:57,450
Because you get into very
skeptical situations.

897
00:43:57,450 --> 00:44:00,960
Because a lot of times, in these
kind of situations, the

898
00:44:00,960 --> 00:44:03,140
problem can be hidden in
many different things.

899
00:44:03,140 --> 00:44:06,500
So that's a very good way
of going through this.

900
00:44:06,500 --> 00:44:10,550
Of course, what you get handed
to you, we have set it up in a

901
00:44:10,550 --> 00:44:12,660
way that there cannot be
too many surprises.

902
00:44:12,660 --> 00:44:15,530
But when you're going through
a system and try to optimize

903
00:44:15,530 --> 00:44:18,590
and look at performance, this is
a really good way of doing

904
00:44:18,590 --> 00:44:21,190
it, because I don't think these
guys thought, when they

905
00:44:21,190 --> 00:44:22,800
started, they'd be looking
at compiled

906
00:44:22,800 --> 00:44:25,750
bugs or compiler problems.

907
00:44:25,750 --> 00:44:28,526
They were thinking about, OK,
I'm going to do a little bit

908
00:44:28,526 --> 00:44:29,770
of code rewriting
and get there.

909
00:44:29,770 --> 00:44:32,080
But at the end of the day, what
they disclosed was very

910
00:44:32,080 --> 00:44:32,930
surprising to them.

911
00:44:32,930 --> 00:44:34,470
And it even surprised
me today.

912
00:44:34,470 --> 00:44:36,500
GUEST SPEAKER 2: Our plan
basically ended at row three,

913
00:44:36,500 --> 00:44:38,110
as far as when we
first set out.

914
00:44:38,110 --> 00:44:40,690
We had no idea was to
follow after that.

915
00:44:40,690 --> 00:44:44,600
So one thing that I remember is
that some students came to

916
00:44:44,600 --> 00:44:47,150
us and asked questions, like,
yeah, I tried to implement

917
00:44:47,150 --> 00:44:48,710
this bit hack here,
and it actually

918
00:44:48,710 --> 00:44:50,580
slowed the program down.

919
00:44:50,580 --> 00:44:54,590
Or I changed my pentomino board

920
00:44:54,590 --> 00:44:56,440
representation to this also.

921
00:44:56,440 --> 00:44:58,690
But it didn't seem to
speed anything up.

922
00:44:58,690 --> 00:45:01,540
Well, in those cases, yeah, you
need to do more detective

923
00:45:01,540 --> 00:45:03,930
work to figure out
what happened.

924
00:45:03,930 --> 00:45:05,990
SAMAN AMARASINGHE: So a lot of
times, the performance is

925
00:45:05,990 --> 00:45:09,790
something like a max of a bunch
of different things.

926
00:45:09,790 --> 00:45:11,930
So you have done something,
that's great.

927
00:45:11,930 --> 00:45:13,880
It should give you a huge
performance improvement.

928
00:45:13,880 --> 00:45:15,380
But something is holding.

929
00:45:15,380 --> 00:45:17,590
And a lot of times, you have
to do three or four things.

930
00:45:17,590 --> 00:45:20,130
And once you do three or four
things, suddenly everything

931
00:45:20,130 --> 00:45:23,430
starts kicking off.

932
00:45:23,430 --> 00:45:24,970
A good example is loop unroll.

933
00:45:24,970 --> 00:45:26,830
You unroll a loop and say,
ha, I should get all this

934
00:45:26,830 --> 00:45:26,990
parallelism.

935
00:45:26,990 --> 00:45:29,290
No, because you have
some dependence.

936
00:45:29,290 --> 00:45:30,990
So you have to fix that
dependence, fix that

937
00:45:30,990 --> 00:45:31,320
dependence.

938
00:45:31,320 --> 00:45:33,470
Suddenly, you remove
everything.

939
00:45:33,470 --> 00:45:34,930
Then you get all the
parallelism and

940
00:45:34,930 --> 00:45:36,050
start running fast.

941
00:45:36,050 --> 00:45:38,400
So a lot of these things,
it's basically max.

942
00:45:38,400 --> 00:45:40,200
You have to get all
the things right.

943
00:45:40,200 --> 00:45:42,970
And so if you stop halfway
through, that means you might

944
00:45:42,970 --> 00:45:45,540
be almost there, and you
are not up to that.

945
00:45:45,540 --> 00:45:49,760
So that's why it just can be a
really frustrating process.

946
00:45:49,760 --> 00:45:51,630
But it's kind of fun, too,
when you get there.