1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:17,430
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,430 --> 00:00:18,680
ocw.mit.edu.

8
00:00:30,460 --> 00:00:36,980
PROFESSOR: And so today we have
Ravi , who is I guess a

9
00:00:36,980 --> 00:00:40,480
performance engineer at VMware,
where he basically

10
00:00:40,480 --> 00:00:44,940
looks at how to get really good
performance in many, many

11
00:00:44,940 --> 00:00:46,920
different things in VMware.

12
00:00:46,920 --> 00:00:51,210
So Ravi is an MIT undergrad.

13
00:00:51,210 --> 00:00:52,475
I guess a long time ago.

14
00:00:52,475 --> 00:00:53,360
RAVI SOUNDARARAJAN: Yeah, '92.

15
00:00:53,360 --> 00:00:53,590
PROFESSOR: '92.

16
00:00:53,590 --> 00:00:54,470
There you go.

17
00:00:54,470 --> 00:00:59,660
And then he got his PhD from
Stanford and end up in VMware

18
00:00:59,660 --> 00:01:03,680
looking at a lot of interesting
performance stuff.

19
00:01:03,680 --> 00:01:07,780
So today, he's going to give us
a kind of a bird's eye view

20
00:01:07,780 --> 00:01:11,260
of how these kind of
technologies, what you guys

21
00:01:11,260 --> 00:01:14,510
are learning, is useful and
how it's used in industry.

22
00:01:14,510 --> 00:01:14,970
OK, Ravi.

23
00:01:14,970 --> 00:01:16,040
Take over.

24
00:01:16,040 --> 00:01:17,130
RAVI SOUNDARARAJAN: First
question, can

25
00:01:17,130 --> 00:01:17,880
everybody hear me?

26
00:01:17,880 --> 00:01:19,210
And is the mic on?

27
00:01:19,210 --> 00:01:19,480
OK.

28
00:01:19,480 --> 00:01:19,720
Good.

29
00:01:19,720 --> 00:01:20,775
I have a bit of a sore throat.

30
00:01:20,775 --> 00:01:22,940
So I'm sorry if you
can't hear me.

31
00:01:22,940 --> 00:01:24,650
The other thing I want to say
is I'm actually really glad

32
00:01:24,650 --> 00:01:24,910
you're here.

33
00:01:24,910 --> 00:01:27,820
Because I remember very
distinctly in 1990 when I was

34
00:01:27,820 --> 00:01:29,540
an undergrad, I was
taking 6.111.

35
00:01:29,540 --> 00:01:31,400
And I had a bug in
my final project.

36
00:01:31,400 --> 00:01:32,220
It just wasn't working.

37
00:01:32,220 --> 00:01:34,000
And I was going to skip
the lecture that day.

38
00:01:34,000 --> 00:01:35,210
But I was like, whatever,
I'll go.

39
00:01:35,210 --> 00:01:35,910
So went to the lecture.

40
00:01:35,910 --> 00:01:38,790
Turns out Don Troxell pointed
out something in that lecture.

41
00:01:38,790 --> 00:01:39,860
I was like, that's my problem.

42
00:01:39,860 --> 00:01:40,680
And I went back, fixed it.

43
00:01:40,680 --> 00:01:41,170
Everything worked.

44
00:01:41,170 --> 00:01:42,390
And I was very happy.

45
00:01:42,390 --> 00:01:43,500
I don't think that's going
to happen here.

46
00:01:43,500 --> 00:01:44,520
But it would be nice
if it did.

47
00:01:44,520 --> 00:01:45,720
So that's cool.

48
00:01:45,720 --> 00:01:48,320
So instead of just talking
about what a performance

49
00:01:48,320 --> 00:01:50,100
engineer does, I would--

50
00:01:50,100 --> 00:01:52,690
I'll just--

51
00:01:52,690 --> 00:01:53,650
Power is coming in.

52
00:01:53,650 --> 00:01:53,930
But--

53
00:01:53,930 --> 00:01:54,580
Oh, there it is.

54
00:01:54,580 --> 00:01:55,210
Perfect.

55
00:01:55,210 --> 00:01:56,950
So instead of just talking
about what performance

56
00:01:56,950 --> 00:01:57,750
engineering is about.

57
00:01:57,750 --> 00:01:59,810
I just wanted to go through 10
bugs that I've actually been

58
00:01:59,810 --> 00:02:00,740
involved in.

59
00:02:00,740 --> 00:02:02,290
And hopefully, they'll be
interesting to you.

60
00:02:02,290 --> 00:02:05,200
Because each one of them
was interesting to me.

61
00:02:05,200 --> 00:02:07,080
So let's start out with--

62
00:02:07,080 --> 00:02:08,620
I just wanted to start
this out with a bang.

63
00:02:08,620 --> 00:02:10,639
So for dramatic effect
everybody read

64
00:02:10,639 --> 00:02:11,430
through this slide.

65
00:02:11,430 --> 00:02:15,800
It's basically an email thread
from one of my colleagues.

66
00:02:15,800 --> 00:02:17,080
And raise your hand
when you're done.

67
00:02:17,080 --> 00:02:18,330
So I'll know when you're done.

68
00:02:26,590 --> 00:02:26,850
OK.

69
00:02:26,850 --> 00:02:29,220
Anyway, so the point of this is
that his question ended up

70
00:02:29,220 --> 00:02:32,640
being, why is the percent lock
time in the database

71
00:02:32,640 --> 00:02:36,280
increasing despite a lighter
load for the thing that uses

72
00:02:36,280 --> 00:02:37,290
that database?

73
00:02:37,290 --> 00:02:39,590
Now, if you're like me, you
read this email, and your

74
00:02:39,590 --> 00:02:42,030
first question is what the
hell is he talking about?

75
00:02:42,030 --> 00:02:44,420
So that's why titled this
slide like that.

76
00:02:44,420 --> 00:02:47,390
So let's actually deconstruct
what, a, that email meant.

77
00:02:47,390 --> 00:02:49,180
And, b, what it meant to me.

78
00:02:49,180 --> 00:02:51,930
So in order to do that, I want
to first talk about the system

79
00:02:51,930 --> 00:02:54,970
that he was talking about and
help describe what the actual

80
00:02:54,970 --> 00:02:56,650
problem he was seeing was.

81
00:02:56,650 --> 00:02:59,500
So here we have a standard
multi-tier set up.

82
00:02:59,500 --> 00:03:03,060
What we've got is that we've
got a server, which is our

83
00:03:03,060 --> 00:03:03,800
middleware server.

84
00:03:03,800 --> 00:03:04,800
And it talks to a database.

85
00:03:04,800 --> 00:03:06,840
And then we have a bunch of
hosts on the other end.

86
00:03:06,840 --> 00:03:08,900
Those are labeled by agent,
HAL, whatever.

87
00:03:08,900 --> 00:03:12,350
Anyway, those are hosts running
virtual machines.

88
00:03:12,350 --> 00:03:13,630
And here's the setup.

89
00:03:13,630 --> 00:03:14,810
You've got a client.

90
00:03:14,810 --> 00:03:16,320
And that client issues
a command to

91
00:03:16,320 --> 00:03:17,730
this middleware server.

92
00:03:17,730 --> 00:03:20,190
And that middleware server tries
to figure out what host

93
00:03:20,190 --> 00:03:22,690
or what virtual machine
needs to be contacted.

94
00:03:22,690 --> 00:03:24,990
So the middleware goes
ahead and does that.

95
00:03:24,990 --> 00:03:27,520
And it performs an operation
on that host.

96
00:03:27,520 --> 00:03:30,560
And after performing the
operation, it then archives

97
00:03:30,560 --> 00:03:34,170
some of the information that
it gained to the database.

98
00:03:34,170 --> 00:03:36,330
And once it archived the
information to the database,

99
00:03:36,330 --> 00:03:39,500
it went ahead and notified the
client that the operation

100
00:03:39,500 --> 00:03:40,360
itself was done.

101
00:03:40,360 --> 00:03:40,960
Fairly simple.

102
00:03:40,960 --> 00:03:42,020
Send a request to a server.

103
00:03:42,020 --> 00:03:44,260
Server contacts some
other host.

104
00:03:44,260 --> 00:03:46,120
Host comes back to the server
and says it's done.

105
00:03:46,120 --> 00:03:49,400
Server persists information to
the database and then notifies

106
00:03:49,400 --> 00:03:50,860
the client everything is done.

107
00:03:50,860 --> 00:03:54,770
So the actual problem that my
colleague was seeing was, he

108
00:03:54,770 --> 00:03:57,900
was sending less and less load
to this middleware server,

109
00:03:57,900 --> 00:04:00,260
which therefore means that less
and less goes to the host

110
00:04:00,260 --> 00:04:03,130
and less and less goes
to the database.

111
00:04:03,130 --> 00:04:05,780
However, the percent of
time spent in the

112
00:04:05,780 --> 00:04:08,060
database was higher.

113
00:04:08,060 --> 00:04:10,280
So despite a lighter load and
less stuff going to the

114
00:04:10,280 --> 00:04:12,150
database, the percent
of time spent in the

115
00:04:12,150 --> 00:04:13,830
database was higher.

116
00:04:13,830 --> 00:04:15,470
Probably, all of you are smart
enough to know what the actual

117
00:04:15,470 --> 00:04:15,960
problem is.

118
00:04:15,960 --> 00:04:18,490
But we'll go through
it step by step.

119
00:04:18,490 --> 00:04:19,470
So the deal is this.

120
00:04:19,470 --> 00:04:21,640
The first thing I did is I
examined the amount of time

121
00:04:21,640 --> 00:04:23,750
that each lock was held for
each of these different

122
00:04:23,750 --> 00:04:24,280
operations.

123
00:04:24,280 --> 00:04:27,160
And what I did was I varied
the load just as

124
00:04:27,160 --> 00:04:28,160
my colleague did.

125
00:04:28,160 --> 00:04:30,990
And what you see are six
different lines.

126
00:04:30,990 --> 00:04:33,400
On the x-axis is each of the
different locks that we

127
00:04:33,400 --> 00:04:34,630
acquire in our system.

128
00:04:34,630 --> 00:04:37,270
And on the y-axis is the latency
of those locks in

129
00:04:37,270 --> 00:04:38,270
milliseconds.

130
00:04:38,270 --> 00:04:41,540
And basically, the
axis for this--

131
00:04:41,540 --> 00:04:43,610
the legend for this chart says
Locked-4, which means a

132
00:04:43,610 --> 00:04:46,540
certain type of load, all the
way down to Locked-128.

133
00:04:46,540 --> 00:04:47,150
OK.

134
00:04:47,150 --> 00:04:50,660
And the deal is that when you
have lock-128, what that

135
00:04:50,660 --> 00:04:53,660
indicates is that there's
128 hosts per thread.

136
00:04:53,660 --> 00:04:55,280
Which, take my word for it, that
means that you've got a

137
00:04:55,280 --> 00:04:56,070
lighter load.

138
00:04:56,070 --> 00:04:59,270
So basically, the lowest
line in the legend is

139
00:04:59,270 --> 00:04:59,980
the lightest load.

140
00:04:59,980 --> 00:05:02,900
And the highest line in the
legend is the highest load.

141
00:05:02,900 --> 00:05:04,810
And as you can see, although
it's a little bit obscured in

142
00:05:04,810 --> 00:05:07,630
this chart, the bottom line is
that this is about time you're

143
00:05:07,630 --> 00:05:09,670
spent holding a lock.

144
00:05:09,670 --> 00:05:12,330
In this chart, lock-4 indicates
a heavy load.

145
00:05:12,330 --> 00:05:14,580
Lock-128 indicates
a lighter load.

146
00:05:14,580 --> 00:05:16,760
And what this chart indicates
is that, in fact, you are

147
00:05:16,760 --> 00:05:19,550
spending more time in the locks
with a heavier load,

148
00:05:19,550 --> 00:05:23,480
lock-4, than you are with a
lighter load, lock-128.

149
00:05:23,480 --> 00:05:26,900
So the latency prologue is
higher with a higher load,

150
00:05:26,900 --> 00:05:28,080
which is totally reasonable.

151
00:05:28,080 --> 00:05:30,280
And it turns out that if we
translate my colleague's

152
00:05:30,280 --> 00:05:34,130
question into reality, what
he's asking is why is the

153
00:05:34,130 --> 00:05:36,610
percent time spent in the
database higher, even though

154
00:05:36,610 --> 00:05:37,860
you have a lighter load?

155
00:05:37,860 --> 00:05:40,100
Well, the answer ends up being
that when you have a lighter

156
00:05:40,100 --> 00:05:42,700
load, overall, you're spending
less time in locks, which

157
00:05:42,700 --> 00:05:44,670
means that any time you spend
in the database, if that's a

158
00:05:44,670 --> 00:05:47,070
fixed cost, is going to
be a higher percent

159
00:05:47,070 --> 00:05:48,590
of the overall time.

160
00:05:48,590 --> 00:05:50,950
So it turns out, there really
wasn't an issue.

161
00:05:50,950 --> 00:05:51,820
There was nothing odd here.

162
00:05:51,820 --> 00:05:53,740
It's exactly what you'd expect,
because the database

163
00:05:53,740 --> 00:05:55,280
imposes essentially
a fixed amount

164
00:05:55,280 --> 00:05:57,120
of latency per operation.

165
00:05:57,120 --> 00:06:00,020
And in order to confirm that,
I also looked at the time

166
00:06:00,020 --> 00:06:01,340
spent contending for locks.

167
00:06:01,340 --> 00:06:04,450
So you really want a lock, but
someone else gets it instead.

168
00:06:04,450 --> 00:06:05,170
And you have to wait.

169
00:06:05,170 --> 00:06:07,060
And the trend is exactly
the same.

170
00:06:07,060 --> 00:06:10,580
When you have a lighter load,
you end up having a much lower

171
00:06:10,580 --> 00:06:11,410
overall latency.

172
00:06:11,410 --> 00:06:13,370
But that just means that any
latency you spend in this

173
00:06:13,370 --> 00:06:17,230
database, its percent impact
on the entire operation is

174
00:06:17,230 --> 00:06:19,180
quite a bit higher.

175
00:06:19,180 --> 00:06:23,130
So to take a look at this and
to do a postmortem on this

176
00:06:23,130 --> 00:06:27,550
case study, his first thing was
some random email, which

177
00:06:27,550 --> 00:06:28,880
kind of was nonsense.

178
00:06:28,880 --> 00:06:31,380
But it was less nonsensical
once you actually tried to

179
00:06:31,380 --> 00:06:32,680
understand what this
setup was.

180
00:06:32,680 --> 00:06:35,660
So step one is figure out what
is the experimental setup and

181
00:06:35,660 --> 00:06:37,710
what are the parameters that
are being studied.

182
00:06:37,710 --> 00:06:39,530
The next thing to understand
is what's being measured?

183
00:06:39,530 --> 00:06:40,580
I remember what my
colleague said.

184
00:06:40,580 --> 00:06:43,290
He said the percent of time in
the database was pretty high.

185
00:06:43,290 --> 00:06:43,960
Why is that?

186
00:06:43,960 --> 00:06:44,770
Well, OK.

187
00:06:44,770 --> 00:06:47,660
So we have to understand what
this multi-tier setup is, why

188
00:06:47,660 --> 00:06:49,770
you would be going to the
database, and what it means if

189
00:06:49,770 --> 00:06:52,770
the time spent in that
database is high.

190
00:06:52,770 --> 00:06:54,570
And I guess the next thing you
want to do is you want to say,

191
00:06:54,570 --> 00:06:56,070
well, OK, he's complaining
about lock latency.

192
00:06:56,070 --> 00:06:58,080
So I must have some sort of
visibility there somewhere.

193
00:06:58,080 --> 00:07:00,300
Let me go and look at the
appropriate metric to figure

194
00:07:00,300 --> 00:07:04,080
out what's going on and why
time is being spent there.

195
00:07:04,080 --> 00:07:06,350
And finally, it's appropriate
to-- once you've got all this

196
00:07:06,350 --> 00:07:07,770
data, don't make any
assumptions--

197
00:07:07,770 --> 00:07:09,940
just draw the appropriate
conclusion based on the data.

198
00:07:13,500 --> 00:07:15,730
The problem is basically that
with a lighter load the impact

199
00:07:15,730 --> 00:07:18,250
of that database becomes larger,
even if it were the

200
00:07:18,250 --> 00:07:19,830
same under a heavier load.

201
00:07:19,830 --> 00:07:21,930
Because under a heavier load
other aspects of the system

202
00:07:21,930 --> 00:07:24,110
also get stressed that
don't get stressed

203
00:07:24,110 --> 00:07:26,280
in a lighter load.

204
00:07:26,280 --> 00:07:28,540
So, and this is basically what
I'm saying here, yeah, the

205
00:07:28,540 --> 00:07:31,390
percent lock time in the
database overall is higher.

206
00:07:31,390 --> 00:07:33,030
But the overall lock
time is smaller.

207
00:07:33,030 --> 00:07:37,050
So everything is behaving
exactly as you would expect.

208
00:07:37,050 --> 00:07:40,260
So this was one simple study
and hopefully it didn't

209
00:07:40,260 --> 00:07:42,760
require any virtualization
specific knowledge.

210
00:07:42,760 --> 00:07:44,520
What I'm going to do now is
go over a bunch of other

211
00:07:44,520 --> 00:07:47,760
different case studies that I've
looked at and crystallize

212
00:07:47,760 --> 00:07:50,560
what lessons I learned from
doing those studies.

213
00:07:50,560 --> 00:07:53,860
So I guess the next case study
I want to talk about is

214
00:07:53,860 --> 00:07:57,080
something that I call garbage
in, garbage out.

215
00:07:57,080 --> 00:07:58,120
I think that's something
I learned on the

216
00:07:58,120 --> 00:07:59,600
first day of 001.

217
00:07:59,600 --> 00:08:00,670
So here's the deal.

218
00:08:00,670 --> 00:08:01,480
You've got a customer.

219
00:08:01,480 --> 00:08:04,400
And he wants to draw
this chart.

220
00:08:04,400 --> 00:08:07,970
And this chart is basically, on
the x-axis is time and on

221
00:08:07,970 --> 00:08:09,360
the y-axis is some metric.

222
00:08:09,360 --> 00:08:10,980
It happens to be CPU
usage megahertz.

223
00:08:10,980 --> 00:08:12,340
But it doesn't really matter.

224
00:08:12,340 --> 00:08:12,960
OK.

225
00:08:12,960 --> 00:08:13,870
So he wants to get this chart.

226
00:08:13,870 --> 00:08:15,870
And it turns out-- if you
remember the diagram I showed

227
00:08:15,870 --> 00:08:18,810
before, we have a server and a
bunch of hosts at the end.

228
00:08:18,810 --> 00:08:21,100
So the client simply has to send
a request to that server.

229
00:08:21,100 --> 00:08:22,150
And the server will
generate the data,

230
00:08:22,150 --> 00:08:23,480
send back to the client.

231
00:08:23,480 --> 00:08:25,070
And it turns out that
if any of--

232
00:08:25,070 --> 00:08:26,860
all of you, I'm sure, are
familiar with Pearl, Python,

233
00:08:26,860 --> 00:08:28,440
and all these different
scripted languages.

234
00:08:28,440 --> 00:08:30,550
There's another one that's very
popular in the Windows

235
00:08:30,550 --> 00:08:32,049
community called PowerShell.

236
00:08:32,049 --> 00:08:33,679
And so it turns out that
our server has

237
00:08:33,679 --> 00:08:34,710
a PowerShell interface.

238
00:08:34,710 --> 00:08:36,950
You can write a PowerShell
script to get any statistic

239
00:08:36,950 --> 00:08:37,480
that you want.

240
00:08:37,480 --> 00:08:40,070
And then you can write that
to some database or some

241
00:08:40,070 --> 00:08:41,549
spreadsheet and get
this charge.

242
00:08:41,549 --> 00:08:42,220
So here's the deal.

243
00:08:42,220 --> 00:08:42,850
We have a user.

244
00:08:42,850 --> 00:08:45,460
And what he does, as I've
indicated at the bottom, is

245
00:08:45,460 --> 00:08:47,680
that he wants to get the CPU
usage for the last hour.

246
00:08:47,680 --> 00:08:49,980
So he just writes a simple
script, using PowerShell.

247
00:08:49,980 --> 00:08:51,620
It doesn't really matter
what the syntax is.

248
00:08:51,620 --> 00:08:52,480
But the point is,
he does what's

249
00:08:52,480 --> 00:08:53,880
called a get-stats call.

250
00:08:53,880 --> 00:08:56,660
And he indicates, OK, I want
this virtual machine.

251
00:08:56,660 --> 00:08:58,090
I want this stat, whatever.

252
00:08:58,090 --> 00:08:59,325
And then he gets his
information and

253
00:08:59,325 --> 00:09:00,530
writes it to a file.

254
00:09:00,530 --> 00:09:01,260
No big deal.

255
00:09:01,260 --> 00:09:03,570
It's just a simple script
to get some statistics.

256
00:09:03,570 --> 00:09:06,440
Now, here's what is actually
interesting.

257
00:09:06,440 --> 00:09:09,440
I wrote that script
using PowerShell.

258
00:09:09,440 --> 00:09:12,060
Now, let's compare a PowerShell
implementation to a

259
00:09:12,060 --> 00:09:14,720
highly tuned Java implementation
of the very

260
00:09:14,720 --> 00:09:15,900
same script.

261
00:09:15,900 --> 00:09:18,630
Now, what I've done here is I've
shown the performance of

262
00:09:18,630 --> 00:09:21,130
that very same script when
getting data for one virtual

263
00:09:21,130 --> 00:09:24,700
machine, which is the first
row, six, about 40.

264
00:09:24,700 --> 00:09:27,330
And then up to a little over
300 virtual machines.

265
00:09:27,330 --> 00:09:29,870
Look at the scaling
of this script.

266
00:09:29,870 --> 00:09:33,720
In the middle column, I have
the performance for this

267
00:09:33,720 --> 00:09:35,280
PowerCLI script that I
just showed you on

268
00:09:35,280 --> 00:09:36,280
the previous slide.

269
00:09:36,280 --> 00:09:40,530
As you can see, it goes from a
healthy 9.2 seconds to over 43

270
00:09:40,530 --> 00:09:43,270
minutes if you increase the
size of your inventory.

271
00:09:43,270 --> 00:09:46,200
In contrast, if you use a
Java-based implementation, it

272
00:09:46,200 --> 00:09:48,430
starts a little slower than
PowerCLI at 14 seconds.

273
00:09:48,430 --> 00:09:50,030
But the scaling is
far superior.

274
00:09:50,030 --> 00:09:52,110
It ends up at 50 seconds.

275
00:09:52,110 --> 00:09:54,010
So the question is why?

276
00:09:54,010 --> 00:09:57,040
What's different about these
scripts at scale versus when I

277
00:09:57,040 --> 00:09:58,360
wrote the little toy script?

278
00:09:58,360 --> 00:10:00,360
Now, to give you some color,
why this matters to someone

279
00:10:00,360 --> 00:10:02,440
like me is that every once in a
while, I'll get a call from

280
00:10:02,440 --> 00:10:04,000
a system admin, who wants
a very simple

281
00:10:04,000 --> 00:10:05,190
script that does blah.

282
00:10:05,190 --> 00:10:07,220
And they say, I wrote this
simple script in my

283
00:10:07,220 --> 00:10:08,640
development environment,
and it worked fine.

284
00:10:08,640 --> 00:10:10,410
But then I run it in my
production environment, and it

285
00:10:10,410 --> 00:10:11,220
completely breaks.

286
00:10:11,220 --> 00:10:12,160
What's going on?

287
00:10:12,160 --> 00:10:14,240
So often, problems like
this actually occur.

288
00:10:14,240 --> 00:10:16,590
Now I'm calling this garbage in,
garbage out, because you

289
00:10:16,590 --> 00:10:18,610
can see that with PowerCLI and
Java, we're getting very

290
00:10:18,610 --> 00:10:19,550
different results.

291
00:10:19,550 --> 00:10:23,190
The deal is that in PowerCLI,
I wrote a very naive script.

292
00:10:23,190 --> 00:10:25,330
And I'll explain how
in a moment.

293
00:10:25,330 --> 00:10:27,560
And that very naive script
has not performed at all.

294
00:10:27,560 --> 00:10:30,640
Whereas in Java, I wrote a
very highly tuned script.

295
00:10:30,640 --> 00:10:32,240
And it ended up taking
about an order of

296
00:10:32,240 --> 00:10:33,390
magnitude less time.

297
00:10:33,390 --> 00:10:34,830
So garbage in, garbage out.

298
00:10:34,830 --> 00:10:36,290
Good stuff in, good stuff out.

299
00:10:36,290 --> 00:10:39,460
Let's try to understand why.

300
00:10:39,460 --> 00:10:40,040
OK.

301
00:10:40,040 --> 00:10:41,310
Here's the deal.

302
00:10:41,310 --> 00:10:42,760
All of you are probably familiar
with the difference

303
00:10:42,760 --> 00:10:45,230
between scripting and high level
languages, or low level

304
00:10:45,230 --> 00:10:45,980
languages or whatever you
want to call them.

305
00:10:45,980 --> 00:10:48,040
So the idea is that if you're
running something using a

306
00:10:48,040 --> 00:10:50,440
shell script or running
Pearl, or whatever.

307
00:10:50,440 --> 00:10:52,360
These things are usually
meant for ease of use.

308
00:10:52,360 --> 00:10:55,140
But they're not necessarily
meant for performance reasons.

309
00:10:55,140 --> 00:10:57,312
And so they hide a lot of
details from the users so that

310
00:10:57,312 --> 00:10:59,210
it's easier to come to
a quick prototype.

311
00:10:59,210 --> 00:11:00,330
You don't have to declare
variables.

312
00:11:00,330 --> 00:11:02,340
Do all this kind of junk,
right, if you're using

313
00:11:02,340 --> 00:11:03,210
scripting language.

314
00:11:03,210 --> 00:11:06,150
In contrast, if you're using,
Java, C++, whatever, you have

315
00:11:06,150 --> 00:11:07,320
Typesafe and all
this nonsense.

316
00:11:07,320 --> 00:11:09,030
You have to worry about what
the declarations of your

317
00:11:09,030 --> 00:11:09,760
variables are.

318
00:11:09,760 --> 00:11:11,180
Make sure they're correct
types, et cetera.

319
00:11:11,180 --> 00:11:13,600
So they tend to be more
difficult to get running.

320
00:11:13,600 --> 00:11:17,000
But once you get them running,
you can use a lot of advanced

321
00:11:17,000 --> 00:11:18,580
tools that just aren't
available

322
00:11:18,580 --> 00:11:20,090
necessarily in toolkits.

323
00:11:20,090 --> 00:11:23,870
And we're going to talk about
how this difference manifested

324
00:11:23,870 --> 00:11:26,100
itself in this particular
problem.

325
00:11:26,100 --> 00:11:28,210
I said that when I told the
customer to write a simple

326
00:11:28,210 --> 00:11:30,950
PowerCLI script, they just
have to do one call.

327
00:11:30,950 --> 00:11:31,930
Get-stat.

328
00:11:31,930 --> 00:11:33,870
And here's a bunch of arguments
to tell it what

329
00:11:33,870 --> 00:11:35,720
statistic it wants to grab.

330
00:11:35,720 --> 00:11:36,670
It's a toolkit.

331
00:11:36,670 --> 00:11:38,660
It's doing a lot of
work for you.

332
00:11:38,660 --> 00:11:39,760
What's it doing for you?

333
00:11:39,760 --> 00:11:42,230
Well, it turns out it's calling
our internal APIs.

334
00:11:42,230 --> 00:11:44,390
Our internal APIs have
to work for a variety

335
00:11:44,390 --> 00:11:44,880
of different platforms.

336
00:11:44,880 --> 00:11:46,970
So they have to be incredibly
general.

337
00:11:46,970 --> 00:11:49,950
Under the covers of that single
get-stat call, we have

338
00:11:49,950 --> 00:11:52,690
about seven different steps
that are going on, ranging

339
00:11:52,690 --> 00:11:55,500
from asking, OK, I want a stat
for a virtual machine, what

340
00:11:55,500 --> 00:11:58,480
kinds of stats does that
virtual machine export?

341
00:11:58,480 --> 00:11:59,980
How frequently does it update?

342
00:11:59,980 --> 00:12:02,035
All kinds of information that
you probably don't really care

343
00:12:02,035 --> 00:12:03,810
about if all you want
is a single stat.

344
00:12:03,810 --> 00:12:05,760
But you might care about if
you're writing a production

345
00:12:05,760 --> 00:12:06,690
quality script.

346
00:12:06,690 --> 00:12:09,080
So the point is get-stat
does these seven calls.

347
00:12:09,080 --> 00:12:11,710
And because you're using
a toolkit, it hides

348
00:12:11,710 --> 00:12:13,620
them all from you.

349
00:12:13,620 --> 00:12:14,260
It spares it.

350
00:12:14,260 --> 00:12:15,990
It's very easy to use.

351
00:12:15,990 --> 00:12:18,630
But it turns out, this is
obviously not the most optimal

352
00:12:18,630 --> 00:12:19,230
way to do this.

353
00:12:19,230 --> 00:12:21,700
And I illustrate that by showing
you conceptually

354
00:12:21,700 --> 00:12:23,600
what's actually going on.

355
00:12:23,600 --> 00:12:26,450
On the right, I guess on your
left, I've got what the

356
00:12:26,450 --> 00:12:27,950
PowerCLI script was doing.

357
00:12:27,950 --> 00:12:30,210
You call this get-stat call, or
whatever, and what it does

358
00:12:30,210 --> 00:12:32,440
that for every virtual machine
that you want this statistic

359
00:12:32,440 --> 00:12:34,670
for, it goes through and does
all these random steps that

360
00:12:34,670 --> 00:12:36,870
were done that I described
in the previous slide.

361
00:12:36,870 --> 00:12:39,180
Figuring out what stats are
supported, how frequently they

362
00:12:39,180 --> 00:12:41,680
updated, and then actually going
and getting the stat.

363
00:12:41,680 --> 00:12:44,150
It turns out, as I show on the
right, when I do this in Java,

364
00:12:44,150 --> 00:12:47,730
I can take advantage of the
fact that a lot of those

365
00:12:47,730 --> 00:12:49,070
things that I'm asking
about in the

366
00:12:49,070 --> 00:12:51,800
toolkit, they never change.

367
00:12:51,800 --> 00:12:53,050
You've got a good
virtual machine.

368
00:12:53,050 --> 00:12:55,110
It supports things to
be refreshed every

369
00:12:55,110 --> 00:12:56,220
20 seconds or whatever.

370
00:12:56,220 --> 00:12:57,520
That never changes.

371
00:12:57,520 --> 00:12:59,260
You want to find out what
statistics are supported.

372
00:12:59,260 --> 00:13:01,700
Well, those never change once
you've configured it and you

373
00:13:01,700 --> 00:13:02,610
don't change anything.

374
00:13:02,610 --> 00:13:04,910
So why do that every
single time?

375
00:13:04,910 --> 00:13:08,910
So as we show in Java, you can
pull a lot of that stuff out

376
00:13:08,910 --> 00:13:10,370
of the main body of the loop.

377
00:13:10,370 --> 00:13:12,600
And if you wanted information
for every VM in the system,

378
00:13:12,600 --> 00:13:15,000
you only have to make one call,
as I show at the very

379
00:13:15,000 --> 00:13:16,080
bottom, for every VM.

380
00:13:16,080 --> 00:13:19,070
You don't have to make seven
different RPC calls.

381
00:13:19,070 --> 00:13:20,830
In fact, you could further
optimize, if you really

382
00:13:20,830 --> 00:13:23,610
understand the API, and collapse
a few more calls in

383
00:13:23,610 --> 00:13:25,110
Java that you can't
do in the toolkit.

384
00:13:25,110 --> 00:13:27,060
Because again, the toolkit's
trying to talk at a high

385
00:13:27,060 --> 00:13:31,040
level, so any random dude can
actually write a script.

386
00:13:31,040 --> 00:13:31,590
OK.

387
00:13:31,590 --> 00:13:34,330
So the point is in this PowerCLI
world, a very simple

388
00:13:34,330 --> 00:13:37,560
toolkit world, we were making
five RPC calls for every VM

389
00:13:37,560 --> 00:13:38,330
that we wanted stats for.

390
00:13:38,330 --> 00:13:40,210
Whereas in Java, we
made one call.

391
00:13:40,210 --> 00:13:42,560
And we can even optimize
that further.

392
00:13:42,560 --> 00:13:43,690
And I don't show that here.

393
00:13:43,690 --> 00:13:46,580
But anyway, the bottom line is
that's why the performance of

394
00:13:46,580 --> 00:13:48,810
the Java script was
so much faster.

395
00:13:48,810 --> 00:13:51,700
So the idea is that with
PowerCLI what I do is I wrote

396
00:13:51,700 --> 00:13:54,330
very simple, a script that
anybody could understand and

397
00:13:54,330 --> 00:13:55,810
use, but it's not optimized.

398
00:13:55,810 --> 00:13:57,390
I didn't utilize
multi-threading.

399
00:13:57,390 --> 00:13:59,660
I didn't realize that the output
format of the data was

400
00:13:59,660 --> 00:14:01,000
really, really verbose.

401
00:14:01,000 --> 00:14:02,900
Again, that's the kind of
thing that if you really

402
00:14:02,900 --> 00:14:04,660
understand the APIs you
can take advantage of.

403
00:14:04,660 --> 00:14:05,950
I also didn't realize--

404
00:14:05,950 --> 00:14:07,930
as a customer, I wouldn't have
realized that I make this one

405
00:14:07,930 --> 00:14:10,250
call but it expands it to a
bunch of different network

406
00:14:10,250 --> 00:14:12,680
requests that I didn't
know about.

407
00:14:12,680 --> 00:14:14,800
In contrast, with Java, I was
able to take advantage of the

408
00:14:14,800 --> 00:14:18,640
power of a language like Java,
where I could thread pools.

409
00:14:18,640 --> 00:14:21,320
I could optimize the return
format of the data.

410
00:14:21,320 --> 00:14:23,660
And I could reduce the
number of RPC calls.

411
00:14:23,660 --> 00:14:26,080
I guess the analogy I would
give is if you think about

412
00:14:26,080 --> 00:14:27,880
assembly code versus
compiler-generated code,

413
00:14:27,880 --> 00:14:29,800
nobody here wants to sit down
and write x86 Assembly.

414
00:14:29,800 --> 00:14:32,570
So, of course, you write C and
you use GCC and whatever.

415
00:14:32,570 --> 00:14:34,680
But maybe there are some things
that would actually be

416
00:14:34,680 --> 00:14:37,440
faster if you wrote
hand assembly.

417
00:14:37,440 --> 00:14:38,290
OK.

418
00:14:38,290 --> 00:14:41,220
So, first example we gave was
just understanding what on

419
00:14:41,220 --> 00:14:42,030
earth the problem was.

420
00:14:42,030 --> 00:14:45,010
The second example was if I
write bad code, I'm going to

421
00:14:45,010 --> 00:14:46,220
usually get bad output.

422
00:14:46,220 --> 00:14:48,380
In the third case, I'm going to
talk a little bit about the

423
00:14:48,380 --> 00:14:49,200
design of an API.

424
00:14:49,200 --> 00:14:51,640
And I'm going to motivate
that by an example.

425
00:14:51,640 --> 00:14:53,700
We have an administrator
on the right hand side.

426
00:14:53,700 --> 00:14:55,320
And he's talking to the
management server.

427
00:14:55,320 --> 00:14:57,440
This management server is
responsible for telling that

428
00:14:57,440 --> 00:15:00,240
administrator what's going in
every host in the system,

429
00:15:00,240 --> 00:15:02,900
every virtual machine,
whatever.

430
00:15:02,900 --> 00:15:06,290
So let's pretend that the user
wants-- so this is a

431
00:15:06,290 --> 00:15:07,930
virtualized environment-- so
the idea is that the user

432
00:15:07,930 --> 00:15:10,930
wants to actually access his
or her virtual machine.

433
00:15:10,930 --> 00:15:12,360
They're running a Linux
virtual machine.

434
00:15:12,360 --> 00:15:13,390
They want to see its console.

435
00:15:13,390 --> 00:15:16,150
So they can type
at the console.

436
00:15:16,150 --> 00:15:17,260
Here's how that roughly works.

437
00:15:17,260 --> 00:15:19,420
They want to see the console of
the VM that I've circled.

438
00:15:19,420 --> 00:15:21,900
So the way it works is that
this user talks to the

439
00:15:21,900 --> 00:15:23,900
management server.

440
00:15:23,900 --> 00:15:26,510
The management server locates
that virtual machine,

441
00:15:26,510 --> 00:15:28,790
establishes a handshaking
connection.

442
00:15:28,790 --> 00:15:30,340
And that connection
is directly given

443
00:15:30,340 --> 00:15:31,470
back to that user.

444
00:15:31,470 --> 00:15:33,770
So now the user can interact
directly with that virtual

445
00:15:33,770 --> 00:15:36,960
machine and send commands
and do whatever.

446
00:15:36,960 --> 00:15:39,060
So we've all got the setup.

447
00:15:39,060 --> 00:15:40,350
Very simple.

448
00:15:40,350 --> 00:15:41,450
Now here's the deal.

449
00:15:41,450 --> 00:15:44,850
The problem was that our poor
little user, poor big user,

450
00:15:44,850 --> 00:15:48,480
whatever, they could not start
this console when they were

451
00:15:48,480 --> 00:15:50,940
managing a large number
of hosts.

452
00:15:50,940 --> 00:15:54,050
To take you back to this diagram
for just a moment, on

453
00:15:54,050 --> 00:15:55,890
the right hand side
of the screen, I

454
00:15:55,890 --> 00:15:56,670
show a number of hosts.

455
00:15:56,670 --> 00:15:57,980
And I show a number of
virtual machines.

456
00:15:57,980 --> 00:16:01,170
So what I'm basically saying
is that this administrator

457
00:16:01,170 --> 00:16:04,270
could not access the console of
his virtual machine as the

458
00:16:04,270 --> 00:16:06,960
number of hosts increased.

459
00:16:06,960 --> 00:16:08,440
There should be no link
between these two.

460
00:16:08,440 --> 00:16:09,600
But there was.

461
00:16:09,600 --> 00:16:11,420
So I got involved
in this problem.

462
00:16:11,420 --> 00:16:12,540
And here I went to
the client folks.

463
00:16:12,540 --> 00:16:13,280
And I explained the problem.

464
00:16:13,280 --> 00:16:17,280
And their response was that's
obviously a server problem.

465
00:16:17,280 --> 00:16:19,120
Next step, I go to
the server folks.

466
00:16:19,120 --> 00:16:22,260
And they said, yeah, it's
obviously a client problem.

467
00:16:22,260 --> 00:16:23,670
So then I went back and thought
a little bit more

468
00:16:23,670 --> 00:16:24,380
about the architecture.

469
00:16:24,380 --> 00:16:25,915
And what I knew is that whenever
the client needs to

470
00:16:25,915 --> 00:16:28,090
initiate this handshake
connection, it has to spawn a

471
00:16:28,090 --> 00:16:29,080
certain process.

472
00:16:29,080 --> 00:16:32,910
So that process is called the
VMRC, VMware remote console.

473
00:16:32,910 --> 00:16:34,700
So I talked to the client
folks again.

474
00:16:34,700 --> 00:16:36,120
And said, well, I think
this is happening.

475
00:16:36,120 --> 00:16:38,110
Oh, yeah, it's a VMRC problem.

476
00:16:38,110 --> 00:16:38,350
OK.

477
00:16:38,350 --> 00:16:39,880
So that just passes the
buck another way.

478
00:16:39,880 --> 00:16:41,590
So then I talk to
the VMRC folks.

479
00:16:41,590 --> 00:16:44,030
And I said, OK, I think I have
this problem with starting up.

480
00:16:44,030 --> 00:16:45,380
And they said, oh, it's probably
authentication.

481
00:16:45,380 --> 00:16:46,270
It's probably this.

482
00:16:46,270 --> 00:16:48,560
You have to enable verbose log
in on your end host, blah,

483
00:16:48,560 --> 00:16:49,080
blah, blah.

484
00:16:49,080 --> 00:16:50,650
And at some point, I got
so sick of this.

485
00:16:50,650 --> 00:16:51,970
I thought, well, this
is complete garbage.

486
00:16:51,970 --> 00:16:53,080
I'm actually going to
take a look at this

487
00:16:53,080 --> 00:16:54,330
a little more carefully.

488
00:16:54,330 --> 00:16:56,910
And so, once I got involved,
I got a little bit more

489
00:16:56,910 --> 00:16:57,390
information.

490
00:16:57,390 --> 00:16:58,840
And the information was actually
pretty useful.

491
00:16:58,840 --> 00:16:59,900
Here's the deal.

492
00:16:59,900 --> 00:17:01,870
When you're starting the remote
console on one of these

493
00:17:01,870 --> 00:17:04,510
virtual machines and you want
to connect, it turns out the

494
00:17:04,510 --> 00:17:06,430
person that was having this
problem-- when you ask a few

495
00:17:06,430 --> 00:17:07,829
more questions, you get
a few more answers.

496
00:17:07,829 --> 00:17:09,190
Some of which are
actually useful.

497
00:17:09,190 --> 00:17:11,290
So I said, well, if you
do this with 50 hosts,

498
00:17:11,290 --> 00:17:12,869
do you see a problem?

499
00:17:12,869 --> 00:17:14,349
So 50 manage hosts,
he only wants to

500
00:17:14,349 --> 00:17:17,050
view one VM's console.

501
00:17:17,050 --> 00:17:18,680
No problem.

502
00:17:18,680 --> 00:17:20,760
Then I asked him-- and then I
actually sat at my desk, and

503
00:17:20,760 --> 00:17:21,530
we reproduced this.

504
00:17:21,530 --> 00:17:24,579
So then I said, let's try it
with 500 hosts, no problem.

505
00:17:24,579 --> 00:17:26,609
Just a little bit slower,
but no problem.

506
00:17:26,609 --> 00:17:28,660
So again, the set up is 500
different hosts, he just wants

507
00:17:28,660 --> 00:17:29,900
to look at one VM.

508
00:17:29,900 --> 00:17:32,060
And it's taking longer and
longer to view that one VM.

509
00:17:32,060 --> 00:17:35,110
Finally, he gets
to 1,001 hosts.

510
00:17:35,110 --> 00:17:36,510
And that's when the
problem occurs.

511
00:17:36,510 --> 00:17:39,000
That' when he cannot
see this console.

512
00:17:39,000 --> 00:17:41,605
So obviously then, you go back
and you think, well, gee, is

513
00:17:41,605 --> 00:17:44,220
there some magic number of 1,000
hard coded anywhere,

514
00:17:44,220 --> 00:17:46,290
such that we exceeded it, and
that was the problem.

515
00:17:46,290 --> 00:17:48,460
And frankly, the answer is no.

516
00:17:48,460 --> 00:17:51,130
So let's figure out actually
what did happen.

517
00:17:51,130 --> 00:17:53,880
So I made a bunch of different
observations while I was

518
00:17:53,880 --> 00:17:56,100
debugging this.

519
00:17:56,100 --> 00:17:59,320
When you have fewer than that
magic number of 1,001 hosts,

520
00:17:59,320 --> 00:18:00,890
here's what was going on.

521
00:18:00,890 --> 00:18:02,510
Remember the sequence
of events.

522
00:18:02,510 --> 00:18:03,900
I'm a client.

523
00:18:03,900 --> 00:18:05,340
I talked to a server.

524
00:18:05,340 --> 00:18:07,290
Server talks to the
virtual machine.

525
00:18:07,290 --> 00:18:09,850
And then that virtual machine
and I are directly connected.

526
00:18:09,850 --> 00:18:13,030
What I was noticing was that
when there were less than

527
00:18:13,030 --> 00:18:16,600
1,000 overall hosts in this
setup, when I would initiate

528
00:18:16,600 --> 00:18:18,960
the command for this console.

529
00:18:18,960 --> 00:18:23,180
The CPU usage and memory usage
of this middle management

530
00:18:23,180 --> 00:18:26,670
server would gradually increase,
and increase, and

531
00:18:26,670 --> 00:18:30,570
increase until the console
would actually show up.

532
00:18:30,570 --> 00:18:31,730
So you right click.

533
00:18:31,730 --> 00:18:32,660
You want to pull up a console.

534
00:18:32,660 --> 00:18:33,440
A little window comes up.

535
00:18:33,440 --> 00:18:34,490
But it's totally blank.

536
00:18:34,490 --> 00:18:36,690
And then eventually, after the
management server consumes a

537
00:18:36,690 --> 00:18:38,460
lot of CPU and memory,
eventually that

538
00:18:38,460 --> 00:18:39,650
console comes up.

539
00:18:39,650 --> 00:18:42,630
So now your question is what's
going on that's causing the

540
00:18:42,630 --> 00:18:43,990
CPU usage of this management
server.

541
00:18:43,990 --> 00:18:46,150
And why is that scaling
with the number of

542
00:18:46,150 --> 00:18:49,090
your virtual machines.

543
00:18:49,090 --> 00:18:51,780
So the first thing I did-- so
one of the themes of this

544
00:18:51,780 --> 00:18:53,420
talk, I suppose, is that there's
a lot of great tools

545
00:18:53,420 --> 00:18:54,280
for performance debugging.

546
00:18:54,280 --> 00:18:56,680
But sometimes there's no real
substitute for common sense

547
00:18:56,680 --> 00:18:58,610
and for the things that
are easiest to do.

548
00:18:58,610 --> 00:19:00,320
So the first thing I did
is I looked at the log

549
00:19:00,320 --> 00:19:01,410
file of this server.

550
00:19:01,410 --> 00:19:02,810
Here's the rationale.

551
00:19:02,810 --> 00:19:05,550
If every time I invoke this
remote console, I'm doing more

552
00:19:05,550 --> 00:19:06,570
work on the server.

553
00:19:06,570 --> 00:19:10,070
First, let me just see what
that server is doing.

554
00:19:10,070 --> 00:19:13,450
Now, it turns out that what I'd
noticed about the server

555
00:19:13,450 --> 00:19:15,720
was that whenever the client
would contact the server and

556
00:19:15,720 --> 00:19:19,260
say I want the console of this
VM, you would see a data

557
00:19:19,260 --> 00:19:22,690
retrieval call from the
client to the server.

558
00:19:22,690 --> 00:19:25,140
And that data retrieval call
got more and more expensive

559
00:19:25,140 --> 00:19:26,540
with more hosts and
more virtual

560
00:19:26,540 --> 00:19:27,950
machines in the system.

561
00:19:27,950 --> 00:19:30,520
So the question that I asked
was, why on earth are we doing

562
00:19:30,520 --> 00:19:32,870
some stupid data retrieval call
if all I wanted is to

563
00:19:32,870 --> 00:19:34,080
say, here's a VM.

564
00:19:34,080 --> 00:19:36,150
Tell me where it's located
so I can talk to it.

565
00:19:36,150 --> 00:19:38,800
So it turns out--

566
00:19:38,800 --> 00:19:39,650
that was problem number one.

567
00:19:39,650 --> 00:19:42,780
Problem number two, that's what
happened when you had 50

568
00:19:42,780 --> 00:19:45,380
hosts, 500, whatever, that you
basically spend more and more

569
00:19:45,380 --> 00:19:48,330
time in this data retrieval
routine on the server.

570
00:19:48,330 --> 00:19:52,810
However, once you hit that magic
number of 1,000, you

571
00:19:52,810 --> 00:19:53,970
would see this call start.

572
00:19:53,970 --> 00:19:55,560
But you would never
see it end.

573
00:19:55,560 --> 00:19:57,140
And in fact, you wouldn't
see it end.

574
00:19:57,140 --> 00:19:59,970
You'd see some exceptions
somewhere.

575
00:19:59,970 --> 00:20:02,760
So that to me was a clue that
I better take a look and see

576
00:20:02,760 --> 00:20:04,050
what's going on at
that exception.

577
00:20:04,050 --> 00:20:05,350
So here's what I did.

578
00:20:05,350 --> 00:20:08,640
Unfortunately, the exception
gave me no information

579
00:20:08,640 --> 00:20:09,010
whatsoever.

580
00:20:09,010 --> 00:20:11,290
So I went ahead and I attached
a debugger to the server.

581
00:20:11,290 --> 00:20:12,650
And when I did, I found
out that it was an

582
00:20:12,650 --> 00:20:13,990
out of memory exception.

583
00:20:13,990 --> 00:20:15,090
So here's the deal.

584
00:20:15,090 --> 00:20:16,210
I'm a poor admin.

585
00:20:16,210 --> 00:20:19,260
I send a request to
look at a console.

586
00:20:19,260 --> 00:20:21,900
For some reason, some data
handshaking code takes place

587
00:20:21,900 --> 00:20:23,070
between the server
and the client.

588
00:20:23,070 --> 00:20:25,040
It takes longer and longer
with more hosts.

589
00:20:25,040 --> 00:20:27,980
Eventually, when I have too many
hosts, that thing ran out

590
00:20:27,980 --> 00:20:29,930
of memory and silently failed.

591
00:20:29,930 --> 00:20:32,060
And so my client's sitting
around looking dumb with a

592
00:20:32,060 --> 00:20:33,860
totally blank screen, because
it has no idea that its

593
00:20:33,860 --> 00:20:35,400
request basically got denied.

594
00:20:35,400 --> 00:20:38,650
So the question is why
is that happening?

595
00:20:38,650 --> 00:20:42,000
And aha, this is where the
aha moment occurred.

596
00:20:42,000 --> 00:20:46,540
I happen to know that one of the
reasons in our setup, that

597
00:20:46,540 --> 00:20:49,820
one of the causes for this data
retrieval, is that as a

598
00:20:49,820 --> 00:20:52,490
client, I want to know certain
attributes of the virtual

599
00:20:52,490 --> 00:20:53,900
machine that I wanted to view.

600
00:20:53,900 --> 00:20:55,980
For example, if I'm looking at
the console for a virtual

601
00:20:55,980 --> 00:20:59,040
machine, I might care that that
virtual machine has a

602
00:20:59,040 --> 00:21:00,160
CD-ROM connected.

603
00:21:00,160 --> 00:21:01,810
And I might care if
that CD-ROM got

604
00:21:01,810 --> 00:21:03,430
disconnected or whatever.

605
00:21:03,430 --> 00:21:05,220
That's not something you'd
necessarily associate with a

606
00:21:05,220 --> 00:21:06,460
virtual machine.

607
00:21:06,460 --> 00:21:09,030
But when I go to the server to
connect to it, the server

608
00:21:09,030 --> 00:21:11,560
wants to give me that
information so that the shell,

609
00:21:11,560 --> 00:21:14,150
which is showing me this console
can appropriately

610
00:21:14,150 --> 00:21:16,810
update if the virtual machine
gets disconnected from its

611
00:21:16,810 --> 00:21:17,390
CD-ROM, whatever.

612
00:21:17,390 --> 00:21:21,130
Anyway, the point is I looked at
the code that this magical

613
00:21:21,130 --> 00:21:22,390
VMRC routine was doing.

614
00:21:22,390 --> 00:21:23,810
And here's what it was doing.

615
00:21:23,810 --> 00:21:26,025
It was saying, look,
I want to start a

616
00:21:26,025 --> 00:21:26,770
remote console session.

617
00:21:26,770 --> 00:21:29,480
I need certain information like
the CD-ROM connectivity.

618
00:21:29,480 --> 00:21:31,630
I'm going to go ahead and get
that information for every

619
00:21:31,630 --> 00:21:33,560
host and every virtual machine
in the system.

620
00:21:33,560 --> 00:21:36,200
And then once that's done, I'll
go ahead and display the

621
00:21:36,200 --> 00:21:38,690
console for this one virtual
machine that I care about.

622
00:21:38,690 --> 00:21:40,940
Well, this is clearly
just a stupid idea.

623
00:21:40,940 --> 00:21:43,520
So I called the guy up and I
said, you realize this is

624
00:21:43,520 --> 00:21:44,160
stupid idea.

625
00:21:44,160 --> 00:21:44,920
And he said, well, why?

626
00:21:44,920 --> 00:21:45,760
It's not that expensive.

627
00:21:45,760 --> 00:21:46,750
Is it?

628
00:21:46,750 --> 00:21:48,985
And I was like, OK, we need
to have a long discussion.

629
00:21:48,985 --> 00:21:50,010
And bring a priest in here.

630
00:21:50,010 --> 00:21:51,170
But that's a terrible idea.

631
00:21:51,170 --> 00:21:53,500
So what we ended up doing is--

632
00:21:53,500 --> 00:21:54,080
And so now--

633
00:21:54,080 --> 00:21:55,080
OK, so that's the problem.

634
00:21:55,080 --> 00:21:58,000
The problem is we drew a
huge data retrieval.

635
00:21:58,000 --> 00:21:58,980
None of which we need.

636
00:21:58,980 --> 00:22:01,020
Now why this 1,001 host thing?

637
00:22:01,020 --> 00:22:01,640
Here's the deal.

638
00:22:01,640 --> 00:22:05,270
It turns out that in our code,
we had-- for old style

639
00:22:05,270 --> 00:22:07,870
clients, of which this was an
old style client-- we had a

640
00:22:07,870 --> 00:22:08,640
restriction.

641
00:22:08,640 --> 00:22:10,880
We said if you need to serialize
more than 200

642
00:22:10,880 --> 00:22:13,320
megabytes worth of data, we're
not going to handle that.

643
00:22:13,320 --> 00:22:14,550
We're just going to fail.

644
00:22:14,550 --> 00:22:18,120
It turns out once we got 1,001
hosts, we slightly exceeded

645
00:22:18,120 --> 00:22:21,210
that 200 megabyte buffer size,
we silently erred with an out

646
00:22:21,210 --> 00:22:22,340
of memory exception.

647
00:22:22,340 --> 00:22:23,710
And we never returned
to the client.

648
00:22:23,710 --> 00:22:24,860
And everybody was unhappy.

649
00:22:24,860 --> 00:22:26,040
So it had nothing to do
with 1,000 hosts.

650
00:22:26,040 --> 00:22:27,170
It has to do with the
amount of data

651
00:22:27,170 --> 00:22:28,200
you're actually sending.

652
00:22:28,200 --> 00:22:30,350
So luckily, there were several
different fixes.

653
00:22:30,350 --> 00:22:32,510
So I told the VMRC folks, the
guy that wrote this thing, I

654
00:22:32,510 --> 00:22:35,550
said, please, under no
circumstances, should you get

655
00:22:35,550 --> 00:22:37,480
information for every single
host if you only want it for

656
00:22:37,480 --> 00:22:38,220
this one guy.

657
00:22:38,220 --> 00:22:39,480
That's a terrible idea.

658
00:22:39,480 --> 00:22:41,760
And then I went back to the
server folks, and I said, you

659
00:22:41,760 --> 00:22:42,600
have to admit--

660
00:22:42,600 --> 00:22:44,080
first of all, you have
to fail correctly.

661
00:22:44,080 --> 00:22:46,540
And second, you have to admit
sensible error messages,

662
00:22:46,540 --> 00:22:49,080
because this should not take the
kind of expertise that it

663
00:22:49,080 --> 00:22:50,330
took to figure out.

664
00:22:50,330 --> 00:22:51,950
And thankfully, this made
actually a pretty huge

665
00:22:51,950 --> 00:22:52,320
difference.

666
00:22:52,320 --> 00:22:53,890
So that was kind of cool.

667
00:22:53,890 --> 00:22:57,200
So the lessons that I took away
from this is if you're

668
00:22:57,200 --> 00:23:00,920
going to create an API, create
an API that's very difficult

669
00:23:00,920 --> 00:23:04,210
to abuse and very easy to use.

670
00:23:04,210 --> 00:23:05,490
Not the opposite.

671
00:23:05,490 --> 00:23:07,620
These poor VMRC folks, it's
like a three line piece of

672
00:23:07,620 --> 00:23:09,150
code to get information
for every VM.

673
00:23:09,150 --> 00:23:10,920
It's like eight or nine lines
to get information

674
00:23:10,920 --> 00:23:11,760
for just that VM.

675
00:23:11,760 --> 00:23:15,200
But those five lines make a
boatload of difference.

676
00:23:15,200 --> 00:23:16,170
The other thing is
that this was an

677
00:23:16,170 --> 00:23:18,030
entirely internal customer.

678
00:23:18,030 --> 00:23:20,030
In other words, it's just
another group within VMware.

679
00:23:20,030 --> 00:23:22,530
Imagine you're some external guy
that's using the same API.

680
00:23:22,530 --> 00:23:23,700
You're screwed.

681
00:23:23,700 --> 00:23:26,410
So we have to be better at
educating the people that are

682
00:23:26,410 --> 00:23:27,080
using our API.

683
00:23:27,080 --> 00:23:28,350
And don't just throw it
over the fence and

684
00:23:28,350 --> 00:23:29,040
say here's an API.

685
00:23:29,040 --> 00:23:29,970
Deal with it.

686
00:23:29,970 --> 00:23:31,720
There needs to be some
interaction between them to

687
00:23:31,720 --> 00:23:36,000
make sure that this is the
right API for its job.

688
00:23:36,000 --> 00:23:39,120
This is the next case study
I want to talk about.

689
00:23:39,120 --> 00:23:42,560
This one actually was found by
a colleague of mine, who I'm

690
00:23:42,560 --> 00:23:43,790
going to abbreviate as RM.

691
00:23:43,790 --> 00:23:45,360
Very interesting example.

692
00:23:45,360 --> 00:23:46,890
For those of you that took
the class last year,

693
00:23:46,890 --> 00:23:47,450
I'm recycling it.

694
00:23:47,450 --> 00:23:49,100
But it's a good example,
so no worries.

695
00:23:49,100 --> 00:23:52,100
So here's the deal.

696
00:23:52,100 --> 00:23:54,100
You've got a benchmark.

697
00:23:54,100 --> 00:23:57,020
And I've got two builds of
a piece of software.

698
00:23:57,020 --> 00:23:58,850
When I run the benchmark
against that piece of

699
00:23:58,850 --> 00:24:02,470
software, in case a, I was
getting a throughput of 100

700
00:24:02,470 --> 00:24:04,860
operations per minute.

701
00:24:04,860 --> 00:24:07,800
In case b, I was getting
half the throughput.

702
00:24:07,800 --> 00:24:10,270
And what was the difference
between those builds?

703
00:24:10,270 --> 00:24:14,620
Well, the first build, the
faster build, was a 32-bit

704
00:24:14,620 --> 00:24:18,120
executable running on
64-bit hardware.

705
00:24:18,120 --> 00:24:21,590
While in the second case, it
was a 64-bit executable

706
00:24:21,590 --> 00:24:23,270
running on 64-bit hardware.

707
00:24:23,270 --> 00:24:25,510
Now if you're like me--

708
00:24:25,510 --> 00:24:27,220
which hopefully you're not--
but if you're like me, you

709
00:24:27,220 --> 00:24:29,000
think yourself, well, there
shouldn't be much difference

710
00:24:29,000 --> 00:24:30,120
between 32-bit and 64-bit.

711
00:24:30,120 --> 00:24:32,360
In fact, 64-bit should be a
lot faster for a lot of

712
00:24:32,360 --> 00:24:33,860
reasons you're going to
learn about in 004.

713
00:24:33,860 --> 00:24:35,130
So the question is
what's going on?

714
00:24:35,130 --> 00:24:38,060
I can distinctly remember when
this check-in was done,

715
00:24:38,060 --> 00:24:39,690
somebody said, oh, we're making
the switch over, but it

716
00:24:39,690 --> 00:24:40,440
shouldn't be a big deal.

717
00:24:40,440 --> 00:24:43,160
And foolishly, like a moron, I
actually believed that person.

718
00:24:43,160 --> 00:24:44,220
Very bad idea.

719
00:24:44,220 --> 00:24:46,110
So this came back to haunt
us for three months.

720
00:24:46,110 --> 00:24:47,760
So what's the deal?

721
00:24:47,760 --> 00:24:50,740
Remember, a 32-bit executable
running on 64-bit code was

722
00:24:50,740 --> 00:24:53,910
faster than a 64-bit executable
running on a 64-bit

723
00:24:53,910 --> 00:24:54,760
application.

724
00:24:54,760 --> 00:24:55,770
So the first thing we do--

725
00:24:55,770 --> 00:24:56,910
first of all, you've got
to find the right tool

726
00:24:56,910 --> 00:24:57,950
for the right job.

727
00:24:57,950 --> 00:25:00,350
In this case, we use a profiling
tool called Xperf.

728
00:25:00,350 --> 00:25:03,550
Xperf is a very powerful
profiler, kind of like Vtune,

729
00:25:03,550 --> 00:25:06,190
or Valgrind, or Quantifier, or
a lot of the other tools

730
00:25:06,190 --> 00:25:06,790
you've used.

731
00:25:06,790 --> 00:25:08,710
The nice thing is it's
built into the OS.

732
00:25:08,710 --> 00:25:10,240
It runs on Windows 2008.

733
00:25:10,240 --> 00:25:12,200
And it's a sampling based
profiler that has a lot of

734
00:25:12,200 --> 00:25:13,220
pretty cool attributes.

735
00:25:13,220 --> 00:25:15,530
If you have Windows 2008, I
think you ought to just go and

736
00:25:15,530 --> 00:25:15,990
download it.

737
00:25:15,990 --> 00:25:16,490
It's free.

738
00:25:16,490 --> 00:25:17,840
And play with it.

739
00:25:17,840 --> 00:25:19,620
This gives a lot of information
like stack traces,

740
00:25:19,620 --> 00:25:21,510
caller/callee information,
et cetera.

741
00:25:21,510 --> 00:25:25,230
It's potentially the perfect
tool for this kind of job.

742
00:25:25,230 --> 00:25:26,990
What kind of output
does it show?

743
00:25:26,990 --> 00:25:28,170
In this case, remember
the setup.

744
00:25:28,170 --> 00:25:30,680
The setup is I run build A. And
I'm getting twice as much

745
00:25:30,680 --> 00:25:33,310
throughput as running build B.
So let's take a look at the

746
00:25:33,310 --> 00:25:36,080
CPU usage, just as a first cut
to see what's going on.

747
00:25:36,080 --> 00:25:38,340
In the 64-bit case, which
is what I showed here.

748
00:25:38,340 --> 00:25:40,230
I showed the output of this
Xperf for 64-bit.

749
00:25:40,230 --> 00:25:42,490
What you can see, x-axis
is just time.

750
00:25:42,490 --> 00:25:43,870
And y-axis is CPU usage.

751
00:25:43,870 --> 00:25:45,610
It's totally saturated.

752
00:25:45,610 --> 00:25:48,940
In the 64-bit case, we're
completely saturated.

753
00:25:48,940 --> 00:25:51,090
Take my word for it, in the
32-bit case, you're not.

754
00:25:51,090 --> 00:25:52,950
And this accounts for
the difference.

755
00:25:52,950 --> 00:25:55,510
So now let's dig deeper and find
out why we're seeing this

756
00:25:55,510 --> 00:25:57,940
difference.

757
00:25:57,940 --> 00:26:00,370
So first, we look at the
sampling profiler.

758
00:26:00,370 --> 00:26:02,970
Now there's a lot of perils of
using sampling profilers.

759
00:26:02,970 --> 00:26:05,090
In this particular case,
here's the deal.

760
00:26:05,090 --> 00:26:06,690
I've got a process.

761
00:26:06,690 --> 00:26:08,820
And that process spawns
a bunch of threads.

762
00:26:08,820 --> 00:26:11,990
In this particular view, what
you're seeing is where is the

763
00:26:11,990 --> 00:26:13,840
time being spent in
this process.

764
00:26:13,840 --> 00:26:17,040
And as you can see, most of the
time is spent at the root,

765
00:26:17,040 --> 00:26:17,820
which kind of makes sense.

766
00:26:17,820 --> 00:26:20,610
Because everything in some sense
is a child of the root.

767
00:26:20,610 --> 00:26:22,890
But the first thing our poor
process does, it spawns a

768
00:26:22,890 --> 00:26:25,100
bunch of child threads.

769
00:26:25,100 --> 00:26:27,600
So clearly, all of the time is
essentially going to be spent

770
00:26:27,600 --> 00:26:30,130
in the routine that calls all
of these threads, because

771
00:26:30,130 --> 00:26:31,850
that's the first entry point.

772
00:26:31,850 --> 00:26:34,370
But unfortunately, that's
remarkably unhelpful.

773
00:26:34,370 --> 00:26:37,800
So the deal is we have to dig
deeper and find out, where is

774
00:26:37,800 --> 00:26:40,480
this time actually
being spent.

775
00:26:40,480 --> 00:26:43,280
So like I said, just to kind of
reiterate the point that I

776
00:26:43,280 --> 00:26:44,140
made on the previous slide.

777
00:26:44,140 --> 00:26:45,560
The point is that your
entry point is

778
00:26:45,560 --> 00:26:46,330
spawning all the threads.

779
00:26:46,330 --> 00:26:48,290
And if that's kind of the
starting point for accounting,

780
00:26:48,290 --> 00:26:50,250
you're not going to get any help
if you're just looking at

781
00:26:50,250 --> 00:26:50,860
that starting point.

782
00:26:50,860 --> 00:26:53,070
You have to dig a little
bit deeper.

783
00:26:53,070 --> 00:26:57,010
And so what I'm basically trying
to say, is that even

784
00:26:57,010 --> 00:27:00,300
though this thing records
stack traces, this stack

785
00:27:00,300 --> 00:27:02,020
happens to be the most
popular stack.

786
00:27:02,020 --> 00:27:03,400
But it's not actually
the problem.

787
00:27:03,400 --> 00:27:04,480
Because it's the most popular,
because that's

788
00:27:04,480 --> 00:27:05,330
where everybody starts.

789
00:27:05,330 --> 00:27:06,760
So all of the time that
is charged gets

790
00:27:06,760 --> 00:27:07,860
charged to that stack.

791
00:27:07,860 --> 00:27:09,620
We have to look a little
bit deeper.

792
00:27:09,620 --> 00:27:11,810
So in order to look a little bit
deeper, let's think about

793
00:27:11,810 --> 00:27:13,870
the problem in a slightly
different way.

794
00:27:13,870 --> 00:27:15,550
Suppose I've got my root.

795
00:27:15,550 --> 00:27:16,760
That's where everything
starts.

796
00:27:16,760 --> 00:27:18,240
That's where all the threads
are spawned, whatever.

797
00:27:18,240 --> 00:27:21,270
And suppose I have three paths
to a tiny little function at

798
00:27:21,270 --> 00:27:23,760
the very end.

799
00:27:23,760 --> 00:27:25,510
Maybe I'm spending all
of my time in the

800
00:27:25,510 --> 00:27:27,530
path that I've indicated.

801
00:27:27,530 --> 00:27:29,980
But the problem is that--

802
00:27:29,980 --> 00:27:31,980
suppose you're spending all of
your time in that tiny little

803
00:27:31,980 --> 00:27:35,360
function, and suppose it's
equally spent among these

804
00:27:35,360 --> 00:27:36,120
other three paths.

805
00:27:36,120 --> 00:27:39,010
Well, it's not any path that's
kind of screwed up.

806
00:27:39,010 --> 00:27:40,700
It's that that tiny function
is kind of messed up.

807
00:27:40,700 --> 00:27:42,360
So you got to figure out why are
you spending time in that

808
00:27:42,360 --> 00:27:43,350
tiny function?

809
00:27:43,350 --> 00:27:46,090
And is there something that
you can do to fix that?

810
00:27:46,090 --> 00:27:48,540
So let's look from the
opposite perspective.

811
00:27:48,540 --> 00:27:50,850
Let's look at the caller view.

812
00:27:50,850 --> 00:27:53,350
In this case, what I did was I
looked at a bunch of different

813
00:27:53,350 --> 00:27:54,540
routines, a bunch of
these different

814
00:27:54,540 --> 00:27:56,080
so-called tiny functions.

815
00:27:56,080 --> 00:27:58,640
And I looked at what was calling
those tiny functions.

816
00:27:58,640 --> 00:28:01,290
It turns out that here's a call,
ZwQueryVirtualMemory

817
00:28:01,290 --> 00:28:02,280
some tiny function.

818
00:28:02,280 --> 00:28:04,490
It happens to be-- it turns out
that we're spending, as it

819
00:28:04,490 --> 00:28:09,480
indicates, 77% of our time
in this tiny function.

820
00:28:09,480 --> 00:28:11,560
And just to emphasize that
this is not being

821
00:28:11,560 --> 00:28:12,450
called from the root.

822
00:28:12,450 --> 00:28:14,190
So you don't see it from
your first glance.

823
00:28:14,190 --> 00:28:16,590
You can see that the number of
times this tiny function was

824
00:28:16,590 --> 00:28:19,600
called from the root is 0.34%.

825
00:28:19,600 --> 00:28:21,316
So if you were just looking from
the root and trying to

826
00:28:21,316 --> 00:28:22,520
figure out what's being
called from the root.

827
00:28:22,520 --> 00:28:23,940
It would be totally useless.

828
00:28:23,940 --> 00:28:24,830
Instead, you have
to dig deeper.

829
00:28:24,830 --> 00:28:26,120
And see all the different
functions

830
00:28:26,120 --> 00:28:26,640
that are being called.

831
00:28:26,640 --> 00:28:28,440
And find out where they're
being called from.

832
00:28:28,440 --> 00:28:30,875
So we know that this function,
ZwQueryVirtualMemory, is

833
00:28:30,875 --> 00:28:33,810
taking like 70% of the CPU,
which is an enormously large

834
00:28:33,810 --> 00:28:34,740
amount of CPU.

835
00:28:34,740 --> 00:28:38,450
We also know what's calling
it are these two

836
00:28:38,450 --> 00:28:39,270
routines right here.

837
00:28:39,270 --> 00:28:40,680
This thing, this magical
thing called

838
00:28:40,680 --> 00:28:42,710
RTDynamicCast and RTtypeid.

839
00:28:42,710 --> 00:28:46,580
So it turns out we have two
paths that are going to this

840
00:28:46,580 --> 00:28:47,340
tiny function.

841
00:28:47,340 --> 00:28:50,160
And those two paths comprise
most of the calls

842
00:28:50,160 --> 00:28:51,180
to this tiny function.

843
00:28:51,180 --> 00:28:52,800
And it turns out we're spending
a boatload of time in

844
00:28:52,800 --> 00:28:54,800
this tiny function.

845
00:28:54,800 --> 00:28:56,980
So let's dig deeper and find
out what on earth these two

846
00:28:56,980 --> 00:28:57,360
routines are.

847
00:28:57,360 --> 00:28:58,550
What is RTtypeid?

848
00:28:58,550 --> 00:29:01,070
What is RTDynamicCast,
whatever?

849
00:29:01,070 --> 00:29:02,850
So let's first look
at RTDynamicCast.

850
00:29:02,850 --> 00:29:03,890
What I show here--

851
00:29:03,890 --> 00:29:05,420
in contrast to the
previous view.

852
00:29:05,420 --> 00:29:07,080
The previous view said, I
have a tiny function.

853
00:29:07,080 --> 00:29:08,800
What are the people
that call me?

854
00:29:08,800 --> 00:29:09,690
Let's look at the reverse.

855
00:29:09,690 --> 00:29:12,540
Let's go and start at one of
these two routines, RTtypeid.

856
00:29:12,540 --> 00:29:15,480
And figure out what it's calling
that is spending so

857
00:29:15,480 --> 00:29:17,450
much time, and ultimately
leads to us calling that

858
00:29:17,450 --> 00:29:20,090
ZwQueryVirtualMemory
or whatever.

859
00:29:20,090 --> 00:29:22,850
And as you can see, this
RTtypeid, it calls two

860
00:29:22,850 --> 00:29:23,870
functions primarily.

861
00:29:23,870 --> 00:29:25,735
One is incrementing a
reference counter.

862
00:29:25,735 --> 00:29:27,600
And the other is decrementing
a reference counter.

863
00:29:27,600 --> 00:29:29,030
Now you're thinking to yourself,
incrementing a

864
00:29:29,030 --> 00:29:30,580
reference counter, decrementing
a reference

865
00:29:30,580 --> 00:29:32,970
counter should be very simple,
very low CPU, and basically

866
00:29:32,970 --> 00:29:33,970
should cost you nothing.

867
00:29:33,970 --> 00:29:35,500
And all of those three
assumptions are dead wrong,

868
00:29:35,500 --> 00:29:36,280
unfortunately.

869
00:29:36,280 --> 00:29:37,670
So let's try to understand
a little bit as

870
00:29:37,670 --> 00:29:38,960
to why that's happening.

871
00:29:38,960 --> 00:29:42,520
Turns out RTtypeid is used in
order to figure out the C++

872
00:29:42,520 --> 00:29:43,380
type of an object.

873
00:29:43,380 --> 00:29:45,200
So we're trying to figure
this out on the fly.

874
00:29:45,200 --> 00:29:47,950
And as I said, it's kind of
mystifying that this routine,

875
00:29:47,950 --> 00:29:50,250
which calls our tiny function,
that you're spending 39%

876
00:29:50,250 --> 00:29:53,930
overall in this routine
in that tiny function.

877
00:29:53,930 --> 00:29:56,010
So let's figure out what's
going on by looking at a

878
00:29:56,010 --> 00:29:57,880
simple example, which
would be IncRef.

879
00:29:57,880 --> 00:30:00,170
So it turns out if you're
looking at this code, the

880
00:30:00,170 --> 00:30:03,240
dreadful line is the fact that
we're doing const type info

881
00:30:03,240 --> 00:30:06,880
reference tinfo equals
typeid of *this.

882
00:30:06,880 --> 00:30:09,560
This is basically, you have to
call RTtypeid to get this kind

883
00:30:09,560 --> 00:30:10,470
of information.

884
00:30:10,470 --> 00:30:13,520
And in order to do that, you
need some runtime type

885
00:30:13,520 --> 00:30:14,580
information.

886
00:30:14,580 --> 00:30:17,180
This runtime type information
has some pointers

887
00:30:17,180 --> 00:30:18,720
associated with it.

888
00:30:18,720 --> 00:30:20,120
So here's the deal.

889
00:30:20,120 --> 00:30:21,610
Whether I'm running a 32-bit--

890
00:30:21,610 --> 00:30:23,300
Remember, we even forgot the
original problem, which is,

891
00:30:23,300 --> 00:30:25,870
gee, I'm 32-bit, 64-bit
executable--

892
00:30:25,870 --> 00:30:27,910
sorry 32-bit executable
on 64-bit hardware.

893
00:30:27,910 --> 00:30:29,450
We were slower than a 64-bit--

894
00:30:29,450 --> 00:30:31,900
we were faster than a 64-bit
executable on 64-bit hardware.

895
00:30:31,900 --> 00:30:33,330
Let's try to figure out why.

896
00:30:33,330 --> 00:30:36,780
Well, remember, every time I
increment or decrement a

897
00:30:36,780 --> 00:30:39,310
reference counter, I get some
type information by consulting

898
00:30:39,310 --> 00:30:40,700
this runtime type information.

899
00:30:40,700 --> 00:30:42,270
It's got a bunch of pointers.

900
00:30:42,270 --> 00:30:46,460
When I run a 32-bit executable
on 64-bit hardware, those

901
00:30:46,460 --> 00:30:48,640
pointers are just raw
32-bit pointers.

902
00:30:48,640 --> 00:30:50,660
You just look them up,
and you're done.

903
00:30:50,660 --> 00:30:53,020
In 64-bit, pretty much the
only difference is that

904
00:30:53,020 --> 00:30:55,560
instead of those pointers being
actual pointers, they're

905
00:30:55,560 --> 00:30:57,490
32-bit offsets.

906
00:30:57,490 --> 00:31:01,170
So you take the offset, you add
it to the base address of

907
00:31:01,170 --> 00:31:05,070
the DLL of the executable or
whatever, where that runtime

908
00:31:05,070 --> 00:31:06,400
typeid call is being made.

909
00:31:06,400 --> 00:31:08,090
And then you're done.

910
00:31:08,090 --> 00:31:10,360
Once you've added that offset,
you get a 64-bit pointer.

911
00:31:10,360 --> 00:31:13,520
You look up that 64-bit pointer,
everybody's happy.

912
00:31:13,520 --> 00:31:14,540
So here's the deal.

913
00:31:14,540 --> 00:31:16,140
Remember, we said
32-bit's faster.

914
00:31:16,140 --> 00:31:17,450
32-bit is just a pointer
look up.

915
00:31:17,450 --> 00:31:18,490
And then you're done.

916
00:31:18,490 --> 00:31:19,670
64-bit's slower.

917
00:31:19,670 --> 00:31:22,120
It's a pointer look up-- and
it's an addition plus a

918
00:31:22,120 --> 00:31:24,640
pointer look up.

919
00:31:24,640 --> 00:31:26,230
So I pause, because you're
thinking to yourself, an

920
00:31:26,230 --> 00:31:28,800
addition plus a pointer look up
versus a pointer look up.

921
00:31:28,800 --> 00:31:29,980
Does that mean that
addition is slow?

922
00:31:29,980 --> 00:31:31,260
That sounds really stupid.

923
00:31:31,260 --> 00:31:32,670
And by the way, it
is really stupid.

924
00:31:32,670 --> 00:31:33,570
It's not true.

925
00:31:33,570 --> 00:31:34,320
So the deal is this.

926
00:31:34,320 --> 00:31:35,970
It's not that addition
to create a

927
00:31:35,970 --> 00:31:37,810
64-bit pointer is slow.

928
00:31:37,810 --> 00:31:39,850
The deal is that-- remember
this 2-step process.

929
00:31:39,850 --> 00:31:41,810
You find out the base
address of a module.

930
00:31:41,810 --> 00:31:44,220
And you add this offset
to that base address.

931
00:31:44,220 --> 00:31:48,540
Determining that base address
is really, really slow.

932
00:31:48,540 --> 00:31:50,510
It turns out you have to call
a bunch of different random

933
00:31:50,510 --> 00:31:53,640
routines that have to walk the
list of loaded modules.

934
00:31:53,640 --> 00:31:55,230
And you're doing this
every single time

935
00:31:55,230 --> 00:31:56,130
you call this routine.

936
00:31:56,130 --> 00:31:58,060
And if you increment a reference
whenever you do a

937
00:31:58,060 --> 00:31:59,870
memory allocation, you can
imagine that this is going to

938
00:31:59,870 --> 00:32:02,630
consume a lot of CPU.

939
00:32:02,630 --> 00:32:04,810
So it turns out, this is
what's actually slow.

940
00:32:04,810 --> 00:32:06,900
And remember, that we started
this off a little by saying,

941
00:32:06,900 --> 00:32:07,950
oh, we've got this
weird function,

942
00:32:07,950 --> 00:32:10,830
ZwQueryVirtualMemory, which
seems to be being called a lot

943
00:32:10,830 --> 00:32:12,320
from a bunch of different
places.

944
00:32:12,320 --> 00:32:14,270
The reason it's called a lot
is because all of these

945
00:32:14,270 --> 00:32:16,020
routines to get typeid
information are

946
00:32:16,020 --> 00:32:17,550
calling this routine.

947
00:32:17,550 --> 00:32:19,060
So there's actually
two solutions.

948
00:32:19,060 --> 00:32:21,355
And you can look on this blog
to actually figure out that

949
00:32:21,355 --> 00:32:24,270
this is a known problem that
Microsoft eventually fixed.

950
00:32:24,270 --> 00:32:28,350
And it turns out, that the two
solutions end up being that

951
00:32:28,350 --> 00:32:30,740
you can either statically
compute that base address so

952
00:32:30,740 --> 00:32:32,580
you're not constantly
relooking it up

953
00:32:32,580 --> 00:32:33,530
every single time.

954
00:32:33,530 --> 00:32:35,810
But there's actually a much
simpler and stupider solution,

955
00:32:35,810 --> 00:32:37,790
which is, use the latest runtime
library where they

956
00:32:37,790 --> 00:32:39,940
have actually fixed
this problem.

957
00:32:39,940 --> 00:32:43,610
So just to summarize
what was going on.

958
00:32:43,610 --> 00:32:45,670
We were CPU saturated.

959
00:32:45,670 --> 00:32:46,850
We looked from the top down.

960
00:32:46,850 --> 00:32:47,850
We didn't notice anything.

961
00:32:47,850 --> 00:32:49,070
We looked from the bottom up.

962
00:32:49,070 --> 00:32:50,930
And we noticed that some bottom
was being called a lot.

963
00:32:50,930 --> 00:32:52,840
We figured out where that bottom
was being called by

964
00:32:52,840 --> 00:32:55,240
looking at caller/callee
views.

965
00:32:55,240 --> 00:32:56,570
Once we figured out where it
was being called, we then

966
00:32:56,570 --> 00:32:58,490
tried to figure out why that
thing was being called.

967
00:32:58,490 --> 00:32:59,960
And then we fixed the problem.

968
00:32:59,960 --> 00:33:03,490
So to me, I guess the take-home
lesson from this is

969
00:33:03,490 --> 00:33:06,080
that when that developer said
to me, three months before

970
00:33:06,080 --> 00:33:07,940
this whole debacle started,
he said, oh, yeah, we're

971
00:33:07,940 --> 00:33:09,520
switching from say
32 to 64-bit.

972
00:33:09,520 --> 00:33:10,980
That's not a big deal.

973
00:33:10,980 --> 00:33:12,750
You can't really take that stuff
at face value, because a

974
00:33:12,750 --> 00:33:14,180
lot of little things
can really add up.

975
00:33:14,180 --> 00:33:16,400
And we could never have really
guessed that this would happen

976
00:33:16,400 --> 00:33:19,070
beforehand.

977
00:33:19,070 --> 00:33:20,360
So that's four.

978
00:33:20,360 --> 00:33:21,800
Let me go over another one.

979
00:33:21,800 --> 00:33:23,150
Memory usage.

980
00:33:23,150 --> 00:33:24,690
So excessive--

981
00:33:24,690 --> 00:33:27,340
I'm sure actually all of you
know that memory usage is a

982
00:33:27,340 --> 00:33:28,130
big problem.

983
00:33:28,130 --> 00:33:30,080
Your application is going to
slow down if you're inducing a

984
00:33:30,080 --> 00:33:31,130
lot of paging.

985
00:33:31,130 --> 00:33:33,700
And at least in 32-bit Windows,
if you exceed two

986
00:33:33,700 --> 00:33:36,160
gigabytes for the process that
you're running, the process is

987
00:33:36,160 --> 00:33:37,090
just going to crash.

988
00:33:37,090 --> 00:33:40,030
So you clearly do care
about memory usage.

989
00:33:40,030 --> 00:33:43,660
And I want to differentiate
between memory leaks and

990
00:33:43,660 --> 00:33:44,610
memory accumulation.

991
00:33:44,610 --> 00:33:46,090
You're wondering why would
a process take up two

992
00:33:46,090 --> 00:33:46,840
gigabytes of memory.

993
00:33:46,840 --> 00:33:48,440
Well, one is that maybe
you have a leak.

994
00:33:48,440 --> 00:33:49,090
You allocate something.

995
00:33:49,090 --> 00:33:50,320
You don't remove it.

996
00:33:50,320 --> 00:33:51,630
That might just be
an accumulation.

997
00:33:51,630 --> 00:33:53,420
You allocate something somewhere
in your code, you

998
00:33:53,420 --> 00:33:54,950
actually are freeing it.

999
00:33:54,950 --> 00:33:56,480
But the trouble is that routine
never gets called,

1000
00:33:56,480 --> 00:33:59,750
because in your code logic you
never take the time to think

1001
00:33:59,750 --> 00:34:03,250
and say, OK, can I actually get
rid of this at this point.

1002
00:34:03,250 --> 00:34:06,340
And that's what I'm going to
call a memory accumulation.

1003
00:34:06,340 --> 00:34:08,090
There's a lot of different tools
out there to analyze

1004
00:34:08,090 --> 00:34:08,980
memory usage.

1005
00:34:08,980 --> 00:34:11,719
And the sad part about this is
that almost none of them work

1006
00:34:11,719 --> 00:34:13,400
for the particular example that
we have, because they

1007
00:34:13,400 --> 00:34:15,360
just didn't scale very well.

1008
00:34:15,360 --> 00:34:17,389
It's fine if you've got your
toy, stupid little application

1009
00:34:17,389 --> 00:34:18,550
that does addition in a loop.

1010
00:34:18,550 --> 00:34:20,120
Not so fine if you've got an
enterprising middleware

1011
00:34:20,120 --> 00:34:20,650
application.

1012
00:34:20,650 --> 00:34:22,699
So we ended up, despite all
the tools that we have

1013
00:34:22,699 --> 00:34:24,780
available, we ended up having
to write our own.

1014
00:34:24,780 --> 00:34:25,860
I didn't do this,
a colleague did.

1015
00:34:25,860 --> 00:34:27,900
But anyway, I guess
that distinction

1016
00:34:27,900 --> 00:34:28,610
doesn't really matter.

1017
00:34:28,610 --> 00:34:30,360
But anyway, the point is that
you do a lot of what these

1018
00:34:30,360 --> 00:34:30,880
other tools do.

1019
00:34:30,880 --> 00:34:32,580
But you do it special
to your environment.

1020
00:34:32,580 --> 00:34:35,000
In this case, we just hooked a
lot of the calls to malloc.

1021
00:34:35,000 --> 00:34:36,670
And we figure out what
pointers are live.

1022
00:34:36,670 --> 00:34:39,179
And then we do stuff
from there.

1023
00:34:39,179 --> 00:34:40,409
This is how normal tools work.

1024
00:34:40,409 --> 00:34:43,000
And of course, this can be
unusably slow if you do a ton

1025
00:34:43,000 --> 00:34:43,760
of allocations.

1026
00:34:43,760 --> 00:34:45,739
For example, if you're doing
millions of allocations per

1027
00:34:45,739 --> 00:34:47,820
second, you can imagine you're
calling this kind of routine,

1028
00:34:47,820 --> 00:34:48,489
which itself--

1029
00:34:48,489 --> 00:34:52,460
if it only has a 10x overhead
over a call without that,

1030
00:34:52,460 --> 00:34:54,300
you're slowing down your program
by 10x for every

1031
00:34:54,300 --> 00:34:55,270
single memory allocation.

1032
00:34:55,270 --> 00:34:57,195
That means performance depends
on how many allocations you

1033
00:34:57,195 --> 00:34:58,800
do, which is really bad.

1034
00:34:58,800 --> 00:35:00,440
And just to illustrate what
I mean by a leak.

1035
00:35:00,440 --> 00:35:01,750
My definition is basically
that you've got

1036
00:35:01,750 --> 00:35:03,210
this routine, foo.

1037
00:35:03,210 --> 00:35:04,600
And it mallocs some data.

1038
00:35:04,600 --> 00:35:06,620
But it returns without
freeing that data.

1039
00:35:06,620 --> 00:35:08,170
And nowhere else in the
code log did you

1040
00:35:08,170 --> 00:35:09,210
actually have a free.

1041
00:35:09,210 --> 00:35:10,830
That's an actual memory leak.

1042
00:35:10,830 --> 00:35:12,640
An accumulation would be if you
have some free somewhere.

1043
00:35:12,640 --> 00:35:15,080
But it's never actually
called.

1044
00:35:15,080 --> 00:35:18,330
So a lot of the code that we
write happened-- at least in

1045
00:35:18,330 --> 00:35:20,060
the past, it's changing-- but
a lot of it happened to be

1046
00:35:20,060 --> 00:35:20,960
written in C++.

1047
00:35:20,960 --> 00:35:24,260
And all of you, I guess assume,
you know about new and

1048
00:35:24,260 --> 00:35:26,610
delete and all those kinds of
things that we do in C++.

1049
00:35:26,610 --> 00:35:28,680
And it turns out, that those
things are notoriously bad for

1050
00:35:28,680 --> 00:35:29,360
assigning memory leaks.

1051
00:35:29,360 --> 00:35:31,310
So instead, what people
typically do is they use

1052
00:35:31,310 --> 00:35:32,380
reference counted objects.

1053
00:35:32,380 --> 00:35:34,590
So every time you use something
you increment some

1054
00:35:34,590 --> 00:35:35,210
reference count.

1055
00:35:35,210 --> 00:35:37,350
And every time you
delete it, you

1056
00:35:37,350 --> 00:35:38,270
decrement a reference count.

1057
00:35:38,270 --> 00:35:40,860
And when the decremented
reference count is zero it

1058
00:35:40,860 --> 00:35:42,310
automatically gets deleted.

1059
00:35:42,310 --> 00:35:44,250
Now of course, this only
solves an actual leak.

1060
00:35:44,250 --> 00:35:46,470
It doesn't solve an
accumulation.

1061
00:35:46,470 --> 00:35:48,190
So here was the problem
that I noticed.

1062
00:35:48,190 --> 00:35:50,110
That was all kind of setup, just
saying what's a memory

1063
00:35:50,110 --> 00:35:52,530
leak, what's an accumulation,
why is memory a problem.

1064
00:35:52,530 --> 00:35:55,050
Let's get to actually why
it was a problem.

1065
00:35:55,050 --> 00:35:56,230
I had this issue.

1066
00:35:56,230 --> 00:35:57,950
I have our little server
application.

1067
00:35:57,950 --> 00:36:01,950
And it's running out of memory
after several hours.

1068
00:36:01,950 --> 00:36:05,950
Now I used Purify on a much,
much smaller setup.

1069
00:36:05,950 --> 00:36:09,870
It was about 1/100 the size of
the actual setup, because

1070
00:36:09,870 --> 00:36:11,860
larger than that, Purify would
crash, because it would run

1071
00:36:11,860 --> 00:36:12,480
out of memory.

1072
00:36:12,480 --> 00:36:14,380
Because by the way, to track
all the memory references,

1073
00:36:14,380 --> 00:36:16,150
it's got to allocate memory,
which ends up being a

1074
00:36:16,150 --> 00:36:18,260
nightmare of unprecedented
proportions.

1075
00:36:18,260 --> 00:36:22,680
So the deal was, was that I
couldn't really detect this

1076
00:36:22,680 --> 00:36:24,670
leak very easily.

1077
00:36:24,670 --> 00:36:27,350
So what I did was I examined
the memory that was in use.

1078
00:36:27,350 --> 00:36:28,970
Essentially, what I'm trying to
say is that, there wasn't

1079
00:36:28,970 --> 00:36:31,750
so much a leak as there
was an accumulation.

1080
00:36:31,750 --> 00:36:33,630
Accumulation again being
a logical problem.

1081
00:36:33,630 --> 00:36:35,120
Well, yeah, all of my code
is written correctly.

1082
00:36:35,120 --> 00:36:37,470
But it's just that free
call is never called.

1083
00:36:37,470 --> 00:36:40,900
So I examined the memory in use
and I localized it to one

1084
00:36:40,900 --> 00:36:43,280
operation that whenever we did
it, memory would increase.

1085
00:36:43,280 --> 00:36:45,600
So since doing it only once,
there's a lot of noise.

1086
00:36:45,600 --> 00:36:47,610
What I did was I said, well, OK,
why don't I just do this

1087
00:36:47,610 --> 00:36:49,530
operation hundreds of times
at the same time.

1088
00:36:49,530 --> 00:36:52,080
Because then, if I see some
allocation that's 100x from

1089
00:36:52,080 --> 00:36:53,920
before I started this
I can know that this

1090
00:36:53,920 --> 00:36:55,680
was actually a problem.

1091
00:36:55,680 --> 00:36:57,270
And that's exactly what I did.

1092
00:36:57,270 --> 00:36:58,820
I had this issue where
something was being

1093
00:36:58,820 --> 00:37:00,370
allocated 64 bytes.

1094
00:37:00,370 --> 00:37:02,580
Now 64 bytes is so small that
it's pretty much in the noise.

1095
00:37:02,580 --> 00:37:03,990
But if you do it a hundred
times, and you're

1096
00:37:03,990 --> 00:37:04,950
seeing 6,400 bytes.

1097
00:37:04,950 --> 00:37:06,480
Then you do it a thousand
times and you

1098
00:37:06,480 --> 00:37:07,740
see 64 times 1,000.

1099
00:37:07,740 --> 00:37:09,225
Then it becomes something
easily that

1100
00:37:09,225 --> 00:37:10,900
you can then track.

1101
00:37:10,900 --> 00:37:13,340
Once I noticed that this
particular piece of data was

1102
00:37:13,340 --> 00:37:15,250
being allocated like this, I
went through and I realized

1103
00:37:15,250 --> 00:37:17,220
that this actually was--

1104
00:37:17,220 --> 00:37:18,530
it was an accumulation
and not a leak.

1105
00:37:18,530 --> 00:37:20,490
So someone would allocate
something whenever an incoming

1106
00:37:20,490 --> 00:37:21,490
message came in.

1107
00:37:21,490 --> 00:37:23,260
And there was a free somewhere,
but it was never

1108
00:37:23,260 --> 00:37:25,060
being called.

1109
00:37:25,060 --> 00:37:28,190
So the lesson that I learned
from this is that in a lot of

1110
00:37:28,190 --> 00:37:30,810
situations with performance
debugging, it's very helpful

1111
00:37:30,810 --> 00:37:32,100
to try to find a problem--

1112
00:37:32,100 --> 00:37:33,660
in order to try to find a
problem, it's very helpful to

1113
00:37:33,660 --> 00:37:36,390
magnify it and make
it a lot worse.

1114
00:37:36,390 --> 00:37:38,440
If you think something is
problematic in the network,

1115
00:37:38,440 --> 00:37:40,160
slow down your network
if that's an option.

1116
00:37:40,160 --> 00:37:42,860
If you think something has a
problem with memory usage,

1117
00:37:42,860 --> 00:37:44,855
either do something like what
I've described here.

1118
00:37:44,855 --> 00:37:46,220
Or maybe reduce the
amount of memory.

1119
00:37:46,220 --> 00:37:49,360
When I was in graduate school,
we were working on a cache

1120
00:37:49,360 --> 00:37:50,660
coherence protocol.

1121
00:37:50,660 --> 00:37:51,780
And I'm going to spare you
a lot of the details.

1122
00:37:51,780 --> 00:37:52,670
But here's the thing.

1123
00:37:52,670 --> 00:37:55,360
We basically stored a linked
list for every data in memory.

1124
00:37:55,360 --> 00:37:57,290
And we had to make sure that
that linked list was rock

1125
00:37:57,290 --> 00:38:00,290
solid no matter what the size
of the inventory was.

1126
00:38:00,290 --> 00:38:01,820
And if you exceeded the
linked list, you'd

1127
00:38:01,820 --> 00:38:02,780
store stuff in disk.

1128
00:38:02,780 --> 00:38:05,590
So what we did to magnify any
possible bugs was we created

1129
00:38:05,590 --> 00:38:07,190
it so that you could
make a linked list

1130
00:38:07,190 --> 00:38:08,980
of at most one item.

1131
00:38:08,980 --> 00:38:10,780
That way no matter what you
did it would force you to

1132
00:38:10,780 --> 00:38:11,770
overflow and go into disk.

1133
00:38:11,770 --> 00:38:14,000
And that was a great way to
test a lot of overflow

1134
00:38:14,000 --> 00:38:16,140
conditions that in practice
would never have occurred.

1135
00:38:16,140 --> 00:38:17,980
Because we do the engineering
thing, figure out the

1136
00:38:17,980 --> 00:38:19,990
capacity, double it, and hope
that nothing happens.

1137
00:38:19,990 --> 00:38:23,900
So it's good to exaggerate
edge conditions.

1138
00:38:23,900 --> 00:38:25,980
So that's one memory
analysis problem.

1139
00:38:25,980 --> 00:38:27,250
Here's another one.

1140
00:38:27,250 --> 00:38:27,790
Here's the deal.

1141
00:38:27,790 --> 00:38:29,820
You've got a user and they're
using this middleware server.

1142
00:38:29,820 --> 00:38:32,590
And they complain that with
time it gets slower, and

1143
00:38:32,590 --> 00:38:34,260
slower, and slower.

1144
00:38:34,260 --> 00:38:35,290
And if you look at
all of the--

1145
00:38:35,290 --> 00:38:37,760
if you look at CPU network and
disk, you see that nothing is

1146
00:38:37,760 --> 00:38:38,960
being saturated there.

1147
00:38:38,960 --> 00:38:41,320
But if you look at memory,
it's increasing pretty

1148
00:38:41,320 --> 00:38:42,300
dramatically.

1149
00:38:42,300 --> 00:38:44,810
And at some point, it increases
so dramatically that

1150
00:38:44,810 --> 00:38:47,300
the server application
crashes.

1151
00:38:47,300 --> 00:38:50,820
So first thing, let's actually
look at the memory utilization

1152
00:38:50,820 --> 00:38:52,150
and see the timeline.

1153
00:38:52,150 --> 00:38:54,880
What I show here is I show
versus time, I show a chart of

1154
00:38:54,880 --> 00:38:57,190
something in Windows that's
called private bytes.

1155
00:38:57,190 --> 00:39:00,060
Essentially, think of
that as memory.

1156
00:39:00,060 --> 00:39:02,550
What you see is that memory is
growing slowly, whatever.

1157
00:39:02,550 --> 00:39:05,000
But at some point, near the very
end, near when the server

1158
00:39:05,000 --> 00:39:10,010
crashes, it starts growing at
a really, really fast clip.

1159
00:39:10,010 --> 00:39:11,440
Now remember, I call
this private bytes.

1160
00:39:11,440 --> 00:39:14,480
What that essentially is, in
this particular case, it's

1161
00:39:14,480 --> 00:39:18,120
Windows giving a process
some amount of memory.

1162
00:39:18,120 --> 00:39:20,410
It's not necessarily what
the process or what the

1163
00:39:20,410 --> 00:39:21,900
application is requesting.

1164
00:39:21,900 --> 00:39:23,260
The application will request
some amount.

1165
00:39:23,260 --> 00:39:24,730
And Windows will give
it this much.

1166
00:39:24,730 --> 00:39:27,570
Well, Windows keeps giving it
more and more and more.

1167
00:39:27,570 --> 00:39:31,840
And it eventually just runs
out of memory to give.

1168
00:39:31,840 --> 00:39:34,390
So the reason I made the
distinction between something

1169
00:39:34,390 --> 00:39:36,670
that an application and
something Windows is giving it

1170
00:39:36,670 --> 00:39:39,260
is that I, as an application,
might be asking for

1171
00:39:39,260 --> 00:39:39,950
four bytes at a time.

1172
00:39:39,950 --> 00:39:40,980
Four bytes, four bytes,
whatever.

1173
00:39:40,980 --> 00:39:43,620
But what if Windows has this
problem where, whenever you

1174
00:39:43,620 --> 00:39:45,740
allocate-- it doesn't, but
suppose it does-- suppose that

1175
00:39:45,740 --> 00:39:47,480
every time you ask for four
bytes, it says, well, you know

1176
00:39:47,480 --> 00:39:49,170
what, I'm just going to give you
16 bytes because that's my

1177
00:39:49,170 --> 00:39:50,530
smallest increment of memory.

1178
00:39:50,530 --> 00:39:52,340
You can imagine if I keep asking
for 4 and it's dumb

1179
00:39:52,340 --> 00:39:54,240
enough to give me 16 every time,
we're going to see a

1180
00:39:54,240 --> 00:39:56,490
huge expansion of memory, even
though I, as an application,

1181
00:39:56,490 --> 00:39:57,730
didn't ask for that much.

1182
00:39:57,730 --> 00:40:00,810
So it's important to keep these
distinctions in mind.

1183
00:40:00,810 --> 00:40:02,860
So the point is what was
happening in this case, was my

1184
00:40:02,860 --> 00:40:05,133
server's going fine, fine, and
fine, until near the very end

1185
00:40:05,133 --> 00:40:07,520
where it starts increasing in
memory use at such a dramatic

1186
00:40:07,520 --> 00:40:10,090
rate, that it eventually
crashes.

1187
00:40:10,090 --> 00:40:11,390
So the first thing I did--

1188
00:40:11,390 --> 00:40:12,290
We just talked a
lot about these

1189
00:40:12,290 --> 00:40:13,870
reference counted objects.

1190
00:40:13,870 --> 00:40:15,670
The fact that I keep track
of whenever something is

1191
00:40:15,670 --> 00:40:18,020
allocated and whenever something
is not being looked

1192
00:40:18,020 --> 00:40:18,590
at anymore.

1193
00:40:18,590 --> 00:40:19,780
When it's not being looked
at anymore, I go

1194
00:40:19,780 --> 00:40:20,660
ahead and remove it.

1195
00:40:20,660 --> 00:40:22,470
If you look at the reference
counted objects, they were

1196
00:40:22,470 --> 00:40:25,650
pretty flat until you get to
the very end of the run,

1197
00:40:25,650 --> 00:40:28,150
exactly the same as when the
memory was increasing.

1198
00:40:28,150 --> 00:40:30,450
And it turns out, I've
highlighted a few of these,

1199
00:40:30,450 --> 00:40:31,710
and they ended up being--

1200
00:40:31,710 --> 00:40:33,240
it doesn't matter what they
specifically were-- but in

1201
00:40:33,240 --> 00:40:35,180
this case, they ended up being
related to threads, and

1202
00:40:35,180 --> 00:40:38,340
mutexes, and stuff associated
with keeping track of what was

1203
00:40:38,340 --> 00:40:40,040
going on in the system.

1204
00:40:40,040 --> 00:40:41,670
So some thread-related
objects were

1205
00:40:41,670 --> 00:40:42,940
increasing for some reason.

1206
00:40:42,940 --> 00:40:45,690
Now turns out thread state in
some sense is very cheap.

1207
00:40:45,690 --> 00:40:48,130
So just because these guys were
increasing, it doesn't

1208
00:40:48,130 --> 00:40:49,450
really explain why we're
dramatically

1209
00:40:49,450 --> 00:40:50,400
increasing in memory.

1210
00:40:50,400 --> 00:40:52,800
But I'll tell you what
does explain it.

1211
00:40:52,800 --> 00:40:54,420
If you take a look at a
different attribute, which is

1212
00:40:54,420 --> 00:40:56,460
these are the total number of
threads in our system, we have

1213
00:40:56,460 --> 00:40:57,800
a static pool of threads.

1214
00:40:57,800 --> 00:41:00,170
And whenever a work item comes
in, we allocate a thread.

1215
00:41:00,170 --> 00:41:01,470
Whenever that work
item is done, we

1216
00:41:01,470 --> 00:41:02,340
deallocate the thread.

1217
00:41:02,340 --> 00:41:04,210
So we keep it to a
minimal number.

1218
00:41:04,210 --> 00:41:05,200
And it turns out,
we had a cap.

1219
00:41:05,200 --> 00:41:07,730
We said, OK, I'm going to allow
at most 20 operations to

1220
00:41:07,730 --> 00:41:08,600
occur in flight.

1221
00:41:08,600 --> 00:41:11,690
Anything beyond 20, I'm going
to queue it up in memory.

1222
00:41:11,690 --> 00:41:13,080
And we'll deal with it later.

1223
00:41:13,080 --> 00:41:13,710
Here's the deal.

1224
00:41:13,710 --> 00:41:15,380
What you see when you look at
the number of threads being

1225
00:41:15,380 --> 00:41:16,570
used is that--

1226
00:41:16,570 --> 00:41:17,950
it's the bottom chart,
unfortunately, it's a little

1227
00:41:17,950 --> 00:41:18,280
hard to see.

1228
00:41:18,280 --> 00:41:20,230
But the point is it goes up and
down, up and down, up and

1229
00:41:20,230 --> 00:41:21,230
down with time.

1230
00:41:21,230 --> 00:41:22,850
So stuff gets allocated.

1231
00:41:22,850 --> 00:41:23,530
Stuff gets used.

1232
00:41:23,530 --> 00:41:24,420
Stuff gets removed.

1233
00:41:24,420 --> 00:41:26,330
But then when you get to the
very end, you see this

1234
00:41:26,330 --> 00:41:28,790
monotonic increase in the number
of threads being used.

1235
00:41:28,790 --> 00:41:29,750
It never goes down.

1236
00:41:29,750 --> 00:41:32,360
And at some point, you hit the
point where all of the threads

1237
00:41:32,360 --> 00:41:35,460
that you've preallocated
are completely taken.

1238
00:41:35,460 --> 00:41:37,130
You obviously can't allocate
any more threads.

1239
00:41:37,130 --> 00:41:40,940
So every incoming request,
you go ahead and queue.

1240
00:41:40,940 --> 00:41:42,260
This was the root
of the problem.

1241
00:41:42,260 --> 00:41:44,660
The root of the problem is
that in this particular

1242
00:41:44,660 --> 00:41:48,050
situation, each of the times
that it was increasing was

1243
00:41:48,050 --> 00:41:51,070
because there was again
an uncaught exception.

1244
00:41:51,070 --> 00:41:54,830
That uncaught exception resulted
in a thread not being

1245
00:41:54,830 --> 00:41:55,510
deallocated.

1246
00:41:55,510 --> 00:41:56,950
At some point, we ran
out of threads.

1247
00:41:56,950 --> 00:42:00,220
And every single request that
came in was being allocated in

1248
00:42:00,220 --> 00:42:02,800
memory and waiting around for
a thread to become available

1249
00:42:02,800 --> 00:42:04,730
before it could be finished.

1250
00:42:04,730 --> 00:42:06,570
Well, if no threads are
available, these things can

1251
00:42:06,570 --> 00:42:07,840
never get pulled
off the queue.

1252
00:42:07,840 --> 00:42:09,540
You see a monotonic increase
in the amount of memory.

1253
00:42:09,540 --> 00:42:11,810
And you eventually die.

1254
00:42:11,810 --> 00:42:15,190
And so, it's actually
a correctness issue.

1255
00:42:15,190 --> 00:42:17,180
Why are you not catching
exceptions properly?

1256
00:42:17,180 --> 00:42:18,920
And why are you not recovering
properly from them.

1257
00:42:18,920 --> 00:42:20,840
But interestingly, it manifested
itself in a

1258
00:42:20,840 --> 00:42:22,980
performance issue.

1259
00:42:22,980 --> 00:42:25,300
So in each of these cases,
what we've done, is we've

1260
00:42:25,300 --> 00:42:27,860
looked at customized profiler
instead of a tool that we

1261
00:42:27,860 --> 00:42:30,860
might have gotten, Purifier,
Valgrind, or whatever--

1262
00:42:30,860 --> 00:42:32,185
it's nice that it's tailored
to our application.

1263
00:42:32,185 --> 00:42:34,160
And we can make it as
arbitrarily fast, because it

1264
00:42:34,160 --> 00:42:36,760
understands what our application
is doing.

1265
00:42:36,760 --> 00:42:38,660
And it also-- the other nice
thing about this is this can

1266
00:42:38,660 --> 00:42:39,610
be run in production
environments.

1267
00:42:39,610 --> 00:42:41,480
If you write your own tool,
and a customer calls and

1268
00:42:41,480 --> 00:42:42,916
complains, you can just say, oh,
just twiddle this knob and

1269
00:42:42,916 --> 00:42:44,460
you're done.

1270
00:42:44,460 --> 00:42:45,370
That's kind of convenient.

1271
00:42:45,370 --> 00:42:46,860
But of course, the disadvantage
is that now you

1272
00:42:46,860 --> 00:42:50,030
have to-- whenever you rewrite
your code or do a new rev, or

1273
00:42:50,030 --> 00:42:50,900
whatever, you have
to make sure that

1274
00:42:50,900 --> 00:42:51,865
these tools work right.

1275
00:42:51,865 --> 00:42:53,820
And you have to recompile
the code.

1276
00:42:53,820 --> 00:42:55,600
If you have something like
Purifier, Quanitify, in a lot

1277
00:42:55,600 --> 00:42:57,490
of situations, you can run with
some sort of preexisting

1278
00:42:57,490 --> 00:42:59,010
binary-- sometimes,
not always.

1279
00:42:59,010 --> 00:43:00,870
You can run it without
having to recompile.

1280
00:43:00,870 --> 00:43:03,320
And that can be very
convenient.

1281
00:43:03,320 --> 00:43:05,200
So I don't know.

1282
00:43:05,200 --> 00:43:07,600
I was kind of stuck for what a
lesson this was besides common

1283
00:43:07,600 --> 00:43:08,230
sense is useful.

1284
00:43:08,230 --> 00:43:09,980
So I just said memory profiling
is pretty critical.

1285
00:43:09,980 --> 00:43:12,090
Don't ignore that when you're
writing your applications.

1286
00:43:12,090 --> 00:43:14,760
And I think I also want
to point out that--

1287
00:43:14,760 --> 00:43:15,560
I didn't put this here.

1288
00:43:15,560 --> 00:43:17,990
But you might think that when
you're using a garbage

1289
00:43:17,990 --> 00:43:19,650
collected language, like
something like Java, that

1290
00:43:19,650 --> 00:43:20,980
these problems go away.

1291
00:43:20,980 --> 00:43:22,120
But they actually don't.

1292
00:43:22,120 --> 00:43:24,110
The problem is actually now
memory allocations and

1293
00:43:24,110 --> 00:43:25,530
deallocations are more
hidden from you.

1294
00:43:25,530 --> 00:43:27,750
So you have to be a little bit
more careful about how you

1295
00:43:27,750 --> 00:43:29,810
analyze them.

1296
00:43:29,810 --> 00:43:31,080
Case study number seven.

1297
00:43:31,080 --> 00:43:32,940
So we've looked a lot of things
related to CPU usage.

1298
00:43:32,940 --> 00:43:35,380
We've looked at a lot of things
related to memory.

1299
00:43:35,380 --> 00:43:37,260
Now let's look a little
bit at the network.

1300
00:43:37,260 --> 00:43:38,240
Here was the problem.

1301
00:43:38,240 --> 00:43:40,050
I have a user, our
Canonical admin.

1302
00:43:40,050 --> 00:43:42,780
The user wants to perform an
operation on some virtual

1303
00:43:42,780 --> 00:43:43,810
machine somewhere.

1304
00:43:43,810 --> 00:43:44,920
Remember the flow.

1305
00:43:44,920 --> 00:43:47,430
Issue request, it goes
to some server.

1306
00:43:47,430 --> 00:43:49,910
The server goes to the host
where that VM is.

1307
00:43:49,910 --> 00:43:51,570
Does some operation,
comes back.

1308
00:43:51,570 --> 00:43:52,960
So this user wants to
perform an operation

1309
00:43:52,960 --> 00:43:54,000
on a virtual machine.

1310
00:43:54,000 --> 00:43:59,260
I had setup A, In setup A, this
operation end to end took

1311
00:43:59,260 --> 00:44:00,710
about eight seconds.

1312
00:44:00,710 --> 00:44:01,280
Fine.

1313
00:44:01,280 --> 00:44:03,890
That doesn't mean much until you
compare to setup B, where

1314
00:44:03,890 --> 00:44:05,340
it took twice the
amount of time.

1315
00:44:05,340 --> 00:44:07,300
It took 16 seconds,
not 8 seconds.

1316
00:44:07,300 --> 00:44:09,560
So now the question you're
asking yourself is what's the

1317
00:44:09,560 --> 00:44:11,620
difference between
setups A and B.

1318
00:44:11,620 --> 00:44:14,600
In setups A and B, the only
difference is that this middle

1319
00:44:14,600 --> 00:44:17,560
server is a different
version number.

1320
00:44:17,560 --> 00:44:19,520
But everything else is
exactly the same.

1321
00:44:19,520 --> 00:44:22,120
So with this different version
you somehow manage to make

1322
00:44:22,120 --> 00:44:25,080
your performance twice as bad,
which is a really bad thing.

1323
00:44:25,080 --> 00:44:26,590
We got to get to the
bottom of that.

1324
00:44:26,590 --> 00:44:28,610
And in addition, it's not like
it was some random thing that

1325
00:44:28,610 --> 00:44:29,120
happened once.

1326
00:44:29,120 --> 00:44:30,480
It happened every single time.

1327
00:44:30,480 --> 00:44:33,540
So now, let's take a look
at what was going on.

1328
00:44:33,540 --> 00:44:36,430
So again, a lot of times, I
tend to do things in a top

1329
00:44:36,430 --> 00:44:38,440
down manner, where I say what's
the easiest information

1330
00:44:38,440 --> 00:44:39,280
to collect.

1331
00:44:39,280 --> 00:44:40,180
And then, OK.

1332
00:44:40,180 --> 00:44:41,380
What's more difficult, what's
more difficult?

1333
00:44:41,380 --> 00:44:43,850
And finally, if you have to use
really industrial strength

1334
00:44:43,850 --> 00:44:44,670
tools, you use them.

1335
00:44:44,670 --> 00:44:45,700
But it might not be necessary.

1336
00:44:45,700 --> 00:44:47,980
In this case, my first thing
was just to use logging and

1337
00:44:47,980 --> 00:44:50,060
try to figure out where
time was being spent.

1338
00:44:50,060 --> 00:44:51,790
Remember, we've got a client
talking to a server, talking

1339
00:44:51,790 --> 00:44:52,630
to an end host.

1340
00:44:52,630 --> 00:44:55,570
In this case, the latency
imposed by that client was

1341
00:44:55,570 --> 00:44:58,010
exactly the same in
these two setups.

1342
00:44:58,010 --> 00:45:00,940
So that's not where this eight
seconds is being lost.

1343
00:45:00,940 --> 00:45:03,920
The amount of time being claimed
by the server was

1344
00:45:03,920 --> 00:45:05,340
exactly the same.

1345
00:45:05,340 --> 00:45:06,010
I'm a client.

1346
00:45:06,010 --> 00:45:06,630
I go to a server.

1347
00:45:06,630 --> 00:45:09,180
It then goes to a host.

1348
00:45:09,180 --> 00:45:12,710
However, the time spent on
the host was different.

1349
00:45:12,710 --> 00:45:14,710
That's where all of the
time was being spent.

1350
00:45:14,710 --> 00:45:17,230
Now, you're probably looking
at-- well, you may not be.

1351
00:45:17,230 --> 00:45:18,220
But I'll tell you to look now.

1352
00:45:18,220 --> 00:45:21,340
We had setup A. Setup A was a
client and the server were

1353
00:45:21,340 --> 00:45:22,470
each at some version number.

1354
00:45:22,470 --> 00:45:24,710
And the host was at the
same version number.

1355
00:45:24,710 --> 00:45:27,110
In setup B, the client and the
server were at a different

1356
00:45:27,110 --> 00:45:29,160
version number with respect
to the host.

1357
00:45:29,160 --> 00:45:30,890
But we just said all of
the time is being

1358
00:45:30,890 --> 00:45:31,860
spent in the host.

1359
00:45:31,860 --> 00:45:33,520
So why is that going on?

1360
00:45:33,520 --> 00:45:35,340
And this is where understanding
the architecture

1361
00:45:35,340 --> 00:45:35,830
is important.

1362
00:45:35,830 --> 00:45:37,940
It turns out in this case, the
first thing that happens when

1363
00:45:37,940 --> 00:45:39,830
you have a difference in version
number is that the

1364
00:45:39,830 --> 00:45:42,870
server helpfully talks to the
end host and says you know

1365
00:45:42,870 --> 00:45:44,010
what, you're in a different
version number.

1366
00:45:44,010 --> 00:45:47,090
So let me give you some code
to run, because you're in a

1367
00:45:47,090 --> 00:45:47,810
different version number.

1368
00:45:47,810 --> 00:45:50,030
So in fact, even though the host
is at the same version

1369
00:45:50,030 --> 00:45:52,440
number, there's a small shim
running on that host that

1370
00:45:52,440 --> 00:45:53,990
makes up for the fact
that it's not at the

1371
00:45:53,990 --> 00:45:56,110
same version number.

1372
00:45:56,110 --> 00:45:58,420
So now let's analyze this host
and figure out where is this

1373
00:45:58,420 --> 00:46:00,350
time being spent.

1374
00:46:00,350 --> 00:46:02,020
I did a little bit more
logging on the host.

1375
00:46:02,020 --> 00:46:03,880
The reason I did this, because
a lot of the standard tools

1376
00:46:03,880 --> 00:46:06,360
that you might use, like gprof
or whatever, these things

1377
00:46:06,360 --> 00:46:06,920
didn't exist.

1378
00:46:06,920 --> 00:46:08,050
We have our own kernel.

1379
00:46:08,050 --> 00:46:10,590
Which therefore, we have our
own tools and our own

1380
00:46:10,590 --> 00:46:12,200
nightmares, and headaches
associated with it.

1381
00:46:12,200 --> 00:46:14,440
So we had to use these tools
instead of commercially

1382
00:46:14,440 --> 00:46:15,680
available tools.

1383
00:46:15,680 --> 00:46:16,980
When you use these tools,
we ended up

1384
00:46:16,980 --> 00:46:17,880
narrowing down the time.

1385
00:46:17,880 --> 00:46:18,850
Here's what goes on.

1386
00:46:18,850 --> 00:46:19,360
I'm a client.

1387
00:46:19,360 --> 00:46:20,480
And I talk to the server.

1388
00:46:20,480 --> 00:46:22,430
The server talks to an
agent on the host.

1389
00:46:22,430 --> 00:46:24,680
It says, agent, this is
what I want to do.

1390
00:46:24,680 --> 00:46:26,980
The agent then talks to the
hardware abstraction layer and

1391
00:46:26,980 --> 00:46:28,780
says, dude, this is
what I want to do.

1392
00:46:28,780 --> 00:46:30,600
The hardware abstraction layer
comes back to the agent and

1393
00:46:30,600 --> 00:46:31,390
says, done.

1394
00:46:31,390 --> 00:46:32,670
Agent goes back to the server.

1395
00:46:32,670 --> 00:46:34,790
Sever goes back to the client,
blah, blah, blah.

1396
00:46:34,790 --> 00:46:37,340
It turns out that this
interaction on the host

1397
00:46:37,340 --> 00:46:39,630
between the agent that accepted
the call from the

1398
00:46:39,630 --> 00:46:42,170
server and the hardware
abstraction layer, that was

1399
00:46:42,170 --> 00:46:43,820
where all of the time
was being spent.

1400
00:46:43,820 --> 00:46:45,640
It was 10 milliseconds
in one case.

1401
00:46:45,640 --> 00:46:48,610
And 20 times that length
in the other case.

1402
00:46:48,610 --> 00:46:51,280
And so this was a pretty big
difference not to be ignored.

1403
00:46:51,280 --> 00:46:53,950
So we wanted to look at
why that was going on.

1404
00:46:53,950 --> 00:46:55,660
So here, it's kind
of a dumb thing.

1405
00:46:55,660 --> 00:46:56,980
But the next the thing I did
was, well, let's take a look

1406
00:46:56,980 --> 00:46:57,620
at the configuration.

1407
00:46:57,620 --> 00:46:58,910
Maybe I did something stupid
that would have

1408
00:46:58,910 --> 00:47:00,680
caused this to happen.

1409
00:47:00,680 --> 00:47:01,370
Here's the deal.

1410
00:47:01,370 --> 00:47:03,220
In setup A, remember everything
is at the same

1411
00:47:03,220 --> 00:47:03,990
version number.

1412
00:47:03,990 --> 00:47:06,490
And it turns out this agent HAL
communication occurs over

1413
00:47:06,490 --> 00:47:07,780
a named pipe.

1414
00:47:07,780 --> 00:47:09,220
OK, whatever.

1415
00:47:09,220 --> 00:47:10,310
It turns out in setup B--

1416
00:47:10,310 --> 00:47:12,280
remember the client and the
server were at a different

1417
00:47:12,280 --> 00:47:14,570
version from the host, so the
server has to download some

1418
00:47:14,570 --> 00:47:16,400
shim code onto that
host so that

1419
00:47:16,400 --> 00:47:18,270
everybody can talk correctly.

1420
00:47:18,270 --> 00:47:21,520
It turns out that shim code
was communicating with the

1421
00:47:21,520 --> 00:47:23,810
hardware abstraction layer--
instead of over a named pipe--

1422
00:47:23,810 --> 00:47:27,030
it was communicating
over TCP/IP.

1423
00:47:27,030 --> 00:47:30,590
Now I have to give credit
to my colleague on this.

1424
00:47:30,590 --> 00:47:33,680
As soon as I showed him this
information, he hit a flash of

1425
00:47:33,680 --> 00:47:35,170
inspiration.

1426
00:47:35,170 --> 00:47:37,960
It turns out that when you're
using the named pipe and it

1427
00:47:37,960 --> 00:47:40,810
takes 10 milliseconds, the
reason that's different from

1428
00:47:40,810 --> 00:47:43,630
TCP/IP communication is because
of something called

1429
00:47:43,630 --> 00:47:44,800
Nagle's algorithm.

1430
00:47:44,800 --> 00:47:47,560
Nagle's algorithm basically
says, you want to accumulate

1431
00:47:47,560 --> 00:47:50,450
data before you send it so you
reduce the number of round

1432
00:47:50,450 --> 00:47:51,840
trips and the overhead
of creating a

1433
00:47:51,840 --> 00:47:52,570
connection, and whatever.

1434
00:47:52,570 --> 00:47:54,910
And the problem was this
was biting us.

1435
00:47:54,910 --> 00:47:57,070
Essentially, we were waiting
within that shim to collect

1436
00:47:57,070 --> 00:47:59,180
enough data to send it
back to that HAL.

1437
00:47:59,180 --> 00:48:02,010
And we were waiting fairly
needlessly, because what

1438
00:48:02,010 --> 00:48:04,330
happens is that this algorithm
waits for a

1439
00:48:04,330 --> 00:48:05,010
certain amount of time.

1440
00:48:05,010 --> 00:48:06,490
And then if it hasn't gotten
more data, it goes

1441
00:48:06,490 --> 00:48:07,070
ahead and sends it.

1442
00:48:07,070 --> 00:48:09,820
So we are basically stuck,
waiting for this time out to

1443
00:48:09,820 --> 00:48:11,960
occur before the data actually
got transferred.

1444
00:48:11,960 --> 00:48:13,640
And there's a very simple
way to get rid of that.

1445
00:48:13,640 --> 00:48:14,990
You just tell it, hey, don't
use Nagle's algorithm.

1446
00:48:14,990 --> 00:48:15,760
We don't care.

1447
00:48:15,760 --> 00:48:18,300
And that solves the
entire problem.

1448
00:48:18,300 --> 00:48:21,110
So the point is once you get rid
of that, the performance

1449
00:48:21,110 --> 00:48:21,860
is exactly the same.

1450
00:48:21,860 --> 00:48:23,220
It's a very simple configuration
setting.

1451
00:48:23,220 --> 00:48:25,490
But that's again, a situation
where somebody--

1452
00:48:25,490 --> 00:48:27,720
and in fact, I remember
asking why this

1453
00:48:27,720 --> 00:48:28,850
particular change was made.

1454
00:48:28,850 --> 00:48:30,190
And again, it was, well, we
didn't think it was going to

1455
00:48:30,190 --> 00:48:31,470
be a big deal.

1456
00:48:31,470 --> 00:48:33,470
And that ends up being, again,
a very bad way of doing

1457
00:48:33,470 --> 00:48:34,440
performance engineering.

1458
00:48:34,440 --> 00:48:37,720
So definitely, the lesson that
I learned here is just like

1459
00:48:37,720 --> 00:48:39,300
the lesson with 32-bit,
64-bit.

1460
00:48:39,300 --> 00:48:41,550
You might think that a little
change like this

1461
00:48:41,550 --> 00:48:42,280
doesn't mean anything.

1462
00:48:42,280 --> 00:48:42,920
But it really does.

1463
00:48:42,920 --> 00:48:44,650
And here, it's actually
important to understand the

1464
00:48:44,650 --> 00:48:45,850
entire stack end to end.

1465
00:48:45,850 --> 00:48:46,960
Because otherwise, you're never
going to be able to

1466
00:48:46,960 --> 00:48:48,790
figure out where the problem
actually was occurring.

1467
00:48:51,560 --> 00:48:53,930
I already alluded to a previous
situation where we

1468
00:48:53,930 --> 00:48:55,390
had a correctness problem,
which masked as

1469
00:48:55,390 --> 00:48:56,150
a performance problem.

1470
00:48:56,150 --> 00:48:58,290
In the previous case, it
was because I had a

1471
00:48:58,290 --> 00:48:59,270
limited thread pool.

1472
00:48:59,270 --> 00:49:01,150
And whenever I ran out of
threads, I'd queue everyting.

1473
00:49:01,150 --> 00:49:02,860
And at some point, I had a
correctness problem where I

1474
00:49:02,860 --> 00:49:04,790
kept killing threads and never
resurrecting them.

1475
00:49:04,790 --> 00:49:06,060
And so I ended up queuing
everything.

1476
00:49:06,060 --> 00:49:07,370
And everything went to pot.

1477
00:49:07,370 --> 00:49:09,360
Now let me talk about a
different situation.

1478
00:49:09,360 --> 00:49:11,830
Here I've got a poor little
customer, in fact that

1479
00:49:11,830 --> 00:49:12,950
customer was me.

1480
00:49:12,950 --> 00:49:15,300
I was powering on a
virtual machine.

1481
00:49:15,300 --> 00:49:17,015
Remember it goes to the server,
goes to host, powers

1482
00:49:17,015 --> 00:49:19,380
on the virtual machine,
tells me it's done.

1483
00:49:19,380 --> 00:49:22,110
Every other time I would do
this, it took five seconds,

1484
00:49:22,110 --> 00:49:23,970
which is pretty OK.

1485
00:49:23,970 --> 00:49:26,290
In comparison, every other time
I did it, it would take

1486
00:49:26,290 --> 00:49:27,780
about five minutes.

1487
00:49:27,780 --> 00:49:29,120
That's a pretty huge
expansion.

1488
00:49:29,120 --> 00:49:31,040
And if a customer called me
and said this, I would

1489
00:49:31,040 --> 00:49:33,150
basically be in deep
doggy do-do.

1490
00:49:33,150 --> 00:49:35,080
So it was important to try
to figure what was

1491
00:49:35,080 --> 00:49:36,020
actually going on here.

1492
00:49:36,020 --> 00:49:38,530
So why does powering on a
VM have such a variable

1493
00:49:38,530 --> 00:49:39,950
performance?

1494
00:49:39,950 --> 00:49:42,030
Well, the key here, again, is
to understand what's the end

1495
00:49:42,030 --> 00:49:43,340
to end path for things.

1496
00:49:43,340 --> 00:49:45,190
In this case, what I know is
that when you power on a

1497
00:49:45,190 --> 00:49:47,400
virtual machine, one of the
things you have to do is

1498
00:49:47,400 --> 00:49:49,930
allocate some space on
disk to write swap

1499
00:49:49,930 --> 00:49:51,180
information and whatever.

1500
00:49:51,180 --> 00:49:53,410
So one of the first places I
looked was I said, well, let

1501
00:49:53,410 --> 00:49:55,410
me understand how the disk is
performing in each of these

1502
00:49:55,410 --> 00:49:57,220
cases to try to figure
out what's going on.

1503
00:49:57,220 --> 00:49:59,450
It's a mixture of common sense
and knowing what's going on,

1504
00:49:59,450 --> 00:50:01,550
and then knowing
where to look.

1505
00:50:01,550 --> 00:50:04,190
So let me let you in
on a little secret.

1506
00:50:04,190 --> 00:50:07,020
In the real world, if you're
using some sort of disk, if

1507
00:50:07,020 --> 00:50:09,820
your disk latency is basically
10 milliseconds or less,

1508
00:50:09,820 --> 00:50:11,080
you're more or less golden.

1509
00:50:11,080 --> 00:50:12,410
That's not a problem.

1510
00:50:12,410 --> 00:50:15,300
If your latency is between 10
and 20 milliseconds, maybe

1511
00:50:15,300 --> 00:50:17,970
that's acceptable because you
have a complicated topology.

1512
00:50:17,970 --> 00:50:19,910
You're some multinational
corporation with sites

1513
00:50:19,910 --> 00:50:21,140
everywhere, whatever.

1514
00:50:21,140 --> 00:50:23,610
If your latency is getting to
the point where they're 50

1515
00:50:23,610 --> 00:50:26,350
milliseconds, then you probably

1516
00:50:26,350 --> 00:50:27,180
have to start worrying.

1517
00:50:27,180 --> 00:50:28,940
That's actually quite
a bit high.

1518
00:50:28,940 --> 00:50:31,870
If your latency is greater
than 51 seconds, that is

1519
00:50:31,870 --> 00:50:34,240
staggeringly bad.

1520
00:50:34,240 --> 00:50:38,290
In this chart, I show you
the latency versus time.

1521
00:50:38,290 --> 00:50:40,890
And what this chart shows you
is that every single time

1522
00:50:40,890 --> 00:50:43,900
step, it says in the last five
minutes, what was the highest

1523
00:50:43,900 --> 00:50:46,920
disk latency that I saw.

1524
00:50:46,920 --> 00:50:49,160
And remember our
rules of thumb.

1525
00:50:49,160 --> 00:50:50,440
10 milliseconds, good.

1526
00:50:50,440 --> 00:50:52,150
20 milliseconds, not so good.

1527
00:50:52,150 --> 00:50:54,110
50 milliseconds, not so good.

1528
00:50:54,110 --> 00:50:57,530
This one is 1,100
milliseconds.

1529
00:50:57,530 --> 00:51:00,330
That's like you could walk to
the disk, get the data, and

1530
00:51:00,330 --> 00:51:02,940
bring it back faster than what
it's actually doing.

1531
00:51:02,940 --> 00:51:04,070
That's when you've got
to call a priest.

1532
00:51:04,070 --> 00:51:04,240
OK.

1533
00:51:04,240 --> 00:51:05,010
That's horrible.

1534
00:51:05,010 --> 00:51:06,930
So this was the reason
that five

1535
00:51:06,930 --> 00:51:08,470
seconds versus five minutes.

1536
00:51:08,470 --> 00:51:10,510
So now the question is
why is this going on?

1537
00:51:10,510 --> 00:51:14,300
Why am I seeing such
unbelievably bad disk latency?

1538
00:51:14,300 --> 00:51:16,390
And remember, the title
of this case study was

1539
00:51:16,390 --> 00:51:20,080
"Correctness Impacts
Performance." So

1540
00:51:20,080 --> 00:51:21,170
let's take a look.

1541
00:51:21,170 --> 00:51:23,930
What I do here is I take a look
at all of the events that

1542
00:51:23,930 --> 00:51:26,710
are happening with respect
to that disk.

1543
00:51:26,710 --> 00:51:27,540
You're not going to
understand this.

1544
00:51:27,540 --> 00:51:28,890
It's some junk related
to our product.

1545
00:51:28,890 --> 00:51:30,880
But I'm going to circle
the relevant portion.

1546
00:51:30,880 --> 00:51:32,180
You'll notice a little
line that says

1547
00:51:32,180 --> 00:51:34,360
lost access to volume.

1548
00:51:34,360 --> 00:51:35,170
Here's the deal.

1549
00:51:35,170 --> 00:51:37,480
I send a request to the disk.

1550
00:51:37,480 --> 00:51:40,190
It turns out the disk is
connected over some channel to

1551
00:51:40,190 --> 00:51:42,530
some channel, a disk controller,
which is talking

1552
00:51:42,530 --> 00:51:43,410
to some disk.

1553
00:51:43,410 --> 00:51:47,440
It turns out, I kept losing
access to that disk.

1554
00:51:47,440 --> 00:51:49,600
Whenever I would lose access,
it would keep retrying.

1555
00:51:49,600 --> 00:51:51,610
And that's why ultimately,
yes, it would complete.

1556
00:51:51,610 --> 00:51:54,030
But it would take over a second
per disk request.

1557
00:51:57,230 --> 00:51:59,560
In the situations where I was
not seeing that loss of

1558
00:51:59,560 --> 00:52:01,720
connectivity, things were
occurring at the order of

1559
00:52:01,720 --> 00:52:04,820
milliseconds and my power-on
would take five seconds.

1560
00:52:04,820 --> 00:52:07,780
So the point here is that in
this case, it was actually a

1561
00:52:07,780 --> 00:52:10,020
correctness problem with
my controller.

1562
00:52:10,020 --> 00:52:11,970
And the controller would
basically flake out and not

1563
00:52:11,970 --> 00:52:13,280
talk to the disk properly.

1564
00:52:13,280 --> 00:52:15,380
And this manifested itself
in a performance problem.

1565
00:52:15,380 --> 00:52:18,390
Once I changed the disk
controller, performance

1566
00:52:18,390 --> 00:52:19,490
problems went away.

1567
00:52:19,490 --> 00:52:22,250
So if you're doing this kind
of debugging, sometimes you

1568
00:52:22,250 --> 00:52:23,760
have to eliminate--

1569
00:52:23,760 --> 00:52:25,450
I guess it's a Sherlock
Holmesism-- that you eliminate

1570
00:52:25,450 --> 00:52:28,060
the obvious and whatever's
left must be it.

1571
00:52:28,060 --> 00:52:31,200
A good example of this is my
ex-officemate, he used to ask

1572
00:52:31,200 --> 00:52:33,380
this question whenever he'd
interview somebody for VMware.

1573
00:52:33,380 --> 00:52:36,680
He'd say, OK, you have
a networking problem.

1574
00:52:36,680 --> 00:52:38,090
Where is the first
place you look?

1575
00:52:38,090 --> 00:52:40,400
And everybody that's coming
in with a grad degree or

1576
00:52:40,400 --> 00:52:41,810
whatever, they're like, I
would look at the stack.

1577
00:52:41,810 --> 00:52:43,640
I'd look at the CPU.

1578
00:52:43,640 --> 00:52:45,190
I'd look at a cache counters,
or whatever.

1579
00:52:45,190 --> 00:52:46,720
And my officemate would just
look at them and say, why

1580
00:52:46,720 --> 00:52:49,450
don't you just check the cable
and see if it's plugged in.

1581
00:52:49,450 --> 00:52:51,360
And surprisingly,
it's so dumb.

1582
00:52:51,360 --> 00:52:53,970
But when you're talking to
customers, they're like I

1583
00:52:53,970 --> 00:52:54,310
swear I did that.

1584
00:52:54,310 --> 00:52:55,530
I swear I did that.

1585
00:52:55,530 --> 00:52:57,890
Their swearing doesn't
necessarily mean anything,

1586
00:52:57,890 --> 00:52:59,770
unless it's directed at you.

1587
00:52:59,770 --> 00:53:01,220
So that's case study nine--

1588
00:53:01,220 --> 00:53:03,750
eight.

1589
00:53:03,750 --> 00:53:05,620
The last two case studies are
going to be specific to

1590
00:53:05,620 --> 00:53:06,790
virtualization.

1591
00:53:06,790 --> 00:53:08,930
And so what I'm going to do is
talk a little bit about how

1592
00:53:08,930 --> 00:53:12,460
the CPU scheduler works in the
VMware server class product.

1593
00:53:12,460 --> 00:53:14,830
And then I'll talk about the
last two case studies.

1594
00:53:14,830 --> 00:53:17,890
So actually how many of you
have used VMware before?

1595
00:53:17,890 --> 00:53:20,310
Just as a hand.

1596
00:53:20,310 --> 00:53:21,230
Thankfully, Saman has used it.

1597
00:53:21,230 --> 00:53:21,810
That's good.

1598
00:53:21,810 --> 00:53:26,960
So basically, the idea is that
you're multiplexing time on

1599
00:53:26,960 --> 00:53:29,250
the CPU between different
virtual machines that are

1600
00:53:29,250 --> 00:53:31,310
running on the CPU.

1601
00:53:31,310 --> 00:53:33,280
So what I show here is I
show ESX, which is our

1602
00:53:33,280 --> 00:53:34,700
server class product.

1603
00:53:34,700 --> 00:53:36,030
Basically, think of that
as the hardware.

1604
00:53:36,030 --> 00:53:38,160
So you're running virtual
machines on hardware.

1605
00:53:38,160 --> 00:53:40,860
That hardware has, in this
particular case,

1606
00:53:40,860 --> 00:53:43,410
four physical CPUs.

1607
00:53:43,410 --> 00:53:44,070
No problem.

1608
00:53:44,070 --> 00:53:45,080
You've got four physical CPUs.

1609
00:53:45,080 --> 00:53:48,110
And you want to run virtual
machines on this hardware.

1610
00:53:48,110 --> 00:53:50,430
So you've got one
virtual machine.

1611
00:53:50,430 --> 00:53:51,560
Well, that's fine.

1612
00:53:51,560 --> 00:53:52,530
There's four CPUs.

1613
00:53:52,530 --> 00:53:53,650
One virtual machine.

1614
00:53:53,650 --> 00:53:54,810
No problem.

1615
00:53:54,810 --> 00:53:56,170
Plenty of CPU to go around.

1616
00:53:56,170 --> 00:53:57,170
Now you got another one.

1617
00:53:57,170 --> 00:53:58,100
Still no problem.

1618
00:53:58,100 --> 00:54:00,250
They're not chained
to any given CPU.

1619
00:54:00,250 --> 00:54:01,790
And there's plenty of
CPU to go around.

1620
00:54:01,790 --> 00:54:02,770
No problem.

1621
00:54:02,770 --> 00:54:04,030
Same with three and four.

1622
00:54:04,030 --> 00:54:06,080
You got all these four
virtual machines.

1623
00:54:06,080 --> 00:54:07,080
It's not a one to one

1624
00:54:07,080 --> 00:54:08,500
correspondence in terms of CPU.

1625
00:54:08,500 --> 00:54:09,530
We don't work that way.

1626
00:54:09,530 --> 00:54:11,030
But there's plenty-- the point
is, there's plenty

1627
00:54:11,030 --> 00:54:13,170
of CPU to go around.

1628
00:54:13,170 --> 00:54:18,060
Now what happens, and when VMs
are basically available and

1629
00:54:18,060 --> 00:54:21,210
when there's hardware on which
they can run and they're using

1630
00:54:21,210 --> 00:54:23,240
the hardware, that's called
the run state.

1631
00:54:23,240 --> 00:54:24,750
So running means I'm golden.

1632
00:54:24,750 --> 00:54:26,930
I'm using the hardware.

1633
00:54:26,930 --> 00:54:30,520
Now, let's add another VM.

1634
00:54:30,520 --> 00:54:33,800
I put it in red, because this
poor VM, he wants to run.

1635
00:54:33,800 --> 00:54:36,860
But you'll see that every single
CPU is actually taken

1636
00:54:36,860 --> 00:54:39,650
up by another virtual machine.

1637
00:54:39,650 --> 00:54:40,890
So even if this virtual machine

1638
00:54:40,890 --> 00:54:42,700
wants to run, it can't.

1639
00:54:42,700 --> 00:54:44,230
It has to wait.

1640
00:54:44,230 --> 00:54:48,810
The time that it spends ready to
use the CPU, but unable to

1641
00:54:48,810 --> 00:54:52,220
use the CPU, because there's
no CPU available, that's

1642
00:54:52,220 --> 00:54:53,870
called ready time.

1643
00:54:53,870 --> 00:54:55,950
In some sense, it's kind
of the demand.

1644
00:54:55,950 --> 00:54:57,810
Ready plus signals down.

1645
00:54:57,810 --> 00:54:59,160
But anyway, the point is
it's sitting around

1646
00:54:59,160 --> 00:55:00,550
waiting to use the CPU.

1647
00:55:00,550 --> 00:55:02,290
But it can't, because other
people are using it.

1648
00:55:02,290 --> 00:55:03,360
So it has to wait its turn.

1649
00:55:03,360 --> 00:55:04,910
It waits its turn.

1650
00:55:04,910 --> 00:55:07,950
And eventually a VM, our ESX
scheduler is smart enough to

1651
00:55:07,950 --> 00:55:10,050
realize that it has to
fix the situation.

1652
00:55:10,050 --> 00:55:12,930
So it deschedules a VM-- in
this case, VM 1-- and goes

1653
00:55:12,930 --> 00:55:15,120
ahead and schedules that other
virtual machine that was

1654
00:55:15,120 --> 00:55:16,440
previously waiting.

1655
00:55:16,440 --> 00:55:20,400
Now predictably, the VM that's
now ready to run but can't is

1656
00:55:20,400 --> 00:55:21,880
in the ready state.

1657
00:55:21,880 --> 00:55:24,530
And the VM that formerly was
in the ready state is now

1658
00:55:24,530 --> 00:55:25,790
using the CPU.

1659
00:55:25,790 --> 00:55:27,050
So it's in the run state.

1660
00:55:27,050 --> 00:55:29,260
It's very happy.

1661
00:55:29,260 --> 00:55:31,140
The final thing that I should
point out is that VMs don't

1662
00:55:31,140 --> 00:55:32,440
have to be in the ready state.

1663
00:55:32,440 --> 00:55:34,213
They don't have to be either
using the CPU or

1664
00:55:34,213 --> 00:55:35,200
ready to use the CPU.

1665
00:55:35,200 --> 00:55:37,520
Some VMs are voluntarily
descheduled, because they have

1666
00:55:37,520 --> 00:55:38,670
nothing to do.

1667
00:55:38,670 --> 00:55:41,260
Some VMs are descheduled,
because they're waiting--

1668
00:55:41,260 --> 00:55:43,300
they're basically waiting
for an IO to complete.

1669
00:55:43,300 --> 00:55:44,730
And so they don't need
to use the CPU.

1670
00:55:44,730 --> 00:55:46,470
And that's what I
depict by VM6.

1671
00:55:46,470 --> 00:55:50,160
These are in what's called
the wait or idle states.

1672
00:55:50,160 --> 00:55:53,430
So hopefully, you now have
a fairly rudimentary

1673
00:55:53,430 --> 00:55:56,080
understanding of what our
CPU scheduler is doing.

1674
00:55:56,080 --> 00:55:58,200
Let's talk a little bit about
a very simple performance

1675
00:55:58,200 --> 00:56:00,310
problem related to this.

1676
00:56:00,310 --> 00:56:00,930
And I call this--

1677
00:56:00,930 --> 00:56:02,910
But It's Only a small
probe VM.

1678
00:56:02,910 --> 00:56:04,710
I'll explain what this
means in a minute.

1679
00:56:04,710 --> 00:56:07,590
Essentially, you've got two of
these hosts, two ESX hosts

1680
00:56:07,590 --> 00:56:09,840
that both want to run VMs.

1681
00:56:09,840 --> 00:56:13,250
And I've got a user, depicted
by the user icon.

1682
00:56:13,250 --> 00:56:15,080
And he's talking
to this one VM.

1683
00:56:15,080 --> 00:56:18,420
Let's call this VM
the vSphere VM.

1684
00:56:18,420 --> 00:56:19,800
Please pardon the
nomenclature.

1685
00:56:19,800 --> 00:56:20,520
It doesn't really matter.

1686
00:56:20,520 --> 00:56:22,380
But anyway, he's talking
to this vSphere VM

1687
00:56:22,380 --> 00:56:23,490
on one of the hosts.

1688
00:56:23,490 --> 00:56:26,070
Now, it turns out that that VM
that he's talking to, as I

1689
00:56:26,070 --> 00:56:28,880
depicted in the diagram, is
talking to another virtual

1690
00:56:28,880 --> 00:56:33,420
machine, which I have labeled
vSphere database.

1691
00:56:33,420 --> 00:56:35,290
So it turns out, this vSphere
VM is talking to a database

1692
00:56:35,290 --> 00:56:37,410
over what's called an
ODBC connection.

1693
00:56:37,410 --> 00:56:38,170
That doesn't matter.

1694
00:56:38,170 --> 00:56:39,350
Don't worry about that.

1695
00:56:39,350 --> 00:56:42,080
And it turns out that on that
other host where the database

1696
00:56:42,080 --> 00:56:44,810
virtual machine is running, we
have another virtual machine.

1697
00:56:44,810 --> 00:56:47,630
It's called the probe
virtual machine.

1698
00:56:47,630 --> 00:56:50,680
The job of that probe virtual
machine is to sniff any

1699
00:56:50,680 --> 00:56:52,020
traffic going into
the database.

1700
00:56:55,090 --> 00:56:56,800
So as I've said, the vSphere VM

1701
00:56:56,800 --> 00:56:58,960
communicates with the database.

1702
00:56:58,960 --> 00:57:01,750
This probe VM is monitoring any
traffic that's going to

1703
00:57:01,750 --> 00:57:04,160
that database virtual machine.

1704
00:57:04,160 --> 00:57:06,720
And therefore, as you can
imagine, the more traffic

1705
00:57:06,720 --> 00:57:09,730
that's being sent between this
vSphere VM and the database,

1706
00:57:09,730 --> 00:57:14,900
the more work that that
probe VM has to do.

1707
00:57:14,900 --> 00:57:16,260
Here was the complaint.

1708
00:57:16,260 --> 00:57:19,840
My colleague called me up, and
he said, hey, I'm having a

1709
00:57:19,840 --> 00:57:21,810
little bit of a problem
with my vSphere VM.

1710
00:57:21,810 --> 00:57:25,390
All of a sudden, it's completely
unresponsive.

1711
00:57:25,390 --> 00:57:28,060
So then I was like, OK, well,
this is kind of silly.

1712
00:57:28,060 --> 00:57:29,770
Let's try to figure out
what's going on.

1713
00:57:29,770 --> 00:57:31,580
So here's what I did.

1714
00:57:31,580 --> 00:57:35,720
I graphed various metrics, which
I just spoke about, the

1715
00:57:35,720 --> 00:57:38,880
used time and the ready
time, on this chart.

1716
00:57:38,880 --> 00:57:40,390
The time is on the x-axis.

1717
00:57:40,390 --> 00:57:43,210
And the y-axis is
in milliseconds.

1718
00:57:43,210 --> 00:57:46,980
In what I'm going to call the
land before time, the ready

1719
00:57:46,980 --> 00:57:48,730
time of the database.

1720
00:57:48,730 --> 00:57:51,460
So look at the circle on the--
let me think, you're--

1721
00:57:51,460 --> 00:57:52,970
on the left.

1722
00:57:52,970 --> 00:57:56,440
This is the ready time of that
database virtual machine.

1723
00:57:56,440 --> 00:57:57,700
Remember what we said,
when you're

1724
00:57:57,700 --> 00:57:58,850
accumulating ready time.

1725
00:57:58,850 --> 00:58:00,460
It means you really want
to use the CPU.

1726
00:58:00,460 --> 00:58:03,300
But you can't because all the
CPUs are already in use.

1727
00:58:03,300 --> 00:58:07,700
So it turns out that the first
thing that happened when my

1728
00:58:07,700 --> 00:58:09,860
colleague called and said I'm
having a problem, is I look at

1729
00:58:09,860 --> 00:58:10,990
this chart, which
is versus time.

1730
00:58:10,990 --> 00:58:11,780
And I said, let me guess.

1731
00:58:11,780 --> 00:58:14,040
Are you having a problem at
about such and such time,

1732
00:58:14,040 --> 00:58:15,870
which corresponds to the
middle of this chart?

1733
00:58:15,870 --> 00:58:17,580
And he said, yeah, that's when
I'm having a problem.

1734
00:58:17,580 --> 00:58:20,240
So you can tell that there are
two very clearly demarcated

1735
00:58:20,240 --> 00:58:22,510
regions here.

1736
00:58:22,510 --> 00:58:24,760
In the left, you can
see that this ready

1737
00:58:24,760 --> 00:58:26,120
time, it's around 12%.

1738
00:58:26,120 --> 00:58:29,610
Now if you're like me, a number
that has no context

1739
00:58:29,610 --> 00:58:30,060
means nothing.

1740
00:58:30,060 --> 00:58:31,790
So 12% means nothing.

1741
00:58:31,790 --> 00:58:34,540
However, what I'd like to point
out is that when things

1742
00:58:34,540 --> 00:58:39,130
got really bad, that number
spiked up to 20%.

1743
00:58:39,130 --> 00:58:40,120
So here's the deal.

1744
00:58:40,120 --> 00:58:41,710
I'm a database virtual
machine.

1745
00:58:41,710 --> 00:58:44,500
I want to use the physical
hardware.

1746
00:58:44,500 --> 00:58:46,920
I can use it about 90% of the
time, because my ready time's

1747
00:58:46,920 --> 00:58:48,110
about 12%, whatever.

1748
00:58:48,110 --> 00:58:50,960
But at some point, something
happens where I now can only

1749
00:58:50,960 --> 00:58:52,600
use it 80% of the time.

1750
00:58:52,600 --> 00:58:55,580
And it turns out that that
difference was enough to cause

1751
00:58:55,580 --> 00:58:59,760
this vSphere VM talking to this
database to be perturbed.

1752
00:58:59,760 --> 00:59:01,850
And that ultimately was the
customer's problem.

1753
00:59:01,850 --> 00:59:03,310
So the question is why.

1754
00:59:03,310 --> 00:59:05,200
And here is the reason.

1755
00:59:05,200 --> 00:59:06,360
Turns out, here's the deal.

1756
00:59:06,360 --> 00:59:08,450
Let's go back to the diagram
so I can explain this.

1757
00:59:08,450 --> 00:59:11,810
Remember that the vSphere VM
talks to the database VM.

1758
00:59:11,810 --> 00:59:14,760
Remember as well, that the more
traffic there is, the

1759
00:59:14,760 --> 00:59:17,080
more work the probe
VM has to do.

1760
00:59:17,080 --> 00:59:19,770
It turns out that what this user
did was at that critical

1761
00:59:19,770 --> 00:59:22,290
juncture, he started
a workload.

1762
00:59:22,290 --> 00:59:24,370
In starting that workload, he
caused a lot of traffic

1763
00:59:24,370 --> 00:59:26,920
between the vSphere VM and the
database, which therefore

1764
00:59:26,920 --> 00:59:30,460
caused a lot of work
by this probe VM.

1765
00:59:30,460 --> 00:59:31,680
Well, what's happening?

1766
00:59:31,680 --> 00:59:34,110
The probe VM and the database
VM are sharing the same

1767
00:59:34,110 --> 00:59:35,270
underlying CPUs.

1768
00:59:35,270 --> 00:59:37,470
So if one of them is running,
the other one can't.

1769
00:59:37,470 --> 00:59:39,950
And that's why we saw this
gentle increase in the amount

1770
00:59:39,950 --> 00:59:42,030
of ready time to vSphere
database.

1771
00:59:42,030 --> 00:59:44,330
Well, this has a cascading
effect, because it can't get

1772
00:59:44,330 --> 00:59:46,380
as much time to use a CPU.

1773
00:59:46,380 --> 00:59:48,630
Any request that goes to
it gets slowed down.

1774
00:59:48,630 --> 00:59:49,960
Where are the requests
coming from?

1775
00:59:49,960 --> 00:59:51,720
They're coming from
this vSphere VM.

1776
00:59:51,720 --> 00:59:52,950
Who's initiating
those requests?

1777
00:59:52,950 --> 00:59:53,850
It's this poor admin.

1778
00:59:53,850 --> 00:59:55,650
The bottom line is that this
admin is seeing poor

1779
00:59:55,650 --> 00:59:57,980
responsiveness because some
doofus put his probe VM on

1780
00:59:57,980 --> 00:59:59,740
some other host, because he
wanted to see what was going

1781
00:59:59,740 --> 01:00:01,530
on in the database.

1782
01:00:01,530 --> 01:00:02,650
In fairness, I was
that doofus.

1783
01:00:02,650 --> 01:00:03,360
But that's fine.

1784
01:00:03,360 --> 01:00:04,490
We needed that, because
we had to debug

1785
01:00:04,490 --> 01:00:07,150
other performance problems.

1786
01:00:07,150 --> 01:00:09,970
So the point is that one of
the things that's kind of

1787
01:00:09,970 --> 01:00:13,580
important in this particular
case is that whenever you

1788
01:00:13,580 --> 01:00:15,340
introduce any sort of
monitoring-- it's like a

1789
01:00:15,340 --> 01:00:17,820
Heisenberg thing-- it's going to
have a performance impact.

1790
01:00:17,820 --> 01:00:19,010
And understanding
that performance

1791
01:00:19,010 --> 01:00:20,180
impact is pretty key.

1792
01:00:20,180 --> 01:00:21,410
It might even shift where the

1793
01:00:21,410 --> 01:00:24,120
bottlenecks are in your system.

1794
01:00:24,120 --> 01:00:26,330
If your thing has a high cost
in terms of CPU, it might

1795
01:00:26,330 --> 01:00:29,090
shift the overhead of whatever's
running to the CPU

1796
01:00:29,090 --> 01:00:30,230
where it wasn't there before.

1797
01:00:30,230 --> 01:00:32,150
So it's kind of important to
understand what you're

1798
01:00:32,150 --> 01:00:34,660
monitoring and how you're
monitoring it.

1799
01:00:34,660 --> 01:00:34,950
OK.

1800
01:00:34,950 --> 01:00:36,230
We're almost done.

1801
01:00:36,230 --> 01:00:38,560
This was a really
curious problem.

1802
01:00:38,560 --> 01:00:39,800
We have a customer.

1803
01:00:39,800 --> 01:00:43,860
And a customer was performing
a load test on a server.

1804
01:00:43,860 --> 01:00:46,040
And in that load test, what they
would do is they would

1805
01:00:46,040 --> 01:00:48,190
just keep attaching clients
to that end server

1806
01:00:48,190 --> 01:00:50,540
and run some test.

1807
01:00:50,540 --> 01:00:53,370
At some point, they would
attach clients and their

1808
01:00:53,370 --> 01:00:56,070
workload suffered, their
workload performance suffered

1809
01:00:56,070 --> 01:00:59,700
even though there was no metric
that was saturated.

1810
01:00:59,700 --> 01:01:00,910
And so the question was why.

1811
01:01:00,910 --> 01:01:02,890
Why you keep adding clients,
nothing is saturated, but

1812
01:01:02,890 --> 01:01:05,140
performance suffers?

1813
01:01:05,140 --> 01:01:07,730
Well, this chart's kind
of an eyesore.

1814
01:01:07,730 --> 01:01:09,770
But let me explain it.

1815
01:01:09,770 --> 01:01:11,360
The first thing we do is we've
got some profiling

1816
01:01:11,360 --> 01:01:12,020
information.

1817
01:01:12,020 --> 01:01:13,910
On this chart, I show
two things.

1818
01:01:13,910 --> 01:01:16,360
In purple, I show CPU usage.

1819
01:01:16,360 --> 01:01:19,720
And in white, I show
disk latency.

1820
01:01:19,720 --> 01:01:21,500
As you can see from this
chart-- and the

1821
01:01:21,500 --> 01:01:22,450
y-axis is just 100%.

1822
01:01:22,450 --> 01:01:23,580
And they're both normalized.

1823
01:01:23,580 --> 01:01:26,440
As you can see at some point,
the CPU usage is steadily,

1824
01:01:26,440 --> 01:01:28,190
steadily increasing.

1825
01:01:28,190 --> 01:01:30,990
And at some point, the disk
latency takes over and gets to

1826
01:01:30,990 --> 01:01:32,100
be really large.

1827
01:01:32,100 --> 01:01:35,600
You can tell that, because up
until about 3/4 of the way in,

1828
01:01:35,600 --> 01:01:37,000
the disk latency hovers
around zero.

1829
01:01:37,000 --> 01:01:39,030
And all of a sudden, it
jumps much higher.

1830
01:01:39,030 --> 01:01:41,750
So my first instinct was to tell
the customer you've got a

1831
01:01:41,750 --> 01:01:43,250
disk problem.

1832
01:01:43,250 --> 01:01:45,150
Your disk latencies are
going over a cliff.

1833
01:01:45,150 --> 01:01:47,040
And that's why you're having
a performance problem.

1834
01:01:47,040 --> 01:01:50,640
Unfortunately, that's
not correct.

1835
01:01:50,640 --> 01:01:52,900
Why is that not correct?

1836
01:01:52,900 --> 01:01:56,140
It's not correct, because, yeah,
disk latency gets worse

1837
01:01:56,140 --> 01:01:57,440
at 4:00 PM.

1838
01:01:57,440 --> 01:01:59,120
It's because of swapping and
some other nonsense.

1839
01:01:59,120 --> 01:02:00,220
We know that.

1840
01:02:00,220 --> 01:02:02,740
However, the application latency
actually gets worse at

1841
01:02:02,740 --> 01:02:04,310
3:30, not at 4:00.

1842
01:02:04,310 --> 01:02:05,810
It's half an hour earlier.

1843
01:02:05,810 --> 01:02:07,790
So now the question is what's
going on that causes the

1844
01:02:07,790 --> 01:02:10,140
problem, even before disk
kicks in and gets to be

1845
01:02:10,140 --> 01:02:12,820
another problem.

1846
01:02:12,820 --> 01:02:15,930
This is where understanding
the difference in metrics

1847
01:02:15,930 --> 01:02:16,610
becomes pretty critical.

1848
01:02:16,610 --> 01:02:18,620
Now, I'm not expecting
very many people to

1849
01:02:18,620 --> 01:02:19,800
understand this chart.

1850
01:02:19,800 --> 01:02:21,860
But I will try to explain it.

1851
01:02:21,860 --> 01:02:23,740
What I do here is I show a
different chart, which

1852
01:02:23,740 --> 01:02:26,470
essentially shows CPU usage of
all the different virtual

1853
01:02:26,470 --> 01:02:29,310
machines and processes on
the underlying host.

1854
01:02:29,310 --> 01:02:30,570
Forget about what's in
most of this chart.

1855
01:02:30,570 --> 01:02:32,360
And just focus on what
I've circled.

1856
01:02:32,360 --> 01:02:34,120
I have two columns
shown there.

1857
01:02:34,120 --> 01:02:37,610
One column is called
percent used.

1858
01:02:37,610 --> 01:02:41,350
And one column is called
percent run.

1859
01:02:41,350 --> 01:02:43,970
You can think about it and
think, OK, well, I can imagine

1860
01:02:43,970 --> 01:02:47,480
used time means probably how
much time I'm using the CPU.

1861
01:02:47,480 --> 01:02:48,740
But then what is the runtime?

1862
01:02:48,740 --> 01:02:51,080
Isn't runtime the same thing,
because I'm using the CPU?

1863
01:02:51,080 --> 01:02:52,990
What's the difference?

1864
01:02:52,990 --> 01:02:54,680
This is actually the
critical problem.

1865
01:02:54,680 --> 01:02:58,050
It turns out, that when you talk
of a percent used time,

1866
01:02:58,050 --> 01:03:01,110
you're normalizing that to
the base clock frequency.

1867
01:03:01,110 --> 01:03:03,920
If I've got a three gigahertz
machine, I've normalized it to

1868
01:03:03,920 --> 01:03:04,650
three gigahertz.

1869
01:03:04,650 --> 01:03:07,490
However, we all live in the
era of power management.

1870
01:03:07,490 --> 01:03:10,700
It turns out this percent run,
which nobody understands, is

1871
01:03:10,700 --> 01:03:12,880
normalized to the clock
frequency at the

1872
01:03:12,880 --> 01:03:14,100
time you were running.

1873
01:03:14,100 --> 01:03:16,840
So if I did power management,
and I was only running at two

1874
01:03:16,840 --> 01:03:18,880
gigahertz for a certain amount
of time, percent

1875
01:03:18,880 --> 01:03:20,840
run captures that.

1876
01:03:20,840 --> 01:03:23,075
And so in this particular chart
what I'm showing is that

1877
01:03:23,075 --> 01:03:26,300
a bunch of virtual machines are
seeing used time different

1878
01:03:26,300 --> 01:03:27,480
from runtime.

1879
01:03:27,480 --> 01:03:29,590
And the runtime, which is
normalized to the clock

1880
01:03:29,590 --> 01:03:32,130
frequency of the time you're
using it, is actually greater

1881
01:03:32,130 --> 01:03:35,030
than used time, which suggests
that the user's actually using

1882
01:03:35,030 --> 01:03:36,310
power management.

1883
01:03:36,310 --> 01:03:40,320
And it turns out, that power
management was kicking in at

1884
01:03:40,320 --> 01:03:42,980
some point and causing their
application latency to go down

1885
01:03:42,980 --> 01:03:46,240
even though the CPU was
never saturated.

1886
01:03:46,240 --> 01:03:48,250
And in this case, when we went
back to the customer-- you

1887
01:03:48,250 --> 01:03:50,310
look at this chart, and you
instantly see, OK, I know that

1888
01:03:50,310 --> 01:03:52,820
because this is normalized and
this is normalized to that.

1889
01:03:52,820 --> 01:03:54,705
This is what's going on, you
have to first of all change

1890
01:03:54,705 --> 01:03:55,660
this parameter.

1891
01:03:55,660 --> 01:03:56,460
They changed it.

1892
01:03:56,460 --> 01:03:57,670
And they were unbelievably
happy.

1893
01:03:57,670 --> 01:03:58,630
It was pretty cool.

1894
01:03:58,630 --> 01:04:00,287
Now, I have to admit that
there's much more to this

1895
01:04:00,287 --> 01:04:01,890
story that I'm not
showing you.

1896
01:04:01,890 --> 01:04:03,260
Which I'm not allowed
to show you.

1897
01:04:03,260 --> 01:04:04,540
But it becomes a very
interesting thing.

1898
01:04:04,540 --> 01:04:09,720
But the point is it's very, very
important to understand

1899
01:04:09,720 --> 01:04:10,370
your metrics.

1900
01:04:10,370 --> 01:04:12,760
Here just understanding the
difference between runtime and

1901
01:04:12,760 --> 01:04:15,210
used time made all of the
difference in diagnosing this

1902
01:04:15,210 --> 01:04:18,080
customer's performance
related issue.

1903
01:04:18,080 --> 01:04:20,880
So I just pick a smattering
of ten random bugs.

1904
01:04:20,880 --> 01:04:25,020
And in each of them, I tried
to distill one kind of

1905
01:04:25,020 --> 01:04:27,170
conclusion you can get from
that particular thing.

1906
01:04:27,170 --> 01:04:29,340
Because all of you can read,
I'm not going to bother

1907
01:04:29,340 --> 01:04:31,370
repeating all of these things.

1908
01:04:31,370 --> 01:04:33,140
But anyway, here's
the first five.

1909
01:04:33,140 --> 01:04:35,500
And you all get the slides.

1910
01:04:35,500 --> 01:04:37,760
And here's the second five.

1911
01:04:37,760 --> 01:04:37,900
Oops.

1912
01:04:37,900 --> 01:04:40,070
I'll just keep that
up for a second.

1913
01:04:40,070 --> 01:04:46,970
And in order to conclude this
talk, let me just say that

1914
01:04:46,970 --> 01:04:48,990
it's very important to
avoid assumptions.

1915
01:04:48,990 --> 01:04:51,370
It's very tempting to
say well that's--

1916
01:04:51,370 --> 01:04:52,520
I'll give you a great example.

1917
01:04:52,520 --> 01:04:53,440
I used to interview people.

1918
01:04:53,440 --> 01:04:54,570
And I'd say, OK, you've
got a UI.

1919
01:04:54,570 --> 01:04:55,420
You've got a problem.

1920
01:04:55,420 --> 01:04:56,230
You have a UI.

1921
01:04:56,230 --> 01:04:56,840
And you click on something.

1922
01:04:56,840 --> 01:04:57,790
And it takes a long time.

1923
01:04:57,790 --> 01:04:59,060
Why is it taking a long time?

1924
01:04:59,060 --> 01:05:00,590
And the first thing they'd say,
well, I'd go and look

1925
01:05:00,590 --> 01:05:02,030
at-- this is literally,
not joking.

1926
01:05:02,030 --> 01:05:03,540
They'd say, well, I'd go to look
at the cache messages on

1927
01:05:03,540 --> 01:05:04,660
the server.

1928
01:05:04,660 --> 01:05:06,230
And I just looked them and I
said, how do you know the

1929
01:05:06,230 --> 01:05:07,100
client's not the problem?

1930
01:05:07,100 --> 01:05:08,590
Oh, the client's
not a problem.

1931
01:05:08,590 --> 01:05:10,510
And you base this on
what knowledge?

1932
01:05:10,510 --> 01:05:11,560
Don't use assumptions.

1933
01:05:11,560 --> 01:05:13,730
It's a very bad idea.

1934
01:05:13,730 --> 01:05:16,180
It's really important to
understand the entire system

1935
01:05:16,180 --> 01:05:16,790
that you're dealing with.

1936
01:05:16,790 --> 01:05:18,590
From the client to the server to
the hardware to the network

1937
01:05:18,590 --> 01:05:19,270
to the disk.

1938
01:05:19,270 --> 01:05:21,050
You have to understand all of
those different pieces,

1939
01:05:21,050 --> 01:05:24,160
because you have to understand
where to look.

1940
01:05:24,160 --> 01:05:27,210
And what's really onerous is
that-- it is not so much when

1941
01:05:27,210 --> 01:05:28,970
it's a problem in your code,
because you can be blamed for

1942
01:05:28,970 --> 01:05:30,000
making stupid mistakes--

1943
01:05:30,000 --> 01:05:31,670
it's when it's problems with
other people's code and you

1944
01:05:31,670 --> 01:05:33,530
spend a month looking at other
people's code, figuring out

1945
01:05:33,530 --> 01:05:35,560
what's the problem That's when
things are really kind of

1946
01:05:35,560 --> 01:05:36,992
irritating.

1947
01:05:36,992 --> 01:05:39,240
And especially when it's a
problem with hardware.

1948
01:05:39,240 --> 01:05:42,630
And I think the final bullet
I'll say is be persistent and

1949
01:05:42,630 --> 01:05:44,680
be thorough.

1950
01:05:44,680 --> 01:05:46,010
I don't have to tell you
how to be good workers.

1951
01:05:46,010 --> 01:05:47,190
So let's just blah,
blah, blah.

1952
01:05:47,190 --> 01:05:48,440
We're done.