1 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:03,910 Commons license. 3 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,600 offer high quality educational resources for free. 5 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 6 00:00:13,500 --> 00:00:17,430 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,430 --> 00:00:18,680 ocw.mit.edu. 8 00:00:30,460 --> 00:00:36,980 PROFESSOR: And so today we have Ravi , who is I guess a 9 00:00:36,980 --> 00:00:40,480 performance engineer at VMware, where he basically 10 00:00:40,480 --> 00:00:44,940 looks at how to get really good performance in many, many 11 00:00:44,940 --> 00:00:46,920 different things in VMware. 12 00:00:46,920 --> 00:00:51,210 So Ravi is an MIT undergrad. 13 00:00:51,210 --> 00:00:52,475 I guess a long time ago. 14 00:00:52,475 --> 00:00:53,360 RAVI SOUNDARARAJAN: Yeah, '92. 15 00:00:53,360 --> 00:00:53,590 PROFESSOR: '92. 16 00:00:53,590 --> 00:00:54,470 There you go. 17 00:00:54,470 --> 00:00:59,660 And then he got his PhD from Stanford and end up in VMware 18 00:00:59,660 --> 00:01:03,680 looking at a lot of interesting performance stuff. 19 00:01:03,680 --> 00:01:07,780 So today, he's going to give us a kind of a bird's eye view 20 00:01:07,780 --> 00:01:11,260 of how these kind of technologies, what you guys 21 00:01:11,260 --> 00:01:14,510 are learning, is useful and how it's used in industry. 22 00:01:14,510 --> 00:01:14,970 OK, Ravi. 23 00:01:14,970 --> 00:01:16,040 Take over. 24 00:01:16,040 --> 00:01:17,130 RAVI SOUNDARARAJAN: First question, can 25 00:01:17,130 --> 00:01:17,880 everybody hear me? 26 00:01:17,880 --> 00:01:19,210 And is the mic on? 27 00:01:19,210 --> 00:01:19,480 OK. 28 00:01:19,480 --> 00:01:19,720 Good. 29 00:01:19,720 --> 00:01:20,775 I have a bit of a sore throat. 30 00:01:20,775 --> 00:01:22,940 So I'm sorry if you can't hear me. 31 00:01:22,940 --> 00:01:24,650 The other thing I want to say is I'm actually really glad 32 00:01:24,650 --> 00:01:24,910 you're here. 33 00:01:24,910 --> 00:01:27,820 Because I remember very distinctly in 1990 when I was 34 00:01:27,820 --> 00:01:29,540 an undergrad, I was taking 6.111. 35 00:01:29,540 --> 00:01:31,400 And I had a bug in my final project. 36 00:01:31,400 --> 00:01:32,220 It just wasn't working. 37 00:01:32,220 --> 00:01:34,000 And I was going to skip the lecture that day. 38 00:01:34,000 --> 00:01:35,210 But I was like, whatever, I'll go. 39 00:01:35,210 --> 00:01:35,910 So went to the lecture. 40 00:01:35,910 --> 00:01:38,790 Turns out Don Troxell pointed out something in that lecture. 41 00:01:38,790 --> 00:01:39,860 I was like, that's my problem. 42 00:01:39,860 --> 00:01:40,680 And I went back, fixed it. 43 00:01:40,680 --> 00:01:41,170 Everything worked. 44 00:01:41,170 --> 00:01:42,390 And I was very happy. 45 00:01:42,390 --> 00:01:43,500 I don't think that's going to happen here. 46 00:01:43,500 --> 00:01:44,520 But it would be nice if it did. 47 00:01:44,520 --> 00:01:45,720 So that's cool. 48 00:01:45,720 --> 00:01:48,320 So instead of just talking about what a performance 49 00:01:48,320 --> 00:01:50,100 engineer does, I would-- 50 00:01:50,100 --> 00:01:52,690 I'll just-- 51 00:01:52,690 --> 00:01:53,650 Power is coming in. 52 00:01:53,650 --> 00:01:53,930 But-- 53 00:01:53,930 --> 00:01:54,580 Oh, there it is. 54 00:01:54,580 --> 00:01:55,210 Perfect. 55 00:01:55,210 --> 00:01:56,950 So instead of just talking about what performance 56 00:01:56,950 --> 00:01:57,750 engineering is about. 57 00:01:57,750 --> 00:01:59,810 I just wanted to go through 10 bugs that I've actually been 58 00:01:59,810 --> 00:02:00,740 involved in. 59 00:02:00,740 --> 00:02:02,290 And hopefully, they'll be interesting to you. 60 00:02:02,290 --> 00:02:05,200 Because each one of them was interesting to me. 61 00:02:05,200 --> 00:02:07,080 So let's start out with-- 62 00:02:07,080 --> 00:02:08,620 I just wanted to start this out with a bang. 63 00:02:08,620 --> 00:02:10,639 So for dramatic effect everybody read 64 00:02:10,639 --> 00:02:11,430 through this slide. 65 00:02:11,430 --> 00:02:15,800 It's basically an email thread from one of my colleagues. 66 00:02:15,800 --> 00:02:17,080 And raise your hand when you're done. 67 00:02:17,080 --> 00:02:18,330 So I'll know when you're done. 68 00:02:26,590 --> 00:02:26,850 OK. 69 00:02:26,850 --> 00:02:29,220 Anyway, so the point of this is that his question ended up 70 00:02:29,220 --> 00:02:32,640 being, why is the percent lock time in the database 71 00:02:32,640 --> 00:02:36,280 increasing despite a lighter load for the thing that uses 72 00:02:36,280 --> 00:02:37,290 that database? 73 00:02:37,290 --> 00:02:39,590 Now, if you're like me, you read this email, and your 74 00:02:39,590 --> 00:02:42,030 first question is what the hell is he talking about? 75 00:02:42,030 --> 00:02:44,420 So that's why titled this slide like that. 76 00:02:44,420 --> 00:02:47,390 So let's actually deconstruct what, a, that email meant. 77 00:02:47,390 --> 00:02:49,180 And, b, what it meant to me. 78 00:02:49,180 --> 00:02:51,930 So in order to do that, I want to first talk about the system 79 00:02:51,930 --> 00:02:54,970 that he was talking about and help describe what the actual 80 00:02:54,970 --> 00:02:56,650 problem he was seeing was. 81 00:02:56,650 --> 00:02:59,500 So here we have a standard multi-tier set up. 82 00:02:59,500 --> 00:03:03,060 What we've got is that we've got a server, which is our 83 00:03:03,060 --> 00:03:03,800 middleware server. 84 00:03:03,800 --> 00:03:04,800 And it talks to a database. 85 00:03:04,800 --> 00:03:06,840 And then we have a bunch of hosts on the other end. 86 00:03:06,840 --> 00:03:08,900 Those are labeled by agent, HAL, whatever. 87 00:03:08,900 --> 00:03:12,350 Anyway, those are hosts running virtual machines. 88 00:03:12,350 --> 00:03:13,630 And here's the setup. 89 00:03:13,630 --> 00:03:14,810 You've got a client. 90 00:03:14,810 --> 00:03:16,320 And that client issues a command to 91 00:03:16,320 --> 00:03:17,730 this middleware server. 92 00:03:17,730 --> 00:03:20,190 And that middleware server tries to figure out what host 93 00:03:20,190 --> 00:03:22,690 or what virtual machine needs to be contacted. 94 00:03:22,690 --> 00:03:24,990 So the middleware goes ahead and does that. 95 00:03:24,990 --> 00:03:27,520 And it performs an operation on that host. 96 00:03:27,520 --> 00:03:30,560 And after performing the operation, it then archives 97 00:03:30,560 --> 00:03:34,170 some of the information that it gained to the database. 98 00:03:34,170 --> 00:03:36,330 And once it archived the information to the database, 99 00:03:36,330 --> 00:03:39,500 it went ahead and notified the client that the operation 100 00:03:39,500 --> 00:03:40,360 itself was done. 101 00:03:40,360 --> 00:03:40,960 Fairly simple. 102 00:03:40,960 --> 00:03:42,020 Send a request to a server. 103 00:03:42,020 --> 00:03:44,260 Server contacts some other host. 104 00:03:44,260 --> 00:03:46,120 Host comes back to the server and says it's done. 105 00:03:46,120 --> 00:03:49,400 Server persists information to the database and then notifies 106 00:03:49,400 --> 00:03:50,860 the client everything is done. 107 00:03:50,860 --> 00:03:54,770 So the actual problem that my colleague was seeing was, he 108 00:03:54,770 --> 00:03:57,900 was sending less and less load to this middleware server, 109 00:03:57,900 --> 00:04:00,260 which therefore means that less and less goes to the host 110 00:04:00,260 --> 00:04:03,130 and less and less goes to the database. 111 00:04:03,130 --> 00:04:05,780 However, the percent of time spent in the 112 00:04:05,780 --> 00:04:08,060 database was higher. 113 00:04:08,060 --> 00:04:10,280 So despite a lighter load and less stuff going to the 114 00:04:10,280 --> 00:04:12,150 database, the percent of time spent in the 115 00:04:12,150 --> 00:04:13,830 database was higher. 116 00:04:13,830 --> 00:04:15,470 Probably, all of you are smart enough to know what the actual 117 00:04:15,470 --> 00:04:15,960 problem is. 118 00:04:15,960 --> 00:04:18,490 But we'll go through it step by step. 119 00:04:18,490 --> 00:04:19,470 So the deal is this. 120 00:04:19,470 --> 00:04:21,640 The first thing I did is I examined the amount of time 121 00:04:21,640 --> 00:04:23,750 that each lock was held for each of these different 122 00:04:23,750 --> 00:04:24,280 operations. 123 00:04:24,280 --> 00:04:27,160 And what I did was I varied the load just as 124 00:04:27,160 --> 00:04:28,160 my colleague did. 125 00:04:28,160 --> 00:04:30,990 And what you see are six different lines. 126 00:04:30,990 --> 00:04:33,400 On the x-axis is each of the different locks that we 127 00:04:33,400 --> 00:04:34,630 acquire in our system. 128 00:04:34,630 --> 00:04:37,270 And on the y-axis is the latency of those locks in 129 00:04:37,270 --> 00:04:38,270 milliseconds. 130 00:04:38,270 --> 00:04:41,540 And basically, the axis for this-- 131 00:04:41,540 --> 00:04:43,610 the legend for this chart says Locked-4, which means a 132 00:04:43,610 --> 00:04:46,540 certain type of load, all the way down to Locked-128. 133 00:04:46,540 --> 00:04:47,150 OK. 134 00:04:47,150 --> 00:04:50,660 And the deal is that when you have lock-128, what that 135 00:04:50,660 --> 00:04:53,660 indicates is that there's 128 hosts per thread. 136 00:04:53,660 --> 00:04:55,280 Which, take my word for it, that means that you've got a 137 00:04:55,280 --> 00:04:56,070 lighter load. 138 00:04:56,070 --> 00:04:59,270 So basically, the lowest line in the legend is 139 00:04:59,270 --> 00:04:59,980 the lightest load. 140 00:04:59,980 --> 00:05:02,900 And the highest line in the legend is the highest load. 141 00:05:02,900 --> 00:05:04,810 And as you can see, although it's a little bit obscured in 142 00:05:04,810 --> 00:05:07,630 this chart, the bottom line is that this is about time you're 143 00:05:07,630 --> 00:05:09,670 spent holding a lock. 144 00:05:09,670 --> 00:05:12,330 In this chart, lock-4 indicates a heavy load. 145 00:05:12,330 --> 00:05:14,580 Lock-128 indicates a lighter load. 146 00:05:14,580 --> 00:05:16,760 And what this chart indicates is that, in fact, you are 147 00:05:16,760 --> 00:05:19,550 spending more time in the locks with a heavier load, 148 00:05:19,550 --> 00:05:23,480 lock-4, than you are with a lighter load, lock-128. 149 00:05:23,480 --> 00:05:26,900 So the latency prologue is higher with a higher load, 150 00:05:26,900 --> 00:05:28,080 which is totally reasonable. 151 00:05:28,080 --> 00:05:30,280 And it turns out that if we translate my colleague's 152 00:05:30,280 --> 00:05:34,130 question into reality, what he's asking is why is the 153 00:05:34,130 --> 00:05:36,610 percent time spent in the database higher, even though 154 00:05:36,610 --> 00:05:37,860 you have a lighter load? 155 00:05:37,860 --> 00:05:40,100 Well, the answer ends up being that when you have a lighter 156 00:05:40,100 --> 00:05:42,700 load, overall, you're spending less time in locks, which 157 00:05:42,700 --> 00:05:44,670 means that any time you spend in the database, if that's a 158 00:05:44,670 --> 00:05:47,070 fixed cost, is going to be a higher percent 159 00:05:47,070 --> 00:05:48,590 of the overall time. 160 00:05:48,590 --> 00:05:50,950 So it turns out, there really wasn't an issue. 161 00:05:50,950 --> 00:05:51,820 There was nothing odd here. 162 00:05:51,820 --> 00:05:53,740 It's exactly what you'd expect, because the database 163 00:05:53,740 --> 00:05:55,280 imposes essentially a fixed amount 164 00:05:55,280 --> 00:05:57,120 of latency per operation. 165 00:05:57,120 --> 00:06:00,020 And in order to confirm that, I also looked at the time 166 00:06:00,020 --> 00:06:01,340 spent contending for locks. 167 00:06:01,340 --> 00:06:04,450 So you really want a lock, but someone else gets it instead. 168 00:06:04,450 --> 00:06:05,170 And you have to wait. 169 00:06:05,170 --> 00:06:07,060 And the trend is exactly the same. 170 00:06:07,060 --> 00:06:10,580 When you have a lighter load, you end up having a much lower 171 00:06:10,580 --> 00:06:11,410 overall latency. 172 00:06:11,410 --> 00:06:13,370 But that just means that any latency you spend in this 173 00:06:13,370 --> 00:06:17,230 database, its percent impact on the entire operation is 174 00:06:17,230 --> 00:06:19,180 quite a bit higher. 175 00:06:19,180 --> 00:06:23,130 So to take a look at this and to do a postmortem on this 176 00:06:23,130 --> 00:06:27,550 case study, his first thing was some random email, which 177 00:06:27,550 --> 00:06:28,880 kind of was nonsense. 178 00:06:28,880 --> 00:06:31,380 But it was less nonsensical once you actually tried to 179 00:06:31,380 --> 00:06:32,680 understand what this setup was. 180 00:06:32,680 --> 00:06:35,660 So step one is figure out what is the experimental setup and 181 00:06:35,660 --> 00:06:37,710 what are the parameters that are being studied. 182 00:06:37,710 --> 00:06:39,530 The next thing to understand is what's being measured? 183 00:06:39,530 --> 00:06:40,580 I remember what my colleague said. 184 00:06:40,580 --> 00:06:43,290 He said the percent of time in the database was pretty high. 185 00:06:43,290 --> 00:06:43,960 Why is that? 186 00:06:43,960 --> 00:06:44,770 Well, OK. 187 00:06:44,770 --> 00:06:47,660 So we have to understand what this multi-tier setup is, why 188 00:06:47,660 --> 00:06:49,770 you would be going to the database, and what it means if 189 00:06:49,770 --> 00:06:52,770 the time spent in that database is high. 190 00:06:52,770 --> 00:06:54,570 And I guess the next thing you want to do is you want to say, 191 00:06:54,570 --> 00:06:56,070 well, OK, he's complaining about lock latency. 192 00:06:56,070 --> 00:06:58,080 So I must have some sort of visibility there somewhere. 193 00:06:58,080 --> 00:07:00,300 Let me go and look at the appropriate metric to figure 194 00:07:00,300 --> 00:07:04,080 out what's going on and why time is being spent there. 195 00:07:04,080 --> 00:07:06,350 And finally, it's appropriate to-- once you've got all this 196 00:07:06,350 --> 00:07:07,770 data, don't make any assumptions-- 197 00:07:07,770 --> 00:07:09,940 just draw the appropriate conclusion based on the data. 198 00:07:13,500 --> 00:07:15,730 The problem is basically that with a lighter load the impact 199 00:07:15,730 --> 00:07:18,250 of that database becomes larger, even if it were the 200 00:07:18,250 --> 00:07:19,830 same under a heavier load. 201 00:07:19,830 --> 00:07:21,930 Because under a heavier load other aspects of the system 202 00:07:21,930 --> 00:07:24,110 also get stressed that don't get stressed 203 00:07:24,110 --> 00:07:26,280 in a lighter load. 204 00:07:26,280 --> 00:07:28,540 So, and this is basically what I'm saying here, yeah, the 205 00:07:28,540 --> 00:07:31,390 percent lock time in the database overall is higher. 206 00:07:31,390 --> 00:07:33,030 But the overall lock time is smaller. 207 00:07:33,030 --> 00:07:37,050 So everything is behaving exactly as you would expect. 208 00:07:37,050 --> 00:07:40,260 So this was one simple study and hopefully it didn't 209 00:07:40,260 --> 00:07:42,760 require any virtualization specific knowledge. 210 00:07:42,760 --> 00:07:44,520 What I'm going to do now is go over a bunch of other 211 00:07:44,520 --> 00:07:47,760 different case studies that I've looked at and crystallize 212 00:07:47,760 --> 00:07:50,560 what lessons I learned from doing those studies. 213 00:07:50,560 --> 00:07:53,860 So I guess the next case study I want to talk about is 214 00:07:53,860 --> 00:07:57,080 something that I call garbage in, garbage out. 215 00:07:57,080 --> 00:07:58,120 I think that's something I learned on the 216 00:07:58,120 --> 00:07:59,600 first day of 001. 217 00:07:59,600 --> 00:08:00,670 So here's the deal. 218 00:08:00,670 --> 00:08:01,480 You've got a customer. 219 00:08:01,480 --> 00:08:04,400 And he wants to draw this chart. 220 00:08:04,400 --> 00:08:07,970 And this chart is basically, on the x-axis is time and on 221 00:08:07,970 --> 00:08:09,360 the y-axis is some metric. 222 00:08:09,360 --> 00:08:10,980 It happens to be CPU usage megahertz. 223 00:08:10,980 --> 00:08:12,340 But it doesn't really matter. 224 00:08:12,340 --> 00:08:12,960 OK. 225 00:08:12,960 --> 00:08:13,870 So he wants to get this chart. 226 00:08:13,870 --> 00:08:15,870 And it turns out-- if you remember the diagram I showed 227 00:08:15,870 --> 00:08:18,810 before, we have a server and a bunch of hosts at the end. 228 00:08:18,810 --> 00:08:21,100 So the client simply has to send a request to that server. 229 00:08:21,100 --> 00:08:22,150 And the server will generate the data, 230 00:08:22,150 --> 00:08:23,480 send back to the client. 231 00:08:23,480 --> 00:08:25,070 And it turns out that if any of-- 232 00:08:25,070 --> 00:08:26,860 all of you, I'm sure, are familiar with Pearl, Python, 233 00:08:26,860 --> 00:08:28,440 and all these different scripted languages. 234 00:08:28,440 --> 00:08:30,550 There's another one that's very popular in the Windows 235 00:08:30,550 --> 00:08:32,049 community called PowerShell. 236 00:08:32,049 --> 00:08:33,679 And so it turns out that our server has 237 00:08:33,679 --> 00:08:34,710 a PowerShell interface. 238 00:08:34,710 --> 00:08:36,950 You can write a PowerShell script to get any statistic 239 00:08:36,950 --> 00:08:37,480 that you want. 240 00:08:37,480 --> 00:08:40,070 And then you can write that to some database or some 241 00:08:40,070 --> 00:08:41,549 spreadsheet and get this charge. 242 00:08:41,549 --> 00:08:42,220 So here's the deal. 243 00:08:42,220 --> 00:08:42,850 We have a user. 244 00:08:42,850 --> 00:08:45,460 And what he does, as I've indicated at the bottom, is 245 00:08:45,460 --> 00:08:47,680 that he wants to get the CPU usage for the last hour. 246 00:08:47,680 --> 00:08:49,980 So he just writes a simple script, using PowerShell. 247 00:08:49,980 --> 00:08:51,620 It doesn't really matter what the syntax is. 248 00:08:51,620 --> 00:08:52,480 But the point is, he does what's 249 00:08:52,480 --> 00:08:53,880 called a get-stats call. 250 00:08:53,880 --> 00:08:56,660 And he indicates, OK, I want this virtual machine. 251 00:08:56,660 --> 00:08:58,090 I want this stat, whatever. 252 00:08:58,090 --> 00:08:59,325 And then he gets his information and 253 00:08:59,325 --> 00:09:00,530 writes it to a file. 254 00:09:00,530 --> 00:09:01,260 No big deal. 255 00:09:01,260 --> 00:09:03,570 It's just a simple script to get some statistics. 256 00:09:03,570 --> 00:09:06,440 Now, here's what is actually interesting. 257 00:09:06,440 --> 00:09:09,440 I wrote that script using PowerShell. 258 00:09:09,440 --> 00:09:12,060 Now, let's compare a PowerShell implementation to a 259 00:09:12,060 --> 00:09:14,720 highly tuned Java implementation of the very 260 00:09:14,720 --> 00:09:15,900 same script. 261 00:09:15,900 --> 00:09:18,630 Now, what I've done here is I've shown the performance of 262 00:09:18,630 --> 00:09:21,130 that very same script when getting data for one virtual 263 00:09:21,130 --> 00:09:24,700 machine, which is the first row, six, about 40. 264 00:09:24,700 --> 00:09:27,330 And then up to a little over 300 virtual machines. 265 00:09:27,330 --> 00:09:29,870 Look at the scaling of this script. 266 00:09:29,870 --> 00:09:33,720 In the middle column, I have the performance for this 267 00:09:33,720 --> 00:09:35,280 PowerCLI script that I just showed you on 268 00:09:35,280 --> 00:09:36,280 the previous slide. 269 00:09:36,280 --> 00:09:40,530 As you can see, it goes from a healthy 9.2 seconds to over 43 270 00:09:40,530 --> 00:09:43,270 minutes if you increase the size of your inventory. 271 00:09:43,270 --> 00:09:46,200 In contrast, if you use a Java-based implementation, it 272 00:09:46,200 --> 00:09:48,430 starts a little slower than PowerCLI at 14 seconds. 273 00:09:48,430 --> 00:09:50,030 But the scaling is far superior. 274 00:09:50,030 --> 00:09:52,110 It ends up at 50 seconds. 275 00:09:52,110 --> 00:09:54,010 So the question is why? 276 00:09:54,010 --> 00:09:57,040 What's different about these scripts at scale versus when I 277 00:09:57,040 --> 00:09:58,360 wrote the little toy script? 278 00:09:58,360 --> 00:10:00,360 Now, to give you some color, why this matters to someone 279 00:10:00,360 --> 00:10:02,440 like me is that every once in a while, I'll get a call from 280 00:10:02,440 --> 00:10:04,000 a system admin, who wants a very simple 281 00:10:04,000 --> 00:10:05,190 script that does blah. 282 00:10:05,190 --> 00:10:07,220 And they say, I wrote this simple script in my 283 00:10:07,220 --> 00:10:08,640 development environment, and it worked fine. 284 00:10:08,640 --> 00:10:10,410 But then I run it in my production environment, and it 285 00:10:10,410 --> 00:10:11,220 completely breaks. 286 00:10:11,220 --> 00:10:12,160 What's going on? 287 00:10:12,160 --> 00:10:14,240 So often, problems like this actually occur. 288 00:10:14,240 --> 00:10:16,590 Now I'm calling this garbage in, garbage out, because you 289 00:10:16,590 --> 00:10:18,610 can see that with PowerCLI and Java, we're getting very 290 00:10:18,610 --> 00:10:19,550 different results. 291 00:10:19,550 --> 00:10:23,190 The deal is that in PowerCLI, I wrote a very naive script. 292 00:10:23,190 --> 00:10:25,330 And I'll explain how in a moment. 293 00:10:25,330 --> 00:10:27,560 And that very naive script has not performed at all. 294 00:10:27,560 --> 00:10:30,640 Whereas in Java, I wrote a very highly tuned script. 295 00:10:30,640 --> 00:10:32,240 And it ended up taking about an order of 296 00:10:32,240 --> 00:10:33,390 magnitude less time. 297 00:10:33,390 --> 00:10:34,830 So garbage in, garbage out. 298 00:10:34,830 --> 00:10:36,290 Good stuff in, good stuff out. 299 00:10:36,290 --> 00:10:39,460 Let's try to understand why. 300 00:10:39,460 --> 00:10:40,040 OK. 301 00:10:40,040 --> 00:10:41,310 Here's the deal. 302 00:10:41,310 --> 00:10:42,760 All of you are probably familiar with the difference 303 00:10:42,760 --> 00:10:45,230 between scripting and high level languages, or low level 304 00:10:45,230 --> 00:10:45,980 languages or whatever you want to call them. 305 00:10:45,980 --> 00:10:48,040 So the idea is that if you're running something using a 306 00:10:48,040 --> 00:10:50,440 shell script or running Pearl, or whatever. 307 00:10:50,440 --> 00:10:52,360 These things are usually meant for ease of use. 308 00:10:52,360 --> 00:10:55,140 But they're not necessarily meant for performance reasons. 309 00:10:55,140 --> 00:10:57,312 And so they hide a lot of details from the users so that 310 00:10:57,312 --> 00:10:59,210 it's easier to come to a quick prototype. 311 00:10:59,210 --> 00:11:00,330 You don't have to declare variables. 312 00:11:00,330 --> 00:11:02,340 Do all this kind of junk, right, if you're using 313 00:11:02,340 --> 00:11:03,210 scripting language. 314 00:11:03,210 --> 00:11:06,150 In contrast, if you're using, Java, C++, whatever, you have 315 00:11:06,150 --> 00:11:07,320 Typesafe and all this nonsense. 316 00:11:07,320 --> 00:11:09,030 You have to worry about what the declarations of your 317 00:11:09,030 --> 00:11:09,760 variables are. 318 00:11:09,760 --> 00:11:11,180 Make sure they're correct types, et cetera. 319 00:11:11,180 --> 00:11:13,600 So they tend to be more difficult to get running. 320 00:11:13,600 --> 00:11:17,000 But once you get them running, you can use a lot of advanced 321 00:11:17,000 --> 00:11:18,580 tools that just aren't available 322 00:11:18,580 --> 00:11:20,090 necessarily in toolkits. 323 00:11:20,090 --> 00:11:23,870 And we're going to talk about how this difference manifested 324 00:11:23,870 --> 00:11:26,100 itself in this particular problem. 325 00:11:26,100 --> 00:11:28,210 I said that when I told the customer to write a simple 326 00:11:28,210 --> 00:11:30,950 PowerCLI script, they just have to do one call. 327 00:11:30,950 --> 00:11:31,930 Get-stat. 328 00:11:31,930 --> 00:11:33,870 And here's a bunch of arguments to tell it what 329 00:11:33,870 --> 00:11:35,720 statistic it wants to grab. 330 00:11:35,720 --> 00:11:36,670 It's a toolkit. 331 00:11:36,670 --> 00:11:38,660 It's doing a lot of work for you. 332 00:11:38,660 --> 00:11:39,760 What's it doing for you? 333 00:11:39,760 --> 00:11:42,230 Well, it turns out it's calling our internal APIs. 334 00:11:42,230 --> 00:11:44,390 Our internal APIs have to work for a variety 335 00:11:44,390 --> 00:11:44,880 of different platforms. 336 00:11:44,880 --> 00:11:46,970 So they have to be incredibly general. 337 00:11:46,970 --> 00:11:49,950 Under the covers of that single get-stat call, we have 338 00:11:49,950 --> 00:11:52,690 about seven different steps that are going on, ranging 339 00:11:52,690 --> 00:11:55,500 from asking, OK, I want a stat for a virtual machine, what 340 00:11:55,500 --> 00:11:58,480 kinds of stats does that virtual machine export? 341 00:11:58,480 --> 00:11:59,980 How frequently does it update? 342 00:11:59,980 --> 00:12:02,035 All kinds of information that you probably don't really care 343 00:12:02,035 --> 00:12:03,810 about if all you want is a single stat. 344 00:12:03,810 --> 00:12:05,760 But you might care about if you're writing a production 345 00:12:05,760 --> 00:12:06,690 quality script. 346 00:12:06,690 --> 00:12:09,080 So the point is get-stat does these seven calls. 347 00:12:09,080 --> 00:12:11,710 And because you're using a toolkit, it hides 348 00:12:11,710 --> 00:12:13,620 them all from you. 349 00:12:13,620 --> 00:12:14,260 It spares it. 350 00:12:14,260 --> 00:12:15,990 It's very easy to use. 351 00:12:15,990 --> 00:12:18,630 But it turns out, this is obviously not the most optimal 352 00:12:18,630 --> 00:12:19,230 way to do this. 353 00:12:19,230 --> 00:12:21,700 And I illustrate that by showing you conceptually 354 00:12:21,700 --> 00:12:23,600 what's actually going on. 355 00:12:23,600 --> 00:12:26,450 On the right, I guess on your left, I've got what the 356 00:12:26,450 --> 00:12:27,950 PowerCLI script was doing. 357 00:12:27,950 --> 00:12:30,210 You call this get-stat call, or whatever, and what it does 358 00:12:30,210 --> 00:12:32,440 that for every virtual machine that you want this statistic 359 00:12:32,440 --> 00:12:34,670 for, it goes through and does all these random steps that 360 00:12:34,670 --> 00:12:36,870 were done that I described in the previous slide. 361 00:12:36,870 --> 00:12:39,180 Figuring out what stats are supported, how frequently they 362 00:12:39,180 --> 00:12:41,680 updated, and then actually going and getting the stat. 363 00:12:41,680 --> 00:12:44,150 It turns out, as I show on the right, when I do this in Java, 364 00:12:44,150 --> 00:12:47,730 I can take advantage of the fact that a lot of those 365 00:12:47,730 --> 00:12:49,070 things that I'm asking about in the 366 00:12:49,070 --> 00:12:51,800 toolkit, they never change. 367 00:12:51,800 --> 00:12:53,050 You've got a good virtual machine. 368 00:12:53,050 --> 00:12:55,110 It supports things to be refreshed every 369 00:12:55,110 --> 00:12:56,220 20 seconds or whatever. 370 00:12:56,220 --> 00:12:57,520 That never changes. 371 00:12:57,520 --> 00:12:59,260 You want to find out what statistics are supported. 372 00:12:59,260 --> 00:13:01,700 Well, those never change once you've configured it and you 373 00:13:01,700 --> 00:13:02,610 don't change anything. 374 00:13:02,610 --> 00:13:04,910 So why do that every single time? 375 00:13:04,910 --> 00:13:08,910 So as we show in Java, you can pull a lot of that stuff out 376 00:13:08,910 --> 00:13:10,370 of the main body of the loop. 377 00:13:10,370 --> 00:13:12,600 And if you wanted information for every VM in the system, 378 00:13:12,600 --> 00:13:15,000 you only have to make one call, as I show at the very 379 00:13:15,000 --> 00:13:16,080 bottom, for every VM. 380 00:13:16,080 --> 00:13:19,070 You don't have to make seven different RPC calls. 381 00:13:19,070 --> 00:13:20,830 In fact, you could further optimize, if you really 382 00:13:20,830 --> 00:13:23,610 understand the API, and collapse a few more calls in 383 00:13:23,610 --> 00:13:25,110 Java that you can't do in the toolkit. 384 00:13:25,110 --> 00:13:27,060 Because again, the toolkit's trying to talk at a high 385 00:13:27,060 --> 00:13:31,040 level, so any random dude can actually write a script. 386 00:13:31,040 --> 00:13:31,590 OK. 387 00:13:31,590 --> 00:13:34,330 So the point is in this PowerCLI world, a very simple 388 00:13:34,330 --> 00:13:37,560 toolkit world, we were making five RPC calls for every VM 389 00:13:37,560 --> 00:13:38,330 that we wanted stats for. 390 00:13:38,330 --> 00:13:40,210 Whereas in Java, we made one call. 391 00:13:40,210 --> 00:13:42,560 And we can even optimize that further. 392 00:13:42,560 --> 00:13:43,690 And I don't show that here. 393 00:13:43,690 --> 00:13:46,580 But anyway, the bottom line is that's why the performance of 394 00:13:46,580 --> 00:13:48,810 the Java script was so much faster. 395 00:13:48,810 --> 00:13:51,700 So the idea is that with PowerCLI what I do is I wrote 396 00:13:51,700 --> 00:13:54,330 very simple, a script that anybody could understand and 397 00:13:54,330 --> 00:13:55,810 use, but it's not optimized. 398 00:13:55,810 --> 00:13:57,390 I didn't utilize multi-threading. 399 00:13:57,390 --> 00:13:59,660 I didn't realize that the output format of the data was 400 00:13:59,660 --> 00:14:01,000 really, really verbose. 401 00:14:01,000 --> 00:14:02,900 Again, that's the kind of thing that if you really 402 00:14:02,900 --> 00:14:04,660 understand the APIs you can take advantage of. 403 00:14:04,660 --> 00:14:05,950 I also didn't realize-- 404 00:14:05,950 --> 00:14:07,930 as a customer, I wouldn't have realized that I make this one 405 00:14:07,930 --> 00:14:10,250 call but it expands it to a bunch of different network 406 00:14:10,250 --> 00:14:12,680 requests that I didn't know about. 407 00:14:12,680 --> 00:14:14,800 In contrast, with Java, I was able to take advantage of the 408 00:14:14,800 --> 00:14:18,640 power of a language like Java, where I could thread pools. 409 00:14:18,640 --> 00:14:21,320 I could optimize the return format of the data. 410 00:14:21,320 --> 00:14:23,660 And I could reduce the number of RPC calls. 411 00:14:23,660 --> 00:14:26,080 I guess the analogy I would give is if you think about 412 00:14:26,080 --> 00:14:27,880 assembly code versus compiler-generated code, 413 00:14:27,880 --> 00:14:29,800 nobody here wants to sit down and write x86 Assembly. 414 00:14:29,800 --> 00:14:32,570 So, of course, you write C and you use GCC and whatever. 415 00:14:32,570 --> 00:14:34,680 But maybe there are some things that would actually be 416 00:14:34,680 --> 00:14:37,440 faster if you wrote hand assembly. 417 00:14:37,440 --> 00:14:38,290 OK. 418 00:14:38,290 --> 00:14:41,220 So, first example we gave was just understanding what on 419 00:14:41,220 --> 00:14:42,030 earth the problem was. 420 00:14:42,030 --> 00:14:45,010 The second example was if I write bad code, I'm going to 421 00:14:45,010 --> 00:14:46,220 usually get bad output. 422 00:14:46,220 --> 00:14:48,380 In the third case, I'm going to talk a little bit about the 423 00:14:48,380 --> 00:14:49,200 design of an API. 424 00:14:49,200 --> 00:14:51,640 And I'm going to motivate that by an example. 425 00:14:51,640 --> 00:14:53,700 We have an administrator on the right hand side. 426 00:14:53,700 --> 00:14:55,320 And he's talking to the management server. 427 00:14:55,320 --> 00:14:57,440 This management server is responsible for telling that 428 00:14:57,440 --> 00:15:00,240 administrator what's going in every host in the system, 429 00:15:00,240 --> 00:15:02,900 every virtual machine, whatever. 430 00:15:02,900 --> 00:15:06,290 So let's pretend that the user wants-- so this is a 431 00:15:06,290 --> 00:15:07,930 virtualized environment-- so the idea is that the user 432 00:15:07,930 --> 00:15:10,930 wants to actually access his or her virtual machine. 433 00:15:10,930 --> 00:15:12,360 They're running a Linux virtual machine. 434 00:15:12,360 --> 00:15:13,390 They want to see its console. 435 00:15:13,390 --> 00:15:16,150 So they can type at the console. 436 00:15:16,150 --> 00:15:17,260 Here's how that roughly works. 437 00:15:17,260 --> 00:15:19,420 They want to see the console of the VM that I've circled. 438 00:15:19,420 --> 00:15:21,900 So the way it works is that this user talks to the 439 00:15:21,900 --> 00:15:23,900 management server. 440 00:15:23,900 --> 00:15:26,510 The management server locates that virtual machine, 441 00:15:26,510 --> 00:15:28,790 establishes a handshaking connection. 442 00:15:28,790 --> 00:15:30,340 And that connection is directly given 443 00:15:30,340 --> 00:15:31,470 back to that user. 444 00:15:31,470 --> 00:15:33,770 So now the user can interact directly with that virtual 445 00:15:33,770 --> 00:15:36,960 machine and send commands and do whatever. 446 00:15:36,960 --> 00:15:39,060 So we've all got the setup. 447 00:15:39,060 --> 00:15:40,350 Very simple. 448 00:15:40,350 --> 00:15:41,450 Now here's the deal. 449 00:15:41,450 --> 00:15:44,850 The problem was that our poor little user, poor big user, 450 00:15:44,850 --> 00:15:48,480 whatever, they could not start this console when they were 451 00:15:48,480 --> 00:15:50,940 managing a large number of hosts. 452 00:15:50,940 --> 00:15:54,050 To take you back to this diagram for just a moment, on 453 00:15:54,050 --> 00:15:55,890 the right hand side of the screen, I 454 00:15:55,890 --> 00:15:56,670 show a number of hosts. 455 00:15:56,670 --> 00:15:57,980 And I show a number of virtual machines. 456 00:15:57,980 --> 00:16:01,170 So what I'm basically saying is that this administrator 457 00:16:01,170 --> 00:16:04,270 could not access the console of his virtual machine as the 458 00:16:04,270 --> 00:16:06,960 number of hosts increased. 459 00:16:06,960 --> 00:16:08,440 There should be no link between these two. 460 00:16:08,440 --> 00:16:09,600 But there was. 461 00:16:09,600 --> 00:16:11,420 So I got involved in this problem. 462 00:16:11,420 --> 00:16:12,540 And here I went to the client folks. 463 00:16:12,540 --> 00:16:13,280 And I explained the problem. 464 00:16:13,280 --> 00:16:17,280 And their response was that's obviously a server problem. 465 00:16:17,280 --> 00:16:19,120 Next step, I go to the server folks. 466 00:16:19,120 --> 00:16:22,260 And they said, yeah, it's obviously a client problem. 467 00:16:22,260 --> 00:16:23,670 So then I went back and thought a little bit more 468 00:16:23,670 --> 00:16:24,380 about the architecture. 469 00:16:24,380 --> 00:16:25,915 And what I knew is that whenever the client needs to 470 00:16:25,915 --> 00:16:28,090 initiate this handshake connection, it has to spawn a 471 00:16:28,090 --> 00:16:29,080 certain process. 472 00:16:29,080 --> 00:16:32,910 So that process is called the VMRC, VMware remote console. 473 00:16:32,910 --> 00:16:34,700 So I talked to the client folks again. 474 00:16:34,700 --> 00:16:36,120 And said, well, I think this is happening. 475 00:16:36,120 --> 00:16:38,110 Oh, yeah, it's a VMRC problem. 476 00:16:38,110 --> 00:16:38,350 OK. 477 00:16:38,350 --> 00:16:39,880 So that just passes the buck another way. 478 00:16:39,880 --> 00:16:41,590 So then I talk to the VMRC folks. 479 00:16:41,590 --> 00:16:44,030 And I said, OK, I think I have this problem with starting up. 480 00:16:44,030 --> 00:16:45,380 And they said, oh, it's probably authentication. 481 00:16:45,380 --> 00:16:46,270 It's probably this. 482 00:16:46,270 --> 00:16:48,560 You have to enable verbose log in on your end host, blah, 483 00:16:48,560 --> 00:16:49,080 blah, blah. 484 00:16:49,080 --> 00:16:50,650 And at some point, I got so sick of this. 485 00:16:50,650 --> 00:16:51,970 I thought, well, this is complete garbage. 486 00:16:51,970 --> 00:16:53,080 I'm actually going to take a look at this 487 00:16:53,080 --> 00:16:54,330 a little more carefully. 488 00:16:54,330 --> 00:16:56,910 And so, once I got involved, I got a little bit more 489 00:16:56,910 --> 00:16:57,390 information. 490 00:16:57,390 --> 00:16:58,840 And the information was actually pretty useful. 491 00:16:58,840 --> 00:16:59,900 Here's the deal. 492 00:16:59,900 --> 00:17:01,870 When you're starting the remote console on one of these 493 00:17:01,870 --> 00:17:04,510 virtual machines and you want to connect, it turns out the 494 00:17:04,510 --> 00:17:06,430 person that was having this problem-- when you ask a few 495 00:17:06,430 --> 00:17:07,829 more questions, you get a few more answers. 496 00:17:07,829 --> 00:17:09,190 Some of which are actually useful. 497 00:17:09,190 --> 00:17:11,290 So I said, well, if you do this with 50 hosts, 498 00:17:11,290 --> 00:17:12,869 do you see a problem? 499 00:17:12,869 --> 00:17:14,349 So 50 manage hosts, he only wants to 500 00:17:14,349 --> 00:17:17,050 view one VM's console. 501 00:17:17,050 --> 00:17:18,680 No problem. 502 00:17:18,680 --> 00:17:20,760 Then I asked him-- and then I actually sat at my desk, and 503 00:17:20,760 --> 00:17:21,530 we reproduced this. 504 00:17:21,530 --> 00:17:24,579 So then I said, let's try it with 500 hosts, no problem. 505 00:17:24,579 --> 00:17:26,609 Just a little bit slower, but no problem. 506 00:17:26,609 --> 00:17:28,660 So again, the set up is 500 different hosts, he just wants 507 00:17:28,660 --> 00:17:29,900 to look at one VM. 508 00:17:29,900 --> 00:17:32,060 And it's taking longer and longer to view that one VM. 509 00:17:32,060 --> 00:17:35,110 Finally, he gets to 1,001 hosts. 510 00:17:35,110 --> 00:17:36,510 And that's when the problem occurs. 511 00:17:36,510 --> 00:17:39,000 That' when he cannot see this console. 512 00:17:39,000 --> 00:17:41,605 So obviously then, you go back and you think, well, gee, is 513 00:17:41,605 --> 00:17:44,220 there some magic number of 1,000 hard coded anywhere, 514 00:17:44,220 --> 00:17:46,290 such that we exceeded it, and that was the problem. 515 00:17:46,290 --> 00:17:48,460 And frankly, the answer is no. 516 00:17:48,460 --> 00:17:51,130 So let's figure out actually what did happen. 517 00:17:51,130 --> 00:17:53,880 So I made a bunch of different observations while I was 518 00:17:53,880 --> 00:17:56,100 debugging this. 519 00:17:56,100 --> 00:17:59,320 When you have fewer than that magic number of 1,001 hosts, 520 00:17:59,320 --> 00:18:00,890 here's what was going on. 521 00:18:00,890 --> 00:18:02,510 Remember the sequence of events. 522 00:18:02,510 --> 00:18:03,900 I'm a client. 523 00:18:03,900 --> 00:18:05,340 I talked to a server. 524 00:18:05,340 --> 00:18:07,290 Server talks to the virtual machine. 525 00:18:07,290 --> 00:18:09,850 And then that virtual machine and I are directly connected. 526 00:18:09,850 --> 00:18:13,030 What I was noticing was that when there were less than 527 00:18:13,030 --> 00:18:16,600 1,000 overall hosts in this setup, when I would initiate 528 00:18:16,600 --> 00:18:18,960 the command for this console. 529 00:18:18,960 --> 00:18:23,180 The CPU usage and memory usage of this middle management 530 00:18:23,180 --> 00:18:26,670 server would gradually increase, and increase, and 531 00:18:26,670 --> 00:18:30,570 increase until the console would actually show up. 532 00:18:30,570 --> 00:18:31,730 So you right click. 533 00:18:31,730 --> 00:18:32,660 You want to pull up a console. 534 00:18:32,660 --> 00:18:33,440 A little window comes up. 535 00:18:33,440 --> 00:18:34,490 But it's totally blank. 536 00:18:34,490 --> 00:18:36,690 And then eventually, after the management server consumes a 537 00:18:36,690 --> 00:18:38,460 lot of CPU and memory, eventually that 538 00:18:38,460 --> 00:18:39,650 console comes up. 539 00:18:39,650 --> 00:18:42,630 So now your question is what's going on that's causing the 540 00:18:42,630 --> 00:18:43,990 CPU usage of this management server. 541 00:18:43,990 --> 00:18:46,150 And why is that scaling with the number of 542 00:18:46,150 --> 00:18:49,090 your virtual machines. 543 00:18:49,090 --> 00:18:51,780 So the first thing I did-- so one of the themes of this 544 00:18:51,780 --> 00:18:53,420 talk, I suppose, is that there's a lot of great tools 545 00:18:53,420 --> 00:18:54,280 for performance debugging. 546 00:18:54,280 --> 00:18:56,680 But sometimes there's no real substitute for common sense 547 00:18:56,680 --> 00:18:58,610 and for the things that are easiest to do. 548 00:18:58,610 --> 00:19:00,320 So the first thing I did is I looked at the log 549 00:19:00,320 --> 00:19:01,410 file of this server. 550 00:19:01,410 --> 00:19:02,810 Here's the rationale. 551 00:19:02,810 --> 00:19:05,550 If every time I invoke this remote console, I'm doing more 552 00:19:05,550 --> 00:19:06,570 work on the server. 553 00:19:06,570 --> 00:19:10,070 First, let me just see what that server is doing. 554 00:19:10,070 --> 00:19:13,450 Now, it turns out that what I'd noticed about the server 555 00:19:13,450 --> 00:19:15,720 was that whenever the client would contact the server and 556 00:19:15,720 --> 00:19:19,260 say I want the console of this VM, you would see a data 557 00:19:19,260 --> 00:19:22,690 retrieval call from the client to the server. 558 00:19:22,690 --> 00:19:25,140 And that data retrieval call got more and more expensive 559 00:19:25,140 --> 00:19:26,540 with more hosts and more virtual 560 00:19:26,540 --> 00:19:27,950 machines in the system. 561 00:19:27,950 --> 00:19:30,520 So the question that I asked was, why on earth are we doing 562 00:19:30,520 --> 00:19:32,870 some stupid data retrieval call if all I wanted is to 563 00:19:32,870 --> 00:19:34,080 say, here's a VM. 564 00:19:34,080 --> 00:19:36,150 Tell me where it's located so I can talk to it. 565 00:19:36,150 --> 00:19:38,800 So it turns out-- 566 00:19:38,800 --> 00:19:39,650 that was problem number one. 567 00:19:39,650 --> 00:19:42,780 Problem number two, that's what happened when you had 50 568 00:19:42,780 --> 00:19:45,380 hosts, 500, whatever, that you basically spend more and more 569 00:19:45,380 --> 00:19:48,330 time in this data retrieval routine on the server. 570 00:19:48,330 --> 00:19:52,810 However, once you hit that magic number of 1,000, you 571 00:19:52,810 --> 00:19:53,970 would see this call start. 572 00:19:53,970 --> 00:19:55,560 But you would never see it end. 573 00:19:55,560 --> 00:19:57,140 And in fact, you wouldn't see it end. 574 00:19:57,140 --> 00:19:59,970 You'd see some exceptions somewhere. 575 00:19:59,970 --> 00:20:02,760 So that to me was a clue that I better take a look and see 576 00:20:02,760 --> 00:20:04,050 what's going on at that exception. 577 00:20:04,050 --> 00:20:05,350 So here's what I did. 578 00:20:05,350 --> 00:20:08,640 Unfortunately, the exception gave me no information 579 00:20:08,640 --> 00:20:09,010 whatsoever. 580 00:20:09,010 --> 00:20:11,290 So I went ahead and I attached a debugger to the server. 581 00:20:11,290 --> 00:20:12,650 And when I did, I found out that it was an 582 00:20:12,650 --> 00:20:13,990 out of memory exception. 583 00:20:13,990 --> 00:20:15,090 So here's the deal. 584 00:20:15,090 --> 00:20:16,210 I'm a poor admin. 585 00:20:16,210 --> 00:20:19,260 I send a request to look at a console. 586 00:20:19,260 --> 00:20:21,900 For some reason, some data handshaking code takes place 587 00:20:21,900 --> 00:20:23,070 between the server and the client. 588 00:20:23,070 --> 00:20:25,040 It takes longer and longer with more hosts. 589 00:20:25,040 --> 00:20:27,980 Eventually, when I have too many hosts, that thing ran out 590 00:20:27,980 --> 00:20:29,930 of memory and silently failed. 591 00:20:29,930 --> 00:20:32,060 And so my client's sitting around looking dumb with a 592 00:20:32,060 --> 00:20:33,860 totally blank screen, because it has no idea that its 593 00:20:33,860 --> 00:20:35,400 request basically got denied. 594 00:20:35,400 --> 00:20:38,650 So the question is why is that happening? 595 00:20:38,650 --> 00:20:42,000 And aha, this is where the aha moment occurred. 596 00:20:42,000 --> 00:20:46,540 I happen to know that one of the reasons in our setup, that 597 00:20:46,540 --> 00:20:49,820 one of the causes for this data retrieval, is that as a 598 00:20:49,820 --> 00:20:52,490 client, I want to know certain attributes of the virtual 599 00:20:52,490 --> 00:20:53,900 machine that I wanted to view. 600 00:20:53,900 --> 00:20:55,980 For example, if I'm looking at the console for a virtual 601 00:20:55,980 --> 00:20:59,040 machine, I might care that that virtual machine has a 602 00:20:59,040 --> 00:21:00,160 CD-ROM connected. 603 00:21:00,160 --> 00:21:01,810 And I might care if that CD-ROM got 604 00:21:01,810 --> 00:21:03,430 disconnected or whatever. 605 00:21:03,430 --> 00:21:05,220 That's not something you'd necessarily associate with a 606 00:21:05,220 --> 00:21:06,460 virtual machine. 607 00:21:06,460 --> 00:21:09,030 But when I go to the server to connect to it, the server 608 00:21:09,030 --> 00:21:11,560 wants to give me that information so that the shell, 609 00:21:11,560 --> 00:21:14,150 which is showing me this console can appropriately 610 00:21:14,150 --> 00:21:16,810 update if the virtual machine gets disconnected from its 611 00:21:16,810 --> 00:21:17,390 CD-ROM, whatever. 612 00:21:17,390 --> 00:21:21,130 Anyway, the point is I looked at the code that this magical 613 00:21:21,130 --> 00:21:22,390 VMRC routine was doing. 614 00:21:22,390 --> 00:21:23,810 And here's what it was doing. 615 00:21:23,810 --> 00:21:26,025 It was saying, look, I want to start a 616 00:21:26,025 --> 00:21:26,770 remote console session. 617 00:21:26,770 --> 00:21:29,480 I need certain information like the CD-ROM connectivity. 618 00:21:29,480 --> 00:21:31,630 I'm going to go ahead and get that information for every 619 00:21:31,630 --> 00:21:33,560 host and every virtual machine in the system. 620 00:21:33,560 --> 00:21:36,200 And then once that's done, I'll go ahead and display the 621 00:21:36,200 --> 00:21:38,690 console for this one virtual machine that I care about. 622 00:21:38,690 --> 00:21:40,940 Well, this is clearly just a stupid idea. 623 00:21:40,940 --> 00:21:43,520 So I called the guy up and I said, you realize this is 624 00:21:43,520 --> 00:21:44,160 stupid idea. 625 00:21:44,160 --> 00:21:44,920 And he said, well, why? 626 00:21:44,920 --> 00:21:45,760 It's not that expensive. 627 00:21:45,760 --> 00:21:46,750 Is it? 628 00:21:46,750 --> 00:21:48,985 And I was like, OK, we need to have a long discussion. 629 00:21:48,985 --> 00:21:50,010 And bring a priest in here. 630 00:21:50,010 --> 00:21:51,170 But that's a terrible idea. 631 00:21:51,170 --> 00:21:53,500 So what we ended up doing is-- 632 00:21:53,500 --> 00:21:54,080 And so now-- 633 00:21:54,080 --> 00:21:55,080 OK, so that's the problem. 634 00:21:55,080 --> 00:21:58,000 The problem is we drew a huge data retrieval. 635 00:21:58,000 --> 00:21:58,980 None of which we need. 636 00:21:58,980 --> 00:22:01,020 Now why this 1,001 host thing? 637 00:22:01,020 --> 00:22:01,640 Here's the deal. 638 00:22:01,640 --> 00:22:05,270 It turns out that in our code, we had-- for old style 639 00:22:05,270 --> 00:22:07,870 clients, of which this was an old style client-- we had a 640 00:22:07,870 --> 00:22:08,640 restriction. 641 00:22:08,640 --> 00:22:10,880 We said if you need to serialize more than 200 642 00:22:10,880 --> 00:22:13,320 megabytes worth of data, we're not going to handle that. 643 00:22:13,320 --> 00:22:14,550 We're just going to fail. 644 00:22:14,550 --> 00:22:18,120 It turns out once we got 1,001 hosts, we slightly exceeded 645 00:22:18,120 --> 00:22:21,210 that 200 megabyte buffer size, we silently erred with an out 646 00:22:21,210 --> 00:22:22,340 of memory exception. 647 00:22:22,340 --> 00:22:23,710 And we never returned to the client. 648 00:22:23,710 --> 00:22:24,860 And everybody was unhappy. 649 00:22:24,860 --> 00:22:26,040 So it had nothing to do with 1,000 hosts. 650 00:22:26,040 --> 00:22:27,170 It has to do with the amount of data 651 00:22:27,170 --> 00:22:28,200 you're actually sending. 652 00:22:28,200 --> 00:22:30,350 So luckily, there were several different fixes. 653 00:22:30,350 --> 00:22:32,510 So I told the VMRC folks, the guy that wrote this thing, I 654 00:22:32,510 --> 00:22:35,550 said, please, under no circumstances, should you get 655 00:22:35,550 --> 00:22:37,480 information for every single host if you only want it for 656 00:22:37,480 --> 00:22:38,220 this one guy. 657 00:22:38,220 --> 00:22:39,480 That's a terrible idea. 658 00:22:39,480 --> 00:22:41,760 And then I went back to the server folks, and I said, you 659 00:22:41,760 --> 00:22:42,600 have to admit-- 660 00:22:42,600 --> 00:22:44,080 first of all, you have to fail correctly. 661 00:22:44,080 --> 00:22:46,540 And second, you have to admit sensible error messages, 662 00:22:46,540 --> 00:22:49,080 because this should not take the kind of expertise that it 663 00:22:49,080 --> 00:22:50,330 took to figure out. 664 00:22:50,330 --> 00:22:51,950 And thankfully, this made actually a pretty huge 665 00:22:51,950 --> 00:22:52,320 difference. 666 00:22:52,320 --> 00:22:53,890 So that was kind of cool. 667 00:22:53,890 --> 00:22:57,200 So the lessons that I took away from this is if you're 668 00:22:57,200 --> 00:23:00,920 going to create an API, create an API that's very difficult 669 00:23:00,920 --> 00:23:04,210 to abuse and very easy to use. 670 00:23:04,210 --> 00:23:05,490 Not the opposite. 671 00:23:05,490 --> 00:23:07,620 These poor VMRC folks, it's like a three line piece of 672 00:23:07,620 --> 00:23:09,150 code to get information for every VM. 673 00:23:09,150 --> 00:23:10,920 It's like eight or nine lines to get information 674 00:23:10,920 --> 00:23:11,760 for just that VM. 675 00:23:11,760 --> 00:23:15,200 But those five lines make a boatload of difference. 676 00:23:15,200 --> 00:23:16,170 The other thing is that this was an 677 00:23:16,170 --> 00:23:18,030 entirely internal customer. 678 00:23:18,030 --> 00:23:20,030 In other words, it's just another group within VMware. 679 00:23:20,030 --> 00:23:22,530 Imagine you're some external guy that's using the same API. 680 00:23:22,530 --> 00:23:23,700 You're screwed. 681 00:23:23,700 --> 00:23:26,410 So we have to be better at educating the people that are 682 00:23:26,410 --> 00:23:27,080 using our API. 683 00:23:27,080 --> 00:23:28,350 And don't just throw it over the fence and 684 00:23:28,350 --> 00:23:29,040 say here's an API. 685 00:23:29,040 --> 00:23:29,970 Deal with it. 686 00:23:29,970 --> 00:23:31,720 There needs to be some interaction between them to 687 00:23:31,720 --> 00:23:36,000 make sure that this is the right API for its job. 688 00:23:36,000 --> 00:23:39,120 This is the next case study I want to talk about. 689 00:23:39,120 --> 00:23:42,560 This one actually was found by a colleague of mine, who I'm 690 00:23:42,560 --> 00:23:43,790 going to abbreviate as RM. 691 00:23:43,790 --> 00:23:45,360 Very interesting example. 692 00:23:45,360 --> 00:23:46,890 For those of you that took the class last year, 693 00:23:46,890 --> 00:23:47,450 I'm recycling it. 694 00:23:47,450 --> 00:23:49,100 But it's a good example, so no worries. 695 00:23:49,100 --> 00:23:52,100 So here's the deal. 696 00:23:52,100 --> 00:23:54,100 You've got a benchmark. 697 00:23:54,100 --> 00:23:57,020 And I've got two builds of a piece of software. 698 00:23:57,020 --> 00:23:58,850 When I run the benchmark against that piece of 699 00:23:58,850 --> 00:24:02,470 software, in case a, I was getting a throughput of 100 700 00:24:02,470 --> 00:24:04,860 operations per minute. 701 00:24:04,860 --> 00:24:07,800 In case b, I was getting half the throughput. 702 00:24:07,800 --> 00:24:10,270 And what was the difference between those builds? 703 00:24:10,270 --> 00:24:14,620 Well, the first build, the faster build, was a 32-bit 704 00:24:14,620 --> 00:24:18,120 executable running on 64-bit hardware. 705 00:24:18,120 --> 00:24:21,590 While in the second case, it was a 64-bit executable 706 00:24:21,590 --> 00:24:23,270 running on 64-bit hardware. 707 00:24:23,270 --> 00:24:25,510 Now if you're like me-- 708 00:24:25,510 --> 00:24:27,220 which hopefully you're not-- but if you're like me, you 709 00:24:27,220 --> 00:24:29,000 think yourself, well, there shouldn't be much difference 710 00:24:29,000 --> 00:24:30,120 between 32-bit and 64-bit. 711 00:24:30,120 --> 00:24:32,360 In fact, 64-bit should be a lot faster for a lot of 712 00:24:32,360 --> 00:24:33,860 reasons you're going to learn about in 004. 713 00:24:33,860 --> 00:24:35,130 So the question is what's going on? 714 00:24:35,130 --> 00:24:38,060 I can distinctly remember when this check-in was done, 715 00:24:38,060 --> 00:24:39,690 somebody said, oh, we're making the switch over, but it 716 00:24:39,690 --> 00:24:40,440 shouldn't be a big deal. 717 00:24:40,440 --> 00:24:43,160 And foolishly, like a moron, I actually believed that person. 718 00:24:43,160 --> 00:24:44,220 Very bad idea. 719 00:24:44,220 --> 00:24:46,110 So this came back to haunt us for three months. 720 00:24:46,110 --> 00:24:47,760 So what's the deal? 721 00:24:47,760 --> 00:24:50,740 Remember, a 32-bit executable running on 64-bit code was 722 00:24:50,740 --> 00:24:53,910 faster than a 64-bit executable running on a 64-bit 723 00:24:53,910 --> 00:24:54,760 application. 724 00:24:54,760 --> 00:24:55,770 So the first thing we do-- 725 00:24:55,770 --> 00:24:56,910 first of all, you've got to find the right tool 726 00:24:56,910 --> 00:24:57,950 for the right job. 727 00:24:57,950 --> 00:25:00,350 In this case, we use a profiling tool called Xperf. 728 00:25:00,350 --> 00:25:03,550 Xperf is a very powerful profiler, kind of like Vtune, 729 00:25:03,550 --> 00:25:06,190 or Valgrind, or Quantifier, or a lot of the other tools 730 00:25:06,190 --> 00:25:06,790 you've used. 731 00:25:06,790 --> 00:25:08,710 The nice thing is it's built into the OS. 732 00:25:08,710 --> 00:25:10,240 It runs on Windows 2008. 733 00:25:10,240 --> 00:25:12,200 And it's a sampling based profiler that has a lot of 734 00:25:12,200 --> 00:25:13,220 pretty cool attributes. 735 00:25:13,220 --> 00:25:15,530 If you have Windows 2008, I think you ought to just go and 736 00:25:15,530 --> 00:25:15,990 download it. 737 00:25:15,990 --> 00:25:16,490 It's free. 738 00:25:16,490 --> 00:25:17,840 And play with it. 739 00:25:17,840 --> 00:25:19,620 This gives a lot of information like stack traces, 740 00:25:19,620 --> 00:25:21,510 caller/callee information, et cetera. 741 00:25:21,510 --> 00:25:25,230 It's potentially the perfect tool for this kind of job. 742 00:25:25,230 --> 00:25:26,990 What kind of output does it show? 743 00:25:26,990 --> 00:25:28,170 In this case, remember the setup. 744 00:25:28,170 --> 00:25:30,680 The setup is I run build A. And I'm getting twice as much 745 00:25:30,680 --> 00:25:33,310 throughput as running build B. So let's take a look at the 746 00:25:33,310 --> 00:25:36,080 CPU usage, just as a first cut to see what's going on. 747 00:25:36,080 --> 00:25:38,340 In the 64-bit case, which is what I showed here. 748 00:25:38,340 --> 00:25:40,230 I showed the output of this Xperf for 64-bit. 749 00:25:40,230 --> 00:25:42,490 What you can see, x-axis is just time. 750 00:25:42,490 --> 00:25:43,870 And y-axis is CPU usage. 751 00:25:43,870 --> 00:25:45,610 It's totally saturated. 752 00:25:45,610 --> 00:25:48,940 In the 64-bit case, we're completely saturated. 753 00:25:48,940 --> 00:25:51,090 Take my word for it, in the 32-bit case, you're not. 754 00:25:51,090 --> 00:25:52,950 And this accounts for the difference. 755 00:25:52,950 --> 00:25:55,510 So now let's dig deeper and find out why we're seeing this 756 00:25:55,510 --> 00:25:57,940 difference. 757 00:25:57,940 --> 00:26:00,370 So first, we look at the sampling profiler. 758 00:26:00,370 --> 00:26:02,970 Now there's a lot of perils of using sampling profilers. 759 00:26:02,970 --> 00:26:05,090 In this particular case, here's the deal. 760 00:26:05,090 --> 00:26:06,690 I've got a process. 761 00:26:06,690 --> 00:26:08,820 And that process spawns a bunch of threads. 762 00:26:08,820 --> 00:26:11,990 In this particular view, what you're seeing is where is the 763 00:26:11,990 --> 00:26:13,840 time being spent in this process. 764 00:26:13,840 --> 00:26:17,040 And as you can see, most of the time is spent at the root, 765 00:26:17,040 --> 00:26:17,820 which kind of makes sense. 766 00:26:17,820 --> 00:26:20,610 Because everything in some sense is a child of the root. 767 00:26:20,610 --> 00:26:22,890 But the first thing our poor process does, it spawns a 768 00:26:22,890 --> 00:26:25,100 bunch of child threads. 769 00:26:25,100 --> 00:26:27,600 So clearly, all of the time is essentially going to be spent 770 00:26:27,600 --> 00:26:30,130 in the routine that calls all of these threads, because 771 00:26:30,130 --> 00:26:31,850 that's the first entry point. 772 00:26:31,850 --> 00:26:34,370 But unfortunately, that's remarkably unhelpful. 773 00:26:34,370 --> 00:26:37,800 So the deal is we have to dig deeper and find out, where is 774 00:26:37,800 --> 00:26:40,480 this time actually being spent. 775 00:26:40,480 --> 00:26:43,280 So like I said, just to kind of reiterate the point that I 776 00:26:43,280 --> 00:26:44,140 made on the previous slide. 777 00:26:44,140 --> 00:26:45,560 The point is that your entry point is 778 00:26:45,560 --> 00:26:46,330 spawning all the threads. 779 00:26:46,330 --> 00:26:48,290 And if that's kind of the starting point for accounting, 780 00:26:48,290 --> 00:26:50,250 you're not going to get any help if you're just looking at 781 00:26:50,250 --> 00:26:50,860 that starting point. 782 00:26:50,860 --> 00:26:53,070 You have to dig a little bit deeper. 783 00:26:53,070 --> 00:26:57,010 And so what I'm basically trying to say, is that even 784 00:26:57,010 --> 00:27:00,300 though this thing records stack traces, this stack 785 00:27:00,300 --> 00:27:02,020 happens to be the most popular stack. 786 00:27:02,020 --> 00:27:03,400 But it's not actually the problem. 787 00:27:03,400 --> 00:27:04,480 Because it's the most popular, because that's 788 00:27:04,480 --> 00:27:05,330 where everybody starts. 789 00:27:05,330 --> 00:27:06,760 So all of the time that is charged gets 790 00:27:06,760 --> 00:27:07,860 charged to that stack. 791 00:27:07,860 --> 00:27:09,620 We have to look a little bit deeper. 792 00:27:09,620 --> 00:27:11,810 So in order to look a little bit deeper, let's think about 793 00:27:11,810 --> 00:27:13,870 the problem in a slightly different way. 794 00:27:13,870 --> 00:27:15,550 Suppose I've got my root. 795 00:27:15,550 --> 00:27:16,760 That's where everything starts. 796 00:27:16,760 --> 00:27:18,240 That's where all the threads are spawned, whatever. 797 00:27:18,240 --> 00:27:21,270 And suppose I have three paths to a tiny little function at 798 00:27:21,270 --> 00:27:23,760 the very end. 799 00:27:23,760 --> 00:27:25,510 Maybe I'm spending all of my time in the 800 00:27:25,510 --> 00:27:27,530 path that I've indicated. 801 00:27:27,530 --> 00:27:29,980 But the problem is that-- 802 00:27:29,980 --> 00:27:31,980 suppose you're spending all of your time in that tiny little 803 00:27:31,980 --> 00:27:35,360 function, and suppose it's equally spent among these 804 00:27:35,360 --> 00:27:36,120 other three paths. 805 00:27:36,120 --> 00:27:39,010 Well, it's not any path that's kind of screwed up. 806 00:27:39,010 --> 00:27:40,700 It's that that tiny function is kind of messed up. 807 00:27:40,700 --> 00:27:42,360 So you got to figure out why are you spending time in that 808 00:27:42,360 --> 00:27:43,350 tiny function? 809 00:27:43,350 --> 00:27:46,090 And is there something that you can do to fix that? 810 00:27:46,090 --> 00:27:48,540 So let's look from the opposite perspective. 811 00:27:48,540 --> 00:27:50,850 Let's look at the caller view. 812 00:27:50,850 --> 00:27:53,350 In this case, what I did was I looked at a bunch of different 813 00:27:53,350 --> 00:27:54,540 routines, a bunch of these different 814 00:27:54,540 --> 00:27:56,080 so-called tiny functions. 815 00:27:56,080 --> 00:27:58,640 And I looked at what was calling those tiny functions. 816 00:27:58,640 --> 00:28:01,290 It turns out that here's a call, ZwQueryVirtualMemory 817 00:28:01,290 --> 00:28:02,280 some tiny function. 818 00:28:02,280 --> 00:28:04,490 It happens to be-- it turns out that we're spending, as it 819 00:28:04,490 --> 00:28:09,480 indicates, 77% of our time in this tiny function. 820 00:28:09,480 --> 00:28:11,560 And just to emphasize that this is not being 821 00:28:11,560 --> 00:28:12,450 called from the root. 822 00:28:12,450 --> 00:28:14,190 So you don't see it from your first glance. 823 00:28:14,190 --> 00:28:16,590 You can see that the number of times this tiny function was 824 00:28:16,590 --> 00:28:19,600 called from the root is 0.34%. 825 00:28:19,600 --> 00:28:21,316 So if you were just looking from the root and trying to 826 00:28:21,316 --> 00:28:22,520 figure out what's being called from the root. 827 00:28:22,520 --> 00:28:23,940 It would be totally useless. 828 00:28:23,940 --> 00:28:24,830 Instead, you have to dig deeper. 829 00:28:24,830 --> 00:28:26,120 And see all the different functions 830 00:28:26,120 --> 00:28:26,640 that are being called. 831 00:28:26,640 --> 00:28:28,440 And find out where they're being called from. 832 00:28:28,440 --> 00:28:30,875 So we know that this function, ZwQueryVirtualMemory, is 833 00:28:30,875 --> 00:28:33,810 taking like 70% of the CPU, which is an enormously large 834 00:28:33,810 --> 00:28:34,740 amount of CPU. 835 00:28:34,740 --> 00:28:38,450 We also know what's calling it are these two 836 00:28:38,450 --> 00:28:39,270 routines right here. 837 00:28:39,270 --> 00:28:40,680 This thing, this magical thing called 838 00:28:40,680 --> 00:28:42,710 RTDynamicCast and RTtypeid. 839 00:28:42,710 --> 00:28:46,580 So it turns out we have two paths that are going to this 840 00:28:46,580 --> 00:28:47,340 tiny function. 841 00:28:47,340 --> 00:28:50,160 And those two paths comprise most of the calls 842 00:28:50,160 --> 00:28:51,180 to this tiny function. 843 00:28:51,180 --> 00:28:52,800 And it turns out we're spending a boatload of time in 844 00:28:52,800 --> 00:28:54,800 this tiny function. 845 00:28:54,800 --> 00:28:56,980 So let's dig deeper and find out what on earth these two 846 00:28:56,980 --> 00:28:57,360 routines are. 847 00:28:57,360 --> 00:28:58,550 What is RTtypeid? 848 00:28:58,550 --> 00:29:01,070 What is RTDynamicCast, whatever? 849 00:29:01,070 --> 00:29:02,850 So let's first look at RTDynamicCast. 850 00:29:02,850 --> 00:29:03,890 What I show here-- 851 00:29:03,890 --> 00:29:05,420 in contrast to the previous view. 852 00:29:05,420 --> 00:29:07,080 The previous view said, I have a tiny function. 853 00:29:07,080 --> 00:29:08,800 What are the people that call me? 854 00:29:08,800 --> 00:29:09,690 Let's look at the reverse. 855 00:29:09,690 --> 00:29:12,540 Let's go and start at one of these two routines, RTtypeid. 856 00:29:12,540 --> 00:29:15,480 And figure out what it's calling that is spending so 857 00:29:15,480 --> 00:29:17,450 much time, and ultimately leads to us calling that 858 00:29:17,450 --> 00:29:20,090 ZwQueryVirtualMemory or whatever. 859 00:29:20,090 --> 00:29:22,850 And as you can see, this RTtypeid, it calls two 860 00:29:22,850 --> 00:29:23,870 functions primarily. 861 00:29:23,870 --> 00:29:25,735 One is incrementing a reference counter. 862 00:29:25,735 --> 00:29:27,600 And the other is decrementing a reference counter. 863 00:29:27,600 --> 00:29:29,030 Now you're thinking to yourself, incrementing a 864 00:29:29,030 --> 00:29:30,580 reference counter, decrementing a reference 865 00:29:30,580 --> 00:29:32,970 counter should be very simple, very low CPU, and basically 866 00:29:32,970 --> 00:29:33,970 should cost you nothing. 867 00:29:33,970 --> 00:29:35,500 And all of those three assumptions are dead wrong, 868 00:29:35,500 --> 00:29:36,280 unfortunately. 869 00:29:36,280 --> 00:29:37,670 So let's try to understand a little bit as 870 00:29:37,670 --> 00:29:38,960 to why that's happening. 871 00:29:38,960 --> 00:29:42,520 Turns out RTtypeid is used in order to figure out the C++ 872 00:29:42,520 --> 00:29:43,380 type of an object. 873 00:29:43,380 --> 00:29:45,200 So we're trying to figure this out on the fly. 874 00:29:45,200 --> 00:29:47,950 And as I said, it's kind of mystifying that this routine, 875 00:29:47,950 --> 00:29:50,250 which calls our tiny function, that you're spending 39% 876 00:29:50,250 --> 00:29:53,930 overall in this routine in that tiny function. 877 00:29:53,930 --> 00:29:56,010 So let's figure out what's going on by looking at a 878 00:29:56,010 --> 00:29:57,880 simple example, which would be IncRef. 879 00:29:57,880 --> 00:30:00,170 So it turns out if you're looking at this code, the 880 00:30:00,170 --> 00:30:03,240 dreadful line is the fact that we're doing const type info 881 00:30:03,240 --> 00:30:06,880 reference tinfo equals typeid of *this. 882 00:30:06,880 --> 00:30:09,560 This is basically, you have to call RTtypeid to get this kind 883 00:30:09,560 --> 00:30:10,470 of information. 884 00:30:10,470 --> 00:30:13,520 And in order to do that, you need some runtime type 885 00:30:13,520 --> 00:30:14,580 information. 886 00:30:14,580 --> 00:30:17,180 This runtime type information has some pointers 887 00:30:17,180 --> 00:30:18,720 associated with it. 888 00:30:18,720 --> 00:30:20,120 So here's the deal. 889 00:30:20,120 --> 00:30:21,610 Whether I'm running a 32-bit-- 890 00:30:21,610 --> 00:30:23,300 Remember, we even forgot the original problem, which is, 891 00:30:23,300 --> 00:30:25,870 gee, I'm 32-bit, 64-bit executable-- 892 00:30:25,870 --> 00:30:27,910 sorry 32-bit executable on 64-bit hardware. 893 00:30:27,910 --> 00:30:29,450 We were slower than a 64-bit-- 894 00:30:29,450 --> 00:30:31,900 we were faster than a 64-bit executable on 64-bit hardware. 895 00:30:31,900 --> 00:30:33,330 Let's try to figure out why. 896 00:30:33,330 --> 00:30:36,780 Well, remember, every time I increment or decrement a 897 00:30:36,780 --> 00:30:39,310 reference counter, I get some type information by consulting 898 00:30:39,310 --> 00:30:40,700 this runtime type information. 899 00:30:40,700 --> 00:30:42,270 It's got a bunch of pointers. 900 00:30:42,270 --> 00:30:46,460 When I run a 32-bit executable on 64-bit hardware, those 901 00:30:46,460 --> 00:30:48,640 pointers are just raw 32-bit pointers. 902 00:30:48,640 --> 00:30:50,660 You just look them up, and you're done. 903 00:30:50,660 --> 00:30:53,020 In 64-bit, pretty much the only difference is that 904 00:30:53,020 --> 00:30:55,560 instead of those pointers being actual pointers, they're 905 00:30:55,560 --> 00:30:57,490 32-bit offsets. 906 00:30:57,490 --> 00:31:01,170 So you take the offset, you add it to the base address of 907 00:31:01,170 --> 00:31:05,070 the DLL of the executable or whatever, where that runtime 908 00:31:05,070 --> 00:31:06,400 typeid call is being made. 909 00:31:06,400 --> 00:31:08,090 And then you're done. 910 00:31:08,090 --> 00:31:10,360 Once you've added that offset, you get a 64-bit pointer. 911 00:31:10,360 --> 00:31:13,520 You look up that 64-bit pointer, everybody's happy. 912 00:31:13,520 --> 00:31:14,540 So here's the deal. 913 00:31:14,540 --> 00:31:16,140 Remember, we said 32-bit's faster. 914 00:31:16,140 --> 00:31:17,450 32-bit is just a pointer look up. 915 00:31:17,450 --> 00:31:18,490 And then you're done. 916 00:31:18,490 --> 00:31:19,670 64-bit's slower. 917 00:31:19,670 --> 00:31:22,120 It's a pointer look up-- and it's an addition plus a 918 00:31:22,120 --> 00:31:24,640 pointer look up. 919 00:31:24,640 --> 00:31:26,230 So I pause, because you're thinking to yourself, an 920 00:31:26,230 --> 00:31:28,800 addition plus a pointer look up versus a pointer look up. 921 00:31:28,800 --> 00:31:29,980 Does that mean that addition is slow? 922 00:31:29,980 --> 00:31:31,260 That sounds really stupid. 923 00:31:31,260 --> 00:31:32,670 And by the way, it is really stupid. 924 00:31:32,670 --> 00:31:33,570 It's not true. 925 00:31:33,570 --> 00:31:34,320 So the deal is this. 926 00:31:34,320 --> 00:31:35,970 It's not that addition to create a 927 00:31:35,970 --> 00:31:37,810 64-bit pointer is slow. 928 00:31:37,810 --> 00:31:39,850 The deal is that-- remember this 2-step process. 929 00:31:39,850 --> 00:31:41,810 You find out the base address of a module. 930 00:31:41,810 --> 00:31:44,220 And you add this offset to that base address. 931 00:31:44,220 --> 00:31:48,540 Determining that base address is really, really slow. 932 00:31:48,540 --> 00:31:50,510 It turns out you have to call a bunch of different random 933 00:31:50,510 --> 00:31:53,640 routines that have to walk the list of loaded modules. 934 00:31:53,640 --> 00:31:55,230 And you're doing this every single time 935 00:31:55,230 --> 00:31:56,130 you call this routine. 936 00:31:56,130 --> 00:31:58,060 And if you increment a reference whenever you do a 937 00:31:58,060 --> 00:31:59,870 memory allocation, you can imagine that this is going to 938 00:31:59,870 --> 00:32:02,630 consume a lot of CPU. 939 00:32:02,630 --> 00:32:04,810 So it turns out, this is what's actually slow. 940 00:32:04,810 --> 00:32:06,900 And remember, that we started this off a little by saying, 941 00:32:06,900 --> 00:32:07,950 oh, we've got this weird function, 942 00:32:07,950 --> 00:32:10,830 ZwQueryVirtualMemory, which seems to be being called a lot 943 00:32:10,830 --> 00:32:12,320 from a bunch of different places. 944 00:32:12,320 --> 00:32:14,270 The reason it's called a lot is because all of these 945 00:32:14,270 --> 00:32:16,020 routines to get typeid information are 946 00:32:16,020 --> 00:32:17,550 calling this routine. 947 00:32:17,550 --> 00:32:19,060 So there's actually two solutions. 948 00:32:19,060 --> 00:32:21,355 And you can look on this blog to actually figure out that 949 00:32:21,355 --> 00:32:24,270 this is a known problem that Microsoft eventually fixed. 950 00:32:24,270 --> 00:32:28,350 And it turns out, that the two solutions end up being that 951 00:32:28,350 --> 00:32:30,740 you can either statically compute that base address so 952 00:32:30,740 --> 00:32:32,580 you're not constantly relooking it up 953 00:32:32,580 --> 00:32:33,530 every single time. 954 00:32:33,530 --> 00:32:35,810 But there's actually a much simpler and stupider solution, 955 00:32:35,810 --> 00:32:37,790 which is, use the latest runtime library where they 956 00:32:37,790 --> 00:32:39,940 have actually fixed this problem. 957 00:32:39,940 --> 00:32:43,610 So just to summarize what was going on. 958 00:32:43,610 --> 00:32:45,670 We were CPU saturated. 959 00:32:45,670 --> 00:32:46,850 We looked from the top down. 960 00:32:46,850 --> 00:32:47,850 We didn't notice anything. 961 00:32:47,850 --> 00:32:49,070 We looked from the bottom up. 962 00:32:49,070 --> 00:32:50,930 And we noticed that some bottom was being called a lot. 963 00:32:50,930 --> 00:32:52,840 We figured out where that bottom was being called by 964 00:32:52,840 --> 00:32:55,240 looking at caller/callee views. 965 00:32:55,240 --> 00:32:56,570 Once we figured out where it was being called, we then 966 00:32:56,570 --> 00:32:58,490 tried to figure out why that thing was being called. 967 00:32:58,490 --> 00:32:59,960 And then we fixed the problem. 968 00:32:59,960 --> 00:33:03,490 So to me, I guess the take-home lesson from this is 969 00:33:03,490 --> 00:33:06,080 that when that developer said to me, three months before 970 00:33:06,080 --> 00:33:07,940 this whole debacle started, he said, oh, yeah, we're 971 00:33:07,940 --> 00:33:09,520 switching from say 32 to 64-bit. 972 00:33:09,520 --> 00:33:10,980 That's not a big deal. 973 00:33:10,980 --> 00:33:12,750 You can't really take that stuff at face value, because a 974 00:33:12,750 --> 00:33:14,180 lot of little things can really add up. 975 00:33:14,180 --> 00:33:16,400 And we could never have really guessed that this would happen 976 00:33:16,400 --> 00:33:19,070 beforehand. 977 00:33:19,070 --> 00:33:20,360 So that's four. 978 00:33:20,360 --> 00:33:21,800 Let me go over another one. 979 00:33:21,800 --> 00:33:23,150 Memory usage. 980 00:33:23,150 --> 00:33:24,690 So excessive-- 981 00:33:24,690 --> 00:33:27,340 I'm sure actually all of you know that memory usage is a 982 00:33:27,340 --> 00:33:28,130 big problem. 983 00:33:28,130 --> 00:33:30,080 Your application is going to slow down if you're inducing a 984 00:33:30,080 --> 00:33:31,130 lot of paging. 985 00:33:31,130 --> 00:33:33,700 And at least in 32-bit Windows, if you exceed two 986 00:33:33,700 --> 00:33:36,160 gigabytes for the process that you're running, the process is 987 00:33:36,160 --> 00:33:37,090 just going to crash. 988 00:33:37,090 --> 00:33:40,030 So you clearly do care about memory usage. 989 00:33:40,030 --> 00:33:43,660 And I want to differentiate between memory leaks and 990 00:33:43,660 --> 00:33:44,610 memory accumulation. 991 00:33:44,610 --> 00:33:46,090 You're wondering why would a process take up two 992 00:33:46,090 --> 00:33:46,840 gigabytes of memory. 993 00:33:46,840 --> 00:33:48,440 Well, one is that maybe you have a leak. 994 00:33:48,440 --> 00:33:49,090 You allocate something. 995 00:33:49,090 --> 00:33:50,320 You don't remove it. 996 00:33:50,320 --> 00:33:51,630 That might just be an accumulation. 997 00:33:51,630 --> 00:33:53,420 You allocate something somewhere in your code, you 998 00:33:53,420 --> 00:33:54,950 actually are freeing it. 999 00:33:54,950 --> 00:33:56,480 But the trouble is that routine never gets called, 1000 00:33:56,480 --> 00:33:59,750 because in your code logic you never take the time to think 1001 00:33:59,750 --> 00:34:03,250 and say, OK, can I actually get rid of this at this point. 1002 00:34:03,250 --> 00:34:06,340 And that's what I'm going to call a memory accumulation. 1003 00:34:06,340 --> 00:34:08,090 There's a lot of different tools out there to analyze 1004 00:34:08,090 --> 00:34:08,980 memory usage. 1005 00:34:08,980 --> 00:34:11,719 And the sad part about this is that almost none of them work 1006 00:34:11,719 --> 00:34:13,400 for the particular example that we have, because they 1007 00:34:13,400 --> 00:34:15,360 just didn't scale very well. 1008 00:34:15,360 --> 00:34:17,389 It's fine if you've got your toy, stupid little application 1009 00:34:17,389 --> 00:34:18,550 that does addition in a loop. 1010 00:34:18,550 --> 00:34:20,120 Not so fine if you've got an enterprising middleware 1011 00:34:20,120 --> 00:34:20,650 application. 1012 00:34:20,650 --> 00:34:22,699 So we ended up, despite all the tools that we have 1013 00:34:22,699 --> 00:34:24,780 available, we ended up having to write our own. 1014 00:34:24,780 --> 00:34:25,860 I didn't do this, a colleague did. 1015 00:34:25,860 --> 00:34:27,900 But anyway, I guess that distinction 1016 00:34:27,900 --> 00:34:28,610 doesn't really matter. 1017 00:34:28,610 --> 00:34:30,360 But anyway, the point is that you do a lot of what these 1018 00:34:30,360 --> 00:34:30,880 other tools do. 1019 00:34:30,880 --> 00:34:32,580 But you do it special to your environment. 1020 00:34:32,580 --> 00:34:35,000 In this case, we just hooked a lot of the calls to malloc. 1021 00:34:35,000 --> 00:34:36,670 And we figure out what pointers are live. 1022 00:34:36,670 --> 00:34:39,179 And then we do stuff from there. 1023 00:34:39,179 --> 00:34:40,409 This is how normal tools work. 1024 00:34:40,409 --> 00:34:43,000 And of course, this can be unusably slow if you do a ton 1025 00:34:43,000 --> 00:34:43,760 of allocations. 1026 00:34:43,760 --> 00:34:45,739 For example, if you're doing millions of allocations per 1027 00:34:45,739 --> 00:34:47,820 second, you can imagine you're calling this kind of routine, 1028 00:34:47,820 --> 00:34:48,489 which itself-- 1029 00:34:48,489 --> 00:34:52,460 if it only has a 10x overhead over a call without that, 1030 00:34:52,460 --> 00:34:54,300 you're slowing down your program by 10x for every 1031 00:34:54,300 --> 00:34:55,270 single memory allocation. 1032 00:34:55,270 --> 00:34:57,195 That means performance depends on how many allocations you 1033 00:34:57,195 --> 00:34:58,800 do, which is really bad. 1034 00:34:58,800 --> 00:35:00,440 And just to illustrate what I mean by a leak. 1035 00:35:00,440 --> 00:35:01,750 My definition is basically that you've got 1036 00:35:01,750 --> 00:35:03,210 this routine, foo. 1037 00:35:03,210 --> 00:35:04,600 And it mallocs some data. 1038 00:35:04,600 --> 00:35:06,620 But it returns without freeing that data. 1039 00:35:06,620 --> 00:35:08,170 And nowhere else in the code log did you 1040 00:35:08,170 --> 00:35:09,210 actually have a free. 1041 00:35:09,210 --> 00:35:10,830 That's an actual memory leak. 1042 00:35:10,830 --> 00:35:12,640 An accumulation would be if you have some free somewhere. 1043 00:35:12,640 --> 00:35:15,080 But it's never actually called. 1044 00:35:15,080 --> 00:35:18,330 So a lot of the code that we write happened-- at least in 1045 00:35:18,330 --> 00:35:20,060 the past, it's changing-- but a lot of it happened to be 1046 00:35:20,060 --> 00:35:20,960 written in C++. 1047 00:35:20,960 --> 00:35:24,260 And all of you, I guess assume, you know about new and 1048 00:35:24,260 --> 00:35:26,610 delete and all those kinds of things that we do in C++. 1049 00:35:26,610 --> 00:35:28,680 And it turns out, that those things are notoriously bad for 1050 00:35:28,680 --> 00:35:29,360 assigning memory leaks. 1051 00:35:29,360 --> 00:35:31,310 So instead, what people typically do is they use 1052 00:35:31,310 --> 00:35:32,380 reference counted objects. 1053 00:35:32,380 --> 00:35:34,590 So every time you use something you increment some 1054 00:35:34,590 --> 00:35:35,210 reference count. 1055 00:35:35,210 --> 00:35:37,350 And every time you delete it, you 1056 00:35:37,350 --> 00:35:38,270 decrement a reference count. 1057 00:35:38,270 --> 00:35:40,860 And when the decremented reference count is zero it 1058 00:35:40,860 --> 00:35:42,310 automatically gets deleted. 1059 00:35:42,310 --> 00:35:44,250 Now of course, this only solves an actual leak. 1060 00:35:44,250 --> 00:35:46,470 It doesn't solve an accumulation. 1061 00:35:46,470 --> 00:35:48,190 So here was the problem that I noticed. 1062 00:35:48,190 --> 00:35:50,110 That was all kind of setup, just saying what's a memory 1063 00:35:50,110 --> 00:35:52,530 leak, what's an accumulation, why is memory a problem. 1064 00:35:52,530 --> 00:35:55,050 Let's get to actually why it was a problem. 1065 00:35:55,050 --> 00:35:56,230 I had this issue. 1066 00:35:56,230 --> 00:35:57,950 I have our little server application. 1067 00:35:57,950 --> 00:36:01,950 And it's running out of memory after several hours. 1068 00:36:01,950 --> 00:36:05,950 Now I used Purify on a much, much smaller setup. 1069 00:36:05,950 --> 00:36:09,870 It was about 1/100 the size of the actual setup, because 1070 00:36:09,870 --> 00:36:11,860 larger than that, Purify would crash, because it would run 1071 00:36:11,860 --> 00:36:12,480 out of memory. 1072 00:36:12,480 --> 00:36:14,380 Because by the way, to track all the memory references, 1073 00:36:14,380 --> 00:36:16,150 it's got to allocate memory, which ends up being a 1074 00:36:16,150 --> 00:36:18,260 nightmare of unprecedented proportions. 1075 00:36:18,260 --> 00:36:22,680 So the deal was, was that I couldn't really detect this 1076 00:36:22,680 --> 00:36:24,670 leak very easily. 1077 00:36:24,670 --> 00:36:27,350 So what I did was I examined the memory that was in use. 1078 00:36:27,350 --> 00:36:28,970 Essentially, what I'm trying to say is that, there wasn't 1079 00:36:28,970 --> 00:36:31,750 so much a leak as there was an accumulation. 1080 00:36:31,750 --> 00:36:33,630 Accumulation again being a logical problem. 1081 00:36:33,630 --> 00:36:35,120 Well, yeah, all of my code is written correctly. 1082 00:36:35,120 --> 00:36:37,470 But it's just that free call is never called. 1083 00:36:37,470 --> 00:36:40,900 So I examined the memory in use and I localized it to one 1084 00:36:40,900 --> 00:36:43,280 operation that whenever we did it, memory would increase. 1085 00:36:43,280 --> 00:36:45,600 So since doing it only once, there's a lot of noise. 1086 00:36:45,600 --> 00:36:47,610 What I did was I said, well, OK, why don't I just do this 1087 00:36:47,610 --> 00:36:49,530 operation hundreds of times at the same time. 1088 00:36:49,530 --> 00:36:52,080 Because then, if I see some allocation that's 100x from 1089 00:36:52,080 --> 00:36:53,920 before I started this I can know that this 1090 00:36:53,920 --> 00:36:55,680 was actually a problem. 1091 00:36:55,680 --> 00:36:57,270 And that's exactly what I did. 1092 00:36:57,270 --> 00:36:58,820 I had this issue where something was being 1093 00:36:58,820 --> 00:37:00,370 allocated 64 bytes. 1094 00:37:00,370 --> 00:37:02,580 Now 64 bytes is so small that it's pretty much in the noise. 1095 00:37:02,580 --> 00:37:03,990 But if you do it a hundred times, and you're 1096 00:37:03,990 --> 00:37:04,950 seeing 6,400 bytes. 1097 00:37:04,950 --> 00:37:06,480 Then you do it a thousand times and you 1098 00:37:06,480 --> 00:37:07,740 see 64 times 1,000. 1099 00:37:07,740 --> 00:37:09,225 Then it becomes something easily that 1100 00:37:09,225 --> 00:37:10,900 you can then track. 1101 00:37:10,900 --> 00:37:13,340 Once I noticed that this particular piece of data was 1102 00:37:13,340 --> 00:37:15,250 being allocated like this, I went through and I realized 1103 00:37:15,250 --> 00:37:17,220 that this actually was-- 1104 00:37:17,220 --> 00:37:18,530 it was an accumulation and not a leak. 1105 00:37:18,530 --> 00:37:20,490 So someone would allocate something whenever an incoming 1106 00:37:20,490 --> 00:37:21,490 message came in. 1107 00:37:21,490 --> 00:37:23,260 And there was a free somewhere, but it was never 1108 00:37:23,260 --> 00:37:25,060 being called. 1109 00:37:25,060 --> 00:37:28,190 So the lesson that I learned from this is that in a lot of 1110 00:37:28,190 --> 00:37:30,810 situations with performance debugging, it's very helpful 1111 00:37:30,810 --> 00:37:32,100 to try to find a problem-- 1112 00:37:32,100 --> 00:37:33,660 in order to try to find a problem, it's very helpful to 1113 00:37:33,660 --> 00:37:36,390 magnify it and make it a lot worse. 1114 00:37:36,390 --> 00:37:38,440 If you think something is problematic in the network, 1115 00:37:38,440 --> 00:37:40,160 slow down your network if that's an option. 1116 00:37:40,160 --> 00:37:42,860 If you think something has a problem with memory usage, 1117 00:37:42,860 --> 00:37:44,855 either do something like what I've described here. 1118 00:37:44,855 --> 00:37:46,220 Or maybe reduce the amount of memory. 1119 00:37:46,220 --> 00:37:49,360 When I was in graduate school, we were working on a cache 1120 00:37:49,360 --> 00:37:50,660 coherence protocol. 1121 00:37:50,660 --> 00:37:51,780 And I'm going to spare you a lot of the details. 1122 00:37:51,780 --> 00:37:52,670 But here's the thing. 1123 00:37:52,670 --> 00:37:55,360 We basically stored a linked list for every data in memory. 1124 00:37:55,360 --> 00:37:57,290 And we had to make sure that that linked list was rock 1125 00:37:57,290 --> 00:38:00,290 solid no matter what the size of the inventory was. 1126 00:38:00,290 --> 00:38:01,820 And if you exceeded the linked list, you'd 1127 00:38:01,820 --> 00:38:02,780 store stuff in disk. 1128 00:38:02,780 --> 00:38:05,590 So what we did to magnify any possible bugs was we created 1129 00:38:05,590 --> 00:38:07,190 it so that you could make a linked list 1130 00:38:07,190 --> 00:38:08,980 of at most one item. 1131 00:38:08,980 --> 00:38:10,780 That way no matter what you did it would force you to 1132 00:38:10,780 --> 00:38:11,770 overflow and go into disk. 1133 00:38:11,770 --> 00:38:14,000 And that was a great way to test a lot of overflow 1134 00:38:14,000 --> 00:38:16,140 conditions that in practice would never have occurred. 1135 00:38:16,140 --> 00:38:17,980 Because we do the engineering thing, figure out the 1136 00:38:17,980 --> 00:38:19,990 capacity, double it, and hope that nothing happens. 1137 00:38:19,990 --> 00:38:23,900 So it's good to exaggerate edge conditions. 1138 00:38:23,900 --> 00:38:25,980 So that's one memory analysis problem. 1139 00:38:25,980 --> 00:38:27,250 Here's another one. 1140 00:38:27,250 --> 00:38:27,790 Here's the deal. 1141 00:38:27,790 --> 00:38:29,820 You've got a user and they're using this middleware server. 1142 00:38:29,820 --> 00:38:32,590 And they complain that with time it gets slower, and 1143 00:38:32,590 --> 00:38:34,260 slower, and slower. 1144 00:38:34,260 --> 00:38:35,290 And if you look at all of the-- 1145 00:38:35,290 --> 00:38:37,760 if you look at CPU network and disk, you see that nothing is 1146 00:38:37,760 --> 00:38:38,960 being saturated there. 1147 00:38:38,960 --> 00:38:41,320 But if you look at memory, it's increasing pretty 1148 00:38:41,320 --> 00:38:42,300 dramatically. 1149 00:38:42,300 --> 00:38:44,810 And at some point, it increases so dramatically that 1150 00:38:44,810 --> 00:38:47,300 the server application crashes. 1151 00:38:47,300 --> 00:38:50,820 So first thing, let's actually look at the memory utilization 1152 00:38:50,820 --> 00:38:52,150 and see the timeline. 1153 00:38:52,150 --> 00:38:54,880 What I show here is I show versus time, I show a chart of 1154 00:38:54,880 --> 00:38:57,190 something in Windows that's called private bytes. 1155 00:38:57,190 --> 00:39:00,060 Essentially, think of that as memory. 1156 00:39:00,060 --> 00:39:02,550 What you see is that memory is growing slowly, whatever. 1157 00:39:02,550 --> 00:39:05,000 But at some point, near the very end, near when the server 1158 00:39:05,000 --> 00:39:10,010 crashes, it starts growing at a really, really fast clip. 1159 00:39:10,010 --> 00:39:11,440 Now remember, I call this private bytes. 1160 00:39:11,440 --> 00:39:14,480 What that essentially is, in this particular case, it's 1161 00:39:14,480 --> 00:39:18,120 Windows giving a process some amount of memory. 1162 00:39:18,120 --> 00:39:20,410 It's not necessarily what the process or what the 1163 00:39:20,410 --> 00:39:21,900 application is requesting. 1164 00:39:21,900 --> 00:39:23,260 The application will request some amount. 1165 00:39:23,260 --> 00:39:24,730 And Windows will give it this much. 1166 00:39:24,730 --> 00:39:27,570 Well, Windows keeps giving it more and more and more. 1167 00:39:27,570 --> 00:39:31,840 And it eventually just runs out of memory to give. 1168 00:39:31,840 --> 00:39:34,390 So the reason I made the distinction between something 1169 00:39:34,390 --> 00:39:36,670 that an application and something Windows is giving it 1170 00:39:36,670 --> 00:39:39,260 is that I, as an application, might be asking for 1171 00:39:39,260 --> 00:39:39,950 four bytes at a time. 1172 00:39:39,950 --> 00:39:40,980 Four bytes, four bytes, whatever. 1173 00:39:40,980 --> 00:39:43,620 But what if Windows has this problem where, whenever you 1174 00:39:43,620 --> 00:39:45,740 allocate-- it doesn't, but suppose it does-- suppose that 1175 00:39:45,740 --> 00:39:47,480 every time you ask for four bytes, it says, well, you know 1176 00:39:47,480 --> 00:39:49,170 what, I'm just going to give you 16 bytes because that's my 1177 00:39:49,170 --> 00:39:50,530 smallest increment of memory. 1178 00:39:50,530 --> 00:39:52,340 You can imagine if I keep asking for 4 and it's dumb 1179 00:39:52,340 --> 00:39:54,240 enough to give me 16 every time, we're going to see a 1180 00:39:54,240 --> 00:39:56,490 huge expansion of memory, even though I, as an application, 1181 00:39:56,490 --> 00:39:57,730 didn't ask for that much. 1182 00:39:57,730 --> 00:40:00,810 So it's important to keep these distinctions in mind. 1183 00:40:00,810 --> 00:40:02,860 So the point is what was happening in this case, was my 1184 00:40:02,860 --> 00:40:05,133 server's going fine, fine, and fine, until near the very end 1185 00:40:05,133 --> 00:40:07,520 where it starts increasing in memory use at such a dramatic 1186 00:40:07,520 --> 00:40:10,090 rate, that it eventually crashes. 1187 00:40:10,090 --> 00:40:11,390 So the first thing I did-- 1188 00:40:11,390 --> 00:40:12,290 We just talked a lot about these 1189 00:40:12,290 --> 00:40:13,870 reference counted objects. 1190 00:40:13,870 --> 00:40:15,670 The fact that I keep track of whenever something is 1191 00:40:15,670 --> 00:40:18,020 allocated and whenever something is not being looked 1192 00:40:18,020 --> 00:40:18,590 at anymore. 1193 00:40:18,590 --> 00:40:19,780 When it's not being looked at anymore, I go 1194 00:40:19,780 --> 00:40:20,660 ahead and remove it. 1195 00:40:20,660 --> 00:40:22,470 If you look at the reference counted objects, they were 1196 00:40:22,470 --> 00:40:25,650 pretty flat until you get to the very end of the run, 1197 00:40:25,650 --> 00:40:28,150 exactly the same as when the memory was increasing. 1198 00:40:28,150 --> 00:40:30,450 And it turns out, I've highlighted a few of these, 1199 00:40:30,450 --> 00:40:31,710 and they ended up being-- 1200 00:40:31,710 --> 00:40:33,240 it doesn't matter what they specifically were-- but in 1201 00:40:33,240 --> 00:40:35,180 this case, they ended up being related to threads, and 1202 00:40:35,180 --> 00:40:38,340 mutexes, and stuff associated with keeping track of what was 1203 00:40:38,340 --> 00:40:40,040 going on in the system. 1204 00:40:40,040 --> 00:40:41,670 So some thread-related objects were 1205 00:40:41,670 --> 00:40:42,940 increasing for some reason. 1206 00:40:42,940 --> 00:40:45,690 Now turns out thread state in some sense is very cheap. 1207 00:40:45,690 --> 00:40:48,130 So just because these guys were increasing, it doesn't 1208 00:40:48,130 --> 00:40:49,450 really explain why we're dramatically 1209 00:40:49,450 --> 00:40:50,400 increasing in memory. 1210 00:40:50,400 --> 00:40:52,800 But I'll tell you what does explain it. 1211 00:40:52,800 --> 00:40:54,420 If you take a look at a different attribute, which is 1212 00:40:54,420 --> 00:40:56,460 these are the total number of threads in our system, we have 1213 00:40:56,460 --> 00:40:57,800 a static pool of threads. 1214 00:40:57,800 --> 00:41:00,170 And whenever a work item comes in, we allocate a thread. 1215 00:41:00,170 --> 00:41:01,470 Whenever that work item is done, we 1216 00:41:01,470 --> 00:41:02,340 deallocate the thread. 1217 00:41:02,340 --> 00:41:04,210 So we keep it to a minimal number. 1218 00:41:04,210 --> 00:41:05,200 And it turns out, we had a cap. 1219 00:41:05,200 --> 00:41:07,730 We said, OK, I'm going to allow at most 20 operations to 1220 00:41:07,730 --> 00:41:08,600 occur in flight. 1221 00:41:08,600 --> 00:41:11,690 Anything beyond 20, I'm going to queue it up in memory. 1222 00:41:11,690 --> 00:41:13,080 And we'll deal with it later. 1223 00:41:13,080 --> 00:41:13,710 Here's the deal. 1224 00:41:13,710 --> 00:41:15,380 What you see when you look at the number of threads being 1225 00:41:15,380 --> 00:41:16,570 used is that-- 1226 00:41:16,570 --> 00:41:17,950 it's the bottom chart, unfortunately, it's a little 1227 00:41:17,950 --> 00:41:18,280 hard to see. 1228 00:41:18,280 --> 00:41:20,230 But the point is it goes up and down, up and down, up and 1229 00:41:20,230 --> 00:41:21,230 down with time. 1230 00:41:21,230 --> 00:41:22,850 So stuff gets allocated. 1231 00:41:22,850 --> 00:41:23,530 Stuff gets used. 1232 00:41:23,530 --> 00:41:24,420 Stuff gets removed. 1233 00:41:24,420 --> 00:41:26,330 But then when you get to the very end, you see this 1234 00:41:26,330 --> 00:41:28,790 monotonic increase in the number of threads being used. 1235 00:41:28,790 --> 00:41:29,750 It never goes down. 1236 00:41:29,750 --> 00:41:32,360 And at some point, you hit the point where all of the threads 1237 00:41:32,360 --> 00:41:35,460 that you've preallocated are completely taken. 1238 00:41:35,460 --> 00:41:37,130 You obviously can't allocate any more threads. 1239 00:41:37,130 --> 00:41:40,940 So every incoming request, you go ahead and queue. 1240 00:41:40,940 --> 00:41:42,260 This was the root of the problem. 1241 00:41:42,260 --> 00:41:44,660 The root of the problem is that in this particular 1242 00:41:44,660 --> 00:41:48,050 situation, each of the times that it was increasing was 1243 00:41:48,050 --> 00:41:51,070 because there was again an uncaught exception. 1244 00:41:51,070 --> 00:41:54,830 That uncaught exception resulted in a thread not being 1245 00:41:54,830 --> 00:41:55,510 deallocated. 1246 00:41:55,510 --> 00:41:56,950 At some point, we ran out of threads. 1247 00:41:56,950 --> 00:42:00,220 And every single request that came in was being allocated in 1248 00:42:00,220 --> 00:42:02,800 memory and waiting around for a thread to become available 1249 00:42:02,800 --> 00:42:04,730 before it could be finished. 1250 00:42:04,730 --> 00:42:06,570 Well, if no threads are available, these things can 1251 00:42:06,570 --> 00:42:07,840 never get pulled off the queue. 1252 00:42:07,840 --> 00:42:09,540 You see a monotonic increase in the amount of memory. 1253 00:42:09,540 --> 00:42:11,810 And you eventually die. 1254 00:42:11,810 --> 00:42:15,190 And so, it's actually a correctness issue. 1255 00:42:15,190 --> 00:42:17,180 Why are you not catching exceptions properly? 1256 00:42:17,180 --> 00:42:18,920 And why are you not recovering properly from them. 1257 00:42:18,920 --> 00:42:20,840 But interestingly, it manifested itself in a 1258 00:42:20,840 --> 00:42:22,980 performance issue. 1259 00:42:22,980 --> 00:42:25,300 So in each of these cases, what we've done, is we've 1260 00:42:25,300 --> 00:42:27,860 looked at customized profiler instead of a tool that we 1261 00:42:27,860 --> 00:42:30,860 might have gotten, Purifier, Valgrind, or whatever-- 1262 00:42:30,860 --> 00:42:32,185 it's nice that it's tailored to our application. 1263 00:42:32,185 --> 00:42:34,160 And we can make it as arbitrarily fast, because it 1264 00:42:34,160 --> 00:42:36,760 understands what our application is doing. 1265 00:42:36,760 --> 00:42:38,660 And it also-- the other nice thing about this is this can 1266 00:42:38,660 --> 00:42:39,610 be run in production environments. 1267 00:42:39,610 --> 00:42:41,480 If you write your own tool, and a customer calls and 1268 00:42:41,480 --> 00:42:42,916 complains, you can just say, oh, just twiddle this knob and 1269 00:42:42,916 --> 00:42:44,460 you're done. 1270 00:42:44,460 --> 00:42:45,370 That's kind of convenient. 1271 00:42:45,370 --> 00:42:46,860 But of course, the disadvantage is that now you 1272 00:42:46,860 --> 00:42:50,030 have to-- whenever you rewrite your code or do a new rev, or 1273 00:42:50,030 --> 00:42:50,900 whatever, you have to make sure that 1274 00:42:50,900 --> 00:42:51,865 these tools work right. 1275 00:42:51,865 --> 00:42:53,820 And you have to recompile the code. 1276 00:42:53,820 --> 00:42:55,600 If you have something like Purifier, Quanitify, in a lot 1277 00:42:55,600 --> 00:42:57,490 of situations, you can run with some sort of preexisting 1278 00:42:57,490 --> 00:42:59,010 binary-- sometimes, not always. 1279 00:42:59,010 --> 00:43:00,870 You can run it without having to recompile. 1280 00:43:00,870 --> 00:43:03,320 And that can be very convenient. 1281 00:43:03,320 --> 00:43:05,200 So I don't know. 1282 00:43:05,200 --> 00:43:07,600 I was kind of stuck for what a lesson this was besides common 1283 00:43:07,600 --> 00:43:08,230 sense is useful. 1284 00:43:08,230 --> 00:43:09,980 So I just said memory profiling is pretty critical. 1285 00:43:09,980 --> 00:43:12,090 Don't ignore that when you're writing your applications. 1286 00:43:12,090 --> 00:43:14,760 And I think I also want to point out that-- 1287 00:43:14,760 --> 00:43:15,560 I didn't put this here. 1288 00:43:15,560 --> 00:43:17,990 But you might think that when you're using a garbage 1289 00:43:17,990 --> 00:43:19,650 collected language, like something like Java, that 1290 00:43:19,650 --> 00:43:20,980 these problems go away. 1291 00:43:20,980 --> 00:43:22,120 But they actually don't. 1292 00:43:22,120 --> 00:43:24,110 The problem is actually now memory allocations and 1293 00:43:24,110 --> 00:43:25,530 deallocations are more hidden from you. 1294 00:43:25,530 --> 00:43:27,750 So you have to be a little bit more careful about how you 1295 00:43:27,750 --> 00:43:29,810 analyze them. 1296 00:43:29,810 --> 00:43:31,080 Case study number seven. 1297 00:43:31,080 --> 00:43:32,940 So we've looked a lot of things related to CPU usage. 1298 00:43:32,940 --> 00:43:35,380 We've looked at a lot of things related to memory. 1299 00:43:35,380 --> 00:43:37,260 Now let's look a little bit at the network. 1300 00:43:37,260 --> 00:43:38,240 Here was the problem. 1301 00:43:38,240 --> 00:43:40,050 I have a user, our Canonical admin. 1302 00:43:40,050 --> 00:43:42,780 The user wants to perform an operation on some virtual 1303 00:43:42,780 --> 00:43:43,810 machine somewhere. 1304 00:43:43,810 --> 00:43:44,920 Remember the flow. 1305 00:43:44,920 --> 00:43:47,430 Issue request, it goes to some server. 1306 00:43:47,430 --> 00:43:49,910 The server goes to the host where that VM is. 1307 00:43:49,910 --> 00:43:51,570 Does some operation, comes back. 1308 00:43:51,570 --> 00:43:52,960 So this user wants to perform an operation 1309 00:43:52,960 --> 00:43:54,000 on a virtual machine. 1310 00:43:54,000 --> 00:43:59,260 I had setup A, In setup A, this operation end to end took 1311 00:43:59,260 --> 00:44:00,710 about eight seconds. 1312 00:44:00,710 --> 00:44:01,280 Fine. 1313 00:44:01,280 --> 00:44:03,890 That doesn't mean much until you compare to setup B, where 1314 00:44:03,890 --> 00:44:05,340 it took twice the amount of time. 1315 00:44:05,340 --> 00:44:07,300 It took 16 seconds, not 8 seconds. 1316 00:44:07,300 --> 00:44:09,560 So now the question you're asking yourself is what's the 1317 00:44:09,560 --> 00:44:11,620 difference between setups A and B. 1318 00:44:11,620 --> 00:44:14,600 In setups A and B, the only difference is that this middle 1319 00:44:14,600 --> 00:44:17,560 server is a different version number. 1320 00:44:17,560 --> 00:44:19,520 But everything else is exactly the same. 1321 00:44:19,520 --> 00:44:22,120 So with this different version you somehow manage to make 1322 00:44:22,120 --> 00:44:25,080 your performance twice as bad, which is a really bad thing. 1323 00:44:25,080 --> 00:44:26,590 We got to get to the bottom of that. 1324 00:44:26,590 --> 00:44:28,610 And in addition, it's not like it was some random thing that 1325 00:44:28,610 --> 00:44:29,120 happened once. 1326 00:44:29,120 --> 00:44:30,480 It happened every single time. 1327 00:44:30,480 --> 00:44:33,540 So now, let's take a look at what was going on. 1328 00:44:33,540 --> 00:44:36,430 So again, a lot of times, I tend to do things in a top 1329 00:44:36,430 --> 00:44:38,440 down manner, where I say what's the easiest information 1330 00:44:38,440 --> 00:44:39,280 to collect. 1331 00:44:39,280 --> 00:44:40,180 And then, OK. 1332 00:44:40,180 --> 00:44:41,380 What's more difficult, what's more difficult? 1333 00:44:41,380 --> 00:44:43,850 And finally, if you have to use really industrial strength 1334 00:44:43,850 --> 00:44:44,670 tools, you use them. 1335 00:44:44,670 --> 00:44:45,700 But it might not be necessary. 1336 00:44:45,700 --> 00:44:47,980 In this case, my first thing was just to use logging and 1337 00:44:47,980 --> 00:44:50,060 try to figure out where time was being spent. 1338 00:44:50,060 --> 00:44:51,790 Remember, we've got a client talking to a server, talking 1339 00:44:51,790 --> 00:44:52,630 to an end host. 1340 00:44:52,630 --> 00:44:55,570 In this case, the latency imposed by that client was 1341 00:44:55,570 --> 00:44:58,010 exactly the same in these two setups. 1342 00:44:58,010 --> 00:45:00,940 So that's not where this eight seconds is being lost. 1343 00:45:00,940 --> 00:45:03,920 The amount of time being claimed by the server was 1344 00:45:03,920 --> 00:45:05,340 exactly the same. 1345 00:45:05,340 --> 00:45:06,010 I'm a client. 1346 00:45:06,010 --> 00:45:06,630 I go to a server. 1347 00:45:06,630 --> 00:45:09,180 It then goes to a host. 1348 00:45:09,180 --> 00:45:12,710 However, the time spent on the host was different. 1349 00:45:12,710 --> 00:45:14,710 That's where all of the time was being spent. 1350 00:45:14,710 --> 00:45:17,230 Now, you're probably looking at-- well, you may not be. 1351 00:45:17,230 --> 00:45:18,220 But I'll tell you to look now. 1352 00:45:18,220 --> 00:45:21,340 We had setup A. Setup A was a client and the server were 1353 00:45:21,340 --> 00:45:22,470 each at some version number. 1354 00:45:22,470 --> 00:45:24,710 And the host was at the same version number. 1355 00:45:24,710 --> 00:45:27,110 In setup B, the client and the server were at a different 1356 00:45:27,110 --> 00:45:29,160 version number with respect to the host. 1357 00:45:29,160 --> 00:45:30,890 But we just said all of the time is being 1358 00:45:30,890 --> 00:45:31,860 spent in the host. 1359 00:45:31,860 --> 00:45:33,520 So why is that going on? 1360 00:45:33,520 --> 00:45:35,340 And this is where understanding the architecture 1361 00:45:35,340 --> 00:45:35,830 is important. 1362 00:45:35,830 --> 00:45:37,940 It turns out in this case, the first thing that happens when 1363 00:45:37,940 --> 00:45:39,830 you have a difference in version number is that the 1364 00:45:39,830 --> 00:45:42,870 server helpfully talks to the end host and says you know 1365 00:45:42,870 --> 00:45:44,010 what, you're in a different version number. 1366 00:45:44,010 --> 00:45:47,090 So let me give you some code to run, because you're in a 1367 00:45:47,090 --> 00:45:47,810 different version number. 1368 00:45:47,810 --> 00:45:50,030 So in fact, even though the host is at the same version 1369 00:45:50,030 --> 00:45:52,440 number, there's a small shim running on that host that 1370 00:45:52,440 --> 00:45:53,990 makes up for the fact that it's not at the 1371 00:45:53,990 --> 00:45:56,110 same version number. 1372 00:45:56,110 --> 00:45:58,420 So now let's analyze this host and figure out where is this 1373 00:45:58,420 --> 00:46:00,350 time being spent. 1374 00:46:00,350 --> 00:46:02,020 I did a little bit more logging on the host. 1375 00:46:02,020 --> 00:46:03,880 The reason I did this, because a lot of the standard tools 1376 00:46:03,880 --> 00:46:06,360 that you might use, like gprof or whatever, these things 1377 00:46:06,360 --> 00:46:06,920 didn't exist. 1378 00:46:06,920 --> 00:46:08,050 We have our own kernel. 1379 00:46:08,050 --> 00:46:10,590 Which therefore, we have our own tools and our own 1380 00:46:10,590 --> 00:46:12,200 nightmares, and headaches associated with it. 1381 00:46:12,200 --> 00:46:14,440 So we had to use these tools instead of commercially 1382 00:46:14,440 --> 00:46:15,680 available tools. 1383 00:46:15,680 --> 00:46:16,980 When you use these tools, we ended up 1384 00:46:16,980 --> 00:46:17,880 narrowing down the time. 1385 00:46:17,880 --> 00:46:18,850 Here's what goes on. 1386 00:46:18,850 --> 00:46:19,360 I'm a client. 1387 00:46:19,360 --> 00:46:20,480 And I talk to the server. 1388 00:46:20,480 --> 00:46:22,430 The server talks to an agent on the host. 1389 00:46:22,430 --> 00:46:24,680 It says, agent, this is what I want to do. 1390 00:46:24,680 --> 00:46:26,980 The agent then talks to the hardware abstraction layer and 1391 00:46:26,980 --> 00:46:28,780 says, dude, this is what I want to do. 1392 00:46:28,780 --> 00:46:30,600 The hardware abstraction layer comes back to the agent and 1393 00:46:30,600 --> 00:46:31,390 says, done. 1394 00:46:31,390 --> 00:46:32,670 Agent goes back to the server. 1395 00:46:32,670 --> 00:46:34,790 Sever goes back to the client, blah, blah, blah. 1396 00:46:34,790 --> 00:46:37,340 It turns out that this interaction on the host 1397 00:46:37,340 --> 00:46:39,630 between the agent that accepted the call from the 1398 00:46:39,630 --> 00:46:42,170 server and the hardware abstraction layer, that was 1399 00:46:42,170 --> 00:46:43,820 where all of the time was being spent. 1400 00:46:43,820 --> 00:46:45,640 It was 10 milliseconds in one case. 1401 00:46:45,640 --> 00:46:48,610 And 20 times that length in the other case. 1402 00:46:48,610 --> 00:46:51,280 And so this was a pretty big difference not to be ignored. 1403 00:46:51,280 --> 00:46:53,950 So we wanted to look at why that was going on. 1404 00:46:53,950 --> 00:46:55,660 So here, it's kind of a dumb thing. 1405 00:46:55,660 --> 00:46:56,980 But the next the thing I did was, well, let's take a look 1406 00:46:56,980 --> 00:46:57,620 at the configuration. 1407 00:46:57,620 --> 00:46:58,910 Maybe I did something stupid that would have 1408 00:46:58,910 --> 00:47:00,680 caused this to happen. 1409 00:47:00,680 --> 00:47:01,370 Here's the deal. 1410 00:47:01,370 --> 00:47:03,220 In setup A, remember everything is at the same 1411 00:47:03,220 --> 00:47:03,990 version number. 1412 00:47:03,990 --> 00:47:06,490 And it turns out this agent HAL communication occurs over 1413 00:47:06,490 --> 00:47:07,780 a named pipe. 1414 00:47:07,780 --> 00:47:09,220 OK, whatever. 1415 00:47:09,220 --> 00:47:10,310 It turns out in setup B-- 1416 00:47:10,310 --> 00:47:12,280 remember the client and the server were at a different 1417 00:47:12,280 --> 00:47:14,570 version from the host, so the server has to download some 1418 00:47:14,570 --> 00:47:16,400 shim code onto that host so that 1419 00:47:16,400 --> 00:47:18,270 everybody can talk correctly. 1420 00:47:18,270 --> 00:47:21,520 It turns out that shim code was communicating with the 1421 00:47:21,520 --> 00:47:23,810 hardware abstraction layer-- instead of over a named pipe-- 1422 00:47:23,810 --> 00:47:27,030 it was communicating over TCP/IP. 1423 00:47:27,030 --> 00:47:30,590 Now I have to give credit to my colleague on this. 1424 00:47:30,590 --> 00:47:33,680 As soon as I showed him this information, he hit a flash of 1425 00:47:33,680 --> 00:47:35,170 inspiration. 1426 00:47:35,170 --> 00:47:37,960 It turns out that when you're using the named pipe and it 1427 00:47:37,960 --> 00:47:40,810 takes 10 milliseconds, the reason that's different from 1428 00:47:40,810 --> 00:47:43,630 TCP/IP communication is because of something called 1429 00:47:43,630 --> 00:47:44,800 Nagle's algorithm. 1430 00:47:44,800 --> 00:47:47,560 Nagle's algorithm basically says, you want to accumulate 1431 00:47:47,560 --> 00:47:50,450 data before you send it so you reduce the number of round 1432 00:47:50,450 --> 00:47:51,840 trips and the overhead of creating a 1433 00:47:51,840 --> 00:47:52,570 connection, and whatever. 1434 00:47:52,570 --> 00:47:54,910 And the problem was this was biting us. 1435 00:47:54,910 --> 00:47:57,070 Essentially, we were waiting within that shim to collect 1436 00:47:57,070 --> 00:47:59,180 enough data to send it back to that HAL. 1437 00:47:59,180 --> 00:48:02,010 And we were waiting fairly needlessly, because what 1438 00:48:02,010 --> 00:48:04,330 happens is that this algorithm waits for a 1439 00:48:04,330 --> 00:48:05,010 certain amount of time. 1440 00:48:05,010 --> 00:48:06,490 And then if it hasn't gotten more data, it goes 1441 00:48:06,490 --> 00:48:07,070 ahead and sends it. 1442 00:48:07,070 --> 00:48:09,820 So we are basically stuck, waiting for this time out to 1443 00:48:09,820 --> 00:48:11,960 occur before the data actually got transferred. 1444 00:48:11,960 --> 00:48:13,640 And there's a very simple way to get rid of that. 1445 00:48:13,640 --> 00:48:14,990 You just tell it, hey, don't use Nagle's algorithm. 1446 00:48:14,990 --> 00:48:15,760 We don't care. 1447 00:48:15,760 --> 00:48:18,300 And that solves the entire problem. 1448 00:48:18,300 --> 00:48:21,110 So the point is once you get rid of that, the performance 1449 00:48:21,110 --> 00:48:21,860 is exactly the same. 1450 00:48:21,860 --> 00:48:23,220 It's a very simple configuration setting. 1451 00:48:23,220 --> 00:48:25,490 But that's again, a situation where somebody-- 1452 00:48:25,490 --> 00:48:27,720 and in fact, I remember asking why this 1453 00:48:27,720 --> 00:48:28,850 particular change was made. 1454 00:48:28,850 --> 00:48:30,190 And again, it was, well, we didn't think it was going to 1455 00:48:30,190 --> 00:48:31,470 be a big deal. 1456 00:48:31,470 --> 00:48:33,470 And that ends up being, again, a very bad way of doing 1457 00:48:33,470 --> 00:48:34,440 performance engineering. 1458 00:48:34,440 --> 00:48:37,720 So definitely, the lesson that I learned here is just like 1459 00:48:37,720 --> 00:48:39,300 the lesson with 32-bit, 64-bit. 1460 00:48:39,300 --> 00:48:41,550 You might think that a little change like this 1461 00:48:41,550 --> 00:48:42,280 doesn't mean anything. 1462 00:48:42,280 --> 00:48:42,920 But it really does. 1463 00:48:42,920 --> 00:48:44,650 And here, it's actually important to understand the 1464 00:48:44,650 --> 00:48:45,850 entire stack end to end. 1465 00:48:45,850 --> 00:48:46,960 Because otherwise, you're never going to be able to 1466 00:48:46,960 --> 00:48:48,790 figure out where the problem actually was occurring. 1467 00:48:51,560 --> 00:48:53,930 I already alluded to a previous situation where we 1468 00:48:53,930 --> 00:48:55,390 had a correctness problem, which masked as 1469 00:48:55,390 --> 00:48:56,150 a performance problem. 1470 00:48:56,150 --> 00:48:58,290 In the previous case, it was because I had a 1471 00:48:58,290 --> 00:48:59,270 limited thread pool. 1472 00:48:59,270 --> 00:49:01,150 And whenever I ran out of threads, I'd queue everyting. 1473 00:49:01,150 --> 00:49:02,860 And at some point, I had a correctness problem where I 1474 00:49:02,860 --> 00:49:04,790 kept killing threads and never resurrecting them. 1475 00:49:04,790 --> 00:49:06,060 And so I ended up queuing everything. 1476 00:49:06,060 --> 00:49:07,370 And everything went to pot. 1477 00:49:07,370 --> 00:49:09,360 Now let me talk about a different situation. 1478 00:49:09,360 --> 00:49:11,830 Here I've got a poor little customer, in fact that 1479 00:49:11,830 --> 00:49:12,950 customer was me. 1480 00:49:12,950 --> 00:49:15,300 I was powering on a virtual machine. 1481 00:49:15,300 --> 00:49:17,015 Remember it goes to the server, goes to host, powers 1482 00:49:17,015 --> 00:49:19,380 on the virtual machine, tells me it's done. 1483 00:49:19,380 --> 00:49:22,110 Every other time I would do this, it took five seconds, 1484 00:49:22,110 --> 00:49:23,970 which is pretty OK. 1485 00:49:23,970 --> 00:49:26,290 In comparison, every other time I did it, it would take 1486 00:49:26,290 --> 00:49:27,780 about five minutes. 1487 00:49:27,780 --> 00:49:29,120 That's a pretty huge expansion. 1488 00:49:29,120 --> 00:49:31,040 And if a customer called me and said this, I would 1489 00:49:31,040 --> 00:49:33,150 basically be in deep doggy do-do. 1490 00:49:33,150 --> 00:49:35,080 So it was important to try to figure what was 1491 00:49:35,080 --> 00:49:36,020 actually going on here. 1492 00:49:36,020 --> 00:49:38,530 So why does powering on a VM have such a variable 1493 00:49:38,530 --> 00:49:39,950 performance? 1494 00:49:39,950 --> 00:49:42,030 Well, the key here, again, is to understand what's the end 1495 00:49:42,030 --> 00:49:43,340 to end path for things. 1496 00:49:43,340 --> 00:49:45,190 In this case, what I know is that when you power on a 1497 00:49:45,190 --> 00:49:47,400 virtual machine, one of the things you have to do is 1498 00:49:47,400 --> 00:49:49,930 allocate some space on disk to write swap 1499 00:49:49,930 --> 00:49:51,180 information and whatever. 1500 00:49:51,180 --> 00:49:53,410 So one of the first places I looked was I said, well, let 1501 00:49:53,410 --> 00:49:55,410 me understand how the disk is performing in each of these 1502 00:49:55,410 --> 00:49:57,220 cases to try to figure out what's going on. 1503 00:49:57,220 --> 00:49:59,450 It's a mixture of common sense and knowing what's going on, 1504 00:49:59,450 --> 00:50:01,550 and then knowing where to look. 1505 00:50:01,550 --> 00:50:04,190 So let me let you in on a little secret. 1506 00:50:04,190 --> 00:50:07,020 In the real world, if you're using some sort of disk, if 1507 00:50:07,020 --> 00:50:09,820 your disk latency is basically 10 milliseconds or less, 1508 00:50:09,820 --> 00:50:11,080 you're more or less golden. 1509 00:50:11,080 --> 00:50:12,410 That's not a problem. 1510 00:50:12,410 --> 00:50:15,300 If your latency is between 10 and 20 milliseconds, maybe 1511 00:50:15,300 --> 00:50:17,970 that's acceptable because you have a complicated topology. 1512 00:50:17,970 --> 00:50:19,910 You're some multinational corporation with sites 1513 00:50:19,910 --> 00:50:21,140 everywhere, whatever. 1514 00:50:21,140 --> 00:50:23,610 If your latency is getting to the point where they're 50 1515 00:50:23,610 --> 00:50:26,350 milliseconds, then you probably 1516 00:50:26,350 --> 00:50:27,180 have to start worrying. 1517 00:50:27,180 --> 00:50:28,940 That's actually quite a bit high. 1518 00:50:28,940 --> 00:50:31,870 If your latency is greater than 51 seconds, that is 1519 00:50:31,870 --> 00:50:34,240 staggeringly bad. 1520 00:50:34,240 --> 00:50:38,290 In this chart, I show you the latency versus time. 1521 00:50:38,290 --> 00:50:40,890 And what this chart shows you is that every single time 1522 00:50:40,890 --> 00:50:43,900 step, it says in the last five minutes, what was the highest 1523 00:50:43,900 --> 00:50:46,920 disk latency that I saw. 1524 00:50:46,920 --> 00:50:49,160 And remember our rules of thumb. 1525 00:50:49,160 --> 00:50:50,440 10 milliseconds, good. 1526 00:50:50,440 --> 00:50:52,150 20 milliseconds, not so good. 1527 00:50:52,150 --> 00:50:54,110 50 milliseconds, not so good. 1528 00:50:54,110 --> 00:50:57,530 This one is 1,100 milliseconds. 1529 00:50:57,530 --> 00:51:00,330 That's like you could walk to the disk, get the data, and 1530 00:51:00,330 --> 00:51:02,940 bring it back faster than what it's actually doing. 1531 00:51:02,940 --> 00:51:04,070 That's when you've got to call a priest. 1532 00:51:04,070 --> 00:51:04,240 OK. 1533 00:51:04,240 --> 00:51:05,010 That's horrible. 1534 00:51:05,010 --> 00:51:06,930 So this was the reason that five 1535 00:51:06,930 --> 00:51:08,470 seconds versus five minutes. 1536 00:51:08,470 --> 00:51:10,510 So now the question is why is this going on? 1537 00:51:10,510 --> 00:51:14,300 Why am I seeing such unbelievably bad disk latency? 1538 00:51:14,300 --> 00:51:16,390 And remember, the title of this case study was 1539 00:51:16,390 --> 00:51:20,080 "Correctness Impacts Performance." So 1540 00:51:20,080 --> 00:51:21,170 let's take a look. 1541 00:51:21,170 --> 00:51:23,930 What I do here is I take a look at all of the events that 1542 00:51:23,930 --> 00:51:26,710 are happening with respect to that disk. 1543 00:51:26,710 --> 00:51:27,540 You're not going to understand this. 1544 00:51:27,540 --> 00:51:28,890 It's some junk related to our product. 1545 00:51:28,890 --> 00:51:30,880 But I'm going to circle the relevant portion. 1546 00:51:30,880 --> 00:51:32,180 You'll notice a little line that says 1547 00:51:32,180 --> 00:51:34,360 lost access to volume. 1548 00:51:34,360 --> 00:51:35,170 Here's the deal. 1549 00:51:35,170 --> 00:51:37,480 I send a request to the disk. 1550 00:51:37,480 --> 00:51:40,190 It turns out the disk is connected over some channel to 1551 00:51:40,190 --> 00:51:42,530 some channel, a disk controller, which is talking 1552 00:51:42,530 --> 00:51:43,410 to some disk. 1553 00:51:43,410 --> 00:51:47,440 It turns out, I kept losing access to that disk. 1554 00:51:47,440 --> 00:51:49,600 Whenever I would lose access, it would keep retrying. 1555 00:51:49,600 --> 00:51:51,610 And that's why ultimately, yes, it would complete. 1556 00:51:51,610 --> 00:51:54,030 But it would take over a second per disk request. 1557 00:51:57,230 --> 00:51:59,560 In the situations where I was not seeing that loss of 1558 00:51:59,560 --> 00:52:01,720 connectivity, things were occurring at the order of 1559 00:52:01,720 --> 00:52:04,820 milliseconds and my power-on would take five seconds. 1560 00:52:04,820 --> 00:52:07,780 So the point here is that in this case, it was actually a 1561 00:52:07,780 --> 00:52:10,020 correctness problem with my controller. 1562 00:52:10,020 --> 00:52:11,970 And the controller would basically flake out and not 1563 00:52:11,970 --> 00:52:13,280 talk to the disk properly. 1564 00:52:13,280 --> 00:52:15,380 And this manifested itself in a performance problem. 1565 00:52:15,380 --> 00:52:18,390 Once I changed the disk controller, performance 1566 00:52:18,390 --> 00:52:19,490 problems went away. 1567 00:52:19,490 --> 00:52:22,250 So if you're doing this kind of debugging, sometimes you 1568 00:52:22,250 --> 00:52:23,760 have to eliminate-- 1569 00:52:23,760 --> 00:52:25,450 I guess it's a Sherlock Holmesism-- that you eliminate 1570 00:52:25,450 --> 00:52:28,060 the obvious and whatever's left must be it. 1571 00:52:28,060 --> 00:52:31,200 A good example of this is my ex-officemate, he used to ask 1572 00:52:31,200 --> 00:52:33,380 this question whenever he'd interview somebody for VMware. 1573 00:52:33,380 --> 00:52:36,680 He'd say, OK, you have a networking problem. 1574 00:52:36,680 --> 00:52:38,090 Where is the first place you look? 1575 00:52:38,090 --> 00:52:40,400 And everybody that's coming in with a grad degree or 1576 00:52:40,400 --> 00:52:41,810 whatever, they're like, I would look at the stack. 1577 00:52:41,810 --> 00:52:43,640 I'd look at the CPU. 1578 00:52:43,640 --> 00:52:45,190 I'd look at a cache counters, or whatever. 1579 00:52:45,190 --> 00:52:46,720 And my officemate would just look at them and say, why 1580 00:52:46,720 --> 00:52:49,450 don't you just check the cable and see if it's plugged in. 1581 00:52:49,450 --> 00:52:51,360 And surprisingly, it's so dumb. 1582 00:52:51,360 --> 00:52:53,970 But when you're talking to customers, they're like I 1583 00:52:53,970 --> 00:52:54,310 swear I did that. 1584 00:52:54,310 --> 00:52:55,530 I swear I did that. 1585 00:52:55,530 --> 00:52:57,890 Their swearing doesn't necessarily mean anything, 1586 00:52:57,890 --> 00:52:59,770 unless it's directed at you. 1587 00:52:59,770 --> 00:53:01,220 So that's case study nine-- 1588 00:53:01,220 --> 00:53:03,750 eight. 1589 00:53:03,750 --> 00:53:05,620 The last two case studies are going to be specific to 1590 00:53:05,620 --> 00:53:06,790 virtualization. 1591 00:53:06,790 --> 00:53:08,930 And so what I'm going to do is talk a little bit about how 1592 00:53:08,930 --> 00:53:12,460 the CPU scheduler works in the VMware server class product. 1593 00:53:12,460 --> 00:53:14,830 And then I'll talk about the last two case studies. 1594 00:53:14,830 --> 00:53:17,890 So actually how many of you have used VMware before? 1595 00:53:17,890 --> 00:53:20,310 Just as a hand. 1596 00:53:20,310 --> 00:53:21,230 Thankfully, Saman has used it. 1597 00:53:21,230 --> 00:53:21,810 That's good. 1598 00:53:21,810 --> 00:53:26,960 So basically, the idea is that you're multiplexing time on 1599 00:53:26,960 --> 00:53:29,250 the CPU between different virtual machines that are 1600 00:53:29,250 --> 00:53:31,310 running on the CPU. 1601 00:53:31,310 --> 00:53:33,280 So what I show here is I show ESX, which is our 1602 00:53:33,280 --> 00:53:34,700 server class product. 1603 00:53:34,700 --> 00:53:36,030 Basically, think of that as the hardware. 1604 00:53:36,030 --> 00:53:38,160 So you're running virtual machines on hardware. 1605 00:53:38,160 --> 00:53:40,860 That hardware has, in this particular case, 1606 00:53:40,860 --> 00:53:43,410 four physical CPUs. 1607 00:53:43,410 --> 00:53:44,070 No problem. 1608 00:53:44,070 --> 00:53:45,080 You've got four physical CPUs. 1609 00:53:45,080 --> 00:53:48,110 And you want to run virtual machines on this hardware. 1610 00:53:48,110 --> 00:53:50,430 So you've got one virtual machine. 1611 00:53:50,430 --> 00:53:51,560 Well, that's fine. 1612 00:53:51,560 --> 00:53:52,530 There's four CPUs. 1613 00:53:52,530 --> 00:53:53,650 One virtual machine. 1614 00:53:53,650 --> 00:53:54,810 No problem. 1615 00:53:54,810 --> 00:53:56,170 Plenty of CPU to go around. 1616 00:53:56,170 --> 00:53:57,170 Now you got another one. 1617 00:53:57,170 --> 00:53:58,100 Still no problem. 1618 00:53:58,100 --> 00:54:00,250 They're not chained to any given CPU. 1619 00:54:00,250 --> 00:54:01,790 And there's plenty of CPU to go around. 1620 00:54:01,790 --> 00:54:02,770 No problem. 1621 00:54:02,770 --> 00:54:04,030 Same with three and four. 1622 00:54:04,030 --> 00:54:06,080 You got all these four virtual machines. 1623 00:54:06,080 --> 00:54:07,080 It's not a one to one 1624 00:54:07,080 --> 00:54:08,500 correspondence in terms of CPU. 1625 00:54:08,500 --> 00:54:09,530 We don't work that way. 1626 00:54:09,530 --> 00:54:11,030 But there's plenty-- the point is, there's plenty 1627 00:54:11,030 --> 00:54:13,170 of CPU to go around. 1628 00:54:13,170 --> 00:54:18,060 Now what happens, and when VMs are basically available and 1629 00:54:18,060 --> 00:54:21,210 when there's hardware on which they can run and they're using 1630 00:54:21,210 --> 00:54:23,240 the hardware, that's called the run state. 1631 00:54:23,240 --> 00:54:24,750 So running means I'm golden. 1632 00:54:24,750 --> 00:54:26,930 I'm using the hardware. 1633 00:54:26,930 --> 00:54:30,520 Now, let's add another VM. 1634 00:54:30,520 --> 00:54:33,800 I put it in red, because this poor VM, he wants to run. 1635 00:54:33,800 --> 00:54:36,860 But you'll see that every single CPU is actually taken 1636 00:54:36,860 --> 00:54:39,650 up by another virtual machine. 1637 00:54:39,650 --> 00:54:40,890 So even if this virtual machine 1638 00:54:40,890 --> 00:54:42,700 wants to run, it can't. 1639 00:54:42,700 --> 00:54:44,230 It has to wait. 1640 00:54:44,230 --> 00:54:48,810 The time that it spends ready to use the CPU, but unable to 1641 00:54:48,810 --> 00:54:52,220 use the CPU, because there's no CPU available, that's 1642 00:54:52,220 --> 00:54:53,870 called ready time. 1643 00:54:53,870 --> 00:54:55,950 In some sense, it's kind of the demand. 1644 00:54:55,950 --> 00:54:57,810 Ready plus signals down. 1645 00:54:57,810 --> 00:54:59,160 But anyway, the point is it's sitting around 1646 00:54:59,160 --> 00:55:00,550 waiting to use the CPU. 1647 00:55:00,550 --> 00:55:02,290 But it can't, because other people are using it. 1648 00:55:02,290 --> 00:55:03,360 So it has to wait its turn. 1649 00:55:03,360 --> 00:55:04,910 It waits its turn. 1650 00:55:04,910 --> 00:55:07,950 And eventually a VM, our ESX scheduler is smart enough to 1651 00:55:07,950 --> 00:55:10,050 realize that it has to fix the situation. 1652 00:55:10,050 --> 00:55:12,930 So it deschedules a VM-- in this case, VM 1-- and goes 1653 00:55:12,930 --> 00:55:15,120 ahead and schedules that other virtual machine that was 1654 00:55:15,120 --> 00:55:16,440 previously waiting. 1655 00:55:16,440 --> 00:55:20,400 Now predictably, the VM that's now ready to run but can't is 1656 00:55:20,400 --> 00:55:21,880 in the ready state. 1657 00:55:21,880 --> 00:55:24,530 And the VM that formerly was in the ready state is now 1658 00:55:24,530 --> 00:55:25,790 using the CPU. 1659 00:55:25,790 --> 00:55:27,050 So it's in the run state. 1660 00:55:27,050 --> 00:55:29,260 It's very happy. 1661 00:55:29,260 --> 00:55:31,140 The final thing that I should point out is that VMs don't 1662 00:55:31,140 --> 00:55:32,440 have to be in the ready state. 1663 00:55:32,440 --> 00:55:34,213 They don't have to be either using the CPU or 1664 00:55:34,213 --> 00:55:35,200 ready to use the CPU. 1665 00:55:35,200 --> 00:55:37,520 Some VMs are voluntarily descheduled, because they have 1666 00:55:37,520 --> 00:55:38,670 nothing to do. 1667 00:55:38,670 --> 00:55:41,260 Some VMs are descheduled, because they're waiting-- 1668 00:55:41,260 --> 00:55:43,300 they're basically waiting for an IO to complete. 1669 00:55:43,300 --> 00:55:44,730 And so they don't need to use the CPU. 1670 00:55:44,730 --> 00:55:46,470 And that's what I depict by VM6. 1671 00:55:46,470 --> 00:55:50,160 These are in what's called the wait or idle states. 1672 00:55:50,160 --> 00:55:53,430 So hopefully, you now have a fairly rudimentary 1673 00:55:53,430 --> 00:55:56,080 understanding of what our CPU scheduler is doing. 1674 00:55:56,080 --> 00:55:58,200 Let's talk a little bit about a very simple performance 1675 00:55:58,200 --> 00:56:00,310 problem related to this. 1676 00:56:00,310 --> 00:56:00,930 And I call this-- 1677 00:56:00,930 --> 00:56:02,910 But It's Only a small probe VM. 1678 00:56:02,910 --> 00:56:04,710 I'll explain what this means in a minute. 1679 00:56:04,710 --> 00:56:07,590 Essentially, you've got two of these hosts, two ESX hosts 1680 00:56:07,590 --> 00:56:09,840 that both want to run VMs. 1681 00:56:09,840 --> 00:56:13,250 And I've got a user, depicted by the user icon. 1682 00:56:13,250 --> 00:56:15,080 And he's talking to this one VM. 1683 00:56:15,080 --> 00:56:18,420 Let's call this VM the vSphere VM. 1684 00:56:18,420 --> 00:56:19,800 Please pardon the nomenclature. 1685 00:56:19,800 --> 00:56:20,520 It doesn't really matter. 1686 00:56:20,520 --> 00:56:22,380 But anyway, he's talking to this vSphere VM 1687 00:56:22,380 --> 00:56:23,490 on one of the hosts. 1688 00:56:23,490 --> 00:56:26,070 Now, it turns out that that VM that he's talking to, as I 1689 00:56:26,070 --> 00:56:28,880 depicted in the diagram, is talking to another virtual 1690 00:56:28,880 --> 00:56:33,420 machine, which I have labeled vSphere database. 1691 00:56:33,420 --> 00:56:35,290 So it turns out, this vSphere VM is talking to a database 1692 00:56:35,290 --> 00:56:37,410 over what's called an ODBC connection. 1693 00:56:37,410 --> 00:56:38,170 That doesn't matter. 1694 00:56:38,170 --> 00:56:39,350 Don't worry about that. 1695 00:56:39,350 --> 00:56:42,080 And it turns out that on that other host where the database 1696 00:56:42,080 --> 00:56:44,810 virtual machine is running, we have another virtual machine. 1697 00:56:44,810 --> 00:56:47,630 It's called the probe virtual machine. 1698 00:56:47,630 --> 00:56:50,680 The job of that probe virtual machine is to sniff any 1699 00:56:50,680 --> 00:56:52,020 traffic going into the database. 1700 00:56:55,090 --> 00:56:56,800 So as I've said, the vSphere VM 1701 00:56:56,800 --> 00:56:58,960 communicates with the database. 1702 00:56:58,960 --> 00:57:01,750 This probe VM is monitoring any traffic that's going to 1703 00:57:01,750 --> 00:57:04,160 that database virtual machine. 1704 00:57:04,160 --> 00:57:06,720 And therefore, as you can imagine, the more traffic 1705 00:57:06,720 --> 00:57:09,730 that's being sent between this vSphere VM and the database, 1706 00:57:09,730 --> 00:57:14,900 the more work that that probe VM has to do. 1707 00:57:14,900 --> 00:57:16,260 Here was the complaint. 1708 00:57:16,260 --> 00:57:19,840 My colleague called me up, and he said, hey, I'm having a 1709 00:57:19,840 --> 00:57:21,810 little bit of a problem with my vSphere VM. 1710 00:57:21,810 --> 00:57:25,390 All of a sudden, it's completely unresponsive. 1711 00:57:25,390 --> 00:57:28,060 So then I was like, OK, well, this is kind of silly. 1712 00:57:28,060 --> 00:57:29,770 Let's try to figure out what's going on. 1713 00:57:29,770 --> 00:57:31,580 So here's what I did. 1714 00:57:31,580 --> 00:57:35,720 I graphed various metrics, which I just spoke about, the 1715 00:57:35,720 --> 00:57:38,880 used time and the ready time, on this chart. 1716 00:57:38,880 --> 00:57:40,390 The time is on the x-axis. 1717 00:57:40,390 --> 00:57:43,210 And the y-axis is in milliseconds. 1718 00:57:43,210 --> 00:57:46,980 In what I'm going to call the land before time, the ready 1719 00:57:46,980 --> 00:57:48,730 time of the database. 1720 00:57:48,730 --> 00:57:51,460 So look at the circle on the-- let me think, you're-- 1721 00:57:51,460 --> 00:57:52,970 on the left. 1722 00:57:52,970 --> 00:57:56,440 This is the ready time of that database virtual machine. 1723 00:57:56,440 --> 00:57:57,700 Remember what we said, when you're 1724 00:57:57,700 --> 00:57:58,850 accumulating ready time. 1725 00:57:58,850 --> 00:58:00,460 It means you really want to use the CPU. 1726 00:58:00,460 --> 00:58:03,300 But you can't because all the CPUs are already in use. 1727 00:58:03,300 --> 00:58:07,700 So it turns out that the first thing that happened when my 1728 00:58:07,700 --> 00:58:09,860 colleague called and said I'm having a problem, is I look at 1729 00:58:09,860 --> 00:58:10,990 this chart, which is versus time. 1730 00:58:10,990 --> 00:58:11,780 And I said, let me guess. 1731 00:58:11,780 --> 00:58:14,040 Are you having a problem at about such and such time, 1732 00:58:14,040 --> 00:58:15,870 which corresponds to the middle of this chart? 1733 00:58:15,870 --> 00:58:17,580 And he said, yeah, that's when I'm having a problem. 1734 00:58:17,580 --> 00:58:20,240 So you can tell that there are two very clearly demarcated 1735 00:58:20,240 --> 00:58:22,510 regions here. 1736 00:58:22,510 --> 00:58:24,760 In the left, you can see that this ready 1737 00:58:24,760 --> 00:58:26,120 time, it's around 12%. 1738 00:58:26,120 --> 00:58:29,610 Now if you're like me, a number that has no context 1739 00:58:29,610 --> 00:58:30,060 means nothing. 1740 00:58:30,060 --> 00:58:31,790 So 12% means nothing. 1741 00:58:31,790 --> 00:58:34,540 However, what I'd like to point out is that when things 1742 00:58:34,540 --> 00:58:39,130 got really bad, that number spiked up to 20%. 1743 00:58:39,130 --> 00:58:40,120 So here's the deal. 1744 00:58:40,120 --> 00:58:41,710 I'm a database virtual machine. 1745 00:58:41,710 --> 00:58:44,500 I want to use the physical hardware. 1746 00:58:44,500 --> 00:58:46,920 I can use it about 90% of the time, because my ready time's 1747 00:58:46,920 --> 00:58:48,110 about 12%, whatever. 1748 00:58:48,110 --> 00:58:50,960 But at some point, something happens where I now can only 1749 00:58:50,960 --> 00:58:52,600 use it 80% of the time. 1750 00:58:52,600 --> 00:58:55,580 And it turns out that that difference was enough to cause 1751 00:58:55,580 --> 00:58:59,760 this vSphere VM talking to this database to be perturbed. 1752 00:58:59,760 --> 00:59:01,850 And that ultimately was the customer's problem. 1753 00:59:01,850 --> 00:59:03,310 So the question is why. 1754 00:59:03,310 --> 00:59:05,200 And here is the reason. 1755 00:59:05,200 --> 00:59:06,360 Turns out, here's the deal. 1756 00:59:06,360 --> 00:59:08,450 Let's go back to the diagram so I can explain this. 1757 00:59:08,450 --> 00:59:11,810 Remember that the vSphere VM talks to the database VM. 1758 00:59:11,810 --> 00:59:14,760 Remember as well, that the more traffic there is, the 1759 00:59:14,760 --> 00:59:17,080 more work the probe VM has to do. 1760 00:59:17,080 --> 00:59:19,770 It turns out that what this user did was at that critical 1761 00:59:19,770 --> 00:59:22,290 juncture, he started a workload. 1762 00:59:22,290 --> 00:59:24,370 In starting that workload, he caused a lot of traffic 1763 00:59:24,370 --> 00:59:26,920 between the vSphere VM and the database, which therefore 1764 00:59:26,920 --> 00:59:30,460 caused a lot of work by this probe VM. 1765 00:59:30,460 --> 00:59:31,680 Well, what's happening? 1766 00:59:31,680 --> 00:59:34,110 The probe VM and the database VM are sharing the same 1767 00:59:34,110 --> 00:59:35,270 underlying CPUs. 1768 00:59:35,270 --> 00:59:37,470 So if one of them is running, the other one can't. 1769 00:59:37,470 --> 00:59:39,950 And that's why we saw this gentle increase in the amount 1770 00:59:39,950 --> 00:59:42,030 of ready time to vSphere database. 1771 00:59:42,030 --> 00:59:44,330 Well, this has a cascading effect, because it can't get 1772 00:59:44,330 --> 00:59:46,380 as much time to use a CPU. 1773 00:59:46,380 --> 00:59:48,630 Any request that goes to it gets slowed down. 1774 00:59:48,630 --> 00:59:49,960 Where are the requests coming from? 1775 00:59:49,960 --> 00:59:51,720 They're coming from this vSphere VM. 1776 00:59:51,720 --> 00:59:52,950 Who's initiating those requests? 1777 00:59:52,950 --> 00:59:53,850 It's this poor admin. 1778 00:59:53,850 --> 00:59:55,650 The bottom line is that this admin is seeing poor 1779 00:59:55,650 --> 00:59:57,980 responsiveness because some doofus put his probe VM on 1780 00:59:57,980 --> 00:59:59,740 some other host, because he wanted to see what was going 1781 00:59:59,740 --> 01:00:01,530 on in the database. 1782 01:00:01,530 --> 01:00:02,650 In fairness, I was that doofus. 1783 01:00:02,650 --> 01:00:03,360 But that's fine. 1784 01:00:03,360 --> 01:00:04,490 We needed that, because we had to debug 1785 01:00:04,490 --> 01:00:07,150 other performance problems. 1786 01:00:07,150 --> 01:00:09,970 So the point is that one of the things that's kind of 1787 01:00:09,970 --> 01:00:13,580 important in this particular case is that whenever you 1788 01:00:13,580 --> 01:00:15,340 introduce any sort of monitoring-- it's like a 1789 01:00:15,340 --> 01:00:17,820 Heisenberg thing-- it's going to have a performance impact. 1790 01:00:17,820 --> 01:00:19,010 And understanding that performance 1791 01:00:19,010 --> 01:00:20,180 impact is pretty key. 1792 01:00:20,180 --> 01:00:21,410 It might even shift where the 1793 01:00:21,410 --> 01:00:24,120 bottlenecks are in your system. 1794 01:00:24,120 --> 01:00:26,330 If your thing has a high cost in terms of CPU, it might 1795 01:00:26,330 --> 01:00:29,090 shift the overhead of whatever's running to the CPU 1796 01:00:29,090 --> 01:00:30,230 where it wasn't there before. 1797 01:00:30,230 --> 01:00:32,150 So it's kind of important to understand what you're 1798 01:00:32,150 --> 01:00:34,660 monitoring and how you're monitoring it. 1799 01:00:34,660 --> 01:00:34,950 OK. 1800 01:00:34,950 --> 01:00:36,230 We're almost done. 1801 01:00:36,230 --> 01:00:38,560 This was a really curious problem. 1802 01:00:38,560 --> 01:00:39,800 We have a customer. 1803 01:00:39,800 --> 01:00:43,860 And a customer was performing a load test on a server. 1804 01:00:43,860 --> 01:00:46,040 And in that load test, what they would do is they would 1805 01:00:46,040 --> 01:00:48,190 just keep attaching clients to that end server 1806 01:00:48,190 --> 01:00:50,540 and run some test. 1807 01:00:50,540 --> 01:00:53,370 At some point, they would attach clients and their 1808 01:00:53,370 --> 01:00:56,070 workload suffered, their workload performance suffered 1809 01:00:56,070 --> 01:00:59,700 even though there was no metric that was saturated. 1810 01:00:59,700 --> 01:01:00,910 And so the question was why. 1811 01:01:00,910 --> 01:01:02,890 Why you keep adding clients, nothing is saturated, but 1812 01:01:02,890 --> 01:01:05,140 performance suffers? 1813 01:01:05,140 --> 01:01:07,730 Well, this chart's kind of an eyesore. 1814 01:01:07,730 --> 01:01:09,770 But let me explain it. 1815 01:01:09,770 --> 01:01:11,360 The first thing we do is we've got some profiling 1816 01:01:11,360 --> 01:01:12,020 information. 1817 01:01:12,020 --> 01:01:13,910 On this chart, I show two things. 1818 01:01:13,910 --> 01:01:16,360 In purple, I show CPU usage. 1819 01:01:16,360 --> 01:01:19,720 And in white, I show disk latency. 1820 01:01:19,720 --> 01:01:21,500 As you can see from this chart-- and the 1821 01:01:21,500 --> 01:01:22,450 y-axis is just 100%. 1822 01:01:22,450 --> 01:01:23,580 And they're both normalized. 1823 01:01:23,580 --> 01:01:26,440 As you can see at some point, the CPU usage is steadily, 1824 01:01:26,440 --> 01:01:28,190 steadily increasing. 1825 01:01:28,190 --> 01:01:30,990 And at some point, the disk latency takes over and gets to 1826 01:01:30,990 --> 01:01:32,100 be really large. 1827 01:01:32,100 --> 01:01:35,600 You can tell that, because up until about 3/4 of the way in, 1828 01:01:35,600 --> 01:01:37,000 the disk latency hovers around zero. 1829 01:01:37,000 --> 01:01:39,030 And all of a sudden, it jumps much higher. 1830 01:01:39,030 --> 01:01:41,750 So my first instinct was to tell the customer you've got a 1831 01:01:41,750 --> 01:01:43,250 disk problem. 1832 01:01:43,250 --> 01:01:45,150 Your disk latencies are going over a cliff. 1833 01:01:45,150 --> 01:01:47,040 And that's why you're having a performance problem. 1834 01:01:47,040 --> 01:01:50,640 Unfortunately, that's not correct. 1835 01:01:50,640 --> 01:01:52,900 Why is that not correct? 1836 01:01:52,900 --> 01:01:56,140 It's not correct, because, yeah, disk latency gets worse 1837 01:01:56,140 --> 01:01:57,440 at 4:00 PM. 1838 01:01:57,440 --> 01:01:59,120 It's because of swapping and some other nonsense. 1839 01:01:59,120 --> 01:02:00,220 We know that. 1840 01:02:00,220 --> 01:02:02,740 However, the application latency actually gets worse at 1841 01:02:02,740 --> 01:02:04,310 3:30, not at 4:00. 1842 01:02:04,310 --> 01:02:05,810 It's half an hour earlier. 1843 01:02:05,810 --> 01:02:07,790 So now the question is what's going on that causes the 1844 01:02:07,790 --> 01:02:10,140 problem, even before disk kicks in and gets to be 1845 01:02:10,140 --> 01:02:12,820 another problem. 1846 01:02:12,820 --> 01:02:15,930 This is where understanding the difference in metrics 1847 01:02:15,930 --> 01:02:16,610 becomes pretty critical. 1848 01:02:16,610 --> 01:02:18,620 Now, I'm not expecting very many people to 1849 01:02:18,620 --> 01:02:19,800 understand this chart. 1850 01:02:19,800 --> 01:02:21,860 But I will try to explain it. 1851 01:02:21,860 --> 01:02:23,740 What I do here is I show a different chart, which 1852 01:02:23,740 --> 01:02:26,470 essentially shows CPU usage of all the different virtual 1853 01:02:26,470 --> 01:02:29,310 machines and processes on the underlying host. 1854 01:02:29,310 --> 01:02:30,570 Forget about what's in most of this chart. 1855 01:02:30,570 --> 01:02:32,360 And just focus on what I've circled. 1856 01:02:32,360 --> 01:02:34,120 I have two columns shown there. 1857 01:02:34,120 --> 01:02:37,610 One column is called percent used. 1858 01:02:37,610 --> 01:02:41,350 And one column is called percent run. 1859 01:02:41,350 --> 01:02:43,970 You can think about it and think, OK, well, I can imagine 1860 01:02:43,970 --> 01:02:47,480 used time means probably how much time I'm using the CPU. 1861 01:02:47,480 --> 01:02:48,740 But then what is the runtime? 1862 01:02:48,740 --> 01:02:51,080 Isn't runtime the same thing, because I'm using the CPU? 1863 01:02:51,080 --> 01:02:52,990 What's the difference? 1864 01:02:52,990 --> 01:02:54,680 This is actually the critical problem. 1865 01:02:54,680 --> 01:02:58,050 It turns out, that when you talk of a percent used time, 1866 01:02:58,050 --> 01:03:01,110 you're normalizing that to the base clock frequency. 1867 01:03:01,110 --> 01:03:03,920 If I've got a three gigahertz machine, I've normalized it to 1868 01:03:03,920 --> 01:03:04,650 three gigahertz. 1869 01:03:04,650 --> 01:03:07,490 However, we all live in the era of power management. 1870 01:03:07,490 --> 01:03:10,700 It turns out this percent run, which nobody understands, is 1871 01:03:10,700 --> 01:03:12,880 normalized to the clock frequency at the 1872 01:03:12,880 --> 01:03:14,100 time you were running. 1873 01:03:14,100 --> 01:03:16,840 So if I did power management, and I was only running at two 1874 01:03:16,840 --> 01:03:18,880 gigahertz for a certain amount of time, percent 1875 01:03:18,880 --> 01:03:20,840 run captures that. 1876 01:03:20,840 --> 01:03:23,075 And so in this particular chart what I'm showing is that 1877 01:03:23,075 --> 01:03:26,300 a bunch of virtual machines are seeing used time different 1878 01:03:26,300 --> 01:03:27,480 from runtime. 1879 01:03:27,480 --> 01:03:29,590 And the runtime, which is normalized to the clock 1880 01:03:29,590 --> 01:03:32,130 frequency of the time you're using it, is actually greater 1881 01:03:32,130 --> 01:03:35,030 than used time, which suggests that the user's actually using 1882 01:03:35,030 --> 01:03:36,310 power management. 1883 01:03:36,310 --> 01:03:40,320 And it turns out, that power management was kicking in at 1884 01:03:40,320 --> 01:03:42,980 some point and causing their application latency to go down 1885 01:03:42,980 --> 01:03:46,240 even though the CPU was never saturated. 1886 01:03:46,240 --> 01:03:48,250 And in this case, when we went back to the customer-- you 1887 01:03:48,250 --> 01:03:50,310 look at this chart, and you instantly see, OK, I know that 1888 01:03:50,310 --> 01:03:52,820 because this is normalized and this is normalized to that. 1889 01:03:52,820 --> 01:03:54,705 This is what's going on, you have to first of all change 1890 01:03:54,705 --> 01:03:55,660 this parameter. 1891 01:03:55,660 --> 01:03:56,460 They changed it. 1892 01:03:56,460 --> 01:03:57,670 And they were unbelievably happy. 1893 01:03:57,670 --> 01:03:58,630 It was pretty cool. 1894 01:03:58,630 --> 01:04:00,287 Now, I have to admit that there's much more to this 1895 01:04:00,287 --> 01:04:01,890 story that I'm not showing you. 1896 01:04:01,890 --> 01:04:03,260 Which I'm not allowed to show you. 1897 01:04:03,260 --> 01:04:04,540 But it becomes a very interesting thing. 1898 01:04:04,540 --> 01:04:09,720 But the point is it's very, very important to understand 1899 01:04:09,720 --> 01:04:10,370 your metrics. 1900 01:04:10,370 --> 01:04:12,760 Here just understanding the difference between runtime and 1901 01:04:12,760 --> 01:04:15,210 used time made all of the difference in diagnosing this 1902 01:04:15,210 --> 01:04:18,080 customer's performance related issue. 1903 01:04:18,080 --> 01:04:20,880 So I just pick a smattering of ten random bugs. 1904 01:04:20,880 --> 01:04:25,020 And in each of them, I tried to distill one kind of 1905 01:04:25,020 --> 01:04:27,170 conclusion you can get from that particular thing. 1906 01:04:27,170 --> 01:04:29,340 Because all of you can read, I'm not going to bother 1907 01:04:29,340 --> 01:04:31,370 repeating all of these things. 1908 01:04:31,370 --> 01:04:33,140 But anyway, here's the first five. 1909 01:04:33,140 --> 01:04:35,500 And you all get the slides. 1910 01:04:35,500 --> 01:04:37,760 And here's the second five. 1911 01:04:37,760 --> 01:04:37,900 Oops. 1912 01:04:37,900 --> 01:04:40,070 I'll just keep that up for a second. 1913 01:04:40,070 --> 01:04:46,970 And in order to conclude this talk, let me just say that 1914 01:04:46,970 --> 01:04:48,990 it's very important to avoid assumptions. 1915 01:04:48,990 --> 01:04:51,370 It's very tempting to say well that's-- 1916 01:04:51,370 --> 01:04:52,520 I'll give you a great example. 1917 01:04:52,520 --> 01:04:53,440 I used to interview people. 1918 01:04:53,440 --> 01:04:54,570 And I'd say, OK, you've got a UI. 1919 01:04:54,570 --> 01:04:55,420 You've got a problem. 1920 01:04:55,420 --> 01:04:56,230 You have a UI. 1921 01:04:56,230 --> 01:04:56,840 And you click on something. 1922 01:04:56,840 --> 01:04:57,790 And it takes a long time. 1923 01:04:57,790 --> 01:04:59,060 Why is it taking a long time? 1924 01:04:59,060 --> 01:05:00,590 And the first thing they'd say, well, I'd go and look 1925 01:05:00,590 --> 01:05:02,030 at-- this is literally, not joking. 1926 01:05:02,030 --> 01:05:03,540 They'd say, well, I'd go to look at the cache messages on 1927 01:05:03,540 --> 01:05:04,660 the server. 1928 01:05:04,660 --> 01:05:06,230 And I just looked them and I said, how do you know the 1929 01:05:06,230 --> 01:05:07,100 client's not the problem? 1930 01:05:07,100 --> 01:05:08,590 Oh, the client's not a problem. 1931 01:05:08,590 --> 01:05:10,510 And you base this on what knowledge? 1932 01:05:10,510 --> 01:05:11,560 Don't use assumptions. 1933 01:05:11,560 --> 01:05:13,730 It's a very bad idea. 1934 01:05:13,730 --> 01:05:16,180 It's really important to understand the entire system 1935 01:05:16,180 --> 01:05:16,790 that you're dealing with. 1936 01:05:16,790 --> 01:05:18,590 From the client to the server to the hardware to the network 1937 01:05:18,590 --> 01:05:19,270 to the disk. 1938 01:05:19,270 --> 01:05:21,050 You have to understand all of those different pieces, 1939 01:05:21,050 --> 01:05:24,160 because you have to understand where to look. 1940 01:05:24,160 --> 01:05:27,210 And what's really onerous is that-- it is not so much when 1941 01:05:27,210 --> 01:05:28,970 it's a problem in your code, because you can be blamed for 1942 01:05:28,970 --> 01:05:30,000 making stupid mistakes-- 1943 01:05:30,000 --> 01:05:31,670 it's when it's problems with other people's code and you 1944 01:05:31,670 --> 01:05:33,530 spend a month looking at other people's code, figuring out 1945 01:05:33,530 --> 01:05:35,560 what's the problem That's when things are really kind of 1946 01:05:35,560 --> 01:05:36,992 irritating. 1947 01:05:36,992 --> 01:05:39,240 And especially when it's a problem with hardware. 1948 01:05:39,240 --> 01:05:42,630 And I think the final bullet I'll say is be persistent and 1949 01:05:42,630 --> 01:05:44,680 be thorough. 1950 01:05:44,680 --> 01:05:46,010 I don't have to tell you how to be good workers. 1951 01:05:46,010 --> 01:05:47,190 So let's just blah, blah, blah. 1952 01:05:47,190 --> 01:05:48,440 We're done.