1 00:00:02,070 --> 00:00:04,500 The following content is provided under a Creative 2 00:00:04,500 --> 00:00:05,920 Commons license. 3 00:00:05,920 --> 00:00:08,950 Your support will help MIT OpenCourseWare continue to 4 00:00:08,950 --> 00:00:12,610 offer high-quality educational resources for free. 5 00:00:12,610 --> 00:00:15,500 To make a donation or view additional materials from 6 00:00:15,500 --> 00:00:19,430 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,430 --> 00:00:20,680 ocw.mit.edu. 8 00:00:25,716 --> 00:00:26,180 GUEST SPEAKER 1: All right. 9 00:00:26,180 --> 00:00:28,960 So we're going to talk about performance engineering and 10 00:00:28,960 --> 00:00:34,262 pretty much how to do that with profiling tools. 11 00:00:34,262 --> 00:00:37,655 So first, we're going to talk about what profiling tools are 12 00:00:37,655 --> 00:00:39,420 and how you use them in general, what the 13 00:00:39,420 --> 00:00:40,550 theory of that is. 14 00:00:40,550 --> 00:00:43,820 And then, more importantly, we're going to do an 15 00:00:43,820 --> 00:00:46,550 interactive walk-through of two examples 16 00:00:46,550 --> 00:00:47,230 that we picked out. 17 00:00:47,230 --> 00:00:50,000 The first one is the Matrix Multiply that you guys saw in 18 00:00:50,000 --> 00:00:54,560 Project 0, the cache ratio measurements and how to do 19 00:00:54,560 --> 00:00:58,400 those measurements and how to do the cycles per instruction 20 00:00:58,400 --> 00:01:00,060 measurements, these kinds of things. 21 00:01:00,060 --> 00:01:03,450 And then we're going to look at doing some of the bit hacks 22 00:01:03,450 --> 00:01:05,750 that Charles talked about in the second lecture. 23 00:01:05,750 --> 00:01:08,520 We're going to work on a branchless sorting algorithm, 24 00:01:08,520 --> 00:01:09,770 which is pretty interesting. 25 00:01:15,360 --> 00:01:22,510 So the code we gave you in project one, it was very 26 00:01:22,510 --> 00:01:24,100 obvious where the hot spot was. 27 00:01:24,100 --> 00:01:25,750 But real life doesn't really work that way. 28 00:01:25,750 --> 00:01:28,340 And code gets really, really big. 29 00:01:31,420 --> 00:01:33,510 It's not always obvious where the part is 30 00:01:33,510 --> 00:01:34,340 that you want to optimize. 31 00:01:34,340 --> 00:01:36,770 And you really don't want to waste time taking what is 32 00:01:36,770 --> 00:01:40,060 otherwise understandable, good code, and beating it over the 33 00:01:40,060 --> 00:01:42,050 head trying to make it go faster. 34 00:01:42,050 --> 00:01:44,640 So that's why Donald Knuth said, "premature optimization 35 00:01:44,640 --> 00:01:47,460 is the root of all evil." You really want to 36 00:01:47,460 --> 00:01:50,590 focus on the hot spots. 37 00:01:50,590 --> 00:01:51,890 So that's what profiling's for. 38 00:01:51,890 --> 00:01:55,130 It basically gives you visibility into the execution 39 00:01:55,130 --> 00:01:56,920 of your code, because it would be too complicated to just sit 40 00:01:56,920 --> 00:02:00,070 down and look at it and try to understand where it's 41 00:02:00,070 --> 00:02:01,320 spending its time. 42 00:02:04,110 --> 00:02:07,110 And there are many possibilities for where it 43 00:02:07,110 --> 00:02:07,900 could be spending its time. 44 00:02:07,900 --> 00:02:11,720 In 6-172 we're mostly going to be focusing on things that are 45 00:02:11,720 --> 00:02:14,890 CPU-bound and things that are using the memory hierarchy 46 00:02:14,890 --> 00:02:16,830 ineffectively. 47 00:02:16,830 --> 00:02:20,410 But in the real world, a couple of other common things 48 00:02:20,410 --> 00:02:22,730 are network requests. 49 00:02:22,730 --> 00:02:24,460 So if you've got your web application, it's making a 50 00:02:24,460 --> 00:02:26,450 whole bunch of AJAX requests. 51 00:02:26,450 --> 00:02:29,870 It's going to be wasting a lot of your users' time. 52 00:02:29,870 --> 00:02:31,690 If you've got your desktop application, it's hitting the 53 00:02:31,690 --> 00:02:32,390 disk all the time. 54 00:02:32,390 --> 00:02:34,320 That's also going to slow you down. 55 00:02:34,320 --> 00:02:36,360 I think SQL database, if anybody's ever done web 56 00:02:36,360 --> 00:02:38,875 development, they know that can be a huge bottleneck. 57 00:02:42,570 --> 00:02:45,390 So basically, the tool that you're going to use is going 58 00:02:45,390 --> 00:02:49,800 to depend a lot on what the problem you're solving is. 59 00:02:49,800 --> 00:02:51,570 But the basic ideas are all the same. 60 00:02:51,570 --> 00:02:54,550 There are all kinds of profiling tools for measuring 61 00:02:54,550 --> 00:02:57,710 these things and figuring out where it's coming from and 62 00:02:57,710 --> 00:02:58,960 where the time's going. 63 00:03:01,690 --> 00:03:05,910 So for trying to figure out CPU and memory, there's a 64 00:03:05,910 --> 00:03:06,570 couple things you can do. 65 00:03:06,570 --> 00:03:09,700 The first and most basic and sort of most obvious to you 66 00:03:09,700 --> 00:03:12,480 guys is you can do manual instrumentation. 67 00:03:12,480 --> 00:03:15,960 And what that is is basically where you go in and you put in 68 00:03:15,960 --> 00:03:18,230 prints and try to figure out where it's spending its time, 69 00:03:18,230 --> 00:03:20,610 what codes could you have executed more. 70 00:03:20,610 --> 00:03:22,960 Or you can be a little bit more clever. 71 00:03:22,960 --> 00:03:27,480 You can try to insert logging calls that fill up a buffer 72 00:03:27,480 --> 00:03:32,180 with checking the time and actually doing that yourself. 73 00:03:32,180 --> 00:03:35,620 And one of the big advantages of that is that, when you 74 00:03:35,620 --> 00:03:38,590 actually go into the code and you instrument yourself, you 75 00:03:38,590 --> 00:03:41,830 can cut down the information that's coming back out at you. 76 00:03:41,830 --> 00:03:44,630 If you have an intuition for how the program works, 77 00:03:44,630 --> 00:03:47,410 sprinkling little calls to timing functions around your 78 00:03:47,410 --> 00:03:51,340 code in the places that you think matter is going to give 79 00:03:51,340 --> 00:03:53,580 you maybe a better picture. 80 00:03:53,580 --> 00:03:55,410 But it requires more work on your part. 81 00:03:55,410 --> 00:03:58,210 So that's the drawback. 82 00:03:58,210 --> 00:04:00,370 The other kind of instrumentation is static 83 00:04:00,370 --> 00:04:03,250 instrumentation, which is basically-- 84 00:04:03,250 --> 00:04:05,075 and I should say that instrumentation is basically a 85 00:04:05,075 --> 00:04:08,190 little snippet of code that is gathering information, 86 00:04:08,190 --> 00:04:11,930 gathering data about the code execution. 87 00:04:11,930 --> 00:04:14,020 So static instrumentation is basically code that's inserted 88 00:04:14,020 --> 00:04:17,100 by the compiler when you build your code. 89 00:04:17,100 --> 00:04:19,230 And gprof is an example of this. 90 00:04:19,230 --> 00:04:22,500 Basically, what it does is it will, at the prologue and the 91 00:04:22,500 --> 00:04:25,130 epilogue for every function, insert a little piece of code 92 00:04:25,130 --> 00:04:26,470 that increments a counter every time the 93 00:04:26,470 --> 00:04:28,686 function gets executed. 94 00:04:28,686 --> 00:04:30,570 I think it also figures out who called it, 95 00:04:30,570 --> 00:04:32,840 that kind of thing. 96 00:04:32,840 --> 00:04:35,150 One of the drawbacks to static instrumentation is that you 97 00:04:35,150 --> 00:04:37,810 have to have the source, and you have to be able to rebuild 98 00:04:37,810 --> 00:04:38,960 your project. 99 00:04:38,960 --> 00:04:44,230 If you've ever had a DLL binary blob of data library 100 00:04:44,230 --> 00:04:47,080 handed to you, and you want to know where the time's going in 101 00:04:47,080 --> 00:04:50,010 that library, you're not going to be able to instrument it. 102 00:04:50,010 --> 00:04:52,550 So you're not going to get any data. 103 00:04:52,550 --> 00:04:54,340 So that brings us to the next thing, which is dynamic 104 00:04:54,340 --> 00:04:57,430 instrumentation, which solves that problem. 105 00:04:57,430 --> 00:05:00,290 Basically, what happens there is you take the binary as it 106 00:05:00,290 --> 00:05:02,990 is just before you're about to run it. 107 00:05:02,990 --> 00:05:05,150 And before you start executing, you actually just 108 00:05:05,150 --> 00:05:06,570 take a look at the code and translate it. 109 00:05:06,570 --> 00:05:08,440 And you go in, and you insert the instrumentation. 110 00:05:08,440 --> 00:05:10,500 And then you have this new snippet of code that you then 111 00:05:10,500 --> 00:05:12,450 run instead. 112 00:05:12,450 --> 00:05:16,270 And so that works for binary blobs. 113 00:05:16,270 --> 00:05:18,500 And it doesn't require a recompile, which is great. 114 00:05:18,500 --> 00:05:23,750 But it introduces a lot of overhead, because you're not 115 00:05:23,750 --> 00:05:27,560 actually running the original code, right? 116 00:05:27,560 --> 00:05:30,270 You're running this translated thing, which might be much 117 00:05:30,270 --> 00:05:31,520 more inefficient. 118 00:05:33,802 --> 00:05:37,410 And I guess Valgrind has a couple of tools that are sort 119 00:05:37,410 --> 00:05:40,270 of classic examples of that, Callgrind and Cachegrind. 120 00:05:40,270 --> 00:05:41,870 I think Callgrind gives you a call graph. 121 00:05:41,870 --> 00:05:46,840 And Cachegrind tries to measure the cache effects, 122 00:05:46,840 --> 00:05:50,010 tries to make a prediction of when you do memory accesses, 123 00:05:50,010 --> 00:05:53,780 are these things in cache or not. 124 00:05:53,780 --> 00:05:56,250 Another thing you might want to look at, or the thing that 125 00:05:56,250 --> 00:05:57,410 we're going to focus on in this course 126 00:05:57,410 --> 00:05:59,880 are performance counters. 127 00:05:59,880 --> 00:06:03,120 And two tools to get these are OProfile and perf. 128 00:06:03,120 --> 00:06:07,590 And basically, the CPU has a bunch of counters-- we've 129 00:06:07,590 --> 00:06:09,530 talked about them-- for branch misses, cache misses, these 130 00:06:09,530 --> 00:06:10,140 kinds of things. 131 00:06:10,140 --> 00:06:13,222 And this will give access to those. 132 00:06:13,222 --> 00:06:15,350 The last tool that I guess I'm going to talk about for 133 00:06:15,350 --> 00:06:17,650 profiling is heap profiling. 134 00:06:17,650 --> 00:06:21,330 A lot of times, if memory's your problem, you really want 135 00:06:21,330 --> 00:06:23,670 to figure out who's doing all the allocation. 136 00:06:23,670 --> 00:06:28,520 And so these tools will give you a nice breakdown of who's 137 00:06:28,520 --> 00:06:31,800 allocating what, who's calling them, and who can I stop 138 00:06:31,800 --> 00:06:35,110 calling so that I'm allocating less memory. 139 00:06:35,110 --> 00:06:40,500 And like I said, you can find more tools for profiling 140 00:06:40,500 --> 00:06:42,630 different kinds of problems and resources. 141 00:06:42,630 --> 00:06:45,730 But today we're mostly going to focus on perf, which is a 142 00:06:45,730 --> 00:06:52,030 new tool for getting access to performance counters on Linux. 143 00:06:52,030 --> 00:06:55,870 I guess I should also mention that, as someone talked about 144 00:06:55,870 --> 00:06:58,410 in the last lecture, the way that you want to gather data 145 00:06:58,410 --> 00:07:02,570 from performance counters is you want to do event sampling. 146 00:07:02,570 --> 00:07:04,670 There's too much data. 147 00:07:04,670 --> 00:07:07,570 As you execute along, you don't want to record, OK, I 148 00:07:07,570 --> 00:07:09,190 executed this instruction, then I executed that 149 00:07:09,190 --> 00:07:12,080 instruction, and I'm going to count all the instruction 150 00:07:12,080 --> 00:07:14,410 executions and then store that data somewhere. 151 00:07:14,410 --> 00:07:16,040 That would be way too much data. 152 00:07:16,040 --> 00:07:20,830 So what you want to do is you want to sample every so often, 153 00:07:20,830 --> 00:07:26,630 every 1,000th event, or some other frequency for whatever 154 00:07:26,630 --> 00:07:27,890 your interesting event is. 155 00:07:31,630 --> 00:07:37,680 And basically, what happens is that when that counter has hit 156 00:07:37,680 --> 00:07:39,830 the threshold, an interrupt will fire, and the 157 00:07:39,830 --> 00:07:40,950 kernel will catch it. 158 00:07:40,950 --> 00:07:44,080 And then it will record the data that you want, the 159 00:07:44,080 --> 00:07:46,560 context, like where in the execution it was, what the 160 00:07:46,560 --> 00:07:48,780 call stack looked like, these kinds of things. 161 00:07:54,000 --> 00:07:57,930 I think we've covered most of the performance counters that 162 00:07:57,930 --> 00:07:58,970 we mostly care about. 163 00:07:58,970 --> 00:08:02,470 But I think it's worth mentioning that there are a 164 00:08:02,470 --> 00:08:03,410 lot of performance counters. 165 00:08:03,410 --> 00:08:07,900 And you can go look them up in this part of the manual. 166 00:08:07,900 --> 00:08:09,420 They cover a lot of interesting things, not just 167 00:08:09,420 --> 00:08:13,150 the ones we've talked about, including sleep states, power 168 00:08:13,150 --> 00:08:13,700 consumption. 169 00:08:13,700 --> 00:08:19,030 And if you want to look at contention, say you have a 170 00:08:19,030 --> 00:08:20,510 parallel program, and you have two cores 171 00:08:20,510 --> 00:08:22,470 accessing the same data. 172 00:08:22,470 --> 00:08:24,110 If they're writing back to it, they're going to be fighting 173 00:08:24,110 --> 00:08:25,460 over the same cache lines. 174 00:08:25,460 --> 00:08:27,930 And so there are counters for things like, how many times 175 00:08:27,930 --> 00:08:33,630 did this core have to go off core to get cache line, and 176 00:08:33,630 --> 00:08:35,840 all kinds of interesting things like that. 177 00:08:35,840 --> 00:08:38,929 But mostly, we're going to focus on the cache misses, 178 00:08:38,929 --> 00:08:40,179 branch misses. 179 00:08:44,520 --> 00:08:48,200 I also wanted to mention that the tool we're using, perf, it 180 00:08:48,200 --> 00:08:50,520 kind of replaces OProfile and PerfMon, which are tools that 181 00:08:50,520 --> 00:08:52,940 do similar things. 182 00:08:52,940 --> 00:08:55,870 I guess it's different in that it's a unified interface to 183 00:08:55,870 --> 00:08:58,020 monitoring two kinds of events. 184 00:08:58,020 --> 00:09:02,080 The first are software kernel events, so things 185 00:09:02,080 --> 00:09:03,320 like contact switches. 186 00:09:03,320 --> 00:09:06,090 So how many times did your application make a system call 187 00:09:06,090 --> 00:09:08,290 and then get contact switched out? 188 00:09:08,290 --> 00:09:12,540 And how many page faults did your application generate? 189 00:09:15,510 --> 00:09:17,270 These kinds of things that the kernel knows about, but aren't 190 00:09:17,270 --> 00:09:18,720 necessarily hardware events. 191 00:09:18,720 --> 00:09:20,480 And the other kind is hardware events. 192 00:09:20,480 --> 00:09:23,450 And those are sort of the ones we're going to focus on. 193 00:09:23,450 --> 00:09:25,310 We've already talked about those. 194 00:09:25,310 --> 00:09:27,530 If you want to get this tool and try it out yourself, if 195 00:09:27,530 --> 00:09:30,160 you have any recent Linux distribution, anything with a 196 00:09:30,160 --> 00:09:33,320 kernel that's less than a year old, you can 197 00:09:33,320 --> 00:09:34,720 install Linux tools. 198 00:09:34,720 --> 00:09:38,470 And the command is perf. 199 00:09:38,470 --> 00:09:41,010 So now we're going to do a demo of the matrix multiply 200 00:09:41,010 --> 00:09:42,650 from Project 0. 201 00:09:42,650 --> 00:09:43,160 GUEST SPEAKER 2: All right. 202 00:09:43,160 --> 00:09:46,070 So for the demos, I guess I'll narrate what the code's doing, 203 00:09:46,070 --> 00:09:49,090 so Reid can focus on typing. 204 00:09:49,090 --> 00:09:50,530 This code should look familiar. 205 00:10:04,116 --> 00:10:05,070 All right. 206 00:10:05,070 --> 00:10:06,940 Can everybody hear me? 207 00:10:06,940 --> 00:10:07,790 All right, cool. 208 00:10:07,790 --> 00:10:09,860 So this code should look very familiar. 209 00:10:09,860 --> 00:10:12,030 It's Project 0's matrix multiply code, 210 00:10:12,030 --> 00:10:13,890 the original version. 211 00:10:13,890 --> 00:10:18,380 And it's a triply-nested four loop for multiplying matrices. 212 00:10:18,380 --> 00:10:22,580 So what we're going to do is I'm going to switch to a 213 00:10:22,580 --> 00:10:25,640 terminal that Reid is typing on. 214 00:10:25,640 --> 00:10:28,330 And we're going to run matrix multiply. 215 00:10:28,330 --> 00:10:31,160 And it takes about eight seconds to run. 216 00:10:31,160 --> 00:10:35,410 And now we're going to run it, but have perf output some 217 00:10:35,410 --> 00:10:39,270 performance statistics for it, basically, CPU counters. 218 00:10:39,270 --> 00:10:42,660 So you use -e to request specific counters. 219 00:10:42,660 --> 00:10:47,180 You can use perf list to take a look at all the counters 220 00:10:47,180 --> 00:10:48,360 that perf supports. 221 00:10:48,360 --> 00:10:51,450 In this case, we're asking for the number of CPU cycles, the 222 00:10:51,450 --> 00:10:55,090 number of instructions, the number of L1 cache loads, and 223 00:10:55,090 --> 00:10:56,955 then the number of L1 cache misses. 224 00:11:00,250 --> 00:11:01,500 Perf stat. 225 00:11:04,240 --> 00:11:05,490 That's better. 226 00:11:07,710 --> 00:11:10,620 So it takes a little bit longer to execute, 227 00:11:10,620 --> 00:11:11,500 I think, with perf. 228 00:11:11,500 --> 00:11:13,145 But it's not supposed to be a big difference. 229 00:11:13,145 --> 00:11:16,460 You can see, 8.31 versus 8.34. 230 00:11:16,460 --> 00:11:19,050 So it's relatively low overhead sampling, which is 231 00:11:19,050 --> 00:11:21,020 good, because you don't want to change the behavior of your 232 00:11:21,020 --> 00:11:23,010 program when you sample it. 233 00:11:23,010 --> 00:11:25,830 Important things to see. 234 00:11:25,830 --> 00:11:32,750 So we're looking at this cache miss rate, which is 18%, 235 00:11:32,750 --> 00:11:37,080 almost 19%, L1 cache miss rate, which is pretty bad. 236 00:11:37,080 --> 00:11:39,240 So as a result of that, if you look at the number of 237 00:11:39,240 --> 00:11:43,680 instructions, the number of cycles, perf calculates the 238 00:11:43,680 --> 00:11:44,870 IPC for you. 239 00:11:44,870 --> 00:11:46,400 We previously talked about CPI. 240 00:11:46,400 --> 00:11:50,450 But perf uses IPC, which is instructions per cycle. 241 00:11:50,450 --> 00:11:54,430 So basically, every cycle, you only executed half an 242 00:11:54,430 --> 00:11:55,890 instruction on average, because you're 243 00:11:55,890 --> 00:11:57,060 waiting for the cache. 244 00:11:57,060 --> 00:11:59,100 So that's no good. 245 00:11:59,100 --> 00:12:03,890 So can anybody remember from Project 0 what you did to work 246 00:12:03,890 --> 00:12:05,610 around that? 247 00:12:05,610 --> 00:12:06,000 Yes. 248 00:12:06,000 --> 00:12:07,730 AUDIENCE: You reordered the loops. 249 00:12:07,730 --> 00:12:09,340 GUEST SPEAKER 2: Yes. 250 00:12:09,340 --> 00:12:12,440 So the before and after. 251 00:12:12,440 --> 00:12:15,200 You just swap out the inner two loops. 252 00:12:15,200 --> 00:12:20,370 And we will run it again. 253 00:12:20,370 --> 00:12:22,770 OK, so remember, the previous runtime was 8 seconds. 254 00:12:22,770 --> 00:12:25,940 Now it's down to 2.5 seconds, which is a lot better. 255 00:12:25,940 --> 00:12:29,500 Let's run the perf stat command again and look at the 256 00:12:29,500 --> 00:12:30,830 performance counters. 257 00:12:30,830 --> 00:12:34,840 And as you can see, the IPC jumped from 0.49 to 1.4. 258 00:12:34,840 --> 00:12:36,300 So it tripled. 259 00:12:36,300 --> 00:12:41,530 And if you look at the new L1 cache miss rate, it's 1% 260 00:12:41,530 --> 00:12:46,670 rather than 20%, which is a lot better. 261 00:12:46,670 --> 00:12:46,995 Yes? 262 00:12:46,995 --> 00:12:49,760 AUDIENCE: Can you point to where the cache miss rate 263 00:12:49,760 --> 00:12:50,248 percent is? 264 00:12:50,248 --> 00:12:52,690 Because I just see zero per second. 265 00:12:52,690 --> 00:12:53,770 GUEST SPEAKER 2: Oh, sorry about that. 266 00:12:53,770 --> 00:12:58,110 So L1 D-cache loads is all of the times that the CPU asked 267 00:12:58,110 --> 00:12:59,850 the L1 cache for something. 268 00:12:59,850 --> 00:13:02,860 And then L1 D-cache load misses is the 269 00:13:02,860 --> 00:13:04,730 number of cache misses. 270 00:13:04,730 --> 00:13:12,630 So you divide this over this to get the ratio, which is 271 00:13:12,630 --> 00:13:15,640 basically what Reid did over here. 272 00:13:15,640 --> 00:13:16,794 Does that make sense? 273 00:13:16,794 --> 00:13:17,830 AUDIENCE: Yes. 274 00:13:17,830 --> 00:13:19,080 GUEST SPEAKER 2: OK. 275 00:13:20,760 --> 00:13:25,520 So this is a ridiculously simple example of how to use 276 00:13:25,520 --> 00:13:28,150 perf, just so that you can see the command set. 277 00:13:28,150 --> 00:13:31,780 But that's not really a very interesting example. 278 00:13:31,780 --> 00:13:36,020 So the next one that we'll look at is actually something 279 00:13:36,020 --> 00:13:41,650 that Charles, Reid, and I did yesterday afternoon for fun. 280 00:13:41,650 --> 00:13:44,700 It started, we were testing your Project 281 00:13:44,700 --> 00:13:46,670 2.1, one which is-- 282 00:13:46,670 --> 00:13:47,980 I don't know how many people have looked 283 00:13:47,980 --> 00:13:48,750 at the handout yet. 284 00:13:48,750 --> 00:13:51,280 But it deals with a bunch of mystery sorting algorithms 285 00:13:51,280 --> 00:13:53,150 that you have to figure out what they're doing. 286 00:13:53,150 --> 00:13:55,510 And so we took one of them. 287 00:13:55,510 --> 00:13:57,300 And we were running sorting. 288 00:13:57,300 --> 00:13:59,200 And Charles asked, what happens if you have a 289 00:13:59,200 --> 00:14:00,850 branchless sorting algorithm? 290 00:14:00,850 --> 00:14:02,510 And we said, we don't know. 291 00:14:02,510 --> 00:14:04,530 So we played around with it. 292 00:14:04,530 --> 00:14:07,150 And in the process, we optimized our branchless 293 00:14:07,150 --> 00:14:10,760 sorting algorithm by using perf to generate insight into 294 00:14:10,760 --> 00:14:12,200 what the CPU is doing. 295 00:14:12,200 --> 00:14:13,680 And as you'll see, we generated 296 00:14:13,680 --> 00:14:14,860 quite a good speed-up. 297 00:14:14,860 --> 00:14:16,610 So the first thing we'll do is, let's start with the 298 00:14:16,610 --> 00:14:18,210 baseline, quicksort, everybody's 299 00:14:18,210 --> 00:14:19,750 favorite sorting algorithm. 300 00:14:19,750 --> 00:14:21,980 So I'll switch over to the terminal. 301 00:14:21,980 --> 00:14:25,770 And Reid is going to run quicksort on 30 million 302 00:14:25,770 --> 00:14:31,570 integers one time, which it says takes 4.14 seconds. 303 00:14:31,570 --> 00:14:37,710 Now, let's use perf stat to see what it's doing. 304 00:14:37,710 --> 00:14:40,260 So this time, we will ask for slightly different events. 305 00:14:40,260 --> 00:14:42,220 We'll still ask for the cycles and 306 00:14:42,220 --> 00:14:44,090 instructions to get our IPC. 307 00:14:44,090 --> 00:14:46,380 But we'll also ask for branches and branch misses, 308 00:14:46,380 --> 00:14:48,470 because we kind of intuitively know that 309 00:14:48,470 --> 00:14:52,130 quicksort is doing branching. 310 00:14:52,130 --> 00:14:53,350 So OK. 311 00:14:53,350 --> 00:14:58,960 It's doing 0.8 IPC, so 0.8 instructions per cycle. 312 00:14:58,960 --> 00:15:04,580 And the branch miss rate, so branch misses over total 313 00:15:04,580 --> 00:15:08,000 number branches, is 11.5%. 314 00:15:08,000 --> 00:15:10,780 That's pretty bad in terms of branch missing. 315 00:15:10,780 --> 00:15:12,930 So let's take a look at the code and see 316 00:15:12,930 --> 00:15:14,630 why that's the case. 317 00:15:14,630 --> 00:15:18,480 So here's the part of the code in quicksort where you select 318 00:15:18,480 --> 00:15:21,830 a pivot recursively, and then you loop, and 319 00:15:21,830 --> 00:15:23,680 then you do your swapping. 320 00:15:23,680 --> 00:15:29,550 So these branches are more or less unpredictable. 321 00:15:29,550 --> 00:15:32,240 And looking at them, there's no way that we could think of 322 00:15:32,240 --> 00:15:34,390 to get rid of the unpredictable branches. 323 00:15:34,390 --> 00:15:36,800 They're unpredictable by nature. 324 00:15:36,800 --> 00:15:39,490 So this would not be an interesting example of which 325 00:15:39,490 --> 00:15:40,980 to do branchless sorting. 326 00:15:40,980 --> 00:15:44,570 So let's switch to a different algorithm, mergesort. 327 00:15:44,570 --> 00:15:50,150 So here's the relevant code in mergesort that takes a list, 328 00:15:50,150 --> 00:15:54,760 C, and two input lists, A and B. And you merge the contents 329 00:15:54,760 --> 00:15:57,810 of A and B into C by taking the smallest element from 330 00:15:57,810 --> 00:16:00,360 either A or B, putting it in C, and then repeating that 331 00:16:00,360 --> 00:16:02,340 process until the lists are empty. 332 00:16:02,340 --> 00:16:06,060 So the "if else" branches, as the comment says, tries to 333 00:16:06,060 --> 00:16:10,050 place the min of the value either at A or B into C, and 334 00:16:10,050 --> 00:16:12,950 then properly increment the pointer and do those 335 00:16:12,950 --> 00:16:15,610 housekeeping tasks so you don't crash. 336 00:16:15,610 --> 00:16:20,080 So we're going to try running this and profiling it. 337 00:16:24,570 --> 00:16:27,390 So Reid is running mergesort. 338 00:16:27,390 --> 00:16:29,200 Took 5.04 seconds. 339 00:16:29,200 --> 00:16:31,160 Well, the last one took 4.8 seconds. 340 00:16:31,160 --> 00:16:33,460 So already, it doesn't look too good. 341 00:16:33,460 --> 00:16:36,410 But let's try to see what happened. 342 00:16:36,410 --> 00:16:40,080 So we'll issue the same perf stat command, asking for 343 00:16:40,080 --> 00:16:42,790 branches, branch misses, cycles. 344 00:16:42,790 --> 00:16:44,040 Reid is being lazy. 345 00:16:50,670 --> 00:16:54,370 OK, so the IPC is 1.1, which is a little bit better. 346 00:16:54,370 --> 00:16:59,120 But if you look at the branch misses over branches, they're 347 00:16:59,120 --> 00:17:00,870 off by roughly an order of magnitude just 348 00:17:00,870 --> 00:17:03,780 like before, so 10%. 349 00:17:03,780 --> 00:17:05,030 So OK. 350 00:17:09,060 --> 00:17:11,770 Mergesort is currently slower than quicksort. 351 00:17:11,770 --> 00:17:15,079 And the reason is still, roughly, you still have all of 352 00:17:15,079 --> 00:17:18,780 these branches that you're not predicting correctly. 353 00:17:18,780 --> 00:17:22,130 OK, so why branching, and why not caching? 354 00:17:22,130 --> 00:17:23,930 Well, we did some quick back-of-the-envelope 355 00:17:23,930 --> 00:17:26,130 calculations, or research, rather. 356 00:17:26,130 --> 00:17:29,180 And it turns out that in the Nehalem, the processes that 357 00:17:29,180 --> 00:17:32,780 you're using on the clouds, one of the design changes that 358 00:17:32,780 --> 00:17:36,300 Intel made is that they made the L1 cache and, actually, 359 00:17:36,300 --> 00:17:40,560 also the L2 cache faster, so fewer number of clock cycles 360 00:17:40,560 --> 00:17:41,870 to go out to the cache. 361 00:17:41,870 --> 00:17:45,040 But at the same time, they also made the pipeline deeper. 362 00:17:45,040 --> 00:17:47,970 So an L1 cache miss is maybe three or four cycles. 363 00:17:47,970 --> 00:17:50,320 An L2 cache miss is 15 cycles. 364 00:17:50,320 --> 00:17:52,430 By the time you miss the L2 cache, you're going out into 365 00:17:52,430 --> 00:17:54,250 L3 cache, which is 12 megabytes. 366 00:17:54,250 --> 00:17:56,180 And if you're sorting integers, most of them are 367 00:17:56,180 --> 00:17:57,400 probably going to be in there anyway. 368 00:17:57,400 --> 00:18:00,080 AUDIENCE: John, those are actually hits [INAUDIBLE]. 369 00:18:00,080 --> 00:18:02,630 GUEST SPEAKER 2: Oh, they are. 370 00:18:02,630 --> 00:18:05,250 OK, so misses might be a little bit bigger than that. 371 00:18:05,250 --> 00:18:06,525 But OK. 372 00:18:06,525 --> 00:18:09,140 AUDIENCE: L1 miss is a L2-- 373 00:18:09,140 --> 00:18:11,020 GUEST SPEAKER 2: Right, is an L2. 374 00:18:11,020 --> 00:18:11,760 AUDIENCE: L1 misses. 375 00:18:11,760 --> 00:18:13,620 AUDIENCE: It was just a typo. 376 00:18:13,620 --> 00:18:16,050 GUEST SPEAKER 2: OK, so the next number is probably, like, 377 00:18:16,050 --> 00:18:19,870 30 to 50, somewhere on that order. 378 00:18:19,870 --> 00:18:20,600 50-something. 379 00:18:20,600 --> 00:18:21,710 OK. 380 00:18:21,710 --> 00:18:24,410 But a branch mispredict will also cost you somewhere 381 00:18:24,410 --> 00:18:27,890 between 16 and 24 cycles, depending on how many pipeline 382 00:18:27,890 --> 00:18:32,080 stages are actually in the Nehalem, which we don't know. 383 00:18:32,080 --> 00:18:35,810 So bad branch predictions might just be as costly as bad 384 00:18:35,810 --> 00:18:37,000 memory access patterns. 385 00:18:37,000 --> 00:18:39,780 And plus, Charles is going to go into detail about memory 386 00:18:39,780 --> 00:18:42,050 access patterns and how to optimize sorting for good 387 00:18:42,050 --> 00:18:42,860 memory access. 388 00:18:42,860 --> 00:18:44,600 So it wouldn't be interesting if I talked 389 00:18:44,600 --> 00:18:45,650 about that right now. 390 00:18:45,650 --> 00:18:47,450 So let's optimize branching. 391 00:18:50,230 --> 00:18:53,670 So to do that, we'll use a bit hack that Charles introduced 392 00:18:53,670 --> 00:18:55,140 in an earlier lecture. 393 00:18:55,140 --> 00:19:02,150 So remember, all we want to do is we want to put the minimum 394 00:19:02,150 --> 00:19:05,630 of either A or B into C. So right now, we're using 395 00:19:05,630 --> 00:19:06,520 branches to do it. 396 00:19:06,520 --> 00:19:08,400 But there's no reason why we have to. 397 00:19:08,400 --> 00:19:11,400 So the trick that will apply is the XOR trick for 398 00:19:11,400 --> 00:19:14,040 calculating min without a branch. 399 00:19:14,040 --> 00:19:17,650 Does everybody kind of follow what that code is doing? 400 00:19:17,650 --> 00:19:21,620 You can refer to the bit hacks lecture later to check. 401 00:19:21,620 --> 00:19:24,116 But trust us that it's correct. 402 00:19:24,116 --> 00:19:25,820 GUEST SPEAKER 1: [INAUDIBLE] 403 00:19:25,820 --> 00:19:26,130 GUEST SPEAKER 2: All right. 404 00:19:26,130 --> 00:19:29,261 I guess we can go over it. 405 00:19:29,261 --> 00:19:31,360 Do I have a mouse pointer? 406 00:19:31,360 --> 00:19:32,290 Cool. 407 00:19:32,290 --> 00:19:37,045 OK, so inside, you're doing a comparing A is less than B, 408 00:19:37,045 --> 00:19:39,700 the value of A is less than the value of B. And you store 409 00:19:39,700 --> 00:19:42,940 either a 0 or 1 into a flag called CMP. 410 00:19:42,940 --> 00:19:45,880 And then you take CMP, and you negate it. 411 00:19:45,880 --> 00:19:48,960 So then you end up with a bit mask of either all 0's or all 412 00:19:48,960 --> 00:19:52,130 1's, depending on which one was smaller. 413 00:19:52,130 --> 00:19:55,130 And then you mask A XOR B with that. 414 00:19:55,130 --> 00:19:59,170 So this inner expression, you either get A XOR B out of it, 415 00:19:59,170 --> 00:20:01,090 or you get all 0's. 416 00:20:01,090 --> 00:20:03,390 And then you XOR B into it. 417 00:20:03,390 --> 00:20:07,990 So in one case, B XOR this expression cancels out the B. 418 00:20:07,990 --> 00:20:09,450 You'll just get A out. 419 00:20:09,450 --> 00:20:14,670 And the other way, B XOR 0 is going to give you B. So that 420 00:20:14,670 --> 00:20:16,790 should be convincing motivation that this is 421 00:20:16,790 --> 00:20:18,530 actually a min function. 422 00:20:18,530 --> 00:20:21,580 And then, once you have that, you can do cute tricks to 423 00:20:21,580 --> 00:20:25,470 recycle the flags, to do the A++ and B++ and 424 00:20:25,470 --> 00:20:28,530 so on, using CMP. 425 00:20:28,530 --> 00:20:32,630 So next, let's run this code and see what happens. 426 00:20:40,780 --> 00:20:49,820 OK, so we went from 5.04 seconds to 4.59 seconds. 427 00:20:49,820 --> 00:20:52,710 And we'll do a profiling run just to get some stats on it. 428 00:20:55,260 --> 00:20:58,390 OK, so now look at the IPC. 429 00:20:58,390 --> 00:21:01,371 We went from 1.1 to 1.7-- 430 00:21:04,020 --> 00:21:05,210 1.7. 431 00:21:05,210 --> 00:21:09,180 And branch misses went down by a factor of 10, which is 432 00:21:09,180 --> 00:21:10,760 really awesome. 433 00:21:10,760 --> 00:21:13,630 But overall, the performance didn't seem 434 00:21:13,630 --> 00:21:15,600 to improve by much. 435 00:21:15,600 --> 00:21:18,240 Well, if you want to look at why, look at the total number 436 00:21:18,240 --> 00:21:20,610 of instructions executed. 437 00:21:20,610 --> 00:21:24,130 That's that number versus that number. 438 00:21:24,130 --> 00:21:26,200 We seem to be executing considerably more 439 00:21:26,200 --> 00:21:27,460 instructions. 440 00:21:27,460 --> 00:21:31,870 So to see why that's the case, we decided to use another mode 441 00:21:31,870 --> 00:21:36,040 of perf called report, or record and then report. 442 00:21:36,040 --> 00:21:40,500 So what this does is that it logs where all of those 443 00:21:40,500 --> 00:21:41,720 interrupts actually happen. 444 00:21:41,720 --> 00:21:45,580 So it logs which point in your code the CPU was executing 445 00:21:45,580 --> 00:21:47,730 when these performance counters tripped. 446 00:21:47,730 --> 00:21:51,040 And then it generates a nice report that we can call using 447 00:21:51,040 --> 00:21:52,880 perf annotate. 448 00:21:52,880 --> 00:21:57,790 And the top section gives you a sorted summary, where it 449 00:21:57,790 --> 00:22:00,340 shows you the percent of times that perf caught your program 450 00:22:00,340 --> 00:22:01,720 executing that line of code. 451 00:22:01,720 --> 00:22:06,530 So 11% of the times that perf triggered its interrupt, your 452 00:22:06,530 --> 00:22:09,650 code was executing mergesort.c line 32. 453 00:22:09,650 --> 00:22:12,900 So let's go to mergesort.c line 32 and 454 00:22:12,900 --> 00:22:15,720 see what it's doing. 455 00:22:15,720 --> 00:22:20,000 So this annotated view shows you the lines of C-code, which 456 00:22:20,000 --> 00:22:21,900 are in black. 457 00:22:21,900 --> 00:22:23,490 C-code is in black. 458 00:22:23,490 --> 00:22:26,030 And then line numbers are annotated here. 459 00:22:26,030 --> 00:22:28,660 And then it actually goes ahead and shows you the 460 00:22:28,660 --> 00:22:31,680 assembly instructions that it's executing. 461 00:22:31,680 --> 00:22:33,900 And it puts the timing actually on the assembly 462 00:22:33,900 --> 00:22:36,680 instructions corresponding to every C expression. 463 00:22:36,680 --> 00:22:37,546 Yes? 464 00:22:37,546 --> 00:22:39,976 AUDIENCE: Why is this [INAUDIBLE] is 465 00:22:39,976 --> 00:22:41,440 the annotated version-- 466 00:22:41,440 --> 00:22:44,070 GUEST SPEAKER 2: Yeah, you call perf annotate and then a 467 00:22:44,070 --> 00:22:45,770 function name. 468 00:22:45,770 --> 00:22:47,900 The Project 2.1 will walk you through doing 469 00:22:47,900 --> 00:22:48,730 this a couple of times. 470 00:22:48,730 --> 00:22:52,170 So you'll be able to see what it's doing. 471 00:22:52,170 --> 00:22:58,270 OK, so the assembly here is actually pretty important. 472 00:22:58,270 --> 00:23:02,030 I guess everybody's taken 6004, or some variant, where 473 00:23:02,030 --> 00:23:04,110 you had to code some form of assembly? 474 00:23:04,110 --> 00:23:06,370 OK, it's a prereq. 475 00:23:06,370 --> 00:23:06,940 Cool. 476 00:23:06,940 --> 00:23:09,470 So Saman will talk more about assembly. 477 00:23:09,470 --> 00:23:12,190 But in this case-- or Charles, actually, now will talk more 478 00:23:12,190 --> 00:23:13,590 about assembly. 479 00:23:13,590 --> 00:23:16,190 But anyway, in this case, it was actually kind of useful to 480 00:23:16,190 --> 00:23:17,660 look at the assembly. 481 00:23:17,660 --> 00:23:22,020 We won't expect you to know how to write assembly for this 482 00:23:22,020 --> 00:23:24,760 class or expect you to write high-performing assembly. 483 00:23:24,760 --> 00:23:27,310 But sometimes taking a look at what the compiler outputs can 484 00:23:27,310 --> 00:23:28,250 really help you. 485 00:23:28,250 --> 00:23:31,730 Well, in this case, what caught our eye was this cltq 486 00:23:31,730 --> 00:23:35,600 thing, because none of us in the room knew what the heck 487 00:23:35,600 --> 00:23:38,430 cltq was It wasn't an assembly instruction 488 00:23:38,430 --> 00:23:39,440 that you usually see. 489 00:23:39,440 --> 00:23:40,340 So that caught our eye. 490 00:23:40,340 --> 00:23:42,390 And we wondered what was this assembly doing. 491 00:23:42,390 --> 00:23:44,990 So we spent a little bit of time tracing through this, 492 00:23:44,990 --> 00:23:46,960 working out what the compiler was doing. 493 00:23:46,960 --> 00:23:52,240 So in PowerPoint, I pulled out this code. 494 00:23:52,240 --> 00:23:58,120 And OK, so I looked up cltq in my 600-page Intel manual. 495 00:23:58,120 --> 00:24:01,250 Yes, your processor does come with a manual. 496 00:24:01,250 --> 00:24:06,170 So cltq, it says, sign extends the EAX register to 64 bits 497 00:24:06,170 --> 00:24:08,830 and then places that into RAX. 498 00:24:08,830 --> 00:24:10,900 Why is it sign extending? 499 00:24:10,900 --> 00:24:14,180 OK, so let's try to trace through what 500 00:24:14,180 --> 00:24:15,250 the assembly's doing. 501 00:24:15,250 --> 00:24:19,990 So here, we wanted to do comp equals star A less than star 502 00:24:19,990 --> 00:24:25,420 B. So it moves the value at the address in register 14 503 00:24:25,420 --> 00:24:28,700 into RCX and does the same thing for register 13. 504 00:24:28,700 --> 00:24:33,520 So assume that R14 and R13 are A and B. And so RCX and RDX 505 00:24:33,520 --> 00:24:36,310 are star A and star B, right? 506 00:24:36,310 --> 00:24:39,730 And then it XORs ESI with ESI, which is 0. 507 00:24:39,730 --> 00:24:42,050 So it zeroes out the register ESI. 508 00:24:42,050 --> 00:24:46,410 And then it does a compare between RCX and RDX, which is 509 00:24:46,410 --> 00:24:49,070 the value at A and the value at B. So it does that 510 00:24:49,070 --> 00:24:50,190 comparison. 511 00:24:50,190 --> 00:24:52,250 And then, depending on the result of that comparison, it 512 00:24:52,250 --> 00:24:56,230 writes that result out into register SIL-- 513 00:24:56,230 --> 00:24:58,090 GUEST SPEAKER 1: Which is the low-byte-- 514 00:24:58,090 --> 00:25:01,820 GUEST SPEAKER 2: SIL, as Reid said, is the low-byte of the 515 00:25:01,820 --> 00:25:04,010 register ESI, which is used here. 516 00:25:04,010 --> 00:25:08,280 But before that, the compiler then executes this expression 517 00:25:08,280 --> 00:25:14,210 here, B XOR a, which it puts into register RDX. 518 00:25:14,210 --> 00:25:17,080 And then, finally, it takes ESI, which contains the same 519 00:25:17,080 --> 00:25:21,830 value as SIL, and then it moves that into EAX. 520 00:25:21,830 --> 00:25:23,960 And then it negates EAX. 521 00:25:23,960 --> 00:25:27,025 OK, so that kind of looks like negative CMP, right? 522 00:25:27,025 --> 00:25:30,660 And then, after negating it, it then sign extends it to a 523 00:25:30,660 --> 00:25:32,900 64-bit register. 524 00:25:32,900 --> 00:25:36,410 And then it does the en masse that we asked it to. 525 00:25:36,410 --> 00:25:37,780 Well, why is it jumping through so 526 00:25:37,780 --> 00:25:39,010 many hoops to do this? 527 00:25:39,010 --> 00:25:42,960 Well, it turns out, when you declare a value of type int, 528 00:25:42,960 --> 00:25:45,040 you say that it's a 32-bit value. 529 00:25:45,040 --> 00:25:48,070 And the compiler took you very literally in saying that you 530 00:25:48,070 --> 00:25:49,930 requested a 32-bit value. 531 00:25:49,930 --> 00:25:52,060 So it did a 32-bit negation. 532 00:25:52,060 --> 00:25:54,060 And then it extended that to 64-bits. 533 00:25:54,060 --> 00:25:57,620 So it wasted a couple of instructions doing this, 534 00:25:57,620 --> 00:25:59,660 instead of just flat-out doing a 64-bit 535 00:25:59,660 --> 00:26:01,590 negation to begin with. 536 00:26:01,590 --> 00:26:05,000 Now, the problem with this is usually you don't care if it 537 00:26:05,000 --> 00:26:06,550 generates one XOR assembly instruction. 538 00:26:06,550 --> 00:26:10,380 But this is a tight loop that's called in the recursion 539 00:26:10,380 --> 00:26:12,520 process for every single merge that it's doing. 540 00:26:12,520 --> 00:26:16,330 So the time that it wastes here actually really adds up. 541 00:26:16,330 --> 00:26:18,030 So how do we fix that? 542 00:26:18,030 --> 00:26:21,390 Well, instead of declaring it as int, let's just try 543 00:26:21,390 --> 00:26:24,430 replacing it with long, and then recompiler the code and 544 00:26:24,430 --> 00:26:26,200 see if things are any different. 545 00:26:26,200 --> 00:26:30,800 So here we start our process of trial and error. 546 00:26:30,800 --> 00:26:35,170 So we compiled mergesort long. 547 00:26:35,170 --> 00:26:38,010 And the runtime is 4.05 seconds. 548 00:26:38,010 --> 00:26:40,050 Before, it was, like, 4.5 or so. 549 00:26:40,050 --> 00:26:43,830 So the runtime definitely improved. 550 00:26:43,830 --> 00:26:44,665 Let's compare it. 551 00:26:44,665 --> 00:26:46,040 GUEST SPEAKER 1: Yeah, just to verify. 552 00:26:46,040 --> 00:26:49,860 GUEST SPEAKER 2: So it went from 4.59 to 4.05, which 553 00:26:49,860 --> 00:26:51,580 definitely is an improvement. 554 00:26:51,580 --> 00:26:55,380 So let's figure out, well, let's see if it generated 555 00:26:55,380 --> 00:26:57,170 better assembly. 556 00:26:57,170 --> 00:27:00,880 So we'll run perf record in order to look at 557 00:27:00,880 --> 00:27:02,190 perf annotate again. 558 00:27:04,750 --> 00:27:07,870 And OK, once again, it's spending 14% of the time 559 00:27:07,870 --> 00:27:12,490 somewhere in mergesort.c, which is hardly surprising. 560 00:27:12,490 --> 00:27:20,620 We'll go to line 27 and, yeah, use long. 561 00:27:20,620 --> 00:27:22,510 OK, so that's roughly the assembly. 562 00:27:22,510 --> 00:27:28,910 And Reid, although he wasted time to find it, I have a 563 00:27:28,910 --> 00:27:31,750 nicely prepared screen shot of the same section of code. 564 00:27:31,750 --> 00:27:34,010 So it looks cleaner. 565 00:27:37,490 --> 00:27:40,390 It loads A and B into different registers this time, 566 00:27:40,390 --> 00:27:42,700 actually, RSI and RCX. 567 00:27:42,700 --> 00:27:45,100 And then it clears out EDX. 568 00:27:45,100 --> 00:27:48,640 And then it does the compare between RSI and RCX, which is 569 00:27:48,640 --> 00:27:52,410 A and B. And then it sets percent DL. 570 00:27:52,410 --> 00:27:56,070 Now, percent DL is the lower byte of EDX. 571 00:27:56,070 --> 00:27:58,400 So it used a different choice of registers. 572 00:27:58,400 --> 00:28:04,660 And then it did the A XOR B. And then it moved that 573 00:28:04,660 --> 00:28:05,880 compare into RAX. 574 00:28:05,880 --> 00:28:08,320 GUEST SPEAKER 1: [INAUDIBLE] 575 00:28:08,320 --> 00:28:11,080 GUEST SPEAKER 2: OK, and Reid actually has a longer snippet 576 00:28:11,080 --> 00:28:11,990 of the code. 577 00:28:11,990 --> 00:28:16,400 And the compiler, actually, now that we told it that it 578 00:28:16,400 --> 00:28:18,980 was a 64-bit flag, the compiler realized that there's 579 00:28:18,980 --> 00:28:20,980 a couple optimizations that it could do. 580 00:28:20,980 --> 00:28:23,340 So it switched around the order of things a little bit. 581 00:28:23,340 --> 00:28:28,570 And it's recycling the value of RDX for loading, I think 582 00:28:28,570 --> 00:28:30,060 it's A+ equals comp. 583 00:28:30,060 --> 00:28:32,270 It's recycling the value RDX that it 584 00:28:32,270 --> 00:28:33,980 generated out of a compare. 585 00:28:33,980 --> 00:28:38,190 And now, down there, it's negating RAX as a 64-bit 586 00:28:38,190 --> 00:28:42,240 number and directly using it without that cltq instruction. 587 00:28:42,240 --> 00:28:44,800 So that was kind of nice. 588 00:28:44,800 --> 00:28:47,270 So in just nudging the compiler a little bit, we got 589 00:28:47,270 --> 00:28:54,160 it to produce code that ran roughly about 10% faster, 590 00:28:54,160 --> 00:28:56,490 which was surprising to us, actually, that changing a 591 00:28:56,490 --> 00:28:59,010 single int to a long could reduce the 592 00:28:59,010 --> 00:29:01,380 runtime by that amount. 593 00:29:01,380 --> 00:29:04,750 So next thing that we did was we scrolled down a little bit 594 00:29:04,750 --> 00:29:05,650 more on this window. 595 00:29:05,650 --> 00:29:06,920 And we looked at the next thing that the 596 00:29:06,920 --> 00:29:08,290 compiler was doing. 597 00:29:08,290 --> 00:29:10,600 And we kind of caught the compiler again 598 00:29:10,600 --> 00:29:12,870 doing something silly. 599 00:29:12,870 --> 00:29:16,220 So here's the code that we were at 600 00:29:16,220 --> 00:29:18,240 before, the compare code. 601 00:29:18,240 --> 00:29:24,510 And now let's look at how the compiler did B+ equals CMP. 602 00:29:24,510 --> 00:29:27,320 Once again, the reason that we stopped and looked at this 603 00:29:27,320 --> 00:29:31,760 code was we saw an instruction that we didn't recognize, SBB. 604 00:29:31,760 --> 00:29:33,600 Well, we looked at the manual again. 605 00:29:33,600 --> 00:29:39,080 And SBB of R1 and R2 is a subtract with a borrow bit. 606 00:29:39,080 --> 00:29:43,875 So it does subtracting the two arguments and then subtracting 607 00:29:43,875 --> 00:29:46,270 a carry flag from that. 608 00:29:46,270 --> 00:29:48,180 So OK, why's it doing that? 609 00:29:48,180 --> 00:29:50,390 Well, let's walk through what the compiler tried to do. 610 00:29:53,130 --> 00:29:55,660 Let's see, where's a good place to start? 611 00:29:55,660 --> 00:29:59,370 So it's RDX in this compare. 612 00:29:59,370 --> 00:30:02,220 So it compares RDX to 1. 613 00:30:02,220 --> 00:30:04,670 RDX is where CMP was stored. 614 00:30:04,670 --> 00:30:07,960 So it checks whether or not it equals to 1. 615 00:30:07,960 --> 00:30:11,460 And then the result of that check is actually stored in 616 00:30:11,460 --> 00:30:13,490 the carry flag, CF. 617 00:30:13,490 --> 00:30:16,960 And subtract with borrow, RAX, RAX. 618 00:30:16,960 --> 00:30:21,160 Before, RAX held the value of B. But now, since it did this, 619 00:30:21,160 --> 00:30:23,820 RAX minus RAX is 0, right? 620 00:30:23,820 --> 00:30:27,390 And then minus the carry flag is either 0 or negative 1, 621 00:30:27,390 --> 00:30:30,270 depending on the value of the compare. 622 00:30:30,270 --> 00:30:34,960 So once again, it generates a register that's either all 0's 623 00:30:34,960 --> 00:30:38,380 or all 1's, 64 bits. 624 00:30:38,380 --> 00:30:40,270 Now, what did it go and do with that? 625 00:30:40,270 --> 00:30:42,700 Well, jump down a little bit later on in the code. 626 00:30:42,700 --> 00:30:47,860 And it ends EAX, which is the 32-bit version of RAX, with 627 00:30:47,860 --> 00:30:51,390 the value of 8 in hexadecimal. 628 00:30:51,390 --> 00:30:54,690 So now it's either all 0's, or it's 8. 629 00:30:54,690 --> 00:31:00,570 And then it adds that value, EAX, which 630 00:31:00,570 --> 00:31:01,920 is the same as RAX-- 631 00:31:01,920 --> 00:31:03,780 RAX is a 64-bit version-- 632 00:31:03,780 --> 00:31:09,050 it adds that value to RBP, which stored the address of B. 633 00:31:09,050 --> 00:31:13,270 So it jumped through that many hoops just to increment B by 634 00:31:13,270 --> 00:31:15,420 either 8 or 0. 635 00:31:15,420 --> 00:31:17,640 Now, this seems a little bit unnecessary. 636 00:31:17,640 --> 00:31:22,540 So we thought, how else could we express B+ equals not CMP, 637 00:31:22,540 --> 00:31:26,050 and try to change the code around a little bit so that it 638 00:31:26,050 --> 00:31:28,260 does the same thing, but hope that the compiler treats it 639 00:31:28,260 --> 00:31:29,950 differently. 640 00:31:29,950 --> 00:31:33,980 Well, in this case, not CMP, when CMP is either 1 or 0, is 641 00:31:33,980 --> 00:31:35,395 the same thing as 1 minus CMP. 642 00:31:39,770 --> 00:31:42,640 Well, we compiled a version of this code, and then we 643 00:31:42,640 --> 00:31:43,890 tried to run it. 644 00:31:52,800 --> 00:31:57,262 Refreshing our memory, mergesort minus, where we just 645 00:31:57,262 --> 00:32:01,360 did the minus sign. 646 00:32:01,360 --> 00:32:03,900 So we took the runtime down from 4.04 647 00:32:03,900 --> 00:32:07,450 seconds to 3.77 seconds. 648 00:32:07,450 --> 00:32:14,940 And once again, this is just from replacing a not CMP to a 649 00:32:14,940 --> 00:32:16,160 1 minus CMP. 650 00:32:16,160 --> 00:32:17,720 That's all the code change that we made. 651 00:32:20,670 --> 00:32:25,740 And we will look at the annotation again, just to make 652 00:32:25,740 --> 00:32:27,770 sure that the compiler generated different code. 653 00:32:27,770 --> 00:32:29,410 Rather, we were curious of what the compiler 654 00:32:29,410 --> 00:32:32,900 generated this time. 655 00:32:32,900 --> 00:32:36,280 So it's roughly this section of code. 656 00:32:39,680 --> 00:32:41,200 Now, one thing you'll notice is, because 657 00:32:41,200 --> 00:32:42,580 we're compiling with-- 658 00:32:42,580 --> 00:32:45,450 this is already compiling with GCC's maximum optimization 659 00:32:45,450 --> 00:32:46,690 level, mind you. 660 00:32:46,690 --> 00:32:49,580 So keep that in mind when you assume that the compiler is 661 00:32:49,580 --> 00:32:52,920 going to help you out and do smart things. 662 00:32:52,920 --> 00:32:55,240 So the instructions are a little bit out of order 663 00:32:55,240 --> 00:32:56,490 compared to what you might expect. 664 00:33:02,584 --> 00:33:04,610 All right, so I'll go to my prepared 665 00:33:04,610 --> 00:33:07,460 screen shot of the code. 666 00:33:07,460 --> 00:33:10,300 So what did it generate this time? 667 00:33:10,300 --> 00:33:12,640 All right, so already, this code looks shorter than the 668 00:33:12,640 --> 00:33:13,890 previous one, right? 669 00:33:17,040 --> 00:33:19,790 A little bit less hairy than that. 670 00:33:19,790 --> 00:33:21,050 So what did it do? 671 00:33:21,050 --> 00:33:25,810 Well, SIL, once again, is the lower byte of RSI. 672 00:33:25,810 --> 00:33:29,270 So it does the compare RCX with RDX, which is actually 673 00:33:29,270 --> 00:33:30,540 this comparison. 674 00:33:30,540 --> 00:33:33,110 And then it saves that into SIL. 675 00:33:33,110 --> 00:33:38,050 And now, later, it does the XOR. 676 00:33:38,050 --> 00:33:39,810 So this was the code from before. 677 00:33:39,810 --> 00:33:43,700 And then it does a move of RSI, which contains the same 678 00:33:43,700 --> 00:33:47,380 value as SIL, which is CMP's either 1 or 0. 679 00:33:47,380 --> 00:33:48,630 It moves that into RAX. 680 00:33:51,830 --> 00:34:00,680 And then, later on, it uses RAX for one calculation. 681 00:34:00,680 --> 00:34:04,130 And then it also uses the value that's already in RSI. 682 00:34:04,130 --> 00:34:07,580 And it subtracts that from register 14, which is where it 683 00:34:07,580 --> 00:34:11,020 decided to store the value of B. So that's a much more 684 00:34:11,020 --> 00:34:14,040 direct way to either subtract 0 or 1, actually get just 0 or 685 00:34:14,040 --> 00:34:16,690 1, and then use it for a subtraction. 686 00:34:16,690 --> 00:34:19,679 So the end result, and something else to point out, 687 00:34:19,679 --> 00:34:23,810 is that the move and subtract commands have a little bit of 688 00:34:23,810 --> 00:34:26,199 instruction level parallelism, because right 689 00:34:26,199 --> 00:34:27,800 after you set [? LE ?] 690 00:34:27,800 --> 00:34:30,510 SIL, as soon as that's done, you can execute both the move 691 00:34:30,510 --> 00:34:33,540 and the subtract at the same time, since they use the same 692 00:34:33,540 --> 00:34:34,860 source register and put in 693 00:34:34,860 --> 00:34:36,960 different destination registers. 694 00:34:36,960 --> 00:34:40,219 And the other thing to notice in the comparison is the 695 00:34:40,219 --> 00:34:42,320 number of ALU operations. 696 00:34:42,320 --> 00:34:44,449 I'll go back to the original one. 697 00:34:44,449 --> 00:34:48,909 You see negate, and, XOR, add. 698 00:34:48,909 --> 00:34:50,580 So move is not an ALU. 699 00:34:50,580 --> 00:34:52,190 It's not an arithmetic operation. 700 00:34:52,190 --> 00:34:56,100 But everything else is either XOR, and, add, or subtract, or 701 00:34:56,100 --> 00:34:59,180 compare, which all use the ALU. 702 00:34:59,180 --> 00:35:03,130 Now, if you look at the code over here, you have move, 703 00:35:03,130 --> 00:35:05,740 move, move, which don't need the ALU. 704 00:35:05,740 --> 00:35:10,300 And then the other or four or five need the ALU. 705 00:35:10,300 --> 00:35:14,010 But having fewer ALU op is a good thing. 706 00:35:14,010 --> 00:35:16,440 If you remember from the previous lecture on CPU 707 00:35:16,440 --> 00:35:19,220 architecture, the Nehalem, has six execution ports. 708 00:35:19,220 --> 00:35:23,770 So it can execute six micro-ops in parallel per CPU 709 00:35:23,770 --> 00:35:25,000 clock cycle. 710 00:35:25,000 --> 00:35:28,620 Actually, only three of those execution ports have ALUs on 711 00:35:28,620 --> 00:35:32,120 them, so that they can actually do XOR, at, and, 712 00:35:32,120 --> 00:35:33,730 subtract, add, and so on. 713 00:35:33,730 --> 00:35:36,540 The rest of them can do register moves and a couple of 714 00:35:36,540 --> 00:35:40,760 other things, but not arithmetic operations. 715 00:35:40,760 --> 00:35:42,200 So that's a good thing. 716 00:35:42,200 --> 00:35:45,900 So that's an explanation of why the code actually sped up. 717 00:35:45,900 --> 00:35:51,040 Now, just as an overview of what we were able to do, we 718 00:35:51,040 --> 00:35:55,270 ran perf stat, the same thing that we did for mergesort, or 719 00:35:55,270 --> 00:35:57,050 for the original two mergesorts. 720 00:35:57,050 --> 00:35:59,900 And we made a table of the improvements that we made. 721 00:35:59,900 --> 00:36:03,240 So originally, quicksort ran in four seconds. 722 00:36:03,240 --> 00:36:08,420 And it had an IPC of 0.8 and 11% branch missing. 723 00:36:08,420 --> 00:36:10,790 So when we switched from quicksort to standard 724 00:36:10,790 --> 00:36:15,610 mergesort, we increased execution time by 20%, which 725 00:36:15,610 --> 00:36:17,420 is not good. 726 00:36:17,420 --> 00:36:20,810 But the instructions per clock jumped up a little bit. 727 00:36:20,810 --> 00:36:23,230 Branch miss rate is still pretty bad. 728 00:36:23,230 --> 00:36:26,980 Now, as soon as we switched it to branchless, we got an 8% 729 00:36:26,980 --> 00:36:29,190 improvement over the previous runtime. 730 00:36:29,190 --> 00:36:32,120 But note that we were still worse off than quicksort at 731 00:36:32,120 --> 00:36:33,370 that point. 732 00:36:35,210 --> 00:36:38,710 But the instructions per clock jumped quite a bit from last 733 00:36:38,710 --> 00:36:42,360 time, which is kind of what you would expect with reducing 734 00:36:42,360 --> 00:36:44,230 the branch misses by a factor of 10. 735 00:36:44,230 --> 00:36:47,700 So the CPU isn't spending extra time stalling the 736 00:36:47,700 --> 00:36:51,880 pipeline, trying to backtrack on its branch mispredicts. 737 00:36:51,880 --> 00:36:56,890 And now the more interesting thing to me is, if you look at 738 00:36:56,890 --> 00:37:01,150 the two silly little compiler optimizations we made, 739 00:37:01,150 --> 00:37:05,930 switching from int to long reduced the runtime by 11% 740 00:37:05,930 --> 00:37:09,510 over the previous branchless implementation. 741 00:37:09,510 --> 00:37:13,090 And then instructions per clock and branch miss both 742 00:37:13,090 --> 00:37:14,020 didn't change at this point. 743 00:37:14,020 --> 00:37:15,790 At this point, we were just reducing the number of 744 00:37:15,790 --> 00:37:17,210 instructions. 745 00:37:17,210 --> 00:37:23,340 And then going from there to changing not CMP to 1 minus 746 00:37:23,340 --> 00:37:29,020 CMP reduced our runtime by another 7% from before. 747 00:37:29,020 --> 00:37:33,860 So overall, branchless mergesorting is 10.8% faster 748 00:37:33,860 --> 00:37:37,010 than quicksorting by the time we were done optimizing, which 749 00:37:37,010 --> 00:37:40,770 is a lot better than negative. 750 00:37:40,770 --> 00:37:42,550 At this point, we were still behind. 751 00:37:42,550 --> 00:37:45,560 And at the end, we ended up ahead of quicksort. 752 00:37:45,560 --> 00:37:49,300 And then, overall, switching to branchless, with all the 753 00:37:49,300 --> 00:37:53,890 performance tweaks that we made, was 33.6% better over 754 00:37:53,890 --> 00:37:55,170 the branching version. 755 00:37:55,170 --> 00:37:58,960 So the non-branching ran a third faster than the 756 00:37:58,960 --> 00:38:01,870 branching version, which is pretty cool. 757 00:38:01,870 --> 00:38:03,450 And that's something that we learned at the 758 00:38:03,450 --> 00:38:05,750 end of the day yesterday. 759 00:38:05,750 --> 00:38:10,630 So to conclude, what did we learn when we were doing this? 760 00:38:10,630 --> 00:38:13,720 Well, I think a lot of that applies to the PSETS that 761 00:38:13,720 --> 00:38:16,720 you're doing and how you should go about optimizing 762 00:38:16,720 --> 00:38:19,420 performance on the code that we give you. 763 00:38:19,420 --> 00:38:22,190 And it's profile before you optimize. 764 00:38:22,190 --> 00:38:25,710 So before you spend too much time just reading through 765 00:38:25,710 --> 00:38:28,930 source code and making wild guesses as to why the code is 766 00:38:28,930 --> 00:38:30,920 slow, profile the code. 767 00:38:30,920 --> 00:38:32,250 See why it's actually slow. 768 00:38:32,250 --> 00:38:35,280 Get some performance counter data. 769 00:38:35,280 --> 00:38:39,470 Do some perf annotate, so that it tells you what assembly or 770 00:38:39,470 --> 00:38:41,400 what line of code that it's actually spending 771 00:38:41,400 --> 00:38:42,710 the most time on. 772 00:38:42,710 --> 00:38:44,270 And then, optimize iteratively. 773 00:38:44,270 --> 00:38:46,000 Don't try to do everything at once. 774 00:38:46,000 --> 00:38:48,480 Do things one at a time to see whether or not they make a 775 00:38:48,480 --> 00:38:49,500 difference. 776 00:38:49,500 --> 00:38:53,240 Now, before we arrived at the not, there are so many 777 00:38:53,240 --> 00:38:56,200 different ways that you could have expressed not and 1 minus 778 00:38:56,200 --> 00:38:57,790 and negative and so on. 779 00:38:57,790 --> 00:38:59,880 So if you try them all at once, you really don't know 780 00:38:59,880 --> 00:39:01,260 what made the improvement. 781 00:39:01,260 --> 00:39:03,400 So you've got to try things one at a time and see what it 782 00:39:03,400 --> 00:39:08,640 causes the compiler to do, because you can see in the 783 00:39:08,640 --> 00:39:10,990 previous two examples, by the time we made that one 784 00:39:10,990 --> 00:39:13,740 optimization, it wasn't just that one assembly instruction 785 00:39:13,740 --> 00:39:15,700 that the compiler ended up changing. 786 00:39:15,700 --> 00:39:17,980 The compiler, in fact, realized that it could do 787 00:39:17,980 --> 00:39:21,250 additional optimizations on top of what we hinted to it. 788 00:39:21,250 --> 00:39:23,280 So it did even more optimizations. 789 00:39:23,280 --> 00:39:26,550 But the bottom line is, you should optimize iteratively. 790 00:39:26,550 --> 00:39:29,510 Try something, and see what effect it has before moving on 791 00:39:29,510 --> 00:39:31,130 and trying something else. 792 00:39:31,130 --> 00:39:35,250 And then, as much as I don't like coding assembly, and I 793 00:39:35,250 --> 00:39:38,500 don't expect that anybody here codes assembly for the class, 794 00:39:38,500 --> 00:39:41,160 looking at the assembly is always a good thing. 795 00:39:41,160 --> 00:39:44,530 Just skim through it for your performance bottlenecks. 796 00:39:44,530 --> 00:39:47,070 And try to get an idea of what the compiler is 797 00:39:47,070 --> 00:39:48,520 asking the CPU to do. 798 00:39:48,520 --> 00:39:51,960 And see if there's a way that you can nudge the compiler to 799 00:39:51,960 --> 00:39:53,240 do the right thing. 800 00:39:53,240 --> 00:39:56,110 Sometimes we tell you the compiler will optimize certain 801 00:39:56,110 --> 00:39:58,320 things, will get rid of dead code, and so on. 802 00:39:58,320 --> 00:39:59,820 And most of the time, that's actually true. 803 00:39:59,820 --> 00:40:03,080 But it doesn't hurt to double-check in the assembly 804 00:40:03,080 --> 00:40:04,930 and make sure that it's actually doing that. 805 00:40:04,930 --> 00:40:09,150 And the output from perf is pretty helpful to doing this. 806 00:40:09,150 --> 00:40:11,800 It's not a monumental task of disassembling. 807 00:40:11,800 --> 00:40:15,000 It does all of the annotation and interleaving for you. 808 00:40:15,000 --> 00:40:18,210 And finally, learn through practice. 809 00:40:18,210 --> 00:40:21,830 That's the only way to learn how to do this performance 810 00:40:21,830 --> 00:40:22,470 optimization. 811 00:40:22,470 --> 00:40:24,580 You gain a lot of experience from just going out 812 00:40:24,580 --> 00:40:25,900 and trying the tools. 813 00:40:25,900 --> 00:40:29,450 And Project 2 will have you working with these tools. 814 00:40:29,450 --> 00:40:32,500 Project 2.1 does not have a code submission. 815 00:40:32,500 --> 00:40:35,490 It's just a write-up, where you get to just play around 816 00:40:35,490 --> 00:40:38,430 with the tools on code that we give you, so you can see how 817 00:40:38,430 --> 00:40:39,600 to use the tools. 818 00:40:39,600 --> 00:40:43,550 And then Project 2.2, which is going out next week, will 819 00:40:43,550 --> 00:40:46,750 actually give you a chance to try to optimize code that we 820 00:40:46,750 --> 00:40:50,760 give you, using these tools to help you out along the way. 821 00:40:50,760 --> 00:40:54,719 So with that said, any questions about-- 822 00:40:54,719 --> 00:40:56,571 AUDIENCE: [INAUDIBLE] a write-up. 823 00:40:56,571 --> 00:40:59,890 how do you submit write-ups for projects? 824 00:40:59,890 --> 00:41:01,050 GUEST SPEAKER 2: On Stellar. 825 00:41:01,050 --> 00:41:05,000 So the handouts for the write-ups are Stellar homework 826 00:41:05,000 --> 00:41:05,480 assignments. 827 00:41:05,480 --> 00:41:07,568 Just upload a PDF to Stellar. 828 00:41:07,568 --> 00:41:10,320 AUDIENCE: OK. 829 00:41:10,320 --> 00:41:11,752 GUEST SPEAKER 2: Yes? 830 00:41:11,752 --> 00:41:13,002 AUDIENCE: [INAUDIBLE PHRASE] 831 00:41:19,850 --> 00:41:21,690 GUEST SPEAKER 1: Yeah, you just have to build the binary 832 00:41:21,690 --> 00:41:22,610 with debug info. 833 00:41:22,610 --> 00:41:24,340 GUEST SPEAKER 2: Yeah, minus G is enough. 834 00:41:24,340 --> 00:41:29,100 As long as GDB's break points can see your code, 835 00:41:29,100 --> 00:41:30,410 then perf can do it. 836 00:41:34,420 --> 00:41:37,230 And it's actually pretty nice in that, even if your code is 837 00:41:37,230 --> 00:41:40,840 inlined, it can identify where the inlined portions came from 838 00:41:40,840 --> 00:41:44,390 and pull in the proper lines of code for that. 839 00:41:44,390 --> 00:41:45,850 SAMAN AMARASINGHE: I think what's really interesting and 840 00:41:45,850 --> 00:41:49,110 important here is not that they found this interesting 841 00:41:49,110 --> 00:41:50,940 thing, but the process they went about. 842 00:41:50,940 --> 00:41:54,110 A lot of times, when you are coding, you are in control. 843 00:41:54,110 --> 00:41:56,920 You have very planned structure how you're going 844 00:41:56,920 --> 00:41:57,720 about doing that. 845 00:41:57,720 --> 00:42:00,800 And you follow these through and produce the code you want. 846 00:42:00,800 --> 00:42:02,440 And sometimes, some pesky bugs show up. 847 00:42:02,440 --> 00:42:04,120 You go through that. 848 00:42:04,120 --> 00:42:08,470 Here is, basically, you basically let the system and 849 00:42:08,470 --> 00:42:10,700 data drive what you are doing. 850 00:42:10,700 --> 00:42:14,670 And when you start, there's no planned part, saying, I'm 851 00:42:14,670 --> 00:42:17,930 going to do A, B, C, D. Basically, you look at the 852 00:42:17,930 --> 00:42:20,180 results, and you figure out what might change. 853 00:42:20,180 --> 00:42:21,640 And you can go through that. 854 00:42:21,640 --> 00:42:25,390 And I was actually talking with Charles, saying, in my 855 00:42:25,390 --> 00:42:28,130 first lecture, I talked about Matrix Multiply. 856 00:42:28,130 --> 00:42:29,050 It looks really cool. 857 00:42:29,050 --> 00:42:32,080 Every time I do something, it improved by a factor of 2, 858 00:42:32,080 --> 00:42:33,930 50%, stuff like that. 859 00:42:33,930 --> 00:42:37,670 What you didn't see is all the things I did that didn't have 860 00:42:37,670 --> 00:42:40,130 any impact on the program. 861 00:42:40,130 --> 00:42:43,240 You saw this nice tree going down there, all these leaves 862 00:42:43,240 --> 00:42:46,890 that end up in dead ends, that had nothing happened. 863 00:42:46,890 --> 00:42:49,700 So if you feel like when you go there, you do a bunch of 864 00:42:49,700 --> 00:42:51,830 things, nothing happens, that's normal. 865 00:42:51,830 --> 00:42:53,890 Of course, when you tell somebody, you're not going to 866 00:42:53,890 --> 00:42:55,540 say, oh, that [UNINTELLIGIBLE]. 867 00:42:55,540 --> 00:42:56,370 Say, I did this 868 00:42:56,370 --> 00:42:57,620 [UNINTELLIGIBLE], and I did that. 869 00:42:57,620 --> 00:43:00,240 And at the end, it looks like these other people are very 870 00:43:00,240 --> 00:43:01,570 smart, and they're getting through all these 871 00:43:01,570 --> 00:43:02,450 optimizations. 872 00:43:02,450 --> 00:43:04,590 No, they spend a lot of time going through 873 00:43:04,590 --> 00:43:05,930 these dead end parts. 874 00:43:05,930 --> 00:43:10,560 And the interesting thing in here is, one thing is, don't 875 00:43:10,560 --> 00:43:12,560 trust anything. 876 00:43:12,560 --> 00:43:14,350 [UNINTELLIGIBLE PHRASE] 877 00:43:14,350 --> 00:43:16,150 Intel, they have hundreds of engineers 878 00:43:16,150 --> 00:43:17,350 working at it for years. 879 00:43:17,350 --> 00:43:20,425 And OK, that should be good. 880 00:43:20,425 --> 00:43:22,210 Not really. 881 00:43:22,210 --> 00:43:25,030 So a lot of times, it's an end-to-end thing. 882 00:43:25,030 --> 00:43:26,670 With performance, it's end to end. 883 00:43:26,670 --> 00:43:29,030 So the problem can be at any level. 884 00:43:29,030 --> 00:43:31,210 So it could become a compiler, it could be [UNINTELLIGIBLE] 885 00:43:31,210 --> 00:43:33,620 network, it can be network, it can be an operating system, it 886 00:43:33,620 --> 00:43:35,795 can be in the memory system, it can be in the processor, it 887 00:43:35,795 --> 00:43:36,870 can be anything. 888 00:43:36,870 --> 00:43:38,740 So if you say, oh, yeah, that shouldn't be the case, 889 00:43:38,740 --> 00:43:40,320 probably that's where it is. 890 00:43:40,320 --> 00:43:42,350 So all the time, basically, it's interesting. 891 00:43:42,350 --> 00:43:45,070 The nice thing about these kind of tools is that 892 00:43:45,070 --> 00:43:48,080 sometimes even the tools might be wrong. 893 00:43:48,080 --> 00:43:49,915 So sometimes I have been in situations where you look have 894 00:43:49,915 --> 00:43:51,900 to look at the wall clock, OK, wait a minute, does the tool 895 00:43:51,900 --> 00:43:53,840 say what I actually see what's happening? 896 00:43:53,840 --> 00:43:57,450 Because you get into very skeptical situations. 897 00:43:57,450 --> 00:44:00,960 Because a lot of times, in these kind of situations, the 898 00:44:00,960 --> 00:44:03,140 problem can be hidden in many different things. 899 00:44:03,140 --> 00:44:06,500 So that's a very good way of going through this. 900 00:44:06,500 --> 00:44:10,550 Of course, what you get handed to you, we have set it up in a 901 00:44:10,550 --> 00:44:12,660 way that there cannot be too many surprises. 902 00:44:12,660 --> 00:44:15,530 But when you're going through a system and try to optimize 903 00:44:15,530 --> 00:44:18,590 and look at performance, this is a really good way of doing 904 00:44:18,590 --> 00:44:21,190 it, because I don't think these guys thought, when they 905 00:44:21,190 --> 00:44:22,800 started, they'd be looking at compiled 906 00:44:22,800 --> 00:44:25,750 bugs or compiler problems. 907 00:44:25,750 --> 00:44:28,526 They were thinking about, OK, I'm going to do a little bit 908 00:44:28,526 --> 00:44:29,770 of code rewriting and get there. 909 00:44:29,770 --> 00:44:32,080 But at the end of the day, what they disclosed was very 910 00:44:32,080 --> 00:44:32,930 surprising to them. 911 00:44:32,930 --> 00:44:34,470 And it even surprised me today. 912 00:44:34,470 --> 00:44:36,500 GUEST SPEAKER 2: Our plan basically ended at row three, 913 00:44:36,500 --> 00:44:38,110 as far as when we first set out. 914 00:44:38,110 --> 00:44:40,690 We had no idea was to follow after that. 915 00:44:40,690 --> 00:44:44,600 So one thing that I remember is that some students came to 916 00:44:44,600 --> 00:44:47,150 us and asked questions, like, yeah, I tried to implement 917 00:44:47,150 --> 00:44:48,710 this bit hack here, and it actually 918 00:44:48,710 --> 00:44:50,580 slowed the program down. 919 00:44:50,580 --> 00:44:54,590 Or I changed my pentomino board 920 00:44:54,590 --> 00:44:56,440 representation to this also. 921 00:44:56,440 --> 00:44:58,690 But it didn't seem to speed anything up. 922 00:44:58,690 --> 00:45:01,540 Well, in those cases, yeah, you need to do more detective 923 00:45:01,540 --> 00:45:03,930 work to figure out what happened. 924 00:45:03,930 --> 00:45:05,990 SAMAN AMARASINGHE: So a lot of times, the performance is 925 00:45:05,990 --> 00:45:09,790 something like a max of a bunch of different things. 926 00:45:09,790 --> 00:45:11,930 So you have done something, that's great. 927 00:45:11,930 --> 00:45:13,880 It should give you a huge performance improvement. 928 00:45:13,880 --> 00:45:15,380 But something is holding. 929 00:45:15,380 --> 00:45:17,590 And a lot of times, you have to do three or four things. 930 00:45:17,590 --> 00:45:20,130 And once you do three or four things, suddenly everything 931 00:45:20,130 --> 00:45:23,430 starts kicking off. 932 00:45:23,430 --> 00:45:24,970 A good example is loop unroll. 933 00:45:24,970 --> 00:45:26,830 You unroll a loop and say, ha, I should get all this 934 00:45:26,830 --> 00:45:26,990 parallelism. 935 00:45:26,990 --> 00:45:29,290 No, because you have some dependence. 936 00:45:29,290 --> 00:45:30,990 So you have to fix that dependence, fix that 937 00:45:30,990 --> 00:45:31,320 dependence. 938 00:45:31,320 --> 00:45:33,470 Suddenly, you remove everything. 939 00:45:33,470 --> 00:45:34,930 Then you get all the parallelism and 940 00:45:34,930 --> 00:45:36,050 start running fast. 941 00:45:36,050 --> 00:45:38,400 So a lot of these things, it's basically max. 942 00:45:38,400 --> 00:45:40,200 You have to get all the things right. 943 00:45:40,200 --> 00:45:42,970 And so if you stop halfway through, that means you might 944 00:45:42,970 --> 00:45:45,540 be almost there, and you are not up to that. 945 00:45:45,540 --> 00:45:49,760 So that's why it just can be a really frustrating process. 946 00:45:49,760 --> 00:45:51,630 But it's kind of fun, too, when you get there.