1 00:00:00,060 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:04,019 Commons license. 3 00:00:04,019 --> 00:00:06,350 Your support will help MIT OpenCourseWare 4 00:00:06,350 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,205 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,205 --> 00:00:17,830 at ocw.mit.edu. 8 00:00:21,420 --> 00:00:22,780 PROFESSOR: All right, everyone. 9 00:00:22,780 --> 00:00:25,360 Welcome back. 10 00:00:25,360 --> 00:00:27,240 Today we're going to talk about the profiling 11 00:00:27,240 --> 00:00:29,120 tools available on Cell to help you evaluate 12 00:00:29,120 --> 00:00:30,550 how your programs are running. 13 00:00:30,550 --> 00:00:33,260 In the next hour I'll talk about SIMD, 14 00:00:33,260 --> 00:00:35,450 which is how you can take advantage of the hardware 15 00:00:35,450 --> 00:00:37,015 support for data parallelization. 16 00:00:39,520 --> 00:00:43,370 First up, I'd like to follow up with some questions 17 00:00:43,370 --> 00:00:46,860 that we had last time about gdb, and then I'll 18 00:00:46,860 --> 00:00:50,330 discuss some of the various methods for profiling on Cell. 19 00:00:50,330 --> 00:00:52,120 I'll talk about the Cell simulator 20 00:00:52,120 --> 00:00:54,680 and using the decrementers, and then we're 21 00:00:54,680 --> 00:00:57,810 going to talk about static profiling and instruction 22 00:00:57,810 --> 00:01:00,120 reordering, which I started to get into last time. 23 00:01:02,640 --> 00:01:05,120 So first up, we ran into some problems 24 00:01:05,120 --> 00:01:07,980 last time with gdb, various error messages. 25 00:01:07,980 --> 00:01:10,470 First thing, sometimes gbd is going 26 00:01:10,470 --> 00:01:12,400 to examine the wrong variable, as we saw 27 00:01:12,400 --> 00:01:14,364 when the names are ambiguous. 28 00:01:14,364 --> 00:01:15,780 One thing you can do to get around 29 00:01:15,780 --> 00:01:19,080 this-- we had problems with ambiguity 30 00:01:19,080 --> 00:01:22,100 between variables, which have the same name where one of them 31 00:01:22,100 --> 00:01:25,520 was in the spu thread and one of them was in the ppu thread. 32 00:01:25,520 --> 00:01:28,530 And so to work around this, one thing you can do 33 00:01:28,530 --> 00:01:32,480 is use spu-gdb, which will just log on to a particular spu 34 00:01:32,480 --> 00:01:36,780 thread, then you can just look for variable names 35 00:01:36,780 --> 00:01:38,410 in that thread. 36 00:01:38,410 --> 00:01:41,600 You can also, of course, rename your variables. 37 00:01:41,600 --> 00:01:44,490 Second problem is that sometimes gdb 38 00:01:44,490 --> 00:01:46,500 will delete your breakpoints before you 39 00:01:46,500 --> 00:01:48,470 want it to delete them. 40 00:01:48,470 --> 00:01:51,690 Reason being, it will remove breakpoints 41 00:01:51,690 --> 00:01:54,260 whenever the program associated with a breakpoint 42 00:01:54,260 --> 00:01:55,530 has been unloaded. 43 00:01:55,530 --> 00:01:57,530 But sometimes you have more than one thread 44 00:01:57,530 --> 00:02:03,730 running the same program, and so the second one you 45 00:02:03,730 --> 00:02:06,620 won't be able to break into. 46 00:02:06,620 --> 00:02:10,860 So again, you can use spu-gdb to log on to a particular thread 47 00:02:10,860 --> 00:02:13,440 if you want to do that. 48 00:02:13,440 --> 00:02:17,740 Another thing you can do is for example put a delay loop 49 00:02:17,740 --> 00:02:19,620 at the end of your spu thread in order 50 00:02:19,620 --> 00:02:22,120 to keep that thread running for as long as possible. 51 00:02:22,120 --> 00:02:24,540 It seems that as long as gdb doesn't unload that thread, 52 00:02:24,540 --> 00:02:27,520 then it won't delete your breakpoint. 53 00:02:27,520 --> 00:02:31,770 And finally, I saw this kind of cryptic error thread event 54 00:02:31,770 --> 00:02:34,250 breakpoint gdb should not stop. 55 00:02:34,250 --> 00:02:36,110 Indeed it shouldn't. 56 00:02:36,110 --> 00:02:38,780 One thing, to get around this, it 57 00:02:38,780 --> 00:02:42,730 seems that this problem doesn't happen on spu-gdb. 58 00:02:42,730 --> 00:02:47,690 So I'll just recap how to use spu-gdb to debug your programs. 59 00:02:47,690 --> 00:02:50,400 Last time we were using mostly ppu-gdb, 60 00:02:50,400 --> 00:02:54,080 which will help you debug kind of both your PPU program 61 00:02:54,080 --> 00:02:55,879 and your SPU program together. 62 00:02:55,879 --> 00:02:57,920 And you're able to switch between all the threads 63 00:02:57,920 --> 00:02:59,190 that are available. 64 00:02:59,190 --> 00:03:01,340 On the other hand, when you're using spu-gdb, 65 00:03:01,340 --> 00:03:04,900 it's just going to log on to a particular SPU thread. 66 00:03:04,900 --> 00:03:08,530 So what you do is you invoke your original people PPU 67 00:03:08,530 --> 00:03:11,980 program with SPU debug star equals 1. 68 00:03:11,980 --> 00:03:15,190 And then you can background it. 69 00:03:15,190 --> 00:03:19,760 Now what happens is every time that your-- every time 70 00:03:19,760 --> 00:03:22,530 that your SPU-- sorry, every time that your program spawns 71 00:03:22,530 --> 00:03:25,427 a new SPU thread, it will print the process ID 72 00:03:25,427 --> 00:03:27,510 and then actually that thread, the spawned thread, 73 00:03:27,510 --> 00:03:30,510 will stop and wait for a debugger to attach to it. 74 00:03:30,510 --> 00:03:35,080 So at this point, you can run spu-gdb -p and then the process 75 00:03:35,080 --> 00:03:36,390 number that they give you. 76 00:03:36,390 --> 00:03:38,870 And when you attach to it, you can continue that thread 77 00:03:38,870 --> 00:03:41,002 to resume debugging. 78 00:03:41,002 --> 00:03:41,585 Any questions? 79 00:03:49,200 --> 00:03:55,910 OK, so the Cell simulator is one of the tools 80 00:03:55,910 --> 00:04:00,845 that we have available to help you debug. 81 00:04:00,845 --> 00:04:02,970 This is how you invoke it if you have it installed, 82 00:04:02,970 --> 00:04:05,660 opt/ibm/systemsim cell/bin/systemsim, 83 00:04:05,660 --> 00:04:09,050 and you want to use -g to use the graphical interface. 84 00:04:09,050 --> 00:04:12,320 And a little window will pop up and you'll click Go. 85 00:04:12,320 --> 00:04:13,450 The window looks like this. 86 00:04:13,450 --> 00:04:15,080 It actually, from here, you can have 87 00:04:15,080 --> 00:04:16,663 a lot of fine-grained control over how 88 00:04:16,663 --> 00:04:17,740 your program is running. 89 00:04:17,740 --> 00:04:22,420 In fact you can advance the state of the simulation cycle 90 00:04:22,420 --> 00:04:25,561 by cycle to look at what's going on inside the simulated Cell 91 00:04:25,561 --> 00:04:26,060 system. 92 00:04:30,180 --> 00:04:32,520 And when you boot up this simulator, 93 00:04:32,520 --> 00:04:35,070 you can actually get a complete Linux environment in there 94 00:04:35,070 --> 00:04:36,660 with an x term. 95 00:04:36,660 --> 00:04:39,550 And what you can do in this Linux environment is 96 00:04:39,550 --> 00:04:42,201 you can copy over programs and then you can-- 97 00:04:42,201 --> 00:04:44,700 you can copy over programs which were compiled for the Cell, 98 00:04:44,700 --> 00:04:45,783 and then you can run them. 99 00:04:49,090 --> 00:04:52,590 In order to transfer a program to your simulated system 100 00:04:52,590 --> 00:04:56,750 so that you can run it, you can use this function-- 101 00:04:56,750 --> 00:04:58,810 or you can use this program call through which 102 00:04:58,810 --> 00:05:00,740 is available on the simulated system. 103 00:05:00,740 --> 00:05:02,830 And what it will do is it will grab 104 00:05:02,830 --> 00:05:06,710 onto a file which is on the host system and copy it in. 105 00:05:06,710 --> 00:05:10,100 And it will just put it on standard 106 00:05:10,100 --> 00:05:12,979 out so you can redirect it to whatever file you want. 107 00:05:12,979 --> 00:05:14,520 And then when you make it executable, 108 00:05:14,520 --> 00:05:16,205 then you can run it. 109 00:05:16,205 --> 00:05:17,080 All right, questions? 110 00:05:21,940 --> 00:05:24,320 While you're in the simulator, you 111 00:05:24,320 --> 00:05:27,160 can view a lot of different things. 112 00:05:27,160 --> 00:05:31,020 You can get a lot of information about the state of the machine. 113 00:05:31,020 --> 00:05:32,860 In particular, you can look at the registers 114 00:05:32,860 --> 00:05:37,794 at any particular time, again with clock step by clock step. 115 00:05:37,794 --> 00:05:40,460 This might not be as interesting now that you have the debugger. 116 00:05:40,460 --> 00:05:42,530 And you can run the debugger on real hardware. 117 00:05:42,530 --> 00:05:44,740 But one thing which the Cell simulator does 118 00:05:44,740 --> 00:05:48,820 get you is these dynamic profiling tools. 119 00:05:48,820 --> 00:05:51,990 And what it will do is it will look 120 00:05:51,990 --> 00:05:54,510 at the state of the simulation and it will figure out 121 00:05:54,510 --> 00:05:57,440 where things are stalling, where the time is 122 00:05:57,440 --> 00:05:59,310 being spent by the processor. 123 00:05:59,310 --> 00:06:01,910 And you can get separate stats like these. 124 00:06:01,910 --> 00:06:04,930 Let's see, it tells you, for example, total cycle count 125 00:06:04,930 --> 00:06:07,820 stalls for various reasons. 126 00:06:07,820 --> 00:06:09,760 You get all these statistics separately 127 00:06:09,760 --> 00:06:13,430 for the PPU and each SPU. 128 00:06:13,430 --> 00:06:16,090 So this can be very helpful if you're trying to do, 129 00:06:16,090 --> 00:06:18,530 for example, you're trying to figure out 130 00:06:18,530 --> 00:06:20,610 exactly what's going on each SPU, 131 00:06:20,610 --> 00:06:22,490 do load balancing, whatever. 132 00:06:26,940 --> 00:06:29,310 And so these statistics will hold-- 133 00:06:29,310 --> 00:06:32,730 these statistics will cover whatever length of the program 134 00:06:32,730 --> 00:06:33,462 you run. 135 00:06:33,462 --> 00:06:35,420 But you can also get more fine-grain statistics 136 00:06:35,420 --> 00:06:37,690 using these profiling functions. 137 00:06:37,690 --> 00:06:40,410 So if you include profile dot H in your program, 138 00:06:40,410 --> 00:06:42,300 then you get access to these functions, 139 00:06:42,300 --> 00:06:44,700 profile clear, start, and stop. 140 00:06:44,700 --> 00:06:48,140 And what these will do is they actually-- 141 00:06:48,140 --> 00:06:49,970 when they get compiled into your program, 142 00:06:49,970 --> 00:06:53,170 there are no ops when you run them on a real Cell processor. 143 00:06:53,170 --> 00:06:55,430 But the simulator will be able to latch onto them 144 00:06:55,430 --> 00:06:59,740 and you can use those to start and stop regions of interest 145 00:06:59,740 --> 00:07:01,730 for profiling. 146 00:07:05,660 --> 00:07:07,747 Now the simulator comes with the SDK, 147 00:07:07,747 --> 00:07:09,580 and it's a little bit cumbersome to install. 148 00:07:09,580 --> 00:07:14,150 So we haven't made it available yet on the actual PS3 hardware. 149 00:07:14,150 --> 00:07:17,610 You can run it on x86, and so if your group would like to get 150 00:07:17,610 --> 00:07:20,680 set up with that on one of your computers, 151 00:07:20,680 --> 00:07:23,195 or we'll try and make it available on our hardware 152 00:07:23,195 --> 00:07:24,320 if there's enough interest. 153 00:07:24,320 --> 00:07:25,340 So please let us know. 154 00:07:31,370 --> 00:07:39,250 OK, the next thing we'll cover is using decrementers 155 00:07:39,250 --> 00:07:42,070 to profile your program. 156 00:07:42,070 --> 00:07:45,176 This is one way you can use dynamic profiling, which 157 00:07:45,176 --> 00:07:46,800 actually doesn't require the simulator. 158 00:07:46,800 --> 00:07:50,130 So when you get information from the decrementers, 159 00:07:50,130 --> 00:07:54,000 you can run these programs on your actual Cell hardware. 160 00:07:54,000 --> 00:07:55,670 Basically the decrementer is just 161 00:07:55,670 --> 00:07:58,490 a counter that counts down at a constant rate. 162 00:07:58,490 --> 00:08:01,490 So you can use it as a clock to figure out 163 00:08:01,490 --> 00:08:05,490 how long different events are lasting. 164 00:08:05,490 --> 00:08:11,230 And the rate at which the decrementer counts down 165 00:08:11,230 --> 00:08:13,360 is not that fast. 166 00:08:13,360 --> 00:08:16,830 So you're not going to be able to use it to time things 167 00:08:16,830 --> 00:08:19,480 on the order of a few clock cycles. 168 00:08:22,060 --> 00:08:25,390 It's best for timing things which 169 00:08:25,390 --> 00:08:27,200 are maybe on the order of thousands 170 00:08:27,200 --> 00:08:30,190 of instructions in length. 171 00:08:30,190 --> 00:08:36,919 And how you use it is there are these SPU write channel 172 00:08:36,919 --> 00:08:40,280 and SPU read channel functions which 173 00:08:40,280 --> 00:08:42,470 give you access to some of the internals of the Cell 174 00:08:42,470 --> 00:08:42,970 processor. 175 00:08:47,000 --> 00:08:50,030 And so here's an example setup you can use. 176 00:08:50,030 --> 00:08:53,000 Basically, first you call write channel, 177 00:08:53,000 --> 00:08:57,640 and you pass in this MFC_DECREMENTER_EVENT thing. 178 00:08:57,640 --> 00:09:00,654 And what that will do is initialize this decrementer 179 00:09:00,654 --> 00:09:01,320 counter for you. 180 00:09:05,030 --> 00:09:08,780 And then you read the value before and after the function 181 00:09:08,780 --> 00:09:12,370 you want to profile, and when you subtract, 182 00:09:12,370 --> 00:09:16,580 that just gives you the length or the time that was elapsed 183 00:09:16,580 --> 00:09:18,974 in some arbitrary units. 184 00:09:18,974 --> 00:09:20,390 And of course, it's counting down, 185 00:09:20,390 --> 00:09:24,190 so you want to subtract the start from the end. 186 00:09:24,190 --> 00:09:27,480 And after you're done using it, you 187 00:09:27,480 --> 00:09:30,080 can do another write channel to stop 188 00:09:30,080 --> 00:09:32,269 the counter from continuing. 189 00:09:32,269 --> 00:09:33,310 All right, any questions? 190 00:09:41,190 --> 00:09:46,420 OK, let's continue talking about instruction reordering, which 191 00:09:46,420 --> 00:09:47,952 I started last time. 192 00:09:47,952 --> 00:09:50,410 So I'm going to kind of start from the beginning this time, 193 00:09:50,410 --> 00:09:54,010 and maybe it'll be more coherent. 194 00:09:54,010 --> 00:09:56,810 So on the Cell architecture, when 195 00:09:56,810 --> 00:09:59,600 you're looking at what instructions are being run, 196 00:09:59,600 --> 00:10:02,870 the instructions are mostly going to be of the form 197 00:10:02,870 --> 00:10:05,420 evaluate some function of some of the registers 198 00:10:05,420 --> 00:10:08,330 and write the result to some other register. 199 00:10:08,330 --> 00:10:10,500 And the assembly file are what you 200 00:10:10,500 --> 00:10:13,580 get when you run with GCC-S is just 201 00:10:13,580 --> 00:10:15,960 going to be a human-readable representation 202 00:10:15,960 --> 00:10:17,550 of these instructions. 203 00:10:17,550 --> 00:10:21,915 And I'll show you how to read that actually later. 204 00:10:21,915 --> 00:10:23,540 And you can think of these instructions 205 00:10:23,540 --> 00:10:27,200 just as executing in series just in the order in which they 206 00:10:27,200 --> 00:10:28,530 appear in the assembly. 207 00:10:28,530 --> 00:10:32,120 So if you think of one instruction starting and then 208 00:10:32,120 --> 00:10:34,340 finishing, the next instruction comes in, 209 00:10:34,340 --> 00:10:40,150 finishes, that should be consistent with the actual 210 00:10:40,150 --> 00:10:43,340 behavior of the hardware. 211 00:10:43,340 --> 00:10:46,630 Now for real hardware, waiting for one instruction 212 00:10:46,630 --> 00:10:49,680 to finish before the next one starts is going to be too slow. 213 00:10:49,680 --> 00:10:52,060 So there are these various optimizations we pull 214 00:10:52,060 --> 00:10:54,460 or that they do on the hardware in order 215 00:10:54,460 --> 00:10:58,540 to run instructions sooner than when the previous instruction 216 00:10:58,540 --> 00:10:59,870 finishes. 217 00:10:59,870 --> 00:11:02,752 So the big one of these is pipelining, 218 00:11:02,752 --> 00:11:04,960 where you have multiple stages inside your processor, 219 00:11:04,960 --> 00:11:07,135 and you can run each instruction-- 220 00:11:07,135 --> 00:11:10,070 you can push each instruction in before the previous one has 221 00:11:10,070 --> 00:11:12,270 completed. 222 00:11:12,270 --> 00:11:15,530 And in pipelining, you're going to be 223 00:11:15,530 --> 00:11:18,230 subject to these dependencies between instructions. 224 00:11:18,230 --> 00:11:21,880 For example, if one instruction reads from a value 225 00:11:21,880 --> 00:11:24,490 that the previous instruction wrote, then 226 00:11:24,490 --> 00:11:26,700 clearly you can't start the second instruction 227 00:11:26,700 --> 00:11:28,960 until after the first instruction has completed. 228 00:11:32,020 --> 00:11:35,450 The Cell processor also has-- actually has multiple pipelines 229 00:11:35,450 --> 00:11:37,470 in which you can insert instructions 230 00:11:37,470 --> 00:11:39,320 and another optimization that they 231 00:11:39,320 --> 00:11:44,620 do is branch prediction in order to reduce stalls 232 00:11:44,620 --> 00:11:46,030 from branching. 233 00:11:49,340 --> 00:11:50,600 AUDIENCE: [INAUDIBLE]. 234 00:11:50,600 --> 00:11:50,820 PROFESSOR: Pardon? 235 00:11:50,820 --> 00:11:53,150 AUDIENCE: Can you talk about static branch prediction? 236 00:11:53,150 --> 00:11:56,604 PROFESSOR: Yes, so static branch prediction means-- 237 00:11:56,604 --> 00:11:58,270 or, rather the way it's done on the Cell 238 00:11:58,270 --> 00:12:02,050 is that what's predicted for the branch 239 00:12:02,050 --> 00:12:07,810 is not going to be affected by the history of which branches 240 00:12:07,810 --> 00:12:09,020 have been taken. 241 00:12:09,020 --> 00:12:10,690 And some other processors actually 242 00:12:10,690 --> 00:12:12,015 do use this information. 243 00:12:16,910 --> 00:12:22,020 All right, now the thing about pipelining 244 00:12:22,020 --> 00:12:23,760 is that because we still have to honor 245 00:12:23,760 --> 00:12:25,980 these dependencies between instructions, 246 00:12:25,980 --> 00:12:28,140 the order in which we evaluate instructions, 247 00:12:28,140 --> 00:12:30,400 that's going to make a difference on how long 248 00:12:30,400 --> 00:12:32,940 the program runs. 249 00:12:32,940 --> 00:12:34,640 So let's take the simple example where 250 00:12:34,640 --> 00:12:37,290 we have three instructions, a, b, and c, 251 00:12:37,290 --> 00:12:39,850 and we have this pipeline processor where 252 00:12:39,850 --> 00:12:44,120 we can insert instructions at these clock ticks. 253 00:12:44,120 --> 00:12:46,710 So suppose that c depends on b, and there 254 00:12:46,710 --> 00:12:50,734 are no other dependencies in the system. 255 00:12:50,734 --> 00:12:52,150 Then this sequence of instructions 256 00:12:52,150 --> 00:12:54,800 is going to take five cycles, because we 257 00:12:54,800 --> 00:12:57,250 can push b in after the first cycle, 258 00:12:57,250 --> 00:13:00,580 but we can't push c in until two cycles after that. 259 00:13:00,580 --> 00:13:02,001 Is that clear? 260 00:13:02,001 --> 00:13:02,500 Questions? 261 00:13:05,410 --> 00:13:08,140 All right, now because c depends on b, 262 00:13:08,140 --> 00:13:11,630 it kind of makes sense to push b as far up as we can in 263 00:13:11,630 --> 00:13:14,310 order to ease this dependency. 264 00:13:14,310 --> 00:13:17,710 So if we execute b before a instead of after a-- 265 00:13:17,710 --> 00:13:20,010 and remember we can reorder b and a however 266 00:13:20,010 --> 00:13:22,620 we want because there's no dependencies between them-- 267 00:13:22,620 --> 00:13:27,290 then we can start b at the first tick, then we put in a right 268 00:13:27,290 --> 00:13:29,350 after that, and then c. 269 00:13:29,350 --> 00:13:33,210 And now the sequence only takes four clock cycles. 270 00:13:33,210 --> 00:13:34,780 All right? 271 00:13:34,780 --> 00:13:38,721 So observe that, when we put in c in this third clock cycle, 272 00:13:38,721 --> 00:13:40,970 we actually couldn't have put in the third instruction 273 00:13:40,970 --> 00:13:45,217 any sooner, no matter what the dependencies were. 274 00:13:45,217 --> 00:13:46,800 And so in general, we're going to want 275 00:13:46,800 --> 00:13:50,800 that to be the case where we want instructions 276 00:13:50,800 --> 00:13:56,690 to be waiting on the pipeline rather than on dependencies. 277 00:13:56,690 --> 00:13:58,170 So that means in the first picture 278 00:13:58,170 --> 00:14:01,941 there's this stall of one clock cycle, 279 00:14:01,941 --> 00:14:03,440 and we're going to want to eliminate 280 00:14:03,440 --> 00:14:04,881 these kinds of stalls. 281 00:14:04,881 --> 00:14:05,380 All right? 282 00:14:10,804 --> 00:14:12,220 So what we're going to do is we're 283 00:14:12,220 --> 00:14:14,530 going to use these static profiling tools to figure out 284 00:14:14,530 --> 00:14:17,250 where stalls happen. 285 00:14:17,250 --> 00:14:20,150 Now the first thing we're going to do-- the first thing that we 286 00:14:20,150 --> 00:14:22,700 have to do is generate the assembly and the instruction 287 00:14:22,700 --> 00:14:24,660 schedule that goes with it. 288 00:14:24,660 --> 00:14:29,390 And so you can use GCC -S, and the same flag works with xlc 289 00:14:29,390 --> 00:14:31,300 to generate the assembly. 290 00:14:31,300 --> 00:14:35,960 And then we have this utility called spu_timing, 291 00:14:35,960 --> 00:14:38,580 which runs on the Cell, and which will tell you 292 00:14:38,580 --> 00:14:40,410 the instruction schedule. 293 00:14:40,410 --> 00:14:42,490 Basically it'll take in your assembly file 294 00:14:42,490 --> 00:14:45,265 and tell you exactly when each instruction in that assembly 295 00:14:45,265 --> 00:14:45,890 gets scheduled. 296 00:14:48,720 --> 00:14:50,490 If you call it with -running-count, 297 00:14:50,490 --> 00:14:54,375 then it'll tell you the running count of clock cycles 298 00:14:54,375 --> 00:14:55,750 at each stage, which means if you 299 00:14:55,750 --> 00:14:57,560 look at the last number, that will tell you 300 00:14:57,560 --> 00:15:00,450 how long your program took to run in clock cycles. 301 00:15:00,450 --> 00:15:03,330 And we'll write the output to a file 302 00:15:03,330 --> 00:15:08,560 with the same name as the input but with dot timing at the end. 303 00:15:08,560 --> 00:15:10,070 Now if you're using our Makefile, 304 00:15:10,070 --> 00:15:12,810 you can actually do all this, generate the assembly 305 00:15:12,810 --> 00:15:18,430 and the timing info, all in one step by doing SPU_TIMING=1 306 00:15:18,430 --> 00:15:22,370 and then make the name of the .s file, all right? 307 00:15:27,780 --> 00:15:29,670 Now if you'll recall on Cell we're 308 00:15:29,670 --> 00:15:31,870 going to have only in-order execution. 309 00:15:31,870 --> 00:15:34,570 The instructions are just going to execute in the order 310 00:15:34,570 --> 00:15:37,990 that they're specified in the assembly. 311 00:15:37,990 --> 00:15:42,847 And because the Cell has dual pipelines, 312 00:15:42,847 --> 00:15:45,430 that means there's two pipelines in which instructions can go, 313 00:15:45,430 --> 00:15:48,410 and the pipelines are going to be selected just 314 00:15:48,410 --> 00:15:51,220 based on the instruction type. 315 00:15:51,220 --> 00:15:53,080 And of course, two instructions are only 316 00:15:53,080 --> 00:15:57,492 going to go in simultaneously when the dependencies allow it. 317 00:15:57,492 --> 00:15:58,408 AUDIENCE: [INAUDIBLE]. 318 00:16:00,922 --> 00:16:01,630 PROFESSOR: Right. 319 00:16:04,530 --> 00:16:06,980 So if you're looking at the assembly, 320 00:16:06,980 --> 00:16:10,550 you're going to see lots of lines of this form OP 321 00:16:10,550 --> 00:16:14,000 DESTINATION and then the source registers. 322 00:16:14,000 --> 00:16:16,599 So when you're trying to figure out dependencies, 323 00:16:16,599 --> 00:16:18,390 it's going to be helpful to be able to look 324 00:16:18,390 --> 00:16:20,600 at each line of the assembly and figure out 325 00:16:20,600 --> 00:16:22,900 what registers that line is reading from, 326 00:16:22,900 --> 00:16:26,160 and what registers that line is spreading to. 327 00:16:26,160 --> 00:16:29,280 And there's also information in the assembly file 328 00:16:29,280 --> 00:16:31,280 I went into that last time, which tells you 329 00:16:31,280 --> 00:16:33,880 kind of for each chunk of assembly what 330 00:16:33,880 --> 00:16:37,830 the corresponding source file line is. 331 00:16:40,990 --> 00:16:45,060 All right, and the actual output of the static profiler, 332 00:16:45,060 --> 00:16:47,380 like I said, it will spit out a schedule 333 00:16:47,380 --> 00:16:49,450 of when each instruction runs. 334 00:16:52,670 --> 00:16:54,750 And in the schedule, you're going 335 00:16:54,750 --> 00:16:58,290 to see one digit for each cycle that the instruction is 336 00:16:58,290 --> 00:16:58,870 running. 337 00:16:58,870 --> 00:17:00,730 So however many digits there are, 338 00:17:00,730 --> 00:17:03,460 that's how many cycles are being used. 339 00:17:03,460 --> 00:17:06,839 And in front of some of the instructions, 340 00:17:06,839 --> 00:17:09,930 you're going to see these dashes which represent stalls. 341 00:17:09,930 --> 00:17:13,790 So these three dashes in front of this instruction fm 342 00:17:13,790 --> 00:17:17,560 means that based on the pipeline, 343 00:17:17,560 --> 00:17:19,980 you could've scheduled this instruction up here 344 00:17:19,980 --> 00:17:23,569 right after the previous instruction finished. 345 00:17:23,569 --> 00:17:26,420 However, because you're waiting on particular dependencies, 346 00:17:26,420 --> 00:17:28,560 you're only going to be able to schedule it 347 00:17:28,560 --> 00:17:29,770 these three cycles later. 348 00:17:29,770 --> 00:17:31,230 So that's a three cycle stall. 349 00:17:35,840 --> 00:17:38,030 And the static profile output is also 350 00:17:38,030 --> 00:17:39,820 going to show other information. 351 00:17:39,820 --> 00:17:42,660 The original assembly appears on the right-hand side. 352 00:17:42,660 --> 00:17:46,090 And which pipeline the instructions are going into 353 00:17:46,090 --> 00:17:47,870 is going to show up on the left. 354 00:17:47,870 --> 00:17:50,928 Any questions about this format? 355 00:17:50,928 --> 00:17:53,207 AUDIENCE: [INAUDIBLE]. 356 00:17:53,207 --> 00:17:55,540 PROFESSOR: Dual-issue just means that these instructions 357 00:17:55,540 --> 00:17:59,645 are going to be executed at the same time. 358 00:17:59,645 --> 00:18:01,609 AUDIENCE: So can people figure out, 359 00:18:01,609 --> 00:18:04,064 for example, what is the latency for fm instruction? 360 00:18:04,064 --> 00:18:06,028 How many circles does fm take? 361 00:18:11,920 --> 00:18:14,866 [INAUDIBLE] 362 00:18:14,866 --> 00:18:17,062 AUDIENCE: [INAUDIBLE]. 363 00:18:17,062 --> 00:18:17,812 PROFESSOR: Pardon? 364 00:18:17,812 --> 00:18:20,758 AUDIENCE: What are the numbers on the right? 365 00:18:20,758 --> 00:18:22,329 PROFESSOR: Over here? 366 00:18:22,329 --> 00:18:23,870 All right, so these are the arguments 367 00:18:23,870 --> 00:18:25,620 to the assembly instructions. 368 00:18:25,620 --> 00:18:27,953 So remember the first one is going to be the destination 369 00:18:27,953 --> 00:18:30,950 register and the other ones are going to be all the source 370 00:18:30,950 --> 00:18:33,130 registers that that instruction is going to use. 371 00:18:37,950 --> 00:18:43,140 All right, so again, we want to be 372 00:18:43,140 --> 00:18:45,710 able to reorder these instructions to minimize 373 00:18:45,710 --> 00:18:47,240 stalls. 374 00:18:47,240 --> 00:18:49,489 And what you're going to do is you're 375 00:18:49,489 --> 00:18:51,280 going to try and look for instructions that 376 00:18:51,280 --> 00:18:54,877 are going to take a long time potentially, for example, loads 377 00:18:54,877 --> 00:18:57,460 and floating point instructions are going to take a long time. 378 00:18:57,460 --> 00:19:01,300 And of these instructions have dependencies 379 00:19:01,300 --> 00:19:04,750 then those dependencies are going to stall. 380 00:19:04,750 --> 00:19:07,702 So not only do you want to minimize stalls, 381 00:19:07,702 --> 00:19:10,160 you also want to dual issue instructions whenever possible, 382 00:19:10,160 --> 00:19:12,285 although that's a little bit more difficult to just 383 00:19:12,285 --> 00:19:14,530 kind of eyeball. 384 00:19:14,530 --> 00:19:19,390 But anyway, if you get up to the ideal running point, where 385 00:19:19,390 --> 00:19:21,890 you're able to dual issue every instruction, then that means 386 00:19:21,890 --> 00:19:24,940 you're going to be running two instructions per cycle, 387 00:19:24,940 --> 00:19:29,170 or it's going to take 1/2 cycle per instruction. 388 00:19:29,170 --> 00:19:31,490 Anyway, going through these assembly files, 389 00:19:31,490 --> 00:19:33,539 and figuring out where the dependencies are, 390 00:19:33,539 --> 00:19:36,080 which instructions you can move around, it's kind of tedious, 391 00:19:36,080 --> 00:19:39,040 and you're not going to be able to apply these techniques 392 00:19:39,040 --> 00:19:40,890 to your entire program. 393 00:19:40,890 --> 00:19:43,744 So if you only have limited time, 394 00:19:43,744 --> 00:19:45,160 what you're going to have to do is 395 00:19:45,160 --> 00:19:47,700 try these on the most critical code first. 396 00:19:47,700 --> 00:19:50,590 And that's usually going to be loop bodies, 397 00:19:50,590 --> 00:19:52,990 because those are going to get executed many, many times. 398 00:19:52,990 --> 00:19:54,610 And any savings that you see there 399 00:19:54,610 --> 00:19:56,450 will be multiplied many, many times over. 400 00:19:59,150 --> 00:20:01,320 So I'm going to go through an example 401 00:20:01,320 --> 00:20:03,990 of one possible optimization you could do. 402 00:20:08,050 --> 00:20:10,490 So we'll actually go through these same files 403 00:20:10,490 --> 00:20:16,130 in the exercise about two minutes from now. 404 00:20:16,130 --> 00:20:19,580 This is from Lab 1. 405 00:20:19,580 --> 00:20:22,340 And this is one of the code snippets 406 00:20:22,340 --> 00:20:28,050 or the assembly snippets from the SPU program. 407 00:20:28,050 --> 00:20:32,180 So notice that over here we have these load 408 00:20:32,180 --> 00:20:34,280 instructions followed by a, I guess 409 00:20:34,280 --> 00:20:37,670 this is a shift instruction. 410 00:20:37,670 --> 00:20:40,370 And there's two pairs of these. 411 00:20:40,370 --> 00:20:42,500 And in each one, the shift instruction 412 00:20:42,500 --> 00:20:45,060 is going to depend on the result of the previous load. 413 00:20:45,060 --> 00:20:46,810 Does everyone see that from the arguments? 414 00:20:50,886 --> 00:20:53,260 So now the problem is that because the shift is happening 415 00:20:53,260 --> 00:20:56,280 right after the load, it's going to stall for a long time, 416 00:20:56,280 --> 00:21:00,650 because the load takes six cycles. 417 00:21:00,650 --> 00:21:02,644 All right, so what can we do over here 418 00:21:02,644 --> 00:21:03,810 in order to reduce stalling? 419 00:21:07,110 --> 00:21:07,940 Yep? 420 00:21:07,940 --> 00:21:09,840 AUDIENCE: Two loads concurrently? 421 00:21:09,840 --> 00:21:12,026 PROFESSOR: Yeah, we can do the loads concurrently, 422 00:21:12,026 --> 00:21:14,150 and in general, we just want to move these loads up 423 00:21:14,150 --> 00:21:15,070 as high as we can. 424 00:21:17,660 --> 00:21:24,030 So if I move the two loads up to the top of this-- 425 00:21:24,030 --> 00:21:27,340 oops-- the top of this snippet, then we 426 00:21:27,340 --> 00:21:30,130 have two loads followed by all the instructions 427 00:21:30,130 --> 00:21:32,180 that came after. 428 00:21:32,180 --> 00:21:35,890 Now the first shift is going to stall by only two cycles, 429 00:21:35,890 --> 00:21:39,020 and the second shift is not going to stall at all. 430 00:21:39,020 --> 00:21:39,870 All right? 431 00:21:39,870 --> 00:21:40,786 AUDIENCE: [INAUDIBLE]? 432 00:21:45,400 --> 00:21:46,665 PROFESSOR: So, pardon? 433 00:21:46,665 --> 00:21:47,850 AUDIENCE: [INAUDIBLE]. 434 00:21:47,850 --> 00:21:50,224 PROFESSOR: So the question is what would this change look 435 00:21:50,224 --> 00:21:51,170 like in code? 436 00:21:51,170 --> 00:21:53,190 The thing is that kind of the order 437 00:21:53,190 --> 00:21:54,640 of these generate instructions is 438 00:21:54,640 --> 00:21:56,434 going to be dependent on the compiler, 439 00:21:56,434 --> 00:21:57,850 and it's really up to the compiler 440 00:21:57,850 --> 00:21:59,890 how it wants to order these instructions. 441 00:21:59,890 --> 00:22:01,380 So this change is not going to be 442 00:22:01,380 --> 00:22:04,240 reflected in like your C code. 443 00:22:04,240 --> 00:22:07,050 And that's why you have to do these optimizations 444 00:22:07,050 --> 00:22:10,050 after the assembly is generated. 445 00:22:10,050 --> 00:22:11,097 Does that make sense? 446 00:22:11,097 --> 00:22:12,013 AUDIENCE: [INAUDIBLE]. 447 00:22:19,050 --> 00:22:20,700 PROFESSOR: If you have a smart compiler 448 00:22:20,700 --> 00:22:24,160 then it might do some kind of instruction scheduling 449 00:22:24,160 --> 00:22:27,010 in order to minimize stalls using this methodology. 450 00:22:27,010 --> 00:22:27,684 Yup? 451 00:22:27,684 --> 00:22:30,054 AUDIENCE: I'm quite impressed with XLC. 452 00:22:30,054 --> 00:22:33,500 I compiled the same program under GCC 453 00:22:33,500 --> 00:22:38,260 for SPU with optimization level 3 under XLC. 454 00:22:38,260 --> 00:22:42,760 XLC ran in half the time. 455 00:22:42,760 --> 00:22:45,475 So it was doing a much better job of instruction scheduling-- 456 00:22:45,475 --> 00:22:46,266 AUDIENCE: Than GCC? 457 00:22:46,266 --> 00:22:48,780 AUDIENCE: Than GCC. 458 00:22:48,780 --> 00:22:51,950 So that's something, if you've got a choice, 459 00:22:51,950 --> 00:22:56,430 and the XLC compiler is working for you-- 460 00:22:56,430 --> 00:23:00,210 AUDIENCE: In general, this kind of sort of low-level assembly 461 00:23:00,210 --> 00:23:03,980 hacking for instruction ordering is not something you'll do, 462 00:23:03,980 --> 00:23:05,450 just because compilers are getting 463 00:23:05,450 --> 00:23:09,090 incredibly good at doing instruction scheduling. 464 00:23:09,090 --> 00:23:11,660 But here the exercise is for you to sort of get familiar 465 00:23:11,660 --> 00:23:13,695 with the tools and understand a little bit more 466 00:23:13,695 --> 00:23:16,140 in detail as to what's going on at the instruction 467 00:23:16,140 --> 00:23:19,110 level [INAUDIBLE] pipelines. 468 00:23:19,110 --> 00:23:21,370 Another thing to note here is, well, 469 00:23:21,370 --> 00:23:23,676 two things you could consider here, how high up 470 00:23:23,676 --> 00:23:26,920 can you push the limits? 471 00:23:26,920 --> 00:23:29,510 And related to that is, why can't you completely hide 472 00:23:29,510 --> 00:23:33,520 the latency for the first load, and hence the first shift still 473 00:23:33,520 --> 00:23:36,550 takes two seconds to stall. 474 00:23:36,550 --> 00:23:38,605 So that's dependent upon how much slack 475 00:23:38,605 --> 00:23:41,630 do you have in your schedule, dependent on dependencies 476 00:23:41,630 --> 00:23:43,825 and where the loops starts. 477 00:23:43,825 --> 00:23:45,200 So in some cases, you just simply 478 00:23:45,200 --> 00:23:48,520 can't hide the entire latency. 479 00:23:48,520 --> 00:23:51,446 So it's part of considerations you have to consider. 480 00:23:51,446 --> 00:23:52,820 But [INAUDIBLE] and the compilers 481 00:23:52,820 --> 00:23:54,796 are getting really good at it. 482 00:23:56,994 --> 00:23:58,660 PROFESSOR: And in case you're wondering, 483 00:23:58,660 --> 00:24:00,840 I think this code was compiled using 484 00:24:00,840 --> 00:24:02,960 XLC on no optimizations, which is 485 00:24:02,960 --> 00:24:06,280 why there's this really easy opportunity for optimization 486 00:24:06,280 --> 00:24:08,450 here. 487 00:24:08,450 --> 00:24:12,090 It turns out, if we run the timing utility again, 488 00:24:12,090 --> 00:24:15,570 we saved eight cycles. 489 00:24:15,570 --> 00:24:17,490 So notice that on the previous picture, 490 00:24:17,490 --> 00:24:20,910 we're going up to seven in that column. 491 00:24:20,910 --> 00:24:27,630 And now we're only going up to the previous nine, all right? 492 00:24:27,630 --> 00:24:31,620 So what we're going to do for the exercise right now is you 493 00:24:31,620 --> 00:24:32,874 can try something like this. 494 00:24:32,874 --> 00:24:35,290 What we're going to try and do is just improve performance 495 00:24:35,290 --> 00:24:37,490 by rescheduling instructions. 496 00:24:37,490 --> 00:24:41,540 And so if you download the tarball, 497 00:24:41,540 --> 00:24:44,520 you'll get the exact same file I was working on before. 498 00:24:44,520 --> 00:24:46,880 And what we're going to have you do is just generate 499 00:24:46,880 --> 00:24:49,220 the assembly, practice generating the assembly, 500 00:24:49,220 --> 00:24:52,390 modifying it, and then continuing with the rest of the 501 00:24:52,390 --> 00:24:55,670 build process to get a new object 502 00:24:55,670 --> 00:24:58,709 file that's based on your modified assembly. 503 00:24:58,709 --> 00:24:59,750 All right, any questions? 504 00:25:02,730 --> 00:25:06,800 Now you just need to find, for example, one opportunity 505 00:25:06,800 --> 00:25:07,830 for optimization. 506 00:25:07,830 --> 00:25:10,840 And you can do the exact same thing that I did, if you want. 507 00:25:10,840 --> 00:25:15,080 And so the key is just after you've done your reordering, 508 00:25:15,080 --> 00:25:16,930 you want to re-run the timing utility 509 00:25:16,930 --> 00:25:18,930 to see how many cycles you saved. 510 00:25:18,930 --> 00:25:21,700 Also you want to, of course, continue the build process 511 00:25:21,700 --> 00:25:24,070 to get the final executable file. 512 00:25:24,070 --> 00:25:26,570 And you want to run that just to make sure that your code is 513 00:25:26,570 --> 00:25:28,206 still correct. 514 00:25:28,206 --> 00:25:30,696 AUDIENCE: Are we able to get this? 515 00:25:30,696 --> 00:27:16,400 [SIDE CONVERSATIONS] 516 00:27:16,400 --> 00:27:19,840 PROFESSOR: OK, so this static profiling process does 517 00:27:19,840 --> 00:27:21,290 have its limitations. 518 00:27:21,290 --> 00:27:25,520 The first is that, and this is a pretty major limitation, 519 00:27:25,520 --> 00:27:29,240 the static profiler assumes that none of the branches are taken. 520 00:27:29,240 --> 00:27:31,900 So basically it just zips through the entire assembly 521 00:27:31,900 --> 00:27:34,210 file in a straight line. 522 00:27:34,210 --> 00:27:36,740 So that means you're going to have a skewed 523 00:27:36,740 --> 00:27:41,942 view of how long code inside conditionals or loop bodies is 524 00:27:41,942 --> 00:27:43,900 going to take, because conditionals, if there's 525 00:27:43,900 --> 00:27:47,170 savings in there, they may not run at all. 526 00:27:47,170 --> 00:27:49,250 And loop bodies, any savings in there 527 00:27:49,250 --> 00:27:53,170 may get multiplied many, many times. 528 00:27:53,170 --> 00:27:54,950 So if, for example, you see that you've 529 00:27:54,950 --> 00:27:57,660 saved eight cycles in the static scheduling, 530 00:27:57,660 --> 00:28:00,510 you're going to need to know some more context, that 531 00:28:00,510 --> 00:28:02,150 is whether it's inside a loop body 532 00:28:02,150 --> 00:28:04,019 and how many times you expect that to run. 533 00:28:04,019 --> 00:28:06,060 So you're going to need a little more information 534 00:28:06,060 --> 00:28:08,980 to get a good idea of how these instruction reorderings are 535 00:28:08,980 --> 00:28:12,630 actually affecting the runtime of your program. 536 00:28:12,630 --> 00:28:14,750 And also, because this is static profiling, which 537 00:28:14,750 --> 00:28:18,650 means you're not using an actual program run to get data, 538 00:28:18,650 --> 00:28:21,522 you're not going to get behavior-- 539 00:28:21,522 --> 00:28:23,730 you're not going to capture any behavior that depends 540 00:28:23,730 --> 00:28:25,720 on the inputs to your program. 541 00:28:25,720 --> 00:28:31,160 So for example, you don't get counts of how many times 542 00:28:31,160 --> 00:28:35,810 each loop runs, or when each branch is taken. 543 00:28:35,810 --> 00:28:38,120 One other thing to worry about, which 544 00:28:38,120 --> 00:28:44,330 isn't going to be manifested in these static profiling 545 00:28:44,330 --> 00:28:47,694 predictions, is branch prediction. 546 00:28:47,694 --> 00:28:49,360 Now remember the static profiler is just 547 00:28:49,360 --> 00:28:52,110 going to ignore all branches. 548 00:28:52,110 --> 00:28:55,630 And the thing is that branch prediction, 549 00:28:55,630 --> 00:28:58,260 because branch prediction is used 550 00:28:58,260 --> 00:29:00,890 to reduce the stall after a branch is taken 551 00:29:00,890 --> 00:29:02,780 by kind of pre-fetching instructions 552 00:29:02,780 --> 00:29:04,370 and possibly doing other things. 553 00:29:04,370 --> 00:29:06,310 So what we'll do is guess the direction 554 00:29:06,310 --> 00:29:07,654 that the branch will take. 555 00:29:07,654 --> 00:29:09,570 And then we'll start pre-fetching instructions 556 00:29:09,570 --> 00:29:11,870 from the expected place where it's going 557 00:29:11,870 --> 00:29:14,390 to resume after the branch. 558 00:29:14,390 --> 00:29:18,150 And the thing is, if your branch prediction is wrong, 559 00:29:18,150 --> 00:29:21,010 then you're going to incur a pretty large penalty. 560 00:29:21,010 --> 00:29:24,190 And the penalty is on the order of 20 cycles 561 00:29:24,190 --> 00:29:25,940 while it flushes the pipeline and starts 562 00:29:25,940 --> 00:29:27,980 fetching instructions from the new place. 563 00:29:27,980 --> 00:29:30,670 On the other hand, if branch prediction is correct, 564 00:29:30,670 --> 00:29:33,450 then there's no penalty at all, actually. 565 00:29:33,450 --> 00:29:34,910 So the next instruction will just 566 00:29:34,910 --> 00:29:38,340 resume right after the branch. 567 00:29:38,340 --> 00:29:41,350 Anyway, you can give hints to the compiler 568 00:29:41,350 --> 00:29:45,590 to tell it what you think the outcomes of branches will be. 569 00:29:45,590 --> 00:29:48,540 And this can help in some circumstances. 570 00:29:48,540 --> 00:29:53,510 Basically there are these macros that you can define. 571 00:29:53,510 --> 00:29:58,710 And what you're going to use is this built-in expect compiler 572 00:29:58,710 --> 00:30:01,850 intrinsic, and what that means is 573 00:30:01,850 --> 00:30:04,170 if you put that inside an if, it will mean, 574 00:30:04,170 --> 00:30:07,740 you know, run the if as if the condition were exp. 575 00:30:07,740 --> 00:30:09,990 But then I'm going to expect that the value 576 00:30:09,990 --> 00:30:13,870 of this condition is going to be, for example, true or false. 577 00:30:13,870 --> 00:30:17,100 All right, does that make sense? 578 00:30:17,100 --> 00:30:22,940 Anyway, just one note about the exercise that you did. 579 00:30:22,940 --> 00:30:28,560 I believe the original run time was 469 clock cycles. 580 00:30:28,560 --> 00:30:32,480 Actually, if you run XLC with 05 optimizations, 581 00:30:32,480 --> 00:30:39,610 it will reduce that to 188, so almost 280 cycles of savings. 582 00:30:39,610 --> 00:30:43,614 All right, so we'll-- pardon? 583 00:30:43,614 --> 00:30:47,970 AUDIENCE: The one that we just [INAUDIBLE]. 584 00:30:47,970 --> 00:30:50,750 PROFESSOR: No, so the original code was-- 585 00:30:50,750 --> 00:30:53,150 the original assembly was compiled on O 0, 586 00:30:53,150 --> 00:30:55,270 or no optimizations. 587 00:30:55,270 --> 00:30:59,670 But then if you're doing an optimization by hand, 588 00:30:59,670 --> 00:31:01,470 it saves eight cycles off of O 0. 589 00:31:01,470 --> 00:31:03,260 These are actually pretty easy to find. 590 00:31:03,260 --> 00:31:08,540 And the 05 optimization level will actually shave off 280 591 00:31:08,540 --> 00:31:10,260 cycles. 592 00:31:10,260 --> 00:31:11,880 So we'll break for about 10 minutes, 593 00:31:11,880 --> 00:31:16,670 and then after the break, we'll do SIMD on Cell.