1 00:00:00,030 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,850 Commons license. 3 00:00:03,850 --> 00:00:06,930 Your support will help MIT OpenCourseWare continue to 4 00:00:06,930 --> 00:00:10,560 offer high quality educational resources for free. 5 00:00:10,560 --> 00:00:13,410 To make a donation or view additional materials from 6 00:00:13,410 --> 00:00:17,510 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,510 --> 00:00:18,760 ocw.mit.edu. 8 00:00:21,650 --> 00:00:25,100 PROFESSOR: So, what I'll talk about here is how to actually 9 00:00:25,100 --> 00:00:29,120 understand the performance of your application, and what are 10 00:00:29,120 --> 00:00:30,860 some of the things you can do to actually improve your 11 00:00:30,860 --> 00:00:31,850 performance. 12 00:00:31,850 --> 00:00:35,690 You're going to hear more about automated optimizations, 13 00:00:35,690 --> 00:00:38,810 compile the optimizations on Monday. 14 00:00:38,810 --> 00:00:40,370 There will be two talks on that. 15 00:00:40,370 --> 00:00:43,770 You'll get some cell-specific optimizations that you can do, 16 00:00:43,770 --> 00:00:49,010 so some cell-specific tricks on Tuesday in the recitation. 17 00:00:49,010 --> 00:00:52,460 So here, this is meant to be a more general purpose talk on 18 00:00:52,460 --> 00:00:54,810 how can you debug performance anomalies and performance 19 00:00:54,810 --> 00:00:57,920 problems. Then what are some way that you can actually 20 00:00:57,920 --> 00:00:58,840 improve the performance. 21 00:00:58,840 --> 00:01:02,410 Where do you look after you've done your parallelization. 22 00:01:02,410 --> 00:01:07,680 So just to review the key concepts to parallelism. 23 00:01:07,680 --> 00:01:09,285 Coverage -- how much parallelism do you have in 24 00:01:09,285 --> 00:01:11,000 your application. 25 00:01:11,000 --> 00:01:13,900 All of you know, because you all had perfect 26 00:01:13,900 --> 00:01:15,450 scores on the last quiz. 27 00:01:15,450 --> 00:01:18,550 So that means if you take a look at your program, you find 28 00:01:18,550 --> 00:01:21,200 the parallel parts and that tells you how much parallelism 29 00:01:21,200 --> 00:01:23,670 that you have. If you don't have more than a certain 30 00:01:23,670 --> 00:01:25,760 fraction, there's really nothing else you can do with 31 00:01:25,760 --> 00:01:26,240 parallelism. 32 00:01:26,240 --> 00:01:29,420 So the rest of the talk will help you address the question 33 00:01:29,420 --> 00:01:31,980 of well, where do you go for last frontier. 34 00:01:31,980 --> 00:01:33,260 The granularity. 35 00:01:33,260 --> 00:01:36,920 We talked about how the granularity of your work and 36 00:01:36,920 --> 00:01:39,360 how much work are you doing on each processor affects your 37 00:01:39,360 --> 00:01:41,910 load balancing, and how it actually affects your 38 00:01:41,910 --> 00:01:43,580 communication costs. 39 00:01:43,580 --> 00:01:45,510 If you have a lot of things colocate on a single 40 00:01:45,510 --> 00:01:48,170 processor, then you don't have to do a whole lot of 41 00:01:48,170 --> 00:01:50,960 communication across processors, but if you 42 00:01:50,960 --> 00:01:53,690 distribute things at a finer level, then you're doing a 43 00:01:53,690 --> 00:01:54,980 whole lot of communication. 44 00:01:54,980 --> 00:01:56,990 So we'll look at the communication costs again and 45 00:01:56,990 --> 00:02:00,450 some tricks that you can apply to optimize that. 46 00:02:00,450 --> 00:02:04,360 Then the last thing that we had talked about in one of the 47 00:02:04,360 --> 00:02:07,940 previous lectures is locality, in locality of communication 48 00:02:07,940 --> 00:02:11,570 versus computation, and both of those are critical. 49 00:02:11,570 --> 00:02:15,720 So we'll have some examples of that. 50 00:02:15,720 --> 00:02:18,292 So just to review the communication cost model, so I 51 00:02:18,292 --> 00:02:23,250 had flashed up on the screen a while ago this equation that 52 00:02:23,250 --> 00:02:27,650 captures all the factors that go into figuring out how 53 00:02:27,650 --> 00:02:30,770 expensive is it to actually send data from one processor 54 00:02:30,770 --> 00:02:31,670 to the other. 55 00:02:31,670 --> 00:02:35,640 Or this could even apply on a single machine where a 56 00:02:35,640 --> 00:02:37,760 processor is talking to the memory -- 57 00:02:37,760 --> 00:02:39,330 you know, loads and stores. 58 00:02:39,330 --> 00:02:41,690 The same cost model really applies there. 59 00:02:41,690 --> 00:02:43,490 If you look at how processors -- 60 00:02:43,490 --> 00:02:46,580 a uniprocessor tries to improve communication, and 61 00:02:46,580 --> 00:02:48,275 some of the things we've mentioned really early on in 62 00:02:48,275 --> 00:02:51,160 the course for improving communication costs, what we 63 00:02:51,160 --> 00:02:53,210 focused on is this overlap. 64 00:02:53,210 --> 00:02:55,010 There are things you can do, for example, sending fewer 65 00:02:55,010 --> 00:02:58,880 messages, optimizing how you're packing data into your 66 00:02:58,880 --> 00:03:02,450 messages, reducing the cost of the network in terms of the 67 00:03:02,450 --> 00:03:06,710 latency, using architecture support, increasing the 68 00:03:06,710 --> 00:03:08,620 bandwidth and so on. 69 00:03:08,620 --> 00:03:11,300 But really the biggest impact that you can get is really 70 00:03:11,300 --> 00:03:13,610 just from overlap because you have direct control over that, 71 00:03:13,610 --> 00:03:16,490 especially in parallel programming. 72 00:03:16,490 --> 00:03:20,690 So let's look at a small review -- you know, what did 73 00:03:20,690 --> 00:03:22,510 it mean to overlap. 74 00:03:22,510 --> 00:03:24,880 So we had some synchronization point or some point in the 75 00:03:24,880 --> 00:03:28,300 execution and then we get data. 76 00:03:28,300 --> 00:03:31,455 Then once the data's arrived, we compute on that data. 77 00:03:31,455 --> 00:03:33,710 So this could be a uniprocessor. 78 00:03:33,710 --> 00:03:37,720 A CPU issues a load, it goes out to memory, memory sends 79 00:03:37,720 --> 00:03:42,100 back the data, and then the CU can continue operating. 80 00:03:42,100 --> 00:03:44,160 But the uniprocessors can pipeline. 81 00:03:44,160 --> 00:03:45,510 They allow you to have multiple loads 82 00:03:45,510 --> 00:03:47,180 going out to memory. 83 00:03:47,180 --> 00:03:50,330 So you can get the effect of hiding over overlapping a lot 84 00:03:50,330 --> 00:03:53,730 of that communication latency. 85 00:03:53,730 --> 00:03:56,800 But there are limits to the communication, to the 86 00:03:56,800 --> 00:03:58,200 pipelining effects. 87 00:03:58,200 --> 00:04:03,435 If the work that you're doing is really equal to the amount 88 00:04:03,435 --> 00:04:04,910 of data that you're fetching, then you 89 00:04:04,910 --> 00:04:05,930 have really good overlap. 90 00:04:05,930 --> 00:04:08,610 So we went over this in the recitation and we showed you 91 00:04:08,610 --> 00:04:11,720 an example of where pipelining doesn't have any performance 92 00:04:11,720 --> 00:04:13,560 effects and so you might not want to do it because it 93 00:04:13,560 --> 00:04:15,730 doesn't give you the performance bang for the 94 00:04:15,730 --> 00:04:17,540 complexity you invest in it. 95 00:04:17,540 --> 00:04:20,830 So, if things are really nicely matched you get good 96 00:04:20,830 --> 00:04:23,630 overlap, here you only get god overlap -- sorry, these, for 97 00:04:23,630 --> 00:04:26,370 some reason, should be shifted over one. 98 00:04:26,370 --> 00:04:30,010 So where else do you look for performance? 99 00:04:30,010 --> 00:04:32,350 So there are two kinds of communication. 100 00:04:35,070 --> 00:04:37,280 There's inherent communication and your algorithm, and this 101 00:04:37,280 --> 00:04:41,650 is a result of how you actually partition your data 102 00:04:41,650 --> 00:04:43,790 and how you partitioned your computation. 103 00:04:43,790 --> 00:04:48,170 Then there's artifacts that come up because of the way you 104 00:04:48,170 --> 00:04:50,660 actually do the implementation and how you map it to the 105 00:04:50,660 --> 00:04:52,330 architecture. 106 00:04:52,330 --> 00:04:55,120 So, if you have poor distribution of data across 107 00:04:55,120 --> 00:04:59,150 memory, then you might unnecessarily end up fetching 108 00:04:59,150 --> 00:05:05,370 data that you don't need. 109 00:05:05,370 --> 00:05:07,370 So, you might also have redundant data fetchers. 110 00:05:07,370 --> 00:05:09,100 So let's talk about that in more detail. 111 00:05:11,820 --> 00:05:15,930 The way I'm going to do this is to draw from wisdom in 112 00:05:15,930 --> 00:05:16,340 uniprocessors. 113 00:05:16,340 --> 00:05:19,970 So in uniprocessors, CPUs communicate with memory, and 114 00:05:19,970 --> 00:05:22,290 really conceptually, I think that's no different than 115 00:05:22,290 --> 00:05:24,880 multiple processors talking to multiple processors. 116 00:05:24,880 --> 00:05:28,190 It's really all about where the data is flowing and how 117 00:05:28,190 --> 00:05:29,710 the memories are structured. 118 00:05:29,710 --> 00:05:34,190 So, loads and stores are the uniprocessor as what and what 119 00:05:34,190 --> 00:05:35,500 are to distributed memory. 120 00:05:35,500 --> 00:05:38,080 So if you think of Cell, what would go in those two blanks? 121 00:05:42,820 --> 00:05:44,070 Can you get this? 122 00:05:48,360 --> 00:05:50,950 I heard the answer there. 123 00:05:50,950 --> 00:05:51,780 You get input. 124 00:05:51,780 --> 00:05:53,400 So, DMA get and DMA put. 125 00:05:53,400 --> 00:05:55,300 That's really just the load and a store. 126 00:05:55,300 --> 00:05:57,960 It's just doing, instead of loading one particular data 127 00:05:57,960 --> 00:06:01,460 element, you're loading a whole chunk of memory. 128 00:06:01,460 --> 00:06:04,950 So, on a uniprocessor, how do you overlap communication? 129 00:06:04,950 --> 00:06:08,210 Well, architecture, the memory system is designed in a way to 130 00:06:08,210 --> 00:06:10,630 exploit two properties that have been observed in 131 00:06:10,630 --> 00:06:11,600 computation. 132 00:06:11,600 --> 00:06:14,030 Spacial locality and temporal locality, and I'll look at 133 00:06:14,030 --> 00:06:15,640 each one separately. 134 00:06:15,640 --> 00:06:19,640 So in spacial locality, CPU asks for a 135 00:06:19,640 --> 00:06:21,840 data address of 1,000. 136 00:06:21,840 --> 00:06:24,650 What the memory does, it'll send data address of 1,000, 137 00:06:24,650 --> 00:06:27,440 plus a whole bunch of other data that's neighboring to it, 138 00:06:27,440 --> 00:06:30,290 so 1,000 to 1,064. 139 00:06:30,290 --> 00:06:33,010 Really, how much data you actually send, you know, what 140 00:06:33,010 --> 00:06:42,290 is the granularity of communication depends on 141 00:06:42,290 --> 00:06:43,430 architectural parameters. 142 00:06:43,430 --> 00:06:46,610 So in common architecture it's really the block side. 143 00:06:46,610 --> 00:06:49,260 So if you have a cache where the organization says you have 144 00:06:49,260 --> 00:06:53,690 a block side to 32 words, 32 bytes, then this is how much 145 00:06:53,690 --> 00:06:57,210 you transfer from main memory to the caches. 146 00:06:57,210 --> 00:07:01,250 So this works well when the CPU actually uses that data. 147 00:07:01,250 --> 00:07:05,530 If I send you 64 data bytes and I only use one of them, 148 00:07:05,530 --> 00:07:06,490 then what have I done? 149 00:07:06,490 --> 00:07:07,880 I've wasted bandwidth. 150 00:07:07,880 --> 00:07:10,460 Plus, I need to store all that extra data in the cache so 151 00:07:10,460 --> 00:07:12,240 I've wasted my cache capacity. 152 00:07:12,240 --> 00:07:15,650 So that's bad and you want to avoid it. 153 00:07:15,650 --> 00:07:19,290 Temporal locality is a clustering of 154 00:07:19,290 --> 00:07:20,650 references in time. 155 00:07:20,650 --> 00:07:24,470 So if you access some particular data element, then 156 00:07:24,470 --> 00:07:26,960 what the memory assumes is you're going reuse that data 157 00:07:26,960 --> 00:07:29,840 over and over and over again, so it stores it in the cache. 158 00:07:29,840 --> 00:07:32,620 So your memory hierarchy has the main memory at the top 159 00:07:32,620 --> 00:07:34,780 level, and that's your slowest memory 160 00:07:34,780 --> 00:07:36,410 but the biggest capacity. 161 00:07:36,410 --> 00:07:38,790 Then as you get closer and closer to the processor, you 162 00:07:38,790 --> 00:07:42,460 end up with smaller caches -- these are local, smaller 163 00:07:42,460 --> 00:07:45,380 storage, but they're faster. 164 00:07:45,380 --> 00:07:48,470 So, if you reuse a data element then it gets cached at 165 00:07:48,470 --> 00:07:50,770 the lowest data level, and so the assumption there is that 166 00:07:50,770 --> 00:07:53,580 you're gonna reuse it over and over and over again. 167 00:07:53,580 --> 00:07:55,300 If you do that, then what you've done is you've 168 00:07:55,300 --> 00:07:58,210 amortized the cost of bringing in that data over many, many 169 00:07:58,210 --> 00:07:58,990 references. 170 00:07:58,990 --> 00:08:00,930 So that works out really well. 171 00:08:00,930 --> 00:08:03,160 But if you don't reuse that particular data elements over 172 00:08:03,160 --> 00:08:06,770 and over again, then you've wasted cache capacity. 173 00:08:06,770 --> 00:08:09,280 You still need to fetch the data because the CPU asks for 174 00:08:09,280 --> 00:08:12,080 it, but you might not have had the cache, so that would have 175 00:08:12,080 --> 00:08:14,250 created more space in your cache to have something else 176 00:08:14,250 --> 00:08:17,350 in there that might have been more useful. 177 00:08:17,350 --> 00:08:20,880 So in the multiprocessor case, how do you reduce these 178 00:08:20,880 --> 00:08:24,180 artifactual costs in communication. 179 00:08:24,180 --> 00:08:32,750 So, DCMA gets inputs on the cell, or just in just message 180 00:08:32,750 --> 00:08:35,320 passing, you're exchanging messages. 181 00:08:35,320 --> 00:08:39,460 Typically, you're communicating over a course or 182 00:08:39,460 --> 00:08:41,820 large blocks of data. 183 00:08:41,820 --> 00:08:44,490 What you're usually getting is a continuous chunk of memory, 184 00:08:44,490 --> 00:08:46,930 although you could do some things in software or in 185 00:08:46,930 --> 00:08:51,000 hardware to gather data from different memory locations and 186 00:08:51,000 --> 00:08:52,880 pack them into contiguous locations. 187 00:08:52,880 --> 00:08:55,910 The reason you pack them into contiguous locations again to 188 00:08:55,910 --> 00:08:59,610 export spatial locality when you store the data locally. 189 00:08:59,610 --> 00:09:03,440 So to exploit the spatial locality characteristics, what 190 00:09:03,440 --> 00:09:05,890 you want to make sure is that you actually are going to have 191 00:09:05,890 --> 00:09:08,320 good spatial locality in your actual computation. 192 00:09:08,320 --> 00:09:11,130 So you want things that are iterating over loops with 193 00:09:11,130 --> 00:09:14,710 well-defined indices with indices that go over very 194 00:09:14,710 --> 00:09:18,010 short ranges, or they're very sequential or have fixed 195 00:09:18,010 --> 00:09:20,340 striped patterns where you're not wasting a lot of the data 196 00:09:20,340 --> 00:09:21,870 that you have brought in. 197 00:09:21,870 --> 00:09:23,730 Otherwise, you have to essentially just increase your 198 00:09:23,730 --> 00:09:26,540 communication because every fetch is getting you only a 199 00:09:26,540 --> 00:09:28,400 small fraction of what you actually need. 200 00:09:28,400 --> 00:09:31,520 So, intuitively this should make sense. 201 00:09:31,520 --> 00:09:34,470 Temporal locality just says I brought in some data and so I 202 00:09:34,470 --> 00:09:36,350 want to maximize this utility. 203 00:09:36,350 --> 00:09:39,520 So if I have any computation in a parallel system, I might 204 00:09:39,520 --> 00:09:42,910 be able to reorder my tasks in a way that I have explicit 205 00:09:42,910 --> 00:09:45,930 control over the scheduling -- which stripe executes when. 206 00:09:45,930 --> 00:09:48,270 Then you want to make sure that all the computation that 207 00:09:48,270 --> 00:09:52,470 needs that particular data happens adjacent in time or in 208 00:09:52,470 --> 00:09:56,830 some short time window so that you can amortize the cost. 209 00:09:56,830 --> 00:09:58,860 Are those two concepts clear? 210 00:09:58,860 --> 00:10:00,110 Any questions on this? 211 00:10:03,920 --> 00:10:05,880 So, you've done all of that. 212 00:10:05,880 --> 00:10:08,600 You've parallelized your code, you've taken care of your 213 00:10:08,600 --> 00:10:10,270 communication costs, you've tried to reduce 214 00:10:10,270 --> 00:10:12,360 it as much as possible. 215 00:10:12,360 --> 00:10:14,560 Where else can you look for performance -- things just 216 00:10:14,560 --> 00:10:16,140 don't look like they're performing as 217 00:10:16,140 --> 00:10:17,690 well as they could? 218 00:10:17,690 --> 00:10:20,550 So, the last frontier is perhaps single thread 219 00:10:20,550 --> 00:10:23,380 performance, so I'm going to talk about that. 220 00:10:23,380 --> 00:10:25,160 So what is really a single thread. 221 00:10:25,160 --> 00:10:26,770 So if you think of what you're doing with parallel 222 00:10:26,770 --> 00:10:29,340 programming, you're taking a bunch of tasks -- this is the 223 00:10:29,340 --> 00:10:31,750 work that you have to do -- and you group them together 224 00:10:31,750 --> 00:10:35,800 into threads or the equivalent of threads, and some threads 225 00:10:35,800 --> 00:10:37,340 will run on individual cores. 226 00:10:37,340 --> 00:10:39,790 So essentially you have one thread running on a core, and 227 00:10:39,790 --> 00:10:42,570 if that performance goes fast, then your overall execution 228 00:10:42,570 --> 00:10:43,960 can also benefit from that. 229 00:10:43,960 --> 00:10:46,310 So, that's single thread performance. 230 00:10:46,310 --> 00:10:49,920 So if you look at a timeline, here you have sequential code 231 00:10:49,920 --> 00:10:52,360 going on, then we hit some parallel part of the 232 00:10:52,360 --> 00:10:53,350 computation. 233 00:10:53,350 --> 00:10:56,260 We have multiple executions going on. 234 00:10:56,260 --> 00:10:59,430 Each one of these is a thread of execution. 235 00:10:59,430 --> 00:11:03,380 Really, my finish line depends on who's the longest thread, 236 00:11:03,380 --> 00:11:06,640 who's the slowest one to complete, and that's going to 237 00:11:06,640 --> 00:11:10,300 essentially control my speed up. 238 00:11:10,300 --> 00:11:14,620 So I can improve this by doing better load balancing. 239 00:11:14,620 --> 00:11:17,250 If I distribute the work [? so that ?] 240 00:11:17,250 --> 00:11:19,980 everybody's doing equivalent amount of work, then I can 241 00:11:19,980 --> 00:11:22,630 shift that finish line earlier in time. 242 00:11:22,630 --> 00:11:23,990 That can work reasonably well. 243 00:11:23,990 --> 00:11:26,920 So we talked about load balancing before. 244 00:11:26,920 --> 00:11:30,620 We can also make execution on each processor faster. 245 00:11:30,620 --> 00:11:33,630 If each one of these threads finishes faster or I've done 246 00:11:33,630 --> 00:11:35,800 the load balancing and now I can even squeeze out more 247 00:11:35,800 --> 00:11:39,410 performance by shrinking each one of those lines, then I can 248 00:11:39,410 --> 00:11:41,460 get performance improvement there as well. 249 00:11:41,460 --> 00:11:44,280 So that's improving single thread performance. 250 00:11:44,280 --> 00:11:46,140 But how do we actually understand what's going on? 251 00:11:46,140 --> 00:11:48,360 How do I know where to optimize? 252 00:11:48,360 --> 00:11:50,800 How do I know how long each thread is taking? 253 00:11:50,800 --> 00:11:52,020 How do I know how long my program is taking? 254 00:11:52,020 --> 00:11:54,090 Where are the problems? 255 00:11:54,090 --> 00:11:56,750 So, there are performance monitoring tools that hard 256 00:11:56,750 --> 00:11:58,840 designed to help you do that. 257 00:11:58,840 --> 00:12:02,290 So what's the most coarse-grained way of figuring 258 00:12:02,290 --> 00:12:04,020 out how long your program took? 259 00:12:04,020 --> 00:12:08,090 You have some sample piece of code shown over here, you 260 00:12:08,090 --> 00:12:10,790 might compile it, and then you might just use time -- 261 00:12:10,790 --> 00:12:15,790 standard units command say run this program and tell me how 262 00:12:15,790 --> 00:12:17,320 much time it took to run. 263 00:12:17,320 --> 00:12:19,670 So you get some alpha back from time that said you took 264 00:12:19,670 --> 00:12:23,210 about two seconds of user time, this is actual code, you 265 00:12:23,210 --> 00:12:27,700 took some small amount of time in system code, and this is 266 00:12:27,700 --> 00:12:29,910 your overall execution, this is how much of the processor 267 00:12:29,910 --> 00:12:30,610 you actually use. 268 00:12:30,610 --> 00:12:32,870 So, a 95% utilization. 269 00:12:32,870 --> 00:12:34,600 Then you might apply some optimization. 270 00:12:34,600 --> 00:12:36,430 So here we'll use the compiler, we'll change the 271 00:12:36,430 --> 00:12:39,290 optimization level, compile the same code, run it, and 272 00:12:39,290 --> 00:12:41,270 we'll see wow, performance improved. 273 00:12:41,270 --> 00:12:44,420 So we increased 99% utilization, my running time 274 00:12:44,420 --> 00:12:47,180 went down by a small chunk. 275 00:12:47,180 --> 00:12:48,940 But did we really learn anything about 276 00:12:48,940 --> 00:12:51,100 what's going on here? 277 00:12:51,100 --> 00:12:53,600 There's some code going on, there's a loop here, there's a 278 00:12:53,600 --> 00:12:56,650 loop here, there's some functions with more loops. 279 00:12:56,650 --> 00:12:59,300 So where is the actual computation time going? 280 00:12:59,300 --> 00:13:01,550 So how would I actually go about understanding this? 281 00:13:01,550 --> 00:13:04,060 So what are some tricks you might have used in trying to 282 00:13:04,060 --> 00:13:06,200 figure out how long something took in your computation? 283 00:13:06,200 --> 00:13:08,192 AUDIENCE: [INAUDIBLE PHRASE]. 284 00:13:17,010 --> 00:13:17,690 PROFESSOR: Right. 285 00:13:17,690 --> 00:13:22,700 So you might have a timer, you record the time here, you 286 00:13:22,700 --> 00:13:24,590 compute and then you stop the timer and then you might 287 00:13:24,590 --> 00:13:27,230 printout or record how long that particular 288 00:13:27,230 --> 00:13:28,960 block of code took. 289 00:13:28,960 --> 00:13:30,370 Then you might have a histogram of them and then you 290 00:13:30,370 --> 00:13:33,350 might analyze the histogram to find out the distribution. 291 00:13:33,350 --> 00:13:35,810 You might repeat this over and over again for many different 292 00:13:35,810 --> 00:13:39,180 loops or many different parts that are code. 293 00:13:39,180 --> 00:13:42,320 If you have a preconceived notion of where the problem is 294 00:13:42,320 --> 00:13:44,160 then you instrument that and see if your 295 00:13:44,160 --> 00:13:45,540 hypothesis is correct. 296 00:13:45,540 --> 00:13:49,880 That can help you identify the problems. But increasingly you 297 00:13:49,880 --> 00:13:52,340 can actually get more accurate measurements. 298 00:13:52,340 --> 00:13:54,770 So, in the previous routine, you're using the time, you 299 00:13:54,770 --> 00:14:01,490 were looking at how much time has elapsed in seconds or in 300 00:14:01,490 --> 00:14:02,760 small increments. 301 00:14:02,760 --> 00:14:04,920 But you can actually use hardware counters today to 302 00:14:04,920 --> 00:14:07,620 actually measure clock cycles, clock ticks. 303 00:14:07,620 --> 00:14:09,710 That might be more useful. 304 00:14:09,710 --> 00:14:11,630 Actually, it's more useful because you can measure a lot 305 00:14:11,630 --> 00:14:13,100 more events than just clock ticks. 306 00:14:15,610 --> 00:14:18,840 The counters in modern architectures are really 307 00:14:18,840 --> 00:14:21,890 specialized registers that count up events and then you 308 00:14:21,890 --> 00:14:24,890 can go in there and probe and ask what is the value in this 309 00:14:24,890 --> 00:14:27,400 register, and you can use that as part of 310 00:14:27,400 --> 00:14:29,160 your performance tuning. 311 00:14:29,160 --> 00:14:32,070 You use them much in the same way as you would have done to 312 00:14:32,070 --> 00:14:37,260 start a regular timer or stop a regular timer. 313 00:14:37,260 --> 00:14:39,340 There are specialized libraries that run. 314 00:14:39,340 --> 00:14:40,330 Unfortunately, these are very 315 00:14:40,330 --> 00:14:41,760 architecture-specific at this point. 316 00:14:41,760 --> 00:14:44,940 There's not really a common standard that says grab a 317 00:14:44,940 --> 00:14:47,990 timer at each different architecture in a uniform way, 318 00:14:47,990 --> 00:14:49,760 although that's getting better with some 319 00:14:49,760 --> 00:14:52,570 standards coming out from--. 320 00:14:52,570 --> 00:14:54,870 I'll talk about that in just a few slides. 321 00:14:54,870 --> 00:14:56,650 You can use this to, for example, measure your 322 00:14:56,650 --> 00:15:00,360 communication to computation cost. So you can wrap your DMA 323 00:15:00,360 --> 00:15:03,710 get and DMA put by timer calls, and you can measure 324 00:15:03,710 --> 00:15:07,880 your actual work by timer calls, and figuring out how 325 00:15:07,880 --> 00:15:12,320 much overlap can you get from overlap in communication and 326 00:15:12,320 --> 00:15:14,960 communication computation and is that really worthwhile to 327 00:15:14,960 --> 00:15:17,850 do pipelining. 328 00:15:17,850 --> 00:15:20,120 But this really requires manual changes to code. 329 00:15:20,120 --> 00:15:22,490 You have to go in there and start the timers. 330 00:15:22,490 --> 00:15:26,320 You have to have maybe an idea of where the problem is, and 331 00:15:26,320 --> 00:15:27,780 you have the Heisenberg effect. 332 00:15:27,780 --> 00:15:30,370 If you have a loop and you want to measure code within 333 00:15:30,370 --> 00:15:32,910 the loop because you have a nested loop inside of that, 334 00:15:32,910 --> 00:15:35,420 then now you're effecting the performance of the outer loop. 335 00:15:35,420 --> 00:15:37,100 That can be problematic. 336 00:15:37,100 --> 00:15:39,110 So it has a better effect because you can't really make 337 00:15:39,110 --> 00:15:42,320 an accurate measurement on the thing you're inspecting. 338 00:15:42,320 --> 00:15:47,100 So there's a slightly better approach, dynamic profiling. 339 00:15:47,100 --> 00:15:50,030 Dynamic profiling is really there's an event-based 340 00:15:50,030 --> 00:15:51,750 profiling and time-based profiling. 341 00:15:51,750 --> 00:15:53,530 Conceptually they do the same thing. 342 00:15:53,530 --> 00:15:56,840 What's going on here is your program is running and you're 343 00:15:56,840 --> 00:16:01,050 going to say I'm interested in events such as cache misses. 344 00:16:01,050 --> 00:16:04,160 Whenever n number of cache misses happen, let's say 345 00:16:04,160 --> 00:16:05,830 1,000, let me know. 346 00:16:05,830 --> 00:16:07,800 So you get an interrupt whenever 1,000 347 00:16:07,800 --> 00:16:08,860 cache misses happen. 348 00:16:08,860 --> 00:16:13,290 Then you can update a counter or use that to trigger some 349 00:16:13,290 --> 00:16:15,940 optimizations or analysis. 350 00:16:15,940 --> 00:16:18,120 This works really nicely because you don't have to 351 00:16:18,120 --> 00:16:19,820 touch your code. 352 00:16:19,820 --> 00:16:22,680 You essentially run your program as you normally do 353 00:16:22,680 --> 00:16:26,380 with just one modification that includes running the 354 00:16:26,380 --> 00:16:31,070 dynamic profiler, as well as your actual computation. 355 00:16:31,070 --> 00:16:33,610 As far as multiple languages because all it does is just 356 00:16:33,610 --> 00:16:37,750 takes your binary so you can program in any language, any 357 00:16:37,750 --> 00:16:38,940 programming model. 358 00:16:38,940 --> 00:16:41,520 It's quite efficient to actually use these dynamic 359 00:16:41,520 --> 00:16:42,990 profiling tools. 360 00:16:42,990 --> 00:16:46,770 The sampling frequencies are reasonably small, you can make 361 00:16:46,770 --> 00:16:49,950 them reasonably small and still have it be efficient. 362 00:16:49,950 --> 00:16:52,010 So some counter examples. 363 00:16:52,010 --> 00:16:54,350 Clock cycles, so you can measure clock ticks. 364 00:16:54,350 --> 00:16:55,190 Pipeline stalls. 365 00:16:55,190 --> 00:16:58,660 This might be interesting if you want to optimize your 366 00:16:58,660 --> 00:17:00,310 instruction schedule -- you'll actually see this in the 367 00:17:00,310 --> 00:17:02,340 recitation next week. 368 00:17:02,340 --> 00:17:05,560 Cache hits, cache misses -- you can get an idea of how bad 369 00:17:05,560 --> 00:17:07,820 your cache performance is and how much time you're spending 370 00:17:07,820 --> 00:17:08,970 in the memory system. 371 00:17:08,970 --> 00:17:11,420 Number of instructions, loads, stores, floating 372 00:17:11,420 --> 00:17:14,230 point ops and so on. 373 00:17:14,230 --> 00:17:16,670 Then you can derive some useful measures from that. 374 00:17:16,670 --> 00:17:19,730 So I can get an idea of processor utilization -- 375 00:17:19,730 --> 00:17:22,850 divide cycles by time and that gives me utilization. 376 00:17:22,850 --> 00:17:26,200 I can derive some other things and maybe some of the more 377 00:17:26,200 --> 00:17:28,350 interesting things like memory of traffic. 378 00:17:28,350 --> 00:17:30,900 So how much data am I actually sending between a CPU and a 379 00:17:30,900 --> 00:17:33,480 processor, or how much data am I communicating from one 380 00:17:33,480 --> 00:17:35,560 processor to the other. 381 00:17:35,560 --> 00:17:38,000 So I can just grab the counters for number of loads 382 00:17:38,000 --> 00:17:40,500 and number of stores, figure out what the cache line size 383 00:17:40,500 --> 00:17:42,440 is -- usually those are documented or there are 384 00:17:42,440 --> 00:17:45,214 calibration tools you can run to get that value, and you 385 00:17:45,214 --> 00:17:46,860 figure out memory of traffic. 386 00:17:46,860 --> 00:17:48,740 Another one would be bandwidth consumed. 387 00:17:48,740 --> 00:17:51,860 So bandwidth is memory of traffic per second. 388 00:17:51,860 --> 00:17:53,110 So how would you measure that? 389 00:17:55,460 --> 00:17:59,770 It's just the traffic divided by the wall clock time. 390 00:17:59,770 --> 00:18:01,480 There's some others that you can calculate. 391 00:18:01,480 --> 00:18:04,200 So these can be really useful in helping you figure out 392 00:18:04,200 --> 00:18:06,890 where are the things you should go focus in on. 393 00:18:06,890 --> 00:18:09,220 I'm going to show you some examples. 394 00:18:09,220 --> 00:18:11,810 The way these tools work is you have your application 395 00:18:11,810 --> 00:18:14,320 source code, you compile it down to a binary. 396 00:18:14,320 --> 00:18:17,290 You take your binary and you run it, and then that can 397 00:18:17,290 --> 00:18:22,200 generate some profile that stores locally on your disk. 398 00:18:22,200 --> 00:18:24,800 Then you can take that profile and analyze it by some sort of 399 00:18:24,800 --> 00:18:28,920 interpreter, and some cases you can actually analyze the 400 00:18:28,920 --> 00:18:32,340 binary as well, and reannotate your source code. 401 00:18:32,340 --> 00:18:34,390 That can actually be very useful because it'll tell you 402 00:18:34,390 --> 00:18:36,640 this particular line of your code is the one where you're 403 00:18:36,640 --> 00:18:39,780 spending most of your time computing. 404 00:18:39,780 --> 00:18:41,860 So some tools -- have any of you used these tools? 405 00:18:41,860 --> 00:18:45,320 Anybody use Gprof, for example? 406 00:18:45,320 --> 00:18:45,520 Good. 407 00:18:45,520 --> 00:18:49,790 So, you might have an idea of how these could be used. 408 00:18:49,790 --> 00:18:50,870 There are others. 409 00:18:50,870 --> 00:18:54,740 HPCToolkit, which I commonly use from Rice. 410 00:18:54,740 --> 00:18:57,820 Pappy is very common because it has a very nice interface 411 00:18:57,820 --> 00:18:59,430 for grabbing all kinds of counters. 412 00:18:59,430 --> 00:19:03,010 VTune from Intel, and there's others that work in different 413 00:19:03,010 --> 00:19:06,070 ways and so there are binary instrumenters that do the same 414 00:19:06,070 --> 00:19:08,900 things and do it slightly more efficiently and actually give 415 00:19:08,900 --> 00:19:11,350 you the ability to compile your code at run time and 416 00:19:11,350 --> 00:19:14,230 optimize it, taking advantage of the profiling information 417 00:19:14,230 --> 00:19:16,570 you've collected. 418 00:19:16,570 --> 00:19:20,670 So here's a sample of running Gprof. 419 00:19:20,670 --> 00:19:23,670 Gprof should be available on any Linux system, it's even 420 00:19:23,670 --> 00:19:27,080 available on Cygwin, if you see Cygwin. 421 00:19:27,080 --> 00:19:29,350 I've compiled some code -- this is MPEG 2D code, a 422 00:19:29,350 --> 00:19:31,530 reference implementation. 423 00:19:31,530 --> 00:19:33,950 I specify some parameters to run it. 424 00:19:33,950 --> 00:19:38,830 Here I add this dash Rflag which says use a particular 425 00:19:38,830 --> 00:19:42,010 kind of inverse DCDT that's a floating point precise that 426 00:19:42,010 --> 00:19:43,920 uses double precision for the floating point 427 00:19:43,920 --> 00:19:46,640 computations in DCT -- 428 00:19:46,640 --> 00:19:48,210 inverse DCT rather. 429 00:19:48,210 --> 00:19:50,890 So you can see actually where most of the time is being 430 00:19:50,890 --> 00:19:52,380 spent in the computation. 431 00:19:52,380 --> 00:19:55,270 So here's a time per function, so each row 432 00:19:55,270 --> 00:19:59,020 represents a function. 433 00:19:59,020 --> 00:20:05,860 So this is the percent of the time, this is the actual time 434 00:20:05,860 --> 00:20:09,450 in seconds, how many times I actually called this function, 435 00:20:09,450 --> 00:20:10,800 and some other useful things. 436 00:20:10,800 --> 00:20:15,920 So the second function that's used here that happens is MPEG 437 00:20:15,920 --> 00:20:18,770 intrablock decoding and here you're doing some spatial 438 00:20:18,770 --> 00:20:22,730 decomposition, restoring spatial pictures by 5%. 439 00:20:22,730 --> 00:20:25,340 So if you were optimize this particular code, where would 440 00:20:25,340 --> 00:20:26,590 you go look? 441 00:20:29,520 --> 00:20:31,950 You would look in the reference DCT. 442 00:20:31,950 --> 00:20:34,900 So, MPEG has two versions of DCT -- 443 00:20:34,900 --> 00:20:38,450 one that uses floating point, another that just uses some 444 00:20:38,450 --> 00:20:42,240 numerical tricks to operate over integers for a loss of 445 00:20:42,240 --> 00:20:46,780 precision, but they find that acceptable as part of this 446 00:20:46,780 --> 00:20:47,670 application. 447 00:20:47,670 --> 00:20:50,640 So you omit the Rflag, it actually uses a different 448 00:20:50,640 --> 00:20:53,760 function for doing the DCT. 449 00:20:53,760 --> 00:20:56,330 Now you see the distribution of where the time is spent in 450 00:20:56,330 --> 00:20:57,780 your computation changes. 451 00:20:57,780 --> 00:21:00,100 Now there's a new function that's become the bottleneck 452 00:21:00,100 --> 00:21:02,810 and it's called Form Component Prediction. 453 00:21:02,810 --> 00:21:06,550 Then IDCT column, which actually is the main 454 00:21:06,550 --> 00:21:10,390 replacement of the previous code, this one, is now about 455 00:21:10,390 --> 00:21:12,250 1/3 of the actual computation. 456 00:21:12,250 --> 00:21:15,090 So, this could be useful because you can Gprof your 457 00:21:15,090 --> 00:21:17,700 application, figure out where the bottlenecks are in terms 458 00:21:17,700 --> 00:21:20,680 of performance, and you might go in there and tweak the 459 00:21:20,680 --> 00:21:22,030 algorithm completely. 460 00:21:22,030 --> 00:21:24,260 You might go in there and sort of look at some problems that 461 00:21:24,260 --> 00:21:28,120 might be implementation bugs or performance bugs and be 462 00:21:28,120 --> 00:21:30,580 able to fix those. 463 00:21:30,580 --> 00:21:34,150 Any questions on that? 464 00:21:34,150 --> 00:21:36,340 You can do sort of more accurate things. 465 00:21:36,340 --> 00:21:42,420 So, Gprof largely uses one mechanism, HPCToolkit uses the 466 00:21:42,420 --> 00:21:44,950 performance counters to actually give you more finer 467 00:21:44,950 --> 00:21:48,810 grade measurements if you want them. 468 00:21:48,810 --> 00:21:53,340 So, in the HPCToolkit, you run your program in the same way. 469 00:21:53,340 --> 00:21:57,130 You have MPEG 2D code, dash dash just says this is where 470 00:21:57,130 --> 00:22:00,740 the parameters are to impact 2D code following that dash 471 00:22:00,740 --> 00:22:03,290 dash, and you can add some parameters in there that say 472 00:22:03,290 --> 00:22:05,450 these are counters I'm interested in measuring. 473 00:22:05,450 --> 00:22:08,160 So the first one is total cycles. 474 00:22:08,160 --> 00:22:13,180 The second one is the L1, so primary cache load misses. 475 00:22:13,180 --> 00:22:14,850 Then you might want to count the floating point 476 00:22:14,850 --> 00:22:17,110 instructions and the total instructions. 477 00:22:17,110 --> 00:22:19,490 As you run your program you actually get a profiling 478 00:22:19,490 --> 00:22:23,290 output, and then you can process that file and it'll 479 00:22:23,290 --> 00:22:25,220 spit out some summaries for you. 480 00:22:25,220 --> 00:22:28,920 So it'll tell you this is the total number of cycles, 698 481 00:22:28,920 --> 00:22:30,580 samples with this frequency. 482 00:22:30,580 --> 00:22:32,930 So if you multiply the two together, you an idea of how 483 00:22:32,930 --> 00:22:36,590 many cycles your computation took. 484 00:22:36,590 --> 00:22:38,230 How many load misses? 485 00:22:38,230 --> 00:22:40,910 So it's 27 samples at this frequency. 486 00:22:40,910 --> 00:22:43,040 So remember what's going on here is the counter is 487 00:22:43,040 --> 00:22:46,510 measuring events, and when the event reaches a particular 488 00:22:46,510 --> 00:22:47,980 threshold it let's you know. 489 00:22:47,980 --> 00:22:52,170 So here the sampling threshold is 32,000. 490 00:22:52,170 --> 00:22:54,960 So whenever 32,000 floating point instructions occur you 491 00:22:54,960 --> 00:22:55,880 get a sample. 492 00:22:55,880 --> 00:22:57,745 So you're just counting how many interrupts you're getting 493 00:22:57,745 --> 00:22:58,810 or how many samples. 494 00:22:58,810 --> 00:23:00,140 So you multiply the two together you can 495 00:23:00,140 --> 00:23:02,770 get the final counts. 496 00:23:02,770 --> 00:23:06,270 It can do things like Gprof, it'll tell you where your time 497 00:23:06,270 --> 00:23:09,580 is and where you spent most of your time. 498 00:23:09,580 --> 00:23:12,230 Actually breaks it down into your module. 499 00:23:12,230 --> 00:23:15,620 So, MPEG calls some standard libraries, libsi. 500 00:23:15,620 --> 00:23:18,400 I could break it down by functions, break it down by 501 00:23:18,400 --> 00:23:19,920 line number. 502 00:23:19,920 --> 00:23:21,800 You can even annotate your source code. 503 00:23:21,800 --> 00:23:25,510 So here's just a simple example that I used earlier, 504 00:23:25,510 --> 00:23:28,960 and each one of these columns represent one of the metrics 505 00:23:28,960 --> 00:23:32,110 that we measured, and you can see most of my time is spent 506 00:23:32,110 --> 00:23:35,460 here, 36% at this particular statement. 507 00:23:35,460 --> 00:23:37,640 So that can be very useful. 508 00:23:37,640 --> 00:23:40,280 You can go in there and say, I want to do some 509 00:23:40,280 --> 00:23:43,710 [? dization ?], I can maybe reduce this overhead in some 510 00:23:43,710 --> 00:23:46,330 way to get better performance. 511 00:23:46,330 --> 00:23:48,210 Any questions on that? 512 00:23:48,210 --> 00:23:48,520 Yup. 513 00:23:48,520 --> 00:23:51,210 AUDIENCE: [INAUDIBLE PHRASE]? 514 00:23:51,210 --> 00:23:52,460 PROFESSOR: I don't know. 515 00:23:54,820 --> 00:23:56,070 Unfortunately, I don't know the answer to that. 516 00:23:59,600 --> 00:24:01,450 There's some nice gooies for some of these tools. 517 00:24:01,450 --> 00:24:03,620 So VTune has a nice interface. 518 00:24:03,620 --> 00:24:06,530 I use HPCViewer. 519 00:24:06,530 --> 00:24:09,670 I use HPCToolkit, which provides HPCViewer, so I just 520 00:24:09,670 --> 00:24:14,200 grab the screenshot from one of the tutorials on this. 521 00:24:14,200 --> 00:24:15,960 You have your source code. 522 00:24:15,960 --> 00:24:17,830 It shows you some of the same information I had on a 523 00:24:17,830 --> 00:24:21,950 previous slide, but in a nicer graphical format. 524 00:24:21,950 --> 00:24:27,710 So, now I have all this information, how do I actually 525 00:24:27,710 --> 00:24:29,460 improve the performance? 526 00:24:29,460 --> 00:24:32,375 Well, if you look at what is the performance time on a 527 00:24:32,375 --> 00:24:36,120 uniprocessor, it's time spent computing plus the time spent 528 00:24:36,120 --> 00:24:38,090 waiting for data or waiting for some 529 00:24:38,090 --> 00:24:40,580 other things to complete. 530 00:24:40,580 --> 00:24:42,360 You have instructional level parallels, which is really 531 00:24:42,360 --> 00:24:45,100 critical for uniprocessors, and architect that sort of 532 00:24:45,100 --> 00:24:49,060 spent massive amounts of effort in providing multiple 533 00:24:49,060 --> 00:24:51,570 functional units, deeply pipeline 534 00:24:51,570 --> 00:24:53,900 the instruction pipeline. 535 00:24:53,900 --> 00:24:56,440 Doing things like speculation, prediction to keep that 536 00:24:56,440 --> 00:24:59,070 instructional level of parallelism number high so you 537 00:24:59,070 --> 00:25:01,260 can get really good performance. 538 00:25:01,260 --> 00:25:03,630 You can do things like looking at the assembly code and 539 00:25:03,630 --> 00:25:07,230 re-ordering instructions to avoid instruction hazards in 540 00:25:07,230 --> 00:25:08,360 the pipeline. 541 00:25:08,360 --> 00:25:11,500 You might look at a register allocation. 542 00:25:11,500 --> 00:25:15,510 But that's really very low-hanging fruit. 543 00:25:15,510 --> 00:25:17,910 You have to reach really high to grab that kind of fruit. 544 00:25:17,910 --> 00:25:19,520 You'll actually, unfortunately, get the 545 00:25:19,520 --> 00:25:21,670 experience that is part of the next recitation. 546 00:25:21,670 --> 00:25:24,020 So apologies in advance. 547 00:25:24,020 --> 00:25:25,710 But you'll see that -- well I'm not going 548 00:25:25,710 --> 00:25:26,550 to talk about that. 549 00:25:26,550 --> 00:25:28,460 Instead I'm going to focus about some things that are 550 00:25:28,460 --> 00:25:30,220 perhaps lower-hanging fruit. 551 00:25:30,220 --> 00:25:31,430 So data level parallelism. 552 00:25:31,430 --> 00:25:35,925 So we've used SIMD in some of recitations. 553 00:25:35,925 --> 00:25:38,080 I'm giving you a short example of that. 554 00:25:38,080 --> 00:25:41,190 Here, I'm going to talk about how you actually get data 555 00:25:41,190 --> 00:25:44,670 level parallelism or how do you actually find the SIMD in 556 00:25:44,670 --> 00:25:48,520 your computation so you can get that added advantage. 557 00:25:48,520 --> 00:25:51,650 Some nice things about data level parallelism in the form 558 00:25:51,650 --> 00:25:54,290 of short vector instructions is that the harder really 559 00:25:54,290 --> 00:25:55,080 becomes simpler. 560 00:25:55,080 --> 00:25:58,660 You issue one instruction and that same instruction operates 561 00:25:58,660 --> 00:26:01,520 over multiple data elements and you get 562 00:26:01,520 --> 00:26:02,620 better instruction bandwidth. 563 00:26:02,620 --> 00:26:04,960 I just have to fetch one instruction and if my vector 564 00:26:04,960 --> 00:26:07,000 lens is 10, than that effectively does 10 565 00:26:07,000 --> 00:26:08,280 instructions for me. 566 00:26:08,280 --> 00:26:10,000 The architecture can get simpler, reduces the 567 00:26:10,000 --> 00:26:10,880 complexity. 568 00:26:10,880 --> 00:26:13,880 So it has some nice advantages. 569 00:26:13,880 --> 00:26:16,470 The thing to go after is the memory hierarchy. 570 00:26:16,470 --> 00:26:20,270 This is because of that speed gap that we showed earlier on 571 00:26:20,270 --> 00:26:24,020 in the course between memory speed and processor speed, and 572 00:26:24,020 --> 00:26:27,180 if you optimize a performance usually it's like 1% 573 00:26:27,180 --> 00:26:31,610 performance and your cache registry gives you some 574 00:26:31,610 --> 00:26:33,590 significant performance improvement in 575 00:26:33,590 --> 00:26:36,600 your overall execution. 576 00:26:36,600 --> 00:26:39,230 So you want to go after that because that's the biggest 577 00:26:39,230 --> 00:26:40,480 beast in the room. 578 00:26:43,540 --> 00:26:47,230 A brief overview of SIMD and then some detailed examples as 579 00:26:47,230 --> 00:26:49,740 to how you actually go about extracting short vector 580 00:26:49,740 --> 00:26:51,520 instructions. 581 00:26:51,520 --> 00:26:54,700 So, here we have an example of scaleacode. 582 00:26:54,700 --> 00:26:58,920 We're iterating in a loop from zero to n, and we're just 583 00:26:58,920 --> 00:27:03,330 adding some array elements a to b and storing results in c. 584 00:27:03,330 --> 00:27:08,262 So in the scaler mode, we just have one add, each value of a 585 00:27:08,262 --> 00:27:09,610 and b is one register. 586 00:27:09,610 --> 00:27:11,090 We add those together, we write the value 587 00:27:11,090 --> 00:27:13,090 to a separate register. 588 00:27:13,090 --> 00:27:17,420 In the vector mode, we can pack multiple data elements, 589 00:27:17,420 --> 00:27:20,100 so here let's assume our vector lens is four, I can 590 00:27:20,100 --> 00:27:23,420 pack four of these data values into one vector register. 591 00:27:23,420 --> 00:27:25,510 You can pack four of these data elements into another 592 00:27:25,510 --> 00:27:29,010 vector register, and now my single vector instruction has 593 00:27:29,010 --> 00:27:32,380 the effect of doing four ads at th same time, and it can 594 00:27:32,380 --> 00:27:36,940 store results into four elements of c. 595 00:27:36,940 --> 00:27:37,660 Any questions on that? 596 00:27:37,660 --> 00:27:40,106 AUDIENCE: [UNINTELLIGIBLE] 597 00:27:44,820 --> 00:27:45,310 PROFESSOR: No. 598 00:27:45,310 --> 00:27:46,560 We'll get to that. 599 00:27:49,850 --> 00:27:54,390 So, let's look at those to sort of give you a more lower 600 00:27:54,390 --> 00:27:56,030 level feel for this. 601 00:27:56,030 --> 00:28:00,270 Same code, I've just shown data dependence graph. 602 00:28:00,270 --> 00:28:04,610 I've omitted things like the increment of the loop and the 603 00:28:04,610 --> 00:28:07,300 branch, just focusing on the main computation. 604 00:28:07,300 --> 00:28:09,890 So I have two loads, one brings a sub i, the other 605 00:28:09,890 --> 00:28:11,040 brings b sub i. 606 00:28:11,040 --> 00:28:14,860 I do the add and I get c sub i and then I can store that. 607 00:28:14,860 --> 00:28:17,490 So that might be sort of a generic op code sequence that 608 00:28:17,490 --> 00:28:21,540 you have. If you're scheduling that, then in the first slot I 609 00:28:21,540 --> 00:28:24,840 can do those two loads in parallel, second cycle I can 610 00:28:24,840 --> 00:28:27,160 do the add, third cycle I can do the store. 611 00:28:27,160 --> 00:28:29,320 I can further improve this performance. 612 00:28:29,320 --> 00:28:32,410 If you took 6.035 you might see software pipelining, you 613 00:28:32,410 --> 00:28:35,280 can actually overlap some of these operations. 614 00:28:35,280 --> 00:28:37,750 Not really that important here. 615 00:28:37,750 --> 00:28:41,970 So, what would the cycle or the schedule look like on a 616 00:28:41,970 --> 00:28:45,670 cycle-by-cycle basis if this was defector output? 617 00:28:48,970 --> 00:28:51,600 In the scaler case, you have n iterations, right? 618 00:28:51,600 --> 00:28:54,570 Each iteration's taking three cycles so that's your overall 619 00:28:54,570 --> 00:28:58,770 execution on time -- n times 3 cycles. 620 00:28:58,770 --> 00:29:02,560 In the vector case, each load is bringing you four data 621 00:29:02,560 --> 00:29:07,550 elements, so a sub i to a sub i plus 3. 622 00:29:07,550 --> 00:29:09,000 Similarly for b. 623 00:29:09,000 --> 00:29:11,470 Then you add those together. 624 00:29:11,470 --> 00:29:14,060 So the schedule would look essentially the same. 625 00:29:14,060 --> 00:29:19,210 The op codes are different, and here what your overall 626 00:29:19,210 --> 00:29:20,310 execution time be? 627 00:29:20,310 --> 00:29:24,530 Well, what I've done is each iteration is now doing four 628 00:29:24,530 --> 00:29:25,700 additions for me. 629 00:29:25,700 --> 00:29:27,630 So if you notice, the loop bounds have changed. 630 00:29:27,630 --> 00:29:31,400 Instead of going from i to n by increments of 1, now I'm 631 00:29:31,400 --> 00:29:34,300 going by increments of 4. 632 00:29:34,300 --> 00:29:37,940 So, overall, instead of having n iterations, I can get by 633 00:29:37,940 --> 00:29:39,810 with n over 4 iterations. 634 00:29:39,810 --> 00:29:42,300 That make sense? 635 00:29:42,300 --> 00:29:44,110 So, what would my speed up be in this case? 636 00:29:52,830 --> 00:29:54,230 4. 637 00:29:54,230 --> 00:29:57,990 So you can get more and more speed up if your vector lens 638 00:29:57,990 --> 00:30:00,860 is longer, because then I can cut down on the number of 639 00:30:00,860 --> 00:30:04,310 iterations that I need. 640 00:30:04,310 --> 00:30:07,270 Depending on the length of my vector register and the data 641 00:30:07,270 --> 00:30:09,920 types that I have, that effectively gives me different 642 00:30:09,920 --> 00:30:12,380 kinds of vector lens for different data types. 643 00:30:12,380 --> 00:30:15,860 So you saw on Cell you have 128 bit registers and you can 644 00:30:15,860 --> 00:30:20,110 pack those with bytes, characters, bytes, shorts, 645 00:30:20,110 --> 00:30:22,700 integers, floats or doubles. 646 00:30:22,700 --> 00:30:25,890 So each one of those gives you different kinds of a different 647 00:30:25,890 --> 00:30:27,140 vector lens. 648 00:30:29,750 --> 00:30:32,990 SIMD is really now, SIMD extensions are 649 00:30:32,990 --> 00:30:33,690 increasingly popular. 650 00:30:33,690 --> 00:30:34,990 They're available on a lot of ISAs. 651 00:30:37,510 --> 00:30:40,080 Alt of x, MMX, SSE are available 652 00:30:40,080 --> 00:30:41,900 on a lot x86 machines. 653 00:30:41,900 --> 00:30:44,930 And, of course, in Cell, in fact, on the SPU, all your 654 00:30:44,930 --> 00:30:47,000 instructions are SIMD instruction, and when you're 655 00:30:47,000 --> 00:30:50,930 doing a scaler instruction, you're actually using just one 656 00:30:50,930 --> 00:30:56,310 chunk of your vector register and your vector pipeline. 657 00:30:56,310 --> 00:31:00,000 So how do you actually use these SIMD instructions? 658 00:31:00,000 --> 00:31:05,190 Unfortunately, it's library calls or using inline assembly 659 00:31:05,190 --> 00:31:07,760 or using intrinsics. 660 00:31:07,760 --> 00:31:10,240 You'll get hands-on experience with this with Cell, so you 661 00:31:10,240 --> 00:31:15,090 might complain about that when you actually do it. 662 00:31:15,090 --> 00:31:17,860 Compile technology is actually getting better, and you'll see 663 00:31:17,860 --> 00:31:21,050 that one of the reasons we're using an XLC compiler is 664 00:31:21,050 --> 00:31:25,320 because it has these vector data types, which also latest 665 00:31:25,320 --> 00:31:29,170 versions of GCC have that allow you to express data 666 00:31:29,170 --> 00:31:31,740 types as vector data types, and the compiler can more 667 00:31:31,740 --> 00:31:34,580 easily or more naturally get the parallelism for you, SIMD 668 00:31:34,580 --> 00:31:38,640 parallelism with you having to go in there and do it by hand. 669 00:31:38,640 --> 00:31:41,110 But if you were to do it by hand, or, in fact, what the 670 00:31:41,110 --> 00:31:43,600 compilers are trying to automate are different 671 00:31:43,600 --> 00:31:48,730 techniques for looking for where the SIMD parallelism is. 672 00:31:48,730 --> 00:31:52,000 There was some work done here about six years ago by Sam 673 00:31:52,000 --> 00:31:57,300 Larson, who is now graduated, on super word level 674 00:31:57,300 --> 00:31:57,910 parallelism. 675 00:31:57,910 --> 00:32:00,970 so I'm going to focus the rest of this talk on this concept 676 00:32:00,970 --> 00:32:03,170 of SIMDization because I think it's probably the one that's 677 00:32:03,170 --> 00:32:05,720 most useful for extracting parallelism in some of the 678 00:32:05,720 --> 00:32:07,680 codes that you're doing. 679 00:32:07,680 --> 00:32:11,150 So this is really ideal for SIMD where you have really 680 00:32:11,150 --> 00:32:14,210 short vector lens, 2 to 8. 681 00:32:14,210 --> 00:32:16,750 What you're looking for is SIMDization that exists within 682 00:32:16,750 --> 00:32:17,550 a basic block. 683 00:32:17,550 --> 00:32:21,780 So within a code block, within a body of a loop or within 684 00:32:21,780 --> 00:32:24,790 some control flow even. 685 00:32:24,790 --> 00:32:26,910 You can uncover this with simple analysis, and this 686 00:32:26,910 --> 00:32:31,470 really has pushed the boundary on what automatic 687 00:32:31,470 --> 00:32:33,280 compilers can do. 688 00:32:33,280 --> 00:32:35,870 Some of work that's gone on at IMB, what they call the 689 00:32:35,870 --> 00:32:38,980 octipiler, has eventually been transferred to the XLC 690 00:32:38,980 --> 00:32:43,170 compiler do a lot of defecniques that build on SLP 691 00:32:43,170 --> 00:32:46,090 and expand in various ways to broaden the scope of what you 692 00:32:46,090 --> 00:32:48,680 can automatically parallize. 693 00:32:48,680 --> 00:32:52,200 So here's an example of how you might actually derive 694 00:32:52,200 --> 00:32:55,120 SIMDization or opportunities for SIMDization. 695 00:32:55,120 --> 00:32:57,280 So you have some code, let's say you're doing RGB 696 00:32:57,280 --> 00:33:03,190 computations where you're just adding the r elements, that's 697 00:33:03,190 --> 00:33:06,820 a red, green and blue. 698 00:33:06,820 --> 00:33:09,630 So this might be in a loop, and what you might notice it 699 00:33:09,630 --> 00:33:13,410 well I can pack the RGB elements into one register. 700 00:33:13,410 --> 00:33:16,190 I can pack these into another register, and I can pack these 701 00:33:16,190 --> 00:33:19,630 literals into a third register. 702 00:33:19,630 --> 00:33:23,220 So that gives me a way to pack data together into SIMD 703 00:33:23,220 --> 00:33:29,150 registers, and now I can replace this scaler code with 704 00:33:29,150 --> 00:33:31,570 instructions that pack the vector register. 705 00:33:31,570 --> 00:33:33,160 I can do the computations in parallel and 706 00:33:33,160 --> 00:33:34,420 I can unpack them. 707 00:33:34,420 --> 00:33:37,050 We'll talk about that with a little bit more illustration 708 00:33:37,050 --> 00:33:37,730 in a second. 709 00:33:37,730 --> 00:33:38,980 Any questions on this? 710 00:33:42,720 --> 00:33:44,560 Perhaps the biggest improvement that you can get 711 00:33:44,560 --> 00:33:48,520 from SIMDization is by looking at adjacent memory references. 712 00:33:48,520 --> 00:33:51,260 Rather than doing one load you can do a vector load, which 713 00:33:51,260 --> 00:33:55,230 really gives you a bigger bandwidth to memory. 714 00:33:55,230 --> 00:34:00,580 So in this case, I have a load from I1, I2, and since these 715 00:34:00,580 --> 00:34:03,750 memory locations are continuous, I can replace them 716 00:34:03,750 --> 00:34:07,530 by one vector load that brings in all these data 717 00:34:07,530 --> 00:34:09,320 elements in one shot. 718 00:34:09,320 --> 00:34:12,540 That essentially eliminates three load instructions, which 719 00:34:12,540 --> 00:34:17,210 are potentially most heavy weight for one ligher weight 720 00:34:17,210 --> 00:34:20,450 instruction because it amortizes bandwidth and 721 00:34:20,450 --> 00:34:21,700 exploits things like spatial locality. 722 00:34:25,000 --> 00:34:27,530 Another one, vectorizable loops. 723 00:34:27,530 --> 00:34:30,870 So this is probably one of the most advanced ways of 724 00:34:30,870 --> 00:34:34,200 exploiting SIMDization, especially in really long 725 00:34:34,200 --> 00:34:37,290 vector codes, so traditional supercomputers like the Cray, 726 00:34:37,290 --> 00:34:39,370 and you'll probably hear Simmon talk about this in the 727 00:34:39,370 --> 00:34:40,590 next lecture. 728 00:34:40,590 --> 00:34:43,090 So I have some loop and I hvae this 729 00:34:43,090 --> 00:34:44,540 particular statement here. 730 00:34:44,540 --> 00:34:46,730 So how can I get SIMD code out of this? 731 00:34:46,730 --> 00:34:47,980 Anybody have any ideas? 732 00:35:01,750 --> 00:35:03,080 Anybody know about loop unrolling? 733 00:35:05,920 --> 00:35:10,170 So if I unroll this loop, I essentially -- that same trick 734 00:35:10,170 --> 00:35:11,900 that I had shown earlier, although I didn't 735 00:35:11,900 --> 00:35:12,680 quite do it this way. 736 00:35:12,680 --> 00:35:16,440 I change a loop bound from n to -- 737 00:35:16,440 --> 00:35:18,600 the increment from -- rather than stepping through one at a 738 00:35:18,600 --> 00:35:20,590 time, stepping through four at a time. 739 00:35:20,590 --> 00:35:25,190 Now the loop body, rather than doing one addition at a time, 740 00:35:25,190 --> 00:35:27,120 I'm doing four additions at a time. 741 00:35:27,120 --> 00:35:29,770 So now this is very natural for a vectorization, right? 742 00:35:29,770 --> 00:35:32,410 Vector load, vector load, vector store plus the vector 743 00:35:32,410 --> 00:35:33,520 add in the middle. 744 00:35:33,520 --> 00:35:34,770 Is that intuitive? 745 00:35:38,400 --> 00:35:41,320 So this gives you another way of extracting parallelism. 746 00:35:41,320 --> 00:35:43,010 Looking at traditional loops, seeing whether you can 747 00:35:43,010 --> 00:35:46,510 actually unroll it in different ways, be able to get 748 00:35:46,510 --> 00:35:48,920 that SIMD parallelization. 749 00:35:48,920 --> 00:35:51,210 The last one I'll talk about is about partial 750 00:35:51,210 --> 00:35:53,400 vectorization. 751 00:35:53,400 --> 00:35:57,390 Either it might be some things where you have a mix of 752 00:35:57,390 --> 00:35:58,130 statements. 753 00:35:58,130 --> 00:36:01,890 So here I have a loop where I have some load and then I'm 754 00:36:01,890 --> 00:36:03,210 doing some computation here. 755 00:36:03,210 --> 00:36:04,460 So what could I do here? 756 00:36:09,780 --> 00:36:11,260 It's not as symmetric as the other loop. 757 00:36:11,260 --> 00:36:18,200 AUDIENCE: There's no vector and [INAUDIBLE PHRASE]. 758 00:36:18,200 --> 00:36:18,530 PROFESSOR: Right. 759 00:36:18,530 --> 00:36:19,600 So you might omit that. 760 00:36:19,600 --> 00:36:22,580 But could you do anything about the subtraction? 761 00:36:22,580 --> 00:36:24,870 AUDIENCE: [INAUDIBLE PHRASE]. 762 00:36:24,870 --> 00:36:27,990 PROFESSOR: If I can unroll this again, right? 763 00:36:27,990 --> 00:36:30,680 Now there's no dependencies between this instruction and 764 00:36:30,680 --> 00:36:34,210 this instruction, so I can really move these together, 765 00:36:34,210 --> 00:36:36,580 and once I've moved these together then these loads 766 00:36:36,580 --> 00:36:38,170 become contiguous. 767 00:36:38,170 --> 00:36:40,790 These loads are contiguous so I can replace these by vector 768 00:36:40,790 --> 00:36:44,640 codes, vector equivalents. 769 00:36:44,640 --> 00:36:48,210 So now the vector load bring in L0, L1, I have the 770 00:36:48,210 --> 00:36:52,800 addition, that brings in those two elements in a vector, and 771 00:36:52,800 --> 00:36:55,960 then I can do my scaler additions. 772 00:36:55,960 --> 00:36:59,300 But what do I do about the value getting out of this 773 00:36:59,300 --> 00:37:01,890 vector register into this scaler register that I need 774 00:37:01,890 --> 00:37:03,400 for the absolute values. 775 00:37:03,400 --> 00:37:06,740 So this is where the benefits versus cost of 776 00:37:06,740 --> 00:37:08,210 SIMDization come in. 777 00:37:08,210 --> 00:37:11,330 So the benefits are great because you can replace 778 00:37:11,330 --> 00:37:15,460 multiple instructions by one instruction, or you can just 779 00:37:15,460 --> 00:37:17,250 cut down the number of instructions by specific 780 00:37:17,250 --> 00:37:19,920 factor, your vector lens. 781 00:37:19,920 --> 00:37:23,400 Low stores can be replaced by one wide memory operation, and 782 00:37:23,400 --> 00:37:26,210 this is probably the biggest opportunity for performance 783 00:37:26,210 --> 00:37:27,800 improvements. 784 00:37:27,800 --> 00:37:30,300 But the cost is that you have to pack data into the data 785 00:37:30,300 --> 00:37:32,700 registers an you have to unpack it out so that you can 786 00:37:32,700 --> 00:37:37,460 have those kinds of communications between this 787 00:37:37,460 --> 00:37:40,430 vector register here and the value here, this value here 788 00:37:40,430 --> 00:37:41,400 and this value here. 789 00:37:41,400 --> 00:37:45,210 Often you can't simply access vector values without doing 790 00:37:45,210 --> 00:37:46,460 this packing and unpacking. 791 00:37:52,210 --> 00:37:55,450 So how do you actually do the packing, unpacking? 792 00:37:55,450 --> 00:37:57,880 This is predominantly where a lot of the complexity goes. 793 00:38:03,910 --> 00:38:06,493 So the value of a here is initialized by some function 794 00:38:06,493 --> 00:38:09,140 and the value of b here is initialized by some function, 795 00:38:09,140 --> 00:38:11,200 and these might not be things that I can 796 00:38:11,200 --> 00:38:12,740 SIMdize very easily. 797 00:38:12,740 --> 00:38:16,480 So what I need to do is move that value into the first 798 00:38:16,480 --> 00:38:19,180 element of vector register, and move the second value into 799 00:38:19,180 --> 00:38:20,800 the second element of the vector register. 800 00:38:20,800 --> 00:38:23,470 So if I have a four-way vector register, then I have to do 801 00:38:23,470 --> 00:38:27,790 four of these moves, and that essentially is the packing. 802 00:38:27,790 --> 00:38:30,590 Then I could do my vector computation, which is really 803 00:38:30,590 --> 00:38:32,940 these two statements here. 804 00:38:32,940 --> 00:38:35,750 Then eventually I have to do my unpacking because I have to 805 00:38:35,750 --> 00:38:39,960 get the values out to do this operation and this operation. 806 00:38:39,960 --> 00:38:42,180 So there's an extraction that has to happen 807 00:38:42,180 --> 00:38:43,430 out of my SIMD register. 808 00:38:46,310 --> 00:38:48,980 But you can amortize the cost of the packing and unpacking 809 00:38:48,980 --> 00:38:50,560 by just reusing your vector registers. 810 00:38:50,560 --> 00:38:54,490 So these are like register allocation techniques. 811 00:38:54,490 --> 00:38:56,960 So if I pack things into a vector register, I find all 812 00:38:56,960 --> 00:38:59,890 cases where I can actually reuse that vector register and 813 00:38:59,890 --> 00:39:04,490 I try to find opportunities for extra SIMDization. 814 00:39:04,490 --> 00:39:08,120 So in the other case then, I pack one then I can reuse that 815 00:39:08,120 --> 00:39:09,320 same vector register. 816 00:39:09,320 --> 00:39:13,690 So what are some ways I can look for to amortize the cost? 817 00:39:16,600 --> 00:39:18,700 The interesting thing about memory operations is while 818 00:39:18,700 --> 00:39:21,950 there are many different ways you can pack scaler values 819 00:39:21,950 --> 00:39:24,700 into a vector register, there's really only one way 820 00:39:24,700 --> 00:39:28,380 you can pack loads coming in from memory into a vector 821 00:39:28,380 --> 00:39:31,290 register is because you want the loads to be sequential, 822 00:39:31,290 --> 00:39:33,340 you want to exploit the spatial locality. 823 00:39:33,340 --> 00:39:36,130 So one vector load really gives you specific ordering. 824 00:39:36,130 --> 00:39:40,140 So, that really constrains you in various ways. 825 00:39:40,140 --> 00:39:42,180 So you might bend over backwards in some cases to 826 00:39:42,180 --> 00:39:46,350 actually get your code to be able to you reuse the 827 00:39:46,350 --> 00:39:49,090 wide-word load without having to do too much packing or 828 00:39:49,090 --> 00:39:50,870 unpacking because that'll start 829 00:39:50,870 --> 00:39:53,520 eating into your benefits. 830 00:39:53,520 --> 00:40:00,900 So simple example of how you might find the SLP 831 00:40:00,900 --> 00:40:02,740 parallelism. 832 00:40:02,740 --> 00:40:05,850 So the first thing you want to do is start with are the 833 00:40:05,850 --> 00:40:08,740 instructions that give you the most benefit, so it's memory 834 00:40:08,740 --> 00:40:09,860 references. 835 00:40:09,860 --> 00:40:11,840 So here there are two memory references. 836 00:40:11,840 --> 00:40:15,680 They happen to be adjacent, so I'm accessing a contiguous 837 00:40:15,680 --> 00:40:17,930 memory chunks, so I can parallelized that. 838 00:40:17,930 --> 00:40:20,540 That would be my first step. 839 00:40:20,540 --> 00:40:23,580 I can do a vector load and that assignment can 840 00:40:23,580 --> 00:40:25,880 become a and b. 841 00:40:25,880 --> 00:40:29,730 I can look for opportunities where I can propagate this 842 00:40:29,730 --> 00:40:34,250 vector values within the vector register that's 843 00:40:34,250 --> 00:40:35,840 holding a and b. 844 00:40:35,840 --> 00:40:38,980 So the way I do that is I look for uses of a and b. 845 00:40:38,980 --> 00:40:40,960 In this case, there are these two statements. 846 00:40:45,020 --> 00:40:46,440 So I can look for opportunities 847 00:40:46,440 --> 00:40:47,690 to vectorize that. 848 00:40:52,160 --> 00:40:54,030 So in this case, both of these instructions are also 849 00:40:54,030 --> 00:40:54,960 vectorizable. 850 00:40:54,960 --> 00:40:59,990 Now I have a vector subtraction, and I have a 851 00:40:59,990 --> 00:41:02,680 vector register holding new values h and j. 852 00:41:02,680 --> 00:41:08,200 So I follow that chain again of where data's flowing. 853 00:41:08,200 --> 00:41:12,170 I find these operations and I can vectorize that as well. 854 00:41:18,160 --> 00:41:21,860 So, sign up with a vectorizable loop where all my 855 00:41:21,860 --> 00:41:24,520 instructions, all my scale instructions are now in SIMD 856 00:41:24,520 --> 00:41:25,820 instructions. 857 00:41:25,820 --> 00:41:30,000 I can cut down on loop iterations of total number of 858 00:41:30,000 --> 00:41:31,220 instructions that I issue. 859 00:41:31,220 --> 00:41:33,650 But I've made some implicit assumption here. 860 00:41:33,650 --> 00:41:35,020 Anybody know what it is? 861 00:41:35,020 --> 00:41:42,538 AUDIENCE: Do you actually need that many 862 00:41:42,538 --> 00:41:44,580 iterations of the loop? 863 00:41:44,580 --> 00:41:47,680 PROFESSOR: Well, so you can factor down the cost. so here 864 00:41:47,680 --> 00:41:49,830 I've vectorized by 2, so I would cut down the number of 865 00:41:49,830 --> 00:41:50,590 iterations by 2. 866 00:41:50,590 --> 00:41:52,830 AUDIENCE: You could have an odd number of iterations? 867 00:41:52,830 --> 00:41:54,190 PROFESSOR: Right, so you could have an odd number of 868 00:41:54,190 --> 00:41:55,110 iterations. 869 00:41:55,110 --> 00:41:57,400 What do you do about the remaining iterations. 870 00:41:57,400 --> 00:41:59,760 You might have to do scaler code for that. 871 00:41:59,760 --> 00:42:02,260 What are some other assumptions? 872 00:42:02,260 --> 00:42:04,660 Maybe it will be clear here. 873 00:42:07,380 --> 00:42:10,160 So in vectorizing this, what have I assumed about 874 00:42:10,160 --> 00:42:13,040 relationships between these statements? 875 00:42:13,040 --> 00:42:15,690 I've essentially reorganized all the statements so that 876 00:42:15,690 --> 00:42:19,520 assumes I have this liberty to move instructions around. 877 00:42:19,520 --> 00:42:19,810 Yup? 878 00:42:19,810 --> 00:42:22,801 AUDIENCE: [UNINTELLIGIBLE] a and b don't change 879 00:42:22,801 --> 00:42:23,300 [INAUDIBLE PHRASE]. 880 00:42:23,300 --> 00:42:23,790 PROFESSOR: Right. 881 00:42:23,790 --> 00:42:27,830 So there's nothing in here that's changing the values. 882 00:42:27,830 --> 00:42:29,930 There's no dependencies between these statements -- no 883 00:42:29,930 --> 00:42:33,910 flow dependencies and no other kind of constraints that limit 884 00:42:33,910 --> 00:42:34,930 this kind of movement. 885 00:42:34,930 --> 00:42:36,800 So in real code it's not actually the case. 886 00:42:36,800 --> 00:42:40,150 You end up with patterns of computation where you can get 887 00:42:40,150 --> 00:42:43,710 really a nice case of classic cases you can vectorize those 888 00:42:43,710 --> 00:42:44,840 really nicely. 889 00:42:44,840 --> 00:42:48,020 In a lot of other codes you have a mix of vectorizable 890 00:42:48,020 --> 00:42:50,520 code and scaler code and there's a lot of communication 891 00:42:50,520 --> 00:42:51,380 between the two. 892 00:42:51,380 --> 00:42:54,090 So the cost is really something significant that you 893 00:42:54,090 --> 00:42:55,340 have to consider. 894 00:42:57,510 --> 00:43:00,780 This was, as I mentioned, done in somebody's Masters thesis 895 00:43:00,780 --> 00:43:04,960 and eventually led to some additional work that was his 896 00:43:04,960 --> 00:43:06,770 PhD thesis. 897 00:43:06,770 --> 00:43:09,370 So in some of the early work, what he did was he looked at a 898 00:43:09,370 --> 00:43:12,470 bunch of benchmarks and looked at how much available 899 00:43:12,470 --> 00:43:15,960 parallelism you have in terms of this kind of short vector 900 00:43:15,960 --> 00:43:20,260 parallelism, or rather SLP where you're looking for 901 00:43:20,260 --> 00:43:23,830 vectorizable code within basic blocks, which really differed 902 00:43:23,830 --> 00:43:26,100 from a classic way of people looking for vectorization. 903 00:43:26,100 --> 00:43:27,950 [? And you ?] have well-structured loops and 904 00:43:27,950 --> 00:43:32,120 doing kinds of transformations you'll hear about next week. 905 00:43:32,120 --> 00:43:35,240 So for different kinds of vector registers, so these are 906 00:43:35,240 --> 00:43:36,010 your vector lens. 907 00:43:36,010 --> 00:43:41,470 So going from 128 bits to 1,024 bits, you can actually 908 00:43:41,470 --> 00:43:43,330 reduce a whole lot of instructions. 909 00:43:43,330 --> 00:43:46,380 So what I'm showing here is the percent dynamic 910 00:43:46,380 --> 00:43:47,420 instruction reduction. 911 00:43:47,420 --> 00:43:50,795 So if I take my baseline application and just compile 912 00:43:50,795 --> 00:43:53,670 it in a normal way and I run it again an instruction count. 913 00:43:53,670 --> 00:43:57,175 I apply this SLP technique that find the SIMDization and 914 00:43:57,175 --> 00:43:59,500 then run my application again, use the performance counters 915 00:43:59,500 --> 00:44:01,110 to count the number of instructions 916 00:44:01,110 --> 00:44:03,450 and compare the two. 917 00:44:03,450 --> 00:44:07,020 I can get 60%, 50%, 40%. 918 00:44:07,020 --> 00:44:11,700 In some cases I can completely eliminate almost 90% or more 919 00:44:11,700 --> 00:44:12,660 of the instructions. 920 00:44:12,660 --> 00:44:16,410 So it's a lot of opportunity for performance improvements 921 00:44:16,410 --> 00:44:19,410 that might be apparent. 922 00:44:19,410 --> 00:44:22,130 One because I'm reducing the instruction bandwidth, I'm 923 00:44:22,130 --> 00:44:25,370 reducing the amount of space I need in my instruction cache, 924 00:44:25,370 --> 00:44:27,590 I have fewer instructions so I can fit more instructions into 925 00:44:27,590 --> 00:44:31,190 my instruction cache, you reduce the number of branches. 926 00:44:31,190 --> 00:44:34,870 You get better bandwidth to the memory, better use of the 927 00:44:34,870 --> 00:44:37,080 memory bandwidth. 928 00:44:37,080 --> 00:44:41,250 Overall, you're running fewer iterations, so you're getting 929 00:44:41,250 --> 00:44:44,050 lots of potential for performance. 930 00:44:44,050 --> 00:44:46,350 So, I actually ran this on the AltiVec. 931 00:44:46,350 --> 00:44:50,710 This was one of the earliest generations of AltiVec, which 932 00:44:50,710 --> 00:44:57,100 SIMD instructions didn't have I believe double precision 933 00:44:57,100 --> 00:44:59,420 floating point, so not all the benchmarks you see on the 934 00:44:59,420 --> 00:45:02,470 previsou slide are here, only the ones that could run 935 00:45:02,470 --> 00:45:04,120 reasonably accurately with a single 936 00:45:04,120 --> 00:45:05,590 precision floating point. 937 00:45:05,590 --> 00:45:08,240 What they measure is the actual speed up. 938 00:45:08,240 --> 00:45:10,810 Doing this SIMDization versus not doing a SIMDization, how 939 00:45:10,810 --> 00:45:12,470 much performance you can get. 940 00:45:12,470 --> 00:45:17,550 The thing to take away is in some cases where you have 941 00:45:17,550 --> 00:45:20,260 nicely structured loops and some nice patterns, you can 942 00:45:20,260 --> 00:45:24,250 get up to 7x speed up on some benchmarks. 943 00:45:24,250 --> 00:45:27,020 What might be the maximum speed up that you can get 944 00:45:27,020 --> 00:45:31,210 depends on the vector lens, so 8, for example on some 945 00:45:31,210 --> 00:45:35,170 architectures depending on the data type. 946 00:45:35,170 --> 00:45:38,500 Is there any questions on that? 947 00:45:38,500 --> 00:45:40,530 So as part of the next recitation, you'll actually 948 00:45:40,530 --> 00:45:43,000 get an exercise of going through and SIMDizing for 949 00:45:43,000 --> 00:45:45,300 Cell, and whether that actually means SIMDize 950 00:45:45,300 --> 00:45:47,990 instructions for Cell might take statements and sort of 951 00:45:47,990 --> 00:45:50,320 replace them by intrinsic functions, which eventually 952 00:45:50,320 --> 00:45:53,630 map down to actually assembly op codes that you'll need. 953 00:45:53,630 --> 00:45:55,940 So you don't actually have to program at the assembly level, 954 00:45:55,940 --> 00:46:00,080 although in effect, you're probably doing the same thing. 955 00:46:00,080 --> 00:46:03,040 Last thing we'll talk about today is optimizing for the 956 00:46:03,040 --> 00:46:05,110 memeory hierarchy. 957 00:46:05,110 --> 00:46:07,950 In addition to data level parallelism, looking for 958 00:46:07,950 --> 00:46:11,250 performance enhancements in the memory system gives you 959 00:46:11,250 --> 00:46:19,250 the best opportunities because of this big gap in performance 960 00:46:19,250 --> 00:46:23,800 between memory access latencies and what the CPU 961 00:46:23,800 --> 00:46:24,840 efficiency is. 962 00:46:24,840 --> 00:46:27,380 So exploiting locality in a memroy system is key. 963 00:46:27,380 --> 00:46:30,940 So these concepts of temporal and spatial locality. 964 00:46:30,940 --> 00:46:32,770 So let's look at an example. 965 00:46:32,770 --> 00:46:37,870 Let's say I have a loop and in this loop I have some code 966 00:46:37,870 --> 00:46:42,440 that's embodied in some function a, some code embodied 967 00:46:42,440 --> 00:46:45,660 in some function b, and some code in some function c. 968 00:46:45,660 --> 00:46:49,720 The values produced by a are consumed by the function b, 969 00:46:49,720 --> 00:46:51,280 and similarly the values consumed by b 970 00:46:51,280 --> 00:46:53,880 are consumed by c. 971 00:46:53,880 --> 00:46:57,040 So this is a general data flow graph that you might have for 972 00:46:57,040 --> 00:46:58,590 this function. 973 00:46:58,590 --> 00:47:03,240 Let's say that all the data could go into a small array 974 00:47:03,240 --> 00:47:08,870 that then I can communicate between functions. 975 00:47:08,870 --> 00:47:12,500 So if I look at my actual cache size and how the working 976 00:47:12,500 --> 00:47:15,400 set of each of these functions is, so let's say this is my 977 00:47:15,400 --> 00:47:18,260 cache size -- this is how many instructions I can 978 00:47:18,260 --> 00:47:20,390 pack into the cache. 979 00:47:20,390 --> 00:47:22,490 Looking at the collective number of instructions in each 980 00:47:22,490 --> 00:47:25,990 one of these functions, I overflow that. 981 00:47:25,990 --> 00:47:27,810 I have more instructions I can fit into my 982 00:47:27,810 --> 00:47:30,130 cache any one time. 983 00:47:30,130 --> 00:47:32,890 So what does that mean for my actual cash performance? 984 00:47:32,890 --> 00:47:38,490 So when I run a, what do I expect the cache hit and miss 985 00:47:38,490 --> 00:47:40,670 rate behavior to be like? 986 00:47:40,670 --> 00:47:45,830 So in the first iteration, I need the instructions for a. 987 00:47:45,830 --> 00:47:48,210 I've never seen a before so I have to fetch that data from 988 00:47:48,210 --> 00:47:49,840 memory and put in the cache. 989 00:47:49,840 --> 00:47:53,320 So, the attachments. 990 00:47:53,320 --> 00:47:54,570 So what about b? 991 00:47:58,330 --> 00:47:59,310 Then c? 992 00:47:59,310 --> 00:48:00,420 Same thing. 993 00:48:00,420 --> 00:48:03,920 So now I'm back at the top of my loop. 994 00:48:03,920 --> 00:48:06,250 So if everything fit in the cache then I would 995 00:48:06,250 --> 00:48:10,230 expect a to be a what? 996 00:48:10,230 --> 00:48:11,320 You'll be a hit. 997 00:48:11,320 --> 00:48:14,900 But since I've constrained this problem such that the 998 00:48:14,900 --> 00:48:17,220 working set doesn't really fit in the cache, what that means 999 00:48:17,220 --> 00:48:19,690 is that I have to fetch some new instructions for a. 1000 00:48:19,690 --> 00:48:20,960 So let's say I have to fetch all the 1001 00:48:20,960 --> 00:48:22,460 instructions for a again. 1002 00:48:22,460 --> 00:48:25,590 That leads me to another miss. 1003 00:48:25,590 --> 00:48:29,610 Now, bringing a again into my cache kicks out some extra 1004 00:48:29,610 --> 00:48:32,100 instructions because I need to make room in a finite memory 1005 00:48:32,100 --> 00:48:35,090 so I kick out b. 1006 00:48:35,090 --> 00:48:38,120 Bring in b and I end up kicking out c. 1007 00:48:38,120 --> 00:48:41,530 So you end up with a pattern where everything is a miss. 1008 00:48:41,530 --> 00:48:45,740 This is a problem because the way the loop is structured, 1009 00:48:45,740 --> 00:48:48,025 collectively I just can't pack all those instructions into 1010 00:48:48,025 --> 00:48:52,760 the cache, so I end up taking a lot of cache misses and 1011 00:48:52,760 --> 00:48:55,030 that's bad for performance. 1012 00:48:55,030 --> 00:48:56,690 But I can look at an alternative way 1013 00:48:56,690 --> 00:48:58,710 of doing this loop. 1014 00:48:58,710 --> 00:49:02,960 I can split up this loop into three where in one loop I do 1015 00:49:02,960 --> 00:49:07,300 all the a instructions, in the second loop I do all the b's, 1016 00:49:07,300 --> 00:49:09,260 and the third loop I do all the c's. 1017 00:49:09,260 --> 00:49:12,830 Now my working set is really small. 1018 00:49:12,830 --> 00:49:16,020 So the instructions for a fit in the cache, instructions for 1019 00:49:16,020 --> 00:49:17,460 b fit in the cache, and instructions for 1020 00:49:17,460 --> 00:49:19,670 c fit in the cache. 1021 00:49:19,670 --> 00:49:24,330 So what do I expect for the first time I see a? 1022 00:49:24,330 --> 00:49:25,110 Miss. 1023 00:49:25,110 --> 00:49:26,360 Then second time? 1024 00:49:31,270 --> 00:49:36,730 It'll be hit, because I've brought in a, I haven't run b 1025 00:49:36,730 --> 00:49:40,150 or c yet, the number of instructions I need for a is 1026 00:49:40,150 --> 00:49:41,370 smaller than what I can fit into the 1027 00:49:41,370 --> 00:49:42,770 cache, so that's great. 1028 00:49:42,770 --> 00:49:44,170 Nothing gets kicked out. 1029 00:49:44,170 --> 00:49:46,200 So every one of those iterations 1030 00:49:46,200 --> 00:49:48,450 for a becomes a hit. 1031 00:49:48,450 --> 00:49:49,220 So that's good. 1032 00:49:49,220 --> 00:49:51,950 I've improved performance. 1033 00:49:51,950 --> 00:49:54,550 For b I have the same pattern. 1034 00:49:54,550 --> 00:49:56,620 First time I see b it's a miss, every time after that 1035 00:49:56,620 --> 00:49:57,480 it's a hit. 1036 00:49:57,480 --> 00:49:58,560 Similarly for c. 1037 00:49:58,560 --> 00:50:02,490 So my cache miss rate goes from being one, everything's a 1038 00:50:02,490 --> 00:50:07,050 miss, to decreasing to 1 over n where n is essentially how 1039 00:50:07,050 --> 00:50:09,500 much I can run the loop. 1040 00:50:09,500 --> 00:50:11,964 So we call that full scaling because we've taken the loop 1041 00:50:11,964 --> 00:50:14,440 where we've distributed, and we've scaled every one of 1042 00:50:14,440 --> 00:50:16,550 those smaller loops to the maximum that we could get. 1043 00:50:19,070 --> 00:50:21,230 Now what about the data? 1044 00:50:21,230 --> 00:50:22,340 So we have the same example. 1045 00:50:22,340 --> 00:50:26,820 Here we saw that the instruction working set is 1046 00:50:26,820 --> 00:50:29,420 big, but what about the data? 1047 00:50:29,420 --> 00:50:31,330 So let's say in this case I'm sending just a 1048 00:50:31,330 --> 00:50:32,270 small amount of data. 1049 00:50:32,270 --> 00:50:35,640 Then the behavior is really good. 1050 00:50:35,640 --> 00:50:38,070 It's a small amount of data that I need to communicate 1051 00:50:38,070 --> 00:50:38,690 from a to b. 1052 00:50:38,690 --> 00:50:40,310 A small amount of data you need to 1053 00:50:40,310 --> 00:50:42,040 communicate from b to c. 1054 00:50:42,040 --> 00:50:43,270 So it's great. 1055 00:50:43,270 --> 00:50:44,990 No problems with the data cache. 1056 00:50:44,990 --> 00:50:46,530 What happens in full scaling case? 1057 00:50:46,530 --> 00:50:53,330 AUDIENCE: It's not correct to communicate from a to b. 1058 00:50:53,330 --> 00:50:54,921 PROFESSOR: What do you mean it's not correct? 1059 00:50:54,921 --> 00:50:55,576 AUDIENCE: Oh, it's not 1060 00:50:55,576 --> 00:50:56,826 communicating at the same time. 1061 00:50:58,740 --> 00:51:01,300 PROFESSOR: Yeah, it's not at the same time. 1062 00:51:01,300 --> 00:51:03,890 In fact, just assume this is sequential. 1063 00:51:03,890 --> 00:51:07,800 So I run a, I store some data, and then when I run 1064 00:51:07,800 --> 00:51:10,680 b I grab that data. 1065 00:51:10,680 --> 00:51:12,430 This is in sequential. 1066 00:51:12,430 --> 00:51:16,610 AUDIENCE: How do you know that the transmission's valid then? 1067 00:51:16,610 --> 00:51:19,210 We could use some global variable. 1068 00:51:19,210 --> 00:51:21,390 PROFESSOR: Simple case. 1069 00:51:21,390 --> 00:51:22,750 There's no global variables. 1070 00:51:22,750 --> 00:51:26,210 All the data that b needs comes from a. 1071 00:51:26,210 --> 00:51:28,290 So if I run a I produce all the data and 1072 00:51:28,290 --> 00:51:29,540 that's all that b needs. 1073 00:51:32,000 --> 00:51:34,110 So in the full scaling case, what do I expect to 1074 00:51:34,110 --> 00:51:37,140 happen for the data? 1075 00:51:37,140 --> 00:51:39,730 Remember, in the full scaling case, all the working sets for 1076 00:51:39,730 --> 00:51:42,810 the instructions are small so they all fit in the cache. 1077 00:51:42,810 --> 00:51:46,260 But now I'm running a for a lot longer so I have to store 1078 00:51:46,260 --> 00:51:48,180 a lot more data for b. 1079 00:51:48,180 --> 00:51:50,970 Similarly, I'm running b for a lot longer so I have to store 1080 00:51:50,970 --> 00:51:52,690 a lot more data for c. 1081 00:51:52,690 --> 00:51:54,770 So what do I expect to happen with the working set here? 1082 00:51:58,180 --> 00:52:01,430 Instructions are still good, but the data might be bad 1083 00:52:01,430 --> 00:52:06,410 because I've run a for a lot more iterations at one shot. 1084 00:52:06,410 --> 00:52:09,730 So now I have to buffer all this data for a to b. 1085 00:52:09,730 --> 00:52:11,960 Similarly, I've run b for a long time so I have to buffer 1086 00:52:11,960 --> 00:52:13,960 a whole lot data for b to c. 1087 00:52:13,960 --> 00:52:15,570 Is that clear? 1088 00:52:15,570 --> 00:52:15,930 AUDIENCE: No. 1089 00:52:15,930 --> 00:52:19,040 PROFESSOR: So let's say every time a runs it produces one 1090 00:52:19,040 --> 00:52:20,830 data element. 1091 00:52:20,830 --> 00:52:23,770 So now in this case, every iteration 1092 00:52:23,770 --> 00:52:24,970 produces one data element. 1093 00:52:24,970 --> 00:52:25,880 That's fine. 1094 00:52:25,880 --> 00:52:27,100 That's clear? 1095 00:52:27,100 --> 00:52:31,720 Here I run a n times, so I produce n data elements. 1096 00:52:31,720 --> 00:52:34,950 And b let's say produces one data element. 1097 00:52:34,950 --> 00:52:39,560 So if my cache can only hold let's say n by 2 data 1098 00:52:39,560 --> 00:52:42,140 elements, then there's an overflow. 1099 00:52:42,140 --> 00:52:44,770 So what that means is not everything's in the cache, and 1100 00:52:44,770 --> 00:52:46,620 that's bad because of the same reasons we saw for the 1101 00:52:46,620 --> 00:52:47,550 instructions. 1102 00:52:47,550 --> 00:52:49,780 When I need those data I have to go out to memory and get 1103 00:52:49,780 --> 00:52:52,060 them again, so it's extra communication, extra 1104 00:52:52,060 --> 00:52:52,620 redundancy. 1105 00:52:52,620 --> 00:52:54,530 AUDIENCE: In this case where you don't need to store the a 1106 00:52:54,530 --> 00:52:55,780 variables [UNINTELLIGIBLE PHRASE]. 1107 00:52:59,550 --> 00:53:03,000 PROFESSOR: But notice these were sequential simple case. 1108 00:53:03,000 --> 00:53:08,210 I need all the data from a to run all the iterations for b. 1109 00:53:08,210 --> 00:53:10,650 Then, yeah, this goes away. 1110 00:53:10,650 --> 00:53:13,570 So let's say this goes away, but still b produces n 1111 00:53:13,570 --> 00:53:15,240 elements and that overflows the cache. 1112 00:53:19,770 --> 00:53:23,230 So there's a third example where I don't fully distribute 1113 00:53:23,230 --> 00:53:26,520 everything, I partially distribute some of the loops. 1114 00:53:26,520 --> 00:53:29,730 I can fully scale a and b because I can fit those 1115 00:53:29,730 --> 00:53:31,600 instructions in the cache. 1116 00:53:31,600 --> 00:53:35,090 That gets me around this problem, because now a and b 1117 00:53:35,090 --> 00:53:37,850 are just communicating one day data element. 1118 00:53:37,850 --> 00:53:40,120 But c is still a problem because I still have to run b 1119 00:53:40,120 --> 00:53:43,430 n times in the end before I can run c so there are n data 1120 00:53:43,430 --> 00:53:46,940 elements in flight. 1121 00:53:46,940 --> 00:53:50,480 So the data for b still becomes a problem in terms of 1122 00:53:50,480 --> 00:53:51,500 its locality. 1123 00:53:51,500 --> 00:53:54,810 Is that clear? 1124 00:53:54,810 --> 00:53:57,640 So, any ideas on how I can improve this? 1125 00:53:57,640 --> 00:53:58,937 AUDIENCE: assuming you have the wrong cache line and you 1126 00:53:58,937 --> 00:54:00,187 have to do one or two memory acceses to get the cache back. 1127 00:54:10,920 --> 00:54:16,690 PROFESSOR: So, programs typically have really good 1128 00:54:16,690 --> 00:54:19,090 instruction locality just because the nature of the way 1129 00:54:19,090 --> 00:54:19,490 we run them. 1130 00:54:19,490 --> 00:54:23,350 We have small loops and they iterate over and over again. 1131 00:54:23,350 --> 00:54:26,410 Data is actually where you spend most of your time in the 1132 00:54:26,410 --> 00:54:27,090 memory system. 1133 00:54:27,090 --> 00:54:28,320 It's fetching data. 1134 00:54:28,320 --> 00:54:31,650 So I didn't actually understand why you think data 1135 00:54:31,650 --> 00:54:33,340 is less expensive than instructions. 1136 00:54:33,340 --> 00:54:35,879 AUDIENCE: What I'm saying say you want to read an array, 1137 00:54:35,879 --> 00:54:39,435 read the first, say, 8 elements to 8 words 1138 00:54:39,435 --> 00:54:39,942 in the cache box. 1139 00:54:39,942 --> 00:54:44,514 Well then you'd get 7 hits, so every 8 iterations you have to 1140 00:54:44,514 --> 00:54:45,530 do a rewrite. 1141 00:54:45,530 --> 00:54:46,190 PROFESSOR: Right. 1142 00:54:46,190 --> 00:54:49,800 So that assumes that you have really good spatial locality, 1143 00:54:49,800 --> 00:54:52,170 because you've assumed that I've brought in 8 elements and 1144 00:54:52,170 --> 00:54:53,780 I'm going to use every one of them. 1145 00:54:53,780 --> 00:54:55,910 So if that's the case you have really good spatial locality 1146 00:54:55,910 --> 00:54:57,690 and that's, in fact, what you want. 1147 00:54:57,690 --> 00:55:00,160 It's the same kind of thing that I showed for the 1148 00:55:00,160 --> 00:55:00,820 instruction cache. 1149 00:55:00,820 --> 00:55:04,450 The first thing is a miss, the rest are hits. 1150 00:55:04,450 --> 00:55:07,160 The reason data is more expensive, you simply have a 1151 00:55:07,160 --> 00:55:10,490 lot more data reads than you have instructions. 1152 00:55:10,490 --> 00:55:12,730 Typically you have small loops, hundreds of 1153 00:55:12,730 --> 00:55:15,120 instructions and they might access really big arrays that 1154 00:55:15,120 --> 00:55:18,160 are millions of data references. 1155 00:55:18,160 --> 00:55:20,230 So that becomes a problem. 1156 00:55:20,230 --> 00:55:22,250 So ideas on how to improve this? 1157 00:55:22,250 --> 00:55:23,340 AUDIENCE: That's a loop? 1158 00:55:23,340 --> 00:55:24,190 PROFESSOR: That's a loop. 1159 00:55:24,190 --> 00:55:25,800 So what would you do in the smaller loop? 1160 00:55:25,800 --> 00:55:30,740 AUDIENCE: [INAUDIBLE PHRASE]. 1161 00:55:30,740 --> 00:55:31,330 PROFESSOR: Something like that? 1162 00:55:31,330 --> 00:55:33,230 AUDIENCE: Yeah. 1163 00:55:33,230 --> 00:55:35,010 PROFESSOR: OK. 1164 00:55:35,010 --> 00:55:39,190 So in a nested loop, you have a smaller loop that has a 1165 00:55:39,190 --> 00:55:43,220 small number of iterations, so 64. 1166 00:55:43,220 --> 00:55:47,110 So, 64 might be just as much as I can buffer for the data 1167 00:55:47,110 --> 00:55:48,380 in the cache. 1168 00:55:48,380 --> 00:55:51,610 Then I wrap that loop with one outer loop that completes the 1169 00:55:51,610 --> 00:55:52,860 whole number of iterations. 1170 00:55:52,860 --> 00:55:55,950 So if I had to do n, then I divide n by 64. 1171 00:55:55,950 --> 00:55:58,060 So that can work out really well. 1172 00:55:58,060 --> 00:56:00,190 So there's different kinds of blocking techniques that you 1173 00:56:00,190 --> 00:56:03,800 can use on getting your data to fit into your local store 1174 00:56:03,800 --> 00:56:07,340 or into your cache to exploit these spatial and temporal 1175 00:56:07,340 --> 00:56:08,250 properties. 1176 00:56:08,250 --> 00:56:08,550 Question? 1177 00:56:08,550 --> 00:56:12,135 AUDIENCE: Would it not be better to use a small 1178 00:56:12,135 --> 00:56:14,079 [UNINTELLIGIBLE] size so you could run a, b, c 1179 00:56:14,079 --> 00:56:15,329 sequentially? 1180 00:56:17,210 --> 00:56:18,430 PROFESSOR: You could do that as well. 1181 00:56:18,430 --> 00:56:21,620 But the problem with running a, b, c sequentially is that 1182 00:56:21,620 --> 00:56:23,930 if they're in the same loop, you end up with 1183 00:56:23,930 --> 00:56:26,050 instructions being bad. 1184 00:56:26,050 --> 00:56:28,370 That would really, this case -- so even if you change this 1185 00:56:28,370 --> 00:56:30,790 number you don't get around the instructions. 1186 00:56:34,890 --> 00:56:38,040 So you're going to see more optimizations that do more of 1187 00:56:38,040 --> 00:56:39,990 these loop tricks. 1188 00:56:39,990 --> 00:56:43,230 I talk about unrolling without really defining what unrolling 1189 00:56:43,230 --> 00:56:45,930 is or going into a lot of details. 1190 00:56:45,930 --> 00:56:48,170 Loop distribution, loop fision, some of the things, 1191 00:56:48,170 --> 00:56:49,960 like loop tiling, loop blocking. 1192 00:56:49,960 --> 00:56:53,620 I think Simmon's going to cover some of these next week. 1193 00:56:53,620 --> 00:56:56,890 So this was implemented, this was done by another Master 1194 00:56:56,890 --> 00:57:02,470 student at MIT who graduated about two years ago, to show 1195 00:57:02,470 --> 00:57:04,910 that if you factor in cache constraints versus ignoring 1196 00:57:04,910 --> 00:57:07,650 cache constraints, how much performance you can get. 1197 00:57:07,650 --> 00:57:09,580 This was done in the context of StreamIt. 1198 00:57:09,580 --> 00:57:12,710 So, in fact, some of you might have recognized a to b to c as 1199 00:57:12,710 --> 00:57:16,950 being interconnected as pipeline filters. 1200 00:57:16,950 --> 00:57:19,690 We ran it on different processors, so the StrongARM 1201 00:57:19,690 --> 00:57:21,550 processor's really small. 1202 00:57:21,550 --> 00:57:25,780 InOrder processor has no L1 cache in this particular model 1203 00:57:25,780 --> 00:57:26,520 that we used. 1204 00:57:26,520 --> 00:57:29,710 But it had a really long latency -- 1205 00:57:29,710 --> 00:57:31,570 sorry, it had no L2 cache. 1206 00:57:31,570 --> 00:57:34,390 It had really long latency to memory. 1207 00:57:34,390 --> 00:57:37,710 Pentium, an x86 processor. 1208 00:57:37,710 --> 00:57:40,300 Reasonably fast. It had a complicated memory system and 1209 00:57:40,300 --> 00:57:43,490 a lot, a lot of memory overlap in terms of references. 1210 00:57:43,490 --> 00:57:49,490 Then the Itanium processor, which had a huge L2 cache at 1211 00:57:49,490 --> 00:57:50,590 its disposal. 1212 00:57:50,590 --> 00:57:53,340 So what you can see is that lower bars indicate 1213 00:57:53,340 --> 00:57:56,060 bigger speed ups. 1214 00:57:56,060 --> 00:57:57,680 This is normalized run time. 1215 00:57:57,680 --> 00:58:01,090 So on the processor where you don't actually have caches to 1216 00:58:01,090 --> 00:58:03,870 save you, and the memory communication is really 1217 00:58:03,870 --> 00:58:06,490 expensive, you can get a lot of benefit from doing the 1218 00:58:06,490 --> 00:58:09,800 cache aware scaling, that loop nesting to take advantage of 1219 00:58:09,800 --> 00:58:12,430 packing instructions instead of instruction cache, packing 1220 00:58:12,430 --> 00:58:15,550 data into data cache and not having to go out to memory if 1221 00:58:15,550 --> 00:58:17,100 you don't to. 1222 00:58:17,100 --> 00:58:23,470 So you can reduce run time to about 1/3 of what it was with 1223 00:58:23,470 --> 00:58:26,160 this kind of cache optimization. 1224 00:58:26,160 --> 00:58:30,710 On the Pentium3 where you have a cache to help you out, the 1225 00:58:30,710 --> 00:58:34,300 benefits are there, but you don't get as big a benefit 1226 00:58:34,300 --> 00:58:38,550 from ignoring the cache constraints versus being aware 1227 00:58:38,550 --> 00:58:39,670 of the cache constraints. 1228 00:58:39,670 --> 00:58:44,480 So here you're actually doing some of that middle column 1229 00:58:44,480 --> 00:58:46,360 whereas here we're doing third columns, 1230 00:58:46,360 --> 00:58:48,980 the cache aware fusion. 1231 00:58:51,890 --> 00:58:56,220 In a Itanium you really get no benefits between the two. 1232 00:58:56,220 --> 00:58:56,580 Yep? 1233 00:58:56,580 --> 00:59:02,140 AUDIENCE: Can you explain what the left columns are? 1234 00:59:02,140 --> 00:59:04,040 PROFESSOR: These? 1235 00:59:04,040 --> 00:59:04,280 AUDIENCE: Yeah. 1236 00:59:04,280 --> 00:59:07,270 PROFESSOR: So this is tricky. 1237 00:59:11,020 --> 00:59:13,000 So the left columns are doing this. 1238 00:59:13,000 --> 00:59:17,990 AUDIENCE: OK, sort of assuming that icache is there. 1239 00:59:17,990 --> 00:59:20,500 PROFESSOR: Right, and the third column is doing this. 1240 00:59:20,500 --> 00:59:24,680 So you want to do this because the icache locality is the 1241 00:59:24,680 --> 00:59:29,890 best. So you always want to go to full or maximum scaling. 1242 00:59:29,890 --> 00:59:33,310 I'm actually fudging a little just for sake of clarity. 1243 00:59:33,310 --> 00:59:37,120 Here you're actually doing this nesting to improve both 1244 00:59:37,120 --> 00:59:38,730 the instruction and the data locality. 1245 00:59:41,440 --> 00:59:43,160 So you can get really good performance improvement. 1246 00:59:43,160 --> 00:59:46,330 So what does that mean for your Cell projects or for 1247 00:59:46,330 --> 00:59:49,020 Cell, we'll talk about that next week at the recitation. 1248 00:59:52,000 --> 00:59:52,150 Yeah? 1249 00:59:52,150 --> 00:59:53,572 AUDIENCE: Is there some big reasons 1250 00:59:53,572 --> 00:59:56,400 [UNINTELLIGIBLE PHRASE]. 1251 00:59:56,400 --> 00:59:57,850 PROFESSOR: Well it just means that if you have caches to 1252 00:59:57,850 --> 01:00:00,390 save you, and they're really big caches and they're really 1253 01:00:00,390 --> 01:00:06,990 efficient, the law of diminishing returns. 1254 01:00:06,990 --> 01:00:08,370 That's where profiling comes in. 1255 01:00:08,370 --> 01:00:10,100 So you look at the profiling results, you look at your 1256 01:00:10,100 --> 01:00:12,360 cache misses, how many cache misses are you taking. 1257 01:00:12,360 --> 01:00:14,780 If it's really significant, then you look at ways to 1258 01:00:14,780 --> 01:00:16,020 improve it. 1259 01:00:16,020 --> 01:00:18,110 If your cache misses are really low, you missed rate is 1260 01:00:18,110 --> 01:00:20,540 really low, then it doesn't make sense to spend time and 1261 01:00:20,540 --> 01:00:22,120 energy focusing on that. 1262 01:00:22,120 --> 01:00:24,830 Good question. 1263 01:00:24,830 --> 01:00:28,360 So, any other questions? 1264 01:00:28,360 --> 01:00:33,410 So summarizing the gamut of programming for performance. 1265 01:00:33,410 --> 01:00:35,410 So you tune to parallelism first, because if you can't 1266 01:00:35,410 --> 01:00:38,140 find the concurrency, your Amdahl's law, you're not going 1267 01:00:38,140 --> 01:00:40,440 to a get a whole lot of speed up. 1268 01:00:40,440 --> 01:00:43,750 But then once you figured out what the parallelism is, then 1269 01:00:43,750 --> 01:00:45,740 what you want to do is really get the performance on each 1270 01:00:45,740 --> 01:00:48,055 processor, the single track performance to be really good. 1271 01:00:48,055 --> 01:00:49,870 You shouldn't ignore that. 1272 01:00:49,870 --> 01:00:51,630 The modern processors are complex. 1273 01:00:51,630 --> 01:00:53,700 You need instructional level parallelism, you need data 1274 01:00:53,700 --> 01:00:56,190 level parallelism, you need memory hierarchy 1275 01:00:56,190 --> 01:00:59,210 optimizations, and so you should consider those 1276 01:00:59,210 --> 01:01:00,230 optimizations. 1277 01:01:00,230 --> 01:01:02,640 Here, profiling tools could really help you figure out 1278 01:01:02,640 --> 01:01:06,030 where the biggest benefits to performance will come from. 1279 01:01:09,570 --> 01:01:11,490 You may have to, in fact, change everything. 1280 01:01:11,490 --> 01:01:13,320 You may have to change your algorithm, your data 1281 01:01:13,320 --> 01:01:14,970 structures, your program structure. 1282 01:01:14,970 --> 01:01:17,460 So in the MPEG decoder case, for example, I showed you that 1283 01:01:17,460 --> 01:01:20,910 if you change the flag that says don't use double 1284 01:01:20,910 --> 01:01:25,530 precision inverse DCT, use a numerical hack, then you can 1285 01:01:25,530 --> 01:01:27,330 get performance improvements but you're changing your 1286 01:01:27,330 --> 01:01:29,380 algorithm really. 1287 01:01:29,380 --> 01:01:32,120 You really want to focus on just the biggest nuggets -- 1288 01:01:32,120 --> 01:01:34,760 where is most of the performance coming in, or 1289 01:01:34,760 --> 01:01:36,600 where's the biggest performance bottleneck, and 1290 01:01:36,600 --> 01:01:38,060 that's the thing you want optimize. 1291 01:01:38,060 --> 01:01:40,200 So remember the law of diminishing returns. 1292 01:01:40,200 --> 01:01:42,720 Don't spend your time on doing things that aren't going to 1293 01:01:42,720 --> 01:01:46,010 get you anything significant in return. 1294 01:01:46,010 --> 01:01:46,450 That's it. 1295 01:01:46,450 --> 01:01:47,700 Any questions? 1296 01:01:51,070 --> 01:01:51,830 OK. 1297 01:01:51,830 --> 01:01:53,080 How are you guys doing with the projects? 1298 01:01:56,830 --> 01:02:01,195 So, one of the added benefits of the central CBS repository 1299 01:02:01,195 --> 01:02:04,620 is I get notifications too when you submit things. 1300 01:02:04,620 --> 01:02:07,410 So I know of only two projects that have been submitting 1301 01:02:07,410 --> 01:02:08,580 things regularly. 1302 01:02:08,580 --> 01:02:11,950 So, I hope that'll pick up soon. 1303 01:02:11,950 --> 01:02:14,280 I guess a few minutes to finish the quiz and then we'll 1304 01:02:14,280 --> 01:02:14,930 see you next week. 1305 01:02:14,930 --> 01:02:16,180 Have a good weekend.