1 00:00:00,040 --> 00:00:02,480 The following content is provided under a Creative 2 00:00:02,480 --> 00:00:04,000 Commons license. 3 00:00:04,000 --> 00:00:06,340 Your support will help MIT OpenCourseWare 4 00:00:06,340 --> 00:00:10,710 continue to offer high quality educational resources for free. 5 00:00:10,710 --> 00:00:13,320 To make a donation, or view additional materials 6 00:00:13,320 --> 00:00:17,216 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,216 --> 00:00:17,841 at ocw.mit.edu. 8 00:00:21,092 --> 00:00:22,550 PROFESSOR: So this hour we're going 9 00:00:22,550 --> 00:00:24,920 to talk about SIMD programming with cell. 10 00:00:27,284 --> 00:00:28,950 First we'll talk a little bit about what 11 00:00:28,950 --> 00:00:32,810 SIMD is and then about the facilities 12 00:00:32,810 --> 00:00:34,720 that cell and the compiler provide 13 00:00:34,720 --> 00:00:36,452 for programming with SIMD. 14 00:00:36,452 --> 00:00:37,910 And then some design considerations 15 00:00:37,910 --> 00:00:40,201 that you have to keep in mind when you're doing things. 16 00:00:42,769 --> 00:00:44,310 All right so the situation these days 17 00:00:44,310 --> 00:00:46,600 is that most compute bound applications 18 00:00:46,600 --> 00:00:49,870 are running through a large piece of data 19 00:00:49,870 --> 00:00:51,580 and running the same computations on it 20 00:00:51,580 --> 00:00:53,310 over and over again, or rather running 21 00:00:53,310 --> 00:00:56,860 the same computations across all the pieces of different data. 22 00:00:56,860 --> 00:01:00,280 And very frequently there'll be no dependence 23 00:01:00,280 --> 00:01:03,110 between iterations when you're going through this data. 24 00:01:03,110 --> 00:01:05,379 So that means there's opportunities 25 00:01:05,379 --> 00:01:06,710 for you to data parallelize. 26 00:01:09,970 --> 00:01:15,390 So as an example, if we have for example, 27 00:01:15,390 --> 00:01:20,510 say we're multiplying a0 and b0 to get c0. 28 00:01:20,510 --> 00:01:24,580 And suppose we want to actually perform 29 00:01:24,580 --> 00:01:29,250 this operation across all the elements of arrays a, b and c. 30 00:01:29,250 --> 00:01:32,190 So instead of multiplying two elements or two integers 31 00:01:32,190 --> 00:01:37,990 together, we're actually going to be taking two arrays 32 00:01:37,990 --> 00:01:42,660 an element-wise multiplying each of the-- multiplying each 33 00:01:42,660 --> 00:01:45,390 of the pairs element-wise and writing 34 00:01:45,390 --> 00:01:46,885 the results to a third array. 35 00:01:46,885 --> 00:01:49,010 So the picture's going to look something like this. 36 00:01:49,010 --> 00:01:52,120 And you would of course represent this 37 00:01:52,120 --> 00:01:56,140 using for example, four loop. 38 00:01:56,140 --> 00:02:01,660 Now what we're going to do is instead of-- let's see, 39 00:02:01,660 --> 00:02:05,130 so you can think of this as kind of an operation that's 40 00:02:05,130 --> 00:02:09,289 abstractly operating on these entire arrays. 41 00:02:09,289 --> 00:02:11,620 And we're not going to go quite that far, 42 00:02:11,620 --> 00:02:14,350 but what we're going to do is we're 43 00:02:14,350 --> 00:02:16,050 going to think of these operations 44 00:02:16,050 --> 00:02:19,300 as acting on these kind of bundles of elements. 45 00:02:19,300 --> 00:02:21,480 So we're going to bundle our arrayed elements 46 00:02:21,480 --> 00:02:23,100 into groups of four. 47 00:02:23,100 --> 00:02:26,039 And then each time we're going to take a group 48 00:02:26,039 --> 00:02:27,580 and multiply with another group using 49 00:02:27,580 --> 00:02:30,460 this element-wise multiplication and write the result 50 00:02:30,460 --> 00:02:32,290 to this third bundle. 51 00:02:32,290 --> 00:02:35,480 OK does that make sense? 52 00:02:35,480 --> 00:02:37,320 Now the thing about this kind of model 53 00:02:37,320 --> 00:02:40,290 is that cell is going to provide very good hardware 54 00:02:40,290 --> 00:02:43,181 support for something that looks kind of like this. 55 00:02:43,181 --> 00:02:45,065 AUDIENCE: Is that actual cell [INAUDIBLE] 56 00:02:48,770 --> 00:02:50,630 PROFESSOR: Yes, I'll get into this. 57 00:02:50,630 --> 00:02:53,500 In fact, this is what we'll be talking about the syntax 58 00:02:53,500 --> 00:02:55,940 and meaning of this kind of thing. 59 00:02:55,940 --> 00:02:56,600 All right? 60 00:02:59,250 --> 00:03:03,530 So for this kind of thing to happen 61 00:03:03,530 --> 00:03:06,240 we need the compiler to support two different things. 62 00:03:06,240 --> 00:03:09,680 First is we need to be able to address these kind of bundles 63 00:03:09,680 --> 00:03:13,000 of elements and these are going to be called vectors. 64 00:03:13,000 --> 00:03:15,590 And second we need to be able to perform operations 65 00:03:15,590 --> 00:03:18,530 on these vectors. 66 00:03:18,530 --> 00:03:25,590 So cell and the XLC compiler give us support for this. 67 00:03:25,590 --> 00:03:28,150 And what they're going to do is provide first, 68 00:03:28,150 --> 00:03:30,940 registers which are capable of holding vectors. 69 00:03:30,940 --> 00:03:32,510 Now normally you think of a register 70 00:03:32,510 --> 00:03:38,300 as holding on a 32-bit machine, one machine word 71 00:03:38,300 --> 00:03:41,000 will hold a 32-bit int, for example. 72 00:03:41,000 --> 00:03:42,500 What we're going to have on the cell 73 00:03:42,500 --> 00:03:45,562 is these 128-bit registers which are 74 00:03:45,562 --> 00:03:47,270 going to be able to hold for example four 75 00:03:47,270 --> 00:03:49,540 ints right next to each other. 76 00:03:49,540 --> 00:03:52,550 So we're going to be able to take this bundle of ints 77 00:03:52,550 --> 00:03:56,770 and operate it-- operate on it as a unit. 78 00:03:56,770 --> 00:03:59,290 And the second part is we're going 79 00:03:59,290 --> 00:04:02,800 to have operations that act on these vector registers. 80 00:04:02,800 --> 00:04:06,957 So the cell is going to support special assembly instructions 81 00:04:06,957 --> 00:04:08,790 and it's going to be able to interpret those 82 00:04:08,790 --> 00:04:12,020 as acting on particular vectors. 83 00:04:12,020 --> 00:04:14,700 But also we're going to have C++ language extensions which are 84 00:04:14,700 --> 00:04:15,639 called intrinsics. 85 00:04:15,639 --> 00:04:17,180 And those are going to give us access 86 00:04:17,180 --> 00:04:19,010 to these special assembly instructions, 87 00:04:19,010 --> 00:04:25,200 but not require us to be poking around in the assembly. 88 00:04:25,200 --> 00:04:28,374 All right now the big draw of this 89 00:04:28,374 --> 00:04:29,790 is that these operations are going 90 00:04:29,790 --> 00:04:33,545 to be pretty much as fast as single operations, which 91 00:04:33,545 --> 00:04:35,170 means that if we take advantage of them 92 00:04:35,170 --> 00:04:37,700 we can make our code run say four times as fast. 93 00:04:41,790 --> 00:04:45,690 OK so how do we refer to these vectors when we're coding? 94 00:04:48,470 --> 00:04:51,540 XLC is going to provide us with these intrinsics. 95 00:04:51,540 --> 00:04:54,150 And we have these vector data types 96 00:04:54,150 --> 00:04:56,170 and each one is just going to specify 97 00:04:56,170 --> 00:05:00,120 how to interpret a consecutive group of 128-bits 98 00:05:00,120 --> 00:05:01,830 as some sort of vector. 99 00:05:01,830 --> 00:05:05,040 And you can have vectors of varying element sizes 100 00:05:05,040 --> 00:05:08,400 and varying number of elements. 101 00:05:08,400 --> 00:05:11,710 So when you're programming on the PPU or the SPU 102 00:05:11,710 --> 00:05:16,680 you get these four different kinds of vector data types. 103 00:05:16,680 --> 00:05:18,760 You can declare things as for example, vector 104 00:05:18,760 --> 00:05:23,120 signed int, which is what I mentioned in the example. 105 00:05:23,120 --> 00:05:26,050 Which is where you have four ints next to each other each 106 00:05:26,050 --> 00:05:27,020 32-bits. 107 00:05:27,020 --> 00:05:28,640 You could also have vectors which 108 00:05:28,640 --> 00:05:31,930 contains 16-bit integers or 8-bit integers. 109 00:05:31,930 --> 00:05:38,220 And you could also have vectors of floating point-- 110 00:05:38,220 --> 00:05:41,970 floating point numbers. 111 00:05:41,970 --> 00:05:43,680 I should mention that all of these 112 00:05:43,680 --> 00:05:47,842 signed integer types also have unsigned equivalence. 113 00:05:47,842 --> 00:05:50,300 Anyway, so you can just declare these anywhere in your code 114 00:05:50,300 --> 00:05:53,750 and use them as if they were a C++ data type. 115 00:05:53,750 --> 00:05:56,270 All right any questions? 116 00:05:56,270 --> 00:06:00,120 On the SPU you also get some additional vector data types. 117 00:06:00,120 --> 00:06:03,235 One is vector signed long-long, which is 64-bit ints 118 00:06:03,235 --> 00:06:05,490 and you can fit two of those in 128. 119 00:06:05,490 --> 00:06:08,660 And you can also fit two double precision floating 120 00:06:08,660 --> 00:06:09,840 point numbers in 128-bits. 121 00:06:12,550 --> 00:06:15,700 Now the compiler's actually support these types 122 00:06:15,700 --> 00:06:16,414 pretty nicely. 123 00:06:16,414 --> 00:06:18,580 So not only can you declare variables of these types 124 00:06:18,580 --> 00:06:19,996 pretty much anywhere in your code, 125 00:06:19,996 --> 00:06:22,680 you can also declare pointers to these types and arrays 126 00:06:22,680 --> 00:06:23,850 of these types. 127 00:06:23,850 --> 00:06:24,960 All right. 128 00:06:24,960 --> 00:06:29,130 And so they look pretty much like natural C++ types, 129 00:06:29,130 --> 00:06:32,220 except that they translate directly into these particular 130 00:06:32,220 --> 00:06:33,685 types that the hardware supports. 131 00:06:38,510 --> 00:06:44,199 Now in order to manipulate these vector data, 132 00:06:44,199 --> 00:06:45,740 we're going to have the-- we're going 133 00:06:45,740 --> 00:06:47,510 to have compiler extensions called 134 00:06:47,510 --> 00:06:51,060 intrinsics, which are going to provide access to the assembly 135 00:06:51,060 --> 00:06:52,620 level features that we want. 136 00:06:52,620 --> 00:06:55,870 Remember we're going to have specific assembly instructions 137 00:06:55,870 --> 00:07:02,140 that correspond to for example, multiplying two vectors which 138 00:07:02,140 --> 00:07:06,080 contain each four, 32-bit integers. 139 00:07:06,080 --> 00:07:10,540 And instead of writing out-- instead of going 140 00:07:10,540 --> 00:07:13,900 into the assembly and actually inserting that instruction 141 00:07:13,900 --> 00:07:17,250 ourselves, we just use a compiler intrinsic inside 142 00:07:17,250 --> 00:07:18,320 our C++ code. 143 00:07:18,320 --> 00:07:20,630 And what it does is it provides this notation that 144 00:07:20,630 --> 00:07:22,327 looks a lot like a function call, 145 00:07:22,327 --> 00:07:24,160 but the compiler automatically translates it 146 00:07:24,160 --> 00:07:26,515 into the correct assembly instruction. 147 00:07:26,515 --> 00:07:27,890 And again you don't have to worry 148 00:07:27,890 --> 00:07:30,410 about going into the assembly and messing around 149 00:07:30,410 --> 00:07:32,249 with this instruction that's supposed 150 00:07:32,249 --> 00:07:33,415 to apply to these registers. 151 00:07:33,415 --> 00:07:36,730 You don't have to worry about register allocation at all. 152 00:07:36,730 --> 00:07:39,426 The compiler just figures out the right thing for you. 153 00:07:39,426 --> 00:07:41,800 And to use these in your SPU program you're going to want 154 00:07:41,800 --> 00:07:43,530 to include SPU_intrinsics.h. 155 00:07:46,736 --> 00:07:48,110 Now what's a little bit confusing 156 00:07:48,110 --> 00:07:51,090 is that you're going to have slightly different intrinsics 157 00:07:51,090 --> 00:07:52,567 available on the PPU and the SPU, 158 00:07:52,567 --> 00:07:55,150 because those are actually going to have different instruction 159 00:07:55,150 --> 00:07:57,480 sets. 160 00:07:57,480 --> 00:08:03,320 But anyway as an example, you can declare two variables 161 00:08:03,320 --> 00:08:06,330 of type vector signed int and then 162 00:08:06,330 --> 00:08:10,880 you can multiply them using this intrinsic called SPU add. 163 00:08:10,880 --> 00:08:14,170 All right and assign them to a third vector signed int. 164 00:08:14,170 --> 00:08:15,660 Questions? 165 00:08:15,660 --> 00:08:16,834 Yep. 166 00:08:16,834 --> 00:08:19,328 AUDIENCE: In what way are they introduced if you're 167 00:08:19,328 --> 00:08:20,940 on the SPU or the PPU? 168 00:08:20,940 --> 00:08:24,740 Is it just-- not entirely the same set 169 00:08:24,740 --> 00:08:26,040 of operations available? 170 00:08:26,040 --> 00:08:28,210 Or are there actually semantic differences? 171 00:08:28,210 --> 00:08:31,080 Could make a little header file that masks over the differences 172 00:08:31,080 --> 00:08:32,402 mostly. 173 00:08:32,402 --> 00:08:34,610 PROFESSOR: There are going to be some operations that 174 00:08:34,610 --> 00:08:37,190 are only available on one and not the other. 175 00:08:37,190 --> 00:08:39,720 But in general, if you look at the names, 176 00:08:39,720 --> 00:08:42,000 if the names correspond, and I'll 177 00:08:42,000 --> 00:08:44,195 go into that in a little bit, then they 178 00:08:44,195 --> 00:08:48,120 should perform essentially the same function. 179 00:08:48,120 --> 00:08:56,050 AUDIENCE: [INAUDIBLE] mostly was the [INAUDIBLE] also 180 00:08:56,050 --> 00:08:58,695 some name differences where there really don't need to be. 181 00:08:58,695 --> 00:09:03,465 For instance, if you try to do a shift on the PPU 182 00:09:03,465 --> 00:09:09,450 I believe it's a [INAUDIBLE] SL, shift logical right, 183 00:09:09,450 --> 00:09:13,930 shift logical left or shift arithmetic right. 184 00:09:13,930 --> 00:09:15,530 Sort of things you would remember. 185 00:09:15,530 --> 00:09:25,900 On the SPU it's the acronym for rotate and mask for shift. 186 00:09:25,900 --> 00:09:30,030 So R-O-T-M-A-R or something like that. 187 00:09:32,880 --> 00:09:35,510 So yes there's some differences that don't need to be there. 188 00:09:44,294 --> 00:09:46,182 PROFESSOR: OK so to actually create 189 00:09:46,182 --> 00:09:48,390 these vectors there's a couple of different notations 190 00:09:48,390 --> 00:09:49,540 you can use. 191 00:09:49,540 --> 00:09:53,720 The first is, you can use this thing which looks like a cast 192 00:09:53,720 --> 00:09:55,800 to, for example, vector signed int. 193 00:09:55,800 --> 00:09:58,040 So you do vector signed int in parentheses 194 00:09:58,040 --> 00:10:02,770 and then a list of four integers you want to fill in. 195 00:10:02,770 --> 00:10:07,950 And that will create an integer vector and assign it to a. 196 00:10:07,950 --> 00:10:11,060 You can also, I believe you can also 197 00:10:11,060 --> 00:10:13,930 use that notation with just one integer 198 00:10:13,930 --> 00:10:18,770 and it will fill in that integer in all four positions. 199 00:10:18,770 --> 00:10:21,830 There's also an SPU intrinsic called splats 200 00:10:21,830 --> 00:10:24,250 that you can use to basically copy the same integer 201 00:10:24,250 --> 00:10:26,247 to all four components. 202 00:10:26,247 --> 00:10:27,788 AUDIENCE: How does it know you're not 203 00:10:27,788 --> 00:10:28,762 using a comma operator? 204 00:10:33,145 --> 00:10:36,370 PROFESSOR: Yeah I don't-- is that right David, 205 00:10:36,370 --> 00:10:38,090 with the parentheses in the second part? 206 00:10:38,090 --> 00:10:39,778 OK. 207 00:10:39,778 --> 00:10:42,158 AUDIENCE: Whatever. 208 00:10:42,158 --> 00:10:44,062 AUDIENCE: Another caveat here from someone 209 00:10:44,062 --> 00:10:48,520 who's been in the trenches is that XLC likes this notation. 210 00:10:48,520 --> 00:10:53,686 GCC sometimes likes curly brace notations instead. 211 00:10:53,686 --> 00:10:55,186 PROFESSOR: So I'd seen both of those 212 00:10:55,186 --> 00:10:55,682 and I do know which to do. 213 00:10:55,682 --> 00:10:56,178 [INTERPOSING VOICES] 214 00:10:56,178 --> 00:10:57,170 AUDIENCE: [INAUDIBLE]. 215 00:11:04,610 --> 00:11:06,440 PROFESSOR: OK great thanks. 216 00:11:10,430 --> 00:11:11,554 All right. 217 00:11:11,554 --> 00:11:13,970 And after you've assigned some of these variables in order 218 00:11:13,970 --> 00:11:18,720 to get back the pieces out, one way you can do it is to use 219 00:11:18,720 --> 00:11:24,430 this union trick where you assign or rather you allocate 220 00:11:24,430 --> 00:11:29,650 something of vector signed int and then you tell C++ that it 221 00:11:29,650 --> 00:11:32,219 can find an array of integers in the same place. 222 00:11:32,219 --> 00:11:34,260 And what that will do is pull out the components. 223 00:11:37,410 --> 00:11:39,800 So if you define this union this way, 224 00:11:39,800 --> 00:11:41,660 then you get a type called intVec. 225 00:11:41,660 --> 00:11:45,060 And any time you have an intVec you can either 226 00:11:45,060 --> 00:11:50,160 do dot Vec to get at the vector signed int, the vector data 227 00:11:50,160 --> 00:11:50,680 type. 228 00:11:50,680 --> 00:11:54,925 Or you can use dot vals to-- with an array 229 00:11:54,925 --> 00:11:59,640 index get at the components of the vector. 230 00:11:59,640 --> 00:12:01,390 And you could also use this intrinsic call 231 00:12:01,390 --> 00:12:05,740 SPU extract to pick out the same components. 232 00:12:11,290 --> 00:12:13,920 XLC provides a bunch of different vector operations 233 00:12:13,920 --> 00:12:15,330 that you can use. 234 00:12:15,330 --> 00:12:18,590 There's integer operations, floating point operations, 235 00:12:18,590 --> 00:12:22,450 there are a permutation and formatting operations 236 00:12:22,450 --> 00:12:26,240 which you can use to shuffle data around inside vector. 237 00:12:26,240 --> 00:12:29,510 And there's also load and store instructions. 238 00:12:29,510 --> 00:12:32,830 And I believe we have a reference linked off the course 239 00:12:32,830 --> 00:12:37,000 website if you want to figure out more about these. 240 00:12:37,000 --> 00:12:41,310 I'm only going to touch on a few of them. 241 00:12:41,310 --> 00:12:47,610 OK so the arithmetic and logical operations like I said, 242 00:12:47,610 --> 00:12:50,290 most of these are the same between the PPU and the SPU. 243 00:12:50,290 --> 00:12:55,510 There's some that are named slightly differently 244 00:12:55,510 --> 00:12:58,280 and some that are not available on the PPU. 245 00:12:58,280 --> 00:13:01,540 So these are all things you would expect, add, subtract. 246 00:13:01,540 --> 00:13:05,090 Madd is multiply and then add with three-- 247 00:13:05,090 --> 00:13:07,280 with three arguments. 248 00:13:07,280 --> 00:13:11,000 Multiply, re is for reciprocal. 249 00:13:11,000 --> 00:13:14,460 You can also do bit-wise and, or xor 250 00:13:14,460 --> 00:13:17,694 and I believe there are other logical operations there too. 251 00:13:17,694 --> 00:13:19,110 Now the thing is you have to worry 252 00:13:19,110 --> 00:13:22,540 about which PPU or SPU instruction you're using. 253 00:13:22,540 --> 00:13:26,370 But you usually don't have to worry about selecting 254 00:13:26,370 --> 00:13:30,370 the right vector type. 255 00:13:30,370 --> 00:13:33,180 The compiler should figure out which vector types you're using 256 00:13:33,180 --> 00:13:36,200 and substitute the appropriate assembly 257 00:13:36,200 --> 00:13:40,590 instruction that produces a result of the same vector type. 258 00:13:40,590 --> 00:13:43,960 So all these operations are what we call generic. 259 00:13:43,960 --> 00:13:48,210 And they stand in for all the specific instructions, 260 00:13:48,210 --> 00:13:51,576 which are-- which only apply to a single vector type. 261 00:13:51,576 --> 00:13:52,450 Does that make sense? 262 00:13:56,830 --> 00:14:00,680 OK so one handy thing is a permutation operation. 263 00:14:00,680 --> 00:14:04,070 And this allows you to rearrange the bytes inside a vector 264 00:14:04,070 --> 00:14:06,250 or two vectors arbitrarily. 265 00:14:06,250 --> 00:14:09,060 And so the syntax is SPU shuffle a, 266 00:14:09,060 --> 00:14:12,130 b which are your source vectors and pattern which 267 00:14:12,130 --> 00:14:14,027 tells you how to shuffle them. 268 00:14:14,027 --> 00:14:15,610 And pattern is going to be interpreted 269 00:14:15,610 --> 00:14:21,040 as a vector of 16-bytes. 270 00:14:21,040 --> 00:14:22,969 And each byte is going to tell you-- 271 00:14:22,969 --> 00:14:25,260 each byte is going to tell the compiler how to pick out 272 00:14:25,260 --> 00:14:28,560 one of these bytes in the result. 273 00:14:28,560 --> 00:14:35,510 And how the byte is interpreted is the low, 274 00:14:35,510 --> 00:14:38,470 the low four bits are going to specify 275 00:14:38,470 --> 00:14:42,790 which position the source is going to come from. 276 00:14:42,790 --> 00:14:46,210 And the fourth byte is going to specify whether you're 277 00:14:46,210 --> 00:14:48,080 going to pull from a or b. 278 00:14:48,080 --> 00:14:52,050 So as an example, here's the pattern VC 279 00:14:52,050 --> 00:14:54,570 and if you look at the second byte which is 280 00:14:54,570 --> 00:14:59,240 one, which is one, four in hex. 281 00:14:59,240 --> 00:15:02,750 Then that means the destination register 282 00:15:02,750 --> 00:15:08,140 is going to contain the fourth byte or the fourth byte of b, 283 00:15:08,140 --> 00:15:09,160 all right. 284 00:15:09,160 --> 00:15:13,190 So four means select the element numbered four and one means 285 00:15:13,190 --> 00:15:14,046 select from b. 286 00:15:14,046 --> 00:15:15,520 Does that make sense? 287 00:15:15,520 --> 00:15:19,140 And this is very versatile by putting in the right, 288 00:15:19,140 --> 00:15:22,129 by putting in the right pattern vector you can arrange 289 00:15:22,129 --> 00:15:24,170 for all these bytes to be shuffled around however 290 00:15:24,170 --> 00:15:25,475 you want. 291 00:15:25,475 --> 00:15:27,474 AUDIENCE: The pattern is a constant [INAUDIBLE]. 292 00:15:27,474 --> 00:15:28,472 PROFESSOR: Pardon? 293 00:15:28,472 --> 00:15:29,407 AUDIENCE: The pattern is a constant 294 00:15:29,407 --> 00:15:30,490 in intermediate parameter. 295 00:15:34,715 --> 00:15:37,320 PROFESSOR: You can fill in the parameter at run time 296 00:15:37,320 --> 00:15:39,302 if that's what you're asking. 297 00:15:39,302 --> 00:15:41,707 AUDIENCE: [INAUDIBLE] 298 00:15:41,707 --> 00:15:42,669 AUDIENCE: [INAUDIBLE] 299 00:15:50,365 --> 00:15:54,920 PROFESSOR: OK, also useful are these rotation operations 300 00:15:54,920 --> 00:15:57,490 which will let you shift your vector left 301 00:15:57,490 --> 00:15:58,610 or right by some amount. 302 00:16:02,550 --> 00:16:06,120 Now one thing to be aware of is that on the SPU 303 00:16:06,120 --> 00:16:09,340 you only have these 128-bit registers. 304 00:16:09,340 --> 00:16:12,660 So on the PPU you have different registers 305 00:16:12,660 --> 00:16:15,240 which are suitable for holding different types. 306 00:16:15,240 --> 00:16:20,010 For example, there's word-sized registers for holding ints 307 00:16:20,010 --> 00:16:23,230 and PPU also has these 128-bit registers. 308 00:16:23,230 --> 00:16:25,280 But the SPU has nothing else. 309 00:16:25,280 --> 00:16:29,940 So that means whenever you're using scalar types on the SPU 310 00:16:29,940 --> 00:16:32,530 they're all going to be using these large registers. 311 00:16:32,530 --> 00:16:36,860 No matter what the size of the scalar you're using. 312 00:16:36,860 --> 00:16:40,290 And depending on the size of the scalar 313 00:16:40,290 --> 00:16:43,450 you're using it's going to go in a particular position 314 00:16:43,450 --> 00:16:46,150 inside this wide register. 315 00:16:46,150 --> 00:16:49,210 It's called a quadword register, because it's 16-bytes. 316 00:16:51,765 --> 00:16:53,140 Now the thing to watch out for is 317 00:16:53,140 --> 00:16:56,060 that whenever you load something from memory 318 00:16:56,060 --> 00:16:58,720 into-- whenever you load a scalar from memory 319 00:16:58,720 --> 00:17:01,830 into one of these registers, there's 320 00:17:01,830 --> 00:17:04,310 going to have to be a little extra processing done in order 321 00:17:04,310 --> 00:17:07,450 to shift-- in order to possibly shift 322 00:17:07,450 --> 00:17:11,950 the scalar into the right place inside this register. 323 00:17:11,950 --> 00:17:15,450 And furthermore, the hardware is always 324 00:17:15,450 --> 00:17:18,720 going to want to grab one of these quadwords all at a time. 325 00:17:18,720 --> 00:17:20,630 So loading a scalar is not going to be 326 00:17:20,630 --> 00:17:24,260 any cheaper than loading one of these quadword registers. 327 00:17:24,260 --> 00:17:25,829 So one possible you're going to want 328 00:17:25,829 --> 00:17:29,430 to load an entire quadword register at a time. 329 00:17:29,430 --> 00:17:31,870 And if you just need a part of it, 330 00:17:31,870 --> 00:17:33,600 then you can figure that out later. 331 00:17:33,600 --> 00:17:36,840 But you might as well get the whole thing. 332 00:17:36,840 --> 00:17:38,708 Questions? 333 00:17:38,708 --> 00:17:43,360 AUDIENCE: So when you-- just a scalar question. 334 00:17:43,360 --> 00:17:45,860 So when you load a scalar value that's 335 00:17:45,860 --> 00:17:53,380 not aligned-- it's not aligned with the preferred position is 336 00:17:53,380 --> 00:17:55,332 that-- is there overhead associated with that? 337 00:17:55,332 --> 00:17:57,040 PROFESSOR: I'm not sure how much overhead 338 00:17:57,040 --> 00:17:59,040 is associated with that. 339 00:17:59,040 --> 00:17:59,540 Pardon? 340 00:17:59,540 --> 00:18:00,230 Oh do you know? 341 00:18:00,230 --> 00:18:02,188 AUDIENCE: Well, unlike scalar it's [INAUDIBLE], 342 00:18:02,188 --> 00:18:07,080 it can only load on 16-byte boundaries. 343 00:18:07,080 --> 00:18:10,721 So it's going to load the-- load something that includes that 344 00:18:10,721 --> 00:18:15,531 and that's going to have to shift to the another position. 345 00:18:15,531 --> 00:18:17,640 PROFESSOR: So do unaligned-- when 346 00:18:17,640 --> 00:18:19,500 it has to shift the scalar around, 347 00:18:19,500 --> 00:18:21,750 does that actually take longer than 348 00:18:21,750 --> 00:18:24,655 went in-- when it's natural? 349 00:18:24,655 --> 00:18:27,730 AUDIENCE: I don't know if it's-- well what you can do, 350 00:18:27,730 --> 00:18:31,340 you can set some flags in XLC that say, 351 00:18:31,340 --> 00:18:34,340 align all of my scalars correctly. 352 00:18:34,340 --> 00:18:37,730 And we'll waste 4x overhead. 353 00:18:37,730 --> 00:18:41,087 It'll even say align my array, have my elements so that I 354 00:18:41,087 --> 00:18:43,170 can have the scalar array at the back-- I can load 355 00:18:43,170 --> 00:18:45,711 and it will waste the overhead's that everything in the array 356 00:18:45,711 --> 00:18:46,760 is [INAUDIBLE]. 357 00:18:46,760 --> 00:18:49,075 So you can have the compiler trade off 358 00:18:49,075 --> 00:18:50,866 space versus time for you off two switches. 359 00:18:50,866 --> 00:18:51,812 PROFESSOR: I see. 360 00:18:57,490 --> 00:19:00,820 OK so we're going to want to look at the sim application 361 00:19:00,820 --> 00:19:02,140 from recitation two. 362 00:19:02,140 --> 00:19:05,040 And we want to adapt that to make use of SIMD data types 363 00:19:05,040 --> 00:19:08,300 and intrinsics. 364 00:19:08,300 --> 00:19:12,380 So what we've done is, remember we had these x, y, z 365 00:19:12,380 --> 00:19:14,070 coordinates that we were manipulating. 366 00:19:14,070 --> 00:19:16,320 What we're going to do is we're going to pad each one. 367 00:19:16,320 --> 00:19:17,778 It was three words before and we're 368 00:19:17,778 --> 00:19:20,510 going to pad each one so that it fills a quadword. 369 00:19:20,510 --> 00:19:22,910 And so for each quadword of course 370 00:19:22,910 --> 00:19:25,790 the first three words are going to correspond to the x, y, z 371 00:19:25,790 --> 00:19:26,640 components. 372 00:19:26,640 --> 00:19:29,850 And we can grab those out using SPU extract 373 00:19:29,850 --> 00:19:32,670 or some other intrinsics. 374 00:19:32,670 --> 00:19:34,880 Now when we're doing manipulations 375 00:19:34,880 --> 00:19:37,350 with these components for example, 376 00:19:37,350 --> 00:19:40,000 we wanted to find the displacement 377 00:19:40,000 --> 00:19:42,150 between two locations. 378 00:19:42,150 --> 00:19:44,640 And that's just subtracting two of these coordinates. 379 00:19:44,640 --> 00:19:47,970 So we can do that subtraction which before required three 380 00:19:47,970 --> 00:19:50,060 floating point subtractions. 381 00:19:50,060 --> 00:19:52,645 We can replace that with a single SIMD instruction. 382 00:19:52,645 --> 00:19:53,680 Does that make sense? 383 00:19:56,400 --> 00:20:03,580 OK so all this-- most of this has already been done 384 00:20:03,580 --> 00:20:07,200 and we're providing most of the implementation 385 00:20:07,200 --> 00:20:11,320 of this SIMD version of sim. 386 00:20:11,320 --> 00:20:15,440 And what we want you to do is download this, download 387 00:20:15,440 --> 00:20:18,700 the tarball for this recitation and then go in there 388 00:20:18,700 --> 00:20:21,650 and what we want you to do is fill in one of the blanks. 389 00:20:21,650 --> 00:20:22,190 All right. 390 00:20:22,190 --> 00:20:23,430 So there's just one function here 391 00:20:23,430 --> 00:20:24,721 that's been left unimplemented. 392 00:20:26,910 --> 00:20:30,210 And to see if you know what's going on, 393 00:20:30,210 --> 00:20:33,470 see if you can fill in the implementation for this. 394 00:20:33,470 --> 00:20:34,120 Any questions? 395 00:20:34,120 --> 00:20:36,370 So this question-- this function you want to implement 396 00:20:36,370 --> 00:20:41,390 is basically going to take a vector float 397 00:20:41,390 --> 00:20:45,422 and if that float contains a, b, c and d 398 00:20:45,422 --> 00:20:47,130 you want to return a-- you want to return 399 00:20:47,130 --> 00:20:49,840 a vector which each of whose elements 400 00:20:49,840 --> 00:20:52,780 is a plus b plus c plus d. 401 00:20:52,780 --> 00:20:53,280 Questions? 402 00:20:57,200 --> 00:21:00,630 AUDIENCE: What directory under the-- 403 00:21:00,630 --> 00:21:03,080 AUDIENCE: [INAUDIBLE]. 404 00:21:03,080 --> 00:21:05,440 PROFESSOR: So we're going to go into sim a list. 405 00:21:13,168 --> 00:21:15,334 AUDIENCE: But we can stay around afterwards and help 406 00:21:15,334 --> 00:21:18,050 you figure out what's going on. 407 00:21:18,050 --> 00:21:21,280 PROFESSOR: OK so here's one implementation. 408 00:21:23,950 --> 00:21:28,220 Basically we're going to just declare another vector float 409 00:21:28,220 --> 00:21:31,330 and that vector float we're going 410 00:21:31,330 --> 00:21:40,560 to-- that's basically we're just going to do these swaps. 411 00:21:40,560 --> 00:21:42,850 So notice in this one we're swapping 412 00:21:42,850 --> 00:21:47,350 the first and second-- we're swapping 413 00:21:47,350 --> 00:21:49,800 the first and second words. 414 00:21:49,800 --> 00:21:51,410 So that means down here we're going 415 00:21:51,410 --> 00:21:55,500 to want to carry bits four, five, six, seven and then-- 416 00:21:55,500 --> 00:21:58,700 or bytes four, five, six, seven first and then bytes 0, 1, 2, 417 00:21:58,700 --> 00:21:59,920 3. 418 00:21:59,920 --> 00:22:03,880 And then over here we want bytes 12, 13, 14, 15 and then 419 00:22:03,880 --> 00:22:05,166 8, 9, 10, 11. 420 00:22:05,166 --> 00:22:07,290 Everyone see what's going on for the first shuffle? 421 00:22:09,821 --> 00:22:12,320 And then we're going to just add that to our original vector 422 00:22:12,320 --> 00:22:14,420 to get this. 423 00:22:14,420 --> 00:22:17,840 And we can do that again, this time now 424 00:22:17,840 --> 00:22:21,700 we just want to swap these two halves. 425 00:22:21,700 --> 00:22:25,144 So the shuffle pattern is going to be 8, 9, 10, 11, 12, 13, 14, 426 00:22:25,144 --> 00:22:28,470 15 followed by 0, 1, 2, 3, 4, 5, 6, 7. 427 00:22:32,120 --> 00:22:32,620 Make sense? 428 00:22:38,750 --> 00:22:44,490 OK so the way we translated the program we just used into SIMD 429 00:22:44,490 --> 00:22:47,985 was we used a struct of arrays. 430 00:22:47,985 --> 00:22:49,610 Basically each of these structs that we 431 00:22:49,610 --> 00:22:51,530 had from our previous implementation 432 00:22:51,530 --> 00:22:54,260 just carried over and we just put all those into an array. 433 00:22:54,260 --> 00:22:57,600 So the structs were right next to each other. 434 00:22:57,600 --> 00:23:00,970 Alternatively we could have laid out the, 435 00:23:00,970 --> 00:23:03,390 laid out the data in memory in a different way 436 00:23:03,390 --> 00:23:06,630 and this is called an array of structs layout. 437 00:23:06,630 --> 00:23:09,680 Instead what we can do is put all the like fields 438 00:23:09,680 --> 00:23:12,157 next to each other so that we have, for example, 439 00:23:12,157 --> 00:23:13,740 an array of all the x components, then 440 00:23:13,740 --> 00:23:15,590 an array of all the y components, then 441 00:23:15,590 --> 00:23:17,930 an array of all the z components. 442 00:23:17,930 --> 00:23:21,610 And when you reorder the data this way 443 00:23:21,610 --> 00:23:25,980 you get different ways you can use to process it. 444 00:23:25,980 --> 00:23:29,460 So for example, now each quadword 445 00:23:29,460 --> 00:23:31,980 instead of containing the data for a single point 446 00:23:31,980 --> 00:23:35,790 is going to contain the data for the same component of four 447 00:23:35,790 --> 00:23:37,200 consecutive points. 448 00:23:37,200 --> 00:23:37,950 Everyone see that? 449 00:23:40,640 --> 00:23:43,080 And actually we can implement the algorithm 450 00:23:43,080 --> 00:23:46,690 from before in this new layout. 451 00:23:46,690 --> 00:23:49,590 But we have to be a little bit more clever in how we're 452 00:23:49,590 --> 00:23:50,880 putting together the elements. 453 00:23:50,880 --> 00:23:53,700 Before we were able to just subtract 454 00:23:53,700 --> 00:23:58,000 each-- subtract or multiply the quadwords with each other, 455 00:23:58,000 --> 00:24:01,430 because those would just correspond to for example, 456 00:24:01,430 --> 00:24:05,540 subtracting the coordinates of two points. 457 00:24:05,540 --> 00:24:09,150 Now this time we have to do some additional computation in order 458 00:24:09,150 --> 00:24:11,630 to put all the pieces together. 459 00:24:11,630 --> 00:24:15,740 The trick behind this structure of arrays implementation 460 00:24:15,740 --> 00:24:19,930 which I'll just gloss over is, if we're 461 00:24:19,930 --> 00:24:22,560 storing state for eight objects then 462 00:24:22,560 --> 00:24:30,310 we're going to need-- eight objects hold the-- hold 24. 463 00:24:30,310 --> 00:24:33,850 Rather for each object we need the position and the velocity. 464 00:24:33,850 --> 00:24:35,860 And for each of those we have x, y and z. 465 00:24:35,860 --> 00:24:37,980 So that means to store state for each object, 466 00:24:37,980 --> 00:24:43,570 for eight objects we need 8 times 6 is 48 words. 467 00:24:43,570 --> 00:24:45,370 And so we can put those in 12 quadwords 468 00:24:45,370 --> 00:24:48,290 if we pack them right. 469 00:24:48,290 --> 00:24:51,540 And when we do SIMD operations on these quadwords 470 00:24:51,540 --> 00:24:56,207 that we pull out, we can get four pair interactions. 471 00:24:56,207 --> 00:24:58,290 So suppose this is a quadword and it's contained-- 472 00:24:58,290 --> 00:25:00,420 and it contains data corresponding 473 00:25:00,420 --> 00:25:02,550 to elements a, b, c and d. 474 00:25:02,550 --> 00:25:05,140 And over here we have a quadword containing data corresponding 475 00:25:05,140 --> 00:25:08,270 to elements one, two, three, and four. 476 00:25:08,270 --> 00:25:10,990 With some SIMD operations we can kind of 477 00:25:10,990 --> 00:25:12,840 figure out the pairwise interaction 478 00:25:12,840 --> 00:25:15,490 between objects a and one, between b and two, 479 00:25:15,490 --> 00:25:19,210 c and three and d and four. 480 00:25:19,210 --> 00:25:23,340 But of course we have to be able to find the interactions 481 00:25:23,340 --> 00:25:24,400 between any pair. 482 00:25:24,400 --> 00:25:26,240 It's not just these pairs that lineup. 483 00:25:26,240 --> 00:25:31,050 So what we have to do is rotate the quadword over by one word 484 00:25:31,050 --> 00:25:32,850 and then do the same thing again. 485 00:25:32,850 --> 00:25:37,109 We do that four times in all and then we add up the results. 486 00:25:37,109 --> 00:25:38,650 So as you can see this implementation 487 00:25:38,650 --> 00:25:42,400 is a little bit more involved and less-- 488 00:25:42,400 --> 00:25:45,900 it maps to the original implementation less directly. 489 00:25:48,490 --> 00:25:53,240 On the other hand, it does give us a really dramatic speedup. 490 00:25:53,240 --> 00:25:58,370 Because we're using more of the vector words. 491 00:25:58,370 --> 00:26:00,580 Notice that in the first packing we had. 492 00:26:00,580 --> 00:26:04,530 We had x, y and z and then the fourth blank was unused. 493 00:26:04,530 --> 00:26:08,110 Anyway, this time the structure of a race implementation is 494 00:26:08,110 --> 00:26:13,210 actually 7 1/2 times faster than the array of structures 495 00:26:13,210 --> 00:26:14,990 implementation. 496 00:26:14,990 --> 00:26:17,600 So choosing this data layout correctly 497 00:26:17,600 --> 00:26:20,740 can actually be one of the really big determinants 498 00:26:20,740 --> 00:26:22,350 of how your program performs. 499 00:26:22,350 --> 00:26:25,286 AUDIENCE: The scalar version was like what 480, something 500 00:26:25,286 --> 00:26:25,786 like that? 501 00:26:25,786 --> 00:26:29,204 Or is it not comparable? 502 00:26:29,204 --> 00:26:32,681 PROFESSOR: I let's see, David, do you remember? 503 00:26:32,681 --> 00:26:35,387 AUDIENCE: [INAUDIBLE] 504 00:26:35,387 --> 00:26:36,830 PROFESSOR: OK so-- 505 00:26:36,830 --> 00:26:38,794 AUDIENCE: [INAUDIBLE]. 506 00:26:38,794 --> 00:26:41,290 AUDIENCE: No that was just on the PPU. 507 00:26:41,290 --> 00:26:42,166 AUDIENCE: [INAUDIBLE] 508 00:26:48,137 --> 00:26:48,970 [INTERPOSING VOICES] 509 00:26:54,210 --> 00:26:56,970 AUDIENCE: [INAUDIBLE] 510 00:26:56,970 --> 00:26:58,900 PROFESSOR: OK, so something like 400 511 00:26:58,900 --> 00:27:03,680 for the double buffered one and 300 for array of structs. 512 00:27:07,370 --> 00:27:09,170 OK one other thing to worry about 513 00:27:09,170 --> 00:27:12,669 is when you're dealing with these-- when you're 514 00:27:12,669 --> 00:27:14,210 dealing with these SIMD instructions, 515 00:27:14,210 --> 00:27:16,793 you want to make sure that all your data are aligned correctly 516 00:27:16,793 --> 00:27:18,100 in memory. 517 00:27:18,100 --> 00:27:20,490 And like I said before, when you're pulling things 518 00:27:20,490 --> 00:27:22,090 in from memory you want to make sure 519 00:27:22,090 --> 00:27:24,006 that whatever you're pulling in is going to be 520 00:27:24,006 --> 00:27:27,270 aligned on a quadword boundary. 521 00:27:27,270 --> 00:27:29,440 And you can use the align compiler directive 522 00:27:29,440 --> 00:27:31,740 to tell the compiler, I want this piece 523 00:27:31,740 --> 00:27:34,350 of data aligned at a particular place. 524 00:27:34,350 --> 00:27:37,820 And if you do that on all your arrays for example, 525 00:27:37,820 --> 00:27:40,170 and make sure that your array-- the array elements 526 00:27:40,170 --> 00:27:42,920 are going to fit neatly into quadwords 527 00:27:42,920 --> 00:27:46,330 then you should be OK. 528 00:27:46,330 --> 00:27:49,050 Again like I said before, you also 529 00:27:49,050 --> 00:27:51,550 want to transfer only multiples of 16-bytes 530 00:27:51,550 --> 00:27:57,370 on the load and store. 531 00:27:57,370 --> 00:28:00,060 And so when you're doing processing it may help, 532 00:28:00,060 --> 00:28:03,760 it may help you if you actually pad the end of your-- pad 533 00:28:03,760 --> 00:28:05,840 the end of your array's so that they fill out 534 00:28:05,840 --> 00:28:07,410 a multiple of 16-bytes. 535 00:28:07,410 --> 00:28:10,460 Because it's easier to just do that processing with the SIMD 536 00:28:10,460 --> 00:28:12,790 instruction rather than just have 537 00:28:12,790 --> 00:28:18,192 one or two elements hanging off and have to worry about those. 538 00:28:18,192 --> 00:28:19,049 AUDIENCE: Question. 539 00:28:19,049 --> 00:28:19,674 PROFESSOR: Yep. 540 00:28:19,674 --> 00:28:26,590 AUDIENCE: Is it a good idea to pass parameters 2.2 [INAUDIBLE] 541 00:28:26,590 --> 00:28:28,566 I mean, which one is preferred? 542 00:28:28,566 --> 00:28:31,036 [INAUDIBLE] 543 00:28:31,036 --> 00:28:38,446 AUDIENCE: So you should [INAUDIBLE] 544 00:28:38,446 --> 00:28:39,928 for figuring out whether something 545 00:28:39,928 --> 00:28:41,410 can scale easily or not. 546 00:28:41,410 --> 00:28:45,362 So you might make [INAUDIBLE]. 547 00:28:45,362 --> 00:28:47,996 So in cases where you can avoid using pointers, 548 00:28:47,996 --> 00:28:49,314 you should do that. 549 00:28:52,772 --> 00:28:53,594 PROFESSOR: OK. 550 00:28:53,594 --> 00:28:54,385 [SIDE CONVERSATION] 551 00:30:07,380 --> 00:30:12,040 PROFESSOR: So one last thing that I should mention. 552 00:30:12,040 --> 00:30:16,470 I haven't really let on, but compilers can actually 553 00:30:16,470 --> 00:30:18,960 generate some of these SIMD instructions by themselves. 554 00:30:18,960 --> 00:30:22,040 If you declare your types to be vector and then use just 555 00:30:22,040 --> 00:30:26,340 regular operations apparently GCC and XLC yes, 556 00:30:26,340 --> 00:30:29,539 will substitute the correct intrinsics for you. 557 00:30:29,539 --> 00:30:31,830 Of course that doesn't get you all the operations which 558 00:30:31,830 --> 00:30:36,840 are available with intrinsics, but anyway 559 00:30:36,840 --> 00:30:40,954 automatically simbianizing your code 560 00:30:40,954 --> 00:30:42,870 is something that's really worth looking into. 561 00:30:42,870 --> 00:30:45,950 As we saw it can give you a great performance improvement. 562 00:30:45,950 --> 00:30:49,010 And the thing is that compilers are still not 563 00:30:49,010 --> 00:30:52,680 very good at automatically doing this transformation. 564 00:30:52,680 --> 00:30:55,412 So unlike instruction scheduling where if your passing 05 565 00:30:55,412 --> 00:30:57,870 your compiler will do a much better job than you would have 566 00:30:57,870 --> 00:30:58,900 time to do. 567 00:30:58,900 --> 00:31:01,140 This is something that you should probably 568 00:31:01,140 --> 00:31:03,580 reserve some time for. 569 00:31:03,580 --> 00:31:04,150 That's all. 570 00:31:04,150 --> 00:31:06,066 If you have any questions you can stick around 571 00:31:06,066 --> 00:31:07,850 and I'll try and help you.