1 00:00:00,030 --> 00:00:02,420 The following content is provided under a Creative 2 00:00:02,420 --> 00:00:03,850 Commons license. 3 00:00:03,850 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue to 4 00:00:06,860 --> 00:00:10,540 offer high quality educational resources for free. 5 00:00:10,540 --> 00:00:13,410 To make a donation or view additional materials from 6 00:00:13,410 --> 00:00:17,610 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,610 --> 00:00:18,860 ocw.mit.edu. 8 00:00:21,390 --> 00:00:23,210 PROFESSOR: Let's get started. 9 00:00:23,210 --> 00:00:27,660 So what we are going to do today is go about discovering 10 00:00:27,660 --> 00:00:29,650 other alternating methods. 11 00:00:29,650 --> 00:00:32,590 We know you guys are amazing hackers and you can actually 12 00:00:32,590 --> 00:00:34,730 do all those things by hand. 13 00:00:34,730 --> 00:00:40,580 But to make multi-core generally acceptable, can we 14 00:00:40,580 --> 00:00:41,510 do things automatically? 15 00:00:41,510 --> 00:00:44,880 Can we really reduce a burden from the programers? 16 00:00:44,880 --> 00:00:48,460 So at the beginning I'm going to talk about general 17 00:00:48,460 --> 00:00:49,600 parallelizing compilers. 18 00:00:49,600 --> 00:00:50,540 What people have done. 19 00:00:50,540 --> 00:00:51,800 What's the state of the art. 20 00:00:51,800 --> 00:00:55,590 Kind of get your feel what is doable. 21 00:00:55,590 --> 00:00:58,120 Hopefully, that will be a little over an hour, and then 22 00:00:58,120 --> 00:01:02,560 we'll go talk about StreamEd compiler, what we have done 23 00:01:02,560 --> 00:01:09,140 recently, and how this automation part can do. 24 00:01:09,140 --> 00:01:11,730 So, I'll talk a little bit about parallel execution. 25 00:01:11,730 --> 00:01:16,150 This is kind of what you know already. 26 00:01:16,150 --> 00:01:19,600 Then go into parallelizing compilers, and talk about how 27 00:01:19,600 --> 00:01:21,670 to determine if something is parallel by doing data 28 00:01:21,670 --> 00:01:25,020 dependence analysis, and how to increase the amount of 29 00:01:25,020 --> 00:01:27,110 parallelism available in code loop, what kind of 30 00:01:27,110 --> 00:01:28,610 transformation. 31 00:01:28,610 --> 00:01:32,570 Then we go look at how to generate code, because once 32 00:01:32,570 --> 00:01:34,280 you see that something is parallel, how you actually get 33 00:01:34,280 --> 00:01:35,270 to run parallel. 34 00:01:35,270 --> 00:01:38,480 And finish up with actually how to do communication code 35 00:01:38,480 --> 00:01:44,330 in a machine such as a server. 36 00:01:44,330 --> 00:01:48,660 So in parallel execution, this is something -- it's a review. 37 00:01:48,660 --> 00:01:50,460 So there are many ways of parallelism, things like 38 00:01:50,460 --> 00:01:51,240 instruction level parallelism. 39 00:01:51,240 --> 00:01:55,680 It's basically effected by hardware or compiler 40 00:01:55,680 --> 00:01:57,060 scheduling. 41 00:01:57,060 --> 00:01:59,730 As of today this is in abundance. 42 00:01:59,730 --> 00:02:02,850 In all for scalars we do that, in [OBSCURES] 43 00:02:02,850 --> 00:02:04,560 we do that. 44 00:02:04,560 --> 00:02:07,350 Then password parallelism, it's what most of you 45 00:02:07,350 --> 00:02:08,860 guys are doing now. 46 00:02:08,860 --> 00:02:11,220 You probably find a program, you divide it into tasks, you 47 00:02:11,220 --> 00:02:14,200 get task level parallelism, mainly by hand. 48 00:02:14,200 --> 00:02:16,120 Some of you might be doing data level parallelism and 49 00:02:16,120 --> 00:02:19,300 also loop level parallelism. 50 00:02:19,300 --> 00:02:22,010 That can be the hand or compiler generated. 51 00:02:22,010 --> 00:02:24,300 Then, of course, pipeline parallelism is more mainly 52 00:02:24,300 --> 00:02:26,860 done in hardware and language extreme, do pipeline 53 00:02:26,860 --> 00:02:28,560 parallelism. 54 00:02:28,560 --> 00:02:31,435 Divide and conquer parallelism we went a little bit more than 55 00:02:31,435 --> 00:02:35,170 in hardware, mainly by hand for recursive functions. 56 00:02:35,170 --> 00:02:39,660 Today we are going to focus on loop level parallelism, 57 00:02:39,660 --> 00:02:43,360 particularly how do loop level parallelism by the compiler. 58 00:02:43,360 --> 00:02:45,090 So why loops? 59 00:02:45,090 --> 00:02:48,000 So loops is interesting because people observed in 60 00:02:48,000 --> 00:02:51,910 morse code, 90% of execution time is in 10% of the code. 61 00:02:51,910 --> 00:02:55,690 Almost 99% of the execution time is in 10% of the code. 62 00:02:55,690 --> 00:03:00,080 This called a loop, and it makes sense because running at 63 00:03:00,080 --> 00:03:05,990 3 gigahertz, if only run one instruction one, then you run 64 00:03:05,990 --> 00:03:09,920 through the hard drive in only a few minutes because you need 65 00:03:09,920 --> 00:03:11,370 to have repeatability. 66 00:03:11,370 --> 00:03:12,830 A lot of time repeatability thing loops. 67 00:03:16,070 --> 00:03:17,970 Loops, if you can parallelize, you can get really good 68 00:03:17,970 --> 00:03:21,190 performance because loops most of the time, each loop 69 00:03:21,190 --> 00:03:24,420 iteration have the same amount of work and you get nice good 70 00:03:24,420 --> 00:03:28,620 load balance, it's somewhat easier to analyze, so that's 71 00:03:28,620 --> 00:03:29,750 why the compiler start there. 72 00:03:29,750 --> 00:03:33,070 Whereas if you try to get task level parallelism, things have 73 00:03:33,070 --> 00:03:38,350 a lot more complexities that automatic compiler cannot do. 74 00:03:38,350 --> 00:03:41,220 So there are two types of parallel loops. 75 00:03:41,220 --> 00:03:43,120 One is a for all loop. 76 00:03:43,120 --> 00:03:45,720 That means there are no loop carried dependences. 77 00:03:45,720 --> 00:03:49,030 That means you can get the sequential code executing, run 78 00:03:49,030 --> 00:03:52,390 everything in parallel, and at the end you have a barrier and 79 00:03:52,390 --> 00:03:53,710 when everybody finishes you continue on 80 00:03:53,710 --> 00:03:55,810 the sequential code. 81 00:03:55,810 --> 00:03:58,300 That is how you do a for all loop. 82 00:03:58,300 --> 00:04:01,510 Some languages, in fact, have explicitly parallel construct, 83 00:04:01,510 --> 00:04:06,580 say OK, here's a for all loop and go do that. 84 00:04:06,580 --> 00:04:08,990 The other type of loop is called a 85 00:04:08,990 --> 00:04:10,990 foracross or doacross loop. 86 00:04:10,990 --> 00:04:13,860 That says OK, while the loop is parallel, there are some 87 00:04:13,860 --> 00:04:14,760 dependences. 88 00:04:14,760 --> 00:04:17,670 That means some value generated here is used 89 00:04:17,670 --> 00:04:18,720 somewhere here. 90 00:04:18,720 --> 00:04:20,670 So you can run it parallel, but you have some 91 00:04:20,670 --> 00:04:22,280 communication going too. 92 00:04:22,280 --> 00:04:23,670 So you had to move data. 93 00:04:23,670 --> 00:04:26,590 So it's not completely running parallel, there's some 94 00:04:26,590 --> 00:04:27,720 synchronization going on. 95 00:04:27,720 --> 00:04:29,300 But you can get large chunk running parallels. 96 00:04:32,200 --> 00:04:36,840 So we kind of focus on dual loops today, and let's look at 97 00:04:36,840 --> 00:04:38,720 this example. 98 00:04:38,720 --> 00:04:40,940 We see it's a for far so it's a parallel 99 00:04:40,940 --> 00:04:46,110 loop or for all loop. 100 00:04:46,110 --> 00:04:48,430 When you know it's parallel, in here, of course, 101 00:04:48,430 --> 00:04:51,030 the user said that. 102 00:04:51,030 --> 00:04:53,930 What we can do is we can distribute the iteration by 103 00:04:53,930 --> 00:04:57,520 chunking up the iteration space into number of process 104 00:04:57,520 --> 00:05:02,170 chunks, and basically run that. 105 00:05:02,170 --> 00:05:05,480 If PMD mode, you can at the beginning the first processor 106 00:05:05,480 --> 00:05:10,120 can calculate the number of iterations you can run on each 107 00:05:10,120 --> 00:05:14,250 process in here, and then you synchronize, you put a barrier 108 00:05:14,250 --> 00:05:17,170 there, so everybody kind of sync up at that point. 109 00:05:17,170 --> 00:05:20,590 Or other process of waiting, and at that point, everybody 110 00:05:20,590 --> 00:05:23,116 starts, when you reach this point it's running, it's part 111 00:05:23,116 --> 00:05:25,420 of iterations, and then you're going to put a barrier 112 00:05:25,420 --> 00:05:26,670 synchronization in place. 113 00:05:28,230 --> 00:05:32,150 Kind of obvious, parallel code basically in here, running on 114 00:05:32,150 --> 00:05:34,780 shared memory machine at this point. 115 00:05:34,780 --> 00:05:36,310 So this is what we can do. 116 00:05:36,310 --> 00:05:39,000 I mean this is what we saw before. 117 00:05:39,000 --> 00:05:41,650 Of course, instead of doing that, you can also do fork 118 00:05:41,650 --> 00:05:44,890 join types or once you want to run something parallel, you 119 00:05:44,890 --> 00:05:49,220 can fork a thread and each thread gets some amount of 120 00:05:49,220 --> 00:05:51,480 iterations you run, and after that you merge together. 121 00:05:51,480 --> 00:05:54,180 So you can do both. 122 00:05:54,180 --> 00:05:55,290 So that's my hand. 123 00:05:55,290 --> 00:05:59,330 How do you do something like that by the compiler? 124 00:05:59,330 --> 00:06:01,540 That sounds simple enough, trivial enough. 125 00:06:01,540 --> 00:06:03,010 But you don't automate the entire process. 126 00:06:03,010 --> 00:06:06,480 How to go about doing that. 127 00:06:06,480 --> 00:06:09,240 So, here are some normal loops, for loops. 128 00:06:09,240 --> 00:06:13,110 So the for all does this thing that was so simple, which is 129 00:06:13,110 --> 00:06:15,270 the for all construct that means somebody could look at 130 00:06:15,270 --> 00:06:18,310 that and said this loop is parallel. 131 00:06:18,310 --> 00:06:21,470 But you look at these FOR loops, how many of these loops 132 00:06:21,470 --> 00:06:23,780 are parallel? 133 00:06:23,780 --> 00:06:26,940 Is the first loop parallel? 134 00:06:26,940 --> 00:06:27,160 Why? 135 00:06:27,160 --> 00:06:27,810 Why not? 136 00:06:27,810 --> 00:06:31,220 AUDIENCE: [OBSCURED.] 137 00:06:31,220 --> 00:06:36,480 PROFESSOR: It's a loop because the iteration, one of that is 138 00:06:36,480 --> 00:06:38,860 using what you wrote in iteration zero. 139 00:06:38,860 --> 00:06:41,910 So iteration one has to wait until iteration zero is 140 00:06:41,910 --> 00:06:43,310 done, so and so. 141 00:06:43,310 --> 00:06:44,560 How about this one? 142 00:06:50,110 --> 00:06:50,460 Why? 143 00:06:50,460 --> 00:06:57,350 AUDIENCE: [NOISE.] 144 00:06:57,350 --> 00:07:01,380 PROFESSOR: Not really. 145 00:07:01,380 --> 00:07:04,500 So it's writing element 0 to 5, it's reading 146 00:07:04,500 --> 00:07:08,040 elements 6 to 11. 147 00:07:08,040 --> 00:07:10,440 So they don't overlap. 148 00:07:10,440 --> 00:07:12,740 So what you read and what you write never overlap, so you 149 00:07:12,740 --> 00:07:16,240 can keep doing it in any order, because the dependence 150 00:07:16,240 --> 00:07:18,990 means something you wrote, later you will read. 151 00:07:18,990 --> 00:07:19,920 This doesn't happen in here. 152 00:07:19,920 --> 00:07:33,600 How about this one? 153 00:07:33,600 --> 00:07:35,010 AUDIENCE: There's no dependence in there. 154 00:07:35,010 --> 00:07:35,250 PROFESSOR: Why? 155 00:07:35,250 --> 00:07:38,420 AUDIENCE: [OBSCURED.] 156 00:07:38,420 --> 00:07:41,420 PROFESSOR: So you're writing even, you're reading odd. 157 00:07:41,420 --> 00:07:43,900 So there's no overlapping or anything like that. 158 00:07:43,900 --> 00:07:44,350 Question? 159 00:07:44,350 --> 00:07:47,020 OK. 160 00:07:47,020 --> 00:07:48,620 So, the way to look at that -- 161 00:07:48,620 --> 00:07:50,260 I'm going to go a little bit of formalism. 162 00:07:50,260 --> 00:07:53,100 You can think about this as a iteration space. 163 00:07:53,100 --> 00:07:57,080 So iteration is if you look at each iteration separately, 164 00:07:57,080 --> 00:07:59,820 there could be thousands and millions of iterations and 165 00:07:59,820 --> 00:08:01,160 your compiler never [COUGHING] 166 00:08:01,160 --> 00:08:04,740 doing any work, and also some iteration space is defined by 167 00:08:04,740 --> 00:08:08,070 a range like 1 to n, so you don't even know exactly how 168 00:08:08,070 --> 00:08:09,905 many iterations are going to be there. 169 00:08:09,905 --> 00:08:13,310 So you can represent this as abstract space. 170 00:08:13,310 --> 00:08:16,470 Normally, most of this loops you look at you 171 00:08:16,470 --> 00:08:17,470 normalize to step one. 172 00:08:17,470 --> 00:08:22,320 So what that means is all the integer points in that space. 173 00:08:22,320 --> 00:08:25,880 So if you have a loop like this, y equals 0 to 6, J 174 00:08:25,880 --> 00:08:27,600 equals 1i to 7. 175 00:08:27,600 --> 00:08:29,040 That's the iteration space, there are two 176 00:08:29,040 --> 00:08:31,080 dimensions in there. 177 00:08:31,080 --> 00:08:34,150 The points that start iteration off because it's not 178 00:08:34,150 --> 00:08:38,980 a rectangular space, it can have this structure because 179 00:08:38,980 --> 00:08:42,090 j's go in triangular in here. 180 00:08:42,090 --> 00:08:44,340 So the way you can represent that is so you can represent 181 00:08:44,340 --> 00:08:48,990 iteration space by a vector i, and you can have each 182 00:08:48,990 --> 00:08:50,340 dimension or use two dimension. 183 00:08:50,340 --> 00:08:52,370 This was some i1, i2 space in here. 184 00:08:52,370 --> 00:08:54,900 So you can represent it like that. 185 00:08:54,900 --> 00:08:57,840 It's the notion of lexicographic ordering. 186 00:08:57,840 --> 00:09:00,540 That means if you execute the loop, what's the order you're 187 00:09:00,540 --> 00:09:01,990 going to receive this thing. 188 00:09:01,990 --> 00:09:03,950 If you execute this loop, what you are going to do 189 00:09:03,950 --> 00:09:06,010 is you go from -- 190 00:09:06,010 --> 00:09:07,590 you go like this. 191 00:09:07,590 --> 00:09:09,810 This is lexicographical ordering of 192 00:09:09,810 --> 00:09:12,070 everything in the loops. 193 00:09:12,070 --> 00:09:13,440 That's the normal execution order. 194 00:09:13,440 --> 00:09:15,440 That's a sequential order. 195 00:09:15,440 --> 00:09:17,940 At some point you want to make sure that anything we do kind 196 00:09:17,940 --> 00:09:20,210 of has a look and feel of the sequential 197 00:09:20,210 --> 00:09:23,430 lexicographical order. 198 00:09:23,430 --> 00:09:27,440 So, one thing you can say is if you have multiple 199 00:09:27,440 --> 00:09:33,610 dimensions, if there are two iterations, one iteration 200 00:09:33,610 --> 00:09:37,180 lexicographical and another iterations says if all outer 201 00:09:37,180 --> 00:09:40,490 dimensions are the same, you go to the first dimension 202 00:09:40,490 --> 00:09:44,650 where the numbers, they are in two different iterations. 203 00:09:44,650 --> 00:09:46,960 Then that dictates if it's 204 00:09:46,960 --> 00:09:48,790 lexicographical than the other. 205 00:09:48,790 --> 00:09:51,840 So if the outer dimensions are the same, that means the next 206 00:09:51,840 --> 00:09:53,470 one decides, the next one decides, next one decides 207 00:09:53,470 --> 00:09:54,610 going down. 208 00:09:54,610 --> 00:09:57,000 First one that's actually different decides who's before 209 00:09:57,000 --> 00:09:58,250 the other one. 210 00:10:00,630 --> 00:10:04,515 So another concept is called affine loop nest. Affine loop 211 00:10:04,515 --> 00:10:08,770 nest says loop bounds are integer linear functions of 212 00:10:08,770 --> 00:10:11,840 constants, loop constant variable 213 00:10:11,840 --> 00:10:14,200 and outer loop indices. 214 00:10:14,200 --> 00:10:17,525 So that means if you want to get affine function within a 215 00:10:17,525 --> 00:10:21,370 loop, that has to be a linear function or integer function 216 00:10:21,370 --> 00:10:26,500 where all the things either has to be constant or loop 217 00:10:26,500 --> 00:10:26,950 constants -- 218 00:10:26,950 --> 00:10:29,760 that means that that variable doesn't change in the loop or 219 00:10:29,760 --> 00:10:30,940 outer loop indices. 220 00:10:30,940 --> 00:10:32,810 That makes it much easier to analyze. 221 00:10:35,550 --> 00:10:39,890 Also, array axises, each dimension, axis function has 222 00:10:39,890 --> 00:10:41,430 the same property. 223 00:10:41,430 --> 00:10:44,670 So of course, there are many programs that doesn't satisfy 224 00:10:44,670 --> 00:10:46,730 this, for example, if we do FFD. 225 00:10:46,730 --> 00:10:48,450 That doesn't satisfy that because you have 226 00:10:48,450 --> 00:10:50,390 exponentials in there. 227 00:10:50,390 --> 00:10:53,900 But what that means is at 50, there's probably no way that 228 00:10:53,900 --> 00:10:55,620 the compiler's going to analyze that. 229 00:10:55,620 --> 00:11:01,060 But most kind of loops fit this kind of model and then 230 00:11:01,060 --> 00:11:03,120 you can put into nice mathematical framework and 231 00:11:03,120 --> 00:11:05,840 analyze that what I'm going to go through is kind of follow 232 00:11:05,840 --> 00:11:06,930 through some of the mathematical 233 00:11:06,930 --> 00:11:10,280 framework with you guys. 234 00:11:10,280 --> 00:11:14,100 So, what you can do here is if you look at this one, instead 235 00:11:14,100 --> 00:11:20,280 of representing this iteration space by each iteration, which 236 00:11:20,280 --> 00:11:23,650 can be huge or which is not even known at compile time, 237 00:11:23,650 --> 00:11:27,890 what you can do is you can represent this by kind of a 238 00:11:27,890 --> 00:11:31,800 bounding space of iterations, basically. 239 00:11:31,800 --> 00:11:35,270 So what this is, we don't mark every box there, but we say 240 00:11:35,270 --> 00:11:37,160 OK, look, if you put these planes -- 241 00:11:37,160 --> 00:11:40,650 I put four planes in here, and everything inside these planes 242 00:11:40,650 --> 00:11:43,120 represent this iteration space. 243 00:11:43,120 --> 00:11:46,850 That's nice because instead of going 0 to 6, if you go 0 to 244 00:11:46,850 --> 00:11:51,010 60,000, still I have the same equation, I don't suddenly 245 00:11:51,010 --> 00:11:55,370 have 6 million data points in here I need to represent. 246 00:11:55,370 --> 00:12:00,230 So, my representation doesn't grow with the size of my 247 00:12:00,230 --> 00:12:01,155 iteration space. 248 00:12:01,155 --> 00:12:03,500 It grows with the shape of this iteration space. 249 00:12:03,500 --> 00:12:06,590 If you have complicated one, it can be difficult. 250 00:12:06,590 --> 00:12:08,890 So what you can do is you can iteration space, it's all 251 00:12:08,890 --> 00:12:13,140 iterations zero to six, j's I27. 252 00:12:13,140 --> 00:12:16,240 This is all linear functionns. 253 00:12:16,240 --> 00:12:18,530 That makes our analysis easier. 254 00:12:18,530 --> 00:12:21,570 So the flip side of that is the data space. 255 00:12:21,570 --> 00:12:24,200 So, if m dimension array has m dimensional 256 00:12:24,200 --> 00:12:27,290 discrete cartesian space. 257 00:12:27,290 --> 00:12:30,520 Basically, in the data space you don't have arrays that are 258 00:12:30,520 --> 00:12:31,240 odd shaped. 259 00:12:31,240 --> 00:12:34,420 It's almost a hypercube always. 260 00:12:34,420 --> 00:12:38,790 So something like that is a one dimensional space and 261 00:12:38,790 --> 00:12:40,130 something can be represented as a two 262 00:12:40,130 --> 00:12:42,470 dimensional space in here. 263 00:12:42,470 --> 00:12:45,990 So data space has this nice property, in that sense it's a 264 00:12:45,990 --> 00:12:48,140 t multi-dimensional hypercube. 265 00:12:48,140 --> 00:12:51,290 And what that gives you is kind of a bunch of 266 00:12:51,290 --> 00:12:54,525 mathematical techniques to kind of do and at least see 267 00:12:54,525 --> 00:12:56,450 some transformations we need to do in compiling. 268 00:12:59,570 --> 00:13:01,470 As humans, I think we can look at a lot more complicated 269 00:13:01,470 --> 00:13:06,320 loops by hand, and get a better idea what's going on. 270 00:13:06,320 --> 00:13:08,850 But in a compiler you need to have a very simple way of 271 00:13:08,850 --> 00:13:11,910 describing what to analyze, what to formulate, and having 272 00:13:11,910 --> 00:13:14,900 this model helps you put it into a nice mathematical frame 273 00:13:14,900 --> 00:13:17,250 you can do. 274 00:13:17,250 --> 00:13:18,160 So the next thing is dependence. 275 00:13:18,160 --> 00:13:20,770 We have done that so I will go through this fast. So the 276 00:13:20,770 --> 00:13:22,330 first is a true dependence. 277 00:13:22,330 --> 00:13:25,890 What that means is I wrote something, I write it here. 278 00:13:25,890 --> 00:13:27,750 So I really meant that I actually 279 00:13:27,750 --> 00:13:29,860 really use that value. 280 00:13:29,860 --> 00:13:34,020 There are two dependences mainly because we are finding 281 00:13:34,020 --> 00:13:37,050 dependence on some location, is an anti-dependence. 282 00:13:37,050 --> 00:13:39,870 That means I can't write it until this read is done 283 00:13:39,870 --> 00:13:41,460 because I can't destroy the value. 284 00:13:41,460 --> 00:13:44,480 Output dependence is there, so ordering of writing that you 285 00:13:44,480 --> 00:13:45,730 need to maintain. 286 00:13:48,010 --> 00:13:53,920 So in a dynamic instance, data dependence exist between i and 287 00:13:53,920 --> 00:13:59,130 j if Either i and j is a write operation, and i and j refers 288 00:13:59,130 --> 00:14:01,780 to the same variable, and i executes before j. 289 00:14:01,780 --> 00:14:05,590 So it's the same thing, one execute before the other. 290 00:14:05,590 --> 00:14:07,590 So it's not that you don't have a dependence when they 291 00:14:07,590 --> 00:14:10,680 get there in time, then it become either true or anti. 292 00:14:10,680 --> 00:14:17,150 So it's always going to be positive over time. 293 00:14:17,150 --> 00:14:19,050 So how about other accesses? 294 00:14:19,050 --> 00:14:21,930 So one element, you can figure out what happened. 295 00:14:21,930 --> 00:14:23,890 So how do you do dependence and other accesses? 296 00:14:23,890 --> 00:14:26,040 Now things get a little bit complicated, because arrays is 297 00:14:26,040 --> 00:14:28,210 not one element. 298 00:14:28,210 --> 00:14:29,750 So that's when you go to dependence analysis. 299 00:14:32,660 --> 00:14:36,620 So I will describe this using bunch of examples. 300 00:14:36,620 --> 00:14:39,710 So in order to look at arrays, there are two spaces I need to 301 00:14:39,710 --> 00:14:40,960 worry about. 302 00:14:40,960 --> 00:14:44,930 One is the iteration space, one is the data space. 303 00:14:44,930 --> 00:14:49,410 What we want to do is figure out what happens at every 304 00:14:49,410 --> 00:14:52,700 iteration for data and what other dependences kind of 305 00:14:52,700 --> 00:14:55,020 summarize this down. 306 00:14:55,020 --> 00:14:58,320 We don't want to look at, say OK, one iteration depend on 307 00:14:58,320 --> 00:15:00,470 second, two depend on third -- you don't want to list 308 00:15:00,470 --> 00:15:01,090 everything. 309 00:15:01,090 --> 00:15:02,390 We need to come up with a summary -- 310 00:15:02,390 --> 00:15:05,300 that's what basically dependence analysis will do. 311 00:15:05,300 --> 00:15:08,110 So if you have this access, this is this loop. 312 00:15:08,110 --> 00:15:12,000 What happens is as we run down, so iterations we are 313 00:15:12,000 --> 00:15:12,970 running down here. 314 00:15:12,970 --> 00:15:15,760 So we have iteration zero, 1, 2, 3, 4, 5. 315 00:15:15,760 --> 00:15:18,040 First do the read, write, read, write. 316 00:15:18,040 --> 00:15:20,440 So this is kind of time going down there. 317 00:15:20,440 --> 00:15:23,860 What you do is this one you are 318 00:15:23,860 --> 00:15:25,730 reading and you are writing. 319 00:15:25,730 --> 00:15:27,760 You're reading and writing, so you have a 320 00:15:27,760 --> 00:15:29,690 dependence like that. 321 00:15:29,690 --> 00:15:30,940 You see the two anti-dependence. 322 00:15:34,410 --> 00:15:36,270 Read -- anti-dependence, I have 323 00:15:36,270 --> 00:15:38,090 anti-dependence going on here. 324 00:15:38,090 --> 00:15:39,690 If you look at it, here's a dependence vector. 325 00:15:39,690 --> 00:15:42,270 What that means is there's a dependence at each of those 326 00:15:42,270 --> 00:15:45,870 things in there -- that's anti-dependence going on. 327 00:15:45,870 --> 00:15:49,020 One way to look at summarizes of this, what is my iteration. 328 00:15:49,020 --> 00:15:52,210 My iteration goes like -- what's my dependence. 329 00:15:52,210 --> 00:15:56,530 I have anti-dependence with the same iteration, because my 330 00:15:56,530 --> 00:15:57,970 read and write has to be 331 00:15:57,970 --> 00:15:59,990 dependence in the same iteration. 332 00:15:59,990 --> 00:16:01,990 So this is a way to kind of describe that. 333 00:16:01,990 --> 00:16:03,240 So a different one. 334 00:16:05,890 --> 00:16:07,060 This one. 335 00:16:07,060 --> 00:16:13,350 I did Ai plus 1 equals Ai So what you realize is iteration 336 00:16:13,350 --> 00:16:18,880 zero, you wrote iteration zero, you wrote a zero, you 337 00:16:18,880 --> 00:16:24,270 read these and you wrote A1, and iteration 1, you read A1 338 00:16:24,270 --> 00:16:28,820 and wrote A2, basically. 339 00:16:28,820 --> 00:16:31,210 Now what you have is your dependence is like 340 00:16:31,210 --> 00:16:34,600 that, going like that. 341 00:16:34,600 --> 00:16:37,090 So if you look at what's happening in here, if you 342 00:16:37,090 --> 00:16:39,890 summarize in here, what you have is a dependence going 343 00:16:39,890 --> 00:16:42,960 like that in iteration space. 344 00:16:42,960 --> 00:16:46,300 So in iteration that means iteration 1 is actually these 345 00:16:46,300 --> 00:16:49,130 two dependence, that uses something that wrote iteration 346 00:16:49,130 --> 00:16:52,660 zero, iteration 2 you have something iteration 1, and you 347 00:16:52,660 --> 00:16:55,530 have iteration going like that. 348 00:16:55,530 --> 00:16:59,240 Sometimes this can be summarized as the dependence 349 00:16:59,240 --> 00:17:00,490 vector of 1. 350 00:17:05,750 --> 00:17:10,700 Because the previous one was zero because there's no loop 351 00:17:10,700 --> 00:17:11,480 carry dependency. 352 00:17:11,480 --> 00:17:13,650 In the outer loop there's a dependence on 1. 353 00:17:13,650 --> 00:17:21,850 So if you have this one, I plus 2, of course, it gets 354 00:17:21,850 --> 00:17:26,510 carried 1 across in here and then you have a 1 skipped 355 00:17:26,510 --> 00:17:30,080 representation in here. 356 00:17:30,080 --> 00:17:32,320 If you have 2I2 by plus 1, what you realize 357 00:17:32,320 --> 00:17:34,370 is there's no overlap. 358 00:17:34,370 --> 00:17:35,660 So there's no basically dependency. 359 00:17:38,840 --> 00:17:42,040 You kind of get how that analytic goes. 360 00:17:42,040 --> 00:17:46,810 So, to find data dependence in a loop, so there's a little 361 00:17:46,810 --> 00:17:47,530 bit of legalese. 362 00:17:47,530 --> 00:17:48,520 So let me try to do that. 363 00:17:48,520 --> 00:17:54,220 So for every pair of array accesses, what you want to 364 00:17:54,220 --> 00:17:59,940 find is is there a dynamic instant that happened? 365 00:17:59,940 --> 00:18:04,640 An iteration that wrote a value, and another dynamic 366 00:18:04,640 --> 00:18:08,240 instance happened that later that actually used that value. 367 00:18:08,240 --> 00:18:11,690 So the first access, so there's a dynamic instance 368 00:18:11,690 --> 00:18:17,820 that's wrote, or that access, and another iteration instance 369 00:18:17,820 --> 00:18:20,250 that also accessed the same location later. 370 00:18:20,250 --> 00:18:21,980 And one of them has to be right, otherwise 371 00:18:21,980 --> 00:18:23,930 there are two in anti. 372 00:18:23,930 --> 00:18:25,860 That's the notion about the second one came 373 00:18:25,860 --> 00:18:28,350 after the first one. 374 00:18:28,350 --> 00:18:30,000 You can also look at the same arrays. 375 00:18:30,000 --> 00:18:32,270 It doesn't have the be the same as different access, the 376 00:18:32,270 --> 00:18:33,510 same array access if you are writing. 377 00:18:33,510 --> 00:18:36,150 If you look at same array access writing you can have 378 00:18:36,150 --> 00:18:37,370 output dependences also. 379 00:18:37,370 --> 00:18:41,580 So it's basically between a read and a write, and a write 380 00:18:41,580 --> 00:18:42,590 and a write. 381 00:18:42,590 --> 00:18:45,600 Two different writes, it can be the same write too. 382 00:18:45,600 --> 00:18:47,590 Key thing is we are looking at location. 383 00:18:47,590 --> 00:18:49,405 We're not looking at value path and say who's actually in 384 00:18:49,405 --> 00:18:52,560 the same location. 385 00:18:52,560 --> 00:18:55,360 Loop carry dependence means the dependence 386 00:18:55,360 --> 00:18:57,790 cross a loop boundary. 387 00:18:57,790 --> 00:19:03,100 That means the person who read and person who wrote are in 388 00:19:03,100 --> 00:19:06,040 different loop iteration. 389 00:19:06,040 --> 00:19:08,290 If it's in the same iteration, then it's all local, because 390 00:19:08,290 --> 00:19:10,570 in my iteration I deal with that, I moved data around. 391 00:19:10,570 --> 00:19:13,300 But what I'm writing is used by somebody else in different 392 00:19:13,300 --> 00:19:18,220 iteration, I have loop carry dependence going on. 393 00:19:18,220 --> 00:19:20,880 Basic thing is there's a loop carry dependence, that loop is 394 00:19:20,880 --> 00:19:23,650 not parallelized in that. 395 00:19:23,650 --> 00:19:26,800 What that means is I am writing in one iteration of 396 00:19:26,800 --> 00:19:28,830 the loop and somebody is reading in different iteration 397 00:19:28,830 --> 00:19:29,630 of the loop. 398 00:19:29,630 --> 00:19:31,960 That means I actually had to move the data across, they can 399 00:19:31,960 --> 00:19:33,340 happen in parallel. 400 00:19:33,340 --> 00:19:34,590 That's a very simple way of looking at that. 401 00:19:37,930 --> 00:19:41,510 So, what we have done is -- 402 00:19:41,510 --> 00:19:44,550 OK, the basic idea is how to actually go and 403 00:19:44,550 --> 00:19:46,740 automate this process. 404 00:19:46,740 --> 00:19:49,050 The simple notion is called a data dependence analysis, and 405 00:19:49,050 --> 00:19:51,850 I will give you a formulation of that. 406 00:19:51,850 --> 00:19:57,700 So what you can formally do is using a set of equations. 407 00:19:57,700 --> 00:20:01,140 So what you want to say is instead of two distinct 408 00:20:01,140 --> 00:20:03,110 iterations, one is the write iteration, 409 00:20:03,110 --> 00:20:06,150 one is the read iteration. 410 00:20:06,150 --> 00:20:07,700 One iteration writes the value, one 411 00:20:07,700 --> 00:20:09,200 iteration reads the value. 412 00:20:09,200 --> 00:20:14,200 So write iteration basically, writes a item loop plus 1, the 413 00:20:14,200 --> 00:20:16,620 read iteration reads AI. 414 00:20:16,620 --> 00:20:21,306 So we know both read and write have to be within loop bound 415 00:20:21,306 --> 00:20:23,360 iteration, because we know that because we can't be 416 00:20:23,360 --> 00:20:24,920 outside loop bounds. 417 00:20:24,920 --> 00:20:28,410 Then we also want to make sure that the loop carried 418 00:20:28,410 --> 00:20:30,330 dependence, that means read and write can't 419 00:20:30,330 --> 00:20:31,560 be in the same iteration. 420 00:20:31,560 --> 00:20:33,330 If it's in the same iteration, I don't have loop carry 421 00:20:33,330 --> 00:20:34,120 dependence. 422 00:20:34,120 --> 00:20:37,070 I am looking for loop carry dependence at this point. 423 00:20:37,070 --> 00:20:41,250 Then what makes both of the read and write 424 00:20:41,250 --> 00:20:42,790 write the same location. 425 00:20:42,790 --> 00:20:44,580 That means access 1 has to be the same. 426 00:20:44,580 --> 00:20:48,550 So the right access point is iw plus 1, and read access 427 00:20:48,550 --> 00:20:51,330 function is [? IEI. ?] 428 00:20:51,330 --> 00:20:55,380 So the key thing is now we have set up equation. 429 00:20:55,380 --> 00:20:59,460 Are there any values for ie and j, integer values, I'm 430 00:20:59,460 --> 00:21:03,470 sorry, iw and ir that these equations are true. 431 00:21:03,470 --> 00:21:06,140 If that is the case, we can say ah-ha, that is the case, 432 00:21:06,140 --> 00:21:10,630 there's an iteration that the write and read are writing 433 00:21:10,630 --> 00:21:13,210 into two different iterations -- one write iteration, one 434 00:21:13,210 --> 00:21:16,970 read iteration, writing to the same value. 435 00:21:16,970 --> 00:21:18,460 Therefore that's a different [OBSCURED]. 436 00:21:18,460 --> 00:21:18,840 Is this true? 437 00:21:18,840 --> 00:21:20,280 Is there a set of values that makes this true? 438 00:21:28,710 --> 00:21:33,480 Yeah, I mean you can do ir equals 1, iw equals 1, 439 00:21:33,480 --> 00:21:36,120 and ir equals 2. 440 00:21:36,120 --> 00:21:38,670 So there's a value in there so these equations will come up 441 00:21:38,670 --> 00:21:43,560 with a solution, and at that point you have a dependency. 442 00:21:43,560 --> 00:21:51,670 AUDIENCE: [NOISE] 443 00:21:51,670 --> 00:21:56,260 PROFESSOR: So that's very easy to make this formulation. 444 00:21:56,260 --> 00:21:59,620 So if the indices is calculated with some thing or 445 00:21:59,620 --> 00:22:02,670 loop value, I can't write the formulation. 446 00:22:02,670 --> 00:22:07,250 So the data that I can do this analysis is this indices has 447 00:22:07,250 --> 00:22:09,110 to be the constant or indefinite. 448 00:22:09,110 --> 00:22:16,540 This is A of b of I. So if my array is A of b of i, I don't 449 00:22:16,540 --> 00:22:21,790 know how the numbers work if you have A of b i. 450 00:22:21,790 --> 00:22:24,810 I have no idea about Ai is without knowing 451 00:22:24,810 --> 00:22:25,770 values of B of i. 452 00:22:25,770 --> 00:22:27,780 And B of i, I can't summarize it. 453 00:22:27,780 --> 00:22:31,330 Each B of i might be different and I can't come up with this 454 00:22:31,330 --> 00:22:34,750 nice single formulation that can check out every B of i. 455 00:22:34,750 --> 00:22:36,330 And I'm in big trouble. 456 00:22:36,330 --> 00:22:50,070 This is doable, but this is not easy to do like this. 457 00:22:50,070 --> 00:22:50,780 Question? 458 00:22:50,780 --> 00:22:53,150 AUDIENCE: [NOISE] 459 00:22:53,150 --> 00:22:54,000 PROFESSOR: Yeah, that's right. 460 00:22:54,000 --> 00:22:57,800 So that the interesting thing that you're not looking at. 461 00:22:57,800 --> 00:23:00,400 Because when we summarized it, because what you are going to 462 00:23:00,400 --> 00:23:02,400 do is we are trying to summarize for everything, 463 00:23:02,400 --> 00:23:05,730 every iteration, and we are not trying to divide it into 464 00:23:05,730 --> 00:23:07,860 saying OK, can I find the parallel groups. 465 00:23:07,860 --> 00:23:08,480 Yes. 466 00:23:08,480 --> 00:23:10,340 You can do some more complicated analysis and do 467 00:23:10,340 --> 00:23:11,410 something like that. 468 00:23:11,410 --> 00:23:13,060 Yes. 469 00:23:13,060 --> 00:23:15,850 So other interesting thing is OK, the next thing you want to 470 00:23:15,850 --> 00:23:20,020 see whether can find output dependence. 471 00:23:20,020 --> 00:23:22,365 OK, are there two different iterations that they're 472 00:23:22,365 --> 00:23:25,360 fighting the same thing. 473 00:23:25,360 --> 00:23:29,350 What that means is the iterations are I1, I2, and I1 474 00:23:29,350 --> 00:23:33,190 not equals I2, and I1 plus 1 equals I2 plus one. 475 00:23:33,190 --> 00:23:37,120 There's no solution to this one because the I1 has to be 476 00:23:37,120 --> 00:23:39,820 equal to I2 according to this, and I1 cannot be equal to I2 477 00:23:39,820 --> 00:23:40,400 during this one. 478 00:23:40,400 --> 00:23:44,020 That says OK, look, I don't have output dependence because 479 00:23:44,020 --> 00:23:45,880 it can be satisfied. 480 00:23:45,880 --> 00:23:49,880 OK, so here I know I have a loop carried -- 481 00:23:49,880 --> 00:23:52,220 I haven't said the two anti depends on which 482 00:23:52,220 --> 00:23:54,370 directions this is. 483 00:23:54,370 --> 00:23:57,386 Two anti-dependents, but I don't have a loop carried out 484 00:23:57,386 --> 00:24:01,070 to [OBSCURED]. 485 00:24:01,070 --> 00:24:02,870 So how do we generalize this? 486 00:24:02,870 --> 00:24:06,410 So what you can do is as integer vector I, so in order 487 00:24:06,410 --> 00:24:08,940 to generalize this, you can use integer programming. 488 00:24:08,940 --> 00:24:11,445 How many of you know integer programming or linear 489 00:24:11,445 --> 00:24:12,260 programming? 490 00:24:12,260 --> 00:24:14,390 OK. 491 00:24:14,390 --> 00:24:18,350 We are not going to go into detail, but I'll tell you what 492 00:24:18,350 --> 00:24:19,280 actually happen. 493 00:24:19,280 --> 00:24:24,050 So integer programming says there's a vector of variable 494 00:24:24,050 --> 00:24:28,570 I, and if you have a formulation like that, is 495 00:24:28,570 --> 00:24:32,360 array, AI is less than or equal to B, A and B are all 496 00:24:32,360 --> 00:24:38,230 constant integers, and you can use the integer programming, 497 00:24:38,230 --> 00:24:42,290 you can see that there's a solution for IE or not. 498 00:24:42,290 --> 00:24:45,120 This is if you do things like operations research, there's a 499 00:24:45,120 --> 00:24:46,890 lot of work around it. 500 00:24:46,890 --> 00:24:49,350 People actually want to know what value is Y. We don't care 501 00:24:49,350 --> 00:24:51,445 that much what values, we just want to know 502 00:24:51,445 --> 00:24:53,520 the solution or not. 503 00:24:53,520 --> 00:24:55,600 If there's a solution, we know that there's a dependent. 504 00:24:55,600 --> 00:24:57,520 If there's no solution we know there's no dependent. 505 00:24:57,520 --> 00:24:59,810 So we need to do is we need to get this set of equations and 506 00:24:59,810 --> 00:25:02,420 put it on that form. 507 00:25:02,420 --> 00:25:03,140 That's simple. 508 00:25:03,140 --> 00:25:08,350 For example, what you want is AI less than B -- 509 00:25:08,350 --> 00:25:14,680 that means you have constnat A1 I1, plus A2 i2, which is 510 00:25:14,680 --> 00:25:19,870 less than or equal to B. So you won't have 511 00:25:19,870 --> 00:25:22,500 this kind of a system. 512 00:25:22,500 --> 00:25:27,050 Not equals doesn't really belong there. 513 00:25:27,050 --> 00:25:29,390 So the way you deal with not equals if you do it in two 514 00:25:29,390 --> 00:25:34,710 different problems. You can say IW less than IER is one 515 00:25:34,710 --> 00:25:39,590 problem, and W is greater then IER is other problem, and if 516 00:25:39,590 --> 00:25:42,070 either problem has a solution, you have a dependence. 517 00:25:42,070 --> 00:25:44,710 So that means one is true and one is anti. 518 00:25:44,710 --> 00:25:46,970 You can see the true dependence or anti-dependence, 519 00:25:46,970 --> 00:25:50,580 you can look at that. 520 00:25:50,580 --> 00:25:52,610 This one is a little bit easier. 521 00:25:52,610 --> 00:25:56,890 This is less than, not actually less than -- 522 00:25:56,890 --> 00:25:58,140 less than equal. 523 00:26:01,900 --> 00:26:04,520 How do you deal with equal? 524 00:26:04,520 --> 00:26:06,400 So the way you deal with equal is you write in both 525 00:26:06,400 --> 00:26:07,330 directions. 526 00:26:07,330 --> 00:26:11,450 So if A is less than B, A less than or equal to B, B is less 527 00:26:11,450 --> 00:26:14,464 than or equal to A means actually is equal to B. So you 528 00:26:14,464 --> 00:26:17,413 can actually try two different inequalities and get equal to 529 00:26:17,413 --> 00:26:17,840 down there. 530 00:26:17,840 --> 00:26:20,850 So you have to kind of massage things a little bit in here. 531 00:26:20,850 --> 00:26:27,620 So here are our original iteration bounds, and here's 532 00:26:27,620 --> 00:26:32,800 our one problem because we are saying write happens before 533 00:26:32,800 --> 00:26:33,950 read, so these are two dependents that 534 00:26:33,950 --> 00:26:37,050 we are looking at. 535 00:26:37,050 --> 00:26:43,550 This is saying that write location is the same as the 536 00:26:43,550 --> 00:26:45,560 read location and this is equal, so I have two different 537 00:26:45,560 --> 00:26:46,930 equations in here. 538 00:26:46,930 --> 00:26:49,520 So kind of massage this a little bit to put it in i 539 00:26:49,520 --> 00:26:52,840 form, and we can come up with A's and B's. 540 00:26:52,840 --> 00:26:56,690 These are just manual steps, A's and B's, and now we are 541 00:26:56,690 --> 00:27:02,050 going to throw it into some super duper integer linear 542 00:27:02,050 --> 00:27:05,440 program package and it will say yes or no and your set. 543 00:27:08,540 --> 00:27:09,820 And of course, you had to do another problem 544 00:27:09,820 --> 00:27:12,370 for the other side. 545 00:27:12,370 --> 00:27:16,780 You can generalize it for much more complete loop nest. So if 546 00:27:16,780 --> 00:27:19,310 you have this complicated loop nest in here, you had to solve 547 00:27:19,310 --> 00:27:21,950 you've got n deepness, you have to solve two end problems 548 00:27:21,950 --> 00:27:23,720 with all these different constraints. 549 00:27:23,720 --> 00:27:24,590 I'm not going to go over this. 550 00:27:24,590 --> 00:27:28,090 I have the slides in here. 551 00:27:28,090 --> 00:27:31,820 So that's the single dimension. 552 00:27:31,820 --> 00:27:35,770 So how about multi-dimension dependences? 553 00:27:35,770 --> 00:27:39,580 So I have two dimensional iteration space here, and I 554 00:27:39,580 --> 00:27:43,350 have I,J equals AI, J minus 1. 555 00:27:43,350 --> 00:27:45,140 That's my iteration space. 556 00:27:45,140 --> 00:27:47,240 What does my dependence look like? 557 00:27:47,240 --> 00:27:48,490 We have arrows too. 558 00:27:58,480 --> 00:27:59,730 Which direction are the arrows going? 559 00:27:59,730 --> 00:28:02,970 AUDIENCE: [OBSCURED] 560 00:28:02,970 --> 00:28:04,840 PROFESSOR: We have something like this. 561 00:28:04,840 --> 00:28:06,680 Yup. 562 00:28:06,680 --> 00:28:10,990 We have something like this because that's J minus 1, the 563 00:28:10,990 --> 00:28:12,470 I's are the same. 564 00:28:12,470 --> 00:28:16,750 Of course, if you have the other way around, go other 565 00:28:16,750 --> 00:28:20,030 direction, one is anti and one is it two dependence, so you 566 00:28:20,030 --> 00:28:22,410 can figure that one out. 567 00:28:22,410 --> 00:28:23,730 And do something complicated. 568 00:28:23,730 --> 00:28:25,670 First one. 569 00:28:25,670 --> 00:28:30,750 So IJ, I minus 1, J plus 1. 570 00:28:30,750 --> 00:28:32,580 Which has to be diagonal. 571 00:28:32,580 --> 00:28:37,910 Which diagonal does it go? 572 00:28:37,910 --> 00:28:39,280 This way or this way? 573 00:28:42,900 --> 00:28:44,150 Who says this way? 574 00:28:46,910 --> 00:28:48,160 Who says this way? 575 00:28:51,820 --> 00:28:57,330 So, this is actually going in this direction. 576 00:29:00,630 --> 00:29:02,680 This is where you have to actually think which iteration 577 00:29:02,680 --> 00:29:04,750 is actually write and read in here. 578 00:29:04,750 --> 00:29:06,200 So things get complicated. 579 00:29:06,200 --> 00:29:08,060 This one is even more interesting. 580 00:29:08,060 --> 00:29:08,770 This one. 581 00:29:08,770 --> 00:29:11,715 There's only one dimensional array or two dimensional loop 582 00:29:11,715 --> 00:29:17,250 nest. So what that means is who's 583 00:29:17,250 --> 00:29:18,530 writing and who's reading? 584 00:29:23,550 --> 00:29:26,580 If you look at it basically -- 585 00:29:26,580 --> 00:29:28,790 actually this actually is a little bit wrong, because the 586 00:29:28,790 --> 00:29:37,680 dependence analysis says -- actually, all these things, 587 00:29:37,680 --> 00:29:41,620 all this read has to go into all the write, because they 588 00:29:41,620 --> 00:29:44,980 are writing any J, just writing the same thing. 589 00:29:44,980 --> 00:29:46,460 So this is a little bit wrong. 590 00:29:46,460 --> 00:29:48,620 This is actually more data flow analysis. 591 00:29:48,620 --> 00:29:52,070 This is a different -- their dependence means I don't care 592 00:29:52,070 --> 00:29:54,900 who the guy wrote, because he's the last guy who wrote, 593 00:29:54,900 --> 00:29:57,000 but everybody's reading, everybody else is writing the 594 00:29:57,000 --> 00:30:01,880 same location. 595 00:30:01,880 --> 00:30:02,010 AUDIENCE: [OBSCURED]. 596 00:30:02,010 --> 00:30:03,370 PROFESSOR: Keep rewriting the same thing again 597 00:30:03,370 --> 00:30:05,060 and again and again. 598 00:30:05,060 --> 00:30:06,570 You start depending on -- 599 00:30:06,570 --> 00:30:12,140 It's not dependant on J's it's dependant on I. But location 600 00:30:12,140 --> 00:30:14,840 says you used to have iterations right in the same 601 00:30:14,840 --> 00:30:22,030 location, different J. So not matter what J, it's writing in 602 00:30:22,030 --> 00:30:23,280 the same location. 603 00:30:25,800 --> 00:30:27,010 You know what I'm saying? 604 00:30:27,010 --> 00:30:30,180 Because J thinks J. 605 00:30:30,180 --> 00:30:34,640 AUDIENCE: [NOISE]. 606 00:30:34,640 --> 00:30:36,770 PROFESSOR: This is iteration space. 607 00:30:36,770 --> 00:30:37,830 I am looking at iteration. 608 00:30:37,830 --> 00:30:38,030 I am looking at I and J.s 609 00:30:38,030 --> 00:30:39,790 AUDIENCE: [OBSCURED]. 610 00:30:39,790 --> 00:30:42,640 PROFESSOR: B is a one dimensional array. 611 00:30:42,640 --> 00:30:44,370 So B is a one dimensional array. 612 00:30:44,370 --> 00:30:45,430 So what that means is -- 613 00:30:45,430 --> 00:30:47,840 The reason I'm saying it's the iteration space and array 614 00:30:47,840 --> 00:30:53,300 space is a match. 615 00:30:53,300 --> 00:30:54,760 I'll correct this and put it in there because this is a 616 00:30:54,760 --> 00:30:55,740 data flow diagram. 617 00:30:55,740 --> 00:30:56,990 It's row independant. 618 00:30:58,800 --> 00:31:01,230 This one writing to what? 619 00:31:01,230 --> 00:31:04,590 AUDIENCE: [OBSCURED]. 620 00:31:04,590 --> 00:31:08,390 PROFESSOR: Iteration space is I and J. So, this 621 00:31:08,390 --> 00:31:09,470 is writing to what? 622 00:31:09,470 --> 00:31:12,240 I zero is -- 623 00:31:12,240 --> 00:31:15,120 This is writing to what? 624 00:31:15,120 --> 00:31:16,370 B1. 625 00:31:18,720 --> 00:31:20,450 All those things are writng to B1. 626 00:31:23,070 --> 00:31:24,360 This is really -- 627 00:31:29,920 --> 00:31:33,860 So this is writing to B1, this is reading B zero. 628 00:31:33,860 --> 00:31:36,210 So this iteration is reading B1 again. 629 00:31:36,210 --> 00:31:37,990 So this was B1, this is iteration B1. 630 00:31:37,990 --> 00:31:41,570 So each of these is writing to B1, each of these are reading 631 00:31:41,570 --> 00:31:47,000 from B1, so each has to be dependent from each other. 632 00:31:47,000 --> 00:31:48,550 AUDIENCE: So I guess one thing that's confusing here is why 633 00:31:48,550 --> 00:31:51,578 isn't it just -- why don't we just have arrows going down 634 00:31:51,578 --> 00:31:52,070 the column? 635 00:31:52,070 --> 00:31:53,550 Why do we have all these--? 636 00:31:53,550 --> 00:31:56,420 PROFESSOR: Arrows going down the column means each is 637 00:31:56,420 --> 00:31:58,860 trying to do different location. 638 00:31:58,860 --> 00:32:02,030 So what happens is that this one, arrays 639 00:32:02,030 --> 00:32:03,280 going down this way. 640 00:32:03,280 --> 00:32:07,350 Is this one -- what's wrote here is only that location, 641 00:32:07,350 --> 00:32:09,830 only this side I accidentally located. 642 00:32:09,830 --> 00:32:12,390 These are all writing to the same location and reading from 643 00:32:12,390 --> 00:32:13,210 the same location. 644 00:32:13,210 --> 00:32:16,180 AUDIENCE: Why isn't B iterated? 645 00:32:16,180 --> 00:32:17,390 PROFESSOR: This is iteration space. 646 00:32:17,390 --> 00:32:18,620 I have two different loops here. 647 00:32:18,620 --> 00:32:22,120 AUDIENCE: But I don't understand why B [NOISE.] 648 00:32:22,120 --> 00:32:24,110 PROFESSOR: This is my program. 649 00:32:24,110 --> 00:32:25,250 I can write this program. 650 00:32:25,250 --> 00:32:27,573 This is a little bit of a stupid program because I am 651 00:32:27,573 --> 00:32:30,090 kind of trying to do the same thing again and again. 652 00:32:30,090 --> 00:32:35,800 But hey, my program doesn't say array dimensions has to 653 00:32:35,800 --> 00:32:36,790 match your loop dimension. 654 00:32:36,790 --> 00:32:39,050 It doesn't say that so you can have programs like that. 655 00:32:39,050 --> 00:32:40,300 You can have other way too. 656 00:32:42,440 --> 00:32:47,800 So the key thing is to make -- don't confuse iteration space 657 00:32:47,800 --> 00:32:48,750 versus array space. 658 00:32:48,750 --> 00:32:50,280 They are two different spaces, two different number of 659 00:32:50,280 --> 00:32:50,980 dimensions. 660 00:32:50,980 --> 00:32:52,440 That's all the point that I'm going to make here. 661 00:32:55,360 --> 00:32:58,645 So by doing dependence analysis, you can figure out 662 00:32:58,645 --> 00:33:00,410 -- now you can formulate this nicely -- 663 00:33:00,410 --> 00:33:03,550 figure out where the loops are parallel. 664 00:33:03,550 --> 00:33:06,480 So that's really neat. 665 00:33:06,480 --> 00:33:09,620 The next thing I'm going to go is trying to figure out how 666 00:33:09,620 --> 00:33:11,970 you can increase the parallelism opportunities. 667 00:33:11,970 --> 00:33:14,550 Because there might be cases where the original code you 668 00:33:14,550 --> 00:33:17,350 wrote, there might be some loops that are not 669 00:33:17,350 --> 00:33:20,580 parallelizable, assays, and can you go and increase that. 670 00:33:20,580 --> 00:33:22,750 So I'm going to talk about few different possibilities of 671 00:33:22,750 --> 00:33:24,000 doing that. 672 00:33:25,880 --> 00:33:28,270 Scalar privatization, I will just go in each of these 673 00:33:28,270 --> 00:33:30,550 separating. 674 00:33:30,550 --> 00:33:33,040 So here is interesting program. 675 00:33:33,040 --> 00:33:37,490 To get parallel to the temporary and use the 676 00:33:37,490 --> 00:33:39,080 temporary in here. 677 00:33:39,080 --> 00:33:41,080 You might not know you had written that but the compiler 678 00:33:41,080 --> 00:33:42,950 normally generates something like that because you always 679 00:33:42,950 --> 00:33:44,790 had temporaries in here, so this might be 680 00:33:44,790 --> 00:33:46,460 what compiler generate. 681 00:33:46,460 --> 00:33:47,240 Is this loop parallel? 682 00:33:47,240 --> 00:33:56,020 AUDIENCE: Yup. 683 00:33:56,020 --> 00:33:56,290 PROFESSOR: Why? 684 00:33:56,290 --> 00:34:00,000 AUDIENCE: [OBSCURED]. 685 00:34:00,000 --> 00:34:02,150 PROFESSOR: Is the loop carry dependence true or anti -- 686 00:34:02,150 --> 00:34:05,820 What's the true dependence which to which? 687 00:34:05,820 --> 00:34:08,260 We didn't loop true dependence. 688 00:34:08,260 --> 00:34:09,510 What is the loop carry dependence? 689 00:34:12,810 --> 00:34:14,070 Anti-dependence. 690 00:34:14,070 --> 00:34:20,710 Because I cannot -- you see I equal 1, basically wrote here 691 00:34:20,710 --> 00:34:21,820 in this reading. 692 00:34:21,820 --> 00:34:26,170 I can't write I equals 2x until I equals 1 is done and 693 00:34:26,170 --> 00:34:26,870 done reading that. 694 00:34:26,870 --> 00:34:29,210 I have one location and everybody's trying to read or 695 00:34:29,210 --> 00:34:31,450 write that, even though I don't really use data. 696 00:34:31,450 --> 00:34:32,860 This is the sad thing about this. 697 00:34:32,860 --> 00:34:34,900 That I'm really using this guy's data, but I'm just 698 00:34:34,900 --> 00:34:36,730 waiting for the same space to occupy. 699 00:34:39,510 --> 00:34:43,410 So, there's a loop carry dependence in here, and it's 700 00:34:43,410 --> 00:34:45,330 anti-dependent. 701 00:34:45,330 --> 00:34:49,040 So what you can do is if you find any anti or output loop 702 00:34:49,040 --> 00:34:50,880 carry dependence, you can get rid of them. 703 00:34:50,880 --> 00:34:53,220 I'm not really using that value, I'm just keeping a 704 00:34:53,220 --> 00:34:54,430 location in here. 705 00:34:54,430 --> 00:34:55,820 So how can we get rid of that? 706 00:34:55,820 --> 00:35:01,670 AUDIENCE: [OBSCURED]. 707 00:35:01,670 --> 00:35:02,040 PROFESSOR: Yeah. 708 00:35:02,040 --> 00:35:03,100 That's one thing. 709 00:35:03,100 --> 00:35:03,970 There's two ways of doing it. 710 00:35:03,970 --> 00:35:07,210 One is I assign something local. 711 00:35:07,210 --> 00:35:11,060 So each processor will have its own copy, 712 00:35:11,060 --> 00:35:12,760 so I don't do that. 713 00:35:12,760 --> 00:35:17,670 So it's something like this, so that's [OBSCURED]. 714 00:35:17,670 --> 00:35:21,300 Or I can look at the array. 715 00:35:21,300 --> 00:35:23,480 In the array you can have either number of process or 716 00:35:23,480 --> 00:35:24,860 iterations for each iteration. 717 00:35:24,860 --> 00:35:27,590 But uses a different location. 718 00:35:27,590 --> 00:35:30,510 This is more efficient than this one because we are 719 00:35:30,510 --> 00:35:34,330 touching lot more locations in here. 720 00:35:34,330 --> 00:35:36,330 I haven't done one thing here. 721 00:35:36,330 --> 00:35:37,210 I'm not complete. 722 00:35:37,210 --> 00:35:39,640 What have I forgotten to do in both of these? 723 00:35:39,640 --> 00:35:43,070 AUDIENCE: [OBSCURED]. 724 00:35:43,070 --> 00:35:45,980 PROFESSOR: Yeah, because it was beforehand somebody might 725 00:35:45,980 --> 00:35:47,880 use final assignment of the loop nest, so what you had to 726 00:35:47,880 --> 00:35:50,690 do is you had to kind of finalize x. 727 00:35:50,690 --> 00:35:53,730 Because I had a temporary variable, so with n, the last 728 00:35:53,730 --> 00:35:56,940 value has to go into x. 729 00:35:56,940 --> 00:35:58,570 You can't keep just not 730 00:35:58,570 --> 00:36:00,740 calculating value in something. 731 00:36:00,740 --> 00:36:03,270 So in here, also, you just say last value is x. 732 00:36:03,270 --> 00:36:06,390 But after you do that, basically now each of this 733 00:36:06,390 --> 00:36:07,640 loop is faster. 734 00:36:10,100 --> 00:36:11,350 Everybody go that? 735 00:36:13,420 --> 00:36:16,090 OK, here's another example. 736 00:36:16,090 --> 00:36:19,110 x equals x plus AI. 737 00:36:19,110 --> 00:36:20,360 Do I have loop carry dependent? 738 00:36:30,780 --> 00:36:32,780 What did the loop-carried dependence? 739 00:36:32,780 --> 00:36:34,030 True or anti? 740 00:36:39,120 --> 00:36:39,400 True dependence. 741 00:36:39,400 --> 00:36:43,600 So this guy is actually creating previous value and 742 00:36:43,600 --> 00:36:45,800 adding something in the event. 743 00:36:45,800 --> 00:36:48,020 So of course in true dependence I cannot seem to 744 00:36:48,020 --> 00:36:48,940 parallelize. 745 00:36:48,940 --> 00:36:51,760 But there are some interesting things we can do. 746 00:36:51,760 --> 00:36:55,740 That was an associative operation. 747 00:36:55,740 --> 00:36:58,300 I didn't care which order this initial happened, so I'm just 748 00:36:58,300 --> 00:37:00,330 keeping a lean bunch of values in here. 749 00:37:00,330 --> 00:37:03,710 And the results were never used in the other loop. 750 00:37:03,710 --> 00:37:05,700 So we just keep adding things and at the end of the loop you 751 00:37:05,700 --> 00:37:08,600 get the sum total in here. 752 00:37:08,600 --> 00:37:10,580 I never used any kind of partial values anywhere. 753 00:37:10,580 --> 00:37:12,130 So that gives the idea. 754 00:37:12,130 --> 00:37:17,870 So what you can do is we can translate this into each of 755 00:37:17,870 --> 00:37:21,580 the guys doing a temporary addition 756 00:37:21,580 --> 00:37:22,460 into its own variable. 757 00:37:22,460 --> 00:37:27,650 So each processor, just do a partial sum. 758 00:37:27,650 --> 00:37:31,390 At the end, once they're done, you basically do the full sum. 759 00:37:31,390 --> 00:37:33,290 Of course, you can do a tree or whatever much more 760 00:37:33,290 --> 00:37:35,700 complicated thing then that -- you can also parallelize this 761 00:37:35,700 --> 00:37:38,050 part at the tree addition. 762 00:37:38,050 --> 00:37:39,130 But you can do that. 763 00:37:39,130 --> 00:37:43,170 I mean Roderick talked about this in hand parallelization. 764 00:37:43,170 --> 00:37:46,040 But we are doing something very simple in here. 765 00:37:46,040 --> 00:37:50,150 So these compilers can figure out associative 766 00:37:50,150 --> 00:37:51,950 operations and do that. 767 00:37:51,950 --> 00:37:55,020 So this is where all the people who are in 768 00:37:55,020 --> 00:37:57,720 parallelizing, and all the people who are writing this 769 00:37:57,720 --> 00:38:00,100 scientific code kind of start having arguments. 770 00:38:00,100 --> 00:38:02,770 Because they say oh my God, you're doing operations and 771 00:38:02,770 --> 00:38:05,700 it's going to have numerical stability issues. 772 00:38:05,700 --> 00:38:06,720 Yes all true. 773 00:38:06,720 --> 00:38:09,260 In compilers you have these flags that say OK, just forget 774 00:38:09,260 --> 00:38:12,800 about all these very issues, and most probably it will be 775 00:38:12,800 --> 00:38:15,320 right, and in most code it will work. 776 00:38:15,320 --> 00:38:18,610 You might find that problem, too -- you change operation 777 00:38:18,610 --> 00:38:21,370 order to get some parallelism and suddenly you are running 778 00:38:21,370 --> 00:38:22,960 unstability. 779 00:38:22,960 --> 00:38:25,190 There are some algorithms that you can't do that, but most 780 00:38:25,190 --> 00:38:26,440 algorithms you can. 781 00:38:28,710 --> 00:38:30,090 So here's another interesting thing. 782 00:38:30,090 --> 00:38:35,430 So, I have a program like that, 2 to the power I, and of 783 00:38:35,430 --> 00:38:37,310 course, most of the time 784 00:38:37,310 --> 00:38:40,080 exponentiation is very expensive. 785 00:38:40,080 --> 00:38:41,450 If you have a smart compiler -- 786 00:38:41,450 --> 00:38:42,840 I don't have to exponentiate. 787 00:38:42,840 --> 00:38:44,390 This thing called strength reduction. 788 00:38:44,390 --> 00:38:44,970 They say wait a minute -- 789 00:38:44,970 --> 00:38:46,160 I will keep variable t. 790 00:38:46,160 --> 00:38:49,270 This 2 to the power i means basically every time I 791 00:38:49,270 --> 00:38:52,150 multiply it by 2 and I can't keep repeating that. 792 00:38:52,150 --> 00:38:57,210 Do you see why these two are equal there? 793 00:38:57,210 --> 00:38:57,940 This is good. 794 00:38:57,940 --> 00:38:59,550 A lot of good compilers do that. 795 00:38:59,550 --> 00:39:01,040 But now what did I suddenly do? 796 00:39:01,040 --> 00:39:03,740 AUDIENCE: [OBSCURED.] 797 00:39:03,740 --> 00:39:05,680 PROFESSOR: Yeah, I reduced the amount of computation, 798 00:39:05,680 --> 00:39:09,100 obviously, but I just introduce a loop-carried true 799 00:39:09,100 --> 00:39:10,350 dependence here. 800 00:39:12,760 --> 00:39:15,560 Because now I have t dependent on the previous t to calculate 801 00:39:15,560 --> 00:39:20,630 the next value, and while order-wise or sequential-wise 802 00:39:20,630 --> 00:39:24,350 this is a win, now suddenly I can't parallelize. 803 00:39:24,350 --> 00:39:26,840 Of course, a lot of times what you had to do is you have a 804 00:39:26,840 --> 00:39:27,750 very smart programmer. 805 00:39:27,750 --> 00:39:30,610 They say aha, I know this operation is expensive so I am 806 00:39:30,610 --> 00:39:33,580 going to do this myself and create you a much simpler 807 00:39:33,580 --> 00:39:35,670 program in sequentially. 808 00:39:35,670 --> 00:39:37,380 Then you try to parallelizes this and you can't. 809 00:39:37,380 --> 00:39:41,100 So what you might try to do is kind of do this direction 810 00:39:41,100 --> 00:39:43,660 transformation many times to make the program run a little 811 00:39:43,660 --> 00:39:47,260 bit slower sequentially just so you can actually go and 812 00:39:47,260 --> 00:39:49,340 parallelize it. 813 00:39:49,340 --> 00:39:50,770 So this get's a little bit counterintuitive. 814 00:39:50,770 --> 00:39:53,900 You just look at a program and say yeah there is a loop 815 00:39:53,900 --> 00:39:55,850 carried dependence, I can do it a little bit more expensive 816 00:39:55,850 --> 00:39:58,540 without the loop carried dependence, and then suddenly 817 00:39:58,540 --> 00:39:59,320 my loop is parallelized. 818 00:39:59,320 --> 00:40:01,460 So there might be cases where you might have to do it by 819 00:40:01,460 --> 00:40:04,020 hand, and a lot of compilers automatic parallelizing 820 00:40:04,020 --> 00:40:05,990 compilers, try to do this also. 821 00:40:05,990 --> 00:40:08,230 Kind of look at these kind of things and try to 822 00:40:08,230 --> 00:40:09,450 move in that direction. 823 00:40:09,450 --> 00:40:11,290 Whereas, most of the sequential compiler is trying 824 00:40:11,290 --> 00:40:12,670 to find this and move this direction. 825 00:40:16,320 --> 00:40:19,840 So, OK I said that. 826 00:40:19,840 --> 00:40:21,790 So, another thing called array privatization. 827 00:40:21,790 --> 00:40:26,130 So scalars, I show you where when you have anti and output 828 00:40:26,130 --> 00:40:28,260 dependence on a variable, you need to privatize. 829 00:40:28,260 --> 00:40:31,360 And in arrays, you have a lot more complexity. 830 00:40:31,360 --> 00:40:33,250 I'm not going to go into that, you can actually do private 831 00:40:33,250 --> 00:40:35,440 copies also in there. 832 00:40:35,440 --> 00:40:37,830 You can do bunch of transformation. 833 00:40:37,830 --> 00:40:39,840 Another thing people do is called interprocedural 834 00:40:39,840 --> 00:40:41,740 parallelization. 835 00:40:41,740 --> 00:40:44,470 So the thing is you have a nice loop and you start 836 00:40:44,470 --> 00:40:46,070 analyzing loop and in the middle of a loop you have a 837 00:40:46,070 --> 00:40:48,250 function call. 838 00:40:48,250 --> 00:40:50,120 Suddenly what are you going to do with it? 839 00:40:50,120 --> 00:40:52,400 You have no idea what the function does, and most of the 840 00:40:52,400 --> 00:40:54,530 simple analysis says OK, I can't parallelize anything 841 00:40:54,530 --> 00:40:55,750 that has a function call. 842 00:40:55,750 --> 00:40:57,430 That's not a good parallelizing compiler because 843 00:40:57,430 --> 00:40:59,780 a lot of loops have function calls and you might call it 844 00:40:59,780 --> 00:41:04,090 something simple as sine function or some simple 845 00:41:04,090 --> 00:41:06,030 exponentiation function and then suddenly it's not 846 00:41:06,030 --> 00:41:08,750 parallelizable. 847 00:41:08,750 --> 00:41:10,470 This is a big problem. 848 00:41:10,470 --> 00:41:11,460 There are two things you can do. 849 00:41:11,460 --> 00:41:15,080 One is interprocedural analysis and another inlining. 850 00:41:15,080 --> 00:41:19,600 So the interprocedural analysis says I'm going to 851 00:41:19,600 --> 00:41:24,370 analyze the entire program and I have function, I'm going to 852 00:41:24,370 --> 00:41:28,830 go and try to analyze the function itself also. 853 00:41:28,830 --> 00:41:33,220 What happens is -- so assume if the functions are used 854 00:41:33,220 --> 00:41:36,060 many, many times, so fine function might be used 855 00:41:36,060 --> 00:41:37,076 hundreds of time. 856 00:41:37,076 --> 00:41:39,380 So every time you have a call of a sine function, if you 857 00:41:39,380 --> 00:41:41,650 keep analyzing, reanalyzing what's happening inside of the 858 00:41:41,650 --> 00:41:44,450 sine function, you kind of have exponential blow up. 859 00:41:44,450 --> 00:41:48,800 So if you code size n, you might have an exponential time 860 00:41:48,800 --> 00:41:51,640 of a number of lines that need to be analyzed because every 861 00:41:51,640 --> 00:41:54,080 call need to go there, call some other functions, you can 862 00:41:54,080 --> 00:41:55,490 see the blow up. 863 00:41:55,490 --> 00:41:57,530 And so analysis might be expensive. 864 00:41:57,530 --> 00:42:00,620 Other option is you analyze each function once. 865 00:42:00,620 --> 00:42:01,910 Yeah, OK. 866 00:42:01,910 --> 00:42:04,160 I analyze this function once and every time I use that 867 00:42:04,160 --> 00:42:07,990 function I just use that analysis information. 868 00:42:07,990 --> 00:42:11,450 What that means is you have a kind of summary of what that 869 00:42:11,450 --> 00:42:13,390 function does for every call. 870 00:42:13,390 --> 00:42:15,580 This is not that easy and this runs into a thing called 871 00:42:15,580 --> 00:42:18,660 unrealizable part problem, because you go into function 872 00:42:18,660 --> 00:42:22,210 in one part -- 873 00:42:22,210 --> 00:42:26,460 assume you call foo from here and return here. 874 00:42:26,460 --> 00:42:28,470 You call it here and return and here. 875 00:42:28,470 --> 00:42:31,270 So when you analyze, normally you can go from here to here, 876 00:42:31,270 --> 00:42:34,530 here to here, but if you treat foo as only one thing you 877 00:42:34,530 --> 00:42:36,466 might be able to even think that you can go here to here 878 00:42:36,466 --> 00:42:38,220 and here to here. 879 00:42:38,220 --> 00:42:40,790 So this looks like one thing in here. 880 00:42:40,790 --> 00:42:44,462 You see that control here goes here, comes here do a function 881 00:42:44,462 --> 00:42:46,550 call goes here, because we are not treating 882 00:42:46,550 --> 00:42:48,610 this as separate instance. 883 00:42:48,610 --> 00:42:50,480 So why did are we analyzing it once? 884 00:42:50,480 --> 00:42:52,650 This cleared all this additional mess and then can 885 00:42:52,650 --> 00:42:53,770 have problems in here. 886 00:42:53,770 --> 00:42:56,480 So these are the kind of researchy things people are 887 00:42:56,480 --> 00:42:57,210 working on. 888 00:42:57,210 --> 00:42:59,480 There's no perfect answer, these are complicated 889 00:42:59,480 --> 00:43:00,650 problems, so you had to do some 890 00:43:00,650 --> 00:43:05,770 interesting balance in here. 891 00:43:05,770 --> 00:43:08,360 Because other thing is every analyst has to deal with that, 892 00:43:08,360 --> 00:43:10,030 so you had to kind of an anti-compiler, 893 00:43:10,030 --> 00:43:12,940 which is not simple. 894 00:43:12,940 --> 00:43:14,550 Inlining is much more easy. 895 00:43:14,550 --> 00:43:16,700 It's a poor man solution, so every time you have function 896 00:43:16,700 --> 00:43:18,570 call, you just bring the function and 897 00:43:18,570 --> 00:43:19,855 just copy it in there. 898 00:43:19,855 --> 00:43:20,810 And every time you have function call you bring the 899 00:43:20,810 --> 00:43:23,410 function and you can run it through the same compiler, but 900 00:43:23,410 --> 00:43:25,510 of course, you can have huge code blow up. 901 00:43:25,510 --> 00:43:28,060 It's not only analysis expense, you might have a 902 00:43:28,060 --> 00:43:30,730 function that before had only 100 lines, now we have 903 00:43:30,730 --> 00:43:32,760 millions of lines in there and then try and do cache 904 00:43:32,760 --> 00:43:34,660 problems, all those other issues. 905 00:43:34,660 --> 00:43:36,310 So can be very expensive too. 906 00:43:36,310 --> 00:43:39,265 So what people do is things like selective inlining and a 907 00:43:39,265 --> 00:43:45,970 lot of kind of interesting combinations of these. 908 00:43:45,970 --> 00:43:48,010 Finally, loop transformations. 909 00:43:48,010 --> 00:43:53,560 So i have this loop, so I have Aij equals Aij minus 1, A i 910 00:43:53,560 --> 00:43:57,000 minus 1 j So look at my -- my arrowheads look too big there, 911 00:43:57,000 --> 00:44:00,280 but look at my dependences. 912 00:44:00,280 --> 00:44:02,020 Is any of this parallel? 913 00:44:02,020 --> 00:44:10,460 AUDIENCE: [OBSCURED.] 914 00:44:10,460 --> 00:44:11,710 PROFESSOR: Yeah. 915 00:44:13,840 --> 00:44:16,280 So, assays neither -- 916 00:44:16,280 --> 00:44:18,650 you can't parallelize I because there's a loop carry 917 00:44:18,650 --> 00:44:21,260 dependence in I dimension. 918 00:44:21,260 --> 00:44:23,800 You can't parallelize J because there's loop carry 919 00:44:23,800 --> 00:44:24,900 dependence in J diimension. 920 00:44:24,900 --> 00:44:27,900 She has idea because you can actually pipeline. 921 00:44:27,900 --> 00:44:30,480 So pipelining, we haven't figured out how 922 00:44:30,480 --> 00:44:32,070 to parallelize pipeline. 923 00:44:32,070 --> 00:44:34,250 So the way you can do this simply is a 924 00:44:34,250 --> 00:44:37,410 thing called loop skewing. 925 00:44:37,410 --> 00:44:39,040 You can kind of -- 926 00:44:39,040 --> 00:44:42,080 because iteration space has changed from a data space. 927 00:44:42,080 --> 00:44:45,090 You can come up with a new iteration space that kind of 928 00:44:45,090 --> 00:44:47,790 skew the loop in there. 929 00:44:47,790 --> 00:44:51,120 So what it does is normally iteration space, what this J 930 00:44:51,120 --> 00:44:54,480 outside, so you go execute like this. 931 00:44:54,480 --> 00:44:57,640 The skill that -- loop basically say I am executing 932 00:44:57,640 --> 00:44:59,650 this way, so I'm executing the pipeline, 933 00:44:59,650 --> 00:45:00,890 basically pipeline here. 934 00:45:00,890 --> 00:45:04,060 So I'm kind of going like this way, executing that way. 935 00:45:04,060 --> 00:45:09,470 If I could run that loop in that fashion, what I can do is 936 00:45:09,470 --> 00:45:12,600 I can run this -- after this iteration, when you go run the 937 00:45:12,600 --> 00:45:16,340 next iteration, there's no dependence across here. 938 00:45:16,340 --> 00:45:18,340 If I run here, I don't have dependence, so I can run each 939 00:45:18,340 --> 00:45:22,510 of these and I have a parallel set of iterations to run. 940 00:45:22,510 --> 00:45:25,670 So in here, what happens is this inner loop it can be 941 00:45:25,670 --> 00:45:30,200 parallel, basically like your pipeline, but it's written in 942 00:45:30,200 --> 00:45:34,010 a way that I still have my two loops in here, but I have done 943 00:45:34,010 --> 00:45:36,700 this weird transformation. 944 00:45:36,700 --> 00:45:38,430 Another interesting is granularity of parallelism. 945 00:45:38,430 --> 00:45:40,950 Assume I have a loop like that, i and j. 946 00:45:40,950 --> 00:45:44,150 Which loop is that in here? 947 00:45:44,150 --> 00:45:46,050 i or j? 948 00:45:46,050 --> 00:45:47,740 j is parallel. 949 00:45:47,740 --> 00:45:48,700 OK, I do something like that. 950 00:45:48,700 --> 00:45:52,580 I say I run i, every iteration I do a barrier, I run j 951 00:45:52,580 --> 00:45:56,770 parallel and I end up doing a barrier again. 952 00:45:56,770 --> 00:46:05,510 What might be a problem in something like this? 953 00:46:05,510 --> 00:46:09,440 I mean inner parallelism can be expensive, because every 954 00:46:09,440 --> 00:46:13,120 time I had to do this probably expensive barrier, run a few 955 00:46:13,120 --> 00:46:14,870 iterations, a few in this one, probably 956 00:46:14,870 --> 00:46:16,850 only like a few cycles. 957 00:46:16,850 --> 00:46:18,980 And write this very expensive barrier again, and everybody 958 00:46:18,980 --> 00:46:20,440 communicates -- 959 00:46:20,440 --> 00:46:23,050 all of those things. 960 00:46:23,050 --> 00:46:25,170 Most of the time when you do inner loop parallelism it 961 00:46:25,170 --> 00:46:27,510 actually slows down the program. 962 00:46:27,510 --> 00:46:29,640 You will probably find it too sometimes, if you define the 963 00:46:29,640 --> 00:46:32,060 parallelism inner array to be too small, it actually has a 964 00:46:32,060 --> 00:46:34,650 negative impact, because all the communication you need to 965 00:46:34,650 --> 00:46:35,932 do, synchronization you need to do all of 966 00:46:35,932 --> 00:46:39,740 them out of the program. 967 00:46:39,740 --> 00:46:40,920 So inner loop is expensive. 968 00:46:40,920 --> 00:46:42,650 What are your choices? 969 00:46:42,650 --> 00:46:44,140 Don't parallelize. 970 00:46:44,140 --> 00:46:45,980 Pretty good choice for a lot of cases. 971 00:46:45,980 --> 00:46:47,510 You look at this and this is actually going to win you 972 00:46:47,510 --> 00:46:49,530 basically by doing that. 973 00:46:49,530 --> 00:46:51,960 Or can you transform it to outer loop parallelism. 974 00:46:51,960 --> 00:46:54,540 Take inner loop parallelism and you change it to get outer 975 00:46:54,540 --> 00:46:55,120 loop parallelism. 976 00:46:55,120 --> 00:46:57,070 This program is actually nice, there are some complex 977 00:46:57,070 --> 00:46:59,710 analysis you need to do to make sure that's legal. 978 00:46:59,710 --> 00:47:03,390 So you can basically take this one and 979 00:47:03,390 --> 00:47:06,170 transform in other direction. 980 00:47:06,170 --> 00:47:10,780 What that means is kind of do a loop interchange. 981 00:47:10,780 --> 00:47:13,500 So now instead of i, you have a a j outer dimension, i inner 982 00:47:13,500 --> 00:47:16,050 dimension, inner loop. 983 00:47:16,050 --> 00:47:19,715 When you do that what you have is your barrier, and then you 984 00:47:19,715 --> 00:47:23,750 can run this is parallel and this like this. 985 00:47:23,750 --> 00:47:29,985 Suddenly, instead of having n barriers for that loop, you 986 00:47:29,985 --> 00:47:31,740 have only one barrier. 987 00:47:31,740 --> 00:47:34,940 Suddenly you have a much larger chunk you're running, 988 00:47:34,940 --> 00:47:41,070 and this can be run. 989 00:47:41,070 --> 00:47:42,670 OK, so this is great. 990 00:47:42,670 --> 00:47:44,960 So I talked to all about all this nice transformation, 991 00:47:44,960 --> 00:47:45,690 stuff like that. 992 00:47:45,690 --> 00:47:47,790 So at some point when you know something is parallel you 993 00:47:47,790 --> 00:47:51,330 might want to go and generate parallel form. 994 00:47:51,330 --> 00:47:56,150 So the problem is, depending on how you partition, the loop 995 00:47:56,150 --> 00:47:58,460 bound has to be changed, and I'm going to talk to you about 996 00:47:58,460 --> 00:48:00,030 how to get loop bound. 997 00:48:00,030 --> 00:48:02,440 So let's look at this program. 998 00:48:02,440 --> 00:48:08,790 So I have something in here and there's an inner loop that 999 00:48:08,790 --> 00:48:09,980 actually reads, outer loop writes. 1000 00:48:09,980 --> 00:48:10,660 Inner loop reads. 1001 00:48:10,660 --> 00:48:13,050 And it's a triangular thing. 1002 00:48:13,050 --> 00:48:14,300 It's a big mess. 1003 00:48:14,300 --> 00:48:19,110 Now I assume I want to run the i loop parallel. 1004 00:48:19,110 --> 00:48:22,860 So what that means is I want to run the first process -- 1005 00:48:22,860 --> 00:48:24,900 there is no for this one, this one on one iteration, two 1006 00:48:24,900 --> 00:48:28,450 iteration, three, four, whatever, each one's in here. 1007 00:48:28,450 --> 00:48:32,170 How do I actually go about generating code that 1008 00:48:32,170 --> 00:48:33,750 actually does that? 1009 00:48:33,750 --> 00:48:36,740 Each processor runs its right number of iteration. 1010 00:48:36,740 --> 00:48:39,410 This is a non-trivial thing because triangularly you get 1011 00:48:39,410 --> 00:48:42,430 something different and you can assume all this 1012 00:48:42,430 --> 00:48:44,090 complexity. 1013 00:48:44,090 --> 00:48:48,250 One thing I did is my iteration space between i and 1014 00:48:48,250 --> 00:48:54,050 j, this is my iteration space. 1015 00:48:54,050 --> 00:48:56,150 So I assume, assume I am running a processor. 1016 00:48:56,150 --> 00:48:59,320 Each I iteration run by your processor, you can say you 1017 00:48:59,320 --> 00:49:04,580 have then another dimension P, and say i equals P. So I can 1018 00:49:04,580 --> 00:49:06,770 look at now instead of a two dimensional space in a three 1019 00:49:06,770 --> 00:49:08,300 dimensional space. 1020 00:49:08,300 --> 00:49:10,340 So in this analysis, if you can think multi-dimensionally 1021 00:49:10,340 --> 00:49:12,800 it's actually very helpful because we can kind of keep 1022 00:49:12,800 --> 00:49:15,970 adding dimensions in here. 1023 00:49:15,970 --> 00:49:19,380 So what are the loop bounds in here? 1024 00:49:19,380 --> 00:49:22,600 What we can do is use another technique called 1025 00:49:22,600 --> 00:49:26,530 Fourier-Motzkin Elimination to calculate loop bounds by using 1026 00:49:26,530 --> 00:49:28,040 projections of the iteration space. 1027 00:49:28,040 --> 00:49:29,330 I will go through later a bit to give you a 1028 00:49:29,330 --> 00:49:30,585 flavor for what it is. 1029 00:49:30,585 --> 00:49:33,910 It's also, if you are in to linear programming, this is 1030 00:49:33,910 --> 00:49:37,850 kind of extension techniques on that. 1031 00:49:37,850 --> 00:49:39,820 So the way we look at that is -- 1032 00:50:06,390 --> 00:50:10,960 A little bit too far. 1033 00:50:10,960 --> 00:50:18,600 I didn't realize MAC can be this slow. 1034 00:50:18,600 --> 00:50:26,480 [ASIDE CONVERSATION] 1035 00:50:26,480 --> 00:50:28,993 See this is why we need parallelism if you think this 1036 00:50:28,993 --> 00:50:33,000 running fast. So what you can do is you can think about this 1037 00:50:33,000 --> 00:50:34,960 as this three dimensional space. 1038 00:50:34,960 --> 00:50:36,400 i, j and p. 1039 00:50:36,400 --> 00:50:40,070 And because i is equal to p, if you get i and p, get a line 1040 00:50:40,070 --> 00:50:41,960 in that dimension and then j goes there. 1041 00:50:41,960 --> 00:50:44,050 So this is the kind of iteration space in here, and 1042 00:50:44,050 --> 00:50:47,930 that represents inequalities here. 1043 00:50:47,930 --> 00:50:52,630 So what I want is a loop where outer dimension is p, then the 1044 00:50:52,630 --> 00:50:54,800 next dimension is i and j. 1045 00:50:54,800 --> 00:50:56,420 We can think about it like that. 1046 00:50:56,420 --> 00:50:59,525 So what that means is I need to get my iteration ordering 1047 00:50:59,525 --> 00:51:04,140 -- when it happens, you just go like that. 1048 00:51:04,140 --> 00:51:05,190 All right, about doing that. 1049 00:51:05,190 --> 00:51:07,570 So this is the kind of loop I want to generate -- let me go 1050 00:51:07,570 --> 00:51:09,090 and show you how we generate that. 1051 00:51:25,530 --> 00:51:29,750 So here's my space in here, so first one I want to do is my 1052 00:51:29,750 --> 00:51:32,100 inner most dimension is j. 1053 00:51:32,100 --> 00:51:34,360 And what I can do is I can look at this thing and say 1054 00:51:34,360 --> 00:51:36,540 what are the bounds of j. 1055 00:51:36,540 --> 00:51:40,370 So, for each of the bounds of j can be described by -- 1056 00:51:40,370 --> 00:51:41,520 with p and i. 1057 00:51:41,520 --> 00:51:44,470 I'll actually show you how to do that in little while. 1058 00:51:44,470 --> 00:51:51,180 Then I will get j goes from 1 to i minus 1. 1059 00:51:51,180 --> 00:51:53,740 Then after that I can basically project it into to 1060 00:51:53,740 --> 00:51:54,580 eliminate j dimension. 1061 00:51:54,580 --> 00:51:56,860 So what I'm doing is I'm going to have a three dimension and 1062 00:51:56,860 --> 00:51:59,780 I project into two dimensions without j anymore, because now 1063 00:51:59,780 --> 00:52:04,110 all I have left is i p and I get a line in that dimension. 1064 00:52:04,110 --> 00:52:06,250 Then what I have to do is now I had to find i. 1065 00:52:06,250 --> 00:52:10,110 What are my bounds of i? 1066 00:52:10,110 --> 00:52:13,340 And bounds of i is actually i is equal to p. 1067 00:52:13,340 --> 00:52:14,690 You can figure that one out because 1068 00:52:14,690 --> 00:52:16,190 there's a line in there. 1069 00:52:16,190 --> 00:52:18,530 Then you eliminate i and now you get this one. 1070 00:52:21,330 --> 00:52:27,070 Then what are bounds of p? p goes from basically 2 to n. 1071 00:52:27,070 --> 00:52:28,110 You just basically get that. 1072 00:52:28,110 --> 00:52:31,160 So you can do this projection in here -- let me go in there, 1073 00:52:31,160 --> 00:52:35,710 and now what you end up doing is you can get this, and of 1074 00:52:35,710 --> 00:52:39,240 course, outer loop p is not a true -- like a loop. 1075 00:52:39,240 --> 00:52:41,050 You can say you get p, my_pid. 1076 00:52:41,050 --> 00:52:43,330 p is with this range. i equals p. 1077 00:52:43,330 --> 00:52:44,190 Do this one. 1078 00:52:44,190 --> 00:52:46,570 So this one, -- generated that piece of code. 1079 00:52:46,570 --> 00:52:49,280 So I will go a little bit detail and show how this 1080 00:52:49,280 --> 00:52:51,080 happens, pretty much can happen. 1081 00:52:51,080 --> 00:52:54,640 So I have my little bit of different space. 1082 00:52:54,640 --> 00:52:55,400 I'm doing a different projection. 1083 00:52:55,400 --> 00:52:57,050 I'm doing i, j, p. 1084 00:52:57,050 --> 00:53:00,340 I want to predict first i of a, j of a, and p of a instead 1085 00:53:00,340 --> 00:53:01,830 of j, i, p before I do anything. 1086 00:53:01,830 --> 00:53:04,250 So here's my iteration space, what do I do? 1087 00:53:04,250 --> 00:53:07,860 The first thing I do is I find the bounds of i, So I have 1088 00:53:07,860 --> 00:53:08,410 this thing. 1089 00:53:08,410 --> 00:53:14,230 I just basically expanded this, and eliminated the j 1090 00:53:14,230 --> 00:53:16,040 this one doesn't contribute to the bounds of i, 1091 00:53:16,040 --> 00:53:17,220 but everybody else. 1092 00:53:17,220 --> 00:53:19,630 So there are a bunch of things that i has to be less than 1093 00:53:19,630 --> 00:53:22,530 that and i have to be greater than these two. 1094 00:53:22,530 --> 00:53:26,840 Then what I have is bound of i is, it has to be maximum of 1095 00:53:26,840 --> 00:53:28,540 this because it has to be greater than all three. 1096 00:53:28,540 --> 00:53:30,380 So it has to be max of this, this, and this. 1097 00:53:30,380 --> 00:53:32,190 It has to be less than these two, it has to be 1098 00:53:32,190 --> 00:53:33,480 mean of this one. 1099 00:53:33,480 --> 00:53:33,800 Question? 1100 00:53:33,800 --> 00:53:35,800 AUDIENCE: Well why did you have to go through all this. 1101 00:53:35,800 --> 00:53:38,520 At least in this case, the outer loop was very simple, 1102 00:53:38,520 --> 00:53:39,970 you could have just directly mapped that. 1103 00:53:39,970 --> 00:53:42,590 PROFESSOR: I agree with you, it's very simple thing, but 1104 00:53:42,590 --> 00:53:45,260 the problem is that's because you are smart and you can 1105 00:53:45,260 --> 00:53:48,270 think a little bit ahead in there, and if I'm programming 1106 00:53:48,270 --> 00:53:52,290 a computer, I can't say find these special cases. 1107 00:53:52,290 --> 00:53:54,850 So I want to come up with a mathematical way that is a 1108 00:53:54,850 --> 00:53:57,340 bullet proof way that will work from the simplest one to 1109 00:53:57,340 --> 00:53:59,970 very complicated, like for example, finding the loop 1110 00:53:59,970 --> 00:54:04,850 bounds for that loop transpose that I showed you before -- 1111 00:54:04,850 --> 00:54:09,160 no, the skew that what we called before. 1112 00:54:09,160 --> 00:54:13,266 AUDIENCE: So it's not so much just defining an index to 1113 00:54:13,266 --> 00:54:14,226 iterate on, it's to find the best 1114 00:54:14,226 --> 00:54:16,910 index to map, to parellize. 1115 00:54:16,910 --> 00:54:20,010 PROFESSOR: Any could be issue, because you have -- 1116 00:54:20,010 --> 00:54:24,640 for example, if the inner dimension depends on i, and i 1117 00:54:24,640 --> 00:54:27,270 goes outside, then I can't make it depend on i. 1118 00:54:27,270 --> 00:54:33,850 So if I have something like for i equals something, for j 1119 00:54:33,850 --> 00:54:37,910 equals i to something. 1120 00:54:37,910 --> 00:54:41,350 Now if I switch these two I have 4j. 1121 00:54:41,350 --> 00:54:42,770 I can't say it's i to something. 1122 00:54:42,770 --> 00:54:45,830 I have to get rid of i and I have to figure out in the for 1123 00:54:45,830 --> 00:54:49,230 i, this has to be something with j, with some function 1124 00:54:49,230 --> 00:54:52,160 with j in here. 1125 00:54:52,160 --> 00:54:56,690 So what is this function, how do you get that? 1126 00:54:56,690 --> 00:54:59,070 You need this kind of transformations do that. 1127 00:54:59,070 --> 00:55:01,260 Next time I'll talk to you about can you do it a little 1128 00:55:01,260 --> 00:55:02,480 bit even better. 1129 00:55:02,480 --> 00:55:03,730 So I get this bound in here. 1130 00:55:06,600 --> 00:55:11,390 Then actually you found this is going from p to p. 1131 00:55:11,390 --> 00:55:16,400 So I can actually set p because, mean and max in here. 1132 00:55:16,400 --> 00:55:19,370 Then after you do that, what you have to do is eliminate I. 1133 00:55:19,370 --> 00:55:25,980 The way you eliminate I is you take this has to be always 1134 00:55:25,980 --> 00:55:29,160 less than n and less than p. 1135 00:55:29,160 --> 00:55:34,630 So you take this n constraints here and you get a n times m 1136 00:55:34,630 --> 00:55:35,660 constraints tier in here. 1137 00:55:35,660 --> 00:55:39,100 So the first three has to be less than n, again, we repeat 1138 00:55:39,100 --> 00:55:42,260 it again, has to be less than p. 1139 00:55:42,260 --> 00:55:45,710 Then, of course, the missing constraint that 1 1140 00:55:45,710 --> 00:55:47,660 is less than j. 1141 00:55:47,660 --> 00:55:49,430 You put all those constraints together. 1142 00:55:49,430 --> 00:55:51,730 Now, nice think is in that one, it's still legal, it 1143 00:55:51,730 --> 00:55:54,050 still represents that space, but you don't 1144 00:55:54,050 --> 00:55:55,580 have i there anymore. 1145 00:55:55,580 --> 00:55:58,470 You can completely get rid of i. 1146 00:55:58,470 --> 00:56:01,160 So, by doing that -- and then of course, there's a lot of 1147 00:56:01,160 --> 00:56:03,940 redundancy in here, and then you can do some analysis and 1148 00:56:03,940 --> 00:56:05,915 eliminate redundancy and you end up in this set of 1149 00:56:05,915 --> 00:56:06,760 constraints. 1150 00:56:06,760 --> 00:56:09,800 That's where when you say what's the best, you can be 1151 00:56:09,800 --> 00:56:13,590 best -- it has to be correct or that means you can't have 1152 00:56:13,590 --> 00:56:15,220 additional iterations or less iterations. 1153 00:56:15,220 --> 00:56:19,400 But best depends on how complicated is the loop bound 1154 00:56:19,400 --> 00:56:21,610 calculation. 1155 00:56:21,610 --> 00:56:24,080 You can come up with a correct solution, and the best is 1156 00:56:24,080 --> 00:56:26,060 depending on which order you do that. 1157 00:56:26,060 --> 00:56:27,530 When you have two redundant thing, which one you 1158 00:56:27,530 --> 00:56:29,600 eliminate, so you can have a lot of heuristics saying OK, 1159 00:56:29,600 --> 00:56:32,520 look if this one looks harder to calculate, eliminate that 1160 00:56:32,520 --> 00:56:34,080 one with the other one. 1161 00:56:34,080 --> 00:56:36,800 So you get this set of constraints. 1162 00:56:36,800 --> 00:56:39,950 Then you have to do is now find the bounds of j. 1163 00:56:39,950 --> 00:56:42,195 So you have this set again. 1164 00:56:42,195 --> 00:56:46,450 To find a bound of j only two constraints are there, and you 1165 00:56:46,450 --> 00:56:51,210 know j goes to 1 to p minus 1, and you find the bound of j. 1166 00:56:51,210 --> 00:56:55,860 Getting rid of j means there's only two. 1167 00:56:55,860 --> 00:56:57,550 One get rid of p minus 1. 1168 00:56:57,550 --> 00:56:58,980 There are two left for p. 1169 00:56:58,980 --> 00:57:02,200 You put it there, and then you can eliminate the redundance 1170 00:57:02,200 --> 00:57:07,450 in here, and now you can find the bounds of p which goes 1171 00:57:07,450 --> 00:57:09,530 from 2 to n. 1172 00:57:09,530 --> 00:57:14,720 And suddenly you have the loop nest. So now I actually di 1173 00:57:14,720 --> 00:57:17,200 parallelization and a loop transpose in here. 1174 00:57:20,200 --> 00:57:23,480 I could combine those two, use this simple mathematical way 1175 00:57:23,480 --> 00:57:27,050 and find loop bounds in here. 1176 00:57:27,050 --> 00:57:30,650 So, I'm going to give you something even a little bit 1177 00:57:30,650 --> 00:57:32,830 interesting beyond that, which is communication code 1178 00:57:32,830 --> 00:57:34,080 generation. 1179 00:57:37,470 --> 00:57:39,240 So if you are dealing with a cache coherent shared memory 1180 00:57:39,240 --> 00:57:40,940 machine, you are done. 1181 00:57:40,940 --> 00:57:43,890 You generate code for parallel loop nest, you can go home 1182 00:57:43,890 --> 00:57:45,900 because everything else will be done automatically. 1183 00:57:45,900 --> 00:57:48,920 But as we all know in something like Cell, if you 1184 00:57:48,920 --> 00:57:51,100 have a no cache coherent shared memory or distributed 1185 00:57:51,100 --> 00:57:54,050 memory, you have to do this one first. Then you write 1186 00:57:54,050 --> 00:57:56,640 identify communication and then you generate 1187 00:57:56,640 --> 00:57:59,590 communication code. 1188 00:57:59,590 --> 00:58:04,670 This have additional burden in here. 1189 00:58:04,670 --> 00:58:07,630 So until now in data dependence analysis, what we 1190 00:58:07,630 --> 00:58:11,040 looked at was location-centric dependences. 1191 00:58:11,040 --> 00:58:13,950 Which location is written by processor one is used by 1192 00:58:13,950 --> 00:58:15,650 processor two. 1193 00:58:15,650 --> 00:58:19,600 That's kind of a location-centric kind of view. 1194 00:58:19,600 --> 00:58:23,470 How about if multiple writes the same location? 1195 00:58:23,470 --> 00:58:25,220 We show that in example, if multiple people write the same 1196 00:58:25,220 --> 00:58:29,450 location, which one should I use? 1197 00:58:29,450 --> 00:58:30,850 That's not clear. 1198 00:58:30,850 --> 00:58:32,730 What you are using in the last last guy who wrote that 1199 00:58:32,730 --> 00:58:35,290 location before I read that thing, and that's not in these 1200 00:58:35,290 --> 00:58:36,690 data flow analysis. 1201 00:58:36,690 --> 00:58:40,110 No data dependence analysis doesn't get it. 1202 00:58:40,110 --> 00:58:43,490 What you want is something of a value-centric. 1203 00:58:43,490 --> 00:58:47,330 Who was the last write before my iteration, 1204 00:58:47,330 --> 00:58:49,540 who wrote that location? 1205 00:58:49,540 --> 00:58:52,490 If I know the last write, he's the one I should be getting 1206 00:58:52,490 --> 00:58:53,570 the value from. 1207 00:58:53,570 --> 00:58:57,720 If the last write happened in the same processor, I am set 1208 00:58:57,720 --> 00:59:00,270 because I wrote the local copy and I don't need 1209 00:59:00,270 --> 00:59:02,010 to deal with anything. 1210 00:59:02,010 --> 00:59:05,040 If the last write happened in a different processor, you 1211 00:59:05,040 --> 00:59:07,135 need to get that value from the guys who wrote it and say, 1212 00:59:07,135 --> 00:59:09,470 OK, you wrote that value, give it to me. 1213 00:59:09,470 --> 00:59:12,340 If nobody wrote it and I'm reading it, that means the 1214 00:59:12,340 --> 00:59:16,340 value came from the original array because nobody had 1215 00:59:16,340 --> 00:59:17,600 written it in my iteration. 1216 00:59:17,600 --> 00:59:19,410 Then I'm reading something that has come from the 1217 00:59:19,410 --> 00:59:21,370 previous iteration. 1218 00:59:21,370 --> 00:59:23,630 So I have to get it from the original array. 1219 00:59:23,630 --> 00:59:26,800 But I have these three different conditions. 1220 00:59:26,800 --> 00:59:29,260 So you know to represent that. 1221 00:59:29,260 --> 00:59:30,522 I'm not going to go into detail on into detail on this 1222 00:59:30,522 --> 00:59:34,160 representation called Last Write Trees. 1223 00:59:34,160 --> 00:59:41,160 So what it says is in this kind of a loop nest in here, 1224 00:59:41,160 --> 00:59:44,160 you have some read access and write accesses in here, and if 1225 00:59:44,160 --> 00:59:46,480 you look at it location-centrically you get 1226 00:59:46,480 --> 00:59:49,860 this entire complex graph, because this is the graph that 1227 00:59:49,860 --> 00:59:53,166 should have been in that example we gave. So these 1228 00:59:53,166 --> 00:59:55,760 arrays going in here. 1229 00:59:55,760 --> 00:59:57,500 I'm switching notation. 1230 00:59:57,500 --> 00:59:59,820 before i was going the other way around. j was in here. 1231 01:00:02,960 --> 01:00:06,590 But if you go look at value-centric, 1232 01:00:06,590 --> 01:00:07,440 this is what happens. 1233 01:00:07,440 --> 01:00:11,150 So you say all these guys basically got 1234 01:00:11,150 --> 01:00:12,910 the value from outside. 1235 01:00:12,910 --> 01:00:14,070 Nobody wrote it. 1236 01:00:14,070 --> 01:00:15,920 This got from -- this is the write, this is the last write, 1237 01:00:15,920 --> 01:00:16,730 this is the last write -- 1238 01:00:16,730 --> 01:00:19,380 I actually have my last write information. 1239 01:00:19,380 --> 01:00:21,970 So where to look at that is there are some part of 1240 01:00:21,970 --> 01:00:23,640 iteration got value from somewhere, other part go 1241 01:00:23,640 --> 01:00:24,740 somewhere else. 1242 01:00:24,740 --> 01:00:27,380 You can't kind of do a big summary, as you point out that 1243 01:00:27,380 --> 01:00:30,080 kind of dependence depend on where the iterations are. 1244 01:00:30,080 --> 01:00:34,320 So you can represent it using a tree when it shows up. 1245 01:00:34,320 --> 01:00:39,170 So you can say if j greater than 1, here's the 1246 01:00:39,170 --> 01:00:41,430 relationship between reads and writes. 1247 01:00:41,430 --> 01:00:44,230 Otherwise relationship means it came from outside. 1248 01:00:44,230 --> 01:00:46,550 So I can say for each different places. 1249 01:00:46,550 --> 01:00:49,410 So you can think about this tree can be a lot more 1250 01:00:49,410 --> 01:00:50,200 complicated tree. 1251 01:00:50,200 --> 01:00:52,740 So each part of the iteration space, I got data from 1252 01:00:52,740 --> 01:00:53,990 somewhere else. 1253 01:00:57,060 --> 01:00:59,730 So, you get this function here. 1254 01:00:59,730 --> 01:01:01,940 I think I'll go to the next slide. 1255 01:01:22,490 --> 01:01:27,240 So what you can do is now I have processor who read, 1256 01:01:27,240 --> 01:01:30,880 processor who write, and iterations that I can reading 1257 01:01:30,880 --> 01:01:33,190 and writing. 1258 01:01:33,190 --> 01:01:38,090 One thing I can do is I can represent i using a huge 1259 01:01:38,090 --> 01:01:40,660 multi-dimensional space. 1260 01:01:40,660 --> 01:01:46,300 So what happens in here is the receive iterations, those are 1261 01:01:46,300 --> 01:01:48,260 the iterations that actually data has to be received in 1262 01:01:48,260 --> 01:01:49,440 communication. 1263 01:01:49,440 --> 01:01:51,790 Assume that the part I'm actually communicating is also 1264 01:01:51,790 --> 01:01:55,570 within the loop bound, so I can write that. 1265 01:01:55,570 --> 01:02:00,280 And the last write relation is that i 1266 01:02:00,280 --> 01:02:02,890 send has to be i receive. 1267 01:02:02,890 --> 01:02:04,140 We know that. 1268 01:02:06,040 --> 01:02:11,250 What you have is the parallel with the processors -- this is 1269 01:02:11,250 --> 01:02:14,660 i iterations are parallel, so processor, receive processor, 1270 01:02:14,660 --> 01:02:18,500 is running iteration i, process i. 1271 01:02:18,500 --> 01:02:20,410 Send iterations are the same because you want to 1272 01:02:20,410 --> 01:02:22,990 parallelize that loop basically. 1273 01:02:22,990 --> 01:02:26,900 In each iteration get assigned to each process. 1274 01:02:26,900 --> 01:02:29,370 Of course, you want to make sure the process communication 1275 01:02:29,370 --> 01:02:29,960 is non-local. 1276 01:02:29,960 --> 01:02:31,900 If it's local I don't have loop communication. 1277 01:02:31,900 --> 01:02:35,310 I can represent this as this gigantic system of equalities. 1278 01:02:35,310 --> 01:02:38,590 It has one, two, three, four, five, and there's a j receiver 1279 01:02:38,590 --> 01:02:42,130 also in here, because you've got to remember I think the 1280 01:02:42,130 --> 01:02:45,480 program I wrote, the original program basically, write 1281 01:02:45,480 --> 01:02:48,230 happen in outer loop and the read happen inner loop. 1282 01:02:48,230 --> 01:02:51,690 So there's only j receive, the i send in here. 1283 01:02:51,690 --> 01:02:54,220 I'll show that later. 1284 01:02:54,220 --> 01:02:56,570 So I have five dimensions. 1285 01:02:56,570 --> 01:03:00,990 So I can't really draw five dimensions, but can I wait 1286 01:03:00,990 --> 01:03:02,240 until it comes back? 1287 01:03:24,640 --> 01:03:29,050 So what I have here is I have this set of complete system of 1288 01:03:29,050 --> 01:03:32,620 inequalities for receive and in communication. 1289 01:03:32,620 --> 01:03:36,650 Of course, since I can't draw five dimensions, and these 1290 01:03:36,650 --> 01:03:39,100 dimensions are the same, I just wrote it in the same. 1291 01:03:39,100 --> 01:03:40,430 So you can actually assume that there's another two 1292 01:03:40,430 --> 01:03:42,410 dimensions for this one, and that's a 1293 01:03:42,410 --> 01:03:44,110 line in that dimension. 1294 01:03:47,320 --> 01:03:50,100 Actually, this is wrong. 1295 01:03:50,100 --> 01:03:50,200 Sorry. 1296 01:03:50,200 --> 01:03:53,200 This should be xi here written. 1297 01:03:56,330 --> 01:03:57,580 My program is wrong, sorry. 1298 01:04:04,290 --> 01:04:05,690 Now what do I do? 1299 01:04:05,690 --> 01:04:10,460 One more time it has to go. 1300 01:04:10,460 --> 01:04:17,300 It makes me slow down my lectures which is probably a 1301 01:04:17,300 --> 01:04:18,350 good thing. 1302 01:04:18,350 --> 01:04:19,490 There we go. 1303 01:04:19,490 --> 01:04:23,500 So what you can do is you can just scan these by predicting 1304 01:04:23,500 --> 01:04:26,720 different ways to calculate the send loop nest and receive 1305 01:04:26,720 --> 01:04:30,060 loop nest. So if you scan in that direction, what you end 1306 01:04:30,060 --> 01:04:34,280 up is something saying for this processor you need to 1307 01:04:34,280 --> 01:04:39,090 send, for this iteration, this processor. 1308 01:04:39,090 --> 01:04:42,080 For what you need to send will be received by these 1309 01:04:42,080 --> 01:04:46,650 processors and this iteration and this, and this you can 1310 01:04:46,650 --> 01:04:51,760 send xi to this iteration at this processor. 1311 01:04:51,760 --> 01:04:55,600 Because you had that relationship, you can get the 1312 01:04:55,600 --> 01:05:00,720 loop nest that actually will do the send. 1313 01:05:00,720 --> 01:05:03,590 The send there you can actually get a loop nest do 1314 01:05:03,590 --> 01:05:06,640 receive and it shows up. 1315 01:05:06,640 --> 01:05:11,340 So what that means is, so all these guys have to send all 1316 01:05:11,340 --> 01:05:13,280 these iterations have to do the receive. 1317 01:05:30,900 --> 01:05:35,330 So, if you predicted a different ordering, what you 1318 01:05:35,330 --> 01:05:39,550 end up is you can say now for this processor has to receive. 1319 01:05:39,550 --> 01:05:42,140 All these processors had to receive something send by 1320 01:05:42,140 --> 01:05:44,500 these guys. 1321 01:05:44,500 --> 01:05:48,550 So now you can get that entire loop nest for receiving and 1322 01:05:48,550 --> 01:05:49,810 entire loop nest for sending, and you have 1323 01:05:49,810 --> 01:05:51,570 computation loop nest also. 1324 01:05:51,570 --> 01:05:53,790 The problem is you can't run them sequentially because 1325 01:05:53,790 --> 01:05:55,440 you're run in some into the order. 1326 01:05:55,440 --> 01:06:00,350 So what you have is something that next slide will show. 1327 01:06:00,350 --> 01:06:02,580 So you have this iteration, there's some computation 1328 01:06:02,580 --> 01:06:05,760 happen from all the one, and I will get a loop nest do some 1329 01:06:05,760 --> 01:06:08,700 send, I need loop nest do some receive, in a one dimensional, 1330 01:06:08,700 --> 01:06:11,440 these kind of, you get three seperate things. 1331 01:06:11,440 --> 01:06:13,540 But of course, what you had to do is you 1332 01:06:13,540 --> 01:06:14,940 had to generate code. 1333 01:06:14,940 --> 01:06:16,610 So the way to do that is --. 1334 01:06:21,850 --> 01:06:24,240 So what you have to do is kind of break this apart into 1335 01:06:24,240 --> 01:06:27,290 pieces where things happen, so this one you do computation, 1336 01:06:27,290 --> 01:06:31,330 this one you do computation and receive, and computation 1337 01:06:31,330 --> 01:06:32,580 send receive and whatever. 1338 01:06:34,840 --> 01:06:37,720 Should be probably send here and receive but -- 1339 01:06:40,350 --> 01:06:43,360 For that one, if you combine this you get a complicated 1340 01:06:43,360 --> 01:06:43,960 mess like this. 1341 01:06:43,960 --> 01:06:48,200 But this all can be done very in an automated fashion by 1342 01:06:48,200 --> 01:06:51,850 using this Fourier-Motzkin Elimination and this linear 1343 01:06:51,850 --> 01:06:55,830 representation. 1344 01:06:55,830 --> 01:06:57,240 Of course, you can do a lot of interesting 1345 01:06:57,240 --> 01:06:57,880 things on top of that. 1346 01:06:57,880 --> 01:06:59,600 You can eliminate redundant communication, if you're 1347 01:06:59,600 --> 01:07:01,950 keeping sending the same thing again that have a send unit, 1348 01:07:01,950 --> 01:07:04,780 eliminate that, you can aggregate communication. 1349 01:07:04,780 --> 01:07:06,650 You want to send a word at a time, you can send bunch of 1350 01:07:06,650 --> 01:07:08,940 things into one packet. 1351 01:07:08,940 --> 01:07:09,810 You can do multitask. 1352 01:07:09,810 --> 01:07:12,160 So same thing, send to multiple people. 1353 01:07:12,160 --> 01:07:15,165 Doesn't have that much in Cell, but assume some machines 1354 01:07:15,165 --> 01:07:18,050 have multitask support, you can do that, and also you can 1355 01:07:18,050 --> 01:07:20,670 do some local memory management because if you have 1356 01:07:20,670 --> 01:07:23,270 distributed memory, you don't have to allocate everybody's 1357 01:07:23,270 --> 01:07:24,520 memory and only use a part. 1358 01:07:24,520 --> 01:07:26,370 You can say OK, look everybody only had to 1359 01:07:26,370 --> 01:07:29,270 allocate that part. 1360 01:07:29,270 --> 01:07:30,680 OK. 1361 01:07:30,680 --> 01:07:36,180 In summary, I think automatic parallelism of loops and 1362 01:07:36,180 --> 01:07:39,350 arrays -- we talked about data dependence analysis, and we 1363 01:07:39,350 --> 01:07:42,210 talked about iteration and data spaces, a how to do that, 1364 01:07:42,210 --> 01:07:46,890 and how the formulate assay integer programming problem. 1365 01:07:46,890 --> 01:07:49,380 We can look at lot of optimization that can increase 1366 01:07:49,380 --> 01:07:51,740 parallelism and then do that. 1367 01:07:51,740 --> 01:07:55,760 Also, we can deal with tings like communication code 1368 01:07:55,760 --> 01:07:58,150 generation and generating loop nest by doing this 1369 01:07:58,150 --> 01:07:59,570 Fourier-Motzkin Elimination. 1370 01:07:59,570 --> 01:08:03,260 So what I want to show out of this talk is that, in fact, 1371 01:08:03,260 --> 01:08:06,180 this parallelization -- automatic parallelization of 1372 01:08:06,180 --> 01:08:10,060 normal loop can be done by mapping into some nice 1373 01:08:10,060 --> 01:08:12,440 mathematical framework, and basically 1374 01:08:12,440 --> 01:08:15,520 manipulating in that map. 1375 01:08:15,520 --> 01:08:18,960 So there are many other things that really complicates the 1376 01:08:18,960 --> 01:08:23,040 life take out of parallelizing programs. So like C, there are 1377 01:08:23,040 --> 01:08:24,860 pointers, you have to deal with that. 1378 01:08:24,860 --> 01:08:29,020 So this problem is not this simple, but what compiler 1379 01:08:29,020 --> 01:08:31,570 writers try to do most of the time is trying to find this 1380 01:08:31,570 --> 01:08:32,380 kind of thing. 1381 01:08:32,380 --> 01:08:34,960 Find interesting mathematical models and do a mapping in 1382 01:08:34,960 --> 01:08:38,150 there and then operating that model and hopefully you can 1383 01:08:38,150 --> 01:08:41,940 get the analysis needed and even the transformation needed 1384 01:08:41,940 --> 01:08:43,760 using that kind of a nice model. 1385 01:08:43,760 --> 01:08:48,770 So I just kind of gave you a good feel for general 1386 01:08:48,770 --> 01:08:49,690 parallelizing compilers. 1387 01:08:49,690 --> 01:08:51,690 We will take a ten-minute break 1388 01:08:51,690 --> 01:08:54,760 and talk about streaming. 1389 01:08:54,760 --> 01:08:55,780 We'll see if I can make this computer run 1390 01:08:55,780 --> 01:08:57,030 faster in the meantime.