1 00:00:00,000 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,880 Commons license. 3 00:00:03,880 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,550 offer high-quality educational resources for free. 5 00:00:10,550 --> 00:00:13,690 To make a donation or view additional materials from 6 00:00:13,690 --> 00:00:17,920 hundreds of MIT courses, visit MIT OpenCourseWare at ocw. 7 00:00:17,920 --> 00:00:19,170 mit.edu. 8 00:00:22,890 --> 00:00:28,070 PROFESSOR: So last week, last few lectures, you heard about 9 00:00:28,070 --> 00:00:32,760 parallel architectures and started with lecture four on 10 00:00:32,760 --> 00:00:34,290 discussions of concurrency. 11 00:00:34,290 --> 00:00:37,730 How do you take applications or independent actors that 12 00:00:37,730 --> 00:00:40,260 want to operate on the same data and make 13 00:00:40,260 --> 00:00:42,790 them run safely together? 14 00:00:42,790 --> 00:00:46,930 And so just recapping the last two lectures, you saw really 15 00:00:46,930 --> 00:00:49,330 two primary classes of architectures. 16 00:00:49,330 --> 00:00:51,800 Although Saman talked about a few more. 17 00:00:51,800 --> 00:00:55,630 There was the class of shared memory processors, you know, 18 00:00:55,630 --> 00:00:59,500 the multicores that Intel, AMD, and PowerPC, for example, 19 00:00:59,500 --> 00:01:03,970 have today, where you have one copy of the data. 20 00:01:03,970 --> 00:01:06,660 And that's really shared among all the different processors 21 00:01:06,660 --> 00:01:09,770 because they essentially share the same memory. 22 00:01:09,770 --> 00:01:14,460 And you need things like atomicity and synchronization 23 00:01:14,460 --> 00:01:17,770 to be able to make sure that the sharing is done properly 24 00:01:17,770 --> 00:01:22,230 so that you don't get into data race situations where 25 00:01:22,230 --> 00:01:24,790 multiple processors try to update the same data element 26 00:01:24,790 --> 00:01:28,260 and you end up with erroneous results. 27 00:01:28,260 --> 00:01:30,530 You also heard about distributed memory processors. 28 00:01:30,530 --> 00:01:35,780 So an example of that might be the Cell, loosely said, where 29 00:01:35,780 --> 00:01:39,710 you have cores that primarily access their own local memory. 30 00:01:39,710 --> 00:01:43,940 And while you can have a single global memory address 31 00:01:43,940 --> 00:01:46,410 space, to get data from memory you essentially have to 32 00:01:46,410 --> 00:01:49,720 communicate with the different processors to explicitly fetch 33 00:01:49,720 --> 00:01:51,340 data in and out. 34 00:01:51,340 --> 00:01:54,640 So things like data distribution, where the data 35 00:01:54,640 --> 00:01:56,730 is, and what your communication pattern is like 36 00:01:56,730 --> 00:01:57,980 affect your performance. 37 00:02:01,270 --> 00:02:05,340 So what I'm going to talk about in today's lecture is 38 00:02:05,340 --> 00:02:06,940 programming these two different kinds of 39 00:02:06,940 --> 00:02:09,260 architectures, shared memory processors and distributed 40 00:02:09,260 --> 00:02:14,570 memory processors, and present you with some concepts for 41 00:02:14,570 --> 00:02:16,890 commonly programming these machines. 42 00:02:16,890 --> 00:02:19,620 So in shared memory processors, you have, say, n 43 00:02:19,620 --> 00:02:21,070 processors, 1 to n. 44 00:02:21,070 --> 00:02:23,890 And they're connected to a single memory. 45 00:02:23,890 --> 00:02:27,810 And if one processor asks for the value stored at address X, 46 00:02:27,810 --> 00:02:29,270 everybody knows where it'll go look. 47 00:02:29,270 --> 00:02:33,480 Because there's only one address X. And so different 48 00:02:33,480 --> 00:02:36,000 processors can communicate through shared variables. 49 00:02:36,000 --> 00:02:38,620 And you need things like locking, as I mentioned, to 50 00:02:38,620 --> 00:02:43,900 avoid race conditions or erroneous computation. 51 00:02:43,900 --> 00:02:45,860 So as an example of parallelization, you know, 52 00:02:45,860 --> 00:02:48,220 straightforward parallelization in a shared 53 00:02:48,220 --> 00:02:52,000 memory machine, would be you have the simple loop that's 54 00:02:52,000 --> 00:02:54,610 just running through an array. 55 00:02:54,610 --> 00:02:58,140 And you're adding elements of array A to elements of array 56 00:02:58,140 --> 00:03:02,480 B. And you're going to write them to some new array, C. 57 00:03:02,480 --> 00:03:04,880 Well, if I gave you this loop you can probably recognize 58 00:03:04,880 --> 00:03:07,370 that there's really no data dependencies here. 59 00:03:07,370 --> 00:03:09,860 I can split up this loop into three chunks -- let's say I 60 00:03:09,860 --> 00:03:13,010 have three processors -- where one processor does all the 61 00:03:13,010 --> 00:03:16,310 computations for iterations zero through three, so the 62 00:03:16,310 --> 00:03:17,470 first four iterations. 63 00:03:17,470 --> 00:03:19,530 Second processor does the next four iterations. 64 00:03:19,530 --> 00:03:21,600 And the third processor does the last four iterations. 65 00:03:21,600 --> 00:03:26,473 And so that's shown with the little -- should have brought 66 00:03:26,473 --> 00:03:27,680 a laser pointer. 67 00:03:27,680 --> 00:03:29,500 So that's showing here. 68 00:03:29,500 --> 00:03:31,980 And what you might need to do is some mechanism to 69 00:03:31,980 --> 00:03:34,840 essentially tell the different processors, here's the code 70 00:03:34,840 --> 00:03:38,110 that you need to run and maybe where to start. 71 00:03:38,110 --> 00:03:41,070 And then you may need some way of sort of synchronizing these 72 00:03:41,070 --> 00:03:43,765 different processors that say, I'm done, I can move on to the 73 00:03:43,765 --> 00:03:46,490 next computation steps. 74 00:03:46,490 --> 00:03:48,950 So this is an example of a data parallel computation. 75 00:03:48,950 --> 00:03:51,720 The loop has no real dependencies and, you know, 76 00:03:51,720 --> 00:03:55,840 each processor can operate on different data sets. 77 00:03:55,840 --> 00:03:58,540 And what you could do is you can have a process -- 78 00:03:58,540 --> 00:04:03,360 this is a single application that forks off or creates what 79 00:04:03,360 --> 00:04:04,840 are commonly called the threads. 80 00:04:04,840 --> 00:04:08,230 And each thread goes on and executes in this case the same 81 00:04:08,230 --> 00:04:09,860 computation. 82 00:04:09,860 --> 00:04:12,970 So a single process can create multiple concurrent threads. 83 00:04:12,970 --> 00:04:17,290 And really each thread is just a mechanism for encapsulating 84 00:04:17,290 --> 00:04:20,060 some trace of execution, some execution path. 85 00:04:20,060 --> 00:04:22,980 So in this case you're essentially encapsulating this 86 00:04:22,980 --> 00:04:24,860 particular loop here. 87 00:04:24,860 --> 00:04:28,670 And maybe you parameterize your start index and your 88 00:04:28,670 --> 00:04:32,990 ending index or maybe your loop bounds. 89 00:04:32,990 --> 00:04:35,420 And in a shared memory processor, since you're 90 00:04:35,420 --> 00:04:37,470 communicating -- 91 00:04:37,470 --> 00:04:40,760 since there's only a single memory, really you don't need 92 00:04:40,760 --> 00:04:43,900 to do anything special about the data in this particular 93 00:04:43,900 --> 00:04:46,270 example, because everybody knows where to go look for it. 94 00:04:46,270 --> 00:04:47,240 Everybody can access it. 95 00:04:47,240 --> 00:04:48,330 Everything's independent. 96 00:04:48,330 --> 00:04:52,850 There's no real issues with races or deadlocks. 97 00:04:52,850 --> 00:04:56,930 So I just wrote down some actual code for that loop that 98 00:04:56,930 --> 00:04:59,315 parallelize it using Pthreads, a commonly 99 00:04:59,315 --> 00:05:01,530 used threading mechanism. 100 00:05:01,530 --> 00:05:04,200 Just to give you a little bit of flavor for, you know, the 101 00:05:04,200 --> 00:05:07,580 complexity of -- the simple loop that we had expands to a 102 00:05:07,580 --> 00:05:09,840 lot more code in this case. 103 00:05:09,840 --> 00:05:11,300 So you have your array. 104 00:05:11,300 --> 00:05:13,050 It has 12 elements. 105 00:05:13,050 --> 00:05:16,930 A, B, and C. And you have the basic functions. 106 00:05:16,930 --> 00:05:18,710 So this is the actual code or computation that we 107 00:05:18,710 --> 00:05:20,010 want to carry out. 108 00:05:20,010 --> 00:05:22,190 And what I've done here is I've parameterized where 109 00:05:22,190 --> 00:05:24,790 you're essentially starting in the array. 110 00:05:24,790 --> 00:05:26,090 So you get this parameter. 111 00:05:26,090 --> 00:05:29,210 And then you calculate four iterations' worth. 112 00:05:29,210 --> 00:05:31,312 And this is essentially the computation that we're 113 00:05:31,312 --> 00:05:31,870 carrying out. 114 00:05:31,870 --> 00:05:35,500 And now in my main program or in my main function, rather, 115 00:05:35,500 --> 00:05:39,320 what I do is I have this concept of threads that I'm 116 00:05:39,320 --> 00:05:40,140 going to create. 117 00:05:40,140 --> 00:05:44,140 In this case I'm going to create three of them. 118 00:05:44,140 --> 00:05:46,040 There are some parameters that I have to pass in, so some 119 00:05:46,040 --> 00:05:48,670 attributes which are now going to get into here. 120 00:05:48,670 --> 00:05:50,470 But then I pass in the function pointer. 121 00:05:50,470 --> 00:05:53,220 This is essentially a mechanism that says once I've 122 00:05:53,220 --> 00:05:56,030 created this thread, I go to this function and execute this 123 00:05:56,030 --> 00:05:57,920 particular code. 124 00:05:57,920 --> 00:05:59,320 And then some arguments that are functions. 125 00:05:59,320 --> 00:06:01,850 So here I'm just passing in an index at which each loop 126 00:06:01,850 --> 00:06:03,400 switch starts with. 127 00:06:03,400 --> 00:06:06,790 And after I've created each thread here, implicitly in the 128 00:06:06,790 --> 00:06:08,140 thread creation, the code can just 129 00:06:08,140 --> 00:06:10,210 immediately start running. 130 00:06:10,210 --> 00:06:12,800 And then once all the threads have started running, I can 131 00:06:12,800 --> 00:06:15,470 essentially just exit the program 132 00:06:15,470 --> 00:06:16,720 because I've completed. 133 00:06:19,070 --> 00:06:22,100 So what I've shown you with that first example was the 134 00:06:22,100 --> 00:06:24,670 concept of, or example of, data parallelism. 135 00:06:24,670 --> 00:06:27,790 So you're performing the same computation, but instead of 136 00:06:27,790 --> 00:06:31,140 operating on one big chunk of data, I've partitioned the 137 00:06:31,140 --> 00:06:35,020 data into smaller chunks and I've replicated the 138 00:06:35,020 --> 00:06:38,190 computation so that I can get that kind of parallelism. 139 00:06:38,190 --> 00:06:40,700 But there's another form of parallelism called control 140 00:06:40,700 --> 00:06:44,630 parallelism, which essentially uses the same model of 141 00:06:44,630 --> 00:06:48,330 threading but doesn't necessarily have to run the 142 00:06:48,330 --> 00:06:51,380 same function or run the same computation each thread. 143 00:06:51,380 --> 00:06:53,400 So I've sort of illustrated that in the illustration 144 00:06:53,400 --> 00:06:58,050 there, where these are your data parallel computations and 145 00:06:58,050 --> 00:07:06,390 these are some other computations in your code. 146 00:07:06,390 --> 00:07:09,370 So there is sort of a programming model that allows 147 00:07:09,370 --> 00:07:13,580 you to do this kind of parallelism and tries to sort 148 00:07:13,580 --> 00:07:17,840 of help the programmer by taking their sequential code 149 00:07:17,840 --> 00:07:20,770 and then adding annotations that say, this loop is data 150 00:07:20,770 --> 00:07:27,500 parallel or this set of code is has this kind of control 151 00:07:27,500 --> 00:07:29,000 parallelism in it. 152 00:07:29,000 --> 00:07:31,760 So you start with your parallel code. 153 00:07:31,760 --> 00:07:34,900 This is the same program, multiple data kind of 154 00:07:34,900 --> 00:07:35,870 parallelization. 155 00:07:35,870 --> 00:07:38,020 So you might have seen in the previous talk and the previous 156 00:07:38,020 --> 00:07:40,390 lecture, it was SIMD, single instruction or same 157 00:07:40,390 --> 00:07:43,130 instruction, multiple data, which allowed you to execute 158 00:07:43,130 --> 00:07:45,950 the same operation, you know, and add over 159 00:07:45,950 --> 00:07:47,490 multiple data elements. 160 00:07:47,490 --> 00:07:50,630 So here it's a similar kind of terminology. 161 00:07:50,630 --> 00:07:54,510 There's same program, multiple data, and multiple program, 162 00:07:54,510 --> 00:07:54,920 multiple data. 163 00:07:54,920 --> 00:07:59,600 This talk is largely focused on the SPMD model, where you 164 00:07:59,600 --> 00:08:02,860 essentially have one central decision maker or you're 165 00:08:02,860 --> 00:08:05,670 trying to solve one central computation. 166 00:08:05,670 --> 00:08:07,190 And you're trying to parallelize that over your 167 00:08:07,190 --> 00:08:09,470 architecture to get the best performance. 168 00:08:09,470 --> 00:08:12,990 So you start off with your program and then you annotate 169 00:08:12,990 --> 00:08:16,200 the code with what's parallel and what's not parallel. 170 00:08:16,200 --> 00:08:18,990 And you might add in some synchronization directives so 171 00:08:18,990 --> 00:08:21,450 that if you do in fact have sharing, you might want to use 172 00:08:21,450 --> 00:08:25,320 the right locking mechanism to guarantee safety. 173 00:08:25,320 --> 00:08:28,990 Now, in OpenMP, there are some limitations as 174 00:08:28,990 --> 00:08:29,840 to what it can do. 175 00:08:29,840 --> 00:08:32,130 So it in fact assumes that the programmer 176 00:08:32,130 --> 00:08:33,260 knows what he's doing. 177 00:08:33,260 --> 00:08:35,670 And the programmer is largely responsible for getting the 178 00:08:35,670 --> 00:08:38,680 synchronization right, or that if they're sharing that they 179 00:08:38,680 --> 00:08:42,900 get those dependencies protected correctly. 180 00:08:42,900 --> 00:08:45,850 So you can take your program, insert these annotations, and 181 00:08:45,850 --> 00:08:49,950 then you go on and test and debug. 182 00:08:49,950 --> 00:08:54,730 So a simple OpenMP example, again using the simple loop -- 183 00:08:54,730 --> 00:08:56,930 now, I've thrown away some of the extra code -- you're 184 00:08:56,930 --> 00:09:00,190 adding these two extra pragmas in this case. 185 00:09:00,190 --> 00:09:05,370 The first one, your parallel pragma, I call the data 186 00:09:05,370 --> 00:09:08,450 parallel pragma, really says that you can execute as many 187 00:09:08,450 --> 00:09:13,120 of the following code block as there are processors or as 188 00:09:13,120 --> 00:09:15,250 many as you have thread contexts. 189 00:09:15,250 --> 00:09:18,260 So in this case I implicitly made the assumption that I 190 00:09:18,260 --> 00:09:20,840 have three processors, so I can automatically partition my 191 00:09:20,840 --> 00:09:23,060 code into three sets. 192 00:09:23,060 --> 00:09:25,380 And this transformation can sort of be done automatically 193 00:09:25,380 --> 00:09:27,810 by the compiler. 194 00:09:27,810 --> 00:09:31,750 And then there's a for pragma that says this loop is 195 00:09:31,750 --> 00:09:36,430 parallel and you can divide up the work in the mechanism 196 00:09:36,430 --> 00:09:37,230 that's work sharing. 197 00:09:37,230 --> 00:09:39,430 So multiple threads can collaborate to solve the same 198 00:09:39,430 --> 00:09:42,540 computation, but each one does a smaller amount of work. 199 00:09:42,540 --> 00:09:49,760 So this is in contrast to what I'm going to focus on a lot 200 00:09:49,760 --> 00:09:52,190 more in the rest of our talk, which is distributed memory 201 00:09:52,190 --> 00:09:55,080 processors and programming for distributed memories. 202 00:09:55,080 --> 00:09:58,760 And this will feel a lot more like programming for the Cell 203 00:09:58,760 --> 00:10:00,780 as you get more and more involved in that and your 204 00:10:00,780 --> 00:10:03,930 projects get more intense. 205 00:10:03,930 --> 00:10:08,220 So in distributed memory processors, to recap the 206 00:10:08,220 --> 00:10:11,020 previous lectures, you have n processors. 207 00:10:11,020 --> 00:10:13,070 Each processor has its own memory. 208 00:10:13,070 --> 00:10:16,180 And they essentially share the interconnection network. 209 00:10:19,030 --> 00:10:21,670 Each processor has its own address, X. So when a 210 00:10:21,670 --> 00:10:25,440 processor, P1, asks for X it knows where to go look. 211 00:10:25,440 --> 00:10:27,810 It's going to look in its own local memory. 212 00:10:27,810 --> 00:10:31,490 So if all processors are asking for the same value as 213 00:10:31,490 --> 00:10:33,670 sort of address X, then each one goes and looks in a 214 00:10:33,670 --> 00:10:35,280 different place. 215 00:10:35,280 --> 00:10:37,620 So there are n places to look, really. 216 00:10:37,620 --> 00:10:39,830 And what's stored in those addresses will vary because 217 00:10:39,830 --> 00:10:42,200 it's everybody's local memory. 218 00:10:42,200 --> 00:10:46,150 So if one processor, say P1, wants to look at the value 219 00:10:46,150 --> 00:10:49,480 stored in processor two's address, it actually has to 220 00:10:49,480 --> 00:10:51,120 explicitly request it. 221 00:10:51,120 --> 00:10:53,320 The processor two has to send it data. 222 00:10:53,320 --> 00:10:55,880 And processor one has to figure out, you know, what to 223 00:10:55,880 --> 00:10:56,720 do with that copy. 224 00:10:56,720 --> 00:10:57,970 So it has to store it somewhere. 225 00:11:00,720 --> 00:11:06,400 So this message passing really exposes explicit communication 226 00:11:06,400 --> 00:11:08,300 to exchange data. 227 00:11:08,300 --> 00:11:11,620 And you'll see that there are different kinds of data 228 00:11:11,620 --> 00:11:12,800 communications. 229 00:11:12,800 --> 00:11:16,240 But really the concept of what you exchange has three 230 00:11:16,240 --> 00:11:19,030 different -- or four different, rather -- 231 00:11:19,030 --> 00:11:20,300 things you need to address. 232 00:11:20,300 --> 00:11:22,760 One is how is the data described and 233 00:11:22,760 --> 00:11:24,530 what does it describe? 234 00:11:24,530 --> 00:11:25,970 How are the processes identified? 235 00:11:25,970 --> 00:11:27,870 So how do I identify that processor one is 236 00:11:27,870 --> 00:11:28,990 sending me this data? 237 00:11:28,990 --> 00:11:30,760 And if I'm receiving data how do I know who I'm 238 00:11:30,760 --> 00:11:32,010 receiving it from? 239 00:11:34,210 --> 00:11:35,570 Are all messages the same? 240 00:11:35,570 --> 00:11:38,100 Well, you know, if I send a message to somebody, do I have 241 00:11:38,100 --> 00:11:40,770 any guarantee that it's received or not? 242 00:11:40,770 --> 00:11:43,450 And what does it mean for a send operation or a receive 243 00:11:43,450 --> 00:11:44,720 operation to be completed? 244 00:11:44,720 --> 00:11:47,580 You know, is there some sort of acknowledgment process? 245 00:11:50,440 --> 00:11:53,380 So an example of a message passing program -- and if 246 00:11:53,380 --> 00:11:55,580 you've started to look at the lab you'll see that this is 247 00:11:55,580 --> 00:11:59,770 essentially where the lab came from. 248 00:11:59,770 --> 00:12:00,920 It's the same idea. 249 00:12:00,920 --> 00:12:05,050 I've created -- here I have some two-dimensional space. 250 00:12:05,050 --> 00:12:07,380 And I have points in this two-dimensional space. 251 00:12:07,380 --> 00:12:10,920 I have points B, which are these blue circles, and I have 252 00:12:10,920 --> 00:12:14,160 points A which I've represented as these yellow or 253 00:12:14,160 --> 00:12:17,040 golden squares. 254 00:12:17,040 --> 00:12:19,740 And what I want to do is for every point in A I want to 255 00:12:19,740 --> 00:12:22,820 calculate the distance to all of the points B. So there's 256 00:12:22,820 --> 00:12:24,190 sort of a pair wise interaction 257 00:12:24,190 --> 00:12:26,880 between the two arrays. 258 00:12:26,880 --> 00:12:29,710 So a simple loop that essentially does this -- 259 00:12:29,710 --> 00:12:34,980 and there are n squared interactions, you have, you 260 00:12:34,980 --> 00:12:37,410 know, a loop that loops over all the A elements, a loop 261 00:12:37,410 --> 00:12:39,920 that loops over all the B elements. 262 00:12:39,920 --> 00:12:42,350 And you essentially calculate in this case Euclidean 263 00:12:42,350 --> 00:12:43,500 distance which I'm not showing. 264 00:12:43,500 --> 00:12:44,990 And you store it into some new array. 265 00:12:47,890 --> 00:12:51,800 So if I give you two processors to do this work, 266 00:12:51,800 --> 00:12:54,180 processor one and processor two, and I give you some 267 00:12:54,180 --> 00:12:56,780 mechanism to share between the two -- 268 00:12:56,780 --> 00:12:58,270 so here's my CPU. 269 00:12:58,270 --> 00:12:59,940 Each processor has local memory. 270 00:12:59,940 --> 00:13:01,300 What would be some approach for actually 271 00:13:01,300 --> 00:13:02,360 parallelizing this? 272 00:13:02,360 --> 00:13:05,240 Anybody look at the lab yet? 273 00:13:05,240 --> 00:13:08,860 OK, so what would you do with two processors? 274 00:13:14,734 --> 00:13:19,630 AUDIENCE: One has half memory [INAUDIBLE] 275 00:13:19,630 --> 00:13:21,120 PROFESSOR: Right. 276 00:13:21,120 --> 00:13:24,360 So what was said was that you split one of the arrays in two 277 00:13:24,360 --> 00:13:26,590 and you can actually get that kind of concurrency. 278 00:13:26,590 --> 00:13:29,080 So, you know, let's say processor one 279 00:13:29,080 --> 00:13:30,470 already has the data. 280 00:13:30,470 --> 00:13:33,320 And it has some place that it's already allocated where 281 00:13:33,320 --> 00:13:37,270 it's going to write C, the results of the computation, 282 00:13:37,270 --> 00:13:39,860 then I can break up the work just like it was suggested. 283 00:13:39,860 --> 00:13:43,550 So what P1 has to do is send data to P2. 284 00:13:43,550 --> 00:13:44,650 It says here's the data. 285 00:13:44,650 --> 00:13:46,190 Here's the computation. 286 00:13:46,190 --> 00:13:48,250 Go ahead and help me out. 287 00:13:48,250 --> 00:13:53,340 So I send the first array elements, and then I send half 288 00:13:53,340 --> 00:13:55,580 of the other elements that I want the 289 00:13:55,580 --> 00:13:57,630 calculations done for. 290 00:13:57,630 --> 00:14:00,760 And then P1 and P2 can now sort of start 291 00:14:00,760 --> 00:14:01,670 computing in parallel. 292 00:14:01,670 --> 00:14:06,650 But notice that P2 has its own array that it's going to store 293 00:14:06,650 --> 00:14:08,210 results in. 294 00:14:08,210 --> 00:14:12,210 And so as these compute they actually fill in different 295 00:14:12,210 --> 00:14:16,570 logical places or logical parts of the overall matrix. 296 00:14:16,570 --> 00:14:19,720 So what has to be done is at the end for P1 to have all the 297 00:14:19,720 --> 00:14:23,880 results, P2 has to send it sort of the rest of the matrix 298 00:14:23,880 --> 00:14:24,400 to complete it. 299 00:14:24,400 --> 00:14:27,200 And so now P1 has all the results. 300 00:14:27,200 --> 00:14:29,480 The computation is done and you can move on. 301 00:14:29,480 --> 00:14:31,690 Does that make sense? 302 00:14:31,690 --> 00:14:31,870 OK. 303 00:14:31,870 --> 00:14:36,490 So you'll get to actually do this as part of your labs. 304 00:14:36,490 --> 00:14:39,600 So in this example messaging program, you have started out 305 00:14:39,600 --> 00:14:41,170 with a sequential code. 306 00:14:41,170 --> 00:14:43,330 And we had two processors. 307 00:14:43,330 --> 00:14:45,570 So processor one actually sends the code. 308 00:14:45,570 --> 00:14:47,410 So it is essentially a template for the code you'll 309 00:14:47,410 --> 00:14:49,350 end up writing. 310 00:14:49,350 --> 00:14:52,330 And it does have to work in the outer loop. 311 00:14:52,330 --> 00:14:57,480 So this n array over which it is iterating the A array, is 312 00:14:57,480 --> 00:14:58,840 it's only doing half as many. 313 00:15:01,910 --> 00:15:06,170 And processor two has to actually receive the data. 314 00:15:06,170 --> 00:15:09,140 And it specifies where to receive the data into. 315 00:15:09,140 --> 00:15:13,110 So I've omitted some things, for example, extra information 316 00:15:13,110 --> 00:15:15,260 sort of hidden in these parameters. 317 00:15:15,260 --> 00:15:18,650 So here you're sending all of A, all of B. Whereas, you 318 00:15:18,650 --> 00:15:21,600 know, you could have specified extra parameters that says, 319 00:15:21,600 --> 00:15:24,740 you know, I'm sending you A. Here's n elements to read from 320 00:15:24,740 --> 00:15:28,110 A. Here's B. Here's n by two elements to read 321 00:15:28,110 --> 00:15:31,040 from B. And so on. 322 00:15:31,040 --> 00:15:33,650 But the computation is essentially the same except 323 00:15:33,650 --> 00:15:38,140 for the index at which you start, in this case changed 324 00:15:38,140 --> 00:15:40,370 for processor two. 325 00:15:40,370 --> 00:15:43,610 And now, when the computation is done, this guy essentially 326 00:15:43,610 --> 00:15:47,180 waits until the data is received. 327 00:15:47,180 --> 00:15:49,410 Processor two eventually sends it that data and now 328 00:15:49,410 --> 00:15:50,050 you can move on. 329 00:15:50,050 --> 00:15:52,450 AUDIENCE: I have a question. 330 00:15:52,450 --> 00:15:53,215 PROFESSOR: Yeah? 331 00:15:53,215 --> 00:15:57,583 AUDIENCE: So would processor two have to wait for the data 332 00:15:57,583 --> 00:15:58,554 from processor one? 333 00:15:58,554 --> 00:15:59,790 PROFESSOR: Yeah, so there's a -- 334 00:15:59,790 --> 00:16:01,760 I'll get into that later. 335 00:16:01,760 --> 00:16:03,980 So what does it mean to receive? 336 00:16:06,660 --> 00:16:09,670 To do this computation, I actually need this instruction 337 00:16:09,670 --> 00:16:10,420 to complete. 338 00:16:10,420 --> 00:16:13,250 So what does it need for that instruction to complete? 339 00:16:13,250 --> 00:16:15,370 I do have to get the data because otherwise I don't know 340 00:16:15,370 --> 00:16:16,360 what to compute on. 341 00:16:16,360 --> 00:16:19,140 So there is some implicit synchronization 342 00:16:19,140 --> 00:16:20,100 that you have to do. 343 00:16:20,100 --> 00:16:21,490 And in some cases it's explicit. 344 00:16:21,490 --> 00:16:24,770 So I'll get into that a little bit later. 345 00:16:24,770 --> 00:16:27,400 Does that sort of hint at the answer? 346 00:16:27,400 --> 00:16:28,140 Are you still confused? 347 00:16:28,140 --> 00:16:33,040 AUDIENCE: So processor one doesn't do the computation but 348 00:16:33,040 --> 00:16:34,330 it still sends the data -- 349 00:16:34,330 --> 00:16:41,470 PROFESSOR: So in terms of tracing, processor one sends 350 00:16:41,470 --> 00:16:43,940 the data and then can immediately start executing 351 00:16:43,940 --> 00:16:45,380 its code, right? 352 00:16:45,380 --> 00:16:48,960 Processor two, in this particular example, has to 353 00:16:48,960 --> 00:16:50,500 wait until it receives the data. 354 00:16:50,500 --> 00:16:52,810 So once this receive completes, then you can 355 00:16:52,810 --> 00:16:54,770 actually go and start executing 356 00:16:54,770 --> 00:16:56,070 the rest of the code. 357 00:16:56,070 --> 00:16:58,930 So imagine that it essentially says, wait until I have data. 358 00:16:58,930 --> 00:16:59,950 Wait until I have something to do. 359 00:16:59,950 --> 00:17:01,720 Does that help? 360 00:17:01,720 --> 00:17:04,670 AUDIENCE: Can the main processor 361 00:17:04,670 --> 00:17:06,080 [UNINTELLIGIBLE PHRASE] 362 00:17:06,080 --> 00:17:06,920 PROFESSOR: Can the main processor -- 363 00:17:06,920 --> 00:17:11,196 AUDIENCE: I mean, in Cell, everybody is not peers. 364 00:17:11,196 --> 00:17:13,360 There is a master there. 365 00:17:13,360 --> 00:17:17,760 And what master can do instead of doing computation, master 366 00:17:17,760 --> 00:17:21,713 can be basically the quarterback, sending data, 367 00:17:21,713 --> 00:17:22,061 receiving data. 368 00:17:22,061 --> 00:17:25,534 And SPEs can be basically waiting for data, get the 369 00:17:25,534 --> 00:17:26,750 computation, send it back. 370 00:17:26,750 --> 00:17:30,728 So in some sense in Cell you probably don't want to do the 371 00:17:30,728 --> 00:17:32,498 computation on the master. 372 00:17:32,498 --> 00:17:34,140 Because that means the master slows down. 373 00:17:34,140 --> 00:17:36,608 The master will do only the data management. 374 00:17:36,608 --> 00:17:41,730 So that might be one symmetrical [UNINTELLIGIBLE] 375 00:17:41,730 --> 00:17:43,280 PROFESSOR: And you'll see that in the example. 376 00:17:43,280 --> 00:17:46,500 Because the PPE in that case has to send the data to two 377 00:17:46,500 --> 00:17:49,390 different SPEs. 378 00:17:49,390 --> 00:17:49,580 Yup? 379 00:17:49,580 --> 00:17:51,136 AUDIENCE: In some sense [UNINTELLIGIBLE PHRASE] 380 00:17:54,250 --> 00:17:57,005 at points seems to be [UNINTELLIGIBLE] sense that if 381 00:17:57,005 --> 00:18:00,368 -- so have a huge array and you want to 382 00:18:00,368 --> 00:18:03,090 [UNINTELLIGIBLE PHRASE] the data to receive the whole 383 00:18:03,090 --> 00:18:05,940 array, then you have to [UNINTELLIGIBLE] 384 00:18:05,940 --> 00:18:08,600 PROFESSOR: Yeah, we'll get into that later. 385 00:18:08,600 --> 00:18:09,840 Yeah, I mean, that's a good point. 386 00:18:09,840 --> 00:18:11,450 You know, communication is not cheap. 387 00:18:11,450 --> 00:18:15,140 And if you sort of don't take that into consideration, you 388 00:18:15,140 --> 00:18:16,920 end up paying a lot for overhead for 389 00:18:16,920 --> 00:18:17,640 parallelizing things. 390 00:18:17,640 --> 00:18:20,120 AUDIENCE: [INAUDIBLE] 391 00:18:20,120 --> 00:18:22,250 PROFESSOR: Well, you can do things in software as well. 392 00:18:22,250 --> 00:18:24,210 We'll get into that. 393 00:18:24,210 --> 00:18:27,850 OK, so some crude performance analysis. 394 00:18:27,850 --> 00:18:29,960 So I have to calculate this distance. 395 00:18:29,960 --> 00:18:31,610 And given two processors, I can 396 00:18:31,610 --> 00:18:34,020 effectively get a 2x speedup. 397 00:18:34,020 --> 00:18:37,620 By dividing up the work I can get done in half the time. 398 00:18:37,620 --> 00:18:40,250 Well, if you gave me four processors, I can maybe get 399 00:18:40,250 --> 00:18:44,980 done four times as fast. And in my communication model 400 00:18:44,980 --> 00:18:48,690 here, I have one copy of one array that's essentially 401 00:18:48,690 --> 00:18:50,540 sending to every processor. 402 00:18:50,540 --> 00:18:54,060 And there's some subset of A. So I'm partitioning my other 403 00:18:54,060 --> 00:18:55,580 array into smaller subsets. 404 00:18:55,580 --> 00:18:58,320 And I'm sending those to each of the different processors. 405 00:18:58,320 --> 00:19:01,610 So we'll get into terminology for how to actually name these 406 00:19:01,610 --> 00:19:03,820 communications later. 407 00:19:03,820 --> 00:19:05,930 But really the thing to take away here is that this 408 00:19:05,930 --> 00:19:09,000 granularity -- how I'm partitioning A -- affects my 409 00:19:09,000 --> 00:19:12,580 performance and communication almost directly. 410 00:19:12,580 --> 00:19:14,690 And, you know, the comment that was just made is that, 411 00:19:14,690 --> 00:19:16,090 you know, what do you do about communication? 412 00:19:16,090 --> 00:19:17,040 It's not free. 413 00:19:17,040 --> 00:19:19,270 So all of those will be addressed. 414 00:19:19,270 --> 00:19:22,150 So to understand performance, we sort of summarize three 415 00:19:22,150 --> 00:19:23,950 main concepts that you essentially need to 416 00:19:23,950 --> 00:19:24,890 understand. 417 00:19:24,890 --> 00:19:28,945 One is coverage, or in other words, how much parallelism do 418 00:19:28,945 --> 00:19:32,880 I actually have in my application? 419 00:19:32,880 --> 00:19:35,980 And this can actually affect, you know, how much work is it 420 00:19:35,980 --> 00:19:38,730 worth spending on this particular application? 421 00:19:38,730 --> 00:19:39,910 Granularity -- 422 00:19:39,910 --> 00:19:41,860 you know, how do you partition your data among your different 423 00:19:41,860 --> 00:19:43,980 processors so that you can keep communication down, so 424 00:19:43,980 --> 00:19:46,970 you can keep synchronization down, and so on. 425 00:19:46,970 --> 00:19:48,130 Locality -- 426 00:19:48,130 --> 00:19:50,990 so while not shown in the particular example, if two 427 00:19:50,990 --> 00:19:54,840 processors are communicating, if they are close in space or 428 00:19:54,840 --> 00:19:57,570 far in space, or if the communication between two 429 00:19:57,570 --> 00:20:01,400 processors is far cheaper than two other processors, can I 430 00:20:01,400 --> 00:20:02,540 exploit that in some way? 431 00:20:02,540 --> 00:20:07,370 And so we'll talk about that as well. 432 00:20:07,370 --> 00:20:11,170 So an example of sort of parallelism in an application, 433 00:20:11,170 --> 00:20:14,030 there are two essentially projects that are doing ray 434 00:20:14,030 --> 00:20:17,030 tracing, so I thought I'd have this slide here. 435 00:20:17,030 --> 00:20:20,890 You know, how much parallelism do you have in 436 00:20:20,890 --> 00:20:23,490 a ray tracing program. 437 00:20:23,490 --> 00:20:26,170 In ray tracing what you do is you essentially have some 438 00:20:26,170 --> 00:20:30,080 camera source, some observer. 439 00:20:30,080 --> 00:20:33,060 And you're trying to figure out, you know, how to color or 440 00:20:33,060 --> 00:20:35,180 how to shade different pixels in your screen. 441 00:20:35,180 --> 00:20:38,160 So what you do is you shoot rays from a particular source 442 00:20:38,160 --> 00:20:38,850 through your plane. 443 00:20:38,850 --> 00:20:41,270 And then you see how the rays bounce off of other objects. 444 00:20:41,270 --> 00:20:44,310 And that allows you to render scenes in various ways. 445 00:20:44,310 --> 00:20:47,580 So you have different kinds of parallelisms. You have your 446 00:20:47,580 --> 00:20:50,190 primary ray that's shot in. 447 00:20:50,190 --> 00:20:52,750 And if you're shooting into something like water or some 448 00:20:52,750 --> 00:20:57,440 very reflective surface, or some surface that can actually 449 00:20:57,440 --> 00:21:03,410 reflect, transmit, you can essentially end up with a lot 450 00:21:03,410 --> 00:21:06,980 more rays that are created at run time. 451 00:21:06,980 --> 00:21:10,040 So there's dynamic parallelism in this particular example. 452 00:21:10,040 --> 00:21:11,800 And you can shoot a lot of rays from here. 453 00:21:11,800 --> 00:21:14,190 So there's different kinds of parallelism you can exploit. 454 00:21:17,500 --> 00:21:21,080 Not all prior programs have this kind of, sort of, a lot 455 00:21:21,080 --> 00:21:23,450 of parallelism, or embarrassingly parallel 456 00:21:23,450 --> 00:21:25,180 computation. 457 00:21:25,180 --> 00:21:27,490 You know, you saw some basic code 458 00:21:27,490 --> 00:21:29,090 sequences in earlier lectures. 459 00:21:29,090 --> 00:21:31,020 So there's a sequential part. 460 00:21:31,020 --> 00:21:32,710 And the reason this is sequential is because there 461 00:21:32,710 --> 00:21:35,020 are data flow dependencies between each of the different 462 00:21:35,020 --> 00:21:36,090 computations. 463 00:21:36,090 --> 00:21:38,600 So here I calculate a, but I need the result of a to do 464 00:21:38,600 --> 00:21:39,660 this instruction. 465 00:21:39,660 --> 00:21:43,300 I calculate d here and I need that result to calculate e. 466 00:21:43,300 --> 00:21:45,722 But then this loop really here is just assigning or it's 467 00:21:45,722 --> 00:21:46,860 initializing some big array. 468 00:21:46,860 --> 00:21:49,940 And I can really do that in parallel. 469 00:21:49,940 --> 00:21:52,520 So I have sequential parts and parallel parts. 470 00:21:52,520 --> 00:21:56,590 So how does that affect my overall speedups? 471 00:21:56,590 --> 00:22:00,460 And so there's this law which is really a demonstration of 472 00:22:00,460 --> 00:22:03,550 diminishing returns, Amdahl's Law. 473 00:22:03,550 --> 00:22:07,850 It says that if, you know, you have a really fast car, it's 474 00:22:07,850 --> 00:22:10,310 only as good to you as fast as you can drive it. 475 00:22:10,310 --> 00:22:12,880 So if there's a lot of congestion on your road or, 476 00:22:12,880 --> 00:22:15,320 you know, there are posted speed limits or some other 477 00:22:15,320 --> 00:22:17,700 mechanism, you really can't exploit all the 478 00:22:17,700 --> 00:22:18,670 speed of your car. 479 00:22:18,670 --> 00:22:22,440 Or in other words, you're only as fast as the fastest 480 00:22:22,440 --> 00:22:25,940 mechanisms of the computation that you can have. 481 00:22:25,940 --> 00:22:31,530 So to look at this in more detail, your potential speedup 482 00:22:31,530 --> 00:22:35,210 is really proportional to the fraction of the code that can 483 00:22:35,210 --> 00:22:36,420 be parallelized. 484 00:22:36,420 --> 00:22:38,730 So if I have some computation -- let's say it has three 485 00:22:38,730 --> 00:22:42,355 parts: a sequential part that takes 25 seconds, a parallel 486 00:22:42,355 --> 00:22:45,120 part that takes 50 seconds, and a sequential part that 487 00:22:45,120 --> 00:22:47,250 runs in 25 seconds. 488 00:22:47,250 --> 00:22:49,680 So the total execution time is 100 seconds. 489 00:22:49,680 --> 00:22:53,390 And if I have one processor, that's really all I can do. 490 00:22:53,390 --> 00:22:55,350 And if she gave me more than one processor -- so let's say 491 00:22:55,350 --> 00:22:57,010 I have five processors. 492 00:22:57,010 --> 00:22:59,390 Well, I can't do anything about the sequential work. 493 00:22:59,390 --> 00:23:01,920 So that's still going to take 25 seconds. 494 00:23:01,920 --> 00:23:04,120 And I can't do anything about this sequential work either. 495 00:23:04,120 --> 00:23:05,850 That still takes 25 seconds. 496 00:23:05,850 --> 00:23:09,080 But this parallel part I can essentially break up among the 497 00:23:09,080 --> 00:23:10,060 different processors. 498 00:23:10,060 --> 00:23:11,470 So five in this case. 499 00:23:11,470 --> 00:23:13,680 And that gets me, you know, five-way parallelism. 500 00:23:13,680 --> 00:23:16,970 So the 50 seconds now is reduced to 10 seconds. 501 00:23:16,970 --> 00:23:18,990 Is that clear so far? 502 00:23:18,990 --> 00:23:22,330 So the overall running time in that case is 60 seconds. 503 00:23:22,330 --> 00:23:24,690 So what would be my speedup? 504 00:23:24,690 --> 00:23:29,580 Well, you calculate speedup, old running time divided by 505 00:23:29,580 --> 00:23:30,790 the new running time. 506 00:23:30,790 --> 00:23:32,820 So 100 seconds divided by 60 seconds. 507 00:23:32,820 --> 00:23:37,020 Or my parallel version is 1.67 times faster. 508 00:23:37,020 --> 00:23:38,660 So this is great. 509 00:23:38,660 --> 00:23:40,690 If I increase the number of processors, then I should be 510 00:23:40,690 --> 00:23:41,980 able to get more and more parallelism. 511 00:23:41,980 --> 00:23:47,530 But it also means that there's sort of an upper bound on how 512 00:23:47,530 --> 00:23:49,160 much speedup you can get. 513 00:23:49,160 --> 00:23:52,450 So if you look at the fraction of work in your application 514 00:23:52,450 --> 00:23:54,840 that's parallel, that's p. 515 00:23:54,840 --> 00:23:58,530 And your number of processors, well, your speedup is -- 516 00:23:58,530 --> 00:24:01,270 let's say the old running time is just one unit of work. 517 00:24:01,270 --> 00:24:04,340 If the time it takes for the sequential work -- so that's 1 518 00:24:04,340 --> 00:24:08,720 minus p, since p is the fraction of the parallel work. 519 00:24:08,720 --> 00:24:11,120 And it's the time to do the parallel work. 520 00:24:11,120 --> 00:24:14,480 And since I can parallelize that fraction over n 521 00:24:14,480 --> 00:24:18,500 processors, I can sort of reduce that to really small 522 00:24:18,500 --> 00:24:19,640 amounts in the limit. 523 00:24:19,640 --> 00:24:20,890 Does that make sense so far? 524 00:24:23,290 --> 00:24:26,260 So the speedup can tend to 1 over 1 minus p in the limit. 525 00:24:26,260 --> 00:24:29,560 If I increase the number of processors or that gets really 526 00:24:29,560 --> 00:24:33,820 large, that's essentially my upper bound on how fast 527 00:24:33,820 --> 00:24:36,110 programs can work. 528 00:24:36,110 --> 00:24:39,550 You know, how much can I exploit out of my program? 529 00:24:39,550 --> 00:24:40,370 So this is great. 530 00:24:40,370 --> 00:24:43,540 What this law says -- the implication here is if your 531 00:24:43,540 --> 00:24:45,790 program has a lot of inherent parallelism, you 532 00:24:45,790 --> 00:24:47,010 can do really well. 533 00:24:47,010 --> 00:24:49,090 But if your program doesn't have any parallelism, well, 534 00:24:49,090 --> 00:24:50,450 there's really nothing you can do. 535 00:24:50,450 --> 00:24:52,680 So parallel architectures won't really help you. 536 00:24:52,680 --> 00:24:54,660 And there's some interesting trade-offs, for example, that 537 00:24:54,660 --> 00:24:58,850 you might consider if you're designing a chip or if you're 538 00:24:58,850 --> 00:25:01,670 looking at an application or domain of applications, 539 00:25:01,670 --> 00:25:06,080 figuring out what is the best architecture to run them on. 540 00:25:06,080 --> 00:25:10,510 So in terms of performance scalability, as I increase the 541 00:25:10,510 --> 00:25:12,890 number of processors, I have speedup. 542 00:25:12,890 --> 00:25:15,230 You can define, sort of, an efficiency to 543 00:25:15,230 --> 00:25:16,570 be linear at 100%. 544 00:25:16,570 --> 00:25:20,590 But typically you end up in sort of the sublinear domain. 545 00:25:20,590 --> 00:25:25,800 That's because communication is not often free. 546 00:25:25,800 --> 00:25:28,350 But you can get super linear speedups ups on real 547 00:25:28,350 --> 00:25:31,450 architectures because of secondary and tertiary effects 548 00:25:31,450 --> 00:25:34,770 that come from register allocation or caching effects. 549 00:25:34,770 --> 00:25:38,120 So they can hide a lot of latency or you can take 550 00:25:38,120 --> 00:25:41,090 advantage of a lot of pipelining mechanisms in the 551 00:25:41,090 --> 00:25:43,330 architecture to get super linear speedups. 552 00:25:43,330 --> 00:25:46,620 So you can end up in two different domains. 553 00:25:49,860 --> 00:25:55,000 So a small, you know, overview of the extent of parallelism 554 00:25:55,000 --> 00:25:56,440 in your program and how that affects 555 00:25:56,440 --> 00:25:59,280 your overall execution. 556 00:25:59,280 --> 00:26:01,620 And the other concept is granularity. 557 00:26:01,620 --> 00:26:03,520 So given that I have this much parallelism, how 558 00:26:03,520 --> 00:26:04,590 do I exploit it? 559 00:26:04,590 --> 00:26:06,810 There are different ways of exploiting it, and that comes 560 00:26:06,810 --> 00:26:09,440 down to, well, how do I subdivide my problem? 561 00:26:09,440 --> 00:26:12,410 What is the granularity of the sub-problems I'm going to 562 00:26:12,410 --> 00:26:15,320 calculate on? 563 00:26:15,320 --> 00:26:19,130 And, really, granularity from my perspective, is just a 564 00:26:19,130 --> 00:26:22,760 qualitative measure of what is the ratio of your computation 565 00:26:22,760 --> 00:26:24,950 to your communication? 566 00:26:24,950 --> 00:26:27,130 So if you're doing a lot of computation, very little 567 00:26:27,130 --> 00:26:29,400 communication, you could be doing really 568 00:26:29,400 --> 00:26:30,580 well or vice versa. 569 00:26:30,580 --> 00:26:33,240 Then you could be computation limited, and so you need a lot 570 00:26:33,240 --> 00:26:35,450 of bandwidth for example in your architecture. 571 00:26:35,450 --> 00:26:37,258 AUDIENCE: Like before, you really didn't have to give 572 00:26:37,258 --> 00:26:40,880 every single processor an entire copy of B. 573 00:26:40,880 --> 00:26:41,850 PROFESSOR: Right. 574 00:26:41,850 --> 00:26:42,010 Yeah. 575 00:26:42,010 --> 00:26:45,070 Good point. 576 00:26:48,190 --> 00:26:50,960 And as you saw in the previous slides, you have -- 577 00:26:50,960 --> 00:26:53,380 computation stages are separated by 578 00:26:53,380 --> 00:26:54,790 communication stages. 579 00:26:54,790 --> 00:26:57,070 And your communication in a lot of cases essentially 580 00:26:57,070 --> 00:26:59,060 serves as synchronization. 581 00:26:59,060 --> 00:27:01,560 I need everybody to get to the same point before I can move 582 00:27:01,560 --> 00:27:04,780 on logically in my computation. 583 00:27:04,780 --> 00:27:07,830 So there are two kinds of sort of classes of granularity. 584 00:27:07,830 --> 00:27:10,840 There's fine grain and, as you'll see, coarse grain. 585 00:27:10,840 --> 00:27:14,010 So in fine-grain parallelism, you have low computation to 586 00:27:14,010 --> 00:27:15,460 communication ratio. 587 00:27:15,460 --> 00:27:18,150 And that has good properties in that you have a small 588 00:27:18,150 --> 00:27:21,010 amount of work done between communication stages. 589 00:27:24,410 --> 00:27:27,910 And it has bad properties in that it gives you less 590 00:27:27,910 --> 00:27:29,530 performance opportunity. 591 00:27:32,190 --> 00:27:33,850 It should be more, right? 592 00:27:33,850 --> 00:27:35,100 More opportunity for -- 593 00:27:35,100 --> 00:27:35,600 AUDIENCE: No. 594 00:27:35,600 --> 00:27:36,100 Less. 595 00:27:36,100 --> 00:27:36,480 PROFESSOR: Sorry. 596 00:27:36,480 --> 00:27:37,120 Yeah, yeah, sorry. 597 00:27:37,120 --> 00:27:38,370 I didn't get enough sleep. 598 00:27:41,670 --> 00:27:44,470 So less opportunities for performance enhancement, but 599 00:27:44,470 --> 00:27:48,870 you have high communication ratio because essentially 600 00:27:48,870 --> 00:27:50,800 you're communicating very often. 601 00:27:50,800 --> 00:27:55,120 So these are the computations here and these yellow bars are 602 00:27:55,120 --> 00:27:56,850 the synchronization points. 603 00:27:56,850 --> 00:27:59,640 So I have to distribute data or communicate. 604 00:27:59,640 --> 00:28:02,000 I do computations but, you know, computation 605 00:28:02,000 --> 00:28:02,990 doesn't last very long. 606 00:28:02,990 --> 00:28:06,410 And I do more communication or more synchronization, and I 607 00:28:06,410 --> 00:28:07,750 repeat the process. 608 00:28:07,750 --> 00:28:11,370 So naturally you can adjust this granularity to sort of 609 00:28:11,370 --> 00:28:12,530 reduce the communication overhead. 610 00:28:12,530 --> 00:28:14,510 AUDIENCE: [UNINTELLIGIBLE PHRASE] 611 00:28:14,510 --> 00:28:17,010 two things in that overhead part. 612 00:28:17,010 --> 00:28:18,120 One is the volume. 613 00:28:18,120 --> 00:28:21,380 So one, communication. 614 00:28:21,380 --> 00:28:23,484 Also there's a large part of synchronization cost. 615 00:28:23,484 --> 00:28:26,210 Basically you get a communication goal and you 616 00:28:26,210 --> 00:28:29,022 have to go start the messages and wait until 617 00:28:29,022 --> 00:28:29,210 everybody is done. 618 00:28:29,210 --> 00:28:32,250 So that overhead also can go. 619 00:28:32,250 --> 00:28:35,642 Even if you don't send that much data, just the fact that 620 00:28:35,642 --> 00:28:38,260 you are communicating, that means you have to do a lot of 621 00:28:38,260 --> 00:28:43,076 this additional bookkeeping stuff, that especially in the 622 00:28:43,076 --> 00:28:43,679 distributed [? memory ?] 623 00:28:43,679 --> 00:28:44,954 [? machine is ?] pretty expensive. 624 00:28:44,954 --> 00:28:45,910 PROFESSOR: Yeah. 625 00:28:45,910 --> 00:28:48,780 Thanks. 626 00:28:48,780 --> 00:28:52,490 So in coarse-grain parallelism, you sort of make 627 00:28:52,490 --> 00:28:55,070 the work chunks more and more so that you do the 628 00:28:55,070 --> 00:28:57,440 communication synchronization less and less. 629 00:28:57,440 --> 00:28:58,440 And so that's shown here. 630 00:28:58,440 --> 00:29:00,750 You do longer pieces of work and have fewer 631 00:29:00,750 --> 00:29:04,370 synchronization stages. 632 00:29:04,370 --> 00:29:09,690 So in that regime, you can have more opportunities for 633 00:29:09,690 --> 00:29:12,590 performance improvements, but the tricky thing that you get 634 00:29:12,590 --> 00:29:15,730 into is what's called load balancing. 635 00:29:15,730 --> 00:29:19,090 So if each of these different computations takes differing 636 00:29:19,090 --> 00:29:22,130 amounts of time to complete, then what you might end up 637 00:29:22,130 --> 00:29:25,380 doing is a lot of people might end up idle as they wait until 638 00:29:25,380 --> 00:29:27,690 everybody's essentially reached their finish line. 639 00:29:27,690 --> 00:29:27,960 Yep? 640 00:29:27,960 --> 00:29:30,000 AUDIENCE: If you don't have to acknowledge that something's 641 00:29:30,000 --> 00:29:31,593 done can't you just say, [? 642 00:29:31,593 --> 00:29:32,040 OK, I'm done with your salt ?]. 643 00:29:32,040 --> 00:29:33,570 Hand it to the initial processor 644 00:29:33,570 --> 00:29:35,100 and keep doing whatever? 645 00:29:35,100 --> 00:29:37,470 PROFESSOR: So, you can do that in cases where that 646 00:29:37,470 --> 00:29:40,430 essentially there is a mechanism -- or the 647 00:29:40,430 --> 00:29:42,340 application allows for it. 648 00:29:42,340 --> 00:29:44,473 But as I'll show -- well, you won't see until the next 649 00:29:44,473 --> 00:29:46,340 lecture -- there are dependencies, for example, 650 00:29:46,340 --> 00:29:48,420 that might preclude you from doing that. 651 00:29:48,420 --> 00:29:51,170 If everybody needs to reach the same point because you're 652 00:29:51,170 --> 00:29:53,550 updating a large data structure before you can go 653 00:29:53,550 --> 00:29:55,290 on, then you might not be able to do that. 654 00:29:55,290 --> 00:29:58,810 So think of doing molecular dynamics simulations. 655 00:29:58,810 --> 00:30:00,970 You need everybody to calculate a new position 656 00:30:00,970 --> 00:30:02,895 before you can go on and calculate new kinds of coarse 657 00:30:02,895 --> 00:30:03,030 interactions. 658 00:30:03,030 --> 00:30:04,700 AUDIENCE: [UNINTELLIGIBLE] nothing else to calculate yet. 659 00:30:04,700 --> 00:30:05,700 PROFESSOR: Right. 660 00:30:05,700 --> 00:30:06,410 AUDIENCE: But also there is pipelining. 661 00:30:06,410 --> 00:30:08,400 So what do you talk about [UNINTELLIGIBLE] because you 662 00:30:08,400 --> 00:30:10,785 might want to get the next data while you're computing 663 00:30:10,785 --> 00:30:13,830 now so that when I'm done I can start sending. 664 00:30:13,830 --> 00:30:16,660 [UNINTELLIGIBLE PHRASE] you can all have some of that. 665 00:30:16,660 --> 00:30:17,970 PROFESSOR: Yep. 666 00:30:17,970 --> 00:30:21,935 Yeah, because communication is such an intensive part, there 667 00:30:21,935 --> 00:30:22,920 are different ways of dealing with it. 668 00:30:22,920 --> 00:30:26,000 And that will be right after load balancing. 669 00:30:26,000 --> 00:30:28,880 So the load balancing problem is just an illustration. 670 00:30:28,880 --> 00:30:33,610 And things that appear in sort of this lightish pink will 671 00:30:33,610 --> 00:30:34,900 serve as sort of visual cues. 672 00:30:34,900 --> 00:30:36,985 This is the same color coding scheme that David's using in 673 00:30:36,985 --> 00:30:37,780 the recitations. 674 00:30:37,780 --> 00:30:39,620 So this is PPU code. 675 00:30:39,620 --> 00:30:42,180 Things that appear in yellow will be SPU code. 676 00:30:42,180 --> 00:30:44,490 And these are just meant to essentially show you how you 677 00:30:44,490 --> 00:30:47,780 might do things like this on Cell, just to help you along 678 00:30:47,780 --> 00:30:50,930 in picking up more of the syntax and functionality you 679 00:30:50,930 --> 00:30:53,190 need for your programs. 680 00:30:53,190 --> 00:30:58,090 So in the load balancing problem, you essentially have, 681 00:30:58,090 --> 00:31:03,930 let's say, three different threads of computation. 682 00:31:03,930 --> 00:31:06,880 And so that's shown here: red, blue, and orange. 683 00:31:06,880 --> 00:31:08,650 And you've reached some communication stage. 684 00:31:08,650 --> 00:31:13,560 So the PPU program in this case is saying, send a message 685 00:31:13,560 --> 00:31:16,340 to each of my SPEs, to each of my different processors, that 686 00:31:16,340 --> 00:31:18,210 you're ready to start. 687 00:31:18,210 --> 00:31:21,150 And so now once every processor gets that message, 688 00:31:21,150 --> 00:31:22,370 they can start computing. 689 00:31:22,370 --> 00:31:25,970 And let's assume they have data and so on ready to go. 690 00:31:25,970 --> 00:31:29,880 And what's going to happen is each processor is going to run 691 00:31:29,880 --> 00:31:31,710 through the computation at different rates. 692 00:31:31,710 --> 00:31:33,980 Now this could be because one processor 693 00:31:33,980 --> 00:31:35,100 is faster than another. 694 00:31:35,100 --> 00:31:36,960 Or it could be because one processor is 695 00:31:36,960 --> 00:31:37,930 more loaded than another. 696 00:31:37,930 --> 00:31:40,550 Or it could be just because each processor is assigned 697 00:31:40,550 --> 00:31:43,090 sort of differing amounts of work. 698 00:31:43,090 --> 00:31:44,090 So one has a short loop. 699 00:31:44,090 --> 00:31:46,240 One has a longer loop. 700 00:31:46,240 --> 00:31:49,400 And so as the animation shows, sort of, execution proceeds 701 00:31:49,400 --> 00:31:52,930 and everybody's waiting until the orange guy has completed. 702 00:31:52,930 --> 00:31:56,200 But nobody could have made progress until everybody's 703 00:31:56,200 --> 00:32:00,130 reached synchronization point because, you know, there's a 704 00:32:00,130 --> 00:32:03,160 strict dependence that's being enforced here that says, I'm 705 00:32:03,160 --> 00:32:04,820 going to wait until everybody's told me they're 706 00:32:04,820 --> 00:32:08,130 done before I go on to the next step of computation. 707 00:32:08,130 --> 00:32:10,500 And so you know, in Cell you do that using 708 00:32:10,500 --> 00:32:12,360 mailboxes in this case. 709 00:32:12,360 --> 00:32:13,610 That clear so far? 710 00:32:16,380 --> 00:32:19,160 So how do you get around this load balancing problem? 711 00:32:19,160 --> 00:32:20,800 Well, there are two different ways. 712 00:32:20,800 --> 00:32:22,430 There's static load balancing. 713 00:32:22,430 --> 00:32:24,540 I know my application really, really well. 714 00:32:24,540 --> 00:32:27,280 And I understand sort of different computations. 715 00:32:27,280 --> 00:32:29,920 So what I can do is I can divide up the work and have a 716 00:32:29,920 --> 00:32:33,320 static mapping of the work to my processors. 717 00:32:33,320 --> 00:32:36,890 And static mapping just means, you know, in this particular 718 00:32:36,890 --> 00:32:40,680 example, that I'm going to assign the work to different 719 00:32:40,680 --> 00:32:43,820 processors and that's what the processors will do. 720 00:32:43,820 --> 00:32:47,410 Work can't shift around between processors. 721 00:32:47,410 --> 00:32:49,330 And so in this case I have a work queue. 722 00:32:49,330 --> 00:32:51,750 Each of those bars is some computation. 723 00:32:51,750 --> 00:32:55,590 You know, I can assign some chunk to P1, processor one, 724 00:32:55,590 --> 00:32:57,400 some chunk to processor two. 725 00:32:57,400 --> 00:32:59,060 And then computation can go on. 726 00:32:59,060 --> 00:33:03,320 Those allocations don't change. 727 00:33:03,320 --> 00:33:05,430 So this works well if I understand the application, 728 00:33:05,430 --> 00:33:08,760 well and I know the computation, and my cores are 729 00:33:08,760 --> 00:33:11,780 relatively homogeneous and, you know, there's not a lot of 730 00:33:11,780 --> 00:33:13,540 contention for them. 731 00:33:13,540 --> 00:33:16,690 So if all the cores are the same, each core has an equal 732 00:33:16,690 --> 00:33:18,710 amount of work -- the total amount of work -- this works 733 00:33:18,710 --> 00:33:21,470 really well because nobody is sitting too idle. 734 00:33:21,470 --> 00:33:24,140 It doesn't work so well for heterogeneous architectures or 735 00:33:24,140 --> 00:33:24,590 multicores. 736 00:33:24,590 --> 00:33:26,690 Because one might be faster than the other. 737 00:33:26,690 --> 00:33:29,470 It increases the complexity of the allocation I need to do. 738 00:33:29,470 --> 00:33:33,170 If there's a lot of contention for some resources, then that 739 00:33:33,170 --> 00:33:36,470 can affect the static load balancing. 740 00:33:36,470 --> 00:33:41,180 So work distribution might end up being uneven. 741 00:33:41,180 --> 00:33:43,740 So the alternative is dynamic load balancing. 742 00:33:43,740 --> 00:33:46,040 And you certainly could do sort of a hybrid load 743 00:33:46,040 --> 00:33:48,860 balancing, static plus dynamic mechanism. 744 00:33:48,860 --> 00:33:51,360 Although I don't have that in the slides. 745 00:33:51,360 --> 00:33:57,020 So in the dynamic load balancing scheme, two 746 00:33:57,020 --> 00:33:59,860 different mechanisms I'm going to illustrate. 747 00:33:59,860 --> 00:34:01,620 So in the first scheme, you start with something like the 748 00:34:01,620 --> 00:34:03,610 static mechanism. 749 00:34:03,610 --> 00:34:07,090 So I have some work going to processor one. 750 00:34:07,090 --> 00:34:10,080 And I have some work going to processor two. 751 00:34:10,080 --> 00:34:14,840 But then as processor two executes and completes faster 752 00:34:14,840 --> 00:34:17,610 than processor one, it takes on some of the additional work 753 00:34:17,610 --> 00:34:18,520 from processor one. 754 00:34:18,520 --> 00:34:20,650 So the work that was here is now shifted. 755 00:34:20,650 --> 00:34:24,160 And so you can keep helping out, you know, your other 756 00:34:24,160 --> 00:34:27,040 processors to compute things faster. 757 00:34:27,040 --> 00:34:29,630 In the other scheme, you have a work queue where you 758 00:34:29,630 --> 00:34:32,780 essentially are distributing work on the fly. 759 00:34:32,780 --> 00:34:35,740 So as things complete, you're just sending them 760 00:34:35,740 --> 00:34:37,240 more work to do. 761 00:34:37,240 --> 00:34:39,610 So in this animation here, I start off. 762 00:34:39,610 --> 00:34:41,820 I send work to two different processors. 763 00:34:41,820 --> 00:34:44,780 P2 is really fast so it's just zipping through things. 764 00:34:44,780 --> 00:34:48,360 And then P1 eventually finishes and new work is 765 00:34:48,360 --> 00:34:50,570 allocated to the two different schemes. 766 00:34:50,570 --> 00:34:54,120 So dynamic load balancing is intended to sort of give equal 767 00:34:54,120 --> 00:34:57,250 amounts of work in a different scheme for processors. 768 00:34:57,250 --> 00:34:59,880 So it really increased utilization and spent less and 769 00:34:59,880 --> 00:35:03,310 less time being idle. 770 00:35:03,310 --> 00:35:03,500 OK. 771 00:35:03,500 --> 00:35:08,480 So load balancing was one part of sort of how granularity can 772 00:35:08,480 --> 00:35:10,270 have a performance trade-off. 773 00:35:10,270 --> 00:35:11,660 The other is synchronization. 774 00:35:11,660 --> 00:35:14,520 So there were already some good questions as to, well, 775 00:35:14,520 --> 00:35:16,350 you know, how does this play into overall execution? 776 00:35:16,350 --> 00:35:17,000 When can I wait? 777 00:35:17,000 --> 00:35:18,810 When can't I wait? 778 00:35:18,810 --> 00:35:21,930 So I'm going to illustrate it with just a simple data 779 00:35:21,930 --> 00:35:22,960 dependence graph. 780 00:35:22,960 --> 00:35:25,600 Although you can imagine that in each one of these circles 781 00:35:25,600 --> 00:35:27,550 there's some really heavy load computation. 782 00:35:27,550 --> 00:35:29,340 And you'll see that in the next lecture, in fact. 783 00:35:29,340 --> 00:35:32,540 So if I have some simple computation here -- 784 00:35:32,540 --> 00:35:33,930 I have some operands. 785 00:35:33,930 --> 00:35:35,800 I'm doing an addition. 786 00:35:35,800 --> 00:35:36,910 Here I do another addition. 787 00:35:36,910 --> 00:35:38,690 I need both of these results before I can do this 788 00:35:38,690 --> 00:35:40,140 multiplication. 789 00:35:40,140 --> 00:35:43,070 Here I have, you know, some loop that's adding through 790 00:35:43,070 --> 00:35:44,200 some array elements. 791 00:35:44,200 --> 00:35:46,870 I need all those results before I do final substraction 792 00:35:46,870 --> 00:35:49,580 and produce my final result. 793 00:35:49,580 --> 00:35:52,330 So what are some synchronization points here? 794 00:35:52,330 --> 00:35:55,630 Well, it really depends on how I allocate the different 795 00:35:55,630 --> 00:35:59,090 instructions to processors. 796 00:35:59,090 --> 00:36:02,580 So if I have an allocation that just says, well, let's 797 00:36:02,580 --> 00:36:06,240 put all these chains on one processor, put these two 798 00:36:06,240 --> 00:36:08,470 chains on two different processors, well, where are my 799 00:36:08,470 --> 00:36:09,890 synchronization points? 800 00:36:09,890 --> 00:36:13,120 Well, it depends on where this guy is and where this guy is. 801 00:36:13,120 --> 00:36:15,880 Because for this instruction to execute, it needs to 802 00:36:15,880 --> 00:36:17,910 receive data from P1 and P2. 803 00:36:17,910 --> 00:36:23,260 So if P1 and P2 are different from what's in that box, 804 00:36:23,260 --> 00:36:24,100 somebody has to wait. 805 00:36:24,100 --> 00:36:24,460 And so there's a 806 00:36:24,460 --> 00:36:26,550 synchronization that has to happen. 807 00:36:32,200 --> 00:36:34,520 So essentially at all join points there's potential for 808 00:36:34,520 --> 00:36:35,510 synchronization. 809 00:36:35,510 --> 00:36:37,810 But I can adjust the granularity so that I can 810 00:36:37,810 --> 00:36:40,330 remove more and more synchronization points. 811 00:36:46,920 --> 00:36:50,950 So if I had assigned all this entire sub-graph to the same 812 00:36:50,950 --> 00:36:53,580 processor, I really get rid of the synchronization because it 813 00:36:53,580 --> 00:36:56,910 is essentially local to that particular processor. 814 00:36:56,910 --> 00:36:58,820 And there's no extra messaging that would have to happen 815 00:36:58,820 --> 00:37:01,415 across processors that says, I'm ready, or I'm ready to 816 00:37:01,415 --> 00:37:04,330 send you data, or you can move on to the next step. 817 00:37:04,330 --> 00:37:06,650 And so in this case the last synchronization point would be 818 00:37:06,650 --> 00:37:07,890 at this join point. 819 00:37:07,890 --> 00:37:12,470 Let's say if it's allocated on P1 or on some other processor. 820 00:37:12,470 --> 00:37:14,420 So how would I get rid of this synchronization point? 821 00:37:14,420 --> 00:37:19,080 AUDIENCE: Do the whole thing. 822 00:37:19,080 --> 00:37:19,390 PROFESSOR: Right. 823 00:37:19,390 --> 00:37:22,160 You put the entire thing on a single processor. 824 00:37:22,160 --> 00:37:23,940 But you get no parallelism in this case. 825 00:37:23,940 --> 00:37:26,360 So the coarse-grain, fine-grain grain parallelism 826 00:37:26,360 --> 00:37:30,410 granularity issue comes to play. 827 00:37:30,410 --> 00:37:33,410 So the last sort of thing I'm going to talk about in terms 828 00:37:33,410 --> 00:37:37,440 of how granularity impacts performance -- and this was 829 00:37:37,440 --> 00:37:39,570 already touched on -- is that communication is really not 830 00:37:39,570 --> 00:37:42,760 cheap and can be quite overwhelming on a lot of 831 00:37:42,760 --> 00:37:43,820 architectures. 832 00:37:43,820 --> 00:37:46,600 And what's interesting about multicores is that they're 833 00:37:46,600 --> 00:37:48,840 essentially putting a lot more resources closer 834 00:37:48,840 --> 00:37:50,470 together on a chip. 835 00:37:50,470 --> 00:37:55,010 So it essentially is changing the factors for communication. 836 00:37:55,010 --> 00:37:57,910 So rather than having, you know, your parallel cluster 837 00:37:57,910 --> 00:38:00,700 now which is connected, say, by ethernet or some other 838 00:38:00,700 --> 00:38:03,690 high-speed link, now you essentially have large 839 00:38:03,690 --> 00:38:05,920 clusters or will have large clusters on a chip. 840 00:38:05,920 --> 00:38:09,250 So communication factors really change. 841 00:38:09,250 --> 00:38:15,310 But the cost model is relatively captured by these 842 00:38:15,310 --> 00:38:16,560 different parameters. 843 00:38:19,730 --> 00:38:22,450 So what is the cost of my communication? 844 00:38:22,450 --> 00:38:26,915 Well, it's equal to, well, how many messages am I sending and 845 00:38:26,915 --> 00:38:30,130 what is the frequency with which I'm sending them? 846 00:38:30,130 --> 00:38:32,130 There's some overhead for message. 847 00:38:32,130 --> 00:38:34,140 So I have to actually package data together. 848 00:38:34,140 --> 00:38:37,920 I have to stick in a control header and then send it out. 849 00:38:37,920 --> 00:38:40,360 So that takes me some work on the receiver side. 850 00:38:40,360 --> 00:38:41,450 I have to take the message. 851 00:38:41,450 --> 00:38:46,390 I maybe have to decode the header, figure out where to 852 00:38:46,390 --> 00:38:48,470 store the data that's coming in on the message. 853 00:38:48,470 --> 00:38:51,700 So there's some overhead associated with that as well. 854 00:38:51,700 --> 00:38:55,420 There's a network delay for sending a message, so putting 855 00:38:55,420 --> 00:38:58,375 a message on the network so that it can be transmitted, or 856 00:38:58,375 --> 00:38:59,880 picking things up off the network. 857 00:38:59,880 --> 00:39:04,050 So there's a latency also associated with how long does 858 00:39:04,050 --> 00:39:07,765 it take for a message to get from point A to point B. What 859 00:39:07,765 --> 00:39:11,020 is the bandwidth that I have across a link? 860 00:39:11,020 --> 00:39:13,350 So if I have a lot of bandwidth then that can really 861 00:39:13,350 --> 00:39:16,670 lower my communication cost. But if I have little bandwidth 862 00:39:16,670 --> 00:39:19,450 then that can really create contention. 863 00:39:19,450 --> 00:39:20,940 How much data am I sending? 864 00:39:20,940 --> 00:39:22,780 And, you know, number of messages. 865 00:39:22,780 --> 00:39:25,730 So this numerator here is really an average of the data 866 00:39:25,730 --> 00:39:29,590 that you're sending per communication. 867 00:39:29,590 --> 00:39:32,100 There's a cost induced per contention. 868 00:39:32,100 --> 00:39:33,880 And then finally there's -- so all of 869 00:39:33,880 --> 00:39:35,580 these are added factors. 870 00:39:35,580 --> 00:39:37,830 The higher they are, except for bandwidth, because it's in 871 00:39:37,830 --> 00:39:39,020 the denominator here, the worse your 872 00:39:39,020 --> 00:39:40,800 communication cost becomes. 873 00:39:40,800 --> 00:39:45,010 So you can try to reduce the communication cost by 874 00:39:45,010 --> 00:39:46,220 communicating less. 875 00:39:46,220 --> 00:39:47,100 So you adjust your granularity. 876 00:39:47,100 --> 00:39:50,460 And that can impact your synchronization or what kind 877 00:39:50,460 --> 00:39:52,960 of data you're shipping around. 878 00:39:52,960 --> 00:39:55,450 You can do some architectural tweaks or maybe some software 879 00:39:55,450 --> 00:39:58,800 tweaks to really get the network latency down and the 880 00:39:58,800 --> 00:40:00,190 overhead per message down. 881 00:40:00,190 --> 00:40:03,880 So on something like raw architecture, which we saw in 882 00:40:03,880 --> 00:40:06,230 Saman's lecture, there's a really fast mechanism to 883 00:40:06,230 --> 00:40:08,900 communicate your nearest neighbor in three cycles. 884 00:40:08,900 --> 00:40:12,580 So one processor can send a single operand to another 885 00:40:12,580 --> 00:40:16,310 reasonably fast. You know, you can improve the bandwidth 886 00:40:16,310 --> 00:40:19,010 again in architectural mechanism. 887 00:40:19,010 --> 00:40:21,720 You can do some tricks as to how you package your data in 888 00:40:21,720 --> 00:40:24,490 each message. 889 00:40:24,490 --> 00:40:27,380 And lastly, what I'm going to talk about in a couple of 890 00:40:27,380 --> 00:40:30,320 slides is, well, I can also improve it using some 891 00:40:30,320 --> 00:40:31,965 mechanisms that try to increase the 892 00:40:31,965 --> 00:40:33,390 overlap between messages. 893 00:40:33,390 --> 00:40:35,160 And what does this really mean? 894 00:40:35,160 --> 00:40:37,070 What am I overlapping it with? 895 00:40:37,070 --> 00:40:40,100 And it's really the communication and computation 896 00:40:40,100 --> 00:40:43,590 stages are going to somehow get aligned. 897 00:40:43,590 --> 00:40:46,300 So before I actually show you that, I just want to point out 898 00:40:46,300 --> 00:40:48,220 that there are two kinds of messages. 899 00:40:48,220 --> 00:40:51,020 There's data messages, and these are, for example, the 900 00:40:51,020 --> 00:40:54,240 arrays that I'm sending around to different processors for 901 00:40:54,240 --> 00:40:57,760 the distance calculations between points in space. 902 00:40:57,760 --> 00:40:59,650 But there are also control messages. 903 00:40:59,650 --> 00:41:02,460 So control messages essentially say, I'm done, or 904 00:41:02,460 --> 00:41:06,900 I'm ready to go, or is there any work for me to do? 905 00:41:06,900 --> 00:41:09,700 So on Cell, control messages, you know, you can think of 906 00:41:09,700 --> 00:41:12,560 using Mailboxes for those and the DMAs for doing the data 907 00:41:12,560 --> 00:41:13,590 communication. 908 00:41:13,590 --> 00:41:16,170 So data messages are relatively much larger -- 909 00:41:16,170 --> 00:41:19,150 you're sending a lot of data -- versus control messages 910 00:41:19,150 --> 00:41:22,190 that are really much shorter, just essentially just sending 911 00:41:22,190 --> 00:41:23,930 you very brief information. 912 00:41:27,640 --> 00:41:30,980 So in order to get that overlap, what you can do is 913 00:41:30,980 --> 00:41:33,610 essentially use this concept of pipelining. 914 00:41:33,610 --> 00:41:35,250 So you've seen pipelining in superscalar. 915 00:41:35,250 --> 00:41:37,620 Someone talked about that. 916 00:41:37,620 --> 00:41:40,130 And what you are essentially trying to do is break up the 917 00:41:40,130 --> 00:41:43,950 communication and computation into different stages and then 918 00:41:43,950 --> 00:41:45,860 figure out a way to overlap them so that you can 919 00:41:45,860 --> 00:41:47,970 essentially hide the latency for the 920 00:41:47,970 --> 00:41:51,090 sends and the receives. 921 00:41:51,090 --> 00:41:54,830 So let's say you have some work that you're doing, and it 922 00:41:54,830 --> 00:41:57,570 really requires you to send the data -- 923 00:41:57,570 --> 00:42:00,200 somebody has to send you the data or you essentially have 924 00:42:00,200 --> 00:42:02,440 to wait until you get it. 925 00:42:02,440 --> 00:42:04,850 And then after you've waited and the data is there, you can 926 00:42:04,850 --> 00:42:06,470 actually go on and do your work. 927 00:42:06,470 --> 00:42:07,670 So these are color coded. 928 00:42:07,670 --> 00:42:11,540 So this is essentially one iteration of the work. 929 00:42:11,540 --> 00:42:15,220 And so you could overlap them by breaking up the work into 930 00:42:15,220 --> 00:42:21,050 send, wait, work stages, where each iteration trying to send 931 00:42:21,050 --> 00:42:24,340 or request the data for the next iteration, I wait on the 932 00:42:24,340 --> 00:42:27,420 data from a previous iteration and then I do my work. 933 00:42:27,420 --> 00:42:29,910 So depending on how I partition, I can really get 934 00:42:29,910 --> 00:42:32,280 really good overlap. 935 00:42:32,280 --> 00:42:35,320 And so what you want to get to is the concept of the steady 936 00:42:35,320 --> 00:42:40,130 state, where in your main loop body, all you're doing is 937 00:42:40,130 --> 00:42:43,590 essentially pre-fetching or requesting data that's going 938 00:42:43,590 --> 00:42:46,030 to be used in future iterations for future work. 939 00:42:46,030 --> 00:42:49,890 And then you're waiting on -- 940 00:42:49,890 --> 00:42:51,490 yeah. 941 00:42:51,490 --> 00:42:54,100 I think my color coding is a little bogus. 942 00:42:54,100 --> 00:42:55,860 That's good. 943 00:42:55,860 --> 00:42:58,360 So here's an example of how you might do this kind of 944 00:42:58,360 --> 00:43:01,700 buffer pipelining in Cell. 945 00:43:01,700 --> 00:43:05,710 So I have some main loop that's going to do some work, 946 00:43:05,710 --> 00:43:07,670 that's encapsulating this process data. 947 00:43:07,670 --> 00:43:09,780 And what I'm going to use is two buffers. 948 00:43:09,780 --> 00:43:12,750 So the scheme is also called double buffering. 949 00:43:12,750 --> 00:43:15,200 I'm going to use this ID to represent which buffer I'm 950 00:43:15,200 --> 00:43:15,700 going to use. 951 00:43:15,700 --> 00:43:18,910 So it's either buffer zero or buffer one. 952 00:43:18,910 --> 00:43:21,230 And this instruction here essentially flips the bit. 953 00:43:21,230 --> 00:43:23,700 So it's either zero or one. 954 00:43:23,700 --> 00:43:27,620 So I fetch data into buffer zero and then I enter my loop. 955 00:43:27,620 --> 00:43:30,680 So this is essentially the first send, which is trying to 956 00:43:30,680 --> 00:43:33,330 get me one iteration ahead. 957 00:43:33,330 --> 00:43:37,760 So I enter this mail loop and I do some calculation to 958 00:43:37,760 --> 00:43:40,380 figure out where to write the next data. 959 00:43:40,380 --> 00:43:43,735 And then I do another request for the next data item that 960 00:43:43,735 --> 00:43:47,800 I'm going to -- sorry, there's an m missing here -- 961 00:43:47,800 --> 00:43:50,900 I'm going to fetch data into a different buffer, right. 962 00:43:50,900 --> 00:43:54,360 This is ID where I've already flipped the bit once. 963 00:43:54,360 --> 00:43:58,160 So this get is going to write data into buffer zero. 964 00:43:58,160 --> 00:44:01,730 And this get is going to write data into buffer one. 965 00:44:01,730 --> 00:44:02,770 I flip the bit again. 966 00:44:02,770 --> 00:44:08,720 So now I'm going to issue a wait instruction that says is 967 00:44:08,720 --> 00:44:10,380 the data from buffer zero ready? 968 00:44:10,380 --> 00:44:13,260 And if it is then I can go on and actually do my work. 969 00:44:13,260 --> 00:44:15,590 Does that make sense? 970 00:44:15,590 --> 00:44:16,150 People are confused? 971 00:44:16,150 --> 00:44:17,400 Should I go over it again? 972 00:44:19,772 --> 00:44:21,720 AUDIENCE: [INAUDIBLE] 973 00:44:21,720 --> 00:44:24,260 PROFESSOR: So this is an [? x or. ?] 974 00:44:24,260 --> 00:44:27,710 So I could have just said buffer equals zero or buffer 975 00:44:27,710 --> 00:44:28,960 equals one. 976 00:44:33,220 --> 00:44:34,410 Oh, sorry. 977 00:44:34,410 --> 00:44:34,620 This is one. 978 00:44:34,620 --> 00:44:34,690 Yeah. 979 00:44:34,690 --> 00:44:36,980 Yeah. 980 00:44:36,980 --> 00:44:40,950 So this is a one here. 981 00:44:40,950 --> 00:44:41,440 Last-minute editing. 982 00:44:41,440 --> 00:44:43,430 It's right there. 983 00:44:43,430 --> 00:44:44,672 Did that confuse you? 984 00:44:44,672 --> 00:44:45,195 AUDIENCE: No. 985 00:44:45,195 --> 00:44:47,810 But, like, I don't see [INAUDIBLE] 986 00:44:47,810 --> 00:44:48,140 PROFESSOR: Oh. 987 00:44:48,140 --> 00:44:48,410 OK. 988 00:44:48,410 --> 00:44:50,190 So I'll go over it again. 989 00:44:50,190 --> 00:44:53,600 So this get here is going to write into ID zero. 990 00:44:53,600 --> 00:44:56,500 So that's buffer zero. 991 00:44:56,500 --> 00:44:57,930 And then I'm going to change the ID. 992 00:44:57,930 --> 00:44:59,400 So imagine there's a one here. 993 00:44:59,400 --> 00:45:04,190 So now the next time I use ID, which is here, I'm trying to 994 00:45:04,190 --> 00:45:04,950 get the data. 995 00:45:04,950 --> 00:45:07,920 And I'm going to write it to buffer one. 996 00:45:07,920 --> 00:45:11,270 The DMA on the Cell processor essentially says I can send 997 00:45:11,270 --> 00:45:14,450 this request off and I can check later to see when that 998 00:45:14,450 --> 00:45:15,710 data is available. 999 00:45:15,710 --> 00:45:17,940 But that data is going to go into a different buffer, 1000 00:45:17,940 --> 00:45:19,420 essentially B1. 1001 00:45:19,420 --> 00:45:22,450 Whereas I'm going to work on buffer zero. 1002 00:45:22,450 --> 00:45:25,920 Because I changed the ID back here. 1003 00:45:25,920 --> 00:45:27,610 Now you get it? 1004 00:45:27,610 --> 00:45:30,790 So I fetch data into buffer zero initially 1005 00:45:30,790 --> 00:45:31,540 before I start to loop. 1006 00:45:31,540 --> 00:45:34,110 And then I start working. 1007 00:45:34,110 --> 00:45:36,380 I probably should have had an animation in here. 1008 00:45:36,380 --> 00:45:39,430 So then you go into your main loop. 1009 00:45:39,430 --> 00:45:42,880 You try to start fetching into buffet one and then you try to 1010 00:45:42,880 --> 00:45:44,280 compute out of buffer zero. 1011 00:45:44,280 --> 00:45:46,130 But before you can start computing out of buffer zero, 1012 00:45:46,130 --> 00:45:48,250 you just have to make sure that your data is there. 1013 00:45:48,250 --> 00:45:52,790 And so that's what the synchronization is doing here. 1014 00:45:52,790 --> 00:45:55,180 Hope that was clear. 1015 00:45:55,180 --> 00:45:58,710 OK, so this kind of computation and communication 1016 00:45:58,710 --> 00:46:01,680 overlap really helps in hiding the latency. 1017 00:46:01,680 --> 00:46:04,990 And it can be real useful in terms of improving 1018 00:46:04,990 --> 00:46:06,240 performance. 1019 00:46:09,720 --> 00:46:13,450 And there are different kinds of communication patterns. 1020 00:46:13,450 --> 00:46:14,720 So there's point to point. 1021 00:46:14,720 --> 00:46:18,080 And you can use these both for data communication or control 1022 00:46:18,080 --> 00:46:19,060 communication. 1023 00:46:19,060 --> 00:46:20,880 And it just means that, you know, one processor can 1024 00:46:20,880 --> 00:46:23,580 explicitly send a message to another processor. 1025 00:46:23,580 --> 00:46:26,345 There's also broadcast that says, hey, I have some data 1026 00:46:26,345 --> 00:46:28,603 that everybody's interested in, so I can just broadcast it 1027 00:46:28,603 --> 00:46:31,000 to everybody on the network. 1028 00:46:31,000 --> 00:46:32,930 Or a reduce, which is the opposite. 1029 00:46:32,930 --> 00:46:36,350 It says everybody on the network has data that I need 1030 00:46:36,350 --> 00:46:39,840 to compute, so everybody send me their data. 1031 00:46:39,840 --> 00:46:42,790 There's an all to all, which says all processors should 1032 00:46:42,790 --> 00:46:46,410 just do a global exchange of data that they have. And then 1033 00:46:46,410 --> 00:46:48,100 there's a scatter and a gather. 1034 00:46:48,100 --> 00:46:50,810 So a scatter and a gather are really different types of 1035 00:46:50,810 --> 00:46:54,770 broadcast. So it's one to several or one to many. 1036 00:46:54,770 --> 00:46:57,340 And gather, which is many to one. 1037 00:46:57,340 --> 00:47:00,370 So this is useful when you're doing a computation that 1038 00:47:00,370 --> 00:47:04,670 really is trying to pull data in together but only from a 1039 00:47:04,670 --> 00:47:06,260 subset of all processors. 1040 00:47:06,260 --> 00:47:12,250 So it depends on how you've partitioned your problems. 1041 00:47:12,250 --> 00:47:16,430 So there's a well-known sort of message passing library 1042 00:47:16,430 --> 00:47:24,950 specification called MPI that tries to specify all of these 1043 00:47:24,950 --> 00:47:29,120 different communications in order to sort of facilitate 1044 00:47:29,120 --> 00:47:30,310 parallel programming. 1045 00:47:30,310 --> 00:47:34,590 Its full featured actually has more types of communications 1046 00:47:34,590 --> 00:47:36,660 and more kinds of functionality than I showed on 1047 00:47:36,660 --> 00:47:38,870 the previous slides. 1048 00:47:38,870 --> 00:47:41,360 But it's not a language or a compiler specification. 1049 00:47:41,360 --> 00:47:43,470 It's really just a library that you can implement in 1050 00:47:43,470 --> 00:47:45,750 various ways on different architectures. 1051 00:47:45,750 --> 00:47:50,720 Again, it's same program, multiple data, or supports the 1052 00:47:50,720 --> 00:47:53,080 SPMD model. 1053 00:47:53,080 --> 00:47:55,990 And it works reasonably well for parallel architectures for 1054 00:47:55,990 --> 00:47:59,760 clusters, heterogeneous multicores, homogeneous 1055 00:47:59,760 --> 00:48:01,830 multicores. 1056 00:48:01,830 --> 00:48:05,240 Because really all it's doing is just abstracting out -- 1057 00:48:05,240 --> 00:48:08,130 it's giving you a mechanism to abstract out all the 1058 00:48:08,130 --> 00:48:11,540 communication that you would need in your computation. 1059 00:48:11,540 --> 00:48:15,280 So you can have additional things like precise buffer 1060 00:48:15,280 --> 00:48:16,370 management. 1061 00:48:16,370 --> 00:48:19,250 You can have some collective operations. 1062 00:48:19,250 --> 00:48:22,985 I'll show an example of for doing things things in a 1063 00:48:22,985 --> 00:48:27,370 scalable manner when a lot of things need to communicate 1064 00:48:27,370 --> 00:48:29,140 with each other. 1065 00:48:29,140 --> 00:48:32,840 So just a brief history of where MPI came from. 1066 00:48:32,840 --> 00:48:35,270 And, you know, very early when, you know, parallel 1067 00:48:35,270 --> 00:48:38,260 computers started becoming more and more widespread and 1068 00:48:38,260 --> 00:48:40,720 there were these networks and people had problems porting 1069 00:48:40,720 --> 00:48:43,840 their applications or writing applications for these 1070 00:48:43,840 --> 00:48:45,860 [? came, ?] just because it was difficult, as you might be 1071 00:48:45,860 --> 00:48:48,200 finding in terms of programming things with the 1072 00:48:48,200 --> 00:48:50,540 Cell processor. 1073 00:48:50,540 --> 00:48:52,800 You know, there needed to be ways to sort of address the 1074 00:48:52,800 --> 00:48:54,860 spectrum of communication. 1075 00:48:54,860 --> 00:48:58,330 And it often helps to have a standard because if everybody 1076 00:48:58,330 --> 00:49:01,680 implements the same standard specification, that allows 1077 00:49:01,680 --> 00:49:03,370 your code to be ported around from one 1078 00:49:03,370 --> 00:49:04,710 architecture to the other. 1079 00:49:04,710 --> 00:49:08,090 And so MPI came around. 1080 00:49:08,090 --> 00:49:11,130 The forum was organized in 1992. 1081 00:49:11,130 --> 00:49:13,980 And that had a lot of people participating in it from 1082 00:49:13,980 --> 00:49:17,000 vendors, you know, people like IBM, a company like IBM, 1083 00:49:17,000 --> 00:49:23,030 Intel, and people who had expertise in writing 1084 00:49:23,030 --> 00:49:27,550 libraries, users who were interested in using these 1085 00:49:27,550 --> 00:49:31,910 kinds of specifications to do their computations, so 1086 00:49:31,910 --> 00:49:36,010 scientific people who were in the scientific domain. 1087 00:49:36,010 --> 00:49:38,050 And it was finished in about 18 months. 1088 00:49:38,050 --> 00:49:40,582 I don't know if that's a reasonably long time or a 1089 00:49:40,582 --> 00:49:40,950 short time. 1090 00:49:40,950 --> 00:49:44,340 But considering, you know, I think the MPEG-4 standard took 1091 00:49:44,340 --> 00:49:48,170 a bit longer to do, as a comparison point. 1092 00:49:48,170 --> 00:49:49,880 I don't have the actual data. 1093 00:49:49,880 --> 00:49:53,270 So point-to-point communication -- 1094 00:49:53,270 --> 00:49:56,590 and again, a reminder, this is how you would do it on Cell. 1095 00:49:56,590 --> 00:50:00,300 These are SPE sends and receives. 1096 00:50:00,300 --> 00:50:02,490 You have one processor that's sending 1097 00:50:02,490 --> 00:50:04,660 it to another processor. 1098 00:50:04,660 --> 00:50:06,260 Or you have some network in between. 1099 00:50:06,260 --> 00:50:08,430 And processor A can essentially send the data 1100 00:50:08,430 --> 00:50:11,910 explicitly to processor two. 1101 00:50:11,910 --> 00:50:14,880 And the message in this case would include how the data is 1102 00:50:14,880 --> 00:50:17,240 packaged, some other information such as the length 1103 00:50:17,240 --> 00:50:20,570 of the data, destination, possibly some tag so you can 1104 00:50:20,570 --> 00:50:22,650 identify the actual communication. 1105 00:50:22,650 --> 00:50:26,560 And, you know, there's an actual mapping for the actual 1106 00:50:26,560 --> 00:50:29,760 functions on Cell. 1107 00:50:29,760 --> 00:50:32,950 And there's a get for the send and a put for the receive. 1108 00:50:39,230 --> 00:50:41,760 So there's a question of, well, how do I know if my data 1109 00:50:41,760 --> 00:50:42,880 actually got sent? 1110 00:50:42,880 --> 00:50:45,410 How do I know if it was received? 1111 00:50:45,410 --> 00:50:49,190 And there's, you know, you can think of a synchronous send 1112 00:50:49,190 --> 00:50:52,200 and a synchronous receive, or asynchronous communication. 1113 00:50:52,200 --> 00:50:53,940 So in the synchronous communication, you actually 1114 00:50:53,940 --> 00:50:55,390 wait for notification. 1115 00:50:55,390 --> 00:50:57,250 So this is kind of like your fax machine. 1116 00:50:57,250 --> 00:50:58,900 You put something into your fax. 1117 00:50:58,900 --> 00:51:01,190 It goes out and you eventually get a beep that says your 1118 00:51:01,190 --> 00:51:02,260 transmission was OK. 1119 00:51:02,260 --> 00:51:04,900 Or if it wasn't OK then, you know, you get a message that 1120 00:51:04,900 --> 00:51:06,630 says, you know, something went wrong. 1121 00:51:06,630 --> 00:51:09,520 And you can redo your communication. 1122 00:51:09,520 --> 00:51:11,730 An asynchronous send is kind of like your -- 1123 00:51:11,730 --> 00:51:13,055 AUDIENCE: Most [UNINTELLIGIBLE] you could get 1124 00:51:13,055 --> 00:51:13,320 a reply too. 1125 00:51:13,320 --> 00:51:15,340 PROFESSOR: Yeah, you can get a reply. 1126 00:51:15,340 --> 00:51:16,420 Thanks. 1127 00:51:16,420 --> 00:51:18,360 An asynchronous send, it's like you write a letter, you 1128 00:51:18,360 --> 00:51:20,680 go put it in the mailbox, and you don't know whether it 1129 00:51:20,680 --> 00:51:26,425 actually made it into the actual postman's bag and it 1130 00:51:26,425 --> 00:51:29,660 was delivered to your destination or if it was 1131 00:51:29,660 --> 00:51:30,940 actually delivered. 1132 00:51:30,940 --> 00:51:34,200 So you only know that the message was sent. 1133 00:51:34,200 --> 00:51:36,250 You know, you put it in the mailbox. 1134 00:51:36,250 --> 00:51:37,940 But you don't know anything else about what happened to 1135 00:51:37,940 --> 00:51:41,100 the message along the way. 1136 00:51:41,100 --> 00:51:43,530 There's also the concept of a blocking versus a 1137 00:51:43,530 --> 00:51:44,940 non-blocking message. 1138 00:51:44,940 --> 00:51:48,110 So this is orthogonal really to synchronous versus 1139 00:51:48,110 --> 00:51:49,930 asynchronous. 1140 00:51:49,930 --> 00:51:54,940 So in blocking messages, a sender waits until there's 1141 00:51:54,940 --> 00:51:58,440 some signal that says the message has been transmitted. 1142 00:51:58,440 --> 00:52:03,180 So this is, for example if I'm writing data into a buffer, 1143 00:52:03,180 --> 00:52:05,350 and the buffer essentially gets transmitted to somebody 1144 00:52:05,350 --> 00:52:10,590 else, we wait until the buffer is empty. 1145 00:52:10,590 --> 00:52:13,070 And what that means is that somebody has read it on the 1146 00:52:13,070 --> 00:52:15,350 other end or somebody has drained that buffer from 1147 00:52:15,350 --> 00:52:16,670 somewhere else. 1148 00:52:16,670 --> 00:52:19,820 The receiver, if he's waiting on data, well, he just waits. 1149 00:52:19,820 --> 00:52:21,850 He essentially blocks until somebody has put 1150 00:52:21,850 --> 00:52:24,110 data into the buffer. 1151 00:52:24,110 --> 00:52:26,470 And you can get into potential deadlock situations. 1152 00:52:26,470 --> 00:52:30,100 So you saw deadlock with locks in the concurrency talk. 1153 00:52:30,100 --> 00:52:30,920 I'm going to show you a different 1154 00:52:30,920 --> 00:52:32,590 kind of deadlock example. 1155 00:52:35,200 --> 00:52:40,630 An example of a blocking send on Cell -- 1156 00:52:40,630 --> 00:52:43,250 allows you to use mailboxes. 1157 00:52:43,250 --> 00:52:47,260 Or you can sort of use mailboxes for that. 1158 00:52:47,260 --> 00:52:50,050 Mailboxes again are just for communicating short messages, 1159 00:52:50,050 --> 00:52:53,910 really, not necessarily for communicating data messages. 1160 00:52:53,910 --> 00:52:58,340 So an SPE does some work, and then it writes out a message, 1161 00:52:58,340 --> 00:53:02,750 in this case to notify the PPU that, let's say, it's done. 1162 00:53:02,750 --> 00:53:04,660 And then it goes on and does more work. 1163 00:53:04,660 --> 00:53:08,230 And then it wants to notify the PPU of something else. 1164 00:53:08,230 --> 00:53:12,190 So in this case this particular send will block 1165 00:53:12,190 --> 00:53:14,930 because, let's say, the PPU hasn't drained its mailbox. 1166 00:53:14,930 --> 00:53:16,220 It hasn't read the mailbox. 1167 00:53:16,220 --> 00:53:20,960 So you essentially stop and wait until the PPU has, you 1168 00:53:20,960 --> 00:53:21,810 know, caught up. 1169 00:53:21,810 --> 00:53:26,860 AUDIENCE: So all mailbox sends are blocking? 1170 00:53:26,860 --> 00:53:27,990 PROFESSOR: Yes. 1171 00:53:27,990 --> 00:53:29,240 David says yes. 1172 00:53:36,680 --> 00:53:39,730 A non-blocking send is something that essentially 1173 00:53:39,730 --> 00:53:44,400 allows you to send a message out and just continue on. 1174 00:53:44,400 --> 00:53:48,650 You don't care exactly about what's happened to the message 1175 00:53:48,650 --> 00:53:51,910 or what's going on with the receiver. 1176 00:53:51,910 --> 00:53:54,530 So you write the data into the buffer and you 1177 00:53:54,530 --> 00:53:56,040 just continue executing. 1178 00:53:56,040 --> 00:53:58,740 And this really helps you in terms of avoiding idle times 1179 00:53:58,740 --> 00:54:00,560 and deadlocks, but it might not always be the 1180 00:54:00,560 --> 00:54:01,840 thing that you want. 1181 00:54:01,840 --> 00:54:05,970 So an example of sort of a non-blocking send and wait on 1182 00:54:05,970 --> 00:54:09,170 Cell is using the DMAs to ship data out. 1183 00:54:09,170 --> 00:54:11,580 You know, you can put something, put in a request to 1184 00:54:11,580 --> 00:54:13,910 send data out on the DMA. 1185 00:54:13,910 --> 00:54:19,190 And you could wait for it if you want in terms of reading 1186 00:54:19,190 --> 00:54:22,680 the status bits to make sure it's completed. 1187 00:54:22,680 --> 00:54:27,530 OK, so what is a source of deadlock in the blocking case? 1188 00:54:27,530 --> 00:54:30,130 And it really comes about if you don't really have enough 1189 00:54:30,130 --> 00:54:33,140 buffering in your communication network. 1190 00:54:33,140 --> 00:54:36,300 And often you can resolve that by having additional storage. 1191 00:54:36,300 --> 00:54:38,700 So let's say I have processor one and processor two and 1192 00:54:38,700 --> 00:54:41,000 they're trying to send messages to each other. 1193 00:54:41,000 --> 00:54:43,710 So processor one sends a message at the same time 1194 00:54:43,710 --> 00:54:45,170 processor two sends a message. 1195 00:54:45,170 --> 00:54:46,990 And these are going to go, let's say, 1196 00:54:46,990 --> 00:54:49,200 into the same buffer. 1197 00:54:49,200 --> 00:54:53,815 Well, neither can make progress because somebody has 1198 00:54:53,815 --> 00:54:55,930 to essentially drain that buffer before these receives 1199 00:54:55,930 --> 00:54:57,180 can execute. 1200 00:55:00,350 --> 00:55:02,990 So what happens with that code is it really depends on how 1201 00:55:02,990 --> 00:55:04,830 much buffering you have between the two. 1202 00:55:04,830 --> 00:55:06,370 If you have a lot of buffering, then you may never 1203 00:55:06,370 --> 00:55:08,000 see the deadlock. 1204 00:55:08,000 --> 00:55:13,930 But if you have a really tiny buffer, then you do a send. 1205 00:55:13,930 --> 00:55:18,180 The other person can't do the send because the buffer hasn't 1206 00:55:18,180 --> 00:55:18,970 been drained. 1207 00:55:18,970 --> 00:55:21,220 And so you end up with a deadlock. 1208 00:55:21,220 --> 00:55:23,600 And so a potential solution is, well, you actually 1209 00:55:23,600 --> 00:55:24,620 increase your buffer length. 1210 00:55:24,620 --> 00:55:26,170 But that doesn't always work because you can 1211 00:55:26,170 --> 00:55:27,480 still get into trouble. 1212 00:55:27,480 --> 00:55:30,040 So what you might need to do is essentially be more 1213 00:55:30,040 --> 00:55:33,990 diligent about how you order your sends and receives. 1214 00:55:33,990 --> 00:55:36,740 So if you have processor one doing a send, make sure it's 1215 00:55:36,740 --> 00:55:39,400 matched up with a receive on the other end. 1216 00:55:39,400 --> 00:55:42,050 And similarly, if you're doing a receive here, make sure 1217 00:55:42,050 --> 00:55:44,100 there's sort of a matching send on the other end. 1218 00:55:44,100 --> 00:55:46,870 And that helps you in sort of making sure that things are 1219 00:55:46,870 --> 00:55:51,160 operating reasonably in lock step at, you know, partially 1220 00:55:51,160 --> 00:55:52,410 ordered times. 1221 00:55:59,750 --> 00:56:03,600 That was really examples of point-to-point communication. 1222 00:56:03,600 --> 00:56:05,990 A broadcast mechanism is slightly different. 1223 00:56:05,990 --> 00:56:09,190 It says, I have data that I want to send to everybody. 1224 00:56:09,190 --> 00:56:11,750 It could be really efficient for sending short control 1225 00:56:11,750 --> 00:56:15,570 messages, maybe even efficient for sending data messages. 1226 00:56:15,570 --> 00:56:18,950 So as an example, if you remember our calculation of 1227 00:56:18,950 --> 00:56:22,380 distances between all points, the parallelization strategy 1228 00:56:22,380 --> 00:56:24,900 said, well, I'm going to send one copy of 1229 00:56:24,900 --> 00:56:28,370 the array A to everybody. 1230 00:56:28,370 --> 00:56:29,820 In the two processor case that was easy. 1231 00:56:29,820 --> 00:56:32,730 But if I have n processors, then rather than sending 1232 00:56:32,730 --> 00:56:35,570 point-to-point communication from A to everybody else, what 1233 00:56:35,570 --> 00:56:38,300 I could do is just, say, broadcast A to everybody and 1234 00:56:38,300 --> 00:56:41,420 they can grab it off the network. 1235 00:56:41,420 --> 00:56:45,210 So in MPI there's this function, MPI 1236 00:56:45,210 --> 00:56:46,560 broadcast, that does that. 1237 00:56:46,560 --> 00:56:51,700 I'm using sort of generic abstract sends, receives and 1238 00:56:51,700 --> 00:56:53,350 broadcasts in my examples. 1239 00:56:53,350 --> 00:56:55,600 So you can broadcast A to everybody. 1240 00:56:55,600 --> 00:56:59,550 And then if I have n processors, then what I might 1241 00:56:59,550 --> 00:57:03,310 do is distribute the m's in a round robin manner to each of 1242 00:57:03,310 --> 00:57:04,230 the different processes. 1243 00:57:04,230 --> 00:57:05,710 So you pointed this out. 1244 00:57:05,710 --> 00:57:07,180 I don't have to send B to everybody. 1245 00:57:07,180 --> 00:57:09,390 I can just send, you know, in this case, 1246 00:57:09,390 --> 00:57:10,420 one particular element. 1247 00:57:10,420 --> 00:57:12,840 Is that clear? 1248 00:57:16,838 --> 00:57:18,670 AUDIENCE: There's no broadcast on Cell? 1249 00:57:18,670 --> 00:57:21,210 PROFESSOR: There is no broadcast on Cell. 1250 00:57:21,210 --> 00:57:25,680 There is no mechanism for reduction either. 1251 00:57:25,680 --> 00:57:30,340 And you can't quite do scatters and gathers. 1252 00:57:30,340 --> 00:57:32,430 I don't think. 1253 00:57:32,430 --> 00:57:35,380 OK, so an example of a reduction, you know, I said 1254 00:57:35,380 --> 00:57:37,650 it's the opposite of a broadcast. Everybody has data 1255 00:57:37,650 --> 00:57:39,860 that needs to essentially get to the same point. 1256 00:57:39,860 --> 00:57:45,610 So as an example, if everybody in this room had a value, 1257 00:57:45,610 --> 00:57:47,770 including myself, and I wanted to know what is the collective 1258 00:57:47,770 --> 00:57:49,670 value of everybody in the room, you all have to 1259 00:57:49,670 --> 00:57:50,840 send me your data. 1260 00:57:50,840 --> 00:57:53,800 Now, this is important because if -- you know, in this case 1261 00:57:53,800 --> 00:57:54,700 we're doing an addition. 1262 00:57:54,700 --> 00:57:56,290 It's an associative operation. 1263 00:57:56,290 --> 00:57:58,050 So what we can do is we can be smart about 1264 00:57:58,050 --> 00:57:59,350 how the data is sent. 1265 00:57:59,350 --> 00:58:02,160 So, you know, guys that are close together can essentially 1266 00:58:02,160 --> 00:58:03,420 add up their numbers and forward me. 1267 00:58:03,420 --> 00:58:05,120 So instead of getting n messages I 1268 00:58:05,120 --> 00:58:06,520 can get log n messages. 1269 00:58:06,520 --> 00:58:08,450 And so if every pair of you added your numbers and 1270 00:58:08,450 --> 00:58:11,040 forwarded me that, that cuts down communication by half. 1271 00:58:11,040 --> 00:58:13,640 And so you can, you know -- starting from the back of 1272 00:58:13,640 --> 00:58:16,420 room, by the time you get to me, I only get two messages 1273 00:58:16,420 --> 00:58:18,800 instead of n messages. 1274 00:58:18,800 --> 00:58:21,680 So a reduction combines data from all processors. 1275 00:58:21,680 --> 00:58:24,030 In MPI, you know, there's this function MPI 1276 00:58:24,030 --> 00:58:26,300 reduce for doing that. 1277 00:58:26,300 --> 00:58:29,180 And the collective operations are things that are 1278 00:58:29,180 --> 00:58:29,920 associative. 1279 00:58:29,920 --> 00:58:32,730 And subtract -- 1280 00:58:32,730 --> 00:58:33,660 sorry. 1281 00:58:33,660 --> 00:58:39,500 And or and -- you can read them on the slide. 1282 00:58:39,500 --> 00:58:42,540 There is a semantic caveat here that no processor can 1283 00:58:42,540 --> 00:58:45,760 finish the reduction before all processors have at least 1284 00:58:45,760 --> 00:58:49,730 sent it one data or have contributed, rather, a 1285 00:58:49,730 --> 00:58:51,790 particular value. 1286 00:58:51,790 --> 00:58:54,740 So in many numerical algorithms, you can actually 1287 00:58:54,740 --> 00:59:00,200 use the broadcast and send to broadcast and reduce in place 1288 00:59:00,200 --> 00:59:04,430 of sends and receives because it really improves the 1289 00:59:04,430 --> 00:59:06,970 simplicity of your computation. 1290 00:59:06,970 --> 00:59:09,260 You don't have to do n sends to communicate there. 1291 00:59:09,260 --> 00:59:11,350 You can just broadcast. It gives you a mechanism for 1292 00:59:11,350 --> 00:59:13,970 essentially having a shared memory abstraction on 1293 00:59:13,970 --> 00:59:16,150 distributed memory architecture. 1294 00:59:16,150 --> 00:59:18,392 There are things like all to all communication which would 1295 00:59:18,392 --> 00:59:19,730 also help you in that sense. 1296 00:59:19,730 --> 00:59:24,960 Although I don't talk about all to all communication here. 1297 00:59:24,960 --> 00:59:27,360 So I'm going to show you an example of sort of a more 1298 00:59:27,360 --> 00:59:29,810 detailed MPI. 1299 00:59:29,810 --> 00:59:32,690 But I also want to contrast this to the OpenMP programming 1300 00:59:32,690 --> 00:59:36,000 on shared memory processors because one might look simpler 1301 00:59:36,000 --> 00:59:38,470 than the other. 1302 00:59:38,470 --> 00:59:40,540 So suppose that you have a numerical integration method 1303 00:59:40,540 --> 00:59:44,990 that essentially you're going to use to calculate pi. 1304 00:59:44,990 --> 00:59:48,200 So as you get finer and finer, you can get more accurate -- 1305 00:59:48,200 --> 00:59:50,030 as you shrink these intervals you can get 1306 00:59:50,030 --> 00:59:53,900 better values for pi. 1307 00:59:53,900 --> 00:59:58,690 And the code for doing that is some C code. 1308 00:59:58,690 --> 01:00:00,270 You have some variables. 1309 01:00:00,270 --> 01:00:02,800 And then you have a step that essentially tells you how many 1310 01:00:02,800 --> 01:00:04,730 times you're going to do this computation. 1311 01:00:04,730 --> 01:00:07,440 And for each time step you calculate this particular 1312 01:00:07,440 --> 01:00:08,540 function here. 1313 01:00:08,540 --> 01:00:11,040 And you add it all up and in the end you can sort of print 1314 01:00:11,040 --> 01:00:13,640 out what is the value of pi that you calculated. 1315 01:00:13,640 --> 01:00:16,600 So clearly as, you know, as you shrink your intervals, you 1316 01:00:16,600 --> 01:00:20,340 can get more and more accurate measures of pi. 1317 01:00:20,340 --> 01:00:22,740 So that translates to increasing the number of steps 1318 01:00:22,740 --> 01:00:29,160 in that particular C code. 1319 01:00:29,160 --> 01:00:33,700 So you can use that numerical integration to calculate pi 1320 01:00:33,700 --> 01:00:34,900 with OpenMP. 1321 01:00:34,900 --> 01:00:37,356 And what that translates to is -- sorry, there should have 1322 01:00:37,356 --> 01:00:41,370 been an animation here to ask you what I should add in. 1323 01:00:41,370 --> 01:00:43,220 You have this particular loop. 1324 01:00:43,220 --> 01:00:46,330 And this is computation that you want to parallelize. 1325 01:00:46,330 --> 01:00:48,890 And there is really four questions that you essentially 1326 01:00:48,890 --> 01:00:50,540 have to go through. 1327 01:00:50,540 --> 01:00:52,220 Are there variables that are shared? 1328 01:00:52,220 --> 01:00:54,420 Because you have to get the process right. 1329 01:00:54,420 --> 01:00:56,200 If there are variables that are shared, you have to 1330 01:00:56,200 --> 01:01:01,830 explicitly synchronize them and use locks to protect them. 1331 01:01:01,830 --> 01:01:02,880 What values are private? 1332 01:01:02,880 --> 01:01:08,690 So in OpenMP, things that are private are data on the stack, 1333 01:01:08,690 --> 01:01:12,000 things that are defined lexically within the scope of 1334 01:01:12,000 --> 01:01:16,030 the computation that you encapsulate by an OpenMP 1335 01:01:16,030 --> 01:01:18,900 pragma, and what variables you might want 1336 01:01:18,900 --> 01:01:20,340 to use for a reduction. 1337 01:01:20,340 --> 01:01:23,490 So in this case I'm doing a summation, and this is the 1338 01:01:23,490 --> 01:01:25,400 computation that I can parallelize. 1339 01:01:25,400 --> 01:01:28,770 Then I essentially want to do a reduction for the plus 1340 01:01:28,770 --> 01:01:32,980 operator since I'm doing an addition on this variable. 1341 01:01:32,980 --> 01:01:34,760 This loop here is parallel. 1342 01:01:34,760 --> 01:01:36,040 It's data parallel. 1343 01:01:36,040 --> 01:01:38,900 I can split it up. 1344 01:01:38,900 --> 01:01:41,190 The for loop is also -- 1345 01:01:41,190 --> 01:01:43,810 I can do this work sharing on it. 1346 01:01:43,810 --> 01:01:45,960 So I use the parallel for pragma. 1347 01:01:45,960 --> 01:01:49,360 And the variable x here is private. 1348 01:01:49,360 --> 01:01:52,010 It's defined here but I can essentially give a directive 1349 01:01:52,010 --> 01:01:53,530 that says, this is private. 1350 01:01:53,530 --> 01:01:56,290 You can essentially rename it on each processor. 1351 01:01:56,290 --> 01:01:58,890 Its value won't have any effect on the overall 1352 01:01:58,890 --> 01:02:01,270 computation because each computation will have its own 1353 01:02:01,270 --> 01:02:03,910 local copy. 1354 01:02:03,910 --> 01:02:06,360 That clear so far? 1355 01:02:06,360 --> 01:02:09,950 So computing pi with integration using MPI takes up 1356 01:02:09,950 --> 01:02:12,110 two slides. 1357 01:02:12,110 --> 01:02:13,980 You know, I could fit it on one slide but you couldn't see 1358 01:02:13,980 --> 01:02:15,170 it in the back. 1359 01:02:15,170 --> 01:02:16,990 So there's some initialization. 1360 01:02:16,990 --> 01:02:20,130 In fact, I think there's only six basic MPI commands that 1361 01:02:20,130 --> 01:02:22,120 you need for computing. 1362 01:02:22,120 --> 01:02:26,030 Three of them are here and you'll see the others are MPI 1363 01:02:26,030 --> 01:02:27,680 send and MPI receive. 1364 01:02:27,680 --> 01:02:31,550 And there's one more that you'll see on the next slide. 1365 01:02:31,550 --> 01:02:33,170 So there's some loop that says while I'm 1366 01:02:33,170 --> 01:02:36,790 not done keep computing. 1367 01:02:36,790 --> 01:02:39,120 And what you do is you broadcast n to all the 1368 01:02:39,120 --> 01:02:39,990 different processors. 1369 01:02:39,990 --> 01:02:42,900 N is really your time step. 1370 01:02:42,900 --> 01:02:47,660 How many small intervals of execution are you going to do? 1371 01:02:47,660 --> 01:02:49,930 And you can go through, do your computation. 1372 01:02:49,930 --> 01:02:52,510 So now this -- the MPI essentially encapsulates the 1373 01:02:52,510 --> 01:02:55,290 computation over n processors. 1374 01:02:55,290 --> 01:02:58,250 And then you get to an MPI reduce command at some point 1375 01:02:58,250 --> 01:03:01,810 that says, OK, what values did everybody compute? 1376 01:03:01,810 --> 01:03:03,430 Do the reduction on that. 1377 01:03:03,430 --> 01:03:06,410 Write that value into my MPI. 1378 01:03:06,410 --> 01:03:10,700 Now what happens here is there's processor ID zero 1379 01:03:10,700 --> 01:03:12,140 which I'm going to consider the master. 1380 01:03:12,140 --> 01:03:15,030 So he's the one who's going to actually print out the value. 1381 01:03:15,030 --> 01:03:18,780 So the reduction essentially synchronizes until everybody's 1382 01:03:18,780 --> 01:03:22,370 communicated a value to processor zero. 1383 01:03:22,370 --> 01:03:23,990 And then it can print out the pi. 1384 01:03:23,990 --> 01:03:27,660 And then you can finalize, which actually makes sure the 1385 01:03:27,660 --> 01:03:28,890 computation can exit. 1386 01:03:28,890 --> 01:03:30,360 And you can go on and terminate. 1387 01:03:35,710 --> 01:03:39,750 So the last concept in terms of understanding performance 1388 01:03:39,750 --> 01:03:43,010 for parallelism is this notion of locality. 1389 01:03:43,010 --> 01:03:46,240 And there's locality in your communication and locality in 1390 01:03:46,240 --> 01:03:47,930 your computation. 1391 01:03:47,930 --> 01:03:50,700 So what do I mean by that? 1392 01:03:50,700 --> 01:03:55,690 So in terms of communication, you know, if I have two 1393 01:03:55,690 --> 01:03:58,830 operations and let's say -- this is a picture or schematic 1394 01:03:58,830 --> 01:04:02,130 of what the MIT raw chip looks like. 1395 01:04:02,130 --> 01:04:03,570 Each one of these is a core. 1396 01:04:03,570 --> 01:04:06,620 There's some network, some basic computation elements. 1397 01:04:06,620 --> 01:04:09,270 And if I have, you know, an addition that feeds into a 1398 01:04:09,270 --> 01:04:12,910 shift, well, I can put the addition here and the shift 1399 01:04:12,910 --> 01:04:15,720 there, but that means I have a really long path that I need 1400 01:04:15,720 --> 01:04:17,700 to go to in terms of communicating 1401 01:04:17,700 --> 01:04:19,190 that data value around. 1402 01:04:19,190 --> 01:04:22,590 So the computation naturally should just be closer together 1403 01:04:22,590 --> 01:04:25,950 because that decreases the latency that I need to 1404 01:04:25,950 --> 01:04:27,940 communicate. 1405 01:04:27,940 --> 01:04:30,300 So rather than doing net mapping, what I might want to 1406 01:04:30,300 --> 01:04:32,900 do is just go to somebody who is close to me and available. 1407 01:04:32,900 --> 01:04:35,130 AUDIENCE: Also there are volume issues. 1408 01:04:35,130 --> 01:04:37,140 So assume more than that. 1409 01:04:37,140 --> 01:04:40,309 A lot of other people also want to communicate. 1410 01:04:40,309 --> 01:04:43,296 So if [UNINTELLIGIBLE] randomly distributed, you can 1411 01:04:43,296 --> 01:04:44,292 assume there's a lot more communication 1412 01:04:44,292 --> 01:04:47,880 going into the channel. 1413 01:04:47,880 --> 01:04:52,710 Whereas if you put locality in there then you can scale 1414 01:04:52,710 --> 01:04:56,210 communication much better than scaling the network. 1415 01:05:00,380 --> 01:05:02,400 PROFESSOR: There's also a notion of locality in terms of 1416 01:05:02,400 --> 01:05:03,040 memory accesses. 1417 01:05:03,040 --> 01:05:07,880 And these are potentially also very important or more 1418 01:05:07,880 --> 01:05:10,010 important, rather, because of the latencies 1419 01:05:10,010 --> 01:05:12,310 for accessing memory. 1420 01:05:12,310 --> 01:05:15,860 So if I have, you know, this loop that's doing some 1421 01:05:15,860 --> 01:05:19,270 addition or some computation on an array and I distribute 1422 01:05:19,270 --> 01:05:21,900 it, say, over four processors -- 1423 01:05:21,900 --> 01:05:24,880 this is, again, let's assume a data parallel loop. 1424 01:05:24,880 --> 01:05:27,460 So what I can do is have a work sharing mechanism that 1425 01:05:27,460 --> 01:05:29,970 says, this thread here will operate on 1426 01:05:29,970 --> 01:05:31,570 the first four indices. 1427 01:05:31,570 --> 01:05:34,135 This thread here will operate on the next four indices and 1428 01:05:34,135 --> 01:05:36,120 the next four and the next four. 1429 01:05:36,120 --> 01:05:39,530 And then you essentially get to join barrier and then you 1430 01:05:39,530 --> 01:05:40,950 can continue on. 1431 01:05:40,950 --> 01:05:44,840 And if we consider how the access patterns are going to 1432 01:05:44,840 --> 01:05:50,730 be generated for this particular loop, well, in the 1433 01:05:50,730 --> 01:05:52,560 sequential case I'm essentially 1434 01:05:52,560 --> 01:05:54,200 generating them in sequence. 1435 01:05:54,200 --> 01:05:56,620 So that allows me to exploit, for example, on traditional 1436 01:05:56,620 --> 01:05:59,780 [? CAT ?] architecture, a notion of spatial locality. 1437 01:05:59,780 --> 01:06:03,620 If I look at how things are organized in memory, in the 1438 01:06:03,620 --> 01:06:06,710 sequential case I can perhaps fetch an 1439 01:06:06,710 --> 01:06:07,870 entire block at a time. 1440 01:06:07,870 --> 01:06:11,340 So I can fetch all the elements of A0 1441 01:06:11,340 --> 01:06:12,680 to A3 in one shot. 1442 01:06:12,680 --> 01:06:16,820 I can fetch all the elements of A4 to A7 in one shot. 1443 01:06:16,820 --> 01:06:19,520 And that allows me to essentially improve 1444 01:06:19,520 --> 01:06:22,080 performance because I overlap communication. 1445 01:06:22,080 --> 01:06:25,890 I'm predicting that once I see a reference, I'm going to use 1446 01:06:25,890 --> 01:06:29,140 data that's adjacent to it in space. 1447 01:06:29,140 --> 01:06:31,070 There's also a notion of temporal locality that says 1448 01:06:31,070 --> 01:06:33,990 that if I use some particular data element, I'm going to 1449 01:06:33,990 --> 01:06:35,430 reuse it later on. 1450 01:06:35,430 --> 01:06:37,970 I'm not showing that here. 1451 01:06:37,970 --> 01:06:41,190 But in the parallel case what could happen is if each one of 1452 01:06:41,190 --> 01:06:43,920 these threads is requesting a different data element -- and 1453 01:06:43,920 --> 01:06:49,470 let's say execution essentially proceeds -- you 1454 01:06:49,470 --> 01:06:51,230 know, all the threads are requesting their 1455 01:06:51,230 --> 01:06:53,970 data at the same time. 1456 01:06:53,970 --> 01:06:56,240 Then all these requests are going to end up going to the 1457 01:06:56,240 --> 01:06:59,130 same memory bank. 1458 01:06:59,130 --> 01:07:02,420 The first thread is requesting ace of zero. 1459 01:07:02,420 --> 01:07:05,770 The next thread is requesting ace of four, the next thread 1460 01:07:05,770 --> 01:07:08,750 ace of eight, next thread ace of 12. 1461 01:07:08,750 --> 01:07:11,010 And all of these happen to be in the same memory bank. 1462 01:07:11,010 --> 01:07:13,040 So what that means is, you know, there's a lot of 1463 01:07:13,040 --> 01:07:14,940 contention for that one memory bank. 1464 01:07:14,940 --> 01:07:17,650 And in effect I've serialized the computation. 1465 01:07:17,650 --> 01:07:17,850 Right? 1466 01:07:17,850 --> 01:07:20,620 Everybody see that? 1467 01:07:20,620 --> 01:07:23,090 And, you know, this can be a problem in that you can 1468 01:07:23,090 --> 01:07:26,920 essentially fully serialize the computation in that, you 1469 01:07:26,920 --> 01:07:29,720 know, there's contention on the first bank, contention on 1470 01:07:29,720 --> 01:07:33,620 the second bank, and then contention on the third bank, 1471 01:07:33,620 --> 01:07:35,040 and then contention on the fourth bank. 1472 01:07:35,040 --> 01:07:38,000 And so I've done absolutely nothing other than pay 1473 01:07:38,000 --> 01:07:39,720 overhead for parallelization. 1474 01:07:39,720 --> 01:07:42,590 I've made extra work for myself [? concreting ?] 1475 01:07:42,590 --> 01:07:44,250 the threads. 1476 01:07:44,250 --> 01:07:46,810 Maybe I've done some extra work in terms of 1477 01:07:46,810 --> 01:07:48,840 synchronization. 1478 01:07:48,840 --> 01:07:50,460 So I'm fully serial. 1479 01:07:52,980 --> 01:07:55,810 So what you want to do is actually reorganize the way 1480 01:07:55,810 --> 01:07:59,840 data is laid out in memory so that you can effectively get 1481 01:07:59,840 --> 01:08:01,620 the benefit of parallelization. 1482 01:08:01,620 --> 01:08:07,420 So if you have the data organized as is there, you can 1483 01:08:07,420 --> 01:08:09,670 shuffle things around. 1484 01:08:09,670 --> 01:08:13,000 And then you end up with fully parallel or a layout that's 1485 01:08:13,000 --> 01:08:16,320 more amenable to full parallelism because now each 1486 01:08:16,320 --> 01:08:17,930 thread is going to a different bank. 1487 01:08:17,930 --> 01:08:20,650 And that essentially gives you a four-way parallelism. 1488 01:08:20,650 --> 01:08:22,750 And so you get the performance benefits. 1489 01:08:26,480 --> 01:08:30,623 So there are different kinds of sort of considerations you 1490 01:08:30,623 --> 01:08:34,400 need to take into account for shared memory architectures in 1491 01:08:34,400 --> 01:08:37,400 terms of how the design affects the memory latency. 1492 01:08:37,400 --> 01:08:42,190 So in a uniform memory access architecture, every processor 1493 01:08:42,190 --> 01:08:44,240 is either, you can think of it as being 1494 01:08:44,240 --> 01:08:45,200 equidistant from memory. 1495 01:08:45,200 --> 01:08:48,295 Or another way, it has the same access latency for 1496 01:08:48,295 --> 01:08:50,100 getting data from memory. 1497 01:08:50,100 --> 01:08:54,480 Most shared memory architectures are non-uniform, 1498 01:08:54,480 --> 01:08:56,710 also known as NUMA architecture. 1499 01:08:56,710 --> 01:08:59,460 So you have physically partitioned memories. 1500 01:08:59,460 --> 01:09:03,020 And the processors can have the same address space, but 1501 01:09:03,020 --> 01:09:05,930 the placement of data affects the performance because going 1502 01:09:05,930 --> 01:09:10,860 to one bank versus another can be faster or slower. 1503 01:09:10,860 --> 01:09:12,560 So what kind of architecture is Cell? 1504 01:09:12,560 --> 01:09:19,100 Yeah. 1505 01:09:19,100 --> 01:09:19,710 No guesses? 1506 01:09:19,710 --> 01:09:22,910 AUDIENCE: It's not a shared memory. 1507 01:09:22,910 --> 01:09:23,150 PROFESSOR: Right. 1508 01:09:23,150 --> 01:09:24,720 It's not a shared memory architecture. 1509 01:09:27,770 --> 01:09:30,500 So a summary of parallel performance factors. 1510 01:09:30,500 --> 01:09:32,390 So there's three things I tried to cover. 1511 01:09:34,970 --> 01:09:36,510 Coverage or the extent of parallelism in the 1512 01:09:36,510 --> 01:09:37,420 application. 1513 01:09:37,420 --> 01:09:40,480 So you saw Amdahl's Law and it actually gave you a sort of a 1514 01:09:40,480 --> 01:09:43,990 model that said when is parallelizing your application 1515 01:09:43,990 --> 01:09:45,000 going to be worthwhile? 1516 01:09:45,000 --> 01:09:46,750 And it really boils down to how much parallelism you 1517 01:09:46,750 --> 01:09:48,990 actually have in your particular algorithm. 1518 01:09:48,990 --> 01:09:50,870 If your algorithm is sequential, then there's 1519 01:09:50,870 --> 01:09:56,110 really nothing you can do for programming for performance 1520 01:09:56,110 --> 01:09:57,820 using parallel architectures. 1521 01:09:57,820 --> 01:10:02,500 I talked about granularity of the data partitioning and the 1522 01:10:02,500 --> 01:10:04,360 granularity of the work distribution. 1523 01:10:04,360 --> 01:10:06,080 You know, if you had really fine-grain things versus 1524 01:10:06,080 --> 01:10:08,200 really coarse-grain things, how does that translate to 1525 01:10:08,200 --> 01:10:10,980 different communication costs? 1526 01:10:10,980 --> 01:10:13,770 And then last thing I shared was locality. 1527 01:10:13,770 --> 01:10:16,620 So if you have near neighbors talking, that may be different 1528 01:10:16,620 --> 01:10:19,230 than two things that are further apart in space 1529 01:10:19,230 --> 01:10:20,730 communicating. 1530 01:10:20,730 --> 01:10:23,690 And there are some issues in terms of the memory latency 1531 01:10:23,690 --> 01:10:28,310 and how you actually can take advantage of that. 1532 01:10:28,310 --> 01:10:33,400 So this really is an overview of sort of the parallel 1533 01:10:33,400 --> 01:10:36,670 programming concepts and the performance implications. 1534 01:10:36,670 --> 01:10:39,530 So the next lecture will be, you know, how do I actually 1535 01:10:39,530 --> 01:10:40,560 parallelize my program? 1536 01:10:40,560 --> 01:10:42,480 And we'll talk about that.