1 00:00:00,030 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,830 Commons license. 3 00:00:03,830 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue to 4 00:00:06,860 --> 00:00:10,560 offer high-quality educational resources for free. 5 00:00:10,560 --> 00:00:13,410 To make a donation or view additional materials from 6 00:00:13,410 --> 00:00:17,190 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,190 --> 00:00:18,440 ocw.mit.edu. 8 00:00:22,530 --> 00:00:23,030 PROFESSOR: OK. 9 00:00:23,030 --> 00:00:26,790 So today we're going to continue on with some of the 10 00:00:26,790 --> 00:00:28,260 design patterns that we started 11 00:00:28,260 --> 00:00:30,040 talking about last week. 12 00:00:30,040 --> 00:00:34,780 So to recap, there are really four common steps to taking a 13 00:00:34,780 --> 00:00:36,380 program and then parallelizing it. 14 00:00:36,380 --> 00:00:39,480 Often you're starting off with a program that's designed or 15 00:00:39,480 --> 00:00:41,060 written in a sequential manner. 16 00:00:41,060 --> 00:00:44,420 And what you want to do is find tasks in the program -- 17 00:00:44,420 --> 00:00:47,230 and these are sort of independent work pieces that 18 00:00:47,230 --> 00:00:49,260 you are going to be able to decompose from 19 00:00:49,260 --> 00:00:50,950 your sequential code. 20 00:00:50,950 --> 00:00:52,520 You're going to group tasks together 21 00:00:52,520 --> 00:00:54,930 into threads or processes. 22 00:00:54,930 --> 00:00:57,480 And then you'll essentially map each one of these threads 23 00:00:57,480 --> 00:00:59,640 or processes down to the actual hardware. 24 00:00:59,640 --> 00:01:02,200 And that will get you, eventually when these programs 25 00:01:02,200 --> 00:01:04,500 run, the concurrency and the performance 26 00:01:04,500 --> 00:01:05,750 speedups that you want. 27 00:01:08,680 --> 00:01:12,120 So as a reminder of what I talked about last week in 28 00:01:12,120 --> 00:01:15,000 terms of finding the task or finding the concurrency, you 29 00:01:15,000 --> 00:01:16,920 start off with an application. 30 00:01:16,920 --> 00:01:19,130 You come up with a block level diagram. 31 00:01:19,130 --> 00:01:23,120 And from that you sort of try to understand where the time 32 00:01:23,120 --> 00:01:25,520 is spent in the computations and what are some typical 33 00:01:25,520 --> 00:01:27,860 patterns for how the computations are carried out. 34 00:01:27,860 --> 00:01:31,680 So we talked about task decomposition or sort of 35 00:01:31,680 --> 00:01:34,115 independent tasks or tasks that might be different that 36 00:01:34,115 --> 00:01:35,540 the application is carrying out. 37 00:01:35,540 --> 00:01:38,310 So in the MPEG encoder, we looked at decoding the motion 38 00:01:38,310 --> 00:01:42,960 vectors for temporal compression versus spatial 39 00:01:42,960 --> 00:01:43,470 compression. 40 00:01:43,470 --> 00:01:48,050 It does sort of substantially different work. 41 00:01:48,050 --> 00:01:49,820 We talked about data decomposition. 42 00:01:49,820 --> 00:01:53,690 So if you're doing a process -- so if you have some work 43 00:01:53,690 --> 00:01:56,740 that's really consuming a large chunk of data, and you 44 00:01:56,740 --> 00:01:59,160 realize that it's applying the same kind of work to each of 45 00:01:59,160 --> 00:02:02,640 those data pieces, then you can partition your data into 46 00:02:02,640 --> 00:02:06,480 smaller subsets and apply the same function 47 00:02:06,480 --> 00:02:07,730 over and over again. 48 00:02:13,970 --> 00:02:16,860 So in the motion compensation phase, that's one example 49 00:02:16,860 --> 00:02:19,590 where you can replicate the function and split up the data 50 00:02:19,590 --> 00:02:22,230 stream in different ways and have these 51 00:02:22,230 --> 00:02:23,920 tasks proceed in parallel. 52 00:02:23,920 --> 00:02:25,250 So that's data decomposition. 53 00:02:25,250 --> 00:02:28,120 And then we talked a little bit about sort of making a 54 00:02:28,120 --> 00:02:29,880 case for a pipeline decomposition. 55 00:02:29,880 --> 00:02:32,530 So you have a data assembly line or producer-consumer 56 00:02:32,530 --> 00:02:35,270 chains, and you essentially want to recognize those in 57 00:02:35,270 --> 00:02:39,140 your computation and make it so that you can exploit them 58 00:02:39,140 --> 00:02:41,220 eventually when you're doing your mapping 59 00:02:41,220 --> 00:02:44,030 down to actual hardware. 60 00:02:44,030 --> 00:02:46,210 But what does it mean for two tasks to actually be 61 00:02:46,210 --> 00:02:47,700 concurrent? 62 00:02:47,700 --> 00:02:49,980 And how do you know that you can safely actually run two 63 00:02:49,980 --> 00:02:51,230 tasks in parallel? 64 00:02:51,230 --> 00:02:54,320 So there's something I crudely went over last time. 65 00:02:54,320 --> 00:02:58,180 So as to make it more concrete, highlighting 66 00:02:58,180 --> 00:03:01,960 Bernstein's condition, which says that given two tasks, if 67 00:03:01,960 --> 00:03:06,630 the input set to one task is different from or does not 68 00:03:06,630 --> 00:03:11,530 intersect with the output set of another, and vice versa, 69 00:03:11,530 --> 00:03:13,910 and neither task sort of updates the same data 70 00:03:13,910 --> 00:03:17,660 structures in memory, then there's really no dependency 71 00:03:17,660 --> 00:03:18,470 issues between them. 72 00:03:18,470 --> 00:03:21,650 You can run them safely in parallel. 73 00:03:21,650 --> 00:03:26,150 So task T1 and T2, if all the data that's consumed by T1, so 74 00:03:26,150 --> 00:03:29,840 all the data elements that are read by T1 are different from 75 00:03:29,840 --> 00:03:33,020 the ones that are read by T2, then you have -- 76 00:03:33,020 --> 00:03:37,130 you know, if T2 is running in parallel, there's really no 77 00:03:37,130 --> 00:03:40,240 problem with T1 because it's updating the 78 00:03:40,240 --> 00:03:41,360 orthogonal data set. 79 00:03:41,360 --> 00:03:45,060 Similarly for T2 and T1, any outputs are different. 80 00:03:45,060 --> 00:03:49,480 So as an example, let's say you have two tasks. 81 00:03:49,480 --> 00:03:52,240 In T1 you're doing some basic statements. 82 00:03:52,240 --> 00:03:54,850 And these could be essentially more coarse grained. 83 00:03:54,850 --> 00:03:56,580 There could be a lot more computation in here. 84 00:03:56,580 --> 00:04:00,390 I just simplified it for the illustration. 85 00:04:00,390 --> 00:04:02,980 So you have task a equals x plus y. 86 00:04:02,980 --> 00:04:06,300 And task two does b equals x plus z. 87 00:04:06,300 --> 00:04:09,660 So if we look at the read set for T1, these are all the 88 00:04:09,660 --> 00:04:13,170 variables or data structures or addresses these that are 89 00:04:13,170 --> 00:04:14,490 read by the first task. 90 00:04:14,490 --> 00:04:16,320 So that's x and y here. 91 00:04:16,320 --> 00:04:19,480 And all the data that's written or produced by T1. 92 00:04:19,480 --> 00:04:22,320 So here we're just producing one data value. 93 00:04:22,320 --> 00:04:24,830 And that's going into location A. 94 00:04:24,830 --> 00:04:26,670 Similarly we can come up with the read set and 95 00:04:26,670 --> 00:04:28,510 write set for T2. 96 00:04:28,510 --> 00:04:32,220 And so that's shown on here. 97 00:04:32,220 --> 00:04:35,530 So we have -- task T2 has x plus z in its read set. 98 00:04:35,530 --> 00:04:37,440 And it produces one data value, b. 99 00:04:37,440 --> 00:04:40,375 If we take the intersection of the read and write sets for 100 00:04:40,375 --> 00:04:43,550 the different tasks, then they're empty. 101 00:04:43,550 --> 00:04:45,440 I read something completely different than what's produced 102 00:04:45,440 --> 00:04:47,510 in this task and vice versa. 103 00:04:47,510 --> 00:04:49,190 And they write to two completely 104 00:04:49,190 --> 00:04:50,530 different memory locations. 105 00:04:50,530 --> 00:04:53,020 So I can essentially parallelize these or run these 106 00:04:53,020 --> 00:04:54,810 two tasks in parallel. 107 00:04:54,810 --> 00:04:58,260 So you can extend this analysis. 108 00:04:58,260 --> 00:05:01,470 And compilers can actually use this condition to determine 109 00:05:01,470 --> 00:05:03,300 when two tasks can be parallelized if you're doing 110 00:05:03,300 --> 00:05:05,380 automatic parallelization. 111 00:05:05,380 --> 00:05:07,960 And you'll probably hear more about these later on. 112 00:05:10,680 --> 00:05:13,990 And so what I focused on last time were the finding 113 00:05:13,990 --> 00:05:16,530 concurrency patterns. 114 00:05:16,530 --> 00:05:20,630 And I had identified sort of four design spaces based on 115 00:05:20,630 --> 00:05:24,310 the work that's outlined in the book by Mattson, Sanders, 116 00:05:24,310 --> 00:05:26,160 and Massingill. 117 00:05:26,160 --> 00:05:31,750 And so starting with two large sort of concepts. 118 00:05:31,750 --> 00:05:35,210 The first helps you figure out how you're going to actually 119 00:05:35,210 --> 00:05:36,310 express your algorithm. 120 00:05:36,310 --> 00:05:38,500 So first you find your concurrency and then you 121 00:05:38,500 --> 00:05:39,640 organize in some way. 122 00:05:39,640 --> 00:05:42,240 And so we're going to talk about that in more detail. 123 00:05:42,240 --> 00:05:44,790 And then once you've organized your tasks in some way that 124 00:05:44,790 --> 00:05:47,800 actually expresses your overall computation, you need 125 00:05:47,800 --> 00:05:52,670 some software construction utilities or data structures 126 00:05:52,670 --> 00:05:55,810 or mechanisms for actually orchestrating computations for 127 00:05:55,810 --> 00:05:57,190 which they have also abstracted 128 00:05:57,190 --> 00:05:58,560 out some common patterns. 129 00:05:58,560 --> 00:06:01,840 And so I'll briefly talk about these as well. 130 00:06:01,840 --> 00:06:03,870 And so on your algorithm expression side, these are 131 00:06:03,870 --> 00:06:07,740 essentially conceptualization steps that help you abstract 132 00:06:07,740 --> 00:06:08,500 out your problem. 133 00:06:08,500 --> 00:06:12,490 And you may in fact think about your algorithm 134 00:06:12,490 --> 00:06:14,710 expression in different ways to expose different kinds of 135 00:06:14,710 --> 00:06:17,960 concurrency or to be able to explore different ways of 136 00:06:17,960 --> 00:06:20,610 mapping the concurrency to hardware. 137 00:06:20,610 --> 00:06:23,540 And so for construction it's more about actual engineering 138 00:06:23,540 --> 00:06:25,200 and implementation. 139 00:06:25,200 --> 00:06:26,550 So here you're actually thinking about what do the 140 00:06:26,550 --> 00:06:28,210 data structures look like? 141 00:06:28,210 --> 00:06:30,050 What is the communication pattern going to look like? 142 00:06:30,050 --> 00:06:32,660 Am I but going to use things like MPI or OpenMP? 143 00:06:32,660 --> 00:06:36,270 What does that help me with in terms of doing my 144 00:06:36,270 --> 00:06:37,520 implementation? 145 00:06:40,320 --> 00:06:42,800 So given a collection of concurrent tasks -- so you've 146 00:06:42,800 --> 00:06:46,080 done your first step in your four design patterns. 147 00:06:46,080 --> 00:06:47,510 You know, what is your next step? 148 00:06:47,510 --> 00:06:49,390 And that's really mapping those tasks that you've 149 00:06:49,390 --> 00:06:52,600 identified down to some sort of execution units. 150 00:06:52,600 --> 00:06:54,660 So threads are very common. 151 00:06:54,660 --> 00:06:56,970 This is essentially what we've been using on Cell. 152 00:06:56,970 --> 00:06:59,960 We take our computation and we wrap it into SPE threads and 153 00:06:59,960 --> 00:07:03,500 then we can execute those at run time. 154 00:07:03,500 --> 00:07:05,900 So some things to keep in mind -- although you shouldn't over 155 00:07:05,900 --> 00:07:09,600 constrain yourself in terms of these considerations. 156 00:07:09,600 --> 00:07:12,150 What is the magnitude of your parallelism that 157 00:07:12,150 --> 00:07:12,580 you're going to get? 158 00:07:12,580 --> 00:07:15,030 You know, do you want hundreds or thousands of threads? 159 00:07:15,030 --> 00:07:18,630 Or do you want something on the order of tens? 160 00:07:18,630 --> 00:07:22,030 And this is because you don't want to overwhelm the intended 161 00:07:22,030 --> 00:07:23,540 system that you're going to run on. 162 00:07:23,540 --> 00:07:27,010 So we talked about yesterday on Cell processor, if you're 163 00:07:27,010 --> 00:07:29,960 creating a lot more than six threads, then you can create 164 00:07:29,960 --> 00:07:32,660 problems or you essentially don't get extra parallelism 165 00:07:32,660 --> 00:07:35,710 because each thread is running to completion on each SPE. 166 00:07:35,710 --> 00:07:38,140 And contact switch overhead is extremely high. 167 00:07:38,140 --> 00:07:41,010 So you don't want to spend too much engineering cost to come 168 00:07:41,010 --> 00:07:42,960 up with an algorithm implementation that's 169 00:07:42,960 --> 00:07:45,940 massively scalable to hundreds or thousands of threads when 170 00:07:45,940 --> 00:07:47,670 you can't actually exploit it. 171 00:07:47,670 --> 00:07:49,940 But that doesn't mean that you should over constrain your 172 00:07:49,940 --> 00:07:53,000 implementation to where if now I want to take your code and 173 00:07:53,000 --> 00:07:55,900 run it on a different machine, I essentially have to redesign 174 00:07:55,900 --> 00:07:58,410 or re-engineer the complete process. 175 00:07:58,410 --> 00:08:00,646 So you want to avoid tendencies to over constrain 176 00:08:00,646 --> 00:08:02,000 the implementation. 177 00:08:02,000 --> 00:08:04,340 And you want to leave your code in a way that's malleable 178 00:08:04,340 --> 00:08:07,580 so that you can easily make changes to sort of factor in 179 00:08:07,580 --> 00:08:10,340 new platforms that you want to run on or new machine 180 00:08:10,340 --> 00:08:14,500 architecture features that you might want to exploit. 181 00:08:14,500 --> 00:08:17,670 So there are three major organization principles I'm 182 00:08:17,670 --> 00:08:20,060 going to talk about. 183 00:08:20,060 --> 00:08:22,280 And none of these should be sort of foreign to you at this 184 00:08:22,280 --> 00:08:24,950 point because we've talked about them in different ways 185 00:08:24,950 --> 00:08:28,510 in the recitations or in previous lectures. 186 00:08:28,510 --> 00:08:31,240 And it's really, what is it that determines sort of the 187 00:08:31,240 --> 00:08:33,290 algorithm structure based on the set of tasks that you're 188 00:08:33,290 --> 00:08:36,070 actually carrying out in your computation? 189 00:08:36,070 --> 00:08:38,180 And so there's the principle that says, 190 00:08:38,180 --> 00:08:40,450 organize things by tasks. 191 00:08:40,450 --> 00:08:42,030 I'm going to talk to that. 192 00:08:42,030 --> 00:08:43,980 And then there's a principle that says, well, organize 193 00:08:43,980 --> 00:08:46,700 things by how you're doing the data decomposition. 194 00:08:46,700 --> 00:08:50,090 So in this case how you're actually distributing the data 195 00:08:50,090 --> 00:08:53,260 or how to the data is laid out in memory, or how you're 196 00:08:53,260 --> 00:08:55,510 partitioning the data to actually compute on it 197 00:08:55,510 --> 00:08:58,330 dictates how you should actually organize your actual 198 00:08:58,330 --> 00:08:59,420 computation. 199 00:08:59,420 --> 00:09:01,780 And then there's organize by flow of data. 200 00:09:01,780 --> 00:09:04,870 And this is something you'll hear about more in the next 201 00:09:04,870 --> 00:09:06,540 lecture where we're talking about streaming. 202 00:09:06,540 --> 00:09:12,210 But in this pattern if there are specific sort of 203 00:09:12,210 --> 00:09:16,230 computations that take advantage of high bandwidth 204 00:09:16,230 --> 00:09:18,560 flow of data between computations, you might want 205 00:09:18,560 --> 00:09:20,520 to exploit that for concurrency. 206 00:09:20,520 --> 00:09:23,540 And we'll talk about that as well. 207 00:09:23,540 --> 00:09:24,100 OK. 208 00:09:24,100 --> 00:09:27,070 So a design diagram for how can you 209 00:09:27,070 --> 00:09:28,660 actually go through process. 210 00:09:28,660 --> 00:09:31,250 So you can ask yourself a set of questions. 211 00:09:31,250 --> 00:09:35,425 if I want to organize things by tasks, then there are 212 00:09:35,425 --> 00:09:38,530 essentially two main clusters or two main computations, two 213 00:09:38,530 --> 00:09:40,010 main patterns. 214 00:09:40,010 --> 00:09:43,700 If the code is recursive, then you essentially want to apply 215 00:09:43,700 --> 00:09:47,220 a divide and conquer pattern or divide and conquer 216 00:09:47,220 --> 00:09:48,430 organization. 217 00:09:48,430 --> 00:09:49,780 If it's not recursive. 218 00:09:49,780 --> 00:09:54,990 then you essentially want to do task parallelism. 219 00:09:54,990 --> 00:09:56,400 So in task parallelism -- 220 00:09:56,400 --> 00:09:58,620 you know, I've listed two examples here. 221 00:09:58,620 --> 00:09:59,850 But really any of the things that we've talked 222 00:09:59,850 --> 00:10:01,810 about in the past fit. 223 00:10:01,810 --> 00:10:03,940 Ray computation, ray tracing. 224 00:10:03,940 --> 00:10:05,950 So here you're shooting rays through a scene to try to 225 00:10:05,950 --> 00:10:09,420 determine how to render it. 226 00:10:09,420 --> 00:10:11,950 And really each ray is a separate and independent 227 00:10:11,950 --> 00:10:13,570 computation step. 228 00:10:13,570 --> 00:10:16,070 In molecular dynamics you're trying to determine the 229 00:10:16,070 --> 00:10:17,660 non-bonded force calculations. 230 00:10:17,660 --> 00:10:20,160 There are some dependencies, but really you can do each 231 00:10:20,160 --> 00:10:23,340 calculation for one molecule or for one atom 232 00:10:23,340 --> 00:10:25,080 independent of any other. 233 00:10:25,080 --> 00:10:27,600 And then there are sort of the global dependence of having to 234 00:10:27,600 --> 00:10:30,530 update or communicate across all those molecules that sort 235 00:10:30,530 --> 00:10:33,350 of reflect new positions in the system. 236 00:10:33,350 --> 00:10:38,500 So the common factors here are your tasks are associated with 237 00:10:38,500 --> 00:10:39,850 iterations of a loop. 238 00:10:39,850 --> 00:10:43,460 And you can distribute, you know, each process -- 239 00:10:43,460 --> 00:10:44,760 each processor can do a different 240 00:10:44,760 --> 00:10:47,150 iteration of the loop. 241 00:10:47,150 --> 00:10:51,170 And often you know sort of what the tasks are before you 242 00:10:51,170 --> 00:10:52,850 actually start your computation. 243 00:10:52,850 --> 00:10:55,550 Although in some cases, like in ray tracing, you might 244 00:10:55,550 --> 00:10:59,040 generate more and more threads as you go along, or more and 245 00:10:59,040 --> 00:11:01,130 more computations because as the ray is shooting off, 246 00:11:01,130 --> 00:11:05,680 you're calculating new reflections. 247 00:11:05,680 --> 00:11:08,800 And that creates sort of extra work. 248 00:11:08,800 --> 00:11:11,330 But largely you have these independent tasks that you can 249 00:11:11,330 --> 00:11:14,680 encapsulate in threads and you run them. 250 00:11:14,680 --> 00:11:17,650 And this is sort of -- it might appear subtle, but there 251 00:11:17,650 --> 00:11:20,730 are algorithm classes where not all tasks essentially need 252 00:11:20,730 --> 00:11:23,160 to complete for you to arrive at a solution. 253 00:11:23,160 --> 00:11:24,462 You know, in some cases you might convert to 254 00:11:24,462 --> 00:11:25,960 an acceptable solution. 255 00:11:25,960 --> 00:11:31,240 And you don't actually need to go through and exercise all 256 00:11:31,240 --> 00:11:33,630 the computation that's outstanding for you to say the 257 00:11:33,630 --> 00:11:34,760 program is done. 258 00:11:34,760 --> 00:11:36,230 So there will be a tricky issue -- 259 00:11:36,230 --> 00:11:38,700 I'll revisit this just briefly later on -- is how do you 260 00:11:38,700 --> 00:11:40,280 determine if your program has actually 261 00:11:40,280 --> 00:11:43,300 terminated or has completed? 262 00:11:43,300 --> 00:11:47,910 In divide and conquer, this is really for recursive programs. 263 00:11:47,910 --> 00:11:51,340 You know, you can think of a well-known sorting algorithm, 264 00:11:51,340 --> 00:11:53,900 merge sort, that classically fits into this kind of 265 00:11:53,900 --> 00:11:57,350 picture, where you have some really large array of data 266 00:11:57,350 --> 00:11:58,280 that you want to sort. 267 00:11:58,280 --> 00:12:01,330 You keep subdividing into smaller and smaller chunks 268 00:12:01,330 --> 00:12:03,610 until you can do local reorderings. 269 00:12:03,610 --> 00:12:06,320 And then you start merging things together. 270 00:12:06,320 --> 00:12:09,460 So this gives you sort of a way to take a problem, divide 271 00:12:09,460 --> 00:12:12,620 it into subproblems. And then you can split the data at some 272 00:12:12,620 --> 00:12:13,990 point and then you join it back together. 273 00:12:13,990 --> 00:12:15,540 You merge it. 274 00:12:15,540 --> 00:12:20,160 You might see things like fork and merge or fork and join 275 00:12:20,160 --> 00:12:22,330 used instead of split and join. 276 00:12:22,330 --> 00:12:25,790 I've used the terminology that sort of melds well with some 277 00:12:25,790 --> 00:12:28,210 of the concepts we use in streaming that you'll see in 278 00:12:28,210 --> 00:12:29,780 the next lecture. 279 00:12:29,780 --> 00:12:32,580 And so in these kinds of programs, it's not always the 280 00:12:32,580 --> 00:12:35,510 case that each subproblem will have essentially the same 281 00:12:35,510 --> 00:12:36,710 amount of work to do. 282 00:12:36,710 --> 00:12:41,050 You might need more dynamic load balancing because each 283 00:12:41,050 --> 00:12:42,390 subproblem -- 284 00:12:42,390 --> 00:12:45,210 how you distribute the data might lead you to do more work 285 00:12:45,210 --> 00:12:47,720 in one problem than in the other. 286 00:12:47,720 --> 00:12:50,950 So as opposed to some of the other mechanisms where static 287 00:12:50,950 --> 00:12:53,220 load balancing will work really well -- 288 00:12:53,220 --> 00:12:56,340 and to remind you, static load balancing essentially says, 289 00:12:56,340 --> 00:12:58,610 you have some work, you assign it to each of the processors. 290 00:12:58,610 --> 00:13:02,180 And you're going to be relatively happy with how each 291 00:13:02,180 --> 00:13:04,000 processor's sort of utilization is 292 00:13:04,000 --> 00:13:05,240 going to be over time. 293 00:13:05,240 --> 00:13:07,430 Nobody's going to be too overwhelmed with the amount of 294 00:13:07,430 --> 00:13:08,830 work they have to do. 295 00:13:08,830 --> 00:13:11,630 In this case, you might end up with needing some things for 296 00:13:11,630 --> 00:13:14,580 dynamic load balancing that says, I'm unhappy with the 297 00:13:14,580 --> 00:13:16,700 work performance or utilization. 298 00:13:16,700 --> 00:13:19,140 Some processors are more idle than the others, so you might 299 00:13:19,140 --> 00:13:21,570 want to essentially redistribute things. 300 00:13:21,570 --> 00:13:24,060 So what we'll talk about -- you know, how does this 301 00:13:24,060 --> 00:13:29,310 concept of divide and conquer parallelization pattern work 302 00:13:29,310 --> 00:13:30,910 into the actual implementation? 303 00:13:30,910 --> 00:13:35,310 You know, how do I actually implement a divide and conquer 304 00:13:35,310 --> 00:13:37,060 organization? 305 00:13:37,060 --> 00:13:40,590 The next organization is organized by data. 306 00:13:40,590 --> 00:13:43,450 So here you have some computation -- 307 00:13:43,450 --> 00:13:44,370 not sure why it's flickering. 308 00:13:44,370 --> 00:13:45,398 AUDIENCE: Check your -- 309 00:13:45,398 --> 00:13:47,455 maybe your VGA cables aren't in good. 310 00:13:50,540 --> 00:14:00,450 PROFESSOR: So in the organize by data, you essentially want 311 00:14:00,450 --> 00:14:03,400 to apply this if you have a lot of computation that's 312 00:14:03,400 --> 00:14:06,040 using a shared global data structure or that's going to 313 00:14:06,040 --> 00:14:08,590 update a central data structure. 314 00:14:08,590 --> 00:14:11,940 So in molecular dynamics, for example, you have a huge array 315 00:14:11,940 --> 00:14:15,150 that records the position of each of the molecules. 316 00:14:15,150 --> 00:14:17,310 And while you can do the coarse calculations 317 00:14:17,310 --> 00:14:21,160 independently, eventually all the parallel tasks have to 318 00:14:21,160 --> 00:14:23,700 communicate with the central data structure and say, here 319 00:14:23,700 --> 00:14:25,980 are the new locations for all the molecules. 320 00:14:25,980 --> 00:14:29,070 And so that has to go into a central repository. 321 00:14:29,070 --> 00:14:32,770 And there are different kinds of sort of decompositions 322 00:14:32,770 --> 00:14:36,160 within this organization. 323 00:14:36,160 --> 00:14:40,650 If your data structure is recursive, so a link list or a 324 00:14:40,650 --> 00:14:42,750 tree or a graph, then you can apply the 325 00:14:42,750 --> 00:14:44,630 recursive data pattern. 326 00:14:44,630 --> 00:14:48,250 If it's not, if it's linear, like an array or a vector, 327 00:14:48,250 --> 00:14:50,450 then you apply geometric decomposition. 328 00:14:50,450 --> 00:14:53,560 And you've essentially seen geometric decomposition. 329 00:14:53,560 --> 00:14:56,990 These were some of the labs that you've already done. 330 00:14:56,990 --> 00:14:59,850 And so the example from yesterday's recitation, you're 331 00:14:59,850 --> 00:15:04,070 doing an end body simulation in terms of who is gravitating 332 00:15:04,070 --> 00:15:06,650 towards who, you're calculating the forces between 333 00:15:06,650 --> 00:15:07,670 pairs of objects. 334 00:15:07,670 --> 00:15:11,060 And depending on the force that each object feels, you 335 00:15:11,060 --> 00:15:16,170 calculate a new motion vector. 336 00:15:16,170 --> 00:15:20,160 And you use that to update the position of each body in your, 337 00:15:20,160 --> 00:15:22,790 say, galaxy that you're simulating. 338 00:15:22,790 --> 00:15:25,940 And so what we talked about yesterday was given an array 339 00:15:25,940 --> 00:15:30,050 of positions, each processor gets a sub-chunk of that 340 00:15:30,050 --> 00:15:31,050 position array. 341 00:15:31,050 --> 00:15:34,280 And it knows how to calculate sort of 342 00:15:34,280 --> 00:15:35,240 locally, based on that. 343 00:15:35,240 --> 00:15:38,500 And then you might also communicate local chunks to do 344 00:15:38,500 --> 00:15:39,860 more scalable computations. 345 00:15:43,930 --> 00:15:46,850 And recursive data structure are a little bit more tricky. 346 00:15:46,850 --> 00:15:49,910 So at face value you might think that there's really no 347 00:15:49,910 --> 00:15:51,580 kind of parallelism you can get out of a 348 00:15:51,580 --> 00:15:52,970 recursive data structure. 349 00:15:52,970 --> 00:15:55,310 So if you're iterating over a list and you want to get the 350 00:15:55,310 --> 00:15:58,040 sum, well, you know, I just need to go through the list. 351 00:15:58,040 --> 00:16:01,020 Can I really parallelize that? 352 00:16:01,020 --> 00:16:04,720 There are, however, opportunities where you can 353 00:16:04,720 --> 00:16:07,100 reshape the computation in a way that exposes the 354 00:16:07,100 --> 00:16:08,130 concurrency. 355 00:16:08,130 --> 00:16:11,220 And often what this comes down to is you're going to do more 356 00:16:11,220 --> 00:16:16,000 work, but it's OK because you're going to finish faster. 357 00:16:16,000 --> 00:16:18,930 So this kind of work/concurrency tradeoff, I'm 358 00:16:18,930 --> 00:16:21,740 going to illustrate with an example. 359 00:16:21,740 --> 00:16:27,450 So in this application we have some graphs. 360 00:16:27,450 --> 00:16:29,650 And for each node in a graph, we want to 361 00:16:29,650 --> 00:16:31,170 know what is its root? 362 00:16:31,170 --> 00:16:34,620 So this works well when you have a forest where not all 363 00:16:34,620 --> 00:16:37,040 the graphs are connected and given a node you want to know 364 00:16:37,040 --> 00:16:39,750 who is the root of this graph. 365 00:16:39,750 --> 00:16:42,580 So what we can do is essentially have more 366 00:16:42,580 --> 00:16:45,700 concurrency by changing the way we actually think about 367 00:16:45,700 --> 00:16:46,630 the algorithm. 368 00:16:46,630 --> 00:16:49,670 So rather than starting with each node and then, in a 369 00:16:49,670 --> 00:16:53,000 directed graph, following its successor -- 370 00:16:53,000 --> 00:16:55,340 so this is essentially order n, because for each node we 371 00:16:55,340 --> 00:16:57,600 have to follow n links -- 372 00:16:57,600 --> 00:16:59,580 we can think about it slightly differently. 373 00:16:59,580 --> 00:17:02,510 So what if rather than finding the successor and then finding 374 00:17:02,510 --> 00:17:05,560 that successor's successor, at each computational step we 375 00:17:05,560 --> 00:17:07,040 start with a node and we say who is 376 00:17:07,040 --> 00:17:09,640 your successor's successor? 377 00:17:09,640 --> 00:17:14,000 So we can converge in this example in three steps. 378 00:17:14,000 --> 00:17:18,160 So from five to six we can say who is this successor? 379 00:17:18,160 --> 00:17:21,200 So who is the successor's successor of five? 380 00:17:21,200 --> 00:17:22,590 And that would be two. 381 00:17:22,590 --> 00:17:25,150 And similarly you can do that for seven and so on. 382 00:17:25,150 --> 00:17:27,520 And so you keep asking the question. 383 00:17:27,520 --> 00:17:29,540 So you can distribute all these data structures, 384 00:17:29,540 --> 00:17:31,970 repeatedly ask these questions out of all these end nodes, 385 00:17:31,970 --> 00:17:34,840 and it leads you to an order log n solution versus an order 386 00:17:34,840 --> 00:17:36,770 n solution. 387 00:17:36,770 --> 00:17:39,310 But what have I done in each step? 388 00:17:39,310 --> 00:17:43,850 Well, I've actually created myself and I've sort of 389 00:17:43,850 --> 00:17:48,220 increased the amount of work that I'm doing by order n. 390 00:17:48,220 --> 00:17:49,310 Right there. 391 00:17:49,310 --> 00:17:51,370 Right. 392 00:17:51,370 --> 00:17:52,260 Yes. 393 00:17:52,260 --> 00:17:56,680 Because I've essentially for each node doing n queries, you 394 00:17:56,680 --> 00:17:58,720 know, who's your successor's successor? 395 00:17:58,720 --> 00:18:01,460 Whereas in a sequential case, you know, I just need to do it 396 00:18:01,460 --> 00:18:02,980 once for each node. 397 00:18:02,980 --> 00:18:05,070 And that works really well. 398 00:18:05,070 --> 00:18:07,790 So most strategies based on this pattern of actually 399 00:18:07,790 --> 00:18:11,390 decomposing your computation according to recursive pattern 400 00:18:11,390 --> 00:18:16,860 lead you to doing much more work or some increase in the 401 00:18:16,860 --> 00:18:18,510 amount of work. 402 00:18:18,510 --> 00:18:20,080 But you get this back in because you can 403 00:18:20,080 --> 00:18:21,590 decrease your execution. 404 00:18:21,590 --> 00:18:23,260 And so this is a good tradeoff that you 405 00:18:23,260 --> 00:18:24,990 might want to consider. 406 00:18:24,990 --> 00:18:27,860 AUDIENCE: In the first one order n was sequential? 407 00:18:27,860 --> 00:18:28,520 PROFESSOR: Yeah, yeah. 408 00:18:28,520 --> 00:18:30,970 It's a typo. 409 00:18:30,970 --> 00:18:32,220 Yeah. 410 00:18:34,630 --> 00:18:39,130 So organize by flow or organize by flow of data. 411 00:18:39,130 --> 00:18:41,560 And this is essentially the pipeline model. 412 00:18:41,560 --> 00:18:44,270 And we talked about this again in some of the recitations in 413 00:18:44,270 --> 00:18:47,960 terms of SPE to SPE communication. 414 00:18:47,960 --> 00:18:50,370 Or do you want to organize based on event-based 415 00:18:50,370 --> 00:18:52,440 mechanisms? 416 00:18:52,440 --> 00:18:56,050 So what these really come down to is, well, how regular is 417 00:18:56,050 --> 00:18:57,750 the flow of data in your application? 418 00:18:57,750 --> 00:19:01,445 If you have regular, let's say, one-way flow through a 419 00:19:01,445 --> 00:19:02,940 stable computation path -- 420 00:19:02,940 --> 00:19:05,690 so I've set up my sort of algorithm structure. 421 00:19:05,690 --> 00:19:08,190 Data is flowing through it at a regular rate. 422 00:19:08,190 --> 00:19:10,990 The computation graph isn't changing very much. 423 00:19:10,990 --> 00:19:13,490 Then I can essentially pipeline things really well. 424 00:19:13,490 --> 00:19:15,870 And this could be a linear chain of computation or it 425 00:19:15,870 --> 00:19:17,040 could be sort of nonlinear. 426 00:19:17,040 --> 00:19:20,780 There could be branches in the graph. 427 00:19:20,780 --> 00:19:25,740 And I can use that in a way to exploit pipeline parallelism. 428 00:19:25,740 --> 00:19:29,440 If I don't have sort of this nice, regular structure, it 429 00:19:29,440 --> 00:19:33,400 could be events that are created at run time. 430 00:19:33,400 --> 00:19:37,450 So, for example, you're a car wash attendant and a 431 00:19:37,450 --> 00:19:38,470 new car comes in. 432 00:19:38,470 --> 00:19:42,480 So you have to find a garage to assign to it and then turn 433 00:19:42,480 --> 00:19:45,180 on the car wash machine. 434 00:19:45,180 --> 00:19:48,540 So the dynamic threads are created based on sensory input 435 00:19:48,540 --> 00:19:51,100 that comes in, then you might want to use an events-based 436 00:19:51,100 --> 00:19:52,150 coordination. 437 00:19:52,150 --> 00:19:54,250 You have irregular computation. 438 00:19:54,250 --> 00:19:56,680 The computation might vary based on the data that comes 439 00:19:56,680 --> 00:19:58,420 into your system. 440 00:19:58,420 --> 00:20:03,460 And you might have unpredictable data flow. 441 00:20:03,460 --> 00:20:07,690 So in the pipeline model, the things to consider is the 442 00:20:07,690 --> 00:20:10,610 pipeline throughput versus the pipeline latency. 443 00:20:10,610 --> 00:20:13,720 So the amount of concurrency in a pipeline is really 444 00:20:13,720 --> 00:20:15,090 limited by the number of stages. 445 00:20:15,090 --> 00:20:16,450 This is nothing new. 446 00:20:16,450 --> 00:20:19,950 You've seen this, for example, in super scaled pipelines. 447 00:20:19,950 --> 00:20:23,460 And just as in this case, as in the case of an architecture 448 00:20:23,460 --> 00:20:25,870 pipeline, the amount of time it takes you to fill the 449 00:20:25,870 --> 00:20:28,160 pipeline and the amount of time it takes you to drain the 450 00:20:28,160 --> 00:20:29,940 pipeline can essentially limit your parallelism. 451 00:20:29,940 --> 00:20:32,550 So you want those to be really small compared to the actual 452 00:20:32,550 --> 00:20:35,930 computation that you spend in your pipeline. 453 00:20:35,930 --> 00:20:38,470 And the performance metric is usually the throughput. 454 00:20:38,470 --> 00:20:39,790 How much data can you pump through your 455 00:20:39,790 --> 00:20:43,630 pipeline per unit time? 456 00:20:43,630 --> 00:20:46,150 So in video encoding, you know, it's the frames per 457 00:20:46,150 --> 00:20:48,130 second that you can produce. 458 00:20:48,130 --> 00:20:50,150 And the pipeline latency, though, is important, 459 00:20:50,150 --> 00:20:52,760 especially in a real-time application where you need a 460 00:20:52,760 --> 00:20:54,470 result every 10 milliseconds. 461 00:20:54,470 --> 00:20:57,240 You know, your pacemaker for example has to produce a beep 462 00:20:57,240 --> 00:21:00,990 or a signal to your heart at specific rates. 463 00:21:00,990 --> 00:21:03,780 So you need to consider what is your pipeline throughput 464 00:21:03,780 --> 00:21:05,000 versus your pipeline latency? 465 00:21:05,000 --> 00:21:07,590 And that can actually determine how many stages you 466 00:21:07,590 --> 00:21:10,250 might want to actually decompose or organize your 467 00:21:10,250 --> 00:21:12,660 application in. 468 00:21:12,660 --> 00:21:15,790 And in the event-based coordination, these are 469 00:21:15,790 --> 00:21:19,120 interactions of tasks over unpredictable intervals. 470 00:21:19,120 --> 00:21:22,920 And you're more prone to sort of deadlocks in these 471 00:21:22,920 --> 00:21:23,600 applications. 472 00:21:23,600 --> 00:21:26,830 Because you might have cyclic dependencies where one event 473 00:21:26,830 --> 00:21:29,500 can't proceed until it gets data from another event. 474 00:21:29,500 --> 00:21:32,450 But it can't proceed until it gets data from another event. 475 00:21:32,450 --> 00:21:35,000 You can create sort of these complex interactions that 476 00:21:35,000 --> 00:21:36,780 often lead to deadlock. 477 00:21:36,780 --> 00:21:39,580 So you have to sort of be very careful in structuring things 478 00:21:39,580 --> 00:21:44,680 together so you don't end up with feedback loops that block 479 00:21:44,680 --> 00:21:47,130 computation progress. 480 00:21:47,130 --> 00:21:50,510 So given sort of these three organizational structures that 481 00:21:50,510 --> 00:21:53,880 say, you know, I can organize my computation by task or by 482 00:21:53,880 --> 00:21:57,120 the flow of data or by sort of the pipeline nature of the 483 00:21:57,120 --> 00:21:59,440 computation, what are the supporting structures? 484 00:21:59,440 --> 00:22:00,800 How do I actually implement these? 485 00:22:00,800 --> 00:22:03,510 And so there are many different supporting 486 00:22:03,510 --> 00:22:04,150 structures. 487 00:22:04,150 --> 00:22:08,880 I've identified sort of four that occur most often in 488 00:22:08,880 --> 00:22:11,190 literature and in books and common 489 00:22:11,190 --> 00:22:13,600 terminology that's used. 490 00:22:13,600 --> 00:22:18,090 And so those are SPMD, loop parallelism, the master/worker 491 00:22:18,090 --> 00:22:19,500 pattern, and the fork/join pattern. 492 00:22:22,620 --> 00:22:26,070 In the SPMD pattern, you're talking about a single 493 00:22:26,070 --> 00:22:29,070 program, multiple data concept. 494 00:22:29,070 --> 00:22:31,000 So here you just have one program. 495 00:22:31,000 --> 00:22:33,680 You write it once and then you assign it to each of your 496 00:22:33,680 --> 00:22:34,900 processors to run. 497 00:22:34,900 --> 00:22:35,940 So it's the same program. 498 00:22:35,940 --> 00:22:38,490 It just runs on different machines. 499 00:22:38,490 --> 00:22:42,710 Now each program or each instance of the code can have 500 00:22:42,710 --> 00:22:44,490 different control flow that it takes. 501 00:22:44,490 --> 00:22:46,270 So just because they're running the same program 502 00:22:46,270 --> 00:22:48,830 doesn't mean the computation is happening in lock step. 503 00:22:48,830 --> 00:22:53,910 That would be a sort of a SIMD or vector-like computation. 504 00:22:53,910 --> 00:22:56,550 In this model you can actually take independent control flow. 505 00:22:56,550 --> 00:22:59,810 It could be different behavior in each instance of the code. 506 00:22:59,810 --> 00:23:02,140 But you're running the same code everywhere. 507 00:23:02,140 --> 00:23:04,550 So this is slightly different, for example, from what you've 508 00:23:04,550 --> 00:23:09,480 seen on Cell, where you have the PPE thread that creates 509 00:23:09,480 --> 00:23:10,540 SPE threads. 510 00:23:10,540 --> 00:23:13,720 Sometimes the SPE threads are the same, but it's not always 511 00:23:13,720 --> 00:23:15,560 the case that the PPE threads and the SPE 512 00:23:15,560 --> 00:23:17,610 threads are the same. 513 00:23:17,610 --> 00:23:21,110 So in the SPMD model there are really five steps that you do. 514 00:23:21,110 --> 00:23:25,020 You initialize sort of your computation in the world of 515 00:23:25,020 --> 00:23:27,250 sort of code instances that you're going to run. 516 00:23:27,250 --> 00:23:29,730 And for each one you obtain a unique identifier. 517 00:23:29,730 --> 00:23:32,250 And this usually helps them being able to determine who 518 00:23:32,250 --> 00:23:37,250 needs to communicate with who or ordering dependencies. 519 00:23:37,250 --> 00:23:41,560 And you run the same program on each processor. 520 00:23:41,560 --> 00:23:44,030 And what you need to do in this case is also distribute 521 00:23:44,030 --> 00:23:45,860 your data between each of the different 522 00:23:45,860 --> 00:23:48,030 instances of your code. 523 00:23:48,030 --> 00:23:50,870 And once, you know, each program is running, it's 524 00:23:50,870 --> 00:23:53,300 computing on its data, eventually you need to 525 00:23:53,300 --> 00:23:54,570 finalize in some way. 526 00:23:54,570 --> 00:23:57,790 And so that might mean doing a reduction to communicate all 527 00:23:57,790 --> 00:24:03,160 the data to one processor to actually output the value. 528 00:24:03,160 --> 00:24:06,620 And so we saw in SPMD an example for the numerical 529 00:24:06,620 --> 00:24:09,120 integration for calculating pi. 530 00:24:09,120 --> 00:24:14,000 And if you remember, so we had this very simple c loop. 531 00:24:14,000 --> 00:24:19,780 And we showed the MPI implementation of the c loop. 532 00:24:19,780 --> 00:24:23,070 And so in this code, what we're doing is we're trying to 533 00:24:23,070 --> 00:24:24,760 determine different intervals. 534 00:24:24,760 --> 00:24:27,980 And for each interval we're going to calculate a value and 535 00:24:27,980 --> 00:24:33,420 then in the MPI program we're essentially deciding how big 536 00:24:33,420 --> 00:24:36,640 an interval each process should run. 537 00:24:36,640 --> 00:24:37,770 So it's the same program. 538 00:24:37,770 --> 00:24:39,640 It runs on every single machine or 539 00:24:39,640 --> 00:24:41,150 every single processor. 540 00:24:41,150 --> 00:24:46,290 And each processor determines based on its ID which interval 541 00:24:46,290 --> 00:24:49,340 of the actual integration to do. 542 00:24:49,340 --> 00:24:51,320 And so in this model we're distributing 543 00:24:51,320 --> 00:24:53,000 work relatively evenly. 544 00:24:53,000 --> 00:24:56,400 Each processor is doing a specific chunk that starts at 545 00:24:56,400 --> 00:24:57,930 say some index i. 546 00:24:57,930 --> 00:25:01,810 And if I have 10 processors, I'm doing 100 steps. 547 00:25:01,810 --> 00:25:06,260 Then you're doing i, i plus 10, i plus 20 and so on. 548 00:25:06,260 --> 00:25:08,320 But I can do a different distribution. 549 00:25:08,320 --> 00:25:10,050 So the first is a block distribution. 550 00:25:10,050 --> 00:25:12,490 I can do something called a cyclic distribution. 551 00:25:12,490 --> 00:25:15,420 So in a cyclic distribution, I distribute work sort of in a 552 00:25:15,420 --> 00:25:18,080 round robin fashion or some other mechanism. 553 00:25:18,080 --> 00:25:21,940 So here, you know, each processor -- 554 00:25:21,940 --> 00:25:24,780 sorry. 555 00:25:24,780 --> 00:25:28,480 In the block distribution I sort of start at interval i 556 00:25:28,480 --> 00:25:31,380 and I go -- 557 00:25:31,380 --> 00:25:32,660 sorry. 558 00:25:32,660 --> 00:25:36,180 So each processor gets one entire slice here. 559 00:25:36,180 --> 00:25:38,260 So I start here and I go through to completion. 560 00:25:38,260 --> 00:25:41,760 I start here and go through to completion. 561 00:25:41,760 --> 00:25:45,100 In a cyclic distribution I might do smaller slices of 562 00:25:45,100 --> 00:25:47,670 each one of those intervals. 563 00:25:47,670 --> 00:25:52,320 And so I greyed out the components for the block 564 00:25:52,320 --> 00:25:55,040 distribution to show you that for a contrast here. 565 00:25:58,070 --> 00:26:01,610 There are some challenges in the SPMD model. 566 00:26:01,610 --> 00:26:02,980 And that is how do you actually 567 00:26:02,980 --> 00:26:05,030 split your data correctly? 568 00:26:05,030 --> 00:26:08,410 You have to distribute your data in a way that, you know, 569 00:26:08,410 --> 00:26:11,690 doesn't increase contention on your memory system, where each 570 00:26:11,690 --> 00:26:14,900 actual processor that's assigned the computation has 571 00:26:14,900 --> 00:26:18,630 data locally to actually operate on. 572 00:26:18,630 --> 00:26:21,480 And you want to achieve an even work distribution. 573 00:26:21,480 --> 00:26:23,800 You know, do you need a dynamic load balancing scheme 574 00:26:23,800 --> 00:26:26,740 or can you use an alternative pattern 575 00:26:26,740 --> 00:26:28,020 if that's not suitable? 576 00:26:30,740 --> 00:26:34,750 So the second pattern, as opposed to the SPMD pattern is 577 00:26:34,750 --> 00:26:36,100 loop parallelism pattern. 578 00:26:36,100 --> 00:26:39,220 In this case, this is the best suited when you actually have 579 00:26:39,220 --> 00:26:42,870 a programming model or a program that you can't really 580 00:26:42,870 --> 00:26:44,880 change a whole lot or that you don't really want to 581 00:26:44,880 --> 00:26:46,590 change a whole lot. 582 00:26:46,590 --> 00:26:48,680 Or you have a programming model that allows you to sort 583 00:26:48,680 --> 00:26:52,790 of identify loops that take up most of the computation and 584 00:26:52,790 --> 00:26:55,160 then insert annotations or some ways to automatically 585 00:26:55,160 --> 00:26:57,350 parallelize those loops. 586 00:26:57,350 --> 00:27:01,660 So we saw in the OpenMP example, you have some loops 587 00:27:01,660 --> 00:27:03,580 you can insert these pragmas that say, 588 00:27:03,580 --> 00:27:05,000 this loop is parallel. 589 00:27:05,000 --> 00:27:08,390 And the compiler in the run-time time system can 590 00:27:08,390 --> 00:27:11,270 automatically partition this loop into smaller chunks. 591 00:27:11,270 --> 00:27:15,880 And then each chunk can compute in parallel. 592 00:27:15,880 --> 00:27:19,200 And you might apply this scheme in different ways 593 00:27:19,200 --> 00:27:21,870 depending on how well you understand your code. 594 00:27:21,870 --> 00:27:25,390 Are you running on a shared memory machine? 595 00:27:25,390 --> 00:27:27,490 You can't afford to do a whole lot of restructuring. 596 00:27:27,490 --> 00:27:29,440 Communication costs might be really expensive. 597 00:27:33,480 --> 00:27:36,910 In the master/worker pattern, this is really starting to get 598 00:27:36,910 --> 00:27:41,040 closer to what we've done with the Cell 599 00:27:41,040 --> 00:27:43,000 recitations in the Cell labs. 600 00:27:43,000 --> 00:27:46,850 You have some world of independent tasks and the 601 00:27:46,850 --> 00:27:50,460 master essentially running and distributing each of these 602 00:27:50,460 --> 00:27:53,490 tasks to different processors. 603 00:27:53,490 --> 00:27:57,240 So in this case you'd get several advantages that you 604 00:27:57,240 --> 00:27:58,150 can leverage. 605 00:27:58,150 --> 00:28:00,830 If each of your tasks are varied in nature -- and they 606 00:28:00,830 --> 00:28:03,390 might finish at different times or they require 607 00:28:03,390 --> 00:28:06,140 different kinds of resources, you can use this model to sort 608 00:28:06,140 --> 00:28:10,640 of view your machine as sort of a non-symmetric processor. 609 00:28:10,640 --> 00:28:12,090 Not everybody is the same. 610 00:28:12,090 --> 00:28:15,040 And you can use this model really well for that. 611 00:28:15,040 --> 00:28:18,790 So you can distribute these and then you can do dynamic 612 00:28:18,790 --> 00:28:19,390 load balancing. 613 00:28:19,390 --> 00:28:21,670 Because as processors -- 614 00:28:21,670 --> 00:28:25,120 as workers finish you can ship them more and more data. 615 00:28:25,120 --> 00:28:35,580 So it has some particularly relevant properties for 616 00:28:35,580 --> 00:28:38,200 heterogeneous computations, but it's also really good for 617 00:28:38,200 --> 00:28:40,140 when you have a whole lot of parallelism in your 618 00:28:40,140 --> 00:28:40,790 application. 619 00:28:40,790 --> 00:28:42,400 So something called embarrassingly parallel 620 00:28:42,400 --> 00:28:46,200 problems. So ray tracing, molecular dynamics, a lot of 621 00:28:46,200 --> 00:28:48,780 scientific applications have these massive levels of 622 00:28:48,780 --> 00:28:49,580 parallelism. 623 00:28:49,580 --> 00:28:52,060 And you can use this essentially work-queue based 624 00:28:52,060 --> 00:28:54,690 mechanism that says I have all these tasks and I'll just 625 00:28:54,690 --> 00:28:58,700 dispatch them to workers and compute. 626 00:28:58,700 --> 00:29:00,970 And as I pointed out earlier, you know, when do you define 627 00:29:00,970 --> 00:29:03,040 your entire computation to have completed? 628 00:29:03,040 --> 00:29:05,160 You know, sometimes you're computing a result until 629 00:29:05,160 --> 00:29:09,210 you've reached some result. 630 00:29:09,210 --> 00:29:13,080 And often you're willing to accept a result within some 631 00:29:13,080 --> 00:29:14,720 range of error. 632 00:29:14,720 --> 00:29:16,155 And you might have some more threads that 633 00:29:16,155 --> 00:29:17,290 are still in flight. 634 00:29:17,290 --> 00:29:19,860 Do you terminate your computation then or not? 635 00:29:19,860 --> 00:29:21,580 What are some issues with synchronization? 636 00:29:21,580 --> 00:29:24,020 If you have so many threads that are running together, you 637 00:29:24,020 --> 00:29:26,900 know, does the communication between them to send out these 638 00:29:26,900 --> 00:29:29,970 control messages say, I'm done, start to overwhelm you? 639 00:29:33,660 --> 00:29:36,240 In the fork/join pattern -- 640 00:29:36,240 --> 00:29:40,610 this is really not conceptually too different in 641 00:29:40,610 --> 00:29:46,570 my mind from the master/worker model, and also very relevant 642 00:29:46,570 --> 00:29:49,840 to what we've done with Cell. 643 00:29:49,840 --> 00:29:52,300 The main difference might be that you have tasks that are 644 00:29:52,300 --> 00:29:54,610 dynamically created. 645 00:29:54,610 --> 00:29:56,960 So in the embarrassingly parallel case, you actually 646 00:29:56,960 --> 00:29:59,170 know the world of all your potential task that you're 647 00:29:59,170 --> 00:30:00,280 going to run in parallel. 648 00:30:00,280 --> 00:30:02,920 In the fork/join join model some new computation might 649 00:30:02,920 --> 00:30:06,720 come up as a result of, say, an event-based mechanism. 650 00:30:06,720 --> 00:30:09,870 So a task might be created dynamically and then later 651 00:30:09,870 --> 00:30:11,760 terminated or they might complete. 652 00:30:11,760 --> 00:30:13,900 And so new ones come up as a result. 653 00:30:13,900 --> 00:30:21,660 AUDIENCE: It almost seems like you are forking the task in 654 00:30:21,660 --> 00:30:23,910 the forking model. 655 00:30:23,910 --> 00:30:25,758 And then keep assigning tasks to that. 656 00:30:25,758 --> 00:30:28,203 The fork/join model you just keep forking at 657 00:30:28,203 --> 00:30:29,670 first virtual box. 658 00:30:29,670 --> 00:30:31,501 Might not be completely matched to a number of 659 00:30:31,501 --> 00:30:32,751 processor available. 660 00:30:43,430 --> 00:30:44,680 fork them out. 661 00:30:46,910 --> 00:30:52,420 PROFESSOR: So the process that's equating all these 662 00:30:52,420 --> 00:30:56,250 threads or that's doing all the forking is often known as 663 00:30:56,250 --> 00:30:58,600 the parent and the tasks that are 664 00:30:58,600 --> 00:31:00,240 generated are the children. 665 00:31:00,240 --> 00:31:04,580 And eventually essentially the parent can't continue or can't 666 00:31:04,580 --> 00:31:07,760 resume until its children have sort of completed or have 667 00:31:07,760 --> 00:31:10,130 reached the join point. 668 00:31:10,130 --> 00:31:15,730 And so those are really some of the models that we've seen 669 00:31:15,730 --> 00:31:18,870 already, in a lot of cases in the recitations and labs for 670 00:31:18,870 --> 00:31:20,720 how you run your computations. 671 00:31:20,720 --> 00:31:23,100 And some of you have already discovered these and actually 672 00:31:23,100 --> 00:31:25,230 are thinking about how your projects should be sort of 673 00:31:25,230 --> 00:31:30,160 parallelized for your actual Cell demos. 674 00:31:30,160 --> 00:31:32,610 Some of the other things that I'm just going to talk about 675 00:31:32,610 --> 00:31:34,020 are communication patterns. 676 00:31:34,020 --> 00:31:37,300 So two lectures ago you saw, for example, that you have 677 00:31:37,300 --> 00:31:40,430 point to point communication or you have broadcast 678 00:31:40,430 --> 00:31:41,400 communication. 679 00:31:41,400 --> 00:31:43,430 So in point to point communication, you have two 680 00:31:43,430 --> 00:31:45,010 tasks that need to communicate. 681 00:31:45,010 --> 00:31:47,720 And they can send explicit messages to each other. 682 00:31:47,720 --> 00:31:49,762 These could be control messages that say I'm done or 683 00:31:49,762 --> 00:31:50,940 I'm waiting for data. 684 00:31:50,940 --> 00:31:53,560 Or they could be data messages that actually ships you a 685 00:31:53,560 --> 00:31:55,640 particular data element that you might need. 686 00:31:55,640 --> 00:31:58,250 And again we've seen this with Cell. 687 00:31:58,250 --> 00:32:01,320 Broadcast says, you know, I have some result that 688 00:32:01,320 --> 00:32:02,050 everybody needs. 689 00:32:02,050 --> 00:32:05,570 And so I send that out to everybody by some mechanism. 690 00:32:05,570 --> 00:32:09,530 There is no real broadcast mechanism on Cell. 691 00:32:09,530 --> 00:32:12,000 The concept I'm going to talk about though is the reduction 692 00:32:12,000 --> 00:32:15,900 mechanism, which really is the inverse of the broadcast. So 693 00:32:15,900 --> 00:32:18,070 in the broadcast I have a data element I need to send to 694 00:32:18,070 --> 00:32:19,060 everybody else. 695 00:32:19,060 --> 00:32:23,090 In the reduction, all of you have data that I need or all 696 00:32:23,090 --> 00:32:25,370 of us have data that each somebody else needs. 697 00:32:25,370 --> 00:32:27,990 So what we need to do is collectively bring that data 698 00:32:27,990 --> 00:32:34,730 together or group it together and generate an end result. 699 00:32:34,730 --> 00:32:40,160 So a simple example of a reduction, you have some array 700 00:32:40,160 --> 00:32:42,530 of elements that you want to add together. 701 00:32:42,530 --> 00:32:45,240 And sort of the result of the collective 702 00:32:45,240 --> 00:32:47,800 operation is the end sum. 703 00:32:47,800 --> 00:32:52,790 So you have an array of four elements, A0, A1, A2, and A3. 704 00:32:52,790 --> 00:32:54,080 And you can do a serial reduction. 705 00:32:54,080 --> 00:32:56,810 I can take A0 and add it to A1. 706 00:32:56,810 --> 00:32:59,360 And that gives me a result. 707 00:32:59,360 --> 00:33:02,030 And I can take A2 and add that to it. 708 00:33:02,030 --> 00:33:03,900 And I can take A3 and add that to it. 709 00:33:03,900 --> 00:33:06,530 And so at the end I'll have sort of calculated the sum 710 00:33:06,530 --> 00:33:09,030 from A0 to A3. 711 00:33:09,030 --> 00:33:12,880 So this is essentially -- the serial reduction applies when 712 00:33:12,880 --> 00:33:15,080 your operation is an associative. 713 00:33:15,080 --> 00:33:17,450 So the addition is associative. 714 00:33:17,450 --> 00:33:22,100 So in this case I can actually do something more intelligent. 715 00:33:22,100 --> 00:33:24,280 And I think we talked about that last time. 716 00:33:24,280 --> 00:33:26,470 I'm going to show you some more examples. 717 00:33:26,470 --> 00:33:29,980 And often sort of the end result follows a broadcast. It 718 00:33:29,980 --> 00:33:31,130 says, here is the end result. 719 00:33:31,130 --> 00:33:32,530 Who are all the people that need it? 720 00:33:32,530 --> 00:33:35,030 I'll sort of broadcast that out so that 721 00:33:35,030 --> 00:33:36,820 everybody has the result. 722 00:33:36,820 --> 00:33:39,320 If your operation isn't associative, then you're 723 00:33:39,320 --> 00:33:41,790 essentially limited to a serial process. 724 00:33:41,790 --> 00:33:45,130 And so that's not very good from a performance standpoint. 725 00:33:48,860 --> 00:33:50,890 Some of the tricks you can apply for actually getting 726 00:33:50,890 --> 00:33:54,030 performance out of your reduction is to go to a 727 00:33:54,030 --> 00:33:55,840 tree-based reduction model. 728 00:33:55,840 --> 00:33:57,200 So this might be very obvious. 729 00:33:57,200 --> 00:34:01,010 Rather than doing A0 and A1 and then adding A2 to that 730 00:34:01,010 --> 00:34:03,650 result, I can do A0 and A1 together. 731 00:34:03,650 --> 00:34:05,860 In parallel I can do A2 and A3. 732 00:34:05,860 --> 00:34:07,830 And then I can get those results and add them together. 733 00:34:07,830 --> 00:34:12,150 So rather than doing n steps I can do log n steps. 734 00:34:12,150 --> 00:34:15,240 So this is particularly attractive when only one task 735 00:34:15,240 --> 00:34:16,100 needs the result. 736 00:34:16,100 --> 00:34:19,280 So in the MPI program when we're doing the integration to 737 00:34:19,280 --> 00:34:21,920 calculate pi, you know, one processor needs to print out 738 00:34:21,920 --> 00:34:23,170 that value of pi. 739 00:34:25,440 --> 00:34:29,020 But if you have a computation where more than one process 740 00:34:29,020 --> 00:34:31,170 actually needs the result of the reduction, there's 741 00:34:31,170 --> 00:34:35,170 actually a better mechanism you can use that's sort of a 742 00:34:35,170 --> 00:34:38,060 better alternative to the tree-based reduction followed 743 00:34:38,060 --> 00:34:39,950 by a broadcast. So you can do a 744 00:34:39,950 --> 00:34:44,010 recursive doubling reduction. 745 00:34:44,010 --> 00:34:47,110 So at the end here, every process will have the result 746 00:34:47,110 --> 00:34:50,550 of the reduction without having done the broadcast. So 747 00:34:50,550 --> 00:34:54,950 we can start off as with the tree-based and add up A0 and 748 00:34:54,950 --> 00:34:56,380 A1 together. 749 00:34:56,380 --> 00:35:00,110 But what we do is for each process that has a value, we 750 00:35:00,110 --> 00:35:01,960 sort of do a local exchange. 751 00:35:01,960 --> 00:35:05,100 So from here we communicate the value to here. 752 00:35:05,100 --> 00:35:06,900 And from here we communicate the value to here. 753 00:35:06,900 --> 00:35:09,900 And so now these two processors that had the value 754 00:35:09,900 --> 00:35:13,700 independently now both have a local sum, A0 to A1. 755 00:35:13,700 --> 00:35:17,140 And similarly we can sort of make the similar symmetric 756 00:35:17,140 --> 00:35:19,740 computation on the other side. 757 00:35:19,740 --> 00:35:23,340 And now we can communicate data from these two processors 758 00:35:23,340 --> 00:35:25,610 here to come up with the end -- 759 00:35:30,150 --> 00:35:31,630 PROFESSOR: It was there. 760 00:35:31,630 --> 00:35:32,720 All right. 761 00:35:32,720 --> 00:35:33,420 Must have been lost in the animation. 762 00:35:33,420 --> 00:35:37,000 So you actually do that the other way as well so that you 763 00:35:37,000 --> 00:35:42,050 have the sum A0 to A3 on all the different processors. 764 00:35:42,050 --> 00:35:45,350 Sorry about the lost animation. 765 00:35:45,350 --> 00:35:45,620 OK. 766 00:35:45,620 --> 00:35:47,870 So this is better than the tree-based approach with a 767 00:35:47,870 --> 00:35:51,070 broadcast because you end up with local 768 00:35:51,070 --> 00:35:53,410 results of your reduction. 769 00:35:53,410 --> 00:35:58,780 And rather than doing the broadcast following the 770 00:35:58,780 --> 00:36:01,900 tree-based reduction which takes n steps, you end up with 771 00:36:01,900 --> 00:36:03,000 an order n. 772 00:36:03,000 --> 00:36:06,450 Everybody has a result in order n versus an order 2n 773 00:36:06,450 --> 00:36:10,590 process for the tree-based plus broadcast. 774 00:36:10,590 --> 00:36:13,200 AUDIENCE: On the Cell processor but not in general. 775 00:36:13,200 --> 00:36:15,250 PROFESSOR: Not in general. 776 00:36:15,250 --> 00:36:17,900 It depends on sort of the architectural mechanism that 777 00:36:17,900 --> 00:36:20,860 you have for your network. 778 00:36:20,860 --> 00:36:23,220 If you actually do need to sort of, you know, if you have 779 00:36:23,220 --> 00:36:26,205 a broadcast mechanism that has bus-based architecture where 780 00:36:26,205 --> 00:36:28,970 you can deposit a local value, everybody can pull that value, 781 00:36:28,970 --> 00:36:31,410 then, yeah, it can be more efficient. 782 00:36:31,410 --> 00:36:33,960 Or on optical networks, you can broadcast the data and 783 00:36:33,960 --> 00:36:35,480 everybody can just fuse it out. 784 00:36:38,910 --> 00:36:40,340 OK. 785 00:36:40,340 --> 00:36:44,860 So summarizing all the different patterns, so here 786 00:36:44,860 --> 00:36:47,490 these are the actual mechanisms that you would use 787 00:36:47,490 --> 00:36:50,800 for how you would implement the different patterns. 788 00:36:50,800 --> 00:36:53,130 So in the SPMD you would write the same program. 789 00:36:53,130 --> 00:36:56,700 In loop parallelism you have your program and you might 790 00:36:56,700 --> 00:36:59,380 annotate sort of some pragmas that tell you how to 791 00:36:59,380 --> 00:37:01,200 parallelize your computation. 792 00:37:01,200 --> 00:37:04,740 In the master/worker model you might have sort of a master 793 00:37:04,740 --> 00:37:07,690 that's going to create threads and you actually know -- 794 00:37:07,690 --> 00:37:10,800 you might sort of have a very good idea of what is the kind 795 00:37:10,800 --> 00:37:12,780 of work you're going to have to do in each thread. 796 00:37:12,780 --> 00:37:16,340 In the fork/join model you have more dynamism. 797 00:37:16,340 --> 00:37:19,280 So you might create threads on the fly. 798 00:37:19,280 --> 00:37:25,890 And you apply these sort of based on appeal or what is 799 00:37:25,890 --> 00:37:28,490 more suited in terms of implementation to each of the 800 00:37:28,490 --> 00:37:31,190 different patterns for how you actually organize your data. 801 00:37:31,190 --> 00:37:34,020 So in the task parallelism model, this is where you have 802 00:37:34,020 --> 00:37:37,410 a world of threads that you know you're going to calculate 803 00:37:37,410 --> 00:37:39,290 or that you're going to use for your computation. 804 00:37:39,290 --> 00:37:43,610 And really you can use largely any one of these models. 805 00:37:43,610 --> 00:37:45,340 So I used a ranking system where four 806 00:37:45,340 --> 00:37:46,340 stars is really good. 807 00:37:46,340 --> 00:37:51,192 One star is sort of bad or no star means not well suited. 808 00:37:51,192 --> 00:37:53,320 AUDIENCE: Sort of in Cell because the 809 00:37:53,320 --> 00:37:55,220 inherit master there. 810 00:37:55,220 --> 00:37:58,110 Sometimes master/worker might get a little bit of a biasing 811 00:37:58,110 --> 00:37:59,910 than this one. 812 00:37:59,910 --> 00:38:01,210 PROFESSOR: Right, so -- 813 00:38:01,210 --> 00:38:04,600 AUDIENCE: You don't have to pay a cost of having master 814 00:38:04,600 --> 00:38:04,950 PROFESSOR: Right. 815 00:38:04,950 --> 00:38:06,050 Right. 816 00:38:06,050 --> 00:38:10,230 Although you could use the Cell master to do regular 817 00:38:10,230 --> 00:38:11,660 computations as well. 818 00:38:11,660 --> 00:38:14,390 But, yes. 819 00:38:14,390 --> 00:38:18,700 So and the divide and conquer model, you know, might be 820 00:38:18,700 --> 00:38:20,930 especially well suited for a fork and join because you're 821 00:38:20,930 --> 00:38:23,580 creating all these recursive subproblems They might be 822 00:38:23,580 --> 00:38:24,080 heterogeneous. 823 00:38:24,080 --> 00:38:26,160 In the nature of the computation that you do, you 824 00:38:26,160 --> 00:38:28,850 might have more problems created dynamically. 825 00:38:28,850 --> 00:38:30,300 Fork/join really works well for that. 826 00:38:30,300 --> 00:38:33,220 And the fact, you know, the subproblem structure that I 827 00:38:33,220 --> 00:38:35,400 showed, the graph of sort of division. 828 00:38:35,400 --> 00:38:40,260 And then merging works really well with the fork/join model. 829 00:38:40,260 --> 00:38:45,260 In the recursive, in the geometric decomposition -- 830 00:38:45,260 --> 00:38:48,420 this is essentially your lab one exercise and the things we 831 00:38:48,420 --> 00:38:50,330 went over yesterday in the recitation. 832 00:38:50,330 --> 00:38:54,860 You're taking data and you're partitioning over multiple 833 00:38:54,860 --> 00:38:56,960 processors to actually compute in parallel. 834 00:38:56,960 --> 00:39:00,510 So this could be SPMD implementation or it could be 835 00:39:00,510 --> 00:39:02,570 a loop parallelism implementation, 836 00:39:02,570 --> 00:39:04,880 which we didn't do. 837 00:39:04,880 --> 00:39:07,280 Less suitable, the master/worker and fork/join, 838 00:39:07,280 --> 00:39:10,960 often because the geometric decomposition applied some 839 00:39:10,960 --> 00:39:13,510 distribution to the data which has static properties that you 840 00:39:13,510 --> 00:39:15,640 can exploit in various ways. 841 00:39:15,640 --> 00:39:17,420 So you don't need to pay the overhead of 842 00:39:17,420 --> 00:39:21,320 master/worker or fork/join. 843 00:39:21,320 --> 00:39:25,670 Recursive data structures sort of have very specific models 844 00:39:25,670 --> 00:39:27,320 that you can run with. 845 00:39:27,320 --> 00:39:32,950 Largely master/worker is a decent implementation choice. 846 00:39:32,950 --> 00:39:36,180 SPMD is another. 847 00:39:36,180 --> 00:39:38,220 And you're going to hear more about sort of the pipeline 848 00:39:38,220 --> 00:39:40,620 mechanism in the next talk so I'm not going to talk about 849 00:39:40,620 --> 00:39:42,120 that very much. 850 00:39:42,120 --> 00:39:44,710 Event-based coordination, largely dynamic. 851 00:39:44,710 --> 00:39:47,090 So fork/join works really well. 852 00:39:47,090 --> 00:39:48,780 So one -- 853 00:39:48,780 --> 00:39:50,870 AUDIENCE: When you're buffering them you could do 854 00:39:50,870 --> 00:39:52,570 master/worker with pipelining? 855 00:39:52,570 --> 00:39:54,960 PROFESSOR: Yes, so next slide. 856 00:39:54,960 --> 00:39:58,860 So sort of these choices or these tradeoffs aren't really 857 00:39:58,860 --> 00:39:59,570 orthogonal. 858 00:39:59,570 --> 00:40:01,670 You can actually combine them in different ways. 859 00:40:01,670 --> 00:40:05,430 And in a lot of applications what you might find is that 860 00:40:05,430 --> 00:40:08,710 the different patterns compose hierarchically. 861 00:40:08,710 --> 00:40:12,540 And you actually want that in various ways -- 862 00:40:12,540 --> 00:40:13,610 for various reasons. 863 00:40:13,610 --> 00:40:17,530 So in the MPEG example, you know, we had tasks here within 864 00:40:17,530 --> 00:40:21,720 each task and identified some pipeline stages. 865 00:40:21,720 --> 00:40:23,680 You know, here I have some data parallelism so I can 866 00:40:23,680 --> 00:40:27,590 apply the loop pattern here. 867 00:40:27,590 --> 00:40:30,400 And what I want to do is actually in my computation 868 00:40:30,400 --> 00:40:32,870 sort of express these different mechanisms so I can 869 00:40:32,870 --> 00:40:34,910 understand sort of different tradeoffs. 870 00:40:34,910 --> 00:40:37,670 And for really large applications, there might be 871 00:40:37,670 --> 00:40:40,630 different patterns that are well suited for the actual 872 00:40:40,630 --> 00:40:41,910 computation that I'm doing. 873 00:40:41,910 --> 00:40:45,860 So I can combine things like pipelining with a task-based 874 00:40:45,860 --> 00:40:49,270 mechanism or data parallelism to actually get really good 875 00:40:49,270 --> 00:40:50,660 performance speedups. 876 00:40:50,660 --> 00:40:53,850 And one of the things that might strike you as well, 877 00:40:53,850 --> 00:40:55,740 heck, this is a whole lot of work that I have to do to 878 00:40:55,740 --> 00:40:58,810 actually get my code in the right way so that I can 879 00:40:58,810 --> 00:41:01,050 actually take advantage of my parallel architecture. 880 00:41:01,050 --> 00:41:03,020 You know, I have to conceptually think about the 881 00:41:03,020 --> 00:41:04,250 question the right way. 882 00:41:04,250 --> 00:41:07,270 I have to maybe restructure my computation in different ways 883 00:41:07,270 --> 00:41:09,790 to actually exploit parallelism. 884 00:41:09,790 --> 00:41:11,260 Data distribution is really hard. 885 00:41:11,260 --> 00:41:12,810 I have to get that right. 886 00:41:12,810 --> 00:41:14,970 Synchronization issues might be a problem. 887 00:41:14,970 --> 00:41:16,330 And how much buffering do I need to do 888 00:41:16,330 --> 00:41:18,020 between different tasks? 889 00:41:18,020 --> 00:41:20,210 So the thing you're going to hear about in the next talk 890 00:41:20,210 --> 00:41:22,820 is, well, what if these things really fall out naturally from 891 00:41:22,820 --> 00:41:27,800 the way you actually write the program, and if the way you 892 00:41:27,800 --> 00:41:30,360 actually write your program matches really well with the 893 00:41:30,360 --> 00:41:32,530 intuitive, sort of natural 894 00:41:32,530 --> 00:41:35,000 conceptualization of the problem. 895 00:41:35,000 --> 00:41:37,410 And so I'll leave Bill to talk about that. 896 00:41:37,410 --> 00:41:40,190 And I'm going to stop here. 897 00:41:40,190 --> 00:41:41,883 Any questions? 898 00:41:41,883 --> 00:41:43,099 AUDIENCE: We can take in some questions and 899 00:41:43,099 --> 00:41:44,349 then everybody -- 900 00:41:50,650 --> 00:41:52,460 AUDIENCE: You talked about fork and join. 901 00:41:52,460 --> 00:41:57,735 When you have a parent thread that spawns off to a child 902 00:41:57,735 --> 00:42:01,403 thread, how do you keep your parent thread from 903 00:42:01,403 --> 00:42:03,970 using up the SPE? 904 00:42:03,970 --> 00:42:07,290 PROFESSOR: So you have a fork/join where you have -- 905 00:42:07,290 --> 00:42:11,780 AUDIENCE: Most of the parents it might be the PPE. 906 00:42:11,780 --> 00:42:17,430 And so if you just do fork/join, might not really 907 00:42:17,430 --> 00:42:20,220 use PPE unless you can, you know, you have some time and 908 00:42:20,220 --> 00:42:22,670 you let it do some of the task and come back. 909 00:42:22,670 --> 00:42:26,859 AUDIENCE: So for our purposes we shouldn't spawn off new 910 00:42:26,859 --> 00:42:29,110 threads by the SPEs? 911 00:42:29,110 --> 00:42:29,810 PROFESSOR: So, yeah. 912 00:42:29,810 --> 00:42:32,390 So most of the threads that are spawned off 913 00:42:32,390 --> 00:42:34,160 are done by the PPE. 914 00:42:34,160 --> 00:42:35,560 So you have these -- 915 00:42:35,560 --> 00:42:39,200 in fact a good walk through in recitation yesterday. 916 00:42:39,200 --> 00:42:39,950 You have the PPE. 917 00:42:39,950 --> 00:42:43,065 Essentially it sends messages to the SPEs that says, create 918 00:42:43,065 --> 00:42:45,190 these threads and start running them. 919 00:42:45,190 --> 00:42:46,400 Here's the data for them. 920 00:42:46,400 --> 00:42:48,950 And then these threads run on the SPEs. 921 00:42:48,950 --> 00:42:50,680 And they just do local computation. 922 00:42:50,680 --> 00:42:53,570 And then they send messages back to the PPE that says, 923 00:42:53,570 --> 00:42:54,050 we're done. 924 00:42:54,050 --> 00:42:56,422 So that essentially implements the join mechanism. 925 00:42:56,422 --> 00:42:58,390 AUDIENCE: On the other hand, if you are doing something 926 00:42:58,390 --> 00:43:05,430 like master slave way, and then the SPE can send a 927 00:43:05,430 --> 00:43:09,690 message and deliver another job into the PPE 928 00:43:09,690 --> 00:43:11,830 who feeds the master. 929 00:43:11,830 --> 00:43:13,930 If SPE see there's some more computers, you can say, OK, 930 00:43:13,930 --> 00:43:16,360 look, put this into your keybord and keep sending 931 00:43:16,360 --> 00:43:19,740 messages and so the master can look at that and update it. 932 00:43:19,740 --> 00:43:22,935 So, you know, it's not only master who has to fork off but 933 00:43:22,935 --> 00:43:24,670 the slaves also. 934 00:43:24,670 --> 00:43:28,470 They still can send information back. 935 00:43:28,470 --> 00:43:34,180 So you can think about something like very 936 00:43:34,180 --> 00:43:36,250 confident that way. 937 00:43:36,250 --> 00:43:43,340 There are eight -- like if six SPE is running and you first 938 00:43:43,340 --> 00:43:47,440 get something in there and SPE says divide it will take one 939 00:43:47,440 --> 00:43:50,066 task and run that until the other one to the master will 940 00:43:50,066 --> 00:43:53,070 finish it and here's my ID. 941 00:43:53,070 --> 00:43:55,560 Send me the message when it's done. 942 00:43:55,560 --> 00:43:56,510 And so you fork that end and wait. 943 00:43:56,510 --> 00:43:59,740 So you can assume you can do something like that. 944 00:43:59,740 --> 00:44:05,210 So it's almost master/slave but the coordination is there. 945 00:44:05,210 --> 00:44:11,215 The trouble with normally fork/join is if you create too 946 00:44:11,215 --> 00:44:12,420 many threads. 947 00:44:12,420 --> 00:44:14,190 You are in like a thread hell because there are too many 948 00:44:14,190 --> 00:44:15,640 things to run. 949 00:44:15,640 --> 00:44:17,760 I don't know, can you SPE? 950 00:44:17,760 --> 00:44:19,650 PROFESSOR: No. 951 00:44:19,650 --> 00:44:22,035 AUDIENCE: So you can't even do that because of some physical 952 00:44:22,035 --> 00:44:22,300 limitation. 953 00:44:22,300 --> 00:44:28,410 You can't get take up 1000 threads you run another 954 00:44:28,410 --> 00:44:34,180 master/slave thing yourself is because 1000 threads 955 00:44:34,180 --> 00:44:36,900 on top of your SPEs. 956 00:44:36,900 --> 00:44:39,820 And that's going to be locked threads. 957 00:44:39,820 --> 00:44:40,310 PROFESSOR: Yeah. 958 00:44:40,310 --> 00:44:43,420 Contact switching on the SPEs is very expensive. 959 00:44:43,420 --> 00:44:47,980 So on the PlayStation 3 you have six SPEs 960 00:44:47,980 --> 00:44:48,810 available to you. 961 00:44:48,810 --> 00:44:52,020 So if you have a lot more than six threads that you've 962 00:44:52,020 --> 00:44:55,190 created, essentially each one runs to completion. 963 00:44:55,190 --> 00:44:58,210 And then you swap that out and you bring in -- well, that 964 00:44:58,210 --> 00:44:59,090 terminates. 965 00:44:59,090 --> 00:45:02,200 You deallocate it from the SPE and you bring in a new thread. 966 00:45:02,200 --> 00:45:05,770 If you actually want to do more thread-like dynamic load 967 00:45:05,770 --> 00:45:08,720 balancing on the SPEs, it's not well suited for that. 968 00:45:08,720 --> 00:45:09,720 Just because the -- 969 00:45:09,720 --> 00:45:13,466 AUDIENCE: The best model there is master/slave. Because the 970 00:45:13,466 --> 00:45:14,990 PPE [UNINTELLIGIBLE PHRASE] 971 00:45:14,990 --> 00:45:15,370 the master part. 972 00:45:15,370 --> 00:45:18,080 It will run more sequential code. 973 00:45:18,080 --> 00:45:22,670 And when there's parallel send -- it will give it to you and 974 00:45:22,670 --> 00:45:25,820 produce the work queue model type and send stuff into SPE 975 00:45:25,820 --> 00:45:27,290 and feed that. 976 00:45:27,290 --> 00:45:30,750 So work queue type models can be used there. 977 00:45:30,750 --> 00:45:31,140 PROFESSOR: Yeah. 978 00:45:31,140 --> 00:45:33,820 And the SPMD model might not work really well because you 979 00:45:33,820 --> 00:45:37,030 have this heterogeneity in the actual hardware, right. 980 00:45:37,030 --> 00:45:39,530 So if I'm taking the same program running on the SPE 981 00:45:39,530 --> 00:45:43,360 versus the PPE, that code might not be -- so I 982 00:45:43,360 --> 00:45:45,010 essentially have to specialize the code. 983 00:45:45,010 --> 00:45:47,670 And that starts to deviate away from the SPMD model. 984 00:45:47,670 --> 00:45:51,565 AUDIENCE: Something I think most of the code you write for 985 00:45:51,565 --> 00:45:55,020 Cell will probably be master/worker. 986 00:45:55,020 --> 00:45:56,995 And if you try to do something other than you should think 987 00:45:56,995 --> 00:45:59,556 hard why that's the case. 988 00:46:02,690 --> 00:46:04,140 PROFESSOR: You can do fork/join but 989 00:46:04,140 --> 00:46:05,120 you know, it's -- 990 00:46:05,120 --> 00:46:07,490 AUDIENCE: I mean you can't -- 991 00:46:07,490 --> 00:46:09,410 because you don't have virtualization. 992 00:46:09,410 --> 00:46:11,938 If you fork too much where are you going to put those? 993 00:46:11,938 --> 00:46:12,404 PROFESSOR: Right. 994 00:46:12,404 --> 00:46:14,790 Sometimes you fork -- 995 00:46:14,790 --> 00:46:15,685 AUDIENCE: Yeah. but in that sense you -- should 996 00:46:15,685 --> 00:46:17,220 you fork too much? 997 00:46:17,220 --> 00:46:19,330 To keep work you want the master. 998 00:46:19,330 --> 00:46:20,590 You can fork things -- 999 00:46:20,590 --> 00:46:22,770 you can do virtual fork and send the work to the master 1000 00:46:22,770 --> 00:46:24,070 and say, here, I forked something. 1001 00:46:24,070 --> 00:46:26,350 Here's the work. 1002 00:46:26,350 --> 00:46:30,370 I mean, the key thing is do the simplest thing. 1003 00:46:30,370 --> 00:46:33,320 I mean, you guys have two weeks left. 1004 00:46:33,320 --> 00:46:36,590 And if you try doing anything complicated, you might end up 1005 00:46:36,590 --> 00:46:38,730 with a big mess that's undebuggable. 1006 00:46:38,730 --> 00:46:39,780 Just do simple things. 1007 00:46:39,780 --> 00:46:45,400 And I can vouch, parallelism is hard. 1008 00:46:45,400 --> 00:46:47,600 Debugging parallel code is even harder. 1009 00:46:47,600 --> 00:46:51,300 So you're sort of trying to push the limits on the 1010 00:46:51,300 --> 00:46:55,345 complexity of messages going all over the world and the 1011 00:46:55,345 --> 00:46:57,615 three different types of parallelism all trying to 1012 00:46:57,615 --> 00:46:59,040 compete in there. 1013 00:46:59,040 --> 00:47:00,935 Just do the simple thing. 1014 00:47:00,935 --> 00:47:01,882 Just get the simple thing working. 1015 00:47:01,882 --> 00:47:05,005 First get the sequential code working and keep adding more 1016 00:47:05,005 --> 00:47:05,650 and more story. 1017 00:47:05,650 --> 00:47:08,430 And then make sure that each level it works. 1018 00:47:08,430 --> 00:47:10,350 The problem with parallelism is because things that 1019 00:47:10,350 --> 00:47:12,020 determine if some bugs that might show up. 1020 00:47:12,020 --> 00:47:13,950 Data might be hard. 1021 00:47:13,950 --> 00:47:17,770 But design absolutely matters. 1022 00:47:17,770 --> 00:47:21,000 Another thing I think, especially for doing demos and 1023 00:47:21,000 --> 00:47:24,570 stuff would be nice, would be to have a knob that basically 1024 00:47:24,570 --> 00:47:25,510 you can tune. 1025 00:47:25,510 --> 00:47:26,180 So you can say, OK, no SPEs. 1026 00:47:26,180 --> 00:47:29,624 Everything running in PPE one is two is. 1027 00:47:29,624 --> 00:47:32,410 So you can actually see hopefully in your code how how 1028 00:47:32,410 --> 00:47:36,693 things move for the demo part. 1029 00:47:43,200 --> 00:47:43,770 PROFESSOR: You had a question? 1030 00:47:43,770 --> 00:47:46,550 All right. 1031 00:47:46,550 --> 00:47:48,130 We'll take a brief break and do the 1032 00:47:48,130 --> 00:47:49,380 quizzes in the meantime.