1 00:00:00,000 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:04,000 Commons license. 3 00:00:04,000 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,560 offer high quality educational resources for free. 5 00:00:10,560 --> 00:00:13,450 To make a donation or view additional materials from 6 00:00:13,450 --> 00:00:16,610 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:16,610 --> 00:00:17,860 ocw.mit.edu. 8 00:00:22,740 --> 00:00:25,350 PROFESSOR: So in this second lecture we're going to talk 9 00:00:25,350 --> 00:00:29,830 about some design patterns for parallel programming. 10 00:00:29,830 --> 00:00:33,180 And to tell you a little bit about what a design pattern is 11 00:00:33,180 --> 00:00:34,340 and why is it useful. 12 00:00:34,340 --> 00:00:36,440 And some of you, if you've taken object oriented 13 00:00:36,440 --> 00:00:38,450 programming you've probably already have seen design 14 00:00:38,450 --> 00:00:40,580 patterns before. 15 00:00:40,580 --> 00:00:43,600 So I ended the last lecture with OK, so I understand some 16 00:00:43,600 --> 00:00:45,330 of the performance implications, how do I go 17 00:00:45,330 --> 00:00:48,510 about parallelizing my program? 18 00:00:48,510 --> 00:00:53,900 This is a flyer I found quite often in books and talks on 19 00:00:53,900 --> 00:00:54,600 parallel programming. 20 00:00:54,600 --> 00:00:58,960 It essentially lays out 4 common steps for parallelizing 21 00:00:58,960 --> 00:01:00,680 your program. 22 00:01:00,680 --> 00:01:03,630 So often, you start out with sequential programs. This 23 00:01:03,630 --> 00:01:06,810 shouldn't be surprising since for a long time as you've 24 00:01:06,810 --> 00:01:09,350 heard in earlier lectures, people just coded sequential 25 00:01:09,350 --> 00:01:12,190 code and that was just good enough. 26 00:01:12,190 --> 00:01:15,220 So now the problem is you want to take that sequential code 27 00:01:15,220 --> 00:01:17,410 or you want to still write sequential code just because 28 00:01:17,410 --> 00:01:19,670 it's conceptually easier and you want to be able to 29 00:01:19,670 --> 00:01:23,240 parallelize it so you can map it down to your parallel 30 00:01:23,240 --> 00:01:26,930 architecture, which in this example has 4 processors. 31 00:01:26,930 --> 00:01:30,630 So the first step is you take your sequential program and 32 00:01:30,630 --> 00:01:32,760 you divide it up into tasks. 33 00:01:32,760 --> 00:01:35,250 So during the project reviews, for example, yesterday when I 34 00:01:35,250 --> 00:01:37,980 talked to each team individually we sort of talked 35 00:01:37,980 --> 00:01:41,740 about this and you sort of stumbled on these 4 steps 36 00:01:41,740 --> 00:01:44,190 whether you realized it or not. 37 00:01:44,190 --> 00:01:47,750 And so you come up with these tasks and then each one 38 00:01:47,750 --> 00:01:50,670 essentially incapsulates some computation. 39 00:01:50,670 --> 00:01:52,870 And then you group them together, so this is some 40 00:01:52,870 --> 00:01:55,530 granularity adjustment and you map them down to processes. 41 00:01:55,530 --> 00:01:58,370 These are things you can pose into threads, for example. 42 00:01:58,370 --> 00:02:00,790 And then you have to essentially plot these down 43 00:02:00,790 --> 00:02:04,850 onto actual processors and they have to talk with each 44 00:02:04,850 --> 00:02:07,250 other, so you have to orchestrate to communication 45 00:02:07,250 --> 00:02:10,110 and then finally do the execution. 46 00:02:10,110 --> 00:02:16,280 So step through each one of these at a time. 47 00:02:16,280 --> 00:02:20,300 Sort of composition and recall that just really effects or 48 00:02:20,300 --> 00:02:23,450 just really is effected by Amdahl's Law. 49 00:02:23,450 --> 00:02:26,110 And that if there's not a whole lot of parallels in the 50 00:02:26,110 --> 00:02:28,590 application your decomposition is a waste of time. 51 00:02:28,590 --> 00:02:30,340 There's not really a whole lot to get. 52 00:02:30,340 --> 00:02:33,150 But what you're trying to do is identify concurrency in 53 00:02:33,150 --> 00:02:35,200 your application and figure out at what 54 00:02:35,200 --> 00:02:36,570 level to exploit it. 55 00:02:36,570 --> 00:02:39,140 So what you're trying to do is divide up your computation 56 00:02:39,140 --> 00:02:42,050 into tasks and eventually these are going to be 57 00:02:42,050 --> 00:02:44,900 distributed among processors and you want to find enough of 58 00:02:44,900 --> 00:02:48,190 them so that you can keep all the processors busy. 59 00:02:48,190 --> 00:02:51,200 And remember that the more of these that you have this gives 60 00:02:51,200 --> 00:02:54,640 you sort of an upper bound on your potential speed up. 61 00:02:54,640 --> 00:02:57,500 And as in the rate tracing example that I showed, the 62 00:02:57,500 --> 00:03:00,660 number of tasks that you have may vary run time. 63 00:03:00,660 --> 00:03:03,220 So sometimes you might have lot of arrays bouncing off a 64 00:03:03,220 --> 00:03:06,140 lot of things, and sometimes you might not have a whole lot 65 00:03:06,140 --> 00:03:10,150 of reflection going on so number of arrays 66 00:03:10,150 --> 00:03:12,720 will change over time. 67 00:03:12,720 --> 00:03:15,080 And in other applications, interactions, for example, 68 00:03:15,080 --> 00:03:16,610 between molecules might change in a 69 00:03:16,610 --> 00:03:19,960 molecular dynamic simulator. 70 00:03:19,960 --> 00:03:21,980 The assignment really effects granularity. 71 00:03:21,980 --> 00:03:23,770 This is where you've partitioned your tasks, you're 72 00:03:23,770 --> 00:03:27,760 trying to group them together so that you're taking into 73 00:03:27,760 --> 00:03:29,300 account, what is the communication 74 00:03:29,300 --> 00:03:30,830 cost going to be? 75 00:03:30,830 --> 00:03:34,640 What kind of locality am I going to deal with? 76 00:03:34,640 --> 00:03:36,450 And what kind of synchronization mechanisms do 77 00:03:36,450 --> 00:03:38,690 I need and how often do I need to synchronize? 78 00:03:42,980 --> 00:03:45,020 You adjust your granularity such that you end up with 79 00:03:45,020 --> 00:03:47,460 things that are load balanced and you try to reduce 80 00:03:47,460 --> 00:03:50,300 communication as much as possible. 81 00:03:50,300 --> 00:03:51,870 And structured approaches might work well. 82 00:03:51,870 --> 00:03:54,670 You might look at the code, do some inspection, you might 83 00:03:54,670 --> 00:03:57,360 understand the application, but there are some well-known 84 00:03:57,360 --> 00:03:59,330 design patterns which is essentially the thing we're 85 00:03:59,330 --> 00:04:00,920 going to get to try to help you with this. 86 00:04:04,470 --> 00:04:07,280 As programmers really, I think, we worry about 87 00:04:07,280 --> 00:04:12,510 partitioning first. This is really independent of an 88 00:04:12,510 --> 00:04:14,070 architecture programming model. 89 00:04:14,070 --> 00:04:16,920 Just taking my application and figuring out well, what are 90 00:04:16,920 --> 00:04:19,710 different parts that I need to compose together to build my 91 00:04:19,710 --> 00:04:20,350 application? 92 00:04:20,350 --> 00:04:22,100 So I'm going to show you an example of that. 93 00:04:22,100 --> 00:04:25,550 And one thing to keep in the back of your mind is that the 94 00:04:25,550 --> 00:04:29,150 complexity of how much partitioning work you actually 95 00:04:29,150 --> 00:04:31,360 have to do really effects your decision. 96 00:04:31,360 --> 00:04:33,610 So if you start out with some piece of code or you wrote 97 00:04:33,610 --> 00:04:36,730 your code in one way and you realize that to actually 98 00:04:36,730 --> 00:04:39,340 parallelize it requires so much more work, in some user 99 00:04:39,340 --> 00:04:41,760 studies we've done on sort of trying to get performance from 100 00:04:41,760 --> 00:04:44,600 code, it really effects how much work you actually do. 101 00:04:44,600 --> 00:04:46,610 And if something requires a lot of work, you might not do 102 00:04:46,610 --> 00:04:48,710 it even though it might have really hot payoff. 103 00:04:48,710 --> 00:04:53,720 So you want to be able to keep complexity down, so it pays 104 00:04:53,720 --> 00:04:55,980 off to really think well about your algorithm, how you 105 00:04:55,980 --> 00:04:59,430 structure it ahead of time. 106 00:04:59,430 --> 00:05:01,880 And finally, the last two stages I've lumped together. 107 00:05:01,880 --> 00:05:03,330 It's really orchestration and mapping. 108 00:05:03,330 --> 00:05:07,230 I have my task, they need to communicate, so what kind of 109 00:05:07,230 --> 00:05:08,770 computation primitives do I need? 110 00:05:08,770 --> 00:05:11,450 What kind of communication primitives do I need? 111 00:05:11,450 --> 00:05:13,590 So am I packaging things up into threads? 112 00:05:13,590 --> 00:05:21,270 And they're talking together over DMAs or shared memories. 113 00:05:21,270 --> 00:05:24,050 And what you want to do is try to preserve locality and then 114 00:05:24,050 --> 00:05:26,860 figure out how to come up with a scheduling order that 115 00:05:26,860 --> 00:05:29,400 preserves overall dependence of the computation. 116 00:05:33,390 --> 00:05:36,720 Parallel program by patterns is meant to essentially give 117 00:05:36,720 --> 00:05:40,750 you a cookbook or set of recipes you can follow to help 118 00:05:40,750 --> 00:05:43,050 you with different steps: decompose, assign, 119 00:05:43,050 --> 00:05:45,100 orchestrate and map. 120 00:05:45,100 --> 00:05:46,710 This could lead to really high quality 121 00:05:46,710 --> 00:05:47,740 solutions in some domains. 122 00:05:47,740 --> 00:05:51,490 So in the scientific computations there's a lot of 123 00:05:51,490 --> 00:05:53,590 problems that are well understood and well studied 124 00:05:53,590 --> 00:05:55,430 and some of the frequently occurring things have been 125 00:05:55,430 --> 00:06:00,240 abstracted out and sort of recorded in patterns. 126 00:06:00,240 --> 00:06:02,700 And there's another purpose to patterns too, in that they 127 00:06:02,700 --> 00:06:05,360 provide you with the vocabulary for-- 128 00:06:05,360 --> 00:06:08,100 two programmers can talk to each other and use the right 129 00:06:08,100 --> 00:06:10,940 terminology and that conveys a whole lot of information 130 00:06:10,940 --> 00:06:13,920 without having to actually go through and 131 00:06:13,920 --> 00:06:16,280 understand all the details. 132 00:06:16,280 --> 00:06:21,510 You instantaneously know if I use a particular model. 133 00:06:24,010 --> 00:06:26,570 It can also help with software reusuability, malleability, 134 00:06:26,570 --> 00:06:27,640 and modularity. 135 00:06:27,640 --> 00:06:30,070 All of those things that are software engineer perspective 136 00:06:30,070 --> 00:06:34,060 from an engineering perspective are important. 137 00:06:34,060 --> 00:06:38,035 So brief history and I found this in some of the talks that 138 00:06:38,035 --> 00:06:39,770 I was researching. 139 00:06:39,770 --> 00:06:43,520 There's a book by Christopher Alexander from Berkeley in 140 00:06:43,520 --> 00:06:48,390 1977 that actually looked at classifying patterns or really 141 00:06:48,390 --> 00:06:50,800 listing patterns from an architectural perspective. 142 00:06:50,800 --> 00:06:53,915 He tried to look at what are some patterns that occur in 143 00:06:53,915 --> 00:06:57,230 living designs and recording those. 144 00:06:57,230 --> 00:06:59,150 So as an example, for example, there's a 6 145 00:06:59,150 --> 00:07:00,850 foot balcony pattern. 146 00:07:00,850 --> 00:07:02,920 So if you're going to build a balcony you should build it 6 147 00:07:02,920 --> 00:07:06,590 foot deep and you should have it slightly recessed and so on 148 00:07:06,590 --> 00:07:08,900 because this is what's commonly used and these are 149 00:07:08,900 --> 00:07:13,140 the kinds of balconies that have good properties 150 00:07:13,140 --> 00:07:13,850 architecturally. 151 00:07:13,850 --> 00:07:16,480 Now I don't know whether this book actually had a whole lot 152 00:07:16,480 --> 00:07:19,180 of impact on how people designed architectures. 153 00:07:19,180 --> 00:07:23,350 Certainly not probably for the Stata Center, but some 154 00:07:23,350 --> 00:07:26,360 patterns from object oriented programming, I think, many of 155 00:07:26,360 --> 00:07:31,480 you have already seen these by the Gang of Four in 1995-- 156 00:07:31,480 --> 00:07:35,760 really sort of organized and classified and came up with 157 00:07:35,760 --> 00:07:38,730 different ways of-- 158 00:07:38,730 --> 00:07:42,240 or captured different ways of programming that people had 159 00:07:42,240 --> 00:07:43,060 been using. 160 00:07:43,060 --> 00:07:45,230 You know, things like the visitor pattern, for example, 161 00:07:45,230 --> 00:07:46,480 some of you might know. 162 00:07:52,460 --> 00:07:56,280 So in 2005, not too long ago there was a new book, which 163 00:07:56,280 --> 00:07:59,860 I'm using to create some of these slides. 164 00:07:59,860 --> 00:08:05,540 Really came up with or recorded patterns for parallel 165 00:08:05,540 --> 00:08:06,480 programming. 166 00:08:06,480 --> 00:08:09,500 And they identified really 4 design spaces. 167 00:08:09,500 --> 00:08:14,260 I think these are sort of structured to express or 168 00:08:14,260 --> 00:08:15,930 capture different elements. 169 00:08:15,930 --> 00:08:19,580 So some elements are for the algorithm expression, I've 170 00:08:19,580 --> 00:08:22,080 listed those here and some are for the actual software 171 00:08:22,080 --> 00:08:24,380 construction or the actual implementation. 172 00:08:24,380 --> 00:08:28,380 So under algorithm expression it's really the thing of 173 00:08:28,380 --> 00:08:29,920 decomposition; finding concurrency. 174 00:08:29,920 --> 00:08:32,050 Where are my tasks? 175 00:08:32,050 --> 00:08:35,390 In the algorithm structure, well, you might need some way 176 00:08:35,390 --> 00:08:39,050 of packaging those tasks together so that they can talk 177 00:08:39,050 --> 00:08:46,950 to each other and be able to use parallel architecture. 178 00:08:46,950 --> 00:08:49,150 On the software construction side you're dealing slightly 179 00:08:49,150 --> 00:08:50,640 more low level details. 180 00:08:50,640 --> 00:08:53,320 So what are some things you might need at a slightly lower 181 00:08:53,320 --> 00:08:55,770 level of implementation to actually get all the 182 00:08:55,770 --> 00:08:58,920 computation that's expressed at the algorithm level to work 183 00:08:58,920 --> 00:08:59,950 and run well? 184 00:08:59,950 --> 00:09:05,010 So I'm going to essentially talk about the latter part in 185 00:09:05,010 --> 00:09:07,580 the next lecture and I'll cover much of the algorithm 186 00:09:07,580 --> 00:09:09,320 expression stuff here-- 187 00:09:09,320 --> 00:09:12,780 at least the fine concurrency part in this talk. 188 00:09:12,780 --> 00:09:14,740 And if there's time I'll do algorithm structure. 189 00:09:14,740 --> 00:09:17,750 Otherwise, just talk about it next time. 190 00:09:17,750 --> 00:09:21,650 So let's say you're working with MPEG decoding. 191 00:09:21,650 --> 00:09:24,690 This is a pipeline picture of an MPEG 2 decoder or rather a 192 00:09:24,690 --> 00:09:27,780 blocked level diagram of an MPEG 2 decoder. 193 00:09:27,780 --> 00:09:29,690 And you have this algorithm and you say, OK, I want to 194 00:09:29,690 --> 00:09:30,380 parallelize this. 195 00:09:30,380 --> 00:09:31,830 Where's my parallelism? 196 00:09:31,830 --> 00:09:34,510 Where's my concurrency? 197 00:09:34,510 --> 00:09:37,000 You know, in MPEG 2 you have some bit screen, you do some 198 00:09:37,000 --> 00:09:39,770 decoding on it and you end up with two things. 199 00:09:39,770 --> 00:09:42,430 You end up with motion vectors that tell you here's 200 00:09:42,430 --> 00:09:44,890 somebody's head, in the next scene it's moved to this 201 00:09:44,890 --> 00:09:46,140 particular location. 202 00:09:49,050 --> 00:09:53,350 So these are captured by the motion vectors. 203 00:09:53,350 --> 00:09:56,800 So this captures or recovers temporal information. 204 00:09:56,800 --> 00:09:58,730 In here you cover spatial information. 205 00:09:58,730 --> 00:10:01,730 So in somebody's head you might have discovered some 206 00:10:01,730 --> 00:10:03,880 redundancies so that redundancy is eliminated out 207 00:10:03,880 --> 00:10:07,180 so you need to know essentially, uncompress or 208 00:10:07,180 --> 00:10:08,560 undo that compression. 209 00:10:08,560 --> 00:10:10,070 So you go through some stages. 210 00:10:10,070 --> 00:10:12,330 You combine the two together. 211 00:10:12,330 --> 00:10:15,030 Combine the motion estimation and now the recovered pictures 212 00:10:15,030 --> 00:10:17,480 to reconstruct the image and then you might do some 213 00:10:17,480 --> 00:10:18,730 additional stages. 214 00:10:20,850 --> 00:10:24,600 This particular stage here is indicated to be data parallel 215 00:10:24,600 --> 00:10:28,120 in that I can do different scenes for example in parallel 216 00:10:28,120 --> 00:10:29,850 or I might be able to do different slices of the 217 00:10:29,850 --> 00:10:31,030 picture in parallel. 218 00:10:31,030 --> 00:10:33,820 So I can essentially take advantage of data parallelism 219 00:10:33,820 --> 00:10:37,140 in the concept of taking a loop and breaking it up as I 220 00:10:37,140 --> 00:10:40,520 showed in lecture 5. 221 00:10:40,520 --> 00:10:43,480 So in task decomposition what we're looking for is really 222 00:10:43,480 --> 00:10:47,630 independent coarse-grain computation. 223 00:10:47,630 --> 00:10:51,040 And these often are inherent to the algorithm. so here I've 224 00:10:51,040 --> 00:10:52,850 outlined these in yellow here. 225 00:10:52,850 --> 00:10:54,600 You know, so this is one particular task. 226 00:10:54,600 --> 00:10:57,960 I can have one thread of execution doing all the 227 00:10:57,960 --> 00:10:59,370 spatial decomposition. 228 00:10:59,370 --> 00:11:00,830 I can have another thread decoding 229 00:11:00,830 --> 00:11:03,710 all my motion vectors. 230 00:11:03,710 --> 00:11:07,410 And in general, you're looking for sequences of statements 231 00:11:07,410 --> 00:11:09,410 that operate together as a group. 232 00:11:09,410 --> 00:11:12,970 These could be loops or they could be functions. 233 00:11:12,970 --> 00:11:15,990 And usually you want these to essentially just fall out of 234 00:11:15,990 --> 00:11:17,670 your algorithm as it's expressed. 235 00:11:17,670 --> 00:11:21,070 And a lot of cases it does, so depending on how you think 236 00:11:21,070 --> 00:11:22,570 about the program you might be able to find 237 00:11:22,570 --> 00:11:26,260 these quicker or easier. 238 00:11:26,260 --> 00:11:28,760 Data decompositions, which I've highlighted here 239 00:11:28,760 --> 00:11:31,490 essentially says you have the same computation applied to 240 00:11:31,490 --> 00:11:33,080 lots of small data element. 241 00:11:33,080 --> 00:11:35,130 You know, you can take your large data element, partition 242 00:11:35,130 --> 00:11:38,150 it into smaller chunks and do the computation over and over 243 00:11:38,150 --> 00:11:40,750 in parallel and so that allows you to essentially get that 244 00:11:40,750 --> 00:11:44,070 kind of data parallelism, expansion of space. 245 00:11:44,070 --> 00:11:47,000 And finally, I'm going to make a case for pipeline 246 00:11:47,000 --> 00:11:49,960 parallelism, which essentially says, well, I can recognize 247 00:11:49,960 --> 00:11:53,060 that I have a lot of stages in my computation and it does 248 00:11:53,060 --> 00:11:56,430 help to actually have this kind of decomposition just 249 00:11:56,430 --> 00:11:58,400 because you're familiar with pipelining 250 00:11:58,400 --> 00:12:00,890 concepts from other domains. 251 00:12:00,890 --> 00:12:05,030 So this type of producer consumer chain is actually 252 00:12:05,030 --> 00:12:06,590 beneficial. 253 00:12:06,590 --> 00:12:08,950 So it does help to sort of expose these kinds of 254 00:12:08,950 --> 00:12:10,200 relationships. 255 00:12:12,170 --> 00:12:15,370 So what are some guidelines for actually coming up with 256 00:12:15,370 --> 00:12:16,560 your task decomposition? 257 00:12:16,560 --> 00:12:17,640 Where do you start? 258 00:12:17,640 --> 00:12:21,600 You have your algorithm, you understand the problem really 259 00:12:21,600 --> 00:12:24,970 well, you're writing some code and the hope is that in fact, 260 00:12:24,970 --> 00:12:28,650 as I've pointed out, it does happen that you can look for 261 00:12:28,650 --> 00:12:32,640 natural code regions that incapsulate your computation. 262 00:12:32,640 --> 00:12:36,020 So function calls, distinct loop iterations are pretty 263 00:12:36,020 --> 00:12:38,830 good places to start looking. 264 00:12:38,830 --> 00:12:42,990 And it's easier as a general rule, it's easier to start 265 00:12:42,990 --> 00:12:46,880 with as many tasks as possible and then fuse them to make the 266 00:12:46,880 --> 00:12:50,220 more coarse-grained than to go the other way around. 267 00:12:50,220 --> 00:12:53,730 It impacts your software engineering decisions, it 268 00:12:53,730 --> 00:12:56,530 impacts software implementation, it impacts how 269 00:12:56,530 --> 00:12:59,050 you incapsulate things at low level details of 270 00:12:59,050 --> 00:13:00,970 implementation. 271 00:13:00,970 --> 00:13:04,720 So it's always easier to fuse than to fizz. 272 00:13:04,720 --> 00:13:06,180 And you want to consider three things. 273 00:13:06,180 --> 00:13:08,380 You want to keep three things in mind: flexibility, 274 00:13:08,380 --> 00:13:10,150 efficiency, and simplicity. 275 00:13:10,150 --> 00:13:21,760 So flexibility says if you made some decisions, is that 276 00:13:21,760 --> 00:13:24,540 really going to scale well or is it going to allow you to 277 00:13:24,540 --> 00:13:28,330 sort of make the decisions, changes? 278 00:13:28,330 --> 00:13:30,230 So you might want to have fixed tasks versus 279 00:13:30,230 --> 00:13:32,140 parameterized tasks, for example. 280 00:13:32,140 --> 00:13:37,420 So the loops that I showed in the previous talk, each loop 281 00:13:37,420 --> 00:13:40,070 that I parallelized had a hard coded number that said, you're 282 00:13:40,070 --> 00:13:41,460 going to do 4 iterations. 283 00:13:41,460 --> 00:13:43,520 That may work well or it may not work well. 284 00:13:43,520 --> 00:13:45,420 You know, I can't reuse that code now if I want to 285 00:13:45,420 --> 00:13:48,130 essentially use that kind of data decomposition and work 286 00:13:48,130 --> 00:13:51,220 sharing if I have a longer loop and I want a longer array 287 00:13:51,220 --> 00:13:53,510 and I want each thread to do more work. 288 00:13:53,510 --> 00:13:56,670 So you might want to parameterize more things in 289 00:13:56,670 --> 00:13:58,140 your tasks. 290 00:13:58,140 --> 00:14:01,110 The efficiency, in that you have to keep in mind that each 291 00:14:01,110 --> 00:14:04,470 of these tasks will eventually sort of have to talk with 292 00:14:04,470 --> 00:14:05,410 other tasks. 293 00:14:05,410 --> 00:14:08,530 There's communication costs that have to be taken into 294 00:14:08,530 --> 00:14:10,030 account, synchronization. 295 00:14:10,030 --> 00:14:12,040 So you want these tasks to actually amortize the 296 00:14:12,040 --> 00:14:16,080 communication costs or other overheads over to computation. 297 00:14:16,080 --> 00:14:18,340 And you want to keep in mind that there's going to be 298 00:14:18,340 --> 00:14:20,720 dependencies between these tasks and you don't want these 299 00:14:20,720 --> 00:14:22,260 dependencies to get out of hand. 300 00:14:22,260 --> 00:14:24,230 So you want to keep things under control. 301 00:14:24,230 --> 00:14:28,410 And lastly, which is probably as important as the other two: 302 00:14:28,410 --> 00:14:29,220 simplicity. 303 00:14:29,220 --> 00:14:32,290 And that if you start decomposing your code into 304 00:14:32,290 --> 00:14:35,850 different chunks and you can't then understand your code in 305 00:14:35,850 --> 00:14:39,180 the end, it doesn't help you from debugging perspective, 306 00:14:39,180 --> 00:14:40,970 doesn't help you from a software engineering 307 00:14:40,970 --> 00:14:44,780 perspective or not being able to reuse you code or other 308 00:14:44,780 --> 00:14:46,290 people being able to understand your code. 309 00:14:49,040 --> 00:14:49,990 Guidelines for data 310 00:14:49,990 --> 00:14:52,250 decomposition are sort of similar. 311 00:14:52,250 --> 00:14:56,200 And you essentially have to do task and data parallelism to 312 00:14:56,200 --> 00:15:00,250 sort of complete your process. 313 00:15:00,250 --> 00:15:03,040 And often your task decomposition dictates your 314 00:15:03,040 --> 00:15:04,210 data partitioning. 315 00:15:04,210 --> 00:15:07,320 So if I've split up a loop into two different processes 316 00:15:07,320 --> 00:15:10,590 I've essentially implied how data should be distributed 317 00:15:10,590 --> 00:15:13,290 between these two threads. 318 00:15:13,290 --> 00:15:17,320 And data composition is a good starting point as opposed to 319 00:15:17,320 --> 00:15:20,340 task parallelism as a good starting point. 320 00:15:20,340 --> 00:15:22,470 If you're doing the same computation over and over and 321 00:15:22,470 --> 00:15:27,050 over again, over really, really large data sets, so you 322 00:15:27,050 --> 00:15:30,040 can essentially use that as your stick to decide whether 323 00:15:30,040 --> 00:15:33,380 you do task decomposition first or data decomposition 324 00:15:33,380 --> 00:15:37,720 first. 325 00:15:37,720 --> 00:15:41,460 I've just listed two common data decompositions. 326 00:15:41,460 --> 00:15:44,040 I'll talk about more of these later on when we talk about 327 00:15:44,040 --> 00:15:48,000 actual performance optimizations. 328 00:15:48,000 --> 00:15:51,890 So you want to decompose arrays for example, along rows 329 00:15:51,890 --> 00:15:52,670 or columns. 330 00:15:52,670 --> 00:15:55,200 You can compose arrays into blocks, you decompose them 331 00:15:55,200 --> 00:15:56,640 into blocks. 332 00:15:56,640 --> 00:15:58,150 You have recursive data structures. 333 00:15:58,150 --> 00:16:00,570 So a tree, you might partition it into left and right 334 00:16:00,570 --> 00:16:02,420 sub-trees in a binary tree. 335 00:16:02,420 --> 00:16:06,150 And the thing you're trying to get to is actually start with 336 00:16:06,150 --> 00:16:08,740 a problem, and then recursively subdivide it until 337 00:16:08,740 --> 00:16:10,340 you can get to a manageable part. 338 00:16:10,340 --> 00:16:12,190 Do the computation and figure out a way to do the 339 00:16:12,190 --> 00:16:13,390 integration. 340 00:16:13,390 --> 00:16:16,970 You know, it's like merge sort, classic example -- 341 00:16:16,970 --> 00:16:19,660 tries to capture this really well. 342 00:16:19,660 --> 00:16:23,620 So again, the three theme, key concepts to keep in mind when 343 00:16:23,620 --> 00:16:25,980 you're doing data decomposition: flexibility, 344 00:16:25,980 --> 00:16:29,100 efficiency, and simplicity. 345 00:16:29,100 --> 00:16:33,370 The first two are really just meant to suggest that the size 346 00:16:33,370 --> 00:16:36,150 of the data that you've allocated actually leads to 347 00:16:36,150 --> 00:16:37,670 enough work. 348 00:16:37,670 --> 00:16:40,250 Because you want to amortize the cost of communication or 349 00:16:40,250 --> 00:16:43,570 synchronization, but you also want the amount of work that's 350 00:16:43,570 --> 00:16:48,510 generated by each data chunk to generate about the same 351 00:16:48,510 --> 00:16:51,040 amount of work, so load balancing. 352 00:16:51,040 --> 00:16:55,360 And simplicity, just for same reason as task decomposition 353 00:16:55,360 --> 00:16:57,600 can get out of hand, data decomposition 354 00:16:57,600 --> 00:16:58,550 can get out of hand. 355 00:16:58,550 --> 00:17:01,480 You don't want data moving around throughout and then it 356 00:17:01,480 --> 00:17:06,120 becomes again, hard to debug or manage or make changes and 357 00:17:06,120 --> 00:17:09,110 track dependencies. 358 00:17:09,110 --> 00:17:11,550 Pipeline parallelism, this is actually classified somewhere 359 00:17:11,550 --> 00:17:12,500 else in the book. 360 00:17:12,500 --> 00:17:17,850 Actually, lifted it up and tried to make a case for it in 361 00:17:17,850 --> 00:17:20,480 that it's just good nature I think, to expose producer 362 00:17:20,480 --> 00:17:21,720 consumer relationships in your code. 363 00:17:21,720 --> 00:17:24,280 So if I have a function that's producing data that's going to 364 00:17:24,280 --> 00:17:29,780 be used by another function as with the spatial decomposition 365 00:17:29,780 --> 00:17:32,300 or in different stages of classic rate tracing 366 00:17:32,300 --> 00:17:35,320 algorithms you want to maintain that producer 367 00:17:35,320 --> 00:17:38,970 consumer relationship or that asembly line analogy. 368 00:17:38,970 --> 00:17:41,910 And what are some prime examples of pipelines in 369 00:17:41,910 --> 00:17:43,810 computer architecture? 370 00:17:43,810 --> 00:17:45,280 It's like the instruction pipeline 371 00:17:45,280 --> 00:17:47,170 and your super scalar. 372 00:17:47,170 --> 00:17:50,070 But there might be some other examples of pipelines, things 373 00:17:50,070 --> 00:17:53,590 that you might have used in say, UNIX shell. 374 00:17:53,590 --> 00:17:57,740 So cat processor, pipe it to another-- 375 00:17:57,740 --> 00:17:59,780 to a grep word and then word count that. 376 00:17:59,780 --> 00:18:02,860 So I think it's a natural concept. 377 00:18:02,860 --> 00:18:05,880 We use it in many different ways and it's good to sort of 378 00:18:05,880 --> 00:18:08,890 practice that at the software level as well. 379 00:18:08,890 --> 00:18:11,490 And there are some computations in specific 380 00:18:11,490 --> 00:18:13,990 domains, like in signal processing and graphics that 381 00:18:13,990 --> 00:18:17,140 have really sort of-- 382 00:18:17,140 --> 00:18:20,410 where the pipeline model is really important part of how 383 00:18:20,410 --> 00:18:21,960 computation gets carried out. 384 00:18:21,960 --> 00:18:25,040 You know, you have your graphics pipeline for example, 385 00:18:25,040 --> 00:18:27,070 in signal processing. 386 00:18:27,070 --> 00:18:29,418 How much time do I have? 387 00:18:29,418 --> 00:18:31,430 How am I doing on time? 388 00:18:31,430 --> 00:18:33,250 PROFESSOR: OK, should I stop here? 389 00:18:33,250 --> 00:18:34,660 AUDIENCE: About how much more? 390 00:18:34,660 --> 00:18:37,550 PROFESSOR: 10 slides. 391 00:18:37,550 --> 00:18:42,930 PROFESSOR: OK, so this is sort of a brief summary, which will 392 00:18:42,930 --> 00:18:46,270 lead into a much larger talk at the next lecture on how you 393 00:18:46,270 --> 00:18:48,200 actually go about re-engineering your code for 394 00:18:48,200 --> 00:18:49,090 parallelism. 395 00:18:49,090 --> 00:18:52,160 And this could come into play if you start with sequential 396 00:18:52,160 --> 00:18:53,610 code and you're parallelizing it. 397 00:18:53,610 --> 00:18:55,090 Some of you are doing that for your projects. 398 00:18:55,090 --> 00:18:58,140 Or if you're actually writing code from scratch and you want 399 00:18:58,140 --> 00:18:59,650 to engineer that for parallelism as well. 400 00:19:02,570 --> 00:19:06,280 So I think it's important to sort of understand the problem 401 00:19:06,280 --> 00:19:07,305 that you're working with. 402 00:19:07,305 --> 00:19:09,300 So you want to survey your landscape, understand what 403 00:19:09,300 --> 00:19:12,270 other people might have done, and look for well-known 404 00:19:12,270 --> 00:19:16,320 solutions and common pitfalls. 405 00:19:16,320 --> 00:19:18,570 And the patterns that I'm going to talk about in more 406 00:19:18,570 --> 00:19:21,480 detail really provide you with a list of questions to sort of 407 00:19:21,480 --> 00:19:26,620 help you assess the existing code that you're working with 408 00:19:26,620 --> 00:19:29,160 or the problem that you're trying to solve. 409 00:19:29,160 --> 00:19:31,250 There's something that you need to keep in mind that sort 410 00:19:31,250 --> 00:19:33,450 of effect your overall correctness. 411 00:19:33,450 --> 00:19:34,660 So for example, is your 412 00:19:34,660 --> 00:19:37,110 computation numerically stable? 413 00:19:37,110 --> 00:19:40,640 You might know if you have a floating point computation you 414 00:19:40,640 --> 00:19:43,540 might not be able to reorder all the operations because 415 00:19:43,540 --> 00:19:46,320 that might effect your actual precision. 416 00:19:46,320 --> 00:19:48,320 And so your overall output might be different and that 417 00:19:48,320 --> 00:19:50,730 may or may not be acceptable. 418 00:19:50,730 --> 00:19:53,930 So a lot of scientific codes for example, are things that 419 00:19:53,930 --> 00:19:56,530 have to deal with a lot of precision, might have to be 420 00:19:56,530 --> 00:19:58,720 cognizant of that fact. 421 00:19:58,720 --> 00:20:01,260 You want to also define the scope of, what are you trying 422 00:20:01,260 --> 00:20:04,880 to do and will it be good enough? 423 00:20:04,880 --> 00:20:07,160 You want to do back of the hand, back of the envelope 424 00:20:07,160 --> 00:20:10,550 calculations to make sure that things that you're suggesting 425 00:20:10,550 --> 00:20:13,095 of doing are actually feasible, that they're 426 00:20:13,095 --> 00:20:15,310 actually practical and that will give you the sort of 427 00:20:15,310 --> 00:20:19,460 performance expectations that you've set out. 428 00:20:19,460 --> 00:20:24,400 You also want to understand your input range. 429 00:20:24,400 --> 00:20:26,630 You might be able to specialize if there are some 430 00:20:26,630 --> 00:20:29,170 cases for example, that you're allowed to ignore. 431 00:20:29,170 --> 00:20:31,930 So these are good things to keep in mind. 432 00:20:31,930 --> 00:20:34,140 You also want to define a testing protocol. 433 00:20:34,140 --> 00:20:37,250 I think it's important to understand-- 434 00:20:37,250 --> 00:20:38,800 you started out with some piece of code, you're going to 435 00:20:38,800 --> 00:20:40,520 make some changes to it, how you going to go 436 00:20:40,520 --> 00:20:41,770 about testing it? 437 00:20:41,770 --> 00:20:44,650 How you might go about debugging it and that could be 438 00:20:44,650 --> 00:20:47,910 essentially where you spend a lot of your time. 439 00:20:47,910 --> 00:20:51,650 And then having these things in mind, I think, the parts 440 00:20:51,650 --> 00:20:55,400 that are worth looking at are the parts that 441 00:20:55,400 --> 00:20:56,470 make the most sense. 442 00:20:56,470 --> 00:20:58,140 Where is your computation spending most of its time? 443 00:20:58,140 --> 00:20:59,610 Is there hot spots in your code? 444 00:20:59,610 --> 00:21:02,380 And you can use profiling tools for that and in fact, 445 00:21:02,380 --> 00:21:04,150 you'll see some of that for cell in some of the 446 00:21:04,150 --> 00:21:08,660 recitations later in the course. 447 00:21:08,660 --> 00:21:13,750 So a simple example of molecular dynamics simulator. 448 00:21:13,750 --> 00:21:16,940 What you're trying to do is, you have some space of 449 00:21:16,940 --> 00:21:19,930 molecules, which I'm just going to represent in 2D. 450 00:21:19,930 --> 00:21:23,040 You know, look, they have water molecules and I have 451 00:21:23,040 --> 00:21:30,250 some protein that I'm trying to understand how the 452 00:21:30,250 --> 00:21:32,640 different atoms in that molecule are moving around so 453 00:21:32,640 --> 00:21:34,880 that I can determine the shape of the protein. 454 00:21:34,880 --> 00:21:37,010 So there are forces, there are bonded 455 00:21:37,010 --> 00:21:39,190 forces between the molecules. 456 00:21:39,190 --> 00:21:42,420 So I've just shown for example, bonded forces among 457 00:21:42,420 --> 00:21:45,120 my protein and then there are non-bonded forces. 458 00:21:45,120 --> 00:21:47,300 So how are different atoms sort of interacting with each 459 00:21:47,300 --> 00:21:49,670 other because of electrostatic forces, for example. 460 00:21:52,920 --> 00:21:55,435 So what you try to do is figure out, on each atom, what 461 00:21:55,435 --> 00:21:58,100 are all the forces that are affecting it and what is its 462 00:21:58,100 --> 00:22:01,110 current position and then you try to estimate where it's 463 00:22:01,110 --> 00:22:05,930 going to move based on Newtonian, in the simplest 464 00:22:05,930 --> 00:22:09,710 case, a Newtonian f equals m a type projection. 465 00:22:09,710 --> 00:22:12,950 So in a naive algorithm you have n squared interactions. 466 00:22:12,950 --> 00:22:16,150 You have to calculate all the forces on one molecule from 467 00:22:16,150 --> 00:22:19,210 all others. 468 00:22:19,210 --> 00:22:21,910 By understanding your problem you know that you can actually 469 00:22:21,910 --> 00:22:24,720 exploit the properties of forces that essentially 470 00:22:24,720 --> 00:22:27,490 decrease exponentially, so you can use a cutoff distance. 471 00:22:27,490 --> 00:22:30,400 So if a molecule is way too far away you can ignore this. 472 00:22:30,400 --> 00:22:35,220 And for people who do galaxy calculations, you know you can 473 00:22:35,220 --> 00:22:38,620 ignore geometric forces between constellations or 474 00:22:38,620 --> 00:22:41,160 clusters that are too far apart. 475 00:22:41,160 --> 00:22:45,370 So in the sequential code, some pseudo code for doing a 476 00:22:45,370 --> 00:22:48,100 molecular dynamic simulator, you have your atoms array, 477 00:22:48,100 --> 00:22:50,750 your force array, your set of neighbors in a two-dimensional 478 00:22:50,750 --> 00:22:54,470 space and you're going to go through and sort of simulate 479 00:22:54,470 --> 00:22:55,710 different time steps. 480 00:22:55,710 --> 00:22:58,660 And for each time step you're going to do-- for each atom-- 481 00:22:58,660 --> 00:23:02,250 compute the bonded forces, compute who are my neighbors 482 00:23:02,250 --> 00:23:05,250 for those neighbors, compute-- so these are things that 483 00:23:05,250 --> 00:23:07,100 essentially incapsulate distance. 484 00:23:07,100 --> 00:23:09,500 Compute the forces between them, update the 485 00:23:09,500 --> 00:23:11,360 position and end. 486 00:23:11,360 --> 00:23:16,180 So since this is a loop then that might suggest essentially 487 00:23:16,180 --> 00:23:18,810 where to start looking for concurrency. 488 00:23:18,810 --> 00:23:21,680 So you can start with the decomposition patterns and 489 00:23:21,680 --> 00:23:26,360 they'll be more in depth details about those next. 490 00:23:26,360 --> 00:23:28,705 I'm going to give you some intuition and then you would 491 00:23:28,705 --> 00:23:31,370 try to figure out whether your decomposition has to abide by 492 00:23:31,370 --> 00:23:34,120 certain dependencies and what are those dependencies? 493 00:23:34,120 --> 00:23:35,140 How do you expose them? 494 00:23:35,140 --> 00:23:37,620 And then, how can you design and evaluate? 495 00:23:37,620 --> 00:23:38,870 How can you evaluate your design? 496 00:23:42,050 --> 00:23:44,000 Screwed up again. 497 00:23:44,000 --> 00:23:44,920 I just fixed this. 498 00:23:44,920 --> 00:23:48,800 OK, so this is the pseudo code again from the previous slide. 499 00:23:48,800 --> 00:23:52,440 And since all you have is a simple loop that essentially 500 00:23:52,440 --> 00:23:54,600 says, this is where to look for the computation. 501 00:23:54,600 --> 00:23:56,980 And since you're essentially doing the same computation for 502 00:23:56,980 --> 00:24:02,210 each atom then that again, gives you the type of 503 00:24:02,210 --> 00:24:05,170 parallelism that we've talked about before. 504 00:24:05,170 --> 00:24:08,700 So you can look for splitting up each iteration and 505 00:24:08,700 --> 00:24:10,970 parallelizing those so that each processor for example, 506 00:24:10,970 --> 00:24:12,850 does one atom. 507 00:24:12,850 --> 00:24:15,870 Or each processor does a collection of atoms. But there 508 00:24:15,870 --> 00:24:16,920 are additional tasks. 509 00:24:16,920 --> 00:24:19,640 So data level parallelism versus sort of control 510 00:24:19,640 --> 00:24:20,670 parallelism. 511 00:24:20,670 --> 00:24:23,320 For each atom you also want to calculate the forces. 512 00:24:23,320 --> 00:24:25,410 You want to calculate long range interactions, find 513 00:24:25,410 --> 00:24:27,490 neighbors, update position and so on. 514 00:24:27,490 --> 00:24:30,340 And some of these have shared data, some of them do not. 515 00:24:30,340 --> 00:24:31,790 So you have to factor that in. 516 00:24:31,790 --> 00:24:34,660 So understanding there control dependencies essentially tells 517 00:24:34,660 --> 00:24:40,930 you how you need to it lay out your orchestration. 518 00:24:40,930 --> 00:24:42,750 So you have your bonded forces, you have your neighbor 519 00:24:42,750 --> 00:24:45,240 list, that effects your long range calculations. 520 00:24:45,240 --> 00:24:48,110 But to do this update position I need both of these tasks to 521 00:24:48,110 --> 00:24:49,870 have completed. 522 00:24:49,870 --> 00:24:53,270 And in each one of these tasks there's different data 523 00:24:53,270 --> 00:24:54,660 structures that they need. 524 00:24:54,660 --> 00:25:00,180 So everybody essentially reads the location of the items. So 525 00:25:00,180 --> 00:25:02,510 this is good because we want time step that essentially 526 00:25:02,510 --> 00:25:08,470 says, I can really distribute this really well, but then 527 00:25:08,470 --> 00:25:10,480 there's a right synchronization problem 528 00:25:10,480 --> 00:25:12,880 because eventually I have to update this array so I have to 529 00:25:12,880 --> 00:25:16,590 be careful about who goes first. There's an 530 00:25:16,590 --> 00:25:19,200 accumulation, which means I can potentially do a 531 00:25:19,200 --> 00:25:21,280 reduction on these. 532 00:25:21,280 --> 00:25:23,630 There's some write on the other end, but that seems to 533 00:25:23,630 --> 00:25:25,280 be a localized data structure. 534 00:25:25,280 --> 00:25:30,040 So for partitioning example, neighbors might have to be 535 00:25:30,040 --> 00:25:31,960 just locals at a different processor. 536 00:25:31,960 --> 00:25:34,890 So coming up with this structure and sort of block 537 00:25:34,890 --> 00:25:38,380 level diagram helps you essentially figure out where 538 00:25:38,380 --> 00:25:39,180 are your tasks? 539 00:25:39,180 --> 00:25:41,410 Helps you figure out what kind of synchronization mechanisms 540 00:25:41,410 --> 00:25:44,710 you need and it can also help you suggest the data 541 00:25:44,710 --> 00:25:48,380 distribution that you might need to reduce synchronization 542 00:25:48,380 --> 00:25:52,330 costs and problems. And lastly, you want to 543 00:25:52,330 --> 00:25:54,440 essentially evaluate your design. 544 00:25:54,440 --> 00:25:55,980 And you want to keep in mind, what is your target 545 00:25:55,980 --> 00:25:56,990 architecture. 546 00:25:56,990 --> 00:25:59,130 Are you trying to really run on shared memory and 547 00:25:59,130 --> 00:26:02,290 distributed memory and message passing or are you just doing 548 00:26:02,290 --> 00:26:03,160 this for one architecture? 549 00:26:03,160 --> 00:26:06,750 So for your project, you're doing this for self, so you 550 00:26:06,750 --> 00:26:08,830 can be very self specific. 551 00:26:08,830 --> 00:26:11,270 But if you're doing this in other contexts, the 552 00:26:11,270 --> 00:26:12,770 architecture actually might influence 553 00:26:12,770 --> 00:26:14,020 some of your decisions. 554 00:26:16,080 --> 00:26:17,760 Does data sharing have enough special 555 00:26:17,760 --> 00:26:19,980 properties like a read only? 556 00:26:19,980 --> 00:26:21,460 There are data structures that are read only. 557 00:26:21,460 --> 00:26:24,370 Are there enough accumulations that you can exploit by 558 00:26:24,370 --> 00:26:25,250 reductions? 559 00:26:25,250 --> 00:26:29,300 Are there temporal constraints on data sharing 560 00:26:29,300 --> 00:26:30,250 that you can exploit? 561 00:26:30,250 --> 00:26:31,630 And can you deal with those efficiently? 562 00:26:31,630 --> 00:26:34,170 If you can't then you have a problem. 563 00:26:34,170 --> 00:26:35,550 So you need to resolve that. 564 00:26:35,550 --> 00:26:37,180 If the designs OK then you move on to 565 00:26:37,180 --> 00:26:38,970 the next design space. 566 00:26:38,970 --> 00:26:41,680 So at the next lecture I'll go through these 567 00:26:41,680 --> 00:26:42,460 in a lot more detail. 568 00:26:42,460 --> 00:26:44,180 That's it.