1 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:03,910 Commons license. 3 00:00:03,910 --> 00:00:06,950 Your support help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,600 offer high quality educational resources for free. 5 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 6 00:00:13,500 --> 00:00:18,130 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:18,130 --> 00:00:19,380 ocw.mit.edu. 8 00:00:23,310 --> 00:00:26,840 PROFESSOR: So today, we're going to talk a bit more about 9 00:00:26,840 --> 00:00:32,360 parallelism and about how you get performance out of 10 00:00:32,360 --> 00:00:33,650 parallel codes. 11 00:00:33,650 --> 00:00:36,860 And also, we're going to take a little bit of a tour 12 00:00:36,860 --> 00:00:41,100 underneath the Cilk++ runtime system so you can get an idea 13 00:00:41,100 --> 00:00:44,830 of what's going on underneath and why it is that when you 14 00:00:44,830 --> 00:00:49,540 code stuff, how it is that it gets mapped, scheduled on the 15 00:00:49,540 --> 00:00:50,790 processors. 16 00:00:52,700 --> 00:00:57,230 So when people talk about parallelism, one of the first 17 00:00:57,230 --> 00:01:01,150 things that often comes up is what's called Amdahl's Law. 18 00:01:01,150 --> 00:01:09,780 Gene Amdahl was the architect of the IBM360 computers who 19 00:01:09,780 --> 00:01:15,400 then left IBM and formed his own company that made 20 00:01:15,400 --> 00:01:22,440 competing machines and he made the following observation 21 00:01:22,440 --> 00:01:25,230 about parallel computing, he said-- 22 00:01:25,230 --> 00:01:28,250 and I'm paraphrasing here-- 23 00:01:28,250 --> 00:01:32,720 half your application is parallel and half is serial. 24 00:01:32,720 --> 00:01:35,310 You can't get more than a factor of two speed up, no 25 00:01:35,310 --> 00:01:38,980 matter how many processors it runs on. 26 00:01:38,980 --> 00:01:42,050 So if you think about it, if it's half parallel and you 27 00:01:42,050 --> 00:01:48,390 managed to make that parallel part run in zero time, still 28 00:01:48,390 --> 00:01:52,430 the serial part will be half of the time and you only get a 29 00:01:52,430 --> 00:01:54,980 factor of two speedup. 30 00:01:54,980 --> 00:01:58,160 You can generalize that to say if some fraction alpha can be 31 00:01:58,160 --> 00:02:01,420 run in parallel and the rest must be run serially, the 32 00:02:01,420 --> 00:02:05,780 speedup is at most 1 over 1 minus alpha. 33 00:02:05,780 --> 00:02:15,170 OK, so this was used in the 1980s in particular to say why 34 00:02:15,170 --> 00:02:18,790 it was that parallel computing had no future, because you 35 00:02:18,790 --> 00:02:23,090 simply weren't going to be able to get very much speedups 36 00:02:23,090 --> 00:02:24,030 from parallel computing. 37 00:02:24,030 --> 00:02:28,890 You're going to spend extra hardware on the parallel parts 38 00:02:28,890 --> 00:02:34,580 of the system and yet you might be limited in terms of 39 00:02:34,580 --> 00:02:38,690 how much parallelism there is in a particular application 40 00:02:38,690 --> 00:02:40,060 and you wouldn't get very much speedup. 41 00:02:40,060 --> 00:02:42,750 You wouldn't get the bang for the buck, if you will. 42 00:02:42,750 --> 00:02:46,650 So things have changed today that make that not quite the 43 00:02:46,650 --> 00:02:47,150 same story. 44 00:02:47,150 --> 00:02:53,490 The first thing is that with multicore computers, it is 45 00:02:53,490 --> 00:03:00,600 pretty much just as inexpensive to produce a p 46 00:03:00,600 --> 00:03:03,970 processor right now, like six processor machine as it is a 47 00:03:03,970 --> 00:03:05,520 one processor machine. 48 00:03:05,520 --> 00:03:07,440 so it's not like you're actually paying for those 49 00:03:07,440 --> 00:03:09,300 extra processing cores. 50 00:03:09,300 --> 00:03:11,560 They come for free. 51 00:03:11,560 --> 00:03:16,330 Because what else are you're going to use that silicon for? 52 00:03:16,330 --> 00:03:19,580 And the other thing is that we've had a large growth of 53 00:03:19,580 --> 00:03:23,510 understanding of problems for which there's ample 54 00:03:23,510 --> 00:03:26,280 parallelism, where that amount of time is, 55 00:03:26,280 --> 00:03:30,830 in fact, quite small. 56 00:03:30,830 --> 00:03:33,810 And the main place these things come from, it turns 57 00:03:33,810 --> 00:03:38,170 out, this analysis is kind of a throughput kind of analysis. 58 00:03:38,170 --> 00:03:41,810 OK, it says, gee, I only get 50% speedup for that 59 00:03:41,810 --> 00:03:46,920 application, but what most people care about in most 60 00:03:46,920 --> 00:03:50,130 interactive applications, at least for a client side 61 00:03:50,130 --> 00:03:53,020 programming, is response time. 62 00:03:53,020 --> 00:03:58,120 And for any problem that you have that has a response time 63 00:03:58,120 --> 00:04:01,970 that's too long and its compute intensive, using 64 00:04:01,970 --> 00:04:07,590 parallelism to make it so that the response is much zippier 65 00:04:07,590 --> 00:04:10,320 is definitely worthwhile. 66 00:04:10,320 --> 00:04:14,520 And so this is true, even for things like game programs. 67 00:04:14,520 --> 00:04:16,649 So in game programs, they don't have quite a response 68 00:04:16,649 --> 00:04:20,180 time problem, they have what's called a time box problem, 69 00:04:20,180 --> 00:04:22,700 where you have a certain amount of time-- 70 00:04:22,700 --> 00:04:24,400 13 milliseconds typically-- 71 00:04:24,400 --> 00:04:29,730 because you need some slop to make sure that you can go from 72 00:04:29,730 --> 00:04:33,760 one frame to another, but about 13 milliseconds to do a 73 00:04:33,760 --> 00:04:37,540 rendering of whatever the frame is that the game player 74 00:04:37,540 --> 00:04:41,200 is going to see on his computer or her computer. 75 00:04:41,200 --> 00:04:47,560 And so in that time, you want to do as much as you possibly 76 00:04:47,560 --> 00:04:51,310 can, and so there's a big opportunity there to take 77 00:04:51,310 --> 00:04:55,530 advantage of parallelism in order to do more, have more 78 00:04:55,530 --> 00:05:00,120 quality graphics, have better AI, have better physics and 79 00:05:00,120 --> 00:05:02,245 all the other components that make up a game engine. 80 00:05:07,210 --> 00:05:10,760 But one of the issues with Amdahl's Law-- 81 00:05:10,760 --> 00:05:14,610 and this analysis is a cogent analysis that Amdahl made-- 82 00:05:14,610 --> 00:05:20,270 but one of the issues here is that it doesn't really say 83 00:05:20,270 --> 00:05:23,770 anything about how fast you can expect your 84 00:05:23,770 --> 00:05:25,750 application to run. 85 00:05:25,750 --> 00:05:30,320 In other words, this is a nice sort of thing, but who really 86 00:05:30,320 --> 00:05:32,910 can decompose their application into the serial 87 00:05:32,910 --> 00:05:37,420 part and the part that can be parallel? 88 00:05:37,420 --> 00:05:41,940 Well fortunately, there's been a lot of work in the theory of 89 00:05:41,940 --> 00:05:44,660 parallel systems to answer this question, and we're going 90 00:05:44,660 --> 00:05:50,800 to go over some of that really outstanding research that 91 00:05:50,800 --> 00:05:55,290 helps us understand what parallelism is. 92 00:05:55,290 --> 00:05:56,740 So we're going to talk a little bit about what 93 00:05:56,740 --> 00:06:02,800 parallelism is and come up with a very specific measure 94 00:06:02,800 --> 00:06:06,750 of parallelism, quantify parallelism, OK? 95 00:06:06,750 --> 00:06:08,720 We're also going to talk a little bit about scheduling 96 00:06:08,720 --> 00:06:12,060 theory and how the Cilk++ runtime system works. 97 00:06:12,060 --> 00:06:15,900 And then we're going to have a little chess lesson. 98 00:06:15,900 --> 00:06:18,870 So who here plays chess? 99 00:06:18,870 --> 00:06:19,930 Nobody plays chess anymore. 100 00:06:19,930 --> 00:06:22,410 Who plays Angry Birds? 101 00:06:22,410 --> 00:06:24,710 [LAUGHTER] 102 00:06:24,710 --> 00:06:25,960 OK. 103 00:06:29,940 --> 00:06:31,900 So you don't have to know anything about chess to learn 104 00:06:31,900 --> 00:06:35,700 this chess lesson, that's OK. 105 00:06:35,700 --> 00:06:37,830 So we'll start out with what is parallelism? 106 00:06:37,830 --> 00:06:40,980 So let's recall first the basics of Cilk++. 107 00:06:40,980 --> 00:06:44,820 So here's the example of the lousy Fibonacci that everybody 108 00:06:44,820 --> 00:06:48,110 parallelizes because it's good didactically. 109 00:06:48,110 --> 00:06:51,050 We have the Cilk spawn statement that says that the 110 00:06:51,050 --> 00:06:54,170 child can execute in parallel with the parent caller and the 111 00:06:54,170 --> 00:06:56,820 sync that says don't go past this point until all your 112 00:06:56,820 --> 00:06:58,740 spawn children have returned. 113 00:06:58,740 --> 00:07:00,240 And that's a local sync, that's just a 114 00:07:00,240 --> 00:07:02,300 sync for that function. 115 00:07:02,300 --> 00:07:04,080 It's not a sync across the whole machine. 116 00:07:04,080 --> 00:07:06,840 So some of you may have had experience with open MP 117 00:07:06,840 --> 00:07:09,220 barriers, for example, that's a sync 118 00:07:09,220 --> 00:07:10,110 across the whole machine. 119 00:07:10,110 --> 00:07:13,110 This is not, this is just a local sync for this function 120 00:07:13,110 --> 00:07:16,080 saying when I sync, make sure all my children have returned 121 00:07:16,080 --> 00:07:18,200 before going past this point. 122 00:07:18,200 --> 00:07:20,930 And just remember also that Cilk keywords grant permission 123 00:07:20,930 --> 00:07:22,480 for parallel execution. 124 00:07:22,480 --> 00:07:24,890 They don't command parallel execution. 125 00:07:24,890 --> 00:07:28,360 OK so we can always execute our code serially 126 00:07:28,360 --> 00:07:29,610 if we choose to. 127 00:07:31,765 --> 00:07:33,564 Yes? 128 00:07:33,564 --> 00:07:34,500 AUDIENCE: [UNINTELLIGIBLE] 129 00:07:34,500 --> 00:07:38,444 Can't this runtime figure that spawning an extra child would 130 00:07:38,444 --> 00:07:40,312 be more expensive? 131 00:07:40,312 --> 00:07:41,450 Can't it like look at this and be like-- 132 00:07:41,450 --> 00:07:43,450 PROFESSOR: We'll go into it. 133 00:07:43,450 --> 00:07:46,360 I'll show you how it works later in the lecture. 134 00:07:46,360 --> 00:07:50,170 I'll show you how it works and then we can talk about what 135 00:07:50,170 --> 00:07:53,970 knobs you have to tune, OK? 136 00:07:53,970 --> 00:08:00,350 So it's helpful to have an execution model for 137 00:08:00,350 --> 00:08:01,320 something like this. 138 00:08:01,320 --> 00:08:05,240 And so we're going to look at an abstract execution model, 139 00:08:05,240 --> 00:08:09,390 which is basically asking what does the instruction trace 140 00:08:09,390 --> 00:08:11,130 look like for this program? 141 00:08:11,130 --> 00:08:14,230 So normally when you execute a program, you can imagine one 142 00:08:14,230 --> 00:08:15,940 instruction executing after the other. 143 00:08:15,940 --> 00:08:19,250 And if it's a serial program, all those instructions 144 00:08:19,250 --> 00:08:22,470 essentially form a long chain. 145 00:08:22,470 --> 00:08:24,820 Well there's a similar thing for parallel computers, which 146 00:08:24,820 --> 00:08:30,620 is that instead of a chain as you'll see, it gets bushier 147 00:08:30,620 --> 00:08:32,580 and it's going to be a directed acyclic graph. 148 00:08:32,580 --> 00:08:34,760 So let's take a look at how we do this. 149 00:08:34,760 --> 00:08:36,535 So we'll the example of fib of four. 150 00:08:39,049 --> 00:08:44,920 So what we're going to do is start out here with a 151 00:08:44,920 --> 00:08:49,720 rectangle here that I want you think about as sort of a 152 00:08:49,720 --> 00:08:51,810 function call activation record. 153 00:08:51,810 --> 00:08:54,250 So it's a record on a stack. 154 00:08:54,250 --> 00:08:56,680 It's got variables associated with it. 155 00:08:56,680 --> 00:08:59,270 The only variable I'm going to keep track of is n, so that's 156 00:08:59,270 --> 00:09:00,840 what the four is there. 157 00:09:00,840 --> 00:09:03,150 OK, so we're going to do fib of four. 158 00:09:03,150 --> 00:09:05,390 So we've got in this activation frame, we have the 159 00:09:05,390 --> 00:09:09,040 variable four and now what I've done is I've color coded 160 00:09:09,040 --> 00:09:16,100 the fib function here and into the parts that are all serial. 161 00:09:16,100 --> 00:09:22,100 So there's a serial part up to where it spawns, then there's 162 00:09:22,100 --> 00:09:25,270 recursively calling the fib and then there's returning. 163 00:09:25,270 --> 00:09:27,830 So there's sort of three parts to this function, each of 164 00:09:27,830 --> 00:09:30,370 which is, in fact, a chain of serial instruction. 165 00:09:30,370 --> 00:09:34,570 I'm going to collapse those chains into a single circle 166 00:09:34,570 --> 00:09:37,190 here that I'm going to call a strand. 167 00:09:37,190 --> 00:09:40,410 OK, now what we do is we execute the strand, which 168 00:09:40,410 --> 00:09:43,480 corresponds to executing the instructions and advancing the 169 00:09:43,480 --> 00:09:46,900 program calendar up until the point we hit this 170 00:09:46,900 --> 00:09:50,350 fib of n minus 1. 171 00:09:50,350 --> 00:09:53,700 At that point, I basically call fib of n minus 1. 172 00:09:53,700 --> 00:09:56,160 So in this case, it's now going to be fib of 3. 173 00:09:56,160 --> 00:10:03,100 So that means I create a child and start executing in the 174 00:10:03,100 --> 00:10:07,770 child, this prefix part of the function. 175 00:10:07,770 --> 00:10:11,140 However, unlike I were doing an ordinary function call, I 176 00:10:11,140 --> 00:10:14,330 would make this call and then this guy would just sit here 177 00:10:14,330 --> 00:10:19,650 and wait until this frame was done. 178 00:10:19,650 --> 00:10:23,280 But since it's a spawn, what happens is I'm actually going 179 00:10:23,280 --> 00:10:28,070 to continue executing in the parent and execute, in fact, 180 00:10:28,070 --> 00:10:29,320 the green part. 181 00:10:31,390 --> 00:10:35,640 So in this case, evaluating the arguments, etc. 182 00:10:35,640 --> 00:10:39,420 Then it's going to spawn here, but this guy, in fact, is 183 00:10:39,420 --> 00:10:42,490 going to what it does when it gets here is it evaluates n 184 00:10:42,490 --> 00:10:48,520 minus 2, it does a call of fib of n minus 2. 185 00:10:48,520 --> 00:10:51,320 So I've indicated that this was a called frame by showing 186 00:10:51,320 --> 00:10:53,580 it in a light color. 187 00:10:53,580 --> 00:10:56,520 So these are spawn, spawn, call, meanwhile 188 00:10:56,520 --> 00:10:57,210 this thing is going. 189 00:10:57,210 --> 00:11:02,780 So at this point, we now have one, two, three things that 190 00:11:02,780 --> 00:11:07,440 are operating in parallel at the same time. 191 00:11:07,440 --> 00:11:09,420 We keep going on, OK? 192 00:11:09,420 --> 00:11:12,730 So this guy that does a spawn and has a continuation, this 193 00:11:12,730 --> 00:11:16,510 one does a call, but while he's doing a call, he's 194 00:11:16,510 --> 00:11:18,330 waiting for the return so he doesn't start 195 00:11:18,330 --> 00:11:20,050 executing the successor. 196 00:11:20,050 --> 00:11:22,920 He stalled at the Cilk sink here. 197 00:11:22,920 --> 00:11:28,550 And we keep executing and so as you can see, what's 198 00:11:28,550 --> 00:11:31,360 happening is we're actually creating a directed acyclic 199 00:11:31,360 --> 00:11:33,840 graph of these strands. 200 00:11:33,840 --> 00:11:37,190 So here basically, this guy was able to execute because 201 00:11:37,190 --> 00:11:40,875 both of the children, one that he had spawned and one that he 202 00:11:40,875 --> 00:11:43,120 had called, have returned. 203 00:11:43,120 --> 00:11:45,430 And so this fella, therefore, is able then 204 00:11:45,430 --> 00:11:47,880 to execute the return. 205 00:11:47,880 --> 00:11:51,710 OK, so the addition of x plus y in particular, and then the 206 00:11:51,710 --> 00:11:53,190 return to the parent. 207 00:11:53,190 --> 00:11:56,640 And so what we end up with is of all these serial chains of 208 00:11:56,640 --> 00:11:59,890 instructions that are represented by these strands, 209 00:11:59,890 --> 00:12:03,950 all these circles, they're embedded in the call tree like 210 00:12:03,950 --> 00:12:07,670 you would have in an ordinary serial execution. 211 00:12:07,670 --> 00:12:10,480 You have a call tree that you execute up and down, you walk 212 00:12:10,480 --> 00:12:14,370 it like a stack normally. 213 00:12:14,370 --> 00:12:17,550 Now, in fact, what we have is embedded in there is the 214 00:12:17,550 --> 00:12:26,120 parallel execution which form a DAG, directed acyclic graph. 215 00:12:29,020 --> 00:12:31,670 So when you start thinking in parallel, you have to start 216 00:12:31,670 --> 00:12:36,650 thinking about the DAG as your execution model, not a chain 217 00:12:36,650 --> 00:12:37,900 of instructions. 218 00:12:40,240 --> 00:12:42,820 And the nice thing about this particular execution model 219 00:12:42,820 --> 00:12:45,190 we're going to be looking at is nowhere did I say how many 220 00:12:45,190 --> 00:12:47,920 processors we were running on. 221 00:12:47,920 --> 00:12:50,100 This is a processor oblivious model. 222 00:12:50,100 --> 00:12:52,510 It doesn't know how many processors you're running on. 223 00:12:57,400 --> 00:13:02,110 We simply in the execution model, are thinking about 224 00:13:02,110 --> 00:13:07,420 abstractly what can run in parallel, not what actually 225 00:13:07,420 --> 00:13:10,970 does run in parallel in an execution. 226 00:13:10,970 --> 00:13:13,620 So any questions about this execution model? 227 00:13:19,260 --> 00:13:21,570 OK. 228 00:13:21,570 --> 00:13:30,620 So just so that we have some terminology, so the parallel 229 00:13:30,620 --> 00:13:33,675 instruction stream is a DAG with vertices and edges. 230 00:13:36,400 --> 00:13:39,320 Each vertex is a strand, OK? 231 00:13:39,320 --> 00:13:42,570 Which is a sequence of instructions not containing a 232 00:13:42,570 --> 00:13:47,200 call spawn sync, a return or thrown exception, if you're 233 00:13:47,200 --> 00:13:49,480 doing exceptions. 234 00:13:49,480 --> 00:13:53,100 We're not going to really talk about exceptions much. 235 00:13:53,100 --> 00:13:58,080 So they are supported in the software that we'll be using, 236 00:13:58,080 --> 00:14:00,890 but for most part, we're not going to have 237 00:14:00,890 --> 00:14:03,170 to worry about them. 238 00:14:03,170 --> 00:14:06,270 OK so there's an initial strand where you start, and a 239 00:14:06,270 --> 00:14:09,990 final strand where you end. 240 00:14:09,990 --> 00:14:16,950 Then each edge is a spawn or a call or return or what's 241 00:14:16,950 --> 00:14:21,570 called a continue edge or a continuation edge, which goes 242 00:14:21,570 --> 00:14:27,430 from the parent, when a parent spawns something to the next 243 00:14:27,430 --> 00:14:30,030 instruction after the spawn. 244 00:14:34,380 --> 00:14:38,600 So we can classify the edges in that fashion. 245 00:14:38,600 --> 00:14:43,640 And I've only explain this for spawm and sync, as you recall 246 00:14:43,640 --> 00:14:46,470 from last time, we also talked about Cilk four. 247 00:14:46,470 --> 00:14:49,950 It turns out Cilk four is converted to spawns and syncs 248 00:14:49,950 --> 00:14:52,520 using a recursive divide and conquer approach. 249 00:14:52,520 --> 00:14:56,515 We'll talk about that next time on Thursday. 250 00:14:56,515 --> 00:15:00,280 So we'll talk more about Cilk four and how it's implemented 251 00:15:00,280 --> 00:15:03,390 and the implications of how loop parallelism works. 252 00:15:03,390 --> 00:15:06,180 So at the fundamental level, the runtime system is only 253 00:15:06,180 --> 00:15:07,780 concerned about spawns and syncs. 254 00:15:12,190 --> 00:15:15,240 Now given that we have a DAG, so I've taken away the call 255 00:15:15,240 --> 00:15:18,480 tree and just left the strands of a computation. 256 00:15:18,480 --> 00:15:22,700 It's actually not the same as the computation we saw before. 257 00:15:22,700 --> 00:15:26,260 We would like to understand, is this a good parallel 258 00:15:26,260 --> 00:15:28,620 program or not? 259 00:15:28,620 --> 00:15:30,730 Based on if I understand the logical 260 00:15:30,730 --> 00:15:32,690 parallelism that I've exposed. 261 00:15:32,690 --> 00:15:34,660 So how much parallelism do you think is in here? 262 00:15:37,850 --> 00:15:39,060 Give me a number. 263 00:15:39,060 --> 00:15:43,470 How many processors does it make sense to run this on? 264 00:15:43,470 --> 00:15:44,720 Five? 265 00:15:47,260 --> 00:15:49,080 That's as parallel as it gets. 266 00:15:49,080 --> 00:15:50,350 Let's take a look. 267 00:15:50,350 --> 00:15:51,730 We're going to do an analysis. 268 00:15:51,730 --> 00:15:55,620 At the end of it, we'll know what the answer is. 269 00:15:55,620 --> 00:16:01,560 So for that, let tp be the execution time on p processors 270 00:16:01,560 --> 00:16:02,810 for this particular program. 271 00:16:06,600 --> 00:16:08,460 It turns but there are two measures 272 00:16:08,460 --> 00:16:09,410 that are really important. 273 00:16:09,410 --> 00:16:11,380 The first is called the work. 274 00:16:11,380 --> 00:16:14,900 OK, so of course, we know that real machines 275 00:16:14,900 --> 00:16:16,180 have caches, etc. 276 00:16:16,180 --> 00:16:17,000 Let's forget all of that. 277 00:16:17,000 --> 00:16:22,320 Just very simple algorithmic model where every strand, 278 00:16:22,320 --> 00:16:26,930 let's say, costs us unit time as opposed to in practice, 279 00:16:26,930 --> 00:16:28,880 they may be many instructions and so forth. 280 00:16:28,880 --> 00:16:30,370 We can take that into account. 281 00:16:30,370 --> 00:16:32,680 Let's take that into account separately. 282 00:16:32,680 --> 00:16:35,390 So T1 is the work. 283 00:16:35,390 --> 00:16:39,620 It's the time it if I had to execute it on one processor, 284 00:16:39,620 --> 00:16:43,450 I've got to do all the work that's in here. 285 00:16:43,450 --> 00:16:45,940 So what's the work of this particular computation? 286 00:16:51,850 --> 00:16:52,940 I think it's 18, right? 287 00:16:52,940 --> 00:16:55,230 Yeah, 18. 288 00:16:55,230 --> 00:16:56,600 So T1 is the work. 289 00:16:56,600 --> 00:16:59,280 So even though I'm executing a parallel, I could it execute 290 00:16:59,280 --> 00:17:06,440 it serially and then T1 is the amount of work it would take. 291 00:17:06,440 --> 00:17:11,040 The other measure is called the span, and sometimes called 292 00:17:11,040 --> 00:17:13,490 critical path length or computational depth. 293 00:17:13,490 --> 00:17:17,440 And it corresponds to the longest path of dependencies 294 00:17:17,440 --> 00:17:20,349 in the DAG. 295 00:17:20,349 --> 00:17:23,220 We call it T infinity because even if you had an infinite 296 00:17:23,220 --> 00:17:27,160 number of processors, you still can't do this one until 297 00:17:27,160 --> 00:17:28,240 you finish that one. 298 00:17:28,240 --> 00:17:31,150 You can't do this one until you finish that one, can't do 299 00:17:31,150 --> 00:17:33,800 this one till you've finished that one and so forth. 300 00:17:33,800 --> 00:17:36,390 So even with an infinite number of processors, I still 301 00:17:36,390 --> 00:17:38,670 wouldn't go faster than the span. 302 00:17:38,670 --> 00:17:40,345 So that's why we denote by T infinity. 303 00:17:43,130 --> 00:17:45,080 So these are the two important measures. 304 00:17:45,080 --> 00:17:46,250 Now what we're really interested in is 305 00:17:46,250 --> 00:17:48,530 Tp for a given p. 306 00:17:48,530 --> 00:17:54,080 As you'll see, we actually can get some bounds on the 307 00:17:54,080 --> 00:17:59,060 performance on p processors just by looking at the work, 308 00:17:59,060 --> 00:18:02,660 the span and the number of processors we're executing on. 309 00:18:02,660 --> 00:18:04,410 So the first bound is the following, it's 310 00:18:04,410 --> 00:18:06,700 called the Work Law. 311 00:18:06,700 --> 00:18:09,910 The Work Law says that the time on p processors is at 312 00:18:09,910 --> 00:18:14,170 least the time on one processor divided by p. 313 00:18:14,170 --> 00:18:17,190 So why does that Work Law make sense? 314 00:18:17,190 --> 00:18:18,440 What's that saying? 315 00:18:22,880 --> 00:18:24,028 Sorry? 316 00:18:24,028 --> 00:18:26,468 AUDIENCE: Like work is conserved sort of? 317 00:18:26,468 --> 00:18:28,910 I mean, you have to do the same amount of work. 318 00:18:28,910 --> 00:18:30,410 PROFESSOR: You have to do the same amount of work, so on 319 00:18:30,410 --> 00:18:35,350 every time step, you can get p pieces of work done. 320 00:18:35,350 --> 00:18:40,020 So if you're running for fewer than T1 over p steps, you've 321 00:18:40,020 --> 00:18:46,440 done less than T1 work over and time Tp. 322 00:18:46,440 --> 00:18:48,670 So you won't have done all the work if you run 323 00:18:48,670 --> 00:18:50,420 for less than this. 324 00:18:50,420 --> 00:18:55,250 So the time must be at least Tp, time Tp must be at 325 00:18:55,250 --> 00:18:56,790 least T1 over p. 326 00:18:56,790 --> 00:19:00,365 You only get to do p work on one step. 327 00:19:00,365 --> 00:19:02,750 Is that pretty clear? 328 00:19:02,750 --> 00:19:06,880 The second one should be even clearer, the Span Law. 329 00:19:06,880 --> 00:19:11,340 On p processors, you're not going to go faster than if you 330 00:19:11,340 --> 00:19:13,330 had an infinite number of processors because the 331 00:19:13,330 --> 00:19:16,170 infinite processor could always use fewer processors if 332 00:19:16,170 --> 00:19:17,370 it's scheduled. 333 00:19:17,370 --> 00:19:18,800 Once again, this is a very simple model. 334 00:19:18,800 --> 00:19:21,520 We're not taking into account scheduling, we're not taking 335 00:19:21,520 --> 00:19:24,190 into account overheads or whatever, just a simple 336 00:19:24,190 --> 00:19:28,930 conceptual model for understanding parallelism. 337 00:19:28,930 --> 00:19:30,535 So any questions about these two laws? 338 00:19:33,620 --> 00:19:36,470 There's going to be a couple of formulas in this lecture 339 00:19:36,470 --> 00:19:39,610 today that you should write down and play with. 340 00:19:39,610 --> 00:19:44,280 So these two, they may seem simple, but these are hugely 341 00:19:44,280 --> 00:19:46,790 important formulas. 342 00:19:46,790 --> 00:19:49,860 So you should know that Tp is at least T1 over p, that's the 343 00:19:49,860 --> 00:19:53,080 Work Law and that Tp is at least T infinity. 344 00:19:53,080 --> 00:19:55,480 Those are bounds on how fast you could execute. 345 00:19:55,480 --> 00:19:59,250 Do I have a question in that back there? 346 00:19:59,250 --> 00:20:07,330 OK so let's see what happens to work in span in terms of 347 00:20:07,330 --> 00:20:10,490 how we can understand our programs and decompose them. 348 00:20:10,490 --> 00:20:15,200 So suppose that I have a computation A followed by 349 00:20:15,200 --> 00:20:18,990 computation B and I connect them in series. 350 00:20:18,990 --> 00:20:21,580 What happens to the work? 351 00:20:21,580 --> 00:20:26,690 How does the work of all this whole thing correspond to the 352 00:20:26,690 --> 00:20:28,210 work of A and the work of B? 353 00:20:32,840 --> 00:20:33,740 What's that? 354 00:20:33,740 --> 00:20:34,280 AUDIENCE: [UNINTELLIGIBLE] 355 00:20:34,280 --> 00:20:37,210 PROFESSOR: Yeah, add them together. 356 00:20:37,210 --> 00:20:41,650 You get T1 of A plus T1 of B. Take the work of this and the 357 00:20:41,650 --> 00:20:43,110 work of this. 358 00:20:43,110 --> 00:20:44,460 OK, that's pretty easy. 359 00:20:44,460 --> 00:20:45,710 What about the span? 360 00:20:50,290 --> 00:20:52,775 So the span is the longest path of dependencies. 361 00:20:55,850 --> 00:20:57,370 What happens to the span when I connect 362 00:20:57,370 --> 00:20:58,620 two things in a series? 363 00:21:01,130 --> 00:21:05,430 Yeah, it just sums as well because I take whatever the 364 00:21:05,430 --> 00:21:08,050 longest path is from here to here and then the longest one 365 00:21:08,050 --> 00:21:11,470 from here to here, it just adds. 366 00:21:11,470 --> 00:21:15,370 But now let's look at parallel composition, So now suppose 367 00:21:15,370 --> 00:21:18,500 that I can execute these two things in parallel. 368 00:21:18,500 --> 00:21:19,750 What happens to the work? 369 00:21:25,160 --> 00:21:28,330 It just adds, just as before. 370 00:21:28,330 --> 00:21:29,460 The work always adds. 371 00:21:29,460 --> 00:21:32,910 The work is easy because it's additive. 372 00:21:32,910 --> 00:21:34,160 What happens to the span? 373 00:21:40,560 --> 00:21:41,744 What's that? 374 00:21:41,744 --> 00:21:42,520 AUDIENCE: [UNINTELLIGIBLE] 375 00:21:42,520 --> 00:21:45,030 PROFESSOR: It's the max of the spans. 376 00:21:45,030 --> 00:21:49,860 Right, so whatever is the longest, whichever one of 377 00:21:49,860 --> 00:21:52,440 these ones has a longer span, that's going to be 378 00:21:52,440 --> 00:21:55,670 the span of the total. 379 00:21:55,670 --> 00:21:58,780 Does that give you some Intuition So we're going to 380 00:21:58,780 --> 00:22:04,390 see when we analyze the spans of things that in fact, we're 381 00:22:04,390 --> 00:22:06,310 going to see maxes occurring all over the place. 382 00:22:09,570 --> 00:22:17,810 So speedup is defined to be T1 over Tp. 383 00:22:17,810 --> 00:22:22,110 So speedup is how much faster am I on p processors than I am 384 00:22:22,110 --> 00:22:23,360 on one processor? 385 00:22:25,450 --> 00:22:26,300 Pretty easy. 386 00:22:26,300 --> 00:22:31,050 So if T1 over Tp is equal to p, we say we have perfect 387 00:22:31,050 --> 00:22:33,170 linear speedup, or linear speedup. 388 00:22:36,840 --> 00:22:38,460 That's good, right? 389 00:22:38,460 --> 00:22:43,970 Because if I put on use p processors, I'd like to have 390 00:22:43,970 --> 00:22:45,970 things go p times faster. 391 00:22:45,970 --> 00:22:49,530 OK, that would be the ideal world. 392 00:22:49,530 --> 00:22:55,720 If T1 over Tp, which is the speedup, is greater than p, 393 00:22:55,720 --> 00:22:58,550 that says we have super linear speedup. 394 00:22:58,550 --> 00:23:01,590 And in our model, we don't get that because of the work law. 395 00:23:01,590 --> 00:23:04,580 Because the work law says Tp is greater than or equal to T1 396 00:23:04,580 --> 00:23:09,580 over p and just do a little algebra here, you get T1 over 397 00:23:09,580 --> 00:23:13,930 Tp must be less than or equal to p. 398 00:23:13,930 --> 00:23:15,500 So you can't get super linear speedup. 399 00:23:15,500 --> 00:23:18,600 In practice, there are situations where you can get 400 00:23:18,600 --> 00:23:21,420 super linear speedup due to caching effects and 401 00:23:21,420 --> 00:23:21,970 a variety of things. 402 00:23:21,970 --> 00:23:23,590 We'll talk about some of those things. 403 00:23:23,590 --> 00:23:27,470 But in this simple model, we don't get 404 00:23:27,470 --> 00:23:31,130 that kind of behavior. 405 00:23:31,130 --> 00:23:34,390 And of course, the case I left out is the common case, which 406 00:23:34,390 --> 00:23:38,590 is the T1 over Tp is less than p, and that's very common 407 00:23:38,590 --> 00:23:40,370 people write code which doesn't 408 00:23:40,370 --> 00:23:44,090 give them linear speedup. 409 00:23:44,090 --> 00:23:47,060 We're mostly interested in getting linear speedup here. 410 00:23:47,060 --> 00:23:48,790 That's our goal. 411 00:23:48,790 --> 00:23:52,050 So that we're getting the most bang for the buck out of the 412 00:23:52,050 --> 00:23:55,110 processors we're using. 413 00:23:55,110 --> 00:23:56,330 OK, parallelism. 414 00:23:56,330 --> 00:23:58,500 So we're finally to the point where I can talk about 415 00:23:58,500 --> 00:24:04,050 parallelism and give a quantitative definition of 416 00:24:04,050 --> 00:24:06,120 parallelism. 417 00:24:06,120 --> 00:24:11,950 So the Span Law says that Tp is at least T infinity, right? 418 00:24:11,950 --> 00:24:14,580 The time on p processors is at least the time on an infinite 419 00:24:14,580 --> 00:24:16,060 number of processors. 420 00:24:16,060 --> 00:24:21,390 So the maximum possible speedup, that's T1 over Tp, 421 00:24:21,390 --> 00:24:26,390 given T1 and T infinity is T1 over T infinity. 422 00:24:29,020 --> 00:24:30,270 And we call that the parallelism. 423 00:24:32,900 --> 00:24:37,050 It's the maximum amount of speedup we 424 00:24:37,050 --> 00:24:38,320 could possibly attain. 425 00:24:41,080 --> 00:24:45,220 So we have the speedup and the speedup by the Span Law that 426 00:24:45,220 --> 00:24:48,440 says this is the maximum amount we can get, we could 427 00:24:48,440 --> 00:24:53,660 also view it as if I look along the critical path of the 428 00:24:53,660 --> 00:24:55,900 computation. 429 00:24:55,900 --> 00:24:57,900 It's sort of what's the average amount of work at 430 00:24:57,900 --> 00:24:59,302 every level. 431 00:24:59,302 --> 00:25:02,830 The work, the total amount of stuff here divided by that 432 00:25:02,830 --> 00:25:05,500 length there that sort of tells us the width, what's the 433 00:25:05,500 --> 00:25:10,180 average amount of stuff that's going on in every step. 434 00:25:10,180 --> 00:25:12,700 So for this example, what is the-- 435 00:25:12,700 --> 00:25:16,790 I forgot to put this on my slide-- 436 00:25:16,790 --> 00:25:21,770 what is the parallelism of this particular DAG here? 437 00:25:26,830 --> 00:25:28,730 Two, right? 438 00:25:28,730 --> 00:25:32,460 So the span has length nine-- this is assuming everything 439 00:25:32,460 --> 00:25:33,440 was unit time-- 440 00:25:33,440 --> 00:25:37,780 obviously in reality, when you have more instructions, you in 441 00:25:37,780 --> 00:25:42,920 fact would make it be whatever the length of this was in 442 00:25:42,920 --> 00:25:46,390 terms of number of instructions or what have you, 443 00:25:46,390 --> 00:25:48,160 of execution time of all these things. 444 00:25:48,160 --> 00:25:51,750 So this is length 9, there's 18 things here, 445 00:25:51,750 --> 00:25:54,660 parallelism is 2. 446 00:25:54,660 --> 00:25:57,410 So we can quantify parallelism precisely. 447 00:25:57,410 --> 00:25:59,650 We'll see why it's important to quantify it. 448 00:25:59,650 --> 00:26:02,210 So that the maximum speedup we're going to get when we run 449 00:26:02,210 --> 00:26:05,180 this application. 450 00:26:05,180 --> 00:26:07,110 Here's another example we did before. 451 00:26:07,110 --> 00:26:09,930 Fib of four. 452 00:26:09,930 --> 00:26:13,700 So let's assume again that each strand takes 453 00:26:13,700 --> 00:26:16,320 unit time to execute. 454 00:26:16,320 --> 00:26:22,225 So what is the work in this particular computation? 455 00:26:31,680 --> 00:26:34,050 Assume every strand takes unit time to execute, which of 456 00:26:34,050 --> 00:26:35,782 course it doesn't, but-- 457 00:26:48,190 --> 00:26:49,440 anybody care to hazard a guess? 458 00:26:52,340 --> 00:26:59,320 17, yeah, because there's four nodes here that have 3 plus 5. 459 00:26:59,320 --> 00:27:03,830 So 3 times 4 plus 5 is 17. 460 00:27:03,830 --> 00:27:06,670 So the work is 17. 461 00:27:06,670 --> 00:27:07,920 OK, what's the span? 462 00:27:12,590 --> 00:27:13,840 This one's tricky. 463 00:27:21,370 --> 00:27:22,690 Too bad it's not a little bit more focused. 464 00:27:27,270 --> 00:27:28,075 What the span? 465 00:27:28,075 --> 00:27:30,146 AUDIENCE: 8. 466 00:27:30,146 --> 00:27:32,610 PROFESSOR: 8, that's correct. 467 00:27:32,610 --> 00:27:35,320 Who got 7? 468 00:27:35,320 --> 00:27:39,450 Yeah, so I got 7 when I did this and then I looked harder 469 00:27:39,450 --> 00:27:40,330 and it was 8. 470 00:27:40,330 --> 00:27:44,000 It's 8, so here it is. 471 00:27:44,000 --> 00:27:46,410 Here's the span. 472 00:27:46,410 --> 00:27:48,360 There is goes. 473 00:27:48,360 --> 00:27:51,667 Ooh that little sidestep there, that's what makes it 8. 474 00:27:54,590 --> 00:27:59,160 OK so basically, it comes down here and I had gone down like 475 00:27:59,160 --> 00:28:01,940 that when I did it, but in fact, you've got to go over 476 00:28:01,940 --> 00:28:03,770 and back up. 477 00:28:03,770 --> 00:28:06,450 So it's actually 8. 478 00:28:06,450 --> 00:28:12,620 So that says that the parallelism is a little bit 479 00:28:12,620 --> 00:28:16,520 more than 2, 2 and 1/8. 480 00:28:16,520 --> 00:28:19,370 What that says is that if I use many more than two 481 00:28:19,370 --> 00:28:24,970 processors, I can't get linear speedup anymore. 482 00:28:24,970 --> 00:28:28,890 I'm only going to get marginal performance gains. 483 00:28:28,890 --> 00:28:31,530 If I use more than 2, because the maximum speedup I can get 484 00:28:31,530 --> 00:28:35,590 is like 2.125 if I had an infinite number of processors. 485 00:28:39,120 --> 00:28:40,190 So any questions about this? 486 00:28:40,190 --> 00:28:46,960 So this by the way deceptively simple and yet, if you don't 487 00:28:46,960 --> 00:28:49,830 play around with it a little bit, you can get 488 00:28:49,830 --> 00:28:53,350 confused very easily. 489 00:28:53,350 --> 00:28:56,880 Deceptively simple, very powerful to 490 00:28:56,880 --> 00:28:58,130 be able to do this. 491 00:29:01,740 --> 00:29:06,110 So here we have for the analysis of parallelism, one 492 00:29:06,110 --> 00:29:09,170 of the things that we have going for us in using the Cilk 493 00:29:09,170 --> 00:29:13,350 tool suite is a program called Cilkview, which has a 494 00:29:13,350 --> 00:29:17,110 scalability analyzer. 495 00:29:17,110 --> 00:29:20,290 And it is like the race detector that I talked to you 496 00:29:20,290 --> 00:29:23,780 about last time in that it uses dynamic instrumentation. 497 00:29:23,780 --> 00:29:27,820 So you run it under Cilkview, it's like running it under 498 00:29:27,820 --> 00:29:28,400 [? Valgrhen ?] 499 00:29:28,400 --> 00:29:30,960 for example, or what have you. 500 00:29:30,960 --> 00:29:33,430 So basically you run your program under it, and it 501 00:29:33,430 --> 00:29:37,040 analyzes your program for scalability. 502 00:29:37,040 --> 00:29:41,820 It computes the work and span of your program to derive some 503 00:29:41,820 --> 00:29:44,630 upper bounds on parallel performance and it also 504 00:29:44,630 --> 00:29:47,090 estimates a scheduling overhead to compute what's 505 00:29:47,090 --> 00:29:49,250 called a burden span for lower bounds. 506 00:29:52,710 --> 00:29:55,230 So let's take a look. 507 00:29:55,230 --> 00:29:58,690 So here's, for example, here's a quick sort program. 508 00:29:58,690 --> 00:30:03,110 So let's just see this is a c++ program. 509 00:30:03,110 --> 00:30:06,260 So here we're using a template so that the type of items that 510 00:30:06,260 --> 00:30:09,400 I'm sorting I can make be a variable. 511 00:30:09,400 --> 00:30:13,000 So tightening-- can we shut the back door there? 512 00:30:13,000 --> 00:30:13,620 One of the TAs? 513 00:30:13,620 --> 00:30:14,870 Somebody run up to-- 514 00:30:17,210 --> 00:30:18,460 thank you. 515 00:30:20,610 --> 00:30:25,380 So we have the variable T And we're going to quick sort from 516 00:30:25,380 --> 00:30:29,080 the beginning to the end of the array. 517 00:30:29,080 --> 00:30:31,340 And what we do is, just as you're familiar with quick 518 00:30:31,340 --> 00:30:34,970 sort, if there's actually something to be sorted, more 519 00:30:34,970 --> 00:30:39,590 than one thing, then we find the middle by partitioning the 520 00:30:39,590 --> 00:30:42,580 thing and this is a bit of a c++ magic to 521 00:30:42,580 --> 00:30:45,200 find the middle element. 522 00:30:45,200 --> 00:30:47,320 And then the important part from our point of view is 523 00:30:47,320 --> 00:30:50,540 after we've done this partition, we quick sort the 524 00:30:50,540 --> 00:30:53,250 first part of the array, from beginning to middle and then 525 00:30:53,250 --> 00:31:00,170 from the beginning plus 1 or the middle, whichever is 526 00:31:00,170 --> 00:31:02,200 greater to the end. 527 00:31:02,200 --> 00:31:04,260 And then we sync. 528 00:31:04,260 --> 00:31:07,350 So what we're doing is quick sort where we're spawning off 529 00:31:07,350 --> 00:31:11,230 the two sub problems to be solved in parallel 530 00:31:11,230 --> 00:31:11,940 recursively. 531 00:31:11,940 --> 00:31:14,430 So they're going to execute in parallel and they're going to 532 00:31:14,430 --> 00:31:17,380 execute in parallel and so forth. 533 00:31:17,380 --> 00:31:20,480 So a fairly natural thing to divide, to do divide and 534 00:31:20,480 --> 00:31:23,810 conquer on quick sort because the two some problems can be 535 00:31:23,810 --> 00:31:25,580 operated on independently. 536 00:31:25,580 --> 00:31:27,830 We just sort them recursively. 537 00:31:27,830 --> 00:31:30,330 But we can sort them in parallel. 538 00:31:30,330 --> 00:31:34,010 OK, so suppose that we are sorting 100,000 numbers. 539 00:31:34,010 --> 00:31:36,130 How much parallelism do you think is in this code? 540 00:31:46,770 --> 00:31:50,850 So remember that we're getting this recursive stuff done. 541 00:31:50,850 --> 00:31:53,854 How many people think-- 542 00:31:53,854 --> 00:31:55,670 well, it's not going to be more than 543 00:31:55,670 --> 00:31:57,090 100,000, I promise you. 544 00:31:57,090 --> 00:32:00,500 So how many people think more than a million parallels? 545 00:32:00,500 --> 00:32:02,570 Raise your hand, more than a million? 546 00:32:02,570 --> 00:32:09,370 And how many people think more than 100,000? 547 00:32:09,370 --> 00:32:13,470 And how many people think more than 10,000? 548 00:32:13,470 --> 00:32:14,720 OK, between the two. 549 00:32:17,380 --> 00:32:21,050 More than 1,000? 550 00:32:21,050 --> 00:32:22,990 OK, how about more than 100? 551 00:32:22,990 --> 00:32:25,520 100 to 1,000? 552 00:32:25,520 --> 00:32:26,790 How about 10 to 100? 553 00:32:29,440 --> 00:32:30,800 How about between 1 and 10? 554 00:32:34,380 --> 00:32:36,290 So a lot of people think between 1 and 10. 555 00:32:36,290 --> 00:32:39,000 Why do you think that there's so little parallels in this? 556 00:32:42,700 --> 00:32:46,140 You don't have to justify yourself, OK. 557 00:32:46,140 --> 00:32:49,760 Well let's see how much there is according to Cilkview. 558 00:32:49,760 --> 00:32:51,630 So here's the type of output that you'll get. 559 00:32:51,630 --> 00:32:52,820 You'll get a graphical curve. 560 00:32:52,820 --> 00:32:55,430 You'll also get a textual output. 561 00:32:55,430 --> 00:32:57,320 But this is sort of the graphical output. 562 00:32:57,320 --> 00:33:00,910 And this is basically showing what the running time here is. 563 00:33:00,910 --> 00:33:03,550 So the first thing it shows is it will actually run your 564 00:33:03,550 --> 00:33:06,950 program, benchmark your program, on in this case, up 565 00:33:06,950 --> 00:33:08,800 to 8 course. 566 00:33:08,800 --> 00:33:11,330 We ran it. 567 00:33:11,330 --> 00:33:15,260 So we ran up to 8 course and give you what your measured 568 00:33:15,260 --> 00:33:17,980 speedup is. 569 00:33:17,980 --> 00:33:20,140 So the second thing is it tells you the parallels. 570 00:33:20,140 --> 00:33:24,760 If you can't read that it's, 11.21. 571 00:33:24,760 --> 00:33:28,590 So we get about 11. 572 00:33:28,590 --> 00:33:30,140 Why do you think it's not higher? 573 00:33:35,880 --> 00:33:36,740 What's that? 574 00:33:36,740 --> 00:33:37,620 AUDIENCE: It's the log. 575 00:33:37,620 --> 00:33:39,890 PROFESSOR: What's the log? 576 00:33:39,890 --> 00:33:41,140 AUDIENCE: [UNINTELLIGIBLE] 577 00:33:46,000 --> 00:33:47,570 PROFESSOR: Yeah, but you're doing the two things in 578 00:33:47,570 --> 00:33:48,670 parallel, right? 579 00:33:48,670 --> 00:33:50,060 We'll actually analyze this. 580 00:33:50,060 --> 00:33:53,500 So it has to do with the fact that the partition routine is 581 00:33:53,500 --> 00:33:56,140 a serial piece of code and it's big. 582 00:33:56,140 --> 00:34:00,236 So the initial partitioning takes you 100,000-- 583 00:34:00,236 --> 00:34:04,100 sorry, 100 million steps of doing a partition-- 584 00:34:04,100 --> 00:34:06,790 before you get to do any parallelism at all. 585 00:34:06,790 --> 00:34:08,620 And we'll see that in just a minute. 586 00:34:08,620 --> 00:34:10,699 So it gives you the parallelism. 587 00:34:10,699 --> 00:34:12,260 It also plots this. 588 00:34:12,260 --> 00:34:14,130 So this is the parallelism. 589 00:34:14,130 --> 00:34:17,170 Notice that's the same number, 11.21 is 590 00:34:17,170 --> 00:34:20,260 plotted as this bound. 591 00:34:20,260 --> 00:34:24,800 So it tells you the span law and it tells you the work law. 592 00:34:24,800 --> 00:34:25,980 This is the linear speedup. 593 00:34:25,980 --> 00:34:28,040 If you were having linear speedup, this is what your 594 00:34:28,040 --> 00:34:29,960 program would give you. 595 00:34:29,960 --> 00:34:33,250 So it gives you these two bounds, the work law and span 596 00:34:33,250 --> 00:34:35,659 law on your speedup. 597 00:34:35,659 --> 00:34:39,460 And then it also computes what's called a burden 598 00:34:39,460 --> 00:34:41,920 parallelism, estimating scheduling overheads to sort 599 00:34:41,920 --> 00:34:44,699 of give you a lower bound. 600 00:34:44,699 --> 00:34:46,790 Now that's not to say that your numbers can't fall 601 00:34:46,790 --> 00:34:48,400 outside this range. 602 00:34:48,400 --> 00:34:51,989 But when they do, it will tell you essentially what the 603 00:34:51,989 --> 00:34:54,639 issues are with your program. 604 00:34:54,639 --> 00:34:57,850 And we'll discuss how you diagnose some of those issues. 605 00:34:57,850 --> 00:35:05,400 Actually that's in one of the handouts that we've provided. 606 00:35:05,400 --> 00:35:07,320 I think that's in one of the handouts. 607 00:35:07,320 --> 00:35:10,390 If not, we'll make sure it's among the handouts. 608 00:35:10,390 --> 00:35:13,610 So basically, this gives you a range for what you can expect. 609 00:35:13,610 --> 00:35:16,340 So the important thing here is to notice here for example, 610 00:35:16,340 --> 00:35:20,200 that we're losing performance, but it's not due to the 611 00:35:20,200 --> 00:35:24,450 parallelism, to the work law. 612 00:35:24,450 --> 00:35:28,230 Basically, in some sense, what's happening is we are 613 00:35:28,230 --> 00:35:30,380 losing it because the Span Law because we're starting to 614 00:35:30,380 --> 00:35:35,530 approach the point where the span is going to be the issue. 615 00:35:35,530 --> 00:35:36,740 So we'll talk more about this. 616 00:35:36,740 --> 00:35:39,660 So the main thing is you have a tool that can tell you the 617 00:35:39,660 --> 00:35:43,550 work and span and so that you can analyze your own programs 618 00:35:43,550 --> 00:35:47,340 to understand are you bounded by parallelism, for example, 619 00:35:47,340 --> 00:35:53,040 in particular, in the code that you've written. 620 00:35:53,040 --> 00:35:56,680 OK let's do a theoretical analysis of this to understand 621 00:35:56,680 --> 00:35:59,590 why that number is small. 622 00:35:59,590 --> 00:36:02,170 So the main thing here is that the expected work, as you 623 00:36:02,170 --> 00:36:05,900 recall, of quick sort is order n log n. 624 00:36:05,900 --> 00:36:09,700 You tend to do order n log n work, you partition and then 625 00:36:09,700 --> 00:36:11,470 you're solving two problems of the same size. 626 00:36:11,470 --> 00:36:14,060 If you actually draw out the recursion tree, it's log 627 00:36:14,060 --> 00:36:17,130 height with linear amount of work on every level for n log 628 00:36:17,130 --> 00:36:20,050 end total work. 629 00:36:20,050 --> 00:36:24,610 The expected span, however, is order n because the partition 630 00:36:24,610 --> 00:36:29,240 routine is a serial program that partitions up the thing 631 00:36:29,240 --> 00:36:32,090 of size n in order n time. 632 00:36:32,090 --> 00:36:34,840 So when you compute the parallelism, you get 633 00:36:34,840 --> 00:36:38,830 parallelism of order log n and log n is kind of puny 634 00:36:38,830 --> 00:36:42,380 parallelism, and that's our technical word for it. 635 00:36:42,380 --> 00:36:44,770 So puny parallelism is what we get out of quick sort. 636 00:36:48,360 --> 00:36:50,560 So it turns out there are lots of things 637 00:36:50,560 --> 00:36:51,630 that you can analyze. 638 00:36:51,630 --> 00:36:54,910 Here's just a selection of some of the interesting 639 00:36:54,910 --> 00:36:58,200 practical algorithms and the kinds of analyses that you can 640 00:36:58,200 --> 00:37:01,150 do showing that, for example, with merge sort you can do it 641 00:37:01,150 --> 00:37:03,380 with work n log n. 642 00:37:03,380 --> 00:37:09,010 You can get a span of log qn and so then the parallelism is 643 00:37:09,010 --> 00:37:10,260 the ratio of the two. 644 00:37:12,750 --> 00:37:15,400 In fact, you can actually theoretically get log squared 645 00:37:15,400 --> 00:37:19,250 n span, but that's not as practical an algorithm as the 646 00:37:19,250 --> 00:37:20,930 one that gives you log cubed n. 647 00:37:20,930 --> 00:37:23,620 And you can go through and there are a whole bunch of 648 00:37:23,620 --> 00:37:29,080 algorithms for which you can get very good parallelism. 649 00:37:29,080 --> 00:37:31,330 So all of these, if you look at the ratio of these, the 650 00:37:31,330 --> 00:37:32,600 parallelism is quite high. 651 00:37:35,510 --> 00:37:37,900 So let's talk a little bit about what's going on 652 00:37:37,900 --> 00:37:41,580 underneath and why parallelism is important. 653 00:37:41,580 --> 00:37:48,020 So when you describe your program in Cilk, you express 654 00:37:48,020 --> 00:37:53,810 the potential parallelism of your application. 655 00:37:53,810 --> 00:37:56,360 You don't say exactly how it's going to be scheduled, that's 656 00:37:56,360 --> 00:38:00,980 done by the Cilk++ scheduler, which maps the strands 657 00:38:00,980 --> 00:38:05,450 dynamically onto the processors at run time. 658 00:38:05,450 --> 00:38:08,040 So it's going to do the load balancing and everything 659 00:38:08,040 --> 00:38:11,490 necessary to balance your computation off the number of 660 00:38:11,490 --> 00:38:12,710 processors. 661 00:38:12,710 --> 00:38:15,890 We want to understand how that process works, because that's 662 00:38:15,890 --> 00:38:18,920 going to help us to understand how it is that we can build 663 00:38:18,920 --> 00:38:23,070 codes that will map very effectively on to the number 664 00:38:23,070 --> 00:38:25,000 of processors. 665 00:38:25,000 --> 00:38:27,760 Now it turns out that the theory of the distributed 666 00:38:27,760 --> 00:38:33,150 schedulers such as is in Cilk++ is complicated. 667 00:38:33,150 --> 00:38:36,550 I'll wave my hands about it towards the end, but the 668 00:38:36,550 --> 00:38:40,280 analysis of it is advanced. 669 00:38:40,280 --> 00:38:44,030 You have to take a graduate course to get that stuff. 670 00:38:44,030 --> 00:38:46,560 So instead, we're going to explore the ideas with a 671 00:38:46,560 --> 00:38:52,600 centralized, much simpler, scheduler which serves as a 672 00:38:52,600 --> 00:38:54,870 surrogate for understanding what's going on. 673 00:38:58,010 --> 00:39:03,380 So the basic idea of almost all scheduling theory in this 674 00:39:03,380 --> 00:39:07,220 domain is greedy scheduling. 675 00:39:07,220 --> 00:39:09,360 And so this is-- by the way, we're coming to the second 676 00:39:09,360 --> 00:39:11,930 thing you have to understand really well in order to be 677 00:39:11,930 --> 00:39:14,150 able to generate good code, the second sort 678 00:39:14,150 --> 00:39:15,370 of theoretical thing-- 679 00:39:15,370 --> 00:39:19,020 so the idea of a greedy scheduler is you want to do as 680 00:39:19,020 --> 00:39:24,700 much work as possible on each step. 681 00:39:24,700 --> 00:39:31,490 So the idea here is let's take a look, for example, suppose 682 00:39:31,490 --> 00:39:36,200 that we've executed this part of the DAG already. 683 00:39:36,200 --> 00:39:38,830 Then there are certain number of strands that are ready to 684 00:39:38,830 --> 00:39:42,910 execute, meaning all their predecessors have exited. 685 00:39:42,910 --> 00:39:46,680 How many strands are ready to execute on this DAG? 686 00:39:46,680 --> 00:39:48,460 Five, right? 687 00:39:48,460 --> 00:39:51,620 These guys. 688 00:39:51,620 --> 00:39:54,560 So those five strands are ready to execute. 689 00:39:54,560 --> 00:39:59,810 So the idea is-- and let me illustrate for p equals 3-- 690 00:39:59,810 --> 00:40:04,480 the idea is to understand the execution in terms of two 691 00:40:04,480 --> 00:40:06,120 types of steps. 692 00:40:06,120 --> 00:40:08,800 So in a greed schedule, you always do as much as possible. 693 00:40:08,800 --> 00:40:12,120 So is what would be called a complete step because I can 694 00:40:12,120 --> 00:40:16,110 schedule all three processors to have some work 695 00:40:16,110 --> 00:40:18,600 to do on that step. 696 00:40:18,600 --> 00:40:23,970 So which are the best three guys to be able to execute? 697 00:40:27,270 --> 00:40:30,100 Yes, so I'm not sure what the best three are, but for sure, 698 00:40:30,100 --> 00:40:32,560 you want to get this guy and this guy, right? 699 00:40:32,560 --> 00:40:34,810 Maybe that guy's not, but this guy, you 700 00:40:34,810 --> 00:40:38,050 definitely want to execute. 701 00:40:38,050 --> 00:40:40,950 And these guys, I guess, OK. 702 00:40:40,950 --> 00:40:43,200 So in a greedy schedule, no, you're not allowed to look to 703 00:40:43,200 --> 00:40:45,240 see which ones are the best execute. 704 00:40:45,240 --> 00:40:47,430 You don't know what the future is, the scheduler isn't going 705 00:40:47,430 --> 00:40:54,310 to know what the future is so it just executes any p course. 706 00:41:02,630 --> 00:41:04,300 You just execute any p course. 707 00:41:04,300 --> 00:41:07,830 In this case, I executed the p strand. 708 00:41:07,830 --> 00:41:11,610 In this case, I executed these three guys even though they 709 00:41:11,610 --> 00:41:12,910 weren't necessarily the best. 710 00:41:12,910 --> 00:41:16,580 And in a greedy scheduler, it doesn't look to see what's the 711 00:41:16,580 --> 00:41:20,500 best one to execute, it just executes as many 712 00:41:20,500 --> 00:41:21,300 as it can this case. 713 00:41:21,300 --> 00:41:22,340 In this case, it's p. 714 00:41:22,340 --> 00:41:24,570 Now we have what's called an incomplete step. 715 00:41:24,570 --> 00:41:25,960 Notice nothing got enabled. 716 00:41:25,960 --> 00:41:28,300 That was sort of too bad. 717 00:41:28,300 --> 00:41:30,150 So there's only two guys that are ready to go. 718 00:41:30,150 --> 00:41:34,630 What do you think happens if I have an incomplete step, 719 00:41:34,630 --> 00:41:36,810 namely p strands are ready, fewer than 720 00:41:36,810 --> 00:41:39,140 p strands are ready? 721 00:41:39,140 --> 00:41:42,390 I just to execute all of them, as many as I can. 722 00:41:42,390 --> 00:41:43,820 Run all of them. 723 00:41:43,820 --> 00:41:45,500 So that's what a greedy scheduler does. 724 00:41:45,500 --> 00:41:50,090 Just at every step, it executes as many as it can and 725 00:41:50,090 --> 00:41:54,350 we can classify the steps as ones which are complete, 726 00:41:54,350 --> 00:41:57,930 meaning we used all our processors versus incomplete, 727 00:41:57,930 --> 00:42:00,680 meaning we only used a subset of our processors in 728 00:42:00,680 --> 00:42:03,540 scheduling it. 729 00:42:03,540 --> 00:42:05,060 So that's what a greedy scheduler does. 730 00:42:05,060 --> 00:42:09,440 Now the important thing, which is the 731 00:42:09,440 --> 00:42:11,430 analysis of this program. 732 00:42:11,430 --> 00:42:13,970 And this is, by the way, the single most important thing in 733 00:42:13,970 --> 00:42:18,600 scheduling theory but you're going to ever learn is this 734 00:42:18,600 --> 00:42:19,520 particular theory. 735 00:42:19,520 --> 00:42:25,290 It goes all the way back to 1968 and what it basically 736 00:42:25,290 --> 00:42:32,850 says it is any greedy scheduler achieves a bound of 737 00:42:32,850 --> 00:42:36,160 T1 over p plus T infinity. 738 00:42:36,160 --> 00:42:40,590 So why is that an interesting upper bound? 739 00:42:40,590 --> 00:42:42,034 Yeah? 740 00:42:42,034 --> 00:42:45,761 AUDIENCE: That says that it's got the refinement of what you 741 00:42:45,761 --> 00:42:50,000 said before, even if you add as many processors as you can, 742 00:42:50,000 --> 00:42:51,560 basically you're bounded by T infinity. 743 00:42:51,560 --> 00:42:52,997 PROFESSOR: Yeah. 744 00:42:52,997 --> 00:42:55,392 AUDIENCE: It's compulsory. 745 00:42:58,266 --> 00:43:00,540 PROFESSOR: So basically, each of these, this term here is 746 00:43:00,540 --> 00:43:02,410 the term in the Work Law. 747 00:43:02,410 --> 00:43:05,640 This is the term in the Span Law, and we're saying you can 748 00:43:05,640 --> 00:43:09,110 always achieve the sum of those two lower bounds as an 749 00:43:09,110 --> 00:43:12,030 upper bound. 750 00:43:12,030 --> 00:43:14,460 So let's see how we do this and then we'll look at some of 751 00:43:14,460 --> 00:43:15,620 the implications. 752 00:43:15,620 --> 00:43:17,534 Question, do you have a question? 753 00:43:17,534 --> 00:43:19,230 No? 754 00:43:19,230 --> 00:43:22,780 So here's the proof that you meet this. 755 00:43:22,780 --> 00:43:25,890 So that the proof says-- and I'll illustrate 756 00:43:25,890 --> 00:43:27,700 for P equals 3-- 757 00:43:27,700 --> 00:43:31,570 how many complete steps could we have? 758 00:43:31,570 --> 00:43:33,960 So I'll argue that the number of complete steps is at 759 00:43:33,960 --> 00:43:37,140 most T1 over p. 760 00:43:37,140 --> 00:43:38,610 Why is that? 761 00:43:38,610 --> 00:43:43,240 Every complete step performs p work. 762 00:43:43,240 --> 00:43:47,970 So if I had more complete steps than T1 over p, I'd be 763 00:43:47,970 --> 00:43:51,280 doing more than T1 work. 764 00:43:51,280 --> 00:43:55,120 But I only have T1 work to do. 765 00:43:55,120 --> 00:43:58,220 OK, so the maximum number of complete steps I could have is 766 00:43:58,220 --> 00:43:59,470 at most T1 over p. 767 00:43:59,470 --> 00:44:03,200 Do people follow that? 768 00:44:03,200 --> 00:44:05,960 So the trickier part of the proof, which is not all that 769 00:44:05,960 --> 00:44:08,630 tricky but it's a little bit trickier, is the other side. 770 00:44:08,630 --> 00:44:12,120 How many incomplete steps could I have? 771 00:44:12,120 --> 00:44:14,420 So we execute those. 772 00:44:14,420 --> 00:44:19,000 So I claim that the number of incomplete steps is bounded by 773 00:44:19,000 --> 00:44:22,610 the critical path length, by the span. 774 00:44:22,610 --> 00:44:24,440 Why is that? 775 00:44:24,440 --> 00:44:26,860 Well let's take a look at the part of DAG that 776 00:44:26,860 --> 00:44:29,290 has yet to be executed. 777 00:44:29,290 --> 00:44:31,230 So that this gray part here. 778 00:44:31,230 --> 00:44:33,270 There's some span associated with that. 779 00:44:33,270 --> 00:44:37,440 In this case, it's this longest path. 780 00:44:37,440 --> 00:44:46,460 When I execute all of the ready threads that are ready 781 00:44:46,460 --> 00:44:52,530 to go, I guarantee to reduce the span of that unexecuted 782 00:44:52,530 --> 00:44:54,820 DAG by at least one. 783 00:44:58,300 --> 00:45:02,780 So as I do here, so I reduce it by one when I execute. 784 00:45:02,780 --> 00:45:07,022 So if I have a complete step, I don't guaranteed to reduce 785 00:45:07,022 --> 00:45:13,200 the span of the unexecuted DAG, because I may execute 786 00:45:13,200 --> 00:45:15,950 things as I showed you in this example, you don't actually 787 00:45:15,950 --> 00:45:17,240 advance anything. 788 00:45:17,240 --> 00:45:23,770 But I execute all the ready threads on an incomplete step, 789 00:45:23,770 --> 00:45:25,490 and that's going to reduce it by one. 790 00:45:25,490 --> 00:45:28,410 So the number of incomplete steps is at most infinity. 791 00:45:28,410 --> 00:45:32,650 So the total number of steps is at most the sum. 792 00:45:32,650 --> 00:45:35,710 So as I say, this proof you should understand in your 793 00:45:35,710 --> 00:45:39,380 sleep because it's the most important scheduling theory 794 00:45:39,380 --> 00:45:43,250 proof that you're going to probably see in your lifetime. 795 00:45:43,250 --> 00:45:48,180 It's very old, and really, very, very simple and yet, 796 00:45:48,180 --> 00:45:50,840 there's a huge amount of scheduling theory if you have 797 00:45:50,840 --> 00:45:54,560 a look at scheduling theory, that comes out of this just 798 00:45:54,560 --> 00:45:58,160 making this same problem more complicated and more real and 799 00:45:58,160 --> 00:46:00,340 more interesting and so forth. 800 00:46:00,340 --> 00:46:03,590 But this is really the crux of what's going on. 801 00:46:03,590 --> 00:46:07,510 Any questions about this proof? 802 00:46:07,510 --> 00:46:13,370 So one corollary of the greedy scheduling algorithm is that 803 00:46:13,370 --> 00:46:16,650 any greedy scheduler achieves within a factor of two of 804 00:46:16,650 --> 00:46:17,900 optimal scheduling. 805 00:46:20,280 --> 00:46:21,400 So let's see why that is. 806 00:46:21,400 --> 00:46:24,070 So it's guaranteed as an upper bound to get within a factor 807 00:46:24,070 --> 00:46:26,220 of two of optimal. 808 00:46:26,220 --> 00:46:27,650 So here's the proof. 809 00:46:27,650 --> 00:46:31,700 So let's Tp star be the execution time produced by the 810 00:46:31,700 --> 00:46:32,425 optimal scheduler. 811 00:46:32,425 --> 00:46:35,630 This is the schedule that knows the whole DAG in advance 812 00:46:35,630 --> 00:46:38,000 and can schedule things exactly where they need to be 813 00:46:38,000 --> 00:46:40,790 scheduled to minimize the total amount of time. 814 00:46:40,790 --> 00:46:44,550 Now even though the optimal scheduler can schedule very 815 00:46:44,550 --> 00:46:47,760 officially, it's still bound by the Work Law 816 00:46:47,760 --> 00:46:50,170 and the Span Law. 817 00:46:50,170 --> 00:46:53,260 So therefore, Tp star has still got to be greater than 818 00:46:53,260 --> 00:46:56,730 T1 over p and greater than T infinity by the 819 00:46:56,730 --> 00:46:58,360 Work and Span Laws. 820 00:46:58,360 --> 00:47:01,850 Even though it's optimal, every scheduler must obey the 821 00:47:01,850 --> 00:47:05,190 Work Laws and Spam Law. 822 00:47:05,190 --> 00:47:08,680 So then we have, by the greedy scheduling theorem, Tp is at 823 00:47:08,680 --> 00:47:11,770 most T1 over p plus T infinity. 824 00:47:11,770 --> 00:47:15,660 Well that's at most twice the maximum of these two values, 825 00:47:15,660 --> 00:47:17,180 whichever is larger. 826 00:47:17,180 --> 00:47:20,880 I've just plugged in to get the maximum of those two and 827 00:47:20,880 --> 00:47:23,590 that's at most, by this equation, 828 00:47:23,590 --> 00:47:25,670 twice the optimal time. 829 00:47:29,060 --> 00:47:33,642 So this is a very simple corollary says oh, greedy 830 00:47:33,642 --> 00:47:35,110 scheduling is actually pretty good. 831 00:47:35,110 --> 00:47:37,400 It's not optimal, in fact, optimal 832 00:47:37,400 --> 00:47:39,200 scheduling is mP complete. 833 00:47:39,200 --> 00:47:41,010 Very hard problem to solve. 834 00:47:41,010 --> 00:47:43,630 But getting within a factor of two, you just do greedy 835 00:47:43,630 --> 00:47:44,880 scheduling, it works just fine. 836 00:47:47,460 --> 00:47:52,770 More importantly is the next corollary, which has to do is 837 00:47:52,770 --> 00:47:54,630 when do you get linear speedup? 838 00:47:54,630 --> 00:47:56,660 And this is, I think, the most important thing 839 00:47:56,660 --> 00:47:57,770 to get out of this. 840 00:47:57,770 --> 00:48:01,820 So any greedy scheduler achieves near perfect linear 841 00:48:01,820 --> 00:48:04,590 speedup whenever-- 842 00:48:04,590 --> 00:48:05,970 what's this thing on the left-hand side? 843 00:48:05,970 --> 00:48:08,940 What's the name we call that?-- 844 00:48:08,940 --> 00:48:10,550 the parallelism, right? 845 00:48:10,550 --> 00:48:13,900 That's the parallelism, is much bigger than the number of 846 00:48:13,900 --> 00:48:16,300 processors you're running on. 847 00:48:16,300 --> 00:48:19,440 So if the number of processors are running on is smaller than 848 00:48:19,440 --> 00:48:23,400 the parallelism of your code says you can expect near 849 00:48:23,400 --> 00:48:26,510 perfect linear speedup. 850 00:48:26,510 --> 00:48:29,140 OK, so what does that say you want to do in your program? 851 00:48:29,140 --> 00:48:33,690 You want to make sure you have ample parallelism and then the 852 00:48:33,690 --> 00:48:37,210 scheduler will be able to schedule it so that you get 853 00:48:37,210 --> 00:48:39,170 near perfect linear speedup. 854 00:48:39,170 --> 00:48:42,210 Let's see why that's true. 855 00:48:42,210 --> 00:48:46,470 So T1 over T infinity is much bigger than p is equivalent to 856 00:48:46,470 --> 00:48:50,860 saying that T infinity is much less than T1 over p. 857 00:48:50,860 --> 00:48:53,960 That's just algebra. 858 00:48:53,960 --> 00:48:55,060 Well what does that mean? 859 00:48:55,060 --> 00:48:58,420 The greedy scheduling theorem says Tp is at most T1 over p 860 00:48:58,420 --> 00:48:59,700 plus T infinity. 861 00:48:59,700 --> 00:49:02,780 We just said that if we have this condition, then T 862 00:49:02,780 --> 00:49:08,020 infinity is very small compared to T1 over p. 863 00:49:08,020 --> 00:49:11,830 So if this is negligible, then the whole thing is 864 00:49:11,830 --> 00:49:13,195 about T1 over p. 865 00:49:15,850 --> 00:49:19,617 Well that just says that the speedup is about p. 866 00:49:23,320 --> 00:49:27,920 So the name of the game is to make sure that your span is 867 00:49:27,920 --> 00:49:31,950 relatively short compared to the amount of work per 868 00:49:31,950 --> 00:49:34,082 processor that you're doing. 869 00:49:34,082 --> 00:49:37,510 And in that case, you'll get linear speedup. 870 00:49:37,510 --> 00:49:40,050 And that happens when you've got enough parallelism 871 00:49:40,050 --> 00:49:43,150 compared to the number processors you're running on. 872 00:49:43,150 --> 00:49:44,460 Any questions about this? 873 00:49:44,460 --> 00:49:50,000 This is like the most important thing you're going 874 00:49:50,000 --> 00:49:51,395 to learn about parallel computing. 875 00:49:57,410 --> 00:49:59,230 Everything else we're going to do is going to be derivatives 876 00:49:59,230 --> 00:50:02,430 of this, so if you don't understand this, you have a 877 00:50:02,430 --> 00:50:05,670 hard time with the other stuff. 878 00:50:05,670 --> 00:50:08,360 So in some sense, it's deceptively simple, right? 879 00:50:08,360 --> 00:50:13,730 We just have a few variables, T1, Tp, T infinity, p, there's 880 00:50:13,730 --> 00:50:14,890 not much else going on. 881 00:50:14,890 --> 00:50:19,590 But there are these bounds and these elegant theorems that 882 00:50:19,590 --> 00:50:25,430 tell us something about how no matter what the shape of the 883 00:50:25,430 --> 00:50:29,200 DAG is or whatever, these two values, the work and the span, 884 00:50:29,200 --> 00:50:33,890 really characterize very closely where it is that you 885 00:50:33,890 --> 00:50:37,440 can expect to get linear speedup. 886 00:50:37,440 --> 00:50:39,660 Any questions? 887 00:50:39,660 --> 00:50:43,630 OK, good. 888 00:50:46,500 --> 00:50:50,220 So the quantity T1 over PT infinity, so what is that? 889 00:50:50,220 --> 00:50:56,310 That's just the parallelism divided by p. 890 00:50:56,310 --> 00:50:59,410 That's called the parallel slackness. 891 00:50:59,410 --> 00:51:05,200 So this parallel slackness is 10, means you have 10 times 892 00:51:05,200 --> 00:51:08,120 more parallelism than processors. 893 00:51:08,120 --> 00:51:10,330 So if you have high slackness, you can expect 894 00:51:10,330 --> 00:51:12,340 to get linear speedup. 895 00:51:12,340 --> 00:51:14,070 If you have low slackness, don't expect 896 00:51:14,070 --> 00:51:15,320 to get linear speedup. 897 00:51:17,660 --> 00:51:18,540 OK. 898 00:51:18,540 --> 00:51:26,920 Now the scheduler we're using is not a greedy scheduler. 899 00:51:26,920 --> 00:51:33,530 It's better in many ways, because it's a distributed, 900 00:51:33,530 --> 00:51:35,580 what's called work stealing scheduler and I'll show you 901 00:51:35,580 --> 00:51:38,450 how it works in a little bit. 902 00:51:38,450 --> 00:51:41,070 But it's based on the same theory. 903 00:51:41,070 --> 00:51:46,340 Even though it's a more complicated scheduler from an 904 00:51:46,340 --> 00:51:48,900 analytical point of view, it's really based on the same 905 00:51:48,900 --> 00:51:51,080 theory as greedy scheduling. 906 00:51:51,080 --> 00:51:57,110 It guarantees that the time on p processors is at most T1 907 00:51:57,110 --> 00:51:59,300 over p plus order T infinity. 908 00:51:59,300 --> 00:52:02,310 So there's a constant here. 909 00:52:02,310 --> 00:52:05,660 And it's a randomized scheduler, so it actually only 910 00:52:05,660 --> 00:52:08,120 guarantees this in expectation. 911 00:52:08,120 --> 00:52:11,590 It actually guarantees very close to this with high 912 00:52:11,590 --> 00:52:13,060 probability. 913 00:52:13,060 --> 00:52:19,190 OK so the difference is the big O, but if you look at any 914 00:52:19,190 --> 00:52:21,500 of the formulas that we did with the greedy scheduler, the 915 00:52:21,500 --> 00:52:24,480 fact that there's a constant there doesn't really matter. 916 00:52:24,480 --> 00:52:27,700 You get the same effect, it just means that the slackness 917 00:52:27,700 --> 00:52:30,580 that you need to get linear speedup has to not only 918 00:52:30,580 --> 00:52:33,010 overcome the T infinity, it's also got to overcome the 919 00:52:33,010 --> 00:52:36,200 constant there. 920 00:52:36,200 --> 00:52:40,440 And empirically, it actually turns out this is not bad as 921 00:52:40,440 --> 00:52:44,040 an estimate using the greedy bound. 922 00:52:44,040 --> 00:52:46,690 Not bad as an estimate, so this is sort of a model that 923 00:52:46,690 --> 00:52:49,450 we'll take as if we're doing things 924 00:52:49,450 --> 00:52:51,130 with a greedy scheduler. 925 00:52:51,130 --> 00:52:53,540 And that will be very close for what we're actually going 926 00:52:53,540 --> 00:52:58,790 to see in practice with the Cilk++ scheduler. 927 00:52:58,790 --> 00:53:01,620 So once again, it means near perfect linear speedup as long 928 00:53:01,620 --> 00:53:06,330 as p is much less than T1 over T infinity generally. 929 00:53:06,330 --> 00:53:10,820 And so Cilkview allows us to measure T1 and T infinity. 930 00:53:10,820 --> 00:53:13,320 So that's going to be good, because then we can figure out 931 00:53:13,320 --> 00:53:16,480 what our parallelism is and look to see how we're running 932 00:53:16,480 --> 00:53:21,910 on typically 12 cores, how much parallels do we have? 933 00:53:21,910 --> 00:53:25,500 If our parallelism is 12, we don't have a lot of slackness. 934 00:53:25,500 --> 00:53:27,440 We won't get very good speedup. 935 00:53:27,440 --> 00:53:30,550 But if we have a parallelism of say, 10 times more, say 936 00:53:30,550 --> 00:53:36,200 120, we should get very, very good parallelism, very, very 937 00:53:36,200 --> 00:53:38,370 good speedup on 12 cores. 938 00:53:38,370 --> 00:53:40,930 We should get close to perfect speedup. 939 00:53:45,100 --> 00:53:47,490 So let's talk about the runtime system and how this 940 00:53:47,490 --> 00:53:50,740 work stealing scheduler works, because it different 941 00:53:50,740 --> 00:53:51,890 from the other one. 942 00:53:51,890 --> 00:53:56,160 And this will be helpful also for understanding when you 943 00:53:56,160 --> 00:53:59,530 program these things what you can expect. 944 00:53:59,530 --> 00:54:07,730 So the basic idea of the schedule is there's two 945 00:54:07,730 --> 00:54:11,810 strategies the people have explored for doing scheduling. 946 00:54:11,810 --> 00:54:16,110 One is called work sharing, which is not what Cilk++ does. 947 00:54:16,110 --> 00:54:19,390 But let me explain what work sharing is because it's 948 00:54:19,390 --> 00:54:22,200 helpful to contrast it with work stealing. 949 00:54:22,200 --> 00:54:25,990 So in works sharing, what you do is when you spawn off some 950 00:54:25,990 --> 00:54:32,280 work, you say let me go find some low utilized processor 951 00:54:32,280 --> 00:54:37,450 and put that worked there for it to operate on. 952 00:54:37,450 --> 00:54:41,470 The problem with work sharing is that you have to do some 953 00:54:41,470 --> 00:54:45,600 communication and synchronization every time you 954 00:54:45,600 --> 00:54:47,960 do a spawn. 955 00:54:47,960 --> 00:54:49,830 Every time you do a spawn, you're going to go out. 956 00:54:49,830 --> 00:54:52,290 This is kind of what Pthreads does, when 957 00:54:52,290 --> 00:54:53,580 you do Pthread create. 958 00:54:53,580 --> 00:54:58,120 It goes out and says OK, let me create all of the things it 959 00:54:58,120 --> 00:55:03,070 needs to do and get it schedule then on a processor. 960 00:55:03,070 --> 00:55:06,410 Work stealing, on the other hand, takes 961 00:55:06,410 --> 00:55:08,310 the opposite approach. 962 00:55:08,310 --> 00:55:11,780 Whenever it spawns work, it's just going to keep that work 963 00:55:11,780 --> 00:55:16,230 local to it, but make it available for stealing. 964 00:55:16,230 --> 00:55:21,220 A processor that runs out of work is going to go looking 965 00:55:21,220 --> 00:55:23,720 for work to steal, to bring back. 966 00:55:23,720 --> 00:55:31,540 The advantage of work stealing is that the processor doesn't 967 00:55:31,540 --> 00:55:33,650 do any synchronization except when it's 968 00:55:33,650 --> 00:55:36,210 actually load balancing. 969 00:55:36,210 --> 00:55:42,850 So if all of the processors have ample work to do, then 970 00:55:42,850 --> 00:55:47,570 what happens is there's no overhead for scheduling 971 00:55:47,570 --> 00:55:48,380 whatsoever. 972 00:55:48,380 --> 00:55:51,600 They all just crank away. 973 00:55:51,600 --> 00:55:56,120 And so you get very, very low overheads when there's ample 974 00:55:56,120 --> 00:55:58,320 work to do on each processor. 975 00:55:58,320 --> 00:56:00,980 So let's see how this works. 976 00:56:00,980 --> 00:56:04,120 So the particular way that it maintains it is that 977 00:56:04,120 --> 00:56:08,180 basically, each processor maintains a work deck. 978 00:56:08,180 --> 00:56:13,750 So a deck is a double-ended queue of the ready strands. 979 00:56:13,750 --> 00:56:17,500 It manipulates the bottom of the deck like a stack. 980 00:56:17,500 --> 00:56:21,020 So what that says is, for example, here, we had a spawn 981 00:56:21,020 --> 00:56:24,310 followed by two calls. 982 00:56:24,310 --> 00:56:26,810 And basically, it's operating just as it would have to 983 00:56:26,810 --> 00:56:36,210 operate in an ordinary stack, an ordinary call stack. 984 00:56:36,210 --> 00:56:40,000 So, for example, this guy says call, well it pushes a frame 985 00:56:40,000 --> 00:56:44,460 on the bottom of the call stack just like normal. 986 00:56:44,460 --> 00:56:47,950 It says spawn, it pushes a spawn frame on the 987 00:56:47,950 --> 00:56:49,200 bottom of the deck. 988 00:56:52,910 --> 00:56:55,450 In fact, of course, it's running in parallel, so you 989 00:56:55,450 --> 00:56:58,420 can have a bunch of guys that are both calling and spawning 990 00:56:58,420 --> 00:57:01,270 and they all push whatever their frames are. 991 00:57:01,270 --> 00:57:05,380 When somebody says return, you just pop it off. 992 00:57:05,380 --> 00:57:10,420 So in the common case, each of these guys is just executing 993 00:57:10,420 --> 00:57:13,420 the code serially the way that it would normally 994 00:57:13,420 --> 00:57:15,295 executing in C or C++. 995 00:57:18,120 --> 00:57:25,370 However, if somebody runs out of work, then it becomes a 996 00:57:25,370 --> 00:57:33,310 thief and it looks for a victim and the strategy that's 997 00:57:33,310 --> 00:57:36,150 used by Cilk++ is to look at random. 998 00:57:36,150 --> 00:57:43,120 It says let me just go to any other processor 999 00:57:43,120 --> 00:57:45,050 or any other workers-- 1000 00:57:45,050 --> 00:57:46,300 I call these workers-- 1001 00:57:48,880 --> 00:57:52,730 and grab away some of their work. 1002 00:57:52,730 --> 00:57:56,170 But when it grabs it away, what it does is it steals it 1003 00:57:56,170 --> 00:58:02,990 from the opposite end of the deck from where this 1004 00:58:02,990 --> 00:58:06,050 particular victim is actually doing its work. 1005 00:58:06,050 --> 00:58:09,650 So it steals the oldest stuff first. 1006 00:58:09,650 --> 00:58:12,970 So it moves that over and now here what it's doing is it's 1007 00:58:12,970 --> 00:58:14,710 stealing up to the point that it spawns. 1008 00:58:14,710 --> 00:58:17,740 So it steals from the top of the deck down to where there's 1009 00:58:17,740 --> 00:58:18,640 a spawn on top. 1010 00:58:18,640 --> 00:58:19,521 Yes? 1011 00:58:19,521 --> 00:58:21,970 AUDIENCE: Is there always a spawn on the 1012 00:58:21,970 --> 00:58:23,220 top of every deck? 1013 00:58:25,540 --> 00:58:28,140 PROFESSOR: Close, almost always. 1014 00:58:28,140 --> 00:58:31,150 Yes, so I think that you could say that there are. 1015 00:58:31,150 --> 00:58:34,290 So the initial deck does not have a spawn on top of it, but 1016 00:58:34,290 --> 00:58:37,250 you could imagine that it did. 1017 00:58:37,250 --> 00:58:39,240 And then when you steal, you're always stealing from 1018 00:58:39,240 --> 00:58:41,920 the top down to a spawn. 1019 00:58:41,920 --> 00:58:47,990 If there isn't something, if this is just a call here, this 1020 00:58:47,990 --> 00:58:49,640 cannot any longer be stolen. 1021 00:58:49,640 --> 00:58:52,170 There's no work there to be stolen because this is just a 1022 00:58:52,170 --> 00:58:54,990 single execution, there's nothing that's been spawned 1023 00:58:54,990 --> 00:58:57,440 off at this point. 1024 00:58:57,440 --> 00:59:00,200 This is the result of having been spawned as opposed to 1025 00:59:00,200 --> 00:59:01,950 that it's doing a spawn. 1026 00:59:01,950 --> 00:59:05,230 So yes, basically you're right. 1027 00:59:05,230 --> 00:59:06,890 There's a spawn on the top. 1028 00:59:06,890 --> 00:59:09,670 So it basically steals that off and then it resumes 1029 00:59:09,670 --> 00:59:15,540 execution afterwards and starts then operating just 1030 00:59:15,540 --> 00:59:19,190 like an ordinary deck. 1031 00:59:19,190 --> 00:59:24,550 So the theorem that you can prove for this type of 1032 00:59:24,550 --> 00:59:28,480 scheduler is that if you have sufficient parallelism, so you 1033 00:59:28,480 --> 00:59:31,910 all know what parallelism is at this point, you can prove 1034 00:59:31,910 --> 00:59:35,990 that the workers steal infrequently. 1035 00:59:35,990 --> 00:59:40,880 So in a a typical execution, you might have a few hundred 1036 00:59:40,880 --> 00:59:44,200 load balancing operations of this nature for something 1037 00:59:44,200 --> 00:59:48,420 which is doing billions and billions of instructions. 1038 00:59:48,420 --> 00:59:51,220 So you steal infrequently. 1039 00:59:51,220 --> 00:59:53,860 If you're stealing infrequently and all the rest 1040 00:59:53,860 --> 00:59:59,970 of the time you're just executing like the C or C++, 1041 00:59:59,970 --> 01:00:02,330 hey, now you've got linear speedup because you've got all 1042 01:00:02,330 --> 01:00:04,175 of these guys working all the time. 1043 01:00:08,140 --> 01:00:11,230 And so as I say, the main thing to understand is that 1044 01:00:11,230 --> 01:00:14,550 there's this work stealing scheduler running underneath. 1045 01:00:14,550 --> 01:00:17,050 It's more complicated to analyze then the greedy 1046 01:00:17,050 --> 01:00:20,210 scheduler, but it gives you pretty much the same 1047 01:00:20,210 --> 01:00:24,620 qualitative kinds of results. 1048 01:00:24,620 --> 01:00:31,450 And the idea then is that the stealing occurs infrequently 1049 01:00:31,450 --> 01:00:32,500 so you get linear speedup. 1050 01:00:32,500 --> 01:00:35,470 So the idea then is just as with greedy scheduling, make 1051 01:00:35,470 --> 01:00:37,540 sure you have enough parallelism, because then the 1052 01:00:37,540 --> 01:00:41,550 load balancing is a small fraction of the time these 1053 01:00:41,550 --> 01:00:45,890 processors are spending executing the code. 1054 01:00:45,890 --> 01:00:48,090 Because whenever it's doing things like work stealing, 1055 01:00:48,090 --> 01:00:52,460 it's not working on your code executing, making it go fast. 1056 01:00:52,460 --> 01:00:57,790 It's doing bookkeeping and overhead and stuff. 1057 01:00:57,790 --> 01:00:59,910 So you want to make sure that stays low. 1058 01:00:59,910 --> 01:01:03,290 So any questions about that? 1059 01:01:03,290 --> 01:01:07,150 So specifically, we have these bounds. 1060 01:01:07,150 --> 01:01:09,750 You have achieved this expected running time, which I 1061 01:01:09,750 --> 01:01:10,400 mentioned before. 1062 01:01:10,400 --> 01:01:13,920 Let me give you a pseudo-proof of this. 1063 01:01:13,920 --> 01:01:19,030 So this is not a real proof because it ignores things like 1064 01:01:19,030 --> 01:01:21,540 independence of probabilities. 1065 01:01:21,540 --> 01:01:24,000 So when you do a probability analysis, you're not allowed 1066 01:01:24,000 --> 01:01:27,310 to multiply probabilities unless they're independent. 1067 01:01:27,310 --> 01:01:30,125 So anyway, here I'm multiplying probabilities that 1068 01:01:30,125 --> 01:01:32,080 are independent. 1069 01:01:32,080 --> 01:01:34,960 So the idea is you can view a processor as 1070 01:01:34,960 --> 01:01:36,940 either working or stealing. 1071 01:01:36,940 --> 01:01:38,880 So it goes into one of two modes. 1072 01:01:38,880 --> 01:01:42,220 It's going to be stealing if it's run out of work, 1073 01:01:42,220 --> 01:01:43,840 otherwise it's working. 1074 01:01:43,840 --> 01:01:46,630 So the total time all processors spend working is 1075 01:01:46,630 --> 01:01:51,900 T1, hooray, that's at least a bound. 1076 01:01:51,900 --> 01:01:56,060 Now it turns out that every steal has a 1 over p chance of 1077 01:01:56,060 --> 01:01:59,550 reducing the span by one. 1078 01:01:59,550 --> 01:02:04,040 So you can prove that of all of the work that's in the top 1079 01:02:04,040 --> 01:02:07,720 of all those decks that those are where any of the ready 1080 01:02:07,720 --> 01:02:15,890 threads are going to be there are in a position of reducing 1081 01:02:15,890 --> 01:02:18,680 the span if you execute them. 1082 01:02:18,680 --> 01:02:21,780 And so whenever you steal, you have a 1 over p chance of 1083 01:02:21,780 --> 01:02:28,580 hitting the guy that matters for the span 1084 01:02:28,580 --> 01:02:30,360 of unexecuted DAG. 1085 01:02:30,360 --> 01:02:33,190 So the same kind of thing as in theory. 1086 01:02:33,190 --> 01:02:34,050 You have a 1 over p chance. 1087 01:02:34,050 --> 01:02:39,580 So the expected cost of all steals is order PT infinity. 1088 01:02:39,580 --> 01:02:43,580 So this is true, but not for this reason. 1089 01:02:43,580 --> 01:02:45,960 But it's kind, the intuition is right. 1090 01:02:48,610 --> 01:02:52,070 So therefore the cost of all steals is PT infinity and the 1091 01:02:52,070 --> 01:02:55,320 cost of the work is T1, so that's the total amount of 1092 01:02:55,320 --> 01:03:00,480 work and time spent stealing by all the p processors. 1093 01:03:00,480 --> 01:03:05,420 So to get the time spent doing that, we divide by p, because 1094 01:03:05,420 --> 01:03:08,220 they're p processors. 1095 01:03:08,220 --> 01:03:12,140 And when I do that, I get T1 over p plus order T infinity. 1096 01:03:12,140 --> 01:03:15,550 So that's kind of where that bound is coming from. 1097 01:03:15,550 --> 01:03:19,620 So you can see what's important here is that the 1098 01:03:19,620 --> 01:03:22,730 term, that order T infinity term, this the one where all 1099 01:03:22,730 --> 01:03:25,630 the overhead of scheduling and synchronization is. 1100 01:03:25,630 --> 01:03:28,120 There's no overhead for scheduling and synchronization 1101 01:03:28,120 --> 01:03:29,940 in the T1 over p term. 1102 01:03:29,940 --> 01:03:33,850 The only overhead there is to do things like mark the frames 1103 01:03:33,850 --> 01:03:39,140 as being a steel frame or a spawn frame and do the 1104 01:03:39,140 --> 01:03:43,820 bookkeeping of the deck as you're executing so the spawn 1105 01:03:43,820 --> 01:03:48,070 can be implemented very cheaply. 1106 01:03:48,070 --> 01:03:55,130 Now in addition to the scheduling things, there are 1107 01:03:55,130 --> 01:03:57,020 some other things to understand a little bit about 1108 01:03:57,020 --> 01:04:02,960 the scheduler and that is that it supports the C, C++ rule 1109 01:04:02,960 --> 01:04:03,720 for pointers. 1110 01:04:03,720 --> 01:04:08,130 So remember in C and C++, you can pass a pointer to stack 1111 01:04:08,130 --> 01:04:12,110 space down, but you can't pass a pointer to stack space back 1112 01:04:12,110 --> 01:04:13,430 to your parent, right? 1113 01:04:13,430 --> 01:04:14,680 Because it popped off. 1114 01:04:17,340 --> 01:04:22,990 So if you think about a C or C++ execution, let's say we 1115 01:04:22,990 --> 01:04:25,940 have this call structure here. 1116 01:04:25,940 --> 01:04:30,820 A really cannot see any of the stack space of B,C,D or E. So 1117 01:04:30,820 --> 01:04:33,130 this is what A gets to see. 1118 01:04:33,130 --> 01:04:36,350 And B, meanwhile, can see A space, because that's down on 1119 01:04:36,350 --> 01:04:39,700 the stack, but it can't see C, D or E. Particularly if you're 1120 01:04:39,700 --> 01:04:42,490 executing this serially, it can't see C because C hasn't 1121 01:04:42,490 --> 01:04:45,860 executed yet when B executes. 1122 01:04:45,860 --> 01:04:49,190 However, C, it turns out, the same thing. 1123 01:04:49,190 --> 01:04:51,310 I can't see any of the variables that might be 1124 01:04:51,310 --> 01:04:54,270 allocated in the space for B when I'm 1125 01:04:54,270 --> 01:04:55,800 executing here on a stack. 1126 01:04:55,800 --> 01:04:58,660 You can see them in a heap, but not on the stack, because 1127 01:04:58,660 --> 01:05:02,250 B has been popped off at that point and so forth. 1128 01:05:02,250 --> 01:05:05,190 So this is basically the normal rule, the normal views 1129 01:05:05,190 --> 01:05:09,790 of stack that you get in C or C++. 1130 01:05:09,790 --> 01:05:14,380 In Cilk++, you get exactly the same behavior except that 1131 01:05:14,380 --> 01:05:20,260 multiple ones of these views may exist at the same time. 1132 01:05:20,260 --> 01:05:23,900 So if, for example, B and C are both executing at the same 1133 01:05:23,900 --> 01:05:26,690 time, they each will see their own stack 1134 01:05:26,690 --> 01:05:30,600 space and a stack space. 1135 01:05:30,600 --> 01:05:34,220 And so the cactus stack maintains that fiction that 1136 01:05:34,220 --> 01:05:36,280 you can sort of look at your ancestors and see your 1137 01:05:36,280 --> 01:05:38,280 ancestors, but now it's maintained. 1138 01:05:38,280 --> 01:05:41,890 It's called a cactus stack because it's kind of like a 1139 01:05:41,890 --> 01:05:47,240 tree structure upside down, like a what's the name of that 1140 01:05:47,240 --> 01:05:49,310 big cactus out West? 1141 01:05:49,310 --> 01:05:50,105 Yes, saguaro. 1142 01:05:50,105 --> 01:05:52,490 The saguaro cactus, yep. 1143 01:05:52,490 --> 01:05:55,290 This kind of looks like that if you look at the stacks. 1144 01:05:58,110 --> 01:06:04,070 This leads to a very powerful bound on how much space your 1145 01:06:04,070 --> 01:06:05,850 program is using. 1146 01:06:05,850 --> 01:06:08,820 So normally, if you do a greedy scheduler, you could 1147 01:06:08,820 --> 01:06:11,950 end up using gobs more space then you would in a serial 1148 01:06:11,950 --> 01:06:15,250 execution, gobs more stack space. 1149 01:06:15,250 --> 01:06:18,600 In Cilk++ programs, you have a bound. 1150 01:06:18,600 --> 01:06:22,530 It's p times s1 is the maximum amount of stack space you'll 1151 01:06:22,530 --> 01:06:25,420 ever use where s1 is the stack space 1152 01:06:25,420 --> 01:06:27,250 used by serial execution. 1153 01:06:27,250 --> 01:06:29,950 So if you can keep your serial execution to use a reasonable 1154 01:06:29,950 --> 01:06:32,270 amount of stack space-- and usually it does-- 1155 01:06:32,270 --> 01:06:34,920 then in parallel, you don't use more than p times that 1156 01:06:34,920 --> 01:06:36,890 amount of stack space. 1157 01:06:36,890 --> 01:06:39,890 And the proof for that is sort of by induction, which 1158 01:06:39,890 --> 01:06:43,240 basically says there's a property called the Busy 1159 01:06:43,240 --> 01:06:50,530 Leaves Property that says that if you have a leaf that's 1160 01:06:50,530 --> 01:06:54,810 being worked on but hasn't been completed-- 1161 01:06:54,810 --> 01:06:57,720 so I've indicated those by the purple and pink ones-- 1162 01:06:57,720 --> 01:07:02,720 then if it's a leaf, it has a worker executing on it. 1163 01:07:02,720 --> 01:07:06,610 And so therefore, if you look at how much stack space you're 1164 01:07:06,610 --> 01:07:09,990 using, each of these guys can trace up and they may double 1165 01:07:09,990 --> 01:07:14,550 count the stack space, but it'll still be bounded by p 1166 01:07:14,550 --> 01:07:18,040 times the depth that they're at, or p times s1, which is 1167 01:07:18,040 --> 01:07:20,620 the maximum amount. 1168 01:07:20,620 --> 01:07:23,160 So it has good space bounds. 1169 01:07:23,160 --> 01:07:26,100 That's not so crucial for you folks to know as a practical 1170 01:07:26,100 --> 01:07:29,690 matter, but it would be if this didn't hold. 1171 01:07:29,690 --> 01:07:32,410 If this didn't hold, then you would have more programming 1172 01:07:32,410 --> 01:07:33,660 problems than you'll have. 1173 01:07:37,080 --> 01:07:40,780 The implications of this work stealing scheduler is 1174 01:07:40,780 --> 01:07:45,920 interesting from the linguistic point of view, 1175 01:07:45,920 --> 01:07:49,180 because you can write a code like this, so for i gets one 1176 01:07:49,180 --> 01:07:54,740 to a billion, spawn some sub-routine foo 1177 01:07:54,740 --> 01:07:58,000 of i and then sync. 1178 01:07:58,000 --> 01:08:01,920 So one way of executing this, the way that the work sharing 1179 01:08:01,920 --> 01:08:05,180 schedulers tend to do this, is they say oh, I've got a 1180 01:08:05,180 --> 01:08:08,600 billion tasks to do. 1181 01:08:08,600 --> 01:08:13,090 So let me create a billion tasks and now schedule them 1182 01:08:13,090 --> 01:08:16,180 and the space just vrooms to store all those billion tasks, 1183 01:08:16,180 --> 01:08:18,100 it gets to be huge. 1184 01:08:18,100 --> 01:08:20,200 Now of course, they have some strategies they can use to 1185 01:08:20,200 --> 01:08:22,939 reduce it by bunching tasks together and so forth. 1186 01:08:22,939 --> 01:08:26,689 But in principle, you got a billion pieces of work to do 1187 01:08:26,689 --> 01:08:30,630 even if you execute on one processor. 1188 01:08:30,630 --> 01:08:34,580 Whereas in the work stealing type execution, what happens 1189 01:08:34,580 --> 01:08:38,180 is you execute this in fact depth research. 1190 01:08:38,180 --> 01:08:42,420 So basically, you're going to execute foo of 1 and then 1191 01:08:42,420 --> 01:08:44,620 you'll return. 1192 01:08:44,620 --> 01:08:48,342 And then you'll increment i and you'll execute foo of 2, 1193 01:08:48,342 --> 01:08:49,670 and you'll return. 1194 01:08:49,670 --> 01:08:52,970 At no time are you using more than in this case two stack 1195 01:08:52,970 --> 01:08:58,029 frames, one for this routine here and one for foo because 1196 01:08:58,029 --> 01:08:59,240 you basically keep going up. 1197 01:08:59,240 --> 01:09:03,430 You're using your stack up on demand, rather than creating 1198 01:09:03,430 --> 01:09:05,469 all the work up front to be scheduled. 1199 01:09:08,090 --> 01:09:09,840 So the work stealing scheduler is very good from 1200 01:09:09,840 --> 01:09:10,720 that point of view. 1201 01:09:10,720 --> 01:09:13,890 The tricky thing for people to understand is that if 1202 01:09:13,890 --> 01:09:16,520 executing on multiple processors, when you do Cilk 1203 01:09:16,520 --> 01:09:21,569 spawn, the processor, the worker that you're running on, 1204 01:09:21,569 --> 01:09:25,760 is going to execute foo of 1. 1205 01:09:25,760 --> 01:09:26,859 The next statement-- 1206 01:09:26,859 --> 01:09:28,270 which would basically be incrementing the 1207 01:09:28,270 --> 01:09:30,720 counter and so forth-- 1208 01:09:30,720 --> 01:09:33,910 is executed by whatever processor comes in and steals 1209 01:09:33,910 --> 01:09:35,719 that continuation. 1210 01:09:40,090 --> 01:09:42,790 So if you had two processors, they're each going to 1211 01:09:42,790 --> 01:09:44,770 basically be executing. 1212 01:09:44,770 --> 01:09:47,620 The first processor isn't the one that excuse everything in 1213 01:09:47,620 --> 01:09:48,500 this function. 1214 01:09:48,500 --> 01:09:51,630 This function has its execution shared, the strands 1215 01:09:51,630 --> 01:09:53,910 are going to be shared where the first part of it would be 1216 01:09:53,910 --> 01:09:56,170 done by processor one and the latter part of it would be 1217 01:09:56,170 --> 01:09:58,270 done by processor two. 1218 01:09:58,270 --> 01:10:01,160 And then when processor one finishes this off, it might go 1219 01:10:01,160 --> 01:10:04,635 back and steal back from processor two. 1220 01:10:07,150 --> 01:10:10,040 So the important thing there is it's generating its stack 1221 01:10:10,040 --> 01:10:16,240 needs sort of on demand rather than all up front, and that 1222 01:10:16,240 --> 01:10:21,250 keeps the amount of stack space small as it executes. 1223 01:10:21,250 --> 01:10:23,950 So the moral is it's better to steal your parents from their 1224 01:10:23,950 --> 01:10:26,780 children than stealing children from their parents. 1225 01:10:30,940 --> 01:10:33,710 So that's the advantage of doing this sort of parent 1226 01:10:33,710 --> 01:10:36,150 stealing, because you're always doing the frame which 1227 01:10:36,150 --> 01:10:40,310 is an ancestor of where that worker is working and that 1228 01:10:40,310 --> 01:10:44,050 means resuming a function right in the middle on a 1229 01:10:44,050 --> 01:10:44,920 different processor. 1230 01:10:44,920 --> 01:10:47,110 That's kind of the magic of the technologies is how do you 1231 01:10:47,110 --> 01:10:50,710 actually move a stack frame from one place to another and 1232 01:10:50,710 --> 01:10:51,960 resume it in the middle? 1233 01:10:55,560 --> 01:10:57,640 Let's finish up here with a chess lesson. 1234 01:10:57,640 --> 01:10:59,760 I promised a chess lesson, so we might as well do 1235 01:10:59,760 --> 01:11:02,440 some fun and games. 1236 01:11:02,440 --> 01:11:06,160 We have a lot of experience at MIT with chess programs. 1237 01:11:06,160 --> 01:11:14,770 We've had a lot of success, probably our closest one was 1238 01:11:14,770 --> 01:11:19,010 Star Socrates 2.0, which took second place in the world 1239 01:11:19,010 --> 01:11:23,850 computer chess championship running on an 1824 node Intel 1240 01:11:23,850 --> 01:11:28,475 Paragon, so a big supercomputer running with a 1241 01:11:28,475 --> 01:11:28,980 Cilk scheduler. 1242 01:11:28,980 --> 01:11:33,990 We actually almost won that competition, and it's a sad 1243 01:11:33,990 --> 01:11:38,450 story that maybe be sometime around dinner or something I 1244 01:11:38,450 --> 01:11:41,540 will tell you the sad story behind it, but I'm not going 1245 01:11:41,540 --> 01:11:45,940 to tell you why we didn't take first place. 1246 01:11:45,940 --> 01:11:50,030 And we've had a bunch of other successes over the years. 1247 01:11:50,030 --> 01:11:52,040 Right now our chess programming is dormant, we're 1248 01:11:52,040 --> 01:11:55,680 not doing that in my group anymore, but in the past, we 1249 01:11:55,680 --> 01:11:57,900 had some very strong chess playing programs. 1250 01:12:00,580 --> 01:12:05,990 So what we did with Star Socrates, which is one of our 1251 01:12:05,990 --> 01:12:11,880 programs, was we wanted to understand the Cilk scheduler. 1252 01:12:11,880 --> 01:12:14,300 And so what we did is we ran a whole bunch of different 1253 01:12:14,300 --> 01:12:19,000 positions on different numbers of processors which ran for 1254 01:12:19,000 --> 01:12:21,540 different amounts of time. 1255 01:12:21,540 --> 01:12:25,640 We wanted to plot them all on the same chart, and here's our 1256 01:12:25,640 --> 01:12:27,340 strategy for doing it. 1257 01:12:27,340 --> 01:12:31,230 What decided to do was do a standard speedup curve. 1258 01:12:31,230 --> 01:12:34,500 So a standard speedup curve says let's plot the number of 1259 01:12:34,500 --> 01:12:40,700 processors along this axis and the speed up along that axis. 1260 01:12:40,700 --> 01:12:44,550 But in order to fit all these things on the same processor 1261 01:12:44,550 --> 01:12:48,110 curve, what we did was we normalize the speedup. 1262 01:12:48,110 --> 01:12:49,710 So what's the maximum pot? 1263 01:12:49,710 --> 01:12:50,700 So here's the speedup. 1264 01:12:50,700 --> 01:12:52,770 If you look the numerator here, this is the 1265 01:12:52,770 --> 01:12:54,280 speedup, T1 over Tp. 1266 01:12:54,280 --> 01:12:58,540 What we did is we normalized by the parallelism. 1267 01:12:58,540 --> 01:13:04,360 So we said what fraction of perfect speedup can we get? 1268 01:13:04,360 --> 01:13:12,480 So here one says that I got exactly a speedup, this is the 1269 01:13:12,480 --> 01:13:16,470 maximum possible speed up that I can get because the maximum 1270 01:13:16,470 --> 01:13:20,190 possible value of T1 over p is T1 over T infinity. 1271 01:13:20,190 --> 01:13:23,400 So that's sort of the maximum. 1272 01:13:23,400 --> 01:13:25,560 On this axis, we said how many processors are 1273 01:13:25,560 --> 01:13:26,250 you running on it? 1274 01:13:26,250 --> 01:13:27,890 Well, we looked at that relative to 1275 01:13:27,890 --> 01:13:29,760 essentially the slackness. 1276 01:13:29,760 --> 01:13:33,930 So notice by normalizing, we essentially have here the 1277 01:13:33,930 --> 01:13:35,620 inverse of the slackness. 1278 01:13:35,620 --> 01:13:39,130 So 1 here says that I'm running on exactly the same 1279 01:13:39,130 --> 01:13:42,360 number of processors as my parallelism. 1280 01:13:42,360 --> 01:13:46,950 A tenth here says I've got a slackness of 10, I'm running 1281 01:13:46,950 --> 01:13:51,250 on 10 times fewer processors then parallelism. 1282 01:13:51,250 --> 01:13:55,310 Out here, I'm saying I got way more processors than I have 1283 01:13:55,310 --> 01:13:56,610 parallelism. 1284 01:13:56,610 --> 01:13:58,200 So I plotted all the points. 1285 01:13:58,200 --> 01:14:01,040 So it doesn't show up very well here, but all those green 1286 01:14:01,040 --> 01:14:03,630 points, there are a lot of green points here, that's our 1287 01:14:03,630 --> 01:14:07,200 performance, measured performance. 1288 01:14:07,200 --> 01:14:10,050 You can sort of see they're green there, not the best 1289 01:14:10,050 --> 01:14:11,300 color for this projector. 1290 01:14:13,910 --> 01:14:20,090 So we plot on this essentially the Work Law and the Span Law. 1291 01:14:20,090 --> 01:14:22,980 So this is the Work Law, it says linear speedup, and this 1292 01:14:22,980 --> 01:14:24,560 is the Span Law. 1293 01:14:24,560 --> 01:14:28,360 And you can see that we're getting very close to perfect 1294 01:14:28,360 --> 01:14:35,600 linear speedup as long as our slackness is 10 or greater. 1295 01:14:35,600 --> 01:14:36,250 See that? 1296 01:14:36,250 --> 01:14:38,500 It's hugging that curve really tightly. 1297 01:14:38,500 --> 01:14:48,570 As we approach a slackness of 1, you can see that it starts 1298 01:14:48,570 --> 01:14:50,740 to go away from the linear speedup curve. 1299 01:14:53,410 --> 01:14:56,060 So for this program, if you look, it says, gee, if we were 1300 01:14:56,060 --> 01:15:00,570 running with 10 time, slackness of 10, 10 times more 1301 01:15:00,570 --> 01:15:03,550 parallelism than processors, we're getting almost perfect 1302 01:15:03,550 --> 01:15:06,050 linear speedup in the number of processors we're running on 1303 01:15:06,050 --> 01:15:09,440 across a wide range of number of processors, wide range of 1304 01:15:09,440 --> 01:15:12,410 benchmarks for this chess program. 1305 01:15:12,410 --> 01:15:15,040 And in fact, this curve is the curve. 1306 01:15:15,040 --> 01:15:17,580 This is not an interpolation here, but rather it is just 1307 01:15:17,580 --> 01:15:19,900 the greedy scheduling curve, and you can see it does a 1308 01:15:19,900 --> 01:15:24,810 pretty good job of going through all the points here. 1309 01:15:24,810 --> 01:15:26,860 Greedy scheduling does a pretty good job of predicting 1310 01:15:26,860 --> 01:15:27,850 the performance. 1311 01:15:27,850 --> 01:15:29,920 The other thing you should notice is that although things 1312 01:15:29,920 --> 01:15:33,250 are very tight down here, as you approach up here, they 1313 01:15:33,250 --> 01:15:35,360 start getting more spread. 1314 01:15:35,360 --> 01:15:38,830 And the reason is that as you start having more of the span 1315 01:15:38,830 --> 01:15:42,480 mattering in the calculation, that's where all the 1316 01:15:42,480 --> 01:15:45,510 synchronization, communication, all the 1317 01:15:45,510 --> 01:15:49,750 overhead of actually doing the mechanics of moving a frame 1318 01:15:49,750 --> 01:15:52,890 from one processor to another take into account, so you get 1319 01:15:52,890 --> 01:15:56,360 a lot more spread as you go up here. 1320 01:15:56,360 --> 01:15:58,610 So that's just the first part of the lesson. 1321 01:15:58,610 --> 01:16:05,560 The first part was, oh, the theory works out in practice 1322 01:16:05,560 --> 01:16:06,800 for real programs. 1323 01:16:06,800 --> 01:16:12,930 You have like 10 times more parallelisms than processors, 1324 01:16:12,930 --> 01:16:14,600 you're going to do a pretty good job of 1325 01:16:14,600 --> 01:16:16,590 getting linear speedup. 1326 01:16:16,590 --> 01:16:19,620 So that says you guys should be shooting for parallelisms 1327 01:16:19,620 --> 01:16:26,660 on the order of 100 for running on 12 cores. 1328 01:16:26,660 --> 01:16:29,610 Somewhere in that vicinity you should be doing pretty well if 1329 01:16:29,610 --> 01:16:31,150 you've got parallelism of 100 when you 1330 01:16:31,150 --> 01:16:33,530 measure it for your codes. 1331 01:16:33,530 --> 01:16:35,740 So we normalize by the parallel there. 1332 01:16:38,290 --> 01:16:43,840 Now the real lesson though was understanding how to use 1333 01:16:43,840 --> 01:16:47,270 things like work and span to make decisions in the design 1334 01:16:47,270 --> 01:16:49,730 of our program. 1335 01:16:49,730 --> 01:16:53,550 So as it turned out, Socrates for this particular 1336 01:16:53,550 --> 01:16:57,750 competition was to run on a 512 processor connection 1337 01:16:57,750 --> 01:17:02,680 machine at the University of Illinois. 1338 01:17:02,680 --> 01:17:08,950 So this was in the mid in the early 1990's. 1339 01:17:08,950 --> 01:17:12,000 It was one of the most powerful machines in the 1340 01:17:12,000 --> 01:17:17,270 world, and this thing is probably more powerful today. 1341 01:17:17,270 --> 01:17:20,150 But in those days, it was a pretty powerful machine. 1342 01:17:20,150 --> 01:17:21,540 I don't know whether this thing is, but this thing 1343 01:17:21,540 --> 01:17:25,610 probably I'm pretty sure is more powerful. 1344 01:17:25,610 --> 01:17:28,820 So this was a big machine. 1345 01:17:28,820 --> 01:17:31,300 However here at MIT, we didn't have a great big 1346 01:17:31,300 --> 01:17:32,950 machine like that. 1347 01:17:32,950 --> 01:17:35,240 We only had a 32 processor CM5. 1348 01:17:37,800 --> 01:17:41,090 So we were developing on a little machine expecting to 1349 01:17:41,090 --> 01:17:42,340 run on a big machine. 1350 01:17:45,050 --> 01:17:48,040 So one of the developers proposed to change the program 1351 01:17:48,040 --> 01:17:54,310 that produced a speedup of over 20% on the MIT machine. 1352 01:17:54,310 --> 01:17:57,910 So we said, oh that's pretty good, 25% improvement. 1353 01:17:57,910 --> 01:18:00,990 But we did a back of the envelope calculation and 1354 01:18:00,990 --> 01:18:05,030 rejected that improvement because we were able to use 1355 01:18:05,030 --> 01:18:10,645 work and span to predict the behavior on the big machine. 1356 01:18:13,180 --> 01:18:16,670 So let's see how that worked out, why that worked out. 1357 01:18:16,670 --> 01:18:20,180 So I've fudged these numbers so that they're easy to do the 1358 01:18:20,180 --> 01:18:22,780 math on and easy to understand. 1359 01:18:22,780 --> 01:18:25,830 The real numbers actually though did sort out very, very 1360 01:18:25,830 --> 01:18:28,610 similar to what I'm saying, just they weren't round 1361 01:18:28,610 --> 01:18:30,630 numbers like I'm going to give you. 1362 01:18:30,630 --> 01:18:34,610 So the original program ran for let's say 65 1363 01:18:34,610 --> 01:18:37,560 seconds on 32 cores. 1364 01:18:37,560 --> 01:18:41,480 The proposed program ran for 40 seconds on 32 cores. 1365 01:18:41,480 --> 01:18:43,830 Sounds like a good improvement to me. 1366 01:18:43,830 --> 01:18:46,790 Let's go for the faster program. 1367 01:18:46,790 --> 01:18:48,940 Well, let's hold your horses. 1368 01:18:48,940 --> 01:18:52,480 Let's take a look at our performance model based on 1369 01:18:52,480 --> 01:18:54,880 greedy scheduling. 1370 01:18:54,880 --> 01:18:57,500 That Tp is T1 over p plus infinity. 1371 01:18:57,500 --> 01:19:00,860 What component we really need to understand the scale this, 1372 01:19:00,860 --> 01:19:03,830 what component of each of these things is work 1373 01:19:03,830 --> 01:19:05,040 and which is span? 1374 01:19:05,040 --> 01:19:07,520 Because that's how we're going to be able to predict what's 1375 01:19:07,520 --> 01:19:09,930 going to happen on the big machine. 1376 01:19:09,930 --> 01:19:15,360 So indeed, this original program had a work of 2048 1377 01:19:15,360 --> 01:19:19,760 seconds and a span of one second. 1378 01:19:19,760 --> 01:19:23,820 Now chess, it turns out, is a non-deterministic type of 1379 01:19:23,820 --> 01:19:28,670 program where you use speculative parallelism, and 1380 01:19:28,670 --> 01:19:32,205 so in order to get more parallelism, you can sacrifice 1381 01:19:32,205 --> 01:19:34,395 and do more work versus less work. 1382 01:19:34,395 --> 01:19:39,250 So this one over here that we improved it to had less work 1383 01:19:39,250 --> 01:19:42,405 on the benchmark, but it had a longer span. 1384 01:19:46,280 --> 01:19:48,100 So it had less work but a longer span. 1385 01:19:48,100 --> 01:19:52,730 So when we actually were going to run this, well first of 1386 01:19:52,730 --> 01:19:57,050 all, we did the calculation and it actually came out 1387 01:19:57,050 --> 01:19:57,700 pretty close. 1388 01:19:57,700 --> 01:20:00,620 I was kind of surprised how close the theory matched. 1389 01:20:00,620 --> 01:20:03,870 We actually on 32 processors when you do the work spanned 1390 01:20:03,870 --> 01:20:08,920 calculation, you get the 65 seconds on a 32 processor 1391 01:20:08,920 --> 01:20:12,250 machine, here we had 40 seconds. 1392 01:20:12,250 --> 01:20:20,200 But now what happens when we scale this to the big machine? 1393 01:20:20,200 --> 01:20:22,200 Here we scaled it to 512 cores. 1394 01:20:22,200 --> 01:20:25,100 So now we take the work divided by the number of 1395 01:20:25,100 --> 01:20:29,160 processors, 512, plus 1, that's 5 seconds for this. 1396 01:20:29,160 --> 01:20:33,340 Here we have the work but we now have a much larger span. 1397 01:20:33,340 --> 01:20:36,790 So we have two seconds of work for processor, but now eight 1398 01:20:36,790 --> 01:20:42,130 seconds of span for a total of 10 seconds. 1399 01:20:42,130 --> 01:20:45,920 So had we made this quote "improvement," our code would 1400 01:20:45,920 --> 01:20:48,035 have been half as fast. 1401 01:20:50,910 --> 01:20:52,160 It would not have scaled. 1402 01:20:55,020 --> 01:21:01,300 And so the point is that work and span typically will beat 1403 01:21:01,300 --> 01:21:05,420 running times for predicting scalability of performance. 1404 01:21:05,420 --> 01:21:07,440 So you can measure a particular thing, but what you 1405 01:21:07,440 --> 01:21:11,160 really want to know is this thing this going to scale and 1406 01:21:11,160 --> 01:21:12,860 how is it going to scale into the future. 1407 01:21:12,860 --> 01:21:16,550 So people building multicore applications today want to 1408 01:21:16,550 --> 01:21:17,750 know that they coded up. 1409 01:21:17,750 --> 01:21:20,450 They don't want to be told in two years that they've got to 1410 01:21:20,450 --> 01:21:24,300 recode it all because the number of cores doubled. 1411 01:21:24,300 --> 01:21:27,370 They want to have some future-proof notion that hey, 1412 01:21:27,370 --> 01:21:33,490 there's a lot of parallelism in this program. 1413 01:21:33,490 --> 01:21:37,630 So work and span, work and span, eat it, 1414 01:21:37,630 --> 01:21:39,560 drink it, sleep it. 1415 01:21:39,560 --> 01:21:45,740 Work and span, work and span, work and span, work and span, 1416 01:21:45,740 --> 01:21:47,210 work and span, OK? 1417 01:21:47,210 --> 01:21:48,460 Work and span.