1 00:00:07,000 --> 00:00:12,000 We only have four more lectures left, and what Professor Demaine 2 00:00:12,000 --> 00:00:18,000 and I have decided to do is give two series of lectures on sort 3 00:00:18,000 --> 00:00:22,000 of advanced topics. So, today at Wednesday we're 4 00:00:22,000 --> 00:00:27,000 going to talk about parallel algorithms, algorithms where you 5 00:00:27,000 --> 00:00:34,000 have more than one processor whacking away on your problem. 6 00:00:34,000 --> 00:00:38,000 And this is a very hot topic right now because all of the 7 00:00:38,000 --> 00:00:42,000 chip manufacturers are now producing so-called multicore 8 00:00:42,000 --> 00:00:47,000 processors where you have more than one processor per chip. 9 00:00:47,000 --> 00:00:50,000 So, knowing something about that is good. 10 00:00:50,000 --> 00:00:55,000 The second topic we're going to cover is going to be caching, 11 00:00:55,000 --> 00:01:00,000 and how you design algorithms for systems with cache. 12 00:01:00,000 --> 00:01:03,000 Right now, we've sort of program to everything as if it 13 00:01:03,000 --> 00:01:07,000 were just a single level of memory, and for some problems 14 00:01:07,000 --> 00:01:10,000 that's not an entirely realistic model. 15 00:01:10,000 --> 00:01:14,000 You'd like to have some model for how the caching hierarchy 16 00:01:14,000 --> 00:01:18,000 works, and how you can take advantage of that. 17 00:01:18,000 --> 00:01:22,000 And there's been a lot of research in that area as well. 18 00:01:22,000 --> 00:01:26,000 So, both of those actually turn out to be my area of research. 19 00:01:26,000 --> 00:01:30,000 So, this is actually fun for me. 20 00:01:30,000 --> 00:01:33,000 Actually, most of it's fun anyway. 21 00:01:33,000 --> 00:01:37,000 So, today we'll talk about parallel algorithms. 22 00:01:37,000 --> 00:01:43,000 And the particular topic, it turns out that there are 23 00:01:43,000 --> 00:01:49,000 lots of models for parallel algorithms, and for parallelism. 24 00:01:49,000 --> 00:01:54,000 And it's one of the reasons that, whereas for serial 25 00:01:54,000 --> 00:02:00,000 algorithms, most people sort of have this basic model that we've 26 00:02:00,000 --> 00:02:04,000 been using. It's sometimes called a random 27 00:02:04,000 --> 00:02:08,000 access machine model, which is what we've been using 28 00:02:08,000 --> 00:02:11,000 to analyze things, whereas in the parallel space, 29 00:02:11,000 --> 00:02:15,000 there's just a huge number of models, and there is no general 30 00:02:15,000 --> 00:02:19,000 agreement on what is the best model because there are 31 00:02:19,000 --> 00:02:23,000 different machines that are made with different configurations, 32 00:02:23,000 --> 00:02:24,000 etc. and people haven't, 33 00:02:24,000 --> 00:02:27,000 sort of, agreed on, even how parallel machines 34 00:02:27,000 --> 00:02:32,000 should be organized. So, we're going to deal with a 35 00:02:32,000 --> 00:02:37,000 particular model, which goes under the rubric of 36 00:02:37,000 --> 00:02:42,000 dynamic multithreading, which is appropriate for the 37 00:02:42,000 --> 00:02:48,000 multicore machines that are now being built for shared memory 38 00:02:48,000 --> 00:02:52,000 programming. It's not appropriate for what's 39 00:02:52,000 --> 00:02:57,000 called distributed memory programs particularly because 40 00:02:57,000 --> 00:03:03,000 the processors are able to access things. 41 00:03:03,000 --> 00:03:06,000 And for those, you need more involved models. 42 00:03:06,000 --> 00:03:10,000 And so, let me start just by giving an example of how one 43 00:03:10,000 --> 00:03:14,000 would write something. I'm going to give you a program 44 00:03:14,000 --> 00:03:18,000 for calculating the nth Fibonacci number in this model. 45 00:03:18,000 --> 00:03:23,000 This is actually a really bad algorithm that I'm going to give 46 00:03:23,000 --> 00:03:28,000 you because it's going to be the exponential time algorithm, 47 00:03:28,000 --> 00:03:32,000 whereas we know from week one or two that you can calculate 48 00:03:32,000 --> 00:03:37,000 the nth Fibonacci number and how much time? 49 00:03:37,000 --> 00:03:40,000 Log n time. So, this is too exponentials 50 00:03:40,000 --> 00:03:46,000 off what you should be able to get, OK, two exponentials off. 51 00:03:46,000 --> 00:03:49,000 OK, so here's the code. 52 00:04:36,000 --> 00:04:40,000 OK, so this is essentially the pseudocode we would write. 53 00:04:40,000 --> 00:04:44,000 And let me just explain a little bit about, 54 00:04:44,000 --> 00:04:48,000 we have a couple of key words here we haven't seen before: 55 00:04:48,000 --> 00:04:52,000 in particular, spawn and sync. 56 00:04:52,000 --> 00:04:58,000 OK, so spawn, this basically says that the 57 00:04:58,000 --> 00:05:07,000 subroutine that you're calling, you use it as a keyword before 58 00:05:07,000 --> 00:05:14,000 a subroutine, that it can execute at the same 59 00:05:14,000 --> 00:05:21,000 time as its parent. So, here, what we say x equals 60 00:05:21,000 --> 00:05:29,000 spawn of n minus one, we immediately go onto the next 61 00:05:29,000 --> 00:05:36,000 statement. And now, while we're executing 62 00:05:36,000 --> 00:05:42,000 fib of n minus one, we can also be executing, 63 00:05:42,000 --> 00:05:49,000 now, this statement which itself will spawn something off. 64 00:05:49,000 --> 00:05:54,000 OK, and we continue, and then we hit the sync 65 00:05:54,000 --> 00:05:58,000 statement. And, what sync says is, 66 00:05:58,000 --> 00:06:04,000 wait until all children are done. 67 00:06:04,000 --> 00:06:09,000 OK, so it says once you get to this point, you've got to wait 68 00:06:09,000 --> 00:06:15,000 until everything here has completed before you execute the 69 00:06:15,000 --> 00:06:21,000 x plus y because otherwise you're going to try to execute 70 00:06:21,000 --> 00:06:26,000 the calculation of x plus y without having computed it yet. 71 00:06:26,000 --> 00:06:31,000 OK, so that's the basic structure. 72 00:06:31,000 --> 00:06:33,000 What this describes, notice in here we never said 73 00:06:33,000 --> 00:06:36,000 how many processors or anything we are running on. 74 00:06:36,000 --> 00:06:40,000 OK, so this actually is just describing logical parallelism 75 00:06:40,000 --> 00:06:41,000 -- 76 00:06:51,000 --> 00:07:02,000 -- not the actual parallelism when we execute it. 77 00:07:02,000 --> 00:07:11,000 And so, what we need is a scheduler, OK, 78 00:07:11,000 --> 00:07:25,000 to determine how to map this dynamically, unfolding execution 79 00:07:25,000 --> 00:07:37,000 onto whatever processors you have available. 80 00:07:37,000 --> 00:07:45,000 OK, and so, today actually we're going to talk mostly about 81 00:07:45,000 --> 00:07:48,000 scheduling. OK, and then, 82 00:07:48,000 --> 00:07:56,000 next time we're going to talk about specific application 83 00:07:56,000 --> 00:08:01,000 algorithms, and how you analyze them. 84 00:08:01,000 --> 00:08:11,000 OK, so you can view the actual multithreaded computation. 85 00:08:11,000 --> 00:08:16,000 If you take a look at the parallel instruction stream, 86 00:08:16,000 --> 00:08:21,000 it's just a directed acyclic graph, OK? 87 00:08:21,000 --> 00:08:25,000 So, let me show you how that works. 88 00:08:25,000 --> 00:08:30,000 So, normally when we have an instruction stream, 89 00:08:30,000 --> 00:08:36,000 I look at each instruction being executed. 90 00:08:36,000 --> 00:08:38,000 If I'm in a loop, I'm not looking at it as a 91 00:08:38,000 --> 00:08:40,000 loop. I'm just looking at the 92 00:08:40,000 --> 00:08:42,000 sequence of instructions that actually executed. 93 00:08:42,000 --> 00:08:45,000 I can do that just as a chain. Before I execute one 94 00:08:45,000 --> 00:08:48,000 instruction, I have to execute the one before it. 95 00:08:48,000 --> 00:08:51,000 Before I execute that, I've got to execute the one 96 00:08:51,000 --> 00:08:53,000 before it. At least, that's the 97 00:08:53,000 --> 00:08:55,000 abstraction. If you've studied processors, 98 00:08:55,000 --> 00:08:58,000 you know that there are a lot of tricks there in figuring out 99 00:08:58,000 --> 00:09:02,000 instruction level parallelism, and how you can actually make 100 00:09:02,000 --> 00:09:07,000 that serial instruction stream actually execute in parallel. 101 00:09:07,000 --> 00:09:15,000 But what we are going to be mostly talking about is the 102 00:09:15,000 --> 00:09:22,000 logical parallelism here, and what we can do in that 103 00:09:22,000 --> 00:09:26,000 context. So, in this DAG, 104 00:09:26,000 --> 00:09:34,000 the vertices are threads, which are maximal sequences of 105 00:09:34,000 --> 00:09:40,000 instructions not containing -- 106 00:09:47,000 --> 00:09:52,000 -- parallel control. And by parallel control, 107 00:09:52,000 --> 00:09:58,000 I just mean spawn, sync, and return from a spawned 108 00:09:58,000 --> 00:10:02,000 procedure. So, let's just mark the, 109 00:10:02,000 --> 00:10:06,000 so the vertices are threads. So, let's just mark what the 110 00:10:06,000 --> 00:10:10,000 vertices are here, OK, what the threads are here. 111 00:10:10,000 --> 00:10:16,000 So, when we enter the function here, we basically execute up to 112 00:10:16,000 --> 00:10:18,000 the point where, basically, here, 113 00:10:18,000 --> 00:10:24,000 let's call that thread A where we are just doing a sequential 114 00:10:24,000 --> 00:10:29,000 execution up to either returning or starting to do the spawn, 115 00:10:29,000 --> 00:10:33,000 fib of n minus one. So actually, 116 00:10:33,000 --> 00:10:38,000 thread A would include the calculation of n minus one right 117 00:10:38,000 --> 00:10:43,000 up to the point where you actually make the subroutine 118 00:10:43,000 --> 00:10:45,000 jump. That's thread A. 119 00:10:45,000 --> 00:10:49,000 Thread B would be the stuff that you would do, 120 00:10:49,000 --> 00:10:54,000 executing from fib of, sorry, B would be from the, 121 00:10:54,000 --> 00:10:57,000 right. We'd go up to the spawn. 122 00:10:57,000 --> 00:11:03,000 So, we've done the spawn. I'm really looking at this. 123 00:11:03,000 --> 00:11:05,000 So, B would be up to the spawn of y. 124 00:11:05,000 --> 00:11:09,000 OK, spawn of fib of n minus two to compute y, 125 00:11:09,000 --> 00:11:12,000 and then we'd have essentially an empty thread. 126 00:11:12,000 --> 00:11:17,000 So, I'll ignore that for now, but really then we have after 127 00:11:17,000 --> 00:11:22,000 the sync up to the point that we get to the return of x plus y. 128 00:11:22,000 --> 00:11:25,000 So basically, we're just looking at maximal 129 00:11:25,000 --> 00:11:30,000 sequences of instructions that are all serial. 130 00:11:30,000 --> 00:11:34,000 And every time I do a parallel instruction, OK, 131 00:11:34,000 --> 00:11:37,000 spawn or a sync, or return from it, 132 00:11:37,000 --> 00:11:40,000 that terminates the current thread. 133 00:11:40,000 --> 00:11:45,000 OK, so we can look at that as a bunch of small threads. 134 00:11:45,000 --> 00:11:50,000 So those of you who are familiar with threads from Java 135 00:11:50,000 --> 00:11:54,000 threads, or POSIX threads, OK, so-called P threads, 136 00:11:54,000 --> 00:12:00,000 those are sort of heavyweight static threads. 137 00:12:00,000 --> 00:12:04,000 This is a much lighter weight notion of thread, 138 00:12:04,000 --> 00:12:08,000 OK, that we are using in this model. 139 00:12:08,000 --> 00:12:13,000 OK, so these are the vertices. And now, let me map out a 140 00:12:13,000 --> 00:12:19,000 little bit how this works, so we can where the edges come 141 00:12:19,000 --> 00:12:21,000 from. So, let's imagine we're 142 00:12:21,000 --> 00:12:26,000 executing fib of four. So, I'm going to draw a 143 00:12:26,000 --> 00:12:31,000 horizontal oval. That's going to correspond to 144 00:12:31,000 --> 00:12:36,000 the procedure execution. And, in this procedure, 145 00:12:36,000 --> 00:12:39,000 there are essentially three threads. 146 00:12:39,000 --> 00:12:44,000 We start out with A, so this is our initial thread 147 00:12:44,000 --> 00:12:49,000 is this guy here. And then, when he executes a 148 00:12:49,000 --> 00:12:55,000 spawn, OK, he's going to execute a spawn, we are going to create 149 00:12:55,000 --> 00:13:00,000 a new procedure, and he's going to execute a new 150 00:13:00,000 --> 00:13:05,000 A recursively within that procedure. 151 00:13:05,000 --> 00:13:09,000 But at the same time, we're also going to be, 152 00:13:09,000 --> 00:13:14,000 now, aloud to go on and execute B in the parent, 153 00:13:14,000 --> 00:13:18,000 we have parallelism here when I do a spawn. 154 00:13:18,000 --> 00:13:21,000 OK, and so there's an edge here. 155 00:13:21,000 --> 00:13:25,000 This edge we are going to call a spawn edge, 156 00:13:25,000 --> 00:13:31,000 and this is called a continuation edge because it's 157 00:13:31,000 --> 00:13:37,000 just simply continuing the procedure execution. 158 00:13:37,000 --> 00:13:41,000 OK, now at this point, this guy, we now have two 159 00:13:41,000 --> 00:13:45,000 things that can execute at the same time. 160 00:13:45,000 --> 00:13:49,000 Once I've executed A, I now have two things that can 161 00:13:49,000 --> 00:13:52,000 execute. OK, so this one, 162 00:13:52,000 --> 00:13:56,000 for example, may spawn another thread here. 163 00:13:56,000 --> 00:13:59,000 Oh, so this is fib of three, right? 164 00:13:59,000 --> 00:14:07,000 And this is now fib of two. OK, so he spawns another guy 165 00:14:07,000 --> 00:14:15,000 here, and simultaneously, he can go on and execute B 166 00:14:15,000 --> 00:14:22,000 here, OK, with a continued edge. And B, in fact, 167 00:14:22,000 --> 00:14:32,000 can also spawn at this point. OK, and this is now fib of two 168 00:14:32,000 --> 00:14:36,000 also. And now, at this point, 169 00:14:36,000 --> 00:14:44,000 we can't execute C yet here even though I've spawned things 170 00:14:44,000 --> 00:14:48,000 off. And the reason is because C 171 00:14:48,000 --> 00:14:54,000 won't execute until we've executed the sync statement, 172 00:14:54,000 --> 00:15:01,000 which can't occur until A and B have both been executed, 173 00:15:01,000 --> 00:15:06,000 OK? So, he just sort of sits there 174 00:15:06,000 --> 00:15:12,000 waiting, OK, and a scheduler can't try to schedule him. 175 00:15:12,000 --> 00:15:18,000 Or if he does, then nothing's going to happen 176 00:15:18,000 --> 00:15:21,000 here, OK? So, we can go on. 177 00:15:21,000 --> 00:15:25,000 Let's see, here we could call fib of one. 178 00:15:25,000 --> 00:15:34,000 The fib of one is only going to execute an A statement here. 179 00:15:34,000 --> 00:15:39,000 OK, of course it can't continue here because A is the only 180 00:15:39,000 --> 00:15:45,000 thing, when I execute fib of one, if we look at the code, 181 00:15:45,000 --> 00:15:50,000 it never executes B or C. OK, and similarly here, 182 00:15:50,000 --> 00:15:55,000 this guy here to do fib of one. OK, and this guy, 183 00:15:55,000 --> 00:16:01,000 I guess, could execute A here of fib of one. 184 00:16:10,000 --> 00:16:17,000 OK, and maybe now this guy calls his another fib of one, 185 00:16:17,000 --> 00:16:25,000 and this guy does another one. This is going to be fib of 186 00:16:25,000 --> 00:16:31,000 zero, right? I keep drawing that arrow to 187 00:16:31,000 --> 00:16:35,000 the wrong place, OK? 188 00:16:35,000 --> 00:16:38,000 And now, once these guys return, well, 189 00:16:38,000 --> 00:16:42,000 let's say these guys return here, I can now execute C. 190 00:16:42,000 --> 00:16:47,000 But I can't execute with them until both of these guys are 191 00:16:47,000 --> 00:16:52,000 done, and that guy is done. So, you see that we get a 192 00:16:52,000 --> 00:16:56,000 synchronization point here before executing C. 193 00:16:56,000 --> 00:17:01,000 And then, similarly here, now that we've executed this 194 00:17:01,000 --> 00:17:06,000 and this, we can now execute this guy here. 195 00:17:06,000 --> 00:17:11,000 And so, those returns go to there. 196 00:17:11,000 --> 00:17:17,000 Likewise here, this guy can now execute his C, 197 00:17:17,000 --> 00:17:26,000 and now once both of those are done, we can execute this guy 198 00:17:26,000 --> 00:17:30,000 here. And then we are done. 199 00:17:30,000 --> 00:17:41,000 This is our final thread. So, I should have labeled also 200 00:17:41,000 --> 00:17:53,000 that when I get one of these guys here, that's a return edge. 201 00:17:53,000 --> 00:18:01,000 So, the three types of edges are spawn, return, 202 00:18:01,000 --> 00:18:08,000 and continuation. OK, and by describing it in 203 00:18:08,000 --> 00:18:11,000 this way, I essentially get a DAG that unfolds. 204 00:18:11,000 --> 00:18:15,000 So, rather than having just a serial execution trace, 205 00:18:15,000 --> 00:18:19,000 I get something where I have still some serial dependencies. 206 00:18:19,000 --> 00:18:23,000 There are still some things that have to be done before 207 00:18:23,000 --> 00:18:27,000 other things, but there are also things that 208 00:18:27,000 --> 00:18:31,000 can be done at the same time. So how are we doing? 209 00:18:31,000 --> 00:18:35,000 Yeah, question? Is every spawn were covered by 210 00:18:35,000 --> 00:18:38,000 a sync, effectively, yeah, yeah, effectively. 211 00:18:38,000 --> 00:18:43,000 There's actually a null thread that gets executed in there, 212 00:18:43,000 --> 00:18:45,000 which I hadn't bothered to show. 213 00:18:45,000 --> 00:18:50,000 But yes, basically you would then not have any parallelism, 214 00:18:50,000 --> 00:18:54,000 OK, because you would spawn it off, but then you're not doing 215 00:18:54,000 --> 00:18:58,000 anything in the parent. So it's pretty much the same, 216 00:18:58,000 --> 00:19:03,000 yeah, as if it had executed serially. 217 00:19:03,000 --> 00:19:06,000 Yep, OK, so you can see that basically what we had here in 218 00:19:06,000 --> 00:19:09,000 some sense is a DAG embedded in a tree. 219 00:19:09,000 --> 00:19:13,000 OK, so you have a tree that's sort of the procedure structure, 220 00:19:13,000 --> 00:19:16,000 but in their you have a DAG, and that DAG can actually get 221 00:19:16,000 --> 00:19:20,000 to be pretty complicated. OK, now what I want to do is 222 00:19:20,000 --> 00:19:23,000 now that we understand that we've got an underlying DAG, 223 00:19:23,000 --> 00:19:27,000 I want to switch to trying to study the performance attributes 224 00:19:27,000 --> 00:19:31,000 of a particular DAG execution, so looking at performance 225 00:19:31,000 --> 00:19:33,000 measures. 226 00:19:45,000 --> 00:19:55,000 So, the notation that we'll use is we'll let T_P be the running 227 00:19:55,000 --> 00:20:05,000 time of whatever our computation is on P processors. 228 00:20:05,000 --> 00:20:07,000 OK, so, T_P is, how long does it take to 229 00:20:07,000 --> 00:20:10,000 execute this on P processors? Now, in general, 230 00:20:10,000 --> 00:20:13,000 this is not going to be just a particular number, 231 00:20:13,000 --> 00:20:17,000 OK, because I can have different scheduling disciplines 232 00:20:17,000 --> 00:20:20,000 would lead me to get numbers for T_P, OK? 233 00:20:20,000 --> 00:20:22,000 But when we talk about the running time, 234 00:20:22,000 --> 00:20:26,000 we'll still sort of use this notation, and I'll try to be 235 00:20:26,000 --> 00:20:30,000 careful as we go through to make sure that there's no confusion 236 00:20:30,000 --> 00:20:34,000 about what that means in context. 237 00:20:34,000 --> 00:20:38,000 There are a couple of them, though, which are fairly well 238 00:20:38,000 --> 00:20:40,000 defined. One is based on this. 239 00:20:40,000 --> 00:20:43,000 One is T_1. So, T_1 is the running time on 240 00:20:43,000 --> 00:20:46,000 one processor. OK, so if I were to execute 241 00:20:46,000 --> 00:20:49,000 this on one processor, you can imagine it's just as if 242 00:20:49,000 --> 00:20:53,000 I had just gotten rid of the spawn, and syncs, 243 00:20:53,000 --> 00:20:55,000 and everything, and just executed it. 244 00:20:55,000 --> 00:21:00,000 That will give me a particular running time. 245 00:21:00,000 --> 00:21:06,000 We call that running time on one processor the work. 246 00:21:06,000 --> 00:21:10,000 It's essentially the serial time. 247 00:21:10,000 --> 00:21:16,000 OK, so when we talk about the work of a computation, 248 00:21:16,000 --> 00:21:22,000 we just been essentially a serial running time. 249 00:21:22,000 --> 00:21:30,000 OK, the other measure that ends up being interesting is what we 250 00:21:30,000 --> 00:21:35,000 call T infinity. OK, and this is the critical 251 00:21:35,000 --> 00:21:40,000 pathlength, OK, which is essentially the 252 00:21:40,000 --> 00:21:46,000 longest path in the DAG. So, for example, 253 00:21:46,000 --> 00:21:50,000 if we look at the fib of four in this example, 254 00:21:50,000 --> 00:21:54,000 it has T of one equal to, so let's assume we have unit 255 00:21:54,000 --> 00:21:58,000 time threads. I know they're not unit time, 256 00:21:58,000 --> 00:22:01,000 but let's just imagine, for the purposes of 257 00:22:01,000 --> 00:22:06,000 understanding this, that every thread costs me one 258 00:22:06,000 --> 00:22:12,000 unit of time to execute. What would be the work of this 259 00:22:12,000 --> 00:22:16,000 particular computation? 17, right, OK, 260 00:22:16,000 --> 00:22:21,000 because all we do is just add up three, six, 261 00:22:21,000 --> 00:22:24,000 nine, 12, 13, 14, 15, 16, 17. 262 00:22:24,000 --> 00:22:32,000 So, the work is 17 in this case if it were unit time threads. 263 00:22:32,000 --> 00:22:35,000 In general, you would add up how many instructions or 264 00:22:35,000 --> 00:22:39,000 whatever were in there. OK, and then T infinity is the 265 00:22:39,000 --> 00:22:42,000 longest path. So, this is the longest 266 00:22:42,000 --> 00:22:44,000 sequence. It's like, if you had an 267 00:22:44,000 --> 00:22:48,000 infinite number of processors, you still can't just do 268 00:22:48,000 --> 00:22:52,000 everything at once because some things have to come before other 269 00:22:52,000 --> 00:22:55,000 things. But if you had an infinite 270 00:22:55,000 --> 00:22:59,000 number of processors, as many processors as you want, 271 00:22:59,000 --> 00:23:04,000 what's the fastest you could possibly execute this? 272 00:23:04,000 --> 00:23:07,000 A little trickier. Seven? 273 00:23:07,000 --> 00:23:12,000 So, what's your seven? So, one, two, 274 00:23:12,000 --> 00:23:17,000 three, four, five, six, seven, 275 00:23:17,000 --> 00:23:22,000 eight, yeah, eight is the longest path. 276 00:23:22,000 --> 00:23:30,000 So, the work and the critical path length, as we'll see, 277 00:23:30,000 --> 00:23:38,000 are key attributes of any computation. 278 00:23:38,000 --> 00:23:44,000 And abstractly, and this is just for [the 279 00:23:44,000 --> 00:23:50,000 notes?], if they're unit time threads. 280 00:23:50,000 --> 00:23:59,000 OK, so we can use these two measures to derive lower bounds 281 00:23:59,000 --> 00:24:07,000 on T_P for P that fall between one and infinity, 282 00:24:07,000 --> 00:24:09,000 OK? 283 00:24:20,000 --> 00:24:30,000 OK, so the first lower bound we can derive is that T_P has got 284 00:24:30,000 --> 00:24:39,000 to be at least T_1 over P. OK, so why is that a lower 285 00:24:39,000 --> 00:24:42,000 bound? Yeah? 286 00:24:42,000 --> 00:24:57,000 But if I have P processors, and, OK, and why would I have 287 00:24:57,000 --> 00:25:05,000 this lower bound? OK, yeah, you've got the right 288 00:25:05,000 --> 00:25:07,000 idea. So, but can we be a little bit 289 00:25:07,000 --> 00:25:10,000 more articulate about it? So, that's right, 290 00:25:10,000 --> 00:25:13,000 so you want to use all of processors. 291 00:25:13,000 --> 00:25:17,000 If you could use all of processors, why couldn't I use 292 00:25:17,000 --> 00:25:20,000 all the processors, though, and have T_P be less 293 00:25:20,000 --> 00:25:23,000 than this? Why does it have to be at least 294 00:25:23,000 --> 00:25:27,000 as big as T_1 over P? I'm just asking for a little 295 00:25:27,000 --> 00:25:31,000 more precision in the answer. You've got exactly the right 296 00:25:31,000 --> 00:25:35,000 idea, but I need a little more precision if we're going to 297 00:25:35,000 --> 00:25:41,000 persuade the rest of the class that this is the lower bound. 298 00:25:41,000 --> 00:25:42,000 Yeah? 299 00:25:50,000 --> 00:25:53,000 Yeah, that's another way of looking at it. 300 00:25:53,000 --> 00:25:56,000 If you were to serialize the computation, OK, 301 00:25:56,000 --> 00:25:59,000 so whatever things you execute on each step, 302 00:25:59,000 --> 00:26:02,000 you do P of them, and so if you serialized it, 303 00:26:02,000 --> 00:26:07,000 somehow then it would take you P steps to execute one step of a 304 00:26:07,000 --> 00:26:09,000 P way, a machine with P processors. 305 00:26:09,000 --> 00:26:11,000 So then, OK, yeah? 306 00:26:11,000 --> 00:26:13,000 OK, maybe a little more precise. 307 00:26:13,000 --> 00:26:15,000 David? 308 00:26:28,000 --> 00:26:33,000 Yeah, good, so let me just state this a little bit. 309 00:26:33,000 --> 00:26:38,000 So, P processors, so what are we relying on? 310 00:26:38,000 --> 00:26:43,000 P processors can do, at most, P work in one step, 311 00:26:43,000 --> 00:26:47,000 right? So, in one step they do, 312 00:26:47,000 --> 00:26:52,000 at most P work. They can't do more than P work. 313 00:26:52,000 --> 00:26:58,000 And so, if they can do, at most P work in one step, 314 00:26:58,000 --> 00:27:02,000 then if the number of steps was, in fact, 315 00:27:02,000 --> 00:27:08,000 less than T_1 over P, they would be able to do more 316 00:27:08,000 --> 00:27:15,000 than T_1 work in P steps. And, there's only T_1 work to 317 00:27:15,000 --> 00:27:19,000 be done. OK, I just stated that almost 318 00:27:19,000 --> 00:27:22,000 as badly as all the responses I got. 319 00:27:22,000 --> 00:27:25,000 [LAUGHTER] OK, P processors can do, 320 00:27:25,000 --> 00:27:30,000 at most, P work in one step, right? 321 00:27:30,000 --> 00:27:34,000 So, if there's T_1 work to be done, the number of steps is 322 00:27:34,000 --> 00:27:37,000 going to be at least T_1 over P, OK? 323 00:27:37,000 --> 00:27:40,000 There we go. OK, it wasn't that hard. 324 00:27:40,000 --> 00:27:43,000 It's just like, I've got a certain amount of, 325 00:27:43,000 --> 00:27:46,000 I've got T_1 work to do. I can knock off, 326 00:27:46,000 --> 00:27:49,000 at most, P on every step. How many steps? 327 00:27:49,000 --> 00:27:53,000 Just divide. OK, so it's going to have to be 328 00:27:53,000 --> 00:27:55,000 at least that amount. OK, good. 329 00:27:55,000 --> 00:27:59,000 The other lower bound is T_P is greater than or equal to T 330 00:27:59,000 --> 00:28:04,000 infinity. Somebody explain to me why that 331 00:28:04,000 --> 00:28:06,000 might be true. Yeah? 332 00:28:06,000 --> 00:28:10,000 Yeah, if you have an infinite number of processors, 333 00:28:10,000 --> 00:28:13,000 you have P. so if you could do it in a 334 00:28:13,000 --> 00:28:18,000 certain amount of time with P, you can certainly do it in that 335 00:28:18,000 --> 00:28:21,000 time with an infinite number of processors. 336 00:28:21,000 --> 00:28:25,000 OK, this is in this model where, you know, 337 00:28:25,000 --> 00:28:29,000 there is lots of stuff that this model doesn't model like 338 00:28:29,000 --> 00:28:32,000 communication costs and interference, 339 00:28:32,000 --> 00:28:37,000 and all sorts of things. But it is simple model, 340 00:28:37,000 --> 00:28:41,000 which actually in practice works out pretty well, 341 00:28:41,000 --> 00:28:45,000 OK, you're not going to be able to do more work with P 342 00:28:45,000 --> 00:28:51,000 processors than you are with an infinite number of processors. 343 00:29:06,000 --> 00:29:12,000 OK, so those are helpful bounds to understand when we are trying 344 00:29:12,000 --> 00:29:17,000 to make something go faster, it's nice to know what you 345 00:29:17,000 --> 00:29:23,000 could possibly hope to achieve, OK, as opposed to beating your 346 00:29:23,000 --> 00:29:28,000 head against a wall, how come I can't get it to go 347 00:29:28,000 --> 00:29:33,000 much faster? Maybe it's because one of these 348 00:29:33,000 --> 00:29:39,000 lower bounds is operating. OK, well, we're interested in 349 00:29:39,000 --> 00:29:44,000 how fast we can go. That's the main reason for 350 00:29:44,000 --> 00:29:51,000 using multiple processors is you hope you're going to go faster 351 00:29:51,000 --> 00:29:55,000 than you could with one processor. 352 00:29:55,000 --> 00:30:03,000 So, we define T_1 over T_P to be the speedup on P processors. 353 00:30:03,000 --> 00:30:09,000 OK, so we say, how much faster is it on P 354 00:30:09,000 --> 00:30:14,000 processors than on one processor? 355 00:30:14,000 --> 00:30:22,000 OK, that's the speed up. If T_1 over T_P is order P, 356 00:30:22,000 --> 00:30:27,000 we say that it's linear speedup. 357 00:30:27,000 --> 00:30:32,000 OK, in other words, why? 358 00:30:32,000 --> 00:30:38,000 Because that says that it means that if I've thrown P processors 359 00:30:38,000 --> 00:30:44,000 at the job I'm going to get a speedup that's proportional to 360 00:30:44,000 --> 00:30:46,000 P. OK, so when I throw P 361 00:30:46,000 --> 00:30:51,000 processors at the job and I get T_P, if that's order P, 362 00:30:51,000 --> 00:30:57,000 that means that in some sense my processors each contributed 363 00:30:57,000 --> 00:31:04,000 within a constant factor its full measure of support. 364 00:31:04,000 --> 00:31:08,000 If this, in fact, were equal to P, 365 00:31:08,000 --> 00:31:13,000 we'd call that perfect linear speedup. 366 00:31:13,000 --> 00:31:20,000 OK, so but here we're looking at giving ourselves, 367 00:31:20,000 --> 00:31:27,000 for theoretical purposes, a little bit of a constant 368 00:31:27,000 --> 00:31:34,000 buffer here, perhaps. If T_1 over T_P is greater than 369 00:31:34,000 --> 00:31:41,000 P, we call that super linear speedup. 370 00:31:41,000 --> 00:31:45,000 OK, so can somebody tell me, when can I get super linear 371 00:31:45,000 --> 00:31:46,000 speedup? 372 00:31:56,000 --> 00:31:59,000 When can I get super linear speed up? 373 00:31:59,000 --> 00:32:01,000 Never. OK, why never? 374 00:32:01,000 --> 00:32:06,000 Yeah, if we buy these lower bounds, the first lower bound 375 00:32:06,000 --> 00:32:11,000 there, it is T_P is greater than or equal to T_1 over P. 376 00:32:11,000 --> 00:32:17,000 And, if I just take T_1 over T_P, that says it's less than or 377 00:32:17,000 --> 00:32:19,000 equal to P. so, this is never, 378 00:32:19,000 --> 00:32:25,000 OK, not possible in this model. OK, there are other models 379 00:32:25,000 --> 00:32:30,000 where it is possible to get super linear speed up due to 380 00:32:30,000 --> 00:32:36,000 caching effects, and things of that nature. 381 00:32:36,000 --> 00:32:43,000 But in this simple model that we are dealing with, 382 00:32:43,000 --> 00:32:50,000 it's not possible to get super linear speedup. 383 00:32:50,000 --> 00:32:57,000 OK, not possible. Now, the maximum possible 384 00:32:57,000 --> 00:33:06,000 speedup, given some amount of work and critical path length is 385 00:33:06,000 --> 00:33:13,000 what? What's the maximum possible 386 00:33:13,000 --> 00:33:20,000 speed up I could get over any number of processors? 387 00:33:20,000 --> 00:33:26,000 What's the maximum I could possibly get? 388 00:33:26,000 --> 00:33:32,000 No, I'm saying, no matter how many processors, 389 00:33:32,000 --> 00:33:40,000 what's the most speedup that I could get? 390 00:33:40,000 --> 00:33:44,000 T_1 over T infinity, because this is the, 391 00:33:44,000 --> 00:33:49,000 so T_1 over T infinity is the maximum I could possibly get. 392 00:33:49,000 --> 00:33:55,000 OK, if I threw an infinite number of processors at the 393 00:33:55,000 --> 00:34:00,000 problem, that's going to give me my biggest speedup. 394 00:34:00,000 --> 00:34:05,000 OK, and we call that the parallelism. 395 00:34:05,000 --> 00:34:08,000 OK, so that's defined to be the parallelism. 396 00:34:08,000 --> 00:34:11,000 So the parallelism of the particular algorithm is 397 00:34:11,000 --> 00:34:16,000 essentially the work divided by the critical path length. 398 00:34:16,000 --> 00:34:31,000 Another way of viewing it is that this is the average amount 399 00:34:31,000 --> 00:34:46,000 of work that can be done in parallel along each step of the 400 00:34:46,000 --> 00:34:57,000 critical path. And, we denote it often by P 401 00:34:57,000 --> 00:35:01,000 bar. So, do not get confused. 402 00:35:01,000 --> 00:35:05,000 P bar does not have anything to do with P at some level. 403 00:35:05,000 --> 00:35:10,000 OK, P is going to be a certain number of processors you're 404 00:35:10,000 --> 00:35:13,000 running. P bar is defined just in terms 405 00:35:13,000 --> 00:35:17,000 of the computation you're executing, not in terms of the 406 00:35:17,000 --> 00:35:21,000 machine you're running it on. OK, it's just the average 407 00:35:21,000 --> 00:35:25,000 amount of work that can be done in parallel along each step of 408 00:35:25,000 --> 00:35:30,000 the critical path. OK, questions so far? 409 00:35:30,000 --> 00:35:33,000 So mostly we're just doing definitions so far. 410 00:35:33,000 --> 00:35:37,000 OK, now we get into, OK, so it's helpful to know 411 00:35:37,000 --> 00:35:41,000 what the parallelism is, because the parallelism is 412 00:35:41,000 --> 00:35:46,000 going to, there's no real point in trying to get speed up bigger 413 00:35:46,000 --> 00:35:50,000 than the parallelism. OK, so if you are given a 414 00:35:50,000 --> 00:35:53,000 particular computation, you'll be able to say, 415 00:35:53,000 --> 00:35:58,000 oh, it doesn't go any faster. You're throwing more processors 416 00:35:58,000 --> 00:36:03,000 at it. Why is it that going any 417 00:36:03,000 --> 00:36:07,000 faster? And the answer could be, 418 00:36:07,000 --> 00:36:14,000 no more parallelism. OK, let's see what I want to, 419 00:36:14,000 --> 00:36:20,000 yeah, I think we can raise the example here. 420 00:36:20,000 --> 00:36:25,000 We'll talk more about this model. 421 00:36:25,000 --> 00:36:31,000 Mostly, now, we're going to just talk about 422 00:36:31,000 --> 00:36:35,000 DAG's. So, we'll talk about the 423 00:36:35,000 --> 00:36:43,000 programming model next time. So, let's talk about 424 00:36:43,000 --> 00:36:48,000 scheduling. The goal of scheduler is to map 425 00:36:48,000 --> 00:36:55,000 the computation to P processors. And this is typically done by a 426 00:36:55,000 --> 00:36:59,000 runtime system, which, if you will, 427 00:36:59,000 --> 00:37:06,000 is an algorithm that is running underneath the language layer 428 00:37:06,000 --> 00:37:12,000 that I showed you. OK, so the programmer designs 429 00:37:12,000 --> 00:37:15,000 an algorithm using spawns, and syncs, and so forth. 430 00:37:15,000 --> 00:37:19,000 Then, underneath that, there's an algorithm that has 431 00:37:19,000 --> 00:37:24,000 to actually map that executing program onto the processors of 432 00:37:24,000 --> 00:37:27,000 the machine as it executes. And that's the scheduler. 433 00:37:27,000 --> 00:37:31,000 OK, so it's done by the language runtime system, 434 00:37:31,000 --> 00:37:37,000 typically. OK, so it turns out that online 435 00:37:37,000 --> 00:37:42,000 schedulers, let me just say they're complex. 436 00:37:42,000 --> 00:37:49,000 OK, they're not necessarily easy things to build. 437 00:37:49,000 --> 00:37:53,000 OK, they're not too bad actually. 438 00:37:53,000 --> 00:38:01,000 But, we are not going to go there because we only have two 439 00:38:01,000 --> 00:38:07,000 lectures to do this. Instead, we're going to do is 440 00:38:07,000 --> 00:38:16,000 we'll illustrate the ideas using off-line scheduling. 441 00:38:16,000 --> 00:38:20,000 OK, so you'll get an idea out of this for what a scheduler 442 00:38:20,000 --> 00:38:24,000 does, and it turns out that doing these things online is 443 00:38:24,000 --> 00:38:27,000 another level of complexity beyond that. 444 00:38:27,000 --> 00:38:31,000 And typically, the online schedulers that are 445 00:38:31,000 --> 00:38:35,000 good, these days, are randomized schedulers. 446 00:38:35,000 --> 00:38:42,000 And they have very strong proofs of their ability to 447 00:38:42,000 --> 00:38:46,000 perform. But we're not going to go 448 00:38:46,000 --> 00:38:50,000 there. We'll keep it simple. 449 00:38:50,000 --> 00:38:56,000 And in particular, we're going to look at a 450 00:38:56,000 --> 00:39:05,000 particular type of scheduler called a greedy scheduler. 451 00:39:05,000 --> 00:39:09,000 So, if you have a DAG to execute, so the basic rules of 452 00:39:09,000 --> 00:39:15,000 the scheduler is you can't execute a node until all of the 453 00:39:15,000 --> 00:39:19,000 nodes that precede it in the DAG have executed. 454 00:39:19,000 --> 00:39:24,000 OK, so you've got to wait until everything is executed. 455 00:39:24,000 --> 00:39:29,000 So, a greedy scheduler, what it says is let's just try 456 00:39:29,000 --> 00:39:34,000 to do as much as possible on every step, OK? 457 00:39:50,000 --> 00:39:52,000 In other words, it says I'm never going to try 458 00:39:52,000 --> 00:39:56,000 to guess that it's worthwhile delaying doing something. 459 00:39:56,000 --> 00:40:00,000 If I could do something now, I'm going to do it. 460 00:40:00,000 --> 00:40:08,000 And so, each step is going to correspond to be one of two 461 00:40:08,000 --> 00:40:13,000 types. The first type is what we'll 462 00:40:13,000 --> 00:40:21,000 call a complete step. And this is a step in which 463 00:40:21,000 --> 00:40:27,000 there are at least P threads ready to run. 464 00:40:27,000 --> 00:40:34,000 And, I'm executing on P processors. 465 00:40:34,000 --> 00:40:38,000 There are at least P threads ready to run. 466 00:40:38,000 --> 00:40:42,000 So, what's a greedy strategy here? 467 00:40:42,000 --> 00:40:48,000 I've got P processors. I've got at least P threads. 468 00:40:48,000 --> 00:40:52,000 Run any P. Yeah, first P would be if you 469 00:40:52,000 --> 00:40:57,000 had a notion of ordering. That would be perfectly 470 00:40:57,000 --> 00:41:02,000 reasonable. Here, we are just going to 471 00:41:02,000 --> 00:41:07,000 execute any P. We might make a mistake there, 472 00:41:07,000 --> 00:41:10,000 because there may be a particular one that if we 473 00:41:10,000 --> 00:41:14,000 execute now, that'll enable more parallelism later on. 474 00:41:14,000 --> 00:41:18,000 We might not execute that one. We don't know. 475 00:41:18,000 --> 00:41:21,000 OK, but basically, what we're going to do is just 476 00:41:21,000 --> 00:41:24,000 execute any P willy-nilly. So, there's some, 477 00:41:24,000 --> 00:41:27,000 if you will, non-determinism in this step 478 00:41:27,000 --> 00:41:32,000 here because which one you execute may or may not be a good 479 00:41:32,000 --> 00:41:38,000 choice. OK, the second type of step 480 00:41:38,000 --> 00:41:45,000 we're going to have is an incomplete step. 481 00:41:45,000 --> 00:41:55,000 And this is a situation where we have fewer than P threads 482 00:41:55,000 --> 00:42:04,000 ready to run. So, what's our strategy there? 483 00:42:04,000 --> 00:42:10,000 Execute all of them. OK, if it's greedy, 484 00:42:10,000 --> 00:42:19,000 no point in not executing. OK, so if I've got more than P 485 00:42:19,000 --> 00:42:25,000 threads ready to run, I execute any P. 486 00:42:25,000 --> 00:42:32,000 If I have fewer than P threads ready to run, 487 00:42:32,000 --> 00:42:39,000 we execute all of them. So, it turns out this is a good 488 00:42:39,000 --> 00:42:42,000 strategy. It's not a perfect strategy. 489 00:42:42,000 --> 00:42:48,000 In fact, the strategy of trying to schedule optimally a DAG on P 490 00:42:48,000 --> 00:42:53,000 processors is NP complete, meaning it's very difficult. 491 00:42:53,000 --> 00:42:57,000 So, those of you going to take 6.045 or 6.840, 492 00:42:57,000 --> 00:43:01,000 I highly recommend these courses, and we'll talk more 493 00:43:01,000 --> 00:43:06,000 about that in the last lecture as we talked a little bit about 494 00:43:06,000 --> 00:43:13,000 what's coming up in the theory engineering concentration. 495 00:43:13,000 --> 00:43:16,000 You can learn about NP completeness and about how you 496 00:43:16,000 --> 00:43:19,000 show that certain problems, there are no good algorithms 497 00:43:19,000 --> 00:43:22,000 for them, OK, that we are aware of, 498 00:43:22,000 --> 00:43:24,000 OK, and what exactly that means. 499 00:43:24,000 --> 00:43:28,000 So, it turns out that this type of scheduling problem turns out 500 00:43:28,000 --> 00:43:32,000 to be a very difficult problem to get it optimal. 501 00:43:32,000 --> 00:43:46,000 But, there's nice theorem, due independently to Graham and 502 00:43:46,000 --> 00:43:53,000 Brent. It says, essentially, 503 00:43:53,000 --> 00:44:05,000 a greedy scheduler executes any computation, 504 00:44:05,000 --> 00:44:15,000 G, with work, T_1, and critical path length, 505 00:44:15,000 --> 00:44:27,000 T infinity in time, T_P, less than or equal to T_1 506 00:44:27,000 --> 00:44:34,000 over P plus T infinity -- 507 00:44:44,000 --> 00:44:49,000 -- on a computer with P processors. 508 00:44:49,000 --> 00:44:56,000 OK, so, it says that I can achieve T_1 over P plus T 509 00:44:56,000 --> 00:45:02,000 infinity. So, what does that say? 510 00:45:02,000 --> 00:45:09,000 If we take a look and compare this with our lower bounds on 511 00:45:09,000 --> 00:45:16,000 runtime, how efficient is this? How does this compare with the 512 00:45:16,000 --> 00:45:22,000 optimal execution? Yeah, it's two competitive. 513 00:45:22,000 --> 00:45:30,000 It's within a factor of two of optimal because this is a lower 514 00:45:30,000 --> 00:45:37,000 bound and this is a lower bound. And so, if I take twice the max 515 00:45:37,000 --> 00:45:41,000 of these two, twice the maximum of these two, 516 00:45:41,000 --> 00:45:44,000 that's going to be bigger than the sum. 517 00:45:44,000 --> 00:45:49,000 So, I'm within a factor of two of which ever is the stronger, 518 00:45:49,000 --> 00:45:54,000 lower bound for any situation. So, this says you get within a 519 00:45:54,000 --> 00:45:58,000 factor of two of efficiency of scheduling in terms of the 520 00:45:58,000 --> 00:46:04,000 runtime on P processors. OK, does everybody see that? 521 00:46:04,000 --> 00:46:10,000 So, let's prove this theorem. It's quite an elegant theorem. 522 00:46:10,000 --> 00:46:15,000 It's not a hard theorem. One of the nice things, 523 00:46:15,000 --> 00:46:20,000 by the way, about this week, is that nothing is very hard. 524 00:46:20,000 --> 00:46:25,000 It just requires you to think differently. 525 00:46:25,000 --> 00:46:31,000 OK, so the proof has to do with counting up how many complete 526 00:46:31,000 --> 00:46:35,000 steps we have, and how many incomplete steps 527 00:46:35,000 --> 00:46:41,000 we have. OK, so we'll start with the 528 00:46:41,000 --> 00:46:49,000 number of complete steps. So, can somebody tell me what's 529 00:46:49,000 --> 00:46:58,000 the largest number of complete steps I could possibly have? 530 00:46:58,000 --> 00:47:05,000 Yeah, I heard somebody mumble it back there. 531 00:47:05,000 --> 00:47:08,000 T_1 over P. Why is that? 532 00:47:08,000 --> 00:47:17,000 Yeah, so the number of complete steps is, at most, 533 00:47:17,000 --> 00:47:25,000 T_1 over P because why? Yeah, once you've had this 534 00:47:25,000 --> 00:47:32,000 many, you've done T_1 work, OK? 535 00:47:32,000 --> 00:47:36,000 So, every complete step I'm getting P work done. 536 00:47:36,000 --> 00:47:41,000 So, if I did more than T_1 over P steps, there would be no more 537 00:47:41,000 --> 00:47:45,000 work to be done. So, the number of complete 538 00:47:45,000 --> 00:47:49,000 steps can't be bigger than T_1 over P. 539 00:48:10,000 --> 00:48:16,000 OK, so that's this piece. OK, now we're going to count up 540 00:48:16,000 --> 00:48:21,000 the incomplete steps, and show its bounded by T 541 00:48:21,000 --> 00:48:25,000 infinity. OK, so let's consider an 542 00:48:25,000 --> 00:48:31,000 incomplete step. And, let's see what happens. 543 00:48:39,000 --> 00:48:57,000 And, let's let G prime be the subgraph of G that remains to be 544 00:48:57,000 --> 00:49:02,000 executed. OK, so we'll draw a picture 545 00:49:02,000 --> 00:49:04,000 here. So, imagine we have, 546 00:49:04,000 --> 00:49:07,000 let's draw it on a new board. 547 00:49:26,000 --> 00:49:32,000 So here, we're going to have a graph, our graph, 548 00:49:32,000 --> 00:49:36,000 G. We're going to do actually P 549 00:49:36,000 --> 00:49:40,000 equals three as our example here. 550 00:49:40,000 --> 00:49:45,000 So, imagine that this is the graph, G. 551 00:49:45,000 --> 00:49:52,000 And, I'm not showing the procedures here because this 552 00:49:52,000 --> 00:50:00,000 actually is a theorem that works for any DAG. 553 00:50:00,000 --> 00:50:09,000 And, the procedure outlines are not necessary. 554 00:50:09,000 --> 00:50:16,000 All we care about is the threads. 555 00:50:16,000 --> 00:50:25,000 I missed one. OK, so imagine that's my DAG, 556 00:50:25,000 --> 00:50:38,000 G, and imagine that I have executed up to this point. 557 00:50:38,000 --> 00:50:47,000 Which ones have I executed? Yeah, I've executed these guys. 558 00:50:47,000 --> 00:50:57,000 So, the things that are in G prime are just the things that 559 00:50:57,000 --> 00:51:04,000 have yet to be executed. And these guys are the ones 560 00:51:04,000 --> 00:51:09,000 that are already executed. And, we'll imagine that all of 561 00:51:09,000 --> 00:51:14,000 them are unit time threads without loss of generality. 562 00:51:14,000 --> 00:51:19,000 The theorem would go through, even if each of these had a 563 00:51:19,000 --> 00:51:23,000 particular time associated with it. 564 00:51:23,000 --> 00:51:27,000 The same scheduling algorithm will work just fine. 565 00:51:27,000 --> 00:51:32,000 So, how can I characterize the threads that are ready to be 566 00:51:32,000 --> 00:51:38,000 executed? Which are the threads that are 567 00:51:38,000 --> 00:51:42,000 ready to be executed here? Let's just see. 568 00:51:42,000 --> 00:51:46,000 So, that one? No, that's not ready to be 569 00:51:46,000 --> 00:51:48,000 executed. Why? 570 00:51:48,000 --> 00:51:52,000 Because it's got a predecessor here, this guy. 571 00:51:52,000 --> 00:51:59,000 OK, so this guy is ready to be executed, and this guy is ready 572 00:51:59,000 --> 00:52:04,000 to be executed. OK, so those two threads are 573 00:52:04,000 --> 00:52:08,000 ready to be, how can I characterize this? 574 00:52:08,000 --> 00:52:12,000 What's their property? What's a graph theoretic 575 00:52:12,000 --> 00:52:17,000 property in G prime that tells me whether or not something is 576 00:52:17,000 --> 00:52:21,000 ready to be executed? It has no predecessor, 577 00:52:21,000 --> 00:52:24,000 but what's another way of saying that? 578 00:52:24,000 --> 00:52:29,000 It's got no predecessor in G prime. 579 00:52:29,000 --> 00:52:38,000 What does it mean for a node not to have a predecessor in a 580 00:52:38,000 --> 00:52:43,000 graph? Its in degree is zero, 581 00:52:43,000 --> 00:52:46,000 right? Same thing. 582 00:52:46,000 --> 00:52:56,000 OK, the threads with in degree, zero and G prime are the ones 583 00:52:56,000 --> 00:53:06,000 that are ready to be executed. OK, and if it's incomplete 584 00:53:06,000 --> 00:53:11,000 step, what do I do? I'm going to execute says, 585 00:53:11,000 --> 00:53:17,000 if it's an incomplete step, I execute all of them. 586 00:53:17,000 --> 00:53:24,000 OK, so I execute all of these. OK, now I execute all of the in 587 00:53:24,000 --> 00:53:30,000 degree zero threads, what happens to the critical 588 00:53:30,000 --> 00:53:38,000 path length of the graph that remains to be executed? 589 00:53:38,000 --> 00:53:48,000 It decreases by one. OK, so the critical path length 590 00:53:48,000 --> 00:54:00,000 of what remains to be executed, G prime, is reduced by one. 591 00:54:00,000 --> 00:54:04,000 So, what's left to be executed on every incomplete step, 592 00:54:04,000 --> 00:54:08,000 what's left to be executed always reduces by one. 593 00:54:08,000 --> 00:54:12,000 Notice the next step here is going to be a complete step, 594 00:54:12,000 --> 00:54:16,000 because I've got four things that are ready to go. 595 00:54:16,000 --> 00:54:21,000 And, I can execute them in such a way that the critical path 596 00:54:21,000 --> 00:54:24,000 length doesn't get reduced on that step. 597 00:54:24,000 --> 00:54:29,000 OK, but if I had to execute all of them, then it does reduce the 598 00:54:29,000 --> 00:54:33,000 critical path length. Now, of course, 599 00:54:33,000 --> 00:54:38,000 both could happen, OK, at the same time, 600 00:54:38,000 --> 00:54:43,000 OK, but any time that I have an incomplete step, 601 00:54:43,000 --> 00:54:50,000 I'm guaranteed to reduce the critical path length by one. 602 00:54:50,000 --> 00:54:56,000 OK, so that implies that the number of incomplete steps is, 603 00:54:56,000 --> 00:55:01,000 at most, T infinity. And so, therefore, 604 00:55:01,000 --> 00:55:05,000 T of P is, at most, the number of complete steps 605 00:55:05,000 --> 00:55:08,000 plus the number of incomplete steps. 606 00:55:08,000 --> 00:55:12,000 And we get our bound. This is sort of an amortized 607 00:55:12,000 --> 00:55:17,000 argument if you want to think of it that way, OK, 608 00:55:17,000 --> 00:55:22,000 that at every step I'm either amortizing the step against the 609 00:55:22,000 --> 00:55:26,000 work, or I'm amortizing it against the critical path 610 00:55:26,000 --> 00:55:32,000 length, or possibly both. But I'm at least doing one of 611 00:55:32,000 --> 00:55:35,000 those for every step, OK, and so, in the end, 612 00:55:35,000 --> 00:55:39,000 I just have to add up the two contributions. 613 00:55:39,000 --> 00:55:42,000 Any questions about that? So this, by the way, 614 00:55:42,000 --> 00:55:46,000 is the fundamental theorem of all scheduling. 615 00:55:46,000 --> 00:55:50,000 If ever you study anything having to do with scheduling, 616 00:55:50,000 --> 00:55:55,000 this basic result is sort of the foundation of a huge number 617 00:55:55,000 --> 00:55:58,000 of things. And then what people do is they 618 00:55:58,000 --> 00:56:01,000 gussy it up, like, let's do this online, 619 00:56:01,000 --> 00:56:05,000 OK, with a scheduler, etc., that everybody's trying 620 00:56:05,000 --> 00:56:09,000 to match these bounds, OK, of what an omniscient 621 00:56:09,000 --> 00:56:14,000 greedy scheduler would achieve, OK, and there are all kinds of 622 00:56:14,000 --> 00:56:19,000 other things. But this is sort of the basic 623 00:56:19,000 --> 00:56:25,000 theorem that just pervades the whole area of scheduling. 624 00:56:25,000 --> 00:56:32,000 OK, let's do a quick corollary. I'm not going to erase those. 625 00:56:32,000 --> 00:56:37,000 Those are just too important. I want to erase those. 626 00:56:37,000 --> 00:56:42,000 Let's not erase those. I want to erase that either. 627 00:56:42,000 --> 00:56:45,000 We're going to go back to the top. 628 00:56:45,000 --> 00:56:51,000 Actually, we'll put the corollary here because that's 629 00:56:51,000 --> 00:56:54,000 just one line. OK. 630 00:57:11,000 --> 00:57:17,000 The corollary says you get linear speed up if the number of 631 00:57:17,000 --> 00:57:24,000 processors that you allocate, that you run your job on is 632 00:57:24,000 --> 00:57:31,000 order, the parallelism. OK, so greedy scheduler gives 633 00:57:31,000 --> 00:57:37,000 you linear speed up if you're running on essentially 634 00:57:37,000 --> 00:57:46,000 parallelism or fewer processors. OK, so let's see why that is. 635 00:57:46,000 --> 00:57:51,000 And I hope I'll fit this, OK? 636 00:57:51,000 --> 00:57:58,000 So, P bar is T_1 over T infinity. 637 00:57:58,000 --> 00:58:04,000 And that implies that if P equals order T_1 over T 638 00:58:04,000 --> 00:58:10,000 infinity, then that says just bringing those around, 639 00:58:10,000 --> 00:58:17,000 T infinity is order T_1 over P. So, everybody with me? 640 00:58:17,000 --> 00:58:22,000 It's just algebra. So, it says this is the 641 00:58:22,000 --> 00:58:28,000 definition of parallelism, T_1 over T infinity, 642 00:58:28,000 --> 00:58:35,000 and so, if P is order parallelism, then it's order T_1 643 00:58:35,000 --> 00:58:43,000 over T infinity. And now, just bring it around. 644 00:58:43,000 --> 00:58:49,000 It says T infinity is order T_1 over P. 645 00:58:49,000 --> 00:58:56,000 So, that says T infinity is order T_1 over P. 646 00:58:56,000 --> 00:59:03,000 OK, and so, therefore, continue the proof here, 647 00:59:03,000 --> 00:59:12,000 thus T_P is at most T_1 over P plus T infinity. 648 00:59:12,000 --> 00:59:23,000 Well, if this is order T_1 over P, the whole thing is order T_1 649 00:59:23,000 --> 00:59:29,000 over P. OK, and so, now I have T_P is 650 00:59:29,000 --> 00:59:37,000 order T_1 over P, and what we need is to compute 651 00:59:37,000 --> 00:59:45,000 T_1 over T_P, and that's going to be order 652 00:59:45,000 --> 00:59:48,000 T_P. OK? 653 00:59:48,000 --> 00:59:51,000 Does everybody see that? So what that says is that if I 654 00:59:51,000 --> 00:59:54,000 have a certain amount of parallelism, if I run 655 00:59:54,000 --> 00:59:58,000 essentially on fewer processors than that parallelism, 656 00:59:58,000 --> 01:00:02,000 I get linear speed up if I use greedy scheduling. 657 01:00:02,000 --> 01:00:05,077 OK, if I run on more processors than the parallelism, 658 01:00:05,077 --> 01:00:07,859 in some sense I'm being wasteful because I can't 659 01:00:07,859 --> 01:00:11,529 possibly get enough speed up to justify those extra processors. 660 01:00:11,529 --> 01:00:15,021 So, understanding parallelism of a job says that's sort of a 661 01:00:15,021 --> 01:00:17,862 limit on the number of processors I want to have. 662 01:00:17,862 --> 01:00:19,757 And, in fact, I can achieve that. 663 01:00:19,757 --> 01:00:21,000 Question? 664 01:00:39,000 --> 01:00:41,008 Yeah, really, in some sense, 665 01:00:41,008 --> 01:00:43,611 this is saying it should be omega P. 666 01:00:43,611 --> 01:00:46,586 Yeah, so that's fine. It's a question of, 667 01:00:46,586 --> 01:00:48,000 so ask again. 668 01:01:03,000 --> 01:01:06,495 No, no, it's only if it's bounded above by a constant. 669 01:01:06,495 --> 01:01:08,804 T_1 and T infinity aren't constants. 670 01:01:08,804 --> 01:01:12,497 They're variables in this. So, we are doing multivariable 671 01:01:12,497 --> 01:01:15,795 asymptotic analysis. So, any of these things can be 672 01:01:15,795 --> 01:01:19,555 a function of anything else, and can be growing as much as 673 01:01:19,555 --> 01:01:22,127 we want. So, the fact that we say we are 674 01:01:22,127 --> 01:01:26,019 given it for a particular thing, we're really not given that 675 01:01:26,019 --> 01:01:28,327 number. We're given a whole class of 676 01:01:28,327 --> 01:01:31,889 DAG's or whatever of various sizes is really what we're 677 01:01:31,889 --> 01:01:37,788 talking about. So, I can look at the growth. 678 01:01:37,788 --> 01:01:45,626 Here, where it's talking about the growth of the parallelism, 679 01:01:45,626 --> 01:01:52,941 sorry, the growth of the runtime T_P as a function of T_1 680 01:01:52,941 --> 01:01:58,689 and T infinity. So, I am talking about things 681 01:01:58,689 --> 01:02:03,000 that are growing here, OK? 682 01:02:03,000 --> 01:02:06,018 OK, so let's put this to work, OK? 683 01:02:06,018 --> 01:02:09,951 And, in fact, so now I'm going to go back to 684 01:02:09,951 --> 01:02:13,243 here. Now I'm going to tell you about 685 01:02:13,243 --> 01:02:18,913 a little bit of my own research, and how we use this in some of 686 01:02:18,913 --> 01:02:23,030 the work that we did. OK, so we've developed a 687 01:02:23,030 --> 01:02:28,426 dynamic multithreaded language called Cilk, spelled with a C 688 01:02:28,426 --> 01:02:33,000 because it's based on the language, C. 689 01:02:33,000 --> 01:02:39,837 And, it's not an acronym because silk is like nice 690 01:02:39,837 --> 01:02:46,953 threads, OK, although at one point my students had a 691 01:02:46,953 --> 01:02:53,651 competition for what the acronym silk could mean. 692 01:02:53,651 --> 01:03:01,046 The winner, turns out, was Charles' Idiotic Linguistic 693 01:03:01,046 --> 01:03:06,214 Kluge. So anyway, if you want to take 694 01:03:06,214 --> 01:03:10,714 a look at it, you can find some stuff on it 695 01:03:10,714 --> 01:03:12,000 here. OK, 696 01:03:20,000 --> 01:03:28,412 OK, and what it uses is actually one of these more 697 01:03:28,412 --> 01:03:36,480 complicated schedulers. It's a randomized online 698 01:03:36,480 --> 01:03:44,206 scheduler, OK, and if you look at its expected 699 01:03:44,206 --> 01:03:53,476 runtime on P processors, it gets effectively T_1 over P 700 01:03:53,476 --> 01:04:01,428 plus O of T infinity provably. OK, and empirically, 701 01:04:01,428 --> 01:04:05,714 if you actually look at what kind of runtimes you get to find 702 01:04:05,714 --> 01:04:09,285 out what's hidden in the big O there, it turns out, 703 01:04:09,285 --> 01:04:13,785 in fact, it's T_1 over P plus T infinity with the constants here 704 01:04:13,785 --> 01:04:16,285 being very close to one empirically. 705 01:04:16,285 --> 01:04:19,428 So, no guarantees, but this turns out to be a 706 01:04:19,428 --> 01:04:22,142 pretty good bound. Sometimes, you see a 707 01:04:22,142 --> 01:04:26,214 coefficient on T infinity that's up maybe close to four or 708 01:04:26,214 --> 01:04:29,385 something. But generally, 709 01:04:29,385 --> 01:04:34,533 you don't see something that's much bigger than that. 710 01:04:34,533 --> 01:04:39,680 And mostly, it tends to be around, if you do a linear 711 01:04:39,680 --> 01:04:44,729 regression curve fit, you get that the constant here 712 01:04:44,729 --> 01:04:48,094 is close to one. And so, with this, 713 01:04:48,094 --> 01:04:54,331 you get near perfect if you use this formula as a model for your 714 01:04:54,331 --> 01:04:57,795 runtime. You get near perfect linear 715 01:04:57,795 --> 01:05:03,339 speed up if the number of processors you're running on is 716 01:05:03,339 --> 01:05:07,892 much less than your average parallelism, which, 717 01:05:07,892 --> 01:05:14,029 of course, is the same thing as if T infinity is much less than 718 01:05:14,029 --> 01:05:19,481 T_1 over P. So, what happens here is that 719 01:05:19,481 --> 01:05:23,247 when P is much less than P infinity, that is, 720 01:05:23,247 --> 01:05:28,297 T infinity is much less than T_1 over P, this term ceases to 721 01:05:28,297 --> 01:05:32,319 matter very much, and you get very good speedup, 722 01:05:32,319 --> 01:05:36,000 OK, in fact, almost perfect speedup. 723 01:05:36,000 --> 01:05:42,357 So, each processor gives you another processor's work as long 724 01:05:42,357 --> 01:05:48,503 as you are the range where the number of processors is much 725 01:05:48,503 --> 01:05:52,211 less than the number of parallelism. 726 01:05:52,211 --> 01:05:58,463 Now, with this language many years ago, which seems now like 727 01:05:58,463 --> 01:06:03,231 many years ago, OK, it turned out we competed. 728 01:06:03,231 --> 01:06:08,000 We built a bunch of chess programs. 729 01:06:08,000 --> 01:06:11,962 And, among our programs were Starsocrates, 730 01:06:11,962 --> 01:06:16,312 and Cilkchess, and we also had several others. 731 01:06:16,312 --> 01:06:19,501 And these were, I would call them, 732 01:06:19,501 --> 01:06:22,014 world-class. In particular, 733 01:06:22,014 --> 01:06:26,750 we tied for first in the 1995 World Computer Chess 734 01:06:26,750 --> 01:06:32,066 Championship in Hong Kong, and then we had a playoff and 735 01:06:32,066 --> 01:06:35,860 we lost. It was really a shame. 736 01:06:35,860 --> 01:06:39,157 We almost won, running on a big parallel 737 01:06:39,157 --> 01:06:41,778 machine. That was, incidentally, 738 01:06:41,778 --> 01:06:47,020 some of you may know about the Deep Blue chess playing program. 739 01:06:47,020 --> 01:06:52,008 That was the last time before they faced then world champion 740 01:06:52,008 --> 01:06:55,728 Kasparov that they competed against programs. 741 01:06:55,728 --> 01:06:58,941 They tied for third in that tournament. 742 01:06:58,941 --> 01:07:03,000 OK, so we actually out-placed them. 743 01:07:03,000 --> 01:07:07,159 However, in the head-to-head competition, we lost to them. 744 01:07:07,159 --> 01:07:11,099 So we had one loss in the tournament up to the point of 745 01:07:11,099 --> 01:07:13,872 the finals. They had a loss and a draw. 746 01:07:13,872 --> 01:07:17,375 Most people aren't aware that Deep Blue, in fact, 747 01:07:17,375 --> 01:07:21,608 was not the reigning World Computer Chess Championship when 748 01:07:21,608 --> 01:07:24,964 they faced Kasparov. The reason that they faced 749 01:07:24,964 --> 01:07:30,000 Kasparov was because IBM was willing to put up the money. 750 01:07:30,000 --> 01:07:38,029 OK, so we developed these chess programs, and the way we 751 01:07:38,029 --> 01:07:44,747 developed them, let me in particular talk about 752 01:07:44,747 --> 01:07:51,172 Starsocrates. We had this interesting anomaly 753 01:07:51,172 --> 01:07:55,699 come up. We were running on a 32 754 01:07:55,699 --> 01:08:03,000 processor computer at MIT for development. 755 01:08:03,000 --> 01:08:07,463 And, we had access to a 512 processor computer for the 756 01:08:07,463 --> 01:08:11,505 tournament at NCSA at the University of Illinois. 757 01:08:11,505 --> 01:08:16,389 So, we had this big machine. Of course, they didn't want to 758 01:08:16,389 --> 01:08:20,852 give it to us very much, but we have the same machine, 759 01:08:20,852 --> 01:08:22,872 just a small one, at MIT. 760 01:08:22,872 --> 01:08:27,756 So, we would develop on this, and occasionally we'd be able 761 01:08:27,756 --> 01:08:31,126 to run on this, and this was what we were 762 01:08:31,126 --> 01:08:37,719 developing for on our processor. So, let me show you sort of the 763 01:08:37,719 --> 01:08:40,000 anomaly that came up, OK? 764 01:08:48,000 --> 01:08:55,974 So, we had a version of a program that I'll call the 765 01:08:55,974 --> 01:09:02,854 original program, OK, and we had an optimized 766 01:09:02,854 --> 01:09:12,236 program that included some new features that were supposed to 767 01:09:12,236 --> 01:09:20,992 make the program go faster. And so, we timed it on our 32 768 01:09:20,992 --> 01:09:28,341 processor machine. And, it took us 65 seconds to 769 01:09:28,341 --> 01:09:33,839 run it. OK, and then we timed this new 770 01:09:33,839 --> 01:09:37,340 program. So, I'll call that T prime of 771 01:09:37,340 --> 01:09:42,261 sub 32 on our 32 processor machine, and it ran and 40 772 01:09:42,261 --> 01:09:45,952 seconds to do this particular benchmark. 773 01:09:45,952 --> 01:09:50,399 Now, let me just say, I've lied about the actual 774 01:09:50,399 --> 01:09:54,375 numbers here to make the calculations easy. 775 01:09:54,375 --> 01:10:01,000 But, the same idea happened. Just the numbers were messier. 776 01:10:01,000 --> 01:10:07,275 OK, so this looks like a significant improvement in 777 01:10:07,275 --> 01:10:12,421 runtime, but we rejected the optimization. 778 01:10:12,421 --> 01:10:19,574 OK, and the reason we rejected it is because we understood 779 01:10:19,574 --> 01:10:24,846 about the issues of work and critical path. 780 01:10:24,846 --> 01:10:30,368 So, let me show you the analysis that we did, 781 01:10:30,368 --> 01:10:33,813 OK? So the analysis, 782 01:10:33,813 --> 01:10:37,441 it turns out, if we looked at our 783 01:10:37,441 --> 01:10:42,089 instrumentation, the work in this case was 784 01:10:42,089 --> 01:10:46,170 2,048. And, the critical path was one 785 01:10:46,170 --> 01:10:50,931 second, which, over here with the optimized 786 01:10:50,931 --> 01:10:55,125 program, the work was, in fact, 1,024. 787 01:10:55,125 --> 01:11:00,000 But the critical path was eight. 788 01:11:00,000 --> 01:11:07,375 So, if we plug into our simple model here, the one I have up 789 01:11:07,375 --> 01:11:14,625 there with the approximation there, I have T_32 is equal to 790 01:11:14,625 --> 01:11:20,625 T_1 over 32 plus T infinity, and that's equal to, 791 01:11:20,625 --> 01:11:25,250 well, the work is 2,048 divided by 32. 792 01:11:25,250 --> 01:11:30,125 What's that? 64, good, plus the critical 793 01:11:30,125 --> 01:11:37,625 path, one, that's 65. So, that checks out with what 794 01:11:37,625 --> 01:11:40,000 we saw. OK, in fact, 795 01:11:40,000 --> 01:11:43,875 we did that, and it checked out. 796 01:11:43,875 --> 01:11:48,375 OK, it was very close. OK, over here, 797 01:11:48,375 --> 01:11:54,875 T prime of 32 is T prime, one over 32 plus T infinity 798 01:11:54,875 --> 01:12:02,750 prime, and that's equal to 1,024 divided by 32 is 32 plus eight, 799 01:12:02,750 --> 01:12:07,981 the critical path here. That's 40. 800 01:12:07,981 --> 01:12:13,377 So, that checked out too. So, now what we did is we said 801 01:12:13,377 --> 01:12:17,596 is we said, OK, let's extrapolate to our big 802 01:12:17,596 --> 01:12:21,422 machine. How fast are these things going 803 01:12:21,422 --> 01:12:25,445 to run on our big machine? Well, for that, 804 01:12:25,445 --> 01:12:29,958 we want T of 512. And, that's equal to T_1 over 805 01:12:29,958 --> 01:12:36,913 512 plus T infinity. And so, what's 2,048 divided by 806 01:12:36,913 --> 01:12:41,079 512? It's four, plus T infinity is 807 01:12:41,079 --> 01:12:44,235 one. That's equal to five. 808 01:12:44,235 --> 01:12:48,401 So, go quite a bit faster on this. 809 01:12:48,401 --> 01:12:55,471 But here, T prime of 512 is equal to T one prime over 512 810 01:12:55,471 --> 01:13:03,172 plus T infinity prime is equal to, well, 1,024 plus divided by 811 01:13:03,172 --> 01:13:11,000 512 is two plus critical path of eight, that's ten. 812 01:13:11,000 --> 01:13:15,913 OK, and so, you see that on the big machine, we would have been 813 01:13:15,913 --> 01:13:19,163 running twice as slow had we adopted that, 814 01:13:19,163 --> 01:13:23,205 quote, "optimization", OK, because we had run out of 815 01:13:23,205 --> 01:13:27,009 parallelism, and this was making the path longer. 816 01:13:27,009 --> 01:13:31,447 We needed to have a way of doing it where we could reduce 817 01:13:31,447 --> 01:13:34,459 the work. Yeah, it's good to reduce the 818 01:13:34,459 --> 01:13:39,135 work but not as the critical path ends up getting rid of the 819 01:13:39,135 --> 01:13:45,000 parallels that we hope to be able to use during the runtime. 820 01:13:45,000 --> 01:13:48,186 So, it's twice as slow, OK, twice as slow. 821 01:13:48,186 --> 01:13:52,927 So the moral is that the work and critical path length predict 822 01:13:52,927 --> 01:13:56,968 the performance better than the execution time alone, 823 01:13:56,968 --> 01:14:00,000 OK, when you look at scalability. 824 01:14:00,000 --> 01:14:03,600 And a big issue on a lot of these machines is scalability; 825 01:14:03,600 --> 01:14:07,263 not always, sometimes you're not worried about scalability. 826 01:14:07,263 --> 01:14:10,421 Sometimes you just care. Had we been running in the 827 01:14:10,421 --> 01:14:14,210 competition on a 32 processor machine, we would have accepted 828 01:14:14,210 --> 01:14:16,926 this optimization. It would have been a good 829 01:14:16,926 --> 01:14:19,515 trade-off. OK, but because we knew that we 830 01:14:19,515 --> 01:14:22,800 were running on a machine with a lot more processors, 831 01:14:22,800 --> 01:14:26,336 and that we were close to running out of the parallelism, 832 01:14:26,336 --> 01:14:29,936 it didn't make sense to be increasing the critical path at 833 01:14:29,936 --> 01:14:33,726 that point, because that was just reducing the parallelism of 834 01:14:33,726 --> 01:14:36,887 our calculation. OK, next time, 835 01:14:36,887 --> 01:14:39,041 any questions about that first? No? 836 01:14:39,041 --> 01:14:40,626 OK. Next time, now that we 837 01:14:40,626 --> 01:14:44,111 understand the model for execution, we're going to start 838 01:14:44,111 --> 01:14:47,786 looking at the performance of particular algorithms what we 839 01:14:47,786 --> 01:14:50,701 code them up in a dynamic, multithreaded style, 840 01:14:50,701 --> 01:14:53,000 OK?