1 00:00:00,030 --> 00:00:02,420 The following content is provided under a Creative 2 00:00:02,420 --> 00:00:03,860 Commons license. 3 00:00:03,860 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue to 4 00:00:06,870 --> 00:00:10,540 offer high quality educational resources for free. 5 00:00:10,540 --> 00:00:13,410 To make a donation or view additional materials from 6 00:00:13,410 --> 00:00:16,610 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:16,610 --> 00:00:17,860 ocw.mit.edu. 8 00:00:21,030 --> 00:00:26,076 PROFESSOR: I guess [OBSCURED] 9 00:00:26,076 --> 00:00:28,030 Let's get going. 10 00:00:28,030 --> 00:00:30,300 OK, should I introduce you? 11 00:00:30,300 --> 00:00:30,610 BRADLEY KUSZMAUL: If you want. 12 00:00:30,610 --> 00:00:31,810 I can introduce myself. 13 00:00:31,810 --> 00:00:34,735 PROFESSOR: We have Bradley Kuszmaul who's been doing 14 00:00:34,735 --> 00:00:36,921 articles on Cilk? 15 00:00:36,921 --> 00:00:42,935 He's a very interesting paralleling and also what you 16 00:00:42,935 --> 00:00:47,820 can say about the program It's a very interesting project 17 00:00:47,820 --> 00:00:52,873 that coming for a while, and there's a lot of interesting 18 00:00:52,873 --> 00:00:55,816 things he's developed, and multi core 19 00:00:55,816 --> 00:00:59,740 becoming very important. 20 00:00:59,740 --> 00:01:00,860 BRADLEY KUSZMAUL: So how many of you people have 21 00:01:00,860 --> 00:01:04,070 ever heard of Cilk? 22 00:01:04,070 --> 00:01:05,870 Have used it? 23 00:01:05,870 --> 00:01:09,370 So those of you who have used it may find 24 00:01:09,370 --> 00:01:13,400 this talk old or whatever. 25 00:01:13,400 --> 00:01:16,840 So Cilk is a system that runs on a shared-memory 26 00:01:16,840 --> 00:01:17,750 multiprocessor. 27 00:01:17,750 --> 00:01:21,690 So this is not like the system you've been programming for 28 00:01:21,690 --> 00:01:22,720 this class. 29 00:01:22,720 --> 00:01:25,440 This kind of machine you have processors, which each have 30 00:01:25,440 --> 00:01:28,840 cache and some sort of a network and a bunch of memory 31 00:01:28,840 --> 00:01:33,720 and when the processors do memory operations they are all 32 00:01:33,720 --> 00:01:36,780 on the same address space and it's typically-- 33 00:01:36,780 --> 00:01:39,000 the memory system provides some sort of coherence like 34 00:01:39,000 --> 00:01:41,540 strong consistency or maybe released consistency. 35 00:01:44,690 --> 00:01:49,180 We're interested in the case where the distance from 36 00:01:49,180 --> 00:01:52,420 processors to other processors into a processors to memory 37 00:01:52,420 --> 00:01:55,980 may be nonuniform and so it's important to use the cache 38 00:01:55,980 --> 00:02:02,500 well in this kind of machine because you can't 39 00:02:02,500 --> 00:02:03,600 just ignore the cache. 40 00:02:03,600 --> 00:02:06,810 So sort of the technology that I'm going to talk about for 41 00:02:06,810 --> 00:02:09,310 this kind of system is called Cilk. 42 00:02:09,310 --> 00:02:13,240 Cilk is a C language and it does dynamic multithreading 43 00:02:13,240 --> 00:02:15,390 and it has a provably good runtime system. 44 00:02:15,390 --> 00:02:18,630 So I'll talk about what those all mean. 45 00:02:18,630 --> 00:02:22,420 Cilk runs on shared-memory machines like Suns and SGIs 46 00:02:22,420 --> 00:02:25,920 and well, you probably can't find Alphaservers anymore. 47 00:02:25,920 --> 00:02:30,260 It runs on SMPs like that are in everybody's laptops now. 48 00:02:30,260 --> 00:02:33,290 There's been several interesting applications 49 00:02:33,290 --> 00:02:37,930 written in Cilk including virus shell assembly, graphics 50 00:02:37,930 --> 00:02:40,050 rendering, n-body simulation. 51 00:02:40,050 --> 00:02:43,180 We did a bunch of chess programs because they were 52 00:02:43,180 --> 00:02:48,740 sort of the raison d'etre for Cilk. 53 00:02:48,740 --> 00:02:51,270 One of the features about Cilk is that it automatically 54 00:02:51,270 --> 00:02:53,300 manages a lot of the low-level issues. 55 00:02:53,300 --> 00:02:56,510 You don't have to do load balancing, you don't have to 56 00:02:56,510 --> 00:02:59,210 write in protocols. 57 00:02:59,210 --> 00:03:02,310 You basically write programs that look a lot more like the 58 00:03:02,310 --> 00:03:06,250 ordinary Cilk programs instead of saying first I'm going to 59 00:03:06,250 --> 00:03:09,040 do this and then I'm going to set this variable and then 60 00:03:09,040 --> 00:03:11,870 somebody else is going to read that variable and that's a 61 00:03:11,870 --> 00:03:15,240 protocol and those are very difficult to get right. 62 00:03:15,240 --> 00:03:17,720 AUDIENCE: [OBSCURED] 63 00:03:17,720 --> 00:03:19,460 BRADLEY KUSZMAUL: Yeah, I'll mention that 64 00:03:19,460 --> 00:03:20,900 a little bit later. 65 00:03:20,900 --> 00:03:24,390 We had award-winning chess player. 66 00:03:24,390 --> 00:03:26,770 So to explain what Cilk's about 67 00:03:26,770 --> 00:03:28,390 I'll talk about Fibonacci. 68 00:03:28,390 --> 00:03:32,510 Now Fibonacci, this is just to review in case you don't know 69 00:03:32,510 --> 00:03:34,400 C. You all know C right? 70 00:03:34,400 --> 00:03:39,080 So Fibonacci is the function that each number is the sum of 71 00:03:39,080 --> 00:03:41,330 the previous two Fibonacci numbers. 72 00:03:41,330 --> 00:03:46,340 And this is an implementation that basically does that 73 00:03:46,340 --> 00:03:47,400 computation directly. 74 00:03:47,400 --> 00:03:50,410 The Fibonacci of n if n is less than 2, it's just n. 75 00:03:50,410 --> 00:03:53,840 So Fibonacci of zero is zero, 1 is 1. 76 00:03:53,840 --> 00:03:55,120 2, the Fibonacci's-- 77 00:03:55,120 --> 00:03:58,380 well, then you have to do the recursion, so you compute 78 00:03:58,380 --> 00:04:01,585 Fibonacci of n minus 1 and Fibonacci of n minus 2 and sum 79 00:04:01,585 --> 00:04:04,500 them together and that's Fibonacci of n. 80 00:04:04,500 --> 00:04:09,330 One observation about this function is it's a really slow 81 00:04:09,330 --> 00:04:12,430 implementation of Fibonacci. 82 00:04:12,430 --> 00:04:13,610 You all know how to do this faster? 83 00:04:13,610 --> 00:04:16,570 How fast can you do Fibonacci? 84 00:04:16,570 --> 00:04:20,703 You all know this, How fast is this one? 85 00:04:20,703 --> 00:04:22,960 AUDIENCE: [OBSCURED]. 86 00:04:22,960 --> 00:04:25,450 BRADLEY KUSZMAUL: So for those of you who don't know-- 87 00:04:25,450 --> 00:04:28,200 certainly know how to compute Fibonacci in linear time just 88 00:04:28,200 --> 00:04:31,340 by keeping track of the most recent two. 89 00:04:31,340 --> 00:04:34,230 1, 1, 2, 3, 5, you just do it. 90 00:04:34,230 --> 00:04:37,270 This is exponential time and there's an algorithm that does 91 00:04:37,270 --> 00:04:38,350 it in logarithmic time. 92 00:04:38,350 --> 00:04:44,380 So this implementation is doubly, exponentially bad. 93 00:04:44,380 --> 00:04:46,760 But it's good as a didactic example because it's easy to 94 00:04:46,760 --> 00:04:48,020 understand. 95 00:04:48,020 --> 00:04:50,540 So to turn this into Cilk we just add some key words and 96 00:04:50,540 --> 00:04:53,730 I'll talk about what the key words are in a minute, but the 97 00:04:53,730 --> 00:04:56,490 key thing to understand about this is if you delete the key 98 00:04:56,490 --> 00:05:00,860 words you have a C program and Cilk programs have the 99 00:05:00,860 --> 00:05:06,070 property that one of the legal semantics for the Cilk program 100 00:05:06,070 --> 00:05:10,210 is the C program that you get by deleting the key words. 101 00:05:10,210 --> 00:05:11,490 Now there's other possible semantics 102 00:05:11,490 --> 00:05:13,020 you could get because-- 103 00:05:13,020 --> 00:05:15,390 not for this function, this function always produces the 104 00:05:15,390 --> 00:05:18,450 same answer because there's no race conditions in it. 105 00:05:18,450 --> 00:05:20,660 But for programs that have races you may have other 106 00:05:20,660 --> 00:05:24,150 semantics that the system could provide. 107 00:05:24,150 --> 00:05:27,480 And so this kind of a language extension where you can sort 108 00:05:27,480 --> 00:05:32,590 of delete the extensions and get a correct implementation 109 00:05:32,590 --> 00:05:36,150 of the parallel program is called a faithful extension. 110 00:05:36,150 --> 00:05:40,380 A lot of languages like OpenMP have properties that if you 111 00:05:40,380 --> 00:05:42,360 had these directives and if you delete them, it will 112 00:05:42,360 --> 00:05:44,990 change the semantics of your program and so you have to be 113 00:05:44,990 --> 00:05:46,160 very careful. 114 00:05:46,160 --> 00:05:48,820 Now if you're careful about programming OpenMP you can 115 00:05:48,820 --> 00:05:51,990 make it so that it's faithful, that has this property. 116 00:05:51,990 --> 00:05:57,600 But that's not always the case that it is. 117 00:05:57,600 --> 00:05:58,896 Sure. 118 00:05:58,896 --> 00:06:01,166 AUDIENCE: Is it built on the different.. 119 00:06:04,060 --> 00:06:06,310 BRADLEY KUSZMAUL: C 77. 120 00:06:06,310 --> 00:06:06,940 No, C 89. 121 00:06:06,940 --> 00:06:08,934 AUDIENCE: OK, so there's no presumption 122 00:06:08,934 --> 00:06:12,346 or any alias involved? 123 00:06:12,346 --> 00:06:15,570 It's assumed that the [OBSCURED]. 124 00:06:15,570 --> 00:06:17,060 BRADLEY KUSZMAUL: So the issue of restricted 125 00:06:17,060 --> 00:06:18,095 pointers, for example? 126 00:06:18,095 --> 00:06:20,820 AUDIENCE: Restricted pointers. 127 00:06:20,820 --> 00:06:21,860 BRADLEY KUSZMAUL: So Cilk turns out to work 128 00:06:21,860 --> 00:06:23,535 with C 99 as well. 129 00:06:23,535 --> 00:06:26,862 AUDIENCE: But is the presumption though for a 130 00:06:26,862 --> 00:06:29,450 pointer that it could alias? 131 00:06:29,450 --> 00:06:31,020 BRADLEY KUSZMAUL: The Cilk compiler makes no assumptions 132 00:06:31,020 --> 00:06:32,680 about that. 133 00:06:32,680 --> 00:06:35,310 If you write a program and the back end-- 134 00:06:35,310 --> 00:06:40,040 Cilk works and I'll talk about this in a couple minutes. 135 00:06:40,040 --> 00:06:42,080 Cilk works by transforming this into a C 136 00:06:42,080 --> 00:06:44,940 program that has-- 137 00:06:44,940 --> 00:06:48,060 when you run it on one processor it's just the 138 00:06:48,060 --> 00:06:50,370 original C program in effect. 139 00:06:50,370 --> 00:06:53,420 And so if you have a dialect of C that has restricted 140 00:06:53,420 --> 00:06:55,950 pointers and a compiler that-- 141 00:06:55,950 --> 00:06:57,460 PROFESSOR: You're taking the assumptions that 142 00:06:57,460 --> 00:06:59,110 if you make a mistake-- 143 00:06:59,110 --> 00:07:02,610 BRADLEY KUSZMAUL: If you make a mistake the language doens't 144 00:07:02,610 --> 00:07:04,940 stop you from making the mistake. 145 00:07:04,940 --> 00:07:08,230 AUDIENCE: Well, but in C 89 there's not a mistake. 146 00:07:08,230 --> 00:07:10,068 There's no assumption about aliasing, right? 147 00:07:10,068 --> 00:07:11,060 It could alias. 148 00:07:11,060 --> 00:07:14,250 So if I said -- 149 00:07:14,250 --> 00:07:16,232 BRADLEY KUSZMAUL: Because of the aliasing you write a 150 00:07:16,232 --> 00:07:19,276 program that has a race condition in it, which is 151 00:07:19,276 --> 00:07:19,450 erroneous-- 152 00:07:19,450 --> 00:07:22,930 AUDIENCE: It would be valid? 153 00:07:22,930 --> 00:07:23,360 BRADLEY KUSZMAUL: No, it'd still be valid. 154 00:07:23,360 --> 00:07:25,120 It would just have a race in it and you would have a 155 00:07:25,120 --> 00:07:26,910 non-determinate result. 156 00:07:26,910 --> 00:07:27,900 PROFESSOR: It may not do what you want. 157 00:07:27,900 --> 00:07:29,830 BRADLEY KUSZMAUL: It may not do what you want, but one of 158 00:07:29,830 --> 00:07:32,310 the legal executions of that parallel program is the 159 00:07:32,310 --> 00:07:34,060 original C program. 160 00:07:34,060 --> 00:07:37,300 AUDIENCE: So there's no extra. 161 00:07:37,300 --> 00:07:39,270 BRADLEY KUSZMAUL: At the sort of level of doing analysis, 162 00:07:39,270 --> 00:07:40,840 Cilk doesn't do analysis. 163 00:07:40,840 --> 00:07:46,530 Cilk is a compiler that compiles this language and the 164 00:07:46,530 --> 00:07:49,610 semantics are what they are, which is you the spawn is 165 00:07:49,610 --> 00:07:50,450 its-- and I'll talk about the semantics. 166 00:07:50,450 --> 00:07:52,910 The spawn means you can run the function in parallel and 167 00:07:52,910 --> 00:07:56,310 if that doesn't give you the same answer every time it's 168 00:07:56,310 --> 00:07:58,350 not the compilers fault. 169 00:07:58,350 --> 00:08:00,890 AUDIENCE: [OBSCURED] 170 00:08:00,890 --> 00:08:01,752 BRADLEY KUSZMAUL: Pardon? 171 00:08:01,752 --> 00:08:04,585 AUDIENCE: There has to be some guarantee [OBSCURED]. 172 00:08:04,585 --> 00:08:07,990 [OBSCURED] 173 00:08:07,990 --> 00:08:09,890 PROFESSOR: How in a race condition you get some 174 00:08:09,890 --> 00:08:12,650 [OBSCURED]. 175 00:08:12,650 --> 00:08:14,240 BRADLEY KUSZMAUL: One of the legal things the Cilk system 176 00:08:14,240 --> 00:08:17,480 could do is just run this, run that program. 177 00:08:17,480 --> 00:08:20,410 Now if you're running it on multiple processors that's not 178 00:08:20,410 --> 00:08:22,430 what happens because the other thing is there's some 179 00:08:22,430 --> 00:08:23,730 performance guarantees we get. 180 00:08:23,730 --> 00:08:24,970 So there's actually parallelism. 181 00:08:24,970 --> 00:08:28,130 But on one processor in fact, that's exactly what the 182 00:08:28,130 --> 00:08:30,860 execution does. 183 00:08:30,860 --> 00:08:34,120 So Cilk does dynamic multithreading and this is 184 00:08:34,120 --> 00:08:37,660 different from p threads for example where you have this 185 00:08:37,660 --> 00:08:41,330 very heavyweight thread that costs tens of thousands of 186 00:08:41,330 --> 00:08:43,010 instructions to create. 187 00:08:43,010 --> 00:08:46,720 Cilk threads are really small, so in this program there's a 188 00:08:46,720 --> 00:08:49,350 Cilk thread that runs basically from when the fib 189 00:08:49,350 --> 00:08:51,990 starts to here and then-- 190 00:08:57,480 --> 00:08:59,920 I feel like there's a missing slide in here. 191 00:08:59,920 --> 00:09:01,170 I didn't tell you about spawn. 192 00:09:04,400 --> 00:09:07,820 OK, well let me tell you about spawn because what the spawn 193 00:09:07,820 --> 00:09:12,940 means is that this function can run in parallel. 194 00:09:12,940 --> 00:09:13,920 That's very simple. 195 00:09:13,920 --> 00:09:18,430 What the sync means is that all the functions that were 196 00:09:18,430 --> 00:09:23,540 spawned off in this function all have to finish before this 197 00:09:23,540 --> 00:09:25,040 function can proceed. 198 00:09:25,040 --> 00:09:29,170 So in a normal execution of C, when you call a function the 199 00:09:29,170 --> 00:09:31,060 parent stops. 200 00:09:31,060 --> 00:09:33,910 In Cilk the parent can keep running, so while that's 201 00:09:33,910 --> 00:09:36,580 running the parent-- this can spawn off this and then the 202 00:09:36,580 --> 00:09:39,410 sync happens and now the parent has to stop. 203 00:09:39,410 --> 00:09:42,140 And this key word basically just says that this function 204 00:09:42,140 --> 00:09:44,590 can be spawned. 205 00:09:44,590 --> 00:09:46,510 AUDIENCE: Is the sync in that scope or the children scope? 206 00:09:49,960 --> 00:09:52,710 BRADLEY KUSZMAUL: The sync is scoped within the function. 207 00:09:52,710 --> 00:09:54,883 So you could have a 4 loop that spawned off a 208 00:09:54,883 --> 00:09:55,640 whole bunch of stuff. 209 00:09:55,640 --> 00:09:58,220 AUDIENCE: You could call the function instead of moving 210 00:09:58,220 --> 00:10:00,180 some spawns, but then [OBSCURED] 211 00:10:00,180 --> 00:10:00,620 in the sync. 212 00:10:00,620 --> 00:10:01,830 BRADLEY KUSZMAUL: There's an explicit sync at the end of 213 00:10:01,830 --> 00:10:03,640 every function. 214 00:10:03,640 --> 00:10:06,660 So Cilk functions are strict. 215 00:10:06,660 --> 00:10:12,680 PROFESSOR: [NOISE] 216 00:10:12,680 --> 00:10:14,940 BRADLEY KUSZMAUL: You know, there's children down inside 217 00:10:14,940 --> 00:10:17,420 here, but this function can't return-- well, if I had 218 00:10:17,420 --> 00:10:22,000 omitted the sync and down in some leaf the compiler puts 219 00:10:22,000 --> 00:10:25,970 one in before the function returns. 220 00:10:25,970 --> 00:10:28,620 There's some languages that are like this where somehow 221 00:10:28,620 --> 00:10:32,030 the intermediate function can go away and then you can sync 222 00:10:32,030 --> 00:10:35,490 directly with your grandparent. 223 00:10:35,490 --> 00:10:36,740 AUDIENCE: Otherwise it would stop. 224 00:10:38,710 --> 00:10:40,303 BRADLEY KUSZMAUL: So this gives you this dag, so you 225 00:10:40,303 --> 00:10:42,100 have this part of the program that runs up to the first 226 00:10:42,100 --> 00:10:44,900 spawn and then part of the program that runs between the 227 00:10:44,900 --> 00:10:48,460 spawns and the part of the program that runs after-- 228 00:10:48,460 --> 00:10:51,760 well, after the last spawn to the sync and then from there 229 00:10:51,760 --> 00:10:53,080 to the return. 230 00:10:53,080 --> 00:10:56,690 So I've got this drawing that shows this 231 00:10:56,690 --> 00:10:58,130 function sort of running. 232 00:10:58,130 --> 00:11:00,650 So first the purple code runs at it gets to the spawn, it 233 00:11:00,650 --> 00:11:03,850 spawns of this guy, but now the second piece of code can 234 00:11:03,850 --> 00:11:05,140 start running. 235 00:11:05,140 --> 00:11:08,510 He does a spawn, so these two are running in parallel. 236 00:11:08,510 --> 00:11:08,880 Meanwhile. 237 00:11:08,880 --> 00:11:10,200 This guy started that pff. 238 00:11:13,460 --> 00:11:18,480 This is a base case, so he's going to not do anything. 239 00:11:18,480 --> 00:11:19,920 Just feels like there's something 240 00:11:19,920 --> 00:11:21,530 missing in this slide. 241 00:11:21,530 --> 00:11:23,670 Oh well. 242 00:11:23,670 --> 00:11:26,780 Essentially now this guy couldn't run 243 00:11:26,780 --> 00:11:28,260 going back to here. 244 00:11:28,260 --> 00:11:30,500 This part of the code couldn't run until after sync so this 245 00:11:30,500 --> 00:11:32,410 thing's sitting here waiting. 246 00:11:32,410 --> 00:11:37,780 So when these guys finally return then this can run. 247 00:11:37,780 --> 00:11:39,270 This guy's getting stuck here. 248 00:11:39,270 --> 00:11:41,320 He runs and he runs. 249 00:11:41,320 --> 00:11:43,970 These two return and the value comes up here. 250 00:11:43,970 --> 00:11:47,050 And now basically the function is done. 251 00:11:49,820 --> 00:11:52,500 One observation here is that there's no mention of the 252 00:11:52,500 --> 00:11:55,470 number of processors in this code. 253 00:11:55,470 --> 00:11:58,760 You haven't specified how to schedule or how many 254 00:11:58,760 --> 00:11:59,800 processors. 255 00:11:59,800 --> 00:12:02,580 All you've specified is this directed acyclic graph that 256 00:12:02,580 --> 00:12:06,550 unfolds dynamically and it's up to us to schedule those 257 00:12:06,550 --> 00:12:07,830 onto the processors. 258 00:12:07,830 --> 00:12:09,850 So this code is processor oblivious. 259 00:12:09,850 --> 00:12:12,481 It's oblivious to the number of processors. 260 00:12:12,481 --> 00:12:14,940 PROFESSOR: But because we're using the language we're 261 00:12:14,940 --> 00:12:17,890 probably have to create, write as many spawns depending on-- 262 00:12:17,890 --> 00:12:19,570 BRADLEY KUSZMAUL: No, what you do is you write as many spawns 263 00:12:19,570 --> 00:12:20,866 as you can. 264 00:12:20,866 --> 00:12:23,800 You expose all the parallelism in your code. 265 00:12:23,800 --> 00:12:27,980 So you want this dag to have millions of threads in it 266 00:12:27,980 --> 00:12:29,520 concurrently. 267 00:12:29,520 --> 00:12:32,120 And then it's up to us to schedule that efficiently. 268 00:12:32,120 --> 00:12:36,120 So it's a different mindset then, I have 4 processors, let 269 00:12:36,120 --> 00:12:37,440 me create 4 things to do. 270 00:12:37,440 --> 00:12:40,250 I have 4 processors, let me create a million things to do. 271 00:12:40,250 --> 00:12:43,350 And then the Cilk scheduler guarantees to give you-- you 272 00:12:43,350 --> 00:12:45,480 have 4 processors, I'll give you 4 fold speed up. 273 00:12:45,480 --> 00:12:48,555 PROFESSOR: I guess what you'd like to avoid is the mindset 274 00:12:48,555 --> 00:12:51,430 of the programmer has to change or find the changing 275 00:12:51,430 --> 00:12:52,680 tuning the parameters for the performance. 276 00:12:55,260 --> 00:12:58,215 BRADLEY KUSZMAUL: There's some tuning that you do in order to 277 00:12:58,215 --> 00:12:59,890 make the leaf code. 278 00:12:59,890 --> 00:13:02,120 There's some overhead for doing function calls. 279 00:13:02,120 --> 00:13:06,140 So it's small overhead. 280 00:13:06,140 --> 00:13:07,210 It turns out the cost of the spawn is like 281 00:13:07,210 --> 00:13:10,670 three function calls. 282 00:13:10,670 --> 00:13:13,760 If you were actually trying to make this code run faster you 283 00:13:13,760 --> 00:13:18,815 make the base case bigger and do something, trying to speed 284 00:13:18,815 --> 00:13:21,970 things up a little bit with the leaves of this call. 285 00:13:21,970 --> 00:13:24,140 So there's this call tree and inside the call 286 00:13:24,140 --> 00:13:28,140 tree is this dag. 287 00:13:28,140 --> 00:13:31,740 So it supports C's rule for pointers. 288 00:13:31,740 --> 00:13:37,550 For whatever dialect you have. If you have a pointer to a 289 00:13:37,550 --> 00:13:41,250 stack and then you have a pointer to the stack and then 290 00:13:41,250 --> 00:13:45,120 you call, you're allowed to use that pointer in C. So in 291 00:13:45,120 --> 00:13:46,760 Cilk you are as well. 292 00:13:46,760 --> 00:13:49,820 If you have a parallel thing going on where normally in C 293 00:13:49,820 --> 00:13:53,780 you would call A, then B returns, then C and D. So C 294 00:13:53,780 --> 00:13:58,030 and D can refer to anything on A, but C can't legally refer 295 00:13:58,030 --> 00:14:02,000 to something on B and the same rule applies to Cilk. 296 00:14:02,000 --> 00:14:04,550 So we have a data structure that implements this cactus 297 00:14:04,550 --> 00:14:10,062 stack is what it's called, after the sugauro cactus-- 298 00:14:10,062 --> 00:14:17,360 the view of the imagery there and it lets you 299 00:14:17,360 --> 00:14:19,690 support that rule. 300 00:14:19,690 --> 00:14:23,630 There's some advanced features in Cilk that have to do with 301 00:14:23,630 --> 00:14:29,150 speculative execution and I'm going to skip over those today 302 00:14:29,150 --> 00:14:32,000 because it turns out that sort of 99% of the time you don't 303 00:14:32,000 --> 00:14:33,250 need this stuff. 304 00:14:35,720 --> 00:14:39,740 We have some debugger support, so if you've written code that 305 00:14:39,740 --> 00:14:45,300 relied on some semantics that maybe you didn't like when you 306 00:14:45,300 --> 00:14:48,460 went to the parallel world, you'd like to find out. 307 00:14:48,460 --> 00:14:51,560 This is a tool that basically takes a Cilk program and an 308 00:14:51,560 --> 00:14:56,640 input data set and it runs and it tells you is there any 309 00:14:56,640 --> 00:14:59,780 schedule that I could have chosen-- so it's that directed 310 00:14:59,780 --> 00:15:00,470 acyclic graph. 311 00:15:00,470 --> 00:15:02,230 So there's a whole bunch of possible schedules I could 312 00:15:02,230 --> 00:15:03,270 have chosen. 313 00:15:03,270 --> 00:15:07,030 Is there any schedule that changes the order of two 314 00:15:07,030 --> 00:15:12,060 concurrent memory operations where one of them is right? 315 00:15:12,060 --> 00:15:14,000 So we call this the non-determinator because it 316 00:15:14,000 --> 00:15:17,750 finds all the determinacy races in your program. 317 00:15:17,750 --> 00:15:22,080 And Cilk guarantees-- the Cilk race detector is guaranteed to 318 00:15:22,080 --> 00:15:23,150 find those. 319 00:15:23,150 --> 00:15:25,470 There's a lot of race detectors where if the race 320 00:15:25,470 --> 00:15:27,300 doesn't actually occur you have two things that are 321 00:15:27,300 --> 00:15:30,820 logically in parallel, but if they don't actually run on 322 00:15:30,820 --> 00:15:34,910 different processors a lot of race detectors out there in 323 00:15:34,910 --> 00:15:36,190 the world won't report the race. 324 00:15:36,190 --> 00:15:38,560 So you get false negatives and there's a bunch of false 325 00:15:38,560 --> 00:15:39,530 positives that show up. 326 00:15:39,530 --> 00:15:41,753 This basically only gives you the real ones. 327 00:15:41,753 --> 00:15:46,800 AUDIENCE: That might be indicatiors there might be 328 00:15:46,800 --> 00:15:49,220 still a data to arrays. 329 00:15:49,220 --> 00:15:51,110 BRADLEY KUSZMAUL: So this doesn't analyze the program. 330 00:15:51,110 --> 00:15:52,890 It analyzes the execution. 331 00:15:52,890 --> 00:15:56,040 So it's not trying to solve some MP complete problem or 332 00:15:56,040 --> 00:15:58,560 Turing complete problem. 333 00:15:58,560 --> 00:16:01,810 And so this reduces the problem of finding data races 334 00:16:01,810 --> 00:16:05,770 to the situation that's just like when you're trying to do 335 00:16:05,770 --> 00:16:08,540 code release and quality control for serial programs. 336 00:16:08,540 --> 00:16:10,030 You write tests. 337 00:16:10,030 --> 00:16:12,720 If you don't test your program you don't know what it does 338 00:16:12,720 --> 00:16:15,280 and that's the same property here. 339 00:16:15,280 --> 00:16:18,430 If you do find some race someday later then you can 340 00:16:18,430 --> 00:16:21,660 write a test for it and know that you're testing to make 341 00:16:21,660 --> 00:16:23,860 sure that race didn't creep back into your code. 342 00:16:23,860 --> 00:16:27,025 That's what you want out of a software release strategy. 343 00:16:27,025 --> 00:16:36,900 AUDIENCE: [NOISE] 344 00:16:36,900 --> 00:16:40,170 BRADLEY KUSZMAUL: If you start putting in sync than maybe the 345 00:16:40,170 --> 00:16:41,480 race goes away because of that. 346 00:16:41,480 --> 00:16:44,950 But if just put in instrumentation to try to 347 00:16:44,950 --> 00:16:47,380 figure out what's going, it's still there. 348 00:16:47,380 --> 00:16:50,650 And the race detector sort of says, this variable in this 349 00:16:50,650 --> 00:16:53,312 function, this variable in this function, you look at it 350 00:16:53,312 --> 00:16:54,330 and say, how could that happen? 351 00:16:54,330 --> 00:16:56,460 And finally you figured out and you fix it and then you 352 00:16:56,460 --> 00:16:59,290 put it-- if you're trying to do software release you build 353 00:16:59,290 --> 00:17:03,610 a regression test that will verify that has that input. 354 00:17:03,610 --> 00:17:08,021 AUDIENCE: What if you have a situation where the spawn 355 00:17:08,021 --> 00:17:09,964 graph falls into a terminal. 356 00:17:09,964 --> 00:17:14,670 So it's not a radius, but monitoring spawn is there but 357 00:17:14,670 --> 00:17:15,920 it spawns a graph a little bit deeper. 358 00:17:21,060 --> 00:17:23,580 BRADLEY KUSZMAUL: Yes. 359 00:17:23,580 --> 00:17:25,420 For example, our race detector understands locks. 360 00:17:25,420 --> 00:17:29,390 So part of the rule is it doesn't report a race if the 361 00:17:29,390 --> 00:17:30,910 two memory accesses-- 362 00:17:30,910 --> 00:17:34,610 if there was a lock that they both held in common. 363 00:17:34,610 --> 00:17:37,010 Now you now you can write buggy programs because you can 364 00:17:37,010 --> 00:17:39,890 essentially do the memory at lock, you know, read the 365 00:17:39,890 --> 00:17:42,390 memory, unlock, lock, write the memory. 366 00:17:42,390 --> 00:17:44,330 Now the interleave happens and there's a race. 367 00:17:44,330 --> 00:17:47,330 So the assumption of this race detector is that if you put 368 00:17:47,330 --> 00:17:49,600 locks in there that you've sort of thought about. 369 00:17:49,600 --> 00:17:53,560 This is finding races that you forgot about rather than races 370 00:17:53,560 --> 00:17:55,150 that you ostensibly thought about. 371 00:17:55,150 --> 00:17:57,530 There are some races that are actually correct. 372 00:17:57,530 --> 00:17:59,880 For example, in the chess programs there's this big 373 00:17:59,880 --> 00:18:01,910 table that remembers all the chess positions 374 00:18:01,910 --> 00:18:03,290 that have been seen. 375 00:18:03,290 --> 00:18:05,360 And if you don't get the right answer out of the table it 376 00:18:05,360 --> 00:18:07,740 doesn't matter because you search it again anyway. 377 00:18:07,740 --> 00:18:08,880 Not getting the right answer means you 378 00:18:08,880 --> 00:18:09,700 don't get any answer. 379 00:18:09,700 --> 00:18:12,420 You look something up and it's not there so you search again. 380 00:18:12,420 --> 00:18:14,790 If you just waited a little longer maybe somebody else 381 00:18:14,790 --> 00:18:15,940 would have put the value in, you could have 382 00:18:15,940 --> 00:18:17,470 saved a little work. 383 00:18:17,470 --> 00:18:20,310 And so in that case, well it turns out to be there's no 384 00:18:20,310 --> 00:18:21,570 parallel way to do that. 385 00:18:21,570 --> 00:18:25,290 So I'm willing to tolerate that race because that gives 386 00:18:25,290 --> 00:18:28,940 me performance and so you have what we call fake locks, which 387 00:18:28,940 --> 00:18:33,280 are basically things that look like lock calls, but they 388 00:18:33,280 --> 00:18:36,200 don't do anything except tell the race detector, pretend 389 00:18:36,200 --> 00:18:37,940 there was a lock held in common. 390 00:18:37,940 --> 00:18:38,707 Yeah? 391 00:18:38,707 --> 00:18:39,957 AUDIENCE: [UNINTELLIGIBLE PHRASE] 392 00:18:52,080 --> 00:18:53,780 BRADLEY KUSZMAUL: If it says there's no race it means that 393 00:18:53,780 --> 00:18:57,745 for every possible scheduling that-- 394 00:18:57,745 --> 00:18:59,290 AUDIENCE: [UNINTELLIGIBLE PHRASE]. 395 00:18:59,290 --> 00:19:00,880 BRADLEY KUSZMAUL: Well, you have that dag. 396 00:19:00,880 --> 00:19:02,850 And imagine running it on one processor. 397 00:19:02,850 --> 00:19:04,300 There's a lot of possible orders in 398 00:19:04,300 --> 00:19:05,910 which to run the dag. 399 00:19:05,910 --> 00:19:10,350 And the rule is well, was there a load in a store or a 400 00:19:10,350 --> 00:19:14,100 store in a store that switched orders in some possible 401 00:19:14,100 --> 00:19:16,511 schedule and that's the definition. 402 00:19:16,511 --> 00:19:30,090 AUDIENCE: So, in practice, sorry, one of the [INAUDIBLE] 403 00:19:30,090 --> 00:19:31,573 techniques is loss. 404 00:19:31,573 --> 00:19:33,798 Assuming, dependent on the processor, that you have 405 00:19:33,798 --> 00:19:37,012 atomic rights, we want to deal with that data 406 00:19:37,012 --> 00:19:38,990 [UNINTELLIGIBLE] in the background -- 407 00:19:38,990 --> 00:19:40,140 BRADLEY KUSZMAUL: Those protocols are really hard to 408 00:19:40,140 --> 00:19:42,755 get right, but yes, it's an important trick. 409 00:19:42,755 --> 00:19:44,430 AUDIENCE: Certainly [INAUDIBLE]. 410 00:19:44,430 --> 00:19:45,440 BRADLEY KUSZMAUL: So to convince the race detector not 411 00:19:45,440 --> 00:19:47,160 to complain you put fake locks around it. 412 00:19:51,410 --> 00:19:53,840 You've programmed a sophisticated algorithm it's 413 00:19:53,840 --> 00:19:56,090 up to you to get the details right. 414 00:19:59,150 --> 00:20:01,190 The other property about this race detector is that it's 415 00:20:01,190 --> 00:20:03,220 fast. It runs almost in liear time. 416 00:20:03,220 --> 00:20:05,700 A lot of the race detectors that you find out there run in 417 00:20:05,700 --> 00:20:07,210 quadratic time. 418 00:20:07,210 --> 00:20:09,550 So if you want to run a million instructions it has to 419 00:20:09,550 --> 00:20:12,380 compare every instruction to every other instruction. 420 00:20:12,380 --> 00:20:13,690 Turns out we don't have to do that. 421 00:20:13,690 --> 00:20:17,220 We run in time, which is n times alpha of n where alpha's 422 00:20:17,220 --> 00:20:18,440 the inverse Ackermann function. 423 00:20:18,440 --> 00:20:23,580 Anybody remember that from the union-find algorithm. 424 00:20:23,580 --> 00:20:27,210 It's got that graded So it's like the almost linear time. 425 00:20:27,210 --> 00:20:32,980 We actually now have a linear timed one that has performance 426 00:20:32,980 --> 00:20:34,990 advantages. 427 00:20:34,990 --> 00:20:40,130 So let me do a little theory in practice. 428 00:20:40,130 --> 00:20:42,850 In Cilk we have some fundamental complexity 429 00:20:42,850 --> 00:20:44,160 measures that we worry about. 430 00:20:44,160 --> 00:20:47,070 So we're interested in knowing and being able to predict the 431 00:20:47,070 --> 00:20:50,440 runtime of a Cilk program on P processors. 432 00:20:50,440 --> 00:20:53,580 So we want to know T sub p, which is the execution time on 433 00:20:53,580 --> 00:20:54,350 P processors. 434 00:20:54,350 --> 00:20:55,660 That's the goal. 435 00:20:55,660 --> 00:20:59,430 What we've got to work with is some directed acyclic graph 436 00:20:59,430 --> 00:21:02,260 that is for a particular input set and if the program 437 00:21:02,260 --> 00:21:05,400 determines it and everything else it's a well defined graph 438 00:21:05,400 --> 00:21:08,470 and we can come up with some basic measures of this graph. 439 00:21:08,470 --> 00:21:11,950 So T sub 1 is the work of the graph, which is the total time 440 00:21:11,950 --> 00:21:15,080 it would take to run that graph on one processor. 441 00:21:15,080 --> 00:21:17,570 Or if you assume that these things are all cost unit 442 00:21:17,570 --> 00:21:19,510 times, just the number of nodes. 443 00:21:19,510 --> 00:21:21,940 So for this graph what's the work? 444 00:21:31,780 --> 00:21:34,070 I heard teen, but something-- 445 00:21:34,070 --> 00:21:36,460 18? 446 00:21:36,460 --> 00:21:43,590 And the critical path is the longest path. 447 00:21:43,590 --> 00:21:45,920 And if these nodes weren't unit time you'd have to weight 448 00:21:45,920 --> 00:21:48,200 the things according to actually how 449 00:21:48,200 --> 00:21:49,200 much time they run. 450 00:21:49,200 --> 00:21:53,280 So the critical path here is what? 451 00:21:53,280 --> 00:21:54,770 9. 452 00:21:54,770 --> 00:21:57,700 So I think those are right. 453 00:21:57,700 --> 00:22:02,500 The lower bounds then that you know is that you don't expect 454 00:22:02,500 --> 00:22:05,480 the runtime on P processes to be faster than linear speedup. 455 00:22:09,980 --> 00:22:13,460 In this model that doesn't happen. 456 00:22:17,430 --> 00:22:18,820 It turns out cache does things. 457 00:22:18,820 --> 00:22:21,190 It's adding more than just processors. 458 00:22:21,190 --> 00:22:23,060 You're adding more cache too. 459 00:22:23,060 --> 00:22:26,380 So all sorts of things or maybe it means that there's a 460 00:22:26,380 --> 00:22:28,180 better algorithm you should have used. 461 00:22:28,180 --> 00:22:30,120 So there's some funny things that happen if you have bad 462 00:22:30,120 --> 00:22:31,180 algorithms and so forth. 463 00:22:31,180 --> 00:22:34,020 But in this model you can't have more than linear speedup. 464 00:22:34,020 --> 00:22:36,880 You also can't get things done faster than in linear time. 465 00:22:36,880 --> 00:22:39,660 This model assumes basically that these costs of running 466 00:22:39,660 --> 00:22:45,420 these things are fixed and the cache has the property that 467 00:22:45,420 --> 00:22:47,410 changing the order of execution means that the 468 00:22:47,410 --> 00:22:52,600 actual costs of the nodes in the graph change costs. 469 00:22:52,600 --> 00:22:56,530 So those are lower bounds and the things that we want to 470 00:22:56,530 --> 00:23:00,410 know are speedups, so that's T sub 1 over T sub p. 471 00:23:00,410 --> 00:23:02,600 And the parallelism of the graph is T sub 472 00:23:02,600 --> 00:23:04,310 1 over T sub infinity. 473 00:23:04,310 --> 00:23:07,090 So the work over the critical path and we've been calling 474 00:23:07,090 --> 00:23:09,740 this span sometimes lately. 475 00:23:09,740 --> 00:23:12,260 Some people call that depth. 476 00:23:12,260 --> 00:23:14,750 Span is easier to say than critical path, depth has too 477 00:23:14,750 --> 00:23:17,390 many other meanings so I kind of like span. 478 00:23:17,390 --> 00:23:19,570 So what's the parallelism for this program? 479 00:23:24,760 --> 00:23:29,880 18/9. 480 00:23:29,880 --> 00:23:33,920 We said that T sub 1 was what? 481 00:23:33,920 --> 00:23:35,120 18. 482 00:23:35,120 --> 00:23:37,770 The infinity is 9. 483 00:23:37,770 --> 00:23:40,290 So on average and if you had an infinite number of 484 00:23:40,290 --> 00:23:44,730 processors and you scheduled this as greedy as you good, it 485 00:23:44,730 --> 00:23:47,560 would take you 9 steps to run and you would you be doing 18 486 00:23:47,560 --> 00:23:48,350 things worth of work. 487 00:23:48,350 --> 00:23:51,140 So on average there's two things to do. 488 00:23:51,140 --> 00:23:55,950 You know, 1 plus 1 plus 1 plus 3 plus 4 plus 4 plus 1 plus 1 489 00:23:55,950 --> 00:23:59,970 plus 1 divided by 9 turns out to be 2. 490 00:23:59,970 --> 00:24:02,950 So the average parallelism or just the parallelism of the 491 00:24:02,950 --> 00:24:06,120 program is T sub 1 over T sub infinity. 492 00:24:06,120 --> 00:24:08,580 And this property is something that's not dependent on the 493 00:24:08,580 --> 00:24:12,390 scheduler, it's a property of the program. 494 00:24:12,390 --> 00:24:15,490 Doesn't depend on how many processors you have. 495 00:24:15,490 --> 00:24:16,438 AUDIENCE: [OBSCURED] 496 00:24:16,438 --> 00:24:21,740 You're saying, you're calling that the span now? 497 00:24:21,740 --> 00:24:32,070 Is that the one for us [OBSCURED] 498 00:24:32,070 --> 00:24:34,460 BRADLEY KUSZMAUL: That's too long to say. 499 00:24:34,460 --> 00:24:37,240 I might as well say critical path length. 500 00:24:37,240 --> 00:24:40,440 Critical path length, longest trace span is a mathematical 501 00:24:40,440 --> 00:24:41,690 sounding name. 502 00:24:45,956 --> 00:24:48,020 AUDIENCE: We just like to steal terminology. 503 00:24:48,020 --> 00:24:50,510 BRADLEY KUSZMAUL: Well, yeah. 504 00:24:50,510 --> 00:24:51,660 So there's a theorem due to-- 505 00:24:51,660 --> 00:24:54,500 Graham and Brent said that there's some schedule that can 506 00:24:54,500 --> 00:24:58,040 actually achieve the sum of those two lower bounds. 507 00:24:58,040 --> 00:25:01,740 This linear speedup is one lower bound of the runtime and 508 00:25:01,740 --> 00:25:04,470 the critical path is the other. 509 00:25:04,470 --> 00:25:06,400 So there's some schedule that basically achieves the sum of 510 00:25:06,400 --> 00:25:09,360 those and how does that theorem work? 511 00:25:09,360 --> 00:25:12,350 Well, at each time step either-- 512 00:25:12,350 --> 00:25:14,400 suppose we had 3 processors. 513 00:25:14,400 --> 00:25:20,090 Either there's at least 3 things ready to run and so 514 00:25:20,090 --> 00:25:22,460 what you do is you do a greedy schedule. 515 00:25:22,460 --> 00:25:23,730 You grab any 3 of them. 516 00:25:27,790 --> 00:25:30,120 If there's fewer than p things to run, like here we have a 517 00:25:30,120 --> 00:25:34,680 situation where these have all run. 518 00:25:34,680 --> 00:25:36,270 The green ones are ready to go. 519 00:25:36,270 --> 00:25:38,600 Those are the only 2 that are ready to go. 520 00:25:38,600 --> 00:25:40,640 So what do you do then in a greedy schedule? 521 00:25:40,640 --> 00:25:42,170 You run them all. 522 00:25:44,880 --> 00:25:50,730 And the argument goes, well, how many times steps could you 523 00:25:50,730 --> 00:25:51,980 execute 3 things? 524 00:25:55,800 --> 00:25:58,090 At most you could do it the work divided by the number of 525 00:25:58,090 --> 00:26:00,625 processors times because then after that you've 526 00:26:00,625 --> 00:26:02,860 used up all the work. 527 00:26:02,860 --> 00:26:07,910 Well how many times could you execute less than p things? 528 00:26:07,910 --> 00:26:09,940 Well, every time you execute less than p things you're 529 00:26:09,940 --> 00:26:13,090 reducing the length of the remaining critical path. 530 00:26:13,090 --> 00:26:17,130 You can't do that more than the span times. 531 00:26:17,130 --> 00:26:21,040 And so a greedy scheduler will achieve some runtime which is 532 00:26:21,040 --> 00:26:22,520 within the sum of these 2. 533 00:26:22,520 --> 00:26:25,990 It's actually the sum of these 2 minus 1. 534 00:26:25,990 --> 00:26:28,940 It turns out that there has to be at least one node that's on 535 00:26:28,940 --> 00:26:32,430 both work and critical path. 536 00:26:32,430 --> 00:26:34,400 And so that means that you're guaranteed to be within a 537 00:26:34,400 --> 00:26:39,720 factor of 2 of optimal with a greedy schedule. 538 00:26:39,720 --> 00:26:43,030 And it turns out that if you have a lot of parallelism 539 00:26:43,030 --> 00:26:45,810 compared to the number processors, so if you have a 540 00:26:45,810 --> 00:26:49,330 graph that has a million fold parallelism and a thousand 541 00:26:49,330 --> 00:26:54,890 fold processors Well, if this is really small compared to 542 00:26:54,890 --> 00:26:57,440 the work, if you have a graph with a million fold 543 00:26:57,440 --> 00:27:00,410 parallelism that means the critical path is small. 544 00:27:00,410 --> 00:27:02,190 If you only had 1000 processors that means this 545 00:27:02,190 --> 00:27:05,220 term's big. 546 00:27:05,220 --> 00:27:08,420 And that means that this term is very close to this term, so 547 00:27:08,420 --> 00:27:11,730 essentially the corollary to this is that you get linear 548 00:27:11,730 --> 00:27:17,770 speedup, perfect linear speed asymptotically if you have 549 00:27:17,770 --> 00:27:22,060 fewer processors then you have parallelism in your program. 550 00:27:22,060 --> 00:27:26,280 So the game here at this level of understanding, I haven't 551 00:27:26,280 --> 00:27:28,500 told you how the scheduler actually works-- is to write a 552 00:27:28,500 --> 00:27:30,720 program that's got a lot of parallelism that you can get 553 00:27:30,720 --> 00:27:31,970 linear speedup. 554 00:27:38,070 --> 00:27:40,390 Well, the work-stealing scheduler we actually use. 555 00:27:40,390 --> 00:27:42,380 The problem is the greedy schedulers can be hard to 556 00:27:42,380 --> 00:27:44,670 compute-- especially if you imagine having a million 557 00:27:44,670 --> 00:27:48,090 processors in a program with a billion fold parallelism. 558 00:27:48,090 --> 00:27:50,940 Finding on every clock cycle, finding something for each of 559 00:27:50,940 --> 00:27:54,580 the million guys to do is conceptually difficult, so 560 00:27:54,580 --> 00:27:57,520 instead we have a work-stealing scheduler. 561 00:27:57,520 --> 00:27:59,140 I'll talk about that in a second. 562 00:27:59,140 --> 00:28:02,970 It achieves bounds which are not quite as good as those. 563 00:28:02,970 --> 00:28:04,140 This bound is the same. 564 00:28:04,140 --> 00:28:07,130 It's the sum of two terms. One is the linear speedup term, 565 00:28:07,130 --> 00:28:09,740 but instead of it being T sub infinity it's big O of T sub 566 00:28:09,740 --> 00:28:14,010 infinity because you actually have to do communication 567 00:28:14,010 --> 00:28:17,530 sometimes if the critical path length is long. 568 00:28:17,530 --> 00:28:18,860 Basically, you can sort of imagine. 569 00:28:18,860 --> 00:28:21,930 If you have a lot of things to do, a lot of tasks and people 570 00:28:21,930 --> 00:28:25,840 to do it, it's easy to do that in parallel if there's no 571 00:28:25,840 --> 00:28:27,810 interdependencies among the tasks. 572 00:28:27,810 --> 00:28:29,930 But as soon as there's dependencies you end up having 573 00:28:29,930 --> 00:28:33,530 to coordinate a lot and that communication costs-- 574 00:28:33,530 --> 00:28:37,620 there's lots of lore about adding programmers to a task 575 00:28:37,620 --> 00:28:41,590 and it slowing you down. 576 00:28:41,590 --> 00:28:45,380 Because basically communication gets you. 577 00:28:45,380 --> 00:28:47,170 What we found empirically-- 578 00:28:47,170 --> 00:28:48,680 there's a theorem for this-- 579 00:28:48,680 --> 00:28:53,130 empirically the runtime is actually still very close to 580 00:28:53,130 --> 00:28:56,960 the sum of those terms. Or maybe it's those terms plus 2 581 00:28:56,960 --> 00:29:00,610 times T sub infinity or something like that. 582 00:29:00,610 --> 00:29:03,190 And again, we basically get near-perfect speedup as long 583 00:29:03,190 --> 00:29:05,450 as the number of processors is a lot less than the 584 00:29:05,450 --> 00:29:06,250 parallelism. 585 00:29:06,250 --> 00:29:08,815 Should be sort of a less than less than. 586 00:29:12,320 --> 00:29:14,950 The compiler has the mode where you basically can insert 587 00:29:14,950 --> 00:29:15,310 instrumentations. 588 00:29:15,310 --> 00:29:17,700 So you can run your program, it'll tell you the critical 589 00:29:17,700 --> 00:29:18,200 path length. 590 00:29:18,200 --> 00:29:20,530 You can compute these numbers. 591 00:29:20,530 --> 00:29:23,220 Clear how to compute work, you just sum up the runtime of all 592 00:29:23,220 --> 00:29:24,590 the threads. 593 00:29:24,590 --> 00:29:27,360 To compute the critical path length, well you have to do 594 00:29:27,360 --> 00:29:31,540 some max's and stuff as you go through the graph. 595 00:29:31,540 --> 00:29:36,000 And the average cost of a spawn these days is about 3 on 596 00:29:36,000 --> 00:29:39,580 like a dual core pentium. 597 00:29:39,580 --> 00:29:42,270 Three times the cost of a function call. 598 00:29:42,270 --> 00:29:45,080 And most of that cost actually has to do with the memory 599 00:29:45,080 --> 00:29:48,700 barrier that we do at the spawn because that machine 600 00:29:48,700 --> 00:29:50,080 doesn't have strong consistencies. 601 00:29:50,080 --> 00:29:52,360 So you have to put this memory barrier in and that just 602 00:29:52,360 --> 00:29:53,610 empties all the pipelines. 603 00:29:56,480 --> 00:30:00,100 It does better on like an SGI machine, which has strong-- 604 00:30:00,100 --> 00:30:01,480 well, traditional. 605 00:30:01,480 --> 00:30:04,520 A MIPS machine that has strong consistency actually does 606 00:30:04,520 --> 00:30:08,440 better for the cost of that overhead. 607 00:30:08,440 --> 00:30:10,490 Let me talk a little bit about chess. 608 00:30:10,490 --> 00:30:16,410 And we had a bunch of chess programs. I wrote one in 1994, 609 00:30:16,410 --> 00:30:19,650 which placed third at the International Computer Chess 610 00:30:19,650 --> 00:30:21,940 Championship and that was running on a big connection 611 00:30:21,940 --> 00:30:23,960 machine CM5. 612 00:30:23,960 --> 00:30:25,663 I was one of the architects of that machine, so 613 00:30:25,663 --> 00:30:27,770 it was double fun. 614 00:30:27,770 --> 00:30:32,110 We wrote another program that placed second in '95 and that 615 00:30:32,110 --> 00:30:34,860 was running on an 1800 node Paragon and that was a big 616 00:30:34,860 --> 00:30:37,020 computer back then. 617 00:30:37,020 --> 00:30:40,520 We built another program called Cilk chess, which 618 00:30:40,520 --> 00:30:46,440 placed first in '96 running on a relatively smaller machine. 619 00:30:46,440 --> 00:30:52,810 And then on a larger SGI origin we ran some more and 620 00:30:52,810 --> 00:30:56,520 then at the World Computer Chess Championship in 1999 we 621 00:30:56,520 --> 00:31:01,270 beat Deep Blue and lost to a PC. 622 00:31:04,420 --> 00:31:07,570 And people don't realize this, but at the time that Deep Blue 623 00:31:07,570 --> 00:31:09,740 beat Kasparov it was not the World Computer Chess 624 00:31:09,740 --> 00:31:13,290 Champion, a PC was. 625 00:31:13,290 --> 00:31:16,150 So what? 626 00:31:16,150 --> 00:31:17,400 It's running a program. 627 00:31:20,930 --> 00:31:22,883 You know, there's this head and a tape. 628 00:31:26,830 --> 00:31:29,360 I don't know what it did. 629 00:31:29,360 --> 00:31:31,980 So this was a program called Fritz, which is a commercially 630 00:31:31,980 --> 00:31:32,860 available program. 631 00:31:32,860 --> 00:31:38,450 And those guys were very good, the PC guys playing were very 632 00:31:38,450 --> 00:31:42,290 good at getting on sort of the algorithm side. 633 00:31:42,290 --> 00:31:44,680 We got advantage by brute force. 634 00:31:44,680 --> 00:31:47,630 And we also had some real chess expertise on our team, 635 00:31:47,630 --> 00:31:51,060 but those guys were spending full time on things like 636 00:31:51,060 --> 00:31:55,290 pruning away sub-searches that they were convinced weren't 637 00:31:55,290 --> 00:31:55,930 going to pan out. 638 00:31:55,930 --> 00:31:58,340 Computer chess programs spend most of their time looking at 639 00:31:58,340 --> 00:32:00,720 situations that any person would look at and say, ah, 640 00:32:00,720 --> 00:32:01,420 blacks won. 641 00:32:01,420 --> 00:32:02,560 Why are you even looking at this? 642 00:32:02,560 --> 00:32:03,820 And it keeps searching. 643 00:32:03,820 --> 00:32:05,786 It's like, well maybe there's a way to get the queen. 644 00:32:10,480 --> 00:32:12,310 So computers are pretty dumb at that. 645 00:32:12,310 --> 00:32:15,380 So basically these guys put a lot more chess intelligence in 646 00:32:15,380 --> 00:32:19,270 and we also lost due to what-- in this particular game, we 647 00:32:19,270 --> 00:32:24,570 were tied for first place and we decided to do a runoff game 648 00:32:24,570 --> 00:32:27,770 to find out who would win and we lost due to a classic 649 00:32:27,770 --> 00:32:29,960 horizon effect. 650 00:32:29,960 --> 00:32:32,910 So it turns out that we were searching to depth 12 in the 651 00:32:32,910 --> 00:32:35,280 tree and Fritz was searching to depth 11. 652 00:32:35,280 --> 00:32:38,730 Even with all these heuristics and stuff they had in it, they 653 00:32:38,730 --> 00:32:41,280 were still not searching as deeply as we were. 654 00:32:41,280 --> 00:32:45,050 But there was a move that was a good move that looked OK at 655 00:32:45,050 --> 00:32:49,290 depth 11 and looked bad at depth 11 and at depth 13 it 656 00:32:49,290 --> 00:32:50,540 looked really good again. 657 00:32:53,000 --> 00:32:56,460 So they saw the move and made it for the wrong reason, we 658 00:32:56,460 --> 00:32:59,130 saw the move and didn't make it for the right reason, but 659 00:32:59,130 --> 00:33:02,120 it was wrong and the right move-- if we'd been able to 660 00:33:02,120 --> 00:33:05,320 search a little deeper, we would have seen that it was 661 00:33:05,320 --> 00:33:08,070 really the wrong thing to do. 662 00:33:08,070 --> 00:33:09,550 This happens all the time in chess. 663 00:33:09,550 --> 00:33:10,760 There's a little randomness in there. 664 00:33:10,760 --> 00:33:13,820 This horizon effect shows up and again, it boils down to 665 00:33:13,820 --> 00:33:15,820 the programs are not intelligent. 666 00:33:15,820 --> 00:33:18,550 A human would look at it and say, eventually that knight's 667 00:33:18,550 --> 00:33:19,730 going to fall. 668 00:33:19,730 --> 00:33:24,070 But if the computer can't see it with a search, you know? 669 00:33:27,190 --> 00:33:31,280 We plotted the speedup of star Socrates, which was the first 670 00:33:31,280 --> 00:33:33,330 one on this funny graph. 671 00:33:33,330 --> 00:33:35,980 So this looks sort of like a typical linear speedup graph. 672 00:33:35,980 --> 00:33:38,200 Sort of when you're down here with few numbers processors 673 00:33:38,200 --> 00:33:40,660 you get good linear speedup and eventually you stop 674 00:33:40,660 --> 00:33:41,730 getting linear speedup. 675 00:33:41,730 --> 00:33:43,410 That's sort of in broad strokes what 676 00:33:43,410 --> 00:33:44,960 this graph looks like. 677 00:33:44,960 --> 00:33:46,510 But the axes are kind of funny. 678 00:33:46,510 --> 00:33:50,210 The axes aren't the number of processors and the speedup-- 679 00:33:50,210 --> 00:33:54,250 it's the number processors divided by the parallelism of 680 00:33:54,250 --> 00:33:55,440 the program. 681 00:33:55,440 --> 00:33:58,360 And here is the speedup divided by the parallelism of 682 00:33:58,360 --> 00:33:59,400 the program. 683 00:33:59,400 --> 00:34:01,750 And the reason we did that is the each of these data points 684 00:34:01,750 --> 00:34:05,540 is a different program with different work in span. 685 00:34:08,640 --> 00:34:10,720 If I'm trying to run a particular problem on a bunch 686 00:34:10,720 --> 00:34:13,350 of different processors I can just draw that curve and see 687 00:34:13,350 --> 00:34:15,950 what happens as get more processors. 688 00:34:19,250 --> 00:34:20,960 I'm not getting any advantage because I've got too many 689 00:34:20,960 --> 00:34:21,600 processors. 690 00:34:21,600 --> 00:34:23,320 I've exceeded the parallelism of the program. 691 00:34:23,320 --> 00:34:25,240 But if I'm running, trying to compare two different 692 00:34:25,240 --> 00:34:27,170 programs, how do I do that? 693 00:34:27,170 --> 00:34:29,560 Well, you can do that by normalizing by the 694 00:34:29,560 --> 00:34:30,760 parallelism. 695 00:34:30,760 --> 00:34:35,890 So down in this domain the number of processors is small 696 00:34:35,890 --> 00:34:38,800 compared to the average parallelism and we get good 697 00:34:38,800 --> 00:34:39,610 linear speedups. 698 00:34:39,610 --> 00:34:43,210 And up in this the domain the number of processors is large 699 00:34:43,210 --> 00:34:46,310 and it starts asymptoting to the point where the speedup 700 00:34:46,310 --> 00:34:51,650 approaches the parallelism and that's sort of what happened. 701 00:34:51,650 --> 00:34:53,830 You get some noise out here so one of the things down here, 702 00:34:53,830 --> 00:34:56,520 it's nice and tight. 703 00:34:56,520 --> 00:34:58,660 And that's because we're in that domain where the 704 00:34:58,660 --> 00:35:01,790 communication costs are infrequently paid because 705 00:35:01,790 --> 00:35:03,200 there's lots of work to do. 706 00:35:03,200 --> 00:35:05,120 You don't have to communicate very much. 707 00:35:05,120 --> 00:35:07,510 Up here there's a lot of communication that happens and 708 00:35:07,510 --> 00:35:12,170 so the noise is showing up more in the data. 709 00:35:12,170 --> 00:35:14,530 This curve here is the T sub 1 over P plus T 710 00:35:14,530 --> 00:35:15,780 sub infinity curve. 711 00:35:19,200 --> 00:35:22,700 The T sub P equals T sub infinity curve and that's the 712 00:35:22,700 --> 00:35:25,430 linear speedup curve on this graph. 713 00:35:25,430 --> 00:35:28,380 So I think there's an important lesson in this graph 714 00:35:28,380 --> 00:35:32,120 besides the data itself, which is if you're careful about 715 00:35:32,120 --> 00:35:37,920 choosing the axes, you can take a whole bunch of data 716 00:35:37,920 --> 00:35:40,580 that you couldn't see how to plot it together and you can 717 00:35:40,580 --> 00:35:42,690 plot it together and get something meaningful. 718 00:35:42,690 --> 00:35:46,130 So in my Ph.D. thesis I had hundreds of little plots for 719 00:35:46,130 --> 00:35:48,980 each chess position and I didn't figure out how-- it's 720 00:35:48,980 --> 00:35:50,030 like they all look the same, right? 721 00:35:50,030 --> 00:35:53,050 But I didn't sort of figure out that if I was careful I 722 00:35:53,050 --> 00:35:55,110 could actually make them be the same. 723 00:35:55,110 --> 00:35:57,290 That happened after I published my thesis. 724 00:35:57,290 --> 00:35:59,030 Oh, we could just overlay them. 725 00:35:59,030 --> 00:36:03,340 Well, what's the normalization that makes that work? 726 00:36:03,340 --> 00:36:05,540 So there's a speedup paradox that happened. 727 00:36:09,220 --> 00:36:10,025 Pardon? 728 00:36:10,025 --> 00:36:11,480 AUDIENCE: [OBSCURED] 729 00:36:11,480 --> 00:36:12,940 BRADLEY KUSZMAUL: Yeah, OK. 730 00:36:12,940 --> 00:36:15,460 There was a speedup paradox that happened while we were 731 00:36:15,460 --> 00:36:16,650 developing star Socrates. 732 00:36:16,650 --> 00:36:20,040 We were developing this for 512 processor connection 733 00:36:20,040 --> 00:36:24,520 machine that was at University of Illinois, but we only had a 734 00:36:24,520 --> 00:36:26,790 smaller machine on which to do our development. 735 00:36:26,790 --> 00:36:30,910 We had a 128 processor machine at MIT and most days I could 736 00:36:30,910 --> 00:36:34,040 only get 32 processors because the machine 737 00:36:34,040 --> 00:36:35,340 was in heavy demand. 738 00:36:35,340 --> 00:36:38,240 So we had this program and it ran on 32 739 00:36:38,240 --> 00:36:41,000 processors in 65 seconds. 740 00:36:41,000 --> 00:36:44,260 And one of the developers said, here's a variation on 741 00:36:44,260 --> 00:36:46,390 the algorithm, it changes the dag. 742 00:36:46,390 --> 00:36:47,720 It's a heuristic. 743 00:36:47,720 --> 00:36:50,510 It makes the program run more efficient. 744 00:36:50,510 --> 00:36:53,910 Look, it runs in only 40 seconds on 32 processors. 745 00:36:53,910 --> 00:36:55,850 And so is that a good idea? 746 00:36:55,850 --> 00:36:59,770 It sure seemed like a good idea, but we were worried that 747 00:36:59,770 --> 00:37:01,890 we knew that the transformation increased the 748 00:37:01,890 --> 00:37:04,260 critical path length of the program, so we weren't sure it 749 00:37:04,260 --> 00:37:05,280 was a good idea. 750 00:37:05,280 --> 00:37:07,660 So we did some calculation. 751 00:37:07,660 --> 00:37:11,550 We measured the work and the speedup. 752 00:37:11,550 --> 00:37:12,510 And so the work here-- 753 00:37:12,510 --> 00:37:14,340 these numbers have been cooked a little bit to make the math 754 00:37:14,340 --> 00:37:19,950 easy, but the numbers-- 755 00:37:19,950 --> 00:37:24,980 this really did happen, but not with these exact numbers. 756 00:37:24,980 --> 00:37:29,100 So we had a work which was 2048 seconds and only 1 second 757 00:37:29,100 --> 00:37:29,910 of critical path. 758 00:37:29,910 --> 00:37:33,430 And over this new program had only 1/2 as much work to do, 759 00:37:33,430 --> 00:37:35,120 but the critical path length was longer. 760 00:37:35,120 --> 00:37:36,390 It was 8 seconds long. 761 00:37:40,190 --> 00:37:43,140 If you predict on 32 processors what the runtime's 762 00:37:43,140 --> 00:37:46,740 going to be that formula says well, 65 seconds. 763 00:37:46,740 --> 00:37:50,050 If you predict it on 32 processors this-- well, it's 764 00:37:50,050 --> 00:37:53,540 40 seconds and that looks good, but we were going to be 765 00:37:53,540 --> 00:37:57,990 running the tournament on 512 processors where this term 766 00:37:57,990 --> 00:38:02,030 would start being less important than this term. 767 00:38:02,030 --> 00:38:04,570 So this really did happen and we actually went back and 768 00:38:04,570 --> 00:38:07,200 validated that these numbers were right after we did the 769 00:38:07,200 --> 00:38:11,120 calculation and it allowed us to do the engineering to make 770 00:38:11,120 --> 00:38:14,310 the right decision and not be misled by something that 771 00:38:14,310 --> 00:38:19,120 looked good in the test environment. 772 00:38:19,120 --> 00:38:21,090 We were able to predict what was going to happen on the big 773 00:38:21,090 --> 00:38:23,310 machine without actually having access to the big 774 00:38:23,310 --> 00:38:24,730 machine and that was very important. 775 00:38:27,660 --> 00:38:31,230 Let me do some algorithms. You guys probably have done some 776 00:38:31,230 --> 00:38:34,000 matrix multipliers over the past 3 weeks, right? 777 00:38:34,000 --> 00:38:36,223 That's probably the only thing you've been able to do would 778 00:38:36,223 --> 00:38:38,210 be my guess. 779 00:38:38,210 --> 00:38:40,680 So matrix multiplication is this operation. 780 00:38:40,680 --> 00:38:42,780 I won't talk about it, but you know what it is. 781 00:38:42,780 --> 00:38:47,240 In Cilk instead of doing the standard triply nested loops 782 00:38:47,240 --> 00:38:49,740 you do divide and conquer. 783 00:38:49,740 --> 00:38:52,825 We don't parallelize loops we parallelize function calls, so 784 00:38:52,825 --> 00:38:56,830 you want to express a loops as recursion. 785 00:38:56,830 --> 00:39:00,460 So to multipliy two big matrices you do a whole bunch 786 00:39:00,460 --> 00:39:01,990 of little matrix multiplications of the 787 00:39:01,990 --> 00:39:04,450 sub-blocks and then you express those little matrix 788 00:39:04,450 --> 00:39:07,870 multiplications themselves and go off and recursively do 789 00:39:07,870 --> 00:39:10,490 smaller matrix multiplications. 790 00:39:10,490 --> 00:39:12,980 So this requires 8 multiplications of matrices 791 00:39:12,980 --> 00:39:15,780 these of 1/2 the number of rows and 1/2 the number 792 00:39:15,780 --> 00:39:19,280 columns an one edition at the end where you add these two 793 00:39:19,280 --> 00:39:20,430 matrices together. 794 00:39:20,430 --> 00:39:25,600 That's the algorithm that we do, it's the same total work 795 00:39:25,600 --> 00:39:28,700 as the standard one, but it's just expressed recursively. 796 00:39:28,700 --> 00:39:32,850 So a matrix multiply is you do these 8 multiplies. 797 00:39:32,850 --> 00:39:35,660 I had to create a temporary variable, so the first four 798 00:39:35,660 --> 00:39:40,540 multiplies the A's and B's into C. The second four 799 00:39:40,540 --> 00:39:45,030 multiply the A's and B's into T and then I have to add T 800 00:39:45,030 --> 00:39:48,880 into C. So I do all those spawns, do all the multiplies. 801 00:39:48,880 --> 00:39:51,516 I do a sync because I better not start using the results on 802 00:39:51,516 --> 00:39:55,030 the multiplies and adding them until the multiplies are done. 803 00:39:55,030 --> 00:39:56,360 AUDIENCE: Which four do you add? 804 00:39:56,360 --> 00:39:57,960 BRADLEY KUSZMAUL: What? 805 00:39:57,960 --> 00:39:59,210 There's parallelism in add. 806 00:40:01,620 --> 00:40:02,930 Matrix addition. 807 00:40:02,930 --> 00:40:05,750 AUDIENCE: Yeah, but it doesn't add spawn extent 808 00:40:05,750 --> 00:40:08,330 BRADLEY KUSZMAUL: Well, we spawn off add. 809 00:40:08,330 --> 00:40:10,610 I don't understand-- 810 00:40:10,610 --> 00:40:12,770 [INTERPOSING VOICES] 811 00:40:12,770 --> 00:40:15,550 BRADLEY KUSZMAUL: So you have to spawn Cilk functions even 812 00:40:15,550 --> 00:40:17,890 if you're only executing one of them at a time. 813 00:40:20,760 --> 00:40:24,290 Cilk functions are spawned, C functions are called. 814 00:40:24,290 --> 00:40:26,810 It's a decision that's built into the language. 815 00:40:26,810 --> 00:40:28,760 It's not really a fundamental decision. 816 00:40:28,760 --> 00:40:30,748 It's just that's the way we did it. 817 00:40:30,748 --> 00:40:32,703 AUDIENCE: Why'd you choose to have the key word then? 818 00:40:32,703 --> 00:40:34,170 That's just documentation from the caller side? 819 00:40:34,170 --> 00:40:37,600 BRADLEY KUSZMAUL: Yeah, we found we were less likely to 820 00:40:37,600 --> 00:40:41,420 make a mistake if we sort of built it into the type system 821 00:40:41,420 --> 00:40:42,370 in this way. 822 00:40:42,370 --> 00:40:45,110 But I'm not convinced that this is the best way to do 823 00:40:45,110 --> 00:40:47,990 this type system. 824 00:40:47,990 --> 00:40:48,920 AUDIENCE: Can the C functions spawn a Cilk function. 825 00:40:48,920 --> 00:40:49,240 BRADLEY KUSZMAUL: No. 826 00:40:49,240 --> 00:40:52,330 You can only call spawn, spawn, spawn, spawn then you 827 00:40:52,330 --> 00:40:55,390 can call C functions at the leaves. 828 00:40:55,390 --> 00:40:58,160 It turns out you can actually spawn Cilk functions if you're 829 00:40:58,160 --> 00:41:01,310 a little clever about-- there's a mechanism for a Cilk 830 00:41:01,310 --> 00:41:02,850 system running in the background and if you're 831 00:41:02,850 --> 00:41:05,800 running C you can say OK, do this 832 00:41:05,800 --> 00:41:07,570 Cilk function in parallel. 833 00:41:07,570 --> 00:41:10,290 So we have that, but that's not didactic. 834 00:41:13,805 --> 00:41:15,745 AUDIENCE: Sorry, I have a question 835 00:41:15,745 --> 00:41:18,655 about the sync spawning. 836 00:41:18,655 --> 00:41:22,487 Is the sync actually doing a whole wave or -- like, in the 837 00:41:22,487 --> 00:41:27,720 case of-- maybe not in the case of the add here, but in 838 00:41:27,720 --> 00:41:32,155 plenty of other practical functions you get inside the 839 00:41:32,155 --> 00:41:35,187 spawn function looking at the tendencies of 840 00:41:35,187 --> 00:41:36,415 the parameters, right? 841 00:41:36,415 --> 00:41:39,136 Based on how those were built from 842 00:41:39,136 --> 00:41:40,960 previous spawned funcitons. 843 00:41:40,960 --> 00:41:44,822 You can actually just start processing so long as it's 844 00:41:44,822 --> 00:41:47,006 guaranteed that the results are available before you 845 00:41:47,006 --> 00:41:47,080 actually read them. 846 00:41:47,080 --> 00:41:49,080 BRADLEY KUSZMAUL: So there's this other style of expressing 847 00:41:49,080 --> 00:41:50,890 parallelism which you see in some of the data flow 848 00:41:50,890 --> 00:41:54,480 languages where you say well, I've computed this first 849 00:41:54,480 --> 00:41:57,160 multiply, why can't I get started on the corresponding 850 00:41:57,160 --> 00:42:00,030 part of the addition. 851 00:42:00,030 --> 00:42:03,440 And it turns out that in those models there's no performance 852 00:42:03,440 --> 00:42:04,690 guarantees. 853 00:42:07,110 --> 00:42:08,480 The real issue is you run out of memory. 854 00:42:12,220 --> 00:42:15,060 It's a long topic, let's not go into it, but there's a 855 00:42:15,060 --> 00:42:17,560 serious technical issue with those programming models. 856 00:42:20,610 --> 00:42:22,560 We have very tight memory bounds as well, so we 857 00:42:22,560 --> 00:42:25,310 simultaneously get these good scheduling bounds and good 858 00:42:25,310 --> 00:42:28,060 memory bounds and if you are doing that you could have sort 859 00:42:28,060 --> 00:42:30,910 of a really large number of temporaries required and run 860 00:42:30,910 --> 00:42:31,300 out of memory. 861 00:42:31,300 --> 00:42:35,030 The data flow machine used to have this number-- there was a 862 00:42:35,030 --> 00:42:38,295 student, Ken Traub, who was working on Monsoon when Greg 863 00:42:38,295 --> 00:42:42,780 Papadapolous was here and he came up with this term which 864 00:42:42,780 --> 00:42:45,390 we called Traub's constant, which was how long the machine 865 00:42:45,390 --> 00:42:47,740 could be guaranteed to run before it crashed from being 866 00:42:47,740 --> 00:42:48,870 out of memory. 867 00:42:48,870 --> 00:42:51,910 And that was-- well, he took the rate at which it Kahn's 868 00:42:51,910 --> 00:42:56,960 divided by the amount of memory and that was it. 869 00:42:56,960 --> 00:43:00,390 And many data flow programs had that property that Monsoon 870 00:43:00,390 --> 00:43:02,460 could run for 40 seconds and then after 871 00:43:02,460 --> 00:43:04,150 that you never knew. 872 00:43:04,150 --> 00:43:06,420 It might start crashing at any moment, so everybody wrote 873 00:43:06,420 --> 00:43:11,770 short data flow programs. 874 00:43:11,770 --> 00:43:13,280 So one of the things you actually do when you're 875 00:43:13,280 --> 00:43:14,900 implementing, when you're trying to engineer this to go 876 00:43:14,900 --> 00:43:18,540 fast, is you course in the base case, which I didn't 877 00:43:18,540 --> 00:43:19,350 describe up there. 878 00:43:19,350 --> 00:43:22,310 You don't just do a 1 by 1 matrix multiplied down there 879 00:43:22,310 --> 00:43:24,970 at the leaves of this recursion. 880 00:43:24,970 --> 00:43:26,900 Because then you're not using the processor pipeline 881 00:43:26,900 --> 00:43:28,580 efficiently. 882 00:43:28,580 --> 00:43:31,980 You call the Intel Math Kernel Library or something on an 8 883 00:43:31,980 --> 00:43:34,290 by 8 matrix so that it really gets the pipeline a 884 00:43:34,290 --> 00:43:35,750 chance to chug away. 885 00:43:38,740 --> 00:43:39,250 So analysis. 886 00:43:39,250 --> 00:43:42,850 This matrix addition operation-- well, what's the 887 00:43:42,850 --> 00:43:45,390 work for matrix addition? 888 00:43:45,390 --> 00:43:50,820 Well the work to do a matrix operation on n rows is well, 889 00:43:50,820 --> 00:43:55,120 you have to do 4 additions of size n over 2. 890 00:43:55,120 --> 00:43:58,370 Plus there's order 1 work here for the sync. 891 00:43:58,370 --> 00:44:04,080 And that recurrence has solution order n squared. 892 00:44:04,080 --> 00:44:05,200 Well, that's not surprising. 893 00:44:05,200 --> 00:44:07,980 You have to add up 2 matrices which are n by n. 894 00:44:07,980 --> 00:44:10,900 That's going to be n squared so that's a good result. 895 00:44:10,900 --> 00:44:15,450 The critical path for this is well, you have to do all of 896 00:44:15,450 --> 00:44:16,350 these in parallel. 897 00:44:16,350 --> 00:44:20,840 So whatever the critical path of the longest one is, they're 898 00:44:20,840 --> 00:44:23,350 all the same so it's just the critical path of the size n 899 00:44:23,350 --> 00:44:29,350 over 2 plus quarter 1, so the critical path is order log n. 900 00:44:29,350 --> 00:44:34,270 For matrix multiplication, sort of the reason I 901 00:44:34,270 --> 00:44:36,840 do this is I can. 902 00:44:36,840 --> 00:44:38,820 This is a model which I can do this analysis, so 903 00:44:38,820 --> 00:44:40,010 I have to do it. 904 00:44:40,010 --> 00:44:43,740 But really, being able to do this analysis is important 905 00:44:43,740 --> 00:44:46,550 when you're trying to make things run faster. 906 00:44:46,550 --> 00:44:48,800 Matrix multiplication, well, the work is I have to do 8 907 00:44:48,800 --> 00:44:52,840 little matrix multiplies plus I have to do the matrix add. 908 00:44:56,350 --> 00:44:59,110 The work has solution order n cubed and everybody knows that 909 00:44:59,110 --> 00:45:01,990 there's order n cubed multiply adds in a matrix multiplier, 910 00:45:01,990 --> 00:45:04,420 so that's not very surprising. 911 00:45:04,420 --> 00:45:09,700 The critical path is-- well, I have to do a add so that takes 912 00:45:09,700 --> 00:45:12,910 log n, plus I have to do a multiply on a matrix that's 913 00:45:12,910 --> 00:45:14,210 1/2 the size. 914 00:45:14,210 --> 00:45:16,990 So the critical path length of the whole thing has solution 915 00:45:16,990 --> 00:45:19,190 order log squared n. 916 00:45:19,190 --> 00:45:25,670 So the total parallelism of matrix multiplication is the 917 00:45:25,670 --> 00:45:30,900 work over the span, which is n cubed over log squared n. 918 00:45:30,900 --> 00:45:34,030 So if you have a 1000 by 1000 matrix that means your 919 00:45:34,030 --> 00:45:37,860 parallelism is close to 10 million. 920 00:45:37,860 --> 00:45:40,930 There's a lot of parallelism and in fact, we see perfect 921 00:45:40,930 --> 00:45:43,760 linear speedup on matrix multiply because there's so 922 00:45:43,760 --> 00:45:45,010 much parallelism in it. 923 00:45:47,710 --> 00:45:51,270 It turns out that this stack temporary that I created so 924 00:45:51,270 --> 00:45:53,550 that I could do these multiplies all in parallel is 925 00:45:53,550 --> 00:45:57,870 actually costing me work because I'm on a machine that 926 00:45:57,870 --> 00:46:00,110 has cache and I want to use the cache effectively. 927 00:46:00,110 --> 00:46:02,630 So I really don't want to create a whole big temporary 928 00:46:02,630 --> 00:46:06,780 matrix and blow my cache out if I can avoid it. 929 00:46:06,780 --> 00:46:10,860 So I proposed the following matrix multiply, which is I 930 00:46:10,860 --> 00:46:14,950 first do 4 of the matrix multiplies into C1 then I do a 931 00:46:14,950 --> 00:46:20,830 sync and then I do the other 4 into C1 and another sync. 932 00:46:20,830 --> 00:46:24,560 And I forgot to do the add-- oh, no those are multiply adds 933 00:46:24,560 --> 00:46:26,850 so they're multiplying and adding in. 934 00:46:26,850 --> 00:46:30,300 And this saves space because it doesn't need a temporary, 935 00:46:30,300 --> 00:46:32,960 but it increases the critical path. 936 00:46:32,960 --> 00:46:35,790 So is that a good idea about or a bad idea? 937 00:46:35,790 --> 00:46:39,410 Well, we can answer part of that question with analysis. 938 00:46:39,410 --> 00:46:42,290 Saving space we know is going to save something. 939 00:46:42,290 --> 00:46:44,080 What does it do to the work in critical path? 940 00:46:44,080 --> 00:46:47,220 Well, the work is still the same, it's n cubed because we 941 00:46:47,220 --> 00:46:50,370 didn't change the number of flops that we're doing. 942 00:46:50,370 --> 00:46:51,900 But the critical path has grown. 943 00:46:51,900 --> 00:46:56,530 Instead of doing 1 times a matrix multiply, we have to do 944 00:46:56,530 --> 00:46:58,690 one and then sync and then do another one. 945 00:46:58,690 --> 00:47:02,140 So it's 2 matrix multiplies of 1/2 the size plus the order 1 946 00:47:02,140 --> 00:47:06,700 and that recurrence has solution order n instead of 947 00:47:06,700 --> 00:47:09,300 order log squared n. 948 00:47:09,300 --> 00:47:12,590 So that sounds bad, we've made the critical path longer. 949 00:47:12,590 --> 00:47:13,610 AUDIENCE: [OBSCURED] 950 00:47:13,610 --> 00:47:13,870 BRADLEY KUSZMAUL: What? 951 00:47:13,870 --> 00:47:15,010 Yeah. 952 00:47:15,010 --> 00:47:18,790 So parallelism is now order n squared instead of n cubed 953 00:47:18,790 --> 00:47:22,240 over log squared n and for a 1000 by 1000 matrix that means 954 00:47:22,240 --> 00:47:24,740 you still have a million fold parallelism. 955 00:47:24,740 --> 00:47:27,900 So for relatively modest sized matrices you still have plenty 956 00:47:27,900 --> 00:47:29,260 of work to do this optimization. 957 00:47:29,260 --> 00:47:31,830 So this is a good transformation to do it. 958 00:47:31,830 --> 00:47:34,680 One of the advantages of Cilk is that you can do this kind 959 00:47:34,680 --> 00:47:37,770 of You could say, let me do an optimization. 960 00:47:37,770 --> 00:47:40,340 I can do an optimization in my C code and I get to take 961 00:47:40,340 --> 00:47:42,460 advantage of it in the Cilk code. 962 00:47:42,460 --> 00:47:45,580 I could do this kind of optimization of trading work 963 00:47:45,580 --> 00:47:46,290 for parallelism. 964 00:47:46,290 --> 00:47:49,730 If I have a lot of work that sometimes is a good idea. 965 00:47:49,730 --> 00:47:53,810 Ordinary matrix multiplication just is really bad. 966 00:47:53,810 --> 00:47:57,170 Basically you can imagine spawning off the n squared 967 00:47:57,170 --> 00:48:00,180 inner dot products here and 968 00:48:00,180 --> 00:48:01,790 computing them all in parallel. 969 00:48:01,790 --> 00:48:06,560 It has work n cubed parallelism log n. 970 00:48:06,560 --> 00:48:10,210 I mean, critical path log n so the parallelism's even better. 971 00:48:10,210 --> 00:48:13,480 It's n cubed over log n instead of n squared. 972 00:48:13,480 --> 00:48:16,000 That looks better theoretically, but it's really 973 00:48:16,000 --> 00:48:19,430 bad in practice because it has such poor cache behavior. 974 00:48:19,430 --> 00:48:23,390 So we don't do that. 975 00:48:23,390 --> 00:48:25,360 I'll just briefly talk about how it works. 976 00:48:25,360 --> 00:48:27,000 So Cilk does work-stealing. 977 00:48:27,000 --> 00:48:29,740 We had did double ended queue-like decque. 978 00:48:29,740 --> 00:48:31,995 So at the bottom of the queue is the stack where you push 979 00:48:31,995 --> 00:48:34,680 and pop things and the top is something where you can pop 980 00:48:34,680 --> 00:48:36,500 things off if you want to. 981 00:48:36,500 --> 00:48:38,790 And so what's running is all these processors are running 982 00:48:38,790 --> 00:48:40,170 each on their own stack. 983 00:48:40,170 --> 00:48:42,690 They're all running the ordinary serial code. 984 00:48:42,690 --> 00:48:44,700 That's sort of the basic situation. 985 00:48:44,700 --> 00:48:46,190 They're pretty much running the serial 986 00:48:46,190 --> 00:48:48,470 code most of the time. 987 00:48:48,470 --> 00:48:50,170 So some processor runs. 988 00:48:50,170 --> 00:48:51,490 It pushes. 989 00:48:51,490 --> 00:48:53,130 Well, it doesn't spawn, so what does it do? 990 00:48:53,130 --> 00:48:55,080 It pushes something onto its stack because it's just a 991 00:48:55,080 --> 00:48:57,080 function call. 992 00:48:57,080 --> 00:49:01,760 And it does another couple more spawns so things pop off. 993 00:49:01,760 --> 00:49:04,180 Somebody returns so he pops his stack. 994 00:49:04,180 --> 00:49:06,890 So far everything's going on, they're not communicating, 995 00:49:06,890 --> 00:49:10,640 they're completely independent computations. 996 00:49:10,640 --> 00:49:12,840 This guy spawns and now he's out of work. 997 00:49:12,840 --> 00:49:14,220 Now he has to do something. 998 00:49:14,220 --> 00:49:17,090 What he does is he goes and picks another processor at 999 00:49:17,090 --> 00:49:22,270 random and he steals the thing from the 1000 00:49:22,270 --> 00:49:23,920 other end of the stack. 1001 00:49:23,920 --> 00:49:26,260 So he's unlikely to conflict because this guy's pushing and 1002 00:49:26,260 --> 00:49:28,900 popping down here, but there's a lock in there, thers's a 1003 00:49:28,900 --> 00:49:30,290 little algorithm. 1004 00:49:30,290 --> 00:49:34,680 A non-blocking algorithm actually, it's not lock. 1005 00:49:34,680 --> 00:49:38,870 And so he goes and he steals something and come on, slide 1006 00:49:38,870 --> 00:49:39,690 over there. 1007 00:49:39,690 --> 00:49:40,460 Whoa. 1008 00:49:40,460 --> 00:49:43,900 Yes, that's animation, right? 1009 00:49:43,900 --> 00:49:47,340 That's the extent of my animation. 1010 00:49:47,340 --> 00:49:49,600 And then he starts working away. 1011 00:49:49,600 --> 00:49:52,800 And the theorem is that a work-stealing scheduler like 1012 00:49:52,800 --> 00:49:56,330 this gives expected running time with high probability 1013 00:49:56,330 --> 00:49:59,280 actually of T sub 1 over P plus T sub infinity on P 1014 00:49:59,280 --> 00:50:00,690 processors. 1015 00:50:00,690 --> 00:50:04,190 And the pseudoproof is a little bit like the proof for 1016 00:50:04,190 --> 00:50:05,270 Brent's Theorem, which is either 1017 00:50:05,270 --> 00:50:07,020 you're working or stealing. 1018 00:50:07,020 --> 00:50:11,050 If you're working well, that goes against T sub 1 over P. 1019 00:50:11,050 --> 00:50:14,410 You can't do that very much or you run out of work. 1020 00:50:14,410 --> 00:50:19,210 If you're stealing well, each steal has a chance that it 1021 00:50:19,210 --> 00:50:22,040 steals the thing that's on the critical path. 1022 00:50:22,040 --> 00:50:23,960 You may actually steal the wrong thing, but you actually 1023 00:50:23,960 --> 00:50:26,910 have a 1 in P chance that you're the one who steals the 1024 00:50:26,910 --> 00:50:30,030 thing that it's on the critical path and then in 1025 00:50:30,030 --> 00:50:31,910 which case the expected number-- 1026 00:50:31,910 --> 00:50:34,060 so you had this chance of 1 over P of reducing the 1027 00:50:34,060 --> 00:50:38,210 critical path length by 1, so after this many steals the 1028 00:50:38,210 --> 00:50:39,750 critical path is all gone. 1029 00:50:39,750 --> 00:50:44,260 So you can only do P times T infinity steals. 1030 00:50:44,260 --> 00:50:46,750 This high probability it comes out. 1031 00:50:46,750 --> 00:50:50,440 And that gives you these bounds. 1032 00:50:50,440 --> 00:50:54,110 OK, I'm not going to give you all this stuff. 1033 00:50:54,110 --> 00:50:58,040 Message passing sucks, you know. 1034 00:50:58,040 --> 00:50:59,170 You guys know. 1035 00:50:59,170 --> 00:51:02,270 There's probably nothing else in here. 1036 00:51:05,270 --> 00:51:09,790 So basically the pitch here is that you get some high level 1037 00:51:09,790 --> 00:51:13,620 linguistics support for these very fine-grained parallelism. 1038 00:51:13,620 --> 00:51:16,620 It's an algorithmic programming model so that 1039 00:51:16,620 --> 00:51:19,640 means that you can do engineering for performance. 1040 00:51:19,640 --> 00:51:23,310 There's fairly easy conversion of existing code, especially 1041 00:51:23,310 --> 00:51:24,770 when you combine it with the race detector. 1042 00:51:24,770 --> 00:51:27,335 You've got this factorization of the debugging problem and 1043 00:51:27,335 --> 00:51:30,930 to debugging your serial code is you run it with all the 1044 00:51:30,930 --> 00:51:32,060 Cilk stuff turned off. 1045 00:51:32,060 --> 00:51:35,010 You allied the program and make sure your program works. 1046 00:51:35,010 --> 00:51:36,700 Then you run it with the rate detector to make sure you get 1047 00:51:36,700 --> 00:51:41,290 the same answer in parallel and then you're done. 1048 00:51:41,290 --> 00:51:44,310 Applications in Cilk don't just scale to large number of 1049 00:51:44,310 --> 00:51:47,240 processors, they scale down to small numbers, which is 1050 00:51:47,240 --> 00:51:49,890 important if you only have two processors or one. 1051 00:51:49,890 --> 00:51:53,750 You don't suddenly want to pay a factor of 10 to get off the 1052 00:51:53,750 --> 00:51:55,820 ground, which happens sometimes on 1053 00:51:55,820 --> 00:51:57,110 clusters running MPI. 1054 00:51:57,110 --> 00:51:58,760 You have to pay a big overhead before 1055 00:51:58,760 --> 00:52:01,700 you've made any progress. 1056 00:52:01,700 --> 00:52:04,320 And one of the advantages for example is that the number of 1057 00:52:04,320 --> 00:52:06,420 processors might change dynamically. 1058 00:52:06,420 --> 00:52:09,520 In this model that's OK because it's 1059 00:52:09,520 --> 00:52:10,650 not part of the program. 1060 00:52:10,650 --> 00:52:14,050 So you may have the operating system reduce the number of 1061 00:52:14,050 --> 00:52:18,230 actual worker threads that you have doing that work-stealing 1062 00:52:18,230 --> 00:52:19,560 and that can work. 1063 00:52:19,560 --> 00:52:22,420 One of the bad things about Cilk is that it doesn't 1064 00:52:22,420 --> 00:52:29,450 support sort of data parallel or program model kind of 1065 00:52:29,450 --> 00:52:30,420 parallelism. 1066 00:52:30,420 --> 00:52:34,010 You really have to think of things as this divide and 1067 00:52:34,010 --> 00:52:35,930 conquer kind of the world. 1068 00:52:35,930 --> 00:52:38,000 And if you have trouble expressing that-- 1069 00:52:40,730 --> 00:52:43,900 situations where you're doing Jacobi update and you very 1070 00:52:43,900 --> 00:52:48,010 carefully put things on, had each processor work on its 1071 00:52:48,010 --> 00:52:49,930 local memory and then they only have to communicate at 1072 00:52:49,930 --> 00:52:51,250 the boundaries. 1073 00:52:51,250 --> 00:52:54,660 That's difficult to do right in Cilk because essentially 1074 00:52:54,660 --> 00:52:56,960 every time you go around the loop of I have all these 1075 00:52:56,960 --> 00:52:57,420 things to do. 1076 00:52:57,420 --> 00:52:59,700 All the work-stealing happens randomly and it happens on a 1077 00:52:59,700 --> 00:53:00,520 different processor. 1078 00:53:00,520 --> 00:53:03,350 So it's not very good at that sort of thing, although it 1079 00:53:03,350 --> 00:53:05,920 turns out Jacobi update's not a very good example for that 1080 00:53:05,920 --> 00:53:08,790 because there are more sophisticated algorithms that 1081 00:53:08,790 --> 00:53:12,230 use cache effectively that you can express in Cilk and I 1082 00:53:12,230 --> 00:53:15,020 would have no idea how to no say those in some of these 1083 00:53:15,020 --> 00:53:16,870 sort of data parallel languages. 1084 00:53:16,870 --> 00:53:21,010 Using the cache efficiently is really important on modern 1085 00:53:21,010 --> 00:53:23,481 processors. 1086 00:53:23,481 --> 00:53:24,731 PROFESSOR: Thank you. 1087 00:53:27,543 --> 00:53:28,793 Questions? 1088 00:53:33,130 --> 00:53:34,700 BRADLEY KUSZMAUL: You can download Cilk, there's a bunch 1089 00:53:34,700 --> 00:53:35,360 of contributors. 1090 00:53:35,360 --> 00:53:38,490 Those are the Cilk worms and you can download 1091 00:53:38,490 --> 00:53:39,620 Cilk off our webpage. 1092 00:53:39,620 --> 00:53:41,730 Just Google for Cilk and you'll find it. 1093 00:53:41,730 --> 00:53:44,540 It's a great language, you'll love it. 1094 00:53:44,540 --> 00:53:47,115 You'll love it much more than what you've been doing. 1095 00:53:47,115 --> 00:53:48,534 AUDIENCE: How does the Cilk play with processor 1096 00:53:48,534 --> 00:53:54,350 [OBSCURED]? 1097 00:53:54,350 --> 00:53:57,420 BRADLEY KUSZMAUL: Well, you have to have a language, a 1098 00:53:57,420 --> 00:53:58,820 compiler that can generate those. 1099 00:53:58,820 --> 00:54:02,482 If you have an assembly command or you have some other 1100 00:54:02,482 --> 00:54:04,090 complier that can generate those. 1101 00:54:04,090 --> 00:54:10,130 So I just won the HPC challenge, which is this 1102 00:54:10,130 --> 00:54:16,070 challenge where everybody tries to run parallel programs 1103 00:54:16,070 --> 00:54:18,620 and argue that they get productivity. 1104 00:54:18,620 --> 00:54:21,870 For that there were some codes like matrix multiply and LUD 1105 00:54:21,870 --> 00:54:24,020 composition with pivoting. 1106 00:54:24,020 --> 00:54:26,910 Basically at the leads of the computation I call the Intel 1107 00:54:26,910 --> 00:54:27,960 Math Kernel Library. 1108 00:54:27,960 --> 00:54:32,190 Which in turn uses the SSE instructions. 1109 00:54:32,190 --> 00:54:35,860 You could do anything you can do in C in the C parts of the 1110 00:54:35,860 --> 00:54:39,940 code because Cilk compiler just passes those through. 1111 00:54:39,940 --> 00:54:43,140 So if you have some really efficient pipeline code for 1112 00:54:43,140 --> 00:54:47,530 doing something, up to some point it made 1113 00:54:47,530 --> 00:54:48,680 sense to use that. 1114 00:54:48,680 --> 00:54:52,620 AUDIENCE: [OBSCURED] 1115 00:54:52,620 --> 00:54:58,460 BRADLEY KUSZMAUL: So I ran it on NASIS Columbia. 1116 00:54:58,460 --> 00:55:02,420 So the benchmark consists of-- well, there's 7 applications 1117 00:55:02,420 --> 00:55:04,560 they have. 6 of which are actually well-defined. 1118 00:55:04,560 --> 00:55:07,065 One of them is this thing that just measures network 1119 00:55:07,065 --> 00:55:07,540 performance or something, so it doesn't 1120 00:55:07,540 --> 00:55:09,190 have any real semantics. 1121 00:55:09,190 --> 00:55:10,020 There's 6 benchmarks. 1122 00:55:10,020 --> 00:55:13,220 One of them is LUD composition, one of them is 1123 00:55:13,220 --> 00:55:18,110 DJEM matrix multiplication and this FFT and 3 others. 1124 00:55:18,110 --> 00:55:21,030 So I implemented all 6, nobody else implemented all 6. 1125 00:55:21,030 --> 00:55:24,310 It turns out that you had to implement 3 in order to enter. 1126 00:55:24,310 --> 00:55:27,990 Almost everybody implemented 3 or 4, but I did all 6 which is 1127 00:55:27,990 --> 00:55:29,540 part of why I won. 1128 00:55:29,540 --> 00:55:33,390 So I could argue that in a weeks work I just 1129 00:55:33,390 --> 00:55:33,820 implemented-- 1130 00:55:33,820 --> 00:55:37,510 AUDIENCE: What is [OBSCURED]? 1131 00:55:37,510 --> 00:55:40,280 BRADLEY KUSZMAUL: So the prize has two components. 1132 00:55:40,280 --> 00:55:43,860 Performance and productivity or elegance or something and 1133 00:55:43,860 --> 00:55:47,370 it's completely whatever the judges want that to be. 1134 00:55:47,370 --> 00:55:50,800 So it was up to me as a presenter to make the case 1135 00:55:50,800 --> 00:55:51,780 that I was elegant. 1136 00:55:51,780 --> 00:55:54,550 Because I had my performance numbers, which were pretty 1137 00:55:54,550 --> 00:55:58,245 good and it turned out that the IBM entry for x10 did me 1138 00:55:58,245 --> 00:55:59,360 more good than I did, I think. 1139 00:55:59,360 --> 00:56:01,960 Because they got up there and they compared the performance 1140 00:56:01,960 --> 00:56:05,330 of x10 to their Cilk implementation and their x10 1141 00:56:05,330 --> 00:56:07,650 thing was almost as good as Cilk. 1142 00:56:07,650 --> 00:56:10,380 So after that I think the judges said they had to give 1143 00:56:10,380 --> 00:56:12,310 me the prize. 1144 00:56:12,310 --> 00:56:15,230 So basically, it went down to supercomputing and each of us 1145 00:56:15,230 --> 00:56:20,680 got 5 minutes to present and there were 5 finalists. 1146 00:56:20,680 --> 00:56:22,840 We did our presentation and then they gave out the -- 1147 00:56:22,840 --> 00:56:28,970 So they divided the prize three ways: the people who got 1148 00:56:28,970 --> 00:56:31,650 the absolute best performance, which were some people running 1149 00:56:31,650 --> 00:56:36,950 UPC and the people who had the most elegance based on minimal 1150 00:56:36,950 --> 00:56:40,630 number of lines of codes and that was Cleve at -- 1151 00:56:40,630 --> 00:56:41,140 what's his name? 1152 00:56:41,140 --> 00:56:43,100 The Mathworks guy, MATLAB guy. 1153 00:56:43,100 --> 00:56:45,880 Who said, look, matrix, LUD composition. 1154 00:56:45,880 --> 00:56:50,110 LU of P. It's very elegant, but I don't think that it 1155 00:56:50,110 --> 00:56:53,720 really sort of explains what you have to do to solve the 1156 00:56:53,720 --> 00:56:58,250 problems. So he won the prize for most elegant and I got the 1157 00:56:58,250 --> 00:57:02,040 prize for best combination, which they then changed-- 1158 00:57:02,040 --> 00:57:06,560 in the final citation for the prize they said, most 1159 00:57:06,560 --> 00:57:07,410 productivity. 1160 00:57:07,410 --> 00:57:08,390 That was the prize. 1161 00:57:08,390 --> 00:57:10,450 So I actually won the contest because that was what the 1162 00:57:10,450 --> 00:57:13,040 contest was supposed to be was most productivity. 1163 00:57:13,040 --> 00:57:14,880 But I only won 1/3 of the prize money because they 1164 00:57:14,880 --> 00:57:16,130 divided it three ways. 1165 00:57:19,236 --> 00:57:22,682 PROFESSOR: Any other question? 1166 00:57:22,682 --> 00:57:24,651 Thank you. 1167 00:57:24,651 --> 00:57:26,620 BRADLEY KUSZMAUL: Thank you. 1168 00:57:26,620 --> 00:57:30,589 PROFESSOR: We'll take a 5 minute break and since you had 1169 00:57:30,589 --> 00:57:34,867 guest lecturer I do have [OBSCURED]