1 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:03,910 Commons license. 3 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,600 offer high-quality educational resources for free. 5 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 6 00:00:13,500 --> 00:00:17,780 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,780 --> 00:00:19,030 ocw.mit.edu. 8 00:00:23,360 --> 00:00:27,230 CHARLES LEISERSON: So today, more parallel programming, as 9 00:00:27,230 --> 00:00:32,170 we will do for the next couple lectures as well. 10 00:00:32,170 --> 00:00:36,570 So today, we're going to look at how to analyze 11 00:00:36,570 --> 00:00:44,970 multi-threaded algorithms, and I'm going to start out with a 12 00:00:44,970 --> 00:00:53,470 review of what I hope most of you know from 6006 or 6046, 13 00:00:53,470 --> 00:00:56,380 which is how to solve divide and conquer recurrences. 14 00:00:56,380 --> 00:01:00,280 Now, we know that we can solve them with recursion trees, and 15 00:01:00,280 --> 00:01:04,069 that gets tedious after a while, so I want to go through 16 00:01:04,069 --> 00:01:07,050 the so-called Master Method to begin with, and then we'll get 17 00:01:07,050 --> 00:01:11,680 into the content of the course. 18 00:01:11,680 --> 00:01:13,610 But it will be very helpful, since we're going to do so 19 00:01:13,610 --> 00:01:15,650 many divide and conquer recurrences. 20 00:01:15,650 --> 00:01:18,270 The difference between these divide and conquer recurrences 21 00:01:18,270 --> 00:01:21,790 and the ones for caching is that caching is all tricky by 22 00:01:21,790 --> 00:01:23,270 the base condition. 23 00:01:23,270 --> 00:01:25,140 Here, are all the recurrences are going to be nice and 24 00:01:25,140 --> 00:01:29,690 clean, just like you learn in your algorithms class. 25 00:01:29,690 --> 00:01:32,310 So we'll start with talking about it, and then we'll go 26 00:01:32,310 --> 00:01:36,140 through several examples of analysis of algorithms. 27 00:01:36,140 --> 00:01:38,450 And it'll also tell us something about what we need 28 00:01:38,450 --> 00:01:42,290 to do to make our code go fast. 29 00:01:42,290 --> 00:01:44,870 So the main method we're going to use is 30 00:01:44,870 --> 00:01:48,400 called the Master Method. 31 00:01:48,400 --> 00:01:53,360 It's for solving recurrences of the form t of n equals a t 32 00:01:53,360 --> 00:01:58,100 of n over b plus f of n, where we have some technical 33 00:01:58,100 --> 00:02:01,480 conditions, a is greater than or equal to 1, b is greater 34 00:02:01,480 --> 00:02:03,950 than one, and f is asymptotically positive. 35 00:02:03,950 --> 00:02:08,460 As f gets large, it becomes positive. 36 00:02:08,460 --> 00:02:11,330 When we give a recurrence like this, normally if the base 37 00:02:11,330 --> 00:02:14,850 case is order one, it's convention not give it, to 38 00:02:14,850 --> 00:02:17,810 just assume, yeah, we understand that when n is 39 00:02:17,810 --> 00:02:22,670 small enough, the result is constant. 40 00:02:22,670 --> 00:02:25,710 As I say, that's the place where this differs from the 41 00:02:25,710 --> 00:02:33,260 way that we solve recurrences for caching, where you have to 42 00:02:33,260 --> 00:02:35,760 worry about what is the base case of the recurrence. 43 00:02:38,640 --> 00:02:40,450 So the way to solve this is, in fact, the 44 00:02:40,450 --> 00:02:41,330 way we've seen before. 45 00:02:41,330 --> 00:02:42,730 It's a recursion tree. 46 00:02:42,730 --> 00:02:47,490 So we start with t of n, and then we replace t of n by the 47 00:02:47,490 --> 00:02:50,010 right hand side, just by substitution. 48 00:02:50,010 --> 00:02:52,590 So what's always going to be in the tree as we develop it 49 00:02:52,590 --> 00:02:56,150 is the total amount of work. 50 00:02:56,150 --> 00:02:59,870 So we basically replace it by f of n plus a copies 51 00:02:59,870 --> 00:03:02,410 of t of n over b. 52 00:03:02,410 --> 00:03:08,650 And then each of those we replace by a copies so t of n 53 00:03:08,650 --> 00:03:11,280 over b squared. 54 00:03:11,280 --> 00:03:14,830 And so forth, continually replacing until we get 55 00:03:14,830 --> 00:03:15,980 down to t of 1. 56 00:03:15,980 --> 00:03:18,680 And at the point t of 1, we no longer can substitute here, 57 00:03:18,680 --> 00:03:22,340 but we know that t of 1 is order 1. 58 00:03:22,340 --> 00:03:25,560 And now what we do is add across the rows. 59 00:03:25,560 --> 00:03:30,510 So we get f of n, we get a, f of n over b, a squared f of n 60 00:03:30,510 --> 00:03:35,110 over b squared, and we keep going to the height of this, 61 00:03:35,110 --> 00:03:39,050 which we're dividing by the argument by b each time. 62 00:03:39,050 --> 00:03:44,440 So to get down to 1 is just log base b of n. 63 00:03:44,440 --> 00:03:50,200 So the number of leaves is, since this is a regular 64 00:03:50,200 --> 00:03:50,680 [? a area ?] 65 00:03:50,680 --> 00:03:53,690 tree, is a to the height, which is a to the 66 00:03:53,690 --> 00:03:54,880 log base b of n. 67 00:03:54,880 --> 00:03:58,010 And for each of those, we're paying t of 1, 68 00:03:58,010 --> 00:04:00,840 which is order 1. 69 00:04:00,840 --> 00:04:03,270 And so now, it turns out there's 70 00:04:03,270 --> 00:04:04,740 no closed form solution. 71 00:04:04,740 --> 00:04:06,480 If I add up all these terms, there's 72 00:04:06,480 --> 00:04:08,790 no closed form solution. 73 00:04:08,790 --> 00:04:11,390 But there are three common situations 74 00:04:11,390 --> 00:04:14,020 that occur in practice. 75 00:04:14,020 --> 00:04:17,170 So yes, this is just n to the log base b of a, just this 76 00:04:17,170 --> 00:04:22,210 term, not the sum, just this term. 77 00:04:22,210 --> 00:04:26,600 So three cases have to do with comparing the number of 78 00:04:26,600 --> 00:04:31,350 leaves, which is times order 1, with f of n. 79 00:04:33,880 --> 00:04:39,760 So the first case is the case where n the log base b of a is 80 00:04:39,760 --> 00:04:42,170 bigger than f of n. 81 00:04:42,170 --> 00:04:44,430 So whenever you're given a recurrence to compute n to the 82 00:04:44,430 --> 00:04:45,480 log base b of a-- 83 00:04:45,480 --> 00:04:48,010 I hope this is repeat for most people. 84 00:04:48,010 --> 00:04:50,240 If not, that's fine, but hopefully it'll 85 00:04:50,240 --> 00:04:53,990 get you caught up. 86 00:04:53,990 --> 00:04:58,020 n log base b of a, if it's much bigger than f of n, then 87 00:04:58,020 --> 00:05:01,080 these terms are geometrically increasing. 88 00:05:01,080 --> 00:05:03,930 And since it's geometrically increasing, all that matters 89 00:05:03,930 --> 00:05:05,630 is the base case. 90 00:05:05,630 --> 00:05:08,970 In fact, actually, it has to be not just greater than, it's 91 00:05:08,970 --> 00:05:11,760 got to be greater than by a polynomial amount, by some n 92 00:05:11,760 --> 00:05:15,200 to the epsilon amount for some epsilon greater than 0. 93 00:05:15,200 --> 00:05:20,480 So it might be n to the 1/2, it might be n to the 1/3, it 94 00:05:20,480 --> 00:05:23,510 could be n to the 100th. 95 00:05:23,510 --> 00:05:29,320 But what it can't be is log n, because log n is less than any 96 00:05:29,320 --> 00:05:31,590 polynomial amount. 97 00:05:31,590 --> 00:05:34,330 So it's got to exceed it by at least n to the epsilon for 98 00:05:34,330 --> 00:05:35,830 some epsilon. 99 00:05:35,830 --> 00:05:38,010 In that case, it's geometrically increasing, the 100 00:05:38,010 --> 00:05:40,550 answer is just what's at the leaves. 101 00:05:40,550 --> 00:05:44,270 So that's case one, geometrically increasing. 102 00:05:44,270 --> 00:05:47,800 Case two is when things are actually fairly 103 00:05:47,800 --> 00:05:50,670 equal on every level. 104 00:05:50,670 --> 00:05:53,680 And the general case we'll look at is when it's 105 00:05:53,680 --> 00:05:56,500 arithmetically increasing. 106 00:05:56,500 --> 00:06:03,200 So in particular, this occurs when f of n is n to the log 107 00:06:03,200 --> 00:06:09,000 base b of a times log to some power, n, for some constant k 108 00:06:09,000 --> 00:06:12,080 that's at least 0. 109 00:06:12,080 --> 00:06:13,450 So this is the situation. 110 00:06:13,450 --> 00:06:18,080 If k is equal to 0, it just says that f of n, the amount 111 00:06:18,080 --> 00:06:22,420 here is exactly the same as the number of leaves. 112 00:06:22,420 --> 00:06:24,960 And that case, it turns out that every level has almost 113 00:06:24,960 --> 00:06:26,960 exactly the same amount. 114 00:06:26,960 --> 00:06:29,610 And since there are log n levels, you tack on an extra 115 00:06:29,610 --> 00:06:31,300 log n for the solution. 116 00:06:31,300 --> 00:06:33,050 In fact, the solution is one more log. 117 00:06:33,050 --> 00:06:38,720 Turns out that if it's growing arithmetically with layer, you 118 00:06:38,720 --> 00:06:41,400 basically take on one extra log. 119 00:06:41,400 --> 00:06:45,320 It's actually like doing the integral of a as 120 00:06:45,320 --> 00:06:46,430 an arithmetic series. 121 00:06:46,430 --> 00:06:51,810 If you're adding the terms of, say, i squared, 122 00:06:51,810 --> 00:06:53,060 the result is i cubed. 123 00:06:56,000 --> 00:07:00,030 If you have a summation that goes from i equals 1 to n of i 124 00:07:00,030 --> 00:07:04,640 squared, the result is proportional to n cubed. 125 00:07:04,640 --> 00:07:11,380 And similarly, if it's i to any k, the result is going to 126 00:07:11,380 --> 00:07:15,690 be n to the k plus 1, and that's basically 127 00:07:15,690 --> 00:07:18,450 what's going on here. 128 00:07:18,450 --> 00:07:21,320 And then the third case is when it's geometrically 129 00:07:21,320 --> 00:07:23,670 decreasing, when the amount at the root dominates. 130 00:07:26,860 --> 00:07:33,960 So in this case, if it's much less than n, and specifically, 131 00:07:33,960 --> 00:07:40,726 f of n is at least n to the epsilon less than log base b 132 00:07:40,726 --> 00:07:45,380 of a, for some constant epsilon, it turns out in 133 00:07:45,380 --> 00:07:47,950 addition, you need f of n to satisfy a regularity 134 00:07:47,950 --> 00:07:50,810 condition, but this regularity condition is satisfied by all 135 00:07:50,810 --> 00:07:53,300 the normal functions that we're going to come up. 136 00:07:53,300 --> 00:08:00,110 It's not satisfied by things like n to the sine n, which 137 00:08:00,110 --> 00:08:02,150 oscillates like crazy. 138 00:08:02,150 --> 00:08:04,750 But it also isn't satisfied by 139 00:08:04,750 --> 00:08:06,800 exponentially growing functions. 140 00:08:06,800 --> 00:08:09,150 But it is satisfied by anything that's polynomial, or 141 00:08:09,150 --> 00:08:12,580 polynomial times a logarithm, or what have you. 142 00:08:12,580 --> 00:08:14,560 So generally, we don't really have to 143 00:08:14,560 --> 00:08:16,820 check this too carefully. 144 00:08:16,820 --> 00:08:19,850 And then the answer there is just it's order f of n, 145 00:08:19,850 --> 00:08:24,530 because it's geometrically decreasing, f of n dominates. 146 00:08:24,530 --> 00:08:26,890 So is this review for everybody? 147 00:08:26,890 --> 00:08:28,920 Pretty much, yeah, yeah, yeah? 148 00:08:28,920 --> 00:08:31,960 You can do this in your head, because we're going to ask you 149 00:08:31,960 --> 00:08:35,195 to do this in your head during the lecture? 150 00:08:35,195 --> 00:08:36,110 Yeah, we're all good? 151 00:08:36,110 --> 00:08:38,940 OK, good. 152 00:08:38,940 --> 00:08:42,299 One of the things that students, when they learn this 153 00:08:42,299 --> 00:08:45,500 in an algorithms class, don't recognize is that this also 154 00:08:45,500 --> 00:08:48,260 tells you where in your program, where in your 155 00:08:48,260 --> 00:08:52,100 recursive program, you should bother to try to eke out 156 00:08:52,100 --> 00:08:53,350 constant factors. 157 00:08:55,570 --> 00:08:58,210 So if you think about it, for example, in case three here, 158 00:08:58,210 --> 00:09:00,290 it's geometrically decreasing. 159 00:09:00,290 --> 00:09:05,530 Does it make sense to try to optimize the leaves? 160 00:09:05,530 --> 00:09:07,510 No, because very little time is spent there. 161 00:09:07,510 --> 00:09:12,420 It makes sense to optimize what's going on at the root, 162 00:09:12,420 --> 00:09:14,850 and to save anything you can at the root. 163 00:09:14,850 --> 00:09:17,250 And sometimes, the root in particular has special 164 00:09:17,250 --> 00:09:20,140 properties that aren't true of the internal nodes that you 165 00:09:20,140 --> 00:09:23,160 can take advantage of, that you may not be able to take 166 00:09:23,160 --> 00:09:24,360 advantage of regularly. 167 00:09:24,360 --> 00:09:26,390 But since it's going to be dominated by the root, trying 168 00:09:26,390 --> 00:09:28,340 to save in the root makes sense. 169 00:09:28,340 --> 00:09:35,540 Correspondingly, if we're in case one, in case one, it's 170 00:09:35,540 --> 00:09:39,810 absolutely critical that you coarsen the recursion, because 171 00:09:39,810 --> 00:09:41,850 all the work is down at this level. 172 00:09:44,420 --> 00:09:47,430 And so if you want to get additional performance, you 173 00:09:47,430 --> 00:09:50,660 want to basically move this up high enough that you can cut 174 00:09:50,660 --> 00:09:55,540 off that constant amount and get factors two, three, 175 00:09:55,540 --> 00:10:00,820 sometimes more, out of your code. 176 00:10:00,820 --> 00:10:06,200 So understanding the structure of the recursion allows you to 177 00:10:06,200 --> 00:10:09,165 figure out where it is that you should optimize your code. 178 00:10:09,165 --> 00:10:10,870 Of course, with loops, it's much easier. 179 00:10:10,870 --> 00:10:13,720 Where do you spend your time with loops to 180 00:10:13,720 --> 00:10:14,970 make code go fast? 181 00:10:17,520 --> 00:10:21,250 The innermost loop, right, because that's the one that's 182 00:10:21,250 --> 00:10:22,790 executing the most. 183 00:10:22,790 --> 00:10:25,230 The outer loops are not that important. 184 00:10:25,230 --> 00:10:26,090 This is sort of the 185 00:10:26,090 --> 00:10:28,840 corresponding thing for recursion. 186 00:10:28,840 --> 00:10:31,490 Figure out where the recursion is spending the time, that's 187 00:10:31,490 --> 00:10:34,110 where you spend time eking out extra factors. 188 00:10:38,550 --> 00:10:39,800 Here's the cheat sheet. 189 00:10:43,010 --> 00:10:47,920 So if it's n to the log base b of a minus epsilon, the 190 00:10:47,920 --> 00:10:49,505 answer's n to the log base b of a. 191 00:10:49,505 --> 00:10:52,020 If it's plus epsilon, it's f of n. 192 00:10:52,020 --> 00:10:55,030 And if it's n to the log base b of a times a logarithmic 193 00:10:55,030 --> 00:10:58,875 factor, where this logarithm has the exponent greater than 194 00:10:58,875 --> 00:11:02,140 or equal to 0, you add a 1. 195 00:11:02,140 --> 00:11:04,340 This is not all of the situations. 196 00:11:04,340 --> 00:11:06,935 There are situations which don't occur. 197 00:11:06,935 --> 00:11:08,185 OK, quick quiz. 198 00:11:10,380 --> 00:11:13,390 So t of n equals 4t of n over 2 plus n. 199 00:11:13,390 --> 00:11:14,640 What's the solution? 200 00:11:17,550 --> 00:11:19,730 n squared, good. 201 00:11:19,730 --> 00:11:23,170 So this is n to the log base b of a, is n to the log base 2 202 00:11:23,170 --> 00:11:25,910 of 4, which is n squared. 203 00:11:25,910 --> 00:11:27,260 That's much bigger than n. 204 00:11:27,260 --> 00:11:29,230 It's bigger by a factor of n. 205 00:11:29,230 --> 00:11:32,430 Here, the epsilon of 1 would do, so would an epsilon of 1/2 206 00:11:32,430 --> 00:11:34,580 or an epsilon of 1/4, but in particular, an 207 00:11:34,580 --> 00:11:36,050 epsilon of 1 would do. 208 00:11:36,050 --> 00:11:37,430 That's case one. 209 00:11:37,430 --> 00:11:40,590 The n squared dominates, so the answer is n squared. 210 00:11:40,590 --> 00:11:43,460 The basic idea is whichever side dominates for case one 211 00:11:43,460 --> 00:11:47,700 and case three, that's the one that is the answer. 212 00:11:47,700 --> 00:11:48,150 Here we go. 213 00:11:48,150 --> 00:11:50,650 What about this one? 214 00:11:50,650 --> 00:11:54,170 n squared log n, because the two sides are 215 00:11:54,170 --> 00:11:56,950 about the same size. 216 00:11:56,950 --> 00:12:02,240 It's n squared times log to the 0n, tack on the extra log. 217 00:12:02,240 --> 00:12:04,120 How about this one? 218 00:12:04,120 --> 00:12:05,180 n cubed. 219 00:12:05,180 --> 00:12:06,430 How about this one? 220 00:12:09,403 --> 00:12:11,399 AUDIENCE: [INAUDIBLE]. 221 00:12:11,399 --> 00:12:14,900 Master Theorem [INAUDIBLE]? 222 00:12:14,900 --> 00:12:17,140 CHARLES LEISERSON: Yeah, the Master Theorem does not apply 223 00:12:17,140 --> 00:12:18,790 to this one. 224 00:12:18,790 --> 00:12:23,070 It looks like it's case two with an an exponent of minus 225 00:12:23,070 --> 00:12:28,360 1, but that's bogus because the exponent of the log must 226 00:12:28,360 --> 00:12:32,280 be greater than or equal to 0. 227 00:12:32,280 --> 00:12:35,510 So instead, this actually has a solution, so it does not 228 00:12:35,510 --> 00:12:40,750 apply, it actually has the solution n squared, log log n, 229 00:12:40,750 --> 00:12:43,320 but that's not covered by the Master Theorem. 230 00:12:43,320 --> 00:12:46,640 You can have an infinite hierarchy of 231 00:12:46,640 --> 00:12:48,840 narrowing in things. 232 00:12:48,840 --> 00:12:52,330 So if you don't have a solution to something that 233 00:12:52,330 --> 00:12:56,250 looks like a Master Theorem type of recurrence, what's 234 00:12:56,250 --> 00:12:59,782 your best strategy for solving it? 235 00:12:59,782 --> 00:13:00,750 AUDIENCE: [INAUDIBLE]. 236 00:13:00,750 --> 00:13:03,260 CHARLES LEISERSON: What's that? 237 00:13:03,260 --> 00:13:04,840 AUDIENCE: [INAUDIBLE]. 238 00:13:04,840 --> 00:13:05,860 CHARLES LEISERSON: So a recursion tree can be good, 239 00:13:05,860 --> 00:13:09,410 but actually, the best is the substitution method, basically 240 00:13:09,410 --> 00:13:11,090 proving it by induction. 241 00:13:11,090 --> 00:13:14,320 And the recursion tree can be very helpful in giving you a 242 00:13:14,320 --> 00:13:18,070 good guess for what you think the answer is. 243 00:13:18,070 --> 00:13:21,170 But the most reliable way to prove any of these things is 244 00:13:21,170 --> 00:13:23,900 using substitution method. 245 00:13:23,900 --> 00:13:25,270 Good enough. 246 00:13:25,270 --> 00:13:28,590 So that was review for, I hope, most people? 247 00:13:28,590 --> 00:13:28,820 Yeah? 248 00:13:28,820 --> 00:13:30,580 OK, good. 249 00:13:30,580 --> 00:13:32,510 OK, let's talk about parallel programming. 250 00:13:32,510 --> 00:13:34,550 We're going to start out with loops. 251 00:13:34,550 --> 00:13:39,800 So last time, we talked about how the cilk++ runtime system 252 00:13:39,800 --> 00:13:46,430 is based on, essentially, implementing spawns and syncs, 253 00:13:46,430 --> 00:13:48,820 using the work stealing algorithm, and we talked about 254 00:13:48,820 --> 00:13:51,660 scheduling and so forth. 255 00:13:51,660 --> 00:13:55,920 We didn't talk about how loops are implemented, except to 256 00:13:55,920 --> 00:13:57,110 mention that they were implemented 257 00:13:57,110 --> 00:13:58,580 with divide and conquer. 258 00:13:58,580 --> 00:14:02,160 So here I want to go into the subtleties of loops, because 259 00:14:02,160 --> 00:14:06,770 probably most programs that occur in the real world these 260 00:14:06,770 --> 00:14:11,930 days are programs where people just simply say, make this a 261 00:14:11,930 --> 00:14:13,800 parallel loop. 262 00:14:13,800 --> 00:14:15,830 That's it. 263 00:14:15,830 --> 00:14:21,730 So let's take an example of the in-place matrix transpose, 264 00:14:21,730 --> 00:14:24,880 where we're basically trying to flip everything along the 265 00:14:24,880 --> 00:14:26,660 main diagonal. 266 00:14:26,660 --> 00:14:31,030 I've used this figure before, I think. 267 00:14:31,030 --> 00:14:34,860 So let's just do it not cache efficiently. 268 00:14:34,860 --> 00:14:36,540 So the cache efficient algorithm actually 269 00:14:36,540 --> 00:14:41,690 parallelizes beautifully also, but let's not look at the 270 00:14:41,690 --> 00:14:44,020 cache efficient version, the divide and conquer version. 271 00:14:44,020 --> 00:14:48,010 Let's look at a looping version to understand. 272 00:14:48,010 --> 00:14:51,940 And once again here, as I did before, I'm going to make the 273 00:14:51,940 --> 00:14:56,430 indices for my implementation run from 0, not 1. 274 00:14:56,430 --> 00:15:01,030 And basically, I have an outer loop that goes from i equals 1 275 00:15:01,030 --> 00:15:08,020 up to n minus 1, and an inner loop that goes from j equals 0 276 00:15:08,020 --> 00:15:09,990 up to i minus 1. 277 00:15:09,990 --> 00:15:15,850 And then I do a little swap in there. 278 00:15:15,850 --> 00:15:21,590 And in here, the outer loop I've parallelized, the inner 279 00:15:21,590 --> 00:15:24,450 loop is running serially. 280 00:15:31,600 --> 00:15:34,810 So let's take a look at analyzing this particular 281 00:15:34,810 --> 00:15:37,900 piece of code to understand what's going on. 282 00:15:37,900 --> 00:15:42,380 So the way this actually gets implemented is as follows. 283 00:15:42,380 --> 00:15:44,830 So here's the code on the left. 284 00:15:44,830 --> 00:15:53,790 What actually happens in the cilk++ compiler is it converts 285 00:15:53,790 --> 00:15:58,210 it into recursion, divide and conquer recursion. 286 00:15:58,210 --> 00:16:02,160 So what it does is it has a range from low to high, is 287 00:16:02,160 --> 00:16:03,840 sort of the common case. 288 00:16:03,840 --> 00:16:06,920 We're going to call it on from 1 to n minus 1, because that's 289 00:16:06,920 --> 00:16:11,300 the indexes that I've given to the cilk_for loop. 290 00:16:11,300 --> 00:16:17,530 And what we do is, if I have a region to do divide and 291 00:16:17,530 --> 00:16:22,760 conquer on, a set of values for i, I basically 292 00:16:22,760 --> 00:16:26,760 divide that in half. 293 00:16:26,760 --> 00:16:32,640 And then I recursively execute the first half, and then the 294 00:16:32,640 --> 00:16:34,740 second half, and then cilk_sync. 295 00:16:34,740 --> 00:16:37,790 So the two sides are going off in parallel. 296 00:16:37,790 --> 00:16:42,450 And then if I am at the base, then I go through the inner 297 00:16:42,450 --> 00:16:48,260 loop and do the swap of the values in the inner loop. 298 00:16:48,260 --> 00:16:51,100 So the outer loop is the one that's the parallel loop. 299 00:16:51,100 --> 00:16:54,450 That one we're doing divide and conquer on. 300 00:16:54,450 --> 00:16:58,990 So we basically recursively spawn the first half, execute 301 00:16:58,990 --> 00:17:02,960 the second, and then each of those recursively does the 302 00:17:02,960 --> 00:17:08,680 same thing until all the iterations have been done. 303 00:17:08,680 --> 00:17:10,184 Any questions about how that operates? 304 00:17:14,069 --> 00:17:16,940 So this is the way all the parallel loops are done, is 305 00:17:16,940 --> 00:17:18,560 basically this strategy. 306 00:17:18,560 --> 00:17:21,609 Now here, I mention that this is in fact what happens is 307 00:17:21,609 --> 00:17:25,589 this test here is actually coarsened. 308 00:17:25,589 --> 00:17:29,000 So we don't want to go all the way to n equals 1, because 309 00:17:29,000 --> 00:17:31,990 then we'll have this recursion call overhead every 310 00:17:31,990 --> 00:17:33,280 time we do a call. 311 00:17:33,280 --> 00:17:39,120 So in fact, what happens is you go down to some grain size 312 00:17:39,120 --> 00:17:43,365 of some number of iterations, and at that number of 313 00:17:43,365 --> 00:17:45,790 iterations, it then just runs through it as an ordinary 314 00:17:45,790 --> 00:17:50,140 serial four loop, in order not to pay the function call 315 00:17:50,140 --> 00:17:51,830 overhead all the way down. 316 00:17:51,830 --> 00:17:55,600 We're going to look exactly at that issue. 317 00:17:55,600 --> 00:18:05,660 So if I take a look at it from using the DAG model that I 318 00:18:05,660 --> 00:18:12,000 introduced last time, remember that the rectangles here kind 319 00:18:12,000 --> 00:18:16,560 of count as activation frames, stack 320 00:18:16,560 --> 00:18:18,520 frames on the call stack. 321 00:18:18,520 --> 00:18:22,700 And the circles here are strands, which are sequences 322 00:18:22,700 --> 00:18:25,020 of serial code. 323 00:18:25,020 --> 00:18:27,430 And so what's happening here is essentially, I'm running 324 00:18:27,430 --> 00:18:30,665 the code that divides it into two parts, 325 00:18:30,665 --> 00:18:33,200 and I spawn one part. 326 00:18:33,200 --> 00:18:36,480 Then this guy spawns the other and waits for the return, and 327 00:18:36,480 --> 00:18:37,530 then these guys come back. 328 00:18:37,530 --> 00:18:40,010 And then I keep doing that recursively, and then when I 329 00:18:40,010 --> 00:18:45,010 get to the bottom, I then run through the innermost loop, 330 00:18:45,010 --> 00:18:49,470 which starts out with just one element to do, two, three. 331 00:18:49,470 --> 00:18:51,210 And so the number of elements-- 332 00:18:51,210 --> 00:18:53,080 for example, in this case where I've done eight, I go 333 00:18:53,080 --> 00:18:59,130 through eight elements at the bottom here if this were an 334 00:18:59,130 --> 00:19:03,430 eight by eight matrix that I was transposing. 335 00:19:03,430 --> 00:19:08,450 So there's more work in these guys than there is over here. 336 00:19:08,450 --> 00:19:11,150 So it's not something you just can map onto processors in 337 00:19:11,150 --> 00:19:12,770 some naive fashion. 338 00:19:12,770 --> 00:19:14,410 It does take some load balancing to 339 00:19:14,410 --> 00:19:16,040 parallelize this loop. 340 00:19:16,040 --> 00:19:18,750 Any questions about what's going on here? 341 00:19:18,750 --> 00:19:19,800 Yeah? 342 00:19:19,800 --> 00:19:21,854 AUDIENCE: Why is it that it's one, two, 343 00:19:21,854 --> 00:19:23,270 three, four, up to eight? 344 00:19:23,270 --> 00:19:24,686 CHARLES LEISERSON: Take a look. 345 00:19:24,686 --> 00:19:26,085 The inner loop goes from j to i. 346 00:19:28,710 --> 00:19:31,230 So this guy just does one iteration of the inner loop, 347 00:19:31,230 --> 00:19:33,660 this guy does two, this guy does three, all the way up to 348 00:19:33,660 --> 00:19:35,820 this guy doing eight iterations, if it were an 349 00:19:35,820 --> 00:19:38,880 eight by eight matrix. 350 00:19:38,880 --> 00:19:42,610 And in general, if it's n by n, it's going from one to n 351 00:19:42,610 --> 00:19:45,630 work up here, but only one work down here. 352 00:19:45,630 --> 00:19:47,800 Because I'm basically iterating through a triangular 353 00:19:47,800 --> 00:19:52,000 iteration space to swap the matrix, and this is basically 354 00:19:52,000 --> 00:19:54,480 swapping row by row. 355 00:19:54,480 --> 00:19:55,550 Questions? 356 00:19:55,550 --> 00:19:57,160 Is that good? 357 00:19:57,160 --> 00:19:58,770 Everybody see what's going on? 358 00:19:58,770 --> 00:20:01,080 So now let's take a look and let's analyze this 359 00:20:01,080 --> 00:20:02,500 for work and span. 360 00:20:02,500 --> 00:20:06,410 So what is the work of this in terms of n, if I 361 00:20:06,410 --> 00:20:08,255 have an n by n matrix? 362 00:20:18,550 --> 00:20:18,950 What's the work? 363 00:20:18,950 --> 00:20:23,470 The work is the ordinary serial running time, right? 364 00:20:23,470 --> 00:20:24,420 It's n squared. 365 00:20:24,420 --> 00:20:26,370 Good. 366 00:20:26,370 --> 00:20:29,710 So basically, it's order n squared, because these guys 367 00:20:29,710 --> 00:20:30,670 are all adding up. 368 00:20:30,670 --> 00:20:35,060 This is an arithmetic sequence up to n, and so the total 369 00:20:35,060 --> 00:20:38,420 amount in here is order n squared. 370 00:20:38,420 --> 00:20:39,770 What about this part up here? 371 00:20:43,420 --> 00:20:50,580 How much does that cost us for work? 372 00:20:50,580 --> 00:20:52,460 How much is in the control overhead of 373 00:20:52,460 --> 00:20:53,710 doing that outer loop? 374 00:21:05,935 --> 00:21:08,530 So asymptotically, how much is in here? 375 00:21:08,530 --> 00:21:09,890 The total is going to be n squared. 376 00:21:09,890 --> 00:21:11,180 That I guarantee you. 377 00:21:11,180 --> 00:21:13,820 But what's going on up here? 378 00:21:13,820 --> 00:21:16,500 How do I count that up? 379 00:21:16,500 --> 00:21:18,261 AUDIENCE: I'm assuming that each [? strain ?] is going to 380 00:21:18,261 --> 00:21:20,730 be constant time? 381 00:21:20,730 --> 00:21:22,160 CHARLES LEISERSON: Yeah, well in this case, it is constant 382 00:21:22,160 --> 00:21:25,520 time, for these up here, because what am I doing? 383 00:21:25,520 --> 00:21:32,690 All I'm doing is the recursion code where I divide the range 384 00:21:32,690 --> 00:21:34,060 and then spawn off two things. 385 00:21:34,060 --> 00:21:36,880 That takes me only a constant amount of manipulation to be 386 00:21:36,880 --> 00:21:38,600 able to do that. 387 00:21:38,600 --> 00:21:42,090 So this is all order n. 388 00:21:42,090 --> 00:21:43,760 Total of order n here. 389 00:21:43,760 --> 00:21:45,620 The reason is because in some sense, there 390 00:21:45,620 --> 00:21:47,550 are n leaves here. 391 00:21:47,550 --> 00:21:52,040 And if you have a full binary tree, meaning every child is 392 00:21:52,040 --> 00:21:55,775 either a leaf or has two children, then the number of 393 00:21:55,775 --> 00:21:58,120 the internal nodes of the tree is one less than 394 00:21:58,120 --> 00:22:01,330 the number of leaves. 395 00:22:01,330 --> 00:22:06,080 So that's a basic property of trees, that the number of the 396 00:22:06,080 --> 00:22:10,430 internal nodes here is going to be n minus 1, in this case. 397 00:22:10,430 --> 00:22:14,460 In particular, we have 7 here. 398 00:22:14,460 --> 00:22:15,910 Is that good? 399 00:22:15,910 --> 00:22:17,730 So this part doesn't contribute 400 00:22:17,730 --> 00:22:19,790 significantly to the work. 401 00:22:19,790 --> 00:22:22,700 Just this part contributes to the work. 402 00:22:22,700 --> 00:22:23,950 Is that good? 403 00:22:26,000 --> 00:22:27,520 What about the span for this? 404 00:22:44,470 --> 00:22:45,720 What's the span? 405 00:22:48,819 --> 00:22:50,770 AUDIENCE: Log n. 406 00:22:50,770 --> 00:22:54,010 CHARLES LEISERSON: It's not log n, but your heads are in 407 00:22:54,010 --> 00:22:55,260 the right place. 408 00:22:57,796 --> 00:23:00,720 AUDIENCE: The longest path is going [INAUDIBLE]. 409 00:23:00,720 --> 00:23:02,824 CHARLES LEISERSON: Which is the longest path going to be 410 00:23:02,824 --> 00:23:10,371 here, starting here and ending there, which way do we go? 411 00:23:10,371 --> 00:23:10,840 AUDIENCE: Go all the way down. 412 00:23:10,840 --> 00:23:11,707 CHARLES LEISERSON: Which way? 413 00:23:11,707 --> 00:23:12,960 AUDIENCE: To the right. 414 00:23:12,960 --> 00:23:17,310 CHARLES LEISERSON: Down to the right, over, down 415 00:23:17,310 --> 00:23:18,040 through this guy. 416 00:23:18,040 --> 00:23:19,800 How big is this guy? 417 00:23:19,800 --> 00:23:20,730 n. 418 00:23:20,730 --> 00:23:22,830 Go back up this way. 419 00:23:22,830 --> 00:23:25,407 So how much is in this part going down? 420 00:23:25,407 --> 00:23:26,800 AUDIENCE: Log n. 421 00:23:26,800 --> 00:23:27,880 CHARLES LEISERSON: Going down and up is log 422 00:23:27,880 --> 00:23:30,890 n, but this is n. 423 00:23:30,890 --> 00:23:32,140 Good. 424 00:23:35,160 --> 00:23:40,310 So it's basically order n plus order log n, order n down here 425 00:23:40,310 --> 00:23:44,490 plus order log n up and down, that's order n. 426 00:23:44,490 --> 00:23:48,840 So the parallelism is the ratio of those two things, 427 00:23:48,840 --> 00:23:51,530 which is order n. 428 00:23:51,530 --> 00:23:54,260 So that's got good parallelism. 429 00:23:54,260 --> 00:23:56,280 And so if you imagine doing this in a large number 430 00:23:56,280 --> 00:24:02,630 processors, very easy to get sort of your benchmark of 431 00:24:02,630 --> 00:24:05,040 maybe 10 times more parallelism than the number of 432 00:24:05,040 --> 00:24:08,570 processors that you're running on. 433 00:24:08,570 --> 00:24:09,820 Everybody follow this? 434 00:24:12,240 --> 00:24:14,460 Good. 435 00:24:14,460 --> 00:24:16,700 So the span of the loop control is order log n. 436 00:24:16,700 --> 00:24:23,310 And in general, when you have a four loop with n iterations, 437 00:24:23,310 --> 00:24:27,700 the loop control itself is going to have log n is going 438 00:24:27,700 --> 00:24:30,000 to have to be added to the span every time you hit a 439 00:24:30,000 --> 00:24:35,570 loop, log of whatever the number of iterations is. 440 00:24:35,570 --> 00:24:38,680 And then we have the maximum span of the body. 441 00:24:38,680 --> 00:24:41,710 Well, the worst case for this thing in the body is when it's 442 00:24:41,710 --> 00:24:44,680 doing the whole thing, because whenever we're looking at 443 00:24:44,680 --> 00:24:48,490 spans, we're always looking what's the maximum of things 444 00:24:48,490 --> 00:24:50,740 that are operating in parallel. 445 00:24:50,740 --> 00:24:53,060 Everybody good? 446 00:24:53,060 --> 00:24:54,310 Questions? 447 00:24:56,330 --> 00:24:58,440 Great. 448 00:24:58,440 --> 00:25:05,590 So now let's do something a little more parallel. 449 00:25:05,590 --> 00:25:09,380 Let's make both loops be parallel. 450 00:25:09,380 --> 00:25:12,510 So here we have a cilk_for loop here, and then another 451 00:25:12,510 --> 00:25:13,990 cilk_for loop on the interior. 452 00:25:16,570 --> 00:25:18,480 And let's see what we get here. 453 00:25:18,480 --> 00:25:20,868 So what's the work for this? 454 00:25:20,868 --> 00:25:21,816 AUDIENCE: n squared. 455 00:25:21,816 --> 00:25:23,000 CHARLES LEISERSON: Yeah, n squared. 456 00:25:23,000 --> 00:25:25,230 That's not going to change. 457 00:25:25,230 --> 00:25:26,480 That's not going to change. 458 00:25:29,650 --> 00:25:30,900 What's the span? 459 00:25:34,992 --> 00:25:37,990 AUDIENCE: log n. 460 00:25:37,990 --> 00:25:40,490 CHARLES LEISERSON: Yeah, log n. 461 00:25:40,490 --> 00:25:44,580 So it's log n because the span of the outer control loop is 462 00:25:44,580 --> 00:25:46,490 going to add log n. 463 00:25:46,490 --> 00:25:50,750 The max span of the inner control loop, well, it's going 464 00:25:50,750 --> 00:25:58,850 from log of 1 up to log of i, but the maximums of those is 465 00:25:58,850 --> 00:26:01,960 going to be proportional to log n 466 00:26:01,960 --> 00:26:05,420 because it's not regular. 467 00:26:05,420 --> 00:26:09,490 And the span of the body now is going to be order 1. 468 00:26:09,490 --> 00:26:16,470 And so we add the logs because those things are in series. 469 00:26:16,470 --> 00:26:17,810 We don't multiply them. 470 00:26:21,250 --> 00:26:22,310 What we're doing is we're looking at, 471 00:26:22,310 --> 00:26:23,170 what's the worst case? 472 00:26:23,170 --> 00:26:26,380 The worst case is I have to do the control for this, plus the 473 00:26:26,380 --> 00:26:30,230 control for this, plus the worst iteration here, which in 474 00:26:30,230 --> 00:26:32,000 this case is just order one. 475 00:26:32,000 --> 00:26:35,090 So the total is order log n. 476 00:26:35,090 --> 00:26:38,810 That can be confusing for people, why it is that we add 477 00:26:38,810 --> 00:26:45,210 here rather than multiply or do something else. 478 00:26:45,210 --> 00:26:47,460 So let me pause here for some questions 479 00:26:47,460 --> 00:26:51,020 if people have questions. 480 00:26:51,020 --> 00:26:52,020 Everybody with us? 481 00:26:52,020 --> 00:26:55,550 Anybody want clarification or make a point that would lead 482 00:26:55,550 --> 00:26:56,800 to clarification? 483 00:26:59,630 --> 00:27:00,780 Yes, question. 484 00:27:00,780 --> 00:27:02,989 AUDIENCE: If you were going to draw a tree like the previous 485 00:27:02,989 --> 00:27:04,239 slide, what would it look like? 486 00:27:09,670 --> 00:27:10,910 CHARLES LEISERSON: Let's see. 487 00:27:10,910 --> 00:27:14,175 I had wanted to do that and it got out of control. 488 00:27:17,510 --> 00:27:20,220 So what it would look like is if we go back to the previous 489 00:27:20,220 --> 00:27:25,550 slide, it basically would look like this, except where each 490 00:27:25,550 --> 00:27:30,450 one of these guys is replaced by a tree that looks like this 491 00:27:30,450 --> 00:27:38,900 with as many leaves as the number here indicates. 492 00:27:38,900 --> 00:27:42,310 So once again, this would be the one with the longest span 493 00:27:42,310 --> 00:27:44,990 because this would be log of the largest number. 494 00:27:44,990 --> 00:27:48,720 But basically, each one of these would be a tree that 495 00:27:48,720 --> 00:27:49,640 came from this. 496 00:27:49,640 --> 00:27:50,200 Is that clear? 497 00:27:50,200 --> 00:27:52,070 That's a great question. 498 00:27:52,070 --> 00:27:53,580 Anybody else have as illuminating 499 00:27:53,580 --> 00:27:54,670 questions as those? 500 00:27:54,670 --> 00:27:57,630 Everybody understand that explanation, what the tree 501 00:27:57,630 --> 00:27:58,310 would look like? 502 00:27:58,310 --> 00:27:59,960 OK, good. 503 00:28:07,420 --> 00:28:10,140 Get 504 00:28:10,140 --> 00:28:15,022 So the parallelism here is n squared over log n. 505 00:28:15,022 --> 00:28:17,790 Now it's tempting, when you do parallel programming, to say 506 00:28:17,790 --> 00:28:20,160 therefore, this is better parallel code. 507 00:28:24,190 --> 00:28:26,760 And the reason is, well, it does asymptotically have more 508 00:28:26,760 --> 00:28:27,430 parallelism. 509 00:28:27,430 --> 00:28:31,110 But generally when you're programming, you're not trying 510 00:28:31,110 --> 00:28:33,120 to get the most parallelism. 511 00:28:33,120 --> 00:28:37,840 What you're trying to do is get sufficient parallelism. 512 00:28:37,840 --> 00:28:43,470 So if n is sufficiently large, it's going to be way more-- 513 00:28:43,470 --> 00:28:44,710 if n is a million-- 514 00:28:44,710 --> 00:28:47,540 which is typical problem size for a loop, for example, for a 515 00:28:47,540 --> 00:28:51,180 big loop, or even if it's a few thousand or whatever-- 516 00:28:51,180 --> 00:28:55,820 it may be just fine to have parallelism on the order of 517 00:28:55,820 --> 00:29:00,720 1,000, which is what the first one gives you. 518 00:29:00,720 --> 00:29:03,520 So 1,000 iterations is generally a small number of 519 00:29:03,520 --> 00:29:05,310 iterations. 520 00:29:05,310 --> 00:29:08,150 So 1,000 by 1,000 matrix is going to generate 521 00:29:08,150 --> 00:29:09,470 parallelism of 1,000. 522 00:29:09,470 --> 00:29:11,480 Here, we're going to get a parallelism of 1 million 523 00:29:11,480 --> 00:29:18,010 divided by about 20, log of 10 or 20, so like 100,000. 524 00:29:18,010 --> 00:29:28,520 But if I have 1,000 by 1,000 matrix, the difference between 525 00:29:28,520 --> 00:29:33,140 having parallelism of 1,000 and parallelism of 100,000, 526 00:29:33,140 --> 00:29:38,440 when I'm running on 100 cores, let's say, it doesn't matter. 527 00:29:38,440 --> 00:29:40,740 Up to 100 cores, it doesn't matter. 528 00:29:40,740 --> 00:29:43,540 And in fact, running this on 100 cores, that's really a 529 00:29:43,540 --> 00:29:46,410 tiny problem compared to the amount of memory 530 00:29:46,410 --> 00:29:48,670 you're going to get. 531 00:29:48,670 --> 00:29:55,840 1,000 by 1,000 matrix is tiny when it comes to the size of 532 00:29:55,840 --> 00:29:58,390 memory that you're going to have access to and so forth. 533 00:29:58,390 --> 00:30:00,760 So for big problems and so forth you really want to look 534 00:30:00,760 --> 00:30:08,650 at this and say, of things that have ample parallelism, 535 00:30:08,650 --> 00:30:12,570 which ones really are going to give me the best bang for the 536 00:30:12,570 --> 00:30:15,070 buck for reasonable machine sizes? 537 00:30:18,300 --> 00:30:20,950 That's different from things like work or 538 00:30:20,950 --> 00:30:22,240 serial running time. 539 00:30:22,240 --> 00:30:25,880 Usually less running time is better, 540 00:30:25,880 --> 00:30:27,970 and it's always better. 541 00:30:27,970 --> 00:30:30,170 But here parallelism-- 542 00:30:30,170 --> 00:30:35,150 yes, it's good to minimize your span, but you don't have 543 00:30:35,150 --> 00:30:38,680 to minimize it extremely. 544 00:30:38,680 --> 00:30:43,220 You just have to get it small enough, whereas the work term, 545 00:30:43,220 --> 00:30:45,680 that you really want to minimize, because that's what 546 00:30:45,680 --> 00:30:47,250 you're going to have to do, even in a serial 547 00:30:47,250 --> 00:30:48,420 implementation. 548 00:30:48,420 --> 00:30:49,780 Question. 549 00:30:49,780 --> 00:30:52,105 AUDIENCE: So are you suggesting that the 550 00:30:52,105 --> 00:30:55,800 other code was OK? 551 00:30:55,800 --> 00:30:59,050 CHARLES LEISERSON: We're going to look a little bit closer at 552 00:30:59,050 --> 00:31:02,210 the issue of overheads. 553 00:31:02,210 --> 00:31:04,200 We're now going to take a look at what's the difference 554 00:31:04,200 --> 00:31:05,930 between these two codes? 555 00:31:05,930 --> 00:31:08,320 We'll come back to that in a minute. 556 00:31:08,320 --> 00:31:12,620 The way I want to do it is take a look at the issue of 557 00:31:12,620 --> 00:31:16,590 overheads with a simpler example, where we can see 558 00:31:16,590 --> 00:31:18,380 what's really going on. 559 00:31:18,380 --> 00:31:22,850 So here, what I've done is I've got a loop that is 560 00:31:22,850 --> 00:31:25,520 basically just doing vector addition. 561 00:31:25,520 --> 00:31:32,110 It's adding for i equals 0 to n minus 1, add b 562 00:31:32,110 --> 00:31:33,450 of i into a of i. 563 00:31:36,120 --> 00:31:38,020 Pretty simple code, and we want to make that be a 564 00:31:38,020 --> 00:31:40,660 parallel loop. 565 00:31:40,660 --> 00:31:44,290 So I get a recursion tree that looks like this, where I have 566 00:31:44,290 --> 00:31:47,170 constant work at every step there. 567 00:31:47,170 --> 00:31:49,460 And of course, the work is order n, 568 00:31:49,460 --> 00:31:52,500 because I've got n leaves. 569 00:31:52,500 --> 00:31:55,670 And the number of internal nodes, the control, is all 570 00:31:55,670 --> 00:31:57,600 constant size strands there. 571 00:31:57,600 --> 00:32:00,170 So this is all just order n for work. 572 00:32:00,170 --> 00:32:03,830 And the span is basically log n, as we've seen, by going 573 00:32:03,830 --> 00:32:06,000 down one of these paths, for example. 574 00:32:06,000 --> 00:32:11,000 And so the parallelism for this is order n over log n. 575 00:32:11,000 --> 00:32:12,550 So a very simple problem. 576 00:32:12,550 --> 00:32:17,280 But now let's take a look more closely at the overheads here. 577 00:32:17,280 --> 00:32:19,540 So the problem is that this work term 578 00:32:19,540 --> 00:32:22,570 contains substantial overhead. 579 00:32:22,570 --> 00:32:25,440 In other words, if I really was doing that, if I hadn't 580 00:32:25,440 --> 00:32:29,160 coarsened the recursion at all in the implementation of 581 00:32:29,160 --> 00:32:31,940 cilk_for, if the developers hadn't done that, then I've 582 00:32:31,940 --> 00:32:37,210 got a function call, I've got n function calls here for 583 00:32:37,210 --> 00:32:43,000 doing a single addition of values at the leaves. 584 00:32:43,000 --> 00:32:47,035 I've got n minus one of these guys, that's approximately n, 585 00:32:47,035 --> 00:32:48,940 and I've got n of these guys. 586 00:32:48,940 --> 00:32:51,360 And which are bigger, these guys or these guys? 587 00:32:54,540 --> 00:32:55,690 These guys are way bigger. 588 00:32:55,690 --> 00:32:58,040 They've got a function call in there. 589 00:32:58,040 --> 00:32:59,590 This guy right here just has what? 590 00:32:59,590 --> 00:33:02,300 One floating point addition. 591 00:33:02,300 --> 00:33:06,530 And so if I really was doing my divide and conquer down to 592 00:33:06,530 --> 00:33:13,230 a single element, this would be way slower on one processor 593 00:33:13,230 --> 00:33:15,790 than if I just ran it with a for loop. 594 00:33:15,790 --> 00:33:17,620 Because if I do a for loop, it's just going to go through, 595 00:33:17,620 --> 00:33:24,440 and the overhead it has is incrementing i and testing for 596 00:33:24,440 --> 00:33:25,100 termination. 597 00:33:25,100 --> 00:33:26,160 That's it. 598 00:33:26,160 --> 00:33:30,130 And of course, that's a predictable branch, because it 599 00:33:30,130 --> 00:33:34,070 almost never terminates until it actually terminates, and so 600 00:33:34,070 --> 00:33:36,150 that's exactly the sort of thing that's going to have a 601 00:33:36,150 --> 00:33:38,850 really, really tight loop with very few instructions. 602 00:33:38,850 --> 00:33:40,750 But in the parallel implementation, there's going 603 00:33:40,750 --> 00:33:44,590 to be this function call overhead everywhere. 604 00:33:44,590 --> 00:33:47,310 And so therefore, this cilk_for loop in principle 605 00:33:47,310 --> 00:33:49,430 would not be as efficient. 606 00:33:49,430 --> 00:33:52,290 It actually is, but we're going to explain why it is, 607 00:33:52,290 --> 00:33:56,760 what goes on in the runtime system, to understand that. 608 00:33:56,760 --> 00:34:00,290 So here's the idea, and you can 609 00:34:00,290 --> 00:34:02,230 control this with a pragma. 610 00:34:02,230 --> 00:34:06,200 So a pragma is a statement to the compiler 611 00:34:06,200 --> 00:34:08,070 that gives it a hint. 612 00:34:08,070 --> 00:34:12,880 And here, the pragma says, you can name a grain size and give 613 00:34:12,880 --> 00:34:15,389 it a value of g. 614 00:34:15,389 --> 00:34:19,000 And what that says is rather than just doing one element 615 00:34:19,000 --> 00:34:22,590 when you get down to the bottom here, do g elements in 616 00:34:22,590 --> 00:34:26,750 a for loop when you get down to the bottom. 617 00:34:26,750 --> 00:34:29,880 And that way, you halt the recursion earlier. 618 00:34:29,880 --> 00:34:32,570 You have fewer of these internal nodes. 619 00:34:32,570 --> 00:34:36,760 And if you make the grain size sufficiently large, the cost 620 00:34:36,760 --> 00:34:40,600 of the recursion at the top you won't be able to see. 621 00:34:40,600 --> 00:34:43,989 So let's analyze what happens when we do this. 622 00:34:43,989 --> 00:34:49,190 So we can understand this vis a vis this equation. 623 00:34:49,190 --> 00:34:54,380 So the idea here is, if I look at my work, imagine that t 624 00:34:54,380 --> 00:34:58,290 iter is the time for iteration of one iteration of the loop, 625 00:34:58,290 --> 00:35:00,490 basic time for one iteration of the loop. 626 00:35:00,490 --> 00:35:04,340 So the amount of work that I have to do is n times the time 627 00:35:04,340 --> 00:35:06,580 for the iterations of the loop. 628 00:35:06,580 --> 00:35:10,430 And then depending upon my grain size, I've got to do 629 00:35:10,430 --> 00:35:13,220 things having to do with the internal nodes, and there's 630 00:35:13,220 --> 00:35:20,180 basically going to be n over g of those, times the time for a 631 00:35:20,180 --> 00:35:24,050 spawn, which is I'm saying is the time to execute one of 632 00:35:24,050 --> 00:35:26,110 these things. 633 00:35:26,110 --> 00:35:29,640 So if these are batched into groups of g, then there are n 634 00:35:29,640 --> 00:35:32,560 over g such leaves. 635 00:35:32,560 --> 00:35:34,840 There's a minus 1 in here, but it doesn't matter. 636 00:35:34,840 --> 00:35:39,250 It's basically n over g times the time for 637 00:35:39,250 --> 00:35:40,700 the internal nodes. 638 00:35:40,700 --> 00:35:42,740 So everybody see where I'm getting this? 639 00:35:42,740 --> 00:35:44,820 So I'm trying to account for the constants in the 640 00:35:44,820 --> 00:35:46,620 implementation. 641 00:35:46,620 --> 00:35:48,000 People follow where I'm getting this? 642 00:35:48,000 --> 00:35:49,180 Ask questions. 643 00:35:49,180 --> 00:35:53,220 I see a couple of people who are sort of going, not sure I 644 00:35:53,220 --> 00:35:54,470 understand. 645 00:35:57,480 --> 00:35:58,260 Yes? 646 00:35:58,260 --> 00:36:01,040 AUDIENCE: The constants [INAUDIBLE]. 647 00:36:01,040 --> 00:36:01,520 CHARLES LEISERSON: Yes. 648 00:36:01,520 --> 00:36:05,940 So basically, the constants are these t iter and t spawn. 649 00:36:05,940 --> 00:36:10,310 So t spawn is the time to execute all that mess. 650 00:36:10,310 --> 00:36:15,250 t iter is the time to execute one iteration within here. 651 00:36:15,250 --> 00:36:17,670 I'm doing, in this case, g of them. 652 00:36:17,670 --> 00:36:22,430 So I have n over g leaves, but each one is doing g, so it's n 653 00:36:22,430 --> 00:36:26,240 over g times g, which is a total of n iterations, which 654 00:36:26,240 --> 00:36:26,750 makes sense. 655 00:36:26,750 --> 00:36:28,980 I should be doing n iterations if I'm 656 00:36:28,980 --> 00:36:30,460 adding two vectors here. 657 00:36:30,460 --> 00:36:34,200 So that's accounting for all the work in these guys. 658 00:36:34,200 --> 00:36:37,470 Then in addition, I've got all of the work for all the 659 00:36:37,470 --> 00:36:41,790 spawning, which is n over g times t spawn. 660 00:36:41,790 --> 00:36:43,950 And as I say, you can play with this yourself, play with 661 00:36:43,950 --> 00:36:46,640 grain size yourself, by just sticking in different grain 662 00:36:46,640 --> 00:36:47,830 size directives. 663 00:36:47,830 --> 00:36:52,410 Otherwise it turns out that the cilk runtime system will 664 00:36:52,410 --> 00:36:56,930 pick what it deems to be a good grain size. 665 00:36:56,930 --> 00:37:02,080 And it usually does a good job, except sometimes. 666 00:37:02,080 --> 00:37:05,350 And that's why there's a parameter here. 667 00:37:05,350 --> 00:37:06,570 So if there's a parameter there, you 668 00:37:06,570 --> 00:37:10,160 can get rid of that. 669 00:37:10,160 --> 00:37:11,410 Yes? 670 00:37:13,462 --> 00:37:16,086 AUDIENCE: Is the pragma something that is enforced, or 671 00:37:16,086 --> 00:37:18,382 is it something that says, hey-- 672 00:37:18,382 --> 00:37:19,366 CHARLES LEISERSON: It's a hint. 673 00:37:19,366 --> 00:37:20,350 AUDIENCE: It's a hint. 674 00:37:20,350 --> 00:37:21,885 CHARLES LEISERSON: Yes, it's a hint. 675 00:37:21,885 --> 00:37:23,310 In other words, compiler could ignore it. 676 00:37:23,310 --> 00:37:23,389 [? 677 00:37:23,389 --> 00:37:24,972 AUDIENCE: The compiler is ?] going to be like, oh, that's 678 00:37:24,972 --> 00:37:26,975 the total [INAUDIBLE] 679 00:37:26,975 --> 00:37:28,000 constant. 680 00:37:28,000 --> 00:37:29,340 CHARLES LEISERSON: It's supposed to be something that 681 00:37:29,340 --> 00:37:32,240 gives a hint for performance reasons but does not affect 682 00:37:32,240 --> 00:37:35,120 the correctness of the program. 683 00:37:35,120 --> 00:37:37,650 So the program is going to be doing the same thing 684 00:37:37,650 --> 00:37:38,350 regardless. 685 00:37:38,350 --> 00:37:40,800 The question is, here's a hint to the compiler and the 686 00:37:40,800 --> 00:37:44,020 runtime system. 687 00:37:44,020 --> 00:37:46,788 And so then it's picked at-- 688 00:37:46,788 --> 00:37:49,268 yeah? 689 00:37:49,268 --> 00:37:53,265 AUDIENCE: My question is, so there's these cases where you 690 00:37:53,265 --> 00:37:56,460 say that the runtime system fails to find an appropriate 691 00:37:56,460 --> 00:37:57,873 value for that [INAUDIBLE]. 692 00:37:57,873 --> 00:38:00,699 I mean, basically, chooses one that's not as good. 693 00:38:00,699 --> 00:38:03,211 If you put a pragma on it, will the runtime system choose 694 00:38:03,211 --> 00:38:04,890 the one that you give it, or still choose-- 695 00:38:04,890 --> 00:38:06,290 CHARLES LEISERSON: No, if you give it, the 696 00:38:06,290 --> 00:38:07,840 runtime system will-- 697 00:38:07,840 --> 00:38:11,040 in the current implementation, it always picks whatever you 698 00:38:11,040 --> 00:38:12,340 say is here. 699 00:38:12,340 --> 00:38:13,215 And that can be an expression. 700 00:38:13,215 --> 00:38:15,490 You can evaluate something there. 701 00:38:15,490 --> 00:38:16,540 It's not just a constant. 702 00:38:16,540 --> 00:38:20,070 It could be maximum of this and that times 703 00:38:20,070 --> 00:38:22,655 whatever, et cetera. 704 00:38:22,655 --> 00:38:25,670 Is that good? 705 00:38:25,670 --> 00:38:27,390 So this is a description of the work. 706 00:38:27,390 --> 00:38:30,400 Now let's get a description with the constants 707 00:38:30,400 --> 00:38:33,840 again of the span. 708 00:38:33,840 --> 00:38:35,825 So what is going to be the constants for the span? 709 00:38:40,040 --> 00:38:43,275 Well, I'm executing this part in here now serially. 710 00:38:46,320 --> 00:38:49,120 So for the span part, we're basically going to go down on 711 00:38:49,120 --> 00:38:52,220 one of these paths and back up I'm not sure which one, but 712 00:38:52,220 --> 00:38:55,330 they're basically all fairly symmetric. 713 00:38:55,330 --> 00:38:56,860 But then when I get to the leaf, I'm 714 00:38:56,860 --> 00:39:00,110 executing the leaf serially. 715 00:39:00,110 --> 00:39:03,790 So I'm going to have whatever the cost is, g times the time 716 00:39:03,790 --> 00:39:07,520 per iteration, is going to be executed serially, plus now 717 00:39:07,520 --> 00:39:12,850 log of n over g-- 718 00:39:12,850 --> 00:39:15,680 n over g is the number of things I have here-- 719 00:39:15,680 --> 00:39:17,435 times the cost of the spawn, basically. 720 00:39:21,152 --> 00:39:22,402 Does that make sense? 721 00:39:25,470 --> 00:39:28,270 So the idea is, what do we want to have here if I want a 722 00:39:28,270 --> 00:39:29,640 good parallel code? 723 00:39:29,640 --> 00:39:33,230 We would like the work to be as small as possible. 724 00:39:33,230 --> 00:39:34,530 How do I make the work small? 725 00:39:38,270 --> 00:39:42,758 How can I set g to make the work small? 726 00:39:42,758 --> 00:39:44,580 AUDIENCE: [INAUDIBLE]. 727 00:39:44,580 --> 00:39:45,520 CHARLES LEISERSON: Make g-- 728 00:39:45,520 --> 00:39:46,540 AUDIENCE: Square root of n. 729 00:39:46,540 --> 00:39:51,090 CHARLES LEISERSON: Well, make g big or little? 730 00:39:51,090 --> 00:39:58,770 If you want this term to be small, you want g to be big. 731 00:39:58,770 --> 00:40:02,380 But we also want to have a lot of parallelism. 732 00:40:02,380 --> 00:40:05,530 So I want this term to be what? 733 00:40:05,530 --> 00:40:09,290 Small, which means I need to make g what? 734 00:40:09,290 --> 00:40:14,366 Well, we got an n over g here, but it's in a log. 735 00:40:14,366 --> 00:40:15,910 It's minus log. 736 00:40:15,910 --> 00:40:20,882 So really, to get this small, I want g to be small. 737 00:40:20,882 --> 00:40:24,670 So I have tension, trade off. 738 00:40:24,670 --> 00:40:26,090 I have trade off. 739 00:40:26,090 --> 00:40:27,640 So let's analyze this a little bit. 740 00:40:30,210 --> 00:40:34,465 Essentially, if I look at this, I want g to be bigger-- 741 00:40:41,860 --> 00:40:44,130 from this one I want g to be small. 742 00:40:44,130 --> 00:40:47,170 But here, what I would like is to make it so that this term 743 00:40:47,170 --> 00:40:48,480 dominates this term. 744 00:40:51,390 --> 00:40:56,470 If the first term here dominates the second term, 745 00:40:56,470 --> 00:40:59,400 then the work is going to be the same as if I did an 746 00:40:59,400 --> 00:41:05,450 ordinary for loop to within a few percent. 747 00:41:05,450 --> 00:41:09,570 So therefore, I want t span over t iter, if I take the 748 00:41:09,570 --> 00:41:14,110 ratio of these things, I want g to be bigger than the time 749 00:41:14,110 --> 00:41:17,450 to spawn divided by the time to iterate. 750 00:41:17,450 --> 00:41:20,960 If I get it much bigger than that, then this term will be 751 00:41:20,960 --> 00:41:22,580 much bigger than that term and I don't have to 752 00:41:22,580 --> 00:41:25,760 worry about this term. 753 00:41:25,760 --> 00:41:29,660 So I want it to be much bigger, but I want to be as 754 00:41:29,660 --> 00:41:31,850 small as possible. 755 00:41:31,850 --> 00:41:35,630 There's no point in making it much bigger than that which 756 00:41:35,630 --> 00:41:38,340 causes this term to essentially be wiped out. 757 00:41:38,340 --> 00:41:39,590 People follow that? 758 00:41:44,830 --> 00:41:49,580 So basically, the idea is we pick a grain size that's large 759 00:41:49,580 --> 00:41:53,180 but not too large, is what you generally want to do, so that 760 00:41:53,180 --> 00:41:55,930 you have enough parallelism, but you don't. 761 00:41:55,930 --> 00:41:59,690 The way that the runtime system does it is it has a 762 00:41:59,690 --> 00:42:02,780 somewhat complicated heuristic, but it actually 763 00:42:02,780 --> 00:42:06,660 looks to see how many processors you're running on. 764 00:42:06,660 --> 00:42:10,970 And it uses a heuristic that says, let's make sure there's 765 00:42:10,970 --> 00:42:13,550 at least parallelism 10 times more than the number of 766 00:42:13,550 --> 00:42:14,820 processors. 767 00:42:14,820 --> 00:42:18,330 But there's no point in having more iterations than like 500 768 00:42:18,330 --> 00:42:23,350 or something, because at 500 iterations, you can't see the 769 00:42:23,350 --> 00:42:25,940 spawn overhead regardless. 770 00:42:25,940 --> 00:42:29,900 So basically, it uses a formula kind of that nature to 771 00:42:29,900 --> 00:42:31,590 pick this automatically. 772 00:42:31,590 --> 00:42:34,150 But you're free to pick this yourself. 773 00:42:34,150 --> 00:42:37,540 But you can see the point is that although it's doing 774 00:42:37,540 --> 00:42:41,470 divide and conquer, you do this issue of coarsening and 775 00:42:41,470 --> 00:42:46,760 you do want to make sure that you have enough work to do in 776 00:42:46,760 --> 00:42:49,100 any of the leaves of the computation. 777 00:42:49,100 --> 00:42:51,470 And as I say, usually it'll guess right. 778 00:42:51,470 --> 00:42:54,550 But if you have trouble with that, you have a parameter you 779 00:42:54,550 --> 00:42:56,760 can play with. 780 00:42:56,760 --> 00:42:59,990 Let's take a look at another implementation just to try to 781 00:42:59,990 --> 00:43:01,240 understand this issue. 782 00:43:04,490 --> 00:43:06,100 Suppose I'm going to do a vector add. 783 00:43:06,100 --> 00:43:10,110 So here I have a vector add of two arrays, where I'm 784 00:43:10,110 --> 00:43:17,750 basically saying ai gets the value of b added into it. 785 00:43:17,750 --> 00:43:20,260 That's kind of the code we had before. 786 00:43:20,260 --> 00:43:25,440 But now, what I want to do is I'm going to implement a 787 00:43:25,440 --> 00:43:26,950 vector add using cilk spawn. 788 00:43:30,560 --> 00:43:34,160 So rather than using a cilk_for loop, I'm going to 789 00:43:34,160 --> 00:43:37,660 parallelize this loop by hand using cilk spawn. 790 00:43:37,660 --> 00:43:41,240 What I'm going to do is I'm going to say for j equals 0 up 791 00:43:41,240 --> 00:43:42,970 to-- and I'm going to jump by whatever my 792 00:43:42,970 --> 00:43:45,020 grain size is here-- 793 00:43:45,020 --> 00:43:50,610 and spawn off the addition of things of size, essentially, 794 00:43:50,610 --> 00:43:53,180 g, unless I get close to the end of the array. 795 00:43:53,180 --> 00:43:57,440 But basically, I'm always spawning off the next g 796 00:43:57,440 --> 00:44:00,200 iterations to do that in parallel. 797 00:44:00,200 --> 00:44:03,280 And then I sync all these spawns. 798 00:44:03,280 --> 00:44:06,180 So everybody understand the code? 799 00:44:06,180 --> 00:44:07,270 I see nods. 800 00:44:07,270 --> 00:44:09,740 I want to see everybody nod, actually, when I do this. 801 00:44:09,740 --> 00:44:12,690 Otherwise what happens is I see three people nod, and I 802 00:44:12,690 --> 00:44:13,770 assume that people are nodding. 803 00:44:13,770 --> 00:44:15,760 Because if you don't do it, you can shake your head, and I 804 00:44:15,760 --> 00:44:18,410 promise none of your friends will see that you're 805 00:44:18,410 --> 00:44:21,280 shaking your head. 806 00:44:21,280 --> 00:44:23,880 And since the TAs are doing the grading and they're facing 807 00:44:23,880 --> 00:44:26,450 this way, they won't see either. 808 00:44:26,450 --> 00:44:29,900 And so it's perfectly safe to let me know, and that way I 809 00:44:29,900 --> 00:44:31,150 can make sure you understand. 810 00:44:33,590 --> 00:44:37,290 So everybody understand what this does? 811 00:44:37,290 --> 00:44:38,500 OK, so I see a few more. 812 00:44:38,500 --> 00:44:38,820 No. 813 00:44:38,820 --> 00:44:39,910 OK, question? 814 00:44:39,910 --> 00:44:43,540 Do you have a question, or should I just explain again? 815 00:44:43,540 --> 00:44:49,490 So this is basically doing a vector add of b into a, of n 816 00:44:49,490 --> 00:44:51,970 iterations here. 817 00:44:51,970 --> 00:44:54,910 And we're going to call it here, when I do a vector add, 818 00:44:54,910 --> 00:44:57,490 of basically g iterations. 819 00:44:57,490 --> 00:45:00,670 So what it's doing is it's going to take my array of size 820 00:45:00,670 --> 00:45:05,590 n, bust it into chunks of size g, and spawn off the first 821 00:45:05,590 --> 00:45:08,230 one, spawn off the second one, spawn off the third one, each 822 00:45:08,230 --> 00:45:11,310 one to do g iterations. 823 00:45:11,310 --> 00:45:13,340 That make sense? 824 00:45:13,340 --> 00:45:14,700 We'll see it. 825 00:45:14,700 --> 00:45:17,330 So here's sort of the instruction stream 826 00:45:17,330 --> 00:45:18,980 for the code here. 827 00:45:18,980 --> 00:45:22,810 So basically, it says here is one, we spawn off something of 828 00:45:22,810 --> 00:45:27,370 size g, then we go on, we spawn off something else of 829 00:45:27,370 --> 00:45:28,870 size g, et cetera. 830 00:45:28,870 --> 00:45:32,740 We keep going up there until we hit the cilk sync. 831 00:45:32,740 --> 00:45:34,480 That make sense? 832 00:45:34,480 --> 00:45:38,610 Each of these is doing a vector add of size g using 833 00:45:38,610 --> 00:45:40,015 this serial routine. 834 00:45:42,610 --> 00:45:46,420 So let's analyze this to understand the efficiency of 835 00:45:46,420 --> 00:45:49,980 this type of looping structure. 836 00:45:49,980 --> 00:45:52,910 So let's assume for this analysis that g equals 1, to 837 00:45:52,910 --> 00:45:54,690 make it easy, so we don't have to worry about it. 838 00:45:54,690 --> 00:45:57,840 So we're simply spawning off one thing here, one thing 839 00:45:57,840 --> 00:46:01,760 here, one iteration here, all the way to the end. 840 00:46:01,760 --> 00:46:05,370 So what is the work for this, if I spawn off things of size 841 00:46:05,370 --> 00:46:09,370 one, asymptotic work? 842 00:46:09,370 --> 00:46:12,850 It's order n, because I've got n leaves, and I've got n guys 843 00:46:12,850 --> 00:46:13,710 that I'm spawning off. 844 00:46:13,710 --> 00:46:15,720 So it's order n. 845 00:46:15,720 --> 00:46:18,019 What's the span? 846 00:46:18,019 --> 00:46:20,800 AUDIENCE: [INAUDIBLE]. 847 00:46:20,800 --> 00:46:27,130 CHARLES LEISERSON: Yeah, it's also order n, because the 848 00:46:27,130 --> 00:46:29,110 critical path goes something like brrrup, brrrup, brrrup. 849 00:46:33,620 --> 00:46:35,850 That's order n length. 850 00:46:35,850 --> 00:46:37,760 It's not this, because that's only order 851 00:46:37,760 --> 00:46:38,820 one length, all those. 852 00:46:38,820 --> 00:46:42,130 The longest path is order n. 853 00:46:42,130 --> 00:46:49,220 So that says the parallelism is order one. 854 00:46:49,220 --> 00:46:53,720 Conclusion, at least with grain size one, this is a 855 00:46:53,720 --> 00:46:57,950 really bad way to implement a parallel loop. 856 00:46:57,950 --> 00:47:01,080 However, I guarantee, it may not be the people in this 857 00:47:01,080 --> 00:47:07,130 room, but some fraction of students in this class will 858 00:47:07,130 --> 00:47:12,080 write this rather than doing a cilk for. 859 00:47:12,080 --> 00:47:15,440 Bad idea. 860 00:47:15,440 --> 00:47:17,270 Bad idea. 861 00:47:17,270 --> 00:47:19,950 Generally, bad idea. 862 00:47:19,950 --> 00:47:20,862 Question? 863 00:47:20,862 --> 00:47:22,308 AUDIENCE: Do you think you could find a constant factor, 864 00:47:22,308 --> 00:47:23,558 not just [INAUDIBLE]? 865 00:47:26,164 --> 00:47:29,360 CHARLES LEISERSON: Well here, actually, with grain size one, 866 00:47:29,360 --> 00:47:31,960 this is really bad, because I've got this overhead of 867 00:47:31,960 --> 00:47:35,450 doing a spawn, and then I'm only doing one iteration. 868 00:47:35,450 --> 00:47:38,250 So the ideal thing would be that I really am only paying 869 00:47:38,250 --> 00:47:41,170 for the leaves, and the internal nodes, I don't have 870 00:47:41,170 --> 00:47:42,405 to pay anything for. 871 00:47:42,405 --> 00:47:44,182 Yeah, Eric? 872 00:47:44,182 --> 00:47:45,820 AUDIENCE: Shouldn't there be a sort of keyword 873 00:47:45,820 --> 00:47:46,560 in the b add too? 874 00:47:46,560 --> 00:47:47,150 CHARLES LEISERSON: In the where? 875 00:47:47,150 --> 00:47:48,175 AUDIENCE: In the b add? 876 00:47:48,175 --> 00:47:49,470 CHARLES LEISERSON: No, that's serial. 877 00:47:49,470 --> 00:47:51,190 That's a serial code. 878 00:47:51,190 --> 00:47:53,066 AUDIENCE: No, but if you were going to call it with cilk 879 00:47:53,066 --> 00:47:56,140 spawn, don't you have to declare it cilk? 880 00:47:56,140 --> 00:47:58,581 Is that not the case? 881 00:47:58,581 --> 00:47:59,038 CHARLES LEISERSON: No. 882 00:47:59,038 --> 00:48:00,288 AUDIENCE: Never mind. 883 00:48:02,820 --> 00:48:03,900 CHARLES LEISERSON: Yes, question. 884 00:48:03,900 --> 00:48:05,884 AUDIENCE: If g is [INAUDIBLE], isn't that good enough? 885 00:48:08,420 --> 00:48:09,290 CHARLES LEISERSON: Yeah, so let's take a look. 886 00:48:09,290 --> 00:48:10,540 That's actually the next slide. 887 00:48:12,980 --> 00:48:17,036 This is basically, this we call puny parallelism. 888 00:48:17,036 --> 00:48:20,580 We don't like puny parallelism. 889 00:48:20,580 --> 00:48:22,470 It doesn't have to be spectacular. 890 00:48:22,470 --> 00:48:25,680 It has to be good enough. 891 00:48:25,680 --> 00:48:28,455 And this is not good enough for most applications. 892 00:48:31,960 --> 00:48:34,250 So here's another implementation. 893 00:48:34,250 --> 00:48:35,620 Here's another way of doing it. 894 00:48:35,620 --> 00:48:40,710 Now let's analyze it where we have control over g. 895 00:48:40,710 --> 00:48:44,430 So we'll analyze it in terms of g, and then see whether 896 00:48:44,430 --> 00:48:47,270 there's a setting for which this make sense. 897 00:48:47,270 --> 00:48:49,690 So if I analyze it in terms of g, now I have to do a little 898 00:48:49,690 --> 00:48:51,600 bit more careful analysis here. 899 00:48:51,600 --> 00:48:57,798 How much work do I have here in terms of n and g? 900 00:48:57,798 --> 00:48:59,200 AUDIENCE: It's the same. 901 00:48:59,200 --> 00:49:00,120 CHARLES LEISERSON: Yeah, the work is still 902 00:49:00,120 --> 00:49:01,370 asymptotically order n. 903 00:49:05,220 --> 00:49:07,940 Because I always have n work in the leaves, even if I do 904 00:49:07,940 --> 00:49:09,190 more iterations here. 905 00:49:09,190 --> 00:49:14,091 What increasing g does is it shrinks this, right? 906 00:49:14,091 --> 00:49:17,350 It shrinks this. 907 00:49:17,350 --> 00:49:18,850 The span for this is what? 908 00:49:23,240 --> 00:49:25,820 So I heard somebody say it. 909 00:49:25,820 --> 00:49:27,448 n over g plus g. 910 00:49:30,560 --> 00:49:32,240 And it corresponds to this path. 911 00:49:34,750 --> 00:49:37,660 So this is the n over g part up here, and 912 00:49:37,660 --> 00:49:39,126 this is the plus g. 913 00:49:41,720 --> 00:49:47,060 So what we want to do to minimize this, is we can 914 00:49:47,060 --> 00:49:47,820 minimize this. 915 00:49:47,820 --> 00:49:50,630 This has the smallest value when these two terms are 916 00:49:50,630 --> 00:49:55,370 equal, which you can either know as a basic fact of the 917 00:49:55,370 --> 00:49:58,300 summation of these kinds of things, or you could take 918 00:49:58,300 --> 00:50:02,210 derivatives and so forth. 919 00:50:02,210 --> 00:50:05,050 Or you can just eyeball it and say, gee, if g is bigger than 920 00:50:05,050 --> 00:50:08,710 square root of n, then this is going to be the dominant, and 921 00:50:08,710 --> 00:50:11,500 if g is smaller than square root of n, then this is going 922 00:50:11,500 --> 00:50:12,630 to be dominant. 923 00:50:12,630 --> 00:50:15,130 And so when they're equal, that sounds like about when it 924 00:50:15,130 --> 00:50:17,720 should be the smallest, which is true. 925 00:50:17,720 --> 00:50:20,270 So we pick it to be about square root of n, to 926 00:50:20,270 --> 00:50:22,970 minimize the span. 927 00:50:22,970 --> 00:50:25,620 Since g, I don't have anything to minimize here. 928 00:50:25,620 --> 00:50:30,200 So pick it around square root of n, then the span is around 929 00:50:30,200 --> 00:50:31,520 square root of n. 930 00:50:31,520 --> 00:50:37,680 And so then the parallelism is order square root of n. 931 00:50:37,680 --> 00:50:38,880 So that's pretty good. 932 00:50:38,880 --> 00:50:40,150 So that's not bad. 933 00:50:40,150 --> 00:50:42,950 So for something that's a big array, array of size 1 934 00:50:42,950 --> 00:50:46,310 million, parallelism might be 1,000. 935 00:50:46,310 --> 00:50:50,300 That might be just hunky dory. 936 00:50:50,300 --> 00:50:51,142 Question. 937 00:50:51,142 --> 00:50:51,594 What's that? 938 00:50:51,594 --> 00:50:54,510 AUDIENCE: I don't see where-- 939 00:50:54,510 --> 00:50:56,170 CHARLES LEISERSON: We've picked g to be equal to 940 00:50:56,170 --> 00:50:57,986 square root of n. 941 00:50:57,986 --> 00:51:00,944 AUDIENCE: [INAUDIBLE] plus n over g, plus g. 942 00:51:00,944 --> 00:51:02,430 I don't see where [INAUDIBLE]. 943 00:51:02,430 --> 00:51:05,870 CHARLES LEISERSON: You don't see where this g came from? 944 00:51:05,870 --> 00:51:09,540 This g comes from, because I'm doing g iterations here. 945 00:51:09,540 --> 00:51:11,780 So remember that these are now of size g. 946 00:51:11,780 --> 00:51:14,510 I'm doing g iterations in each leaf here, if 947 00:51:14,510 --> 00:51:15,860 I set g to be large. 948 00:51:15,860 --> 00:51:21,960 So I'm doing n over g pieces here, plus g iterations in my 949 00:51:21,960 --> 00:51:22,610 [INAUDIBLE]. 950 00:51:22,610 --> 00:51:24,090 Is that clear? 951 00:51:24,090 --> 00:51:26,120 So the n over g is this part. 952 00:51:26,120 --> 00:51:28,710 This size here, this is not one. 953 00:51:28,710 --> 00:51:30,620 This has g iterations in it. 954 00:51:30,620 --> 00:51:33,252 So the total span is g plus n over g. 955 00:51:37,370 --> 00:51:39,280 Any other questions about this? 956 00:51:39,280 --> 00:51:41,750 So basically, I get order square root of n. 957 00:51:44,270 --> 00:51:49,370 And so this is not necessarily a bad way of doing it, but the 958 00:51:49,370 --> 00:51:51,985 cilk for is a far more reliable way of making sure 959 00:51:51,985 --> 00:51:54,230 that you get the parallelism than spawning 960 00:51:54,230 --> 00:51:55,710 things off one by one. 961 00:51:55,710 --> 00:52:00,000 One of the things, by the way, in this, I've seen people 962 00:52:00,000 --> 00:52:03,370 write code where their first instinct is to write something 963 00:52:03,370 --> 00:52:06,600 like this, where this that they're spawning off is only 964 00:52:06,600 --> 00:52:07,660 constant time. 965 00:52:07,660 --> 00:52:11,810 And they say, gee, I spawned off n things. 966 00:52:11,810 --> 00:52:14,340 That's really parallel. 967 00:52:14,340 --> 00:52:18,270 When in fact, their parallelism is order one. 968 00:52:18,270 --> 00:52:22,600 So it's really seductive to think that you can get 969 00:52:22,600 --> 00:52:23,880 parallelism by this, [? right. ?] 970 00:52:23,880 --> 00:52:27,160 It's much better to do divide and conquer, and cilk for does 971 00:52:27,160 --> 00:52:29,140 that for you automatically. 972 00:52:29,140 --> 00:52:31,060 If you're going to do it by hand, sometimes you do want to 973 00:52:31,060 --> 00:52:33,160 do it by hand, then you probably want to think more 974 00:52:33,160 --> 00:52:35,900 about divide and conquer to generate parallelism, because 975 00:52:35,900 --> 00:52:38,230 you'll have a small span, than doing many 976 00:52:38,230 --> 00:52:39,480 things one at a time. 977 00:52:41,800 --> 00:52:47,520 So here's some tips for performance. 978 00:52:47,520 --> 00:52:52,090 So you want to minimize the span, so the parallelism is 979 00:52:52,090 --> 00:52:53,470 the work over the span. 980 00:52:53,470 --> 00:52:57,880 So you want to minimize the span to maximize parallelism. 981 00:52:57,880 --> 00:53:00,770 And in general, you should try to generate something like 10 982 00:53:00,770 --> 00:53:04,290 times more parallelism than processors, if you want to get 983 00:53:04,290 --> 00:53:05,510 near perfect linear speed-up. 984 00:53:05,510 --> 00:53:09,200 In other words, a parallel slackness of 10 or better is 985 00:53:09,200 --> 00:53:10,450 usually adequate. 986 00:53:13,340 --> 00:53:16,190 If you can get more, you're now talking that you can get 987 00:53:16,190 --> 00:53:22,720 more performance, but now you're getting performance 988 00:53:22,720 --> 00:53:27,110 increases in the range of 5% or so, 5% to 10%, 989 00:53:27,110 --> 00:53:29,890 something like that. 990 00:53:29,890 --> 00:53:33,520 Second thing is if you have plenty of parallelism, try to 991 00:53:33,520 --> 00:53:36,970 trade some of it off to reduce work overhead. 992 00:53:36,970 --> 00:53:38,060 So this is a general case. 993 00:53:38,060 --> 00:53:42,800 This is what actually goes on underneath in the cilk++ 994 00:53:42,800 --> 00:53:45,790 runtime system, is they are trying to do this themselves. 995 00:53:45,790 --> 00:53:49,640 But you in your own code can play exactly the same game. 996 00:53:49,640 --> 00:53:52,330 Whenever you have a problem and it says, whoa, look at all 997 00:53:52,330 --> 00:53:55,130 this parallelism, think about ways that you could reduce the 998 00:53:55,130 --> 00:54:00,140 parallelism and get something back in the efficiency of the 999 00:54:00,140 --> 00:54:03,490 work term, because the performance in the end is 1000 00:54:03,490 --> 00:54:08,070 going to be something like t1 over p plus t infinity. 1001 00:54:08,070 --> 00:54:11,400 If t infinity is small, it's like t1 over p, and so 1002 00:54:11,400 --> 00:54:16,060 anything you save in the t1 term is saving you overall. 1003 00:54:16,060 --> 00:54:19,630 It's going to be a savings for you overall. 1004 00:54:19,630 --> 00:54:22,200 Use divide and conquer recursion on parallel loops 1005 00:54:22,200 --> 00:54:24,870 rather than sprawling one small thing after another. 1006 00:54:24,870 --> 00:54:28,930 In other words, do this not this, generally. 1007 00:54:33,620 --> 00:54:36,220 And here's some more tips. 1008 00:54:36,220 --> 00:54:39,520 Another thing that can happen that we looked at here was 1009 00:54:39,520 --> 00:54:42,300 make sure that the amount of work you're doing is 1010 00:54:42,300 --> 00:54:45,390 reasonably large compared to the number of spawns. 1011 00:54:45,390 --> 00:54:47,950 You could also say this is true when you do recursion for 1012 00:54:47,950 --> 00:54:49,280 function calls. 1013 00:54:49,280 --> 00:54:52,040 Make sure if you're just in serial programming, you always 1014 00:54:52,040 --> 00:54:54,180 want to make sure that the amount of work you're doing is 1015 00:54:54,180 --> 00:54:57,500 small compared to the number of function calls are doing if 1016 00:54:57,500 --> 00:55:00,120 you can, and that'll make things go faster. 1017 00:55:00,120 --> 00:55:08,050 So same thing here, you want to have a lot of work compared 1018 00:55:08,050 --> 00:55:09,620 to the total number of spawns that you're 1019 00:55:09,620 --> 00:55:11,780 doing in your program. 1020 00:55:11,780 --> 00:55:14,520 So spawns, by the way, in this system, are about three or 1021 00:55:14,520 --> 00:55:19,500 four times the cost of a function call. 1022 00:55:19,500 --> 00:55:22,800 They're sort of the same order of magnitude as a function 1023 00:55:22,800 --> 00:55:26,670 call, a little bit heavier than a function call. 1024 00:55:26,670 --> 00:55:31,960 So you can spawn pretty readily, as long as the total 1025 00:55:31,960 --> 00:55:37,350 number of spawns you're doing isn't dominating your work. 1026 00:55:37,350 --> 00:55:40,210 Generally parallelize outer loops as opposed to inner 1027 00:55:40,210 --> 00:55:43,620 loops if you're forced to make a choice. 1028 00:55:43,620 --> 00:55:47,090 So it's always better to have an outer loop that runs in 1029 00:55:47,090 --> 00:55:51,000 parallel rather than an inner loop that runs in parallel, 1030 00:55:51,000 --> 00:55:54,750 because when you do an inner loop that runs in parallel, 1031 00:55:54,750 --> 00:55:56,590 you've got a lot of overhead to overcome. 1032 00:55:56,590 --> 00:56:00,980 But in an outer loop, you've got all of the inner loop to 1033 00:56:00,980 --> 00:56:03,930 amortize against the cost of the spawns that are being used 1034 00:56:03,930 --> 00:56:06,570 to parallelize the outer loop. 1035 00:56:06,570 --> 00:56:10,195 So you'll do many fewer spawns in the implementation. 1036 00:56:12,810 --> 00:56:14,510 Watch out for scheduling overheads. 1037 00:56:18,620 --> 00:56:21,990 So if you do something like this-- 1038 00:56:21,990 --> 00:56:27,310 so here we're paralyzing an inner loop rather than an 1039 00:56:27,310 --> 00:56:27,640 outer loop. 1040 00:56:27,640 --> 00:56:30,470 Now this turns out, it doesn't matter which order we're going 1041 00:56:30,470 --> 00:56:33,000 in or whatever. 1042 00:56:33,000 --> 00:56:35,650 It's generally not desirable to do this because I'm paying 1043 00:56:35,650 --> 00:56:40,230 scheduling overhead n times through this loop, whereas 1044 00:56:40,230 --> 00:56:43,930 here, I pay for scheduling overhead just twice. 1045 00:56:46,920 --> 00:56:50,010 So is generally better, if I have n pieces of work to do, 1046 00:56:50,010 --> 00:56:52,400 rather than, in this case, parallelizing-- 1047 00:56:55,200 --> 00:56:57,510 let me slow down here. 1048 00:56:57,510 --> 00:56:58,980 So let's look at what this code does. 1049 00:56:58,980 --> 00:57:00,835 This says, go for two iterations. 1050 00:57:03,740 --> 00:57:05,980 Do something for which it is going to take n 1051 00:57:05,980 --> 00:57:09,840 iterations for j. 1052 00:57:09,840 --> 00:57:12,235 So two iterations for i, n iterations for j. 1053 00:57:15,710 --> 00:57:17,970 If you look at the parallelism of this, what is the 1054 00:57:17,970 --> 00:57:20,560 parallelism of this assuming that f is constant time? 1055 00:57:23,830 --> 00:57:25,220 What's the parallelism of this code? 1056 00:57:33,460 --> 00:57:35,210 Two. 1057 00:57:35,210 --> 00:57:37,570 The parallelism of two, because I've got two things on 1058 00:57:37,570 --> 00:57:39,930 the outer loop here, and then each is n. 1059 00:57:39,930 --> 00:57:43,420 So my span is essentially n. 1060 00:57:43,420 --> 00:57:46,770 My work is like 2n, something like that. 1061 00:57:46,770 --> 00:57:49,250 So it's got a parallelism of that, too. 1062 00:57:49,250 --> 00:57:50,800 What's the parallelism of this code? 1063 00:58:04,970 --> 00:58:05,790 What's the parallelism? 1064 00:58:05,790 --> 00:58:08,220 It's not n, because I'm basically going through this 1065 00:58:08,220 --> 00:58:10,070 serially, the outer loop serially. 1066 00:58:18,680 --> 00:58:20,490 What's the theoretical parallelism of this? 1067 00:58:24,270 --> 00:58:29,430 So for each iteration here, the parallelism is two. 1068 00:58:29,430 --> 00:58:31,170 No, not n. 1069 00:58:31,170 --> 00:58:34,980 It can't be n, because I'm basically only parallelizing 1070 00:58:34,980 --> 00:58:37,530 two things, and I'm doing them serially. 1071 00:58:40,462 --> 00:58:44,540 The outer loop is going serially through the code and 1072 00:58:44,540 --> 00:58:47,480 it's spawning off two things, two things, two things, two 1073 00:58:47,480 --> 00:58:48,530 things, two things. 1074 00:58:48,530 --> 00:58:50,380 And waiting for them to be done, two things, wait for it 1075 00:58:50,380 --> 00:58:52,540 to be done, two things, wait for it to be done. 1076 00:58:52,540 --> 00:58:54,930 So the parallelism is two. 1077 00:58:54,930 --> 00:58:56,180 These have the same parallelism. 1078 00:58:58,690 --> 00:59:03,190 However if you run this, this one will give you a speedup of 1079 00:59:03,190 --> 00:59:07,530 two on two cores, very close to it. 1080 00:59:07,530 --> 00:59:09,990 Because there's the scheduling overhead here, you've only 1081 00:59:09,990 --> 00:59:12,390 paid once for the scheduling overhead, and then you're 1082 00:59:12,390 --> 00:59:14,640 doing a whole bunch of stuff. 1083 00:59:14,640 --> 00:59:17,140 So remember, to schedule it, it's got to be migrated, it's 1084 00:59:17,140 --> 00:59:19,910 got to be moved to another processor, et cetera. 1085 00:59:19,910 --> 00:59:25,410 This one, it's not even worth it probably to steal each of 1086 00:59:25,410 --> 00:59:26,250 these individual things. 1087 00:59:26,250 --> 00:59:28,770 You're spawning off things that are so small, this may 1088 00:59:28,770 --> 00:59:33,395 even have parallelism that's less than 1 in practice. 1089 00:59:33,395 --> 00:59:35,880 And if you look at the cilkview tool, this will show 1090 00:59:35,880 --> 00:59:38,180 you a high burden parallelism. 1091 00:59:38,180 --> 00:59:41,450 Because the cilkview tool, the burden parallelism tells you 1092 00:59:41,450 --> 00:59:46,980 what the overhead is from scheduling, as well as what 1093 00:59:46,980 --> 00:59:48,310 the actual parallelism is. 1094 00:59:48,310 --> 00:59:51,010 And it recognizes that oh, gee whiz. 1095 00:59:51,010 --> 00:59:53,625 This thing really has very small-- 1096 01:00:00,670 --> 01:00:02,200 there's almost no work in here. 1097 01:00:02,200 --> 01:00:04,000 So you're trying to parallelize something where 1098 01:00:04,000 --> 01:00:06,960 the work is so small, it's not even worth migrating it to 1099 01:00:06,960 --> 01:00:10,380 take advantage of it. 1100 01:00:10,380 --> 01:00:12,740 So those are some tips. 1101 01:00:12,740 --> 01:00:15,140 Now let's go through and analyze a bunch of algorithms 1102 01:00:15,140 --> 01:00:16,730 reasonably quickly. 1103 01:00:16,730 --> 01:00:21,670 We'll start with matrix multiplication. 1104 01:00:21,670 --> 01:00:22,935 People seen this problem before? 1105 01:00:28,400 --> 01:00:31,900 Here's the matrix multiplication problem. 1106 01:00:31,900 --> 01:00:33,780 And let's assume for simplicity that n 1107 01:00:33,780 --> 01:00:35,030 is a power of 2. 1108 01:00:38,470 --> 01:00:44,250 So basically, let's start out with just our looping version. 1109 01:00:44,250 --> 01:00:46,390 In fact, this isn't even a very good looping version, 1110 01:00:46,390 --> 01:00:49,340 because I've got the order of the loops wrong, I think. 1111 01:00:49,340 --> 01:00:52,380 But it is just illustrative. 1112 01:00:52,380 --> 01:00:55,080 Basically let's parallelize the outer two loops. 1113 01:00:55,080 --> 01:00:57,070 I can't parallelize the inner loop. 1114 01:00:57,070 --> 01:00:58,140 Why not? 1115 01:00:58,140 --> 01:01:00,180 What happens if I tried to parallelize the inner loop 1116 01:01:00,180 --> 01:01:03,095 with a cilk_for in this implementation? 1117 01:01:07,580 --> 01:01:11,040 Why can't I just put a cilk_for there? 1118 01:01:11,040 --> 01:01:12,408 Yes, somebody said it. 1119 01:01:12,408 --> 01:01:14,600 AUDIENCE: It does that in cij. 1120 01:01:14,600 --> 01:01:17,220 CHARLES LEISERSON: Yeah, we get a race condition. 1121 01:01:17,220 --> 01:01:19,980 We have more than two things in parallel trying to update 1122 01:01:19,980 --> 01:01:25,070 the same cij, and we'll have a race condition. 1123 01:01:25,070 --> 01:01:29,000 So always run cilkview to tell your performance. 1124 01:01:29,000 --> 01:01:33,695 But always, always, run cilk screen to tell whether or not 1125 01:01:33,695 --> 01:01:35,060 you've got races in your code. 1126 01:01:38,500 --> 01:01:40,650 So yeah, you'll have a race condition if you try to 1127 01:01:40,650 --> 01:01:43,160 naively parallelize the loop here. 1128 01:01:46,570 --> 01:01:47,990 So the work of this is what? 1129 01:01:53,460 --> 01:01:58,090 It's order n cubed, just three nested loops each going to n. 1130 01:01:58,090 --> 01:01:59,340 What's the span of this? 1131 01:02:12,180 --> 01:02:13,430 What's the span of this? 1132 01:02:20,990 --> 01:02:26,900 It's order n, because it's log n for this loop, log n for 1133 01:02:26,900 --> 01:02:30,290 this loop, plus the maximum of this, well, that's n. 1134 01:02:30,290 --> 01:02:34,860 Log n plus log n plus n is order n. 1135 01:02:34,860 --> 01:02:37,130 So order n span, which says 1136 01:02:37,130 --> 01:02:42,340 parallelism is order n squared. 1137 01:02:42,340 --> 01:02:45,080 So for 1,000 by 1,000 matrices, the parallelism is 1138 01:02:45,080 --> 01:02:49,376 on the order of a million. 1139 01:02:49,376 --> 01:02:50,626 Wow. 1140 01:02:52,430 --> 01:02:53,680 That's great. 1141 01:02:56,050 --> 01:03:00,600 However, it's on the order of a million, but as we know, 1142 01:03:00,600 --> 01:03:04,630 this doesn't use cache very effectively. 1143 01:03:04,630 --> 01:03:06,890 So one of the nice things about doing divide and conquer 1144 01:03:06,890 --> 01:03:08,860 is, as you know, that's a really good way to take 1145 01:03:08,860 --> 01:03:11,850 advantage of caching. 1146 01:03:11,850 --> 01:03:15,100 And this works in parallel, too. 1147 01:03:15,100 --> 01:03:18,920 In particular because whenever you have sufficient 1148 01:03:18,920 --> 01:03:23,880 parallelism, these processors are executing the code just as 1149 01:03:23,880 --> 01:03:26,160 if they were executing serial code. 1150 01:03:26,160 --> 01:03:28,750 So you get all the same cache locality you would get in the 1151 01:03:28,750 --> 01:03:32,280 serial code in the parallel code, except for the times 1152 01:03:32,280 --> 01:03:34,300 that you're actually migrating work. 1153 01:03:34,300 --> 01:03:35,710 And if you have sufficient parallelism, 1154 01:03:35,710 --> 01:03:38,590 that isn't too often. 1155 01:03:38,590 --> 01:03:40,740 So let's take a look at recursive divide and conquer 1156 01:03:40,740 --> 01:03:41,890 multiplication. 1157 01:03:41,890 --> 01:03:45,770 So we're familiar with this, too. 1158 01:03:45,770 --> 01:03:48,520 So this is eight multiplications of n over 2 by 1159 01:03:48,520 --> 01:03:51,350 2 matrices, and one addition of n by n matrix. 1160 01:03:51,350 --> 01:03:55,970 So here's a code using a little bit of C++ism. 1161 01:03:55,970 --> 01:04:01,560 So I've made the type a variable t. 1162 01:04:01,560 --> 01:04:07,710 So we're going to do matrix multiplication of an array, a, 1163 01:04:07,710 --> 01:04:10,220 the result is going to go in c, and we're going to 1164 01:04:10,220 --> 01:04:13,630 basically have a and b, and we're going to add 1165 01:04:13,630 --> 01:04:15,240 the result into c. 1166 01:04:15,240 --> 01:04:19,950 We have n, which is the side of the submatrix that we're 1167 01:04:19,950 --> 01:04:23,080 working on, and we're also going to have an n size, which 1168 01:04:23,080 --> 01:04:26,105 is the length of the row in the original matrix. 1169 01:04:26,105 --> 01:04:29,870 So remember when we do matrix things, if I take a submatrix, 1170 01:04:29,870 --> 01:04:31,240 it's not contiguous in memory. 1171 01:04:31,240 --> 01:04:34,470 So I have to know the row size of the matrix that I'm in in 1172 01:04:34,470 --> 01:04:38,710 order to be able to calculate what the elements are. 1173 01:04:38,710 --> 01:04:41,370 So the way it's going to work is I'm going to assign this 1174 01:04:41,370 --> 01:04:46,060 temporary d, by using the new-- 1175 01:04:46,060 --> 01:04:49,096 which is basically memory allocation C++-- 1176 01:04:49,096 --> 01:04:52,410 array of size n by n. 1177 01:04:52,410 --> 01:04:58,670 And what we're going to do is then do four of the recursive 1178 01:04:58,670 --> 01:05:05,470 multiplications, these guys here, into the elements of c, 1179 01:05:05,470 --> 01:05:12,000 and then four of them also into d using the temporary. 1180 01:05:12,000 --> 01:05:14,690 And then we're going to sync, after we get all that parallel 1181 01:05:14,690 --> 01:05:18,440 work done, and then we're going to add d into c, and 1182 01:05:18,440 --> 01:05:22,880 then we'll delete d, because we allocated it up here. 1183 01:05:22,880 --> 01:05:25,920 Everybody understand the code? 1184 01:05:25,920 --> 01:05:27,080 So we're doing this, it's just we're 1185 01:05:27,080 --> 01:05:30,490 going to do it in parallel. 1186 01:05:30,490 --> 01:05:31,520 Good? 1187 01:05:31,520 --> 01:05:33,620 Questions? 1188 01:05:33,620 --> 01:05:36,260 OK. 1189 01:05:36,260 --> 01:05:38,680 So this is the row length of the matrices so that I can do 1190 01:05:38,680 --> 01:05:41,590 the base cases, and in particular, partition the 1191 01:05:41,590 --> 01:05:43,350 matrices effectively. 1192 01:05:43,350 --> 01:05:45,720 I haven't shown that code. 1193 01:05:45,720 --> 01:05:47,980 And of course, the base case, normally, we would want to 1194 01:05:47,980 --> 01:05:49,560 coarsen for efficiency. 1195 01:05:49,560 --> 01:05:52,390 I would want to go down to something like maybe a eight 1196 01:05:52,390 --> 01:05:57,090 by eight or 16 by 16 matrix, and at that point switch to 1197 01:05:57,090 --> 01:06:01,880 something that's going to use the processor pipeline better. 1198 01:06:01,880 --> 01:06:04,360 The base cases, once again, I want to emphasize this because 1199 01:06:04,360 --> 01:06:07,430 a couple people on the quiz misunderstood this. 1200 01:06:07,430 --> 01:06:11,250 The reason you coarsen has nothing to do with caches. 1201 01:06:11,250 --> 01:06:14,410 The reason you coarsen is to overcome the overhead of the 1202 01:06:14,410 --> 01:06:18,440 function calls, and the coarsening is generally chosen 1203 01:06:18,440 --> 01:06:21,280 independent of what the size of the caches are. 1204 01:06:21,280 --> 01:06:25,590 It's not a parameter that has to be tuned to cache size. 1205 01:06:25,590 --> 01:06:28,190 It's a parameter that has to be tuned to function call, 1206 01:06:28,190 --> 01:06:33,080 versus ALU instructions, and what that balance is. 1207 01:06:33,080 --> 01:06:34,464 Question? 1208 01:06:34,464 --> 01:06:37,446 AUDIENCE: I mean, I understand that's true, but I thought-- 1209 01:06:37,446 --> 01:06:39,434 I mean, maybe I [? heard the call ?] wrong, 1210 01:06:39,434 --> 01:06:42,416 but I thought we wanted, in general, in terms of caching, 1211 01:06:42,416 --> 01:06:45,895 that you would choose it somehow so that all of the 1212 01:06:45,895 --> 01:06:48,380 data that you have would somehow fit-- 1213 01:06:48,380 --> 01:06:50,060 CHARLES LEISERSON: That's what the divide and conquer does 1214 01:06:50,060 --> 01:06:52,210 automatically. 1215 01:06:52,210 --> 01:06:54,550 The divide and conquer keeps halving it until it fits in 1216 01:06:54,550 --> 01:06:56,450 whatever size cache you have. 1217 01:06:56,450 --> 01:06:58,330 And in fact, we have three caches on the 1218 01:06:58,330 --> 01:06:59,770 machines we're using. 1219 01:06:59,770 --> 01:07:03,363 AUDIENCE: Yeah, but I'm saying if your coarsened constant is 1220 01:07:03,363 --> 01:07:05,160 too big, that's not going to happen. 1221 01:07:05,160 --> 01:07:07,030 CHARLES LEISERSON: If the coarsened constant is too big, 1222 01:07:07,030 --> 01:07:08,180 that's not going to happen. 1223 01:07:08,180 --> 01:07:12,120 But generally, the caches are much bigger than what you need 1224 01:07:12,120 --> 01:07:14,410 to do to amortize the cost. 1225 01:07:14,410 --> 01:07:16,660 But you're right, that is an assumption. 1226 01:07:16,660 --> 01:07:19,490 The caches are generally much bigger than the size that you 1227 01:07:19,490 --> 01:07:21,680 need in order to overcome function call overhead. 1228 01:07:21,680 --> 01:07:24,170 Function call overhead is not that high. 1229 01:07:24,170 --> 01:07:24,930 OK? 1230 01:07:24,930 --> 01:07:27,020 Good. 1231 01:07:27,020 --> 01:07:29,280 I'm glad I raised that issue again. 1232 01:07:29,280 --> 01:07:31,690 And so we're going to determine the submatrices by 1233 01:07:31,690 --> 01:07:33,330 index calculation. 1234 01:07:33,330 --> 01:07:36,400 And then we have to implement this parallel add, and that 1235 01:07:36,400 --> 01:07:42,270 I'm going to do just with a doubly nested for loop to add 1236 01:07:42,270 --> 01:07:43,520 the things. 1237 01:07:45,530 --> 01:07:48,650 There's no cache behavior I can really take advantage of 1238 01:07:48,650 --> 01:07:51,750 here except for spatial locality. 1239 01:07:51,750 --> 01:07:53,760 There's no temporal locality because I'm just adding two 1240 01:07:53,760 --> 01:07:59,270 matrices once, so there's no real temporal locality that 1241 01:07:59,270 --> 01:08:00,726 I'll get out of it. 1242 01:08:00,726 --> 01:08:02,440 And here, I've actually done the index 1243 01:08:02,440 --> 01:08:03,690 calculations by hand. 1244 01:08:08,550 --> 01:08:11,050 So let's analyze this. 1245 01:08:11,050 --> 01:08:13,790 So to analyze the multiplication program, I have 1246 01:08:13,790 --> 01:08:16,590 to start by analyzing the addition program. 1247 01:08:16,590 --> 01:08:19,890 So this should be, I think, fairly straightforward. 1248 01:08:19,890 --> 01:08:26,290 What's the work for adding two n by n matrices here? 1249 01:08:26,290 --> 01:08:29,689 n squared, good, just doubly nested loop. 1250 01:08:29,689 --> 01:08:32,612 What's the span? 1251 01:08:32,612 --> 01:08:35,609 AUDIENCE: [INAUDIBLE]. 1252 01:08:35,609 --> 01:08:37,859 CHARLES LEISERSON: Yeah, here it's log n, very good. 1253 01:08:37,859 --> 01:08:40,830 Because I've got log n plus log n plus order one. 1254 01:08:44,600 --> 01:08:46,470 I'm not going to analyze the parallelism, because I really 1255 01:08:46,470 --> 01:08:48,260 don't care about the parallelism of the addition. 1256 01:08:48,260 --> 01:08:51,180 I really care about the parallelism of the matrix 1257 01:08:51,180 --> 01:08:51,859 multiplication. 1258 01:08:51,859 --> 01:08:54,859 But we'll plug those values in now. 1259 01:08:54,859 --> 01:08:57,430 What is the work of the matrix multiplication? 1260 01:08:57,430 --> 01:09:02,359 Well for this, what we want to do is get a recurrence that we 1261 01:09:02,359 --> 01:09:03,140 can then solve. 1262 01:09:03,140 --> 01:09:04,920 So what's the recurrence that we want to get 1263 01:09:04,920 --> 01:09:06,450 for m of 1 of n? 1264 01:09:11,050 --> 01:09:17,439 Yeah, it's going to be 8m sub 1 of n over 2, that 1265 01:09:17,439 --> 01:09:22,399 corresponds to these things, plus some constant stuff, plus 1266 01:09:22,399 --> 01:09:23,649 the work of the addition. 1267 01:09:26,300 --> 01:09:28,040 Does that make sense? 1268 01:09:28,040 --> 01:09:29,390 We analyze the work of the addition. 1269 01:09:29,390 --> 01:09:32,000 What's the work of the addition? 1270 01:09:32,000 --> 01:09:33,630 Order n squared. 1271 01:09:33,630 --> 01:09:36,100 So that's going to dominate that constant 1272 01:09:36,100 --> 01:09:38,569 there, so we get 8. 1273 01:09:38,569 --> 01:09:41,090 And what's the solution to this? 1274 01:09:41,090 --> 01:09:43,319 Back to Master Theorem. 1275 01:09:43,319 --> 01:09:46,210 Now we're going to start pulling out the Master Theorem 1276 01:09:46,210 --> 01:09:49,850 multiple times per slide for the rest of the lecture. 1277 01:09:49,850 --> 01:09:52,700 n cubed, because we have log base 2 of 8. 1278 01:09:52,700 --> 01:09:57,330 That's n cubed compared with n squared. 1279 01:09:57,330 --> 01:10:00,450 So we get a solution which is n cubed-- 1280 01:10:00,450 --> 01:10:02,390 Case 3 of the Master Theorem. 1281 01:10:02,390 --> 01:10:03,670 So that's good. 1282 01:10:03,670 --> 01:10:06,140 The work we're doing is the same asymptotic work we're 1283 01:10:06,140 --> 01:10:09,220 doing for the triply nested loop. 1284 01:10:09,220 --> 01:10:11,380 Now let's take a look at the span. 1285 01:10:14,170 --> 01:10:15,790 So what's the span for this? 1286 01:10:21,030 --> 01:10:23,420 So once again, we want a recurrence. 1287 01:10:23,420 --> 01:10:24,670 What's the recurrence look like? 1288 01:10:31,930 --> 01:10:35,444 So the span of this is going to be the span of-- 1289 01:10:35,444 --> 01:10:37,345 it's going to be the sum of some things. 1290 01:10:39,990 --> 01:10:43,950 But the key observation is that it's going to be-- we 1291 01:10:43,950 --> 01:10:45,800 want the maximum of these guys. 1292 01:10:52,970 --> 01:10:55,850 So we're going to basically have the allocation as 1293 01:10:55,850 --> 01:11:00,900 constant time, we have the maximum of these, which is m 1294 01:11:00,900 --> 01:11:03,900 of infinity of n over 2, and then we have 1295 01:11:03,900 --> 01:11:07,160 the span of the add. 1296 01:11:07,160 --> 01:11:10,060 So we get this recurrence. 1297 01:11:10,060 --> 01:11:12,930 m infinity sub n over 2, because we have only to worry 1298 01:11:12,930 --> 01:11:15,740 about the worst of these guys. 1299 01:11:15,740 --> 01:11:17,400 The worst of them is-- 1300 01:11:17,400 --> 01:11:19,620 they're all symmetric, so it's basically the same. 1301 01:11:19,620 --> 01:11:21,800 We have a of n, and then there's a constant amount of 1302 01:11:21,800 --> 01:11:24,290 other overhead here. 1303 01:11:24,290 --> 01:11:28,320 Any questions about where I pulled that out of, why that's 1304 01:11:28,320 --> 01:11:29,570 the recurrence? 1305 01:11:32,710 --> 01:11:36,390 So this is the addition, the span of the addition of this 1306 01:11:36,390 --> 01:11:37,570 guy that we analyzed already. 1307 01:11:37,570 --> 01:11:41,000 What is the span of the addition? 1308 01:11:41,000 --> 01:11:42,900 What did we decide that was? 1309 01:11:42,900 --> 01:11:44,530 log n. 1310 01:11:44,530 --> 01:11:46,730 So basically, that dominates the order one. 1311 01:11:46,730 --> 01:11:48,920 So we get this term, and what's the solution of this 1312 01:11:48,920 --> 01:11:49,530 recurrence? 1313 01:11:49,530 --> 01:11:50,780 AUDIENCE: [INAUDIBLE]. 1314 01:11:54,270 --> 01:11:55,727 CHARLES LEISERSON: What case is this? 1315 01:11:55,727 --> 01:11:57,515 AUDIENCE: [INAUDIBLE] 1316 01:11:57,515 --> 01:11:58,860 log n squared. 1317 01:11:58,860 --> 01:12:02,500 CHARLES LEISERSON: Yes, it's log squared n. 1318 01:12:02,500 --> 01:12:04,530 So basically, it's case two. 1319 01:12:04,530 --> 01:12:07,880 So if I do n to the log base b of a, that's n to the log base 1320 01:12:07,880 --> 01:12:12,330 2 of 1, that's just 1. 1321 01:12:12,330 --> 01:12:15,900 And so this is basically a logarithmic factor times the 1322 01:12:15,900 --> 01:12:18,250 1, so we add an extra log. 1323 01:12:18,250 --> 01:12:19,500 We get log squared n. 1324 01:12:22,330 --> 01:12:24,300 That's just Master Theorem plugging in. 1325 01:12:24,300 --> 01:12:26,640 So here, the span is order log squared n. 1326 01:12:29,370 --> 01:12:33,840 And so we have the work of n cubed, the span of log squared 1327 01:12:33,840 --> 01:12:37,150 n, so the parallelism is the ratio, which is n cubed over 1328 01:12:37,150 --> 01:12:38,400 log squared n. 1329 01:12:40,810 --> 01:12:43,150 Not too bad for a 1,000 by 1,000 matrices, the 1330 01:12:43,150 --> 01:12:49,300 parallelism is about 10 million. 1331 01:12:49,300 --> 01:12:50,550 Plenty of parallelism. 1332 01:12:52,940 --> 01:12:55,760 So let's use the fact that we have plenty of parallelism to 1333 01:12:55,760 --> 01:12:58,500 say, let's get rid of some of that parallelism and put it 1334 01:12:58,500 --> 01:13:02,180 back into making our code more efficient. 1335 01:13:02,180 --> 01:13:09,520 So in particular, this code uses an extra temporary d, 1336 01:13:09,520 --> 01:13:14,080 which it allocates here and it deletes here. 1337 01:13:14,080 --> 01:13:16,640 And generally, there's a good rule that says, if you use 1338 01:13:16,640 --> 01:13:18,570 more storage you're going to use more time, because you're 1339 01:13:18,570 --> 01:13:20,770 going to have to look at that storage, it's going to take up 1340 01:13:20,770 --> 01:13:23,220 space in your cache, and it's generally 1341 01:13:23,220 --> 01:13:24,300 going to make you slower. 1342 01:13:24,300 --> 01:13:28,470 So things that use less storage are generally faster. 1343 01:13:28,470 --> 01:13:30,240 Not always the case, sometimes there's a trade off. 1344 01:13:30,240 --> 01:13:34,290 But often it's the case, use more storage, it runs slower. 1345 01:13:34,290 --> 01:13:36,380 So let's get rid of this guy. 1346 01:13:36,380 --> 01:13:37,630 How do we get rid of this guy? 1347 01:13:44,440 --> 01:13:45,754 Yeah? 1348 01:13:45,754 --> 01:13:47,004 AUDIENCE: [INAUDIBLE PHRASE]. 1349 01:13:55,140 --> 01:13:55,530 CHARLES LEISERSON: You're going to do this serially, 1350 01:13:55,530 --> 01:13:55,860 you're saying? 1351 01:13:55,860 --> 01:13:58,510 AUDIENCE: Yeah, you do those serially in add. 1352 01:13:58,510 --> 01:14:00,880 CHARLES LEISERSON: If you do this serially in add, it turns 1353 01:14:00,880 --> 01:14:03,290 out if you do that, you're going to be in trouble because 1354 01:14:03,290 --> 01:14:07,120 you're going to not have very much parallelism, 1355 01:14:07,120 --> 01:14:09,330 unfortunately. 1356 01:14:09,330 --> 01:14:11,920 Actually, analyzing exactly what the parallelism is there 1357 01:14:11,920 --> 01:14:13,330 is actually pretty good. 1358 01:14:13,330 --> 01:14:15,130 It's a good puzzle. 1359 01:14:15,130 --> 01:14:18,740 Maybe we'll do that on the quiz, the take home problem 1360 01:14:18,740 --> 01:14:21,110 set we're calling it now, right? 1361 01:14:21,110 --> 01:14:23,170 We're going to have a take home problem set, maybe that's 1362 01:14:23,170 --> 01:14:26,100 a good one. 1363 01:14:26,100 --> 01:14:29,865 Yeah, so the idea is, you can sync. 1364 01:14:29,865 --> 01:14:36,810 And in particular, why not compute these, then sync, and 1365 01:14:36,810 --> 01:14:39,780 then compute these, adding their results into the places 1366 01:14:39,780 --> 01:14:41,030 where we added these in? 1367 01:14:43,850 --> 01:14:47,370 So it's making the program more serial, because I'm 1368 01:14:47,370 --> 01:14:50,420 putting in a sync. 1369 01:14:50,420 --> 01:14:52,300 That shouldn't have an impact on the work, but it will have 1370 01:14:52,300 --> 01:14:54,845 an impact on the span. 1371 01:14:58,430 --> 01:15:00,880 So we're going to trade it off, and the way we'll do that 1372 01:15:00,880 --> 01:15:04,470 is by putting essentially a sync in the middle. 1373 01:15:04,470 --> 01:15:06,700 And since they're adding it in, I don't even have we call 1374 01:15:06,700 --> 01:15:10,800 the addition routine, because it's just going to 1375 01:15:10,800 --> 01:15:12,800 add it in in place. 1376 01:15:12,800 --> 01:15:16,190 So I spawn off these four guys, putting their results 1377 01:15:16,190 --> 01:15:20,260 into c, then I spawn off these four guys, and they add their 1378 01:15:20,260 --> 01:15:21,920 results into c. 1379 01:15:21,920 --> 01:15:25,060 Is that clear what the code is? 1380 01:15:25,060 --> 01:15:26,415 So let's analyze this. 1381 01:15:32,050 --> 01:15:41,570 So the work for this is order n cubed. 1382 01:15:41,570 --> 01:15:43,580 It's the same as anything else, we can come up with a 1383 01:15:43,580 --> 01:15:46,380 recurrence, slightly different from before because I only 1384 01:15:46,380 --> 01:15:48,510 have an order one there, but it doesn't really matter. 1385 01:15:48,510 --> 01:15:51,570 The answer is order n cubed. 1386 01:15:51,570 --> 01:15:55,850 The span, now this gets a little trickier. 1387 01:15:55,850 --> 01:15:57,200 What's the recurrence of the span? 1388 01:16:04,430 --> 01:16:06,570 AUDIENCE: [INAUDIBLE]. 1389 01:16:06,570 --> 01:16:07,706 CHARLES LEISERSON: What is that? 1390 01:16:07,706 --> 01:16:09,690 AUDIENCE: Twice the span of m of n over 2. 1391 01:16:09,690 --> 01:16:12,780 CHARLES LEISERSON: Twice the span of m of n 1392 01:16:12,780 --> 01:16:15,510 over 2, that's right. 1393 01:16:15,510 --> 01:16:18,560 So basically, we have the maximum of these guys, the 1394 01:16:18,560 --> 01:16:20,870 maximum of these guys, and then this is making those 1395 01:16:20,870 --> 01:16:23,090 things be in series. 1396 01:16:23,090 --> 01:16:25,650 So things that are in parallel I take the max, if it's in 1397 01:16:25,650 --> 01:16:26,940 series, I have to add them. 1398 01:16:26,940 --> 01:16:31,916 So I end up with 2m infinity of n over 2 plus order one. 1399 01:16:31,916 --> 01:16:35,270 Does that make sense? 1400 01:16:35,270 --> 01:16:36,250 OK, good. 1401 01:16:36,250 --> 01:16:37,550 So let's solve that recurrence. 1402 01:16:37,550 --> 01:16:40,260 What's the answer to that one? 1403 01:16:40,260 --> 01:16:41,120 That's order in. 1404 01:16:41,120 --> 01:16:44,290 Which case is it? 1405 01:16:44,290 --> 01:16:46,540 I never know what the cases are. 1406 01:16:46,540 --> 01:16:49,410 I know two, but one and three, it's like-- 1407 01:16:49,410 --> 01:16:52,030 they're the same thing, it's just which side it's in, so I 1408 01:16:52,030 --> 01:16:53,370 never remember what the number is. 1409 01:16:53,370 --> 01:16:55,250 But anyway, case one, yes. 1410 01:16:55,250 --> 01:16:57,720 Case one. 1411 01:16:57,720 --> 01:17:00,710 It's the one where this thing is bigger, so that's order n. 1412 01:17:04,410 --> 01:17:07,200 Good. 1413 01:17:07,200 --> 01:17:13,010 So then the work is n cubed, the span is order n, the 1414 01:17:13,010 --> 01:17:15,870 parallelism is order n squared. 1415 01:17:15,870 --> 01:17:18,660 So for 1,000 by 1,000 matrices, I get parallelism on 1416 01:17:18,660 --> 01:17:21,600 the order of a million, instead of before, I had 1417 01:17:21,600 --> 01:17:25,710 parallelism on the order of 10 million. 1418 01:17:25,710 --> 01:17:29,540 So this turns out way better code than the previous one 1419 01:17:29,540 --> 01:17:33,840 because it avoids the temporary and therefore runs, 1420 01:17:33,840 --> 01:17:37,280 you get a constant factor improvement for that, and it's 1421 01:17:37,280 --> 01:17:42,300 still, on 12 cores, it's going to run pretty fast. 1422 01:17:42,300 --> 01:17:45,530 And in practice, this is a much better way to do it. 1423 01:17:45,530 --> 01:17:49,190 The actual best code that I know for doing this 1424 01:17:49,190 --> 01:17:52,590 essentially does divide and conquer in only one 1425 01:17:52,590 --> 01:17:54,580 dimension at a time. 1426 01:17:54,580 --> 01:17:57,270 So basically, it looks to see what's the long dimension, and 1427 01:17:57,270 --> 01:18:00,880 whatever the long dimension is, it slices it in half and 1428 01:18:00,880 --> 01:18:04,920 then recurses, and just does that as a binary thing. 1429 01:18:04,920 --> 01:18:06,450 And it basically is the same work, et cetera. 1430 01:18:06,450 --> 01:18:07,755 It's a little bit more tricky to analyze. 1431 01:18:14,820 --> 01:18:17,940 Let me quick do merge sort. 1432 01:18:17,940 --> 01:18:18,900 So you know merge sort. 1433 01:18:18,900 --> 01:18:23,470 There's merging two sorted arrays, we saw this before. 1434 01:18:23,470 --> 01:18:26,320 If I spend all this time doing animations, I might as well 1435 01:18:26,320 --> 01:18:29,830 get my mileage out of it. 1436 01:18:29,830 --> 01:18:30,480 There we go. 1437 01:18:30,480 --> 01:18:33,820 So you merge, that's basically what this code does. 1438 01:18:33,820 --> 01:18:37,090 Order n time to merge. 1439 01:18:37,090 --> 01:18:38,470 So here's merge sort. 1440 01:18:38,470 --> 01:18:40,880 So what I'll do in merge sort is the same thing I normally 1441 01:18:40,880 --> 01:18:52,210 do, except that I'll make recursive 1442 01:18:52,210 --> 01:18:53,500 routines go in parallel. 1443 01:18:53,500 --> 01:18:58,500 So when I do that, it basically divide and conquers 1444 01:18:58,500 --> 01:19:03,760 down, and then it sort of does this to merge things together. 1445 01:19:03,760 --> 01:19:08,710 So we saw this before, except now, I've got the fact that I 1446 01:19:08,710 --> 01:19:11,450 can sort two things in parallel rather than sorting 1447 01:19:11,450 --> 01:19:13,020 them serially. 1448 01:19:13,020 --> 01:19:14,375 So let's take a look at the work. 1449 01:19:14,375 --> 01:19:15,900 What's the work of merge sort? 1450 01:19:15,900 --> 01:19:18,770 We know that. 1451 01:19:18,770 --> 01:19:20,370 n log n, right? 1452 01:19:20,370 --> 01:19:26,790 2t of n over 2 plus order n, so that's order n log n. 1453 01:19:26,790 --> 01:19:29,590 The span is what? 1454 01:19:29,590 --> 01:19:30,840 What's the recurrence of the span? 1455 01:19:36,150 --> 01:19:38,890 So we're going to take the maximum of these two guys. 1456 01:19:38,890 --> 01:19:44,050 So we only have one term that involves t infinity, and then 1457 01:19:44,050 --> 01:19:46,570 the merge costs us order n, so we get this recurrence. 1458 01:19:49,440 --> 01:19:54,990 So that says that the solution is order n. 1459 01:19:54,990 --> 01:20:01,590 So therefore, the work is n log n, the span is order n, 1460 01:20:01,590 --> 01:20:03,858 and so the parallelism is order log n. 1461 01:20:06,546 --> 01:20:07,950 Puny. 1462 01:20:07,950 --> 01:20:08,410 Puny parallelism. 1463 01:20:08,410 --> 01:20:11,750 Log n is like, you can run it, and it'll work fine on a few 1464 01:20:11,750 --> 01:20:14,250 cores, but it's not to be something that generally will 1465 01:20:14,250 --> 01:20:17,570 scale and give you a lot of parallelism. 1466 01:20:17,570 --> 01:20:20,630 So it's pretty clear from this that the bottleneck-- 1467 01:20:20,630 --> 01:20:22,330 where's all the span going to? 1468 01:20:22,330 --> 01:20:23,580 It's going to that merge. 1469 01:20:26,730 --> 01:20:28,960 So when you understand that that's the structure of it, 1470 01:20:28,960 --> 01:20:31,410 now you say if you want to get parallelism, you've got to go 1471 01:20:31,410 --> 01:20:32,230 after the merge. 1472 01:20:32,230 --> 01:20:34,480 So here's how we parallelize the merge. 1473 01:20:34,480 --> 01:20:36,615 So we're going to look at merging of two arrays that are 1474 01:20:36,615 --> 01:20:38,920 of possibly different length. 1475 01:20:38,920 --> 01:20:41,220 So one we'll call A, and one we'll call B, 1476 01:20:41,220 --> 01:20:42,810 with na and nb elements. 1477 01:20:42,810 --> 01:20:46,540 And let me assume without loss of generality that na is 1478 01:20:46,540 --> 01:20:48,830 greater than or equal to nb, because otherwise I can just 1479 01:20:48,830 --> 01:20:51,280 switch the roles of A and B. 1480 01:20:51,280 --> 01:20:53,370 So the way that I'm going to do it is I'm going to find the 1481 01:20:53,370 --> 01:20:56,490 middle element of A. These are sorted arrays that 1482 01:20:56,490 --> 01:20:58,300 I'm going to merge. 1483 01:20:58,300 --> 01:21:01,990 I find the middle element of A, so these guys are or less 1484 01:21:01,990 --> 01:21:04,540 than or equal to a of ma, and these are greater 1485 01:21:04,540 --> 01:21:05,930 than or equal to. 1486 01:21:05,930 --> 01:21:09,570 And now I binary search and find out where that middle 1487 01:21:09,570 --> 01:21:14,840 element would fall in the array B. So that costs me log 1488 01:21:14,840 --> 01:21:16,460 n time to binary search. 1489 01:21:16,460 --> 01:21:17,710 Remember binary search? 1490 01:21:22,420 --> 01:21:25,560 Then what I'm going to do is recursively merge these guys, 1491 01:21:25,560 --> 01:21:28,490 because these are sorted and less than or equal to ma, 1492 01:21:28,490 --> 01:21:31,950 recursively merge those and put this guy in the middle. 1493 01:21:35,140 --> 01:21:42,180 So when I do that, the key question when we analyze-- 1494 01:21:42,180 --> 01:21:45,310 it turns out the work is going to basically be the same, but 1495 01:21:45,310 --> 01:21:49,170 the key thing is going to be what happens to the span? 1496 01:21:49,170 --> 01:21:52,080 And the idea here is that the total number of elements in 1497 01:21:52,080 --> 01:21:59,690 the larger of these two things is going to be at most what? 1498 01:21:59,690 --> 01:22:04,015 Another way of looking at it is in the smaller partition, 1499 01:22:04,015 --> 01:22:06,230 if n is the total number of elements, the smaller 1500 01:22:06,230 --> 01:22:09,310 partition has how many elements at 1501 01:22:09,310 --> 01:22:11,010 least relative to n? 1502 01:22:13,750 --> 01:22:17,620 No matter where this binary search finds itself. 1503 01:22:17,620 --> 01:22:21,060 So the worst case is sort of going to come when this guy is 1504 01:22:21,060 --> 01:22:24,900 like at one end or the other. 1505 01:22:24,900 --> 01:22:28,070 And then the point is that because A is the larger array, 1506 01:22:28,070 --> 01:22:30,640 at least a quarter of the elements will still be in the 1507 01:22:30,640 --> 01:22:33,340 smaller partition. 1508 01:22:33,340 --> 01:22:37,280 Of all the elements here, at least a quarter will be in the 1509 01:22:37,280 --> 01:22:39,840 smaller partition, which will occur when B is equal to in 1510 01:22:39,840 --> 01:22:45,170 size to A. So the number, in the larger of the recursive 1511 01:22:45,170 --> 01:22:46,790 merges, is at most 3/4 n. 1512 01:22:49,350 --> 01:22:50,750 Sound good? 1513 01:22:50,750 --> 01:22:54,410 That's the main, key idea behind this. 1514 01:22:54,410 --> 01:22:57,420 So here's the parallel merge. 1515 01:22:57,420 --> 01:23:01,290 Basically you do binary search, you spawn, then, the 1516 01:23:01,290 --> 01:23:02,660 two merges. 1517 01:23:02,660 --> 01:23:04,720 Here's one merge, and here's the other merge, 1518 01:23:04,720 --> 01:23:06,140 and then you sync. 1519 01:23:06,140 --> 01:23:09,590 So that's the code for the doing the parallel merge. 1520 01:23:09,590 --> 01:23:11,430 And now you want to incorporate that parallel 1521 01:23:11,430 --> 01:23:14,750 merge into the parallel merge sort. 1522 01:23:14,750 --> 01:23:16,805 Of course, you coarsen the base cases for efficiency. 1523 01:23:21,190 --> 01:23:24,190 So let's analyze the span of this. 1524 01:23:24,190 --> 01:23:29,470 So the span is basically then the span of something of 3/4, 1525 01:23:29,470 --> 01:23:36,380 at most 3/4, the size plus the log n for the binary search. 1526 01:23:36,380 --> 01:23:40,500 So the span of parallel merge is therefore order log squared 1527 01:23:40,500 --> 01:23:44,300 n, because the important thing is, I'm whacking off a 1528 01:23:44,300 --> 01:23:46,415 constant fraction here every time. 1529 01:23:46,415 --> 01:23:52,050 So I get log squared n as the span, and the work I get this 1530 01:23:52,050 --> 01:23:58,270 hairy recurrence, that it's t of alpha n plus t1 minus alpha 1531 01:23:58,270 --> 01:24:03,920 n plus log n, where alpha falls in this range. 1532 01:24:03,920 --> 01:24:07,700 This does not satisfy the Master Theorem. 1533 01:24:07,700 --> 01:24:10,080 You can actually do this pretty easily with a recursion 1534 01:24:10,080 --> 01:24:13,270 tree, but the way to verify is-- 1535 01:24:13,270 --> 01:24:16,370 we call this technically a hairy recurrence. 1536 01:24:16,370 --> 01:24:20,360 That's the technical term for it. 1537 01:24:20,360 --> 01:24:23,830 So it turns out, this has order n, just like ordinary 1538 01:24:23,830 --> 01:24:28,010 merge, order n time. 1539 01:24:28,010 --> 01:24:30,540 here's You can use the substitution method, and I 1540 01:24:30,540 --> 01:24:32,230 won't drag you through it, but you can look 1541 01:24:32,230 --> 01:24:35,510 at it in the notes. 1542 01:24:35,510 --> 01:24:39,180 And this should be very familiar to you as having all 1543 01:24:39,180 --> 01:24:43,270 aced 6006, right? 1544 01:24:43,270 --> 01:24:44,560 Otherwise you wouldn't be here, right? 1545 01:24:47,930 --> 01:24:51,640 So the parallelism of the parallel merge is something 1546 01:24:51,640 --> 01:24:55,670 like n over log squared n. 1547 01:24:55,670 --> 01:25:00,510 So that's much better than having n order n bound. 1548 01:25:00,510 --> 01:25:03,340 And now, we can plug it into merge sort. 1549 01:25:03,340 --> 01:25:05,920 So the work is going to be the same as before, because I just 1550 01:25:05,920 --> 01:25:08,780 have the work of the merge, which is still order n. 1551 01:25:08,780 --> 01:25:12,190 So the work is order n log n, once again pulling out the 1552 01:25:12,190 --> 01:25:13,550 Master Theorem. 1553 01:25:13,550 --> 01:25:21,660 And then the span is n over 2 plus log n, because basically, 1554 01:25:21,660 --> 01:25:27,590 I have the span of a problem of half the size plus the span 1555 01:25:27,590 --> 01:25:28,910 that I need to merge things. 1556 01:25:28,910 --> 01:25:30,660 That's order log squared n. 1557 01:25:30,660 --> 01:25:32,300 This I want to pause on for moment. 1558 01:25:32,300 --> 01:25:35,520 People get this recurrence? 1559 01:25:35,520 --> 01:25:38,230 Because this is the span of the merge. 1560 01:25:38,230 --> 01:25:45,310 And so what I end up with is I get another log, log cubed n. 1561 01:25:45,310 --> 01:25:50,350 And so the total parallelism is n over log squared n. 1562 01:25:50,350 --> 01:25:55,440 And this is actually quite a practical thing to implement, 1563 01:25:55,440 --> 01:25:59,980 to get the n over log squared n parallelism versus just a 1564 01:25:59,980 --> 01:26:02,770 log n parallelism. 1565 01:26:02,770 --> 01:26:04,200 We're not going to do tableau construction. 1566 01:26:04,200 --> 01:26:07,170 You can read that up, that's on the notes that are online, 1567 01:26:07,170 --> 01:26:11,010 but you should read through that part of it. 1568 01:26:11,010 --> 01:26:13,520 It's got some nice animations which you don't get to see. 1569 01:26:20,630 --> 01:26:23,240 This is like when you do longest common subsequence and 1570 01:26:23,240 --> 01:26:25,460 stuff like that, how you would solve that type 1571 01:26:25,460 --> 01:26:27,110 of problem in parallel. 1572 01:26:27,110 --> 01:26:28,360 OK, great.