1 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:03,910 Commons license. 3 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,600 offer high-quality educational resources for free. 5 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 6 00:00:13,500 --> 00:00:18,480 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:18,480 --> 00:00:19,730 ocw.mit.edu. 8 00:00:22,570 --> 00:00:25,080 JOHN DONG: All right, so I'm sure everybody's kind of 9 00:00:25,080 --> 00:00:30,750 curious about Project 2.2 Beta, so here's the 10 00:00:30,750 --> 00:00:32,310 preliminary performance results. 11 00:00:32,310 --> 00:00:34,800 There's still a bit more work to do on finalizing these 12 00:00:34,800 --> 00:00:37,210 numbers, but here's how they look. 13 00:00:37,210 --> 00:00:40,270 But before I show you guys that, I'd like to yell at you 14 00:00:40,270 --> 00:00:41,790 guys for a little bit. 15 00:00:41,790 --> 00:00:46,100 So this is a timeline of the submission deadline and when 16 00:00:46,100 --> 00:00:47,680 people submitted things. 17 00:00:47,680 --> 00:00:52,610 So from this zone to this zone is an hour, and I see like 50% 18 00:00:52,610 --> 00:00:55,830 of the commits made during that period. 19 00:00:55,830 --> 00:01:00,660 So a little bit of background, the submission checking system 20 00:01:00,660 --> 00:01:04,269 automatically clones a repository up to the deadline 21 00:01:04,269 --> 00:01:05,880 and not a second after. 22 00:01:05,880 --> 00:01:10,200 So if you look at from this to this, some people took 23 00:01:10,200 --> 00:01:11,870 quite a jump back. 24 00:01:11,870 --> 00:01:15,810 So just as a warning, please try to submit things on time. 25 00:01:15,810 --> 00:01:21,370 11:59 means 11:59, and to drive that point further, this 26 00:01:21,370 --> 00:01:23,690 is an example of a commit that I saw in somebody's 27 00:01:23,690 --> 00:01:27,340 repository, who's name I blanked out. 28 00:01:27,340 --> 00:01:30,340 Obviously, it's seven seconds past the deadline, so the 29 00:01:30,340 --> 00:01:32,970 automatic repository cloner didn't grab it. 30 00:01:32,970 --> 00:01:37,170 And the previous commit to that was like 10 days ago when 31 00:01:37,170 --> 00:01:40,760 Reid pushed out pentominoes grades. 32 00:01:40,760 --> 00:01:44,340 So don't do that either. 33 00:01:44,340 --> 00:01:48,140 A little break down on how often people commit. 34 00:01:48,140 --> 00:01:50,490 About a quarter of the class only made one commit to their 35 00:01:50,490 --> 00:01:52,490 repository. 36 00:01:52,490 --> 00:01:54,780 Half of you guys did three to 10 commits, which 37 00:01:54,780 --> 00:01:56,280 seems to be an up. 38 00:01:56,280 --> 00:02:00,190 Under 21+, there was somebody who did 100 some commits, 39 00:02:00,190 --> 00:02:01,440 which was pretty impressive. 40 00:02:05,150 --> 00:02:06,830 Yeah, very smart dude. 41 00:02:06,830 --> 00:02:09,310 Committing often is a good idea, so that you don't run 42 00:02:09,310 --> 00:02:10,850 into a situation like before. 43 00:02:10,850 --> 00:02:14,440 And next time definitely, there is going to be next to 44 00:02:14,440 --> 00:02:17,800 zero tolerance for people who don't commit things on time 45 00:02:17,800 --> 00:02:19,360 for the deadlines. 46 00:02:19,360 --> 00:02:22,320 Now, the numbers that people want. 47 00:02:22,320 --> 00:02:25,520 So for rotate, just as an interesting data point, this 48 00:02:25,520 --> 00:02:28,570 is the 512 by 512 case. 49 00:02:28,570 --> 00:02:33,400 It seems like not everybody remembered to carry over their 50 00:02:33,400 --> 00:02:36,950 optimizations for the 512 case over to the final, otherwise 51 00:02:36,950 --> 00:02:38,670 you would expect the numbers to be a little bit more 52 00:02:38,670 --> 00:02:43,800 similar and not off by a factor of eight. 53 00:02:43,800 --> 00:02:49,860 And for rotate overall, that's the distribution. 54 00:02:49,860 --> 00:02:52,340 The speed up factor is normalized to some constant 55 00:02:52,340 --> 00:02:55,760 that gives everybody a reasonable number. 56 00:02:55,760 --> 00:02:58,780 And there were a lot of groups that had 57 00:02:58,780 --> 00:03:00,020 code that didn't build. 58 00:03:00,020 --> 00:03:00,540 Yes? 59 00:03:00,540 --> 00:03:02,188 AUDIENCE: Does the speedup only include 60 00:03:02,188 --> 00:03:05,250 the rotate dot 64. 61 00:03:05,250 --> 00:03:06,920 JOHN DONG: Yes. 62 00:03:06,920 --> 00:03:09,560 Performance was only tested on the rotate dot 64, which is 63 00:03:09,560 --> 00:03:12,280 what we said in the handout also. 64 00:03:12,280 --> 00:03:14,900 So there were a lot of groups that didn't build, which 65 00:03:14,900 --> 00:03:18,260 really surprised me, and I think it's probably because 66 00:03:18,260 --> 00:03:20,760 half of you guys pushed things after the deadline, and I 67 00:03:20,760 --> 00:03:23,900 presume those things contained important commits towards 68 00:03:23,900 --> 00:03:25,730 making your code actually work. 69 00:03:25,730 --> 00:03:28,910 But in this case, there was no header files, there was no 70 00:03:28,910 --> 00:03:30,590 cross-testing. 71 00:03:30,590 --> 00:03:34,720 All we did was run your make file on your code, and we 72 00:03:34,720 --> 00:03:39,510 replaced your test bed dot c, and your k timing, so I'm not 73 00:03:39,510 --> 00:03:42,380 quite sure why people have code that doesn't build. 74 00:03:46,210 --> 00:03:51,410 For sort, this was the maximum size input that we allowed you 75 00:03:51,410 --> 00:03:52,660 guys to run. 76 00:03:58,250 --> 00:04:01,210 One group did really well. 77 00:04:01,210 --> 00:04:03,640 So all of these are correct. 78 00:04:03,640 --> 00:04:08,120 The correctness test is built in, and I replaced that with a 79 00:04:08,120 --> 00:04:10,730 clean copy that contains a couple additional 80 00:04:10,730 --> 00:04:13,630 checks, by the way. 81 00:04:13,630 --> 00:04:19,151 So I tried the other extreme, a relatively small array. 82 00:04:19,151 --> 00:04:22,310 AUDIENCE: Is the top the same person? 83 00:04:22,310 --> 00:04:23,650 JOHN DONG: I'm not sure whether or not the top is the 84 00:04:23,650 --> 00:04:25,080 same person. 85 00:04:25,080 --> 00:04:29,650 But the distribution wise seems like people didn't quite 86 00:04:29,650 --> 00:04:32,750 remember to optimize for the smallest case. 87 00:04:32,750 --> 00:04:37,240 And this is the current overall speedup factors for 88 00:04:37,240 --> 00:04:38,640 sort averaging. 89 00:04:38,640 --> 00:04:44,320 We did about 10 to 15 cases for rotate and sort. 90 00:04:44,320 --> 00:04:50,240 And that's the overall speedup distribution. 91 00:04:50,240 --> 00:04:56,650 SAMAN AMARASINGHE: OK, so now you're done with individual 92 00:04:56,650 --> 00:04:59,440 project, so you already did the last project individually, 93 00:04:59,440 --> 00:05:02,090 and then we are moving into, again, a group project. 94 00:05:02,090 --> 00:05:05,190 So the first thing we have, setting up automated system 95 00:05:05,190 --> 00:05:08,490 for you to say who your group members are. 96 00:05:08,490 --> 00:05:11,610 So we will send you information, and with that, 97 00:05:11,610 --> 00:05:15,150 what you have to do is run a script saying who your group 98 00:05:15,150 --> 00:05:17,600 members are, both group members have to do it, and 99 00:05:17,600 --> 00:05:20,600 then we will basically clear that account in there. 100 00:05:20,600 --> 00:05:24,490 That said, a lot of you didn't know, in the first project, 101 00:05:24,490 --> 00:05:27,150 how to work and what's the right mode of 102 00:05:27,150 --> 00:05:29,070 operations with the group. 103 00:05:29,070 --> 00:05:33,360 OK if we gave you to write 100,000 lines of code, it 104 00:05:33,360 --> 00:05:35,550 makes sense to say, OK, I'm going to divide the problem 105 00:05:35,550 --> 00:05:37,910 into half, one person do one, the other person 106 00:05:37,910 --> 00:05:39,840 do the other half. 107 00:05:39,840 --> 00:05:43,500 The reason for doing the group is to try to get you to do 108 00:05:43,500 --> 00:05:46,260 pair programming, because talking to a lot of you, 109 00:05:46,260 --> 00:05:49,440 getting a lot of feedback, it looks like most of you spent a 110 00:05:49,440 --> 00:05:52,330 huge amount of time debugging. 111 00:05:52,330 --> 00:05:57,080 And since you're only writing a little amount of code, it 112 00:05:57,080 --> 00:06:00,990 makes a lot more sense to sit with your partner next to the 113 00:06:00,990 --> 00:06:04,010 screen, one person types, other person looks over, and 114 00:06:04,010 --> 00:06:06,510 then you have a much faster way of getting through the 115 00:06:06,510 --> 00:06:07,500 debugging process. 116 00:06:07,500 --> 00:06:11,000 So the next one, don't try to divide the problem by half. 117 00:06:11,000 --> 00:06:15,200 Just try to find some time, sit with each other. 118 00:06:15,200 --> 00:06:18,220 Then the other really disturbing thing is there have 119 00:06:18,220 --> 00:06:22,540 been a couple of groups that were completely dysfunctional. 120 00:06:22,540 --> 00:06:26,010 We get emails saying, OK, my group member didn't talk to 121 00:06:26,010 --> 00:06:29,880 me, or they didn't do any work, or they were very 122 00:06:29,880 --> 00:06:31,940 condescending. 123 00:06:31,940 --> 00:06:35,710 And that's really sad, because from my experience with MIT 124 00:06:35,710 --> 00:06:39,080 students, when you guys go to company, you guys probably 125 00:06:39,080 --> 00:06:41,640 will be the best programmers there. 126 00:06:41,640 --> 00:06:42,510 There's no question about it. 127 00:06:42,510 --> 00:06:43,590 I have seen that. 128 00:06:43,590 --> 00:06:45,950 To the point that some people might even resent it, to have 129 00:06:45,950 --> 00:06:46,670 this best programmer. 130 00:06:46,670 --> 00:06:50,700 But what I have seen is the fact a lot of you cannot work 131 00:06:50,700 --> 00:06:53,530 in the group, if you haven't developed that skill, you will 132 00:06:53,530 --> 00:06:55,590 not be the most impactful person. 133 00:06:55,590 --> 00:06:58,700 I have seen that again and again in my experience with 134 00:06:58,700 --> 00:06:59,780 doing a start-up. 135 00:06:59,780 --> 00:07:03,410 Our MIT students, their way of impacting is put all night do 136 00:07:03,410 --> 00:07:05,820 the entire project by themselves. 137 00:07:05,820 --> 00:07:08,480 Doable when you're doing a small change to a large 138 00:07:08,480 --> 00:07:10,690 project, but if you want to have a big change, 139 00:07:10,690 --> 00:07:11,380 you can't do that. 140 00:07:11,380 --> 00:07:13,390 You have to work with the group, figure out how to 141 00:07:13,390 --> 00:07:15,630 impact, how to communicate. 142 00:07:15,630 --> 00:07:18,620 This is more important learning than, say, trying to 143 00:07:18,620 --> 00:07:20,390 figure out how you can optimize something. 144 00:07:20,390 --> 00:07:26,170 So being individual contributors and able to do 145 00:07:26,170 --> 00:07:29,450 amazing things is important, but the fact that you can't 146 00:07:29,450 --> 00:07:32,980 work with a group is going to really make the impact you can 147 00:07:32,980 --> 00:07:34,700 make much less. 148 00:07:34,700 --> 00:07:38,840 So please, please learn how to work with your group members. 149 00:07:38,840 --> 00:07:42,160 Some of them might not be as good as you are, and that will 150 00:07:42,160 --> 00:07:45,330 be probably true in real life, too, but that doesn't mean you 151 00:07:45,330 --> 00:07:48,200 can be condescending towards them, make them feel inferior. 152 00:07:48,200 --> 00:07:50,260 That doesn't cut it. 153 00:07:50,260 --> 00:07:53,240 You have to learn how to work with these people. 154 00:07:53,240 --> 00:07:59,970 So part of your learning is working with others, and 155 00:07:59,970 --> 00:08:01,570 that's a large part of your learning. 156 00:08:01,570 --> 00:08:04,190 Don't consider that to be this external thing, even though 157 00:08:04,190 --> 00:08:05,710 you might think you can do a better job. 158 00:08:05,710 --> 00:08:08,200 Just work with the other person, especially in pair 159 00:08:08,200 --> 00:08:09,670 programming, because where both of you are sticking 160 00:08:09,670 --> 00:08:12,530 together, it's much easier. 161 00:08:12,530 --> 00:08:16,310 Because there are four eyes on your project, they might see 162 00:08:16,310 --> 00:08:18,510 something you don't see, and see whether you can work 163 00:08:18,510 --> 00:08:19,110 together like that. 164 00:08:19,110 --> 00:08:24,030 And I don't want to hear any more stories saying, look, my 165 00:08:24,030 --> 00:08:27,420 partner was too dumb, or my partner didn't show up, he 166 00:08:27,420 --> 00:08:30,320 couldn't deal with that. 167 00:08:30,320 --> 00:08:32,289 Those are not great excuses. 168 00:08:32,289 --> 00:08:33,940 And so we wonder. 169 00:08:33,940 --> 00:08:37,299 Some of them, we will pay attention to it, because it 170 00:08:37,299 --> 00:08:39,539 might be one person's unilateral actions 171 00:08:39,539 --> 00:08:40,850 that lead to that. 172 00:08:40,850 --> 00:08:43,620 Still, please, try to figure out how we can 173 00:08:43,620 --> 00:08:44,910 work with your partners. 174 00:08:44,910 --> 00:08:47,110 And hope you have a good partner experience. 175 00:08:47,110 --> 00:08:48,540 Use pair programming. 176 00:08:48,540 --> 00:08:51,390 Use a lot of good debugging techniques, and the next 177 00:08:51,390 --> 00:08:52,640 project will be fine. 178 00:09:13,430 --> 00:09:14,260 CHARLES LEISERSON: Great. 179 00:09:14,260 --> 00:09:17,780 We're going to talk more about caches. 180 00:09:17,780 --> 00:09:18,620 Whoo-whoo! 181 00:09:18,620 --> 00:09:20,170 OK. 182 00:09:20,170 --> 00:09:24,880 So for those who weren't here last time, we talked about the 183 00:09:24,880 --> 00:09:27,980 ideal cache model. 184 00:09:27,980 --> 00:09:30,840 As you recall, it has a two-level hierarchy and a 185 00:09:30,840 --> 00:09:35,040 cache size M bytes, and a cache line length of B bytes. 186 00:09:35,040 --> 00:09:37,860 It's fully associative and is optimal 187 00:09:37,860 --> 00:09:41,750 omniscient replacement strategy. 188 00:09:41,750 --> 00:09:45,850 However, we also learned that LRU was a good substitute, and 189 00:09:45,850 --> 00:09:49,380 that any of the asymptotic results that you can get with 190 00:09:49,380 --> 00:09:52,800 optimal, you could also get with LRU. 191 00:09:52,800 --> 00:09:56,120 The two performance measures that we talked about were the 192 00:09:56,120 --> 00:09:59,800 work, which deals with what the processor ends up doing, 193 00:09:59,800 --> 00:10:04,490 and the cache misses, which is the transfers between cache 194 00:10:04,490 --> 00:10:06,590 and main memory. 195 00:10:06,590 --> 00:10:08,900 You only have to count one direction, because what goes 196 00:10:08,900 --> 00:10:16,230 in basically goes out, so more or less, it's the same number. 197 00:10:16,230 --> 00:10:21,570 OK, so I'd like to start today by talking about some very 198 00:10:21,570 --> 00:10:25,860 basic algorithms that you have seen in your algorithms and 199 00:10:25,860 --> 00:10:31,890 data structures class, but which may look new when we 200 00:10:31,890 --> 00:10:34,140 start taking caches into account. 201 00:10:34,140 --> 00:10:36,960 So the first one here is the problem of 202 00:10:36,960 --> 00:10:40,490 merging two sorted arrays. 203 00:10:40,490 --> 00:10:42,160 So as you recall, you can basically do 204 00:10:42,160 --> 00:10:43,660 this in linear time. 205 00:10:43,660 --> 00:10:48,130 The way that the algorithm works is that it looks at the 206 00:10:48,130 --> 00:10:52,330 first element of the two arrays to be sorted and 207 00:10:52,330 --> 00:10:55,910 whichever is smaller, it puts in the output. 208 00:10:55,910 --> 00:10:59,440 And then it advances the pointer to the next element, 209 00:10:59,440 --> 00:11:03,160 and then whichever is smaller, it puts in the output. 210 00:11:03,160 --> 00:11:05,600 And in every step, it's doing just a constant amount of 211 00:11:05,600 --> 00:11:12,730 work, there are n items, so by the time this process is done, 212 00:11:12,730 --> 00:11:15,770 we basically spent time proportional to the number of 213 00:11:15,770 --> 00:11:17,270 items in the output list. 214 00:11:17,270 --> 00:11:20,360 So this should be fairly familiar. 215 00:11:20,360 --> 00:11:24,270 So the time to emerge n elements is order n. 216 00:11:27,330 --> 00:11:30,320 Now, the reason merging is useful is because you can use 217 00:11:30,320 --> 00:11:34,010 it in a sorting algorithm, a merge sorting algorithm. 218 00:11:34,010 --> 00:11:39,550 The way merge sort works is it essentially does divide and 219 00:11:39,550 --> 00:11:40,740 conquer on the array. 220 00:11:40,740 --> 00:11:44,980 It divides the array into two pieces, and it divides those 221 00:11:44,980 --> 00:11:47,450 each into two pieces, and those into two, until it gets 222 00:11:47,450 --> 00:11:49,880 down to something of unit size. 223 00:11:49,880 --> 00:11:52,600 And then what it does is it merges the 224 00:11:52,600 --> 00:11:55,560 two pairs of arrays. 225 00:11:55,560 --> 00:12:03,850 So for example here, the 19 and 3 got merged together to 226 00:12:03,850 --> 00:12:06,260 be 3 and 19. 227 00:12:06,260 --> 00:12:08,790 The 12 and 46 were already in order, but you still had to do 228 00:12:08,790 --> 00:12:11,300 work to get it there and so forth. 229 00:12:11,300 --> 00:12:15,480 So it puts everything in order in pairs, and then for each of 230 00:12:15,480 --> 00:12:18,790 those, it puts it together into fours, and for each of 231 00:12:18,790 --> 00:12:21,730 those, it puts it together into the final list. 232 00:12:21,730 --> 00:12:25,010 Now of course, the way it does this is not in the order I 233 00:12:25,010 --> 00:12:25,510 showed you. 234 00:12:25,510 --> 00:12:30,690 It actually goes down and does a walk of this tree. 235 00:12:30,690 --> 00:12:36,940 But conceptually, you can see that it essentially comes down 236 00:12:36,940 --> 00:12:41,850 to merging pairs, merging quadruples, emerging octuples, 237 00:12:41,850 --> 00:12:50,100 and so forth, all the way until the program is done. 238 00:12:50,100 --> 00:12:54,790 So to calculate the work of Merge Sort, this is something 239 00:12:54,790 --> 00:12:57,300 you've seen before because it's exactly what you do in 240 00:12:57,300 --> 00:13:00,290 your algorithms class. 241 00:13:00,290 --> 00:13:04,160 You get a recurrence that says that the work, 242 00:13:04,160 --> 00:13:05,470 in this case is-- 243 00:13:05,470 --> 00:13:07,620 well, if you have only one element, it's a constant 244 00:13:07,620 --> 00:13:11,350 amount of work, and otherwise, I solve two problems of half 245 00:13:11,350 --> 00:13:15,320 the size doing order n work, which is the time to merge the 246 00:13:15,320 --> 00:13:16,440 two elements. 247 00:13:16,440 --> 00:13:19,020 So classic divide and conquer. 248 00:13:19,020 --> 00:13:22,380 And I'm sure you're familiar with what the solution to this 249 00:13:22,380 --> 00:13:23,320 recurrence is. 250 00:13:23,320 --> 00:13:26,990 What's the solution to this recurrence? 251 00:13:26,990 --> 00:13:28,840 n log n. 252 00:13:28,840 --> 00:13:32,200 I want to, though, step through it, just to get 253 00:13:32,200 --> 00:13:35,520 everybody warmed up to the way that I want to solve 254 00:13:35,520 --> 00:13:39,780 recurrences so that you are in a position, when we do the 255 00:13:39,780 --> 00:13:41,030 caching analysis-- 256 00:13:43,560 --> 00:13:47,050 we have a common framework for understanding how the caching 257 00:13:47,050 --> 00:13:49,130 analysis will work. 258 00:13:49,130 --> 00:13:53,200 So we're going to solve this recurrence, and if the base 259 00:13:53,200 --> 00:13:56,330 case is constant, we usually omit it. 260 00:13:56,330 --> 00:13:58,940 It's assumed. 261 00:13:58,940 --> 00:14:02,800 So we start out with W of n, and what we do is we replace 262 00:14:02,800 --> 00:14:08,610 it by the right hand side, where we put the constant term 263 00:14:08,610 --> 00:14:10,230 on the top and then the two children. 264 00:14:10,230 --> 00:14:14,630 So here I've gotten rid of the theta, because conceptually 265 00:14:14,630 --> 00:14:17,600 when I'm done, I can put a big theta around the whole thing, 266 00:14:17,600 --> 00:14:20,610 around the whole tree, and it just makes the math a little 267 00:14:20,610 --> 00:14:21,990 bit easier and a little bit clearer to see 268 00:14:21,990 --> 00:14:23,320 what's going on. 269 00:14:23,320 --> 00:14:28,270 So then I take each of those and I split those, and this 270 00:14:28,270 --> 00:14:31,460 time I've got n over 4. 271 00:14:31,460 --> 00:14:32,430 Correct? 272 00:14:32,430 --> 00:14:34,580 I checked for that one this time. 273 00:14:34,580 --> 00:14:37,180 It's funny because it was still there, actually, just a 274 00:14:37,180 --> 00:14:41,340 few minutes before class as I was going through. 275 00:14:41,340 --> 00:14:44,640 And we keep doing that until we get down to something of 276 00:14:44,640 --> 00:14:48,600 size one, until the recurrence bottoms out. 277 00:14:48,600 --> 00:14:51,590 So when you look at a recursion tree of this nature, 278 00:14:51,590 --> 00:14:54,470 the first thing that you typically want to do is take a 279 00:14:54,470 --> 00:14:57,340 look at what the height of the tree is. 280 00:14:57,340 --> 00:15:00,560 In this case, we're taking a problem of size n, and we're 281 00:15:00,560 --> 00:15:02,600 halving it at every step. 282 00:15:02,600 --> 00:15:07,910 And so the number of times we have to halve the argument-- 283 00:15:07,910 --> 00:15:10,910 which turns out to be also equal to the work, but that's 284 00:15:10,910 --> 00:15:12,100 just coincidence-- 285 00:15:12,100 --> 00:15:14,730 is log n times. 286 00:15:14,730 --> 00:15:18,730 So the height is log base 2 of n. 287 00:15:18,730 --> 00:15:23,610 Now what we do typically is we add things up across the rows, 288 00:15:23,610 --> 00:15:24,900 across the levels. 289 00:15:24,900 --> 00:15:27,185 So on the top level, we have n. 290 00:15:27,185 --> 00:15:31,010 On the next level, we have n. 291 00:15:31,010 --> 00:15:33,940 The next level, hey, n. 292 00:15:36,660 --> 00:15:41,890 To add up the bottom, just to make sure, we have to count up 293 00:15:41,890 --> 00:15:44,930 how many leaves there are, and the number of leaves, since 294 00:15:44,930 --> 00:15:47,720 this is a binary tree, is just 2 to the height. 295 00:15:47,720 --> 00:15:53,310 So it's 2 to the log n, which is n. 296 00:15:53,310 --> 00:15:58,640 So then I add across all the leaves, and I get the order 1 297 00:15:58,640 --> 00:16:03,580 at the base times n, which is order n. 298 00:16:03,580 --> 00:16:07,370 And so now I'm in a position to add up the total work, 299 00:16:07,370 --> 00:16:11,855 which is basically log n levels of n is total 300 00:16:11,855 --> 00:16:13,040 of order n log n. 301 00:16:13,040 --> 00:16:16,090 So hopefully, this is all review. 302 00:16:16,090 --> 00:16:18,190 Hopefully this is all review. 303 00:16:18,190 --> 00:16:19,310 You haven't seen this before. 304 00:16:19,310 --> 00:16:21,260 It's really neat, isn't it? 305 00:16:21,260 --> 00:16:22,950 But you've missed something along the way. 306 00:16:26,860 --> 00:16:28,110 So now with caching. 307 00:16:30,860 --> 00:16:35,500 So the first thing to observe is that merge subroutine, the 308 00:16:35,500 --> 00:16:43,190 number of cache misses that it has is order n over B. So as 309 00:16:43,190 --> 00:16:45,800 you're going through, these arrays are laid out 310 00:16:45,800 --> 00:16:48,530 continuously in memory. 311 00:16:48,530 --> 00:16:50,900 The number of misses-- 312 00:16:50,900 --> 00:16:52,740 you're just going through the data once-- 313 00:16:52,740 --> 00:16:56,950 is order n data, all going through contiguously. 314 00:16:56,950 --> 00:17:00,210 And so every time you bring in data, you get full spatial 315 00:17:00,210 --> 00:17:04,730 locality, there are n elements, it costs n over B. 316 00:17:04,730 --> 00:17:06,400 So is that plain? 317 00:17:06,400 --> 00:17:10,990 Hopefully that part's plain, because you bring in things of 318 00:17:10,990 --> 00:17:13,970 each axis, you get the same factor of B. 319 00:17:13,970 --> 00:17:15,690 So now merge sort-- 320 00:17:15,690 --> 00:17:18,109 and this is, once again, the hard part is coming up with a 321 00:17:18,109 --> 00:17:21,099 recurrence, and then the other hard part is solving it. 322 00:17:21,099 --> 00:17:23,560 So there's two hard parts to recurrences. 323 00:17:23,560 --> 00:17:27,310 OK, so the merge sort algorithm solves two problems 324 00:17:27,310 --> 00:17:30,160 of size n over 2, and then does a merge. 325 00:17:30,160 --> 00:17:32,610 So the second line here is pretty straightforward. 326 00:17:32,610 --> 00:17:34,930 I take the cache misses that I need to do, 327 00:17:34,930 --> 00:17:35,830 and I take a merge. 328 00:17:35,830 --> 00:17:38,810 I may have a few other accesses in there, but they're 329 00:17:38,810 --> 00:17:41,300 going to be dominated by the merge. 330 00:17:41,300 --> 00:17:45,080 OK, so it's still going to be theta n over B. 331 00:17:45,080 --> 00:17:50,410 Now the hard part, generally, of dealing with cache analysis 332 00:17:50,410 --> 00:17:53,590 is how to do the base case, because the base case is more 333 00:17:53,590 --> 00:17:56,680 complicated than when you just do running time, you get that 334 00:17:56,680 --> 00:17:59,610 to run down to a base case of constant size. 335 00:17:59,610 --> 00:18:01,720 Here, you don't get to run down to base 336 00:18:01,720 --> 00:18:04,080 case of constant size. 337 00:18:04,080 --> 00:18:08,900 So what it says here is that we're going to run down until 338 00:18:08,900 --> 00:18:13,580 I have a sorting problem that fits in cache, n is going to 339 00:18:13,580 --> 00:18:17,190 be less than some constant times n, for some sufficiently 340 00:18:17,190 --> 00:18:21,080 small constant c less than 1. 341 00:18:21,080 --> 00:18:25,420 When in finally fits in cache, how many cache misses does it 342 00:18:25,420 --> 00:18:27,300 take me to sort it? 343 00:18:27,300 --> 00:18:30,490 Well, I only need to have the cold misses to bring that 344 00:18:30,490 --> 00:18:36,500 array into memory, and so that's just proportional to n 345 00:18:36,500 --> 00:18:41,450 over B, because for all the rest of the levels of merging, 346 00:18:41,450 --> 00:18:44,610 you're inside the cache. 347 00:18:44,610 --> 00:18:46,430 That make sense? 348 00:18:46,430 --> 00:18:48,190 So that's where we get this recurrence. 349 00:18:48,190 --> 00:18:51,980 This is always a tricky thing to figure out how to write 350 00:18:51,980 --> 00:18:54,780 that recurrence for a given thing. 351 00:18:54,780 --> 00:18:57,620 Then, as I say, the other tricky thing is how 352 00:18:57,620 --> 00:18:58,890 do you solve it? 353 00:18:58,890 --> 00:19:01,420 But we're going to solve it essentially the same way as we 354 00:19:01,420 --> 00:19:04,500 did before. 355 00:19:04,500 --> 00:19:06,880 I'm not going to go through all the steps here, except to 356 00:19:06,880 --> 00:19:07,940 just elaborate. 357 00:19:07,940 --> 00:19:12,140 So what we're doing is we're taking, and we have n over B 358 00:19:12,140 --> 00:19:16,200 at the top, and then we divide it into two problems, and for 359 00:19:16,200 --> 00:19:18,580 each of those-- 360 00:19:18,580 --> 00:19:22,600 whoops, there's a c there that doesn't belong. 361 00:19:22,600 --> 00:19:26,560 It should just be n over 2B on both, and then n 362 00:19:26,560 --> 00:19:30,170 over 4B, and so forth. 363 00:19:30,170 --> 00:19:34,500 And we keep going down until we get to our base case. 364 00:19:34,500 --> 00:19:39,060 Now in our base case, what I claim is that when I hit this 365 00:19:39,060 --> 00:19:43,120 base case, it's going to be the case that n is, in fact, a 366 00:19:43,120 --> 00:19:48,800 constant factor times n, so that n over B is almost the 367 00:19:48,800 --> 00:19:53,280 same as m over B. And the reason is because before I hit 368 00:19:53,280 --> 00:20:00,120 the base case, when I was at size twice n, and that was 369 00:20:00,120 --> 00:20:01,410 bigger than m. 370 00:20:01,410 --> 00:20:04,900 So if twice n is bigger than m-- 371 00:20:04,900 --> 00:20:07,960 than my constant times m-- 372 00:20:07,960 --> 00:20:17,710 but n is smaller than m, then it's the case that n and m are 373 00:20:17,710 --> 00:20:21,210 essentially the same size to within a constant factor, 374 00:20:21,210 --> 00:20:23,880 within a factor of two, in fact. 375 00:20:23,880 --> 00:20:30,110 And so therefore, here I can say that it's order m over B. 376 00:20:30,110 --> 00:20:32,850 And now the question is, how many levels did I have to go 377 00:20:32,850 --> 00:20:37,690 down cutting things in half before I got to something of 378 00:20:37,690 --> 00:20:40,460 size m over B? 379 00:20:40,460 --> 00:20:43,270 Well, the way that I usually think about this is-- 380 00:20:43,270 --> 00:20:46,230 you can do it by taking the difference as I did before. 381 00:20:46,230 --> 00:20:49,590 The height of the whole tree is going to be log base 2 of 382 00:20:49,590 --> 00:20:55,220 n, and the height of this is basically log of-- 383 00:20:55,220 --> 00:21:00,250 what's the size of n? 384 00:21:00,250 --> 00:21:02,725 It's going to be log of the size of n when this occurs. 385 00:21:02,725 --> 00:21:07,650 Well, n at that point is something like cm. 386 00:21:07,650 --> 00:21:11,590 So it's basically log n minus log cm, which is basically log 387 00:21:11,590 --> 00:21:14,840 of n over cm. 388 00:21:14,840 --> 00:21:16,090 How about some questions. 389 00:21:22,520 --> 00:21:23,195 Yeah, question. 390 00:21:23,195 --> 00:21:24,924 AUDIENCE: --the reason for why you just substituted on the 391 00:21:24,924 --> 00:21:28,820 left, [INAUDIBLE] over B, but on the right [INAUDIBLE]. 392 00:21:28,820 --> 00:21:29,630 CHARLES LEISERSON: Here? 393 00:21:29,630 --> 00:21:30,650 AUDIENCE: No. 394 00:21:30,650 --> 00:21:31,180 CHARLES LEISERSON: Or you mean here? 395 00:21:31,180 --> 00:21:32,430 AUDIENCE: [INAUDIBLE]. 396 00:21:36,100 --> 00:21:36,930 CHARLES LEISERSON: On the right side. 397 00:21:36,930 --> 00:21:38,914 AUDIENCE: On the right-most leaf, it's n over B. Is that 398 00:21:38,914 --> 00:21:41,180 all of the leaves added up because [INAUDIBLE]? 399 00:21:41,180 --> 00:21:42,820 CHARLES LEISERSON: No, no, no, this is going to be all of the 400 00:21:42,820 --> 00:21:44,710 leaves added up here. 401 00:21:44,710 --> 00:21:47,690 This is the stack I have on the right hand side. 402 00:21:47,690 --> 00:21:48,520 So we'll get there. 403 00:21:48,520 --> 00:21:51,625 So the point is, the number of leaves is 2 to this-- so 404 00:21:51,625 --> 00:21:54,020 that's just n over cm-- 405 00:21:54,020 --> 00:21:59,660 times m over B. Well, n over cm times m over B, the m's 406 00:21:59,660 --> 00:22:03,500 cancel, and I get essentially n over b with whatever that 407 00:22:03,500 --> 00:22:06,920 constant is here. 408 00:22:06,920 --> 00:22:11,490 And so now I have n over b across every level, and then 409 00:22:11,490 --> 00:22:18,340 when I add those up, I have to go log of n over cm levels, 410 00:22:18,340 --> 00:22:20,560 which is the same as log of n over m. 411 00:22:20,560 --> 00:22:24,230 So I have n over B times log of n over m. 412 00:22:24,230 --> 00:22:25,281 Yeah, question? 413 00:22:25,281 --> 00:22:28,718 AUDIENCE: Initial assumption that c is some sufficiently 414 00:22:28,718 --> 00:22:31,418 small number, so 1 over c would be a 415 00:22:31,418 --> 00:22:32,660 rather large factor. 416 00:22:32,660 --> 00:22:36,570 CHARLES LEISERSON: It could potentially be a large factor, 417 00:22:36,570 --> 00:22:37,690 but it's a constant. 418 00:22:37,690 --> 00:22:41,520 In other words, it can't vary with n. 419 00:22:41,520 --> 00:22:44,310 So in fact, typically for something like merge sort, the 420 00:22:44,310 --> 00:22:47,930 constant is going to be very close to-- 421 00:22:47,930 --> 00:22:51,840 for most things, the constant there is typically only a few. 422 00:22:51,840 --> 00:22:54,460 Because the question is, how many other things do you need 423 00:22:54,460 --> 00:22:56,070 in order to make sure it fits in? 424 00:22:56,070 --> 00:22:57,170 In this case, you have n. 425 00:22:57,170 --> 00:23:00,490 Well, you have both the input and the output, so here, it's 426 00:23:00,490 --> 00:23:02,750 going to be, you have to fit both the input and the output 427 00:23:02,750 --> 00:23:05,850 into cache in order not to have the cache it. 428 00:23:05,850 --> 00:23:07,880 So it's basically going to be like a factor 429 00:23:07,880 --> 00:23:09,170 of 2 for merge sort. 430 00:23:09,170 --> 00:23:12,500 For the matrix multiplication, it was like a factor of three. 431 00:23:12,500 --> 00:23:14,600 So generally a fairly small number. 432 00:23:14,600 --> 00:23:15,502 Question? 433 00:23:15,502 --> 00:23:16,978 AUDIENCE: I guess that makes sense. 434 00:23:16,978 --> 00:23:19,438 But I [INAUDIBLE]. 435 00:23:19,438 --> 00:23:24,358 So on the order of the size of the leaves, can you assume 436 00:23:24,358 --> 00:23:27,320 that n is more than cm, so you can substitute-- 437 00:23:27,320 --> 00:23:30,510 CHARLES LEISERSON: Yeah, because basically, when it 438 00:23:30,510 --> 00:23:31,680 hits this condition-- 439 00:23:31,680 --> 00:23:32,155 AUDIENCE: Right. 440 00:23:32,155 --> 00:23:33,105 I understand that. 441 00:23:33,105 --> 00:23:37,754 Then you're also saying, why isn't there more or less just 442 00:23:37,754 --> 00:23:42,173 one B, because there's n over cm, n is the same as cm. 443 00:23:42,173 --> 00:23:44,137 That's [INAUDIBLE]. 444 00:23:44,137 --> 00:23:46,120 At the bottom level, there should be-- 445 00:23:46,120 --> 00:23:49,340 CHARLES LEISERSON: Oh, did I do something wrong here? 446 00:23:49,340 --> 00:23:50,590 Number of leaves is-- 447 00:23:50,590 --> 00:23:51,840 AUDIENCE: [INAUDIBLE]. 448 00:23:55,260 --> 00:23:55,900 CHARLES LEISERSON: Right, right. 449 00:23:55,900 --> 00:23:56,220 Sorry. 450 00:23:56,220 --> 00:23:58,480 This is the n at the top. 451 00:23:58,480 --> 00:24:00,610 You always have to be careful here. 452 00:24:00,610 --> 00:24:07,520 So this is the n in this case, so this is some n, little n. 453 00:24:07,520 --> 00:24:08,900 So this is not the same n. 454 00:24:08,900 --> 00:24:10,910 This is the n that we had at the top. 455 00:24:10,910 --> 00:24:13,650 This is the notion of recurrences, like the n keeps 456 00:24:13,650 --> 00:24:16,030 recurring, but you know which ones-- 457 00:24:16,030 --> 00:24:18,530 so it can be confusing, because if you're-- 458 00:24:18,530 --> 00:24:19,470 so yeah. 459 00:24:19,470 --> 00:24:22,270 So this is not the n that's here. 460 00:24:22,270 --> 00:24:25,260 This is the n that started out at the top. 461 00:24:25,260 --> 00:24:27,110 So we're analyzing it in terms of the n. 462 00:24:27,110 --> 00:24:29,380 Some people write these things where they would write this in 463 00:24:29,380 --> 00:24:34,410 terms of k, and then analyze it for n, and for some people, 464 00:24:34,410 --> 00:24:39,380 that can be helpful to do, to disambiguate the two things. 465 00:24:39,380 --> 00:24:43,330 I always find it wastes a variable, and you know, those 466 00:24:43,330 --> 00:24:44,580 variables are hard to come by. 467 00:24:44,580 --> 00:24:45,980 There's only a finite number of them. 468 00:24:50,510 --> 00:24:53,310 OK, so are we good for this? 469 00:24:53,310 --> 00:24:56,210 So here, we ended up with n over b, log of n 470 00:24:56,210 --> 00:24:59,980 over m cache misses. 471 00:24:59,980 --> 00:25:01,990 So how does that compare? 472 00:25:01,990 --> 00:25:06,500 Let's just do a little thinking about this. 473 00:25:06,500 --> 00:25:10,770 Here's the recurrence, and I solved it out to this. 474 00:25:10,770 --> 00:25:12,530 So let's just look to see what this means. 475 00:25:12,530 --> 00:25:15,890 If I have a really big n, much bigger than the size of my 476 00:25:15,890 --> 00:25:22,020 cache, then I'm going to do a factor of B log n less misses 477 00:25:22,020 --> 00:25:27,060 than work, because the work is n over B log n. 478 00:25:31,380 --> 00:25:37,990 So if n compared to m-- let's say n was m squared or 479 00:25:37,990 --> 00:25:39,240 something-- 480 00:25:41,370 --> 00:25:48,060 then n over m would still be n, if n 481 00:25:48,060 --> 00:25:50,380 was as big as m squared. 482 00:25:50,380 --> 00:25:56,180 So if n was as big as m squared, this log of n over m 483 00:25:56,180 --> 00:25:59,590 would still be log n. 484 00:25:59,590 --> 00:26:07,435 And so I would basically have n over B log n for a factor of 485 00:26:07,435 --> 00:26:09,160 B log n less misses than work. 486 00:26:09,160 --> 00:26:10,625 If they're about the same-- 487 00:26:10,625 --> 00:26:11,650 did I get this right? 488 00:26:11,650 --> 00:26:16,950 If they're about the same size, if n is approximately m, 489 00:26:16,950 --> 00:26:21,760 maybe it's just a little bit bigger, then the log here 490 00:26:21,760 --> 00:26:27,020 disappears completely, and so I basically just 491 00:26:27,020 --> 00:26:28,570 have n over B misses. 492 00:26:31,237 --> 00:26:32,487 AUDIENCE: [INAUDIBLE]. 493 00:26:36,380 --> 00:26:38,060 CHARLES LEISERSON: Yeah, but if n is like-- 494 00:26:38,060 --> 00:26:39,800 AUDIENCE: [INAUDIBLE]. 495 00:26:39,800 --> 00:26:40,740 CHARLES LEISERSON: Yeah. 496 00:26:40,740 --> 00:26:42,980 In fact, for this, you have to be careful as you get to the 497 00:26:42,980 --> 00:26:43,860 base cases. 498 00:26:43,860 --> 00:26:47,470 Technically, for some of this, I should be saying 1 plus log 499 00:26:47,470 --> 00:26:50,610 of n over m, and in some of the things I do later, I will 500 00:26:50,610 --> 00:26:52,480 put in the ones. 501 00:26:52,480 --> 00:26:57,030 But if you're looking at it asymptotically and n gets big, 502 00:26:57,030 --> 00:26:58,920 you don't have to worry about those cases. 503 00:26:58,920 --> 00:27:02,110 That just handles the case whether you're looking at n 504 00:27:02,110 --> 00:27:04,130 getting large or whether you're trying to handle a 505 00:27:04,130 --> 00:27:06,950 formula for all n, even if and n is small. 506 00:27:06,950 --> 00:27:07,850 Question? 507 00:27:07,850 --> 00:27:09,100 AUDIENCE: [INAUDIBLE]? 508 00:27:11,770 --> 00:27:15,150 CHARLES LEISERSON: The work was n log n, yes. 509 00:27:15,150 --> 00:27:17,050 The work was n log n. 510 00:27:17,050 --> 00:27:21,890 So here we basically have n over B log n, so I'm saving a 511 00:27:21,890 --> 00:27:25,380 factor of B in the case where it's about the same. 512 00:27:25,380 --> 00:27:26,830 Did I get this right? 513 00:27:26,830 --> 00:27:28,880 I'm just looking at this, and now I'm trying to reverse 514 00:27:28,880 --> 00:27:32,480 engineer what my argument is. 515 00:27:32,480 --> 00:27:38,890 So we're looking at n log n versus n over B log n over m. 516 00:27:38,890 --> 00:27:41,830 AUDIENCE: [INAUDIBLE PHRASE]. 517 00:27:41,830 --> 00:27:45,995 So that you get a factor of B less misses, because you would 518 00:27:45,995 --> 00:27:49,180 be getting n over B like log of n, that's the only way 519 00:27:49,180 --> 00:27:51,630 you're getting a factor of B less misses. 520 00:27:51,630 --> 00:27:54,406 So I don't understand how you're saying, for n more or 521 00:27:54,406 --> 00:27:55,060 less equal to m. 522 00:27:55,060 --> 00:27:58,000 You would want something more like, for n-- 523 00:27:58,000 --> 00:28:00,760 CHARLES LEISERSON: Well, if n and m are about the same size, 524 00:28:00,760 --> 00:28:07,070 the number of cache misses is just n over B. And the number 525 00:28:07,070 --> 00:28:12,430 of cache misses is n over B, and the work is n log n. 526 00:28:12,430 --> 00:28:17,106 So I've saved a factor of B times log n, OK? 527 00:28:21,320 --> 00:28:22,730 What did I say? 528 00:28:22,730 --> 00:28:23,980 AUDIENCE: [INAUDIBLE]. 529 00:28:26,960 --> 00:28:28,180 CHARLES LEISERSON: B log m? 530 00:28:28,180 --> 00:28:31,020 No, I was saying that's for the case when n is 531 00:28:31,020 --> 00:28:32,830 much bigger than m. 532 00:28:32,830 --> 00:28:34,080 So let's take a look at the case-- 533 00:28:36,750 --> 00:28:38,180 let me just do it on the board here. 534 00:28:44,420 --> 00:28:54,440 Let's suppose that n is like m squared, just as an example, 535 00:28:54,440 --> 00:28:56,510 big number. 536 00:28:56,510 --> 00:29:03,120 So I'm going to look at, essentially, n over B times 537 00:29:03,120 --> 00:29:06,502 log of n over m. 538 00:29:09,650 --> 00:29:15,360 So log of n over m, so what is n over m is about m, which is 539 00:29:15,360 --> 00:29:20,960 about square root of n, right? 540 00:29:20,960 --> 00:29:25,950 So this basically ends up being approximately n over B 541 00:29:25,950 --> 00:29:34,250 log of square root of n, which is the same as log n, to 542 00:29:34,250 --> 00:29:35,290 within a constant factor. 543 00:29:35,290 --> 00:29:38,540 I'm going to leave out the constant factors here. 544 00:29:38,540 --> 00:29:43,546 Then I want to compare that with n log n. 545 00:29:50,610 --> 00:29:53,310 So I get a factor of B less misses. 546 00:29:53,310 --> 00:29:56,780 So the first one, yes, OK. 547 00:29:56,780 --> 00:29:59,530 So I get a factor of B less misses, you're right. 548 00:29:59,530 --> 00:30:01,230 Then I get a factor of B less misses. 549 00:30:01,230 --> 00:30:02,480 So I think I've got these switched. 550 00:30:10,330 --> 00:30:13,400 So this is the case I'm doing is for n much bigger than m. 551 00:30:16,260 --> 00:30:17,590 So let's do the other case. 552 00:30:17,590 --> 00:30:19,790 I think I've got the two things switched. 553 00:30:19,790 --> 00:30:21,400 I'll fix it in the notes. 554 00:30:21,400 --> 00:30:25,550 If n and m are approximately the same, then the log is a 555 00:30:25,550 --> 00:30:28,230 constant, right? 556 00:30:28,230 --> 00:30:33,860 So this ends up being approximately n over B. And 557 00:30:33,860 --> 00:30:36,470 now when I take a look at the difference between the number 558 00:30:36,470 --> 00:30:39,110 of things, I get B log n. 559 00:30:39,110 --> 00:30:40,660 So I've got the two things mixed. 560 00:30:40,660 --> 00:30:41,820 Yeah? 561 00:30:41,820 --> 00:30:44,640 AUDIENCE: As n approaches m, then the log would approach 562 00:30:44,640 --> 00:30:47,460 zero, but were you talking about how it technically 563 00:30:47,460 --> 00:30:49,020 should be-- 564 00:30:49,020 --> 00:30:50,247 CHARLES LEISERSON: 1 plus n, yes. 565 00:30:50,247 --> 00:30:53,217 AUDIENCE: So technically, that approaches one when the log 566 00:30:53,217 --> 00:30:54,360 approaches zero. 567 00:30:54,360 --> 00:30:57,008 CHARLES LEISERSON: Yeah. 568 00:30:57,008 --> 00:30:58,976 AUDIENCE: These things are really hard for me, because 569 00:30:58,976 --> 00:30:59,960 they are really arbitrary. 570 00:30:59,960 --> 00:31:01,600 And then you're like, oh yeah, you can just put 571 00:31:01,600 --> 00:31:02,420 a 1 on top of there. 572 00:31:02,420 --> 00:31:05,126 And for example, I always miss those, because I usually try 573 00:31:05,126 --> 00:31:07,586 to do the math as rigorously as I can, and those ones 574 00:31:07,586 --> 00:31:08,356 generally do not appear, and you're 575 00:31:08,356 --> 00:31:09,834 like, oh, sure whatever. 576 00:31:09,834 --> 00:31:13,412 So how am I supposed to know that the log is actually not 577 00:31:13,412 --> 00:31:15,771 going to be zero, and I'm going to be like, yeah, you're 578 00:31:15,771 --> 00:31:18,140 not going to do any caches. 579 00:31:18,140 --> 00:31:19,590 CHARLES LEISERSON: Because generally, what we're doing is 580 00:31:19,590 --> 00:31:21,950 we're looking at how things scale, so we're generally 581 00:31:21,950 --> 00:31:25,410 looking at n being big, in which case it doesn't matter. 582 00:31:25,410 --> 00:31:28,530 These things only matter if n's-- for example, notice here 583 00:31:28,530 --> 00:31:34,000 that if n goes less than m, we're in real trouble, right? 584 00:31:34,000 --> 00:31:35,820 Because now the log is negative. 585 00:31:35,820 --> 00:31:37,190 Wait, what does that mean? 586 00:31:37,190 --> 00:31:41,340 Well, the answer is the analysis was assuming that n 587 00:31:41,340 --> 00:31:44,150 was sufficiently large compared with m. 588 00:31:44,150 --> 00:31:46,712 AUDIENCE: Why can't you just be like, oh, when n is for 589 00:31:46,712 --> 00:31:49,518 less than one, you can assume, well, n is 2n. 590 00:31:49,518 --> 00:31:50,982 In that case, you get log of two, which is still 591 00:31:50,982 --> 00:31:52,934 something or other. 592 00:31:52,934 --> 00:31:53,920 CHARLES LEISERSON: Yeah, exactly. 593 00:31:53,920 --> 00:31:56,390 So what happens in these things is if you get right on 594 00:31:56,390 --> 00:32:01,350 the cusp of fitting in memory, then these analyses like, 595 00:32:01,350 --> 00:32:07,020 well, what exactly is the answer, is dicey. 596 00:32:07,020 --> 00:32:09,580 But if you assume that it doesn't fit in, 597 00:32:09,580 --> 00:32:10,690 what's going to happen? 598 00:32:10,690 --> 00:32:13,930 Or that does fit in, what is going to happen? 599 00:32:13,930 --> 00:32:16,350 And then the analysis right on the edge is 600 00:32:16,350 --> 00:32:17,600 somewhere between there. 601 00:32:20,100 --> 00:32:21,000 Good. 602 00:32:21,000 --> 00:32:23,040 So I switched these. 603 00:32:23,040 --> 00:32:25,190 I said this the other way around. 604 00:32:25,190 --> 00:32:25,670 That's funny. 605 00:32:25,670 --> 00:32:30,530 I went through this, and then in my notes, I had them 606 00:32:30,530 --> 00:32:34,990 switched, and I said, oh my gosh, I did this wrong. 607 00:32:34,990 --> 00:32:37,020 And I've just gone through it, and it turns out I was right 608 00:32:37,020 --> 00:32:40,090 in my notes. 609 00:32:40,090 --> 00:32:42,850 Now, one of the things, if you look at what's going on-- 610 00:32:42,850 --> 00:32:45,220 let's just go back to this picture here. 611 00:32:45,220 --> 00:32:49,410 What's going on here is each one of the passes that we're 612 00:32:49,410 --> 00:32:56,010 doing to do a merge is basically taking n over B 613 00:32:56,010 --> 00:32:58,100 misses to do a binary merge. 614 00:32:58,100 --> 00:33:05,890 We're going through all the data to merge just two things, 615 00:33:05,890 --> 00:33:08,490 and traversing all the data. 616 00:33:08,490 --> 00:33:11,530 So you can imagine, what would happen if I did, say, a 617 00:33:11,530 --> 00:33:12,780 four-way merge? 618 00:33:15,210 --> 00:33:20,770 With a four-way merge, I could actually merge four things 619 00:33:20,770 --> 00:33:23,505 with only a little bit more than n over B misses. 620 00:33:26,210 --> 00:33:29,280 In fact, that's what we're going to analyze in general. 621 00:33:29,280 --> 00:33:34,850 So the idea is that we can improve our cache efficiency 622 00:33:34,850 --> 00:33:38,650 by doing multi-way merging. 623 00:33:38,650 --> 00:33:42,350 So the idea here is, let's merge R, which is, let's say, 624 00:33:42,350 --> 00:33:44,370 less than n subarrays with a tournament. 625 00:33:44,370 --> 00:33:51,250 So here are R subarrays, and here's they're each, let's 626 00:33:51,250 --> 00:33:56,370 say, is of size n over R. And what we're going to do is 627 00:33:56,370 --> 00:33:58,940 merge them with a tournament, so this is a tournament where 628 00:33:58,940 --> 00:34:02,070 we say, who's the winner of these two, who's the winner of 629 00:34:02,070 --> 00:34:03,740 these two, et cetera. 630 00:34:03,740 --> 00:34:07,710 And then whoever wins at the top here, we take them and put 631 00:34:07,710 --> 00:34:09,810 them in the output, and then we repeat the tournament. 632 00:34:12,320 --> 00:34:13,810 Now let's just look what happens. 633 00:34:13,810 --> 00:34:18,830 It takes order R work to produce the forced output. 634 00:34:18,830 --> 00:34:20,630 So we got R things here. 635 00:34:20,630 --> 00:34:23,600 To playoff this tournament, there are R nodes here. 636 00:34:23,600 --> 00:34:26,630 They each have to do a constant amount of comparing 637 00:34:26,630 --> 00:34:30,399 before I end up with a single value to put in the output. 638 00:34:30,399 --> 00:34:34,100 So it costs me R to get this thing warmed up. 639 00:34:34,100 --> 00:34:38,540 But once I find the winner, and I remove the winner 640 00:34:38,540 --> 00:34:42,120 whatever chain he might have come along, how quickly can I 641 00:34:42,120 --> 00:34:46,320 repopulate the tournament with the next guy? 642 00:34:46,320 --> 00:34:51,260 The next guy only has to play the tournament on the path 643 00:34:51,260 --> 00:34:51,780 that was there. 644 00:34:51,780 --> 00:34:53,610 All the other matches, we know who won. 645 00:34:56,570 --> 00:35:03,810 So the second guy only cost me log R to produce the next guy. 646 00:35:03,810 --> 00:35:08,910 And the next guy is log R, and so once we get going, to do an 647 00:35:08,910 --> 00:35:16,620 R-way merge only costs me log R work per element. 648 00:35:16,620 --> 00:35:20,930 So the total work in merging is R, to get started, plus n 649 00:35:20,930 --> 00:35:26,530 log R. Well, R is less than n, so that's just n log R total 650 00:35:26,530 --> 00:35:27,880 to do the merging. 651 00:35:27,880 --> 00:35:29,130 That's the work. 652 00:35:32,360 --> 00:35:38,340 Now, let's take a look at what happens if I now do merge sort 653 00:35:38,340 --> 00:35:39,590 with R-way merges. 654 00:35:42,630 --> 00:35:51,660 So what I do is if I have only one element, then it's going 655 00:35:51,660 --> 00:35:54,830 to cost me order one time to merge it, because there's 656 00:35:54,830 --> 00:35:57,580 nothing to do, just put it in the output. 657 00:35:57,580 --> 00:36:04,160 Otherwise, I've got R problems of size n over R that I'm 658 00:36:04,160 --> 00:36:07,050 going to merge, and my merge takes n log R 659 00:36:07,050 --> 00:36:12,270 time to do the merge. 660 00:36:12,270 --> 00:36:15,310 So if I look at the recursion tree, I have n log R here 661 00:36:15,310 --> 00:36:20,900 starting here, then I branch R ways, and then I have n over R 662 00:36:20,900 --> 00:36:26,510 log R to do the R-way branching at the next level, n 663 00:36:26,510 --> 00:36:29,200 over R squared log R at the next level, et cetera. 664 00:36:35,183 --> 00:36:38,500 AUDIENCE: You said that the cost of processing is R. 665 00:36:38,500 --> 00:36:39,560 CHARLES LEISERSON: --is n log R. 666 00:36:39,560 --> 00:36:43,040 AUDIENCE: But is-- 667 00:36:43,040 --> 00:36:46,940 CHARLES LEISERSON: Upfront, there's an order R cost, but 668 00:36:46,940 --> 00:36:51,520 the order R cost is dominated by the n log R, so we don't 669 00:36:51,520 --> 00:36:54,000 have to count that separately. 670 00:36:54,000 --> 00:36:55,360 We just have to worry about this guy. 671 00:36:59,380 --> 00:37:05,240 So as I go through here, I basically end up having a tree 672 00:37:05,240 --> 00:37:09,780 which is only log base R of n tall, because I'm dividing 673 00:37:09,780 --> 00:37:13,440 things into R pieces each time, rather 674 00:37:13,440 --> 00:37:15,330 than into two pieces. 675 00:37:15,330 --> 00:37:20,030 So I only go log base R steps till I get to the base case. 676 00:37:20,030 --> 00:37:22,610 But I'm doing an R-way merge, so the number of 677 00:37:22,610 --> 00:37:23,860 leaves is still n. 678 00:37:26,390 --> 00:37:31,110 But now when I add across here, I get n times log R, and 679 00:37:31,110 --> 00:37:33,940 I go across here, I get n times log R, because I got R 680 00:37:33,940 --> 00:37:35,490 copies of the same thing. 681 00:37:35,490 --> 00:37:38,480 Now I've got R squared times n over R squared 682 00:37:38,480 --> 00:37:40,150 log R, and so forth. 683 00:37:40,150 --> 00:37:44,740 And so at every level, I have n log R. So I have n log R 684 00:37:44,740 --> 00:37:48,550 times the number of levels here, which is log base R of 685 00:37:48,550 --> 00:37:53,000 n, plus the order n work at the bottom, which we can 686 00:37:53,000 --> 00:37:55,060 ignore because it's going to be dominated. 687 00:37:55,060 --> 00:37:58,560 And so what you notice here is, what's log base R of n? 688 00:37:58,560 --> 00:38:03,490 That's just log n over log R. So the log Rs cancel, and I 689 00:38:03,490 --> 00:38:08,570 get n log n plus n, which is just n log n. 690 00:38:08,570 --> 00:38:11,980 So after all that work, we still do the same amount of 691 00:38:11,980 --> 00:38:15,760 work, whether I do binary merging or R-way merging, the 692 00:38:15,760 --> 00:38:18,880 work is the same. 693 00:38:18,880 --> 00:38:21,600 But there's kind of a big difference 694 00:38:21,600 --> 00:38:27,260 when it comes to caching. 695 00:38:27,260 --> 00:38:30,670 So it's the same work as binary merge sort. 696 00:38:30,670 --> 00:38:33,980 So let's take a look at the caching. 697 00:38:33,980 --> 00:38:40,390 So let's assume that my tournament fits in the cache. 698 00:38:40,390 --> 00:38:46,670 So I want to make sure that R is less than m over B for some 699 00:38:46,670 --> 00:38:50,360 constant R. So when I do constant way, when I consider 700 00:38:50,360 --> 00:38:53,320 the R-way merging of contiguous arrays of total 701 00:38:53,320 --> 00:38:58,150 size n, the entire tournament plus 1 block from each array 702 00:38:58,150 --> 00:39:00,090 can fit in cache. 703 00:39:00,090 --> 00:39:04,660 So the tournament is never going to be responsible for 704 00:39:04,660 --> 00:39:08,440 generating cache misses, because I'm going to leave the 705 00:39:08,440 --> 00:39:10,790 tournament in memory. 706 00:39:10,790 --> 00:39:15,470 So if I'm the optimal algorithm, it's going to say, 707 00:39:15,470 --> 00:39:20,290 let's just leave the tournament in memory and bring 708 00:39:20,290 --> 00:39:23,200 in all the other things as we do the operation. 709 00:39:23,200 --> 00:39:23,990 Question? 710 00:39:23,990 --> 00:39:26,290 AUDIENCE: [INAUDIBLE] 711 00:39:26,290 --> 00:39:28,060 CHARLES LEISERSON: Those circles that I had, the tree. 712 00:39:28,060 --> 00:39:31,154 AUDIENCE: Is that a cumulative list of the elements that 713 00:39:31,154 --> 00:39:32,820 you've merged in already? 714 00:39:36,460 --> 00:39:36,880 CHARLES LEISERSON: I'm sorry. 715 00:39:36,880 --> 00:39:38,825 Is the-- 716 00:39:38,825 --> 00:39:44,363 AUDIENCE: Is it a cumulative list of the length arrays that 717 00:39:44,363 --> 00:39:45,368 you've merged already? 718 00:39:45,368 --> 00:39:45,770 CHARLES LEISERSON: No, no, no. 719 00:39:45,770 --> 00:39:46,630 You haven't merged them. 720 00:39:46,630 --> 00:39:47,790 Let's just go back and make sure we 721 00:39:47,790 --> 00:39:49,040 understand the algorithm. 722 00:39:49,040 --> 00:39:55,000 The algorithm says that what we do is we compare the head 723 00:39:55,000 --> 00:40:01,240 of this pair and we produce a single value here, for which 724 00:40:01,240 --> 00:40:02,560 whatever is-- because these are already 725 00:40:02,560 --> 00:40:05,890 sorted to do the merge. 726 00:40:05,890 --> 00:40:06,780 These are already sorted. 727 00:40:06,780 --> 00:40:09,460 So I just have the minimum of these two here, and the 728 00:40:09,460 --> 00:40:12,020 minimum of these two here, and the minimum of all 729 00:40:12,020 --> 00:40:13,600 four of them here. 730 00:40:13,600 --> 00:40:15,510 So it's repeated. 731 00:40:15,510 --> 00:40:19,500 When we get to the top, we have now the minimum of all of 732 00:40:19,500 --> 00:40:22,770 these guys, and that's the guy that's the minimum overall, we 733 00:40:22,770 --> 00:40:24,400 put him in the output array. 734 00:40:24,400 --> 00:40:29,560 And now we walk back down the path that he came from, and 735 00:40:29,560 --> 00:40:34,150 what we do is we say, oh, this guy had a-- let's walk down 736 00:40:34,150 --> 00:40:37,680 the path, let's say we get to this guy. 737 00:40:37,680 --> 00:40:39,350 Let's advance the pointer in here and 738 00:40:39,350 --> 00:40:42,180 bring out another element. 739 00:40:42,180 --> 00:40:45,180 And now we play off the tournament here, play off the 740 00:40:45,180 --> 00:40:48,360 guy here, and he advances, and he advances, whatever. 741 00:40:48,360 --> 00:40:50,585 And now some other path may be the minimum. 742 00:40:50,585 --> 00:40:52,370 But it only took me log n work. 743 00:40:52,370 --> 00:40:56,740 I'm only keeping copies of the element, if you will, or the 744 00:40:56,740 --> 00:41:01,800 results of the comparisons along this in the tree here. 745 00:41:04,370 --> 00:41:07,440 And that tree, we're saying, fits in the cache, plus 1 746 00:41:07,440 --> 00:41:09,240 block, the first block. 747 00:41:09,240 --> 00:41:12,660 Whatever cache block fits in each of these arrays, no 748 00:41:12,660 --> 00:41:15,590 matter how much we've gone down, one of those 749 00:41:15,590 --> 00:41:18,260 is fitting in memory. 750 00:41:18,260 --> 00:41:23,450 So then what happens here is the entire tournament plus one 751 00:41:23,450 --> 00:41:28,450 block from each memory can fit in cache, and so therefore the 752 00:41:28,450 --> 00:41:31,950 number of cache misses that I'm going to have when I do 753 00:41:31,950 --> 00:41:35,780 the merge it's just essentially the time to take 754 00:41:35,780 --> 00:41:40,630 faults on that one cache block whenever I exceed it in each 755 00:41:40,630 --> 00:41:45,120 array, plus the one for the output that's similar. 756 00:41:45,120 --> 00:41:48,060 And so the total number of cache misses that I'm going to 757 00:41:48,060 --> 00:41:52,810 have is going to be n over B, because I'm just striding 758 00:41:52,810 --> 00:41:57,880 straight through memory, and that tournament, I don't have 759 00:41:57,880 --> 00:42:00,170 to worry about, because it's sitting in cache. 760 00:42:00,170 --> 00:42:02,650 And there's enough sitting in cache that all the other 761 00:42:02,650 --> 00:42:05,350 stuff, I can just keep one block from each of them in 762 00:42:05,350 --> 00:42:08,910 memory and still expect to get it. 763 00:42:08,910 --> 00:42:14,130 In fact, you need the tall cache assumption to assume 764 00:42:14,130 --> 00:42:16,480 that they all fit in memory. 765 00:42:16,480 --> 00:42:20,880 So therefore, the R-way merge sort is then, if it's 766 00:42:20,880 --> 00:42:23,340 sufficiently small, once again, we have the case that 767 00:42:23,340 --> 00:42:25,830 it fits in memory so I only have the cold misses to get 768 00:42:25,830 --> 00:42:29,120 there, n over B, if n is less than cm. 769 00:42:29,120 --> 00:42:32,730 And otherwise, it's R copies of the number of cache misses 770 00:42:32,730 --> 00:42:42,220 for n over R, plus n over B. Because this is what it took 771 00:42:42,220 --> 00:42:44,050 us here to do the merge. 772 00:42:44,050 --> 00:42:48,400 We get only n over B faults when we merge, as long as the 773 00:42:48,400 --> 00:42:50,390 tournament fits in cache. 774 00:42:50,390 --> 00:42:52,410 If the tournament doesn't fit in cache, it's a more 775 00:42:52,410 --> 00:42:54,626 complicated analysis. 776 00:42:54,626 --> 00:42:57,972 AUDIENCE: --n over B, that's cold misses. 777 00:42:57,972 --> 00:43:00,840 You're getting the stuff-- 778 00:43:00,840 --> 00:43:03,070 CHARLES LEISERSON: Yeah, basically, it's the cold 779 00:43:03,070 --> 00:43:05,910 misses on the data, yes, basically. 780 00:43:11,200 --> 00:43:12,030 Good. 781 00:43:12,030 --> 00:43:16,230 So now, let's do the recursion tree for this. 782 00:43:16,230 --> 00:43:19,090 So we basically have n over B that we're going to pay at 783 00:43:19,090 --> 00:43:24,780 every level, dividing by R, et cetera, down to the point 784 00:43:24,780 --> 00:43:26,215 where things fit in cache. 785 00:43:30,290 --> 00:43:32,960 And by the time it fits in cache, it's going to be m over 786 00:43:32,960 --> 00:43:35,890 B, because n will be approximately m, just as we 787 00:43:35,890 --> 00:43:38,630 had before when we were doing the binary case. 788 00:43:38,630 --> 00:43:41,800 As soon as the subarray completely fits in memory, I 789 00:43:41,800 --> 00:43:43,750 don't have to when I'm doing the sorting. 790 00:43:43,750 --> 00:43:46,390 So this is now analyzing not the merging, this is analyzing 791 00:43:46,390 --> 00:43:49,360 the sorting now. 792 00:43:49,360 --> 00:43:50,920 This is the sorting, not the merging. 793 00:43:54,330 --> 00:43:58,930 So we get down to m over B, and I've gone now log base R, 794 00:43:58,930 --> 00:44:01,180 not log base 2 as we did before, but log 795 00:44:01,180 --> 00:44:05,010 base R n over cm. 796 00:44:05,010 --> 00:44:08,360 The number of leaves is n over cm, and so when I multiply 797 00:44:08,360 --> 00:44:11,330 this out, I get the same n over B here, and I've got n 798 00:44:11,330 --> 00:44:12,730 over B at every level here. 799 00:44:15,420 --> 00:44:17,060 So where's the win? 800 00:44:17,060 --> 00:44:21,440 The win is that I have only log base R of n over cm, 801 00:44:21,440 --> 00:44:26,040 rather than log base 2 levels in the tree, because the 802 00:44:26,040 --> 00:44:29,640 amount that every level cost me was the same, 803 00:44:29,640 --> 00:44:30,890 asymptotically. 804 00:44:33,710 --> 00:44:37,520 So when I add it up, I get n over B log base R of n over m, 805 00:44:37,520 --> 00:44:40,600 instead of n over B log base 2 of n over m. 806 00:44:40,600 --> 00:44:42,530 So how do we tune R? 807 00:44:45,100 --> 00:44:49,460 Well if we just look at this formula here, if I want to 808 00:44:49,460 --> 00:44:51,580 tune R, what should I do to R to make this 809 00:44:51,580 --> 00:44:52,830 be as small as possible? 810 00:44:55,100 --> 00:44:57,300 Make it as big as possible. 811 00:44:57,300 --> 00:45:01,360 But I had to assume that R was less than some constant times 812 00:45:01,360 --> 00:45:04,890 m so that it fits in cache. 813 00:45:04,890 --> 00:45:12,850 So that's, in fact, what I do, is I say R is m over B. So it 814 00:45:12,850 --> 00:45:17,890 fits in cache, we have at least one block for each thing 815 00:45:17,890 --> 00:45:19,020 that we're merging. 816 00:45:19,020 --> 00:45:22,150 And then when we do the analysis now, I take log base 817 00:45:22,150 --> 00:45:27,210 m over B here, and that's compared to the binary one, 818 00:45:27,210 --> 00:45:31,910 which was log base 2, which is a factor of-- because this is 819 00:45:31,910 --> 00:45:34,110 just log 2 over log base-- 820 00:45:34,110 --> 00:45:38,050 of log of m over B savings in cache misses. 821 00:45:38,050 --> 00:45:39,700 Now, is that a significant number? 822 00:45:39,700 --> 00:45:41,110 Let's take a look. 823 00:45:41,110 --> 00:45:57,830 So if your l one cache is 32 kilobytes, and we have cache 824 00:45:57,830 --> 00:46:01,410 lines of 64 bytes, that is basically the difference in 825 00:46:01,410 --> 00:46:04,420 the exponents, 9x savings. 826 00:46:04,420 --> 00:46:07,230 For l two cache, we get about a 12x savings. 827 00:46:07,230 --> 00:46:09,430 For l three, we get about a 17x savings. 828 00:46:09,430 --> 00:46:11,910 Now of course, there are some other constants going on in 829 00:46:11,910 --> 00:46:14,360 here, so you can't be absolutely sure that it's 830 00:46:14,360 --> 00:46:17,320 exactly these numbers, but it's going to be proportional 831 00:46:17,320 --> 00:46:20,200 to these numbers. 832 00:46:20,200 --> 00:46:24,750 So that's pretty good savings to do multi-way merging. 833 00:46:24,750 --> 00:46:28,030 So generally when you merge, don't merge pairs. 834 00:46:28,030 --> 00:46:30,270 Not a very good way of doing it if you want to take good 835 00:46:30,270 --> 00:46:33,210 advantage of cache. 836 00:46:33,210 --> 00:46:36,680 May give you some ideas for how to improve some sorts that 837 00:46:36,680 --> 00:46:37,930 you might have looked at. 838 00:46:40,890 --> 00:46:43,830 Now it turns out that there's a cache oblivious sorting 839 00:46:43,830 --> 00:46:47,360 algorithm, where you don't actually have-- 840 00:46:47,360 --> 00:46:50,360 that was a cache aware algorithm that knew the size 841 00:46:50,360 --> 00:46:52,635 of the cache, and we tune R to get there. 842 00:46:52,635 --> 00:46:57,630 There is an algorithm called funnelsort, which is based on 843 00:46:57,630 --> 00:47:00,510 recursively sorting and n to the 1/3 groups of 844 00:47:00,510 --> 00:47:03,200 n to the 2/3 items. 845 00:47:03,200 --> 00:47:06,930 And then you merge the sorted groups with a merging process 846 00:47:06,930 --> 00:47:08,780 called an n to the 1/3 funnel. 847 00:47:11,540 --> 00:47:15,230 So this is more for fun, although the sorting 848 00:47:15,230 --> 00:47:20,710 algorithm, in my experience, from what others have told me 849 00:47:20,710 --> 00:47:24,110 about implementing it and so forth, is probably about 30% 850 00:47:24,110 --> 00:47:27,750 slower than the best hand-tuned algorithm. 851 00:47:27,750 --> 00:47:30,020 Whereas with matrix multiplication, the cache 852 00:47:30,020 --> 00:47:33,450 oblivious algorithms are as good as any cache aware 853 00:47:33,450 --> 00:47:37,350 algorithm as a practical matter, here, they're off by 854 00:47:37,350 --> 00:47:39,660 about 20% or 30%. 855 00:47:39,660 --> 00:47:43,580 So interesting research topic is build one of these things 856 00:47:43,580 --> 00:47:45,980 and make it really efficient so that it can compete with 857 00:47:45,980 --> 00:47:48,030 real sorts. 858 00:47:48,030 --> 00:47:51,890 So the k funnel merges k cubed items in k sorted lists, 859 00:47:51,890 --> 00:47:54,070 incurring this many cache funnels. 860 00:47:54,070 --> 00:47:58,090 Here, I did put in the one, for people who are concerned 861 00:47:58,090 --> 00:48:00,550 about the ones. 862 00:48:00,550 --> 00:48:04,400 And so then, you get this recurrence for the cache 863 00:48:04,400 --> 00:48:08,330 misses, because you solve n to the 1/3 problems of size n to 864 00:48:08,330 --> 00:48:13,900 the 2/3 recursively, plus this amount for merging. 865 00:48:13,900 --> 00:48:16,550 And that ends up giving you this bound, which turns out to 866 00:48:16,550 --> 00:48:17,800 be asymptotically optimal. 867 00:48:20,740 --> 00:48:25,050 And the way it works is there's basically, a k funnel 868 00:48:25,050 --> 00:48:27,520 is constructed recursively. 869 00:48:27,520 --> 00:48:32,120 And the idea is that what we have is, we have recursive k 870 00:48:32,120 --> 00:48:36,570 funnels, so this is going to be a merging process, that is 871 00:48:36,570 --> 00:48:49,400 going to produce k cubed items by having k to the 3/2 buffers 872 00:48:49,400 --> 00:48:53,380 that each are taking the square root of k guys and 873 00:48:53,380 --> 00:48:56,050 producing k guys out. 874 00:48:56,050 --> 00:48:58,380 So each of these guys is going to produce k, and the square 875 00:48:58,380 --> 00:49:04,000 root of k of them for a total of k to the 3/2, but each of 876 00:49:04,000 --> 00:49:06,290 these is going to be length square root of k, so we end up 877 00:49:06,290 --> 00:49:09,260 with k cubed. 878 00:49:09,260 --> 00:49:12,870 and And they basically feed each other, and then they get 879 00:49:12,870 --> 00:49:17,330 merged with their own k thing, and each of these then 880 00:49:17,330 --> 00:49:19,290 recursively is constructed the same way. 881 00:49:21,980 --> 00:49:24,930 And the basic idea is that you keep filling the buffers, I 882 00:49:24,930 --> 00:49:28,315 think I say this here, so that all these buffers end up being 883 00:49:28,315 --> 00:49:30,050 in contiguous storage. 884 00:49:30,050 --> 00:49:33,710 And the idea is, rather than going and just getting one 885 00:49:33,710 --> 00:49:37,660 element out as you do in a typical tournament, as long as 886 00:49:37,660 --> 00:49:41,860 you're going to go merge, let's merge a lot of stuff and 887 00:49:41,860 --> 00:49:43,390 put it into our buffer so we don't have to 888 00:49:43,390 --> 00:49:45,350 go back here again. 889 00:49:45,350 --> 00:49:48,880 So you sort of batch your merging in local regions, and 890 00:49:48,880 --> 00:49:51,250 that ends up using the cache efficiently 891 00:49:51,250 --> 00:49:52,500 in the local regions. 892 00:49:54,960 --> 00:49:56,520 Enough of sorting. 893 00:49:56,520 --> 00:50:01,280 Let's go on to physics. 894 00:50:01,280 --> 00:50:04,260 So many of you are probably studying in your linear 895 00:50:04,260 --> 00:50:07,630 algebra class or elsewhere, the heat equation. 896 00:50:07,630 --> 00:50:11,130 So, people familiar with heat diffusion? 897 00:50:11,130 --> 00:50:14,680 So it's a common one to do, and these were-- 898 00:50:14,680 --> 00:50:18,110 I have a former student, Matteo Frigo, who is a 899 00:50:18,110 --> 00:50:24,750 brilliant coder on anything cache oblivious. 900 00:50:24,750 --> 00:50:26,860 He's got the best code out there. 901 00:50:30,680 --> 00:50:35,660 So the 2D heat equation, what we do is let's let u(t, x, y) 902 00:50:35,660 --> 00:50:39,730 be the temperature at time t at point (x, y). 903 00:50:39,730 --> 00:50:41,910 And now you can go through the physics and come up with an 904 00:50:41,910 --> 00:50:46,210 equation that looks like this, which says that basically the 905 00:50:46,210 --> 00:50:51,010 partial of u with respect to t is proportional to the sum of 906 00:50:51,010 --> 00:50:55,240 the second partials with respect to x and 907 00:50:55,240 --> 00:50:58,510 with respect to y. 908 00:50:58,510 --> 00:51:01,380 So basically what that says is, the hotter the difference 909 00:51:01,380 --> 00:51:04,210 between two things, the quicker things are going to 910 00:51:04,210 --> 00:51:07,890 adjust, the quicker the temperature 911 00:51:07,890 --> 00:51:09,170 moves between them. 912 00:51:09,170 --> 00:51:14,220 And alpha is the thermal diffusivity, which has-- 913 00:51:14,220 --> 00:51:18,220 different materials have different thermal 914 00:51:18,220 --> 00:51:20,840 diffusivities. 915 00:51:20,840 --> 00:51:24,660 Say that three times fast. 916 00:51:24,660 --> 00:51:31,520 So if we do a simulation, we can end up with heats, say, 917 00:51:31,520 --> 00:51:35,840 put like this, and after it looks like this. 918 00:51:35,840 --> 00:51:37,170 See if we can get this running here. 919 00:51:37,170 --> 00:51:38,420 So now, let me see. 920 00:51:45,500 --> 00:51:51,940 So I can move my cursor around and make things. 921 00:51:51,940 --> 00:51:54,080 You can just sort of see that it simulates. 922 00:51:54,080 --> 00:51:57,280 You can see the simulation is actually pretty slow. 923 00:51:57,280 --> 00:52:01,940 Now, on my slide, I have a thing here that says-- 924 00:52:01,940 --> 00:52:04,920 let's see if this breaks when we do it again. 925 00:52:04,920 --> 00:52:06,170 There we go. 926 00:52:08,960 --> 00:52:13,220 So we're getting around 100 frames per minute in doing 927 00:52:13,220 --> 00:52:14,470 this simulation. 928 00:52:17,550 --> 00:52:21,120 And so how does this simulation work? 929 00:52:21,120 --> 00:52:22,380 So let's take a look at that. 930 00:52:22,380 --> 00:52:27,040 It's kind of a neat problem. 931 00:52:27,040 --> 00:52:30,310 So this is what happened when I did 6.172 932 00:52:30,310 --> 00:52:31,050 for a little while. 933 00:52:31,050 --> 00:52:34,510 It basically gave me that after a while, because it just 934 00:52:34,510 --> 00:52:37,450 sort of averages things, smears it out. 935 00:52:37,450 --> 00:52:38,290 So what's going on? 936 00:52:38,290 --> 00:52:40,700 Let's look at it in one dimension, because it's easier 937 00:52:40,700 --> 00:52:45,410 to understand than if we take on two dimensions. 938 00:52:45,410 --> 00:52:48,940 So assuming that we have, say, a bar which has no 939 00:52:48,940 --> 00:52:50,190 differential in this direction, 940 00:52:50,190 --> 00:52:52,250 only in this direction. 941 00:52:52,250 --> 00:52:58,420 So then we get to drop the partials with respect to y. 942 00:52:58,420 --> 00:53:01,270 So if I take a look at that, what I can do is what's called 943 00:53:01,270 --> 00:53:02,110 a finite difference 944 00:53:02,110 --> 00:53:03,870 approximation, which you probably-- 945 00:53:03,870 --> 00:53:06,600 who's studied finite differences? 946 00:53:06,600 --> 00:53:07,760 So a few people. 947 00:53:07,760 --> 00:53:10,080 It's OK if you haven't. 948 00:53:10,080 --> 00:53:12,420 That's OK if you haven't, I'll teach it to you now. 949 00:53:12,420 --> 00:53:15,530 And then you're free to forget it, because that's not the 950 00:53:15,530 --> 00:53:17,520 part that I want you to understand, but it is 951 00:53:17,520 --> 00:53:18,780 interesting. 952 00:53:18,780 --> 00:53:22,660 So what I can do is look at the partial, for example, with 953 00:53:22,660 --> 00:53:26,890 respect to t, and just do an approximation the says, well, 954 00:53:26,890 --> 00:53:29,120 let me perturb t a little bit-- 955 00:53:29,120 --> 00:53:30,580 that's what it means. 956 00:53:30,580 --> 00:53:38,040 So I add delta t minus u of t divided by t plus delta t 957 00:53:38,040 --> 00:53:41,300 minus t, which gives me delta t. 958 00:53:41,300 --> 00:53:42,270 And I can use that as an 959 00:53:42,270 --> 00:53:44,350 approximation for this partial. 960 00:53:47,410 --> 00:53:50,090 Then on the right hand side-- well first of all, let me get 961 00:53:50,090 --> 00:53:52,930 the first derivative with respect to x. 962 00:53:52,930 --> 00:53:55,470 And basically here what I'll do is I'll do an approximation 963 00:53:55,470 --> 00:54:00,740 where I take x plus delta x over 2 minus x minus delta x 964 00:54:00,740 --> 00:54:04,780 over 2, and once again, the differences in the terms there 965 00:54:04,780 --> 00:54:07,100 ends up being delta x. 966 00:54:07,100 --> 00:54:11,470 And now I use that to take the next one. 967 00:54:11,470 --> 00:54:14,540 So basically, to take this one, I basically take the 968 00:54:14,540 --> 00:54:19,600 partial with respect to delta x over 2, minus the partial 969 00:54:19,600 --> 00:54:24,440 with x minus delta x over 2, and take the partial of that, 970 00:54:24,440 --> 00:54:25,850 do the approximation. 971 00:54:25,850 --> 00:54:29,830 And what happens is, if you look at it, when I take a 972 00:54:29,830 --> 00:54:34,920 partial here I'm adding delta x over 2 twice, so I end up 973 00:54:34,920 --> 00:54:38,640 getting just a delta x here, and then the two things on 974 00:54:38,640 --> 00:54:41,830 either side combined give me my original one, 2 times u(t, 975 00:54:41,830 --> 00:54:45,720 x), and then another one here, and now the whole thing over 976 00:54:45,720 --> 00:54:48,900 delta x squared. 977 00:54:48,900 --> 00:54:52,500 And so what I can do is to reduce this heat equation, 978 00:54:52,500 --> 00:54:55,670 which is continuous, to something that we can handle 979 00:54:55,670 --> 00:54:59,740 in a computer, which is discrete, by saying OK, let's 980 00:54:59,740 --> 00:55:03,090 just do this approximation that says that this term must 981 00:55:03,090 --> 00:55:06,350 be equal to that term. 982 00:55:06,350 --> 00:55:08,326 And if you've studied the linear algebra that said that 983 00:55:08,326 --> 00:55:10,920 there are all kinds of conditions on convergence, and 984 00:55:10,920 --> 00:55:13,690 stability, and stuff like that, that are actually quite 985 00:55:13,690 --> 00:55:15,522 interesting from a numerical point of view, but we're not 986 00:55:15,522 --> 00:55:17,590 going to get into it. 987 00:55:17,590 --> 00:55:20,550 But basically, I've just taken that equation here and said, 988 00:55:20,550 --> 00:55:23,950 OK, that's my approximation for this one. 989 00:55:23,950 --> 00:55:25,210 And now what do I have here? 990 00:55:25,210 --> 00:55:31,640 I've got u of t plus delta t, and u things u of t, and then 991 00:55:31,640 --> 00:55:36,970 over here, they're all with t, but now the deltas are in-- 992 00:55:36,970 --> 00:55:39,880 whoops, that should have been a delta x there. 993 00:55:39,880 --> 00:55:40,870 I don't know how that got there. 994 00:55:40,870 --> 00:55:43,370 That should be a delta x there. 995 00:55:43,370 --> 00:55:45,340 They're all in spatial over here. 996 00:55:48,520 --> 00:55:56,820 So what I can do is take this, and do an iterative process to 997 00:55:56,820 --> 00:55:58,730 compute this. 998 00:55:58,730 --> 00:56:04,170 And so the idea is, let me take this and throw this term 999 00:56:04,170 --> 00:56:08,590 onto the right hand side, and look at u of t plus delta t as 1000 00:56:08,590 --> 00:56:09,770 if it's t plus 1. 1001 00:56:09,770 --> 00:56:13,880 Let me make my delta be one, essentially. 1002 00:56:13,880 --> 00:56:19,730 Throw the delta t over here times the alpha over delta x1, 1003 00:56:19,730 --> 00:56:24,540 and then I get basically u of tx is based on u of t of x 1004 00:56:24,540 --> 00:56:27,640 plus 1 and of x and x minus 1. 1005 00:56:27,640 --> 00:56:28,840 As I say, there's a typo here. 1006 00:56:28,840 --> 00:56:33,080 That should be a delta t. 1007 00:56:33,080 --> 00:56:36,370 So what that says is that if I look at my one-dimensional 1008 00:56:36,370 --> 00:56:40,410 process proceeding through time, what I'm doing is 1009 00:56:40,410 --> 00:56:44,320 updating every point here based on the three points 1010 00:56:44,320 --> 00:56:48,840 below it, diagonally to the right, and 1011 00:56:48,840 --> 00:56:51,640 diagonally to the left. 1012 00:56:51,640 --> 00:56:54,640 So this guy can be updated because of those. 1013 00:56:54,640 --> 00:56:58,390 These we're not going update, because they're the boundary. 1014 00:56:58,390 --> 00:56:59,810 So these can be fixed. 1015 00:56:59,810 --> 00:57:02,455 In a periodic stencil, they may even wrap 1016 00:57:02,455 --> 00:57:05,660 around like a torus. 1017 00:57:05,660 --> 00:57:08,570 So basically, I can go through and update all these with 1018 00:57:08,570 --> 00:57:11,190 whatever that hairy equation is. 1019 00:57:11,190 --> 00:57:13,410 And this is basically what the code is that I 1020 00:57:13,410 --> 00:57:14,660 showed you is doing. 1021 00:57:17,130 --> 00:57:21,360 It just keeps updating everyone based on three until 1022 00:57:21,360 --> 00:57:25,160 I've gone through a bunch of time, and that's how the 1023 00:57:25,160 --> 00:57:26,410 system evolves. 1024 00:57:31,330 --> 00:57:37,080 So any questions about how I got to here? 1025 00:57:37,080 --> 00:57:39,250 So we're going to now look at this purely 1026 00:57:39,250 --> 00:57:41,080 computer sciencey now. 1027 00:57:41,080 --> 00:57:42,970 We don't have to understand any of those equations. 1028 00:57:42,970 --> 00:57:44,680 We just have to understand the structure. 1029 00:57:44,680 --> 00:57:47,980 The structure is that we're updating t plus 1 based on 1030 00:57:47,980 --> 00:57:52,300 stuff on three points with some function that some 1031 00:57:52,300 --> 00:57:58,530 physicist oracle gave us out of the blue. 1032 00:57:58,530 --> 00:58:03,180 And so here is a pretty simple algorithm to do it. 1033 00:58:03,180 --> 00:58:05,750 I basically have what's called the kernel, which does this 1034 00:58:05,750 --> 00:58:11,200 updating, basically updating each one based on things. 1035 00:58:11,200 --> 00:58:13,430 And what I'm going to do for computer science is I don't 1036 00:58:13,430 --> 00:58:17,400 need to keep all the intermediate values. 1037 00:58:17,400 --> 00:58:18,660 And so what I'm going to do is do what's 1038 00:58:18,660 --> 00:58:21,300 called an even-odd trick. 1039 00:58:21,300 --> 00:58:24,870 Basically if I have one row, I compute the next row into 1040 00:58:24,870 --> 00:58:29,160 another array, and then I'll reuse that first array-- it's 1041 00:58:29,160 --> 00:58:30,770 all been used up-- 1042 00:58:30,770 --> 00:58:33,020 and go back to the first one. 1043 00:58:33,020 --> 00:58:36,410 So basically here, I'm just going to update t plus 1 mod 1044 00:58:36,410 --> 00:58:43,550 2, and just allocate two arrays of size n, and just do 1045 00:58:43,550 --> 00:58:46,596 modding all the way up. 1046 00:58:46,596 --> 00:58:48,760 Is that clear? 1047 00:58:48,760 --> 00:58:50,880 And other than that, it's basically doing the same 1048 00:58:50,880 --> 00:58:53,120 thing, and I'm doing a little bit of fancy arithmetic here 1049 00:58:53,120 --> 00:58:54,910 by passing-- 1050 00:58:54,910 --> 00:58:59,450 see stuff, where I'm passing the pointer to where I am in 1051 00:58:59,450 --> 00:59:02,010 the array, so I only have to update it 1052 00:59:02,010 --> 00:59:04,900 locally within the array. 1053 00:59:04,900 --> 00:59:06,870 So I don't have to double indexing once I'm in the 1054 00:59:06,870 --> 00:59:09,530 array, because I'm already indexed into the part of the 1055 00:59:09,530 --> 00:59:12,470 array that I'm going to use, and then I am doing flipping. 1056 00:59:12,470 --> 00:59:14,270 So this is just a little bit of cleverness. 1057 00:59:14,270 --> 00:59:17,330 You might want to study this later. 1058 00:59:17,330 --> 00:59:20,700 So what's happening then is I have this double nested loop 1059 00:59:20,700 --> 00:59:23,260 where I have a time loop on the outside, and a space loop 1060 00:59:23,260 --> 00:59:25,920 on the inside, and I'm basically going through and 1061 00:59:25,920 --> 00:59:30,270 using a stencil of this shape, this is called a three point 1062 00:59:30,270 --> 00:59:34,190 stencil, because you're basically taking three points 1063 00:59:34,190 --> 00:59:36,570 to update one point. 1064 00:59:36,570 --> 00:59:41,950 And now if I imagine that this dimension is bigger, n here is 1065 00:59:41,950 --> 00:59:46,490 bigger than my cache size, what's going to happen? 1066 00:59:46,490 --> 00:59:48,840 I'm going to take a cache fault here, these are all cold 1067 00:59:48,840 --> 00:59:49,850 misses, et cetera. 1068 00:59:49,850 --> 00:59:55,760 But when I get back to the beginning here, if I use LRU, 1069 00:59:55,760 --> 00:59:58,350 nothing is going to be in memory that I happened to 1070 00:59:58,350 --> 00:59:59,430 update over here. 1071 00:59:59,430 --> 01:00:01,580 So I have to go and I take a cache fault 1072 01:00:01,580 --> 01:00:04,070 on every cache line. 1073 01:00:04,070 --> 01:00:10,810 And so if I'm going t steps into the future from where I 1074 01:00:10,810 --> 01:00:15,620 started, I basically have n times t updates, and I save a 1075 01:00:15,620 --> 01:00:20,935 factor of B, because I get the spatial locality because the u 1076 01:00:20,935 --> 01:00:25,490 of t minus 1, u of t, and u of t plus 1, are all generally on 1077 01:00:25,490 --> 01:00:30,500 the same-- are nearby, and all within one cache line. 1078 01:00:30,500 --> 01:00:32,170 Question? 1079 01:00:32,170 --> 01:00:34,020 AUDIENCE: The x's, what are the x's for? 1080 01:00:36,660 --> 01:00:37,055 CHARLES LEISERSON: Sorry. 1081 01:00:37,055 --> 01:00:38,110 I should have put the legend on here. 1082 01:00:38,110 --> 01:00:40,660 The x's are a miss. 1083 01:00:40,660 --> 01:00:44,030 So I do a miss when I update these, and then these I don't 1084 01:00:44,030 --> 01:00:45,300 miss on, because it was brought in 1085 01:00:45,300 --> 01:00:48,420 when I accessed that. 1086 01:00:48,420 --> 01:00:52,110 And then I do a miss, and I'll do it-- 1087 01:00:52,110 --> 01:00:55,660 so basically I do it, then I shift over the stencil by one, 1088 01:00:55,660 --> 01:00:58,440 and then I won't get a miss. 1089 01:00:58,440 --> 01:01:01,060 So I'm just looking at the misses on the reads, not 1090 01:01:01,060 --> 01:01:03,000 misses on the writes. 1091 01:01:03,000 --> 01:01:05,500 I should have made that clear, too. 1092 01:01:05,500 --> 01:01:07,340 But the point is, the writes don't help you, because it's 1093 01:01:07,340 --> 01:01:11,320 all out of memory by the time I get up here. 1094 01:01:11,320 --> 01:01:15,740 To the second row, if this is longer than my cache size, 1095 01:01:15,740 --> 01:01:17,220 none of that's there if I'm using LRU. 1096 01:01:21,358 --> 01:01:24,214 AUDIENCE: You have also a miss, like you need to get two 1097 01:01:24,214 --> 01:01:27,045 [INAUDIBLE]. 1098 01:01:27,045 --> 01:01:27,830 CHARLES LEISERSON: Yeah, but what I'm saying is I'm only 1099 01:01:27,830 --> 01:01:30,270 looking at the read misses. 1100 01:01:30,270 --> 01:01:33,260 Yes, there are write misses as well, but basically, I'm only 1101 01:01:33,260 --> 01:01:34,060 doing the read misses. 1102 01:01:34,060 --> 01:01:35,900 You can look at the write misses as well. 1103 01:01:35,900 --> 01:01:37,150 It makes the picture messier. 1104 01:01:40,120 --> 01:01:42,670 So we've basically have nt over b. 1105 01:01:42,670 --> 01:01:44,950 However this, let me tell you, is the way that 1106 01:01:44,950 --> 01:01:46,200 everybody codes it. 1107 01:01:48,530 --> 01:01:51,410 and And if you have a machine where you have any bandwidth 1108 01:01:51,410 --> 01:01:55,010 issues to memory, especially for these large problems, this 1109 01:01:55,010 --> 01:01:58,810 is not a very good way to do it, as it turns out. 1110 01:01:58,810 --> 01:02:02,030 So it turns out that what you want to do is, as we've seen, 1111 01:02:02,030 --> 01:02:04,960 divide and conquer is a really good way to do it. 1112 01:02:04,960 --> 01:02:08,970 But in this case, when we're doing divide and conquer, 1113 01:02:08,970 --> 01:02:13,980 we're actually not going to use rectangles, we're going to 1114 01:02:13,980 --> 01:02:15,230 use trapezoids. 1115 01:02:17,760 --> 01:02:19,920 And the reason is that a trapezoid has the nice 1116 01:02:19,920 --> 01:02:22,010 property that-- 1117 01:02:22,010 --> 01:02:27,100 notice that if I have all these points in memory, then 1118 01:02:27,100 --> 01:02:29,760 notice that I can compute all the guys that are read on the 1119 01:02:29,760 --> 01:02:33,530 next level, and then I can compute all the guys that are 1120 01:02:33,530 --> 01:02:35,660 next on the next level. 1121 01:02:35,660 --> 01:02:37,910 And so for example, if you imagine that this part here 1122 01:02:37,910 --> 01:02:41,380 fit within cache, I could actually keep going. 1123 01:02:41,380 --> 01:02:44,580 I didn't have to stop here, I could keep going right up to a 1124 01:02:44,580 --> 01:02:48,480 triangle if I wanted to, and compute all the values without 1125 01:02:48,480 --> 01:02:52,190 having any more cache misses than those needed to bring in, 1126 01:02:52,190 --> 01:02:53,480 essentially, one row-- 1127 01:02:53,480 --> 01:02:57,390 two rows, actually, because I'm reusing the 1128 01:02:57,390 --> 01:02:59,660 rows as I go up. 1129 01:02:59,660 --> 01:03:02,490 So what we're going to do is traverse trapezoidal regions 1130 01:03:02,490 --> 01:03:07,720 of space-time points such that the points are between an 1131 01:03:07,720 --> 01:03:11,300 upper limit, T1, and a low one, T0, and between an x0 and 1132 01:03:11,300 --> 01:03:15,330 an x1, where now I have slopes here that are going to be, in 1133 01:03:15,330 --> 01:03:18,480 general, this is plus 1 minus 1. 1134 01:03:18,480 --> 01:03:22,680 And in fact, sometimes it will be straight, in which case 1135 01:03:22,680 --> 01:03:23,600 we'll call it 0. 1136 01:03:23,600 --> 01:03:29,760 It's really the inverse of the slope, but we'll still call it 1137 01:03:29,760 --> 01:03:32,220 zero rather than infinity. 1138 01:03:32,220 --> 01:03:33,380 So it's 1 over the slope. 1139 01:03:33,380 --> 01:03:35,260 There's a name for that, right? 1140 01:03:35,260 --> 01:03:37,730 Is that called the run or something? 1141 01:03:37,730 --> 01:03:39,620 I forget, I don't remember my calculus. 1142 01:03:42,570 --> 01:03:44,840 So that's what we're going to do. 1143 01:03:44,840 --> 01:03:49,230 And we're going to leave the upper and right borders undone 1144 01:03:49,230 --> 01:03:51,270 and include, so it's going to be a sort of half open 1145 01:03:51,270 --> 01:03:54,960 trapezoid on the left and bottom, closed on the left and 1146 01:03:54,960 --> 01:03:58,200 bottom, and open on the top and right. 1147 01:03:58,200 --> 01:04:03,120 So the width is basically the midpoint here, and the height 1148 01:04:03,120 --> 01:04:05,220 is the height, because they're always going to have 1149 01:04:05,220 --> 01:04:09,420 parallel axes here. 1150 01:04:09,420 --> 01:04:13,790 So here's how are our recursion is going to work. 1151 01:04:13,790 --> 01:04:18,390 If the height is 1, then we can compute all space-time 1152 01:04:18,390 --> 01:04:22,300 points in any way we want. 1153 01:04:22,300 --> 01:04:25,900 I can just go through them if I want, because they're all 1154 01:04:25,900 --> 01:04:26,570 independent. 1155 01:04:26,570 --> 01:04:28,930 None depends on anybody else. 1156 01:04:28,930 --> 01:04:30,540 So that's going to be our base case. 1157 01:04:33,730 --> 01:04:38,100 If the width is greater than twice the height, however, 1158 01:04:38,100 --> 01:04:39,930 then what we're going to do is we're going to cut the 1159 01:04:39,930 --> 01:04:45,740 trapezoid through the middle of the slope of minus 1. 1160 01:04:48,380 --> 01:04:51,680 And that will produce two new trapezoids, which we then will 1161 01:04:51,680 --> 01:04:58,936 recursively compute all the elements of. 1162 01:04:58,936 --> 01:05:02,000 So I'll start out with a trapezoid. 1163 01:05:02,000 --> 01:05:07,000 Basically, if it ends up that it's a long and wide one, I'm 1164 01:05:07,000 --> 01:05:10,700 going to make what's called a space cut, and cut it this 1165 01:05:10,700 --> 01:05:13,410 way, and then I'm going to recursively do this one and 1166 01:05:13,410 --> 01:05:15,470 then this one. 1167 01:05:15,470 --> 01:05:18,950 And notice that I can do that because-- 1168 01:05:18,950 --> 01:05:21,790 all these guys I can do, but then when I get to the border 1169 01:05:21,790 --> 01:05:24,830 here, this will already have been done by the time I'm 1170 01:05:24,830 --> 01:05:26,080 computing these guys. 1171 01:05:28,930 --> 01:05:32,510 So the requirement is that I've got to do things 1172 01:05:32,510 --> 01:05:36,360 according to that map of triples that I showed you 1173 01:05:36,360 --> 01:05:38,750 before, but I don't have to do them in the same order. 1174 01:05:38,750 --> 01:05:41,300 I don't have to do the whole bottom row first. 1175 01:05:41,300 --> 01:05:44,820 In this case, I can compute the whole trapezoid here, and 1176 01:05:44,820 --> 01:05:48,850 then I can compute this trapezoid here, and then all 1177 01:05:48,850 --> 01:05:50,940 the values that I'll need will have already been computed 1178 01:05:50,940 --> 01:05:54,870 over here, that are on the boundary of this trapezoid. 1179 01:05:54,870 --> 01:05:58,030 The other type of cut I'll do is what happens when a 1180 01:05:58,030 --> 01:06:01,680 trapezoid gets too tall for me. 1181 01:06:01,680 --> 01:06:04,470 So if the trapezoid is too tall, then what we'll do is 1182 01:06:04,470 --> 01:06:06,580 we'll slice it through the middle, but the other way. 1183 01:06:06,580 --> 01:06:08,380 We call that a time cut. 1184 01:06:08,380 --> 01:06:11,570 So we won't take it all the way through time, we'll only 1185 01:06:11,570 --> 01:06:12,990 take it partially through time. 1186 01:06:16,430 --> 01:06:19,000 Now you can show, and I'm not going to show this in detail, 1187 01:06:19,000 --> 01:06:21,910 but you can show that if I do this, my trapezoids are always 1188 01:06:21,910 --> 01:06:23,620 sort of medium sized. 1189 01:06:23,620 --> 01:06:26,520 I never get long, long skinny ones. 1190 01:06:26,520 --> 01:06:29,490 If I start with something that's sort of got a good 1191 01:06:29,490 --> 01:06:33,660 aspect ratio, I maintain a good aspect ratio through the 1192 01:06:33,660 --> 01:06:34,910 entire code. 1193 01:06:38,820 --> 01:06:40,070 So here's the implementation. 1194 01:06:43,700 --> 01:06:48,040 This is what Matteo Frigo wrote, and I've modified it a 1195 01:06:48,040 --> 01:06:48,900 little bit. 1196 01:06:48,900 --> 01:06:53,020 So basically, we pass in it the values that let us 1197 01:06:53,020 --> 01:06:59,320 identify the trapezoid, t0, t1, x0, and then the slope on 1198 01:06:59,320 --> 01:07:04,870 the left side, x1 in the slope on the right side, where the 1199 01:07:04,870 --> 01:07:11,290 dx0 and the dx1s are all either 0, 1, or minus 1. 1200 01:07:11,290 --> 01:07:15,980 And then what I do is I look at what the height is that my 1201 01:07:15,980 --> 01:07:17,760 trapezoid is going to operate on. 1202 01:07:17,760 --> 01:07:22,920 And if the height is 1, well, then I just run through all 1203 01:07:22,920 --> 01:07:27,830 the elements, and I just compute the kernel-- 1204 01:07:27,830 --> 01:07:30,050 that program that I showed you before, that kernel-- 1205 01:07:30,050 --> 01:07:31,690 on all the elements. 1206 01:07:31,690 --> 01:07:33,920 Nothing really to be done there, just go through and 1207 01:07:33,920 --> 01:07:37,900 compute them individually with a four loop. 1208 01:07:37,900 --> 01:07:46,910 Otherwise, if I've got the situation where the trapezoid 1209 01:07:46,910 --> 01:07:51,330 is big, then I do this comparison, which I promise 1210 01:07:51,330 --> 01:07:53,310 you-- you can work out the math if you wish-- 1211 01:07:53,310 --> 01:07:56,950 which I promise you tells you whether or not it's more than 1212 01:07:56,950 --> 01:07:59,570 twice the height, as I said before, whether the width is 1213 01:07:59,570 --> 01:08:01,130 more than twice the height. 1214 01:08:01,130 --> 01:08:07,360 And if so, I compute the middle point, and then I 1215 01:08:07,360 --> 01:08:09,120 partition it into two trapezoids, and I 1216 01:08:09,120 --> 01:08:12,790 recursively call them. 1217 01:08:12,790 --> 01:08:16,770 And otherwise, I simply cut the time in half, and then I 1218 01:08:16,770 --> 01:08:20,550 do the bottom half and then the upper half. 1219 01:08:20,550 --> 01:08:25,979 So getting all those parameters exactly right takes 1220 01:08:25,979 --> 01:08:30,310 a little bit of thinking, makes my brain hurt, but 1221 01:08:30,310 --> 01:08:33,149 Matteo is brilliant at this kind of coding. 1222 01:08:36,490 --> 01:08:38,540 So let's see how well this does. 1223 01:08:38,540 --> 01:08:41,479 So I'm not going to do a detailed analysis that I did 1224 01:08:41,479 --> 01:08:47,319 before, but basically what's going on is at this level, if 1225 01:08:47,319 --> 01:08:49,960 I'm doing divide and conquering, I'm only doing a 1226 01:08:49,960 --> 01:08:54,350 constant amount of work managing this stuff. 1227 01:08:54,350 --> 01:08:58,500 So my caches that I'm taking in the internal part of the 1228 01:08:58,500 --> 01:08:59,840 tree are all going to be order one. 1229 01:09:03,140 --> 01:09:10,729 Now each leaf is going to represent a trapezoid, which 1230 01:09:10,729 --> 01:09:15,990 is going to be approximately h times w, where h and w are 1231 01:09:15,990 --> 01:09:19,260 approximately equal, because they're going to be shaped-- 1232 01:09:19,260 --> 01:09:22,140 This is assuming I start out with a number of iterations to 1233 01:09:22,140 --> 01:09:28,720 do that is at least as large as the number of points that I 1234 01:09:28,720 --> 01:09:29,500 have to go on. 1235 01:09:29,500 --> 01:09:32,170 If I start out with something that's really thin and flat, 1236 01:09:32,170 --> 01:09:34,290 then it's not going to be the case. 1237 01:09:34,290 --> 01:09:36,410 But if I start out with something that's deep enough, 1238 01:09:36,410 --> 01:09:39,950 then I'm going to be able to make progress in an 1239 01:09:39,950 --> 01:09:43,620 unconventional order into time by moving the time 1240 01:09:43,620 --> 01:09:46,550 non-uniformly through the space. 1241 01:09:46,550 --> 01:09:53,340 So each leaf represents a fairly balanced trapezoid. 1242 01:09:53,340 --> 01:09:57,350 Each leaf basically is going to-- 1243 01:09:57,350 --> 01:10:03,050 if you look that the direction of the trapezoid is in time, 1244 01:10:03,050 --> 01:10:06,960 so this represents the spatial dimension, and if I have 1245 01:10:06,960 --> 01:10:10,430 something of size w, I can access it with 1246 01:10:10,430 --> 01:10:13,720 only w over B misses. 1247 01:10:13,720 --> 01:10:18,810 And when that fits in cache, where w is some constant less 1248 01:10:18,810 --> 01:10:20,830 than m, so w is order m. 1249 01:10:20,830 --> 01:10:24,230 So each of these things that's a leaf is only going to occur 1250 01:10:24,230 --> 01:10:25,480 w over B misses. 1251 01:10:30,080 --> 01:10:32,720 Now, the total space number of points I have to go 1252 01:10:32,720 --> 01:10:34,490 after is n times t. 1253 01:10:34,490 --> 01:10:37,610 N is going to be the full dimension this way, t is the 1254 01:10:37,610 --> 01:10:39,450 height that way. 1255 01:10:39,450 --> 01:10:42,810 And so since each leaf has hw points, I 1256 01:10:42,810 --> 01:10:44,570 have nt over hw leaves. 1257 01:10:47,130 --> 01:10:49,860 And the number of internal nodes is just the leaves minus 1258 01:10:49,860 --> 01:10:52,640 1, so they can't contribute substantially, because there's 1259 01:10:52,640 --> 01:10:57,840 only order one misses I'm taking here, whereas I've got 1260 01:10:57,840 --> 01:11:02,720 something on the order of w over B misses for this. 1261 01:11:02,720 --> 01:11:05,330 So therefore, now I can do my math. 1262 01:11:05,330 --> 01:11:08,100 The number of cache misses I'm going to take is-- 1263 01:11:08,100 --> 01:11:11,300 well, how many leaves do I have? 1264 01:11:11,300 --> 01:11:13,780 nt over hw. 1265 01:11:13,780 --> 01:11:15,420 And what does each one cost us? 1266 01:11:15,420 --> 01:11:19,220 W over B. 1267 01:11:19,220 --> 01:11:24,830 And so now, when I multiply that out, well, hw is about m 1268 01:11:24,830 --> 01:11:34,540 squared, and w over B is about m over B. And so I get nt over 1269 01:11:34,540 --> 01:11:38,200 MB as being the total number of savings. 1270 01:11:38,200 --> 01:11:42,270 so whereas the original algorithm only got nt over B, 1271 01:11:42,270 --> 01:11:49,610 we've got this factor of a memory cache size in there 1272 01:11:49,610 --> 01:11:53,100 showing us that we have far fewer cache misses. 1273 01:11:53,100 --> 01:11:55,640 So the cache misses end up not being an issue for this. 1274 01:11:58,360 --> 01:11:59,610 Any questions about that? 1275 01:12:06,230 --> 01:12:11,410 So I want to show you a simulation of this three point 1276 01:12:11,410 --> 01:12:14,330 stencil and comparing the two things. 1277 01:12:14,330 --> 01:12:16,650 So this is going to be the looping version, where the red 1278 01:12:16,650 --> 01:12:21,170 dots are where the cache misses are, and this is going 1279 01:12:21,170 --> 01:12:23,590 to be the trapezoidal one. 1280 01:12:23,590 --> 01:12:27,950 And basically, I have an n of 95 and a t of 87, and what I'm 1281 01:12:27,950 --> 01:12:31,440 going to do is assume a fully associative LRU cache that 1282 01:12:31,440 --> 01:12:35,750 fits four points on a cache line, where the cache size is 1283 01:12:35,750 --> 01:12:39,520 32, two to the fifth as opposed to two to the 15th, 1284 01:12:39,520 --> 01:12:42,100 it's really little. 1285 01:12:42,100 --> 01:12:44,910 If I get a cache hit, I'm going to call it one cycle. 1286 01:12:44,910 --> 01:12:48,070 If I get a cache miss, I'm going to call it 10 cycles. 1287 01:12:48,070 --> 01:12:49,070 We're going to race them. 1288 01:12:49,070 --> 01:12:52,840 So on the left is the current world 1289 01:12:52,840 --> 01:12:54,510 champion, the looping algorithm. 1290 01:12:54,510 --> 01:12:59,130 And on the right is the cache oblivious trapezoid algorithm. 1291 01:12:59,130 --> 01:13:00,380 So let's go. 1292 01:13:07,750 --> 01:13:12,480 So you can see that it's basically, it's made a space 1293 01:13:12,480 --> 01:13:18,060 cut there, but it's made a time cut across the top there. 1294 01:13:18,060 --> 01:13:20,730 It said, this is too tall, so let me cut it this way. 1295 01:13:38,460 --> 01:13:40,764 And that guy's, meanwhile, taking all those-- you can see 1296 01:13:40,764 --> 01:13:42,140 how many cache misses he's taking. 1297 01:13:45,850 --> 01:13:47,100 Let's speed him up. 1298 01:13:52,770 --> 01:13:54,460 That's one way you can do it, is make it think. 1299 01:14:09,050 --> 01:14:12,670 So let's see what happens if I have a cache of size eight. 1300 01:14:18,020 --> 01:14:19,270 So here we go. 1301 01:14:31,320 --> 01:14:32,830 I think I'm just doing the same thing. 1302 01:14:32,830 --> 01:14:34,160 I know I can show you the cuts. 1303 01:14:34,160 --> 01:14:37,050 Can I show you the cuts? 1304 01:14:37,050 --> 01:14:37,270 I know. 1305 01:14:37,270 --> 01:14:39,540 I think it's because I'm not-- 1306 01:14:39,540 --> 01:14:41,350 OK, let's try it. 1307 01:14:41,350 --> 01:14:41,600 There we go. 1308 01:14:41,600 --> 01:14:44,700 Now I'm showing the cuts as they go on. 1309 01:14:44,700 --> 01:14:47,678 Let's do that again. 1310 01:14:47,678 --> 01:14:49,516 We'll go fast and do it again. 1311 01:14:58,230 --> 01:15:00,400 So those are the cuts that it's making to begin with as 1312 01:15:00,400 --> 01:15:01,760 it's doing the divide and conquer. 1313 01:15:06,450 --> 01:15:09,570 So I think this is the same size cache. 1314 01:15:14,100 --> 01:15:16,595 So now I think I'm doing a bigger cache. 1315 01:15:25,010 --> 01:15:26,830 I think I did a bigger cache, but I'm not sure I gave the 1316 01:15:26,830 --> 01:15:28,080 other guy a bigger cache. 1317 01:15:46,790 --> 01:15:48,330 Yeah, because it doesn't matter for the guy on the 1318 01:15:48,330 --> 01:15:49,100 left, right? 1319 01:15:49,100 --> 01:15:51,470 As long as the cache line is the same length and as long as 1320 01:15:51,470 --> 01:15:55,910 it's not big enough, he's going to do the same thing. 1321 01:15:55,910 --> 01:15:57,600 He didn't get to take advantage of the fact that the 1322 01:15:57,600 --> 01:16:00,350 cache was bigger, because it was still smaller than the 1323 01:16:00,350 --> 01:16:02,050 array that he's striping out there. 1324 01:16:05,460 --> 01:16:06,880 Anyway, we can play with these all day. 1325 01:16:13,910 --> 01:16:16,240 So if you make the cache lines bigger, then of course it'll 1326 01:16:16,240 --> 01:16:19,790 go faster, because he'll have fewer misses. 1327 01:16:19,790 --> 01:16:21,680 He'll get to bring it in. 1328 01:16:21,680 --> 01:16:22,930 So let's see here. 1329 01:16:25,330 --> 01:16:29,030 So let's now do it for real. 1330 01:16:29,030 --> 01:16:30,820 So this is a two-dimensional problem. 1331 01:16:30,820 --> 01:16:34,720 You can use the same thing to do what end up being 1332 01:16:34,720 --> 01:16:37,710 three-dimensional trapezoids. 1333 01:16:37,710 --> 01:16:40,130 In fact, you can generalize this trapezoid method to 1334 01:16:40,130 --> 01:16:41,790 multiple dimensions. 1335 01:16:41,790 --> 01:16:43,320 So this is the looping one. 1336 01:16:43,320 --> 01:16:45,240 So let's start that one out. 1337 01:16:51,020 --> 01:16:53,820 So it's going about 104 frames a minute. 1338 01:17:08,470 --> 01:17:12,590 I think by resizing it, the calibration is off. 1339 01:17:15,160 --> 01:17:18,000 But in any case, let's switch to the cash oblivious version. 1340 01:17:27,840 --> 01:17:29,090 Anybody notice something? 1341 01:17:31,460 --> 01:17:32,710 Slower. 1342 01:17:35,040 --> 01:17:36,290 Why is that? 1343 01:17:41,460 --> 01:17:43,010 I gave code exactly as I had up there. 1344 01:17:46,120 --> 01:17:47,640 No, it's not because it's two dimensions. 1345 01:17:47,640 --> 01:17:48,890 AUDIENCE: [INAUDIBLE]. 1346 01:17:53,090 --> 01:17:54,090 CHARLES LEISERSON: I'm sorry? 1347 01:17:54,090 --> 01:17:54,390 [INTERPOSING VOICES] 1348 01:17:54,390 --> 01:17:58,505 CHARLES LEISERSON: Yeah, so now it's the trapezoiding at 1349 01:17:58,505 --> 01:17:59,755 only 86 frames. 1350 01:18:02,080 --> 01:18:04,150 What do you suppose is going on there? 1351 01:18:04,150 --> 01:18:06,750 AUDIENCE: You have [INAUDIBLE]. 1352 01:18:06,750 --> 01:18:08,000 CHARLES LEISERSON: Yeah. 1353 01:18:11,120 --> 01:18:14,870 So this is a case where if you look at the code I wrote, I 1354 01:18:14,870 --> 01:18:24,790 went down to a t, a delta t, of one in my recursion. 1355 01:18:27,510 --> 01:18:28,810 I recursed all the way down. 1356 01:18:28,810 --> 01:18:32,160 Let's see what happens if instead of going all the way 1357 01:18:32,160 --> 01:18:34,830 down, playing the trapezoid game on little tiny 1358 01:18:34,830 --> 01:18:41,070 trapezoids, suppose I go down only to, say, when t is 10, 1359 01:18:41,070 --> 01:18:45,280 and then do essentially the row major ones. 1360 01:18:45,280 --> 01:18:50,920 So I'm basically coarsening the leaves of the thing. 1361 01:18:50,920 --> 01:18:52,460 So to do that, I do this. 1362 01:18:52,460 --> 01:18:53,590 So now we go-- 1363 01:18:53,590 --> 01:18:54,840 ah. 1364 01:19:00,660 --> 01:19:02,670 So I have to coarsen in order to overcome the 1365 01:19:02,670 --> 01:19:04,250 procedure call overhead. 1366 01:19:04,250 --> 01:19:06,050 It has nothing to do with the cache. 1367 01:19:06,050 --> 01:19:08,440 It has to the fact that the way that you implement 1368 01:19:08,440 --> 01:19:10,770 recursion, recursion and function calls 1369 01:19:10,770 --> 01:19:11,770 have a cost to them. 1370 01:19:11,770 --> 01:19:15,600 And if what you're going to do is do a little tiny update of 1371 01:19:15,600 --> 01:19:18,860 those few floating point operations-- 1372 01:19:18,860 --> 01:19:22,380 let's go back to the looping just to see. 1373 01:19:22,380 --> 01:19:30,540 The looping is going about 107, 108, and 1374 01:19:30,540 --> 01:19:37,310 trapezoiding at 136. 1375 01:19:37,310 --> 01:19:39,520 So unfortunately, you need a voodoo variable, but it's a 1376 01:19:39,520 --> 01:19:42,660 voodoo variable not to overcome the cache, but rather 1377 01:19:42,660 --> 01:19:45,470 to deal with what's the overhead in using the 1378 01:19:45,470 --> 01:19:48,420 processor when you do function calls. 1379 01:19:48,420 --> 01:19:51,150 So let's see. 1380 01:19:51,150 --> 01:19:52,290 How coarse can we make it? 1381 01:19:52,290 --> 01:19:54,110 Let's try five, a coarsening of five? 1382 01:19:57,370 --> 01:19:58,620 That's still pretty good. 1383 01:20:02,040 --> 01:20:02,790 That's still pretty good. 1384 01:20:02,790 --> 01:20:04,040 How about four? 1385 01:20:07,940 --> 01:20:11,750 Still doing 131 frames a minute. 1386 01:20:11,750 --> 01:20:15,080 How about three? 1387 01:20:15,080 --> 01:20:16,620 Oh, we lost something there. 1388 01:20:19,540 --> 01:20:20,790 How about two? 1389 01:20:24,020 --> 01:20:30,070 So at a coarsening of two, I go 138, whereas the looping 1390 01:20:30,070 --> 01:20:32,890 goes at about the same. 1391 01:20:32,890 --> 01:20:33,720 I can't do 20. 1392 01:20:33,720 --> 01:20:34,780 I didn't program that in. 1393 01:20:34,780 --> 01:20:37,820 I just programmed up to 10. 1394 01:20:37,820 --> 01:20:42,330 So if I go down to one, however, then you see it's not 1395 01:20:42,330 --> 01:20:42,970 that efficient. 1396 01:20:42,970 --> 01:20:46,780 But if I pick any number that's even slightly larger, 1397 01:20:46,780 --> 01:20:50,490 that gives me just enough that the function call overhead 1398 01:20:50,490 --> 01:20:51,300 ends up not being a 1399 01:20:51,300 --> 01:20:57,600 substantial cost of the things. 1400 01:20:57,600 --> 01:20:59,630 So let me just wrap up now. 1401 01:20:59,630 --> 01:21:00,870 So I just have a couple more things. 1402 01:21:00,870 --> 01:21:05,750 So I'm not going to really talk about these, but there 1403 01:21:05,750 --> 01:21:09,640 are lots of cash oblivious algorithms that have been 1404 01:21:09,640 --> 01:21:18,250 discovered in the last 10 or 15 years for doing things like 1405 01:21:18,250 --> 01:21:23,340 matrix transposition, which is similar to rotating a matrix. 1406 01:21:23,340 --> 01:21:26,720 You can do that in a cache oblivious fashion. 1407 01:21:26,720 --> 01:21:32,270 Strassen's algorithm, which does matrix multiplication 1408 01:21:32,270 --> 01:21:34,720 using fewer than n cubed operations. 1409 01:21:34,720 --> 01:21:40,440 The FFT can be computed in a cache oblivious fashion. 1410 01:21:40,440 --> 01:21:44,770 And LUD composition is a popular 1411 01:21:44,770 --> 01:21:48,740 thing to solve systems. 1412 01:21:48,740 --> 01:21:52,690 In addition, there are cache oblivious data structures, and 1413 01:21:52,690 --> 01:21:53,760 here are just a few of them. 1414 01:21:53,760 --> 01:21:57,060 There's cache oblivious B-Trees and priority queues, 1415 01:21:57,060 --> 01:22:00,820 and doing things called ordered-file maintenance. 1416 01:22:00,820 --> 01:22:01,850 There's a whole raft. 1417 01:22:01,850 --> 01:22:05,740 There's probably now several hundred papers written on 1418 01:22:05,740 --> 01:22:09,240 cache oblivious algorithms, so something you should be aware 1419 01:22:09,240 --> 01:22:12,970 of and understand how it is that you go about designing an 1420 01:22:12,970 --> 01:22:14,040 algorithm of this nature. 1421 01:22:14,040 --> 01:22:17,290 Not all of them are straightforward. 1422 01:22:17,290 --> 01:22:22,910 For example, the FFT one does divide and conquer but not by 1423 01:22:22,910 --> 01:22:24,010 dividing it into two. 1424 01:22:24,010 --> 01:22:26,750 It divides it into square root of n pieces of size square 1425 01:22:26,750 --> 01:22:31,460 root of n each in order to get a good cache efficient 1426 01:22:31,460 --> 01:22:34,060 algorithm that doesn't have any tuning parameters. 1427 01:22:34,060 --> 01:22:38,010 But almost all of them, since they're recursive, do have 1428 01:22:38,010 --> 01:22:41,090 this annoying thing that you have to still coarsen the base 1429 01:22:41,090 --> 01:22:45,250 case in order to get really good performance if you're not 1430 01:22:45,250 --> 01:22:50,600 doing a lot of work in the leaves of the recursion. 1431 01:22:50,600 --> 01:22:51,850 So any questions?