1 00:00:07,000 --> 00:00:08,000 -- week of 6.046. Woohoo! 2 00:00:08,000 --> 00:00:13,000 The topic of this final week, among our advanced topics, 3 00:00:13,000 --> 00:00:18,000 is cache oblivious algorithms. This is a particularly fun 4 00:00:18,000 --> 00:00:22,000 area, one dear to my heart because I've done a lot of 5 00:00:22,000 --> 00:00:26,000 research in this area. This is an area co-founded by 6 00:00:26,000 --> 00:00:29,000 Professor Leiserson. So, in fact, 7 00:00:29,000 --> 00:00:34,000 the first context in which I met Professor Leiserson was him 8 00:00:34,000 --> 00:00:38,000 giving a talk about cache oblivious algorithms at WADS '99 9 00:00:38,000 --> 00:00:41,000 in Vancouver I think. Yeah, that has to be an odd 10 00:00:41,000 --> 00:00:44,000 year. So, I learned about cache 11 00:00:44,000 --> 00:00:48,000 oblivious algorithms then, started working in the area, 12 00:00:48,000 --> 00:00:50,000 and it's been a fun place to play. 13 00:00:50,000 --> 00:00:54,000 But this topic in some sense was also developed in the 14 00:00:54,000 --> 00:00:58,000 context of this class. I think there was one semester, 15 00:00:58,000 --> 00:01:02,000 probably also '98-'99 where all of the problem sets were about 16 00:01:02,000 --> 00:01:07,000 cache oblivious algorithms. And they were, 17 00:01:07,000 --> 00:01:10,000 in particular, working out the research ideas 18 00:01:10,000 --> 00:01:13,000 at the same time. So, it must have been fun 19 00:01:13,000 --> 00:01:15,000 semester. We consider doing that this 20 00:01:15,000 --> 00:01:18,000 semester, but we kept it to the simple. 21 00:01:18,000 --> 00:01:23,000 We know a lot more about cache oblivious algorithms by now as 22 00:01:23,000 --> 00:01:26,000 you might expect. Right, I think that's all the 23 00:01:26,000 --> 00:01:29,000 setting. I mean, it was kind of 24 00:01:29,000 --> 00:01:33,000 developed also with a bunch of MIT students in particular, 25 00:01:33,000 --> 00:01:35,000 M.Eng. student, Harold Prokop. 26 00:01:35,000 --> 00:01:36,000 It was his M.Eng. thesis. 27 00:01:36,000 --> 00:01:39,000 There is all the citations I will give for now. 28 00:01:39,000 --> 00:01:43,000 I haven't posted yet, but there are some lecture 29 00:01:43,000 --> 00:01:45,000 notes that are already on my webpage. 30 00:01:45,000 --> 00:01:49,000 But I will link to them from the course website that gives 31 00:01:49,000 --> 00:01:53,000 all the references for all the results I'll be talking about. 32 00:01:53,000 --> 00:01:56,000 They've all been done in the last five years or so, 33 00:01:56,000 --> 00:01:59,000 in particular, starting in '99 when the first 34 00:01:59,000 --> 00:02:03,000 paper was published. But I won't give the specific 35 00:02:03,000 --> 00:02:08,000 citations in lecture. And, this topic is related to 36 00:02:08,000 --> 00:02:11,000 the topic of last week, multithreaded algorithms, 37 00:02:11,000 --> 00:02:14,000 although at a somewhat high level. 38 00:02:14,000 --> 00:02:18,000 And then it's also dealing with parallelism in modern machines. 39 00:02:18,000 --> 00:02:22,000 And we've had throughout all of these last two lectures, 40 00:02:22,000 --> 00:02:26,000 we've had this very simple model of a computer where we 41 00:02:26,000 --> 00:02:30,000 have random access. You can access memory at a cost 42 00:02:30,000 --> 00:02:33,000 of one. You can read and write a word 43 00:02:33,000 --> 00:02:36,000 of memory. There is some details on how 44 00:02:36,000 --> 00:02:39,000 big a word can be and whatnot. It's pretty basic, 45 00:02:39,000 --> 00:02:41,000 simple, flat model. And, at the multithreaded 46 00:02:41,000 --> 00:02:45,000 algorithm is the idea that, well, maybe you have multiple 47 00:02:45,000 --> 00:02:48,000 threads of computation running at once, but you still have this 48 00:02:48,000 --> 00:02:51,000 very flat memory. Everyone can access anything in 49 00:02:51,000 --> 00:02:54,000 memory at a constant cost. We're going to change that 50 00:02:54,000 --> 00:02:58,000 model now. And we are going to realize 51 00:02:58,000 --> 00:03:03,000 that a real machine, the memory of a real machine is 52 00:03:03,000 --> 00:03:06,000 some hierarchy. You have some CPU, 53 00:03:06,000 --> 00:03:10,000 you have some cache, probably on the same chip, 54 00:03:10,000 --> 00:03:14,000 level 1 cache, you have some level 2 cache, 55 00:03:14,000 --> 00:03:18,000 if you're lucky, maybe you have some level 3 56 00:03:18,000 --> 00:03:21,000 cache, before you get to main memory. 57 00:03:21,000 --> 00:03:26,000 And then, you probably have a really big disk and probably 58 00:03:26,000 --> 00:03:31,000 there's even some cache out here, but I won't even think 59 00:03:31,000 --> 00:03:35,000 about that. So, the point is, 60 00:03:35,000 --> 00:03:38,000 you have lots of different levels of memory and what's 61 00:03:38,000 --> 00:03:42,000 changing here is that things that are very close to the CPU 62 00:03:42,000 --> 00:03:46,000 are very fast to access. Usually level 1 cache you can 63 00:03:46,000 --> 00:03:48,000 access in one clock cycle or a few. 64 00:03:48,000 --> 00:03:50,000 And then, things get slower and slower. 65 00:03:50,000 --> 00:03:54,000 Memory still costs like 70 ns or so to access a chunk out of. 66 00:03:54,000 --> 00:03:57,000 And that's a long time. 70 ns is, of course, 67 00:03:57,000 --> 00:04:01,000 a very long time. So, as we go out here, 68 00:04:01,000 --> 00:04:04,000 we get slower. But we also get bigger. 69 00:04:04,000 --> 00:04:07,000 I mean, if we could put everything at level 1 cache, 70 00:04:07,000 --> 00:04:11,000 the problem would be solved. But what would be a flat 71 00:04:11,000 --> 00:04:13,000 memory. Accessing everything in here, 72 00:04:13,000 --> 00:04:16,000 we assumed takes the same amount of time. 73 00:04:16,000 --> 00:04:18,000 But usually, we can't afford, 74 00:04:18,000 --> 00:04:22,000 it's not even possible to put everything in level 1 cache. 75 00:04:22,000 --> 00:04:26,000 I mean, there's a reason why there is a memory hierarchy. 76 00:04:26,000 --> 00:04:32,000 Does anyone have a suggestion on what that reason might be? 77 00:04:32,000 --> 00:04:35,000 It's like one of these limits in life. 78 00:04:35,000 --> 00:04:37,000 Yeah? Fast memory is expensive. 79 00:04:37,000 --> 00:04:40,000 That's the practical limitations indeed, 80 00:04:40,000 --> 00:04:45,000 that you could try to build more and more at level 1 cache 81 00:04:45,000 --> 00:04:48,000 and maybe you could try to, well, yeah. 82 00:04:48,000 --> 00:04:51,000 Expenses is a good reason, and practically, 83 00:04:51,000 --> 00:04:55,000 that's why they may be the sizes are what they are. 84 00:04:55,000 --> 00:05:01,000 But suppose really fast memory were really cheap. 85 00:05:01,000 --> 00:05:04,000 There is a physical limitation of what's going on, 86 00:05:04,000 --> 00:05:05,000 yeah? The speed of light. 87 00:05:05,000 --> 00:05:08,000 Yeah, that's a bit of a problem, right? 88 00:05:08,000 --> 00:05:11,000 No matter how much, let's suppose you can only fit 89 00:05:11,000 --> 00:05:15,000 so many bits in an atom. You can only fit so many bits 90 00:05:15,000 --> 00:05:18,000 in a particular amount of space. If you want more bits, 91 00:05:18,000 --> 00:05:22,000 and you need more space, and the more space you have, 92 00:05:22,000 --> 00:05:25,000 the longer it's going to take for a round-trip. 93 00:05:25,000 --> 00:05:28,000 So, if you assume your CPU is like this point in space, 94 00:05:28,000 --> 00:05:32,000 so it's relatively small and it has to get the data in, 95 00:05:32,000 --> 00:05:37,000 the bigger the data, the farther it has to be away. 96 00:05:37,000 --> 00:05:40,000 But, you can have these cores around the CPU that are, 97 00:05:40,000 --> 00:05:44,000 we don't usually live in 3-D, and chips were usually in 2-D, 98 00:05:44,000 --> 00:05:46,000 but never mind. You can have the sphere that's 99 00:05:46,000 --> 00:05:49,000 closer to the CPU that's a lot faster to access. 100 00:05:49,000 --> 00:05:52,000 And as you get further away it costs more. 101 00:05:52,000 --> 00:05:55,000 And that's essentially what this model is representing, 102 00:05:55,000 --> 00:05:59,000 although it's a bit approximated from the intrinsic 103 00:05:59,000 --> 00:06:02,000 physics and geometry and whatnot. 104 00:06:02,000 --> 00:06:05,000 But that's the idea. The latency, 105 00:06:05,000 --> 00:06:11,000 the round-trip time to get some of this memory has to be big. 106 00:06:11,000 --> 00:06:17,000 In general, the costs to access memory is made up of two things. 107 00:06:17,000 --> 00:06:21,000 There's the latency, the round-trip time, 108 00:06:21,000 --> 00:06:26,000 which in particular is limited by the speed of light. 109 00:06:26,000 --> 00:06:32,000 And, plus the round-trip time, you also have to get the data 110 00:06:32,000 --> 00:06:36,000 out. And depending on how much data 111 00:06:36,000 --> 00:06:40,000 you want, it could take longer. OK, so there's something. 112 00:06:40,000 --> 00:06:42,000 There could be, get this right, 113 00:06:42,000 --> 00:06:46,000 let's say, the amount of data divided by the bandwidth. 114 00:06:46,000 --> 00:06:51,000 OK, the bandwidth is at what rate can you get the data out? 115 00:06:51,000 --> 00:06:54,000 And if you look at the bandwidth of these various 116 00:06:54,000 --> 00:06:58,000 levels of memory, it's all pretty much the same. 117 00:06:58,000 --> 00:07:02,000 If you have a well-designed computer the bandwidths should 118 00:07:02,000 --> 00:07:07,000 all be the same. OK, as you can still get data 119 00:07:07,000 --> 00:07:08,000 off disc really, really fast, 120 00:07:08,000 --> 00:07:13,000 usually at about the speed of your bus, and that the bus gets 121 00:07:13,000 --> 00:07:16,000 the CPU hopefully as fast as everything else. 122 00:07:16,000 --> 00:07:20,000 So, even though they're slower, they're really only slower in 123 00:07:20,000 --> 00:07:23,000 terms of latency. And so, this part is maybe 124 00:07:23,000 --> 00:07:26,000 reasonable. The bandwidth looks pretty much 125 00:07:26,000 --> 00:07:29,000 the same universally. It's the latency that's going 126 00:07:29,000 --> 00:07:32,000 up. So, if the latency is going up 127 00:07:32,000 --> 00:07:36,000 but we still get to divide by the same amount of bandwidth, 128 00:07:36,000 --> 00:07:40,000 what should we do to make the access cost at all these levels 129 00:07:40,000 --> 00:07:45,000 about the same? This is fixed. 130 00:07:45,000 --> 00:07:53,000 Let's say this is increasing, but this is still staying big. 131 00:07:53,000 --> 00:07:59,000 What could we do to balance this formula? 132 00:07:59,000 --> 00:08:05,000 Change the amounts. As the latency goes up, 133 00:08:05,000 --> 00:08:10,000 if we increase the amount, then the amortized cost to 134 00:08:10,000 --> 00:08:16,000 access one element will go down. So, this is amortization in a 135 00:08:16,000 --> 00:08:21,000 very simple sense. So, this was to access a whole 136 00:08:21,000 --> 00:08:26,000 block, let's say, and this amount was the size of 137 00:08:26,000 --> 00:08:30,000 the block. So, the amortized cost, 138 00:08:30,000 --> 00:08:36,000 then, to access one element is going to be the latency divided 139 00:08:36,000 --> 00:08:41,000 by the size of the block, the amount plus one over the 140 00:08:41,000 --> 00:08:45,000 bandwidth. OK, so this is what you should 141 00:08:45,000 --> 00:08:49,000 implicitly be thinking in your head. 142 00:08:49,000 --> 00:08:55,000 So, I'm just dividing here by the amounts because the amount 143 00:08:55,000 --> 00:09:02,000 is how many elements you get in one access, let's suppose. 144 00:09:02,000 --> 00:09:04,000 OK, so we get this formula for the amortized cost. 145 00:09:04,000 --> 00:09:08,000 The one over bandwidth is going to be good no matter what level 146 00:09:08,000 --> 00:09:11,000 we are on, I claim. There's no real fundamental 147 00:09:11,000 --> 00:09:14,000 limitation there except it might be expensive. 148 00:09:14,000 --> 00:09:17,000 And the latency week at the amortized out by the amounts, 149 00:09:17,000 --> 00:09:21,000 so whatever the latency is, at the latency gets bigger out 150 00:09:21,000 --> 00:09:24,000 here, we just get more and more stuff and then we make these two 151 00:09:24,000 --> 00:09:27,000 terms equal, let's say. That would be a good way to 152 00:09:27,000 --> 00:09:30,000 balance things. So what particular, 153 00:09:30,000 --> 00:09:34,000 disc has a really high latency. Not only is there speed of 154 00:09:34,000 --> 00:09:37,000 light issues here, but there's actually the speed 155 00:09:37,000 --> 00:09:39,000 of the head moving on the tracks of the disk. 156 00:09:39,000 --> 00:09:42,000 That takes a long time. There's a physical motion. 157 00:09:42,000 --> 00:09:45,000 Everything else here doesn't usually have physical motion. 158 00:09:45,000 --> 00:09:47,000 It's just electric. So, this is really, 159 00:09:47,000 --> 00:09:51,000 really slow and latency, so when you read something out 160 00:09:51,000 --> 00:09:54,000 of disk, you might as well read a lot of data from disc, 161 00:09:54,000 --> 00:09:57,000 like a megabyte or so. It's probably even old these 162 00:09:57,000 --> 00:09:58,000 days. Maybe you read multiple 163 00:09:58,000 --> 00:10:02,000 megabytes when you read anything from disk if you want these to 164 00:10:02,000 --> 00:10:06,000 be matched. OK, there's a bit of a problem 165 00:10:06,000 --> 00:10:10,000 with doing that. Any suggestions what the 166 00:10:10,000 --> 00:10:14,000 problem would be? So, you have this algorithm. 167 00:10:14,000 --> 00:10:17,000 And, whenever it reads something off of desk, 168 00:10:17,000 --> 00:10:22,000 it reads an entire megabyte of stuff around the element it 169 00:10:22,000 --> 00:10:26,000 asked for. So the amortized cost to access 170 00:10:26,000 --> 00:10:31,000 is going to be reasonable, but that's actually sort of 171 00:10:31,000 --> 00:10:34,000 assuming something. Yeah? 172 00:10:34,000 --> 00:10:38,000 Right. I'm assuming I'm ever going to 173 00:10:38,000 --> 00:10:43,000 use the rest of that data. If I'm going to read 10 MB 174 00:10:43,000 --> 00:10:49,000 around the one element that asked for, I access A bracket I, 175 00:10:49,000 --> 00:10:55,000 and I get 10 million items from A around I, it would be kind of 176 00:10:55,000 --> 00:11:00,000 good if the algorithm actually used that data for something. 177 00:11:00,000 --> 00:11:06,000 It seems reasonable. So, this would be spatial 178 00:11:06,000 --> 00:11:08,000 locality. So, we want, 179 00:11:08,000 --> 00:11:15,000 I mean the goal of this world in cache oblivious algorithms 180 00:11:15,000 --> 00:11:20,000 and cache efficient algorithms in general is you want 181 00:11:20,000 --> 00:11:26,000 algorithms that perform well when this is happening. 182 00:11:26,000 --> 00:11:31,000 So, this is the idea of blocking. 183 00:11:31,000 --> 00:11:36,000 And we want the algorithm to use all or at least most of the 184 00:11:36,000 --> 00:11:41,000 elements in a block, a consecutive chunk of memory. 185 00:11:41,000 --> 00:11:45,000 So, this is spatial locality. 186 00:11:55,000 --> 00:11:57,000 Ideally, we'd use all of them right then. 187 00:11:57,000 --> 00:11:59,000 But I mean, depending on your algorithm, that's a little bit 188 00:11:59,000 --> 00:12:01,000 tricky. There is another issue, 189 00:12:01,000 --> 00:12:03,000 though. So, you read in your thing 190 00:12:03,000 --> 00:12:05,000 into, read your 10 MB into main memory, let's say, 191 00:12:05,000 --> 00:12:07,000 and your memory, let's say, is at least, 192 00:12:07,000 --> 00:12:10,000 these days you should have a 4 GB memory or something. 193 00:12:10,000 --> 00:12:13,000 So, you could read and actually a lot of different blocks into 194 00:12:13,000 --> 00:12:15,000 main memory. What you'd like is that you can 195 00:12:15,000 --> 00:12:17,000 use those blocks for as long as possible. 196 00:12:17,000 --> 00:12:20,000 Maybe you don't even use them. If you have a linear time 197 00:12:20,000 --> 00:12:23,000 algorithm, you're probably only going to visit each element a 198 00:12:23,000 --> 00:12:25,000 constant number of times. So, this is enough. 199 00:12:25,000 --> 00:12:27,000 But if your algorithm is more than linear time, 200 00:12:27,000 --> 00:12:32,000 you're going to be accessing elements more than once. 201 00:12:32,000 --> 00:12:37,000 So, it would be a good idea not only to use all the elements of 202 00:12:37,000 --> 00:12:43,000 the blocks, but use them as many times as you can before you have 203 00:12:43,000 --> 00:12:47,000 to throw the block out. That's temporal locality. 204 00:12:47,000 --> 00:12:52,000 So ideally, you even reuse blocks as much as possible. 205 00:12:52,000 --> 00:12:55,000 So, I mean, we have all these caches. 206 00:12:55,000 --> 00:13:01,000 So, I didn't write this word. Just in case I don't know how 207 00:13:01,000 --> 00:13:07,000 to spell it, it's not the money. We should use those caches for 208 00:13:07,000 --> 00:13:09,000 something. I mean, the fact that they 209 00:13:09,000 --> 00:13:13,000 store more than one block, each cache can store several 210 00:13:13,000 --> 00:13:14,000 blocks. How many? 211 00:13:14,000 --> 00:13:17,000 Well, we'll get to that in a second. 212 00:13:17,000 --> 00:13:20,000 OK, so this is the general motivation, but at this point 213 00:13:20,000 --> 00:13:23,000 the model is still pretty damn ugly. 214 00:13:23,000 --> 00:13:27,000 If you wanted to design an algorithm that runs well on this 215 00:13:27,000 --> 00:13:30,000 kind of machine directly, it's possible but pretty 216 00:13:30,000 --> 00:13:34,000 difficult, and essentially never done, let's say, 217 00:13:34,000 --> 00:13:39,000 even though this is what real machines look like. 218 00:13:39,000 --> 00:13:42,000 At least in theory, and pretty much in practice, 219 00:13:42,000 --> 00:13:47,000 the main thing to think about is two levels at a time. 220 00:13:47,000 --> 00:13:51,000 So, this is a simplification where we can say a lot more 221 00:13:51,000 --> 00:13:55,000 about algorithms, a simplification over this 222 00:13:55,000 --> 00:13:57,000 model. So, in this model, 223 00:13:57,000 --> 00:14:01,000 each of these levels has different block sizes, 224 00:14:01,000 --> 00:14:06,000 and a different total sizes, it's a mess to deal with and 225 00:14:06,000 --> 00:14:10,000 design algorithms for. If you just think about two 226 00:14:10,000 --> 00:14:17,000 levels, it's relatively easy. So, we have our CPU which we 227 00:14:17,000 --> 00:14:22,000 assume has a constant number of registers only. 228 00:14:22,000 --> 00:14:27,000 So, you know, once it has a couple of data 229 00:14:27,000 --> 00:14:31,000 items, you can add them and whatnot. 230 00:14:31,000 --> 00:14:35,000 Then we have this really fast pipe. 231 00:14:35,000 --> 00:14:41,000 So, I draw it thick to some cache. 232 00:14:41,000 --> 00:14:49,000 So this is cache. And, we have a relatively 233 00:14:49,000 --> 00:14:58,000 narrow pipe to some really big other storage, 234 00:14:58,000 --> 00:15:06,000 which I will call main memory. So, I mean, that's the general 235 00:15:06,000 --> 00:15:09,000 picture. Now, this could represent any 236 00:15:09,000 --> 00:15:12,000 two of these levels. It could be between L3 cache 237 00:15:12,000 --> 00:15:14,000 and make memory. That's maybe, 238 00:15:14,000 --> 00:15:16,000 what? The naming corresponds to best. 239 00:15:16,000 --> 00:15:20,000 Or cache could in fact be main memory, what we consider the RAM 240 00:15:20,000 --> 00:15:23,000 of the machine, and what's called a memory over 241 00:15:23,000 --> 00:15:26,000 there to be the disk. It's whatever you care about. 242 00:15:26,000 --> 00:15:28,000 And usually, if you have a program, 243 00:15:28,000 --> 00:15:34,000 that's what usually we assume everything fits in main memory. 244 00:15:34,000 --> 00:15:36,000 Then you care about the caching behavior. 245 00:15:36,000 --> 00:15:39,000 So you probably look between these two levels. 246 00:15:39,000 --> 00:15:42,000 That's probably what matters the most inner program because 247 00:15:42,000 --> 00:15:46,000 the cost differential here is really big relative to the cost 248 00:15:46,000 --> 00:15:49,000 differential here. If your data doesn't even fit 249 00:15:49,000 --> 00:15:51,000 it main memory, and you have to go to disk, 250 00:15:51,000 --> 00:15:54,000 then you really care about this level because the cost 251 00:15:54,000 --> 00:15:57,000 differential here is huge. It's like six orders of 252 00:15:57,000 --> 00:16:00,000 magnitude, let's say. So, in practice you may think 253 00:16:00,000 --> 00:16:05,000 of just two memory levels that are the most relevant. 254 00:16:05,000 --> 00:16:09,000 OK, now I'm going to define some parameters. 255 00:16:09,000 --> 00:16:14,000 I'm going to call them cache and make memory just for clarity 256 00:16:14,000 --> 00:16:20,000 because I like to think of main memory just the way it used to 257 00:16:20,000 --> 00:16:23,000 be. And now all we have to worry 258 00:16:23,000 --> 00:16:26,000 about is this extra thing called cache. 259 00:16:26,000 --> 00:16:31,000 It has some bounded size, and there's a block size. 260 00:16:31,000 --> 00:16:36,000 The block size is B. and a number of blocks is M 261 00:16:36,000 --> 00:16:41,000 over B. So, the total size of the cache 262 00:16:41,000 --> 00:16:44,000 is M. OK, main memory is also blocked 263 00:16:44,000 --> 00:16:49,000 into blocks of size B. And we assume that it has 264 00:16:49,000 --> 00:16:55,000 essentially infinite size. We don't care about its size in 265 00:16:55,000 --> 00:16:59,000 this picture. It's whatever is big enough to 266 00:16:59,000 --> 00:17:04,000 hold the size of your algorithm, or data structure, 267 00:17:04,000 --> 00:17:09,000 or whatever. OK, so that's the general 268 00:17:09,000 --> 00:17:11,000 model. And for strange, 269 00:17:11,000 --> 00:17:15,000 historical reasons, which I don't want to get into, 270 00:17:15,000 --> 00:17:20,000 these things are called capital M and capital B. 271 00:17:20,000 --> 00:17:25,000 Even though M sounds a lot like memory, it's really for cache, 272 00:17:25,000 --> 00:17:29,000 and don't ask. OK, this is to preserve 273 00:17:29,000 --> 00:17:32,000 history. OK, now what do we do with this 274 00:17:32,000 --> 00:17:34,000 model? It seems nice, 275 00:17:34,000 --> 00:17:36,000 but now what do we measure about it? 276 00:17:36,000 --> 00:17:39,000 What I'm going to assume is that the cache is really fast. 277 00:17:39,000 --> 00:17:43,000 So the CPU can access cache essentially instantaneously. 278 00:17:43,000 --> 00:17:46,000 I still have to pay for the computation that the CPU is 279 00:17:46,000 --> 00:17:50,000 doing, but I'm assuming cache is close enough that I don't care. 280 00:17:50,000 --> 00:17:54,000 And that may memory is so big that it has to be far away, 281 00:17:54,000 --> 00:17:56,000 and therefore, this pipe is a problem. 282 00:17:56,000 --> 00:17:59,000 I mean, what I should really draw is that pipe is still 283 00:17:59,000 --> 00:18:04,000 thick, but is really long. So, the latency is high. 284 00:18:04,000 --> 00:18:07,000 The bandwidth is still high. OK, and all transfers here 285 00:18:07,000 --> 00:18:10,000 happened as blocks. So, when you don't have 286 00:18:10,000 --> 00:18:12,000 something, so the idea is CPU asks for A of I, 287 00:18:12,000 --> 00:18:15,000 as for something in memory, if it's in the cache, 288 00:18:15,000 --> 00:18:17,000 it gets it. That's free. 289 00:18:17,000 --> 00:18:21,000 Otherwise, it has to grab the entire block containing that 290 00:18:21,000 --> 00:18:23,000 element from main memory, brings it into cache, 291 00:18:23,000 --> 00:18:26,000 maybe kicks somebody out if the cache was full, 292 00:18:26,000 --> 00:18:29,000 and then the CPU can use that data and keep going. 293 00:18:29,000 --> 00:18:33,000 Until it accesses something else that's not in cache, 294 00:18:33,000 --> 00:18:37,000 then it has to grab it from main memory. 295 00:18:37,000 --> 00:18:43,000 When you kick something out, you're actually writing back to 296 00:18:43,000 --> 00:18:46,000 memory. That's the model. 297 00:18:46,000 --> 00:18:51,000 So, we suppose the accesses to cache are free. 298 00:18:51,000 --> 00:18:56,000 But we can still think about the running time of the 299 00:18:56,000 --> 00:19:01,000 algorithm. I'm not going to change the 300 00:19:01,000 --> 00:19:05,000 definition of running time. This would be the computation 301 00:19:05,000 --> 00:19:10,000 time, or the work if you want to use multithreaded lingo, 302 00:19:10,000 --> 00:19:13,000 computation time. OK, so we still have time, 303 00:19:13,000 --> 00:19:17,000 and T of N will still mean what it did before. 304 00:19:17,000 --> 00:19:22,000 This is just an extra level of refinement of understanding of 305 00:19:22,000 --> 00:19:24,000 what's going on. Essentially, 306 00:19:24,000 --> 00:19:29,000 measuring the parallelism that we can exploit out of the memory 307 00:19:29,000 --> 00:19:34,000 system, that when you access something you actually get B 308 00:19:34,000 --> 00:19:39,000 items. So, this is the old stuff. 309 00:19:39,000 --> 00:19:47,000 Now, what I want to do is count memory transfers. 310 00:19:47,000 --> 00:19:56,000 These are transfers of blocks, so I should say block memory 311 00:19:56,000 --> 00:20:04,000 transfers between the two levels, so, between the cache 312 00:20:04,000 --> 00:20:12,000 and main memory. So, memory transfers are either 313 00:20:12,000 --> 00:20:19,000 reading reads or writes. Maybe I should say that. 314 00:20:19,000 --> 00:20:29,000 These are number of block reads and writes from and to the main 315 00:20:29,000 --> 00:20:33,000 memory. OK, so I'm going to introduce 316 00:20:33,000 --> 00:20:35,000 some notation. This is new notation, 317 00:20:35,000 --> 00:20:40,000 so we'll see how it works out. MT of N I want to represent the 318 00:20:40,000 --> 00:20:44,000 number of memory transfers instead of just normal time of 319 00:20:44,000 --> 00:20:49,000 the problem of size N. Really, this is a function that 320 00:20:49,000 --> 00:20:52,000 depends not only on N but also on these parameters, 321 00:20:52,000 --> 00:20:56,000 B and M, in our model. So, this is what it should be, 322 00:20:56,000 --> 00:21:00,000 MT_B,M(N), but that's obviously pretty messy, 323 00:21:00,000 --> 00:21:04,000 so I'm going to stick to MT of N. 324 00:21:04,000 --> 00:21:07,000 But this will always, because mainly I care about the 325 00:21:07,000 --> 00:21:09,000 growth in terms of N. well, I care about the growth 326 00:21:09,000 --> 00:21:12,000 in terms of all things, but the only thing I could 327 00:21:12,000 --> 00:21:14,000 change is N. So, most of the time I only 328 00:21:14,000 --> 00:21:17,000 think about, like when we are writing recurrences, 329 00:21:17,000 --> 00:21:20,000 only N is changing. I can't recurse on the block 330 00:21:20,000 --> 00:21:22,000 size. I can't recurse on the size of 331 00:21:22,000 --> 00:21:24,000 cache. Those are given to me. 332 00:21:24,000 --> 00:21:26,000 They're fixed. OK, so we'll be changing N 333 00:21:26,000 --> 00:21:28,000 mainly. But B and M always matter here. 334 00:21:28,000 --> 00:21:31,000 They're not constants. They're parameters of the 335 00:21:31,000 --> 00:21:34,000 model. OK, easy enough. 336 00:21:34,000 --> 00:21:39,000 This is something called the disk access model, 337 00:21:39,000 --> 00:21:44,000 if you like DAM models, or the external memory model, 338 00:21:44,000 --> 00:21:50,000 or the cache aware model. Maybe I should mention that; 339 00:21:50,000 --> 00:21:55,000 this is the cache aware. In general, you have some 340 00:21:55,000 --> 00:22:01,000 algorithm that runs on this kind of model, machine model. 341 00:22:01,000 --> 00:22:07,000 That's a cache aware algorithm. OK, we're not too interested in 342 00:22:07,000 --> 00:22:10,000 cache aware algorithms. We've seen one, 343 00:22:10,000 --> 00:22:12,000 B trees. B trees are cache aware data 344 00:22:12,000 --> 00:22:14,000 structure. You assume that there is some 345 00:22:14,000 --> 00:22:15,000 block size, B, underlying. 346 00:22:15,000 --> 00:22:18,000 Maybe you didn't see exactly this model. 347 00:22:18,000 --> 00:22:20,000 In particular, it didn't really matter how big 348 00:22:20,000 --> 00:22:23,000 the cache was because you just wanted to know. 349 00:22:23,000 --> 00:22:26,000 When I read B items, I can use all of them as much 350 00:22:26,000 --> 00:22:29,000 as possible and figure out where I fit among those B items, 351 00:22:29,000 --> 00:22:32,000 and that gives me log base B of N memory transfers instead of 352 00:22:32,000 --> 00:22:36,000 log N, which would be, if you just threw your favorite 353 00:22:36,000 --> 00:22:41,000 balanced binary search tree. So, log base B of N is 354 00:22:41,000 --> 00:22:46,000 definitely better than log base 2 of N. 355 00:22:46,000 --> 00:22:51,000 B trees are a cache aware algorithm. 356 00:22:51,000 --> 00:22:58,000 OK, what we would like to do today and next lecture is get 357 00:22:58,000 --> 00:23:06,000 cache oblivious algorithms. So, there's essentially only 358 00:23:06,000 --> 00:23:12,000 one difference between cache aware algorithms and cache 359 00:23:12,000 --> 00:23:18,000 oblivious algorithms. In cache oblivious algorithms, 360 00:23:18,000 --> 00:23:22,000 the algorithm doesn't know what B and M are. 361 00:23:22,000 --> 00:23:30,000 So this is a bit of a subtle point, but very cool idea. 362 00:23:30,000 --> 00:23:32,000 You assume that this is the model of the machine, 363 00:23:32,000 --> 00:23:36,000 and you care about the number of memory transfers between this 364 00:23:36,000 --> 00:23:39,000 cache of size M with blocking B, and main memory with blocking 365 00:23:39,000 --> 00:23:41,000 B. But you don't actually know 366 00:23:41,000 --> 00:23:43,000 what the model is. You don't know the other 367 00:23:43,000 --> 00:23:45,000 parameters of the model. It looks like this, 368 00:23:45,000 --> 00:23:48,000 but you don't know the width. You don't know the height. 369 00:23:48,000 --> 00:23:50,000 Why not? So, the analysis knows what B 370 00:23:50,000 --> 00:23:52,000 and M are. We are going to write some 371 00:23:52,000 --> 00:23:56,000 algorithms which look just like boring old algorithms that we've 372 00:23:56,000 --> 00:24:00,000 seen throughout the lecture. That's one of the nice things 373 00:24:00,000 --> 00:24:03,000 about this model. Every algorithm we have seen is 374 00:24:03,000 --> 00:24:06,000 a cache oblivious algorithm, all right, because we didn't 375 00:24:06,000 --> 00:24:08,000 even know the word cache in this class until today. 376 00:24:08,000 --> 00:24:11,000 So, we already have lots of algorithms to choose from. 377 00:24:11,000 --> 00:24:13,000 The thing is, some of them will perform well 378 00:24:13,000 --> 00:24:15,000 in this model, and some of them won't. 379 00:24:15,000 --> 00:24:18,000 So, we would like to design algorithms that just like our 380 00:24:18,000 --> 00:24:21,000 old algorithms that happened to perform well in this context, 381 00:24:21,000 --> 00:24:24,000 no matter what B and M are. So, another way this is the 382 00:24:24,000 --> 00:24:27,000 same algorithm should work well for all values of B and M if you 383 00:24:27,000 --> 00:24:31,000 have a good cache oblivious algorithm. 384 00:24:31,000 --> 00:24:33,000 OK, there are a few consequences to this assumption. 385 00:24:33,000 --> 00:24:36,000 In a cache aware algorithm, you can explicitly say, 386 00:24:36,000 --> 00:24:39,000 OK, I'm blocking my memory into chunks of size B. 387 00:24:39,000 --> 00:24:42,000 Here they are. I was going to store these B 388 00:24:42,000 --> 00:24:44,000 elements here, these B elements here, 389 00:24:44,000 --> 00:24:46,000 because you know B, you can do that. 390 00:24:46,000 --> 00:24:48,000 You can say, well, OK, now I want to read 391 00:24:48,000 --> 00:24:51,000 these B items into my cache, and then write out these ones 392 00:24:51,000 --> 00:24:53,000 over here. You can explicitly maintain 393 00:24:53,000 --> 00:24:55,000 your cache. With cache oblivious 394 00:24:55,000 --> 00:25:00,000 algorithms, you can't because you don't know what it is. 395 00:25:00,000 --> 00:25:04,000 So, it's got to be all implicit. 396 00:25:04,000 --> 00:25:11,000 And this is pretty much how caches work anyway except for 397 00:25:11,000 --> 00:25:15,000 disk. So, it's a pretty reasonable 398 00:25:15,000 --> 00:25:18,000 model. In particular, 399 00:25:18,000 --> 00:25:24,000 when you access an element that's not in cache, 400 00:25:24,000 --> 00:25:33,000 you automatically fetch the block containing that element. 401 00:25:33,000 --> 00:25:38,000 And you pay one memory transfer for that if it wasn't already 402 00:25:38,000 --> 00:25:41,000 there. Another bit of a catch here is, 403 00:25:41,000 --> 00:25:45,000 what if your cache is full? Then you've got to kick some 404 00:25:45,000 --> 00:25:50,000 block out of your cache. And then, so we need some model 405 00:25:50,000 --> 00:25:55,000 of which block gets kicked out because we can't control that. 406 00:25:55,000 --> 00:26:00,000 We have no knowledge of what the blocks are in our algorithm. 407 00:26:00,000 --> 00:26:05,000 So what we're going to assume in this model is the ideal 408 00:26:05,000 --> 00:26:10,000 thing, that when you fetch a new block, if your cache is full, 409 00:26:10,000 --> 00:26:17,000 you evict a block that will be used farthest in the future. 410 00:26:17,000 --> 00:26:21,000 Sorry, the furthest. Farthest is distance. 411 00:26:21,000 --> 00:26:25,000 Furthest is time. Furthest in the future. 412 00:26:25,000 --> 00:26:31,000 OK, this would be the best possible thing to do. 413 00:26:31,000 --> 00:26:35,000 It's a little bit hard to do in practice because you don't know 414 00:26:35,000 --> 00:26:38,000 the future generally, unless you're omniscient. 415 00:26:38,000 --> 00:26:41,000 So, this is a bit of an idealized model. 416 00:26:41,000 --> 00:26:45,000 But it's pretty reasonable in the sense that if you've read 417 00:26:45,000 --> 00:26:49,000 the reading handout number 20, this paper by Sleator and 418 00:26:49,000 --> 00:26:52,000 Tarjan, they introduce the idea of competitive algorithms. 419 00:26:52,000 --> 00:26:56,000 So, we only talked about a small portion of that paper that 420 00:26:56,000 --> 00:27:01,000 moved to front heuristic for storing a list. 421 00:27:01,000 --> 00:27:03,000 But, it also proves that there are strategies, 422 00:27:03,000 --> 00:27:06,000 and maybe you heard this in recitation. 423 00:27:06,000 --> 00:27:08,000 Some people covered it; some didn't, 424 00:27:08,000 --> 00:27:10,000 that these are called paging strategies. 425 00:27:10,000 --> 00:27:13,000 So, you want to maintain some cache of pages or blocks, 426 00:27:13,000 --> 00:27:17,000 and you pay whenever you have to access a block that's not in 427 00:27:17,000 --> 00:27:19,000 your cache. The best thing to do is to 428 00:27:19,000 --> 00:27:23,000 always kick out the block that will be used farthest in the 429 00:27:23,000 --> 00:27:27,000 future because that way you'll use all the blocks that are in 430 00:27:27,000 --> 00:27:28,000 there. This turns out to be the 431 00:27:28,000 --> 00:27:33,000 offline optimal strategy if you knew the future. 432 00:27:33,000 --> 00:27:35,000 But, there are algorithms that are essentially constant 433 00:27:35,000 --> 00:27:37,000 competitive against this strategy. 434 00:27:37,000 --> 00:27:40,000 I don't want to get into details because they're not 435 00:27:40,000 --> 00:27:43,000 exactly constant competitive. But they are sufficiently 436 00:27:43,000 --> 00:27:46,000 constant competitive for the purposes of this lecture that we 437 00:27:46,000 --> 00:27:49,000 can assume this, not have to worry about it. 438 00:27:49,000 --> 00:27:51,000 Most of the time, we don't even really use this 439 00:27:51,000 --> 00:27:53,000 assumption. But there it is. 440 00:27:53,000 --> 00:27:55,000 That's the cache oblivious model. 441 00:27:55,000 --> 00:27:58,000 It makes things cleaner to think about just anything that 442 00:27:58,000 --> 00:28:01,000 should be done, will be done. 443 00:28:01,000 --> 00:28:05,000 And you can simulate that with least recently used or whatever 444 00:28:05,000 --> 00:28:10,000 good heuristic that you want to that's competitive against the 445 00:28:10,000 --> 00:28:12,000 optimal. OK, that's pretty much the 446 00:28:12,000 --> 00:28:16,000 cache oblivious algorithm. Once you have the two level 447 00:28:16,000 --> 00:28:20,000 model, you just assume you don't know B and M. 448 00:28:20,000 --> 00:28:24,000 You have this automatic request in writing, and whatnot. 449 00:28:24,000 --> 00:28:28,000 A little bit more to say, I guess, it may be obvious at 450 00:28:28,000 --> 00:28:34,000 this point, but I've been drawing everything as tables. 451 00:28:34,000 --> 00:28:37,000 So, it's not really clear what the linear order is. 452 00:28:37,000 --> 00:28:40,000 Linear order is just the reading order. 453 00:28:40,000 --> 00:28:44,000 So, although we don't explicitly say it most of the 454 00:28:44,000 --> 00:28:48,000 time, a typical model is that memory is a linear array. 455 00:28:48,000 --> 00:28:53,000 Everything that you ever store in your program is written in 456 00:28:53,000 --> 00:28:57,000 this linear array. If you've ever programmed in 457 00:28:57,000 --> 00:29:01,000 Assembly or whatever, that's the model. 458 00:29:01,000 --> 00:29:04,000 You have the address space, and any number between here and 459 00:29:04,000 --> 00:29:08,000 here, that's where you can actually, this is physical 460 00:29:08,000 --> 00:29:11,000 memory. This is all you can write to. 461 00:29:11,000 --> 00:29:15,000 So, it starts at zero and goes out to, let's call it infinity 462 00:29:15,000 --> 00:29:17,000 over here. And, if you allocate some 463 00:29:17,000 --> 00:29:20,000 array, maybe it occupies some space in the middle. 464 00:29:20,000 --> 00:29:23,000 Who knows? OK, we usually don't think 465 00:29:23,000 --> 00:29:26,000 about that much. What I care about now is that 466 00:29:26,000 --> 00:29:29,000 memory itself is blocked in this view. 467 00:29:29,000 --> 00:29:31,000 So, however your stuff is stored in memory, 468 00:29:31,000 --> 00:29:36,000 it's blocked into clusters of length B. 469 00:29:36,000 --> 00:29:39,000 So, if this is, let me call it one and be a 470 00:29:39,000 --> 00:29:41,000 little bit nicer. This is B. 471 00:29:41,000 --> 00:29:46,000 This is position B plus one. This is 2B, and 2B plus one, 472 00:29:46,000 --> 00:29:49,000 and so on. These are the indexes into 473 00:29:49,000 --> 00:29:51,000 memory. This is how the blocking 474 00:29:51,000 --> 00:29:54,000 happens. If you access something here, 475 00:29:54,000 --> 00:29:59,000 you get that chunk from U, round it down to the previous 476 00:29:59,000 --> 00:30:02,000 multiple of B, round it up to the next 477 00:30:02,000 --> 00:30:06,000 multiple of B. That's what you always get. 478 00:30:06,000 --> 00:30:11,000 OK, so if you think about some array that's maybe allocated 479 00:30:11,000 --> 00:30:15,000 here, OK, you have to keep in mind that that array may not be 480 00:30:15,000 --> 00:30:18,000 perfectly aligned with the blocks. 481 00:30:18,000 --> 00:30:21,000 But more or less it will be so we don't care too much. 482 00:30:21,000 --> 00:30:24,000 But that's a bit of a subtlety there. 483 00:30:24,000 --> 00:30:28,000 OK, so that's pretty much the model. 484 00:30:28,000 --> 00:30:32,000 So every algorithm we've seen, except B trees, 485 00:30:32,000 --> 00:30:36,000 is a cache oblivious algorithm. And our question is, 486 00:30:36,000 --> 00:30:41,000 now, we know how everything runs in terms of running time. 487 00:30:41,000 --> 00:30:46,000 Now we want to measure the number of memory transfers, 488 00:30:46,000 --> 00:30:49,000 MT of N. I want to mention one other 489 00:30:49,000 --> 00:30:53,000 fact or theorem. I'll put it in brackets because 490 00:30:53,000 --> 00:30:58,000 I don't want to state it precisely. 491 00:30:58,000 --> 00:31:04,000 But if you have an algorithm that is efficient on two levels, 492 00:31:04,000 --> 00:31:08,000 so in other words, what we're looking at, 493 00:31:08,000 --> 00:31:14,000 if we just think about the two level world and your algorithm 494 00:31:14,000 --> 00:31:18,000 is cache oblivious, then it is efficient on any 495 00:31:18,000 --> 00:31:23,000 number of levels in your memory hierarchy, say, 496 00:31:23,000 --> 00:31:27,000 L levels. So, I don't want to define what 497 00:31:27,000 --> 00:31:31,000 efficient means. But the intuition is, 498 00:31:31,000 --> 00:31:34,000 if your machine really looks like this and you have a cache 499 00:31:34,000 --> 00:31:36,000 oblivious algorithm, you can apply the cache 500 00:31:36,000 --> 00:31:38,000 oblivious analysis for all B and M. 501 00:31:38,000 --> 00:31:41,000 So you can analyze the number of memory transfers here, 502 00:31:41,000 --> 00:31:43,000 here, here, here, and here. 503 00:31:43,000 --> 00:31:45,000 And if you have a good cache oblivious algorithm, 504 00:31:45,000 --> 00:31:48,000 the performances at all those levels has to be good. 505 00:31:48,000 --> 00:31:51,000 And therefore, the whole performance is good. 506 00:31:51,000 --> 00:31:54,000 Good here means asymptotically optimal up to constant factors, 507 00:31:54,000 --> 00:31:57,000 something like that. OK, so I don't want to prove 508 00:31:57,000 --> 00:32:01,000 that, and you can read the cache oblivious papers. 509 00:32:01,000 --> 00:32:04,000 That's a nice fact about cache oblivious algorithms. 510 00:32:04,000 --> 00:32:08,000 If you have a cache aware algorithm that tunes to a 511 00:32:08,000 --> 00:32:12,000 particular value of B, and a particular value of M, 512 00:32:12,000 --> 00:32:15,000 you're not going to have that problem. 513 00:32:15,000 --> 00:32:19,000 So, this is one nice feature of cache obliviousness. 514 00:32:19,000 --> 00:32:23,000 Another nice feature is when you are coding the algorithm, 515 00:32:23,000 --> 00:32:26,000 you don't have to put in B and M. 516 00:32:26,000 --> 00:32:28,000 So, that simplifies things a bit. 517 00:32:28,000 --> 00:32:34,000 So, let's do some algorithms. Enough about models. 518 00:32:34,000 --> 00:32:40,000 OK, we're going to start out with some really simple things 519 00:32:40,000 --> 00:32:45,000 just to get warmed up on the analysis side. 520 00:32:45,000 --> 00:32:52,000 The most basic thing you can do that's good in a cache oblivious 521 00:32:52,000 --> 00:32:57,000 world is scanning. So, scanning is just visiting 522 00:32:57,000 --> 00:33:03,000 the items in an array in order. So, visit A_1 up to A_N in 523 00:33:03,000 --> 00:33:06,000 order. For some notion of visit, 524 00:33:06,000 --> 00:33:09,000 this is presumably some constant time operation. 525 00:33:09,000 --> 00:33:12,000 For example, suppose you want to compute the 526 00:33:12,000 --> 00:33:16,000 aggregate of the array. You want to sum all the 527 00:33:16,000 --> 00:33:19,000 elements in the array. So, you have one extra variable 528 00:33:19,000 --> 00:33:23,000 using, but you can store that in a register or whatever, 529 00:33:23,000 --> 00:33:27,000 so that's one simple example. Sum the array. 530 00:33:27,000 --> 00:33:31,000 OK, so here's the picture. We have our memory. 531 00:33:31,000 --> 00:33:36,000 Each of these cells represents one item, one element, 532 00:33:36,000 --> 00:33:38,000 log N bits, one word, whatever. 533 00:33:38,000 --> 00:33:43,000 Our array is somewhere in here. Maybe it's there. 534 00:33:43,000 --> 00:33:47,000 And we go from here to here to here to here. 535 00:33:47,000 --> 00:33:50,000 OK, and so on. So, what does this cost? 536 00:33:50,000 --> 00:33:53,000 What is the number of memory transfers? 537 00:33:53,000 --> 00:33:57,000 We know that this is a linear time algorithm. 538 00:33:57,000 --> 00:34:03,000 It takes order N time. What does it cost in terms of 539 00:34:03,000 --> 00:34:07,000 memory transfers? N over B, pretty much. 540 00:34:07,000 --> 00:34:12,000 We like to say it's order N over B plus two or one in the 541 00:34:12,000 --> 00:34:15,000 big O. This is a bit of worry. 542 00:34:15,000 --> 00:34:18,000 I mean, N could be smaller than B. 543 00:34:18,000 --> 00:34:21,000 We really want to think about all the cases, 544 00:34:21,000 --> 00:34:26,000 especially because usually you're not doing this on 545 00:34:26,000 --> 00:34:31,000 something of size N. You're doing it on something of 546 00:34:31,000 --> 00:34:37,000 size k, where we don't really know much about k. 547 00:34:37,000 --> 00:34:40,000 But in general, it's N over B plus one because 548 00:34:40,000 --> 00:34:43,000 we always need at least one memory transfer to look at 549 00:34:43,000 --> 00:34:46,000 something, unless N is zero. And in particular, 550 00:34:46,000 --> 00:34:49,000 it's plus two if you care about the constants. 551 00:34:49,000 --> 00:34:53,000 If I don't write the big O, then it would be plus two at 552 00:34:53,000 --> 00:34:57,000 most because you could essentially waste the first 553 00:34:57,000 --> 00:35:01,000 block and that everything is fine for awhile. 554 00:35:01,000 --> 00:35:05,000 And then, if you're unlucky, you essentially waste the last 555 00:35:05,000 --> 00:35:08,000 blocked. There is just one element in 556 00:35:08,000 --> 00:35:12,000 that block, and you're not getting much out of it. 557 00:35:12,000 --> 00:35:16,000 Everything in the middle, though, every block between the 558 00:35:16,000 --> 00:35:19,000 first and last block has to be full. 559 00:35:19,000 --> 00:35:22,000 So, you're using all of those elements. 560 00:35:22,000 --> 00:35:26,000 So out of the N elements, you only have N over B blocks 561 00:35:26,000 --> 00:35:28,000 because the block has B elements. 562 00:35:28,000 --> 00:35:33,000 OK, that's pretty trivial. Let me do something slightly 563 00:35:33,000 --> 00:35:38,000 more interesting, which is two scans at once. 564 00:35:38,000 --> 00:35:41,000 OK, here we are not assuming anything about M. 565 00:35:41,000 --> 00:35:45,000 we're not assuming anything about the size of the cache, 566 00:35:45,000 --> 00:35:48,000 just that I can hold a single block. 567 00:35:48,000 --> 00:35:51,000 The last block that we visited has to be there. 568 00:35:51,000 --> 00:35:55,000 OK, you can also do a constant number of parallel scans. 569 00:35:55,000 --> 00:36:00,000 This is not really parallel in the sense of multithreaded, 570 00:36:00,000 --> 00:36:06,000 bit simulated parallelism. I mean, if you have a constant 571 00:36:06,000 --> 00:36:09,000 number, do one, do the other, 572 00:36:09,000 --> 00:36:12,000 do the other, come back, come back, 573 00:36:12,000 --> 00:36:18,000 come back, all right, visit them in turn round robin, 574 00:36:18,000 --> 00:36:20,000 whatever. For example, 575 00:36:20,000 --> 00:36:26,000 here's a cute piece of code. If you want to reverse an 576 00:36:26,000 --> 00:36:33,000 array, OK, then you can do it. This is a good puzzle. 577 00:36:33,000 --> 00:36:38,000 You can do it by essentially two scans where you repeatedly 578 00:36:38,000 --> 00:36:42,000 swapped the first and last element. 579 00:36:42,000 --> 00:36:46,000 So I was swapping A_i with N minus i plus one, 580 00:36:46,000 --> 00:36:51,000 and just restart at one. So, here's your array. 581 00:36:51,000 --> 00:36:54,000 Suppose this is actually my array. 582 00:36:54,000 --> 00:36:59,000 I swap these two guys, and I saw these two guys, 583 00:36:59,000 --> 00:37:04,000 and so on. That will reverse my array, 584 00:37:04,000 --> 00:37:08,000 and this should work hopefully the middle as well if it's odd. 585 00:37:08,000 --> 00:37:13,000 It should not do anything. And you can view this as two 586 00:37:13,000 --> 00:37:16,000 scans. There is one scan that's coming 587 00:37:16,000 --> 00:37:19,000 in this way. There's also a reverse scan, 588 00:37:19,000 --> 00:37:23,000 ooh, some more sophisticated, coming back this way. 589 00:37:23,000 --> 00:37:26,000 Of course, reverse scan has the same analysis. 590 00:37:26,000 --> 00:37:31,000 And as long as your cache is big enough to store at least two 591 00:37:31,000 --> 00:37:35,000 blocks, which is a pretty reasonable assumption, 592 00:37:35,000 --> 00:37:40,000 so let's write it. Assuming the number of blocks 593 00:37:40,000 --> 00:37:43,000 in the cache, which is M over B, 594 00:37:43,000 --> 00:37:49,000 is at least two in this algorithm, the number of memory 595 00:37:49,000 --> 00:37:53,000 transfers is still order N over B plus one. 596 00:37:53,000 --> 00:37:58,000 OK, the constant goes up maybe, but in this case it probably 597 00:37:58,000 --> 00:38:02,000 doesn't. But who cares. 598 00:38:02,000 --> 00:38:06,000 OK, as long as you're doing a constant number of scans, 599 00:38:06,000 --> 00:38:11,000 and some constant number of arrays, it happens to be one of 600 00:38:11,000 --> 00:38:15,000 them's reversed, whatever, it will take, 601 00:38:15,000 --> 00:38:20,000 we call this linear time. It's linear in the number of 602 00:38:20,000 --> 00:38:22,000 blocks in your input. OK, great. 603 00:38:22,000 --> 00:38:26,000 So now you can reverse an array: exciting. 604 00:38:26,000 --> 00:38:32,000 Let's try another simple algorithm on another board. 605 00:38:47,000 --> 00:38:50,000 Let's try binary search. So just like last week, 606 00:38:50,000 --> 00:38:53,000 we're going back to our basics here. 607 00:38:53,000 --> 00:38:57,000 Scanning we didn't even talk about in this class. 608 00:38:57,000 --> 00:39:02,000 Binary search is something we talked about a little bit. 609 00:39:02,000 --> 00:39:04,000 It was a simple divide and conquer algorithm. 610 00:39:04,000 --> 00:39:08,000 I hope you all remember it. And if we look at an array, 611 00:39:08,000 --> 00:39:11,000 and I'm not going to draw the cells here because I want to 612 00:39:11,000 --> 00:39:14,000 imagine a really big array, binary search, 613 00:39:14,000 --> 00:39:16,000 but suppose it always goes to left. 614 00:39:16,000 --> 00:39:19,000 It starts by visiting this element in the middle. 615 00:39:19,000 --> 00:39:23,000 Then ago so the quarter marked. Then it goes to the one eighth 616 00:39:23,000 --> 00:39:25,000 mark. OK, this is one hypothetical 617 00:39:25,000 --> 00:39:29,000 execution of a binary search. OK, and eventually it finds the 618 00:39:29,000 --> 00:39:32,000 element it's looking for. It finds where it fits at 619 00:39:32,000 --> 00:39:35,000 least. So x is over here. 620 00:39:35,000 --> 00:39:38,000 So, we know that it takes log N time. 621 00:39:38,000 --> 00:39:41,000 How many memory transfers of the take? 622 00:39:41,000 --> 00:39:45,000 Now, I blocked this array into chunks of size B, 623 00:39:45,000 --> 00:39:49,000 blocks of size B. How many blocks do I touch? 624 00:39:49,000 --> 00:39:53,000 This one's a little bit more subtle. 625 00:40:18,000 --> 00:40:21,000 It depends on the relative sizes of N and B, 626 00:40:21,000 --> 00:40:23,000 yeah. Log base B of N would be a good 627 00:40:23,000 --> 00:40:25,000 guess. We would like it to be, 628 00:40:25,000 --> 00:40:29,000 let's say, hope, is that it's log base B of N 629 00:40:29,000 --> 00:40:33,000 because we know that B trees can search in what's essentially a 630 00:40:33,000 --> 00:40:38,000 sorted list of N items in log base B of N time. 631 00:40:38,000 --> 00:40:42,000 That turns out to be optimal in the cache oblivious model or in 632 00:40:42,000 --> 00:40:46,000 the two level model you've got to pay log base B of N. 633 00:40:46,000 --> 00:40:51,000 I won't prove that here. The same reason you need log N 634 00:40:51,000 --> 00:40:55,000 comparisons to do a binary search in the normal model. 635 00:40:55,000 --> 00:41:00,000 Alas, it is possible to get log base B of N even without knowing 636 00:41:00,000 --> 00:41:06,000 B. But, binary search does not do 637 00:41:06,000 --> 00:41:09,000 it. Log of N over B, 638 00:41:09,000 --> 00:41:13,000 yes. So the number of memory 639 00:41:13,000 --> 00:41:22,000 transfers on N items is log of N over B also known as, 640 00:41:22,000 --> 00:41:31,000 let's say, plus one, also known as log N minus log 641 00:41:31,000 --> 00:41:35,000 B. OK, whereas log base B of N is 642 00:41:35,000 --> 00:41:39,000 log N divided by log B, OK, clearly this is much better 643 00:41:39,000 --> 00:41:42,000 than subtracting. So, this would be good, 644 00:41:42,000 --> 00:41:45,000 but this is bad. Most of the time, 645 00:41:45,000 --> 00:41:47,000 this is log N, which is no better, 646 00:41:47,000 --> 00:41:51,000 I mean, you're not using blocks at all essentially. 647 00:41:51,000 --> 00:41:53,000 The idea is, out here, I mean, 648 00:41:53,000 --> 00:41:57,000 there's some little, tiny block that contains this 649 00:41:57,000 --> 00:42:00,000 thing. I mean, tiny depends on how big 650 00:42:00,000 --> 00:42:03,000 B is. But, each of these items will 651 00:42:03,000 --> 00:42:06,000 be in a different block until you get essentially within one 652 00:42:06,000 --> 00:42:09,000 block worth of x. When you get within one block 653 00:42:09,000 --> 00:42:12,000 worth of x, there's only like a constant number of blocks that 654 00:42:12,000 --> 00:42:15,000 matter, and so all of these accesses are indeed within the 655 00:42:15,000 --> 00:42:17,000 same block. But, how many are there? 656 00:42:17,000 --> 00:42:21,000 Well, just log B because you're only spending log B within a, 657 00:42:21,000 --> 00:42:24,000 if you're within an interval of size k, you're only going to 658 00:42:24,000 --> 00:42:27,000 spend log k steps in it. So, you're saving log B in 659 00:42:27,000 --> 00:42:30,000 here, but overall you're paying log N, so you only get log N 660 00:42:30,000 --> 00:42:34,000 minus log B plus some constant. OK, so this is bad news for 661 00:42:34,000 --> 00:42:37,000 binary search. So, not all of the algorithms 662 00:42:37,000 --> 00:42:40,000 we've seen are going to work well in this model. 663 00:42:40,000 --> 00:42:43,000 We need a lot more thinking before we can solve what is 664 00:42:43,000 --> 00:42:47,000 essentially the binary search problem, finding an element in a 665 00:42:47,000 --> 00:42:50,000 sorted list, in log base B of N without knowing B. 666 00:42:50,000 --> 00:42:52,000 OK, we know we could use B trees. 667 00:42:52,000 --> 00:42:53,000 If you knew B, great, that works, 668 00:42:53,000 --> 00:42:56,000 and that's optimal. But without knowing B, 669 00:42:56,000 --> 00:43:02,000 it's a little bit harder. And this gets us into the world 670 00:43:02,000 --> 00:43:06,000 of divide and conquer. Also like last week, 671 00:43:06,000 --> 00:43:13,000 and like the first few weeks of this class, divide and conquer 672 00:43:13,000 --> 00:43:17,000 is your friend. And, it turns out divide and 673 00:43:17,000 --> 00:43:23,000 conquer is not the only tool, but it's a really useful tool 674 00:43:23,000 --> 00:43:27,000 in designing cache oblivious algorithms. 675 00:43:27,000 --> 00:43:31,000 And, let me say why. 676 00:43:43,000 --> 00:43:47,000 So, we'll see a bunch of divide and conquer based algorithms, 677 00:43:47,000 --> 00:43:50,000 cache oblivious. And, the intuition is that we 678 00:43:50,000 --> 00:43:54,000 can take all the favorite algorithms we have, 679 00:43:54,000 --> 00:43:56,000 obviously it doesn't always work. 680 00:43:56,000 --> 00:43:59,000 Binary search was a divide and conquer algorithm. 681 00:43:59,000 --> 00:44:03,000 It's not so great. But, in general, 682 00:44:03,000 --> 00:44:07,000 the idea is that your algorithm can just do the normal divide 683 00:44:07,000 --> 00:44:08,000 and conquer thing, right? 684 00:44:08,000 --> 00:44:12,000 You divide your problem into subproblems of smaller size 685 00:44:12,000 --> 00:44:15,000 repeatedly, all the way down to problems of constant size, 686 00:44:15,000 --> 00:44:19,000 OK, just like before. But, if you're recursively 687 00:44:19,000 --> 00:44:21,000 dividing your problem into smaller things, 688 00:44:21,000 --> 00:44:24,000 at some point you can think about it and say, 689 00:44:24,000 --> 00:44:27,000 well, wait, I mean, the algorithm divides all the 690 00:44:27,000 --> 00:44:31,000 way, but we can think about the point at which the problem fits 691 00:44:31,000 --> 00:44:36,000 in a block or fits in cache. OK, and that's the analysis. 692 00:44:36,000 --> 00:44:40,000 OK, we'll think about the time when your problem is small 693 00:44:40,000 --> 00:44:43,000 enough that we can analyze it in some other way. 694 00:44:43,000 --> 00:44:46,000 So, usually, we analyze it recursively. 695 00:44:46,000 --> 00:44:48,000 We get a recurrence. What we're changing, 696 00:44:48,000 --> 00:44:50,000 essentially, is the base case. 697 00:44:50,000 --> 00:44:54,000 So, in the base case, we don't want to go down to a 698 00:44:54,000 --> 00:44:56,000 constant size. That's too far. 699 00:44:56,000 --> 00:45:02,000 I'll show you some examples. We want to consider the point 700 00:45:02,000 --> 00:45:09,000 in recursion at which either the problem fits in cache, 701 00:45:09,000 --> 00:45:17,000 so it has size less than or equal to M, or it fits in order 702 00:45:17,000 --> 00:45:22,000 one blocks. That's another natural time to 703 00:45:22,000 --> 00:45:27,000 do it. Order one blocks would be even 704 00:45:27,000 --> 00:45:35,000 better than fitting in cache. So, this means a size order B. 705 00:45:35,000 --> 00:45:41,000 OK, this will change the base case of the recurrence, 706 00:45:41,000 --> 00:45:48,000 and it will turn out to give us good answers instead of bad 707 00:45:48,000 --> 00:45:52,000 ones. So, let's do a simple example. 708 00:45:52,000 --> 00:45:57,000 Our good friend order statistics, in particular, 709 00:45:57,000 --> 00:46:04,000 for finding medians. So, I hope you all know this by 710 00:46:04,000 --> 00:46:08,000 heart. Remember the worst case linear 711 00:46:08,000 --> 00:46:12,000 time, median finding algorithm by Bloom et al. 712 00:46:12,000 --> 00:46:17,000 I'll write this fast. We partition our array. 713 00:46:17,000 --> 00:46:21,000 It turns out, this is a good algorithm as it 714 00:46:21,000 --> 00:46:24,000 is. We partition our array 715 00:46:24,000 --> 00:46:30,000 conceptually into N over five, five tuples into little groups 716 00:46:30,000 --> 00:46:36,000 of five. This may not have been exactly 717 00:46:36,000 --> 00:46:40,000 how I wrote it last time. I didn't check. 718 00:46:40,000 --> 00:46:46,000 But, it's the same algorithm. You compute the median of each 719 00:46:46,000 --> 00:46:49,000 five tuple. Then you recursively compute 720 00:46:49,000 --> 00:46:55,000 the median of the medians of these medians. 721 00:47:11,000 --> 00:47:15,000 Then, you partition around x. So, that gave us some element 722 00:47:15,000 --> 00:47:20,000 that was roughly in the middle. It was within the middle half, 723 00:47:20,000 --> 00:47:22,000 I think. Partition around x, 724 00:47:22,000 --> 00:47:27,000 and then we show that you could always recurse on just one of 725 00:47:27,000 --> 00:47:29,000 the sides. 726 00:47:38,000 --> 00:47:41,000 OK, this was our good old friend for computing, 727 00:47:41,000 --> 00:47:43,000 order statistics, or medians, or whatnot. 728 00:47:43,000 --> 00:47:47,000 OK, so how much time does this, well, we know how much time 729 00:47:47,000 --> 00:47:50,000 this takes. It should be linear time. 730 00:47:50,000 --> 00:47:52,000 But how many memory transfers does this take? 731 00:47:52,000 --> 00:47:56,000 Well, conceptually partitioning that, I can do, 732 00:47:56,000 --> 00:47:58,000 in zero. Maybe I have to compute N over 733 00:47:58,000 --> 00:48:02,000 five, no big deal here. We're not thinking about 734 00:48:02,000 --> 00:48:05,000 computation. I have to find the median of 735 00:48:05,000 --> 00:48:07,000 each tuple. So, here it matters how my 736 00:48:07,000 --> 00:48:10,000 array is laid out. But, what I'm going to do is 737 00:48:10,000 --> 00:48:13,000 take my array, take the first five elements, 738 00:48:13,000 --> 00:48:16,000 and then the next five elements and so on. 739 00:48:16,000 --> 00:48:20,000 Those will be my five tuples. So, I can implement this just 740 00:48:20,000 --> 00:48:23,000 by scanning, and then computing the median on those five 741 00:48:23,000 --> 00:48:27,000 elements, which I stored in the five registers on my CPU. 742 00:48:27,000 --> 00:48:32,000 I'll assume that there are enough registers for that. 743 00:48:32,000 --> 00:48:35,000 And, I compute the median, write it out to some array out 744 00:48:35,000 --> 00:48:38,000 here. So, it's going to be one 745 00:48:38,000 --> 00:48:40,000 element. So, the median of here goes 746 00:48:40,000 --> 00:48:43,000 into there. The median of these guys goes 747 00:48:43,000 --> 00:48:46,000 into there, and so on. So, I'm scanning in here, 748 00:48:46,000 --> 00:48:50,000 and in parallel, I'm scanning an output in here. 749 00:48:50,000 --> 00:48:54,000 So, it's two parallel scans. So, that takes linear time. 750 00:48:54,000 --> 00:48:59,000 So, this takes order N over B plus one memory transfers. 751 00:48:59,000 --> 00:49:03,000 OK, then we have recursively compute the median of the 752 00:49:03,000 --> 00:49:06,000 medians. This step used to be T of N 753 00:49:06,000 --> 00:49:09,000 over five. Now it's MT of N over five, 754 00:49:09,000 --> 00:49:12,000 OK, with the same values of B and M. 755 00:49:12,000 --> 00:49:17,000 Then we partition around x. Partitioning is also like three 756 00:49:17,000 --> 00:49:19,000 parallel scans if you work it out. 757 00:49:19,000 --> 00:49:24,000 So, this is also going to take linear memory transfers, 758 00:49:24,000 --> 00:49:28,000 N over B plus one. And then, we recurse on one of 759 00:49:28,000 --> 00:49:33,000 the sides, and this is the fun part of the analysis which I 760 00:49:33,000 --> 00:49:37,000 won't repeat here. But, we get MT of, 761 00:49:37,000 --> 00:49:42,000 like, three quarters N. I think originally it was seven 762 00:49:42,000 --> 00:49:45,000 tenths, so we simplified to three quarters, 763 00:49:45,000 --> 00:49:49,000 which is hopefully bigger than seven tenths. 764 00:49:49,000 --> 00:49:52,000 Yeah, it is. OK, so this is the new 765 00:49:52,000 --> 00:49:55,000 analysis. Now we get a recurrence. 766 00:49:55,000 --> 00:49:58,000 So, let's do that. 767 00:50:16,000 --> 00:50:22,000 So, the analysis is we get this MT of N is MT of N over five 768 00:50:22,000 --> 00:50:29,000 plus MT of three quarters N plus, this is just as before. 769 00:50:29,000 --> 00:50:35,000 Before we had linear work here. And now, we have what we call 770 00:50:35,000 --> 00:50:39,000 linear number of memory transfers, linear number of 771 00:50:39,000 --> 00:50:41,000 blocks. OK, I'll sort of ignore this 772 00:50:41,000 --> 00:50:44,000 plus one. It's not too critical. 773 00:50:44,000 --> 00:50:48,000 So, this is our recurrence. Now, it depends what our base 774 00:50:48,000 --> 00:50:51,000 case is. And, usually we would use a 775 00:50:51,000 --> 00:50:55,000 base case of constant size. So, let's see what happens if 776 00:50:55,000 --> 00:51:00,000 we use a base case of constant size just so that it's clear why 777 00:51:00,000 --> 00:51:05,000 this base case is so important. OK, this describes a recurrence 778 00:51:05,000 --> 00:51:07,000 as one of these hairy recurrences. 779 00:51:07,000 --> 00:51:09,000 And, I don't want to use substitution. 780 00:51:09,000 --> 00:51:12,000 I just want the intuition of why this is going to solve to 781 00:51:12,000 --> 00:51:14,000 something rather big. OK, and for me, 782 00:51:14,000 --> 00:51:17,000 the best intuition always comes from recursion trees. 783 00:51:17,000 --> 00:51:20,000 If you don't know the solution to recurrence and you need a 784 00:51:20,000 --> 00:51:24,000 good guess, use recursion trees. And today, I will only give you 785 00:51:24,000 --> 00:51:26,000 good guesses. I don't want to prove anything 786 00:51:26,000 --> 00:51:31,000 with substitution because I want to get to the bigger ideas. 787 00:51:31,000 --> 00:51:34,000 So, this is even messy from a recursion tree point of view 788 00:51:34,000 --> 00:51:38,000 because you have these unbalanced sizes where you start 789 00:51:38,000 --> 00:51:40,000 at the root with some of size N over B. 790 00:51:40,000 --> 00:51:44,000 Then you split it into something size one fifth N over 791 00:51:44,000 --> 00:51:47,000 B, and something of size three quarters N over B, 792 00:51:47,000 --> 00:51:51,000 which is annoying because now this subtree will be a lot 793 00:51:51,000 --> 00:51:54,000 bigger than this one, or this one will terminate 794 00:51:54,000 --> 00:51:56,000 faster. So, it's pretty unbalanced. 795 00:51:56,000 --> 00:52:00,000 But, summing per level doesn't really tell you a lot at this 796 00:52:00,000 --> 00:52:02,000 point. But let's just look at the 797 00:52:02,000 --> 00:52:07,000 bottom level. Look at all the leaves in this 798 00:52:07,000 --> 00:52:10,000 recursion tree. So, that's the base cases. 799 00:52:10,000 --> 00:52:13,000 How many base cases are there? This is an interesting 800 00:52:13,000 --> 00:52:16,000 question. We've never thought about it in 801 00:52:16,000 --> 00:52:21,000 the context of this recurrence. It gives a somewhat surprising 802 00:52:21,000 --> 00:52:23,000 answer. It was surprising to me the 803 00:52:23,000 --> 00:52:27,000 first time I worked it out. So, how many leaves does this 804 00:52:27,000 --> 00:52:32,000 recursion tree have? Well, we can write a 805 00:52:32,000 --> 00:52:35,000 recurrence. The number of leaves in a 806 00:52:35,000 --> 00:52:41,000 problem of size N, it's going to be the number of 807 00:52:41,000 --> 00:52:47,000 leaves in this problem plus the number of leaves in this problem 808 00:52:47,000 --> 00:52:52,000 plus zero. So, that's another recurrence. 809 00:52:52,000 --> 00:52:57,000 We'll call this L of N. OK, now the base case is really 810 00:52:57,000 --> 00:53:02,000 relevant. It determines the solution to 811 00:53:02,000 --> 00:53:04,000 this recurrence. And let's, again, 812 00:53:04,000 --> 00:53:08,000 assume that in a problem of size one, we have one leaf. 813 00:53:08,000 --> 00:53:12,000 That's our only base case. Well, it turns out, 814 00:53:12,000 --> 00:53:14,000 and here you need to guess, I think. 815 00:53:14,000 --> 00:53:17,000 This is not particularly obvious. 816 00:53:17,000 --> 00:53:21,000 Any of the TA's have guesses of the form of this solution? 817 00:53:21,000 --> 00:53:25,000 Or anybody, not just TA's. But this is open to everyone. 818 00:53:25,000 --> 00:53:28,000 If Charles were here, I would ask him. 819 00:53:28,000 --> 00:53:31,000 I had to think for a while, and it's not linear, 820 00:53:31,000 --> 00:53:37,000 right, because you're somehow decreasing quite a bit. 821 00:53:37,000 --> 00:53:42,000 So, it's smaller than linear, but it's more than a constant. 822 00:53:42,000 --> 00:53:47,000 OK, it's actually more than polylog, so what's your favorite 823 00:53:47,000 --> 00:53:50,000 function in the middle? N over log N, 824 00:53:50,000 --> 00:53:53,000 that's still too big. Keep going. 825 00:53:53,000 --> 00:53:57,000 You have an oracle here, so you can, N to the k, 826 00:53:57,000 --> 00:54:00,000 yeah, close. I mean, k is usually an 827 00:54:00,000 --> 00:54:04,000 integer. N to the alpha for some real 828 00:54:04,000 --> 00:54:09,000 number between zero and one. Yeah, that's what you meant. 829 00:54:09,000 --> 00:54:11,000 Sorry. It's like the shortest 830 00:54:11,000 --> 00:54:15,000 mathematical joke. Let epsilon be less than zero 831 00:54:15,000 --> 00:54:18,000 or for a sufficiently large epsilon. 832 00:54:18,000 --> 00:54:21,000 I don't know. So, you've got to use the right 833 00:54:21,000 --> 00:54:25,000 letters. So, let's suppose that it's N 834 00:54:25,000 --> 00:54:28,000 to the alpha. Then we would get this N over 835 00:54:28,000 --> 00:54:32,000 five to the alpha, and we'd get three quarters N 836 00:54:32,000 --> 00:54:36,000 to the alpha. When you have a nice recurrence 837 00:54:36,000 --> 00:54:40,000 like this, you can just try plugging in a guess and see 838 00:54:40,000 --> 00:54:42,000 whether it works, OK, and of course this will 839 00:54:42,000 --> 00:54:46,000 work only depending on alpha. So, we should get an equation 840 00:54:46,000 --> 00:54:49,000 on alpha here. So, everything has an N to the 841 00:54:49,000 --> 00:54:51,000 alpha, in fact, all of these terms. 842 00:54:51,000 --> 00:54:53,000 So, I can divide through my N to the alpha. 843 00:54:53,000 --> 00:54:56,000 That's assuming that it's not zero or something. 844 00:54:56,000 --> 00:54:59,000 That seems reasonable. So, we have one equals one 845 00:54:59,000 --> 00:55:04,000 fifth to the alpha plus three quarters to the alpha. 846 00:55:04,000 --> 00:55:10,000 This is something you won't get on a final because I don't know 847 00:55:10,000 --> 00:55:15,000 any good way to solve this except with like Maple or 848 00:55:15,000 --> 00:55:19,000 Mathematica. If you're smart I'm sure you 849 00:55:19,000 --> 00:55:24,000 could compute it in a nicer way, but alpha is about 0.8, 850 00:55:24,000 --> 00:55:28,000 it turns out. So, the number of leaves is 851 00:55:28,000 --> 00:55:34,000 this sort of in between constant and linear. 852 00:55:34,000 --> 00:55:37,000 Usually polynomial means you have an integer power. 853 00:55:37,000 --> 00:55:40,000 Let's call it a polynomial. Why not? 854 00:55:40,000 --> 00:55:43,000 There's a lot of leaves, is the point, 855 00:55:43,000 --> 00:55:47,000 and if we say that each leaf costs a constant number of 856 00:55:47,000 --> 00:55:50,000 memory transfers, we're in trouble because then 857 00:55:50,000 --> 00:55:54,000 the number of memory transfers has to be at least this. 858 00:55:54,000 --> 00:55:58,000 If it's at least that, that's potentially bigger than 859 00:55:58,000 --> 00:56:02,000 N over B, I mean, bigger than in an asymptotic 860 00:56:02,000 --> 00:56:06,000 sense. This is little omega of N over 861 00:56:06,000 --> 00:56:10,000 B if B is big. If B is at least N to the 0.2 862 00:56:10,000 --> 00:56:14,000 something, OK, or one seventh something. 863 00:56:14,000 --> 00:56:18,000 But if, in particular, B is at least N to the 0.2, 864 00:56:18,000 --> 00:56:22,000 then this should be bigger than that. 865 00:56:22,000 --> 00:56:27,000 So, this is a bad analysis because we're not going to get 866 00:56:27,000 --> 00:56:32,000 the answer we want, which is N over B. 867 00:56:32,000 --> 00:56:35,000 The best you can do for median is N over B because you have to 868 00:56:35,000 --> 00:56:38,000 read all the element, and you should spend linear 869 00:56:38,000 --> 00:56:40,000 time. So, we want to get N over B. 870 00:56:40,000 --> 00:56:42,000 This algorithm is N over B plus one. 871 00:56:42,000 --> 00:56:45,000 So, this is why you need a good base case, all right? 872 00:56:45,000 --> 00:56:48,000 So that makes the point. So, the question is, 873 00:56:48,000 --> 00:56:51,000 what base case should I use? 874 00:57:04,000 --> 00:57:06,000 So, we have this recurrence 875 00:57:21,000 --> 00:57:25,000 What base case should I use? Constant was too small. 876 00:57:25,000 --> 00:57:30,000 We have a couple of choices listed up here. 877 00:57:46,000 --> 00:57:55,000 Any suggestions? B, OK, MT of B is? 878 00:57:55,000 --> 00:58:01,000 The hard part. So, if my problem, 879 00:58:01,000 --> 00:58:07,000 if the size of my array fits in a block and I do all this stuff 880 00:58:07,000 --> 00:58:11,000 on it, how many memory transfers could that take? 881 00:58:11,000 --> 00:58:15,000 One, or a constant, depending on alignment. 882 00:58:15,000 --> 00:58:20,000 OK, maybe it takes two memory transfers, but constant. 883 00:58:20,000 --> 00:58:23,000 Good. That's clearly a lot better 884 00:58:23,000 --> 00:58:27,000 than this base case, MT of one equals order one, 885 00:58:27,000 --> 00:58:30,000 clearly stronger. So, hopefully, 886 00:58:30,000 --> 00:58:36,000 it gives the right answer, and now indeed it does. 887 00:58:36,000 --> 00:58:39,000 I love this analysis. So, I'm going to wave my hands. 888 00:58:39,000 --> 00:58:43,000 OK, but in particular, what this gives us, 889 00:58:43,000 --> 00:58:47,000 if we do the previous analysis, what is the number of leaves? 890 00:58:47,000 --> 00:58:51,000 So, in the leaves, now L of B equals one instead 891 00:58:51,000 --> 00:58:54,000 of L of one equals one. So, this stops earlier. 892 00:58:54,000 --> 00:58:59,000 When does it stop? Well, instead of getting N to 893 00:58:59,000 --> 00:59:02,000 the order of 0.8, whatever, we get N over B to 894 00:59:02,000 --> 00:59:06,000 the power of 0.8 whatever. OK, so it turns out the number 895 00:59:06,000 --> 00:59:10,000 of leaves is N over B to the alpha, which is little o of N 896 00:59:10,000 --> 00:59:12,000 over B. So, we don't care. 897 00:59:12,000 --> 00:59:15,000 It's tiny. And, if you look at the root 898 00:59:15,000 --> 00:59:17,000 cost is N over B in the recursion tree, 899 00:59:17,000 --> 00:59:22,000 the leaf cost is little o of N over B, and if you wave your 900 00:59:22,000 --> 00:59:26,000 hands, and close your eyes, and squint, the cost should be 901 00:59:26,000 --> 00:59:29,000 geometrically decreasing as we go down, I hope, 902 00:59:29,000 --> 00:59:34,000 more or less. It's a bit messy because of all 903 00:59:34,000 --> 00:59:39,000 the things terminating, but let's say cost is roughly 904 00:59:39,000 --> 00:59:42,000 geometric. Don't do this in the final, 905 00:59:42,000 --> 00:59:47,000 but you won't have any messy recurrences like this. 906 00:59:47,000 --> 00:59:50,000 So, don't worry. Down the tree, 907 00:59:50,000 --> 00:59:55,000 so you'd have to prove this formally, but I claim that the 908 00:59:55,000 --> 01:00:01,000 root cost dominates. And, the root cost is N over B. 909 01:00:13,000 --> 01:00:16,591 So, we get N over B. OK, so this is a nice, 910 01:00:16,591 --> 01:00:21,892 linear time algorithm for order statistics for cache oblivious. 911 01:00:21,892 --> 01:00:24,970 Great. This may turn you off a little 912 01:00:24,970 --> 01:00:29,758 bit, but even though this is like the simplest algorithm, 913 01:00:29,758 --> 01:00:34,460 it's also probably the most complicated analysis that we 914 01:00:34,460 --> 01:00:36,846 will do. In the future, 915 01:00:36,846 --> 01:00:40,234 our algorithms will be more complicated, and the analyses 916 01:00:40,234 --> 01:00:42,533 will be relatively simple. And usually, 917 01:00:42,533 --> 01:00:45,255 it's that way with cache oblivious algorithms. 918 01:00:45,255 --> 01:00:48,824 So, I'm giving you this sort of as the intuition of why this 919 01:00:48,824 --> 01:00:51,425 should be enough. Then you have to prove it. 920 01:00:51,425 --> 01:00:54,933 OK, let's go to another problem where divide and conquer is 921 01:00:54,933 --> 01:00:57,716 useful, our good friend, matrix multiplication. 922 01:00:57,716 --> 01:01:01,164 I don't know how many times we've seen this in this class, 923 01:01:01,164 --> 01:01:04,370 but in particular we saw it last week with a recursive 924 01:01:04,370 --> 01:01:08,000 matrix multiply, multithreaded algorithm. 925 01:01:08,000 --> 01:01:11,708 So, I won't give you the algorithm yet again, 926 01:01:11,708 --> 01:01:16,176 but we're going to analyze it in a very different way. 927 01:01:16,176 --> 01:01:20,475 So, we have C and we have A, and actually up to you. 928 01:01:20,475 --> 01:01:24,521 So, I could cover standard matrix multiplication, 929 01:01:24,521 --> 01:01:30,000 which is when you do it row by row, and column by column. 930 01:01:30,000 --> 01:01:32,331 And, we could see why that's bad. 931 01:01:32,331 --> 01:01:36,485 And then, we could do the recursive one and see why that's 932 01:01:36,485 --> 01:01:39,036 good. Or, we could skip the standard 933 01:01:39,036 --> 01:01:41,951 algorithm. So, how many people would like 934 01:01:41,951 --> 01:01:44,866 to see why the standard algorithm is bad? 935 01:01:44,866 --> 01:01:47,198 Because it's not totally obvious. 936 01:01:47,198 --> 01:01:49,603 One, two, three, four, five, half? 937 01:01:49,603 --> 01:01:53,611 Wow, that's a lot of votes. Now, how many people want to 938 01:01:53,611 --> 01:01:55,433 skip to the chase? No one. 939 01:01:55,433 --> 01:01:58,129 One, OK. And, everyone else is asleep. 940 01:01:58,129 --> 01:02:01,190 So, that's pretty good, 50% awake, not bad. 941 01:02:01,190 --> 01:02:06,000 OK, then, so standard matrix multiplication. 942 01:02:06,000 --> 01:02:10,036 I'll do this fast because it is, I mean, you all know the 943 01:02:10,036 --> 01:02:13,207 algorithm, right? To compute this value of C; 944 01:02:13,207 --> 01:02:17,099 in A, you take this row, and in B you take this column. 945 01:02:17,099 --> 01:02:19,477 Sorry I did a little bit sloppily. 946 01:02:19,477 --> 01:02:21,927 But this is supposed to be aligned. 947 01:02:21,927 --> 01:02:24,378 Right? So I take all of this stuff, 948 01:02:24,378 --> 01:02:27,837 I multiply it with all of the stuff, add them up, 949 01:02:27,837 --> 01:02:31,949 the dot product. That gives me this element. 950 01:02:31,949 --> 01:02:35,487 And, let's say I do them in this order row by row. 951 01:02:35,487 --> 01:02:39,241 So for every item in C, I loop over this row and this 952 01:02:39,241 --> 01:02:41,624 column, B, multiply them together. 953 01:02:41,624 --> 01:02:44,151 That is an access pattern in memory. 954 01:02:44,151 --> 01:02:48,555 So, exactly how much that costs depends how these matrices are 955 01:02:48,555 --> 01:02:51,732 laid out in memory. OK, this is a subtlety we 956 01:02:51,732 --> 01:02:55,703 haven't had to worry about before because everything was 957 01:02:55,703 --> 01:02:58,519 uniform. I'm going to assume to give the 958 01:02:58,519 --> 01:03:02,057 standard algorithm the best chances of being good, 959 01:03:02,057 --> 01:03:05,956 I'm going to store C in row major order, A in row major 960 01:03:05,956 --> 01:03:10,000 order, and B in column major order. 961 01:03:10,000 --> 01:03:14,983 So, everything is nice and you're scanning. 962 01:03:14,983 --> 01:03:19,254 So then this inner product is a scan. 963 01:03:19,254 --> 01:03:21,389 Cool. Sounds great, 964 01:03:21,389 --> 01:03:24,711 doesn't it? It's bad, though. 965 01:03:24,711 --> 01:03:31,000 Assume A is row major, and B is column major. 966 01:03:31,000 --> 01:03:33,911 And C, you could assume is really either way, 967 01:03:33,911 --> 01:03:37,750 but if I'm doing it row by row, I'll assume it's row major. 968 01:03:37,750 --> 01:03:41,257 So, this is what I call the layout, the memory layout, 969 01:03:41,257 --> 01:03:43,904 of these matrices. OK, it's good for this 970 01:03:43,904 --> 01:03:46,551 algorithm, but the algorithm is not good. 971 01:03:46,551 --> 01:03:49,000 So, it won't be that great. 972 01:04:12,000 --> 01:04:16,227 So, how long does this take? How many memory transfers? 973 01:04:16,227 --> 01:04:20,533 We know it takes M^3 time. Not going to try and beat M^3 974 01:04:20,533 --> 01:04:22,882 here. Just going to try and get 975 01:04:22,882 --> 01:04:26,249 standard matrix multiplication going faster. 976 01:04:26,249 --> 01:04:30,711 So, well, for each item over here I pay N over B to do the 977 01:04:30,711 --> 01:04:36,801 scans and get the inner product. So, N over B per item. 978 01:04:36,801 --> 01:04:42,659 So, it's N over B, or we could go with the plus 979 01:04:42,659 --> 01:04:49,408 one here, to compute each c_ij. So that would suggest, 980 01:04:49,408 --> 01:04:54,883 as an upper bound at least, it's N^3 over B. 981 01:04:54,883 --> 01:05:00,996 OK, and indeed that is the right bound, so theta. 982 01:05:00,996 --> 01:05:08,000 This is memory transfers, not time, obviously. 983 01:05:08,000 --> 01:05:12,349 That is indeed the case because if you look at consecutive, 984 01:05:12,349 --> 01:05:14,525 I do this c_ij, then this one, 985 01:05:14,525 --> 01:05:18,125 this one, this one, this one, keep incrementing j 986 01:05:18,125 --> 01:05:20,074 and keeping I fixed, right? 987 01:05:20,074 --> 01:05:23,824 So, the row that I use stays fixed for a long time. 988 01:05:23,824 --> 01:05:27,875 I get to reuse that if it happens, say that that fits a 989 01:05:27,875 --> 01:05:32,150 block maybe, I get to reuse that row several times if that 990 01:05:32,150 --> 01:05:36,631 happens to fit in cache. But the column is changing 991 01:05:36,631 --> 01:05:39,642 every single time. OK, so every time I moved here 992 01:05:39,642 --> 01:05:43,093 and compute the next c_ij, even if a column could fit in 993 01:05:43,093 --> 01:05:45,790 cache, I can't fit all the columns in cache. 994 01:05:45,790 --> 01:05:48,174 And the columns that I'm visiting move, 995 01:05:48,174 --> 01:05:50,119 you know, they just scan across. 996 01:05:50,119 --> 01:05:52,942 So, I'm scanning this whole matrix every time. 997 01:05:52,942 --> 01:05:55,766 And unless you're entire matrix fits in cache, 998 01:05:55,766 --> 01:05:58,840 in which case you could do anything, I don't care, 999 01:05:58,840 --> 01:06:02,353 it will take constant time, or you'll take M over B time, 1000 01:06:02,353 --> 01:06:05,302 enough to read it into the cache, do your stuff, 1001 01:06:05,302 --> 01:06:09,989 and write it back out. Except in that boring case, 1002 01:06:09,989 --> 01:06:14,115 you're going to have to pay N^2 over B for every row here 1003 01:06:14,115 --> 01:06:18,242 because you have to scan the whole collection of columns. 1004 01:06:18,242 --> 01:06:22,589 You have to read this entire matrix for every row over here. 1005 01:06:22,589 --> 01:06:26,494 So, you really do need N^3 over B for the whole thing. 1006 01:06:26,494 --> 01:06:30,043 So, it's usually a theta. So, you might say, 1007 01:06:30,043 --> 01:06:32,766 well, that's great. It's the size of my problem, 1008 01:06:32,766 --> 01:06:34,852 the usual running time, divided by B. 1009 01:06:34,852 --> 01:06:38,329 And that was the case when we are thinking about linear time, 1010 01:06:38,329 --> 01:06:41,168 N versus N over B. It's hard to beat N over B when 1011 01:06:41,168 --> 01:06:44,066 your problem is of size N. But now we have a cubed. 1012 01:06:44,066 --> 01:06:47,137 And, this gets back to, we have good spatial locality. 1013 01:06:47,137 --> 01:06:49,687 When we read a block, we use the whole thing. 1014 01:06:49,687 --> 01:06:51,019 Great. It seems optimal. 1015 01:06:51,019 --> 01:06:53,337 But we don't have good temporal locality. 1016 01:06:53,337 --> 01:06:56,350 It could be that maybe if we stored the right things, 1017 01:06:56,350 --> 01:06:59,074 we kept them around, we could them several times 1018 01:06:59,074 --> 01:07:04,000 because we're using each element like a cubed number of times. 1019 01:07:04,000 --> 01:07:08,990 That's not the right way of saying it, but we're reusing the 1020 01:07:08,990 --> 01:07:11,951 matrices a lot, reusing those items. 1021 01:07:11,951 --> 01:07:16,942 If we are doing N^3 work on N^2 things, we're reusing a lot. 1022 01:07:16,942 --> 01:07:21,933 So, we want to do better than this, and that's the recursive 1023 01:07:21,933 --> 01:07:26,416 algorithm, which we've seen. So, we know the algorithm 1024 01:07:26,416 --> 01:07:29,800 pretty much. I just have to tell you what 1025 01:07:29,800 --> 01:07:36,588 the layout is. So, we're going to take C, 1026 01:07:36,588 --> 01:07:42,941 partition of C_1-1, C_1-2, and so on. 1027 01:07:42,941 --> 01:07:52,647 So, I have an N by N matrix, and I'm partitioning into N 1028 01:07:52,647 --> 01:08:02,176 over 2 by N over 2 submatrices, all three of them times 1029 01:08:02,176 --> 01:08:07,377 whatever. And, I could write this out yet 1030 01:08:07,377 --> 01:08:11,058 again but I won't. OK, we can recursively compute 1031 01:08:11,058 --> 01:08:15,200 this thing with eight matrix multiplies, and a bunch of 1032 01:08:15,200 --> 01:08:18,191 matrix additions. I don't care how many, 1033 01:08:18,191 --> 01:08:22,256 but a constant number. We see that at least twice now, 1034 01:08:22,256 --> 01:08:26,091 so I won't show it again. Now, how do I lay out the 1035 01:08:26,091 --> 01:08:29,005 matrices? Any suggestions how I lay out 1036 01:08:29,005 --> 01:08:32,979 the matrices? I could lay them out in row 1037 01:08:32,979 --> 01:08:35,693 major order. I'll call it major order. 1038 01:08:35,693 --> 01:08:38,185 But that might be less natural now. 1039 01:08:38,185 --> 01:08:42,000 We're not doing anything by rows or by columns. 1040 01:08:59,000 --> 01:09:03,014 So, what layout should I use? Yeah? 1041 01:09:03,014 --> 01:09:08,446 Quartet major order, maybe quadrant major order 1042 01:09:08,446 --> 01:09:12,933 unless you're musically inclined, yeah. 1043 01:09:12,933 --> 01:09:17,420 Good idea. You've never seen this order 1044 01:09:17,420 --> 01:09:21,671 before, so it's maybe not so natural. 1045 01:09:21,671 --> 01:09:26,158 Somehow I want to cluster it by blocks. 1046 01:09:26,158 --> 01:09:33,402 OK, I think that's about all. So, I mean, it's a recursive 1047 01:09:33,402 --> 01:09:36,576 layout. This was not an easy question. 1048 01:09:36,576 --> 01:09:39,751 It's OK. Store matrices or lay out the 1049 01:09:39,751 --> 01:09:44,899 matrices recursively by block. OK, I'm cheating a little bit. 1050 01:09:44,899 --> 01:09:49,961 I'm redefining the problem to say, assume that your matrices 1051 01:09:49,961 --> 01:09:54,680 are laid out in this way. But, it doesn't really matter. 1052 01:09:54,680 --> 01:09:56,568 We can cheat, can't we? 1053 01:09:56,568 --> 01:10:02,276 In fact, it doesn't matter. You can turn a matrix into this 1054 01:10:02,276 --> 01:10:06,315 layout without too much linear work, almost linear work. 1055 01:10:06,315 --> 01:10:07,637 Log factors, maybe. 1056 01:10:07,637 --> 01:10:11,676 OK, so if I want to store my matrix A as a linear thing, 1057 01:10:11,676 --> 01:10:15,274 I'm going to recursively defined that layout to be 1058 01:10:15,274 --> 01:10:19,019 recursively store the upper left corner, then store, 1059 01:10:19,019 --> 01:10:21,442 let's say, the upper right corner. 1060 01:10:21,442 --> 01:10:24,380 It doesn't matter which order I do these. 1061 01:10:24,380 --> 01:10:28,492 I should have drawn this wider, then store the lower left 1062 01:10:28,492 --> 01:10:34,000 corner, and then store the lower right corner recursively. 1063 01:10:34,000 --> 01:10:38,025 So, how do you store this? Well, you divide it in four, 1064 01:10:38,025 --> 01:10:40,634 and lay out the top left, and so on. 1065 01:10:40,634 --> 01:10:44,511 OK, this is a recursive definition of how the element 1066 01:10:44,511 --> 01:10:47,046 should be stored in a linear array. 1067 01:10:47,046 --> 01:10:50,326 It's a weird one, but this is a very powerful 1068 01:10:50,326 --> 01:10:52,861 idea in cache oblivious algorithms. 1069 01:10:52,861 --> 01:10:57,408 We'll use this multiple times. OK, so now all we have to do is 1070 01:10:57,408 --> 01:11:00,241 analyze the number of memory transfers. 1071 01:11:00,241 --> 01:11:05,066 How hard could it be? So, we're going to store all 1072 01:11:05,066 --> 01:11:08,978 the matrices in this order, and we want to compute the 1073 01:11:08,978 --> 01:11:12,373 number of memory transfers on an N by N matrix. 1074 01:11:12,373 --> 01:11:15,547 See, I lapsed and I switched to lowercase n. 1075 01:11:15,547 --> 01:11:19,902 I should, throughout this week, be using uppercase N because 1076 01:11:19,902 --> 01:11:23,666 for historical reasons, any external memory kinds of 1077 01:11:23,666 --> 01:11:28,095 algorithms, to level algorithms, always talk about capital N. 1078 01:11:28,095 --> 01:11:31,785 And, don't ask why. You should see what they define 1079 01:11:31,785 --> 01:11:37,995 little n to be. OK, so, any suggestions on what 1080 01:11:37,995 --> 01:11:45,342 the recurrence should be now? All his fancy setup with the 1081 01:11:45,342 --> 01:11:49,724 recurrence is actually pretty easy. 1082 01:11:49,724 --> 01:11:57,071 So, definitely it involves multiplying matrices that are N 1083 01:11:57,071 --> 01:12:03,000 over 2 by N over 2. So, what goes here? 1084 01:12:03,000 --> 01:12:05,752 Eight, thank you. That you should know. 1085 01:12:05,752 --> 01:12:08,793 And that the tricky part is what goes here. 1086 01:12:08,793 --> 01:12:12,487 OK, what goes here is, now, the fact that I can even 1087 01:12:12,487 --> 01:12:15,384 write this, this is the matrix additions. 1088 01:12:15,384 --> 01:12:18,788 Ignore those for now. Suppose there weren't any. 1089 01:12:18,788 --> 01:12:21,323 I just have to recursively multiply. 1090 01:12:21,323 --> 01:12:25,740 The fact that this actually is eight times memory transfers of 1091 01:12:25,740 --> 01:12:30,670 N over 2 relies on this layout. Right, I'm assuming that the 1092 01:12:30,670 --> 01:12:34,129 arrays that I'm given are given as contiguous intervals and 1093 01:12:34,129 --> 01:12:35,442 memory. If they aren't, 1094 01:12:35,442 --> 01:12:38,066 I mean, if they're scattered all over memory, 1095 01:12:38,066 --> 01:12:40,273 I'm screwed. There's nothing I can do. 1096 01:12:40,273 --> 01:12:43,434 So, but by assuming that I have this recursive layout, 1097 01:12:43,434 --> 01:12:46,835 I know that the recursive multiplies will always deal with 1098 01:12:46,835 --> 01:12:49,519 three consecutive chunks of memory, one for A, 1099 01:12:49,519 --> 01:12:52,202 one for B, one for C, OK, no matter what I do. 1100 01:12:52,202 --> 01:12:54,470 Because these are stored consecutively, 1101 01:12:54,470 --> 01:12:56,438 recursively I have that invariant. 1102 01:12:56,438 --> 01:12:59,540 And I can keep recursing. And I'm always dealing with 1103 01:12:59,540 --> 01:13:03,000 three consecutive chunks of memory. 1104 01:13:03,000 --> 01:13:08,327 That's why I need this layout is to be able to say this. 1105 01:13:08,327 --> 01:13:11,332 OK, Now what does addition cost? 1106 01:13:11,332 --> 01:13:14,335 I'll just give you two matrices. 1107 01:13:14,335 --> 01:13:19,858 They're stored in some linear order, the same linear order 1108 01:13:19,858 --> 01:13:25,186 among the three of them. Do I care what the linear order 1109 01:13:25,186 --> 01:13:28,384 is? How should I add two matrices, 1110 01:13:28,384 --> 01:13:31,000 get the output? 1111 01:13:42,000 --> 01:13:43,000 Yeah? 1112 01:13:51,000 --> 01:13:54,850 Right, if each of the three arrays I'm dealing with are 1113 01:13:54,850 --> 01:13:58,559 stored in the same order, I can just scan in parallel 1114 01:13:58,559 --> 01:14:02,909 through all three of them and just add corresponding elements, 1115 01:14:02,909 --> 01:14:07,045 and output it to the third. So, I don't care what the order 1116 01:14:07,045 --> 01:14:10,682 is, as long as it's consistent and I get N^2 over B. 1117 01:14:10,682 --> 01:14:14,390 I'll ignore plus one here. That's just looking at the 1118 01:14:14,390 --> 01:14:16,529 entire matrix. So, there we go: 1119 01:14:16,529 --> 01:14:19,667 another recurrence. We've seen this with N^2, 1120 01:14:19,667 --> 01:14:23,090 and we just got N^3. But, it turns out now we get 1121 01:14:23,090 --> 01:14:26,371 something cooler if we use the right base case. 1122 01:14:26,371 --> 01:14:30,008 So now we get to the base case, ah, the tricky part. 1123 01:14:30,008 --> 01:14:35,000 So, any suggestions what base case I should use? 1124 01:14:35,000 --> 01:14:36,672 The block size, good suggestion. 1125 01:14:36,672 --> 01:14:38,829 So, if we have something of size order B, 1126 01:14:38,829 --> 01:14:41,850 we know that takes a constant number of memory transfers. 1127 01:14:41,850 --> 01:14:44,871 It turns out that's not enough. That won't solve it here. 1128 01:14:44,871 --> 01:14:46,381 But good guess. In this case, 1129 01:14:46,381 --> 01:14:49,294 it's not the right answer. I'll give you some intuition 1130 01:14:49,294 --> 01:14:51,182 why. We are trying to improve on N^3 1131 01:14:51,182 --> 01:14:53,178 over B. If you were just trying to get 1132 01:14:53,178 --> 01:14:55,443 it divided by B, this is a great base case. 1133 01:14:55,443 --> 01:14:58,572 But here, we know that just the improvement afforded by the 1134 01:14:58,572 --> 01:15:03,244 block size is not enough. We have to somehow use the fact 1135 01:15:03,244 --> 01:15:06,864 that the cache is big. It's M, so however big M is, 1136 01:15:06,864 --> 01:15:09,977 it's that big. OK, so if we want to get some 1137 01:15:09,977 --> 01:15:13,307 improvement on this, we've got to have M in the 1138 01:15:13,307 --> 01:15:16,276 formula somewhere, and there's no M's yet. 1139 01:15:16,276 --> 01:15:19,027 So, it's got to involve M. What's that? 1140 01:15:19,027 --> 01:15:21,271 MT of M over B? That would work, 1141 01:15:21,271 --> 01:15:25,108 but MT of M is also OK, I mean, some constant times M, 1142 01:15:25,108 --> 01:15:27,859 let's say. I want to make this constant 1143 01:15:27,859 --> 01:15:33,000 small enough so that the entire problem fits in cache. 1144 01:15:33,000 --> 01:15:37,006 So, it's like one third. I think it's actually, 1145 01:15:37,006 --> 01:15:40,837 oh wait, is it the square root of M actually? 1146 01:15:40,837 --> 01:15:43,537 Right, this is an N by N matrix. 1147 01:15:43,537 --> 01:15:47,456 So, it should be C times the square root of M. 1148 01:15:47,456 --> 01:15:50,330 Sorry. So, the square root of M by 1149 01:15:50,330 --> 01:15:53,552 square root of M matrix has M entries. 1150 01:15:53,552 --> 01:15:58,603 If I make C like one third or something, then I can fit all 1151 01:15:58,603 --> 01:16:04,372 three matrices in memory. Actually, one over square root 1152 01:16:04,372 --> 01:16:06,903 of three would do, but who cares? 1153 01:16:06,903 --> 01:16:10,621 So, for some constant, C, now everything fits in 1154 01:16:10,621 --> 01:16:13,548 memory. How many memory transfers does 1155 01:16:13,548 --> 01:16:14,497 it take? One? 1156 01:16:14,497 --> 01:16:18,451 It's a bit too small, because I do have to read the 1157 01:16:18,451 --> 01:16:20,587 problem in. And now, I mean, 1158 01:16:20,587 --> 01:16:24,621 here was one because there's only one block to read. 1159 01:16:24,621 --> 01:16:27,548 Now how many blocks are there to read? 1160 01:16:27,548 --> 01:16:30,000 Constants? No. 1161 01:16:30,000 --> 01:16:30,369 B? No. 1162 01:16:30,369 --> 01:16:33,255 M over B, good. Get it right eventually. 1163 01:16:33,255 --> 01:16:37,102 That's the great thing about thinking with an oracle. 1164 01:16:37,102 --> 01:16:41,318 You can just keep guessing. M over B because we have cache 1165 01:16:41,318 --> 01:16:43,908 size M. There are M over B blocks in 1166 01:16:43,908 --> 01:16:46,201 that cache to read each one, OK? 1167 01:16:46,201 --> 01:16:49,382 This is maybe, you forgot what M was because 1168 01:16:49,382 --> 01:16:51,897 we haven't used it for a long time. 1169 01:16:51,897 --> 01:16:54,857 But M is the number of elements in cache. 1170 01:16:54,857 --> 01:16:59,000 This is the number of blocks in cache. 1171 01:16:59,000 --> 01:17:02,537 OK, some of was saying B, and it's reasonable to assume 1172 01:17:02,537 --> 01:17:05,943 that M over B is about B. That's like a square cache, 1173 01:17:05,943 --> 01:17:08,892 but in general, we don't make that assumption. 1174 01:17:08,892 --> 01:17:11,381 OK, where are we? We're hopefully done, 1175 01:17:11,381 --> 01:17:14,460 just about, good, because we have three minutes. 1176 01:17:14,460 --> 01:17:17,800 So, that's our base case. I have a square root here; 1177 01:17:17,800 --> 01:17:20,815 I just forgot it. Now we just have to solve it. 1178 01:17:20,815 --> 01:17:23,434 Now, this is an easier recurrence, right? 1179 01:17:23,434 --> 01:17:27,497 I don't want to use the master method, because master method is 1180 01:17:27,497 --> 01:17:31,296 not going to handle these B's and M's, and these crazy base 1181 01:17:31,296 --> 01:17:35,271 cases. OK, master method would prove 1182 01:17:35,271 --> 01:17:36,054 N^3. Great. 1183 01:17:36,054 --> 01:17:40,282 Master method doesn't really think about these kinds of 1184 01:17:40,282 --> 01:17:42,789 cases. But with regression trees, 1185 01:17:42,789 --> 01:17:47,331 if you remember way back to the proof of the master method, 1186 01:17:47,331 --> 01:17:52,030 just look at the recursion tree as geometric up or down where 1187 01:17:52,030 --> 01:17:55,945 everything is equal, and then you just add them up, 1188 01:17:55,945 --> 01:17:59,000 every level. The point is that this is a 1189 01:17:59,000 --> 01:18:02,680 nice recurrence. All of the sub problems are the 1190 01:18:02,680 --> 01:18:05,891 same size, and that analysis always works, 1191 01:18:05,891 --> 01:18:12,000 I say, when everything has the same size, all the children. 1192 01:18:12,000 --> 01:18:18,857 So, here's the recursion tree. We have N^2 over B at the top. 1193 01:18:18,857 --> 01:18:24,114 We split into eight subproblems where each one, 1194 01:18:24,114 --> 01:18:27,657 the cost is one half N^2 over B. 1195 01:18:27,657 --> 01:18:32,000 I'm not going to write them all. 1196 01:18:32,000 --> 01:18:34,716 There they are. You add them up. 1197 01:18:34,716 --> 01:18:38,921 How much do you get? Well, there's eight of them. 1198 01:18:38,921 --> 01:18:41,637 Eight times a half is two. Four. 1199 01:18:41,637 --> 01:18:44,265 [LAUGHTER] Thanks. Four, right? 1200 01:18:44,265 --> 01:18:48,909 OK, I'm bad at arithmetic. I probably already said it, 1201 01:18:48,909 --> 01:18:52,675 but there are three kinds of mathematicians, 1202 01:18:52,675 --> 01:18:56,006 those who can add, and those who can't. 1203 01:18:56,006 --> 01:19:01,000 OK, why am I looking at this? It's obvious. 1204 01:19:01,000 --> 01:19:03,800 OK, so we keep going. This looks geometrically 1205 01:19:03,800 --> 01:19:04,858 increasing. Right? 1206 01:19:04,858 --> 01:19:08,405 You just know in your heart that if you work out the first 1207 01:19:08,405 --> 01:19:12,263 two levels, you can tell whether it's geometrically increasing, 1208 01:19:12,263 --> 01:19:15,437 decreasing, or they're all equal, or something else. 1209 01:19:15,437 --> 01:19:18,984 And then you better think. But I see this as geometrically 1210 01:19:18,984 --> 01:19:21,412 increasing. It will indeed be like 16 at 1211 01:19:21,412 --> 01:19:22,843 the next level, I guess. 1212 01:19:22,843 --> 01:19:25,145 OK, it should be. So, it's increasing. 1213 01:19:25,145 --> 01:19:30,000 That means the leaves matter. So, let's work out the leaves. 1214 01:19:30,000 --> 01:19:33,960 And, this is where we use our base case. 1215 01:19:33,960 --> 01:19:38,630 So, we have a problem of size square root of M. 1216 01:19:38,630 --> 01:19:41,981 And so, yeah, you have a question? 1217 01:19:41,981 --> 01:19:45,840 Oh, indeed. I knew there was something. 1218 01:19:45,840 --> 01:19:50,003 I knew it was supposed to be two out here. 1219 01:19:50,003 --> 01:19:53,150 Thanks. This is why you're here. 1220 01:19:53,150 --> 01:19:57,110 It's actually N over two squared over B. 1221 01:19:57,110 --> 01:20:00,867 Thanks. I'm substituting N over 2 into 1222 01:20:00,867 --> 01:20:04,900 this. OK, so this is actually N^2 1223 01:20:04,900 --> 01:20:06,519 over 4 B. So, I get two, 1224 01:20:06,519 --> 01:20:09,546 because there are eight times one over four. 1225 01:20:09,546 --> 01:20:13,416 OK, I wasn't that far off then. It's still geometrically 1226 01:20:13,416 --> 01:20:15,529 increasing, still the case, OK? 1227 01:20:15,529 --> 01:20:17,992 But now, it actually doesn't matter. 1228 01:20:17,992 --> 01:20:21,371 Whatever the cost is, as long as it's bigger than 1229 01:20:21,371 --> 01:20:23,975 one, great. Now we look at the leaves. 1230 01:20:23,975 --> 01:20:26,157 The leaves are root M by root M. 1231 01:20:26,157 --> 01:20:29,958 I substitute root M into this: I get M over B with some 1232 01:20:29,958 --> 01:20:32,903 constants. Who cares? 1233 01:20:32,903 --> 01:20:36,787 So, each leaf is M over B, OK, lots of them. 1234 01:20:36,787 --> 01:20:40,038 How many are there? This is the only, 1235 01:20:40,038 --> 01:20:45,006 deal with recursion trees, counting the number of leaves 1236 01:20:45,006 --> 01:20:48,709 is always the annoying part. Oh boy, well, 1237 01:20:48,709 --> 01:20:53,948 we start with an N by N matrix. We stop when we get down to 1238 01:20:53,948 --> 01:21:00,000 root N by root N matrix. So, that sounds like something. 1239 01:21:00,000 --> 01:21:04,141 Oh boy, I'm cheating here. Really? 1240 01:21:04,141 --> 01:21:07,905 That many? It sounds plausible. 1241 01:21:07,905 --> 01:21:11,921 OK, the claim is, and I'll cheat. 1242 01:21:11,921 --> 01:21:19,450 So I'm going to use the oracle here, and we'll figure out why 1243 01:21:19,450 --> 01:21:24,470 this is the case. N over root N^3 leaves, 1244 01:21:24,470 --> 01:21:27,231 hey what? I think here, 1245 01:21:27,231 --> 01:21:33,979 it's hard to see the tree. But it's easy to see in the 1246 01:21:33,979 --> 01:21:36,178 matrix. Let's enter the matrix. 1247 01:21:36,178 --> 01:21:39,256 We have our big matrix. We divided in half. 1248 01:21:39,256 --> 01:21:43,654 We recursively divide in half. We recursively divide in half. 1249 01:21:43,654 --> 01:21:45,120 You get the idea, OK? 1250 01:21:45,120 --> 01:21:49,151 Now, at some point these sectors, let's say one of these 1251 01:21:49,151 --> 01:21:52,743 sectors, and each of these sectors, fits in cache. 1252 01:21:52,743 --> 01:21:56,994 And three of them fit in cache. So, that's when we stop the 1253 01:21:56,994 --> 01:22:02,320 recursion in the analysis. The algorithm goes all the way. 1254 01:22:02,320 --> 01:22:05,538 But in the analysis, let's say we stop at M. 1255 01:22:05,538 --> 01:22:08,981 OK, now, how many leaves or problems are there? 1256 01:22:08,981 --> 01:22:11,451 Oh man, this is still not obvious. 1257 01:22:11,451 --> 01:22:14,669 OK, the number of leaf chunks here is, like, 1258 01:22:14,669 --> 01:22:19,010 I mean, the number of these things is something like N over 1259 01:22:19,010 --> 01:22:21,629 root M, right, the number of chunks. 1260 01:22:21,629 --> 01:22:26,195 But, it's a little less clear because I have so many of these. 1261 01:22:26,195 --> 01:22:28,964 But, all right, so let's just suppose, 1262 01:22:28,964 --> 01:22:32,856 now, I think of normal, boring, matrix multiplication 1263 01:22:32,856 --> 01:22:38,119 on chunks of this size. That's essentially what the 1264 01:22:38,119 --> 01:22:42,200 leaves should tell me. I start with this big problem, 1265 01:22:42,200 --> 01:22:45,261 I recurse out to all these little, tiny, 1266 01:22:45,261 --> 01:22:48,950 multiply this by that, OK, this root M by root M 1267 01:22:48,950 --> 01:22:51,305 chunk. OK, how many operations, 1268 01:22:51,305 --> 01:22:54,680 how many multiplies do I do on those things? 1269 01:22:54,680 --> 01:22:57,034 N^3. But now, N, the size of my 1270 01:22:57,034 --> 01:23:00,488 matrix in terms of these little sub matrices, 1271 01:23:00,488 --> 01:23:05,859 is N over root M. So, it should be N over root 1272 01:23:05,859 --> 01:23:10,760 M^3 subproblems of this size. If you work it out, 1273 01:23:10,760 --> 01:23:16,478 normally we go down to things of constant size and we get 1274 01:23:16,478 --> 01:23:21,278 exactly N^3 of them. Now we are stopping at this 1275 01:23:21,278 --> 01:23:26,485 short point in saying, well, it's however many there 1276 01:23:26,485 --> 01:23:30,161 are, cubed. OK, this is a bit of hand 1277 01:23:30,161 --> 01:23:35,352 waving. You could work it out with the 1278 01:23:35,352 --> 01:23:39,151 recurrence on the number of leaves. 1279 01:23:39,151 --> 01:23:44,180 But there it is. So, the total here is N over, 1280 01:23:44,180 --> 01:23:49,656 let's work it out. N^3 over M to the three halves, 1281 01:23:49,656 --> 01:23:56,025 that's this number of leaves, times the cost at each leaf, 1282 01:23:56,025 --> 01:24:01,054 which is M over B. So, some of the N's cancel, 1283 01:24:01,054 --> 01:24:07,759 and we get N^3 over B root M, which is a root M factor better 1284 01:24:07,759 --> 01:24:13,433 than N^3 over B. It's actually quite a lot, 1285 01:24:13,433 --> 01:24:16,522 the square root of the cache size. 1286 01:24:16,522 --> 01:24:20,359 That is optimal. The best two level matrix 1287 01:24:20,359 --> 01:24:26,162 multiplication algorithm is N^3 over B root M memory transfers. 1288 01:24:26,162 --> 01:24:30,000 Pretty amazing, and I'm over time. 1289 01:24:30,000 --> 01:24:34,979 You can generalize this into all sorts of great things, 1290 01:24:34,979 --> 01:24:39,959 but the bottom line is this is a great way to do matrix 1291 01:24:39,959 --> 01:24:45,308 multiplication as a recursion. We'll see more recursion for 1292 01:24:45,308 --> 01:24:48,000 cache oblivious algorithms on Wednesday.