1 00:00:00,060 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:04,010 Commons license. 3 00:00:04,010 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,730 continue to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,236 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,236 --> 00:00:21,690 at ocw.mit.edu. 8 00:00:21,690 --> 00:00:24,524 ERIK DEMAINE: Welcome to the final week of 6.046. 9 00:00:24,524 --> 00:00:25,190 Are you excited? 10 00:00:25,190 --> 00:00:28,581 [CHEERING] Yeah, today-- AUDIENCE: Oh. 11 00:00:28,581 --> 00:00:30,080 ERIK DEMAINE: Well, and sad, I know. 12 00:00:30,080 --> 00:00:30,640 It's tough. 13 00:00:30,640 --> 00:00:32,630 But we've got two more lectures. 14 00:00:32,630 --> 00:00:36,720 They're on that one topic, which is cache oblivious algorithms. 15 00:00:36,720 --> 00:00:39,470 And this is a really cool concept. 16 00:00:39,470 --> 00:00:40,970 It was actually originally developed 17 00:00:40,970 --> 00:00:45,980 in the context of 6.046, as sort of an interesting way 18 00:00:45,980 --> 00:00:48,560 to teach cache efficient algorithms. 19 00:00:48,560 --> 00:00:50,710 But it turned into a whole research program 20 00:00:50,710 --> 00:00:55,240 in the late '90s, and now it's its own thing. 21 00:00:55,240 --> 00:00:59,870 It's kind of funny to bring it back to 6.046. 22 00:00:59,870 --> 00:01:04,670 The whole idea is in all of the algorithms we have seen, 23 00:01:04,670 --> 00:01:06,520 except maybe distributed algorithms, 24 00:01:06,520 --> 00:01:11,289 we've had this view that all of the data that we can access 25 00:01:11,289 --> 00:01:13,280 is the same cost. 26 00:01:13,280 --> 00:01:15,360 If we have an array, like a hash table, 27 00:01:15,360 --> 00:01:19,020 accessing anything in a hash table is equally costly. 28 00:01:19,020 --> 00:01:20,840 If we have a binary search tree, every node 29 00:01:20,840 --> 00:01:23,110 costs the same to access. 30 00:01:23,110 --> 00:01:26,120 But this is not real. 31 00:01:26,120 --> 00:01:27,680 Let me give you some idea of what 32 00:01:27,680 --> 00:01:29,350 a real computer looks like. 33 00:01:29,350 --> 00:01:32,880 You probably know this, but we've not yet thought about it 34 00:01:32,880 --> 00:01:42,300 in an algorithmic context. 35 00:01:42,300 --> 00:01:45,390 These are caches, what are typically called caches, 36 00:01:45,390 --> 00:01:48,539 in your computer. 37 00:01:48,539 --> 00:01:52,640 Then you have what we've mostly been thinking about, 38 00:01:52,640 --> 00:02:00,660 which is main memory, your RAM. 39 00:02:00,660 --> 00:02:02,260 And then there's probably more stuff. 40 00:02:02,260 --> 00:02:05,512 These days you probably have some big flash. 41 00:02:05,512 --> 00:02:06,970 If you have a fancier computer, you 42 00:02:06,970 --> 00:02:09,030 have flash, which is maybe caching 43 00:02:09,030 --> 00:02:13,000 your disk, which is huge. 44 00:02:13,000 --> 00:02:15,110 And then maybe there's the internet at the end, 45 00:02:15,110 --> 00:02:16,860 if you like. 46 00:02:16,860 --> 00:02:22,370 So the point is all the data in the world is not on your CPU. 47 00:02:22,370 --> 00:02:26,040 And there's this big thing which is called the memory hierarchy, 48 00:02:26,040 --> 00:02:29,410 which dictates which things are fast 49 00:02:29,410 --> 00:02:32,290 and which things are slow, not exactly which 50 00:02:32,290 --> 00:02:35,530 data items; that's up to you. 51 00:02:35,530 --> 00:02:37,820 But the idea is that on board your CPU 52 00:02:37,820 --> 00:02:43,570 you have probably, these days, up to four levels of cache. 53 00:02:43,570 --> 00:02:47,500 As I've tried to draw them, they get increasingly big. 54 00:02:47,500 --> 00:02:49,440 Typical values-- a level one cache 55 00:02:49,440 --> 00:02:52,860 is something on the order of 10, 32 K, whatever. 56 00:02:52,860 --> 00:02:54,840 Level four cache these days, as introduced 57 00:02:54,840 --> 00:02:58,724 by like Haswell Architectures, has about 100 megabytes. 58 00:02:58,724 --> 00:03:01,140 Main memory you know; this is the thing you usually think. 59 00:03:01,140 --> 00:03:01,470 About. 60 00:03:01,470 --> 00:03:02,400 It's in the gigabytes. 61 00:03:02,400 --> 00:03:05,600 These days you can buy computers with a terabyte of RAM. 62 00:03:05,600 --> 00:03:06,830 It's not crazy. 63 00:03:06,830 --> 00:03:08,520 Flash gets bigger. 64 00:03:08,520 --> 00:03:12,317 Disk-- these days you can buy 4 terabyte single disk, 65 00:03:12,317 --> 00:03:13,900 but if you have a whole RAID of disks, 66 00:03:13,900 --> 00:03:17,079 you can have petabytes of data on one computer. 67 00:03:17,079 --> 00:03:20,780 So things are getting bigger as we go farther to the right. 68 00:03:20,780 --> 00:03:23,329 But they're also getting slower. . 69 00:03:23,329 --> 00:03:25,310 And the point of cache efficient algorithms 70 00:03:25,310 --> 00:03:27,710 is to deal with the fact that things get slow 71 00:03:27,710 --> 00:03:29,462 when they get far away. 72 00:03:29,462 --> 00:03:31,420 And this makes sense from a physics standpoint. 73 00:03:31,420 --> 00:03:33,860 If you think about how much data can 74 00:03:33,860 --> 00:03:36,240 you store in a cubic inch or something 75 00:03:36,240 --> 00:03:41,017 and how much could possibly be near your CPU, at some point, 76 00:03:41,017 --> 00:03:42,600 you're just going to run out of space, 77 00:03:42,600 --> 00:03:44,090 and you've got to go farther away. 78 00:03:44,090 --> 00:03:47,184 And to go farther away is going to take more time. 79 00:03:47,184 --> 00:03:48,850 So you can think of it-- I mean, there's 80 00:03:48,850 --> 00:03:50,850 the speed of light argument, that things 81 00:03:50,850 --> 00:03:52,525 that are farther away in your computer 82 00:03:52,525 --> 00:03:54,710 are going to take longer. 83 00:03:54,710 --> 00:03:56,630 Typical computers are not anywhere near 84 00:03:56,630 --> 00:03:59,350 the speed of light, so there's a more real issue, which 85 00:03:59,350 --> 00:04:01,160 is how long are your traces. 86 00:04:01,160 --> 00:04:04,640 And then when you have physical moving parts, like a disk, 87 00:04:04,640 --> 00:04:06,840 I don't know if you know, but disks actually spin, 88 00:04:06,840 --> 00:04:09,814 and there's a head, and it has to move around. 89 00:04:09,814 --> 00:04:10,980 And that's called seek time. 90 00:04:10,980 --> 00:04:13,550 Moving a head around on the disk is really slow, 91 00:04:13,550 --> 00:04:15,360 on the order of milliseconds. 92 00:04:15,360 --> 00:04:18,152 Whereas reading from on chip cache, 93 00:04:18,152 --> 00:04:19,610 that's on the order of nanoseconds, 94 00:04:19,610 --> 00:04:23,180 whatever your clock rate is, so a few billion times a second. 95 00:04:23,180 --> 00:04:26,340 So there's a big spread of like a factor 96 00:04:26,340 --> 00:04:31,930 of a million or 10 million from level one cache to disk speed. 97 00:04:31,930 --> 00:04:33,100 That sucks. 98 00:04:33,100 --> 00:04:35,576 And so you might think, well, if your data's big, 99 00:04:35,576 --> 00:04:36,409 you're just screwed. 100 00:04:36,409 --> 00:04:39,909 You've got to deal with disk, and disk is slow. 101 00:04:39,909 --> 00:04:42,430 But that's not true. 102 00:04:42,430 --> 00:04:44,960 Life is not so bad. 103 00:04:44,960 --> 00:05:07,720 So, in general, there's two notions of speed, 104 00:05:07,720 --> 00:05:09,560 and I've been kind of vague on them. 105 00:05:09,560 --> 00:05:13,300 One notion is latency, which is if right now I 106 00:05:13,300 --> 00:05:17,250 have the idea that I really need to fetch memory location 2 107 00:05:17,250 --> 00:05:21,100 billion and 73, how long does it take for that data-- say, 108 00:05:21,100 --> 00:05:23,190 one word of data-- to come back? 109 00:05:23,190 --> 00:05:25,260 That's latency. 110 00:05:25,260 --> 00:05:28,710 But there's another issue, which is bandwidth; 111 00:05:28,710 --> 00:05:32,720 how fat are these pipes? 112 00:05:32,720 --> 00:05:35,500 What's my rate of information that I could pump? 113 00:05:35,500 --> 00:05:38,570 If I said, please give me all of main memory in order, 114 00:05:38,570 --> 00:05:40,507 how fast could it pump it back? 115 00:05:40,507 --> 00:05:56,370 And that's actually really good. 116 00:05:56,370 --> 00:05:57,974 So latency is like your start up cost. 117 00:05:57,974 --> 00:05:59,390 When I ask for something, how long 118 00:05:59,390 --> 00:06:01,270 does it take for that one thing to come? 119 00:06:01,270 --> 00:06:04,660 But then there's a data rate. 120 00:06:04,660 --> 00:06:08,620 And bandwidth you can generally make really large. 121 00:06:08,620 --> 00:06:12,110 For example, in disk, bandwidth of a disk is pretty big. 122 00:06:12,110 --> 00:06:16,780 But even if it weren't big, you could just add 100 more disks. 123 00:06:16,780 --> 00:06:18,410 And then when you ask for some data, 124 00:06:18,410 --> 00:06:23,032 all 100 disks could give you data at the same speed, 125 00:06:23,032 --> 00:06:24,740 and provided you don't overload your bus, 126 00:06:24,740 --> 00:06:27,480 so you've got to also make more buses and so on. 127 00:06:27,480 --> 00:06:30,567 You can actually really huge amount of data per second, 128 00:06:30,567 --> 00:06:33,150 but still the time to get there and the time for all the disks 129 00:06:33,150 --> 00:06:34,770 to seek their heads, that's slow. 130 00:06:34,770 --> 00:06:36,770 It doesn't add up, actually, because they're all 131 00:06:36,770 --> 00:06:38,420 doing it in parallel. 132 00:06:38,420 --> 00:06:42,620 So you can't reduced latency, but you can increase bandwidth. 133 00:06:42,620 --> 00:06:45,630 And let's say-- it doesn't match physics, 134 00:06:45,630 --> 00:06:48,480 but we can get pretty close to arbitrarily high bandwidth. 135 00:06:48,480 --> 00:06:50,590 And so in a well designed computer, 136 00:06:50,590 --> 00:06:52,840 the fatnesses of these pipes are going 137 00:06:52,840 --> 00:06:56,520 to increase, or could increase, if you want. 138 00:06:56,520 --> 00:07:01,510 So you can move lots of data around. 139 00:07:01,510 --> 00:07:03,840 But latency we can't get rid of, and this is annoying 140 00:07:03,840 --> 00:07:05,673 because from an algorithmic standpoint, when 141 00:07:05,673 --> 00:07:07,952 we ask for something, we'd like it immediately. 142 00:07:07,952 --> 00:07:09,910 In a sequential logarithm, we can't do anything 143 00:07:09,910 --> 00:07:11,530 until that date arrives. 144 00:07:11,530 --> 00:07:21,260 So cache efficiency is going to fix this by blocking. 145 00:07:21,260 --> 00:07:30,692 This is an old idea, since caches were introduced. 146 00:07:30,692 --> 00:07:31,900 There's the idea of blocking. 147 00:07:31,900 --> 00:07:36,300 So when you ask for a single word in main memory, 148 00:07:36,300 --> 00:07:38,090 you don't get one word. 149 00:07:38,090 --> 00:07:41,960 You get maybe 32 kilobytes of information, 150 00:07:41,960 --> 00:08:09,620 not just 4 bytes or 8 bytes. 151 00:08:09,620 --> 00:08:12,490 And we're kind of free to choose these block sizes however 152 00:08:12,490 --> 00:08:15,490 we want, when we designed the system. 153 00:08:15,490 --> 00:08:21,940 So we can set them, in a certain sense, to hide latency. 154 00:08:21,940 --> 00:08:34,799 So if you think of amortizing the cost over the block, 155 00:08:34,799 --> 00:08:41,990 then you have something like amortized cost over block. 156 00:08:41,990 --> 00:08:45,600 This is per word. 157 00:08:45,600 --> 00:08:54,380 Essentially, we divide the latency by the block size. 158 00:08:54,380 --> 00:08:57,330 And we have to pay one over bandwidth. 159 00:08:57,330 --> 00:08:59,900 Bandwidth is how many words a second 160 00:08:59,900 --> 00:09:02,390 you can read, say, from your memory. 161 00:09:02,390 --> 00:09:06,240 So one over bandwidth is going to be your cost. 162 00:09:06,240 --> 00:09:07,800 So this we can't change but. 163 00:09:07,800 --> 00:09:10,280 By adding enough disks or adding enough things 164 00:09:10,280 --> 00:09:12,300 at making these pipes fat enough, 165 00:09:12,300 --> 00:09:15,500 you can basically make this big. 166 00:09:15,500 --> 00:09:18,600 Latency is the thing we can't control. 167 00:09:18,600 --> 00:09:24,260 But if this block is sort of useful, 168 00:09:24,260 --> 00:09:27,240 then we're paying the initial start up time, say, hey, 169 00:09:27,240 --> 00:09:29,600 give me this block, and then waiting for the response. 170 00:09:29,600 --> 00:09:32,520 That latency we only pay once for the entire block. 171 00:09:32,520 --> 00:09:38,080 So if there are block size words in that block, per item, 172 00:09:38,080 --> 00:09:40,380 we're effectively dividing latency by block size. 173 00:09:40,380 --> 00:09:44,110 This is kind of rough, but this is the idea 174 00:09:44,110 --> 00:09:45,790 of how to reduce latency. 175 00:09:45,790 --> 00:10:00,230 Now, for this actually work, we need better algorithms. 176 00:10:00,230 --> 00:10:03,550 Pretty much every algorithm you see in the class so far works 177 00:10:03,550 --> 00:10:05,050 horribly in this model. 178 00:10:05,050 --> 00:10:13,410 So that's the point of today and next class is to fix that. 179 00:10:13,410 --> 00:10:20,070 For this kind of amortization to work, 180 00:10:20,070 --> 00:10:23,370 I'm using "use" in a vague sense so far. 181 00:10:23,370 --> 00:10:24,930 We'll make it formal in a moment. 182 00:10:24,930 --> 00:10:28,500 When I fetch an entire block, all of the elements 183 00:10:28,500 --> 00:10:30,184 in that block should be useful. 184 00:10:30,184 --> 00:10:32,100 We should be able to compute something on them 185 00:10:32,100 --> 00:10:33,510 that we needed to compute. 186 00:10:33,510 --> 00:10:36,915 Otherwise, if I if I only needed the one item that I read out 187 00:10:36,915 --> 00:10:41,250 of the block, that's not going to help me so much. 188 00:10:41,250 --> 00:10:45,382 So I really want to structure my data in such a way 189 00:10:45,382 --> 00:10:46,840 that when I access one element, I'm 190 00:10:46,840 --> 00:10:49,890 also going to access the elements nearby it. 191 00:10:49,890 --> 00:10:51,980 Then this blocking will actually be useful. 192 00:10:51,980 --> 00:11:04,190 This is a property normally called spatial locality. 193 00:11:04,190 --> 00:11:08,180 And the other thing we'd like-- these caches 194 00:11:08,180 --> 00:11:11,360 have some size, so I can store more than just one block. 195 00:11:11,360 --> 00:11:13,110 It's not like I read one block, and I just 196 00:11:13,110 --> 00:11:16,630 finish processing it, and then I read the next block and go on. 197 00:11:16,630 --> 00:11:18,505 Some of these caches are actually pretty big. 198 00:11:18,505 --> 00:11:21,470 If you think of main memory as a cache to your disk, 199 00:11:21,470 --> 00:11:22,740 that can be really big. 200 00:11:22,740 --> 00:11:27,355 So ideally, the blocks that I'm using here 201 00:11:27,355 --> 00:11:28,730 relate to each other in some way, 202 00:11:28,730 --> 00:11:30,990 or when I access the block, I'm going 203 00:11:30,990 --> 00:11:34,030 to access it for awhile, along with other blocks. 204 00:11:34,030 --> 00:11:36,020 So the way this is usually said is 205 00:11:36,020 --> 00:11:41,902 that we'd like to reuse the existing blocks in the cache 206 00:11:41,902 --> 00:11:45,290 as much as possible. 207 00:11:45,290 --> 00:11:51,000 And this you can think of as temporal locality. 208 00:11:51,000 --> 00:11:52,530 When I access a particular block, 209 00:11:52,530 --> 00:11:54,710 I'm going to access it again fairly soon. 210 00:11:54,710 --> 00:11:57,690 That way it's actually useful to bring it into my cache, 211 00:11:57,690 --> 00:11:59,197 and then I use it many times. 212 00:11:59,197 --> 00:12:00,280 That would be even better. 213 00:12:00,280 --> 00:12:01,738 I don't have to have both of these, 214 00:12:01,738 --> 00:12:03,500 and exactly to what extent I have 215 00:12:03,500 --> 00:12:05,740 them is going to dictate what the overall time 216 00:12:05,740 --> 00:12:07,570 it's going to take to run my algorithm. 217 00:12:07,570 --> 00:12:09,280 But these are so the ideal properties 218 00:12:09,280 --> 00:12:11,640 you want in a very informal sense. 219 00:12:11,640 --> 00:12:16,000 Now, in the rest today, we're going to make this formal, 220 00:12:16,000 --> 00:12:18,840 and then we're going to develop some algorithms for this model. 221 00:12:18,840 --> 00:12:20,740 But this is the motivation. 222 00:12:20,740 --> 00:12:24,585 In reality, we're free to choose block size in the system. 223 00:12:24,585 --> 00:12:26,460 Though, in a moment, I'm going to assume that 224 00:12:26,460 --> 00:12:27,870 it's given to us. 225 00:12:27,870 --> 00:12:29,290 You'd normally set the block size 226 00:12:29,290 --> 00:12:32,570 so that these two terms come out roughly equal. 227 00:12:32,570 --> 00:12:35,400 Because if you're spending the latency time to go and get 228 00:12:35,400 --> 00:12:41,530 something, you might as well get a whole chunk of something, 229 00:12:41,530 --> 00:12:43,200 according to whatever your bandwidth is. 230 00:12:43,200 --> 00:12:44,710 If it only cost you, say, twice as 231 00:12:44,710 --> 00:12:48,350 much to fetch an entire block than to fetch one word, 232 00:12:48,350 --> 00:12:50,420 that seems like a pretty good block size. 233 00:12:50,420 --> 00:12:54,160 So for something like disk, that block size 234 00:12:54,160 --> 00:12:57,140 is on the order of megabytes, maybe even 235 00:12:57,140 --> 00:12:58,970 bigger-- hundreds of megabytes. 236 00:12:58,970 --> 00:13:01,170 So think of the block sizes as really big. 237 00:13:01,170 --> 00:13:04,950 We really want all that data to be useful in some way. 238 00:13:04,950 --> 00:13:08,980 Now it's really hard to think about a memory 239 00:13:08,980 --> 00:13:12,150 hierarchy with so many levels. 240 00:13:12,150 --> 00:13:14,810 So we're going to focus on two levels at a time-- 241 00:13:14,810 --> 00:13:18,530 the sort of the cheap and small cache 242 00:13:18,530 --> 00:13:21,010 versus the huge thing, which I'll call disk, 243 00:13:21,010 --> 00:13:26,420 just for emphasis. 244 00:13:26,420 --> 00:13:30,386 So I'm going to call this two level model the external memory 245 00:13:30,386 --> 00:13:30,890 model. 246 00:13:30,890 --> 00:13:33,920 It was originally introduced as a model 247 00:13:33,920 --> 00:13:35,626 for main memory versus disk. 248 00:13:35,626 --> 00:13:37,500 But you could apply it to any pair of levels. 249 00:13:37,500 --> 00:13:40,720 In general, you have your problem size N, 250 00:13:40,720 --> 00:13:45,040 choose the smallest level that fits N. Typically that's 251 00:13:45,040 --> 00:13:45,680 main memory. 252 00:13:45,680 --> 00:13:46,900 Maybe it's disk. 253 00:13:46,900 --> 00:13:52,160 And just think of the level between that and the previous, 254 00:13:52,160 --> 00:13:57,120 so the last level and the next to last level. 255 00:13:57,120 --> 00:13:58,567 Often that's what matters. 256 00:13:58,567 --> 00:14:00,650 Like if you run a program, and you run out of RAM, 257 00:14:00,650 --> 00:14:02,983 and you start swapping the disks, that's when everything 258 00:14:02,983 --> 00:14:05,080 just slows to a crawl. 259 00:14:05,080 --> 00:14:07,305 You can see that difference at each of these levels, 260 00:14:07,305 --> 00:14:08,930 but it's probably most dramatic at disk 261 00:14:08,930 --> 00:14:13,910 just because it's so slow-- a million times slower than RAM, 262 00:14:13,910 --> 00:14:17,190 or at least 1,000 times slower than RAM, I should say. 263 00:14:17,190 --> 00:14:23,730 Anyway, so we have just two levels. 264 00:14:23,730 --> 00:14:25,760 So let me draw a more precise picture. 265 00:14:25,760 --> 00:14:27,020 We have the CPU. 266 00:14:27,020 --> 00:14:29,160 This is where all of our operations 267 00:14:29,160 --> 00:14:31,210 are doing this, where we add numbers and so on. 268 00:14:31,210 --> 00:14:33,668 We'll think of it as having a constant number of registers. 269 00:14:33,668 --> 00:14:36,660 Each register is one word. 270 00:14:36,660 --> 00:14:42,440 And then we have a really fat pipe, low latency pipe, 271 00:14:42,440 --> 00:14:48,490 to the cache. 272 00:14:48,490 --> 00:14:53,810 Cache is going to be divided into blocks. 273 00:14:53,810 --> 00:14:58,800 So let's say there's B words per blocks. 274 00:14:58,800 --> 00:15:00,930 Instead of writing block size, I'll 275 00:15:00,930 --> 00:15:04,910 just write capital B. And the number of blocks. 276 00:15:04,910 --> 00:15:16,300 I'm going to call M over B. So the total size of your cache 277 00:15:16,300 --> 00:15:22,650 is capital M. And then there is a relatively thin and slow 278 00:15:22,650 --> 00:15:28,060 connection-- this one's fast. 279 00:15:28,060 --> 00:15:35,460 This one's slow-- to your disk. 280 00:15:35,460 --> 00:15:38,800 Disk we'll think of as huge, essentially infinite size. 281 00:15:38,800 --> 00:15:51,110 It's also divided into blocks of size B, so same block size. 282 00:15:51,110 --> 00:15:52,840 So this is the picture. 283 00:15:52,840 --> 00:15:56,190 And so, initially, all of the input 284 00:15:56,190 --> 00:15:59,050 is over here, all of your end data items, whatever. 285 00:15:59,050 --> 00:16:01,010 So you want to sort those items. 286 00:16:01,010 --> 00:16:04,130 And in order to access those items, 287 00:16:04,130 --> 00:16:06,670 you first have to bring them into cache. 288 00:16:06,670 --> 00:16:12,260 That's going to be slow, but it's done in a blocked manner. 289 00:16:12,260 --> 00:16:16,230 So when I can't access an individual item here, 290 00:16:16,230 --> 00:16:19,190 I have to request the entire block. 291 00:16:19,190 --> 00:16:21,300 When I request that block, it gets sent over here. 292 00:16:21,300 --> 00:16:22,409 It takes a while. 293 00:16:22,409 --> 00:16:24,200 And then I get to choose where to store it. 294 00:16:24,200 --> 00:16:25,960 Maybe I'll put it here. 295 00:16:25,960 --> 00:16:29,000 And then maybe I'll grab this block 296 00:16:29,000 --> 00:16:32,100 and then store it here and so on. 297 00:16:32,100 --> 00:16:34,870 Each of those is a block read, so these are new instructions 298 00:16:34,870 --> 00:16:36,990 the CPU can do. 299 00:16:36,990 --> 00:16:39,680 And eventually, this cache will get full. 300 00:16:39,680 --> 00:16:41,430 And then before I bring in a new block, 301 00:16:41,430 --> 00:16:42,990 I have to kick out an old lock. 302 00:16:42,990 --> 00:16:44,910 Meaning I need to take one these blocks 303 00:16:44,910 --> 00:16:49,175 and write it to some position, maybe to the same place. 304 00:16:49,175 --> 00:16:50,800 I think, in fact, we will always assume 305 00:16:50,800 --> 00:16:53,100 that you write to the same place, overwrite 306 00:16:53,100 --> 00:16:54,340 what was on the disk. 307 00:16:54,340 --> 00:16:56,656 You made some changes here, send it back. 308 00:16:56,656 --> 00:16:58,280 And, in general, what we're going to do 309 00:16:58,280 --> 00:17:01,100 is count how many times we read and write blocks. 310 00:17:01,100 --> 00:17:02,474 Question? 311 00:17:02,474 --> 00:17:05,400 AUDIENCE: When you talked about how fast the connection is, 312 00:17:05,400 --> 00:17:07,108 you're just talking about latency, right? 313 00:17:07,108 --> 00:17:08,858 ERIK DEMAINE: Yes, sorry, this is latency. 314 00:17:08,858 --> 00:17:11,710 AUDIENCE: Yeah, so like the [INAUDIBLE] connections 315 00:17:11,710 --> 00:17:13,710 [? just don't have ?] [INAUDIBLE]? 316 00:17:13,710 --> 00:17:16,020 ERIK DEMAINE: Right, this could have huge bandwidth. 317 00:17:16,020 --> 00:17:19,089 So in this model, we're assuming the block size is fixed, 318 00:17:19,089 --> 00:17:21,250 and then the latency versus bandwidth 319 00:17:21,250 --> 00:17:23,781 is not-- we're not going to think about bandwidth. 320 00:17:23,781 --> 00:17:25,280 We'll assume the block size has been 321 00:17:25,280 --> 00:17:27,036 chosen in some reasonable way. 322 00:17:27,036 --> 00:17:29,410 And then all we need to do is count the number of blocks. 323 00:17:29,410 --> 00:17:33,640 But underneath, yeah, you have some kind of bandwidth. 324 00:17:33,640 --> 00:17:35,990 Presumably you set the block size 325 00:17:35,990 --> 00:17:37,841 to make these two things roughly equal, 326 00:17:37,841 --> 00:17:39,216 and so then latency and bandwidth 327 00:17:39,216 --> 00:17:41,010 are kind of the same thing. 328 00:17:41,010 --> 00:17:41,939 That's the idea. 329 00:17:41,939 --> 00:17:44,480 But really, we're just going to think about counting latency, 330 00:17:44,480 --> 00:17:45,870 which is how many times do I have 331 00:17:45,870 --> 00:17:48,400 to request to block and wait for it to come over, 332 00:17:48,400 --> 00:17:51,010 and how much does it cost to write a block? 333 00:17:51,010 --> 00:17:52,650 How many times do I write a block? 334 00:17:52,650 --> 00:17:55,357 I'm not going to worry about how much physical time it 335 00:17:55,357 --> 00:17:56,940 takes me to do either of those things; 336 00:17:56,940 --> 00:17:59,000 I'm just going to count them and assume that that 337 00:17:59,000 --> 00:18:02,410 is what I need to minimize. 338 00:18:02,410 --> 00:18:09,750 So I'm going to count-- we call these memory transfers-- 339 00:18:09,750 --> 00:18:13,334 transfers of blocks between levels, 340 00:18:13,334 --> 00:18:17,150 between these two levels. 341 00:18:17,150 --> 00:18:32,380 This is the number of blocks read from or written to disk. 342 00:18:32,380 --> 00:18:40,110 We're going to view accesses to the cache as free. 343 00:18:40,110 --> 00:18:45,284 I'm not going to count those. 344 00:18:45,284 --> 00:18:46,700 You don't need to worry about that 345 00:18:46,700 --> 00:18:52,550 so much because we can still count the number of operations 346 00:18:52,550 --> 00:18:59,520 that we do on the computer, on the CPU. 347 00:18:59,520 --> 00:19:02,120 We still can think about how much time, 348 00:19:02,120 --> 00:19:04,730 regular time, it takes to do the computation-- 349 00:19:04,730 --> 00:19:07,570 how many comparisons, how many additions, things like that. 350 00:19:07,570 --> 00:19:10,030 And that would include things like reading and writing 351 00:19:10,030 --> 00:19:12,890 elements from cache-- individual things. 352 00:19:12,890 --> 00:19:15,360 But we're going to view this connection-- let's say, 353 00:19:15,360 --> 00:19:17,120 these are on the same ship. 354 00:19:17,120 --> 00:19:19,897 So reading cache is just as fast as reading from registers. 355 00:19:19,897 --> 00:19:21,730 So we're not going to worry about that time. 356 00:19:21,730 --> 00:19:24,230 What we're focusing on, for the purpose of this model, 357 00:19:24,230 --> 00:19:26,700 is between these two levels. 358 00:19:26,700 --> 00:19:30,355 So these are essentially one level combined. 359 00:19:30,355 --> 00:19:31,730 I'll change that in a little bit. 360 00:19:31,730 --> 00:19:34,240 But for now, just think about the two levels. 361 00:19:34,240 --> 00:19:36,210 And we're counting how many memory transfers 362 00:19:36,210 --> 00:19:41,660 do we have between these two levels, cache and disk. 363 00:19:41,660 --> 00:19:43,480 So we want to minimize that. 364 00:19:43,480 --> 00:19:45,360 Now, just like before, we want to minimize 365 00:19:45,360 --> 00:19:49,090 the running time in the usual traditional measure. 366 00:19:49,090 --> 00:19:51,570 And we want to minimize space and all the usual things 367 00:19:51,570 --> 00:19:52,240 we minimize. 368 00:19:52,240 --> 00:19:53,740 But now we have a new measure, which 369 00:19:53,740 --> 00:19:56,073 is number of memory transfers, and we want our algorithm 370 00:19:56,073 --> 00:19:59,460 to minimize that too, for a given block size 371 00:19:59,460 --> 00:20:05,930 and for a given cache size. 372 00:20:05,930 --> 00:20:10,180 And at this point-- I'm going to change this in a moment-- 373 00:20:10,180 --> 00:20:15,480 the algorithm that we would write in this external memory 374 00:20:15,480 --> 00:20:19,590 model explicitly manages the blocks. 375 00:20:19,590 --> 00:20:31,444 It has to explicitly read and write blocks. 376 00:20:31,444 --> 00:20:32,860 And there's a software system that 377 00:20:32,860 --> 00:20:35,350 implements this model, particularly for disk, 378 00:20:35,350 --> 00:20:37,510 and lets you do this in a nice controlled way, 379 00:20:37,510 --> 00:20:40,460 maintain your memory, maintain reading and writing disk. 380 00:20:40,460 --> 00:20:42,160 The operating system tries to do this, 381 00:20:42,160 --> 00:20:45,070 but it usually does a really bad job with swapping. 382 00:20:45,070 --> 00:20:47,280 But there are software systems that 383 00:20:47,280 --> 00:20:53,900 let you take control and do much better. 384 00:20:53,900 --> 00:20:54,980 So that's a good model. 385 00:20:54,980 --> 00:21:02,280 External memory model is especially good for disk. 386 00:21:02,280 --> 00:21:04,630 It's not going to capture the finesse of all 387 00:21:04,630 --> 00:21:07,900 these other levels, and it's a little bit annoying 388 00:21:07,900 --> 00:21:10,245 to write algorithms in this way-- explicitly 389 00:21:10,245 --> 00:21:11,370 reading and writing blocks. 390 00:21:11,370 --> 00:21:14,230 Today I will not write any such algorithms. 391 00:21:14,230 --> 00:21:16,890 Although, you could think about them. 392 00:21:16,890 --> 00:21:20,765 I personally love this other model, 393 00:21:20,765 --> 00:21:29,920 which is cache obviousness. 394 00:21:29,920 --> 00:21:32,840 It's going to lead to, in some sense, cleaner algorithm. 395 00:21:32,840 --> 00:21:36,380 Although, it's more of a magic trick to get them to work. 396 00:21:36,380 --> 00:21:38,140 But writing the algorithms is very simple. 397 00:21:38,140 --> 00:21:40,500 Analyzing them is more work. 398 00:21:40,500 --> 00:21:44,740 And it will capture, in some sense, all of these levels. 399 00:21:44,740 --> 00:21:48,320 But, in fact, it is basically exactly this model, almost 400 00:21:48,320 --> 00:21:49,400 the same. 401 00:21:49,400 --> 00:21:52,270 We're going to change one thing, which is 402 00:21:52,270 --> 00:21:54,510 where the oblivious comes from. 403 00:21:54,510 --> 00:21:59,200 We're going to say that the algorithm doesn't 404 00:21:59,200 --> 00:22:02,430 know the cache parameters. 405 00:22:02,430 --> 00:22:08,215 It doesn't know B or M. So this is a little weird. 406 00:22:08,215 --> 00:22:13,490 We're going to have to make some other changes to make it work. 407 00:22:13,490 --> 00:22:16,800 From an analysis perspective, I want to count memory transfers 408 00:22:16,800 --> 00:22:19,350 and analyze my algorithm with respect to this memory 409 00:22:19,350 --> 00:22:20,980 hierarchy. 410 00:22:20,980 --> 00:22:22,930 But the algorithm itself isn't allowed 411 00:22:22,930 --> 00:22:25,190 to know what that member hierarchy looks like. 412 00:22:25,190 --> 00:22:27,250 Another way to say this is that the algorithm 413 00:22:27,250 --> 00:22:29,765 has to work simultaneously for all values of B 414 00:22:29,765 --> 00:22:34,167 and all values of M. As you might imagine, 415 00:22:34,167 --> 00:22:35,000 this is not so easy. 416 00:22:35,000 --> 00:22:37,166 But there are some simple things where this is easy, 417 00:22:37,166 --> 00:22:40,150 and more complicated things where this is possible. 418 00:22:40,150 --> 00:22:42,800 And it gives you all sorts of cool things. 419 00:22:42,800 --> 00:22:46,290 Let me first formalize the model a little bit. 420 00:22:46,290 --> 00:22:48,780 The other nice thing about cache oblivious algorithms 421 00:22:48,780 --> 00:22:52,620 is it corresponds much more closely 422 00:22:52,620 --> 00:22:55,740 to how these caches work. 423 00:22:55,740 --> 00:22:57,310 When you write code on your CPU, you 424 00:22:57,310 --> 00:22:58,560 may have noticed you don't usually 425 00:22:58,560 --> 00:23:00,476 do block reads and block writes, unless you're 426 00:23:00,476 --> 00:23:02,760 dealing with flash or disk. 427 00:23:02,760 --> 00:23:04,590 All of this is taking care for you. 428 00:23:04,590 --> 00:23:06,500 It's all done internal to the processor. 429 00:23:06,500 --> 00:23:09,360 When you access a word, behind the scenes, 430 00:23:09,360 --> 00:23:13,260 magically, the system, the computer, 431 00:23:13,260 --> 00:23:16,100 finds which word to read or which block to read. 432 00:23:16,100 --> 00:23:19,200 It moves the entire block into a higher level cache, 433 00:23:19,200 --> 00:23:21,930 and then it's just serving you words out of that block. 434 00:23:21,930 --> 00:23:25,200 And you don't have explicit control over that. 435 00:23:25,200 --> 00:23:32,930 So the way that works is when you access a word in memory-- 436 00:23:32,930 --> 00:23:36,420 and I'm going to think of memory as everything; 437 00:23:36,420 --> 00:23:42,246 this is what's stored in the disk, say. 438 00:23:42,246 --> 00:23:44,370 This is the entire memory system, the entire memory 439 00:23:44,370 --> 00:23:45,280 hierarchy. 440 00:23:45,280 --> 00:23:46,840 And, as usual in this class, we're 441 00:23:46,840 --> 00:23:48,360 going to think of the entire memory 442 00:23:48,360 --> 00:23:56,940 as a giant array of words. 443 00:23:56,940 --> 00:24:01,660 Each of these squares is one word. 444 00:24:01,660 --> 00:24:06,040 But then also, the memory is now divided into blocks. 445 00:24:06,040 --> 00:24:07,440 So let's say every four. 446 00:24:07,440 --> 00:24:09,160 Let's say B equals 4. 447 00:24:09,160 --> 00:24:13,820 Every four words is a block boundary, 448 00:24:13,820 --> 00:24:17,960 just for the sake of drawing a figure. 449 00:24:17,960 --> 00:24:20,450 So this is B equals 4. 450 00:24:20,450 --> 00:24:23,650 When you access a single word, like this one, 451 00:24:23,650 --> 00:24:30,100 you get the entire block containing the word. 452 00:24:30,100 --> 00:24:32,720 Let's say, to emphasize, it's not you personally; 453 00:24:32,720 --> 00:24:47,307 the system somehow fetches the block containing that word. 454 00:24:47,307 --> 00:24:48,640 It has to do this automatically. 455 00:24:48,640 --> 00:24:51,120 We can't explicitly read and write blocks in this model 456 00:24:51,120 --> 00:24:53,050 because we don't know how big the blocks are. 457 00:24:53,050 --> 00:24:55,230 So it couldn't even name them. 458 00:24:55,230 --> 00:24:59,510 But internally, on the real system and in your analysis, 459 00:24:59,510 --> 00:25:01,760 you're going to think of whenever you touch something, 460 00:25:01,760 --> 00:25:03,630 you actually get all this into the cache. 461 00:25:03,630 --> 00:25:06,100 So you hope that you will use things nearby because you've 462 00:25:06,100 --> 00:25:07,300 already read them in. 463 00:25:07,300 --> 00:25:08,300 Ideally, they're useful. 464 00:25:08,300 --> 00:25:10,091 But you don't know how many you've read in. 465 00:25:10,091 --> 00:25:13,457 You've read in B, and you don't what B is. 466 00:25:13,457 --> 00:25:17,200 The algorithm doesn't now. 467 00:25:17,200 --> 00:25:19,670 One more detail-- the cache is going 468 00:25:19,670 --> 00:25:21,462 to get full pretty quickly. 469 00:25:21,462 --> 00:25:23,170 And so then, whenever you read something, 470 00:25:23,170 --> 00:25:24,461 you have to kick something out. 471 00:25:24,461 --> 00:25:26,690 In steady state, cache might as well 472 00:25:26,690 --> 00:25:29,250 always stay full-- no reason to leave anything empty. 473 00:25:29,250 --> 00:25:34,884 So which block do you kick out? 474 00:25:34,884 --> 00:25:35,550 Any suggestions? 475 00:25:35,550 --> 00:25:37,870 Which block should I kick out? 476 00:25:37,870 --> 00:25:40,710 If I've been reading and writing some blocks, 477 00:25:40,710 --> 00:25:46,228 reading and writing to words within these blocks. 478 00:25:46,228 --> 00:25:46,728 Yeah? 479 00:25:46,728 --> 00:25:48,477 AUDIENCE: [INAUDIBLE]. 480 00:25:48,477 --> 00:25:51,060 ERIK DEMAINE: The block that was fetched farthest in the past? 481 00:25:51,060 --> 00:25:53,480 Yeah that is usually called First In, First Out. 482 00:25:53,480 --> 00:25:54,610 That's FIFO. 483 00:25:54,610 --> 00:25:57,890 And that is a good strategy. 484 00:25:57,890 --> 00:25:59,352 Any other suggestions? 485 00:25:59,352 --> 00:25:59,852 Yeah. 486 00:25:59,852 --> 00:26:02,692 AUDIENCE: [INAUDIBLE]. 487 00:26:02,692 --> 00:26:04,900 ERIK DEMAINE: The block has been least recently used. 488 00:26:04,900 --> 00:26:07,670 So maybe you fetched it a long time ago, 489 00:26:07,670 --> 00:26:10,999 but you use it every clock cycle. 490 00:26:10,999 --> 00:26:12,790 That one you should probably not throw away 491 00:26:12,790 --> 00:26:13,831 because you use it a lot. 492 00:26:13,831 --> 00:26:18,730 That's called LRU, and that is also a good strategy. 493 00:26:18,730 --> 00:26:19,720 Other suggestions? 494 00:26:19,720 --> 00:26:21,150 Those are two good ones. 495 00:26:21,150 --> 00:26:23,180 If you go beyond that, I'm worried I won't know. 496 00:26:23,180 --> 00:26:24,596 But there are some bad strategies. 497 00:26:24,596 --> 00:26:25,890 Yeah? 498 00:26:25,890 --> 00:26:27,050 AUDIENCE: Just random. 499 00:26:27,050 --> 00:26:32,435 ERIK DEMAINE: Random-- yeah, random is probably pretty good. 500 00:26:32,435 --> 00:26:33,310 I don't know offhand. 501 00:26:33,310 --> 00:26:34,810 There are some randomized strategies 502 00:26:34,810 --> 00:26:36,160 that beat both of those. 503 00:26:36,160 --> 00:26:38,250 But from this perspective, both are good. 504 00:26:38,250 --> 00:26:42,760 We've got lots of Frisbees to go through, so. 505 00:26:42,760 --> 00:26:43,690 That's a good answer. 506 00:26:43,690 --> 00:26:44,950 Random is definitely a good idea. 507 00:26:44,950 --> 00:26:46,930 I know there's a randomized strategy called [? bit, ?] 508 00:26:46,930 --> 00:26:48,760 that in certain senses is a little bit better. 509 00:26:48,760 --> 00:26:51,051 But from my perspective, I think all of those are good. 510 00:26:51,051 --> 00:26:53,600 Random, I have to double check whether you lose a log factor. 511 00:26:53,600 --> 00:26:57,520 And expectation should be fine. 512 00:26:57,520 --> 00:27:00,600 So all of those strategies will work. 513 00:27:00,600 --> 00:27:02,945 You could define this model with any of them. 514 00:27:02,945 --> 00:27:04,820 I think it would work fine, except randomize, 515 00:27:04,820 --> 00:27:08,640 you'd get an expectation bound. 516 00:27:08,640 --> 00:27:24,480 So the system evicts, let's say, the least recently used page. 517 00:27:24,480 --> 00:27:26,770 The least recently loaded page would also work fine. 518 00:27:26,770 --> 00:27:28,136 That's FIFO. 519 00:27:28,136 --> 00:27:31,740 Sorry I'm switching to page, but I've been calling them blocks. 520 00:27:31,740 --> 00:27:36,200 Blocks and pages are the same thing for this lecture. 521 00:27:36,200 --> 00:27:40,690 And either at the end of this lecture or beginning of next, 522 00:27:40,690 --> 00:27:42,900 I'll tell you why that's an OK thing. 523 00:27:42,900 --> 00:27:51,440 But let's not worry about it at this point. 524 00:27:51,440 --> 00:27:55,000 So now we have a model-- cache flow oblivious. 525 00:27:55,000 --> 00:27:58,186 We have two models, actually. 526 00:27:58,186 --> 00:28:00,060 But I think now that the cache flow oblivious 527 00:28:00,060 --> 00:28:03,040 model is complete, we're going to analyze. 528 00:28:03,040 --> 00:28:06,460 Again, we're still counting the number of memory transfers 529 00:28:06,460 --> 00:28:07,260 in this thing. 530 00:28:07,260 --> 00:28:09,275 The algorithm's just not allowed know B and M, 531 00:28:09,275 --> 00:28:10,900 and so we had to change the model 532 00:28:10,900 --> 00:28:13,890 to make the reading and writing of blocks 533 00:28:13,890 --> 00:28:15,812 automatic because the algorithm's not 534 00:28:15,812 --> 00:28:16,520 allowed to do it. 535 00:28:16,520 --> 00:28:18,950 So someone's got to. 536 00:28:18,950 --> 00:28:20,980 The cool thing about cache oblivious model 537 00:28:20,980 --> 00:28:23,870 is every algorithm you see in this class, 538 00:28:23,870 --> 00:28:26,260 or most of the algorithms you see in this class, 539 00:28:26,260 --> 00:28:28,510 are in a certain sense cache oblivious algorithms. 540 00:28:28,510 --> 00:28:32,390 They weren't aware of B and M before, still not. 541 00:28:32,390 --> 00:28:35,930 What changes is now you can analyze them in this new way, 542 00:28:35,930 --> 00:28:37,300 in this new model. 543 00:28:37,300 --> 00:28:39,815 Now, as I said, all the algorithms we've seen 544 00:28:39,815 --> 00:28:44,309 are not going to perform well in this model-- almost all. 545 00:28:44,309 --> 00:28:45,725 But that makes things interesting, 546 00:28:45,725 --> 00:28:50,870 and that's why we have some work to do. 547 00:28:50,870 --> 00:28:53,180 I have some reasons why cache obliviousness-- 548 00:28:53,180 --> 00:28:55,470 why would you tie your hands behind your back 549 00:28:55,470 --> 00:28:57,060 and not know B or M? 550 00:28:57,060 --> 00:29:00,150 Reason one, it's cool. 551 00:29:00,150 --> 00:29:02,390 I think it's pretty amazing you can actually do this. 552 00:29:02,390 --> 00:29:03,880 I guess that's reason two is you can actually 553 00:29:03,880 --> 00:29:05,930 do it for a lot of problems we care about. 554 00:29:05,930 --> 00:29:08,900 Cache oblivious algorithms exist that are just as good. 555 00:29:08,900 --> 00:29:10,630 So, I mean, of course they exist. 556 00:29:10,630 --> 00:29:12,700 But there are ones that are optimal. 557 00:29:12,700 --> 00:29:15,230 They're within a constant factor of the best algorithm 558 00:29:15,230 --> 00:29:18,665 when you know B or M. So that's surprising. 559 00:29:18,665 --> 00:29:22,019 That's the cool part. 560 00:29:22,019 --> 00:29:23,560 In general, the algorithms are easier 561 00:29:23,560 --> 00:29:27,540 to write down because we can use pseudo code just like before. 562 00:29:27,540 --> 00:29:30,930 We don't need to worry about blocking in the algorithm. 563 00:29:30,930 --> 00:29:34,530 The analysis is going to be harder, but that's unavoidable. 564 00:29:34,530 --> 00:29:37,510 In some sense, it makes it easier to write code. 565 00:29:37,510 --> 00:29:40,820 And it's also a little easier to distribute your code 566 00:29:40,820 --> 00:29:43,160 because every computer has different block 567 00:29:43,160 --> 00:29:44,200 sizes that matter. 568 00:29:44,200 --> 00:29:46,500 Also, as you change your value of N, 569 00:29:46,500 --> 00:29:49,390 a different level in the memory hierarchy's going to matter. 570 00:29:49,390 --> 00:29:52,520 And so it's annoying-- each of these levels, I didn't mention, 571 00:29:52,520 --> 00:29:54,640 has a different block size and, of course, 572 00:29:54,640 --> 00:29:56,360 has a different cache size. 573 00:29:56,360 --> 00:29:59,840 So tuning your code every time to a different B or M 574 00:29:59,840 --> 00:30:01,740 is annoying. 575 00:30:01,740 --> 00:30:03,980 The big gain here, though, I think, 576 00:30:03,980 --> 00:30:08,030 is that you capture the entire hierarchy, in a sense. 577 00:30:08,030 --> 00:30:11,930 So in the real world, each of these pipes 578 00:30:11,930 --> 00:30:12,945 has its own latency. 579 00:30:12,945 --> 00:30:15,110 And let's just think about latency. 580 00:30:15,110 --> 00:30:17,490 And you'd like to minimize the number of block transfers 581 00:30:17,490 --> 00:30:18,630 between here and here. 582 00:30:18,630 --> 00:30:20,810 You'd like to minimize the number block answers here here. 583 00:30:20,810 --> 00:30:22,434 Well, OK, I can't minimize all of them. 584 00:30:22,434 --> 00:30:24,580 That's a multi dimensional problem. 585 00:30:24,580 --> 00:30:27,190 What I'd like to minimize is some weighted average 586 00:30:27,190 --> 00:30:30,589 of those things-- latency times number of blocks here, 587 00:30:30,589 --> 00:30:32,380 plus the latency times the number of blocks 588 00:30:32,380 --> 00:30:34,254 here, plus latency times the number of blocks 589 00:30:34,254 --> 00:30:36,910 here, and so on. 590 00:30:36,910 --> 00:30:41,410 If you can find an optimal cache oblivious algorithm and analyze 591 00:30:41,410 --> 00:30:44,680 it just with respect to two levels, 592 00:30:44,680 --> 00:30:47,455 because the algorithm's not allowed to know B and M, 593 00:30:47,455 --> 00:30:50,130 it has to work for all levels. 594 00:30:50,130 --> 00:30:54,140 It has to minimize the number of block transfers between all 595 00:30:54,140 --> 00:30:55,680 these levels, and so, in particular, 596 00:30:55,680 --> 00:30:59,175 will minimize the weighted sum of them. 597 00:30:59,175 --> 00:31:00,050 It's a bit hand wavy. 598 00:31:00,050 --> 00:31:01,520 You have to prove something there. 599 00:31:01,520 --> 00:31:06,680 But you can prove it. 600 00:31:06,680 --> 00:31:09,930 So there's a paper about this from 1999 601 00:31:09,930 --> 00:31:15,390 by Frigo, Leiserson, Prokop, and Ramachandran. 602 00:31:15,390 --> 00:31:17,710 It's old enough that I remember all the names. 603 00:31:17,710 --> 00:31:20,387 After about 2001, when I became a professor, 604 00:31:20,387 --> 00:31:21,470 I can't remember anything. 605 00:31:21,470 --> 00:31:24,400 But before that, I can remember everything. 606 00:31:24,400 --> 00:31:28,450 So Frigo, we've talked about him in the context of FFTW. 607 00:31:28,450 --> 00:31:30,780 That was the fastest Fourier Transform in the West. 608 00:31:30,780 --> 00:31:31,870 So he was a student here. 609 00:31:31,870 --> 00:31:35,810 And FFTW uses a cache oblivious Fast Fourier Transform 610 00:31:35,810 --> 00:31:37,770 algorithm. 611 00:31:37,770 --> 00:31:40,490 Leiserson, you've probably seen on the cover of your textbook 612 00:31:40,490 --> 00:31:42,340 or walking around Stata. 613 00:31:42,340 --> 00:31:45,370 Professor Leiserson here at MIT. 614 00:31:45,370 --> 00:31:48,270 And Prokop, this is actually his [? M Enge ?] thesis. 615 00:31:48,270 --> 00:31:52,220 So pretty awesome [? M Enge ?] thesis. 616 00:31:52,220 --> 00:31:56,180 All right, so cool, I think I said all the things 617 00:31:56,180 --> 00:31:58,450 I wanted to say. 618 00:31:58,450 --> 00:32:00,195 So if you want to see the proof that you 619 00:32:00,195 --> 00:32:02,140 can solve the entire memory hierarchy, 620 00:32:02,140 --> 00:32:04,020 you can read their paper. 621 00:32:04,020 --> 00:32:05,740 You have to make a couple of assumptions, 622 00:32:05,740 --> 00:32:07,200 but it's intuitive. 623 00:32:07,200 --> 00:32:09,090 Cache oblivious has to work for all B and M, 624 00:32:09,090 --> 00:32:12,220 so it's going to optimize all the levels simultaneously. 625 00:32:12,220 --> 00:32:15,220 Doing that explicitly, with all the different B's and M's, that 626 00:32:15,220 --> 00:32:19,157 would be really messy code, probably also slower. 627 00:32:19,157 --> 00:32:20,740 Cache oblivious is just going to do it 628 00:32:20,740 --> 00:32:23,046 for free with the same code. 629 00:32:23,046 --> 00:32:29,480 All right, let's do some algorithms. 630 00:32:29,480 --> 00:32:31,660 There's one easy algorithm which works 631 00:32:31,660 --> 00:32:37,770 great from a cache oblivious perspective, which is scanning. 632 00:32:37,770 --> 00:32:48,540 Let we give you some Python code. 633 00:32:48,540 --> 00:32:50,850 For historical reasons, in this field, 634 00:32:50,850 --> 00:32:52,730 N is written with a capital letter. 635 00:32:52,730 --> 00:32:55,707 Don't ask, or don't worry about it. 636 00:32:55,707 --> 00:32:57,040 So here's some very simple code. 637 00:32:57,040 --> 00:32:59,230 Suppose you want to accumulate an array. 638 00:32:59,230 --> 00:33:01,700 You want to add up all of the items in the array 639 00:33:01,700 --> 00:33:04,180 or multiply them or take them in or whatever. 640 00:33:04,180 --> 00:33:06,900 This is a typical kind of thing. 641 00:33:06,900 --> 00:33:11,200 Again, an array, we're going to think of-- so here 642 00:33:11,200 --> 00:33:12,895 was my memory. 643 00:33:12,895 --> 00:33:14,270 We're going to think of the array 644 00:33:14,270 --> 00:33:19,040 as being stored as some contiguous 645 00:33:19,040 --> 00:33:23,610 segment of that array, let's say, this segment. 646 00:33:23,610 --> 00:33:25,092 So this is important. 647 00:33:25,092 --> 00:33:37,090 Assume array is stored contiguously, no holes, 648 00:33:37,090 --> 00:33:41,760 relative to how it's mapped on to memory. 649 00:33:41,760 --> 00:33:43,350 And this is a realistic assumption. 650 00:33:43,350 --> 00:33:45,860 When you allocate a block of memory, 651 00:33:45,860 --> 00:33:48,120 the promise by the system is that it's essentially 652 00:33:48,120 --> 00:33:53,390 a contiguous chunk of memory or disk, or whatever. 653 00:33:53,390 --> 00:33:58,170 And when Python makes an array, it does this. 654 00:33:58,170 --> 00:34:01,160 It guarantees that these things will be stored contiguously. 655 00:34:01,160 --> 00:34:03,160 If you use a dictionary, this would not be true. 656 00:34:03,160 --> 00:34:05,780 But for regular [? array's ?] list, this is true. 657 00:34:05,780 --> 00:34:10,530 So I'm accessing the items in the array in order, 658 00:34:10,530 --> 00:34:12,500 and so I start here at item zero. 659 00:34:12,500 --> 00:34:15,780 I end up with item N minus 1. 660 00:34:15,780 --> 00:34:17,949 That seems good because I read this one. 661 00:34:17,949 --> 00:34:19,114 I get the whole block. 662 00:34:19,114 --> 00:34:19,989 Then I read this one. 663 00:34:19,989 --> 00:34:21,031 I already had that block. 664 00:34:21,031 --> 00:34:21,710 It's free. 665 00:34:21,710 --> 00:34:22,560 This one's free. 666 00:34:22,560 --> 00:34:23,389 This one's free. 667 00:34:23,389 --> 00:34:25,260 Here I have to read a new block. 668 00:34:25,260 --> 00:34:26,650 But then this one's free. 669 00:34:26,650 --> 00:34:29,130 So the first item I access in each block 670 00:34:29,130 --> 00:34:33,610 costs one, but as long as my cache store's at least one 671 00:34:33,610 --> 00:34:35,820 block, that's enough. 672 00:34:35,820 --> 00:34:38,010 And let's say the sum is a register; 673 00:34:38,010 --> 00:34:39,684 that's enough to remember that block so 674 00:34:39,684 --> 00:34:43,250 that the next operation I do will be free. 675 00:34:43,250 --> 00:34:52,840 So the cost is going to be-- actually, 676 00:34:52,840 --> 00:35:02,800 be a little more precise-- ceiling of N over B almost. 677 00:35:02,800 --> 00:35:09,170 Without the big O here, this is right in the external memory 678 00:35:09,170 --> 00:35:16,330 model, but not quite right in the cache oblivious model. 679 00:35:16,330 --> 00:35:18,154 Can someone tell me why? 680 00:35:18,154 --> 00:35:19,540 Yeah? 681 00:35:19,540 --> 00:35:21,388 AUDIENCE: If N is two, you could have it 682 00:35:21,388 --> 00:35:23,240 beyond a border [INAUDIBLE]. 683 00:35:23,240 --> 00:35:25,320 ERIK DEMAINE: Good, N could be two. 684 00:35:25,320 --> 00:35:26,970 But it could span a block boundary. 685 00:35:26,970 --> 00:35:28,500 Remember, the algorithm has no idea 686 00:35:28,500 --> 00:35:29,791 where the block boundaries are. 687 00:35:29,791 --> 00:35:32,750 And again, in reality, there are block boundaries 688 00:35:32,750 --> 00:35:35,390 all over the place, and there's no way to know. 689 00:35:35,390 --> 00:35:38,400 You can't request that when you allocate an array 690 00:35:38,400 --> 00:35:40,230 it always begins in a block boundary. 691 00:35:40,230 --> 00:35:48,066 So great, you can span block boundaries in-- oh, way off. 692 00:35:48,066 --> 00:35:52,200 I just spanned a block boundary, sorry. 693 00:35:52,200 --> 00:35:56,290 So it's going to be, at most, ceiling over N 694 00:35:56,290 --> 00:36:00,470 over B plus 1 cache obviously. 695 00:36:00,470 --> 00:36:02,060 So it's just going to hurt you by one. 696 00:36:02,060 --> 00:36:04,226 But I want to point out, there's a slight difference 697 00:36:04,226 --> 00:36:07,010 between the two models, even with this very simple 698 00:36:07,010 --> 00:36:08,560 algorithm. 699 00:36:08,560 --> 00:36:10,860 In general, I'm just going to think of this 700 00:36:10,860 --> 00:36:15,680 as big O N over B plus 1. 701 00:36:15,680 --> 00:36:17,790 There's some additive constant. 702 00:36:17,790 --> 00:36:20,470 I guess you could even say it's N over B plus big O 1, 703 00:36:20,470 --> 00:36:23,960 but we won't worry about constant factors today. 704 00:36:23,960 --> 00:36:26,820 So that's scanning, cache oblivious external memory, 705 00:36:26,820 --> 00:36:28,500 both great. 706 00:36:28,500 --> 00:36:52,740 Slightly more interesting-- AUDIENCE: [INAUDIBLE]? 707 00:36:52,740 --> 00:36:55,949 ERIK DEMAINE: Yeah, in the external memory algorithm, 708 00:36:55,949 --> 00:36:57,990 because you're explicitly controlling the blocks, 709 00:36:57,990 --> 00:36:59,970 you're explicitly reading and writing them. 710 00:36:59,970 --> 00:37:01,920 And you know where the block boundaries are. 711 00:37:01,920 --> 00:37:04,680 You could, if you wanted to, you don't have to, 712 00:37:04,680 --> 00:37:07,070 but you could choose the array to be aligned, 713 00:37:07,070 --> 00:37:09,370 to be starting at a block boundary. 714 00:37:09,370 --> 00:37:10,490 So that's the distinction. 715 00:37:10,490 --> 00:37:12,570 In the cache oblivious, you can't control that, 716 00:37:12,570 --> 00:37:15,319 so you have to worry about the worst case. 717 00:37:15,319 --> 00:37:16,860 External memory you could control it, 718 00:37:16,860 --> 00:37:19,240 and you could do better, and maybe you'd want to. 719 00:37:19,240 --> 00:37:23,240 It will hurt you buy a constant factor. 720 00:37:23,240 --> 00:37:25,130 And in disks, for example, you want 721 00:37:25,130 --> 00:37:28,182 things to be track aligned because if you 722 00:37:28,182 --> 00:37:30,640 have to go to an adjacent track, it's a lot more expensive. 723 00:37:30,640 --> 00:37:32,320 You've got to move the head. 724 00:37:32,320 --> 00:37:35,220 Track is a circle, what you can read without moving 725 00:37:35,220 --> 00:37:42,170 the head, so great. 726 00:37:42,170 --> 00:37:44,110 So slightly more interesting is you 727 00:37:44,110 --> 00:37:47,360 can do a constant number of parallel scans. 728 00:37:47,360 --> 00:37:50,380 So that was one scan. 729 00:37:50,380 --> 00:38:02,810 Here's an example of two scans. 730 00:38:02,810 --> 00:38:10,360 Again, we have one array of size N. Python notation, 731 00:38:10,360 --> 00:38:13,370 that would be the whole thing. 732 00:38:13,370 --> 00:38:21,170 And what I want to do is swap Ai with-- this is not Python, 733 00:38:21,170 --> 00:38:25,780 but it's, I think, textbook notation. 734 00:38:25,780 --> 00:38:28,840 But you know what swap means. 735 00:38:28,840 --> 00:38:35,980 What does this do, assuming I got my minus ones right? 736 00:38:35,980 --> 00:38:36,480 Yeah? 737 00:38:36,480 --> 00:38:37,813 AUDIENCE: It reverses the array. 738 00:38:37,813 --> 00:38:39,860 ERIK DEMAINE: It reverses the array, good. 739 00:38:39,860 --> 00:38:42,424 We'll just run through these Frisbees. 740 00:38:42,424 --> 00:38:43,840 So this is a very simple algorithm 741 00:38:43,840 --> 00:38:44,575 for reversing the array. 742 00:38:44,575 --> 00:38:46,060 It was originally by John Bentley, 743 00:38:46,060 --> 00:38:48,120 who was Charles Leiserson's adviser-- PhD 744 00:38:48,120 --> 00:38:50,670 adviser-- back in the day. 745 00:38:50,670 --> 00:38:53,085 So very simple, but what's cool about it, 746 00:38:53,085 --> 00:38:56,630 if you think about the array and the order in which you're 747 00:38:56,630 --> 00:39:03,210 accessing things, it's like I have two fingers-- 748 00:39:03,210 --> 00:39:06,190 and I should have made this smaller. 749 00:39:06,190 --> 00:39:08,340 So here, we'll go down here. 750 00:39:08,340 --> 00:39:10,250 I start at the very beginning of the array 751 00:39:10,250 --> 00:39:11,660 and the very end of the array. 752 00:39:11,660 --> 00:39:14,830 Then I go to the second element, next to last element, 753 00:39:14,830 --> 00:39:17,740 and I advance like this. 754 00:39:17,740 --> 00:39:22,256 So as long as your cache M, the number of blocks in the cache 755 00:39:22,256 --> 00:39:24,130 is at least two, which is totally reasonable. 756 00:39:24,130 --> 00:39:27,930 You can assume this is at least 100, typically. 757 00:39:27,930 --> 00:39:30,100 You've got at least 100 blocks, say. 758 00:39:30,100 --> 00:39:32,837 So for any fixed constant, we're going to assume N over B 759 00:39:32,837 --> 00:39:33,920 is bigger than a constant. 760 00:39:33,920 --> 00:39:35,628 We'll only need like two or three or four 761 00:39:35,628 --> 00:39:38,320 for the algorithms we cover. 762 00:39:38,320 --> 00:39:40,660 Then great, when I access this item, 763 00:39:40,660 --> 00:39:43,410 I will load in the block that contains it. 764 00:39:43,410 --> 00:39:47,640 I don't know how it's aligned, but don't care so much. 765 00:39:47,640 --> 00:39:50,120 And then I load in the block that contains this item. 766 00:39:50,120 --> 00:39:52,250 And then the next accesses are free until I 767 00:39:52,250 --> 00:39:53,470 advance to the next block. 768 00:39:53,470 --> 00:39:56,340 But once I advance to the next block on the left or the right, 769 00:39:56,340 --> 00:39:58,020 I'll never have to access the old ones. 770 00:39:58,020 --> 00:40:01,020 And so again, the cost here is just 771 00:40:01,020 --> 00:40:03,290 going to be equal to the number of blocks, which 772 00:40:03,290 --> 00:40:07,540 is big O of N over B plus 1. 773 00:40:07,540 --> 00:40:09,840 So a constant number of parallel scans 774 00:40:09,840 --> 00:40:14,690 is going to be basically the number of blocks in the array. 775 00:40:14,690 --> 00:40:18,340 So N is smaller than B, this is a bad idea or not so hot. 776 00:40:18,340 --> 00:40:19,760 But when N is bigger than B, this 777 00:40:19,760 --> 00:40:21,590 is just N over B. That's how much it takes 778 00:40:21,590 --> 00:40:26,670 to read in the data-- big deal. 779 00:40:26,670 --> 00:40:29,890 So these are boring cache oblivious algorithms. 780 00:40:29,890 --> 00:40:31,830 Let's do interesting ones. 781 00:40:31,830 --> 00:40:34,830 And I would say the central idea in cache 782 00:40:34,830 --> 00:40:38,360 oblivious algorithms is to use divide and conquer. 783 00:40:38,360 --> 00:40:42,010 This goes back to the first few lectures in this class. 784 00:40:42,010 --> 00:40:46,390 And so we will go back to examples from there. 785 00:40:46,390 --> 00:40:48,910 Today we're going to do the median finding, 786 00:40:48,910 --> 00:40:52,790 in particular, which we did in lecture two, 787 00:40:52,790 --> 00:40:54,620 so really a blast from the past. 788 00:40:54,620 --> 00:40:57,040 But it's good review because the final covers everything, 789 00:40:57,040 --> 00:40:59,570 so you've got to remember that. 790 00:40:59,570 --> 00:41:02,430 Matrix multiplication, we've talked about, but not 791 00:41:02,430 --> 00:41:06,920 usually-- well, I guess we did actually use divide and conquer 792 00:41:06,920 --> 00:41:08,440 for Strassen's algorithm. 793 00:41:08,440 --> 00:41:11,310 We're going to use -and conquer even for the boring algorithm 794 00:41:11,310 --> 00:41:12,139 today. 795 00:41:12,139 --> 00:41:14,680 And then next class, we're going to go back to van Emde Boas, 796 00:41:14,680 --> 00:41:16,150 but in a completely different way. 797 00:41:16,150 --> 00:41:18,280 So if you don't like van Emde Boas, 798 00:41:18,280 --> 00:41:21,800 don't worry; it's much simpler. 799 00:41:21,800 --> 00:41:24,930 So let's do median finding. 800 00:41:24,930 --> 00:41:29,808 Or actually, sorry, let me first talk about divide 801 00:41:29,808 --> 00:41:33,860 and conquer in general. 802 00:41:33,860 --> 00:41:35,360 You know what divide and conquer is. 803 00:41:35,360 --> 00:41:36,600 You take your problem. 804 00:41:36,600 --> 00:41:39,390 You split it into non overlapping subproblems, 805 00:41:39,390 --> 00:41:42,665 recursively solve them, combine them. 806 00:41:42,665 --> 00:41:44,040 But what I want to stress here is 807 00:41:44,040 --> 00:41:47,110 what it's going to look like in a cache oblivious context. 808 00:41:47,110 --> 00:41:51,598 So the algorithm is going to look like a regular divide 809 00:41:51,598 --> 00:41:53,380 and conquer algorithm. 810 00:41:53,380 --> 00:42:01,800 So, in particular, the algorithm will recurse all the way to, 811 00:42:01,800 --> 00:42:05,090 let's say, constant size problems, 812 00:42:05,090 --> 00:42:11,900 whatever the base case is. 813 00:42:11,900 --> 00:42:18,410 So same as usual, but what's different is the analysis. 814 00:42:18,410 --> 00:42:23,610 When we analyze a cache oblivious algorithm, 815 00:42:23,610 --> 00:42:25,274 then we get to know what B and M are. 816 00:42:25,274 --> 00:42:27,190 In some sense, we're analyzing for all B an M. 817 00:42:27,190 --> 00:42:29,669 But let's suppose B and M is given to us, then 818 00:42:29,669 --> 00:42:32,451 will tell you how many memory transfers you need. 819 00:42:32,451 --> 00:42:33,950 This kind of bound, you need to know 820 00:42:33,950 --> 00:42:37,110 what B is to know what the value of this bound is. 821 00:42:37,110 --> 00:42:39,740 But you learn it as a function of B and, in general, 822 00:42:39,740 --> 00:42:41,500 a function of B and M, and that's 823 00:42:41,500 --> 00:42:46,390 the best you could hope for as a complete characterization. 824 00:42:46,390 --> 00:42:49,470 So in the analysis, let's just look at one value of B 825 00:42:49,470 --> 00:42:57,540 and one value of M. So analysis knows B and M, 826 00:42:57,540 --> 00:43:11,620 and it's going to look at, let's say, the recursive level, 827 00:43:11,620 --> 00:43:15,230 where one of two things happens. 828 00:43:15,230 --> 00:43:28,660 Either the problem size fits in order one blocks. 829 00:43:28,660 --> 00:43:32,790 So meaning it's order B size. 830 00:43:32,790 --> 00:43:34,690 That's an interesting level. 831 00:43:34,690 --> 00:43:39,650 Another interesting level, the more obvious one probably, 832 00:43:39,650 --> 00:43:44,045 is that it fits in cache. 833 00:43:44,045 --> 00:43:46,545 So that means that the size is less than or equal to capital 834 00:43:46,545 --> 00:43:52,870 M. Everything here is counted in terms of words. 835 00:43:52,870 --> 00:43:54,200 This is the more obvious one. 836 00:43:54,200 --> 00:43:57,247 For a lot of problems, the cache size isn't so relevant. 837 00:43:57,247 --> 00:43:58,830 What really matters is the block size. 838 00:43:58,830 --> 00:44:01,430 For example, scanning, you're only looking through the data 839 00:44:01,430 --> 00:44:01,930 once. 840 00:44:01,930 --> 00:44:03,721 So it doesn't matter how big your cache is, 841 00:44:03,721 --> 00:44:05,970 as long as it's not super tiny. 842 00:44:05,970 --> 00:44:08,420 As long as it has a few blocks, then 843 00:44:08,420 --> 00:44:13,360 it's just a function of B and N, no M involved. 844 00:44:13,360 --> 00:44:15,500 So for that kind of problem this would 845 00:44:15,500 --> 00:44:19,800 be more useful-- constant number of blocks. 846 00:44:19,800 --> 00:44:22,740 Because I think of the cache M as being larger 847 00:44:22,740 --> 00:44:27,140 than any constant times B, this is strictly smaller, 848 00:44:27,140 --> 00:44:30,870 or this is smaller or equal to problem fitting in cache. 849 00:44:30,870 --> 00:44:32,780 So when M is relevant, we'll look 850 00:44:32,780 --> 00:44:35,600 at this level and maybe the adjacent levels 851 00:44:35,600 --> 00:44:37,580 in the recursion. 852 00:44:37,580 --> 00:44:40,150 So the algorithm doesn't know what B and M are, so it's 853 00:44:40,150 --> 00:44:42,610 got to recurse all the way down-- turtles 854 00:44:42,610 --> 00:44:44,040 all the way down. 855 00:44:44,040 --> 00:44:45,890 But the analysis, because we're only 856 00:44:45,890 --> 00:44:47,780 thinking about one value B and M at a time, 857 00:44:47,780 --> 00:44:49,830 we can afford to just consider that one level, 858 00:44:49,830 --> 00:44:51,496 and that will be like the critical place 859 00:44:51,496 --> 00:44:52,599 where all the cost is. 860 00:44:52,599 --> 00:44:55,140 Because once things fit in cache and you've loaded things in, 861 00:44:55,140 --> 00:44:56,570 the cost will be zero. 862 00:44:56,570 --> 00:44:58,899 So below that, the base case is kind of trivial. 863 00:44:58,899 --> 00:45:00,440 So basically what this is going to do 864 00:45:00,440 --> 00:45:02,410 is make our base cases larger. 865 00:45:02,410 --> 00:45:04,510 Instead of our base case being constant, 866 00:45:04,510 --> 00:45:11,650 it's going to be order B or M. 867 00:45:11,650 --> 00:45:21,839 What don't I need? 868 00:45:21,839 --> 00:45:45,420 So now let's going to median finding. 869 00:45:45,420 --> 00:45:47,990 Median finding, you're given an unsorted array. 870 00:45:47,990 --> 00:45:50,560 You want to find the median. 871 00:45:50,560 --> 00:45:55,440 And in lecture two, we had a linear time 872 00:45:55,440 --> 00:46:00,900 worst case algorithm for this. 873 00:46:00,900 --> 00:46:04,230 And so my goal today is to make it this running time. 874 00:46:04,230 --> 00:46:05,890 This is what you might call linear time 875 00:46:05,890 --> 00:46:08,360 in the cache oblivious model because that's how long it 876 00:46:08,360 --> 00:46:12,850 takes just to read the data. 877 00:46:12,850 --> 00:46:15,680 It turns out basically the same algorithm works. 878 00:46:15,680 --> 00:46:17,810 First, you've got to remember the algorithm. 879 00:46:17,810 --> 00:46:20,510 So let me write it down quickly. 880 00:46:20,510 --> 00:46:25,250 This is the sort of five by in N array. 881 00:46:25,250 --> 00:46:29,816 So think of the array as being partitioned into, I'll 882 00:46:29,816 --> 00:46:35,540 call them, five columns. 883 00:46:35,540 --> 00:46:42,990 So this picture of five dots by N over 5 dots-- this is 884 00:46:42,990 --> 00:46:44,361 dot, dot, dot. 885 00:46:44,361 --> 00:46:46,240 So this is five. 886 00:46:46,240 --> 00:46:48,025 Now, we didn't talk about it then, 887 00:46:48,025 --> 00:46:50,150 and there's a few different ways you could actually 888 00:46:50,150 --> 00:46:52,710 implement it, but let's say these-- the actual array is 889 00:46:52,710 --> 00:46:53,810 one-dimensional. 890 00:46:53,810 --> 00:46:55,610 Let's say these are the first five items. 891 00:46:55,610 --> 00:46:57,020 These are the next five items. 892 00:46:57,020 --> 00:47:01,610 So, in other words, this matrix is stored column by column. 893 00:47:01,610 --> 00:47:03,150 This is just a conceptual view. 894 00:47:03,150 --> 00:47:05,880 So we can define it either way, however we want. 895 00:47:05,880 --> 00:47:08,070 So I'm going to view it that way. 896 00:47:08,070 --> 00:47:12,090 And then what the rest of the algorithm did was for sort 897 00:47:12,090 --> 00:47:16,150 each column, it's only five items, 898 00:47:16,150 --> 00:47:19,137 so you can sort it in constant time, each one. 899 00:47:19,137 --> 00:47:20,720 But, in particular, what we care about 900 00:47:20,720 --> 00:47:24,010 is the median of those five items. 901 00:47:24,010 --> 00:47:32,370 Then we recursively found the median of the medians. 902 00:47:32,370 --> 00:47:41,150 This is the step we're going to have to change a little bit. 903 00:47:41,150 --> 00:47:46,350 Then we-- leave a little bit of space. 904 00:47:46,350 --> 00:47:52,580 Then we partition the array by x. 905 00:47:52,580 --> 00:47:55,190 Meaning we split the array into items less than 906 00:47:55,190 --> 00:48:00,350 or equal to x and things greater than x. 907 00:48:00,350 --> 00:48:03,040 We probably assumed there was only one value equal to x, 908 00:48:03,040 --> 00:48:05,160 but it doesn't matter. 909 00:48:05,160 --> 00:48:13,760 And finally, we recurse on one of those two halves. 910 00:48:13,760 --> 00:48:16,369 So this is a pretty crazy divide and conquer algorithm, one 911 00:48:16,369 --> 00:48:17,867 of the more sophisticated ones. 912 00:48:17,867 --> 00:48:19,700 You don't need to know all the details here, 913 00:48:19,700 --> 00:48:22,980 just that it worked and it ran in linear time. 914 00:48:22,980 --> 00:48:26,030 What's crazy about it is there are two recursive calls. 915 00:48:26,030 --> 00:48:27,460 Usually, like in merge sort, where 916 00:48:27,460 --> 00:48:30,090 you do two recursive calls and spend linear time 917 00:48:30,090 --> 00:48:32,300 to do the stuff, like this partition, 918 00:48:32,300 --> 00:48:34,470 you get n log n time, like merge sort. 919 00:48:34,470 --> 00:48:37,570 Here, because this array is a lot smaller, 920 00:48:37,570 --> 00:48:39,690 this is a size N over 5. 921 00:48:39,690 --> 00:48:41,610 And this one was reasonably small; 922 00:48:41,610 --> 00:48:48,800 it was like [? M of ?] 7/10 N. Because 7/10 plus 1/5 923 00:48:48,800 --> 00:48:52,730 is strictly less than 1, this ends up being 924 00:48:52,730 --> 00:48:54,480 linear time instead of n log n. 925 00:48:54,480 --> 00:48:56,820 That's just review. 926 00:48:56,820 --> 00:49:05,270 Now, what I'd like to do is the same thing, same analysis, 927 00:49:05,270 --> 00:49:08,490 or same algorithm, but now I want to analyze it 928 00:49:08,490 --> 00:49:10,410 in this two-level model. 929 00:49:10,410 --> 00:49:28,740 So actually, I will erase this board. 930 00:49:28,740 --> 00:49:33,820 So now my array has been partitioned into blocks 931 00:49:33,820 --> 00:49:36,414 of size B, like this picture. 932 00:49:36,414 --> 00:49:37,580 In fact, it's quite similar. 933 00:49:37,580 --> 00:49:39,730 Here, we're partitioning things into blocks, 934 00:49:39,730 --> 00:49:40,670 but they're size five. 935 00:49:40,670 --> 00:49:41,620 That's different. 936 00:49:41,620 --> 00:49:44,990 Now someone has partitioned my array into blocks of size B. 937 00:49:44,990 --> 00:49:46,840 I need to count how many things I access. 938 00:49:46,840 --> 00:49:49,720 Well, let's just look line by line at this code 939 00:49:49,720 --> 00:49:50,530 and see what we do. 940 00:49:50,530 --> 00:49:52,490 Step one, we do absolutely nothing. 941 00:49:52,490 --> 00:49:56,440 This is a conceptual picture, so zero cost, great. 942 00:49:56,440 --> 00:50:00,600 Step one is zero, my favorite answer. 943 00:50:00,600 --> 00:50:03,630 Step two, we sort each column. 944 00:50:03,630 --> 00:50:04,630 How long does this take? 945 00:50:04,630 --> 00:50:12,230 What am I doing? 946 00:50:12,230 --> 00:50:17,420 It's right above me. 947 00:50:17,420 --> 00:50:18,350 AUDIENCE: N over B. 948 00:50:18,350 --> 00:50:21,024 ERIK DEMAINE: N over B because this is a scan. 949 00:50:21,024 --> 00:50:22,440 It's a little bit weird of a scan. 950 00:50:22,440 --> 00:50:25,520 We look at five items, and then we 951 00:50:25,520 --> 00:50:28,370 look at the next five items, and then the next five items. 952 00:50:28,370 --> 00:50:29,550 But it's basically a scan. 953 00:50:29,550 --> 00:50:32,559 You could think of it as almost five parallel scans, I suppose, 954 00:50:32,559 --> 00:50:34,100 or you could just break into the case 955 00:50:34,100 --> 00:50:37,540 where maybe if B is a constant, then 956 00:50:37,540 --> 00:50:38,790 it doesn't matter what you do. 957 00:50:38,790 --> 00:50:42,681 But if B bigger than a constant, then reading five items, 958 00:50:42,681 --> 00:50:44,680 those are all probably going to be in one block, 959 00:50:44,680 --> 00:50:47,290 except the ones that straddle the block boundaries. 960 00:50:47,290 --> 00:50:51,086 So in all cases, for step two-- maybe 961 00:50:51,086 --> 00:50:53,600 I should rewrite step one-- zero cost. 962 00:50:53,600 --> 00:51:01,240 Step two, is order N over B plus 1, to be careful. 963 00:51:01,240 --> 00:51:03,636 That's a scan. 964 00:51:03,636 --> 00:51:05,010 Actually, it's two parallel scans 965 00:51:05,010 --> 00:51:09,490 because we have to write out these medians somewhere, 966 00:51:09,490 --> 00:51:10,620 so we'll have to. 967 00:51:10,620 --> 00:51:14,320 Step three is recursively find the medians. 968 00:51:14,320 --> 00:51:22,140 Now, before, we had in T of N is T of N over 5 969 00:51:22,140 --> 00:51:30,110 plus T of 7/10 N plus linear. 970 00:51:30,110 --> 00:51:34,002 In this new world-- this is regular running time. 971 00:51:34,002 --> 00:51:35,460 In this new world, I'm going to use 972 00:51:35,460 --> 00:51:38,640 a different notation for the recurrence, MT of N 973 00:51:38,640 --> 00:51:40,630 for memory transfers. 974 00:51:40,630 --> 00:51:42,490 This is a good old fashioned time, 975 00:51:42,490 --> 00:51:44,900 and this is our new modern notion of time-- 976 00:51:44,900 --> 00:51:47,760 how many block transfers do I need to do for problem size N. 977 00:51:47,760 --> 00:51:54,020 So this is a recursion, and should be MT of N over 5. 978 00:51:54,020 --> 00:52:00,500 But, and this is important, for this 979 00:52:00,500 --> 00:52:02,850 to be a same problem of the same type, 980 00:52:02,850 --> 00:52:05,750 I need to know that the array that recursing on 981 00:52:05,750 --> 00:52:08,660 is stored contiguously. 982 00:52:08,660 --> 00:52:10,800 Before, I didn't need to do that. 983 00:52:10,800 --> 00:52:14,200 I could say, well, let's put the medians in the middle. 984 00:52:14,200 --> 00:52:18,341 So now every fifth item in this array is my new subarray. 985 00:52:18,341 --> 00:52:20,090 And so I could recursively call this thing 986 00:52:20,090 --> 00:52:22,690 and say, OK, here's my array, but really only think 987 00:52:22,690 --> 00:52:23,830 about every fifth item. 988 00:52:23,830 --> 00:52:25,650 That's like a stride in the array. 989 00:52:25,650 --> 00:52:27,460 And then the next recursive level, oh, only 990 00:52:27,460 --> 00:52:28,950 worry about every 25th item. 991 00:52:28,950 --> 00:52:32,260 And every 5 cubed item, I'm going to stop computing, 992 00:52:32,260 --> 00:52:33,870 and so on. 993 00:52:33,870 --> 00:52:37,360 And that would be fine for regular running time. 994 00:52:37,360 --> 00:52:39,554 But when I get my stride gets bigger and bigger, 995 00:52:39,554 --> 00:52:40,970 at some point, every item is going 996 00:52:40,970 --> 00:52:42,130 to be in a different block. 997 00:52:42,130 --> 00:52:43,130 That's bad. 998 00:52:43,130 --> 00:52:44,260 I don't want to do that. 999 00:52:44,260 --> 00:52:48,270 So when I find these medians, or when I recurse, 1000 00:52:48,270 --> 00:52:50,290 I need that the medians that I'm recursing on 1001 00:52:50,290 --> 00:52:51,670 are stored in a contiguous array. 1002 00:52:51,670 --> 00:52:52,690 Now, this is easy to do. 1003 00:52:52,690 --> 00:52:54,023 But we didn't have to do before. 1004 00:52:54,023 --> 00:52:57,790 That's the key difference. 1005 00:52:57,790 --> 00:53:07,060 Make sure they are stored contiguously. 1006 00:53:07,060 --> 00:53:11,760 I can do that because when I sort each column in one scan, 1007 00:53:11,760 --> 00:53:14,250 I can have a second scan which is the output, which 1008 00:53:14,250 --> 00:53:16,360 is the array of medians. 1009 00:53:16,360 --> 00:53:17,960 So as I'm scanning through the input, 1010 00:53:17,960 --> 00:53:19,290 I'm going to output the median. 1011 00:53:19,290 --> 00:53:21,059 It's going to be 1/5 the size. 1012 00:53:21,059 --> 00:53:22,850 Then I've got all the medians nicely stored 1013 00:53:22,850 --> 00:53:25,070 in a contiguous array. 1014 00:53:25,070 --> 00:53:27,430 So with order one parallel scans, 1015 00:53:27,430 --> 00:53:33,810 same time here, this is actually a legitimate recursive call. 1016 00:53:33,810 --> 00:53:35,120 Then we partition. 1017 00:53:35,120 --> 00:53:42,920 Partition, again, is a bunch of parallel scans, I think, three. 1018 00:53:42,920 --> 00:53:44,640 You've got one reading scan, which 1019 00:53:44,640 --> 00:53:46,190 is you're reading through the array, 1020 00:53:46,190 --> 00:53:47,380 and you've got to writing scans. 1021 00:53:47,380 --> 00:53:49,420 You're writing out the elements less than or equal to x, 1022 00:53:49,420 --> 00:53:51,630 and you're writing out the elements greater than x. 1023 00:53:51,630 --> 00:53:53,050 But again, all of those are scans. 1024 00:53:53,050 --> 00:53:55,120 You're always writing the next element right 1025 00:53:55,120 --> 00:53:56,460 after the previous one. 1026 00:53:56,460 --> 00:53:58,350 So if you already have that block in memory 1027 00:53:58,350 --> 00:54:01,750 and if you assume that the number of blocks in cache 1028 00:54:01,750 --> 00:54:06,910 is at least three, then three parallel scans is fine. 1029 00:54:06,910 --> 00:54:09,190 It's different from the CLRS partition algorithm. 1030 00:54:09,190 --> 00:54:11,260 That one was fancy to be in place. 1031 00:54:11,260 --> 00:54:13,720 We're not trying to be in place or fancy at all. 1032 00:54:13,720 --> 00:54:16,310 Let's just do it with a bunch of scans. 1033 00:54:16,310 --> 00:54:18,524 So now we have two arrays-- the element less than x, 1034 00:54:18,524 --> 00:54:19,690 the elements greater than x. 1035 00:54:19,690 --> 00:54:22,070 Then we recurse on one of them, and those elements 1036 00:54:22,070 --> 00:54:24,260 are consecutive already, so good. 1037 00:54:24,260 --> 00:54:26,630 This is a regular recursive call. 1038 00:54:26,630 --> 00:54:28,320 Again, we're maintaining the variant 1039 00:54:28,320 --> 00:54:32,350 that the array is stored contiguously. 1040 00:54:32,350 --> 00:54:37,430 And by the old analysis, that array is sized at most 7/10 N. 1041 00:54:37,430 --> 00:54:45,918 So I get a new recurrence, which is MT of N is MT of N over 5 1042 00:54:45,918 --> 00:54:51,090 plus MT-- this analysis feels very "empty--" 1043 00:54:51,090 --> 00:54:57,150 plus N over B-- sorry, bad joke-- N over B plus 1. 1044 00:54:57,150 --> 00:55:01,010 So basically the same recurrence, but now N over B 1045 00:55:01,010 --> 00:55:03,955 plus 1 for what we're doing here. 1046 00:55:03,955 --> 00:55:05,330 But I had to change the algorithm 1047 00:55:05,330 --> 00:55:07,970 a little bit for this recurrence to be correct, 1048 00:55:07,970 --> 00:55:10,430 for it to correctly reflect the number of memory transfers. 1049 00:55:10,430 --> 00:55:13,920 Now all we need to do is solve the recurrence. 1050 00:55:13,920 --> 00:55:18,100 And actually, in some sense, more importantly, 1051 00:55:18,100 --> 00:55:20,520 we need to figure out what the base case is. 1052 00:55:20,520 --> 00:55:25,630 Because we could say, all right, here's the usual base case. 1053 00:55:25,630 --> 00:55:27,170 If I have a constant sized problem, 1054 00:55:27,170 --> 00:55:29,140 well, that's going to be constant. 1055 00:55:29,140 --> 00:55:32,260 This is our base case for every recurrence we've ever done. 1056 00:55:32,260 --> 00:55:34,220 And that's enough usually. 1057 00:55:34,220 --> 00:55:36,900 It's going to give us a really bad answer here. 1058 00:55:36,900 --> 00:56:01,060 So let's go off to the side here and solve that recurrence. 1059 00:56:01,060 --> 00:56:05,534 So if that's my base case, well, in particular-- so 1060 00:56:05,534 --> 00:56:06,700 this is some recursion tree. 1061 00:56:06,700 --> 00:56:09,650 It's very uneven, so it's kind of annoying to draw. 1062 00:56:09,650 --> 00:56:13,500 But what I know with this base case, 1063 00:56:13,500 --> 00:56:17,160 this overall MT event is going to at least the number 1064 00:56:17,160 --> 00:56:19,880 of leaves in the recursion tree. 1065 00:56:19,880 --> 00:56:22,955 So let's say MT of N is at least L 1066 00:56:22,955 --> 00:56:29,300 of N, number of leaves in the recursion. 1067 00:56:29,300 --> 00:56:31,520 So this is really if I run the algorithm, 1068 00:56:31,520 --> 00:56:35,630 how many base cases of constant size do I get? 1069 00:56:35,630 --> 00:56:46,050 And that satisfies-- so it's not obvious what that is. 1070 00:56:46,050 --> 00:56:47,199 There's no plus here. 1071 00:56:47,199 --> 00:56:49,490 Number of leaves is just how many leaves are over here, 1072 00:56:49,490 --> 00:56:52,560 how many leaves are over here, and L of 1 equals 1, say, 1073 00:56:52,560 --> 00:56:55,120 or some constant equals constant. 1074 00:56:55,120 --> 00:56:59,440 I happen to know, because I saw lots of recurrences, 1075 00:56:59,440 --> 00:57:03,260 this solves to some N to the alpha. 1076 00:57:03,260 --> 00:57:08,730 I claim that L of N is N to the alpha for some constant alpha. 1077 00:57:08,730 --> 00:57:09,440 Why? 1078 00:57:09,440 --> 00:57:11,650 I'll just prove that it works. 1079 00:57:11,650 --> 00:57:16,250 So this is now N over 5 to the alpha, 1080 00:57:16,250 --> 00:57:19,950 and this is 7/10 N to the alpha. 1081 00:57:19,950 --> 00:57:24,640 If it's going to work, this recurrence should be satisfied. 1082 00:57:24,640 --> 00:57:26,830 And now, if you look at this equation, 1083 00:57:26,830 --> 00:57:30,080 there's a lot of N to the alphas, and they all cancel. 1084 00:57:30,080 --> 00:57:37,305 So I get 1 equals 1/5 to the alpha plus 7/10 to the alpha. 1085 00:57:37,305 --> 00:57:38,680 It's confusing because I was just 1086 00:57:38,680 --> 00:57:42,930 watching the TV show Alphas, but no relation. 1087 00:57:42,930 --> 00:57:45,370 So this is now something purely in terms of alpha. 1088 00:57:45,370 --> 00:57:47,580 You just need to check that there is a real solution. 1089 00:57:47,580 --> 00:57:48,440 There is one. 1090 00:57:48,440 --> 00:57:51,630 You have to plug it into Wolfram Alpha or something, 1091 00:57:51,630 --> 00:57:53,000 no pun intended. 1092 00:57:53,000 --> 00:57:55,967 Wow, they're just coming out today. 1093 00:57:55,967 --> 00:57:58,947 And then alpha is... next page... 1094 00:57:58,947 --> 00:58:01,947 I can't do this by hand. 1095 00:58:01,947 --> 00:58:08,207 Something like .83978. 1096 00:58:08,207 --> 00:58:15,487 So we get L of N is say at least N to the 0.8th bigger. 1097 00:58:15,487 --> 00:58:21,247 It's sublinear and that was enough when we cared about time 1098 00:58:21,247 --> 00:58:23,687 but now it's bad news because N over B... 1099 00:58:23,687 --> 00:58:29,087 our goal was to get N over B+1. 1100 00:58:29,087 --> 00:58:33,507 If B is huge, if B is bigger than N to the 0.2, 1101 00:58:33,507 --> 00:58:35,843 then we are not achieving this bound. 1102 00:58:35,843 --> 00:58:36,343 Right. 1103 00:58:36,343 --> 00:58:38,947 We are always are paying at least N to the 0.8. 1104 00:58:38,947 --> 00:58:43,247 For example B is roughly N. We are way off! 1105 00:58:43,247 --> 00:58:45,247 But that's because we used the wrong base case. 1106 00:58:45,247 --> 00:58:49,360 Turns out if you use a better base case, things just work. 1107 00:58:49,360 --> 00:58:51,024 So let's do that. 1108 00:58:51,024 --> 00:58:53,940 I think its going to be smaller. 1109 00:58:53,940 --> 00:58:55,912 So... the next base... 1110 00:58:55,912 --> 00:58:56,616 I mean... 1111 00:58:56,616 --> 00:58:58,532 When you are doing cache full release analysis 1112 00:58:58,532 --> 00:58:59,760 you never use this base case. 1113 00:58:59,760 --> 00:59:03,100 The first one you should think about is this one. 1114 00:59:03,100 --> 00:59:04,812 If you have a problem of size that 1115 00:59:04,812 --> 00:59:06,320 fits in a constant number of blocks. 1116 00:59:06,320 --> 00:59:08,518 Well of course that's going to take... 1117 00:59:08,518 --> 00:59:10,684 once they are read into the cache, 1118 00:59:10,684 --> 00:59:12,100 you are not going to pay anything. 1119 00:59:12,100 --> 00:59:14,524 How long does it take to read a constant number of blocks 1120 00:59:14,524 --> 00:59:15,024 into cache? 1121 00:59:15,024 --> 00:59:16,940 Constant number of memory transfers. 1122 00:59:16,940 --> 00:59:19,415 Okay, this is obviously a strictly better base case 1123 00:59:19,415 --> 00:59:20,240 than this one. 1124 00:59:20,240 --> 00:59:23,090 Because we have the same thing on the right hand 1125 00:59:23,090 --> 00:59:26,240 side as a constant but we've solved a larger problem. 1126 00:59:26,240 --> 00:59:29,600 So clearly you should cut here, instead of there. 1127 00:59:29,600 --> 00:59:34,740 Then the number of leaves in this recursion... 1128 00:59:34,740 --> 00:59:38,998 So same recurrence- different base case. 1129 00:59:38,998 --> 00:59:41,164 So we'd stop recursing conceptually in the analysis, 1130 00:59:41,164 --> 00:59:42,980 the algorithm goes all the way down, 1131 00:59:42,980 --> 00:59:45,092 but in the analysis we stop recursing when 1132 00:59:45,092 --> 00:59:46,940 we reach a problem of size B. 1133 00:59:46,940 --> 00:59:54,168 The number of leaves in that new recursion tree will be N over B 1134 00:59:54,168 --> 00:59:55,900 to the alpha. 1135 00:59:55,900 --> 00:59:56,870 That's good! 1136 00:59:56,870 --> 00:59:59,780 That's smaller than N over B. 1137 00:59:59,780 --> 01:00:02,380 OK, now I'm going to wave my hands a little bit 1138 01:00:02,380 --> 01:00:10,780 and say, MT of N is order N over B plus 1. 1139 01:00:10,780 --> 01:00:13,090 I guess to do that, you want to prove it 1140 01:00:13,090 --> 01:00:16,390 the same way we did before when we solved this recurrence, 1141 01:00:16,390 --> 01:00:17,770 which is by substitution. 1142 01:00:17,770 --> 01:00:19,900 You assume this is true, you plug it in, 1143 01:00:19,900 --> 01:00:22,750 verify it can actually be done with some constants. 1144 01:00:22,750 --> 01:00:25,750 The intuition of what's going on is, in general, this recurrence 1145 01:00:25,750 --> 01:00:27,550 is dominated by the root. 1146 01:00:27,550 --> 01:00:31,759 The root cost for this recursion is N over B plus 1. 1147 01:00:31,759 --> 01:00:32,800 So this is the root cost. 1148 01:00:32,800 --> 01:00:34,630 I claim that, up to constant factors, 1149 01:00:34,630 --> 01:00:35,920 that is the overall cost. 1150 01:00:35,920 --> 01:00:38,440 Roughly because, as you go down the recursion tree, 1151 01:00:38,440 --> 01:00:41,431 the cost is decreasing geometrically. 1152 01:00:41,431 --> 01:00:43,180 But that's not obvious for this recurrence 1153 01:00:43,180 --> 01:00:44,694 because it's so uneven. 1154 01:00:44,694 --> 01:00:47,110 But it's kind of like the master method, a little fancier. 1155 01:00:47,110 --> 01:00:52,230 Intuitively, this should be obvious. 1156 01:00:52,230 --> 01:00:54,490 There's the root cost and then there's the other ones. 1157 01:00:54,490 --> 01:00:56,990 But to actually prove it, you should do substitution method. 1158 01:00:56,990 --> 01:01:02,320 I want to go to more interesting algorithms instead, 1159 01:01:02,320 --> 01:01:07,610 but any questions before we continue? 1160 01:01:07,610 --> 01:01:08,150 All right. 1161 01:01:08,150 --> 01:01:11,390 So next algorithm, that was median, now 1162 01:01:11,390 --> 01:01:18,670 we're going to do matrix multiplication via divide 1163 01:01:18,670 --> 01:01:32,620 and conquer. 1164 01:01:32,620 --> 01:01:34,060 So what we just saw was an example 1165 01:01:34,060 --> 01:01:37,464 where, in divide and conquer, in the analysis 1166 01:01:37,464 --> 01:01:39,130 we think about the case where things fit 1167 01:01:39,130 --> 01:01:40,810 in a constant number of blocks. 1168 01:01:40,810 --> 01:01:42,280 That was sort of case one. 1169 01:01:42,280 --> 01:01:44,372 The next example, matrix multiplication, 1170 01:01:44,372 --> 01:01:45,330 will be the other case. 1171 01:01:45,330 --> 01:01:53,650 So you get to see both types. 1172 01:01:53,650 --> 01:01:55,890 So multiplying matrices, something 1173 01:01:55,890 --> 01:01:57,270 we've done many times. 1174 01:01:57,270 --> 01:02:01,020 For example, in the FFT lecture and in the Strassen's 1175 01:02:01,020 --> 01:02:03,764 algorithm, just to remind you. 1176 01:02:03,764 --> 01:02:05,430 I'm just thinking about the square case, 1177 01:02:05,430 --> 01:02:07,860 although this generalizes. 1178 01:02:07,860 --> 01:02:16,140 We have two square matrices, N by N. 1179 01:02:16,140 --> 01:02:18,420 Normally, I would say C equals A times B, 1180 01:02:18,420 --> 01:02:20,970 but I realized we used B for block side. 1181 01:02:20,970 --> 01:02:26,125 So this is going to be s equals x times y. 1182 01:02:26,125 --> 01:02:31,020 Hopefully that doesn't conflict with anything else, but no B's. 1183 01:02:31,020 --> 01:02:33,510 All right, so standard matrix. 1184 01:02:33,510 --> 01:02:42,090 Let's start with the standard algorithm. 1185 01:02:42,090 --> 01:02:43,920 Let's start by analyzing that. 1186 01:02:43,920 --> 01:02:46,770 Because if you're reasonably clever, 1187 01:02:46,770 --> 01:02:50,740 this the standard algorithm is not so bad. 1188 01:02:50,740 --> 01:02:54,010 So in general, this won't matter too much. 1189 01:02:54,010 --> 01:02:57,150 Let's suppose we're doing z row by row, 1190 01:02:57,150 --> 01:03:03,150 and let's say we're currently computing this product cell. 1191 01:03:03,150 --> 01:03:06,780 So that product cell is the dot product 1192 01:03:06,780 --> 01:03:15,410 this ZIJ here is the dot product of this row with this column. 1193 01:03:15,410 --> 01:03:17,370 How do I compute dot products? 1194 01:03:17,370 --> 01:03:18,190 Two parallel scans. 1195 01:03:18,190 --> 01:03:18,690 Right? 1196 01:03:18,690 --> 01:03:20,520 I scan through this row and I parallel 1197 01:03:20,520 --> 01:03:22,020 scan through this column. 1198 01:03:22,020 --> 01:03:26,160 Now, it depends the order in which you store x and y, 1199 01:03:26,160 --> 01:03:30,000 but let's suppose we can store x in row major order, 1200 01:03:30,000 --> 01:03:33,240 meaning row by row, and we store y in column major order, 1201 01:03:33,240 --> 01:03:34,530 meaning column-by-column. 1202 01:03:34,530 --> 01:03:36,178 Then this will be an honest to goodness 1203 01:03:36,178 --> 01:03:37,560 scan of a contiguous array. 1204 01:03:37,560 --> 01:03:41,550 Again, the order we store things in memory really matters. 1205 01:03:41,550 --> 01:03:42,990 So let's make our life ideal. 1206 01:03:42,990 --> 01:03:48,120 Let's say that this is row by row 1207 01:03:48,120 --> 01:03:52,740 and this one is column by column, then hey, 1208 01:03:52,740 --> 01:03:55,500 this is two parallel scans so order N over B 1209 01:03:55,500 --> 01:03:58,300 to compute this cell. 1210 01:03:58,300 --> 01:04:07,500 OK, I claim that computing ZIJ costs 1211 01:04:07,500 --> 01:04:12,360 N over B, so maybe plus 1. 1212 01:04:12,360 --> 01:04:14,187 Again, these are end by end matrices, 1213 01:04:14,187 --> 01:04:24,480 so total size N squared, which means the total cost is what? 1214 01:04:24,480 --> 01:04:30,480 N cubed over B plus N squared, I guess. 1215 01:04:30,480 --> 01:04:31,350 Seems pretty good. 1216 01:04:31,350 --> 01:04:34,200 I mean, we had a running time of N cubed before 1217 01:04:34,200 --> 01:04:37,470 and we divided by B. How could you possibly do better? 1218 01:04:37,470 --> 01:04:40,140 Well, by being smarter. 1219 01:04:40,140 --> 01:04:46,500 This is not optimal, you can do better. 1220 01:04:46,500 --> 01:04:50,702 It's not obvious, but let me just spend 1221 01:04:50,702 --> 01:04:53,160 a little more time convincing you this is the right answer. 1222 01:04:53,160 --> 01:04:56,607 Not only is this big O, but for appropriate settings-- 1223 01:04:56,607 --> 01:05:01,110 in the worst case this is going to be theta. 1224 01:05:01,110 --> 01:05:03,800 Because if you think of the order in which we're-- see, 1225 01:05:03,800 --> 01:05:06,680 we look at these rows several times. 1226 01:05:06,680 --> 01:05:09,330 And if you look at, when I compute this cell and this cell 1227 01:05:09,330 --> 01:05:12,420 and this cell of the z matrix, or the product matrix, 1228 01:05:12,420 --> 01:05:15,750 each of them uses the same row of x. 1229 01:05:15,750 --> 01:05:18,300 So maybe you could reuse that. 1230 01:05:18,300 --> 01:05:21,300 You could reuse that row of x. 1231 01:05:21,300 --> 01:05:23,550 That might actually be free, depending 1232 01:05:23,550 --> 01:05:24,910 on how B and N relate. 1233 01:05:24,910 --> 01:05:30,930 But the columns of y, those are different every time. 1234 01:05:30,930 --> 01:05:32,920 When I compute this one, I use the first column 1235 01:05:32,920 --> 01:05:35,760 of y, when I compute this one I use the second column of y. 1236 01:05:35,760 --> 01:05:38,010 Unless the cache is so big that it 1237 01:05:38,010 --> 01:05:40,380 can store all of y, which is like, 1238 01:05:40,380 --> 01:05:42,779 you could store the entire problem in cache 1239 01:05:42,779 --> 01:05:44,460 that's unrealistic. 1240 01:05:44,460 --> 01:05:48,900 So unless M is bigger than N squared, 1241 01:05:48,900 --> 01:05:52,290 in this algorithm at least, you have to read a new column of y 1242 01:05:52,290 --> 01:05:53,970 every single time. 1243 01:05:53,970 --> 01:05:55,960 So that's why it's theta N over B plus 1. 1244 01:05:55,960 --> 01:06:00,430 You need to spend N over B, assuming 1245 01:06:00,430 --> 01:06:06,160 M is less than N squared. 1246 01:06:06,160 --> 01:06:06,660 OK. 1247 01:06:06,660 --> 01:06:09,120 And I claim this is not the best you can do because we're 1248 01:06:09,120 --> 01:06:10,380 going to do better. 1249 01:06:10,380 --> 01:06:28,490 And we're going to do better by divide and conquer. 1250 01:06:28,490 --> 01:06:30,540 Now, you've already seen divide and conquer used 1251 01:06:30,540 --> 01:06:36,930 for matrix multiplication to get Strassen's algorithm, 1252 01:06:36,930 --> 01:06:44,100 and the idea there is to use blocks. 1253 01:06:44,100 --> 01:06:46,740 So this is sort of an algorithm you've already seen. 1254 01:06:46,740 --> 01:06:55,080 I'm going to divide the matrix z into N over 2 1255 01:06:55,080 --> 01:06:57,900 by N over 2 sub-matrices. 1256 01:06:57,900 --> 01:07:02,190 Each of these ZIJs is an N over 2 by N over 2 matrix. 1257 01:07:02,190 --> 01:07:15,400 And I do the same thing for x and y. 1258 01:07:15,400 --> 01:07:16,770 Numbers are right. 1259 01:07:16,770 --> 01:07:19,190 1, 2, y, 2, 1, and so on. 1260 01:07:19,190 --> 01:07:22,190 And you can write this out explicitly. 1261 01:07:22,190 --> 01:07:25,345 I prefer not to do all of it, but let's do one of them. 1262 01:07:25,345 --> 01:07:27,470 You can just think of these as two by two matrices, 1263 01:07:27,470 --> 01:07:29,570 because matrix multiplication is associative 1264 01:07:29,570 --> 01:07:30,740 and good things happen. 1265 01:07:30,740 --> 01:07:32,450 I can just take these two elements-- 1266 01:07:32,450 --> 01:07:34,640 but they're actually matrices, sorry. 1267 01:07:34,640 --> 01:07:38,192 I might take these two and dot product with these two. 1268 01:07:38,192 --> 01:07:46,790 And I get x1,1 y1,1 plus x1,2 y2,1, 1269 01:07:46,790 --> 01:07:50,220 and that's what I should set z1,1 to. 1270 01:07:50,220 --> 01:07:54,020 So this is a formula, but it's also a recursive algorithm. 1271 01:07:54,020 --> 01:07:57,470 It says, if I want to compute z I'm going to say, 1272 01:07:57,470 --> 01:07:59,510 well, there are four sub problems. 1273 01:07:59,510 --> 01:08:01,399 The first one is to compute z1,1, 1274 01:08:01,399 --> 01:08:03,440 and I'm going to do that by recursively computing 1275 01:08:03,440 --> 01:08:06,500 the product of x1,1 and y1,1, recursively computing 1276 01:08:06,500 --> 01:08:10,034 the product of x1,2 y2,1 and then adding them together. 1277 01:08:10,034 --> 01:08:10,950 This is not recursive. 1278 01:08:10,950 --> 01:08:13,080 Addition is easy. 1279 01:08:13,080 --> 01:08:13,580 OK. 1280 01:08:13,580 --> 01:08:15,800 And there's two products here, two products here, 1281 01:08:15,800 --> 01:08:17,341 two products here, two products here, 1282 01:08:17,341 --> 01:08:18,830 a total of eight products, so we're 1283 01:08:18,830 --> 01:08:25,055 going to have eight recursive calls in size N over 2. 1284 01:08:25,055 --> 01:08:26,930 If we look at the number of memory transfers, 1285 01:08:26,930 --> 01:08:31,689 this is 8 times recursive call on N over 2 by N 1286 01:08:31,689 --> 01:08:37,550 over 2 sub matrices plus the cost of addition. 1287 01:08:37,550 --> 01:08:41,300 And I claim the cost of addition is at most N squared over B 1288 01:08:41,300 --> 01:08:46,461 plus 1, because addition is basically parallel scans. 1289 01:08:46,461 --> 01:08:50,390 I can scan through x, scan through y. 1290 01:08:50,390 --> 01:08:52,609 As long as they're stored in the same order, 1291 01:08:52,609 --> 01:08:55,350 I just am adding them element by element, 1292 01:08:55,350 --> 01:08:59,960 and there's a third scan, which is writing out the z vector 1293 01:08:59,960 --> 01:09:02,500 once things are linearized. 1294 01:09:02,500 --> 01:09:06,100 Now, for this to work, for this to be a true recursion, 1295 01:09:06,100 --> 01:09:12,279 I need that, say, x1,1 and y1,1 are stored as contiguous things 1296 01:09:12,279 --> 01:09:13,690 in memory. 1297 01:09:13,690 --> 01:09:19,809 So this means that the layout of a matrix, 1298 01:09:19,809 --> 01:09:24,430 let's consider the matrix z, is going to be like the following. 1299 01:09:24,430 --> 01:09:27,370 I'm going to recursively lay out 1,1-- so when I say lay out, 1300 01:09:27,370 --> 01:09:29,920 I mean what order do I store the elements in memory? 1301 01:09:29,920 --> 01:09:32,050 What order do I store the cells in memory? 1302 01:09:32,050 --> 01:09:36,370 And what I'm going to say is, recursively lay out 1303 01:09:36,370 --> 01:09:40,859 the pieces-- there's four pieces-- recursively call 1304 01:09:40,859 --> 01:09:46,282 layout of those and then concatenate them together. 1305 01:09:46,282 --> 01:09:46,990 That's my layout. 1306 01:09:46,990 --> 01:09:48,430 So I'm going to store all of these items, 1307 01:09:48,430 --> 01:09:50,221 then I'm going to store all of these items, 1308 01:09:50,221 --> 01:09:52,510 and then all of these items, then all these items. 1309 01:09:52,510 --> 01:09:55,480 How do I store these items, in what order? 1310 01:09:55,480 --> 01:09:56,107 Recursively. 1311 01:09:56,107 --> 01:09:57,690 So I'm going to divide them like this, 1312 01:09:57,690 --> 01:09:59,650 store these before these before these before these, 1313 01:09:59,650 --> 01:10:00,640 how do I store these? 1314 01:10:00,640 --> 01:10:01,440 Recursively. 1315 01:10:01,440 --> 01:10:02,455 OK, same recursion. 1316 01:10:02,455 --> 01:10:04,510 So it's a really weird order, it's 1317 01:10:04,510 --> 01:10:06,820 a divide and conquer order. 1318 01:10:06,820 --> 01:10:08,380 There's only four things here. 1319 01:10:08,380 --> 01:10:10,510 In what order should I combine the four things? 1320 01:10:10,510 --> 01:10:11,716 Doesn't matter. 1321 01:10:11,716 --> 01:10:13,590 All that matters is that this is consecutive, 1322 01:10:13,590 --> 01:10:15,730 this is consecutive, and this is consecutive, 1323 01:10:15,730 --> 01:10:18,530 so that when I recurse, I'm recursing on consecutive chunks 1324 01:10:18,530 --> 01:10:19,030 of memory. 1325 01:10:19,030 --> 01:10:21,070 Otherwise the analysis just won't work. 1326 01:10:21,070 --> 01:10:24,690 So for this to be right, got to have this layout. 1327 01:10:24,690 --> 01:10:26,500 OK. 1328 01:10:26,500 --> 01:10:32,110 Now we just need to solve the recurrence, and we're done. 1329 01:10:32,110 --> 01:10:34,960 I already told you, the base case we're going to use 1330 01:10:34,960 --> 01:10:35,860 is this one. 1331 01:10:35,860 --> 01:10:37,526 We're going to use this one because it's 1332 01:10:37,526 --> 01:10:40,360 stronger and better, and we'll need it, in this case, 1333 01:10:40,360 --> 01:10:43,442 to get a better analysis. 1334 01:10:43,442 --> 01:10:45,400 You could solve it using the weaker base cases, 1335 01:10:45,400 --> 01:10:47,330 you'll get larger numbers. 1336 01:10:47,330 --> 01:10:51,309 But if you use the strongest base case, MT of-- it's 1337 01:10:51,309 --> 01:10:54,180 not M. Got to be a little careful. 1338 01:10:54,180 --> 01:10:56,980 Because N here is actually just one side length. 1339 01:10:56,980 --> 01:11:00,445 This is an end by end matrix, so the total size 1340 01:11:00,445 --> 01:11:04,720 is N squared-- actually the total size is 3N squared, 1341 01:11:04,720 --> 01:11:09,460 so this is going to be the square root of M over 3, 1342 01:11:09,460 --> 01:11:12,880 at some constant, times the square root of N. It actually 1343 01:11:12,880 --> 01:11:14,510 doesn't matter what the constant is. 1344 01:11:14,510 --> 01:11:16,240 But this is the size of-- this is 1345 01:11:16,240 --> 01:11:18,400 the value of N for which all three matrices will 1346 01:11:18,400 --> 01:11:20,890 fit in cache. 1347 01:11:20,890 --> 01:11:27,309 So I claim we know this costs at most M over B memory transfers, 1348 01:11:27,309 --> 01:11:33,550 because we were kind of stroke here because we know 1349 01:11:33,550 --> 01:11:35,110 that all of these guys fit in cache 1350 01:11:35,110 --> 01:11:37,540 and because we know that they can store it consecutively 1351 01:11:37,540 --> 01:11:40,840 in memory, well three consecutive chunks. 1352 01:11:40,840 --> 01:11:45,940 Once, no matter what I do, there are only M over B blocks there, 1353 01:11:45,940 --> 01:11:47,660 and so at worst I read them all in. 1354 01:11:47,660 --> 01:11:50,470 But once the cache is filled with them, 1355 01:11:50,470 --> 01:11:52,540 for the duration of this recursion, 1356 01:11:52,540 --> 01:11:54,100 I won't be reading any other blocks, 1357 01:11:54,100 --> 01:11:56,990 and so the cache will just stay full with the problem. 1358 01:11:56,990 --> 01:11:59,440 And so I never pay more than this. 1359 01:11:59,440 --> 01:12:00,830 So that's the base case. 1360 01:12:00,830 --> 01:12:05,140 Easy, but you have to think about it for a second. 1361 01:12:05,140 --> 01:12:05,650 Cool. 1362 01:12:05,650 --> 01:12:08,640 Now we have a recurrence and a base case, 1363 01:12:08,640 --> 01:12:11,210 and now we have a good old fashioned recursion tree. 1364 01:12:11,210 --> 01:12:13,809 This one I can actually draw, because it's-- well, 1365 01:12:13,809 --> 01:12:17,120 partly because it's nice and uniform. 1366 01:12:17,120 --> 01:12:19,870 It just explodes rather fast. 1367 01:12:19,870 --> 01:12:25,960 So at the top we have a cost of N squared over B plus 1, 1368 01:12:25,960 --> 01:12:28,750 and we have eight recursive calls. 1369 01:12:28,750 --> 01:12:31,960 And the recursive calls are to something in size 1370 01:12:31,960 --> 01:12:39,970 N over 2 squared over B, also known as N squared over 4B. 1371 01:12:39,970 --> 01:12:42,170 OK, so if I add up everything on this level, 1372 01:12:42,170 --> 01:12:46,180 I get N squared over B, and if I add up everything on this level 1373 01:12:46,180 --> 01:12:53,200 I'm going to get 8 times N over 4-- is that right? 1374 01:12:53,200 --> 01:12:53,740 Yeah. 1375 01:12:53,740 --> 01:12:58,120 So 2 times N squared over B. 1376 01:12:58,120 --> 01:12:58,870 OK. 1377 01:12:58,870 --> 01:13:02,170 I did that in order to verify that the cost per level 1378 01:13:02,170 --> 01:13:06,430 is increasing geometrically, so all that will matter 1379 01:13:06,430 --> 01:13:08,770 is the leaf level. 1380 01:13:08,770 --> 01:13:11,520 This is the proof of the master theorem. 1381 01:13:11,520 --> 01:13:13,300 When things are doubling at every step-- 1382 01:13:13,300 --> 01:13:15,059 and this was just a special case, 1383 01:13:15,059 --> 01:13:18,529 but every level would look the same-- every level 1384 01:13:18,529 --> 01:13:20,070 of recursion, if you add them all up, 1385 01:13:20,070 --> 01:13:22,653 you're getting twice as much as you had at the previous level. 1386 01:13:22,653 --> 01:13:26,150 So all that will matter is the leaf level. 1387 01:13:26,150 --> 01:13:29,430 OK, the leaf level. 1388 01:13:29,430 --> 01:13:32,610 Actually, maybe I'll do it over here. 1389 01:13:32,610 --> 01:13:34,800 First question is how many leaves are there? 1390 01:13:34,800 --> 01:13:37,900 The leaves are this thing. 1391 01:13:37,900 --> 01:13:41,300 So the way I would think about this is, because everything 1392 01:13:41,300 --> 01:13:44,499 is nice and uniform, is 8 to the power of the number of levels. 1393 01:13:44,499 --> 01:13:50,790 What's the number of levels? 1394 01:13:50,790 --> 01:13:53,320 Well, we're dividing by 2 each time, 1395 01:13:53,320 --> 01:13:56,559 so it's going to be log of something, 1396 01:13:56,559 --> 01:14:00,130 but it's no longer log N because we're stopping early. 1397 01:14:00,130 --> 01:14:03,450 We're stopping when N reaches this value. 1398 01:14:03,450 --> 01:14:10,856 So it turns out that is N divided by that value. 1399 01:14:10,856 --> 01:14:12,230 This is, how many times do I have 1400 01:14:12,230 --> 01:14:14,775 to multiply by 2 before I get to this, which 1401 01:14:14,775 --> 01:14:17,150 is the same thing as how many times do I have to divide N 1402 01:14:17,150 --> 01:14:19,636 by 2 before I get that? 1403 01:14:19,636 --> 01:14:20,780 Think about it. 1404 01:14:20,780 --> 01:14:22,320 OK, but 8 to the log. 1405 01:14:22,320 --> 01:14:24,980 This is 2 to the 3 times log. 1406 01:14:24,980 --> 01:14:27,690 2 to the log is just the thing. 1407 01:14:27,690 --> 01:14:36,830 So this is N over root M over B-- so many overs-- 1408 01:14:36,830 --> 01:14:39,017 to the third power. 1409 01:14:39,017 --> 01:14:40,600 OK, this is starting to look familiar. 1410 01:14:40,600 --> 01:14:46,080 This is N cubed, that should appear somewhere, 1411 01:14:46,080 --> 01:14:48,480 divided by square root of M over B. 1412 01:14:48,480 --> 01:14:50,090 This is the number of leaves. 1413 01:14:50,090 --> 01:14:55,880 Now, for each leaf we're paying this cost, 1414 01:14:55,880 --> 01:15:03,020 so the overall cost of MT of N is going to be this times this. 1415 01:15:03,020 --> 01:15:11,930 So let's do that and simplify. 1416 01:15:11,930 --> 01:15:18,062 So MT of N is going to be big O, because we're taking the leaf 1417 01:15:18,062 --> 01:15:20,020 level but there's some other things that's just 1418 01:15:20,020 --> 01:15:23,530 going to lose us a factor of 2. 1419 01:15:23,530 --> 01:15:26,175 We have this thing multiplied by this thing. 1420 01:15:26,175 --> 01:15:34,220 So we've got N cubed over square root of M over B 1421 01:15:34,220 --> 01:15:40,798 times M over B. 1422 01:15:40,798 --> 01:15:43,360 AUDIENCE: You-- PROFESSOR: I made a mistake. 1423 01:15:43,360 --> 01:15:44,720 Yea, thank you. 1424 01:15:44,720 --> 01:15:45,970 This was supposed to be cubed. 1425 01:15:45,970 --> 01:15:49,600 So this was M over B to the 1/2, so now we have, 1426 01:15:49,600 --> 01:15:52,720 down here, M over B to the 3/2. 1427 01:15:52,720 --> 01:15:59,850 Thank you, thought that looked weird. 1428 01:15:59,850 --> 01:16:01,890 All right. 1429 01:16:01,890 --> 01:16:06,531 M over B to the 3/2. 1430 01:16:06,531 --> 01:16:07,031 OK. 1431 01:16:07,031 --> 01:16:18,430 AUDIENCE: [INAUDIBLE] PROFESSOR: Yeah. 1432 01:16:18,430 --> 01:16:20,180 What was I doing here? 1433 01:16:20,180 --> 01:16:21,630 This is supposed to be M over 3. 1434 01:16:21,630 --> 01:16:23,605 I was not missing a stroke, thank you. 1435 01:16:23,605 --> 01:16:27,824 M over 3, this is supposed to be M over 3. 1436 01:16:27,824 --> 01:16:29,240 Wow. 1437 01:16:29,240 --> 01:16:32,260 OK, so this is M over 3. 1438 01:16:32,260 --> 01:16:34,780 I'm just going to drop the-- well, I'll put it here. 1439 01:16:34,780 --> 01:16:37,340 But then I'm just going to write theta 1440 01:16:37,340 --> 01:16:39,670 so I can forget about the 3, because that's just 1441 01:16:39,670 --> 01:16:41,360 a square root of 3 factor. 1442 01:16:41,360 --> 01:16:47,405 So now this is going to be M to the 3/2. 1443 01:16:47,405 --> 01:16:50,546 That makes me much happier. 1444 01:16:50,546 --> 01:16:53,410 Did I get it right this time? 1445 01:16:53,410 --> 01:16:54,430 Let's double-check. 1446 01:16:54,430 --> 01:16:57,910 So this is square root of M to the 3 power, 1447 01:16:57,910 --> 01:17:00,840 so that's M to the 1/2 cubed M to the 3/2. 1448 01:17:00,840 --> 01:17:04,850 I think that's good, this base case was square root of M. 1449 01:17:04,850 --> 01:17:07,929 OK, get it right. 1450 01:17:07,929 --> 01:17:10,282 So now this is M to the 3/2. 1451 01:17:10,282 --> 01:17:11,740 There is a square root that's going 1452 01:17:11,740 --> 01:17:16,270 to come back, there's M to the 3/2 and there's an M upstairs, 1453 01:17:16,270 --> 01:17:18,270 so the one cancels. 1454 01:17:18,270 --> 01:17:23,439 We're going to be left with N cubed over square root of M 1455 01:17:23,439 --> 01:17:26,112 times B. OK. 1456 01:17:26,112 --> 01:17:28,570 There was a lower order term because I dropped this plus 1, 1457 01:17:28,570 --> 01:17:31,180 but let's not worry about that right now. 1458 01:17:31,180 --> 01:17:33,730 Here we had N cubed divided by B, that 1459 01:17:33,730 --> 01:17:35,200 was the standard algorithm. 1460 01:17:35,200 --> 01:17:39,160 Now we've got M cubed divided by B divided by square root of M. 1461 01:17:39,160 --> 01:17:40,635 That's big. 1462 01:17:40,635 --> 01:17:42,010 I mean, this is basically, you're 1463 01:17:42,010 --> 01:17:45,450 dividing by-- well, square root of your cache size. 1464 01:17:45,450 --> 01:17:46,030 Wow. 1465 01:17:46,030 --> 01:17:49,230 So who knows how big that is, but say, 1466 01:17:49,230 --> 01:17:52,430 between memory and disk, we're talking gigabytes. 1467 01:17:52,430 --> 01:17:54,330 So this is like billions. 1468 01:17:54,330 --> 01:17:57,450 Square root of a billion is still pretty big, 1469 01:17:57,450 --> 01:18:01,324 like 10 to 100,000, so this is a huge amount faster 1470 01:18:01,324 --> 01:18:02,490 than the standard algorithm. 1471 01:18:02,490 --> 01:18:04,920 You can do way better than scans. 1472 01:18:04,920 --> 01:18:07,650 Basically because we're reusing the same rows and columns 1473 01:18:07,650 --> 01:18:08,726 over and over. 1474 01:18:08,726 --> 01:18:10,559 Now, this is standard matrix multiplication. 1475 01:18:10,559 --> 01:18:12,610 You might ask, what about Strassen's algorithm? 1476 01:18:12,610 --> 01:18:13,770 Well, same thing works. 1477 01:18:13,770 --> 01:18:15,960 You can do the same analysis Strassen, of course. 1478 01:18:15,960 --> 01:18:19,020 You get a similar improvement over Strassen. 1479 01:18:19,020 --> 01:18:21,960 You can do this for non square matrices and all 1480 01:18:21,960 --> 01:18:23,580 those good things. 1481 01:18:23,580 --> 01:18:25,170 And one minute left. 1482 01:18:25,170 --> 01:18:27,100 And it's going to be enough, I think, 1483 01:18:27,100 --> 01:18:31,330 to cover LRU block replacement. 1484 01:18:31,330 --> 01:18:39,604 So here's what I want to say about LRU block replacement. 1485 01:18:39,604 --> 01:18:41,520 So in the beginning, we said the model is LRU, 1486 01:18:41,520 --> 01:18:44,670 or it could have been FIFO. 1487 01:18:44,670 --> 01:18:45,870 Remember that? 1488 01:18:45,870 --> 01:18:48,210 And this algorithm will work just fine from an LRU 1489 01:18:48,210 --> 01:18:49,620 perspective or a FIFO perspective 1490 01:18:49,620 --> 01:18:51,570 if you think about it, but how do 1491 01:18:51,570 --> 01:18:53,700 we know that LRU is as good as anything? 1492 01:18:53,700 --> 01:18:58,790 I claim, if you look at some sequence of block axises-- 1493 01:18:58,790 --> 01:19:01,890 so suppose you know what B is-- and you count, 1494 01:19:01,890 --> 01:19:07,020 for a cache of size M, how many memory transfers does LRU do, 1495 01:19:07,020 --> 01:19:09,930 it's going to be within a factor of 2 of the optimal. 1496 01:19:09,930 --> 01:19:12,780 But not the optimal for a cache of size M, 1497 01:19:12,780 --> 01:19:15,732 the optimal for a cache of size M over 2. 1498 01:19:15,732 --> 01:19:17,190 This is a bit of a weird statement. 1499 01:19:17,190 --> 01:19:19,950 I have a factor of 2 here and a factor of 2 here. 1500 01:19:19,950 --> 01:19:27,450 This is a cool idea called resource augmentation, 1501 01:19:27,450 --> 01:19:30,380 fancy word for a simple idea. 1502 01:19:30,380 --> 01:19:31,200 This we're used to. 1503 01:19:31,200 --> 01:19:33,370 This is approximation algorithms. 1504 01:19:33,370 --> 01:19:36,090 OK, but this is an approximation in cost. 1505 01:19:36,090 --> 01:19:38,040 Here we're approximating the resources 1506 01:19:38,040 --> 01:19:39,240 available to the algorithm. 1507 01:19:39,240 --> 01:19:42,900 We're changing the machine model, dividing M by 2, 1508 01:19:42,900 --> 01:19:46,540 and we get a nice result. 1509 01:19:46,540 --> 01:19:47,670 Why is this OK? 1510 01:19:47,670 --> 01:19:49,559 Because, if you look at a bound like this, 1511 01:19:49,559 --> 01:19:51,300 if you change M by a factor of 2, 1512 01:19:51,300 --> 01:19:53,130 it will not change the bound by more than a 1513 01:19:53,130 --> 01:19:54,880 factor of square root of 2. 1514 01:19:54,880 --> 01:19:56,910 So as long as you have at most, say, 1515 01:19:56,910 --> 01:19:59,670 a linear or polynomial dependence on M, 1516 01:19:59,670 --> 01:20:01,530 changing M by a constant factor will not 1517 01:20:01,530 --> 01:20:03,110 change the overall running time of the cache 1518 01:20:03,110 --> 01:20:03,690 for the previous algorithm. 1519 01:20:03,690 --> 01:20:05,430 This is why we can assume it's LRU. 1520 01:20:05,430 --> 01:20:07,680 The same is true for FIFO, it's probably 1521 01:20:07,680 --> 01:20:11,660 true in expectation for random sequences. 1522 01:20:11,660 --> 01:20:16,170 And I will leave it at that. 1523 01:20:16,170 --> 01:20:17,670 If you want to see the-- do you want 1524 01:20:17,670 --> 01:20:22,080 to see the proof of this theorem? 1525 01:20:22,080 --> 01:20:22,680 Tomorrow? 1526 01:20:22,680 --> 01:20:24,120 Or, Thursday? 1527 01:20:24,120 --> 01:20:24,620 Yes. 1528 01:20:24,620 --> 01:20:27,370 OK, we'll cover it on Thursday.