1 00:00:00,090 --> 00:00:02,590 NARRATOR: The following content is provided under a Creative 2 00:00:02,590 --> 00:00:04,059 Commons license. 3 00:00:04,059 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,350 To make a donation or view additional materials 6 00:00:13,350 --> 00:00:17,310 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,310 --> 00:00:18,450 at ocw.mit.edu. 8 00:00:25,719 --> 00:00:26,510 ERIK DEMAINE: Yeah. 9 00:00:26,510 --> 00:00:28,340 I'm going to talk about I/O models. 10 00:00:28,340 --> 00:00:29,840 Just to get a sense, how many people 11 00:00:29,840 --> 00:00:33,510 know a model called I/O model? 12 00:00:33,510 --> 00:00:35,851 And how many people don't? 13 00:00:35,851 --> 00:00:36,600 It doesn't matter. 14 00:00:36,600 --> 00:00:38,990 I'm just curious. 15 00:00:38,990 --> 00:00:41,210 As some of you may know, I/O models 16 00:00:41,210 --> 00:00:42,440 have a really rich history. 17 00:00:42,440 --> 00:00:44,030 And they're pretty fascinating. 18 00:00:44,030 --> 00:00:48,770 They are all central to this problem of modeling the memory 19 00:00:48,770 --> 00:00:50,210 hierarchy in a computer. 20 00:00:50,210 --> 00:00:52,970 We have things like RAM model of computation 21 00:00:52,970 --> 00:00:56,390 where you can access anything at the same price in your memory. 22 00:00:56,390 --> 00:00:58,040 But the reality of computers is you 23 00:00:58,040 --> 00:01:00,200 have things that are very close to you that 24 00:01:00,200 --> 00:01:02,030 are very cheap to access, and you 25 00:01:02,030 --> 00:01:05,630 have things that are very far from you that are big. 26 00:01:05,630 --> 00:01:08,090 You can get 3 terabyte disks these days, 27 00:01:08,090 --> 00:01:10,880 but are very slow to access. 28 00:01:10,880 --> 00:01:13,010 And one of the big costs there is latency. 29 00:01:13,010 --> 00:01:17,000 Because here, the head has to move to the right position, 30 00:01:17,000 --> 00:01:19,820 and then you can read lots of data really fast. 31 00:01:19,820 --> 00:01:22,460 The disk actually can give you data very fast, 32 00:01:22,460 --> 00:01:26,167 but the hard part is getting started in reading stuff. 33 00:01:26,167 --> 00:01:28,250 And so this is the sort of thing we want to model. 34 00:01:28,250 --> 00:01:31,190 These kinds of computers have been around for decades, 35 00:01:31,190 --> 00:01:31,820 as we'll see. 36 00:01:31,820 --> 00:01:33,528 And people have been trying to model them 37 00:01:33,528 --> 00:01:36,050 in as clean a way as possible that works well theoretically 38 00:01:36,050 --> 00:01:39,920 and matches practice in some ways. 39 00:01:39,920 --> 00:01:41,795 I have just some fun additions to this slide. 40 00:01:41,795 --> 00:01:44,120 You can keep getting bigger, go to the internet, 41 00:01:44,120 --> 00:01:45,910 get to an exa- or a zettabyte. 42 00:01:45,910 --> 00:01:48,620 You have to look up all the words for these. 43 00:01:48,620 --> 00:01:50,180 In the universe, you've got about 10 44 00:01:50,180 --> 00:01:54,250 to the 83 atoms, so maybe roughly that many bits. 45 00:01:54,250 --> 00:01:57,570 But I don't know if there's a letter for them. 46 00:01:57,570 --> 00:02:00,140 So how do we model this? 47 00:02:00,140 --> 00:02:01,410 Well, there's a lot of models. 48 00:02:01,410 --> 00:02:02,620 This is a partial list. 49 00:02:02,620 --> 00:02:05,950 These are sort of the core models that were around, 50 00:02:05,950 --> 00:02:10,280 let's say, since this millennium. 51 00:02:10,280 --> 00:02:13,820 So we start in 1972 and work our way forward. 52 00:02:13,820 --> 00:02:15,500 And I'm going to go through all of these 53 00:02:15,500 --> 00:02:18,030 in different levels of detail. 54 00:02:18,030 --> 00:02:20,270 There's a couple of key features in a cache 55 00:02:20,270 --> 00:02:23,467 that we want to model or maybe a few key features. 56 00:02:23,467 --> 00:02:25,550 And then there's some measure of simplicity, which 57 00:02:25,550 --> 00:02:26,674 is a little hard to define. 58 00:02:29,390 --> 00:02:32,690 The goal is to get all four of these things at once. 59 00:02:32,690 --> 00:02:36,750 And we get that more or less by the end. 60 00:02:36,750 --> 00:02:40,545 So first section is on this idealized two-level storage, 61 00:02:40,545 --> 00:02:44,930 which was introduced by Bob Floyd in 1972. 62 00:02:44,930 --> 00:02:48,220 This is what the first page of the paper looks like. 63 00:02:48,220 --> 00:02:51,200 It's probably typeset on a typewriter 64 00:02:51,200 --> 00:02:54,560 it looks like and underline, good old days of computer 65 00:02:54,560 --> 00:02:57,740 science, very early days of computer science. 66 00:02:57,740 --> 00:03:00,470 And this was published in a conference 67 00:03:00,470 --> 00:03:02,720 called The Complexity of Computer Computations. 68 00:03:02,720 --> 00:03:06,001 How many people have heard of that conference? 69 00:03:06,001 --> 00:03:06,500 No one. 70 00:03:06,500 --> 00:03:06,999 Wow. 71 00:03:06,999 --> 00:03:07,880 There it is. 72 00:03:07,880 --> 00:03:09,380 It's a kind of a classic, because it 73 00:03:09,380 --> 00:03:11,870 had Karp's original paper on NP-completeness. 74 00:03:11,870 --> 00:03:14,220 So you've definitely read this paper. 75 00:03:14,220 --> 00:03:16,884 But there are a lot of neat papers in there 76 00:03:16,884 --> 00:03:19,550 and a panel discussion including what should we call algorithms, 77 00:03:19,550 --> 00:03:21,830 which is kind of a fun read. 78 00:03:21,830 --> 00:03:25,190 So this is in the day when one of the state of the art 79 00:03:25,190 --> 00:03:27,650 computers was the PDP-11. 80 00:03:27,650 --> 00:03:30,770 This is what PDP-11, or one of them, 81 00:03:30,770 --> 00:03:34,850 looks like by probably owned by Bell Labs. 82 00:03:34,850 --> 00:03:36,710 But Dennis Ritchie and Ken Thompson's 83 00:03:36,710 --> 00:03:39,590 the inventors of C and Unix, working away there. 84 00:03:39,590 --> 00:03:45,860 It has disks, each of which is about 2 megabytes in capacity. 85 00:03:45,860 --> 00:03:50,040 And it has internal memory which was core memory at the time. 86 00:03:50,040 --> 00:03:53,360 So each of these is a little circular magnetic core. 87 00:03:53,360 --> 00:03:55,010 And it stores 1 bit. 88 00:03:55,010 --> 00:03:56,930 And in total, there are 8 kilobytes. 89 00:03:56,930 --> 00:04:01,220 So you get a sense of already this being an issue. 90 00:04:01,220 --> 00:04:03,060 And this is why they wrote their paper. 91 00:04:03,060 --> 00:04:04,560 So here's the model they introduced, 92 00:04:04,560 --> 00:04:08,000 a very simple model, maybe the simplest we'll see. 93 00:04:08,000 --> 00:04:11,210 You have your CPU, which can do local computation. 94 00:04:11,210 --> 00:04:14,720 And then you have your memory, which is very big. 95 00:04:14,720 --> 00:04:17,810 But in particular, it's divided into these blocks of size B. 96 00:04:17,810 --> 00:04:20,390 So each block can have up to B items. 97 00:04:20,390 --> 00:04:23,720 And what you're allowed to do in one block operation 98 00:04:23,720 --> 00:04:26,754 is read two of the blocks. 99 00:04:26,754 --> 00:04:28,420 You can read all the items in the block. 100 00:04:28,420 --> 00:04:30,260 So let's say you read these two items. 101 00:04:30,260 --> 00:04:33,764 You pick some subset of those items to pick up. 102 00:04:33,764 --> 00:04:35,180 And then what you're allowed to do 103 00:04:35,180 --> 00:04:37,830 is store them somewhere else. 104 00:04:37,830 --> 00:04:40,280 So you can pick some other target block like this one 105 00:04:40,280 --> 00:04:44,885 and copy those elements to overwrite that block. 106 00:04:44,885 --> 00:04:46,760 I mean, there's no computation in this model, 107 00:04:46,760 --> 00:04:50,330 because he was just interested in how you can permute items 108 00:04:50,330 --> 00:04:51,950 in that world. 109 00:04:51,950 --> 00:04:55,710 So simple model, but you get the idea. 110 00:04:55,710 --> 00:04:58,310 You can read two blocks, take up to B items out of them, 111 00:04:58,310 --> 00:04:59,310 stick them in here. 112 00:04:59,310 --> 00:05:01,820 Here, we just ignore what the order is within a block, 113 00:05:01,820 --> 00:05:04,070 because we're assuming you can just rearrange once you 114 00:05:04,070 --> 00:05:06,050 read them in and spit them out. 115 00:05:06,050 --> 00:05:08,270 So don't worry about the order within the block. 116 00:05:08,270 --> 00:05:11,300 It's more for every item, which block is it in? 117 00:05:11,300 --> 00:05:13,730 And we're assuming here items are indivisible. 118 00:05:13,730 --> 00:05:16,670 So here's the main theorem of that paper. 119 00:05:16,670 --> 00:05:22,000 If you're given N items and you want to permute them into N 120 00:05:22,000 --> 00:05:24,620 over B blocks, which means each of those blocks is going to be 121 00:05:24,620 --> 00:05:27,210 full-- let's say that's sort of the most interesting case-- 122 00:05:27,210 --> 00:05:32,730 then you need to use N over B log B block operations 123 00:05:32,730 --> 00:05:37,540 even for a random permutation on average with high probability. 124 00:05:37,540 --> 00:05:41,680 So this is kind of nice or kind of interesting, 125 00:05:41,680 --> 00:05:43,950 because just to touch those blocks requires 126 00:05:43,950 --> 00:05:47,590 N over B block operations. 127 00:05:47,590 --> 00:05:51,120 But there's an extra log factor that starts to creep up, 128 00:05:51,120 --> 00:05:54,711 which is maybe a little bit surprising, less surprising 129 00:05:54,711 --> 00:05:57,210 to people who are familiar with I/O models, but at the time, 130 00:05:57,210 --> 00:05:58,377 very new. 131 00:05:58,377 --> 00:06:00,210 And I'm making a particular assumption here, 132 00:06:00,210 --> 00:06:01,110 but just a small thing. 133 00:06:01,110 --> 00:06:02,490 I thought I'd go through the proof of this theorem, 134 00:06:02,490 --> 00:06:04,346 because it's fairly simple. 135 00:06:04,346 --> 00:06:05,970 It's going to use a slightly simplified 136 00:06:05,970 --> 00:06:09,010 model where, instead of copying items, you actually move items. 137 00:06:09,010 --> 00:06:11,160 So these guys would disappear after you put them 138 00:06:11,160 --> 00:06:12,017 in this new block. 139 00:06:12,017 --> 00:06:14,100 Because we're thinking about permutation problems, 140 00:06:14,100 --> 00:06:16,536 again, that doesn't really change anything. 141 00:06:16,536 --> 00:06:17,910 You can just, for every item, see 142 00:06:17,910 --> 00:06:20,220 what path it follows to ultimately get to its target 143 00:06:20,220 --> 00:06:22,770 location, throw away all the extra copies 144 00:06:22,770 --> 00:06:25,020 and just keep that one set of copies. 145 00:06:25,020 --> 00:06:27,820 And that will still be a valid solution in this model. 146 00:06:27,820 --> 00:06:29,990 So how does the lower bound go? 147 00:06:29,990 --> 00:06:32,010 It's a simple potential argument. 148 00:06:32,010 --> 00:06:36,540 You look at for every pair of blocks, 149 00:06:36,540 --> 00:06:39,270 how many items are there in block 150 00:06:39,270 --> 00:06:41,270 i that are destined for block j? 151 00:06:41,270 --> 00:06:43,135 You want to move from block i to block j. 152 00:06:43,135 --> 00:06:44,760 This is going to be changing over time. 153 00:06:44,760 --> 00:06:47,560 This is where they currently are. 154 00:06:47,560 --> 00:06:48,510 So that's nij. 155 00:06:48,510 --> 00:06:51,630 You take and nij, log nij, and sum that up over all 156 00:06:51,630 --> 00:06:52,950 i's and j's. 157 00:06:52,950 --> 00:06:54,610 That's the potential function. 158 00:06:54,610 --> 00:06:58,920 And our goal is to maximize that potential. 159 00:06:58,920 --> 00:07:00,840 Because it's going to be-- 160 00:07:00,840 --> 00:07:02,310 for those familiar with entropy-- 161 00:07:02,310 --> 00:07:03,400 negative entropy. 162 00:07:03,400 --> 00:07:08,220 So it's going to be maximized when all the items are 163 00:07:08,220 --> 00:07:09,960 where they need to be. 164 00:07:09,960 --> 00:07:12,510 This is when everything is as clustered as possible. 165 00:07:12,510 --> 00:07:14,280 You can only have a cluster of size B, 166 00:07:14,280 --> 00:07:18,430 because items can only be up to B in the same place. 167 00:07:18,430 --> 00:07:21,000 One way to see this, in the target configuration, 168 00:07:21,000 --> 00:07:23,144 nii is B for all i. 169 00:07:23,144 --> 00:07:24,810 Everyone's where they're supposed to be. 170 00:07:24,810 --> 00:07:27,990 And so that potential gives you the number of items 171 00:07:27,990 --> 00:07:30,390 times log B. And this is always, at most, 172 00:07:30,390 --> 00:07:34,290 log B. And so that's the biggest this could ever hope to get. 173 00:07:34,290 --> 00:07:36,810 So our goal is to increase entropy as much as possible. 174 00:07:36,810 --> 00:07:38,310 And we're starting with low entropy. 175 00:07:38,310 --> 00:07:41,130 If you take a random permutation, 176 00:07:41,130 --> 00:07:43,620 you're trying to get the expected number of guys 177 00:07:43,620 --> 00:07:44,610 that are where they're supposed to be. 178 00:07:44,610 --> 00:07:46,151 It's very small, because most of them 179 00:07:46,151 --> 00:07:49,350 are going to be destined for some other block. 180 00:07:49,350 --> 00:07:52,350 So we're starting with the potential of linear. 181 00:07:52,350 --> 00:07:56,730 We need to get to N log B. And then the claim is 182 00:07:56,730 --> 00:08:00,960 that each block operation we do can only increase potential 183 00:08:00,960 --> 00:08:04,790 by, at most, B. And so that gives us 184 00:08:04,790 --> 00:08:06,930 this bound of the potential we need 185 00:08:06,930 --> 00:08:10,710 to get to minus the potential we had divided by how much we can 186 00:08:10,710 --> 00:08:15,510 decrease potential in each step, which is basically N over B log 187 00:08:15,510 --> 00:08:18,720 B minus a little O. 188 00:08:18,720 --> 00:08:20,190 Why is this claim true? 189 00:08:20,190 --> 00:08:21,300 I'll just sketch. 190 00:08:21,300 --> 00:08:24,930 The idea is this fun fact, the x plus y log x plus y 191 00:08:24,930 --> 00:08:28,500 is, at most, x log x plus y log y plus x plus y. 192 00:08:28,500 --> 00:08:31,470 What this means is if you have two clusters, 193 00:08:31,470 --> 00:08:33,090 our goal is to sort of cluster things 194 00:08:33,090 --> 00:08:36,615 together and make bigger groups that are in the same place 195 00:08:36,615 --> 00:08:39,309 or in the correct place. 196 00:08:39,309 --> 00:08:41,400 So if you have two clusters x log x and y log y 197 00:08:41,400 --> 00:08:43,830 contributing to this thing and you merge them, 198 00:08:43,830 --> 00:08:46,020 then you now have this potential. 199 00:08:46,020 --> 00:08:49,710 And the claim is that could have only gone up by x plus y. 200 00:08:49,710 --> 00:08:52,740 And when you're moving B items, the total number 201 00:08:52,740 --> 00:08:55,200 of things you're moving is B. So you can only 202 00:08:55,200 --> 00:08:57,850 increase things by B. So it was a quick sketch 203 00:08:57,850 --> 00:08:58,620 of this old paper. 204 00:08:58,620 --> 00:09:03,810 It's a fun read, quite clear, easy argument. 205 00:09:03,810 --> 00:09:07,110 So we proved this theorem that you need at least N over B 206 00:09:07,110 --> 00:09:08,850 log B. But what is the right answer? 207 00:09:08,850 --> 00:09:11,640 There's actually not a matching upper bound. 208 00:09:11,640 --> 00:09:14,670 Of course, for B at constant, this is the right answer. 209 00:09:14,670 --> 00:09:20,100 It's N, but that's not so exciting. 210 00:09:20,100 --> 00:09:21,900 On the upper bound side, this paper 211 00:09:21,900 --> 00:09:23,896 has almost matching lower bound. 212 00:09:23,896 --> 00:09:25,770 It's another log, but not quite the same log, 213 00:09:25,770 --> 00:09:28,590 N over B log N over B instead of log B. 214 00:09:28,590 --> 00:09:30,480 And the rough idea of how to do that-- 215 00:09:30,480 --> 00:09:30,960 AUDIENCE: [INAUDIBLE] 216 00:09:30,960 --> 00:09:32,168 ERIK DEMAINE: Yeah, question. 217 00:09:32,168 --> 00:09:34,377 AUDIENCE: [INAUDIBLE] 218 00:09:34,377 --> 00:09:36,210 ERIK DEMAINE: I said a tall disk assumption. 219 00:09:36,210 --> 00:09:39,330 I'm assuming N over B is greater than B. 220 00:09:39,330 --> 00:09:41,640 The number of blocks in your disk 221 00:09:41,640 --> 00:09:44,036 is at least the size of a block. 222 00:09:44,036 --> 00:09:45,660 AUDIENCE: You needed that in the proof? 223 00:09:45,660 --> 00:09:47,701 ERIK DEMAINE: I needed that in the proof I think. 224 00:09:50,100 --> 00:09:55,214 Good question, Where N over B log B. 225 00:09:55,214 --> 00:09:57,910 AUDIENCE: [INAUDIBLE] 226 00:09:57,910 --> 00:09:58,930 ERIK DEMAINE: Yeah. 227 00:09:58,930 --> 00:09:59,560 Exactly. 228 00:09:59,560 --> 00:10:00,893 Yeah, that's where I'm using it. 229 00:10:00,893 --> 00:10:02,800 Thanks. 230 00:10:02,800 --> 00:10:06,210 Otherwise this expectation doesn't work out. 231 00:10:06,210 --> 00:10:08,860 I mean, if you have one block, for example, this will fail, 232 00:10:08,860 --> 00:10:12,090 because you need zero operations. 233 00:10:12,090 --> 00:10:15,230 So there has to be some trade off at the very small regime. 234 00:10:15,230 --> 00:10:15,730 OK. 235 00:10:15,730 --> 00:10:18,400 So the way to get N over B log N over B 236 00:10:18,400 --> 00:10:20,950 is basically a radix sort. 237 00:10:20,950 --> 00:10:23,140 In one pass through the data, you 238 00:10:23,140 --> 00:10:28,060 can rewrite everything to have the lower order bits of 0 239 00:10:28,060 --> 00:10:30,100 before all the lower order bits of 1. 240 00:10:30,100 --> 00:10:33,940 So in N over B, you can sort by each bit in the target 241 00:10:33,940 --> 00:10:36,730 block ID of every item. 242 00:10:36,730 --> 00:10:38,712 And so you do log of N over B things, 243 00:10:38,712 --> 00:10:40,420 because that's how many blocks there are. 244 00:10:40,420 --> 00:10:41,836 And so this is how many passes you 245 00:10:41,836 --> 00:10:44,500 need by a binary radix sort. 246 00:10:44,500 --> 00:10:47,080 You can achieve that bound. 247 00:10:47,080 --> 00:10:52,507 And the paper actually claims that there's a lower bound. 248 00:10:52,507 --> 00:10:54,090 It's a little strange, because there's 249 00:10:54,090 --> 00:10:55,660 a careful proof given for this. 250 00:10:55,660 --> 00:10:58,060 And then this claim just says, "by information 251 00:10:58,060 --> 00:11:01,774 theoretic consideration--" this is also true. 252 00:11:01,774 --> 00:11:03,190 This is in the days when we didn't 253 00:11:03,190 --> 00:11:05,650 distinguish between big O and big omega 254 00:11:05,650 --> 00:11:08,290 before [INAUDIBLE] paper. 255 00:11:08,290 --> 00:11:09,280 But this is not true. 256 00:11:09,280 --> 00:11:11,210 And we'll see that it's not true. 257 00:11:11,210 --> 00:11:15,220 It was settled about 14 years later. 258 00:11:15,220 --> 00:11:17,720 So we'll see the right answer. 259 00:11:17,720 --> 00:11:20,320 This is almost the right answer, but it doesn't quite 260 00:11:20,320 --> 00:11:22,180 work when B is very small. 261 00:11:22,180 --> 00:11:24,850 And one way to see that is when B is 1. 262 00:11:24,850 --> 00:11:28,650 When B is 1, the right answer is N, not N log N. 263 00:11:28,650 --> 00:11:32,814 So when B is less than log N over B, 264 00:11:32,814 --> 00:11:34,480 then there's a slightly different answer 265 00:11:34,480 --> 00:11:35,910 which we'll get to later. 266 00:11:35,910 --> 00:11:37,740 But that was the early days. 267 00:11:37,740 --> 00:11:42,100 There's some other fun quotes from this paper foreshadowing 268 00:11:42,100 --> 00:11:43,100 different things. 269 00:11:43,100 --> 00:11:46,270 One is the word RAM model, which is very common today, 270 00:11:46,270 --> 00:11:47,579 but not at the time. 271 00:11:47,579 --> 00:11:49,120 And it says, obviously, these results 272 00:11:49,120 --> 00:11:51,370 apply for distant drums, which was probably 273 00:11:51,370 --> 00:11:53,078 what they were thinking about originally, 274 00:11:53,078 --> 00:11:54,970 but also when the pages, the blocks, 275 00:11:54,970 --> 00:11:57,404 are words of internal memory and the records 276 00:11:57,404 --> 00:11:58,570 are the bits in those words. 277 00:11:58,570 --> 00:11:59,820 So this is a word RAM model. 278 00:12:02,800 --> 00:12:05,537 Here, I said just ignore the permutation within each block. 279 00:12:05,537 --> 00:12:07,120 But you can actually do all the things 280 00:12:07,120 --> 00:12:09,040 you need to do for these algorithms using 281 00:12:09,040 --> 00:12:12,800 shifts and logical or, xor, and operations. 282 00:12:12,800 --> 00:12:15,250 So all these algorithms work in the word RAM model, too, 283 00:12:15,250 --> 00:12:17,440 which is kind of nifty. 284 00:12:17,440 --> 00:12:20,230 Another thing is foreshadowing, what 285 00:12:20,230 --> 00:12:22,660 we call the I/O model, which we'll get to in a little bit. 286 00:12:22,660 --> 00:12:24,990 It says, "work is in progress." 287 00:12:24,990 --> 00:12:26,410 He got scooped, unfortunately. 288 00:12:26,410 --> 00:12:29,095 "Work is in progress--" unless he meant by someone else-- 289 00:12:29,095 --> 00:12:32,860 "attempting to study the case where you 290 00:12:32,860 --> 00:12:34,300 can store more than two pages." 291 00:12:34,300 --> 00:12:37,060 Basically, this CPU can hold two of these blocks, 292 00:12:37,060 --> 00:12:41,920 and then write one back out, but has no bigger memory than that. 293 00:12:41,920 --> 00:12:43,720 or bigger cache. 294 00:12:43,720 --> 00:12:46,150 So that's where we were at the time. 295 00:12:46,150 --> 00:12:50,010 Next, chapter in this story is 1981. 296 00:12:50,010 --> 00:12:50,810 It's a good year. 297 00:12:50,810 --> 00:12:52,750 It was when I was born. 298 00:12:52,750 --> 00:12:55,690 And this is Hong and Kung's paper. 299 00:12:55,690 --> 00:12:58,480 You've probably heard about the red-blue pebble game. 300 00:12:58,480 --> 00:13:01,690 And it's also a two-level model, but now there's 301 00:13:01,690 --> 00:13:03,340 a cache in the middle. 302 00:13:03,340 --> 00:13:05,860 And you can remember stuff for a while. 303 00:13:05,860 --> 00:13:07,840 I mean, you can remember up to M things 304 00:13:07,840 --> 00:13:09,520 before you have to kick them out. 305 00:13:09,520 --> 00:13:11,720 The difference here is there's no blocks anymore. 306 00:13:11,720 --> 00:13:13,490 It's just items. 307 00:13:13,490 --> 00:13:15,490 So let me tell you a little bit about the paper. 308 00:13:15,490 --> 00:13:17,500 This was the state of the art in computing at the time. 309 00:13:17,500 --> 00:13:19,458 The personal computer revolution was happening. 310 00:13:19,458 --> 00:13:22,450 They had the Apple II, TRS-80, VIC-20. 311 00:13:22,450 --> 00:13:25,660 All of these originally had about 4 kilobytes of RAM. 312 00:13:25,660 --> 00:13:28,910 And the disks could store maybe, I don't know, 313 00:13:28,910 --> 00:13:31,060 360 kilobytes or so. 314 00:13:31,060 --> 00:13:34,302 But you could also connect a tape and other crazy things. 315 00:13:34,302 --> 00:13:35,510 So, again, this was relevant. 316 00:13:35,510 --> 00:13:37,879 And that's the setting they were writing this. 317 00:13:37,879 --> 00:13:38,920 They have this fun quote. 318 00:13:38,920 --> 00:13:41,530 "When a large computation is performed on a small device--" 319 00:13:41,530 --> 00:13:44,027 at that point, small devices were becoming common-- 320 00:13:44,027 --> 00:13:45,610 "you must decompose those computations 321 00:13:45,610 --> 00:13:47,100 to subcomputations." 322 00:13:47,100 --> 00:13:49,940 This is going to require a lot of I/O. It's going to be slow. 323 00:13:49,940 --> 00:13:52,990 So how do we minimize I/O? 324 00:13:52,990 --> 00:13:57,700 So their model-- before I get to this red-blue pebble game 325 00:13:57,700 --> 00:14:03,520 model, it's based on a vanilla single color pebble game model 326 00:14:03,520 --> 00:14:05,200 by a Hopcroft, Paul, and Valiant. 327 00:14:05,200 --> 00:14:07,780 This is the famous interrelation between the time hierarchy 328 00:14:07,780 --> 00:14:10,720 and space hierarchy paper. 329 00:14:10,720 --> 00:14:14,380 And what they said is, OK, let's think of the algorithm 330 00:14:14,380 --> 00:14:15,880 we're executing as a DAG. 331 00:14:15,880 --> 00:14:18,370 We start with some things that are inputs. 332 00:14:18,370 --> 00:14:22,960 And we want to compute stuff that this computation depends 333 00:14:22,960 --> 00:14:24,657 on having these two values and so on. 334 00:14:24,657 --> 00:14:26,490 In the end, we want to compute some outputs. 335 00:14:26,490 --> 00:14:29,920 So you can rewrite computation in this kind of DAG form. 336 00:14:29,920 --> 00:14:33,160 And we're going to model the execution of that 337 00:14:33,160 --> 00:14:35,110 by playing this pebble game. 338 00:14:35,110 --> 00:14:37,790 And so a node can have pebbles on it. 339 00:14:37,790 --> 00:14:40,290 And for example, we could put a pebble on this node. 340 00:14:40,290 --> 00:14:43,180 In general, we are allowed to put a pebble on a node 341 00:14:43,180 --> 00:14:46,300 if all of its predecessors have a pebble. 342 00:14:46,300 --> 00:14:49,390 And pebble is going to correspond to being in memory. 343 00:14:49,390 --> 00:14:51,640 And we can also throw away a node, because we can just 344 00:14:51,640 --> 00:14:53,140 forget stuff. 345 00:14:53,140 --> 00:14:55,240 Unlike real life, you can just forget whatever you 346 00:14:55,240 --> 00:14:56,830 don't want to know any more. 347 00:14:56,830 --> 00:14:58,420 So you add a pebble. 348 00:14:58,420 --> 00:15:00,460 Let's say, now we can add this pebble, 349 00:15:00,460 --> 00:15:03,570 because its predecessor has a pebble on it. 350 00:15:03,570 --> 00:15:06,777 We can add this pebble over here, add this pebble here. 351 00:15:06,777 --> 00:15:08,610 Now, we don't need this information anymore, 352 00:15:08,610 --> 00:15:10,610 because we've computed all the things out of it. 353 00:15:10,610 --> 00:15:12,450 So we can choose to remove that pebble. 354 00:15:12,450 --> 00:15:15,810 And now, we can add this one, remove that one, add this one. 355 00:15:15,810 --> 00:15:20,340 You can check that I got all these right, add this one, 356 00:15:20,340 --> 00:15:23,669 remove that one, remove, add, remove, remove. 357 00:15:23,669 --> 00:15:25,460 In the end, we want pebbles on the outputs. 358 00:15:25,460 --> 00:15:27,450 We start with pebbles on the inputs. 359 00:15:27,450 --> 00:15:31,500 And in this case, their goal was to minimize the maximum number 360 00:15:31,500 --> 00:15:32,790 of pebbles over time. 361 00:15:32,790 --> 00:15:36,030 Here, there's up to four pebbles at any one moment. 362 00:15:36,030 --> 00:15:40,020 That means you need memory of size four. 363 00:15:40,020 --> 00:15:42,990 And they ended up proving that any DAG can be executed 364 00:15:42,990 --> 00:15:45,360 using N over log N maximum pebbles, which 365 00:15:45,360 --> 00:15:47,260 gave this theorem time. 366 00:15:47,260 --> 00:15:49,270 If you use t units of time, you can fit in t 367 00:15:49,270 --> 00:15:52,980 over log t units of space, which was a neat advance. 368 00:15:52,980 --> 00:15:55,906 But that's beside the point. 369 00:15:55,906 --> 00:15:57,780 This is where Hong and Kung were coming from. 370 00:15:57,780 --> 00:15:58,904 They had this pebble model. 371 00:15:58,904 --> 00:16:01,410 And they wanted to use two colors of pebbles, one 372 00:16:01,410 --> 00:16:06,030 to represent the shallower level of the memory hierarchy 373 00:16:06,030 --> 00:16:10,150 in cache, and the other to say that you're on disk somewhere. 374 00:16:10,150 --> 00:16:12,240 So red pebble is going to be in cache. 375 00:16:12,240 --> 00:16:13,320 That's the hot stuff. 376 00:16:13,320 --> 00:16:15,150 And the blue pebbles are our disk. 377 00:16:15,150 --> 00:16:17,370 That's the cold stuff. 378 00:16:17,370 --> 00:16:19,960 And, basically, the same rules-- 379 00:16:19,960 --> 00:16:21,930 when you're initially placing a pebble, 380 00:16:21,930 --> 00:16:23,512 everything here has to be red. 381 00:16:23,512 --> 00:16:25,470 You can place a red pebble if your predecessors 382 00:16:25,470 --> 00:16:26,384 have red pebbles. 383 00:16:26,384 --> 00:16:28,050 We start out with the inputs being blue, 384 00:16:28,050 --> 00:16:29,560 so there are no red pebbles. 385 00:16:29,560 --> 00:16:31,450 But for free-- or not for free. 386 00:16:31,450 --> 00:16:33,444 For unit cost, we can convert any red pebble 387 00:16:33,444 --> 00:16:35,610 to a blue pebble or any blue pebble to a red pebble. 388 00:16:35,610 --> 00:16:36,960 So let's go through this. 389 00:16:36,960 --> 00:16:38,370 I can make that one red. 390 00:16:38,370 --> 00:16:40,080 And now, I can make this one red. 391 00:16:40,080 --> 00:16:40,889 Great. 392 00:16:40,889 --> 00:16:42,180 Now, I don't need it right now. 393 00:16:42,180 --> 00:16:45,925 So I'm going to make it blue, meaning write it out to disk. 394 00:16:45,925 --> 00:16:47,699 I make this one red, make this one red. 395 00:16:47,699 --> 00:16:48,990 Now, I can throw that one away. 396 00:16:48,990 --> 00:16:51,360 I don't need it on cache or disk. 397 00:16:51,360 --> 00:16:54,570 I can put that one on disk, because I 398 00:16:54,570 --> 00:16:56,430 don't need it right now. 399 00:16:56,430 --> 00:17:00,300 I can bring that one back in from cache, 400 00:17:00,300 --> 00:17:03,205 write this one out, put that one onto disk, 401 00:17:03,205 --> 00:17:04,079 put that onto a disk. 402 00:17:04,079 --> 00:17:06,990 Now, we'll go over here, read this back in from disk, 403 00:17:06,990 --> 00:17:09,460 finish off this section over here. 404 00:17:09,460 --> 00:17:13,805 And now, I can throw that away, add this guy, throw that away. 405 00:17:13,805 --> 00:17:14,430 What do I need? 406 00:17:14,430 --> 00:17:15,846 Now, I can write this out to disk. 407 00:17:15,846 --> 00:17:17,200 I'm done with the output. 408 00:17:17,200 --> 00:17:18,866 Now, I've got to read all these guys in, 409 00:17:18,866 --> 00:17:20,550 and then I can do this one. 410 00:17:20,550 --> 00:17:23,430 And so I needed a cache size here of four. 411 00:17:23,430 --> 00:17:26,980 The maximum number of red things at any moment was four. 412 00:17:26,980 --> 00:17:29,820 And I can get rid of those guys and write that one to disk. 413 00:17:29,820 --> 00:17:32,460 And my goal is to get the outputs all blue. 414 00:17:32,460 --> 00:17:33,960 But the objective here is different. 415 00:17:33,960 --> 00:17:37,650 Before, we were minimizing, essentially, cache size. 416 00:17:37,650 --> 00:17:39,150 Cache size now is given to us. 417 00:17:39,150 --> 00:17:41,114 We say we have a cache of size M. 418 00:17:41,114 --> 00:17:43,530 But now, what we count are the number of reads and writes, 419 00:17:43,530 --> 00:17:45,690 the number of switching colors of pebbles. 420 00:17:45,690 --> 00:17:48,895 That is the number I/Os. 421 00:17:48,895 --> 00:17:51,930 And so you can think of this model as this picture 422 00:17:51,930 --> 00:17:53,010 I drew before. 423 00:17:53,010 --> 00:17:53,640 You have cache. 424 00:17:53,640 --> 00:17:55,590 You can store up to M items. 425 00:17:55,590 --> 00:17:59,940 You can take any blue item. 426 00:17:59,940 --> 00:18:01,860 You could throw them away, for example. 427 00:18:01,860 --> 00:18:04,380 I could move a red item over here, turn it blue. 428 00:18:04,380 --> 00:18:06,480 That corresponds to writing out to disk. 429 00:18:06,480 --> 00:18:09,510 I can bring a blue item back in to fill that spot. 430 00:18:09,510 --> 00:18:12,180 That corresponds to reading from disk as long as, at all times, 431 00:18:12,180 --> 00:18:14,760 I have at most M red items. 432 00:18:14,760 --> 00:18:18,100 And these are the same model. 433 00:18:18,100 --> 00:18:20,640 So what Hong and Kung did is look 434 00:18:20,640 --> 00:18:24,120 at a bunch of different algorithms, not problems, 435 00:18:24,120 --> 00:18:26,160 but specific algorithms, things that you 436 00:18:26,160 --> 00:18:28,464 could compute in the DAG form. 437 00:18:28,464 --> 00:18:29,880 The DAG form is, I guess you could 438 00:18:29,880 --> 00:18:31,387 say, a class of algorithms. 439 00:18:31,387 --> 00:18:32,970 There's many ways to execute this DAG. 440 00:18:32,970 --> 00:18:37,470 You could follow any topological sort of this DAG. 441 00:18:37,470 --> 00:18:39,700 That's an algorithm in some sense. 442 00:18:39,700 --> 00:18:44,940 And so what he's finding is the best execution of these meta 443 00:18:44,940 --> 00:18:46,500 algorithms, if you will. 444 00:18:46,500 --> 00:18:51,450 So that doesn't mean it's the best way to do matrix vector 445 00:18:51,450 --> 00:18:52,080 multiplication. 446 00:18:52,080 --> 00:18:54,450 But it says if you're following the standard algorithm, 447 00:18:54,450 --> 00:18:57,380 the standard DAG that you get from it or the standard FFT 448 00:18:57,380 --> 00:18:58,045 DAG-- 449 00:18:58,045 --> 00:19:00,090 I guess FFT is actually an algorithm-- 450 00:19:00,090 --> 00:19:03,960 then the minimum number of memory transfers 451 00:19:03,960 --> 00:19:08,859 is this number of red or blue recolorings. 452 00:19:08,859 --> 00:19:09,900 And so you get a variety. 453 00:19:09,900 --> 00:19:14,130 Of course, the speed-ups, relative to the regular RAM 454 00:19:14,130 --> 00:19:17,550 analysis versus this analysis is going to be somewhere between 1 455 00:19:17,550 --> 00:19:23,306 and M, I guess for most problems at least. 456 00:19:23,306 --> 00:19:25,680 And for some problems, like matrix vector multiplication, 457 00:19:25,680 --> 00:19:30,000 you get very good M odd even transpositions [INAUDIBLE] 458 00:19:30,000 --> 00:19:33,120 you get M. Matrix multiplication, not quite 459 00:19:33,120 --> 00:19:36,690 as good red M and FFT. 460 00:19:36,690 --> 00:19:38,792 Sorting was not analyzed here, because sorting 461 00:19:38,792 --> 00:19:40,000 is many different algorithms. 462 00:19:40,000 --> 00:19:45,360 Just one specific algorithm analyzed here, only log M. 463 00:19:45,360 --> 00:19:47,760 So I don't want to go through these analyzes, 464 00:19:47,760 --> 00:19:49,980 because a lot of them will follow from other results 465 00:19:49,980 --> 00:19:52,489 that we'll get to. 466 00:19:52,489 --> 00:19:54,030 So at this point, we have two models. 467 00:19:54,030 --> 00:19:57,210 We have the idealized two-level storage of Floyd. 468 00:19:57,210 --> 00:19:59,550 We have the red-blue pebble game of Hong and Kung. 469 00:19:59,550 --> 00:20:02,130 This one models caching, that you 470 00:20:02,130 --> 00:20:03,300 can store a bunch of things. 471 00:20:03,300 --> 00:20:04,466 But it does not have blocks. 472 00:20:04,466 --> 00:20:07,740 This one models blocking, but it does not have a cache, 473 00:20:07,740 --> 00:20:10,020 or it has a cache of constant size. 474 00:20:10,020 --> 00:20:12,690 So the idea is to merge these two models. 475 00:20:12,690 --> 00:20:15,540 And this is the Aggarwal and Vitter paper many of you 476 00:20:15,540 --> 00:20:17,160 have heard of, I'm sure. 477 00:20:17,160 --> 00:20:21,135 It was in 1987, so six years after Hong and Kung. 478 00:20:21,135 --> 00:20:22,960 It has many names. 479 00:20:22,960 --> 00:20:25,609 I/O model is the original, I guess. 480 00:20:25,609 --> 00:20:27,400 External Memory Model is what I usually use 481 00:20:27,400 --> 00:20:28,890 and a bunch of people here use. 482 00:20:28,890 --> 00:20:32,610 Disk Access Model has the nice advantage of you can call it 483 00:20:32,610 --> 00:20:34,080 the DAM model. 484 00:20:34,080 --> 00:20:38,400 And, again, our goal is to minimize number of I/Os. 485 00:20:38,400 --> 00:20:40,650 It's just a fusion of the two models. 486 00:20:40,650 --> 00:20:44,100 Now, our cache has blocks of size B. 487 00:20:44,100 --> 00:20:46,680 And you have M over B blocks. 488 00:20:46,680 --> 00:20:49,840 And your disk is also divided into blocks of size 489 00:20:49,840 --> 00:20:52,350 B. We imagine it being as large as you need it to be, 490 00:20:52,350 --> 00:20:54,690 probably about order N. 491 00:20:54,690 --> 00:20:56,230 And what can you do? 492 00:20:56,230 --> 00:20:58,980 Well, you can pick up one of these blocks 493 00:20:58,980 --> 00:21:01,350 and read it in from disk to cache, 494 00:21:01,350 --> 00:21:04,710 so kicking out whatever used to be there. 495 00:21:04,710 --> 00:21:06,510 You can do computation internally, 496 00:21:06,510 --> 00:21:10,350 change whatever these items are for free, let's say. 497 00:21:10,350 --> 00:21:12,750 You could measure time, but usually 498 00:21:12,750 --> 00:21:15,090 you just measure a number of memory transfers. 499 00:21:15,090 --> 00:21:16,860 And then you can take one of these blocks 500 00:21:16,860 --> 00:21:18,780 and write it back out to disk, kicking out 501 00:21:18,780 --> 00:21:20,310 whatever used to be there. 502 00:21:20,310 --> 00:21:23,020 So it's the obvious hybrid of these models. 503 00:21:23,020 --> 00:21:26,354 But this turns out to be a really good model. 504 00:21:26,354 --> 00:21:28,270 Those other two models, they were interesting. 505 00:21:28,270 --> 00:21:28,894 They were toys. 506 00:21:28,894 --> 00:21:29,790 They were simple. 507 00:21:29,790 --> 00:21:33,605 This is basically as simple, but it spawned this whole field. 508 00:21:33,605 --> 00:21:35,190 And it's why we're here today. 509 00:21:35,190 --> 00:21:39,090 So this is a really cool model, let's say, 510 00:21:39,090 --> 00:21:40,389 tons of results in this model. 511 00:21:40,389 --> 00:21:41,430 It's interesting to see-- 512 00:21:41,430 --> 00:21:43,967 I'm going to talk about a lot of models today. 513 00:21:43,967 --> 00:21:46,050 We're sort of in the middle of them at the moment. 514 00:21:46,050 --> 00:21:48,960 But only two have really caught on in a big way 515 00:21:48,960 --> 00:21:51,190 and have led to lots and lots of papers. 516 00:21:51,190 --> 00:21:53,950 This is one of them. 517 00:21:53,950 --> 00:21:56,930 So let me tell you some basic results and how to do them. 518 00:21:56,930 --> 00:22:02,880 A simple approach algorithmic technique in external memory 519 00:22:02,880 --> 00:22:04,290 is to scan. 520 00:22:04,290 --> 00:22:06,710 So here's my data. 521 00:22:06,710 --> 00:22:10,680 If I just want to read items in order and stop at some point N, 522 00:22:10,680 --> 00:22:13,710 then that cost me order N over B memory transfers. 523 00:22:13,710 --> 00:22:14,340 That's optimal. 524 00:22:14,340 --> 00:22:15,548 I've got to read the data in. 525 00:22:15,548 --> 00:22:18,240 I can accumulate, add them up, multiply them together, 526 00:22:18,240 --> 00:22:19,650 whatever. 527 00:22:19,650 --> 00:22:22,230 One thing to be careful with those is plus 1, 528 00:22:22,230 --> 00:22:24,180 or you could put a ceiling on that. 529 00:22:24,180 --> 00:22:27,000 If N is a lot less and B, then this is not a good strategy. 530 00:22:27,000 --> 00:22:29,930 But as long as N is at least order B, 531 00:22:29,930 --> 00:22:33,030 that's really efficient. 532 00:22:33,030 --> 00:22:36,060 More generally, instead of just one scan, 533 00:22:36,060 --> 00:22:39,270 you can run up to M over B parallel scans. 534 00:22:39,270 --> 00:22:41,220 Because for a scan, you really just need 535 00:22:41,220 --> 00:22:44,250 to know what is my block currently. 536 00:22:44,250 --> 00:22:47,100 And we can fit M over B blocks in our cache. 537 00:22:47,100 --> 00:22:50,940 And so we can advance this scan a little bit, 538 00:22:50,940 --> 00:22:53,560 advance this scan a little bit, advanced this one, 539 00:22:53,560 --> 00:22:54,700 and go back and forth. 540 00:22:54,700 --> 00:22:57,845 In any kind of interleaving we want of those M over B scans, 541 00:22:57,845 --> 00:22:59,220 some of them could be read scans. 542 00:22:59,220 --> 00:23:00,270 Some of them could be write scans. 543 00:23:00,270 --> 00:23:01,520 Some of them can go backwards. 544 00:23:01,520 --> 00:23:03,776 Some of them could go forwards, a lot of options here. 545 00:23:03,776 --> 00:23:05,400 And in particular, you can do something 546 00:23:05,400 --> 00:23:07,860 like given a little bit less than M 547 00:23:07,860 --> 00:23:09,840 over be lists of total size N, you 548 00:23:09,840 --> 00:23:11,070 can merge them all together. 549 00:23:11,070 --> 00:23:12,870 If they're sorted lists, you can merge them 550 00:23:12,870 --> 00:23:17,661 into one sorted list in optimal N over B time. 551 00:23:17,661 --> 00:23:20,160 So that's good. 552 00:23:20,160 --> 00:23:22,480 We'll use that in a moment. 553 00:23:22,480 --> 00:23:23,010 Here 554 00:23:23,010 --> 00:23:26,610 I have a little bit of a thought experiment, 555 00:23:26,610 --> 00:23:30,420 originally by Lars Arge who will be speaking later. 556 00:23:30,420 --> 00:23:31,920 You know, is this really a big deal? 557 00:23:31,920 --> 00:23:34,720 Factor B doesn't sound so big. 558 00:23:34,720 --> 00:23:35,250 Do I care? 559 00:23:35,250 --> 00:23:37,460 For example, suppose I'm going to traverse 560 00:23:37,460 --> 00:23:41,340 a linked list in memory, but it's actually stored on disk. 561 00:23:41,340 --> 00:23:43,230 Is it really important that I sort that list 562 00:23:43,230 --> 00:23:46,950 and do a scan versus jumping around random access? 563 00:23:46,950 --> 00:23:49,650 And this is back of the envelope, 564 00:23:49,650 --> 00:23:51,750 just computing what things ought to be. 565 00:23:51,750 --> 00:23:54,975 If you have about a gigabyte of data, a block size of 32 566 00:23:54,975 --> 00:23:58,770 kilobytes, which is probably on the small side, a 1 millisecond 567 00:23:58,770 --> 00:24:02,250 disk access time, which is really fast, 568 00:24:02,250 --> 00:24:05,940 usually at least 2 milliseconds, then 569 00:24:05,940 --> 00:24:10,140 if you do things in random order, 570 00:24:10,140 --> 00:24:13,020 on average every access is going to require a memory transfer. 571 00:24:13,020 --> 00:24:17,100 That'll take about 70 hours, three days. 572 00:24:17,100 --> 00:24:20,550 But if you do a scan, if you presorted everything 573 00:24:20,550 --> 00:24:23,400 and you do a scan, then it will only take you 32 seconds. 574 00:24:23,400 --> 00:24:28,080 So it's just 8,000 in time space is a lot bigger than we 575 00:24:28,080 --> 00:24:29,640 conceptualize. 576 00:24:29,640 --> 00:24:33,060 And it makes things that were impractical to do, say, daily, 577 00:24:33,060 --> 00:24:34,750 very practical. 578 00:24:34,750 --> 00:24:36,649 So that's why we're here. 579 00:24:36,649 --> 00:24:37,690 Let's do another problem. 580 00:24:37,690 --> 00:24:38,398 How about search? 581 00:24:38,398 --> 00:24:40,380 Suppose I have the items in sorted order, 582 00:24:40,380 --> 00:24:42,309 and I want to do binary search. 583 00:24:42,309 --> 00:24:44,100 Well, the right thing is not binary search, 584 00:24:44,100 --> 00:24:48,690 but B-way search, so log base B of N. The plus 1 585 00:24:48,690 --> 00:24:50,310 is to handle the case when B equals 1. 586 00:24:50,310 --> 00:24:52,830 Then you want log base 2. 587 00:24:52,830 --> 00:24:55,200 So we have our items. 588 00:24:55,200 --> 00:24:58,230 We want to search, first, why is this the right bound? 589 00:24:58,230 --> 00:24:59,607 Why is this optimal? 590 00:24:59,607 --> 00:25:01,440 You can do an information theoretic argument 591 00:25:01,440 --> 00:25:02,830 in the comparison model, assuming 592 00:25:02,830 --> 00:25:04,800 you're just comparing items. 593 00:25:04,800 --> 00:25:06,730 Then whenever you read in a block-- 594 00:25:06,730 --> 00:25:08,610 if the blocks have already been sorted, 595 00:25:08,610 --> 00:25:10,320 you read in some block-- 596 00:25:10,320 --> 00:25:13,440 what you learn from looking at those B items 597 00:25:13,440 --> 00:25:16,564 is where your query guy, x, fits among those B items. 598 00:25:16,564 --> 00:25:18,480 You already know everything about the B items, 599 00:25:18,480 --> 00:25:19,950 how they relate to each other. 600 00:25:19,950 --> 00:25:21,180 But you learn where x is. 601 00:25:21,180 --> 00:25:24,840 So that gives you log of B plus 1 bits of information, 602 00:25:24,840 --> 00:25:28,260 because there are B plus 1 places where x could be. 603 00:25:28,260 --> 00:25:30,300 And you need to figure out log of N plus 1 bits. 604 00:25:30,300 --> 00:25:32,740 You want to know where x fits among all the items. 605 00:25:32,740 --> 00:25:35,310 And so you divide log of N plus 1 by log of B plus 1. 606 00:25:35,310 --> 00:25:39,510 That's log base b plus 1 of N plus 1. 607 00:25:39,510 --> 00:25:40,716 So that's the lower bound. 608 00:25:40,716 --> 00:25:43,090 And the upper bound is, you probably have guessed by now, 609 00:25:43,090 --> 00:25:44,130 is a B-tree. 610 00:25:44,130 --> 00:25:47,130 You just have B items and the node sort 611 00:25:47,130 --> 00:25:49,620 of uniformly distributed through the sorted list. 612 00:25:49,620 --> 00:25:52,440 And then once you get those items, 613 00:25:52,440 --> 00:25:54,930 you go to the appropriate subtree and recurse. 614 00:25:54,930 --> 00:25:57,930 And the height of such a tree is log base b plus 1 of N, 615 00:25:57,930 --> 00:25:59,162 and so it works. 616 00:25:59,162 --> 00:26:00,870 B-trees have the nice thing, you can also 617 00:26:00,870 --> 00:26:03,870 do insertions and deletions in the same amount of time. 618 00:26:03,870 --> 00:26:05,730 Though, that's no longer so optimal. 619 00:26:05,730 --> 00:26:07,860 For searches, this is the right answer. 620 00:26:10,369 --> 00:26:11,910 So, next thing you might want to do-- 621 00:26:11,910 --> 00:26:13,860 I keep saying, assume it's sorted-- 622 00:26:13,860 --> 00:26:16,030 I'd really like some sorted data, please. 623 00:26:16,030 --> 00:26:19,009 So how do I sort my data? 624 00:26:19,009 --> 00:26:20,550 I think the Aggarwal and Vitter paper 625 00:26:20,550 --> 00:26:26,040 has this fun quote about, today, one fourth of all computation 626 00:26:26,040 --> 00:26:27,750 is sorting. 627 00:26:27,750 --> 00:26:29,820 Some machines are devoted entirely to sorting. 628 00:26:29,820 --> 00:26:31,410 It's like the problem of the day. 629 00:26:31,410 --> 00:26:32,580 Everyone was sorting. 630 00:26:32,580 --> 00:26:34,800 I assume people still sort, but I'm 631 00:26:34,800 --> 00:26:38,710 guessing it's not the dominant feature anymore. 632 00:26:38,710 --> 00:26:39,960 And it's a big deal, you know. 633 00:26:39,960 --> 00:26:41,760 Can I sort within one day, so that all the 634 00:26:41,760 --> 00:26:45,390 stuff that I learned today or all the transactions that 635 00:26:45,390 --> 00:26:48,450 happened today I could sort them. 636 00:26:48,450 --> 00:26:51,060 So it turns out the right answer for sorting 637 00:26:51,060 --> 00:26:53,520 bound is N over B log base M over B of N over B. 638 00:26:53,520 --> 00:26:56,790 If you haven't seen that, it looks kind of like a big thing. 639 00:26:56,790 --> 00:27:01,080 But those of us in the know can recite that in our sleep. 640 00:27:01,080 --> 00:27:02,490 It comes up all over the place. 641 00:27:02,490 --> 00:27:04,470 Lots of problems are as hard as sorting, 642 00:27:04,470 --> 00:27:07,170 and can be solved in the sorting bound time. 643 00:27:07,170 --> 00:27:09,930 To go back to the problem I was talking 644 00:27:09,930 --> 00:27:14,820 about with Floyd's model, the permutation problem, 645 00:27:14,820 --> 00:27:16,050 I know the permutation. 646 00:27:16,050 --> 00:27:17,950 I know where things are supposed to go. 647 00:27:17,950 --> 00:27:20,580 I just need to move them there physically. 648 00:27:20,580 --> 00:27:22,870 Then it's slightly better. 649 00:27:22,870 --> 00:27:25,680 You have the sorting bound, which is essentially 650 00:27:25,680 --> 00:27:26,860 what we had before. 651 00:27:26,860 --> 00:27:30,120 But in some cases, just doing the naive thing is better. 652 00:27:30,120 --> 00:27:33,210 Sometimes it's better to just take every item and stick it 653 00:27:33,210 --> 00:27:36,591 where it belongs in completely random access. 654 00:27:36,591 --> 00:27:39,090 So you could always do it, of course, in N memory transfers. 655 00:27:39,090 --> 00:27:42,030 And sometimes that is slightly better than the sorting bound, 656 00:27:42,030 --> 00:27:45,020 because you don't have the log term. 657 00:27:45,020 --> 00:27:48,260 And so that is the right answer to Floyd's problem. 658 00:27:48,260 --> 00:27:50,900 He got the upper bound right. 659 00:27:50,900 --> 00:27:52,860 In his case, M over B is 3. 660 00:27:52,860 --> 00:27:57,350 So this is just log base 2. 661 00:27:57,350 --> 00:28:01,380 But he missed this one term. 662 00:28:01,380 --> 00:28:02,360 OK. 663 00:28:02,360 --> 00:28:03,982 So why is the sorting bound correct? 664 00:28:03,982 --> 00:28:05,690 I won't go through the permutation bound. 665 00:28:05,690 --> 00:28:07,151 The upper bound's clear. 666 00:28:07,151 --> 00:28:08,900 Information, theoretically, it's very easy 667 00:28:08,900 --> 00:28:12,365 to see why you can't do better than the sorting bound. 668 00:28:12,365 --> 00:28:14,720 Let's set up a little bit of ground rules. 669 00:28:14,720 --> 00:28:19,190 Let's suppose that whatever you have in cache, you sort it. 670 00:28:19,190 --> 00:28:19,936 Because why not? 671 00:28:19,936 --> 00:28:21,560 I mean, this is only going to help you. 672 00:28:21,560 --> 00:28:23,510 And everything you do in cache is free. 673 00:28:23,510 --> 00:28:25,460 So always keep the cache sorted. 674 00:28:25,460 --> 00:28:29,227 And to clean up the information that's around, 675 00:28:29,227 --> 00:28:31,310 I'm going to first do a pass where I read a block, 676 00:28:31,310 --> 00:28:34,110 sort the block, stick it back out, and repeat. 677 00:28:34,110 --> 00:28:36,290 So each block is presorted. 678 00:28:36,290 --> 00:28:39,830 So there's no sorting information inside a block. 679 00:28:39,830 --> 00:28:42,660 It's all about how blocks compare to each other here. 680 00:28:42,660 --> 00:28:45,010 So when I read a block-- 681 00:28:45,010 --> 00:28:47,780 let's say this is my cache, and a new block comes in here-- 682 00:28:47,780 --> 00:28:51,890 what I learn is where those B items live among the M items 683 00:28:51,890 --> 00:28:53,097 that I already had. 684 00:28:53,097 --> 00:28:54,680 So it's just like the analysis before, 685 00:28:54,680 --> 00:29:00,620 except now I'm reading B items among M instead of one among B. 686 00:29:00,620 --> 00:29:04,160 And so the number of possible outcomes for that 687 00:29:04,160 --> 00:29:06,350 is M plus b choose B. 688 00:29:06,350 --> 00:29:07,550 So you have M plus B things. 689 00:29:07,550 --> 00:29:10,490 And there's B of them that we're saying 690 00:29:10,490 --> 00:29:14,210 which of the B in the order came from the new block. 691 00:29:14,210 --> 00:29:16,340 You take log of that, and you get basically 692 00:29:16,340 --> 00:29:20,150 B log M over B bits that you learn from each step. 693 00:29:20,150 --> 00:29:22,730 And the total number of bits we need to learn 694 00:29:22,730 --> 00:29:24,900 is N log N, as you know. 695 00:29:24,900 --> 00:29:28,490 But we knew a little bit of bits from this presorting step. 696 00:29:28,490 --> 00:29:30,500 This is to clean this up at the beginning. 697 00:29:30,500 --> 00:29:35,377 We already knew N log B bits, because each of those B things 698 00:29:35,377 --> 00:29:35,960 was presorted. 699 00:29:35,960 --> 00:29:38,540 So we have B log B per block each of them. 700 00:29:38,540 --> 00:29:39,740 There's N over B of them. 701 00:29:39,740 --> 00:29:42,800 So it's N log B. So we need to learn 702 00:29:42,800 --> 00:29:45,650 N log N minus N log B bits. 703 00:29:45,650 --> 00:29:53,030 And in each step, which is a log of N over B N log N over B-- 704 00:29:53,030 --> 00:29:55,490 and in each step, we learn B log M over B. 705 00:29:55,490 --> 00:29:57,230 So you divide those two things, and you 706 00:29:57,230 --> 00:29:58,880 get N over B log base M over B and N 707 00:29:58,880 --> 00:30:03,170 over B. It's a good exercise in log rules and information 708 00:30:03,170 --> 00:30:03,670 theory. 709 00:30:03,670 --> 00:30:07,130 But now, you see it's sort of the obvious bound 710 00:30:07,130 --> 00:30:11,670 once you check how many bits you're learning in each step. 711 00:30:11,670 --> 00:30:12,230 OK. 712 00:30:12,230 --> 00:30:15,124 How do we achieve this bound? 713 00:30:15,124 --> 00:30:16,040 What's an upper bound? 714 00:30:16,040 --> 00:30:18,410 I'm going to show you two ways to do it. 715 00:30:18,410 --> 00:30:21,450 The easy one is mergesort. 716 00:30:21,450 --> 00:30:23,330 To me, the conceptually easiest is mergesort. 717 00:30:23,330 --> 00:30:25,790 They're actually kind of symmetric. 718 00:30:25,790 --> 00:30:27,604 So you probably know binary mergesort. 719 00:30:27,604 --> 00:30:30,020 You take your items, split them in half, recursively sort, 720 00:30:30,020 --> 00:30:31,220 merge. 721 00:30:31,220 --> 00:30:35,180 But we know that we can merge M over B sorted lists 722 00:30:35,180 --> 00:30:38,000 in linear time as well in N over M time. 723 00:30:38,000 --> 00:30:39,602 So instead of doing binary mergesort 724 00:30:39,602 --> 00:30:41,060 where we split in half, we're going 725 00:30:41,060 --> 00:30:43,970 to split into M over B equal sized pieces, 726 00:30:43,970 --> 00:30:46,640 recursively sort them all, and then merge. 727 00:30:46,640 --> 00:30:49,826 And the recurrence we get from that, there is-- 728 00:30:49,826 --> 00:30:52,080 did I get this right? 729 00:30:52,080 --> 00:30:52,580 Yeah. 730 00:30:52,580 --> 00:30:56,600 There's M over B sub-problems, each of size a factor of M 731 00:30:56,600 --> 00:30:59,150 over B smaller than N. 732 00:30:59,150 --> 00:31:02,490 And then to do the merge, we pay N over B plus 1. 733 00:31:02,490 --> 00:31:04,700 That won't end up mattering. 734 00:31:04,700 --> 00:31:07,250 To make this not matter, we need to use a base 735 00:31:07,250 --> 00:31:11,540 case for this recurrence that's not 1, but B. B will work. 736 00:31:11,540 --> 00:31:14,634 You could also do M, but it doesn't really help you. 737 00:31:14,634 --> 00:31:16,550 Once we get down to a single block, of course, 738 00:31:16,550 --> 00:31:18,580 we can sort in constant time. 739 00:31:18,580 --> 00:31:22,089 We read it and sort it, write it back out. 740 00:31:22,089 --> 00:31:23,630 So you want to solve this recurrence. 741 00:31:23,630 --> 00:31:25,880 Easy way is to draw a recursion tree. 742 00:31:25,880 --> 00:31:28,400 At the root, you have a problem of size N. 743 00:31:28,400 --> 00:31:30,830 We're paying N over B to solve it. 744 00:31:30,830 --> 00:31:33,830 We have branching factor M over B. And at the leaves, 745 00:31:33,830 --> 00:31:37,950 we have problems with size B. Each of them has constant cost. 746 00:31:37,950 --> 00:31:41,570 I'm removing the big Os to make this diagram both more 747 00:31:41,570 --> 00:31:42,980 legible and more correct. 748 00:31:42,980 --> 00:31:45,740 Because you can't use big Os when you're using dot dot dot. 749 00:31:45,740 --> 00:31:47,810 So no big Os for you. 750 00:31:47,810 --> 00:31:49,410 So then use sum these level by level, 751 00:31:49,410 --> 00:31:51,680 and you see we have conservation of mass. 752 00:31:51,680 --> 00:31:53,630 We have N things here. 753 00:31:53,630 --> 00:31:54,740 We still have N things. 754 00:31:54,740 --> 00:31:55,948 They just got distributed up. 755 00:31:55,948 --> 00:31:58,250 They're all being divided by B linearity. 756 00:31:58,250 --> 00:32:00,662 You get N over B at every level, including the leaves. 757 00:32:00,662 --> 00:32:02,120 Leaves you have to check specially. 758 00:32:02,120 --> 00:32:03,770 But there are indeed N over B leaves, 759 00:32:03,770 --> 00:32:06,380 because we stop when we get to B. 760 00:32:06,380 --> 00:32:07,190 So you add this up. 761 00:32:07,190 --> 00:32:09,530 We just need to know how many levels are there. 762 00:32:09,530 --> 00:32:12,890 One is log base M over B of N over B. 763 00:32:12,890 --> 00:32:16,160 Because there's N over B leaves branching factor M over B. 764 00:32:16,160 --> 00:32:19,490 So you multiply, done, easy. 765 00:32:19,490 --> 00:32:21,890 So mergesort is pretty cool. 766 00:32:21,890 --> 00:32:23,525 And this works really well in practice. 767 00:32:23,525 --> 00:32:28,100 It revolutionized the world of sorting in 1988. 768 00:32:28,100 --> 00:32:32,180 Here's a different approach, the inverse, more like quicksort, 769 00:32:32,180 --> 00:32:34,430 the one that you know is guaranteed to run [INAUDIBLE] 770 00:32:34,430 --> 00:32:36,410 log N usually. 771 00:32:36,410 --> 00:32:37,940 Here, you can't do binary quicksort. 772 00:32:37,940 --> 00:32:41,420 You do M over B root M over B-way quicksort. 773 00:32:41,420 --> 00:32:46,480 The square root is necessary just to do step one. 774 00:32:46,480 --> 00:32:50,150 So step one is I need to split. 775 00:32:50,150 --> 00:32:53,870 Now, I'm not splitting my list into chunks. 776 00:32:53,870 --> 00:32:55,940 In the answer, in the sorted answer, 777 00:32:55,940 --> 00:33:00,050 I need to find things that are evenly spaced in the answer. 778 00:33:00,050 --> 00:33:01,070 That's the hard part. 779 00:33:01,070 --> 00:33:03,349 Then, usually, you find the median to do this. 780 00:33:03,349 --> 00:33:05,390 But now, we have to find sort of square root of M 781 00:33:05,390 --> 00:33:08,635 over B median-like elements spread out through the answer. 782 00:33:08,635 --> 00:33:11,210 But we don't know the answer, so it's a little tricky. 783 00:33:11,210 --> 00:33:13,550 Then once we have those partition elements, 784 00:33:13,550 --> 00:33:14,840 we can just do it. 785 00:33:14,840 --> 00:33:19,067 This is the square root of M over B-way scan again. 786 00:33:19,067 --> 00:33:20,150 You scan through the data. 787 00:33:20,150 --> 00:33:22,910 For each of them, you see how it compares to the partition 788 00:33:22,910 --> 00:33:24,110 elements. 789 00:33:24,110 --> 00:33:25,470 There aren't very many of them. 790 00:33:25,470 --> 00:33:27,636 And then you write it out to the corresponding list, 791 00:33:27,636 --> 00:33:31,240 and you get square root of M over B plus 1 lists. 792 00:33:31,240 --> 00:33:33,730 And so that's efficient, because it's just 793 00:33:33,730 --> 00:33:35,960 a scan or parallel scans. 794 00:33:35,960 --> 00:33:38,180 And then you recurse, and there's no combination. 795 00:33:38,180 --> 00:33:40,040 There's no merging to do. 796 00:33:40,040 --> 00:33:41,540 Once you've got them set up there, 797 00:33:41,540 --> 00:33:43,350 you recursively sort, and you're done. 798 00:33:43,350 --> 00:33:46,790 So the recurrence is exactly the same as mergesort. 799 00:33:46,790 --> 00:33:49,070 And the hard part is how do you do this partitioning? 800 00:33:49,070 --> 00:33:51,162 And I'll just quickly sketch that. 801 00:33:51,162 --> 00:33:53,120 This is probably the most complicated algorithm 802 00:33:53,120 --> 00:33:56,330 in these slides. 803 00:33:56,330 --> 00:33:57,710 I'll tell you the algorithm. 804 00:33:57,710 --> 00:34:02,990 Exactly why it works is familiar to if you know the Bloom 805 00:34:02,990 --> 00:34:07,090 at all, linear time merging algorithm 806 00:34:07,090 --> 00:34:09,540 for regular internal memory. 807 00:34:09,540 --> 00:34:12,600 Here's what we're going to do. 808 00:34:12,600 --> 00:34:17,513 We're going to read in M items into our cache, sort them. 809 00:34:17,513 --> 00:34:19,429 So that's a piece of the answer in some sense. 810 00:34:19,429 --> 00:34:21,000 But how it relates to the answer, 811 00:34:21,000 --> 00:34:23,360 which subset of the answer it is, we don't know. 812 00:34:23,360 --> 00:34:26,840 Sample that piece of the answer like this. 813 00:34:26,840 --> 00:34:30,290 Every root M over B items, take one guy. 814 00:34:30,290 --> 00:34:32,540 Spit that in an output of samples. 815 00:34:32,540 --> 00:34:34,929 Do this over and over for all the items-- 816 00:34:34,929 --> 00:34:37,699 read in M, sort, sample, spit out-- 817 00:34:37,699 --> 00:34:39,770 you end up with this many items. 818 00:34:39,770 --> 00:34:43,167 This is basically a trick to shrink your input. 819 00:34:43,167 --> 00:34:45,500 So now, we can do inefficient things on this many items, 820 00:34:45,500 --> 00:34:47,690 because there aren't that many of them. 821 00:34:47,690 --> 00:34:49,969 So what do we do? 822 00:34:49,969 --> 00:34:54,500 We just run the regular linear time selection algorithm 823 00:34:54,500 --> 00:34:57,590 that you know and love from algorithms class 824 00:34:57,590 --> 00:35:02,330 to find the right item. 825 00:35:02,330 --> 00:35:07,370 So if you were splitting into four pieces, 826 00:35:07,370 --> 00:35:10,370 then you'd want the 25%, 50%, and 75%. 827 00:35:10,370 --> 00:35:12,684 You know how to do each of those in linear time. 828 00:35:12,684 --> 00:35:14,100 And it turns out if you re-analyze 829 00:35:14,100 --> 00:35:16,040 the regular linear time selection, indeed, 830 00:35:16,040 --> 00:35:19,010 it runs in N over B time in external memory. 831 00:35:19,010 --> 00:35:21,177 So that's great. 832 00:35:21,177 --> 00:35:23,510 But now, we're doing this just repeatedly over and over. 833 00:35:23,510 --> 00:35:24,860 You find the 25%. 834 00:35:24,860 --> 00:35:25,670 You find the 50%. 835 00:35:25,670 --> 00:35:27,450 Each of them, you spend linear time. 836 00:35:27,450 --> 00:35:29,300 But you multiply it out. 837 00:35:29,300 --> 00:35:31,400 You're only finding root of M over B of them. 838 00:35:31,400 --> 00:35:35,799 Linear time, it's not N over B, it's N divided by this mess. 839 00:35:35,799 --> 00:35:37,340 You multiply them out, it disappears. 840 00:35:37,340 --> 00:35:39,860 You end up in regular linear time, 841 00:35:39,860 --> 00:35:42,500 N over B. You find a good set of partitions. 842 00:35:42,500 --> 00:35:45,041 Why this is a good set is not totally clear. 843 00:35:45,041 --> 00:35:46,040 I won't justify it here. 844 00:35:46,040 --> 00:35:50,170 But it is good, so don't worry. 845 00:35:50,170 --> 00:35:51,710 OK. 846 00:35:51,710 --> 00:35:54,230 One embellishment to the external memory model 847 00:35:54,230 --> 00:35:59,700 before I go on is to distinguish not just saying, 848 00:35:59,700 --> 00:36:01,970 oh, well, every block is equally good. 849 00:36:01,970 --> 00:36:04,370 You want to count how many blocks you read. 850 00:36:04,370 --> 00:36:06,620 When you read one item, you get the whole block. 851 00:36:06,620 --> 00:36:07,934 And you better use that block. 852 00:36:07,934 --> 00:36:09,350 But you can furthermore say, well, 853 00:36:09,350 --> 00:36:11,960 it would be really good if I read a whole bunch of blocks 854 00:36:11,960 --> 00:36:12,930 in sequence. 855 00:36:12,930 --> 00:36:15,650 There are lots of reasons for this in particular. 856 00:36:15,650 --> 00:36:17,990 Disks are really good at sequential access, 857 00:36:17,990 --> 00:36:19,220 because they're spinning. 858 00:36:19,220 --> 00:36:21,769 It's very easy to seek to the thing right after you. 859 00:36:21,769 --> 00:36:23,810 First of all, it's easy to read the entire track, 860 00:36:23,810 --> 00:36:25,018 the whole circle of the desk. 861 00:36:25,018 --> 00:36:29,110 And it's easy to move that thing. 862 00:36:29,110 --> 00:36:31,370 So here's a model that captures the idea 863 00:36:31,370 --> 00:36:35,756 that sequential block reads or writes are better than random. 864 00:36:35,756 --> 00:36:37,130 So here's the idea of sequential. 865 00:36:37,130 --> 00:36:44,600 If you read M items, so you read M over B blocks in sequence, 866 00:36:44,600 --> 00:36:47,030 then each of those is considered to be a sequential memory 867 00:36:47,030 --> 00:36:47,960 transfer. 868 00:36:47,960 --> 00:36:51,260 If you break that sequence, then you're starting a new sequence. 869 00:36:51,260 --> 00:36:53,540 Or it's just random access if you don't fall 870 00:36:53,540 --> 00:36:56,250 into a big block like this. 871 00:36:56,250 --> 00:36:58,650 So there's a couple of results in this model. 872 00:36:58,650 --> 00:37:02,750 One is this harder version of external memory. 873 00:37:02,750 --> 00:37:05,180 So one thing is what about sorting? 874 00:37:05,180 --> 00:37:06,335 We just covered sorting. 875 00:37:06,335 --> 00:37:09,320 It turns out those are pretty random access in the algorithms 876 00:37:09,320 --> 00:37:10,100 we saw. 877 00:37:10,100 --> 00:37:15,410 But if you use binary mergesort, it is sequential. 878 00:37:15,410 --> 00:37:18,505 As you binary merge, things are good. 879 00:37:18,505 --> 00:37:19,880 And that's, essentially, the best 880 00:37:19,880 --> 00:37:22,070 you can do, surprisingly, in this model. 881 00:37:22,070 --> 00:37:27,650 If you want the number of random memory transfers 882 00:37:27,650 --> 00:37:29,900 to be little o of the sorting bound-- 883 00:37:29,900 --> 00:37:32,300 so you want more than a constant fraction 884 00:37:32,300 --> 00:37:35,660 to be sequential-- then you need to use 885 00:37:35,660 --> 00:37:38,930 at least this much total memory transfers. 886 00:37:38,930 --> 00:37:44,390 And so binary mergesort is optimal in this model, 887 00:37:44,390 --> 00:37:47,780 assuming you want a reasonable number of sequential axises. 888 00:37:47,780 --> 00:37:50,060 And the main point of this paper was 889 00:37:50,060 --> 00:37:51,579 to solve suffix-tree construction 890 00:37:51,579 --> 00:37:52,370 in external memory. 891 00:37:52,370 --> 00:37:54,740 And what they prove is it reduces to sorting, 892 00:37:54,740 --> 00:37:56,000 essentially, and scans. 893 00:37:56,000 --> 00:37:57,740 And scans are good. 894 00:37:57,740 --> 00:38:00,170 So you get this exact same trade-off 895 00:38:00,170 --> 00:38:04,210 for suffix-tree construction, fair representation. 896 00:38:04,210 --> 00:38:08,391 I have to be careful, because so many authors are in those room. 897 00:38:08,391 --> 00:38:08,890 Cool. 898 00:38:08,890 --> 00:38:10,727 So let's move on to a different model. 899 00:38:10,727 --> 00:38:12,310 This is a model that did not catch on. 900 00:38:12,310 --> 00:38:13,810 But it's fun for historical reasons 901 00:38:13,810 --> 00:38:18,070 to see what it was about. 902 00:38:18,070 --> 00:38:20,960 You can see in here two issues. 903 00:38:20,960 --> 00:38:23,800 One is, what about a deeper memory hierarchy? 904 00:38:23,800 --> 00:38:25,630 Two levels is nice. 905 00:38:25,630 --> 00:38:28,480 Yeah, in practice, two levels are all that matter. 906 00:38:28,480 --> 00:38:31,420 But we should really understand multiple levels. 907 00:38:31,420 --> 00:38:33,290 Surely, there's a clean way to do that. 908 00:38:33,290 --> 00:38:36,190 And so there are a bunch of models that try to do this. 909 00:38:36,190 --> 00:38:38,950 And by the end, we get something that's reasonable. 910 00:38:38,950 --> 00:38:43,060 And HMM is probably one of my favorite weird models. 911 00:38:43,060 --> 00:38:44,890 It's "particularly simple." 912 00:38:44,890 --> 00:38:47,670 This is a quote from their own paper, 913 00:38:47,670 --> 00:38:48,910 not that they're boastful. 914 00:38:48,910 --> 00:38:49,870 It is a simple model. 915 00:38:49,870 --> 00:38:51,850 This is true. 916 00:38:51,850 --> 00:38:55,390 And it does model, in some sense, a larger hierarchy. 917 00:38:55,390 --> 00:38:57,430 But the way it's phrased initially 918 00:38:57,430 --> 00:39:01,290 doesn't look like this picture, but they're equivalent. 919 00:39:01,290 --> 00:39:02,440 So it's a RAM model. 920 00:39:02,440 --> 00:39:05,150 So your memory is an array. 921 00:39:05,150 --> 00:39:09,490 If you want to access position x in the array, you pay f of x. 922 00:39:09,490 --> 00:39:12,970 And in the original definition, that's just log x. 923 00:39:12,970 --> 00:39:15,670 So what that corresponds to is the first item is free. 924 00:39:15,670 --> 00:39:17,110 Second item costs 1. 925 00:39:17,110 --> 00:39:19,240 The next two items cost 2. 926 00:39:19,240 --> 00:39:20,800 The next four items cost 3. 927 00:39:20,800 --> 00:39:23,230 The next eight items cost 4, and so on. 928 00:39:23,230 --> 00:39:26,470 So it's exactly this kind of memory hierarchy. 929 00:39:26,470 --> 00:39:27,820 And you can move items. 930 00:39:27,820 --> 00:39:28,490 You can copy. 931 00:39:28,490 --> 00:39:31,180 And you can do all the things you can do in a RAM. 932 00:39:31,180 --> 00:39:34,560 So this is a pretty good model of hierarchical memory. 933 00:39:34,560 --> 00:39:36,310 It's just a little hard. 934 00:39:36,310 --> 00:39:39,310 So, originally, they defined it with log x 935 00:39:39,310 --> 00:39:42,730 based on this book, which is the classic reference of VLSI 936 00:39:42,730 --> 00:39:44,050 at the time by Mead and Conway. 937 00:39:44,050 --> 00:39:47,380 It sort of revolutionized teaching VLSI. 938 00:39:47,380 --> 00:39:49,990 And it has this particular construction 939 00:39:49,990 --> 00:39:52,060 of a hierarchical RAM. 940 00:39:52,060 --> 00:39:54,200 I Don't know if RAMs are actually built this way. 941 00:39:54,200 --> 00:39:57,100 But they have a sketch of how to do it 942 00:39:57,100 --> 00:40:00,500 that achieves a logarithmic performance. 943 00:40:00,500 --> 00:40:05,980 The deeper you are, you pay log. 944 00:40:05,980 --> 00:40:08,560 The bigger your space is, you need 945 00:40:08,560 --> 00:40:11,940 to pay logarithmic to access it. 946 00:40:11,940 --> 00:40:12,460 OK. 947 00:40:12,460 --> 00:40:14,650 So here are the results that they get in this model. 948 00:40:14,650 --> 00:40:15,816 I'm not going to prove them. 949 00:40:15,816 --> 00:40:18,970 Because, again, they follow from the results in some sense. 950 00:40:18,970 --> 00:40:23,002 But you've got matrix multiplication, FFT sorting, 951 00:40:23,002 --> 00:40:25,210 scanning, binary search, a lot of the usual problems. 952 00:40:25,210 --> 00:40:31,150 You get kind of weird running times, log, log, and so on. 953 00:40:31,150 --> 00:40:33,280 Here, it's a matter of slow down versus speed 954 00:40:33,280 --> 00:40:36,864 up, because everything is going to cost more than constant now. 955 00:40:36,864 --> 00:40:38,280 So you want to minimize slowdowns. 956 00:40:38,280 --> 00:40:39,350 Sometimes you get constant. 957 00:40:39,350 --> 00:40:41,110 The worst slow down you can get is log N, 958 00:40:41,110 --> 00:40:43,180 because everything you can access in, at most, 959 00:40:43,180 --> 00:40:44,870 log N time in this model. 960 00:40:44,870 --> 00:40:49,870 But I would say setting f of N to be log N doesn't really 961 00:40:49,870 --> 00:40:51,670 reveal what we care about. 962 00:40:51,670 --> 00:40:54,580 But in the same paper, they give a better perspective 963 00:40:54,580 --> 00:40:56,150 of their own work. 964 00:40:56,150 --> 00:40:59,290 So they say, well, let's look at the general case. 965 00:40:59,290 --> 00:41:00,890 Maybe log x isn't the right thing. 966 00:41:00,890 --> 00:41:02,890 Let's look at an arbitrary f of x. 967 00:41:02,890 --> 00:41:04,390 Well, you could write an arbitrary f 968 00:41:04,390 --> 00:41:08,320 of x as a weighted sum of threshold functions. 969 00:41:08,320 --> 00:41:10,300 I want to know is x bigger than xi. 970 00:41:10,300 --> 00:41:12,910 If so, I pay wi. 971 00:41:12,910 --> 00:41:15,989 Well, that is just like this picture. 972 00:41:15,989 --> 00:41:17,530 Any function can be written like that 973 00:41:17,530 --> 00:41:19,390 if it's a discrete function. 974 00:41:19,390 --> 00:41:21,530 But you can also think of it in this form 975 00:41:21,530 --> 00:41:23,530 if the xi's are sorted. 976 00:41:23,530 --> 00:41:26,410 After you get beyond x0 items, you pay w0. 977 00:41:26,410 --> 00:41:31,010 After you get beyond x1 items total, you pay w1, and so on. 978 00:41:31,010 --> 00:41:33,400 So this gives you an arbitrary memory hierarchy 979 00:41:33,400 --> 00:41:35,134 even with growing and shrinking sizes, 980 00:41:35,134 --> 00:41:36,550 which you'd never see in practice. 981 00:41:36,550 --> 00:41:38,810 But this is the general case. 982 00:41:38,810 --> 00:41:40,630 And we are going to assume here that f 983 00:41:40,630 --> 00:41:44,720 is polynomially bounded to make these functions reasonable. 984 00:41:44,720 --> 00:41:47,230 So when you double the input, you only change the output 985 00:41:47,230 --> 00:41:48,190 by a constant factor. 986 00:41:51,350 --> 00:41:52,610 OK. 987 00:41:52,610 --> 00:41:53,110 Fine. 988 00:41:53,110 --> 00:41:55,060 So we have to solve this weighted sum. 989 00:41:55,060 --> 00:41:57,010 But let's just look at one of these. 990 00:41:57,010 --> 00:41:59,284 This is kind of the canonical function. 991 00:41:59,284 --> 00:42:00,950 The rest is just a weighted sum of them. 992 00:42:00,950 --> 00:42:03,074 And if you assume this polynomial bounded property, 993 00:42:03,074 --> 00:42:05,860 really it suffices to look at this. 994 00:42:05,860 --> 00:42:12,430 So this is called f sub M. We pay 1 995 00:42:12,430 --> 00:42:16,970 to access anything beyond M. And we pay 0 otherwise. 996 00:42:16,970 --> 00:42:20,230 So they've taken general f with this deep hierarchy, 997 00:42:20,230 --> 00:42:26,170 and they've reduced to this model, the red-blue pebble 998 00:42:26,170 --> 00:42:28,232 game, which we've already seen. 999 00:42:28,232 --> 00:42:30,190 I don't know if they mentioned this explicitly, 1000 00:42:30,190 --> 00:42:32,440 but it's the same model again. 1001 00:42:32,440 --> 00:42:35,350 And that's good, because a lot of problems-- well, 1002 00:42:35,350 --> 00:42:36,800 they haven't been solved exactly. 1003 00:42:36,800 --> 00:42:38,674 I would say, now, this paper is the first one 1004 00:42:38,674 --> 00:42:40,780 to really say, OK, sorting, what's the best 1005 00:42:40,780 --> 00:42:43,660 way I can sort in this model? 1006 00:42:43,660 --> 00:42:45,030 And they get something. 1007 00:42:45,030 --> 00:42:46,430 Do I have it here? 1008 00:42:46,430 --> 00:42:47,060 Yeah. 1009 00:42:47,060 --> 00:42:49,150 They aim for a uniform optimality. 1010 00:42:49,150 --> 00:42:51,700 This means there's one algorithm that 1011 00:42:51,700 --> 00:42:56,170 works optimally for this threshold function no matter 1012 00:42:56,170 --> 00:42:57,040 what M is. 1013 00:42:57,040 --> 00:42:58,450 The algorithm doesn't get to know 1014 00:42:58,450 --> 00:43:00,340 M. You might say the algorithm is 1015 00:43:00,340 --> 00:43:04,800 oblivious to M. Sound familiar? 1016 00:43:04,800 --> 00:43:05,960 So this is a cool idea. 1017 00:43:05,960 --> 00:43:07,880 Of course, it does not have blocking yet. 1018 00:43:07,880 --> 00:43:10,160 But none of this model has blocking. 1019 00:43:10,160 --> 00:43:12,160 But they prove that if you're uniformly optimal, 1020 00:43:12,160 --> 00:43:16,010 if you work in the red-blue pebble game model for all M 1021 00:43:16,010 --> 00:43:17,810 with one algorithm, then, in fact, you 1022 00:43:17,810 --> 00:43:20,540 are optimal for all f of x, which means, 1023 00:43:20,540 --> 00:43:23,840 in particular for the deep hierarchy, you also work. 1024 00:43:23,840 --> 00:43:27,200 And they achieve tight bounds for a bunch of problems here. 1025 00:43:27,200 --> 00:43:29,270 You should recognize all of these bounds 1026 00:43:29,270 --> 00:43:32,030 are now, in some sense, particular cases 1027 00:43:32,030 --> 00:43:33,950 of the external memory bounds. 1028 00:43:33,950 --> 00:43:35,482 So like sorting, you have this. 1029 00:43:35,482 --> 00:43:37,940 Except there's no B. The B has disappeared, because there's 1030 00:43:37,940 --> 00:43:38,960 no B in this model. 1031 00:43:38,960 --> 00:43:41,120 But, otherwise, it is N over B log base M over B 1032 00:43:41,120 --> 00:43:44,030 of N over B and so on down the line. 1033 00:43:44,030 --> 00:43:47,930 They said, oh, search here is really bad, because caching 1034 00:43:47,930 --> 00:43:49,310 doesn't really help for search. 1035 00:43:49,310 --> 00:43:50,750 But blocks help for search. 1036 00:43:50,750 --> 00:43:53,270 So when there's no B, these are exactly the bounds 1037 00:43:53,270 --> 00:43:55,160 you get for external memory. 1038 00:43:55,160 --> 00:43:56,900 So I mean, some of these were known. 1039 00:43:56,900 --> 00:43:59,240 These were already known by Hong and Kung, 1040 00:43:59,240 --> 00:44:01,760 because it's the same special case. 1041 00:44:01,760 --> 00:44:04,379 And then the others followed from external memory. 1042 00:44:04,379 --> 00:44:05,420 But this is kind of neat. 1043 00:44:05,420 --> 00:44:10,040 They're doing it in a somewhat stronger sense, because it's 1044 00:44:10,040 --> 00:44:14,000 uniform without knowing M. So the uniformity 1045 00:44:14,000 --> 00:44:16,110 doesn't follow from this. 1046 00:44:16,110 --> 00:44:17,240 But they get uniformity. 1047 00:44:17,240 --> 00:44:20,930 And therefore, it works for all f. 1048 00:44:20,930 --> 00:44:22,860 OK. 1049 00:44:22,860 --> 00:44:24,402 They had another fun fact, which will 1050 00:44:24,402 --> 00:44:25,776 look familiar to those of you who 1051 00:44:25,776 --> 00:44:28,050 know the cache-oblivious model, which we'll get to. 1052 00:44:28,050 --> 00:44:29,967 They have this observation that while we 1053 00:44:29,967 --> 00:44:32,550 have these algorithms that are explicitly moving things around 1054 00:44:32,550 --> 00:44:34,230 in our RAM, it would be nice if we 1055 00:44:34,230 --> 00:44:37,080 didn't have to write that down explicitly in the algorithm. 1056 00:44:37,080 --> 00:44:40,650 Could we just use least recently used replacement, 1057 00:44:40,650 --> 00:44:43,550 so move things forward? 1058 00:44:43,550 --> 00:44:45,900 That works great if you know what M is. 1059 00:44:45,900 --> 00:44:49,490 Then you say, OK, if I need to get something from out if here, 1060 00:44:49,490 --> 00:44:50,670 I'll move it over here. 1061 00:44:50,670 --> 00:44:53,190 And whatever was least recently used, I'll kick out. 1062 00:44:53,190 --> 00:44:55,020 And at this point, this is just a couple 1063 00:44:55,020 --> 00:44:56,280 of years prior to this paper. 1064 00:44:56,280 --> 00:44:59,910 Sleator and Tarjan did the first paper on competitive analysis. 1065 00:44:59,910 --> 00:45:02,470 And they proved that LRU or even first in, 1066 00:45:02,470 --> 00:45:05,310 first out is good in the sense that if you just 1067 00:45:05,310 --> 00:45:08,330 double the size of your cache-- 1068 00:45:08,330 --> 00:45:09,870 oh, I got this backwards. 1069 00:45:09,870 --> 00:45:13,290 TLRU of twice the cache is, at most, 1070 00:45:13,290 --> 00:45:15,540 TOPT of 1 times the cache. 1071 00:45:15,540 --> 00:45:16,800 So the 2 should be over here. 1072 00:45:20,100 --> 00:45:20,850 Great. 1073 00:45:20,850 --> 00:45:24,180 And assuming you have a polynomially bounded growth 1074 00:45:24,180 --> 00:45:27,130 function, then this is only losing a constant factor. 1075 00:45:27,130 --> 00:45:27,630 OK. 1076 00:45:27,630 --> 00:45:28,830 But we don't know what M is. 1077 00:45:28,830 --> 00:45:31,271 This works for the threshold function f sub m. 1078 00:45:31,271 --> 00:45:33,270 But it doesn't work for an arbitrary function f, 1079 00:45:33,270 --> 00:45:35,330 or it doesn't work uniformly. 1080 00:45:35,330 --> 00:45:36,700 And we want a uniform solution. 1081 00:45:36,700 --> 00:45:37,830 And they gave one. 1082 00:45:37,830 --> 00:45:39,619 I'll just sketch it here. 1083 00:45:39,619 --> 00:45:41,535 The idea is you have this arbitrary hierarchy. 1084 00:45:41,535 --> 00:45:42,960 You don't really know. 1085 00:45:42,960 --> 00:45:45,720 I'm going to assume I do know what f is. 1086 00:45:45,720 --> 00:45:47,600 So this is not uniform. 1087 00:45:47,600 --> 00:45:49,530 It's achieved in a different way. 1088 00:45:49,530 --> 00:45:52,500 But I'm going to basically rearrange the structure 1089 00:45:52,500 --> 00:45:55,200 to be roughly exponential to say, well, 1090 00:45:55,200 --> 00:45:57,390 I'm going to measure f of x as x increases. 1091 00:45:57,390 --> 00:45:59,714 And whenever f of x doubles, I'll draw a line. 1092 00:45:59,714 --> 00:46:01,380 These are not where the real levels are. 1093 00:46:01,380 --> 00:46:02,790 It's just a conceptual thing. 1094 00:46:02,790 --> 00:46:04,980 And then I do LRU on this structure. 1095 00:46:04,980 --> 00:46:08,100 So if I want to access something here, I pull it out. 1096 00:46:08,100 --> 00:46:08,940 I stick it in here. 1097 00:46:08,940 --> 00:46:10,920 Whatever is least recently used gets kicked out here. 1098 00:46:10,920 --> 00:46:12,000 And whatever is least recently used 1099 00:46:12,000 --> 00:46:13,550 gets kicked out here, here, here. 1100 00:46:13,550 --> 00:46:15,485 And you do a chain of LRUs. 1101 00:46:15,485 --> 00:46:17,610 Then you can prove that is within a constant factor 1102 00:46:17,610 --> 00:46:21,830 of optimal, but you do have to pay a startup cost. 1103 00:46:21,830 --> 00:46:24,080 It's similar to move to front analysis 1104 00:46:24,080 --> 00:46:26,330 from Sleator and Tarjan. 1105 00:46:26,330 --> 00:46:27,200 OK. 1106 00:46:27,200 --> 00:46:30,590 Enough about HMM sort of. 1107 00:46:30,590 --> 00:46:32,740 The next model is called BT. 1108 00:46:32,740 --> 00:46:35,810 It's the same as HMM, but they add blocks. 1109 00:46:35,810 --> 00:46:39,020 But not the blocks that we know from computer architecture, 1110 00:46:39,020 --> 00:46:41,030 but a different kind of block thing. 1111 00:46:41,030 --> 00:46:43,040 It's kind of similar. 1112 00:46:43,040 --> 00:46:45,800 Probably, [INAUDIBLE] constant factors and not so different. 1113 00:46:45,800 --> 00:46:50,312 So you have the old thing accessing x costs f of x. 1114 00:46:50,312 --> 00:46:51,770 But, now, you have a new operation, 1115 00:46:51,770 --> 00:46:53,950 which is I can copy any interval, which 1116 00:46:53,950 --> 00:46:57,110 would look something like this, from x minus delta to x. 1117 00:46:57,110 --> 00:47:00,260 And I can copy it to y minus delta to y. 1118 00:47:00,260 --> 00:47:05,210 And I pay the time to seek there, f of max of x and y. 1119 00:47:05,210 --> 00:47:06,920 Or you could do f of x plus f of y. 1120 00:47:06,920 --> 00:47:08,220 It doesn't matter. 1121 00:47:08,220 --> 00:47:09,720 And then you pay plus delta. 1122 00:47:09,720 --> 00:47:12,915 So you can move a big chunk relatively quickly. 1123 00:47:12,915 --> 00:47:15,290 You just pay once to get there, and then you can move it. 1124 00:47:15,290 --> 00:47:18,560 This is a lot more reasonable than HMM. 1125 00:47:18,560 --> 00:47:21,890 But it makes things a lot messier is the short answer. 1126 00:47:21,890 --> 00:47:24,969 Because-- here's a block move-- 1127 00:47:24,969 --> 00:47:26,510 these are the sort of bounds you get. 1128 00:47:26,510 --> 00:47:28,100 They depend now on f. 1129 00:47:28,100 --> 00:47:31,280 And you don't get the same kind of uniformity 1130 00:47:31,280 --> 00:47:32,780 as far as I can tell. 1131 00:47:32,780 --> 00:47:35,150 You can't just say, oh, it works for all f. 1132 00:47:35,150 --> 00:47:39,080 For each of these problems, this is basically scanning or matrix 1133 00:47:39,080 --> 00:47:40,430 multiplication. 1134 00:47:40,430 --> 00:47:43,460 It doesn't matter much until f of x gets really big, and then 1135 00:47:43,460 --> 00:47:45,170 something changes. 1136 00:47:45,170 --> 00:47:47,810 You Dot product, you get log*, log, log, log, 1137 00:47:47,810 --> 00:47:51,950 depending on whether your f of x is log or subpolynomial 1138 00:47:51,950 --> 00:47:53,450 or linear. 1139 00:47:53,450 --> 00:47:55,130 So I find this kind of unsatisfying. 1140 00:47:55,130 --> 00:47:59,180 So I'm just going to move on to MH, which is probably 1141 00:47:59,180 --> 00:48:01,100 the messiest of the models. 1142 00:48:01,100 --> 00:48:04,050 But in some sense, it's the most realistic of the models. 1143 00:48:04,050 --> 00:48:05,660 Here's the picture which I would draw 1144 00:48:05,660 --> 00:48:07,960 if someone asked me to draw a general memory hierarchy. 1145 00:48:07,960 --> 00:48:09,990 I have CPU connects to this cache for free. 1146 00:48:09,990 --> 00:48:12,140 It has blocks of size B0. 1147 00:48:12,140 --> 00:48:16,250 And to go to the next memory, it costs me some time, t0. 1148 00:48:16,250 --> 00:48:20,180 And the blocks that I read here of size B0, I write of size B0. 1149 00:48:20,180 --> 00:48:22,070 So the transfers here are size B0. 1150 00:48:22,070 --> 00:48:24,030 And one has potentially a different block size. 1151 00:48:24,030 --> 00:48:26,030 It has a different cache size, M1. 1152 00:48:26,030 --> 00:48:27,620 And you pay. 1153 00:48:27,620 --> 00:48:30,410 So these blocks are subdivided into B0 sized blocks, 1154 00:48:30,410 --> 00:48:31,700 which happen here. 1155 00:48:31,700 --> 00:48:34,089 This is a generic multi-level memory hierarchy picture. 1156 00:48:34,089 --> 00:48:36,380 It's the obvious extension of the external memory model 1157 00:48:36,380 --> 00:48:38,960 to arbitrarily many levels. 1158 00:48:38,960 --> 00:48:41,660 And to make it so easy to program, 1159 00:48:41,660 --> 00:48:43,660 all levels can be transferring at once. 1160 00:48:43,660 --> 00:48:48,500 This is realistic, but hard to manipulate. 1161 00:48:48,500 --> 00:48:52,610 And they thought, oh, well, l parameters 1162 00:48:52,610 --> 00:48:54,360 for an l-level hierarchy is too many. 1163 00:48:54,360 --> 00:48:58,340 So let's reduce it to two parameters and one function. 1164 00:48:58,340 --> 00:49:00,800 So assume that B [? does ?] grow exponentially, 1165 00:49:00,800 --> 00:49:03,300 that these things grow roughly the same way. 1166 00:49:03,300 --> 00:49:04,940 with some aspect ratio alpha. 1167 00:49:04,940 --> 00:49:06,260 And then the ti-- 1168 00:49:06,260 --> 00:49:08,092 this is the part that's hard to guess-- 1169 00:49:08,092 --> 00:49:09,050 it grows exponentially. 1170 00:49:09,050 --> 00:49:11,450 And then there's some f of i, which we don't know, 1171 00:49:11,450 --> 00:49:13,070 maybe it's log i. 1172 00:49:13,070 --> 00:49:16,010 And because of that, this doesn't really 1173 00:49:16,010 --> 00:49:17,420 clean the model enough. 1174 00:49:17,420 --> 00:49:20,600 You get bounds, which, it's interesting. 1175 00:49:20,600 --> 00:49:23,720 You can say as long as f of i is, at most, something, 1176 00:49:23,720 --> 00:49:26,120 then we get optimal bounds. 1177 00:49:26,120 --> 00:49:29,190 But sometimes when f of i grows, things change. 1178 00:49:29,190 --> 00:49:31,280 And it's interesting. 1179 00:49:31,280 --> 00:49:32,870 These algorithms follow approaches 1180 00:49:32,870 --> 00:49:35,570 that we will see in a moment, divide and conquer. 1181 00:49:35,570 --> 00:49:39,420 But it's hard to state what the answers are. 1182 00:49:39,420 --> 00:49:41,636 What's B4? 1183 00:49:41,636 --> 00:49:42,760 I think that's just a typo. 1184 00:49:42,760 --> 00:49:45,530 That should be blank. 1185 00:49:45,530 --> 00:49:48,520 I mean, it's hard to beat an upper bound of 1. 1186 00:49:48,520 --> 00:49:50,810 It also seems wrong. 1187 00:49:50,810 --> 00:49:53,280 Ignore that row. 1188 00:49:53,280 --> 00:49:53,840 All right. 1189 00:49:53,840 --> 00:49:58,430 Finally, we go to the cache-oblivious model by Frigo, 1190 00:49:58,430 --> 00:50:01,140 et al. in 1999. 1191 00:50:01,140 --> 00:50:02,430 This is another clean model. 1192 00:50:02,430 --> 00:50:07,000 And this is another of the two models that really caught on. 1193 00:50:07,000 --> 00:50:10,400 It's motivated by all the models you've just seen. 1194 00:50:10,400 --> 00:50:13,310 And in particular, it picks up on the other successful model, 1195 00:50:13,310 --> 00:50:15,350 the External Memory Model and says, OK, 1196 00:50:15,350 --> 00:50:17,750 let's take External Memory Model, exactly the same cost 1197 00:50:17,750 --> 00:50:18,410 model. 1198 00:50:18,410 --> 00:50:21,325 But suppose your algorithm doesn't know B or M. 1199 00:50:21,325 --> 00:50:23,450 And we're going to analyze it in this model knowing 1200 00:50:23,450 --> 00:50:24,230 what B and M is. 1201 00:50:24,230 --> 00:50:26,690 But, really, one algorithm has to work for all B and M. 1202 00:50:26,690 --> 00:50:30,560 This is uniformity from the-- 1203 00:50:30,560 --> 00:50:33,170 I can't even remember the model names-- 1204 00:50:33,170 --> 00:50:36,920 not UMH, but the HMM model. 1205 00:50:36,920 --> 00:50:39,110 So it's taking that idea, but applying it 1206 00:50:39,110 --> 00:50:41,450 to a model that has blocking. 1207 00:50:44,510 --> 00:50:47,750 So for this to be meaningful, block transfers 1208 00:50:47,750 --> 00:50:48,650 have to be automatic. 1209 00:50:48,650 --> 00:50:51,174 Because you can't manually move between here and here. 1210 00:50:51,174 --> 00:50:53,090 In HMM, you could manually move things around, 1211 00:50:53,090 --> 00:50:55,040 because your memory is just a sequential thing. 1212 00:50:55,040 --> 00:50:56,570 But now, you don't know where the cutoff 1213 00:50:56,570 --> 00:50:57,710 is between cache and disks. 1214 00:50:57,710 --> 00:50:59,750 So you can't manually manage your memory. 1215 00:50:59,750 --> 00:51:02,360 So you have to assume automatic block replacement. 1216 00:51:02,360 --> 00:51:04,950 But we already know LRU or FIFOs only going 1217 00:51:04,950 --> 00:51:06,350 to lose a constant factor. 1218 00:51:06,350 --> 00:51:09,560 So that's cool. 1219 00:51:09,560 --> 00:51:12,096 I like this model, because it's clean. 1220 00:51:12,096 --> 00:51:14,720 Also, in a certain sense, it's a little hard to formalize this. 1221 00:51:14,720 --> 00:51:17,840 But it works for changing B, because it works for all B. 1222 00:51:17,840 --> 00:51:20,450 And so you can imagine even if B is not a uniform thing-- 1223 00:51:20,450 --> 00:51:23,830 like the size of tracks on a disk are varying, 1224 00:51:23,830 --> 00:51:26,500 because circles have different sizes-- 1225 00:51:26,500 --> 00:51:29,230 it probably works well in that setting. 1226 00:51:29,230 --> 00:51:32,110 It also works if your cache gets smaller, because you've 1227 00:51:32,110 --> 00:51:34,600 got a competing process. 1228 00:51:34,600 --> 00:51:38,470 It'll just adjust, because the analysis will work. 1229 00:51:38,470 --> 00:51:40,750 And the other fun thing is even though you're 1230 00:51:40,750 --> 00:51:43,240 analyzing on a two-level memory hierarchy, 1231 00:51:43,240 --> 00:51:46,850 it works on an arbitrary memory hierarchy, this MH thing. 1232 00:51:46,850 --> 00:51:48,850 This is a clean way to tackle MH. 1233 00:51:48,850 --> 00:51:54,040 You just need a cache-oblivious solution. 1234 00:51:54,040 --> 00:51:55,450 Cool. 1235 00:51:55,450 --> 00:51:59,070 Because you can imagine the levels 1236 00:51:59,070 --> 00:52:01,570 to the left of something and the levels to the right of some 1237 00:52:01,570 --> 00:52:02,130 point. 1238 00:52:02,130 --> 00:52:03,880 And the cache-oblivious analysis tells you 1239 00:52:03,880 --> 00:52:06,338 that the number of transfers over this boundary is optimal. 1240 00:52:06,338 --> 00:52:08,540 And if that's true for every boundary, 1241 00:52:08,540 --> 00:52:10,990 then the overall thing will be optimal, 1242 00:52:10,990 --> 00:52:15,040 just like for HMM uniformity. 1243 00:52:15,040 --> 00:52:15,670 OK. 1244 00:52:15,670 --> 00:52:17,740 Quickly some techniques from cache-oblivious. 1245 00:52:17,740 --> 00:52:20,050 I don't have much time, so I will just 1246 00:52:20,050 --> 00:52:21,220 give you a couple sketches. 1247 00:52:21,220 --> 00:52:23,980 Scanning is one that generalizes great from external memory. 1248 00:52:23,980 --> 00:52:25,730 Of course, every cache-oblivious algorithm 1249 00:52:25,730 --> 00:52:27,070 is external memory also. 1250 00:52:27,070 --> 00:52:29,980 So we should first try all the external memory techniques. 1251 00:52:29,980 --> 00:52:31,210 You can scan. 1252 00:52:31,210 --> 00:52:33,204 You can't really do M over B parallel scans, 1253 00:52:33,204 --> 00:52:34,870 because you don't know what M over B is. 1254 00:52:34,870 --> 00:52:36,995 But you can do a constant number of parallel scans. 1255 00:52:36,995 --> 00:52:40,770 So you could at least merge two lists. 1256 00:52:40,770 --> 00:52:41,740 OK. 1257 00:52:41,740 --> 00:52:44,980 Searching, so this is the analog of binary search. 1258 00:52:44,980 --> 00:52:48,987 You'd like to achieve log base B of N query time. 1259 00:52:48,987 --> 00:52:49,820 And you can do that. 1260 00:52:49,820 --> 00:52:52,940 And this is in Harald Prokop's master's thesis. 1261 00:52:52,940 --> 00:52:55,900 So the idea is pretty cool. 1262 00:52:55,900 --> 00:53:00,910 You imagine a binary search tree built on the items. 1263 00:53:00,910 --> 00:53:03,430 We can't do a B-way, because we don't know what B is. 1264 00:53:03,430 --> 00:53:05,740 But then we cut it at the middle level, 1265 00:53:05,740 --> 00:53:08,046 recursively store the top part, and then 1266 00:53:08,046 --> 00:53:09,670 recursively store all the bottom parts, 1267 00:53:09,670 --> 00:53:11,950 and get root N chunks of size root 1268 00:53:11,950 --> 00:53:15,610 N. Do that recursively, you get some kind of lay out like this. 1269 00:53:15,610 --> 00:53:17,290 And it turns out this works very well. 1270 00:53:17,290 --> 00:53:18,940 Because at some level of the recursion, 1271 00:53:18,940 --> 00:53:20,705 whatever B is-- it doesn't know when 1272 00:53:20,705 --> 00:53:21,830 you're doing the recursion. 1273 00:53:21,830 --> 00:53:23,412 But B is something. 1274 00:53:23,412 --> 00:53:25,120 And if you look at the level of recursion 1275 00:53:25,120 --> 00:53:27,536 where you straddle B here, these things are size, at most, 1276 00:53:27,536 --> 00:53:30,670 B. And the next level up is size bigger than B. 1277 00:53:30,670 --> 00:53:34,625 Then you look at a root to leaf path here. 1278 00:53:34,625 --> 00:53:36,250 It's a matter of how many of these blue 1279 00:53:36,250 --> 00:53:38,032 triangles do you visit. 1280 00:53:38,032 --> 00:53:39,490 Well, the height of a blue triangle 1281 00:53:39,490 --> 00:53:41,590 is going to be around half log B, 1282 00:53:41,590 --> 00:53:44,500 because we're dividing in half until we hit log B. 1283 00:53:44,500 --> 00:53:47,800 So we might overshoot by a factor of 2, but that's all. 1284 00:53:47,800 --> 00:53:50,644 And we only have to pay 2 memory transfers to visit these. 1285 00:53:50,644 --> 00:53:52,810 Because we don't know how it's aligned with a block. 1286 00:53:52,810 --> 00:53:55,520 but at most, it fits in 2 blocks, certainly. 1287 00:53:55,520 --> 00:53:57,940 It's stored consecutively by the recursion. 1288 00:53:57,940 --> 00:53:59,230 And so you divide. 1289 00:53:59,230 --> 00:54:00,900 I mean, the height of this thing, 1290 00:54:00,900 --> 00:54:03,460 it's going to be log base B of N times 2. 1291 00:54:03,460 --> 00:54:04,250 We pay 2 each. 1292 00:54:04,250 --> 00:54:06,702 So we get an upper bound of 4. 1293 00:54:06,702 --> 00:54:07,660 Not as good as B-trees. 1294 00:54:07,660 --> 00:54:10,270 B-trees get 1 times log base B of N. Here, 1295 00:54:10,270 --> 00:54:12,484 we get 4 times log base B of N. This problem has 1296 00:54:12,484 --> 00:54:13,150 been considered. 1297 00:54:13,150 --> 00:54:17,900 The right answer is log of e plus little o. 1298 00:54:17,900 --> 00:54:18,880 And that is tight. 1299 00:54:18,880 --> 00:54:22,370 You can't do better than that bound. 1300 00:54:22,370 --> 00:54:24,520 So cache-oblivious loses a constant factor 1301 00:54:24,520 --> 00:54:27,220 relative to external memory for that problem. 1302 00:54:27,220 --> 00:54:28,702 You can also make this dynamic. 1303 00:54:28,702 --> 00:54:30,160 This is where a bunch of us started 1304 00:54:30,160 --> 00:54:33,880 getting involved in this world, in cache-oblivious world. 1305 00:54:33,880 --> 00:54:39,850 And this is a sketch of one of the methods, I think this one. 1306 00:54:39,850 --> 00:54:41,314 That's the one I usually teach. 1307 00:54:41,314 --> 00:54:43,480 You might have guessed these are from lecture notes, 1308 00:54:43,480 --> 00:54:45,700 these handwritten things. 1309 00:54:45,700 --> 00:54:47,300 I'll plug that in in a second. 1310 00:54:47,300 --> 00:54:50,260 So sorting is trickier. 1311 00:54:50,260 --> 00:54:51,760 There is an analog to mergesort. 1312 00:54:51,760 --> 00:54:54,290 There is an analog to distribution sort. 1313 00:54:54,290 --> 00:54:55,600 They achieve the sorting bound. 1314 00:54:55,600 --> 00:54:58,100 But they do need an assumption, this tall-cache assumption. 1315 00:54:58,100 --> 00:54:59,850 It's a little different from the last one. 1316 00:54:59,850 --> 00:55:01,720 This is a stronger assumption than before. 1317 00:55:01,720 --> 00:55:05,230 It says the cache is taller than it is wide, roughly, 1318 00:55:05,230 --> 00:55:07,600 up to some epsilon exponent. 1319 00:55:07,600 --> 00:55:11,290 So this is saying M over B is at least B to the epsilon. 1320 00:55:11,290 --> 00:55:14,000 Most caches have that property, so it's not that big a deal. 1321 00:55:14,000 --> 00:55:15,375 But you can prove it's necessary. 1322 00:55:15,375 --> 00:55:17,980 If you don't have it, you can't achieve the sorting bound. 1323 00:55:17,980 --> 00:55:20,320 You could also prove you cannot achieve the permutation bound, 1324 00:55:20,320 --> 00:55:21,569 because you can't do that min. 1325 00:55:21,569 --> 00:55:26,170 You don't know which is better, same paper. 1326 00:55:26,170 --> 00:55:28,770 Finally, I wanted to plug this class. 1327 00:55:28,770 --> 00:55:31,420 It just got released if you're interested. 1328 00:55:31,420 --> 00:55:32,810 It's advanced data structures. 1329 00:55:32,810 --> 00:55:35,602 There's video lectures for free streaming online. 1330 00:55:35,602 --> 00:55:37,810 There are three lectures about cache-oblivious stuff, 1331 00:55:37,810 --> 00:55:39,510 mostly on the data structure side, because it's 1332 00:55:39,510 --> 00:55:40,490 a data structures class. 1333 00:55:40,490 --> 00:55:42,323 But if you're interested in data structures, 1334 00:55:42,323 --> 00:55:43,900 you should check it out. 1335 00:55:43,900 --> 00:55:47,025 That is the end of my summary of a zillion models. 1336 00:55:47,025 --> 00:55:48,525 The ones to keep in mind, of course, 1337 00:55:48,525 --> 00:55:49,930 are external memory and cache-oblivious. 1338 00:55:49,930 --> 00:55:51,221 But the others are kind of fun. 1339 00:55:51,221 --> 00:55:54,550 And you really see the genesis of how this was 1340 00:55:54,550 --> 00:55:56,170 the union of these two models. 1341 00:55:56,170 --> 00:55:58,660 And this was sort of the culmination of this effort 1342 00:55:58,660 --> 00:56:02,170 to do multilevel in a clean way. 1343 00:56:02,170 --> 00:56:05,380 So I learned a lot looking at all these papers. 1344 00:56:05,380 --> 00:56:06,410 Hope you enjoyed it. 1345 00:56:06,410 --> 00:56:07,045 Thanks. 1346 00:56:07,045 --> 00:56:11,005 [APPLAUSE] 1347 00:56:11,005 --> 00:56:13,480 PROFESSOR: Are there any questions? 1348 00:56:13,480 --> 00:56:17,440 AUDIENCE: So all these are order of magnitude bounds 1349 00:56:17,440 --> 00:56:20,424 I'm wondering about the constant factors. 1350 00:56:20,424 --> 00:56:22,090 ERIK DEMAINE: Are you guys going to talk 1351 00:56:22,090 --> 00:56:23,920 about that in your final talk? 1352 00:56:23,920 --> 00:56:28,370 Or who knows? 1353 00:56:28,370 --> 00:56:31,780 Or Lars maybe also? 1354 00:56:31,780 --> 00:56:33,490 Some of these papers even evaluated 1355 00:56:33,490 --> 00:56:36,790 that, especially these guys that had the messy models. 1356 00:56:36,790 --> 00:56:39,370 They were getting the parameters of, at that time, 1357 00:56:39,370 --> 00:56:41,890 [INAUDIBLE] 6,000 processor, which is something I've 1358 00:56:41,890 --> 00:56:44,830 actually used, so not so old. 1359 00:56:44,830 --> 00:56:49,830 And they got very good matching even at that point. 1360 00:56:49,830 --> 00:56:54,100 I'd say external memory does very good for modeling disk. 1361 00:56:54,100 --> 00:56:56,830 I don't know if people use it a lot for cache. 1362 00:56:56,830 --> 00:56:58,420 No, I'm told. 1363 00:56:58,420 --> 00:57:04,010 Cache-oblivious, it's a little harder to measure. 1364 00:57:04,010 --> 00:57:06,370 Because you're not trying to tune to specific things. 1365 00:57:06,370 --> 00:57:10,190 But in practice, it seems to do very well for many problems. 1366 00:57:10,190 --> 00:57:11,507 That's the short answer. 1367 00:57:11,507 --> 00:57:13,006 AUDIENCE: [INAUDIBLE] it runs faster 1368 00:57:13,006 --> 00:57:15,940 than [INAUDIBLE] cache aware. 1369 00:57:15,940 --> 00:57:17,260 ERIK DEMAINE: Yeah. 1370 00:57:17,260 --> 00:57:19,060 It does better than our analysis said 1371 00:57:19,060 --> 00:57:21,880 it should do in some sense, because it's so flexible. 1372 00:57:21,880 --> 00:57:23,710 And the reality is very messy. 1373 00:57:23,710 --> 00:57:26,020 In reality, M is changing, because there's 1374 00:57:26,020 --> 00:57:29,710 all sorts of processes doing useless work. 1375 00:57:29,710 --> 00:57:31,780 And cache-oblivious will adjust to that. 1376 00:57:31,780 --> 00:57:35,069 And it's especially the case in internal memory, 1377 00:57:35,069 --> 00:57:35,860 in the cache world. 1378 00:57:35,860 --> 00:57:38,770 Things are very messy and fussy. 1379 00:57:38,770 --> 00:57:40,810 And the nice thing about cache-oblivious 1380 00:57:40,810 --> 00:57:42,560 is because you're not specifically tuning, 1381 00:57:42,560 --> 00:57:44,939 you have the potential to not die when you mess up. 1382 00:57:44,939 --> 00:57:46,397 AUDIENCE: I'd say that's especially 1383 00:57:46,397 --> 00:57:47,830 the case in the disk world. 1384 00:57:47,830 --> 00:57:49,080 ERIK DEMAINE: Oh, interesting. 1385 00:57:49,080 --> 00:57:50,320 AUDIENCE: [INAUDIBLE] But-- 1386 00:57:50,320 --> 00:57:51,880 ERIK DEMAINE: These are the guys who know. 1387 00:57:51,880 --> 00:57:54,171 AUDIENCE: [INAUDIBLE] people have different [INAUDIBLE] 1388 00:57:54,171 --> 00:57:55,150 ERIK DEMAINE: Yeah. 1389 00:57:55,150 --> 00:57:56,184 They're both relevant. 1390 00:57:56,184 --> 00:57:58,040 AUDIENCE: What's the future? 1391 00:57:58,040 --> 00:57:59,462 This is history. 1392 00:57:59,462 --> 00:58:00,170 ERIK DEMAINE: OK. 1393 00:58:00,170 --> 00:58:03,020 Well, for the future, you should go to the other talks, I guess. 1394 00:58:03,020 --> 00:58:05,600 There's still lots of open problems in both models. 1395 00:58:05,600 --> 00:58:08,330 External memory, I guess, graph algorithms and geometry 1396 00:58:08,330 --> 00:58:11,020 are still the main topics of ongoing research. 1397 00:58:11,020 --> 00:58:13,130 Cache-oblivious is similar. 1398 00:58:13,130 --> 00:58:14,930 At this point, I think-- 1399 00:58:14,930 --> 00:58:17,350 well, also geometry is a big one. 1400 00:58:17,350 --> 00:58:18,950 There's some external memory results 1401 00:58:18,950 --> 00:58:24,880 that have not yet been cache-oblivified in geometry. 1402 00:58:24,880 --> 00:58:25,931 AUDIENCE: Multicore. 1403 00:58:25,931 --> 00:58:26,930 ERIK DEMAINE: Multicore. 1404 00:58:26,930 --> 00:58:28,304 Oh, yeah, I forgot to say I'm not 1405 00:58:28,304 --> 00:58:31,327 going to talk about parallel models here. 1406 00:58:31,327 --> 00:58:32,660 Partly, because of lack of time. 1407 00:58:32,660 --> 00:58:35,930 Also, that's probably the most active-- 1408 00:58:35,930 --> 00:58:39,260 it's an interesting active area of research, something 1409 00:58:39,260 --> 00:58:42,830 I'm interested in particular. 1410 00:58:42,830 --> 00:58:45,350 There are some results about parallel cache-oblivious. 1411 00:58:45,350 --> 00:58:49,460 And all of these papers actually had parallelism. 1412 00:58:49,460 --> 00:58:53,084 These had parallelism in a single disk. 1413 00:58:53,084 --> 00:58:55,000 There's another model that has multiple disks. 1414 00:58:55,000 --> 00:58:56,458 Those behave more or less the same. 1415 00:58:56,458 --> 00:58:58,759 You basically divide everything by p. 1416 00:58:58,759 --> 00:59:00,800 These models also tried to introduce parallelism. 1417 00:59:00,800 --> 00:59:04,130 Or there's a follow up to UMH by these guys. 1418 00:59:04,130 --> 00:59:06,680 So there is work on parallel, but I 1419 00:59:06,680 --> 00:59:08,750 think multicore or cache-oblivious is probably 1420 00:59:08,750 --> 00:59:11,044 the most exciting unknown or still in progress stuff. 1421 00:59:11,044 --> 00:59:12,460 AUDIENCE: Thank the speaker again. 1422 00:59:12,460 --> 00:59:13,360 ERIK DEMAINE: Thanks. 1423 00:59:13,360 --> 00:59:17,910 [APPLAUSE]