1 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:03,910 Commons license. 3 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,600 offer high-quality educational resources for free. 5 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 6 00:00:13,500 --> 00:00:18,130 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:18,130 --> 00:00:19,380 ocw.mit.edu. 8 00:00:28,664 --> 00:00:32,320 PROFESSOR: So a couple of things I want to say about the 9 00:00:32,320 --> 00:00:33,570 final project. 10 00:00:36,800 --> 00:00:40,700 You guys should start thinking about it. 11 00:00:40,700 --> 00:00:44,920 So of course, you guys should think about the teams first, 12 00:00:44,920 --> 00:00:47,350 and submit team information. 13 00:00:47,350 --> 00:00:49,940 Say who you're going to team up with. 14 00:00:49,940 --> 00:00:51,040 By when? 15 00:00:51,040 --> 00:00:53,270 Team information, Josh, by when? 16 00:00:53,270 --> 00:00:54,990 Team information has to be in-- 17 00:00:54,990 --> 00:00:56,481 AUDIENCE: By tomorrow. 18 00:00:56,481 --> 00:00:57,380 PROFESSOR: By tomorrow. 19 00:00:57,380 --> 00:00:57,790 OK, good. 20 00:00:57,790 --> 00:00:58,930 You should know your teams. 21 00:00:58,930 --> 00:01:02,200 Get them together, use the same mechanism that we used 22 00:01:02,200 --> 00:01:04,420 before to get the team information before. 23 00:01:04,420 --> 00:01:07,210 So we are going to add one small thing this year I think 24 00:01:07,210 --> 00:01:09,720 that would be useful. 25 00:01:09,720 --> 00:01:12,430 Once you have submitted your design documents, we are going 26 00:01:12,430 --> 00:01:15,370 to get a design review done by Masters. 27 00:01:15,370 --> 00:01:19,130 So what that means is next week we are going to schedule 28 00:01:19,130 --> 00:01:21,830 a design review with your Masters. 29 00:01:21,830 --> 00:01:30,000 And you should send mail to your Masters, hopefully in the 30 00:01:30,000 --> 00:01:32,610 middle of next week, to schedule a design review. 31 00:01:32,610 --> 00:01:36,150 The design review will happen the week after Thanksgiving. 32 00:01:36,150 --> 00:01:39,350 So you'll submit your design doc next week, and then the 33 00:01:39,350 --> 00:01:40,500 week after Thanksgiving, you'll 34 00:01:40,500 --> 00:01:41,930 schedule your design review. 35 00:01:41,930 --> 00:01:43,530 The earlier the better, because you can hopefully get 36 00:01:43,530 --> 00:01:45,990 some really good feedback before you go into 37 00:01:45,990 --> 00:01:47,080 implementation. 38 00:01:47,080 --> 00:01:49,860 So have an idea what you're doing. 39 00:01:49,860 --> 00:01:52,170 Of course you're write it up for your design document. 40 00:01:52,170 --> 00:01:54,180 And then go to your Masters and say, here's what I'm 41 00:01:54,180 --> 00:01:55,080 planning to do. 42 00:01:55,080 --> 00:01:57,050 Get some good feedback, and hopefully it will help your 43 00:01:57,050 --> 00:01:58,100 life to do that. 44 00:01:58,100 --> 00:02:03,890 And then beforehand, performance only mattered for 45 00:02:03,890 --> 00:02:04,470 your grade. 46 00:02:04,470 --> 00:02:06,310 We did this absolute grading. 47 00:02:06,310 --> 00:02:08,310 This year, we are going to have actually an in-class 48 00:02:08,310 --> 00:02:12,660 competition on the final day of class, to figure out who 49 00:02:12,660 --> 00:02:17,380 has the fastest ray tracer in the class. 50 00:02:17,380 --> 00:02:20,440 And for that we will actually give [? you ?] 51 00:02:20,440 --> 00:02:23,260 a little but of a different [UNINTELLIGIBLE] 52 00:02:23,260 --> 00:02:25,250 than what I have given you. 53 00:02:25,250 --> 00:02:28,770 And so you don't go too much into really hand coding to 54 00:02:28,770 --> 00:02:30,560 that teammate because that might not work. 55 00:02:30,560 --> 00:02:33,570 And so here's something hot off the press. 56 00:02:33,570 --> 00:02:38,120 So for the winning team, there's going to be an Akamai 57 00:02:38,120 --> 00:02:39,910 prize for the winning team. 58 00:02:39,910 --> 00:02:42,750 And this prize includes a celebration/demonstration at 59 00:02:42,750 --> 00:02:44,820 Akamai headquarters. 60 00:02:44,820 --> 00:02:46,680 You're going to go visit there [UNINTELLIGIBLE] 61 00:02:46,680 --> 00:02:49,890 and perhaps show off to their engineers the cool 62 00:02:49,890 --> 00:02:51,220 ray tracer you did. 63 00:02:51,220 --> 00:02:57,040 And also every team member of the winning team is going to 64 00:02:57,040 --> 00:02:59,150 get the iPod Nano. 65 00:02:59,150 --> 00:03:01,150 Sorry, guys, last year it didn't happen. 66 00:03:01,150 --> 00:03:02,480 First time. 67 00:03:02,480 --> 00:03:05,550 So there's a lot at stake. 68 00:03:05,550 --> 00:03:09,370 So make sure your program is going to run 69 00:03:09,370 --> 00:03:11,120 as fast as you can. 70 00:03:11,120 --> 00:03:19,440 OK, with that, let's get your slides on. 71 00:03:19,440 --> 00:03:22,660 So I'd like to introduce Bradley Kuszmaul. 72 00:03:22,660 --> 00:03:27,430 Bradley has been at MIT, in and out of MIT, for a long 73 00:03:27,430 --> 00:03:32,620 time doing lots of cool stuff with high-performance-- 74 00:03:32,620 --> 00:03:34,670 yeah, make the screen bigger-- 75 00:03:34,670 --> 00:03:37,660 high-performance computing. 76 00:03:37,660 --> 00:03:39,980 He has done some really interesting data structure 77 00:03:39,980 --> 00:03:43,740 work, performance optimization work, and stuff like that. 78 00:03:43,740 --> 00:03:46,600 And today he's going to talk about an interesting data 79 00:03:46,600 --> 00:03:50,970 structure that goes all the way from theory into getting 80 00:03:50,970 --> 00:03:53,130 really, really high-performance. 81 00:03:53,130 --> 00:03:53,580 Thank you, Bradley. 82 00:03:53,580 --> 00:03:55,745 OK, you have the mic. 83 00:03:55,745 --> 00:04:01,290 BRADLEY KUSZMAUL: So I'm going to talk about a data structure 84 00:04:01,290 --> 00:04:04,330 called fractal trees, which in the academic world are called 85 00:04:04,330 --> 00:04:05,860 streaming B-trees. 86 00:04:05,860 --> 00:04:08,310 But the marketing people didn't think very much of 87 00:04:08,310 --> 00:04:11,530 that, and so a lot of these slides are borrowed from a 88 00:04:11,530 --> 00:04:13,270 company that I've started. 89 00:04:13,270 --> 00:04:15,710 So rather than redo that, I'm just going to stick to the 90 00:04:15,710 --> 00:04:20,029 terminology "fractal tree." I'm research faculty at MIT, 91 00:04:20,029 --> 00:04:21,790 and I'm a founder at Tokutek. 92 00:04:21,790 --> 00:04:24,980 And so that's sort of who I am. 93 00:04:24,980 --> 00:04:26,620 I'll do a little bit more introduction. 94 00:04:26,620 --> 00:04:29,060 So I have been around at MIT a long time. 95 00:04:29,060 --> 00:04:31,150 I have four MIT degrees. 96 00:04:31,150 --> 00:04:34,030 And I was one of the architects of the Connection 97 00:04:34,030 --> 00:04:35,190 Machine CM-5. 98 00:04:35,190 --> 00:04:38,920 And Charles was also one of the 99 00:04:38,920 --> 00:04:40,380 architects of that machine. 100 00:04:40,380 --> 00:04:43,380 So at the time, that was the fastest machine in the world, 101 00:04:43,380 --> 00:04:45,300 at least for some applications. 102 00:04:45,300 --> 00:04:51,230 And after getting my degrees and being an architect, I went 103 00:04:51,230 --> 00:04:53,150 and was a professor at Yale, and then 104 00:04:53,150 --> 00:04:54,990 later I was at Akamai. 105 00:04:54,990 --> 00:04:59,080 So I don't know what an Akamai prize is beyond an iPod, but 106 00:04:59,080 --> 00:05:01,210 maybe it's like all your content delivered free for a 107 00:05:01,210 --> 00:05:03,990 month or something. 108 00:05:03,990 --> 00:05:07,010 And I'm now research faculty in the SuperTech Group, 109 00:05:07,010 --> 00:05:08,770 working with Charles. 110 00:05:08,770 --> 00:05:12,010 And I'm a founder of Tokutek, which is commercializing some 111 00:05:12,010 --> 00:05:13,130 work we did. 112 00:05:13,130 --> 00:05:17,490 A couple years ago, I started collaborating with Michael 113 00:05:17,490 --> 00:05:20,230 Bender and Martin Farach-Colton on data 114 00:05:20,230 --> 00:05:26,260 structures for that are suited for storing data on disk. 115 00:05:26,260 --> 00:05:29,530 And we ended up a bit later starting a company to 116 00:05:29,530 --> 00:05:30,690 commercialize the research. 117 00:05:30,690 --> 00:05:34,330 And basically, I'll tell you sort of what the background 118 00:05:34,330 --> 00:05:35,900 is, and actually go into some 119 00:05:35,900 --> 00:05:38,390 technical on the data structure. 120 00:05:38,390 --> 00:05:42,470 So I don't know exactly what you've spent most of your time 121 00:05:42,470 --> 00:05:44,815 on, but a lot of high-performance work, 122 00:05:44,815 --> 00:05:49,190 especially in academia, focuses on the CPUs and using 123 00:05:49,190 --> 00:05:51,110 the CPUs efficiently, maybe getting 124 00:05:51,110 --> 00:05:52,780 lots of FLOPS or something. 125 00:05:52,780 --> 00:05:55,730 The Cilk work that Charles and I did, for example, is 126 00:05:55,730 --> 00:06:00,000 squarely in the category of how do you get more FLOPS, or 127 00:06:00,000 --> 00:06:03,770 more computrons out of a particular machine. 128 00:06:03,770 --> 00:06:07,940 But it turns out often I/O is a big bottleneck, and so you 129 00:06:07,940 --> 00:06:09,850 see systems that look a little bit like this. 130 00:06:09,850 --> 00:06:12,720 You have a whole bunch of sensors somewhere, and the 131 00:06:12,720 --> 00:06:20,900 sensors might be something like a bunch of telescopes in 132 00:06:20,900 --> 00:06:22,400 an astronomy system. 133 00:06:22,400 --> 00:06:24,950 They're sending millions of data items per second, and 134 00:06:24,950 --> 00:06:26,100 they have to be stored. 135 00:06:26,100 --> 00:06:29,010 And disk is where you have to store large amounts of data, 136 00:06:29,010 --> 00:06:33,830 because disk is orders of magnitude per byte than other 137 00:06:33,830 --> 00:06:35,010 storage systems. 138 00:06:35,010 --> 00:06:42,560 And then you want to do queries on that data, and you 139 00:06:42,560 --> 00:06:44,160 want to look at the data that's recent. 140 00:06:44,160 --> 00:06:46,440 So it's not good enough just to look at yesterday's data. 141 00:06:46,440 --> 00:06:48,500 You want to know what's going on right now. 142 00:06:48,500 --> 00:06:50,970 If your sensor array is a bunch of telescopes, and a 143 00:06:50,970 --> 00:06:53,980 supernova starts happening, you want to be able to find 144 00:06:53,980 --> 00:06:56,460 out quickly what's going on, so that you can broadcast the 145 00:06:56,460 --> 00:06:58,280 message to everybody in the world so they can all point 146 00:06:58,280 --> 00:07:02,690 their telescopes at the supernova while it's fresh. 147 00:07:02,690 --> 00:07:05,170 So that's the picture. 148 00:07:05,170 --> 00:07:08,730 Another example of a sensor system is the internet, where 149 00:07:08,730 --> 00:07:12,220 you have thousands or millions of people clicking away on 150 00:07:12,220 --> 00:07:13,850 Facebook, for example. 151 00:07:13,850 --> 00:07:17,380 You could view that collection of mice as, abstractly, a 152 00:07:17,380 --> 00:07:19,580 bunch of sensors. 153 00:07:19,580 --> 00:07:21,240 And so you see it in science. 154 00:07:21,240 --> 00:07:22,260 You see it in internet. 155 00:07:22,260 --> 00:07:24,340 There's lots of applications. 156 00:07:24,340 --> 00:07:27,220 For example, another one would be that you're looking for 157 00:07:27,220 --> 00:07:32,100 attacks on your internet infrastructure in a large 158 00:07:32,100 --> 00:07:34,370 corporation, or something. 159 00:07:34,370 --> 00:07:43,320 So trying to reduce this big sensor system to what is the 160 00:07:43,320 --> 00:07:46,640 fundamental problem, basically we need to index the data. 161 00:07:46,640 --> 00:07:49,550 So the data indexing problem is this. 162 00:07:49,550 --> 00:07:54,280 Data is arriving in one order, and you want to ask about it 163 00:07:54,280 --> 00:07:55,440 in another order. 164 00:07:55,440 --> 00:07:59,970 So typically data is arriving in the order by time. 165 00:07:59,970 --> 00:08:03,260 When an observation is made, the event is logged. 166 00:08:03,260 --> 00:08:05,950 When the next observation is made, the event is logged. 167 00:08:05,950 --> 00:08:07,820 And then you want to do a query. 168 00:08:07,820 --> 00:08:11,040 Tell me everything that's happening in that particular 169 00:08:11,040 --> 00:08:14,470 area of the sky over the past month. 170 00:08:14,470 --> 00:08:18,060 So there's a big transposition that has to be done for these 171 00:08:18,060 --> 00:08:21,210 queries, abstractly, is the data's coming in in one order, 172 00:08:21,210 --> 00:08:23,490 and you want to sort it and get the data 173 00:08:23,490 --> 00:08:24,740 out in another order. 174 00:08:28,050 --> 00:08:30,700 So one solution to this problem that a lot of people 175 00:08:30,700 --> 00:08:32,220 use is simply to sort the data. 176 00:08:32,220 --> 00:08:33,020 The data comes in. 177 00:08:33,020 --> 00:08:34,270 Sort it. 178 00:08:36,860 --> 00:08:39,380 Then you can query it in the order that it makes sense. 179 00:08:39,380 --> 00:08:43,150 This is basically a simple-minded explanation of 180 00:08:43,150 --> 00:08:45,090 what a data warehouse is. 181 00:08:45,090 --> 00:08:47,310 A data warehouse is all this data comes in-- 182 00:08:47,310 --> 00:08:49,610 Walmart runs one of these in Arkansas. 183 00:08:49,610 --> 00:08:53,320 All these events which are people scanning cans of soup 184 00:08:53,320 --> 00:08:56,560 on bar codes all over the country out in Walmart stores, 185 00:08:56,560 --> 00:09:01,060 all that data arrives in Arkansas, in one location. 186 00:09:01,060 --> 00:09:03,870 They sort the data overnight, and then the next morning they 187 00:09:03,870 --> 00:09:08,170 can answer questions like what's the most popular food 188 00:09:08,170 --> 00:09:12,070 in the week before a hurricane strikes? 189 00:09:12,070 --> 00:09:14,690 Because this is the kind of request that Walmart might 190 00:09:14,690 --> 00:09:17,230 care about, because they get a forecast that a hurricane's 191 00:09:17,230 --> 00:09:19,110 coming, and it turns out they need to ship beer and 192 00:09:19,110 --> 00:09:23,920 blueberry Pop-Tarts to the local stores, which are the 193 00:09:23,920 --> 00:09:28,780 things that basically you can eat even if power has failed. 194 00:09:28,780 --> 00:09:30,550 The problem with sorting is that you 195 00:09:30,550 --> 00:09:31,880 have to wait overnight. 196 00:09:31,880 --> 00:09:35,100 And for Walmart, that might actually be good enough. 197 00:09:35,100 --> 00:09:36,890 But if you're the astronomer, that 198 00:09:36,890 --> 00:09:39,430 application is not so great. 199 00:09:39,430 --> 00:09:42,940 So this problem is called the indexing problem. 200 00:09:42,940 --> 00:09:45,430 We have to maintain indexes. 201 00:09:45,430 --> 00:09:50,790 And traditionally, the classical solution is to use a 202 00:09:50,790 --> 00:09:52,040 data structure called a B-tree. 203 00:09:52,040 --> 00:09:55,348 So do you all know what a B-tree is? 204 00:09:55,348 --> 00:09:57,120 [INAUDIBLE] data structures to algorithms? 205 00:09:57,120 --> 00:10:00,000 A B-tree is like a search tree, except it's got some 206 00:10:00,000 --> 00:10:03,030 fan-out, and I'll talk about it in a second. 207 00:10:03,030 --> 00:10:06,230 They show up in virtually all storage systems today, and 208 00:10:06,230 --> 00:10:10,160 they were invented about 40 years ago, and they show up in 209 00:10:10,160 --> 00:10:13,230 databases such as MyISAM or Oracle. 210 00:10:13,230 --> 00:10:16,010 They show up in file systems like XFS. 211 00:10:16,010 --> 00:10:19,390 You can think of what Unix file systems like EXT do as 212 00:10:19,390 --> 00:10:20,820 being a variation of a B-tree. 213 00:10:20,820 --> 00:10:22,630 Basically they're everywhere. 214 00:10:22,630 --> 00:10:26,380 Mike might drew this picture of a B-tree. 215 00:10:26,380 --> 00:10:31,090 And I said, I don't get it. 216 00:10:31,090 --> 00:10:34,540 He said, well, there's a tree, and there's bees. 217 00:10:34,540 --> 00:10:36,710 And I said, but those are wasps. 218 00:10:36,710 --> 00:10:37,960 So anyway-- 219 00:10:45,670 --> 00:10:47,180 So a B-tree looks like this. 220 00:10:47,180 --> 00:10:50,660 It's a search tree, so that means everything is organized. 221 00:10:50,660 --> 00:10:52,990 It's got left children and right children, and there's 222 00:10:52,990 --> 00:10:54,360 actually many children. 223 00:10:54,360 --> 00:10:56,340 And like any other search tree, all the things to the 224 00:10:56,340 --> 00:10:58,820 left are before all the things to the right. 225 00:10:58,820 --> 00:11:01,470 That's sort of the property of trees that lets you do more 226 00:11:01,470 --> 00:11:02,470 than just a hash table. 227 00:11:02,470 --> 00:11:05,850 A hash table lets you do get and put. 228 00:11:05,850 --> 00:11:08,930 But a tree lets you do next. 229 00:11:08,930 --> 00:11:11,870 And that's the key observation, why you need 230 00:11:11,870 --> 00:11:13,480 something like a tree instead of a hash table. 231 00:11:13,480 --> 00:11:17,510 A lot of database queries-- if you go and click on Facebook 232 00:11:17,510 --> 00:11:22,000 on somebody's page, there's all these things that have 233 00:11:22,000 --> 00:11:25,770 been posted on somebody's wall. 234 00:11:25,770 --> 00:11:28,480 And what they've done when they organized that data is 235 00:11:28,480 --> 00:11:31,750 that they've organized it so that each of those items is a 236 00:11:31,750 --> 00:11:34,290 row in the database, and they're next to each other so 237 00:11:34,290 --> 00:11:37,380 that you fetch the first one, which is like the home page of 238 00:11:37,380 --> 00:11:40,460 the person, and then the next and next and next gives each 239 00:11:40,460 --> 00:11:42,860 of the messages that they want to display. 240 00:11:42,860 --> 00:11:46,440 And by making those things adjacent to each other, it 241 00:11:46,440 --> 00:11:49,630 means that they don't incur a disk I/O every time. 242 00:11:49,630 --> 00:11:51,890 if it were just a hash table, you'd be having to look all 243 00:11:51,890 --> 00:11:54,400 over the place to find those things. 244 00:11:54,400 --> 00:11:58,010 So B-trees are really fast, if you have to do insertions 245 00:11:58,010 --> 00:11:58,740 sequentially. 246 00:11:58,740 --> 00:12:02,100 And the reason is you have a data structure that's too big 247 00:12:02,100 --> 00:12:05,350 to fit in main memory. 248 00:12:05,350 --> 00:12:07,620 If the data structure fits in main memory, this is just the 249 00:12:07,620 --> 00:12:09,690 wrong data structure, right? 250 00:12:09,690 --> 00:12:13,320 If it fits in main memory, what should you use to solve 251 00:12:13,320 --> 00:12:14,570 this problem? 252 00:12:16,760 --> 00:12:19,450 Any ideas? 253 00:12:19,450 --> 00:12:23,410 What data structure is like a B-tree except it doesn't have 254 00:12:23,410 --> 00:12:24,660 lots of fan-out? 255 00:12:28,241 --> 00:12:30,784 Does anybody know this stuff in this class? 256 00:12:33,530 --> 00:12:36,780 Do you people know data structures at all? 257 00:12:36,780 --> 00:12:38,820 Maybe I'm in the wrong place. 258 00:12:38,820 --> 00:12:41,520 Because it's OK. 259 00:12:41,520 --> 00:12:43,550 Just a binary tree would be the data structure if you were 260 00:12:43,550 --> 00:12:46,070 doing this in memory, right? 261 00:12:46,070 --> 00:12:47,520 A binary tree would be fine. 262 00:12:47,520 --> 00:12:50,365 Or maybe you would try to minimize the number of cache 263 00:12:50,365 --> 00:12:52,810 misses or something. 264 00:12:52,810 --> 00:12:56,640 So for sequential inserts, if you're inserting at the end, 265 00:12:56,640 --> 00:12:59,380 basically all the stuff down the right spine of the tree is 266 00:12:59,380 --> 00:13:00,750 in main memory, and an insertion 267 00:13:00,750 --> 00:13:01,870 just inserts and inserts. 268 00:13:01,870 --> 00:13:03,890 You have no disk I/Os, and basically it 269 00:13:03,890 --> 00:13:06,310 runs extremely fast. 270 00:13:06,310 --> 00:13:08,210 The disk I/O is sequential. 271 00:13:08,210 --> 00:13:12,110 You get basically performance that's limited by the disk 272 00:13:12,110 --> 00:13:15,270 bandwidth, which is the rate at which the disk can write 273 00:13:15,270 --> 00:13:17,980 consecutive blocks. 274 00:13:17,980 --> 00:13:21,650 But B-trees are really slow if you're doing insertions that 275 00:13:21,650 --> 00:13:23,260 look random. 276 00:13:23,260 --> 00:13:26,260 The database world calls those high-entropy. 277 00:13:26,260 --> 00:13:29,320 And so basically the idea is I pick some leaf at random, and 278 00:13:29,320 --> 00:13:32,900 then I have to bring it into main memory, put the new 279 00:13:32,900 --> 00:13:35,610 record in there, and then eventually write it back out. 280 00:13:35,610 --> 00:13:39,960 And because the data structure is spread all over disk, then 281 00:13:39,960 --> 00:13:42,530 each of those random blocks that I choose, when I bring it 282 00:13:42,530 --> 00:13:45,920 in, that's a random disk I/O, which is very expensive. 283 00:13:45,920 --> 00:13:49,390 So here, for this workload, unlike the previous workload, 284 00:13:49,390 --> 00:13:53,850 the performance of the system is limited by how fast can you 285 00:13:53,850 --> 00:13:57,260 move the disk head around, rather than how fast can you 286 00:13:57,260 --> 00:13:59,280 write having placed the disk head. 287 00:13:59,280 --> 00:14:03,920 And you perhaps you can only, on a disk drive, do something 288 00:14:03,920 --> 00:14:08,500 like 100 disk head movements per second. 289 00:14:08,500 --> 00:14:13,010 And if you're writing small records that are like 100 290 00:14:13,010 --> 00:14:16,180 bytes or something, you might find yourself using a 291 00:14:16,180 --> 00:14:20,500 thousandth of a percent of the disk I/O, of the disk's 292 00:14:20,500 --> 00:14:21,970 bandwidth performance. 293 00:14:21,970 --> 00:14:23,860 And so people hate that. 294 00:14:23,860 --> 00:14:25,690 They hate buying something and only being able to use a 295 00:14:25,690 --> 00:14:28,670 thousandth of a percent of its capacity. 296 00:14:28,670 --> 00:14:29,920 Right? 297 00:14:34,440 --> 00:14:35,540 New B-trees. 298 00:14:35,540 --> 00:14:37,080 Something's wrong with that title. 299 00:14:37,080 --> 00:14:41,190 So B-trees are really fast in doing range queries, because 300 00:14:41,190 --> 00:14:43,440 basically once you've are brought a block in and you 301 00:14:43,440 --> 00:14:46,470 want to do the next item, chances are the next item's 302 00:14:46,470 --> 00:14:47,700 also on the same page. 303 00:14:47,700 --> 00:14:50,990 So once in while you go over a page boundary, but mostly you 304 00:14:50,990 --> 00:14:53,410 just reading stuff very fast. 305 00:14:53,410 --> 00:14:55,050 Oh, I know what this is about. 306 00:14:55,050 --> 00:14:57,960 When a B-tree's new and it's been constructed sequentially, 307 00:14:57,960 --> 00:14:59,350 it's also very fast. 308 00:14:59,350 --> 00:15:02,760 When it gets old, what happens is the blocks themselves get 309 00:15:02,760 --> 00:15:03,760 moved around on disk. 310 00:15:03,760 --> 00:15:05,280 They're not next to each other. 311 00:15:05,280 --> 00:15:08,600 And this is a problem that people have spent a lot of 312 00:15:08,600 --> 00:15:12,370 time trying to solve, is that as B-trees get older, their 313 00:15:12,370 --> 00:15:14,350 performance degrades. 314 00:15:14,350 --> 00:15:16,390 This aging problem-- 315 00:15:16,390 --> 00:15:20,790 I saw one report that suggested that something like 316 00:15:20,790 --> 00:15:27,430 2% of all the money spent by corporations on IT is spent 317 00:15:27,430 --> 00:15:30,730 dumping and reloading their B-trees to try to make this 318 00:15:30,730 --> 00:15:32,820 problem go away. 319 00:15:32,820 --> 00:15:37,175 So that's a lot of money or pain or something. 320 00:15:40,890 --> 00:15:46,880 Well, B-trees are optimal for doing lookups. 321 00:15:46,880 --> 00:15:49,370 If you just want to look something up, there's an old 322 00:15:49,370 --> 00:15:52,660 argument that says, gee, if going to have a tree 323 00:15:52,660 --> 00:15:56,430 structure, which is what you need in order to do next 324 00:15:56,430 --> 00:15:59,220 operations, then you're going to have some path through the 325 00:15:59,220 --> 00:16:04,370 B-tree which is a certain depth, and you do it optimally 326 00:16:04,370 --> 00:16:07,430 by having the fan-out be the block size. 327 00:16:07,430 --> 00:16:08,680 And everything works. 328 00:16:11,520 --> 00:16:15,420 But that argument of optimality is not actually 329 00:16:15,420 --> 00:16:17,900 true for insertion workloads. 330 00:16:17,900 --> 00:16:21,990 And this is where the data structures work that I've done 331 00:16:21,990 --> 00:16:25,440 with Mike and Martin sort of gets to be an advantage. 332 00:16:25,440 --> 00:16:28,220 To see that B-trees aren't optimal for insertions-- 333 00:16:28,220 --> 00:16:32,340 here's a data structure that's really good at insertions. 334 00:16:32,340 --> 00:16:33,220 What is the data structure? 335 00:16:33,220 --> 00:16:36,260 I'm just going to append to the end of a file. 336 00:16:36,260 --> 00:16:37,920 Right? 337 00:16:37,920 --> 00:16:39,390 So it's great. 338 00:16:39,390 --> 00:16:41,820 Basically, it doesn't matter what the keys are. 339 00:16:41,820 --> 00:16:44,010 I can insert data into this data 340 00:16:44,010 --> 00:16:46,040 structure at disk bandwidth. 341 00:16:49,010 --> 00:16:51,468 What's the disadvantage of this data structure? 342 00:16:51,468 --> 00:16:53,908 AUDIENCE: Lookups? 343 00:16:53,908 --> 00:16:54,890 BRADLEY KUSZMAUL: Lookups. 344 00:16:54,890 --> 00:16:57,530 So what is the disadvantage? 345 00:16:57,530 --> 00:16:58,790 Lookups aren't so good. 346 00:16:58,790 --> 00:17:01,150 What is the cost of doing a lookup? 347 00:17:01,150 --> 00:17:02,536 AUDIENCE: Order N? 348 00:17:02,536 --> 00:17:03,940 BRADLEY KUSZMAUL: Order n. 349 00:17:03,940 --> 00:17:04,740 Yeah. 350 00:17:04,740 --> 00:17:07,730 You have to look at everything. 351 00:17:07,730 --> 00:17:11,560 It requires a scan of the entire table. 352 00:17:11,560 --> 00:17:13,849 And we'll get into what the cost model is in a second. 353 00:17:13,849 --> 00:17:16,069 But basically, you have to look at everything. 354 00:17:16,069 --> 00:17:19,250 So it's order n. 355 00:17:19,250 --> 00:17:23,400 It turns out the number of blocks you have to read in, 356 00:17:23,400 --> 00:17:25,319 which is the thing you care about-- 357 00:17:25,319 --> 00:17:27,859 it's order n over b. 358 00:17:27,859 --> 00:17:28,900 So we'll get into a performance 359 00:17:28,900 --> 00:17:30,480 model in just a second. 360 00:17:30,480 --> 00:17:31,386 So here we are. 361 00:17:31,386 --> 00:17:34,850 We have two data structures-- a B-tree, which is not so 362 00:17:34,850 --> 00:17:37,140 great at insertions. 363 00:17:37,140 --> 00:17:40,210 It's quite good at point queries and quite good at 364 00:17:40,210 --> 00:17:42,530 ranged queries, especially when it's young. 365 00:17:42,530 --> 00:17:45,130 And we have this other data structure, which is the append 366 00:17:45,130 --> 00:17:48,700 data structure, which is wonderful for insertions and 367 00:17:48,700 --> 00:17:50,880 really bad for queries. 368 00:17:50,880 --> 00:17:55,010 So can you do something that's like the best of 369 00:17:55,010 --> 00:17:56,670 all possible worlds? 370 00:17:56,670 --> 00:17:58,610 You can imagine a data structure that's the worst of 371 00:17:58,610 --> 00:18:02,570 all possible, but it turns out that there are data structures 372 00:18:02,570 --> 00:18:05,020 that do well for this. 373 00:18:05,020 --> 00:18:09,480 And I'll show you how one works in a minute 374 00:18:09,480 --> 00:18:12,520 So to explain how it works and to do the analysis, we need to 375 00:18:12,520 --> 00:18:13,830 have a cost model. 376 00:18:13,830 --> 00:18:16,060 And we got into this just a minute ago, with what is the 377 00:18:16,060 --> 00:18:17,870 cost model for a table scan? 378 00:18:17,870 --> 00:18:19,530 Is it order N? 379 00:18:19,530 --> 00:18:23,080 Well, if you're only counting the number of CPU cycles that 380 00:18:23,080 --> 00:18:25,120 you're using up, it's order N, because you have to look at 381 00:18:25,120 --> 00:18:26,630 every item. 382 00:18:26,630 --> 00:18:28,910 But if what you really care about is the number of disk 383 00:18:28,910 --> 00:18:31,440 I/Os, then you just count up the number of blocks. 384 00:18:31,440 --> 00:18:35,820 And so in that model, the cost is order N over B. And that's 385 00:18:35,820 --> 00:18:39,230 the model that we're going to use to do this analysis. 386 00:18:39,230 --> 00:18:43,030 So in this model, we aren't going to care about CPU cost. 387 00:18:43,030 --> 00:18:45,470 We are going to care about disk I/O. And that's a pretty 388 00:18:45,470 --> 00:18:48,510 good place to design in if you're an engineer, because 389 00:18:48,510 --> 00:18:52,595 right now the number of CPU cycles that you get for a 390 00:18:52,595 --> 00:18:54,620 dollar is going up. 391 00:18:54,620 --> 00:18:55,480 It's been going up. 392 00:18:55,480 --> 00:18:57,330 It's continuing to go up. 393 00:18:57,330 --> 00:19:00,140 You have to write parallel programs today to get that, 394 00:19:00,140 --> 00:19:06,650 but you get a lot of cycles in a $100 package. 395 00:19:06,650 --> 00:19:10,000 But the number of disk I/Os that you're getting is 396 00:19:10,000 --> 00:19:10,885 essentially unchanged. 397 00:19:10,885 --> 00:19:14,520 It's maybe improved by a factor of two in 40 years. 398 00:19:14,520 --> 00:19:18,370 So that's the one to optimize for, is the 399 00:19:18,370 --> 00:19:19,960 one that's not changing. 400 00:19:19,960 --> 00:19:22,890 And use all those CPU cycles, if you can, to do something. 401 00:19:22,890 --> 00:19:26,880 So the model here is that we're going to have a memory 402 00:19:26,880 --> 00:19:27,950 and a disk. 403 00:19:27,950 --> 00:19:33,340 And you could use this, and there's some block size, B, 404 00:19:33,340 --> 00:19:36,100 which we may or may not know what it is. 405 00:19:36,100 --> 00:19:39,340 And it's actually quite tricky on real disk systems to figure 406 00:19:39,340 --> 00:19:42,510 out what the right block size is. 407 00:19:42,510 --> 00:19:48,100 It's not 500 bytes, because that's not going to be a good 408 00:19:48,100 --> 00:19:49,090 block size. 409 00:19:49,090 --> 00:19:51,550 It might be more like a megabyte. 410 00:19:51,550 --> 00:19:53,880 And when we move stuff back and forth, we're going to move 411 00:19:53,880 --> 00:19:54,940 a block at a time. 412 00:19:54,940 --> 00:19:57,270 We're going to bring in a whole block from disk, and 413 00:19:57,270 --> 00:19:59,440 when we have to write a block out, we write 414 00:19:59,440 --> 00:20:00,710 the whole block out. 415 00:20:00,710 --> 00:20:01,970 So we're just going to count that up. 416 00:20:01,970 --> 00:20:03,000 There's two parameters. 417 00:20:03,000 --> 00:20:08,790 There's the block size, B, and the memory size, M. So if the 418 00:20:08,790 --> 00:20:11,690 memory is as big as the entire disk, then the problem goes 419 00:20:11,690 --> 00:20:14,700 away, and if the memory's way too small-- like you can only 420 00:20:14,700 --> 00:20:15,850 have one block-- 421 00:20:15,850 --> 00:20:18,620 then it's very difficult to get anything done. 422 00:20:18,620 --> 00:20:20,540 So you need to be able to hold several 423 00:20:20,540 --> 00:20:22,620 blocks worth of storage. 424 00:20:22,620 --> 00:20:25,200 The memory is treated as a cache for disk. 425 00:20:25,200 --> 00:20:28,430 So once we've brought a block in, we can keep using it for a 426 00:20:28,430 --> 00:20:30,290 while until we get rid of it. 427 00:20:30,290 --> 00:20:32,650 So have you guys done any cache-oblivious data 428 00:20:32,650 --> 00:20:33,210 structures? 429 00:20:33,210 --> 00:20:33,990 OK. 430 00:20:33,990 --> 00:20:37,130 So you've seen this model. 431 00:20:37,130 --> 00:20:39,170 So the game here is to minimize the number of clock 432 00:20:39,170 --> 00:20:41,640 cycles and not worrying about the CPU cycles. 433 00:20:41,640 --> 00:20:44,300 So here's the theoretical results. 434 00:20:44,300 --> 00:20:45,660 We'll start with a B-tree. 435 00:20:45,660 --> 00:20:48,480 So a B-tree which has a block size b-- 436 00:20:48,480 --> 00:20:51,270 and here I'm going to assume that the things you're storing 437 00:20:51,270 --> 00:20:52,490 are unit sized. 438 00:20:52,490 --> 00:20:54,800 Because you can do the analysis, but it gets more 439 00:20:54,800 --> 00:20:55,680 complicated. 440 00:20:55,680 --> 00:21:02,540 So the cost of a lookup, which is the upper right side, is 441 00:21:02,540 --> 00:21:06,640 log N over log B. That's the same as-- 442 00:21:13,040 --> 00:21:17,260 you may not be used to manipulating these, but 443 00:21:17,260 --> 00:21:20,800 usually people write this as log base B of N. But that's 444 00:21:20,800 --> 00:21:26,920 the same as log N over log B. And I'm going to write it this 445 00:21:26,920 --> 00:21:30,380 way, because then it's easier to compare things. 446 00:21:30,380 --> 00:21:35,010 So if B is 1,000 or something, then 447 00:21:35,010 --> 00:21:36,410 basically instead of paying-- 448 00:21:40,590 --> 00:21:43,920 just as an example, if N is, say, 2 to the 40th-- 449 00:21:46,490 --> 00:21:48,670 let's these all to lg's because it 450 00:21:48,670 --> 00:21:50,290 basically doesn't matter. 451 00:21:50,290 --> 00:21:54,880 So it's 40 over log base B, and if B is, say, 2 to the 452 00:21:54,880 --> 00:22:02,710 10th, then that means that if you have a trillion items and 453 00:22:02,710 --> 00:22:06,780 you have a fan-out of 1,000, it takes you at most four disk 454 00:22:06,780 --> 00:22:11,880 I/Os to find any particular item. 455 00:22:11,880 --> 00:22:15,590 An insertion cost is the same, because to do an insertion, we 456 00:22:15,590 --> 00:22:18,820 have to find the leaf that the item should have been in, and 457 00:22:18,820 --> 00:22:20,070 then put it there. 458 00:22:22,350 --> 00:22:26,370 So the append log-- well, what's the cost of insertion? 459 00:22:26,370 --> 00:22:29,130 Well, we're appending away, right? 460 00:22:29,130 --> 00:22:32,630 And once every B items, we actually have to do a disk 461 00:22:32,630 --> 00:22:35,660 I/O. So the cost of an insertion in the 462 00:22:35,660 --> 00:22:38,120 append log isn't 0. 463 00:22:38,120 --> 00:22:42,590 It's one Bth of a block I/O per object. 464 00:22:42,590 --> 00:22:44,640 And the point query cost looks really bad. 465 00:22:44,640 --> 00:22:47,970 It's N over B, which we already discussed. 466 00:22:47,970 --> 00:22:51,760 So the fractal tree has this kind of performance. 467 00:22:51,760 --> 00:22:54,750 It's log N over-- 468 00:22:54,750 --> 00:22:58,370 it's not B, which would be really great. 469 00:22:58,370 --> 00:23:01,150 It's something smaller. 470 00:23:01,150 --> 00:23:06,450 It's maybe square root of B for the insertion cost. 471 00:23:06,450 --> 00:23:11,110 And the lookup cost is log N over something, which I'm 472 00:23:11,110 --> 00:23:13,630 going to just hide. 473 00:23:13,630 --> 00:23:20,750 Let's set epsilon to 1/2 and work out what that is. 474 00:23:20,750 --> 00:23:23,700 Because epsilon 1/2 is a good engineering point. 475 00:23:23,700 --> 00:23:32,810 So the insertion cost is log N over B to the 1/2, which is 476 00:23:32,810 --> 00:23:40,275 log N over root B. And the other one, the lookup cost-- 477 00:23:45,540 --> 00:23:47,780 there's all big 0's around here, but I'm not going to 478 00:23:47,780 --> 00:23:49,800 draw those again-- over 1/2-- 479 00:23:49,800 --> 00:23:51,540 so I'm going to maybe ignore that-- 480 00:23:51,540 --> 00:24:02,370 of the log of the square root of B. Did I do that right? 481 00:24:02,370 --> 00:24:05,700 B to the 1 minus 1/2. 482 00:24:05,700 --> 00:24:08,180 Put the 1/2 back in to make you happy. 483 00:24:08,180 --> 00:24:10,670 So big-O of that-- 484 00:24:10,670 --> 00:24:12,350 well, what's log of root B? 485 00:24:16,850 --> 00:24:18,350 AUDIENCE: [INAUDIBLE] 486 00:24:18,350 --> 00:24:19,780 BRADLEY KUSZMAUL: I can't quite hear you, 487 00:24:19,780 --> 00:24:23,020 but I know the answer. 488 00:24:23,020 --> 00:24:26,020 I can just say that's the same as log B, when I'm doing big 489 00:24:26,020 --> 00:24:27,640 0's Get rid of the halves. 490 00:24:27,640 --> 00:24:33,890 So it's log N over log B. So if you sort of choose block 491 00:24:33,890 --> 00:24:39,530 sizes, if you set this parameter to be something 492 00:24:39,530 --> 00:24:41,720 where you're doing something with the square root, you end 493 00:24:41,720 --> 00:24:45,320 up having lookups that cost, asymptotically the 494 00:24:45,320 --> 00:24:48,450 same as for a B-tree. 495 00:24:48,450 --> 00:24:49,780 But there are these constants in there. 496 00:24:49,780 --> 00:24:51,480 There's a factor of 4 or something that 497 00:24:51,480 --> 00:24:53,900 I've glossed over. 498 00:24:53,900 --> 00:24:56,160 But asymptotically, it's the same. 499 00:24:56,160 --> 00:24:59,290 And insertions have this much better performance. 500 00:24:59,290 --> 00:25:03,360 What if B was 1,000? 501 00:25:03,360 --> 00:25:05,090 Then we're dividing by 30 here. 502 00:25:05,090 --> 00:25:07,140 But B isn't really 1,000. 503 00:25:07,140 --> 00:25:09,900 B's more like a million in a modern system. 504 00:25:09,900 --> 00:25:14,940 So you actually get to divide by something like 1,000 here. 505 00:25:14,940 --> 00:25:18,610 And that's a huge advantage, to basically make insertions 506 00:25:18,610 --> 00:25:22,852 asymptotically be 1,000 times faster, whatever that means. 507 00:25:25,670 --> 00:25:27,720 When you actually work out the constants, perhaps it's a 508 00:25:27,720 --> 00:25:32,120 factor of 100, is what we see in practice. 509 00:25:32,120 --> 00:25:39,070 So this is basically working out those details. 510 00:25:39,070 --> 00:25:40,160 So here's an example. 511 00:25:40,160 --> 00:25:42,060 Here is a data structure that can achieve this kind of 512 00:25:42,060 --> 00:25:44,380 performance. 513 00:25:44,380 --> 00:25:46,940 It's a simple version of a streaming B-tree 514 00:25:46,940 --> 00:25:48,200 or a fractal tree. 515 00:25:48,200 --> 00:25:50,180 And what this data structure is-- 516 00:25:50,180 --> 00:25:55,340 so first of all, we're kind of going to switch modes from 517 00:25:55,340 --> 00:25:58,770 marketoid, or at least explaining what it's good for, 518 00:25:58,770 --> 00:26:00,640 to talking about what a data structure is that actually 519 00:26:00,640 --> 00:26:02,070 solves the problem. 520 00:26:02,070 --> 00:26:04,500 So any questions before we dive down that path? 521 00:26:04,500 --> 00:26:05,640 OK. 522 00:26:05,640 --> 00:26:07,210 So if there are any questions, stop me. 523 00:26:07,210 --> 00:26:12,700 Because I like to race through this stuff if possible. 524 00:26:12,700 --> 00:26:15,890 So the deal here is that you're going to have log N 525 00:26:15,890 --> 00:26:20,230 arrays, and each one is a power of two in size. 526 00:26:20,230 --> 00:26:21,920 And you're going to have one for each power of 2. 527 00:26:21,920 --> 00:26:25,230 So there's going to be one array of size 1, one of size 528 00:26:25,230 --> 00:26:31,620 2, one of size 4 and 8 and 16, all the way up to a trillion, 529 00:26:31,620 --> 00:26:32,870 2 to the 40th. 530 00:26:35,130 --> 00:26:37,920 The second invariant of this data structure is each array 531 00:26:37,920 --> 00:26:42,660 is either completely full or completely empty. 532 00:26:42,660 --> 00:26:45,140 And the third one is that each array is sorted. 533 00:26:49,780 --> 00:26:53,390 So I'll do an example here. 534 00:26:53,390 --> 00:26:57,920 If I have four elements in the array, and these are the 535 00:26:57,920 --> 00:27:00,670 numbers, there's only one way for me to put those in that 536 00:27:00,670 --> 00:27:03,160 satisfy all those requirements. 537 00:27:03,160 --> 00:27:05,720 Because there's four items, it has to go into the 538 00:27:05,720 --> 00:27:06,660 array of size four. 539 00:27:06,660 --> 00:27:07,480 It has to fill it up. 540 00:27:07,480 --> 00:27:09,670 I can't have any other way of doing that. 541 00:27:09,670 --> 00:27:12,900 And within that array of size four, they have to be sorted. 542 00:27:12,900 --> 00:27:16,330 So those four elements uniquely go there, and that's 543 00:27:16,330 --> 00:27:18,390 the end of the story for where four elements go. 544 00:27:21,210 --> 00:27:24,710 If there's 10 elements, you get a little freedom, because, 545 00:27:24,710 --> 00:27:27,000 well, we have to fill up the 2 array and we have to fill up 546 00:27:27,000 --> 00:27:29,880 the 8 array, because there's only one way to write in 547 00:27:29,880 --> 00:27:34,820 binary 10, which is 0101. 548 00:27:34,820 --> 00:27:35,870 But we get a little choice. 549 00:27:35,870 --> 00:27:39,070 It turns out that the bottom array has to be sorted and the 550 00:27:39,070 --> 00:27:41,950 top array, the array containing 5 and 551 00:27:41,950 --> 00:27:42,810 10 has to be sorted. 552 00:27:42,810 --> 00:27:46,850 But we could have put the five down here and, say, swapped 553 00:27:46,850 --> 00:27:49,210 the 5 and the 6, and that would've been a perfectly 554 00:27:49,210 --> 00:27:52,800 valid data structure for this set of data as well. 555 00:27:52,800 --> 00:27:55,080 So we get a little bit of freedom. 556 00:27:55,080 --> 00:27:57,180 OK? 557 00:27:57,180 --> 00:27:59,310 So that's the basic data structure. 558 00:27:59,310 --> 00:28:00,910 So now what do we do? 559 00:28:00,910 --> 00:28:02,360 How do you search this data structure? 560 00:28:02,360 --> 00:28:06,890 Well, the idea is just to perform a binary search in 561 00:28:06,890 --> 00:28:08,140 each of the arrays. 562 00:28:13,060 --> 00:28:15,660 The advantage of this is it works, and it's a lot faster 563 00:28:15,660 --> 00:28:17,750 than a table scan. 564 00:28:17,750 --> 00:28:20,670 The disadvantage is it's actually quite a bit slower 565 00:28:20,670 --> 00:28:26,380 than a B-tree, because if you do the analysis here, which in 566 00:28:26,380 --> 00:28:28,290 this class, you probably-- you've done things like master 567 00:28:28,290 --> 00:28:29,740 theorem and stuff, right? 568 00:28:29,740 --> 00:28:33,550 So you know what the cost of doing the search in the 569 00:28:33,550 --> 00:28:35,440 biggest array is, right? 570 00:28:35,440 --> 00:28:38,552 How many disk I/Os is that in the worst case? 571 00:28:38,552 --> 00:28:40,025 AUDIENCE: Log N. 572 00:28:40,025 --> 00:28:45,170 BRADLEY KUSZMAUL: It's log N. It's going to be log base 2 of 573 00:28:45,170 --> 00:28:48,740 N, plus or minus a little bit. 574 00:28:48,740 --> 00:28:51,720 Just ignore all that stuff. 575 00:28:51,720 --> 00:28:55,670 I'll just do L-O-G. So what's the size of doing the 576 00:28:55,670 --> 00:28:56,810 second-biggest array? 577 00:28:56,810 --> 00:28:58,690 What's the cost of searching the second-biggest array? 578 00:29:02,010 --> 00:29:10,930 It's half as big, so it's log of N over 2, right? 579 00:29:10,930 --> 00:29:12,180 I can't write. 580 00:29:14,640 --> 00:29:23,590 So this is log N. This is equal to log of N minus 1. 581 00:29:23,590 --> 00:29:26,010 What's the next array? 582 00:29:26,010 --> 00:29:27,720 What's the cost of searching the next biggest array? 583 00:29:32,150 --> 00:29:33,645 Log of N minus 2-- 584 00:29:38,570 --> 00:29:42,700 you add that up, and what's the sum? 585 00:29:42,700 --> 00:29:44,640 We don't even need recurrences for this. 586 00:29:44,640 --> 00:29:47,200 We could have done it that way, but what's the sum? 587 00:29:47,200 --> 00:29:50,310 When we finally get down to 1, and you search the bottom 588 00:29:50,310 --> 00:29:54,150 array, you have to do one disk I/O in the worst case. 589 00:29:54,150 --> 00:29:58,380 So this is an arithmetic sequence, right? 590 00:29:58,380 --> 00:29:59,610 So what's the answer? 591 00:29:59,610 --> 00:30:04,680 Big-O. I'm not even going to ask for the-- 592 00:30:04,680 --> 00:30:05,160 pardon? 593 00:30:05,160 --> 00:30:05,805 AUDIENCE: Log squared? 594 00:30:05,805 --> 00:30:08,710 BRADLEY KUSZMAUL: Yes, it's log squared, which is right 595 00:30:08,710 --> 00:30:09,960 there in green. 596 00:30:11,720 --> 00:30:14,230 So basically, this thing is really expensive. 597 00:30:14,230 --> 00:30:18,900 Log squared N, when we were trying to match a B-tree, 598 00:30:18,900 --> 00:30:23,920 which is log N over log B. So not only is it not log base B, 599 00:30:23,920 --> 00:30:26,260 it's log base 2 or something. 600 00:30:26,260 --> 00:30:28,380 But it's squaring it. 601 00:30:28,380 --> 00:30:31,710 So if you think of having a million items in your data 602 00:30:31,710 --> 00:30:35,750 structure, even a relatively small one, log base 603 00:30:35,750 --> 00:30:37,720 1 million is 20. 604 00:30:37,720 --> 00:30:40,092 If you square that, that's 400. 605 00:30:40,092 --> 00:30:41,660 Maybe you get to divide by 2. 606 00:30:41,660 --> 00:30:43,730 It's hundreds of disk I/Os just to do a look 607 00:30:43,730 --> 00:30:46,250 up, instead of four. 608 00:30:46,250 --> 00:30:49,440 So this is just sucking at this point. 609 00:30:49,440 --> 00:30:53,740 So let's put that aside and see if we can do insertion, 610 00:30:53,740 --> 00:30:56,510 since we are doing so badly at [? logging. ?] 611 00:30:56,510 --> 00:31:00,400 So to make this easier to think about, I'm going to add 612 00:31:00,400 --> 00:31:02,750 another set of temporary arrays. 613 00:31:02,750 --> 00:31:05,270 So I'm actually going to have two arrays of each size. 614 00:31:05,270 --> 00:31:08,490 And the idea is at the beginning of each step, after 615 00:31:08,490 --> 00:31:12,000 doing an insertion, all the temporary arrays are empty. 616 00:31:12,000 --> 00:31:14,810 I'm only going to have arrays on the left side that are 617 00:31:14,810 --> 00:31:17,670 going to have data in them. 618 00:31:17,670 --> 00:31:20,290 So to insert 15 into this data structure, there's only one 619 00:31:20,290 --> 00:31:21,230 place to put it. 620 00:31:21,230 --> 00:31:25,110 I put it in the one array, if I'm trying to be lazy about 621 00:31:25,110 --> 00:31:26,170 how much work I want to do. 622 00:31:26,170 --> 00:31:29,780 And it turns out, this is exactly what you want to do. 623 00:31:29,780 --> 00:31:32,210 You have an an empty one array, a new element comes in, 624 00:31:32,210 --> 00:31:33,460 just put it in there. 625 00:31:36,010 --> 00:31:37,520 Now I want to insert a 7. 626 00:31:37,520 --> 00:31:39,450 There's no place in the one array, so I'm going to put it 627 00:31:39,450 --> 00:31:42,870 in the one array over on the temp side. 628 00:31:42,870 --> 00:31:44,860 And then I'm going to merge the two one arrays 629 00:31:44,860 --> 00:31:47,790 to make a two array. 630 00:31:47,790 --> 00:31:52,310 So the 15 and the 7 become 7 and 15 here. 631 00:31:52,310 --> 00:31:53,630 I couldn't put it there because that 632 00:31:53,630 --> 00:31:55,010 array already was full. 633 00:31:55,010 --> 00:32:01,110 And then I merge those two to make a new four array. 634 00:32:01,110 --> 00:32:03,140 So this is the final result after 635 00:32:03,140 --> 00:32:06,332 inserting those two items. 636 00:32:06,332 --> 00:32:07,582 Does that make sense? 637 00:32:10,440 --> 00:32:13,300 It's not a hard data structure. 638 00:32:13,300 --> 00:32:16,900 So one insertion can cause a whole bunch of merges. 639 00:32:16,900 --> 00:32:19,200 Here we have sort of an animation. 640 00:32:19,200 --> 00:32:24,100 So here I've laid out the one array across the top, and then 641 00:32:24,100 --> 00:32:27,200 the temporary array just under it, and then going down, we 642 00:32:27,200 --> 00:32:30,930 have a sequence of steps for the data structure over time. 643 00:32:30,930 --> 00:32:33,170 So we have the whole arrays. 644 00:32:33,170 --> 00:32:35,180 We have one and the two and the four and the eight array 645 00:32:35,180 --> 00:32:38,100 are all full, and we insert one more item, which 646 00:32:38,100 --> 00:32:39,720 causes a big carry. 647 00:32:39,720 --> 00:32:43,450 So the one creates a two, the two twos create two fours, the 648 00:32:43,450 --> 00:32:46,660 two fours and the eight create two eights, and so forth. 649 00:32:46,660 --> 00:32:47,550 So here you are. 650 00:32:47,550 --> 00:32:49,190 You're running. 651 00:32:49,190 --> 00:32:52,310 You've built up a terabyte of data. 652 00:32:52,310 --> 00:32:53,970 Your insert one more item, and now you have to 653 00:32:53,970 --> 00:32:56,380 rewrite all of disk. 654 00:32:56,380 --> 00:32:59,860 So that also sounds a little unappealing. 655 00:32:59,860 --> 00:33:03,880 But we'll build on this to make a data structure that 656 00:33:03,880 --> 00:33:05,620 actually works. 657 00:33:05,620 --> 00:33:08,980 So first let's analyze what the average cost for this data 658 00:33:08,980 --> 00:33:10,210 structure is. 659 00:33:10,210 --> 00:33:12,860 I've just sort of explained why-- there are some really 660 00:33:12,860 --> 00:33:14,210 bad cases where you're doing an 661 00:33:14,210 --> 00:33:16,010 insertion and it's expensive. 662 00:33:16,010 --> 00:33:19,040 But on average, it turns out it's really good. 663 00:33:19,040 --> 00:33:24,430 And the reason is that merging of sorted arrays is really I/O 664 00:33:24,430 --> 00:33:29,170 efficient, because the merge is essentially operating on 665 00:33:29,170 --> 00:33:30,480 that append data structure. 666 00:33:30,480 --> 00:33:33,710 We're reading two append data structures and then writing 667 00:33:33,710 --> 00:33:35,790 the answer into another append data structure. 668 00:33:35,790 --> 00:33:39,870 And that does hardly any I/O. 669 00:33:39,870 --> 00:33:44,960 So if you have two arrays of size X, the cost to merge them 670 00:33:44,960 --> 00:33:47,320 is you have to read the two arrays and you have to write 671 00:33:47,320 --> 00:33:48,070 the new array. 672 00:33:48,070 --> 00:33:54,070 And you add it all up, and that's order X over B I/Os. 673 00:33:54,070 --> 00:33:55,950 Maybe it's 4x over b or something. 674 00:33:55,950 --> 00:34:00,210 But big-O of X over B. So the merge is efficient. 675 00:34:00,210 --> 00:34:05,770 The cost per element for the merge is 1 over B, because 676 00:34:05,770 --> 00:34:08,420 order X elements were merged when we did that. 677 00:34:08,420 --> 00:34:10,210 And we get to spread the cost. 678 00:34:10,210 --> 00:34:14,650 Sure, we had to rewrite a trillion items when we filled 679 00:34:14,650 --> 00:34:19,940 up our disk, but actually, when you divide that out over 680 00:34:19,940 --> 00:34:23,409 the trillion items, it's not that much cost per item. 681 00:34:23,409 --> 00:34:29,659 And so the cost for each item of that big operation is only 682 00:34:29,659 --> 00:34:31,900 1 over B disk I/Os. 683 00:34:31,900 --> 00:34:36,030 And each item only has to be rewritten log N times. 684 00:34:36,030 --> 00:34:41,130 So the total average cost for an insertion of one element is 685 00:34:41,130 --> 00:34:46,010 log N over B, which is actually better than what I 686 00:34:46,010 --> 00:34:46,900 promised here. 687 00:34:46,900 --> 00:34:48,010 But this data structure's going to be 688 00:34:48,010 --> 00:34:48,940 worse somewhere else. 689 00:34:48,940 --> 00:34:51,530 So this is a simplified version. 690 00:34:51,530 --> 00:34:53,179 I'll get to within-- 691 00:34:53,179 --> 00:34:56,120 ignoring epsilons and things, it'll be good enough. 692 00:34:56,120 --> 00:34:59,280 So does that analysis makes sense? 693 00:34:59,280 --> 00:35:03,310 It's not hard analysis, so if it doesn't make sense, it's 694 00:35:03,310 --> 00:35:04,840 not because of you. 695 00:35:04,840 --> 00:35:07,080 It's got to be because I didn't explain it, because 696 00:35:07,080 --> 00:35:10,410 it's too easy to not understand. 697 00:35:15,910 --> 00:35:19,800 So if you're going to build something like this, you can't 698 00:35:19,800 --> 00:35:23,940 just say, oh, well, your database is great except once 699 00:35:23,940 --> 00:35:29,100 every couple days it hangs for an hour while we resort 700 00:35:29,100 --> 00:35:30,030 everything. 701 00:35:30,030 --> 00:35:32,700 So the fix of this is that we're going to get rid of the 702 00:35:32,700 --> 00:35:33,800 worst case. 703 00:35:33,800 --> 00:35:36,310 And the idea is, well, let's just have a separate thread 704 00:35:36,310 --> 00:35:38,510 that does the merging of the arrays. 705 00:35:38,510 --> 00:35:41,140 So we insert something into a temporary array and just 706 00:35:41,140 --> 00:35:42,780 return immediately. 707 00:35:42,780 --> 00:35:48,640 And as long as the merge thread gets to do at least log 708 00:35:48,640 --> 00:35:53,830 N moves every time we insert something, it can keep up. 709 00:35:53,830 --> 00:35:57,190 You could actually do a very careful dance, where I insert 710 00:35:57,190 --> 00:35:59,400 something, and part of the insertion is I have to move 711 00:35:59,400 --> 00:36:01,600 something from this array and something from this array and 712 00:36:01,600 --> 00:36:04,400 something from this array, and I can keep everything up to 713 00:36:04,400 --> 00:36:05,590 date that way. 714 00:36:05,590 --> 00:36:09,660 So it's not very hard to de-amortize this algorithm-- 715 00:36:09,660 --> 00:36:14,400 that is, to turn the algorithm from a good average-case 716 00:36:14,400 --> 00:36:16,840 behavior to good worst-case behavior. 717 00:36:16,840 --> 00:36:21,827 The worst-case behavior just becomes that it has to do log 718 00:36:21,827 --> 00:36:25,830 N work for an insertion, which isn't so bad. 719 00:36:28,450 --> 00:36:30,660 Does that make sense? 720 00:36:30,660 --> 00:36:31,968 Yeah. 721 00:36:31,968 --> 00:36:34,872 AUDIENCE: Does that work if these are [INAUDIBLE] items 722 00:36:34,872 --> 00:36:35,840 [INAUDIBLE]? 723 00:36:35,840 --> 00:36:40,680 What if somebody wants [INAUDIBLE]? 724 00:36:40,680 --> 00:36:41,430 BRADLEY KUSZMAUL: Ah. 725 00:36:41,430 --> 00:36:44,070 Well, OK, so the question-- 726 00:36:44,070 --> 00:36:45,480 let me repeat it and see if-- 727 00:36:45,480 --> 00:36:49,440 so you're in the middle of doing these merges and you 728 00:36:49,440 --> 00:36:51,690 have a background thread doing that, say, and somebody comes 729 00:36:51,690 --> 00:36:53,480 along and wants to do a query. 730 00:36:53,480 --> 00:36:54,354 AUDIENCE: Yeah. 731 00:36:54,354 --> 00:36:55,230 [INAUDIBLE] 732 00:36:55,230 --> 00:36:57,220 BRADLEY KUSZMAUL: So the trick to there is that you put a bit 733 00:36:57,220 --> 00:36:58,940 on the array that says, the new arrays is 734 00:36:58,940 --> 00:37:00,050 not ready to query. 735 00:37:00,050 --> 00:37:03,670 Keep using the old arrays, which are still there. 736 00:37:03,670 --> 00:37:05,270 Just don't destroy the old ones until 737 00:37:05,270 --> 00:37:07,980 the new one's ready. 738 00:37:07,980 --> 00:37:12,250 So basically you have these two megabyte-sized things. 739 00:37:12,250 --> 00:37:14,430 You're trying to make a two megabyte-sized one. 740 00:37:14,430 --> 00:37:18,520 You leave the one-megabyte ones lying around for a while 741 00:37:18,520 --> 00:37:21,650 while you're incrementally moving things down. 742 00:37:21,650 --> 00:37:24,820 And then suddenly, when the big ones done, you flip the 743 00:37:24,820 --> 00:37:28,460 bits, so in order one operations, you can say, no, 744 00:37:28,460 --> 00:37:30,960 those two are no longer valid, and this one's valid. 745 00:37:30,960 --> 00:37:34,130 So queries should use this one instead of those. 746 00:37:34,130 --> 00:37:38,400 So that's basically the kind of trick you might do. 747 00:37:38,400 --> 00:37:41,360 Or you would just search the partially constructed arrays, 748 00:37:41,360 --> 00:37:42,005 if you have locks. 749 00:37:42,005 --> 00:37:43,350 There's lots of ways to do it. 750 00:37:47,440 --> 00:37:48,890 So that's a pretty good question. 751 00:37:48,890 --> 00:37:50,830 Yes. 752 00:37:50,830 --> 00:37:53,380 That's one that we had to think about a little. 753 00:37:53,380 --> 00:37:56,940 So it sounds glib, but it's like, how do we do this? 754 00:37:56,940 --> 00:37:59,850 Any other questions? 755 00:37:59,850 --> 00:38:01,750 OK. 756 00:38:01,750 --> 00:38:06,340 So now we've got to do something about the search, 757 00:38:06,340 --> 00:38:08,620 because the search is really bad. 758 00:38:08,620 --> 00:38:12,380 Well, it's not as bad as the insertion worst-case thing. 759 00:38:12,380 --> 00:38:15,000 I'm going to show you how to shave off a factor of log N, 760 00:38:15,000 --> 00:38:17,840 and I don't think I'm going to show you how to shave off the 761 00:38:17,840 --> 00:38:22,010 factor of 1 over log B. so we'll just get it down to log 762 00:38:22,010 --> 00:38:26,150 N instead of log squared N. Because if I actually want to 763 00:38:26,150 --> 00:38:28,330 get it down, then I have to give up-- 764 00:38:28,330 --> 00:38:31,410 remember, the performance that I had was log of N over B. If 765 00:38:31,410 --> 00:38:33,960 I actually want to get rid of things, I have to 766 00:38:33,960 --> 00:38:34,650 do something else. 767 00:38:34,650 --> 00:38:37,250 There's a lower-bound argument. 768 00:38:37,250 --> 00:38:41,850 So the idea here is we're searching-- 769 00:38:41,850 --> 00:38:47,985 I'm going to flip those. 770 00:38:50,490 --> 00:38:52,505 We've got these arrays of various sizes. 771 00:38:56,100 --> 00:38:59,020 And I've just done a binary search on here and then here 772 00:38:59,020 --> 00:39:01,750 and then here, and I found out the thing I'm looking for 773 00:39:01,750 --> 00:39:04,020 wasn't here and it wasn't here and it wasn't here. 774 00:39:04,020 --> 00:39:07,800 That's where it would have been, if it had been there. 775 00:39:07,800 --> 00:39:09,730 It should have been there but it wasn't. 776 00:39:09,730 --> 00:39:12,000 It should have been here but it wasn't. 777 00:39:12,000 --> 00:39:14,910 And then I'm going to start searching in this array. 778 00:39:14,910 --> 00:39:18,850 And the intuition you might have is that, gee, it's kind 779 00:39:18,850 --> 00:39:21,890 of wasteful to start a whole new search on this array when 780 00:39:21,890 --> 00:39:23,700 we already knew where it wasn't in this array. 781 00:39:26,370 --> 00:39:27,630 Right? 782 00:39:27,630 --> 00:39:31,320 So for example, if the data were uniformly randomly 783 00:39:31,320 --> 00:39:35,270 distributed, and the thing was, say, 1/3 of the array 784 00:39:35,270 --> 00:39:39,230 here, I might gain some advantage by searching at the 785 00:39:39,230 --> 00:39:42,770 1/3 point over here to see if it's there. 786 00:39:42,770 --> 00:39:45,180 Now, that's kind of an intuition. 787 00:39:45,180 --> 00:39:46,610 I don't know how to make that work. 788 00:39:46,610 --> 00:39:50,290 But I do know how to make something work. 789 00:39:50,290 --> 00:39:53,620 But the intuition is, having done some search here, I 790 00:39:53,620 --> 00:39:56,850 should in principal have information about where to 791 00:39:56,850 --> 00:39:58,680 limit the search so that I don't have to search the whole 792 00:39:58,680 --> 00:40:01,220 thing on the next array. 793 00:40:03,770 --> 00:40:04,120 OK? 794 00:40:04,120 --> 00:40:08,920 And here's basically what you do, is every element, you get 795 00:40:08,920 --> 00:40:11,730 a forward pointer to where that element would go in the 796 00:40:11,730 --> 00:40:13,430 next array. 797 00:40:13,430 --> 00:40:15,960 So for example, you have something here and something 798 00:40:15,960 --> 00:40:19,300 here, which are the two things that are less than and greater 799 00:40:19,300 --> 00:40:21,620 than the thing you're looking for. 800 00:40:21,620 --> 00:40:24,700 And it says, oh, those should have gone 801 00:40:24,700 --> 00:40:27,850 here in the next array. 802 00:40:27,850 --> 00:40:30,140 So if you maintain that 803 00:40:30,140 --> 00:40:34,150 information, it's almost enough. 804 00:40:34,150 --> 00:40:37,950 But let's gloss over the almost part. 805 00:40:37,950 --> 00:40:41,910 If those two destinations of those two pointers are close 806 00:40:41,910 --> 00:40:45,240 together, then you've saved a lot of 807 00:40:45,240 --> 00:40:46,490 search in the next array. 808 00:40:49,840 --> 00:40:51,660 Does anybody see a bug in this? 809 00:40:51,660 --> 00:40:52,805 There is one. 810 00:40:52,805 --> 00:40:55,450 The almost part. 811 00:40:55,450 --> 00:40:57,590 You don't have to see it, because I've been thinking 812 00:40:57,590 --> 00:40:58,840 about this a lot. 813 00:41:01,740 --> 00:41:05,460 The problem is, what if all of these items are less than all 814 00:41:05,460 --> 00:41:08,718 of these items, for example? 815 00:41:08,718 --> 00:41:13,240 In which case, these pointers all point down to the 816 00:41:13,240 --> 00:41:16,220 beginning, and we've got nothing. 817 00:41:16,220 --> 00:41:17,530 That's a case where this fails. 818 00:41:17,530 --> 00:41:19,870 And that's allowed, right? 819 00:41:19,870 --> 00:41:23,330 In particular, if we were inserting things-- 820 00:41:23,330 --> 00:41:24,070 yeah. 821 00:41:24,070 --> 00:41:26,890 AUDIENCE: Then we know the element is in the biggest 822 00:41:26,890 --> 00:41:28,926 array, because the element was supposed to 823 00:41:28,926 --> 00:41:30,650 go between the two. 824 00:41:30,650 --> 00:41:33,060 BRADLEY KUSZMAUL: Ah, but in this array, we found out that 825 00:41:33,060 --> 00:41:37,700 it's above the last element, when we did our search, right? 826 00:41:37,700 --> 00:41:39,660 That's one of the possible ways-- 827 00:41:39,660 --> 00:41:42,250 the worst-case behavior is we've got something where this 828 00:41:42,250 --> 00:41:43,870 array is less than this array. 829 00:41:43,870 --> 00:41:44,890 We're looking for that item. 830 00:41:44,890 --> 00:41:49,030 So we do a binary search and find out, it's over here. 831 00:41:49,030 --> 00:41:51,220 And it doesn't help to special case this or something, 832 00:41:51,220 --> 00:41:53,840 because they could be all to the right or they could be all 833 00:41:53,840 --> 00:41:55,190 bunched up in funny ways. 834 00:41:55,190 --> 00:41:58,930 There's lots of screwy ways that this could go wrong. 835 00:41:58,930 --> 00:42:01,710 But the simple version, it's easy to come up with an 836 00:42:01,710 --> 00:42:03,580 example, which is everything's to the left. 837 00:42:03,580 --> 00:42:04,416 Yeah. 838 00:42:04,416 --> 00:42:10,248 AUDIENCE: Can you still save time by, when you do the 839 00:42:10,248 --> 00:42:12,678 binary search on the smallest array-- but I guess you'd want 840 00:42:12,678 --> 00:42:13,650 [INAUDIBLE] 841 00:42:13,650 --> 00:42:14,136 search. 842 00:42:14,136 --> 00:42:15,870 It will help reduced cost which gives you the 843 00:42:15,870 --> 00:42:18,315 next one and so on? 844 00:42:18,315 --> 00:42:19,370 BRADLEY KUSZMAUL: Yeah. 845 00:42:19,370 --> 00:42:22,950 So there is a way to fix it so that the pointers in the 846 00:42:22,950 --> 00:42:26,480 smaller array do help you reduce the 847 00:42:26,480 --> 00:42:28,010 cost in the next array. 848 00:42:28,010 --> 00:42:32,050 And that is to seed the smaller array with some values 849 00:42:32,050 --> 00:42:32,850 from the next array. 850 00:42:32,850 --> 00:42:35,760 Like, suppose I put in every 20th item, and I stuck it in 851 00:42:35,760 --> 00:42:37,790 that array with a bit on it that says, oh, this is a 852 00:42:37,790 --> 00:42:41,100 repeat, it's going to be repeated again. 853 00:42:41,100 --> 00:42:44,630 So then I could guarantee that there's these dummies that I 854 00:42:44,630 --> 00:42:50,510 throw in here, which are evenly spaced, plus whatever 855 00:42:50,510 --> 00:42:51,150 else is in there. 856 00:42:51,150 --> 00:42:54,430 So put the other things in there, and they have forward 857 00:42:54,430 --> 00:42:55,900 pointers too. 858 00:42:55,900 --> 00:43:02,310 And now I'm guaranteed that the distance between two 859 00:43:02,310 --> 00:43:07,292 adjacent items is guaranteed to be a constant. 860 00:43:07,292 --> 00:43:10,430 Does that make sense? 861 00:43:10,430 --> 00:43:16,040 The trick is to make it so that having found two adjacent 862 00:43:16,040 --> 00:43:17,820 items that the thing you want-- 863 00:43:17,820 --> 00:43:21,570 then on the next array, the image of those two items is 864 00:43:21,570 --> 00:43:25,100 separated by at most 20 items. 865 00:43:25,100 --> 00:43:29,890 And so that gets you down to only log of N instead of log 866 00:43:29,890 --> 00:43:33,280 squared of n, because you're searching constant items in 867 00:43:33,280 --> 00:43:35,535 this array, and there's only log N arrays. 868 00:43:38,280 --> 00:43:39,110 Yeah. 869 00:43:39,110 --> 00:43:42,601 AUDIENCE: Doesn't that slow down the merging of the rays? 870 00:43:42,601 --> 00:43:45,720 BRADLEY KUSZMAUL: Not asymptotically. 871 00:43:45,720 --> 00:43:48,390 Because asymptotically, what this means-- 872 00:43:48,390 --> 00:43:50,860 if I'm going to build that array, so I'm going to merge 873 00:43:50,860 --> 00:43:53,350 two arrays to make this array, I have to do an additional 874 00:43:53,350 --> 00:43:55,890 scan of this other array as I'm constructing this one. 875 00:43:55,890 --> 00:44:00,080 So the picture is I have two arrays, and I'm trying to 876 00:44:00,080 --> 00:44:02,390 merge them into this array. 877 00:44:02,390 --> 00:44:08,230 And I'm trying to also insert these dummy forward pointers 878 00:44:08,230 --> 00:44:11,770 from the next array, which is only twice as big. 879 00:44:11,770 --> 00:44:15,751 So the big 0's are, if it's X, instead of it being 1, 2, 3, 880 00:44:15,751 --> 00:44:18,550 4X, it's 8X. 881 00:44:18,550 --> 00:44:19,800 So it's only a constant. 882 00:44:23,320 --> 00:44:25,960 So basically, I can read all three of these. 883 00:44:25,960 --> 00:44:29,660 I can read an array and the next one and the next array, 884 00:44:29,660 --> 00:44:31,460 which is twice as big, and the next array which is 885 00:44:31,460 --> 00:44:32,740 four times as big. 886 00:44:32,740 --> 00:44:36,510 It all adds up to 8 times the size of the original array. 887 00:44:36,510 --> 00:44:38,330 So at least the asymptotics aren't messed up. 888 00:44:38,330 --> 00:44:40,770 Maybe the engineer in you goes, bleh, I have to read the 889 00:44:40,770 --> 00:44:42,200 data eight times. 890 00:44:42,200 --> 00:44:48,320 But remember, the game here is not to get 100% of the disk's 891 00:44:48,320 --> 00:44:50,690 insertion capacity. 892 00:44:50,690 --> 00:44:55,200 That's not the game, going back to the marketing 893 00:44:55,200 --> 00:44:56,100 perspective. 894 00:44:56,100 --> 00:45:00,470 The competition is only getting 0.001% 895 00:45:00,470 --> 00:45:01,720 of the disk's capacity. 896 00:45:04,700 --> 00:45:07,640 That's what a B-tree gets in the worst case. 897 00:45:07,640 --> 00:45:11,730 And so we don't have to get 100% to be three orders of 898 00:45:11,730 --> 00:45:15,680 magnitude better, which is where we are. 899 00:45:15,680 --> 00:45:18,650 So it turns out that for this kind of thing, we end up 900 00:45:18,650 --> 00:45:23,370 getting 1% of the disk's capacity, and everybody's 901 00:45:23,370 --> 00:45:26,310 jumping around saying that's great, because it's 1,000 902 00:45:26,310 --> 00:45:27,560 times faster. 903 00:45:30,630 --> 00:45:34,070 And why do we only get 1%? 904 00:45:34,070 --> 00:45:39,100 Well, there's a factor of two here and there's a log N over 905 00:45:39,100 --> 00:45:46,050 there, and you divide all that, and it's a constant 906 00:45:46,050 --> 00:45:50,100 challenge, because the engineers at Tokutek always 907 00:45:50,100 --> 00:45:53,310 are having ideas for how to make it faster. 908 00:45:53,310 --> 00:45:57,260 And right now, making this data structure faster is not 909 00:45:57,260 --> 00:45:59,330 the thing that's going to make people buy it. 910 00:45:59,330 --> 00:46:01,390 Because it's already 1,000 times faster than the 911 00:46:01,390 --> 00:46:02,780 competition. 912 00:46:02,780 --> 00:46:06,140 What's going to make it faster is some other thing that adds 913 00:46:06,140 --> 00:46:08,180 features that make it so it's easy to use. 914 00:46:08,180 --> 00:46:09,980 So I keep having to-- 915 00:46:09,980 --> 00:46:13,760 no, you really need to work on making it so that we can do 916 00:46:13,760 --> 00:46:17,370 backups, or something. 917 00:46:17,370 --> 00:46:20,820 It turns out, if you're selling a database, you need 918 00:46:20,820 --> 00:46:22,930 to do more than just queries and insertions. 919 00:46:22,930 --> 00:46:24,290 You need to be able to do backups. 920 00:46:24,290 --> 00:46:26,360 You need to be able to recover from a crash. 921 00:46:26,360 --> 00:46:33,540 You need to be able to cope with the problem of some 922 00:46:33,540 --> 00:46:36,820 particularly heavy query that's going and starving all 923 00:46:36,820 --> 00:46:39,300 the other queries from getting their work done. 924 00:46:39,300 --> 00:46:42,460 All those problems turn out to be the problems that, if you 925 00:46:42,460 --> 00:46:44,460 do any of them badly, people won't buy you. 926 00:46:44,460 --> 00:46:49,710 And so I suspect that there's another factor of 10 to be 927 00:46:49,710 --> 00:46:54,650 gotten over this data structure, if you were to sit 928 00:46:54,650 --> 00:46:57,090 down and try to say, how could I make it be the fastest 929 00:46:57,090 --> 00:46:58,510 possible thing. 930 00:46:58,510 --> 00:47:02,110 And someday, that work will have to be done, because the 931 00:47:02,110 --> 00:47:03,900 competition will have it and we won't. 932 00:47:07,370 --> 00:47:09,270 So let's see. 933 00:47:09,270 --> 00:47:10,570 I mentioned some of these just now. 934 00:47:10,570 --> 00:47:14,010 So some of the things you have to do in order to have an 935 00:47:14,010 --> 00:47:17,630 industrial strength dictionary are you need to cope with 936 00:47:17,630 --> 00:47:20,240 variable-size rows. 937 00:47:20,240 --> 00:47:22,320 Now we assumed for the analysis that the rows were 938 00:47:22,320 --> 00:47:23,240 all unit size. 939 00:47:23,240 --> 00:47:25,630 In fact, database rows vary in size. 940 00:47:25,630 --> 00:47:26,670 And some of them are huge. 941 00:47:26,670 --> 00:47:28,550 Some of them are megabytes. 942 00:47:28,550 --> 00:47:31,670 Or sometimes people do things like they put satellite images 943 00:47:31,670 --> 00:47:33,280 into databases. 944 00:47:33,280 --> 00:47:36,310 So they end up having very large rows. 945 00:47:36,310 --> 00:47:38,295 You have to do deletions as well as insertions. 946 00:47:41,000 --> 00:47:43,220 And it turns out we can do deletions just as fast as 947 00:47:43,220 --> 00:47:45,060 insertions. 948 00:47:45,060 --> 00:47:47,840 And the idea there is basically, if you want to do a 949 00:47:47,840 --> 00:47:53,310 delete, you just you insert the thing with a bit on it 950 00:47:53,310 --> 00:47:55,240 that says, hey, this is really a deletion. 951 00:47:55,240 --> 00:47:57,660 And then, whatever you get a chance, when you're doing a 952 00:47:57,660 --> 00:48:01,490 merge, if you find something that has the same value, you 953 00:48:01,490 --> 00:48:03,210 just annihilate it. 954 00:48:03,210 --> 00:48:07,700 And the delete has to keep going down, because there 955 00:48:07,700 --> 00:48:09,640 might be more copies of it further 956 00:48:09,640 --> 00:48:11,600 down that were shadowed. 957 00:48:11,600 --> 00:48:15,480 And eventually, when you finally do the last merge, 958 00:48:15,480 --> 00:48:19,320 that tombstone goes away. 959 00:48:19,320 --> 00:48:21,460 You have to do transactions and logging. 960 00:48:21,460 --> 00:48:23,610 You have to do crash recovery. 961 00:48:23,610 --> 00:48:26,230 And it's a big pain to get that right, and a lot of 962 00:48:26,230 --> 00:48:29,980 companies have foundered when they tried to move from one 963 00:48:29,980 --> 00:48:31,320 mode to the other. 964 00:48:31,320 --> 00:48:33,850 How many of you have experienced the phenomena that 965 00:48:33,850 --> 00:48:37,030 your file system didn't come back properly after a crash? 966 00:48:40,410 --> 00:48:42,050 You see the difference in age here. 967 00:48:42,050 --> 00:48:46,530 They're all using file systems that have transactional 968 00:48:46,530 --> 00:48:48,490 logging underneath them. 969 00:48:48,490 --> 00:48:50,714 When's the last time it happened? 970 00:48:50,714 --> 00:48:51,470 AUDIENCE: Tuesday. 971 00:48:51,470 --> 00:48:53,430 BRADLEY KUSZMAUL: Tuesday. 972 00:48:53,430 --> 00:48:56,994 So the difference is you're paying attention and they're 973 00:48:56,994 --> 00:48:59,304 not, right? 974 00:48:59,304 --> 00:49:01,235 AUDIENCE: [INAUDIBLE] disk failure. 975 00:49:01,235 --> 00:49:02,660 BRADLEY KUSZMAUL: Disk failure. 976 00:49:02,660 --> 00:49:03,920 That's a different problem. 977 00:49:03,920 --> 00:49:06,160 AUDIENCE: Cacheing is not [? finalized. ?] 978 00:49:06,160 --> 00:49:07,360 BRADLEY KUSZMAUL: Yeah. 979 00:49:07,360 --> 00:49:09,650 You say everybody's running with their disk 980 00:49:09,650 --> 00:49:11,520 cache turned on. 981 00:49:11,520 --> 00:49:14,150 And on some file systems, that's a bad idea. 982 00:49:14,150 --> 00:49:18,520 So we're still suffering that it's been difficult to switch 983 00:49:18,520 --> 00:49:23,840 from the original Unix file system, which is 30 years old 984 00:49:23,840 --> 00:49:26,490 and wasn't designed to recover from crash. 985 00:49:26,490 --> 00:49:30,870 You have to run fsck, and it doesn't always work. 986 00:49:30,870 --> 00:49:32,660 We still have file systems that don't 987 00:49:32,660 --> 00:49:33,800 recover from crashes. 988 00:49:33,800 --> 00:49:36,305 So you can see why that could be difficult. 989 00:49:40,540 --> 00:49:43,450 It turns out that one common use case is that the data is 990 00:49:43,450 --> 00:49:47,580 coming in sequentially, and this data structure just sucks 991 00:49:47,580 --> 00:49:51,060 compared to a B-tree in the case where you're inserting 992 00:49:51,060 --> 00:49:52,970 things and the data actually is already 993 00:49:52,970 --> 00:49:54,110 sorted as it's inserted. 994 00:49:54,110 --> 00:49:56,220 Because this is moving things all around and moving things 995 00:49:56,220 --> 00:49:56,710 all around. 996 00:49:56,710 --> 00:49:59,580 And it's like, why didn't you just notice that it's sorted 997 00:49:59,580 --> 00:50:02,000 and put it in? 998 00:50:02,000 --> 00:50:06,470 You have to get rid of the log base B to get it down to log 999 00:50:06,470 --> 00:50:10,620 base B of N instead of log base 2 of N for search costs. 1000 00:50:10,620 --> 00:50:16,060 Because people in fact do a lot more searches than-- 1001 00:50:16,060 --> 00:50:18,200 if you have to choose which to do better, you want to 1002 00:50:18,200 --> 00:50:20,260 generally do searches better. 1003 00:50:20,260 --> 00:50:22,550 And compression turns out to be important. 1004 00:50:22,550 --> 00:50:29,250 I had one customer who had a database 1005 00:50:29,250 --> 00:50:31,720 which was 300 gigabytes. 1006 00:50:31,720 --> 00:50:34,910 He has a whole bunch of servers, and on each server, 1007 00:50:34,910 --> 00:50:37,260 he had a 300 gigabyte database. 1008 00:50:37,260 --> 00:50:40,210 And with us, it was 70 1009 00:50:40,210 --> 00:50:42,780 gigabytes, because we compress. 1010 00:50:42,780 --> 00:50:45,380 And we just do simple compression of, basically, 1011 00:50:45,380 --> 00:50:46,200 large blocks. 1012 00:50:46,200 --> 00:50:48,885 When we do I/Os, we do I/Os of like a megabyte. 1013 00:50:51,480 --> 00:50:54,460 So when we take one of those megabytes, we compress it. 1014 00:50:54,460 --> 00:50:57,790 And it's a big advantage to compress a megabyte at a time, 1015 00:50:57,790 --> 00:50:59,020 instead of what-- 1016 00:50:59,020 --> 00:51:02,440 a lot of B-trees, they have maybe 16 kilobytes. 1017 00:51:02,440 --> 00:51:05,580 And gzip hardly gets a chance to get anywhere when you only 1018 00:51:05,580 --> 00:51:07,700 have 16 kilobytes. 1019 00:51:07,700 --> 00:51:09,960 And it gets down to 12 kilobytes. 1020 00:51:09,960 --> 00:51:13,000 But if you have a megabyte to work with and you compress it, 1021 00:51:13,000 --> 00:51:14,610 particularly if it's sorted-- 1022 00:51:14,610 --> 00:51:17,940 so this is a megabyte of data that's sorted, so compression 1023 00:51:17,940 --> 00:51:19,650 works pretty well on sorted data. 1024 00:51:19,650 --> 00:51:22,900 So you get factors of 5 or 10 or something. 1025 00:51:22,900 --> 00:51:26,950 And so we asked him to dump the data without the indexes, 1026 00:51:26,950 --> 00:51:30,680 so just the primary table with no indexes, and then run that 1027 00:51:30,680 --> 00:51:31,780 through gzip. 1028 00:51:31,780 --> 00:51:33,210 And it was 50 gigabytes. 1029 00:51:33,210 --> 00:51:36,930 So the smallest he could store the raw data was 50 gigabytes, 1030 00:51:36,930 --> 00:51:39,140 and we were giving him a useful database that was 70 1031 00:51:39,140 --> 00:51:41,370 gigabytes that had a bunch of indexes. 1032 00:51:41,370 --> 00:51:42,620 So he was like, yeah. 1033 00:51:45,150 --> 00:51:47,360 And you have to deal with multithreading and lots of 1034 00:51:47,360 --> 00:51:48,560 clients and stuff. 1035 00:51:48,560 --> 00:51:50,990 So here's an example. 1036 00:51:50,990 --> 00:51:53,350 We worked with Mark Callahan, who was at Google at the 1037 00:51:53,350 --> 00:51:55,140 time-- he's now at Facebook-- 1038 00:51:55,140 --> 00:51:57,190 on trying to come up with some benchmarks, because none of 1039 00:51:57,190 --> 00:52:03,650 the benchmarks out in the world do a good job of 1040 00:52:03,650 --> 00:52:06,820 measuring this insertion performance problem. 1041 00:52:06,820 --> 00:52:09,860 So iiBench is an insertion benchmark. 1042 00:52:09,860 --> 00:52:14,170 And basically what it does is it sets up a database with 1043 00:52:14,170 --> 00:52:18,210 three indexes, and the indexes are random. 1044 00:52:18,210 --> 00:52:21,290 So it's actually harder than real workloads. 1045 00:52:21,290 --> 00:52:24,040 This workload, you basically have a row and then you create 1046 00:52:24,040 --> 00:52:26,930 a random key to point into that from 1047 00:52:26,930 --> 00:52:28,230 three different places. 1048 00:52:28,230 --> 00:52:31,620 Real databases, it turns out, probably have more of a 1049 00:52:31,620 --> 00:52:32,620 Zipfian distribution. 1050 00:52:32,620 --> 00:52:36,660 Have you talked at all about Zipfian distributions of data? 1051 00:52:36,660 --> 00:52:37,850 So this is sort of an interesting thing. 1052 00:52:37,850 --> 00:52:49,810 If you're dealing with real-world caches, you should 1053 00:52:49,810 --> 00:52:54,310 know that data ain't uniformly randomly distributed. 1054 00:52:54,310 --> 00:52:55,970 That's a poor model. 1055 00:52:55,970 --> 00:53:03,150 So in particular, suppose I have memory and disk, and this 1056 00:53:03,150 --> 00:53:06,410 is 10% of the disk. 1057 00:53:06,410 --> 00:53:07,800 Very simple situation. 1058 00:53:07,800 --> 00:53:09,660 Very common ratio. 1059 00:53:09,660 --> 00:53:12,190 You'll see, this is how Facebook 1060 00:53:12,190 --> 00:53:13,440 sets up their databases. 1061 00:53:15,550 --> 00:53:19,810 They'll have a 300-gigabyte database and 30 gigs of RAM. 1062 00:53:19,810 --> 00:53:25,710 If the queries that you wanted to do were random, it wouldn't 1063 00:53:25,710 --> 00:53:27,790 matter what data structure you were using. 1064 00:53:27,790 --> 00:53:31,520 Let's suppose that God tells you where it is on disk, so 1065 00:53:31,520 --> 00:53:32,430 you don't have to find it. 1066 00:53:32,430 --> 00:53:34,680 You just have to move the disk head and move it. 1067 00:53:34,680 --> 00:53:37,310 If they're random, then basically, no matter what 1068 00:53:37,310 --> 00:53:41,510 you've done, 90% of the queries you have to do are 1069 00:53:41,510 --> 00:53:45,360 going to do a random disk I/O. 10% are going to already be 1070 00:53:45,360 --> 00:53:48,090 there, because you got lucky. 1071 00:53:48,090 --> 00:53:51,980 So that is not reflecting what's going on on any 1072 00:53:51,980 --> 00:53:53,350 workload that I know. 1073 00:53:53,350 --> 00:53:57,570 What they'll see is more like 99% of the queries hit here, 1074 00:53:57,570 --> 00:54:03,150 and 1% go out here, or maybe 95% here and 5% go out there. 1075 00:54:03,150 --> 00:54:06,540 And it turns out that, that for a lot of things, there is 1076 00:54:06,540 --> 00:54:10,710 a model of what's going on called a Zipfian distribution. 1077 00:54:10,710 --> 00:54:12,525 This would a random uniform distribution. 1078 00:54:12,525 --> 00:54:15,860 It's like every item has equal probability of being chosen 1079 00:54:15,860 --> 00:54:17,460 for a query. 1080 00:54:17,460 --> 00:54:23,810 It turns out that for things like what's the popularity of 1081 00:54:23,810 --> 00:54:30,590 web pages, or if you have a library, what's the frequency 1082 00:54:30,590 --> 00:54:32,610 at which words appear in the library-- 1083 00:54:32,610 --> 00:54:36,880 so words like "the" appear frequently, and words like 1084 00:54:36,880 --> 00:54:39,940 "polymorphic" are less frequent. 1085 00:54:39,940 --> 00:54:44,520 So Zipf came up with this model, and there's a simple 1086 00:54:44,520 --> 00:54:47,650 version of the model, which says that the most popular 1087 00:54:47,650 --> 00:54:52,040 word has probability proportional to 1. 1088 00:54:52,040 --> 00:54:52,990 It's not going to be 1. 1089 00:54:52,990 --> 00:54:54,970 It's going to be proportional to 1. 1090 00:54:54,970 --> 00:54:58,330 The second most popular word is going to have 1/2. 1091 00:54:58,330 --> 00:55:02,580 The third most popular word is 1/3. 1092 00:55:02,580 --> 00:55:06,160 And the fourth most popular word is 1/4 the probability of 1093 00:55:06,160 --> 00:55:09,240 the first word, and so forth. 1094 00:55:09,240 --> 00:55:13,260 So if you plot this distribution, it kind of looks 1095 00:55:13,260 --> 00:55:14,440 like this, right? 1096 00:55:14,440 --> 00:55:17,590 It's like 1 over x. 1097 00:55:17,590 --> 00:55:20,930 And what would you tell me if I told you that I had an 1098 00:55:20,930 --> 00:55:25,040 infinite universe of objects that had a probability 1099 00:55:25,040 --> 00:55:29,690 distribution like this. 1100 00:55:29,690 --> 00:55:32,740 Does that seem plausible? 1101 00:55:32,740 --> 00:55:33,300 Why? 1102 00:55:33,300 --> 00:55:34,796 You're saying no. 1103 00:55:34,796 --> 00:55:36,046 AUDIENCE: [INAUDIBLE PHRASE] 1104 00:55:40,508 --> 00:55:41,460 try adding them all together. 1105 00:55:41,460 --> 00:55:42,900 BRADLEY KUSZMAUL: If you add them all 1106 00:55:42,900 --> 00:55:45,520 together, it doesn't converge. 1107 00:55:45,520 --> 00:55:48,140 So it's a heavy-tailed distribution. 1108 00:55:48,140 --> 00:55:55,430 So it turns out that if you sum up these up to 1, the sum 1109 00:55:55,430 --> 00:56:05,540 from 1 to n of 1 over i, it's the nth harmonic number. 1110 00:56:05,540 --> 00:56:07,030 And that grows over time. 1111 00:56:07,030 --> 00:56:09,870 It's basically like the integral under this curve, 1112 00:56:09,870 --> 00:56:14,410 from 1 to n. 1113 00:56:14,410 --> 00:56:15,910 It's close to that. 1114 00:56:15,910 --> 00:56:17,190 And what is the integral of that? 1115 00:56:20,900 --> 00:56:23,170 It's like something you learned seven years ago, and 1116 00:56:23,170 --> 00:56:24,535 now you've forgotten, right? 1117 00:56:24,535 --> 00:56:26,590 You learned it when you were sophomores in 1118 00:56:26,590 --> 00:56:28,510 high school, or something. 1119 00:56:28,510 --> 00:56:33,770 So it's approximately log of n. 1120 00:56:33,770 --> 00:56:39,410 It's actually like log of n plus 0.57, is a very good 1121 00:56:39,410 --> 00:56:41,630 approximation for the nth harmonic number. 1122 00:56:41,630 --> 00:56:43,800 When you're doing this kind of analysis, boy, it depresses 1123 00:56:43,800 --> 00:56:47,000 people, because you say, oh, that's H of n, and if you have 1124 00:56:47,000 --> 00:56:50,510 a million items in your database, then the sum of all 1125 00:56:50,510 --> 00:56:52,270 those things is H of 1 million, and 1126 00:56:52,270 --> 00:56:53,540 what's H of 1 million? 1127 00:56:53,540 --> 00:56:56,930 Well, the log base 2 of 1 million, I know that, because 1128 00:56:56,930 --> 00:56:58,520 I'm a computer scientist. 1129 00:56:58,520 --> 00:57:00,560 So it's going to be like 20, and because we're doing log 1130 00:57:00,560 --> 00:57:05,370 base e in that formula, maybe it's 15 or something. 1131 00:57:05,370 --> 00:57:10,220 So if you have 1 million items, then the most popular 1132 00:57:10,220 --> 00:57:13,620 item is going to-- you have to divide by H of n here. 1133 00:57:13,620 --> 00:57:15,250 So the most popular item is going to 1134 00:57:15,250 --> 00:57:17,360 appear 1/15 of the time. 1135 00:57:17,360 --> 00:57:19,300 And the next most popular item is going to appear-- 1136 00:57:22,260 --> 00:57:23,620 emergency backup chalk. 1137 00:57:26,800 --> 00:57:30,530 Somebody's been burning both ends of this chalk. 1138 00:57:30,530 --> 00:57:34,810 1/30 of the time, and 1/45 of the time, and those add up. 1139 00:57:34,810 --> 00:57:37,430 When you go up to 1 over 1 million-- 1140 00:57:41,610 --> 00:57:43,250 another zero in there-- 1141 00:57:43,250 --> 00:57:49,060 times 15, that finite series will add up to 1, 1142 00:57:49,060 --> 00:57:50,060 approximately. 1143 00:57:50,060 --> 00:57:53,150 Except to the extent that I've approximated. 1144 00:57:53,150 --> 00:57:55,080 So this is what's going on. 1145 00:57:55,080 --> 00:58:01,270 So the most popular Facebook page-- 1146 00:58:01,270 --> 00:58:04,350 they might have 1 billion pages, so how 1147 00:58:04,350 --> 00:58:05,320 does that change things? 1148 00:58:05,320 --> 00:58:08,120 Well, that means the most popular one has a probability 1149 00:58:08,120 --> 00:58:11,860 1 in 20, and the second most is 1 in 40. 1150 00:58:11,860 --> 00:58:14,540 And this explains why cache works for 1151 00:58:14,540 --> 00:58:15,820 this kind of workload. 1152 00:58:15,820 --> 00:58:20,370 Nobody really knows why Facebook pages and words in 1153 00:58:20,370 --> 00:58:25,300 libraries and everything else have this distribution, which 1154 00:58:25,300 --> 00:58:28,870 is named after a guy named Zipf. 1155 00:58:31,660 --> 00:58:32,510 But they do. 1156 00:58:32,510 --> 00:58:33,740 Everything has this property. 1157 00:58:33,740 --> 00:58:36,000 And so you can sort of predict what's happening. 1158 00:58:36,000 --> 00:58:38,400 So iiBench should have a Zipfian 1159 00:58:38,400 --> 00:58:39,470 distribution and it doesn't. 1160 00:58:39,470 --> 00:58:41,590 So this is painting a worse picture. 1161 00:58:41,590 --> 00:58:42,450 Or a better picture. 1162 00:58:42,450 --> 00:58:45,610 It's making us look better than we really are, because 1163 00:58:45,610 --> 00:58:49,840 the real world is going to have more hits on the stuff 1164 00:58:49,840 --> 00:58:53,700 that's in memory for a B-tree than this model, where 1165 00:58:53,700 --> 00:58:56,240 basically you're completely hosed all the time because 1166 00:58:56,240 --> 00:58:57,720 it's random. 1167 00:58:57,720 --> 00:59:00,960 So this is an example in the category of how to lie with 1168 00:59:00,960 --> 00:59:02,090 statistics. 1169 00:59:02,090 --> 00:59:05,650 And it's a pretty sophisticated lie. 1170 00:59:05,650 --> 00:59:07,255 If you're going to lie, be sophisticated. 1171 00:59:11,340 --> 00:59:14,050 So these measurements were taken in the top graph. 1172 00:59:14,050 --> 00:59:15,440 Up is good. 1173 00:59:15,440 --> 00:59:17,930 It's how many rows per second we could insert. 1174 00:59:17,930 --> 00:59:22,800 And this axis is how many rows have been inserted so far. 1175 00:59:22,800 --> 00:59:27,200 And the green one is a B-tree. 1176 00:59:27,200 --> 00:59:29,750 According to Mark Callahan, who's essentially a 1177 00:59:29,750 --> 00:59:33,140 disinterested observer, it's the best implementation of a 1178 00:59:33,140 --> 00:59:36,130 B-tree ever. 1179 00:59:36,130 --> 00:59:38,840 And you can sort of see what happens, is that as you insert 1180 00:59:38,840 --> 00:59:41,450 stuff, the system falls out of main memory, and the 1181 00:59:41,450 --> 00:59:43,020 performance was really good at the beginning-- 1182 00:59:43,020 --> 00:59:46,150 40,000 per second-- and then boom, you're down to 200 down 1183 00:59:46,150 --> 00:59:47,410 here at the end, by the time you've 1184 00:59:47,410 --> 00:59:49,170 inserted a billion rows. 1185 00:59:49,170 --> 00:59:53,005 Whereas, for the fractal tree, you can 1186 00:59:53,005 --> 00:59:54,250 sort of see this noise. 1187 00:59:54,250 --> 00:59:57,390 That's because some insertions are a little cheaper than 1188 00:59:57,390 --> 00:59:59,690 other insertions. 1189 00:59:59,690 --> 01:00:01,760 Every other insertion's completely free, right? 1190 01:00:01,760 --> 01:00:02,630 You had a free spot. 1191 01:00:02,630 --> 01:00:03,880 You just put it in. 1192 01:00:07,610 --> 01:00:10,440 One out of four insertions, the ones that weren't free, 1193 01:00:10,440 --> 01:00:13,260 half of them only had to do a little operation in memory. 1194 01:00:13,260 --> 01:00:17,100 So you see this high frequency noise, because some things are 1195 01:00:17,100 --> 01:00:18,540 cheaper than others. 1196 01:00:18,540 --> 01:00:21,730 And that's like a factor of 30 or something. 1197 01:00:24,640 --> 01:00:28,480 It turns out it even works on SSD, solid state disk. 1198 01:00:28,480 --> 01:00:29,300 You might think-- 1199 01:00:29,300 --> 01:00:31,500 all this time I've been talking about disk drives. 1200 01:00:31,500 --> 01:00:34,440 Solid state disk has a complicated cache hierarchy 1201 01:00:34,440 --> 01:00:37,250 inside it, and we were surprised to see that 1202 01:00:37,250 --> 01:00:43,840 basically we're faster on this workload on a rotating disk 1203 01:00:43,840 --> 01:00:49,220 than a B-tree is on an SSD, which is orders of magnitude 1204 01:00:49,220 --> 01:00:52,500 faster in principle, but turns out that for various 1205 01:00:52,500 --> 01:00:53,750 reasons it's not. 1206 01:00:59,440 --> 01:01:03,190 One question I get often is, the world is moving away from 1207 01:01:03,190 --> 01:01:05,660 rotating disk to solid state disk. 1208 01:01:05,660 --> 01:01:06,760 A lot of applications-- 1209 01:01:06,760 --> 01:01:10,050 how many of you have solid state disks in your laptops? 1210 01:01:10,050 --> 01:01:12,430 That's a really good application for a solid state 1211 01:01:12,430 --> 01:01:15,070 disk, because it's not sensitive to 1212 01:01:15,070 --> 01:01:16,990 being knocked around. 1213 01:01:16,990 --> 01:01:20,550 So it's worth it to have a solid state disk even if it 1214 01:01:20,550 --> 01:01:22,810 were more expensive, which it is. 1215 01:01:22,810 --> 01:01:24,020 It turns out it's not that much more 1216 01:01:24,020 --> 01:01:24,950 expensive for a laptop. 1217 01:01:24,950 --> 01:01:27,990 It's a couple of hundred dollars more or something. 1218 01:01:27,990 --> 01:01:30,250 But the advantage of it is that if you go up in an 1219 01:01:30,250 --> 01:01:32,820 airplane and you're sitting and trying to type in the 1220 01:01:32,820 --> 01:01:35,140 middle of a thunderstorm, flying across-- 1221 01:01:35,140 --> 01:01:38,410 it doesn't care. 1222 01:01:38,410 --> 01:01:40,415 Disk drives, if you do that-- 1223 01:01:40,415 --> 01:01:44,170 disk drives do not like flying at high altitude, because they 1224 01:01:44,170 --> 01:01:47,380 work by having a cushion of air that the head is flying 1225 01:01:47,380 --> 01:01:52,670 on, and in airplanes, which pressurize the cabin to the 1226 01:01:52,670 --> 01:01:56,450 same altitude as 8,000 feet, that's half an atmosphere. 1227 01:01:56,450 --> 01:01:59,340 So there's only half as much air keeping it off. 1228 01:01:59,340 --> 01:02:02,830 So if you travel a lot, that's when your disk drive will 1229 01:02:02,830 --> 01:02:06,100 fail, is when you're flying. 1230 01:02:06,100 --> 01:02:08,790 OK. 1231 01:02:08,790 --> 01:02:12,940 So it looks like, however that rotating disk is getting 1232 01:02:12,940 --> 01:02:16,810 cheaper faster then solid state disk is. 1233 01:02:16,810 --> 01:02:21,200 So rotating disk is an order of magnitude cheaper per byte 1234 01:02:21,200 --> 01:02:23,250 than solid state disk today. 1235 01:02:23,250 --> 01:02:25,010 Maybe two orders of magnitude cheaper. 1236 01:02:25,010 --> 01:02:27,480 It's hard to measure fairly. 1237 01:02:27,480 --> 01:02:30,470 But rotating disk, according to Seagate-- 1238 01:02:30,470 --> 01:02:33,650 they're saying, by the end of the decade, we'll have 70 1239 01:02:33,650 --> 01:02:38,870 terabyte drives that are the same form factor. 1240 01:02:38,870 --> 01:02:41,870 And so you figure out what the Moore's Law is for that, and 1241 01:02:41,870 --> 01:02:45,330 it's better than for lithography. 1242 01:02:45,330 --> 01:02:48,670 Lithography is not going to be that much more 1243 01:02:48,670 --> 01:02:50,580 dense in that timeframe. 1244 01:02:50,580 --> 01:02:55,440 So at least for the next 5 or 10 years, it looks like disk 1245 01:02:55,440 --> 01:02:59,460 drives are going to maintain their cost advantage over 1246 01:02:59,460 --> 01:03:01,450 solid state storage, and maybe even 1247 01:03:01,450 --> 01:03:02,770 spread that cost advantage. 1248 01:03:02,770 --> 01:03:05,920 So for any particular application, for storing your 1249 01:03:05,920 --> 01:03:09,390 music, SSD will be cheap enough, but for those people 1250 01:03:09,390 --> 01:03:11,820 that have really big data sets, like these new 1251 01:03:11,820 --> 01:03:13,510 telescopes they're putting up-- 1252 01:03:13,510 --> 01:03:15,040 these new telescopes are crazy. 1253 01:03:15,040 --> 01:03:17,480 These people are putting up these telescopes. 1254 01:03:17,480 --> 01:03:19,540 They're putting up 1,500 telescopes across the 1255 01:03:19,540 --> 01:03:21,300 Australian Outback. 1256 01:03:21,300 --> 01:03:24,370 And each of those telescopes in the first 15 minutes live 1257 01:03:24,370 --> 01:03:26,940 is going to produce more data than has come down from the 1258 01:03:26,940 --> 01:03:29,910 Hubble, total. 1259 01:03:29,910 --> 01:03:33,690 And there's just no way for them to-- 1260 01:03:33,690 --> 01:03:34,760 I don't know what they're going to do. 1261 01:03:34,760 --> 01:03:37,140 But it's a huge amount of data, and they're going to 1262 01:03:37,140 --> 01:03:39,040 have to use disks to store whatever it is that 1263 01:03:39,040 --> 01:03:40,010 they want to keep. 1264 01:03:40,010 --> 01:03:41,950 And they don't like throwing away data, because it's so 1265 01:03:41,950 --> 01:03:43,790 expensive to make. 1266 01:03:43,790 --> 01:03:46,720 So if I were a disk maker, I'd make sure that my salesmen had 1267 01:03:46,720 --> 01:03:48,322 an office somewhere out there. 1268 01:03:51,930 --> 01:03:53,870 So the conclusion is you're not going to be able to, at 1269 01:03:53,870 --> 01:03:56,130 least for those applications, just have an 1270 01:03:56,130 --> 01:03:57,250 index in main memory. 1271 01:03:57,250 --> 01:03:59,430 You're going to have to have a data structure that 1272 01:03:59,430 --> 01:04:00,680 works well on disk. 1273 01:04:03,520 --> 01:04:05,740 The speed trends-- 1274 01:04:05,740 --> 01:04:08,530 well, seek time is not going to change. 1275 01:04:08,530 --> 01:04:09,350 It hasn't changed. 1276 01:04:09,350 --> 01:04:10,970 It's not going to change. 1277 01:04:10,970 --> 01:04:15,000 The bandwidth of a disk drive grows with the square root of 1278 01:04:15,000 --> 01:04:16,440 its capacity. 1279 01:04:16,440 --> 01:04:19,150 So if you quadruple the storage on the disk because 1280 01:04:19,150 --> 01:04:24,500 you've made the bits twice as dense in each dimension, then 1281 01:04:24,500 --> 01:04:27,800 one spin of the disk sees twice as many disks, not four 1282 01:04:27,800 --> 01:04:29,310 times as many disks. 1283 01:04:29,310 --> 01:04:32,110 So that projects out to something like disks that are 1284 01:04:32,110 --> 01:04:34,210 500 megabytes per second. 1285 01:04:34,210 --> 01:04:37,650 So how long is it going to take to back up a 67 terabyte 1286 01:04:37,650 --> 01:04:38,900 disk drive? 1287 01:04:42,860 --> 01:04:47,700 So there remain systems problems. 1288 01:04:47,700 --> 01:04:51,720 And I was explaining to my son that there's all these 1289 01:04:51,720 --> 01:04:54,020 problems in systems. 1290 01:04:54,020 --> 01:04:59,320 Data structures aren't suited, and all these systems suck. 1291 01:04:59,320 --> 01:05:01,740 He said, well, isn't that horrible if 1292 01:05:01,740 --> 01:05:02,790 you're computer scientist? 1293 01:05:02,790 --> 01:05:09,650 I said, no, because we make our living 1294 01:05:09,650 --> 01:05:11,060 off of these problems. 1295 01:05:14,060 --> 01:05:15,270 So here are some problems. 1296 01:05:15,270 --> 01:05:17,320 There's plenty of living to be made yet. 1297 01:05:23,760 --> 01:05:26,630 Power consumption is also a big issue for these things. 1298 01:05:26,630 --> 01:05:32,300 If you fill up a room a Google data center, a room which is 1299 01:05:32,300 --> 01:05:36,960 probably bigger than this room, full of machines. 1300 01:05:36,960 --> 01:05:39,930 The Facebook data center is probably a room about this 1301 01:05:39,930 --> 01:05:41,690 size, full of machines. 1302 01:05:41,690 --> 01:05:46,420 And power and cooling is something like half the cost 1303 01:05:46,420 --> 01:05:47,440 of the machines. 1304 01:05:47,440 --> 01:05:52,940 The machines for something like Facebook, the hardware 1305 01:05:52,940 --> 01:05:55,160 might cost them $10 million or $20 million a year, and the 1306 01:05:55,160 --> 01:05:57,510 heating and cooling is another $10 million or $20 million a 1307 01:05:57,510 --> 01:06:00,080 year, which is why they go off and they build these data 1308 01:06:00,080 --> 01:06:03,430 centers in places like North Carolina, where I guess 1309 01:06:03,430 --> 01:06:08,350 they're willing to give them power for free or something. 1310 01:06:08,350 --> 01:06:11,460 So making good use of disk bandwidth offers huge power 1311 01:06:11,460 --> 01:06:15,530 savings, because basically you can use disks which are 1312 01:06:15,530 --> 01:06:17,450 cheaper than solid state for power. 1313 01:06:17,450 --> 01:06:24,000 And you want to use that well. 1314 01:06:24,000 --> 01:06:25,110 CPU trends. 1315 01:06:25,110 --> 01:06:26,900 Well, you've probably talked about this, right? 1316 01:06:26,900 --> 01:06:30,600 CPUs are going to get a lot more cores. 1317 01:06:30,600 --> 01:06:35,160 I actually have a 48-core machine that cost $10,000 that 1318 01:06:35,160 --> 01:06:37,340 I bought about a month ago. 1319 01:06:37,340 --> 01:06:41,790 And our customers mostly use machines that 1320 01:06:41,790 --> 01:06:44,000 are like $5,000 machines. 1321 01:06:44,000 --> 01:06:46,610 So when I provisioned this machine, I said, well, I 1322 01:06:46,610 --> 01:06:49,370 should spend and buy a machine that's twice as good as what 1323 01:06:49,370 --> 01:06:52,030 they're buying, because I'm developing software that 1324 01:06:52,030 --> 01:06:55,010 they're going to use next year. 1325 01:06:55,010 --> 01:06:58,440 So I bought a $10,000 machine, which is 48 cores. 1326 01:06:58,440 --> 01:07:03,160 And we're having all sorts of making a 1327 01:07:03,160 --> 01:07:06,110 living with that machine. 1328 01:07:06,110 --> 01:07:09,790 The memory bandwidth and the I/O bus bandwidth will grow. 1329 01:07:09,790 --> 01:07:14,410 And so I think it's going to get more and more exciting to 1330 01:07:14,410 --> 01:07:15,970 try to use all these cores. 1331 01:07:15,970 --> 01:07:21,180 Fractal trees have a lot of opportunity to use those cores 1332 01:07:21,180 --> 01:07:25,510 to improve and reduce the number of disk I/Os. 1333 01:07:25,510 --> 01:07:30,140 So the conclusion is, basically, these data 1334 01:07:30,140 --> 01:07:35,850 structures dominate B-trees asymptotically. 1335 01:07:35,850 --> 01:07:40,870 And then B-trees have 40 years of engineering advantage, but 1336 01:07:40,870 --> 01:07:42,120 that will evaporate eventually. 1337 01:07:45,300 --> 01:07:48,170 These data structures ride better technology curves than 1338 01:07:48,170 --> 01:07:52,810 B-trees do, and so I find it hard to believe that in 10 1339 01:07:52,810 --> 01:07:56,180 years that anybody would design a system using a 1340 01:07:56,180 --> 01:08:00,700 B-tree, because how do you overcome those advantages. 1341 01:08:00,700 --> 01:08:03,660 So basically all storage systems are going to use data 1342 01:08:03,660 --> 01:08:06,940 structures that are like this, or something else. 1343 01:08:06,940 --> 01:08:09,000 There's a whole bunch of other kinds of indexes that we 1344 01:08:09,000 --> 01:08:12,450 haven't attacked, things like indexing multi-dimensional 1345 01:08:12,450 --> 01:08:19,609 data or indexing data where you have very large keys, very 1346 01:08:19,609 --> 01:08:22,189 large rows. 1347 01:08:22,189 --> 01:08:25,729 Imagine that you're trying to index DNA sequences, which are 1348 01:08:25,729 --> 01:08:29,520 much bigger than a disk block. 1349 01:08:29,520 --> 01:08:33,160 So there's a whole bunch of interesting opportunities. 1350 01:08:33,160 --> 01:08:37,880 And that's what I'm working on. 1351 01:08:37,880 --> 01:08:41,660 So any questions or comments? 1352 01:08:41,660 --> 01:08:43,109 Arguments? 1353 01:08:43,109 --> 01:08:44,359 Fistfights? 1354 01:08:53,540 --> 01:08:54,232 OK. 1355 01:08:54,232 --> 01:08:55,359 AUDIENCE: Where's the mic? 1356 01:08:55,359 --> 01:08:57,138 BRADLEY KUSZMAUL: Where is the mic? 1357 01:08:57,138 --> 01:08:58,094 AUDIENCE: That's OK. 1358 01:08:58,094 --> 01:08:59,528 I can [INAUDIBLE]. 1359 01:08:59,528 --> 01:09:00,778 BRADLEY KUSZMAUL: It's on my coat. 1360 01:09:05,122 --> 01:09:07,540 PROFESSOR: So actually, this is a very interesting point, 1361 01:09:07,540 --> 01:09:11,029 because if you think where the world is leading, I think that 1362 01:09:11,029 --> 01:09:14,109 big data is something that's very, very interesting, 1363 01:09:14,109 --> 01:09:16,819 because all these people are gathering huge amounts of 1364 01:09:16,819 --> 01:09:19,330 data, and they're storing huge amounts of data. 1365 01:09:19,330 --> 01:09:22,300 And what do with data, accessing them, is going to be 1366 01:09:22,300 --> 01:09:23,080 one big problem. 1367 01:09:23,080 --> 01:09:25,890 I mean, if you look at what people like Google are doing, 1368 01:09:25,890 --> 01:09:27,710 they're just collecting all those. 1369 01:09:27,710 --> 01:09:29,250 Nobody's throwing anything out. 1370 01:09:29,250 --> 01:09:34,800 And I believe if you to kind of look at them, analyze them, 1371 01:09:34,800 --> 01:09:36,720 do cool things with the data, it's going to 1372 01:09:36,720 --> 01:09:38,080 be very, very important. 1373 01:09:38,080 --> 01:09:40,729 So I think that would be very interesting, 1374 01:09:40,729 --> 01:09:43,160 high-performance end. 1375 01:09:43,160 --> 01:09:46,220 It's not just doing number crunching. 1376 01:09:46,220 --> 01:09:47,569 Until now, when people look at 1377 01:09:47,569 --> 01:09:48,960 high-performance, it's about CPU. 1378 01:09:48,960 --> 01:09:52,290 It's about how many FLOPS per second can you do? 1379 01:09:52,290 --> 01:09:55,060 TeraFLOPS, petaFLOP machines and stuff like that. 1380 01:09:55,060 --> 01:09:56,140 But I think one thing that's really 1381 01:09:56,140 --> 01:09:58,250 interesting is it's not petaFLOPS. 1382 01:09:58,250 --> 01:10:00,690 How many terabytes of data can you process 1383 01:10:00,690 --> 01:10:03,150 through to find something. 1384 01:10:03,150 --> 01:10:09,120 BRADLEY KUSZMAUL: So I was at a talk by Facebook, and they 1385 01:10:09,120 --> 01:10:14,410 serve 37 gigabytes per data per second out of their 1386 01:10:14,410 --> 01:10:17,310 database tier. 1387 01:10:17,310 --> 01:10:25,130 And that's a lot of serving. 1388 01:10:25,130 --> 01:10:27,130 Out of one little piece of whatever they're doing. 1389 01:10:29,780 --> 01:10:33,860 Those guys have three or five petabytes. 1390 01:10:33,860 --> 01:10:38,680 And in the petabyte club, they're small potatoes. 1391 01:10:38,680 --> 01:10:42,040 There's people who have hundreds of petabytes, people 1392 01:10:42,040 --> 01:10:43,290 with three-letter acronyms. 1393 01:10:46,448 --> 01:10:48,795 PROFESSOR: I mean, some of those three-letter acronym 1394 01:10:48,795 --> 01:10:52,270 places, the amount of data they are getting and they are 1395 01:10:52,270 --> 01:10:55,000 processing is just gigantic. 1396 01:10:55,000 --> 01:11:00,170 And I think to a point that even some of the interesting 1397 01:11:00,170 --> 01:11:01,936 things about-- 1398 01:11:01,936 --> 01:11:06,040 if they keep growing their data centers at the rate they 1399 01:11:06,040 --> 01:11:09,490 keep growing in the next couple of decades, they will 1400 01:11:09,490 --> 01:11:11,290 need the entire power of the United States to power their 1401 01:11:11,290 --> 01:11:13,560 data centers, because they are at that kind of 1402 01:11:13,560 --> 01:11:14,810 thing at this point. 1403 01:11:16,940 --> 01:11:22,150 Even in these big national labs, the reason they can't 1404 01:11:22,150 --> 01:11:24,480 expand is not that they don't have money to buy the 1405 01:11:24,480 --> 01:11:26,690 machines, but they don't have money to pay for the 1406 01:11:26,690 --> 01:11:28,750 electricity, and also they don't have electricity-- that 1407 01:11:28,750 --> 01:11:29,560 much electricity-- 1408 01:11:29,560 --> 01:11:30,820 [UNINTELLIGIBLE] 1409 01:11:30,820 --> 01:11:32,672 them to basically feed it. 1410 01:11:32,672 --> 01:11:35,740 BRADLEY KUSZMAUL: I've run into people for whom the power 1411 01:11:35,740 --> 01:11:37,010 issue was a big deal. 1412 01:11:37,010 --> 01:11:40,780 I look at it and say, eh, you bought a $5,000 machine. 1413 01:11:40,780 --> 01:11:44,230 You spend $5,000 in power over the lifetime of the machine. 1414 01:11:44,230 --> 01:11:47,390 It doesn't seem like it's that big a deal. 1415 01:11:47,390 --> 01:11:50,490 But they've filled up their data center, and the cost of 1416 01:11:50,490 --> 01:11:54,310 adding one more machine has a huge incremental cost, because 1417 01:11:54,310 --> 01:11:56,470 they can't fit one more in. 1418 01:11:56,470 --> 01:11:59,330 So that means they have to build another building. 1419 01:11:59,330 --> 01:12:06,260 And so almost everybody's facing that problem who's in 1420 01:12:06,260 --> 01:12:09,260 this business. 1421 01:12:09,260 --> 01:12:10,980 And then they try to build a building somewhere where 1422 01:12:10,980 --> 01:12:12,350 there's natural cooled-- 1423 01:12:12,350 --> 01:12:14,870 Google's written these papers about, oh, it turns out if you 1424 01:12:14,870 --> 01:12:16,610 don't air condition your computers, most 1425 01:12:16,610 --> 01:12:17,860 of them work anyway. 1426 01:12:24,180 --> 01:12:27,710 So, well, air conditioning is a quarter of the cost over the 1427 01:12:27,710 --> 01:12:28,880 lifetime of the computer. 1428 01:12:28,880 --> 01:12:33,490 So if you can make more than 3/4 of them give you service, 1429 01:12:33,490 --> 01:12:36,815 you come out ahead. 1430 01:12:36,815 --> 01:12:40,660 GUEST SPEAKER: On that note, MIT is part of a consortium 1431 01:12:40,660 --> 01:12:45,590 that includes Harvard, Northeastern, Boston 1432 01:12:45,590 --> 01:12:50,130 University, and University of Massachusetts Amherst, to 1433 01:12:50,130 --> 01:12:54,920 relocate all of our high-performance computing 1434 01:12:54,920 --> 01:13:00,280 into a new green data center in Holyoke, Massachusetts. 1435 01:13:00,280 --> 01:13:04,410 So the idea is that rather than us locating things here 1436 01:13:04,410 --> 01:13:08,130 on campus, where the energy costs are high and we get a 1437 01:13:08,130 --> 01:13:16,060 lot of our energy from fuels that have a big carbon 1438 01:13:16,060 --> 01:13:19,660 footprint, locating it in Holyoke-- 1439 01:13:19,660 --> 01:13:27,080 they have a lot of hydro power and nuclear power there. 1440 01:13:27,080 --> 01:13:31,660 And they're able to build a building that is extremely 1441 01:13:31,660 --> 01:13:33,140 energy-efficient. 1442 01:13:33,140 --> 01:13:37,810 And it turns out that a bunch of years ago when they were 1443 01:13:37,810 --> 01:13:42,780 digging up the Route 90, the Mass Pike, they laid a lot of 1444 01:13:42,780 --> 01:13:45,220 dark fiber down the length. 1445 01:13:45,220 --> 01:13:48,080 And so what they're going to do is light up that fiber, 1446 01:13:48,080 --> 01:13:51,220 which comes right back here to the Boston area. 1447 01:13:51,220 --> 01:13:53,940 And so for most people who are using these very 1448 01:13:53,940 --> 01:13:56,210 high-performance things, it doesn't really matter where 1449 01:13:56,210 --> 01:13:59,590 it's located anymore, at the level of that. 1450 01:13:59,590 --> 01:14:03,830 So instead of just locating some piece of equipment here, 1451 01:14:03,830 --> 01:14:08,490 we just will locate it out there, and the price will drop 1452 01:14:08,490 --> 01:14:09,240 dramatically. 1453 01:14:09,240 --> 01:14:13,490 And it'll be a much greener way for us to be doing our 1454 01:14:13,490 --> 01:14:15,450 high-end computing. 1455 01:14:15,450 --> 01:14:16,170 Yeah, question? 1456 01:14:16,170 --> 01:14:18,510 AUDIENCE: Isn't someone talking about water-cooled 1457 01:14:18,510 --> 01:14:19,914 offshore floating data centers? 1458 01:14:19,914 --> 01:14:21,050 GUEST SPEAKER: Sure. 1459 01:14:21,050 --> 01:14:22,080 Sure. 1460 01:14:22,080 --> 01:14:25,930 So the question is, are people talking about water-coooled 1461 01:14:25,930 --> 01:14:27,630 offshore floating data centers. 1462 01:14:27,630 --> 01:14:27,850 Yeah. 1463 01:14:27,850 --> 01:14:34,500 I mean, locating things in some area where you can cool 1464 01:14:34,500 --> 01:14:36,860 things easily makes a lot of sense. 1465 01:14:36,860 --> 01:14:41,490 Usually, they tend to want those near rivers rather than 1466 01:14:41,490 --> 01:14:44,990 in the middle of the ocean, just because you get the 1467 01:14:44,990 --> 01:14:46,340 hydropower. 1468 01:14:46,340 --> 01:14:49,210 But even in the ocean, you can use currents to do very much 1469 01:14:49,210 --> 01:14:50,200 the same kind of thing. 1470 01:14:50,200 --> 01:14:52,620 So for some of these things, people are looking very 1471 01:14:52,620 --> 01:15:00,830 seriously at a whole bunch of different strategies for 1472 01:15:00,830 --> 01:15:03,625 containing large-scale equipment. 1473 01:15:03,625 --> 01:15:08,060 PROFESSOR: So one that's very counterintuitive is people are 1474 01:15:08,060 --> 01:15:11,680 trying to build data centers in the middle of deserts, when 1475 01:15:11,680 --> 01:15:13,130 it's very hot. 1476 01:15:13,130 --> 01:15:15,715 I mean, why do you think people want to build a data 1477 01:15:15,715 --> 01:15:17,530 center in the middle of the desert? 1478 01:15:17,530 --> 01:15:19,310 AUDIENCE: Solar power? 1479 01:15:19,310 --> 01:15:21,110 PROFESSOR: Solar power is one thing. 1480 01:15:21,110 --> 01:15:22,626 No, it's not solar power. 1481 01:15:22,626 --> 01:15:23,964 AUDIENCE: It gets really cold at night. 1482 01:15:23,964 --> 01:15:25,760 PROFESSOR: No, it's not really cold at night. 1483 01:15:25,760 --> 01:15:26,990 That's not it. 1484 01:15:26,990 --> 01:15:29,740 GUEST SPEAKER: Cheap property. 1485 01:15:29,740 --> 01:15:33,120 PROFESSOR: No, the biggest thing about cooling is either 1486 01:15:33,120 --> 01:15:35,140 you can do air conditioning, where you're using power to 1487 01:15:35,140 --> 01:15:39,920 pull heat out, or you can use just water to cool. 1488 01:15:39,920 --> 01:15:43,750 And what happens is most of other places is the humidity 1489 01:15:43,750 --> 01:15:44,710 is too high. 1490 01:15:44,710 --> 01:15:47,470 And when you go to the desert, humidity is low enough that 1491 01:15:47,470 --> 01:15:49,590 you can just pump water through the thing and get the 1492 01:15:49,590 --> 01:15:52,280 water evaporation going by, and then use 1493 01:15:52,280 --> 01:15:53,750 that to cool the system. 1494 01:15:53,750 --> 01:15:56,970 So sometimes they're looking at data centers in places 1495 01:15:56,970 --> 01:16:00,480 where it could be 120 degrees, but very low humidity. 1496 01:16:00,480 --> 01:16:02,940 And they think that is a lot more efficient 1497 01:16:02,940 --> 01:16:04,490 to cool than that. 1498 01:16:04,490 --> 01:16:06,570 So there are a lot of these interesting nonintuitive 1499 01:16:06,570 --> 01:16:08,280 things people are looking at. 1500 01:16:08,280 --> 01:16:10,435 So what [UNINTELLIGIBLE] they say is that humidity's the 1501 01:16:10,435 --> 01:16:11,988 killer, not the temperature. 1502 01:16:11,988 --> 01:16:15,480 AUDIENCE: What if you're located on the South-- 1503 01:16:15,480 --> 01:16:19,910 if you're located on the South Pole, then that's both cold 1504 01:16:19,910 --> 01:16:24,370 and really low humidity. 1505 01:16:24,370 --> 01:16:25,540 GUEST SPEAKER: Yeah. 1506 01:16:25,540 --> 01:16:28,270 I mean it'll be interesting to see how these things develop. 1507 01:16:28,270 --> 01:16:35,300 It's a very so-called hot topic these days, is energy 1508 01:16:35,300 --> 01:16:36,090 for computing. 1509 01:16:36,090 --> 01:16:39,820 And the energy for computing of course matters also not 1510 01:16:39,820 --> 01:16:42,000 only at the large scale but also at the small scale, 1511 01:16:42,000 --> 01:16:49,410 because you want your favorite handheld to 1512 01:16:49,410 --> 01:16:51,110 use very little battery. 1513 01:16:51,110 --> 01:16:52,540 So your batteries last longer. 1514 01:16:52,540 --> 01:16:56,620 So the issue of energy, using that as a measure-- 1515 01:16:56,620 --> 01:16:59,830 we've mostly been looking at how fast we can make things 1516 01:16:59,830 --> 01:17:03,260 run in this class, but many of the lessons you can use to 1517 01:17:03,260 --> 01:17:06,550 say, well, how can I make this run as 1518 01:17:06,550 --> 01:17:09,180 energy-efficient as possible? 1519 01:17:09,180 --> 01:17:13,410 And what you'll learn is that many of the lessons we've had 1520 01:17:13,410 --> 01:17:17,310 in the class during the term we focused, as I say, on 1521 01:17:17,310 --> 01:17:18,180 performance. 1522 01:17:18,180 --> 01:17:21,070 But there are many resources in any given situation that 1523 01:17:21,070 --> 01:17:22,390 you might want to optimize. 1524 01:17:22,390 --> 01:17:25,370 And so understanding something about how do I minimize 1525 01:17:25,370 --> 01:17:29,200 energy, how do I minimize disk I/Os, how do I minimize clock 1526 01:17:29,200 --> 01:17:32,760 cycles, how do I minimize off-chip accesses-- 1527 01:17:32,760 --> 01:17:34,430 which tend to be much more energy 1528 01:17:34,430 --> 01:17:36,070 intensive than on-chip-- 1529 01:17:36,070 --> 01:17:40,740 all those different kinds of measures end up being part of 1530 01:17:40,740 --> 01:17:42,700 the mix of what you have to do when you're really engineering 1531 01:17:42,700 --> 01:17:43,950 these systems. 1532 01:17:46,142 --> 01:17:49,130 PROFESSOR: So I think another interesting thing is, because 1533 01:17:49,130 --> 01:17:54,170 we are in this time where some stuff grows at exponential 1534 01:17:54,170 --> 01:17:57,630 rates and stuff like that, some of those ratios that made 1535 01:17:57,630 --> 01:18:02,330 sense at some point just suddenly start making really 1536 01:18:02,330 --> 01:18:02,960 bad things. 1537 01:18:02,960 --> 01:18:06,770 Like, for example, in this, at some point the seek times were 1538 01:18:06,770 --> 01:18:08,870 normal enough that you didn't care. 1539 01:18:08,870 --> 01:18:13,620 And at some point, because the rest of the things took off so 1540 01:18:13,620 --> 01:18:16,610 fast, suddenly it becomes this really, really big 1541 01:18:16,610 --> 01:18:17,150 bottlenecks. 1542 01:18:17,150 --> 01:18:19,660 BRADLEY KUSZMAUL: B-trees were a really good data 1543 01:18:19,660 --> 01:18:22,750 structure in 1972. 1544 01:18:22,750 --> 01:18:26,590 Because, well, the seek time and the transfer 1545 01:18:26,590 --> 01:18:28,660 time and the CPU-- 1546 01:18:28,660 --> 01:18:33,040 the CPUs actually couldn't read in the data in one 1547 01:18:33,040 --> 01:18:36,530 rotation, so people didn't even read consecutive blocks, 1548 01:18:36,530 --> 01:18:38,860 because the CPU just couldn't handle data 1549 01:18:38,860 --> 01:18:40,040 coming in that fast. 1550 01:18:40,040 --> 01:18:42,200 You would stagger blocks around the disk, so that when 1551 01:18:42,200 --> 01:18:44,640 you did sequential reads, you'd get this one and then 1552 01:18:44,640 --> 01:18:46,330 this one and this one. 1553 01:18:46,330 --> 01:18:48,730 There was this whole thing about tuning your file system. 1554 01:18:48,730 --> 01:18:50,190 It's like-- 1555 01:18:50,190 --> 01:18:52,640 AUDIENCE: By the way, back when disks were-- 1556 01:18:52,640 --> 01:18:54,310 BRADLEY KUSZMAUL: Yeah. 1557 01:18:54,310 --> 01:18:55,180 Washing machine. 1558 01:18:55,180 --> 01:18:57,335 AUDIENCE: Washing machine size. 1559 01:18:57,335 --> 01:18:59,565 BRADLEY KUSZMAUL: For 20 megabytes. 1560 01:18:59,565 --> 01:19:01,040 PROFESSOR: Oh, yeah, that's the big disk. 1561 01:19:05,540 --> 01:19:07,870 So hopefully you guys got a feel for-- 1562 01:19:07,870 --> 01:19:09,840 we have been looking at this performance on a small 1563 01:19:09,840 --> 01:19:12,440 multi-core and stuff like that, how it can scale in 1564 01:19:12,440 --> 01:19:15,030 different directions and the kind of impact 1565 01:19:15,030 --> 01:19:16,365 performance can have. 1566 01:19:16,365 --> 01:19:22,050 And in fact, if anybody has read books on why Google is 1567 01:19:22,050 --> 01:19:25,530 successful, one of the biggest things for their success is 1568 01:19:25,530 --> 01:19:29,090 they managed to do a huge amount of work very cheaply, 1569 01:19:29,090 --> 01:19:32,790 because the amount of work they do, if anybody did in the 1570 01:19:32,790 --> 01:19:36,400 traditional way, they can't afford that model, to give it 1571 01:19:36,400 --> 01:19:39,820 for free or give it supplied for advertising. 1572 01:19:39,820 --> 01:19:42,360 Because they can get it done because it's about 1573 01:19:42,360 --> 01:19:43,980 optimization. 1574 01:19:43,980 --> 01:19:47,770 Performance, Performance basically relates to cost. 1575 01:19:47,770 --> 01:19:50,380 And if the cost is low enough, then they don't have to keep 1576 01:19:50,380 --> 01:19:53,370 charging a huge amount of money for each search. 1577 01:19:53,370 --> 01:19:57,790 GUEST SPEAKER: So let's thank Dr. Kuszmaul for 1578 01:19:57,790 --> 01:20:01,438 an excellent talk. 1579 01:20:01,438 --> 01:20:04,245 And can you hang out for just a little bit, if people want 1580 01:20:04,245 --> 01:20:04,710 to come down? 1581 01:20:04,710 --> 01:20:06,200 OK. 1582 01:20:06,200 --> 01:20:07,450 Thanks.