1 00:00:00,090 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:04,030 Commons license. 3 00:00:04,030 --> 00:00:06,360 Your support will help MIT OpenCourseWare 4 00:00:06,360 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,280 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,280 --> 00:00:18,450 at ocw.mit.edu. 8 00:00:21,056 --> 00:00:25,280 ERIK DEMAINE: Welcome to 6.851 Advanced Data Structures. 9 00:00:25,280 --> 00:00:26,420 I am Erik Demaine. 10 00:00:26,420 --> 00:00:28,170 You can call me Erik. 11 00:00:28,170 --> 00:00:31,600 We have two TAs, Tom Morgan and Justin Zhang. 12 00:00:31,600 --> 00:00:32,900 Tom's back there. 13 00:00:32,900 --> 00:00:36,260 Justin is late. 14 00:00:36,260 --> 00:00:38,600 And this class is about all kinds 15 00:00:38,600 --> 00:00:40,567 of very cool data structures. 16 00:00:40,567 --> 00:00:42,650 You should have already seen basic data structures 17 00:00:42,650 --> 00:00:45,200 like balance binary search trees and things 18 00:00:45,200 --> 00:00:48,140 like that, log n time to do wherever 19 00:00:48,140 --> 00:00:50,060 you want in one dimension. 20 00:00:50,060 --> 00:00:52,310 And here we're going to turn all those data structures 21 00:00:52,310 --> 00:00:54,143 on their head and consider them in all sorts 22 00:00:54,143 --> 00:00:56,930 of different models and additional cool problems. 23 00:00:56,930 --> 00:00:59,630 Today we're going to talk about time travel or temporal data 24 00:00:59,630 --> 00:01:02,770 structures, where you're manipulating time 25 00:01:02,770 --> 00:01:05,510 as any good time traveler should. 26 00:01:05,510 --> 00:01:07,940 Then we'll do geometry where we have higher dimensional 27 00:01:07,940 --> 00:01:09,620 data, more than one dimension. 28 00:01:09,620 --> 00:01:11,120 Then we'll look at a problem called 29 00:01:11,120 --> 00:01:14,900 dynamic optimality, which is, is there one best binary search 30 00:01:14,900 --> 00:01:16,642 tree that rules them all? 31 00:01:16,642 --> 00:01:19,100 Then we'll look at something called memory hierarchy, which 32 00:01:19,100 --> 00:01:22,580 is a way to model more realistic computers which have cache 33 00:01:22,580 --> 00:01:24,560 and then more cache and then main memory 34 00:01:24,560 --> 00:01:26,630 and then dish and all these different levels. 35 00:01:26,630 --> 00:01:28,580 How do you optimize for that? 36 00:01:28,580 --> 00:01:31,770 Hashing is probably the most famous, and most popular, 37 00:01:31,770 --> 00:01:33,670 most used data structure in computer science. 38 00:01:33,670 --> 00:01:35,960 We'll do a little bit on that. 39 00:01:35,960 --> 00:01:39,050 Integers, when you know that your data is integers and not 40 00:01:39,050 --> 00:01:42,147 just arbitrary black boxes that you can compare or do whatever, 41 00:01:42,147 --> 00:01:43,730 you can do a lot better with integers. 42 00:01:43,730 --> 00:01:45,140 You usually beat log n time. 43 00:01:45,140 --> 00:01:47,100 Often you can get constant time. 44 00:01:47,100 --> 00:01:49,130 For example, if you want to do priority queues, 45 00:01:49,130 --> 00:01:51,800 you can do square root log log n time. 46 00:01:51,800 --> 00:01:55,100 That's the best known randomized. 47 00:01:55,100 --> 00:01:58,010 Dynamic graphs, you have a graph you want to store, 48 00:01:58,010 --> 00:02:01,440 and the edges are being added and maybe deleted, like you're 49 00:02:01,440 --> 00:02:02,690 representing a social network. 50 00:02:02,690 --> 00:02:04,040 And people are friending and de-friending. 51 00:02:04,040 --> 00:02:06,081 You want to maintain some interesting information 52 00:02:06,081 --> 00:02:07,760 about that graph. 53 00:02:07,760 --> 00:02:10,669 Strings, you have a piece of text, 54 00:02:10,669 --> 00:02:12,555 such as the entire worldwide web. 55 00:02:12,555 --> 00:02:14,180 And you want to search for a substring. 56 00:02:14,180 --> 00:02:15,540 How do you do that efficiently? 57 00:02:15,540 --> 00:02:17,420 It's sort of the Google problem. 58 00:02:17,420 --> 00:02:20,540 Or you searching through DNA for patterns, whenever. 59 00:02:20,540 --> 00:02:22,350 And finally succinct data structures, 60 00:02:22,350 --> 00:02:24,170 which is all about taking what we normally 61 00:02:24,170 --> 00:02:26,660 consider optimal space or n space 62 00:02:26,660 --> 00:02:30,440 and reducing it down to the very bare minimum of bits of space. 63 00:02:30,440 --> 00:02:32,745 Usually if you want to store something 64 00:02:32,745 --> 00:02:34,370 where there's 2 to the n possibilities, 65 00:02:34,370 --> 00:02:36,620 you want to get away with n bits of space, 66 00:02:36,620 --> 00:02:40,030 maybe plus square root of n or something very tiny. 67 00:02:40,030 --> 00:02:42,740 So that's the sync data structures. 68 00:02:42,740 --> 00:02:44,790 So that's an overview of the entire class. 69 00:02:44,790 --> 00:02:47,970 And these are sort of the sections we'll be following. 70 00:02:47,970 --> 00:02:50,930 Let me give you a quick administrative overview 71 00:02:50,930 --> 00:02:54,350 of what we're doing. 72 00:02:54,350 --> 00:02:55,520 Requirements for the class-- 73 00:02:55,520 --> 00:02:58,317 I guess, first, attending lecture. 74 00:02:58,317 --> 00:02:59,900 Obviously if you don't attend lecture, 75 00:02:59,900 --> 00:03:00,983 there'll be videos online. 76 00:03:00,983 --> 00:03:02,810 So that's resolvable. 77 00:03:02,810 --> 00:03:05,000 But let me know if you're not going to make it. 78 00:03:05,000 --> 00:03:07,410 We're going to have problems sets roughly every week. 79 00:03:07,410 --> 00:03:09,390 If you're taking the class for credit, 80 00:03:09,390 --> 00:03:12,500 they have a very simple rule of one page in, one page out. 81 00:03:12,500 --> 00:03:15,290 This is more constraint on us to write problems that 82 00:03:15,290 --> 00:03:16,719 have easy or short answers. 83 00:03:16,719 --> 00:03:18,260 You probably need to think about them 84 00:03:18,260 --> 00:03:21,410 a little bit before they're transparent, but then easy 85 00:03:21,410 --> 00:03:23,060 to write up. 86 00:03:23,060 --> 00:03:26,430 And then scribing lectures-- so we have a scribe for today, 87 00:03:26,430 --> 00:03:27,580 I hope. 88 00:03:27,580 --> 00:03:28,550 Here? 89 00:03:28,550 --> 00:03:30,170 Yes, good. 90 00:03:30,170 --> 00:03:32,210 So most of the lectures have already 91 00:03:32,210 --> 00:03:34,580 been scribed in some version, and your goal 92 00:03:34,580 --> 00:03:38,420 is to revise that scribe notes that if you don't like 93 00:03:38,420 --> 00:03:42,110 handwritten notes, which are also online, then easier 94 00:03:42,110 --> 00:03:44,320 for people to read. 95 00:03:44,320 --> 00:03:45,160 Let's see. 96 00:03:45,160 --> 00:03:46,187 Listeners welcome. 97 00:03:46,187 --> 00:03:48,020 We're going to have an open problem session. 98 00:03:48,020 --> 00:03:49,230 I really like open problems. 99 00:03:49,230 --> 00:03:50,730 I really like solving open problems. 100 00:03:50,730 --> 00:03:53,540 So we've done this every time this class has been offered. 101 00:03:53,540 --> 00:03:55,790 So if you're interested in also solving open problems, 102 00:03:55,790 --> 00:03:56,910 it's optional. 103 00:03:56,910 --> 00:03:59,480 I will organize-- in a couple of weeks, 104 00:03:59,480 --> 00:04:03,110 we'll have a weekly open problem session 105 00:04:03,110 --> 00:04:06,440 and try to solve all the things that 106 00:04:06,440 --> 00:04:09,080 push the frontier of advanced data structures. 107 00:04:09,080 --> 00:04:11,810 So in classes, we'll see the state of the art. 108 00:04:11,810 --> 00:04:15,671 And then we'll change the state of the art in those sessions. 109 00:04:15,671 --> 00:04:16,420 I think that's it. 110 00:04:16,420 --> 00:04:18,920 Any questions about the class before we 111 00:04:18,920 --> 00:04:20,089 get into the fun stuff? 112 00:04:22,980 --> 00:04:23,900 All right. 113 00:04:23,900 --> 00:04:27,142 Let's do some time traveling. 114 00:04:27,142 --> 00:04:28,850 Before I get to time traveling, though, I 115 00:04:28,850 --> 00:04:32,420 need to define our model of computation. 116 00:04:32,420 --> 00:04:35,716 A theme in this class is that the model of computation you're 117 00:04:35,716 --> 00:04:36,590 working with matters. 118 00:04:36,590 --> 00:04:38,060 Models matter. 119 00:04:38,060 --> 00:04:40,430 And there's lots of different models of computation. 120 00:04:40,430 --> 00:04:45,890 We'll see a few of the main ones in this class. 121 00:04:45,890 --> 00:04:48,580 And the starting point, and the one 122 00:04:48,580 --> 00:04:53,100 we'll be using throughout today, is called a pointer machine. 123 00:04:53,100 --> 00:04:54,485 It's an old one from the '80s. 124 00:04:57,240 --> 00:04:59,240 And it corresponds to what you might think about 125 00:04:59,240 --> 00:05:02,090 if you've done a lot of object-oriented programming, 126 00:05:02,090 --> 00:05:04,950 and before that, structure-oriented programming, 127 00:05:04,950 --> 00:05:05,760 I guess. 128 00:05:05,760 --> 00:05:08,660 So you have a bunch of nodes. 129 00:05:08,660 --> 00:05:13,970 They have some fields in them, a constant number of fields. 130 00:05:13,970 --> 00:05:16,610 You can think of these as objects or strucs 131 00:05:16,610 --> 00:05:20,270 in c It used to be records back in Pascal days, 132 00:05:20,270 --> 00:05:22,170 so a lot of the papers call them records. 133 00:05:22,170 --> 00:05:24,170 You could just have a constant number of fields. 134 00:05:24,170 --> 00:05:25,300 You could think of those numbered, labeled. 135 00:05:25,300 --> 00:05:27,175 It doesn't really matter because there's only 136 00:05:27,175 --> 00:05:28,550 a constant number of them. 137 00:05:28,550 --> 00:05:32,240 Each of the fields could be a pointer to another node, 138 00:05:32,240 --> 00:05:35,550 could be a null pointer, or could have some data in it. 139 00:05:35,550 --> 00:05:37,760 So I'll just assume that all my data is integers. 140 00:05:43,130 --> 00:05:45,090 You can have a pointer to yourself. 141 00:05:45,090 --> 00:05:48,440 You can have a pointer over here, whatever you want. 142 00:05:48,440 --> 00:05:52,640 A pointer machine would look something like this. 143 00:05:52,640 --> 00:05:55,040 In any moment, this is the state of the pointer machine. 144 00:05:55,040 --> 00:05:59,300 So you think this as the memory of your computer storing. 145 00:05:59,300 --> 00:06:02,210 And then you have some operations 146 00:06:02,210 --> 00:06:03,660 that you're allowed to do. 147 00:06:03,660 --> 00:06:07,800 That's the computation part of the model. 148 00:06:07,800 --> 00:06:10,570 You can think of this as the memory model. 149 00:06:10,570 --> 00:06:13,100 What you're allowed to do are create nodes. 150 00:06:13,100 --> 00:06:15,705 You can say something like, x equals new node. 151 00:06:19,910 --> 00:06:24,990 You can, I don't know, look at fields. 152 00:06:24,990 --> 00:06:28,070 You can do x equals y.field. 153 00:06:28,070 --> 00:06:33,560 You can set fields, x.field equals y. 154 00:06:33,560 --> 00:06:37,400 You can compute on these data, so you can add 5 and 7, 155 00:06:37,400 --> 00:06:39,110 do things like that. 156 00:06:39,110 --> 00:06:40,990 I'm not going to worry about-- 157 00:06:40,990 --> 00:06:43,700 I'll just write et cetera. 158 00:06:43,700 --> 00:06:45,680 This is more a model about how everything's 159 00:06:45,680 --> 00:06:48,380 organized in memory, not so much about what you're allowed 160 00:06:48,380 --> 00:06:49,620 to do to the data items. 161 00:06:49,620 --> 00:06:50,870 In this lecture, it won't matter what 162 00:06:50,870 --> 00:06:52,161 you're doing to the data items. 163 00:06:52,161 --> 00:06:53,930 We never touch them. 164 00:06:53,930 --> 00:06:56,840 We just copy them around. 165 00:06:56,840 --> 00:06:59,870 So am I missing anything? 166 00:06:59,870 --> 00:07:01,100 Probably. 167 00:07:01,100 --> 00:07:03,950 I guess you could destroy nodes if you felt like it. 168 00:07:03,950 --> 00:07:06,566 But we won't have to today, because we 169 00:07:06,566 --> 00:07:07,940 don't want to throw anything away 170 00:07:07,940 --> 00:07:09,200 when you're time traveling. 171 00:07:09,200 --> 00:07:10,100 It's too dangerous. 172 00:07:12,740 --> 00:07:18,570 And then the one catch here is, what are x and y? 173 00:07:18,570 --> 00:07:21,280 There's going to be one node in this data structure 174 00:07:21,280 --> 00:07:24,184 or in your memory called the root node. 175 00:07:24,184 --> 00:07:26,600 And you could think of that as that's the thing you always 176 00:07:26,600 --> 00:07:27,800 have in your head. 177 00:07:27,800 --> 00:07:29,390 This is like your cache, if you will. 178 00:07:29,390 --> 00:07:31,348 It's just got a constant number of things, just 179 00:07:31,348 --> 00:07:32,470 like any other node. 180 00:07:32,470 --> 00:07:37,490 And x and y are fields of the root. 181 00:07:40,520 --> 00:07:42,530 So that sort of ties things down. 182 00:07:42,530 --> 00:07:45,200 You're always working relative to the root. 183 00:07:45,200 --> 00:07:50,030 But you can look at the data, basically follow this pointer, 184 00:07:50,030 --> 00:07:53,040 by looking at the field. 185 00:07:53,040 --> 00:07:55,850 You could set one of these pointers-- 186 00:07:55,850 --> 00:07:58,160 I think I probably need another operation here, 187 00:07:58,160 --> 00:08:03,936 like x equals y.field1, field2, that sort of thing, 188 00:08:03,936 --> 00:08:06,800 and maybe the reverse. 189 00:08:06,800 --> 00:08:09,200 But you can manipulate all nodes sort 190 00:08:09,200 --> 00:08:10,770 of via the root is the idea. 191 00:08:10,770 --> 00:08:12,710 You follow pointers, do whatever. 192 00:08:12,710 --> 00:08:14,780 So pretty obvious, slightly annoying 193 00:08:14,780 --> 00:08:16,250 to write down formally. 194 00:08:16,250 --> 00:08:18,710 But that is pointer machine. 195 00:08:23,460 --> 00:08:26,320 And what we're going to be talking about today 196 00:08:26,320 --> 00:08:28,829 in time travel is suppose someone 197 00:08:28,829 --> 00:08:31,120 gives me a pointer machine data structure, for example, 198 00:08:31,120 --> 00:08:33,510 balanced binary search tree, linked list. 199 00:08:33,510 --> 00:08:36,030 A lot of data structures, especially classic data 200 00:08:36,030 --> 00:08:39,192 structures, follow pointer machine model. 201 00:08:39,192 --> 00:08:41,400 What we'd like to do is transform that data structure 202 00:08:41,400 --> 00:08:42,816 or make a new pointer machine data 203 00:08:42,816 --> 00:08:45,120 structure that does extra cool things, 204 00:08:45,120 --> 00:08:47,470 namely travel through time. 205 00:08:47,470 --> 00:08:53,340 So that's what we're going to do. 206 00:08:53,340 --> 00:08:57,870 There's two senses of time travel or temporal data 207 00:08:57,870 --> 00:09:02,200 structures that we're going to cover in this class. 208 00:09:02,200 --> 00:09:05,400 The one for today is called persistence, 209 00:09:05,400 --> 00:09:08,310 where you don't forget anything, like an elephant. 210 00:09:08,310 --> 00:09:11,471 And the other one is retroactivity. 211 00:09:15,240 --> 00:09:16,800 Persistence will be today. 212 00:09:16,800 --> 00:09:19,260 Retroactivity is next class. 213 00:09:19,260 --> 00:09:21,670 Basically, these correspond to two models of time travel. 214 00:09:21,670 --> 00:09:24,176 Persistence is the branching universe time travel model, 215 00:09:24,176 --> 00:09:25,800 where if you make a change in the past, 216 00:09:25,800 --> 00:09:27,350 you get a new universe. 217 00:09:27,350 --> 00:09:29,460 You never destroy old universes. 218 00:09:29,460 --> 00:09:33,255 Retroactivity is more like Back to the Future, 219 00:09:33,255 --> 00:09:35,130 when you go back, make a change, and then you 220 00:09:35,130 --> 00:09:37,920 can return to the present and see what happened. 221 00:09:37,920 --> 00:09:39,710 This is a lot harder to do. 222 00:09:39,710 --> 00:09:42,150 And we'll work on that next class. 223 00:09:42,150 --> 00:09:46,360 Persistence is what we will do today. 224 00:09:46,360 --> 00:09:48,670 So persistence. 225 00:09:57,940 --> 00:10:01,810 The general idea of persistence is to remember everything-- 226 00:10:01,810 --> 00:10:05,110 the general goal, I would say. 227 00:10:05,110 --> 00:10:07,360 And by everything, I mean different versions 228 00:10:07,360 --> 00:10:08,800 of the data structure. 229 00:10:08,800 --> 00:10:11,290 So you're doing data structures in general. 230 00:10:11,290 --> 00:10:14,550 We have update operations and query operations. 231 00:10:14,550 --> 00:10:16,387 We're mainly concerned about updates here. 232 00:10:16,387 --> 00:10:18,220 Every time you do an update, you think of it 233 00:10:18,220 --> 00:10:21,730 as taking a version of the data structure and making a new one. 234 00:10:21,730 --> 00:10:23,800 And you never want to destroy old versions. 235 00:10:23,800 --> 00:10:26,230 So even though an update like an insert or something 236 00:10:26,230 --> 00:10:29,230 changes the data structure, we want to remember that past data 237 00:10:29,230 --> 00:10:30,770 as well. 238 00:10:30,770 --> 00:10:34,030 And then let's make this reasonable. 239 00:10:34,030 --> 00:10:37,791 All data structure operations are relative to a specified 240 00:10:37,791 --> 00:10:38,290 version. 241 00:10:47,200 --> 00:10:57,210 So an update makes and returns a new version. 242 00:11:05,690 --> 00:11:08,750 So when you do an insert, you specify 243 00:11:08,750 --> 00:11:10,850 a version of your data structure and the thing 244 00:11:10,850 --> 00:11:11,960 you want to insert. 245 00:11:11,960 --> 00:11:13,850 And the output is a new version. 246 00:11:13,850 --> 00:11:16,550 So then you could insert into that new version, keep going, 247 00:11:16,550 --> 00:11:20,210 or maybe go back to the old version, modify that. 248 00:11:20,210 --> 00:11:22,450 I haven't said exactly what's allowed here, 249 00:11:22,450 --> 00:11:25,460 but this is sort of the general goal. 250 00:11:25,460 --> 00:11:30,950 And then there are four levels of persistence 251 00:11:30,950 --> 00:11:33,020 that you might want to get. 252 00:11:33,020 --> 00:11:37,440 First level is called partial persistence. 253 00:11:37,440 --> 00:11:38,690 This is the easiest to obtain. 254 00:11:44,300 --> 00:11:47,630 And in partial persistence, you're 255 00:11:47,630 --> 00:11:55,460 only allowed to update the latest version, which 256 00:11:55,460 --> 00:12:01,670 means the versions are linearly ordered. 257 00:12:01,670 --> 00:12:03,770 This is the easiest to think about. 258 00:12:03,770 --> 00:12:09,300 And time travel can easily get confusing, so start simple. 259 00:12:09,300 --> 00:12:14,780 We have a timeline of various versions on it. 260 00:12:14,780 --> 00:12:17,420 This is the latest. 261 00:12:17,420 --> 00:12:19,850 And what we can do is update that version. 262 00:12:19,850 --> 00:12:23,750 We'll get a new version, and then our latest is this one. 263 00:12:23,750 --> 00:12:27,950 What this allows is looking back at the past to an old version 264 00:12:27,950 --> 00:12:29,282 and querying that version. 265 00:12:29,282 --> 00:12:31,490 So you can still ask questions about the old version, 266 00:12:31,490 --> 00:12:33,830 if you want to be able to do a search on any of these data 267 00:12:33,830 --> 00:12:34,329 structures. 268 00:12:34,329 --> 00:12:35,600 But you can't change them. 269 00:12:35,600 --> 00:12:38,180 You can only change the most recent version. 270 00:12:38,180 --> 00:12:39,470 So that's nice. 271 00:12:39,470 --> 00:12:44,690 It's kind of like time machine on Mac, I guess. 272 00:12:44,690 --> 00:12:46,970 If you've ever seen the movie Deja Vu, which is not 273 00:12:46,970 --> 00:12:48,636 very common, but it's a good time travel 274 00:12:48,636 --> 00:12:51,500 movie, in the first half of the movie, all they can do 275 00:12:51,500 --> 00:12:52,550 is look back at the past. 276 00:12:52,550 --> 00:12:54,710 Later they discover that actually they 277 00:12:54,710 --> 00:12:57,850 have a full persistence model. 278 00:12:57,850 --> 00:12:59,930 It takes a while for dramatic effect. 279 00:13:04,250 --> 00:13:08,750 In full persistence, you can update anything you want-- 280 00:13:08,750 --> 00:13:11,015 so update any version. 281 00:13:18,070 --> 00:13:24,200 and so then the versions form a tree. 282 00:13:27,960 --> 00:13:28,460 OK. 283 00:13:28,460 --> 00:13:30,200 So in this model, maybe you initially 284 00:13:30,200 --> 00:13:32,420 have a nice line of versions. 285 00:13:32,420 --> 00:13:34,860 But now if I go back to this version and update it, 286 00:13:34,860 --> 00:13:37,340 I branch, get a new version here. 287 00:13:37,340 --> 00:13:40,400 And then I might keep modifying that version sometimes. 288 00:13:40,400 --> 00:13:41,660 Any of these guys can branch. 289 00:13:44,240 --> 00:13:47,480 So this is why I call it the branching universe model, when 290 00:13:47,480 --> 00:13:49,685 you update your branch. 291 00:13:52,520 --> 00:13:54,380 So no version ever gets destroyed here. 292 00:13:54,380 --> 00:13:56,720 Again, you can query all versions. 293 00:13:56,720 --> 00:13:59,360 But now you can also update any version. 294 00:13:59,360 --> 00:14:00,710 But you just make a new version. 295 00:14:00,710 --> 00:14:02,660 It's a totally new world. 296 00:14:02,660 --> 00:14:04,490 When I update this version, this version 297 00:14:04,490 --> 00:14:06,410 knows nothing about all the-- 298 00:14:06,410 --> 00:14:07,940 this doesn't know about this future. 299 00:14:07,940 --> 00:14:10,890 It's created its own future. 300 00:14:10,890 --> 00:14:14,310 There's no way to sort of merge those universes together. 301 00:14:14,310 --> 00:14:16,910 It's kind of sad. 302 00:14:16,910 --> 00:14:22,220 That's why we have the third level of persistence, 303 00:14:22,220 --> 00:14:24,680 which lets us merge timelines. 304 00:14:24,680 --> 00:14:27,940 It's great for lots of fiction out there. 305 00:14:35,880 --> 00:14:38,360 If you've seen the old TV show Sliders, 306 00:14:38,360 --> 00:14:40,660 that would be confluent persistence. 307 00:14:50,830 --> 00:15:01,790 So confluent persistence, you can combine two versions 308 00:15:01,790 --> 00:15:03,110 to create a new version. 309 00:15:09,220 --> 00:15:13,720 And in this case, again, you can't destroy old versions. 310 00:15:13,720 --> 00:15:16,260 In persistence, you never destroy versions. 311 00:15:16,260 --> 00:15:22,520 So now the versions form a DAG, directed acyclic graph. 312 00:15:22,520 --> 00:15:24,010 So now we're allowing-- 313 00:15:24,010 --> 00:15:25,520 OK, you make some changes, whatever. 314 00:15:25,520 --> 00:15:30,274 You branch your universe, make some changes. 315 00:15:30,274 --> 00:15:32,440 And now I can say, OK, take this version of the data 316 00:15:32,440 --> 00:15:35,510 structure and this version and recombine them. 317 00:15:35,510 --> 00:15:38,730 Get a new version, and then maybe make some more changes. 318 00:15:38,730 --> 00:15:40,654 OK, what does combine mean? 319 00:15:40,654 --> 00:15:42,320 Well, it depends on your data structure. 320 00:15:42,320 --> 00:15:44,420 A lot of data structures have combine operations 321 00:15:44,420 --> 00:15:48,029 like if you have linked lists, you have two linked lists, 322 00:15:48,029 --> 00:15:49,070 you can concatenate them. 323 00:15:49,070 --> 00:15:50,180 That's an easy operation. 324 00:15:50,180 --> 00:15:51,721 Even if you have binary search trees, 325 00:15:51,721 --> 00:15:53,750 you can concatenate them reasonably easy 326 00:15:53,750 --> 00:15:56,670 and combine it into one big binary search tree. 327 00:15:56,670 --> 00:15:59,060 So if your data structure has an operation that 328 00:15:59,060 --> 00:16:01,850 takes as input two data structures, 329 00:16:01,850 --> 00:16:05,150 then what we're saying is now it can take two versions, which 330 00:16:05,150 --> 00:16:06,430 is more general. 331 00:16:06,430 --> 00:16:08,240 So I could take the same data structure, 332 00:16:08,240 --> 00:16:10,430 make some changes in one way, separately make 333 00:16:10,430 --> 00:16:12,138 some changes in a different way, and then 334 00:16:12,138 --> 00:16:14,660 try to concatenate them or do something crazy. 335 00:16:14,660 --> 00:16:16,550 This is hard to do, and most of it 336 00:16:16,550 --> 00:16:19,090 is an open problem whether it can be done. 337 00:16:19,090 --> 00:16:21,590 But I'll tell you about it. 338 00:16:21,590 --> 00:16:24,860 Then there's another level even more than confluent 339 00:16:24,860 --> 00:16:26,550 persistence. 340 00:16:26,550 --> 00:16:30,590 This is hard to interpret in the time travel world, 341 00:16:30,590 --> 00:16:32,430 but it would be functional data structures. 342 00:16:32,430 --> 00:16:34,130 If you've ever programmed in a functional programming 343 00:16:34,130 --> 00:16:36,800 language, it's a little bit annoying from an algorithm's 344 00:16:36,800 --> 00:16:39,650 perspective, because it constrains you to work 345 00:16:39,650 --> 00:16:43,220 in a purely functional world. 346 00:16:43,220 --> 00:16:45,900 You can never modify anything. 347 00:16:45,900 --> 00:16:46,400 OK. 348 00:16:46,400 --> 00:16:49,220 Now, we don't want to modify versions. 349 00:16:49,220 --> 00:16:49,830 That's fine. 350 00:16:49,830 --> 00:16:51,288 But in a functional data structure, 351 00:16:51,288 --> 00:16:54,110 you're not allowed to modify any nodes ever. 352 00:16:54,110 --> 00:16:55,720 All you can do is make new notes. 353 00:17:03,670 --> 00:17:07,317 This is constraining, and you can't always 354 00:17:07,317 --> 00:17:09,400 get optimal running times in the functional world. 355 00:17:09,400 --> 00:17:11,358 But if you can get a functional data structure, 356 00:17:11,358 --> 00:17:13,300 you have all these things, because you 357 00:17:13,300 --> 00:17:14,259 can't destroy anything. 358 00:17:14,259 --> 00:17:16,216 If you can't destroy nodes, then in particular, 359 00:17:16,216 --> 00:17:17,430 you can't destroy versions. 360 00:17:17,430 --> 00:17:19,930 And all of these things just work for free. 361 00:17:19,930 --> 00:17:22,957 And so a bunch of special cases are known, 362 00:17:22,957 --> 00:17:24,790 interesting special cases, like search trees 363 00:17:24,790 --> 00:17:26,470 you can do in the functional world. 364 00:17:26,470 --> 00:17:29,050 And that makes all of these things easy. 365 00:17:29,050 --> 00:17:30,790 So the rest of this lecture is going 366 00:17:30,790 --> 00:17:34,270 to be general techniques for doing partial full persistence, 367 00:17:34,270 --> 00:17:36,250 what we know about confluent, and what 368 00:17:36,250 --> 00:17:41,060 we know about functional, brief overview. 369 00:17:41,060 --> 00:17:44,680 Any questions about those goals, problem definitions? 370 00:17:47,640 --> 00:17:48,191 Yeah. 371 00:17:48,191 --> 00:17:50,834 AUDIENCE: I'm still confused about functional, because-- 372 00:17:50,834 --> 00:17:52,500 ERIK DEMAINE: What does functional mean? 373 00:17:52,500 --> 00:17:55,260 AUDIENCE: [INAUDIBLE] 374 00:17:55,260 --> 00:17:57,680 ERIK DEMAINE: Yeah, I guess you'll see what-- 375 00:17:57,680 --> 00:18:00,200 functional looks like all the other things, I agree. 376 00:18:00,200 --> 00:18:02,360 You'll see in a moment how we actually implement 377 00:18:02,360 --> 00:18:03,380 partial and persistence. 378 00:18:03,380 --> 00:18:07,490 We're going to be changing nodes a lot. 379 00:18:07,490 --> 00:18:10,670 As long as we still represent the same data 380 00:18:10,670 --> 00:18:13,040 in the old versions, we don't have to represent it 381 00:18:13,040 --> 00:18:14,164 in the same way. 382 00:18:14,164 --> 00:18:15,830 That lets us do things more efficiently. 383 00:18:15,830 --> 00:18:17,720 Whereas in functional, you have to represent 384 00:18:17,720 --> 00:18:19,430 all the old versions in exactly the way 385 00:18:19,430 --> 00:18:20,794 you used to represent them. 386 00:18:20,794 --> 00:18:22,460 Here we can kind of mangle things around 387 00:18:22,460 --> 00:18:23,585 and it makes things faster. 388 00:18:23,585 --> 00:18:25,490 Yeah, good question. 389 00:18:25,490 --> 00:18:29,882 So it seems almost the same, but it's nodes versus versions. 390 00:18:29,882 --> 00:18:31,340 I haven't really defined a version. 391 00:18:31,340 --> 00:18:34,580 But it's just that all the queries answer the same way. 392 00:18:34,580 --> 00:18:38,580 That's what you need for persistence. 393 00:18:38,580 --> 00:18:40,780 Other questions? 394 00:18:40,780 --> 00:18:43,620 All right. 395 00:18:43,620 --> 00:18:45,811 Well, let's do some real data structures. 396 00:18:49,660 --> 00:18:51,310 We start with partial persistence. 397 00:18:55,907 --> 00:18:56,740 This is the easiest. 398 00:18:59,680 --> 00:19:02,080 For both partial and full persistence, 399 00:19:02,080 --> 00:19:06,400 there is the following result. Any pointer machine data 400 00:19:06,400 --> 00:19:19,660 structure, one catch with a constant number of pointers 401 00:19:19,660 --> 00:19:22,220 to any node-- 402 00:19:22,220 --> 00:19:24,310 so this is constant n degree. 403 00:19:27,726 --> 00:19:30,017 In a pointer machine, you always have a constant number 404 00:19:30,017 --> 00:19:32,540 of pointers out of a node at most. 405 00:19:32,540 --> 00:19:34,540 But for this result to hold, we also 406 00:19:34,540 --> 00:19:37,280 need a constant number of pointers into any node. 407 00:19:37,280 --> 00:19:38,806 So this is an extra constraint. 408 00:19:42,910 --> 00:19:47,320 Can be transformed into another data structure that 409 00:19:47,320 --> 00:19:53,770 is partially persistent and does all the things it used to do-- 410 00:19:53,770 --> 00:19:56,362 so I'll just say, can be made partially persistent. 411 00:20:00,300 --> 00:20:03,550 You have to pay something, but you have to pay very little-- 412 00:20:03,550 --> 00:20:12,550 constant amortized factor overhead, 413 00:20:12,550 --> 00:20:21,860 multiplicative overhead and constant amount 414 00:20:21,860 --> 00:20:30,290 of additive space per change in the data structure. 415 00:20:30,290 --> 00:20:33,660 So every time you do a modification in your pointer 416 00:20:33,660 --> 00:20:36,260 machine-- you set one of the fields to something-- 417 00:20:36,260 --> 00:20:37,830 you have to store that forever. 418 00:20:37,830 --> 00:20:39,980 So, I mean, this is the best you could hope to do. 419 00:20:39,980 --> 00:20:43,250 You've got to store everything that happened. 420 00:20:43,250 --> 00:20:45,760 You pay a constant factor overhead, eh. 421 00:20:45,760 --> 00:20:46,670 We're theoreticians. 422 00:20:46,670 --> 00:20:48,170 That doesn't matter. 423 00:20:48,170 --> 00:20:50,840 Then you get any data structure in this world 424 00:20:50,840 --> 00:20:53,270 can be made partially persistent. 425 00:20:53,270 --> 00:20:54,026 That's nice. 426 00:20:54,026 --> 00:20:54,650 Let's prove it. 427 00:21:00,330 --> 00:21:02,140 OK, the idea is pretty simple. 428 00:21:04,780 --> 00:21:06,790 Pointer machines are all about nodes and fields. 429 00:21:06,790 --> 00:21:09,820 So we just need to simulate whatever the data structure is 430 00:21:09,820 --> 00:21:11,710 doing to those nodes and fields in a way 431 00:21:11,710 --> 00:21:13,540 that we don't lose all the information 432 00:21:13,540 --> 00:21:17,320 and we can still search it very quickly. 433 00:21:17,320 --> 00:21:21,190 First idea is to store back pointers. 434 00:21:21,190 --> 00:21:24,872 And this is why we need the constant n degree constraint. 435 00:21:27,530 --> 00:21:31,075 So if we have a node-- 436 00:21:31,075 --> 00:21:32,610 how do I want to draw a node here? 437 00:21:35,640 --> 00:21:38,700 So maybe these are the three fields of the node. 438 00:21:38,700 --> 00:21:42,930 I want to also store some back pointers. 439 00:21:42,930 --> 00:21:48,480 Whenever there is a node that points to this node, 440 00:21:48,480 --> 00:21:50,670 I want to have a back pointer that 441 00:21:50,670 --> 00:21:54,210 points back so I know where all the pointers came from. 442 00:21:54,210 --> 00:21:57,600 If there's only p pointers, then this is fine. 443 00:21:57,600 --> 00:22:00,960 There'll be p fields here. 444 00:22:00,960 --> 00:22:03,879 So still constant, still in the pointier machine model. 445 00:22:03,879 --> 00:22:05,670 OK, I'm going to need some other stuff too. 446 00:22:08,670 --> 00:22:11,310 So this is a simple thing, definitely want this. 447 00:22:11,310 --> 00:22:13,470 Because if my nodes ever move around, 448 00:22:13,470 --> 00:22:15,172 I've got to update the pointers to them. 449 00:22:15,172 --> 00:22:16,380 And where are those pointers? 450 00:22:16,380 --> 00:22:20,030 Well, the back pointers tell you where they are. 451 00:22:20,030 --> 00:22:22,410 Nodes will still be constant size, 452 00:22:22,410 --> 00:22:25,500 remain in pointer machine data structure. 453 00:22:25,500 --> 00:22:26,340 OK. 454 00:22:26,340 --> 00:22:28,230 That's idea one. 455 00:22:28,230 --> 00:22:35,080 Idea two is this part. 456 00:22:35,080 --> 00:22:39,485 This is going to store something called mods. 457 00:22:39,485 --> 00:22:44,020 It could stand for something, but I'll leave it as mods. 458 00:22:44,020 --> 00:22:56,530 So these are two of the fields of the data structure. 459 00:22:56,530 --> 00:23:01,480 Ah, one convenience here is for back pointers, 460 00:23:01,480 --> 00:23:04,840 I'm only going to store it for the latest version of the data 461 00:23:04,840 --> 00:23:05,780 structure. 462 00:23:16,310 --> 00:23:16,810 Sorry. 463 00:23:16,810 --> 00:23:19,610 I forgot about that. 464 00:23:19,610 --> 00:23:21,397 We'll come back to that later. 465 00:23:21,397 --> 00:23:23,480 And then the idea is to store these modifications. 466 00:23:23,480 --> 00:23:25,730 How many modifications? 467 00:23:25,730 --> 00:23:29,610 Let's say up to p, twice p. 468 00:23:34,835 --> 00:23:38,640 p was the bound on the n degree of a node. 469 00:23:38,640 --> 00:23:43,870 So I'm going to allow 2p modifications over here. 470 00:23:43,870 --> 00:23:45,450 And what's a modification look like? 471 00:23:48,780 --> 00:23:50,770 It's going to consist of three things-- 472 00:23:50,770 --> 00:23:53,700 get them in the right order-- 473 00:23:53,700 --> 00:23:56,850 the version in which something was changed, 474 00:23:56,850 --> 00:24:02,830 the field that got changed, and the value it go changed to. 475 00:24:02,830 --> 00:24:07,920 So the idea is that these are the fields here. 476 00:24:07,920 --> 00:24:09,870 We're not going to touch those. 477 00:24:09,870 --> 00:24:11,784 Once they're set to something-- 478 00:24:11,784 --> 00:24:13,450 or, I mean, whatever they are initially, 479 00:24:13,450 --> 00:24:15,370 they will stay that way. 480 00:24:15,370 --> 00:24:18,420 And so instead of actually changing things like the data 481 00:24:18,420 --> 00:24:19,920 structure normally would, we're just 482 00:24:19,920 --> 00:24:21,810 going to add modifications here to say, oh, 483 00:24:21,810 --> 00:24:25,410 well at this time, this field changed to the value of 5. 484 00:24:25,410 --> 00:24:27,720 And then later on, it changed to the value 7. 485 00:24:27,720 --> 00:24:31,380 And then later on, this one changed to the value 23, 486 00:24:31,380 --> 00:24:32,320 whatever. 487 00:24:32,320 --> 00:24:36,292 So that's what they'll look like. 488 00:24:36,292 --> 00:24:37,500 There's a limit to how many-- 489 00:24:37,500 --> 00:24:40,920 we can only store a constant number of mods to each node. 490 00:24:40,920 --> 00:24:44,580 And our constant will be 2p. 491 00:24:44,580 --> 00:24:45,080 OK. 492 00:24:45,080 --> 00:24:46,220 Those are the ideas, and now it's 493 00:24:46,220 --> 00:24:47,761 just a matter of making this all work 494 00:24:47,761 --> 00:24:50,510 and analyzing that it's constant amortized overhead. 495 00:24:53,260 --> 00:25:06,230 So first thing is if you want to read a field, 496 00:25:06,230 --> 00:25:08,090 how would I read a field? 497 00:25:08,090 --> 00:25:09,020 This is really easy. 498 00:25:11,720 --> 00:25:15,920 First you look at what the field is in the node itself. 499 00:25:15,920 --> 00:25:17,831 But then it might have been changed. 500 00:25:17,831 --> 00:25:19,580 And so remember when I say read the field, 501 00:25:19,580 --> 00:25:21,600 I actually mean while I'm given some version, 502 00:25:21,600 --> 00:25:24,655 v, I want to know what is the value of this field at version 503 00:25:24,655 --> 00:25:28,280 v, because I want to be able to look at any of the old data 504 00:25:28,280 --> 00:25:29,630 structures too. 505 00:25:29,630 --> 00:25:37,370 So this would be at version v. I just 506 00:25:37,370 --> 00:25:38,840 look through all the modifications. 507 00:25:38,840 --> 00:25:40,923 There's constantly many, so it takes constant time 508 00:25:40,923 --> 00:25:43,730 to just flip through them and say, well, what changes 509 00:25:43,730 --> 00:25:46,160 have happened up to version v? 510 00:25:46,160 --> 00:25:56,750 So I look at mods with version less than 511 00:25:56,750 --> 00:25:59,450 or equal to v. That will be all the changes that happened up 512 00:25:59,450 --> 00:26:00,770 to this point. 513 00:26:00,770 --> 00:26:02,060 I see, did this field change? 514 00:26:02,060 --> 00:26:03,710 I look at the latest one. 515 00:26:03,710 --> 00:26:07,250 That will be how I read the field of the node, so 516 00:26:07,250 --> 00:26:08,517 constant time. 517 00:26:08,517 --> 00:26:10,850 There's lots of ways to make this efficient in practice. 518 00:26:10,850 --> 00:26:13,730 But for our purposes, it doesn't matter. 519 00:26:13,730 --> 00:26:16,010 It's constant. 520 00:26:16,010 --> 00:26:18,890 The hard part is how do you change a field? 521 00:26:18,890 --> 00:26:22,427 Because there might not be any room in the mod structure. 522 00:26:35,850 --> 00:26:44,525 So to modify, say we want to set node.field equal to x. 523 00:26:47,240 --> 00:26:52,740 What we do is first we check, is there 524 00:26:52,740 --> 00:26:54,270 any space in the mod structure? 525 00:26:54,270 --> 00:27:03,080 If there's any blank mods, so if the node is not full, 526 00:27:03,080 --> 00:27:06,630 we just add a mod. 527 00:27:06,630 --> 00:27:13,106 So a mod will look like now field x. 528 00:27:13,106 --> 00:27:15,810 Just throw that in there. 529 00:27:15,810 --> 00:27:17,650 Because right at this moment-- 530 00:27:17,650 --> 00:27:19,560 so we maintain a time counter, just increment 531 00:27:19,560 --> 00:27:21,930 it ever time we do a change. 532 00:27:21,930 --> 00:27:23,320 This field changed that value. 533 00:27:23,320 --> 00:27:24,540 So that's the easy case. 534 00:27:24,540 --> 00:27:29,340 The trouble, of course, is if the node is full-- 535 00:27:29,340 --> 00:27:31,067 the moment you've all been waiting for. 536 00:27:31,067 --> 00:27:33,150 So what we're going to do here is make a new node. 537 00:27:33,150 --> 00:27:34,690 We've ran out of space. 538 00:27:34,690 --> 00:27:36,064 So we need to make a new node. 539 00:27:36,064 --> 00:27:38,730 We're not going to touch the old node, just going to let it sit. 540 00:27:38,730 --> 00:27:40,530 It still maintains all those old versions. 541 00:27:40,530 --> 00:27:43,230 Now we want a new node that represents the latest 542 00:27:43,230 --> 00:27:44,970 and greatest of this node. 543 00:27:44,970 --> 00:27:45,490 OK. 544 00:27:45,490 --> 00:27:47,700 So make a new node. 545 00:27:51,420 --> 00:27:57,060 I'll call it node prime to distinguish from node, where 546 00:27:57,060 --> 00:28:04,705 with all the mods, and this modification in particular, 547 00:28:04,705 --> 00:28:05,205 applied. 548 00:28:07,770 --> 00:28:11,160 OK, so we make a new version of this node. 549 00:28:11,160 --> 00:28:15,370 It's going to have some different fields, whatever 550 00:28:15,370 --> 00:28:18,730 was the latest version represented by those mods. 551 00:28:18,730 --> 00:28:20,525 It's still going to have back pointers, 552 00:28:20,525 --> 00:28:24,810 so we have to maintain all those back pointers. 553 00:28:24,810 --> 00:28:26,400 And now the mod, initially, is going 554 00:28:26,400 --> 00:28:29,340 to be empty, because we just applied them all. 555 00:28:29,340 --> 00:28:32,460 So this new node doesn't have any recent mods. 556 00:28:32,460 --> 00:28:34,320 Old node represents the old versions. 557 00:28:34,320 --> 00:28:37,935 This node is going to represent the new versions. 558 00:28:37,935 --> 00:28:39,420 What's wrong with this picture? 559 00:28:39,420 --> 00:28:40,770 AUDIENCE: Update pointers. 560 00:28:40,770 --> 00:28:42,120 ERIK DEMAINE: Update pointers. 561 00:28:42,120 --> 00:28:43,890 Yeah, there's pointers to the old version 562 00:28:43,890 --> 00:28:47,580 of the node, which are fine for the old versions of the data 563 00:28:47,580 --> 00:28:48,259 structure. 564 00:28:48,259 --> 00:28:50,300 But for the latest version of the data structure, 565 00:28:50,300 --> 00:28:53,560 this node has moved to this new location. 566 00:28:53,560 --> 00:28:56,070 So if there are any old pointers to that node, 567 00:28:56,070 --> 00:28:58,440 we've got to update them in the current version. 568 00:28:58,440 --> 00:29:00,648 We have to update them to point to this node instead. 569 00:29:00,648 --> 00:29:04,060 The old versions are fine, but the new version is in trouble. 570 00:29:04,060 --> 00:29:06,120 Other questions or all the same answer? 571 00:29:06,120 --> 00:29:06,791 Yeah. 572 00:29:06,791 --> 00:29:10,719 AUDIENCE: So if you wanted to read an old version 573 00:29:10,719 --> 00:29:15,630 but you just have the new version, [INAUDIBLE]? 574 00:29:15,630 --> 00:29:16,555 ERIK DEMAINE: OK-- 575 00:29:16,555 --> 00:29:17,430 AUDIENCE: [INAUDIBLE] 576 00:29:17,430 --> 00:29:19,180 ERIK DEMAINE: The question is essentially, 577 00:29:19,180 --> 00:29:22,059 how do we hold on to versions? 578 00:29:22,059 --> 00:29:24,600 Essentially, you can think of a version of the data structure 579 00:29:24,600 --> 00:29:26,404 as where the root node is. 580 00:29:26,404 --> 00:29:27,570 That's probably the easiest. 581 00:29:27,570 --> 00:29:29,528 I mean, in general, we're representing versions 582 00:29:29,528 --> 00:29:33,120 by a number, v. But we always start at the root. 583 00:29:33,120 --> 00:29:35,077 And so you've given the data structure, 584 00:29:35,077 --> 00:29:36,660 which is represented by the root node. 585 00:29:36,660 --> 00:29:40,020 And you say, search for the value 5. 586 00:29:40,020 --> 00:29:43,289 Is it in this binary search tree or whatever? 587 00:29:43,289 --> 00:29:45,330 And then you just start navigating from the root, 588 00:29:45,330 --> 00:29:49,560 but you know I'm inversion a million or whatever. 589 00:29:49,560 --> 00:29:51,340 I know what version I'm looking for. 590 00:29:51,340 --> 00:29:56,040 So you start with the root, which never changes, let's say. 591 00:29:56,040 --> 00:29:58,320 And then you follow pointers that 592 00:29:58,320 --> 00:30:00,060 essentially tell you for that version 593 00:30:00,060 --> 00:30:01,380 where you should be going. 594 00:30:01,380 --> 00:30:03,780 I guess at the root version, it's a little trickier. 595 00:30:03,780 --> 00:30:07,710 You probably want a little array that says for this version, 596 00:30:07,710 --> 00:30:08,900 here's the root node. 597 00:30:08,900 --> 00:30:11,130 But that's a special case. 598 00:30:11,130 --> 00:30:11,670 Yeah. 599 00:30:11,670 --> 00:30:13,234 Another question? 600 00:30:13,234 --> 00:30:15,222 AUDIENCE: So on the new node that you 601 00:30:15,222 --> 00:30:19,606 created, the fields that you copied, you also have to have 602 00:30:19,606 --> 00:30:20,689 a version for them, right? 603 00:30:20,689 --> 00:30:22,180 Because [INAUDIBLE]? 604 00:30:26,670 --> 00:30:27,731 ERIK DEMAINE: These-- 605 00:30:27,731 --> 00:30:30,190 AUDIENCE: Or do you version the whole node? 606 00:30:30,190 --> 00:30:32,650 ERIK DEMAINE: Here we're versioning the whole node. 607 00:30:32,650 --> 00:30:34,600 The original field values represent 608 00:30:34,600 --> 00:30:37,940 what was originally there, whenever this node was created. 609 00:30:37,940 --> 00:30:40,250 Then the mods specify what time the fields change. 610 00:30:40,250 --> 00:30:44,200 So I don't think we need times here. 611 00:30:44,200 --> 00:30:45,326 All right. 612 00:30:45,326 --> 00:30:47,200 So we've got to update two kinds of pointers. 613 00:30:47,200 --> 00:30:48,810 There's regular pointers, which live 614 00:30:48,810 --> 00:30:52,140 in the fields, which are things pointing to the node. 615 00:30:52,140 --> 00:30:53,640 But then there's also back pointers. 616 00:30:53,640 --> 00:30:55,909 Because if this is a pointer to a node, 617 00:30:55,909 --> 00:30:57,950 then there'll be a back pointer back to the node. 618 00:30:57,950 --> 00:31:00,620 And all of those have to change. 619 00:31:00,620 --> 00:31:03,765 Conveniently, the back pointers are easy. 620 00:31:11,700 --> 00:31:13,535 So if they're back pointers to the node, 621 00:31:13,535 --> 00:31:14,910 we change them to the node prime. 622 00:31:14,910 --> 00:31:16,080 How do we find the back pointers? 623 00:31:16,080 --> 00:31:17,621 Well, we just follow all the pointers 624 00:31:17,621 --> 00:31:21,330 and then there will be back pointers there. 625 00:31:21,330 --> 00:31:23,040 Because I said we're only maintaining 626 00:31:23,040 --> 00:31:25,650 backed pointers for the latest version, 627 00:31:25,650 --> 00:31:28,299 I don't need to preserve the old versions 628 00:31:28,299 --> 00:31:29,340 of those backed pointers. 629 00:31:29,340 --> 00:31:31,025 So I just go in and I change them. 630 00:31:31,025 --> 00:31:33,150 It takes constant time, because the constant number 631 00:31:33,150 --> 00:31:35,700 of things I point to, each one as a back pointer. 632 00:31:35,700 --> 00:31:37,130 So this is cheap. 633 00:31:37,130 --> 00:31:39,210 There's no persistence here. 634 00:31:39,210 --> 00:31:41,940 That's an advantage of partial persistence. 635 00:31:41,940 --> 00:31:44,370 The hard part is updating the pointers 636 00:31:44,370 --> 00:31:45,684 because those live in fields. 637 00:31:45,684 --> 00:31:47,850 I need to remember the old versions of those fields. 638 00:31:47,850 --> 00:31:49,564 And that we do recursively. 639 00:31:58,746 --> 00:32:00,120 Because to change those pointers, 640 00:32:00,120 --> 00:32:01,170 that's a field update. 641 00:32:01,170 --> 00:32:02,940 That's something exactly of this form. 642 00:32:02,940 --> 00:32:05,970 So that's the same operation but on a different node. 643 00:32:05,970 --> 00:32:07,440 So I just do that. 644 00:32:07,440 --> 00:32:08,760 I claim this is good. 645 00:32:08,760 --> 00:32:11,400 That's the end of the algorithm. 646 00:32:11,400 --> 00:32:12,540 Now we need to analyze it. 647 00:32:24,886 --> 00:32:25,760 How do we analyze it? 648 00:32:25,760 --> 00:32:26,260 Any guesses? 649 00:32:29,498 --> 00:32:30,500 AUDIENCE: Amortize it. 650 00:32:30,500 --> 00:32:32,208 ERIK DEMAINE: Amortized analysis, exactly 651 00:32:32,208 --> 00:32:33,620 the answer I was looking for. 652 00:32:33,620 --> 00:32:34,580 OK. 653 00:32:34,580 --> 00:32:36,290 [INAUDIBLE] amortization. 654 00:32:36,290 --> 00:32:38,330 The most powerful technique in amortization 655 00:32:38,330 --> 00:32:40,460 is probably the potential method. 656 00:32:40,460 --> 00:32:42,370 So we're going to use that. 657 00:32:42,370 --> 00:32:44,360 There's a sort of more-- 658 00:32:44,360 --> 00:32:47,990 you'll see a charging argument in a moment. 659 00:32:50,914 --> 00:32:53,330 We want the potential function to represent when this data 660 00:32:53,330 --> 00:32:55,300 structure is in a bad state. 661 00:32:55,300 --> 00:32:58,807 Intuitively, it's in a bad state when a lot of nodes are full. 662 00:32:58,807 --> 00:33:00,890 Because then as soon as you make a change in them, 663 00:33:00,890 --> 00:33:03,680 they will burst, and you have to do all this crazy recursion 664 00:33:03,680 --> 00:33:04,580 and stuff. 665 00:33:04,580 --> 00:33:05,990 This case is nice and cheap. 666 00:33:05,990 --> 00:33:08,680 We just add a modification, constant time. 667 00:33:08,680 --> 00:33:10,430 This case, not so nice because we recurse. 668 00:33:10,430 --> 00:33:12,740 And then that's going to cause more recursions 669 00:33:12,740 --> 00:33:16,040 and all sorts of chaos could happen. 670 00:33:16,040 --> 00:33:20,030 So there's probably a few different potential functions 671 00:33:20,030 --> 00:33:21,080 that would work here. 672 00:33:21,080 --> 00:33:23,150 And an old version of these nodes I said 673 00:33:23,150 --> 00:33:25,070 should be the number of full nodes. 674 00:33:25,070 --> 00:33:27,680 But I think we can make life a little bit easier 675 00:33:27,680 --> 00:33:32,390 by the following. 676 00:33:32,390 --> 00:33:36,674 Basically, the total number of modifications-- 677 00:33:36,674 --> 00:33:39,800 not quite the total, almost the total. 678 00:33:39,800 --> 00:33:49,760 So I'm going to do c times the sum of the number of mods 679 00:33:49,760 --> 00:33:56,190 in latest version nodes. 680 00:33:59,160 --> 00:34:00,530 OK. 681 00:34:00,530 --> 00:34:02,984 So because we sort of really only 682 00:34:02,984 --> 00:34:05,150 care about-- we're only changing the latest version, 683 00:34:05,150 --> 00:34:07,070 so I really only care about nodes that 684 00:34:07,070 --> 00:34:08,570 live in the latest version. 685 00:34:08,570 --> 00:34:09,659 What do I mean by this? 686 00:34:09,659 --> 00:34:11,909 Well, when I made this new node prime, 687 00:34:11,909 --> 00:34:14,000 this becomes the new representation of that node. 688 00:34:14,000 --> 00:34:15,690 The old version is dead. 689 00:34:15,690 --> 00:34:18,560 We will never change it again. 690 00:34:18,560 --> 00:34:21,080 If we're modifying, we will never even look at it again. 691 00:34:21,080 --> 00:34:24,805 Because now everything points to here. 692 00:34:24,805 --> 00:34:26,429 So I don't really care about that node. 693 00:34:26,429 --> 00:34:27,690 It's got a ton of mods. 694 00:34:27,690 --> 00:34:30,380 But what's nice is that when I create this new node, now 695 00:34:30,380 --> 00:34:31,882 the mod list is empty. 696 00:34:31,882 --> 00:34:33,840 So I start from scratch, just like reinstalling 697 00:34:33,840 --> 00:34:34,870 your operating system. 698 00:34:34,870 --> 00:34:38,010 It's a good feeling. 699 00:34:38,010 --> 00:34:45,090 And so the potential goes down by, I guess, c times 2 times p. 700 00:34:45,090 --> 00:34:49,764 When I do this change, potential goes down by basically p. 701 00:34:49,764 --> 00:34:52,230 AUDIENCE: Is c any constant or-- 702 00:34:52,230 --> 00:34:55,440 ERIK DEMAINE: c will be a constant to be determined. 703 00:34:55,440 --> 00:34:57,180 I mean, it could be 1. 704 00:34:57,180 --> 00:34:58,770 It depends how you want to define it. 705 00:34:58,770 --> 00:35:02,130 I'm going to use the CLRS notion of amortized cost, which 706 00:35:02,130 --> 00:35:06,777 is actual cost plus change in potential. 707 00:35:06,777 --> 00:35:08,610 And then I need a constant here, because I'm 708 00:35:08,610 --> 00:35:12,850 measuring a running time versus some combinatorial quantity. 709 00:35:12,850 --> 00:35:17,410 So this will be to match the running time that we'll get to. 710 00:35:17,410 --> 00:35:17,990 OK. 711 00:35:17,990 --> 00:35:22,290 So what is amortized cost? 712 00:35:22,290 --> 00:35:24,920 There's sort of two cases modification. 713 00:35:24,920 --> 00:35:28,440 There's the cheap case and the not so cheap case. 714 00:35:28,440 --> 00:35:30,945 In general, amortized cost-- 715 00:35:34,980 --> 00:35:37,920 in both cases, it's going to be at most-- 716 00:35:37,920 --> 00:35:39,810 well, first of all, we do some constant work 717 00:35:39,810 --> 00:35:44,640 just to figure out all this stuff, make copies, whatever. 718 00:35:44,640 --> 00:35:49,680 So that's some constant time. 719 00:35:49,680 --> 00:35:52,920 That's the part that I don't want to try to measure. 720 00:35:52,920 --> 00:35:55,140 Then potentially, we add a new mod. 721 00:35:55,140 --> 00:35:59,430 If we add a mod, that increases the potential by c. 722 00:35:59,430 --> 00:36:02,070 Because we're just counting mods, multiplying by c. 723 00:36:02,070 --> 00:36:04,702 So we might get plus 1 mod. 724 00:36:04,702 --> 00:36:06,160 This is going to be an upper bound. 725 00:36:06,160 --> 00:36:09,720 We don't always add 1, but worst case, we always had 1, 726 00:36:09,720 --> 00:36:11,880 let's say. 727 00:36:11,880 --> 00:36:14,220 And then there's this annoying part. 728 00:36:14,220 --> 00:36:16,500 And this might happen, might not happen. 729 00:36:16,500 --> 00:36:20,340 So then there's a plus maybe. 730 00:36:20,340 --> 00:36:23,310 If this happens, we decrease the potential 731 00:36:23,310 --> 00:36:26,310 because we empty out the mods for that node in terms 732 00:36:26,310 --> 00:36:27,720 of the latest version. 733 00:36:27,720 --> 00:36:34,500 So then we get a negative 2cp, change in potential. 734 00:36:34,500 --> 00:36:42,120 And then we'd have to pay I guess up to p recursions. 735 00:36:49,250 --> 00:36:51,520 Because we have to-- 736 00:36:51,520 --> 00:36:53,360 how many pointers are there to me? 737 00:36:53,360 --> 00:36:58,490 Well, at most p of them, because there are at most p pointers 738 00:36:58,490 --> 00:36:59,270 to any node. 739 00:37:02,750 --> 00:37:03,350 OK. 740 00:37:03,350 --> 00:37:05,110 This is kind of a weird-- 741 00:37:05,110 --> 00:37:06,510 it's not exactly algebra here. 742 00:37:06,510 --> 00:37:09,736 I have this thing, recursions. 743 00:37:09,736 --> 00:37:11,610 But if you think about how this would expand, 744 00:37:11,610 --> 00:37:13,160 all right, this is constant time. 745 00:37:13,160 --> 00:37:14,020 That's good. 746 00:37:14,020 --> 00:37:15,020 And then if we do this-- 747 00:37:15,020 --> 00:37:16,160 I'll put a question mark here. 748 00:37:16,160 --> 00:37:16,868 It might be here. 749 00:37:16,868 --> 00:37:18,110 It might not. 750 00:37:18,110 --> 00:37:19,820 If it's not here, find constant. 751 00:37:19,820 --> 00:37:24,280 If it is here, then this gets expanded into this thing. 752 00:37:24,280 --> 00:37:26,130 It's a weird way to write a recurrence. 753 00:37:26,130 --> 00:37:30,540 But we get p times whatever is in this right hand side. 754 00:37:30,540 --> 00:37:31,040 OK. 755 00:37:31,040 --> 00:37:33,440 But then there's this minus 2cp. 756 00:37:33,440 --> 00:37:36,560 So we're going to get p times 2c here. 757 00:37:36,560 --> 00:37:37,850 That's the initial cost. 758 00:37:37,850 --> 00:37:40,040 So that will cancel with this. 759 00:37:40,040 --> 00:37:41,910 And then we might get another recursion. 760 00:37:41,910 --> 00:37:43,910 But every time we get a recursion, all the terms 761 00:37:43,910 --> 00:37:44,899 cancel. 762 00:37:44,899 --> 00:37:46,940 So it doesn't matter whether this is here or not. 763 00:37:46,940 --> 00:37:49,610 You get 0, which is great. 764 00:37:49,610 --> 00:37:53,400 And you're left with the original 2c. 765 00:37:53,400 --> 00:37:55,410 Constant. 766 00:37:55,410 --> 00:37:56,300 OK. 767 00:37:56,300 --> 00:37:59,120 [INAUDIBLE] potential functions are always a little crazy. 768 00:37:59,120 --> 00:38:03,530 What's happening here is that, OK, maybe you add a mod. 769 00:38:03,530 --> 00:38:05,150 That's cheap. 770 00:38:05,150 --> 00:38:08,150 But when we have to do this work and we have to do this 771 00:38:08,150 --> 00:38:14,390 recursion-- this is up to 2p updates or recursions-- 772 00:38:14,390 --> 00:38:17,270 we are charging it to the emptying of this node. 773 00:38:17,270 --> 00:38:21,110 The number of mods went from 2p down to 0. 774 00:38:21,110 --> 00:38:22,940 And so we're just charging this update cost 775 00:38:22,940 --> 00:38:24,059 to that modification. 776 00:38:24,059 --> 00:38:26,600 So if you like charging schemes, this is much more intuitive. 777 00:38:26,600 --> 00:38:28,500 But with charging schemes, it's always a little careful. 778 00:38:28,500 --> 00:38:30,860 You have to make sure you're not double charging. 779 00:38:30,860 --> 00:38:34,790 Here it's obvious that you're not double charging. 780 00:38:34,790 --> 00:38:37,070 Kind of a cool and magical. 781 00:38:37,070 --> 00:38:42,010 This is a paper by Driscoll, Sarnak, Sleator, 782 00:38:42,010 --> 00:38:43,820 Tarjan from 1989. 783 00:38:43,820 --> 00:38:45,620 So it's very early days of amortization. 784 00:38:45,620 --> 00:38:47,760 But they knew how to do it. 785 00:38:47,760 --> 00:38:48,521 Question? 786 00:38:48,521 --> 00:38:50,504 AUDIENCE: [INAUDIBLE] 787 00:38:50,504 --> 00:38:52,670 ERIK DEMAINE: What happens if you overflow the root? 788 00:38:52,670 --> 00:38:54,753 Yeah, I never thought about the root before today. 789 00:38:54,753 --> 00:38:57,350 But I think the way to fix the root is 790 00:38:57,350 --> 00:39:02,780 just you have one big table that says, for a given version-- 791 00:39:02,780 --> 00:39:04,326 I guess a simple way would be to say, 792 00:39:04,326 --> 00:39:06,200 not only is a version a number, but it's also 793 00:39:06,200 --> 00:39:07,130 a pointer to the root. 794 00:39:07,130 --> 00:39:07,629 There we go. 795 00:39:07,629 --> 00:39:09,242 Pointer machine. 796 00:39:09,242 --> 00:39:11,450 So that way you're just always explicitly maintaining 797 00:39:11,450 --> 00:39:15,110 the root copy or the pointer. 798 00:39:15,110 --> 00:39:18,710 Because otherwise, you're in trouble. 799 00:39:18,710 --> 00:39:21,530 AUDIENCE: Then can you go back to [INAUDIBLE]. 800 00:39:21,530 --> 00:39:24,310 ERIK DEMAINE: So in order to refer to an old version, 801 00:39:24,310 --> 00:39:26,750 you have to have the pointer to that root node. 802 00:39:26,750 --> 00:39:29,336 If you want to do it just from a version number, 803 00:39:29,336 --> 00:39:30,460 look at the data structure. 804 00:39:30,460 --> 00:39:31,450 Just from a version number, you would 805 00:39:31,450 --> 00:39:33,249 need some kind of lookup table, which 806 00:39:33,249 --> 00:39:34,540 is outside the pointer machine. 807 00:39:34,540 --> 00:39:36,280 So you could do it in a real computer, 808 00:39:36,280 --> 00:39:39,210 but a pointer machine is not technically allowed. 809 00:39:39,210 --> 00:39:40,570 So it's slightly awkward. 810 00:39:40,570 --> 00:39:42,400 No arrays are allowed in pointer machines, 811 00:39:42,400 --> 00:39:43,483 in case that wasn't clear. 812 00:39:43,483 --> 00:39:44,274 Another question? 813 00:39:44,274 --> 00:39:48,720 AUDIENCE: [INAUDIBLE] constant space to store for [INAUDIBLE]. 814 00:39:48,720 --> 00:39:54,294 And also, what if we have really big numbers [INAUDIBLE]? 815 00:39:54,294 --> 00:39:56,710 ERIK DEMAINE: In this model, in the pointer machine model, 816 00:39:56,710 --> 00:39:58,930 we're assuming that whatever the data is in the items 817 00:39:58,930 --> 00:40:01,480 take constant space each. 818 00:40:01,480 --> 00:40:03,670 If you want to know about bigger things in here, 819 00:40:03,670 --> 00:40:05,450 then refer to future lectures. 820 00:40:05,450 --> 00:40:06,910 This is time travel, after all. 821 00:40:06,910 --> 00:40:09,200 Just go to a future class and then come back. 822 00:40:09,200 --> 00:40:11,920 [LAUGHS] So we'll get there, but right now, 823 00:40:11,920 --> 00:40:15,121 we're not thinking about what's in here. 824 00:40:15,121 --> 00:40:16,870 Whatever big thing you're trying to store, 825 00:40:16,870 --> 00:40:19,810 you reduce it down to constant size things. 826 00:40:19,810 --> 00:40:22,840 And then you spread them around nodes of a pointer machine. 827 00:40:22,840 --> 00:40:25,250 How you do that, that's up to the data structure. 828 00:40:25,250 --> 00:40:28,000 We're just transforming the data structure to be persistent. 829 00:40:28,000 --> 00:40:30,458 OK, you could ask about other models than pointer machines, 830 00:40:30,458 --> 00:40:34,530 but we're going to stick to pointer machines here. 831 00:40:34,530 --> 00:40:36,220 All right. 832 00:40:36,220 --> 00:40:38,110 That was partial persistence. 833 00:40:38,110 --> 00:40:41,540 Let's do full persistence. 834 00:40:41,540 --> 00:40:42,370 That was too easy. 835 00:40:46,300 --> 00:40:48,970 Same paper does full persistence. 836 00:40:48,970 --> 00:40:50,427 Systems That was just a warm up. 837 00:40:50,427 --> 00:40:52,510 Full persistence is actually not that much harder. 838 00:40:55,070 --> 00:40:57,685 So let me tell you basically what changes. 839 00:41:04,240 --> 00:41:05,550 There are two issues. 840 00:41:05,550 --> 00:41:09,440 One is that everything here has to change and not by much. 841 00:41:09,440 --> 00:41:11,370 We're still going to use back pointers. 842 00:41:11,370 --> 00:41:12,860 We're still going to have my mods. 843 00:41:12,860 --> 00:41:15,026 The number of mods is going to be slightly different 844 00:41:15,026 --> 00:41:16,910 but basically the same. 845 00:41:16,910 --> 00:41:19,327 Back pointers no longer just refer to the latest version. 846 00:41:19,327 --> 00:41:21,410 We have to maintain back pointers in all versions. 847 00:41:21,410 --> 00:41:22,970 So that's annoying. 848 00:41:22,970 --> 00:41:24,274 But hey, that's life. 849 00:41:24,274 --> 00:41:25,940 The amortization, the potential function 850 00:41:25,940 --> 00:41:28,190 will change slightly but basically not much. 851 00:41:30,850 --> 00:41:33,116 Sort of the bigger issue you might first wonder about, 852 00:41:33,116 --> 00:41:35,240 and it's actually the most challenging technically, 853 00:41:35,240 --> 00:41:37,490 is versions are no longer numbers. 854 00:41:37,490 --> 00:41:39,140 Because it's not a line. 855 00:41:39,140 --> 00:41:41,247 Versions are nodes in a tree. 856 00:41:41,247 --> 00:41:42,830 You should probably call them vertices 857 00:41:42,830 --> 00:41:45,121 in a tree to distinguish them from nodes in the pointer 858 00:41:45,121 --> 00:41:46,580 machine. 859 00:41:46,580 --> 00:41:48,530 OK, so you've got this tree of versions. 860 00:41:48,530 --> 00:41:53,930 And then versions are just some point on that tree. 861 00:41:53,930 --> 00:41:57,020 This is annoying because we like lines. 862 00:41:57,020 --> 00:41:58,770 We don't like trees as much. 863 00:41:58,770 --> 00:42:00,770 So what we're going to do is linearize the tree. 864 00:42:04,320 --> 00:42:05,900 Like, when in doubt, cheat. 865 00:42:12,200 --> 00:42:13,880 How do we do this? 866 00:42:13,880 --> 00:42:15,310 With tree traversal. 867 00:42:15,310 --> 00:42:18,240 Imagine I'm going to draw a super complicated tree 868 00:42:18,240 --> 00:42:19,370 of versions. 869 00:42:19,370 --> 00:42:21,660 Say there are three versions. 870 00:42:21,660 --> 00:42:22,807 OK. 871 00:42:22,807 --> 00:42:24,890 I don't want to number them, because that would be 872 00:42:24,890 --> 00:42:26,330 kind of begging the question. 873 00:42:26,330 --> 00:42:30,530 So let's just call them x, y, and z. 874 00:42:33,080 --> 00:42:34,384 All right. 875 00:42:34,384 --> 00:42:36,050 I mean, it's a directed tree, because we 876 00:42:36,050 --> 00:42:37,464 have the older versions. 877 00:42:37,464 --> 00:42:38,880 This is like the original version. 878 00:42:38,880 --> 00:42:39,754 And we made a change. 879 00:42:39,754 --> 00:42:42,690 We made a different change on the same version. 880 00:42:42,690 --> 00:42:45,520 What I'd like to do is a traversal of that tree, 881 00:42:45,520 --> 00:42:48,170 like a regular, as if you're going to sort those nodes. 882 00:42:48,170 --> 00:42:53,420 Actually, let me use color, high def here. 883 00:42:53,420 --> 00:42:55,820 So here's our traversal of the tree. 884 00:42:59,030 --> 00:43:01,790 And I want to look at the first and the last time I 885 00:43:01,790 --> 00:43:02,730 visit each node. 886 00:43:02,730 --> 00:43:05,360 So here's the first time I visit x. 887 00:43:05,360 --> 00:43:09,250 So I'll write this is the beginning of x. 888 00:43:09,250 --> 00:43:13,550 Capital X. Then this is the first time I visit y, 889 00:43:13,550 --> 00:43:15,530 so it's beginning of y. 890 00:43:15,530 --> 00:43:19,070 And then this is the last time I visit y, so it's the end of y. 891 00:43:19,070 --> 00:43:20,770 And then, don't care. 892 00:43:20,770 --> 00:43:24,800 Then this is the beginning of z. 893 00:43:24,800 --> 00:43:27,230 And this is the end of z. 894 00:43:27,230 --> 00:43:29,480 And then this is the end x. 895 00:43:29,480 --> 00:43:38,830 If I write those sequentially, I get bxbyeybzez, 896 00:43:38,830 --> 00:43:42,400 because this is so easy, ex. 897 00:43:42,400 --> 00:43:45,530 OK, you can think of these as parentheses, right? 898 00:43:45,530 --> 00:43:48,460 For whatever reason I chose b and e for beginning and ending, 899 00:43:48,460 --> 00:43:50,360 but this is like open parens, close parens. 900 00:43:50,360 --> 00:43:52,310 This is easy to do in linear time. 901 00:43:52,310 --> 00:43:53,690 I think you all know how. 902 00:43:53,690 --> 00:43:55,066 Except it's not a static problem. 903 00:43:55,066 --> 00:43:56,523 Versions are changing all the time. 904 00:43:56,523 --> 00:43:57,500 We're adding versions. 905 00:43:57,500 --> 00:43:59,458 We're never deleting versions, but we're always 906 00:43:59,458 --> 00:44:00,422 adding stuff to here. 907 00:44:00,422 --> 00:44:01,880 It's a little awkward, but the idea 908 00:44:01,880 --> 00:44:05,840 is I want to maintain this order, 909 00:44:05,840 --> 00:44:16,010 maintain the begin and the end of each you 910 00:44:16,010 --> 00:44:17,315 might say subtree of versions. 911 00:44:23,520 --> 00:44:25,770 This string, from bx to ex, represents 912 00:44:25,770 --> 00:44:29,820 all of the stuff in x's subtree, in the rooted tree 913 00:44:29,820 --> 00:44:30,660 starting at x. 914 00:44:33,890 --> 00:44:34,930 How do I maintain that? 915 00:44:40,550 --> 00:44:42,009 Using a data structure. 916 00:44:56,490 --> 00:45:00,830 So we're going to use something, a data structure we haven't yet 917 00:45:00,830 --> 00:45:02,210 seen. 918 00:45:02,210 --> 00:45:04,899 It will be in lecture 8. 919 00:45:04,899 --> 00:45:06,440 This is a time travel data structure, 920 00:45:06,440 --> 00:45:10,280 so I'm allowed to do that. 921 00:45:10,280 --> 00:45:14,150 So order maintenance data structure. 922 00:45:14,150 --> 00:45:16,970 You can think of this as a magical linked list. 923 00:45:16,970 --> 00:45:19,520 Let me tell you what the magical linked list can do. 924 00:45:19,520 --> 00:45:22,871 You can insert-- 925 00:45:22,871 --> 00:45:24,620 I'm going to call it an item, because node 926 00:45:24,620 --> 00:45:28,640 would be kind of confusing given where we are right now. 927 00:45:28,640 --> 00:45:32,480 You can insert a new item in the list immediately before 928 00:45:32,480 --> 00:45:34,850 or after a given item. 929 00:45:37,410 --> 00:45:37,910 OK. 930 00:45:37,910 --> 00:45:41,390 This is like a regular linked list. 931 00:45:41,390 --> 00:45:44,750 Here's a regular linked list. 932 00:45:44,750 --> 00:45:48,290 And if I'm given a particular item like this one, 933 00:45:48,290 --> 00:45:51,190 I can say, well, insert a new item right here. 934 00:45:51,190 --> 00:45:51,721 You say, OK. 935 00:45:51,721 --> 00:45:52,220 Fine. 936 00:45:52,220 --> 00:45:57,060 I'll just make a new node and relink here, relink there. 937 00:45:57,060 --> 00:45:58,190 Constant time, right? 938 00:45:58,190 --> 00:46:00,050 So in an order maintenance data structure, 939 00:46:00,050 --> 00:46:01,940 you can do this in constant time. 940 00:46:01,940 --> 00:46:02,900 Wow! 941 00:46:02,900 --> 00:46:05,180 So amazing. 942 00:46:05,180 --> 00:46:08,410 OK, catch is the second operation you can do. 943 00:46:08,410 --> 00:46:09,410 Maybe I'll number these. 944 00:46:09,410 --> 00:46:10,970 This is the update. 945 00:46:10,970 --> 00:46:13,490 Then there's the query. 946 00:46:13,490 --> 00:46:17,990 The query is, what is the relative order 947 00:46:17,990 --> 00:46:20,360 of two notes, of two items? 948 00:46:24,700 --> 00:46:27,090 x and y. 949 00:46:27,090 --> 00:46:29,980 So now I give you this node and this node. 950 00:46:29,980 --> 00:46:32,420 And I say, which is to the left? 951 00:46:32,420 --> 00:46:34,030 Which is earlier in the order? 952 00:46:34,030 --> 00:46:36,070 I want to know, is x basically less than y 953 00:46:36,070 --> 00:46:37,690 in terms of the order in the list? 954 00:46:37,690 --> 00:46:41,136 Or is y less than x? 955 00:46:41,136 --> 00:46:42,760 And an order maintenance data structure 956 00:46:42,760 --> 00:46:45,910 can do this in constant time. 957 00:46:45,910 --> 00:46:50,486 Now it doesn't look like your mother's linked list, I guess. 958 00:46:50,486 --> 00:46:52,360 It's not the link list you learned in school. 959 00:46:52,360 --> 00:46:54,700 It's a magical linked list that can somehow 960 00:46:54,700 --> 00:46:55,810 answer these queries. 961 00:46:55,810 --> 00:46:56,510 How? 962 00:46:56,510 --> 00:46:58,770 Go to lecture 7. 963 00:46:58,770 --> 00:46:59,270 OK. 964 00:46:59,270 --> 00:47:03,174 Forward reference, lecture 8, sorry. 965 00:47:03,174 --> 00:47:05,590 For now, we're just going to assume that this magical data 966 00:47:05,590 --> 00:47:06,740 structure exists. 967 00:47:06,740 --> 00:47:09,340 So in constant time, this is great. 968 00:47:09,340 --> 00:47:11,680 Because if we're maintaining these b's and e's, we 969 00:47:11,680 --> 00:47:16,090 want to maintain the order that these things appear in. 970 00:47:16,090 --> 00:47:17,620 If we want to create a new version, 971 00:47:17,620 --> 00:47:20,410 like suppose we were just creating version z, 972 00:47:20,410 --> 00:47:23,140 well, it used to be everything without this bz, ez. 973 00:47:23,140 --> 00:47:27,215 And we'd just insert two items in here, bz and ez. 974 00:47:27,215 --> 00:47:28,590 They're right next to each other. 975 00:47:28,590 --> 00:47:30,820 And if we were given version x, we could just say, 976 00:47:30,820 --> 00:47:34,300 oh, we'll look at ex and insert two items right before it. 977 00:47:34,300 --> 00:47:36,069 Or you can put them right after bx. 978 00:47:36,069 --> 00:47:37,610 I mean, there's no actual order here. 979 00:47:37,610 --> 00:47:40,690 So it could have been y first and then z or z first and then 980 00:47:40,690 --> 00:47:42,070 y. 981 00:47:42,070 --> 00:47:44,610 So it's really easy to add a new version in constant time. 982 00:47:44,610 --> 00:47:47,590 You just do two of these insert operations. 983 00:47:47,590 --> 00:47:50,680 And now you have this magical order operation, which 984 00:47:50,680 --> 00:47:54,500 if I'm given two versions-- 985 00:47:54,500 --> 00:47:56,800 I don't know, v and w-- 986 00:47:56,800 --> 00:48:00,250 and I want to know is v an ancestor of w, 987 00:48:00,250 --> 00:48:02,390 now I can do it in constant time. 988 00:48:02,390 --> 00:48:09,700 So this lets me do a third operation, which is, is version 989 00:48:09,700 --> 00:48:21,850 v an ancestor of version w? 990 00:48:21,850 --> 00:48:26,350 Because that's going to be true if and only if bv 991 00:48:26,350 --> 00:48:37,195 is an ev nest around bw and ew. 992 00:48:39,710 --> 00:48:40,290 OK. 993 00:48:40,290 --> 00:48:41,920 So that's just three tests. 994 00:48:41,920 --> 00:48:43,980 They're probably not all even necessary. 995 00:48:43,980 --> 00:48:45,390 This one always holds. 996 00:48:45,390 --> 00:48:50,670 But if these guys fit in between these guys, then you know-- 997 00:48:50,670 --> 00:48:54,810 now, what this tells us, what we care about here, 998 00:48:54,810 --> 00:48:58,020 is reading fields. 999 00:48:58,020 --> 00:49:00,300 When we read a field, we said, oh, we'll 1000 00:49:00,300 --> 00:49:02,670 apply all the modifications that apply to version 1001 00:49:02,670 --> 00:49:04,510 v. Before that, that was a linear order. 1002 00:49:04,510 --> 00:49:06,900 So it's just all versions less than or equal to v. Now 1003 00:49:06,900 --> 00:49:10,800 it's all versions that are ancestors of v. Given a mod, 1004 00:49:10,800 --> 00:49:13,830 we need to know, does this mod apply to my version? 1005 00:49:13,830 --> 00:49:16,560 And now I tell you, I can do that in constant time 1006 00:49:16,560 --> 00:49:17,610 through magic. 1007 00:49:17,610 --> 00:49:20,340 I just test these order relations. 1008 00:49:20,340 --> 00:49:24,360 If they hold, then that mod applies to my version. 1009 00:49:24,360 --> 00:49:27,360 So w's the version we're testing. 1010 00:49:27,360 --> 00:49:29,430 v is some version in the mod. 1011 00:49:29,430 --> 00:49:32,070 And I want to know, am descendant of that version? 1012 00:49:32,070 --> 00:49:34,100 If so, the mod applies. 1013 00:49:34,100 --> 00:49:36,630 And I update what the field is. 1014 00:49:36,630 --> 00:49:39,030 I can do all pairwise ancestor checks and figure out, 1015 00:49:39,030 --> 00:49:43,200 what is the most recent version in my ancestor history 1016 00:49:43,200 --> 00:49:44,850 that modified a given field? 1017 00:49:44,850 --> 00:49:47,080 That lets me read a field in constant time. 1018 00:49:47,080 --> 00:49:49,080 Constants are getting kind of big at this point, 1019 00:49:49,080 --> 00:49:50,040 but it can be done. 1020 00:49:53,270 --> 00:49:54,550 Clear? 1021 00:49:54,550 --> 00:49:56,850 A little bit of a black box here. 1022 00:49:56,850 --> 00:50:01,920 But now we've gotten as far as reading. 1023 00:50:01,920 --> 00:50:04,750 And we don't need to change much else. 1024 00:50:04,750 --> 00:50:11,780 So this is good news 1025 00:50:11,780 --> 00:50:15,280 Maybe I'll give you a bit of a diff. 1026 00:50:15,280 --> 00:50:26,340 So full persistence, fully persistent theorem-- 1027 00:50:26,340 --> 00:50:27,210 done. 1028 00:50:27,210 --> 00:50:27,710 OK. 1029 00:50:27,710 --> 00:50:30,080 Same theorem just with full persistence. 1030 00:50:30,080 --> 00:50:31,230 How do we do it? 1031 00:50:31,230 --> 00:50:35,542 We store back pointers now for all versions. 1032 00:50:35,542 --> 00:50:36,870 It's a little bit annoying. 1033 00:50:36,870 --> 00:50:40,832 But how many mods do we use? 1034 00:50:40,832 --> 00:50:42,540 There's lots of ways to get this to work, 1035 00:50:42,540 --> 00:50:44,890 but I'm going to change this number 1036 00:50:44,890 --> 00:50:51,702 to 2 times d plus p plus 1. 1037 00:50:51,702 --> 00:50:56,450 Wait, what's d? d is the number of fields here. 1038 00:50:56,450 --> 00:50:57,140 OK. 1039 00:50:57,140 --> 00:50:59,150 We said it was constant number fields. 1040 00:50:59,150 --> 00:51:03,680 I never said what that constant is. d for out degree, I guess. 1041 00:51:03,680 --> 00:51:09,289 So p is in degree, max in degree. d is max out degree. 1042 00:51:09,289 --> 00:51:11,330 So just slightly more-- that main reason for this 1043 00:51:11,330 --> 00:51:14,450 is because back pointers now are treated like everyone else. 1044 00:51:14,450 --> 00:51:17,434 We have to treat both the out pointers and the in pointers 1045 00:51:17,434 --> 00:51:18,350 as basically the same. 1046 00:51:18,350 --> 00:51:19,880 So instead of p, we have d plus p. 1047 00:51:19,880 --> 00:51:23,231 And there's a plus 1 just for safety. 1048 00:51:23,231 --> 00:51:28,330 It gets my amortization to work, hopefully. 1049 00:51:28,330 --> 00:51:29,320 OK. 1050 00:51:29,320 --> 00:51:32,430 Not much else-- this page is all the same. 1051 00:51:32,430 --> 00:51:35,830 Mods are still, you give versions, fields, values, 1052 00:51:35,830 --> 00:51:36,620 reading. 1053 00:51:36,620 --> 00:51:41,030 OK, well, this is no longer less than or equal to v. But 1054 00:51:41,030 --> 00:51:47,060 this is now with a version, sort of the nearest version, that's 1055 00:51:47,060 --> 00:51:50,660 an ancestor of v. 1056 00:51:50,660 --> 00:51:52,719 That's what we were just talking about. 1057 00:51:52,719 --> 00:51:54,260 So that can be done in constant time. 1058 00:51:54,260 --> 00:51:57,400 Check it for all of them, constant work. 1059 00:51:57,400 --> 00:51:58,130 OK. 1060 00:51:58,130 --> 00:52:00,490 That was the first part. 1061 00:52:04,710 --> 00:52:07,160 Now we get to the hard part, which is modification. 1062 00:52:07,160 --> 00:52:08,410 This is going to be different. 1063 00:52:08,410 --> 00:52:10,810 Maybe you I should just erase-- 1064 00:52:10,810 --> 00:52:13,690 yeah, I think I'll erase everything, 1065 00:52:13,690 --> 00:52:14,930 except the first clause. 1066 00:52:24,270 --> 00:52:24,890 OK. 1067 00:52:24,890 --> 00:52:26,910 If a node is not full, we'll just 1068 00:52:26,910 --> 00:52:28,460 add a mod, just like before. 1069 00:52:28,460 --> 00:52:31,040 What changes is when a node is full. 1070 00:52:36,000 --> 00:52:38,220 Here we have to do something completely different. 1071 00:52:38,220 --> 00:52:38,910 Why? 1072 00:52:38,910 --> 00:52:41,070 Because if we just make a new version 1073 00:52:41,070 --> 00:52:45,050 of this node that has empty mods, this one's still full. 1074 00:52:45,050 --> 00:52:48,810 And I can keep modifying the same version. 1075 00:52:48,810 --> 00:52:52,830 This new node that I just erased represents some new version. 1076 00:52:52,830 --> 00:52:54,720 But if I keep modifying an old version, which 1077 00:52:54,720 --> 00:52:57,760 I can do in full persistence, this node keeps being full. 1078 00:52:57,760 --> 00:53:00,019 And I keep paying potentially huge cost. 1079 00:53:00,019 --> 00:53:02,310 If all the nodes were full, and when I make this change 1080 00:53:02,310 --> 00:53:04,764 every node gets copied, and then I 1081 00:53:04,764 --> 00:53:06,180 make a change to the same version, 1082 00:53:06,180 --> 00:53:07,388 every node gets copied again. 1083 00:53:07,388 --> 00:53:09,860 This is going to take linear time per operation. 1084 00:53:09,860 --> 00:53:11,940 So I can't do the old strategy. 1085 00:53:11,940 --> 00:53:15,304 I need to somehow make this node less full. 1086 00:53:15,304 --> 00:53:17,220 This is where we're definitely not functional. 1087 00:53:17,220 --> 00:53:19,050 None of this was functional, but now I'm 1088 00:53:19,050 --> 00:53:24,240 going to change an old node, not just make a new one in a more 1089 00:53:24,240 --> 00:53:25,860 drastic way. 1090 00:53:25,860 --> 00:53:27,060 Before I was adding a mod. 1091 00:53:27,060 --> 00:53:28,560 That's not a functional operation. 1092 00:53:28,560 --> 00:53:33,870 Now I'm actually going to remove mods from a node to rebalance. 1093 00:53:33,870 --> 00:53:43,050 So what I'd like to do is split the node into two halves. 1094 00:53:43,050 --> 00:53:43,550 OK. 1095 00:53:43,550 --> 00:53:46,295 So I had some big node that was-- 1096 00:53:46,295 --> 00:53:50,190 I'll draw it-- completely full. 1097 00:53:50,190 --> 00:53:52,890 Now I'm going to make two nodes. 1098 00:53:52,890 --> 00:53:53,820 Here we go. 1099 00:53:59,020 --> 00:54:01,030 This one is going to be half full. 1100 00:54:01,030 --> 00:54:04,930 This one's going to be half full of mods. 1101 00:54:04,930 --> 00:54:05,460 OK. 1102 00:54:05,460 --> 00:54:08,100 The only question left is, what do I do with all these things? 1103 00:54:12,150 --> 00:54:14,700 Basically what I'd like to do is have the-- 1104 00:54:14,700 --> 00:54:18,720 on the one hand, I want to have the old node. 1105 00:54:18,720 --> 00:54:20,550 It's just where it used to be. 1106 00:54:20,550 --> 00:54:23,820 I've just removed half of the mods, the second half, 1107 00:54:23,820 --> 00:54:25,710 the later half. 1108 00:54:25,710 --> 00:54:26,740 What does that mean? 1109 00:54:26,740 --> 00:54:27,420 I don't know. 1110 00:54:27,420 --> 00:54:29,320 Figure it out. 1111 00:54:29,320 --> 00:54:31,050 It's linearized. 1112 00:54:31,050 --> 00:54:32,710 I haven't thought deeply about that. 1113 00:54:32,710 --> 00:54:36,300 Now we're going to make a new node with the second half 1114 00:54:36,300 --> 00:54:36,880 of the mods. 1115 00:54:40,160 --> 00:54:41,640 It's more painful than I thought. 1116 00:54:41,640 --> 00:54:45,180 In reality, these mods represent a tree of modifications. 1117 00:54:45,180 --> 00:54:48,450 And what you need to do is find a partition of that tree 1118 00:54:48,450 --> 00:54:51,000 into two roughly equal halves. 1119 00:54:51,000 --> 00:54:52,870 You can actually do a one third, 2/3 split. 1120 00:54:52,870 --> 00:54:57,049 That's also in a future lecture, which whose number I forget. 1121 00:54:57,049 --> 00:54:58,590 So really, you're splitting this tree 1122 00:54:58,590 --> 00:55:01,440 into two roughly balanced halves. 1123 00:55:01,440 --> 00:55:03,750 And so this 2 might actually need to change to a 3, 1124 00:55:03,750 --> 00:55:06,330 but it's a constant. 1125 00:55:06,330 --> 00:55:07,590 OK. 1126 00:55:07,590 --> 00:55:09,330 What I want is for this to represent 1127 00:55:09,330 --> 00:55:10,350 a subtree of versions. 1128 00:55:10,350 --> 00:55:11,860 Let me draw the picture. 1129 00:55:11,860 --> 00:55:15,180 So here's a tree of versions represented by the old mods. 1130 00:55:15,180 --> 00:55:18,580 I'd like to cut out a subtree rooted at some node. 1131 00:55:18,580 --> 00:55:21,540 So let's just assume for now this has exactly 1132 00:55:21,540 --> 00:55:22,890 half the nodes. 1133 00:55:22,890 --> 00:55:25,470 And this has half the nodes. 1134 00:55:25,470 --> 00:55:29,180 In reality, I think it can be one third, 2/3. 1135 00:55:29,180 --> 00:55:29,680 OK. 1136 00:55:29,680 --> 00:55:32,930 But let's keep it convenient. 1137 00:55:32,930 --> 00:55:34,750 So I want the new node to represent 1138 00:55:34,750 --> 00:55:37,630 this subtree and this node to represent everything else. 1139 00:55:37,630 --> 00:55:41,650 This node is as if this stuff hasn't happened yet. 1140 00:55:41,650 --> 00:55:44,714 I mean, so it represents all these old versions that do not, 1141 00:55:44,714 --> 00:55:45,880 that are not in the subtree. 1142 00:55:45,880 --> 00:55:47,800 This represents all the latest stuff. 1143 00:55:47,800 --> 00:55:49,750 So what I'm going to do is like before, I 1144 00:55:49,750 --> 00:55:54,090 want to apply some mods to these fields. 1145 00:55:54,090 --> 00:55:58,320 And whatever minds were relevant at this point, whatever 1146 00:55:58,320 --> 00:56:02,610 had been applied, I apply those to the fields here. 1147 00:56:02,610 --> 00:56:06,900 And so that means I can remove all of these mods. 1148 00:56:06,900 --> 00:56:09,360 I only cared about these ones. 1149 00:56:09,360 --> 00:56:11,220 Update these fields accordingly. 1150 00:56:11,220 --> 00:56:14,040 I still have the other mods to represent all the other changes 1151 00:56:14,040 --> 00:56:16,030 that could be in that subtree. 1152 00:56:16,030 --> 00:56:16,530 OK. 1153 00:56:16,530 --> 00:56:33,255 So we actually split the tree, and we apply mods to new nodes. 1154 00:56:38,680 --> 00:56:40,050 Anything else I need to say? 1155 00:56:42,542 --> 00:56:44,000 Oh, now we need to update pointers. 1156 00:56:44,000 --> 00:56:45,124 That's always the fun part. 1157 00:56:49,550 --> 00:56:50,530 Let's go over here. 1158 00:57:05,300 --> 00:57:07,490 So old node hasn't moved. 1159 00:57:07,490 --> 00:57:09,150 But this new node has moved. 1160 00:57:09,150 --> 00:57:13,880 So for all of these versions, I want 1161 00:57:13,880 --> 00:57:18,020 to change the pointer that used to point to old node 1162 00:57:18,020 --> 00:57:20,806 should now point to new node. 1163 00:57:20,806 --> 00:57:21,930 In this version, it's fine. 1164 00:57:21,930 --> 00:57:23,150 It should still point to old node, 1165 00:57:23,150 --> 00:57:25,280 because this represents all those old versions. 1166 00:57:25,280 --> 00:57:28,170 But for the new version, that version in the subtree, 1167 00:57:28,170 --> 00:57:30,781 I've got to point here instead. 1168 00:57:30,781 --> 00:57:31,280 OK. 1169 00:57:31,280 --> 00:57:37,330 So how many pointers could there be to this node 1170 00:57:37,330 --> 00:57:38,850 that need to change. 1171 00:57:38,850 --> 00:57:41,630 That's a tricky part in this analysis. 1172 00:57:41,630 --> 00:57:45,200 Think about it for a while. 1173 00:57:45,200 --> 00:57:47,200 I mean, in this new node, whatever 1174 00:57:47,200 --> 00:57:50,222 is pointed to by either here or here in the new node also has 1175 00:57:50,222 --> 00:57:50,930 a return pointer. 1176 00:57:50,930 --> 00:57:52,130 All pointers are bidirectional. 1177 00:57:52,130 --> 00:57:54,338 So we don't really care about whether they're forward 1178 00:57:54,338 --> 00:57:54,926 or backward. 1179 00:57:54,926 --> 00:57:56,300 How many pointers are there here? 1180 00:57:56,300 --> 00:57:59,360 Well, there's d here and there's p here. 1181 00:57:59,360 --> 00:58:01,280 But then there's also some additional pointers 1182 00:58:01,280 --> 00:58:02,750 represented over here. 1183 00:58:02,750 --> 00:58:04,100 How many? 1184 00:58:04,100 --> 00:58:06,890 Well, if we assume this magical 50/50 split, 1185 00:58:06,890 --> 00:58:12,920 there's right now d plus p plus 1 mods over here, half of them. 1186 00:58:12,920 --> 00:58:16,100 Each of them might be a pointer to some other place, which 1187 00:58:16,100 --> 00:58:18,715 has a return pointer in that version. 1188 00:58:18,715 --> 00:58:23,690 So number of back pointers that we need to update 1189 00:58:23,690 --> 00:58:27,150 is going to be this, 2 times d 2 times p plus 1. 1190 00:58:30,700 --> 00:58:41,850 So recursively update at most 2 times d plus 2 times p 1191 00:58:41,850 --> 00:58:44,840 plus 1 pointers to the node. 1192 00:58:50,270 --> 00:58:52,550 The good news is this is really only half of them 1193 00:58:52,550 --> 00:58:54,850 or some fraction of them. 1194 00:58:54,850 --> 00:58:57,417 It used to be-- 1195 00:58:57,417 --> 00:58:59,000 well, there were more pointers before. 1196 00:58:59,000 --> 00:59:00,290 We don't have to deal with these ones. 1197 00:59:00,290 --> 00:59:01,831 That's where we're saving, and that's 1198 00:59:01,831 --> 00:59:03,622 why this amortization works. 1199 00:59:03,622 --> 00:59:06,080 Let me give you a potential function that makes this work-- 1200 00:59:12,950 --> 00:59:23,760 is minus c times sum of the number of empty mod slots. 1201 00:59:23,760 --> 00:59:26,370 It's kind of the same potential but before 1202 00:59:26,370 --> 00:59:28,530 we had this notion of dead and alive nodes. 1203 00:59:28,530 --> 00:59:30,450 Now everything's alive because everything 1204 00:59:30,450 --> 00:59:31,980 could change at any moment. 1205 00:59:31,980 --> 00:59:36,030 So instead, I'm going to measure how much room I have 1206 00:59:36,030 --> 00:59:37,134 in each node. 1207 00:59:37,134 --> 00:59:38,550 Before I had no room in this node. 1208 00:59:38,550 --> 00:59:41,760 Now I have half the space in both nodes. 1209 00:59:41,760 --> 00:59:44,070 So that's good news. 1210 00:59:44,070 --> 00:59:48,300 Whenever we have this recursion, we 1211 00:59:48,300 --> 00:59:56,630 can charge it to a potential decrease. 1212 00:59:56,630 --> 01:00:01,160 Fee goes down by-- 1213 01:00:01,160 --> 01:00:03,720 because I have a negative sign here-- 1214 01:00:03,720 --> 01:00:13,740 c times, oh man, 2 times d plus p plus 1, I think. 1215 01:00:13,740 --> 01:00:15,780 Because there's d plus p plus 1 space here, 1216 01:00:15,780 --> 01:00:17,230 d plus p plus 1 space here. 1217 01:00:17,230 --> 01:00:18,990 I mean, we added one whole new node. 1218 01:00:18,990 --> 01:00:20,820 And total capacity of a node in mods 1219 01:00:20,820 --> 01:00:23,610 is 2 times d plus p plus 1. 1220 01:00:23,610 --> 01:00:26,010 So we get that times c. 1221 01:00:26,010 --> 01:00:28,530 And this is basically just enough, 1222 01:00:28,530 --> 01:00:32,010 because this is 2 times d plus 2 times p plus 2. 1223 01:00:32,010 --> 01:00:34,140 And here we have a plus 1. 1224 01:00:34,140 --> 01:00:39,690 And so the recursion gets annihilated by 2 times d plus 1225 01:00:39,690 --> 01:00:41,280 2 times p plus 1. 1226 01:00:41,280 --> 01:00:43,440 And then there's one c left over to absorb 1227 01:00:43,440 --> 01:00:47,241 whatever constant cost there was to do all this other work. 1228 01:00:47,241 --> 01:00:51,570 So I got the constants just to work, 1229 01:00:51,570 --> 01:00:54,340 except that I cheated and it's really a one third, 2/3 split. 1230 01:00:54,340 --> 01:00:57,090 So probably all of these constants have to change, 1231 01:00:57,090 --> 01:00:58,102 such is life. 1232 01:00:58,102 --> 01:01:01,490 But I think you get the idea. 1233 01:01:01,490 --> 01:01:03,900 Any questions about full persistence? 1234 01:01:07,110 --> 01:01:10,200 This is fun stuff, time travel. 1235 01:01:10,200 --> 01:01:11,426 Yeah? 1236 01:01:11,426 --> 01:01:14,630 AUDIENCE: So in the first half of the thing where 1237 01:01:14,630 --> 01:01:16,583 the if, there's room you can put it in. 1238 01:01:16,583 --> 01:01:17,056 ERIK DEMAINE: Right. 1239 01:01:17,056 --> 01:01:17,919 AUDIENCE: I have a question about how 1240 01:01:17,919 --> 01:01:19,421 we represent the version. 1241 01:01:19,421 --> 01:01:23,016 Because before when we said restore now [INAUDIBLE]. 1242 01:01:23,016 --> 01:01:25,920 It made more sense if now was like a timestamp or something. 1243 01:01:25,920 --> 01:01:26,670 ERIK DEMAINE: OK. 1244 01:01:26,670 --> 01:01:31,470 Right, so how do we represent a version even here or anywhere? 1245 01:01:31,470 --> 01:01:34,230 When we do a modification, an update, in the data structure, 1246 01:01:34,230 --> 01:01:36,420 we want to return the new version. 1247 01:01:36,420 --> 01:01:39,810 Basically, we're going to actually store 1248 01:01:39,810 --> 01:01:41,042 the DAG of versions. 1249 01:01:41,042 --> 01:01:43,250 And a version is going to be represented by a pointer 1250 01:01:43,250 --> 01:01:44,400 into this DAG. 1251 01:01:44,400 --> 01:01:47,340 One of the nodes in this DAG becomes a version. 1252 01:01:47,340 --> 01:01:50,400 Every node in this DAG is going to store a pointer 1253 01:01:50,400 --> 01:01:53,640 to the corresponding b character and a corresponding e character 1254 01:01:53,640 --> 01:01:56,460 in this data structure, which then 1255 01:01:56,460 --> 01:01:57,924 lets you do anything you want. 1256 01:01:57,924 --> 01:01:59,590 Then you can query against that version, 1257 01:01:59,590 --> 01:02:01,690 whether it's an ancestor of another version. 1258 01:02:01,690 --> 01:02:02,981 So yeah, I didn't mention that. 1259 01:02:02,981 --> 01:02:04,230 Versions are nodes in here. 1260 01:02:04,230 --> 01:02:06,647 Nodes in here have pointers to the b's and e's. 1261 01:02:06,647 --> 01:02:08,730 And vice versa, the b's and e's have pointers back 1262 01:02:08,730 --> 01:02:10,731 to the corresponding version node. 1263 01:02:10,731 --> 01:02:12,480 And then you can keep track of everything. 1264 01:02:12,480 --> 01:02:14,790 Good question. 1265 01:02:14,790 --> 01:02:15,390 Yeah? 1266 01:02:15,390 --> 01:02:16,270 AUDIENCE: [INAUDIBLE] question. 1267 01:02:16,270 --> 01:02:17,150 Remind me what d is in this. 1268 01:02:17,150 --> 01:02:19,108 ERIK DEMAINE: Oh, d was the maximum out degree. 1269 01:02:19,108 --> 01:02:26,970 It's the number of fields in a node, as defined right here. 1270 01:02:26,970 --> 01:02:29,701 Other questions? 1271 01:02:29,701 --> 01:02:30,200 Whew. 1272 01:02:30,200 --> 01:02:31,305 OK, a little breather. 1273 01:02:31,305 --> 01:02:33,450 That was partial persistence, full persistence. 1274 01:02:33,450 --> 01:02:36,730 This is, unfortunately, the end of the really good results. 1275 01:02:36,730 --> 01:02:38,650 As long as we have constant degree nodes, 1276 01:02:38,650 --> 01:02:41,320 in and out degree, we can do all. 1277 01:02:41,320 --> 01:02:44,830 We can do for persistence for free. 1278 01:02:44,830 --> 01:02:47,080 Obviously there are practical constants involved here. 1279 01:02:47,080 --> 01:02:53,170 But in theory, you can do this perfectly. 1280 01:02:53,170 --> 01:02:54,830 Before we go on to confluence, there 1281 01:02:54,830 --> 01:02:58,210 is one positive result, which is what if you 1282 01:02:58,210 --> 01:03:00,615 don't like amortize bounds. 1283 01:03:00,615 --> 01:03:02,740 There are various reasons amortize bounds might not 1284 01:03:02,740 --> 01:03:03,070 be good. 1285 01:03:03,070 --> 01:03:04,861 Maybe you really care about every operation 1286 01:03:04,861 --> 01:03:08,740 being no slower than it was except by a constant factor. 1287 01:03:08,740 --> 01:03:11,500 We're amortizing here, so some operations get really slow. 1288 01:03:11,500 --> 01:03:14,110 But the others are all fast to compensate. 1289 01:03:14,110 --> 01:03:19,540 You can deamortize, it's called. 1290 01:03:22,600 --> 01:03:30,280 You can get constant worst case slowdown 1291 01:03:30,280 --> 01:03:31,870 for partial persistence. 1292 01:03:36,770 --> 01:03:44,260 This is a result of Garret Brodle from the late '90s, '97. 1293 01:03:44,260 --> 01:03:47,149 For full persistence-- so it's an open problem. 1294 01:03:47,149 --> 01:03:48,940 I don't know if people have worked on that. 1295 01:03:55,801 --> 01:03:56,300 All right. 1296 01:03:56,300 --> 01:03:59,515 So some, mostly good results. 1297 01:03:59,515 --> 01:04:01,640 Let's move on to confluent persistence where things 1298 01:04:01,640 --> 01:04:03,606 get a lot more challenging. 1299 01:04:17,511 --> 01:04:20,010 Lots of things go out the window with confluent persistence. 1300 01:04:20,010 --> 01:04:23,520 In particular, your versions are now a DAG. 1301 01:04:23,520 --> 01:04:25,650 It's a lot harder to linearize a DAG. 1302 01:04:25,650 --> 01:04:28,980 Trees are not that far from pads. 1303 01:04:28,980 --> 01:04:33,672 But DAGs are quite far from pads, unfortunately. 1304 01:04:33,672 --> 01:04:35,130 But that's not all that goes wrong. 1305 01:04:44,660 --> 01:04:50,060 Let me first tell you the kind of end effect as a user. 1306 01:04:50,060 --> 01:04:52,060 Imagine you have a data structure. 1307 01:04:54,830 --> 01:04:57,500 Think of it as a list, I guess, which 1308 01:04:57,500 --> 01:04:59,330 is a list of characters in your document. 1309 01:04:59,330 --> 01:05:03,410 You're using vi or Word, your favorite, whatever. 1310 01:05:03,410 --> 01:05:05,060 It's a text editor. 1311 01:05:05,060 --> 01:05:06,680 You've got a string of words. 1312 01:05:06,680 --> 01:05:09,785 And now you like to do things like copy and paste. 1313 01:05:09,785 --> 01:05:11,270 It's a nice operation. 1314 01:05:11,270 --> 01:05:16,340 So you select an interval of the string and you copy it. 1315 01:05:16,340 --> 01:05:18,340 And then you paste it somewhere else. 1316 01:05:18,340 --> 01:05:21,950 So now you've got two copies of that string. 1317 01:05:21,950 --> 01:05:24,050 This is, in some sense, what you might 1318 01:05:24,050 --> 01:05:27,960 call a confluent operation, because-- 1319 01:05:27,960 --> 01:05:30,470 yeah, maybe a cleaner way to think of it is the following. 1320 01:05:30,470 --> 01:05:31,910 You have your string. 1321 01:05:31,910 --> 01:05:33,950 Now I have an operation, which is split it. 1322 01:05:33,950 --> 01:05:35,840 So now I have two strings. 1323 01:05:35,840 --> 01:05:36,340 OK. 1324 01:05:36,340 --> 01:05:38,298 And now I have an operation, which is split it. 1325 01:05:38,298 --> 01:05:40,770 Now I have three strings. 1326 01:05:40,770 --> 01:05:41,270 OK. 1327 01:05:41,270 --> 01:05:44,280 Now I have an operation which is concatenate. 1328 01:05:44,280 --> 01:05:47,330 So I can, for example, reconstruct 1329 01:05:47,330 --> 01:05:49,850 the original string-- actually, I have the original string. 1330 01:05:49,850 --> 01:05:51,940 No biggie. 1331 01:05:51,940 --> 01:05:54,470 Let's say-- because I have all versions. 1332 01:05:54,470 --> 01:05:55,520 I never lose them. 1333 01:05:55,520 --> 01:05:59,090 So now instead, I'm going to cut the string here, let's say. 1334 01:05:59,090 --> 01:06:03,710 So now I have this and this. 1335 01:06:03,710 --> 01:06:06,620 And now I can do things like concatenate 1336 01:06:06,620 --> 01:06:10,010 from here to here to here. 1337 01:06:10,010 --> 01:06:16,801 And I will get this plus this plus this. 1338 01:06:16,801 --> 01:06:17,300 OK. 1339 01:06:17,300 --> 01:06:18,579 This guy moved here. 1340 01:06:18,579 --> 01:06:20,870 So that's a copy/paste operation with a constant number 1341 01:06:20,870 --> 01:06:22,100 of splits and concatenates. 1342 01:06:22,100 --> 01:06:23,810 I could also do cut and paste. 1343 01:06:23,810 --> 01:06:26,720 With confluence, I can do crazy cuts and pastes 1344 01:06:26,720 --> 01:06:28,950 in all sorts of ways. 1345 01:06:28,950 --> 01:06:29,910 So what? 1346 01:06:29,910 --> 01:06:32,120 Well, the so what is I can actually 1347 01:06:32,120 --> 01:06:33,990 double the size of my data structure 1348 01:06:33,990 --> 01:06:36,050 in a constant number of operations. 1349 01:06:36,050 --> 01:06:38,270 I can take, for example, the entire string 1350 01:06:38,270 --> 01:06:40,031 and concatenate it to itself. 1351 01:06:40,031 --> 01:06:41,780 That will double the number of characters, 1352 01:06:41,780 --> 01:06:43,740 number of elements in there. 1353 01:06:43,740 --> 01:06:45,900 I can do that again and again and again. 1354 01:06:45,900 --> 01:06:51,380 So in u updates, I can potentially 1355 01:06:51,380 --> 01:06:53,000 get a data structure size 2 to the u. 1356 01:06:57,770 --> 01:06:58,610 Kind of nifty. 1357 01:06:58,610 --> 01:07:00,350 I think this is why confluence is cool. 1358 01:07:00,350 --> 01:07:02,700 It's also why it's hard. 1359 01:07:02,700 --> 01:07:03,900 So not a big surprise. 1360 01:07:03,900 --> 01:07:08,130 But, here we go. 1361 01:07:08,130 --> 01:07:13,490 In that case, the version DAG, for reference, looks like this. 1362 01:07:13,490 --> 01:07:16,180 You're taking the same version, combining it. 1363 01:07:16,180 --> 01:07:20,460 So here I'm assuming I have a concatenate operation. 1364 01:07:20,460 --> 01:07:24,240 And so the effect here, every time I do this, 1365 01:07:24,240 --> 01:07:25,140 I double the size. 1366 01:07:44,210 --> 01:07:44,817 All right. 1367 01:07:44,817 --> 01:07:46,900 What do I want to say about confluent persistence? 1368 01:07:46,900 --> 01:07:47,399 All right. 1369 01:07:47,399 --> 01:07:53,200 Let me start with the most general result, which 1370 01:07:53,200 --> 01:08:04,340 is by Fiat and Kaplan in 2003. 1371 01:08:04,340 --> 01:08:08,817 They define a notion called effective depth of a version. 1372 01:08:08,817 --> 01:08:09,900 Let me just write it down. 1373 01:08:21,180 --> 01:08:24,870 It's kind of like if you took this DAG 1374 01:08:24,870 --> 01:08:30,113 and expanded it out to be a tree of all possible paths. 1375 01:08:30,113 --> 01:08:31,529 Instead of point to the same node, 1376 01:08:31,529 --> 01:08:33,330 you could just duplicate that node 1377 01:08:33,330 --> 01:08:35,260 and then have pointers left and right. 1378 01:08:35,260 --> 01:08:35,760 OK. 1379 01:08:35,760 --> 01:08:38,218 So if I did that, of course, this size grows exponentially. 1380 01:08:38,218 --> 01:08:41,310 It explicitly represents the size of my data structure. 1381 01:08:41,310 --> 01:08:42,810 At the bottom, if I have u things, 1382 01:08:42,810 --> 01:08:45,960 I'm going to have 2 to the u leaves at the bottom. 1383 01:08:45,960 --> 01:08:49,080 But then I can easily measure the number of paths 1384 01:08:49,080 --> 01:08:50,500 from the root to the same version. 1385 01:08:50,500 --> 01:08:52,250 At the bottom, I still label it, oh, those 1386 01:08:52,250 --> 01:08:54,630 are all v. They're all the same version down there. 1387 01:08:54,630 --> 01:08:56,664 So exponential number of paths, if I take log, 1388 01:08:56,664 --> 01:08:58,080 I get what I call effective depth. 1389 01:08:58,080 --> 01:09:02,250 It's like if you somehow could rebalance that tree, 1390 01:09:02,250 --> 01:09:05,910 this is the best you could hope to do. 1391 01:09:05,910 --> 01:09:07,270 It's not really a lower bound. 1392 01:09:07,270 --> 01:09:08,040 But it's a number. 1393 01:09:08,040 --> 01:09:09,000 It's a thing. 1394 01:09:09,000 --> 01:09:10,470 OK. 1395 01:09:10,470 --> 01:09:17,370 Then the result they achieve is that the overhead is 1396 01:09:17,370 --> 01:09:19,790 log the number of updates plus-- this 1397 01:09:19,790 --> 01:09:22,290 is a multiplicative overhead, so you take your running time. 1398 01:09:22,290 --> 01:09:25,979 You multiply it by this. 1399 01:09:25,979 --> 01:09:28,649 And this is a time and a space overhead. 1400 01:09:31,529 --> 01:09:34,260 So maximum effective depth of all versions, maybe even 1401 01:09:34,260 --> 01:09:39,100 sum of effective depths, but we'll just say max to be safe. 1402 01:09:39,100 --> 01:09:41,800 Sorry-- sum over all the operations. 1403 01:09:41,800 --> 01:09:43,129 This is per operation. 1404 01:09:43,129 --> 01:09:44,670 You pay basically the effective depth 1405 01:09:44,670 --> 01:09:48,779 of that operation as a factor. 1406 01:09:48,779 --> 01:09:51,330 Now, the annoying thing is if you have this kind of set up 1407 01:09:51,330 --> 01:09:54,720 where the size grew exponentially, 1408 01:09:54,720 --> 01:09:56,490 then number of paths is exponential. 1409 01:09:56,490 --> 01:09:59,220 Log of the number of paths is linear in u. 1410 01:09:59,220 --> 01:10:06,420 And so this factor could be as much as u, linear slowdown. 1411 01:10:06,420 --> 01:10:08,820 Now, Fiat and Kaplan argue linear slowdown is not 1412 01:10:08,820 --> 01:10:13,440 that bad, because if you weren't even persistent, if you did 1413 01:10:13,440 --> 01:10:18,410 this in the naive way of just recopying the data, 1414 01:10:18,410 --> 01:10:21,579 you were actually spending exponential time to build 1415 01:10:21,579 --> 01:10:22,620 the final data structure. 1416 01:10:22,620 --> 01:10:23,619 It has exponential size. 1417 01:10:23,619 --> 01:10:26,800 Just to represent it explicitly requires exponential time, 1418 01:10:26,800 --> 01:10:29,820 so losing a linear factor to do u operations 1419 01:10:29,820 --> 01:10:31,800 and now u squared time instead of 2 to the u. 1420 01:10:31,800 --> 01:10:35,190 So it's a big improvement to do this. 1421 01:10:35,190 --> 01:10:40,440 The downside of this approach is that even if you have a version 1422 01:10:40,440 --> 01:10:43,410 DAG that looks like this, even if the size of the data 1423 01:10:43,410 --> 01:10:46,402 structure is staying normal, staying linear, so 1424 01:10:46,402 --> 01:10:48,360 this potential, you could be doubling the size. 1425 01:10:48,360 --> 01:10:49,920 But we don't know what this merge operation is. 1426 01:10:49,920 --> 01:10:51,794 Maybe it just throws away one of the versions 1427 01:10:51,794 --> 01:10:53,220 or does something-- 1428 01:10:53,220 --> 01:10:55,230 somehow takes half the nodes from one 1429 01:10:55,230 --> 01:10:57,188 side, half the nodes from the other side maybe. 1430 01:10:57,188 --> 01:10:58,830 These operations do preserve size. 1431 01:10:58,830 --> 01:11:02,520 Then there's no great reason why it should be a linear slowdown, 1432 01:11:02,520 --> 01:11:03,671 but it is. 1433 01:11:03,671 --> 01:11:04,170 OK? 1434 01:11:04,170 --> 01:11:07,650 So it's all right but not great. 1435 01:11:10,830 --> 01:11:13,560 And it's the best general result we know. 1436 01:11:13,560 --> 01:11:15,540 They also prove a lower bound. 1437 01:11:21,420 --> 01:11:30,345 So lower bound is some effect of depth, total bits of space. 1438 01:11:37,230 --> 01:11:37,730 OK. 1439 01:11:37,730 --> 01:11:40,020 What does this mean? 1440 01:11:40,020 --> 01:11:42,170 So even if this is not happening, 1441 01:11:42,170 --> 01:11:44,150 the number of bits of space you need 1442 01:11:44,150 --> 01:11:45,800 in the worst case-- this does not 1443 01:11:45,800 --> 01:11:47,810 apply to every data structure. 1444 01:11:47,810 --> 01:11:49,790 That's one catch. 1445 01:11:49,790 --> 01:11:52,070 They give a specific data structure 1446 01:11:52,070 --> 01:11:53,630 where you need this much space. 1447 01:11:53,630 --> 01:11:57,050 So it's similar to this kind of picture. 1448 01:11:57,050 --> 01:11:58,940 We'll go into the details. 1449 01:11:58,940 --> 01:12:00,860 And you need this much space. 1450 01:12:00,860 --> 01:12:02,720 Now, this is kind of bad, because if there's 1451 01:12:02,720 --> 01:12:06,440 u operations, and each of these is u, that's u squared space. 1452 01:12:06,440 --> 01:12:09,395 So we actually need a factor u blow up in space. 1453 01:12:09,395 --> 01:12:11,290 It looks like. 1454 01:12:11,290 --> 01:12:14,150 But to be more precise, what this means is 1455 01:12:14,150 --> 01:12:17,270 that you need omega e of v space, and therefore 1456 01:12:17,270 --> 01:12:27,830 time overhead per update, if-- 1457 01:12:27,830 --> 01:12:29,570 this is not written in the paper-- 1458 01:12:29,570 --> 01:12:30,560 queries are free. 1459 01:12:35,300 --> 01:12:40,400 Implicit here, they just want to slow down and increase space 1460 01:12:40,400 --> 01:12:43,310 for the updates you do, which is pretty natural. 1461 01:12:43,310 --> 01:12:46,870 Normally you think of queries as not increasing space. 1462 01:12:46,870 --> 01:12:49,600 But in order to construct this lower bound, 1463 01:12:49,600 --> 01:12:52,360 they actually do this many queries. 1464 01:12:52,360 --> 01:12:55,900 So they do e of v queries and then one update. 1465 01:12:55,900 --> 01:12:59,410 And they say, oh well, space had to go up by an extra e of v. 1466 01:12:59,410 --> 01:13:02,470 So if you only charge updates for the space, 1467 01:13:02,470 --> 01:13:04,120 then yes, you have to lose potentially 1468 01:13:04,120 --> 01:13:07,780 a linear factor, this effect of death, potentially u. 1469 01:13:07,780 --> 01:13:09,550 But if you also charge the queries, 1470 01:13:09,550 --> 01:13:13,270 it's still constant in their example. 1471 01:13:13,270 --> 01:13:18,100 So open question, for confluent persistence, 1472 01:13:18,100 --> 01:13:21,130 can you achieve constant everything? 1473 01:13:21,130 --> 01:13:27,160 Constant time and space overheads, 1474 01:13:27,160 --> 01:13:33,610 multiplicative factor per operation, 1475 01:13:33,610 --> 01:13:35,425 both updates and queries. 1476 01:13:35,425 --> 01:13:37,300 So if you charge the queries, potentially you 1477 01:13:37,300 --> 01:13:38,980 could get constant everything. 1478 01:13:38,980 --> 01:13:41,040 This is a relatively new realization. 1479 01:13:43,890 --> 01:13:47,325 And no one knows how to do this yet. 1480 01:13:47,325 --> 01:13:47,950 Nice challenge. 1481 01:13:47,950 --> 01:13:50,530 I think maybe we'll work on that in our first problem session. 1482 01:13:50,530 --> 01:13:51,196 I would like to. 1483 01:13:53,600 --> 01:13:54,770 Questions about that result? 1484 01:13:54,770 --> 01:13:56,450 I'm not going to prove the result. 1485 01:13:56,450 --> 01:13:59,540 But it is a fancy rebalancing of those kinds 1486 01:13:59,540 --> 01:14:02,300 of pictures to get this log. 1487 01:14:10,266 --> 01:14:12,390 There are other results I'd like to tell you about. 1488 01:14:32,630 --> 01:14:34,710 So brand new result-- 1489 01:14:34,710 --> 01:14:35,980 that was from 2003. 1490 01:14:35,980 --> 01:14:38,300 This is from 2012-- 1491 01:14:38,300 --> 01:14:42,590 no, '11, '11, sorry. 1492 01:14:42,590 --> 01:14:47,480 It's SOTO, which is in January, so it's a little confusing. 1493 01:14:47,480 --> 01:14:49,250 Is it '11? 1494 01:14:49,250 --> 01:14:50,267 Maybe '12. 1495 01:14:50,267 --> 01:14:51,350 Actually now I'm not sure. 1496 01:14:51,350 --> 01:14:54,750 It's February already, right? 1497 01:14:54,750 --> 01:14:56,870 A January, either this year or last year. 1498 01:15:00,310 --> 01:15:02,840 It's not as general a transformation. 1499 01:15:02,840 --> 01:15:05,330 It's only going to hold in what's called a disjoint case. 1500 01:15:05,330 --> 01:15:07,820 But it gets a very good bound-- 1501 01:15:07,820 --> 01:15:09,850 not quite constant, but logarithmic. 1502 01:15:09,850 --> 01:15:12,420 OK, logarithmic would also be nice. 1503 01:15:12,420 --> 01:15:17,075 Or log, log n, whatever n is. 1504 01:15:17,075 --> 01:15:22,450 Pick your favorite n, number of operations, say. 1505 01:15:22,450 --> 01:15:22,950 OK. 1506 01:15:25,700 --> 01:15:39,830 If you assume that confluent operations are performed only 1507 01:15:39,830 --> 01:15:46,070 on two versions with no shared nodes-- 1508 01:15:50,360 --> 01:15:53,870 OK, this would be a way to forbid this kind of behavior 1509 01:15:53,870 --> 01:15:56,660 where I concatenate the data structure with itself. 1510 01:15:56,660 --> 01:15:58,520 All the nodes are common. 1511 01:15:58,520 --> 01:16:01,840 If I guarantee that maybe I, you know, slice this up, slice it, 1512 01:16:01,840 --> 01:16:03,590 dice it, wherever, and then re-emerge them 1513 01:16:03,590 --> 01:16:06,230 in some other order, but I never use two copies 1514 01:16:06,230 --> 01:16:10,130 of the same piece, that would be a valid confluent 1515 01:16:10,130 --> 01:16:12,260 operation over here. 1516 01:16:12,260 --> 01:16:13,880 This is quite a strong restriction 1517 01:16:13,880 --> 01:16:16,580 that you're not allowed. 1518 01:16:16,580 --> 01:16:19,030 If you try to, who knows what happens. 1519 01:16:19,030 --> 01:16:19,980 Behavior's undefined. 1520 01:16:19,980 --> 01:16:21,830 So won't tell you, oh, those two versions 1521 01:16:21,830 --> 01:16:22,871 have this node in common. 1522 01:16:22,871 --> 01:16:24,600 You've got to make a second copy of it. 1523 01:16:24,600 --> 01:16:27,099 So somehow you have to guarantee that control and operations 1524 01:16:27,099 --> 01:16:29,270 never overlap. 1525 01:16:29,270 --> 01:16:30,757 But they can be reordered. 1526 01:16:33,740 --> 01:16:39,500 Then you can get order log n overhead. 1527 01:16:39,500 --> 01:16:40,850 n is the number of operations. 1528 01:16:45,390 --> 01:16:46,970 I have a sketch of a proof of this 1529 01:16:46,970 --> 01:16:48,870 but not very much time to talk about it. 1530 01:16:48,870 --> 01:16:49,370 All right. 1531 01:16:49,370 --> 01:16:51,570 Let me give you a quick picture. 1532 01:16:51,570 --> 01:16:55,790 In general, the versions form a DAG. 1533 01:16:55,790 --> 01:17:00,950 But if you make this assumption, and you look at a single node, 1534 01:17:00,950 --> 01:17:03,620 and look at all the versions where that node appears, 1535 01:17:03,620 --> 01:17:05,210 that is a tree. 1536 01:17:05,210 --> 01:17:07,370 Because you're not allowed to remerge versions 1537 01:17:07,370 --> 01:17:08,720 that have the same node. 1538 01:17:08,720 --> 01:17:11,480 So while the big picture is a DAG, 1539 01:17:11,480 --> 01:17:15,090 the small picture of a single guy is some tree. 1540 01:17:17,504 --> 01:17:18,920 I'm drawing all these wiggly lines 1541 01:17:18,920 --> 01:17:20,000 because there are all these versions where 1542 01:17:20,000 --> 01:17:21,560 the node isn't changing. 1543 01:17:21,560 --> 01:17:23,300 This is the entire version DAG. 1544 01:17:23,300 --> 01:17:26,540 And then some of these nodes-- 1545 01:17:26,540 --> 01:17:29,000 some of these versions, I should say-- 1546 01:17:29,000 --> 01:17:31,925 that node that we're thinking about changes. 1547 01:17:31,925 --> 01:17:33,860 OK, whenever it branches, it's probably 1548 01:17:33,860 --> 01:17:36,410 because the actual node changed, maybe. 1549 01:17:36,410 --> 01:17:37,470 I don't know. 1550 01:17:37,470 --> 01:17:40,170 Anyway there are some dots here where the version changed, 1551 01:17:40,170 --> 01:17:41,960 some of the leaves, maybe, that changed. 1552 01:17:41,960 --> 01:17:44,420 Maybe some of them haven't yet. 1553 01:17:44,420 --> 01:17:48,350 In fact, let's see. 1554 01:17:48,350 --> 01:17:51,170 Here where it's change, it could be that we destroyed the node. 1555 01:17:51,170 --> 01:17:54,560 Maybe it's gone from the actual data structure. 1556 01:17:54,560 --> 01:17:56,542 But there still may be versions down here. 1557 01:17:56,542 --> 01:17:57,500 It's not really a tree. 1558 01:17:57,500 --> 01:17:59,480 It's a whole DAG of stuff down there. 1559 01:17:59,480 --> 01:18:01,400 So that's kind of ugly. 1560 01:18:01,400 --> 01:18:03,080 Where never the node still exists, 1561 01:18:03,080 --> 01:18:05,300 I guess that is an actual leaf of the DAG. 1562 01:18:05,300 --> 01:18:06,650 So those are OK. 1563 01:18:06,650 --> 01:18:08,870 But as soon as I maybe delete that node, 1564 01:18:08,870 --> 01:18:11,610 then there can be a whole subtree down there. 1565 01:18:11,610 --> 01:18:12,110 OK. 1566 01:18:12,110 --> 01:18:15,120 So now if you look at an arbitrary version, 1567 01:18:15,120 --> 01:18:17,580 so what we're thinking about is how to implement reading, 1568 01:18:17,580 --> 01:18:18,080 let's say. 1569 01:18:18,080 --> 01:18:21,110 Reading and writing are more or less the same. 1570 01:18:21,110 --> 01:18:22,280 I give you a version. 1571 01:18:22,280 --> 01:18:23,720 I give you a node, and I give you a field. 1572 01:18:23,720 --> 01:18:26,180 I want to know, what is the value of that field, that node, 1573 01:18:26,180 --> 01:18:27,810 that version? 1574 01:18:27,810 --> 01:18:30,014 So now where could a version fall? 1575 01:18:30,014 --> 01:18:31,430 Well it has to be in this subtree. 1576 01:18:31,430 --> 01:18:33,990 Because the node has to exist. 1577 01:18:36,950 --> 01:18:38,390 And then it's maybe a pointer. 1578 01:18:38,390 --> 01:18:42,830 A pointer could be to another node, which 1579 01:18:42,830 --> 01:18:44,540 also has this kind of picture. 1580 01:18:44,540 --> 01:18:46,460 They could be overlapping trees. 1581 01:18:46,460 --> 01:18:48,140 In general, there are three cases. 1582 01:18:48,140 --> 01:18:51,110 Either you're lucky, and the version you're talking about 1583 01:18:51,110 --> 01:18:53,960 is a version where the node was changed. 1584 01:18:53,960 --> 01:18:58,470 In that case, the data is just stored right there. 1585 01:18:58,470 --> 01:18:59,125 That's easy. 1586 01:18:59,125 --> 01:19:01,250 So you could just say, oh, how did the node change? 1587 01:19:01,250 --> 01:19:02,630 Oh, that's what the field is. 1588 01:19:02,630 --> 01:19:05,190 OK, follow the pointer. 1589 01:19:05,190 --> 01:19:08,210 A slightly harder case it's a version 1590 01:19:08,210 --> 01:19:09,590 in between two such changes. 1591 01:19:09,590 --> 01:19:11,660 And maybe these are not updates. 1592 01:19:11,660 --> 01:19:17,330 So I sort of want to know, what was the previous version where 1593 01:19:17,330 --> 01:19:21,650 this node changed in constant time? 1594 01:19:21,650 --> 01:19:22,620 It can be done. 1595 01:19:22,620 --> 01:19:25,160 Not constant time, actually, logarithmic time, 1596 01:19:25,160 --> 01:19:28,120 using a data structure called link-cut trees, 1597 01:19:28,120 --> 01:19:31,010 another fun black box for now, which 1598 01:19:31,010 --> 01:19:36,171 we will cover in lecture 19, far in the future. 1599 01:19:36,171 --> 01:19:36,670 OK. 1600 01:19:39,884 --> 01:19:40,800 Well, that's one case. 1601 01:19:40,800 --> 01:19:43,190 There's also the version where maybe a version 1602 01:19:43,190 --> 01:19:45,110 is down here in a subtree. 1603 01:19:45,110 --> 01:19:48,340 I guess then the node didn't exist. 1604 01:19:48,340 --> 01:19:50,490 Well, all these things can happen. 1605 01:19:50,490 --> 01:19:51,800 And that's even harder. 1606 01:19:51,800 --> 01:19:53,360 It's messy. 1607 01:19:53,360 --> 01:19:59,720 They use another trick, which is called fractional cascading, 1608 01:19:59,720 --> 01:20:02,240 which I'm not even going to try to describe what it means. 1609 01:20:02,240 --> 01:20:04,190 But it's got a very cool name. 1610 01:20:04,190 --> 01:20:06,080 Because we'll be covering it in lecture 3. 1611 01:20:06,080 --> 01:20:07,284 So stay tuned for that. 1612 01:20:07,284 --> 01:20:09,450 I'm not going to say how it applies to this setting, 1613 01:20:09,450 --> 01:20:13,330 but it's a necessary step in here. 1614 01:20:13,330 --> 01:20:15,364 In the remaining zero minutes, let 1615 01:20:15,364 --> 01:20:17,780 me tell you a little bit about functional data structures. 1616 01:20:17,780 --> 01:20:20,005 [LAUGHTER] 1617 01:20:20,900 --> 01:20:21,910 Beauty of time travel. 1618 01:20:24,830 --> 01:20:31,130 Functional-- I just want to give you 1619 01:20:31,130 --> 01:20:33,590 some examples of things that can be done functionally. 1620 01:20:33,590 --> 01:20:35,798 There's a whole book about functional data structures 1621 01:20:35,798 --> 01:20:36,700 by Okasaki. 1622 01:20:36,700 --> 01:20:38,180 It's pretty cool. 1623 01:20:38,180 --> 01:20:42,320 A simple example is balanced BSTs. 1624 01:20:42,320 --> 01:20:44,600 So if you just want to get log n time for everything, 1625 01:20:44,600 --> 01:20:45,890 you can do that functionally. 1626 01:20:45,890 --> 01:20:46,890 It's actually really easy. 1627 01:20:46,890 --> 01:20:48,920 You pick your favorite balance BST, like red black trees. 1628 01:20:48,920 --> 01:20:51,260 You implement it top down so you never follow parent pointers. 1629 01:20:51,260 --> 01:20:52,710 So you don't need parent pointers. 1630 01:20:52,710 --> 01:20:57,710 So then as you make changes down the tree, you just copy. 1631 01:20:57,710 --> 01:20:58,995 It's called path copying. 1632 01:20:58,995 --> 01:21:00,620 Whenever you're about to make a change, 1633 01:21:00,620 --> 01:21:02,070 make a copy of that node. 1634 01:21:02,070 --> 01:21:05,450 So you end up copying all the change nodes and all 1635 01:21:05,450 --> 01:21:06,200 their ancestors. 1636 01:21:06,200 --> 01:21:09,650 There's only log n of them, so it takes log n time. 1637 01:21:09,650 --> 01:21:10,500 Clear? 1638 01:21:10,500 --> 01:21:11,500 Easy. 1639 01:21:11,500 --> 01:21:12,620 It's a nice technique. 1640 01:21:12,620 --> 01:21:14,600 Sometimes path copying is very useful. 1641 01:21:14,600 --> 01:21:16,170 Like link-cut trees, for example, 1642 01:21:16,170 --> 01:21:17,450 can be made functional. 1643 01:21:17,450 --> 01:21:19,905 We don't know what they are, but they're basically a BST. 1644 01:21:19,905 --> 01:21:21,280 And you can make them functional. 1645 01:21:21,280 --> 01:21:23,291 We use that in a paper. 1646 01:21:23,291 --> 01:21:23,790 All right. 1647 01:21:23,790 --> 01:21:25,970 Deques. 1648 01:21:25,970 --> 01:21:27,800 These are doubly ended queues. 1649 01:21:27,800 --> 01:21:29,990 So it's like a stack and a queue and everything. 1650 01:21:29,990 --> 01:21:32,900 You can insert and delete from the beginning and the end. 1651 01:21:32,900 --> 01:21:34,730 People start to know what these are now, 1652 01:21:34,730 --> 01:21:35,980 because Python calls him that. 1653 01:21:35,980 --> 01:21:41,090 But you can also do concatenation 1654 01:21:41,090 --> 01:21:43,310 with deques in constant time per operation. 1655 01:21:43,310 --> 01:21:44,150 This is cool. 1656 01:21:44,150 --> 01:21:46,220 Deques are not very hard to make functional. 1657 01:21:46,220 --> 01:21:48,500 But you can do deques and you can concatenate them 1658 01:21:48,500 --> 01:21:51,980 like we were doing in the figure that's right behind this board. 1659 01:21:51,980 --> 01:21:53,690 Constant time split is a little harder. 1660 01:21:53,690 --> 01:21:56,870 That's actually one of my open problems. 1661 01:21:56,870 --> 01:22:01,580 Can you do lists with split and concatenate in constant time-- 1662 01:22:01,580 --> 01:22:05,840 functionally or confluently, persistently, or whatever? 1663 01:22:05,840 --> 01:22:08,580 Another example-- oh, you can do a mix of the two. 1664 01:22:08,580 --> 01:22:12,390 You can get log n search in constant time deque operations, 1665 01:22:12,390 --> 01:22:14,870 is you can do tries. 1666 01:22:14,870 --> 01:22:17,900 So a try is a tree with a fixed topology. 1667 01:22:17,900 --> 01:22:20,010 Think of it as a directory tree. 1668 01:22:20,010 --> 01:22:21,530 So maybe you're using Subversion. 1669 01:22:21,530 --> 01:22:23,120 Subversion has time travel operations. 1670 01:22:23,120 --> 01:22:26,240 You can copy an entire subtree from one version 1671 01:22:26,240 --> 01:22:30,620 and stick it into a new version, another version. 1672 01:22:30,620 --> 01:22:32,685 So you get a version DAG. 1673 01:22:32,685 --> 01:22:34,850 It's a confluently persistent data structure-- 1674 01:22:34,850 --> 01:22:37,610 not implemented optimally, because we don't necessarily 1675 01:22:37,610 --> 01:22:38,240 know how. 1676 01:22:38,240 --> 01:22:40,200 But there is one paper. 1677 01:22:40,200 --> 01:22:43,910 This actually came from the open problem section of this class 1678 01:22:43,910 --> 01:22:45,590 four years ago, I think. 1679 01:22:45,590 --> 01:22:49,520 It's with Eric Price and Stefan Langerman. 1680 01:22:49,520 --> 01:22:50,929 You can get very good results. 1681 01:22:50,929 --> 01:22:52,970 I won't write them down because it takes a while. 1682 01:22:52,970 --> 01:22:56,450 Basically log the degree of the nodes factor 1683 01:22:56,450 --> 01:22:59,690 and get functional, and you can be even fancier 1684 01:22:59,690 --> 01:23:02,480 and get slightly better bounds like log log the degree 1685 01:23:02,480 --> 01:23:05,370 and get confluently persistent with various tricks, 1686 01:23:05,370 --> 01:23:07,530 including using all of these data structures. 1687 01:23:07,530 --> 01:23:09,800 So if you want to implement subversion optimally, 1688 01:23:09,800 --> 01:23:14,390 that is known how to be done but hasn't actually been done yet. 1689 01:23:14,390 --> 01:23:18,110 Because there are those pesky constant factors. 1690 01:23:18,110 --> 01:23:19,670 I think that's all. 1691 01:23:19,670 --> 01:23:23,030 What is known about functional is there's a log n separation. 1692 01:23:23,030 --> 01:23:26,890 You can be log n away from the best. 1693 01:23:26,890 --> 01:23:30,230 That's the worst separation known, 1694 01:23:30,230 --> 01:23:33,012 between functional and just a regular old data structure. 1695 01:23:33,012 --> 01:23:34,220 It'd be nice to improve that. 1696 01:23:34,220 --> 01:23:35,345 Lots of open problems here. 1697 01:23:35,345 --> 01:23:38,140 Maybe we'll work on them next time.