1 00:00:08,000 --> 00:00:14,000 Good morning. Today we're going to talk about 2 00:00:14,000 --> 00:00:18,000 augmenting data structures. 3 00:00:18,000 --> 00:00:27,000 That one is 23 and that is 23. 4 00:00:27,000 --> 00:00:33,000 And this is a -- Normally, rather than designing 5 00:00:33,000 --> 00:00:37,000 data structures from scratch, you tend to take existing data 6 00:00:37,000 --> 00:00:40,000 structures and build your functionality into them. 7 00:00:40,000 --> 00:00:44,000 And that is a process we call data-structure augmentation. 8 00:00:44,000 --> 00:00:48,000 And this also today marks sort of the start of the design phase 9 00:00:48,000 --> 00:00:51,000 of the class. We spent a lot of time doing 10 00:00:51,000 --> 00:00:54,000 analysis up to this point. And now we're still going to 11 00:00:54,000 --> 00:00:58,000 learn some new analytical techniques. 12 00:00:58,000 --> 00:01:01,000 But we're going to start turning our focus more toward 13 00:01:01,000 --> 00:01:05,000 how is it that you design efficient data structures, 14 00:01:05,000 --> 00:01:08,000 efficient algorithms for various problems? 15 00:01:08,000 --> 00:01:11,000 So this is a good example of the design phase. 16 00:01:11,000 --> 00:01:14,000 It is a really good idea, at this point, 17 00:01:14,000 --> 00:01:18,000 if you have not done so, to review the textbook Appendix 18 00:01:18,000 --> 00:01:20,000 B. You should take that as 19 00:01:20,000 --> 00:01:24,000 additional reading to make sure that you are familiar, 20 00:01:24,000 --> 00:01:29,000 because over the next few weeks we're going to hit almost every 21 00:01:29,000 --> 00:01:33,000 topic in Appendix B. It is going to be brought to 22 00:01:33,000 --> 00:01:37,000 bear on the subjects that we're talking about. 23 00:01:37,000 --> 00:01:41,000 If you're going to go scramble to learn that while you're also 24 00:01:41,000 --> 00:01:45,000 trying to learn the material, it will be more onerous than if 25 00:01:45,000 --> 00:01:48,000 you just simply review the material now. 26 00:01:48,000 --> 00:01:52,000 We're going to start with an illustration of the problem of 27 00:01:52,000 --> 00:01:55,000 dynamic order statistics. 28 00:02:00,000 --> 00:02:03,000 We are familiar with finding things like the median or the 29 00:02:03,000 --> 00:02:08,000 kth order statistic or whatever. Now we want to do the same 30 00:02:08,000 --> 00:02:11,000 thing but we want to do it with a dynamic set. 31 00:02:11,000 --> 00:02:14,000 Rather than being given all the data upfront, 32 00:02:14,000 --> 00:02:18,000 we're going to have a set. And then at some point somebody 33 00:02:18,000 --> 00:02:21,000 is going to be doing typically insert and delete. 34 00:02:21,000 --> 00:02:24,000 And at some point somebody is going to say OK, 35 00:02:24,000 --> 00:02:30,000 select for me the ith largest guy or the ith smallest guy -- 36 00:02:41,000 --> 00:02:58,000 -- in the dynamic set. Or, something like OS-Rank of 37 00:02:58,000 --> 00:03:05,000 x. The rank of x in the sorted 38 00:03:05,000 --> 00:03:09,000 order of the set. 39 00:03:14,000 --> 00:03:16,000 So either I want to just say, for example, 40 00:03:16,000 --> 00:03:19,000 if I gave n over 2, if I had n elements in the set 41 00:03:19,000 --> 00:03:22,000 and I said n over 2, I am asking for the median. 42 00:03:22,000 --> 00:03:25,000 I could be asking for the mean. I could be asking for quartile. 43 00:03:25,000 --> 00:03:29,000 Here I take an element and say, OK, so where does that element 44 00:03:29,000 --> 00:03:33,000 fall among all of the other elements in the set? 45 00:03:33,000 --> 00:03:37,000 And, in addition, these are dynamic sets so I 46 00:03:37,000 --> 00:03:45,000 want to be able to do insert and delete, I want to be able to add 47 00:03:45,000 --> 00:03:50,000 and remove elements. The solution we are going to 48 00:03:50,000 --> 00:03:56,000 look at for this one, the basic idea is to keep the 49 00:03:56,000 --> 00:04:03,000 sizes of subtrees in the nodes of a red-black tree. 50 00:04:08,000 --> 00:04:12,000 Let me draw a picture as an example. 51 00:04:30,000 --> 00:04:32,000 In this tree -- 52 00:04:37,000 --> 00:04:39,000 I didn't draw the NILs for this. 53 00:04:39,000 --> 00:04:44,000 I am going to keep two values. I am going to keep the key. 54 00:04:44,000 --> 00:04:48,000 And so for the keys, what I will do is just use 55 00:04:48,000 --> 00:04:51,000 letters of the alphabet. 56 00:05:06,000 --> 00:05:11,000 And this is a red-black tree. Just for practice, 57 00:05:11,000 --> 00:05:16,000 how can I label this tree so it's a red-black tree? 58 00:05:16,000 --> 00:05:21,000 I haven't shown the NILs. Remember the NILs are all 59 00:05:21,000 --> 00:05:24,000 black. How can I label this, 60 00:05:24,000 --> 00:05:29,000 red and black? Make sure it is a red-black 61 00:05:29,000 --> 00:05:33,000 tree. Not every tree can be labeled 62 00:05:33,000 --> 00:05:36,000 as a red-black tree, right? 63 00:05:36,000 --> 00:05:42,000 This is good practice because this sort of thing shows up on 64 00:05:42,000 --> 00:05:45,000 quizzes. Make F red, good, 65 00:05:45,000 --> 00:05:51,000 and everything else black, that is certainly a solution. 66 00:05:51,000 --> 00:05:57,000 Because then that basically brings the level of this guy up 67 00:05:57,000 --> 00:06:01,000 to here. Actually, I had a more 68 00:06:01,000 --> 00:06:06,000 complicated one because it seemed like more fun. 69 00:06:06,000 --> 00:06:12,000 What I did was I made this guy black and then these two guys 70 00:06:12,000 --> 00:06:16,000 red and black and red, black and red, 71 00:06:16,000 --> 00:06:21,000 black and black. But your solution is perfectly 72 00:06:21,000 --> 00:06:25,000 good as well. So we don't have any two reds 73 00:06:25,000 --> 00:06:31,000 in a row on any path. And all the black height from 74 00:06:31,000 --> 00:06:36,000 any particular point going down we get the same number of blacks 75 00:06:36,000 --> 00:06:38,000 whichever way we go. Good. 76 00:06:38,000 --> 00:06:42,000 The idea here now is that, we're going to keep the subtree 77 00:06:42,000 --> 00:06:47,000 sizes, these are the keys that are stored in our dynamic set, 78 00:06:47,000 --> 00:06:52,000 we're going to keep the subtree sizes in the red-black tree. 79 00:06:52,000 --> 00:06:55,000 For example, this guy has size one. 80 00:06:55,000 --> 00:07:00,000 These guys have size one because they're leaves. 81 00:07:00,000 --> 00:07:08,000 And then we can just work up. So this has size three, 82 00:07:08,000 --> 00:07:16,000 this guy has size five, this guy has size three, 83 00:07:16,000 --> 00:07:25,000 and this guy has five plus three plus one is nine. 84 00:07:25,000 --> 00:07:35,000 In general, we will have size of x is equal to size of left of 85 00:07:35,000 --> 00:07:45,000 x plus the size of the right child of x plus one. 86 00:07:45,000 --> 00:07:48,000 That is how I compute it recursively. 87 00:07:48,000 --> 00:07:52,000 A very simple formula for what the size is. 88 00:07:52,000 --> 00:07:58,000 It turns out that for the code that we're going to want to 89 00:07:58,000 --> 00:08:03,000 write to implement these operations, it is going to be 90 00:08:03,000 --> 00:08:09,000 convenient to be talking about the size of NIL. 91 00:08:09,000 --> 00:08:12,000 So what is the size of NIL? Zero. 92 00:08:12,000 --> 00:08:16,000 Size of NIL, there are no elements there. 93 00:08:16,000 --> 00:08:22,000 However, in most program languages, if I take size of 94 00:08:22,000 --> 00:08:26,000 NIL, what will happen? You get an error. 95 00:08:26,000 --> 00:08:33,000 That is kind of inconvenient. What I have to do in my code is 96 00:08:33,000 --> 00:08:37,000 that everywhere that I might want to take size of NIL, 97 00:08:37,000 --> 00:08:41,000 or take the size of anything, I have to say, 98 00:08:41,000 --> 00:08:46,000 well, if it's NIL then return zero, otherwise return the size 99 00:08:46,000 --> 00:08:49,000 field, etc. There is an implementation 100 00:08:49,000 --> 00:08:52,000 trick that we're going to use to simplify that. 101 00:08:52,000 --> 00:08:56,000 It's called using a sentinel. 102 00:09:01,000 --> 00:09:05,000 A sentinel is nothing more than a dummy record. 103 00:09:05,000 --> 00:09:10,000 Instead of using a NIL, we will actually use a NIL 104 00:09:10,000 --> 00:09:14,000 sentinel. We will use a dummy record for 105 00:09:14,000 --> 00:09:18,000 NIL such that size of NIL is equal to zero. 106 00:09:18,000 --> 00:09:24,000 Instead of any place I would have used NIL in the tree, 107 00:09:24,000 --> 00:09:31,000 instead I will have a special record that I will call NIL. 108 00:09:31,000 --> 00:09:35,000 But it will be a whole record. And that way I can set its size 109 00:09:35,000 --> 00:09:38,000 field to be zero, and then I don't have to check 110 00:09:38,000 --> 00:09:42,000 that as a special case. That is a very common type of 111 00:09:42,000 --> 00:09:46,000 programming trick to use, is to use sentinels to simplify 112 00:09:46,000 --> 00:09:51,000 code so you don't have all these boundary cases or you don't have 113 00:09:51,000 --> 00:09:55,000 to write an extra function when all I want to do is just index 114 00:09:55,000 --> 00:10:00,000 the size of something. Everybody with me on that? 115 00:10:00,000 --> 00:10:06,000 So let's write the code for OS-Select given this 116 00:10:06,000 --> 00:10:09,000 representation. 117 00:10:17,000 --> 00:10:30,000 And this is going to basically give us the ith smallest in the 118 00:10:30,000 --> 00:10:37,000 subtree rooted at x. It's actually going to be a 119 00:10:37,000 --> 00:10:42,000 little bit more general. If I want to implement the 120 00:10:42,000 --> 00:10:47,000 OS-Select i of up there, I basically give it the root 121 00:10:47,000 --> 00:10:50,000 n_i. But we're going to build this 122 00:10:50,000 --> 00:10:55,000 recursively so it's going to be helpful to have the node in 123 00:10:55,000 --> 00:10:59,000 which we're trying to find the subtree. 124 00:10:59,000 --> 00:11:02,000 Here is the code. 125 00:12:22,000 --> 00:12:28,000 This is the code. And let's just see how it works 126 00:12:28,000 --> 00:12:34,000 and then we will argue why it works. 127 00:12:34,000 --> 00:12:41,000 As an example, let's do OS-Select of the root 128 00:12:41,000 --> 00:12:46,000 and 5. We're going to find the fifth 129 00:12:46,000 --> 00:12:54,000 largest in the set. We have OS-Select of the root 130 00:12:54,000 --> 00:13:00,000 and 5. This is inconvenient. 131 00:13:00,000 --> 00:13:08,000 We start out at the top, well, let's just switch the 132 00:13:08,000 --> 00:13:11,000 boards. Here we go. 133 00:13:11,000 --> 00:13:17,000 We start at the top, and i is the root. 134 00:13:17,000 --> 00:13:23,000 Excuse me, i is 5, sorry, and the root. 135 00:13:23,000 --> 00:13:28,000 i=5. We want to five the fifth 136 00:13:28,000 --> 00:13:35,000 largest. We first compute this value k. 137 00:13:35,000 --> 00:00:01,000 k is the size of left of x plus 138 00:13:39,000 --> 00:13:44,000 What is that value? What is k anyway? 139 00:13:44,000 --> 00:13:50,000 What is it? Well, in this case it is 6. 140 00:13:50,000 --> 00:13:56,000 Good. But what is the meaning of k? 141 00:14:02,000 --> 00:14:03,000 The order. The rank. 142 00:14:03,000 --> 00:14:07,000 Good, the rank of the current node. 143 00:14:07,000 --> 00:14:10,000 This is the rank of the current node. 144 00:14:10,000 --> 00:14:15,000 k is always the size of the left subtree plus 1. 145 00:14:15,000 --> 00:14:19,000 That is just the rank of the current node. 146 00:14:19,000 --> 00:14:23,000 We look here and we say, well, the rank is k. 147 00:14:23,000 --> 00:14:30,000 Now, if it is equal then we found the element we want. 148 00:14:30,000 --> 00:14:32,000 But, otherwise, if i is less, 149 00:14:32,000 --> 00:14:36,000 we know it's going to be in the left subtree. 150 00:14:36,000 --> 00:14:42,000 All we're doing then is recursing in the left subtree. 151 00:14:42,000 --> 00:14:47,000 And here we will recurse. We will want the fifth largest 152 00:14:47,000 --> 00:14:50,000 one. And now this time k is going to 153 00:14:50,000 --> 00:14:52,000 be equal to what? Two. 154 00:14:52,000 --> 00:14:56,000 Now here we say, OK, this is bigger, 155 00:14:56,000 --> 00:15:01,000 so therefore the element we want is going to be in the right 156 00:15:01,000 --> 00:15:06,000 subtree. But we don't want the ith 157 00:15:06,000 --> 00:15:11,000 largest guy in the right subtree, because we already know 158 00:15:11,000 --> 00:15:15,000 there are going to be two guys over here. 159 00:15:15,000 --> 00:15:19,000 We want the third largest guy in this subtree. 160 00:15:19,000 --> 00:15:24,000 We have i equals 3 as we recurse into this subtree. 161 00:15:24,000 --> 00:15:30,000 And now we compute k for here. This plus 1 is 2. 162 00:15:30,000 --> 00:15:34,000 And that says we recursed right here. 163 00:15:34,000 --> 00:15:39,000 And then we have i=1, k=1, and we return in this code 164 00:15:39,000 --> 00:15:43,000 a pointer to this node. 165 00:15:55,000 --> 00:16:04,000 So this returns a pointer to the node containing H whose key 166 00:16:04,000 --> 00:16:10,000 is H. Just to make a comment here, 167 00:16:10,000 --> 00:16:15,000 we discovered k is equal to the rank of x. 168 00:16:15,000 --> 00:16:22,000 Any questions about what is going on in this code? 169 00:16:22,000 --> 00:16:27,000 OK. It's basically just finding its 170 00:16:27,000 --> 00:16:33,000 way down. The subtree sizes help it make 171 00:16:33,000 --> 00:16:39,000 the decision as to which way it should go to find which is the 172 00:16:39,000 --> 00:16:43,000 ith largest. We can do a quick analysis. 173 00:16:43,000 --> 00:16:49,000 On our red-black tree, how long does OS-Select take to 174 00:16:49,000 --> 00:16:50,000 run? Yeah? 175 00:16:50,000 --> 00:16:57,000 Yeah, order log n if there are n elements in the tree. 176 00:16:57,000 --> 00:17:01,000 Because the red-black tree is a balance tree. 177 00:17:01,000 --> 00:17:07,000 Its height is order log n. In fact, this code will work on 178 00:17:07,000 --> 00:17:12,000 any tree that has order log n the height of the tree. 179 00:17:12,000 --> 00:17:19,000 And so if you have a guaranteed height, the way that red-black 180 00:17:19,000 --> 00:17:25,000 trees do, you're in good shape. OS-Rank, we won't do but it is 181 00:17:25,000 --> 00:17:30,000 in the book, also gets order log n. 182 00:17:30,000 --> 00:17:35,000 Here is a question I want to pose. 183 00:17:35,000 --> 00:17:42,000 Why not just keep the ranks themselves? 184 00:17:58,000 --> 00:18:01,000 Yeah? It's the node itself. 185 00:18:01,000 --> 00:18:04,000 Otherwise, you cannot take left of it. 186 00:18:04,000 --> 00:18:07,000 I mean, if we were doing this in a decent language, 187 00:18:07,000 --> 00:18:11,000 strongly typed language there would be no confusion. 188 00:18:11,000 --> 00:18:15,000 But we're writing in this pseudocode that is good because 189 00:18:15,000 --> 00:18:18,000 it's compact, which lets you focus on the 190 00:18:18,000 --> 00:18:19,000 algorithm. But, of course, 191 00:18:19,000 --> 00:18:24,000 it doesn't have a lot of the things you would really want if 192 00:18:24,000 --> 00:18:28,000 you were programming things of scale like type safety and so 193 00:18:28,000 --> 00:18:33,000 forth. Yeah? 194 00:18:41,000 --> 00:18:44,000 It is basically hard to maintain when you modify it. 195 00:18:44,000 --> 00:18:48,000 For example, if we actually kept the ranks 196 00:18:48,000 --> 00:18:51,000 in the nodes, certainly it would be easy to 197 00:18:51,000 --> 00:18:53,000 find the element of a given rank. 198 00:18:53,000 --> 00:18:57,000 But all I have to do is insert the smallest element, 199 00:18:57,000 --> 00:19:03,000 an element that is smaller than all of the other elements. 200 00:19:03,000 --> 00:19:06,000 And what happens? All the ranks have to be 201 00:19:06,000 --> 00:19:10,000 changed. Order n changes have to be made 202 00:19:10,000 --> 00:19:14,000 if that's what I was maintaining, whereas with 203 00:19:14,000 --> 00:19:18,000 subtree sizes that's a lot easier. 204 00:19:18,000 --> 00:19:22,000 Because it's hard to maintain -- 205 00:19:27,000 --> 00:19:33,000 -- when the red-black tree is modified. 206 00:19:33,000 --> 00:19:38,000 And that is the other sort of tricky thing when you're 207 00:19:38,000 --> 00:19:43,000 augmenting a data structure. You want to put in the things 208 00:19:43,000 --> 00:19:49,000 that your operations go fast, but you cannot forget that 209 00:19:49,000 --> 00:19:55,000 there are already underlying operations on the data structure 210 00:19:55,000 --> 00:20:00,000 that have to be maintained in some way. 211 00:20:00,000 --> 00:20:03,000 Can we close this door, please? 212 00:20:03,000 --> 00:20:08,000 Thank you. We have to look at what are the 213 00:20:08,000 --> 00:20:14,000 modifying operations and how do we maintain them. 214 00:20:14,000 --> 00:20:21,000 The modifying operations for red-black trees are insert and 215 00:20:21,000 --> 00:20:25,000 delete. If I were augmenting a binary 216 00:20:25,000 --> 00:20:33,000 heap, what operations would I have to worry about? 217 00:20:38,000 --> 00:20:44,000 If I were augmenting a heap, what are the modifying 218 00:20:44,000 --> 00:20:47,000 operations? Binary min heap, 219 00:20:47,000 --> 00:20:52,000 for example, classic priority queue? 220 00:20:52,000 --> 00:20:58,000 Who remembers heaps? What are the operations on a 221 00:20:58,000 --> 00:21:04,000 heap? There's a good final question. 222 00:21:04,000 --> 00:21:09,000 Take-home exam, don't worry about it. 223 00:21:09,000 --> 00:21:16,000 Final, worry about it. What are the operations on a 224 00:21:16,000 --> 00:21:20,000 heap? Just look it up on Books24 or 225 00:21:20,000 --> 00:21:23,000 whatever it is, right? 226 00:21:23,000 --> 00:21:30,000 AnswerMan? What does AnswerMan say? 227 00:21:30,000 --> 00:21:30,000 OK. And? If it's a min heap. It's min, extract min, 228 00:21:36,000 --> 00:21:43,000 typical operations and insert. And of those which are 229 00:21:43,000 --> 00:21:47,000 modifying? Insert and extract min, 230 00:21:47,000 --> 00:21:50,000 OK? So, min is not. 231 00:21:50,000 --> 00:21:57,000 You don't have to worry about min because all that is is a 232 00:21:57,000 --> 00:22:01,000 query. You want to distinguish 233 00:22:01,000 --> 00:22:06,000 operations on a dynamic data structure those that modify and 234 00:22:06,000 --> 00:22:09,000 those that don't, because the ones that don't 235 00:22:09,000 --> 00:22:14,000 modify the data structure are all perfectly fine as long as 236 00:22:14,000 --> 00:22:16,000 you haven't destroyed information. 237 00:22:16,000 --> 00:22:18,000 The queries, those are easy. 238 00:22:18,000 --> 00:22:22,000 But the operations that modify the data structure, 239 00:22:22,000 --> 00:22:26,000 those we're very concerned about in making sure we can 240 00:22:26,000 --> 00:22:29,000 maintain. Our strategy for dealing with 241 00:22:29,000 --> 00:22:34,000 insert and delete in this case is to update the subtree sizes 242 00:22:34,000 --> 00:22:36,000 -- 243 00:22:43,000 --> 00:22:51,000 -- when inserting or deleting. For example, 244 00:22:51,000 --> 00:23:00,000 let's look at what happens when I insert k. 245 00:23:00,000 --> 00:23:07,000 Element key k. I am going to want to insert it 246 00:23:07,000 --> 00:23:14,000 in here, right? What is going to happen to this 247 00:23:14,000 --> 00:23:20,000 subtree size if I am inserting k in here? 248 00:23:20,000 --> 00:00:10,000 This is going to increase to 249 00:23:25,000 --> 00:23:35,000 And then I go left. This one is going to increase 250 00:23:35,000 --> 00:23:41,000 to 6. Here it is going to increase to 251 00:00:04,000 --> 00:23:42,000 Here 2. 252 00:23:42,000 --> 00:23:50,000 And then I will put my k down there with a 1. 253 00:23:50,000 --> 00:23:56,000 So I just updated on the way down. 254 00:23:56,000 --> 00:24:00,000 Pretty easy. Yeah? 255 00:24:00,000 --> 00:24:04,000 But now it's not a red-black tree anymore. 256 00:24:04,000 --> 00:24:09,000 You have to rebalance, so you must also handle 257 00:24:09,000 --> 00:24:12,000 rebalancing. Because, remember, 258 00:24:12,000 --> 00:24:17,000 and this is something that people tend to forget so it's 259 00:24:17,000 --> 00:24:22,000 always, I think, helpful when I see patterns 260 00:24:22,000 --> 00:24:28,000 going on to tell everybody what the pattern is so that you can 261 00:24:28,000 --> 00:24:34,000 be sure of it in your work that you're not falling into that 262 00:24:34,000 --> 00:24:39,000 pattern. What people tend to forget when 263 00:24:39,000 --> 00:24:43,000 they're doing red-black trees is they tend to remember the tree 264 00:24:43,000 --> 00:24:46,000 insert part of it, but red-black insert, 265 00:24:46,000 --> 00:24:50,000 that RB insert procedure actually has two parts to it. 266 00:24:50,000 --> 00:24:54,000 First you call tree insert and then you have to rebalance. 267 00:24:54,000 --> 00:24:58,000 And so you've got to make sure you do the whole of the 268 00:24:58,000 --> 00:25:02,000 red-black insert. Not just the tree insert part. 269 00:25:02,000 --> 00:25:05,000 We just did the tree insert part. 270 00:25:05,000 --> 00:25:09,000 That was easy. We also have to handle 271 00:25:09,000 --> 00:25:12,000 rebalancing. So there are two types of 272 00:25:12,000 --> 00:25:18,000 things we have to worry about. One is red-black color changes. 273 00:25:18,000 --> 00:25:23,000 Well, unfortunately those have no effect on subtree sizes. 274 00:25:23,000 --> 00:25:27,000 If I change the colors of things, no effect, 275 00:25:27,000 --> 00:25:34,000 no problem. But also the interesting one is 276 00:25:34,000 --> 00:25:39,000 rotations. Rotations, it turns out, 277 00:25:39,000 --> 00:25:46,000 are fairly easy to fix up. Because when I do a rotation, 278 00:25:46,000 --> 00:25:52,000 I can update the nodes based on the children. 279 00:25:52,000 --> 00:25:59,000 I will show you that. You basically look at children 280 00:25:59,000 --> 00:26:09,000 and fix up, in this case, in order one time per rotation. 281 00:26:09,000 --> 00:26:12,000 For example, imagine that I had a piece of 282 00:26:12,000 --> 00:26:16,000 my tree that looked like this. 283 00:26:23,000 --> 00:26:26,000 And let's say it was 7, 3, 4, the subtree sizes. 284 00:26:26,000 --> 00:26:30,000 I'm not going to put the values in here. 285 00:26:30,000 --> 00:26:36,000 And I did a right rotation on that edge to put them the other 286 00:26:36,000 --> 00:26:40,000 way. And so these guys get hooked up 287 00:26:40,000 --> 00:26:45,000 this way. Always the three children stay 288 00:26:45,000 --> 00:26:50,000 as three children. We just swing this guy over to 289 00:26:50,000 --> 00:26:58,000 there and make this guy be the parent of the other one. 290 00:26:58,000 --> 00:27:03,000 And so now the point is that I can just simply update this guy 291 00:27:03,000 --> 00:27:08,000 to be, well, he's got 8, 3 plus 4 plus 1 using our 292 00:27:08,000 --> 00:27:13,000 formula for what the size is. And now, for this one, 293 00:27:13,000 --> 00:27:19,000 it's going to be 8 plus 7 plus 1 is 16, or, if I think about 294 00:27:19,000 --> 00:27:24,000 it, it's going to be whatever that was before because I 295 00:27:24,000 --> 00:27:30,000 haven't changed this subtree size with a rotation. 296 00:27:30,000 --> 00:27:33,000 Everything beneath this edge is still beneath this edge. 297 00:27:33,000 --> 00:27:36,000 And so I fixed it up in order one time. 298 00:27:36,000 --> 00:27:40,000 There are certain other types of operations sometimes that 299 00:27:40,000 --> 00:27:42,000 occur where this isn't the value. 300 00:27:42,000 --> 00:27:46,000 If I wasn't doing subtree sizes but was doing some other 301 00:27:46,000 --> 00:27:50,000 property of the subtree, it could be that this was no 302 00:27:50,000 --> 00:27:53,000 longer 16 in which case the effect might propagate up 303 00:27:53,000 --> 00:27:58,000 towards the root. There is a nice little lemma in 304 00:27:58,000 --> 00:28:03,000 the book that shows the conditions under which you can 305 00:28:03,000 --> 00:28:08,000 make sure that the re-balancing doesn't cost you too much. 306 00:28:08,000 --> 00:28:13,000 So that was pretty good. Now, insert and delete, 307 00:28:13,000 --> 00:28:18,000 that is all we have to do for rotations, are therefore still 308 00:28:18,000 --> 00:28:22,000 order log n time, because a red-black tree only 309 00:28:22,000 --> 00:28:28,000 has to do order one rotations. Do they normally take constant 310 00:28:28,000 --> 00:28:32,000 time? Well, they still take constant 311 00:28:32,000 --> 00:28:35,000 time. They just take a little bit 312 00:28:35,000 --> 00:28:39,000 bigger constant. And so now we've been able to 313 00:28:39,000 --> 00:28:45,000 build this great data structure that supports dynamic order 314 00:28:45,000 --> 00:28:50,000 statistic queries and it works in order log n time for insert, 315 00:28:50,000 --> 00:28:54,000 delete and the various queries. OS-Select. 316 00:28:54,000 --> 00:28:59,000 I can also just search for an element. 317 00:28:59,000 --> 00:29:05,000 I have taken the basic data structure and have added some 318 00:29:05,000 --> 00:29:11,000 new operations on it. Any questions about what we did 319 00:29:11,000 --> 00:29:14,000 here? Do people understand this 320 00:29:14,000 --> 00:29:16,000 reasonably well? OK. 321 00:29:16,000 --> 00:29:23,000 Then let's generalize, always a dangerous thing. 322 00:29:37,000 --> 00:29:42,000 Augmenting data structures. What I would like to do is give 323 00:29:42,000 --> 00:29:47,000 you a little methodology for how you go about doing this safely 324 00:29:47,000 --> 00:29:52,000 so you don't forget things. The most common thing, 325 00:29:52,000 --> 00:29:56,000 by the way, if there is an augmentation problem on the 326 00:29:56,000 --> 00:30:01,000 take-home or if there is one on the final, I guarantee that 327 00:30:01,000 --> 00:30:07,000 probably a quarter of the class will forget the rotations if 328 00:30:07,000 --> 00:30:12,000 they augmented red-black tree. I guarantee it. 329 00:30:12,000 --> 00:30:16,000 Anyway, here is a little methodology to check yourself. 330 00:30:16,000 --> 00:30:19,000 As I mentioned, the reason why this is so 331 00:30:19,000 --> 00:30:22,000 important is because this is, in practice, 332 00:30:22,000 --> 00:30:25,000 the thing that you do most of the time. 333 00:30:25,000 --> 00:30:30,000 You don't just use a data structure as given. 334 00:30:30,000 --> 00:30:34,000 You take a data structure. You say I have my own 335 00:30:34,000 --> 00:30:37,000 operations I want to layer onto this. 336 00:30:37,000 --> 00:30:40,000 We're going to give a methodology. 337 00:30:40,000 --> 00:30:43,000 And what I will do, as I go along, 338 00:30:43,000 --> 00:30:48,000 is will use the example of order statistics trees to 339 00:30:48,000 --> 00:30:52,000 illustrate the methodology. It is four steps. 340 00:30:52,000 --> 00:30:58,000 The first is choose an underlying data structure. 341 00:31:04,000 --> 00:31:09,000 Which in the case of order statistics tree was what? 342 00:31:09,000 --> 00:31:11,000 Red-black tree. 343 00:31:19,000 --> 00:31:23,000 And the second thing we do is we figure out what additional 344 00:31:23,000 --> 00:31:27,000 information we wish to maintain in that data structure. 345 00:31:38,000 --> 00:31:43,000 Which in this case is the subtree sizes. 346 00:31:43,000 --> 00:31:49,000 Subtree sizes is what we keep for this one. 347 00:31:49,000 --> 00:31:55,000 And when we did this we could make mistakes, 348 00:31:55,000 --> 00:31:58,000 right? We could have said, 349 00:31:58,000 --> 00:32:05,000 oh, let's keep the rank. And we start playing with it 350 00:32:05,000 --> 00:32:09,000 and discover we can do that. It just goes really slowly. 351 00:32:09,000 --> 00:32:14,000 It takes some creativity to figure out what is the 352 00:32:14,000 --> 00:32:18,000 information that you're going to be able to keep, 353 00:32:18,000 --> 00:32:22,000 but also to maintain the other properties that you want. 354 00:32:22,000 --> 00:32:26,000 The third step is verify that the information can be 355 00:32:26,000 --> 00:32:29,000 maintained -- 356 00:32:34,000 --> 00:32:38,000 -- for the modifying operations on the data structure. 357 00:32:45,000 --> 00:32:50,000 And so in this case, for OS trees, 358 00:32:50,000 --> 00:32:59,000 the modifying operations were insert and delete. 359 00:32:59,000 --> 00:33:01,000 And, of course, we had to make sure we dealt 360 00:33:01,000 --> 00:33:03,000 with rotations. 361 00:33:10,000 --> 00:33:14,000 And because rotations are part of that we could break it down 362 00:33:14,000 --> 00:33:17,000 into the tree insert, the tree delete and rotations. 363 00:33:17,000 --> 00:33:20,000 And once we've did that everything was fine. 364 00:33:20,000 --> 00:33:24,000 We didn't, for this particular problem, have to worry about 365 00:33:24,000 --> 00:33:27,000 color changes. But that's another thing that 366 00:33:27,000 --> 00:33:32,000 under some things you might have to worry about. 367 00:33:32,000 --> 00:33:35,000 For some reason the color made a difference. 368 00:33:35,000 --> 00:33:38,000 Usually that doesn't make a difference. 369 00:33:38,000 --> 00:33:43,000 And then the fourth step is to develop new operations. 370 00:33:50,000 --> 00:33:56,000 Presumably that use the info that you have now stored. 371 00:33:56,000 --> 00:34:02,000 And this was OS-Select and OS-Rank, which we didn't give 372 00:34:02,000 --> 00:34:07,000 but which is there. And also it's a nice little 373 00:34:07,000 --> 00:34:12,000 puzzle to figure out yourself, how you would build OS-Rank. 374 00:34:12,000 --> 00:34:17,000 Not a hard piece of code. This methodology is not 375 00:34:17,000 --> 00:34:22,000 actually the way you do this. This is one of these things 376 00:34:22,000 --> 00:34:27,000 that's more like a checklist, because you see whether or not 377 00:34:27,000 --> 00:34:31,000 you've got -- When you're actually doing this 378 00:34:31,000 --> 00:34:34,000 maybe you developed the new operations first. 379 00:34:34,000 --> 00:34:37,000 You've got to keep in mind the new operations while you're 380 00:34:37,000 --> 00:34:40,000 verifying that the information you're storing can be here. 381 00:34:40,000 --> 00:34:44,000 Maybe you will then go back and change this and sort of sort 382 00:34:44,000 --> 00:34:46,000 through it. This is more a checklist that 383 00:34:46,000 --> 00:34:49,000 when you're done this is how you write it up. 384 00:34:49,000 --> 00:34:52,000 This is how you document that what you've done is, 385 00:34:52,000 --> 00:34:54,000 in fact, a good thing. You have a checklist. 386 00:34:54,000 --> 00:34:56,000 Here is my underlying data structure. 387 00:34:56,000 --> 00:35:00,000 Here is the addition information I need. 388 00:35:00,000 --> 00:35:03,000 See, I can still support the modifying operations that the 389 00:35:03,000 --> 00:35:07,000 data structure used to have and now here are my new operations 390 00:35:07,000 --> 00:35:10,000 and see what those are. It's really a checklist. 391 00:35:10,000 --> 00:35:13,000 Not a prescription for the order in which you do things. 392 00:35:13,000 --> 00:35:16,000 You must do all these steps, not necessarily in this order. 393 00:35:16,000 --> 00:35:19,000 This is a guide for your documentation. 394 00:35:19,000 --> 00:35:22,000 When we ask for you to augment a data structure, 395 00:35:22,000 --> 00:35:25,000 generally we're asking you to tell us what the four steps are. 396 00:35:25,000 --> 00:35:29,000 It will help you organize your things. 397 00:35:29,000 --> 00:35:33,000 It will also help make sure you don't forget some step along the 398 00:35:33,000 --> 00:35:36,000 way. I've seen people who have added 399 00:35:36,000 --> 00:35:40,000 the information and developed new operations but completely 400 00:35:40,000 --> 00:35:44,000 forgot to verify that the information could be maintained. 401 00:35:44,000 --> 00:35:48,000 So you want to make sure that you've done all those. 402 00:35:48,000 --> 00:35:51,000 Usually you have to play -- 403 00:35:56,000 --> 00:35:59,000 -- with interactions -- 404 00:36:04,000 --> 00:36:07,000 -- between steps. It's not just a do this, 405 00:36:07,000 --> 00:36:12,000 do this, do this. We're going to do now a more 406 00:36:12,000 --> 00:36:17,000 complicated data structure. It's not that much more 407 00:36:17,000 --> 00:36:24,000 complicated, but its correctness is actually kind of challenging. 408 00:36:33,000 --> 00:36:36,000 And it is actually a very practical and useful data 409 00:36:36,000 --> 00:36:40,000 structure. I am amazed at how many people 410 00:36:40,000 --> 00:36:45,000 aren't aware that there are data structures of this nature that 411 00:36:45,000 --> 00:36:49,000 are useful for them when I see people writing really slow code. 412 00:36:49,000 --> 00:36:55,000 And so the example we're going to do is interval trees. 413 00:37:00,000 --> 00:37:08,000 And the idea of this is that we want to maintain a set of 414 00:37:08,000 --> 00:37:11,000 intervals. For example, 415 00:37:11,000 --> 00:37:18,000 time intervals. I have a whole database of time 416 00:37:18,000 --> 00:37:24,000 intervals that I'm trying to maintain. 417 00:37:24,000 --> 00:37:30,000 Let's just do an example here. 418 00:38:00,000 --> 00:38:08,000 This is going from 7 to 10, 5 to 11 and 4 to 8, 419 00:38:08,000 --> 00:38:14,000 from 15 to 18, 17 to 19 and 21 to 23. 420 00:38:14,000 --> 00:38:24,000 This is a set of intervals. And if we have an interval i, 421 00:38:24,000 --> 00:38:34,000 let's say this is interval i, which is 7,10. 422 00:38:34,000 --> 00:38:38,000 We're going to call this endpoint the low endpoint of i 423 00:38:38,000 --> 00:38:41,000 and this we're going to call the high endpoint of i. 424 00:38:41,000 --> 00:38:46,000 The reason I use low and high rather than left or right is 425 00:38:46,000 --> 00:38:50,000 because we're going to have a tree, and we're going to want 426 00:38:50,000 --> 00:38:53,000 the left subtree and the right subtree. 427 00:38:53,000 --> 00:38:58,000 So if I start saying left and right for intervals and left and 428 00:38:58,000 --> 00:39:03,000 right for tree we're going to get really confused. 429 00:39:03,000 --> 00:39:05,000 This is also a tip. Let me say when you're coding, 430 00:39:05,000 --> 00:39:09,000 you really have to think hard sometimes about the words that 431 00:39:09,000 --> 00:39:12,000 you're using for things, especially things like left and 432 00:39:12,000 --> 00:39:15,000 right because they get so overused throughout programming. 433 00:39:15,000 --> 00:39:18,000 It's a good idea to come up with a whole wealth of synonyms 434 00:39:18,000 --> 00:39:22,000 for different situations so that it is clear in any piece of code 435 00:39:22,000 --> 00:39:24,000 when you're talking, for example, 436 00:39:24,000 --> 00:39:27,000 about the intervals versus the tree, because we're going to 437 00:39:27,000 --> 00:39:33,000 have both going on here. And what we're going to do is 438 00:39:33,000 --> 00:39:41,000 we want to support insertion and deletion of intervals here. 439 00:39:41,000 --> 00:39:49,000 And we're going to have a query, which is going to be the 440 00:39:49,000 --> 00:39:57,000 new operation we're going to develop, which is going to be to 441 00:39:57,000 --> 00:40:03,000 find an interval, any interval in the set that 442 00:40:03,000 --> 00:40:09,000 overlaps a given query interval. 443 00:40:15,000 --> 00:40:23,000 So I give you a query interval like say 6, 14 and you can 444 00:40:23,000 --> 00:40:31,000 return this guy or this guy, this guy, couldn't return any 445 00:40:31,000 --> 00:40:38,000 of these because these are all less than 14. 446 00:40:38,000 --> 00:40:41,000 So I can return any one of those. 447 00:40:41,000 --> 00:40:47,000 I only have to return one. I just have to find one guy 448 00:40:47,000 --> 00:40:52,000 that overlaps. Any question about what we're 449 00:40:52,000 --> 00:40:55,000 going to be setting up here? OK. 450 00:40:55,000 --> 00:41:01,000 Our methodology is we're going to pick, first of all, 451 00:41:01,000 --> 00:41:06,000 step one. And here is our methodology. 452 00:41:06,000 --> 00:41:12,000 Step one is we're going chose underlying data structure. 453 00:41:12,000 --> 00:41:18,000 Does anybody have a suggestion as to what data structure we 454 00:41:18,000 --> 00:41:24,000 ought to use here to support interval trees? 455 00:41:32,000 --> 00:41:38,000 What data structure should we try to start here to support 456 00:41:38,000 --> 00:41:41,000 interval trees? Anybody have any idea? 457 00:41:41,000 --> 00:41:45,000 A red-black tree. A binary search tree. 458 00:41:45,000 --> 00:41:50,000 Red-black tree. We're going to use a red-black 459 00:41:50,000 --> 00:41:52,000 tree. 460 00:41:57,000 --> 00:42:02,000 Oh, I've got to say what it is keyed on. 461 00:42:02,000 --> 00:42:06,000 What is going to be the key for my red-black tree? 462 00:42:06,000 --> 00:42:10,000 For each interval, what should I use for a key? 463 00:42:10,000 --> 00:42:14,000 This is where there are a bunch of options, right? 464 00:42:14,000 --> 00:42:19,000 Throw out some ideas. It's always better to branch 465 00:42:19,000 --> 00:42:23,000 than it is to prune. You can always prune later, 466 00:42:23,000 --> 00:42:28,000 but if you don't branch you will never get the chance to 467 00:42:28,000 --> 00:42:32,000 prune. So generation of ideas. 468 00:42:32,000 --> 00:42:37,000 You'll need that when you're doing the design phase and doing 469 00:42:37,000 --> 00:42:40,000 the take-home exam. Yeah? 470 00:42:40,000 --> 00:42:43,000 We're calling that the low endpoint. 471 00:42:43,000 --> 00:42:48,000 OK, you could do low endpoint. What other ideas are there? 472 00:42:48,000 --> 00:42:52,000 High end point. Now you can look at low 473 00:42:52,000 --> 00:42:57,000 endpoint, high endpoint. Well, between low and high 474 00:42:57,000 --> 00:43:02,000 which is better? That one is not going to 475 00:43:02,000 --> 00:43:06,000 matter, right? So doing high versus low, 476 00:43:06,000 --> 00:43:13,000 we don't have to consider that, but there is another natural 477 00:43:13,000 --> 00:43:18,000 point you want to think about using like the median, 478 00:43:18,000 --> 00:43:23,000 the middle point. At least that is symmetric. 479 00:43:23,000 --> 00:43:27,000 What do you think? What else might I use? 480 00:43:27,000 --> 00:43:32,000 The length? I think the length doesn't feel 481 00:43:32,000 --> 00:43:36,000 to me productive. This is just purely a matter of 482 00:43:36,000 --> 00:43:39,000 intuition. It doesn't feel productive, 483 00:43:39,000 --> 00:43:43,000 because if I know the length I don't know where it is so it's 484 00:43:43,000 --> 00:43:48,000 going to be hard to maintain information about where it is 485 00:43:48,000 --> 00:43:51,000 for queries. It turns out we're going to use 486 00:43:51,000 --> 00:43:55,000 the low left endpoint, but I think to me that was sort 487 00:43:55,000 --> 00:44:02,000 of a surprise that you'd want to use that and not the middle one. 488 00:44:02,000 --> 00:44:06,000 Because you're favoring one endpoint over the other. 489 00:44:06,000 --> 00:44:11,000 It turns out that's the right thing to do, surprisingly. 490 00:44:11,000 --> 00:44:16,000 There is another strategy. Actually, there's another type 491 00:44:16,000 --> 00:44:22,000 of tree called a segment tree. Actually, what you do is you 492 00:44:22,000 --> 00:44:27,000 store both the left and right endpoints separately in the 493 00:44:27,000 --> 00:44:30,000 tree. And then you maintain a data 494 00:44:30,000 --> 00:44:35,000 structure where the line segments go up through the tree 495 00:44:35,000 --> 00:44:40,000 on to the other. There are lots of things you 496 00:44:40,000 --> 00:44:45,000 can do, but we're just going to keep it keyed on the low 497 00:44:45,000 --> 00:44:47,000 endpoint. That's why this is a more 498 00:44:47,000 --> 00:44:50,000 clever data structure in some ways. 499 00:44:50,000 --> 00:44:54,000 Now, this is harder. That is why this is a clever 500 00:44:54,000 --> 00:44:58,000 data structure. What are we going to store in 501 00:44:58,000 --> 00:45:03,000 the -- I think any of those ideas are 502 00:45:03,000 --> 00:45:08,000 good ideas to throw out and look at. 503 00:45:08,000 --> 00:45:14,000 You don't know which one is going to work until you play 504 00:45:14,000 --> 00:45:17,000 with it. This one, though, 505 00:45:17,000 --> 00:45:22,000 is, I think, much harder to guess. 506 00:45:22,000 --> 00:45:28,000 You're going to store in a node the largest value, 507 00:45:28,000 --> 00:45:33,000 I will call it m, in the subtree rooted at that 508 00:45:33,000 --> 00:45:36,000 node. 509 00:45:45,000 --> 00:45:48,000 We'll draw it like this, a node like this. 510 00:45:48,000 --> 00:45:52,000 We will put the interval here and we will put the m value 511 00:45:52,000 --> 00:45:53,000 here. 512 00:46:02,000 --> 00:46:04,000 Let's draw a picture. 513 00:46:38,000 --> 00:46:42,000 Once again, I am not drawing the NILs. 514 00:47:00,000 --> 00:47:05,000 I hope that that is a search tree that is keyed on the low 515 00:47:05,000 --> 00:47:08,000 left endpoint. 4, 5, 7, 15, 516 00:47:08,000 --> 00:47:11,000 17, 21. It is keyed on the low left 517 00:47:11,000 --> 00:47:15,000 endpoint. If this a red-black tree, 518 00:47:15,000 --> 00:47:21,000 let's just do another practice. How can I color this so that it 519 00:47:21,000 --> 00:47:27,000 is a legal red-black tree? Not too relevant to what we're 520 00:47:27,000 --> 00:47:32,000 doing right now But a little drill doesn't hurt 521 00:47:32,000 --> 00:47:35,000 sometimes. Remember, the NILs are not 522 00:47:35,000 --> 00:47:39,000 there and they are all black. And the root is black. 523 00:47:39,000 --> 00:47:42,000 I will give that one to you. 524 00:47:52,000 --> 00:47:54,000 Good. This will work. 525 00:47:54,000 --> 00:48:00,000 You sort of go through a little puzzle. 526 00:48:00,000 --> 00:48:03,000 A logic puzzle. Because this is really short so 527 00:48:03,000 --> 00:48:06,000 it better not have any reds in it. 528 00:48:06,000 --> 00:48:11,000 This has got to be black. Now, if I'm going to balance 529 00:48:11,000 --> 00:48:15,000 the height, I have got to have a layer of black here. 530 00:48:15,000 --> 00:48:19,000 It couldn't be that one. It's got to be these two. 531 00:48:19,000 --> 00:48:22,000 Good. Now let's compute the m value 532 00:48:22,000 --> 00:48:26,000 for each of these. It's the largest value in the 533 00:48:26,000 --> 00:48:36,000 subtree rooted at that node. What's the largest value in the 534 00:48:36,000 --> 00:00:10,000 subtree rooted at this node? 535 00:48:43,000 --> 00:00:18,000 And in this one? 536 00:48:47,000 --> 00:00:08,000 In this one? 537 00:49:00,000 --> 00:49:12,000 In general, m is going to be the maximum of three possible 538 00:49:12,000 --> 00:49:20,000 values. Either the high point of the 539 00:49:20,000 --> 00:49:34,000 interval at x or m of the left of x or m of the right of x. 540 00:49:40,000 --> 00:49:44,000 Does everybody see that? It is going to be m of x for 541 00:49:44,000 --> 00:49:46,000 any node. I just have to look, 542 00:49:46,000 --> 00:49:50,000 what is the maximum here, what is the maximum here and 543 00:49:50,000 --> 00:49:53,000 what is the high point of the interval. 544 00:49:53,000 --> 00:49:58,000 Whichever one of those is largest, that's the largest for 545 00:49:58,000 --> 00:50:00,000 that subtree. 546 00:50:15,000 --> 00:50:19,000 The modifying operations. 547 00:50:29,000 --> 00:50:33,000 Let's first do insert. How can I do insert? 548 00:50:33,000 --> 00:50:38,000 There are two parts. The first part is to do the 549 00:50:38,000 --> 00:50:44,000 tree insert, just a normal insert into a binary search 550 00:50:44,000 --> 00:50:46,000 tree. 551 00:50:55,000 --> 00:51:03,000 What do I do? Insert a new interval? 552 00:51:20,000 --> 00:51:23,000 Insert a new interval here? How can I fix up the m's? 553 00:51:33,000 --> 00:51:35,000 That's right. You just go down the tree and 554 00:51:35,000 --> 00:51:39,000 look at my current interval. And if it's got a bigger max, 555 00:51:39,000 --> 00:51:43,000 this is something that is going into that subtree. 556 00:51:43,000 --> 00:51:46,000 If its high endpoint is bigger than the current max, 557 00:51:46,000 --> 00:51:50,000 update the current max. I just do that as I'm going 558 00:51:50,000 --> 00:51:54,000 through the insertion, wherever it happens to land up 559 00:51:54,000 --> 00:51:58,000 in every subtree that it hits, every node that it hits on the 560 00:51:58,000 --> 00:52:04,000 way down. I just update it with the 561 00:52:04,000 --> 00:52:11,000 maximum wherever it happens to fall. 562 00:52:11,000 --> 00:52:17,000 Good. You just fix them on the way 563 00:52:17,000 --> 00:52:19,000 down. 564 00:52:25,000 --> 00:52:30,000 But we also have to do the other section. 565 00:52:30,000 --> 00:52:37,000 Also need to handle rotations. 566 00:52:45,000 --> 00:52:51,000 So let's just see how we might do rotations as an example. 567 00:53:00,000 --> 00:53:03,000 Let's say this is 11, 15, 30. 568 00:53:14,000 --> 00:53:16,000 Let's say I'm doing a right rotation. 569 00:53:16,000 --> 00:53:19,000 This is coming off from somewhere. 570 00:53:32,000 --> 00:53:37,000 That is coming off. This is still going to be the 571 00:53:37,000 --> 00:53:43,000 child that has 30, the one that 14 and the one 572 00:53:43,000 --> 00:53:48,000 that has 19. And so now we've rotated this 573 00:53:48,000 --> 00:53:53,000 way, so this is the 11, 15 and this is the 6, 574 00:00:20,000 --> 00:53:55,000 For this one, 575 00:53:55,000 --> 00:54:02,000 I just use my formula here. I just look here and say which 576 00:54:02,000 --> 00:54:04,000 is the biggest, 14, 15 or 19? 577 00:00:19,000 --> 00:54:06,000 And I look here. 578 00:54:06,000 --> 00:54:08,000 Which is the biggest? 30, 19 or 20? 579 00:00:30,000 --> 00:54:10,000 Or, once again, 580 00:54:10,000 --> 00:54:12,000 it turns out, not too hard to show, 581 00:54:12,000 --> 00:54:17,000 that it's always whatever was there, because we're talking 582 00:54:17,000 --> 00:54:20,000 about the biggest thing in the subtree. 583 00:54:20,000 --> 00:54:24,000 And the membership of the subtree hasn't changed when we 584 00:54:24,000 --> 00:54:28,000 do the rotation. That just took me order one 585 00:54:28,000 --> 00:54:31,000 time to fix up. 586 00:54:51,000 --> 00:55:08,000 Fixing up the m's during rotation takes O(1) time. 587 00:55:08,000 --> 00:55:19,000 So the total insert time is O(lg n). 588 00:55:25,000 --> 00:55:27,000 Once I figured out that this is the right information, 589 00:55:27,000 --> 00:55:29,000 of course we don't know what we're using this information for 590 00:55:29,000 --> 00:55:32,000 yet. But once I know that that is 591 00:55:32,000 --> 00:55:36,000 the information, showing you that it works in 592 00:55:36,000 --> 00:55:41,000 certain delete continuing work in order log n time is easy. 593 00:55:41,000 --> 00:55:46,000 Now, delete is actually a little bit trickier but I will 594 00:55:46,000 --> 00:55:50,000 just say it is similar. Because in delete you go 595 00:55:50,000 --> 00:55:56,000 through and you find something, you may have to go through the 596 00:55:56,000 --> 00:56:02,000 whole business of swapping it. If it's an internal node you've 597 00:56:02,000 --> 00:56:05,000 got to swap it with its successor or predecessor. 598 00:56:05,000 --> 00:56:08,000 And so there are a bunch of things that have to be dealt 599 00:56:08,000 --> 00:56:12,000 with, but it is all stuff where you can update the information 600 00:56:12,000 --> 00:56:15,000 using this thing. And it's all essentially local 601 00:56:15,000 --> 00:56:19,000 changes when you're updating this information because you can 602 00:56:19,000 --> 00:56:23,000 do it essentially only on a path up from the root and most of the 603 00:56:23,000 --> 00:56:27,000 tree is never dealt with. I will leave that for you folks 604 00:56:27,000 --> 00:56:32,000 to work out. It's also in the book if you 605 00:56:32,000 --> 00:56:36,000 want to cheat, but it is a good exercise. 606 00:56:36,000 --> 00:56:41,000 Any questions about the first three steps? 607 00:56:41,000 --> 00:56:45,000 Fourth step is new operations. 608 00:57:18,000 --> 00:57:28,000 Interval search of i is going to find an interval that 609 00:57:28,000 --> 00:57:35,000 overlaps the interval i. So i here is an interval. 610 00:57:35,000 --> 00:57:39,000 It's got two coordinates. And this, rather than writing 611 00:57:39,000 --> 00:57:43,000 recursively, we're going to write as, it's sort of going to 612 00:57:43,000 --> 00:57:46,000 be recursive, but we're going to write it 613 00:57:46,000 --> 00:57:49,000 with a while loop. You could write it recursively. 614 00:57:49,000 --> 00:57:53,000 The other one that we wrote, we could have written as a 615 00:57:53,000 --> 00:57:57,000 while loop as well and not had the recursive call. 616 00:57:57,000 --> 00:58:02,000 Here we're going to basically just start x gets the root. 617 00:58:02,000 --> 00:58:05,000 And then while -- 618 00:59:47,000 --> 00:59:56,000 That is the code. Let's just see how it works. 619 00:59:56,000 --> 01:00:05,000 Let's search for the interval 14, 16 -- 620 01:00:12,000 --> 01:00:15,202 -- in this tree. Let's see. 621 01:00:15,202 --> 01:00:21,239 x starts out at the root. And while it is not NIL, 622 01:00:21,239 --> 01:00:29,000 and it's not NIL because it's the root, what is this doing? 623 01:00:29,000 --> 01:00:31,000 Somebody tell me what that code does. 624 01:00:50,000 --> 01:00:56,000 Well, what is this doing? This is testing something 625 01:00:56,000 --> 01:01:01,952 between i and int of x. Int of x is the interval stored 626 01:01:01,952 --> 01:01:05,000 at x. What is this testing for? 627 01:01:17,000 --> 01:01:19,000 I hope I got it right. 628 01:01:30,000 --> 01:01:34,000 What is this testing for? Yeah? 629 01:01:41,000 --> 01:01:46,333 Above or below? I need just simple words. 630 01:01:46,333 --> 01:01:52,866 Test for overlaps. In particular test whether they 631 01:01:52,866 --> 01:01:55,000 do or don't? 632 01:02:00,000 --> 01:02:01,778 Do? Don't? 633 01:02:01,778 --> 01:02:12,251 If I get to this point, what do I know about i and int 634 01:02:12,251 --> 01:02:16,005 of x? Don't overlap. 635 01:02:16,005 --> 01:02:28,059 They don't overlap because the high of one is smaller than the 636 01:02:28,059 --> 01:02:35,417 low of the other. The high of one is smaller than 637 01:02:35,417 --> 01:02:39,239 the low of the other. They don't overlap that way. 638 01:02:39,239 --> 01:02:41,735 Could they overlap the other way? 639 01:02:41,735 --> 01:02:46,259 No because we're testing also whether the low of the one is 640 01:02:46,259 --> 01:02:48,832 bigger than the high of the other. 641 01:02:48,832 --> 01:02:52,654 They're saying it's either like this or like this. 642 01:02:52,654 --> 01:02:56,554 This is testing not overlap. That makes it simpler. 643 01:02:56,554 --> 01:03:01,000 When I'm searching for 14, 16, I check here. 644 01:03:01,000 --> 01:03:04,340 And I say do they overlap? And the answer is, 645 01:03:04,340 --> 01:03:08,591 now we can understand it without having to go through all 646 01:03:08,591 --> 01:03:12,387 the arithmetic calculations, no they don't overlap. 647 01:03:12,387 --> 01:03:15,424 If they did overlap, I found what I want. 648 01:03:15,424 --> 01:03:19,675 And what's going to happen? I am going to drop out of the 649 01:03:19,675 --> 01:03:24,230 while loop and just return x, because I will return something 650 01:03:24,230 --> 01:03:26,507 that overlaps. That is my goal. 651 01:03:26,507 --> 01:03:30,000 Here it says they don't overlap. 652 01:03:30,000 --> 01:03:34,731 So then I say, well, if left of x is not NIL, 653 01:03:34,731 --> 01:03:39,462 in other words, I've got a left child and low 654 01:03:39,462 --> 01:03:44,193 of i is less than or equal to m of left of x, 655 01:03:44,193 --> 01:03:48,924 then we go left. What happens in this case if 656 01:03:48,924 --> 01:03:51,505 I'm searching for 14, 16? 657 01:03:51,505 --> 01:03:57,096 Is the low of i less than or equal to m of left of x? 658 01:03:57,096 --> 01:04:03,181 Low of i is 14. And I am searching. 659 01:04:03,181 --> 01:04:07,702 And is it less than 18? Yes. 660 01:04:07,702 --> 01:04:16,576 Therefore, what do I do? I go left and make x point to 661 01:04:16,576 --> 01:04:20,093 this guy. Now I check. 662 01:04:20,093 --> 01:04:23,274 Does it overlap? No. 663 01:04:23,274 --> 01:04:29,637 I take a look at the left guy. It is 8. 664 01:04:29,637 --> 01:04:36,000 I compare 8 with 14, right? 665 01:04:36,000 --> 01:04:40,508 And is it lower? No, so I go right. 666 01:04:40,508 --> 01:04:48,729 And now I discover that I have an overlap here and it overlaps. 667 01:04:48,729 --> 01:04:55,093 It returns then the 15, 18 as an overlapping one. 668 01:04:55,093 --> 00:00:14,000 If I were searching for 12, 669 01:05:12,000 --> 01:05:16,556 I would go up to the top. And I look, 12, 670 01:05:16,556 --> 01:05:22,708 14, it doesn't overlap here. I look at the 18 and it is 671 01:05:22,708 --> 01:05:27,037 greater so I go left. I then look here. 672 01:05:27,037 --> 01:05:30,000 Does it overlap? No. 673 01:05:30,000 --> 01:05:34,740 So then what happens? I look at the left. 674 01:05:34,740 --> 01:05:38,413 It says I go right. I look here. 675 01:05:38,413 --> 01:05:42,207 Then I go and I look at the left. 676 01:05:42,207 --> 01:05:44,696 It says, no, go right. 677 01:05:44,696 --> 01:05:49,674 I go here, which is NIL, and now it is NIL. 678 01:05:49,674 --> 01:05:52,637 I return NIL. And does 12, 679 01:05:52,637 --> 01:05:56,666 14 overlap anything in the set? No. 680 01:05:56,666 --> 01:06:02,000 So, therefore, it always works. 681 01:06:02,000 --> 01:06:02,971 OK? OK. 682 01:06:02,971 --> 01:06:12,520 We're going to do correctness in a minute, but let's just do 683 01:06:12,520 --> 01:06:21,421 our analysis first so we don't have to do it because the 684 01:06:21,421 --> 01:06:30,000 correctness is going to be a little bit tricky. 685 01:06:30,000 --> 01:06:36,095 Time = O(lg n) because all I am doing is going down the tree. 686 01:06:36,095 --> 01:06:41,377 It takes time proportional to the height of the tree. 687 01:06:41,377 --> 01:06:46,457 That's pretty easy. If I need to list all overlaps, 688 01:06:46,457 --> 01:06:52,552 suppose I want to list all the overlaps, how quickly can I do 689 01:06:52,552 --> 01:06:55,701 that? Can somebody suggest how I 690 01:06:55,701 --> 01:07:02,000 could use this as a subroutine to list all overlaps? 691 01:07:13,000 --> 01:07:16,840 Suppose I have k overlaps, k intervals that overlap my 692 01:07:16,840 --> 01:07:21,043 query interval and I want to find every single one of them, 693 01:07:21,043 --> 01:07:23,000 how fast can I do that? 694 01:07:31,000 --> 01:07:33,000 How do I do it? 695 01:07:44,000 --> 01:07:49,271 How do I do it? If I search a second time, 696 01:07:49,271 --> 01:07:53,000 I might get the same value. 697 01:08:02,000 --> 01:08:04,400 Yeah, there you go. Do what? 698 01:08:04,400 --> 01:08:08,933 When you find it delete it. Put it over to the side. 699 01:08:08,933 --> 01:08:13,199 Find the next one, delete it until there are none 700 01:08:13,199 --> 01:08:16,133 left. And then, if I don't want to 701 01:08:16,133 --> 01:08:20,577 modify the data structure, insert them all back in. 702 01:08:20,577 --> 01:08:24,221 It costs me k lg n if they are k overlaps. 703 01:08:24,221 --> 01:08:30,000 That's actually called an output sensitive algorithm. 704 01:08:30,000 --> 01:08:34,064 Because the running time of it depends upon how much it 705 01:08:34,064 --> 01:08:37,000 outputs, so this is output sensitive. 706 01:08:42,000 --> 01:08:47,357 The best to date for this problem, by the way, 707 01:08:47,357 --> 01:08:54,380 of listing all is O(k+lg n) with a different data structure. 708 01:08:54,380 --> 01:08:59,738 And, actually, that was open for a while as an 709 01:08:59,738 --> 01:09:07,000 open problem. OK. Correctness. 710 01:09:12,000 --> 01:09:16,697 Why does this algorithm always work correctly? 711 01:09:16,697 --> 01:09:22,126 The key issue of the correctness is that I am picking 712 01:09:22,126 --> 01:09:25,049 one way to go, left or right. 713 01:09:25,049 --> 01:09:29,328 And that's great, as long as it is in that 714 01:09:29,328 --> 01:09:33,636 subtree. But how do I know that when I 715 01:09:33,636 --> 01:09:39,181 pick I decide I'm going to go left that it might not be in the 716 01:09:39,181 --> 01:09:42,636 right subtree and I went the wrong way? 717 01:09:42,636 --> 01:09:47,000 Or, if I went right, that I accidentally left one 718 01:09:47,000 --> 01:09:51,363 out on the left side? We're always going just one 719 01:09:51,363 --> 01:09:54,272 direction. And that's sort of the 720 01:09:54,272 --> 01:09:59,000 cleverness of the code. The theorem is let's let L be 721 01:09:59,000 --> 01:10:05,000 the set of intervals i prime in the left of a node x. 722 01:10:05,000 --> 01:10:14,106 And R be the set of i primes in the right of x. 723 01:10:14,106 --> 01:10:23,213 And now there are two parts I am going to show. 724 01:10:23,213 --> 01:10:33,705 If the search goes right then the set of i prime in L, 725 01:10:33,705 --> 01:10:44,000 such that i prime overlaps i is the empty set. 726 01:10:44,000 --> 01:10:48,833 That's the first thing I do. If it goes right then there is 727 01:10:48,833 --> 01:10:52,250 nothing in the left subtree that overlaps. 728 01:10:52,250 --> 01:10:55,666 It's always, whenever the code goes right, 729 01:10:55,666 --> 01:11:00,583 no problem, because there was nothing in the left subtree to 730 01:11:00,583 --> 01:11:03,783 be found. Does everybody understand what 731 01:11:03,783 --> 01:11:05,982 that says? We are going to prove this, 732 01:11:05,982 --> 01:11:08,419 but I want to make sure people understand. 733 01:11:08,419 --> 01:11:11,986 Because the second one is going to be harder to understand so 734 01:11:11,986 --> 01:11:15,136 you've got to make sure you understand this one first. 735 01:11:15,136 --> 01:11:16,800 Any questions about this? OK. 736 01:11:16,800 --> 01:11:19,000 If the search goes left -- 737 01:11:27,000 --> 01:11:40,808 -- then the set of i prime in L such that i prime overlaps i 738 01:11:40,808 --> 01:11:49,000 empty set implies that i prime -- 739 01:12:00,000 --> 01:12:02,329 OK. What is this saying? 740 01:12:02,329 --> 01:12:06,987 If the search goes left, if the left was empty, 741 01:12:06,987 --> 01:12:10,936 in other words, if you went left and you 742 01:12:10,936 --> 01:12:16,000 discovered that there was nothing in there to find, 743 01:12:16,000 --> 01:12:21,568 no overlapping interval to find then it is OK because it 744 01:12:21,568 --> 01:12:27,443 wouldn't have helped me to go right anyway because there is 745 01:12:27,443 --> 01:12:32,000 nothing in the right to be found. 746 01:12:32,000 --> 01:12:37,809 So it is not guaranteeing that there is nothing to be found in 747 01:12:37,809 --> 01:12:43,333 the left, but if there happens to be nothing to find in the 748 01:12:43,333 --> 01:12:49,333 left it is OK because there was nothing to be found in the right 749 01:12:49,333 --> 01:12:52,571 either. That is what the second one 750 01:12:52,571 --> 01:12:54,476 says. In either case, 751 01:12:54,476 --> 01:13:00,000 you're OK to go the way. So let's do this proof. 752 01:13:05,000 --> 01:13:09,090 Does everybody understand what the proof says? 753 01:13:09,090 --> 01:13:12,090 Understanding the proof is tricky. 754 01:13:12,090 --> 01:13:14,545 It's logic. Logic is tricky. 755 01:13:14,545 --> 01:13:20,000 Suppose the search goes right. We'll do the first one. 756 01:13:27,000 --> 01:13:37,275 If left of x is NIL then we are done since we proved what we 757 01:13:37,275 --> 01:13:44,938 wanted to prove. If we go right there are two 758 01:13:44,938 --> 01:13:52,775 possibilities, either we have left of x be NIL 759 01:13:52,775 --> 01:14:00,389 or left of x is not NIL. So if it is NIL we are OK 760 01:14:00,389 --> 01:14:05,455 because we said if it goes right I want to prove this, 761 01:14:05,455 --> 01:14:10,904 that the things in the left subtree that overlap is empty. 762 01:14:10,904 --> 01:14:16,257 If there is nothing there, there is clearly nothing there 763 01:14:16,257 --> 01:14:20,080 that overlaps. Otherwise, the low of i is 764 01:14:20,080 --> 01:14:24,000 greater than m of the left of x. 765 01:14:29,000 --> 01:14:34,775 If I look at x here, either x was NIL in the while 766 01:14:34,775 --> 01:14:41,847 statement here or this is true. We just said it is not NIL so 767 01:14:41,847 --> 01:14:45,501 let's take a look at, excuse me. 768 01:14:45,501 --> 01:14:50,216 I'm on the wrong line. I am in this loop. 769 01:14:50,216 --> 01:14:55,756 Left of x was not NIL and the low of i was this. 770 01:14:55,756 --> 01:15:01,530 Which way am I going here? I am going right. 771 01:15:01,530 --> 01:15:06,572 Therefore, this was not true. So either left of x was not 772 01:15:06,572 --> 01:15:11,794 NIL, which was the first one, or low of i is greater than m 773 01:15:11,794 --> 01:15:14,675 of left of x if I am going right. 774 01:15:14,675 --> 01:15:19,176 If I'm going right one of those two had to be true. 775 01:15:19,176 --> 01:15:23,408 The first one was easy. Otherwise, we have this, 776 01:15:23,408 --> 01:15:28,000 low of i is greater than m of left of x. 777 01:15:28,000 --> 01:15:31,798 Now this has got to be that value. 778 01:15:31,798 --> 01:15:38,359 m of left of x is the right endpoint, is the high endpoint 779 01:15:38,359 --> 01:15:42,043 of some interval in that subtree. 780 01:15:42,043 --> 01:15:47,338 This is equal to the high of j for some j in L. 781 01:15:47,338 --> 01:15:54,129 So m of left of x must be equal to the high of some endpoint 782 01:15:54,129 --> 01:16:00,000 because that's how we're picking the m's. 783 01:16:00,000 --> 01:16:13,863 For some j in the left subtree. And no other interval in L has 784 01:16:13,863 --> 01:16:20,000 a larger high endpoint -- 785 01:16:27,000 --> 01:16:33,456 -- than high of j. If I draw a picture here, 786 01:16:33,456 --> 01:16:39,400 I have over here i and this is the low of i. 787 01:16:39,400 --> 01:16:47,557 And I have j where we say its high endpoint is less than the 788 01:16:47,557 --> 01:16:53,087 low of i. This is j, and I don't know how 789 01:16:53,087 --> 01:17:00,000 far over it goes. And this has high of j -- 790 01:17:08,000 --> 01:17:12,575 -- which is the highest one in the left subtree. 791 01:17:12,575 --> 01:17:18,026 There is nobody else who has got a higher right endpoint. 792 01:17:18,026 --> 01:17:23,282 There is nobody else in this subtree who could possibly 793 01:17:23,282 --> 01:17:30,000 overlap I, because all of them end somewhere before this point. 794 01:17:30,000 --> 01:17:38,076 This point is the highest one in a subtree. 795 01:17:38,076 --> 01:17:49,230 Therefore, i prime in L such that i prime overlaps i is the 796 01:17:49,230 --> 01:17:55,384 empty set. And now the hard case. 797 01:17:55,384 --> 01:18:00,786 Everybody stretch. Hard case. 798 01:18:00,786 --> 01:18:05,266 Does everybody follow this? The point is that because this 799 01:18:05,266 --> 01:18:09,039 is the highest guy everybody else has to be left, 800 01:18:09,039 --> 01:18:13,675 so if you didn't overlap the highest guy you're not going to 801 01:18:13,675 --> 01:18:18,000 overlap anybody. Suppose the search goes left -- 802 01:18:24,000 --> 01:18:34,000 -- and that there is nothing to overlap in the left subtree. 803 01:18:34,000 --> 01:18:38,777 I went left here but I am not going to find anything. 804 01:18:38,777 --> 01:18:43,922 Now I want to prove that it wouldn't have helped me to go 805 01:18:43,922 --> 01:18:46,954 right. That's essentially what the 806 01:18:46,954 --> 01:18:50,812 theorem here says. That if I assume this it 807 01:18:50,812 --> 01:18:53,752 wouldn't have helped to go right. 808 01:18:53,752 --> 01:19:00,000 I want to show that there is nothing in the right subtree. 809 01:19:00,000 --> 01:19:07,277 So going left was OK because I wasn't going to find anything 810 01:19:07,277 --> 01:19:11,348 anyway. Similarly, we go through a 811 01:19:11,348 --> 01:19:17,145 similar analysis. Low of i is less than or equal 812 01:19:17,145 --> 01:19:23,312 to m of the left of x, which once again is equal to 813 01:19:23,312 --> 01:19:34,053 the high of j for some j in L. We are just saying if I go left 814 01:19:34,053 --> 01:19:41,473 these things must be true. I went left. 815 01:19:41,473 --> 01:19:52,213 Since j is in L it doesn't overlap i, because the set of 816 01:19:52,213 --> 01:20:01,000 things that overlap i in L is empty set. 817 01:20:01,000 --> 01:20:14,022 Since j doesn't overlap i that implies that the high of i must 818 01:20:14,022 --> 01:20:20,000 be less than the low of j. 819 01:20:25,000 --> 01:20:31,913 Since j is in L and it doesn't overlap i, what are the 820 01:20:31,913 --> 01:20:38,145 possibilities? We essentially have here, 821 01:20:38,145 --> 01:20:45,939 if I draw a picture, I have j and L and I have i 822 01:20:45,939 --> 01:20:51,412 here. The point is that it doesn't 823 01:20:51,412 --> 01:21:00,035 overlap it, therefore, it must be to the left because 824 01:21:00,035 --> 01:21:07,000 its low endpoint is less than this. 825 01:21:07,000 --> 01:21:11,659 But it doesn't overlap it, therefore its high endpoint 826 01:21:11,659 --> 01:21:15,000 must be left of the low of this one. 827 01:21:28,000 --> 01:21:30,000 Now we will use the binary search tree property. 828 01:21:37,000 --> 01:21:44,576 That implies that for all i prime in R, everything in the 829 01:21:44,576 --> 01:21:50,664 right subtree, we have a low of j is less than 830 01:21:50,664 --> 01:21:57,835 or equal to low of i prime, so we're sorted on the low 831 01:21:57,835 --> 01:22:02,439 endpoints. Everything in the right subtree 832 01:22:02,439 --> 01:22:07,081 must have a low endpoint that starts to the right of the low 833 01:22:07,081 --> 01:22:10,464 endpoint of j because j in the left subtree. 834 01:22:10,464 --> 01:22:15,106 And everything in the whole tree is sorted by low endpoints, 835 01:22:15,106 --> 01:22:19,355 so anything in the right subtree is going to start over 836 01:22:19,355 --> 01:22:21,558 here. Those are other things. 837 01:22:21,558 --> 01:22:25,964 These are the i primes in R. We don't know how many there 838 01:22:25,964 --> 01:22:31,000 are, but they all start to the right of this point. 839 01:22:31,000 --> 01:22:40,333 So they cannot overlap i either, therefore, 840 01:22:40,333 --> 01:22:50,555 there is nothing. All the i primes in R is also 841 01:22:50,555 --> 01:22:53,000 nobody. 842 01:22:57,000 --> 01:23:02,942 Just to go back again, the basic idea is that since 843 01:23:02,942 --> 01:23:10,547 this guy doesn't overlap the guy who is in the left and everybody 844 01:23:10,547 --> 01:23:16,252 to the right is going to be further to the right, 845 01:23:16,252 --> 01:23:23,144 if I go left and don't find anything that's OK because I am 846 01:23:23,144 --> 01:23:28,255 not going to find anything over here anyway. 847 01:23:28,255 --> 01:23:35,147 They are not going to overlap. Data-structure augmentation, 848 01:23:35,147 --> 01:23:41,652 great stuff. It will give you a lot of rich, 849 01:23:41,652 --> 01:23:47,189 rich data structures built on any ones you know, 850 01:23:47,189 --> 01:23:52,137 hash tables, heaps, binary search trees and 851 01:23:52,137 --> 01:23:55,000 so forth.