1 00:00:00,050 --> 00:00:01,770 The following content is provided 2 00:00:01,770 --> 00:00:04,010 under a Creative Commons license. 3 00:00:04,010 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 4 00:00:06,860 --> 00:00:10,720 to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,226 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,226 --> 00:00:17,851 at ocw.mit.edu. 8 00:00:21,830 --> 00:00:23,830 PROFESSOR: If you guys want me to cover anything 9 00:00:23,830 --> 00:00:25,300 in particular, is there anything you 10 00:00:25,300 --> 00:00:26,508 didn't understand in lecture? 11 00:00:29,840 --> 00:00:32,640 In the last section, I covered the recursion trees 12 00:00:32,640 --> 00:00:34,170 because they will be on the Pset, 13 00:00:34,170 --> 00:00:35,940 and people said they were a bit unclear, 14 00:00:35,940 --> 00:00:42,312 so we can do that and cover less of the stuff that I have here. 15 00:00:42,312 --> 00:00:44,770 Or if there's anything else, you can tell me what you want. 16 00:00:44,770 --> 00:00:48,040 So there I cover recursion trees because someone said, hey, 17 00:00:48,040 --> 00:00:50,000 can you go over that again? 18 00:00:50,000 --> 00:00:51,310 Is there any pain points? 19 00:00:56,620 --> 00:00:57,120 No? 20 00:00:57,120 --> 00:00:57,936 OK. 21 00:00:57,936 --> 00:00:59,810 So then I'm going to give you the same choice 22 00:00:59,810 --> 00:01:02,310 that I gave to people last time, and that 23 00:01:02,310 --> 00:01:06,320 is we can go over recursion trees again, 24 00:01:06,320 --> 00:01:08,630 but if I do that, then I won't have time to go over 25 00:01:08,630 --> 00:01:11,129 the code for deleting a node from a binary search tree. 26 00:01:11,129 --> 00:01:12,920 So we'll go through the theory and you guys 27 00:01:12,920 --> 00:01:15,070 will have to go through the code on your own. 28 00:01:15,070 --> 00:01:19,460 But instead, we'll go over recursion trees again 29 00:01:19,460 --> 00:01:21,770 and remember how you solve a recurrence using 30 00:01:21,770 --> 00:01:23,450 recursion trees. 31 00:01:23,450 --> 00:01:25,120 The alternative is we don't do that 32 00:01:25,120 --> 00:01:29,982 and we complete the deletions part. 33 00:01:29,982 --> 00:01:31,690 AUDIENCE: I feel like covering deletions, 34 00:01:31,690 --> 00:01:34,358 since we didn't do that in lecture, that would probably 35 00:01:34,358 --> 00:01:36,032 be more helpful. 36 00:01:36,032 --> 00:01:37,240 PROFESSOR: Let's take a vote. 37 00:01:37,240 --> 00:01:40,730 Who wants to do deletions in painstaking detail? 38 00:01:40,730 --> 00:01:45,410 So deletions and not recursion? 39 00:01:45,410 --> 00:01:49,452 Who wants to do recursion trees and not deletion? 40 00:01:49,452 --> 00:01:50,580 AUDIENCE: It's about equal. 41 00:01:50,580 --> 00:01:52,205 PROFESSOR: It's equal and nobody cares. 42 00:01:52,205 --> 00:01:52,870 I'm really sad. 43 00:01:58,376 --> 00:02:00,230 AUDIENCE: Let's do both in half detail. 44 00:02:00,230 --> 00:02:01,310 PROFESSOR: OK, sure. 45 00:02:01,310 --> 00:02:04,080 Who remembers merge sort? 46 00:02:04,080 --> 00:02:06,490 What does merge sort do really quick? 47 00:02:06,490 --> 00:02:10,050 AUDIENCE: It takes some sort of unsorted array, 48 00:02:10,050 --> 00:02:13,310 splits it in half, and then continually splits it, 49 00:02:13,310 --> 00:02:15,480 and then once it finally gets to the point 50 00:02:15,480 --> 00:02:20,540 where you have arrays of two elements, then it sorts them, 51 00:02:20,540 --> 00:02:23,088 and then sorts those, and then sorts those. 52 00:02:23,088 --> 00:02:24,669 It's a fun thing. 53 00:02:24,669 --> 00:02:25,960 And then it merges [INAUDIBLE]. 54 00:02:25,960 --> 00:02:28,680 PROFESSOR: That's so much code. 55 00:02:28,680 --> 00:02:31,360 I don't like to write much code because for every line of code 56 00:02:31,360 --> 00:02:33,151 that you write, you might have a bug in it, 57 00:02:33,151 --> 00:02:34,530 so I like to write less code. 58 00:02:34,530 --> 00:02:37,950 So the way I do it is when I get to an array size of one 59 00:02:37,950 --> 00:02:40,550 element, I know it's already sorted. 60 00:02:40,550 --> 00:02:41,640 So merge sort. 61 00:02:41,640 --> 00:02:43,380 You have an array, it's unsorted. 62 00:02:43,380 --> 00:02:47,030 Split it into two halves, call merge sort on each half, 63 00:02:47,030 --> 00:02:49,780 assume that magically, they're going to come back sorted, 64 00:02:49,780 --> 00:02:52,300 and then you merge the sorted halves. 65 00:02:52,300 --> 00:02:53,810 How much time does merging take? 66 00:02:57,590 --> 00:02:58,460 OK. 67 00:02:58,460 --> 00:03:03,042 So the recursion for the running time of merge sort? 68 00:03:03,042 --> 00:03:04,536 AUDIENCE: Why does it take n time? 69 00:03:04,536 --> 00:03:05,161 Just too large? 70 00:03:05,161 --> 00:03:06,869 AUDIENCE: Isn't it the finger thing where 71 00:03:06,869 --> 00:03:09,018 you take each element, and you're like, this one, 72 00:03:09,018 --> 00:03:12,260 is that greater or less than, then you put it in the array. 73 00:03:12,260 --> 00:03:12,760 So you get-- 74 00:03:12,760 --> 00:03:16,694 PROFESSOR: Please take my word for it that it's order n. 75 00:03:16,694 --> 00:03:18,860 AUDIENCE: I'll explain it and then I'll be confused. 76 00:03:21,670 --> 00:03:22,985 PROFESSOR: OK, so order n. 77 00:03:22,985 --> 00:03:24,075 What's the recursion? 78 00:03:28,422 --> 00:03:30,130 Don't give me the solution because then I 79 00:03:30,130 --> 00:03:32,530 can't do the trees anymore, so give me the recursion 80 00:03:32,530 --> 00:03:33,675 before it's solved. 81 00:03:33,675 --> 00:03:37,160 Give me the recurrence formula. 82 00:03:37,160 --> 00:03:39,566 So it starts with T of N, right? 83 00:03:39,566 --> 00:03:43,117 AUDIENCE: It starts with N over 2 plus N, I think. 84 00:03:46,844 --> 00:03:47,635 PROFESSOR: Perfect. 85 00:03:50,730 --> 00:03:52,690 So you take the array, you split it into two, 86 00:03:52,690 --> 00:03:56,150 you call merge sort on the two halves of the arrays. 87 00:03:56,150 --> 00:03:57,670 So you call merge sort twice. 88 00:03:57,670 --> 00:03:59,140 That's why you have a 2 here. 89 00:03:59,140 --> 00:03:59,800 The 2 matters. 90 00:03:59,800 --> 00:04:01,970 Without it, you get a different answer. 91 00:04:01,970 --> 00:04:04,040 And when you call it, the arrays that you 92 00:04:04,040 --> 00:04:06,910 give it are half the size, and then merge 93 00:04:06,910 --> 00:04:08,760 takes order and time. 94 00:04:08,760 --> 00:04:11,750 Splitting depends on what you're using to store your arrays. 95 00:04:11,750 --> 00:04:14,860 Can be constant time or it can be order N. So the time 96 00:04:14,860 --> 00:04:17,810 won't change because of split. 97 00:04:17,810 --> 00:04:20,845 How do we solve this recurrence? 98 00:04:24,162 --> 00:04:25,870 The recursion tree method says that we're 99 00:04:25,870 --> 00:04:27,680 going to draw a call graph. 100 00:04:27,680 --> 00:04:29,200 So we start out with a call to merge 101 00:04:29,200 --> 00:04:31,910 sort with an array of size N. Then 102 00:04:31,910 --> 00:04:33,610 it's going to call merge sort again, 103 00:04:33,610 --> 00:04:35,000 but after the array is split. 104 00:04:35,000 --> 00:04:38,730 So it's going to call merge sort twice, size is N over 2. 105 00:04:41,390 --> 00:04:45,680 This guy gets an array of N over 2, calls merge sort. 106 00:04:45,680 --> 00:04:50,760 Two arrays, sizes N over 4, N over 4. 107 00:04:50,760 --> 00:04:51,630 This does the same. 108 00:04:54,940 --> 00:05:00,420 So this goes on forever and ever and ever until at some point 109 00:05:00,420 --> 00:05:03,240 we reach our base case. 110 00:05:03,240 --> 00:05:05,940 So we're going to have a bunch of calls 111 00:05:05,940 --> 00:05:09,460 here where the array size is? 112 00:05:09,460 --> 00:05:10,521 What's our base case? 113 00:05:10,521 --> 00:05:11,020 1. 114 00:05:11,020 --> 00:05:11,519 Excellent. 115 00:05:15,870 --> 00:05:18,240 So this is the call graph for merge sort, and let's 116 00:05:18,240 --> 00:05:20,930 put the base case here so we know what we're talking about. 117 00:05:20,930 --> 00:05:23,370 T of 1 is theta 1. 118 00:05:25,876 --> 00:05:27,250 Now inside the nodes, we're going 119 00:05:27,250 --> 00:05:32,150 to put the cost for each call without counting 120 00:05:32,150 --> 00:05:34,770 the sub-call, so the children here. 121 00:05:34,770 --> 00:05:38,370 That's this guy here, except instead of order N, 122 00:05:38,370 --> 00:05:40,160 I will write CN. 123 00:05:40,160 --> 00:05:42,870 Remember how sometimes we use CN instead 124 00:05:42,870 --> 00:05:45,500 of the order of notation? 125 00:05:45,500 --> 00:05:49,150 The reason we do that is if I put in the asymptotic notation, 126 00:05:49,150 --> 00:05:52,930 then we're going to be tempted to sum them up. 127 00:05:52,930 --> 00:05:56,950 You're allowed to sum terms using asymptotic notation as 128 00:05:56,950 --> 00:05:59,200 long as there's a finite number of them, 129 00:05:59,200 --> 00:06:02,420 but here, it turns out there's an infinite number of them. 130 00:06:02,420 --> 00:06:05,150 Also, if you go this way, you can never go wrong. 131 00:06:05,150 --> 00:06:08,030 You always get the right answer, so that's 132 00:06:08,030 --> 00:06:11,620 why we switch from order N to CN. 133 00:06:11,620 --> 00:06:14,980 In order to merge sort an array of size N, 134 00:06:14,980 --> 00:06:18,100 we're going to merge sort two arrays of size N over 2 135 00:06:18,100 --> 00:06:21,690 and then spend CN time on doing the merge. 136 00:06:21,690 --> 00:06:23,355 What are the costs here? 137 00:06:26,050 --> 00:06:28,140 To sort an array of N over 2, what's 138 00:06:28,140 --> 00:06:31,670 the cost outside the cost to merge? 139 00:06:31,670 --> 00:06:33,510 AUDIENCE: C of N over 2. 140 00:06:33,510 --> 00:06:34,860 PROFESSOR: Perfect. 141 00:06:34,860 --> 00:06:37,000 C times N over 2. 142 00:06:37,000 --> 00:06:39,410 C times N over 2. 143 00:06:39,410 --> 00:06:41,046 How about here? 144 00:06:41,046 --> 00:06:42,978 AUDIENCE: C times N over 4. 145 00:06:42,978 --> 00:06:45,876 PROFESSOR: Perfect. 146 00:06:45,876 --> 00:06:48,850 CN over 4. 147 00:06:48,850 --> 00:06:50,610 My nodes are really ugly. 148 00:06:50,610 --> 00:06:52,860 I should have drawn them like this from the beginning. 149 00:06:52,860 --> 00:06:54,270 CN over 4. 150 00:06:54,270 --> 00:06:55,250 There you go. 151 00:06:55,250 --> 00:06:57,620 How about down here? 152 00:06:57,620 --> 00:07:01,142 AUDIENCE: C of N over 2 to the i. 153 00:07:01,142 --> 00:07:03,490 PROFESSOR: You're going on step ahead. 154 00:07:03,490 --> 00:07:05,610 We'll do that right next. 155 00:07:05,610 --> 00:07:08,540 AUDIENCE: C of N over log N, right? 156 00:07:08,540 --> 00:07:11,220 Because they're log N levels, so-- 157 00:07:11,220 --> 00:07:14,140 PROFESSOR: Let's not worry about the number of levels. 158 00:07:14,140 --> 00:07:15,980 You're ruining my steps. 159 00:07:15,980 --> 00:07:18,455 I was going to get to that two steps after this. 160 00:07:18,455 --> 00:07:19,564 AUDIENCE: Is it just C? 161 00:07:19,564 --> 00:07:20,320 PROFESSOR: Yep. 162 00:07:20,320 --> 00:07:21,840 So array size is 1, right? 163 00:07:21,840 --> 00:07:28,140 So the cost is C. C, C, C, C. OK, you guys got it 164 00:07:28,140 --> 00:07:30,179 if you're thinking of levels already. 165 00:07:30,179 --> 00:07:31,720 The next thing I want to do is I want 166 00:07:31,720 --> 00:07:34,740 to figure out how many levels I have in this tree. 167 00:07:34,740 --> 00:07:36,600 Why do I care about that? 168 00:07:36,600 --> 00:07:40,900 The answer for T of N is the sum of all these costs in here 169 00:07:40,900 --> 00:07:44,550 because the cost of merge sorting an array of size N 170 00:07:44,550 --> 00:07:49,940 is the merge sort plus the costs for sorting the two arrays. 171 00:07:49,940 --> 00:07:52,890 And the nodes here keep track of all the time spent 172 00:07:52,890 --> 00:07:56,840 in recursive sub-calls, so if we can add up everything up, 173 00:07:56,840 --> 00:08:01,880 we have the answer to T of N. It turns out the easiest 174 00:08:01,880 --> 00:08:06,050 way to do that is to sum up the cost at each level 175 00:08:06,050 --> 00:08:10,875 because the costs are this guy copied over here. 176 00:08:10,875 --> 00:08:13,800 For a level, they tend to be the same, 177 00:08:13,800 --> 00:08:17,360 so it's reasonably easy to add them up, 178 00:08:17,360 --> 00:08:19,430 except in order to be able to add those up, 179 00:08:19,430 --> 00:08:22,980 you have to know how many levels you have. 180 00:08:22,980 --> 00:08:26,920 So how do I know how many levels I have? 181 00:08:26,920 --> 00:08:30,470 Someone already told me log N. How do I get to that log N? 182 00:08:35,980 --> 00:08:40,440 So when I get to the bottommost level, 183 00:08:40,440 --> 00:08:43,460 the number has to be 1, the number next to the node, 184 00:08:43,460 --> 00:08:45,080 because that's my base case. 185 00:08:45,080 --> 00:08:47,490 When I have a one element array, it's sorted, I'm done. 186 00:08:47,490 --> 00:08:49,650 I return. 187 00:08:49,650 --> 00:08:53,130 So I can say that for each level, 188 00:08:53,130 --> 00:08:58,150 the number next to the node is something as a function of L. 189 00:08:58,150 --> 00:08:59,750 Here, I'm going to say that this is 190 00:08:59,750 --> 00:09:06,850 N over 1, which is N over 2 to the 0 power. 191 00:09:06,850 --> 00:09:11,480 And this is N over 2, so it's N over 2 to the first power. 192 00:09:11,480 --> 00:09:16,280 This is N over 2 to the second, and so on and so forth. 193 00:09:16,280 --> 00:09:18,476 It might not be obvious if you only have two levels. 194 00:09:18,476 --> 00:09:20,100 I don't want to draw a lot on the board 195 00:09:20,100 --> 00:09:23,180 because I don't have a lot of space and I'd get my nodes 196 00:09:23,180 --> 00:09:25,420 all messed into each other. 197 00:09:25,420 --> 00:09:28,690 If it takes more than two levels to see the pattern, go for it. 198 00:09:28,690 --> 00:09:31,440 Expand for three levels, four levels, five levels, 199 00:09:31,440 --> 00:09:35,070 whatever it takes to get it right on a Pset or on a test. 200 00:09:35,070 --> 00:09:37,430 So you see the pattern, then you write the formula 201 00:09:37,430 --> 00:09:42,140 for the node size at the level. 202 00:09:42,140 --> 00:09:44,770 And assuming this pattern holds, we 203 00:09:44,770 --> 00:09:50,270 see that the size of a node at level l, the size 204 00:09:50,270 --> 00:09:56,880 is 2 N over 2 to the l minus 1. 205 00:09:56,880 --> 00:09:58,330 Fair enough? 206 00:09:58,330 --> 00:10:01,770 You can say N over 2 to the l, and forget 207 00:10:01,770 --> 00:10:04,730 that there's a minus 1, and then the asymptotics will save you, 208 00:10:04,730 --> 00:10:09,680 so it's no big deal, but this is the real number. 209 00:10:09,680 --> 00:10:16,310 So that means that at the bottommost level, at level l, 210 00:10:16,310 --> 00:10:18,900 this size is going to be 1. 211 00:10:18,900 --> 00:10:24,110 N over 2 to the l minus 1 equals 1. 212 00:10:24,110 --> 00:10:26,620 So now this is an equation, so I can solve for l. 213 00:10:26,620 --> 00:10:31,630 I pull this on the right side, N equals 2 to the l minus 1, 214 00:10:31,630 --> 00:10:37,771 so l minus 1 equals-- anyone? 215 00:10:37,771 --> 00:10:39,020 The inverse of an exponential? 216 00:10:41,720 --> 00:10:43,220 AUDIENCE: I wasn't paying attention. 217 00:10:43,220 --> 00:10:45,390 Sorry. 218 00:10:45,390 --> 00:10:50,837 AUDIENCE: Log N. 219 00:10:50,837 --> 00:10:53,170 PROFESSOR: The inverse of an exponential is a logarithm. 220 00:10:53,170 --> 00:10:58,470 Keep that in mind for solving 6.006 problems. 221 00:10:58,470 --> 00:11:06,289 l minus 1 is log N so l is log n plus 1, roughly log n. 222 00:11:06,289 --> 00:11:08,330 I could use log n plus 1 and go through the math. 223 00:11:08,330 --> 00:11:10,580 It's a bit more painful and, because we're 224 00:11:10,580 --> 00:11:13,940 using asymptotics, it doesn't really matter. 225 00:11:13,940 --> 00:11:15,610 So now we know how many levels we have. 226 00:11:15,610 --> 00:11:18,050 Let's see what's the cost at the level. 227 00:11:18,050 --> 00:11:19,740 So all the calls at a certain level, 228 00:11:19,740 --> 00:11:21,640 what's the sum of the costs? 229 00:11:21,640 --> 00:11:25,400 For this level, what's the cost? 230 00:11:29,600 --> 00:11:30,100 CN. 231 00:11:30,100 --> 00:11:32,310 And That was the easy question. 232 00:11:32,310 --> 00:11:33,635 Just the root, right? 233 00:11:33,635 --> 00:11:34,510 How about this level? 234 00:11:38,770 --> 00:11:43,180 Because I have two nodes, the cost in each node is CN over 2. 235 00:11:43,180 --> 00:11:44,060 How about this level? 236 00:11:49,800 --> 00:11:52,110 Four levels, each level CN over 4. 237 00:11:52,110 --> 00:11:53,340 How about the bottom level? 238 00:11:58,752 --> 00:12:01,680 AUDIENCE: CN. 239 00:12:01,680 --> 00:12:04,486 PROFESSOR: Why is it CN? 240 00:12:04,486 --> 00:12:08,440 AUDIENCE: Because there are N arrays of size 1. 241 00:12:08,440 --> 00:12:09,710 PROFESSOR: N arrays of size 1. 242 00:12:09,710 --> 00:12:11,530 Excellent. 243 00:12:11,530 --> 00:12:14,510 A cute argument I heard once is you start out with N, 244 00:12:14,510 --> 00:12:16,910 you split it into N over 2 and N over 2. 245 00:12:16,910 --> 00:12:19,520 Then you split this guy in N over 4, N over 4, 246 00:12:19,520 --> 00:12:21,980 so this is like conservation of mass. 247 00:12:21,980 --> 00:12:26,380 If you start with N and here, you don't end up with N total, 248 00:12:26,380 --> 00:12:30,210 then you lost some element somewhere on the way. 249 00:12:30,210 --> 00:12:32,100 So CN. 250 00:12:32,100 --> 00:12:33,310 CN, CN, CN, CN. 251 00:12:33,310 --> 00:12:34,530 I think I see a pattern. 252 00:12:34,530 --> 00:12:38,540 I think it's reasonable to say that for every level, it's CN. 253 00:12:38,540 --> 00:12:40,950 And if you write the proof, you can prove that 254 00:12:40,950 --> 00:12:44,300 by using math instead of waving hands. 255 00:12:44,300 --> 00:12:47,420 So CN times the number of levels, right? 256 00:12:47,420 --> 00:12:56,040 The answer for this guy is C of N is CN times l. 257 00:12:56,040 --> 00:12:57,586 What's l? 258 00:12:57,586 --> 00:12:59,289 AUDIENCE: N log N. 259 00:12:59,289 --> 00:13:00,080 PROFESSOR: Roughly. 260 00:13:00,080 --> 00:13:10,240 OK So order of N log N. C becomes 261 00:13:10,240 --> 00:13:14,630 order of, l is order of log N, N stays the same. 262 00:13:18,320 --> 00:13:18,970 Any questions? 263 00:13:24,100 --> 00:13:27,837 Are people getting it or did I confuse you even more? 264 00:13:27,837 --> 00:13:28,670 AUDIENCE: We got it. 265 00:13:28,670 --> 00:13:29,897 PROFESSOR: OK, sweet. 266 00:13:29,897 --> 00:13:31,230 Thank you for the encouragement. 267 00:13:31,230 --> 00:13:36,200 So this gets you through problem one of Pset 2. 268 00:13:36,200 --> 00:13:38,400 So in this case, the tree is nicely balanced. 269 00:13:38,400 --> 00:13:40,284 The cost at each level is the same. 270 00:13:40,284 --> 00:13:42,700 When [INAUDIBLE] talked about recursion trees in lectures, 271 00:13:42,700 --> 00:13:45,230 he showed two more trees, one where 272 00:13:45,230 --> 00:13:49,350 pretty much all the cost was up here-- the cost of the children 273 00:13:49,350 --> 00:13:54,000 was negligible-- and one tree where all the cost was 274 00:13:54,000 --> 00:13:57,240 concentrated here, so the cost of all the inner nodes 275 00:13:57,240 --> 00:14:01,160 was negligible and the leaves were doing all the real work. 276 00:14:01,160 --> 00:14:03,804 So don't be scared if your costs aren't the same. 277 00:14:03,804 --> 00:14:05,970 Just sum them up and you'll get to the right answer. 278 00:14:11,470 --> 00:14:13,770 Now I'm going to talk about binary search trees, 279 00:14:13,770 --> 00:14:18,920 except I will make a five minute general talk about data 280 00:14:18,920 --> 00:14:20,700 structures before I do that. 281 00:14:20,700 --> 00:14:22,310 So we use the term "data structures." 282 00:14:22,310 --> 00:14:24,296 I think we covered it well, and I 283 00:14:24,296 --> 00:14:25,670 want to give you a couple of tips 284 00:14:25,670 --> 00:14:29,280 for dealing with them on Pset 1. 285 00:14:29,280 --> 00:14:31,390 A data structure is a bunch of algorithms 286 00:14:31,390 --> 00:14:35,380 that help you store and then retrieve information. 287 00:14:35,380 --> 00:14:36,960 You have two types of algorithms. 288 00:14:36,960 --> 00:14:44,795 You have queries, and you have updates. 289 00:14:49,850 --> 00:14:51,920 You start out with an empty data structure, 290 00:14:51,920 --> 00:14:55,870 like an empty binary search tree or an empty list, 291 00:14:55,870 --> 00:14:58,370 and then you throw some data at it. 292 00:14:58,370 --> 00:14:59,840 That's when you update it. 293 00:14:59,840 --> 00:15:03,129 Then you ask it some questions, and that's when you query it. 294 00:15:03,129 --> 00:15:05,670 Then maybe you throw more data at it, so you do more updates, 295 00:15:05,670 --> 00:15:07,794 and you ask more questions, so you do more queries. 296 00:15:10,190 --> 00:15:12,460 What are the queries and the updates 297 00:15:12,460 --> 00:15:15,200 for the binary search trees that we talked about in lecture? 298 00:15:18,660 --> 00:15:21,140 AUDIENCE: A query would be like, what's 299 00:15:21,140 --> 00:15:23,977 your right child, what's your left child? 300 00:15:23,977 --> 00:15:25,310 PROFESSOR: So that's for a node. 301 00:15:28,502 --> 00:15:29,960 AUDIENCE: What are you looking for? 302 00:15:29,960 --> 00:15:31,460 PROFESSOR: I'm looking for something 303 00:15:31,460 --> 00:15:32,812 for the entire infrastructure. 304 00:15:32,812 --> 00:15:34,520 So for the entire tree, what's a question 305 00:15:34,520 --> 00:15:36,090 that you would ask the tree? 306 00:15:36,090 --> 00:15:37,147 PROFESSOR: Max. 307 00:15:37,147 --> 00:15:37,730 PROFESSOR: OK. 308 00:15:46,750 --> 00:15:47,250 Min. 309 00:15:50,914 --> 00:15:51,830 AUDIENCE: Next larger. 310 00:15:51,830 --> 00:15:52,460 PROFESSOR: Next larger. 311 00:15:52,460 --> 00:15:53,940 Are you looking at the nodes? 312 00:15:58,640 --> 00:16:02,860 AUDIENCE: Is there an are you balanced question? 313 00:16:02,860 --> 00:16:05,665 PROFESSOR: Well, I would say that the most popular operation 314 00:16:05,665 --> 00:16:08,980 in a binary search tree is Search, which 315 00:16:08,980 --> 00:16:12,560 looks for-- we call it Find in the code 316 00:16:12,560 --> 00:16:16,640 because most code implementations call it Find 317 00:16:16,640 --> 00:16:18,359 nowadays, but binary search tree. 318 00:16:18,359 --> 00:16:19,650 What are you going to do in it? 319 00:16:19,650 --> 00:16:20,870 You search for a value. 320 00:16:20,870 --> 00:16:23,920 That's why it has the Search in binary search. 321 00:16:23,920 --> 00:16:27,580 So queries are operations where you ask questions to the data 322 00:16:27,580 --> 00:16:31,540 structure and it doesn't change. 323 00:16:31,540 --> 00:16:32,400 How about updates? 324 00:16:32,400 --> 00:16:33,940 What did we learn for updates? 325 00:16:36,712 --> 00:16:37,640 AUDIENCE: Insert. 326 00:16:37,640 --> 00:16:39,740 PROFESSOR: Excellent. 327 00:16:39,740 --> 00:16:41,630 So Insert was covered in lecture, 328 00:16:41,630 --> 00:16:44,400 and we're doing Delete today. 329 00:16:49,740 --> 00:16:53,030 So data structures have this property 330 00:16:53,030 --> 00:16:56,460 that's called the representation invariant, RI, 331 00:16:56,460 --> 00:16:57,930 or Rep Invariant. 332 00:17:05,819 --> 00:17:09,835 Actually, before I get there, the rep invariant 333 00:17:09,835 --> 00:17:12,640 says that the data in the data structures 334 00:17:12,640 --> 00:17:14,829 is organized in this way, and as long 335 00:17:14,829 --> 00:17:17,349 as it's organized in this way, the data structure functions 336 00:17:17,349 --> 00:17:18,490 correctly. 337 00:17:18,490 --> 00:17:20,810 Can someone guess for a sorted array 338 00:17:20,810 --> 00:17:24,829 what's the representation invariant? 339 00:17:24,829 --> 00:17:27,294 AUDIENCE: It can mean sorted. 340 00:17:27,294 --> 00:17:27,960 PROFESSOR: Yeah. 341 00:17:27,960 --> 00:17:29,350 A sorted array should be sorted. 342 00:17:29,350 --> 00:17:31,690 Sounds like a very good rep invariant. 343 00:17:31,690 --> 00:17:33,530 So the elements should be stored an array. 344 00:17:33,530 --> 00:17:38,340 Every element should be smaller than any element after it. 345 00:17:38,340 --> 00:17:42,370 And as long as the rep invariant holds, so as long 346 00:17:42,370 --> 00:17:44,770 as elements are stored in the right way in the data 347 00:17:44,770 --> 00:17:48,890 structure, the queries will return the right results. 348 00:17:48,890 --> 00:17:50,360 If the rep invariant doesn't hold, 349 00:17:50,360 --> 00:17:53,364 then God knows what's going to happen. 350 00:17:53,364 --> 00:17:54,780 What can you do in a storage array 351 00:17:54,780 --> 00:17:56,800 as long as the rep invariant holds? 352 00:18:00,970 --> 00:18:01,927 Sorted array. 353 00:18:01,927 --> 00:18:04,010 What's the reason why I would have a sorted array? 354 00:18:04,010 --> 00:18:06,390 What can I do that's fast in a sorted array? 355 00:18:06,390 --> 00:18:08,660 AUDIENCE: Min and Max. 356 00:18:08,660 --> 00:18:10,140 PROFESSOR: I can do that very fast. 357 00:18:10,140 --> 00:18:10,790 That's good. 358 00:18:10,790 --> 00:18:12,414 What's the running time? 359 00:18:12,414 --> 00:18:13,290 AUDIENCE: A constant. 360 00:18:13,290 --> 00:18:14,219 PROFESSOR: Perfect. 361 00:18:14,219 --> 00:18:16,510 Min you look at the beginning, Max you look at the end. 362 00:18:16,510 --> 00:18:17,140 Yes? 363 00:18:17,140 --> 00:18:18,210 AUDIENCE: Binary search. 364 00:18:18,210 --> 00:18:19,251 PROFESSOR: Binary search. 365 00:18:19,251 --> 00:18:21,130 That's the other reason for that. 366 00:18:21,130 --> 00:18:23,090 So binary search runs in order log 367 00:18:23,090 --> 00:18:26,690 N time, doesn't have to look at most of the array, 368 00:18:26,690 --> 00:18:28,690 tells you whether an element is there are not. 369 00:18:28,690 --> 00:18:31,100 Now, what if the array is unsorted? 370 00:18:31,100 --> 00:18:33,840 Will the algorithm work? 371 00:18:33,840 --> 00:18:35,270 It might say something isn't there 372 00:18:35,270 --> 00:18:36,353 when it actually is there. 373 00:18:36,353 --> 00:18:38,790 You can do binary search on a non-sorted array. 374 00:18:38,790 --> 00:18:40,440 So if the rep invariant doesn't hold, 375 00:18:40,440 --> 00:18:42,390 your queries might give you a wrong answer. 376 00:18:44,940 --> 00:18:46,460 How about updates? 377 00:18:46,460 --> 00:18:49,370 How do you search something in a sorted list? 378 00:18:53,223 --> 00:18:54,764 AUDIENCE: You find where it should go 379 00:18:54,764 --> 00:18:55,885 and you move everything. 380 00:18:55,885 --> 00:18:56,510 PROFESSOR: Yep. 381 00:18:56,510 --> 00:18:59,110 So you have to move everything, make room for it, 382 00:18:59,110 --> 00:19:01,030 and put it there so that the array is still 383 00:19:01,030 --> 00:19:02,900 sorted at the end. 384 00:19:02,900 --> 00:19:04,570 You can't just append things at the end, 385 00:19:04,570 --> 00:19:10,610 even though that would be faster and lazier and less code. 386 00:19:10,610 --> 00:19:12,467 When you do an update to a data structure, 387 00:19:12,467 --> 00:19:14,550 you have to make sure that the rep invariant still 388 00:19:14,550 --> 00:19:16,300 holds at the end. 389 00:19:16,300 --> 00:19:18,800 Sort of a correctness proof for an update algorithm 390 00:19:18,800 --> 00:19:21,580 says that if the rep invariant holds at the beginning, 391 00:19:21,580 --> 00:19:24,900 the rep invariant is guaranteed to hold at the end. 392 00:19:24,900 --> 00:19:27,640 Why do we care about this rep invariant stuff? 393 00:19:27,640 --> 00:19:32,700 Suppose you have a problem, say on the next Pset, that 394 00:19:32,700 --> 00:19:36,890 asks you to find the place that's slow in your code 395 00:19:36,890 --> 00:19:39,490 and then speed it up. 396 00:19:39,490 --> 00:19:42,336 And suppose you recognize the data structure there, 397 00:19:42,336 --> 00:19:43,960 and you say that's inefficient, and you 398 00:19:43,960 --> 00:19:46,430 want to implement another data structure that 399 00:19:46,430 --> 00:19:49,620 would be more efficient. 400 00:19:49,620 --> 00:19:52,000 You're going to implement it. 401 00:19:52,000 --> 00:19:54,390 You might have bugs in an update. 402 00:19:54,390 --> 00:19:56,460 How do you find the bugs? 403 00:19:56,460 --> 00:19:58,570 Queries give you the wrong answers. 404 00:19:58,570 --> 00:20:01,510 You might do queries a long time after you do updates, 405 00:20:01,510 --> 00:20:04,624 and you're not going to know which update failed. 406 00:20:04,624 --> 00:20:06,790 The right way to do this is you implement the method 407 00:20:06,790 --> 00:20:10,250 called Check RI-- that's what I call it-- so 408 00:20:10,250 --> 00:20:12,170 check the representation invariant. 409 00:20:12,170 --> 00:20:14,630 And that method walks through the entire data structure 410 00:20:14,630 --> 00:20:16,960 and make sure that the rep invariant holds, 411 00:20:16,960 --> 00:20:19,089 and if it doesn't, it raises an exception 412 00:20:19,089 --> 00:20:21,380 because you know that whatever you try to do from there 413 00:20:21,380 --> 00:20:25,580 is not going to work, so there's no reason to keep going. 414 00:20:25,580 --> 00:20:28,110 So at the end of every update, you add a call 415 00:20:28,110 --> 00:20:32,050 to this Check RI method until you're 416 00:20:32,050 --> 00:20:34,480 sure that your code is correct. 417 00:20:34,480 --> 00:20:36,390 And after you're done debugging your code, 418 00:20:36,390 --> 00:20:40,180 you remove this method and you submit the code. 419 00:20:40,180 --> 00:20:42,310 Why do I want to remove the method? 420 00:20:42,310 --> 00:20:45,290 It might be painfully slow and inefficient, much slower 421 00:20:45,290 --> 00:20:47,530 than the actual queries and updates. 422 00:20:47,530 --> 00:20:49,010 For example, let's take a heap. 423 00:20:49,010 --> 00:20:52,610 Do people remember heaps from lecture? 424 00:20:52,610 --> 00:20:54,870 What's the query for a heap? 425 00:20:54,870 --> 00:20:55,930 Say you have a max heap. 426 00:20:55,930 --> 00:20:58,567 What's a query? 427 00:20:58,567 --> 00:20:59,650 AUDIENCE: Where's the max? 428 00:20:59,650 --> 00:21:00,800 PROFESSOR: OK, cool. 429 00:21:00,800 --> 00:21:03,570 So for a max heap, a query would be max. 430 00:21:03,570 --> 00:21:04,200 Running time? 431 00:21:07,511 --> 00:21:08,457 AUDIENCE: Constant. 432 00:21:08,457 --> 00:21:09,248 PROFESSOR: Perfect. 433 00:21:09,248 --> 00:21:10,360 Constant. 434 00:21:10,360 --> 00:21:13,880 What do you do? 435 00:21:13,880 --> 00:21:15,122 Look at the top? 436 00:21:15,122 --> 00:21:16,307 AUDIENCE: Yeah, exactly. 437 00:21:16,307 --> 00:21:16,890 PROFESSOR: OK. 438 00:21:16,890 --> 00:21:17,640 Sweet. 439 00:21:17,640 --> 00:21:23,430 So what are the two popular updates in a max heap? 440 00:21:23,430 --> 00:21:25,600 AUDIENCE: There would be Insert as well. 441 00:21:25,600 --> 00:21:28,480 PROFESSOR: OK. 442 00:21:28,480 --> 00:21:29,770 Insert. 443 00:21:29,770 --> 00:21:32,306 And did we teach you general delete? 444 00:21:39,310 --> 00:21:44,817 Usually Extract Max is simpler. 445 00:21:44,817 --> 00:21:45,650 That's all you need. 446 00:21:51,570 --> 00:21:53,270 What's the running time for Insert? 447 00:21:56,830 --> 00:21:59,390 Do people remember heaps? 448 00:21:59,390 --> 00:22:03,190 AUDIENCE: I think it was per N, but I'm not completely sure. 449 00:22:03,190 --> 00:22:04,560 PROFESSOR: Anyone else? 450 00:22:04,560 --> 00:22:05,250 It's not. 451 00:22:05,250 --> 00:22:08,135 Life would be bad if it would be N. 452 00:22:08,135 --> 00:22:09,217 AUDIENCE: N squared? 453 00:22:09,217 --> 00:22:09,800 PROFESSOR: No. 454 00:22:13,835 --> 00:22:16,210 It's better than N, so you guys are doing a binary search 455 00:22:16,210 --> 00:22:18,660 over the few running times that I gave you earlier. 456 00:22:18,660 --> 00:22:20,118 AUDIENCE: [INAUDIBLE] add to the N, 457 00:22:20,118 --> 00:22:22,498 and then you compare your neighbor, and then you 458 00:22:22,498 --> 00:22:24,245 [INAUDIBLE]. 459 00:22:24,245 --> 00:22:26,780 AUDIENCE: If it's an array, there isn't-- 460 00:22:26,780 --> 00:22:29,272 PROFESSOR: So conceptually, a heap looks like this. 461 00:22:29,272 --> 00:22:30,980 And yeah, it becomes an array eventually, 462 00:22:30,980 --> 00:22:32,750 but let's look at it this way. 463 00:22:36,500 --> 00:22:37,950 It is a full binary tree. 464 00:22:37,950 --> 00:22:41,290 Binary tree means that each node has at most two children, 465 00:22:41,290 --> 00:22:44,160 and full means that every level except for the last level 466 00:22:44,160 --> 00:22:45,420 is completely populated. 467 00:22:45,420 --> 00:22:48,770 So every internal node has exactly two children, 468 00:22:48,770 --> 00:22:53,050 and in here, every node except for some nodes 469 00:22:53,050 --> 00:22:56,700 and then some nodes after it will not have. 470 00:22:56,700 --> 00:22:59,090 Everything to the left is fully populated, 471 00:22:59,090 --> 00:23:01,870 and then at some point, you stop having children. 472 00:23:01,870 --> 00:23:05,150 It turns out that this is easy to store in an array, 473 00:23:05,150 --> 00:23:06,660 but I will not go over that. 474 00:23:06,660 --> 00:23:09,660 Instead, I want to go over inserting. 475 00:23:09,660 --> 00:23:13,980 What's the rep invariant for a heap? 476 00:23:13,980 --> 00:23:15,960 AUDIENCE: The max in the top, right? 477 00:23:15,960 --> 00:23:18,572 Well, for max heap, and then the two children 478 00:23:18,572 --> 00:23:20,456 are less than the next node. 479 00:23:20,456 --> 00:23:21,330 PROFESSOR: All right. 480 00:23:21,330 --> 00:23:23,515 So the guy here has to be bigger than these guys, 481 00:23:23,515 --> 00:23:25,640 then the guy here has to be bigger than these guys, 482 00:23:25,640 --> 00:23:26,980 and so on and so forth. 483 00:23:26,980 --> 00:23:29,990 And if you use induction, you can 484 00:23:29,990 --> 00:23:32,212 prove that if this is bigger than this, 485 00:23:32,212 --> 00:23:34,170 it has to be bigger than these guys, and bigger 486 00:23:34,170 --> 00:23:36,970 than these guys, and bigger than everything, and it's a max. 487 00:23:36,970 --> 00:23:40,060 That's the reason why we have that rep invariant. 488 00:23:40,060 --> 00:23:42,480 So the way we insert a node is we add it 489 00:23:42,480 --> 00:23:45,280 at the bottom, the only place where we could add it. 490 00:23:45,280 --> 00:23:48,190 And then if this guy is bigger than this guy, 491 00:23:48,190 --> 00:23:51,110 the rep invariant is violated, so we swap them 492 00:23:51,110 --> 00:23:53,270 in order to fix that. 493 00:23:53,270 --> 00:23:54,270 Now the guy is here. 494 00:23:54,270 --> 00:23:56,890 If this is bigger than this, we do another swap. 495 00:23:56,890 --> 00:24:00,400 If this is bigger than this, we do another swap. 496 00:24:00,400 --> 00:24:03,990 So you're going to go from the bottom of the heap potentially 497 00:24:03,990 --> 00:24:06,210 all the way up to the root. 498 00:24:06,210 --> 00:24:08,820 So the running time of insert is order 499 00:24:08,820 --> 00:24:10,700 of the height of the heap. 500 00:24:13,580 --> 00:24:16,810 Now, the heap is a full binary tree. 501 00:24:16,810 --> 00:24:17,445 I said "full." 502 00:24:17,445 --> 00:24:18,320 I keep saying "full." 503 00:24:18,320 --> 00:24:20,765 The reason I care about full is that the full binary tree 504 00:24:20,765 --> 00:24:24,767 is guaranteed to have a height of log N. It's always 505 00:24:24,767 --> 00:24:26,350 log N, where N is the number of nodes. 506 00:24:30,120 --> 00:24:35,303 So inserting in a heap takes log N. 507 00:24:35,303 --> 00:24:36,469 AUDIENCE: I have a question. 508 00:24:36,469 --> 00:24:39,415 Didn't they say that because it's in an array, 509 00:24:39,415 --> 00:24:44,230 then to find it-- oh no, I guess because you can still 510 00:24:44,230 --> 00:24:44,949 do the swaps. 511 00:24:44,949 --> 00:24:46,490 PROFESSOR: You can still do the swaps 512 00:24:46,490 --> 00:24:48,720 when you have it serialized in an array. 513 00:24:48,720 --> 00:24:50,450 You know that given an item's index, 514 00:24:50,450 --> 00:24:53,894 the parent is that index divided by 2. 515 00:24:53,894 --> 00:24:55,810 So you add an element at the end of the array, 516 00:24:55,810 --> 00:24:57,100 and then you know what the parent is, 517 00:24:57,100 --> 00:24:58,988 and then you keep swapping and swapping and swapping 518 00:24:58,988 --> 00:24:59,944 towards the [INAUDIBLE]. 519 00:24:59,944 --> 00:25:01,300 AUDIENCE: You don't ever have to put it in 520 00:25:01,300 --> 00:25:02,072 and shift everything over. 521 00:25:02,072 --> 00:25:02,420 You're only swapping. 522 00:25:02,420 --> 00:25:03,045 PROFESSOR: Yep. 523 00:25:03,045 --> 00:25:04,200 You only swap. 524 00:25:04,200 --> 00:25:06,131 That's important. 525 00:25:06,131 --> 00:25:06,880 Thanks for asking. 526 00:25:06,880 --> 00:25:07,650 That's important. 527 00:25:07,650 --> 00:25:11,850 So log N. Extract max, take my word for it, 528 00:25:11,850 --> 00:25:16,380 also log N. What's the running time for checking 529 00:25:16,380 --> 00:25:19,760 the invariant in a heap? 530 00:25:19,760 --> 00:25:22,590 So to make sure that this guy is a heap, if I had numbers here, 531 00:25:22,590 --> 00:25:24,000 what would you have to do? 532 00:25:29,280 --> 00:25:31,295 AUDIENCE: You'd have to look at every node. 533 00:25:31,295 --> 00:25:31,920 PROFESSOR: Yep. 534 00:25:31,920 --> 00:25:34,304 So running time? 535 00:25:34,304 --> 00:25:36,547 AUDIENCE: Theta of N. 536 00:25:36,547 --> 00:25:37,172 PROFESSOR: Yep. 537 00:25:40,820 --> 00:25:43,430 So if I'm going to submit code for a heap 538 00:25:43,430 --> 00:25:46,410 where the operations are our order of log N, 539 00:25:46,410 --> 00:25:50,730 or order 1, but then each of these calls Check RI, 540 00:25:50,730 --> 00:25:52,640 that's going to be painfully slow because I'm 541 00:25:52,640 --> 00:25:55,650 making the updates be order N instead of log N. 542 00:25:55,650 --> 00:25:59,160 So you're putting Check RI calls in every update. 543 00:25:59,160 --> 00:26:00,607 You debug your code. 544 00:26:00,607 --> 00:26:02,690 When you make sure it's correct, you remove those, 545 00:26:02,690 --> 00:26:05,810 and then you submit the Pset. 546 00:26:05,810 --> 00:26:08,530 Make sense? 547 00:26:08,530 --> 00:26:09,760 Sweet. 548 00:26:09,760 --> 00:26:12,070 And we looked a little bit at heaps, which is good. 549 00:26:16,520 --> 00:26:18,320 Binary search trees. 550 00:26:18,320 --> 00:26:21,000 So a binary tree is a tree where every node 551 00:26:21,000 --> 00:26:23,830 has at most two children. 552 00:26:23,830 --> 00:26:28,460 When we code this up, we represent a node as a Python 553 00:26:28,460 --> 00:26:34,810 object, and for a node, we keep track of the left child, 554 00:26:34,810 --> 00:26:41,880 of the right child, parent, and then this is a hollow tree. 555 00:26:41,880 --> 00:26:43,100 It's not very useful. 556 00:26:43,100 --> 00:26:46,370 This becomes useful when you start putting keys in the nodes 557 00:26:46,370 --> 00:26:49,690 so that you can find them and do other things with them. 558 00:26:49,690 --> 00:26:50,830 So each node has a key. 559 00:26:53,640 --> 00:26:55,390 Let me draw a binary search tree. 560 00:27:11,275 --> 00:27:14,360 Can people see this? 561 00:27:14,360 --> 00:27:15,750 So this is a binary tree. 562 00:27:15,750 --> 00:27:18,360 Can someone say something a bit more specific about it? 563 00:27:23,050 --> 00:27:24,400 AUDIENCE: It's unbalanced. 564 00:27:24,400 --> 00:27:24,690 PROFESSOR: OK. 565 00:27:24,690 --> 00:27:25,356 It's imbalanced. 566 00:27:28,590 --> 00:27:31,260 So that means that finding things all the way 567 00:27:31,260 --> 00:27:34,051 at the bottom is going to be expensive. 568 00:27:34,051 --> 00:27:34,550 What else? 569 00:27:37,520 --> 00:27:39,140 So I said it's a binary tree. 570 00:27:39,140 --> 00:27:40,900 Give me something more specific. 571 00:27:45,830 --> 00:27:50,060 So binary tree just means that every node has two children. 572 00:27:50,060 --> 00:27:52,470 There's a bit more structure in this guy. 573 00:27:52,470 --> 00:27:55,840 So if I look at the root, if I look at 23, 574 00:27:55,840 --> 00:27:58,740 all the nodes to the left are smaller. 575 00:27:58,740 --> 00:28:01,635 All the nodes to the right are bigger. 576 00:28:01,635 --> 00:28:04,320 Now, if I look at 8, all the nodes to the left are smaller, 577 00:28:04,320 --> 00:28:06,360 all the nodes to the right are greater. 578 00:28:10,310 --> 00:28:15,750 This additional rep invariant defines a binary search tree. 579 00:28:15,750 --> 00:28:19,820 This is what we talked about in class. 580 00:28:19,820 --> 00:28:22,320 BST. 581 00:28:22,320 --> 00:28:24,945 Why would I want to have this rep invariant? 582 00:28:24,945 --> 00:28:27,540 It sounds like a pain to maintain nodes 583 00:28:27,540 --> 00:28:29,810 with all these ordering constraints. 584 00:28:29,810 --> 00:28:32,710 What's the advantage of doing that? 585 00:28:32,710 --> 00:28:33,996 AUDIENCE: Search is fast. 586 00:28:33,996 --> 00:28:34,620 PROFESSOR: Yep. 587 00:28:34,620 --> 00:28:35,310 Search is fast. 588 00:28:35,310 --> 00:28:37,960 How do I do search? 589 00:28:37,960 --> 00:28:41,220 If you're looking for 42 or for 16, 590 00:28:41,220 --> 00:28:43,440 you'd be like, oh, it's less than 23. 591 00:28:43,440 --> 00:28:45,640 I'll get on this path. 592 00:28:45,640 --> 00:28:48,880 PROFESSOR: So start at the root, compare my key to the root. 593 00:28:48,880 --> 00:28:50,089 If it's smaller, go left. 594 00:28:50,089 --> 00:28:51,130 If it's bigger, go right. 595 00:28:51,130 --> 00:28:53,600 Then keep doing that until I arrive somewhere 596 00:28:53,600 --> 00:28:57,590 or until I arrive at a dead end if I'm looking for 14. 597 00:28:57,590 --> 00:28:59,860 This is a lot like binary search. 598 00:28:59,860 --> 00:29:02,330 Binary search in an array, you look at the middle. 599 00:29:02,330 --> 00:29:03,759 If your key is smaller, go left. 600 00:29:03,759 --> 00:29:05,300 If your key is bigger, then go right. 601 00:29:09,680 --> 00:29:13,290 Let's look at the code for a little bit. 602 00:29:13,290 --> 00:29:16,490 Look at the BST Node Class, and you'll 603 00:29:16,490 --> 00:29:19,020 see that it has the fields that we have up here. 604 00:29:19,020 --> 00:29:21,370 And look at the Find method, and this is pretty much 605 00:29:21,370 --> 00:29:23,020 the binary search code. 606 00:29:23,020 --> 00:29:26,920 Lines 8 and 9 have the return condition when you're happy 607 00:29:26,920 --> 00:29:31,187 and you found the key, and then line 10 compares 608 00:29:31,187 --> 00:29:33,520 the key that you're looking for with the key in the node 609 00:29:33,520 --> 00:29:38,010 that you're at, and then lines 11, 14, 16, and 19 are pretty 610 00:29:38,010 --> 00:29:39,776 much copy pasted, except one of them 611 00:29:39,776 --> 00:29:41,400 deals with the left case, the other one 612 00:29:41,400 --> 00:29:44,910 deals with the right case. 613 00:29:44,910 --> 00:29:46,955 What is the running time for Find? 614 00:29:57,890 --> 00:30:00,780 AUDIENCE: Wouldn't it be log N, right? 615 00:30:00,780 --> 00:30:03,440 PROFESSOR: I wish. 616 00:30:03,440 --> 00:30:06,060 If this is all you have to do to get log N, 617 00:30:06,060 --> 00:30:08,360 then I would have to write a lot less code. 618 00:30:11,060 --> 00:30:14,540 So not quite log N. We will have to go through next lecture 619 00:30:14,540 --> 00:30:20,020 to get to log N. Until then, what's the running time? 620 00:30:20,020 --> 00:30:20,900 AUDIENCE: Order h. 621 00:30:20,900 --> 00:30:21,525 PROFESSOR: Yep. 622 00:30:24,030 --> 00:30:26,030 So you told me at the beginning it's unbalanced. 623 00:30:26,030 --> 00:30:27,135 AUDIENCE: Yeah. 624 00:30:27,135 --> 00:30:29,010 PROFESSOR: So then it's not going to be fast. 625 00:30:32,800 --> 00:30:34,440 OK, so order h. 626 00:30:34,440 --> 00:30:37,090 The reason why we care about h, and the reason 627 00:30:37,090 --> 00:30:39,980 we don't say order N, is because next 628 00:30:39,980 --> 00:30:42,415 lecture after we learn how to balance a tree, 629 00:30:42,415 --> 00:30:45,040 there's some magic that you can do to these binary search trees 630 00:30:45,040 --> 00:30:47,721 to guarantee that the height is order of log N. 631 00:30:47,721 --> 00:30:50,220 And then we'll go through all the running times that we have 632 00:30:50,220 --> 00:30:52,676 and replace h with log N. 633 00:30:52,676 --> 00:30:55,050 Now, it happens that in this case, if you would have told 634 00:30:55,050 --> 00:30:59,260 me order N, I couldn't argue with you because worst case, 635 00:30:59,260 --> 00:31:02,790 searches are order N. Can someone give me a binary search 636 00:31:02,790 --> 00:31:07,071 tree that exposes this degenerate case? 637 00:31:07,071 --> 00:31:07,570 Yes? 638 00:31:07,570 --> 00:31:09,195 AUDIENCE: If it's completely unbalanced 639 00:31:09,195 --> 00:31:13,150 and every node is greater than the parent nodes. 640 00:31:13,150 --> 00:31:17,435 PROFESSOR: So give me some inserts that create it. 641 00:31:17,435 --> 00:31:18,820 AUDIENCE: Insert 5. 642 00:31:18,820 --> 00:31:20,622 PROFESSOR: 5. 643 00:31:20,622 --> 00:31:22,390 AUDIENCE: Insert 10. 644 00:31:22,390 --> 00:31:23,222 PROFESSOR: 10. 645 00:31:23,222 --> 00:31:24,520 AUDIENCE: Insert 15. 646 00:31:24,520 --> 00:31:25,638 PROFESSOR: 15. 647 00:31:25,638 --> 00:31:27,042 AUDIENCE: Insert 20. 648 00:31:27,042 --> 00:31:27,980 PROFESSOR: Yep. 649 00:31:27,980 --> 00:31:28,940 And I could keep going. 650 00:31:28,940 --> 00:31:30,540 I could say, 1, 2, 3, 4, 5. 651 00:31:30,540 --> 00:31:31,740 I could say 5, 10, 15. 652 00:31:31,740 --> 00:31:34,000 As long as these keep growing, this is basically 653 00:31:34,000 --> 00:31:37,290 going to be a list, so searching is 654 00:31:37,290 --> 00:31:41,517 order N. This is a degenerate case. 655 00:31:41,517 --> 00:31:43,600 Turns out it doesn't happen too often in practice. 656 00:31:43,600 --> 00:31:47,340 If you have random data, the height will be roughly log N. 657 00:31:47,340 --> 00:31:49,530 But in order to avoid those degenerate cases, 658 00:31:49,530 --> 00:31:54,180 we'll be doing balanced trees later on. 659 00:31:54,180 --> 00:31:55,680 So we covered Find. 660 00:31:55,680 --> 00:31:56,854 We know it's order h. 661 00:31:56,854 --> 00:31:58,270 How do you insert, really quickly? 662 00:32:09,104 --> 00:32:10,520 AUDIENCE: Do you mean in searching 663 00:32:10,520 --> 00:32:13,000 when it's balanced or unbalanced? 664 00:32:13,000 --> 00:32:14,520 PROFESSOR: This guy. 665 00:32:14,520 --> 00:32:16,310 So the trees look exactly the same. 666 00:32:16,310 --> 00:32:18,420 If it's balanced, it's going to look more 667 00:32:18,420 --> 00:32:20,380 like that than like this. 668 00:32:20,380 --> 00:32:22,200 Actually, this is balanced. 669 00:32:22,200 --> 00:32:23,550 This is perfectly unbalanced. 670 00:32:23,550 --> 00:32:25,125 This is somewhere in the middle. 671 00:32:25,125 --> 00:32:27,500 If it's balanced, it's just going to look more like this, 672 00:32:27,500 --> 00:32:30,180 but it's still a binary search tree. 673 00:32:30,180 --> 00:32:33,481 How would you insert a node? 674 00:32:33,481 --> 00:32:33,980 Yes? 675 00:32:33,980 --> 00:32:35,944 AUDIENCE: Can't you start at the root 676 00:32:35,944 --> 00:32:38,399 and find your way down, and then the first open child 677 00:32:38,399 --> 00:32:43,686 that you see that's applicable to your element, state it then? 678 00:32:43,686 --> 00:32:44,310 PROFESSOR: Yep. 679 00:32:44,310 --> 00:32:49,406 So if I wanted to insert 14, which way do I go? 680 00:32:49,406 --> 00:32:51,280 AUDIENCE: So you'd look at 23, and you'd say, 681 00:32:51,280 --> 00:32:53,215 it's less than 23, go left. 682 00:32:53,215 --> 00:32:54,670 You'd look at 8. 683 00:32:54,670 --> 00:32:56,610 You'd say, it's greater than 8. 684 00:32:56,610 --> 00:32:57,565 You'd go right. 685 00:32:57,565 --> 00:32:58,065 Look at 16. 686 00:32:58,065 --> 00:33:00,005 You'd say it's less, so you go left. 687 00:33:00,005 --> 00:33:00,975 15, it's less. 688 00:33:00,975 --> 00:33:03,855 Then you have an open spot so you stick it there. 689 00:33:03,855 --> 00:33:04,730 PROFESSOR: Excellent. 690 00:33:04,730 --> 00:33:06,930 Thank you. 691 00:33:06,930 --> 00:33:07,770 Yes? 692 00:33:07,770 --> 00:33:08,936 AUDIENCE: I have a question. 693 00:33:08,936 --> 00:33:11,176 What if we want to insert 5? 694 00:33:11,176 --> 00:33:12,359 Then-- 695 00:33:12,359 --> 00:33:14,025 PROFESSOR: So if you want to insert who? 696 00:33:14,025 --> 00:33:14,915 AUDIENCE: 5. 697 00:33:14,915 --> 00:33:17,590 Or actually no, we can't. 698 00:33:17,590 --> 00:33:22,184 I'm thinking, is there any case in which need to move a node? 699 00:33:22,184 --> 00:33:23,600 PROFESSOR: How would you insert 5? 700 00:33:23,600 --> 00:33:24,100 Let's see. 701 00:33:24,100 --> 00:33:25,552 What would you do for 5? 702 00:33:25,552 --> 00:33:29,408 AUDIENCE: For 5, then we'd insert it to the right of 4, 703 00:33:29,408 --> 00:33:30,860 right? 704 00:33:30,860 --> 00:33:32,656 PROFESSOR: Smaller, smaller, greater, 5. 705 00:33:32,656 --> 00:33:33,155 Right? 706 00:33:36,410 --> 00:33:38,560 AUDIENCE: So there would be no case in which we'd 707 00:33:38,560 --> 00:33:41,360 need to swap nodes or something? 708 00:33:41,360 --> 00:33:42,084 PROFESSOR: No. 709 00:33:42,084 --> 00:33:43,000 You're thinking ahead. 710 00:33:43,000 --> 00:33:46,900 We'll talk about that a little later when we get to deleting. 711 00:33:46,900 --> 00:33:51,720 As long as you follow a path in the tree, the path that finding 712 00:33:51,720 --> 00:33:54,720 would get you to, as soon as you hit a dead end, 713 00:33:54,720 --> 00:33:56,046 that's where your node belongs. 714 00:33:56,046 --> 00:33:58,420 Because you know next time you're going to search for it, 715 00:33:58,420 --> 00:34:02,230 the search is going to follow that path and find the node. 716 00:34:02,230 --> 00:34:02,800 Yes? 717 00:34:02,800 --> 00:34:04,924 AUDIENCE: If you have values are the same, like two 718 00:34:04,924 --> 00:34:07,620 nodes at the same number, does it 719 00:34:07,620 --> 00:34:09,360 matter which side you put it on? 720 00:34:09,360 --> 00:34:10,991 PROFESSOR: You don't. 721 00:34:10,991 --> 00:34:11,880 AUDIENCE: Oh, I see. 722 00:34:11,880 --> 00:34:15,670 It's more like you would only have four 1's in the tree. 723 00:34:15,670 --> 00:34:16,440 PROFESSOR: Yes. 724 00:34:16,440 --> 00:34:18,805 So if you're trying to store keys and values, 725 00:34:18,805 --> 00:34:20,179 then what you'd have to do if you 726 00:34:20,179 --> 00:34:21,945 want to allow multiple values for the same key 727 00:34:21,945 --> 00:34:23,540 is you have a linked list going off 728 00:34:23,540 --> 00:34:26,770 of this, which node becomes an array of values 729 00:34:26,770 --> 00:34:28,305 aside from the key. 730 00:34:28,305 --> 00:34:28,929 Smart question. 731 00:34:28,929 --> 00:34:29,699 Thank you. 732 00:34:29,699 --> 00:34:33,800 That trips you up every time you do actual code, 733 00:34:33,800 --> 00:34:35,889 so that's the right question to ask yourself 734 00:34:35,889 --> 00:34:36,880 when you're implementing this. 735 00:34:36,880 --> 00:34:37,838 Will I have duplicates? 736 00:34:37,838 --> 00:34:39,810 How do I handle them? 737 00:34:39,810 --> 00:34:40,489 We don't. 738 00:34:40,489 --> 00:34:43,300 We take the easy way out. 739 00:34:43,300 --> 00:34:45,460 So if you look at Insert, on the next page, 740 00:34:45,460 --> 00:34:48,440 you will see that the code is pretty much the Find code 741 00:34:48,440 --> 00:34:53,869 copy pasted, except when Self Left is None 742 00:34:53,869 --> 00:34:55,820 or Self Right is None, instead of returning, 743 00:34:55,820 --> 00:34:56,850 it creates a new node. 744 00:35:02,296 --> 00:35:03,820 Does that make sense to people? 745 00:35:06,915 --> 00:35:07,415 All right. 746 00:35:10,720 --> 00:35:13,730 So Delete is going to be the hardest operation for today. 747 00:35:13,730 --> 00:35:15,760 Before we do that, let's do a warm up operation. 748 00:35:18,820 --> 00:35:29,350 Let's say I want to implement Find Next Larger, also called 749 00:35:29,350 --> 00:35:32,110 Successor in some implementations. 750 00:35:32,110 --> 00:35:33,930 So I have a node. 751 00:35:33,930 --> 00:35:41,130 Say I have node 8, and I want to find the next key 752 00:35:41,130 --> 00:35:43,710 in the tree that's strictly larger than 8 753 00:35:43,710 --> 00:35:46,190 but smaller than anything else. 754 00:35:46,190 --> 00:35:48,770 So if I would take these nodes and write them down in order, 755 00:35:48,770 --> 00:35:52,400 I want to find the element that would go right after it. 756 00:35:52,400 --> 00:35:54,026 How do I do that? 757 00:35:54,026 --> 00:35:54,920 Don't cheat. 758 00:35:54,920 --> 00:35:59,588 Don't look at the code, or make my life easier and do searches. 759 00:35:59,588 --> 00:36:01,079 AUDIENCE: Go down one to the right, 760 00:36:01,079 --> 00:36:03,297 and you try to get down left as far as you can. 761 00:36:03,297 --> 00:36:03,880 PROFESSOR: OK. 762 00:36:03,880 --> 00:36:04,870 Very good. 763 00:36:04,870 --> 00:36:11,310 So I have a node, and it has some subtree here, 764 00:36:11,310 --> 00:36:17,690 so I can go to the right here, I can go all the way left. 765 00:36:17,690 --> 00:36:19,370 We have an operation that does this, 766 00:36:19,370 --> 00:36:22,182 and it's called Min for a tree. 767 00:36:22,182 --> 00:36:24,390 In order to find the minimum in a binary search tree, 768 00:36:24,390 --> 00:36:25,870 you keep going left. 769 00:36:25,870 --> 00:36:28,870 For example, in this case, you get 4, which is good. 770 00:36:28,870 --> 00:36:31,970 So the way you would code this up is if you have Min, 771 00:36:31,970 --> 00:36:33,740 you go to the right if you can, and then 772 00:36:33,740 --> 00:36:36,390 you call Min on the subtree. 773 00:36:36,390 --> 00:36:41,340 And you can see that lines 3 and 4 do exactly that. 774 00:36:41,340 --> 00:36:43,180 Good guess. 775 00:36:43,180 --> 00:36:45,630 But you can line 1 says case one, 776 00:36:45,630 --> 00:36:50,040 so you have the right answer for one case. 777 00:36:50,040 --> 00:36:53,760 Now we have to handle more difficult cases. 778 00:36:53,760 --> 00:36:59,490 What if instead, I go down a bunch of nodes, 779 00:36:59,490 --> 00:37:04,780 and I want to find the successor for this guy, for example, 780 00:37:04,780 --> 00:37:06,420 and there's nothing here. 781 00:37:06,420 --> 00:37:07,050 What do I do? 782 00:37:12,090 --> 00:37:17,160 So if I want to find the successor for 8, what do I do? 783 00:37:17,160 --> 00:37:17,660 Sorry. 784 00:37:17,660 --> 00:37:18,368 It has an answer. 785 00:37:18,368 --> 00:37:24,247 What if I want to find the successor for 4? 786 00:37:27,229 --> 00:37:28,230 AUDIENCE: Go up one. 787 00:37:28,230 --> 00:37:28,813 PROFESSOR: OK. 788 00:37:28,813 --> 00:37:29,900 Go up one. 789 00:37:29,900 --> 00:37:31,522 Why does that work? 790 00:37:31,522 --> 00:37:33,370 AUDIENCE: You know it's going to be greater. 791 00:37:33,370 --> 00:37:37,780 PROFESSOR: So I'm going up right. 792 00:37:37,780 --> 00:37:41,000 So I know that everything here is guaranteed to be smaller, 793 00:37:41,000 --> 00:37:44,330 everything here is guaranteed to be greater than this guy. 794 00:37:44,330 --> 00:37:48,180 This guy is up right, so this is guaranteed to be greater 795 00:37:48,180 --> 00:37:50,040 than this, and everything here is 796 00:37:50,040 --> 00:37:52,550 guaranteed to be greater than this, and so on and so forth 797 00:37:52,550 --> 00:37:54,270 for the entire tree. 798 00:37:54,270 --> 00:37:58,150 So if I go up right, I'm happy. 799 00:37:58,150 --> 00:38:00,030 I definitely found my answer. 800 00:38:00,030 --> 00:38:06,040 Now, what if I have something that looks like this, 801 00:38:06,040 --> 00:38:08,160 and I want to find the successor for this guy? 802 00:38:13,040 --> 00:38:14,730 AUDIENCE: There is none. 803 00:38:14,730 --> 00:38:16,470 PROFESSOR: In this case, there is none 804 00:38:16,470 --> 00:38:18,210 if there's nothing else here. 805 00:38:18,210 --> 00:38:22,910 What if I have this, but then I have this? 806 00:38:22,910 --> 00:38:24,300 So I came down this way. 807 00:38:32,760 --> 00:38:35,765 AUDIENCE: Are you saying you're calling on that last node? 808 00:38:35,765 --> 00:38:36,390 PROFESSOR: Yep. 809 00:38:36,390 --> 00:38:37,473 AUDIENCE: Find the larger? 810 00:38:41,300 --> 00:38:44,299 I guess you'd just trace back up. 811 00:38:44,299 --> 00:38:45,590 PROFESSOR: And where do I stop? 812 00:38:51,410 --> 00:38:54,110 AUDIENCE: It affects the tree, so you go up one from there. 813 00:38:54,110 --> 00:38:56,320 You don't stop there. 814 00:38:56,320 --> 00:38:58,890 PROFESSOR: Why can't I stop here? 815 00:38:58,890 --> 00:39:03,911 AUDIENCE: Because you know that that-- not necessarily. 816 00:39:03,911 --> 00:39:06,160 AUDIENCE: You know that everything in that long branch 817 00:39:06,160 --> 00:39:09,429 right there is less than that node [INAUDIBLE]. 818 00:39:09,429 --> 00:39:11,220 PROFESSOR: This is to the left of this guy, 819 00:39:11,220 --> 00:39:16,620 so this guy has to be greater than everything here, 820 00:39:16,620 --> 00:39:20,640 and then you can repeat the argument that we had before. 821 00:39:20,640 --> 00:39:22,060 So here, we could stop right away 822 00:39:22,060 --> 00:39:23,700 because we could branch left. 823 00:39:23,700 --> 00:39:26,340 In this case, you have to go up until you're 824 00:39:26,340 --> 00:39:29,580 able to go left and up. 825 00:39:29,580 --> 00:39:34,310 If you get to the root, then what happened? 826 00:39:34,310 --> 00:39:37,025 Then we're in this case, and you have no successor. 827 00:39:40,730 --> 00:39:42,740 So take a look at the code. 828 00:39:42,740 --> 00:39:45,320 The next larger, lines 1 through 9. 829 00:39:45,320 --> 00:39:49,070 Case two, 6 through 8, does exactly that. 830 00:39:49,070 --> 00:39:52,910 If I can't go to my right and find the tree there, 831 00:39:52,910 --> 00:39:57,570 then I go up through my parent chain, 832 00:39:57,570 --> 00:40:00,942 and as long as I have to go up to the left, 833 00:40:00,942 --> 00:40:02,900 so as long as I'm the right child of my parent, 834 00:40:02,900 --> 00:40:04,120 I have to keep going. 835 00:40:04,120 --> 00:40:06,810 The moment I find the parent where I'm the left child, 836 00:40:06,810 --> 00:40:07,340 I stop. 837 00:40:07,340 --> 00:40:08,255 That's my successor. 838 00:40:11,330 --> 00:40:14,270 What if I would have to find the predecessor instead? 839 00:40:14,270 --> 00:40:16,810 So the element that's smaller than me 840 00:40:16,810 --> 00:40:19,734 but bigger than everything else in the tree. 841 00:40:19,734 --> 00:40:20,400 What would I do? 842 00:40:31,624 --> 00:40:33,640 AUDIENCE: It's just the opposite. 843 00:40:33,640 --> 00:40:35,422 PROFESSOR: Just the opposite. 844 00:40:35,422 --> 00:40:39,718 So how do I do the opposite? 845 00:40:39,718 --> 00:40:43,809 AUDIENCE: You can take the max of the left side tree, 846 00:40:43,809 --> 00:40:53,400 or traverse up, and if that's less than-- 847 00:40:53,400 --> 00:40:55,900 PROFESSOR: OK, so if I have a left subtree, fine. 848 00:40:55,900 --> 00:40:59,070 Call max on it and get the rightmost node there. 849 00:40:59,070 --> 00:41:04,250 If not, I go up, and when do I stop? 850 00:41:04,250 --> 00:41:10,598 When I go left or right? 851 00:41:10,598 --> 00:41:13,430 AUDIENCE: You'd have to go right. 852 00:41:13,430 --> 00:41:14,850 Is that right? 853 00:41:14,850 --> 00:41:15,630 PROFESSOR: Yep. 854 00:41:15,630 --> 00:41:20,300 So last time, in this case, when I was going up, 855 00:41:20,300 --> 00:41:22,160 if I was going left, I had to keep going, 856 00:41:22,160 --> 00:41:25,250 and the moment I went right, I was happy and I stopped. 857 00:41:25,250 --> 00:41:28,560 What if I want to find the predecessor? 858 00:41:28,560 --> 00:41:29,810 It's the opposite, right? 859 00:41:29,810 --> 00:41:34,770 So I will go this way, and the moment I can go this way, 860 00:41:34,770 --> 00:41:35,960 I'm done. 861 00:41:35,960 --> 00:41:37,624 How do you do this in code? 862 00:41:42,570 --> 00:41:43,765 Slightly tricky. 863 00:41:43,765 --> 00:41:44,940 Just slightly, I promise. 864 00:41:49,080 --> 00:41:51,616 AUDIENCE: [INAUDIBLE]. 865 00:41:51,616 --> 00:41:53,790 PROFESSOR: It's hard. 866 00:41:53,790 --> 00:41:56,630 What I would do is copy paste the code, 867 00:41:56,630 --> 00:41:59,725 replace "left" with "right" everywhere, and replace "min" 868 00:41:59,725 --> 00:42:03,251 with "max." 869 00:42:03,251 --> 00:42:05,240 You get it done. 870 00:42:05,240 --> 00:42:07,910 So we talked about how the tree is symmetric, right? 871 00:42:07,910 --> 00:42:12,670 So every time, instead of saying "left," you say "right," 872 00:42:12,670 --> 00:42:14,830 and instead of saying "min," you say "max." 873 00:42:14,830 --> 00:42:15,860 That's how you do this. 874 00:42:18,760 --> 00:42:19,905 How do we do deletions? 875 00:42:23,170 --> 00:42:25,910 So suppose I'm in this tree and I want to delete 15. 876 00:42:25,910 --> 00:42:28,710 What do I do? 877 00:42:28,710 --> 00:42:29,570 AUDIENCE: Kill it. 878 00:42:29,570 --> 00:42:30,510 PROFESSOR: Kill it. 879 00:42:30,510 --> 00:42:31,714 Very good. 880 00:42:31,714 --> 00:42:32,880 What if I want to delete 16? 881 00:42:32,880 --> 00:42:34,011 What do I do? 882 00:42:39,663 --> 00:42:43,797 AUDIENCE: You need to put 15 where 16 is. 883 00:42:43,797 --> 00:42:44,380 PROFESSOR: OK. 884 00:42:44,380 --> 00:42:46,715 So I would put 15 here. 885 00:42:51,770 --> 00:42:53,640 So I had 16. 886 00:42:56,520 --> 00:42:58,085 Suppose I have a big tree here. 887 00:43:03,087 --> 00:43:04,670 Actually, let's go for an easier case. 888 00:43:04,670 --> 00:43:09,920 Let's say I have this tree here. 889 00:43:09,920 --> 00:43:12,000 So you're here, you have a big tree here, 890 00:43:12,000 --> 00:43:13,070 you don't have anything here, and you 891 00:43:13,070 --> 00:43:14,070 want to delete this guy. 892 00:43:17,450 --> 00:43:19,165 AUDIENCE: You know that everything less 893 00:43:19,165 --> 00:43:21,370 than the top node is going to be less than it, 894 00:43:21,370 --> 00:43:22,830 so you can just move that up. 895 00:43:22,830 --> 00:43:27,769 PROFESSOR: Everything less than this guy is also 896 00:43:27,769 --> 00:43:29,060 going to be less than this guy. 897 00:43:29,060 --> 00:43:32,250 So you're saying move the whole tree up. 898 00:43:32,250 --> 00:43:32,992 AUDIENCE: Yep. 899 00:43:32,992 --> 00:43:34,450 PROFESSOR: So the way we do that is 900 00:43:34,450 --> 00:43:38,290 we'd take this node's left link and make it point here, 901 00:43:38,290 --> 00:43:42,690 and take this guy's parent link and make it point here, 902 00:43:42,690 --> 00:43:44,730 and this guy sort of goes away. 903 00:43:48,250 --> 00:43:50,090 So we have two cases for deleting. 904 00:43:50,090 --> 00:43:53,250 We have if you're a leaf, we'll take you out. 905 00:44:00,600 --> 00:44:01,100 Sorry. 906 00:44:01,100 --> 00:44:02,190 I got confused. 907 00:44:02,190 --> 00:44:04,350 If you have one child and that child 908 00:44:04,350 --> 00:44:07,860 is in the same direction as your parent, then you can do this. 909 00:44:07,860 --> 00:44:21,730 What if you have one child, but it's a zigzag like this? 910 00:44:21,730 --> 00:44:22,380 What do you do? 911 00:44:29,520 --> 00:44:32,020 AUDIENCE: It's still greater than, so you do the same thing. 912 00:44:32,020 --> 00:44:33,250 PROFESSOR: Exactly. 913 00:44:33,250 --> 00:44:33,800 Same thing. 914 00:44:37,200 --> 00:44:40,269 Just change this guy, change this guy, and I'm happy. 915 00:44:40,269 --> 00:44:42,810 So it doesn't matter if you have a zigzag or a straight line. 916 00:44:42,810 --> 00:44:45,290 It might help you think about it to convince yourself 917 00:44:45,290 --> 00:44:49,970 that the code is correct, but in the end, you do the same thing. 918 00:44:49,970 --> 00:44:54,840 Now, what if I want to delete node 8? 919 00:44:54,840 --> 00:45:10,370 So what if I have a nasty case where I want to delete this guy 920 00:45:10,370 --> 00:45:12,833 and it has children both on the left and on the right? 921 00:45:22,200 --> 00:45:25,030 AUDIENCE: You have to take 8, compare it to its parent 922 00:45:25,030 --> 00:45:26,605 and compare it to its right child, 923 00:45:26,605 --> 00:45:28,667 and see which one is greater in order 924 00:45:28,667 --> 00:45:33,857 to figure out which node gets replaced in its spot. 925 00:45:33,857 --> 00:45:34,440 PROFESSOR: OK. 926 00:45:34,440 --> 00:45:37,064 So there is replacing that's going to happen. 927 00:45:37,064 --> 00:45:38,230 The answer is really tricky. 928 00:45:38,230 --> 00:45:41,290 I always forget this when coding. 929 00:45:41,290 --> 00:45:44,130 Try to understand it, and if it doesn't work, 930 00:45:44,130 --> 00:45:45,300 refer to the textbook. 931 00:45:45,300 --> 00:45:48,270 When you forget it, because you will, refer to the textbook 932 00:45:48,270 --> 00:45:50,120 or to the internet. 933 00:45:50,120 --> 00:45:54,650 So what you do is I can't just magically replace this node 934 00:45:54,650 --> 00:45:57,650 with one of the subtrees, but we talked right 935 00:45:57,650 --> 00:46:07,030 before this about Next Greater, so finding a node's successor. 936 00:46:07,030 --> 00:46:10,560 If this node has both a left subtree and a right subtree, 937 00:46:10,560 --> 00:46:13,460 then I know that if I call Find Successor on it, 938 00:46:13,460 --> 00:46:17,600 I'm going to go somewhere inside here, 939 00:46:17,600 --> 00:46:23,000 and I'm going to find a node somewhere in here all the way 940 00:46:23,000 --> 00:46:25,510 to the left that is this guy's successor. 941 00:46:29,670 --> 00:46:32,340 So what I'm going to do is I'm going 942 00:46:32,340 --> 00:46:38,030 to delete this node instead, and then I'm going to take its key 943 00:46:38,030 --> 00:46:38,990 and put it up here. 944 00:46:42,880 --> 00:46:48,550 So if I want to delete 8, what I do is I find its successor, 945 00:46:48,550 --> 00:46:53,180 then I delete it, then I take the 15 that was here-- you 946 00:46:53,180 --> 00:46:53,930 can see it, right? 947 00:46:53,930 --> 00:46:54,638 It's still there. 948 00:46:57,290 --> 00:46:59,680 Put it here. 949 00:46:59,680 --> 00:47:05,470 So the reason this works is that everything here 950 00:47:05,470 --> 00:47:08,010 is greater than this guy. 951 00:47:08,010 --> 00:47:11,240 Everything here is smaller than this guy. 952 00:47:11,240 --> 00:47:14,480 This is the next node that's greater than this guy, 953 00:47:14,480 --> 00:47:16,760 but everything else is bigger than it, 954 00:47:16,760 --> 00:47:19,590 right, because we wanted it to be a successor. 955 00:47:19,590 --> 00:47:22,840 So if I take this value and I put it up here, 956 00:47:22,840 --> 00:47:25,500 everything in here is still going to be greater than it. 957 00:47:30,310 --> 00:47:33,000 This is a successor of this guy, so everything here 958 00:47:33,000 --> 00:47:35,000 is still going to be smaller than the successor. 959 00:47:44,310 --> 00:47:45,360 Great. 960 00:47:45,360 --> 00:47:48,620 In order to do a delete, I find the successor, 961 00:47:48,620 --> 00:47:50,455 and then I call Delete on it. 962 00:47:50,455 --> 00:47:51,830 How do I know that this will end? 963 00:47:51,830 --> 00:47:54,210 How do I know that I'm not going to go 964 00:47:54,210 --> 00:47:58,533 into a loop that runs forever? 965 00:47:58,533 --> 00:47:59,699 AUDIENCE: Because it's not-- 966 00:47:59,699 --> 00:48:01,947 AUDIENCE: It's acyclic, right? 967 00:48:01,947 --> 00:48:02,530 PROFESSOR: OK. 968 00:48:05,670 --> 00:48:07,540 First answer, good. 969 00:48:07,540 --> 00:48:11,290 Eventually, worst case, I'm going to get to the maximum, 970 00:48:11,290 --> 00:48:16,340 and then not going on have to delete the successor anymore. 971 00:48:16,340 --> 00:48:17,780 Now, another thing to note here is 972 00:48:17,780 --> 00:48:20,710 that if this guy is the successor of this guy, 973 00:48:20,710 --> 00:48:24,300 it can't have anything on the left, because if it would, 974 00:48:24,300 --> 00:48:27,800 then whatever is down here has to be bigger than this, 975 00:48:27,800 --> 00:48:29,520 and whatever's to the left of this node 976 00:48:29,520 --> 00:48:32,560 has to be smaller than this. 977 00:48:32,560 --> 00:48:34,740 But we said that this is the successor of this, 978 00:48:34,740 --> 00:48:36,500 so there's nothing here. 979 00:48:36,500 --> 00:48:40,000 So this will be one of the easy cases that we talked about. 980 00:48:40,000 --> 00:48:42,500 The successor either has no kids, 981 00:48:42,500 --> 00:48:47,940 or it has only one child, only one subtree. 982 00:48:47,940 --> 00:48:51,140 So then I can delete it using one of the easy cases. 983 00:48:51,140 --> 00:48:55,170 So in fact, worst case that happens in a delete is my node 984 00:48:55,170 --> 00:48:56,230 has two subtrees. 985 00:48:56,230 --> 00:48:59,460 Then I find the successor that's only going to have one subtree, 986 00:48:59,460 --> 00:49:01,130 I change my links there, and I'm done. 987 00:49:03,970 --> 00:49:05,660 What is the running time for Delete? 988 00:49:15,103 --> 00:49:17,186 AUDIENCE: Is it order h, because you should do it 989 00:49:17,186 --> 00:49:19,939 all the way down to the bottom of the tree, right? 990 00:49:19,939 --> 00:49:21,480 PROFESSOR: You have the right answer. 991 00:49:21,480 --> 00:49:22,684 Let's see why it's order h. 992 00:49:22,684 --> 00:49:23,850 It has to be order h, right? 993 00:49:23,850 --> 00:49:25,433 Otherwise, the tree would be too slow. 994 00:49:25,433 --> 00:49:29,130 If it's order N, then it's bad. 995 00:49:29,130 --> 00:49:32,490 So why would Delete be order h? 996 00:49:32,490 --> 00:49:35,350 This was a heap, right, so I can't use this. 997 00:49:35,350 --> 00:49:39,401 I'm going to write "delete" here again. 998 00:49:39,401 --> 00:49:41,900 So the first thing you do is you have to search for the key, 999 00:49:41,900 --> 00:49:42,630 right? 1000 00:49:42,630 --> 00:49:43,610 That's order h. 1001 00:49:46,320 --> 00:49:48,960 Now, if it's a happy case, if it's case one or two, 1002 00:49:48,960 --> 00:49:50,750 you change some links and you're done. 1003 00:49:50,750 --> 00:49:51,830 What's the time for that? 1004 00:49:54,679 --> 00:49:55,470 AUDIENCE: Constant. 1005 00:49:55,470 --> 00:49:56,810 PROFESSOR: Constant. 1006 00:49:56,810 --> 00:49:59,950 So happy case, order h for sure. 1007 00:49:59,950 --> 00:50:00,659 Now sad case. 1008 00:50:00,659 --> 00:50:02,200 If you have two children, what do you 1009 00:50:02,200 --> 00:50:06,580 have to do after you realize that you have two subtrees? 1010 00:50:06,580 --> 00:50:08,047 AUDIENCE: Find the successor. 1011 00:50:08,047 --> 00:50:08,630 PROFESSOR: OK. 1012 00:50:08,630 --> 00:50:12,378 What's the running time for finding a successor? 1013 00:50:12,378 --> 00:50:14,270 AUDIENCE: Order h. 1014 00:50:14,270 --> 00:50:15,090 PROFESSOR: Order h. 1015 00:50:19,310 --> 00:50:21,287 Once I find the successor, what do I do? 1016 00:50:25,560 --> 00:50:29,287 Call Delete on that, and what happens? 1017 00:50:29,287 --> 00:50:30,620 It's a happy case or a sad case? 1018 00:50:30,620 --> 00:50:32,410 AUDIENCE: It's a happy case. 1019 00:50:32,410 --> 00:50:35,076 PROFESSOR: Happy case, a few links get swapped, 1020 00:50:35,076 --> 00:50:36,070 constant time. 1021 00:50:36,070 --> 00:50:40,010 So worst case, order h plus order h. 1022 00:50:40,010 --> 00:50:41,260 Order h. 1023 00:50:41,260 --> 00:50:45,280 So insertions are order h, deletions are order h. 1024 00:50:45,280 --> 00:50:46,410 AUDIENCE: The first one. 1025 00:50:46,410 --> 00:50:47,950 Because the second one is from finding the successor. 1026 00:50:47,950 --> 00:50:49,033 What is the first one for? 1027 00:50:49,033 --> 00:50:52,930 PROFESSOR: Finding the node for a key in the tree. 1028 00:50:52,930 --> 00:50:55,910 So if I say Delete 8, then you have to find 8. 1029 00:50:55,910 --> 00:50:58,880 If I give you the node, then you don't have that. 1030 00:50:58,880 --> 00:50:59,506 Good question. 1031 00:50:59,506 --> 00:51:00,380 It's a good question. 1032 00:51:00,380 --> 00:51:01,000 Thank you. 1033 00:51:06,380 --> 00:51:07,500 So that's insertion. 1034 00:51:07,500 --> 00:51:08,470 That's deletion. 1035 00:51:11,169 --> 00:51:12,585 Let's look at the code for Delete. 1036 00:51:17,430 --> 00:51:18,810 Looks kind of long. 1037 00:51:22,320 --> 00:51:26,637 So lines through 21, happy case or sad case? 1038 00:51:31,410 --> 00:51:33,150 Try to do it by looking at the "if" 1039 00:51:33,150 --> 00:51:35,010 instead of looking at the comments. 1040 00:51:38,130 --> 00:51:42,180 So lines through 21 for Delete. 1041 00:51:45,484 --> 00:51:46,900 AUDIENCE: On this tree? 1042 00:51:46,900 --> 00:51:49,258 Which tree, because there are two deletes? 1043 00:51:49,258 --> 00:51:50,133 PROFESSOR: Oh really? 1044 00:51:53,440 --> 00:51:53,940 Sorry. 1045 00:51:53,940 --> 00:51:56,145 Why do we have two deletes? 1046 00:51:56,145 --> 00:52:01,409 AUDIENCE: There's BST Delete and then there's BST Node Delete. 1047 00:52:01,409 --> 00:52:02,450 PROFESSOR: So BST Delete. 1048 00:52:06,000 --> 00:52:08,400 Finds the node, and then calls Delete on the node. 1049 00:52:11,540 --> 00:52:13,460 And then if the node is a tree's root, 1050 00:52:13,460 --> 00:52:16,480 then it updates the tree's root. 1051 00:52:16,480 --> 00:52:19,410 So let's look at the nodes delete. 1052 00:52:19,410 --> 00:52:20,295 Oh, I see. 1053 00:52:20,295 --> 00:52:21,920 I think I was looking at the wrong one. 1054 00:52:26,020 --> 00:52:26,580 Thank you. 1055 00:52:26,580 --> 00:52:29,280 My Delete was much longer than yours. 1056 00:52:29,280 --> 00:52:33,760 So lines 3 through 12, happy case or sad case? 1057 00:52:40,180 --> 00:52:44,290 Look at the "if" on line 3 and tell me, 1058 00:52:44,290 --> 00:52:45,650 what case is it going for? 1059 00:52:51,208 --> 00:52:52,124 AUDIENCE: [INAUDIBLE]. 1060 00:52:55,660 --> 00:52:57,580 PROFESSOR: If it doesn't have a left child 1061 00:52:57,580 --> 00:52:59,280 or it doesn't have a right child, 1062 00:52:59,280 --> 00:53:01,750 is that the happy case or the sad case? 1063 00:53:01,750 --> 00:53:02,480 AUDIENCE: Happy. 1064 00:53:02,480 --> 00:53:03,396 PROFESSOR: Happy case. 1065 00:53:03,396 --> 00:53:08,300 So lines 4 through 12 handle the happy case. 1066 00:53:08,300 --> 00:53:10,990 Lines 14 through 16 handle the sad case. 1067 00:53:14,710 --> 00:53:17,860 Do lines 14 through 16 make sense? 1068 00:53:17,860 --> 00:53:19,740 Find the successor, then swap the keys, 1069 00:53:19,740 --> 00:53:21,148 then delete that successor. 1070 00:53:26,820 --> 00:53:29,660 Now, lines 4 through 11 are pretty much what 1071 00:53:29,660 --> 00:53:33,960 we talked about here, except I can't draw arrows on the board 1072 00:53:33,960 --> 00:53:37,472 and instead I have to change left and right links. 1073 00:53:37,472 --> 00:53:41,130 Line 4 has to see if we're a left child or a right child, 1074 00:53:41,130 --> 00:53:45,240 and then lines 5 through 7 and 9 through 11 are pretty much 1075 00:53:45,240 --> 00:53:48,600 copy paste, swap left with right. 1076 00:53:48,600 --> 00:53:51,166 And they changed the links like we changed them here. 1077 00:53:59,330 --> 00:54:03,150 Do we have any questions on Deletes? 1078 00:54:03,150 --> 00:54:08,440 AUDIENCE: So if the successor had a right child, 1079 00:54:08,440 --> 00:54:15,050 then all you do, you just do the workaround thing where 1080 00:54:15,050 --> 00:54:15,550 you just-- 1081 00:54:15,550 --> 00:54:17,530 PROFESSOR: Yep. 1082 00:54:17,530 --> 00:54:19,990 So the case that it doesn't have two children. 1083 00:54:19,990 --> 00:54:21,870 As long as it doesn't have both children, 1084 00:54:21,870 --> 00:54:24,328 you're in the happy case and you can do some link swapping. 1085 00:54:28,355 --> 00:54:29,870 Are you guys burned out already? 1086 00:54:33,240 --> 00:54:33,910 Fair enough. 1087 00:54:37,250 --> 00:54:38,540 I left a part out. 1088 00:54:38,540 --> 00:54:43,120 What I left out is how to augment a binary tree. 1089 00:54:43,120 --> 00:54:46,740 So binary trees by default can answer the question, 1090 00:54:46,740 --> 00:54:49,500 what's the minimum node in a tree in order h. 1091 00:54:49,500 --> 00:54:51,710 You go all the way to the left, you find the minimum. 1092 00:54:51,710 --> 00:54:53,260 That's the minimum. 1093 00:54:53,260 --> 00:54:56,920 It turns out that if you make a node a little bit fatter, 1094 00:54:56,920 --> 00:55:03,070 so if instead of storing, say, 23 in this node, I store 23, 1095 00:55:03,070 --> 00:55:05,630 and I store the fact that the minimum in my left subtree 1096 00:55:05,630 --> 00:55:10,230 is 4, then it turns out that I can answer the question 1097 00:55:10,230 --> 00:55:13,066 in constant time, what's the minimum? 1098 00:55:13,066 --> 00:55:15,475 Oh gee, if you store the minimum here, 1099 00:55:15,475 --> 00:55:18,550 of course you can retrieve it in constant time, right? 1100 00:55:18,550 --> 00:55:21,940 The hard part is, how do you handle insertions and updates 1101 00:55:21,940 --> 00:55:25,580 in the same time? 1102 00:55:25,580 --> 00:55:27,770 So the idea is that if I have a node 1103 00:55:27,770 --> 00:55:33,990 and I have a function here, say the minimum of everything, 1104 00:55:33,990 --> 00:55:40,530 if I have two children, here they're 15 and 42, 1105 00:55:40,530 --> 00:55:42,800 and say the minimum in this tree is 4 1106 00:55:42,800 --> 00:55:45,850 and the minimum in this tree is. 1107 00:55:45,850 --> 00:55:50,076 So if I already computed the function for these guys, 1108 00:55:50,076 --> 00:55:51,700 how do I compute the function for this? 1109 00:55:54,039 --> 00:55:55,580 AUDIENCE: [INAUDIBLE] and compare it? 1110 00:55:55,580 --> 00:55:56,220 PROFESSOR: Yep. 1111 00:55:56,220 --> 00:55:59,190 Take the minimum of these two guys, right? 1112 00:55:59,190 --> 00:56:02,290 There are some special cases if you don't have a child. 1113 00:56:02,290 --> 00:56:06,540 If you don't have a left child, then you're the minimum. 1114 00:56:06,540 --> 00:56:08,440 But you write down those special cases, 1115 00:56:08,440 --> 00:56:10,930 and you can compute this in how much time? 1116 00:56:14,650 --> 00:56:17,020 AUDIENCE: Order h, right? 1117 00:56:17,020 --> 00:56:19,250 PROFESSOR: What if I already computed the answer 1118 00:56:19,250 --> 00:56:20,759 for the children? 1119 00:56:20,759 --> 00:56:22,300 How much time does it take to compute 1120 00:56:22,300 --> 00:56:24,432 the answer for a single node? 1121 00:56:24,432 --> 00:56:25,304 AUDIENCE: Constant. 1122 00:56:25,304 --> 00:56:26,137 PROFESSOR: Constant. 1123 00:56:26,137 --> 00:56:26,720 OK. 1124 00:56:26,720 --> 00:56:28,271 AUDIENCE: For a tree, though. 1125 00:56:28,271 --> 00:56:29,770 PROFESSOR: For a tree, it's order h. 1126 00:56:29,770 --> 00:56:30,270 Yeah. 1127 00:56:30,270 --> 00:56:32,791 You're getting ahead. 1128 00:56:32,791 --> 00:56:33,540 You're rushing me. 1129 00:56:33,540 --> 00:56:34,748 You're not letting me finish. 1130 00:56:34,748 --> 00:56:38,200 AUDIENCE: Are you saying that we store the minimum value? 1131 00:56:38,200 --> 00:56:39,345 PROFESSOR: So for every-- 1132 00:56:39,345 --> 00:56:43,425 AUDIENCE: Each node has a field that 1133 00:56:43,425 --> 00:56:46,316 says what the minimum value is in that tree. 1134 00:56:46,316 --> 00:56:47,550 PROFESSOR: Yep, exactly. 1135 00:56:47,550 --> 00:56:51,340 So for each node, what's the minimum in the subtree. 1136 00:56:51,340 --> 00:56:54,740 So if I add a node here, suppose I add three 1137 00:56:54,740 --> 00:57:01,840 and I had my minimums, what changed? 1138 00:57:01,840 --> 00:57:03,870 This subtree changed, this subtree changed, 1139 00:57:03,870 --> 00:57:07,670 this subtree changed, and then this subtree changed. 1140 00:57:07,670 --> 00:57:12,150 So I have to update the minimums here, here, here, here. 1141 00:57:12,150 --> 00:57:13,910 Nothing else changed. 1142 00:57:13,910 --> 00:57:16,190 Outside the path where I did the Insert, 1143 00:57:16,190 --> 00:57:19,680 nothing changed, so I don't have to update anything. 1144 00:57:19,680 --> 00:57:22,790 So what I do is after the Insert, I go back up 1145 00:57:22,790 --> 00:57:25,305 and I re-compute the values. 1146 00:57:25,305 --> 00:57:26,180 So here, I'll have 3. 1147 00:57:26,180 --> 00:57:29,774 I go back up 3, 3, 3. 1148 00:57:29,774 --> 00:57:32,980 AUDIENCE: You could when you're passing down, though. 1149 00:57:32,980 --> 00:57:35,818 When you're going down that column, 1150 00:57:35,818 --> 00:57:37,600 you can just compare it on the way down. 1151 00:57:37,600 --> 00:57:39,100 You don't have to go back up, right? 1152 00:57:39,100 --> 00:57:40,040 PROFESSOR: Yep. 1153 00:57:40,040 --> 00:57:42,480 So the advantage of doing it the way I'm saying it 1154 00:57:42,480 --> 00:57:46,400 is that you can have other functions instead of minimum. 1155 00:57:46,400 --> 00:57:48,640 As long as you can compute the function 1156 00:57:48,640 --> 00:57:51,480 inside the parent in constant time 1157 00:57:51,480 --> 00:57:53,850 using the function from the children, 1158 00:57:53,850 --> 00:57:57,379 it makes sense to compute the function on the children first. 1159 00:57:57,379 --> 00:57:59,420 There's an obvious function that I can't tell you 1160 00:57:59,420 --> 00:58:02,760 because that's on the Pset, but when you see the next Pset, 1161 00:58:02,760 --> 00:58:05,810 you'll see what I mean. 1162 00:58:05,810 --> 00:58:08,565 So if you have a function where you 1163 00:58:08,565 --> 00:58:11,190 know the result for the children and you can compute the result 1164 00:58:11,190 --> 00:58:15,820 for the parent in constant time, then after you do the Insert, 1165 00:58:15,820 --> 00:58:21,550 you go up on the path and you re-compute the function. 1166 00:58:21,550 --> 00:58:25,478 When you delete, what do you do? 1167 00:58:25,478 --> 00:58:26,874 AUDIENCE: Same thing. 1168 00:58:26,874 --> 00:58:27,790 PROFESSOR: Same thing. 1169 00:58:30,640 --> 00:58:35,850 If this goes away, then this subtree changed, and then 1170 00:58:35,850 --> 00:58:37,530 if there would be something else here, 1171 00:58:37,530 --> 00:58:41,240 then this subtree changed, but nothing else changed. 1172 00:58:41,240 --> 00:58:43,880 So whenever you do an Insert or a Delete, all you have to do 1173 00:58:43,880 --> 00:58:46,210 is go back up the path to the parent 1174 00:58:46,210 --> 00:58:52,000 and re-compute the function that you're trying to compute. 1175 00:58:52,000 --> 00:58:55,310 And that's tree augmentation. 1176 00:58:55,310 --> 00:58:58,370 Does this make sense somewhat? 1177 00:58:58,370 --> 00:58:59,440 That's it. 1178 00:58:59,440 --> 00:59:00,960 So what you'll find in lecture notes 1179 00:59:00,960 --> 00:59:05,850 is a harder way of doing it that works for minimum, 1180 00:59:05,850 --> 00:59:08,990 but what I told you works for everything. 1181 00:59:08,990 --> 00:59:12,790 So don't tell people I told you how to do this for everything. 1182 00:59:12,790 --> 00:59:15,250 Sure nobody's going to know.