1 00:00:07,000 --> 00:00:11,000 So, there is a lot of 2 00:00:07,000 --> 00:00:11,000 Today we're going to not talk about sorting. 3 00:00:11,000 --> 00:00:14,000 This is an exciting new development. We know that what we're looking 4 00:00:14,000 --> 00:00:18,000 We're going to talk about another problem, 5 00:00:18,000 --> 00:00:23,000 a related problem, but a different problem. 6 00:00:35,000 --> 00:00:38,000 We're going to talk about another problem that we would 7 00:00:38,000 --> 00:00:41,000 like to solve in linear time. Last class we talked about we 8 00:00:41,000 --> 00:00:44,000 could do sorting in linear time. To do that we needed some 9 00:00:44,000 --> 00:00:47,000 additional assumptions. Today we're going to look at a 10 00:00:47,000 --> 00:00:51,000 problem that really only needs linear time, even though at 11 00:00:51,000 --> 00:00:54,000 first glance it might look like it requires sorting. 12 00:00:54,000 --> 00:00:56,000 So this is going to be an easier problem. 13 00:00:56,000 --> 00:01:00,000 The problem is I give you a bunch of numbers. 14 00:01:00,000 --> 00:01:06,000 Let's call them elements. And they are in some array, 15 00:01:06,000 --> 00:01:11,000 let's say. And they're in no particular 16 00:01:11,000 --> 00:01:18,000 order, so unsorted. I want to find the kth smallest 17 00:01:18,000 --> 00:01:20,000 element. 18 00:01:26,000 --> 00:01:30,000 This is called the element of rank k. 19 00:01:37,000 --> 00:01:39,000 In other words, I have this list of numbers 20 00:01:39,000 --> 00:01:43,000 which is unsorted. And, if I were to sort it, 21 00:01:43,000 --> 00:01:46,000 I would like to know what the kth element is. 22 00:01:46,000 --> 00:01:50,000 But I'm not allowed to sort it. One solution to this problem, 23 00:01:50,000 --> 00:01:54,000 this is the naÔve algorithm, is you just sort and then 24 00:01:54,000 --> 00:01:57,000 return the kth element. This is another possible 25 00:01:57,000 --> 00:02:03,000 definition of the problem. And we would like to do better 26 00:02:03,000 --> 00:02:05,000 than that. So you could sort, 27 00:02:05,000 --> 00:02:10,000 what's called the array A, and then return A[k]. 28 00:02:10,000 --> 00:02:16,000 That is one thing we could do. And if we use heap sort or 29 00:02:16,000 --> 00:02:20,000 mergesort, this will take n lg n time. 30 00:02:20,000 --> 00:02:23,000 We would like to do better than n lg n. 31 00:02:23,000 --> 00:02:29,000 Ideally linear time. The problem is pretty natural, 32 00:02:29,000 --> 00:02:34,000 straightforward. It has various applications. 33 00:02:34,000 --> 00:02:39,000 Depending on how you choose k, k could be any number between 1 34 00:02:39,000 --> 00:02:41,000 and n. For example, 35 00:02:41,000 --> 00:02:44,000 if we choose k=1 that element has a name. 36 00:02:44,000 --> 00:02:47,000 Any suggestions of what the name is? 37 00:02:47,000 --> 00:02:50,000 The minimum. That's easy. 38 00:02:50,000 --> 00:02:55,000 Any suggestions on how we could find the minimum element in an 39 00:02:55,000 --> 00:02:59,000 array in linear time? Right. 40 00:02:59,000 --> 00:03:04,000 Just scan through the array. Keep track of what the smallest 41 00:03:04,000 --> 00:03:08,000 number is that you've seen. The same thing with the 42 00:03:08,000 --> 00:03:12,000 maximum, k=n. These are rather trivial. 43 00:03:12,000 --> 00:03:17,000 But a more interesting version of the order statistic problem 44 00:03:17,000 --> 00:03:21,000 is to find the median. This is either k equals n plus 45 00:03:21,000 --> 00:03:26,000 1 over 2 floor or ceiling. I will call both of those 46 00:03:26,000 --> 00:03:29,000 elements medians. 47 00:03:34,000 --> 00:03:37,000 Finding the median of an unsorted array in linear time is 48 00:03:37,000 --> 00:03:39,000 quite tricky. And that sort of is the main 49 00:03:39,000 --> 00:03:41,000 goal of this lecture, is to be able to find the 50 00:03:41,000 --> 00:03:43,000 medians. For free we're going to be able 51 00:03:43,000 --> 00:03:46,000 to find the arbitrary kth smallest element, 52 00:03:46,000 --> 00:03:48,000 but typically we're most interested in finding the 53 00:03:48,000 --> 00:03:50,000 median. And on Friday in recitation 54 00:03:50,000 --> 00:03:52,000 you'll see why that is so useful. 55 00:03:52,000 --> 00:03:55,000 There are all sorts of situations where you can use 56 00:03:55,000 --> 00:03:58,000 median for really effective divide-and-conquer without 57 00:03:58,000 --> 00:04:02,000 having to sort. You can solve a lot of problems 58 00:04:02,000 --> 00:04:07,000 in linear time as a result. And we're going to cover today 59 00:04:07,000 --> 00:04:10,000 two algorithms for finding order statistics. 60 00:04:10,000 --> 00:04:15,000 Both of them are linear time. The first one is randomized, 61 00:04:15,000 --> 00:04:18,000 so it's only linear expected time. 62 00:04:18,000 --> 00:04:21,000 And the second one is worst-case linear time, 63 00:04:21,000 --> 00:04:25,000 and it will build on the randomized version. 64 00:04:25,000 --> 00:04:31,000 Let's start with a randomize divide-and-conquer algorithm. 65 00:04:46,000 --> 00:04:49,000 This algorithm is called rand-select. 66 00:05:02,000 --> 00:05:06,000 And the parameters are a little bit more than what we're used 67 00:05:06,000 --> 00:05:08,000 to. The order statistics problem 68 00:05:08,000 --> 00:05:12,000 you're given an array A. And here I've changed notation 69 00:05:12,000 --> 00:05:15,000 and I'm looking for the ith smallest element, 70 00:05:15,000 --> 00:05:18,000 so i is the index I'm looking for. 71 00:05:18,000 --> 00:05:21,000 And I'm also going to change the problem a little bit. 72 00:05:21,000 --> 00:05:25,000 And instead of trying to find it in the whole array, 73 00:05:25,000 --> 00:05:29,000 I'm going to look in a particular interval of the 74 00:05:29,000 --> 00:05:33,000 array, A from p up to q. We're going to need that for a 75 00:05:33,000 --> 00:05:36,000 recursion. This better be a recursive 76 00:05:36,000 --> 00:05:39,000 algorithm because we're using divide-and-conquer. 77 00:05:39,000 --> 00:05:41,000 Here is the algorithm. 78 00:05:51,000 --> 00:05:54,000 With a base case. It's pretty simple. 79 00:05:54,000 --> 00:06:00,000 Then we're going to use part of the quicksort algorithm, 80 00:06:00,000 --> 00:06:03,000 randomized quicksort. 81 00:06:09,000 --> 00:06:13,000 We didn't actually define this subroutine two lectures ago, 82 00:06:13,000 --> 00:06:17,000 but you should know what it does, especially if you've read 83 00:06:17,000 --> 00:06:20,000 the textbook. This says in the array A[p...q] 84 00:06:20,000 --> 00:06:24,000 pick a random element, so pick a random index between 85 00:06:24,000 --> 00:06:30,000 p and q, swap it with the first element, then call partition. 86 00:06:30,000 --> 00:06:34,000 And partition uses that first element to split the rest of the 87 00:06:34,000 --> 00:06:39,000 array into less than or equal to that random partition and 88 00:06:39,000 --> 00:06:42,000 greater than or equal to that partition. 89 00:06:42,000 --> 00:06:47,000 This is just picking a random partition element between p and 90 00:06:47,000 --> 00:06:52,000 q, cutting the array in half, although the two sizes may not 91 00:06:52,000 --> 00:06:54,000 be equal. And it returns the index of 92 00:06:54,000 --> 00:07:00,000 that partition element, some number between p and q. 93 00:07:00,000 --> 00:07:08,000 And we're going to define k to be this particular value, 94 00:07:08,000 --> 00:07:15,000 r minus p plus 1. And the reason for that is that 95 00:07:15,000 --> 00:07:21,000 k is then the rank of the partition element. 96 00:07:21,000 --> 00:07:30,000 This is in A[p...q]. Let me draw a picture here. 97 00:07:30,000 --> 00:07:34,000 We have our array A. It starts at p and ends at q. 98 00:07:34,000 --> 00:07:38,000 There is other stuff, but for this recursive all we 99 00:07:38,000 --> 00:07:42,000 care about is p up to q. We pick a random partition 100 00:07:42,000 --> 00:07:47,000 element, say this one, and we partition things so that 101 00:07:47,000 --> 00:07:50,000 everything in here, let's call this r, 102 00:07:50,000 --> 00:07:55,000 is less than or equal to A[r] and everything up here is 103 00:07:55,000 --> 00:08:00,000 greater than or equal to A[r]. And A[r] is our partition 104 00:08:00,000 --> 00:08:03,000 element. After this call, 105 00:08:03,000 --> 00:08:06,000 that's what the array looks like. 106 00:08:06,000 --> 00:08:09,000 And we get r. We get the index of where 107 00:08:09,000 --> 00:08:14,000 partition element is stored. The number of elements that are 108 00:08:14,000 --> 00:08:20,000 less than or equal to A[r] and including r is r minus p plus 1. 109 00:08:20,000 --> 00:08:23,000 There will be r minus p elements here, 110 00:08:23,000 --> 00:08:28,000 and we're adding 1 to get this element. 111 00:08:28,000 --> 00:08:32,000 And, if you start counting at 1, if this is rank 1, 112 00:08:32,000 --> 00:08:35,000 rank 2, this element will have rank k. 113 00:08:35,000 --> 00:08:40,000 That's just from the construction in the partition. 114 00:08:40,000 --> 00:08:46,000 And now we get to recurse. And there are three cases -- 115 00:08:53,000 --> 00:08:55,000 -- depending on how i relates to k. 116 00:08:55,000 --> 00:08:57,000 Remember i is the rank that we're looking for, 117 00:08:57,000 --> 00:09:01,000 k is the rank that we happen to get out of this random 118 00:09:01,000 --> 00:09:03,000 partition. We don't have much control over 119 00:09:03,000 --> 00:09:07,000 k, but if we're lucky i=k. That's the element we want. 120 00:09:13,000 --> 00:09:15,000 Then we just return the partition element. 121 00:09:15,000 --> 00:09:18,000 More likely is that the element we're looking for is either to 122 00:09:18,000 --> 00:09:20,000 the left or to the right. And if it's to the left we're 123 00:09:20,000 --> 00:09:23,000 going to recurse in the left-hand portion of the array. 124 00:09:23,000 --> 00:09:26,000 And if it's to the right we're going to recurse in the 125 00:09:26,000 --> 00:09:28,000 right-hand portion. So, pretty straightforward at 126 00:09:28,000 --> 00:09:30,000 this point. 127 00:09:45,000 --> 00:09:48,000 I just have to get all the indices right. 128 00:10:08,000 --> 00:10:11,000 Either we're going to recurse on the part between p and r 129 00:10:11,000 --> 00:10:14,000 minus 1, that's this case. The rank we're looking for is 130 00:10:14,000 --> 00:10:17,000 to the left of the rank of element A[r]. 131 00:10:17,000 --> 00:10:20,000 Or, we're going to recurse on the right part between r plus 1 132 00:10:20,000 --> 00:10:22,000 and q. Where we recurse on the left 133 00:10:22,000 --> 00:10:25,000 part the rank we're looking for remains the same, 134 00:10:25,000 --> 00:10:28,000 but when we recurse on the right part the rank we're 135 00:10:28,000 --> 00:10:33,000 looking for gets offset. Because we sort of got rid of 136 00:10:33,000 --> 00:10:38,000 the k elements over here. I should have written this 137 00:10:38,000 --> 00:10:42,000 length is k. We've sort of swept away k 138 00:10:42,000 --> 00:10:46,000 ranks of elements. And now within this array we're 139 00:10:46,000 --> 00:10:51,000 looking for the i minus kth smallest element. 140 00:10:51,000 --> 00:10:55,000 That's the recursion. We only recurse once. 141 00:10:55,000 --> 00:11:00,000 And random partition is not a recursion. 142 00:11:00,000 --> 00:11:04,000 That just takes linear time. And the total amount of work 143 00:11:04,000 --> 00:11:09,000 we're doing here should be linear time plus one recursion. 144 00:11:09,000 --> 00:11:14,000 And we'd next like to see what the total running time is in 145 00:11:14,000 --> 00:11:19,000 expectation, but let's first do a little example -- 146 00:11:26,000 --> 00:11:29,000 -- to make this algorithm perfectly clear. 147 00:11:29,000 --> 00:11:33,000 Let's suppose we're looking for the seventh smallest element in 148 00:11:33,000 --> 00:11:35,000 this array. 149 00:11:50,000 --> 00:11:53,000 And let's suppose, just for example, 150 00:11:53,000 --> 00:11:57,000 that the pivot we're using is just the first element. 151 00:11:57,000 --> 00:12:02,000 So, nothing fancy. I would have to flip a few 152 00:12:02,000 --> 00:12:06,000 coins in order to generate a random one, so let's just pick 153 00:12:06,000 --> 00:12:09,000 this one. If I partition at the element 154 00:12:09,000 --> 00:12:13,000 6, this is actually an example we did two weeks ago, 155 00:12:13,000 --> 00:12:17,000 and I won't go through it again, but we get the same 156 00:12:17,000 --> 00:12:21,000 array, as we did two weeks ago, namely 2, 5, 157 00:12:21,000 --> 00:12:23,000 3, 6, 8, 13, 10 and 11. 158 00:12:23,000 --> 00:12:26,000 If you run through the partitioning algorithm, 159 00:12:26,000 --> 00:12:31,000 that happens to be the order that it throws the elements 160 00:12:31,000 --> 00:12:35,000 into. And this is our position r. 161 00:12:35,000 --> 00:12:37,000 This is p here. It's just 1. 162 00:12:37,000 --> 00:12:40,000 And q is just the end. And I am looking for the 163 00:12:40,000 --> 00:12:44,000 seventh smallest element. And it happens when I run this 164 00:12:44,000 --> 00:12:48,000 partition that 6 falls into the fourth place. 165 00:12:48,000 --> 00:12:52,000 And we know that means, because all the elements here 166 00:12:52,000 --> 00:12:56,000 are less than 6 and all the elements here are greater than 167 00:12:56,000 --> 00:13:00,000 6, if this array were sorted, 6 would be right here in 168 00:13:00,000 --> 00:13:05,000 position four. So, r here is 4. 169 00:13:05,000 --> 00:13:09,000 Yeah? The 12 turned into an 11? 170 00:13:09,000 --> 00:13:13,000 This was an 11, believe it or not. 171 00:13:13,000 --> 00:13:16,000 Let me be simple. Sorry. 172 00:13:16,000 --> 00:13:20,000 Sometimes my ones look like twos. 173 00:13:20,000 --> 00:13:27,000 Not a good feature. That's an easy way to cover. 174 00:13:27,000 --> 00:13:31,000 [LAUGHTER] Don't try that on exams. 175 00:13:31,000 --> 00:13:33,000 Oh, that one was just a two. No. 176 00:13:33,000 --> 00:13:37,000 Even though we're not sorting the array, we're only spending 177 00:13:37,000 --> 00:00:06,000 linear work here to partition by 178 00:13:39,000 --> 00:13:43,000 We know that if we had sorted the array 6 would fall here. 179 00:13:43,000 --> 00:13:46,000 We don't know about these other elements. 180 00:13:46,000 --> 00:13:49,000 They're not in sorted order, but from the properties of 181 00:13:49,000 --> 00:13:52,000 partition we know 6 went the right spot. 182 00:13:52,000 --> 00:13:56,000 We now know rank of 6 is 4. We happened to be looking for 7 183 00:13:56,000 --> 00:14:00,000 and we happened to get this number 4. 184 00:14:00,000 --> 00:14:03,000 We want something over here. It turns out we're looking for 185 00:14:03,000 --> 00:14:05,000 10, I guess. No, 11. 186 00:14:05,000 --> 00:14:08,000 There should be eight elements in this array, 187 00:14:08,000 --> 00:14:10,000 so it's the next to max. Max here is 13, 188 00:14:10,000 --> 00:14:14,000 I'm cheating here. The answer we're looking for is 189 00:14:16,000 --> 00:14:20,000 for is in the right-hand part because the rank we're looking 190 00:14:20,000 --> 00:00:04,000 for is 7, which is bigger than 191 00:14:22,000 --> 00:14:25,000 Now, what rank are we looking for in here? 192 00:14:25,000 --> 00:14:30,000 Well, we've gotten rid of four elements over here. 193 00:14:30,000 --> 00:14:35,000 It happened here that k is also 4 because p is 1 in this 194 00:14:35,000 --> 00:14:38,000 example. The rank of 6 was 4. 195 00:14:38,000 --> 00:14:41,000 We throw away those four elements. 196 00:14:41,000 --> 00:14:46,000 Now we're looking for rank 7 minus 4 which is 3. 197 00:14:46,000 --> 00:14:49,000 And, indeed, the rank 3 element here is 198 00:14:49,000 --> 00:14:53,000 still 11. So, you recursively find that. 199 00:14:53,000 --> 00:14:58,000 That's your answer. Now that algorithm should be 200 00:14:58,000 --> 00:15:03,000 pretty clear. The tricky part is to analyze 201 00:15:03,000 --> 00:15:05,000 it. And the analysis here is quite 202 00:15:05,000 --> 00:15:10,000 a bit like randomized quicksort, although not quite as hairy, 203 00:15:10,000 --> 00:15:13,000 so it will go faster. But it will be also sort of a 204 00:15:13,000 --> 00:15:18,000 nice review of the randomized quicksort analysis which was a 205 00:15:18,000 --> 00:15:21,000 bit tricky and always good to see a couple of times. 206 00:15:21,000 --> 00:15:26,000 We're going to follow the same kind of outline as before to 207 00:15:26,000 --> 00:15:31,000 look at the expected running time of this algorithm. 208 00:15:31,000 --> 00:15:34,000 And to start out we're going to, as before, 209 00:15:34,000 --> 00:15:39,000 look at some intuition just to feel good about ourselves. 210 00:15:39,000 --> 00:15:44,000 Also feel bad as you'll see. Let's think about two sort of 211 00:15:44,000 --> 00:15:49,000 extreme cases, a good case and the worst case. 212 00:15:49,000 --> 00:15:54,000 And I should mention that in all of the analyses today we 213 00:15:54,000 --> 00:15:58,000 assume the elements are distinct. 214 00:16:04,000 --> 00:16:08,000 It gets really messy if the elements are not distinct. 215 00:16:08,000 --> 00:16:12,000 And you may even have to change the algorithms a little bit 216 00:16:12,000 --> 00:16:16,000 because if all the elements are equal, if you pick a random 217 00:16:16,000 --> 00:16:19,000 element, the partition does not do so well. 218 00:16:19,000 --> 00:16:24,000 But let's assume they're all distinct, which is the really 219 00:16:24,000 --> 00:16:28,000 interesting case. A pretty luck case -- 220 00:16:28,000 --> 00:16:32,000 I mean the best cases we partition right in the middle. 221 00:16:32,000 --> 00:16:37,000 The number of elements to the left of our partition is equal 222 00:16:37,000 --> 00:16:42,000 to the number of elements to the right of our partition. 223 00:16:42,000 --> 00:16:47,000 But almost as good would be some kind of 1/10 to 9/10 split. 224 00:16:47,000 --> 00:16:50,000 Any constant fraction, we should feel that. 225 00:16:50,000 --> 00:16:54,000 Any constant fraction is as good as 1/2. 226 00:16:54,000 --> 00:16:58,000 Then the recurrence we get is, let's say at most, 227 00:16:58,000 --> 00:17:01,000 this bad. So, it depends. 228 00:17:01,000 --> 00:17:04,000 If we have let's say 1/10 on the left and 9/10 on the right 229 00:17:04,000 --> 00:17:08,000 every time we do a partition. It depends where our answer is. 230 00:17:08,000 --> 00:17:12,000 It could be if i is really small it's in the 1/10 part. 231 00:17:12,000 --> 00:17:16,000 If i is really big it's going to be in the 9/10 part, 232 00:17:16,000 --> 00:17:19,000 or most of the time it's going to be in the 9/10 part. 233 00:17:19,000 --> 00:17:23,000 We're doing worst-case analysis within the lucky case, 234 00:17:23,000 --> 00:17:25,000 so we're happy to have upper bounds. 235 00:17:25,000 --> 00:17:30,000 I will say t(n) is at most t of T(9/10n)+Theta(n). 236 00:17:30,000 --> 00:17:34,000 Clearly it's worse if we're in the bigger part. 237 00:17:34,000 --> 00:17:38,000 What is the solution to this recurrence? 238 00:17:38,000 --> 00:17:42,000 Oh, solving recurrence was so long ago. 239 00:17:42,000 --> 00:17:47,000 What method should we use for solving this recurrence? 240 00:17:47,000 --> 00:17:51,000 The master method. What case are we in? 241 00:17:51,000 --> 00:17:52,000 Three. Good. 242 00:17:52,000 --> 00:17:55,000 You still remember. This is Case 3. 243 00:17:55,000 --> 00:18:01,000 We're looking at nlog_b(a). b here is 10/9, 244 00:18:01,000 --> 00:18:06,000 although it doesn't really matter because a is 1. 245 00:18:06,000 --> 00:18:11,000 log base anything of 1 is 0. So, this is n^0 which is 1. 246 00:18:11,000 --> 00:18:14,000 And n is polynomially larger than 1. 247 00:18:14,000 --> 00:18:18,000 This is going to be O(n), which is good. 248 00:18:18,000 --> 00:18:21,000 That is what we want, linear time. 249 00:18:21,000 --> 00:18:25,000 If we're in the lucky case, great. 250 00:18:25,000 --> 00:18:30,000 Unfortunately this is only intuition. 251 00:18:30,000 --> 00:18:32,000 And we're not always going to get the lucky case. 252 00:18:32,000 --> 00:18:35,000 We could do the same kind of analysis as we did with 253 00:18:35,000 --> 00:18:38,000 randomized quicksort. If you alternate between lucky 254 00:18:38,000 --> 00:18:41,000 and unlucky, things will still be good, but let's just talk 255 00:18:41,000 --> 00:18:44,000 about the unlucky case to show how bad things can get. 256 00:18:44,000 --> 00:18:48,000 And this really would be a worst-case analysis. 257 00:18:53,000 --> 00:19:00,000 The unlucky case we get a split of 0:n-1. 258 00:19:00,000 --> 00:19:04,000 Because we're removing the partition element either way. 259 00:19:04,000 --> 00:19:09,000 And there could be nothing less than the partition element. 260 00:19:09,000 --> 00:19:14,000 We have 0 on the left-hand side and we have n-1 on the 261 00:19:14,000 --> 00:19:18,000 right-hand side. Now we get a recurrence like 262 00:19:18,000 --> 00:19:23,000 T(n)=T(n-1) plus linear cost. And what's the solution to that 263 00:19:23,000 --> 00:19:25,000 recurrence? n^2. 264 00:19:25,000 --> 00:19:27,000 Yes. This one you should just know. 265 00:19:27,000 --> 00:19:33,000 It's n^2 because it's an arithmetic series. 266 00:19:38,000 --> 00:19:40,000 And that's pretty bad. This is much, 267 00:19:40,000 --> 00:19:43,000 much worse than sorting and then picking the ith element. 268 00:19:43,000 --> 00:19:46,000 In the worst-case this algorithm really sucks, 269 00:19:46,000 --> 00:19:49,000 but most of the time it's going to do really well. 270 00:19:49,000 --> 00:19:52,000 And, unless you're really, really unlucky and every coin 271 00:19:52,000 --> 00:19:56,000 you flip gives the wrong answer, you won't get this case and you 272 00:19:56,000 --> 00:19:58,000 will get something more like the lucky case. 273 00:19:58,000 --> 00:20:02,000 At least that's what we'd like to prove. 274 00:20:02,000 --> 00:20:05,000 And we will prove that the expected running time here is 275 00:20:05,000 --> 00:20:07,000 linear. So, it's very rare to get 276 00:20:07,000 --> 00:20:09,000 anything quadratic. But later on we will see how to 277 00:20:09,000 --> 00:20:11,000 make the worst-case linear as well. 278 00:20:11,000 --> 00:20:15,000 This would really, really solve the problem. 279 00:20:30,000 --> 00:20:34,000 Let's get into the analysis. 280 00:20:43,000 --> 00:20:47,000 Now, you've seen an analysis much like this before. 281 00:20:47,000 --> 00:20:51,000 What do you suggest we do in order to analyze this expected 282 00:20:51,000 --> 00:20:54,000 time? It's a divide-and-conquer 283 00:20:54,000 --> 00:20:59,000 algorithm, so we kind of like to write down the recurrence on 284 00:20:59,000 --> 00:21:03,000 something resembling the running time. 285 00:21:09,000 --> 00:21:12,000 I don't need the answer, but what's the first step that 286 00:21:12,000 --> 00:21:16,000 we might do to analyze the expected running time of this 287 00:21:16,000 --> 00:21:18,000 algorithm? Sorry? 288 00:21:18,000 --> 00:21:20,000 Look at different cases, yeah. 289 00:21:20,000 --> 00:21:22,000 Exactly. We have all these possible ways 290 00:21:22,000 --> 00:21:25,000 that random partition could split. 291 00:21:25,000 --> 00:21:30,000 It could split 0 to the n-1. It could split in half. 292 00:21:30,000 --> 00:21:33,000 There are n choices where it could split. 293 00:21:33,000 --> 00:21:35,000 How can we break into those cases? 294 00:21:35,000 --> 00:21:38,000 Indicator random variables. Cool. 295 00:21:38,000 --> 00:21:41,000 Exactly. That's what we want to do. 296 00:21:41,000 --> 00:21:46,000 Indicator random variable suggests that what we're dealing 297 00:21:46,000 --> 00:21:50,000 with is not exactly just a function T(n) but it's a random 298 00:21:50,000 --> 00:21:53,000 variable. This is one subtlety. 299 00:21:53,000 --> 00:21:57,000 T(n) depends on the random choices, so it's really a random 300 00:21:57,000 --> 00:22:00,000 variable. 301 00:22:05,000 --> 00:22:08,000 And then we're going to use indicator random variables to 302 00:22:08,000 --> 00:22:10,000 get a recurrence on T(n). 303 00:22:25,000 --> 00:22:32,000 So, T(n) is the running time of rand-select on an input of size 304 00:22:32,000 --> 00:22:33,000 n. 305 00:22:40,000 --> 00:22:46,000 And I am also going to write down explicitly an assumption 306 00:22:46,000 --> 00:22:49,000 about the random numbers. 307 00:22:55,000 --> 00:23:00,000 That they should be chosen independently from each other. 308 00:23:00,000 --> 00:23:03,000 Every time I call random partition, it's generating a 309 00:23:03,000 --> 00:23:07,000 completely independent random number from all the other times 310 00:23:07,000 --> 00:23:10,000 I call random partition. That is important, 311 00:23:10,000 --> 00:23:12,000 of course, for this analysis to work. 312 00:23:12,000 --> 00:23:15,000 We will see why some point down the line. 313 00:23:15,000 --> 00:23:19,000 And now, to sort of write down an equation for T(n) we're going 314 00:23:19,000 --> 00:23:24,000 to define indicator random variables, as you suggested. 315 00:23:36,000 --> 00:23:44,000 And we will call it X_k. And this is for all k=0...n-1. 316 00:23:50,000 --> 00:23:54,000 Indicator random variables either 1 or 0. 317 00:23:54,000 --> 00:24:00,000 And it's going to be 1 if the partition comes out k on the 318 00:24:00,000 --> 00:24:06,000 left-hand side. So say the partition generates 319 00:24:06,000 --> 00:24:11,000 a k:n-k-1 split and it is 0 otherwise. 320 00:24:11,000 --> 00:24:17,000 We have n of these indicator random variables between 321 00:24:17,000 --> 00:24:20,000 0...n-1. And in each case, 322 00:24:20,000 --> 00:24:27,000 no matter how the random choice comes out, exactly one of them 323 00:24:27,000 --> 00:24:32,000 will be 1. All the others will be 0. 324 00:24:32,000 --> 00:24:37,000 Now we can divide out the running time of this algorithm 325 00:24:37,000 --> 00:24:40,000 based on which case we're in. 326 00:24:49,000 --> 00:24:57,000 That will sort of unify this intuition that we did and get 327 00:24:57,000 --> 00:25:02,000 all the cases. And then we can look at the 328 00:25:02,000 --> 00:25:08,000 expectation. T(n), if we just split out by 329 00:25:08,000 --> 00:25:15,000 cases, we have an upper bound like this. 330 00:25:28,000 --> 00:25:33,000 If we have 0 to n-1 split, the worst is we have n-1. 331 00:25:33,000 --> 00:25:38,000 Then we have to recurse in a problem of size n-1. 332 00:25:38,000 --> 00:25:43,000 In fact, it would be pretty hard to recurse in a problem of 333 00:25:43,000 --> 00:25:47,000 size 0. If we have a 1 to n-2 split 334 00:25:47,000 --> 00:25:51,000 then we take the max of the two sides. 335 00:25:51,000 --> 00:25:58,000 That's certainly going to give us an upper bound and so on. 336 00:26:03,000 --> 00:26:06,000 And at the bottom you get an n-1 to 0 split. 337 00:26:14,000 --> 00:26:16,000 This is now sort of conditioning on various events, 338 00:26:16,000 --> 00:26:19,000 but we have indicator random variables to tell us when these 339 00:26:19,000 --> 00:26:21,000 events happen. We can just multiply each of 340 00:26:21,000 --> 00:26:25,000 these values by the indicator random variable and it will come 341 00:26:25,000 --> 00:26:28,000 out 0 if that's not the case and will come out 1 and give us this 342 00:26:28,000 --> 00:26:31,000 value if that happens to be the split. 343 00:26:31,000 --> 00:26:37,000 So, if we add up all of those we'll get the same thing. 344 00:26:37,000 --> 00:26:45,000 This is equal to the sum over all k of the indicator random 345 00:26:45,000 --> 00:26:52,000 variable times the cost in that case, which is t of max k, 346 00:26:52,000 --> 00:26:57,000 and the other side, which is n-k-1, 347 00:26:57,000 --> 00:27:01,000 plus theta n. This is our recurrence, 348 00:27:01,000 --> 00:27:04,000 in some sense, for the random variable 349 00:27:04,000 --> 00:27:09,000 representing running time. Now, the value will depend on 350 00:27:09,000 --> 00:27:13,000 which case we come into. We know the probability of each 351 00:27:13,000 --> 00:27:19,000 of these events happening is the same because we're choosing the 352 00:27:19,000 --> 00:27:23,000 partition element uniformly at random, but we cannot really 353 00:27:23,000 --> 00:27:29,000 simplify much beyond this until we take expectations. 354 00:27:29,000 --> 00:27:32,000 We know this random variable could be as big as n^2. 355 00:27:32,000 --> 00:27:37,000 Hopefully it's usually linear. We will take expectations of 356 00:27:37,000 --> 00:27:40,000 both sides and get what we want. 357 00:27:54,000 --> 00:27:58,000 Let's look at the expectation of this random variable, 358 00:27:58,000 --> 00:28:02,000 which is just the expectation, I will copy over, 359 00:28:02,000 --> 00:28:07,000 summation we have here so I can work on this board. 360 00:28:30,000 --> 00:28:33,000 I want to compute the expectation of this summation. 361 00:28:33,000 --> 00:28:36,000 What property of expectation should I use? 362 00:28:36,000 --> 00:28:39,000 Linearity, good. We can bring the summation 363 00:28:39,000 --> 00:28:41,000 outside. 364 00:29:08,000 --> 00:29:09,000 Now I have a sum of expectation. 365 00:29:09,000 --> 00:29:12,000 Let's look at each expectation individually. 366 00:29:12,000 --> 00:29:15,000 It's a product of two random variables, if you will. 367 00:29:15,000 --> 00:29:19,000 This is an indicator random variable and this is some more 368 00:29:19,000 --> 00:29:22,000 complicated function, some more complicated random 369 00:29:22,000 --> 00:29:24,000 variable representing some running time, 370 00:29:24,000 --> 00:29:28,000 which depends on what random choices are made in that 371 00:29:28,000 --> 00:29:31,000 recursive call. Now what should I do? 372 00:29:31,000 --> 00:29:37,000 I have the expectation of the product of two random variables. 373 00:29:37,000 --> 00:29:39,000 Independence, exactly. 374 00:29:39,000 --> 00:29:45,000 If I know that these two random variables are independent then I 375 00:29:45,000 --> 00:29:51,000 know that the expectation of the product is the product of the 376 00:29:51,000 --> 00:29:55,000 expectations. Now we have to check are they 377 00:29:55,000 --> 00:29:58,000 independent? I hope so because otherwise 378 00:29:58,000 --> 00:30:04,000 there isn't much else I can do. Why are they independent? 379 00:30:04,000 --> 00:30:07,000 Sorry? Because we stated that they 380 00:30:07,000 --> 00:30:10,000 are, right. Because of this assumption. 381 00:30:10,000 --> 00:30:14,000 We assume that all the random numbers are chosen 382 00:30:14,000 --> 00:30:17,000 independently. We need to sort of interpolate 383 00:30:17,000 --> 00:30:19,000 that here. These X_k's, 384 00:30:19,000 --> 00:30:21,000 all the X_k's, X_0 up to X_n-1, 385 00:30:21,000 --> 00:30:26,000 so all the ones appearing in this summation are dependent 386 00:30:26,000 --> 00:30:30,000 upon a single random choice of this particular call to random 387 00:30:30,000 --> 00:30:36,000 partition. All of these are correlated, 388 00:30:36,000 --> 00:30:44,000 because if one of them is 1, all the others are forced to be 389 00:30:47,000 --> 00:30:54,000 correlation among the X_k's. But with respect to everything 390 00:30:54,000 --> 00:31:00,000 that is in here, and the only random part is 391 00:31:00,000 --> 00:31:07,000 this T(max(kn-k-1)). That is the reason that this 392 00:31:07,000 --> 00:31:12,000 random variable is independent from these. 393 00:31:12,000 --> 00:31:19,000 The same thing as quicksort, but I know some people got 394 00:31:19,000 --> 00:31:24,000 confused about it a couple lectures ago so I am 395 00:31:24,000 --> 00:31:29,000 reiterating. We get the product of 396 00:31:29,000 --> 00:31:35,000 expectations, E[X_k] E[T(max(kn-k-1))]. 397 00:31:35,000 --> 00:31:40,000 I mean the order n comes outside, but let's leave it 398 00:31:40,000 --> 00:31:44,000 inside for now. There is no expectation to 399 00:31:44,000 --> 00:31:49,000 compute there for order n. Order n is order n. 400 00:31:49,000 --> 00:31:55,000 What is the expectation of X_k? 1/n, because they're all chosen 401 00:31:55,000 --> 00:32:00,000 with equal probability. There is n of them, 402 00:32:00,000 --> 00:32:04,000 so the expectation is 1/n. The value is either 1 or 0. 403 00:32:04,000 --> 00:32:07,000 We start to be able to split this up. 404 00:32:07,000 --> 00:32:12,000 We have 1/n times this expected value of some recursive T call, 405 00:32:12,000 --> 00:32:15,000 and then we have plus 1 over n times order n, 406 00:32:15,000 --> 00:32:20,000 also known as a constant, but everything is summed up n 407 00:32:20,000 --> 00:32:23,000 times so let's expand this. 408 00:32:35,000 --> 00:32:42,000 I have the sum k=0 to n-1. I guess the 1/n can come 409 00:32:42,000 --> 00:32:47,000 outside. And we have expectation of 410 00:32:47,000 --> 00:32:54,000 [T(max(kn-k-1))]. Lots of nifty braces there. 411 00:32:54,000 --> 00:32:59,000 And then plus we have, on the other hand, 412 00:32:59,000 --> 00:33:06,000 the sum k=0 to n-1. Let me just write that out 413 00:33:06,000 --> 00:33:08,000 again. We have a 1/n in front and we 414 00:33:08,000 --> 00:33:12,000 have a Theta(n) inside. This summation is n^2. 415 00:33:12,000 --> 00:33:16,000 And then we're dividing by n, so this whole thing is, 416 00:33:16,000 --> 00:33:20,000 again, order n. Nothing fancy happened there. 417 00:33:20,000 --> 00:33:25,000 This is really just saying the expectation of order n is order 418 00:33:25,000 --> 00:33:27,000 n. Average value of order n is 419 00:33:27,000 --> 00:33:31,000 order n. What is interesting is this 420 00:33:31,000 --> 00:33:35,000 part. Now, what could we do with this 421 00:33:35,000 --> 00:33:38,000 summation? Here we start to differ from 422 00:33:38,000 --> 00:33:43,000 randomized quicksort because we have this max. 423 00:33:43,000 --> 00:33:48,000 Randomized quicksort we had the sum of T(k) plus T(n-k-1) 424 00:33:48,000 --> 00:33:52,000 because we were making both recursive calls. 425 00:33:52,000 --> 00:33:56,000 Here we're only making the biggest one. 426 00:33:56,000 --> 00:34:03,000 That max is really a pain for evaluating this recurrence. 427 00:34:03,000 --> 00:34:11,000 How could I get rid of the max? That's one way to think of it. 428 00:34:11,000 --> 00:34:13,000 Yeah? 429 00:34:18,000 --> 00:34:20,000 Exactly. I could only sum up to halfway 430 00:34:20,000 --> 00:34:23,000 and then double. In other words, 431 00:34:23,000 --> 00:34:26,000 terms are getting repeated twice here. 432 00:34:26,000 --> 00:34:30,000 When k=0 or when k=n-1, I get the same T(n-1). 433 00:34:30,000 --> 00:34:33,000 When k=1 or n-2, I get the same thing, 434 00:34:33,000 --> 00:34:37,000 2 and n-3. What I will actually do is sum 435 00:34:37,000 --> 00:34:42,000 from halfway up. That's a little bit cleaner. 436 00:34:42,000 --> 00:34:45,000 And let me get the indices right. 437 00:34:45,000 --> 00:34:49,000 Floor of n/2 up to n-1 will be safe. 438 00:34:49,000 --> 00:34:55,000 And then I just have E[T(k)], except I forgot to multiply by 439 00:34:55,000 --> 00:35:01,000 2, so I'm going to change this 1 to a 2. 440 00:35:01,000 --> 00:35:04,000 And order n is preserved. This is just because each term 441 00:35:04,000 --> 00:35:07,000 is appearing twice. I can factor it out. 442 00:35:07,000 --> 00:35:10,000 And if n is odd, I'm actually double-counting 443 00:35:10,000 --> 00:35:13,000 somewhat, but it's certain at most that. 444 00:35:13,000 --> 00:35:17,000 So, that's a safe upper bound. And upper bounds are all we 445 00:35:17,000 --> 00:35:20,000 care about because we're hoping to get linear. 446 00:35:20,000 --> 00:35:24,000 And the running time of this algorithm is definitely at least 447 00:35:24,000 --> 00:35:29,000 linear, so we just need an upper bounded linear. 448 00:35:29,000 --> 00:35:32,000 So, this is a recurrence. E[T(n)] is at most 2/n times 449 00:35:32,000 --> 00:35:36,000 the sum of half the numbers between 0 and n of 450 00:35:36,000 --> 00:35:39,000 E[T(k)]+Theta(n). It's a bit of hairy recurrence. 451 00:35:39,000 --> 00:35:41,000 We want to solve it, though. 452 00:35:41,000 --> 00:35:45,000 And it's actually a little bit easier than the randomized 453 00:35:45,000 --> 00:35:48,000 quicksort recurrence. We're going to solve it. 454 00:35:48,000 --> 00:35:51,000 What method should we use? Sorry? 455 00:35:51,000 --> 00:35:53,000 Master method? Master would be nice, 456 00:35:53,000 --> 00:35:57,000 except that each of the recursive calls is with a 457 00:35:57,000 --> 00:36:01,000 different value of k. The master method only works 458 00:36:01,000 --> 00:36:05,000 when all the calls are with the same value, same size. 459 00:36:05,000 --> 00:36:09,000 Alas, it would be nice if we could use the master method. 460 00:36:09,000 --> 00:36:11,000 What else do we have? Substitution. 461 00:36:11,000 --> 00:36:13,000 When it's hard, when in doubt, 462 00:36:13,000 --> 00:36:16,000 use substitution. I mean the good thing here is 463 00:36:16,000 --> 00:36:20,000 we know what we want. From the intuition at least, 464 00:36:20,000 --> 00:36:23,000 which is now erased, we really feel that this should 465 00:36:23,000 --> 00:36:26,000 be linear time. So, we know what we want to 466 00:36:26,000 --> 00:36:31,000 prove. And indeed we can prove it just 467 00:36:31,000 --> 00:36:35,000 directly with substitution. 468 00:36:42,000 --> 00:36:46,000 I want to claim there is some constant c greater than zero 469 00:36:46,000 --> 00:36:49,000 such that E[T(n)], according to this recurrence, 470 00:36:49,000 --> 00:36:54,000 is at most c times n. Let's prove that over here. 471 00:37:00,000 --> 00:37:04,000 As we guessed, the proof is by substitution. 472 00:37:13,000 --> 00:37:18,000 What that means is we're going to assume, by induction, 473 00:37:18,000 --> 00:37:22,000 that this inequality is true for all smaller m. 474 00:37:22,000 --> 00:37:28,000 I will just say 4 less than n. And we need to prove it for n. 475 00:37:28,000 --> 00:37:33,000 We get E[T(n)]. Now we are just going to expand 476 00:37:33,000 --> 00:37:36,000 using the recurrence that we have. 477 00:37:36,000 --> 00:37:40,000 It's at most this. I will copy that over. 478 00:37:54,000 --> 00:37:57,000 And then each of these recursive calls is with some 479 00:37:57,000 --> 00:38:00,000 value k that is strictly smaller than n. 480 00:38:00,000 --> 00:38:03,000 Sorry, I copied it wrong, floor of n over 2, 481 00:38:03,000 --> 00:38:07,000 not zero. And so I can apply the 482 00:38:07,000 --> 00:38:11,000 induction hypothesis to each of these. 483 00:38:11,000 --> 00:38:16,000 This is at most c times k by the induction hypothesis. 484 00:38:16,000 --> 00:38:20,000 And so I get this inequality. 485 00:38:37,000 --> 00:38:40,000 This c can come outside the summation because it's just a 486 00:38:40,000 --> 00:38:43,000 constant. And I will be slightly tedious 487 00:38:43,000 --> 00:38:47,000 in writing this down again, because what I care about is 488 00:38:47,000 --> 00:38:50,000 the summation here that is left over. 489 00:38:56,000 --> 00:39:01,000 This is a good old-fashioned summation. 490 00:39:01,000 --> 00:39:04,000 And if you remember back to your summation tricks or 491 00:39:04,000 --> 00:39:07,000 whatever, you should be able to evaluate this. 492 00:39:07,000 --> 00:39:11,000 If we started at zero and went up to n minus 1, 493 00:39:11,000 --> 00:39:14,000 that's just an arithmetic series, but here we have the 494 00:39:14,000 --> 00:39:16,000 tail end of an arithmetic series. 495 00:39:16,000 --> 00:39:19,000 And you should know, at least up to theta, 496 00:39:19,000 --> 00:39:21,000 what this is, right? 497 00:39:21,000 --> 00:39:23,000 n^2, yeah. It's definitely T(n^2). 498 00:39:23,000 --> 00:39:26,000 But we need here a slightly better upper bond, 499 00:39:26,000 --> 00:39:31,000 as we will see the constants really matter. 500 00:39:31,000 --> 00:39:35,000 What we're going to use is that this summation is at most 3/8 501 00:39:35,000 --> 00:39:38,000 times n^2. And that will be critical, 502 00:39:38,000 --> 00:39:41,000 the fact that 3/8 is smaller than 1/2, I believe. 503 00:39:41,000 --> 00:39:44,000 So it's going to get rid of this 2. 504 00:39:44,000 --> 00:39:47,000 I am not going to prove this. This is an exercise. 505 00:39:47,000 --> 00:39:52,000 When you know that it is true, it's easy because you can just 506 00:39:52,000 --> 00:39:55,000 prove it by induction. Figuring out that number is a 507 00:39:55,000 --> 00:40:00,000 little bit more work, but not too much more. 508 00:40:00,000 --> 00:40:04,000 So you should prove that by induction. 509 00:40:04,000 --> 00:40:09,000 Now let me simplify. This is a bit messy, 510 00:40:09,000 --> 00:40:15,000 but what I want is c times n. Let's write it as our desired 511 00:40:15,000 --> 00:40:22,000 value minus the residual. And here we have some crazy 512 00:40:22,000 --> 00:40:26,000 fractions. This is 2 times 3 which is 6 513 00:40:26,000 --> 00:40:31,000 over 8 which is 3/4, right? 514 00:40:31,000 --> 00:40:34,000 Here we have 1, so we have to subtract up 1/4 515 00:40:34,000 --> 00:40:37,000 to get 3/4. And this should be, 516 00:40:37,000 --> 00:40:42,000 I guess, 1/4 times c times n. And then we have this theta n 517 00:40:42,000 --> 00:40:45,000 with double negation becomes a plus theta n. 518 00:40:45,000 --> 00:40:49,000 That should be clear. I am just rewriting that. 519 00:40:49,000 --> 00:40:52,000 So we have what we want over here. 520 00:40:52,000 --> 00:40:57,000 And then we hope that this is nonnegative because what we want 521 00:40:57,000 --> 00:41:03,000 is that this less than or equal to c times n. 522 00:41:03,000 --> 00:41:06,000 That will be true, provided this thing is 523 00:41:06,000 --> 00:41:09,000 nonnegative. And it looks pretty good 524 00:41:09,000 --> 00:41:13,000 because we're free to choose c however large we want. 525 00:41:13,000 --> 00:41:17,000 Whatever constant is imbedded in this beta notation is one 526 00:41:17,000 --> 00:41:21,000 fixed constant, whatever makes this recurrence 527 00:41:21,000 --> 00:41:24,000 true. We just set c to be bigger than 528 00:41:24,000 --> 00:41:28,000 4 times that constant and then this will be nonnegative. 529 00:41:28,000 --> 00:41:32,000 So this is true for c sufficiently large to dwarf that 530 00:41:32,000 --> 00:41:36,000 theta constant. It's also the base case. 531 00:41:36,000 --> 00:41:41,000 I just have to make the cursory mention that we choose c large 532 00:41:41,000 --> 00:41:45,000 enough so that this claim is true, even in the base case 533 00:41:45,000 --> 00:41:48,000 where n is at most some constant. 534 00:41:48,000 --> 00:41:52,000 Here it's like 1 or so because then we're not making a 535 00:41:52,000 --> 00:41:55,000 recursive call. What we get -- 536 00:41:55,000 --> 00:41:59,000 This algorithm, randomize select, 537 00:41:59,000 --> 00:42:05,000 has expected running time order n, Theta(n). 538 00:42:12,000 --> 00:42:15,000 The annoying this is that in the worst-case, 539 00:42:15,000 --> 00:42:19,000 if you're really, really unlucky it's n^2. 540 00:42:19,000 --> 00:42:23,000 Any questions before we move on from this point? 541 00:42:23,000 --> 00:42:29,000 This finished off the proof of this fact that we have Theta(n) 542 00:42:29,000 --> 00:42:32,000 expected time. We already saw the n^2 543 00:42:32,000 --> 00:42:34,000 worst-case. All perfectly clear? 544 00:42:34,000 --> 00:42:37,000 Good. You should go over these 545 00:42:37,000 --> 00:42:39,000 proofs. They're intrinsically related 546 00:42:39,000 --> 00:42:43,000 between randomized quicksort and randomized select. 547 00:42:43,000 --> 00:42:47,000 Know them in your heart. This is a great algorithm that 548 00:42:47,000 --> 00:42:52,000 works really well in practice because most of the time you're 549 00:42:52,000 --> 00:42:54,000 going to split, say, in the middle, 550 00:42:54,000 --> 00:43:00,000 somewhere between a 1/4 and 3/4 and everything is good. 551 00:43:00,000 --> 00:43:03,000 It's extremely unlikely that you get the n^2 worst-case. 552 00:43:03,000 --> 00:43:06,000 It would have to happen with like 1 over n^n probability or 553 00:43:06,000 --> 00:43:08,000 something really, really small. 554 00:43:08,000 --> 00:43:10,000 But I am a theoretician at least. 555 00:43:10,000 --> 00:43:14,000 And it would be really nice if you could get Theta(n) in the 556 00:43:14,000 --> 00:43:16,000 worst-case. That would be the cleanest 557 00:43:16,000 --> 00:43:19,000 result that you could hope for because that's optimal. 558 00:43:19,000 --> 00:43:21,000 You cannot do better than Theta(n). 559 00:43:21,000 --> 00:43:23,000 You've got to look at the elements. 560 00:43:23,000 --> 00:43:25,000 So, you might ask, can we get rid of this 561 00:43:25,000 --> 00:43:29,000 worst-case behavior and somehow avoid randomization and 562 00:43:29,000 --> 00:43:33,000 guarantee Theta(n) worst-case running time? 563 00:43:33,000 --> 00:43:39,000 And you can but it's a rather nontrivial algorithm. 564 00:43:39,000 --> 00:43:45,000 And this is going to be one of the most sophisticated that 565 00:43:45,000 --> 00:43:51,000 we've seen so far. It won't continue to be the 566 00:43:51,000 --> 00:43:58,000 most sophisticated algorithm we will see, but here it is. 567 00:43:58,000 --> 00:44:04,000 Worst-case linear time order statistics. 568 00:44:09,000 --> 00:44:22,000 And this is an algorithm by several, all very famous people, 569 00:44:22,000 --> 00:44:32,000 Blum, Floyd, Pratt, Rivest and Tarjan. 570 00:44:32,000 --> 00:44:35,000 I think I've only met the B and the R and the T. 571 00:44:35,000 --> 00:44:39,000 Oh, no, I've met Pratt as well. I'm getting close to all the 572 00:44:39,000 --> 00:44:42,000 authors. This is a somewhat old result, 573 00:44:42,000 --> 00:44:46,000 but at the time it was a major breakthrough and still is an 574 00:44:46,000 --> 00:44:50,000 amazing algorithm. Ron Rivest is a professor here. 575 00:44:50,000 --> 00:44:52,000 You should know him from the R in RSA. 576 00:44:52,000 --> 00:44:56,000 When I took my PhD comprehensives some time ago, 577 00:44:56,000 --> 00:45:00,000 on the cover sheet was a joke question. 578 00:45:00,000 --> 00:45:04,000 It asked of the authors of the worst-case linear time order 579 00:45:04,000 --> 00:45:08,000 statistics algorithm, which of them is the most rich? 580 00:45:08,000 --> 00:45:13,000 Sadly it was not a graded part of the comprehensive exam, 581 00:45:13,000 --> 00:45:18,000 but it was an amusing question. I won't answer it here because 582 00:45:18,000 --> 00:45:21,000 we're on tape, [LAUGHTER] but think about it. 583 00:45:21,000 --> 00:45:25,000 I may not be obvious. Several of them are rich. 584 00:45:25,000 --> 00:45:30,000 It's just the question of who is the most rich. 585 00:45:30,000 --> 00:45:33,000 Anyway, before they were rich they came up with this 586 00:45:33,000 --> 00:45:35,000 algorithm. They've come up with many 587 00:45:35,000 --> 00:45:38,000 algorithms since, even after getting rich, 588 00:45:38,000 --> 00:45:42,000 believe it or not. What we want is a good pivot, 589 00:45:42,000 --> 00:45:45,000 guaranteed good pivot. Random pivot is going to be 590 00:45:45,000 --> 00:45:48,000 really good. And so the simplest algorithm 591 00:45:48,000 --> 00:45:52,000 is just pick a random pivot. It's going to be good with high 592 00:45:52,000 --> 00:45:55,000 probability. We want to force a good pivot 593 00:45:55,000 --> 00:45:58,000 deterministically. And the new idea here is we're 594 00:45:58,000 --> 00:46:02,000 going to generate it recursively. 595 00:46:02,000 --> 00:46:04,000 What else could we do but recurse? 596 00:46:04,000 --> 00:46:08,000 Well, you should know from your recurrences that if we did two 597 00:46:08,000 --> 00:46:12,000 recursive calls on problems of half the size and we have a 598 00:46:12,000 --> 00:46:16,000 linear extra work that's the mergesort recurrence, 599 00:46:16,000 --> 00:46:20,000 T(n)=2[T(n/2)+Theta(n)]. You should recite in your 600 00:46:20,000 --> 00:46:21,000 sleep. That's n lg n. 601 00:46:21,000 --> 00:46:25,000 So we cannot recurse on two problems of half the size. 602 00:46:25,000 --> 00:46:30,000 We've got to do better. Somehow these recursions have 603 00:46:30,000 --> 00:46:32,000 to add up to strictly less than n. 604 00:46:32,000 --> 00:46:35,000 That's the magic of this algorithm. 605 00:46:35,000 --> 00:46:39,000 So this will just be called select instead of rand-select. 606 00:46:39,000 --> 00:46:44,000 And it really depends on an array, but I will focus on the 607 00:46:44,000 --> 00:46:48,000 i-th element that we want to select and the size of the array 608 00:46:48,000 --> 00:46:53,000 that we want to select in. And I am going to write this 609 00:46:53,000 --> 00:46:57,000 algorithm slightly less formally than randomize-select because 610 00:46:57,000 --> 00:47:02,000 it's a bit higher level of an algorithm. 611 00:47:22,000 --> 00:47:31,000 And let me draw over here the picture of the algorithm. 612 00:47:31,000 --> 00:47:36,000 The first step is sort of the weirdest and it's one of the key 613 00:47:36,000 --> 00:47:38,000 ideas. You take your elements, 614 00:47:38,000 --> 00:47:43,000 and they are in no particular order, so instead of drawing 615 00:47:43,000 --> 00:47:47,000 them on a line, I am going to draw them in a 5 616 00:47:47,000 --> 00:47:49,000 by n over 5 grid. Why not? 617 00:47:49,000 --> 00:47:54,000 This, unfortunately, take a little while to draw, 618 00:47:54,000 --> 00:48:00,000 but it will take you equally long so I will take my time. 619 00:48:00,000 --> 00:48:02,000 It doesn't really matter what the width is, 620 00:48:02,000 --> 00:48:06,000 but it should be width n over 5 so make sure you draw your 621 00:48:06,000 --> 00:48:08,000 figure accordingly. Width n over 5, 622 00:48:08,000 --> 00:48:10,000 but the height should be exactly 5. 623 00:48:10,000 --> 00:48:13,000 I think I got it right. I can count that high. 624 00:48:13,000 --> 00:48:15,000 Here is 5. And this should be, 625 00:48:15,000 --> 00:48:17,000 well, you know, our number may not be divisible 626 00:48:17,000 --> 00:48:20,000 by 5, so maybe it ends off in sort of an odd way. 627 00:48:20,000 --> 00:48:24,000 But what I would like is that these chunks should be floor of 628 00:48:24,000 --> 00:48:26,000 n over 5. And then we will have, 629 00:48:26,000 --> 00:48:30,000 at most, four elements left over. 630 00:48:30,000 --> 00:48:33,000 So I am going to ignore those. They don't really matter. 631 00:48:33,000 --> 00:48:36,000 It's just an additive constant. Here is my array. 632 00:48:36,000 --> 00:48:39,000 I just happened to write it in this funny way. 633 00:48:39,000 --> 00:48:42,000 And I will call these vertical things groups. 634 00:48:42,000 --> 00:48:45,000 I would circle them, and I did that in my notes, 635 00:48:45,000 --> 00:48:49,000 but things get really messy if you start circling. 636 00:48:49,000 --> 00:48:53,000 This diagram is going to get really full, just to warn you. 637 00:48:53,000 --> 00:48:55,000 By the end it will be almost unintelligible, 638 00:48:55,000 --> 00:49:00,000 but there it is. If you are really feeling 639 00:49:00,000 --> 00:49:03,000 bored, you can draw this a few times. 640 00:49:03,000 --> 00:49:06,000 And you should draw how it grows. 641 00:49:06,000 --> 00:49:10,000 So there are the groups, vertical groups of five. 642 00:49:10,000 --> 00:49:12,000 Next step. 643 00:49:18,000 --> 00:49:24,000 The second step is to recurse. This is where things are a bit 644 00:49:24,000 --> 00:49:28,000 unusual, well, even more unusual. 645 00:49:28,000 --> 00:49:32,000 Oops, sorry. I really should have had a line 646 00:49:32,000 --> 00:49:37,000 between one and two so I am going to have to move this down 647 00:49:37,000 --> 00:49:40,000 and insert it here. I also, in step one, 648 00:49:40,000 --> 00:49:44,000 want to find the median of each group. 649 00:49:53,000 --> 00:49:56,000 What I would like to do is just imagine this figure, 650 00:49:56,000 --> 00:49:59,000 each of the five elements in each group gets reorganized so 651 00:49:59,000 --> 00:50:02,000 that the middle one is the median. 652 00:50:02,000 --> 00:50:05,000 So I am going to call these the medians of each group. 653 00:50:05,000 --> 00:50:10,000 I have five elements so the median is right in the middle. 654 00:50:10,000 --> 00:50:13,000 There are two elements less than the median, 655 00:50:13,000 --> 00:50:15,000 two elements greater than the median. 656 00:50:15,000 --> 00:50:19,000 Again, we're assuming all elements are distinct. 657 00:50:19,000 --> 00:50:21,000 So there they are. I compute them. 658 00:50:21,000 --> 00:50:24,000 How long does that take me? N over five groups, 659 00:50:24,000 --> 00:50:30,000 each with five elements, compute the median of each one? 660 00:50:30,000 --> 00:50:32,000 Sorry? Yeah, 2 times n over 5. 661 00:50:32,000 --> 00:50:34,000 It's theta n, that's all I need to know. 662 00:50:34,000 --> 00:50:38,000 I mean, you're counting comparisons, which is good. 663 00:50:38,000 --> 00:50:42,000 It's definitely Theta(n). The point is within each group, 664 00:50:42,000 --> 00:50:46,000 I only have to do a constant number of comparisons because 665 00:50:46,000 --> 00:50:48,000 it's a constant number of elements. 666 00:50:48,000 --> 00:50:51,000 It doesn't matter. You could use randomize select 667 00:50:51,000 --> 00:50:54,000 for all I care. No matter what you do, 668 00:50:54,000 --> 00:50:59,000 it can only take a constant number of comparisons. 669 00:50:59,000 --> 00:51:03,000 As long as you don't make a comparison more than once. 670 00:51:03,000 --> 00:51:07,000 So this is easy. You could sort the five numbers 671 00:51:07,000 --> 00:51:12,000 and then look at the third one, it doesn't matter because there 672 00:51:12,000 --> 00:51:16,000 are only five of them. That's one nifty idea. 673 00:51:16,000 --> 00:51:21,000 Already we have some elements that are sort of vaguely in the 674 00:51:21,000 --> 00:51:25,000 middle but just of the group. And we've only done linear 675 00:51:25,000 --> 00:51:29,000 work. So doing well so far. 676 00:51:29,000 --> 00:51:33,000 Now we get to the second step, which I started to write 677 00:51:33,000 --> 00:51:36,000 before, where we recurse. 678 00:51:58,000 --> 00:52:01,000 So the next idea is, well, we have these floor over 679 00:52:01,000 --> 00:52:04,000 n over 5 medians. I am going to compute the 680 00:52:04,000 --> 00:52:07,000 median of those medians. I am imagining that I 681 00:52:07,000 --> 00:52:09,000 rearranged these. And, unfortunately, 682 00:52:09,000 --> 00:52:11,000 it's an even number, there are six of them, 683 00:52:11,000 --> 00:52:15,000 but I will rearrange so that this guy, I have drawn in a 684 00:52:15,000 --> 00:52:18,000 second box, is the median of these elements so that these two 685 00:52:18,000 --> 00:52:22,000 elements are strictly less than this guy, these three elements 686 00:52:22,000 --> 00:52:24,000 are strictly greater than this guy. 687 00:52:24,000 --> 00:52:27,000 Now, that doesn't directly tell me anything, it would seem, 688 00:52:27,000 --> 00:52:31,000 about any of the elements out here. 689 00:52:31,000 --> 00:52:35,000 We will come back to that. In fact, it does tell us about 690 00:52:35,000 --> 00:52:38,000 some of the elements. But right now this element is 691 00:52:38,000 --> 00:52:42,000 just the median of these guys. Each of these guys is a median 692 00:52:42,000 --> 00:52:45,000 of five elements. That's all we know. 693 00:52:45,000 --> 00:52:49,000 If we do that recursively, this is going to take T of n 694 00:52:49,000 --> 00:52:51,000 over 5 time. So far so good. 695 00:52:51,000 --> 00:52:55,000 We can afford a recursion on a problem of size n over 5 and 696 00:52:55,000 --> 00:52:58,000 linear work. We know that works out to 697 00:52:58,000 --> 00:53:00,000 linear time. But there is more. 698 00:53:00,000 --> 00:53:02,000 We're obviously not done yet. 699 00:53:10,000 --> 00:53:12,000 The next step is x is our partition element. 700 00:53:12,000 --> 00:53:15,000 We partition there. The rest of the algorithm is 701 00:53:15,000 --> 00:53:19,000 just like randomized partition, so we're going to define k to 702 00:53:19,000 --> 00:53:21,000 be the rank of x. And this can be done, 703 00:53:21,000 --> 00:53:25,000 I mean it's n minus r plus 1 or whatever, but I'm not going to 704 00:53:25,000 --> 00:53:30,000 write out how to do that because we're at a higher level here. 705 00:53:30,000 --> 00:53:34,000 But it can be done. And then we have the three-way 706 00:53:34,000 --> 00:53:37,000 branching. So if i happens to equal k 707 00:53:37,000 --> 00:53:41,000 we're happy. The pivot element is the 708 00:53:41,000 --> 00:53:46,000 element we're looking for, but more likely i is either 709 00:53:46,000 --> 00:53:49,000 less than k or it is bigger than k. 710 00:53:49,000 --> 00:53:53,000 And then we make the appropriate recursive call, 711 00:53:53,000 --> 00:54:00,000 so here we recursively select the i-th smallest element -- 712 00:54:08,000 --> 00:54:11,000 -- in the lower part of the array. 713 00:54:11,000 --> 00:54:16,000 Left of the partition element. Otherwise, we recursively 714 00:54:16,000 --> 00:54:22,000 select the i minus k-th smallest element in the upper part of the 715 00:54:22,000 --> 00:54:25,000 array. I am writing this at a high 716 00:54:25,000 --> 00:54:30,000 level because we've already seen it. 717 00:54:30,000 --> 00:54:36,000 All of this is the same as the last couple steps of randomized 718 00:54:36,000 --> 00:54:37,000 select. 719 00:54:45,000 --> 00:54:48,000 That is the algorithm. The real question is why does 720 00:54:48,000 --> 00:54:50,000 it work? Why is this linear time? 721 00:54:50,000 --> 00:54:53,000 The first question is what's the recurrence? 722 00:54:53,000 --> 00:54:56,000 We cannot quite write it down yet because we don't know how 723 00:54:56,000 --> 00:55:00,000 big these recursive subproblems could be. 724 00:55:00,000 --> 00:55:03,000 We're going to either recurse in the lower part or the upper 725 00:55:03,000 --> 00:55:07,000 part, that's just like before. If we're unlucky and we have a 726 00:55:07,000 --> 00:55:11,000 split of like zero to n minus one, this is going to be a 727 00:55:11,000 --> 00:55:14,000 quadratic time algorithm. The claim is that this 728 00:55:14,000 --> 00:55:18,000 partition element is guaranteed to be pretty good and good 729 00:55:18,000 --> 00:55:21,000 enough. The running time of this thing 730 00:55:21,000 --> 00:55:24,000 will be T of something times n, and we don't know what the 731 00:55:24,000 --> 00:55:27,000 something is yet. How big could it be? 732 00:55:27,000 --> 00:55:32,000 Well, I could ask you. But we're sort of indirect here 733 00:55:32,000 --> 00:55:34,000 so I will tell you. We have already a recursive 734 00:55:34,000 --> 00:55:38,000 call of T of n over 5. It better be that whatever 735 00:55:38,000 --> 00:55:41,000 constant, so it's going to be something times n, 736 00:55:41,000 --> 00:55:44,000 it better be that that constant is strictly less than 4/5. 737 00:55:44,000 --> 00:55:48,000 If it's equal to 4/5 then you're not splitting up the 738 00:55:48,000 --> 00:55:51,000 problem enough to get an n lg n running time. 739 00:55:51,000 --> 00:55:55,000 If it's strictly less than 4/5 then you're reducing the problem 740 00:55:55,000 --> 00:55:59,000 by at least a constant factor. In the sense if you add up all 741 00:55:59,000 --> 00:56:03,000 the recursive subproblems, n over 5 and something times n, 742 00:56:03,000 --> 00:56:07,000 you get something that is a constant strictly less than one 743 00:56:07,000 --> 00:56:09,000 times n. That forces the work to be 744 00:56:09,000 --> 00:56:12,000 geometric. If it's geometric you're going 745 00:56:12,000 --> 00:56:15,000 to get linear time. So this is intuition but it's 746 00:56:15,000 --> 00:56:18,000 the right intuition. Whenever you're aiming for 747 00:56:18,000 --> 00:56:21,000 linear time keep that in mind. If you're doing a 748 00:56:21,000 --> 00:56:24,000 divide-and-conquer, you've got to get the total 749 00:56:24,000 --> 00:56:27,000 subproblem size to be some constant less than one times n. 750 00:56:27,000 --> 00:56:32,000 That will work. OK, so we've got to work out 751 00:56:32,000 --> 00:56:37,000 this constant here. And we're going to use this 752 00:56:37,000 --> 00:56:42,000 figure, which so far looks surprisingly uncluttered. 753 00:56:42,000 --> 00:56:48,000 Now we will make it cluttered. What I would like to do is draw 754 00:56:48,000 --> 00:56:53,000 an arrow between two vertices, two points, elements, 755 00:56:53,000 --> 00:57:00,000 whatever you want to call them. Let's call them a and b. 756 00:57:00,000 --> 00:57:04,000 And I want to orient the arrow so it points to a larger value, 757 00:57:04,000 --> 00:57:06,000 so this means that a is less than b. 758 00:57:06,000 --> 00:57:09,000 This is notation just for the diagram. 759 00:57:09,000 --> 00:57:13,000 And so this element, I am going to write down what I 760 00:57:13,000 --> 00:57:15,000 know. This element is the median of 761 00:57:15,000 --> 00:57:19,000 these five elements. I will suppose that it is drawn 762 00:57:19,000 --> 00:57:22,000 so that these elements are larger than the median, 763 00:57:22,000 --> 00:57:25,000 these elements are smaller than the median. 764 00:57:25,000 --> 00:57:28,000 Therefore, I have arrows like this. 765 00:57:28,000 --> 00:57:33,000 Here is where I wish I had some colored chalk. 766 00:57:33,000 --> 00:57:36,000 This is just stating this guy is in the middle of those five 767 00:57:36,000 --> 00:57:39,000 elements. I know that in every single 768 00:57:39,000 --> 00:57:40,000 column. 769 00:57:55,000 --> 00:57:58,000 Here is where the diagram starts to get messy. 770 00:57:58,000 --> 00:58:01,000 I am not done yet. Now, we also know that this 771 00:58:01,000 --> 00:58:03,000 element is the median of the medians. 772 00:58:03,000 --> 00:58:06,000 Of all the squared elements, this guy is the middle. 773 00:58:06,000 --> 00:58:10,000 And I will draw it so that these are the ones smaller than 774 00:58:10,000 --> 00:58:13,000 the median, these are the ones larger than the median. 775 00:58:13,000 --> 00:58:15,000 I mean the algorithm cannot do this. 776 00:58:15,000 --> 00:58:18,000 It doesn't necessarily know how all this works. 777 00:58:18,000 --> 00:58:20,000 I guess it could, but this is just for analysis 778 00:58:20,000 --> 00:58:23,000 purposes. We know this guy is bigger than 779 00:58:23,000 --> 00:58:25,000 that one and bigger than that one. 780 00:58:25,000 --> 00:58:29,000 We don't directly know about the other elements. 781 00:58:29,000 --> 00:58:33,000 We just know that that one is bigger than both of those and 782 00:58:33,000 --> 00:58:37,000 this guy is smaller than these. Now, that is as messy as the 783 00:58:37,000 --> 00:58:40,000 figure will get. Now, the nice thing about less 784 00:58:40,000 --> 00:58:43,000 than is that it's a transitive relation. 785 00:58:43,000 --> 00:58:47,000 If I have a directed path in this graph, I know that this 786 00:58:47,000 --> 00:58:51,000 element is strictly less than that element because this is 787 00:58:51,000 --> 00:58:54,000 less than that one and this is less than that one. 788 00:58:54,000 --> 00:58:59,000 Even though directly I only know within a column and within 789 00:58:59,000 --> 00:59:02,000 this middle row, I actually know that this 790 00:59:02,000 --> 00:59:05,000 element -- This is x, by the way. 791 00:59:05,000 --> 00:59:10,000 This element is larger than all of these elements because it's 792 00:59:10,000 --> 00:59:15,000 larger than this one and this one and each of these is larger 793 00:59:15,000 --> 00:59:17,000 than all of those by these arrows. 794 00:59:17,000 --> 00:59:22,000 I also know that all of these elements in this rectangle here, 795 00:59:22,000 --> 00:59:27,000 and you don't have to do this but I will make the background 796 00:59:27,000 --> 00:59:32,000 even more cluttered. All of these elements in this 797 00:59:32,000 --> 00:59:37,000 rectangle are greater than or equal to this one and all of the 798 00:59:37,000 --> 00:59:42,000 elements in this rectangle are less than or equal to x. 799 00:59:42,000 --> 00:59:47,000 Now, how many are there? Well, this is roughly halfway 800 00:59:47,000 --> 00:59:52,000 along the set of groups and this is 3/5 of these columns. 801 00:59:52,000 --> 00:59:57,000 So what we get is that there are at least -- 802 00:59:57,000 --> 01:00:03,554 We have n over 5 groups and we have half of the groups that 803 01:00:03,554 --> 01:00:10,222 we're looking at here roughly, so let's call that floor of n 804 01:00:10,222 --> 01:00:16,664 over 2, and then within each group we have three elements. 805 01:00:16,664 --> 01:00:23,219 So we have at least 3 times floor of floor of n over 5 over 806 01:00:23,219 --> 01:00:30,000 2 n floor elements that are less than or equal to x. 807 01:00:30,000 --> 01:00:36,222 And we have the same that are greater than or equal to x. 808 01:00:36,222 --> 01:00:40,444 Let me simplify this a little bit more. 809 01:00:40,444 --> 01:00:45,222 I can also give you some more justification, 810 01:00:45,222 --> 01:00:51,222 and we drew the picture, but just for why this is true. 811 01:00:51,222 --> 01:00:57,777 We have at least n over 5 over 2 group medians that are less 812 01:00:57,777 --> 01:01:02,622 than or equal to x. This is the argument we use. 813 01:01:02,622 --> 01:01:05,809 We have half of the group medians are less than or equal 814 01:01:05,809 --> 01:01:08,590 to x because x is the median of the group median, 815 01:01:08,590 --> 01:01:11,892 so that is no big surprise. This is almost an equality but 816 01:01:11,892 --> 01:01:14,905 we're making floors so it's greater than or equal to. 817 01:01:14,905 --> 01:01:18,034 And then, for each group median, we know that there are 818 01:01:18,034 --> 01:01:21,568 three elements there that are less than or equal to that group 819 01:01:21,568 --> 01:01:23,133 median. So, by transitivity, 820 01:01:23,133 --> 01:01:25,218 they're also less than or equal to x. 821 01:01:25,218 --> 01:01:30,664 We get this number times three. This is actually just floor of 822 01:01:30,664 --> 01:01:33,773 n over 10. I was being unnecessarily 823 01:01:33,773 --> 01:01:38,126 complicated there, but that is where it came from. 824 01:01:38,126 --> 01:01:43,544 What we know is that this thing is now at least 3 times n over 825 01:01:43,544 --> 01:01:48,252 10, which is roughly 3/10 of elements are in one side. 826 01:01:48,252 --> 01:01:53,137 In fact, at least 3/10 of the elements are in each side. 827 01:01:53,137 --> 01:01:59,000 Therefore, each side has at most 7/10 elements roughly. 828 01:01:59,000 --> 01:02:01,214 So the number here will be 7/10. 829 01:02:01,214 --> 01:02:04,642 And, if I'm lucky, 7/10 plus 1/5 is strictly less 830 01:02:04,642 --> 01:02:06,428 than one. I believe it is, 831 01:02:06,428 --> 01:02:09,142 but I have trouble working with tenths. 832 01:02:09,142 --> 01:02:11,357 I can only handle powers of two. 833 01:02:11,357 --> 01:02:14,857 What we're going to use is a minor simplification, 834 01:02:14,857 --> 01:02:19,214 which just barely still works, is a little bit easier to think 835 01:02:19,214 --> 01:02:21,785 about. It's mainly to get rid of this 836 01:02:21,785 --> 01:02:24,285 floor because the floor is annoying. 837 01:02:24,285 --> 01:02:28,214 And we don't really have a sloppiness lemma that applies 838 01:02:28,214 --> 01:02:31,463 here. It turns out if n is 839 01:02:31,463 --> 01:02:34,975 sufficiently large, 3 times floor of n over 10 is 840 01:02:34,975 --> 01:02:38,707 greater than or equal to 1/4. Quarters I can handle. 841 01:02:38,707 --> 01:02:42,365 The claim is that each group has size at least 1/4, 842 01:02:42,365 --> 01:02:46,609 therefore each group has size at most 3/4 because there's a 843 01:02:46,609 --> 01:02:49,317 quarter on the side. This will be 3/4. 844 01:02:49,317 --> 01:02:53,048 And I can definitely tell that 1/5 is less than 1/4. 845 01:02:53,048 --> 01:02:57,292 This is going to add up to something strictly less than one 846 01:02:57,292 --> 01:03:01,292 and then it will work. How is my time? 847 01:03:01,292 --> 01:03:02,929 Good. At this point, 848 01:03:02,929 --> 01:03:05,686 the rest of the analysis is easy. 849 01:03:05,686 --> 01:03:09,993 How the heck you would come up with this algorithm, 850 01:03:09,993 --> 01:03:14,818 you realize that this is clearly a really good choice for 851 01:03:14,818 --> 01:03:19,643 finding a partition element, just barely good enough that 852 01:03:19,643 --> 01:03:22,830 both recursions add up to linear time. 853 01:03:22,830 --> 01:03:28,000 Well, that's why it took so many famous people. 854 01:03:28,000 --> 01:03:30,780 Especially in quizzes, but I think in general this 855 01:03:30,780 --> 01:03:34,241 class, you won't have to come up with an algorithm this clever 856 01:03:34,241 --> 01:03:37,531 because you can just use this algorithm to find the median. 857 01:03:37,531 --> 01:03:40,312 And the median is a really good partition element. 858 01:03:40,312 --> 01:03:43,375 Now that you know this algorithm, now that we're beyond 859 01:03:43,375 --> 01:03:45,815 1973, you don't need to know how to do this. 860 01:03:45,815 --> 01:03:48,482 I mean you should know how this algorithm works, 861 01:03:48,482 --> 01:03:51,943 but you don't need to do this in another algorithm because you 862 01:03:51,943 --> 01:03:55,234 can just say run this algorithm, you will get the median in 863 01:03:55,234 --> 01:03:58,524 linear time, and then you can partition to the left and the 864 01:03:58,524 --> 01:04:02,225 right. And then the left and the right 865 01:04:02,225 --> 01:04:04,737 will have exactly equal size. Great. 866 01:04:04,737 --> 01:04:07,321 This is a really powerful subroutine. 867 01:04:07,321 --> 01:04:11,700 You could use this all over the place, and you will on Friday. 868 01:04:11,700 --> 01:04:14,858 Have I analyzed the running time pretty much? 869 01:04:14,858 --> 01:04:18,806 The first step is linear. The second step is T of n over 870 00:00:05,000 --> 01:04:20,027 The third step, 871 01:04:20,027 --> 01:04:22,037 I didn't write it, is linear. 872 01:04:22,037 --> 01:04:25,410 And then the last step is just a recursive call. 873 01:04:25,410 --> 01:04:29,000 And now we know that this is 3/4. 874 01:04:34,000 --> 01:04:40,000 I get this recurrence. T of n is, I'll say at most, 875 01:04:40,000 --> 01:04:47,079 T of n over 5 plus T of 3/4n. You could have also used 7/10. 876 01:04:47,079 --> 01:04:54,400 It would give the same answer, but you would also need a floor 877 01:04:54,400 --> 01:05:01,000 so we won't do that. I claim that this is linear. 878 01:05:01,000 --> 01:05:07,000 How should I prove it? Substitution. 879 01:05:12,000 --> 01:05:15,901 Claim that T of n is at most again c times n, 880 01:05:15,901 --> 01:05:19,891 that will be enough. Proof is by substitution. 881 01:05:19,891 --> 01:05:23,704 Again, we assume this is true for smaller n. 882 01:05:23,704 --> 01:05:28,758 And want to prove it for n. We have T of n is at most this 883 01:05:28,758 --> 01:05:31,489 thing. T of n over 5. 884 01:05:31,489 --> 01:05:36,489 And by induction, because n of 5 is smaller than 885 01:05:36,489 --> 01:05:40,000 n, we know that this is at most c. 886 01:05:40,000 --> 01:05:43,723 Let me write it as c over 5 times n. 887 01:05:43,723 --> 01:05:47,765 Sure, why not. Then we have here 3/4cn. 888 01:05:47,765 --> 01:05:53,085 And then we have a linear term. Now, unfortunately, 889 01:05:53,085 --> 01:06:00,000 I have to deal with things that are not powers of two. 890 01:06:00,000 --> 01:06:02,447 I will cheat and look at my notes. 891 01:06:02,447 --> 01:06:06,599 This is also known as 19/20 times c times n plus theta n. 892 01:06:06,599 --> 01:06:10,826 And the point is just that this is strictly less than one. 893 01:06:10,826 --> 01:06:15,202 Because it's strictly less than one, I can write this as one 894 01:06:15,202 --> 01:06:19,206 times c of n minus some constant, here it happens to be 895 01:06:19,206 --> 01:06:22,766 1/20, as long as I have something left over here, 896 01:06:22,766 --> 01:06:26,622 1/20 times c times n. Then I have this annoying theta 897 01:06:26,622 --> 01:06:30,923 n term which I want to get rid of because I want this to be 898 01:06:30,923 --> 01:06:34,783 nonnegative. But it is nonnegative, 899 01:06:34,783 --> 01:06:38,432 as long as I set c to be really, really large, 900 01:06:38,432 --> 01:06:41,918 at least 20 times whatever constant is here. 901 01:06:41,918 --> 01:06:46,216 So this is at most c times n for c sufficiently large. 902 01:06:46,216 --> 01:06:50,189 And, oh, by the way, if n is less than or equal to 903 01:06:50,189 --> 01:06:54,404 50, which we used up here, then T of n is a constant, 904 01:06:54,404 --> 01:06:59,270 it doesn't really matter what you do, and T of n is at most c 905 01:06:59,270 --> 01:07:03,000 times n for c sufficiently large. 906 01:07:03,000 --> 01:07:06,017 That proves this claim. Of course, the constant here is 907 01:07:06,017 --> 01:07:08,421 pretty damn big. It depends exactly what the 908 01:07:08,421 --> 01:07:11,606 constants and the running times are, which depends on your 909 01:07:11,606 --> 01:07:14,960 machine, but practically this algorithm is not so hot because 910 01:07:14,960 --> 01:07:18,089 the constants are pretty big. Even though this element is 911 01:07:18,089 --> 01:07:20,772 guaranteed to be somewhere vaguely in the middle, 912 01:07:20,772 --> 01:07:23,566 and even though these recursions add up to strictly 913 01:07:23,566 --> 01:07:26,752 less than n and it's geometric, it's geometric because the 914 01:07:26,752 --> 01:07:31,000 problem is reducing by at least a factor of 19/20 each time. 915 01:07:31,000 --> 01:07:34,742 So it actually takes a while for the problem to get really 916 01:07:34,742 --> 01:07:37,106 small. Practically you probably don't 917 01:07:37,106 --> 01:07:40,782 want to use this algorithm unless you cannot somehow flip 918 01:07:40,782 --> 01:07:43,146 coins. The randomized algorithm works 919 01:07:43,146 --> 01:07:46,166 really, really fast. Theoretically this is your 920 01:07:46,166 --> 01:07:50,237 dream, the best you could hope for because it's linear time and 921 01:07:50,237 --> 01:07:53,257 you need linear time as guaranteed linear time. 922 01:07:53,257 --> 01:07:55,161 I will mention, before we end, 923 01:07:55,161 --> 01:07:57,000 an exercise. 924 01:08:03,000 --> 01:08:06,375 Why did we use groups of five? Why not groups of three? 925 01:08:06,375 --> 01:08:09,062 As you might guess, the answer is because it 926 01:08:09,062 --> 01:08:11,125 doesn't work with groups of three. 927 01:08:11,125 --> 01:08:13,812 But it's quite constructive to find out why. 928 01:08:13,812 --> 01:08:17,562 If you work through this math with groups of three instead of 929 01:08:17,562 --> 01:08:20,250 groups of five, you will find that you don't 930 01:08:20,250 --> 01:08:23,062 quite get the problem reduction that you need. 931 01:08:23,062 --> 01:08:27,000 Five is the smallest number for which this works. 932 01:08:27,000 --> 01:08:30,176 It would work with seven, but theoretically not any 933 01:08:30,176 --> 01:08:32,973 better than a constant factor. Any questions? 934 01:08:32,973 --> 01:08:35,069 All right. Then recitation Friday. 935 01:08:35,069 --> 01:08:37,801 Homework lab Sunday. Problem set due Monday. 936 01:08:37,801 --> 01:08:40,000 Quiz one in two weeks.