1 00:00:08,000 --> 00:00:12,000 Today we're going to talk about sorting, which may not come as 2 00:00:12,000 --> 00:00:15,000 such a big surprise. We talked about sorting for a 3 00:00:15,000 --> 00:00:20,000 while, but we're going to talk about it at a somewhat higher 4 00:00:20,000 --> 00:00:24,000 level and question some of the assumptions that we've been 5 00:00:24,000 --> 00:00:27,000 making so far. And we're going to ask the 6 00:00:27,000 --> 00:00:32,000 question how fast can we sort? A pretty natural question. 7 00:00:32,000 --> 00:00:35,000 You may think you know the answer. 8 00:00:35,000 --> 00:00:40,000 Perhaps you do. Any suggestions on what the 9 00:00:40,000 --> 00:00:43,000 answer to this question might be? 10 00:00:43,000 --> 00:00:46,000 There are several possible answers. 11 00:00:46,000 --> 00:00:50,000 Many of them are partially correct. 12 00:00:50,000 --> 00:00:56,000 Let's hear any kinds of answers you'd like and start waking up 13 00:00:56,000 --> 00:01:00,000 this fresh morning. Sorry? 14 00:01:00,000 --> 00:01:02,000 Theta n log n. That's a good answer. 15 00:01:02,000 --> 00:01:06,000 That's often correct. Any other suggestions? 16 00:01:06,000 --> 00:01:09,000 N squared. That's correct if all you're 17 00:01:09,000 --> 00:01:12,000 allowed to do is swap adjacent elements. 18 00:01:12,000 --> 00:01:13,000 Good. That was close. 19 00:01:13,000 --> 00:01:17,000 I will see if I can make every answer correct. 20 00:01:17,000 --> 00:01:20,000 Usually n squared is not the right answer, 21 00:01:20,000 --> 00:01:22,000 but in some models it is. Yeah? 22 00:01:22,000 --> 00:01:26,000 Theta n is also sometimes the right answer. 23 00:01:26,000 --> 00:01:30,000 The real answer is "it depends". 24 00:01:30,000 --> 00:01:33,000 That's the point of today's lecture. 25 00:01:33,000 --> 00:01:37,000 It depends on what we call the computational model, 26 00:01:37,000 --> 00:01:42,000 what you're allowed to do. And, in particular here, 27 00:01:42,000 --> 00:01:46,000 with sorting, what we care about is the order 28 00:01:46,000 --> 00:01:49,000 of the elements, how are you allowed to 29 00:01:49,000 --> 00:01:54,000 manipulate the elements, what are you allowed to do with 30 00:01:54,000 --> 00:02:00,000 them and find out their order. The model is what you can do 31 00:02:00,000 --> 00:02:03,000 with the elements. 32 00:02:14,000 --> 00:02:18,000 Now, we've seen several sorting algorithms. 33 00:02:18,000 --> 00:02:23,000 Do you want to shout some out? I think we've seen four, 34 00:02:23,000 --> 00:02:27,000 but maybe you know even more algorithms. 35 00:02:27,000 --> 00:02:30,000 Quicksort. Keep going. 36 00:02:30,000 --> 00:02:32,000 Heapsort. Merge sort. 37 00:02:32,000 --> 00:02:37,000 You can remember all the way back to Lecture 1. 38 00:02:37,000 --> 00:02:39,000 Any others? Insertion sort. 39 00:02:39,000 --> 00:02:43,000 All right. You're on top of it today. 40 00:02:43,000 --> 00:02:49,000 I don't know exactly why, but these two are single words 41 00:02:49,000 --> 00:02:54,000 and these two are two words. That's the style. 42 00:02:54,000 --> 00:03:00,000 What is the running time of quicksort? 43 00:03:00,000 --> 00:03:04,000 This is a bit tricky. N log n in the average case. 44 00:03:04,000 --> 00:03:10,000 Or, if we randomize quicksort, randomized quicksort runs in n 45 00:03:10,000 --> 00:03:14,000 log n expected for any input sequence. 46 00:03:14,000 --> 00:03:18,000 Let's say n lg n randomized. That's theta. 47 00:03:18,000 --> 00:03:24,000 And the worst-case with plain old quicksort where you just 48 00:03:24,000 --> 00:03:30,000 pick the first element as the partition element. 49 00:03:30,000 --> 00:03:34,000 That's n^2. Heapsort, what's the running 50 00:03:34,000 --> 00:03:37,000 time there? n lg n always. 51 00:03:37,000 --> 00:03:43,000 Merge sort, I hope you can remember that as well, 52 00:03:43,000 --> 00:03:46,000 n lg n. And insertion sort? 53 00:03:46,000 --> 00:03:50,000 n^2. All of these algorithms run no 54 00:03:50,000 --> 00:03:54,000 faster than n lg n, so we might ask, 55 00:03:54,000 --> 00:03:59,000 can we do better than n lg n? 56 00:04:11,000 --> 00:04:13,000 And that is a question, in some sense, 57 00:04:13,000 --> 00:04:16,000 we will answer both yes and no to today. 58 00:04:16,000 --> 00:04:20,000 But all of these algorithms have something in common in 59 00:04:20,000 --> 00:04:25,000 terms of the model of what you're allowed to do with the 60 00:04:25,000 --> 00:04:28,000 elements. Any guesses on what that model 61 00:04:28,000 --> 00:04:30,000 might be? Yeah? 62 00:04:30,000 --> 00:04:33,000 You compare pairs of elements, exactly. 63 00:04:33,000 --> 00:04:39,000 That is indeed the model used by all four of these algorithms. 64 00:04:39,000 --> 00:04:43,000 And in that model n lg n is the best you can do. 65 00:04:43,000 --> 00:04:48,000 We have so far just looked at what are called comparison 66 00:04:48,000 --> 00:04:52,000 sorting algorithms or "comparison sorts". 67 00:04:52,000 --> 00:04:57,000 And this is a model for the sorting problem of what you're 68 00:04:57,000 --> 00:05:02,000 allowed to do. Here all you can do is use 69 00:05:02,000 --> 00:05:06,000 comparisons meaning less than, greater than, 70 00:05:06,000 --> 00:05:11,000 less than or equal to, greater than or equal to, 71 00:05:11,000 --> 00:05:17,000 equals to determine the relative order of elements. 72 00:05:25,000 --> 00:05:26,000 This is a restriction on algorithms. 73 00:05:26,000 --> 00:05:29,000 It is, in some sense, stating what kinds of elements 74 00:05:29,000 --> 00:05:32,000 we're dealing with. They are elements that we can This is a three digit number. 75 00:05:32,000 --> 00:05:35,000 somehow compare. They have a total order, 76 00:05:35,000 --> 00:05:37,000 some are less, some are bigger. 77 00:05:37,000 --> 00:05:39,000 But is also restricts the algorithm. 78 00:05:39,000 --> 00:05:42,000 You could say, well, I'm sorting integers, 79 00:05:42,000 --> 00:05:45,000 but still I'm only allowed to do comparisons with them. 80 00:05:45,000 --> 00:05:49,000 I'm not allowed to multiply the integers or do other weird 81 00:05:49,000 --> 00:05:51,000 things. That's the comparison sorting 82 00:05:51,000 --> 00:05:52,000 model. And this lecture, 83 00:05:52,000 --> 00:05:55,000 in some sense, follows the standard 84 00:05:55,000 --> 00:05:58,000 mathematical progression where you have a theorem, Then we get some 4s, 85 00:05:58,000 --> 00:06:01,000 then you have a proof, then you have a counter 86 00:06:01,000 --> 00:06:05,000 example. It's always a good way to have 87 00:06:05,000 --> 00:06:07,000 a math lecture. We're going to prove the 88 00:06:07,000 --> 00:06:11,000 theorem that no comparison sorting algorithm runs better 89 00:06:11,000 --> 00:06:13,000 than n lg n. Comparisons. 90 00:06:13,000 --> 00:06:17,000 State the theorem, prove that, and then we'll give 91 00:06:17,000 --> 00:06:21,000 a counter example in the sense that if you go outside the 92 00:06:21,000 --> 00:06:25,000 comparison sorting model you can do better, you can get linear 93 00:06:25,000 --> 00:06:28,000 time in some cases, better than n lg n. 94 00:06:28,000 --> 00:06:32,000 So, that is what we're doing today. 95 00:06:32,000 --> 00:06:36,000 But first we're going to stick to this comparison model and try 96 00:06:36,000 --> 00:06:41,000 to understand why we need n lg n comparisons if that's all we're 97 00:06:41,000 --> 00:06:45,000 allowed to do. And for that we're going to 98 00:06:45,000 --> 00:06:48,000 look at something called decision trees, 99 00:06:48,000 --> 00:06:52,000 which in some sense is another model of what you're allowed to 100 00:06:52,000 --> 00:06:56,000 do in an algorithm, but it's more general than the 101 00:06:56,000 --> 00:07:01,000 comparison model. And let's try and example to 102 00:07:01,000 --> 00:07:06,000 get some intuition. Suppose we want to sort three 103 00:07:06,000 --> 00:07:10,000 elements. This is not very challenging, 104 00:07:10,000 --> 00:07:15,000 but we'll get to draw the decision tree that corresponds 105 00:07:15,000 --> 00:07:22,000 to sorting three elements. Here is one solution I claim. 106 00:07:42,000 --> 00:07:45,000 This is, in a certain sense, an algorithm, 107 00:07:45,000 --> 00:07:50,000 but it's drawn as a tree instead of pseudocode. 108 00:08:15,000 --> 00:08:18,000 What this tree means is that each node you're making a 109 00:08:18,000 --> 00:08:21,000 comparison. This says compare a_1 versus 110 00:08:21,000 --> 00:08:24,000 a_2. If a_1 is smaller than a_2 you 111 00:08:24,000 --> 00:08:27,000 go this way, if it is bigger than a_2 you go this way, 112 00:08:27,000 --> 00:08:32,000 and then you proceed. When you get down to a leaf, 113 00:08:32,000 --> 00:08:36,000 this is the answer. Remember, the sorting problem 114 00:08:36,000 --> 00:08:41,000 is you're trying to find a permutation of the inputs that 115 00:08:41,000 --> 00:08:45,000 puts it in sorted order. Let's try it with some sequence 116 00:08:45,000 --> 00:08:48,000 of numbers, say 9, 4 and 6. 117 00:08:48,000 --> 00:08:51,000 We want to sort 9, 4 and 6, so first we compare 118 00:08:51,000 --> 00:08:55,000 the first element with the second element. 119 00:08:55,000 --> 00:09:00,000 9 is bigger than 4 so we go down this way. 120 00:09:00,000 --> 00:09:03,000 Then we compare the first element with the third element, 121 00:09:03,000 --> 00:09:05,000 that's 9 versus 6. 9 is bigger than 6, 122 00:09:05,000 --> 00:09:08,000 so we go this way. And then we compare the second 123 00:09:08,000 --> 00:09:11,000 element with the third element, 4 is less than 6 and, 124 00:09:11,000 --> 00:09:14,000 so we go this way. And the claim is that this is 125 00:09:14,000 --> 00:09:16,000 the correct permutation of the elements. 126 00:09:16,000 --> 00:09:19,000 You take a_2, which is 4, then you take a_3, 127 00:09:19,000 --> 00:09:22,000 which is 6, and then you take a_1, which is 9, 128 00:09:22,000 --> 00:09:25,000 so indeed that works out. And if I wrote this down right, 129 00:09:25,000 --> 00:09:30,000 this is a sorting algorithm in the decision tree model. 130 00:09:30,000 --> 00:09:36,000 In general, let me just say the rules of this game. 131 00:09:36,000 --> 00:09:43,000 In general, we have n elements we want to sort. 132 00:09:43,000 --> 00:09:52,000 And I only drew the n = 3 case because these trees get very big 133 00:09:52,000 --> 00:09:56,000 very quickly. Each internal node, 134 00:09:56,000 --> 00:10:03,000 so every non-leaf node, has a label of the form i : 135 00:10:03,000 --> 00:10:10,000 j where i and j are between 1 and n. 136 00:10:15,000 --> 00:10:23,000 And this means that we compare a_i with a_j. 137 00:10:29,000 --> 00:10:33,000 And we have two subtrees from every such node. 138 00:10:33,000 --> 00:10:40,000 We have the left subtree which tells you what the algorithm 139 00:10:40,000 --> 00:10:45,000 does, what subsequent comparisons it makes if it comes 140 00:10:45,000 --> 00:10:48,000 out less than. 141 00:10:54,000 --> 00:10:57,000 And we have to be a little bit careful because it could also 142 00:10:57,000 --> 00:10:59,000 come out equal. What we will do is the left 143 00:10:59,000 --> 00:11:03,000 subtree corresponds to less than or equal to and the right 144 00:11:03,000 --> 00:11:06,000 subtree corresponds to strictly greater than. 145 00:11:17,000 --> 00:11:21,000 That is a little bit more precise than what we were doing 146 00:11:21,000 --> 00:11:23,000 here. Here all the elements were 147 00:11:23,000 --> 00:11:26,000 distinct so no problem. But, in general, 148 00:11:26,000 --> 00:11:30,000 we care about the equality case too to be general. 149 00:11:30,000 --> 00:11:32,000 So, that was the internal nodes. 150 00:11:32,000 --> 00:11:36,000 And then each leaf node gives you a permutation. 151 00:11:44,000 --> 00:11:47,000 So, in order to be the answer to that sorting problem, 152 00:11:47,000 --> 00:11:52,000 that permutation better have the property that it orders the 153 00:11:52,000 --> 00:11:54,000 elements. This is from the first lecture 154 00:11:54,000 --> 00:11:58,000 when we defined the sorting problem. 155 00:11:58,000 --> 00:12:05,000 Some permutation on n things such that a_pi(1) is less than 156 00:12:05,000 --> 00:12:09,000 or equal to a_pi(2) and so on. 157 00:12:15,000 --> 00:12:18,000 So, that is the definition of a decision tree. 158 00:12:18,000 --> 00:12:21,000 Any binary tree with these kinds of labels satisfies all 159 00:12:21,000 --> 00:12:24,000 these properties. That is, in some sense, 160 00:12:24,000 --> 00:12:28,000 a sorting algorithm. It's a sorting algorithm in the 161 00:12:28,000 --> 00:12:31,000 decision tree model. Now, as you might expect, 162 00:12:31,000 --> 00:12:35,000 this is really not too different than the comparison 163 00:12:35,000 --> 00:12:37,000 model. If I give you a comparison 164 00:12:37,000 --> 00:12:40,000 sorting algorithm, we have these four, 165 00:12:40,000 --> 00:12:44,000 quicksort, heapsort, merge sort and insertion sort. 166 00:12:44,000 --> 00:12:48,000 All of them can be translated into the decision tree model. 167 00:12:48,000 --> 00:12:52,000 It's sort of a graphical representation of what the 168 00:12:52,000 --> 00:12:55,000 algorithm does. It's not a terribly useful one 169 00:12:55,000 --> 00:13:00,000 for writing down an algorithm. Any guesses why? 170 00:13:00,000 --> 00:13:03,000 Why do we not draw these pictures as a definition of 171 00:13:03,000 --> 00:13:06,000 quicksort or a definition of merge sort? 172 00:13:06,000 --> 00:13:09,000 It depends on the size of the input, that's a good point. 173 00:13:09,000 --> 00:13:13,000 This tree is specific to the value of n, so it is, 174 00:13:13,000 --> 00:13:15,000 in some sense, not as generic. 175 00:13:15,000 --> 00:13:19,000 Now, we could try to write down a construction for an arbitrary 176 00:13:19,000 --> 00:13:22,000 value of n of one of these decision trees and that would 177 00:13:22,000 --> 00:13:28,000 give us sort of a real algorithm that works for any input size. 178 00:13:28,000 --> 00:13:31,000 But even then this is not a terribly convenient 179 00:13:31,000 --> 00:13:34,000 representation for writing down an algorithm. 180 00:13:34,000 --> 00:13:38,000 Well, let's write down a transformation that converts a 181 00:13:38,000 --> 00:13:42,000 comparison sorting algorithm to a decision tree and then maybe 182 00:13:42,000 --> 00:13:45,000 you will see why. This is not a useless model, 183 00:13:45,000 --> 00:13:48,000 obviously, I wouldn't be telling you otherwise. 184 00:13:48,000 --> 00:13:52,000 It will be very powerful for proving that we cannot do better 185 00:13:52,000 --> 00:13:56,000 than n lg n, but as writing down an algorithm, 186 00:13:56,000 --> 00:14:00,000 if you were going to implement something, this tree is not so 187 00:14:00,000 --> 00:14:05,000 useful. Even if you had a decision tree 188 00:14:05,000 --> 00:14:10,000 computer, whatever that is. But let's prove this theorem 189 00:14:10,000 --> 00:14:14,000 that decision trees, in some sense, 190 00:14:14,000 --> 00:14:19,000 model comparison sorting algorithms, which we call just 191 00:14:19,000 --> 00:14:22,000 comparison sorts. 192 00:14:29,000 --> 00:14:33,000 This is a transformation. And we're going to build one 193 00:14:33,000 --> 00:14:38,000 tree for each value of n. The decision trees depend on n. 194 00:14:38,000 --> 00:14:43,000 The algorithm hopefully, well, it depends on n, 195 00:14:43,000 --> 00:14:46,000 but it works for all values of n. 196 00:14:46,000 --> 00:14:51,000 And we're just going to think of the algorithm as splitting 197 00:14:51,000 --> 00:14:55,000 into two forks, the left subtree and the right 198 00:14:55,000 --> 00:15:00,000 subtree whenever it makes a comparison. 199 00:15:07,000 --> 00:15:09,000 If we take a comparison sort like merge sort. 200 00:15:09,000 --> 00:15:12,000 And it does lots of stuff. It does index arithmetic, 201 00:15:12,000 --> 00:15:14,000 it does recursion, whatever. 202 00:15:14,000 --> 00:15:18,000 But at some point it makes a comparison and then we say, 203 00:15:18,000 --> 00:15:20,000 OK, there are two halves of the algorithm. 204 00:15:20,000 --> 00:15:24,000 There is what the algorithm would do if the comparison came 205 00:15:24,000 --> 00:15:27,000 out less than or equal to and what the algorithm would do if 206 00:15:27,000 --> 00:15:31,000 the comparison came out greater than. 207 00:15:31,000 --> 00:15:33,000 So, you can build a tree in this way. 208 00:15:33,000 --> 00:15:37,000 In some sense, what this tree is doing is 209 00:15:37,000 --> 00:15:42,000 listing all possible executions of this algorithm considering 210 00:15:42,000 --> 00:15:46,000 what would happen for all possible values of those 211 00:15:46,000 --> 00:15:48,000 comparisons. 212 00:15:59,000 --> 00:16:03,000 We will call these all possible instruction traces. 213 00:16:03,000 --> 00:16:09,000 If you write down all the instructions that are executed 214 00:16:09,000 --> 00:16:13,000 by this algorithm, for all possible input arrays, 215 00:16:13,000 --> 00:16:19,000 a_1 to a_n, see what all the comparisons, how they could come 216 00:16:19,000 --> 00:16:25,000 and what the algorithm does, in the end you will get a tree. 217 00:16:25,000 --> 00:16:30,000 Now, how big will that tree be roughly? 218 00:16:43,000 --> 00:16:48,000 As a function of n. Yeah? 219 00:16:55,000 --> 00:16:57,000 Right. If it's got to be able to sort 220 00:16:57,000 --> 00:17:01,000 every possible list of length n, at the leaves I have to have 221 00:17:01,000 --> 00:17:05,000 all the permutations of those elements. 222 00:17:05,000 --> 00:17:07,000 That is a lot. There are a lot of permeations 223 00:17:07,000 --> 00:17:10,000 on n elements. There's n factorial of them. 224 00:17:10,000 --> 00:17:13,000 N factorial is exponential, it's really big. 225 00:17:13,000 --> 00:17:17,000 So, this tree is huge. It's going to be exponential on 226 00:17:17,000 --> 00:17:19,000 the input size n. That is why we don't write 227 00:17:19,000 --> 00:17:22,000 algorithms down normally as a decision tree, 228 00:17:22,000 --> 00:17:25,000 even though in some cases maybe we could. 229 00:17:25,000 --> 00:17:29,000 It's not a very compact representation. 230 00:17:29,000 --> 00:17:31,000 These algorithms, you write them down in 231 00:17:31,000 --> 00:17:33,000 pseudocode, they have constant length. 232 00:17:33,000 --> 00:17:35,000 It's a very succinct representation of this 233 00:17:35,000 --> 00:17:38,000 algorithm. Here the length depends on n 234 00:17:38,000 --> 00:17:41,000 and it depends exponentially on n, which is not useful if you 235 00:17:41,000 --> 00:17:44,000 wanted to implement the algorithm because writing down 236 00:17:44,000 --> 00:17:46,000 the algorithm would take a long time. 237 00:17:46,000 --> 00:17:49,000 But, nonetheless, we can use this as a tool to 238 00:17:49,000 --> 00:17:51,000 analyze these comparison sorting algorithms. 239 00:17:51,000 --> 00:17:54,000 We have all of these. Any algorithm can be 240 00:17:54,000 --> 00:17:58,000 transformed in this way into a decision tree. 241 00:17:58,000 --> 00:18:03,000 And now we have this observation that the number of 242 00:18:03,000 --> 00:18:08,000 leaves in this decision tree has to be really big. 243 00:18:08,000 --> 00:18:12,000 Let me talk about leaves in a second. 244 00:18:12,000 --> 00:18:18,000 Before we get to leaves, let's talk about the depth of 245 00:18:18,000 --> 00:18:20,000 the tree. 246 00:18:26,000 --> 00:18:29,000 This decision tree represents all possible executions of the 247 00:18:29,000 --> 00:18:31,000 algorithm. If I look at a particular 248 00:18:31,000 --> 00:18:35,000 execution, which corresponds to some root to leaf path in the 249 00:18:35,000 --> 00:18:38,000 tree, the running time or the number of comparisons made by 250 00:18:38,000 --> 00:18:42,000 that execution is just the length of the path. 251 00:18:47,000 --> 00:18:52,000 And, therefore, the worst-case running time, 252 00:18:52,000 --> 00:18:59,000 over all possible inputs of length n, is going to be -- 253 00:19:05,000 --> 00:19:06,000 n - 1? Could be. 254 00:19:06,000 --> 00:19:11,000 Depends on the decision tree. But, as a function of the 255 00:19:11,000 --> 00:19:14,000 decision tree? The longest path, 256 00:19:14,000 --> 00:19:19,000 right, which is called the height of the tree. 257 00:19:24,000 --> 00:19:26,000 So, this is what we want to measure. 258 00:19:26,000 --> 00:19:29,000 We want to claim that the height of the tree has to be at 259 00:19:29,000 --> 00:19:32,000 least n lg n with an omega in front. 260 00:19:32,000 --> 00:19:34,000 That is what we'll prove. 261 00:19:42,000 --> 00:19:44,000 And the only thing we're going to use is that the number of 262 00:19:44,000 --> 00:19:48,000 leaves in that tree has to be big, has to be n factorial. 263 00:20:00,000 --> 00:20:09,000 This is a lower bound on decision tree sorting. 264 00:20:21,000 --> 00:20:26,000 And the lower bound says that if you have any decision tree 265 00:20:26,000 --> 00:20:32,000 that sorts n elements then its height has to be at least n lg n 266 00:20:32,000 --> 00:20:35,000 up to constant factors. 267 00:20:45,000 --> 00:20:52,000 So, that is the theorem. Now we're going to prove the 268 00:20:52,000 --> 00:20:57,000 theorem. And we're going to use that the 269 00:20:57,000 --> 00:21:06,000 number of leaves in that tree must be at least n factorial. 270 00:21:06,000 --> 00:21:10,000 Because there are n factorial permutations of the inputs. 271 00:21:10,000 --> 00:21:14,000 All of them could happen. And so, for this algorithm to 272 00:21:14,000 --> 00:21:19,000 be correct, it has detect every one of those permutations in 273 00:21:19,000 --> 00:21:22,000 some way. Now, it may do it very quickly. 274 00:21:22,000 --> 00:21:26,000 We better only need n lg n comparisons because we know 275 00:21:26,000 --> 00:21:31,000 that's possible. The depth of the tree may not 276 00:21:31,000 --> 00:21:35,000 be too big, but it has to have a huge number of leaves down 277 00:21:35,000 --> 00:21:37,000 there. It has to branch enough to get 278 00:21:37,000 --> 00:21:42,000 n factorial leaves because it has to give the right answer in 279 00:21:42,000 --> 00:21:45,000 possible inputs. This is, in some sense, 280 00:21:45,000 --> 00:21:49,000 counting the number of possible inputs that we have to 281 00:21:49,000 --> 00:21:52,000 distinguish. This is the number of leaves. 282 00:21:52,000 --> 00:21:55,000 What we care about is the height of the tree. 283 00:21:55,000 --> 00:21:59,000 Let's call the height of the tree h. 284 00:21:59,000 --> 00:22:02,000 Now, if I have a tree of height h, how many leaves could it 285 00:22:02,000 --> 00:22:04,000 have? What's the maximum number of 286 00:22:04,000 --> 00:22:06,000 leaves it could have? 287 00:22:19,000 --> 00:22:23,000 2^h, exactly. Because this is binary tree, 288 00:22:23,000 --> 00:22:29,000 comparison trees always have a branching factor of 2, 289 00:22:29,000 --> 00:22:35,000 the number of leaves has to be at most 2^h, if I have a height 290 00:22:35,000 --> 00:22:38,000 h tree. Now, this gives me a relation. 291 00:22:38,000 --> 00:22:41,000 The number of leaves has to be greater than or equal to n 292 00:22:41,000 --> 00:22:44,000 factorial and the number of leaves has to be less than or 293 00:22:44,000 --> 00:22:47,000 equal to 2^h. Therefore, n factorial is less 294 00:22:47,000 --> 00:22:50,000 than or equal to 2^h, if I got that right. 295 00:22:58,000 --> 00:23:02,000 Now, again, we care about h in terms of n factorial, 296 00:23:02,000 --> 00:23:04,000 so we solve this by taking logs. 297 00:23:04,000 --> 00:23:07,000 And I am also going to flip sides. 298 00:23:07,000 --> 00:23:12,000 Now h is at least log base 2, because there is a 2 over here, 299 00:23:12,000 --> 00:23:15,000 of n factorial. There is a property that I'm 300 00:23:15,000 --> 00:23:20,000 using here in order to derive this inequality from this 301 00:23:20,000 --> 00:23:23,000 inequality. This is a technical aside, 302 00:23:23,000 --> 00:23:27,000 but it's important that you realize there is a technical 303 00:23:27,000 --> 00:23:30,000 issue here. 304 00:23:40,000 --> 00:23:43,000 The general principle I'm applying is I have some 305 00:23:43,000 --> 00:23:46,000 inequality, I do the same thing to both sides, 306 00:23:46,000 --> 00:23:49,000 and hopefully that inequality should still be true. 307 00:23:49,000 --> 00:23:53,000 But, in order for that to be the case, I need a property 308 00:23:53,000 --> 00:23:56,000 about that operation that I'm performing. 309 00:23:56,000 --> 00:24:00,000 It has to be a monotonic transformation. 310 00:24:00,000 --> 00:24:04,000 Here what I'm using is that log is a monotonically increasing 311 00:24:04,000 --> 00:24:06,000 function. That is important. 312 00:24:06,000 --> 00:24:11,000 If I multiply both sides by -1, which is a decreasing function, 313 00:24:11,000 --> 00:24:14,000 the inequality would have to get flipped. 314 00:24:14,000 --> 00:24:18,000 The fact that the inequality is not flipping here, 315 00:24:18,000 --> 00:24:21,000 I need to know that log is monotonically increasing. 316 00:24:21,000 --> 00:24:27,000 If you see log that's true. We need to be careful here. 317 00:24:27,000 --> 00:24:31,000 Now we need some approximation of n factorial in order to 318 00:24:31,000 --> 00:24:36,000 figure out what its log is. Does anyone know a good 319 00:24:36,000 --> 00:24:41,000 approximation for n factorial? Not necessarily the equation 320 00:24:41,000 --> 00:24:44,000 but the name. Stirling's formula. 321 00:24:44,000 --> 00:24:47,000 Good. You all remember Stirling. 322 00:24:47,000 --> 00:24:52,000 And I just need the highest order term, which I believe is 323 00:24:52,000 --> 00:24:54,000 that. N factorial is at least 324 00:24:54,000 --> 00:24:59,000 (n/e)^n. So, that's all we need here. 325 00:24:59,000 --> 00:25:06,000 Now I can use properties of logs to bring the n outside. 326 00:25:06,000 --> 00:25:09,000 This is n lg (n/e). 327 00:25:15,000 --> 00:25:18,000 And then lg (n/e) I can simplify. 328 00:25:28,000 --> 00:25:32,000 That is just lg n - lg e. So, this is n(lg n - lg e). 329 00:25:32,000 --> 00:25:37,000 Lg e is a constant, so it's really tiny compared to 330 00:25:37,000 --> 00:25:39,000 this lg n which is growing within. 331 00:25:39,000 --> 00:25:44,000 This is Omega(n lg n). All we care about is the 332 00:25:44,000 --> 00:25:47,000 leading term. It is actually Theta(n lg n), 333 00:25:47,000 --> 00:25:52,000 but because we have it greater than or equal to all we care 334 00:25:52,000 --> 00:25:57,000 about is the omega. A theta here wouldn't give us 335 00:25:57,000 --> 00:26:01,000 anything stronger. Of course, not all algorithms 336 00:26:01,000 --> 00:26:04,000 have n lg n running time or make n lg n comparisons. 337 00:26:04,000 --> 00:26:07,000 Some of them do, some of them are worse, 338 00:26:07,000 --> 00:26:10,000 but this proves that all of them require a height of at 339 00:26:10,000 --> 00:26:12,000 least n lg n. There you see proof, 340 00:26:12,000 --> 00:26:15,000 once you observe the fact about the number of leaves, 341 00:26:15,000 --> 00:26:18,000 and if you remember Stirling's formula. 342 00:26:18,000 --> 00:26:22,000 So, you should know this proof. You can show that all sorts of 343 00:26:22,000 --> 00:26:25,000 problems require n lg n time with this kind of technique, 344 00:26:25,000 --> 00:26:30,000 provided you're in some kind of a decision tree model. 345 00:26:30,000 --> 00:26:32,000 That's important. We really need that our 346 00:26:32,000 --> 00:26:35,000 algorithm can be phrased as a decision tree. 347 00:26:35,000 --> 00:26:37,000 And, in particular, we know from this 348 00:26:37,000 --> 00:26:40,000 transformation that all comparison sorts can be 349 00:26:40,000 --> 00:26:42,000 represented as the decision tree. 350 00:26:42,000 --> 00:26:45,000 But there are some sorting algorithms which cannot be 351 00:26:45,000 --> 00:26:48,000 represented as a decision tree. And we will turn to that 352 00:26:48,000 --> 00:26:51,000 momentarily. But before we get there I 353 00:26:51,000 --> 00:26:54,000 phrased this theorem as a lower bound on decision tree sorting. 354 00:26:54,000 --> 00:26:57,000 But, of course, we also get a lower bound on 355 00:26:57,000 --> 00:27:02,000 comparison sorting. And, in particular, 356 00:27:02,000 --> 00:27:08,000 it tells us that merge sort and heapsort are asymptotically 357 00:27:08,000 --> 00:27:11,000 optimal. Their dependence on n, 358 00:27:11,000 --> 00:27:17,000 in terms of asymptotic notation, so ignoring constant 359 00:27:17,000 --> 00:27:24,000 factors, these algorithms are optimal in terms of growth of n, 360 00:27:24,000 --> 00:27:30,000 but this is only in the comparison model. 361 00:27:30,000 --> 00:27:33,000 So, among comparison sorting algorithms, which these are, 362 00:27:33,000 --> 00:27:35,000 they are asymptotically optimal. 363 00:27:35,000 --> 00:27:39,000 They use the minimum number of comparisons up to constant 364 00:27:39,000 --> 00:27:41,000 factors. In fact, their whole running 365 00:27:41,000 --> 00:27:44,000 time is dominated by the number of comparisons. 366 00:27:44,000 --> 00:27:47,000 It's all Theta(n lg n). So, this is good news. 367 00:27:47,000 --> 00:27:51,000 And I should probably mention a little bit about what happens 368 00:27:51,000 --> 00:27:55,000 with randomized algorithms. What I've described here really 369 00:27:55,000 --> 00:27:57,000 only applies, in some sense, 370 00:27:57,000 --> 00:28:02,000 to deterministic algorithms. Does anyone see what would 371 00:28:02,000 --> 00:28:06,000 change with randomized algorithms or where I've assumed 372 00:28:06,000 --> 00:28:09,000 that I've had a deterministic comparison sort? 373 00:28:09,000 --> 00:28:13,000 This is a bit subtle. And I only noticed it reading 374 00:28:13,000 --> 00:28:17,000 the notes this morning, oh, wait. 375 00:28:28,000 --> 00:28:30,000 I will give you a hint. It's over here, 376 00:28:30,000 --> 00:28:33,000 the right-hand side of the world. 377 00:28:50,000 --> 00:28:55,000 If I have a deterministic algorithm, what the algorithm 378 00:28:55,000 --> 00:29:00,000 does is completely determinate at each step. 379 00:29:00,000 --> 00:29:05,000 As long as I know all the comparisons that it made up to 380 00:29:05,000 --> 00:29:11,000 some point, it's determinate what that algorithm will do. 381 00:29:11,000 --> 00:29:17,000 But, if I have a randomized algorithm, it also depends on 382 00:29:17,000 --> 00:29:24,000 the outcomes of some coin flips. Any suggestions of what breaks 383 00:29:24,000 --> 00:29:28,000 over here? There is more than one tree, 384 00:29:28,000 --> 00:29:31,000 exactly. So, we had this assumption that 385 00:29:31,000 --> 00:29:33,000 we only have one tree for each n. 386 00:29:33,000 --> 00:29:36,000 In fact, what we get is a probability distribution over 387 00:29:36,000 --> 00:29:38,000 trees. For each value of n, 388 00:29:38,000 --> 00:29:41,000 if you take all the possible executions of that algorithm, 389 00:29:41,000 --> 00:29:44,000 all the instruction traces, well, now, in addition to 390 00:29:44,000 --> 00:29:47,000 branching on comparisons, we also branch on whether a 391 00:29:47,000 --> 00:29:50,000 coin flip came out heads or tails, or however we're 392 00:29:50,000 --> 00:29:53,000 generating random numbers it came out with some value between 393 00:29:53,000 --> 00:29:55,000 1 and n. So, we get a probability 394 00:29:55,000 --> 00:29:58,000 distribution over trees. This lower bound still applies, 395 00:29:58,000 --> 00:30:02,000 though. Because, no matter what tree we 396 00:30:02,000 --> 00:30:05,000 get, I don't really care. I get at least one tree for 397 00:30:05,000 --> 00:30:08,000 each n. And this proof applies to every 398 00:30:08,000 --> 00:30:10,000 tree. So, no matter what tree you 399 00:30:10,000 --> 00:30:15,000 get, if it is a correct tree it has to have height Omega(n lg 400 00:30:15,000 --> 00:30:17,000 n). This lower bound applies even 401 00:30:17,000 --> 00:30:21,000 for randomized algorithms. You cannot get better than n lg 402 00:30:21,000 --> 00:30:24,000 n, because no matter what tree it comes up with, 403 00:30:24,000 --> 00:30:29,000 no matter how those coin flips come out, this argument still 404 00:30:29,000 --> 00:30:33,000 applies. Every tree that comes out has 405 00:30:33,000 --> 00:30:37,000 to be correct, so this is really at least one 406 00:30:37,000 --> 00:30:38,000 tree. 407 00:30:43,000 --> 00:30:47,000 And that will now work. We also get the fact that 408 00:30:47,000 --> 00:30:52,000 randomized quicksort is asymptotically optimal in 409 00:30:52,000 --> 00:30:54,000 expectation. 410 00:31:05,000 --> 00:31:09,000 But, in order to say that randomized quicksort is 411 00:31:09,000 --> 00:31:13,000 asymptotically optimal, we need to know that all 412 00:31:13,000 --> 00:31:19,000 randomized algorithms require Omega(n lg n) comparisons. 413 00:31:19,000 --> 00:31:22,000 Now we know that so all is well. 414 00:31:22,000 --> 00:31:27,000 That is the comparison model. Any questions before we go on? 415 00:31:27,000 --> 00:31:31,000 Good. The next topic is to burst 416 00:31:31,000 --> 00:31:37,000 outside of the comparison model and try to sort in linear time. 417 00:31:43,000 --> 00:31:45,000 It is pretty clear that, as long as you don't have some 418 00:31:45,000 --> 00:31:48,000 kind of a parallel algorithm or something really fancy, 419 00:31:48,000 --> 00:31:51,000 you cannot sort any better than linear time because you've at 420 00:31:51,000 --> 00:31:54,000 least got to look at the data. No matter what you're doing 421 00:31:54,000 --> 00:31:56,000 with the data, you've got to look at it, 422 00:31:56,000 --> 00:31:59,000 otherwise you're not sorting it correctly. 423 00:31:59,000 --> 00:32:01,000 So, linear time is the best we could hope for. 424 00:32:01,000 --> 00:32:05,000 N lg n is pretty close. How could we sort in linear 425 00:32:05,000 --> 00:32:07,000 time? Well, we're going to need some 426 00:32:07,000 --> 00:32:10,000 more powerful assumption. And this is the counter 427 00:32:10,000 --> 00:32:12,000 example. We're going to have to move 428 00:32:12,000 --> 00:32:16,000 outside the comparison model and do something else with our 429 00:32:16,000 --> 00:32:18,000 elements. And what we're going to do is 430 00:32:18,000 --> 00:32:21,000 assume that they're integers in a particular range, 431 00:32:21,000 --> 00:32:24,000 and we will use that to sort in linear time. 432 00:32:24,000 --> 00:32:27,000 We're going to see two algorithms for sorting faster 433 00:32:27,000 --> 00:32:32,000 than n lg n. The first one is pretty simple, 434 00:32:32,000 --> 00:32:35,000 and we will use it in the second algorithm. 435 00:32:35,000 --> 00:32:40,000 It's called counting sort. The input to counting sort is 436 00:32:40,000 --> 00:32:44,000 an array, as usual, but we're going to assume what 437 00:32:44,000 --> 00:32:49,000 those array elements look like. Each A[i] is an integer from 438 00:32:49,000 --> 00:32:52,000 the range of 1 to k. This is a pretty strong 439 00:32:52,000 --> 00:32:55,000 assumption. And the running time is 440 00:32:55,000 --> 00:33:01,000 actually going to depend on k. If k is small it is going to be 441 00:33:01,000 --> 00:33:06,000 a good algorithm. If k is big it's going to be a 442 00:33:06,000 --> 00:33:10,000 really bad algorithm, worse than n lg n. 443 00:33:10,000 --> 00:33:15,000 Our goal is to output some sorted version of this array. 444 00:33:15,000 --> 00:33:20,000 Let's call this sorting of A. It's going to be easier to 445 00:33:20,000 --> 00:33:25,000 write down the output directly instead of writing down 446 00:33:25,000 --> 00:33:32,000 permutation for this algorithm. And then we have some auxiliary 447 00:33:32,000 --> 00:33:36,000 storage. I'm about to write down the 448 00:33:36,000 --> 00:33:41,000 pseudocode, which is why I'm declaring all my variables here. 449 00:33:41,000 --> 00:33:45,000 And the auxiliary storage will have length k, 450 00:33:45,000 --> 00:33:48,000 which is the range on my input values. 451 00:33:48,000 --> 00:33:52,000 Let's see the algorithm. 452 00:34:07,000 --> 00:34:09,000 This is counting sort. 453 00:34:17,000 --> 00:34:20,000 And it takes a little while to write down but it's pretty 454 00:34:20,000 --> 00:34:22,000 straightforward. 455 00:34:28,000 --> 00:34:32,000 First we do some initialization. 456 00:34:32,000 --> 00:34:36,000 Then we do some counting. 457 00:35:04,000 --> 00:35:06,000 Then we do some summing. 458 00:35:50,000 --> 00:35:54,000 And then we actually write the output. 459 00:36:28,000 --> 00:36:30,000 Is that algorithm perfectly clear to everyone? 460 00:36:30,000 --> 00:36:30,000 No one. Good. This should illustrate how obscure pseudocode can be. 461 00:36:33,000 --> 00:36:36,000 And when you're solving your problem sets, 462 00:36:36,000 --> 00:36:39,000 you should keep in mind that it's really hard to understand 463 00:36:39,000 --> 00:36:41,000 an algorithm just given pseudocode like this. 464 00:36:41,000 --> 00:36:45,000 You need some kind of English description of what's going on 465 00:36:45,000 --> 00:36:48,000 because, while you could work through and figure out what this 466 00:36:48,000 --> 00:36:51,000 means, it could take half an hour to an hour. 467 00:36:51,000 --> 00:36:53,000 And that's not a good way of expressing yourself. 468 00:36:53,000 --> 00:36:57,000 And so what I will give you now is the English description, 469 00:36:57,000 --> 00:37:01,000 but we will refer back to this to understand. 470 00:37:01,000 --> 00:37:05,000 This is sort of our bible of what the algorithm is supposed 471 00:37:05,000 --> 00:37:07,000 to do. Let me go over it briefly. 472 00:37:07,000 --> 00:37:11,000 The first step is just some initialization. 473 00:37:11,000 --> 00:37:15,000 The C[i]'s are going to count some things, count occurrences 474 00:37:15,000 --> 00:37:18,000 of values. And so first we set them to 475 00:37:18,000 --> 00:37:20,000 zero. Then, for every value we see 476 00:37:20,000 --> 00:37:25,000 A[j], we're going to increment the counter for that value A[j]. 477 00:37:25,000 --> 00:37:30,000 Then the C[i]s will give me the number of elements equal to a 478 00:37:30,000 --> 00:37:35,000 particular value i. Then I'm going to take prefix 479 00:37:35,000 --> 00:37:39,000 sums, which will make it so that C[i] gives me the number of 480 00:37:39,000 --> 00:37:42,000 keys, the number of elements less than or equal to [i] 481 00:37:42,000 --> 00:37:45,000 instead of equals. And then, finally, 482 00:37:45,000 --> 00:37:49,000 it turns out that's enough to put all the elements in the 483 00:37:49,000 --> 00:37:52,000 right place. This I will call distribution. 484 00:37:52,000 --> 00:37:56,000 This is the distribution step. And it's probably the least 485 00:37:56,000 --> 00:38:01,000 obvious of all the steps. And let's do an example to make 486 00:38:01,000 --> 00:38:04,000 it more obvious what's going on. 487 00:38:12,000 --> 00:38:30,000 Let's take an array A = [4, 1, 3, 4, 3]. 488 00:38:30,000 --> 00:38:36,000 And then I want some array C. And let me add some indices 489 00:38:36,000 --> 00:38:43,000 here so we can see what the algorithm is really doing. 490 00:38:43,000 --> 00:38:50,000 Here it turns out that all of my numbers are in the range 1 to 491 00:38:50,000 --> 00:38:54,000 4, so k = 4. My array C has four values. 492 00:38:54,000 --> 00:39:00,000 Initially, I set them all to zero. 493 00:39:00,000 --> 00:39:03,000 That's easy. And now I want to count through 494 00:39:03,000 --> 00:39:07,000 everything. And let me not cheat here. 495 00:39:07,000 --> 00:39:10,000 I'm in the second step, so to speak. 496 00:39:10,000 --> 00:39:13,000 And I look for each element in order. 497 00:39:13,000 --> 00:39:17,000 I look at the C[i] value. The first element is 4, 498 00:39:17,000 --> 00:39:20,000 so I look at C4. That is 0. 499 00:39:20,000 --> 00:39:24,000 I increment it to 1. Then I look at element 1. 500 00:39:24,000 --> 00:39:28,000 That's 0. I increment it to 1. 501 00:39:28,000 --> 00:39:30,000 Then I look at 3 and that's here. 502 00:39:30,000 --> 00:39:33,000 It is also 0. I increment it to 1. 503 00:39:33,000 --> 00:39:37,000 Not so exciting so far. Now I see 4, 504 00:39:37,000 --> 00:39:40,000 which I've seen before, how exciting. 505 00:39:40,000 --> 00:39:44,000 I had value 1 in here, I increment it to 2. 506 00:39:44,000 --> 00:39:48,000 Then I see value 3, which also had a value of 1. 507 00:39:48,000 --> 00:39:51,000 I increment that to 2. The result is [1, 508 00:39:51,000 --> 00:39:55,000 0, 2, 2]. That's what my array C looks 509 00:39:55,000 --> 00:40:00,000 like at this point in the algorithm. 510 00:40:00,000 --> 00:40:04,000 Now I do a relatively simple transformation of taking prefix 511 00:40:04,000 --> 00:40:05,000 sums. I want to know, 512 00:40:05,000 --> 00:40:09,000 instead of these individual values, the sum of this prefix, 513 00:40:09,000 --> 00:40:13,000 the sum of this prefix, the sum of this prefix and the 514 00:40:13,000 --> 00:40:17,000 sum of this prefix. I will call that C prime just 515 00:40:17,000 --> 00:40:21,000 so we don't get too lost in all these different versions of C. 516 00:40:21,000 --> 00:40:23,000 This is just 1. And 1 plus 0 is 1. 517 00:40:23,000 --> 00:40:25,000 1 plus 2 is 3. 3 plus 2 is 5. 518 00:40:25,000 --> 00:40:30,000 So, these are sort of the running totals. 519 00:40:30,000 --> 00:40:33,000 There are five elements total, there are three elements less 520 00:40:33,000 --> 00:40:37,000 than or equal to 3, there is one element less than 521 00:40:37,000 --> 00:40:38,000 or equal to 2, and so on. 522 00:40:38,000 --> 00:40:40,000 Now, the fun part, the distribution. 523 00:40:40,000 --> 00:40:43,000 And this is where we get our array B. 524 00:40:43,000 --> 00:40:46,000 B better have the same size, every element better appear 525 00:40:46,000 --> 00:40:50,000 here somewhere and they should come out in sorted order. 526 00:40:50,000 --> 00:40:54,000 Let's just run the algorithm. j is going to start at the end 527 00:40:54,000 --> 00:40:58,000 of the array and work its way down to 1, the beginning of the 528 00:40:58,000 --> 00:41:02,000 array. And what we do is we pick up 529 00:41:02,000 --> 00:41:05,000 the last element of A, A[n]. 530 00:41:05,000 --> 00:41:11,000 We look at the counter. We look at the C vector for 531 00:41:11,000 --> 00:41:14,000 that value. Here the value is 3, 532 00:41:14,000 --> 00:41:19,000 and this is the third column, so that has number 3. 533 00:41:19,000 --> 00:41:24,000 And the claim is that's where it belongs in B. 534 00:41:24,000 --> 00:41:29,000 You take this number 3, you put it in index 3 of the 535 00:41:29,000 --> 00:41:34,000 array B. And then you decrement the 536 00:41:34,000 --> 00:41:37,000 counter. I'm going to replace 3 here 537 00:41:37,000 --> 00:41:40,000 with 2. And the idea is these numbers 538 00:41:40,000 --> 00:41:44,000 tell you where those values should go. 539 00:41:44,000 --> 00:41:48,000 Anything of value 1 should go at position 1. 540 00:41:48,000 --> 00:41:53,000 Anything with value 3 should go at position 3 or less. 541 00:41:53,000 --> 00:41:59,000 This is going to be the last place that a 3 should go. 542 00:41:59,000 --> 00:42:02,000 And then anything with value 4 should go at position 5 or less, 543 00:42:02,000 --> 00:42:06,000 definitely should go at the end of the array because 4 is the 544 00:42:06,000 --> 00:42:09,000 largest value. And this counter will work out 545 00:42:09,000 --> 00:42:13,000 perfectly because these counts have left enough space in each 546 00:42:13,000 --> 00:42:15,000 section of the array. Effectively, 547 00:42:15,000 --> 00:42:18,000 this part is reserved for ones, there are no twos, 548 00:42:18,000 --> 00:42:21,000 this part is reserved for threes, and this part is 549 00:42:21,000 --> 00:42:24,000 reserved for fours. You can check if that's really 550 00:42:24,000 --> 00:42:27,000 what this array means. Let's finish running the 551 00:42:27,000 --> 00:42:31,000 algorithm. That was the last element. 552 00:42:31,000 --> 00:42:34,000 I won't cross it off, but we've sort of done that. 553 00:42:34,000 --> 00:42:36,000 Now I look at the next to last element. 554 00:42:36,000 --> 00:42:38,000 That's a 4. Fours go in position 5. 555 00:42:38,000 --> 00:42:42,000 So, I put my 4 here in position 5 and I decrement that counter. 556 00:42:42,000 --> 00:42:45,000 Next I look at another 3. Threes now go in position 2, 557 00:42:45,000 --> 00:42:48,000 so that goes there. And then I decrement that 558 00:42:48,000 --> 00:42:50,000 counter. I won't actually use that 559 00:42:50,000 --> 00:42:53,000 counter anymore, but let's decrement it because 560 00:42:53,000 --> 00:42:57,000 that's what the algorithm says. I look at the previous element. 561 00:42:57,000 --> 00:43:00,000 That's a 1. Ones go in position 1, 562 00:43:00,000 --> 00:43:04,000 so I put it here and decrement that counter. 563 00:43:04,000 --> 00:43:09,000 And finally I have another 4. And fours go in position 4 now, 564 00:43:09,000 --> 00:43:13,000 position 4 is here, and I decrement that counter. 565 00:43:13,000 --> 00:43:18,000 So, that's counting sort. And you'll notice that all the 566 00:43:18,000 --> 00:43:23,000 elements appear and they appear in order, so that's the 567 00:43:23,000 --> 00:43:26,000 algorithm. Now, what's the running time of 568 00:43:26,000 --> 00:43:31,000 counting sort? kn is an upper bound. 569 00:43:31,000 --> 00:43:35,000 It's a little bit better than that. 570 00:43:35,000 --> 00:43:43,000 Actually, quite a bit better. This requires some summing. 571 00:43:43,000 --> 00:43:49,000 Let's go back to the top of the algorithm. 572 00:43:49,000 --> 00:43:53,000 How much time does this step take? 573 00:43:53,000 --> 00:43:57,000 k. How much time does this step 574 00:43:57,000 --> 00:44:00,000 take? n. 575 00:44:00,000 --> 00:44:05,000 How much time does this step take? 576 00:44:05,000 --> 00:44:10,000 k. Each of these operations in the 577 00:44:10,000 --> 00:44:17,000 for loops is taking constant time, so it is how many 578 00:44:17,000 --> 00:44:22,000 iterations of that for loop are there? 579 00:44:22,000 --> 00:44:29,000 And, finally, this step takes n. 580 00:44:29,000 --> 00:44:35,000 So, the total running time of counting sort is k + n. 581 00:44:35,000 --> 00:44:43,000 And this is a great algorithm if k is relatively small, 582 00:44:43,000 --> 00:44:49,000 like at most n. If k is big like n^2 or 2^n or 583 00:44:49,000 --> 00:44:54,000 whatever, this is not such a good algorithm, 584 00:44:54,000 --> 00:45:01,000 but if k = O(n) this is great. And we get our linear time 585 00:45:01,000 --> 00:45:04,000 sorting algorithm. Not only do we need the 586 00:45:04,000 --> 00:45:08,000 assumption that our numbers are integers, but we need that the 587 00:45:08,000 --> 00:45:12,000 range of the integers is pretty small for this algorithm to 588 00:45:12,000 --> 00:45:14,000 work. If all the numbers are between 589 00:45:14,000 --> 00:45:17,000 1 and order n then we get a linear time algorithm. 590 00:45:17,000 --> 00:45:20,000 But as soon as they're up to n lg n we're toast. 591 00:45:20,000 --> 00:45:24,000 We're back to n lg n sorting. It's not so great. 592 00:45:24,000 --> 00:45:27,000 So, you could write a combination algorithm that says, 593 00:45:27,000 --> 00:45:31,000 well, if k is bigger than n lg n, then I will just use merge 594 00:45:31,000 --> 00:45:35,000 sort. And if it's less than n lg n 595 00:45:35,000 --> 00:45:38,000 I'll use counting sort. And that would work, 596 00:45:38,000 --> 00:45:42,000 but we can do better than that. How's the time? 597 00:45:42,000 --> 00:45:46,000 It is worth noting that we've beaten our bound, 598 00:45:46,000 --> 00:45:51,000 but only assuming that we're outside the comparison model. 599 00:45:51,000 --> 00:45:55,000 We haven't really contradicted the original theorem, 600 00:45:55,000 --> 00:46:00,000 we're just changing the model. And it's always good to 601 00:46:00,000 --> 00:46:04,000 question what you're allowed to do in any problem scenario. 602 00:46:04,000 --> 00:46:07,000 In, say, some practical scenarios, this would be great 603 00:46:07,000 --> 00:46:10,000 if the numbers you're dealing with are, say, 604 00:46:10,000 --> 00:46:12,000 a byte long. Then k is only 2^8, 605 00:46:12,000 --> 00:46:15,000 which is 256. You need this auxiliary array 606 00:46:15,000 --> 00:46:17,000 of size 256, and this is really fast. 607 00:46:17,000 --> 00:46:21,000 256 + n, no matter how big n is it's linear in n. 608 00:46:21,000 --> 00:46:24,000 If you know your numbers are small, it's great. 609 00:46:24,000 --> 00:46:27,000 But if you're numbers are bigger, say you still know 610 00:46:27,000 --> 00:46:30,000 they're integers but they fit in like 32 bit words, 611 00:46:30,000 --> 00:46:35,000 then life is not so easy. Because k is then 2^32, 612 00:46:35,000 --> 00:46:39,000 which is 4.2 billion or so, which is pretty big. 613 00:46:39,000 --> 00:46:43,000 And you would need this auxiliary array of 4.2 billion 614 00:46:43,000 --> 00:46:46,000 words, which is probably like 16 gigabytes. 615 00:46:46,000 --> 00:46:51,000 So, you just need to initialize that array before you can even 616 00:46:51,000 --> 00:46:54,000 get started. Unless n is like much, 617 00:46:54,000 --> 00:46:58,000 much more than 4 billion and you have 16 gigabytes of storage 618 00:46:58,000 --> 00:47:02,000 just to throw away, which I don't even have any 619 00:47:02,000 --> 00:47:06,000 machines with 16 gigabytes of RAM, this is not such a great 620 00:47:06,000 --> 00:47:10,000 algorithm. Just to get a feel, 621 00:47:10,000 --> 00:47:13,000 it's good, the numbers are really small. 622 00:47:13,000 --> 00:47:18,000 What we're going to do next is come up with a fancier algorithm 623 00:47:18,000 --> 00:47:22,000 that uses this as a subroutine on small numbers and combines 624 00:47:22,000 --> 00:47:25,000 this algorithm to handle larger numbers. 625 00:47:25,000 --> 00:47:29,000 That algorithm is called radix sort. 626 00:47:29,000 --> 00:47:34,000 But we need one important property of counting sort before 627 00:47:34,000 --> 00:47:36,000 we can go there. 628 00:47:42,000 --> 00:47:45,000 And that important property is stability. 629 00:47:50,000 --> 00:47:58,000 A stable sorting algorithm preserves the order of equal 630 00:47:58,000 --> 00:48:05,000 elements, let's say the relative order. 631 00:48:19,000 --> 00:48:21,000 This is a bit subtle because usually we think of elements 632 00:48:21,000 --> 00:48:24,000 just as numbers. And, yeah, we had a couple 633 00:48:24,000 --> 00:48:25,000 threes and we had a couple fours. 634 00:48:25,000 --> 00:48:28,000 It turns out, if you look at the order of 635 00:48:28,000 --> 00:48:31,000 those threes and the order of those fours, we kept them in 636 00:48:31,000 --> 00:48:33,000 order. Because we took the last three 637 00:48:33,000 --> 00:48:36,000 and we put it here. Then we took the next to the 638 00:48:36,000 --> 00:48:39,000 last three and we put it to the left of that where O is 639 00:48:39,000 --> 00:48:42,000 decrementing our counter and moving from the end of the array 640 00:48:42,000 --> 00:48:45,000 to the beginning of the array. No matter how we do that, 641 00:48:45,000 --> 00:48:49,000 the orders of those threes are preserved, the orders of the 642 00:48:49,000 --> 00:48:51,000 fours are preserved. This may seem like a relatively 643 00:48:51,000 --> 00:48:54,000 simple thing, but if you look at the other 644 00:48:54,000 --> 00:48:57,000 four sorting algorithms we've seen, not all of them are 645 00:48:57,000 --> 00:49:00,000 stable. So, this is an exercise. 646 00:49:06,000 --> 00:49:11,000 Exercise is figure out which other sorting algorithms that 647 00:49:11,000 --> 00:49:15,000 we've seen are stable and which are not. 648 00:49:21,000 --> 00:49:25,000 I encourage you to work that out because this is the sort of 649 00:49:25,000 --> 00:49:29,000 thing that we ask on quizzes. But for now all we need is that 650 00:49:29,000 --> 00:49:33,000 counting sort is stable. And I won't prove this, 651 00:49:33,000 --> 00:49:37,000 but it should be pretty obvious from the algorithm. 652 00:49:37,000 --> 00:49:41,000 Now we get to talk about radix sort. 653 00:49:55,000 --> 00:50:01,000 Radix sort is going to work for a much larger range of numbers 654 00:50:01,000 --> 00:50:04,000 in linear time. Still it has to have an 655 00:50:04,000 --> 00:50:09,000 assumption about how big those numbers are, but it will be a 656 00:50:09,000 --> 00:50:13,000 much more lax assumption. Now, to increase suspense even 657 00:50:13,000 --> 00:50:18,000 further, I am going to tell you some history about radix sort. 658 00:50:18,000 --> 00:50:22,000 This is one of the oldest sorting algorithms. 659 00:50:22,000 --> 00:50:26,000 It's probably the oldest implemented sorting algorithm. 660 00:50:26,000 --> 00:50:32,000 It was implemented around 1890. This is Herman Hollerith. 661 00:50:32,000 --> 00:50:35,000 Let's say around 1890. Has anyone heard of Hollerith 662 00:50:35,000 --> 00:50:37,000 before? A couple people. 663 00:50:37,000 --> 00:50:41,000 Not too many. He is sort of an important guy. 664 00:50:41,000 --> 00:50:43,000 He was a lecturer at MIT at some point. 665 00:50:43,000 --> 00:50:47,000 He developed an early version of punch cards. 666 00:50:47,000 --> 00:50:51,000 Punch card technology. This is before my time so I 667 00:50:51,000 --> 00:50:54,000 even have to look at my notes to remember. 668 00:50:54,000 --> 00:50:57,000 Oh, yeah, they're called punch cards. 669 00:50:57,000 --> 00:51:02,000 You may have seen them. If not they're in the 670 00:51:02,000 --> 00:51:06,000 PowerPoint lecture notes. There's this big grid. 671 00:51:06,000 --> 00:51:11,000 These days, if you've used a modern punch card recently, 672 00:51:11,000 --> 00:51:16,000 they are 80 characters wide and, I don't know, 673 00:51:16,000 --> 00:51:21,000 I think it's something like 16, I don't remember exactly. 674 00:51:21,000 --> 00:51:25,000 And then you punch little holes here. 675 00:51:25,000 --> 00:51:30,000 You have this magic machine. It's like a typewriter. 676 00:51:30,000 --> 00:51:34,000 You press a letter and that corresponds to some character. 677 00:51:34,000 --> 00:51:38,000 Maybe it will punch out a hole here, punch out a hole here. 678 00:51:38,000 --> 00:51:42,000 You can see the website if you want to know exactly how this 679 00:51:42,000 --> 00:51:46,000 works for historical reasons. You don't see these too often 680 00:51:46,000 --> 00:51:49,000 anymore, but this is in particular the reason why most 681 00:51:49,000 --> 00:51:53,000 terminals are 80 characters wide because that was how things 682 00:51:53,000 --> 00:51:55,000 were. Hollerith actually didn't 683 00:51:55,000 --> 00:51:59,000 develop these punch cards exactly, although eventually he 684 00:51:59,000 --> 00:52:01,000 did. In the beginning, 685 00:52:01,000 --> 00:52:04,000 in 1890, the big deal was the US Census. 686 00:52:04,000 --> 00:52:07,000 If you watched the news, I guess like a year or two ago, 687 00:52:07,000 --> 00:52:10,000 the US Census was a big deal because it's really expensive to 688 00:52:10,000 --> 00:52:12,000 collect all this data from everyone. 689 00:52:12,000 --> 00:52:15,000 And the Constitution says you've got to collect data about 690 00:52:15,000 --> 00:52:18,000 everyone every ten years. And it was getting hard. 691 00:52:18,000 --> 00:52:20,000 In particular, in 1880, they did the census. 692 00:52:20,000 --> 00:52:24,000 And it took them almost ten years to complete the census. 693 00:52:24,000 --> 00:52:27,000 The population kept going up, and ten years to do a ten-year 694 00:52:27,000 --> 00:52:30,000 census, that's going to start getting expensive when they 695 00:52:30,000 --> 00:52:34,000 overlap with each other. So, for 1890 they wanted to do 696 00:52:34,000 --> 00:52:37,000 something fancier. And Hollerith said, 697 00:52:37,000 --> 00:52:40,000 OK, I'm going to build a machine that you take in the 698 00:52:40,000 --> 00:52:42,000 data. It was a modified punch card 699 00:52:42,000 --> 00:52:46,000 where you would mark out particular squares depending on 700 00:52:46,000 --> 00:52:50,000 your status, whether you were single or married or whatever. 701 00:52:50,000 --> 00:52:53,000 All the things they wanted to know on the census they would 702 00:52:53,000 --> 00:52:57,000 encode in binary onto this card. And then he built a machine 703 00:52:57,000 --> 00:53:02,000 that would sort these cards so you could do counting. 704 00:53:02,000 --> 00:53:05,000 And, in some sense, these are numbers. 705 00:53:05,000 --> 00:53:10,000 And the numbers aren't too big, but they're big enough that 706 00:53:10,000 --> 00:53:15,000 counting sort wouldn't work. I mean if there were a hundred 707 00:53:15,000 --> 00:53:18,000 numbers here, 2^100 is pretty overwhelming, 708 00:53:18,000 --> 00:53:24,000 so we cannot use counting sort. The first idea was the wrong 709 00:53:24,000 --> 00:53:27,000 idea. I'm going to think of these as 710 00:53:27,000 --> 00:53:30,000 numbers. Let's say each of these columns 711 00:53:30,000 --> 00:53:34,000 is one number. And so there's sort of the most 712 00:53:34,000 --> 00:53:38,000 significant number out here and there is the least significant 713 00:53:38,000 --> 00:53:40,000 number out here. The first idea was you sort by 714 00:53:40,000 --> 00:53:43,000 the most significant digit first. 715 00:53:50,000 --> 00:53:53,000 That's not such a great algorithm, because if you sort 716 00:53:53,000 --> 00:53:58,000 by the most significant digit you get a bunch of buckets each 717 00:53:58,000 --> 00:54:01,000 with a pile of cards. And this was a physical device. 718 00:54:01,000 --> 00:54:04,000 It wasn't exactly an electronically controlled 719 00:54:04,000 --> 00:54:06,000 computer. It was a human that would push 720 00:54:06,000 --> 00:54:09,000 down some kind of reader. It would see which holes in the 721 00:54:09,000 --> 00:54:12,000 first column are punched. And then it would open a 722 00:54:12,000 --> 00:54:15,000 physical bin in which the person would sort of swipe it and it 723 00:54:15,000 --> 00:54:17,000 would just fall into the right bin. 724 00:54:17,000 --> 00:54:20,000 It was a semi-automated. I mean the computer was the 725 00:54:20,000 --> 00:54:22,000 human plus the machine, but never mind. 726 00:54:22,000 --> 00:54:25,000 This was the procedure. You sorted it into bins. 727 00:54:25,000 --> 00:54:28,000 Then you had to go through and sort each bin by the second 728 00:54:28,000 --> 00:54:32,000 digit. And pretty soon the number of 729 00:54:32,000 --> 00:54:36,000 bins gets pretty big. And if you don't have too many 730 00:54:36,000 --> 00:54:40,000 digits this is OK, but it's not the right thing to 731 00:54:40,000 --> 00:54:41,000 do. The right idea, 732 00:54:41,000 --> 00:54:45,000 which is what Hollerith came up with after that, 733 00:54:45,000 --> 00:54:50,000 was to sort by the least significant digit first. 734 00:55:00,000 --> 00:55:03,000 And you should also do that using a stable sorting 735 00:55:03,000 --> 00:55:05,000 algorithm. Now, Hollerith probably didn't 736 00:55:05,000 --> 00:55:08,000 call it a stable sorting algorithm at the time, 737 00:55:08,000 --> 00:55:11,000 but we will. And this won Hollerith lots of 738 00:55:11,000 --> 00:55:14,000 money and good things. He founded this tabulating 739 00:55:14,000 --> 00:55:17,000 machine company in 1911, and that merged with several 740 00:55:17,000 --> 00:55:21,000 other companies to form something you may have heard of 741 00:55:21,000 --> 00:55:24,000 called IBM in 1924. That may be the context in 742 00:55:24,000 --> 00:55:28,000 which you've heard of Hollerith, or if you've done punch cards 743 00:55:28,000 --> 00:55:32,000 before. The whole idea is that we're 744 00:55:32,000 --> 00:55:37,000 doing a digit by digit sort. I should have mentioned that at 745 00:55:37,000 --> 00:55:40,000 the beginning. And we're going to do it from 746 00:55:40,000 --> 00:55:43,000 least significant to most significant. 747 00:55:43,000 --> 00:55:48,000 It turns out that works. And to see that let's do an 748 00:55:48,000 --> 00:55:50,000 example. I think I'm going to need a 749 00:55:50,000 --> 00:55:55,000 whole two boards ideally. First we'll see an example. 750 00:55:55,000 --> 00:55:59,000 Then we'll prove the theorem. The proof is actually pretty 751 00:55:59,000 --> 00:56:03,000 darn easy. But, nonetheless, 752 00:56:03,000 --> 00:56:07,000 it's rather counterintuitive this works if you haven't seen 753 00:56:07,000 --> 00:56:10,000 it before. Certainly, the first time I saw 754 00:56:10,000 --> 00:56:14,000 it, it was quite a surprise. The nice thing also about this 755 00:56:14,000 --> 00:56:19,000 algorithm is there are no bins. It's all one big bin at all 756 00:56:19,000 --> 00:56:21,000 times. Let's take some numbers. 757 00:56:23,000 --> 00:56:28,000 I'm spacing out the digits so we can see them a little bit 758 00:56:28,000 --> 00:07:37,000 better. 759 00:56:30,000 --> 00:56:33,000 657, 839, 436, 720 and 355. 760 00:56:33,000 --> 00:56:38,000 I'm assuming here we're using decimal numbers. 761 00:56:38,000 --> 00:56:43,000 Why not? Hopefully this are not yet 762 00:56:43,000 --> 00:56:47,000 sorted. We'd like to sort them. 763 00:56:47,000 --> 00:56:54,000 The first thing we do is take the least significant digit, 764 00:56:54,000 --> 00:57:00,000 sort by the least significant digit. 765 00:57:00,000 --> 00:57:04,000 And whenever we have equal elements like these two nines, 766 00:57:04,000 --> 00:57:07,000 we preserve their relative order. 767 00:57:07,000 --> 00:57:11,000 So, 329 is going to remain above 839. 768 00:57:11,000 --> 00:57:16,000 It doesn't matter here because we're doing the first sort, 769 00:57:16,000 --> 00:57:20,000 but in general we're always using a stable sorting 770 00:57:20,000 --> 00:57:23,000 algorithm. When we sort by this column, 771 00:57:23,000 --> 00:57:27,000 first we get the zero, so that's 720, 772 00:57:27,000 --> 00:05:55,000 then we get 5, 773 00:57:30,000 --> 00:07:16,000 Then we get 6, 774 00:57:31,000 --> 00:57:36,000 Stop me if I make a mistake. Then we get the 7s, 775 00:57:36,000 --> 00:57:42,000 and we preserve the order. Here it happens to be the right 776 00:57:42,000 --> 00:57:47,000 order, but it may not be at this point. 777 00:57:47,000 --> 00:57:51,000 We haven't even looked at the other digits. 778 00:57:51,000 --> 00:57:54,000 Then we get 9s, there are two 9s, 779 00:57:54,000 --> 00:57:57,000 329 and 839. All right so far? 780 00:57:57,000 --> 00:58:03,000 Good. Now we sort by the middle 781 00:58:03,000 --> 00:58:07,000 digit, the next least significant. 782 00:58:07,000 --> 00:58:12,000 And we start out with what looks like the 2s. 783 00:58:12,000 --> 00:58:17,000 There is a 2 up here and a 2 down here. 784 00:58:17,000 --> 00:58:23,000 Of course, we write the first 2 first, 720, then 329. 785 00:58:23,000 --> 00:58:30,000 Then we have the 3s, so we have 436 and 839. 786 00:58:30,000 --> 00:58:33,000 Then we have a bunch of 5s it looks like. 787 00:58:33,000 --> 00:58:36,000 Have I missed anyone so far? No. 788 00:58:36,000 --> 00:58:38,000 Good. We have three 5s, 789 00:58:38,000 --> 00:58:42,000 355, 457 and 657. I like to check that I haven't 790 00:58:42,000 --> 00:58:45,000 lost any elements. We have seven here, 791 00:58:45,000 --> 00:58:48,000 seven here and seven elements here. 792 00:58:48,000 --> 00:58:51,000 Good. Finally, we sort by the last 793 00:58:51,000 --> 00:58:53,000 digit. One thing to notice, 794 00:58:53,000 --> 00:59:00,000 by the way, is before we sorted by the last digit -- 795 00:59:00,000 --> 00:59:05,000 Currently these numbers don't resemble sorted order at all. 796 00:59:05,000 --> 00:59:10,000 But if you look at everything beyond the digit we haven't yet 797 00:59:10,000 --> 00:59:15,000 sorted, so these two digits, that's nice and sorted, 798 00:59:15,000 --> 00:59:17,000 20, 29, 36, 39, 55, 57, 57. 799 00:59:17,000 --> 00:59:20,000 Pretty cool. Let's finish it off. 800 00:59:20,000 --> 00:59:23,000 We stably sort by the first digit. 801 00:59:23,000 --> 00:59:29,000 And the smallest number we get is a 3, so we get 329 and then 802 00:59:36,000 --> 00:59:45,000 436 and 457, then we get a 6, 803 00:59:45,000 --> 00:59:55,000 657, then a 7, and then we have an 8. 804 00:59:55,000 --> 01:00:01,631 And check. I still have seven elements. 805 01:00:01,631 --> 01:00:03,203 Good. I haven't lost anyone. 806 01:00:03,203 --> 01:00:05,533 And, indeed, they're now in sorted order. 807 01:00:05,533 --> 01:00:08,097 And you can start to see why this is working. 808 01:00:08,097 --> 01:00:11,417 When I have equal elements here, I have already sorted the 809 01:00:11,417 --> 01:00:13,398 suffix. Let's write down a proof of 810 01:00:13,398 --> 01:00:15,029 that. What is nice about this 811 01:00:15,029 --> 01:00:17,650 algorithm is we're not partitioning into bins. 812 01:00:17,650 --> 01:00:20,970 We always keep the huge batch of elements in one big pile, 813 01:00:20,970 --> 01:00:23,650 but we're just going through it multiple times. 814 01:00:23,650 --> 01:00:27,087 In general, we sort of need to go through it multiple times. 815 01:00:27,087 --> 01:00:32,006 Hopefully not too many times. But let's first argue 816 01:00:32,006 --> 01:00:36,019 correctness. To analyze the running time is 817 01:00:36,019 --> 01:00:41,751 a little bit tricky here because it depends how you partition 818 01:00:41,751 --> 01:00:44,808 into digits. Correctness is easy. 819 01:00:44,808 --> 01:00:50,159 We just induct on the digit position that we're currently 820 01:00:50,159 --> 01:00:55,891 sorting, so let's call that t. And we can assume by induction 821 01:00:55,891 --> 01:01:02,656 that it's sorted beyond digit t. This is our induction 822 01:01:02,656 --> 01:01:07,841 hypothesis. We assume that we're sorted on 823 01:01:07,841 --> 01:01:14,924 the low-order t - 1 digits. And then the next thing we do 824 01:01:14,924 --> 01:01:21,501 is sort on the t-th digit. We just need to check that 825 01:01:21,501 --> 01:01:26,561 things work. And we restore the induction 826 01:01:26,561 --> 00:00:01,000 hypothesis for t instead of t - 827 01:01:32,000 --> 01:01:36,009 When we sort on the t-th digit there are two cases. 828 01:01:36,009 --> 01:01:40,981 If we look at any two elements, we want to know whether they're 829 01:01:40,981 --> 01:01:45,150 put in the right order. If two elements are the same, 830 01:01:45,150 --> 01:01:49,000 let's say they have the same t-th digit -- 831 01:01:58,000 --> 01:02:02,000 This is the tricky case. If they have the same t-th 832 01:02:02,000 --> 01:02:05,519 digit then their order should not be changed. 833 01:02:05,519 --> 01:02:09,360 So, by stability, we know that they remain in the 834 01:02:09,360 --> 01:02:14,400 same order because stability is supposed to preserve things that 835 01:02:14,400 --> 01:02:17,519 have the same key that we're sorting on. 836 01:02:17,519 --> 01:02:21,920 And then, by the induction hypothesis, we know that that 837 01:02:21,920 --> 01:02:26,239 keeps them in sorted order because induction hypothesis 838 01:02:26,239 --> 01:02:30,000 says that they used to be sorted. 839 01:02:30,000 --> 01:02:35,369 Adding on this value in the front that's the same in both 840 01:02:35,369 --> 01:02:39,684 doesn't change anything so they remain sorted. 841 01:02:39,684 --> 01:02:44,000 And if they have differing t-th digits -- 842 01:02:54,000 --> 01:03:00,000 -- then this sorting step will put them in the right order. 843 01:03:00,000 --> 01:03:03,189 Because that's what sorting does. 844 01:03:03,189 --> 01:03:08,870 This is the most significant digit, so you've got to order 845 01:03:08,870 --> 01:03:12,558 them by the t-th digit if they differ. 846 01:03:12,558 --> 01:03:17,840 The rest are irrelevant. So, proof here of correctness 847 01:03:17,840 --> 01:03:22,026 is very simple once you know the algorithm. 848 01:03:22,026 --> 01:03:25,514 Any questions before we go on? Good. 849 01:03:25,514 --> 01:03:30,000 We're going to use counting sort. 850 01:03:30,000 --> 01:03:30,344 We could use any sorting algorithm we want for individual 851 01:03:30,344 --> 01:03:30,713 digits, but the only algorithm that we know that runs in less 852 01:03:30,713 --> 01:03:30,916 than n lg n time is counting sort. 853 01:03:30,916 --> 01:03:31,267 So, we better use that one to sort of bootstrap and get an 854 01:03:31,267 --> 01:03:31,501 even faster and more general algorithm. 855 01:03:31,501 --> 01:03:31,883 I just erased the running time. Counting sort runs in order k + 856 01:03:31,883 --> 01:03:36,003 n time. We need to remember that. 857 01:03:36,003 --> 01:03:44,329 And the range of the numbers is 1 to k or 0 to k - 1. 858 01:03:44,329 --> 01:03:53,616 When we sort by a particular digit, we shouldn't use n lg n 859 01:03:53,616 --> 01:04:02,743 algorithm because then this thing will take n lg n for one 860 01:04:02,743 --> 01:04:09,788 round and it's going to have multiple rounds. 861 01:04:09,788 --> 01:04:15,552 That's going to be worse than n lg n. 862 01:04:15,552 --> 01:04:25,000 We're going to use counting sort for each round. 863 01:04:32,000 --> 01:04:34,931 We use counting sort for each digit. 864 01:04:34,931 --> 01:04:40,125 And we know the running time of counting sort here is order k + 865 01:04:40,125 --> 01:04:42,973 n . But I don't want to assume that 866 01:04:42,973 --> 01:04:46,324 my integers are split into digits for me. 867 01:04:46,324 --> 01:04:50,261 That's sort of giving away too much flexibility. 868 01:04:50,261 --> 01:04:55,287 Because if I have some number written in whatever form it is, 869 01:04:55,287 --> 01:05:00,062 probably written in binary, I can cluster together some of 870 01:05:00,062 --> 01:05:04,000 those bits and call that a digit. 871 01:05:04,000 --> 01:05:07,415 Let's think of our numbers as binary. 872 01:05:07,415 --> 01:05:12,442 Suppose we have n integers. And they're in some range. 873 01:05:12,442 --> 01:05:16,901 And we want to know how big a range they can be. 874 01:05:16,901 --> 01:05:21,264 Let's say, a sort of practical way of thinking, 875 01:05:21,264 --> 01:05:26,577 you know, we're in a binary world, each integer is b bits 876 01:05:26,577 --> 01:05:29,774 long. So, in other words, 877 01:05:29,774 --> 01:05:35,283 the range is from 0 to 2b - 1. I will assume that my numbers 878 01:05:35,283 --> 01:05:39,765 are non-negative. It doesn't make much difference 879 01:05:39,765 --> 01:05:42,006 if they're negative, too. 880 01:05:42,006 --> 01:05:47,515 I want to know how big a b I can handle, but I don't want to 881 01:05:47,515 --> 01:05:52,650 split into bits as my digits because then I would have b 882 01:05:52,650 --> 01:05:59,000 digits and I would have to do b rounds of this algorithm. 883 01:05:59,000 --> 01:06:02,839 The number of rounds of this algorithm is the number of 884 01:06:02,839 --> 01:06:05,754 digits that I have. And each one costs me, 885 01:06:05,754 --> 01:06:08,598 let's hope, for linear time. And, indeed, 886 01:06:08,598 --> 01:06:10,589 if I use a single bit, k = 2. 887 01:06:10,589 --> 01:06:14,428 And so this is order n. But then the running time would 888 01:06:14,428 --> 01:06:17,557 be order n per round. And there are b digits, 889 01:06:17,557 --> 01:06:21,183 if I consider them to be bits, order n times b time. 890 01:06:21,183 --> 01:06:24,240 And even if b is something small like log n, 891 01:06:24,240 --> 01:06:27,866 if I have log n bits, then these are numbers between 892 01:06:27,866 --> 01:06:32,549 0 and n - 1. I already know how to sort 893 01:06:32,549 --> 01:06:36,666 numbers between 0 and n - 1 in linear time. 894 01:06:36,666 --> 01:06:41,372 Here I'm spending n lg n time, so that's no good. 895 01:06:41,372 --> 01:06:47,549 Instead, what we're going to do is take a bunch of bits and call 896 01:06:47,549 --> 01:06:51,470 that a digit, the most bits we can handle 897 01:06:51,470 --> 01:06:56,078 with counting sort. The notation will be I split 898 01:06:56,078 --> 01:07:01,846 each integer into b/r digits. Each r bits long. 899 01:07:01,846 --> 01:07:06,630 In other words, I think of my number as being 900 01:07:06,630 --> 01:07:11,086 in base 2^r. And I happen to be writing it 901 01:07:11,086 --> 01:07:15,869 down in binary, but I cluster together r bits 902 01:07:15,869 --> 01:07:20,108 and I get a bunch of digits in base 2^r. 903 01:07:20,108 --> 01:07:26,195 And then there are b/ r digits. This b/r is the number of 904 01:07:26,195 --> 01:07:30,000 rounds. And this base -- 905 01:07:30,000 --> 01:07:34,104 This is the maximum value I have in one of these digits. 906 01:07:34,104 --> 01:07:37,537 It's between 0 and 2^r. This is, in some sense, 907 01:07:37,537 --> 01:07:40,000 k for a run of counting sort. 908 01:07:49,000 --> 01:07:54,673 What is the running time? Well, I have b/r rounds. 909 01:07:54,673 --> 01:08:00,000 It's b/r times the running time for a round. 910 01:08:00,000 --> 01:08:05,830 Which I have n numbers and my value of k is 2^r. 911 01:08:05,830 --> 01:08:10,917 This is the running time of counting sort, 912 01:08:10,917 --> 01:08:18,236 n + k, this is the number of rounds, so this is b/r (n+2^r). 913 01:08:18,236 --> 01:08:23,198 And I am free to choose r however I want. 914 01:08:23,198 --> 01:08:30,145 What I would like to do is minimize this run time over my 915 01:08:30,145 --> 01:08:35,703 choices of r. Any suggestions on how I might 916 01:08:35,703 --> 01:08:40,303 find the minimum running time over all choices of r? 917 01:08:40,303 --> 01:08:44,000 Techniques, not necessarily solutions. 918 01:08:53,000 --> 01:08:55,488 We're not used to this because it's asymptomatic, 919 01:08:55,488 --> 01:08:58,288 but forget the big O here. How do I minimize a function 920 01:08:58,288 --> 01:09:01,336 with respect to one variable? Take the derivative, 921 01:09:01,336 --> 01:09:03,541 yeah. I can take the derivative of 922 01:09:03,541 --> 01:09:06,080 this function by r, differentiate by r, 923 01:09:06,080 --> 01:09:10,022 set the derivative equal to 0, and that should be a critical 924 01:09:10,022 --> 01:09:13,496 point in this function. It turns out this function is 925 01:09:13,496 --> 01:09:16,368 unimodal in r and you will find the minimum. 926 01:09:16,368 --> 01:09:19,510 We could do that. I'm not going to do it because 927 01:09:19,510 --> 01:09:23,385 it takes a little bit more work. You should try it at home. 928 01:09:23,385 --> 01:09:27,059 It will give you the exact minimum, which is good if you 929 01:09:27,059 --> 01:09:32,283 know what this constant is. Differentiate with respect to r 930 01:09:32,283 --> 01:09:35,305 and set to 0. I am going to do it a little 931 01:09:35,305 --> 01:09:39,063 bit more intuitively, in other words less precisely, 932 01:09:39,063 --> 01:09:41,788 but I will still get the right answer. 933 01:09:41,788 --> 01:09:46,210 And definitely I will get an upper bound because I can choose 934 01:09:46,210 --> 01:09:50,115 r to be whatever I want. It turns out this will be the 935 01:09:50,115 --> 01:09:53,210 right answer. Let's just think about growth 936 01:09:53,210 --> 01:09:56,526 in terms of r. There are essentially two terms 937 01:09:56,526 --> 01:10:00,024 here. I have b/r(n) and I have 938 01:10:00,024 --> 01:10:03,315 b/r(2^r). Now, b/r(n) would like r to be 939 01:10:03,315 --> 01:10:07,364 as big as possible. The bigger r is the number of 940 01:10:07,364 --> 01:10:10,992 rounds goes down. This number in front of n, 941 01:10:10,992 --> 01:10:16,138 this coefficient in front of n goes down, so I would like r to 942 01:10:16,138 --> 01:10:18,669 be big. So, b/r(n) wants r big. 943 01:10:18,669 --> 01:10:23,478 However, r cannot be too big. This is saying I want digits 944 01:10:23,478 --> 01:10:28,540 that have a lot of bits in them. It cannot be too big because 945 01:10:28,540 --> 01:10:34,465 there's 2^r term out here. If this happens to be bigger 946 01:10:34,465 --> 01:10:39,220 than n then this will dominate in terms of growth of r. 947 01:10:39,220 --> 01:10:43,182 This is going to be b times 2 to the r over r. 948 01:10:43,182 --> 01:10:46,264 2 the r is much, much bigger than r, 949 01:10:46,264 --> 01:10:50,490 so it's going to grow much faster is what I mean. 950 01:10:50,490 --> 01:10:55,949 And so I really don't want r to be too big for this other term. 951 01:10:55,949 --> 01:11:00,000 So, that is b/4(2^r) wants r small. 952 01:11:00,000 --> 01:11:06,684 Provided that this term is bigger or equal to this term 953 01:11:06,684 --> 01:11:11,758 then I can set r pretty big for that term. 954 01:11:11,758 --> 01:11:16,710 What I want is the n to dominate the 2^r. 955 01:11:16,710 --> 01:11:23,641 Provided I have that then I can set r as large as I want. 956 01:11:23,641 --> 01:11:30,697 Let's say I want to choose r to be maximum subject to this 957 01:11:30,697 --> 01:11:38,000 condition that n is greater than or equal to 2^r. 958 01:11:38,000 --> 01:11:42,291 This is an upper bound to 2^r, and upper bound on r. 959 01:11:42,291 --> 01:11:44,899 In other words, I want r = lg n. 960 01:11:44,899 --> 01:11:49,948 This turns out to be the right answer up to constant factors. 961 01:11:49,948 --> 01:11:53,566 There we go. And definitely choosing r to be 962 01:11:53,566 --> 01:11:58,951 lg n will give me an upper bound on the best running time I could 963 01:11:58,951 --> 01:12:04,000 get because I can choose it to be whatever I want. 964 01:12:04,000 --> 01:12:10,564 If you differentiate you will indeed get the same answer. 965 01:12:10,564 --> 01:12:15,956 This was not quite a formal argument but close, 966 01:12:15,956 --> 01:12:21,699 because the big O is all about what grows fastest. 967 01:12:21,699 --> 01:12:26,036 If we plug in r = lg n we get bn/lg n. 968 01:12:26,036 --> 01:12:31,780 The n and the 2^r are equal, that's a factor of 2, 969 01:12:31,780 --> 01:12:38,704 2 times n, not a big deal. It comes out into the O. 970 01:12:38,704 --> 01:12:44,788 We have bn/lg n which is r. We have to think about what 971 01:12:44,788 --> 01:12:49,859 this means and translate it in terms of range. 972 01:12:49,859 --> 01:12:56,957 b was the number of bits in our number, which corresponds to the 973 01:12:56,957 --> 01:13:03,417 range of the number. I've got 20 minutes under so 974 01:13:03,417 --> 01:13:08,543 far in lecture so I can go 20 minutes over, 975 01:13:08,543 --> 01:13:11,228 right? No, I'm kidding. 976 01:13:11,228 --> 01:13:15,988 Almost done. Let's say that our numbers, 977 01:13:15,988 --> 01:13:21,724 are integers are in the range, we have 0 to 2^b, 978 01:13:21,724 --> 01:13:26,606 I'm going to say that it's range 0 to nd. 979 01:13:26,606 --> 01:13:33,449 This should be a -1 here. If I have numbers that are 980 01:13:33,449 --> 01:13:38,632 between 0 and n^d - 1 where d is a constant or d is some 981 01:13:38,632 --> 01:13:42,306 parameter, so this is a polynomial in n, 982 01:13:42,306 --> 01:13:45,604 then you work out this running time. 983 01:13:45,604 --> 01:13:49,844 It is order dn. This is the way to think about 984 01:13:49,844 --> 01:13:54,179 it because now we can compare to counting sort. 985 01:13:54,179 --> 01:13:59,644 Counting sort could handle 0 up to some constant times d in 986 01:13:59,644 --> 01:14:04,501 linear time. Now I can handle 0 up to n to 987 01:14:04,501 --> 01:14:07,434 some constant power in linear time. 988 01:14:07,434 --> 01:14:12,178 This is if d = order 1 then we get a linear time sorting 989 01:14:12,178 --> 01:14:15,543 algorithm. And that is cool as long as d 990 01:14:15,543 --> 01:14:19,511 is at most lg n. As long as your numbers are at 991 01:14:19,511 --> 01:14:24,255 most n lg n then we have something that beats our n lg n 992 01:14:24,255 --> 01:14:29,000 sorting algorithms. And this is pretty nice. 993 01:14:29,000 --> 01:14:33,099 Whenever you know that your numbers are order log end bits 994 01:14:33,099 --> 01:14:36,048 long we are happy, and you get some smooth 995 01:14:36,048 --> 01:14:37,990 tradeoff there. For example, 996 01:14:37,990 --> 01:14:42,018 if we have our 32 bit numbers and we split into let's say 997 01:14:42,018 --> 01:14:46,262 eight bit chunks then we'll only have to do four rounds each 998 01:14:46,262 --> 01:14:49,570 linear time and we have just 256 working space. 999 01:14:49,570 --> 01:14:52,735 We were doing four rounds for 32 bit numbers. 1000 01:14:52,735 --> 01:14:56,835 If you use n lg n algorithm, you're going to be doing lg n 1001 01:14:56,835 --> 01:15:00,941 rounds through your numbers. n is like 2000, 1002 01:15:00,941 --> 01:15:03,515 and that's at least 11 rounds for example. 1003 01:15:03,515 --> 01:15:07,281 You would think this algorithm is going to be much faster for 1004 01:15:07,281 --> 01:15:09,038 small numbers. Unfortunately, 1005 01:15:09,038 --> 01:15:11,612 counting sort is not very good on a cache. 1006 01:15:11,612 --> 01:15:14,311 In practice, rating sort is not that fast an 1007 01:15:14,311 --> 01:15:17,199 algorithm unless your numbers are really small. 1008 01:15:17,199 --> 01:15:19,584 Something like quicksort can do better. 1009 01:15:19,584 --> 01:15:22,660 It's sort of shame, but theoretically this is very 1010 01:15:22,660 --> 01:15:25,045 beautiful. And there are contexts where 1011 01:15:25,045 --> 01:15:29,000 this is really the right way to sort things. 1012 01:15:29,000 --> 01:15:34,352 I will mention finally that if you have arbitrary integers that 1013 01:15:34,352 --> 01:15:39,100 are one word length long. Here we're assuming that there 1014 01:15:39,100 --> 01:15:44,280 are b bits in a word and we have some depends indirectly on b 1015 01:15:44,280 --> 01:15:46,093 here. But, in general, 1016 01:15:46,093 --> 01:15:51,100 if you have a bunch of integers and they're one word length 1017 01:15:51,100 --> 01:15:55,589 long, and you can manipulate a word in constant time, 1018 01:15:55,589 --> 01:16:00,597 then the best algorithm we know for sorting runs in n times 1019 01:16:00,597 --> 01:16:05,000 square root of lg lg n time expected. 1020 01:16:05,000 --> 01:16:08,719 It is a randomized algorithm. We're not going to cover that 1021 01:16:08,719 --> 01:16:11,798 algorithm in this class. It's rather complicated. 1022 01:16:11,798 --> 01:16:15,068 I didn't even cover it in Advanced Algorithms when I 1023 01:16:15,068 --> 01:16:17,570 taught it. If you want something easier, 1024 01:16:17,570 --> 01:16:21,289 you can get n times square root of lg lg n time worst-case. 1025 01:16:21,289 --> 01:16:23,406 And that paper is almost readable. 1026 01:16:23,406 --> 01:16:26,035 I have taught that in Advanced Algorithms. 1027 01:16:26,035 --> 01:16:28,729 If you're interested in this kind of stuff, 1028 01:16:28,729 --> 01:16:32,000 take Advanced Algorithms next fall. 1029 01:16:32,000 --> 01:16:34,552 It's one of the follow-ons to this class. 1030 01:16:34,552 --> 01:16:38,317 These are much more complicated algorithms, but it gives you 1031 01:16:38,317 --> 01:16:40,870 some sense. You can even break out of the 1032 01:16:40,870 --> 01:16:43,742 dependence on b, as long as you know that b is 1033 01:16:43,742 --> 01:16:46,486 at most a word. And I will stop there unless 1034 01:16:46,486 --> 01:16:49,000 there are any questions. Then see you Wednesday.