1 00:00:07,000 --> 00:00:11,000 So, the topic today is dynamic programming. 2 00:00:21,000 --> 00:00:25,000 The term programming in the name of this term doesn't refer 3 00:00:25,000 --> 00:00:30,000 to computer programming. OK, programming is an old word 4 00:00:30,000 --> 00:00:35,000 that means any tabular method for accomplishing something. 5 00:00:35,000 --> 00:00:39,000 So, you'll hear about linear programming and dynamic 6 00:00:39,000 --> 00:00:42,000 programming. Either of those, 7 00:00:42,000 --> 00:00:47,000 even though we now incorporate those algorithms in computer 8 00:00:47,000 --> 00:00:52,000 programs, originally computer programming, you were given a 9 00:00:52,000 --> 00:00:57,000 datasheet and you put one line per line of code as a tabular 10 00:00:57,000 --> 00:01:04,000 method for giving the machine instructions as to what to do. 11 00:01:04,000 --> 00:01:07,000 OK, so the term programming is older. 12 00:01:07,000 --> 00:01:11,000 Of course, and now conventionally when you see 13 00:01:11,000 --> 00:01:15,000 programming, you mean software, computer programming. 14 00:01:15,000 --> 00:01:18,000 But that wasn't always the case. 15 00:01:18,000 --> 00:01:22,000 And these terms continue in the literature. 16 00:01:22,000 --> 00:01:26,000 So, dynamic programming is a design technique like other 17 00:01:26,000 --> 00:01:33,000 design techniques we've seen such as divided and conquer. 18 00:01:33,000 --> 00:01:40,000 OK, so it's a way of solving a class of problems rather than a 19 00:01:40,000 --> 00:01:43,000 particular algorithm or something. 20 00:01:43,000 --> 00:01:50,000 So, we're going to work through this for the example of 21 00:01:50,000 --> 00:01:55,000 so-called longest common subsequence problem, 22 00:01:55,000 --> 00:02:00,000 sometimes called LCS, OK, which is a problem that 23 00:02:00,000 --> 00:02:06,000 comes up in a variety of contexts. 24 00:02:06,000 --> 00:02:10,000 And it's particularly important in computational biology, 25 00:02:10,000 --> 00:02:14,000 where you have long DNA strains, and you're trying to 26 00:02:14,000 --> 00:02:19,000 find commonalities between two strings, OK, one which may be a 27 00:02:19,000 --> 00:02:23,000 genome, and one may be various, when people do, 28 00:02:23,000 --> 00:02:28,000 what is that thing called when they do the evolutionary 29 00:02:28,000 --> 00:02:31,000 comparisons? The evolutionary trees, 30 00:02:31,000 --> 00:02:33,000 yeah, right, yeah, exactly, 31 00:02:33,000 --> 00:02:35,000 phylogenetic trees, there you go, 32 00:02:35,000 --> 00:02:44,000 OK, phylogenetic trees. Good, so here's the problem. 33 00:02:44,000 --> 00:02:54,000 So, you're given two sequences, x going from one to m, 34 00:02:54,000 --> 00:03:04,000 and y running from one to n. You want to find a longest 35 00:03:04,000 --> 00:03:12,000 sequence common to both. OK, and here I say a, 36 00:03:12,000 --> 00:03:19,000 not the, although it's common to talk about the longest common 37 00:03:19,000 --> 00:03:24,000 subsequence. Usually the longest comment 38 00:03:24,000 --> 00:03:29,000 subsequence isn't unique. There could be several 39 00:03:29,000 --> 00:03:35,000 different subsequences that tie for that. 40 00:03:35,000 --> 00:03:41,000 However, people tend to, it's one of the sloppinesses 41 00:03:41,000 --> 00:03:45,000 that people will say. I will try to say a, 42 00:03:45,000 --> 00:03:51,000 unless it's unique. But I may slip as well because 43 00:03:51,000 --> 00:03:57,000 it's just such a common thing to just talk about the, 44 00:03:57,000 --> 00:04:02,000 even though there might be multiple. 45 00:04:02,000 --> 00:04:07,000 So, here's an example. Suppose x is this sequence, 46 00:04:07,000 --> 00:04:14,000 and y is this sequence. So, what is a longest common 47 00:04:14,000 --> 00:04:16,000 subsequence of those two sequences? 48 00:04:16,000 --> 00:04:19,000 See if you can just eyeball it. 49 00:04:35,000 --> 00:04:45,000 AB: length two? Anybody have one longer? 50 00:04:45,000 --> 00:04:51,000 Excuse me? BDB, BDB. 51 00:04:51,000 --> 00:05:02,000 BDAB, BDAB, BDAB, anything longer? 52 00:05:02,000 --> 00:05:09,000 So, BDAB: that's the longest one. 53 00:05:09,000 --> 00:05:20,000 Is there another one that's the same length? 54 00:05:20,000 --> 00:05:35,000 Is there another one that ties? BCAB, BCAB, another one? 55 00:05:35,000 --> 00:05:40,000 BCBA, yeah, there are a bunch of them all of length four. 56 00:05:40,000 --> 00:05:45,000 There isn't one of length five. OK, we are actually going to 57 00:05:45,000 --> 00:05:49,000 come up with an algorithm that, if it's correct, 58 00:05:49,000 --> 00:05:54,000 we're going to show it's correct, guarantees that there 59 00:05:54,000 --> 00:05:58,000 isn't one of length five. So all those are, 60 00:05:58,000 --> 00:06:03,000 we can say, any one of these is the longest comment subsequence 61 00:06:03,000 --> 00:06:06,000 of x and y. We tend to use it this way 62 00:06:06,000 --> 00:06:11,000 using functional notation, but it's not a function that's 63 00:06:11,000 --> 00:06:17,000 really a relation. So, we'll say something is an 64 00:06:17,000 --> 00:06:20,000 LCS when really we only mean it's an element, 65 00:06:20,000 --> 00:06:23,000 if you will, of the set of longest common 66 00:06:23,000 --> 00:06:26,000 subsequences. Once again, it's classic 67 00:06:26,000 --> 00:06:29,000 abusive notation. As long as we know what we 68 00:06:29,000 --> 00:06:35,000 mean, it's OK to abuse notation. What we can't do is misuse it. 69 00:06:35,000 --> 00:06:40,000 But abuse, yeah! Make it so it's easy to deal 70 00:06:40,000 --> 00:06:43,000 with. But you have to know what's 71 00:06:43,000 --> 00:06:47,000 going on underneath. OK, so let's see, 72 00:06:47,000 --> 00:06:53,000 so there's a fairly simple brute force algorithm for 73 00:06:53,000 --> 00:06:59,000 solving this problem. And that is, 74 00:06:59,000 --> 00:07:10,000 let's just check every, maybe some of you did this in 75 00:07:10,000 --> 00:07:22,000 your heads, subsequence of x from one to m to see if it's 76 00:07:22,000 --> 00:07:31,000 also a subsequence of y of one to n. 77 00:07:31,000 --> 00:07:36,000 So, just take every subsequence that you can get here, 78 00:07:36,000 --> 00:07:40,000 check it to see if it's in there. 79 00:07:40,000 --> 00:07:43,000 So let's analyze that. 80 00:07:52,000 --> 00:07:58,000 So, to check, so if I give you a subsequence 81 00:07:58,000 --> 00:08:05,000 of x, how long does it take you to check whether it is, 82 00:08:05,000 --> 00:08:14,000 in fact, a subsequence of y? So, I give you something like 83 00:08:14,000 --> 00:08:18,000 BCAB. How long does it take me to 84 00:08:18,000 --> 00:08:24,000 check to see if it's a subsequence of y? 85 00:08:24,000 --> 00:08:28,000 Length of y, which is order n. 86 00:08:28,000 --> 00:08:34,000 And how do you do it? Yeah, you just scan. 87 00:08:34,000 --> 00:08:39,000 So as you hit the first character that matches, 88 00:08:39,000 --> 00:08:41,000 great. Now, if you will, 89 00:08:41,000 --> 00:08:46,000 recursively see whether the suffix of your string matches 90 00:08:46,000 --> 00:08:50,000 the suffix of x. OK, and so, you are just simply 91 00:08:50,000 --> 00:08:54,000 walking down the tree to see if it matches. 92 00:08:54,000 --> 00:08:59,000 You're walking down the string to see if it matches. 93 00:08:59,000 --> 00:09:04,000 OK, then the second thing is, then how many subsequences of x 94 00:09:04,000 --> 00:09:08,000 are there? Two to the n? 95 00:09:08,000 --> 00:09:15,000 x just goes from one to m, two to the m subsequences of x, 96 00:09:15,000 --> 00:09:20,000 OK, two to the m. Two to the m subsequences of x, 97 00:09:20,000 --> 00:09:25,000 OK, one way to see that, you say, well, 98 00:09:25,000 --> 00:09:32,000 how many subsequences are there of something there? 99 00:09:32,000 --> 00:09:35,000 If I consider a bit vector of length m, OK, 100 00:09:35,000 --> 00:09:39,000 that's one or zero, just every position where 101 00:09:39,000 --> 00:09:42,000 there's a one, I take out, that identifies an 102 00:09:42,000 --> 00:09:45,000 element that I'm going to take out. 103 00:09:45,000 --> 00:09:50,000 OK, then that gives me a mapping from each subsequence of 104 00:09:50,000 --> 00:09:55,000 x, from each bit vector to a different subsequence of x. 105 00:09:55,000 --> 00:09:58,000 Now, of course, you could have matching 106 00:09:58,000 --> 00:10:01,000 characters there, that in the worst case, 107 00:10:01,000 --> 00:10:06,000 all of the characters are different. 108 00:10:06,000 --> 00:10:14,000 OK, and so every one of those will be a unique subsequence. 109 00:10:14,000 --> 00:10:22,000 So, each bit vector of length m corresponds to a subsequence. 110 00:10:22,000 --> 00:10:29,000 That's a generally good trick to know. 111 00:10:29,000 --> 00:10:38,000 So, the worst-case running time of this method is order n times 112 00:10:38,000 --> 00:10:43,000 two to the m, which is, since m is in the 113 00:10:43,000 --> 00:10:52,000 exponent, is exponential time. And there's a technical term 114 00:10:52,000 --> 00:10:59,000 that we use when something is exponential time. 115 00:10:59,000 --> 00:11:03,000 Slow: good. OK, very good. 116 00:11:03,000 --> 00:11:06,000 OK, slow, OK, so this is really bad. 117 00:11:06,000 --> 00:11:12,000 This is taking a long time to crank out how long the longest 118 00:11:12,000 --> 00:11:17,000 common subsequence is because there's so many subsequences. 119 00:11:17,000 --> 00:11:23,000 OK, so we're going to now go through a process of developing 120 00:11:23,000 --> 00:11:27,000 a far more efficient algorithm for this problem. 121 00:11:27,000 --> 00:11:34,000 OK, and we're actually going to go through several stages. 122 00:11:34,000 --> 00:11:42,000 The first one is to go through simplification stage. 123 00:11:42,000 --> 00:11:52,000 OK, and what we're going to do is look at simply the length of 124 00:11:52,000 --> 00:11:59,000 the longest common sequence of x and y. 125 00:11:59,000 --> 00:12:03,000 And then what we'll do is extend the algorithm to find the 126 00:12:03,000 --> 00:12:06,000 longest common subsequence itself. 127 00:12:06,000 --> 00:12:10,000 OK, so we're going to look at the length. 128 00:12:10,000 --> 00:12:13,000 So, simplify the problem, if you will, 129 00:12:13,000 --> 00:12:16,000 to just try to compute the length. 130 00:12:16,000 --> 00:12:19,000 What's nice is the length is unique. 131 00:12:19,000 --> 00:12:23,000 OK, there's only going to be one length that's going to be 132 00:12:23,000 --> 00:12:27,000 the longest. OK, and what we'll do is just 133 00:12:27,000 --> 00:12:31,000 focus on the problem of computing the length. 134 00:12:31,000 --> 00:12:36,000 And then we'll do is we can back up from that and figure out 135 00:12:36,000 --> 00:12:43,000 what actually is the subsequence that realizes that length. 136 00:12:43,000 --> 00:12:46,000 OK, and that will be a big simplification because we don't 137 00:12:46,000 --> 00:12:50,000 have to keep track of a lot of different possibilities at every 138 00:12:50,000 --> 00:12:52,000 stage. We just have to keep track of 139 00:12:52,000 --> 00:12:54,000 the one number, which is the length. 140 00:12:54,000 --> 00:12:57,000 So, it's sort of reduces it to a numerical problem. 141 00:12:57,000 --> 00:13:00,000 We'll adopt the following notation. 142 00:13:00,000 --> 00:13:04,000 It's pretty standard notation, but I just want, 143 00:13:04,000 --> 00:13:09,000 if I put absolute values around the string or a sequence, 144 00:13:09,000 --> 00:13:13,000 it denotes the length of the sequence, S. 145 00:13:13,000 --> 00:13:19,000 OK, so that's the first thing. The second thing we're going to 146 00:13:19,000 --> 00:13:22,000 do is, actually, we're going to, 147 00:13:22,000 --> 00:13:28,000 which takes a lot more insight when you come up with a problem 148 00:13:28,000 --> 00:13:33,000 like this, and in some sense, 149 00:13:33,000 --> 00:13:39,000 ends up being the hardest part of designing a good dynamic 150 00:13:39,000 --> 00:13:47,000 programming algorithm from any problem, which is we're going to 151 00:13:47,000 --> 00:13:53,000 actually look not at all subsequences of x and y, 152 00:13:53,000 --> 00:13:56,000 but just prefixes. 153 00:14:06,000 --> 00:14:13,000 OK, we're just going to look at prefixes and we're going to show 154 00:14:13,000 --> 00:14:20,000 how we can express the length of the longest common subsequence 155 00:14:20,000 --> 00:14:24,000 of prefixes in terms of each other. 156 00:14:24,000 --> 00:14:28,000 In particular, we're going to define c of ij 157 00:14:28,000 --> 00:14:34,000 to be the length, the longest common subsequence 158 00:14:34,000 --> 00:14:41,000 of the prefix of x going from one to i, and y of going to one 159 00:14:41,000 --> 00:14:48,000 to j. And what we are going to do is 160 00:14:48,000 --> 00:14:56,000 we're going to calculate c[i,j] for all ij. 161 00:14:56,000 --> 00:15:04,000 And if we do that, how then do we solve the 162 00:15:04,000 --> 00:15:15,000 problem of the longest common of sequence of x and y? 163 00:15:15,000 --> 00:15:19,000 How do we solve the longest common subsequence? 164 00:15:19,000 --> 00:15:23,000 Suppose we've solved this for all I and j. 165 00:15:23,000 --> 00:15:29,000 How then do we compute the length of the longest common 166 00:15:29,000 --> 00:15:33,000 subsequence of x and y? Yeah, c[m,n], 167 00:15:33,000 --> 00:15:37,000 that's all, OK? So then, c of m, 168 00:15:37,000 --> 00:15:44,000 n is just equal to the longest common subsequence of x and y, 169 00:15:44,000 --> 00:15:50,000 because if I go from one to n, I'm done, OK? 170 00:15:50,000 --> 00:15:56,000 And so, it's going to turn out that what we want to do is 171 00:15:56,000 --> 00:16:02,000 figure out how to express to c[m,n], in general, 172 00:16:02,000 --> 00:16:08,000 c[i,j], in terms of other c[i,j]. 173 00:16:08,000 --> 00:16:18,000 So, let's see how we do that. OK, so our theorem is going to 174 00:16:18,000 --> 00:16:23,000 say that c[i,j] is just -- 175 00:17:05,000 --> 00:17:10,000 OK, it says that if the i'th character matches the j'th 176 00:17:10,000 --> 00:17:17,000 character, then i'th character of x matches the j'th character 177 00:17:17,000 --> 00:17:23,000 of y, then c of ij is just c of I minus one, j minus one plus 178 00:17:23,000 --> 00:17:26,000 one. And if they don't match, 179 00:17:26,000 --> 00:17:31,000 then it's either going to be the longer of c[i, 180 00:17:31,000 --> 00:17:35,000 j-1], and c[i-1, j], OK? 181 00:17:35,000 --> 00:17:38,000 So that's what we're going to prove. 182 00:17:38,000 --> 00:17:44,000 And that's going to give us a way of relating the calculation 183 00:17:44,000 --> 00:17:49,000 of a given c[i,j] to values that are strictly smaller, 184 00:17:49,000 --> 00:17:56,000 OK, that is at least one of the arguments is smaller of the two 185 00:17:56,000 --> 00:18:00,000 arguments. OK, and that's going to give us 186 00:18:00,000 --> 00:18:05,000 a way of being able, then, to understand how to 187 00:18:05,000 --> 00:18:11,000 calculate c[i,j]. So, let's prove this theorem. 188 00:18:11,000 --> 00:18:18,000 So, we'll start with a case x[i] equals y of j. 189 00:18:18,000 --> 00:18:22,000 And so, let's draw a picture here. 190 00:18:22,000 --> 00:18:26,000 So, we have x here. 191 00:18:50,000 --> 00:18:52,000 And here is y. 192 00:19:13,000 --> 00:19:19,000 OK, so here's my sequence, x, which I'm sort of drawing as 193 00:19:19,000 --> 00:19:25,000 this elongated box, sequence y, and I'm saying that 194 00:19:25,000 --> 00:19:30,000 x[i] and y[j], those are equal. 195 00:19:38,000 --> 00:19:46,000 OK, so let's see what that means. 196 00:19:46,000 --> 00:20:01,000 OK, so let's let z of one to k be, in fact, the longest common 197 00:20:01,000 --> 00:20:12,000 subsequence of x of one to i, y of one to j, 198 00:20:12,000 --> 00:20:23,000 where c of ij is equal to k. OK, so the longest common 199 00:20:23,000 --> 00:20:29,000 subsequence of x and y of one to I and y of one to j has some 200 00:20:29,000 --> 00:20:32,000 value. Let's call it k. 201 00:20:32,000 --> 00:20:39,000 And so, let's say that we have some sequence which realizes 202 00:20:39,000 --> 00:20:42,000 that. OK, we'll call it z. 203 00:20:42,000 --> 00:20:48,000 OK, so then, can somebody tell me what z of 204 00:20:48,000 --> 00:20:50,000 k is? 205 00:21:04,000 --> 00:21:05,000 What is z of k here? 206 00:21:14,000 --> 00:21:18,000 Yeah, it's actually equal to x of I, which is also equal to y 207 00:21:18,000 --> 00:21:19,000 of j? Why is that? 208 00:21:19,000 --> 00:21:23,000 Why couldn't it be some other value? 209 00:21:41,000 --> 00:21:43,000 Yeah, so you got the right idea. 210 00:21:43,000 --> 00:21:46,000 So, the idea is, suppose that the sequence 211 00:21:46,000 --> 00:21:50,000 didn't include this element here at the last element, 212 00:21:50,000 --> 00:21:55,000 the longest common subsequence. OK, so then it includes a bunch 213 00:21:55,000 --> 00:21:59,000 of values in here, and a bunch of values in here, 214 00:21:59,000 --> 00:22:03,000 same values. It doesn't include this or 215 00:22:03,000 --> 00:22:07,000 this. Well, then I could just tack on 216 00:22:07,000 --> 00:22:13,000 this extra character and make it be longer, make it k plus one 217 00:22:13,000 --> 00:22:18,000 because these two match. OK, so if the sequence ended 218 00:22:18,000 --> 00:22:20,000 before -- 219 00:22:34,000 --> 00:22:40,000 -- just extend it by tacking on x[i]. 220 00:22:40,000 --> 00:22:48,000 OK, it would be fairly simple to just tack on x[i]. 221 00:22:48,000 --> 00:22:58,000 OK, so if that's the case, then if I look at z going one 222 00:22:58,000 --> 00:23:05,000 up to k minus one, that's certainly a common 223 00:23:05,000 --> 00:23:14,000 sequence of x of 1 up to, excuse me, of up to i minus 224 00:23:14,000 --> 00:23:20,000 one. And, y of one up to j minus 225 00:23:20,000 --> 00:23:26,000 one, OK, because this is a longest common sequence. 226 00:23:26,000 --> 00:23:33,000 z is a longest common sequence is, from x of one to i, 227 00:23:33,000 --> 00:23:38,000 y of one to j. And, we know what the last 228 00:23:38,000 --> 00:23:41,000 character is. It's just x[i], 229 00:23:41,000 --> 00:23:43,000 or equivalently, y[j]. 230 00:23:43,000 --> 00:23:47,000 So therefore, everything except the last 231 00:23:47,000 --> 00:23:53,000 character must at least be a common sequence of x of one to i 232 00:23:53,000 --> 00:23:57,000 minus one, y of one to j minus one. 233 00:23:57,000 --> 00:24:04,000 Everybody with me? It must be a comment sequence. 234 00:24:04,000 --> 00:24:12,000 OK, now, what you also suspect? What do you also suspect about 235 00:24:12,000 --> 00:24:18,000 z of one to k? It's a common sequence of these 236 00:24:18,000 --> 00:24:19,000 two. Yeah? 237 00:24:19,000 --> 00:24:26,000 Yeah, it's a longest common sequence. 238 00:24:26,000 --> 00:24:34,000 So that's what we claim, z of one up to k minus one is 239 00:24:34,000 --> 00:24:42,000 in fact a longest common subsequence of x of one to i 240 00:24:42,000 --> 00:24:48,000 minus one, and y of one to j minus one, OK? 241 00:24:48,000 --> 00:24:57,000 So, let's prove that claim. So, we'll just have a little 242 00:24:57,000 --> 00:25:09,000 diversion to prove the claim. OK, so suppose that w is a 243 00:25:09,000 --> 00:25:21,000 longer comment sequence, that is, that the length, 244 00:25:21,000 --> 00:25:30,000 the w, is bigger than k minus one. 245 00:25:30,000 --> 00:25:35,000 OK, so suppose we have a longer comment sequence than z of one 246 00:25:35,000 --> 00:25:38,000 to k minus one. So, it's got to have length 247 00:25:38,000 --> 00:25:42,000 that's bigger than k minus one if it's longer. 248 00:25:42,000 --> 00:25:47,000 OK, and now what we do is we use a classic argument you're 249 00:25:47,000 --> 00:25:51,000 going to see multiple times, not just this week, 250 00:25:51,000 --> 00:25:56,000 which it will be important for this week, but through several 251 00:25:56,000 --> 00:25:59,000 lectures. Hence, it's called a cut and 252 00:25:59,000 --> 00:26:06,000 paste argument. So, the idea is let's take a 253 00:26:06,000 --> 00:26:15,000 look at w, concatenate it with that last character, 254 00:26:15,000 --> 00:26:19,000 z of k. so, this is string, 255 00:26:19,000 --> 00:26:27,000 OK, so that's just my terminology for string 256 00:26:27,000 --> 00:26:36,000 concatenation. OK, so I take whatever I 257 00:26:36,000 --> 00:26:48,000 claimed was a longer comment subsequence, and I concatenate z 258 00:26:48,000 --> 00:26:56,000 of k to it. OK, so that is certainly a 259 00:26:56,000 --> 00:27:11,000 common sequence of x of one to I minus one, and y of one to j. 260 00:27:11,000 --> 00:27:18,000 And it has length bigger than k because it's basically, 261 00:27:18,000 --> 00:27:24,000 what is its length? The length of w is bigger than 262 00:27:24,000 --> 00:27:28,000 k minus one. I add one character. 263 00:27:28,000 --> 00:27:37,000 So, this combination here, now, has length bigger that k. 264 00:27:37,000 --> 00:27:43,000 OK, and that's a contradiction, thereby proving the claim. 265 00:27:43,000 --> 00:27:47,000 So, I'm simply saying, I claim this. 266 00:27:47,000 --> 00:27:52,000 Suppose you have a longer one. Well, let me show, 267 00:27:52,000 --> 00:27:58,000 if I had a longer common sequence for the prefixes where 268 00:27:58,000 --> 00:28:05,000 we dropped the character from both strings if it was longer 269 00:28:05,000 --> 00:28:12,000 there, but we would have made the whole thing longer. 270 00:28:12,000 --> 00:28:16,000 So that can't be. So, therefore, 271 00:28:16,000 --> 00:28:22,000 this must be a longest common subsequence, OK? 272 00:28:22,000 --> 00:28:27,000 Questions? Because you are going to need 273 00:28:27,000 --> 00:28:33,000 to be able to do this kind of proof ad nauseam, 274 00:28:33,000 --> 00:28:39,000 almost. So, if there any questions, 275 00:28:39,000 --> 00:28:42,000 let them at me, people. 276 00:28:42,000 --> 00:28:47,000 OK, so now what we have established is that z one 277 00:28:47,000 --> 00:28:55,000 through k is a longest common subsequence of the two prefixes 278 00:28:55,000 --> 00:29:05,000 when we drop the last character. So, thus, we have c of i minus 279 00:29:05,000 --> 00:29:11,000 one, j minus one is equal to what? 280 00:29:11,000 --> 00:29:19,000 What's c of i minus one, j minus one? 281 00:29:31,000 --> 00:29:33,000 k minus one; thank you. 282 00:29:33,000 --> 00:29:40,000 Let's move on with the class, right, OK, which implies that c 283 00:29:40,000 --> 00:29:47,000 of ij is just equal to c of I minus one, j minus one plus one. 284 00:29:47,000 --> 00:29:54,000 So, it's fairly straightforward if you think about what's going 285 00:29:54,000 --> 00:29:57,000 on there. It's not always as 286 00:29:57,000 --> 00:30:04,000 straightforward in some problems as it is for longest common 287 00:30:04,000 --> 00:30:08,000 subsequence. The idea is, 288 00:30:08,000 --> 00:30:13,000 so I'm not going to go through the other cases. 289 00:30:13,000 --> 00:30:16,000 They are similar. But, in fact, 290 00:30:16,000 --> 00:30:21,000 we've hit on one of the two hallmarks of dynamic 291 00:30:21,000 --> 00:30:24,000 programming. So, by hallmarks, 292 00:30:24,000 --> 00:30:30,000 I mean when you see this kind of structure in a problem, 293 00:30:30,000 --> 00:30:36,000 there's a good chance that dynamic programming is going to 294 00:30:36,000 --> 00:30:41,000 work as a strategy. The dynamic programming 295 00:30:41,000 --> 00:30:44,000 hallmark is the following. 296 00:30:55,000 --> 00:31:02,000 This is number one. And that is the property of 297 00:31:02,000 --> 00:31:09,000 optimal substructure. OK, what that says is an 298 00:31:09,000 --> 00:31:16,000 optimal solution to a problem, and by this, 299 00:31:16,000 --> 00:31:21,000 we really mean problem instance. 300 00:31:21,000 --> 00:31:31,000 But it's tedious to keep saying problem instance. 301 00:31:31,000 --> 00:31:35,000 A problem is generally, in computer science, 302 00:31:35,000 --> 00:31:42,000 viewed as having an infinite number of instances typically, 303 00:31:42,000 --> 00:31:48,000 OK, so sorting is a problem. A sorting instance is a 304 00:31:48,000 --> 00:31:53,000 particular input. OK, so we're really talking 305 00:31:53,000 --> 00:31:59,000 about problem instances, but I'm just going to say 306 00:31:59,000 --> 00:32:04,000 problem, OK? So, when you have an optimal 307 00:32:04,000 --> 00:32:09,000 solution to a problem, contains optimal solutions to 308 00:32:09,000 --> 00:32:17,000 subproblems. OK, and that's worth drawing a 309 00:32:17,000 --> 00:32:22,000 box around because it's so important. 310 00:32:22,000 --> 00:32:25,000 OK, so here, for example, 311 00:32:25,000 --> 00:32:33,000 if z is a longest common subsequence of x and y, 312 00:32:33,000 --> 00:32:55,000 OK, then any prefix of z is a longest common subsequence of a 313 00:32:55,000 --> 00:33:09,000 prefix of x, and a prefix of y, OK? 314 00:33:09,000 --> 00:33:12,000 So, this is basically what it says. 315 00:33:12,000 --> 00:33:16,000 I look at the problem, and I can see that there is 316 00:33:16,000 --> 00:33:21,000 optimal substructure going on. OK, in this case, 317 00:33:21,000 --> 00:33:26,000 and the idea is that almost always, it means that there's a 318 00:33:26,000 --> 00:33:32,000 cut and paste argument you could do to demonstrate that, 319 00:33:32,000 --> 00:33:36,000 OK, that if the substructure were not optimal, 320 00:33:36,000 --> 00:33:41,000 then you'd be able to find a better solution to the overall 321 00:33:41,000 --> 00:33:49,000 problem using cut and paste. OK, so this theorem, 322 00:33:49,000 --> 00:33:57,000 now, gives us a strategy for being able to compute longest 323 00:33:57,000 --> 00:34:01,000 comment subsequence. 324 00:34:24,000 --> 00:34:29,000 Here's the code; oh wait. 325 00:34:38,000 --> 00:34:41,000 OK, so going to ignore base cases in this, 326 00:34:41,000 --> 00:34:42,000 if -- 327 00:35:44,000 --> 00:35:54,000 And we will return the value of the longest common subsequence. 328 00:35:54,000 --> 00:36:02,000 It's basically just implementing this theorem. 329 00:36:02,000 --> 00:36:06,000 OK, so it's either the longest comment subsequence if they 330 00:36:06,000 --> 00:36:09,000 match. It's the longest comment 331 00:36:09,000 --> 00:36:14,000 subsequence of one of the prefixes where you drop that 332 00:36:14,000 --> 00:36:18,000 character for both strengths and add one because that's the 333 00:36:18,000 --> 00:36:22,000 matching one. Or, you drop a character from 334 00:36:22,000 --> 00:36:26,000 x, and it's the longest comment subsequence of that. 335 00:36:26,000 --> 00:36:31,000 Or you drop a character from y, whichever one of those is 336 00:36:31,000 --> 00:36:34,000 longer. That ends up being the longest 337 00:36:34,000 --> 00:36:43,000 comment subsequence. OK, so what's the worst case 338 00:36:43,000 --> 00:36:52,000 for this program? What's going to happen in the 339 00:36:52,000 --> 00:37:00,000 worst case? Which of these two clauses is 340 00:37:00,000 --> 00:37:09,000 going to cause us more headache? The second clause: 341 00:37:09,000 --> 00:37:12,000 why the second clause? Yeah, you're doing two LCS 342 00:37:12,000 --> 00:37:16,000 sub-calculations here. Here, you're only doing one. 343 00:37:16,000 --> 00:37:19,000 Not only that, but you get to decrement both 344 00:37:19,000 --> 00:37:22,000 indices, whereas here you've basically got to, 345 00:37:22,000 --> 00:37:26,000 you only get to decrement one index, and you've got to 346 00:37:26,000 --> 00:37:29,000 calculate two of them. So that's going to generate the 347 00:37:29,000 --> 00:37:34,000 tree. So, and the worst case, 348 00:37:34,000 --> 00:37:42,000 x of i is not equal to x of j for all i and j. 349 00:37:42,000 --> 00:37:52,000 So, let's draw a recursion tree for this program to sort of get 350 00:37:52,000 --> 00:38:02,000 an understanding as to what is going on to help us. 351 00:38:02,000 --> 00:38:06,000 And, I'm going to do it with m equals seven, 352 00:38:06,000 --> 00:38:12,000 and n equals six. OK, so we start up the top with 353 00:38:12,000 --> 00:38:16,000 my two indices being seven and six. 354 00:38:16,000 --> 00:38:22,000 And then, in the worst case, we had to execute these. 355 00:38:22,000 --> 00:38:27,000 So, this is going to end up being six, six, 356 00:38:27,000 --> 00:38:34,000 and seven, five for indices after the first call. 357 00:38:34,000 --> 00:38:37,000 And then, this guy is going to split. 358 00:38:37,000 --> 00:38:44,000 And he's going to produce five, six here, decrement the first 359 00:38:44,000 --> 00:38:48,000 index, I. And then, if I keep going down 360 00:38:48,000 --> 00:38:52,000 here, we're going to get four, six and five, 361 00:38:52,000 --> 00:38:56,000 five. And these guys keep extending 362 00:38:56,000 --> 00:38:58,000 here. I get six five, 363 00:38:58,000 --> 00:39:02,000 five five, six four, OK? 364 00:39:02,000 --> 00:39:08,000 Over here, I'm going to get decrement the first index, 365 00:39:08,000 --> 00:39:15,000 six five, and I get five five, six four, and these guys keep 366 00:39:15,000 --> 00:39:17,000 going down. And over here, 367 00:39:17,000 --> 00:39:22,000 I get seven four. And then we get six four, 368 00:39:22,000 --> 00:39:27,000 seven three, and those keep going down. 369 00:39:27,000 --> 00:39:33,000 So, we keep just building this tree out. 370 00:39:33,000 --> 00:39:38,000 OK, so what's the height of this tree? 371 00:39:38,000 --> 00:39:46,000 Not of this one for the particular value of m and n, 372 00:39:46,000 --> 00:39:54,000 but in terms of m and n. What's the height of this tree? 373 00:39:54,000 --> 00:40:01,000 It's the max of m and n. You've got the right, 374 00:40:01,000 --> 00:40:07,000 it's theta of the max. It's not the max. 375 00:40:07,000 --> 00:40:10,000 Max would be, in this case, 376 00:40:10,000 --> 00:40:14,000 you're saying it has height seven. 377 00:40:14,000 --> 00:40:18,000 But, I think you can sort of see, for example, 378 00:40:18,000 --> 00:40:23,000 along a path like this that, in fact, I've only, 379 00:40:23,000 --> 00:40:28,000 after going three levels, reduced m plus n, 380 00:40:28,000 --> 00:40:32,000 good, very good, m plus n. 381 00:40:32,000 --> 00:40:39,000 So, height here is m plus n. OK, and its binary. 382 00:40:39,000 --> 00:40:45,000 So, the height: that implies the work is 383 00:40:45,000 --> 00:40:51,000 exponential in m and n. All that work, 384 00:40:51,000 --> 00:41:01,000 and are we any better off than the brute force algorithm? 385 00:41:01,000 --> 00:41:05,000 Not really. And, our technical term for 386 00:41:05,000 --> 00:41:09,000 this is slow. OK, and we like speed. 387 00:41:09,000 --> 00:41:14,000 OK, we like fast. OK, but I'm sure that some of 388 00:41:14,000 --> 00:41:20,000 you have observed something interesting about this tree. 389 00:41:20,000 --> 00:41:25,000 Yeah, there's a lot of repeated work here. 390 00:41:25,000 --> 00:41:31,000 Right, there's a lot of repeated work. 391 00:41:31,000 --> 00:41:34,000 In particular, this whole subtree, 392 00:41:34,000 --> 00:41:40,000 and this whole subtree, OK, they are the same. 393 00:41:40,000 --> 00:41:46,000 That's the same subtree, the same subproblem that you 394 00:41:46,000 --> 00:41:51,000 are solving. OK, you can even see over here, 395 00:41:51,000 --> 00:41:58,000 there is even similarity between this whole subtree and 396 00:41:58,000 --> 00:42:03,000 this whole subtree. OK, so there's lots of repeated 397 00:42:03,000 --> 00:42:08,000 work. OK, and one thing is, 398 00:42:08,000 --> 00:42:13,000 if you want to do things fast, don't keep doing the same 399 00:42:13,000 --> 00:42:17,000 thing. OK, don't keep doing the same 400 00:42:17,000 --> 00:42:21,000 thing. When you find you are repeating 401 00:42:21,000 --> 00:42:25,000 something, figure out a way of not doing it. 402 00:42:25,000 --> 00:42:30,000 So, that brings up our second hallmark for dynamic 403 00:42:30,000 --> 00:42:33,000 programming. 404 00:42:50,000 --> 00:43:07,000 And that's a property called overlapping subproblems, 405 00:43:07,000 --> 00:43:19,000 OK? OK, recursive solution contains 406 00:43:19,000 --> 00:43:33,000 many, excuse me, contains a small number of 407 00:43:33,000 --> 00:43:50,000 distinct subproblems repeated many times. 408 00:43:50,000 --> 00:43:54,000 And once again, this is important enough to put 409 00:43:54,000 --> 00:43:58,000 a box around. I don't put boxes around too 410 00:43:58,000 --> 00:44:01,000 many things. Maybe I should put our boxes 411 00:44:01,000 --> 00:44:05,000 around things. This is definitely one to put a 412 00:44:05,000 --> 00:44:08,000 box around, OK? So, for example, 413 00:44:08,000 --> 00:44:12,000 so here we have a recursive solution. 414 00:44:12,000 --> 00:44:15,000 This tree is exponential in size. 415 00:44:15,000 --> 00:44:19,000 It's two to the m plus n in height, in size, 416 00:44:19,000 --> 00:44:24,000 in the total number of problems if I actually implemented like 417 00:44:24,000 --> 00:44:27,000 that. But how many distinct 418 00:44:27,000 --> 00:44:33,000 subproblems are there? m times n, OK? 419 00:44:33,000 --> 00:44:42,000 So, the longest comment subsequence, the subproblem 420 00:44:42,000 --> 00:44:49,000 space contains m times n, distinct subproblems. 421 00:44:49,000 --> 00:45:00,000 OK, and then this is a small number compared with two to the 422 00:45:00,000 --> 00:45:07,000 m plus n, or two to the n, or two to the m, 423 00:45:07,000 --> 00:45:13,000 or whatever. OK, this is small, 424 00:45:13,000 --> 00:45:19,000 OK, because for each subproblem, it's characterized 425 00:45:19,000 --> 00:45:24,000 by an I and a j. An I goes from one to m, 426 00:45:24,000 --> 00:45:27,000 and j goes from one to n, OK? 427 00:45:27,000 --> 00:45:34,000 There aren't that many different subproblems. 428 00:45:34,000 --> 00:45:36,000 It's just the product of the two. 429 00:45:36,000 --> 00:45:41,000 So, here's an improved algorithm, which is often a good 430 00:45:41,000 --> 00:45:45,000 way to solve it. It's an algorithm called a 431 00:45:45,000 --> 00:45:48,000 memo-ization algorithm. 432 00:45:56,000 --> 00:46:02,000 And, this is memo-ization, not memorization because what 433 00:46:02,000 --> 00:46:09,000 you're going to do is make a little memo whenever you solve a 434 00:46:09,000 --> 00:46:14,000 subproblem. Make a little memo that says I 435 00:46:14,000 --> 00:46:19,000 solved this already. And if ever you are asked for 436 00:46:19,000 --> 00:46:25,000 it rather than recalculating it, say, oh, I see that. 437 00:46:25,000 --> 00:46:30,000 I did that before. Here's the answer, 438 00:46:30,000 --> 00:46:32,000 OK? So, here's the code. 439 00:46:32,000 --> 00:46:40,000 It's very similar to that code. So, it basically keeps a table 440 00:46:40,000 --> 00:46:44,000 around of c[i,j]. It says, what we do is we 441 00:46:44,000 --> 00:46:47,000 check. If the entry for c[i,j] is nil, 442 00:46:47,000 --> 00:46:51,000 we haven't computed it, then we compute it. 443 00:46:51,000 --> 00:46:55,000 And, how do we compute it? Just the same way we did 444 00:46:55,000 --> 00:46:57,000 before. 445 00:47:34,000 --> 00:47:45,000 OK, so this whole part here, OK, is exactly what we have had 446 00:47:45,000 --> 00:47:51,000 before. It's the same as before. 447 00:47:51,000 --> 00:47:59,000 And then, we just return c[i,j]. 448 00:47:59,000 --> 00:48:03,000 If we don't bother to keep recalculating, 449 00:48:03,000 --> 00:48:07,000 OK, so if it's nil, we calculate it. 450 00:48:07,000 --> 00:48:12,000 Otherwise, we just return it. It's not calculated, 451 00:48:12,000 --> 00:48:18,000 calculate and return it. Otherwise, just return it: 452 00:48:18,000 --> 00:48:21,000 OK, pretty straightforward code. 453 00:48:21,000 --> 00:48:23,000 OK. 454 00:48:33,000 --> 00:48:38,000 OK, now the tricky thing is how much time does it take to 455 00:48:38,000 --> 00:48:40,000 execute this? 456 00:48:58,000 --> 00:49:04,000 This takes a little bit of thinking. 457 00:49:04,000 --> 00:49:10,000 Yeah? Yeah, it takes order MN. 458 00:49:10,000 --> 00:49:18,000 OK, why is that? Yeah, but I have to look up 459 00:49:18,000 --> 00:49:25,000 c[i,j]. I might call c[i,j] a bunch of 460 00:49:25,000 --> 00:49:29,000 times. When I'm doing this, 461 00:49:29,000 --> 00:49:38,000 I'm still calling it recursively. 462 00:49:38,000 --> 00:49:43,000 Yeah, so you have to, so each recursive call is going 463 00:49:43,000 --> 00:49:50,000 to look at, and the worst-case, say, is going to look at the 464 00:49:50,000 --> 00:49:55,000 max of these two things. Well, this is going to involve 465 00:49:55,000 --> 00:50:00,000 a recursive call, and a lookup. 466 00:50:00,000 --> 00:50:05,000 So, this might take a fair amount of effort to calculate. 467 00:50:05,000 --> 00:50:09,000 I mean, you're right, and your intuition is right. 468 00:50:09,000 --> 00:50:13,000 Let's see if we can get a more precise argument, 469 00:50:13,000 --> 00:50:17,000 why this is taking order m times n. 470 00:50:17,000 --> 00:50:21,000 What's going on here? Because not every time I call 471 00:50:21,000 --> 00:50:27,000 this is it going to just take me a constant amount of work to do 472 00:50:27,000 --> 00:50:30,000 this. Sometimes it's going to take me 473 00:50:30,000 --> 00:50:34,000 a lot of work. Sometimes I get lucky, 474 00:50:34,000 --> 00:50:41,000 and I return it. So, your intuition is dead on. 475 00:50:41,000 --> 00:50:47,000 It's dead on. We just need a little bit more 476 00:50:47,000 --> 00:50:55,000 articulate explanation, so that everybody is on board. 477 00:50:55,000 --> 00:51:01,000 Try again? Good, at most three times, 478 00:51:01,000 --> 00:51:04,000 yeah. OK, so that's one way to look 479 00:51:04,000 --> 00:51:05,000 at it. Yeah. 480 00:51:05,000 --> 00:51:09,000 There is another way to look at it that's kind of what you are 481 00:51:09,000 --> 00:51:12,000 expressing there is an amortized, a bookkeeping, 482 00:51:12,000 --> 00:51:15,000 way of looking at this. What's the amortized cost? 483 00:51:15,000 --> 00:51:18,000 You could say what the amortized cost of calculating 484 00:51:18,000 --> 00:51:21,000 one of these, where basically whenever I call 485 00:51:21,000 --> 00:51:24,000 it, I'm going to charge a constant amount for looking up. 486 00:51:24,000 --> 00:51:28,000 And so, I could get to look up whatever is in here to call the 487 00:51:28,000 --> 00:51:31,000 things. But if it, in fact, 488 00:51:31,000 --> 00:51:35,000 so in some sense, this charge here, 489 00:51:35,000 --> 00:51:41,000 of calling it and returning it, etc., I charged that to my 490 00:51:41,000 --> 00:51:44,000 caller. OK, so I charged these lines 491 00:51:44,000 --> 00:51:50,000 and this line to the caller. And I charge the rest of these 492 00:51:50,000 --> 00:51:55,000 lines to the c[i,j] element. And then, the point is that 493 00:51:55,000 --> 00:52:02,000 every caller basically only ends up being charged for a constant 494 00:52:02,000 --> 00:52:07,000 amount of stuff. OK, to calculate one c[i,j], 495 00:52:07,000 --> 00:52:11,000 it's only an amortized constant amount of stuff that I'm 496 00:52:11,000 --> 00:52:16,000 charging to that calculation of i and j, that calculation of i 497 00:52:16,000 --> 00:52:19,000 and j. OK, so you can view it in terms 498 00:52:19,000 --> 00:52:23,000 of amortized analysis doing a bookkeeping argument that just 499 00:52:23,000 --> 00:52:27,000 says, let me charge enough to calculate my own, 500 00:52:27,000 --> 00:52:32,000 do all my own local things plus enough to look up the value in 501 00:52:32,000 --> 00:52:36,000 the next level and get it returned. 502 00:52:36,000 --> 00:52:40,000 OK, and then if it has to go off and calculate, 503 00:52:40,000 --> 00:52:46,000 well, that's OK because that's all been charged to a different 504 00:52:46,000 --> 00:52:50,000 ij at that point. So, every cell only costs me a 505 00:52:50,000 --> 00:52:56,000 constant amount of time that order MN cells total of order 506 00:52:56,000 --> 00:53:00,000 MN. OK: constant work per entry. 507 00:53:00,000 --> 00:53:04,000 OK, and you can sort of use an amortized analysis to argue 508 00:53:04,000 --> 00:53:07,000 that. How much space does it take? 509 00:53:07,000 --> 00:53:12,000 We haven't usually looked at space, but here we are going to 510 00:53:12,000 --> 00:53:15,000 start looking at space. That turns out, 511 00:53:15,000 --> 00:53:20,000 for some of these algorithms, to be really important. 512 00:53:20,000 --> 00:53:23,000 How much space do I need, storage space? 513 00:53:23,000 --> 00:53:28,000 Yeah, also m times n, OK, to store the c[i,j] table. 514 00:53:28,000 --> 00:53:30,000 OK, the rest, storing x and y, 515 00:53:30,000 --> 00:53:35,000 OK, that's just m plus n. So, that's negligible, 516 00:53:35,000 --> 00:53:37,000 but mostly I need the space m times n. 517 00:53:37,000 --> 00:53:41,000 So, this memo-ization type algorithm is a really good 518 00:53:41,000 --> 00:53:44,000 strategy in programming for many things where, 519 00:53:44,000 --> 00:53:48,000 when you have the same parameters, you're going to get 520 00:53:48,000 --> 00:53:51,000 the same results. It doesn't work in programs 521 00:53:51,000 --> 00:53:53,000 where you have a side effect, necessarily, 522 00:53:53,000 --> 00:53:57,000 that is, when the calculation for a given set of parameters 523 00:53:57,000 --> 00:54:03,000 might be different on each call. But for something which is 524 00:54:03,000 --> 00:54:08,000 essentially like a functional programming type of environment, 525 00:54:08,000 --> 00:54:13,000 then if you've calculated it once, you can look it up. 526 00:54:13,000 --> 00:54:19,000 And, so this is very helpful. But, it takes a fair amount of 527 00:54:19,000 --> 00:54:24,000 space, and it also doesn't proceed in a very orderly way. 528 00:54:24,000 --> 00:54:29,000 So, there is another strategy for doing exactly the same 529 00:54:29,000 --> 00:54:34,000 calculation in a bottom-up way. And that's what we call dynamic 530 00:54:34,000 --> 00:54:42,000 programming. OK, the idea is to compute the 531 00:54:42,000 --> 00:54:49,000 table bottom-up. I think I'm going to get rid 532 00:54:49,000 --> 00:54:56,000 of, I think what we'll do is we'll just use, 533 00:54:56,000 --> 00:55:07,000 actually I think what I'm going to do is use this board. 534 00:55:33,000 --> 00:55:38,000 OK, so here's the idea. What we're going to do is look 535 00:55:38,000 --> 00:55:45,000 at the c[i,j] table and realize that there's actually an orderly 536 00:55:45,000 --> 00:55:51,000 way of filling in the table. This is sort of a top-down with 537 00:55:51,000 --> 00:55:55,000 memo-ization. OK, but there's actually a way 538 00:55:55,000 --> 00:56:00,000 we can do it bottom up. So, here's the idea. 539 00:56:00,000 --> 00:56:07,000 So, let's make our table. OK, so there's x. 540 00:56:07,000 --> 00:56:18,000 And then, there's y. And, I'm going to initialize 541 00:56:18,000 --> 00:56:28,000 the empty string. I didn't cover the base cases 542 00:56:28,000 --> 00:56:39,000 for c[i,j], but c of zero meaning a prefix with no 543 00:56:39,000 --> 00:56:45,000 elements in it. The prefix of that with 544 00:56:45,000 --> 00:56:48,000 anything else, the length is zero. 545 00:56:48,000 --> 00:56:53,000 OK, so that's basically how I'm going to bound the borders here. 546 00:56:53,000 --> 00:56:57,000 And now, what I can do is just use my formula, 547 00:56:57,000 --> 00:57:00,000 which I've conveniently erased up there, OK, 548 00:57:00,000 --> 00:57:04,000 to compute what is the longest common subsequence, 549 00:57:04,000 --> 00:57:09,000 length of the longest comment subsequence from this character 550 00:57:09,000 --> 00:57:15,000 in y, and this character in x up to this character. 551 00:57:15,000 --> 00:57:19,000 So here, for example, they don't match. 552 00:57:19,000 --> 00:57:24,000 So, it's the maximum of these two values. 553 00:57:24,000 --> 00:57:29,000 Here, they do match. OK, so it says it's one plus 554 00:57:29,000 --> 00:57:34,000 the value here. And, I'm going to draw a line. 555 00:57:34,000 --> 00:57:38,000 Whenever I'm going to get a match, I'm going to draw a line 556 00:57:38,000 --> 00:57:41,000 like that, indicating that I had that first case, 557 00:57:41,000 --> 00:57:44,000 the case where they had a good match. 558 00:57:44,000 --> 00:57:47,000 And so, all I'm doing is applying that recursive formula 559 00:57:47,000 --> 00:57:52,000 from the theorem that we proved. So here, it's basically they 560 00:57:52,000 --> 00:57:54,000 don't match. So, it's the maximum of those 561 00:57:54,000 --> 00:57:56,000 two. Here, they match. 562 00:57:56,000 --> 00:58:01,000 So, it's one plus that guy. Here, they don't match. 563 00:58:01,000 --> 00:58:06,000 So, it's basically the maximum of these two. 564 00:58:06,000 --> 00:58:11,000 Here, they don't match. So it's the maximum. 565 00:58:11,000 --> 00:58:17,000 So, it's one plus that guy. So, everybody understand how I 566 00:58:17,000 --> 00:58:23,000 filled out that first row? OK, well that you guys can 567 00:58:23,000 --> 00:58:27,000 help. OK, so this one is what? 568 00:58:27,000 --> 00:58:32,000 Just call it out. Zero, good. 569 00:58:32,000 --> 00:58:41,000 One, because it's the maximum, one, two, right. 570 00:58:41,000 --> 00:58:47,000 This one, now, gets from there, 571 00:58:47,000 --> 00:58:52,000 two, two. OK, here, zero, 572 00:58:52,000 --> 00:59:03,000 one, because it's the maximum of those two. 573 00:59:03,000 --> 00:59:15,000 Two, two, two, good. 574 00:59:15,000 --> 00:59:34,000 One, one, two, two, two, three, 575 00:59:34,000 --> 00:59:48,000 three. One, two, three, 576 00:59:48,000 --> 01:00:00,250 get that line, three, four, 577 01:00:00,250 --> 01:00:05,974 OK. One there, three, 578 01:00:05,974 --> 01:00:10,000 three, four, good, four. 579 01:00:10,000 --> 01:00:14,199 OK, and our answer: four. 580 01:00:14,199 --> 01:00:23,125 So this is blindingly fast code if you code this up, 581 01:00:23,125 --> 01:00:33,275 OK, because it gets to use the fact that modern machines in 582 01:00:33,275 --> 01:00:45,000 particular do very well on regular strides through memory. 583 01:00:45,000 --> 01:00:50,012 So, if you're just plowing through memory across like this, 584 01:00:50,012 --> 01:00:55,024 OK, and your two-dimensional array is stored in that order, 585 01:00:55,024 --> 01:00:58,308 which it is, otherwise you go this way, 586 01:00:58,308 --> 01:01:02,802 stored in that order. This can really fly in terms of 587 01:01:02,802 --> 01:01:11,948 the speed of the calculation. So, how much time did it take 588 01:01:11,948 --> 01:01:17,897 us to do this? Yeah, order MN, 589 01:01:17,897 --> 01:01:20,769 theta MN. Yeah? 590 01:01:20,769 --> 01:01:30,000 We'll talk about space in just a minute. 591 01:01:30,000 --> 01:01:33,875 OK, so hold that question. Good question, 592 01:01:33,875 --> 01:01:36,491 good question, already, wow, 593 01:01:36,491 --> 01:01:40,657 good, OK, how do I now figure out, remember, 594 01:01:40,657 --> 01:01:46,179 we had the simplification. We were going to just calculate 595 01:01:46,179 --> 01:01:49,764 the length. OK, it turns out I can now 596 01:01:49,764 --> 01:01:54,415 figure out a particular sequence that matches it. 597 01:01:54,415 --> 01:01:58,000 And basically, I do that. 598 01:01:58,000 --> 01:02:04,932 I can reconstruct the longest common subsequence by tracing 599 01:02:04,932 --> 01:02:09,474 backwards. So essentially I start here. 600 01:02:09,474 --> 01:02:15,928 Here I have a choice because this one was dependent on, 601 01:02:15,928 --> 01:02:22,980 since it doesn't have a bar here, it was dependent on one of 602 01:02:22,980 --> 01:02:28,000 these two. So, let me go this way. 603 01:02:28,000 --> 01:02:33,444 OK, and now I have a diagonal element here. 604 01:02:33,444 --> 01:02:41,222 So what I'll do is simply mark the character that appeared in 605 01:02:41,222 --> 01:02:45,370 those positions as I go this way. 606 01:02:45,370 --> 01:02:51,203 I have three here. And now, let me keep going, 607 01:02:51,203 --> 01:02:56,129 three here, and now I have another one. 608 01:02:56,129 --> 01:03:03,000 So that means this character gets selected. 609 01:03:03,000 --> 01:03:08,632 And then I go up to here, OK, and then up to here. 610 01:03:08,632 --> 01:03:15,643 And now I go diagonally again, which means that this character 611 01:03:15,643 --> 01:03:18,977 is selected. And I go to here, 612 01:03:18,977 --> 01:03:24,724 and then I go here. And then, I go up here and this 613 01:03:24,724 --> 01:03:30,471 character is selected. So here is my longest common 614 01:03:30,471 --> 01:03:35,098 subsequence. And this was just one path 615 01:03:35,098 --> 01:03:37,843 back. I could have gone a path like 616 01:03:37,843 --> 01:03:42,203 this and gotten a different longest common subsequence. 617 01:03:42,203 --> 01:03:45,997 OK, so that simplification of just saying, look, 618 01:03:45,997 --> 01:03:49,468 let me just run backwards and figure it out, 619 01:03:49,468 --> 01:03:53,989 that's actually pretty good because it means that by just 620 01:03:53,989 --> 01:03:58,026 calculating the value, then figuring out these back 621 01:03:58,026 --> 01:04:04,000 pointers to let me reconstruct it is a fairly simple process. 622 01:04:04,000 --> 01:04:10,075 OK, if I had to think about that to begin with, 623 01:04:10,075 --> 01:04:14,962 it would have been a much bigger mess. 624 01:04:14,962 --> 01:04:19,452 OK, so the space, I just mentioned, 625 01:04:19,452 --> 01:04:25,264 was order MN because we still need the table. 626 01:04:25,264 --> 01:04:32,000 So, you can actually do the min of m and n. 627 01:04:32,000 --> 01:04:37,970 OK, to get to your question, how do you do the min of m and 628 01:04:37,970 --> 01:04:41,367 n? Diagonal stripes won't give you 629 01:04:41,367 --> 01:04:45,897 min of m and n. That'll give you the sum of m 630 01:04:45,897 --> 01:04:48,676 and n. So, going in stripes, 631 01:04:48,676 --> 01:04:53,308 maybe I'm not quite sure I know what you mean. 632 01:04:53,308 --> 01:04:58,250 So, you're saying, so what's the order I would do 633 01:04:58,250 --> 01:05:01,661 here? So, I would start. 634 01:05:01,661 --> 01:05:06,461 I would do this one first. Then which one would I do? 635 01:05:06,461 --> 01:05:10,246 This one and this one? And then, this one, 636 01:05:10,246 --> 01:05:12,923 this one, this one, like this? 637 01:05:12,923 --> 01:05:18,000 That's a perfectly good order. OK, and so you're saying, 638 01:05:18,000 --> 01:05:22,800 then, so I'm keeping the diagonal there all the time. 639 01:05:22,800 --> 01:05:28,615 So, you're saying the length of the diagonal is the min of m and 640 01:05:28,615 --> 01:05:31,633 n? I think that's right. 641 01:05:31,633 --> 01:05:36,068 OK, there is another way you can do it that's a little bit 642 01:05:36,068 --> 01:05:39,881 more straightforward, which is you compare m to n. 643 01:05:39,881 --> 01:05:42,993 Whichever is smaller, well, first of all, 644 01:05:42,993 --> 01:05:45,871 let's just do this existing algorithm. 645 01:05:45,871 --> 01:05:50,228 If I just simply did row by row, I don't need more than a 646 01:05:50,228 --> 01:05:53,418 previous row. OK, I just need one row at a 647 01:05:53,418 --> 01:05:56,141 time. So, I can go ahead and compute 648 01:05:56,141 --> 01:06:00,421 just one row because once I computed the succeeding row, 649 01:06:00,421 --> 01:06:04,910 the first row is unimportant. And in fact, 650 01:06:04,910 --> 01:06:07,263 I don't even need the whole row. 651 01:06:07,263 --> 01:06:10,754 All I need is just the current row that I'm on, 652 01:06:10,754 --> 01:06:14,093 plus one or two elements of the previous row, 653 01:06:14,093 --> 01:06:16,522 plus the end of the previous row. 654 01:06:16,522 --> 01:06:20,848 So, I use a prefix of this row, and an extra two elements, 655 01:06:20,848 --> 01:06:24,263 and the suffix of this row. So, it's actually, 656 01:06:24,263 --> 01:06:28,058 you can do it with one row, plus order one element. 657 01:06:28,058 --> 01:06:32,535 And then, I could do it either running vertically or running 658 01:06:32,535 --> 01:06:35,495 horizontally, whichever one gives me the 659 01:06:35,495 --> 01:06:40,303 smaller space. OK, and it might be that your 660 01:06:40,303 --> 01:06:43,084 diagonal trick would work there too. 661 01:06:43,084 --> 01:06:45,785 I'd have to think about that. Yeah? 662 01:06:45,785 --> 01:06:50,392 Ooh, that's a good question. So, you can do the calculation 663 01:06:50,392 --> 01:06:53,570 of the length, and run row plus order one 664 01:06:53,570 --> 01:06:57,415 elements. OK, and our exercise, 665 01:06:57,415 --> 01:07:04,203 and this is a hard exercise, OK, so that a good one to do is 666 01:07:04,203 --> 01:07:11,221 to do small space and allow you to reconstruct the LCS because 667 01:07:11,221 --> 01:07:18,469 the naÔve way that we were just doing it, it's not clear how you 668 01:07:18,469 --> 01:07:24,336 would go backwards from that because you've lost the 669 01:07:24,336 --> 01:07:29,168 information. OK, so this is actually a very 670 01:07:29,168 --> 01:07:37,182 interesting and tricky problem. And, it turns out it succumbs 671 01:07:37,182 --> 01:07:43,329 of all things to divide and conquer, OK, rather than some 672 01:07:43,329 --> 01:07:47,060 more straightforward tabular thing. 673 01:07:47,060 --> 01:07:51,231 OK: so very good practice, for example, 674 01:07:51,231 --> 01:07:57,268 for the upcoming take home quiz, OK, which is all design 675 01:07:57,268 --> 01:08:03,493 and cleverness type quiz. OK, so this is a good one for 676 01:08:03,493 --> 01:08:07,191 people to take on. So, this is basically the 677 01:08:07,191 --> 01:08:11,319 tabular method that's called dynamic programming. 678 01:08:11,319 --> 01:08:16,479 OK, memo-ization is not dynamic programming, even though it's 679 01:08:16,479 --> 01:08:18,714 related. It's memo-ization. 680 01:08:18,714 --> 01:08:23,788 And, we're going to see a whole bunch of other problems that 681 01:08:23,788 --> 01:08:27,314 succumb to dynamic programming approaches. 682 01:08:27,314 --> 01:08:31,098 It's a very cool method, and on the homework, 683 01:08:31,098 --> 01:08:36,000 so let me just mention the homework again. 684 01:08:36,000 --> 01:08:38,216 On the homework, we're going to look at a 685 01:08:38,216 --> 01:08:40,434 problem called the edit distance problem. 686 01:08:40,434 --> 01:08:42,763 Edit distance is you are given two strings. 687 01:08:42,763 --> 01:08:46,256 And you can imagine that you're typing in a keyboard with one of 688 01:08:46,256 --> 01:08:48,862 the strings there. And what you have to do is by 689 01:08:48,862 --> 01:08:50,303 doing inserts, and deletes, 690 01:08:50,303 --> 01:08:52,631 and replaces, and moving the cursor around, 691 01:08:52,631 --> 01:08:55,182 you've got to transform one string to the next. 692 01:08:55,182 --> 01:08:57,399 And, each of those operations has a cost. 693 01:08:57,399 --> 01:09:00,671 And your job is to minimize the cost of transforming the one 694 01:09:00,671 --> 01:09:05,565 string into the other. This actually turns out also to 695 01:09:05,565 --> 01:09:09,537 be useful for computational biology applications. 696 01:09:09,537 --> 01:09:12,600 And, in fact, there have been editors, 697 01:09:12,600 --> 01:09:14,917 screen editors, text editors, 698 01:09:14,917 --> 01:09:19,881 that implement algorithms of this nature in order to minimize 699 01:09:19,881 --> 01:09:24,931 the number of characters that have to be sent as IO in and out 700 01:09:24,931 --> 01:09:28,568 of the system. So, the warning is, 701 01:09:28,568 --> 01:09:33,274 you better get going on your programming on problem one on 702 01:09:33,274 --> 01:09:37,816 the homework today if at all possible because whenever I 703 01:09:37,816 --> 01:09:41,862 assign programming, since we don't do that as sort 704 01:09:41,862 --> 01:09:45,660 of a routine thing, I'm just concerned for some 705 01:09:45,660 --> 01:09:50,283 people that there will not be able to get things like the 706 01:09:50,283 --> 01:09:53,422 input and output to work, and so forth. 707 01:09:53,422 --> 01:09:57,550 We have example problems, and such, on the website. 708 01:09:57,550 --> 01:10:00,853 And we also have, you can write it in any 709 01:10:00,853 --> 01:10:03,743 language you want, including Matlab, 710 01:10:03,743 --> 01:10:08,697 Python, whatever your favorite, the solutions will be written 711 01:10:08,697 --> 01:10:14,425 in Java and Python. OK, so the fastest solutions 712 01:10:14,425 --> 01:10:19,188 are likely to be written in c. OK, you can also do it in 713 01:10:19,188 --> 01:10:21,960 assembly language if you care to. 714 01:10:21,960 --> 01:10:24,905 You laugh. I used to be in assembly 715 01:10:24,905 --> 01:10:28,716 language programmer back in the days of yore. 716 01:10:28,716 --> 01:10:34,086 OK, so I do encourage people to get started on this because let 717 01:10:34,086 --> 01:10:39,370 me mention, the other thing is that this particular problem on 718 01:10:39,370 --> 01:10:45,000 this problem set is an absolutely mandatory problem. 719 01:10:45,000 --> 01:10:49,662 OK, all the problems are mandatory, but as you know you 720 01:10:49,662 --> 01:10:54,583 can skip them and it doesn't hurt you too much if you only 721 01:10:54,583 --> 01:10:57,605 skip one or two. This one, you skip, 722 01:10:57,605 --> 01:11:00,367 hurts big time: one letter grade. 723 01:11:00,367 --> 01:11:03,000 It must be done.