1
00:00:07,000 --> 00:00:11,000
So, the topic today is dynamic
programming.

2
00:00:21,000 --> 00:00:25,000
The term programming in the
name of this term doesn't refer

3
00:00:25,000 --> 00:00:30,000
to computer programming.
OK, programming is an old word

4
00:00:30,000 --> 00:00:35,000
that means any tabular method
for accomplishing something.

5
00:00:35,000 --> 00:00:39,000
So, you'll hear about linear
programming and dynamic

6
00:00:39,000 --> 00:00:42,000
programming.
Either of those,

7
00:00:42,000 --> 00:00:47,000
even though we now incorporate
those algorithms in computer

8
00:00:47,000 --> 00:00:52,000
programs, originally computer
programming, you were given a

9
00:00:52,000 --> 00:00:57,000
datasheet and you put one line
per line of code as a tabular

10
00:00:57,000 --> 00:01:04,000
method for giving the machine
instructions as to what to do.

11
00:01:04,000 --> 00:01:07,000
OK, so the term programming is
older.

12
00:01:07,000 --> 00:01:11,000
Of course, and now
conventionally when you see

13
00:01:11,000 --> 00:01:15,000
programming, you mean software,
computer programming.

14
00:01:15,000 --> 00:01:18,000
But that wasn't always the
case.

15
00:01:18,000 --> 00:01:22,000
And these terms continue in the
literature.

16
00:01:22,000 --> 00:01:26,000
So, dynamic programming is a
design technique like other

17
00:01:26,000 --> 00:01:33,000
design techniques we've seen
such as divided and conquer.

18
00:01:33,000 --> 00:01:40,000
OK, so it's a way of solving a
class of problems rather than a

19
00:01:40,000 --> 00:01:43,000
particular algorithm or
something.

20
00:01:43,000 --> 00:01:50,000
So, we're going to work through
this for the example of

21
00:01:50,000 --> 00:01:55,000
so-called longest common
subsequence problem,

22
00:01:55,000 --> 00:02:00,000
sometimes called LCS,
OK, which is a problem that

23
00:02:00,000 --> 00:02:06,000
comes up in a variety of
contexts.

24
00:02:06,000 --> 00:02:10,000
And it's particularly important
in computational biology,

25
00:02:10,000 --> 00:02:14,000
where you have long DNA
strains, and you're trying to

26
00:02:14,000 --> 00:02:19,000
find commonalities between two
strings, OK, one which may be a

27
00:02:19,000 --> 00:02:23,000
genome, and one may be various,
when people do,

28
00:02:23,000 --> 00:02:28,000
what is that thing called when
they do the evolutionary

29
00:02:28,000 --> 00:02:31,000
comparisons?
The evolutionary trees,

30
00:02:31,000 --> 00:02:33,000
yeah, right,
yeah, exactly,

31
00:02:33,000 --> 00:02:35,000
phylogenetic trees,
there you go,

32
00:02:35,000 --> 00:02:44,000
OK, phylogenetic trees.
Good, so here's the problem.

33
00:02:44,000 --> 00:02:54,000
So, you're given two sequences,
x going from one to m,

34
00:02:54,000 --> 00:03:04,000
and y running from one to n.
You want to find a longest

35
00:03:04,000 --> 00:03:12,000
sequence common to both.
OK, and here I say a,

36
00:03:12,000 --> 00:03:19,000
not the, although it's common
to talk about the longest common

37
00:03:19,000 --> 00:03:24,000
subsequence.
Usually the longest comment

38
00:03:24,000 --> 00:03:29,000
subsequence isn't unique.
There could be several

39
00:03:29,000 --> 00:03:35,000
different subsequences that tie
for that.

40
00:03:35,000 --> 00:03:41,000
However, people tend to,
it's one of the sloppinesses

41
00:03:41,000 --> 00:03:45,000
that people will say.
I will try to say a,

42
00:03:45,000 --> 00:03:51,000
unless it's unique.
But I may slip as well because

43
00:03:51,000 --> 00:03:57,000
it's just such a common thing to
just talk about the,

44
00:03:57,000 --> 00:04:02,000
even though there might be
multiple.

45
00:04:02,000 --> 00:04:07,000
So, here's an example.
Suppose x is this sequence,

46
00:04:07,000 --> 00:04:14,000
and y is this sequence.
So, what is a longest common

47
00:04:14,000 --> 00:04:16,000
subsequence of those two
sequences?

48
00:04:16,000 --> 00:04:19,000
See if you can just eyeball it.

49
00:04:35,000 --> 00:04:45,000
AB: length two?
Anybody have one longer?

50
00:04:45,000 --> 00:04:51,000
Excuse me?
BDB, BDB.

51
00:04:51,000 --> 00:05:02,000
BDAB, BDAB, BDAB,
anything longer?

52
00:05:02,000 --> 00:05:09,000
So, BDAB: that's the longest
one.

53
00:05:09,000 --> 00:05:20,000
Is there another one that's the
same length?

54
00:05:20,000 --> 00:05:35,000
Is there another one that ties?
BCAB, BCAB, another one?

55
00:05:35,000 --> 00:05:40,000
BCBA, yeah, there are a bunch
of them all of length four.

56
00:05:40,000 --> 00:05:45,000
There isn't one of length five.
OK, we are actually going to

57
00:05:45,000 --> 00:05:49,000
come up with an algorithm that,
if it's correct,

58
00:05:49,000 --> 00:05:54,000
we're going to show it's
correct, guarantees that there

59
00:05:54,000 --> 00:05:58,000
isn't one of length five.
So all those are,

60
00:05:58,000 --> 00:06:03,000
we can say, any one of these is
the longest comment subsequence

61
00:06:03,000 --> 00:06:06,000
of x and y.
We tend to use it this way

62
00:06:06,000 --> 00:06:11,000
using functional notation,
but it's not a function that's

63
00:06:11,000 --> 00:06:17,000
really a relation.
So, we'll say something is an

64
00:06:17,000 --> 00:06:20,000
LCS when really we only mean
it's an element,

65
00:06:20,000 --> 00:06:23,000
if you will,
of the set of longest common

66
00:06:23,000 --> 00:06:26,000
subsequences.
Once again, it's classic

67
00:06:26,000 --> 00:06:29,000
abusive notation.
As long as we know what we

68
00:06:29,000 --> 00:06:35,000
mean, it's OK to abuse notation.
What we can't do is misuse it.

69
00:06:35,000 --> 00:06:40,000
But abuse, yeah!
Make it so it's easy to deal

70
00:06:40,000 --> 00:06:43,000
with.
But you have to know what's

71
00:06:43,000 --> 00:06:47,000
going on underneath.
OK, so let's see,

72
00:06:47,000 --> 00:06:53,000
so there's a fairly simple
brute force algorithm for

73
00:06:53,000 --> 00:06:59,000
solving this problem.
And that is,

74
00:06:59,000 --> 00:07:10,000
let's just check every,
maybe some of you did this in

75
00:07:10,000 --> 00:07:22,000
your heads, subsequence of x
from one to m to see if it's

76
00:07:22,000 --> 00:07:31,000
also a subsequence of y of one
to n.

77
00:07:31,000 --> 00:07:36,000
So, just take every subsequence
that you can get here,

78
00:07:36,000 --> 00:07:40,000
check it to see if it's in
there.

79
00:07:40,000 --> 00:07:43,000
So let's analyze that.

80
00:07:52,000 --> 00:07:58,000
So, to check,
so if I give you a subsequence

81
00:07:58,000 --> 00:08:05,000
of x, how long does it take you
to check whether it is,

82
00:08:05,000 --> 00:08:14,000
in fact, a subsequence of y?
So, I give you something like

83
00:08:14,000 --> 00:08:18,000
BCAB.
How long does it take me to

84
00:08:18,000 --> 00:08:24,000
check to see if it's a
subsequence of y?

85
00:08:24,000 --> 00:08:28,000
Length of y,
which is order n.

86
00:08:28,000 --> 00:08:34,000
And how do you do it?
Yeah, you just scan.

87
00:08:34,000 --> 00:08:39,000
So as you hit the first
character that matches,

88
00:08:39,000 --> 00:08:41,000
great.
Now, if you will,

89
00:08:41,000 --> 00:08:46,000
recursively see whether the
suffix of your string matches

90
00:08:46,000 --> 00:08:50,000
the suffix of x.
OK, and so, you are just simply

91
00:08:50,000 --> 00:08:54,000
walking down the tree to see if
it matches.

92
00:08:54,000 --> 00:08:59,000
You're walking down the string
to see if it matches.

93
00:08:59,000 --> 00:09:04,000
OK, then the second thing is,
then how many subsequences of x

94
00:09:04,000 --> 00:09:08,000
are there?
Two to the n?

95
00:09:08,000 --> 00:09:15,000
x just goes from one to m,
two to the m subsequences of x,

96
00:09:15,000 --> 00:09:20,000
OK, two to the m.
Two to the m subsequences of x,

97
00:09:20,000 --> 00:09:25,000
OK, one way to see that,
you say, well,

98
00:09:25,000 --> 00:09:32,000
how many subsequences are there
of something there?

99
00:09:32,000 --> 00:09:35,000
If I consider a bit vector of
length m, OK,

100
00:09:35,000 --> 00:09:39,000
that's one or zero,
just every position where

101
00:09:39,000 --> 00:09:42,000
there's a one,
I take out, that identifies an

102
00:09:42,000 --> 00:09:45,000
element that I'm going to take
out.

103
00:09:45,000 --> 00:09:50,000
OK, then that gives me a
mapping from each subsequence of

104
00:09:50,000 --> 00:09:55,000
x, from each bit vector to a
different subsequence of x.

105
00:09:55,000 --> 00:09:58,000
Now, of course,
you could have matching

106
00:09:58,000 --> 00:10:01,000
characters there,
that in the worst case,

107
00:10:01,000 --> 00:10:06,000
all of the characters are
different.

108
00:10:06,000 --> 00:10:14,000
OK, and so every one of those
will be a unique subsequence.

109
00:10:14,000 --> 00:10:22,000
So, each bit vector of length m
corresponds to a subsequence.

110
00:10:22,000 --> 00:10:29,000
That's a generally good trick
to know.

111
00:10:29,000 --> 00:10:38,000
So, the worst-case running time
of this method is order n times

112
00:10:38,000 --> 00:10:43,000
two to the m,
which is, since m is in the

113
00:10:43,000 --> 00:10:52,000
exponent, is exponential time.
And there's a technical term

114
00:10:52,000 --> 00:10:59,000
that we use when something is
exponential time.

115
00:10:59,000 --> 00:11:03,000
Slow: good.
OK, very good.

116
00:11:03,000 --> 00:11:06,000
OK, slow, OK,
so this is really bad.

117
00:11:06,000 --> 00:11:12,000
This is taking a long time to
crank out how long the longest

118
00:11:12,000 --> 00:11:17,000
common subsequence is because
there's so many subsequences.

119
00:11:17,000 --> 00:11:23,000
OK, so we're going to now go
through a process of developing

120
00:11:23,000 --> 00:11:27,000
a far more efficient algorithm
for this problem.

121
00:11:27,000 --> 00:11:34,000
OK, and we're actually going to
go through several stages.

122
00:11:34,000 --> 00:11:42,000
The first one is to go through
simplification stage.

123
00:11:42,000 --> 00:11:52,000
OK, and what we're going to do
is look at simply the length of

124
00:11:52,000 --> 00:11:59,000
the longest common sequence of x
and y.

125
00:11:59,000 --> 00:12:03,000
And then what we'll do is
extend the algorithm to find the

126
00:12:03,000 --> 00:12:06,000
longest common subsequence
itself.

127
00:12:06,000 --> 00:12:10,000
OK, so we're going to look at
the length.

128
00:12:10,000 --> 00:12:13,000
So, simplify the problem,
if you will,

129
00:12:13,000 --> 00:12:16,000
to just try to compute the
length.

130
00:12:16,000 --> 00:12:19,000
What's nice is the length is
unique.

131
00:12:19,000 --> 00:12:23,000
OK, there's only going to be
one length that's going to be

132
00:12:23,000 --> 00:12:27,000
the longest.
OK, and what we'll do is just

133
00:12:27,000 --> 00:12:31,000
focus on the problem of
computing the length.

134
00:12:31,000 --> 00:12:36,000
And then we'll do is we can
back up from that and figure out

135
00:12:36,000 --> 00:12:43,000
what actually is the subsequence
that realizes that length.

136
00:12:43,000 --> 00:12:46,000
OK, and that will be a big
simplification because we don't

137
00:12:46,000 --> 00:12:50,000
have to keep track of a lot of
different possibilities at every

138
00:12:50,000 --> 00:12:52,000
stage.
We just have to keep track of

139
00:12:52,000 --> 00:12:54,000
the one number,
which is the length.

140
00:12:54,000 --> 00:12:57,000
So, it's sort of reduces it to
a numerical problem.

141
00:12:57,000 --> 00:13:00,000
We'll adopt the following
notation.

142
00:13:00,000 --> 00:13:04,000
It's pretty standard notation,
but I just want,

143
00:13:04,000 --> 00:13:09,000
if I put absolute values around
the string or a sequence,

144
00:13:09,000 --> 00:13:13,000
it denotes the length of the
sequence, S.

145
00:13:13,000 --> 00:13:19,000
OK, so that's the first thing.
The second thing we're going to

146
00:13:19,000 --> 00:13:22,000
do is, actually,
we're going to,

147
00:13:22,000 --> 00:13:28,000
which takes a lot more insight
when you come up with a problem

148
00:13:28,000 --> 00:13:33,000
like this,
and in some sense,

149
00:13:33,000 --> 00:13:39,000
ends up being the hardest part
of designing a good dynamic

150
00:13:39,000 --> 00:13:47,000
programming algorithm from any
problem, which is we're going to

151
00:13:47,000 --> 00:13:53,000
actually look not at all
subsequences of x and y,

152
00:13:53,000 --> 00:13:56,000
but just prefixes.

153
00:14:06,000 --> 00:14:13,000
OK, we're just going to look at
prefixes and we're going to show

154
00:14:13,000 --> 00:14:20,000
how we can express the length of
the longest common subsequence

155
00:14:20,000 --> 00:14:24,000
of prefixes in terms of each
other.

156
00:14:24,000 --> 00:14:28,000
In particular,
we're going to define c of ij

157
00:14:28,000 --> 00:14:34,000
to be the length,
the longest common subsequence

158
00:14:34,000 --> 00:14:41,000
of the prefix of x going from
one to i, and y of going to one

159
00:14:41,000 --> 00:14:48,000
to j.
And what we are going to do is

160
00:14:48,000 --> 00:14:56,000
we're going to calculate c[i,j]
for all ij.

161
00:14:56,000 --> 00:15:04,000
And if we do that,
how then do we solve the

162
00:15:04,000 --> 00:15:15,000
problem of the longest common of
sequence of x and y?

163
00:15:15,000 --> 00:15:19,000
How do we solve the longest
common subsequence?

164
00:15:19,000 --> 00:15:23,000
Suppose we've solved this for
all I and j.

165
00:15:23,000 --> 00:15:29,000
How then do we compute the
length of the longest common

166
00:15:29,000 --> 00:15:33,000
subsequence of x and y?
Yeah, c[m,n],

167
00:15:33,000 --> 00:15:37,000
that's all, OK?
So then, c of m,

168
00:15:37,000 --> 00:15:44,000
n is just equal to the longest
common subsequence of x and y,

169
00:15:44,000 --> 00:15:50,000
because if I go from one to n,
I'm done, OK?

170
00:15:50,000 --> 00:15:56,000
And so, it's going to turn out
that what we want to do is

171
00:15:56,000 --> 00:16:02,000
figure out how to express to
c[m,n], in general,

172
00:16:02,000 --> 00:16:08,000
c[i,j], in terms of other
c[i,j].

173
00:16:08,000 --> 00:16:18,000
So, let's see how we do that.
OK, so our theorem is going to

174
00:16:18,000 --> 00:16:23,000
say that c[i,j] is just --

175
00:17:05,000 --> 00:17:10,000
OK, it says that if the i'th
character matches the j'th

176
00:17:10,000 --> 00:17:17,000
character, then i'th character
of x matches the j'th character

177
00:17:17,000 --> 00:17:23,000
of y, then c of ij is just c of
I minus one, j minus one plus

178
00:17:23,000 --> 00:17:26,000
one.
And if they don't match,

179
00:17:26,000 --> 00:17:31,000
then it's either going to be
the longer of c[i,

180
00:17:31,000 --> 00:17:35,000
j-1], and c[i-1,
j], OK?

181
00:17:35,000 --> 00:17:38,000
So that's what we're going to
prove.

182
00:17:38,000 --> 00:17:44,000
And that's going to give us a
way of relating the calculation

183
00:17:44,000 --> 00:17:49,000
of a given c[i,j] to values that
are strictly smaller,

184
00:17:49,000 --> 00:17:56,000
OK, that is at least one of the
arguments is smaller of the two

185
00:17:56,000 --> 00:18:00,000
arguments.
OK, and that's going to give us

186
00:18:00,000 --> 00:18:05,000
a way of being able,
then, to understand how to

187
00:18:05,000 --> 00:18:11,000
calculate c[i,j].
So, let's prove this theorem.

188
00:18:11,000 --> 00:18:18,000
So, we'll start with a case
x[i] equals y of j.

189
00:18:18,000 --> 00:18:22,000
And so, let's draw a picture
here.

190
00:18:22,000 --> 00:18:26,000
So, we have x here.

191
00:18:50,000 --> 00:18:52,000
And here is y.

192
00:19:13,000 --> 00:19:19,000
OK, so here's my sequence,
x, which I'm sort of drawing as

193
00:19:19,000 --> 00:19:25,000
this elongated box,
sequence y, and I'm saying that

194
00:19:25,000 --> 00:19:30,000
x[i] and y[j],
those are equal.

195
00:19:38,000 --> 00:19:46,000
OK, so let's see what that
means.

196
00:19:46,000 --> 00:20:01,000
OK, so let's let z of one to k
be, in fact, the longest common

197
00:20:01,000 --> 00:20:12,000
subsequence of x of one to i,
y of one to j,

198
00:20:12,000 --> 00:20:23,000
where c of ij is equal to k.
OK, so the longest common

199
00:20:23,000 --> 00:20:29,000
subsequence of x and y of one to
I and y of one to j has some

200
00:20:29,000 --> 00:20:32,000
value.
Let's call it k.

201
00:20:32,000 --> 00:20:39,000
And so, let's say that we have
some sequence which realizes

202
00:20:39,000 --> 00:20:42,000
that.
OK, we'll call it z.

203
00:20:42,000 --> 00:20:48,000
OK, so then,
can somebody tell me what z of

204
00:20:48,000 --> 00:20:50,000
k is?

205
00:21:04,000 --> 00:21:05,000
What is z of k here?

206
00:21:14,000 --> 00:21:18,000
Yeah, it's actually equal to x
of I, which is also equal to y

207
00:21:18,000 --> 00:21:19,000
of j?
Why is that?

208
00:21:19,000 --> 00:21:23,000
Why couldn't it be some other
value?

209
00:21:41,000 --> 00:21:43,000
Yeah, so you got the right
idea.

210
00:21:43,000 --> 00:21:46,000
So, the idea is,
suppose that the sequence

211
00:21:46,000 --> 00:21:50,000
didn't include this element here
at the last element,

212
00:21:50,000 --> 00:21:55,000
the longest common subsequence.
OK, so then it includes a bunch

213
00:21:55,000 --> 00:21:59,000
of values in here,
and a bunch of values in here,

214
00:21:59,000 --> 00:22:03,000
same values.
It doesn't include this or

215
00:22:03,000 --> 00:22:07,000
this.
Well, then I could just tack on

216
00:22:07,000 --> 00:22:13,000
this extra character and make it
be longer, make it k plus one

217
00:22:13,000 --> 00:22:18,000
because these two match.
OK, so if the sequence ended

218
00:22:18,000 --> 00:22:20,000
before --

219
00:22:34,000 --> 00:22:40,000
-- just extend it by tacking on
x[i].

220
00:22:40,000 --> 00:22:48,000
OK, it would be fairly simple
to just tack on x[i].

221
00:22:48,000 --> 00:22:58,000
OK, so if that's the case,
then if I look at z going one

222
00:22:58,000 --> 00:23:05,000
up to k minus one,
that's certainly a common

223
00:23:05,000 --> 00:23:14,000
sequence of x of 1 up to,
excuse me, of up to i minus

224
00:23:14,000 --> 00:23:20,000
one.
And, y of one up to j minus

225
00:23:20,000 --> 00:23:26,000
one, OK, because this is a
longest common sequence.

226
00:23:26,000 --> 00:23:33,000
z is a longest common sequence
is, from x of one to i,

227
00:23:33,000 --> 00:23:38,000
y of one to j.
And, we know what the last

228
00:23:38,000 --> 00:23:41,000
character is.
It's just x[i],

229
00:23:41,000 --> 00:23:43,000
or equivalently,
y[j].

230
00:23:43,000 --> 00:23:47,000
So therefore,
everything except the last

231
00:23:47,000 --> 00:23:53,000
character must at least be a
common sequence of x of one to i

232
00:23:53,000 --> 00:23:57,000
minus one, y of one to j minus
one.

233
00:23:57,000 --> 00:24:04,000
Everybody with me?
It must be a comment sequence.

234
00:24:04,000 --> 00:24:12,000
OK, now, what you also suspect?
What do you also suspect about

235
00:24:12,000 --> 00:24:18,000
z of one to k?
It's a common sequence of these

236
00:24:18,000 --> 00:24:19,000
two.
Yeah?

237
00:24:19,000 --> 00:24:26,000
Yeah, it's a longest common
sequence.

238
00:24:26,000 --> 00:24:34,000
So that's what we claim,
z of one up to k minus one is

239
00:24:34,000 --> 00:24:42,000
in fact a longest common
subsequence of x of one to i

240
00:24:42,000 --> 00:24:48,000
minus one, and y of one to j
minus one, OK?

241
00:24:48,000 --> 00:24:57,000
So, let's prove that claim.
So, we'll just have a little

242
00:24:57,000 --> 00:25:09,000
diversion to prove the claim.
OK, so suppose that w is a

243
00:25:09,000 --> 00:25:21,000
longer comment sequence,
that is, that the length,

244
00:25:21,000 --> 00:25:30,000
the w, is bigger than k minus
one.

245
00:25:30,000 --> 00:25:35,000
OK, so suppose we have a longer
comment sequence than z of one

246
00:25:35,000 --> 00:25:38,000
to k minus one.
So, it's got to have length

247
00:25:38,000 --> 00:25:42,000
that's bigger than k minus one
if it's longer.

248
00:25:42,000 --> 00:25:47,000
OK, and now what we do is we
use a classic argument you're

249
00:25:47,000 --> 00:25:51,000
going to see multiple times,
not just this week,

250
00:25:51,000 --> 00:25:56,000
which it will be important for
this week, but through several

251
00:25:56,000 --> 00:25:59,000
lectures.
Hence, it's called a cut and

252
00:25:59,000 --> 00:26:06,000
paste argument.
So, the idea is let's take a

253
00:26:06,000 --> 00:26:15,000
look at w, concatenate it with
that last character,

254
00:26:15,000 --> 00:26:19,000
z of k.
so, this is string,

255
00:26:19,000 --> 00:26:27,000
OK, so that's just my
terminology for string

256
00:26:27,000 --> 00:26:36,000
concatenation.
OK, so I take whatever I

257
00:26:36,000 --> 00:26:48,000
claimed was a longer comment
subsequence, and I concatenate z

258
00:26:48,000 --> 00:26:56,000
of k to it.
OK, so that is certainly a

259
00:26:56,000 --> 00:27:11,000
common sequence of x of one to I
minus one, and y of one to j.

260
00:27:11,000 --> 00:27:18,000
And it has length bigger than k
because it's basically,

261
00:27:18,000 --> 00:27:24,000
what is its length?
The length of w is bigger than

262
00:27:24,000 --> 00:27:28,000
k minus one.
I add one character.

263
00:27:28,000 --> 00:27:37,000
So, this combination here,
now, has length bigger that k.

264
00:27:37,000 --> 00:27:43,000
OK, and that's a contradiction,
thereby proving the claim.

265
00:27:43,000 --> 00:27:47,000
So, I'm simply saying,
I claim this.

266
00:27:47,000 --> 00:27:52,000
Suppose you have a longer one.
Well, let me show,

267
00:27:52,000 --> 00:27:58,000
if I had a longer common
sequence for the prefixes where

268
00:27:58,000 --> 00:28:05,000
we dropped the character from
both strings if it was longer

269
00:28:05,000 --> 00:28:12,000
there, but we would have made
the whole thing longer.

270
00:28:12,000 --> 00:28:16,000
So that can't be.
So, therefore,

271
00:28:16,000 --> 00:28:22,000
this must be a longest common
subsequence, OK?

272
00:28:22,000 --> 00:28:27,000
Questions?
Because you are going to need

273
00:28:27,000 --> 00:28:33,000
to be able to do this kind of
proof ad nauseam,

274
00:28:33,000 --> 00:28:39,000
almost.
So, if there any questions,

275
00:28:39,000 --> 00:28:42,000
let them at me,
people.

276
00:28:42,000 --> 00:28:47,000
OK, so now what we have
established is that z one

277
00:28:47,000 --> 00:28:55,000
through k is a longest common
subsequence of the two prefixes

278
00:28:55,000 --> 00:29:05,000
when we drop the last character.
So, thus, we have c of i minus

279
00:29:05,000 --> 00:29:11,000
one, j minus one is equal to
what?

280
00:29:11,000 --> 00:29:19,000
What's c of i minus one,
j minus one?

281
00:29:31,000 --> 00:29:33,000
k minus one;
thank you.

282
00:29:33,000 --> 00:29:40,000
Let's move on with the class,
right, OK, which implies that c

283
00:29:40,000 --> 00:29:47,000
of ij is just equal to c of I
minus one, j minus one plus one.

284
00:29:47,000 --> 00:29:54,000
So, it's fairly straightforward
if you think about what's going

285
00:29:54,000 --> 00:29:57,000
on there.
It's not always as

286
00:29:57,000 --> 00:30:04,000
straightforward in some problems
as it is for longest common

287
00:30:04,000 --> 00:30:08,000
subsequence.
The idea is,

288
00:30:08,000 --> 00:30:13,000
so I'm not going to go through
the other cases.

289
00:30:13,000 --> 00:30:16,000
They are similar.
But, in fact,

290
00:30:16,000 --> 00:30:21,000
we've hit on one of the two
hallmarks of dynamic

291
00:30:21,000 --> 00:30:24,000
programming.
So, by hallmarks,

292
00:30:24,000 --> 00:30:30,000
I mean when you see this kind
of structure in a problem,

293
00:30:30,000 --> 00:30:36,000
there's a good chance that
dynamic programming is going to

294
00:30:36,000 --> 00:30:41,000
work as a strategy.
The dynamic programming

295
00:30:41,000 --> 00:30:44,000
hallmark is the following.

296
00:30:55,000 --> 00:31:02,000
This is number one.
And that is the property of

297
00:31:02,000 --> 00:31:09,000
optimal substructure.
OK, what that says is an

298
00:31:09,000 --> 00:31:16,000
optimal solution to a problem,
and by this,

299
00:31:16,000 --> 00:31:21,000
we really mean problem
instance.

300
00:31:21,000 --> 00:31:31,000
But it's tedious to keep saying
problem instance.

301
00:31:31,000 --> 00:31:35,000
A problem is generally,
in computer science,

302
00:31:35,000 --> 00:31:42,000
viewed as having an infinite
number of instances typically,

303
00:31:42,000 --> 00:31:48,000
OK, so sorting is a problem.
A sorting instance is a

304
00:31:48,000 --> 00:31:53,000
particular input.
OK, so we're really talking

305
00:31:53,000 --> 00:31:59,000
about problem instances,
but I'm just going to say

306
00:31:59,000 --> 00:32:04,000
problem, OK?
So, when you have an optimal

307
00:32:04,000 --> 00:32:09,000
solution to a problem,
contains optimal solutions to

308
00:32:09,000 --> 00:32:17,000
subproblems.
OK, and that's worth drawing a

309
00:32:17,000 --> 00:32:22,000
box around because it's so
important.

310
00:32:22,000 --> 00:32:25,000
OK, so here,
for example,

311
00:32:25,000 --> 00:32:33,000
if z is a longest common
subsequence of x and y,

312
00:32:33,000 --> 00:32:55,000
OK, then any prefix of z is a
longest common subsequence of a

313
00:32:55,000 --> 00:33:09,000
prefix of x, and a prefix of y,
OK?

314
00:33:09,000 --> 00:33:12,000
So, this is basically what it
says.

315
00:33:12,000 --> 00:33:16,000
I look at the problem,
and I can see that there is

316
00:33:16,000 --> 00:33:21,000
optimal substructure going on.
OK, in this case,

317
00:33:21,000 --> 00:33:26,000
and the idea is that almost
always, it means that there's a

318
00:33:26,000 --> 00:33:32,000
cut and paste argument you could
do to demonstrate that,

319
00:33:32,000 --> 00:33:36,000
OK, that if the substructure
were not optimal,

320
00:33:36,000 --> 00:33:41,000
then you'd be able to find a
better solution to the overall

321
00:33:41,000 --> 00:33:49,000
problem using cut and paste.
OK, so this theorem,

322
00:33:49,000 --> 00:33:57,000
now, gives us a strategy for
being able to compute longest

323
00:33:57,000 --> 00:34:01,000
comment subsequence.

324
00:34:24,000 --> 00:34:29,000
Here's the code; oh wait.

325
00:34:38,000 --> 00:34:41,000
OK, so going to ignore base
cases in this,

326
00:34:41,000 --> 00:34:42,000
if --

327
00:35:44,000 --> 00:35:54,000
And we will return the value of
the longest common subsequence.

328
00:35:54,000 --> 00:36:02,000
It's basically just
implementing this theorem.

329
00:36:02,000 --> 00:36:06,000
OK, so it's either the longest
comment subsequence if they

330
00:36:06,000 --> 00:36:09,000
match.
It's the longest comment

331
00:36:09,000 --> 00:36:14,000
subsequence of one of the
prefixes where you drop that

332
00:36:14,000 --> 00:36:18,000
character for both strengths and
add one because that's the

333
00:36:18,000 --> 00:36:22,000
matching one.
Or, you drop a character from

334
00:36:22,000 --> 00:36:26,000
x, and it's the longest comment
subsequence of that.

335
00:36:26,000 --> 00:36:31,000
Or you drop a character from y,
whichever one of those is

336
00:36:31,000 --> 00:36:34,000
longer.
That ends up being the longest

337
00:36:34,000 --> 00:36:43,000
comment subsequence.
OK, so what's the worst case

338
00:36:43,000 --> 00:36:52,000
for this program?
What's going to happen in the

339
00:36:52,000 --> 00:37:00,000
worst case?
Which of these two clauses is

340
00:37:00,000 --> 00:37:09,000
going to cause us more headache?
The second clause:

341
00:37:09,000 --> 00:37:12,000
why the second clause?
Yeah, you're doing two LCS

342
00:37:12,000 --> 00:37:16,000
sub-calculations here.
Here, you're only doing one.

343
00:37:16,000 --> 00:37:19,000
Not only that,
but you get to decrement both

344
00:37:19,000 --> 00:37:22,000
indices, whereas here you've
basically got to,

345
00:37:22,000 --> 00:37:26,000
you only get to decrement one
index, and you've got to

346
00:37:26,000 --> 00:37:29,000
calculate two of them.
So that's going to generate the

347
00:37:29,000 --> 00:37:34,000
tree.
So, and the worst case,

348
00:37:34,000 --> 00:37:42,000
x of i is not equal to x of j
for all i and j.

349
00:37:42,000 --> 00:37:52,000
So, let's draw a recursion tree
for this program to sort of get

350
00:37:52,000 --> 00:38:02,000
an understanding as to what is
going on to help us.

351
00:38:02,000 --> 00:38:06,000
And, I'm going to do it with m
equals seven,

352
00:38:06,000 --> 00:38:12,000
and n equals six.
OK, so we start up the top with

353
00:38:12,000 --> 00:38:16,000
my two indices being seven and
six.

354
00:38:16,000 --> 00:38:22,000
And then, in the worst case,
we had to execute these.

355
00:38:22,000 --> 00:38:27,000
So, this is going to end up
being six, six,

356
00:38:27,000 --> 00:38:34,000
and seven, five for indices
after the first call.

357
00:38:34,000 --> 00:38:37,000
And then, this guy is going to
split.

358
00:38:37,000 --> 00:38:44,000
And he's going to produce five,
six here, decrement the first

359
00:38:44,000 --> 00:38:48,000
index, I.
And then, if I keep going down

360
00:38:48,000 --> 00:38:52,000
here, we're going to get four,
six and five,

361
00:38:52,000 --> 00:38:56,000
five.
And these guys keep extending

362
00:38:56,000 --> 00:38:58,000
here.
I get six five,

363
00:38:58,000 --> 00:39:02,000
five five, six four,
OK?

364
00:39:02,000 --> 00:39:08,000
Over here, I'm going to get
decrement the first index,

365
00:39:08,000 --> 00:39:15,000
six five, and I get five five,
six four, and these guys keep

366
00:39:15,000 --> 00:39:17,000
going down.
And over here,

367
00:39:17,000 --> 00:39:22,000
I get seven four.
And then we get six four,

368
00:39:22,000 --> 00:39:27,000
seven three,
and those keep going down.

369
00:39:27,000 --> 00:39:33,000
So, we keep just building this
tree out.

370
00:39:33,000 --> 00:39:38,000
OK, so what's the height of
this tree?

371
00:39:38,000 --> 00:39:46,000
Not of this one for the
particular value of m and n,

372
00:39:46,000 --> 00:39:54,000
but in terms of m and n.
What's the height of this tree?

373
00:39:54,000 --> 00:40:01,000
It's the max of m and n.
You've got the right,

374
00:40:01,000 --> 00:40:07,000
it's theta of the max.
It's not the max.

375
00:40:07,000 --> 00:40:10,000
Max would be,
in this case,

376
00:40:10,000 --> 00:40:14,000
you're saying it has height
seven.

377
00:40:14,000 --> 00:40:18,000
But, I think you can sort of
see, for example,

378
00:40:18,000 --> 00:40:23,000
along a path like this that,
in fact, I've only,

379
00:40:23,000 --> 00:40:28,000
after going three levels,
reduced m plus n,

380
00:40:28,000 --> 00:40:32,000
good, very good,
m plus n.

381
00:40:32,000 --> 00:40:39,000
So, height here is m plus n.
OK, and its binary.

382
00:40:39,000 --> 00:40:45,000
So, the height:
that implies the work is

383
00:40:45,000 --> 00:40:51,000
exponential in m and n.
All that work,

384
00:40:51,000 --> 00:41:01,000
and are we any better off than
the brute force algorithm?

385
00:41:01,000 --> 00:41:05,000
Not really.
And, our technical term for

386
00:41:05,000 --> 00:41:09,000
this is slow.
OK, and we like speed.

387
00:41:09,000 --> 00:41:14,000
OK, we like fast.
OK, but I'm sure that some of

388
00:41:14,000 --> 00:41:20,000
you have observed something
interesting about this tree.

389
00:41:20,000 --> 00:41:25,000
Yeah, there's a lot of repeated
work here.

390
00:41:25,000 --> 00:41:31,000
Right, there's a lot of
repeated work.

391
00:41:31,000 --> 00:41:34,000
In particular,
this whole subtree,

392
00:41:34,000 --> 00:41:40,000
and this whole subtree,
OK, they are the same.

393
00:41:40,000 --> 00:41:46,000
That's the same subtree,
the same subproblem that you

394
00:41:46,000 --> 00:41:51,000
are solving.
OK, you can even see over here,

395
00:41:51,000 --> 00:41:58,000
there is even similarity
between this whole subtree and

396
00:41:58,000 --> 00:42:03,000
this whole subtree.
OK, so there's lots of repeated

397
00:42:03,000 --> 00:42:08,000
work.
OK, and one thing is,

398
00:42:08,000 --> 00:42:13,000
if you want to do things fast,
don't keep doing the same

399
00:42:13,000 --> 00:42:17,000
thing.
OK, don't keep doing the same

400
00:42:17,000 --> 00:42:21,000
thing.
When you find you are repeating

401
00:42:21,000 --> 00:42:25,000
something, figure out a way of
not doing it.

402
00:42:25,000 --> 00:42:30,000
So, that brings up our second
hallmark for dynamic

403
00:42:30,000 --> 00:42:33,000
programming.

404
00:42:50,000 --> 00:43:07,000
And that's a property called
overlapping subproblems,

405
00:43:07,000 --> 00:43:19,000
OK?
OK, recursive solution contains

406
00:43:19,000 --> 00:43:33,000
many, excuse me,
contains a small number of

407
00:43:33,000 --> 00:43:50,000
distinct subproblems repeated
many times.

408
00:43:50,000 --> 00:43:54,000
And once again,
this is important enough to put

409
00:43:54,000 --> 00:43:58,000
a box around.
I don't put boxes around too

410
00:43:58,000 --> 00:44:01,000
many things.
Maybe I should put our boxes

411
00:44:01,000 --> 00:44:05,000
around things.
This is definitely one to put a

412
00:44:05,000 --> 00:44:08,000
box around, OK?
So, for example,

413
00:44:08,000 --> 00:44:12,000
so here we have a recursive
solution.

414
00:44:12,000 --> 00:44:15,000
This tree is exponential in
size.

415
00:44:15,000 --> 00:44:19,000
It's two to the m plus n in
height, in size,

416
00:44:19,000 --> 00:44:24,000
in the total number of problems
if I actually implemented like

417
00:44:24,000 --> 00:44:27,000
that.
But how many distinct

418
00:44:27,000 --> 00:44:33,000
subproblems are there?
m times n, OK?

419
00:44:33,000 --> 00:44:42,000
So, the longest comment
subsequence, the subproblem

420
00:44:42,000 --> 00:44:49,000
space contains m times n,
distinct subproblems.

421
00:44:49,000 --> 00:45:00,000
OK, and then this is a small
number compared with two to the

422
00:45:00,000 --> 00:45:07,000
m plus n, or two to the n,
or two to the m,

423
00:45:07,000 --> 00:45:13,000
or whatever.
OK, this is small,

424
00:45:13,000 --> 00:45:19,000
OK, because for each
subproblem, it's characterized

425
00:45:19,000 --> 00:45:24,000
by an I and a j.
An I goes from one to m,

426
00:45:24,000 --> 00:45:27,000
and j goes from one to n,
OK?

427
00:45:27,000 --> 00:45:34,000
There aren't that many
different subproblems.

428
00:45:34,000 --> 00:45:36,000
It's just the product of the
two.

429
00:45:36,000 --> 00:45:41,000
So, here's an improved
algorithm, which is often a good

430
00:45:41,000 --> 00:45:45,000
way to solve it.
It's an algorithm called a

431
00:45:45,000 --> 00:45:48,000
memo-ization algorithm.

432
00:45:56,000 --> 00:46:02,000
And, this is memo-ization,
not memorization because what

433
00:46:02,000 --> 00:46:09,000
you're going to do is make a
little memo whenever you solve a

434
00:46:09,000 --> 00:46:14,000
subproblem.
Make a little memo that says I

435
00:46:14,000 --> 00:46:19,000
solved this already.
And if ever you are asked for

436
00:46:19,000 --> 00:46:25,000
it rather than recalculating it,
say, oh, I see that.

437
00:46:25,000 --> 00:46:30,000
I did that before.
Here's the answer,

438
00:46:30,000 --> 00:46:32,000
OK?
So, here's the code.

439
00:46:32,000 --> 00:46:40,000
It's very similar to that code.
So, it basically keeps a table

440
00:46:40,000 --> 00:46:44,000
around of c[i,j].
It says, what we do is we

441
00:46:44,000 --> 00:46:47,000
check.
If the entry for c[i,j] is nil,

442
00:46:47,000 --> 00:46:51,000
we haven't computed it,
then we compute it.

443
00:46:51,000 --> 00:46:55,000
And, how do we compute it?
Just the same way we did

444
00:46:55,000 --> 00:46:57,000
before.

445
00:47:34,000 --> 00:47:45,000
OK, so this whole part here,
OK, is exactly what we have had

446
00:47:45,000 --> 00:47:51,000
before.
It's the same as before.

447
00:47:51,000 --> 00:47:59,000
And then, we just return
c[i,j].

448
00:47:59,000 --> 00:48:03,000
If we don't bother to keep
recalculating,

449
00:48:03,000 --> 00:48:07,000
OK, so if it's nil,
we calculate it.

450
00:48:07,000 --> 00:48:12,000
Otherwise, we just return it.
It's not calculated,

451
00:48:12,000 --> 00:48:18,000
calculate and return it.
Otherwise, just return it:

452
00:48:18,000 --> 00:48:21,000
OK, pretty straightforward
code.

453
00:48:21,000 --> 00:48:23,000
OK.

454
00:48:33,000 --> 00:48:38,000
OK, now the tricky thing is how
much time does it take to

455
00:48:38,000 --> 00:48:40,000
execute this?

456
00:48:58,000 --> 00:49:04,000
This takes a little bit of
thinking.

457
00:49:04,000 --> 00:49:10,000
Yeah?
Yeah, it takes order MN.

458
00:49:10,000 --> 00:49:18,000
OK, why is that?
Yeah, but I have to look up

459
00:49:18,000 --> 00:49:25,000
c[i,j].
I might call c[i,j] a bunch of

460
00:49:25,000 --> 00:49:29,000
times.
When I'm doing this,

461
00:49:29,000 --> 00:49:38,000
I'm still calling it
recursively.

462
00:49:38,000 --> 00:49:43,000
Yeah, so you have to,
so each recursive call is going

463
00:49:43,000 --> 00:49:50,000
to look at, and the worst-case,
say, is going to look at the

464
00:49:50,000 --> 00:49:55,000
max of these two things.
Well, this is going to involve

465
00:49:55,000 --> 00:50:00,000
a recursive call,
and a lookup.

466
00:50:00,000 --> 00:50:05,000
So, this might take a fair
amount of effort to calculate.

467
00:50:05,000 --> 00:50:09,000
I mean, you're right,
and your intuition is right.

468
00:50:09,000 --> 00:50:13,000
Let's see if we can get a more
precise argument,

469
00:50:13,000 --> 00:50:17,000
why this is taking order m
times n.

470
00:50:17,000 --> 00:50:21,000
What's going on here?
Because not every time I call

471
00:50:21,000 --> 00:50:27,000
this is it going to just take me
a constant amount of work to do

472
00:50:27,000 --> 00:50:30,000
this.
Sometimes it's going to take me

473
00:50:30,000 --> 00:50:34,000
a lot of work.
Sometimes I get lucky,

474
00:50:34,000 --> 00:50:41,000
and I return it.
So, your intuition is dead on.

475
00:50:41,000 --> 00:50:47,000
It's dead on.
We just need a little bit more

476
00:50:47,000 --> 00:50:55,000
articulate explanation,
so that everybody is on board.

477
00:50:55,000 --> 00:51:01,000
Try again?
Good, at most three times,

478
00:51:01,000 --> 00:51:04,000
yeah.
OK, so that's one way to look

479
00:51:04,000 --> 00:51:05,000
at it.
Yeah.

480
00:51:05,000 --> 00:51:09,000
There is another way to look at
it that's kind of what you are

481
00:51:09,000 --> 00:51:12,000
expressing there is an
amortized, a bookkeeping,

482
00:51:12,000 --> 00:51:15,000
way of looking at this.
What's the amortized cost?

483
00:51:15,000 --> 00:51:18,000
You could say what the
amortized cost of calculating

484
00:51:18,000 --> 00:51:21,000
one of these,
where basically whenever I call

485
00:51:21,000 --> 00:51:24,000
it, I'm going to charge a
constant amount for looking up.

486
00:51:24,000 --> 00:51:28,000
And so, I could get to look up
whatever is in here to call the

487
00:51:28,000 --> 00:51:31,000
things.
But if it, in fact,

488
00:51:31,000 --> 00:51:35,000
so in some sense,
this charge here,

489
00:51:35,000 --> 00:51:41,000
of calling it and returning it,
etc., I charged that to my

490
00:51:41,000 --> 00:51:44,000
caller.
OK, so I charged these lines

491
00:51:44,000 --> 00:51:50,000
and this line to the caller.
And I charge the rest of these

492
00:51:50,000 --> 00:51:55,000
lines to the c[i,j] element.
And then, the point is that

493
00:51:55,000 --> 00:52:02,000
every caller basically only ends
up being charged for a constant

494
00:52:02,000 --> 00:52:07,000
amount of stuff.
OK, to calculate one c[i,j],

495
00:52:07,000 --> 00:52:11,000
it's only an amortized constant
amount of stuff that I'm

496
00:52:11,000 --> 00:52:16,000
charging to that calculation of
i and j, that calculation of i

497
00:52:16,000 --> 00:52:19,000
and j.
OK, so you can view it in terms

498
00:52:19,000 --> 00:52:23,000
of amortized analysis doing a
bookkeeping argument that just

499
00:52:23,000 --> 00:52:27,000
says, let me charge enough to
calculate my own,

500
00:52:27,000 --> 00:52:32,000
do all my own local things plus
enough to look up the value in

501
00:52:32,000 --> 00:52:36,000
the next level and get it
returned.

502
00:52:36,000 --> 00:52:40,000
OK, and then if it has to go
off and calculate,

503
00:52:40,000 --> 00:52:46,000
well, that's OK because that's
all been charged to a different

504
00:52:46,000 --> 00:52:50,000
ij at that point.
So, every cell only costs me a

505
00:52:50,000 --> 00:52:56,000
constant amount of time that
order MN cells total of order

506
00:52:56,000 --> 00:53:00,000
MN.
OK: constant work per entry.

507
00:53:00,000 --> 00:53:04,000
OK, and you can sort of use an
amortized analysis to argue

508
00:53:04,000 --> 00:53:07,000
that.
How much space does it take?

509
00:53:07,000 --> 00:53:12,000
We haven't usually looked at
space, but here we are going to

510
00:53:12,000 --> 00:53:15,000
start looking at space.
That turns out,

511
00:53:15,000 --> 00:53:20,000
for some of these algorithms,
to be really important.

512
00:53:20,000 --> 00:53:23,000
How much space do I need,
storage space?

513
00:53:23,000 --> 00:53:28,000
Yeah, also m times n,
OK, to store the c[i,j] table.

514
00:53:28,000 --> 00:53:30,000
OK, the rest,
storing x and y,

515
00:53:30,000 --> 00:53:35,000
OK, that's just m plus n.
So, that's negligible,

516
00:53:35,000 --> 00:53:37,000
but mostly I need the space m
times n.

517
00:53:37,000 --> 00:53:41,000
So, this memo-ization type
algorithm is a really good

518
00:53:41,000 --> 00:53:44,000
strategy in programming for many
things where,

519
00:53:44,000 --> 00:53:48,000
when you have the same
parameters, you're going to get

520
00:53:48,000 --> 00:53:51,000
the same results.
It doesn't work in programs

521
00:53:51,000 --> 00:53:53,000
where you have a side effect,
necessarily,

522
00:53:53,000 --> 00:53:57,000
that is, when the calculation
for a given set of parameters

523
00:53:57,000 --> 00:54:03,000
might be different on each call.
But for something which is

524
00:54:03,000 --> 00:54:08,000
essentially like a functional
programming type of environment,

525
00:54:08,000 --> 00:54:13,000
then if you've calculated it
once, you can look it up.

526
00:54:13,000 --> 00:54:19,000
And, so this is very helpful.
But, it takes a fair amount of

527
00:54:19,000 --> 00:54:24,000
space, and it also doesn't
proceed in a very orderly way.

528
00:54:24,000 --> 00:54:29,000
So, there is another strategy
for doing exactly the same

529
00:54:29,000 --> 00:54:34,000
calculation in a bottom-up way.
And that's what we call dynamic

530
00:54:34,000 --> 00:54:42,000
programming.
OK, the idea is to compute the

531
00:54:42,000 --> 00:54:49,000
table bottom-up.
I think I'm going to get rid

532
00:54:49,000 --> 00:54:56,000
of, I think what we'll do is
we'll just use,

533
00:54:56,000 --> 00:55:07,000
actually I think what I'm going
to do is use this board.

534
00:55:33,000 --> 00:55:38,000
OK, so here's the idea.
What we're going to do is look

535
00:55:38,000 --> 00:55:45,000
at the c[i,j] table and realize
that there's actually an orderly

536
00:55:45,000 --> 00:55:51,000
way of filling in the table.
This is sort of a top-down with

537
00:55:51,000 --> 00:55:55,000
memo-ization.
OK, but there's actually a way

538
00:55:55,000 --> 00:56:00,000
we can do it bottom up.
So, here's the idea.

539
00:56:00,000 --> 00:56:07,000
So, let's make our table.
OK, so there's x.

540
00:56:07,000 --> 00:56:18,000
And then, there's y.
And, I'm going to initialize

541
00:56:18,000 --> 00:56:28,000
the empty string.
I didn't cover the base cases

542
00:56:28,000 --> 00:56:39,000
for c[i,j], but c of zero
meaning a prefix with no

543
00:56:39,000 --> 00:56:45,000
elements in it.
The prefix of that with

544
00:56:45,000 --> 00:56:48,000
anything else,
the length is zero.

545
00:56:48,000 --> 00:56:53,000
OK, so that's basically how I'm
going to bound the borders here.

546
00:56:53,000 --> 00:56:57,000
And now, what I can do is just
use my formula,

547
00:56:57,000 --> 00:57:00,000
which I've conveniently erased
up there, OK,

548
00:57:00,000 --> 00:57:04,000
to compute what is the longest
common subsequence,

549
00:57:04,000 --> 00:57:09,000
length of the longest comment
subsequence from this character

550
00:57:09,000 --> 00:57:15,000
in y, and this character in x up
to this character.

551
00:57:15,000 --> 00:57:19,000
So here, for example,
they don't match.

552
00:57:19,000 --> 00:57:24,000
So, it's the maximum of these
two values.

553
00:57:24,000 --> 00:57:29,000
Here, they do match.
OK, so it says it's one plus

554
00:57:29,000 --> 00:57:34,000
the value here.
And, I'm going to draw a line.

555
00:57:34,000 --> 00:57:38,000
Whenever I'm going to get a
match, I'm going to draw a line

556
00:57:38,000 --> 00:57:41,000
like that, indicating that I had
that first case,

557
00:57:41,000 --> 00:57:44,000
the case where they had a good
match.

558
00:57:44,000 --> 00:57:47,000
And so, all I'm doing is
applying that recursive formula

559
00:57:47,000 --> 00:57:52,000
from the theorem that we proved.
So here, it's basically they

560
00:57:52,000 --> 00:57:54,000
don't match.
So, it's the maximum of those

561
00:57:54,000 --> 00:57:56,000
two.
Here, they match.

562
00:57:56,000 --> 00:58:01,000
So, it's one plus that guy.
Here, they don't match.

563
00:58:01,000 --> 00:58:06,000
So, it's basically the maximum
of these two.

564
00:58:06,000 --> 00:58:11,000
Here, they don't match.
So it's the maximum.

565
00:58:11,000 --> 00:58:17,000
So, it's one plus that guy.
So, everybody understand how I

566
00:58:17,000 --> 00:58:23,000
filled out that first row?
OK, well that you guys can

567
00:58:23,000 --> 00:58:27,000
help.
OK, so this one is what?

568
00:58:27,000 --> 00:58:32,000
Just call it out.
Zero, good.

569
00:58:32,000 --> 00:58:41,000
One, because it's the maximum,
one, two, right.

570
00:58:41,000 --> 00:58:47,000
This one, now,
gets from there,

571
00:58:47,000 --> 00:58:52,000
two, two.
OK, here, zero,

572
00:58:52,000 --> 00:59:03,000
one, because it's the maximum
of those two.

573
00:59:03,000 --> 00:59:15,000
Two, two, two,
good.

574
00:59:15,000 --> 00:59:34,000
One, one, two,
two, two, three,

575
00:59:34,000 --> 00:59:48,000
three.
One, two, three,

576
00:59:48,000 --> 01:00:00,250
get that line,
three, four,

577
01:00:00,250 --> 01:00:05,974
OK.
One there, three,

578
01:00:05,974 --> 01:00:10,000
three, four,
good, four.

579
01:00:10,000 --> 01:00:14,199
OK, and our answer:
four.

580
01:00:14,199 --> 01:00:23,125
So this is blindingly fast code
if you code this up,

581
01:00:23,125 --> 01:00:33,275
OK, because it gets to use the
fact that modern machines in

582
01:00:33,275 --> 01:00:45,000
particular do very well on
regular strides through memory.

583
01:00:45,000 --> 01:00:50,012
So, if you're just plowing
through memory across like this,

584
01:00:50,012 --> 01:00:55,024
OK, and your two-dimensional
array is stored in that order,

585
01:00:55,024 --> 01:00:58,308
which it is,
otherwise you go this way,

586
01:00:58,308 --> 01:01:02,802
stored in that order.
This can really fly in terms of

587
01:01:02,802 --> 01:01:11,948
the speed of the calculation.
So, how much time did it take

588
01:01:11,948 --> 01:01:17,897
us to do this?
Yeah, order MN,

589
01:01:17,897 --> 01:01:20,769
theta MN.
Yeah?

590
01:01:20,769 --> 01:01:30,000
We'll talk about space in just
a minute.

591
01:01:30,000 --> 01:01:33,875
OK, so hold that question.
Good question,

592
01:01:33,875 --> 01:01:36,491
good question,
already, wow,

593
01:01:36,491 --> 01:01:40,657
good, OK, how do I now figure
out, remember,

594
01:01:40,657 --> 01:01:46,179
we had the simplification.
We were going to just calculate

595
01:01:46,179 --> 01:01:49,764
the length.
OK, it turns out I can now

596
01:01:49,764 --> 01:01:54,415
figure out a particular sequence
that matches it.

597
01:01:54,415 --> 01:01:58,000
And basically,
I do that.

598
01:01:58,000 --> 01:02:04,932
I can reconstruct the longest
common subsequence by tracing

599
01:02:04,932 --> 01:02:09,474
backwards.
So essentially I start here.

600
01:02:09,474 --> 01:02:15,928
Here I have a choice because
this one was dependent on,

601
01:02:15,928 --> 01:02:22,980
since it doesn't have a bar
here, it was dependent on one of

602
01:02:22,980 --> 01:02:28,000
these two.
So, let me go this way.

603
01:02:28,000 --> 01:02:33,444
OK, and now I have a diagonal
element here.

604
01:02:33,444 --> 01:02:41,222
So what I'll do is simply mark
the character that appeared in

605
01:02:41,222 --> 01:02:45,370
those positions as I go this
way.

606
01:02:45,370 --> 01:02:51,203
I have three here.
And now, let me keep going,

607
01:02:51,203 --> 01:02:56,129
three here, and now I have
another one.

608
01:02:56,129 --> 01:03:03,000
So that means this character
gets selected.

609
01:03:03,000 --> 01:03:08,632
And then I go up to here,
OK, and then up to here.

610
01:03:08,632 --> 01:03:15,643
And now I go diagonally again,
which means that this character

611
01:03:15,643 --> 01:03:18,977
is selected.
And I go to here,

612
01:03:18,977 --> 01:03:24,724
and then I go here.
And then, I go up here and this

613
01:03:24,724 --> 01:03:30,471
character is selected.
So here is my longest common

614
01:03:30,471 --> 01:03:35,098
subsequence.
And this was just one path

615
01:03:35,098 --> 01:03:37,843
back.
I could have gone a path like

616
01:03:37,843 --> 01:03:42,203
this and gotten a different
longest common subsequence.

617
01:03:42,203 --> 01:03:45,997
OK, so that simplification of
just saying, look,

618
01:03:45,997 --> 01:03:49,468
let me just run backwards and
figure it out,

619
01:03:49,468 --> 01:03:53,989
that's actually pretty good
because it means that by just

620
01:03:53,989 --> 01:03:58,026
calculating the value,
then figuring out these back

621
01:03:58,026 --> 01:04:04,000
pointers to let me reconstruct
it is a fairly simple process.

622
01:04:04,000 --> 01:04:10,075
OK, if I had to think about
that to begin with,

623
01:04:10,075 --> 01:04:14,962
it would have been a much
bigger mess.

624
01:04:14,962 --> 01:04:19,452
OK, so the space,
I just mentioned,

625
01:04:19,452 --> 01:04:25,264
was order MN because we still
need the table.

626
01:04:25,264 --> 01:04:32,000
So, you can actually do the min
of m and n.

627
01:04:32,000 --> 01:04:37,970
OK, to get to your question,
how do you do the min of m and

628
01:04:37,970 --> 01:04:41,367
n?
Diagonal stripes won't give you

629
01:04:41,367 --> 01:04:45,897
min of m and n.
That'll give you the sum of m

630
01:04:45,897 --> 01:04:48,676
and n.
So, going in stripes,

631
01:04:48,676 --> 01:04:53,308
maybe I'm not quite sure I know
what you mean.

632
01:04:53,308 --> 01:04:58,250
So, you're saying,
so what's the order I would do

633
01:04:58,250 --> 01:05:01,661
here?
So, I would start.

634
01:05:01,661 --> 01:05:06,461
I would do this one first.
Then which one would I do?

635
01:05:06,461 --> 01:05:10,246
This one and this one?
And then, this one,

636
01:05:10,246 --> 01:05:12,923
this one, this one,
like this?

637
01:05:12,923 --> 01:05:18,000
That's a perfectly good order.
OK, and so you're saying,

638
01:05:18,000 --> 01:05:22,800
then, so I'm keeping the
diagonal there all the time.

639
01:05:22,800 --> 01:05:28,615
So, you're saying the length of
the diagonal is the min of m and

640
01:05:28,615 --> 01:05:31,633
n?
I think that's right.

641
01:05:31,633 --> 01:05:36,068
OK, there is another way you
can do it that's a little bit

642
01:05:36,068 --> 01:05:39,881
more straightforward,
which is you compare m to n.

643
01:05:39,881 --> 01:05:42,993
Whichever is smaller,
well, first of all,

644
01:05:42,993 --> 01:05:45,871
let's just do this existing
algorithm.

645
01:05:45,871 --> 01:05:50,228
If I just simply did row by
row, I don't need more than a

646
01:05:50,228 --> 01:05:53,418
previous row.
OK, I just need one row at a

647
01:05:53,418 --> 01:05:56,141
time.
So, I can go ahead and compute

648
01:05:56,141 --> 01:06:00,421
just one row because once I
computed the succeeding row,

649
01:06:00,421 --> 01:06:04,910
the first row is unimportant.
And in fact,

650
01:06:04,910 --> 01:06:07,263
I don't even need the whole
row.

651
01:06:07,263 --> 01:06:10,754
All I need is just the current
row that I'm on,

652
01:06:10,754 --> 01:06:14,093
plus one or two elements of the
previous row,

653
01:06:14,093 --> 01:06:16,522
plus the end of the previous
row.

654
01:06:16,522 --> 01:06:20,848
So, I use a prefix of this row,
and an extra two elements,

655
01:06:20,848 --> 01:06:24,263
and the suffix of this row.
So, it's actually,

656
01:06:24,263 --> 01:06:28,058
you can do it with one row,
plus order one element.

657
01:06:28,058 --> 01:06:32,535
And then, I could do it either
running vertically or running

658
01:06:32,535 --> 01:06:35,495
horizontally,
whichever one gives me the

659
01:06:35,495 --> 01:06:40,303
smaller space.
OK, and it might be that your

660
01:06:40,303 --> 01:06:43,084
diagonal trick would work there
too.

661
01:06:43,084 --> 01:06:45,785
I'd have to think about that.
Yeah?

662
01:06:45,785 --> 01:06:50,392
Ooh, that's a good question.
So, you can do the calculation

663
01:06:50,392 --> 01:06:53,570
of the length,
and run row plus order one

664
01:06:53,570 --> 01:06:57,415
elements.
OK, and our exercise,

665
01:06:57,415 --> 01:07:04,203
and this is a hard exercise,
OK, so that a good one to do is

666
01:07:04,203 --> 01:07:11,221
to do small space and allow you
to reconstruct the LCS because

667
01:07:11,221 --> 01:07:18,469
the naÔve way that we were just
doing it, it's not clear how you

668
01:07:18,469 --> 01:07:24,336
would go backwards from that
because you've lost the

669
01:07:24,336 --> 01:07:29,168
information.
OK, so this is actually a very

670
01:07:29,168 --> 01:07:37,182
interesting and tricky problem.
And, it turns out it succumbs

671
01:07:37,182 --> 01:07:43,329
of all things to divide and
conquer, OK, rather than some

672
01:07:43,329 --> 01:07:47,060
more straightforward tabular
thing.

673
01:07:47,060 --> 01:07:51,231
OK: so very good practice,
for example,

674
01:07:51,231 --> 01:07:57,268
for the upcoming take home
quiz, OK, which is all design

675
01:07:57,268 --> 01:08:03,493
and cleverness type quiz.
OK, so this is a good one for

676
01:08:03,493 --> 01:08:07,191
people to take on.
So, this is basically the

677
01:08:07,191 --> 01:08:11,319
tabular method that's called
dynamic programming.

678
01:08:11,319 --> 01:08:16,479
OK, memo-ization is not dynamic
programming, even though it's

679
01:08:16,479 --> 01:08:18,714
related.
It's memo-ization.

680
01:08:18,714 --> 01:08:23,788
And, we're going to see a whole
bunch of other problems that

681
01:08:23,788 --> 01:08:27,314
succumb to dynamic programming
approaches.

682
01:08:27,314 --> 01:08:31,098
It's a very cool method,
and on the homework,

683
01:08:31,098 --> 01:08:36,000
so let me just mention the
homework again.

684
01:08:36,000 --> 01:08:38,216
On the homework,
we're going to look at a

685
01:08:38,216 --> 01:08:40,434
problem called the edit distance
problem.

686
01:08:40,434 --> 01:08:42,763
Edit distance is you are given
two strings.

687
01:08:42,763 --> 01:08:46,256
And you can imagine that you're
typing in a keyboard with one of

688
01:08:46,256 --> 01:08:48,862
the strings there.
And what you have to do is by

689
01:08:48,862 --> 01:08:50,303
doing inserts,
and deletes,

690
01:08:50,303 --> 01:08:52,631
and replaces,
and moving the cursor around,

691
01:08:52,631 --> 01:08:55,182
you've got to transform one
string to the next.

692
01:08:55,182 --> 01:08:57,399
And, each of those operations
has a cost.

693
01:08:57,399 --> 01:09:00,671
And your job is to minimize the
cost of transforming the one

694
01:09:00,671 --> 01:09:05,565
string into the other.
This actually turns out also to

695
01:09:05,565 --> 01:09:09,537
be useful for computational
biology applications.

696
01:09:09,537 --> 01:09:12,600
And, in fact,
there have been editors,

697
01:09:12,600 --> 01:09:14,917
screen editors,
text editors,

698
01:09:14,917 --> 01:09:19,881
that implement algorithms of
this nature in order to minimize

699
01:09:19,881 --> 01:09:24,931
the number of characters that
have to be sent as IO in and out

700
01:09:24,931 --> 01:09:28,568
of the system.
So, the warning is,

701
01:09:28,568 --> 01:09:33,274
you better get going on your
programming on problem one on

702
01:09:33,274 --> 01:09:37,816
the homework today if at all
possible because whenever I

703
01:09:37,816 --> 01:09:41,862
assign programming,
since we don't do that as sort

704
01:09:41,862 --> 01:09:45,660
of a routine thing,
I'm just concerned for some

705
01:09:45,660 --> 01:09:50,283
people that there will not be
able to get things like the

706
01:09:50,283 --> 01:09:53,422
input and output to work,
and so forth.

707
01:09:53,422 --> 01:09:57,550
We have example problems,
and such, on the website.

708
01:09:57,550 --> 01:10:00,853
And we also have,
you can write it in any

709
01:10:00,853 --> 01:10:03,743
language you want,
including Matlab,

710
01:10:03,743 --> 01:10:08,697
Python, whatever your favorite,
the solutions will be written

711
01:10:08,697 --> 01:10:14,425
in Java and Python.
OK, so the fastest solutions

712
01:10:14,425 --> 01:10:19,188
are likely to be written in c.
OK, you can also do it in

713
01:10:19,188 --> 01:10:21,960
assembly language if you care
to.

714
01:10:21,960 --> 01:10:24,905
You laugh.
I used to be in assembly

715
01:10:24,905 --> 01:10:28,716
language programmer back in the
days of yore.

716
01:10:28,716 --> 01:10:34,086
OK, so I do encourage people to
get started on this because let

717
01:10:34,086 --> 01:10:39,370
me mention, the other thing is
that this particular problem on

718
01:10:39,370 --> 01:10:45,000
this problem set is an
absolutely mandatory problem.

719
01:10:45,000 --> 01:10:49,662
OK, all the problems are
mandatory, but as you know you

720
01:10:49,662 --> 01:10:54,583
can skip them and it doesn't
hurt you too much if you only

721
01:10:54,583 --> 01:10:57,605
skip one or two.
This one, you skip,

722
01:10:57,605 --> 01:11:00,367
hurts big time:
one letter grade.

723
01:11:00,367 --> 01:11:03,000
It must be done.