1
00:00:00,070 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:14,960
from hundreds of
MIT courses, visit

7
00:00:14,960 --> 00:00:16,210
mitopencourseware@ocw.mit.edu.

8
00:00:26,430 --> 00:00:28,630
PROFESSOR: All right,
so let's get started.

9
00:00:28,630 --> 00:00:34,800
So today we're going to
review local alignment,

10
00:00:34,800 --> 00:00:39,270
we talked about last time, and
introduce global alignment,

11
00:00:39,270 --> 00:00:42,810
also talking about
issues related

12
00:00:42,810 --> 00:00:45,340
to protein sequences,
which include

13
00:00:45,340 --> 00:00:49,670
some more interesting
scoring matrices.

14
00:00:49,670 --> 00:00:54,980
So just some info on topic
one, which we are still in.

15
00:00:54,980 --> 00:00:57,082
So I will have an
overview slide.

16
00:00:57,082 --> 00:00:58,707
It'll have a blue
background, and there

17
00:00:58,707 --> 00:01:00,665
will be a review slide
with a purple background

18
00:01:00,665 --> 00:01:01,870
in every lecture.

19
00:01:01,870 --> 00:01:06,440
So last time we talked about
local alignment and some

20
00:01:06,440 --> 00:01:08,990
of the statistics
associated with that,

21
00:01:08,990 --> 00:01:13,210
and also a little bit about
sequencing, technologies,

22
00:01:13,210 --> 00:01:16,890
both conventional
Sanger DNA sequencing

23
00:01:16,890 --> 00:01:20,070
as well as second
generation sequencing.

24
00:01:20,070 --> 00:01:22,680
And at the beginning of the
local alignment section,

25
00:01:22,680 --> 00:01:25,900
we introduced a simple
BLAST-like algorithm,

26
00:01:25,900 --> 00:01:28,750
and then we talked about
statistics, target frequencies,

27
00:01:28,750 --> 00:01:30,940
mismatched penalties,
that sort of thing.

28
00:01:33,630 --> 00:01:36,660
So there were a couple questions
at the end which I just

29
00:01:36,660 --> 00:01:38,260
wanted to briefly answer.

30
00:01:38,260 --> 00:01:45,500
So I believe it was Joe asked
about how the dye is attached

31
00:01:45,500 --> 00:01:50,070
to the DNTP in dye
terminator sequencing.

32
00:01:50,070 --> 00:01:51,530
And it appears
that it's attached

33
00:01:51,530 --> 00:01:54,950
to the base, sort of the
backside of the base, not

34
00:01:54,950 --> 00:01:57,562
the Watson-Crick
face, obviously.

35
00:01:57,562 --> 00:01:59,520
That seems to be the
common way that it's done.

36
00:01:59,520 --> 00:02:01,140
And then there was
another question

37
00:02:01,140 --> 00:02:02,690
from somebody in the back.

38
00:02:02,690 --> 00:02:08,520
I don't remember who asked about
when you're making libraries,

39
00:02:08,520 --> 00:02:11,480
how do you make sure that
each of your insert sequences

40
00:02:11,480 --> 00:02:14,310
has the two different adapters,
one adaptor on one side

41
00:02:14,310 --> 00:02:16,610
and the other adapter
on the other side?

42
00:02:16,610 --> 00:02:20,040
And there are at least
three ways to do this.

43
00:02:20,040 --> 00:02:22,650
So simplest is in
RNA ligation, when

44
00:02:22,650 --> 00:02:25,235
you take advantage of the
different chemistry at the five

45
00:02:25,235 --> 00:02:27,604
prime and three prime
ends of the small RNA

46
00:02:27,604 --> 00:02:28,770
that you're trying to clone.

47
00:02:28,770 --> 00:02:32,920
So you just use the
phosphate and NLH

48
00:02:32,920 --> 00:02:34,850
to ligate two
different adapters.

49
00:02:34,850 --> 00:02:39,450
Another more complicated way
occurs in ribosome footprint

50
00:02:39,450 --> 00:02:43,640
profiling, which is a method for
mapping the precise locations

51
00:02:43,640 --> 00:02:47,410
of ribosomes along
mRNAs, and involves polyA

52
00:02:47,410 --> 00:02:53,040
tailing, and then
introducing the adapters

53
00:02:53,040 --> 00:02:57,070
together, the two adapters,
with a polyT primer

54
00:02:57,070 --> 00:02:58,580
that primes off the polyA tail.

55
00:02:58,580 --> 00:03:01,150
And then you circularize, and
then you PCR off the circles.

56
00:03:01,150 --> 00:03:02,566
And it's a little
bit complicated,

57
00:03:02,566 --> 00:03:04,690
but you can look it up
in the reference that's

58
00:03:04,690 --> 00:03:07,390
up here on the slide.

59
00:03:07,390 --> 00:03:08,850
It's working now.

60
00:03:08,850 --> 00:03:11,180
And then, finally,
the way that's

61
00:03:11,180 --> 00:03:14,100
actually most commonly
used for protocols

62
00:03:14,100 --> 00:03:17,780
like RNA seq and
genomic DNA sequencing

63
00:03:17,780 --> 00:03:24,930
is that after you make
your double strand DNA,

64
00:03:24,930 --> 00:03:27,560
there's an enzyme that adds
a single A to the three

65
00:03:27,560 --> 00:03:29,170
prime end of each strand.

66
00:03:29,170 --> 00:03:32,160
So now you have a
symmetrical molecule.

67
00:03:32,160 --> 00:03:38,780
But then you add these funny
Y-shaped adapters that have

68
00:03:38,780 --> 00:03:42,000
and overhanging T on,
say, the red guy here.

69
00:03:42,000 --> 00:03:47,150
And so what will happen is
that each of these Y's can

70
00:03:47,150 --> 00:03:48,250
be ligated here.

71
00:03:48,250 --> 00:03:51,870
But each of the inserts,
independent of which

72
00:03:51,870 --> 00:03:55,520
strand it is, will have a red
adapter at the five prime end

73
00:03:55,520 --> 00:03:59,000
and a blue adaptor at
the three prime end.

74
00:03:59,000 --> 00:04:02,470
Any questions
about this or about

75
00:04:02,470 --> 00:04:08,084
sequencing technologies before
we go to local alignments?

76
00:04:08,084 --> 00:04:08,980
OK, good.

77
00:04:08,980 --> 00:04:12,890
It was a good question,
and that's the answer.

78
00:04:12,890 --> 00:04:16,690
So we motivated our
discussion of local alignments

79
00:04:16,690 --> 00:04:18,750
last time by talking
about this example,

80
00:04:18,750 --> 00:04:22,289
where you have a non-coding
RNA that you found, inhuman.

81
00:04:22,289 --> 00:04:24,580
You BLAST it against mouse,
and you get this alignment.

82
00:04:24,580 --> 00:04:25,430
Is this significant?

83
00:04:25,430 --> 00:04:29,900
So is this really likely to
be a homologous sequence?

84
00:04:29,900 --> 00:04:32,822
And how do you find
the alignments?

85
00:04:32,822 --> 00:04:35,120
And so we said
that, well, there's

86
00:04:35,120 --> 00:04:38,790
this theory that's
exact, at least

87
00:04:38,790 --> 00:04:42,060
exact in the asymptotic sense
for large query and database

88
00:04:42,060 --> 00:04:47,620
sizes that tells us the
statistical significance

89
00:04:47,620 --> 00:04:52,010
of the highest scoring
ungapped local alignment.

90
00:04:52,010 --> 00:04:54,760
And it's given by
this formula here,

91
00:04:54,760 --> 00:04:58,370
which is the extreme value
or Gumbel distribution.

92
00:04:58,370 --> 00:05:00,580
And then we talked
about the constraints

93
00:05:00,580 --> 00:05:02,624
or the expected score
has to be negative,

94
00:05:02,624 --> 00:05:04,915
but positive scores have to
be possible for this theory

95
00:05:04,915 --> 00:05:05,840
to work.

96
00:05:05,840 --> 00:05:08,570
And we also talked
about an algorithm.

97
00:05:08,570 --> 00:05:11,720
But if you remember, the
algorithms was very simple.

98
00:05:11,720 --> 00:05:15,595
It involved-- this
is zero-- keeping

99
00:05:15,595 --> 00:05:17,200
track of the cumulative score.

100
00:05:17,200 --> 00:05:21,360
So we have a mismatch and a
match, mismatch, mismatch,

101
00:05:21,360 --> 00:05:24,071
mismatch, match, match, match.

102
00:05:24,071 --> 00:05:25,820
That is a high scoring
segment, et cetera.

103
00:05:25,820 --> 00:05:29,010
So you keep track of the lowest
point you've ever been to

104
00:05:29,010 --> 00:05:30,219
as well as the current score.

105
00:05:30,219 --> 00:05:32,385
And when the current score
exceeds that lowest point

106
00:05:32,385 --> 00:05:36,850
you've ever been to by more
than we've ever seen before,

107
00:05:36,850 --> 00:05:41,890
more than this, then that's
your high scoring segment.

108
00:05:41,890 --> 00:05:47,600
Now it turns out, if this
is not intuitive to you,

109
00:05:47,600 --> 00:05:52,420
there's another
algorithm which I find,

110
00:05:52,420 --> 00:05:53,580
personally, more intuitive.

111
00:05:53,580 --> 00:05:55,670
So I just want to tell you
about that one as well.

112
00:05:55,670 --> 00:05:57,086
And it's basically
the same thing,

113
00:05:57,086 --> 00:06:01,780
except whenever you go
negative, you reset to zero.

114
00:06:01,780 --> 00:06:05,000
So here, we were
going to go negative,

115
00:06:05,000 --> 00:06:08,560
so we just reset to zero.

116
00:06:08,560 --> 00:06:10,170
That was on this mismatch here.

117
00:06:10,170 --> 00:06:11,460
Then we have a match.

118
00:06:11,460 --> 00:06:12,480
Now we're at plus 1.

119
00:06:12,480 --> 00:06:13,380
That's fine.

120
00:06:13,380 --> 00:06:14,950
Now we have a mismatch.

121
00:06:14,950 --> 00:06:16,220
Now we're down to zero.

122
00:06:16,220 --> 00:06:17,780
We don't need to do anything.

123
00:06:17,780 --> 00:06:19,820
Now we have another mismatch.

124
00:06:19,820 --> 00:06:21,850
Here, we're still at zero.

125
00:06:21,850 --> 00:06:23,680
Remember, we just stay at zero.

126
00:06:23,680 --> 00:06:25,990
We were going to go negative,
but we stayed at zero.

127
00:06:25,990 --> 00:06:28,310
Another mismatch, we
still stay at zero.

128
00:06:28,310 --> 00:06:33,290
And now we have these
three matches in a row.

129
00:06:33,290 --> 00:06:34,920
My line is not
staying very flat.

130
00:06:34,920 --> 00:06:38,970
But this should've
been here flat at zero.

131
00:06:38,970 --> 00:06:44,700
The point is that now the
highest scoring segment

132
00:06:44,700 --> 00:06:46,920
is the highest point
you ever reach.

133
00:06:46,920 --> 00:06:49,220
So it's very simple.

134
00:06:49,220 --> 00:06:52,140
So this is actually slightly
easier to implement.

135
00:06:52,140 --> 00:06:53,980
And that's sort
of a little trick.

136
00:06:53,980 --> 00:06:59,860
So for local alignments,
you can often reset to zero.

137
00:06:59,860 --> 00:07:01,320
Any questions about that?

138
00:07:04,390 --> 00:07:08,830
So well, we talked about
computational efficiency,

139
00:07:08,830 --> 00:07:12,384
this big O notation,
where you consider

140
00:07:12,384 --> 00:07:13,925
the number of
individual computations

141
00:07:13,925 --> 00:07:16,840
that are required to run
an algorithm as a function

142
00:07:16,840 --> 00:07:18,660
of the size of the
input, basically

143
00:07:18,660 --> 00:07:20,550
the number of units
in the problem base

144
00:07:20,550 --> 00:07:24,360
pairs, amino acid
residues, whatever.

145
00:07:24,360 --> 00:07:28,360
So computer scientists look
at the asymptotic worst case

146
00:07:28,360 --> 00:07:29,660
running time.

147
00:07:29,660 --> 00:07:32,870
That's either because
they're pessimistic,

148
00:07:32,870 --> 00:07:36,000
or perhaps because they
want to guarantee things.

149
00:07:36,000 --> 00:07:38,430
They want to say, it's not
going to be worse than this.

150
00:07:38,430 --> 00:07:39,850
Maybe it'll be faster,
and then you'll be happy.

151
00:07:39,850 --> 00:07:41,488
But I can guarantee
you, it's not

152
00:07:41,488 --> 00:07:42,654
going to be worse than this.

153
00:07:42,654 --> 00:07:48,340
And so in this case, the
algorithm we talked about

154
00:07:48,340 --> 00:07:52,100
was order n times
n, where that's

155
00:07:52,100 --> 00:07:55,170
the lengths of
the two sequences.

156
00:07:55,170 --> 00:07:58,190
So toward the end
last time, we get

157
00:07:58,190 --> 00:08:01,160
we talked about this
lambda parameter

158
00:08:01,160 --> 00:08:05,060
and said that lambda is the
unique positive solution

159
00:08:05,060 --> 00:08:11,370
to this equation here,
where sij are the scores

160
00:08:11,370 --> 00:08:16,110
and pi and rj are the
nucleotide frequencies.

161
00:08:16,110 --> 00:08:18,740
And then there's this
target frequency formula

162
00:08:18,740 --> 00:08:23,550
that comes up that says that
if you use a scoring system sij

163
00:08:23,550 --> 00:08:26,380
to apply to sequences,
and then you pull out

164
00:08:26,380 --> 00:08:28,410
just the high scoring
segments, the ones that

165
00:08:28,410 --> 00:08:31,620
are unusually high
scoring, they will have

166
00:08:31,620 --> 00:08:35,390
a frequency of
matching nucleotides

167
00:08:35,390 --> 00:08:39,330
qij that's given by the
product of the frequencies

168
00:08:39,330 --> 00:08:43,370
in the two sequences basically
weighted by e to the lambda

169
00:08:43,370 --> 00:08:44,030
sij.

170
00:08:44,030 --> 00:08:48,440
So matches will occur more
strongly, because that

171
00:08:48,440 --> 00:08:51,020
has a positive work, and
mismatches less strongly.

172
00:08:51,020 --> 00:08:54,000
And that then gives
rise to this notion

173
00:08:54,000 --> 00:08:57,270
that there's an optimal
mismatch penalty, if you just

174
00:08:57,270 --> 00:08:59,800
consider scoring systems
that have plus 1 for a match

175
00:08:59,800 --> 00:09:02,790
and m for a mismatch, some
negative number, that's

176
00:09:02,790 --> 00:09:05,780
given by this equation
here, and here I've

177
00:09:05,780 --> 00:09:07,750
worked out a couple of values.

178
00:09:07,750 --> 00:09:14,760
So the theory says that to find
matches that are 99% identical,

179
00:09:14,760 --> 00:09:16,740
you should use a mismatched
score of minus 3,

180
00:09:16,740 --> 00:09:21,610
but for 75% identical,
you should use minus 1.

181
00:09:21,610 --> 00:09:24,560
And I asked you to think
about does that make sense,

182
00:09:24,560 --> 00:09:29,390
or how is that true?

183
00:09:29,390 --> 00:09:41,140
So y is minus 3
better than minus 1

184
00:09:41,140 --> 00:09:44,380
for finding nearly
identical matches.

185
00:09:44,380 --> 00:09:46,542
Anyone have an idea
or thought on this?

186
00:09:46,542 --> 00:09:48,000
There's some thoughts
on the slide.

187
00:09:48,000 --> 00:09:50,990
But can anyone intuitively
explain why this is true?

188
00:09:57,862 --> 00:09:58,820
Yeah, what's your name?

189
00:09:58,820 --> 00:09:59,445
AUDIENCE: Eric.

190
00:09:59,445 --> 00:10:00,700
PROFESSOR: Yeah, go ahead.

191
00:10:00,700 --> 00:10:03,050
AUDIENCE: With a mismatch
penalty of minus 3,

192
00:10:03,050 --> 00:10:05,793
you actually need more
steps climbing back up

193
00:10:05,793 --> 00:10:09,230
to get back to
some local maximum.

194
00:10:09,230 --> 00:10:14,386
And therefore, you do require
a longer stretch [INAUDIBLE]

195
00:10:14,386 --> 00:10:19,732
matches in order to
get a significant hit.

196
00:10:19,732 --> 00:10:24,373
That's my guess as
to why a score of m

197
00:10:24,373 --> 00:10:28,830
equals minus 3, a
penalty, [INAUDIBLE]

198
00:10:28,830 --> 00:10:30,437
why it would be
better at finding

199
00:10:30,437 --> 00:10:35,110
the higher identity matches.

200
00:10:35,110 --> 00:10:39,380
PROFESSOR: OK, because
the minus 3 makes

201
00:10:39,380 --> 00:10:42,120
you go down faster, so it
takes longer to recover,

202
00:10:42,120 --> 00:10:46,350
so you can only find
nearly identical things

203
00:10:46,350 --> 00:10:48,150
with that kind of
scoring system.

204
00:10:48,150 --> 00:10:49,080
Is that your point?

205
00:10:49,080 --> 00:10:50,322
OK, that's a good point.

206
00:10:52,990 --> 00:10:55,640
So yeah, when would
you want to use

207
00:10:55,640 --> 00:10:58,852
a mismatched penalty of minus 1?

208
00:10:58,852 --> 00:11:02,756
AUDIENCE: When you're trying
to look for things that are

209
00:11:02,756 --> 00:11:07,148
[INAUDIBLE], but
maybe not so close.

210
00:11:07,148 --> 00:11:09,588
Well, when you're looking
for a [INAUDIBLE],

211
00:11:09,588 --> 00:11:13,004
you're looking for [INAUDIBLE].

212
00:11:13,004 --> 00:11:15,629
That kind of situation.

213
00:11:15,629 --> 00:11:17,170
PROFESSOR: And so
let's say I'm using

214
00:11:17,170 --> 00:11:20,960
a mismatch penalty of minus 2.

215
00:11:20,960 --> 00:11:26,980
Can I find regions
that are 66% identical?

216
00:11:35,746 --> 00:11:36,977
AUDIENCE: Probably.

217
00:11:36,977 --> 00:11:37,810
PROFESSOR: Probably.

218
00:11:37,810 --> 00:11:41,730
AUDIENCE: But not guaranteed.

219
00:11:41,730 --> 00:11:44,480
PROFESSOR: Anyone else
have a comment on that?

220
00:11:44,480 --> 00:11:45,600
Match is plus 1.

221
00:11:45,600 --> 00:11:49,830
Mismatch is minus 2
regions of 66% identity.

222
00:11:58,920 --> 00:12:00,020
Yeah, with Levi, yeah.

223
00:12:00,020 --> 00:12:03,116
AUDIENCE: No, since your
score will just be zero.

224
00:12:03,116 --> 00:12:04,070
PROFESSOR: Yeah.

225
00:12:04,070 --> 00:12:05,000
That's correct.

226
00:12:05,000 --> 00:12:07,255
So Levi's comment is
your score will be zero.

227
00:12:11,970 --> 00:12:18,360
Well, I'll just say plus for
match, plus, plus, minus, plus,

228
00:12:18,360 --> 00:12:19,300
plus, minus.

229
00:12:19,300 --> 00:12:22,270
I mean, it'll be interspersed.

230
00:12:22,270 --> 00:12:24,470
It doesn't have to be like
this for every triplet.

231
00:12:24,470 --> 00:12:27,680
But but on average, you'll have
two matches for every match.

232
00:12:27,680 --> 00:12:29,480
That's what 66% identity means.

233
00:12:29,480 --> 00:12:33,320
And so these will score
a total of plus 2.

234
00:12:33,320 --> 00:12:35,505
And this will score minus 2.

235
00:12:35,505 --> 00:12:39,940
And so you basically will
never rise much above zero.

236
00:12:39,940 --> 00:12:46,090
And so you you can't really
use that mismatch penalty.

237
00:12:46,090 --> 00:12:47,040
There's a limit.

238
00:12:47,040 --> 00:12:50,779
66% is sort of at a point
where you can no longer see.

239
00:12:50,779 --> 00:12:52,320
You could potentially
see things that

240
00:12:52,320 --> 00:12:54,820
are 75% identical if
they were incredibly

241
00:12:54,820 --> 00:12:56,690
long with that kind
of mismatch penalty.

242
00:12:56,690 --> 00:13:01,360
But you just can't see
anything below 2/3% identity

243
00:13:01,360 --> 00:13:02,150
with minus 2.

244
00:13:02,150 --> 00:13:05,900
So to find those low things,
you have to use the lower.

245
00:13:05,900 --> 00:13:07,840
You have to go down
to minus 1 if you

246
00:13:07,840 --> 00:13:12,710
want to find the
really weak matches.

247
00:13:12,710 --> 00:13:14,950
But they will have to be
correspondingly very long

248
00:13:14,950 --> 00:13:18,350
in order to achieve
statistical significance.

249
00:13:18,350 --> 00:13:22,340
So correspondingly, the
reason why it's better

250
00:13:22,340 --> 00:13:25,080
to use a harsher mismatch
penalty of minus 3

251
00:13:25,080 --> 00:13:26,970
to find the nearly
identical regions

252
00:13:26,970 --> 00:13:31,160
is that, in this
equation, when you

253
00:13:31,160 --> 00:13:37,230
go from having a plus 1, minus
1 scoring system to plus 1,

254
00:13:37,230 --> 00:13:39,880
minus 3, lambda will change.

255
00:13:39,880 --> 00:13:41,730
This equation will
no longer satisfied

256
00:13:41,730 --> 00:13:47,440
so that a new value of
lambda will be relevant.

257
00:13:47,440 --> 00:13:50,640
And that value will be larger.

258
00:13:50,640 --> 00:13:52,580
That's not totally
obvious from this equation

259
00:13:52,580 --> 00:13:54,740
because you sort of
have one term, which

260
00:13:54,740 --> 00:13:57,950
is either the minus lambda in
one term or either plus lambda.

261
00:13:57,950 --> 00:14:01,090
But it turns out that
that making the mismatch

262
00:14:01,090 --> 00:14:04,630
penalty more negative will
lead to a solution that's

263
00:14:04,630 --> 00:14:06,470
a bigger value of lambda.

264
00:14:06,470 --> 00:14:11,300
So that means that
the same score, x,

265
00:14:11,300 --> 00:14:20,480
will lead to a larger
negative exponent here.

266
00:14:20,480 --> 00:14:20,980
and?

267
00:14:20,980 --> 00:14:23,574
How will that
affect the p-value?

268
00:14:23,574 --> 00:14:24,740
Anyone take us through this?

269
00:14:24,740 --> 00:14:26,281
It's a little bit
convoluted with all

270
00:14:26,281 --> 00:14:28,430
these negative
exponentials and stuff,

271
00:14:28,430 --> 00:14:32,690
but can someone explain to us
how that affects the p-value?

272
00:14:32,690 --> 00:14:33,280
Same x.

273
00:14:33,280 --> 00:14:34,610
We're going to increase lambda.

274
00:14:34,610 --> 00:14:35,776
What happens to the p-value?

275
00:14:41,581 --> 00:14:44,300
This gets bigger, more negative.

276
00:14:44,300 --> 00:14:49,550
That means this e to the
minus thing gets closer to 0.

277
00:14:49,550 --> 00:14:52,110
That means that this is
inside an exponential.

278
00:14:52,110 --> 00:14:54,940
As that thing gets closer
to 0, this whole term

279
00:14:54,940 --> 00:14:58,840
here gets closer to 1.

280
00:14:58,840 --> 00:15:00,550
Therefore you're
subtracting it from 1.

281
00:15:00,550 --> 00:15:03,580
Therefore the p-value
gets smaller, closer to 0,

282
00:15:03,580 --> 00:15:04,985
more significant.

283
00:15:04,985 --> 00:15:05,860
Does that made sense?

284
00:15:05,860 --> 00:15:07,740
So it's good to
just work through

285
00:15:07,740 --> 00:15:10,850
how this equation works.

286
00:15:10,850 --> 00:15:13,070
So that's all I wanted
to say about mismatch

287
00:15:13,070 --> 00:15:15,250
penalties for DNA.

288
00:15:15,250 --> 00:15:18,150
Any questions about that?

289
00:15:20,647 --> 00:15:22,480
So how do you actually
use this in practice?

290
00:15:22,480 --> 00:15:24,010
So if you just
Google "BLAST end,"

291
00:15:24,010 --> 00:15:25,610
you'll get to this website.

292
00:15:25,610 --> 00:15:30,440
It's been set up at NCBI
for about 20 years or so.

293
00:15:30,440 --> 00:15:32,650
And of course, it's gone
through various iterations

294
00:15:32,650 --> 00:15:33,983
and improvements over the years.

295
00:15:33,983 --> 00:15:37,280
And if you look
down at the bottom,

296
00:15:37,280 --> 00:15:40,780
there is a place where
you can click and set

297
00:15:40,780 --> 00:15:42,790
the algorithm parameters.

298
00:15:42,790 --> 00:15:45,510
And there are a number of
parameters you can set.

299
00:15:48,340 --> 00:15:50,630
Some of them affect the speed.

300
00:15:50,630 --> 00:15:54,050
But we're focused here
mostly on parameters

301
00:15:54,050 --> 00:15:57,100
that will affect the quality,
the nature of the alignments

302
00:15:57,100 --> 00:15:57,810
that you find.

303
00:15:57,810 --> 00:16:01,730
And so here, you can set
not arbitrary penalties,

304
00:16:01,730 --> 00:16:07,480
but you can set within a range
of standard mismatch penalties.

305
00:16:07,480 --> 00:16:10,200
You can do 1 minus 1,
1 minus 2, et cetera.

306
00:16:14,940 --> 00:16:19,490
So what about sequences
that code for protein?

307
00:16:19,490 --> 00:16:21,230
So exons, for example.

308
00:16:21,230 --> 00:16:27,340
So you can search them with a
nucleotide search, like BLAST.

309
00:16:27,340 --> 00:16:31,070
But it can often be
the case that you'll

310
00:16:31,070 --> 00:16:35,590
do better if you first
translate your exon

311
00:16:35,590 --> 00:16:37,960
into the corresponding
amino acid sequence using

312
00:16:37,960 --> 00:16:41,510
a genetic code and then
search that peptide.

313
00:16:41,510 --> 00:16:44,440
Now you may or may not know
the reading frame of your exon

314
00:16:44,440 --> 00:16:46,550
a priori, or even know
that it is an exon,

315
00:16:46,550 --> 00:16:52,025
so BLAST automatically will
do this translation for you.

316
00:16:52,025 --> 00:16:53,650
So for example, with
this DNA sequence,

317
00:16:53,650 --> 00:16:55,850
it'll translate in all
three of the reading frames,

318
00:16:55,850 --> 00:16:59,020
leading to essentially
this bag of peptides

319
00:16:59,020 --> 00:17:02,680
here, where sometimes you'll
hit a stop code on, like right

320
00:17:02,680 --> 00:17:03,230
here.

321
00:17:03,230 --> 00:17:04,979
And then it just treats
it as, OK, there's

322
00:17:04,979 --> 00:17:06,750
a little PR dipeptide there.

323
00:17:06,750 --> 00:17:09,869
And then there's a longer
peptide here, [INAUDIBLE],

324
00:17:09,869 --> 00:17:10,849
and so forth.

325
00:17:10,849 --> 00:17:14,609
So it just makes these bags of
peptides for each reading frame

326
00:17:14,609 --> 00:17:18,000
and searches all those peptides
against some target, which

327
00:17:18,000 --> 00:17:21,770
can be approaching
database or a DNA database,

328
00:17:21,770 --> 00:17:23,849
again, translated in
all the reading frames.

329
00:17:23,849 --> 00:17:27,890
So the folks at NCBI have made
all these different flavors

330
00:17:27,890 --> 00:17:30,060
of BLAST available.

331
00:17:30,060 --> 00:17:31,550
So BLASTP is for proteins.

332
00:17:31,550 --> 00:17:32,430
N is for nucleotides.

333
00:17:32,430 --> 00:17:36,260
And then the translating ones
are called things like BLASTX

334
00:17:36,260 --> 00:17:38,830
for a nucleotide query
against a protein database.

335
00:17:38,830 --> 00:17:41,870
TBLASTN for a protein query
against a nucleotide database,

336
00:17:41,870 --> 00:17:44,140
which gets translated in
all frames, or TBLASTX,

337
00:17:44,140 --> 00:17:47,440
where you translate both
the nucleotide sequences

338
00:17:47,440 --> 00:17:49,430
in all frames.

339
00:17:49,430 --> 00:17:51,790
And then there's a number
of other versions of BLAST

340
00:17:51,790 --> 00:17:54,590
which we probably won't
discuss but that are well

341
00:17:54,590 --> 00:17:59,110
described in the textbook
and other accessible

342
00:17:59,110 --> 00:18:03,150
online sources.

343
00:18:03,150 --> 00:18:04,370
Let me ask you this.

344
00:18:04,370 --> 00:18:05,990
So remember ESTs.

345
00:18:05,990 --> 00:18:10,610
So ESTs are segments of cDNAs
that typically correspond

346
00:18:10,610 --> 00:18:18,690
to one ABI 3700 Sanger
sequenc off of that cDNA,

347
00:18:18,690 --> 00:18:22,270
so one read, like
600 bases or so.

348
00:18:22,270 --> 00:18:28,980
So let's say you have
some ESTs from chimp.

349
00:18:28,980 --> 00:18:32,070
And you don't have
the chimp genome yet.

350
00:18:32,070 --> 00:18:35,360
So you're going to search
them against human.

351
00:18:35,360 --> 00:18:37,250
What would you do?

352
00:18:40,420 --> 00:18:43,510
Would you use a
translating search?

353
00:18:43,510 --> 00:18:48,711
Or would you use
a BLASTN search?

354
00:18:48,711 --> 00:18:49,460
Or does it matter?

355
00:18:56,760 --> 00:19:01,780
Chimp is a 98% identical
human, very high.

356
00:19:05,540 --> 00:19:08,000
Any ideas?

357
00:19:08,000 --> 00:19:08,668
Yeah, Tim.

358
00:19:08,668 --> 00:19:11,058
AUDIENCE: You could use
a translating search,

359
00:19:11,058 --> 00:19:18,610
because you know that the cDNAs
are at least coding for RNAs.

360
00:19:18,610 --> 00:19:26,450
And so if you just use
a nucleotide search,

361
00:19:26,450 --> 00:19:30,860
then you're not going to
have functional significance

362
00:19:30,860 --> 00:19:32,820
in terms of the alignment.

363
00:19:32,820 --> 00:19:35,790
But if it's going
for a protein, then--

364
00:19:35,790 --> 00:19:38,380
PROFESSOR: You mean you won't
know whether it is the protein

365
00:19:38,380 --> 00:19:41,390
coding part of the cDNA or not?

366
00:19:41,390 --> 00:19:43,856
AUDIENCE: So I just
mean that if you're

367
00:19:43,856 --> 00:19:45,397
looking between
chimp and human, then

368
00:19:45,397 --> 00:19:47,538
you're expecting some
sort of mismatch.

369
00:19:47,538 --> 00:19:51,673
But it's possible that it
could be a functional mismatch.

370
00:19:51,673 --> 00:19:55,398
Then you know that the cDNA
is maybe coding for a protein.

371
00:19:55,398 --> 00:19:58,302
Therefore, if the
mismatch is between two

372
00:19:58,302 --> 00:20:01,980
similar amino acids, then
that would be picked up

373
00:20:01,980 --> 00:20:05,181
by a translating
search, but it would

374
00:20:05,181 --> 00:20:07,014
be skewed against it
in a nucleotide search.

375
00:20:07,014 --> 00:20:09,190
PROFESSOR: OK, fair enough.

376
00:20:09,190 --> 00:20:11,770
But if you assume that
the two genomes are,

377
00:20:11,770 --> 00:20:15,480
let's say, 97% identical, even
in a non-coding region, which

378
00:20:15,480 --> 00:20:16,230
they're very high.

379
00:20:16,230 --> 00:20:18,390
I don't remember the exact
percent, but very high.

380
00:20:18,390 --> 00:20:24,250
Then if you're searching 600
nucleotides against the genome,

381
00:20:24,250 --> 00:20:25,870
even if it's 95%
identical, you'll

382
00:20:25,870 --> 00:20:27,990
easily find that under either.

383
00:20:27,990 --> 00:20:33,660
So either answer is
correct, BLASTN or BLASTX.

384
00:20:33,660 --> 00:20:38,650
And the UTRs could only be found
by-- if it happened that this

385
00:20:38,650 --> 00:20:40,270
was a sequence from
a three prime UTR,

386
00:20:40,270 --> 00:20:44,100
you could only find that
by BLASTN typically.

387
00:20:44,100 --> 00:20:49,451
What if it's a human EST
against the mouse genome?

388
00:20:52,160 --> 00:20:54,570
So mouse exons are
about 80% identical

389
00:20:54,570 --> 00:20:58,671
to human exons at the
nucleotide level, typically.

390
00:20:58,671 --> 00:20:59,310
Any ideas?

391
00:21:03,175 --> 00:21:04,550
What kind of search
would you do?

392
00:21:04,550 --> 00:21:07,290
BLASTN, BLASTX,
or something else?

393
00:21:10,454 --> 00:21:12,880
TBLASTX.

394
00:21:12,880 --> 00:21:13,631
Yeah, go ahead.

395
00:21:13,631 --> 00:21:15,790
AUDIENCE: I have
another question.

396
00:21:15,790 --> 00:21:17,672
What exactly is the
kind of question

397
00:21:17,672 --> 00:21:19,912
we're trying to answer by
doing this BLAST search?

398
00:21:19,912 --> 00:21:21,370
PROFESSOR: Oh,
well, I was assuming

399
00:21:21,370 --> 00:21:27,410
you're just trying to find
the closest homologous cDNA

400
00:21:27,410 --> 00:21:33,600
or exons in the genome-- exons,
I guess, yeah, the exons.

401
00:21:33,600 --> 00:21:36,755
of the homologous gene.

402
00:21:36,755 --> 00:21:38,085
Yeah, that's a good question.

403
00:21:38,085 --> 00:21:39,210
Exons of a homologous gene.

404
00:21:39,210 --> 00:21:43,410
We've got a human EST going
against the mouse genome.

405
00:21:43,410 --> 00:21:44,750
When do we do?

406
00:21:44,750 --> 00:21:49,025
AUDIENCE: I suggest
BLASTP because--

407
00:21:49,025 --> 00:21:51,900
PROFESSOR: Well,
BLASTP, that's protein.

408
00:21:51,900 --> 00:21:54,650
This is a nucleotide
sequence against nucleotide.

409
00:21:54,650 --> 00:21:59,969
So we can do BLASTN
or TBLASTX, let's say.

410
00:21:59,969 --> 00:22:01,009
AUDIENCE: TBLASTX.

411
00:22:01,009 --> 00:22:01,800
PROFESSOR: TBLASTX.

412
00:22:01,800 --> 00:22:05,190
You translate your EST,
translate the genome,

413
00:22:05,190 --> 00:22:07,090
search those peptides.

414
00:22:07,090 --> 00:22:10,518
TBLASTX, why?

415
00:22:10,518 --> 00:22:13,003
AUDIENCE: The
nucleotide sequences

416
00:22:13,003 --> 00:22:17,476
may be only about
80% similarity,

417
00:22:17,476 --> 00:22:21,452
but the protein
sequences functionally,

418
00:22:21,452 --> 00:22:23,937
due to the functional
constraints,

419
00:22:23,937 --> 00:22:26,980
you might actually get
higher similarities there.

420
00:22:26,980 --> 00:22:27,760
PROFESSOR: Yeah.

421
00:22:27,760 --> 00:22:28,800
It's exactly right.

422
00:22:28,800 --> 00:22:33,241
So they are about, on
average, about 80% identical.

423
00:22:33,241 --> 00:22:33,990
It varies by gene.

424
00:22:33,990 --> 00:22:37,070
But a lot of those
variations that occur

425
00:22:37,070 --> 00:22:39,227
are at the third side
of the codon that

426
00:22:39,227 --> 00:22:41,060
don't effect the amino
acid, because there's

427
00:22:41,060 --> 00:22:42,750
a lot of constraint
on protein sequence.

428
00:22:42,750 --> 00:22:45,460
And so you'll do
better, in general,

429
00:22:45,460 --> 00:22:48,650
with a translating search
than with a nucleotide search.

430
00:22:48,650 --> 00:22:50,260
Although, they both may work.

431
00:22:50,260 --> 00:22:53,250
But you may find a
more complete match

432
00:22:53,250 --> 00:22:54,900
with a translating search.

433
00:22:54,900 --> 00:22:55,400
That's good.

434
00:22:55,400 --> 00:22:56,237
Everyone got that?

435
00:22:56,237 --> 00:22:56,736
Sally, yeah.

436
00:22:56,736 --> 00:22:59,027
AUDIENCE: Is there a reason
why you wouldn't use BLASTX

437
00:22:59,027 --> 00:23:01,214
and instead you use TBLASTX?

438
00:23:01,214 --> 00:23:02,880
PROFESSOR: Yeah, I
just gave the example

439
00:23:02,880 --> 00:23:04,213
of searching against the genome.

440
00:23:04,213 --> 00:23:08,642
But you could search against
the mouse proteome as well.

441
00:23:08,642 --> 00:23:09,600
You might or might not.

442
00:23:09,600 --> 00:23:11,230
It depends how well
annotated that genome is.

443
00:23:11,230 --> 00:23:12,530
Mouse is pretty well annotated.

444
00:23:12,530 --> 00:23:14,321
Almost all the proteins
are probably known.

445
00:23:14,321 --> 00:23:15,660
So you probably get it.

446
00:23:15,660 --> 00:23:21,477
But if you were searching
against some more obscure

447
00:23:21,477 --> 00:23:23,310
organism, the chameleon
genome or something,

448
00:23:23,310 --> 00:23:25,427
and it wasn't well
annotated, then you

449
00:23:25,427 --> 00:23:28,010
might do better with searching
against the genome, because you

450
00:23:28,010 --> 00:23:31,200
could find a new x on there.

451
00:23:31,200 --> 00:23:32,173
OK, good.

452
00:23:32,173 --> 00:23:33,173
Question yeah, go ahead.

453
00:23:33,173 --> 00:23:36,027
AUDIENCE: So when we do these
translations, these nucleotide,

454
00:23:36,027 --> 00:23:38,287
amino acid things,
do we get all frames?

455
00:23:38,287 --> 00:23:39,620
Do the algorithms to all frames?

456
00:23:39,620 --> 00:23:41,990
PROFESSOR: Yeah, all six frames.

457
00:23:41,990 --> 00:23:44,220
So three frames on the plus
strand, and three frames

458
00:23:44,220 --> 00:23:45,639
on the reverse strand.

459
00:23:45,639 --> 00:23:46,138
Yeah.

460
00:23:49,870 --> 00:23:51,660
All right, great.

461
00:23:51,660 --> 00:23:55,360
So that's the end of local
alignment, for the moment.

462
00:23:55,360 --> 00:24:00,245
And we're going to now move
on to global alignment using

463
00:24:00,245 --> 00:24:00,870
two algorithms.

464
00:24:04,134 --> 00:24:06,050
For global alignment,
Needleman-Wunch-Sellers,

465
00:24:06,050 --> 00:24:09,870
and then for gapped local
alignment, Smith-Waterman.

466
00:24:09,870 --> 00:24:11,930
And toward the end,
we're going to introduce

467
00:24:11,930 --> 00:24:17,210
the concept of amino acid
substitution matrices.

468
00:24:17,210 --> 00:24:20,500
So the background for
today, the textbook

469
00:24:20,500 --> 00:24:22,860
does a pretty good
job on these topics,

470
00:24:22,860 --> 00:24:25,660
especially the
pages indicated are

471
00:24:25,660 --> 00:24:29,970
good for introducing the
PAM series of matrices.

472
00:24:29,970 --> 00:24:34,190
We'll talked a little bit today
and a little bit next time.

473
00:24:34,190 --> 00:24:36,850
So why would we align
protein sequences?

474
00:24:36,850 --> 00:24:44,390
So the most obvious reason is to
find homologues that we might,

475
00:24:44,390 --> 00:24:47,360
then, want to investigate,
or we might, for example,

476
00:24:47,360 --> 00:24:51,590
if you have a human protein
and you find homologous mouse

477
00:24:51,590 --> 00:24:54,510
protein, and that mouse
protein has known function from

478
00:24:54,510 --> 00:24:58,010
a knockout or from some
biochemical studies,

479
00:24:58,010 --> 00:25:02,274
for example, then you can guess
that the human protein will

480
00:25:02,274 --> 00:25:03,190
have similar function.

481
00:25:03,190 --> 00:25:05,340
So we often use this
type of inference

482
00:25:05,340 --> 00:25:11,320
that sequence similarity implies
similarity in function and/or

483
00:25:11,320 --> 00:25:13,280
structure.

484
00:25:13,280 --> 00:25:14,380
So how true is this?

485
00:25:14,380 --> 00:25:17,990
So it turns out,
from a wide body

486
00:25:17,990 --> 00:25:21,530
of literature, that this
inference that sequence

487
00:25:21,530 --> 00:25:26,660
similarity implies functional
and structural similarity

488
00:25:26,660 --> 00:25:30,820
is almost always true when
the sequence similarity is

489
00:25:30,820 --> 00:25:33,710
more than about 30% identity
over the whole length

490
00:25:33,710 --> 00:25:37,830
of a protein, over
300, 400 amino acids.

491
00:25:37,830 --> 00:25:40,470
That's a good inference.

492
00:25:40,470 --> 00:25:44,110
Below that, sort of
in the 20% to 30%

493
00:25:44,110 --> 00:25:45,650
sequence similarity,
that's often

494
00:25:45,650 --> 00:25:47,450
referred to as
the Twilight Zone,

495
00:25:47,450 --> 00:25:49,570
where sometimes it's
a good inference,

496
00:25:49,570 --> 00:25:51,780
and sometimes it's not.

497
00:25:51,780 --> 00:25:56,410
So you need to be a
little bit careful.

498
00:25:56,410 --> 00:26:00,030
And below that, it's deeper
into the Twilight Zone,

499
00:26:00,030 --> 00:26:02,500
where most of the time you
probably shouldn't trust it.

500
00:26:02,500 --> 00:26:06,500
But occasionally, you can see
these very remote homologies.

501
00:26:06,500 --> 00:26:09,820
You might want to have
additional information

502
00:26:09,820 --> 00:26:12,420
to support that
kind of inference.

503
00:26:12,420 --> 00:26:15,740
And I want to just point out
that the converse is just not

504
00:26:15,740 --> 00:26:16,950
true in biology.

505
00:26:16,950 --> 00:26:19,780
So structural
similarity does not

506
00:26:19,780 --> 00:26:23,850
imply sequence
similarity or even

507
00:26:23,850 --> 00:26:28,550
derivation from a
common ancestor.

508
00:26:28,550 --> 00:26:34,070
So you may think,
well, every protein

509
00:26:34,070 --> 00:26:39,030
has a really complex, elaborate
three dimensional structure,

510
00:26:39,030 --> 00:26:43,844
and there's no way that
could ever evolve twice.

511
00:26:43,844 --> 00:26:46,260
And it's true that probably
that exact structure can never

512
00:26:46,260 --> 00:26:46,801
evolve twice.

513
00:26:46,801 --> 00:26:49,850
But a very similar structure,
a similar fold even,

514
00:26:49,850 --> 00:26:53,745
in terms of the topology
of alpha helices

515
00:26:53,745 --> 00:26:55,820
and beta strands, which
Professor Frank will

516
00:26:55,820 --> 00:26:59,430
talk about later in the
course, the identical fold

517
00:26:59,430 --> 00:27:01,080
can involve more than once.

518
00:27:01,080 --> 00:27:03,450
It's not that hard
to evolve a pattern

519
00:27:03,450 --> 00:27:05,960
of alpha helices
and beta strands.

520
00:27:05,960 --> 00:27:09,480
And so this point about
structural similarity

521
00:27:09,480 --> 00:27:11,950
not implying
sequence similarity,

522
00:27:11,950 --> 00:27:17,120
the way I think about it
is like this, like here

523
00:27:17,120 --> 00:27:19,260
are two organisms.

524
00:27:19,260 --> 00:27:20,982
This is a hummingbird,
you've all seen.

525
00:27:20,982 --> 00:27:22,440
And some of you
may have seen this.

526
00:27:22,440 --> 00:27:26,030
This is a hawk moth,
which is an insect that

527
00:27:26,030 --> 00:27:29,640
is roughly two inches long,
beats its wings very fast,

528
00:27:29,640 --> 00:27:32,770
has a long tongue that
sips nectar from flowers.

529
00:27:32,770 --> 00:27:35,870
So it basically occupies
the same ecological niche

530
00:27:35,870 --> 00:27:38,109
as a hummingbird, and
looks very, very similar

531
00:27:38,109 --> 00:27:39,400
to a hummingbird at a distance.

532
00:27:39,400 --> 00:27:41,890
From 10 or more feet,
you often can't tell.

533
00:27:44,859 --> 00:27:46,400
This is an insect,
and that's a bird.

534
00:27:46,400 --> 00:27:50,210
The last common
ancestor was something

535
00:27:50,210 --> 00:27:53,220
that probably lived
500 million years ago,

536
00:27:53,220 --> 00:27:55,770
and certainly didn't
have wings, and may not

537
00:27:55,770 --> 00:27:58,020
have had legs or eyes.

538
00:27:58,020 --> 00:27:59,630
And yet, they've
independently evolved

539
00:27:59,630 --> 00:28:01,130
eyes and wings and
all these things.

540
00:28:01,130 --> 00:28:04,520
So when there's
selective pressure

541
00:28:04,520 --> 00:28:10,890
to evolve something, either
a morphology or a protein

542
00:28:10,890 --> 00:28:13,320
structure, for
example, evolution

543
00:28:13,320 --> 00:28:16,530
is flexible enough that it can
evolve it many, many times.

544
00:28:16,530 --> 00:28:20,380
So here's an example from
the protein structure world.

545
00:28:20,380 --> 00:28:23,520
This is homophilous
iron binding protein.

546
00:28:23,520 --> 00:28:27,570
This is just the iron
coordination center.

547
00:28:27,570 --> 00:28:31,600
And this is now a eukaryotic
protein called lactoferrin.

548
00:28:31,600 --> 00:28:34,220
Turns out these
guys are homologous.

549
00:28:34,220 --> 00:28:38,810
But eukaryotes and bacteria
diverged 2 million years

550
00:28:38,810 --> 00:28:44,770
ago or so, so their ancestry
is very, very ancient.

551
00:28:44,770 --> 00:28:48,160
And yet, you can see that in
this iron coordination center,

552
00:28:48,160 --> 00:28:51,235
you have a tyrosine
pointing into the iron here.

553
00:28:51,235 --> 00:28:54,810
And you have a histidine
up here, and so forth.

554
00:28:54,810 --> 00:28:59,240
So the geometry has
been highly conserved.

555
00:28:59,240 --> 00:29:00,460
It's not perfectly conserved.

556
00:29:00,460 --> 00:29:02,840
Like, here you
have a carboxylate.

557
00:29:02,840 --> 00:29:04,090
And here you have a phosphate.

558
00:29:04,090 --> 00:29:05,715
So there's been a
little bit of change.

559
00:29:05,715 --> 00:29:14,160
But overall, this way
of coordinating iron

560
00:29:14,160 --> 00:29:16,720
has basically evolved
independently.

561
00:29:16,720 --> 00:29:18,620
So although these
are homologous,

562
00:29:18,620 --> 00:29:22,920
the last common
ancestor bound anions--

563
00:29:22,920 --> 00:29:24,760
that's known from
[INAUDIBLE] construction.

564
00:29:24,760 --> 00:29:29,530
So they independently evolved
the ability to bind cations,

565
00:29:29,530 --> 00:29:31,430
like iron.

566
00:29:31,430 --> 00:29:34,300
And here is actually
my favorite example.

567
00:29:34,300 --> 00:29:39,450
So here's a protein called
ribosome recycling factor.

568
00:29:39,450 --> 00:29:40,340
And that's its shape.

569
00:29:40,340 --> 00:29:42,680
So it's a very
unusual shaped protein

570
00:29:42,680 --> 00:29:45,214
that's kind of shaped like an L.

571
00:29:45,214 --> 00:29:47,440
Does this remind
anyone of anything,

572
00:29:47,440 --> 00:29:49,150
this particular shape?

573
00:29:49,150 --> 00:29:55,610
Have you seen this in another
biomolecule at some point?

574
00:29:55,610 --> 00:29:57,080
AUDIENCE: [INAUDIBLE].

575
00:29:57,080 --> 00:29:58,663
PROFESSOR: Something
like [INAUDIBLE].

576
00:29:58,663 --> 00:29:59,990
OK, could be.

577
00:29:59,990 --> 00:30:00,780
Any other guesses?

578
00:30:05,037 --> 00:30:06,670
How about this?

579
00:30:06,670 --> 00:30:08,360
That's a tRNA.

580
00:30:08,360 --> 00:30:13,100
So the 3D structure of
tRNA is almost identical,

581
00:30:13,100 --> 00:30:16,640
both in terms of
the overall shape

582
00:30:16,640 --> 00:30:21,000
and in terms of the geometry.

583
00:30:21,000 --> 00:30:24,190
Sorry, I'm having issues
with my animations here.

584
00:30:24,190 --> 00:30:30,900
The geometry of these, they're
both about 70 angstroms long.

585
00:30:30,900 --> 00:30:31,650
So why is that?

586
00:30:31,650 --> 00:30:34,930
Why would this protein
evolve to have the same three

587
00:30:34,930 --> 00:30:36,890
dimensional shape as a tRNA?

588
00:30:42,710 --> 00:30:44,254
Any ideas?

589
00:30:44,254 --> 00:30:46,190
AUDIENCE: [INAUDIBLE].

590
00:30:46,190 --> 00:30:47,160
PROFESSOR: [INAUDIBLE].

591
00:30:47,160 --> 00:30:47,940
Right, exactly.

592
00:30:47,940 --> 00:30:50,430
It fits into the
ribosome, and it's

593
00:30:50,430 --> 00:30:55,000
involved, when the
ribosome is stalled,

594
00:30:55,000 --> 00:30:56,920
and basically
releasing the ribosome.

595
00:30:56,920 --> 00:31:00,210
So it's mimicking a tRNA
in terms of structure.

596
00:31:00,210 --> 00:31:03,950
And so the point about
this is that, if you

597
00:31:03,950 --> 00:31:06,430
were to take a bunch
of biomolecules

598
00:31:06,430 --> 00:31:09,040
and match them up using
a structure comparison

599
00:31:09,040 --> 00:31:12,410
algorithm to find similar ones--
these two are clearly similar.

600
00:31:12,410 --> 00:31:15,520
And yet, they probably never
had a common ancestor right,

601
00:31:15,520 --> 00:31:18,250
because one's an RNA
in one's a protein.

602
00:31:23,220 --> 00:31:23,860
OK.

603
00:31:23,860 --> 00:31:26,100
So now what we're going to talk
about a few different types

604
00:31:26,100 --> 00:31:26,683
of alignments.

605
00:31:26,683 --> 00:31:29,280
So we talked about
local alignments,

606
00:31:29,280 --> 00:31:32,120
where you don't try to
align the entire sequence

607
00:31:32,120 --> 00:31:33,790
of your query or your database.

608
00:31:33,790 --> 00:31:38,020
You just find smaller
regions of high similarity.

609
00:31:38,020 --> 00:31:41,810
Global alignment, where
you try to align the two

610
00:31:41,810 --> 00:31:43,380
proteins from end
to end, you assume

611
00:31:43,380 --> 00:31:46,530
that these two proteins
are homologous,

612
00:31:46,530 --> 00:31:51,210
and actually that they
haven't had major insertions

613
00:31:51,210 --> 00:31:55,322
or rearrangements
of their sequence.

614
00:31:55,322 --> 00:31:56,780
And then semi-global,
which is sort

615
00:31:56,780 --> 00:31:59,256
of a little twist on global.

616
00:31:59,256 --> 00:32:01,630
And we'll talk about a few
different scoring systems-- so

617
00:32:01,630 --> 00:32:04,150
ungapped, which we've been
talking about until now,

618
00:32:04,150 --> 00:32:07,950
and then we'll introduce
gaps of two types that

619
00:32:07,950 --> 00:32:10,140
are called linear and affine.

620
00:32:10,140 --> 00:32:14,880
And the nomenclature is a little
bit confusing, as you'll see.

621
00:32:14,880 --> 00:32:18,010
They're both linear, in a sense.

622
00:32:18,010 --> 00:32:23,100
So a common way to represent
sequence alignments, especially

623
00:32:23,100 --> 00:32:26,370
in the protein alignment-- you
can do it for protein or DNA--

624
00:32:26,370 --> 00:32:29,790
is what's called a dot matrix.

625
00:32:29,790 --> 00:32:31,170
Now we've got two proteins.

626
00:32:31,170 --> 00:32:34,960
They might be 500 amino
acids long each, let's say.

627
00:32:34,960 --> 00:32:37,990
You write sequence
one along the x-axis,

628
00:32:37,990 --> 00:32:40,400
sequence two along the y-axis.

629
00:32:40,400 --> 00:32:44,152
And then you make a dot
in this matrix whenever

630
00:32:44,152 --> 00:32:46,360
they have identical residues,
although probably there

631
00:32:46,360 --> 00:32:47,830
would be a lot
more dots in this.

632
00:32:47,830 --> 00:32:51,360
So let's say, whenever you have
three residues in a row that

633
00:32:51,360 --> 00:32:53,910
are identical-- OK, that's
going to occur fairly rarely,

634
00:32:53,910 --> 00:32:56,170
since there's 20 amino acids.

635
00:32:56,170 --> 00:32:57,290
And you make that dot.

636
00:32:57,290 --> 00:33:01,490
And for these two
proteins, you don't

637
00:33:01,490 --> 00:33:02,980
get any off diagonal dots.

638
00:33:02,980 --> 00:33:06,470
You just get these three
diagonal lines here.

639
00:33:06,470 --> 00:33:11,360
So what does that tell you
about the history of these two

640
00:33:11,360 --> 00:33:12,890
proteins?

641
00:33:12,890 --> 00:33:14,300
What's that right there?

642
00:33:17,054 --> 00:33:17,891
Sally.

643
00:33:17,891 --> 00:33:19,349
AUDIENCE: An
insertion or deletion.

644
00:33:19,349 --> 00:33:20,849
PROFESSOR: An
insertion or deletion.

645
00:33:20,849 --> 00:33:22,232
An insertion in which protein?

646
00:33:22,232 --> 00:33:23,190
AUDIENCE: Sequence two.

647
00:33:23,190 --> 00:33:24,356
PROFESSOR: Or a deletion in?

648
00:33:24,356 --> 00:33:24,978
AUDIENCE: One.

649
00:33:24,978 --> 00:33:25,806
PROFESSOR: OK.

650
00:33:25,806 --> 00:33:27,490
Everyone got that?

651
00:33:27,490 --> 00:33:28,350
OK, good.

652
00:33:28,350 --> 00:33:32,010
There's extra sequence
in sequence two here

653
00:33:32,010 --> 00:33:33,362
that's not in sequence one.

654
00:33:33,362 --> 00:33:35,570
You don't know whether it's
an insertion or deletion.

655
00:33:35,570 --> 00:33:37,880
It could be either one,
based on this information.

656
00:33:37,880 --> 00:33:41,030
Sometimes you can figure that
out from other information.

657
00:33:41,030 --> 00:33:44,050
So sometimes you call that an
indel-- insertion or deletion.

658
00:33:44,050 --> 00:33:48,860
And then, what is
this down here?

659
00:33:48,860 --> 00:33:50,850
Someone else?

660
00:33:50,850 --> 00:33:53,880
Insertion, I heard, insertion
in sequence one or deletion

661
00:33:53,880 --> 00:33:55,010
in sequence two.

662
00:33:55,010 --> 00:33:58,100
OK, good.

663
00:33:58,100 --> 00:33:59,810
All right, so what
type of alignment

664
00:33:59,810 --> 00:34:03,080
would be most appropriate
for this pair sequences,

665
00:34:03,080 --> 00:34:04,345
a local or a global?

666
00:34:10,662 --> 00:34:12,646
AUDIENCE: I would do
global, because they're

667
00:34:12,646 --> 00:34:14,964
very, very similar. [INAUDIBLE].

668
00:34:14,964 --> 00:34:15,630
PROFESSOR: Yeah.

669
00:34:15,630 --> 00:34:19,150
They are quite similar
across their entire lengths,

670
00:34:19,150 --> 00:34:21,850
just with these
two major indels.

671
00:34:21,850 --> 00:34:24,920
So that's sort of the
classical case where

672
00:34:24,920 --> 00:34:27,060
you want to do the
global alignment.

673
00:34:27,060 --> 00:34:27,960
All right.

674
00:34:27,960 --> 00:34:31,337
So what about
these two proteins?

675
00:34:31,337 --> 00:34:32,920
Based on this dot
matrix, what can you

676
00:34:32,920 --> 00:34:35,570
say about the relation
between these two,

677
00:34:35,570 --> 00:34:38,179
and what type of
alignment would you

678
00:34:38,179 --> 00:34:42,652
want to use when comparing
these two proteins?

679
00:34:42,652 --> 00:34:43,610
Yeah, what's your name?

680
00:34:43,610 --> 00:34:44,101
AUDIENCE: Sonia.

681
00:34:44,101 --> 00:34:45,083
PROFESSOR: Go ahead, Sonia.

682
00:34:45,083 --> 00:34:47,541
AUDIENCE: It looks like they've
got similar domains, maybe.

683
00:34:47,541 --> 00:34:50,560
So local alignment
might be better.

684
00:34:50,560 --> 00:34:52,810
PROFESSOR: And why wouldn't
you do a global alignment?

685
00:34:52,810 --> 00:34:55,970
AUDIENCE: Local, because the
local alignment might actually

686
00:34:55,970 --> 00:34:57,972
find those domains and
tell you what they are.

687
00:34:57,972 --> 00:34:59,930
PROFESSOR: So a local
alignment should at least

688
00:34:59,930 --> 00:35:03,700
find these two guys here.

689
00:35:03,700 --> 00:35:06,900
And why do these two
parallel diagonal

690
00:35:06,900 --> 00:35:08,689
lines, what does that tell you?

691
00:35:08,689 --> 00:35:10,934
AUDIENCE: That the
two different proteins

692
00:35:10,934 --> 00:35:13,679
have similar sequences, just in
different parts of the protein,

693
00:35:13,679 --> 00:35:16,180
different areas
relative to the start.

694
00:35:16,180 --> 00:35:16,960
PROFESSOR: Right.

695
00:35:16,960 --> 00:35:18,134
Yeah, go ahead.

696
00:35:18,134 --> 00:35:19,800
AUDIENCE: Doesn't it
just basically mean

697
00:35:19,800 --> 00:35:22,315
that there's a section
in sequence two

698
00:35:22,315 --> 00:35:23,880
that's in sequence one twice?

699
00:35:23,880 --> 00:35:25,100
PROFESSOR: Yeah, exactly.

700
00:35:25,100 --> 00:35:31,830
So this segment of sequence two,
here-- sorry, having trouble,

701
00:35:31,830 --> 00:35:35,680
there we go, that apart-- is
present twice in sequence one.

702
00:35:35,680 --> 00:35:39,590
It's present once from
about here over to here,

703
00:35:39,590 --> 00:35:41,970
and then it's present once
from here over to here.

704
00:35:41,970 --> 00:35:43,380
So it's repeated.

705
00:35:43,380 --> 00:35:48,580
So repeats and
things like that will

706
00:35:48,580 --> 00:35:50,080
confuse your global alignment.

707
00:35:50,080 --> 00:35:55,140
The global alignment needs to
align each residue-- or trying

708
00:35:55,140 --> 00:35:56,750
to align each residue
in protein one

709
00:35:56,750 --> 00:35:58,200
to each residue in protein two.

710
00:35:58,200 --> 00:35:59,660
And here, it's ambiguous.

711
00:35:59,660 --> 00:36:01,920
It's not clear which
part of sequence one

712
00:36:01,920 --> 00:36:03,670
to align to that
part of sequence two.

713
00:36:03,670 --> 00:36:04,800
So it'll get confused.

714
00:36:04,800 --> 00:36:06,170
It'll choose one or the other.

715
00:36:06,170 --> 00:36:07,830
But that may be
wrong, and that really

716
00:36:07,830 --> 00:36:09,730
doesn't capture what
actually happens.

717
00:36:09,730 --> 00:36:12,460
So yeah, so here
a local alignment

718
00:36:12,460 --> 00:36:13,461
would be more suitable.

719
00:36:13,461 --> 00:36:13,960
Good.

720
00:36:16,630 --> 00:36:20,080
So let's talk now about
gaps, again, which

721
00:36:20,080 --> 00:36:22,710
can be called indels.

722
00:36:22,710 --> 00:36:26,100
In protein sequence alignments,
or DNA, which many of you

723
00:36:26,100 --> 00:36:29,850
have probably seen, you
often use maybe just

724
00:36:29,850 --> 00:36:32,150
a dash to represent a gap.

725
00:36:32,150 --> 00:36:33,870
So in this alignment
here, you can

726
00:36:33,870 --> 00:36:36,180
see that's kind of a
reasonable alignment, right?

727
00:36:36,180 --> 00:36:38,960
You've got pretty good
matching on both sides.

728
00:36:38,960 --> 00:36:44,750
But there's nothing in
the second sequence that

729
00:36:44,750 --> 00:36:46,960
matches the RG in
the first sequence.

730
00:36:46,960 --> 00:36:50,200
So that would be a reasonable
alignment of those two.

731
00:36:50,200 --> 00:36:53,880
And so what's often
used is what's

732
00:36:53,880 --> 00:36:55,420
called a linear gap penalty.

733
00:36:55,420 --> 00:36:59,500
So if you have end gaps,
like in this case two,

734
00:36:59,500 --> 00:37:02,580
you assign a gap
penalty A, let's say.

735
00:37:02,580 --> 00:37:06,330
And A is a negative number.

736
00:37:06,330 --> 00:37:10,950
And then you can just run the
same kinds of algorithms, where

737
00:37:10,950 --> 00:37:13,846
you add up matches,
penalize mismatches,

738
00:37:13,846 --> 00:37:15,470
but then you have an
additional penalty

739
00:37:15,470 --> 00:37:19,540
you apply when you
introduce a gap.

740
00:37:19,540 --> 00:37:22,080
And typically,
the gap penalty is

741
00:37:22,080 --> 00:37:26,656
more severe than your
average mismatch.

742
00:37:26,656 --> 00:37:28,030
But there's really
no theory that

743
00:37:28,030 --> 00:37:31,190
says exactly how the gap
penalty should be chosen.

744
00:37:31,190 --> 00:37:33,935
But empirically,
in cases where you

745
00:37:33,935 --> 00:37:35,560
should know the
answer, where you have,

746
00:37:35,560 --> 00:37:37,260
for example, a
structural alignment,

747
00:37:37,260 --> 00:37:40,382
you can often find that
a gap penalty that's

748
00:37:40,382 --> 00:37:42,090
bigger than your
average mismatch penalty

749
00:37:42,090 --> 00:37:46,226
is usually the
right thing to do.

750
00:37:46,226 --> 00:37:47,100
So why would that be?

751
00:37:47,100 --> 00:37:51,400
Why would a gap penalty--
why would you want to set it

752
00:37:51,400 --> 00:37:53,740
larger than a typical mismatch?

753
00:37:53,740 --> 00:37:54,480
Any ideas?

754
00:37:54,480 --> 00:37:55,162
Yeah, what's your name?

755
00:37:55,162 --> 00:37:55,995
AUDIENCE: I'm Chris.

756
00:37:55,995 --> 00:37:57,050
PROFESSOR: Chris.

757
00:37:57,050 --> 00:38:03,970
AUDIENCE: Because having
mutations that shift the frame

758
00:38:03,970 --> 00:38:07,730
or that one insert would
have insertions or deletions

759
00:38:07,730 --> 00:38:14,239
is far more uncommon than just
having changing [INAUDIBLE].

760
00:38:14,239 --> 00:38:16,780
PROFESSOR: Mutations that create
insertions and deletions are

761
00:38:16,780 --> 00:38:19,650
less common than those that
introduce substitutions

762
00:38:19,650 --> 00:38:20,800
of residues.

763
00:38:20,800 --> 00:38:22,780
Everyone got that?

764
00:38:22,780 --> 00:38:23,730
That's true.

765
00:38:23,730 --> 00:38:27,630
And do you know by what factor?

766
00:38:27,630 --> 00:38:29,940
AUDIENCE: Oh, I couldn't
give you a number.

767
00:38:29,940 --> 00:38:32,795
PROFESSOR: So I mean,
this varies by organism,

768
00:38:32,795 --> 00:38:36,910
and It varies by what type of
insertion you're looking at.

769
00:38:36,910 --> 00:38:39,450
But even at the single
nucleotide level,

770
00:38:39,450 --> 00:38:42,160
having insertions
is about an order

771
00:38:42,160 --> 00:38:44,700
of magnitude less
common than having

772
00:38:44,700 --> 00:38:46,450
a substitution in
those lineages.

773
00:38:46,450 --> 00:38:49,320
And here, in order to get
an amino acid insertion,

774
00:38:49,320 --> 00:38:52,960
you actually have to have a
triplet insertion, three or six

775
00:38:52,960 --> 00:38:56,680
or some multiple of
three into the exon.

776
00:38:56,680 --> 00:38:58,520
And that's quite
a bit less common.

777
00:38:58,520 --> 00:39:00,050
So they occur less commonly.

778
00:39:00,050 --> 00:39:03,460
A mutation occurs less commonly,
and therefore the mutation

779
00:39:03,460 --> 00:39:08,250
is actually accepted by
evolution even less commonly.

780
00:39:08,250 --> 00:39:13,850
And an alternative is a
so-called affine gap penalty,

781
00:39:13,850 --> 00:39:18,440
which is defined
as G plus n lambda.

782
00:39:18,440 --> 00:39:21,590
So n is the number
of gaps, and then G

783
00:39:21,590 --> 00:39:24,030
is what's called a
gap opening penalty.

784
00:39:24,030 --> 00:39:27,860
So the idea here is
that basically the gaps

785
00:39:27,860 --> 00:39:30,930
tend to cluster.

786
00:39:30,930 --> 00:39:37,210
So having an insertion
is a rare thing.

787
00:39:37,210 --> 00:39:39,650
You penalize that
with G. But then,

788
00:39:39,650 --> 00:39:41,910
if you're going to
have an insertion,

789
00:39:41,910 --> 00:39:44,250
sometimes you'll have a big
insertion of two or three

790
00:39:44,250 --> 00:39:47,000
or four codons.

791
00:39:47,000 --> 00:39:51,840
A four codon insertion should
not be penalized twice as much

792
00:39:51,840 --> 00:39:57,690
as a two codon insertion,
because only one gap actually

793
00:39:57,690 --> 00:39:58,190
occurred.

794
00:39:58,190 --> 00:39:59,815
And when you have
this insertion event,

795
00:39:59,815 --> 00:40:04,530
it can any variety of sizes.

796
00:40:04,530 --> 00:40:06,520
You still penalize
more for a bigger gap

797
00:40:06,520 --> 00:40:10,910
than for a smaller gap,
but it's no longer linear.

798
00:40:10,910 --> 00:40:12,530
I mean, it's still
a linear function,

799
00:40:12,530 --> 00:40:14,830
just with this
constant thing added.

800
00:40:14,830 --> 00:40:18,350
So these are the two common
types of gap penalties

801
00:40:18,350 --> 00:40:20,000
that you'll see
in the literature.

802
00:40:20,000 --> 00:40:22,570
The affine works a
little bit better,

803
00:40:22,570 --> 00:40:27,180
but it is a little bit more
complicated to implement.

804
00:40:27,180 --> 00:40:31,860
So sometimes you'll see both
of them used in practice.

805
00:40:31,860 --> 00:40:35,350
And then, of course, by changing
your definition of gamma,

806
00:40:35,350 --> 00:40:37,550
you could have a
G plus n minus 1.

807
00:40:37,550 --> 00:40:42,750
So that first gap would
be G, and then all

808
00:40:42,750 --> 00:40:44,660
the subsequent gaps would gamma.

809
00:40:44,660 --> 00:40:49,000
So you're not going to have
to double score something.

810
00:40:49,000 --> 00:40:49,580
All right.

811
00:40:49,580 --> 00:40:50,080
OK.

812
00:40:53,440 --> 00:40:54,440
You've got two proteins.

813
00:40:54,440 --> 00:40:59,620
How do you actually find the
optimal global alignment?

814
00:40:59,620 --> 00:41:03,270
Any ideas on how to do this?

815
00:41:03,270 --> 00:41:06,450
So we can write one sequence
down one axis, one down

816
00:41:06,450 --> 00:41:07,550
the other axis.

817
00:41:07,550 --> 00:41:09,650
We can make this dot plot.

818
00:41:09,650 --> 00:41:13,330
The dot plot can give us some
ideas about what's going on.

819
00:41:13,330 --> 00:41:16,840
But how do we actually
find the optimal one

820
00:41:16,840 --> 00:41:20,880
where we want to start
from the beginning?

821
00:41:20,880 --> 00:41:23,750
In the end, we're going
to write the two sequences

822
00:41:23,750 --> 00:41:24,960
one above the other.

823
00:41:24,960 --> 00:41:28,845
And if the first residue
or the first sequence is n,

824
00:41:28,845 --> 00:41:31,090
and maybe we'll
align it to here,

825
00:41:31,090 --> 00:41:35,580
then we have to write
the entire sequence here

826
00:41:35,580 --> 00:41:37,950
all the way down to the end.

827
00:41:37,950 --> 00:41:40,830
And below it has to
be either a residue

828
00:41:40,830 --> 00:41:43,090
in sequence two or a gap.

829
00:41:43,090 --> 00:41:45,050
And again, we can
have gaps up here.

830
00:41:45,050 --> 00:41:47,307
So you have to do something.

831
00:41:47,307 --> 00:41:49,890
You have to make it all the way
from the beginning to the end.

832
00:41:49,890 --> 00:41:52,450
And we're just going to sum
the scores of all the matching

833
00:41:52,450 --> 00:41:55,970
residues, of all the mismatching
residues, and of all the gaps.

834
00:41:55,970 --> 00:41:58,130
How do we find that alignment?

835
00:42:02,570 --> 00:42:03,070
Chris.

836
00:42:03,070 --> 00:42:07,116
AUDIENCE: Well, since we're
using dynamic programming,

837
00:42:07,116 --> 00:42:09,266
I'm guessing that you're
going to have to fill out

838
00:42:09,266 --> 00:42:13,280
a matrix of some
sort and backtrack.

839
00:42:13,280 --> 00:42:16,610
PROFESSOR: And so when you see
the term dynamic programming,

840
00:42:16,610 --> 00:42:18,890
what does that mean to you?

841
00:42:18,890 --> 00:42:23,490
AUDIENCE: You're going to
find solutions to sub problems

842
00:42:23,490 --> 00:42:26,190
until you find a
smaller solution.

843
00:42:26,190 --> 00:42:27,956
Then you'd backtrack
through what

844
00:42:27,956 --> 00:42:31,400
you've solved so far to
find the global sequence.

845
00:42:31,400 --> 00:42:32,170
PROFESSOR: Good.

846
00:42:32,170 --> 00:42:33,628
That's a good way
of describing it.

847
00:42:33,628 --> 00:42:35,770
So what smaller
problems are you going

848
00:42:35,770 --> 00:42:37,916
to break this
large problem into?

849
00:42:37,916 --> 00:42:39,415
AUDIENCE: The smaller
sub-sequences.

850
00:42:43,875 --> 00:42:45,500
PROFESSOR: Which
smaller sub-sequences?

851
00:42:49,285 --> 00:42:49,830
Anyone else?

852
00:42:49,830 --> 00:42:52,690
You are definitely on
the right track here.

853
00:42:52,690 --> 00:42:54,780
Go ahead.

854
00:42:54,780 --> 00:42:57,720
AUDIENCE: I mean, it
says at the top there,

855
00:42:57,720 --> 00:43:00,660
one sequence across the
top and one down the side.

856
00:43:00,660 --> 00:43:03,600
You could start with just
the gap versus the sequence

857
00:43:03,600 --> 00:43:08,070
and say your gap will
increase as you move across.

858
00:43:08,070 --> 00:43:11,620
Basically, each cell there could
be filled out with information

859
00:43:11,620 --> 00:43:14,010
from some of its neighbors.

860
00:43:14,010 --> 00:43:15,925
So you want to make
sure that you fill out

861
00:43:15,925 --> 00:43:19,430
old cells in some
order so that we

862
00:43:19,430 --> 00:43:21,658
can proceed to the next
level with what we've

863
00:43:21,658 --> 00:43:23,570
[? written down. ?]

864
00:43:23,570 --> 00:43:25,260
PROFESSOR: So if
you had precisely

865
00:43:25,260 --> 00:43:30,270
to find a sub problem where you
could see what the answer is,

866
00:43:30,270 --> 00:43:34,060
and then a slightly larger sub
problem whose solution would

867
00:43:34,060 --> 00:43:38,459
build on the solution that first
one, where would you start?

868
00:43:38,459 --> 00:43:40,125
What would be your
smallest sub problem?

869
00:43:40,125 --> 00:43:42,336
AUDIENCE: I'd start
with the top row,

870
00:43:42,336 --> 00:43:45,240
because you could just
the gap versus gap,

871
00:43:45,240 --> 00:43:49,112
and then move in the row,
because you don't need anything

872
00:43:49,112 --> 00:43:51,540
above that.

873
00:43:51,540 --> 00:43:56,370
PROFESSOR: And then what's
the smallest actual problem

874
00:43:56,370 --> 00:44:01,460
where you actually have
parts of the protein aligned?

875
00:44:01,460 --> 00:44:04,250
AUDIENCE: One row in
column two, basically.

876
00:44:04,250 --> 00:44:06,725
If it's a match,
you have some score.

877
00:44:06,725 --> 00:44:10,685
And if it's a mismatch,
you have some other score.

878
00:44:10,685 --> 00:44:16,130
And you want the best
possible one in each block.

879
00:44:16,130 --> 00:44:19,360
PROFESSOR: Yeah, OK.

880
00:44:19,360 --> 00:44:20,440
Yeah.

881
00:44:20,440 --> 00:44:21,770
That's good.

882
00:44:21,770 --> 00:44:26,290
So just to generalize
this-- hopefully this

883
00:44:26,290 --> 00:44:32,480
is blank-- in general, you
could think about we've got,

884
00:44:32,480 --> 00:44:37,720
let's say, 1 to n here,
and a sequence 1 to n here.

885
00:44:37,720 --> 00:44:42,710
You could think about a position
i here and a position j here.

886
00:44:42,710 --> 00:44:48,790
And we could say finding the
optimal global alignment,

887
00:44:48,790 --> 00:44:49,839
that's a big problem.

888
00:44:49,839 --> 00:44:50,630
That's complicated.

889
00:44:50,630 --> 00:44:54,920
But finding an alignment
of just the sequence from 1

890
00:44:54,920 --> 00:44:57,440
to i in the first protein
against the sequence

891
00:44:57,440 --> 00:44:59,650
from 1 to j in the
second protein,

892
00:44:59,650 --> 00:45:01,310
that could be pretty easy.

893
00:45:01,310 --> 00:45:05,150
If i is 2 and j is 2,
you've got a dipeptide

894
00:45:05,150 --> 00:45:06,265
against a dipeptide.

895
00:45:06,265 --> 00:45:07,890
You could actually
try all combinations

896
00:45:07,890 --> 00:45:10,790
and get the optimal
alignment there.

897
00:45:10,790 --> 00:45:15,800
And so the idea, then, is if you
can record those optimal scores

898
00:45:15,800 --> 00:45:20,010
here in this matrix, then you
could build out, for example,

899
00:45:20,010 --> 00:45:24,000
like this, and find
the optimal alignments

900
00:45:24,000 --> 00:45:28,280
of increasingly
bigger sub problems

901
00:45:28,280 --> 00:45:32,626
where you add another residue
in each direction, for example.

902
00:45:32,626 --> 00:45:34,710
Does that makes sense to you?

903
00:45:34,710 --> 00:45:37,260
The idea of a dynamic
programming algorithm

904
00:45:37,260 --> 00:45:39,925
is it's a form of
recursive optimization.

905
00:45:39,925 --> 00:45:43,420
So you first optimize
something small,

906
00:45:43,420 --> 00:45:46,040
and then you optimize
something bigger

907
00:45:46,040 --> 00:45:49,560
using the solution you got
from that smaller piece.

908
00:45:49,560 --> 00:45:52,480
And the way that that's
done for protein sequences

909
00:45:52,480 --> 00:45:58,050
in Neeleman-Wunsch is
to, as we were saying,

910
00:45:58,050 --> 00:46:02,000
first consider that there
might be a gap in one aligning

911
00:46:02,000 --> 00:46:04,310
to a residue in the other.

912
00:46:04,310 --> 00:46:08,260
So we need to put these
gaps down across the top

913
00:46:08,260 --> 00:46:10,090
and down the side.

914
00:46:13,140 --> 00:46:16,580
This is a linear gap
penalty, for example.

915
00:46:16,580 --> 00:46:19,900
And so here would
be how you start.

916
00:46:19,900 --> 00:46:24,170
And this is a gap penalty,
obviously, of minus 8.

917
00:46:24,170 --> 00:46:28,060
So if you're the
optimal solution that

918
00:46:28,060 --> 00:46:36,800
begins with this V in
the top sequence aligned

919
00:46:36,800 --> 00:46:42,826
to this gap in the
vertical sequence,

920
00:46:42,826 --> 00:46:44,450
there's one gap there,
so it's minus 8.

921
00:46:44,450 --> 00:46:49,690
And then if you want to start
with two gaps against this V

922
00:46:49,690 --> 00:46:51,600
and D, then it's minus 16.

923
00:46:51,600 --> 00:46:54,310
So that's how you
would start it.

924
00:46:54,310 --> 00:46:58,040
So you start with these problems
where there's no options.

925
00:46:58,040 --> 00:47:01,220
If you have two gaps against
two residues, that's minus 16.

926
00:47:01,220 --> 00:47:04,780
By our scoring system,
it's unambiguous.

927
00:47:04,780 --> 00:47:08,230
So you just can fill those in.

928
00:47:08,230 --> 00:47:10,070
And then you can
start thinking about,

929
00:47:10,070 --> 00:47:16,920
what do we put right here?

930
00:47:16,920 --> 00:47:21,140
What score should
we put right there?

931
00:47:21,140 --> 00:47:24,870
Remember, we're defining
the entries in this matrix

932
00:47:24,870 --> 00:47:31,740
as the optimal score of the
sub-sequence of the top protein

933
00:47:31,740 --> 00:47:35,445
up to position i against
the vertical protein up

934
00:47:35,445 --> 00:47:36,440
to position j.

935
00:47:36,440 --> 00:47:38,990
So that would be the
top protein position

936
00:47:38,990 --> 00:47:42,580
one up to the vertical
protein position one.

937
00:47:42,580 --> 00:47:45,150
What score would that be?

938
00:47:45,150 --> 00:47:48,835
What's the optical
alignment there

939
00:47:48,835 --> 00:47:50,640
that ends position
one both sequences?

940
00:47:53,010 --> 00:47:54,510
It'll depend on
your scoring system.

941
00:47:54,510 --> 00:47:57,450
But for a reasonable scoring
system, that's a match.

942
00:47:57,450 --> 00:47:59,220
That's going to get
some positive score.

943
00:47:59,220 --> 00:48:02,650
That's going to be better than
anything involving a gap in one

944
00:48:02,650 --> 00:48:04,990
against a gap in the other
or something crazy like that.

945
00:48:04,990 --> 00:48:08,680
So that's going to get whatever
your VV match score is.

946
00:48:11,860 --> 00:48:15,140
This is your Sij from
your scoring matrix

947
00:48:15,140 --> 00:48:18,390
for your different amino acids.

948
00:48:18,390 --> 00:48:26,030
And then, basically the
way that this is done

949
00:48:26,030 --> 00:48:35,470
is to consider that when
you're matching that position

950
00:48:35,470 --> 00:48:39,170
one against position one, you
might have come from a gap

951
00:48:39,170 --> 00:48:42,250
before in one sequence or a
gap in the other sequence,

952
00:48:42,250 --> 00:48:45,190
or from a match position
in the other sequence.

953
00:48:45,190 --> 00:48:47,830
And that leads to
these three arrows.

954
00:48:47,830 --> 00:48:52,630
I think it gets clear if I write
up the whole algorithm here.

955
00:48:52,630 --> 00:48:57,010
So Sij is the score of the
optimal alignment ending

956
00:48:57,010 --> 00:49:00,630
at position i in sequence one
and position j in sequence two.

957
00:49:00,630 --> 00:49:04,470
Requires that we know
what's above, to the left,

958
00:49:04,470 --> 00:49:06,280
and diagonally above.

959
00:49:06,280 --> 00:49:12,570
And you solve it from
the top and left down

960
00:49:12,570 --> 00:49:15,810
to the bottom and
right, which is often

961
00:49:15,810 --> 00:49:19,490
called dynamic programming.

962
00:49:19,490 --> 00:49:23,380
And let's just look at
what the recursion is.

963
00:49:23,380 --> 00:49:27,780
So Needleman and
Wunsch basically

964
00:49:27,780 --> 00:49:32,470
observed that you could find
this optimal global alignment

965
00:49:32,470 --> 00:49:38,390
score by filling in the
matrix by at each point taking

966
00:49:38,390 --> 00:49:42,500
the maximum of these
three scores here.

967
00:49:42,500 --> 00:49:44,860
So you take the
maximum of the score

968
00:49:44,860 --> 00:49:51,660
that you had above and to the
left, so diagonally above, plus

969
00:49:51,660 --> 00:49:55,090
sigma of xi yj.

970
00:49:55,090 --> 00:49:57,420
Sigma, in this case,
is the scoring matrix

971
00:49:57,420 --> 00:50:00,040
that you're using
that's 20 by 20 that

972
00:50:00,040 --> 00:50:04,990
scores each amino acid against
each other amino acid residue.

973
00:50:04,990 --> 00:50:07,970
You add that score if you're
going to move diagonally

974
00:50:07,970 --> 00:50:09,720
to whatever the optimal
score was there,

975
00:50:09,720 --> 00:50:14,710
or if you're moving to
the right or down, you're

976
00:50:14,710 --> 00:50:17,290
adding a gap in one
sequence or the other.

977
00:50:17,290 --> 00:50:20,200
So you have to add A, which
is this gap penalty, which

978
00:50:20,200 --> 00:50:23,150
is a negative number, to
whatever the optimal alignment

979
00:50:23,150 --> 00:50:25,890
was before.

980
00:50:25,890 --> 00:50:29,330
I think it's maybe easier
if we do an example here.

981
00:50:29,330 --> 00:50:34,050
So here is the PAM250
scoring matrix.

982
00:50:34,050 --> 00:50:37,866
So this was actually developed
by Dayhoff back in the '70s.

983
00:50:37,866 --> 00:50:39,240
This might be an
updated version,

984
00:50:39,240 --> 00:50:43,150
but it's more or less
the same as the original.

985
00:50:43,150 --> 00:50:46,110
Notice, it's a
triangular matrix.

986
00:50:46,110 --> 00:50:48,508
Why is that?

987
00:50:48,508 --> 00:50:50,215
AUDIENCE: It's symmetric.

988
00:50:50,215 --> 00:50:51,590
PROFESSOR: It's
symmetric, right.

989
00:50:51,590 --> 00:50:52,530
So it has a diagonal.

990
00:50:52,530 --> 00:50:54,780
But then everything
below the diagonal,

991
00:50:54,780 --> 00:50:56,942
it would be mirrored
above the diagonal,

992
00:50:56,942 --> 00:50:57,900
because it's symmetric.

993
00:50:57,900 --> 00:51:01,660
Because you don't know when
you see a valine matched

994
00:51:01,660 --> 00:51:05,760
to a leucine, it's the same as
a leucine matched to a valine,

995
00:51:05,760 --> 00:51:09,160
because it's a symmetrical
definition of scoring.

996
00:51:09,160 --> 00:51:12,360
And here are two
relevant scores.

997
00:51:12,360 --> 00:51:18,080
So notice that VV has a score
of plus 4 in this matrix.

998
00:51:18,080 --> 00:51:23,780
And over here, VD has
a score of minus 2.

999
00:51:23,780 --> 00:51:25,420
So I'll just write those down.

1000
00:51:32,220 --> 00:51:36,814
Anyone notice anything else
interesting about this matrix?

1001
00:51:36,814 --> 00:51:39,480
We haven't said exactly where it
comes from, but we're going to.

1002
00:51:39,480 --> 00:51:40,250
Yeah, what's your name?

1003
00:51:40,250 --> 00:51:41,170
AUDIENCE: Michael.

1004
00:51:41,170 --> 00:51:42,090
PROFESSOR: Go ahead.

1005
00:51:42,090 --> 00:51:43,770
AUDIENCE: Not all the
diagonal values are the same.

1006
00:51:43,770 --> 00:51:45,519
PROFESSOR: Not all
diagonals are the same.

1007
00:51:45,519 --> 00:51:51,810
In fact, there's a pretty
big range, from 2 up to 17,

1008
00:51:51,810 --> 00:51:52,850
so a big range.

1009
00:51:52,850 --> 00:51:55,420
And anything else?

1010
00:51:55,420 --> 00:51:56,800
OK, I'm sorry, go ahead.

1011
00:51:56,800 --> 00:51:57,672
What's your name?

1012
00:51:57,672 --> 00:51:58,380
AUDIENCE: Tagius.

1013
00:51:58,380 --> 00:51:58,786
PROFESSOR: Tagius, yeah.

1014
00:51:58,786 --> 00:51:58,990
Go ahead.

1015
00:51:58,990 --> 00:52:00,448
AUDIENCE: There
are positive values

1016
00:52:00,448 --> 00:52:01,850
for things that
are not the same?

1017
00:52:01,850 --> 00:52:02,610
PROFESSOR: Yeah.

1018
00:52:02,610 --> 00:52:05,420
So all the diagonal
terms are positive.

1019
00:52:05,420 --> 00:52:09,440
So a match of any
particular residue type

1020
00:52:09,440 --> 00:52:11,500
to its identical
residue is always

1021
00:52:11,500 --> 00:52:13,790
scored positively, but
with varying scores.

1022
00:52:13,790 --> 00:52:16,310
And there are also some positive
scores in the off diagonal.

1023
00:52:16,310 --> 00:52:19,290
And where are those
positive scores occurring?

1024
00:52:19,290 --> 00:52:22,670
Notice they tend to
be to nearby residues.

1025
00:52:22,670 --> 00:52:27,710
And notice the order of
residues is not alphabetical.

1026
00:52:27,710 --> 00:52:31,550
So someone who knows a
lot about amino acids,

1027
00:52:31,550 --> 00:52:36,670
what can you see
about these scores?

1028
00:52:36,670 --> 00:52:38,390
Yeah, go ahead.

1029
00:52:38,390 --> 00:52:40,840
AUDIENCE: I think these
amino acids [INAUDIBLE]

1030
00:52:40,840 --> 00:52:42,800
based on their [INAUDIBLE].

1031
00:52:47,704 --> 00:52:49,870
PROFESSOR: So the comment
was that the residues have

1032
00:52:49,870 --> 00:52:53,339
been grouped by similar
chemistry of their side chains.

1033
00:52:53,339 --> 00:52:54,380
And that's exactly right.

1034
00:52:54,380 --> 00:52:58,205
So the basic residues,
histidine, arginine,

1035
00:52:58,205 --> 00:53:00,160
and lysine, are all together.

1036
00:53:00,160 --> 00:53:02,520
The acidic residues,
aspartate and glutamate,

1037
00:53:02,520 --> 00:53:06,330
are here, along with
asparagine and glutamine.

1038
00:53:06,330 --> 00:53:13,380
And notice that D to E
has a positive score here.

1039
00:53:13,380 --> 00:53:15,450
It's 3.

1040
00:53:15,450 --> 00:53:19,650
It's almost as good as D to D
or E to E, which are plus 4.

1041
00:53:19,650 --> 00:53:23,770
So recognizing that you can
often substitute in evolution

1042
00:53:23,770 --> 00:53:25,880
an aspartate for a glutamate.

1043
00:53:25,880 --> 00:53:30,530
So yeah, so it basically,
to some extent,

1044
00:53:30,530 --> 00:53:34,420
is scoring for
similar chemistry.

1045
00:53:34,420 --> 00:53:38,430
But that doesn't explain
why, on the diagonal,

1046
00:53:38,430 --> 00:53:41,510
you have such a large
range of values.

1047
00:53:41,510 --> 00:53:44,530
Why is a tryptophan
more like a tryptophan

1048
00:53:44,530 --> 00:53:47,490
than a serine is like a serine.

1049
00:53:47,490 --> 00:53:48,832
Tim, you want to comment?

1050
00:53:48,832 --> 00:53:52,720
AUDIENCE: Perhaps it's because
tryptophans occur very rarely

1051
00:53:52,720 --> 00:53:55,636
in all proteins [INAUDIBLE].

1052
00:53:55,636 --> 00:54:00,010
So if you've got
two [INAUDIBLE],

1053
00:54:00,010 --> 00:54:03,412
that's a lot rarer of an
occurrence and [INAUDIBLE].

1054
00:54:05,350 --> 00:54:07,850
PROFESSOR: So Tim's point was
that tryptophans occur rarely,

1055
00:54:07,850 --> 00:54:10,960
so when you see two
tryptophans aligned,

1056
00:54:10,960 --> 00:54:12,682
you should take note of it.

1057
00:54:12,682 --> 00:54:13,890
It can anchor your alignment.

1058
00:54:13,890 --> 00:54:15,731
You can be more
confident in that.

1059
00:54:15,731 --> 00:54:16,230
Sally.

1060
00:54:16,230 --> 00:54:18,700
AUDIENCE: Well, tryptophans
are also incredibly bulky,

1061
00:54:18,700 --> 00:54:24,134
and also have the ability to
make electric interactions,

1062
00:54:24,134 --> 00:54:25,616
electro-static interactions.

1063
00:54:25,616 --> 00:54:27,098
PROFESSOR: Not really
electro-static, you would say,

1064
00:54:27,098 --> 00:54:27,598
more--

1065
00:54:27,598 --> 00:54:28,491
[INTERPOSING VOICES]

1066
00:54:28,491 --> 00:54:29,074
AUDIENCE: Yes.

1067
00:54:29,074 --> 00:54:31,544
But they do have
a lot of abilities

1068
00:54:31,544 --> 00:54:34,014
to interact with
other side chains.

1069
00:54:34,014 --> 00:54:37,966
And cysteines contribute
very, very strongly

1070
00:54:37,966 --> 00:54:41,918
to the three dimensional
structure of the protein.

1071
00:54:41,918 --> 00:54:42,906
PROFESSOR: Why is that?

1072
00:54:42,906 --> 00:54:45,870
AUDIENCE: Well, because
they can form [INAUDIBLE].

1073
00:54:45,870 --> 00:54:47,181
PROFESSOR: OK.

1074
00:54:47,181 --> 00:54:47,680
Yeah.

1075
00:54:47,680 --> 00:54:50,950
So maybe you don't put your
tryptophans and your cysteines

1076
00:54:50,950 --> 00:54:54,000
into your protein by
chance, or you only

1077
00:54:54,000 --> 00:54:55,630
put them when you
want them, when

1078
00:54:55,630 --> 00:54:57,880
there's enough space
for a tryptophan.

1079
00:54:57,880 --> 00:54:59,650
And when you substitute
something smaller,

1080
00:54:59,650 --> 00:55:00,550
it leaves a gap.

1081
00:55:00,550 --> 00:55:04,430
It leaves a 3D spatial gap.

1082
00:55:04,430 --> 00:55:06,100
And so you don't want that.

1083
00:55:06,100 --> 00:55:07,880
You don't get good packing.

1084
00:55:07,880 --> 00:55:10,147
When you have cysteines,
they form disulfide bonds.

1085
00:55:10,147 --> 00:55:12,230
If you change it to something
that's non-cysteine,

1086
00:55:12,230 --> 00:55:13,271
it can form that anymore.

1087
00:55:13,271 --> 00:55:15,270
That could be disruptive
to the overall fold.

1088
00:55:15,270 --> 00:55:19,110
So those ones tend to be more
conserved in protein sequence

1089
00:55:19,110 --> 00:55:21,170
alignments, absolutely.

1090
00:55:21,170 --> 00:55:24,320
Whereas, for example, if you
look at these hydrophobics,

1091
00:55:24,320 --> 00:55:29,240
the MILV group
down here, they all

1092
00:55:29,240 --> 00:55:32,920
have positive scores
relative to each other.

1093
00:55:32,920 --> 00:55:35,740
And that says that, basically,
most the time when those

1094
00:55:35,740 --> 00:55:37,360
are used-- I mean,
there are sometimes

1095
00:55:37,360 --> 00:55:38,740
when it went really matters.

1096
00:55:38,740 --> 00:55:42,270
But a lot of time, if you just
want a transmembrane segment,

1097
00:55:42,270 --> 00:55:46,990
you can often substitute any one
of those at several positions

1098
00:55:46,990 --> 00:55:50,749
and it'll work equally well
as a transmembrane segment.

1099
00:55:50,749 --> 00:55:52,040
So these are not random at all.

1100
00:55:52,040 --> 00:55:54,300
There's some patterns here.

1101
00:55:54,300 --> 00:55:57,010
So let's go back
to this algorithm.

1102
00:55:57,010 --> 00:56:00,150
So now, if we're going to
implement this recursion,

1103
00:56:00,150 --> 00:56:04,320
so we fill on the top
row and the left column,

1104
00:56:04,320 --> 00:56:07,340
and then we need to
fill in this first.

1105
00:56:07,340 --> 00:56:10,280
I would argue the first
interesting place in the matrix

1106
00:56:10,280 --> 00:56:11,620
is right here.

1107
00:56:11,620 --> 00:56:17,460
And we consider
adding a gap here.

1108
00:56:17,460 --> 00:56:21,020
When you move vertically
or horizontally,

1109
00:56:21,020 --> 00:56:24,705
you're not adding a
match or adding a match.

1110
00:56:24,705 --> 00:56:28,799
So from this position, this is
sort of the beginning point.

1111
00:56:28,799 --> 00:56:31,090
It doesn't actually correspond
to a particular position

1112
00:56:31,090 --> 00:56:31,780
in the protein.

1113
00:56:31,780 --> 00:56:33,580
We're going to add
now the score for VV.

1114
00:56:33,580 --> 00:56:36,960
And we said that VV, you look
it up in that PAM matrix,

1115
00:56:36,960 --> 00:56:37,900
and it's plus 4.

1116
00:56:37,900 --> 00:56:39,670
So we're going to
add 4 there to 0.

1117
00:56:39,670 --> 00:56:43,120
And so that's clearly
bigger than minus 16,

1118
00:56:43,120 --> 00:56:45,717
which is what you get from
coming above or coming

1119
00:56:45,717 --> 00:56:46,300
from the left.

1120
00:56:46,300 --> 00:56:48,360
So you put in the 4.

1121
00:56:48,360 --> 00:56:52,700
And then you also, in addition
to putting that 4 there,

1122
00:56:52,700 --> 00:56:55,060
you also keep the arrow.

1123
00:56:55,060 --> 00:56:56,560
So there's that red arrow.

1124
00:56:56,560 --> 00:56:59,242
We remember where we came
from in this algorithm.

1125
00:56:59,242 --> 00:57:01,450
Because someone said something
about backtracking-- I

1126
00:57:01,450 --> 00:57:04,620
think Chris-- so that's
going to be relevant later.

1127
00:57:04,620 --> 00:57:07,400
So we basically get rid
of those two dotted arrows

1128
00:57:07,400 --> 00:57:09,850
and just keep that red
arrow as well as the score.

1129
00:57:09,850 --> 00:57:14,880
And then we fill in
the next position here.

1130
00:57:14,880 --> 00:57:18,860
And so to fill in
this, now we're

1131
00:57:18,860 --> 00:57:23,480
considering going to the second
position in sequence one,

1132
00:57:23,480 --> 00:57:27,410
but we're still only at the
first position in sequence two.

1133
00:57:27,410 --> 00:57:31,830
So if we match V to V,
then we'd have to add,

1134
00:57:31,830 --> 00:57:38,370
basically, a gap in
one of the sequences.

1135
00:57:38,370 --> 00:57:40,260
Basically it would be
a gap in sequence two.

1136
00:57:40,260 --> 00:57:42,050
And that's going to be minus 8.

1137
00:57:42,050 --> 00:57:45,610
So you take 4, and then plus
minus 8, so it's negative 4.

1138
00:57:45,610 --> 00:57:50,900
Or you could do minus 8
and then plus negative 2,

1139
00:57:50,900 --> 00:57:53,960
if you want to start
from a gap and then

1140
00:57:53,960 --> 00:57:58,300
add a DV mismatch
there, because minus 2

1141
00:57:58,300 --> 00:58:01,630
was the score for a DV mismatch.

1142
00:58:01,630 --> 00:58:04,769
Or again, you can start from a
gap and then add another gap.

1143
00:58:04,769 --> 00:58:05,810
OK, does that make sense?

1144
00:58:05,810 --> 00:58:10,420
So what is the
maximum going to be?

1145
00:58:10,420 --> 00:58:12,290
Negative 4.

1146
00:58:12,290 --> 00:58:14,610
And the arrow is going
to be horizontal,

1147
00:58:14,610 --> 00:58:17,910
because we got some bonus
points for that VV match,

1148
00:58:17,910 --> 00:58:19,850
and now it's carrying over.

1149
00:58:19,850 --> 00:58:21,100
We're negative, but that's OK.

1150
00:58:21,100 --> 00:58:24,180
We're going to just keep
the maximum, whatever it is.

1151
00:58:24,180 --> 00:58:27,670
All right, so it's minus 4,
and the horizontal arrow.

1152
00:58:27,670 --> 00:58:30,880
And then here's the
entire matrix filled out.

1153
00:58:30,880 --> 00:58:34,690
And you'll have a chance
to do this for yourself

1154
00:58:34,690 --> 00:58:36,810
on problem set one.

1155
00:58:36,810 --> 00:58:39,022
And I've also filled in arrows.

1156
00:58:39,022 --> 00:58:40,480
I haven't filled
in all the arrows,

1157
00:58:40,480 --> 00:58:42,260
because it gets
kind of cluttered.

1158
00:58:42,260 --> 00:58:45,770
But all the relevant arrows
here are filled in, as well as

1159
00:58:45,770 --> 00:58:47,480
some irrelevant arrows.

1160
00:58:47,480 --> 00:58:53,920
And so then, once
I fill this in,

1161
00:58:53,920 --> 00:58:56,100
what do I do with
this information?

1162
00:58:56,100 --> 00:58:59,532
How do I get an actual
alignment out of this matrix?

1163
00:59:03,780 --> 00:59:04,322
Any ideas?

1164
00:59:04,322 --> 00:59:05,280
Yeah, what's your name?

1165
00:59:05,280 --> 00:59:06,196
AUDIENCE: [INAUDIBLE].

1166
00:59:14,749 --> 00:59:17,290
PROFESSOR: Yeah, so what he said
is start at the bottom right

1167
00:59:17,290 --> 00:59:20,390
corner and go
backwards following

1168
00:59:20,390 --> 00:59:21,790
the red arrows in reverse.

1169
00:59:21,790 --> 00:59:23,719
Is that right?

1170
00:59:23,719 --> 00:59:25,010
So why the bottom right corner?

1171
00:59:25,010 --> 00:59:26,303
What's special about that?

1172
00:59:26,303 --> 00:59:27,219
AUDIENCE: [INAUDIBLE].

1173
00:59:31,233 --> 00:59:34,180
PROFESSOR: Yeah.

1174
00:59:34,180 --> 00:59:35,910
It's a score of the
optimal alignment

1175
00:59:35,910 --> 00:59:38,550
of the entire sequence one
against the entire sequence

1176
00:59:38,550 --> 00:59:39,930
two.

1177
00:59:39,930 --> 00:59:41,270
So that's the answer.

1178
00:59:41,270 --> 00:59:43,790
That's what we define as the
optimal global alignment.

1179
00:59:43,790 --> 00:59:47,710
And then you want to
know how you got there.

1180
00:59:47,710 --> 00:59:49,290
And so how did we get there?

1181
00:59:49,290 --> 00:59:54,760
So the fact that there's
this red arrow here,

1182
00:59:54,760 --> 00:59:59,476
what does that red arrow
correspond to specifically?

1183
00:59:59,476 --> 01:00:00,392
AUDIENCE: [INAUDIBLE].

1184
01:00:03,180 --> 01:00:03,950
PROFESSOR: Right.

1185
01:00:03,950 --> 01:00:07,630
In this particular case, for
this particular red arrow,

1186
01:00:07,630 --> 01:00:09,260
remember the
diagonals are matches.

1187
01:00:09,260 --> 01:00:10,380
So what match is that?

1188
01:00:10,380 --> 01:00:11,334
AUDIENCE: [INAUDIBLE].

1189
01:00:11,334 --> 01:00:13,250
PROFESSOR: Yeah, that's
a Y to Y match, right?

1190
01:00:13,250 --> 01:00:14,166
Can everyone see that?

1191
01:00:14,166 --> 01:00:16,340
We added Y to Y,
which was plus 10,

1192
01:00:16,340 --> 01:00:19,370
to whatever this
13 was and got 23.

1193
01:00:19,370 --> 01:00:20,960
OK, so now we go back to here.

1194
01:00:20,960 --> 01:00:22,310
And then how do we get here?

1195
01:00:22,310 --> 01:00:27,900
We came from up here by
going this diagonal arrow.

1196
01:00:27,900 --> 01:00:28,940
What is that?

1197
01:00:28,940 --> 01:00:30,110
What match was that?

1198
01:00:30,110 --> 01:00:32,016
That's a
cysteine-cysteine match.

1199
01:00:32,016 --> 01:00:33,390
And then how do
we get to this 1?

1200
01:00:36,090 --> 01:00:37,870
We came vertically.

1201
01:00:37,870 --> 01:00:40,440
And so what does that mean?

1202
01:00:40,440 --> 01:00:41,765
AUDIENCE: [INAUDIBLE].

1203
01:00:41,765 --> 01:00:43,765
PROFESSOR: We inserted a
gap, in which sequence?

1204
01:00:47,090 --> 01:00:48,455
The first one.

1205
01:00:48,455 --> 01:00:49,080
The second one?

1206
01:00:52,130 --> 01:00:55,230
What do people think?

1207
01:00:55,230 --> 01:00:55,954
Moving down.

1208
01:00:55,954 --> 01:00:56,870
AUDIENCE: [INAUDIBLE].

1209
01:01:01,220 --> 01:01:03,670
PROFESSOR: Yeah, the top one.

1210
01:01:03,670 --> 01:01:05,560
And so that got us to here.

1211
01:01:05,560 --> 01:01:09,220
Here's a match, plus 2 for
having a serine-serine match.

1212
01:01:09,220 --> 01:01:13,450
Here's a plus 3 for
having a D to E mismatch.

1213
01:01:13,450 --> 01:01:16,020
But remember, that's those
are chemically similar,

1214
01:01:16,020 --> 01:01:17,340
so they get a positive score.

1215
01:01:17,340 --> 01:01:19,871
And then this is the V to V.

1216
01:01:19,871 --> 01:01:20,940
So can you see?

1217
01:01:20,940 --> 01:01:24,080
I think I have the
optimal alignment written

1218
01:01:24,080 --> 01:01:28,160
somewhere here,
hopefully, there.

1219
01:01:28,160 --> 01:01:29,580
That's called the trace back.

1220
01:01:29,580 --> 01:01:34,650
And then that is the alignment.

1221
01:01:34,650 --> 01:01:37,500
OK, we align the Y to
the Y, the C to the C.

1222
01:01:37,500 --> 01:01:43,500
Then we have basically a gap
in this top sequence-- that's

1223
01:01:43,500 --> 01:01:48,550
that purple dash there--
that's corresponding to that L.

1224
01:01:48,550 --> 01:01:50,902
And you can see why we wanted
to put that gap in there,

1225
01:01:50,902 --> 01:01:52,360
because we want
these S's to match,

1226
01:01:52,360 --> 01:01:53,710
and we want the C's to match.

1227
01:01:53,710 --> 01:01:55,270
And the only way
to connect those

1228
01:01:55,270 --> 01:01:57,880
is to have a gap in the purple.

1229
01:01:57,880 --> 01:02:00,640
And the purple was shorter
than the green sequence

1230
01:02:00,640 --> 01:02:02,290
anyway, so we kind
of knew that there

1231
01:02:02,290 --> 01:02:05,030
was going to be a gap somewhere.

1232
01:02:05,030 --> 01:02:06,530
And good.

1233
01:02:06,530 --> 01:02:09,150
And that's the
optimal alignment.

1234
01:02:13,470 --> 01:02:17,610
So that's just some philosophy
on Needlemen-Wunsch alignments.

1235
01:02:22,960 --> 01:02:24,890
So what is
semi-global alignment?

1236
01:02:24,890 --> 01:02:26,400
You don't see that
that commonly.

1237
01:02:26,400 --> 01:02:27,180
It's not that big a deal.

1238
01:02:27,180 --> 01:02:28,929
I don't want to spend
too much time on it.

1239
01:02:28,929 --> 01:02:33,870
But it is actually
reasonable a lot of times

1240
01:02:33,870 --> 01:02:36,440
that, let's say you
have a protein that

1241
01:02:36,440 --> 01:02:40,580
has a particular
enzymatic activity,

1242
01:02:40,580 --> 01:02:48,760
and you may find that the
whole, the bulk of the protein

1243
01:02:48,760 --> 01:02:51,150
is well conserved
across species.

1244
01:02:51,150 --> 01:02:53,620
But then at the N and
C termini, there's

1245
01:02:53,620 --> 01:02:55,490
a little bit of flutter.

1246
01:02:55,490 --> 01:02:58,510
You can add a few residues or
delete a few residues, and not

1247
01:02:58,510 --> 01:03:00,570
much matters at the
N and C termini.

1248
01:03:00,570 --> 01:03:03,657
Or it may matter not for
the structure but for,

1249
01:03:03,657 --> 01:03:06,240
you know, you're adding a single
peptide so it'll be secreted,

1250
01:03:06,240 --> 01:03:07,989
or you're adding some
localization signal.

1251
01:03:07,989 --> 01:03:09,640
You're adding some
little thing that

1252
01:03:09,640 --> 01:03:11,760
isn't necessarily conserved.

1253
01:03:11,760 --> 01:03:15,870
And so a semi-global
alignment, where

1254
01:03:15,870 --> 01:03:18,820
you use the same
algorithm, except

1255
01:03:18,820 --> 01:03:23,530
that you initialize the edges of
the dynamic programming matrix

1256
01:03:23,530 --> 01:03:29,570
to 0, instead of the minus
8, minus 16 whole gap,

1257
01:03:29,570 --> 01:03:30,500
and go to 0.

1258
01:03:30,500 --> 01:03:32,920
So we're not going to penalize
for gaps of the edges.

1259
01:03:32,920 --> 01:03:35,940
And then, instead of
requiring the trace back

1260
01:03:35,940 --> 01:03:38,200
to begin at the
bottom right, Smn,

1261
01:03:38,200 --> 01:03:40,150
you allow it to
begin at the highest

1262
01:03:40,150 --> 01:03:42,970
score in the bottom row
or the rightmost column.

1263
01:03:42,970 --> 01:03:46,390
And when you do the
trace back as before,

1264
01:03:46,390 --> 01:03:51,950
these two changes basically find
the optimal global alignment

1265
01:03:51,950 --> 01:03:56,420
but allowing arbitrary
numbers of gaps at the ends

1266
01:03:56,420 --> 01:03:58,570
and just finding the core match.

1267
01:03:58,570 --> 01:04:01,110
It has to go basically
to the end of one

1268
01:04:01,110 --> 01:04:02,740
or the other
sequences, but then you

1269
01:04:02,740 --> 01:04:05,353
can have other residues
hanging off the end

1270
01:04:05,353 --> 01:04:09,180
on the other sequence, if
you want, with no penalty.

1271
01:04:09,180 --> 01:04:11,920
And this sometimes will
give a better answer,

1272
01:04:11,920 --> 01:04:14,170
so it's worth knowing about.

1273
01:04:14,170 --> 01:04:17,510
And it's quite
easy to implement.

1274
01:04:17,510 --> 01:04:20,560
Now what about gapped
local alignments?

1275
01:04:20,560 --> 01:04:22,264
So what if you
have two proteins?

1276
01:04:22,264 --> 01:04:23,680
Do you remember
those two proteins

1277
01:04:23,680 --> 01:04:26,400
where we had the two diagonal?

1278
01:04:26,400 --> 01:04:29,425
I guess they were
diagonal lines.

1279
01:04:29,425 --> 01:04:30,150
How where they?

1280
01:04:30,150 --> 01:04:30,605
Something like that.

1281
01:04:30,605 --> 01:04:31,979
Anyway, diagonal
lines like that.

1282
01:04:31,979 --> 01:04:37,730
So where in this protein on the
vertical, there is a sequence

1283
01:04:37,730 --> 01:04:44,249
here that matches two segments
of the horizontal protein.

1284
01:04:44,249 --> 01:04:46,790
So for those two, you don't want
to do this global alignment.

1285
01:04:46,790 --> 01:04:47,640
It'll get confused.

1286
01:04:47,640 --> 01:04:51,220
It doesn't know whether
to match this guy to this

1287
01:04:51,220 --> 01:04:53,400
or this other one
to the sequence.

1288
01:04:53,400 --> 01:04:55,790
So you want to use
a local alignment.

1289
01:04:55,790 --> 01:04:58,710
So how do we modify this
Needleman-Wunsch algorithm

1290
01:04:58,710 --> 01:05:00,520
to do local alignment?

1291
01:05:05,460 --> 01:05:07,068
Any ideas?

1292
01:05:07,068 --> 01:05:09,044
It's not super hard.

1293
01:05:17,950 --> 01:05:19,790
Yeah, go ahead.

1294
01:05:19,790 --> 01:05:23,180
AUDIENCE: If the score
is going to go negative,

1295
01:05:23,180 --> 01:05:26,810
instead of putting a negative
score, you just put 0,

1296
01:05:26,810 --> 01:05:36,380
and you start from where you get
the highest total score, rather

1297
01:05:36,380 --> 01:05:40,540
than the last
column or last row.

1298
01:05:40,540 --> 01:05:43,060
Start your traceback
from the highest score.

1299
01:05:43,060 --> 01:05:46,470
PROFESSOR: So whenever you're
going negative, you reset to 0.

1300
01:05:46,470 --> 01:05:49,380
Now what does that
remind you of?

1301
01:05:49,380 --> 01:05:52,800
That's the same trick
we did write previously

1302
01:05:52,800 --> 01:05:55,570
with ungapped local alignment.

1303
01:05:55,570 --> 01:05:57,090
So you reset to 0.

1304
01:05:57,090 --> 01:06:00,160
And that's as a no
penalty, because if you're

1305
01:06:00,160 --> 01:06:03,770
going negative, it's better
just to throw that stuff away

1306
01:06:03,770 --> 01:06:05,220
and start over.

1307
01:06:05,220 --> 01:06:07,380
We can do that because
we're doing local alignment.

1308
01:06:07,380 --> 01:06:09,030
We don't have to
align the whole thing.

1309
01:06:09,030 --> 01:06:11,410
So that's allowed.

1310
01:06:11,410 --> 01:06:16,600
And then, rather than going
to the bottom right corner,

1311
01:06:16,600 --> 01:06:18,220
you can be anywhere
in the matrix.

1312
01:06:18,220 --> 01:06:21,360
You look for that highest score
and then do the traceback.

1313
01:06:21,360 --> 01:06:22,910
That's exactly right.

1314
01:06:22,910 --> 01:06:27,600
So it's not that different.

1315
01:06:27,600 --> 01:06:29,920
There are a few
constraints, though,

1316
01:06:29,920 --> 01:06:31,100
now on the scoring system.

1317
01:06:31,100 --> 01:06:34,250
So if you think about the
Needleman-Wunsch algorithm,

1318
01:06:34,250 --> 01:06:38,150
we could actually use a matrix
that had all positive scores.

1319
01:06:38,150 --> 01:06:40,860
You could take
the PAM250 matrix.

1320
01:06:40,860 --> 01:06:42,900
And let's say the
most negative score

1321
01:06:42,900 --> 01:06:46,280
there is, I don't know,
like minus 10 or something,

1322
01:06:46,280 --> 01:06:49,620
and you could just add 10, or
even add 20 to all those score.

1323
01:06:49,620 --> 01:06:50,960
So they're all positive now.

1324
01:06:50,960 --> 01:06:52,640
And you could still
run that algorithm.

1325
01:06:52,640 --> 01:06:56,890
And it would still produce
more or less sensible results.

1326
01:06:56,890 --> 01:07:00,330
I mean, they t wouldn't be
as good as the real PAM250,

1327
01:07:00,330 --> 01:07:03,460
but you would still get
a coherent alignment out

1328
01:07:03,460 --> 01:07:04,470
of the other end.

1329
01:07:04,470 --> 01:07:06,920
But that is no
longer true when you

1330
01:07:06,920 --> 01:07:09,300
talk about the
Smith-Waterman algorithm,

1331
01:07:09,300 --> 01:07:12,540
for the same reason that an
ungapped local alignment,

1332
01:07:12,540 --> 01:07:15,807
we had to require that the
expected score be negative,

1333
01:07:15,807 --> 01:07:17,640
because you have to
have this negative drift

1334
01:07:17,640 --> 01:07:20,771
to find small regions
that go in the positive.

1335
01:07:20,771 --> 01:07:23,270
So if you have this rule, this
kind of permissive that says,

1336
01:07:23,270 --> 01:07:25,300
whenever we go negative
we can just reset to 0,

1337
01:07:25,300 --> 01:07:29,420
then you have to have this
negative drift in order

1338
01:07:29,420 --> 01:07:32,700
for positive scoring
stuff to be unusual.

1339
01:07:37,400 --> 01:07:39,726
All right, so that's
another constraint there.

1340
01:07:39,726 --> 01:07:42,100
You have to have negative
values for mismatches-- I mean,

1341
01:07:42,100 --> 01:07:43,230
not all mismatches.

1342
01:07:43,230 --> 01:07:46,810
But if you took two random
residues in alignment,

1343
01:07:46,810 --> 01:07:49,511
the average score
has to be negative.

1344
01:07:49,511 --> 01:07:51,680
I should probably rephrase
that, but more or less.

1345
01:07:51,680 --> 01:07:54,840
And here's an example
of Smith-Waterman.

1346
01:07:54,840 --> 01:07:59,940
So you right zeroes down the
left side and across the top.

1347
01:07:59,940 --> 01:08:02,690
And that's because,
remember, if you go negative,

1348
01:08:02,690 --> 01:08:03,530
you reset to 0.

1349
01:08:03,530 --> 01:08:04,730
So we're doing that.

1350
01:08:04,730 --> 01:08:07,840
And then you take the
maximum of four things.

1351
01:08:07,840 --> 01:08:12,150
So coming from the diagonal and
adding the score of the match,

1352
01:08:12,150 --> 01:08:14,410
that's the same as before.

1353
01:08:14,410 --> 01:08:17,700
Coming from the left and
adding a gap in one sequence,

1354
01:08:17,700 --> 01:08:19,859
coming from above
and adding a gap

1355
01:08:19,859 --> 01:08:22,979
in the other sequence, or 0.

1356
01:08:22,979 --> 01:08:26,620
This "or 0" business
allows us to reset to 0

1357
01:08:26,620 --> 01:08:30,550
if we ever go negative.

1358
01:08:30,550 --> 01:08:33,770
And when you have a 0, you still
keep track of these arrows.

1359
01:08:33,770 --> 01:08:36,710
But when you have a
0, there's no arrow.

1360
01:08:36,710 --> 01:08:37,600
You're starting it.

1361
01:08:37,600 --> 01:08:39,350
You're starting the
alignment right there.

1362
01:08:43,010 --> 01:08:45,930
So that's Smith-Waterman.

1363
01:08:45,930 --> 01:08:46,970
It's helpful.

1364
01:08:46,970 --> 01:08:48,569
I think on problem
set one, you'll

1365
01:08:48,569 --> 01:08:50,943
have some experience thinking
about both Needleman-Wunsch

1366
01:08:50,943 --> 01:08:52,055
and Smith-Waterman.

1367
01:08:52,055 --> 01:08:53,930
They sort of behave a
little bit differently,

1368
01:08:53,930 --> 01:08:55,649
but they're highly related.

1369
01:08:55,649 --> 01:08:59,109
So it's important to
understand how they're similar,

1370
01:08:59,109 --> 01:09:00,710
how they're different.

1371
01:09:00,710 --> 01:09:05,149
And what I want to focus on for
the remainder of this lecture

1372
01:09:05,149 --> 01:09:09,479
is just introducing the concept
of amino acid similarity

1373
01:09:09,479 --> 01:09:10,250
matrices.

1374
01:09:10,250 --> 01:09:12,990
We saw that PAM matrix, but
where does it come from?

1375
01:09:12,990 --> 01:09:15,170
And what does it mean?

1376
01:09:15,170 --> 01:09:19,810
And does it work well or not,
and are there alternatives?

1377
01:09:19,810 --> 01:09:22,740
So we could use this
identity matrix.

1378
01:09:22,740 --> 01:09:25,749
But as we've heard,
there are a number

1379
01:09:25,749 --> 01:09:28,770
of reasons why that
may not be optimal.

1380
01:09:28,770 --> 01:09:32,100
For example, the cysteines, we
should surely score them more,

1381
01:09:32,100 --> 01:09:34,250
because they're often
involved in disulfide bonds,

1382
01:09:34,250 --> 01:09:38,160
and those have major structural
effects on the protein

1383
01:09:38,160 --> 01:09:41,649
and are likely to be conserved
more than your average leucine

1384
01:09:41,649 --> 01:09:44,779
or alanine or whatever.

1385
01:09:44,779 --> 01:09:47,729
So clearly, scoring
system should

1386
01:09:47,729 --> 01:09:50,990
favor matching identical or
related amino acids, penalize

1387
01:09:50,990 --> 01:09:54,210
poor matches and for gaps.

1388
01:09:54,210 --> 01:09:56,985
And there's also
an argument that

1389
01:09:56,985 --> 01:10:00,890
can be made that it should
have to do with how often one

1390
01:10:00,890 --> 01:10:04,820
residue is substituted for
another during evolution.

1391
01:10:04,820 --> 01:10:07,060
So that commonly
substituted thing

1392
01:10:07,060 --> 01:10:10,270
should have either positive
scores or less negative

1393
01:10:10,270 --> 01:10:14,250
scores than rarely
substituted things.

1394
01:10:14,250 --> 01:10:18,280
And perhaps not
totally obvious, but it

1395
01:10:18,280 --> 01:10:20,860
is if you think
about it for a while,

1396
01:10:20,860 --> 01:10:30,630
is that any scoring system that
you dream up carries with it

1397
01:10:30,630 --> 01:10:34,130
an implicit model of
molecular evolution

1398
01:10:34,130 --> 01:10:35,780
for how often things
are going to be

1399
01:10:35,780 --> 01:10:37,510
substituted for each other.

1400
01:10:37,510 --> 01:10:40,550
So it's going to turn out
that the score is roughly

1401
01:10:40,550 --> 01:10:42,920
proportional to a
[INAUDIBLE] score

1402
01:10:42,920 --> 01:10:46,930
for the occurrence of
that pair of residues,

1403
01:10:46,930 --> 01:10:48,430
divided by how
often it would occur

1404
01:10:48,430 --> 01:10:50,370
by chance, something like that.

1405
01:10:50,370 --> 01:10:54,710
And so that if you assign
positive scores to things,

1406
01:10:54,710 --> 01:10:57,090
to certain pairs
of residues, you're

1407
01:10:57,090 --> 01:11:00,240
basically implying that
those things will commonly

1408
01:11:00,240 --> 01:11:02,860
interchange during evolution.

1409
01:11:02,860 --> 01:11:12,930
And so if you want to have
realistic, useful scores,

1410
01:11:12,930 --> 01:11:15,840
it helps to think about what the
implicit evolutionary model is

1411
01:11:15,840 --> 01:11:21,740
and whether that is a realistic
model for how proteins evolve.

1412
01:11:21,740 --> 01:11:25,420
So I'm going to come to Dayhoff.

1413
01:11:25,420 --> 01:11:29,510
And so unlike
later matrices, she

1414
01:11:29,510 --> 01:11:33,590
had an explicit
evolutionary model,

1415
01:11:33,590 --> 01:11:38,800
like an actual mathematical
model, for how proteins evolve.

1416
01:11:38,800 --> 01:11:42,550
And the idea was
that there are going

1417
01:11:42,550 --> 01:11:44,280
to be alignments
of some proteins.

1418
01:11:44,280 --> 01:11:45,970
And keep in mind,
this was in 1978.

1419
01:11:45,970 --> 01:11:49,389
So the protein database probably
had like 1,000 proteins in it,

1420
01:11:49,389 --> 01:11:49,930
or something.

1421
01:11:49,930 --> 01:11:50,929
It was very, very small.

1422
01:11:53,850 --> 01:11:57,790
But there were some
alignments that were obvious.

1423
01:11:57,790 --> 01:12:02,390
If you see two protein
segments of 50 residues long

1424
01:12:02,390 --> 01:12:07,720
that are 85 identical, there's
no way that occurred by chance.

1425
01:12:07,720 --> 01:12:10,040
You don't even need to
do statistics on that.

1426
01:12:10,040 --> 01:12:10,890
So you're sure.

1427
01:12:10,890 --> 01:12:13,750
So she took these very high
confidence protein sequence

1428
01:12:13,750 --> 01:12:18,020
alignments, and she calculated
the actual residue residue

1429
01:12:18,020 --> 01:12:21,730
substitution frequencies, how
often we have a valine in one

1430
01:12:21,730 --> 01:12:23,860
sequence as a substitute
for a leucine.

1431
01:12:23,860 --> 01:12:26,490
And it's actually
assumed it's symmetrical.

1432
01:12:26,490 --> 01:12:28,750
Again, you don't
know the direction.

1433
01:12:28,750 --> 01:12:33,540
And calculated these
substitution frequencies.

1434
01:12:33,540 --> 01:12:38,620
Basically estimated
what she called

1435
01:12:38,620 --> 01:12:42,540
a PAM1 one matrix,
which is a matrix that

1436
01:12:42,540 --> 01:12:46,100
implies 1% divergence
between proteins.

1437
01:12:46,100 --> 01:12:49,100
So there's, on average,
only a 1% chance

1438
01:12:49,100 --> 01:12:51,730
any given residue will change.

1439
01:12:51,730 --> 01:12:54,420
And the real alignments had
greater divergence than that.

1440
01:12:54,420 --> 01:12:56,310
They had something
like 15% divergence.

1441
01:12:56,310 --> 01:12:58,530
But you can look at
those frequencies

1442
01:12:58,530 --> 01:13:00,130
and reduce them
by a factor of 15,

1443
01:13:00,130 --> 01:13:03,270
and you'll get not exactly
15 but something like 15.

1444
01:13:03,270 --> 01:13:05,150
And you'll get
something where there's

1445
01:13:05,150 --> 01:13:09,240
a 1% chance of substitution.

1446
01:13:09,240 --> 01:13:13,400
And then once you have that
model for what 1% sequence

1447
01:13:13,400 --> 01:13:14,970
substitution looks
like, turns out

1448
01:13:14,970 --> 01:13:19,170
you can just represent
that as a matrix

1449
01:13:19,170 --> 01:13:24,540
and multiply it up to get a
matrix that describes what

1450
01:13:24,540 --> 01:13:33,080
5% sequence substitution looks
like, or 10% or 50% or 250%.

1451
01:13:33,080 --> 01:13:36,470
So that PAM250 matrix that we
talked about before, that's

1452
01:13:36,470 --> 01:13:44,940
a model for what 250% amino
acid substitution looks like.

1453
01:13:44,940 --> 01:13:46,270
How does that even make sense?

1454
01:13:46,270 --> 01:13:49,981
How can you have more than 100%?

1455
01:13:49,981 --> 01:13:55,460
Is anyone with me on this?

1456
01:13:55,460 --> 01:13:56,390
Tim, yeah.

1457
01:13:56,390 --> 01:13:57,764
AUDIENCE: Because
it can go back.

1458
01:13:57,764 --> 01:14:00,235
So it's more likely, in
some cases, that you revert

1459
01:14:00,235 --> 01:14:02,833
rather than [INAUDIBLE].

1460
01:14:02,833 --> 01:14:04,300
PROFESSOR: Right.

1461
01:14:04,300 --> 01:14:08,580
So a PAM10 matrix means, on
average, 10% of the residues

1462
01:14:08,580 --> 01:14:09,980
have changed.

1463
01:14:09,980 --> 01:14:11,960
But a few of those
residues might

1464
01:14:11,960 --> 01:14:15,550
have actually-- so maybe about
90% won't have changed at all.

1465
01:14:15,550 --> 01:14:17,130
Some will have
changed once, but some

1466
01:14:17,130 --> 01:14:19,810
might have even changed
twice, even at 10%.

1467
01:14:19,810 --> 01:14:23,240
And when you get to
250%, on average,

1468
01:14:23,240 --> 01:14:26,010
every residue has
changed 2 and 1/2 times.

1469
01:14:26,010 --> 01:14:28,640
But again, a few residues
might have remained the same.

1470
01:14:28,640 --> 01:14:30,740
And some residues that
change-- for example,

1471
01:14:30,740 --> 01:14:35,360
if you had an isoleucine
that mutated to a valine,

1472
01:14:35,360 --> 01:14:39,440
it might have actually changed
back already in that time.

1473
01:14:39,440 --> 01:14:43,729
So it basically accounts for
all those sorts of things.

1474
01:14:43,729 --> 01:14:45,645
And if you have commonly
substituted residues,

1475
01:14:45,645 --> 01:14:49,490
you get that type of
evolution happening.

1476
01:14:49,490 --> 01:14:49,990
All right.

1477
01:14:49,990 --> 01:14:51,930
So she took these protein
sequence alignments--

1478
01:14:51,930 --> 01:14:53,240
it looks something like
this-- and calculated

1479
01:14:53,240 --> 01:14:54,460
these statistics.

1480
01:14:54,460 --> 01:14:57,170
Again, I don't want to
go through this in detail

1481
01:14:57,170 --> 01:15:00,390
during lecture, because
it's very well described

1482
01:15:00,390 --> 01:15:02,830
in the text.

1483
01:15:02,830 --> 01:15:05,930
But what I do want to do
is introduce this concept

1484
01:15:05,930 --> 01:15:07,630
of a Markov chain,
because it's sort

1485
01:15:07,630 --> 01:15:12,230
of what is underlying
these Dayhoff matrices.

1486
01:15:12,230 --> 01:15:13,410
So let's think about it.

1487
01:15:13,410 --> 01:15:16,800
We'll do more on this next time.

1488
01:15:16,800 --> 01:15:19,900
But imagine that you
were able to sequence

1489
01:15:19,900 --> 01:15:22,310
the genomes of
cartoon characters

1490
01:15:22,310 --> 01:15:24,960
with some newly
developed technology

1491
01:15:24,960 --> 01:15:28,680
and you chose to analyze
the complicated genetics

1492
01:15:28,680 --> 01:15:30,610
of the Simpson lineage.

1493
01:15:30,610 --> 01:15:32,860
I'm assuming you all
know these people.

1494
01:15:32,860 --> 01:15:39,350
This is Grandpa and Homer eating
doughnut and his son, Bart.

1495
01:15:39,350 --> 01:15:41,710
So imagine this is
Grandpa's genome

1496
01:15:41,710 --> 01:15:45,480
at the apolipoprotein A locus.

1497
01:15:45,480 --> 01:15:53,730
And a mutation occurred that
he then passed on to Homer.

1498
01:15:53,730 --> 01:15:59,730
So this mutation occurred in the
germ line, passed on to Homer.

1499
01:15:59,730 --> 01:16:04,140
And then when Homer passed
on his genes to Bart,

1500
01:16:04,140 --> 01:16:08,816
another mutation occurred
here, changing this AT pair

1501
01:16:08,816 --> 01:16:11,750
to a GC pair in Bart.

1502
01:16:11,750 --> 01:16:17,757
So this, I would argue,
is a type of Markov chain.

1503
01:16:17,757 --> 01:16:18,840
So what is a Markov chain?

1504
01:16:18,840 --> 01:16:21,500
So it's just a
stochastic process.

1505
01:16:21,500 --> 01:16:24,310
So a stochastic process
is a random process,

1506
01:16:24,310 --> 01:16:25,780
is sort of the general meaning.

1507
01:16:25,780 --> 01:16:28,880
But here we're
going to be dealing

1508
01:16:28,880 --> 01:16:32,880
with discrete stochastic
processes, which is just

1509
01:16:32,880 --> 01:16:35,620
a sequence of random variables.

1510
01:16:35,620 --> 01:16:41,030
So X1 here is a random variable
that represents, for example,

1511
01:16:41,030 --> 01:16:43,470
the genome of an
individual, or it

1512
01:16:43,470 --> 01:16:46,000
could represent the
genotype, in this case,

1513
01:16:46,000 --> 01:16:48,693
at a particular position,
maybe whether it's an A,

1514
01:16:48,693 --> 01:16:53,400
C, G, or T at one particular
position in the genome.

1515
01:16:53,400 --> 01:16:57,230
And now the index here-- one,
two, three, and so forth--

1516
01:16:57,230 --> 01:16:58,950
is going to represent time.

1517
01:16:58,950 --> 01:17:05,460
So X1 might be the
genotype in Grandpa Simpson

1518
01:17:05,460 --> 01:17:07,360
at a particular position.

1519
01:17:07,360 --> 01:17:11,340
And X2 might be the
genotype of Homer Simpson.

1520
01:17:11,340 --> 01:17:14,450
And X3 would be the genotype
in the next generation, which

1521
01:17:14,450 --> 01:17:17,370
would be Bart Simpson.

1522
01:17:17,370 --> 01:17:20,120
And what a Markov
chain is is it's

1523
01:17:20,120 --> 01:17:22,770
a particular type of
stochastic process

1524
01:17:22,770 --> 01:17:26,950
that arises commonly in
natural sciences, really,

1525
01:17:26,950 --> 01:17:29,320
and other places
all over the place.

1526
01:17:29,320 --> 01:17:32,260
So it's a good one
to know that has

1527
01:17:32,260 --> 01:17:33,740
what's called the
Markov property.

1528
01:17:33,740 --> 01:17:37,240
And that says that
the probability

1529
01:17:37,240 --> 01:17:40,700
that the next
random variable, or

1530
01:17:40,700 --> 01:17:43,760
the genotype at the next
generation, if you will-- so

1531
01:17:43,760 --> 01:17:48,340
Xn plus 1 equals
some value, j, which

1532
01:17:48,340 --> 01:17:50,090
could be any of the
possible values,

1533
01:17:50,090 --> 01:17:55,480
say any of the four bases,
conditional on the values of X1

1534
01:17:55,480 --> 01:17:58,610
through Xn, that is the entire
history of the process up

1535
01:17:58,610 --> 01:18:03,040
to that time, is equal to
the conditional probability

1536
01:18:03,040 --> 01:18:09,300
that Xn plus 1 equals j given
only that little xn equals

1537
01:18:09,300 --> 01:18:10,580
some particular value.

1538
01:18:10,580 --> 01:18:14,500
So basically what it says
that if I tell you what

1539
01:18:14,500 --> 01:18:18,570
Homer's genotype
was at this locus,

1540
01:18:18,570 --> 01:18:21,950
and I tell you what Grandpa
Simpson's genotype was at that

1541
01:18:21,950 --> 01:18:24,480
locus, you can just
ignore Grandpa Simpson's.

1542
01:18:24,480 --> 01:18:25,310
That's irrelevant.

1543
01:18:25,310 --> 01:18:27,400
It only matters what
Homer's genotype

1544
01:18:27,400 --> 01:18:31,226
was for the purpose of
predicting Bart's genotype.

1545
01:18:31,226 --> 01:18:32,660
Does that make sense?

1546
01:18:32,660 --> 01:18:36,870
So it really doesn't
matter whether that base

1547
01:18:36,870 --> 01:18:41,450
in Homer's genome was the same
as it was in Grandpa Simpson's

1548
01:18:41,450 --> 01:18:44,650
genome, or whether it
was a mutation that's

1549
01:18:44,650 --> 01:18:47,040
specific to Homer,
because Homer is

1550
01:18:47,040 --> 01:18:49,350
the one who passes
on DNA to Bart.

1551
01:18:49,350 --> 01:18:50,250
Does that make sense?

1552
01:18:50,250 --> 01:18:54,975
So you only look
back one generation.

1553
01:18:54,975 --> 01:18:56,470
It's a type of
memoryless process,

1554
01:18:56,470 --> 01:18:59,160
that you only remember
the last generation.

1555
01:18:59,160 --> 01:19:01,650
That's the only thing
that's relevant.

1556
01:19:01,650 --> 01:19:03,710
And so to understand
Markov chains,

1557
01:19:03,710 --> 01:19:07,100
it's very important
that you all review

1558
01:19:07,100 --> 01:19:09,389
your conditional probability.

1559
01:19:09,389 --> 01:19:11,680
So we're going to do a little
bit more on Markov chains

1560
01:19:11,680 --> 01:19:12,730
next time.

1561
01:19:12,730 --> 01:19:18,220
P A given B, what
does that mean?

1562
01:19:18,220 --> 01:19:20,950
If you don't remember, look
it up in the Probability

1563
01:19:20,950 --> 01:19:25,340
and Statistics, because that's
sort of essential to Markov

1564
01:19:25,340 --> 01:19:25,840
chains.

1565
01:19:25,840 --> 01:19:29,210
So next time we're going to
talk about comparative genomics,

1566
01:19:29,210 --> 01:19:31,700
which will involve some
applications of some

1567
01:19:31,700 --> 01:19:35,390
of the alignment methods that
we've been talking about.

1568
01:19:35,390 --> 01:19:41,320
And I may post some examples of
interesting comparative genomic

1569
01:19:41,320 --> 01:19:44,162
research papers, which are
going to be optional reading.

1570
01:19:44,162 --> 01:19:46,745
You may get a little more out
of the lecture if you read them,

1571
01:19:46,745 --> 01:19:49,400
but it's not essential.