1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation, or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:26,810 --> 00:00:29,580
PROFESSOR: Any questions from
last time about Gibbs Sampling?

9
00:00:32,450 --> 00:00:33,920
No?

10
00:00:33,920 --> 00:00:38,210
So at the end, we
introduced this concept

11
00:00:38,210 --> 00:00:39,670
of relative entropy.

12
00:00:39,670 --> 00:00:41,850
So I just wanted to briefly
review this, and make

13
00:00:41,850 --> 00:00:44,050
sure it's clear to everyone.

14
00:00:44,050 --> 00:00:47,300
So the relative
entropy is a measure

15
00:00:47,300 --> 00:00:51,340
of distance between
probability distributions--

16
00:00:51,340 --> 00:00:56,280
can be written different ways,
often with this D p q notation.

17
00:00:56,280 --> 00:00:59,950
And as you'll see, it's
the mean bit-score,

18
00:00:59,950 --> 00:01:04,538
if you're scoring a motif
with a foreground model, pk,

19
00:01:04,538 --> 00:01:08,610
and a background model,
qk it's the average log

20
00:01:08,610 --> 00:01:12,060
odd score under the motif model.

21
00:01:12,060 --> 00:01:17,040
And I asked you to show
that under the special case

22
00:01:17,040 --> 00:01:21,525
where qk is 1 over
4 to the w, that

23
00:01:21,525 --> 00:01:24,930
is uniform background, that the
relative entropy of the motif

24
00:01:24,930 --> 00:01:29,090
ends up being simply 2w minus 8.

25
00:01:29,090 --> 00:01:33,280
Did anyone have a
chance to do this?

26
00:01:33,280 --> 00:01:36,170
It's pretty simple--
has anyone done this?

27
00:01:36,170 --> 00:01:39,340
Can anyone show this?

28
00:01:39,340 --> 00:01:40,460
Want me to do it briefly?

29
00:01:40,460 --> 00:01:42,626
How many would like to
actually see this derivation?

30
00:01:42,626 --> 00:01:43,706
It's very, very quick.

31
00:01:43,706 --> 00:01:44,790
Few people, OK.

32
00:01:44,790 --> 00:01:46,206
So I'll just do
that really quick.

33
00:01:46,206 --> 00:01:56,640
So summation Pk log Pk over
qk equals-- so you rewrite it

34
00:01:56,640 --> 00:02:02,270
as a difference, the
log of a quotient

35
00:02:02,270 --> 00:02:04,480
is the difference of the log.

36
00:02:04,480 --> 00:02:12,850
Of summation log Pk plus
summation Pk log qk.

37
00:02:12,850 --> 00:02:15,710
OK, and then the special case
that we're dealing with here

38
00:02:15,710 --> 00:02:18,140
is that qk is
equal to a quarter,

39
00:02:18,140 --> 00:02:20,150
if we're dealing
with the simplest

40
00:02:20,150 --> 00:02:22,600
case of a one-base motif.

41
00:02:22,600 --> 00:02:28,180
And so you recognize that
that's minus H of Pp, right?

42
00:02:28,180 --> 00:02:30,710
H of p is defined as minus
that, so it's minus H p.

43
00:02:30,710 --> 00:02:34,730
And this here, that's
just a quarter.

44
00:02:34,730 --> 00:02:36,690
Log 2 of a quarter is minus 2.

45
00:02:36,690 --> 00:02:39,245
You can take the minus
2 outside of the sum,

46
00:02:39,245 --> 00:02:45,692
so you're ending up with
minus 2-- I'm sorry.

47
00:02:45,692 --> 00:02:47,940
How come Sally
didn't correct me?

48
00:02:47,940 --> 00:02:49,315
Usually she catches
these things.

49
00:02:49,315 --> 00:02:51,189
So that's a minus there,
right, because we're

50
00:02:51,189 --> 00:02:52,330
taking the difference.

51
00:02:52,330 --> 00:02:56,830
And so then we have a minus
2 that we're pulling out

52
00:02:56,830 --> 00:02:59,430
from this, and you're
left with summation Pk.

53
00:02:59,430 --> 00:03:02,900
And summation Pk, it sums to 1.

54
00:03:02,900 --> 00:03:04,900
So that's just 1.

55
00:03:04,900 --> 00:03:11,010
And so this equals minus
minus 2, or 2 minus H of p.

56
00:03:11,010 --> 00:03:15,910
And there are many other
results of this type that

57
00:03:15,910 --> 00:03:19,422
can be shown in
information theory.

58
00:03:19,422 --> 00:03:20,880
Often there are
some simple results

59
00:03:20,880 --> 00:03:23,500
you can get simply by
using this by splitting it

60
00:03:23,500 --> 00:03:26,620
into different
terms, and summing.

61
00:03:26,620 --> 00:03:28,720
So another result that
I mentioned earlier,

62
00:03:28,720 --> 00:03:35,310
without showing, is that if you
have a motif, say, of length 2,

63
00:03:35,310 --> 00:03:39,910
that the information
content of that motif model

64
00:03:39,910 --> 00:03:43,230
can be broken into
the information

65
00:03:43,230 --> 00:03:46,670
content of each position
if your model is such

66
00:03:46,670 --> 00:03:48,700
that the positions
are independent.

67
00:03:48,700 --> 00:03:52,350
So you would have,
in that case--

68
00:03:52,350 --> 00:03:56,740
let's just take the entropy of
a model on [? dinucleotides. ?]

69
00:03:56,740 --> 00:04:05,520
It that would be minus
summation Pi Pj log Pi Pj,

70
00:04:05,520 --> 00:04:09,610
if you have a model where
the two are independent,

71
00:04:09,610 --> 00:04:12,794
and this sum would be
taken over both i and j.

72
00:04:12,794 --> 00:04:14,960
And so if you want to show
that this is equal to-- I

73
00:04:14,960 --> 00:04:31,500
claim that this is
equal to the i--

74
00:04:31,500 --> 00:04:33,977
Anyway, if you have different
positions, in general--

75
00:04:33,977 --> 00:04:36,310
this would be the more general
term-- where you have two

76
00:04:36,310 --> 00:04:39,400
different compositions at the
two positions for the motif.

77
00:04:39,400 --> 00:04:50,860
And then you can show that
it's equal to basically the sum

78
00:04:50,860 --> 00:04:56,887
the entropies at
the two positions.

79
00:04:56,887 --> 00:04:57,970
OK, you do the same thing.

80
00:04:57,970 --> 00:05:00,490
You separate out
the log of the sum,

81
00:05:00,490 --> 00:05:03,950
in terms of the sum of
the logs, and then you

82
00:05:03,950 --> 00:05:07,030
do properties of summations
until you get the answer.

83
00:05:07,030 --> 00:05:10,804
OK, so this is your
homework, and obviously it

84
00:05:10,804 --> 00:05:11,470
won't be graded.

85
00:05:11,470 --> 00:05:14,570
But we'll check in
next Thursday and see

86
00:05:14,570 --> 00:05:16,580
if anyone has
questions with that.

87
00:05:16,580 --> 00:05:19,730
So what is the use
of relative entropy?

88
00:05:19,730 --> 00:05:21,810
the main use in
bio-informatics is

89
00:05:21,810 --> 00:05:26,830
that it's a measure that
takes into account non-uniform

90
00:05:26,830 --> 00:05:28,670
backgrounds.

91
00:05:28,670 --> 00:05:30,950
The standard definition
of information

92
00:05:30,950 --> 00:05:33,320
basically works
when the background

93
00:05:33,320 --> 00:05:37,900
is uniform, but falls apart
when it's non-uniform.

94
00:05:37,900 --> 00:05:40,050
So if you have a
very biased genome,

95
00:05:40,050 --> 00:05:42,760
like this one shown
here which is 75% A T,

96
00:05:42,760 --> 00:05:45,910
then the information content
using the standard method

97
00:05:45,910 --> 00:05:49,870
would be two bits of this
motif, which is P C equals 1.

98
00:05:49,870 --> 00:05:53,860
But then, that would
predict, using the formula,

99
00:05:53,860 --> 00:06:00,190
that a motif occurs 2 to
the information content--

100
00:06:00,190 --> 00:06:02,550
once every 2 to the information
content bases-- that

101
00:06:02,550 --> 00:06:05,020
would be 2 to the 2,
which would be 4 bases,

102
00:06:05,020 --> 00:06:06,950
and that's clearly
incorrect in this case.

103
00:06:06,950 --> 00:06:10,890
But the relative
entropy, if you do it,

104
00:06:10,890 --> 00:06:16,250
there will be four terms, but
three of them just have a 0.

105
00:06:16,250 --> 00:06:28,760
And then one of them has a 1,
so it's 1 times log 1 over 1/8,

106
00:06:28,760 --> 00:06:34,157
in this case, and that's
will be equal to 3.

107
00:06:34,157 --> 00:06:35,615
And so the relative
entropy clearly

108
00:06:35,615 --> 00:06:38,820
gives you a more
sensible version.

109
00:06:38,820 --> 00:06:43,190
It's a good measure for
non-uniform backgrounds.

110
00:06:43,190 --> 00:06:44,616
Questions about
relative entropy?

111
00:06:48,350 --> 00:06:50,270
All right, so then
we said you can

112
00:06:50,270 --> 00:06:52,790
use a weight matrix, or a
position-specific probability

113
00:06:52,790 --> 00:06:56,300
matrix, for a motif like this
five-prime splice site motif,

114
00:06:56,300 --> 00:06:58,440
assuming independence
between positions.

115
00:06:58,440 --> 00:07:02,840
But if that's not true, then
a natural generalization

116
00:07:02,840 --> 00:07:08,400
would be an inhomogeneous
Markov model.

117
00:07:08,400 --> 00:07:15,530
So now, we're going to say
that the base at position k

118
00:07:15,530 --> 00:07:18,350
depends on the base
at position k minus 1,

119
00:07:18,350 --> 00:07:21,140
but not on anything before that.

120
00:07:21,140 --> 00:07:23,950
And so, the probability
of generating

121
00:07:23,950 --> 00:07:29,500
a particular sequence, S1 to S9,
is now given by this expression

122
00:07:29,500 --> 00:07:36,170
here, where you have for
every base after the first,

123
00:07:36,170 --> 00:07:37,842
you have a conditional
probability.

124
00:07:37,842 --> 00:07:39,300
This is the
conditional probability

125
00:07:39,300 --> 00:07:42,355
of seeing the base, S2,
at position minus 2, given

126
00:07:42,355 --> 00:07:47,150
that you saw S1 at position
minus 3, and so forth.

127
00:07:47,150 --> 00:07:51,160
And again, you can take the log
for convenience, if you like.

128
00:07:51,160 --> 00:07:56,060
So I actually implemented
both of these models.

129
00:07:56,060 --> 00:08:01,080
So just for thinking about it,
if you want to implement this,

130
00:08:01,080 --> 00:08:03,430
you have parameters-- these
conditional probability

131
00:08:03,430 --> 00:08:08,870
parameters-- and you
estimate them as shown here.

132
00:08:08,870 --> 00:08:16,340
So remember, conditional
probability of A given B

133
00:08:16,340 --> 00:08:21,660
is the joint probability
divided by the probability of B.

134
00:08:21,660 --> 00:08:24,910
And so in this case, that
would be the joint probability

135
00:08:24,910 --> 00:08:28,520
of seeing C A at minus
3, minus 2, divided

136
00:08:28,520 --> 00:08:31,550
by the probability of
seeing C at minus 3.

137
00:08:31,550 --> 00:08:34,924
You could have the ratio of the
frequencies, or, in this case,

138
00:08:34,924 --> 00:08:36,840
the counts, because the
normalization constant

139
00:08:36,840 --> 00:08:38,710
will cancel.

140
00:08:38,710 --> 00:08:40,770
Is that clear?

141
00:08:40,770 --> 00:08:43,264
So I actually implemented
both the weight matrix model

142
00:08:43,264 --> 00:08:45,680
and a first-order Markov model
of five-prime splice sites,

143
00:08:45,680 --> 00:08:47,750
and scored some
genomic sequence.

144
00:08:47,750 --> 00:08:53,460
And what you can see here, the
units are in 1/10th-bit units,

145
00:08:53,460 --> 00:08:58,790
is that they both are partially
successful in separating real

146
00:08:58,790 --> 00:09:02,600
five-prime splice sites-- shown
in black from the background,

147
00:09:02,600 --> 00:09:05,400
shown in light bars--
but in both cases,

148
00:09:05,400 --> 00:09:06,650
it's not a perfect separation.

149
00:09:06,650 --> 00:09:08,500
There's some overlap here.

150
00:09:08,500 --> 00:09:10,500
And if you zoom
there, you can see

151
00:09:10,500 --> 00:09:13,010
that the Markov model
is a little bit better.

152
00:09:13,010 --> 00:09:18,170
It has a tighter
tail on the left.

153
00:09:18,170 --> 00:09:21,500
So it's generally
separating the true

154
00:09:21,500 --> 00:09:23,256
from the decoys a
little bit better.

155
00:09:23,256 --> 00:09:25,130
Not dramatically better,
but slightly better.

156
00:09:25,130 --> 00:09:26,949
Yes, question?

157
00:09:26,949 --> 00:09:28,365
AUDIENCE: From the
previous slide,

158
00:09:28,365 --> 00:09:33,990
could you clarify what the
letter R and the letter S are?

159
00:09:33,990 --> 00:09:36,690
PROFESSOR: Yes,
sorry about that.

160
00:09:36,690 --> 00:09:38,660
R would be the odd
ratio-- so it's

161
00:09:38,660 --> 00:09:41,180
the ratio of the
probability of generating

162
00:09:41,180 --> 00:09:43,589
that sequence under
the foreground

163
00:09:43,589 --> 00:09:45,130
model-- the plus
model, we're calling

164
00:09:45,130 --> 00:09:49,480
it-- divided by the probability
under the background,

165
00:09:49,480 --> 00:09:50,840
or minus, model.

166
00:09:50,840 --> 00:09:54,210
And then, I think I
pointed out last time

167
00:09:54,210 --> 00:09:56,780
that when you get
products of probabilities,

168
00:09:56,780 --> 00:09:58,390
they tend to get very small.

169
00:09:58,390 --> 00:10:01,410
This can cause
computational problems.

170
00:10:01,410 --> 00:10:04,430
And so if you just take the
log, you convert it into a sum.

171
00:10:04,430 --> 00:10:08,940
And so we'll often
use score, or S,

172
00:10:08,940 --> 00:10:12,440
for the log of the odds ratio.

173
00:10:12,440 --> 00:10:14,845
Sorry, should have
marked that more clearly.

174
00:10:17,880 --> 00:10:20,570
So Markov models can
improve performance,

175
00:10:20,570 --> 00:10:23,890
when there is dependence,
and when you have enough data

176
00:10:23,890 --> 00:10:28,070
to estimate the increased
number of parameters.

177
00:10:28,070 --> 00:10:30,710
And it doesn't just
have to be dependence

178
00:10:30,710 --> 00:10:35,600
on the previous base--
you can have a model where

179
00:10:35,600 --> 00:10:37,710
the probability of
the next base depends

180
00:10:37,710 --> 00:10:38,919
on the two previous bases.

181
00:10:38,919 --> 00:10:40,960
That would be called a
second-order Markov model,

182
00:10:40,960 --> 00:10:43,340
or in general, a
K-order Markov model.

183
00:10:46,590 --> 00:10:51,257
Sometimes, these dependencies
actually occur in practice.

184
00:10:51,257 --> 00:10:53,340
With five-prime splice
sites, it's a nice example,

185
00:10:53,340 --> 00:10:55,550
because there's probably
a couple thousand of them

186
00:10:55,550 --> 00:10:57,550
in the human genome, and
we know them very well,

187
00:10:57,550 --> 00:11:00,620
so you can make
quite complex models

188
00:11:00,620 --> 00:11:03,050
and have enough
data to train them.

189
00:11:03,050 --> 00:11:04,510
But in general,
if you're thinking

190
00:11:04,510 --> 00:11:06,790
about modeling a transcription
factor binding site,

191
00:11:06,790 --> 00:11:10,400
or something, often you might
have dozens or, at best,

192
00:11:10,400 --> 00:11:13,290
hundreds of examples, typically.

193
00:11:13,290 --> 00:11:15,170
And so you might not
have enough to train

194
00:11:15,170 --> 00:11:16,680
some of the larger model.

195
00:11:16,680 --> 00:11:19,400
So how many
parameters do you need

196
00:11:19,400 --> 00:11:21,700
to fit a K-order Markov model?

197
00:11:21,700 --> 00:11:23,750
So question first, yeah?

198
00:11:23,750 --> 00:11:26,120
AUDIENCE: [INAUDIBLE]
If you're comparing

199
00:11:26,120 --> 00:11:32,000
the first-order Markov models
with W M M, what is W M M?

200
00:11:32,000 --> 00:11:36,900
PROFESSOR: Weight matrix, or
position-specific probability

201
00:11:36,900 --> 00:11:37,400
matrix.

202
00:11:37,400 --> 00:11:40,150
Just a model of independence
between the two.

203
00:11:44,370 --> 00:11:45,620
Coming back to this case.

204
00:11:45,620 --> 00:11:47,620
So let's suppose
you are thinking

205
00:11:47,620 --> 00:11:49,659
about making a
K-order Markov model,

206
00:11:49,659 --> 00:11:51,200
because you do some
statistical tasks

207
00:11:51,200 --> 00:11:54,240
and you find there's some
dependence between sets

208
00:11:54,240 --> 00:11:56,460
of positions in your motif.

209
00:11:56,460 --> 00:11:58,020
How many parameters
would there be?

210
00:11:58,020 --> 00:12:02,110
So if you have an independence
model, or weight matrix

211
00:12:02,110 --> 00:12:04,900
or position-specific
probability matrix,

212
00:12:04,900 --> 00:12:09,110
there are four parameters
at each position,

213
00:12:09,110 --> 00:12:12,394
the probabilities
of the four bases.

214
00:12:12,394 --> 00:12:14,330
This will be only
three free parameters,

215
00:12:14,330 --> 00:12:16,640
because the fourth one-- but
let's just think about it

216
00:12:16,640 --> 00:12:20,260
as four, four parameters
times the width of the motif.

217
00:12:20,260 --> 00:12:23,957
So if I now go to a
first-order Markov model,

218
00:12:23,957 --> 00:12:25,540
now there's more
parameters, because I

219
00:12:25,540 --> 00:12:28,855
have these conditional
probabilities at each position.

220
00:12:28,855 --> 00:12:30,520
So how many
parameters are there?

221
00:12:36,840 --> 00:12:39,400
For a first-order Markov?

222
00:12:39,400 --> 00:12:42,610
How many do I need to estimate?

223
00:12:42,610 --> 00:12:43,368
Yeah, Kevin?

224
00:12:43,368 --> 00:12:45,451
AUDIENCE: I think it would
be 16 at each position.

225
00:12:45,451 --> 00:12:47,020
PROFESSOR: Yeah, 16
at each position,

226
00:12:47,020 --> 00:12:49,100
except the first
position, which has four.

227
00:12:49,100 --> 00:12:52,647
OK, and what about a
second-order Markov model,

228
00:12:52,647 --> 00:12:54,730
where you condition on the
two previous positions?

229
00:12:58,710 --> 00:13:00,840
64, right?

230
00:13:00,840 --> 00:13:04,060
Because you have two possible
bases you're conditioning on,

231
00:13:04,060 --> 00:13:05,960
that's 16 possibilities times 4.

232
00:13:05,960 --> 00:13:12,830
And so in general, the
formula is 4 to the k plus 1.

233
00:13:12,830 --> 00:13:18,040
This is really the issue-- if
you have only 100 sequences,

234
00:13:18,040 --> 00:13:20,800
and you need to estimate 64
parameters at each position,

235
00:13:20,800 --> 00:13:23,440
you don't have enough
data to estimate those.

236
00:13:23,440 --> 00:13:30,250
So you shouldn't use
such a high order model.

237
00:13:30,250 --> 00:13:33,670
All right, so let's
think about this--

238
00:13:33,670 --> 00:13:37,600
what could happen if you don't
have enough data to estimate

239
00:13:37,600 --> 00:13:39,740
parameters, and how can
you get around that?

240
00:13:39,740 --> 00:13:41,740
So let's just take a
very simple example.

241
00:13:41,740 --> 00:13:44,920
So suppose you were setting
a new transcription factor.

242
00:13:44,920 --> 00:13:50,360
You had done some sort of
pull-down assay, followed

243
00:13:50,360 --> 00:13:52,420
by, say, conventional
sequencing,

244
00:13:52,420 --> 00:13:55,960
and identified 10
sequences that bind

245
00:13:55,960 --> 00:13:57,180
to that transcription factor.

246
00:13:57,180 --> 00:14:00,184
And these are the 10
sequences, and you align them.

247
00:14:00,184 --> 00:14:01,600
You see there is
sort of a pattern

248
00:14:01,600 --> 00:14:04,280
there-- there's usually an
A at the first position,

249
00:14:04,280 --> 00:14:06,550
and usually a C at the
second, and so forth.

250
00:14:06,550 --> 00:14:11,680
And so you consider making
a weight matrix model.

251
00:14:11,680 --> 00:14:15,200
Then you tally up-- there's
eight A's, one C, one G,

252
00:14:15,200 --> 00:14:18,020
and no T's at the
first position.

253
00:14:18,020 --> 00:14:22,300
So how confident can
you be that T is not

254
00:14:22,300 --> 00:14:25,660
compatible with binding of
this transcription factor?

255
00:14:25,660 --> 00:14:28,054
Who thinks you can
be very confident?

256
00:14:28,054 --> 00:14:29,470
Most of you are
shaking your head.

257
00:14:29,470 --> 00:14:36,270
So if you're not confident,
why are you not confident?

258
00:14:36,270 --> 00:14:39,100
I think-- wait, were
you shaking your head?

259
00:14:39,100 --> 00:14:42,110
What's the problem here?

260
00:14:42,110 --> 00:14:44,840
It's just too small
a sample, right?

261
00:14:44,840 --> 00:14:47,290
Maybe T occurs rarely.

262
00:14:47,290 --> 00:14:50,700
So suppose that T occurs
at a frequency of 10%,

263
00:14:50,700 --> 00:14:54,480
what's the probability of
that in natural sequences?

264
00:14:54,480 --> 00:14:57,744
And we just have a
random sample of those.

265
00:14:57,744 --> 00:14:59,160
What's the probability
we wouldn't

266
00:14:59,160 --> 00:15:02,270
see any T's in a
sample of size 10?

267
00:15:05,520 --> 00:15:07,900
Anyone have an idea?

268
00:15:07,900 --> 00:15:10,102
Anyone have a ballpark
number on this?

269
00:15:13,750 --> 00:15:15,136
Yeah, Simona?

270
00:15:15,136 --> 00:15:17,080
AUDIENCE: 0.9 to the 10th.

271
00:15:17,080 --> 00:15:18,790
PROFESSOR: 0.9 to the 10th, OK.

272
00:15:18,790 --> 00:15:20,040
And what is that?

273
00:15:20,040 --> 00:15:23,280
AUDIENCE: 0.9 is the probability
that you grab one, and don't

274
00:15:23,280 --> 00:15:26,450
see a T, and then
you do that 10 times.

275
00:15:26,450 --> 00:15:27,492
PROFESSOR: Yeah, exactly.

276
00:15:27,492 --> 00:15:29,533
In genera it's a binomial
thing, but it works out

277
00:15:29,533 --> 00:15:30,490
to be 0.9 to the 10th.

278
00:15:30,490 --> 00:15:33,790
And that's roughly--
this is like a Poisson.

279
00:15:33,790 --> 00:15:37,620
There's a mean of 1, so it's
roughly E to the minus 1,

280
00:15:37,620 --> 00:15:41,830
so about 35% chance that
you don't see any T's.

281
00:15:41,830 --> 00:15:45,500
So we really shouldn't
be confident.

282
00:15:45,500 --> 00:15:48,320
T probably doesn't have
a frequency of 0.5,

283
00:15:48,320 --> 00:15:53,270
but it could easily have a
frequency of 10% or even 5%,

284
00:15:53,270 --> 00:15:54,910
or even 15%.

285
00:15:54,910 --> 00:15:57,210
And you might have
just not seen it.

286
00:15:57,210 --> 00:16:00,290
So you don't want to assign
a probability 0 to T.

287
00:16:00,290 --> 00:16:05,680
But what value should you assign
for something you haven't seen?

288
00:16:05,680 --> 00:16:06,180
Sally?

289
00:16:10,060 --> 00:16:13,220
So, it turns out there
is a principled way

290
00:16:13,220 --> 00:16:16,670
to do this called pseudocounts.

291
00:16:16,670 --> 00:16:22,430
So basically, if you use
maximum likelihood estimation,

292
00:16:22,430 --> 00:16:25,290
you get-- maximum
likelihood, it turns out,

293
00:16:25,290 --> 00:16:28,420
is equal to the
observed frequency.

294
00:16:28,420 --> 00:16:34,910
But if you assume that the
true frequency is unknown,

295
00:16:34,910 --> 00:16:38,390
but was sampled from
all possible, reasonable

296
00:16:38,390 --> 00:16:41,920
frequencies-- so that's a
Dirichlet distribution-- then

297
00:16:41,920 --> 00:16:45,360
you can calculate what the
posterior distribution is

298
00:16:45,360 --> 00:16:49,900
in a Bayesian framework, given
that you observed, for example,

299
00:16:49,900 --> 00:16:54,490
zero T's, what's
the distribution

300
00:16:54,490 --> 00:16:58,650
of that parameter,
frequency of T?

301
00:16:58,650 --> 00:17:01,930
And it turns out, it's
equivalent to adding

302
00:17:01,930 --> 00:17:06,419
a single count to
each of your bins.

303
00:17:06,419 --> 00:17:08,210
I'm not going to go
through the derivation,

304
00:17:08,210 --> 00:17:10,460
because it takes
time, but it is well

305
00:17:10,460 --> 00:17:13,300
described in the appendix
of a book called,

306
00:17:13,300 --> 00:17:16,420
Biological Sequence
Analysis, published about 10,

307
00:17:16,420 --> 00:17:19,730
15 years ago by a number of
leaders in the field-- Durbin,

308
00:17:19,730 --> 00:17:21,460
Eddy, Krogh, and Mitchison.

309
00:17:21,460 --> 00:17:25,990
And there's also a derivation
of this in the probability

310
00:17:25,990 --> 00:17:28,079
and statistics primer.

311
00:17:28,079 --> 00:17:32,080
So basically you just do
this poster calculation,

312
00:17:32,080 --> 00:17:34,970
and it turns out to be
equivalent to adding 1 count.

313
00:17:34,970 --> 00:17:36,910
So when you add 1 count--
and then, of course,

314
00:17:36,910 --> 00:17:39,880
you re-normalize,
and then you get

315
00:17:39,880 --> 00:17:43,840
a frequency-- what
effectively it does

316
00:17:43,840 --> 00:17:47,120
is it will reduce the
frequency of things

317
00:17:47,120 --> 00:17:51,450
that you observe most commonly,
and boost up the things

318
00:17:51,450 --> 00:17:55,010
that you don't see, so
that you actually end up

319
00:17:55,010 --> 00:17:59,750
assigning a probability
of 0.07 to T.

320
00:17:59,750 --> 00:18:01,980
Now, if you had a
larger sample-- so

321
00:18:01,980 --> 00:18:06,880
let's imagine instead of 8
1 1 0, it was 80 10 10 0,

322
00:18:06,880 --> 00:18:08,980
you still at a single count.

323
00:18:08,980 --> 00:18:10,880
So you can see in
that case, you're

324
00:18:10,880 --> 00:18:16,990
only going to be adding a very
small, close to 1%, for T.

325
00:18:16,990 --> 00:18:20,310
So as you get more
data, it converges

326
00:18:20,310 --> 00:18:22,900
to the maximum
likelihood estimate.

327
00:18:22,900 --> 00:18:25,909
But it does something more
reasonable, more open-minded,

328
00:18:25,909 --> 00:18:28,200
in a case where you're really
limited in terms of data.

329
00:18:28,200 --> 00:18:29,860
So the limitation--
you always want

330
00:18:29,860 --> 00:18:31,720
to be aware when
you're considering

331
00:18:31,720 --> 00:18:35,050
going to a more complex model to
get better predictability-- you

332
00:18:35,050 --> 00:18:37,230
want to be aware of
how much data you have,

333
00:18:37,230 --> 00:18:41,910
and whether you have enough to
accurately estimate parameters.

334
00:18:41,910 --> 00:18:44,530
And if you don't, you
either simplify the model,

335
00:18:44,530 --> 00:18:47,240
or if you can't
simplify it anymore,

336
00:18:47,240 --> 00:18:49,060
consider using pseudocounts.

337
00:18:49,060 --> 00:18:51,820
Sometimes you'll see
smaller pseudocounts added--

338
00:18:51,820 --> 00:18:54,070
like instead of 1
1 1 1, you might

339
00:18:54,070 --> 00:18:57,810
see a quarter, one pseudocount
distributed across the four

340
00:18:57,810 --> 00:18:58,670
bins.

341
00:18:58,670 --> 00:19:01,380
There's arguments pro and
con, which I won't go into.

342
00:19:04,100 --> 00:19:06,880
So for the remainder
of today, I want

343
00:19:06,880 --> 00:19:09,530
to introduce hidden
Markov models.

344
00:19:09,530 --> 00:19:13,070
We'll talk about
some the terminology,

345
00:19:13,070 --> 00:19:16,440
some applications, and the
Viterbi algorithm-- which

346
00:19:16,440 --> 00:19:22,070
is a core algorithm when
using HMMs to predict things--

347
00:19:22,070 --> 00:19:24,330
and then we'll give
a couple examples.

348
00:19:24,330 --> 00:19:27,230
So we'll talk about
the CpG Island

349
00:19:27,230 --> 00:19:30,340
HMM, which is about the
simplest HMM I could think of,

350
00:19:30,340 --> 00:19:34,140
which is good for illustrating
the mechanics of HMM.

351
00:19:34,140 --> 00:19:38,030
And then a couple
later, probably

352
00:19:38,030 --> 00:19:41,390
coming into the next
lecture, some examples

353
00:19:41,390 --> 00:19:43,270
of real-world
HMMs, like one that

354
00:19:43,270 --> 00:19:45,920
predicts transmembrane helices.

355
00:19:45,920 --> 00:19:48,080
So some background reading
for today's lecture

356
00:19:48,080 --> 00:19:49,600
that's posted on
the course website,

357
00:19:49,600 --> 00:19:53,140
there's a nature
biotechnology primer on HMMs,

358
00:19:53,140 --> 00:19:55,930
there's a little
bit in the textbook.

359
00:19:55,930 --> 00:19:58,620
But really, if you want to
understand the guts of HMMs,

360
00:19:58,620 --> 00:20:01,070
you should read the
Rabiner tutorial,

361
00:20:01,070 --> 00:20:04,830
which is really
pretty well done.

362
00:20:04,830 --> 00:20:09,260
For Thursday's lecture, I will
post another of these nature

363
00:20:09,260 --> 00:20:11,580
biotechnology primers
on RNA folding.

364
00:20:11,580 --> 00:20:15,477
This one is actually has
a little bit more content,

365
00:20:15,477 --> 00:20:17,310
takes a little bit
longer to absorb probably

366
00:20:17,310 --> 00:20:19,520
than some of the
others, but still

367
00:20:19,520 --> 00:20:21,010
a good introduction
to the topic.

368
00:20:21,010 --> 00:20:24,990
And then it turns out the
text has a pretty good section

369
00:20:24,990 --> 00:20:29,170
on RNA folding, so take
a look at chapter 11.

370
00:20:33,550 --> 00:20:37,370
So hidden Markov
models can be thought

371
00:20:37,370 --> 00:20:42,360
of as a general approach
for modeling sequence

372
00:20:42,360 --> 00:20:44,105
labeling problems--
you have sequences,

373
00:20:44,105 --> 00:20:46,640
they might be genomic
sequences, protein sequences,

374
00:20:46,640 --> 00:20:47,790
RNA sequences.

375
00:20:47,790 --> 00:20:51,120
And these sequences have
features-- promoters,

376
00:20:51,120 --> 00:20:55,040
they may have domains,
et cetera, linear motifs.

377
00:20:55,040 --> 00:21:00,320
And you want to
label those features

378
00:21:00,320 --> 00:21:01,790
in an unknown sequence.

379
00:21:01,790 --> 00:21:04,830
So a classical example
would be gene finding.

380
00:21:04,830 --> 00:21:08,720
You have a genomic sequence,
some parts are, say, exon,

381
00:21:08,720 --> 00:21:09,950
some are introns.

382
00:21:09,950 --> 00:21:13,240
You want to be able to
label them, it's not known.

383
00:21:13,240 --> 00:21:15,970
But you might have a training
set of known exons and introns,

384
00:21:15,970 --> 00:21:20,150
and you might learn what the
sequence composition of each

385
00:21:20,150 --> 00:21:22,720
of those labels
looks like, and then

386
00:21:22,720 --> 00:21:25,120
make a model that
builds things together.

387
00:21:25,120 --> 00:21:27,380
And what they allow you
to do, though, with HMMs

388
00:21:27,380 --> 00:21:29,892
is to have transition
probabilities

389
00:21:29,892 --> 00:21:31,100
between the different states.

390
00:21:31,100 --> 00:21:33,610
You could model
states, you can model

391
00:21:33,610 --> 00:21:36,300
the length of different types
of states to some extent--

392
00:21:36,300 --> 00:21:38,820
as we'll see-- and
you can model which

393
00:21:38,820 --> 00:21:43,170
states need to
follow other states.

394
00:21:43,170 --> 00:21:45,270
They're relatively
easy to design,

395
00:21:45,270 --> 00:21:48,850
you can just simply
draw a graph.

396
00:21:48,850 --> 00:21:53,660
It can even have cycles
in it, that's OK.

397
00:21:53,660 --> 00:21:56,090
And they've been
described as the LEGOs

398
00:21:56,090 --> 00:21:57,580
of computational
sequence analysis.

399
00:21:57,580 --> 00:22:01,300
They were developed originally
in electrical engineering

400
00:22:01,300 --> 00:22:03,700
four or five decades
ago for applications

401
00:22:03,700 --> 00:22:05,740
in voice recognition,
and they're still

402
00:22:05,740 --> 00:22:07,220
used in voice recognition.

403
00:22:07,220 --> 00:22:12,790
So when you are calling up some
large corporation, and instead

404
00:22:12,790 --> 00:22:16,634
of a person answering
the phone, some computer

405
00:22:16,634 --> 00:22:18,050
answering the phone
and attempting

406
00:22:18,050 --> 00:22:20,130
to recognize your
voice, it could well

407
00:22:20,130 --> 00:22:25,260
be an HMM on the other end,
which is either correctly

408
00:22:25,260 --> 00:22:27,610
recognizing what
you're saying, or not.

409
00:22:27,610 --> 00:22:32,990
So you can thank them or
blame them, as you wish.

410
00:22:32,990 --> 00:22:37,870
All right, so Markov Model
example-- we did this before,

411
00:22:37,870 --> 00:22:42,330
imagine the genotype
at a particular locus,

412
00:22:42,330 --> 00:22:48,070
and successive generations is
thought of as a Markov chain.

413
00:22:48,070 --> 00:22:51,030
Bart's genotype depends on
Homer's, but is conditionally

414
00:22:51,030 --> 00:22:55,090
independent of Grandpa
Simpson's, given Homer's.

415
00:22:55,090 --> 00:22:56,990
So now what's a
hidden Markov model?

416
00:22:56,990 --> 00:23:01,830
So imagine that our
DNA sequencer is not

417
00:23:01,830 --> 00:23:03,350
working that week,
we can't actually

418
00:23:03,350 --> 00:23:05,910
go in and measure the genotype.

419
00:23:05,910 --> 00:23:10,260
But instead, we're going to
observe some phenotype that's

420
00:23:10,260 --> 00:23:12,470
dependent on genotype.

421
00:23:12,470 --> 00:23:17,680
But it's not dependent
in a deterministic way,

422
00:23:17,680 --> 00:23:20,180
it's dependent in a more
complex way, because there's

423
00:23:20,180 --> 00:23:23,390
an impact of environment,
as well, let's say.

424
00:23:23,390 --> 00:23:28,710
So we're imagining that your
genotype at the apolipoprotein

425
00:23:28,710 --> 00:23:31,140
locus is correlated
with cholesterol,

426
00:23:31,140 --> 00:23:32,970
but doesn't
completely predict it.

427
00:23:32,970 --> 00:23:36,690
So you're homozygous, you tend
to have higher LDL cholesterol

428
00:23:36,690 --> 00:23:38,200
than you are heterozygous.

429
00:23:38,200 --> 00:23:39,990
But there's a
distribution depending

430
00:23:39,990 --> 00:23:43,690
on how many doughnuts you
eat, or something like that.

431
00:23:43,690 --> 00:23:47,160
Imagine that we
observe that grandpa

432
00:23:47,160 --> 00:23:51,560
had low cholesterol, 150,
Homer had high cholesterol,

433
00:23:51,560 --> 00:23:56,020
and Bart's cholesterol
is intermediate.

434
00:23:56,020 --> 00:23:59,550
Now if we had just observed
Bart's cholesterol,

435
00:23:59,550 --> 00:24:04,170
we would say, well, it
could go either way.

436
00:24:04,170 --> 00:24:07,720
It could be homozygous
or heterozygous.

437
00:24:07,720 --> 00:24:09,750
You would just look at
the population frequency

438
00:24:09,750 --> 00:24:13,450
of those two, and would
use that to guess.

439
00:24:13,450 --> 00:24:16,600
But remember, we know his
father's cholesterol, which

440
00:24:16,600 --> 00:24:19,990
was 250, makes it
much more likely

441
00:24:19,990 --> 00:24:25,714
that his father was homozygous,
and then that, in turn, biases

442
00:24:25,714 --> 00:24:27,380
the distribution
[? of it. ?] So that'll

443
00:24:27,380 --> 00:24:30,560
make it a little bit more
likely that Bart, himself, is

444
00:24:30,560 --> 00:24:32,100
homozygous, if you didn't know.

445
00:24:32,100 --> 00:24:37,590
So this is the basic idea-- you
have some observable phenotype,

446
00:24:37,590 --> 00:24:41,080
if you will, that depends,
in a probabilistic way,

447
00:24:41,080 --> 00:24:42,600
on something hidden.

448
00:24:42,600 --> 00:24:48,180
And that hidden thing has some
dependent structure to it.

449
00:24:48,180 --> 00:24:50,720
And you want to, then,
predict those hidden states

450
00:24:50,720 --> 00:24:52,040
from the observable data.

451
00:24:52,040 --> 00:24:54,640
So we'll give some more
examples coming up.

452
00:24:54,640 --> 00:24:57,670
And the way to think about these
models, or at least a handy way

453
00:24:57,670 --> 00:25:01,110
to think about them, is
as generative models.

454
00:25:01,110 --> 00:25:03,870
And so this is from
the Rabiner tutorial--

455
00:25:03,870 --> 00:25:08,670
you imagine an HMM used
in order to generate

456
00:25:08,670 --> 00:25:10,194
observable sequences.

457
00:25:10,194 --> 00:25:12,110
So there's these hidden
states-- think of them

458
00:25:12,110 --> 00:25:14,300
as genotypes--
observable-- think

459
00:25:14,300 --> 00:25:16,570
of them as the
cholesterol levels.

460
00:25:16,570 --> 00:25:18,970
So the way that it works is
you choose an initial state

461
00:25:18,970 --> 00:25:21,650
from one of your
possible hidden states,

462
00:25:21,650 --> 00:25:24,110
according to some
initial distribution,

463
00:25:24,110 --> 00:25:26,350
you set the time
variable equal to 1.

464
00:25:26,350 --> 00:25:29,550
In this case, it's T,
which will, in our case,

465
00:25:29,550 --> 00:25:31,930
often be the position
in the sequence.

466
00:25:31,930 --> 00:25:35,200
And then you choose
an observed value,

467
00:25:35,200 --> 00:25:37,260
according to some
probability distribution,

468
00:25:37,260 --> 00:25:41,240
but it depends on what
that hidden state was.

469
00:25:41,240 --> 00:25:44,010
And then you transition
to a new state,

470
00:25:44,010 --> 00:25:47,450
and then you emit another one.

471
00:25:47,450 --> 00:25:49,230
So we'll do an example.

472
00:25:49,230 --> 00:25:56,880
Let's say bacterial gene
finding is our application,

473
00:25:56,880 --> 00:25:59,740
and we're going to model
a bacterial gene-- these

474
00:25:59,740 --> 00:26:02,820
are protein coding genes,
only it's got to have a start

475
00:26:02,820 --> 00:26:06,510
coat on, it's got to have
an open reading frame,

476
00:26:06,510 --> 00:26:08,880
and then it's got to
have a stop code on it.

477
00:26:08,880 --> 00:26:13,960
So how many different states
do we need in our HMM?

478
00:26:13,960 --> 00:26:15,940
What should our states be?

479
00:26:18,750 --> 00:26:19,766
Anyone?

480
00:26:19,766 --> 00:26:22,009
Do you want to make that-- Tim?

481
00:26:22,009 --> 00:26:23,550
AUDIENCE: Maybe you
need four states,

482
00:26:23,550 --> 00:26:26,861
because the start state, the
orf state, the stop state,

483
00:26:26,861 --> 00:26:29,020
and the non-genic state.

484
00:26:29,020 --> 00:26:37,290
PROFESSOR: OK, start, orf,
stop and then intergenic,

485
00:26:37,290 --> 00:26:38,203
or non-genic.

486
00:26:43,250 --> 00:26:45,960
OK now, remember these
are the hidden states,

487
00:26:45,960 --> 00:26:48,145
so what are they going to emit?

488
00:26:48,145 --> 00:26:50,020
They emit observable
data, what's

489
00:26:50,020 --> 00:26:52,080
that observable
data going to be?

490
00:26:52,080 --> 00:26:52,680
Sequence, OK.

491
00:26:52,680 --> 00:26:57,070
And how many bases of sequence
should each of them emit?

492
00:26:57,070 --> 00:27:00,552
AUDIENCE: Well I
guess we don't know.

493
00:27:00,552 --> 00:27:01,760
PROFESSOR: You have a choice.

494
00:27:01,760 --> 00:27:04,550
You're the model builder,
you can do anything you want.

495
00:27:04,550 --> 00:27:09,060
1, 5, 10-- any number
of bases you want.

496
00:27:09,060 --> 00:27:11,335
And they can emit different
things, if you want.

497
00:27:11,335 --> 00:27:12,960
This is generative,
you can do anything

498
00:27:12,960 --> 00:27:14,793
you want-- there will
be consequences later,

499
00:27:14,793 --> 00:27:19,920
but for now-- I'm going
to call this-- go ahead.

500
00:27:19,920 --> 00:27:23,920
AUDIENCE: You could start with
the start and the stop states

501
00:27:23,920 --> 00:27:25,420
maybe being three.

502
00:27:25,420 --> 00:27:26,490
PROFESSOR: Three, OK.

503
00:27:26,490 --> 00:27:30,630
So this is going to
emit three nucleotides.

504
00:27:34,296 --> 00:27:35,170
How about this state?

505
00:27:35,170 --> 00:27:37,162
What should this emit?

506
00:27:40,150 --> 00:27:41,644
AUDIENCE: Any number.

507
00:27:41,644 --> 00:27:44,632
PROFESSOR: Any number?

508
00:27:44,632 --> 00:27:46,452
Yeah, OK, Sally?

509
00:27:46,452 --> 00:27:48,118
AUDIENCE: If you let
it emit one number,

510
00:27:48,118 --> 00:27:51,110
and then add a self-cycle,
then that would work.

511
00:27:51,110 --> 00:27:54,500
PROFESSOR: So Sally wants
to have this state emit one

512
00:27:54,500 --> 00:27:56,960
nucleotide, but she
wants it to have

513
00:27:56,960 --> 00:27:59,880
a chance of returning to itself.

514
00:27:59,880 --> 00:28:03,170
So that then we can have strings
of N's to represent intergenic.

515
00:28:03,170 --> 00:28:04,140
Does that make sense?

516
00:28:04,140 --> 00:28:05,970
And these, I agree.

517
00:28:05,970 --> 00:28:07,660
Three is a good choice, here.

518
00:28:07,660 --> 00:28:10,550
If you had this one emit
three, as well, then

519
00:28:10,550 --> 00:28:13,476
your genes would have to
be a multiple of three

520
00:28:13,476 --> 00:28:15,350
apart from each other,
which isn't realistic.

521
00:28:15,350 --> 00:28:18,010
You would miss out on
some genes for that.

522
00:28:18,010 --> 00:28:21,480
So this has to be able to
emit arbitrary numbers.

523
00:28:21,480 --> 00:28:24,050
So you could either have it
emit an arbitrary number,

524
00:28:24,050 --> 00:28:26,650
but it's going to turn
out to make the Viterbi

525
00:28:26,650 --> 00:28:30,010
algorithm easier if it
just emits one and recurs,

526
00:28:30,010 --> 00:28:31,260
as Sally suggested.

527
00:28:31,260 --> 00:28:32,780
And then we have our orf state.

528
00:28:32,780 --> 00:28:33,830
So how about here?

529
00:28:33,830 --> 00:28:36,110
What should we do here?

530
00:28:36,110 --> 00:28:38,514
AUDIENCE: It can be
three, and then you

531
00:28:38,514 --> 00:28:39,900
put the circle [INAUDIBLE].

532
00:28:39,900 --> 00:28:42,066
PROFESSOR: So I'm going to
change the name to Codon,

533
00:28:42,066 --> 00:28:45,450
because it's going to emit
one codon-- three nucleotides.

534
00:28:45,450 --> 00:28:47,760
And then recur to itself.

535
00:28:47,760 --> 00:28:50,150
And now what transitions
should we allow between states?

536
00:28:52,700 --> 00:29:02,140
AUDIENCE: So start to four,
orf to stop, then stop to N,

537
00:29:02,140 --> 00:29:04,625
and then to start.

538
00:29:04,625 --> 00:29:05,619
PROFESSOR: Any others?

539
00:29:14,068 --> 00:29:15,007
Yeah?

540
00:29:15,007 --> 00:29:16,590
AUDIENCE: N could
go to stop, as well.

541
00:29:16,590 --> 00:29:18,298
PROFESSOR: I'm sorry,
N could go to stop?

542
00:29:18,298 --> 00:29:20,231
AUDIENCE: Yeah, so that
the gene [INAUDIBLE].

543
00:29:20,231 --> 00:29:21,730
PROFESSOR: OK, so
that's a question.

544
00:29:21,730 --> 00:29:24,490
We're thinking of a
gene on the plus strand,

545
00:29:24,490 --> 00:29:27,200
a gene could well be
on the opposite strand.

546
00:29:27,200 --> 00:29:29,230
And so we should
probably make a model

547
00:29:29,230 --> 00:29:34,314
of where you would hit stop
on the other strand, which

548
00:29:34,314 --> 00:29:36,730
would emit a triplet of the
inverse complement of the stop

549
00:29:36,730 --> 00:29:37,910
code, [INAUDIBLE] et cetera.

550
00:29:37,910 --> 00:29:40,280
That's true, excellent point.

551
00:29:40,280 --> 00:29:42,220
And then you would
traverse this whole circle

552
00:29:42,220 --> 00:29:44,230
in the opposite direction.

553
00:29:44,230 --> 00:29:46,190
But it wouldn't
be the same state.

554
00:29:46,190 --> 00:29:48,860
It would be stop-- because it
would emit different things.

555
00:29:48,860 --> 00:29:55,047
So you'd have minus
stop, stop minus strand.

556
00:29:55,047 --> 00:29:56,880
And then you'd have
some other states there.

557
00:29:56,880 --> 00:29:59,046
And I'm not going to draw
those, but that's a point.

558
00:29:59,046 --> 00:30:01,950
And you could have a
teeny one-codon gene

559
00:30:01,950 --> 00:30:06,420
if you want, but
probably not worth it.

560
00:30:06,420 --> 00:30:11,330
All right, everyone have
an idea about this HMM?

561
00:30:11,330 --> 00:30:14,550
So this is a model you have
to specify in order for this

562
00:30:14,550 --> 00:30:17,150
to actually generate sequence.

563
00:30:17,150 --> 00:30:19,530
This model will
actually generate

564
00:30:19,530 --> 00:30:21,892
annotations and sequence.

565
00:30:21,892 --> 00:30:23,350
You have to specify
where to start,

566
00:30:23,350 --> 00:30:25,150
so you have to have some
probability of starting,

567
00:30:25,150 --> 00:30:27,150
but the first base that
you're going to generate

568
00:30:27,150 --> 00:30:30,380
is going to begin intergenic,
or start, or codon, et cetera.

569
00:30:30,380 --> 00:30:32,700
And you might give it a
high probability of this,

570
00:30:32,700 --> 00:30:36,440
and then it'll generate a label.

571
00:30:36,440 --> 00:30:39,380
So for example, let's say N.
And then it'll generate a base,

572
00:30:39,380 --> 00:30:41,320
let's say G.

573
00:30:41,320 --> 00:30:43,415
And then you look at
these probabilities,

574
00:30:43,415 --> 00:30:46,690
so the transition probability
here, versus this-- you either

575
00:30:46,690 --> 00:30:48,860
generate another N, or
you generate a start.

576
00:30:48,860 --> 00:30:51,500
And let's say you go
to start, and then

577
00:30:51,500 --> 00:30:54,400
you'll generate three
bases, so A T G.

578
00:30:54,400 --> 00:30:56,880
And then you would go
to the codon state,

579
00:30:56,880 --> 00:31:03,666
you would emit some other
triplet, and so forth.

580
00:31:03,666 --> 00:31:07,930
So this is a model that will
generate strings of annotations

581
00:31:07,930 --> 00:31:09,460
with associated bases.

582
00:31:12,440 --> 00:31:15,040
Still doesn't predict
gene structure yet,

583
00:31:15,040 --> 00:31:19,200
but at least it generates
gene structures.

584
00:31:19,200 --> 00:31:24,050
All right, so we are going to,
for the sake of illustrating

585
00:31:24,050 --> 00:31:25,830
the Viterbi
algorithm, we're going

586
00:31:25,830 --> 00:31:27,920
to use a simpler HMM in that.

587
00:31:27,920 --> 00:31:31,060
So this one only has two
states, and its purpose

588
00:31:31,060 --> 00:31:36,840
is to predict CPG islands
in a vertebrate genome.

589
00:31:36,840 --> 00:31:38,780
So what are CPG islands?

590
00:31:38,780 --> 00:31:39,830
Anyone remember?

591
00:31:44,010 --> 00:31:44,945
What is a CPG island?

592
00:31:44,945 --> 00:31:45,820
Anyone heard of this?

593
00:31:45,820 --> 00:31:46,694
I'm sure some of you.

594
00:31:53,090 --> 00:31:54,550
Well, the definition
here is going

595
00:31:54,550 --> 00:31:58,050
to be regions of high CNG
content, and relatively high

596
00:31:58,050 --> 00:32:00,900
abundance of CPG dinucleotides,
which are unmethylated.

597
00:32:00,900 --> 00:32:02,130
So what is the P here?

598
00:32:02,130 --> 00:32:06,200
The p means that the
CG we're talking about

599
00:32:06,200 --> 00:32:12,420
is C followed by G along
the particular DNA strand,

600
00:32:12,420 --> 00:32:15,282
just to distinguish it
from C base paired with G.

601
00:32:15,282 --> 00:32:16,990
We're not talking
about a base pair here,

602
00:32:16,990 --> 00:32:18,680
we're talking about
C and G following

603
00:32:18,680 --> 00:32:21,520
each other along the strand.

604
00:32:21,520 --> 00:32:25,040
So this dinucleotide is
rare in vertebrate genomes,

605
00:32:25,040 --> 00:32:29,470
because CPG is the
site of a methylase,

606
00:32:29,470 --> 00:32:33,392
and methylation of the C
is mutogenic-- it lends

607
00:32:33,392 --> 00:32:34,850
to a much higher
rate of mutations.

608
00:32:34,850 --> 00:32:37,260
so CPGs often mutate
away, except for the ones

609
00:32:37,260 --> 00:32:39,020
that are necessary.

610
00:32:39,020 --> 00:32:42,310
But there are certain
regions, often near promoters,

611
00:32:42,310 --> 00:32:47,100
that are unmethylated,
and therefore, CPGs

612
00:32:47,100 --> 00:32:49,050
can accumulate to
higher frequencies.

613
00:32:49,050 --> 00:32:53,020
And so you can actually
look for these regions

614
00:32:53,020 --> 00:32:54,960
and use them to predict
where promoters are.

615
00:32:54,960 --> 00:32:57,570
That's one application.

616
00:32:57,570 --> 00:33:00,250
So they have higher CPG
dinucleotide content, and also

617
00:33:00,250 --> 00:33:01,880
higher C and G content.

618
00:33:01,880 --> 00:33:06,220
The background of the human
genome is about 40% C G only,

619
00:33:06,220 --> 00:33:09,000
so it's a bit AT rich, and so
you see these patches of, say,

620
00:33:09,000 --> 00:33:13,330
50% to 60% C G that are often
associated with promoters,

621
00:33:13,330 --> 00:33:17,520
with promoters of roughly
half of human genes.

622
00:33:17,520 --> 00:33:20,500
So we're going to-- I always
drop that little clicker thing.

623
00:33:20,500 --> 00:33:21,420
Here it is.

624
00:33:21,420 --> 00:33:24,480
We're going to make
a model of these,

625
00:33:24,480 --> 00:33:29,550
and then run it to predict
promoters in the gene.

626
00:33:29,550 --> 00:33:30,560
So here's our model.

627
00:33:30,560 --> 00:33:32,393
We have two states, we
have a genome state--

628
00:33:32,393 --> 00:33:34,710
this sort of generic
position in the genome--

629
00:33:34,710 --> 00:33:37,000
and then we have
an island state.

630
00:33:37,000 --> 00:33:39,550
We have the simplest
possible transitions,

631
00:33:39,550 --> 00:33:42,820
you can go genome to
genome, genome to island,

632
00:33:42,820 --> 00:33:45,590
island to genome,
or island to island.

633
00:33:45,590 --> 00:33:49,860
So now you can generate islands
of arbitrary size, interspersed

634
00:33:49,860 --> 00:33:52,870
with genomic regions
of arbitrary size.

635
00:33:52,870 --> 00:33:56,980
And then each of
those hidden states

636
00:33:56,980 --> 00:33:59,400
is going to emit a single base.

637
00:33:59,400 --> 00:34:02,520
So a CPG island in
this model is a stretch

638
00:34:02,520 --> 00:34:08,679
of I states in a row flanked
by G states, if you will.

639
00:34:08,679 --> 00:34:11,120
Everyone clear on this set up?

640
00:34:11,120 --> 00:34:13,170
Good.

641
00:34:13,170 --> 00:34:16,639
So here, in order to
fully specify the model,

642
00:34:16,639 --> 00:34:20,219
you need to say what
all the parameters are.

643
00:34:20,219 --> 00:34:24,659
And there are really
three class of parameters.

644
00:34:24,659 --> 00:34:26,949
There are initiation
probabilities--

645
00:34:26,949 --> 00:34:30,800
so the green here is the
notation used in the Rabiner

646
00:34:30,800 --> 00:34:34,449
tutorial, except they call them
[? pi ?] [? j's. ?] So here,

647
00:34:34,449 --> 00:34:39,000
I'm going to say it's a 99%
chance you start in the generic

648
00:34:39,000 --> 00:34:43,130
genome state, and a 1% chance
you start in an island state,

649
00:34:43,130 --> 00:34:46,120
because islands are
not that common.

650
00:34:46,120 --> 00:34:49,420
And then you need to specify
transition probabilities.

651
00:34:49,420 --> 00:34:52,830
So there's four possible
transitions you could make,

652
00:34:52,830 --> 00:34:55,620
and you need to assign
probabilities to them.

653
00:34:55,620 --> 00:35:00,630
So if the average length of
an island were 1,000 bases,

654
00:35:00,630 --> 00:35:05,180
then a reasonable value
for the I to I transition

655
00:35:05,180 --> 00:35:07,440
would be 0.999.

656
00:35:07,440 --> 00:35:10,460
You have 99.9% chance of
making another island,

657
00:35:10,460 --> 00:35:14,360
and 0.1% chance of
leaving that island state.

658
00:35:14,360 --> 00:35:18,370
If you just run that in
this generative mode,

659
00:35:18,370 --> 00:35:21,370
it would generate a variety
of lengths of islands,

660
00:35:21,370 --> 00:35:24,840
but on average, they'd
be about one kb long,

661
00:35:24,840 --> 00:35:29,260
because the probability of
terminating is one in 1,000.

662
00:35:29,260 --> 00:35:33,070
And then if we imagine that
those one kb islands are

663
00:35:33,070 --> 00:35:35,690
interspersed with genomic
regions that are about, say,

664
00:35:35,690 --> 00:35:39,000
100 kilo bases long on average,
then you would get this

665
00:35:39,000 --> 00:35:41,890
[? five ?] [? nines ?]
probability for P G G,

666
00:35:41,890 --> 00:35:44,630
and 10 to the minus fifth as
a probability of going from

667
00:35:44,630 --> 00:35:45,750
genome to islands.

668
00:35:45,750 --> 00:35:50,190
That would generate
widely spaced islands,

669
00:35:50,190 --> 00:35:52,730
that are on average
100 kb apart,

670
00:35:52,730 --> 00:35:55,130
that are about one kb in length.

671
00:35:55,130 --> 00:35:58,060
Is that making sense?

672
00:35:58,060 --> 00:36:02,300
And now the third type of
probability we need to specify

673
00:36:02,300 --> 00:36:04,130
are called emission
probabilities,

674
00:36:04,130 --> 00:36:07,850
which are the B J K
in Rabiner notation.

675
00:36:07,850 --> 00:36:11,960
And this is where the predictive
power is going to come in.

676
00:36:11,960 --> 00:36:14,602
There has to be a
difference in the emissions

677
00:36:14,602 --> 00:36:16,060
if you're going to
have any ability

678
00:36:16,060 --> 00:36:18,580
to predict these
features, and so we're

679
00:36:18,580 --> 00:36:21,822
going to imagine that
the genome is 40% C G,

680
00:36:21,822 --> 00:36:24,027
and islands are 60% C G.
So it's a base composition

681
00:36:24,027 --> 00:36:24,860
that we're modeling.

682
00:36:24,860 --> 00:36:26,200
We're not doing the
dinucleotides here,

683
00:36:26,200 --> 00:36:27,700
that would make it
more complicated.

684
00:36:27,700 --> 00:36:30,623
We're just looking for
patches of high G C content.

685
00:36:33,700 --> 00:36:36,300
So now we've fully
specified our model.

686
00:36:41,410 --> 00:36:45,200
The problem here
is that the model

687
00:36:45,200 --> 00:36:47,900
is written from the
hidden generating

688
00:36:47,900 --> 00:36:51,840
the observable, and the problem
we're faced with, in practice,

689
00:36:51,840 --> 00:36:54,099
is that we have the
observable sequence,

690
00:36:54,099 --> 00:36:55,640
and we want to go
back to the hidden.

691
00:36:55,640 --> 00:36:59,980
So we need to reverse
the conditioning that's

692
00:36:59,980 --> 00:37:01,320
in the model.

693
00:37:01,320 --> 00:37:03,110
So when you see this
type of problem,

694
00:37:03,110 --> 00:37:06,310
how do you reverse conditioning?

695
00:37:06,310 --> 00:37:10,170
In general, what's
a good way to do it?

696
00:37:10,170 --> 00:37:13,530
You see P A, given
B, but the model

697
00:37:13,530 --> 00:37:17,780
you have is written in P
B, given A. What do you do?

698
00:37:17,780 --> 00:37:18,400
What's that?

699
00:37:18,400 --> 00:37:19,399
AUDIENCE: Bayes theorem.

700
00:37:19,399 --> 00:37:20,870
PROFESSOR: Yeah, Bayes theorem.

701
00:37:20,870 --> 00:37:25,830
Right, let's do
base theorem here.

702
00:37:30,540 --> 00:37:33,390
Remember, definition of
conditional probability,

703
00:37:33,390 --> 00:37:36,660
if we have P A given B--
so this might be the hidden

704
00:37:36,660 --> 00:37:40,820
states, given the observables--
we want to write that in terms

705
00:37:40,820 --> 00:37:42,980
of P B given A. So
what we do first?

706
00:37:42,980 --> 00:37:46,410
How do we derive Bayes rule?

707
00:37:46,410 --> 00:37:52,770
You first write the definition
of conditional probability--

708
00:37:52,770 --> 00:37:55,150
right, that's just
the definition.

709
00:37:55,150 --> 00:37:58,160
And now what do I do?

710
00:37:58,160 --> 00:38:01,298
Split the top part into what?

711
00:38:01,298 --> 00:38:09,120
AUDIENCE: A times P of
B given [INAUDIBLE].

712
00:38:09,120 --> 00:38:11,515
PROFESSOR: P B
given A. That's just

713
00:38:11,515 --> 00:38:14,930
another way of writing
joint probability of P A B,

714
00:38:14,930 --> 00:38:17,180
using the definition of
conditional probability again.

715
00:38:22,950 --> 00:38:26,650
So now it's written
the other way.

716
00:38:26,650 --> 00:38:30,530
That's basically the idea.

717
00:38:30,530 --> 00:38:32,860
So this is the simple form.

718
00:38:32,860 --> 00:38:34,880
And like I said, I
don't usually call it

719
00:38:34,880 --> 00:38:37,470
a theorem, because
it's so simple-- it's

720
00:38:37,470 --> 00:38:41,420
something you can
derive in 30 seconds.

721
00:38:41,420 --> 00:38:45,030
It should be called maybe
a rule, or something.

722
00:38:45,030 --> 00:38:46,670
There is a more general form.

723
00:38:46,670 --> 00:38:54,100
This is where you have
two states, basically B

724
00:38:54,100 --> 00:38:56,044
or not B that
you're dealing with.

725
00:38:56,044 --> 00:38:57,460
And there's this
more general form

726
00:38:57,460 --> 00:39:00,940
that's shown on the slide, which
is when you have many states.

727
00:39:00,940 --> 00:39:03,330
And it basically
is the same idea,

728
00:39:03,330 --> 00:39:09,540
it's just that we've
rewritten this term, this P B,

729
00:39:09,540 --> 00:39:13,040
and we split it up into
all the possible states.

730
00:39:22,620 --> 00:39:25,830
The slide starts from PB given
A and goes the other-- anyway,

731
00:39:25,830 --> 00:39:29,530
so you rewrite the
bottom term as a sum

732
00:39:29,530 --> 00:39:34,340
of all the possible cases.

733
00:39:34,340 --> 00:39:37,510
All right, so how does
that apply to HMMs?

734
00:39:37,510 --> 00:39:44,670
So with HMMs, we're interested
in the joint probability

735
00:39:44,670 --> 00:39:47,960
of a set of hidden states, and
a set of observable states.

736
00:39:47,960 --> 00:39:52,440
So H, capital H, is
going to be a vector that

737
00:39:52,440 --> 00:39:58,415
specifies the particular hidden
state-- for instance, island

738
00:39:58,415 --> 00:40:03,060
or genome-- at position 1,
that's H1 all the way to H N.

739
00:40:03,060 --> 00:40:07,760
So little h's are specific
values for those hidden state.

740
00:40:07,760 --> 00:40:11,400
And then, big O is a
vector that describes

741
00:40:11,400 --> 00:40:14,440
the different bases
in the genome.

742
00:40:14,440 --> 00:40:16,410
So O1 is the first
base in the genome,

743
00:40:16,410 --> 00:40:24,310
up to O N. One can imagine
comparing two H vectors,

744
00:40:24,310 --> 00:40:29,090
one of which, H versus the
H primes, what's the what's

745
00:40:29,090 --> 00:40:33,010
the probability of this
hidden state versus that?

746
00:40:33,010 --> 00:40:36,500
You could compare them in terms
of their joint probabilities

747
00:40:36,500 --> 00:40:38,450
with this model,
and perhaps favor

748
00:40:38,450 --> 00:40:42,770
those that have
higher probabilities.

749
00:40:42,770 --> 00:40:43,810
Yeah?

750
00:40:43,810 --> 00:40:45,780
AUDIENCE: [INAUDIBLE]
of the two capital

751
00:40:45,780 --> 00:40:49,674
H's have any different notation?

752
00:40:49,674 --> 00:40:52,090
Like the second one being H
prime, or something like that?

753
00:40:52,090 --> 00:40:53,256
Or are they supposed to be--

754
00:40:53,256 --> 00:40:55,607
PROFESSOR: With the
H, in this case,

755
00:40:55,607 --> 00:40:57,940
these are probability statements
about random variables.

756
00:40:57,940 --> 00:41:00,120
So H is a random
variable, which could

757
00:41:00,120 --> 00:41:02,860
assume any possible
sequence of hidden states.

758
00:41:02,860 --> 00:41:08,230
The little h's are
specific values.

759
00:41:08,230 --> 00:41:11,190
So for instance,
imagine comparing

760
00:41:11,190 --> 00:41:19,910
what's the probability of
H equals genome, genome,

761
00:41:19,910 --> 00:41:23,160
genome, versus the
probability that H

762
00:41:23,160 --> 00:41:26,210
equals genome, genome, island.

763
00:41:26,210 --> 00:41:28,400
So the little h's, or
the little h primes,

764
00:41:28,400 --> 00:41:30,430
are specific instances.

765
00:41:30,430 --> 00:41:33,010
The H's is a random
variable, unknown.

766
00:41:33,010 --> 00:41:34,211
Does that help?

767
00:41:41,280 --> 00:41:43,840
OK, so how do we
apply Bayes' rule?

768
00:41:43,840 --> 00:41:46,920
So what we're interested
in here is the probability

769
00:41:46,920 --> 00:41:50,380
that H, this unknown variable
that represents hidden states,

770
00:41:50,380 --> 00:41:53,330
that it equals a particular
set of hidden states,

771
00:41:53,330 --> 00:41:57,040
little h1 to h N, given
the observables, little

772
00:41:57,040 --> 00:42:01,200
o1 to little oN, which is the
actual sequence that we see.

773
00:42:01,200 --> 00:42:05,210
And we can write
that using definition

774
00:42:05,210 --> 00:42:07,710
of conditional probability
as the joint probability

775
00:42:07,710 --> 00:42:10,370
patient of H and O, over
the probability of O.

776
00:42:10,370 --> 00:42:13,430
And then Bayes' rule, we just
apply conditional probability

777
00:42:13,430 --> 00:42:13,930
again.

778
00:42:13,930 --> 00:42:20,960
It's P H times P O,
given H, over P O.

779
00:42:20,960 --> 00:42:25,590
So it turns out that
this P O-- so what is P O

780
00:42:25,590 --> 00:42:30,000
equals O1 to O N in this model?

781
00:42:30,000 --> 00:42:33,720
Well, the model specifies how
to generate the hidden states,

782
00:42:33,720 --> 00:42:37,300
and how the observables are
generated from those hidden

783
00:42:37,300 --> 00:42:39,335
states.

784
00:42:39,335 --> 00:42:46,310
So P O is actually defined
as the sum of P O comma H

785
00:42:46,310 --> 00:42:51,540
equals the first possible
hidden state, plus the same term

786
00:42:51,540 --> 00:42:52,470
for the second.

787
00:42:52,470 --> 00:42:56,330
You have to sum over all
the possible outcomes

788
00:42:56,330 --> 00:42:58,790
of the hidden states,
every possible thing.

789
00:42:58,790 --> 00:43:00,650
So if we have a sequence
of length three,

790
00:43:00,650 --> 00:43:04,120
you have to sum
over the possibility

791
00:43:04,120 --> 00:43:11,850
that H might be G G G or
G G I, or G I G, or G I I,

792
00:43:11,850 --> 00:43:15,400
or I G G, et cetera.

793
00:43:19,760 --> 00:43:23,417
You have to sum over
eight possibilities here.

794
00:43:23,417 --> 00:43:25,000
And if the sequence
is a million long,

795
00:43:25,000 --> 00:43:28,500
you have to sum over 2 to the
one millionth possibilities.

796
00:43:28,500 --> 00:43:31,610
That sounds complicated
to calculate.

797
00:43:31,610 --> 00:43:34,080
So it turns out that
there's actually a trick,

798
00:43:34,080 --> 00:43:35,830
and you can calculate it.

799
00:43:35,830 --> 00:43:37,950
But you don't have to.

800
00:43:37,950 --> 00:43:40,410
That's one of the
good things, is

801
00:43:40,410 --> 00:43:42,550
that we can just treat
it as a constant.

802
00:43:42,550 --> 00:43:47,780
So notice that the denominator
here is independent of the H's.

803
00:43:47,780 --> 00:43:50,320
So we'll just treat that as a
constant, an unknown constant.

804
00:43:50,320 --> 00:43:55,720
And what we're interested
in, which possible value of H

805
00:43:55,720 --> 00:43:56,900
has a higher probability?

806
00:43:56,900 --> 00:44:04,080
So we're just going to try to
maximize P H equals H1 to H N--

807
00:44:04,080 --> 00:44:06,590
find the optimal
sequence of hidden states

808
00:44:06,590 --> 00:44:09,310
that optimizes that
joint probability,

809
00:44:09,310 --> 00:44:12,600
the joint probability with
the observable values,

810
00:44:12,600 --> 00:44:19,360
O1 to O N. Is that making sense?

811
00:44:19,360 --> 00:44:22,350
Basically, we want to find
the sequence of hidden states,

812
00:44:22,350 --> 00:44:24,540
we'll call it H opt.

813
00:44:24,540 --> 00:44:27,030
So now H opt here is
a particular vector.

814
00:44:27,030 --> 00:44:29,460
Capital H, by itself,
is a random vector.

815
00:44:29,460 --> 00:44:33,110
This is now a particular
vector of hidden states,

816
00:44:33,110 --> 00:44:35,430
H1 opt through H N opt.

817
00:44:35,430 --> 00:44:41,350
And it's defined as the
vector of hidden states

818
00:44:41,350 --> 00:44:46,330
that maximizes the joint
probability with O equals

819
00:44:46,330 --> 00:44:53,220
O1 to O N, where that's the
observed sequence that we're

820
00:44:53,220 --> 00:44:55,160
dealing with.

821
00:44:55,160 --> 00:44:59,450
So now what I'm
telling you is if we

822
00:44:59,450 --> 00:45:01,960
can find the vector
of hidden states

823
00:45:01,960 --> 00:45:03,830
that maximizes the
joint probability,

824
00:45:03,830 --> 00:45:08,520
then that will also maximize
the conditional probability of H

825
00:45:08,520 --> 00:45:14,680
given O. And that's often
the language of linguistics

826
00:45:14,680 --> 00:45:18,420
is used, and it's called the
optimal parse of the sequence.

827
00:45:18,420 --> 00:45:21,980
You'll see that sometimes,
I might say that.

828
00:45:21,980 --> 00:45:28,510
So the solution is to
define these variables,

829
00:45:28,510 --> 00:45:35,610
R I of H, which are
defined as the probability

830
00:45:35,610 --> 00:45:37,730
of the optimal parse of
the subsequence from one

831
00:45:37,730 --> 00:45:40,980
to I-- not the
whole long sequence,

832
00:45:40,980 --> 00:45:43,380
but a little piece of
it from the beginning

833
00:45:43,380 --> 00:45:47,620
to a particular place in the
middle, that ends in state, H.

834
00:45:47,620 --> 00:45:48,640
And so first.

835
00:45:48,640 --> 00:45:52,580
We calculate R, R 1
1, the probability

836
00:45:52,580 --> 00:45:54,320
of generating the
first base, given

837
00:45:54,320 --> 00:45:56,420
that it ended in hidden state 1.

838
00:45:56,420 --> 00:45:59,040
And then we would do
the hidden state 2,

839
00:45:59,040 --> 00:46:02,110
and then we basically have to
figure out a way, a recursion,

840
00:46:02,110 --> 00:46:07,800
for getting the probabilities
of the optimal parses ending

841
00:46:07,800 --> 00:46:10,850
at each of the
states at position 2,

842
00:46:10,850 --> 00:46:12,334
given the values at position 1.

843
00:46:12,334 --> 00:46:14,000
And then we go, work
our way all the way

844
00:46:14,000 --> 00:46:16,250
down to the end of the sequence.

845
00:46:16,250 --> 00:46:18,110
And then we'll figure
out which is better,

846
00:46:18,110 --> 00:46:19,940
and then we'll
backtrack to figure out

847
00:46:19,940 --> 00:46:23,460
what that optimal parse was.

848
00:46:23,460 --> 00:46:25,000
We'll do an example
on the board,

849
00:46:25,000 --> 00:46:31,180
this is unlikely to be
completely clear at this point.

850
00:46:31,180 --> 00:46:32,800
But don't worry.

851
00:46:32,800 --> 00:46:35,770
So why is this called
the Viterbi algorithm?

852
00:46:35,770 --> 00:46:38,320
Well, this is the guy
who figured it out.

853
00:46:38,320 --> 00:46:40,590
He was actually an MIT alum.

854
00:46:40,590 --> 00:46:44,122
He did his bachelor's
and master's in double E,

855
00:46:44,122 --> 00:46:48,220
I don't know, quite
awhile ago, '50s or '60s.

856
00:46:48,220 --> 00:46:51,210
And later went on
to found Qualcomm,

857
00:46:51,210 --> 00:46:57,060
and is now big philanthropist,
who apparently supports USC.

858
00:46:57,060 --> 00:47:00,540
I don't know why he
lost his loyalty to MIT,

859
00:47:00,540 --> 00:47:03,870
but maybe he'll come back
and give us a seminar.

860
00:47:03,870 --> 00:47:06,810
I actually met him once.

861
00:47:06,810 --> 00:47:08,840
Let's talk about his
algorithm a little more.

862
00:47:08,840 --> 00:47:14,690
So what I want to do is I
want to take a particular HMM,

863
00:47:14,690 --> 00:47:18,480
so we'll take our
CPG island HMM,

864
00:47:18,480 --> 00:47:21,370
and then we'll go through
the actual Viterbi

865
00:47:21,370 --> 00:47:26,140
algorithm on the board
for a particular sequence.

866
00:47:26,140 --> 00:47:29,660
And you'll see that it's
actually pretty simple.

867
00:47:29,660 --> 00:47:31,200
But then you'll
also see that it's

868
00:47:31,200 --> 00:47:33,770
not totally obvious
why it works.

869
00:47:36,500 --> 00:47:38,210
The mechanics of it
are not that bad,

870
00:47:38,210 --> 00:47:41,690
but the understanding
really how it

871
00:47:41,690 --> 00:47:44,430
is able to come up with
the optimal parses, that's

872
00:47:44,430 --> 00:47:49,270
the more subtle part.

873
00:47:49,270 --> 00:47:53,360
So let's suppose
we have a sequence,

874
00:47:53,360 --> 00:47:58,620
A C G. Can anyone tell
me what the optimal parse

875
00:47:58,620 --> 00:48:01,394
of this sequence is,
without doing Viterbi?

876
00:48:04,240 --> 00:48:07,050
With this particular
model, these initiation

877
00:48:07,050 --> 00:48:11,740
probabilities transitions
and emissions?

878
00:48:11,740 --> 00:48:16,570
Do you know what it's
going to be in advance?

879
00:48:16,570 --> 00:48:17,910
Any guesses?

880
00:48:17,910 --> 00:48:19,879
AUDIENCE: How about
genome, island, island.

881
00:48:19,879 --> 00:48:21,295
PROFESSOR: Genome,
island, island.

882
00:48:21,295 --> 00:48:24,220
Because, you're saying,
that way the emissions

883
00:48:24,220 --> 00:48:26,010
will be optimized, right?

884
00:48:26,010 --> 00:48:28,400
Because you'll omit
the C's and G's.

885
00:48:28,400 --> 00:48:29,680
OK, that's a reasonable guess.

886
00:48:29,680 --> 00:48:31,496
Sally's shaking
her head, though.

887
00:48:31,496 --> 00:48:33,137
AUDIENCE: The
transitional probability

888
00:48:33,137 --> 00:48:36,059
from being in the genome
is very, very small,

889
00:48:36,059 --> 00:48:38,210
and so it's more likely
that it'll either only

890
00:48:38,210 --> 00:48:39,960
be in the genome or
only be in the island.

891
00:48:39,960 --> 00:48:41,585
PROFESSOR: So the
transition from going

892
00:48:41,585 --> 00:48:44,220
from a genome to island or
island to genome is very small,

893
00:48:44,220 --> 00:48:46,220
and so she's saying you're
going to pay a bigger

894
00:48:46,220 --> 00:48:48,640
penalty for making that
transition in there, that

895
00:48:48,640 --> 00:48:50,630
may not be offset
by the emissions.

896
00:48:50,630 --> 00:48:51,780
Right, is that your point?

897
00:48:51,780 --> 00:48:53,820
Yeah, question?

898
00:48:53,820 --> 00:48:56,985
AUDIENCE: Check here-- when
we're talking about the optimal

899
00:48:56,985 --> 00:49:01,700
parse, we're saying let's
maximize the probability

900
00:49:01,700 --> 00:49:03,700
of that letter--

901
00:49:03,700 --> 00:49:06,188
PROFESSOR: The
joint probability.

902
00:49:06,188 --> 00:49:08,658
AUDIENCE: Sorry, the joint
probability of that letter--

903
00:49:08,658 --> 00:49:10,866
PROFESSOR: Of that [INAUDIBLE]
state and that letter.

904
00:49:10,866 --> 00:49:12,116
AUDIENCE: OK, so that means--

905
00:49:12,116 --> 00:49:14,092
PROFESSOR: Or that set
of [INAUDIBLE] states

906
00:49:14,092 --> 00:49:15,670
and that set of bases.

907
00:49:15,670 --> 00:49:16,070
AUDIENCE: So when
we're computing

908
00:49:16,070 --> 00:49:17,486
across this
three-letter thing, we

909
00:49:17,486 --> 00:49:21,846
have to say the probability
of the letter, then let's

910
00:49:21,846 --> 00:49:24,266
multiply if by the
probability of the transition

911
00:49:24,266 --> 00:49:28,622
to the next letter, and
then multiply it again

912
00:49:28,622 --> 00:49:31,410
[INAUDIBLE] and that letter.

913
00:49:31,410 --> 00:49:33,020
PROFESSOR: So let's do this.

914
00:49:33,020 --> 00:49:36,400
If A C G is our
sequence-- I'm just

915
00:49:36,400 --> 00:49:38,290
going to space it
out a little bit.

916
00:49:38,290 --> 00:49:43,970
Here's our A at position one,
here's our C at position two,

917
00:49:43,970 --> 00:49:49,100
and our G at position three.

918
00:49:49,100 --> 00:49:51,440
And then we have
our hidden states.

919
00:49:51,440 --> 00:49:58,100
And so we'll write genome first,
and then we have island here.

920
00:50:01,370 --> 00:50:05,700
And so what is the optimal
parse of the sequence from base

921
00:50:05,700 --> 00:50:10,920
one to base one,
that ends in genome?

922
00:50:10,920 --> 00:50:12,800
It's just the one
that starts in genome,

923
00:50:12,800 --> 00:50:15,430
because it doesn't go--
right, so it's just genome.

924
00:50:15,430 --> 00:50:17,920
And what is its probability?

925
00:50:17,920 --> 00:50:19,890
That's how this thing
is defined here.

926
00:50:19,890 --> 00:50:24,870
This is this R I H thing
I was talking about.

927
00:50:24,870 --> 00:50:28,230
This is H, and this is I here.

928
00:50:31,130 --> 00:50:35,130
So the probability
of the optimal parse

929
00:50:35,130 --> 00:50:38,350
of the sequence, up to position
one, that ends in genome,

930
00:50:38,350 --> 00:50:40,200
is just the one that
starts in genome,

931
00:50:40,200 --> 00:50:41,651
and then emits that base.

932
00:50:41,651 --> 00:50:43,650
So what's the probability
of starting in genome?

933
00:50:46,310 --> 00:50:49,270
It's five nines, right?

934
00:50:49,270 --> 00:50:54,130
So that's the initial
probability-- 9 9 9 9 9.

935
00:50:54,130 --> 00:50:58,040
And then what's the probability
of emitting an A, given

936
00:50:58,040 --> 00:51:00,810
that we're in the genome state?

937
00:51:00,810 --> 00:51:02,380
0.3.

938
00:51:02,380 --> 00:51:06,360
So I claim that
this is the value

939
00:51:06,360 --> 00:51:11,370
of R1 of genome, of
the genome state.

940
00:51:11,370 --> 00:51:12,590
OK, that's the optimal parse.

941
00:51:12,590 --> 00:51:14,006
There's only one
parse, so there's

942
00:51:14,006 --> 00:51:17,080
nothing-- it is what it is.

943
00:51:17,080 --> 00:51:20,190
You start here, there's no
transitions-- we started here--

944
00:51:20,190 --> 00:51:22,930
and then you emit an A.

945
00:51:22,930 --> 00:51:26,070
What's the probability
of the optimal parse

946
00:51:26,070 --> 00:51:29,149
ending an island at position
one of the sequence?

947
00:51:29,149 --> 00:51:29,690
Someone else?

948
00:51:29,690 --> 00:51:31,391
Yeah, question?

949
00:51:31,391 --> 00:51:33,796
AUDIENCE: Why are we using
the transition probability?

950
00:51:33,796 --> 00:51:37,171
PROFESSOR: This is the
initial-- Oh, I'm sorry.

951
00:51:37,171 --> 00:51:37,670
Correct.

952
00:51:37,670 --> 00:51:40,286
Thank you, thank you--
what was your name?

953
00:51:40,286 --> 00:51:41,036
AUDIENCE: Deborah.

954
00:51:41,036 --> 00:51:41,910
PROFESSOR: Deborah,
OK, thanks Deborah.

955
00:51:41,910 --> 00:51:43,451
It should be the
initial probability,

956
00:51:43,451 --> 00:51:45,740
which is 0.99, good.

957
00:51:45,740 --> 00:51:47,160
Initial probability.

958
00:51:47,160 --> 00:51:47,910
What about island?

959
00:51:47,910 --> 00:51:50,062
Deborah, you want
to take this one?

960
00:51:50,062 --> 00:51:51,970
AUDIENCE: 0.01.

961
00:51:51,970 --> 00:51:53,470
PROFESSOR: 0.01 to be an island.

962
00:51:53,470 --> 00:51:58,930
And what about the
emission probability?

963
00:51:58,930 --> 00:52:00,840
We have to start
an island, and then

964
00:52:00,840 --> 00:52:03,617
emit an A, which
is probably what?

965
00:52:03,617 --> 00:52:04,200
AUDIENCE: 0.2.

966
00:52:04,200 --> 00:52:06,810
PROFESSOR: 0.2, yeah,
it's up on the screen.

967
00:52:06,810 --> 00:52:09,080
Should be, hopefully.

968
00:52:09,080 --> 00:52:10,900
Yeah, 0.02.

969
00:52:10,900 --> 00:52:12,400
So who's winning so far?

970
00:52:12,400 --> 00:52:15,640
If the sequences
ended at position one?

971
00:52:15,640 --> 00:52:16,690
Genome, genome's winning.

972
00:52:16,690 --> 00:52:18,190
This is a lot bigger
than that, it's

973
00:52:18,190 --> 00:52:21,170
about 150 times bigger, right?

974
00:52:21,170 --> 00:52:25,210
Now what do when we go-- we said
we have to do recursion, right?

975
00:52:25,210 --> 00:52:27,290
We have to figure
out the probability

976
00:52:27,290 --> 00:52:30,840
of the optimal parse ending
at position two in each

977
00:52:30,840 --> 00:52:34,480
of these states, given
the optimal parse ending

978
00:52:34,480 --> 00:52:36,280
at position one.

979
00:52:36,280 --> 00:52:37,420
How do we figure that out?

980
00:52:43,594 --> 00:52:45,260
What are we going to
write here, or what

981
00:52:45,260 --> 00:52:51,310
do we have to compare to
figure out what to put here?

982
00:52:51,310 --> 00:52:53,200
There's two possible
parses ending in genome

983
00:52:53,200 --> 00:52:54,880
at position two--
there's the one that

984
00:52:54,880 --> 00:52:56,770
started in genome
at position one,

985
00:52:56,770 --> 00:52:58,700
and there's the one
that started at island.

986
00:52:58,700 --> 00:53:02,680
So you have to
compare this to this.

987
00:53:02,680 --> 00:53:06,200
So you compare what the
probability of this parse

988
00:53:06,200 --> 00:53:09,110
was times the
transition probability,

989
00:53:09,110 --> 00:53:12,740
and then the emission
in that state.

990
00:53:12,740 --> 00:53:13,930
So what would that be?

991
00:53:16,780 --> 00:53:19,730
What would this be,
if we stay in genome?

992
00:53:19,730 --> 00:53:22,380
What's the transition?

993
00:53:22,380 --> 00:53:25,010
Now we've got our five
nines, yeah, good.

994
00:53:25,010 --> 00:53:26,430
So five nines.

995
00:53:26,430 --> 00:53:28,074
And the emission is what?

996
00:53:28,074 --> 00:53:30,490
AUDIENCE: Genome at 0.2.

997
00:53:30,490 --> 00:53:32,680
PROFESSOR: 0.2, right.

998
00:53:32,680 --> 00:53:35,860
And times this.

999
00:53:35,860 --> 00:53:37,070
And what about this one?

1000
00:53:37,070 --> 00:53:38,720
What are we going
to multiply when

1001
00:53:38,720 --> 00:53:41,356
we consider this island
to genome transition?

1002
00:53:41,356 --> 00:53:48,132
AUDIENCE: [INAUDIBLE] 0.01?

1003
00:53:48,132 --> 00:53:50,760
Because the genome is still 0.2.

1004
00:53:50,760 --> 00:53:53,096
PROFESSOR: It's still 0.2.

1005
00:53:53,096 --> 00:53:54,720
So we take the maximum
of these, right?

1006
00:53:54,720 --> 00:53:56,678
We're doing optimal parse,
highest probability.

1007
00:53:56,678 --> 00:53:58,920
So which one of these
two turns this bigger?

1008
00:53:58,920 --> 00:54:00,450
Clearly, the top one is bigger.

1009
00:54:00,450 --> 00:54:01,590
This one is already
bigger than this,

1010
00:54:01,590 --> 00:54:03,214
and we're multiplying
by the same data,

1011
00:54:03,214 --> 00:54:11,370
so clearly the answer
here is 0.99 times

1012
00:54:11,370 --> 00:54:18,530
0.3 times-- my nines
are going to have

1013
00:54:18,530 --> 00:54:24,250
to get really skinny
here-- 0.99999 and 0.2.

1014
00:54:24,250 --> 00:54:25,150
That's the winner.

1015
00:54:25,150 --> 00:54:27,620
And the other thing we do,
besides recording this number,

1016
00:54:27,620 --> 00:54:30,654
is we circle this arrow.

1017
00:54:30,654 --> 00:54:31,570
Does this ring a bell?

1018
00:54:31,570 --> 00:54:33,350
This is sort of
like Needleman-Wench

1019
00:54:33,350 --> 00:54:36,710
or Smith-Waterman,
where you don't just

1020
00:54:36,710 --> 00:54:38,330
record what's the
best score, but you

1021
00:54:38,330 --> 00:54:39,454
remember how you got there.

1022
00:54:39,454 --> 00:54:41,280
We're going to need that later.

1023
00:54:41,280 --> 00:54:42,540
And what about here?

1024
00:54:42,540 --> 00:54:46,400
What's the optimal
parse-- what's

1025
00:54:46,400 --> 00:54:48,340
the probability of the
optimal parse ending

1026
00:54:48,340 --> 00:54:50,936
in island at position two?

1027
00:54:50,936 --> 00:54:52,602
Or what do I have to
do to calculate it?

1028
00:55:00,314 --> 00:55:01,278
Sorry?

1029
00:55:01,278 --> 00:55:05,640
AUDIENCE: You have to
calculate the [INAUDIBLE].

1030
00:55:05,640 --> 00:55:09,290
PROFESSOR: Right, you consider
going genome to island here,

1031
00:55:09,290 --> 00:55:10,630
and island to island.

1032
00:55:10,630 --> 00:55:12,510
And who's going
to win that race?

1033
00:55:12,510 --> 00:55:13,412
Do you have an idea?

1034
00:55:16,860 --> 00:55:19,120
Genome to island
had a head start,

1035
00:55:19,120 --> 00:55:23,590
but it pays a penalty
for the transition.

1036
00:55:23,590 --> 00:55:27,510
The transition is pretty small,
that's 10 to the minus fifth.

1037
00:55:27,510 --> 00:55:30,966
And what about this one?

1038
00:55:30,966 --> 00:55:33,150
This has a much higher
transition probability

1039
00:55:33,150 --> 00:55:35,780
of 0.999.

1040
00:55:35,780 --> 00:55:38,090
And so even though you were
starting from something,

1041
00:55:38,090 --> 00:55:40,550
this is about 150 times
smaller than this.

1042
00:55:40,550 --> 00:55:43,120
But this is being multiplied
by 10 to the minus fifth,

1043
00:55:43,120 --> 00:55:46,010
and this is being multiplied
by something that's around 1.

1044
00:55:46,010 --> 00:55:49,060
So this one will win,
island to island will win.

1045
00:55:49,060 --> 00:55:50,930
Everyone agree on that?

1046
00:55:50,930 --> 00:55:53,940
And what will that value be?

1047
00:55:53,940 --> 00:55:59,150
So it's whatever
it was before times

1048
00:55:59,150 --> 00:56:02,034
the transition, which is what?

1049
00:56:02,034 --> 00:56:02,950
From island to island?

1050
00:56:05,590 --> 00:56:07,040
999.

1051
00:56:07,040 --> 00:56:08,250
Times the emission which is?

1052
00:56:11,870 --> 00:56:13,132
0.3.

1053
00:56:13,132 --> 00:56:14,590
Island is more
likely [? to omit ?]

1054
00:56:14,590 --> 00:56:17,270
a C. Everyone clear on that?

1055
00:56:20,850 --> 00:56:24,370
And then, we're not that until
we circle this arrow here.

1056
00:56:24,370 --> 00:56:27,230
That was the winner, the
winner was coming from island,

1057
00:56:27,230 --> 00:56:28,830
remaining on island.

1058
00:56:28,830 --> 00:56:32,292
And then we keep
going like this.

1059
00:56:32,292 --> 00:56:33,750
Do you want me to
do one more base?

1060
00:56:33,750 --> 00:56:35,583
How many people want
me to do one more base,

1061
00:56:35,583 --> 00:56:37,696
and how many people
want me to stop this?

1062
00:56:37,696 --> 00:56:39,970
I'll do one more
base, but you guys

1063
00:56:39,970 --> 00:56:42,750
will have to help
me a little bit.

1064
00:56:42,750 --> 00:56:44,960
Who is going to
win-- now we want

1065
00:56:44,960 --> 00:56:49,120
the probability of the
optimal parse ending in G,

1066
00:56:49,120 --> 00:56:54,300
ending at position three, which
is a G, and ending in genome,

1067
00:56:54,300 --> 00:56:56,150
or ending in island.

1068
00:56:56,150 --> 00:57:02,840
So for ending in genome, where
is that one going to come from?

1069
00:57:02,840 --> 00:57:04,730
Which is going to win?

1070
00:57:04,730 --> 00:57:08,530
This one, or this one?

1071
00:57:08,530 --> 00:57:09,770
AUDIENCE: Stay in genome.

1072
00:57:09,770 --> 00:57:11,769
PROFESSOR: Yeah, stay in
genome is going to win.

1073
00:57:11,769 --> 00:57:13,920
This one is already
bigger than this,

1074
00:57:13,920 --> 00:57:16,860
and the transition
probability here-- this

1075
00:57:16,860 --> 00:57:19,860
is a 10 to minus 3
transition probability.

1076
00:57:19,860 --> 00:57:22,400
And this is a probability
that's near one,

1077
00:57:22,400 --> 00:57:25,320
so the transitions are
going to dominate here.

1078
00:57:25,320 --> 00:57:29,720
And so you're going to
have this term-- I'm

1079
00:57:29,720 --> 00:57:38,120
going to call that R 2 of G.
That's this notation here.

1080
00:57:38,120 --> 00:57:40,920
And times the
transition probability,

1081
00:57:40,920 --> 00:57:44,645
genome to genome, which
is all these nines here.

1082
00:57:47,345 --> 00:57:48,720
And then the
emission probability

1083
00:57:48,720 --> 00:57:51,540
of a G in the
genome state is 0.2.

1084
00:57:51,540 --> 00:57:55,840
And who's going to win here
for the optimal parse, ending

1085
00:57:55,840 --> 00:57:58,035
at position three,
in the island state?

1086
00:58:02,990 --> 00:58:06,640
Is it going to be this guy,
to island, or this one,

1087
00:58:06,640 --> 00:58:09,150
changing from genome to island?

1088
00:58:09,150 --> 00:58:11,880
Island to island, because,
again, the transmission

1089
00:58:11,880 --> 00:58:14,210
probability is
prohibitive-- that's a 10

1090
00:58:14,210 --> 00:58:16,740
to the minus fifth
penalty there.

1091
00:58:16,740 --> 00:58:18,480
So you're going
to stay in island.

1092
00:58:18,480 --> 00:58:21,670
So this one won here,
and this one won here.

1093
00:58:21,670 --> 00:58:29,610
And so this term
here is R 2 of island

1094
00:58:29,610 --> 00:58:32,480
times the transition
probability, island to island,

1095
00:58:32,480 --> 00:58:36,810
which is 0.999 times the
emission probability, which

1096
00:58:36,810 --> 00:58:41,230
is 0.3 of a G in
the island state.

1097
00:58:41,230 --> 00:58:45,440
Everyone clear on that?

1098
00:58:45,440 --> 00:58:53,640
Now, if we went out another 20
bases, what's going to happen?

1099
00:58:53,640 --> 00:58:54,931
Probably not a lot.

1100
00:58:54,931 --> 00:58:58,040
Probably the same kind of
stuff that's happening.

1101
00:58:58,040 --> 00:59:01,050
That seems kind of
boring, but when would

1102
00:59:01,050 --> 00:59:04,590
we actually get a cross?

1103
00:59:04,590 --> 00:59:05,722
What would it take?

1104
00:59:05,722 --> 00:59:08,934
To push you over and cause
you to transition from one

1105
00:59:08,934 --> 00:59:09,578
to the other?

1106
00:59:12,952 --> 00:59:16,675
AUDIENCE: Slowly stacked
against you or long enough?

1107
00:59:16,675 --> 00:59:18,550
PROFESSOR: Yeah, that's
a good way to put it.

1108
00:59:18,550 --> 00:59:19,880
So let me give you an example.

1109
00:59:19,880 --> 00:59:27,690
This is the Viterbi algorithm,
written out mathematically.

1110
00:59:27,690 --> 00:59:29,750
We can go over this in
a moment, but I just

1111
00:59:29,750 --> 00:59:33,020
want to try to stay
with the intuition here.

1112
00:59:33,020 --> 00:59:38,460
We did that, now
I want to do this.

1113
00:59:38,460 --> 00:59:45,760
Suppose your sequence is A C
G T, repeating 10,000 times.

1114
00:59:45,760 --> 00:59:51,310
Can anyone figure out what the
optimal parse of that sequence

1115
00:59:51,310 --> 00:59:54,704
would be, without doing
Viterbi in their head?

1116
00:59:59,970 --> 01:00:01,510
Start and stay in genome.

1117
01:00:01,510 --> 01:00:03,070
Can you explain why?

1118
01:00:03,070 --> 01:00:06,111
AUDIENCE: Because
it's equal to the--

1119
01:00:06,111 --> 01:00:09,640
what are the
[? widths? ?] Because it's

1120
01:00:09,640 --> 01:00:13,080
homogeneous to the
distributor, as opposed

1121
01:00:13,080 --> 01:00:19,238
to enriched for C N G, and it
just repeats without pattern.

1122
01:00:19,238 --> 01:00:22,094
Or it repeats throughout without
[? concentrating the C N Gs ?]

1123
01:00:22,094 --> 01:00:22,930
anywhere.

1124
01:00:22,930 --> 01:00:25,555
PROFESSOR: Right, so the unit of
the repeat, this A C G T unit,

1125
01:00:25,555 --> 01:00:27,020
is not biased for either one.

1126
01:00:27,020 --> 01:00:32,120
So there will be 2.3
emissions, and 2.2 emissions,

1127
01:00:32,120 --> 01:00:34,100
whether you go through
those in G G G G

1128
01:00:34,100 --> 01:00:37,200
or in I I I I. Does
that make sense?

1129
01:00:37,200 --> 01:00:39,670
So the emissions will be the
same, if you're all in genome,

1130
01:00:39,670 --> 01:00:41,150
or if you're all in island.

1131
01:00:41,150 --> 01:00:46,360
And the initial
probabilities favor genome,

1132
01:00:46,360 --> 01:00:49,080
and the transitions also
favor staying in genome.

1133
01:00:49,080 --> 01:00:51,433
Right, so all genome.

1134
01:00:51,433 --> 01:00:52,710
Can everyone see that?

1135
01:00:56,419 --> 01:00:58,377
So do you want to take
a stab at this next one?

1136
01:01:04,710 --> 01:01:05,770
This one's harder.

1137
01:01:05,770 --> 01:01:08,500
Let me ask you, in
the optimal parse,

1138
01:01:08,500 --> 01:01:12,729
what state is it
going to end in?

1139
01:01:12,729 --> 01:01:13,437
AUDIENCE: Genome.

1140
01:01:13,437 --> 01:01:16,270
PROFESSOR: Genome, you've got
to run [? with a 1,000 ?] T's.

1141
01:01:16,270 --> 01:01:20,330
And genome, the emissions
favor emitting T's.

1142
01:01:20,330 --> 01:01:23,450
So clearly, it's going
to end in genome.

1143
01:01:23,450 --> 01:01:27,610
And then, what about those runs
of C's and G's in the middle

1144
01:01:27,610 --> 01:01:28,215
there?

1145
01:01:28,215 --> 01:01:32,560
Are any of those long enough to
trigger a transition to island?

1146
01:01:32,560 --> 01:01:34,102
What was your name again?

1147
01:01:34,102 --> 01:01:34,810
AUDIENCE: Daniel.

1148
01:01:34,810 --> 01:01:37,210
PROFESSOR: Daniel, so
you're shaking your head.

1149
01:01:37,210 --> 01:01:39,010
You think they're
not long enough.

1150
01:01:39,010 --> 01:01:44,230
So you think the winner's
going to be genome all the way?

1151
01:01:44,230 --> 01:01:45,778
Who thinks they're long enough?

1152
01:01:45,778 --> 01:01:48,218
Or maybe some of them are?

1153
01:01:48,218 --> 01:01:49,908
Go ahead, what was your name?

1154
01:01:49,908 --> 01:01:50,658
AUDIENCE: Michael.

1155
01:01:50,658 --> 01:01:51,699
PROFESSOR: Michael, yeah.

1156
01:01:51,699 --> 01:01:54,946
AUDIENCE: The ones at length
80 and 60 are long enough,

1157
01:01:54,946 --> 01:01:56,890
but the one at length 20 is not.

1158
01:01:56,890 --> 01:01:59,352
PROFESSOR: OK, and
why do you say that?

1159
01:01:59,352 --> 01:02:03,860
AUDIENCE: Just looking
at power of 3 over 2,

1160
01:02:03,860 --> 01:02:08,320
3 times 10 to the 3 isn't enough
to overcome the difference

1161
01:02:08,320 --> 01:02:11,750
in transition probabilities
between the island and genome.

1162
01:02:11,750 --> 01:02:17,630
But 3 times 10 to the 10
and 1 times 10 to the 14

1163
01:02:17,630 --> 01:02:25,680
is, over the length
of those sequences.

1164
01:02:25,680 --> 01:02:28,585
The difference in
probability of making that

1165
01:02:28,585 --> 01:02:31,177
switch once at the
beginning [INAUDIBLE].

1166
01:02:31,177 --> 01:02:33,510
PROFESSOR: OK, did everyone
get what Michael was saying?

1167
01:02:33,510 --> 01:02:36,500
So, Michael, can you
explain why powers of 1.5

1168
01:02:36,500 --> 01:02:38,130
are relevant here?

1169
01:02:38,130 --> 01:02:48,620
AUDIENCE: Oh, that's the
ratio of emission probability

1170
01:02:48,620 --> 01:02:52,545
between the C states
and the G states,

1171
01:02:52,545 --> 01:02:54,920
between island and genome.

1172
01:02:54,920 --> 01:02:58,016
So in island it's 0.3,
and in genome it's

1173
01:02:58,016 --> 01:02:58,890
0.2 over [INAUDIBLE].

1174
01:02:58,890 --> 01:03:01,390
PROFESSOR: Right, so when you're
going through a run of C's,

1175
01:03:01,390 --> 01:03:04,830
if you're in the island
state, you get a power of 1.5,

1176
01:03:04,830 --> 01:03:08,520
in terms of emissions
at each position.

1177
01:03:08,520 --> 01:03:09,780
What about the transitions?

1178
01:03:09,780 --> 01:03:12,240
You're sort of
glossing over those.

1179
01:03:12,240 --> 01:03:13,670
Why is that?

1180
01:03:13,670 --> 01:03:16,950
AUDIENCE: Because that
only has to happen once

1181
01:03:16,950 --> 01:03:18,450
at the beginning.

1182
01:03:18,450 --> 01:03:26,020
So the ratio between the
transition probabilities

1183
01:03:26,020 --> 01:03:30,310
is really high, but as
long as the compounded

1184
01:03:30,310 --> 01:03:32,345
ratio of the
emission probability

1185
01:03:32,345 --> 01:03:35,091
is high enough
over a [INAUDIBLE]

1186
01:03:35,091 --> 01:03:39,755
of sequences, that as long
as that compound emission

1187
01:03:39,755 --> 01:03:41,965
is greater than that one-off
ratio at the beginning,

1188
01:03:41,965 --> 01:03:51,790
then the island is
more [INAUDIBLE].

1189
01:03:51,790 --> 01:03:56,560
PROFESSOR: Yeah, so if you think
about the transitions, I to I,

1190
01:03:56,560 --> 01:04:01,290
or G to G, as being close to 1--
so if you think of them as 1,

1191
01:04:01,290 --> 01:04:03,920
then you can ignore them,
and only focus on the cases

1192
01:04:03,920 --> 01:04:06,710
where it transitions
from G to I, and I to G.

1193
01:04:06,710 --> 01:04:11,010
So you say that 60 and
80 are long enough.

1194
01:04:11,010 --> 01:04:15,880
So your prediction is
at the optimal parse

1195
01:04:15,880 --> 01:04:30,460
is G 1,000 I 80
G another 2,020--

1196
01:04:30,460 --> 01:04:37,370
you said that one wasn't going
to-- and then I 60 G 1,000.

1197
01:04:37,370 --> 01:04:38,990
Michael, is that
what you're saying?

1198
01:04:38,990 --> 01:04:41,560
Can you read this?

1199
01:04:41,560 --> 01:04:43,520
AUDIENCE: Yeah.

1200
01:04:43,520 --> 01:04:48,920
PROFESSOR: OK, so why do
you say that 10 to the 10th

1201
01:04:48,920 --> 01:04:53,139
is enough to flip the switch,
and 10 to the 3rd is not?

1202
01:04:53,139 --> 01:04:56,492
AUDIENCE: If I remember the
numbers from the previous slide

1203
01:04:56,492 --> 01:04:57,930
correctly--

1204
01:04:57,930 --> 01:05:01,410
PROFESSOR: A couple
of slides back?

1205
01:05:01,410 --> 01:05:05,165
AUDIENCE: So if you look at
the ratio of the probability

1206
01:05:05,165 --> 01:05:08,395
of staying in the genome, and
the probability of going from

1207
01:05:08,395 --> 01:05:10,135
the genome to the island, it's--

1208
01:05:10,135 --> 01:05:11,520
PROFESSOR: 10 to the 5th.

1209
01:05:11,520 --> 01:05:12,770
AUDIENCE: Yeah, 10 to the 5th.

1210
01:05:12,770 --> 01:05:19,001
So whatever happens going over
the next [? run ?] of sequences

1211
01:05:19,001 --> 01:05:23,474
has to overcome the difference
in ratio for the switch

1212
01:05:23,474 --> 01:05:25,322
to become more likely.

1213
01:05:25,322 --> 01:05:26,030
PROFESSOR: Right.

1214
01:05:26,030 --> 01:05:31,820
So if everyone agrees that
we're going to start in genome,

1215
01:05:31,820 --> 01:05:35,430
we've got a run of 1,000 A's,
and genome is favored anyway--

1216
01:05:35,430 --> 01:05:38,830
so that's clear, we're going to
be in genome at the beginning

1217
01:05:38,830 --> 01:05:41,590
for the first 1,000, and
be in genome at the end,

1218
01:05:41,590 --> 01:05:43,440
then if you're going
to go to island,

1219
01:05:43,440 --> 01:05:46,050
you have to pay two
penalties, basically.

1220
01:05:46,050 --> 01:05:48,460
You pay the penalty of
starting an island, which

1221
01:05:48,460 --> 01:05:51,657
is 10 to the minus 5th-- this is
maybe a slightly different way

1222
01:05:51,657 --> 01:05:53,240
than you were thinking
about it, but I

1223
01:05:53,240 --> 01:05:56,300
think it's equivalent-- 10 the
minus 5th to switch island,

1224
01:05:56,300 --> 01:05:59,770
and then you pay a penalty
coming back, 10 to the minus 3.

1225
01:05:59,770 --> 01:06:02,000
And all the other
transitions are near 1.

1226
01:06:02,000 --> 01:06:04,150
So it's like a 10 to
the minus 8 penalty

1227
01:06:04,150 --> 01:06:07,360
for going from genome
to island and back.

1228
01:06:07,360 --> 01:06:10,970
And so if the emissions
are greater than 10

1229
01:06:10,970 --> 01:06:15,523
to the 8th, favor island by
a gradient of 10 to the 8th,

1230
01:06:15,523 --> 01:06:18,070
it'll be worth doing that.

1231
01:06:18,070 --> 01:06:19,134
Does that make sense?

1232
01:06:19,134 --> 01:06:21,554
AUDIENCE: I forgot about
the penalty of [INAUDIBLE],

1233
01:06:21,554 --> 01:06:23,567
but it's still the [INAUDIBLE].

1234
01:06:23,567 --> 01:06:24,942
PROFESSOR: Yeah,
it's still true.

1235
01:06:24,942 --> 01:06:25,910
Everyone see that?

1236
01:06:25,910 --> 01:06:28,930
You have to pay a
penalty of 10 to the 8gh

1237
01:06:28,930 --> 01:06:31,215
to go from genome
to island and back.

1238
01:06:31,215 --> 01:06:32,840
But the emissions
can make up for that.

1239
01:06:32,840 --> 01:06:36,130
Even though it seems small,
it seems like 60 bases is not

1240
01:06:36,130 --> 01:06:41,110
enough-- it's multiplicative,
and it adds up.

1241
01:06:41,110 --> 01:06:41,880
Sally?

1242
01:06:41,880 --> 01:06:44,880
AUDIENCE: So it seems
like the [INAUDIBLE]

1243
01:06:44,880 --> 01:06:50,380
to me is going to
return a lagging answer,

1244
01:06:50,380 --> 01:06:53,046
because we're not going
to actually switch

1245
01:06:53,046 --> 01:06:57,380
to genome in our HMM until we
hit the point where we should

1246
01:06:57,380 --> 01:07:00,970
[? tip, ?] which would be about
60 G's into the run of 80.

1247
01:07:00,970 --> 01:07:03,057
PROFESSOR: So you're
saying it's not actually

1248
01:07:03,057 --> 01:07:05,492
going to predict
the right thing?

1249
01:07:05,492 --> 01:07:09,388
AUDIENCE: Do you have to rerun
the [INAUDIBLE] processing

1250
01:07:09,388 --> 01:07:11,830
to get it actually in
line to the correct thing?

1251
01:07:11,830 --> 01:07:15,060
PROFESSOR: What do
people think about this?

1252
01:07:15,060 --> 01:07:16,681
Yeah, comment?

1253
01:07:16,681 --> 01:07:18,180
AUDIENCE: That's
not quite the case,

1254
01:07:18,180 --> 01:07:22,210
because you [? pack it ?]
or you [? stack it ?] both

1255
01:07:22,210 --> 01:07:24,900
in the genome and
island possibilities,

1256
01:07:24,900 --> 01:07:27,647
and your transition
is the penalty.

1257
01:07:27,647 --> 01:07:31,270
So it's the highest
impact penalty.

1258
01:07:31,270 --> 01:07:36,830
So when you island to island
to island in that string of 80,

1259
01:07:36,830 --> 01:07:39,440
the transition
will only be valid

1260
01:07:39,440 --> 01:07:41,392
starting at the first one.

1261
01:07:44,308 --> 01:07:44,808
[INAUDIBLE]

1262
01:07:50,176 --> 01:07:53,766
PROFESSOR: OK, we're
at position 1,000.

1263
01:07:53,766 --> 01:07:55,390
I think you're on
the right track here.

1264
01:07:55,390 --> 01:07:58,722
So I'm going to claim that
the Viterbi will transition

1265
01:07:58,722 --> 01:08:00,430
at the right place,
because it's actually

1266
01:08:00,430 --> 01:08:03,680
proven to generate
the optimal parse.

1267
01:08:03,680 --> 01:08:09,030
So I'm right, but I
totally get your intuition.

1268
01:08:09,030 --> 01:08:11,787
This is the key thing--
most people's intuition,

1269
01:08:11,787 --> 01:08:13,870
my intuition, everyone's
intuition when they first

1270
01:08:13,870 --> 01:08:17,090
hear about this is that it
seems like you don't transition

1271
01:08:17,090 --> 01:08:17,590
soon enough.

1272
01:08:17,590 --> 01:08:19,950
It seems like you have
to look into the future

1273
01:08:19,950 --> 01:08:21,886
to know to transition
at that place, right?

1274
01:08:21,886 --> 01:08:23,760
And obviously you can't
look into the future,

1275
01:08:23,760 --> 01:08:26,430
it's a recursion.

1276
01:08:26,430 --> 01:08:28,410
How does it work?

1277
01:08:28,410 --> 01:08:31,680
Clearly, this is going
to be the winner.

1278
01:08:31,680 --> 01:08:37,014
So let's go to position
1,001, that's the first C.

1279
01:08:37,014 --> 01:08:43,160
And this guy is going
to come from here,

1280
01:08:43,160 --> 01:08:47,620
this guy is the winner overall--
G 1,000 is clearly the winner.

1281
01:08:47,620 --> 01:08:50,189
But what about this guy?

1282
01:08:50,189 --> 01:08:52,840
Where's it coming from?

1283
01:08:52,840 --> 01:08:57,040
G 1,000, it's coming from there.

1284
01:08:57,040 --> 01:09:03,770
And in fact, the previous
guy came from G 1,000.

1285
01:09:03,770 --> 01:09:08,240
I 1,000 came from G
999, and so forth.

1286
01:09:08,240 --> 01:09:10,710
Now, here's the
interesting question.

1287
01:09:10,710 --> 01:09:16,040
What happens at 1,002?

1288
01:09:16,040 --> 01:09:23,729
Sally, I want you to tell
me what happens at 1,002.

1289
01:09:23,729 --> 01:09:25,040
Who wins here?

1290
01:09:25,040 --> 01:09:26,000
AUDIENCE: Genome.

1291
01:09:26,000 --> 01:09:28,040
PROFESSOR: Genome.

1292
01:09:28,040 --> 01:09:28,958
Who wins here?

1293
01:09:28,958 --> 01:09:29,666
AUDIENCE: Island.

1294
01:09:29,666 --> 01:09:31,380
PROFESSOR: Island.

1295
01:09:31,380 --> 01:09:34,600
It had been transitioning,
genome has got a head start,

1296
01:09:34,600 --> 01:09:36,700
so the best way
to beat an island

1297
01:09:36,700 --> 01:09:38,910
is to have been genome
as long as possible,

1298
01:09:38,910 --> 01:09:41,640
up until position 1,000.

1299
01:09:41,640 --> 01:09:43,450
And that was still
true at 1,001.

1300
01:09:43,450 --> 01:09:45,370
It's no longer true after that.

1301
01:09:45,370 --> 01:09:48,819
It was actually better to
have transitioned back here

1302
01:09:48,819 --> 01:09:52,075
to get that one extra
emission, that one power of 1.5

1303
01:09:52,075 --> 01:09:54,120
from emitting that C
in the island state.

1304
01:09:54,120 --> 01:09:56,453
If you're going to
be in island anyway--

1305
01:09:56,453 --> 01:09:59,780
this is much lower than
this, at this point.

1306
01:09:59,780 --> 01:10:02,317
It's about 10 to the 5th lower.

1307
01:10:02,317 --> 01:10:03,650
But that's OK, we still keep it.

1308
01:10:03,650 --> 01:10:05,140
It's the best that
ends in island.

1309
01:10:05,140 --> 01:10:07,710
Do you see what I'm saying?

1310
01:10:07,710 --> 01:10:11,840
OK, there were all
these-- island always

1311
01:10:11,840 --> 01:10:14,334
had to come from genome at
the latest possible time, up

1312
01:10:14,334 --> 01:10:16,250
until this point, and
now it's actually better

1313
01:10:16,250 --> 01:10:18,780
to have made that transition
there, and then stay in island.

1314
01:10:18,780 --> 01:10:21,460
So you can see island is
going to win for awhile,

1315
01:10:21,460 --> 01:10:24,460
and then it'll flip back.

1316
01:10:24,460 --> 01:10:30,080
And the question is going
to be down here at 1,060,

1317
01:10:30,080 --> 01:10:32,960
going to 1,061.

1318
01:10:32,960 --> 01:10:33,850
Who's bigger here?

1319
01:10:33,850 --> 01:10:37,180
This guy was perhaps--
well, we don't even

1320
01:10:37,180 --> 01:10:38,850
know exactly how we got here.

1321
01:10:38,850 --> 01:10:47,410
But you can see that this
parse here that stays in island

1322
01:10:47,410 --> 01:10:48,590
is going to be optimal.

1323
01:10:48,590 --> 01:10:51,230
And the question is, would
it be just staying in genome?

1324
01:10:51,230 --> 01:10:54,170
And the answer is yes, because
it gained 10 to the 10th

1325
01:10:54,170 --> 01:10:56,950
in emissions, overcomes
the 10 to the 8th penalty

1326
01:10:56,950 --> 01:10:59,921
that it paid.

1327
01:10:59,921 --> 01:11:01,170
Now what do you do at the end?

1328
01:11:01,170 --> 01:11:03,900
How do you actually find
the optimal parse overall?

1329
01:11:07,050 --> 01:11:21,690
I go out to position
whatever it is, 4,160.

1330
01:11:21,690 --> 01:11:24,000
I've got a probability
here, probability here,

1331
01:11:24,000 --> 01:11:26,730
what do I do with those?

1332
01:11:26,730 --> 01:11:29,500
Right, but what do I do first?

1333
01:11:29,500 --> 01:11:31,500
You pick the bigger one,
whichever one's bigger.

1334
01:11:31,500 --> 01:11:35,810
We decided that this one is
going to be bigger, right?

1335
01:11:35,810 --> 01:11:38,560
And then remember all the
arrows that I circled?

1336
01:11:38,560 --> 01:11:42,910
You just backtrack through
and figure out what it was.

1337
01:11:42,910 --> 01:11:45,382
Does that make sense?

1338
01:11:45,382 --> 01:11:46,590
That's the Viterbi algorithm.

1339
01:11:46,590 --> 01:11:53,350
We'll do it a little bit
more on this next time,

1340
01:11:53,350 --> 01:11:55,060
or definitely field questions.

1341
01:11:55,060 --> 01:11:57,500
It's a little bit tricky
to get your head around.

1342
01:12:01,450 --> 01:12:03,960
It's a dynamic programming
algorithm, like Needleman-Wench

1343
01:12:03,960 --> 01:12:06,710
or Smith-Waterman, but
a little bit different.

1344
01:12:09,400 --> 01:12:12,230
The runtime-- what
is the runtime,

1345
01:12:12,230 --> 01:12:15,540
for those who were
sleeping and didn't notice

1346
01:12:15,540 --> 01:12:17,860
that little thing
I flashed up there?

1347
01:12:17,860 --> 01:12:21,030
Or, if you read it, can you
explain where it comes from?

1348
01:12:21,030 --> 01:12:24,969
How does the runtime depend
on the number of hidden states

1349
01:12:24,969 --> 01:12:26,260
and the length of the sequence?

1350
01:12:31,140 --> 01:12:37,590
I've got K states, sequence of
length L, what is the runtime?

1351
01:12:37,590 --> 01:12:39,822
So I'm going to
put this up here.

1352
01:12:39,822 --> 01:12:41,298
This might help.

1353
01:12:47,210 --> 01:12:50,750
So when you look at the
recursion like this,

1354
01:12:50,750 --> 01:12:53,640
when you want to think
about the runtime--

1355
01:12:53,640 --> 01:12:57,191
forget about initialization
and termination, that's not

1356
01:12:57,191 --> 01:12:57,690
[INAUDIBLE].

1357
01:12:57,690 --> 01:13:01,550
It's what you do on the
typical intermediate state that

1358
01:13:01,550 --> 01:13:02,680
determines the runtime.

1359
01:13:02,680 --> 01:13:04,760
That's what grows
with sequence length.

1360
01:13:04,760 --> 01:13:09,460
So what do you have
to do at each--

1361
01:13:09,460 --> 01:13:11,969
to go from position I
to position I plus 1?

1362
01:13:11,969 --> 01:13:12,885
How many calculations?

1363
01:13:16,208 --> 01:13:19,694
AUDIENCE: You have to do
N calculations for 33.

1364
01:13:19,694 --> 01:13:21,700
Is that right?

1365
01:13:21,700 --> 01:13:23,530
PROFESSOR: Yeah, so 33.

1366
01:13:23,530 --> 01:13:25,600
Yeah, the notation
is a little bit

1367
01:13:25,600 --> 01:13:28,460
different, but how many--
let me ask you this,

1368
01:13:28,460 --> 01:13:31,010
how many transitions do
you have to consider?

1369
01:13:31,010 --> 01:13:34,110
If I have an HMM
with K hidden states?

1370
01:13:34,110 --> 01:13:35,405
AUDIENCE: K squared.

1371
01:13:35,405 --> 01:13:37,780
PROFESSOR: K squared, right?

1372
01:13:37,780 --> 01:13:40,155
So you're going to have to
do K squared calculations,

1373
01:13:40,155 --> 01:13:42,790
basically, to go from
position I to I plus 1.

1374
01:13:42,790 --> 01:13:44,870
So what is the
overall dependence

1375
01:13:44,870 --> 01:13:48,460
on K and L, the length
of the sequence?

1376
01:13:52,780 --> 01:13:55,560
OK, it's K squared L. It's
linear in the sequence.

1377
01:13:55,560 --> 01:13:57,540
So is this good or bad?

1378
01:13:57,540 --> 01:13:58,062
Yes, Sally?

1379
01:13:58,062 --> 01:14:00,437
AUDIENCE: Doesn't this assume
that the graph is complete?

1380
01:14:00,437 --> 01:14:03,916
And if you don't
actually have [INAUDIBLE]

1381
01:14:03,916 --> 01:14:05,442
you can get a little faster?

1382
01:14:05,442 --> 01:14:06,900
PROFESSOR: Yeah,
it's a good point.

1383
01:14:06,900 --> 01:14:09,390
So this is the
worst case, or this

1384
01:14:09,390 --> 01:14:12,670
is in the case where you can
transition from any state

1385
01:14:12,670 --> 01:14:13,500
to any other state.

1386
01:14:13,500 --> 01:14:18,410
If you remember, the
gene finding HMM--

1387
01:14:18,410 --> 01:14:20,890
I might have erased it,
I think I erased it--

1388
01:14:20,890 --> 01:14:23,080
if you can see this subtle--

1389
01:14:23,080 --> 01:14:26,530
No, remember Tim designed
an HMM for gene-finding

1390
01:14:26,530 --> 01:14:30,160
here, which only had some
of the arrows were allowed.

1391
01:14:30,160 --> 01:14:33,010
So if that's true, if there's
a bunch of zero probabilities

1392
01:14:33,010 --> 01:14:34,870
for transitions, then
you can ignore those,

1393
01:14:34,870 --> 01:14:36,200
and it actually speeds it up.

1394
01:14:36,200 --> 01:14:37,660
That's true.

1395
01:14:37,660 --> 01:14:38,680
It's a good point.

1396
01:14:38,680 --> 01:14:39,430
Everyone got this?

1397
01:14:39,430 --> 01:14:42,360
So this the worst case.

1398
01:14:42,360 --> 01:14:45,320
K squared L-- is
this good or bad?

1399
01:14:45,320 --> 01:14:47,290
Fast or slow?

1400
01:14:47,290 --> 01:14:49,340
Slow?

1401
01:14:49,340 --> 01:14:52,600
I mean, it depends on the
structure of your HMM.

1402
01:14:52,600 --> 01:14:56,650
For a simple HMM, like
the CPG island HMM,

1403
01:14:56,650 --> 01:15:00,210
this is like blindingly fast.

1404
01:15:00,210 --> 01:15:01,850
K squared is 4, right?

1405
01:15:01,850 --> 01:15:03,740
So it takes the same
order of magnitude

1406
01:15:03,740 --> 01:15:05,510
as just reading the sequence.

1407
01:15:05,510 --> 01:15:07,750
So it'll be super, super fast.

1408
01:15:07,750 --> 01:15:10,750
If you make a really complicated
HMM, it can be slower.

1409
01:15:10,750 --> 01:15:14,930
But the point is that for
genomic sequence analysis,

1410
01:15:14,930 --> 01:15:16,250
L is big.

1411
01:15:16,250 --> 01:15:19,646
So as long as you keep
K small, it'll run fast.

1412
01:15:19,646 --> 01:15:22,310
It's much better than sequence
comparison, where you end up

1413
01:15:22,310 --> 01:15:24,274
with these L squared
types of things.

1414
01:15:24,274 --> 01:15:25,940
So it's faster than
sequence comparison.

1415
01:15:25,940 --> 01:15:27,410
So that's really
one of the reasons

1416
01:15:27,410 --> 01:15:30,640
why Viterbi is so popular,
is it's super fast.

1417
01:15:30,640 --> 01:15:33,020
So in the last couple
minutes, I just

1418
01:15:33,020 --> 01:15:35,500
want to say a few things
about the midterm.

1419
01:15:35,500 --> 01:15:38,490
You guys did remember
there is a mid-term, right?

1420
01:15:38,490 --> 01:15:44,440
So the midterm is a week from
today, Tuesday, March 18.

1421
01:15:44,440 --> 01:15:46,820
For everybody, it's
going to be here,

1422
01:15:46,820 --> 01:15:50,240
except for those
who are in 6874--

1423
01:15:50,240 --> 01:15:54,730
and those people
should go to 68 180.

1424
01:15:54,730 --> 01:15:57,200
And because they're going
to be given extra time,

1425
01:15:57,200 --> 01:15:58,810
you should go there early.

1426
01:15:58,810 --> 01:16:00,670
Go there at 12:40.

1427
01:16:00,670 --> 01:16:03,200
Everyone else,
who's not in 6784,

1428
01:16:03,200 --> 01:16:06,850
should come here to the
regular class by 1:00 PM,

1429
01:16:06,850 --> 01:16:09,500
just so you have a chance to
get set up and everything.

1430
01:16:09,500 --> 01:16:12,380
And then the exam will
start promptly at 1:05,

1431
01:16:12,380 --> 01:16:16,820
and will end at 2:25,
an hour and 20 minutes.

1432
01:16:16,820 --> 01:16:20,337
It is closed book, open notes.

1433
01:16:20,337 --> 01:16:22,420
So don't bring your textbook,
but you can bring up

1434
01:16:22,420 --> 01:16:24,440
to two pages-- they
can be double sided

1435
01:16:24,440 --> 01:16:25,700
if you want-- of notes.

1436
01:16:25,700 --> 01:16:27,010
So why do we do this?

1437
01:16:27,010 --> 01:16:32,950
Well, we think that the act
of going through the lectures

1438
01:16:32,950 --> 01:16:35,590
and textbook, or whatever
other notes you have,

1439
01:16:35,590 --> 01:16:41,480
and deciding what's most
important, maybe helpful.

1440
01:16:41,480 --> 01:16:44,320
And so this, hopefully, will be
a useful studying exercise, so

1441
01:16:44,320 --> 01:16:46,403
figure out what's most
important and write it down

1442
01:16:46,403 --> 01:16:49,340
on a piece of paper if you are
likely to forget it-- maybe

1443
01:16:49,340 --> 01:16:50,970
complicated
equations, things like

1444
01:16:50,970 --> 01:16:53,520
that, you might
want to write down.

1445
01:16:53,520 --> 01:16:55,570
No calculators or
other electronic aids.

1446
01:16:55,570 --> 01:16:57,050
But you won't need them.

1447
01:16:57,050 --> 01:17:03,400
If you get an answer that's
e squared over 17 factorial--

1448
01:17:03,400 --> 01:17:05,710
you're asked to convert
that into a decimal.

1449
01:17:05,710 --> 01:17:08,389
Just leave it like that.

1450
01:17:08,389 --> 01:17:09,430
So what should you study?

1451
01:17:09,430 --> 01:17:12,610
So you should study your lecture
notes, readings and tutorials,

1452
01:17:12,610 --> 01:17:14,240
and past exams.

1453
01:17:14,240 --> 01:17:17,050
Past exams have been posted
to the course website.

1454
01:17:17,050 --> 01:17:18,960
p-sets as well.

1455
01:17:18,960 --> 01:17:21,780
The midterm exams from
past years are posted.

1456
01:17:21,780 --> 01:17:25,110
And there's some variation
in topics from year to year,

1457
01:17:25,110 --> 01:17:28,570
so if you're reading through
a midterm from a past year,

1458
01:17:28,570 --> 01:17:33,000
and you run across an
unfamiliar phrase or concept,

1459
01:17:33,000 --> 01:17:35,270
you have to ask
yourself, was I just

1460
01:17:35,270 --> 01:17:37,450
dozing off when
that was discussed?

1461
01:17:37,450 --> 01:17:39,320
Or was that not
discussed this year?

1462
01:17:39,320 --> 01:17:41,730
And act appropriately.

1463
01:17:41,730 --> 01:17:44,800
The content of the midterm
will be all the lectures,

1464
01:17:44,800 --> 01:17:48,752
all the topics up through
today-- hidden Markov models.

1465
01:17:48,752 --> 01:17:50,960
And I'll do a little bit
more on hidden Markov models

1466
01:17:50,960 --> 01:17:51,840
on Thursday.

1467
01:17:51,840 --> 01:17:57,060
That part could be on the exam,
but the next major topic-- RNA

1468
01:17:57,060 --> 01:18:00,880
secondary structure--
will not be on the exam.

1469
01:18:00,880 --> 01:18:02,627
It'll be on a p-set
in the future.

1470
01:18:02,627 --> 01:18:03,960
Any questions about the midterm?

1471
01:18:10,060 --> 01:18:13,341
And the TAs will be doing
some review stuff in sections

1472
01:18:13,341 --> 01:18:13,840
this week.

1473
01:18:17,670 --> 01:18:19,420
OK, thank you.

1474
01:18:19,420 --> 01:18:21,420
See you on Thursday.