1
00:00:00,060 --> 00:00:01,476
NARRATOR: The
following content is

2
00:00:01,476 --> 00:00:04,019
provided under a
Creative Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high-quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,330
To make a donation, or
view additional materials

6
00:00:13,330 --> 00:00:17,236
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,236 --> 00:00:17,861
at ocw.mit.edu.

8
00:00:27,000 --> 00:00:33,690
PROFESSOR: All right,
we should get started.

9
00:00:33,690 --> 00:00:36,600
So it's good to be back.

10
00:00:36,600 --> 00:00:41,920
We'll be discussing
DNA sequence motifs.

11
00:00:41,920 --> 00:00:44,120
Oh yeah, we were,
if you're wondering,

12
00:00:44,120 --> 00:00:48,060
yes, the instructors were
at the awards on Sunday.

13
00:00:48,060 --> 00:00:48,830
It was great.

14
00:00:48,830 --> 00:00:52,420
The pizza was delicious.

15
00:00:52,420 --> 00:00:56,560
So today, we're going to be
talking about DNA and protein

16
00:00:56,560 --> 00:01:01,110
sequence motifs, which are
essentially the building

17
00:01:01,110 --> 00:01:07,770
blocks of regulatory
information, in a sense.

18
00:01:07,770 --> 00:01:11,940
Before we get started,
I wanted to just see

19
00:01:11,940 --> 00:01:15,540
if there are any
questions about material

20
00:01:15,540 --> 00:01:19,100
that Professor Gifford covered
from the past couple days?

21
00:01:19,100 --> 00:01:21,490
No guarantees I'll be
able to answer them,

22
00:01:21,490 --> 00:01:26,760
but just general things related
to transcriptome analysis,

23
00:01:26,760 --> 00:01:28,990
or PCA?

24
00:01:28,990 --> 00:01:29,730
Anything?

25
00:01:29,730 --> 00:01:33,020
Hopefully, you all got the
email that he sent out about,

26
00:01:33,020 --> 00:01:36,790
basically, what you're
expected to get.

27
00:01:36,790 --> 00:01:42,070
So at the level of the
document that's posted,

28
00:01:42,070 --> 00:01:43,570
that's sort of what
we're expecting.

29
00:01:43,570 --> 00:01:45,200
So if you haven't
had linear algebra,

30
00:01:45,200 --> 00:01:47,190
that should still
be accessible--

31
00:01:47,190 --> 00:01:50,080
not necessarily all
the derivations.

32
00:01:50,080 --> 00:01:53,630
Any questions about that?

33
00:01:53,630 --> 00:01:57,210
OK, so as a reminder,
team projects,

34
00:01:57,210 --> 00:02:01,150
your aims are due soon.

35
00:02:01,150 --> 00:02:02,740
We'll post a
slightly-- there's been

36
00:02:02,740 --> 00:02:05,160
a request for more detailed
information on what we'd

37
00:02:05,160 --> 00:02:09,660
like in the aims, so we'll
post something more detailed

38
00:02:09,660 --> 00:02:13,500
on the website this
evening, and probably

39
00:02:13,500 --> 00:02:15,860
extend the deadline
a day or two, just

40
00:02:15,860 --> 00:02:19,270
to give you a little bit
more time on the aims.

41
00:02:19,270 --> 00:02:22,070
So after you submit
your aims-- this

42
00:02:22,070 --> 00:02:24,420
is students who are
taking the project

43
00:02:24,420 --> 00:02:29,650
component of the
course-- then your team

44
00:02:29,650 --> 00:02:32,760
will be assigned to one
of the three instructors

45
00:02:32,760 --> 00:02:38,170
as a mentor/advisor, and
we will schedule a time

46
00:02:38,170 --> 00:02:40,680
to meet with you in
the next week or two

47
00:02:40,680 --> 00:02:43,500
to discuss your
aims, just to assess

48
00:02:43,500 --> 00:02:48,140
the feasibility of the project
and so forth, before you launch

49
00:02:48,140 --> 00:02:50,570
into it.

50
00:02:50,570 --> 00:02:56,421
All right-- any questions
from past lectures?

51
00:02:56,421 --> 00:02:57,920
All right, today
we're going to talk

52
00:02:57,920 --> 00:03:02,280
about modeling and discovery
of sequence motifs.

53
00:03:02,280 --> 00:03:05,500
We'll give an example of a
particular algorithm that's

54
00:03:05,500 --> 00:03:08,870
used in motif finding called
the Gibbs Sampling Algorithm.

55
00:03:08,870 --> 00:03:12,170
It's not the only algorithm,
it's not even necessarily

56
00:03:12,170 --> 00:03:13,280
the best algorithm.

57
00:03:13,280 --> 00:03:15,000
It's pretty good.

58
00:03:15,000 --> 00:03:17,330
It works in many cases.

59
00:03:17,330 --> 00:03:18,470
It's an early algorithm.

60
00:03:18,470 --> 00:03:21,700
But it's interesting
to talk about

61
00:03:21,700 --> 00:03:24,490
because it illustrates
the problem in general,

62
00:03:24,490 --> 00:03:26,800
and also it's an example
of a stochastic algorithm--

63
00:03:26,800 --> 00:03:30,930
an algorithm where what
it does is determined

64
00:03:30,930 --> 00:03:32,410
at random, to some extent.

65
00:03:32,410 --> 00:03:36,337
And yet still often converges
to a particular answer.

66
00:03:36,337 --> 00:03:38,170
So it's interesting
from that point of view.

67
00:03:38,170 --> 00:03:40,420
And we'll talk about
a few other types

68
00:03:40,420 --> 00:03:42,480
of motif finding algorithms.

69
00:03:42,480 --> 00:03:46,476
And we'll do a little bit
on statistical entropy

70
00:03:46,476 --> 00:03:47,850
and information
content, which is

71
00:03:47,850 --> 00:03:50,540
a handy way of
describing motifs.

72
00:03:50,540 --> 00:03:54,470
And talk a little bit
about parameter estimation,

73
00:03:54,470 --> 00:03:58,250
as well, which is critical
when you have a motif

74
00:03:58,250 --> 00:04:00,440
and you want to
build a model of it

75
00:04:00,440 --> 00:04:05,380
to then discover additional
instances of that motif.

76
00:04:05,380 --> 00:04:09,480
So some reading for today--
I posted some nature

77
00:04:09,480 --> 00:04:13,780
biotechnology primers on
motifs and motif discovery,

78
00:04:13,780 --> 00:04:16,589
which are pretty easy reading.

79
00:04:16,589 --> 00:04:19,390
The textbook, chapter 6, also
has some good information

80
00:04:19,390 --> 00:04:22,200
on motifs, I encourage
you to look at that.

81
00:04:22,200 --> 00:04:25,180
And I've also posted
the original paper

82
00:04:25,180 --> 00:04:28,010
by Bailey and Elkin on
the MEME algorithm, which

83
00:04:28,010 --> 00:04:30,560
is kind of related to the
Gibbs Sampling Algorithm,

84
00:04:30,560 --> 00:04:35,720
but is used as
expectation maximization.

85
00:04:35,720 --> 00:04:40,340
And so it's a really nice
paper-- take a look at that.

86
00:04:40,340 --> 00:04:44,370
And I'll also post the original
Gibbs Sampler paper later

87
00:04:44,370 --> 00:04:45,280
today.

88
00:04:45,280 --> 00:04:47,520
And then on Tuesday,
we're going to be talking

89
00:04:47,520 --> 00:04:50,160
about Markov and
hidden Markov models.

90
00:04:50,160 --> 00:04:56,030
And so take a look at
the primer on HMMs,

91
00:04:56,030 --> 00:04:57,980
as well as there
is some information

92
00:04:57,980 --> 00:05:00,480
on HMMs in the text.

93
00:05:00,480 --> 00:05:02,410
It's not really a
distinct section,

94
00:05:02,410 --> 00:05:04,900
it's kind of scattered
throughout the text.

95
00:05:04,900 --> 00:05:10,086
So the best approach is to
look in the index for HMMs,

96
00:05:10,086 --> 00:05:16,310
and read the relevant parts
that you're interested in.

97
00:05:16,310 --> 00:05:19,100
And if you really
want to understand

98
00:05:19,100 --> 00:05:21,430
the mechanics of HMMs, and
how to actually implement

99
00:05:21,430 --> 00:05:25,680
one in depth, then I strongly
recommend this Rabiner tutorial

100
00:05:25,680 --> 00:05:28,050
on HMMs, which is posted.

101
00:05:28,050 --> 00:05:29,920
So everyone please,
please read that.

102
00:05:29,920 --> 00:05:36,030
I will use the same notation,
to the extent possible,

103
00:05:36,030 --> 00:05:40,310
as the Rabiner paper
when talking about some

104
00:05:40,310 --> 00:05:42,440
of the algorithms used
in HMMs in lecture.

105
00:05:42,440 --> 00:05:45,420
So it should synergize well.

106
00:05:48,280 --> 00:05:50,775
So what is a sequence motifs?

107
00:05:53,335 --> 00:05:54,710
In general, it's
a pattern that's

108
00:05:54,710 --> 00:05:58,300
common to a set of DNA,
RNA, or protein sequences,

109
00:05:58,300 --> 00:06:01,920
that share a
biological property.

110
00:06:01,920 --> 00:06:06,410
So for example, all of the
binding sites of the Myc

111
00:06:06,410 --> 00:06:09,680
transcription factor--
there's probably a pattern

112
00:06:09,680 --> 00:06:13,500
that they share, and you
call that the motif for Myc.

113
00:06:13,500 --> 00:06:19,280
Can you give some examples of
where you might get DNA motifs?

114
00:06:19,280 --> 00:06:21,215
Or protein motifs?

115
00:06:21,215 --> 00:06:23,970
Anyone have another
example of a type of motif

116
00:06:23,970 --> 00:06:25,740
that would be interesting?

117
00:06:25,740 --> 00:06:27,765
What about one that's
defined on function?

118
00:06:27,765 --> 00:06:28,390
Yeah, go ahead.

119
00:06:28,390 --> 00:06:29,557
What's your name?

120
00:06:29,557 --> 00:06:30,704
AUDIENCE: Dan. [INAUDIBLE]

121
00:06:30,704 --> 00:06:31,370
PROFESSOR: Yeah.

122
00:06:31,370 --> 00:06:37,560
So each kinase typically
has a certain sequence motif

123
00:06:37,560 --> 00:06:40,655
that determines which
proteins it phosphorylate.

124
00:06:40,655 --> 00:06:41,790
Right.

125
00:06:41,790 --> 00:06:42,812
Other examples?

126
00:06:42,812 --> 00:06:45,270
Yeah, so in that case, you
might determine it functionally.

127
00:06:45,270 --> 00:06:47,644
You might purify that
protein, incubate it

128
00:06:47,644 --> 00:06:50,060
with a pool of peptides, and
see what gets phosphorylated,

129
00:06:50,060 --> 00:06:51,450
for example.

130
00:06:51,450 --> 00:06:52,316
Yeah, in the back?

131
00:06:52,316 --> 00:06:55,184
AUDIENCE: I'm [INAUDIBLE],
and promonocytes.

132
00:06:55,184 --> 00:06:56,600
PROFESSOR: What
was the first one?

133
00:06:56,600 --> 00:06:57,076
AUDIENCE: Promonocytes?

134
00:06:57,076 --> 00:06:58,028
Oh, that one?

135
00:06:58,028 --> 00:06:59,456
Oh, that was my name.

136
00:06:59,456 --> 00:07:01,850
PROFESSOR: Yeah, OK.

137
00:07:01,850 --> 00:07:03,540
And as to promoter motifs, sir?

138
00:07:03,540 --> 00:07:05,510
Some examples?

139
00:07:05,510 --> 00:07:11,613
AUDIENCE: Like, [INAUDIBLE]
in transcription mining site.

140
00:07:11,613 --> 00:07:12,196
PROFESSOR: Ah.

141
00:07:12,196 --> 00:07:13,188
Yeah.

142
00:07:13,188 --> 00:07:15,172
And so you would
identify those how?

143
00:07:15,172 --> 00:07:17,156
AUDIENCE: By
looking at sequences

144
00:07:17,156 --> 00:07:20,132
upstream of
[INAUDIBLE], and seeing

145
00:07:20,132 --> 00:07:23,092
what different sequences
have in common?

146
00:07:23,092 --> 00:07:23,800
PROFESSOR: Right.

147
00:07:23,800 --> 00:07:27,521
So I think there's at
least three ways-- OK,

148
00:07:27,521 --> 00:07:29,020
four ways I can
think of identifying

149
00:07:29,020 --> 00:07:29,730
those types of motifs.

150
00:07:29,730 --> 00:07:31,563
That's probably one of
the most common types

151
00:07:31,563 --> 00:07:34,600
of motifs encountered
in molecular biology.

152
00:07:34,600 --> 00:07:38,160
So one way, you take
a bunch of genes,

153
00:07:38,160 --> 00:07:40,490
where you've identified the
transcription start site.

154
00:07:40,490 --> 00:07:44,409
You just look for patterns--
short sub-sequences

155
00:07:44,409 --> 00:07:45,450
that they have in common.

156
00:07:45,450 --> 00:07:49,120
That might give you the
TATA box, for example.

157
00:07:49,120 --> 00:07:52,150
Another way would be, what
about comparative genomics?

158
00:07:52,150 --> 00:07:54,040
You take each
individual one, look

159
00:07:54,040 --> 00:07:56,720
to see which parts of that
promoter are conserved.

160
00:07:56,720 --> 00:08:01,090
That can also help you
refine your motifs.

161
00:08:01,090 --> 00:08:03,100
Protein binding, you
could do ChIP-Seq,

162
00:08:03,100 --> 00:08:05,520
that could give you motifs.

163
00:08:05,520 --> 00:08:07,160
And what about a
functional readout?

164
00:08:07,160 --> 00:08:10,080
You clone a bunch
of random sequences

165
00:08:10,080 --> 00:08:12,490
upstream of a
luciferase reporter, see

166
00:08:12,490 --> 00:08:15,580
which ones actually drive
expression, for example.

167
00:08:15,580 --> 00:08:16,780
So, that would be another.

168
00:08:16,780 --> 00:08:19,460
Yeah, absolutely, so there's
a bunch of different ways

169
00:08:19,460 --> 00:08:21,570
to define them.

170
00:08:21,570 --> 00:08:23,780
In terms of when we
talk about motifs,

171
00:08:23,780 --> 00:08:28,660
there are several different
models of increasing resolution

172
00:08:28,660 --> 00:08:30,440
that people use.

173
00:08:30,440 --> 00:08:33,960
So people often talk about talk
about the consensus sequence

174
00:08:33,960 --> 00:08:38,510
so you say the TATA
box, which, of course,

175
00:08:38,510 --> 00:08:41,659
describes the actual
motif-- T-A-T-A-A-A,

176
00:08:41,659 --> 00:08:43,470
something like that.

177
00:08:43,470 --> 00:08:46,690
But that's really
just the consensus

178
00:08:46,690 --> 00:08:49,210
of a bunch of TATA box motifs.

179
00:08:49,210 --> 00:08:51,490
You rarely find the
perfect consensus

180
00:08:51,490 --> 00:08:54,820
in real promoters-- the real,
naturally occurring ones

181
00:08:54,820 --> 00:08:58,080
are usually one or
two mismatches away.

182
00:08:58,080 --> 00:09:00,084
So that doesn't
fully captured it.

183
00:09:00,084 --> 00:09:02,000
So sometimes you'll have
a regular expression.

184
00:09:02,000 --> 00:09:06,710
So an example would be if
you were describing mammalian

185
00:09:06,710 --> 00:09:12,300
5 prime splice sites, you might
describe the motif as GT, A

186
00:09:12,300 --> 00:09:20,960
or G, AGT, or sometimes
abbreviated as GTR AGT, where

187
00:09:20,960 --> 00:09:24,026
R is shorthand for either
appearing nucleotide-- either A

188
00:09:24,026 --> 00:09:29,710
or G. In some motifs you could
have GT, NN, GT, or something

189
00:09:29,710 --> 00:09:32,280
like that.

190
00:09:32,280 --> 00:09:35,170
Those can be captured,
often, by regular expressions

191
00:09:35,170 --> 00:09:38,950
in a scripting language
like Python or Perl.

192
00:09:38,950 --> 00:09:41,740
Another very common
description in motifs,

193
00:09:41,740 --> 00:09:43,980
there would be a weight matrix.

194
00:09:43,980 --> 00:09:47,990
So you'll see a matrix where
the width of the matrix

195
00:09:47,990 --> 00:09:50,730
is the number of
bases in the motif.

196
00:09:50,730 --> 00:09:53,270
And then there are
four rows, which

197
00:09:53,270 --> 00:09:55,910
are the four bases-- we'll
see that in a moment.

198
00:09:55,910 --> 00:09:59,280
Sometimes these are described
as position-specific probability

199
00:09:59,280 --> 00:10:01,990
matrices, or position-specific
score matrices.

200
00:10:01,990 --> 00:10:03,550
We'll come to that in a moment.

201
00:10:03,550 --> 00:10:04,890
And then there are more
complicated models.

202
00:10:04,890 --> 00:10:06,360
So it's increasingly
becoming clear

203
00:10:06,360 --> 00:10:10,340
that the simple weight matrix
is too limited-- it doesn't

204
00:10:10,340 --> 00:10:15,410
capture all the information
that's present in motifs.

205
00:10:15,410 --> 00:10:17,980
So we talked about where
do motifs come from.

206
00:10:17,980 --> 00:10:19,690
These are just some examples.

207
00:10:19,690 --> 00:10:22,740
I think I talked
about all of these,

208
00:10:22,740 --> 00:10:25,250
except for in vitro binding.

209
00:10:25,250 --> 00:10:28,010
So in addition to doing
a CLIP-seq, where you're

210
00:10:28,010 --> 00:10:30,480
looking at the binding of
the endogenous protein,

211
00:10:30,480 --> 00:10:33,020
you could also make recombinant
protein-- incubate that

212
00:10:33,020 --> 00:10:35,690
with a random pool
of DNA molecules,

213
00:10:35,690 --> 00:10:38,750
pull down, and see what
binds to it, for example.

214
00:10:42,010 --> 00:10:44,200
So why are they important?

215
00:10:44,200 --> 00:10:46,310
They're important
for obvious reasons--

216
00:10:46,310 --> 00:10:49,280
that they can
identify proteins that

217
00:10:49,280 --> 00:10:52,390
have a specific biological
property of interest.

218
00:10:52,390 --> 00:10:54,480
For example, being
phosphorylated by a particular

219
00:10:54,480 --> 00:10:54,980
kinase.

220
00:10:54,980 --> 00:10:57,460
Or promoters that have
a particular property.

221
00:10:57,460 --> 00:10:59,480
That is, that they're
likely to be regulated

222
00:10:59,480 --> 00:11:03,510
by a particular transcription
factor, et cetera.

223
00:11:03,510 --> 00:11:08,020
And ultimately, if
you're very interested

224
00:11:08,020 --> 00:11:12,160
in the regulation of
a particular gene,

225
00:11:12,160 --> 00:11:17,180
knowing what motifs are upstream
and how strong the evidence is

226
00:11:17,180 --> 00:11:19,480
for each particular
transcription factor that

227
00:11:19,480 --> 00:11:22,040
might or might not
bind there, can

228
00:11:22,040 --> 00:11:27,527
be very useful in understanding
the regulation of that gene.

229
00:11:27,527 --> 00:11:29,610
And they're also going to
be important for efforts

230
00:11:29,610 --> 00:11:30,780
to model gene expression.

231
00:11:30,780 --> 00:11:33,650
So, a goal of
systems biology would

232
00:11:33,650 --> 00:11:42,100
be to predict, from a given
starting point, if we introduce

233
00:11:42,100 --> 00:11:44,780
some perturbation-- for
example, if we knock out

234
00:11:44,780 --> 00:11:46,870
or knock down a particular
transcription factor,

235
00:11:46,870 --> 00:11:51,470
or over-express it, how
will the system behave?

236
00:11:51,470 --> 00:11:53,220
So you'd really want
to be able to predict

237
00:11:53,220 --> 00:11:57,620
how the occupancy of
that transcription factor

238
00:11:57,620 --> 00:11:58,360
would change.

239
00:11:58,360 --> 00:12:01,720
You'd want to know, first, where
it is at an endogenous levels,

240
00:12:01,720 --> 00:12:04,210
and then how its occupancy
at every promoter

241
00:12:04,210 --> 00:12:06,300
will change when you
perturb its levels.

242
00:12:06,300 --> 00:12:09,460
And then, what
effects that will have

243
00:12:09,460 --> 00:12:13,500
on expression of
downstream genes.

244
00:12:13,500 --> 00:12:16,370
So these sorts of
models all require

245
00:12:16,370 --> 00:12:19,940
really accurate
descriptions of motifs.

246
00:12:19,940 --> 00:12:22,230
OK, so these are some
examples of protein motifs.

247
00:12:22,230 --> 00:12:25,560
Anyone recognize this one?

248
00:12:25,560 --> 00:12:29,720
What motif is that?

249
00:12:29,720 --> 00:12:30,941
So it says X's.

250
00:12:30,941 --> 00:12:32,440
X's would be
degenerate oppositions,

251
00:12:32,440 --> 00:12:34,510
and C's would cysteines.

252
00:12:34,510 --> 00:12:36,965
And H's [INAUDIBLE].

253
00:12:36,965 --> 00:12:38,438
What is this?

254
00:12:38,438 --> 00:12:39,911
What does this define?

255
00:12:39,911 --> 00:12:41,089
What protein has this?

256
00:12:41,089 --> 00:12:42,857
What can you predict
about its function?

257
00:12:42,857 --> 00:12:43,839
AUDIENCE: Zinc finger.

258
00:12:43,839 --> 00:12:44,760
PROFESSOR: Zinc finger, right.

259
00:12:44,760 --> 00:12:47,360
So it's a motif commonly seen
in genome binding transcription

260
00:12:47,360 --> 00:12:52,400
factors, and it
coordinates to zinc.

261
00:12:52,400 --> 00:12:55,460
What about this one?

262
00:12:55,460 --> 00:12:57,340
Any guesses on
what this motif is?

263
00:12:57,340 --> 00:13:00,550
This quite a short motif.

264
00:13:00,550 --> 00:13:01,884
Yeah?

265
00:13:01,884 --> 00:13:02,371
AUDIENCE: That's
a phosphorylation.

266
00:13:02,371 --> 00:13:03,345
PROFESSOR: Phosphorylation site.

267
00:13:03,345 --> 00:13:03,845
Yeah.

268
00:13:03,845 --> 00:13:06,754
And how do you know that?

269
00:13:06,754 --> 00:13:10,039
AUDIENCE: The [INAUDIBLE] and
the [INAUDIBLE] next to it

270
00:13:10,039 --> 00:13:12,600
means it's [INAUDIBLE].

271
00:13:12,600 --> 00:13:15,121
PROFESSOR: OK, so you even
know what kinase it is, yeah.

272
00:13:15,121 --> 00:13:15,620
Exactly.

273
00:13:15,620 --> 00:13:16,745
So that's sort of the view.

274
00:13:16,745 --> 00:13:18,922
So, serine, threonine,
and tirocene

275
00:13:18,922 --> 00:13:20,630
are the residues that
get phosphorylated.

276
00:13:20,630 --> 00:13:23,350
And so if you see a motif
with a serine in the middle,

277
00:13:23,350 --> 00:13:26,140
it's a good chance it's
a phosphorylation site.

278
00:13:28,850 --> 00:13:32,720
Here are some-- you can think
of them as DNA sequence motifs,

279
00:13:32,720 --> 00:13:34,950
because they occur in
genes, but they, of course,

280
00:13:34,950 --> 00:13:37,100
function at that RNA level.

281
00:13:37,100 --> 00:13:40,310
These are the motifs that
occur at the boundaries

282
00:13:40,310 --> 00:13:42,770
of mammalian introns.

283
00:13:42,770 --> 00:13:46,700
So this first one is the
prime splicing motif.

284
00:13:46,700 --> 00:13:49,800
So these would be the bases
that occur at the last three

285
00:13:49,800 --> 00:13:51,390
bases of the exon.

286
00:13:51,390 --> 00:13:55,079
The first two of the intron
here, are almost always GT.

287
00:13:55,079 --> 00:13:57,370
And then you have this position
that I mentioned here--

288
00:13:57,370 --> 00:13:59,900
it's almost always
A or G position.

289
00:13:59,900 --> 00:14:02,660
And then some positions that
are bias for A, bias for G,

290
00:14:02,660 --> 00:14:06,870
and then slightly
biased for T. And that

291
00:14:06,870 --> 00:14:11,075
is what you see when you
look at a whole bunch of five

292
00:14:11,075 --> 00:14:14,080
prime ends of mammalian
introns-- they have this motif.

293
00:14:14,080 --> 00:14:17,710
So some will have better
matches, or worse,

294
00:14:17,710 --> 00:14:19,980
to this particular pattern.

295
00:14:19,980 --> 00:14:22,530
And that's the average
pattern that you see.

296
00:14:22,530 --> 00:14:25,290
And it turns out
that in this case,

297
00:14:25,290 --> 00:14:29,240
the recognition of that site
is not by a protein, per se,

298
00:14:29,240 --> 00:14:32,410
but it's by a
ribonucleoprotein complex.

299
00:14:32,410 --> 00:14:35,160
So there's actually
an RNA called U1 snRNA

300
00:14:35,160 --> 00:14:37,880
that base pairs with the
five prime splice site.

301
00:14:37,880 --> 00:14:40,760
And its sequence, or
part of its sequence,

302
00:14:40,760 --> 00:14:45,320
is perfectly complimentary to
the consensus five prime splice

303
00:14:45,320 --> 00:14:45,820
site.

304
00:14:45,820 --> 00:14:48,645
So we can understand why
five prime splice sites have

305
00:14:48,645 --> 00:14:50,660
this motif-- they're
evolving to have

306
00:14:50,660 --> 00:14:53,400
a certain degree of
complementarity to U1,

307
00:14:53,400 --> 00:14:55,140
and in order to get
officially recognized

308
00:14:55,140 --> 00:14:58,250
by the splicing machinery.

309
00:14:58,250 --> 00:15:01,260
Then at the three
prime end of introns,

310
00:15:01,260 --> 00:15:02,650
you see this motif here.

311
00:15:02,650 --> 00:15:05,450
So here's the last base of
the intron, a G, and then

312
00:15:05,450 --> 00:15:06,610
an A before it.

313
00:15:06,610 --> 00:15:09,230
Almost all introns end with AG.

314
00:15:09,230 --> 00:15:12,000
Then you have a
pyrimidine ahead of it.

315
00:15:12,000 --> 00:15:15,230
Then you have basically an
irrelevant position here

316
00:15:15,230 --> 00:15:20,720
at minus four, which is
not strongly conserved.

317
00:15:20,720 --> 00:15:23,924
And then a stretch of
residues that are usually,

318
00:15:23,924 --> 00:15:26,340
but not always, pyrimidines--
called the pyrimidine track.

319
00:15:26,340 --> 00:15:29,080
And in this case,
the recognition

320
00:15:29,080 --> 00:15:31,910
is actually by proteins
rather than RNA.

321
00:15:31,910 --> 00:15:33,790
And there are two proteins.

322
00:15:33,790 --> 00:15:36,280
One called U2AF65 that
binds the pyrimidine track,

323
00:15:36,280 --> 00:15:40,550
and one, U2AF35 that
binds that last YAG motif.

324
00:15:40,550 --> 00:15:42,590
And then there's
an upstream motif

325
00:15:42,590 --> 00:15:44,820
here, that's just upstream
of the 3 prime splice site

326
00:15:44,820 --> 00:15:47,350
that is quite degenerate
and hard to find,

327
00:15:47,350 --> 00:15:49,890
called the branch point motif.

328
00:15:49,890 --> 00:15:52,210
OK, so, let's take an example.

329
00:15:52,210 --> 00:15:55,650
So the five prime splice site
is a nice example of a motif,

330
00:15:55,650 --> 00:15:59,010
because you can uniquely
align them, right?

331
00:15:59,010 --> 00:16:01,530
You can sequence DNA,
sequence genomes,

332
00:16:01,530 --> 00:16:03,790
align the CDNA to the
genome, that tells you

333
00:16:03,790 --> 00:16:05,700
exactly where the
splice junctions are.

334
00:16:05,700 --> 00:16:10,710
And you can take the exons that
have a 5 prime splice site,

335
00:16:10,710 --> 00:16:14,680
and align the sequences aligned
to the exon/intron boundary

336
00:16:14,680 --> 00:16:16,630
and get a precise motif.

337
00:16:16,630 --> 00:16:19,255
And then you can tally up
the frequencies of the bases,

338
00:16:19,255 --> 00:16:20,630
and make a table
like this, which

339
00:16:20,630 --> 00:16:23,760
we would call a
position-specific probability

340
00:16:23,760 --> 00:16:26,050
matrix.

341
00:16:26,050 --> 00:16:31,900
And what you could then
do to predict additional,

342
00:16:31,900 --> 00:16:35,850
say, five prime splice-site
motifs in other genes--

343
00:16:35,850 --> 00:16:38,010
for example, genes where
you didn't get a good CDNA

344
00:16:38,010 --> 00:16:40,560
coverage, because let's
say they're not expressed

345
00:16:40,560 --> 00:16:43,530
in the cells that you
analyzed-- you could then

346
00:16:43,530 --> 00:16:48,450
make this odds ratio here.

347
00:16:48,450 --> 00:16:50,920
So here we have a
candidate sequence.

348
00:16:50,920 --> 00:16:56,425
So the motif is nine positions,
often numbered minus 3

349
00:16:56,425 --> 00:16:58,420
to minus 1, would be the
exonic parts of this.

350
00:16:58,420 --> 00:17:00,830
And then plus 1 to plus
6 would be the first six

351
00:17:00,830 --> 00:17:01,680
bases of the intron.

352
00:17:01,680 --> 00:17:03,760
That's just the
convention that's used.

353
00:17:03,760 --> 00:17:06,690
I'm sure it's going to drive
the computer scientists crazy

354
00:17:06,690 --> 00:17:09,036
because we're not
starting at 0, but that's

355
00:17:09,036 --> 00:17:10,619
usually what's used
in the literature.

356
00:17:10,619 --> 00:17:13,010
And so we have a
nine-based motif.

357
00:17:13,010 --> 00:17:16,050
And then we're going to
calculate the probability

358
00:17:16,050 --> 00:17:19,970
of generating that
particular sequence as given

359
00:17:19,970 --> 00:17:23,020
plus-- meaning given
our foreground,

360
00:17:23,020 --> 00:17:27,760
or motif model-- as the
product of the probability

361
00:17:27,760 --> 00:17:33,150
of generating the first base in
sequence, S1, using the column

362
00:17:33,150 --> 00:17:36,290
probability in the
minus 3 position.

363
00:17:36,290 --> 00:17:41,660
So if the first base is AC,
for example, that would be 0.4.

364
00:17:41,660 --> 00:17:43,910
And then the probability of
generating the second base

365
00:17:43,910 --> 00:17:46,670
in the sequence using the
next column, and so forth.

366
00:17:49,540 --> 00:17:54,207
If you made a vector for
each position that had a 1

367
00:17:54,207 --> 00:17:56,040
for the base that
occurred at that position,

368
00:17:56,040 --> 00:17:58,180
and a 0 for the other
bases, and then you

369
00:17:58,180 --> 00:18:02,680
just did the dot product of that
with the matrix, you get this.

370
00:18:02,680 --> 00:18:04,130
So, we multiply probabilities.

371
00:18:04,130 --> 00:18:08,420
So that is assuming
independence between positions.

372
00:18:08,420 --> 00:18:12,600
And so that's a key assumption--
weight matrices assume

373
00:18:12,600 --> 00:18:16,020
that each position in the
motif contributes independently

374
00:18:16,020 --> 00:18:17,980
to the overall
strength of that motif.

375
00:18:17,980 --> 00:18:20,774
And that may or may not be
true-- they don't assume

376
00:18:20,774 --> 00:18:22,190
that it's homogeneous,
that is you

377
00:18:22,190 --> 00:18:28,327
have usually in a typical
case, different probabilities

378
00:18:28,327 --> 00:18:30,160
in different columns,
so it's inhomogeneous,

379
00:18:30,160 --> 00:18:31,570
but assumes independence.

380
00:18:31,570 --> 00:18:35,770
And then you often want
to use a background model.

381
00:18:35,770 --> 00:18:38,100
For example, if your
genome composition

382
00:18:38,100 --> 00:18:40,690
is 25% of each of
the nucleotides,

383
00:18:40,690 --> 00:18:43,550
you could just have a
background probability that

384
00:18:43,550 --> 00:18:45,930
was equally likely
for each of the four,

385
00:18:45,930 --> 00:18:51,300
and then calculate the
probability, S given minus,

386
00:18:51,300 --> 00:18:53,700
of generating that
particular [INAUDIBLE]

387
00:18:53,700 --> 00:18:56,880
under the background model, and
take the ratio of those two.

388
00:18:56,880 --> 00:18:58,800
And the advantage of
that is that then you

389
00:18:58,800 --> 00:19:02,890
can find sequences
that are-- that ratio,

390
00:19:02,890 --> 00:19:07,117
it could be 100 times more
like a 5 prime splice site

391
00:19:07,117 --> 00:19:08,700
than like background--
or 1,000 times.

392
00:19:08,700 --> 00:19:12,480
Or you have some sort
of scaling on it.

393
00:19:12,480 --> 00:19:14,850
Whereas, if you just
take the raw probability,

394
00:19:14,850 --> 00:19:17,661
it's going to be something
that's on the order of 1/4

395
00:19:17,661 --> 00:19:18,160
to a 1/9.

396
00:19:18,160 --> 00:19:20,631
So some very, very small
number that's a little hard to

397
00:19:20,631 --> 00:19:21,130
work with.

398
00:19:24,800 --> 00:19:27,410
So when people
talk about motifs,

399
00:19:27,410 --> 00:19:30,770
they often use language
like exact, or precise,

400
00:19:30,770 --> 00:19:34,310
versus degenerate, strong
versus weak, good versus lousy,

401
00:19:34,310 --> 00:19:38,140
depending on the
context, who's listening.

402
00:19:38,140 --> 00:19:41,830
So an example of these would
be a restriction enzyme.

403
00:19:41,830 --> 00:19:43,920
You often say
restriction enzymes have

404
00:19:43,920 --> 00:19:46,380
very precise
sequence specificity,

405
00:19:46,380 --> 00:19:51,190
they only cut-- echo
R1 only cuts a GAA TTC.

406
00:19:51,190 --> 00:19:55,300
Whereas, a TATA binding protein
is somewhat more degenerate.

407
00:19:55,300 --> 00:19:58,510
It'll bind to a range of things.

408
00:19:58,510 --> 00:20:01,819
So I use degenerate there, you
could say it's a weaker motif.

409
00:20:01,819 --> 00:20:04,110
You'll often-- if you want
to try to make this precise,

410
00:20:04,110 --> 00:20:06,840
then the language of
entropy information

411
00:20:06,840 --> 00:20:10,360
offers additional terminology,
like high information

412
00:20:10,360 --> 00:20:13,630
content, low entropy, et cetera.

413
00:20:13,630 --> 00:20:19,230
So let's take a look at this as
perhaps a more natural, or more

414
00:20:19,230 --> 00:20:22,530
precise way of describing
what we mean, here.

415
00:20:22,530 --> 00:20:25,220
So imagine you have a motif.

416
00:20:25,220 --> 00:20:27,840
We're going to do a
motif of length one--

417
00:20:27,840 --> 00:20:31,200
just keep the math
super simple, but you'll

418
00:20:31,200 --> 00:20:33,440
see it easily generalizes.

419
00:20:33,440 --> 00:20:40,090
So you have probabilities of the
four nucleotides that are Pk.

420
00:20:40,090 --> 00:20:42,500
And you have background
probabilities, qk.

421
00:20:42,500 --> 00:20:45,690
And we're going to assume
those are all uniform,

422
00:20:45,690 --> 00:20:47,250
they're all a quarter.

423
00:20:47,250 --> 00:20:51,920
So then the statistical,
or Shannon entropy

424
00:20:51,920 --> 00:20:57,520
of a probability distribution--
or vector of probabilities,

425
00:20:57,520 --> 00:21:01,250
if you will-- is defined here.

426
00:21:01,250 --> 00:21:10,590
So H of q, where q is a
distribution or, in this case,

427
00:21:10,590 --> 00:21:19,460
vector, is defined as minus
the summation of qk log qk,

428
00:21:19,460 --> 00:21:20,160
in general.

429
00:21:20,160 --> 00:21:22,230
And then if you wanted
to be in units of bits,

430
00:21:22,230 --> 00:21:25,060
you'd use log base 2.

431
00:21:25,060 --> 00:21:27,345
So how many people have
seen this equation before?

432
00:21:29,850 --> 00:21:32,030
Like half, I'm going to go with.

433
00:21:32,030 --> 00:21:33,550
OK, good.

434
00:21:33,550 --> 00:21:39,490
So who can tell me
why, first of all--

435
00:21:39,490 --> 00:21:41,880
is this a positive
quantity, negative quantity,

436
00:21:41,880 --> 00:21:43,183
non-negative, or what?

437
00:21:46,354 --> 00:21:47,260
Yeah, go ahead.

438
00:21:47,260 --> 00:21:49,274
AUDIENCE: Log qk is always
going to be negative.

439
00:21:49,274 --> 00:21:50,649
And so therefore
you have to take

440
00:21:50,649 --> 00:21:52,482
the negative of the sum
of all the negatives

441
00:21:52,482 --> 00:21:53,770
to get a positive number.

442
00:21:53,770 --> 00:21:55,390
PROFESSOR: Right,
so this, in general,

443
00:21:55,390 --> 00:21:57,580
is a non-negative
quantity, because we

444
00:21:57,580 --> 00:21:58,780
have this minus sign here.

445
00:21:58,780 --> 00:22:01,170
We're taking logs of things
that are between 0 and 1.

446
00:22:01,170 --> 00:22:04,530
So the logs are negative, right?

447
00:22:04,530 --> 00:22:05,030
OK.

448
00:22:05,030 --> 00:22:08,910
And then what would be
the entropy if I say that

449
00:22:08,910 --> 00:22:17,810
the distribution q is
this-- 0100, meaning,

450
00:22:17,810 --> 00:22:20,234
it's a motif that's 100% C?

451
00:22:20,234 --> 00:22:21,400
What is the entropy of that?

452
00:22:24,536 --> 00:22:26,151
What was your name?

453
00:22:26,151 --> 00:22:26,900
AUDIENCE: William.

454
00:22:26,900 --> 00:22:28,960
PROFESSOR: William.

455
00:22:28,960 --> 00:22:31,084
AUDIENCE: So the
entropy would be 0,

456
00:22:31,084 --> 00:22:33,674
because the vector is determined
in respect of the known

457
00:22:33,674 --> 00:22:34,381
certainty.

458
00:22:34,381 --> 00:22:35,089
PROFESSOR: Right.

459
00:22:35,089 --> 00:22:39,215
And we do the math--
you'll get, for the C term,

460
00:22:39,215 --> 00:22:40,650
you'll have a sum.

461
00:22:40,650 --> 00:22:43,060
You'll have three
terms that are 0 log

462
00:22:43,060 --> 00:22:47,410
0-- it might crash your
calculator, I guess.

463
00:22:47,410 --> 00:22:52,440
And then you'll have one
term that is 1 log 1.

464
00:22:52,440 --> 00:22:53,890
And so 1 log 1, that's easy.

465
00:22:53,890 --> 00:22:56,560
That's 0, right?

466
00:22:56,560 --> 00:22:58,810
This, you could
say, is undefined.

467
00:22:58,810 --> 00:23:04,000
But using L'Hospital's rule--
by continuity, x log x,

468
00:23:04,000 --> 00:23:06,750
you take the limit,
as x gets small, is 0.

469
00:23:06,750 --> 00:23:10,350
So this is defined to be
0 in information theory.

470
00:23:10,350 --> 00:23:12,370
And this is always, always 0.

471
00:23:12,370 --> 00:23:14,710
So that comes out to be 0.

472
00:23:14,710 --> 00:23:16,090
So it's deterministic.

473
00:23:16,090 --> 00:23:18,860
So entropy is a
measure of uncertainty,

474
00:23:18,860 --> 00:23:21,340
and so that makes sense-- if
you know what the base is,

475
00:23:21,340 --> 00:23:23,780
there's no uncertainty,
entropy is 0.

476
00:23:23,780 --> 00:23:35,290
So what about this vector-- 1/4,
1/4, 25% of each of the bases,

477
00:23:35,290 --> 00:23:40,200
what is H of q?

478
00:23:40,200 --> 00:23:40,700
Anyone?

479
00:23:43,622 --> 00:23:45,570
I'm going to make
you show me why,

480
00:23:45,570 --> 00:23:52,695
so-- Anyone want
to attempt this?

481
00:23:52,695 --> 00:23:53,690
Levi?

482
00:23:53,690 --> 00:23:54,732
AUDIENCE: I think it's 2.

483
00:23:54,732 --> 00:23:55,440
PROFESSOR: 2, OK.

484
00:23:55,440 --> 00:23:56,270
Can you explain?

485
00:23:56,270 --> 00:23:58,260
AUDIENCE: Because
the log of the 1/4's

486
00:23:58,260 --> 00:24:00,610
is going to be negative 2.

487
00:24:00,610 --> 00:24:04,554
And then you're
multiplying that by 1/4,

488
00:24:04,554 --> 00:24:07,170
so you're getting 1/2 for each
and adding it up equals 2.

489
00:24:07,170 --> 00:24:08,910
PROFESSOR: Right,
in sum, there are

490
00:24:08,910 --> 00:24:13,010
going to be four terms that
are 1/4 times log of a 1/4.

491
00:24:16,340 --> 00:24:19,215
This is minus 2, 1/4 times
minus 2 is minus 1/2,

492
00:24:19,215 --> 00:24:21,190
4 times minus 1/2 is
minus 2, and then you

493
00:24:21,190 --> 00:24:24,110
change the sign, because
there's this minus in front.

494
00:24:24,110 --> 00:24:26,670
So that equals 2.

495
00:24:26,670 --> 00:24:29,245
And what about this one?

496
00:24:35,770 --> 00:24:38,940
Anyone see that one?

497
00:24:38,940 --> 00:24:40,380
This is a coin flip, basically.

498
00:24:40,380 --> 00:24:41,220
All right?

499
00:24:41,220 --> 00:24:43,355
It's either A or G. [INAUDIBLE].

500
00:24:49,785 --> 00:24:50,285
Anyone?

501
00:24:55,730 --> 00:24:57,215
Levi, want to do this one again?

502
00:24:57,215 --> 00:24:58,205
AUDIENCE: It's 1.

503
00:24:58,205 --> 00:25:02,165
PROFESSOR: OK, and why?

504
00:25:02,165 --> 00:25:05,830
AUDIENCE: Because you have two
terms of 0 log 0, which is 0.

505
00:25:05,830 --> 00:25:16,980
And two terms of 1/2
times the log of 1/2,

506
00:25:16,980 --> 00:25:18,960
which is just negative 1.

507
00:25:18,960 --> 00:25:19,950
So you have 2 halves.

508
00:25:19,950 --> 00:25:20,445
PROFESSOR: Yeah.

509
00:25:20,445 --> 00:25:21,435
So two terms like that.

510
00:25:21,435 --> 00:25:23,226
And then there's going
to be two terms that

511
00:25:23,226 --> 00:25:26,650
are something that turns
out to be 0-- 0 log 0.

512
00:25:26,650 --> 00:25:28,730
And then there's
a minus in front.

513
00:25:28,730 --> 00:25:30,590
So that will be 1.

514
00:25:30,590 --> 00:25:32,786
So a coin flip has one
bit of information.

515
00:25:32,786 --> 00:25:34,160
So that's basically
what we mean.

516
00:25:34,160 --> 00:25:37,776
If you have a fair coin and
you don't know the outcome,

517
00:25:37,776 --> 00:25:39,150
we're going to
call that one bit.

518
00:25:39,150 --> 00:25:43,620
And so a base that could be
any of the four equally likely

519
00:25:43,620 --> 00:25:47,630
has twice as much uncertainty.

520
00:25:47,630 --> 00:25:51,593
All right, and this is related
to the Boltzmann entropy

521
00:25:51,593 --> 00:25:54,670
that you may be familiar with
from statistical mechanics,

522
00:25:54,670 --> 00:25:59,880
which is the log of the number
of states, in that if you have

523
00:25:59,880 --> 00:26:02,570
N states, and they're
all equally likely, then

524
00:26:02,570 --> 00:26:05,160
it turns out that the
Shannon entropy turns out

525
00:26:05,160 --> 00:26:07,170
to be log of the number states.

526
00:26:07,170 --> 00:26:10,350
We saw that here-- four
states, equally likely,

527
00:26:10,350 --> 00:26:13,930
comes out to be log of 4 or 2.

528
00:26:13,930 --> 00:26:16,130
And that's true in general.

529
00:26:16,130 --> 00:26:18,170
All right, so you
can think of this

530
00:26:18,170 --> 00:26:23,010
as a generalization of Boltzmann
entropy, if you want to.

531
00:26:23,010 --> 00:26:28,280
OK, so why did he
call it entropy?

532
00:26:28,280 --> 00:26:30,930
So it turns out
that Shannon, who

533
00:26:30,930 --> 00:26:32,980
was developing this
in the late '40s,

534
00:26:32,980 --> 00:26:36,697
as developing a theory
of communication,

535
00:26:36,697 --> 00:26:38,030
scratched his head a little bit.

536
00:26:38,030 --> 00:26:41,613
And he talked to his friend,
John von Neumann-- none

537
00:26:41,613 --> 00:26:45,970
other than him, involved
in inventing computers--

538
00:26:45,970 --> 00:26:48,600
and he says, "My concern
was what to call it.

539
00:26:48,600 --> 00:26:50,130
I thought of calling
it information.

540
00:26:50,130 --> 00:26:51,800
But the word was overly used."

541
00:26:51,800 --> 00:26:57,060
OK, so back in 1949, information
was already overused.

542
00:26:57,060 --> 00:27:00,342
"And and so I decided
to call it uncertainty."

543
00:27:00,342 --> 00:27:02,300
And then he discussed it
with John von Neumann,

544
00:27:02,300 --> 00:27:03,480
and he had a better idea.

545
00:27:03,480 --> 00:27:05,550
He said, "You should
call it entropy.

546
00:27:05,550 --> 00:27:07,704
In the first place,
your certainly function

547
00:27:07,704 --> 00:27:09,620
has already been used
in statistical mechanics

548
00:27:09,620 --> 00:27:11,420
under that name," so
it already has a name.

549
00:27:11,420 --> 00:27:13,450
"And the second place,
and more important,

550
00:27:13,450 --> 00:27:16,020
nobody knows what entropy
really is, so in a debate,

551
00:27:16,020 --> 00:27:17,820
you always have the advantage."

552
00:27:17,820 --> 00:27:20,350
So keep that in mind.

553
00:27:20,350 --> 00:27:23,250
After you've taken this class,
just start throwing it around

554
00:27:23,250 --> 00:27:25,820
and you will win
a lot of debates.

555
00:27:30,010 --> 00:27:32,530
So how is information
related to entropy?

556
00:27:32,530 --> 00:27:35,110
So the way we're going
to define it here,

557
00:27:35,110 --> 00:27:37,730
which is how it's
often defined, is

558
00:27:37,730 --> 00:27:41,760
information is reduction
in uncertainty.

559
00:27:41,760 --> 00:27:47,870
So, if I'm dealing
with an unknown DNA

560
00:27:47,870 --> 00:27:49,860
sequence, the
lambda phage genome,

561
00:27:49,860 --> 00:27:54,930
and it has 25% of each
base, if you tell me,

562
00:27:54,930 --> 00:27:56,930
I'm going to send you two
bases, I have no idea.

563
00:27:56,930 --> 00:27:59,320
They could be any pair of bases.

564
00:27:59,320 --> 00:28:03,260
My uncertainty is
2 bits per base,

565
00:28:03,260 --> 00:28:05,660
or 4 bits before you
tell me anything.

566
00:28:05,660 --> 00:28:10,360
If you then tell me,
it's the TA motif,

567
00:28:10,360 --> 00:28:13,980
which is always T followed
by A, then now my uncertainty

568
00:28:13,980 --> 00:28:16,990
is 0, so the amount
of information

569
00:28:16,990 --> 00:28:19,860
that you just gave me is 4 bits.

570
00:28:19,860 --> 00:28:23,370
You reduced my uncertainty
from 4 bits to 0.

571
00:28:23,370 --> 00:28:26,370
So we define the information
at a particular position

572
00:28:26,370 --> 00:28:30,320
as the entropy
before-- before meaning

573
00:28:30,320 --> 00:28:33,400
the background, the background
a sort of your null hypothesis--

574
00:28:33,400 --> 00:28:37,570
minus the entropy after--
so after you've told me

575
00:28:37,570 --> 00:28:39,270
that this is an
instance of that motif,

576
00:28:39,270 --> 00:28:42,580
and it has a particular model.

577
00:28:42,580 --> 00:28:50,750
So, in this case, you
can see the entropy

578
00:28:50,750 --> 00:28:53,370
is going to be entropy before.

579
00:28:53,370 --> 00:28:56,580
This is just H of q
right here, this term.

580
00:28:56,580 --> 00:29:02,050
And then minus this
term, which is H of p.

581
00:29:02,050 --> 00:29:07,070
So, if it's uniform, we said
H of q is 2 bits per position.

582
00:29:07,070 --> 00:29:12,870
And so, so the information
content of the motif is just 2

583
00:29:12,870 --> 00:29:17,820
minus the entropy
of that motif model.

584
00:29:17,820 --> 00:29:21,140
In general, it turns out if
the positions in the motif

585
00:29:21,140 --> 00:29:23,360
are independent, then
the information content

586
00:29:23,360 --> 00:29:27,610
of the motif is 2w
minus H of motif,

587
00:29:27,610 --> 00:29:30,710
where w is it
width of the motif.

588
00:29:30,710 --> 00:29:37,770
So for example, the entropy
of the motif of-- we

589
00:29:37,770 --> 00:29:41,230
said the entropy of
this is 2 bits, right?

590
00:29:41,230 --> 00:29:44,160
Therefore, the information
content is what?

591
00:29:47,118 --> 00:29:54,230
If this is our-- let's say this
is a P. This is our routine.

592
00:29:54,230 --> 00:29:56,440
Are you starting to generate?

593
00:29:56,440 --> 00:29:58,948
What is its information content?

594
00:29:58,948 --> 00:30:00,250
AUDIENCE: 0?

595
00:30:00,250 --> 00:30:00,830
PROFESSOR: 0.

596
00:30:00,830 --> 00:30:03,686
Why is it 0?

597
00:30:03,686 --> 00:30:04,674
Yeah, back row.

598
00:30:04,674 --> 00:30:07,043
AUDIENCE: Because the
information content

599
00:30:07,043 --> 00:30:08,626
of that is 0, and
then the information

600
00:30:08,626 --> 00:30:11,096
content of the known
hypothesis, so to say, is 0.

601
00:30:11,096 --> 00:30:13,072
Sorry, both of them are 2.

602
00:30:13,072 --> 00:30:13,876
So 2 minus 2 is 0.

603
00:30:13,876 --> 00:30:15,542
PROFESSOR: The entropy
of the background

604
00:30:15,542 --> 00:30:18,012
is 2, and the entropy
if this is also 2.

605
00:30:18,012 --> 00:30:19,520
So 2 minus 2 is 0.

606
00:30:19,520 --> 00:30:21,920
And what about this?

607
00:30:21,920 --> 00:30:23,410
Let's say this was
our motif, it's

608
00:30:23,410 --> 00:30:28,100
a motif that's either A or G.
We said the entropy of this

609
00:30:28,100 --> 00:30:30,470
is 1 bit, so what
is the information

610
00:30:30,470 --> 00:30:34,125
content of this motif?

611
00:30:34,125 --> 00:30:35,060
AUDIENCE: 1.

612
00:30:35,060 --> 00:30:38,809
PROFESSOR: 1, and why is it 1?

613
00:30:38,809 --> 00:30:42,705
AUDIENCE: Background is
2, and entropy here is 1.

614
00:30:42,705 --> 00:30:44,653
PROFESSOR: Background
is 2, entropy is 1.

615
00:30:44,653 --> 00:30:45,627
OK?

616
00:30:45,627 --> 00:30:49,240
And what about if I tell you
it's the echo R1 restriction

617
00:30:49,240 --> 00:30:49,790
enzyme?

618
00:30:49,790 --> 00:30:54,190
So it's GAA TTC,
a six-base motif

619
00:30:54,190 --> 00:30:56,980
precise-- it has
to be those bases?

620
00:30:56,980 --> 00:30:59,042
What is the information
content of that motif?

621
00:31:06,422 --> 00:31:07,410
In the back?

622
00:31:07,410 --> 00:31:08,380
AUDIENCE: It's 12.

623
00:31:08,380 --> 00:31:10,101
PROFESSOR: 12-- 12 what?

624
00:31:10,101 --> 00:31:10,850
AUDIENCE: 12 bits.

625
00:31:10,850 --> 00:31:12,350
PROFESSOR: 12 bits,
and why is that?

626
00:31:12,350 --> 00:31:15,954
AUDIENCE: Because the
background is 2 times 6.

627
00:31:15,954 --> 00:31:20,328
So 6 bases, and 2 bits for each.

628
00:31:20,328 --> 00:31:24,216
And you have all the
bases are determined

629
00:31:24,216 --> 00:31:26,180
at the specific
[INAUDIBLE] enzyme site.

630
00:31:26,180 --> 00:31:31,520
So the entropy of that is
0, since 12 minus 0 is 12.

631
00:31:31,520 --> 00:31:35,170
PROFESSOR: Right, the
entropy of that motif is 0.

632
00:31:35,170 --> 00:31:39,190
You imagine 4,096
possible six-mers.

633
00:31:39,190 --> 00:31:40,350
One of them has probably 1.

634
00:31:40,350 --> 00:31:41,510
All the others have 0.

635
00:31:41,510 --> 00:31:44,730
You're going to
have that big sum.

636
00:31:44,730 --> 00:31:48,110
It's going to come
out to be 0, OK?

637
00:31:48,110 --> 00:31:49,630
Why is this useful
at all, or is it?

638
00:31:52,340 --> 00:31:56,990
One of the reasons why
it's useful-- sorry,

639
00:31:56,990 --> 00:31:59,460
that's on a later slide.

640
00:31:59,460 --> 00:32:02,320
Well, just hang with
me, and it will be clear

641
00:32:02,320 --> 00:32:05,660
why it's useful in a few slides.

642
00:32:05,660 --> 00:32:08,120
But for now, we
have a description

643
00:32:08,120 --> 00:32:09,840
of information content.

644
00:32:09,840 --> 00:32:14,940
So the echo R1 site has
12 bits of information,

645
00:32:14,940 --> 00:32:19,180
a completely random
position has 0,

646
00:32:19,180 --> 00:32:21,550
and a short four-cutter
restriction enzyme

647
00:32:21,550 --> 00:32:23,887
would have 2 times 4, 8
bits of information, right,

648
00:32:23,887 --> 00:32:24,720
and an eight-cutter.

649
00:32:24,720 --> 00:32:26,594
So you can see as the
restriction enzyme gets

650
00:32:26,594 --> 00:32:28,400
longer, more
information content.

651
00:32:31,894 --> 00:32:33,810
So let's talk about the
motif finding problem,

652
00:32:33,810 --> 00:32:36,460
and then we'll return to the
usefulness of information

653
00:32:36,460 --> 00:32:37,090
content.

654
00:32:37,090 --> 00:32:40,019
So can everyone see
the motif that's

655
00:32:40,019 --> 00:32:41,310
present in all these sequences?

656
00:32:45,710 --> 00:32:48,070
If anyone can't,
please let me know.

657
00:32:48,070 --> 00:32:49,010
You probably can't.

658
00:32:49,010 --> 00:32:49,990
Now, what now?

659
00:32:49,990 --> 00:32:54,580
These are the same sequences,
but I've aligned them.

660
00:32:54,580 --> 00:32:56,934
Can anyone see a motif?

661
00:32:56,934 --> 00:32:58,850
PROFESSOR: GGG GGG.

662
00:32:58,850 --> 00:33:00,280
PROFESSOR: Yeah,
I heard some G's.

663
00:33:00,280 --> 00:33:00,780
Right.

664
00:33:00,780 --> 00:33:03,254
so there's this motif
that's over here.

665
00:33:03,254 --> 00:33:04,920
It's pretty weak, and
pretty degenerate.

666
00:33:04,920 --> 00:33:05,910
There's definitely
some exceptions,

667
00:33:05,910 --> 00:33:08,159
but you can definitely see
that a lot of the sequences

668
00:33:08,159 --> 00:33:12,450
have at least GGC,
possibly an A after that.

669
00:33:12,450 --> 00:33:15,610
Right, so this is the
problem we're dealing with.

670
00:33:15,610 --> 00:33:21,070
You have a bunch of promoters,
and the transcription factor

671
00:33:21,070 --> 00:33:26,867
that binds may be fairly
degenerate, maybe because it

672
00:33:26,867 --> 00:33:29,200
likes to bind cooperatively
with several of its buddies,

673
00:33:29,200 --> 00:33:30,741
and so it doesn't
have to have a very

674
00:33:30,741 --> 00:33:33,920
strong instance of
the motif present.

675
00:33:33,920 --> 00:33:37,260
And so, it can be quite
difficult to find.

676
00:33:37,260 --> 00:33:40,790
So that's why there's a real
bio-informatics challenge.

677
00:33:40,790 --> 00:33:44,622
Motif finding is not done by
lining up sequences by hand,

678
00:33:44,622 --> 00:33:46,080
and drawing boxes--
although that's

679
00:33:46,080 --> 00:33:48,920
how the first motif was
found, the TATA box.

680
00:33:48,920 --> 00:33:51,400
That's why it's called the
TATA box, because someone just

681
00:33:51,400 --> 00:33:53,820
drew a box in a
sequence alignment.

682
00:33:53,820 --> 00:33:56,860
But these days,
you need a computer

683
00:33:56,860 --> 00:34:01,020
to find-- most motifs require
some sort of algorithm to find.

684
00:34:03,890 --> 00:34:09,050
Like I said, it's essentially
a local multiple alignment

685
00:34:09,050 --> 00:34:09,770
problem.

686
00:34:09,770 --> 00:34:13,590
You want multiple alignment, but
it doesn't have to be global.

687
00:34:13,590 --> 00:34:16,034
It just can be local, it can
be just over a sub-region.

688
00:34:19,300 --> 00:34:24,770
There are basically at
least three different sort

689
00:34:24,770 --> 00:34:27,960
of general approaches to the
problem of motif finding.

690
00:34:27,960 --> 00:34:31,840
One approach is the so-called
enumerative, or dictionary,

691
00:34:31,840 --> 00:34:33,110
approach.

692
00:34:33,110 --> 00:34:36,310
And so in this
approach, you say,

693
00:34:36,310 --> 00:34:40,239
well, we're looking
for a motif of length 6

694
00:34:40,239 --> 00:34:45,760
because this is a leucine zipper
transcription factor that we're

695
00:34:45,760 --> 00:34:48,100
modeling, and they usually
have binding sites around 6,

696
00:34:48,100 --> 00:34:49,320
so we're going to guess 6.

697
00:34:49,320 --> 00:34:51,949
And we're going to
enumerate all the six-mers,

698
00:34:51,949 --> 00:34:54,310
there's 4,096 six-mers.

699
00:34:54,310 --> 00:34:56,750
We're going to count
up their occurrences

700
00:34:56,750 --> 00:35:01,220
in a set of promoters that, for
example, are turned on when you

701
00:35:01,220 --> 00:35:04,230
over-express this
factor, and look

702
00:35:04,230 --> 00:35:07,630
at those frequencies
divided by the frequencies

703
00:35:07,630 --> 00:35:09,940
of those six-mers in
some background set--

704
00:35:09,940 --> 00:35:12,717
either random sequences, or
promoters that didn't turn on.

705
00:35:12,717 --> 00:35:13,550
Something like that.

706
00:35:13,550 --> 00:35:15,520
You have two
classes, and you look

707
00:35:15,520 --> 00:35:17,940
for statistical enrichment.

708
00:35:17,940 --> 00:35:19,530
This approach, this is fine.

709
00:35:19,530 --> 00:35:22,360
There's nothing wrong
with this approach.

710
00:35:22,360 --> 00:35:25,330
People use it all the time.

711
00:35:25,330 --> 00:35:28,170
One of the downsides,
though, is that you're

712
00:35:28,170 --> 00:35:29,620
doing a lot of
statistical tests.

713
00:35:29,620 --> 00:35:31,720
You're essentially
testing each six-mer--

714
00:35:31,720 --> 00:35:34,200
you're doing 4,096
statistical tests.

715
00:35:34,200 --> 00:35:36,740
So you have to adjust the
statistical significance

716
00:35:36,740 --> 00:35:38,240
for the number of
tests that you do,

717
00:35:38,240 --> 00:35:40,750
and that can reduce your power.

718
00:35:40,750 --> 00:35:43,520
So that's one main drawback.

719
00:35:43,520 --> 00:35:47,740
The other reason
is that maybe you

720
00:35:47,740 --> 00:35:50,750
don't see-- maybe this protein
binds a rather degenerate

721
00:35:50,750 --> 00:35:53,960
motif, and a precise
six-mer is just too precise.

722
00:35:53,960 --> 00:35:56,330
None of them will
occur often enough.

723
00:35:56,330 --> 00:35:59,880
You really have to have
a degenerate motif that's

724
00:35:59,880 --> 00:36:05,870
C R Y G Y. That's really
the motif that it binds to,

725
00:36:05,870 --> 00:36:07,510
and so you don't
see it unless you

726
00:36:07,510 --> 00:36:09,280
use something more degenerate.

727
00:36:09,280 --> 00:36:11,680
So you can generalize
this to use

728
00:36:11,680 --> 00:36:13,910
regular expressions, et cetera.

729
00:36:13,910 --> 00:36:16,580
And it's a reasonable approach.

730
00:36:16,580 --> 00:36:20,430
Another approach that we'll
talk about in a moment

731
00:36:20,430 --> 00:36:22,970
is probabilistic
optimization, where

732
00:36:22,970 --> 00:36:31,070
you wander around the possible
space of possible motifs

733
00:36:31,070 --> 00:36:34,240
until you find one
that looks strong.

734
00:36:34,240 --> 00:36:35,530
And we'll talk about that.

735
00:36:35,530 --> 00:36:39,810
And then they're deterministic
versions of this, like me.

736
00:36:42,580 --> 00:36:45,360
We're going to focus
today on this second one.

737
00:36:45,360 --> 00:36:49,150
Mostly because it's a little bit
more mysterious and interesting

738
00:36:49,150 --> 00:36:50,270
as an algorithm.

739
00:36:53,210 --> 00:36:55,255
And it's also [INAUDIBLE].

740
00:36:55,255 --> 00:36:58,400
So, if the motif landscape
looked like this,

741
00:36:58,400 --> 00:37:02,360
where imagine all
possible motifs,

742
00:37:02,360 --> 00:37:05,970
you've somehow come up with a
2D lattice of the possible motif

743
00:37:05,970 --> 00:37:06,890
sequences.

744
00:37:06,890 --> 00:37:10,400
And then the strength of
that motif, or the degree

745
00:37:10,400 --> 00:37:15,160
to which that motif description
corresponds to the 2 motif

746
00:37:15,160 --> 00:37:17,170
is represented by
the height here.

747
00:37:17,170 --> 00:37:19,780
Then, there's basically
one optimal motif,

748
00:37:19,780 --> 00:37:24,070
and the closer you get to
that, the better fit it is.

749
00:37:24,070 --> 00:37:28,010
Then our problem is going
to be relatively easy.

750
00:37:28,010 --> 00:37:32,140
But it's also possible that
it looks something like this.

751
00:37:32,140 --> 00:37:34,170
There's a lot of
sort of decoy motifs,

752
00:37:34,170 --> 00:37:37,420
or weaker motifs that are
only slightly enriched

753
00:37:37,420 --> 00:37:42,650
in the sequence space.

754
00:37:42,650 --> 00:37:44,640
And so you can easily
get tripped up,

755
00:37:44,640 --> 00:37:49,320
if you're wandering
around randomly.

756
00:37:49,320 --> 00:37:52,240
We don't know a priori,
and it's probably not

757
00:37:52,240 --> 00:37:54,410
as simple as the first example.

758
00:37:54,410 --> 00:37:58,300
And so that's one
of the issues that

759
00:37:58,300 --> 00:38:01,770
motivates these
stochastic algorithms.

760
00:38:01,770 --> 00:38:03,921
So just to sort of
put this in context--

761
00:38:03,921 --> 00:38:06,420
the Gibbs motif sampler that
we're going to be talking about

762
00:38:06,420 --> 00:38:08,380
is a Monte Carlo
algorithm, so that just

763
00:38:08,380 --> 00:38:12,040
means it's an algorithm
that basically does

764
00:38:12,040 --> 00:38:15,310
some random sampling
somewhere in it,

765
00:38:15,310 --> 00:38:18,240
so that the outcome
that you get isn't

766
00:38:18,240 --> 00:38:20,235
necessarily deterministic.

767
00:38:20,235 --> 00:38:22,360
Your run it at different
times, and you'll actually

768
00:38:22,360 --> 00:38:24,050
get different
outputs, which can be

769
00:38:24,050 --> 00:38:28,190
a little bit disconcerting
and annoying at times.

770
00:38:28,190 --> 00:38:31,370
But it turns out to be
useful in some cases.

771
00:38:31,370 --> 00:38:36,265
There's also a special case
of a Las Vegas algorithm,

772
00:38:36,265 --> 00:38:39,670
where it knows when it
got be optimal answer.

773
00:38:39,670 --> 00:38:41,910
But in general, not.

774
00:38:41,910 --> 00:38:44,960
In general, you
don't know for sure.

775
00:38:44,960 --> 00:38:49,290
So Gibbs motif
simpler is basically

776
00:38:49,290 --> 00:38:54,880
a model where you have a
likelihood for generating

777
00:38:54,880 --> 00:38:58,420
a set of sequences,
S. So imagine

778
00:38:58,420 --> 00:39:04,150
you have 40 sequences that
are bacterial promoters, each

779
00:39:04,150 --> 00:39:07,620
of 40 bases long, let's say.

780
00:39:07,620 --> 00:39:12,500
That's your S. And so
what you want to do,

781
00:39:12,500 --> 00:39:18,810
then, is consider a model that
there is a particular instance

782
00:39:18,810 --> 00:39:21,500
of a motif you're
trying to discover,

783
00:39:21,500 --> 00:39:23,990
at a particular position in
each one of those sequences.

784
00:39:23,990 --> 00:39:25,406
Not necessarily
the same position,

785
00:39:25,406 --> 00:39:27,880
just some position
in each sequence.

786
00:39:27,880 --> 00:39:32,530
And we're going to describe
the composition of that motif

787
00:39:32,530 --> 00:39:33,690
by a weight matrix.

788
00:39:33,690 --> 00:39:37,300
OK, one of these matrices
that's of width, W, and then

789
00:39:37,300 --> 00:39:41,820
has the four rows specifying
the frequencies of the four

790
00:39:41,820 --> 00:39:44,580
nucleotides at that position.

791
00:39:44,580 --> 00:39:50,860
The setup here is that you
want to calculate or think

792
00:39:50,860 --> 00:39:56,600
about the probability of S comma
A, S is the actual sequences,

793
00:39:56,600 --> 00:40:01,000
and A is basically a
vector that specifies

794
00:40:01,000 --> 00:40:04,810
the location of the motif
instance in each of those 40

795
00:40:04,810 --> 00:40:05,310
sequences.

796
00:40:09,230 --> 00:40:13,260
You want to calculate that,
conditional on capital

797
00:40:13,260 --> 00:40:15,910
theta-- which is
our weight matrix.

798
00:40:15,910 --> 00:40:17,730
So that's going to
be, in this case,

799
00:40:17,730 --> 00:40:21,850
I think I made a motif of length
8, and it's shown there in red.

800
00:40:21,850 --> 00:40:24,750
There's going to be a
weight matrix of length 8.

801
00:40:24,750 --> 00:40:27,200
And then there's going to
be some sort of background

802
00:40:27,200 --> 00:40:29,270
frequency vector that
might be the background

803
00:40:29,270 --> 00:40:34,270
composition in the genome
of E.coli DNA, for example.

804
00:40:34,270 --> 00:40:40,220
And so then the probability
of generating those sequences

805
00:40:40,220 --> 00:40:47,430
together with that
particular locations

806
00:40:47,430 --> 00:40:51,190
is going to be
proportional to this.

807
00:40:51,190 --> 00:40:55,970
Basically, use the little
theta background vector

808
00:40:55,970 --> 00:41:00,830
for all the positions, except
the specific positions that

809
00:41:00,830 --> 00:41:05,770
are inside the motif,
starting at position AK here.

810
00:41:05,770 --> 00:41:08,220
And then you use the
particular column

811
00:41:08,220 --> 00:41:10,460
of the weight matrix
for those 8 positions,

812
00:41:10,460 --> 00:41:15,220
and then you go back to using
the background probabilities.

813
00:41:15,220 --> 00:41:16,295
Question, yeah?

814
00:41:16,295 --> 00:41:20,255
AUDIENCE: Is this
for finding motifs

815
00:41:20,255 --> 00:41:22,235
based on other known motifs?

816
00:41:22,235 --> 00:41:23,081
Or is this--

817
00:41:23,081 --> 00:41:24,914
PROFESSOR: No, what
we're doing-- I'm sorry,

818
00:41:24,914 --> 00:41:25,545
I should've prefaced that.

819
00:41:25,545 --> 00:41:26,961
We're doing de
novo motif finding.

820
00:41:26,961 --> 00:41:28,970
We're going to tell
the algorithm-- we're

821
00:41:28,970 --> 00:41:32,230
going to give the algorithm some
sequences of a given length,

822
00:41:32,230 --> 00:41:33,860
or it can even be
of variable lengths,

823
00:41:33,860 --> 00:41:36,380
and we're going to
give it a guess of what

824
00:41:36,380 --> 00:41:37,810
the length of the motif is.

825
00:41:37,810 --> 00:41:40,090
So we're going to
say, we think it's 8.

826
00:41:40,090 --> 00:41:42,300
That could come from
structural reasons.

827
00:41:42,300 --> 00:41:43,970
Or often you really
have no idea,

828
00:41:43,970 --> 00:41:46,510
so you just guess that
you know, a lot of times

829
00:41:46,510 --> 00:41:48,830
it's kind of short, so we're
going to go with 6 or 8,

830
00:41:48,830 --> 00:41:50,450
or you try different lengths.

831
00:41:50,450 --> 00:41:52,350
Totally de novo motif finding.

832
00:41:52,350 --> 00:41:55,160
OK, so how does algorithm work?

833
00:41:55,160 --> 00:41:58,620
You have N sequences of
length, L. You guessed

834
00:41:58,620 --> 00:42:04,330
that the motif has width, W.
You choose starting positions

835
00:42:04,330 --> 00:42:08,627
at random-- OK, so this is
a vector, of the starting

836
00:42:08,627 --> 00:42:10,210
position in each
sequence, we're going

837
00:42:10,210 --> 00:42:14,060
to choose completely random
positions within the end

838
00:42:14,060 --> 00:42:14,960
sequences.

839
00:42:14,960 --> 00:42:18,760
They have to be at
least W before the end--

840
00:42:18,760 --> 00:42:22,600
so we'll have a whole motif,
that's just an accounting thing

841
00:42:22,600 --> 00:42:23,640
to make it simpler.

842
00:42:23,640 --> 00:42:27,190
And then you choose one
of the sequence at random.

843
00:42:27,190 --> 00:42:28,420
Say, the first sequence.

844
00:42:28,420 --> 00:42:31,260
You make a weight matrix
model of width, W,

845
00:42:31,260 --> 00:42:35,250
from the instances in
the other sequences.

846
00:42:35,250 --> 00:42:37,880
So for example--
actually, I have

847
00:42:37,880 --> 00:42:40,220
slides on this, so we'll
just do it with the slides,

848
00:42:40,220 --> 00:42:42,053
you'll see what this
looks like in a moment.

849
00:42:42,053 --> 00:42:44,810
And so you have instances
here in the sequence,

850
00:42:44,810 --> 00:42:46,360
here in this one, here.

851
00:42:46,360 --> 00:42:49,930
You take all those, line them
up, make a weight matrix out

852
00:42:49,930 --> 00:42:52,660
of those, and then you score
the positions in sequence 1

853
00:42:52,660 --> 00:42:55,630
for how well they match.

854
00:42:55,630 --> 00:42:57,420
So, let me just do this.

855
00:42:57,420 --> 00:42:58,920
These are your motif instances.

856
00:42:58,920 --> 00:43:01,070
Again, totally random
at the beginning.

857
00:43:01,070 --> 00:43:07,840
Then you build a weight
matrix from those

858
00:43:07,840 --> 00:43:11,950
by lining them up, and
just counting frequencies.

859
00:43:11,950 --> 00:43:14,520
Then you pick a sequence
at random-- yeah,

860
00:43:14,520 --> 00:43:17,560
your weight matrix doesn't
include that sequence,

861
00:43:17,560 --> 00:43:18,810
typically.

862
00:43:18,810 --> 00:43:23,690
And then you take
your theta matrix

863
00:43:23,690 --> 00:43:26,940
and you slide it
along the sequence.

864
00:43:26,940 --> 00:43:30,300
You consider every
sub-sequence of length,

865
00:43:30,300 --> 00:43:33,340
W-- the one that
goes from 1 to W,

866
00:43:33,340 --> 00:43:36,190
to one that goes from
2 to W plus 1, et

867
00:43:36,190 --> 00:43:38,130
cetera, all the way
along the sequence,

868
00:43:38,130 --> 00:43:40,610
until you get to the end.

869
00:43:40,610 --> 00:43:43,760
And you calculate the
probability of that sequence,

870
00:43:43,760 --> 00:43:47,500
using that likelihood
that I gave you before.

871
00:43:47,500 --> 00:43:50,140
So, it's basically the
probability generating sequence

872
00:43:50,140 --> 00:43:54,020
where you use the background
vector for all the positions,

873
00:43:54,020 --> 00:43:57,240
except for the particular
motif instance that you're

874
00:43:57,240 --> 00:43:59,960
considering, and use the
motif model for that.

875
00:43:59,960 --> 00:44:01,190
Does that make sense?

876
00:44:01,190 --> 00:44:06,810
So, if you happen to have a good
looking occurrence of the motif

877
00:44:06,810 --> 00:44:10,220
at this position,
here, in the sequence,

878
00:44:10,220 --> 00:44:15,560
then you would get
a higher likelihood.

879
00:44:15,560 --> 00:44:24,010
So for example, if the motif
was, let's say it's 3 long,

880
00:44:24,010 --> 00:44:40,650
and it happened
to favor ACG, then

881
00:44:40,650 --> 00:44:44,210
if you have a sequence
here that has,

882
00:44:44,210 --> 00:44:46,364
let's say, it's got
TTT, that's going

883
00:44:46,364 --> 00:44:48,030
to have a low probability
in this motif.

884
00:44:48,030 --> 00:44:50,380
It's going to be 0.1 cubed.

885
00:44:50,380 --> 00:44:54,152
And then if you have an
occurrence of, say, ACT,

886
00:44:54,152 --> 00:44:55,860
that's going to have
a higher occurrence.

887
00:44:55,860 --> 00:44:58,100
It's going to be 0.7
times 0.7 times 0.1.

888
00:44:58,100 --> 00:44:59,260
So, quite a bit higher.

889
00:44:59,260 --> 00:45:02,450
So you start, it'll be
low for this triplet

890
00:45:02,450 --> 00:45:04,370
here-- so I'll put
a low value here.

891
00:45:04,370 --> 00:45:07,190
TTA is also going to be low.

892
00:45:07,190 --> 00:45:09,350
TAC, also low.

893
00:45:09,350 --> 00:45:12,540
But ACT, that matches 2
out of 3 to the motif.

894
00:45:12,540 --> 00:45:15,040
It's going to be a lot better.

895
00:45:15,040 --> 00:45:17,855
And then CT is going to
be low again, et cetera.

896
00:45:17,855 --> 00:45:20,230
So you just slide this along
and calculate probabilities.

897
00:45:23,130 --> 00:45:27,270
And then what you do is you
sample from this distribution.

898
00:45:27,270 --> 00:45:31,500
These probabilities don't
necessarily sum to 1.

899
00:45:31,500 --> 00:45:34,165
But you re-normalize them
so that they do sum to 1,

900
00:45:34,165 --> 00:45:36,440
you just add them up,
divide by the sum.

901
00:45:36,440 --> 00:45:37,590
Now they sum to 1.

902
00:45:37,590 --> 00:45:39,919
And now you sample those
sites in that sequence,

903
00:45:39,919 --> 00:45:41,710
according to that
probability distribution.

904
00:45:45,560 --> 00:45:47,500
Like I said, in this
case you might end up

905
00:45:47,500 --> 00:45:49,729
sampling-- that's the
highest probability site,

906
00:45:49,729 --> 00:45:50,770
so you might sample that.

907
00:45:50,770 --> 00:45:53,874
But you also might sample
one of these other ones.

908
00:45:53,874 --> 00:45:55,540
It's unlikely you
would sample this one,

909
00:45:55,540 --> 00:45:56,640
because that's very low.

910
00:45:56,640 --> 00:46:03,650
But you actually sometime
sample one that's not so great.

911
00:46:03,650 --> 00:46:06,790
So you sample a starting
position in that sequence,

912
00:46:06,790 --> 00:46:09,490
and you basically-- wherever
you would originally

913
00:46:09,490 --> 00:46:13,640
assign in sequence 1, now you
move it to that new location.

914
00:46:13,640 --> 00:46:17,380
We've just changed
the assignment

915
00:46:17,380 --> 00:46:20,950
of where we think the motif
might be in that sequence.

916
00:46:20,950 --> 00:46:22,450
And then you choose
another sequence

917
00:46:22,450 --> 00:46:24,440
at random from your list.

918
00:46:24,440 --> 00:46:26,450
Often you go through the
sequences sequentially,

919
00:46:26,450 --> 00:46:29,570
and then you make a new
weight matrix model.

920
00:46:29,570 --> 00:46:32,450
So how will that weight matrix
model differ from the last one?

921
00:46:32,450 --> 00:46:35,730
Well it'll differ
because the instance

922
00:46:35,730 --> 00:46:39,780
of the motif in sequence 1
is now at a new location,

923
00:46:39,780 --> 00:46:40,360
in general.

924
00:46:40,360 --> 00:46:42,776
I mean, you might have sampled
the exact same location you

925
00:46:42,776 --> 00:46:44,777
started, but in
general it'll move.

926
00:46:44,777 --> 00:46:46,860
And so now, you'll got a
slightly different weight

927
00:46:46,860 --> 00:46:48,250
matrix.

928
00:46:48,250 --> 00:46:52,200
Most of the data going
into it, N minus 1,

929
00:46:52,200 --> 00:46:53,200
is going to be the same.

930
00:46:53,200 --> 00:46:55,040
But one of them is
going to be different.

931
00:46:55,040 --> 00:46:57,400
So it'll change a little bit.

932
00:46:57,400 --> 00:46:59,690
You make a new weight
matrix, and then you

933
00:46:59,690 --> 00:47:00,810
pick a new sequence.

934
00:47:00,810 --> 00:47:03,000
You slide that weight
matrix along that sequence,

935
00:47:03,000 --> 00:47:05,541
you get this distribution, you
sample from that distribution,

936
00:47:05,541 --> 00:47:07,980
and you keep going.

937
00:47:07,980 --> 00:47:12,800
Yeah, this was described
by Lorenz in 1993,

938
00:47:12,800 --> 00:47:16,670
and I'll post that paper.

939
00:47:16,670 --> 00:47:19,210
OK, so you sample a
portion with that,

940
00:47:19,210 --> 00:47:20,890
and you update the location.

941
00:47:20,890 --> 00:47:23,130
So now we sampled that
really high probably one,

942
00:47:23,130 --> 00:47:26,520
so we moved the motif over
to that new orange location,

943
00:47:26,520 --> 00:47:28,742
there.

944
00:47:28,742 --> 00:47:30,920
I don't know if these
animations are helping at all.

945
00:47:30,920 --> 00:47:33,670
And then you update
your weight matrix.

946
00:47:37,080 --> 00:47:40,240
And then you iterate
until convergence.

947
00:47:40,240 --> 00:47:44,700
So you typically have
a set of end sequences,

948
00:47:44,700 --> 00:47:46,259
you go through them once.

949
00:47:46,259 --> 00:47:48,800
You have a weight matrix, and
then you go through them again.

950
00:47:48,800 --> 00:47:50,110
You go through a few times.

951
00:47:50,110 --> 00:47:51,960
And maybe at a
certain point, you

952
00:47:51,960 --> 00:47:54,600
end re-sampling the
same sites as you

953
00:47:54,600 --> 00:47:57,910
did in the last iteration--
same exact sites.

954
00:47:57,910 --> 00:47:59,720
You've converged.

955
00:47:59,720 --> 00:48:03,265
Or, you keep track
of the theta matrices

956
00:48:03,265 --> 00:48:06,320
that you get after going through
the whole set of sequences,

957
00:48:06,320 --> 00:48:08,460
and from one
iteration to the next,

958
00:48:08,460 --> 00:48:11,545
the theta matrix hasn't
really changed much.

959
00:48:11,545 --> 00:48:12,440
You've converged.

960
00:48:16,790 --> 00:48:20,430
So let's do an example of this.

961
00:48:20,430 --> 00:48:23,570
Here I made up a motif, and
this is a representation

962
00:48:23,570 --> 00:48:26,650
where the four bases have
these colors assigned to them.

963
00:48:26,650 --> 00:48:29,440
And you can see that this
motif is quite strong.

964
00:48:29,440 --> 00:48:32,870
It really strongly
prefers A at this position

965
00:48:32,870 --> 00:48:34,259
here, and et cetera.

966
00:48:34,259 --> 00:48:36,550
And I put it at the same
position in all the sequences,

967
00:48:36,550 --> 00:48:40,090
just to make life simple.

968
00:48:40,090 --> 00:48:46,010
And then a former student
in the lab, [INAUDIBLE],

969
00:48:46,010 --> 00:48:50,990
he implemented the Gibb
Sample in Matlab, actually,

970
00:48:50,990 --> 00:48:53,730
and made a little video
of what's going on.

971
00:48:53,730 --> 00:48:58,470
So the upper part shows
the current weight matrix.

972
00:48:58,470 --> 00:49:01,590
Notice it's pretty
random-looking

973
00:49:01,590 --> 00:49:02,550
at the beginning.

974
00:49:02,550 --> 00:49:10,740
And the right parts
show where the motif

975
00:49:10,740 --> 00:49:13,874
is, or the position that
we're currently considering.

976
00:49:13,874 --> 00:49:15,540
So this shows the
position that was last

977
00:49:15,540 --> 00:49:19,800
sampled in the last round.

978
00:49:19,800 --> 00:49:21,570
And this shows the
probability density

979
00:49:21,570 --> 00:49:24,560
along each sequence of
what's the probability

980
00:49:24,560 --> 00:49:28,050
that the motif occurs
at each particular place

981
00:49:28,050 --> 00:49:30,620
in the sequence.

982
00:49:30,620 --> 00:49:32,440
And that's what
happens over times.

983
00:49:32,440 --> 00:49:35,410
So it's obviously very
fast, so I'll run it again

984
00:49:35,410 --> 00:49:38,070
and maybe pause it partway.

985
00:49:38,070 --> 00:49:41,160
We're starting from a
very random-looking motif.

986
00:49:44,480 --> 00:49:48,420
This is what you get after not
too many iterations-- probably

987
00:49:48,420 --> 00:49:50,870
like 100 or so.

988
00:49:50,870 --> 00:49:55,730
And now you can see your motif--
your weight matrix is now quite

989
00:49:55,730 --> 00:49:59,670
biased, and now favors A at
this position, and so forth.

990
00:49:59,670 --> 00:50:03,680
And the locations of
your motif, most of them

991
00:50:03,680 --> 00:50:05,840
are around this
position, around 6 or 7

992
00:50:05,840 --> 00:50:09,450
in the sequence-- that's
where we put the motif in.

993
00:50:09,450 --> 00:50:11,020
But not all, some of them.

994
00:50:11,020 --> 00:50:14,410
And then you can see the
probabilities-- white is high,

995
00:50:14,410 --> 00:50:17,010
black is low-- in
some sequences,

996
00:50:17,010 --> 00:50:18,530
it's very, very
confident, the motif

997
00:50:18,530 --> 00:50:21,090
is exactly at that position,
like this first sequence here.

998
00:50:21,090 --> 00:50:23,980
And others, it's
got some uncertainty

999
00:50:23,980 --> 00:50:26,460
about where the motif might be.

1000
00:50:26,460 --> 00:50:29,450
And then we let it
run a little bit more,

1001
00:50:29,450 --> 00:50:33,010
and it eventually
converges to being

1002
00:50:33,010 --> 00:50:37,213
very confident that the motif
has the sequence, A C G T A G C

1003
00:50:37,213 --> 00:50:41,324
A, and that it occurs at
that particular position

1004
00:50:41,324 --> 00:50:41,990
in the sequence.

1005
00:50:45,200 --> 00:50:49,630
So who can tell me why
this actually works?

1006
00:50:49,630 --> 00:50:51,590
We're choosing
positions at random,

1007
00:50:51,590 --> 00:50:55,280
updating a weight matrix,
why does that actually

1008
00:50:55,280 --> 00:50:58,646
help you find the real motif
that's in these sequences?

1009
00:51:01,840 --> 00:51:04,310
Any ideas?

1010
00:51:04,310 --> 00:51:06,774
Or who can make an argument
that it shouldn't work?

1011
00:51:09,830 --> 00:51:10,330
Yeah?

1012
00:51:10,330 --> 00:51:11,473
What was your name again?

1013
00:51:11,473 --> 00:51:12,056
AUDIENCE: Dan.

1014
00:51:12,056 --> 00:51:13,347
PROFESSOR: Dan, yeah, go ahead.

1015
00:51:13,347 --> 00:51:18,678
AUDIENCE: So, couldn't it,
sort of, in a certain situation

1016
00:51:18,678 --> 00:51:25,110
have different sub-motifs
that are also sort of rich,

1017
00:51:25,110 --> 00:51:27,090
and because you're
sampling randomly

1018
00:51:27,090 --> 00:51:31,960
you might be stuck inside
of those boundaries

1019
00:51:31,960 --> 00:51:36,142
where you're searching
your composition?

1020
00:51:36,142 --> 00:51:37,350
PROFESSOR: Yeah, that's good.

1021
00:51:37,350 --> 00:51:39,560
So Dan's point is
that you can get

1022
00:51:39,560 --> 00:51:45,350
stuck in sub-optimal
smaller or weaker motifs.

1023
00:51:45,350 --> 00:51:47,035
So that's certainly true.

1024
00:51:47,035 --> 00:51:49,160
So you're saying, maybe
this example is artificial?

1025
00:51:49,160 --> 00:51:51,326
Because I had started with
totally random sequences,

1026
00:51:51,326 --> 00:51:54,160
and I put a pretty strong
motif in a particular place,

1027
00:51:54,160 --> 00:51:58,080
so there were no-- it's
more like that mountain,

1028
00:51:58,080 --> 00:52:01,780
that structure where there's
just one motif to find.

1029
00:52:01,780 --> 00:52:03,580
So it's perhaps an easy case.

1030
00:52:03,580 --> 00:52:06,740
But still, what I want to know
is how does this algorithm,

1031
00:52:06,740 --> 00:52:09,322
how did it actually
find that motif?

1032
00:52:09,322 --> 00:52:15,290
He implemented exactly that
algorithm that I described.

1033
00:52:15,290 --> 00:52:19,806
Why does it tend to go
towards [INAUDIBLE]?

1034
00:52:19,806 --> 00:52:22,056
After a long time,
remember it's a long time,

1035
00:52:22,056 --> 00:52:23,222
it's hundreds of iterations.

1036
00:52:23,222 --> 00:52:27,614
AUDIENCE: So you're covering
a lot in the sequence,

1037
00:52:27,614 --> 00:52:34,214
just the random searching of
the sequence, when you're--

1038
00:52:34,214 --> 00:52:35,755
PROFESSOR: There
are many iterations.

1039
00:52:35,755 --> 00:52:37,570
You're considering
many possible locations

1040
00:52:37,570 --> 00:52:40,620
within the sequences,
that's true.

1041
00:52:40,620 --> 00:52:47,238
But why does it eventually-- why
does it converge to something?

1042
00:52:47,238 --> 00:52:48,935
AUDIENCE: I guess,
because you're

1043
00:52:48,935 --> 00:52:52,806
seeing your motif more
plainly than you're

1044
00:52:52,806 --> 00:52:56,694
seeing other random motifs.

1045
00:52:56,694 --> 00:52:59,610
So it will hit it more
frequently-- randomly.

1046
00:52:59,610 --> 00:53:03,012
And therefore,
converge [INAUDIBLE].

1047
00:53:03,012 --> 00:53:04,990
PROFESSOR: Yeah, that's true.

1048
00:53:04,990 --> 00:53:10,760
Can someone give a more
intuition behind this?

1049
00:53:10,760 --> 00:53:11,734
Yeah?

1050
00:53:11,734 --> 00:53:13,216
AUDIENCE: I just
have a question.

1051
00:53:13,216 --> 00:53:15,192
Is each iteration
an independent test?

1052
00:53:15,192 --> 00:53:20,626
For example, if you iterate
over the same sequence base

1053
00:53:20,626 --> 00:53:24,792
100 times, and you're updating
your weight matrix each time,

1054
00:53:24,792 --> 00:53:29,485
does that mean it is the
updating the weight matrix also

1055
00:53:29,485 --> 00:53:31,481
taking into account
that the previous--

1056
00:53:31,481 --> 00:53:34,475
that this is the
same sample space?

1057
00:53:34,475 --> 00:53:35,937
PROFESSOR: Yeah,
the weight matrix,

1058
00:53:35,937 --> 00:53:38,270
after you go through one
iteration of all the sequences,

1059
00:53:38,270 --> 00:53:40,120
you have a weight matrix.

1060
00:53:40,120 --> 00:53:43,900
You carry that over, you
don't start from scratch.

1061
00:53:43,900 --> 00:53:45,470
You bring that weight
matrix back up,

1062
00:53:45,470 --> 00:53:49,150
and use that to score, let's
say, that first sequence.

1063
00:53:49,150 --> 00:53:56,090
Yeah, the weight matrix
just keeps moving around.

1064
00:53:56,090 --> 00:53:58,450
Moves a little bit every
time you sample a sequence.

1065
00:53:58,450 --> 00:54:00,740
AUDIENCE: So you constantly
get a strong [INAUDIBLE].

1066
00:54:00,740 --> 00:54:02,206
PROFESSOR: Well, does it?

1067
00:54:02,206 --> 00:54:04,150
AUDIENCE: Well, I guess--

1068
00:54:04,150 --> 00:54:07,070
PROFESSOR: Would it
constantly get stronger?

1069
00:54:07,070 --> 00:54:11,190
What's to make it get
stronger or weaker?

1070
00:54:11,190 --> 00:54:14,599
I mean, this is sort
of-- you're on the track.

1071
00:54:14,599 --> 00:54:18,008
AUDIENCE: If it is
random, then there's

1072
00:54:18,008 --> 00:54:21,379
some probability that you're
going to find this motif again,

1073
00:54:21,379 --> 00:54:22,878
at which point it
will get stronger.

1074
00:54:22,878 --> 00:54:28,235
But, if it's-- given
enough iterations,

1075
00:54:28,235 --> 00:54:35,044
it gets stronger as long as you
hit different spots at random.

1076
00:54:35,044 --> 00:54:35,960
PROFESSOR: Yeah, yeah.

1077
00:54:35,960 --> 00:54:39,832
That's what I'm-- I think
there was a comment.

1078
00:54:39,832 --> 00:54:40,800
Jacob, yeah?

1079
00:54:40,800 --> 00:54:43,731
AUDIENCE: Well, you can think
about it as a random walk

1080
00:54:43,731 --> 00:54:44,956
through the landscape.

1081
00:54:44,956 --> 00:54:47,436
Eventually, it has
a high probability

1082
00:54:47,436 --> 00:54:50,412
of taking that motif, and
updating the [INAUDIBLE]

1083
00:54:50,412 --> 00:54:54,293
direction, just from the
probability of [INAUDIBLE].

1084
00:54:54,293 --> 00:54:54,876
PROFESSOR: OK.

1085
00:54:54,876 --> 00:54:57,356
AUDIENCE: And given
the [INAUDIBLE].

1086
00:55:00,332 --> 00:55:09,240
PROFESSOR: OK, let's say I
had 100 sequences of length,

1087
00:55:09,240 --> 00:55:11,330
I don't know, 30.

1088
00:55:11,330 --> 00:55:14,130
And the width of the motif is 6.

1089
00:55:20,160 --> 00:55:22,310
So here's our sequences.

1090
00:55:22,310 --> 00:55:29,800
We choose random positions
for the start position,

1091
00:55:29,800 --> 00:55:32,490
and let's say it was
this example where

1092
00:55:32,490 --> 00:55:38,932
the real motif, I put it right
here, and all the sequences.

1093
00:55:38,932 --> 00:55:39,890
That's where it starts.

1094
00:55:43,160 --> 00:55:43,785
Does this help?

1095
00:55:46,370 --> 00:55:49,630
So it's 30 and 6, so there's
25 possible start positions.

1096
00:55:49,630 --> 00:55:53,060
I did that to make
it a little easier.

1097
00:55:53,060 --> 00:55:56,070
So what would happen in
that first iteration?

1098
00:55:56,070 --> 00:55:57,920
What w can you say
about what the weight

1099
00:55:57,920 --> 00:55:59,730
matrix would look like?

1100
00:55:59,730 --> 00:56:06,315
It's going to be a width, W, you
know, columns 1, 2, 3, up to 6.

1101
00:56:09,790 --> 00:56:13,206
We're going to give it
100 positions at random.

1102
00:56:13,206 --> 00:56:16,150
The motif is here--
let's say it's

1103
00:56:16,150 --> 00:56:19,230
a very strong motif,
that's a 12-bit motif.

1104
00:56:19,230 --> 00:56:24,390
So it's 100%-- it's echo R1.

1105
00:56:24,390 --> 00:56:25,286
It's that.

1106
00:56:28,430 --> 00:56:31,570
What would that weight
matrix look like,

1107
00:56:31,570 --> 00:56:33,670
in this first iteration,
when you first

1108
00:56:33,670 --> 00:56:37,490
just sample the sites at random?

1109
00:56:37,490 --> 00:56:40,130
What kind of probabilities
would it have?

1110
00:56:40,130 --> 00:56:42,110
AUDIENCE: [INAUDIBLE]

1111
00:56:42,110 --> 00:56:44,454
PROFESSOR: Equal?

1112
00:56:44,454 --> 00:56:47,940
OK-- perfectly equal?

1113
00:56:47,940 --> 00:56:48,851
AUDIENCE: Roughly.

1114
00:56:48,851 --> 00:56:49,434
PROFESSOR: OK.

1115
00:56:49,434 --> 00:56:50,430
Any box?

1116
00:56:58,398 --> 00:57:01,620
Are we likely to hit
the actual motif, ever,

1117
00:57:01,620 --> 00:57:03,318
in that first encryption?

1118
00:57:03,318 --> 00:57:06,282
AUDIENCE: No, because you
have a uniform probability,

1119
00:57:06,282 --> 00:57:07,270
of sampling.

1120
00:57:07,270 --> 00:57:10,514
Well, uniform at each
one of the 25 positions?

1121
00:57:10,514 --> 00:57:11,222
PROFESSOR: Right.

1122
00:57:11,222 --> 00:57:14,186
AUDIENCE: Right now, you're
not sampling proportional

1123
00:57:14,186 --> 00:57:15,668
to the likelihood.

1124
00:57:15,668 --> 00:57:19,406
PROFESSOR: So the chance of
hitting the motif in any given

1125
00:57:19,406 --> 00:57:20,114
sequence is what?

1126
00:57:20,114 --> 00:57:20,608
AUDIENCE: 1/25.

1127
00:57:20,608 --> 00:57:21,274
PROFESSOR: 1/25.

1128
00:57:21,274 --> 00:57:22,584
We have 100 sequences.

1129
00:57:22,584 --> 00:57:25,529
AUDIENCE: So that's
four out of--

1130
00:57:25,529 --> 00:57:27,862
PROFESSOR: So on average,
I'll hit the motif four times,

1131
00:57:27,862 --> 00:57:28,530
right.

1132
00:57:28,530 --> 00:57:32,680
The other 96 positions will
be essentially random, right?

1133
00:57:35,760 --> 00:57:40,080
So you initially said this was
going to be uniform, right?

1134
00:57:40,080 --> 00:57:43,280
On average, 25% of each
base, plus or minus

1135
00:57:43,280 --> 00:57:47,730
a little bit of sampling
error-- could be 23, 24, 26.

1136
00:57:47,730 --> 00:57:52,210
But now, you pointed out
that it's going to be four.

1137
00:57:52,210 --> 00:57:55,900
You're going to hit the
motif four times, on average.

1138
00:57:55,900 --> 00:57:59,280
So, can you say anything more?

1139
00:57:59,280 --> 00:58:02,737
AUDIENCE: Could you maybe
have a slightly bias towards G

1140
00:58:02,737 --> 00:58:04,228
on the first position?

1141
00:58:04,228 --> 00:58:08,701
Slightly biased towards A
on the second and third?

1142
00:58:08,701 --> 00:58:12,180
Slightly biased towards T
on the fourth and fifth.

1143
00:58:12,180 --> 00:58:14,665
And slightly biased
towards C in the sixth?

1144
00:58:14,665 --> 00:58:16,156
So it would be slightly biased--

1145
00:58:16,156 --> 00:58:19,138
PROFESSOR: Right, so
remind me of your name?

1146
00:58:19,138 --> 00:58:20,132
AUDIENCE: I'm Eric.

1147
00:58:20,132 --> 00:58:22,370
PROFESSOR: Eric,
OK, so Eric says

1148
00:58:22,370 --> 00:58:24,640
that because four
of the sequences

1149
00:58:24,640 --> 00:58:26,160
will have a G at
the first position,

1150
00:58:26,160 --> 00:58:28,451
because those are the ones
where you sampled the motif,

1151
00:58:28,451 --> 00:58:31,810
and the other 96 will have
each of the four bases equally

1152
00:58:31,810 --> 00:58:36,760
likely, on average you have
like 24%-- plus 4 for G, right?

1153
00:58:36,760 --> 00:58:38,640
Something like
28%-- this will be

1154
00:58:38,640 --> 00:58:42,860
28%, plus or minus a little bit.

1155
00:58:42,860 --> 00:58:50,520
And these other ones will be
whatever that works out to be,

1156
00:58:50,520 --> 00:58:54,790
23 or something like
that-- 23-ish, on average.

1157
00:58:54,790 --> 00:58:56,540
Again, it may not
come out exactly

1158
00:58:56,540 --> 00:58:59,750
like-- G may not be number
one, but it's more often

1159
00:58:59,750 --> 00:59:01,680
going to be number one
than any other base.

1160
00:59:01,680 --> 00:59:06,370
And on average, it'll be more
like 28% rather than 25%.

1161
00:59:06,370 --> 00:59:08,600
And similarly for
position two, A

1162
00:59:08,600 --> 00:59:12,610
will be 28%, and
three, and et cetera.

1163
00:59:12,610 --> 00:59:16,264
And then the sixth
will be-- C will

1164
00:59:16,264 --> 00:59:17,430
have a little bit of a bias.

1165
00:59:17,430 --> 00:59:20,210
OK, so even in that
first round, when

1166
00:59:20,210 --> 00:59:22,380
you're sampling
that first sequence,

1167
00:59:22,380 --> 00:59:24,920
the matrix is going
to be slightly biased

1168
00:59:24,920 --> 00:59:27,050
toward the motif-- depending
how the sampling went.

1169
00:59:27,050 --> 00:59:30,100
You might not have hit any
instances of motif, right?

1170
00:59:30,100 --> 00:59:34,766
But often, it'll
be a little bit--

1171
00:59:34,766 --> 00:59:37,760
Is that enough of
a bias to give you

1172
00:59:37,760 --> 00:59:42,994
a good chance of selecting the
motif in that first sequence?

1173
00:59:42,994 --> 00:59:44,743
AUDIENCE: You mean in
the first iteration?

1174
00:59:44,743 --> 00:59:48,155
PROFESSOR: Let's say the first
random sequence size sample.

1175
00:59:48,155 --> 00:59:48,655
No.

1176
00:59:48,655 --> 00:59:50,122
You're shaking your head.

1177
00:59:50,122 --> 00:59:55,501
Not enough of a
bias because-- it's

1178
00:59:55,501 --> 01:00:01,955
0.28 over 0.25 to the
sixth power, right?

1179
01:00:01,955 --> 01:00:03,050
So it's like--

1180
01:00:03,050 --> 01:00:05,228
AUDIENCE: The likelihood
is still close 1.

1181
01:00:05,228 --> 01:00:07,025
Like, that's [INAUDIBLE] ratio.

1182
01:00:07,025 --> 01:00:09,150
PROFESSOR: So it's something
like 1.1 to the sixth,

1183
01:00:09,150 --> 01:00:10,400
or something like that.

1184
01:00:10,400 --> 01:00:13,810
So it might be close to 2,
might be twice as likely.

1185
01:00:13,810 --> 01:00:16,750
But still, there's 25 positions.

1186
01:00:16,750 --> 01:00:18,230
Does that make sense?

1187
01:00:18,230 --> 01:00:21,790
So it's quite likely
that you won't

1188
01:00:21,790 --> 01:00:25,020
sample the motif in that first--
you'll sample something else.

1189
01:00:25,020 --> 01:00:31,440
Which will take it away
in some random direction.

1190
01:00:31,440 --> 01:00:34,462
So who can tell me how this
actually ends up working?

1191
01:00:34,462 --> 01:00:36,170
Why does it actually
converge eventually,

1192
01:00:36,170 --> 01:00:37,710
if you get it long enough?

1193
01:00:52,622 --> 01:00:53,614
AUDIENCE: [INAUDIBLE].

1194
01:00:58,078 --> 01:00:59,670
PROFESSOR: So the
information content,

1195
01:00:59,670 --> 01:01:01,900
what will happen to that?

1196
01:01:01,900 --> 01:01:05,074
So the information content,
if it was completely random--

1197
01:01:05,074 --> 01:01:06,360
we said that would be uniform.

1198
01:01:06,360 --> 01:01:08,630
That would be zero
information content, right?

1199
01:01:08,630 --> 01:01:12,330
This matrix, which has around
28% at six different positions,

1200
01:01:12,330 --> 01:01:16,190
will have an information content
that's low, but non-zero.

1201
01:01:16,190 --> 01:01:19,700
It might end up being
like 1 bit, or something.

1202
01:01:19,700 --> 01:01:23,599
And if you then sample motifs
that are not the motif,

1203
01:01:23,599 --> 01:01:25,640
they will tend to reduce
the information content,

1204
01:01:25,640 --> 01:01:27,910
tend to bring it
back toward random.

1205
01:01:27,910 --> 01:01:34,680
If you sample locations
that have the motif,

1206
01:01:34,680 --> 01:01:36,900
what will that do to
the information content?

1207
01:01:36,900 --> 01:01:38,162
Boost it.

1208
01:01:38,162 --> 01:01:40,620
So what would you expect if we
were to plot the information

1209
01:01:40,620 --> 01:01:43,090
content over time, what
would that look like?

1210
01:01:46,090 --> 01:01:47,590
AUDIENCE: It should
trend upwards,

1211
01:01:47,590 --> 01:01:50,430
but it could fluctuate.

1212
01:01:50,430 --> 01:01:51,378
PROFESSOR: Yeah.

1213
01:01:51,378 --> 01:01:53,290
AUDIENCE: Over the
number of iterations?

1214
01:01:59,929 --> 01:02:01,470
PROFESSOR: I think
I blocked it here.

1215
01:02:01,470 --> 01:02:03,886
Let me see if I
can-- Let's try this.

1216
01:02:03,886 --> 01:02:05,050
I think I plotted it.

1217
01:02:08,000 --> 01:02:09,951
OK, never mind.

1218
01:02:09,951 --> 01:02:11,450
I wanted to keep
it very mysterious,

1219
01:02:11,450 --> 01:02:13,470
so you guys have
to figure it out.

1220
01:02:13,470 --> 01:02:24,430
The answer is that it will--
basically what happens is you

1221
01:02:24,430 --> 01:02:26,670
start with a weight
matrix like this.

1222
01:02:26,670 --> 01:02:31,390
A lot of times, because the bias
for the motif is quite weak,

1223
01:02:31,390 --> 01:02:33,490
a lot of times
you'll sample-- even

1224
01:02:33,490 --> 01:02:35,034
for a sequence,
what matters is--

1225
01:02:35,034 --> 01:02:37,450
like, if you had a sequence
where the location, initially,

1226
01:02:37,450 --> 01:02:40,494
was not the motif, and then you
sample another location that's

1227
01:02:40,494 --> 01:02:42,910
not the motif, that's not
really going to change anything.

1228
01:02:42,910 --> 01:02:44,285
It'll change things
a little bit,

1229
01:02:44,285 --> 01:02:46,280
but not in any
particular direction.

1230
01:02:46,280 --> 01:02:48,330
What really matters is
when you get to a sequence

1231
01:02:48,330 --> 01:02:51,040
where you already had the motif,
if you now sample one that's

1232
01:02:51,040 --> 01:02:54,000
not the motif, your information
content will get weaker.

1233
01:02:54,000 --> 01:02:57,420
It will become more uniform.

1234
01:02:57,420 --> 01:03:00,410
But if you have a sequence
where it wasn't the motif,

1235
01:03:00,410 --> 01:03:05,440
but now you happen to sample the
motif, then it'll get stronger.

1236
01:03:05,440 --> 01:03:07,920
And when it gets
stronger, it will then

1237
01:03:07,920 --> 01:03:11,820
be more likely to pick the
motif in the next sequence,

1238
01:03:11,820 --> 01:03:13,584
and so on.

1239
01:03:13,584 --> 01:03:15,750
So basically what happens
to the information content

1240
01:03:15,750 --> 01:03:19,440
is that over many
iterations-- it starts near 0.

1241
01:03:19,440 --> 01:03:22,654
And can occasionally
go up a little bit.

1242
01:03:22,654 --> 01:03:25,070
And then once it exceeds the
threshold, it goes like that.

1243
01:03:27,630 --> 01:03:30,770
So what happens is it
stumbles onto a few instances

1244
01:03:30,770 --> 01:03:33,230
of the motif that bias
the weight matrix.

1245
01:03:33,230 --> 01:03:34,900
And if they don't
bias it enough,

1246
01:03:34,900 --> 01:03:36,874
it'll just fall off that.

1247
01:03:36,874 --> 01:03:38,540
It's like trying to
climb the mountain--

1248
01:03:38,540 --> 01:03:40,480
but it's walking in
a random direction.

1249
01:03:40,480 --> 01:03:42,710
So sometimes it will turn
around and go back down.

1250
01:03:42,710 --> 01:03:47,270
But then when it gets high
enough, it'll be obvious.

1251
01:03:47,270 --> 01:03:52,020
Once you have a, say, 20
times greater likelihood

1252
01:03:52,020 --> 01:03:55,010
of picking that motif than
any other sequence, most

1253
01:03:55,010 --> 01:03:56,300
of the time you will pick it.

1254
01:03:56,300 --> 01:03:59,576
And very soon,
it'll be stronger.

1255
01:03:59,576 --> 01:04:01,200
And the next round,
when it's stronger,

1256
01:04:01,200 --> 01:04:03,220
you'll have a greater
bias for picking

1257
01:04:03,220 --> 01:04:04,770
the motif, and so forth.

1258
01:04:04,770 --> 01:04:06,501
Question?

1259
01:04:06,501 --> 01:04:08,370
AUDIENCE: For this
specific example,

1260
01:04:08,370 --> 01:04:11,850
M is much greater
than L minus W.

1261
01:04:11,850 --> 01:04:15,730
How true is that for
practical examples?

1262
01:04:15,730 --> 01:04:18,380
PROFESSOR: That's a
very good question.

1263
01:04:18,380 --> 01:04:23,950
There is sometimes-- depends on
how commonly your motif occurs

1264
01:04:23,950 --> 01:04:27,840
in the genome, and how
good your data is, really,

1265
01:04:27,840 --> 01:04:30,370
and what the source
of your data is.

1266
01:04:30,370 --> 01:04:31,870
So sometimes it can
be very limited,

1267
01:04:31,870 --> 01:04:37,190
sometimes-- If you do
ChIP-Seq you might have 10,000

1268
01:04:37,190 --> 01:04:38,940
peaks that you're
analyzing, or something.

1269
01:04:38,940 --> 01:04:40,870
So you could have a huge number.

1270
01:04:40,870 --> 01:04:43,500
But on the other hand, if
you did some functional

1271
01:04:43,500 --> 01:04:48,320
assay that's quite laborious for
a motif that drives luciferase,

1272
01:04:48,320 --> 01:04:50,770
or something, and you
can only test a few,

1273
01:04:50,770 --> 01:04:51,940
you might only have 10.

1274
01:04:51,940 --> 01:04:55,980
So it varies all over the map.

1275
01:04:55,980 --> 01:04:57,740
So that's a good question.

1276
01:04:57,740 --> 01:05:00,130
We'll come back to
that in a little bit.

1277
01:05:00,130 --> 01:05:01,200
Simona?

1278
01:05:01,200 --> 01:05:02,750
AUDIENCE: If you
have a short motif,

1279
01:05:02,750 --> 01:05:04,270
does it make sense,
then, to reduce

1280
01:05:04,270 --> 01:05:05,710
the number of
sequences you have?

1281
01:05:05,710 --> 01:05:07,880
Because maybe it won't converge?

1282
01:05:07,880 --> 01:05:09,846
PROFESSOR: Reduce the
number of sequences?

1283
01:05:09,846 --> 01:05:11,345
What do you people
think about that?

1284
01:05:11,345 --> 01:05:12,830
Is that a good
idea or a bad idea?

1285
01:05:16,790 --> 01:05:21,182
It's true that it
might converge faster

1286
01:05:21,182 --> 01:05:22,640
with a smaller
number of sequences,

1287
01:05:22,640 --> 01:05:25,080
but you also might
not find it all.

1288
01:05:25,080 --> 01:05:27,640
So generally you're
losing information,

1289
01:05:27,640 --> 01:05:30,320
so you want to
have more sequences

1290
01:05:30,320 --> 01:05:32,415
up to a certain point.

1291
01:05:32,415 --> 01:05:34,790
Let's just do a couple more
examples, and I'll come back.

1292
01:05:34,790 --> 01:05:36,080
Those are both good questions.

1293
01:05:36,080 --> 01:05:37,329
OK, so here's this weak motif.

1294
01:05:37,329 --> 01:05:39,590
So this is the one where
you guys couldn't see it

1295
01:05:39,590 --> 01:05:41,240
when I just put
the sequences up.

1296
01:05:41,240 --> 01:05:43,880
You can only see it
when it's aligned--

1297
01:05:43,880 --> 01:05:47,190
it's this thing with GGC, here.

1298
01:05:47,190 --> 01:05:52,410
And here is, again,
the Gibbs Sampler.

1299
01:05:52,410 --> 01:05:55,775
And what happened?

1300
01:06:00,980 --> 01:06:04,698
Who can summarize
what happened here?

1301
01:06:12,926 --> 01:06:13,822
Yeah, David?

1302
01:06:13,822 --> 01:06:15,030
AUDIENCE: It didn't converge.

1303
01:06:15,030 --> 01:06:17,894
PROFESSOR: Yeah, it
didn't quite converge.

1304
01:06:17,894 --> 01:06:21,130
The motif is usually
on the right side,

1305
01:06:21,130 --> 01:06:24,890
and it found something
that's like the motif.

1306
01:06:24,890 --> 01:06:29,330
But it's not quite right--
it's got that A, it's G A G C,

1307
01:06:29,330 --> 01:06:33,310
it should be G G C. And so
it sampled some other things,

1308
01:06:33,310 --> 01:06:35,050
and it got off
track a little bit,

1309
01:06:35,050 --> 01:06:36,766
because probably
by chance, there

1310
01:06:36,766 --> 01:06:39,140
were some things that looked
a little bit like the motif,

1311
01:06:39,140 --> 01:06:41,090
and it was finding
some instances of that,

1312
01:06:41,090 --> 01:06:42,920
and some instances
of the real motif.

1313
01:06:42,920 --> 01:06:44,640
And yeah, it didn't
quite converge.

1314
01:06:44,640 --> 01:06:50,220
And you can see this
probability vectors here,

1315
01:06:50,220 --> 01:06:53,000
they have multiple white
dots in many of the rows.

1316
01:06:53,000 --> 01:06:54,604
So it doesn't know,
it's uncertain.

1317
01:06:54,604 --> 01:06:55,770
So it keeps bouncing around.

1318
01:06:55,770 --> 01:06:57,890
So it didn't really
converge, it was too weak,

1319
01:06:57,890 --> 01:07:00,930
it was too challenging
for the algorithm.

1320
01:07:04,130 --> 01:07:10,750
This is just a summary of the
Gibb Sampler, how it works.

1321
01:07:10,750 --> 01:07:15,390
It's not guaranteed to converge
to the same motif every time.

1322
01:07:15,390 --> 01:07:19,390
So what you generally will want
to do is run it several times,

1323
01:07:19,390 --> 01:07:22,890
and nine out of 10 times,
you get the same motif.

1324
01:07:22,890 --> 01:07:25,400
You should trust that.

1325
01:07:25,400 --> 01:07:26,480
Go ahead.

1326
01:07:26,480 --> 01:07:33,062
AUDIENCE: Over here, are we
optimizing for convergence

1327
01:07:33,062 --> 01:07:35,930
of the value of the
information content?

1328
01:07:35,930 --> 01:07:37,600
PROFESSOR: No, the
information content

1329
01:07:37,600 --> 01:07:41,880
is just describing-- it's
just a handy single number

1330
01:07:41,880 --> 01:07:44,550
description of how biased
the weight matrix is.

1331
01:07:44,550 --> 01:07:49,200
So it's not actually
directly being optimized.

1332
01:07:49,200 --> 01:07:53,700
But it turns out that
this way of sampling

1333
01:07:53,700 --> 01:07:58,350
tends to increase
information content.

1334
01:07:58,350 --> 01:08:01,985
It's sort of a self-reinforcing
kind of a thing.

1335
01:08:04,366 --> 01:08:05,740
But it's not
directly doing that.

1336
01:08:05,740 --> 01:08:10,110
However MEME, more or
less, directly does that.

1337
01:08:10,110 --> 01:08:14,450
The problem with that is
that, where do you start?

1338
01:08:14,450 --> 01:08:16,500
Imagine an algorithm
like this, but where

1339
01:08:16,500 --> 01:08:18,292
you deterministically--
instead of sampling

1340
01:08:18,292 --> 01:08:20,583
from the positions in the
sequence, where it might have

1341
01:08:20,583 --> 01:08:22,310
a motif in proportion
to probabilities,

1342
01:08:22,310 --> 01:08:24,643
you just chose the one that
had the highest probability.

1343
01:08:24,643 --> 01:08:26,800
That's more or less
what MEME does.

1344
01:08:26,800 --> 01:08:29,960
And so what are
the pros and cons

1345
01:08:29,960 --> 01:08:31,540
of that approach,
versus this one?

1346
01:08:39,444 --> 01:08:40,432
Any ideas?

1347
01:08:50,312 --> 01:08:54,620
OK, one of the disadvantages
is that the initial choice

1348
01:08:54,620 --> 01:08:58,420
of-- how you're initially
seeding your matrix,

1349
01:08:58,420 --> 01:08:59,580
matters a lot.

1350
01:08:59,580 --> 01:09:05,290
That slight bias-- it might
be that you had a slight bias,

1351
01:09:05,290 --> 01:09:08,660
and it didn't come out
being G was number one.

1352
01:09:08,660 --> 01:09:10,850
It was actually-- T was
number one, just because

1353
01:09:10,850 --> 01:09:14,960
of the quirks of the sampling.

1354
01:09:14,960 --> 01:09:18,880
So what would this
be, 31 or something?

1355
01:09:18,880 --> 01:09:22,279
Anyway, it's higher
than these other guys.

1356
01:09:22,279 --> 01:09:27,550
And so then you're always
picking the highest.

1357
01:09:27,550 --> 01:09:30,167
It'll become a
self-fulfilling prophecy.

1358
01:09:30,167 --> 01:09:31,500
So that's the problem with MEME.

1359
01:09:31,500 --> 01:09:33,210
So the way that MEME
gets around that,

1360
01:09:33,210 --> 01:09:35,380
is it uses multiple
different seeding,

1361
01:09:35,380 --> 01:09:37,130
multiple different
starting points,

1362
01:09:37,130 --> 01:09:39,569
and goes to the end
with all of them.

1363
01:09:39,569 --> 01:09:42,399
And then it evaluates, how good
a model did we get at the end?

1364
01:09:42,399 --> 01:09:44,819
And whichever was the
best one, it takes that.

1365
01:09:44,819 --> 01:09:47,580
So it actually takes
longer, but you only

1366
01:09:47,580 --> 01:09:50,330
need to run it once
because it's deterministic.

1367
01:09:50,330 --> 01:09:53,334
You use a deterministic
set of starting points,

1368
01:09:53,334 --> 01:09:54,750
you run a deterministic
algorithm,

1369
01:09:54,750 --> 01:09:57,650
and then you evaluate.

1370
01:09:57,650 --> 01:10:01,400
The Gibbs, it can
go off on a tangent,

1371
01:10:01,400 --> 01:10:03,590
but because it's
sampling so randomly,

1372
01:10:03,590 --> 01:10:05,526
it often will fall off,
then, and come back

1373
01:10:05,526 --> 01:10:06,900
to something that's
more uniform.

1374
01:10:06,900 --> 01:10:08,640
And when it's a
uniform matrix, it's

1375
01:10:08,640 --> 01:10:10,140
really sampling
completely randomly,

1376
01:10:10,140 --> 01:10:12,560
exploring the space
in an unbiased way.

1377
01:10:12,560 --> 01:10:13,724
Tim?

1378
01:10:13,724 --> 01:10:17,580
AUDIENCE: For genomes that have
inherent biases that you know

1379
01:10:17,580 --> 01:10:23,160
going in, do you precalculate--
do you just recalculate

1380
01:10:23,160 --> 01:10:28,730
the weight matrix before, to
[? affect  those  classes? ?]

1381
01:10:28,730 --> 01:10:33,400
For example, if you
had 80% AT content,

1382
01:10:33,400 --> 01:10:36,895
then you're not looking
for-- you know, immediately,

1383
01:10:36,895 --> 01:10:41,386
that you're going to hit an A
or a T off the first iteration.

1384
01:10:41,386 --> 01:10:43,237
So how do you deal with that?

1385
01:10:43,237 --> 01:10:44,278
PROFESSOR: Good question.

1386
01:10:48,800 --> 01:10:53,100
So these are some features
that affect motif finding.

1387
01:10:53,100 --> 01:10:59,485
I think that we've now hit at
least a few of these-- number

1388
01:10:59,485 --> 01:11:01,990
of sequences, length of
sequences, information content,

1389
01:11:01,990 --> 01:11:07,140
and motif, and basically
whether the background is biased

1390
01:11:07,140 --> 01:11:08,060
or not.

1391
01:11:08,060 --> 01:11:13,130
So, in general, higher
information content motifs,

1392
01:11:13,130 --> 01:11:15,660
or lower information
content, are easier to find--

1393
01:11:15,660 --> 01:11:18,792
who thinks higher?

1394
01:11:18,792 --> 01:11:19,500
Who thinks lower?

1395
01:11:22,590 --> 01:11:26,055
Someone, can you explain?

1396
01:11:26,055 --> 01:11:27,013
AUDIENCE: I don't know.

1397
01:11:27,013 --> 01:11:27,638
I just guessed.

1398
01:11:27,638 --> 01:11:30,210
PROFESSOR: Just a guess?

1399
01:11:30,210 --> 01:11:31,636
OK, in back, can you explain?

1400
01:11:31,636 --> 01:11:32,135
Lower?

1401
01:11:32,135 --> 01:11:33,510
AUDIENCE: Low
information content

1402
01:11:33,510 --> 01:11:35,991
is basically very uniform.

1403
01:11:35,991 --> 01:11:38,324
PROFESSOR: Low information
means nearly uniform-- right,

1404
01:11:38,324 --> 01:11:41,186
those are very hard to find.

1405
01:11:41,186 --> 01:11:43,660
That's like that GGC one.

1406
01:11:43,660 --> 01:11:45,200
The high information
content motif,

1407
01:11:45,200 --> 01:11:47,366
those are the very strong
ones, like that first one.

1408
01:11:47,366 --> 01:11:48,700
Those are much easier to find.

1409
01:11:48,700 --> 01:11:50,640
Because when you
stumble on to them,

1410
01:11:50,640 --> 01:11:54,870
it biases the matrix more, and
you rapidly converge to that.

1411
01:11:54,870 --> 01:11:57,150
OK, high information
is easy to find.

1412
01:11:57,150 --> 01:12:01,200
So if I have one
motif per sequence,

1413
01:12:01,200 --> 01:12:03,226
what about the length
of the sequence?

1414
01:12:03,226 --> 01:12:05,190
Is longer or shorter better?

1415
01:12:07,930 --> 01:12:09,706
Is long better?

1416
01:12:09,706 --> 01:12:11,800
Who thinks shorter is better?

1417
01:12:11,800 --> 01:12:14,291
Shorter-- can you
explain why short?

1418
01:12:14,291 --> 01:12:16,582
AUDIENCE: Shouldn't it be
the smaller the search space,

1419
01:12:16,582 --> 01:12:18,842
the fewer the problems?

1420
01:12:18,842 --> 01:12:21,207
PROFESSOR: Exactly, the
shorter the search space,

1421
01:12:21,207 --> 01:12:23,290
and your motif, there's
less place for it to hide.

1422
01:12:23,290 --> 01:12:25,900
You're more likely to sample it.

1423
01:12:25,900 --> 01:12:27,510
Shorter is better.

1424
01:12:27,510 --> 01:12:30,830
If you think about-- if you
have a motif like TATA, which

1425
01:12:30,830 --> 01:12:35,770
is typically 30
bases from the TSS,

1426
01:12:35,770 --> 01:12:40,320
if you happen to know that, and
you give it plus 1 to minus 50,

1427
01:12:40,320 --> 01:12:41,820
you're giving it a
small region, you

1428
01:12:41,820 --> 01:12:43,150
can easily find the TATA box.

1429
01:12:43,150 --> 01:12:47,890
If you give it plus 1 to
minus 2,000 or something,

1430
01:12:47,890 --> 01:12:48,980
you may not find it.

1431
01:12:48,980 --> 01:12:52,680
It's diluted, essentially.

1432
01:12:52,680 --> 01:12:54,486
Number of sequences--
the more the better.

1433
01:12:54,486 --> 01:12:56,610
This is a little more
subtle, as Simona was saying.

1434
01:12:56,610 --> 01:12:58,410
It affects convergence
time, and so forth.

1435
01:12:58,410 --> 01:13:01,870
But in general, the
more the better.

1436
01:13:01,870 --> 01:13:06,520
And if you guessed the
wrong length of your matrix,

1437
01:13:06,520 --> 01:13:09,370
that makes it worse
than if you guess

1438
01:13:09,370 --> 01:13:11,310
the right length in
either direction.

1439
01:13:11,310 --> 01:13:15,690
For example, it's six-base
motif, you guess three.

1440
01:13:15,690 --> 01:13:18,670
The information content,
even if it's a 12-bit motif,

1441
01:13:18,670 --> 01:13:21,256
there's only six bits that
you could hope to find,

1442
01:13:21,256 --> 01:13:23,380
because you can only find
three of those positions.

1443
01:13:23,380 --> 01:13:27,176
So clearly, effectively
it's a smaller information

1444
01:13:27,176 --> 01:13:28,550
content, and much
harder to find.

1445
01:13:28,550 --> 01:13:29,960
And vice versa.

1446
01:13:32,720 --> 01:13:36,390
Another thing that
occurs in practice

1447
01:13:36,390 --> 01:13:39,350
is what's called shifted motifs.

1448
01:13:39,350 --> 01:13:44,740
Your motif is G A A T T C.
Imagine in your first iteration

1449
01:13:44,740 --> 01:13:49,680
you happen to hit several of
these sequences, starting here.

1450
01:13:49,680 --> 01:13:52,330
You hit the motif,
but off by two

1451
01:13:52,330 --> 01:13:54,150
at several different places.

1452
01:13:54,150 --> 01:13:55,680
That'll bias first
position to be

1453
01:13:55,680 --> 01:13:59,230
A, and the second position
to be T, and so forth.

1454
01:13:59,230 --> 01:14:03,090
And then you tend to find other
shifted versions of that motif.

1455
01:14:03,090 --> 01:14:06,700
You may well converge to this--
A T C C N N, or something

1456
01:14:06,700 --> 01:14:09,480
like that-- which
is not quite right.

1457
01:14:09,480 --> 01:14:12,950
It's close, you're very
close, but not quite right.

1458
01:14:12,950 --> 01:14:16,060
And it's not as information
rich as the real motif.

1459
01:14:16,060 --> 01:14:19,480
Because it's got those two N's
at the end, instead of G A.

1460
01:14:19,480 --> 01:14:21,290
So one thing that's
done in practice

1461
01:14:21,290 --> 01:14:26,020
is a lot of times, every so
often, the algorithm will say,

1462
01:14:26,020 --> 01:14:28,310
what would happen if we
shifted all of our positions

1463
01:14:28,310 --> 01:14:30,990
over to the left by one or two?

1464
01:14:30,990 --> 01:14:32,410
Or to the right by one or two?

1465
01:14:32,410 --> 01:14:34,730
Would the information
content go up?

1466
01:14:34,730 --> 01:14:37,540
If so, let's do that.

1467
01:14:37,540 --> 01:14:41,210
So basically, shifted
versions of the motif become

1468
01:14:41,210 --> 01:14:43,630
local, near-optimal solutions.

1469
01:14:43,630 --> 01:14:46,296
So you have to avoid them.

1470
01:14:46,296 --> 01:14:47,670
And biased background
composition

1471
01:14:47,670 --> 01:14:49,120
is very difficult to deal with.

1472
01:14:49,120 --> 01:14:54,650
So I will just give
you one or two more

1473
01:14:54,650 --> 01:14:57,490
examples of that in a
moment, and continue.

1474
01:14:57,490 --> 01:15:02,340
So in practice, I would say
the Gibbs Sampler is sometimes

1475
01:15:02,340 --> 01:15:05,830
used, or AlignACE, which is
a version of Gibbs Sampler.

1476
01:15:05,830 --> 01:15:08,530
But probably more
often, people use

1477
01:15:08,530 --> 01:15:12,390
an algorithm called MEME, which
is this EM algorithm, which,

1478
01:15:12,390 --> 01:15:14,550
like I said, is
deterministic, so you always

1479
01:15:14,550 --> 01:15:17,600
get the same answer,
which makes you feel good.

1480
01:15:17,600 --> 01:15:21,280
May or may not always be right,
but you can try it out here

1481
01:15:21,280 --> 01:15:22,100
at this website.

1482
01:15:22,100 --> 01:15:23,755
And actually, the
Fraenkel Lab has

1483
01:15:23,755 --> 01:15:26,350
a very nice website
called WebMotifs

1484
01:15:26,350 --> 01:15:30,700
that runs several different
motif finders including,

1485
01:15:30,700 --> 01:15:33,080
like I said, a MEME
and AlignACE, which

1486
01:15:33,080 --> 01:15:35,420
is similar to Gibbs,
as well as some others.

1487
01:15:35,420 --> 01:15:40,190
And it integrates the output,
so that's often a handy thing

1488
01:15:40,190 --> 01:15:41,951
to use.

1489
01:15:41,951 --> 01:15:43,200
You can read about them there.

1490
01:15:43,200 --> 01:15:48,370
And then I just wanted
to say a couple words--

1491
01:15:48,370 --> 01:15:53,070
this is related to Tim's comment
about the biased background.

1492
01:15:53,070 --> 01:15:56,130
How do you actually
deal with that?

1493
01:15:56,130 --> 01:16:04,360
And this related to this notion
of a mean bit score of a motif.

1494
01:16:04,360 --> 01:16:10,040
So if I were to give you a
motif model, P, and a background

1495
01:16:10,040 --> 01:16:14,780
model, q, then the natural
scoring system, if you wanted

1496
01:16:14,780 --> 01:16:16,940
additives scores, instead
of multiplicative,

1497
01:16:16,940 --> 01:16:18,310
you would just take the log.

1498
01:16:18,310 --> 01:16:23,050
So log P over q, I would argue,
is natural additive scores.

1499
01:16:23,050 --> 01:16:25,600
And that's often what you'll
see in a weight matrix-- you'll

1500
01:16:25,600 --> 01:16:30,116
see log probabilities, or logs
of ratios of probabilities.

1501
01:16:30,116 --> 01:16:31,490
And so then you
just add them up,

1502
01:16:31,490 --> 01:16:33,700
and it makes life a bit simpler.

1503
01:16:33,700 --> 01:16:36,350
And so then, if you were to
calculate what's the mean bit

1504
01:16:36,350 --> 01:16:40,570
score-- if I had a bunch
of instances of a motif,

1505
01:16:40,570 --> 01:16:46,580
it will be given by this formula
that's here in the upper right.

1506
01:16:46,580 --> 01:16:48,420
So that's your score.

1507
01:16:48,420 --> 01:16:51,710
And this is the mean,
where you're sampling over

1508
01:16:51,710 --> 01:16:58,150
the probability in using the
motif model, probabilities.

1509
01:16:58,150 --> 01:17:03,130
So it turns out, then, that
if qk, your background,

1510
01:17:03,130 --> 01:17:06,360
is uniform, motif of width
w-- so its probability

1511
01:17:06,360 --> 01:17:10,420
of any w-mer, is
1/4 to the w, then

1512
01:17:10,420 --> 01:17:13,410
it's true that the
mean bit-score is

1513
01:17:13,410 --> 01:17:17,110
2w minus the entropy
of the motif, which

1514
01:17:17,110 --> 01:17:20,270
is the same as the information
content of the motif,

1515
01:17:20,270 --> 01:17:22,760
using our previous definition.

1516
01:17:22,760 --> 01:17:27,380
So that's just a
handy relationship.

1517
01:17:27,380 --> 01:17:32,340
And you can do a little algebra
to show that, if you want.

1518
01:17:32,340 --> 01:17:43,150
So basically summation Pk
log Pk over qk-- this log,

1519
01:17:43,150 --> 01:17:44,630
you turn that into
a difference--

1520
01:17:44,630 --> 01:17:56,890
so that summation Pk
log Pk minus Pk log qk.

1521
01:17:56,890 --> 01:18:00,810
And then you can do some
rearrangement, and sum them up,

1522
01:18:00,810 --> 01:18:02,920
and you'll get this formula.

1523
01:18:02,920 --> 01:18:06,480
I'll leave that as an exercise,
and any questions on it,

1524
01:18:06,480 --> 01:18:08,790
we can do it next time.

1525
01:18:08,790 --> 01:18:12,250
So what I wanted to get to
is sort of this big question

1526
01:18:12,250 --> 01:18:15,050
that I posed earlier--
what's the use of knowing

1527
01:18:15,050 --> 01:18:17,520
the information
content of a motif?

1528
01:18:17,520 --> 01:18:26,630
And the answer is that one use
is that it's true, in general,

1529
01:18:26,630 --> 01:18:29,540
that the motif with
n bits of information

1530
01:18:29,540 --> 01:18:34,140
will occur about once every 2 to
the n bases of random sequence.

1531
01:18:34,140 --> 01:18:39,190
So we said a six-cutter
restriction enzyme, echo R1,

1532
01:18:39,190 --> 01:18:42,660
has an information
content of 12 bits.

1533
01:18:42,660 --> 01:18:45,260
So by this rule, it should
occur about once every

1534
01:18:45,260 --> 01:18:47,700
to 2 to the 12th
bases of sequence.

1535
01:18:47,700 --> 01:18:50,060
And if you know your powers
of 2, which you should all

1536
01:18:50,060 --> 01:18:54,370
commit to memory,
that's about 4,000.

1537
01:18:54,370 --> 01:18:57,330
2 to the 12th is 4 to
the sixth, is 4,096.

1538
01:18:57,330 --> 01:19:00,680
So it'll occur about once
every 4 [? kb, ?] which

1539
01:19:00,680 --> 01:19:03,490
if you've ever cut E. coli
DNA, you know is about right--

1540
01:19:03,490 --> 01:19:07,610
your fragments come out
to be about 4 [? kb. ?]

1541
01:19:07,610 --> 01:19:11,570
So this turns out to be
strictly true for any motif

1542
01:19:11,570 --> 01:19:14,920
that you can represent
by a regular expression,

1543
01:19:14,920 --> 01:19:16,610
like a precise
motif, or something

1544
01:19:16,610 --> 01:19:21,660
where you have a degenerate R
or Y or N in it, still true.

1545
01:19:21,660 --> 01:19:23,770
And if you have a more
general motif that's

1546
01:19:23,770 --> 01:19:27,490
described by weight matrix, then
you have to define a threshold,

1547
01:19:27,490 --> 01:19:32,960
and it's roughly
true, but not exactly.

1548
01:19:32,960 --> 01:19:35,870
All right, so what do you
do when the background

1549
01:19:35,870 --> 01:19:38,320
composition is biased,
like Tim was saying?

1550
01:19:38,320 --> 01:19:41,100
What if it's 80%, A plus T?

1551
01:19:41,100 --> 01:19:47,310
So then, it turns out that this
mean bit-score is a good way

1552
01:19:47,310 --> 01:19:49,830
to go.

1553
01:19:49,830 --> 01:19:52,460
So like I said,
the mean bit-score

1554
01:19:52,460 --> 01:19:55,750
equals the information
content in this special case,

1555
01:19:55,750 --> 01:19:58,720
where the background is uniform.

1556
01:19:58,720 --> 01:20:02,790
But if the background
is not uniform,

1557
01:20:02,790 --> 01:20:05,950
then you can still calculate
this mean bit-score,

1558
01:20:05,950 --> 01:20:08,181
and it'll still be meaningful.

1559
01:20:08,181 --> 01:20:09,680
But now it's called
something else--

1560
01:20:09,680 --> 01:20:12,730
it's called relative entropy.

1561
01:20:12,730 --> 01:20:14,860
Actually it has several
names, relative entropy,

1562
01:20:14,860 --> 01:20:17,270
Kullback-Leibler
distance is another,

1563
01:20:17,270 --> 01:20:18,900
and information
for discrimination,

1564
01:20:18,900 --> 01:20:21,330
depending whether you're
reading the Double

1565
01:20:21,330 --> 01:20:24,410
E literature, or
statistics, or whatever.

1566
01:20:24,410 --> 01:20:26,110
And so it turns out
that if you have

1567
01:20:26,110 --> 01:20:29,970
a very biased composition--
so here's one that's 75% A T,

1568
01:20:29,970 --> 01:20:33,020
probability of A and T
are 3/8, C and G are 1/8.

1569
01:20:35,990 --> 01:20:40,760
If your motif is just
C 100% of the time,

1570
01:20:40,760 --> 01:20:43,920
your information content
by the original formula

1571
01:20:43,920 --> 01:20:48,230
that I gave you,
would be 2 bits.

1572
01:20:48,230 --> 01:20:53,560
However, the relative
entropy will be 3 bits,

1573
01:20:53,560 --> 01:20:56,620
if you just plug in these
numbers into this formula,

1574
01:20:56,620 --> 01:21:00,490
it will turn out to be 3 bits.

1575
01:21:00,490 --> 01:21:03,750
My question is, which
one better describes

1576
01:21:03,750 --> 01:21:08,400
the frequency of C in
the background sequence?

1577
01:21:08,400 --> 01:21:10,345
Frequency of this
motif-- the motif

1578
01:21:10,345 --> 01:21:14,050
is just a C. You can see that
the relative entropy says

1579
01:21:14,050 --> 01:21:16,530
that actually, that's
stronger than it appears.

1580
01:21:16,530 --> 01:21:18,600
Because it's a C, and
that's a rare nucleotide,

1581
01:21:18,600 --> 01:21:20,225
it's actually stronger
than it appears.

1582
01:21:20,225 --> 01:21:22,340
And so 2 to the 3rd
is a better estimate

1583
01:21:22,340 --> 01:21:24,990
of its frequency than 2 squared.

1584
01:21:24,990 --> 01:21:25,840
So relative entropy.

1585
01:21:25,840 --> 01:21:27,660
So what you can do
when you run a motif

1586
01:21:27,660 --> 01:21:29,970
finder in a sequence
of biased composition,

1587
01:21:29,970 --> 01:21:31,930
you can say, what's
the relative entropy

1588
01:21:31,930 --> 01:21:33,330
of this motif at the end?

1589
01:21:33,330 --> 01:21:36,750
And look at the ones
that are strong.

1590
01:21:36,750 --> 01:21:40,580
We'll come back to this
a little more next time.

1591
01:21:40,580 --> 01:21:42,700
Next time, we'll talk
about hidden Markov models,

1592
01:21:42,700 --> 01:21:45,590
and please take a
look at the readings.

1593
01:21:45,590 --> 01:21:48,840
And please, those who
are doing projects,

1594
01:21:48,840 --> 01:21:50,760
look for more
detailed instructions

1595
01:21:50,760 --> 01:21:54,070
to be posted tonight.

1596
01:21:54,070 --> 01:21:55,620
Thanks.