1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,010
under a Creative
Commons license.

3
00:00:04,010 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,236
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,236 --> 00:00:17,861
at ocw.mit.edu.

8
00:00:29,230 --> 00:00:31,760
PROFESSOR: So good
afternoon, once again.

9
00:00:31,760 --> 00:00:35,210
And welcome back to
Computational Systems Biology,

10
00:00:35,210 --> 00:00:38,450
lecture number seven.

11
00:00:38,450 --> 00:00:41,457
And today we're going
to put to good use

12
00:00:41,457 --> 00:00:42,790
two things that we have learned.

13
00:00:42,790 --> 00:00:46,630
We've learned how to align high
throughput reads to genomes.

14
00:00:46,630 --> 00:00:49,250
And we've learned how to take
a collection of high throughput

15
00:00:49,250 --> 00:00:51,580
reads and assemble a genome.

16
00:00:51,580 --> 00:00:54,150
And today we're going to
delve into the mysteries

17
00:00:54,150 --> 00:00:56,470
of transcriptional regulation.

18
00:00:56,470 --> 00:01:00,220
And so what I'd like to
discuss with you today

19
00:01:00,220 --> 00:01:03,380
is a very important
set of techniques

20
00:01:03,380 --> 00:01:06,860
that allows us to
elucidate exactly how genes

21
00:01:06,860 --> 00:01:08,380
are regulated.

22
00:01:08,380 --> 00:01:10,130
It's probably the most
important technique

23
00:01:10,130 --> 00:01:13,660
for allowing us to
get at that question.

24
00:01:13,660 --> 00:01:17,070
And I'm sure many of you
are familiar with the idea

25
00:01:17,070 --> 00:01:19,090
of transcriptional
regulators, which

26
00:01:19,090 --> 00:01:21,820
are proteins that bind in
a sequence specific way

27
00:01:21,820 --> 00:01:25,520
to the genome and act
as molecular switches.

28
00:01:25,520 --> 00:01:28,060
And we'll return to
some aspects of these

29
00:01:28,060 --> 00:01:30,950
when we talk about
proteomics later in the term.

30
00:01:30,950 --> 00:01:33,090
But suffice to
say, these proteins

31
00:01:33,090 --> 00:01:36,750
have domains that interact
in a sequence specific way

32
00:01:36,750 --> 00:01:41,010
with the DNA bases in one
of the grooves of DNA.

33
00:01:41,010 --> 00:01:43,790
They also typically
contain a domain

34
00:01:43,790 --> 00:01:47,440
which is an activation
domain or repression

35
00:01:47,440 --> 00:01:49,570
domain that interacts
with other proteins that

36
00:01:49,570 --> 00:01:52,196
can cause the genome to fold up.

37
00:01:52,196 --> 00:01:56,250
And it can also help recruit
the RNA polymerase holoenzyme

38
00:01:56,250 --> 00:01:59,580
to actually turn on
a gene transcription.

39
00:01:59,580 --> 00:02:02,050
So here we have a
figure of a collection

40
00:02:02,050 --> 00:02:05,470
of Pit1 molecules
interacting with the genome.

41
00:02:05,470 --> 00:02:10,269
And of course, there are many
flavors of genomic regulators.

42
00:02:10,269 --> 00:02:13,000
It's estimated that humans have
about 2,000 different proteins

43
00:02:13,000 --> 00:02:17,130
that act as these molecular
switches, some as activators,

44
00:02:17,130 --> 00:02:18,216
some as repressors.

45
00:02:18,216 --> 00:02:20,340
And we're going to be
talking about, fundamentally,

46
00:02:20,340 --> 00:02:24,150
today the idea of how
these molecules interact

47
00:02:24,150 --> 00:02:29,930
with the genome and
control gene expression

48
00:02:29,930 --> 00:02:32,600
through the analysis
of where they actually

49
00:02:32,600 --> 00:02:37,340
interact with
specific genome loci.

50
00:02:37,340 --> 00:02:45,230
So if we were to draw a
picture of how we understand

51
00:02:45,230 --> 00:02:50,020
gene regulation in cartoon
form, if we have a gene here

52
00:02:50,020 --> 00:02:52,200
with the transcription
start site

53
00:02:52,200 --> 00:02:57,290
and we can imagine RNA
molecules being produced off

54
00:02:57,290 --> 00:03:00,260
of this genomic
template, we know

55
00:03:00,260 --> 00:03:07,050
that there are non-coding
regions of a gene that

56
00:03:07,050 --> 00:03:09,320
permit for the binding
of these regulators.

57
00:03:09,320 --> 00:03:11,180
And my definition
of a gene is all

58
00:03:11,180 --> 00:03:15,170
of the DNA that's required to
make a specific transcriptive

59
00:03:15,170 --> 00:03:15,740
protein.

60
00:03:15,740 --> 00:03:18,640
So that includes not
only the coding parts

61
00:03:18,640 --> 00:03:23,500
of the gene but the non-coding,
regulatory parts as well.

62
00:03:23,500 --> 00:03:25,280
So we can imagine
out here that there

63
00:03:25,280 --> 00:03:27,470
are a collection of
regulators, perhaps just

64
00:03:27,470 --> 00:03:33,670
one, that bind to a sequence
that in turn activate

65
00:03:33,670 --> 00:03:40,030
this gene, producing the RNA
transcript, which then in turn

66
00:03:40,030 --> 00:03:43,770
this turned into
another protein.

67
00:03:43,770 --> 00:03:46,550
This protein may
undergo some sort

68
00:03:46,550 --> 00:03:48,720
of post-translational
modification

69
00:03:48,720 --> 00:03:55,090
by signaling pathway or other
mechanism that activates it.

70
00:03:55,090 --> 00:03:56,830
So regulators need
to be activated

71
00:03:56,830 --> 00:03:59,700
before they could
bind and some do not.

72
00:03:59,700 --> 00:04:05,190
And this activated regular
combine to get another gene

73
00:04:05,190 --> 00:04:09,170
and cause another
RNA to be expressed.

74
00:04:09,170 --> 00:04:11,620
And it may be, in
the second context,

75
00:04:11,620 --> 00:04:15,730
that we need two
different proteins to bind

76
00:04:15,730 --> 00:04:17,120
to activate the gene.

77
00:04:17,120 --> 00:04:19,190
And so during the
course of the term,

78
00:04:19,190 --> 00:04:21,420
we're going to be talking
about many aspects

79
00:04:21,420 --> 00:04:25,960
of these regulatory networks,
including things like what

80
00:04:25,960 --> 00:04:28,553
the regulatory code
of a genome is,

81
00:04:28,553 --> 00:04:30,279
that is where these
binding sites are

82
00:04:30,279 --> 00:04:31,320
and how they're occupied.

83
00:04:34,880 --> 00:04:38,790
And we'll return to that
later on in today's lecture.

84
00:04:38,790 --> 00:04:44,680
We'll talk about the dynamics of
binding of proteins, including

85
00:04:44,680 --> 00:04:47,850
how concentration
[? they ?] dependent are.

86
00:04:47,850 --> 00:04:57,520
And we'll talk about
combinatorial control,

87
00:04:57,520 --> 00:04:59,272
whether or not,
for example, both

88
00:04:59,272 --> 00:05:01,230
of these have to be
present or just one of them

89
00:05:01,230 --> 00:05:03,600
needs to be present for
a gene to be transcribed,

90
00:05:03,600 --> 00:05:07,280
and how you can use these
regulatory sequences

91
00:05:07,280 --> 00:05:11,747
to implement very complex
computational functions.

92
00:05:11,747 --> 00:05:13,580
But suffice to say, the
most important thing

93
00:05:13,580 --> 00:05:17,070
that we have to identify
is the programming

94
00:05:17,070 --> 00:05:19,430
that underlies the
genome, which includes

95
00:05:19,430 --> 00:05:21,520
these regulatory sequences
and exactly how they're

96
00:05:21,520 --> 00:05:25,720
occupied by regulatory proteins.

97
00:05:25,720 --> 00:05:27,980
Ideally what we would
like to be able to do

98
00:05:27,980 --> 00:05:32,980
is to do a single experiment
that elucidated all

99
00:05:32,980 --> 00:05:35,140
of the regulatory
sites in the genome

100
00:05:35,140 --> 00:05:38,830
and which ones were
occupied by which proteins.

101
00:05:38,830 --> 00:05:42,480
But presently, that's
technically not possible.

102
00:05:42,480 --> 00:05:44,750
So we'll begin today
with a technique that

103
00:05:44,750 --> 00:05:47,690
allows us to consider
one protein at a time

104
00:05:47,690 --> 00:05:52,230
and identify where it
occupies the genome.

105
00:05:52,230 --> 00:05:54,430
Now there are other
kinds of proteins

106
00:05:54,430 --> 00:05:59,020
that we can identify in terms
of where they are associated

107
00:05:59,020 --> 00:06:02,290
with the genome that are not
transcriptional regulators, per

108
00:06:02,290 --> 00:06:03,670
se.

109
00:06:03,670 --> 00:06:07,790
For example, we all
know that chromatin

110
00:06:07,790 --> 00:06:13,265
is organized on spools
called nucleosomes.

111
00:06:16,840 --> 00:06:20,290
These nucleosomes are composed
of eight different histones

112
00:06:20,290 --> 00:06:22,800
that have tails
on them, and there

113
00:06:22,800 --> 00:06:25,570
can be covalent marks
put on these tails.

114
00:06:25,570 --> 00:06:29,130
We'll return to this later on
when we talk about epigenetics.

115
00:06:29,130 --> 00:06:32,970
But I did want to mention today
that it's possible to identify

116
00:06:32,970 --> 00:06:37,340
where there are histones
with specific barks that

117
00:06:37,340 --> 00:06:38,960
are present in the
genome and what

118
00:06:38,960 --> 00:06:42,710
genome sequences they
are associated with.

119
00:06:42,710 --> 00:06:45,010
So we can look at sequence
specific regulators.

120
00:06:45,010 --> 00:06:49,350
We could look at a more
general epigenetic marks,

121
00:06:49,350 --> 00:06:52,680
all using the same technology.

122
00:06:52,680 --> 00:06:57,020
And this slide simply
recapitulates what we just

123
00:06:57,020 --> 00:07:00,870
talked about over here on the
left-hand side of the board.

124
00:07:00,870 --> 00:07:02,980
But we want to
know basically what

125
00:07:02,980 --> 00:07:06,280
and where, in terms of
these genomic regulators,

126
00:07:06,280 --> 00:07:09,500
what they are and where
they are in the genome.

127
00:07:09,500 --> 00:07:12,660
And today we're going to assume
a fairly simple model, which

128
00:07:12,660 --> 00:07:16,060
is that regulars that are
proximal to a gene, most

129
00:07:16,060 --> 00:07:17,670
probably regulated.

130
00:07:17,670 --> 00:07:20,150
Although we know in
practice, actually, it

131
00:07:20,150 --> 00:07:23,525
appears that roughly one
third of the regulators that

132
00:07:23,525 --> 00:07:26,310
are proximal to a gene actually
skip it and regulate a gene

133
00:07:26,310 --> 00:07:28,980
further down the genome.

134
00:07:28,980 --> 00:07:31,600
It's not really
understood very well

135
00:07:31,600 --> 00:07:34,600
how the genome
folds in three space

136
00:07:34,600 --> 00:07:37,030
to allow these transit
regulatory interactions

137
00:07:37,030 --> 00:07:38,380
to occur.

138
00:07:38,380 --> 00:07:42,300
But I'll just point out to
you that the simplistic model

139
00:07:42,300 --> 00:07:47,860
that proximal binding
regulates proximal genes

140
00:07:47,860 --> 00:07:49,510
doesn't always hold,
especially when

141
00:07:49,510 --> 00:07:53,740
we get into mammals and
other higher organisms.

142
00:07:53,740 --> 00:07:56,516
And as I mentioned,
another aspect of this

143
00:07:56,516 --> 00:07:58,140
is that certain
proteins may need

144
00:07:58,140 --> 00:08:02,900
to be modified to become
active, and thus there

145
00:08:02,900 --> 00:08:05,960
are signaling pathways.

146
00:08:05,960 --> 00:08:08,510
You can imagine signaling
pathways responding

147
00:08:08,510 --> 00:08:11,730
to environmental stimuli
outside of cells.

148
00:08:11,730 --> 00:08:14,100
These signaling
pathways can interact

149
00:08:14,100 --> 00:08:18,420
with transcriptional
activators and modify

150
00:08:18,420 --> 00:08:21,670
what targets they
seek in the genome.

151
00:08:21,670 --> 00:08:23,280
So these sorts of
regulatory networks

152
00:08:23,280 --> 00:08:25,800
will be talked
about specifically

153
00:08:25,800 --> 00:08:27,510
in a separate lecture
later in the term,

154
00:08:27,510 --> 00:08:29,460
but they're
extraordinarily important.

155
00:08:29,460 --> 00:08:32,230
And a foundational
aspect of them

156
00:08:32,230 --> 00:08:36,200
is putting together
the wiring diagram.

157
00:08:36,200 --> 00:08:40,150
And the wiring diagram
has to do with where

158
00:08:40,150 --> 00:08:44,320
the regulators occupy
the genome and what

159
00:08:44,320 --> 00:08:48,880
genes those regulators regulate.

160
00:08:48,880 --> 00:08:51,190
And in order to do
that, we're going

161
00:08:51,190 --> 00:08:54,870
to utilize a
technique today called

162
00:08:54,870 --> 00:08:58,930
ChIP-seq, which stands for
chromatin immunoprecipitation

163
00:08:58,930 --> 00:09:02,210
followed by sequencing.

164
00:09:02,210 --> 00:09:06,350
And we can now reliably
identify where regulars bind

165
00:09:06,350 --> 00:09:10,900
to the genome within
roughly 10 base pairs.

166
00:09:10,900 --> 00:09:13,210
So the spatial
resolution has gotten

167
00:09:13,210 --> 00:09:16,590
exceptionally good with high
throughput sequencing, as we'll

168
00:09:16,590 --> 00:09:21,420
see, which really is a
fantastic era to be in now,

169
00:09:21,420 --> 00:09:23,220
because this really
wasn't possible 5

170
00:09:23,220 --> 00:09:25,360
or even 10 years ago.

171
00:09:25,360 --> 00:09:28,060
And so we now have
the tools to be

172
00:09:28,060 --> 00:09:31,870
able to take apart the
regulatory occupancy

173
00:09:31,870 --> 00:09:35,505
of the genome and
discern exactly

174
00:09:35,505 --> 00:09:36,880
where these proteins
are binding.

175
00:09:40,150 --> 00:09:43,740
The way that this is done,
as I'll describe the protocol

176
00:09:43,740 --> 00:09:48,910
to you, in general, and then
I'm going to pause for a second

177
00:09:48,910 --> 00:09:52,490
and see if anybody has any
questions about the specifics.

178
00:09:52,490 --> 00:09:58,070
But the essential idea is that
you have a collection of cells.

179
00:09:58,070 --> 00:09:59,700
Typically you need
a lot of cells.

180
00:09:59,700 --> 00:10:02,210
We're talking 10 million cells.

181
00:10:02,210 --> 00:10:05,360
So for certain
kinds of marks, you

182
00:10:05,360 --> 00:10:08,870
can get down below a million
or even to 100,000 cells.

183
00:10:08,870 --> 00:10:12,960
But to get robust signals, you
need a lot of cells at present.

184
00:10:12,960 --> 00:10:18,770
And all these cells obviously
have chromosomes inside of them

185
00:10:18,770 --> 00:10:21,680
with proteins that
are occupying them.

186
00:10:21,680 --> 00:10:25,400
And the essential idea is that
you take a flash photography

187
00:10:25,400 --> 00:10:28,490
picture of the cell
while it's alive.

188
00:10:28,490 --> 00:10:32,700
You add a cross linking agent,
and that cross links proteins

189
00:10:32,700 --> 00:10:34,590
creates bonds
between the proteins

190
00:10:34,590 --> 00:10:39,630
and the genome, the DNA, where
those proteins are sitting.

191
00:10:39,630 --> 00:10:44,340
And so you then isolate
the chromatin material,

192
00:10:44,340 --> 00:10:49,890
and you wind up with pieces
of DNA with proteins occupying

193
00:10:49,890 --> 00:10:52,620
them, all the proteins.

194
00:10:52,620 --> 00:10:55,100
So not just some of the
proteins, but all the proteins

195
00:10:55,100 --> 00:10:58,410
are non-selectively
cross-linked to the genome.

196
00:11:00,920 --> 00:11:07,100
You then can take this
extract and fragment it.

197
00:11:07,100 --> 00:11:12,150
Typically you fragment
it by using sonication,

198
00:11:12,150 --> 00:11:15,620
which is mechanical energy,
which causes the DNA

199
00:11:15,620 --> 00:11:20,749
to break at random locations.

200
00:11:20,749 --> 00:11:22,540
There are more modern
techniques that we'll

201
00:11:22,540 --> 00:11:24,220
touch on at the end
of today's lecture

202
00:11:24,220 --> 00:11:28,080
where you could enzymatically
digest these fragments right

203
00:11:28,080 --> 00:11:30,660
down to where the protein is.

204
00:11:30,660 --> 00:11:34,620
But suffice to say, you get
small fragments, which you then

205
00:11:34,620 --> 00:11:37,820
can immunopurify
with an antibody

206
00:11:37,820 --> 00:11:40,600
to a protein of interest.

207
00:11:40,600 --> 00:11:46,450
So one condition of using
this technology is either A,

208
00:11:46,450 --> 00:11:48,750
you have a good
antibody to a protein

209
00:11:48,750 --> 00:11:53,250
that you care about
as regulatory, or B,

210
00:11:53,250 --> 00:11:56,600
you have tagged
this protein such

211
00:11:56,600 --> 00:12:00,950
that it has a flag tag, myc
tag, or some other epitope

212
00:12:00,950 --> 00:12:04,410
tag on it which allows
you to use an antibody

213
00:12:04,410 --> 00:12:06,080
or other purification
methodology

214
00:12:06,080 --> 00:12:08,190
for that specific tag.

215
00:12:08,190 --> 00:12:10,690
So either you have
a good antibody

216
00:12:10,690 --> 00:12:14,760
or you have a tag
on the protein.

217
00:12:14,760 --> 00:12:16,490
One problem with
tags on proteins

218
00:12:16,490 --> 00:12:20,550
is that they can render
the proteins nonfunctional.

219
00:12:20,550 --> 00:12:22,670
If they're nonfunctional,
then, of course,

220
00:12:22,670 --> 00:12:24,720
they're not going to bind
where they should bind.

221
00:12:24,720 --> 00:12:26,470
And one has to be
careful about this,

222
00:12:26,470 --> 00:12:28,590
because if you
introduce a tag and you

223
00:12:28,590 --> 00:12:32,770
have a couple good copies of
the protein that are untagged

224
00:12:32,770 --> 00:12:34,890
and one copy that
is tagged, it's

225
00:12:34,890 --> 00:12:37,400
hard to tell whether or not
the tagged version is actually

226
00:12:37,400 --> 00:12:41,170
doing what you think it does,
and one has to be careful.

227
00:12:41,170 --> 00:12:43,250
But suffice to say,
assuming that you

228
00:12:43,250 --> 00:12:48,610
have some way of
immunopurifying this protein,

229
00:12:48,610 --> 00:12:54,610
you can then use the
antibodies to simply purify

230
00:12:54,610 --> 00:12:57,930
those fragments that have
the protein of interest.

231
00:13:01,510 --> 00:13:03,620
After you've purified
the protein of interest,

232
00:13:03,620 --> 00:13:05,980
you can reverse
the cross linking,

233
00:13:05,980 --> 00:13:10,430
have a collection of
fragments which you then

234
00:13:10,430 --> 00:13:16,570
sequence using a high throughput
sequencing instrument.

235
00:13:16,570 --> 00:13:21,890
Now recall that, in the
usually applied protocol,

236
00:13:21,890 --> 00:13:23,910
the fragmentation is a random.

237
00:13:23,910 --> 00:13:28,310
So you're going to be sequencing
both ends of these molecules.

238
00:13:28,310 --> 00:13:30,585
For each one, you probably
only sequence one end.

239
00:13:30,585 --> 00:13:34,100
You're going to sequence an
end of these molecules which

240
00:13:34,100 --> 00:13:39,280
gives you a sequence tag that is
near where the event occurred,

241
00:13:39,280 --> 00:13:41,890
but not exactly at it.

242
00:13:41,890 --> 00:13:45,210
And we're going to
take those tags,

243
00:13:45,210 --> 00:13:47,630
and if we have our
genome-- here represented

244
00:13:47,630 --> 00:13:52,820
as this short, horizontal chalk
line-- we'll take our reads,

245
00:13:52,820 --> 00:13:57,030
and we will align
them to the genome,

246
00:13:57,030 --> 00:14:00,110
and try and discern
from those aligned

247
00:14:00,110 --> 00:14:02,800
reads where the original
proteins were binding.

248
00:14:05,350 --> 00:14:13,360
Now our job is to do the best
possible alignment or discovery

249
00:14:13,360 --> 00:14:17,290
of where the proteins are
binding, given this evidence.

250
00:14:17,290 --> 00:14:18,960
So we have a
collection of evidence

251
00:14:18,960 --> 00:14:21,590
exhibited by the read sequences.

252
00:14:21,590 --> 00:14:23,190
The other thing
that we will do is

253
00:14:23,190 --> 00:14:27,450
we will take the original
population of molecules,

254
00:14:27,450 --> 00:14:29,930
and we will sequence
them as well,

255
00:14:29,930 --> 00:14:35,560
sometimes called the whole cell
extract sequence, as a control.

256
00:14:35,560 --> 00:14:40,250
And we'll see why we need this
control a little bit later on.

257
00:14:40,250 --> 00:14:45,900
But this is going to be
a purified so-called IP

258
00:14:45,900 --> 00:14:49,680
for immunoprecipitate
fraction, which we'll sequence.

259
00:14:49,680 --> 00:14:52,150
And this will be the whole
cell extract, which should not

260
00:14:52,150 --> 00:14:55,510
be enriched for any
particular protein.

261
00:14:55,510 --> 00:14:59,370
Now before I go on, I'd be
happy to entertain any questions

262
00:14:59,370 --> 00:15:00,990
about the details
of this protocol,

263
00:15:00,990 --> 00:15:03,610
because it's really important
that you feel comfortable

264
00:15:03,610 --> 00:15:05,920
with it before we talk about
the computational analysis

265
00:15:05,920 --> 00:15:08,050
of the output.

266
00:15:08,050 --> 00:15:10,232
So if anybody has
any questions, now

267
00:15:10,232 --> 00:15:11,440
would be a great time to ask.

268
00:15:14,540 --> 00:15:15,050
Yes.

269
00:15:15,050 --> 00:15:17,008
AUDIENCE: I have more of
a scientific question.

270
00:15:17,008 --> 00:15:19,352
This assumes that we know
a transcriptioned factor.

271
00:15:19,352 --> 00:15:24,780
Are there are ways, methods
to figure out transcription

272
00:15:24,780 --> 00:15:28,452
factors so that you can design
antibodies to bind to it?

273
00:15:28,452 --> 00:15:29,910
PROFESSOR: So the
question is, this

274
00:15:29,910 --> 00:15:31,826
assumes that we know the
regulators that we're

275
00:15:31,826 --> 00:15:34,600
interested in ahead of time.

276
00:15:34,600 --> 00:15:38,700
And is there a de novo
way of discovering

277
00:15:38,700 --> 00:15:41,445
heretofore unknown regulators
that are binding to the genome?

278
00:15:44,130 --> 00:15:48,250
The answer to that
question is sometimes,

279
00:15:48,250 --> 00:15:49,710
as is usually the case.

280
00:15:49,710 --> 00:15:52,500
Later in the term, we'll talk
about other methodologies

281
00:15:52,500 --> 00:15:55,520
for looking at the regulatory
occupancy of the genome that

282
00:15:55,520 --> 00:15:59,050
don't depend upon
immunopurification,

283
00:15:59,050 --> 00:16:03,270
in which case we'll get
an understanding of what's

284
00:16:03,270 --> 00:16:08,627
going on with the genome at
the level of knowing what

285
00:16:08,627 --> 00:16:10,960
particular sequences are
occupied without knowing what's

286
00:16:10,960 --> 00:16:12,242
there.

287
00:16:12,242 --> 00:16:13,700
From the sequence,
sometimes we can

288
00:16:13,700 --> 00:16:17,070
infer the family of the
protein that is sitting there.

289
00:16:17,070 --> 00:16:23,800
But in general, the holy grail
of this, which has not really

290
00:16:23,800 --> 00:16:26,420
fully materialized,
would be as follows,

291
00:16:26,420 --> 00:16:29,680
which is instead of
purifying with an antibody

292
00:16:29,680 --> 00:16:33,930
and then sequencing, why not
purify with a nucleic acid

293
00:16:33,930 --> 00:16:39,570
sequence and then do mass spec,
to actually take the proteins

294
00:16:39,570 --> 00:16:42,280
off of the DNA, run
them through mass spec,

295
00:16:42,280 --> 00:16:44,680
and figure out what's there.

296
00:16:44,680 --> 00:16:48,000
And we and others
have attempted this.

297
00:16:48,000 --> 00:16:49,960
And at times, you
get good results.

298
00:16:49,960 --> 00:16:52,770
But mass spec is
improving greatly,

299
00:16:52,770 --> 00:16:57,890
but it's till a fraught
process with noise.

300
00:16:57,890 --> 00:17:02,520
And there's a paper just
published in Nature Methods

301
00:17:02,520 --> 00:17:04,750
late last year on something
called the CRAPome.

302
00:17:04,750 --> 00:17:06,900
Have you heard of
this paper before?

303
00:17:06,900 --> 00:17:10,750
It is all the junk you get when
you run mass spec experiments.

304
00:17:10,750 --> 00:17:12,849
And so when you run a
mass spec experiment,

305
00:17:12,849 --> 00:17:15,240
you can just take all the
stuff in the CRAPome out of it,

306
00:17:15,240 --> 00:17:17,119
and it actually helps
you quite a bit.

307
00:17:17,119 --> 00:17:19,089
That gives you an idea what the
state of the art of mass spec

308
00:17:19,089 --> 00:17:19,720
is.

309
00:17:19,720 --> 00:17:20,720
It's a little bit noisy.

310
00:17:23,200 --> 00:17:25,520
But I think your
question is great.

311
00:17:25,520 --> 00:17:27,270
I think we need to get there.

312
00:17:27,270 --> 00:17:29,030
We need to get to
the place where

313
00:17:29,030 --> 00:17:32,570
we can take portions
of the genome

314
00:17:32,570 --> 00:17:35,260
and run them through
mass spec and figure out

315
00:17:35,260 --> 00:17:37,900
what is populating it de novo
without having to know ahead

316
00:17:37,900 --> 00:17:39,537
of time.

317
00:17:39,537 --> 00:17:40,370
Any other questions?

318
00:17:43,580 --> 00:17:44,490
OK.

319
00:17:44,490 --> 00:17:45,210
Great.

320
00:17:45,210 --> 00:17:48,780
So the figure on
the slide up there

321
00:17:48,780 --> 00:17:53,660
also describes the
ChIP-seq protocol.

322
00:17:53,660 --> 00:17:56,130
And I'll also say
that some people

323
00:17:56,130 --> 00:17:59,070
believe this would never work.

324
00:17:59,070 --> 00:18:02,020
They actually thought that
when you did the purification,

325
00:18:02,020 --> 00:18:04,590
you would just get so
much background sequence

326
00:18:04,590 --> 00:18:06,320
that when you map
it to the genome,

327
00:18:06,320 --> 00:18:09,030
you could never discern
any signal whatsoever.

328
00:18:09,030 --> 00:18:11,470
And so there are lively debates
about this until somebody

329
00:18:11,470 --> 00:18:13,930
actually made it work,
and then the argument

330
00:18:13,930 --> 00:18:16,220
was over because it made
everything else completely

331
00:18:16,220 --> 00:18:17,950
and totally obsolete.

332
00:18:17,950 --> 00:18:21,430
So it wasn't good to be on the
wrong side of that argument,

333
00:18:21,430 --> 00:18:22,950
I'll tell you that.

334
00:18:22,950 --> 00:18:25,310
I wasn't, but all right.

335
00:18:25,310 --> 00:18:26,400
I was on the right side.

336
00:18:26,400 --> 00:18:31,230
But suffice to say, here's
a close-up picture of Mr.

337
00:18:31,230 --> 00:18:35,790
Protein-- Ms. Protein-- and what
happens when there is breakage

338
00:18:35,790 --> 00:18:38,480
around that site,
followed by sequencing.

339
00:18:38,480 --> 00:18:41,050
And as you can see,
the little black lines

340
00:18:41,050 --> 00:18:43,140
connecting between the
protein and the DNA

341
00:18:43,140 --> 00:18:46,050
is supposed to indicate
contacts sites.

342
00:18:46,050 --> 00:18:48,565
And you can see the little
yellow arrows are supposed

343
00:18:48,565 --> 00:18:52,670
to indicate breakage sites of
the DNA that are being caused

344
00:18:52,670 --> 00:18:59,550
by, in this case, mechanical
breakage through sonication.

345
00:18:59,550 --> 00:19:02,830
And you get reads from
both strands of the DNA.

346
00:19:02,830 --> 00:19:05,610
Remember that a sequencing
instrument always

347
00:19:05,610 --> 00:19:07,790
sequences from five
prime to three prime.

348
00:19:07,790 --> 00:19:14,740
So you're going to get the
reads on the red strand

349
00:19:14,740 --> 00:19:17,650
and on the blue
strand, shown here.

350
00:19:17,650 --> 00:19:19,670
And when we do the
mapping, we know

351
00:19:19,670 --> 00:19:22,910
which strand they're mapped on.

352
00:19:22,910 --> 00:19:28,750
And the profile is
shown in the lower plot,

353
00:19:28,750 --> 00:19:32,850
showing the density of
map reads versus distance

354
00:19:32,850 --> 00:19:38,010
from where we believe
the protein is sitting.

355
00:19:38,010 --> 00:19:42,400
And the tag density refers
to tags or sequence tags

356
00:19:42,400 --> 00:19:47,500
or reads that are aligned using
the methodology we discussed

357
00:19:47,500 --> 00:19:52,210
two lectures ago to the genome
we assembled last lecture.

358
00:19:52,210 --> 00:19:58,000
So the characteristic
shape shown in this picture

359
00:19:58,000 --> 00:20:02,780
is something that is not
the same for all proteins.

360
00:20:02,780 --> 00:20:05,557
It is something that varies
from protein to protein.

361
00:20:05,557 --> 00:20:07,140
And thus, one of the
things that we'll

362
00:20:07,140 --> 00:20:10,250
want to do during our discovery
of where these proteins are

363
00:20:10,250 --> 00:20:15,429
binding is always learn the
shape of the read distribution

364
00:20:15,429 --> 00:20:17,470
that will come out of a
particular binding event.

365
00:20:20,140 --> 00:20:23,310
So just to show you some
actual data, so you'll

366
00:20:23,310 --> 00:20:25,690
get a feel for what
we're talking about,

367
00:20:25,690 --> 00:20:31,280
this is actual data from
the Oct4 protein, which

368
00:20:31,280 --> 00:20:35,010
is an embryonic regulator,
pluripotency factor,

369
00:20:35,010 --> 00:20:40,310
binding to the mouse genome
around the SOCS2 gene.

370
00:20:40,310 --> 00:20:45,530
And you can see the two distinct
peaks on the upper track,

371
00:20:45,530 --> 00:20:48,690
both the plus strand
reads and the minus strand

372
00:20:48,690 --> 00:20:52,360
reads, shown in blue
and red respectively.

373
00:20:52,360 --> 00:20:54,627
Each one of the black and
white bars at the top--

374
00:20:54,627 --> 00:20:56,710
probably can't be read
from the back of the room--

375
00:20:56,710 --> 00:20:58,680
but each one of
those is 1,000 bases,

376
00:20:58,680 --> 00:21:02,630
to give you some idea about
sort of the scale of the genome

377
00:21:02,630 --> 00:21:04,960
that we're looking at here.

378
00:21:04,960 --> 00:21:08,120
You can see the
SOCS2 gene below.

379
00:21:08,120 --> 00:21:11,460
The exons are the solid bars.

380
00:21:11,460 --> 00:21:14,800
And then you see the whole
cell extract channel,

381
00:21:14,800 --> 00:21:16,580
which we talked about earlier.

382
00:21:16,580 --> 00:21:18,590
And the whole cell
extract channel

383
00:21:18,590 --> 00:21:22,580
is simply giving us a
background set of reads

384
00:21:22,580 --> 00:21:26,020
that are nonspecific.

385
00:21:26,020 --> 00:21:29,090
And so you might have a set of,
say, 10 or 20 million reads,

386
00:21:29,090 --> 00:21:32,160
something like that, for when
these experiments that you map

387
00:21:32,160 --> 00:21:36,630
to the genome and get a
picture that looks like this.

388
00:21:36,630 --> 00:21:41,980
So now our job is to take the
read sets that we see here,

389
00:21:41,980 --> 00:21:45,130
genome wide, and
figure out every place

390
00:21:45,130 --> 00:21:47,590
that the Oct4 protein is
binding to the genome.

391
00:21:50,164 --> 00:21:51,580
Now there are
several ways that we

392
00:21:51,580 --> 00:21:54,020
could approach this question.

393
00:21:54,020 --> 00:21:55,603
One way to approach
the question would

394
00:21:55,603 --> 00:21:59,210
be to simply say,
where are the peaks?

395
00:21:59,210 --> 00:22:03,020
And so you hear this kind of
exploration often described

396
00:22:03,020 --> 00:22:05,380
as peak finding.

397
00:22:05,380 --> 00:22:07,380
Where can you find the peaks?

398
00:22:07,380 --> 00:22:10,740
And where is the
middle of the peak?

399
00:22:10,740 --> 00:22:13,720
Now the problem
with this approach

400
00:22:13,720 --> 00:22:18,880
is that it works just fine
when a peak represents

401
00:22:18,880 --> 00:22:21,020
a single binding event.

402
00:22:21,020 --> 00:22:23,050
So imagine that these
two fingers here

403
00:22:23,050 --> 00:22:25,680
are binding events, and they're
fairly far apart of the genome.

404
00:22:25,680 --> 00:22:29,580
Now as they come closer and
closer and closer together,

405
00:22:29,580 --> 00:22:32,990
what will happen is that,
instead of having two peaks,

406
00:22:32,990 --> 00:22:37,020
we're going to wind up
having one broad peak.

407
00:22:37,020 --> 00:22:40,770
And thus, there's a
lot of biology present

408
00:22:40,770 --> 00:22:48,190
in this kind of binding
of the same protein

409
00:22:48,190 --> 00:22:50,130
proximal to itself.

410
00:22:50,130 --> 00:22:53,072
So we need to be able to take
these sorts of events that

411
00:22:53,072 --> 00:22:58,120
occur underneath a single
enrichment, or single peak,

412
00:22:58,120 --> 00:23:00,807
into two separate
binding events.

413
00:23:00,807 --> 00:23:02,765
And this is shown in the
next couple of slides,

414
00:23:02,765 --> 00:23:05,850
right where we look at
what we would expect

415
00:23:05,850 --> 00:23:09,700
from a single event, in terms
of a read enrichment profile

416
00:23:09,700 --> 00:23:11,880
once it's aligned to the genome.

417
00:23:11,880 --> 00:23:14,470
And we think about a
possibility that there

418
00:23:14,470 --> 00:23:17,910
are two events, here
shown in indiscernible

419
00:23:17,910 --> 00:23:20,200
gray and blue and red.

420
00:23:20,200 --> 00:23:24,330
And we note that
each one of these

421
00:23:24,330 --> 00:23:28,770
will have its own
specific profile.

422
00:23:28,770 --> 00:23:31,747
And then you can consider
them to be added together

423
00:23:31,747 --> 00:23:33,080
to get the peak that we observe.

424
00:23:37,750 --> 00:23:41,670
Now one of the reasons this
additive property works

425
00:23:41,670 --> 00:23:45,790
is that, remember, we're
working with a large population

426
00:23:45,790 --> 00:23:52,500
of cells, and regulators
don't always occupy a site.

427
00:23:52,500 --> 00:23:55,770
And thus, what we're looking
at in terms of the reads

428
00:23:55,770 --> 00:24:01,610
are the sum of all of the
evidence from all of the cells.

429
00:24:01,610 --> 00:24:10,420
And so even though the proteins
are close to one another,

430
00:24:10,420 --> 00:24:13,490
we often can find
an additive effect

431
00:24:13,490 --> 00:24:15,530
between that proximal binding.

432
00:24:18,840 --> 00:24:21,730
So how can we handle this?

433
00:24:21,730 --> 00:24:25,540
Well, what we're
going to do is we're

434
00:24:25,540 --> 00:24:28,740
going to do two key
algorithmic things.

435
00:24:28,740 --> 00:24:31,520
We're going to model the spatial
distribution of the reads that

436
00:24:31,520 --> 00:24:35,634
come out of a specific event,
as I suggested earlier.

437
00:24:35,634 --> 00:24:37,550
And we're going to keep
that model up to date.

438
00:24:37,550 --> 00:24:45,930
That is, we can learn that model
by first running our method

439
00:24:45,930 --> 00:24:49,960
using a common
distribution of reads,

440
00:24:49,960 --> 00:24:52,480
identify a bunch
of events that we

441
00:24:52,480 --> 00:24:58,510
think are bindings of a single
protein, take those events

442
00:24:58,510 --> 00:25:00,920
and use them to build
a better model of what

443
00:25:00,920 --> 00:25:03,480
the redistribution looks like,
and then run the algorithm

444
00:25:03,480 --> 00:25:05,550
again.

445
00:25:05,550 --> 00:25:07,460
And with the better
distribution,

446
00:25:07,460 --> 00:25:11,540
we can do a much better
job at resolving events out

447
00:25:11,540 --> 00:25:15,560
of multiple events
out of single peaks.

448
00:25:15,560 --> 00:25:17,930
And the next thing
we're going to do

449
00:25:17,930 --> 00:25:22,240
is we're going to model the
genome at a single base pair

450
00:25:22,240 --> 00:25:24,130
level.

451
00:25:24,130 --> 00:25:29,300
So we're going to consider every
single base as being the center

452
00:25:29,300 --> 00:25:34,860
point of a protein binding
event, and using that model,

453
00:25:34,860 --> 00:25:40,540
try and sort through how
we could have observed

454
00:25:40,540 --> 00:25:44,550
the reads that we
are presented with.

455
00:25:44,550 --> 00:25:48,270
And first, the spatial
distribution that we build

456
00:25:48,270 --> 00:25:50,800
is going to look
something like this.

457
00:25:50,800 --> 00:25:54,860
And we consider a
400 base pair window,

458
00:25:54,860 --> 00:25:58,720
and we learn the
distribution of reads,

459
00:25:58,720 --> 00:26:01,140
and we build an actual
empirical distribution.

460
00:26:01,140 --> 00:26:04,140
We don't fit the
distribution to it,

461
00:26:04,140 --> 00:26:09,700
but rather we can keep an
exact distribution or histogram

462
00:26:09,700 --> 00:26:13,280
of what we observe, averaged
over many, many events.

463
00:26:13,280 --> 00:26:16,820
So when we're fitting things,
we have the best possible

464
00:26:16,820 --> 00:26:18,460
estimate.

465
00:26:18,460 --> 00:26:18,960
Yes.

466
00:26:18,960 --> 00:26:20,585
AUDIENCE: I was just
trying to remember

467
00:26:20,585 --> 00:26:22,540
to origin of the
[INAUDIBLE] clearly.

468
00:26:22,540 --> 00:26:25,740
So within the protocol,
are you sequencing

469
00:26:25,740 --> 00:26:29,700
with the proteins bound
to these fragments?

470
00:26:29,700 --> 00:26:30,450
PROFESSOR: No.

471
00:26:30,450 --> 00:26:32,470
See, you can't do that.

472
00:26:32,470 --> 00:26:37,824
You have to reverse the
cross-linking from the DNA.

473
00:26:37,824 --> 00:26:39,240
And then there's
a step here which

474
00:26:39,240 --> 00:26:42,964
we omitted for simplicity,
which is, we amplified the DNA.

475
00:26:42,964 --> 00:26:44,380
AUDIENCE: So I was
just wondering,

476
00:26:44,380 --> 00:26:48,220
if there's no protein, why
doesn't the polymerase just

477
00:26:48,220 --> 00:26:50,040
read through the whole
thing from one side?

478
00:26:50,040 --> 00:26:52,152
Why is there a peak?

479
00:26:52,152 --> 00:26:55,650
There seems to be a
loss of signal right

480
00:26:55,650 --> 00:26:57,550
where the protein is bound.

481
00:26:57,550 --> 00:27:00,630
PROFESSOR: Well, that depends
upon how long the reads are

482
00:27:00,630 --> 00:27:02,840
and how hard you fragment.

483
00:27:02,840 --> 00:27:06,170
And in fact, that can occur.

484
00:27:06,170 --> 00:27:08,130
But typically,
we're using fairly

485
00:27:08,130 --> 00:27:12,110
short reads, like
35 base pair reads.

486
00:27:12,110 --> 00:27:14,800
And we're fragmenting
the DNA to be,

487
00:27:14,800 --> 00:27:18,410
say, perhaps 200 to
300 base pairs along.

488
00:27:18,410 --> 00:27:21,832
So we're reading the first 35
base pairs of the DNA fragment,

489
00:27:21,832 --> 00:27:23,540
but we're not really
all the way through.

490
00:27:23,540 --> 00:27:25,980
We could read all the
way through if we wanted,

491
00:27:25,980 --> 00:27:27,930
but there really wouldn't
be a point to that.

492
00:27:27,930 --> 00:27:31,940
The thing that
we're observing here

493
00:27:31,940 --> 00:27:35,340
is where the five prime
end of the readers, where

494
00:27:35,340 --> 00:27:38,049
the leftmost edge
of the read is.

495
00:27:38,049 --> 00:27:40,340
So even though it might be
reading all the way through,

496
00:27:40,340 --> 00:27:41,923
we're just seeing
the left edge of it.

497
00:27:43,959 --> 00:27:45,250
Does that answer your question?

498
00:27:45,250 --> 00:27:45,875
AUDIENCE: Yeah.

499
00:27:45,875 --> 00:27:46,930
PROFESSOR: OK, great.

500
00:27:46,930 --> 00:27:48,540
Any other questions?

501
00:27:48,540 --> 00:27:49,350
Yes, at the back.

502
00:27:49,350 --> 00:27:52,164
AUDIENCE: So to clarify,
this distribution

503
00:27:52,164 --> 00:27:54,034
that's being shown
up here on both

504
00:27:54,034 --> 00:27:55,450
the positive and
negative strand--

505
00:27:55,450 --> 00:27:55,910
PROFESSOR: Yes.

506
00:27:55,910 --> 00:27:58,335
AUDIENCE: This is the position
of where the reads started,

507
00:27:58,335 --> 00:28:01,730
not the count of the number of
times that particular base was

508
00:28:01,730 --> 00:28:03,876
shown in the sequencing result.

509
00:28:03,876 --> 00:28:06,180
Is that correct?

510
00:28:06,180 --> 00:28:09,084
PROFESSOR: Let me
repeat the question.

511
00:28:09,084 --> 00:28:10,750
What we're observing
in the distribution

512
00:28:10,750 --> 00:28:15,200
is where the read starts
and not the number of times

513
00:28:15,200 --> 00:28:18,640
that base shows up in
the sequencing data.

514
00:28:18,640 --> 00:28:22,850
It is the number of reads
whose five prime position

515
00:28:22,850 --> 00:28:25,570
start at that base.

516
00:28:25,570 --> 00:28:29,555
OK So each read
only get one count.

517
00:28:29,555 --> 00:28:31,150
Does that help?

518
00:28:31,150 --> 00:28:33,089
OK.

519
00:28:33,089 --> 00:28:35,380
The other thing that we're
going to do, for simplicity,

520
00:28:35,380 --> 00:28:38,750
is we're going to assume
that the plus and the minus

521
00:28:38,750 --> 00:28:41,970
strand distributions
are symmetric.

522
00:28:41,970 --> 00:28:43,410
So we only learn
one distribution,

523
00:28:43,410 --> 00:28:46,630
and then we flip it to
do the minus strand.

524
00:28:49,650 --> 00:28:54,390
And that's shown here,
where we can articulate

525
00:28:54,390 --> 00:28:57,270
this as this empirical
distribution,

526
00:28:57,270 --> 00:29:01,500
where the probability or
read given a base position is

527
00:29:01,500 --> 00:29:08,310
described in terms of the
distance between where we are

528
00:29:08,310 --> 00:29:12,080
considering the binding
of it may have occurred

529
00:29:12,080 --> 00:29:13,750
and where the read is.

530
00:29:13,750 --> 00:29:20,230
So it's important for us to
look at this in some detail

531
00:29:20,230 --> 00:29:21,480
so you're comfortable with it.

532
00:29:30,640 --> 00:29:32,566
Here's our genome again.

533
00:29:32,566 --> 00:29:38,170
And let's assume that
we have a binding

534
00:29:38,170 --> 00:29:42,010
event at base m
along the genome.

535
00:29:42,010 --> 00:29:47,840
And we have a read here,
r sub n, at some position.

536
00:29:47,840 --> 00:29:53,590
The probability
that this read was

537
00:29:53,590 --> 00:29:59,280
caused by this binding event can
be described as probably a read

538
00:29:59,280 --> 00:30:04,040
n given the fact that
we're considering

539
00:30:04,040 --> 00:30:05,476
an event at location m.

540
00:30:08,750 --> 00:30:10,540
Now of course, it
could be that there

541
00:30:10,540 --> 00:30:16,350
are other possible locations
that have caused this read.

542
00:30:16,350 --> 00:30:19,600
And let us suppose
that we model all

543
00:30:19,600 --> 00:30:25,750
of those positions along
the genome as a vector pi.

544
00:30:25,750 --> 00:30:33,790
And each element of pi describes
the probability or the strength

545
00:30:33,790 --> 00:30:35,350
of a binding event
having occurred

546
00:30:35,350 --> 00:30:37,510
in a particular location.

547
00:30:37,510 --> 00:30:44,650
So we can now describe the
probability of a read sub n

548
00:30:44,650 --> 00:30:50,030
given pi is equal
to the summation

549
00:30:50,030 --> 00:30:53,100
where i equals 1 to big
M, assuming that there

550
00:30:53,100 --> 00:31:03,810
is 1 to M bases in this
genome, of a p rn given m pi m,

551
00:31:03,810 --> 00:31:05,740
like this.

552
00:31:05,740 --> 00:31:09,440
So we are mixing together
here all of the positions

553
00:31:09,440 --> 00:31:13,280
along the genome to try
and explain this read.

554
00:31:13,280 --> 00:31:17,530
So the probably of the
read, given this vector

555
00:31:17,530 --> 00:31:21,200
pi, which describes all
the possible events that

556
00:31:21,200 --> 00:31:27,760
could have created this
read, is this formulation,

557
00:31:27,760 --> 00:31:31,240
which considers the
probability of each position

558
00:31:31,240 --> 00:31:35,410
times the probability that an
event occurred at that position

559
00:31:35,410 --> 00:31:44,790
subject to the constraint
that all of a pi i's sum to 1.

560
00:31:44,790 --> 00:31:46,880
So we're just assigning
probability mass

561
00:31:46,880 --> 00:31:53,770
along the genome from whence all
of the reads originally came.

562
00:31:53,770 --> 00:31:59,190
So this is considering
a single read, r sub n,

563
00:31:59,190 --> 00:32:03,551
and where that might
have originated from.

564
00:32:03,551 --> 00:32:04,050
Yes.

565
00:32:04,050 --> 00:32:06,116
AUDIENCE: Does the
constraint basically state

566
00:32:06,116 --> 00:32:11,360
that only one event occurred
in this fragment point?

567
00:32:11,360 --> 00:32:13,360
PROFESSOR: The question
is, does this constraint

568
00:32:13,360 --> 00:32:15,900
imply that only
one event occurred?

569
00:32:15,900 --> 00:32:16,420
No.

570
00:32:16,420 --> 00:32:19,570
The constraint is
implying that we're

571
00:32:19,570 --> 00:32:24,210
going to only have one
unit of probability mass

572
00:32:24,210 --> 00:32:28,750
to assign along the genome which
will generate all of the reads.

573
00:32:28,750 --> 00:32:33,850
And thus, this vector
describes the contribution

574
00:32:33,850 --> 00:32:37,010
of each base to the
reads that we observe.

575
00:32:37,010 --> 00:32:39,401
So let us say
simplistically that it

576
00:32:39,401 --> 00:32:41,650
might be that there are only
two events in the genome,

577
00:32:41,650 --> 00:32:43,640
and we had a perfect solution.

578
00:32:43,640 --> 00:32:46,470
Two points in the
genome, like m1 and m2,

579
00:32:46,470 --> 00:32:49,650
would have 0.5 as
their values for pi.

580
00:32:49,650 --> 00:32:52,910
And all the other values
in the genome would be 0.

581
00:32:52,910 --> 00:32:56,130
We're going to try to make
this as sparse as possible,

582
00:32:56,130 --> 00:32:58,290
as many zeroes as possible.

583
00:32:58,290 --> 00:33:01,055
So only at the places in
the genome where protein

584
00:33:01,055 --> 00:33:05,795
is actually binding
will pi i be nonzero.

585
00:33:05,795 --> 00:33:08,080
Does that make sense?

586
00:33:08,080 --> 00:33:09,250
These are great questions.

587
00:33:09,250 --> 00:33:09,750
Yes.

588
00:33:09,750 --> 00:33:11,530
And if people could
say their names first,

589
00:33:11,530 --> 00:33:12,050
that would be great.

590
00:33:12,050 --> 00:33:12,560
Yes.

591
00:33:12,560 --> 00:33:13,184
AUDIENCE: Sara.

592
00:33:13,184 --> 00:33:18,570
Just to clarify, the pi vector
is completely empirical?

593
00:33:18,570 --> 00:33:20,868
PROFESSOR: This distribution
is completely empirical.

594
00:33:20,868 --> 00:33:24,949
Yes, that's right, Sara,
completely empirical.

595
00:33:24,949 --> 00:33:26,740
I'll also say, just so
you know, that there

596
00:33:26,740 --> 00:33:31,490
are many ways of doing
this kind of discovery,

597
00:33:31,490 --> 00:33:33,180
as you might imagine.

598
00:33:33,180 --> 00:33:35,600
The way we're going
to describe today

599
00:33:35,600 --> 00:33:38,770
was a way that was selected
as part of the ENCODE 3

600
00:33:38,770 --> 00:33:41,040
pipeline for the
government's ENCODE project.

601
00:33:41,040 --> 00:33:43,010
And so what I'm going
to talk about today

602
00:33:43,010 --> 00:33:47,230
is the methodology that's being
used for the next set of data

603
00:33:47,230 --> 00:33:50,760
for ENCODE 3 followed
by IDR analysis, which

604
00:33:50,760 --> 00:33:52,090
is also part of ENCODE 3.

605
00:33:52,090 --> 00:33:54,251
So what you're
hearing about today

606
00:33:54,251 --> 00:33:56,500
is a pipeline that's being
used that will be published

607
00:33:56,500 --> 00:33:58,779
next year as part of
the ENCODE project.

608
00:33:58,779 --> 00:34:00,570
These papers, this
method's been published.

609
00:34:00,570 --> 00:34:03,496
But the analysis of
all the Encyclopedia

610
00:34:03,496 --> 00:34:05,370
of DNA Elements-- which
is what Encode stands

611
00:34:05,370 --> 00:34:10,250
for-- the third phase of
that is utilizing this.

612
00:34:10,250 --> 00:34:13,000
OK, any other questions?

613
00:34:13,000 --> 00:34:13,783
Yes.

614
00:34:13,783 --> 00:34:15,940
AUDIENCE: Does the shape
of this binding event

615
00:34:15,940 --> 00:34:19,070
tell you anything about the
topology of the actual protein?

616
00:34:19,070 --> 00:34:20,791
PROFESSOR: It does, actually.

617
00:34:20,791 --> 00:34:23,219
And we'll return to that.

618
00:34:23,219 --> 00:34:27,889
But the shape of this
binding can tell you

619
00:34:27,889 --> 00:34:31,699
something about the class
of protein, which is still

620
00:34:31,699 --> 00:34:34,960
an area of active research.

621
00:34:34,960 --> 00:34:37,580
But also note that
that is a little bit

622
00:34:37,580 --> 00:34:40,159
confounded by the
fact that when you

623
00:34:40,159 --> 00:34:42,330
have homotypic
binding, which means

624
00:34:42,330 --> 00:34:45,469
you have these closely
spaced binding events,

625
00:34:45,469 --> 00:34:47,070
you get these broader peaks.

626
00:34:47,070 --> 00:34:51,070
And so there's a lot
of research into what

627
00:34:51,070 --> 00:34:54,889
the shapes on the genome mean
and what biological function

628
00:34:54,889 --> 00:34:57,260
of mechanism they might imply.

629
00:34:57,260 --> 00:34:57,770
Yes.

630
00:34:57,770 --> 00:34:59,970
AUDIENCE: Can you
explain pi one more time?

631
00:34:59,970 --> 00:35:02,395
PROFESSOR: Yeah, explain
pi one more time, sure.

632
00:35:02,395 --> 00:35:06,370
So pi is describing
where there are

633
00:35:06,370 --> 00:35:08,280
binding events along the genome.

634
00:35:08,280 --> 00:35:14,590
So for example, if we just had
two binding events, m1 and m2,

635
00:35:14,590 --> 00:35:19,680
then pi of m1 would be
equal to 0.5, and pi of m2

636
00:35:19,680 --> 00:35:22,720
would be equal to 0.5, and
all the other values of pi

637
00:35:22,720 --> 00:35:24,540
would be 0.

638
00:35:24,540 --> 00:35:28,330
So we're just describing with pi
where the events are occurring.

639
00:35:28,330 --> 00:35:31,930
That gives us the
location of the events.

640
00:35:31,930 --> 00:35:32,430
OK?

641
00:35:34,940 --> 00:35:41,290
And you'll see in a moment why
we articulated it that way.

642
00:35:41,290 --> 00:35:44,090
But that's what pi is doing.

643
00:35:44,090 --> 00:35:45,570
Does that answer your question?

644
00:35:45,570 --> 00:35:46,337
OK.

645
00:35:46,337 --> 00:35:47,170
Any other questions?

646
00:35:47,170 --> 00:35:48,107
Yes.

647
00:35:48,107 --> 00:35:51,610
AUDIENCE: In cases
when you have two peaks

648
00:35:51,610 --> 00:35:54,965
really close together,
to a point that there's

649
00:35:54,965 --> 00:35:56,340
some sort of
[INAUDIBLE] between,

650
00:35:56,340 --> 00:35:59,920
how do you constrain your pi?

651
00:35:59,920 --> 00:36:01,530
PROFESSOR: How do
you constrain the pi

652
00:36:01,530 --> 00:36:03,030
when you have
closely spaced peaks?

653
00:36:03,030 --> 00:36:04,904
AUDIENCE: Yeah.

654
00:36:04,904 --> 00:36:06,570
PROFESSOR: Well, you
don't constrain it.

655
00:36:06,570 --> 00:36:07,470
You actually want--

656
00:36:07,470 --> 00:36:08,345
AUDIENCE: [INAUDIBLE]

657
00:36:12,937 --> 00:36:14,520
PROFESSOR: Well, I'm
going to show you

658
00:36:14,520 --> 00:36:17,160
some actual examples of
this algorithm running.

659
00:36:17,160 --> 00:36:19,710
So I'm going to give you an
animation of the algorithm

660
00:36:19,710 --> 00:36:22,370
running, so you can
actually watch what it does.

661
00:36:22,370 --> 00:36:24,370
And you'll see, first
it's going to be something

662
00:36:24,370 --> 00:36:27,880
that isn't too pleasant,
and then we'll fix it.

663
00:36:27,880 --> 00:36:30,840
But the key thing
here is sparsity.

664
00:36:30,840 --> 00:36:33,360
What we want to do is
we want to enforce pi

665
00:36:33,360 --> 00:36:36,710
being as sparse as possible
to explain the data.

666
00:36:36,710 --> 00:36:40,270
One of the common problems
in any approach to machine

667
00:36:40,270 --> 00:36:43,610
learning is that, with a
suitably complex model,

668
00:36:43,610 --> 00:36:47,690
you can model anything, but it's
not necessarily interpretable.

669
00:36:47,690 --> 00:36:52,340
So what we want to do here is
to make pi as simple as possible

670
00:36:52,340 --> 00:36:55,890
to be able to explain
the data that we see.

671
00:36:55,890 --> 00:37:02,250
However, if a single event
cannot explain and observe

672
00:37:02,250 --> 00:37:04,730
redistribution at a particular
point in the genome,

673
00:37:04,730 --> 00:37:06,340
we'll need to bring
another event in.

674
00:37:09,860 --> 00:37:15,205
All right, so that is how to
think about a single read.

675
00:37:18,380 --> 00:37:21,730
And now we can just
say the probability

676
00:37:21,730 --> 00:37:26,490
of our entire read set
given pi is quite simple.

677
00:37:26,490 --> 00:37:29,400
It's simply the product
over all the reads.

678
00:37:40,200 --> 00:37:40,700
Sorry.

679
00:37:48,020 --> 00:37:50,210
Like so.

680
00:37:50,210 --> 00:37:53,540
So this is the probability
of the entire read set.

681
00:37:53,540 --> 00:37:56,460
So we had this
previously, which is

682
00:37:56,460 --> 00:37:58,600
the probability
of a single read.

683
00:37:58,600 --> 00:38:01,570
And we take the product of the
probability for each individual

684
00:38:01,570 --> 00:38:06,620
read to get the likelihood
of all the reads.

685
00:38:06,620 --> 00:38:09,430
So now all we need to do to
solve this problem is this.

686
00:38:09,430 --> 00:38:14,870
We need to say that pi
is equal to the arg max

687
00:38:14,870 --> 00:38:21,510
pi of P R pi, which gives
us the maximum likelihood

688
00:38:21,510 --> 00:38:23,920
estimate for pi.

689
00:38:23,920 --> 00:38:25,630
Now it's easy to
write that down, just

690
00:38:25,630 --> 00:38:29,080
find the setting for
pi that maximizes

691
00:38:29,080 --> 00:38:33,780
the likelihood of the observed
reads, and proof, you're done,

692
00:38:33,780 --> 00:38:36,706
because now you've come up
with a pi that describes where

693
00:38:36,706 --> 00:38:38,330
the binding events
are along the genome

694
00:38:38,330 --> 00:38:40,990
at single base pair
solution, assuming

695
00:38:40,990 --> 00:38:47,126
that pi is modeling
every single base.

696
00:38:47,126 --> 00:38:50,590
Does everybody see that?

697
00:38:50,590 --> 00:38:53,860
Any questions about that?

698
00:38:53,860 --> 00:38:54,505
Yes.

699
00:38:54,505 --> 00:38:55,380
AUDIENCE: [INAUDIBLE]

700
00:39:00,030 --> 00:39:03,870
PROFESSOR: n is the
number of reads.

701
00:39:03,870 --> 00:39:05,440
m is a number of
bases in the genome.

702
00:39:11,010 --> 00:39:13,977
n is the number of reads.

703
00:39:13,977 --> 00:39:16,422
AUDIENCE: The
length of pi, right?

704
00:39:16,422 --> 00:39:18,378
Is that equal to the
number of binding events?

705
00:39:18,378 --> 00:39:21,312
Or is it equal to just
the number of [INAUDIBLE]?

706
00:39:21,312 --> 00:39:22,655
PROFESSOR: The length of pi?

707
00:39:22,655 --> 00:39:23,280
AUDIENCE: Yeah.

708
00:39:23,280 --> 00:39:25,370
PROFESSOR: It's the number
of bases in the genome,

709
00:39:25,370 --> 00:39:27,567
and hopefully the
number of bindings

710
00:39:27,567 --> 00:39:29,150
is much, much, much
smaller than that.

711
00:39:29,150 --> 00:39:34,320
Typical number of binding
events is 5 to 30,000

712
00:39:34,320 --> 00:39:37,920
across a genome that has
3 billion bases, something

713
00:39:37,920 --> 00:39:38,870
like that.

714
00:39:38,870 --> 00:39:41,165
So it's much, much smaller.

715
00:39:41,165 --> 00:39:41,665
OK?

716
00:39:44,740 --> 00:39:48,020
So this is what we
would like to solve.

717
00:39:48,020 --> 00:39:53,010
And another way to look
at this model, which

718
00:39:53,010 --> 00:39:56,580
it may be somewhat
more confusing,

719
00:39:56,580 --> 00:40:01,370
is as follows, which is that
we have these events spread

720
00:40:01,370 --> 00:40:03,130
along the genome.

721
00:40:03,130 --> 00:40:05,940
And we have the reads
shown on the lower line.

722
00:40:05,940 --> 00:40:09,900
And the events are
generating the reads.

723
00:40:09,900 --> 00:40:11,830
So this is called
a generative model,

724
00:40:11,830 --> 00:40:17,040
because the derivation
of the likelihood

725
00:40:17,040 --> 00:40:22,320
directly follows from, if we
knew where the events were,

726
00:40:22,320 --> 00:40:25,470
we could exactly
come up with the best

727
00:40:25,470 --> 00:40:30,300
solution for the
assignment of pi,

728
00:40:30,300 --> 00:40:31,825
and thus for the likelihood.

729
00:40:34,282 --> 00:40:36,240
So this is what we have
on the board over here.

730
00:40:39,070 --> 00:40:42,640
But we can solve this directly.

731
00:40:42,640 --> 00:40:43,200
Yes.

732
00:40:43,200 --> 00:40:44,511
Question in the back.

733
00:40:44,511 --> 00:40:45,052
AUDIENCE: Hi.

734
00:40:45,052 --> 00:40:46,378
My name is Eric.

735
00:40:46,378 --> 00:40:48,150
I have a little question.

736
00:40:48,150 --> 00:40:50,640
What is the definition
of GPS here?

737
00:40:50,640 --> 00:40:52,416
PROFESSOR: Oh, sorry.

738
00:40:52,416 --> 00:40:55,962
Yeah, GPS is the name
of this algorithm.

739
00:40:55,962 --> 00:40:57,670
I told it to an editor
and they hated it.

740
00:40:57,670 --> 00:41:00,874
But at any rate, it's called
the genome positioning system.

741
00:41:00,874 --> 00:41:01,700
[LAUGHTER]

742
00:41:01,700 --> 00:41:03,690
Yeah, so you don't like
it either, right, Eric?

743
00:41:03,690 --> 00:41:04,190
Oh, well.

744
00:41:06,530 --> 00:41:09,840
But yes, it locates proteins
on the genome, right?

745
00:41:09,840 --> 00:41:11,380
Yeah.

746
00:41:11,380 --> 00:41:13,980
Good question.

747
00:41:13,980 --> 00:41:18,940
So this is going to be, I
think, our first introduction

748
00:41:18,940 --> 00:41:21,990
to the EM algorithm
as a way of solving

749
00:41:21,990 --> 00:41:24,880
complex problems like this.

750
00:41:24,880 --> 00:41:28,040
And here is the
insight we're going

751
00:41:28,040 --> 00:41:30,440
to use to solve
this problem, which

752
00:41:30,440 --> 00:41:36,230
is that imagine I
do the function g.

753
00:41:36,230 --> 00:41:37,890
So g is going to be
this function that

754
00:41:37,890 --> 00:41:42,870
tells us what reads
came from which events

755
00:41:42,870 --> 00:41:46,580
and where those events
are in the genome.

756
00:41:46,580 --> 00:41:51,130
So if you knew g
exactly, this would

757
00:41:51,130 --> 00:41:53,590
be a really trivial
problem to solve.

758
00:41:53,590 --> 00:41:58,470
It would tell you, for
every single read, which

759
00:41:58,470 --> 00:42:03,310
particular binding
event caused it.

760
00:42:03,310 --> 00:42:06,880
And if we knew that,
it would be great,

761
00:42:06,880 --> 00:42:09,880
because then we could
say something like this.

762
00:42:09,880 --> 00:42:14,575
We could say, well, we knew g.

763
00:42:21,110 --> 00:42:29,540
Then the number of reeds created
by a binding event at location

764
00:42:29,540 --> 00:42:38,400
m would simply be-- it's not a
very good summation sign here--

765
00:42:38,400 --> 00:42:50,130
we would sum over all the reads
and count up the number of them

766
00:42:50,130 --> 00:42:57,270
where read n was caused
by an event at location m.

767
00:42:57,270 --> 00:42:59,300
And we just summed
up this number.

768
00:42:59,300 --> 00:43:03,740
We'd have the number
of reads caused

769
00:43:03,740 --> 00:43:06,400
by an event at location m.

770
00:43:06,400 --> 00:43:09,622
Is everybody cool with that?

771
00:43:09,622 --> 00:43:12,570
No.

772
00:43:12,570 --> 00:43:16,290
Do you see the
definition of there of g

773
00:43:16,290 --> 00:43:20,000
on the overhead projector?

774
00:43:20,000 --> 00:43:25,120
So every time a read
n comes from event m,

775
00:43:25,120 --> 00:43:28,400
that function is going
to be equal to 1.

776
00:43:28,400 --> 00:43:31,790
So we just count the
number of times it's 1.

777
00:43:31,790 --> 00:43:33,510
For a given location
m, we find out

778
00:43:33,510 --> 00:43:36,980
how many reads came from
that event-- could be 5,

779
00:43:36,980 --> 00:43:40,010
could be 10, could be 100.

780
00:43:40,010 --> 00:43:41,890
We don't know exactly
how many were generated

781
00:43:41,890 --> 00:43:43,481
by that particular
event, but there's

782
00:43:43,481 --> 00:43:44,480
going to be some number.

783
00:43:44,480 --> 00:43:49,170
And we have 10 million reads,
and we have 100,000 events,

784
00:43:49,170 --> 00:43:50,750
we're going to get
about 100 reads

785
00:43:50,750 --> 00:43:53,901
per event, something like that.

786
00:43:53,901 --> 00:43:55,775
You can give that kind
of order of magnitude.

787
00:43:58,360 --> 00:43:59,014
Everybody OK?

788
00:43:59,014 --> 00:44:00,680
Any questions about
the details of this?

789
00:44:00,680 --> 00:44:04,780
Because the next step is going
to be causing leaps of faith,

790
00:44:04,780 --> 00:44:07,340
so I want to make sure that
everybody's on firm ground

791
00:44:07,340 --> 00:44:09,110
before we take
the leap together.

792
00:44:09,110 --> 00:44:09,610
Yes.

793
00:44:09,610 --> 00:44:13,960
AUDIENCE: So you are
in forced sparsity yet?

794
00:44:13,960 --> 00:44:15,822
PROFESSOR: I'm not in
forced sparsity yet.

795
00:44:15,822 --> 00:44:17,905
You like sparsity.

796
00:44:17,905 --> 00:44:19,780
You're going to keep
me to that, right?

797
00:44:19,780 --> 00:44:22,364
So later on, you'll hold
me to it, all right?

798
00:44:22,364 --> 00:44:23,030
That's your job.

799
00:44:23,030 --> 00:44:23,610
What's your name?

800
00:44:23,610 --> 00:44:24,460
AUDIENCE: [INAUDIBLE]

801
00:44:24,460 --> 00:44:25,043
PROFESSOR: OK.

802
00:44:25,043 --> 00:44:26,310
That's your job, all right?

803
00:44:26,310 --> 00:44:27,930
Mr. Sparsity.

804
00:44:27,930 --> 00:44:28,700
I like that.

805
00:44:28,700 --> 00:44:29,445
He's been sparse.

806
00:44:29,445 --> 00:44:31,500
I like it.

807
00:44:31,500 --> 00:44:34,290
So if this is the
number of reads assigned

808
00:44:34,290 --> 00:44:38,780
to a particular
location, then we

809
00:44:38,780 --> 00:44:41,870
know that pi sub
m will simply be

810
00:44:41,870 --> 00:44:55,680
n sub m over the summation
of all of the reads assigned

811
00:44:55,680 --> 00:44:59,030
to all of the events.

812
00:44:59,030 --> 00:45:01,450
And some of these
are going to be zero.

813
00:45:01,450 --> 00:45:03,470
A lot of them will be zero.

814
00:45:03,470 --> 00:45:09,440
So here, I don't know this
assignment of reads to events.

815
00:45:09,440 --> 00:45:11,100
It's latent.

816
00:45:11,100 --> 00:45:15,420
It's something I have
invented which I do not know.

817
00:45:15,420 --> 00:45:20,696
But if I did know it, I'd be
able to compute pi directly,

818
00:45:20,696 --> 00:45:24,750
like so, because I'd
be able to figure out

819
00:45:24,750 --> 00:45:28,060
for each location on the genome
how many reads were there.

820
00:45:28,060 --> 00:45:30,060
And that's how much
responsibility that location

821
00:45:30,060 --> 00:45:31,710
had for generating
all of the reads.

822
00:45:35,660 --> 00:45:37,270
Is everybody with me so far?

823
00:45:37,270 --> 00:45:38,240
Any questions at all?

824
00:45:42,330 --> 00:45:43,420
OK.

825
00:45:43,420 --> 00:45:47,480
So remember, this is latent.

826
00:45:47,480 --> 00:45:49,570
We don't know what it
is, but if we did know,

827
00:45:49,570 --> 00:45:52,010
we'd be in great shape.

828
00:45:52,010 --> 00:45:58,180
So our job now is to estimate g.

829
00:45:58,180 --> 00:46:03,750
And if we could estimate
g, we can estimate pi.

830
00:46:03,750 --> 00:46:06,016
And once we get a
better estimate for pi,

831
00:46:06,016 --> 00:46:07,390
I'd like to suggest
to you we can

832
00:46:07,390 --> 00:46:10,800
get a better estimate for g.

833
00:46:10,800 --> 00:46:14,240
And we can go between
these two steps.

834
00:46:14,240 --> 00:46:17,390
And so that's what the
expectation, maximization

835
00:46:17,390 --> 00:46:20,880
algorithm does in the
lower part here, which

836
00:46:20,880 --> 00:46:24,450
is that the left part is
estimating gamma, which

837
00:46:24,450 --> 00:46:39,860
is our estimate for g, which is
looking at the number of reads

838
00:46:39,860 --> 00:46:44,440
that we think are softly
assigned to a location m

839
00:46:44,440 --> 00:46:46,690
over the total number
of softly assigned

840
00:46:46,690 --> 00:46:49,710
reads from all locations.

841
00:46:49,710 --> 00:47:00,190
And that gives the fraction of
a read at n assigned to event m.

842
00:47:00,190 --> 00:47:07,210
So we are computing
an estimate of g.

843
00:47:07,210 --> 00:47:09,690
But to compute this
estimate, we have to have pi.

844
00:47:09,690 --> 00:47:12,780
It's telling us where we
believe that these reads are

845
00:47:12,780 --> 00:47:13,390
coming from.

846
00:47:16,090 --> 00:47:18,560
And once we have gamma,
which is an estimate,

847
00:47:18,560 --> 00:47:24,140
we can compute pi, just in the
same way we computed it here,

848
00:47:24,140 --> 00:47:26,790
if we knew g.

849
00:47:26,790 --> 00:47:33,500
So we compute the expectation
of this latent function, which

850
00:47:33,500 --> 00:47:37,580
we don't know, and
then we maximize

851
00:47:37,580 --> 00:47:41,700
pi to maximize our likelihood.

852
00:47:41,700 --> 00:47:45,940
And you can derive
this directly by taking

853
00:47:45,940 --> 00:47:49,420
the log of that likelihood
probability up there

854
00:47:49,420 --> 00:47:52,220
and maximizing it.

855
00:47:52,220 --> 00:47:57,662
But suffice to say, you wind
up with this formulation

856
00:47:57,662 --> 00:47:58,495
in the EM framework.

857
00:48:01,050 --> 00:48:06,080
So I can go between these
two different steps--

858
00:48:06,080 --> 00:48:09,060
the E step, the M step,
the E step, the M step.

859
00:48:09,060 --> 00:48:11,420
And each time as
I go through it,

860
00:48:11,420 --> 00:48:15,860
I get a better and better
approximation for pi,

861
00:48:15,860 --> 00:48:19,510
until I ultimately get it
within some preset level

862
00:48:19,510 --> 00:48:21,840
of convergence.

863
00:48:21,840 --> 00:48:23,290
And then I'm finished.

864
00:48:23,290 --> 00:48:31,200
I actually can report my pi.

865
00:48:31,200 --> 00:48:34,600
But before I leave the EM
algorithm, and this formulation

866
00:48:34,600 --> 00:48:36,730
of it, are there
any questions at all

867
00:48:36,730 --> 00:48:40,807
about what's going on here?

868
00:48:40,807 --> 00:48:41,890
Yes, question in the back.

869
00:48:41,890 --> 00:48:43,552
AUDIENCE: So the solution
to the EM algorithm

870
00:48:43,552 --> 00:48:44,740
doesn't always [INAUDIBLE]?

871
00:48:50,450 --> 00:48:52,960
PROFESSOR: It gives
you a solution.

872
00:48:56,560 --> 00:48:59,750
And it is the optimum solution
given the starting conditions

873
00:48:59,750 --> 00:49:01,140
that we give it.

874
00:49:01,140 --> 00:49:02,650
AUDIENCE: But it depends on
the starting conditions, right?

875
00:49:02,650 --> 00:49:04,858
PROFESSOR: It does depend
on the starting conditions.

876
00:49:08,077 --> 00:49:08,910
Any other questions?

877
00:49:11,540 --> 00:49:12,260
Yes, in the back.

878
00:49:12,260 --> 00:49:13,759
AUDIENCE: Is this
dependent upon you

879
00:49:13,759 --> 00:49:16,220
having discrete binding events?

880
00:49:16,220 --> 00:49:20,180
So if you have a protein that
has more diffuse localization

881
00:49:20,180 --> 00:49:22,655
across the genome, for
example, RNA Pol II,

882
00:49:22,655 --> 00:49:24,234
would this model break down?

883
00:49:24,234 --> 00:49:25,900
PROFESSOR: So the
question is, does this

884
00:49:25,900 --> 00:49:27,858
depend upon you having
discrete binding events?

885
00:49:27,858 --> 00:49:30,770
What biological assumptions are
we making in this model about

886
00:49:30,770 --> 00:49:33,770
whether or not a protein is
actually fixed and welded

887
00:49:33,770 --> 00:49:35,570
to the genome in a
particular location

888
00:49:35,570 --> 00:49:37,432
or whether or not
it's drifting around?

889
00:49:37,432 --> 00:49:39,140
And this might be of
particular interest,

890
00:49:39,140 --> 00:49:41,139
for example, if we're
looking at histones, which

891
00:49:41,139 --> 00:49:43,620
are known to slide up
and down the genome,

892
00:49:43,620 --> 00:49:45,440
and we're
immunoprecipitating them.

893
00:49:45,440 --> 00:49:48,490
What will this give us?

894
00:49:48,490 --> 00:49:50,380
We are making the
assumption that we're

895
00:49:50,380 --> 00:49:52,880
dealing with proteins that have
punctate binding properties,

896
00:49:52,880 --> 00:49:55,521
that actually are
point binding proteins.

897
00:49:55,521 --> 00:49:58,020
And thus what we're going to
get out of this, as you'll see,

898
00:49:58,020 --> 00:50:00,940
is going to be a
single location.

899
00:50:00,940 --> 00:50:03,600
There are other methodologies,
which I won't describe today,

900
00:50:03,600 --> 00:50:10,120
for essentially deconvolving
the motion of a protein

901
00:50:10,120 --> 00:50:12,840
on the genome and coming
up with its middle

902
00:50:12,840 --> 00:50:16,570
or mean position while
allowing it to move.

903
00:50:16,570 --> 00:50:19,200
But today's
algorithm is designed

904
00:50:19,200 --> 00:50:22,350
for point binding proteins.

905
00:50:22,350 --> 00:50:25,180
Good question, OK?

906
00:50:25,180 --> 00:50:26,730
All right.

907
00:50:26,730 --> 00:50:28,934
So just to be clear,
what we're going to do

908
00:50:28,934 --> 00:50:30,600
is we're going to
initialize everything,

909
00:50:30,600 --> 00:50:33,200
such that we'll begin
by initializing pi

910
00:50:33,200 --> 00:50:36,550
to be 1 over the
size of the genome.

911
00:50:36,550 --> 00:50:39,560
And at the very end of this,
the strength of a binding event

912
00:50:39,560 --> 00:50:47,670
will be simply the number of
reads assigned to that event

913
00:50:47,670 --> 00:50:52,460
by summing up our estimate of g.

914
00:50:52,460 --> 00:50:56,340
And the nice thing about this
is that because the number

915
00:50:56,340 --> 00:50:59,620
of reads assigned to an
event, in some sense,

916
00:50:59,620 --> 00:51:02,165
is scaled relative to
the total number of reads

917
00:51:02,165 --> 00:51:06,090
in the experiment, we can
algorithmically take our genome

918
00:51:06,090 --> 00:51:08,920
and chop it up into
independent pieces

919
00:51:08,920 --> 00:51:10,990
and process them
all in parallel,

920
00:51:10,990 --> 00:51:14,760
which is what this
methodology does.

921
00:51:14,760 --> 00:51:15,460
All right.

922
00:51:15,460 --> 00:51:16,960
So let's have a
look at what happens

923
00:51:16,960 --> 00:51:19,360
when we run this algorithm.

924
00:51:19,360 --> 00:51:20,130
So here we are.

925
00:51:20,130 --> 00:51:21,210
This is synthetic data.

926
00:51:21,210 --> 00:51:24,930
I have events planted at
500 and 550 base pairs.

927
00:51:24,930 --> 00:51:29,620
The x-axis is between
0 and 1,400 base pairs.

928
00:51:29,620 --> 00:51:34,930
And the y-axis is
pi from 0 to 1.

929
00:51:34,930 --> 00:51:38,728
So here we go.

930
00:51:38,728 --> 00:51:39,676
It's running.

931
00:51:44,420 --> 00:51:46,030
And this is one base
pair resolution.

932
00:51:49,390 --> 00:51:54,320
And it is still
running-- and stop.

933
00:51:54,320 --> 00:52:00,070
Well, we get sort of a
fuzzy mess there, don't we?

934
00:52:00,070 --> 00:52:02,390
And the reason we're
getting a fuzzy mess there

935
00:52:02,390 --> 00:52:06,820
is that we have got a lot
of reads we've created,

936
00:52:06,820 --> 00:52:09,810
and there are a lot of
different genome positions

937
00:52:09,810 --> 00:52:13,270
that are claiming some
responsibility for this, which

938
00:52:13,270 --> 00:52:15,640
my friend, Mr. Sparsity,
doesn't like, right?

939
00:52:15,640 --> 00:52:19,360
We need to clean
this up somehow.

940
00:52:19,360 --> 00:52:22,260
And let's see what
happens when we actually

941
00:52:22,260 --> 00:52:23,570
look at actual data.

942
00:52:23,570 --> 00:52:27,190
So here is that original
data I showed you

943
00:52:27,190 --> 00:52:31,910
for the two Oct4 binding
events next to the SOCS2 gene.

944
00:52:31,910 --> 00:52:36,000
And I'll run the
algorithm on these data.

945
00:52:36,000 --> 00:52:40,530
And you can see it
working away here.

946
00:52:40,530 --> 00:52:46,590
And once again, it's giving
us a spread of events,

947
00:52:46,590 --> 00:52:48,330
and you can see
it even beginning

948
00:52:48,330 --> 00:52:52,340
to fill in along where
there's noise in the genome

949
00:52:52,340 --> 00:52:54,150
and stopped here.

950
00:52:54,150 --> 00:52:57,750
And it's assigning the
probability mass of pi

951
00:52:57,750 --> 00:53:00,180
all over the place.

952
00:53:00,180 --> 00:53:02,280
It is not sparse at all.

953
00:53:05,680 --> 00:53:08,500
So does anybody
have any suggestions

954
00:53:08,500 --> 00:53:10,960
for how to fix this?

955
00:53:10,960 --> 00:53:11,940
We've worked all might.

956
00:53:11,940 --> 00:53:13,530
We've got our algorithm running.

957
00:53:13,530 --> 00:53:15,363
We thought it was going
to be totally great.

958
00:53:15,363 --> 00:53:20,640
When we run it, it gives us
this unfortunate, smeary kind

959
00:53:20,640 --> 00:53:23,460
of result.

960
00:53:23,460 --> 00:53:25,250
What could we do?

961
00:53:25,250 --> 00:53:28,630
Any ideas at all?

962
00:53:28,630 --> 00:53:29,618
Yes.

963
00:53:29,618 --> 00:53:32,455
AUDIENCE: Add a prior where
most of the reads are.

964
00:53:32,455 --> 00:53:33,990
PROFESSOR: Add a prior.

965
00:53:33,990 --> 00:53:37,355
You saw the word no
prior, a little tip off.

966
00:53:37,355 --> 00:53:39,530
Yeah, absolutely, good point.

967
00:53:39,530 --> 00:53:40,174
Add a prior.

968
00:53:40,174 --> 00:53:41,590
So what we're going
to do is we're

969
00:53:41,590 --> 00:53:44,330
going to try and add a prior
that's going to prejudice pi

970
00:53:44,330 --> 00:53:48,540
to be 0 to create sparsity.

971
00:53:48,540 --> 00:53:51,690
So we like pi being 0.

972
00:53:51,690 --> 00:53:55,030
So we'll add what's called
a negative Dirichlet prior,

973
00:53:55,030 --> 00:53:57,800
which looks like this, which is
the probability of pi there is

974
00:53:57,800 --> 00:54:02,620
proportional to 1 over pi to
the m raised to the alpha power.

975
00:54:02,620 --> 00:54:06,350
And as pi gets smaller,
that gets much, much bigger.

976
00:54:06,350 --> 00:54:11,490
And thus, if we add
this prior, which

977
00:54:11,490 --> 00:54:13,820
happens to be a very
convenient prior to use,

978
00:54:13,820 --> 00:54:18,720
we can force pi to
be 0 in many cases.

979
00:54:18,720 --> 00:54:22,750
And when we take that prior
into account and we compute

980
00:54:22,750 --> 00:54:29,200
the maximum a posteriori
values for pi,

981
00:54:29,200 --> 00:54:31,880
the update rules
are very similar,

982
00:54:31,880 --> 00:54:33,950
except that what's
interesting is

983
00:54:33,950 --> 00:54:36,690
that the rule on the left for
computing our estimate of g

984
00:54:36,690 --> 00:54:38,660
is identical.

985
00:54:38,660 --> 00:54:42,025
But on the right, what we do
is we take the number of reads

986
00:54:42,025 --> 00:54:44,350
that we observe at a
particular location

987
00:54:44,350 --> 00:54:46,750
and we subtract
alpha from it, which

988
00:54:46,750 --> 00:54:51,180
was that exponent in our prior.

989
00:54:51,180 --> 00:54:54,720
And what that simply means is
that if you don't have alpha

990
00:54:54,720 --> 00:54:56,640
reads, you're history.

991
00:54:56,640 --> 00:54:58,890
You're going to get eliminated.

992
00:54:58,890 --> 00:55:01,840
So this is going to do something
called component elimination.

993
00:55:01,840 --> 00:55:03,380
If you don't have
enough strength,

994
00:55:03,380 --> 00:55:06,040
you're going to
get zapped to zero.

995
00:55:06,040 --> 00:55:08,374
Now you don't want to do this
too aggressively early on.

996
00:55:08,374 --> 00:55:10,581
You want to let things sort
of percolate for a while.

997
00:55:10,581 --> 00:55:13,160
So what this algorithm does
is it doesn't eliminate

998
00:55:13,160 --> 00:55:17,950
all of the components that don't
have alpha reads at the outset.

999
00:55:17,950 --> 00:55:20,430
It lets it run for
a little while.

1000
00:55:20,430 --> 00:55:25,400
But it provides you with a way
of ensuring that components

1001
00:55:25,400 --> 00:55:28,120
get eliminated and set to zero
when they actually don't have

1002
00:55:28,120 --> 00:55:29,540
enough support in the read set.

1003
00:55:34,022 --> 00:55:35,522
So once again, all
we're going to do

1004
00:55:35,522 --> 00:55:40,160
is we're going to add this prior
on pi, which we will wind up

1005
00:55:40,160 --> 00:55:43,130
multiplying times
that top equation.

1006
00:55:43,130 --> 00:55:45,830
We'll get the joint probability
of r and pi in this case,

1007
00:55:45,830 --> 00:55:51,590
and then we're going to maximize
it using this EM adaptation.

1008
00:55:51,590 --> 00:55:53,310
And when we do so,
let me show you

1009
00:55:53,310 --> 00:55:56,740
what happens to our first
example that we had.

1010
00:56:04,240 --> 00:56:05,957
OK, that's much cleaner, right?

1011
00:56:05,957 --> 00:56:07,790
We actually only have
two events popping out

1012
00:56:07,790 --> 00:56:11,370
of this at the right locations.

1013
00:56:11,370 --> 00:56:12,770
All right.

1014
00:56:12,770 --> 00:56:14,680
So that's looking good.

1015
00:56:14,680 --> 00:56:19,430
And the probability is
summed to 1, which we like.

1016
00:56:19,430 --> 00:56:24,000
And then we'll run
this on the Oct4 data.

1017
00:56:27,300 --> 00:56:29,310
And you can see now,
instead of getting

1018
00:56:29,310 --> 00:56:32,180
the mushing around
those locations, what's

1019
00:56:32,180 --> 00:56:35,452
going to happen is that
most of the probability mass

1020
00:56:35,452 --> 00:56:37,660
is going to be absorbed into
just a couple components

1021
00:56:37,660 --> 00:56:38,870
around those binding events.

1022
00:56:43,490 --> 00:56:49,010
Now another way to test this
is to ask whether or not,

1023
00:56:49,010 --> 00:56:53,260
if we looked at what we
believe are closely spaced

1024
00:56:53,260 --> 00:56:59,650
homotypic events, in this case,
for CTCF, with that prior,

1025
00:56:59,650 --> 00:57:02,170
we still can recover
those two events being

1026
00:57:02,170 --> 00:57:04,460
next to each other
in the genome.

1027
00:57:04,460 --> 00:57:11,930
And so if we run this, you
can see it running along here.

1028
00:57:11,930 --> 00:57:16,640
Each one of these iterations,
by the way, is an EM step.

1029
00:57:16,640 --> 00:57:18,160
And there you go.

1030
00:57:18,160 --> 00:57:20,790
And you can see that
even from the sparse data

1031
00:57:20,790 --> 00:57:23,060
you see above-- that's the
actual data being used,

1032
00:57:23,060 --> 00:57:27,950
all those little bars
up there, read counts

1033
00:57:27,950 --> 00:57:31,230
for the five prime ends of
the reads-- we can recover

1034
00:57:31,230 --> 00:57:36,110
the position of
those binding events.

1035
00:57:36,110 --> 00:57:36,830
Question, yes.

1036
00:57:36,830 --> 00:57:37,705
AUDIENCE: [INAUDIBLE]

1037
00:57:40,036 --> 00:57:42,140
PROFESSOR: It depends
upon the number of reads,

1038
00:57:42,140 --> 00:57:45,390
but it's somewhere
around three or four.

1039
00:57:45,390 --> 00:57:48,110
I mean, if you look at the
GPS paper, which I posted,

1040
00:57:48,110 --> 00:57:51,290
it tells you how alpha is set.

1041
00:57:51,290 --> 00:57:51,790
Yes.

1042
00:57:51,790 --> 00:57:53,750
AUDIENCE: Do you use
wholesale extract data

1043
00:57:53,750 --> 00:57:56,230
when you are trying to compute?

1044
00:57:56,230 --> 00:57:58,480
PROFESSOR: Question, do we
use wholesale extract data?

1045
00:57:58,480 --> 00:57:59,980
I told you all how
important it was,

1046
00:57:59,980 --> 00:58:01,640
and I haven't
mentioned it again.

1047
00:58:01,640 --> 00:58:03,550
We're about to get to that,
because that's very important.

1048
00:58:03,550 --> 00:58:04,940
I'm glad you asked
that question.

1049
00:58:04,940 --> 00:58:05,760
Yes.

1050
00:58:05,760 --> 00:58:08,175
AUDIENCE: It looks like,
along the genome coordinates,

1051
00:58:08,175 --> 00:58:10,170
each event is only
one base pare,

1052
00:58:10,170 --> 00:58:12,880
but binding is usually
more than one base pair.

1053
00:58:12,880 --> 00:58:14,980
So is that the center
of the binding?

1054
00:58:14,980 --> 00:58:16,050
PROFESSOR: Yes.

1055
00:58:16,050 --> 00:58:23,509
Well, it's the center of the
read distribution function.

1056
00:58:23,509 --> 00:58:26,050
Whether or not it's the center
of the binding of the protein,

1057
00:58:26,050 --> 00:58:27,100
I really can't tell.

1058
00:58:31,600 --> 00:58:33,960
OK, great.

1059
00:58:33,960 --> 00:58:37,410
And I'll just point out
that a power of this method

1060
00:58:37,410 --> 00:58:40,690
is its ability to take
apart piles of reads

1061
00:58:40,690 --> 00:58:45,560
like this and to these so-called
homotypic binding events, where

1062
00:58:45,560 --> 00:58:46,890
you can see where the motif is.

1063
00:58:46,890 --> 00:58:50,690
And you can see this method will
take apart that pile of reads

1064
00:58:50,690 --> 00:58:54,510
into two independent events,
whereas other popular methods,

1065
00:58:54,510 --> 00:58:57,610
of which there are
many for analyzing

1066
00:58:57,610 --> 00:59:00,800
these type of ChIP-seq
data, don't take apart

1067
00:59:00,800 --> 00:59:04,045
the event into multiple
events, because most of them

1068
00:59:04,045 --> 00:59:06,020
make the assumption
that one pile of reads

1069
00:59:06,020 --> 00:59:07,311
equals one binding event.

1070
00:59:10,960 --> 00:59:15,754
Now back to our question
about wholesale extract.

1071
00:59:15,754 --> 00:59:17,920
What happens if you get to
a part of the genome that

1072
00:59:17,920 --> 00:59:18,630
looks like this?

1073
00:59:18,630 --> 00:59:21,130
We're going to go
back to our Oct4 data.

1074
00:59:21,130 --> 00:59:25,370
And we have our Oct4 IP
track, which is above,

1075
00:59:25,370 --> 00:59:26,980
and then our wholesale
extract tract

1076
00:59:26,980 --> 00:59:28,146
is down there on the bottom.

1077
00:59:30,630 --> 00:59:31,530
What would you say?

1078
00:59:31,530 --> 00:59:33,920
Would you say those are
really Oct4 binding events?

1079
00:59:33,920 --> 00:59:35,961
Or do you think something
else is going on there?

1080
00:59:35,961 --> 00:59:37,710
If so, what might it be?

1081
00:59:42,180 --> 00:59:45,315
Any ideas about
what's going on here?

1082
00:59:45,315 --> 00:59:45,815
Yes.

1083
00:59:45,815 --> 00:59:48,470
AUDIENCE: Regions with a lot
of repeats in the genome?

1084
00:59:48,470 --> 00:59:50,920
PROFESSOR: Regions with
repeats in the genome?

1085
00:59:50,920 --> 00:59:51,510
Yes.

1086
00:59:51,510 --> 00:59:56,830
In fact, you can see repeats
annotated in green right there.

1087
00:59:56,830 --> 01:00:00,510
And as you recall from last time
when we talked about assembly,

1088
01:00:00,510 --> 01:00:04,000
it's very difficult
to actually get

1089
01:00:04,000 --> 01:00:08,200
the number of repeat
elements correct.

1090
01:00:08,200 --> 01:00:10,820
And a consequence of that
is if you underestimate

1091
01:00:10,820 --> 01:00:12,620
the number of repeat
elements in the genome

1092
01:00:12,620 --> 01:00:16,290
and you collapse them, when
you do sequencing of the genome

1093
01:00:16,290 --> 01:00:18,910
and you remap it, you're going
to get regions of the genome

1094
01:00:18,910 --> 01:00:22,480
were a lot of reads pile
up, because they're actually

1095
01:00:22,480 --> 01:00:24,890
coming from many different
places in the genome,

1096
01:00:24,890 --> 01:00:28,240
but they only have
one place to align.

1097
01:00:28,240 --> 01:00:31,692
And these towers of
reads, as they're called,

1098
01:00:31,692 --> 01:00:35,270
can create all
sorts of artifacts.

1099
01:00:35,270 --> 01:00:38,060
So in order to judge the
significant of events,

1100
01:00:38,060 --> 01:00:40,100
what we'll do is a
two-step process.

1101
01:00:40,100 --> 01:00:44,980
First, we will run
our discovery pipeline

1102
01:00:44,980 --> 01:00:47,650
and discover where all the
events are in the IP channel

1103
01:00:47,650 --> 01:00:50,190
without regard to the
wholesale extract channel,

1104
01:00:50,190 --> 01:00:52,120
in this particular case.

1105
01:00:52,120 --> 01:00:53,930
And then we will
filter the events

1106
01:00:53,930 --> 01:00:59,440
and say which of the events
are real in view of the data

1107
01:00:59,440 --> 01:01:02,190
from the wholesale
extract channel.

1108
01:01:02,190 --> 01:01:05,100
And so the way to
represent this is

1109
01:01:05,100 --> 01:01:11,060
what is the likelihood we
would have seen those IP reads

1110
01:01:11,060 --> 01:01:17,710
at random given the
wholesale extract channel?

1111
01:01:17,710 --> 01:01:19,670
So how can we
formulate this problem?

1112
01:01:19,670 --> 01:01:21,790
Well imagine that we
take all of the reads

1113
01:01:21,790 --> 01:01:25,306
that we see in both channels
and we add them together.

1114
01:01:25,306 --> 01:01:26,680
So we get a total
number of reads

1115
01:01:26,680 --> 01:01:29,985
at a particular
location in the genome.

1116
01:01:29,985 --> 01:01:31,985
If, in fact, they were
going to occur at random,

1117
01:01:31,985 --> 01:01:34,109
we'd be doing a coin flip.

1118
01:01:34,109 --> 01:01:35,650
Some reads will go
on the IP channel.

1119
01:01:35,650 --> 01:01:38,790
Some reads will go on the
wholesale extract channel.

1120
01:01:38,790 --> 01:01:42,660
So now we can ask what's the
probability we observed greater

1121
01:01:42,660 --> 01:01:45,080
than a certain number of
reads on the IP channel

1122
01:01:45,080 --> 01:01:48,670
at chance, assuming a
coin flipping model where

1123
01:01:48,670 --> 01:01:51,520
we've added the reads together.

1124
01:01:51,520 --> 01:01:54,670
Another way to view that is
what's the chance that we have

1125
01:01:54,670 --> 01:01:57,500
observed less than
a certain number of

1126
01:01:57,500 --> 01:02:00,160
reads in the wholesale
extract channel

1127
01:02:00,160 --> 01:02:02,900
assuming a coin flipping model?

1128
01:02:02,900 --> 01:02:08,350
And that will give us a p-value
under the null hypothesis

1129
01:02:08,350 --> 01:02:10,700
that, in fact, there
is no binding event

1130
01:02:10,700 --> 01:02:12,740
and simply what's
going on is that we

1131
01:02:12,740 --> 01:02:17,150
have reads being flipped
between the two channels out

1132
01:02:17,150 --> 01:02:18,980
of a pool of the
total number of reads.

1133
01:02:21,980 --> 01:02:25,125
And if we take that
approach, we can formulate it

1134
01:02:25,125 --> 01:02:28,140
as the binomial in
the following way.

1135
01:02:28,140 --> 01:02:34,390
We can compute a p-value for
a given binding event using

1136
01:02:34,390 --> 01:02:37,670
the data from both
the IP channel

1137
01:02:37,670 --> 01:02:43,140
and from the control channel.

1138
01:02:43,140 --> 01:02:47,740
And the way that we do this is
that we look at the number of

1139
01:02:47,740 --> 01:02:49,850
reads assigned to an
event, because we already

1140
01:02:49,850 --> 01:02:50,960
know that number.

1141
01:02:50,960 --> 01:02:53,050
We've computed that number.

1142
01:02:53,050 --> 01:02:57,880
And we take a similar window
around the control channel.

1143
01:02:57,880 --> 01:02:59,840
We count the number
of reads in it

1144
01:02:59,840 --> 01:03:02,900
and ask whether or not the
reads in the IP channel

1145
01:03:02,900 --> 01:03:07,370
is significantly larger enough
than the reads in the control

1146
01:03:07,370 --> 01:03:12,370
channel to reject the null
hypothesis that, in fact, they

1147
01:03:12,370 --> 01:03:13,430
occurred at random.

1148
01:03:16,360 --> 01:03:18,600
So this is the way that
we compute the p-values

1149
01:03:18,600 --> 01:03:19,100
for events.

1150
01:03:23,920 --> 01:03:27,670
So once we've computed
the p-values for events,

1151
01:03:27,670 --> 01:03:31,790
we still have our multiple
hypothesis correction problem,

1152
01:03:31,790 --> 01:03:36,890
which is that if we
have 100,000 events,

1153
01:03:36,890 --> 01:03:44,060
we need to set our p-value's
cut off and appropriate value.

1154
01:03:44,060 --> 01:03:47,380
And I think that
you may have heard

1155
01:03:47,380 --> 01:03:49,930
about this before in recitation.

1156
01:03:49,930 --> 01:03:55,900
But the way that this
is done in this system

1157
01:03:55,900 --> 01:03:59,960
is to use a
Benjamini-Hochberg correction.

1158
01:03:59,960 --> 01:04:04,800
And the essential idea is that
we take all of the p-values

1159
01:04:04,800 --> 01:04:13,550
and we organize them
from smallest to largest.

1160
01:04:13,550 --> 01:04:17,450
And we're going to take
a set of the top events,

1161
01:04:17,450 --> 01:04:21,140
say 1 through 4, and
call them significant.

1162
01:04:21,140 --> 01:04:25,860
And we call them significant
under the constraint

1163
01:04:25,860 --> 01:04:32,800
that p-value sub i is less
than or equal to i over n,

1164
01:04:32,800 --> 01:04:35,680
where n is the total number
of events times alpha,

1165
01:04:35,680 --> 01:04:38,580
which is our desired
false discovery rate.

1166
01:04:43,770 --> 01:04:46,520
So this allows us
in a principled way

1167
01:04:46,520 --> 01:04:50,010
to select the events
that we believe

1168
01:04:50,010 --> 01:04:53,240
are significant up to a false
discovery rate, which might be,

1169
01:04:53,240 --> 01:04:55,430
for example, 0.05 for
something that's typically

1170
01:04:55,430 --> 01:04:56,860
used for a false discovery rate.

1171
01:05:02,410 --> 01:05:06,650
So we've talked about
how to discover events,

1172
01:05:06,650 --> 01:05:10,530
how to judge their
significance individually,

1173
01:05:10,530 --> 01:05:15,734
how to take a ranked list of
events from a given experiment

1174
01:05:15,734 --> 01:05:17,400
and determine which
ones are significant

1175
01:05:17,400 --> 01:05:20,200
up to a desired
false discovery rate.

1176
01:05:20,200 --> 01:05:22,890
The false discovery
means that, let's say,

1177
01:05:22,890 --> 01:05:27,080
that we have 1,000
events that come out.

1178
01:05:27,080 --> 01:05:28,990
If this false
discovery rate was 0.1,

1179
01:05:28,990 --> 01:05:33,350
we would expect 100 of
them to be false positives.

1180
01:05:33,350 --> 01:05:35,880
So it's the fraction
of positives

1181
01:05:35,880 --> 01:05:39,551
that we detect that we
think are going to be false.

1182
01:05:39,551 --> 01:05:41,342
And we can set that to
be whatever we want.

1183
01:05:51,360 --> 01:05:54,395
Now let's talk about the
analysis of these sort of data.

1184
01:05:54,395 --> 01:05:55,520
There's a question up here.

1185
01:05:55,520 --> 01:05:56,361
Yes.

1186
01:05:56,361 --> 01:05:59,127
AUDIENCE: So this randomness,
does that correct to the fact

1187
01:05:59,127 --> 01:06:04,780
that in your previous slide the
coin isn't necessarily fair?

1188
01:06:04,780 --> 01:06:06,880
PROFESSOR: Oh, the
null hypothesis

1189
01:06:06,880 --> 01:06:12,780
assumed that the coin was fair,
that reads could either occur.

1190
01:06:12,780 --> 01:06:15,970
That in this case, for
example, that the reads were

1191
01:06:15,970 --> 01:06:19,090
either occurring in the control
channel or in the IP channel

1192
01:06:19,090 --> 01:06:22,640
with equal likelihood, assuming
because they were coming

1193
01:06:22,640 --> 01:06:27,097
from the same process, which was
not related to the IP itself.

1194
01:06:27,097 --> 01:06:29,180
They're coming from some
underlying noise process.

1195
01:06:31,925 --> 01:06:34,207
AUDIENCE: So typically
you'd have some fixed number

1196
01:06:34,207 --> 01:06:40,564
of reads, and so in the IP
channel, the bulk of your reads

1197
01:06:40,564 --> 01:06:42,683
was in your peaks.

1198
01:06:42,683 --> 01:06:45,130
Then you would have
fewer in your--

1199
01:06:45,130 --> 01:06:47,010
PROFESSOR: So the
question is, how do you

1200
01:06:47,010 --> 01:06:48,990
normalize for the
number of reads

1201
01:06:48,990 --> 01:06:50,210
and how they're being spent?

1202
01:06:50,210 --> 01:06:52,060
Because in the IP
channel, you're

1203
01:06:52,060 --> 01:06:55,410
spending the reeds on
IP enriched events.

1204
01:06:55,410 --> 01:06:56,790
In the wholesale
extract channel,

1205
01:06:56,790 --> 01:06:58,450
you're not spending
the reads on those,

1206
01:06:58,450 --> 01:07:01,820
so you have more reads to
go around for bad things,

1207
01:07:01,820 --> 01:07:03,650
so to speak.

1208
01:07:03,650 --> 01:07:07,520
So what this algorithm
does-- which I'm glad

1209
01:07:07,520 --> 01:07:11,620
you brought it up-- is it takes
regions of the genome that

1210
01:07:11,620 --> 01:07:15,960
are event free, and it matches
them up against one another

1211
01:07:15,960 --> 01:07:17,500
and fits a model
against them to be

1212
01:07:17,500 --> 01:07:22,900
able to scale the control reads
so that the number of reads

1213
01:07:22,900 --> 01:07:24,840
in the IP and
extract channel are

1214
01:07:24,840 --> 01:07:27,865
the same in the regions
that are free of events.

1215
01:07:30,590 --> 01:07:34,260
And so it matches
things, so it is 0.5.

1216
01:07:34,260 --> 01:07:34,760
OK?

1217
01:07:34,760 --> 01:07:38,560
AUDIENCE: Even if wasn't
0.5, wouldn't this rank list

1218
01:07:38,560 --> 01:07:41,080
correct with that?

1219
01:07:41,080 --> 01:07:42,580
PROFESSOR: No, it would not.

1220
01:07:47,460 --> 01:07:53,180
Let us suppose that we
made a horrible mistake

1221
01:07:53,180 --> 01:07:56,900
and that we let
these events through

1222
01:07:56,900 --> 01:07:59,320
because our p was wrong.

1223
01:07:59,320 --> 01:08:01,260
They would have a very
large number of reads,

1224
01:08:01,260 --> 01:08:02,718
and they would be
very significant.

1225
01:08:04,990 --> 01:08:06,690
As a consequence,
they would float up

1226
01:08:06,690 --> 01:08:09,260
to the top of this list
as having the lowest

1227
01:08:09,260 --> 01:08:11,630
p-values, or the
most significant,

1228
01:08:11,630 --> 01:08:13,580
and they would pop through this.

1229
01:08:13,580 --> 01:08:17,051
So we would report
them as binding events,

1230
01:08:17,051 --> 01:08:18,009
which is not desirable.

1231
01:08:21,859 --> 01:08:24,339
These are all great questions.

1232
01:08:24,339 --> 01:08:25,464
Any other questions at all?

1233
01:08:29,979 --> 01:08:36,359
So we've talked about this.

1234
01:08:36,359 --> 01:08:40,779
The next question you
might have would be,

1235
01:08:40,779 --> 01:08:42,649
you've done two replicates
of the same chip.

1236
01:08:42,649 --> 01:08:45,180
You always want to do two of
every experiment, if not three

1237
01:08:45,180 --> 01:08:45,680
or four.

1238
01:08:48,620 --> 01:08:51,565
And I know it's
expensive, but you

1239
01:08:51,565 --> 01:08:53,939
don't know where you are if
you only do one of something.

1240
01:08:53,939 --> 01:08:55,930
Trust me.

1241
01:08:55,930 --> 01:08:58,420
So assuming you've done
two experiments that

1242
01:08:58,420 --> 01:09:01,120
were identical ChIP
experiments, and you

1243
01:09:01,120 --> 01:09:03,729
would like to know
whether or not

1244
01:09:03,729 --> 01:09:06,160
the results between those two
experiments are concordant.

1245
01:09:09,880 --> 01:09:13,214
How might you go about
determining this?

1246
01:09:13,214 --> 01:09:14,380
Does anybody have any ideas?

1247
01:09:24,770 --> 01:09:26,410
Let me suggest something to you.

1248
01:09:29,200 --> 01:09:39,189
Imagine that we
create two lists-- one

1249
01:09:39,189 --> 01:09:42,640
for experiment x and
one for experiment y.

1250
01:09:42,640 --> 01:09:44,250
And we're going to
put in this list,

1251
01:09:44,250 --> 01:09:46,609
in rank order, the
strongest events

1252
01:09:46,609 --> 01:09:49,450
and where they
occur in the genome.

1253
01:09:49,450 --> 01:09:53,067
And let us suppose that we
are able to match events

1254
01:09:53,067 --> 01:09:55,275
across experiments when
they're in the same location.

1255
01:09:59,520 --> 01:10:08,920
So what we'll do is we'll
say Event one occurs here.

1256
01:10:08,920 --> 01:10:11,010
Event 13 occurs here and here.

1257
01:10:11,010 --> 01:10:13,290
Event 9 occurs here and here.

1258
01:10:13,290 --> 01:10:15,220
Event 10 is matched.

1259
01:10:15,220 --> 01:10:17,270
Event 11 occurs here.

1260
01:10:17,270 --> 01:10:20,450
Event 57 occurs here.

1261
01:10:20,450 --> 01:10:24,290
And these are ranked in
order of significance.

1262
01:10:24,290 --> 01:10:27,960
They're not
completely concordant,

1263
01:10:27,960 --> 01:10:31,190
and they're ranked according
to the significance

1264
01:10:31,190 --> 01:10:35,440
in this particular experiment
and this particular experiment.

1265
01:10:35,440 --> 01:10:37,180
And we would like to
know whether or not

1266
01:10:37,180 --> 01:10:40,980
we think these two
experiments are

1267
01:10:40,980 --> 01:10:42,462
good replicates of one another.

1268
01:10:45,380 --> 01:10:49,870
Now the nice thing
about converting this

1269
01:10:49,870 --> 01:10:54,980
into a rank test is
that we are no longer

1270
01:10:54,980 --> 01:11:02,310
sensitive to the number of reads
or any specific numeric values

1271
01:11:02,310 --> 01:11:04,070
represented by each
one of these events.

1272
01:11:04,070 --> 01:11:07,530
We're simply ranking them
in terms of their importance

1273
01:11:07,530 --> 01:11:13,950
and asking whether or not they
appear to be quite similar.

1274
01:11:13,950 --> 01:11:19,640
So one way to do this is simply
to take a rank correlation,

1275
01:11:19,640 --> 01:11:24,390
which is a correlation between
the ranks of identical events

1276
01:11:24,390 --> 01:11:26,520
in the two lists.

1277
01:11:26,520 --> 01:11:29,390
And so, for example, imagine
in the lower right-hand corner

1278
01:11:29,390 --> 01:11:36,179
of this slide shows you an
example of X and Y values,

1279
01:11:36,179 --> 01:11:38,220
although there's not really
a linear relationship

1280
01:11:38,220 --> 01:11:41,560
between them, and the Pearson
correlation coefficient

1281
01:11:41,560 --> 01:11:43,510
is 0.88.

1282
01:11:43,510 --> 01:11:45,260
If we look at the right
correlation, which

1283
01:11:45,260 --> 01:11:47,551
is defined by the equation
on the left, which is really

1284
01:11:47,551 --> 01:11:51,290
simply the correlation
between their rank values,

1285
01:11:51,290 --> 01:11:53,650
the Spearman correlation
for rank is 1.

1286
01:11:53,650 --> 01:11:56,650
They're perfectly matched.

1287
01:11:56,650 --> 01:12:01,640
So if you want to
consider whether or not

1288
01:12:01,640 --> 01:12:05,150
two experiments
are very similar,

1289
01:12:05,150 --> 01:12:08,420
I'd suggest you consider
rank based tests.

1290
01:12:11,820 --> 01:12:15,920
But we can do even
more than this.

1291
01:12:15,920 --> 01:12:18,280
Imagine that we
would like to know

1292
01:12:18,280 --> 01:12:23,760
at what point along
this list things

1293
01:12:23,760 --> 01:12:27,830
are no longer concordant,
that we would like to consider

1294
01:12:27,830 --> 01:12:31,460
all the events that are
consistent among these two

1295
01:12:31,460 --> 01:12:32,860
experiments.

1296
01:12:32,860 --> 01:12:37,320
And we assume that there is
a portion of the experiment

1297
01:12:37,320 --> 01:12:43,590
results that are concordant
and part that is discordant.

1298
01:12:43,590 --> 01:12:47,960
And we want to learn
where the boundary is.

1299
01:12:47,960 --> 01:12:50,300
So what we'll do is this.

1300
01:12:50,300 --> 01:12:56,580
We will look at the number of
events that are concordant up

1301
01:12:56,580 --> 01:12:59,152
to a particular point.

1302
01:12:59,152 --> 01:13:01,610
Let's see here, make sure we
get the parametrization right.

1303
01:13:07,550 --> 01:13:14,650
So if we have lists that
are n long, psi of n of t

1304
01:13:14,650 --> 01:13:26,950
is the number of
events that are paired

1305
01:13:26,950 --> 01:13:33,760
in the top n times t events.

1306
01:13:33,760 --> 01:13:37,060
So t is simply the fraction.

1307
01:13:37,060 --> 01:13:40,730
If is 0.25, that would
mean that in the top 25%

1308
01:13:40,730 --> 01:13:44,180
of the events, the number
of events that are paired.

1309
01:13:47,020 --> 01:13:49,676
Actually, it's the
fraction of events.

1310
01:13:49,676 --> 01:13:50,175
Sorry.

1311
01:13:54,900 --> 01:14:03,430
So assuming that we had perfect
replicates, what we would see

1312
01:14:03,430 --> 01:14:09,430
would be that psi of n of
t would look like this.

1313
01:14:22,931 --> 01:14:23,430
Yes?

1314
01:14:23,430 --> 01:14:25,440
AUDIENCE: What is t?

1315
01:14:25,440 --> 01:14:28,424
PROFESSOR: t is the
fraction of the events

1316
01:14:28,424 --> 01:14:29,382
that we're considering.

1317
01:14:29,382 --> 01:14:33,600
So this t might
be equal to 0.25.

1318
01:14:33,600 --> 01:14:38,060
If this was 10, then this
would be-- oh, sorry.

1319
01:14:38,060 --> 01:14:40,570
I think this t equal 0.5.

1320
01:14:40,570 --> 01:14:44,180
If n was equal to 10, this
would be the first five events.

1321
01:14:44,180 --> 01:14:45,337
So t is the fraction.

1322
01:14:45,337 --> 01:14:46,670
n is the total number of events.

1323
01:14:51,640 --> 01:15:02,130
So if this is t, which
is the fraction which

1324
01:15:02,130 --> 01:15:06,070
goes from 0 to 1, and
this is psi of n of t,

1325
01:15:06,070 --> 01:15:08,345
this would be a perfectly
matched set of replicates.

1326
01:15:11,630 --> 01:15:15,765
A 0.5, 0.5 of the events are
matched, which is perfect.

1327
01:15:15,765 --> 01:15:17,015
Can't do any better than that.

1328
01:15:19,930 --> 01:15:26,070
So this is simply
telling us, as we

1329
01:15:26,070 --> 01:15:30,570
take larger and larger
fractions of our event, what

1330
01:15:30,570 --> 01:15:33,800
fraction of them are
matched up to that point?

1331
01:15:33,800 --> 01:15:34,590
Question.

1332
01:15:34,590 --> 01:15:39,958
AUDIENCE: So this assumes
that the rank order

1333
01:15:39,958 --> 01:15:42,418
is important for the
correlation, right?

1334
01:15:42,418 --> 01:15:44,189
The higher rank
they are, we expect

1335
01:15:44,189 --> 01:15:45,862
them to be better correlated.

1336
01:15:45,862 --> 01:15:49,397
For example, if
in the middle they

1337
01:15:49,397 --> 01:15:53,388
were the best correlated, but
not the top or bottom in terms

1338
01:15:53,388 --> 01:15:56,548
of the ranking, then you
might have a weird looking--

1339
01:15:56,548 --> 01:15:57,496
PROFESSOR: Yeah.

1340
01:15:57,496 --> 01:15:59,200
The question is,
doesn't this assume

1341
01:15:59,200 --> 01:16:01,562
that we care most
about the events

1342
01:16:01,562 --> 01:16:03,520
up here, as opposed to
the events, for example,

1343
01:16:03,520 --> 01:16:04,750
in the middle.

1344
01:16:04,750 --> 01:16:09,550
And this analysis
is done to consider

1345
01:16:09,550 --> 01:16:12,920
how many of the
top events we take

1346
01:16:12,920 --> 01:16:16,510
that we think are consistent
across replicates.

1347
01:16:16,510 --> 01:16:19,170
So it starts at the top as a
consequence of that assumption.

1348
01:16:25,920 --> 01:16:28,440
This is the definition
of psi n, which

1349
01:16:28,440 --> 01:16:30,210
is the fraction of
the top events that

1350
01:16:30,210 --> 01:16:32,640
are paired in the top events.

1351
01:16:32,640 --> 01:16:34,590
It's roughly linear
from the point

1352
01:16:34,590 --> 01:16:37,220
where events are no
longer reproducible.

1353
01:16:37,220 --> 01:16:43,320
And psi prime of n is the
derivative of that function.

1354
01:16:43,320 --> 01:16:46,150
And graphically, if
we look at these,

1355
01:16:46,150 --> 01:16:48,430
we can see that there's
perfect correspondence

1356
01:16:48,430 --> 01:16:51,700
on the left point
up to some point.

1357
01:16:51,700 --> 01:16:56,780
And then the hypothesis of this
irreproducible discovery rate

1358
01:16:56,780 --> 01:17:01,390
is at that point things
stop being in correspondence

1359
01:17:01,390 --> 01:17:05,980
because you've hit the noise
part of the event population.

1360
01:17:05,980 --> 01:17:07,900
and you get to junk.

1361
01:17:07,900 --> 01:17:11,710
And as soon as you get to
junk, things fall apart.

1362
01:17:11,710 --> 01:17:13,760
And what this
methodology does is

1363
01:17:13,760 --> 01:17:16,090
it attempts to fit the
distribution to both

1364
01:17:16,090 --> 01:17:18,450
the correspondence and the
non-correspondence parts

1365
01:17:18,450 --> 01:17:23,460
of the population, and you
get to set a parameter, which

1366
01:17:23,460 --> 01:17:26,510
is the probability that you're
willing to accept something

1367
01:17:26,510 --> 01:17:28,174
that's part of the
junk population

1368
01:17:28,174 --> 01:17:29,715
of the non-correspondence
population.

1369
01:17:32,500 --> 01:17:35,990
And the way this is used, for
example, in the ENCODE project

1370
01:17:35,990 --> 01:17:40,550
is that all ChIP-seq experiments
are done in parallel.

1371
01:17:40,550 --> 01:17:43,830
Event discovery is done
independently on them.

1372
01:17:43,830 --> 01:17:46,600
All of the events from
each independent experiment

1373
01:17:46,600 --> 01:17:48,970
are ranked from 1 to n.

1374
01:17:48,970 --> 01:17:51,580
And then IDR is
done on the results

1375
01:17:51,580 --> 01:17:54,430
to figure out
which of the events

1376
01:17:54,430 --> 01:17:57,277
are consistent
across replicates.

1377
01:17:57,277 --> 01:17:59,360
And obviously, if you've
had a very bad replicate,

1378
01:17:59,360 --> 01:18:02,856
you get a very small number
of events, if any at all.

1379
01:18:02,856 --> 01:18:04,600
But if you have good
replicates, then you

1380
01:18:04,600 --> 01:18:07,850
get a large number of events.

1381
01:18:07,850 --> 01:18:13,380
A secondary question is, imagine
you had processing algorithm A

1382
01:18:13,380 --> 01:18:16,990
and processing algorithm B.
It might be that processing

1383
01:18:16,990 --> 01:18:21,550
algorithm A typically
gave you more reproducible

1384
01:18:21,550 --> 01:18:26,580
events than processing algorithm
B, that whenever you ran A,

1385
01:18:26,580 --> 01:18:28,450
you could get much
further down this curve

1386
01:18:28,450 --> 01:18:31,250
before things fell apart.

1387
01:18:31,250 --> 01:18:35,435
Does that mean that A is
necessarily better, or not?

1388
01:18:35,435 --> 01:18:36,185
What do you think?

1389
01:18:39,634 --> 01:18:41,925
Would you always pick algorithm
A if that was the case?

1390
01:18:44,460 --> 01:18:45,852
Any insight on this?

1391
01:18:45,852 --> 01:18:47,740
Yes?

1392
01:18:47,740 --> 01:18:50,100
AUDIENCE: No, probably,
because there's

1393
01:18:50,100 --> 01:18:51,794
experimental
considerations to it.

1394
01:18:51,794 --> 01:18:52,460
PROFESSOR: Yeah.

1395
01:18:52,460 --> 01:18:55,206
I would go with [INAUDIBLE]
experimental considerations.

1396
01:18:55,206 --> 01:18:56,580
But the reason
you might not pick

1397
01:18:56,580 --> 01:19:00,420
it is it might be that if
algorithm A always gave you

1398
01:19:00,420 --> 01:19:03,930
the same wrong answers
in the same order,

1399
01:19:03,930 --> 01:19:06,590
it would score perfectly
on this, right?

1400
01:19:06,590 --> 01:19:08,380
It doesn't mean that it's right.

1401
01:19:08,380 --> 01:19:12,090
It just means it's consistent.

1402
01:19:12,090 --> 01:19:14,650
So for example, it could
be calling all the towers

1403
01:19:14,650 --> 01:19:18,530
as events, and it
could always give you,

1404
01:19:18,530 --> 01:19:22,090
for every experiment,
the same results.

1405
01:19:22,090 --> 01:19:23,630
In fact, one
interesting thing to do

1406
01:19:23,630 --> 01:19:26,940
is to run this kind of analysis
against your experiment,

1407
01:19:26,940 --> 01:19:29,390
and an outlier shouldn't
match your experiment.

1408
01:19:29,390 --> 01:19:31,492
It should fail.

1409
01:19:31,492 --> 01:19:33,200
It's always good to
run negative controls

1410
01:19:33,200 --> 01:19:35,730
when you're doing data analysis.

1411
01:19:35,730 --> 01:19:39,500
So this IDR analysis
simply allows

1412
01:19:39,500 --> 01:19:42,837
you to pick alpha, which is
the probability of the rate

1413
01:19:42,837 --> 01:19:45,330
of repairs for the
irreproducible part

1414
01:19:45,330 --> 01:19:48,435
of the mixture that
you're willing to accept.

1415
01:19:48,435 --> 01:19:50,060
And the way this is
used is, as I said,

1416
01:19:50,060 --> 01:19:53,820
is that here you see
different ways of analyzing

1417
01:19:53,820 --> 01:19:54,910
ChIP-seq data.

1418
01:19:54,910 --> 01:19:57,400
And the little
vertical tick mark

1419
01:19:57,400 --> 01:19:58,920
indicates how many
peaks they get

1420
01:19:58,920 --> 01:20:01,360
before they can't go
any further because they

1421
01:20:01,360 --> 01:20:05,040
hit the IDR bound alpha.

1422
01:20:05,040 --> 01:20:07,340
And some methodologies can
produce far more events

1423
01:20:07,340 --> 01:20:08,930
consistently than
other methods can.

1424
01:20:12,320 --> 01:20:14,680
So that's just something
for you to keep in mind.

1425
01:20:14,680 --> 01:20:18,840
And I will not have time to
present all the rest of this.

1426
01:20:18,840 --> 01:20:21,180
But I did want to point
out one thing to you, which

1427
01:20:21,180 --> 01:20:25,940
is that the methodology
we talked about today

1428
01:20:25,940 --> 01:20:30,610
is able to resolve
where things are

1429
01:20:30,610 --> 01:20:33,350
in the genome at exceptional
spatial resolution,

1430
01:20:33,350 --> 01:20:36,450
within 20 base
pairs genome wide.

1431
01:20:36,450 --> 01:20:39,780
And it's always run genome wide
at single base pair solution.

1432
01:20:39,780 --> 01:20:43,500
And it can do
things like compute

1433
01:20:43,500 --> 01:20:47,030
for pairs of
transcription factors

1434
01:20:47,030 --> 01:20:49,120
that have been profiled
using ChIP-seq,

1435
01:20:49,120 --> 01:20:52,440
the spacing between them for
spacings that are significant.

1436
01:20:52,440 --> 01:20:54,240
And so you can see
that we're beginning

1437
01:20:54,240 --> 01:20:59,280
to understand the regulatory
grammar of the genome in terms

1438
01:20:59,280 --> 01:21:01,640
of the way factors
organize together

1439
01:21:01,640 --> 01:21:04,810
and they interact to implement
the combinatorial switches

1440
01:21:04,810 --> 01:21:07,520
that' we've talked about.

1441
01:21:07,520 --> 01:21:08,450
That's it for today.

1442
01:21:08,450 --> 01:21:10,090
You guys have been
totally great.

1443
01:21:10,090 --> 01:21:13,060
We'll see you on
Tuesday for RNA-seq.

1444
01:21:13,060 --> 01:21:15,830
And have a great
weekend until then.