1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,215
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,215 --> 00:00:17,840
at ocw.mit.edu.

8
00:00:26,840 --> 00:00:31,390
PROFESSOR: All right, well,
good afternoon and welcome back.

9
00:00:31,390 --> 00:00:35,400
We have an exciting fun-filled
program for you this afternoon.

10
00:00:35,400 --> 00:00:36,176
I'm David Gifford.

11
00:00:36,176 --> 00:00:37,550
I'm delighted to
be back with you

12
00:00:37,550 --> 00:00:40,970
again, here in computational
systems biology.

13
00:00:40,970 --> 00:00:43,140
Today we're going to talk
about chromatin structure

14
00:00:43,140 --> 00:00:45,540
and how we can analyze it.

15
00:00:45,540 --> 00:00:51,290
And to give you the narrative
arc for our discussion today,

16
00:00:51,290 --> 00:00:54,130
we're first going to
begin with looking

17
00:00:54,130 --> 00:00:56,750
at computational methods that
we can break the, quote unquote

18
00:00:56,750 --> 00:01:01,120
code, that describes
the epigenome.

19
00:01:01,120 --> 00:01:03,900
Now, epigenetic state is
extraordinarily important

20
00:01:03,900 --> 00:01:05,800
and one way you
can visualize this

21
00:01:05,800 --> 00:01:08,400
is that the genome is
like a hotel filled

22
00:01:08,400 --> 00:01:09,780
with lots of different rooms.

23
00:01:09,780 --> 00:01:13,360
And a lot of the doors are
locked and some of the doors

24
00:01:13,360 --> 00:01:13,990
are unlocked.

25
00:01:13,990 --> 00:01:15,890
And only in the doors
that we can go into,

26
00:01:15,890 --> 00:01:18,320
where the genome is
open and accessible

27
00:01:18,320 --> 00:01:22,270
can there actually be work
done, regulation performed

28
00:01:22,270 --> 00:01:26,254
and transcripts
and proteins made.

29
00:01:26,254 --> 00:01:28,420
So we're going to talk about
how to actually analyze

30
00:01:28,420 --> 00:01:30,777
epigenetic state.

31
00:01:30,777 --> 00:01:32,360
And then we're going
to talk about how

32
00:01:32,360 --> 00:01:35,150
to use epigenetic
information to understand

33
00:01:35,150 --> 00:01:39,260
the entire regulatory
occupancy of the genome.

34
00:01:39,260 --> 00:01:42,230
We've already talked about
ChIP-seq and the idea

35
00:01:42,230 --> 00:01:44,620
that we can understand where
individual regulators sit

36
00:01:44,620 --> 00:01:49,950
on the genome, and how they
regulate proximal genes.

37
00:01:49,950 --> 00:01:53,590
We're now going to see if we
can learn more about the genome.

38
00:01:53,590 --> 00:01:56,610
How it's state-- whether
it's open or closed.

39
00:01:56,610 --> 00:01:58,380
Is it self-regulated?

40
00:01:58,380 --> 00:02:00,440
And answer a puzzle.

41
00:02:00,440 --> 00:02:04,430
The puzzle is, if there
are hundreds of thousands

42
00:02:04,430 --> 00:02:06,390
of possible binary
locations that

43
00:02:06,390 --> 00:02:09,770
are equally good
for a regulator,

44
00:02:09,770 --> 00:02:12,680
why are only tens of
thousands occupied?

45
00:02:12,680 --> 00:02:15,420
And how are those sites picked?

46
00:02:15,420 --> 00:02:18,630
Because that level of regulation
is extraordinarily important

47
00:02:18,630 --> 00:02:21,860
to establish a basal
level of what genes

48
00:02:21,860 --> 00:02:24,140
are accessible and operating.

49
00:02:24,140 --> 00:02:29,540
And finally, we're going to
talk about how we can map,

50
00:02:29,540 --> 00:02:32,270
which regulatory
regions in the genome

51
00:02:32,270 --> 00:02:36,160
are affecting which genes.

52
00:02:36,160 --> 00:02:40,360
It turns out that about
1/3 of the regulatory sites

53
00:02:40,360 --> 00:02:44,440
in the genome skip over a
gene that's closest to them

54
00:02:44,440 --> 00:02:48,240
to regulate a gene
that's farther away.

55
00:02:48,240 --> 00:02:49,510
This is a million genomes.

56
00:02:49,510 --> 00:02:52,530
And so given that
rough approximation,

57
00:02:52,530 --> 00:02:54,380
how is it that we
can make connections

58
00:02:54,380 --> 00:03:00,720
between regulatory sites and
the genes that they control?

59
00:03:00,720 --> 00:03:03,190
Now, in computational
systems biology,

60
00:03:03,190 --> 00:03:05,340
we always talk a
lot about biology,

61
00:03:05,340 --> 00:03:08,880
but we also need to reflect
upon the computational methods

62
00:03:08,880 --> 00:03:11,090
that we're bringing to
bear on these questions.

63
00:03:11,090 --> 00:03:13,300
And so, today, we're
going to be talking

64
00:03:13,300 --> 00:03:14,550
about three different methods.

65
00:03:14,550 --> 00:03:17,520
We'll talk about dynamic
Bayesian networks as a way

66
00:03:17,520 --> 00:03:21,140
to approach, understanding
the histone code.

67
00:03:21,140 --> 00:03:24,580
We'll talk about how to
classify factor binding,

68
00:03:24,580 --> 00:03:27,180
using log likelihood ratios.

69
00:03:27,180 --> 00:03:29,280
And finally, we'll
turn to our friend,

70
00:03:29,280 --> 00:03:33,250
the hypergeometric
distribution to analyze

71
00:03:33,250 --> 00:03:35,200
which locations
in the genome are

72
00:03:35,200 --> 00:03:36,430
interacting with one another.

73
00:03:39,010 --> 00:03:43,680
So let's begin with
establishing a vocabulary.

74
00:03:43,680 --> 00:03:46,230
I'm sure some of you
have seen this before.

75
00:03:46,230 --> 00:03:48,070
This is the way
that chromatin can

76
00:03:48,070 --> 00:03:50,630
be thought of being organized
at different levels.

77
00:03:50,630 --> 00:03:53,440
There's the primary
DNA sequence,

78
00:03:53,440 --> 00:03:58,930
which can include
methylated CPGs.

79
00:03:58,930 --> 00:04:01,820
That's cysteine,
phosphate, guanine.

80
00:04:01,820 --> 00:04:09,470
And the nice thing about
that is that it's symmetrical

81
00:04:09,470 --> 00:04:15,830
so that when you have a CPG,
a methyltransferase during DNA

82
00:04:15,830 --> 00:04:18,050
replication can copy
that methy mark over.

83
00:04:18,050 --> 00:04:20,910
So it's a mark that's heritable.

84
00:04:20,910 --> 00:04:23,880
The next level down
are histone tails.

85
00:04:23,880 --> 00:04:29,310
On the amino terminus
of histones H3 and H4,

86
00:04:29,310 --> 00:04:32,060
different chemical
modifications can be made,

87
00:04:32,060 --> 00:04:33,960
and they serve as
sign posts, as we'll

88
00:04:33,960 --> 00:04:35,739
see, to give us
clues about what's

89
00:04:35,739 --> 00:04:37,780
going on in the genome in
that proximal location.

90
00:04:40,710 --> 00:04:43,340
The next level down
is, whether or not

91
00:04:43,340 --> 00:04:46,830
the chromatin is
compacted or not.

92
00:04:46,830 --> 00:04:48,630
Whether it's open or closed.

93
00:04:48,630 --> 00:04:50,360
And that relates
to whether or not

94
00:04:50,360 --> 00:04:54,260
DNA binding proteins are
actually on the genome.

95
00:04:54,260 --> 00:04:57,380
And finally, certain
domains of the genome

96
00:04:57,380 --> 00:05:00,160
can be associated with
the nuclear lamina.

97
00:05:00,160 --> 00:05:03,880
And so they're different levels
of organization of chromatin.

98
00:05:03,880 --> 00:05:08,620
And we'll be exploring
all of these today.

99
00:05:08,620 --> 00:05:13,690
So the cartoon
version of the way

100
00:05:13,690 --> 00:05:20,260
that the genome is
organized is that at the top

101
00:05:20,260 --> 00:05:22,080
we have a transcribed gene.

102
00:05:22,080 --> 00:05:24,480
And you can see that
there's an enhancer that

103
00:05:24,480 --> 00:05:29,310
is interacting with the RNA
polymerase II start site.

104
00:05:29,310 --> 00:05:31,120
And you can see
varied histone marks

105
00:05:31,120 --> 00:05:34,920
that are associated with
this activated gene.

106
00:05:34,920 --> 00:05:36,640
There are also marks
that are associated

107
00:05:36,640 --> 00:05:37,860
with that active enhancer.

108
00:05:40,790 --> 00:05:44,040
Down below, you see
an inactive gene.

109
00:05:44,040 --> 00:05:46,650
And you can see that there's
a boundary element that's

110
00:05:46,650 --> 00:05:50,240
bound by CTCF, which,
one of its function

111
00:05:50,240 --> 00:05:53,820
is to serve as a genomic
insulator, which insulates

112
00:05:53,820 --> 00:05:58,140
the effect of the enhancer
above from the gene below.

113
00:05:58,140 --> 00:06:01,790
So through careful biochemical
analysis over the years,

114
00:06:01,790 --> 00:06:10,620
these different marks have been
analyzed and characterized.

115
00:06:10,620 --> 00:06:15,340
And a general paradigm
for understanding

116
00:06:15,340 --> 00:06:19,410
how the marks transition
as genes are activated

117
00:06:19,410 --> 00:06:21,640
is shown here.

118
00:06:21,640 --> 00:06:24,870
So genes that are
fairly active and cycle

119
00:06:24,870 --> 00:06:27,880
between active and
inactive states typically

120
00:06:27,880 --> 00:06:31,750
have a high CPG content
in their promoters.

121
00:06:31,750 --> 00:06:33,750
And transition is
shown on the left.

122
00:06:33,750 --> 00:06:37,140
Where in the repressed
state on the bottom,

123
00:06:37,140 --> 00:06:41,990
they're marked by
H3K27 trimethyl marks.

124
00:06:41,990 --> 00:06:47,210
When they're poised, they have
both H3K4 trimethyl and H3K27

125
00:06:47,210 --> 00:06:48,490
trimethyl.

126
00:06:48,490 --> 00:06:55,320
And when they're active, they
only have H3K4 trimethyl.

127
00:06:55,320 --> 00:06:59,200
And on the right hand side are
genes that are less active.

128
00:06:59,200 --> 00:07:02,480
So in their completely shut down
state, they may have no marks,

129
00:07:02,480 --> 00:07:04,650
but the DNA is
methylated, silencing

130
00:07:04,650 --> 00:07:06,670
that region of the genome.

131
00:07:06,670 --> 00:07:11,280
And other marks then,
culminating in H3K4 trimethyl

132
00:07:11,280 --> 00:07:15,320
once again when they
become active at the top.

133
00:07:15,320 --> 00:07:20,040
So I'm summarizing
for you here, decades

134
00:07:20,040 --> 00:07:23,590
of research in histone marks.

135
00:07:23,590 --> 00:07:28,520
And it has been
summarized in figures

136
00:07:28,520 --> 00:07:33,627
like this, where you can
look at different classes

137
00:07:33,627 --> 00:07:35,960
of genetic elements-- whether
they be promoters in front

138
00:07:35,960 --> 00:07:40,330
of genes, gene bodies
themselves, enhancers,

139
00:07:40,330 --> 00:07:42,830
or the large scale
repression of the genome--

140
00:07:42,830 --> 00:07:44,840
and you can look
at the associated

141
00:07:44,840 --> 00:07:47,985
marks with those
characteristic elements.

142
00:07:52,180 --> 00:07:56,340
OK, so, how can we
learn this de novo?

143
00:07:56,340 --> 00:07:59,520
That is, you could
memorize, and of course it's

144
00:07:59,520 --> 00:08:01,694
important to
understand, for example,

145
00:08:01,694 --> 00:08:03,360
if you want to look
for active enhancers

146
00:08:03,360 --> 00:08:05,470
in the genome, that
looking for things

147
00:08:05,470 --> 00:08:12,657
like H3K4 monomethyl and H3K7
27 acetyl marks together,

148
00:08:12,657 --> 00:08:14,740
would give you a good clue
where the enhancers are

149
00:08:14,740 --> 00:08:17,520
in the genome that are active.

150
00:08:17,520 --> 00:08:20,150
But if we want to
learn all this de novo,

151
00:08:20,150 --> 00:08:23,622
without having to memorize it
or rely upon the literature,

152
00:08:23,622 --> 00:08:26,080
the great thing is that there's
a lot of data out there now

153
00:08:26,080 --> 00:08:29,990
that characterizes, or profiles
all these marks, genome-wide,

154
00:08:29,990 --> 00:08:32,080
in variety of cellular states.

155
00:08:32,080 --> 00:08:34,799
And there's the epigenome
roadmap initiative

156
00:08:34,799 --> 00:08:38,820
to look at this in hundreds
of different cell types.

157
00:08:38,820 --> 00:08:43,770
So, what is the histone code?

158
00:08:43,770 --> 00:08:48,600
That is, how can we
unravel the different marks

159
00:08:48,600 --> 00:08:51,650
present in the genome and
understand what they mean?

160
00:08:51,650 --> 00:08:54,810
Because the genome doesn't come
ready-made with those little

161
00:08:54,810 --> 00:08:57,560
cute labels that we had on
it-- enhancer, gene body,

162
00:08:57,560 --> 00:08:59,030
and so forth.

163
00:08:59,030 --> 00:09:00,900
So somehow, if we
want to understand

164
00:09:00,900 --> 00:09:03,360
the grammar of the
genome and its function,

165
00:09:03,360 --> 00:09:07,650
we're going to need to be
able to annotate it, hopefully

166
00:09:07,650 --> 00:09:11,210
with computational help.

167
00:09:11,210 --> 00:09:13,890
So here's a picture of
what typical data looks

168
00:09:13,890 --> 00:09:15,970
like along the genome.

169
00:09:15,970 --> 00:09:19,500
So, obviously you can't
read any of the legends

170
00:09:19,500 --> 00:09:20,480
on the left-hand side.

171
00:09:20,480 --> 00:09:22,063
If you want to look
at the slides that

172
00:09:22,063 --> 00:09:24,850
are posted on Stellar, you
can see the actual marks.

173
00:09:24,850 --> 00:09:26,780
But the reason I posted
this is because you

174
00:09:26,780 --> 00:09:28,821
can see the little pink
thing at the top-- that's

175
00:09:28,821 --> 00:09:32,360
where the RNA transcript has
been mapped to the genome.

176
00:09:32,360 --> 00:09:35,450
The actual annotated
genes are above.

177
00:09:35,450 --> 00:09:37,740
And then down below you
can see a whole collection

178
00:09:37,740 --> 00:09:41,510
of histone marks and other
kinds of chromatin information

179
00:09:41,510 --> 00:09:43,340
that have been
mapped to the genome

180
00:09:43,340 --> 00:09:46,710
and spatially
create patterns that

181
00:09:46,710 --> 00:09:52,880
are suggestive of the function
of the genomic elements,

182
00:09:52,880 --> 00:09:54,600
if they're properly interpreted.

183
00:09:54,600 --> 00:10:01,460
And below, you see in blue,
the binding of different TFs,

184
00:10:01,460 --> 00:10:04,670
as determined by ChIP-seq.

185
00:10:04,670 --> 00:10:08,600
So, what we would
like to do then,

186
00:10:08,600 --> 00:10:13,130
is to take this
kind of information

187
00:10:13,130 --> 00:10:15,810
and automatically learn,
or automatically annotate

188
00:10:15,810 --> 00:10:20,172
the genome as to its
functional elements.

189
00:10:20,172 --> 00:10:21,880
Let me stop here and
ask, how many people

190
00:10:21,880 --> 00:10:27,630
have seen histone mark
information before?

191
00:10:27,630 --> 00:10:28,820
OK.

192
00:10:28,820 --> 00:10:32,860
And how many people have
used it in their research?

193
00:10:32,860 --> 00:10:35,710
Not too many-- a couple people?

194
00:10:35,710 --> 00:10:37,410
OK.

195
00:10:37,410 --> 00:10:40,240
So it's getting
quite easy to collect

196
00:10:40,240 --> 00:10:45,690
and there are a couple of ways
of analyzing this kind of data,

197
00:10:45,690 --> 00:10:47,760
genome-wide.

198
00:10:47,760 --> 00:10:51,640
One way is that we could
run a hidden Markov

199
00:10:51,640 --> 00:10:55,670
model over these data
and predict states

200
00:10:55,670 --> 00:10:56,730
at regular intervals.

201
00:10:56,730 --> 00:10:59,460
For example, every 200
bases down the genome,

202
00:10:59,460 --> 00:11:02,920
and see how the HMM transition
from state to state and let

203
00:11:02,920 --> 00:11:08,330
the state suggest what the
underlying genome elements

204
00:11:08,330 --> 00:11:10,920
that we're doing.

205
00:11:10,920 --> 00:11:16,220
Another way is to use a
dynamic Bayesian network.

206
00:11:16,220 --> 00:11:19,790
So a dynamic Bayesian network
is simply a Bayesian network.

207
00:11:19,790 --> 00:11:22,760
We've talked about those before.

208
00:11:22,760 --> 00:11:25,810
And it models data
sampled along the genome.

209
00:11:25,810 --> 00:11:29,510
And so it's a directed
acyclic graph.

210
00:11:29,510 --> 00:11:31,580
There are tools out
there that allow

211
00:11:31,580 --> 00:11:34,850
us to learn these
models directly.

212
00:11:34,850 --> 00:11:40,140
And it allows us, as we'll
see, to analyze the genome

213
00:11:40,140 --> 00:11:45,450
at high resolution, and
to handle missing data.

214
00:11:45,450 --> 00:11:47,010
So we'll be talking
about Segway,

215
00:11:47,010 --> 00:11:50,470
which is a particular
dynamic Bayesian network that

216
00:11:50,470 --> 00:11:52,220
takes the kind of data
we saw on the slide

217
00:11:52,220 --> 00:11:58,500
before and essentially parses
it into labels that allow us

218
00:11:58,500 --> 00:12:01,640
to assign function to
different genomic elements.

219
00:12:01,640 --> 00:12:04,670
And it does this in
an unsupervised way.

220
00:12:04,670 --> 00:12:07,980
What I mean by that is
that it is automatically

221
00:12:07,980 --> 00:12:12,100
learning the states,
and then afterwards we

222
00:12:12,100 --> 00:12:14,950
can look at the states and
assign meaning to them.

223
00:12:17,660 --> 00:12:23,360
So here is the dynamic Bayesian
network that Segway uses.

224
00:12:23,360 --> 00:12:26,020
And let me explain
this somewhat scary

225
00:12:26,020 --> 00:12:28,970
looking diagram of lots of
little boxes and pointers

226
00:12:28,970 --> 00:12:31,160
to you.

227
00:12:31,160 --> 00:12:36,380
The genome is described
through the variables

228
00:12:36,380 --> 00:12:38,880
on the bottom-- the
observation variables,

229
00:12:38,880 --> 00:12:41,860
going from left to
right, where each base is

230
00:12:41,860 --> 00:12:44,440
a separate observation
variable which consists

231
00:12:44,440 --> 00:12:47,730
of the level of a
particular histone mark

232
00:12:47,730 --> 00:12:51,720
at a particular based position
as described by mapped

233
00:12:51,720 --> 00:12:54,420
reads to that location.

234
00:12:54,420 --> 00:12:56,645
The little square
box-- the little boxes

235
00:12:56,645 --> 00:12:59,050
that says "x" on it with the
other small print you can't

236
00:12:59,050 --> 00:13:01,880
read-- is simply an
indicator, whether or not

237
00:13:01,880 --> 00:13:03,890
the data is present.

238
00:13:03,890 --> 00:13:06,940
If the data is absent, we
don't try and model it.

239
00:13:06,940 --> 00:13:09,860
If that box contains a zero,
we don't model the data.

240
00:13:09,860 --> 00:13:13,960
If the box is one, then we
attempt to model the data.

241
00:13:13,960 --> 00:13:16,710
And the most important part of
the dynamic Bayesian network

242
00:13:16,710 --> 00:13:20,960
is the q box above, where
those are the states.

243
00:13:20,960 --> 00:13:25,330
And each state describes an
ensemble of different histone

244
00:13:25,330 --> 00:13:27,380
marks that are output.

245
00:13:27,380 --> 00:13:30,260
And so the key thing
is that for each state

246
00:13:30,260 --> 00:13:33,460
we learn what marks
it's outputting.

247
00:13:33,460 --> 00:13:35,240
And the model learns
this automatically

248
00:13:35,240 --> 00:13:37,920
through a learning phase.

249
00:13:37,920 --> 00:13:42,970
The boxes above
simply are a counter.

250
00:13:42,970 --> 00:13:47,420
And the counter allows us
to define maximum lengths

251
00:13:47,420 --> 00:13:51,720
for particular states, so
states don't run on forever.

252
00:13:51,720 --> 00:13:53,650
So unlike a hidden
Markov model that

253
00:13:53,650 --> 00:13:55,460
doesn't have that
kind of control,

254
00:13:55,460 --> 00:14:00,880
we can adjust how long we
want the states to last.

255
00:14:00,880 --> 00:14:05,880
So this model, if you
turned it 90 degrees

256
00:14:05,880 --> 00:14:10,220
and rotated it clockwise,
would be more familiar to you

257
00:14:10,220 --> 00:14:12,585
because all the arrows
would be flowing

258
00:14:12,585 --> 00:14:14,200
from the top of the screen down.

259
00:14:14,200 --> 00:14:17,990
There are no cycles in this
directed acyclic graph.

260
00:14:17,990 --> 00:14:21,271
And therefore, it can be
probabilistically viewed

261
00:14:21,271 --> 00:14:22,645
and learned in
the same framework

262
00:14:22,645 --> 00:14:24,970
that we learn a
Bayesian network.

263
00:14:24,970 --> 00:14:27,790
In fact, it is a
Bayesian network.

264
00:14:27,790 --> 00:14:29,660
The reason it's
called dynamic is

265
00:14:29,660 --> 00:14:33,310
because we are learning
temporal information,

266
00:14:33,310 --> 00:14:36,650
or in this case,
spatial information

267
00:14:36,650 --> 00:14:38,850
with these different
observations

268
00:14:38,850 --> 00:14:42,540
along the bottom of the model.

269
00:14:42,540 --> 00:14:44,760
Now before I go on,
perhaps somebody

270
00:14:44,760 --> 00:14:46,760
could ask me a question
about the details

271
00:14:46,760 --> 00:14:49,180
of these dynamic
Bayesian networks,

272
00:14:49,180 --> 00:14:53,330
because the ability to
automatically assign labels

273
00:14:53,330 --> 00:14:57,790
to genome function, given
the histone marks is really

274
00:14:57,790 --> 00:15:00,450
a key thing that's gone on
the last couple of years.

275
00:15:00,450 --> 00:15:01,401
Yes?

276
00:15:01,401 --> 00:15:03,325
AUDIENCE: Could you
re-explain that--

277
00:15:03,325 --> 00:15:06,700
what the labeled-- the second
[INAUDIBLE] was all about?

278
00:15:06,700 --> 00:15:07,610
PROFESSOR: Sure.

279
00:15:07,610 --> 00:15:16,300
So the Q label is right
here, these labels.

280
00:15:16,300 --> 00:15:19,050
And each of these
Q labels defines

281
00:15:19,050 --> 00:15:20,290
one of a number of states.

282
00:15:20,290 --> 00:15:23,196
For example, 24
different states.

283
00:15:23,196 --> 00:15:28,420
In a given state, describes
the expected output

284
00:15:28,420 --> 00:15:32,100
in terms of what histone marks
are present in that state.

285
00:15:32,100 --> 00:15:34,701
So it's going to
describe the means of all

286
00:15:34,701 --> 00:15:35,950
those different histone marks.

287
00:15:35,950 --> 00:15:38,570
24 different means,
let's say, of the marks

288
00:15:38,570 --> 00:15:41,090
it's going to output.

289
00:15:41,090 --> 00:15:46,160
And the job of fitting the model
is picking the right states,

290
00:15:46,160 --> 00:15:48,770
or a set of 24
states, each of which

291
00:15:48,770 --> 00:15:53,540
is most descriptive of its
particular subset of chromatin

292
00:15:53,540 --> 00:15:54,990
marks.

293
00:15:54,990 --> 00:15:59,100
And then defining how we
transition between states.

294
00:15:59,100 --> 00:16:04,000
So we not only need to
define what a state means

295
00:16:04,000 --> 00:16:07,150
in terms of the marks
that it outputs, but also

296
00:16:07,150 --> 00:16:11,290
when we transition from
one state to another.

297
00:16:11,290 --> 00:16:13,244
Does that make sense to you?

298
00:16:13,244 --> 00:16:16,632
AUDIENCE: So I know it
states the information that

299
00:16:16,632 --> 00:16:18,568
tells at each of the Q boxes.

300
00:16:18,568 --> 00:16:22,260
Is that a series
of probabilities?

301
00:16:22,260 --> 00:16:24,592
Or is it something else?

302
00:16:24,592 --> 00:16:26,970
PROFESSOR: It's actually
a discrete number, right.

303
00:16:26,970 --> 00:16:30,000
So it actually is a
single-- there's only

304
00:16:30,000 --> 00:16:31,520
a single state in each Q box.

305
00:16:31,520 --> 00:16:33,570
So it might be a
number between 1 and 24

306
00:16:33,570 --> 00:16:35,190
that we're going to learn.

307
00:16:35,190 --> 00:16:37,460
And based upon
that number, we're

308
00:16:37,460 --> 00:16:41,970
going to have a
description of the marks

309
00:16:41,970 --> 00:16:45,190
that we would expect to
see at the observation

310
00:16:45,190 --> 00:16:47,810
at that particular
genomic location.

311
00:16:47,810 --> 00:16:53,960
And so our job here is to
learn those 24 different states

312
00:16:53,960 --> 00:16:58,910
and what they output
in the training phase,

313
00:16:58,910 --> 00:17:00,870
and then once we've
trained the model,

314
00:17:00,870 --> 00:17:03,430
we can go back and look
at other held out data,

315
00:17:03,430 --> 00:17:04,899
and then we can
decode the genome.

316
00:17:04,899 --> 00:17:06,690
Because we know what
the states are, and we

317
00:17:06,690 --> 00:17:09,190
know what they are
supposed to be producing,

318
00:17:09,190 --> 00:17:13,010
we can use a Verterbi decoder
and go back and-- as we

319
00:17:13,010 --> 00:17:15,930
did with the HMM and we
learned the HMM-- go back

320
00:17:15,930 --> 00:17:19,550
and read off on the
histone mark sequence

321
00:17:19,550 --> 00:17:21,640
and figure out what
their relative states are

322
00:17:21,640 --> 00:17:25,569
for each base position
of the genome.

323
00:17:25,569 --> 00:17:27,079
Is that helpful?

324
00:17:27,079 --> 00:17:29,314
Yes?

325
00:17:29,314 --> 00:17:32,920
Any other questions about
dynamic Bayesian networks?

326
00:17:32,920 --> 00:17:33,790
Yes?

327
00:17:33,790 --> 00:17:36,086
AUDIENCE: How do you choose
the number of states?

328
00:17:36,086 --> 00:17:37,710
PROFESSOR: That's a
very good question.

329
00:17:37,710 --> 00:17:40,220
How do you choose
the number of states?

330
00:17:40,220 --> 00:17:43,320
Well, if you choose
too many states,

331
00:17:43,320 --> 00:17:45,300
they obviously don't
really become descriptive

332
00:17:45,300 --> 00:17:46,890
and you can become
over fit and then

333
00:17:46,890 --> 00:17:48,985
can start fitting
noise to your model.

334
00:17:48,985 --> 00:17:52,260
And if you choose too few
states, what will happen

335
00:17:52,260 --> 00:17:54,320
is, that states can
get collapsed together

336
00:17:54,320 --> 00:17:56,430
and they won't be
adequately descriptive.

337
00:17:56,430 --> 00:17:59,010
The answer is, it's more
or less trial and error.

338
00:17:59,010 --> 00:18:01,160
There really isn't
a principled way

339
00:18:01,160 --> 00:18:03,620
to choose the right
number of states

340
00:18:03,620 --> 00:18:05,580
in this particular context.

341
00:18:05,580 --> 00:18:06,930
Now, you could do--

342
00:18:06,930 --> 00:18:08,421
AUDIENCE: What's
the trial, then?

343
00:18:08,421 --> 00:18:10,906
You run it and you
get a set of things,

344
00:18:10,906 --> 00:18:12,727
and what do you do
with those labels?

345
00:18:12,727 --> 00:18:14,310
PROFESSOR: What do
you do with labels?

346
00:18:14,310 --> 00:18:17,260
AUDIENCE: Yeah, how
do you evaluate it?

347
00:18:17,260 --> 00:18:19,290
PROFESSOR: You
typically, in both

348
00:18:19,290 --> 00:18:23,690
of these cases-- both in the
case of chrome HMM and this--

349
00:18:23,690 --> 00:18:26,180
you rely upon the
previous literature.

350
00:18:26,180 --> 00:18:29,020
And we saw on that
slide earlier,

351
00:18:29,020 --> 00:18:31,780
what marks are associated
with what kinds of features.

352
00:18:31,780 --> 00:18:33,820
So you use the prior
literature and you

353
00:18:33,820 --> 00:18:36,720
use what the states are telling
you they're describing to try

354
00:18:36,720 --> 00:18:39,190
and associate those
states with what's

355
00:18:39,190 --> 00:18:41,540
known about genome function.

356
00:18:44,641 --> 00:18:45,657
All right, yes?

357
00:18:45,657 --> 00:18:47,656
AUDIENCE: Where does that
information concerning

358
00:18:47,656 --> 00:18:50,280
the distance between
states go again?

359
00:18:50,280 --> 00:18:51,711
Like, the counter?

360
00:18:51,711 --> 00:18:53,619
Like, how does that
control how long

361
00:18:53,619 --> 00:18:55,309
the states go on
and whether or not--

362
00:18:55,309 --> 00:18:57,850
PROFESSOR: What happens is that
the counter at the top, the C

363
00:18:57,850 --> 00:19:03,200
variables, influence the J
variables you can see there.

364
00:19:03,200 --> 00:19:04,860
When the J variable
terms to a 1,

365
00:19:04,860 --> 00:19:07,290
it forces the state transition.

366
00:19:07,290 --> 00:19:11,530
So the counters count
down and can then

367
00:19:11,530 --> 00:19:14,240
force a state
transition which will

368
00:19:14,240 --> 00:19:17,800
cause the Q variable to change.

369
00:19:17,800 --> 00:19:22,040
It's sort of a-- that particular
formulation of this model

370
00:19:22,040 --> 00:19:24,750
is a bit of a, sort
of Rube Goldberg kind

371
00:19:24,750 --> 00:19:26,570
of hackish kind of thing.

372
00:19:26,570 --> 00:19:29,135
I think to make it get
out of particular states.

373
00:19:33,210 --> 00:19:38,640
But it works, as we'll
see in just a moment.

374
00:19:38,640 --> 00:19:39,650
OK.

375
00:19:39,650 --> 00:19:45,480
So here's an example
of it operating.

376
00:19:45,480 --> 00:19:50,400
And you can see the different
states on the y-axis here.

377
00:19:50,400 --> 00:19:53,250
You can see the different
state transitions

378
00:19:53,250 --> 00:19:55,060
as we go down the genome.

379
00:19:55,060 --> 00:19:57,670
And you can see the
annotations that it's

380
00:19:57,670 --> 00:20:00,730
outputting, corresponding
to the histone marks.

381
00:20:00,730 --> 00:20:04,020
And so what this
is doing is it's

382
00:20:04,020 --> 00:20:08,130
decoding for us what it thinks
is going on in the genome,

383
00:20:08,130 --> 00:20:10,620
solely with reference
to the histone marks,

384
00:20:10,620 --> 00:20:15,180
without reference to primary
sequence or anything else.

385
00:20:15,180 --> 00:20:17,830
And this kind of
decoding is most useful

386
00:20:17,830 --> 00:20:22,340
when we want to discover things
like regulatory elements.

387
00:20:22,340 --> 00:20:27,600
When we want to look for H3K4
mono or dimethyl, and H3K27

388
00:20:27,600 --> 00:20:31,200
acetyl for example, and identify
those regions of the genome

389
00:20:31,200 --> 00:20:33,150
that we think are
active enhancers.

390
00:20:33,150 --> 00:20:33,650
OK.

391
00:20:37,129 --> 00:20:40,120
OK.

392
00:20:40,120 --> 00:20:48,310
So, any questions at all about
histone marks and decoding?

393
00:20:48,310 --> 00:20:50,610
Do you get the
general idea that you

394
00:20:50,610 --> 00:20:55,620
can assay these histone
marks through ChIP-seq using

395
00:20:55,620 --> 00:20:59,590
antibodies that are specific
to a particular mark.

396
00:20:59,590 --> 00:21:04,250
Pull down the histones that
are associated with DNA

397
00:21:04,250 --> 00:21:06,130
with that mark and map
them to the genome.

398
00:21:06,130 --> 00:21:10,910
So we get one track for
each ChIP-seq experiment.

399
00:21:10,910 --> 00:21:15,180
We can profile all the marks
that we think are relevant,

400
00:21:15,180 --> 00:21:18,470
and then we can look at
what those parks imply

401
00:21:18,470 --> 00:21:22,710
about both the static
structure of our genome,

402
00:21:22,710 --> 00:21:31,310
and also how it's being
used as cells differentiate

403
00:21:31,310 --> 00:21:34,191
or in different
environmental conditions.

404
00:21:34,191 --> 00:21:34,690
OK.

405
00:21:37,950 --> 00:21:39,550
OK.

406
00:21:39,550 --> 00:21:45,730
So, let's go on, then,
to the next step, which

407
00:21:45,730 --> 00:21:54,310
is that if we understand the
sort of epigenetics state,

408
00:21:54,310 --> 00:22:02,050
how is that established and
how is the opening of chromatin

409
00:22:02,050 --> 00:22:06,750
regulated and how is it that
factors find particular places

410
00:22:06,750 --> 00:22:09,460
in the genome to bind?

411
00:22:09,460 --> 00:22:13,770
So, the puzzle I talked
to you about earlier

412
00:22:13,770 --> 00:22:15,900
was that there are
hundreds of thousands

413
00:22:15,900 --> 00:22:18,170
of particular motifs
in the genome,

414
00:22:18,170 --> 00:22:20,730
but a very small
number are actually

415
00:22:20,730 --> 00:22:24,170
bound by regulatory factors.

416
00:22:24,170 --> 00:22:27,550
And you might think
that the difference

417
00:22:27,550 --> 00:22:31,430
is that the ones that are bound
have different DNA sequences.

418
00:22:31,430 --> 00:22:34,410
But in fact, on the
right-hand side, what we see

419
00:22:34,410 --> 00:22:38,194
is that identical DNA sequences
are bound differentially

420
00:22:38,194 --> 00:22:39,360
in two different conditions.

421
00:22:39,360 --> 00:22:41,280
Shown there are
sites that are only

422
00:22:41,280 --> 00:22:44,300
bound, for example,
in endodermal tissues

423
00:22:44,300 --> 00:22:46,860
or in ES cells.

424
00:22:46,860 --> 00:22:49,840
So it isn't the sequence
that's controlling

425
00:22:49,840 --> 00:22:54,010
the specificity of the
binding, it's something else.

426
00:22:54,010 --> 00:22:56,360
And we'd like to figure out
what that something else is.

427
00:22:56,360 --> 00:23:00,700
We'd like to understand
the rules that

428
00:23:00,700 --> 00:23:03,170
govern where those factors
are binding in the genome.

429
00:23:06,810 --> 00:23:12,400
So a set of factors are
known that bind to the genome

430
00:23:12,400 --> 00:23:13,140
and open it.

431
00:23:13,140 --> 00:23:15,200
They're called pioneer factors.

432
00:23:15,200 --> 00:23:18,220
There are some well known
pioneer factors like FoxA

433
00:23:18,220 --> 00:23:22,930
and some of the iPS
reprogramming factors.

434
00:23:22,930 --> 00:23:26,080
And the idea is that
they're able to bind

435
00:23:26,080 --> 00:23:28,990
to closed chromatin
and to open it up

436
00:23:28,990 --> 00:23:33,220
to provide accessibility
to other factors.

437
00:23:33,220 --> 00:23:36,220
So what we would
like to do, is to see

438
00:23:36,220 --> 00:23:39,570
if there's a way that we
could, both understand

439
00:23:39,570 --> 00:23:41,690
how to discover those
factors automatically,

440
00:23:41,690 --> 00:23:45,790
using a computational
method, and secondarily,

441
00:23:45,790 --> 00:23:49,190
understand where factors are
binding in a single experiment

442
00:23:49,190 --> 00:23:50,140
across the genome.

443
00:23:53,320 --> 00:23:57,279
So the results I'm going to
show you can be summarized here.

444
00:23:57,279 --> 00:23:58,820
I'm going to show
you a method called

445
00:23:58,820 --> 00:24:04,070
PIQ that can predict where
TFs bind from DNase-seq data

446
00:24:04,070 --> 00:24:05,525
that I'll describe in a moment.

447
00:24:05,525 --> 00:24:08,020
We'll identify pioneer factors.

448
00:24:08,020 --> 00:24:10,860
We'll show that certain of these
pioneer factors are directional

449
00:24:10,860 --> 00:24:14,560
and only operate in
one way on the genome.

450
00:24:14,560 --> 00:24:17,460
And finally, that the
opening of the genome

451
00:24:17,460 --> 00:24:22,350
allow subtler factors to come
in and to bind to the genome.

452
00:24:22,350 --> 00:24:27,280
So let's begin with
what DNase-seq data is,

453
00:24:27,280 --> 00:24:29,420
and how we can use
it to predict where

454
00:24:29,420 --> 00:24:30,670
TFs are binding to the genome.

455
00:24:33,700 --> 00:24:37,410
So DNase-seq is a
methodology for exploring

456
00:24:37,410 --> 00:24:40,320
what parts of the
genome are open.

457
00:24:40,320 --> 00:24:42,330
So here's the idea.

458
00:24:42,330 --> 00:24:48,190
You take your cell
and you expose it,

459
00:24:48,190 --> 00:24:52,280
once you've isolated the
chromatin to DNase-1 which

460
00:24:52,280 --> 00:24:55,670
will cut or nick
DNA at locations

461
00:24:55,670 --> 00:24:59,130
where the DNA is open.

462
00:24:59,130 --> 00:25:01,885
You then can collect the
DNA, size separate it

463
00:25:01,885 --> 00:25:02,842
and sequence it.

464
00:25:02,842 --> 00:25:04,300
And thus, you're
going to have more

465
00:25:04,300 --> 00:25:09,240
reads where the
DNA has been open,

466
00:25:09,240 --> 00:25:11,365
and less reads were it's
protected by proteins.

467
00:25:13,910 --> 00:25:16,810
So the cartoon below
gives you an idea

468
00:25:16,810 --> 00:25:20,670
that, where there are
histones-- each histone

469
00:25:20,670 --> 00:25:23,910
has about 147 bases of
DNA wrapped around it.

470
00:25:23,910 --> 00:25:28,230
Or where there are other
proteins hiding the DNA,

471
00:25:28,230 --> 00:25:32,010
you're going to cast
shadows on this.

472
00:25:32,010 --> 00:25:37,520
So we're going to be looking
at the shadows and also

473
00:25:37,520 --> 00:25:40,670
the accessible parts,
by looking directly

474
00:25:40,670 --> 00:25:41,845
at the DNase-seq reads.

475
00:25:45,040 --> 00:25:48,180
So if we sequence
deeply enough we

476
00:25:48,180 --> 00:25:53,140
can understand that
each binding protein has

477
00:25:53,140 --> 00:25:58,500
its own particular
profile of protection.

478
00:25:58,500 --> 00:26:01,330
So if you look at these
different proteins,

479
00:26:01,330 --> 00:26:05,010
they cast particular
shadows on the genome.

480
00:26:05,010 --> 00:26:09,370
I'm showing here a window
that's 400 base pairs wide.

481
00:26:09,370 --> 00:26:15,630
This is the average of thousands
of different binding instances.

482
00:26:15,630 --> 00:26:18,480
So this is not one binding
instance on the top row.

483
00:26:18,480 --> 00:26:21,550
You can see how CTCF
and other factors

484
00:26:21,550 --> 00:26:27,470
have particular shadows
they cast or profiles.

485
00:26:27,470 --> 00:26:29,214
Yes?

486
00:26:29,214 --> 00:26:33,038
AUDIENCE: How do you know
which factor was at which site?

487
00:26:33,038 --> 00:26:33,994
[INAUDIBLE].

488
00:26:33,994 --> 00:26:36,384
PROFESSOR: How do we know
which factor is at which site?

489
00:26:36,384 --> 00:26:38,140
By the motifs that
are under the site.

490
00:26:41,400 --> 00:26:42,810
And what's
interesting about CTCF

491
00:26:42,810 --> 00:26:47,160
is that you can actually see
how it phase the nucleosomes.

492
00:26:47,160 --> 00:26:51,090
You can see the, sort of,
periodic pattern in CTCF.

493
00:26:51,090 --> 00:26:55,200
And those dips are where
the nucleosomes are.

494
00:26:55,200 --> 00:26:58,570
There's a lot you can
tell from these patterns

495
00:26:58,570 --> 00:27:04,340
about the underlying molecular
mechanism of what's going on.

496
00:27:04,340 --> 00:27:08,530
Now, you can see at the very
bottom, the aggregate CTCF

497
00:27:08,530 --> 00:27:10,020
profile.

498
00:27:10,020 --> 00:27:13,060
And if all the CTCF
bindings looked like that,

499
00:27:13,060 --> 00:27:14,840
it'd be really easy.

500
00:27:14,840 --> 00:27:18,180
But above it, as I've shown
you what an individual CTCF

501
00:27:18,180 --> 00:27:21,040
site looks like, you can
see how sparse it is.

502
00:27:21,040 --> 00:27:23,520
We just don't get
enough re-density to be

503
00:27:23,520 --> 00:27:28,730
able to recover a beautiful
protection profile like that.

504
00:27:28,730 --> 00:27:30,810
So we're always working
against a lot of noise

505
00:27:30,810 --> 00:27:33,330
in this kind of
biological environment.

506
00:27:33,330 --> 00:27:35,220
And so our
computational technique

507
00:27:35,220 --> 00:27:37,030
will need to come up
with an adequate model

508
00:27:37,030 --> 00:27:39,150
to overcome that noise.

509
00:27:39,150 --> 00:27:41,970
But if we can, right,
the great promise

510
00:27:41,970 --> 00:27:44,990
is that with a single
experiment we'll

511
00:27:44,990 --> 00:27:47,910
be able to identify where all
these different factors are

512
00:27:47,910 --> 00:27:53,160
binding to the genome
from one set of data.

513
00:27:53,160 --> 00:28:00,100
So, just reiterating now,
if you think about the input

514
00:28:00,100 --> 00:28:04,460
to this algorithm-- we're
going to have three things

515
00:28:04,460 --> 00:28:06,200
that we input to the algorithm.

516
00:28:06,200 --> 00:28:09,990
We input the original
genome sequence.

517
00:28:09,990 --> 00:28:12,780
We input the motifs
of the factors

518
00:28:12,780 --> 00:28:16,550
that we care about, that
we think are interesting.

519
00:28:16,550 --> 00:28:20,510
And we input the
DNase-seq data that

520
00:28:20,510 --> 00:28:23,090
has been aligned to the genome.

521
00:28:23,090 --> 00:28:25,070
So those are the three inputs.

522
00:28:25,070 --> 00:28:27,310
And the output of
the algorithm is

523
00:28:27,310 --> 00:28:31,810
going to be the predictions
of which motifs are occupied

524
00:28:31,810 --> 00:28:35,780
by the factors,
probabilistically.

525
00:28:35,780 --> 00:28:39,490
And in order to do
that, for each protein

526
00:28:39,490 --> 00:28:43,070
we need to learn its
protection profile.

527
00:28:43,070 --> 00:28:45,370
And we need to
score that profile

528
00:28:45,370 --> 00:28:48,040
against each
instance of the motif

529
00:28:48,040 --> 00:28:50,440
to see whether or not we
think the protein is actually

530
00:28:50,440 --> 00:28:54,498
sitting at that
location in the genome.

531
00:28:54,498 --> 00:28:56,704
Any questions at all about that?

532
00:29:03,165 --> 00:29:04,160
No?

533
00:29:04,160 --> 00:29:07,010
OK.

534
00:29:07,010 --> 00:29:08,280
Don't hesitate to stop me.

535
00:29:08,280 --> 00:29:12,790
So the design goals for this
particular computational

536
00:29:12,790 --> 00:29:16,140
algorithm, as I said earlier,
is resistance to low coverage

537
00:29:16,140 --> 00:29:17,110
and lots of noise.

538
00:29:17,110 --> 00:29:20,010
To be able to handle
multiple experiment once,

539
00:29:20,010 --> 00:29:23,070
it has to work on the
entire mammalian genome.

540
00:29:23,070 --> 00:29:25,390
It has to have high
spatial accuracy

541
00:29:25,390 --> 00:29:31,890
and it has to have good
behavior in bad cases.

542
00:29:31,890 --> 00:29:36,970
So in order to model the
underlying re-distribution

543
00:29:36,970 --> 00:29:40,660
of the genome, what
we're going to do

544
00:29:40,660 --> 00:29:46,150
is something that
is in principle

545
00:29:46,150 --> 00:29:47,210
quite straightforward.

546
00:29:47,210 --> 00:29:49,501
Which is that we're going to
model all accounts that we

547
00:29:49,501 --> 00:29:52,890
see in the genome by a
Poisson distribution.

548
00:29:52,890 --> 00:29:55,860
So in each base of
the genome, the counts

549
00:29:55,860 --> 00:29:58,940
that we see there in
the DNase-seq data

550
00:29:58,940 --> 00:30:01,080
are modeled by a Poisson.

551
00:30:01,080 --> 00:30:06,160
And this is assuming that
there's no protein bound there.

552
00:30:06,160 --> 00:30:09,280
So what we're trying to do
is to model the background

553
00:30:09,280 --> 00:30:14,310
distribution of counts
without any kind of binding.

554
00:30:14,310 --> 00:30:17,930
And the log rate
of that Poisson is

555
00:30:17,930 --> 00:30:21,290
going to be taken from
a multivariate normal.

556
00:30:21,290 --> 00:30:24,070
And the particular structure
of that multivariate normal

557
00:30:24,070 --> 00:30:25,760
provides a lot of smoothing.

558
00:30:25,760 --> 00:30:28,500
So we can learn from
that multivariate normal

559
00:30:28,500 --> 00:30:31,040
how to fill in
missing information.

560
00:30:31,040 --> 00:30:33,760
It's very important
to build strength

561
00:30:33,760 --> 00:30:35,700
from neighboring bases.

562
00:30:35,700 --> 00:30:38,130
So, even though we may not
have lots of information

563
00:30:38,130 --> 00:30:39,750
for this base, if
we have information

564
00:30:39,750 --> 00:30:43,750
for all the bases around us,
we can use that information

565
00:30:43,750 --> 00:30:47,730
to build strength to estimate
what we should see at this base

566
00:30:47,730 --> 00:30:51,280
if it's not occupied.

567
00:30:51,280 --> 00:30:56,950
So the details of how we
learn the mean and the sigma

568
00:30:56,950 --> 00:30:59,436
matrix you see up
there for estimating

569
00:30:59,436 --> 00:31:01,310
the multivariate normal
are outside the scope

570
00:31:01,310 --> 00:31:03,440
of what I'm going
to talk about today.

571
00:31:03,440 --> 00:31:07,560
But suffice to say, they
can be effectively learned.

572
00:31:07,560 --> 00:31:13,210
And the second thing we need
to learn are these profiles.

573
00:31:13,210 --> 00:31:18,170
And so each protein is
going to have a profile.

574
00:31:18,170 --> 00:31:20,782
Here shown 400 bases wide.

575
00:31:20,782 --> 00:31:25,080
And it describes how that
protein, so to speak,

576
00:31:25,080 --> 00:31:26,455
casts a shadow on the genome.

577
00:31:28,960 --> 00:31:32,230
And we judge the significance
of these profiles--

578
00:31:32,230 --> 00:31:33,940
and remember that
one of my points

579
00:31:33,940 --> 00:31:36,250
was I wanted this to be robust.

580
00:31:36,250 --> 00:31:44,480
So I will not make calls for
proteins where I cannot get

581
00:31:44,480 --> 00:31:49,090
a robust profile that is
significant above background.

582
00:31:49,090 --> 00:31:52,740
And I also exclude the
middle region of the profile

583
00:31:52,740 --> 00:31:56,580
because it's been shown that
the actual cutting enzymes are

584
00:31:56,580 --> 00:31:59,260
sequence specific
to some extent.

585
00:31:59,260 --> 00:32:02,160
The DNase-1 cutting enzyme.

586
00:32:02,160 --> 00:32:04,440
And so we don't
simply want to be

587
00:32:04,440 --> 00:32:06,985
but picking up sequence
bias in our profile.

588
00:32:09,740 --> 00:32:14,590
So we learn these
profiles that describe

589
00:32:14,590 --> 00:32:19,220
for each particular
motif-- and typically we

590
00:32:19,220 --> 00:32:24,090
can take in hundreds of motifs,
over 500 motifs at once--

591
00:32:24,090 --> 00:32:27,440
for each motif, what its
protection looks like.

592
00:32:30,160 --> 00:32:34,350
So what we then have-- we're
going to learn this, actually,

593
00:32:34,350 --> 00:32:37,480
in an iterative process, but
what we're going to have is--

594
00:32:37,480 --> 00:32:41,720
now we have a model of what the
unoccupied genome looks like.

595
00:32:41,720 --> 00:32:47,860
And we have a model of the
reads that a particular protein

596
00:32:47,860 --> 00:32:50,260
at a motif location
is going to produce.

597
00:32:52,820 --> 00:32:59,960
And we can put those two
things together and the way

598
00:32:59,960 --> 00:33:05,290
that we do that is that we
have a binding variable.

599
00:33:05,290 --> 00:33:07,120
Showing there is delta.

600
00:33:07,120 --> 00:33:13,500
And we can either add or
not add the binding profile

601
00:33:13,500 --> 00:33:18,030
of a particular protein in
a location in the genome.

602
00:33:18,030 --> 00:33:20,960
And that will change the
expected number of counts

603
00:33:20,960 --> 00:33:23,670
that we see.

604
00:33:23,670 --> 00:33:32,060
So the key part of this is that
we use a likelihood ratio shown

605
00:33:32,060 --> 00:33:33,362
as the second probability.

606
00:33:33,362 --> 00:33:34,820
It's not really a
probability, it's

607
00:33:34,820 --> 00:33:39,600
a ratio, which is the
probability of a count, given

608
00:33:39,600 --> 00:33:43,530
that a protein j is
binding at that location,

609
00:33:43,530 --> 00:33:48,910
versus the probability of the
counts, were it not binding.

610
00:33:48,910 --> 00:33:52,870
And that quantity
is key because it's

611
00:33:52,870 --> 00:33:56,370
going to be-- once
we log transform it,

612
00:33:56,370 --> 00:33:59,594
will be a key component
of our test statistic

613
00:33:59,594 --> 00:34:01,260
to figure out whether
or not a protein's

614
00:34:01,260 --> 00:34:02,634
binding at a
particular location.

615
00:34:05,740 --> 00:34:11,630
And so the way that we go about
that is it we log that ratio

616
00:34:11,630 --> 00:34:14,670
and we add it to some other
prior information that gives us

617
00:34:14,670 --> 00:34:20,699
an overall measure
for whether or not

618
00:34:20,699 --> 00:34:23,260
the protein is binding
at a particular location.

619
00:34:23,260 --> 00:34:27,770
And then we can rank
these for all the motifs

620
00:34:27,770 --> 00:34:31,120
for that particular
protein in the genome.

621
00:34:31,120 --> 00:34:34,107
And then we can make
calls using a null set.

622
00:34:34,107 --> 00:34:35,940
So we could look in the
genome for locations

623
00:34:35,940 --> 00:34:39,179
that we know are not occupied,
compute a distribution

624
00:34:39,179 --> 00:34:43,800
of that statistic,
and then we can say,

625
00:34:43,800 --> 00:34:46,610
for what values of this
statistic that we observe,

626
00:34:46,610 --> 00:34:52,030
at the actual motif sites,
is it so unlikely that this

627
00:34:52,030 --> 00:34:53,110
would occur at random.

628
00:34:53,110 --> 00:34:57,540
At some desired p
value by looking

629
00:34:57,540 --> 00:35:01,670
at the area in the
tail of the null set.

630
00:35:04,290 --> 00:35:09,440
So, just summarizing, we
learn a background model

631
00:35:09,440 --> 00:35:14,460
of the genome, which
is a Poisson that

632
00:35:14,460 --> 00:35:18,460
takes log rates from
a multivariate normal.

633
00:35:18,460 --> 00:35:24,100
We learn patterns, or
profiles of protection,

634
00:35:24,100 --> 00:35:30,480
or the production of
reads for each motif.

635
00:35:30,480 --> 00:35:34,250
And at each motif location,
we ask the question

636
00:35:34,250 --> 00:35:37,260
whether or not, it's
likely that the protein

637
00:35:37,260 --> 00:35:42,400
was there and actually caused
the reads that we're seeing,

638
00:35:42,400 --> 00:35:44,310
using a log likelihood ratio.

639
00:35:49,210 --> 00:35:50,790
So what we're
integrating together,

640
00:35:50,790 --> 00:35:52,400
when we take all
these things, is

641
00:35:52,400 --> 00:35:55,280
that we're taking our
original DNA seq-reads,

642
00:35:55,280 --> 00:36:02,330
we're taking our TF-specific
specific binding profiles.

643
00:36:02,330 --> 00:36:07,130
We can build strength across
experiments for the background

644
00:36:07,130 --> 00:36:13,330
model and we can also learn,
to what extent, the strength

645
00:36:13,330 --> 00:36:18,930
of binding is influenced by
the match of the position--

646
00:36:18,930 --> 00:36:21,610
a specific weight matrix--
to a particular location

647
00:36:21,610 --> 00:36:23,970
in the genome.

648
00:36:23,970 --> 00:36:27,560
And then we can
produce binding calls.

649
00:36:27,560 --> 00:36:33,480
And when we do so,
it works quite well.

650
00:36:33,480 --> 00:36:38,710
So here you see three
different mouse ESO factors.

651
00:36:38,710 --> 00:36:43,820
And the area under
this receiver operating

652
00:36:43,820 --> 00:36:46,000
curve-- we've talked
about this before.

653
00:36:46,000 --> 00:36:48,190
Remember a receiver
operating characteristic

654
00:36:48,190 --> 00:36:50,990
curve-- has false
positives increasing

655
00:36:50,990 --> 00:36:55,120
on the x-axis and true positives
increasing on the y-axis.

656
00:36:55,120 --> 00:36:58,040
And if we had a perfect method,
the area under that curve

657
00:36:58,040 --> 00:37:01,480
would be 1.0.

658
00:37:01,480 --> 00:37:06,730
And so for this method,
the area under the ROC

659
00:37:06,730 --> 00:37:10,530
curve for these three
factors, using ChIP-seq data,

660
00:37:10,530 --> 00:37:16,869
is the absolute gold
standard, is over 0.9.

661
00:37:16,869 --> 00:37:18,410
And you might say,
well that's great,

662
00:37:18,410 --> 00:37:20,560
but how well does
it work in general?

663
00:37:20,560 --> 00:37:23,510
I mean, for example,
the On-code project

664
00:37:23,510 --> 00:37:26,241
has used hundreds and hundreds
of ChIP-seq experiments

665
00:37:26,241 --> 00:37:27,740
to profile where
factors are binding

666
00:37:27,740 --> 00:37:29,780
in different cellular states.

667
00:37:29,780 --> 00:37:32,620
If you take the DNase-seq data
from those matched cell types

668
00:37:32,620 --> 00:37:35,495
and you ask, can you reproduce
the ChIP-seq seq data?

669
00:37:38,670 --> 00:37:42,929
The answer is, a lot
of the time we can,

670
00:37:42,929 --> 00:37:44,220
using this kind of methodology.

671
00:37:44,220 --> 00:37:48,310
And that is, the
AUC mean is 0.93

672
00:37:48,310 --> 00:37:51,540
compared to 313 different
ChIP-seq experiments.

673
00:37:54,360 --> 00:37:59,090
So this methodology of
looking at open chromatin

674
00:37:59,090 --> 00:38:02,630
allows us to identify where
lots of different factors

675
00:38:02,630 --> 00:38:04,710
bind to the genome.

676
00:38:04,710 --> 00:38:11,680
And about 75 different
factors are strongly

677
00:38:11,680 --> 00:38:16,150
detectable using
this methodology.

678
00:38:16,150 --> 00:38:20,040
So it's detectable if
it has a strong motif,

679
00:38:20,040 --> 00:38:22,350
if it binds in
DNase-accessible regions

680
00:38:22,350 --> 00:38:25,100
and has strong
DNA-binding affinity.

681
00:38:25,100 --> 00:38:27,550
So I tell you this
just so you know

682
00:38:27,550 --> 00:38:30,450
that there are
new methods coming

683
00:38:30,450 --> 00:38:33,600
that allow us to take
a single experiment

684
00:38:33,600 --> 00:38:39,830
and analyze it and determine
where a large number of factors

685
00:38:39,830 --> 00:38:43,730
bind from that single
experimental data set.

686
00:38:46,290 --> 00:38:49,130
Now, a second question
we wanted to answer

687
00:38:49,130 --> 00:38:54,595
was, how is it that chrome,
opening and closing is

688
00:38:54,595 --> 00:38:56,170
controlled?

689
00:38:56,170 --> 00:39:01,640
And since we had a direct read
out of what chromatin is open,

690
00:39:01,640 --> 00:39:04,220
because reads are
being produced there,

691
00:39:04,220 --> 00:39:05,980
we could look in a
experimental system

692
00:39:05,980 --> 00:39:08,470
where we measured
chromatin accessibility

693
00:39:08,470 --> 00:39:11,150
through developmental time.

694
00:39:11,150 --> 00:39:15,910
And the idea was that as we
measured this accessibility,

695
00:39:15,910 --> 00:39:19,850
we could look at the
places that changed

696
00:39:19,850 --> 00:39:25,900
and determine what underlying
motifs were present that

697
00:39:25,900 --> 00:39:29,405
perhaps were causing the
genome to undergo this opening

698
00:39:29,405 --> 00:39:29,905
process.

699
00:39:32,760 --> 00:39:37,750
So we developed an
underlying theory

700
00:39:37,750 --> 00:39:42,340
that pioneer factors would bind
to closed chromatin as shown

701
00:39:42,340 --> 00:39:46,150
in the middle panel
and open it up,

702
00:39:46,150 --> 00:39:49,150
and that we could observe those
by looking at the differential

703
00:39:49,150 --> 00:39:51,320
accessibility of the genome
at two different time

704
00:39:51,320 --> 00:39:55,080
points that were related.

705
00:39:55,080 --> 00:39:59,960
And we couldn't observe pioneers
they didn't open up chromatin.

706
00:39:59,960 --> 00:40:03,720
And for non-pioneers--
obviously the left-hand panel--

707
00:40:03,720 --> 00:40:08,340
they would not, in
our design here,

708
00:40:08,340 --> 00:40:09,755
lead to increased accessibility.

709
00:40:13,050 --> 00:40:22,670
So we then looked at designing
computational indices that

710
00:40:22,670 --> 00:40:25,263
measured the--
oh, question, yes?

711
00:40:25,263 --> 00:40:27,195
AUDIENCE: When you
say pioneer factors,

712
00:40:27,195 --> 00:40:31,542
are you looking at what proteins
are pioneer factors, or are you

713
00:40:31,542 --> 00:40:34,239
looking at what sequences they
bind to that are [INAUDIBLE].

714
00:40:34,239 --> 00:40:35,780
PROFESSOR: So the
question is, are we

715
00:40:35,780 --> 00:40:38,080
looking at what
proteins are factors,

716
00:40:38,080 --> 00:40:40,042
or are we looking at
what sequence, right?

717
00:40:40,042 --> 00:40:42,000
What we're doing is,
we're making an assumption

718
00:40:42,000 --> 00:40:45,920
that the underlying sequence
denotes one or more proteins

719
00:40:45,920 --> 00:40:48,500
and thus, we are
hypothesizing, there's

720
00:40:48,500 --> 00:40:51,510
the proteins that are actually
binding to the sequence, that's

721
00:40:51,510 --> 00:40:52,860
causing that.

722
00:40:52,860 --> 00:40:55,730
And then later on, we'll go back
and test that experimentally,

723
00:40:55,730 --> 00:40:57,340
as you'll see in a second.

724
00:40:57,340 --> 00:41:00,340
OK?

725
00:41:00,340 --> 00:41:03,330
So here there are three
different metrics,

726
00:41:03,330 --> 00:41:06,460
which is the dynamic opening
of chromatin from one time

727
00:41:06,460 --> 00:41:10,690
point to the next, the
static openness of chromatin

728
00:41:10,690 --> 00:41:14,230
around a particular factor,
and a social index showing

729
00:41:14,230 --> 00:41:16,700
how many other factors
are around where

730
00:41:16,700 --> 00:41:20,150
a particular factor binds.

731
00:41:20,150 --> 00:41:24,660
And you can see that these
things are distributed in a way

732
00:41:24,660 --> 00:41:29,190
that certain of the factors have
a very high index in multiple

733
00:41:29,190 --> 00:41:30,225
of these scores.

734
00:41:33,390 --> 00:41:39,840
And thus, we were
able to classify

735
00:41:39,840 --> 00:41:44,910
a certain set of factors
as what we classified

736
00:41:44,910 --> 00:41:48,160
as computational pioneers,
that would open up the genome.

737
00:41:50,730 --> 00:41:54,000
Now, in any kind of
computational work,

738
00:41:54,000 --> 00:41:56,400
we're actually looking
at correlative analysis,

739
00:41:56,400 --> 00:41:57,890
which is never causal.

740
00:41:57,890 --> 00:41:58,390
Right.

741
00:41:58,390 --> 00:42:02,470
So we have to go back and we
have to test whether or not

742
00:42:02,470 --> 00:42:06,590
our computational
predictions are correct.

743
00:42:06,590 --> 00:42:12,940
So in order to do
that, we built a test

744
00:42:12,940 --> 00:42:15,880
construct where we
could put the pioneers

745
00:42:15,880 --> 00:42:20,240
in on the left-hand side
and ask, whether or not

746
00:42:20,240 --> 00:42:22,840
the pioneer would
open up chromatin

747
00:42:22,840 --> 00:42:26,600
and enable the expression
of a GFP marker.

748
00:42:26,600 --> 00:42:28,910
And the red bars
show the factors

749
00:42:28,910 --> 00:42:30,940
that we thought were pioneers.

750
00:42:30,940 --> 00:42:36,480
And as you can see, in
this case, all but one

751
00:42:36,480 --> 00:42:42,520
of the predictive pioneers
produces GFP activity.

752
00:42:42,520 --> 00:42:44,950
And this construct was
designed in an interesting way.

753
00:42:44,950 --> 00:42:48,960
We had to design it so that
the pioneers themselves

754
00:42:48,960 --> 00:42:51,930
were not simply activators.

755
00:42:51,930 --> 00:42:54,780
And so it was upstream of
another activator, which

756
00:42:54,780 --> 00:42:57,750
is a retinoic acid
receptor site.

757
00:42:57,750 --> 00:43:00,180
And so in the absence of
retinoic acid receptor,

758
00:43:00,180 --> 00:43:03,410
we had to ensure that when
we turned on the pioneer,

759
00:43:03,410 --> 00:43:06,110
GFP was not turned on.

760
00:43:06,110 --> 00:43:08,015
It was only with the
addition of the pioneer

761
00:43:08,015 --> 00:43:12,130
to open the chromatin
and the activator

762
00:43:12,130 --> 00:43:14,210
that we actually
got GFP expression.

763
00:43:17,750 --> 00:43:18,690
OK.

764
00:43:18,690 --> 00:43:24,630
So, through this
methodology we discovered

765
00:43:24,630 --> 00:43:31,520
about 120 different motifs
corresponding to proteins

766
00:43:31,520 --> 00:43:36,320
that we found computationally
open-- chromatin out.

767
00:43:36,320 --> 00:43:37,227
Yes?

768
00:43:37,227 --> 00:43:38,726
AUDIENCE: [INAUDIBLE]
concentrations

769
00:43:38,726 --> 00:43:41,720
of different pioneer
factors are different,

770
00:43:41,720 --> 00:43:44,215
wouldn't that show up
differentially [INAUDIBLE]?

771
00:43:49,205 --> 00:43:52,070
PROFESSOR: The question
is, if the concentration

772
00:43:52,070 --> 00:43:53,960
of different pioneer
factors was different,

773
00:43:53,960 --> 00:43:56,240
wouldn't that show
up differentially?

774
00:43:56,240 --> 00:43:59,170
And that's precisely, we
think how chromatin structures

775
00:43:59,170 --> 00:44:01,310
are regulated.

776
00:44:01,310 --> 00:44:06,320
That we think that the
concentration, or presence

777
00:44:06,320 --> 00:44:10,160
of different pioneer factors,
is regulating the openness

778
00:44:10,160 --> 00:44:12,330
or closeness of different
parts of the genome,

779
00:44:12,330 --> 00:44:16,770
based upon where their
motifs are occurring.

780
00:44:16,770 --> 00:44:19,916
Is that, in part,
answering your question?

781
00:44:19,916 --> 00:44:23,360
AUDIENCE: Yes, but,
if a concentration

782
00:44:23,360 --> 00:44:25,820
of a particular
pioneer factor is low,

783
00:44:25,820 --> 00:44:30,710
do they necessarily have lesser
binding sites on the genome?

784
00:44:30,710 --> 00:44:32,920
PROFESSOR: So you're
asking, how is

785
00:44:32,920 --> 00:44:34,640
the concentration
of a pioneer factor

786
00:44:34,640 --> 00:44:37,720
related to its ability
to open chromatin

787
00:44:37,720 --> 00:44:39,780
and whether or not a
higher dosage would

788
00:44:39,780 --> 00:44:40,820
open more chromatin?

789
00:44:40,820 --> 00:44:41,430
AUDIENCE: Yes.

790
00:44:41,430 --> 00:44:44,620
PROFESSOR: I don't have a
good answer to that question.

791
00:44:44,620 --> 00:44:46,120
Those experiments
haven't been done.

792
00:44:48,940 --> 00:44:55,680
However, one thing you may have
noticed about these profiles--

793
00:44:55,680 --> 00:44:58,650
remember these are the same
profiles that we talked

794
00:44:58,650 --> 00:45:03,200
about earlier of DNase-1
read reproduction

795
00:45:03,200 --> 00:45:05,220
around a particular factor.

796
00:45:05,220 --> 00:45:08,305
And what you might notice is
that some of these profiles

797
00:45:08,305 --> 00:45:08,930
are asymmetric.

798
00:45:12,030 --> 00:45:14,720
And that they appear to be
producing more region one

799
00:45:14,720 --> 00:45:16,306
direction than the
other direction.

800
00:45:19,620 --> 00:45:21,684
And so this is all
computational analysis, right.

801
00:45:21,684 --> 00:45:23,350
But when you see
something like that you

802
00:45:23,350 --> 00:45:24,970
say, well gee, why
is that going on?

803
00:45:24,970 --> 00:45:30,130
Why is it that for NRF-1 the
left-hand side has a lot more

804
00:45:30,130 --> 00:45:33,960
reads than the right hand side.

805
00:45:33,960 --> 00:45:38,280
Now, of course, the only reason
that we can produce an oriented

806
00:45:38,280 --> 00:45:42,530
profile like that is that the
NRF-1 motif is not palindromic,

807
00:45:42,530 --> 00:45:43,030
right.

808
00:45:43,030 --> 00:45:45,400
We can actually orient
it in the genome

809
00:45:45,400 --> 00:45:49,190
and so we know that the
more reads, in this case,

810
00:45:49,190 --> 00:45:50,850
are coming from
the five prime end

811
00:45:50,850 --> 00:45:55,070
then from the three prime end.

812
00:45:55,070 --> 00:45:56,890
So what do you think
would cause that?

813
00:45:56,890 --> 00:45:58,776
Does anybody have a--
when we first saw this,

814
00:45:58,776 --> 00:45:59,900
we didn't know what it was.

815
00:45:59,900 --> 00:46:01,858
But anybody have an idea
of what that could be?

816
00:46:08,440 --> 00:46:09,750
Oh, yes.

817
00:46:09,750 --> 00:46:12,060
AUDIENCE: It's the remodelers
that these transcription

818
00:46:12,060 --> 00:46:15,690
factors are calling in tend to
open the chromatin more on one

819
00:46:15,690 --> 00:46:17,440
side of the motif
than the other.

820
00:46:17,440 --> 00:46:19,160
PROFESSOR: Right,
so if the remodelers

821
00:46:19,160 --> 00:46:23,937
are working in some sort
of directional way, right.

822
00:46:23,937 --> 00:46:25,020
So that's what we thought.

823
00:46:25,020 --> 00:46:28,410
We didn't know whether
they were or not.

824
00:46:28,410 --> 00:46:35,470
And so we went back to our
assay and we tested the motifs,

825
00:46:35,470 --> 00:46:38,990
both in the forward and
the reverse direction.

826
00:46:38,990 --> 00:46:39,670
Right.

827
00:46:39,670 --> 00:46:41,070
To see whether or
not it mattered

828
00:46:41,070 --> 00:46:45,000
which way the motif
went into the construct,

829
00:46:45,000 --> 00:46:49,840
based upon selecting factors,
based upon a symmetry

830
00:46:49,840 --> 00:46:56,060
score that we computed for
their read profile, right?

831
00:46:56,060 --> 00:47:06,190
And what we found was that, in
fact, it was the case that when

832
00:47:06,190 --> 00:47:10,840
the motif was properly
oriented it would turn on GFP

833
00:47:10,840 --> 00:47:14,790
and was in the other
direction it would not.

834
00:47:14,790 --> 00:47:18,670
So it appeared, for the
factors that we tested,

835
00:47:18,670 --> 00:47:24,620
that they did have directional
chromatin opening properties.

836
00:47:24,620 --> 00:47:26,120
And so that's an
interesting concept

837
00:47:26,120 --> 00:47:28,060
that you actually
can have chromatin

838
00:47:28,060 --> 00:47:31,400
being opened in one direction
but not the other direction,

839
00:47:31,400 --> 00:47:33,550
because it admits
the idea of some sort

840
00:47:33,550 --> 00:47:38,770
of genomic parentheses,
where you could imagine

841
00:47:38,770 --> 00:47:41,730
part of the genome
being accessible where

842
00:47:41,730 --> 00:47:43,350
the other part is not.

843
00:47:47,000 --> 00:47:54,100
And overall this led us to
classifying protein factors

844
00:47:54,100 --> 00:47:56,230
that are operating in
genome accessibility

845
00:47:56,230 --> 00:47:59,300
into three classes.

846
00:47:59,300 --> 00:48:01,660
Here shown as two, where
we have pioneers which

847
00:48:01,660 --> 00:48:05,530
are the things that
open up the genome,

848
00:48:05,530 --> 00:48:08,190
and settlers that follow
behind and actually

849
00:48:08,190 --> 00:48:12,380
bind in the regions where
the chromatin is open.

850
00:48:12,380 --> 00:48:15,780
That is, it's much more likely
that those factors are going

851
00:48:15,780 --> 00:48:18,630
to bind where the doors
of the rooms are open,

852
00:48:18,630 --> 00:48:20,740
and the pioneers are
the proteins that

853
00:48:20,740 --> 00:48:24,865
come along and open the doors,
in particular, chromatin

854
00:48:24,865 --> 00:48:25,365
domains.

855
00:48:27,910 --> 00:48:30,980
And there were a couple of other
tests that we wanted to do.

856
00:48:30,980 --> 00:48:36,600
We wanted to test whether
or not we could knock out

857
00:48:36,600 --> 00:48:43,260
this pioneering activity by
taking a pioneer and just

858
00:48:43,260 --> 00:48:45,880
only including its
DNA-binding domain

859
00:48:45,880 --> 00:48:47,840
and knocking out the
rest of its domain

860
00:48:47,840 --> 00:48:52,250
which might be operative
in doing this chromatin

861
00:48:52,250 --> 00:48:53,225
remodeling.

862
00:48:53,225 --> 00:48:54,930
And then asked,
whether or not, when

863
00:48:54,930 --> 00:48:58,830
we expressed this sort
of poisoned pioneer,

864
00:48:58,830 --> 00:49:03,341
whether or not it would affect
the binding of nearby factors.

865
00:49:03,341 --> 00:49:06,050
And, in fact, when
you do express

866
00:49:06,050 --> 00:49:08,880
the sort of poison
pioneer, it does

867
00:49:08,880 --> 00:49:13,000
reduce the binding
of nearby factors.

868
00:49:13,000 --> 00:49:15,010
Here, we have a dominant
negative for NFYA

869
00:49:15,010 --> 00:49:16,830
and dominant negative for NRF1.

870
00:49:16,830 --> 00:49:23,780
It reduces the binding
of nearby factors.

871
00:49:23,780 --> 00:49:29,830
And finally, we wanted
to know, if we included

872
00:49:29,830 --> 00:49:33,820
a dominant negative for
the directional pioneer,

873
00:49:33,820 --> 00:49:37,100
if it actually
would preferentially

874
00:49:37,100 --> 00:49:39,730
affect the binding
of [INAUDIBLE] on one

875
00:49:39,730 --> 00:49:44,290
side of its binding
occurrences or the other side.

876
00:49:44,290 --> 00:49:46,210
And so we looked
at mix sites that

877
00:49:46,210 --> 00:49:48,800
were oriented with
respect to NFYA.

878
00:49:48,800 --> 00:49:53,690
And when we add
the NFYA, you can

879
00:49:53,690 --> 00:49:59,780
see that it actually-- the
dominant negative NFYA-- when

880
00:49:59,780 --> 00:50:04,550
the mix site is down of where
we think NFYA is opening up

881
00:50:04,550 --> 00:50:08,420
the chromatin, the binding
is substantially reduced.

882
00:50:08,420 --> 00:50:11,170
Whereas, when the
Myc site is not

883
00:50:11,170 --> 00:50:14,120
on the side where we think
that NFYA is opening,

884
00:50:14,120 --> 00:50:16,310
it doesn't really
have an effect.

885
00:50:16,310 --> 00:50:19,350
So this is further
confirmation of the idea

886
00:50:19,350 --> 00:50:22,330
that in vivo, these
factors are actually

887
00:50:22,330 --> 00:50:25,716
operating in a directional way.

888
00:50:25,716 --> 00:50:29,060
Now I tell you all
this because, you know,

889
00:50:29,060 --> 00:50:31,040
we do a lot of
computational analysis

890
00:50:31,040 --> 00:50:33,360
and it's important to
follow up and understand

891
00:50:33,360 --> 00:50:35,670
what the correlations tell us.

892
00:50:35,670 --> 00:50:37,240
So when you do
computational analysis

893
00:50:37,240 --> 00:50:40,507
and you see a very
interesting pattern,

894
00:50:40,507 --> 00:50:42,715
the thing to keep in mind
is, what kind of experiment

895
00:50:42,715 --> 00:50:46,340
can I design to
test whether or not

896
00:50:46,340 --> 00:50:48,270
my hypothesis is correct or not?

897
00:50:51,240 --> 00:50:56,910
We also did an analysis across
human and mouse data sets

898
00:50:56,910 --> 00:51:00,210
and found that
for a given motif,

899
00:51:00,210 --> 00:51:02,940
and thus, protein
family, it appeared

900
00:51:02,940 --> 00:51:04,980
that the chromatin
opening index was largely

901
00:51:04,980 --> 00:51:08,160
preserved, evolutionarily.

902
00:51:08,160 --> 00:51:10,910
So that there are
similar pioneers

903
00:51:10,910 --> 00:51:12,470
between human and mouse.

904
00:51:15,580 --> 00:51:20,000
Are there any questions
at all about the idea?

905
00:51:20,000 --> 00:51:22,590
So I told you, I mean, when you
go to cocktail party tonight,

906
00:51:22,590 --> 00:51:25,390
you say hey, you know, did
you know that DNase-seq

907
00:51:25,390 --> 00:51:28,420
is this really cool technique
that not only tells you

908
00:51:28,420 --> 00:51:31,200
whether or not chromatin is
open or not, but, you know,

909
00:51:31,200 --> 00:51:32,390
where factors bind?

910
00:51:32,390 --> 00:51:34,900
And some of those factors
open up the chromatin itself

911
00:51:34,900 --> 00:51:38,780
and, plus, get this,
some of the factors only

912
00:51:38,780 --> 00:51:42,500
do it in one direction, right.

913
00:51:42,500 --> 00:51:44,500
That'd be a good
conversation starter, right?

914
00:51:44,500 --> 00:51:47,320
That'd be the end of
the conversation, no.

915
00:51:47,320 --> 00:51:49,550
You get the idea, right.

916
00:51:49,550 --> 00:51:52,655
So are there any questions
about DNase-1 seq analysis?

917
00:51:55,220 --> 00:51:55,940
Yes?

918
00:51:55,940 --> 00:51:58,928
AUDIENCE: A little unrelated,
but I was just wondering--

919
00:51:58,928 --> 00:52:04,420
in the literature where people
have identified factors that

920
00:52:04,420 --> 00:52:07,850
neither directly reprogram
between different cell types,

921
00:52:07,850 --> 00:52:10,655
or go through some sort of
[INAUDIBLE] intermediate--

922
00:52:10,655 --> 00:52:11,280
PROFESSOR: Yes.

923
00:52:11,280 --> 00:52:13,488
AUDIENCE: There are a number
of transcription factors

924
00:52:13,488 --> 00:52:16,242
that have been
identified. [INAUDIBLE]

925
00:52:16,242 --> 00:52:17,661
but there are others.

926
00:52:17,661 --> 00:52:22,496
Do you often see, or always
see some of the pioneers

927
00:52:22,496 --> 00:52:24,400
that you've identified
in those cases.

928
00:52:24,400 --> 00:52:25,050
And then--

929
00:52:25,050 --> 00:52:25,410
PROFESSOR: Yes.

930
00:52:25,410 --> 00:52:27,076
AUDIENCE: And then,
a follow-up question

931
00:52:27,076 --> 00:52:29,695
would be, do you think that if
you took some of the pioneers

932
00:52:29,695 --> 00:52:32,120
that you generated that
were not known before

933
00:52:32,120 --> 00:52:35,922
and expressed them
in cell types,

934
00:52:35,922 --> 00:52:38,352
that they would open
up the chromatin

935
00:52:38,352 --> 00:52:40,550
sufficiently to potentially
reprogram the mistakes?

936
00:52:40,550 --> 00:52:41,258
PROFESSOR: Right.

937
00:52:41,258 --> 00:52:43,130
So the question
was, is it the case

938
00:52:43,130 --> 00:52:45,870
that known
reprogramming factors,

939
00:52:45,870 --> 00:52:47,596
at times are powerful pioneers?

940
00:52:47,596 --> 00:52:50,230
The answer is yes.

941
00:52:50,230 --> 00:52:53,130
The second question was,
now that you have a broader

942
00:52:53,130 --> 00:52:55,370
repertoire of pioneer
factors, and you

943
00:52:55,370 --> 00:52:59,230
can identify what they're
doing, is a possible to,

944
00:52:59,230 --> 00:53:02,580
in a principled way, engineer
the opening of chromatin

945
00:53:02,580 --> 00:53:05,375
by perhaps expressing those
factors to see whether or not

946
00:53:05,375 --> 00:53:07,625
you could match a particular
desired epigenetic state,

947
00:53:07,625 --> 00:53:09,950
let's say?

948
00:53:09,950 --> 00:53:12,500
Our preliminary results are yes
on the second count as well.

949
00:53:12,500 --> 00:53:16,430
That there appear to
be pioneer factors that

950
00:53:16,430 --> 00:53:18,490
operate, sort of at a
basal level that keep,

951
00:53:18,490 --> 00:53:23,525
sort of, the sort of usual
rooms open in the genome.

952
00:53:23,525 --> 00:53:25,150
And then there are
factors that operate

953
00:53:25,150 --> 00:53:27,600
in a lineage-specific
specific way.

954
00:53:27,600 --> 00:53:29,720
And when we express
lineage-specific pioneer

955
00:53:29,720 --> 00:53:34,160
factors, they don't completely
mimic but largely mimic

956
00:53:34,160 --> 00:53:35,650
the chromatin state
that's present

957
00:53:35,650 --> 00:53:41,560
in the corresponding
lineage committed cells.

958
00:53:41,560 --> 00:53:44,320
And so we think that for
principal reprogramming

959
00:53:44,320 --> 00:53:48,800
of cells, the basal level
of establishing matched

960
00:53:48,800 --> 00:53:51,300
open states is going to be
an interesting and important

961
00:53:51,300 --> 00:53:52,530
avenue to explore.

962
00:53:52,530 --> 00:53:54,620
Does that answer your question?

963
00:53:54,620 --> 00:53:55,120
Yeah.

964
00:53:57,830 --> 00:54:00,480
OK.

965
00:54:00,480 --> 00:54:09,880
So, now we're going to
turn to another-- well let

966
00:54:09,880 --> 00:54:13,190
me just first summarise
what I just told you about,

967
00:54:13,190 --> 00:54:14,920
which is that we
can predict where

968
00:54:14,920 --> 00:54:18,055
TFs bind from DNase-seq data.

969
00:54:18,055 --> 00:54:19,895
We can identify these
pioneer factors.

970
00:54:19,895 --> 00:54:21,580
Some of them are directional.

971
00:54:21,580 --> 00:54:24,860
And other factors follow
these pioneers and bind

972
00:54:24,860 --> 00:54:25,920
sort of in their wake.

973
00:54:25,920 --> 00:54:30,620
In where they are actually
open up the chromatin.

974
00:54:30,620 --> 00:54:35,630
And returning to our
narrative arc for today,

975
00:54:35,630 --> 00:54:37,920
we've talked about the
idea of histone marks.

976
00:54:37,920 --> 00:54:40,510
We've talked about the
idea of chromatin openness

977
00:54:40,510 --> 00:54:42,030
and closeness.

978
00:54:42,030 --> 00:54:45,270
And now I'd like to talk about
the important question of how

979
00:54:45,270 --> 00:54:49,940
we can understand which
regulatory regions are

980
00:54:49,940 --> 00:54:51,435
regulating which genes.

981
00:54:54,030 --> 00:54:56,540
Now the traditional
way to approach this,

982
00:54:56,540 --> 00:55:02,770
is that if you have a regulatory
region, the thing that you do

983
00:55:02,770 --> 00:55:04,890
is you look for
the closest gene.

984
00:55:04,890 --> 00:55:11,100
And you go, aha, that's the one
that that regulatory region is

985
00:55:11,100 --> 00:55:13,060
controlling.

986
00:55:13,060 --> 00:55:15,040
This applies not only
for regulatory regions

987
00:55:15,040 --> 00:55:15,950
but for snips, right.

988
00:55:15,950 --> 00:55:19,760
If you find a snip
or a polymorphism

989
00:55:19,760 --> 00:55:22,630
you are likely to
assume that it's

990
00:55:22,630 --> 00:55:25,910
regulating the closest gene.

991
00:55:25,910 --> 00:55:29,770
It could have an effect
on the closest gene.

992
00:55:29,770 --> 00:55:36,670
But there are other ways of
approaching that question

993
00:55:36,670 --> 00:55:39,260
with molecular protocols.

994
00:55:39,260 --> 00:55:45,220
And drawing you once again
a cartoon of genome looping,

995
00:55:45,220 --> 00:55:50,180
you can see how an enhancer is
coming in contact with the Pol

996
00:55:50,180 --> 00:55:52,420
II holoenzyme apparatus.

997
00:55:52,420 --> 00:55:56,080
And this enhancer will
include regulators

998
00:55:56,080 --> 00:56:00,920
that will cause Pol II
to begin transcription.

999
00:56:00,920 --> 00:56:05,590
And if somehow we could
capture these complexes

1000
00:56:05,590 --> 00:56:11,990
so that we could examine them
and figure out what bits of DNA

1001
00:56:11,990 --> 00:56:15,130
are associated with
one another, we

1002
00:56:15,130 --> 00:56:19,340
could map, directly, what
enhancers are controlling what

1003
00:56:19,340 --> 00:56:24,310
genes, when they're
active in this form.

1004
00:56:24,310 --> 00:56:30,490
So the essential idea of a
variety of different protocols,

1005
00:56:30,490 --> 00:56:36,960
whether it be protocols
like high c or ChIA-PET

1006
00:56:36,960 --> 00:56:39,850
that we're going to
talk about are the same.

1007
00:56:39,850 --> 00:56:43,570
The difference is that
in the case of ChIA-PET,

1008
00:56:43,570 --> 00:56:45,790
we're only going to look
at interactions that

1009
00:56:45,790 --> 00:56:48,780
are defined by a
particular protein.

1010
00:56:48,780 --> 00:56:51,460
So what we're going to do in
the slides I'm going to show you

1011
00:56:51,460 --> 00:56:53,625
today, is we're
going to only look

1012
00:56:53,625 --> 00:56:56,560
at interactions that are
mediated through RNA polymerase

1013
00:56:56,560 --> 00:56:57,940
II.

1014
00:56:57,940 --> 00:57:00,190
And those are particularly
interesting interactions

1015
00:57:00,190 --> 00:57:03,000
as you can see,
because they involve

1016
00:57:03,000 --> 00:57:06,360
actively transcribed genes.

1017
00:57:06,360 --> 00:57:09,860
So if we could capture all
the RNA polymerase II mediated

1018
00:57:09,860 --> 00:57:16,650
interactions, we'd
be in great shape.

1019
00:57:16,650 --> 00:57:23,070
So, we have a lot of very
talented biologists here.

1020
00:57:23,070 --> 00:57:29,100
So would anybody like to make
a suggestion for a protocol

1021
00:57:29,100 --> 00:57:32,075
for actually revealing
these interactions?

1022
00:57:34,910 --> 00:57:39,260
Does anybody have any ideas
how you'd go about that?

1023
00:57:39,260 --> 00:57:41,165
Or what enzyme
might be involved?

1024
00:57:44,980 --> 00:57:46,510
Any ideas?

1025
00:57:46,510 --> 00:57:49,953
Don't be bashful now.

1026
00:57:49,953 --> 00:57:50,452
Yes.

1027
00:57:50,452 --> 00:57:55,380
AUDIENCE: How about fixing
everything in place where it is

1028
00:57:55,380 --> 00:57:58,967
and then getting
[INAUDIBLE] through DNA.

1029
00:57:58,967 --> 00:57:59,550
PROFESSOR: OK.

1030
00:57:59,550 --> 00:58:02,160
Fixing everything
where it is in place.

1031
00:58:02,160 --> 00:58:03,090
That's good.

1032
00:58:03,090 --> 00:58:06,902
So we might cross link this
whole thing, for example.

1033
00:58:06,902 --> 00:58:07,800
OK.

1034
00:58:07,800 --> 00:58:11,710
And then any other
ideas what we would do?

1035
00:58:11,710 --> 00:58:13,817
That's done, this
protical-- yes.

1036
00:58:13,817 --> 00:58:19,670
AUDIENCE: Well, [INAUDIBLE] that
you've going to be [INAUDIBLE].

1037
00:58:19,670 --> 00:58:23,736
And then digesting the
DNA that's coming out,

1038
00:58:23,736 --> 00:58:26,345
and then that lingers
to the DNA that

1039
00:58:26,345 --> 00:58:30,950
are closest together
in the sequence.

1040
00:58:30,950 --> 00:58:32,060
PROFESSOR: OK.

1041
00:58:32,060 --> 00:58:35,160
So I think what you're
suggesting goes something

1042
00:58:35,160 --> 00:58:36,365
like this.

1043
00:58:36,365 --> 00:58:38,680
All right.

1044
00:58:38,680 --> 00:58:44,770
Which is, that imagine that
we cross link those complexes

1045
00:58:44,770 --> 00:58:47,940
and we precipitate them.

1046
00:58:47,940 --> 00:58:55,310
And then what we do is we,
in a very dilute solution,

1047
00:58:55,310 --> 00:58:58,990
we ligate the DNA together.

1048
00:58:58,990 --> 00:59:02,430
And so we get two kinds
of ligation products.

1049
00:59:02,430 --> 00:59:05,060
On the left-hand side we
get self-ligation products

1050
00:59:05,060 --> 00:59:08,500
where a DNA molecule
ligates to itself.

1051
00:59:08,500 --> 00:59:12,200
And on the right-hand side we
get inner ligation products,

1052
00:59:12,200 --> 00:59:17,960
where the piece of DNA
that the enhancer was on,

1053
00:59:17,960 --> 00:59:22,790
ligates to the pieces of DNA
that the RNA polymerase was

1054
00:59:22,790 --> 00:59:25,540
transcribing the gene on.

1055
00:59:25,540 --> 00:59:28,500
And those inter-ligation
bits of DNA,

1056
00:59:28,500 --> 00:59:32,330
the ones that are red and blue,
are really interesting, right.

1057
00:59:32,330 --> 00:59:35,287
Because they contain both
the enhancer sequence

1058
00:59:35,287 --> 00:59:36,370
and the promoter sequence.

1059
00:59:39,030 --> 00:59:44,690
And all we need to do now is
to sequence those molecules

1060
00:59:44,690 --> 00:59:52,080
from the ends and figure out
where they are in the genome.

1061
00:59:52,080 --> 00:59:53,026
Yes?

1062
00:59:53,026 --> 00:59:56,428
AUDIENCE: How much variation
would there be in the sequence?

1063
00:59:56,428 --> 01:00:00,302
I guess I'm just wondering-- the
RNA polymerase is not static,

1064
01:00:00,302 --> 01:00:00,802
is it?

1065
01:00:00,802 --> 01:00:03,718
In terms of its interaction
with the intenser and the gene.

1066
01:00:03,718 --> 01:00:07,262
I just don't know what
would be capturing in this--

1067
01:00:07,262 --> 01:00:07,970
PROFESSOR: Right.

1068
01:00:07,970 --> 01:00:08,820
AUDIENCE: [INAUDIBLE]
doesn't just

1069
01:00:08,820 --> 01:00:10,590
touch at the beginning
and then [INAUDIBLE].

1070
01:00:10,590 --> 01:00:10,980
PROFESSOR: Right.

1071
01:00:10,980 --> 01:00:12,646
And I think that's a
very good question.

1072
01:00:12,646 --> 01:00:19,140
And in fact, a PhD thesis was
just written on this topic.

1073
01:00:19,140 --> 01:00:22,030
Which is, when you have
proteins that are moving down

1074
01:00:22,030 --> 01:00:24,590
the genome, in
some sense, you're

1075
01:00:24,590 --> 01:00:27,020
looking at a blurred picture.

1076
01:00:27,020 --> 01:00:31,540
So how do you
de-blur the picture

1077
01:00:31,540 --> 01:00:34,320
so that it's brought
sharply into focus?

1078
01:00:34,320 --> 01:00:38,340
And so a compute is something
called a point spread function

1079
01:00:38,340 --> 01:00:43,210
which describes how things are
spread out down the genome.

1080
01:00:43,210 --> 01:00:46,480
And then you invert that to get
a more focused picture of where

1081
01:00:46,480 --> 01:00:49,810
the protein is actually,
primarily located.

1082
01:00:49,810 --> 01:00:50,650
But you're right.

1083
01:00:50,650 --> 01:00:52,714
Things like RNA
polymerase II are not

1084
01:00:52,714 --> 01:00:54,255
thought of as
point-binding proteins.

1085
01:00:54,255 --> 01:00:56,630
They're actually proteins
in motion most time

1086
01:00:56,630 --> 01:00:57,950
when they're doing their work.

1087
01:00:57,950 --> 01:01:01,950
AUDIENCE: [INAUDIBLE]
that it's polymerizing,

1088
01:01:01,950 --> 01:01:04,489
does that it mean that it's
still continually bound

1089
01:01:04,489 --> 01:01:05,280
to the [INAUDIBLE]?

1090
01:01:05,280 --> 01:01:06,950
PROFESSOR: No.

1091
01:01:06,950 --> 01:01:08,660
Although, I don't
think we really

1092
01:01:08,660 --> 01:01:14,350
understand all of the
details of that mechanism.

1093
01:01:14,350 --> 01:01:17,440
But, suffice to say
that what I can do

1094
01:01:17,440 --> 01:01:20,090
is I can start showing
you data and from the data

1095
01:01:20,090 --> 01:01:25,040
we can try and
understand mechanism.

1096
01:01:25,040 --> 01:01:27,147
These are all great
questions, right.

1097
01:01:27,147 --> 01:01:27,646
Yes.

1098
01:01:27,646 --> 01:01:30,194
AUDIENCE: When we did the
citations and ligation,

1099
01:01:30,194 --> 01:01:32,900
you're going to get a lot
of random ligation, right?

1100
01:01:32,900 --> 01:01:34,400
PROFESSOR: A lot
of random ligation?

1101
01:01:34,400 --> 01:01:37,225
AUDIENCE: Yeah, between DNA
sequences that aren't aren't, I

1102
01:01:37,225 --> 01:01:38,890
guess, as close?

1103
01:01:38,890 --> 01:01:41,240
Or you shouldn't really be
ligating certain things?

1104
01:01:41,240 --> 01:01:44,850
PROFESSOR: Well, this picture is
a little bit deceiving, right?

1105
01:01:44,850 --> 01:01:48,020
Because there's actually another
complex just like the one

1106
01:01:48,020 --> 01:01:51,230
at the top, right
to its left, right?

1107
01:01:51,230 --> 01:01:56,620
And you could imagine those
things ligating together.

1108
01:01:56,620 --> 01:01:59,750
And so now you're going to
get ligation products that

1109
01:01:59,750 --> 01:02:01,190
are noise.

1110
01:02:01,190 --> 01:02:02,260
They don't mean anything.

1111
01:02:02,260 --> 01:02:04,267
AUDIENCE: Do you just
throw those out, I guess?

1112
01:02:04,267 --> 01:02:06,850
PROFESSOR: Well, the problem is,
you don't know which ones are

1113
01:02:06,850 --> 01:02:08,300
noise and which ones aren't.

1114
01:02:08,300 --> 01:02:09,920
Right?

1115
01:02:09,920 --> 01:02:13,260
Now, there are some clever
tricks you can play.

1116
01:02:13,260 --> 01:02:17,730
One clever trick is
to change the protocol

1117
01:02:17,730 --> 01:02:23,500
to do these kinds
of reactions, not

1118
01:02:23,500 --> 01:02:29,430
in solution, but
in some sort of gel

1119
01:02:29,430 --> 01:02:32,690
or other thing that
keeps the products apart.

1120
01:02:32,690 --> 01:02:35,770
The other thing you
can do is estimate

1121
01:02:35,770 --> 01:02:38,720
how bad the situation is.

1122
01:02:38,720 --> 01:02:40,330
And how might you do that?

1123
01:02:40,330 --> 01:02:44,650
What you do is, you
take one set of-- you

1124
01:02:44,650 --> 01:02:47,860
take your original preparation
and you split it into two.

1125
01:02:47,860 --> 01:02:48,810
OK.

1126
01:02:48,810 --> 01:02:54,180
And you color this one red and
this one blue using linkers,

1127
01:02:54,180 --> 01:02:55,310
right.

1128
01:02:55,310 --> 01:02:59,130
And then you put them together
and you do this reaction.

1129
01:02:59,130 --> 01:03:01,090
And then you ask,
how many molecules

1130
01:03:01,090 --> 01:03:03,550
have the red and the
blue linkers on them.

1131
01:03:03,550 --> 01:03:06,050
And then you know those are
bad ones because they actually

1132
01:03:06,050 --> 01:03:08,890
came from different
complexes, right.

1133
01:03:08,890 --> 01:03:13,770
And so by estimating the amount
of critical chimeric products

1134
01:03:13,770 --> 01:03:18,260
you get, from that split and
then recombined approach,

1135
01:03:18,260 --> 01:03:21,750
you can optimize the protocol to
reduce the chimeric production

1136
01:03:21,750 --> 01:03:22,430
rate.

1137
01:03:22,430 --> 01:03:26,010
Current chimeric production
rates are about 20%.

1138
01:03:26,010 --> 01:03:27,080
Something of that order.

1139
01:03:27,080 --> 01:03:27,580
OK.

1140
01:03:27,580 --> 01:03:30,070
It used to be 50%,
that's really bad.

1141
01:03:30,070 --> 01:03:31,050
OK.

1142
01:03:31,050 --> 01:03:38,520
So you can try
and optimize that.

1143
01:03:38,520 --> 01:03:42,090
Now, if the protocol
has these issues--

1144
01:03:42,090 --> 01:03:44,660
you have a moving protein
that was brought up here,

1145
01:03:44,660 --> 01:03:46,600
right, that you're
trying to capture.

1146
01:03:46,600 --> 01:03:51,210
You've got a lot of noise
coming from the background

1147
01:03:51,210 --> 01:03:55,340
of these reactions, right.

1148
01:03:55,340 --> 01:03:57,510
Why are we doing this?

1149
01:03:57,510 --> 01:04:00,870
Well, it's the only
game in town right now.

1150
01:04:00,870 --> 01:04:04,030
If you want to have
a mechanistic way

1151
01:04:04,030 --> 01:04:07,900
of understanding what enhancers
are communicating with what

1152
01:04:07,900 --> 01:04:11,960
genes, this and its
family-- I broadly

1153
01:04:11,960 --> 01:04:14,545
call this a family
of protocols--

1154
01:04:14,545 --> 01:04:16,640
is really the only way to go.

1155
01:04:16,640 --> 01:04:18,440
OK.

1156
01:04:18,440 --> 01:04:23,260
The interesting thing
is that when you do,

1157
01:04:23,260 --> 01:04:25,280
you get data like this.

1158
01:04:25,280 --> 01:04:27,540
And so, what you're
looking at here

1159
01:04:27,540 --> 01:04:29,880
is exactly the same
location in the genome.

1160
01:04:29,880 --> 01:04:33,590
It's about 600,000 bases
across from left to right.

1161
01:04:33,590 --> 01:04:34,375
OK.

1162
01:04:34,375 --> 01:04:38,270
And at the very bottom,
you see the SOX2 gene.

1163
01:04:38,270 --> 01:04:41,840
And you have three
different cellular states.

1164
01:04:41,840 --> 01:04:44,070
The top state--
our motor neurons

1165
01:04:44,070 --> 01:04:49,520
have been programmed through
the ectopic expression

1166
01:04:49,520 --> 01:04:52,240
of three transcription factors.

1167
01:04:52,240 --> 01:04:56,520
The second set of
interactions are motor neurons

1168
01:04:56,520 --> 01:04:59,800
that have been
produced by exposure

1169
01:04:59,800 --> 01:05:02,720
to small molecules
over a 7-day period.

1170
01:05:02,720 --> 01:05:05,030
And the bottom set
of interactions

1171
01:05:05,030 --> 01:05:09,824
are from mouse ES cells
that are plueripotent.

1172
01:05:09,824 --> 01:05:11,240
And what's interesting
is that you

1173
01:05:11,240 --> 01:05:16,560
can see how-- I'm
going to point here.

1174
01:05:16,560 --> 01:05:20,550
You can see here-- this is the
SOX2 gene down at the bottom.

1175
01:05:20,550 --> 01:05:24,580
And you can see here--
this regulatory region

1176
01:05:24,580 --> 01:05:30,460
is interacting heavily with
the SOX2 gene at the ESL state.

1177
01:05:30,460 --> 01:05:35,360
And above here, I have
put SOX2 ChIP-seq data.

1178
01:05:35,360 --> 01:05:40,310
So you can actually see that
SOX2 is regulating itself.

1179
01:05:40,310 --> 01:05:46,320
And up here, we have the
same SOX2 gene locus.

1180
01:05:46,320 --> 01:05:50,710
And OLIG2 is a key regulator
of this motor neuron fate.

1181
01:05:50,710 --> 01:05:53,600
And you can see that it
appears that OLIG2 is now

1182
01:05:53,600 --> 01:05:56,490
regulating SOX2.

1183
01:05:56,490 --> 01:06:00,760
And we don't have as complete
dependence upon the SOX2 locus

1184
01:06:00,760 --> 01:06:02,760
as we had before.

1185
01:06:02,760 --> 01:06:05,730
And up here in the induced
motor neuron state,

1186
01:06:05,730 --> 01:06:08,180
LHX4 is one of the
reprogramming factors

1187
01:06:08,180 --> 01:06:12,560
and you can see how it is
interacting with SOX2 here

1188
01:06:12,560 --> 01:06:15,480
and over here.

1189
01:06:15,480 --> 01:06:19,890
So what this methodology
allows us to do,

1190
01:06:19,890 --> 01:06:25,380
is to tie these regulatory
regions to the genes

1191
01:06:25,380 --> 01:06:32,750
that they are regulating,
albeit it with some issues.

1192
01:06:32,750 --> 01:06:37,876
So, we'll talk about the
issues in just a second.

1193
01:06:37,876 --> 01:06:45,290
Are there any questions at all
about the idea of capturing,

1194
01:06:45,290 --> 01:06:50,000
in essence, the folding of the
genome with this methodology

1195
01:06:50,000 --> 01:06:53,620
to link regulatory
regions to genes?

1196
01:06:56,830 --> 01:06:57,350
Yes?

1197
01:06:57,350 --> 01:06:58,646
AUDIENCE: I have a question.

1198
01:06:58,646 --> 01:07:02,678
So in each of
those charts you've

1199
01:07:02,678 --> 01:07:05,392
got parts describing regions
that are interacting.

1200
01:07:05,392 --> 01:07:06,017
PROFESSOR: Yes.

1201
01:07:06,017 --> 01:07:07,450
AUDIENCE: Is that correct?

1202
01:07:07,450 --> 01:07:08,510
PROFESSOR: Yes.

1203
01:07:08,510 --> 01:07:12,290
The little loops underneath
are the actual read pairs

1204
01:07:12,290 --> 01:07:14,560
that came out of the sequencer.

1205
01:07:14,560 --> 01:07:16,780
And the green dotted
lines are the interactions

1206
01:07:16,780 --> 01:07:18,384
I'm suggesting are significant.

1207
01:07:21,040 --> 01:07:23,200
So I'm showing you
the raw data and I'm

1208
01:07:23,200 --> 01:07:28,360
showing you the hypothesized
or purported interactions

1209
01:07:28,360 --> 01:07:32,168
with the green dotted lines.

1210
01:07:32,168 --> 01:07:32,668
Right?

1211
01:07:35,500 --> 01:07:36,640
Right?

1212
01:07:36,640 --> 01:07:42,810
AUDIENCE: So how is you raw
sequencing then transformed

1213
01:07:42,810 --> 01:07:46,720
into this set of interactions?

1214
01:07:46,720 --> 01:07:49,890
PROFESSOR: How is the raw
sequencing data-- remember

1215
01:07:49,890 --> 01:07:52,970
that what came out
of the protocol

1216
01:07:52,970 --> 01:07:59,500
were molecules on the
right-hand side that

1217
01:07:59,500 --> 01:08:05,307
had little bits of DNA from two
different places in the genome.

1218
01:08:05,307 --> 01:08:07,453
AUDIENCE: I'm
sorry, I meant, how

1219
01:08:07,453 --> 01:08:11,508
did you determine-- because
I'm assuming each of these arcs

1220
01:08:11,508 --> 01:08:14,615
has to have a single base start
side and a single base end

1221
01:08:14,615 --> 01:08:15,114
site.

1222
01:08:15,114 --> 01:08:15,580
PROFESSOR: Correct.

1223
01:08:15,580 --> 01:08:16,955
AUDIENCE: However,
your reads are

1224
01:08:16,955 --> 01:08:19,199
going to span-- your
joined paired reads are

1225
01:08:19,199 --> 01:08:21,620
going to span a number of bases.

1226
01:08:21,620 --> 01:08:23,451
So you have a number
of bases coming

1227
01:08:23,451 --> 01:08:25,457
from the red part
and a number of bases

1228
01:08:25,457 --> 01:08:26,540
coming from the blue part.

1229
01:08:26,540 --> 01:08:28,732
PROFESSOR: We've got
20, 20 something, yeah.

1230
01:08:28,732 --> 01:08:32,099
AUDIENCE: How do you determine
which of these red bases

1231
01:08:32,099 --> 01:08:34,504
and which of these blue
bases are your start

1232
01:08:34,504 --> 01:08:36,430
and end points for
the [INAUDIBLE].

1233
01:08:36,430 --> 01:08:38,830
PROFESSOR: Well, you are
looking at a 600,000 base pair

1234
01:08:38,830 --> 01:08:41,830
window of the
genome and we're not

1235
01:08:41,830 --> 01:08:43,813
quite at the resolution
of 28 bases yet.

1236
01:08:43,813 --> 01:08:44,354
AUDIENCE: OK.

1237
01:08:44,354 --> 01:08:46,460
PROFESSOR: So, you know--

1238
01:08:46,460 --> 01:08:49,960
AUDIENCE: So this is not
necessarily single base pair

1239
01:08:49,960 --> 01:08:52,044
resolution, but this
is a region resolution?

1240
01:08:52,044 --> 01:08:55,109
Is that correct?

1241
01:08:55,109 --> 01:08:57,300
PROFESSOR: Once
again, the question

1242
01:08:57,300 --> 01:09:00,580
of how to improve the spatial
resolution of these results

1243
01:09:00,580 --> 01:09:02,720
is a subject of active research.

1244
01:09:02,720 --> 01:09:05,990
And once again, you
can deconvolve things

1245
01:09:05,990 --> 01:09:08,590
like the shearing to
actually get things

1246
01:09:08,590 --> 01:09:12,779
down to within, say, 10 to
100 base pairs resolution.

1247
01:09:12,779 --> 01:09:13,667
AUDIENCE: OK.

1248
01:09:13,667 --> 01:09:14,250
PROFESSOR: OK?

1249
01:09:14,250 --> 01:09:15,770
AUDIENCE: Got it.

1250
01:09:15,770 --> 01:09:19,659
PROFESSOR: But you can't
identify the exact motif

1251
01:09:19,659 --> 01:09:21,620
that the things land on, right.

1252
01:09:21,620 --> 01:09:23,920
They can get in the
ballpark, so to speak, right.

1253
01:09:23,920 --> 01:09:27,970
You can figure out where
you need to look for motifs.

1254
01:09:27,970 --> 01:09:32,850
And so one thing
that we and others do

1255
01:09:32,850 --> 01:09:34,760
is look at these
regions and we ask

1256
01:09:34,760 --> 01:09:38,740
what motifs are present
into these regions.

1257
01:09:38,740 --> 01:09:42,060
Or if you have match DNase-seq
data, you can go back

1258
01:09:42,060 --> 01:09:44,979
and you can say, aha,
I have DNase-seq data.

1259
01:09:44,979 --> 01:09:48,180
I have this data and
I know that there's

1260
01:09:48,180 --> 01:09:50,180
something going on at
that region of the genome.

1261
01:09:50,180 --> 01:09:51,971
What proteins do I
think are sitting there,

1262
01:09:51,971 --> 01:09:55,330
based upon the protection
profiles I see.

1263
01:09:55,330 --> 01:09:55,830
Right.

1264
01:09:55,830 --> 01:09:57,454
So you can take an
integrative approach

1265
01:09:57,454 --> 01:09:59,740
where you use different
data types to begin

1266
01:09:59,740 --> 01:10:02,035
to pick apart the
regulatory network.

1267
01:10:02,035 --> 01:10:05,600
Where you see the connections
directly molecularly,

1268
01:10:05,600 --> 01:10:08,070
and you see the
regulatory proteins

1269
01:10:08,070 --> 01:10:11,101
that are binding
at those locations.

1270
01:10:11,101 --> 01:10:11,600
OK?

1271
01:10:11,600 --> 01:10:13,225
Was that helpful?

1272
01:10:13,225 --> 01:10:13,725
Good.

1273
01:10:13,725 --> 01:10:14,557
Good questions.

1274
01:10:14,557 --> 01:10:15,390
Any other questions?

1275
01:10:15,390 --> 01:10:15,925
Yes?

1276
01:10:15,925 --> 01:10:17,550
AUDIENCE: Would you consider
Hi-C and 5C and all of those

1277
01:10:17,550 --> 01:10:19,059
to be the same
family of technique?

1278
01:10:19,059 --> 01:10:19,850
PROFESSOR: I would.

1279
01:10:19,850 --> 01:10:25,520
They're all, sort of the same
family and they're improving.

1280
01:10:25,520 --> 01:10:28,860
I'm about to tell you why
this doesn't work very well.

1281
01:10:28,860 --> 01:10:32,770
But, that said, it's the
best thing we have going.

1282
01:10:32,770 --> 01:10:34,220
Right.

1283
01:10:34,220 --> 01:10:37,060
5C is not any to any.

1284
01:10:37,060 --> 01:10:38,920
It's to one to any.

1285
01:10:38,920 --> 01:10:43,740
This protocol, when you do
one experiment with this,

1286
01:10:43,740 --> 01:10:47,170
it tells you all the interacting
regions in the genome.

1287
01:10:47,170 --> 01:10:49,070
Right.

1288
01:10:49,070 --> 01:10:51,170
I believe 5C-- help
me if I'm wrong.

1289
01:10:51,170 --> 01:10:52,900
You pick one anchor
location and then

1290
01:10:52,900 --> 01:10:54,290
you can tell all the
regions and genomes that

1291
01:10:54,290 --> 01:10:56,150
are interacting with
that anchor location.

1292
01:10:56,150 --> 01:10:57,150
AUDIENCE: Isn't that 3C?

1293
01:10:57,150 --> 01:10:58,137
PROFESSOR: What?

1294
01:10:58,137 --> 01:10:59,220
AUDIENCE: 3C's one to one.

1295
01:10:59,220 --> 01:11:00,916
4C's one to any.

1296
01:11:00,916 --> 01:11:01,790
AUDIENCE: And 5C is--

1297
01:11:01,790 --> 01:11:02,987
AUDIENCE: 5C's any to any.

1298
01:11:02,987 --> 01:11:04,435
PROFESSOR: And 5C's any to any?

1299
01:11:04,435 --> 01:11:04,935
OK.

1300
01:11:04,935 --> 01:11:06,580
I stand correct.

1301
01:11:06,580 --> 01:11:07,280
Thank you.

1302
01:11:10,751 --> 01:11:11,250
Yeah.

1303
01:11:14,501 --> 01:11:15,000
OK.

1304
01:11:17,660 --> 01:11:20,950
You didn't critique
my bond type.

1305
01:11:20,950 --> 01:11:23,091
See I was trying to
get you and you didn't.

1306
01:11:23,091 --> 01:11:23,590
OK.

1307
01:11:26,920 --> 01:11:28,240
And other questions about this?

1308
01:11:31,520 --> 01:11:32,810
OK.

1309
01:11:32,810 --> 01:11:34,940
What could go wrong?

1310
01:11:34,940 --> 01:11:35,964
What could go wrong?

1311
01:11:35,964 --> 01:11:37,630
Well, I can tell you
what will go wrong.

1312
01:11:37,630 --> 01:11:44,290
What will go wrong is that it
has a low true positive rate.

1313
01:11:44,290 --> 01:11:44,790
OK.

1314
01:11:47,860 --> 01:11:49,200
And how can you tell that?

1315
01:11:49,200 --> 01:11:53,080
You do the experiment
twice and you

1316
01:11:53,080 --> 01:11:55,680
get thousands of interactions
from each experiment in exactly

1317
01:11:55,680 --> 01:11:59,710
matched conditions and
there's a very small overlap

1318
01:11:59,710 --> 01:12:02,230
between the conditions.

1319
01:12:02,230 --> 01:12:04,830
Oops.

1320
01:12:04,830 --> 01:12:08,180
So, that's a pretty
big oops, right?

1321
01:12:08,180 --> 01:12:11,350
Because you would like it
to be the case that when

1322
01:12:11,350 --> 01:12:15,460
you do an experiment multiple
times, you get the same answer.

1323
01:12:15,460 --> 01:12:19,560
So let us just
suppose that you get

1324
01:12:19,560 --> 01:12:23,562
10,000 interactions
in experiment one.

1325
01:12:23,562 --> 01:12:27,380
10,000 interactions in
experiment two, but only

1326
01:12:27,380 --> 01:12:31,380
2,000 of them are the same.

1327
01:12:31,380 --> 01:12:33,750
What could possibly
be going wrong?

1328
01:12:37,190 --> 01:12:37,760
Any ideas?

1329
01:12:40,264 --> 01:12:42,430
If you're looking at the
data, what would you think?

1330
01:12:48,730 --> 01:12:50,080
Well?

1331
01:12:50,080 --> 01:12:50,580
Yeah?

1332
01:12:50,580 --> 01:12:52,371
AUDIENCE: [INAUDIBLE]
could be really high,

1333
01:12:52,371 --> 01:12:54,994
so you're just seeing
a couple of things

1334
01:12:54,994 --> 01:12:56,782
that are above the background.

1335
01:12:56,782 --> 01:12:58,130
And they don't necessarily--

1336
01:12:58,130 --> 01:12:58,838
PROFESSOR: Right.

1337
01:12:58,838 --> 01:13:00,500
So is it maybe
that, you know, it's

1338
01:13:00,500 --> 01:13:02,280
just tough to get
these interactions out.

1339
01:13:02,280 --> 01:13:06,162
And so you got a lot
of background trash.

1340
01:13:06,162 --> 01:13:07,620
And the things that
are significant

1341
01:13:07,620 --> 01:13:12,185
are tough to pick out.

1342
01:13:12,185 --> 01:13:12,685
Yeah?

1343
01:13:12,685 --> 01:13:16,509
AUDIENCE: Maybe it's a real
biological noise issue?

1344
01:13:16,509 --> 01:13:20,733
So rather than the technique,
actually any given time that

1345
01:13:20,733 --> 01:13:24,461
the interactions as so diverse
that when you take the snap

1346
01:13:24,461 --> 01:13:25,460
shot you can't--

1347
01:13:25,460 --> 01:13:26,518
PROFESSOR: I like
that explanation

1348
01:13:26,518 --> 01:13:28,180
because it's very pleasing
and makes me feel good.

1349
01:13:28,180 --> 01:13:29,470
And I would be hopeful
that that would be true

1350
01:13:29,470 --> 01:13:31,845
that there's enough biological
noise that that's actually

1351
01:13:31,845 --> 01:13:32,885
what I'm observing.

1352
01:13:32,885 --> 01:13:34,810
It doesn't make me feel
too warm and fuzzy,

1353
01:13:34,810 --> 01:13:36,830
but you know, I'd
go with that, right.

1354
01:13:40,194 --> 01:13:41,860
The other thing you
might think is, gee,

1355
01:13:41,860 --> 01:13:43,720
if we just sequenced
that library more,

1356
01:13:43,720 --> 01:13:46,670
we'd get more interactions
out them, right?

1357
01:13:46,670 --> 01:13:49,120
So you go off and you compute
the library complexity

1358
01:13:49,120 --> 01:13:53,210
of your library and you go,
oops, that's not going to work.

1359
01:13:53,210 --> 01:13:55,980
There just isn't enough
diversity in the library.

1360
01:13:55,980 --> 01:13:59,080
Meaning that the underlying
biological protocol did not

1361
01:13:59,080 --> 01:14:02,400
produce enough of those
interesting inner ligation

1362
01:14:02,400 --> 01:14:06,560
events to allow you to reveal
more information about what's

1363
01:14:06,560 --> 01:14:07,840
going on.

1364
01:14:07,840 --> 01:14:09,890
OK.

1365
01:14:09,890 --> 01:14:15,130
Now if I ask you to judge the
significance of an interaction

1366
01:14:15,130 --> 01:14:16,630
pair here.

1367
01:14:16,630 --> 01:14:19,360
Let's think about
this using what

1368
01:14:19,360 --> 01:14:22,060
we know already
from the subject.

1369
01:14:22,060 --> 01:14:23,290
OK.

1370
01:14:23,290 --> 01:14:25,375
So I'm going to draw a picture.

1371
01:14:28,480 --> 01:14:31,470
So I have my genome.

1372
01:14:31,470 --> 01:14:37,400
And let's just say that I have
a location, CA and a location CB

1373
01:14:37,400 --> 01:14:43,690
and I have a pile of ends that
wind up in those two locations.

1374
01:14:43,690 --> 01:14:45,200
OK.

1375
01:14:45,200 --> 01:14:51,080
And what I would like
to know is-- and I have,

1376
01:14:51,080 --> 01:14:56,230
let me just see what
variable I used for this.

1377
01:14:56,230 --> 01:15:01,900
And I have a certain number of
interactions between a and b.

1378
01:15:01,900 --> 01:15:05,840
That is I have a certain
number of reads that

1379
01:15:05,840 --> 01:15:09,080
cross between these two
locations in the genome.

1380
01:15:09,080 --> 01:15:11,511
And I'd like to know whether
or not this number of reads

1381
01:15:11,511 --> 01:15:12,135
is significant.

1382
01:15:14,860 --> 01:15:17,240
OK.

1383
01:15:17,240 --> 01:15:18,652
How could I estimate that?

1384
01:15:21,810 --> 01:15:24,500
Any ideas?

1385
01:15:24,500 --> 01:15:29,140
Oh, I'm also going
to tell you that n

1386
01:15:29,140 --> 01:15:38,160
is the total number
of read ends observed.

1387
01:15:42,700 --> 01:15:43,200
OK.

1388
01:15:46,040 --> 01:15:49,530
Well, here is the idea.

1389
01:15:49,530 --> 01:15:54,240
I've got n total
read ends, right?

1390
01:15:54,240 --> 01:15:57,770
I've got ca read ends here.

1391
01:15:57,770 --> 01:16:00,720
I've got cv read
ends here, and I

1392
01:16:00,720 --> 01:16:05,490
have iab that are overlapping.

1393
01:16:05,490 --> 01:16:07,890
So now, this is just our old
friend, the hypergeometric,

1394
01:16:07,890 --> 01:16:08,390
right.

1395
01:16:08,390 --> 01:16:11,290
We can ask what is the
probability of that happening

1396
01:16:11,290 --> 01:16:12,960
at random?

1397
01:16:12,960 --> 01:16:18,460
This many interactions or
fewer would happen at random.

1398
01:16:18,460 --> 01:16:20,510
And if it's very
unlikely, we would

1399
01:16:20,510 --> 01:16:23,330
reject the null hypothesis
and accept that there's really

1400
01:16:23,330 --> 01:16:25,730
an interaction going on here.

1401
01:16:25,730 --> 01:16:27,390
OK?

1402
01:16:27,390 --> 01:16:31,130
So, just to be more
precise about that.

1403
01:16:31,130 --> 01:16:32,380
This is what it looks like.

1404
01:16:32,380 --> 01:16:34,170
You've seen this before.

1405
01:16:34,170 --> 01:16:37,300
That the probability of
those interactions happening

1406
01:16:37,300 --> 01:16:40,610
on a null model, given a total
number of interactions end in

1407
01:16:40,610 --> 01:16:45,070
ca and cb is given by
the hypergeometric.

1408
01:16:45,070 --> 01:16:46,640
OK.

1409
01:16:46,640 --> 01:16:50,690
So that's one way of going
about assessing whether or not

1410
01:16:50,690 --> 01:16:52,672
the interactions we
see are significant.

1411
01:16:56,920 --> 01:17:00,156
Now, let me ask you a
slightly different question.

1412
01:17:00,156 --> 01:17:00,655
Right.

1413
01:17:04,290 --> 01:17:13,900
Imagine that I have-- and
I'm being very generous here.

1414
01:17:13,900 --> 01:17:20,330
Imagine that I have
two experiment-- that's

1415
01:17:20,330 --> 01:17:21,365
the wrong size bubbles.

1416
01:17:24,610 --> 01:17:25,870
I don't want to mislead you.

1417
01:17:28,772 --> 01:17:30,480
One of your friends
comes to you and say,

1418
01:17:30,480 --> 01:17:32,430
"I've done this
experiment twice."

1419
01:17:32,430 --> 01:17:34,690
Twice, OK.

1420
01:17:34,690 --> 01:17:41,900
"And each time I get
1,000 interactions.

1421
01:17:41,900 --> 01:17:46,600
So each one gives
you 1,000, let's say.

1422
01:17:46,600 --> 01:17:54,290
And I have 900 that are common
between the two replicates.

1423
01:17:54,290 --> 01:17:57,150
And your friend says,
"how many interactions

1424
01:17:57,150 --> 01:18:02,190
do you think there
are in total?"

1425
01:18:02,190 --> 01:18:03,328
How could we estimate that?

1426
01:18:07,720 --> 01:18:11,570
Well, what's interesting
about this problem

1427
01:18:11,570 --> 01:18:19,470
is that what we're
asking is what's n?

1428
01:18:19,470 --> 01:18:19,970
Right.

1429
01:18:19,970 --> 01:18:22,670
What's the total number of
interactions of which we're

1430
01:18:22,670 --> 01:18:26,810
observing this set and this set
of which 900 is overlapping.

1431
01:18:26,810 --> 01:18:29,860
There's the hyperlink
geometric again.

1432
01:18:29,860 --> 01:18:36,420
So all we need to do is to find
the maximum value, the best

1433
01:18:36,420 --> 01:18:43,800
value for n that predicts the
observed overlap given that we

1434
01:18:43,800 --> 01:18:49,460
have two experiments
of size, with m

1435
01:18:49,460 --> 01:18:54,510
and n different observations,
and we have an overlap of k.

1436
01:18:54,510 --> 01:18:55,335
OK.

1437
01:18:55,335 --> 01:18:57,820
Does that makes
sense to everybody?

1438
01:18:57,820 --> 01:19:02,760
Of how to estimate the
total number of interactions

1439
01:19:02,760 --> 01:19:06,016
out there making a
set of assumption that

1440
01:19:06,016 --> 01:19:07,199
they're all equally likely.

1441
01:19:10,990 --> 01:19:12,916
Any questions about that at all?

1442
01:19:19,540 --> 01:19:20,440
OK.

1443
01:19:20,440 --> 01:19:26,670
And, just so you know, you can
approximate this, this way.

1444
01:19:26,670 --> 01:19:30,710
Which is that the maximum
likelihood estimate

1445
01:19:30,710 --> 01:19:32,690
of the total number
of interactions

1446
01:19:32,690 --> 01:19:35,280
is approximately
n times n over k,

1447
01:19:35,280 --> 01:19:38,780
as seen by the
approximation on the bottom.

1448
01:19:38,780 --> 01:19:40,780
OK?

1449
01:19:40,780 --> 01:19:43,850
Just so that you can
approximate how many things

1450
01:19:43,850 --> 01:19:45,810
are out there that you
haven't seen when you've

1451
01:19:45,810 --> 01:19:48,460
done a couple of replicates.

1452
01:19:48,460 --> 01:19:50,200
OK, you guys have
been totally great.

1453
01:19:50,200 --> 01:19:52,033
We've talked about a
lot of different things

1454
01:19:52,033 --> 01:19:54,830
today in chromatin
architecture and structure.

1455
01:19:54,830 --> 01:19:57,365
Sort of the DC to light
version of chromatin structure

1456
01:19:57,365 --> 01:20:00,670
and architecture lecture.

1457
01:20:00,670 --> 01:20:03,720
Next time we're going
to talk about building

1458
01:20:03,720 --> 01:20:05,677
genetic models of EQTLs.

1459
01:20:05,677 --> 01:20:07,135
And the time after
that we're going

1460
01:20:07,135 --> 01:20:09,220
to talk about human genetics.

1461
01:20:09,220 --> 01:20:10,380
Thank you so much.

1462
01:20:10,380 --> 01:20:11,730
Have a great, long weekend.

1463
01:20:11,730 --> 01:20:13,850
We'll see you next Thursday.