1
00:00:00,070 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:26,011 --> 00:00:27,760
PROFESSOR: Going to
finish up a little bit

9
00:00:27,760 --> 00:00:29,920
from last time on gene
regulatory networks

10
00:00:29,920 --> 00:00:32,445
and see how the different
methods that we looked at

11
00:00:32,445 --> 00:00:34,695
compared, and then we'll
dive into protein interaction

12
00:00:34,695 --> 00:00:35,880
networks.

13
00:00:35,880 --> 00:00:37,860
Were there any questions
from last time?

14
00:00:40,371 --> 00:00:40,870
OK.

15
00:00:40,870 --> 00:00:41,460
Very good.

16
00:00:41,460 --> 00:00:45,400
So recall that we start off with
this dream challenge in which

17
00:00:45,400 --> 00:00:51,010
they provided unlabeled data
representing gene expression

18
00:00:51,010 --> 00:00:56,790
data for either in a completely
synthetic case, in silico data,

19
00:00:56,790 --> 00:00:59,120
or for three different
actual experiments--

20
00:00:59,120 --> 00:01:02,967
one in E. coli, one in S.
cerevisiae, and one in aureus.

21
00:01:02,967 --> 00:01:05,050
For some of those, it was
straight expression data

22
00:01:05,050 --> 00:01:05,970
under different conditions.

23
00:01:05,970 --> 00:01:08,450
In other cases, there were
actual knock-down experiments

24
00:01:08,450 --> 00:01:10,230
or other kinds of perturbations.

25
00:01:10,230 --> 00:01:12,304
And then they gave that
data out to the community

26
00:01:12,304 --> 00:01:14,470
and asked people to use
whatever methods they wanted

27
00:01:14,470 --> 00:01:18,190
to try to rediscover
automatically the gene

28
00:01:18,190 --> 00:01:19,930
regulatory networks.

29
00:01:19,930 --> 00:01:22,230
So with some
preliminary analysis,

30
00:01:22,230 --> 00:01:24,700
we saw that there were a couple
of main clusters of kinds

31
00:01:24,700 --> 00:01:27,880
of analyses that all had similar
properties across these data

32
00:01:27,880 --> 00:01:28,455
sets.

33
00:01:28,455 --> 00:01:29,830
There were the
Bayesian networks,

34
00:01:29,830 --> 00:01:32,720
that we've discussed now
in two separate contexts.

35
00:01:32,720 --> 00:01:35,960
And then we looked at
regression-based techniques

36
00:01:35,960 --> 00:01:37,754
and mutual information
based techniques.

37
00:01:37,754 --> 00:01:39,920
And there were a bunch of
other kinds of approaches.

38
00:01:39,920 --> 00:01:42,650
And some of them actually
combine multiple predictors

39
00:01:42,650 --> 00:01:45,430
from different kinds
of algorithms together.

40
00:01:45,430 --> 00:01:48,180
And some of them, they
evaluated how well each of these

41
00:01:48,180 --> 00:01:50,680
did on all the
different data sets.

42
00:01:50,680 --> 00:01:53,082
So first the results
on the in silico data,

43
00:01:53,082 --> 00:01:54,540
and they're showing
this as an area

44
00:01:54,540 --> 00:01:57,030
under the
precision-recall curve.

45
00:01:57,030 --> 00:01:59,890
Obviously, higher numbers
are going to be better here.

46
00:01:59,890 --> 00:02:02,050
So in this first
group over here are

47
00:02:02,050 --> 00:02:05,760
the regression-based
techniques, mutual information,

48
00:02:05,760 --> 00:02:08,560
correlation, Bayesian networks.

49
00:02:08,560 --> 00:02:12,000
Things didn't fall into any of
those particular categories.

50
00:02:12,000 --> 00:02:14,690
Meta were techniques that
use more than one class

51
00:02:14,690 --> 00:02:18,500
of prediction and then develop
their own prediction based

52
00:02:18,500 --> 00:02:20,380
on those individual techniques.

53
00:02:20,380 --> 00:02:23,580
Then they defined something
that they call the community

54
00:02:23,580 --> 00:02:25,719
definition, which
they combine data

55
00:02:25,719 --> 00:02:27,260
from many of the
different techniques

56
00:02:27,260 --> 00:02:30,170
together with their own
algorithms to kind of come up

57
00:02:30,170 --> 00:02:32,620
with what they call the
"wisdom of the crowds."

58
00:02:32,620 --> 00:02:36,100
And then R represents
a random collection

59
00:02:36,100 --> 00:02:39,390
of other predictions.

60
00:02:39,390 --> 00:02:42,340
And you can see that on
these in silico data,

61
00:02:42,340 --> 00:02:43,980
the performances
don't dramatically

62
00:02:43,980 --> 00:02:45,730
differ one from the other.

63
00:02:45,730 --> 00:02:48,130
Within each class, if you
look at the best performer

64
00:02:48,130 --> 00:02:52,530
in each class, they're all
sort of in the same league.

65
00:02:52,530 --> 00:02:56,480
Obviously, some of the classes
do better consistently.

66
00:02:56,480 --> 00:02:58,270
Now their point
in their analysis

67
00:02:58,270 --> 00:02:59,960
is about the wisdom
of the crowds, that

68
00:02:59,960 --> 00:03:01,970
taking all these data
together, even including

69
00:03:01,970 --> 00:03:04,265
some of the bad
ones, is beneficial.

70
00:03:04,265 --> 00:03:05,890
That's not the main
thing that I wanted

71
00:03:05,890 --> 00:03:07,640
to get out of these
data for our purposes.

72
00:03:07,640 --> 00:03:10,460
So these E. coli data,
notice though that the errant

73
00:03:10,460 --> 00:03:14,790
to the curve, it's about
30 something percent.

74
00:03:14,790 --> 00:03:17,787
Now this is, oh, sorry,
this is in silico data.

75
00:03:17,787 --> 00:03:19,620
Now this is the first
real experimental data

76
00:03:19,620 --> 00:03:21,567
we'll look at, so
this is E. coli data.

77
00:03:21,567 --> 00:03:24,150
And notice the change of scale,
that the best performer's only

78
00:03:24,150 --> 00:03:28,845
doing under less than 10% of
the possible objective optimal

79
00:03:28,845 --> 00:03:29,345
results.

80
00:03:29,345 --> 00:03:31,870
So you can see that the real
data are much, much harder

81
00:03:31,870 --> 00:03:33,467
than the in silico data.

82
00:03:33,467 --> 00:03:35,300
And here the performance
varies quite a lot.

83
00:03:35,300 --> 00:03:37,230
You can see that the Bayesian
networks are struggling,

84
00:03:37,230 --> 00:03:38,970
compared to some of
the other techniques.

85
00:03:38,970 --> 00:03:41,540
The best of those
doesn't really get

86
00:03:41,540 --> 00:03:45,231
close to the best of some
of these other approaches.

87
00:03:45,231 --> 00:03:47,730
So what they did next, was they
took some of the predictions

88
00:03:47,730 --> 00:03:53,454
from their community
predictions that were built off

89
00:03:53,454 --> 00:03:55,870
of all these other data, and
they went and actually tested

90
00:03:55,870 --> 00:03:56,453
some of these.

91
00:03:56,453 --> 00:03:59,010
So they built regulatory
networks for E. coli

92
00:03:59,010 --> 00:04:00,476
and for aureus.

93
00:04:00,476 --> 00:04:02,850
And then they actually did
some experiments to test them.

94
00:04:02,850 --> 00:04:04,350
I think the results
overall are kind

95
00:04:04,350 --> 00:04:07,050
of encouraging, in the
sense that if you focus

96
00:04:07,050 --> 00:04:09,749
on the top pie chart
here, of all the things

97
00:04:09,749 --> 00:04:11,290
that they tested,
about half of them,

98
00:04:11,290 --> 00:04:12,920
they could get some support.

99
00:04:12,920 --> 00:04:15,170
In some cases, it was
very strong support.

100
00:04:15,170 --> 00:04:18,370
In other cases, it
wasn't quite as good.

101
00:04:18,370 --> 00:04:20,745
So the glass is half
empty or half full.

102
00:04:20,745 --> 00:04:22,370
But also, one of the
interesting things

103
00:04:22,370 --> 00:04:24,260
is that the data
are quite variable

104
00:04:24,260 --> 00:04:26,310
over the different
predictions that they make.

105
00:04:26,310 --> 00:04:29,990
So each one of these circles
represents a regulator,

106
00:04:29,990 --> 00:04:33,450
and the things that they claim
are targets of that regulator.

107
00:04:33,450 --> 00:04:35,180
And things that are
in blue are things

108
00:04:35,180 --> 00:04:37,510
that were confirmed
by their experiments.

109
00:04:37,510 --> 00:04:41,000
The things with black outlines
and blue are the controls.

110
00:04:41,000 --> 00:04:42,860
So they knew that
these would be right.

111
00:04:42,860 --> 00:04:46,150
So you could see that for
pure R, they do very well.

112
00:04:46,150 --> 00:04:49,385
For some of these
others, they do mediocre.

113
00:04:49,385 --> 00:04:51,760
But there are some, which
they're honest enough to admit,

114
00:04:51,760 --> 00:04:52,760
they do very poorly on.

115
00:04:52,760 --> 00:04:54,551
So they didn't get any
of their predictions

116
00:04:54,551 --> 00:04:55,634
right for this regulator.

117
00:04:55,634 --> 00:04:58,050
And this probably reflects the
kind of data that they had,

118
00:04:58,050 --> 00:05:01,380
in terms of what conditions
were being tested.

119
00:05:01,380 --> 00:05:03,470
So, so far, things
look reasonable.

120
00:05:03,470 --> 00:05:05,330
I think the real
shocker of this paper

121
00:05:05,330 --> 00:05:07,610
does not appear in the
abstract or the title.

122
00:05:07,610 --> 00:05:10,430
But it is in one of the main
figures, if you pay attention.

123
00:05:10,430 --> 00:05:12,810
So these were the results
for in silico data.

124
00:05:12,810 --> 00:05:14,230
Everything looked pretty good.

125
00:05:14,230 --> 00:05:17,000
Change of scale to E. coli,
there's some variation.

126
00:05:17,000 --> 00:05:19,050
But you can make arguments.

127
00:05:19,050 --> 00:05:21,450
These are the results for
Saccharomyces cerevisiae.

128
00:05:21,450 --> 00:05:24,090
So this is the organism,
yeast, on which

129
00:05:24,090 --> 00:05:25,757
most of the gene
regulatory algorithms

130
00:05:25,757 --> 00:05:26,840
were originally developed.

131
00:05:26,840 --> 00:05:28,214
And people actually
built careers

132
00:05:28,214 --> 00:05:30,370
off of saying how great
their algorithms were

133
00:05:30,370 --> 00:05:32,760
in reconstructing these
regulatory networks.

134
00:05:32,760 --> 00:05:35,299
And we look at these
completely blinded data,

135
00:05:35,299 --> 00:05:37,340
where people don't know
what they're looking for.

136
00:05:37,340 --> 00:05:40,510
You could see that the actual
results are rather terrible.

137
00:05:40,510 --> 00:05:43,820
So the area under the curve
is in the single digits

138
00:05:43,820 --> 00:05:44,404
of percentage.

139
00:05:44,404 --> 00:05:46,861
And it doesn't seem to matter
what algorithm they're using.

140
00:05:46,861 --> 00:05:48,380
They're all doing very badly.

141
00:05:48,380 --> 00:05:51,990
And the community predictions
are no better-- in some cases,

142
00:05:51,990 --> 00:05:54,360
worse-- than the
individual ones.

143
00:05:54,360 --> 00:05:56,410
So this is really
a stunning result.

144
00:05:56,410 --> 00:05:57,372
It's there in the data.

145
00:05:57,372 --> 00:05:58,830
And if you dig into
the supplement,

146
00:05:58,830 --> 00:06:01,010
they actually explain
what's going on, I think,

147
00:06:01,010 --> 00:06:03,110
pretty clearly.

148
00:06:03,110 --> 00:06:04,840
Remember that all
of these predictions

149
00:06:04,840 --> 00:06:09,240
are being made by looking for a
transcriptional regulator that

150
00:06:09,240 --> 00:06:11,750
increases in its own
expression or decreases

151
00:06:11,750 --> 00:06:13,030
in its own expression.

152
00:06:13,030 --> 00:06:14,610
And that change in
its own expression

153
00:06:14,610 --> 00:06:16,480
is predictive of its targets.

154
00:06:16,480 --> 00:06:19,460
So the hypothesis is when you
have more of an activator,

155
00:06:19,460 --> 00:06:21,410
you'll have more of
its targets coming on.

156
00:06:21,410 --> 00:06:22,250
If you have less
of an activator,

157
00:06:22,250 --> 00:06:23,610
you'll have less of the targets.

158
00:06:23,610 --> 00:06:25,026
And you look through
all the data,

159
00:06:25,026 --> 00:06:27,370
whether it's by Bayesian
networks or regression,

160
00:06:27,370 --> 00:06:30,880
to find those kinds
of relationships.

161
00:06:30,880 --> 00:06:32,970
Now what if those
relationships don't actually

162
00:06:32,970 --> 00:06:34,550
exist in the data?

163
00:06:34,550 --> 00:06:36,540
And that's what
this chart shows.

164
00:06:36,540 --> 00:06:40,034
So the green are genes
that have no relationship

165
00:06:40,034 --> 00:06:40,700
with each other.

166
00:06:40,700 --> 00:06:44,340
And they're measuring here the
correlation across all the data

167
00:06:44,340 --> 00:06:46,360
sets, between two
pairs of genes,

168
00:06:46,360 --> 00:06:48,360
for which have no known
regulatory relationship.

169
00:06:52,360 --> 00:06:54,116
The purple are ones
that are targets

170
00:06:54,116 --> 00:06:55,490
of the same
transcription factor.

171
00:06:55,490 --> 00:06:56,950
And the orange
are ones where one

172
00:06:56,950 --> 00:06:59,920
is the activator or
repressor of the other.

173
00:06:59,920 --> 00:07:03,000
And in the in silico data,
they give a very nice spread

174
00:07:03,000 --> 00:07:05,450
between the green, the
orange, and the purple.

175
00:07:05,450 --> 00:07:07,670
So the co-regulator are
very highly correlated

176
00:07:07,670 --> 00:07:08,860
with each other.

177
00:07:08,860 --> 00:07:12,700
The ones that are parent-child
relationships-- a regulator

178
00:07:12,700 --> 00:07:15,650
and its target-- have a
pretty good correlation,

179
00:07:15,650 --> 00:07:17,600
much, much different
from the distribution

180
00:07:17,600 --> 00:07:20,690
that you see for the things
that are not interacting.

181
00:07:20,690 --> 00:07:23,590
And on these data, the
algorithms do their best.

182
00:07:23,590 --> 00:07:25,572
Then you look at
the E. coli data,

183
00:07:25,572 --> 00:07:27,530
and you can see that in
E. Coli, the curves are

184
00:07:27,530 --> 00:07:30,430
much closer to each other,
but there's still some spread.

185
00:07:30,430 --> 00:07:32,376
But when you look
at yeast-- again,

186
00:07:32,376 --> 00:07:34,000
this is where a lot
of these algorithms

187
00:07:34,000 --> 00:07:35,583
were developed-- you
could see there's

188
00:07:35,583 --> 00:07:38,760
almost no difference between the
correlation between the things

189
00:07:38,760 --> 00:07:40,940
that have no relationship
to each other,

190
00:07:40,940 --> 00:07:44,580
things that are co-regulated
by the same regulatory protein,

191
00:07:44,580 --> 00:07:47,040
or those parent-child
relationships.

192
00:07:47,040 --> 00:07:48,230
They're all quite similar.

193
00:07:48,230 --> 00:07:49,605
And it doesn't
matter whether you

194
00:07:49,605 --> 00:07:52,380
use correlation analysis
or mutual information.

195
00:07:52,380 --> 00:07:54,480
Over here and in this
right-hand panel,

196
00:07:54,480 --> 00:07:56,990
they've blown up the
bottom part of this curve,

197
00:07:56,990 --> 00:07:58,730
and you can see how
similar these are.

198
00:07:58,730 --> 00:08:00,720
So again, this is a
mutual information

199
00:08:00,720 --> 00:08:06,905
spread for in silico data for
E. coli and then for yeast.

200
00:08:09,480 --> 00:08:09,980
OK.

201
00:08:09,980 --> 00:08:14,390
So what I think we can say
about the expression analysis

202
00:08:14,390 --> 00:08:16,840
is that expression data
are very, very powerful

203
00:08:16,840 --> 00:08:18,590
for some things and
are going to be rather

204
00:08:18,590 --> 00:08:20,690
poor for some
other applications.

205
00:08:20,690 --> 00:08:23,360
So they're very powerful for
classification and clustering.

206
00:08:23,360 --> 00:08:25,560
We saw that earlier.

207
00:08:25,560 --> 00:08:27,250
Now what those
clusters mean, that's

208
00:08:27,250 --> 00:08:29,777
this inference problem
they're trying to solve now.

209
00:08:29,777 --> 00:08:32,110
And the expression data are
not sufficient to figure out

210
00:08:32,110 --> 00:08:34,539
what the regulatory proteins
are that are causing

211
00:08:34,539 --> 00:08:37,539
those sets of genes to be
co-expressed-- at least

212
00:08:37,539 --> 00:08:38,498
not in yeast.

213
00:08:38,498 --> 00:08:40,039
And I think there's
every expectation

214
00:08:40,039 --> 00:08:41,747
that if you did the
same thing in humans,

215
00:08:41,747 --> 00:08:43,830
you would have the same result.

216
00:08:43,830 --> 00:08:45,590
So the critical
question then is if you

217
00:08:45,590 --> 00:08:48,820
do want to build models of
how regulation is taking place

218
00:08:48,820 --> 00:08:51,110
in organisms, what do you do?

219
00:08:51,110 --> 00:08:54,060
And the answer is that you
need some other kind of data.

220
00:08:54,060 --> 00:08:56,960
So one thing you might
think, if we go back

221
00:08:56,960 --> 00:08:58,840
to this core analysis,
like what's wrong?

222
00:08:58,840 --> 00:09:02,870
Why is it that these gene
expression levels cannot be

223
00:09:02,870 --> 00:09:05,470
used to predict the
regulatory networks?

224
00:09:05,470 --> 00:09:08,540
And it comes down to
whether gene levels are

225
00:09:08,540 --> 00:09:09,790
predictive approaching levels.

226
00:09:09,790 --> 00:09:12,260
And a couple of groups
have looked into this.

227
00:09:12,260 --> 00:09:17,270
One of the earlier studies
was this one, now 2009,

228
00:09:17,270 --> 00:09:20,320
where they used microarray data
and looked at mRNA expression

229
00:09:20,320 --> 00:09:22,794
levels versus protein levels.

230
00:09:22,794 --> 00:09:23,960
And what do you see in this?

231
00:09:23,960 --> 00:09:25,740
You see that there is a trend.

232
00:09:25,740 --> 00:09:28,440
Right there, R
squared is around 0.2,

233
00:09:28,440 --> 00:09:30,170
but that there's a huge spread.

234
00:09:30,170 --> 00:09:32,390
So that for any
position on the x-axis,

235
00:09:32,390 --> 00:09:36,470
a particular level of mRNA, you
can have 1,000-fold variation

236
00:09:36,470 --> 00:09:38,769
in the protein levels.

237
00:09:38,769 --> 00:09:40,310
So a lot of people
saw this and said,

238
00:09:40,310 --> 00:09:42,890
well, we know there are
problems with microarrays.

239
00:09:42,890 --> 00:09:46,640
They're not really great
at predicting mRNA levels

240
00:09:46,640 --> 00:09:48,030
or low in protein levels.

241
00:09:48,030 --> 00:09:52,497
So maybe this will all get
better if we use mRNA-Seq.

242
00:09:52,497 --> 00:09:54,080
Now that turns out
not to be the case.

243
00:09:54,080 --> 00:09:58,090
So there was a very careful
study published in 2012,

244
00:09:58,090 --> 00:10:02,110
where the group used
microarray data, RNA-Seq data,

245
00:10:02,110 --> 00:10:05,490
and a number of different ways
of calling the proteomics data.

246
00:10:05,490 --> 00:10:06,676
So you might say, well,
maybe some of the problem

247
00:10:06,676 --> 00:10:09,217
is that you're not doing a very
good job of inferring protein

248
00:10:09,217 --> 00:10:11,360
levels from mass spec data.

249
00:10:11,360 --> 00:10:13,645
And so they try a whole
bunch of these different ways

250
00:10:13,645 --> 00:10:15,070
of pulling mass spec data.

251
00:10:15,070 --> 00:10:16,570
And then they look,
you should focus

252
00:10:16,570 --> 00:10:18,570
on the numbers in these
columns for the average

253
00:10:18,570 --> 00:10:22,510
and the best correlations
between the RNA

254
00:10:22,510 --> 00:10:27,470
data in these columns and the
proteomic data in the rows.

255
00:10:27,470 --> 00:10:29,700
And you could see the
best case scenario--

256
00:10:29,700 --> 00:10:35,730
you can get these up to 0.54
correlation, still pretty weak.

257
00:10:35,730 --> 00:10:38,029
So what's going on?

258
00:10:38,029 --> 00:10:39,820
What we've been focusing
on now is the idea

259
00:10:39,820 --> 00:10:42,700
that the RNA levels are going
to be very well correlated

260
00:10:42,700 --> 00:10:43,600
with protein levels.

261
00:10:43,600 --> 00:10:45,016
And I think a lot
of literature is

262
00:10:45,016 --> 00:10:47,760
based on hypotheses that
are almost identical.

263
00:10:47,760 --> 00:10:49,350
But in reality, of
course, there are

264
00:10:49,350 --> 00:10:50,516
a lot of processes involved.

265
00:10:50,516 --> 00:10:52,051
There's the process
of translation,

266
00:10:52,051 --> 00:10:53,550
which has a rate
associated with it.

267
00:10:53,550 --> 00:10:55,664
It has regulatory steps
associated with it.

268
00:10:55,664 --> 00:10:57,330
And then there are
degradatory pathways.

269
00:10:57,330 --> 00:10:59,580
So the RNA gets
degraded at some rate,

270
00:10:59,580 --> 00:11:01,430
and the protein gets
degraded at some rate.

271
00:11:01,430 --> 00:11:03,820
And sometimes those
rates are regulated,

272
00:11:03,820 --> 00:11:04,840
sometimes they're not.

273
00:11:04,840 --> 00:11:06,790
Sometimes it depends
on the sequence.

274
00:11:06,790 --> 00:11:08,430
So what would happen
if you actually

275
00:11:08,430 --> 00:11:09,610
measured what's going on?

276
00:11:09,610 --> 00:11:13,640
And that was done recently
in this paper in 2011,

277
00:11:13,640 --> 00:11:17,970
where the group used a
labeling technique for proteins

278
00:11:17,970 --> 00:11:20,660
to [INAUDIBLE] and measure
steady state levels of proteins

279
00:11:20,660 --> 00:11:23,620
and then label the
proteins at specific times

280
00:11:23,620 --> 00:11:25,700
and see how much
newly synthesized

281
00:11:25,700 --> 00:11:28,130
their protein was
at various times.

282
00:11:28,130 --> 00:11:30,780
And similarly, for RNA, using
a technology that allowed them

283
00:11:30,780 --> 00:11:35,197
to separate newly synthesized
transcripts from the bulk RNA.

284
00:11:35,197 --> 00:11:36,780
And once you have
those data, then you

285
00:11:36,780 --> 00:11:40,650
can find out what the spread is
in the half lives of proteins

286
00:11:40,650 --> 00:11:43,200
and the abundance of proteins.

287
00:11:43,200 --> 00:11:46,010
So if you focus on
the left-hand side,

288
00:11:46,010 --> 00:11:51,280
these are the
determined half lives

289
00:11:51,280 --> 00:11:54,890
for various RNAs in blue
and proteins in red.

290
00:11:54,890 --> 00:11:56,955
If you look at the
spread in the red ones,

291
00:11:56,955 --> 00:11:59,205
you've got at least three
orders of magnitude of range

292
00:11:59,205 --> 00:12:03,650
in stability in half
lives for proteins.

293
00:12:03,650 --> 00:12:06,201
So that's really at the
heart of why RNA levels are

294
00:12:06,201 --> 00:12:07,950
very poorly predictive
approaching levels,

295
00:12:07,950 --> 00:12:10,920
because there's such a range
of the stability proteins.

296
00:12:10,920 --> 00:12:13,130
And the RNAs also, they
spread over probably

297
00:12:13,130 --> 00:12:16,780
about one or two orders of
magnitude in the RNA stability.

298
00:12:16,780 --> 00:12:18,590
And then here are
the abundances.

299
00:12:18,590 --> 00:12:21,040
So you can see that
the range of abundance

300
00:12:21,040 --> 00:12:24,330
for average copies
per cell of proteins

301
00:12:24,330 --> 00:12:26,880
is extremely large,
from 100 to 10

302
00:12:26,880 --> 00:12:30,000
to the eighth copies per cell.

303
00:12:30,000 --> 00:12:34,570
Now if you look at the
degradation rates for protein

304
00:12:34,570 --> 00:12:36,170
half lives and RNA
half lives, you

305
00:12:36,170 --> 00:12:38,640
can see there's no correlation.

306
00:12:38,640 --> 00:12:40,575
So these are completely
independent processes

307
00:12:40,575 --> 00:12:42,825
that determine whether an
RNA is degraded or a protein

308
00:12:42,825 --> 00:12:44,292
is degraded.

309
00:12:44,292 --> 00:12:46,750
So then when you try to figure
out what the relationship is

310
00:12:46,750 --> 00:12:48,590
between RNA levels
and protein levels,

311
00:12:48,590 --> 00:12:51,350
you really have to resort to a
set of differential equations

312
00:12:51,350 --> 00:12:53,542
to map out what
all the rates are.

313
00:12:53,542 --> 00:12:55,250
And if you know all
those rates, then you

314
00:12:55,250 --> 00:12:57,200
can estimate what the
relationships will be.

315
00:12:57,200 --> 00:12:59,400
And so they did exactly that.

316
00:12:59,400 --> 00:13:01,750
And these charts show
what they inferred

317
00:13:01,750 --> 00:13:05,680
to be the contribution of
each of these components

318
00:13:05,680 --> 00:13:08,140
to protein levels.

319
00:13:08,140 --> 00:13:10,540
So on the left-hand
side, these are

320
00:13:10,540 --> 00:13:12,964
from cells which
had the most data.

321
00:13:12,964 --> 00:13:14,630
And they build a model
on the same cells

322
00:13:14,630 --> 00:13:16,088
from which they
collected the data.

323
00:13:16,088 --> 00:13:19,740
And in these cells, the RNA
levels account for about 40%

324
00:13:19,740 --> 00:13:22,320
of the protein
levels, the variance.

325
00:13:22,320 --> 00:13:24,080
And the biggest
thing that affects

326
00:13:24,080 --> 00:13:27,130
the abundance of proteins
is rates of translation.

327
00:13:27,130 --> 00:13:30,460
And then they took the data
built from one set of cells

328
00:13:30,460 --> 00:13:32,330
and tried to use it
to predict outcomes

329
00:13:32,330 --> 00:13:34,160
in another set of
cells in replicate.

330
00:13:34,160 --> 00:13:36,267
And the results are
kind of similar.

331
00:13:36,267 --> 00:13:38,850
They also did it for an entirely
different kind of cell types.

332
00:13:38,850 --> 00:13:40,741
In all of these cases,
the precise amounts

333
00:13:40,741 --> 00:13:41,490
are going to vary.

334
00:13:41,490 --> 00:13:43,250
But you can see that
the red bars, which

335
00:13:43,250 --> 00:13:46,720
represent the amount of
information content in the RNA,

336
00:13:46,720 --> 00:13:50,260
is less than about half of what
you can get from other sources.

337
00:13:50,260 --> 00:13:52,120
So this gets back
to why it's so hard

338
00:13:52,120 --> 00:13:56,740
to infer regulatory networks
solely from RNA levels.

339
00:13:56,740 --> 00:13:58,684
So this is the
plot that they get

340
00:13:58,684 --> 00:14:00,350
when they compare
protein levels and RNA

341
00:14:00,350 --> 00:14:02,270
levels at the
experimental level.

342
00:14:02,270 --> 00:14:04,040
And again, you see
that big spread and R

343
00:14:04,040 --> 00:14:06,650
squared at about 0.4,
which at the time,

344
00:14:06,650 --> 00:14:07,650
they were very proud of.

345
00:14:07,650 --> 00:14:09,316
They write several
times in the article,

346
00:14:09,316 --> 00:14:11,730
this is the best anyone
has seen to date.

347
00:14:11,730 --> 00:14:14,230
But if you incorporate all these
other pieces of information

348
00:14:14,230 --> 00:14:16,340
about RNA stability
and protein stability,

349
00:14:16,340 --> 00:14:18,670
you can actually get a
very, very good correlation.

350
00:14:18,670 --> 00:14:22,220
So once you know the variation
in the protein stability

351
00:14:22,220 --> 00:14:25,529
and the RNA stability for each
and every protein and RNA,

352
00:14:25,529 --> 00:14:27,820
then you can do a good job
of predicting protein levels

353
00:14:27,820 --> 00:14:28,690
from RNA levels.

354
00:14:28,690 --> 00:14:32,280
But without all that
data, you can't.

355
00:14:32,280 --> 00:14:33,370
Any questions on this?

356
00:14:36,627 --> 00:14:37,960
So what are we going to do then?

357
00:14:37,960 --> 00:14:41,000
So we really have two primary
things that we can do.

358
00:14:41,000 --> 00:14:44,600
We can try to explicitly model
all of these regulatory steps

359
00:14:44,600 --> 00:14:47,000
and include those in
our predictive models

360
00:14:47,000 --> 00:14:50,150
and try to build up gene
regulatory networks, protein

361
00:14:50,150 --> 00:14:53,500
models that actually include all
those different kinds of data.

362
00:14:53,500 --> 00:14:56,549
And we'll see that
in just a minute.

363
00:14:56,549 --> 00:14:58,590
And the other thing we
can try to do is actually,

364
00:14:58,590 --> 00:15:00,190
rather than try
to focus on what's

365
00:15:00,190 --> 00:15:03,930
downstream of RNA synthesis,
the protein levels,

366
00:15:03,930 --> 00:15:06,950
we can try to focus on what's
upstream of RNA synthesis

367
00:15:06,950 --> 00:15:09,800
and look at what the production
of RNAs-- which RNAs are

368
00:15:09,800 --> 00:15:11,370
getting turned on
and off-- tell us

369
00:15:11,370 --> 00:15:13,940
about the signaling pathways
and the transcription factors.

370
00:15:13,940 --> 00:15:16,148
And that's going to be a
topic of one of the upcoming

371
00:15:16,148 --> 00:15:22,290
lectures in which Professor
Gifford will look at variations

372
00:15:22,290 --> 00:15:24,780
in epigenomic data and
using those variations

373
00:15:24,780 --> 00:15:27,630
in epigenomic data to identify
sequences that represent which

374
00:15:27,630 --> 00:15:29,990
regulatory proteins are bound
under certain conditions

375
00:15:29,990 --> 00:15:31,090
and not others.

376
00:15:31,090 --> 00:15:31,682
Questions?

377
00:15:31,682 --> 00:15:32,182
Yeah?

378
00:15:32,182 --> 00:15:33,598
AUDIENCE: In a
typical experiment,

379
00:15:33,598 --> 00:15:36,761
the rate constants for how
many mRNAs or proteins can

380
00:15:36,761 --> 00:15:37,957
be estimated?

381
00:15:37,957 --> 00:15:40,540
PROFESSOR: So the question was
how many rate constants can you

382
00:15:40,540 --> 00:15:41,430
estimate in a
typical experiment?

383
00:15:41,430 --> 00:15:42,440
So I should say,
first of all, they're

384
00:15:42,440 --> 00:15:43,480
not typical experiments.

385
00:15:43,480 --> 00:15:45,580
Very few people do
this kind of analysis.

386
00:15:45,580 --> 00:15:47,910
It's actually very time
consuming, very expensive.

387
00:15:47,910 --> 00:15:50,802
So I think in this one, it was--
I'll get the numbers roughly

388
00:15:50,802 --> 00:15:52,010
wrong-- but it was thousands.

389
00:15:52,010 --> 00:15:55,410
It was some decent fraction
of the proteome, but not

390
00:15:55,410 --> 00:15:56,520
the entire one.

391
00:15:56,520 --> 00:15:59,200
But most of the data
set's papers you'll read

392
00:15:59,200 --> 00:16:02,670
do not include any analysis of
stability rates, degradation

393
00:16:02,670 --> 00:16:03,930
rates.

394
00:16:03,930 --> 00:16:08,459
They only look at the bulk
abundance of the RNAs.

395
00:16:08,459 --> 00:16:09,125
Other questions?

396
00:16:13,110 --> 00:16:13,610
OK.

397
00:16:13,610 --> 00:16:15,140
So this is an
upcoming lecture where

398
00:16:15,140 --> 00:16:17,620
we're going to actually
try to go backwards.

399
00:16:17,620 --> 00:16:19,896
We're going to say, we
see these changes in RNA.

400
00:16:19,896 --> 00:16:21,270
What does that
tell us about what

401
00:16:21,270 --> 00:16:24,280
regulatory regions of the
genome were active or not?

402
00:16:24,280 --> 00:16:26,065
And then you could
go upstream from that

403
00:16:26,065 --> 00:16:27,940
and try to figure out
the signaling pathways.

404
00:16:27,940 --> 00:16:30,750
So if I know
changes in RNA, I'll

405
00:16:30,750 --> 00:16:32,740
deduce, as we'll see in
that upcoming lecture--

406
00:16:32,740 --> 00:16:36,090
the sequences-- the identity
of the DNA binding proteins.

407
00:16:36,090 --> 00:16:37,580
And then I could
try to figure out

408
00:16:37,580 --> 00:16:40,270
what the signaling
pathways were that drove

409
00:16:40,270 --> 00:16:42,980
those changes in
gene expression.

410
00:16:42,980 --> 00:16:46,910
Now later in this lecture, we'll
talk about the network modeling

411
00:16:46,910 --> 00:16:47,870
problem.

412
00:16:47,870 --> 00:16:50,270
If assuming you knew these
transcription factors,

413
00:16:50,270 --> 00:16:53,330
what could you do to
infer this network?

414
00:16:53,330 --> 00:16:54,920
But before we go
to that, I'd like

415
00:16:54,920 --> 00:16:57,230
to talk about an interesting
modeling approach that

416
00:16:57,230 --> 00:16:59,970
tries to take into account
all these degradatory pathways

417
00:16:59,970 --> 00:17:02,470
and look specifically at
each kind of regulation

418
00:17:02,470 --> 00:17:05,530
as an explicit step in
the model and see how that

419
00:17:05,530 --> 00:17:08,369
copes with some of these issues.

420
00:17:08,369 --> 00:17:16,520
So this is work
from Josh Stewart.

421
00:17:16,520 --> 00:17:18,427
And one of the first
papers is here.

422
00:17:18,427 --> 00:17:20,010
We'll look at some
later ones as well.

423
00:17:20,010 --> 00:17:22,720
And the idea here is to
explicitly, as I said,

424
00:17:22,720 --> 00:17:25,130
deal with many, many
different steps in regulation

425
00:17:25,130 --> 00:17:28,860
and try to be quite specific
about what kinds of data

426
00:17:28,860 --> 00:17:32,750
are informing about what
step in the process.

427
00:17:32,750 --> 00:17:35,120
So we measure the
things in the bottom

428
00:17:35,120 --> 00:17:39,300
here-- arrays that tell us
how many copies of a gene

429
00:17:39,300 --> 00:17:41,282
there are in the genome,
especially in cancer.

430
00:17:41,282 --> 00:17:42,740
And you can get
big changes of what

431
00:17:42,740 --> 00:17:45,510
are called copy number,
amplifications, or deletions

432
00:17:45,510 --> 00:17:46,970
of large chunks of chromosomes.

433
00:17:46,970 --> 00:17:49,250
You need to take
that into account.

434
00:17:49,250 --> 00:17:52,120
All the RNA-Seq and microarrays
that we were talking about

435
00:17:52,120 --> 00:17:53,620
in measuring
transcription levels--

436
00:17:53,620 --> 00:17:54,870
what do they actually tell us?

437
00:17:54,870 --> 00:17:56,590
Well, they give us
some information

438
00:17:56,590 --> 00:17:58,370
about what they're
directly connected to.

439
00:17:58,370 --> 00:18:01,195
So the transcriptomic data tells
something about the expression

440
00:18:01,195 --> 00:18:01,930
state.

441
00:18:01,930 --> 00:18:04,380
But notice they have explicitly
separated the expression

442
00:18:04,380 --> 00:18:07,370
state of the RNA from
the protein level.

443
00:18:07,370 --> 00:18:08,870
And they separated
the protein level

444
00:18:08,870 --> 00:18:10,560
from the protein activity.

445
00:18:10,560 --> 00:18:12,657
And they have these
little black boxes in here

446
00:18:12,657 --> 00:18:14,740
that represent the different
kinds of regulations.

447
00:18:14,740 --> 00:18:18,504
So however many copies of a
gene you have in the genome,

448
00:18:18,504 --> 00:18:20,920
there's some regulatory event,
transcriptional regulation,

449
00:18:20,920 --> 00:18:22,545
that determines how
much expression you

450
00:18:22,545 --> 00:18:24,219
get at the mRNA level.

451
00:18:24,219 --> 00:18:25,760
There's another
regulatory event here

452
00:18:25,760 --> 00:18:28,100
that determines at what
rate those RNAs are

453
00:18:28,100 --> 00:18:29,890
turned into proteins.

454
00:18:29,890 --> 00:18:31,674
And there are other
regulatory steps here

455
00:18:31,674 --> 00:18:33,340
that have to do with
signaling pathways,

456
00:18:33,340 --> 00:18:35,589
for example, that determine
whether those proteins are

457
00:18:35,589 --> 00:18:36,650
active or not.

458
00:18:36,650 --> 00:18:39,108
So we're going to treat each
of those as separate variables

459
00:18:39,108 --> 00:18:41,490
in our model that
are going to be

460
00:18:41,490 --> 00:18:43,410
connected by these black boxes.

461
00:18:46,254 --> 00:18:47,920
So they call their
algorithm "Paradigm,"

462
00:18:47,920 --> 00:18:49,590
and they developed
it in the context

463
00:18:49,590 --> 00:18:51,059
of looking at cancer data.

464
00:18:51,059 --> 00:18:53,600
In cancer data, the two primary
kinds of information they had

465
00:18:53,600 --> 00:18:58,100
were the RNA levels from either
microarray or RNA-Seq and then

466
00:18:58,100 --> 00:18:59,710
these copy number
variations, again,

467
00:18:59,710 --> 00:19:01,990
representing
amplifications or deletions

468
00:19:01,990 --> 00:19:03,820
of chunks of the genome.

469
00:19:03,820 --> 00:19:05,780
And what they're trying
to infer from that

470
00:19:05,780 --> 00:19:07,870
is how active different
components are of

471
00:19:07,870 --> 00:19:10,706
known signaling pathways.

472
00:19:10,706 --> 00:19:12,580
Now the approach that
they used that involved

473
00:19:12,580 --> 00:19:14,413
all of those little
black boxes is something

474
00:19:14,413 --> 00:19:16,060
called a factor graph.

475
00:19:16,060 --> 00:19:18,855
And factor graphs can be
thought of in the same context

476
00:19:18,855 --> 00:19:19,730
as Bayesian networks.

477
00:19:19,730 --> 00:19:23,540
In fact, Bayesian networks
are a type of factor graph.

478
00:19:23,540 --> 00:19:25,670
So if I have a
Bayesian network that

479
00:19:25,670 --> 00:19:27,730
represents these three
variables, where they're

480
00:19:27,730 --> 00:19:29,770
directly connected by
edges, in a factor graph,

481
00:19:29,770 --> 00:19:32,700
there would be this extra
kind of node-- this black box

482
00:19:32,700 --> 00:19:35,960
or red box-- that's the factor
that's going to connect them.

483
00:19:35,960 --> 00:19:37,467
So what do these things do?

484
00:19:37,467 --> 00:19:39,050
Well, again, they're
bipartite graphs.

485
00:19:39,050 --> 00:19:40,800
They always have these
two different kinds

486
00:19:40,800 --> 00:19:43,696
of nodes-- the random
variables and the factors.

487
00:19:43,696 --> 00:19:45,820
And the reason they're
called factor graphs is they

488
00:19:45,820 --> 00:19:48,460
describe how the global
function-- in our case,

489
00:19:48,460 --> 00:19:51,120
it's going to be the global
probability distribution--

490
00:19:51,120 --> 00:19:53,390
can be broken down into
factorable components.

491
00:19:53,390 --> 00:19:56,330
It can be combined
in a product to look

492
00:19:56,330 --> 00:20:02,090
at what the global
probability function is.

493
00:20:02,090 --> 00:20:04,386
So if I have some
global function over all

494
00:20:04,386 --> 00:20:06,760
the variables, you can think
of this again, specifically,

495
00:20:06,760 --> 00:20:09,218
as the probability function--
the joint probability for all

496
00:20:09,218 --> 00:20:11,910
the variables in my system--
I want to be able to divide it

497
00:20:11,910 --> 00:20:14,230
into a product of
individual terms,

498
00:20:14,230 --> 00:20:16,900
where I don't have all the
variables in each of these f's.

499
00:20:16,900 --> 00:20:19,382
They're just some
subset of variables.

500
00:20:19,382 --> 00:20:23,150
And each of these represents
one of these terms

501
00:20:23,150 --> 00:20:25,990
in that global product.

502
00:20:25,990 --> 00:20:27,837
The only things that
are in this function,

503
00:20:27,837 --> 00:20:29,670
are things to which
it's directly connected.

504
00:20:29,670 --> 00:20:32,540
So these edges exist
solely between a factor

505
00:20:32,540 --> 00:20:35,500
and the variables that are
terms in that equation.

506
00:20:35,500 --> 00:20:36,450
Is that clear?

507
00:20:43,710 --> 00:20:45,130
So in this context,
the variables

508
00:20:45,130 --> 00:20:46,464
are going to be nodes.

509
00:20:46,464 --> 00:20:47,880
And their allowed
values are going

510
00:20:47,880 --> 00:20:51,770
to be whether they're
activated or not activated.

511
00:20:51,770 --> 00:20:54,044
The factors are going to
describe the relationships

512
00:20:54,044 --> 00:20:54,960
among those variables.

513
00:20:54,960 --> 00:20:57,970
We previously saw those as
being cases of regulation.

514
00:20:57,970 --> 00:20:59,689
Is the RNA turned into protein?

515
00:20:59,689 --> 00:21:00,730
Is the protein activated?

516
00:21:04,282 --> 00:21:05,740
And what we'd like
to be able do is

517
00:21:05,740 --> 00:21:07,410
compute marginal probabilities.

518
00:21:07,410 --> 00:21:09,650
So we've got some
big network that

519
00:21:09,650 --> 00:21:12,580
represents our understanding
of all the signaling pathways

520
00:21:12,580 --> 00:21:15,610
and all the transcriptional
regulatory networks in a cancer

521
00:21:15,610 --> 00:21:16,110
cell.

522
00:21:16,110 --> 00:21:18,830
And we want to ask about
a particular pathway

523
00:21:18,830 --> 00:21:21,980
or a particular protein,
what's the probability

524
00:21:21,980 --> 00:21:24,040
that this protein or this
pathway is activated,

525
00:21:24,040 --> 00:21:27,669
marginalized over all
the other variables?

526
00:21:27,669 --> 00:21:28,460
So that's our goal.

527
00:21:28,460 --> 00:21:30,220
Our goal is to find
a way to compute

528
00:21:30,220 --> 00:21:32,844
these marginal
probabilities efficiently.

529
00:21:32,844 --> 00:21:34,260
And how do you
compute a marginal?

530
00:21:34,260 --> 00:21:37,430
Well, obviously you need to
sum over all the configurations

531
00:21:37,430 --> 00:21:41,190
of all the variables that
have your particular variable

532
00:21:41,190 --> 00:21:41,820
at its value.

533
00:21:41,820 --> 00:21:44,390
So if I want to know if
MYC and MAX are active,

534
00:21:44,390 --> 00:21:47,300
I set MYC and MAX
equal to active.

535
00:21:47,300 --> 00:21:49,200
And then I sum over
all the configurations

536
00:21:49,200 --> 00:21:50,882
that are consistent with that.

537
00:21:50,882 --> 00:21:52,590
And in general, that
would be hard to do.

538
00:21:52,590 --> 00:21:54,600
But the factor graph
gives us an efficient way

539
00:21:54,600 --> 00:21:55,897
of figuring out how to do that.

540
00:21:55,897 --> 00:21:56,980
I'll show you in a second.

541
00:22:00,210 --> 00:22:01,610
So I have some global function.

542
00:22:01,610 --> 00:22:04,220
In this case, this little
factor graph over here,

543
00:22:04,220 --> 00:22:07,450
this is the global function.

544
00:22:07,450 --> 00:22:10,850
Now remember, these
represent the factors,

545
00:22:10,850 --> 00:22:12,330
and they only have
edges to things

546
00:22:12,330 --> 00:22:14,850
that are terms in
their equations.

547
00:22:14,850 --> 00:22:18,850
So over here, is a
function of x3 and x5.

548
00:22:18,850 --> 00:22:24,920
And so it has edges to x3 and
x5, and so on for all of them.

549
00:22:24,920 --> 00:22:27,730
And if I want to explicitly
compute the marginal

550
00:22:27,730 --> 00:22:29,590
with respect to a
particular variable,

551
00:22:29,590 --> 00:22:32,160
so the marginal
with respect to x1

552
00:22:32,160 --> 00:22:36,550
set equal to a, so I'd
have this function with x1

553
00:22:36,550 --> 00:22:40,900
equal to a times the sum over
all possible states of x2,

554
00:22:40,900 --> 00:22:45,137
the sum over all possible
states of x3, x4, and x5.

555
00:22:45,137 --> 00:22:45,720
Is that clear?

556
00:22:45,720 --> 00:22:47,428
That's just the
definition of a marginal.

557
00:22:51,190 --> 00:22:54,850
They introduced a notation in
factor graphs that's called

558
00:22:54,850 --> 00:22:55,985
a "not-sum."

559
00:22:55,985 --> 00:22:59,679
It's rather terrible, but
the not-sum or summary.

560
00:22:59,679 --> 00:23:01,220
So I like this term,
summary, better.

561
00:23:01,220 --> 00:23:02,870
The summary over
all the variables.

562
00:23:02,870 --> 00:23:05,540
So if I want to figure
out the summary for x1,

563
00:23:05,540 --> 00:23:07,970
that's the sum over
all the other variables

564
00:23:07,970 --> 00:23:09,600
of all their
possible states when

565
00:23:09,600 --> 00:23:13,290
I set x1 equal to
a, in this case.

566
00:23:13,290 --> 00:23:14,510
So it's purely a definition.

567
00:23:14,510 --> 00:23:19,140
So then I can rewrite-- and you
can work this through by hand

568
00:23:19,140 --> 00:23:21,340
after class-- but
I can rewrite this,

569
00:23:21,340 --> 00:23:24,220
which is this intuitive way
of thinking of the marginal,

570
00:23:24,220 --> 00:23:27,840
in terms of these not-sums,
where each one of these

571
00:23:27,840 --> 00:23:30,940
is over all the other
variables that are not

572
00:23:30,940 --> 00:23:33,380
the one that's in the brackets.

573
00:23:33,380 --> 00:23:35,310
So that's just the definition.

574
00:23:35,310 --> 00:23:37,112
OK, this hasn't really
helped us very much,

575
00:23:37,112 --> 00:23:38,570
if we don't have
some efficient way

576
00:23:38,570 --> 00:23:39,820
of computing these marginals.

577
00:23:39,820 --> 00:23:41,470
And that's what the
factor graph does.

578
00:23:41,470 --> 00:23:43,650
So we've got some factor graph.

579
00:23:43,650 --> 00:23:46,760
We have this
representation, either

580
00:23:46,760 --> 00:23:50,020
in terms of graph or equation,
of how the global function can

581
00:23:50,020 --> 00:23:51,340
be partitioned.

582
00:23:51,340 --> 00:23:54,280
Now if I take any one
of these factor graphs,

583
00:23:54,280 --> 00:23:56,210
and I want to compute
a marginal over a node,

584
00:23:56,210 --> 00:24:00,180
I can re-draw the factor graph
so that variable of interest

585
00:24:00,180 --> 00:24:01,620
is the root node.

586
00:24:01,620 --> 00:24:02,120
Right?

587
00:24:02,120 --> 00:24:04,380
Everyone see that these
two representations

588
00:24:04,380 --> 00:24:06,206
are completely equivalent?

589
00:24:06,206 --> 00:24:08,722
I've just yanked
x1 up to the top.

590
00:24:08,722 --> 00:24:10,055
So now this is a tree structure.

591
00:24:12,482 --> 00:24:14,190
So this is that factor
graph that we just

592
00:24:14,190 --> 00:24:15,546
saw drawn as a tree.

593
00:24:15,546 --> 00:24:17,670
And this is what's called
an expression tree, which

594
00:24:17,670 --> 00:24:19,470
is going to tell
us how to compute

595
00:24:19,470 --> 00:24:22,575
the marginal over the
structure of the graph.

596
00:24:25,330 --> 00:24:28,880
So this is just copied
from the previous picture.

597
00:24:28,880 --> 00:24:32,650
And now we're going to come up
with a program for computing

598
00:24:32,650 --> 00:24:35,930
these marginals, using
this tree structure.

599
00:24:35,930 --> 00:24:39,520
So first I'm going to compute
that summary function--

600
00:24:39,520 --> 00:24:43,430
the sum over all sets of the
other variables for everything

601
00:24:43,430 --> 00:24:46,300
below this point, starting with
the lowest point in the graph.

602
00:24:46,300 --> 00:24:48,870
And we can compute the
summary function there.

603
00:24:48,870 --> 00:24:55,590
And that's this term, the
summary for x3 of just this fE.

604
00:24:55,590 --> 00:25:00,440
I do the same thing for
fD, the summary for it.

605
00:25:00,440 --> 00:25:02,720
And then I go up a
level in the tree,

606
00:25:02,720 --> 00:25:06,327
and I multiply the summary
for everything below it.

607
00:25:06,327 --> 00:25:08,410
So I'm going to compute
the product of the summary

608
00:25:08,410 --> 00:25:09,510
functions.

609
00:25:09,510 --> 00:25:12,010
And I always compute the summary
with respect to the parent.

610
00:25:12,010 --> 00:25:15,550
So here the parent was
x3, for both of these.

611
00:25:15,550 --> 00:25:19,240
So these are summaries
with respect to x3.

612
00:25:19,240 --> 00:25:20,260
Here who's the parent?

613
00:25:20,260 --> 00:25:20,760
x1.

614
00:25:20,760 --> 00:25:23,045
And so the summary is to x1.

615
00:25:27,280 --> 00:25:28,100
Yes?

616
00:25:28,100 --> 00:25:29,558
AUDIENCE: Are there
directed edges?

617
00:25:29,558 --> 00:25:33,860
In the sense that in f, in
the example on the right,

618
00:25:33,860 --> 00:25:37,557
is fD just relating
how x4 relates to x3?

619
00:25:37,557 --> 00:25:38,890
PROFESSOR: That's exactly right.

620
00:25:38,890 --> 00:25:44,210
So the edges represent which
factor you're related to.

621
00:25:44,210 --> 00:25:47,040
So that's why I can
redraw it in any way.

622
00:25:47,040 --> 00:25:49,872
I'm always going to
go from the leaves up.

623
00:25:49,872 --> 00:25:54,274
I don't have to worry about any
directed edges in the graph.

624
00:25:54,274 --> 00:25:54,940
Other questions.

625
00:25:58,540 --> 00:26:01,080
So what this does
is it gives us a way

626
00:26:01,080 --> 00:26:04,310
to officially, overall a
complicated graph structure,

627
00:26:04,310 --> 00:26:07,390
compute marginals.

628
00:26:07,390 --> 00:26:09,580
And they're typically
thought of in terms

629
00:26:09,580 --> 00:26:12,430
of messages that are being sent
from the bottom of the graph up

630
00:26:12,430 --> 00:26:13,160
to the top.

631
00:26:13,160 --> 00:26:15,451
And you can have a rule from
computing these marginals.

632
00:26:15,451 --> 00:26:17,260
And the rule is as follows.

633
00:26:17,260 --> 00:26:19,580
Each vertex waits
for the messages

634
00:26:19,580 --> 00:26:21,270
from all of its
children, until it

635
00:26:21,270 --> 00:26:23,820
gets its-- the messages are
accumulating their way up

636
00:26:23,820 --> 00:26:24,564
the graph.

637
00:26:24,564 --> 00:26:25,980
And every node is
waiting until it

638
00:26:25,980 --> 00:26:29,900
hears from all of its progeny
about what's going on.

639
00:26:29,900 --> 00:26:33,790
And then it sends the signal
up above it to its parent,

640
00:26:33,790 --> 00:26:35,110
based on the following rules.

641
00:26:35,110 --> 00:26:38,380
A variable node just takes
the product of the children.

642
00:26:38,380 --> 00:26:41,120
And a factor node-- one of
those little black boxes--

643
00:26:41,120 --> 00:26:44,017
computes the summary
for the children

644
00:26:44,017 --> 00:26:45,350
and sends that up to the parent.

645
00:26:45,350 --> 00:26:47,540
And it's the summary with
respect to the parent,

646
00:26:47,540 --> 00:26:50,924
just like in the
examples before.

647
00:26:50,924 --> 00:26:53,090
So this is a formula for
computing single marginals.

648
00:26:53,090 --> 00:26:55,780
Now it turns out-- I'm not going
to go into details of this.

649
00:26:55,780 --> 00:26:57,080
It's kind of complicated.

650
00:26:57,080 --> 00:27:00,780
But you actually can,
based on this core idea,

651
00:27:00,780 --> 00:27:04,950
come up with an efficient way of
computing all of the marginals

652
00:27:04,950 --> 00:27:06,450
without having to
do this separately

653
00:27:06,450 --> 00:27:07,325
for every single one.

654
00:27:07,325 --> 00:27:09,370
And that's called a
message passing algorithm.

655
00:27:09,370 --> 00:27:10,870
And if you're really
interested, you

656
00:27:10,870 --> 00:27:14,730
can look into the citation
for how that's done.

657
00:27:14,730 --> 00:27:18,960
So the core idea is that we
can take a representation

658
00:27:18,960 --> 00:27:22,185
of our belief of how this
global function-- in our case,

659
00:27:22,185 --> 00:27:24,560
it's going to be the joint
probability-- factors in terms

660
00:27:24,560 --> 00:27:26,670
of particular
biological processes.

661
00:27:26,670 --> 00:27:30,040
We can encode what we know about
the regulation in that factor

662
00:27:30,040 --> 00:27:31,457
graph, the structure
of the graph.

663
00:27:31,457 --> 00:27:33,081
And then we could
have an efficient way

664
00:27:33,081 --> 00:27:35,110
of computing the marginals,
which will tell us,

665
00:27:35,110 --> 00:27:37,010
given the data,
what's the probability

666
00:27:37,010 --> 00:27:38,960
that this particular
pathway is active?

667
00:27:41,510 --> 00:27:44,010
So in this particular case,
in this paradigm model,

668
00:27:44,010 --> 00:27:45,930
the variables can
take three states--

669
00:27:45,930 --> 00:27:49,130
activated, deactivated,
or unchanged.

670
00:27:49,130 --> 00:27:51,800
And this is, in a tumor
setting, for example,

671
00:27:51,800 --> 00:27:55,024
you might say the tumor is
just like the wild type cell,

672
00:27:55,024 --> 00:27:57,440
or the tumor has activation
with respect to the wild type,

673
00:27:57,440 --> 00:27:59,648
or it has a repression with
respect to the wild type.

674
00:28:03,010 --> 00:28:06,510
Again, this is the structure of
the factor graph that they're

675
00:28:06,510 --> 00:28:09,450
using and the different kinds
of information that they have.

676
00:28:09,450 --> 00:28:11,230
The primary experimental
data are just

677
00:28:11,230 --> 00:28:15,320
these arrays that tell us about
SNiPs and copy number variation

678
00:28:15,320 --> 00:28:18,310
and then arrays or RNA-Seq to
tell us about the transcript

679
00:28:18,310 --> 00:28:19,910
levels.

680
00:28:19,910 --> 00:28:21,290
But now they can
encode all sorts

681
00:28:21,290 --> 00:28:23,310
of rather complicated
biological functions

682
00:28:23,310 --> 00:28:25,510
in the graph structure itself.

683
00:28:25,510 --> 00:28:28,750
So transcription
regulation is shown here.

684
00:28:28,750 --> 00:28:31,016
Why is the edge from
activity to here?

685
00:28:34,984 --> 00:28:36,990
Because we don't
want to just infer

686
00:28:36,990 --> 00:28:40,990
that if there's more of the
protein, there's more activity.

687
00:28:40,990 --> 00:28:42,760
So we're actually,
explicitly computing

688
00:28:42,760 --> 00:28:45,040
the activity of each protein.

689
00:28:45,040 --> 00:28:48,440
So if an RNA gets
transcribed, it's

690
00:28:48,440 --> 00:28:51,400
because some transcription
factor was active.

691
00:28:51,400 --> 00:28:53,570
And the transcription
factor might not

692
00:28:53,570 --> 00:28:56,590
be active, even if the levels
of the transcription factor

693
00:28:56,590 --> 00:28:57,790
are high.

694
00:28:57,790 --> 00:29:00,322
That's one of the
pieces that's not

695
00:29:00,322 --> 00:29:02,530
encoded in all of those
things that were in the dream

696
00:29:02,530 --> 00:29:05,560
challenge, that are really
critical for representing

697
00:29:05,560 --> 00:29:07,240
the regulatory structure.

698
00:29:07,240 --> 00:29:09,140
Similarly, protein
activation-- I

699
00:29:09,140 --> 00:29:11,810
can have protein that goes from
being present to being active.

700
00:29:11,810 --> 00:29:14,350
So think of a
kinase, that itself

701
00:29:14,350 --> 00:29:16,855
needs to be phosphorylated
to be active.

702
00:29:16,855 --> 00:29:18,230
So that would be
that transition.

703
00:29:18,230 --> 00:29:20,060
Some other kinase comes in.

704
00:29:20,060 --> 00:29:22,240
And if that other
kinase1 is active,

705
00:29:22,240 --> 00:29:24,210
then it can
phosphorylate kinase2

706
00:29:24,210 --> 00:29:26,392
and make that one active.

707
00:29:26,392 --> 00:29:27,850
And so it's pretty
straightforward.

708
00:29:27,850 --> 00:29:30,179
You can also represent the
formation of a complex.

709
00:29:30,179 --> 00:29:32,220
So the fact that all the
proteins are in the cell

710
00:29:32,220 --> 00:29:34,678
doesn't necessarily mean they're
forming an active complex.

711
00:29:34,678 --> 00:29:37,590
So the next step
then can be here.

712
00:29:37,590 --> 00:29:39,190
Only when I have
all of them, would I

713
00:29:39,190 --> 00:29:40,470
have activity of the complex.

714
00:29:40,470 --> 00:29:43,730
We'll talk about how AND-like
connections are formed.

715
00:29:43,730 --> 00:29:46,060
And then they also
can incorporate OR.

716
00:29:46,060 --> 00:29:47,160
So what does that mean?

717
00:29:47,160 --> 00:29:50,290
So if I know that all
members of the gene family

718
00:29:50,290 --> 00:29:52,790
can do something, I might
want to explicitly represent

719
00:29:52,790 --> 00:29:57,230
that gene family as an element
to the graph-- a variable.

720
00:29:57,230 --> 00:29:59,400
Is any member of
this family active?

721
00:29:59,400 --> 00:30:01,140
And so that would be
done this way, where

722
00:30:01,140 --> 00:30:03,340
if you have an OR-like
function here, then

723
00:30:03,340 --> 00:30:07,110
this factor would make this gene
active if any of the parents

724
00:30:07,110 --> 00:30:07,720
are active.

725
00:30:11,290 --> 00:30:13,250
So there, they
give a toy example,

726
00:30:13,250 --> 00:30:15,630
where they're trying to figure
out if the P53 pathway is

727
00:30:15,630 --> 00:30:18,820
active, so MDM2 is
an inhibitor of P53.

728
00:30:18,820 --> 00:30:21,430
P53 can be an
activator-related apoptosis.

729
00:30:21,430 --> 00:30:24,260
And so for separately,
for MDM2 and for P53,

730
00:30:24,260 --> 00:30:27,390
they have the factor graphs
that show the relationship

731
00:30:27,390 --> 00:30:29,560
between copy number
variation and transcript

732
00:30:29,560 --> 00:30:32,230
level and protein
level and activity.

733
00:30:32,230 --> 00:30:33,610
And those relate to each other.

734
00:30:33,610 --> 00:30:35,750
And then those relate to
the apoptotic pathway.

735
00:30:39,035 --> 00:30:40,910
So what they want to do
then is take the data

736
00:30:40,910 --> 00:30:43,300
that they have, in
terms of these pathways,

737
00:30:43,300 --> 00:30:45,322
and they want to compute
the likelihood ratios.

738
00:30:45,322 --> 00:30:46,780
What's the probability
of observing

739
00:30:46,780 --> 00:30:52,250
the data, given a hypothesis
that this pathway is active

740
00:30:52,250 --> 00:30:54,250
and all my other settings
of the parameters?

741
00:30:54,250 --> 00:30:55,708
And compare that
to the probability

742
00:30:55,708 --> 00:30:58,902
of the data, given that
that pathway is not active.

743
00:30:58,902 --> 00:31:00,610
So this is the kinds
of likelihood ratios

744
00:31:00,610 --> 00:31:02,526
we've been seeing now
in a couple of lectures.

745
00:31:04,631 --> 00:31:07,130
So now it gets into the details
of how you actually do this.

746
00:31:07,130 --> 00:31:09,400
So there's a lot of manual
steps involved here.

747
00:31:09,400 --> 00:31:12,840
So if I want to encode a
regulatory pathway as a factor

748
00:31:12,840 --> 00:31:18,052
graph, it's currently done in a
manual way or semi-manual way.

749
00:31:18,052 --> 00:31:19,510
You convert what's
in the databases

750
00:31:19,510 --> 00:31:20,968
into the structure
or factor graph.

751
00:31:20,968 --> 00:31:23,420
And you make a
series of decisions

752
00:31:23,420 --> 00:31:25,100
about exactly how
you want to do that.

753
00:31:25,100 --> 00:31:26,891
You can argue with the
particular decisions

754
00:31:26,891 --> 00:31:29,620
they made, but the
reasonable ones.

755
00:31:29,620 --> 00:31:31,390
People could do
things differently.

756
00:31:31,390 --> 00:31:37,757
So they convert the regulatory
networks into graphs.

757
00:31:37,757 --> 00:31:39,840
And then they have to
define some of the functions

758
00:31:39,840 --> 00:31:41,050
on this graph.

759
00:31:41,050 --> 00:31:44,810
So they define the expected
state of a variable,

760
00:31:44,810 --> 00:31:47,260
based on the state
of its parents.

761
00:31:47,260 --> 00:31:50,879
And they take a majority
vote of the parents.

762
00:31:50,879 --> 00:31:53,420
So a parent that's connected by
a positive edge, meaning it's

763
00:31:53,420 --> 00:31:55,860
an activator, if the
parent is active,

764
00:31:55,860 --> 00:31:59,020
then it contributes a
plus 1 to the child.

765
00:31:59,020 --> 00:32:01,360
If it's connected by
a repressive edge,

766
00:32:01,360 --> 00:32:04,200
then the parenting active
would make a vote of minus 1

767
00:32:04,200 --> 00:32:05,040
for the child.

768
00:32:05,040 --> 00:32:10,580
And you take the majority
vote of all those votes.

769
00:32:10,580 --> 00:32:11,990
So that's what this says.

770
00:32:11,990 --> 00:32:14,950
But the nice thing is that you
can also incorporate logic.

771
00:32:14,950 --> 00:32:17,000
So for example, when
we said, is any member

772
00:32:17,000 --> 00:32:18,250
of this pathway active?

773
00:32:18,250 --> 00:32:20,630
And you have a
family member node.

774
00:32:20,630 --> 00:32:23,570
So that can be done
with an OR function.

775
00:32:23,570 --> 00:32:25,700
And there, it's
these same factors

776
00:32:25,700 --> 00:32:28,480
that will determine--
so some of these edges

777
00:32:28,480 --> 00:32:29,960
are going to get
labeled "maximum"

778
00:32:29,960 --> 00:32:33,130
or "minimum," that
tell you what's

779
00:32:33,130 --> 00:32:35,420
the expected value of the
child, based on the parent.

780
00:32:35,420 --> 00:32:38,129
So if it's an OR, then if any
of the parents are active,

781
00:32:38,129 --> 00:32:39,170
then the child is active.

782
00:32:39,170 --> 00:32:40,753
And if it's AND, you
need all of them.

783
00:32:43,150 --> 00:32:45,906
So you could have described
all of these networks

784
00:32:45,906 --> 00:32:46,780
by Bayesian networks.

785
00:32:46,780 --> 00:32:48,700
But the advantage
of a factor graph

786
00:32:48,700 --> 00:32:50,580
is that your explicitly
able to include

787
00:32:50,580 --> 00:32:54,386
all these steps to describe this
regulation in an intuitive way.

788
00:32:54,386 --> 00:32:55,760
So you can go back
to your models

789
00:32:55,760 --> 00:32:57,730
and understand what
you've done, and change it

790
00:32:57,730 --> 00:32:59,820
in an obvious way.

791
00:32:59,820 --> 00:33:01,830
Now critically, we're
not trying to learn

792
00:33:01,830 --> 00:33:03,750
the structure of the
graph from the data.

793
00:33:03,750 --> 00:33:05,870
We're imposing the
structure of the graph.

794
00:33:05,870 --> 00:33:07,900
We still need to learn
a lot of variables,

795
00:33:07,900 --> 00:33:10,550
and that's done using
expectation maximization,

796
00:33:10,550 --> 00:33:13,620
as we saw in the
Bayesian networks.

797
00:33:13,620 --> 00:33:15,300
And then, again,
it's a factor graph,

798
00:33:15,300 --> 00:33:17,970
which primarily means that we
can factor the global function

799
00:33:17,970 --> 00:33:20,550
into all of these factor nodes.

800
00:33:20,550 --> 00:33:23,850
So the total probability
is normalized,

801
00:33:23,850 --> 00:33:27,500
but it's the product
of these factors which

802
00:33:27,500 --> 00:33:29,720
have to do with just the
variables that are connected

803
00:33:29,720 --> 00:33:33,830
to that factor
node in the graph.

804
00:33:33,830 --> 00:33:35,230
And this notation
that you'll see

805
00:33:35,230 --> 00:33:37,410
if you look through
this, this notation

806
00:33:37,410 --> 00:33:39,060
means the setting
of all the variables

807
00:33:39,060 --> 00:33:40,650
consistent with something.

808
00:33:40,650 --> 00:33:44,020
So let's see that-- here we go.

809
00:33:44,020 --> 00:33:47,220
So this here, this is the
setting of all the variables

810
00:33:47,220 --> 00:33:49,800
X, consistent with the data
that we have-- so the data

811
00:33:49,800 --> 00:33:53,870
being the arrays, the
RNA-Seq, if you had it.

812
00:33:53,870 --> 00:33:56,630
And so we want to compute
the marginal probability

813
00:33:56,630 --> 00:33:59,680
of some particular variable
being at a particular setting,

814
00:33:59,680 --> 00:34:02,320
given the fully
specified factor graph.

815
00:34:02,320 --> 00:34:06,490
And we just take the product
of all of these marginals.

816
00:34:06,490 --> 00:34:07,550
Is that clear?

817
00:34:07,550 --> 00:34:10,050
Consistent with all
the settings where

818
00:34:10,050 --> 00:34:15,639
that variable is
set to x equals a.

819
00:34:15,639 --> 00:34:16,139
Questions?

820
00:34:18,850 --> 00:34:19,350
OK.

821
00:34:19,350 --> 00:34:21,099
And we can compute the
likelihood function

822
00:34:21,099 --> 00:34:21,870
in the same way.

823
00:34:21,870 --> 00:34:24,400
So then what actually happens
when you try to do this?

824
00:34:24,400 --> 00:34:28,469
So they give an example here in
this more recent paper, where

825
00:34:28,469 --> 00:34:30,250
it's basically a toy example.

826
00:34:30,250 --> 00:34:32,469
But they're modeling all
of these different states

827
00:34:32,469 --> 00:34:33,320
in the cells.

828
00:34:33,320 --> 00:34:35,550
So G are the number
of genomic copies,

829
00:34:35,550 --> 00:34:38,060
T, the level of transcripts.

830
00:34:38,060 --> 00:34:41,061
Those are connected by a factor
to what you actually measure.

831
00:34:41,061 --> 00:34:42,810
So there is some true
change in the number

832
00:34:42,810 --> 00:34:44,050
of copies in the cell.

833
00:34:44,050 --> 00:34:46,370
And then there's what
appears in your array.

834
00:34:46,370 --> 00:34:49,650
There's some true number of
copies of RNA in the cell.

835
00:34:49,650 --> 00:34:52,550
And then there's what you
get out of your RNA-Seq.

836
00:34:52,550 --> 00:34:54,300
So that's what these
factors are present--

837
00:34:54,300 --> 00:34:55,799
and then these are
regulatory terms.

838
00:34:55,799 --> 00:34:59,460
So how much transcript you get
depends on these two variables,

839
00:34:59,460 --> 00:35:03,390
the epigenetic state
of the promoter

840
00:35:03,390 --> 00:35:05,880
and the regulatory proteins
that interact with it.

841
00:35:05,880 --> 00:35:07,730
How much transcript
gets turned into protein

842
00:35:07,730 --> 00:35:10,480
depends on regulatory proteins.

843
00:35:10,480 --> 00:35:12,910
And those are determined by
upstream signaling events.

844
00:35:12,910 --> 00:35:14,410
And how much protein
becomes active,

845
00:35:14,410 --> 00:35:16,860
again, is determined by the
upstream signaling events.

846
00:35:16,860 --> 00:35:21,757
And then those can have effects
on downstream pathways as well.

847
00:35:21,757 --> 00:35:24,090
So then in this toy example,
they're looking at MYC/MAX.

848
00:35:24,090 --> 00:35:28,390
They're trying to figure out
whether it's active or not.

849
00:35:28,390 --> 00:35:30,170
So we've got this pathway.

850
00:35:30,170 --> 00:35:32,470
PAK2 represses MYC/MAX.

851
00:35:32,470 --> 00:35:38,030
MYC/MAX activates these two
genes and represses this one.

852
00:35:38,030 --> 00:35:39,710
And so if these were
the data that we

853
00:35:39,710 --> 00:35:42,830
had coming from copy number
variation, DNA methylation,

854
00:35:42,830 --> 00:35:47,940
and RNA expression, then I'd
see that the following states

855
00:35:47,940 --> 00:35:53,240
of the downstream genes--
this one's active.

856
00:35:53,240 --> 00:35:55,220
This one's repressed.

857
00:35:55,220 --> 00:35:55,970
This one's active.

858
00:35:55,970 --> 00:35:57,130
This one's repressed.

859
00:35:57,130 --> 00:36:00,330
They infer that
MYC/MAX is active.

860
00:36:00,330 --> 00:36:03,510
Oh, but what about the fact
that this one should also

861
00:36:03,510 --> 00:36:05,030
be activated?

862
00:36:05,030 --> 00:36:06,720
That can be explained
away by the fact

863
00:36:06,720 --> 00:36:11,940
that there's a difference in the
epigenetic state between ENO1

864
00:36:11,940 --> 00:36:14,880
and the other two.

865
00:36:14,880 --> 00:36:18,360
And then the belief
propagation allows

866
00:36:18,360 --> 00:36:21,010
us to transfer that information
upward through the graph

867
00:36:21,010 --> 00:36:25,030
to figure out, OK, so now we've
decided that MYC/MAX is active.

868
00:36:25,030 --> 00:36:28,430
And that gives us information
about the state of the proteins

869
00:36:28,430 --> 00:36:33,030
upstream of it and the
activity then of PAK2,

870
00:36:33,030 --> 00:36:34,915
which is a repressor of MYC/MAX.

871
00:36:39,220 --> 00:36:41,650
Questions on the factor
graphs specifically

872
00:36:41,650 --> 00:36:43,559
or anything's that
come up until now?

873
00:36:48,740 --> 00:36:52,810
So this has all been
reasoning on known pathways.

874
00:36:52,810 --> 00:36:56,150
One of the big promises of
these systematic approaches

875
00:36:56,150 --> 00:36:58,390
is the hope that we can
discover new pathways.

876
00:36:58,390 --> 00:37:01,330
Can we discover things we
don't already know about?

877
00:37:01,330 --> 00:37:04,040
And for this, we're going to
look at interactome graphs,

878
00:37:04,040 --> 00:37:06,251
so graphs that are
built primarily

879
00:37:06,251 --> 00:37:08,250
from high throughput
protein-protein interaction

880
00:37:08,250 --> 00:37:10,083
data, but could also
be built, as we'll see,

881
00:37:10,083 --> 00:37:14,514
from other kinds of
large-scale connections.

882
00:37:14,514 --> 00:37:16,430
And we're going to look
at what the underlying

883
00:37:16,430 --> 00:37:17,971
structure of these
networks could be.

884
00:37:17,971 --> 00:37:19,686
And so they could
arise from a graph

885
00:37:19,686 --> 00:37:21,310
where you put an edge
between two nodes

886
00:37:21,310 --> 00:37:25,730
if their co-expressed, if they
have high mutual information.

887
00:37:25,730 --> 00:37:27,620
That's what we saw
in say, ARACHNE,

888
00:37:27,620 --> 00:37:31,100
which we talked
about a lecture ago.

889
00:37:31,100 --> 00:37:35,050
Or, if say, the two hybrids
and affinity capture mass spec

890
00:37:35,050 --> 00:37:37,210
indicated direct
physical interaction

891
00:37:37,210 --> 00:37:39,020
or say a high throughput
genetic screen

892
00:37:39,020 --> 00:37:40,394
indicated a genetic interaction.

893
00:37:40,394 --> 00:37:42,310
These are going to be
very, very large graphs.

894
00:37:42,310 --> 00:37:44,860
And we're going to look at some
of the algorithmic problems

895
00:37:44,860 --> 00:37:46,500
that we have dealing
with huge graphs

896
00:37:46,500 --> 00:37:49,690
and how to compress the
information down so we get

897
00:37:49,690 --> 00:37:52,795
some piece of the network
that's quite interpretable.

898
00:37:52,795 --> 00:37:54,420
And we'll look at
various kinds of ways

899
00:37:54,420 --> 00:37:59,490
of analyzing these graphs
that are listed here.

900
00:37:59,490 --> 00:38:03,654
So one of the advantages of
dealing with data in the graph

901
00:38:03,654 --> 00:38:06,070
formulation is that we can
leverage the fact that computer

902
00:38:06,070 --> 00:38:08,760
science has dealt with large
graphs for quite a while

903
00:38:08,760 --> 00:38:11,580
now, often in the context
of telecommunications.

904
00:38:11,580 --> 00:38:14,460
Now big data, Facebook,
Google-- they're always

905
00:38:14,460 --> 00:38:16,260
dealing with things in
a graph formulation.

906
00:38:16,260 --> 00:38:20,430
So there are a lot of algorithms
that we can take advantage of.

907
00:38:20,430 --> 00:38:23,270
We're going to look at
how to use quick distance

908
00:38:23,270 --> 00:38:24,459
calculations on graphs.

909
00:38:24,459 --> 00:38:26,500
And we'll look at that
specifically in an example

910
00:38:26,500 --> 00:38:29,570
of how to find the kinase
target relationships.

911
00:38:29,570 --> 00:38:31,630
Then we'll look at how
to cluster large graphs

912
00:38:31,630 --> 00:38:33,640
to find subgraphs
that either represents

913
00:38:33,640 --> 00:38:35,150
an interesting
topological feature

914
00:38:35,150 --> 00:38:37,160
of the inherent
structure of the graph

915
00:38:37,160 --> 00:38:40,830
or perhaps to represent
active pieces of the network.

916
00:38:40,830 --> 00:38:43,006
And then we'll look at
other kinds of optimization

917
00:38:43,006 --> 00:38:45,380
techniques to help us find
the part of the network that's

918
00:38:45,380 --> 00:38:50,390
most relevant to our particular
experimental setting.

919
00:38:50,390 --> 00:38:54,080
So let's start with
ostensibly a simple problem.

920
00:38:54,080 --> 00:38:57,140
I know a lot about-- I have a
lot of protein phosphorylation

921
00:38:57,140 --> 00:38:57,640
data.

922
00:38:57,640 --> 00:38:59,600
I'd like to figure
out what kinase

923
00:38:59,600 --> 00:39:03,060
was that phosphorylated
a particular protein.

924
00:39:03,060 --> 00:39:05,270
So let's say I have
this protein that's

925
00:39:05,270 --> 00:39:08,560
involved in cancer
signaling, Rad50.

926
00:39:08,560 --> 00:39:10,750
And I know it's phosphorylated
these two sites.

927
00:39:10,750 --> 00:39:12,660
And I have the sequences
of those sites.

928
00:39:12,660 --> 00:39:14,970
So what kinds of tools do
we have at our disposal

929
00:39:14,970 --> 00:39:16,900
if I have a set of
sequences that I believe

930
00:39:16,900 --> 00:39:18,570
are phosphorylated,
that would help

931
00:39:18,570 --> 00:39:21,481
me try to figure out what
kinase did the phosphorylation?

932
00:39:21,481 --> 00:39:21,980
Any ideas?

933
00:39:26,910 --> 00:39:29,740
So if I know the specificity of
the kinases, what could I do?

934
00:39:32,320 --> 00:39:34,260
I could look for
a sequence match

935
00:39:34,260 --> 00:39:36,440
between the specificity
of the kinase

936
00:39:36,440 --> 00:39:38,417
and the sequence of
the protein, right?

937
00:39:38,417 --> 00:39:40,250
In the same way that
we can look for a match

938
00:39:40,250 --> 00:39:42,790
between the specificity
of a transcription factor

939
00:39:42,790 --> 00:39:46,330
and the region of the
genome to which it binds.

940
00:39:46,330 --> 00:39:49,470
So if I have a library
of specificity motifs

941
00:39:49,470 --> 00:39:51,580
for different kinases,
where every position here

942
00:39:51,580 --> 00:39:53,859
represents a piece of
the recognition element,

943
00:39:53,859 --> 00:39:56,150
and the height of the letters
represent the information

944
00:39:56,150 --> 00:39:57,900
content, I can scan those.

945
00:39:57,900 --> 00:40:00,310
And I can see what
family of kinases

946
00:40:00,310 --> 00:40:03,440
are most likely to be
responsible for phosphorylating

947
00:40:03,440 --> 00:40:04,482
these sites.

948
00:40:04,482 --> 00:40:06,190
But again, those are
families of kinases.

949
00:40:06,190 --> 00:40:07,564
There are many
individual members

950
00:40:07,564 --> 00:40:08,860
of each of those families.

951
00:40:08,860 --> 00:40:10,380
So how to I find
the specific member

952
00:40:10,380 --> 00:40:12,330
of that family that's
most likely to carry out

953
00:40:12,330 --> 00:40:13,620
the regulation?

954
00:40:13,620 --> 00:40:15,120
So here, what happens
in this paper.

955
00:40:15,120 --> 00:40:17,244
It's called [? "Network." ?]
And as they say, well,

956
00:40:17,244 --> 00:40:18,570
let's use the graph properties.

957
00:40:18,570 --> 00:40:23,290
Let's try to figure out which
proteins are physically linked

958
00:40:23,290 --> 00:40:26,390
relatively closely in the
network to the target.

959
00:40:26,390 --> 00:40:29,080
So in this case, they've
got Rad50 over here.

960
00:40:29,080 --> 00:40:33,620
And they're trying to figure out
which kinase is regulating it.

961
00:40:33,620 --> 00:40:35,980
So here are two kinases that
have similar specificity.

962
00:40:35,980 --> 00:40:37,669
But this one's
directly connected

963
00:40:37,669 --> 00:40:39,210
in the interaction
that works so it's

964
00:40:39,210 --> 00:40:41,870
more likely to be responsible.

965
00:40:41,870 --> 00:40:44,430
And here's the member
of the kinase that

966
00:40:44,430 --> 00:40:47,010
seems to be consistent with the
sequence being phosphorylated

967
00:40:47,010 --> 00:40:48,130
over here.

968
00:40:48,130 --> 00:40:50,910
It's not directly connected,
but it's relatively close.

969
00:40:50,910 --> 00:40:53,530
And so that's also a
highly probable member,

970
00:40:53,530 --> 00:40:56,110
compared to one that's
more distantly related.

971
00:40:56,110 --> 00:40:58,870
So in general, if I've
got a set of kinases

972
00:40:58,870 --> 00:41:02,410
that are all of equally good
sequence matches to the target

973
00:41:02,410 --> 00:41:05,560
sequence, represented by these
dash lines, but one of them

974
00:41:05,560 --> 00:41:08,780
is physically linked as well,
perhaps directly and perhaps

975
00:41:08,780 --> 00:41:10,610
indirectly, I have
higher confidence

976
00:41:10,610 --> 00:41:13,360
in this kinase because
of its physical links

977
00:41:13,360 --> 00:41:15,857
than I do in these.

978
00:41:15,857 --> 00:41:18,190
So that's fine if you want
to look at things one by one.

979
00:41:18,190 --> 00:41:19,690
But if you want to look
at this at a global scale,

980
00:41:19,690 --> 00:41:21,110
we need very
efficient algorithms

981
00:41:21,110 --> 00:41:23,980
for figuring out what the
distance is in this interaction

982
00:41:23,980 --> 00:41:29,190
network between any
kinase and any target.

983
00:41:29,190 --> 00:41:31,590
So how do you go about
officially computing distances?

984
00:41:31,590 --> 00:41:34,030
Well that's where converting
things into a graph structure

985
00:41:34,030 --> 00:41:35,180
is helpful.

986
00:41:35,180 --> 00:41:37,070
So when we talk
about graphs here,

987
00:41:37,070 --> 00:41:40,654
we mean sets of vertices and
the edges that connect them.

988
00:41:40,654 --> 00:41:42,820
The vertices, in our case,
are going to be proteins.

989
00:41:42,820 --> 00:41:44,290
The edges are going
to perhaps represent

990
00:41:44,290 --> 00:41:46,789
physical interactions or some
of these other kinds of graphs

991
00:41:46,789 --> 00:41:49,520
we talked about.

992
00:41:49,520 --> 00:41:52,049
These graphs can be directed,
or they can the undirected.

993
00:41:52,049 --> 00:41:53,090
Undirected would be what?

994
00:41:53,090 --> 00:41:54,950
For example, say two hybrid.

995
00:41:54,950 --> 00:41:57,150
I don't know which one's
doing what to which.

996
00:41:57,150 --> 00:41:59,280
I just know that two
proteins can come together.

997
00:41:59,280 --> 00:42:01,260
Whereas a directed edge
might be this kinase

998
00:42:01,260 --> 00:42:02,460
phosphorylates this target.

999
00:42:02,460 --> 00:42:05,130
And so it's a directed edge.

1000
00:42:05,130 --> 00:42:07,091
I can have weights
associated with these edges.

1001
00:42:07,091 --> 00:42:08,590
We'll see in a
second how we can use

1002
00:42:08,590 --> 00:42:11,240
that to encode our confidence
that the edge represents

1003
00:42:11,240 --> 00:42:14,680
a true physical interaction.

1004
00:42:14,680 --> 00:42:17,580
We can also talk about the
degree, the number of edges

1005
00:42:17,580 --> 00:42:20,770
that come into a
node or leave a node.

1006
00:42:20,770 --> 00:42:22,740
And for our point,
it's rather important

1007
00:42:22,740 --> 00:42:25,170
to talk about the path,
the set of vertices

1008
00:42:25,170 --> 00:42:27,840
that can get me from one
node to another node,

1009
00:42:27,840 --> 00:42:31,476
without ever retracing my steps.

1010
00:42:31,476 --> 00:42:34,100
And we're going to
talk about path length,

1011
00:42:34,100 --> 00:42:35,600
so if my graph is
unweighted, that's

1012
00:42:35,600 --> 00:42:39,000
just the number of
edges along the path.

1013
00:42:39,000 --> 00:42:40,990
But if my graph
has edge weights,

1014
00:42:40,990 --> 00:42:43,675
it's going to be the sum of the
edge weights along that path.

1015
00:42:43,675 --> 00:42:44,290
Is that clear?

1016
00:42:48,040 --> 00:42:50,640
And then we're going to
use an adjacency matrix

1017
00:42:50,640 --> 00:42:51,640
to represent the graphs.

1018
00:42:51,640 --> 00:42:53,400
So I have two completely
equivalent formulations

1019
00:42:53,400 --> 00:42:53,941
of the graph.

1020
00:42:53,941 --> 00:42:56,190
One is the picture on
the left-hand side,

1021
00:42:56,190 --> 00:42:59,310
and the other one is the matrix
on the right-hand side, where

1022
00:42:59,310 --> 00:43:02,820
a 1 between any row
and column represents

1023
00:43:02,820 --> 00:43:03,820
the presence of an edge.

1024
00:43:03,820 --> 00:43:10,500
So the only edge connecting
node 1 goes to node 2.

1025
00:43:10,500 --> 00:43:13,610
Whereas, node 2 is connected
both to node 1 and to node 3.

1026
00:43:13,610 --> 00:43:14,720
Hopefully, that agrees.

1027
00:43:14,720 --> 00:43:15,365
OK.

1028
00:43:15,365 --> 00:43:16,170
Is that clear?

1029
00:43:20,537 --> 00:43:22,370
And if I have a weighted
graph, then instead

1030
00:43:22,370 --> 00:43:23,910
of putting zeros or
ones in the matrix,

1031
00:43:23,910 --> 00:43:25,868
I'll put the actual edge
weights in the matrix.

1032
00:43:28,620 --> 00:43:31,950
So there are algorithms that
exist for officially finding

1033
00:43:31,950 --> 00:43:35,740
shortest paths in large graphs.

1034
00:43:35,740 --> 00:43:37,910
So we can very
rapidly, for example,

1035
00:43:37,910 --> 00:43:40,130
compute the shortest path
between any two nodes,

1036
00:43:40,130 --> 00:43:43,720
based solely on that
adjacency matrix.

1037
00:43:43,720 --> 00:43:46,170
Now why are we going to
look at weighted graphs?

1038
00:43:46,170 --> 00:43:48,650
Because that gives us the
way to encode our confidence

1039
00:43:48,650 --> 00:43:50,040
in the underlying data.

1040
00:43:50,040 --> 00:43:54,350
So because the total
distance in network

1041
00:43:54,350 --> 00:43:57,230
is the sum of the edge weights,
if I set my edge weights

1042
00:43:57,230 --> 00:44:01,150
to be negative log
of a probability,

1043
00:44:01,150 --> 00:44:03,302
then if I sum all
the edge weights,

1044
00:44:03,302 --> 00:44:05,385
I'm taking the product of
all those probabilities.

1045
00:44:08,560 --> 00:44:10,800
And so the shortest
path is going

1046
00:44:10,800 --> 00:44:14,390
to be the most probable
path as well, because it's

1047
00:44:14,390 --> 00:44:18,900
going to be the minimum of
the sum of the negative log.

1048
00:44:18,900 --> 00:44:21,475
So it's going to be the maximum
of the joint probability.

1049
00:44:21,475 --> 00:44:24,270
Is that clear?

1050
00:44:24,270 --> 00:44:24,770
OK.

1051
00:44:24,770 --> 00:44:25,470
Very good.

1052
00:44:25,470 --> 00:44:30,200
So by encoding our network as a
weighted graph, where the edge

1053
00:44:30,200 --> 00:44:31,932
weights are minus log
of the probability,

1054
00:44:31,932 --> 00:44:34,140
then when I use these standard
algorithms for finding

1055
00:44:34,140 --> 00:44:35,806
the shortest path
between any two nodes,

1056
00:44:35,806 --> 00:44:38,870
I'm also getting the most
probable path between these two

1057
00:44:38,870 --> 00:44:41,160
proteins.

1058
00:44:41,160 --> 00:44:44,210
So where these edge
weights come from?

1059
00:44:44,210 --> 00:44:47,090
So if my network consists
say of affinity capture mass

1060
00:44:47,090 --> 00:44:48,530
spec and two hybrid
interactions,

1061
00:44:48,530 --> 00:44:51,700
how would I compute the edge
of weights for that network?

1062
00:45:01,524 --> 00:45:03,190
We actually explicitly
talked about this

1063
00:45:03,190 --> 00:45:04,468
just a lecture or two ago.

1064
00:45:08,950 --> 00:45:10,960
So I have all this
affinity capture mass spec,

1065
00:45:10,960 --> 00:45:12,150
two hybrid data.

1066
00:45:12,150 --> 00:45:13,620
And I want to
assign a probability

1067
00:45:13,620 --> 00:45:18,350
to every edge that tells me how
confident I am that it's real.

1068
00:45:18,350 --> 00:45:21,050
So we already saw that in
the context of this paper

1069
00:45:21,050 --> 00:45:23,970
where we use Bayesian networks
and gold standards to compute

1070
00:45:23,970 --> 00:45:26,345
the probability for every
single edge in the interactome.

1071
00:45:28,800 --> 00:45:29,300
OK.

1072
00:45:29,300 --> 00:45:32,520
So that works pretty well if you
can define the gold standards.

1073
00:45:32,520 --> 00:45:35,630
It turns out that that has
not been the most popular way

1074
00:45:35,630 --> 00:45:38,270
of dealing with mammalian data.

1075
00:45:38,270 --> 00:45:40,160
It works pretty well
for yeast, but it's not

1076
00:45:40,160 --> 00:45:42,400
what's used primarily
in mammalian data.

1077
00:45:42,400 --> 00:45:45,830
So in mammalian data, the
databases are much larger.

1078
00:45:45,830 --> 00:45:48,280
The number of gold
standards are fewer.

1079
00:45:48,280 --> 00:45:51,270
People rely on more
ad hoc methods.

1080
00:45:51,270 --> 00:45:54,370
One of the big advances,
technically, for the field

1081
00:45:54,370 --> 00:45:57,060
was the development of a common
way for all these databases

1082
00:45:57,060 --> 00:45:59,580
of protein-protein interactions
to report their data,

1083
00:45:59,580 --> 00:46:01,150
to be able to interchange them.

1084
00:46:01,150 --> 00:46:05,580
There's something called
PSICQUIC and PSISCORE, that

1085
00:46:05,580 --> 00:46:09,485
allow a client to
pull information

1086
00:46:09,485 --> 00:46:11,610
from all the different
databases of protein-protein

1087
00:46:11,610 --> 00:46:12,720
interactions.

1088
00:46:12,720 --> 00:46:15,810
And because you can get all
the data in a common format

1089
00:46:15,810 --> 00:46:18,500
where it's traceable back to
the underlying experiment,

1090
00:46:18,500 --> 00:46:21,510
then you can start
computing confidence scores

1091
00:46:21,510 --> 00:46:23,110
based on these
properties, what we

1092
00:46:23,110 --> 00:46:26,450
know about where the data came
from in a high throughput way.

1093
00:46:26,450 --> 00:46:28,510
Different people have
different approaches

1094
00:46:28,510 --> 00:46:30,620
to computing those scores.

1095
00:46:30,620 --> 00:46:32,380
So there's a common
format for that

1096
00:46:32,380 --> 00:46:34,790
as well, which is
this PSISCORE where

1097
00:46:34,790 --> 00:46:38,150
you can build your interaction
database from whichever

1098
00:46:38,150 --> 00:46:40,390
one of these underlying
databases you want,

1099
00:46:40,390 --> 00:46:41,740
filter it however you want.

1100
00:46:41,740 --> 00:46:45,780
And then send your database to
one of these scoring servers.

1101
00:46:45,780 --> 00:46:47,690
And they'll send
you back the scores

1102
00:46:47,690 --> 00:46:50,130
according to their algorithm.

1103
00:46:50,130 --> 00:46:52,940
One that I kind of like this
is this Miscore algorithm.

1104
00:46:52,940 --> 00:46:54,509
It digs down into
the underlying data

1105
00:46:54,509 --> 00:46:56,050
of what kind of
experiments were done

1106
00:46:56,050 --> 00:46:58,139
and how many
experiments were done.

1107
00:46:58,139 --> 00:47:00,180
Again, they make all sorts
of arbitrary decisions

1108
00:47:00,180 --> 00:47:01,013
in how they do that.

1109
00:47:01,013 --> 00:47:03,400
But the arbitrary
decisions seem reasonable

1110
00:47:03,400 --> 00:47:05,790
in the absence of
any other data.

1111
00:47:05,790 --> 00:47:10,260
So their scores are based on
these three kinds of terms--

1112
00:47:10,260 --> 00:47:12,180
how many publications
there are associated

1113
00:47:12,180 --> 00:47:17,130
with any interaction, what
experimental method was used,

1114
00:47:17,130 --> 00:47:19,434
and then also, if there's an
annotation in the database

1115
00:47:19,434 --> 00:47:21,725
saying that we know that this
is a genetic interaction,

1116
00:47:21,725 --> 00:47:23,559
or we know that it's a
physical interaction.

1117
00:47:23,559 --> 00:47:25,599
And then they put weights
on all of these things.

1118
00:47:25,599 --> 00:47:27,280
So people can argue
about what the best

1119
00:47:27,280 --> 00:47:28,924
way of approaching this is.

1120
00:47:28,924 --> 00:47:30,590
The fundamental point
is that we can now

1121
00:47:30,590 --> 00:47:33,030
have a very, very
large database of known

1122
00:47:33,030 --> 00:47:34,630
interactions as weighted.

1123
00:47:34,630 --> 00:47:37,360
So by last count,
there are about 250,000

1124
00:47:37,360 --> 00:47:40,352
protein-protein interactions
for humans in these databases.

1125
00:47:40,352 --> 00:47:41,810
So you have that
giant interactome.

1126
00:47:41,810 --> 00:47:44,380
It's got all these scores
associated with it.

1127
00:47:44,380 --> 00:47:46,390
And now we can dive
into that and say,

1128
00:47:46,390 --> 00:47:51,880
these data are somewhat largely
unbiased by our prior notions

1129
00:47:51,880 --> 00:47:53,750
about what's important.

1130
00:47:53,750 --> 00:47:55,560
They're built up from
high throughput data.

1131
00:47:55,560 --> 00:47:57,910
So unlike the carefully
curated pathways

1132
00:47:57,910 --> 00:48:00,077
that are what everybody's
been studying for decades,

1133
00:48:00,077 --> 00:48:01,993
there might be information
here about pathways

1134
00:48:01,993 --> 00:48:02,830
no one knows about.

1135
00:48:02,830 --> 00:48:05,250
Can we find those pathways
in different contexts?

1136
00:48:05,250 --> 00:48:07,050
What can we learn from that?

1137
00:48:07,050 --> 00:48:09,127
So one early thing
people can do is just

1138
00:48:09,127 --> 00:48:10,710
try to find pieces
of the network that

1139
00:48:10,710 --> 00:48:12,168
seem to be modular,
where there are

1140
00:48:12,168 --> 00:48:15,650
more interactions among the
components of that module

1141
00:48:15,650 --> 00:48:17,990
than they are to other
pieces of the network.

1142
00:48:17,990 --> 00:48:20,700
And you can find those
modules in two different ways.

1143
00:48:20,700 --> 00:48:24,240
One is just based on
the underlying network.

1144
00:48:24,240 --> 00:48:27,540
And one is based on the
network, plus some external data

1145
00:48:27,540 --> 00:48:28,290
you have.

1146
00:48:28,290 --> 00:48:29,920
So one would be
to say, are there

1147
00:48:29,920 --> 00:48:32,790
proteins that fundamentally
interact with each other

1148
00:48:32,790 --> 00:48:34,531
under all possible settings?

1149
00:48:34,531 --> 00:48:36,780
And then we would say, in
my particular patient sample

1150
00:48:36,780 --> 00:48:40,290
or my disease or
my microorganism,

1151
00:48:40,290 --> 00:48:42,370
which proteins seem
to be functioning

1152
00:48:42,370 --> 00:48:44,810
in this particular condition?

1153
00:48:44,810 --> 00:48:47,230
So one is the topological model.

1154
00:48:47,230 --> 00:48:48,740
That's just the network itself.

1155
00:48:48,740 --> 00:48:51,534
And one is the functional model,
where I layer onto information

1156
00:48:51,534 --> 00:48:53,950
that the dark nodes are active
in my particular condition.

1157
00:48:56,590 --> 00:49:00,000
So an early use of
this kind of approach

1158
00:49:00,000 --> 00:49:02,995
was to try to annotate
nodes-- a large fraction

1159
00:49:02,995 --> 00:49:05,130
of even well studied
genomes that we don't know

1160
00:49:05,130 --> 00:49:07,500
the function of
any of those genes.

1161
00:49:07,500 --> 00:49:09,490
So what if I use the
structure of the network

1162
00:49:09,490 --> 00:49:13,060
to infer that if some protein
is close to another protein

1163
00:49:13,060 --> 00:49:14,750
in this interaction
network, it is

1164
00:49:14,750 --> 00:49:16,670
likely to have similar function?

1165
00:49:16,670 --> 00:49:19,280
And statistically,
that's definitely true.

1166
00:49:19,280 --> 00:49:24,410
So this graph shows, for things
for where we know the function,

1167
00:49:24,410 --> 00:49:26,590
the semantic similarity
in the y-axis,

1168
00:49:26,590 --> 00:49:28,427
the distance in the
network in the x-axis,

1169
00:49:28,427 --> 00:49:30,510
things that are close to
each other in the network

1170
00:49:30,510 --> 00:49:32,930
of interactions, are
also more likely to be

1171
00:49:32,930 --> 00:49:35,645
similar in terms of function.

1172
00:49:35,645 --> 00:49:37,020
So how do we go
about doing that?

1173
00:49:37,020 --> 00:49:38,520
So let's say we
have got this graph.

1174
00:49:38,520 --> 00:49:40,850
We've got some unknown
node labeled u.

1175
00:49:40,850 --> 00:49:43,810
And we've got two
known nodes in black.

1176
00:49:43,810 --> 00:49:46,250
And we want to systematically
deduce for every example

1177
00:49:46,250 --> 00:49:50,170
like this, every u, what
its annotation should be.

1178
00:49:50,170 --> 00:49:52,710
So I could just look
at its neighbors,

1179
00:49:52,710 --> 00:49:54,949
and depending on how I
set the window around it,

1180
00:49:54,949 --> 00:49:56,490
do I look at the
immediate neighbors?

1181
00:49:56,490 --> 00:49:57,410
Do I go two out?

1182
00:49:57,410 --> 00:49:58,470
Do I go three out?

1183
00:49:58,470 --> 00:50:00,430
I could get different answers.

1184
00:50:00,430 --> 00:50:03,070
So if I set K equal to 1,
I've got the unknown node,

1185
00:50:03,070 --> 00:50:04,710
but all the neighbors
are also unknown.

1186
00:50:04,710 --> 00:50:07,570
If I go two steps out,
then I pick up two knowns.

1187
00:50:10,380 --> 00:50:13,940
Now there's a fundamental
assumption going on here

1188
00:50:13,940 --> 00:50:17,300
that the node has the same
function as its neighbors,

1189
00:50:17,300 --> 00:50:20,430
which is fine when the
neighbors are homogeneous.

1190
00:50:20,430 --> 00:50:23,970
But what do you do when the
neighbors are heterogeneous?

1191
00:50:23,970 --> 00:50:27,440
So in this case, I've
got two unknowns u and v.

1192
00:50:27,440 --> 00:50:30,260
And if I just were to take
the K nearest neighbors,

1193
00:50:30,260 --> 00:50:32,390
they would have the same
neighborhood, right?

1194
00:50:32,390 --> 00:50:34,760
But I might have a prior
expectation that u is more like

1195
00:50:34,760 --> 00:50:39,460
the black nodes, and v is
more like the grey nodes.

1196
00:50:39,460 --> 00:50:42,270
So how do you choose
the best annotation?

1197
00:50:42,270 --> 00:50:45,290
The K nearest neighbors is
OK, but it's not the optimal.

1198
00:50:45,290 --> 00:50:48,530
So here's one approach,
which says the following.

1199
00:50:48,530 --> 00:50:51,750
I'm going to go through
for every function,

1200
00:50:51,750 --> 00:50:54,094
every annotation in my
database, separately.

1201
00:50:54,094 --> 00:50:56,260
And for each annotation,
I'll set all the nodes that

1202
00:50:56,260 --> 00:50:59,180
have that annotation to
plus 1 and every node

1203
00:50:59,180 --> 00:51:01,680
that doesn't have that
annotation, either it's unknown

1204
00:51:01,680 --> 00:51:04,890
or it's got some other
annotation, to minus 1.

1205
00:51:04,890 --> 00:51:06,570
And then for every
unknown, I'm going

1206
00:51:06,570 --> 00:51:09,960
to try to find the
setting which is going

1207
00:51:09,960 --> 00:51:12,550
to maximize the sum of products.

1208
00:51:12,550 --> 00:51:15,570
So we're going to take the
sum of the products of u

1209
00:51:15,570 --> 00:51:18,200
and all of its neighbors.

1210
00:51:18,200 --> 00:51:21,742
So in this setting,
if I set u to plus 1,

1211
00:51:21,742 --> 00:51:23,908
then I do better than if I
set it to minus 1, right?

1212
00:51:27,260 --> 00:51:29,870
Because I'll get plus
1 plus 1 minus 1.

1213
00:51:29,870 --> 00:51:32,760
So that will be better
than setting it to minus 1.

1214
00:51:32,760 --> 00:51:33,260
Yes.

1215
00:51:33,260 --> 00:51:35,627
AUDIENCE: Are we ignoring
all the end weights?

1216
00:51:35,627 --> 00:51:37,960
PROFESSOR: In this case, we're
ignoring the end weights.

1217
00:51:37,960 --> 00:51:40,090
We'll come back to using
the end weights later.

1218
00:51:40,090 --> 00:51:42,052
But this was done with
an unweighted graph.

1219
00:51:42,052 --> 00:51:44,052
AUDIENCE: [INAUDIBLE]
[? nearest neighborhood ?]

1220
00:51:44,052 --> 00:51:45,489
they're using it then?

1221
00:51:45,489 --> 00:51:47,780
PROFESSOR: So here they're
using the nearest neighbors.

1222
00:51:47,780 --> 00:51:50,004
That's right, with
no cutoff, right?

1223
00:51:50,004 --> 00:51:50,795
So any interaction.

1224
00:51:56,800 --> 00:51:59,982
So then we could iterate
this into convergence.

1225
00:51:59,982 --> 00:52:01,190
That's one problem with this.

1226
00:52:01,190 --> 00:52:02,840
But maybe a more
fundamental problem

1227
00:52:02,840 --> 00:52:05,880
is that you're never going to
get the best overall solution

1228
00:52:05,880 --> 00:52:08,730
by this local
optimization procedure.

1229
00:52:08,730 --> 00:52:10,950
So consider a setting like this.

1230
00:52:10,950 --> 00:52:13,700
Remember, I'm trying
to maximize the sum

1231
00:52:13,700 --> 00:52:16,950
of the product of the
settings for neighbors.

1232
00:52:16,950 --> 00:52:21,330
So how could I ever-- it seems
plausible that all A, B, and C

1233
00:52:21,330 --> 00:52:24,280
here, should have the
red annotation, right?

1234
00:52:24,280 --> 00:52:27,000
But if I set C to red,
that doesn't help me.

1235
00:52:27,000 --> 00:52:29,250
If I set A to red,
that doesn't help me.

1236
00:52:29,250 --> 00:52:32,190
If I set B to red, it
makes things worse.

1237
00:52:32,190 --> 00:52:34,820
So no local change is going
to get me where I want to go.

1238
00:52:37,374 --> 00:52:38,540
So let's think for a second.

1239
00:52:38,540 --> 00:52:40,340
What algorithms
have we already seen

1240
00:52:40,340 --> 00:52:42,540
that could help us get
to the right answer?

1241
00:52:42,540 --> 00:52:45,320
We can't get here by
local optimization.

1242
00:52:45,320 --> 00:52:48,170
We need to find the global
minimum, not the local minimum.

1243
00:52:48,170 --> 00:52:49,670
So what algorithms
have we seen that

1244
00:52:49,670 --> 00:52:51,140
help us find that
global minimum?

1245
00:52:54,612 --> 00:52:58,180
Yeah, sorry, so a video
simulated annealing.

1246
00:52:58,180 --> 00:53:01,040
So the simulated annealing
version in this setting

1247
00:53:01,040 --> 00:53:02,700
is as follows.

1248
00:53:02,700 --> 00:53:04,280
I initialize the graph.

1249
00:53:04,280 --> 00:53:06,850
I pick a neighboring node,
v, that I'm going to add.

1250
00:53:06,850 --> 00:53:09,830
Say we'll turn one of these red.

1251
00:53:09,830 --> 00:53:16,370
I check the value of that sum of
the products for this new one.

1252
00:53:16,370 --> 00:53:19,864
And if it's improving
things, I keep it.

1253
00:53:19,864 --> 00:53:21,905
But the critical thing
is, if it doesn't improve,

1254
00:53:21,905 --> 00:53:23,530
if it makes things
worse, I still

1255
00:53:23,530 --> 00:53:24,780
keep it with some probability.

1256
00:53:24,780 --> 00:53:27,480
It's based on how bad
things have gotten.

1257
00:53:27,480 --> 00:53:29,295
And by doing this,
we can climb the hill

1258
00:53:29,295 --> 00:53:33,630
and get over to
some global optimum.

1259
00:53:33,630 --> 00:53:35,660
So we saw simulating before.

1260
00:53:35,660 --> 00:53:36,490
In what context?

1261
00:53:36,490 --> 00:53:38,386
When in the side chain
placement problem.

1262
00:53:38,386 --> 00:53:39,510
Here we're seeing it again.

1263
00:53:39,510 --> 00:53:40,370
It's quite broad.

1264
00:53:40,370 --> 00:53:42,299
Any time you've got a
local optimization that

1265
00:53:42,299 --> 00:53:43,840
doesn't get you
where you need to go,

1266
00:53:43,840 --> 00:53:45,114
you need global optimization.

1267
00:53:45,114 --> 00:53:46,530
You can think
simulated annealing.

1268
00:53:46,530 --> 00:53:49,761
It's quite often the
plausible way to go.

1269
00:53:49,761 --> 00:53:50,260
All right.

1270
00:53:50,260 --> 00:53:53,092
So this is one approach
for annotation.

1271
00:53:53,092 --> 00:53:55,050
We also wanted to see
whether we could discover

1272
00:53:55,050 --> 00:53:56,890
inherent structure
in these graphs.

1273
00:53:56,890 --> 00:53:58,600
So often, we'll be
interested in trying

1274
00:53:58,600 --> 00:54:00,600
to find clusters in a graph.

1275
00:54:00,600 --> 00:54:03,680
Some graphs have obvious
structures in them.

1276
00:54:03,680 --> 00:54:05,940
Other graphs, it's a
little less obvious.

1277
00:54:05,940 --> 00:54:07,780
What algorithms exist
for trying to do this?

1278
00:54:07,780 --> 00:54:10,521
We're going to look at two
relatively straightforward

1279
00:54:10,521 --> 00:54:11,020
ways.

1280
00:54:11,020 --> 00:54:13,010
One is called edge
betweenness clustering

1281
00:54:13,010 --> 00:54:16,730
and the other one
is a Markov process.

1282
00:54:16,730 --> 00:54:19,160
Edge betweenness, I think,
is the most intuitive.

1283
00:54:19,160 --> 00:54:25,860
So I look at each
edge, and I ask

1284
00:54:25,860 --> 00:54:28,370
for all pairs of
nodes in the graph,

1285
00:54:28,370 --> 00:54:30,360
does the shortest path
between those nodes

1286
00:54:30,360 --> 00:54:31,395
pass through this edge?

1287
00:54:35,270 --> 00:54:38,790
So if I look at this edge,
very few shortest paths

1288
00:54:38,790 --> 00:54:40,240
go through this edge, right?

1289
00:54:40,240 --> 00:54:42,640
Just the shortest path
for those two nodes.

1290
00:54:42,640 --> 00:54:47,759
But if I look at this edge,
all of the shortest paths

1291
00:54:47,759 --> 00:54:50,050
between any node on this side
and any node on this side

1292
00:54:50,050 --> 00:54:51,325
have to pass through there.

1293
00:54:51,325 --> 00:54:55,090
So that has a high betweenness.

1294
00:54:55,090 --> 00:54:58,750
So if I want a cluster, I
can go through my graph.

1295
00:54:58,750 --> 00:55:01,400
I can compute betweenness.

1296
00:55:01,400 --> 00:55:03,470
I take the edge that has
the highest betweenness,

1297
00:55:03,470 --> 00:55:05,330
and I remove it from my graph.

1298
00:55:05,330 --> 00:55:07,720
And then I repeat.

1299
00:55:07,720 --> 00:55:09,960
And I'll be slowly
breaking my graph down

1300
00:55:09,960 --> 00:55:14,050
into chunks that are relatively
more connected internally

1301
00:55:14,050 --> 00:55:15,890
than they are to
things in other pieces.

1302
00:55:19,430 --> 00:55:20,360
Any questions?

1303
00:55:20,360 --> 00:55:21,860
So that's an entire
edge betweenness

1304
00:55:21,860 --> 00:55:22,860
clustering algorithm.

1305
00:55:22,860 --> 00:55:23,840
Pretty straightforward.

1306
00:55:27,480 --> 00:55:32,590
Now an alternative is a
Markov clustering method.

1307
00:55:32,590 --> 00:55:34,430
And the Markov
clustering method is

1308
00:55:34,430 --> 00:55:37,780
based on the idea of
random walks in the graph.

1309
00:55:37,780 --> 00:55:41,070
So again, let's try to
develop some intuition here.

1310
00:55:41,070 --> 00:55:42,610
If I start at some
node over here,

1311
00:55:42,610 --> 00:55:46,220
and I randomly wander
across this graph,

1312
00:55:46,220 --> 00:55:48,991
I'm more likely to stay
on the left-hand side

1313
00:55:48,991 --> 00:55:51,490
than I am to move all the way
across to the right-hand side,

1314
00:55:51,490 --> 00:55:54,060
correct?

1315
00:55:54,060 --> 00:55:56,090
So can I formalize that
and actually come up

1316
00:55:56,090 --> 00:55:58,500
with a measure of how
often any node will visit

1317
00:55:58,500 --> 00:56:01,410
any other and then use
that to cluster the graph?

1318
00:56:05,990 --> 00:56:07,720
So remember our
adjacency matrix,

1319
00:56:07,720 --> 00:56:12,020
which just represented which
nodes were connected to which.

1320
00:56:12,020 --> 00:56:16,280
And what happens if I multiply
the adjacency matrix by itself?

1321
00:56:16,280 --> 00:56:19,290
So I raise it to some power.

1322
00:56:19,290 --> 00:56:23,750
Well, if I multiply the
adjacency matrix by itself

1323
00:56:23,750 --> 00:56:27,770
just once, the squared adjacency
matrix of the property,

1324
00:56:27,770 --> 00:56:30,760
that it tells me how
many paths of length 2

1325
00:56:30,760 --> 00:56:33,150
exists between any two nodes.

1326
00:56:33,150 --> 00:56:36,160
So the adjacency matrix told
me how many paths of length 1

1327
00:56:36,160 --> 00:56:36,851
exist.

1328
00:56:36,851 --> 00:56:37,350
Right?

1329
00:56:37,350 --> 00:56:38,704
You're directly connected.

1330
00:56:38,704 --> 00:56:40,120
If I squared the
adjacency matrix,

1331
00:56:40,120 --> 00:56:43,510
it tells me how many
paths of length 2 exist.

1332
00:56:43,510 --> 00:56:46,790
N-th power tells me how many
paths of length N exist.

1333
00:56:46,790 --> 00:56:48,150
So let's see if that works.

1334
00:56:48,150 --> 00:56:49,710
This claims that
there are exactly

1335
00:56:49,710 --> 00:56:53,334
two paths that connect
node 2 to node 2.

1336
00:56:53,334 --> 00:56:54,375
What are those two paths?

1337
00:56:59,060 --> 00:57:00,180
Connect node 2 to node 2.

1338
00:57:00,180 --> 00:57:01,930
I go here, and I go back.

1339
00:57:01,930 --> 00:57:06,030
That's the path of length 2, and
this is the path of length 2.

1340
00:57:06,030 --> 00:57:08,200
And there are zero
paths of length 2

1341
00:57:08,200 --> 00:57:13,210
that connect node 2 to
node three, because 1, 2.

1342
00:57:13,210 --> 00:57:15,110
I'm not back at 3.

1343
00:57:15,110 --> 00:57:18,390
So that's from general
A to the N equals m,

1344
00:57:18,390 --> 00:57:23,220
if there exists exactly m paths
of length N between those two

1345
00:57:23,220 --> 00:57:24,020
nodes.

1346
00:57:24,020 --> 00:57:25,160
So how does this help me?

1347
00:57:25,160 --> 00:57:28,830
Well, when you take that idea of
the N-th power of the adjacency

1348
00:57:28,830 --> 00:57:32,360
matrix and convert it to a
transition probability matrix,

1349
00:57:32,360 --> 00:57:34,510
simply by normalizing.

1350
00:57:34,510 --> 00:57:36,969
So if I were to do a
random walk in this graph,

1351
00:57:36,969 --> 00:57:39,010
what's the probability
that I'll move from node i

1352
00:57:39,010 --> 00:57:41,420
to node j in a certain
number of steps?

1353
00:57:41,420 --> 00:57:43,330
That's what I want to compute.

1354
00:57:43,330 --> 00:57:45,779
So I need to have a
stochastic matrix,

1355
00:57:45,779 --> 00:57:47,195
where the sum of
the probabilities

1356
00:57:47,195 --> 00:57:50,426
for any transition is 1.

1357
00:57:50,426 --> 00:57:51,550
I have to end up somewhere.

1358
00:57:51,550 --> 00:57:53,370
I either end up back
in myself, or I end up

1359
00:57:53,370 --> 00:57:54,203
at some other nodes.

1360
00:57:54,203 --> 00:57:56,810
I'm just going to take
that adjacency matrix

1361
00:57:56,810 --> 00:57:59,370
and normalize the columns.

1362
00:57:59,370 --> 00:58:03,140
And then that gives me
the stochastic matrix.

1363
00:58:03,140 --> 00:58:05,460
And then I can exponentiate
the stochastic matrix

1364
00:58:05,460 --> 00:58:08,290
to figure out my probability
of moving from any node

1365
00:58:08,290 --> 00:58:11,594
to any other in a
certain number of steps.

1366
00:58:11,594 --> 00:58:12,510
Any questions on that?

1367
00:58:15,201 --> 00:58:15,700
OK.

1368
00:58:18,790 --> 00:58:23,830
So if we simply keep multiplying
this stochasticity matrix,

1369
00:58:23,830 --> 00:58:26,450
we'll get the probability of
increasing numbers of moves.

1370
00:58:26,450 --> 00:58:28,700
But it doesn't give us sharp
partitions of the matrix.

1371
00:58:28,700 --> 00:58:31,034
So to do a Markov clustering,
we do an exponentiation

1372
00:58:31,034 --> 00:58:32,950
of this matrix with
what's called an inflation

1373
00:58:32,950 --> 00:58:35,720
operator, which
is the following.

1374
00:58:38,500 --> 00:58:43,930
This inflation operator
takes the r-th power

1375
00:58:43,930 --> 00:58:48,100
of the adjacency matrix
and puts a denominator,

1376
00:58:48,100 --> 00:58:51,945
the sum of the powers
of the transition.

1377
00:58:51,945 --> 00:58:52,820
So here's an example.

1378
00:58:52,820 --> 00:58:57,275
Let's say I've got two
probabilities-- 0.9 and 0.1.

1379
00:58:57,275 --> 00:59:01,416
When I inflate it, I
square the numerator,

1380
00:59:01,416 --> 00:59:03,290
and I square each element
of the denominator.

1381
00:59:03,290 --> 00:59:09,210
Now I've gone from 0.9
to 0.99 and 0.1 to 0.01.

1382
00:59:09,210 --> 00:59:11,380
So this inflation
operator exaggerates

1383
00:59:11,380 --> 00:59:14,314
all my probabilities and makes
the higher probabilities more

1384
00:59:14,314 --> 00:59:16,480
probable and makes the lower
probabilities even less

1385
00:59:16,480 --> 00:59:18,910
probable.

1386
00:59:18,910 --> 00:59:20,430
So I take this
adjacency matrix that

1387
00:59:20,430 --> 00:59:22,280
represents the number
of steps in my matrix,

1388
00:59:22,280 --> 00:59:24,510
and I exaggerate it with
the inflation operator.

1389
00:59:24,510 --> 00:59:27,310
And that takes the
basic clustering,

1390
00:59:27,310 --> 00:59:30,990
and it makes it more compact.

1391
00:59:30,990 --> 00:59:35,540
So the algorithm for this
Markov clustering is as follows.

1392
00:59:35,540 --> 00:59:37,220
I start with a graph.

1393
00:59:37,220 --> 00:59:38,370
I add loops to the graph.

1394
00:59:38,370 --> 00:59:39,237
Why do I add loops?

1395
00:59:39,237 --> 00:59:41,820
Because I need some probability
that I stay in the same place,

1396
00:59:41,820 --> 00:59:42,910
right?

1397
00:59:42,910 --> 00:59:44,425
And in a normal
adjacency matrix,

1398
00:59:44,425 --> 00:59:45,800
you can't stay in
the same place.

1399
00:59:45,800 --> 00:59:47,734
You have to go somewhere.

1400
00:59:47,734 --> 00:59:48,400
So I add a loop.

1401
00:59:48,400 --> 00:59:51,510
So there's always a self loop.

1402
00:59:51,510 --> 00:59:56,680
Then I set the inflation
parameter to some value.

1403
00:59:56,680 --> 01:00:01,176
M_1 is the matrix of random
walks in the original graph.

1404
01:00:01,176 --> 01:00:02,720
I multiply that.

1405
01:00:02,720 --> 01:00:05,110
I inflate it.

1406
01:00:05,110 --> 01:00:07,550
And then I find the difference.

1407
01:00:07,550 --> 01:00:11,480
And I do that until the
difference in this--

1408
01:00:11,480 --> 01:00:15,000
because this matrix
gets below some value.

1409
01:00:15,000 --> 01:00:17,770
And what I end up with then
are relatively sharp partitions

1410
01:00:17,770 --> 01:00:20,734
of the overall structure.

1411
01:00:20,734 --> 01:00:24,710
So I'll show you an
example of how that works.

1412
01:00:24,710 --> 01:00:26,260
So in this case,
the authors were

1413
01:00:26,260 --> 01:00:32,210
using a matrix where the
nodes represented proteins.

1414
01:00:32,210 --> 01:00:34,225
The edges represented
BLAST hits.

1415
01:00:34,225 --> 01:00:35,990
And what they wanted
to do was find

1416
01:00:35,990 --> 01:00:39,610
families of proteins
that had similar sequence

1417
01:00:39,610 --> 01:00:40,770
similarity to each other.

1418
01:00:40,770 --> 01:00:44,152
But they didn't want it to be
entirely dominated by domains.

1419
01:00:44,152 --> 01:00:46,610
So they figured that this graph
structure would be helpful,

1420
01:00:46,610 --> 01:00:48,630
because you'd get--
for any protein,

1421
01:00:48,630 --> 01:00:53,262
there'd be edges,
not just things

1422
01:00:53,262 --> 01:00:55,470
that had similar common
domains, but also things that

1423
01:00:55,470 --> 01:00:59,670
had edges connecting it
to other proteins as well.

1424
01:00:59,670 --> 01:01:03,750
So in the original graph, the
edges are these BLAST values.

1425
01:01:03,750 --> 01:01:05,980
They come up with the
transition matrix.

1426
01:01:05,980 --> 01:01:08,170
They convert into
the Markov matrix,

1427
01:01:08,170 --> 01:01:10,540
and they carry out
that exponentiation.

1428
01:01:10,540 --> 01:01:12,680
And what they end
up with are clusters

1429
01:01:12,680 --> 01:01:17,190
where any individual domain
can appear multiple clusters.

1430
01:01:17,190 --> 01:01:20,060
The domains are dominated not
just by the highest BLAST hit,

1431
01:01:20,060 --> 01:01:22,760
but by the whole network
property of what other proteins

1432
01:01:22,760 --> 01:01:24,940
they're connected to.

1433
01:01:24,940 --> 01:01:28,300
And it's also been done with a
network, where the underlying

1434
01:01:28,300 --> 01:01:30,120
network represents
gene expression,

1435
01:01:30,120 --> 01:01:33,330
and edges between two
genes represent the degree

1436
01:01:33,330 --> 01:01:37,480
of correlation of the expression
across a very large data

1437
01:01:37,480 --> 01:01:39,980
set for 61 mouse tissues.

1438
01:01:39,980 --> 01:01:42,050
And once again, you
take the overall graph,

1439
01:01:42,050 --> 01:01:44,140
and you can break it
down into clusters,

1440
01:01:44,140 --> 01:01:46,540
where you can find
functional annotations

1441
01:01:46,540 --> 01:01:47,570
for specific clusters.

1442
01:01:50,320 --> 01:01:54,210
Any questions then on
the Markov clustering?

1443
01:01:54,210 --> 01:01:55,930
So these are two
separate ways of looking

1444
01:01:55,930 --> 01:01:57,962
at the underlying
structure of a graph.

1445
01:01:57,962 --> 01:02:00,170
We had the edge betweenness
clustering and the Markov

1446
01:02:00,170 --> 01:02:00,862
clustering.

1447
01:02:00,862 --> 01:02:03,070
Now when you do this, you
have to make some decision,

1448
01:02:03,070 --> 01:02:04,500
as I found this cluster.

1449
01:02:04,500 --> 01:02:06,260
Now how do I decide
what it's doing?

1450
01:02:06,260 --> 01:02:08,350
So you need to do some
sort of annotation.

1451
01:02:08,350 --> 01:02:09,940
So once I have a
cluster, how am I

1452
01:02:09,940 --> 01:02:13,840
going to assign a
function to that cluster?

1453
01:02:13,840 --> 01:02:16,220
So one thing I could
do would be to look

1454
01:02:16,220 --> 01:02:18,590
at things that already
have an annotation.

1455
01:02:18,590 --> 01:02:19,680
So I got some cluster.

1456
01:02:19,680 --> 01:02:21,110
Maybe two members
of this cluster

1457
01:02:21,110 --> 01:02:23,540
have an annotation and
two members of this one.

1458
01:02:23,540 --> 01:02:25,110
And that's fine.

1459
01:02:25,110 --> 01:02:26,910
But what do I do
when a cluster has

1460
01:02:26,910 --> 01:02:29,840
a whole bunch of
different annotations?

1461
01:02:29,840 --> 01:02:31,885
So I could be arbitrary.

1462
01:02:31,885 --> 01:02:33,930
I could just take the one
that's the most common.

1463
01:02:33,930 --> 01:02:36,471
But a nice way to do it is by
the hypergeometric distribution

1464
01:02:36,471 --> 01:02:38,620
that you saw in the earlier
part of the semester.

1465
01:02:43,950 --> 01:02:46,790
So these are all ways of
clustering the underlying graph

1466
01:02:46,790 --> 01:02:48,420
without any reference
to specific data

1467
01:02:48,420 --> 01:02:50,540
for a particular condition
that you're interested in.

1468
01:02:50,540 --> 01:02:51,915
A slightly harder
problem is when

1469
01:02:51,915 --> 01:02:53,680
I do have those
specific data, and I'd

1470
01:02:53,680 --> 01:02:55,620
like to find a piece
of the network that's

1471
01:02:55,620 --> 01:02:57,732
most relevant to
those specific data.

1472
01:02:57,732 --> 01:02:59,690
So it could be different
in different settings.

1473
01:02:59,690 --> 01:03:01,190
Maybe the part of
the network that's

1474
01:03:01,190 --> 01:03:02,869
relevant in the
cancer setting is not

1475
01:03:02,869 --> 01:03:05,160
the part of the network that's
relevant in the diabetes

1476
01:03:05,160 --> 01:03:07,640
setting.

1477
01:03:07,640 --> 01:03:10,160
So one way to think about this
is that I have the network,

1478
01:03:10,160 --> 01:03:12,000
and I paint onto
it my expression

1479
01:03:12,000 --> 01:03:13,620
data or my proteomic data.

1480
01:03:13,620 --> 01:03:16,850
And then I want to find
chunks of the network that

1481
01:03:16,850 --> 01:03:18,640
are enriched in activity.

1482
01:03:18,640 --> 01:03:21,600
So this is sometimes called
the active subgraph problem.

1483
01:03:21,600 --> 01:03:23,530
And how do we find
the active subgraph?

1484
01:03:23,530 --> 01:03:25,510
Well, it's not that
different from the problem

1485
01:03:25,510 --> 01:03:27,190
that we just looked at.

1486
01:03:27,190 --> 01:03:31,030
So if I want to figure out a
piece of the network that's

1487
01:03:31,030 --> 01:03:33,790
active, I could just take the
things that are immediately

1488
01:03:33,790 --> 01:03:35,160
connected to each other.

1489
01:03:35,160 --> 01:03:37,040
That doesn't give me
the global picture.

1490
01:03:37,040 --> 01:03:38,870
So instead why
don't I try to find

1491
01:03:38,870 --> 01:03:40,350
larger chunks of
the network where

1492
01:03:40,350 --> 01:03:43,300
I can include some
nodes for which I do not

1493
01:03:43,300 --> 01:03:45,240
have specific data?

1494
01:03:45,240 --> 01:03:46,877
And one way that's
been done for that

1495
01:03:46,877 --> 01:03:48,710
is, again, the simulated
annealing approach.

1496
01:03:48,710 --> 01:03:51,880
So you can try to find
pieces of the network that

1497
01:03:51,880 --> 01:03:54,700
maximize the
probability that all

1498
01:03:54,700 --> 01:03:57,790
the things in the
subnetwork are active.

1499
01:04:00,302 --> 01:04:01,760
Another formulation
of this problem

1500
01:04:01,760 --> 01:04:04,126
is something that's called
the Steiner tree problem.

1501
01:04:04,126 --> 01:04:06,000
And in the Steiner tree,
I want to find trees

1502
01:04:06,000 --> 01:04:10,060
in the network that consist of
all the nodes that are active,

1503
01:04:10,060 --> 01:04:15,494
plus some nodes that are not,
for which I have no data.

1504
01:04:15,494 --> 01:04:17,160
And those nodes for
which I have no data

1505
01:04:17,160 --> 01:04:18,712
are called Steiner nodes.

1506
01:04:18,712 --> 01:04:20,920
And this was a problem that
was looked at extensively

1507
01:04:20,920 --> 01:04:21,836
in telecommunications.

1508
01:04:21,836 --> 01:04:25,090
So if I want to wire up
a bunch of buildings--

1509
01:04:25,090 --> 01:04:29,130
back when people used wires--
say to give telephone service,

1510
01:04:29,130 --> 01:04:31,552
so I need to figure out
what the minimum cost is

1511
01:04:31,552 --> 01:04:32,510
for wiring them all up.

1512
01:04:32,510 --> 01:04:35,580
And sometimes, that involves
sticking a pole in the ground,

1513
01:04:35,580 --> 01:04:37,850
then having everybody
communicate to that pole.

1514
01:04:37,850 --> 01:04:43,000
So if I've got paying
customers over here,

1515
01:04:43,000 --> 01:04:45,110
and I want to wire
them to each other,

1516
01:04:45,110 --> 01:04:51,160
I could run wires
between everybody.

1517
01:04:51,160 --> 01:04:52,330
But I don't have to.

1518
01:04:52,330 --> 01:04:55,290
If I stick a pole over here,
then I don't need this wire,

1519
01:04:55,290 --> 01:04:58,194
and I don't need this wire,
and I don't need this wire.

1520
01:04:58,194 --> 01:04:59,860
So this is what's
called a Steiner node.

1521
01:05:06,940 --> 01:05:11,470
And so in graph theory, there
are pretty efficient algorithms

1522
01:05:11,470 --> 01:05:16,030
for finding a Steiner graph--
the Steiner tree-- the smallest

1523
01:05:16,030 --> 01:05:17,600
tree that connects
all of the nodes.

1524
01:05:17,600 --> 01:05:20,640
Now the problem in our setting
is that we don't necessarily

1525
01:05:20,640 --> 01:05:22,371
want to connect every
node, because we're

1526
01:05:22,371 --> 01:05:24,120
going to have in our
data some things that

1527
01:05:24,120 --> 01:05:25,640
are false positives.

1528
01:05:25,640 --> 01:05:27,660
And if we connect too
many things in our graph,

1529
01:05:27,660 --> 01:05:31,050
we end up with what are
lovingly called "hairballs."

1530
01:05:31,050 --> 01:05:33,100
So I'll give you a
specific example of that.

1531
01:05:33,100 --> 01:05:34,891
Here's some data that
we were working with.

1532
01:05:34,891 --> 01:05:37,830
We had a relatively small
number of experimental hits that

1533
01:05:37,830 --> 01:05:39,460
were detected as
changing in a cancer

1534
01:05:39,460 --> 01:05:42,240
setting and the
interactome graph.

1535
01:05:42,240 --> 01:05:46,399
And if you simply look
for the shortest path,

1536
01:05:46,399 --> 01:05:48,190
I should say, between
the experimental hits

1537
01:05:48,190 --> 01:05:49,640
across the
interactome, you end up

1538
01:05:49,640 --> 01:05:53,590
with something that looks very
similar to the interactome.

1539
01:05:53,590 --> 01:05:56,286
So you start off with a
relatively small set of nodes,

1540
01:05:56,286 --> 01:05:57,910
and you try to find
the subnetwork that

1541
01:05:57,910 --> 01:05:59,060
includes everything.

1542
01:05:59,060 --> 01:06:02,290
And you get a giant graph.

1543
01:06:02,290 --> 01:06:04,060
And it's very hard
to figure out what

1544
01:06:04,060 --> 01:06:06,179
to do with a graph
that's this big.

1545
01:06:06,179 --> 01:06:07,970
I mean, there may be
some information here,

1546
01:06:07,970 --> 01:06:09,939
but you've taken a
relatively simple problem

1547
01:06:09,939 --> 01:06:12,230
to try to understand the
relationship among these hits.

1548
01:06:12,230 --> 01:06:13,896
And you've turned it
into a problem that

1549
01:06:13,896 --> 01:06:18,070
now involves hundreds
and hundreds of nodes.

1550
01:06:18,070 --> 01:06:20,530
So these kinds of
problems arise, as I said,

1551
01:06:20,530 --> 01:06:22,400
in part, because of
noise in the data.

1552
01:06:22,400 --> 01:06:25,060
So some of these
hits are not real.

1553
01:06:25,060 --> 01:06:26,740
And incorporating
those, obviously,

1554
01:06:26,740 --> 01:06:30,470
makes me take very long
paths in the interactome,

1555
01:06:30,470 --> 01:06:33,160
but also arises because of
the noise in the interactome--

1556
01:06:33,160 --> 01:06:35,910
both false positives
and false negatives.

1557
01:06:35,910 --> 01:06:38,710
So I have two proteins
that I'm trying to connect,

1558
01:06:38,710 --> 01:06:40,710
and there's a false
positive in the interactome.

1559
01:06:40,710 --> 01:06:42,582
It's going to draw
a line between them.

1560
01:06:42,582 --> 01:06:44,540
If there's a false negative
in the interactome,

1561
01:06:44,540 --> 01:06:47,780
maybe these things really do
interact, but there's no edge.

1562
01:06:47,780 --> 01:06:49,720
If I force the algorithm
to find a connection,

1563
01:06:49,720 --> 01:06:51,720
it probably can, because
most of the interactome

1564
01:06:51,720 --> 01:06:54,400
is one giant
connected component.

1565
01:06:54,400 --> 01:06:56,630
But it could be a
very, very long edge.

1566
01:06:56,630 --> 01:06:58,619
It goes through
many other proteins.

1567
01:06:58,619 --> 01:07:00,910
And so in the process of
trying to connect all my data,

1568
01:07:00,910 --> 01:07:02,555
I can get extremely
large graphs.

1569
01:07:05,230 --> 01:07:07,304
So to avoid having
giant networks--

1570
01:07:07,304 --> 01:07:08,970
so on this projector,
unfortunately, you

1571
01:07:08,970 --> 01:07:10,190
can't see this very well.

1572
01:07:10,190 --> 01:07:13,947
But there are a lot of edges
among all the nodes here.

1573
01:07:13,947 --> 01:07:15,280
Most of you have your computers.

1574
01:07:15,280 --> 01:07:16,321
You can look at it there.

1575
01:07:16,321 --> 01:07:20,150
So in a Steiner tree
approach, if my data

1576
01:07:20,150 --> 01:07:24,220
are the ones that are yellow,
they're called terminals.

1577
01:07:24,220 --> 01:07:26,450
And the grey ones,
I have no data.

1578
01:07:26,450 --> 01:07:30,770
And I ask to try to solve
the Steiner tree problem,

1579
01:07:30,770 --> 01:07:33,625
it's going to have to find a
way to connect this node up

1580
01:07:33,625 --> 01:07:34,750
to the rest of the network.

1581
01:07:37,852 --> 01:07:39,310
But if this one's
a false positive,

1582
01:07:39,310 --> 01:07:41,760
that's not the desired outcome.

1583
01:07:41,760 --> 01:07:43,620
So there are
optimization techniques

1584
01:07:43,620 --> 01:07:46,110
that actually allow me to
tell the algorithm that it's

1585
01:07:46,110 --> 01:07:49,615
OK to leave out some of the data
to get a more compact network.

1586
01:07:52,497 --> 01:07:54,330
So one of those approaches
is called a prize

1587
01:07:54,330 --> 01:07:56,130
collecting Steiner tree problem.

1588
01:07:56,130 --> 01:07:58,580
And the idea here
is the following.

1589
01:07:58,580 --> 01:08:01,370
For every node for which
I have experimental data,

1590
01:08:01,370 --> 01:08:05,410
I associate with
that node a prize.

1591
01:08:05,410 --> 01:08:07,590
The prize is larger,
the more confident

1592
01:08:07,590 --> 01:08:10,640
I am that that node is
relevant in the experiment.

1593
01:08:10,640 --> 01:08:12,900
And for every edge,
I take the edge away,

1594
01:08:12,900 --> 01:08:15,400
and I convert it into a cost.

1595
01:08:15,400 --> 01:08:19,439
If I have a high confidence
edge, there's a low cost.

1596
01:08:19,439 --> 01:08:20,810
It's cheap.

1597
01:08:20,810 --> 01:08:24,520
Low confidence edges are
going to be very expensive.

1598
01:08:24,520 --> 01:08:26,210
And now I ask the
algorithm to try

1599
01:08:26,210 --> 01:08:28,790
to connect up all
the things it can.

1600
01:08:28,790 --> 01:08:31,540
Every time it includes a
node for which the zeta keeps

1601
01:08:31,540 --> 01:08:35,040
the prize, but it had to add
an edge, so it pays the cost.

1602
01:08:35,040 --> 01:08:37,250
So there's a trade-off
for every node.

1603
01:08:37,250 --> 01:08:41,260
So if the algorithm wants
to include this node,

1604
01:08:41,260 --> 01:08:44,642
then it's going to pay the
price for all the edges,

1605
01:08:44,642 --> 01:08:45,850
but it gets to keep the node.

1606
01:08:45,850 --> 01:08:47,766
So the optimization
function is the following.

1607
01:08:47,766 --> 01:08:53,220
For every vertex that's not in
the tree, there's a penalty.

1608
01:08:53,220 --> 01:08:55,319
And for every edge in
the tree, there's a cost.

1609
01:08:55,319 --> 01:08:57,810
And you want to minimize
the sum of these two terms.

1610
01:08:57,810 --> 01:09:01,282
You want to minimize the number
of edge costs you pay for.

1611
01:09:01,282 --> 01:09:02,740
And you want to
minimize the number

1612
01:09:02,740 --> 01:09:04,705
of prizes you leave behind.

1613
01:09:04,705 --> 01:09:05,695
Is that clear?

1614
01:09:12,140 --> 01:09:15,439
So then the algorithm then can,
depending on the optimization

1615
01:09:15,439 --> 01:09:19,492
terms, figure out is it more of
a benefit to include this node,

1616
01:09:19,492 --> 01:09:21,950
keep the prize, and pay all
the edge costs or the opposite?

1617
01:09:21,950 --> 01:09:23,044
Throw it out.

1618
01:09:23,044 --> 01:09:24,710
You don't get to keep
the prize, but you

1619
01:09:24,710 --> 01:09:26,560
don't have to pay
the edge costs.

1620
01:09:26,560 --> 01:09:28,810
And so that turns these
very, very large networks

1621
01:09:28,810 --> 01:09:30,350
into relatively compact ones.

1622
01:09:30,350 --> 01:09:32,910
Now solving this problem is
actually rather computationally

1623
01:09:32,910 --> 01:09:33,899
challenging.

1624
01:09:33,899 --> 01:09:36,415
You can do it with integer
linear programming.

1625
01:09:36,415 --> 01:09:38,579
It takes a huge
amount of memory.

1626
01:09:38,579 --> 01:09:40,620
There's also signal and
message passing approach.

1627
01:09:40,620 --> 01:09:42,800
If you're interested in
the underlying algorithms,

1628
01:09:42,800 --> 01:09:45,990
you can look at some
of these papers.

1629
01:09:45,990 --> 01:09:47,750
So what happens when
you actually do this?

1630
01:09:47,750 --> 01:09:49,920
So that hairball that
I showed you before

1631
01:09:49,920 --> 01:09:53,160
consisted of a very
small initial data set.

1632
01:09:53,160 --> 01:09:55,960
If you do a shortest path
search across the network,

1633
01:09:55,960 --> 01:09:59,110
you get thousands
of edges shown here.

1634
01:09:59,110 --> 01:10:02,640
But the prize collecting Steiner
tree solution to this problem

1635
01:10:02,640 --> 01:10:07,210
is actually extremely compact,
and it consists of subnetworks.

1636
01:10:07,210 --> 01:10:08,590
You can cluster
it automatically.

1637
01:10:08,590 --> 01:10:10,620
This was clustered by hand,
but you get more or less

1638
01:10:10,620 --> 01:10:11,160
the same results.

1639
01:10:11,160 --> 01:10:12,409
It's just not quite as pretty.

1640
01:10:12,409 --> 01:10:16,109
If you cluster by hand or by
say, edge betweenness, then

1641
01:10:16,109 --> 01:10:17,650
you get subnetworks
that are enriched

1642
01:10:17,650 --> 01:10:19,910
in various reasonable
cellular processes.

1643
01:10:19,910 --> 01:10:22,030
This was a network
built from cancer data.

1644
01:10:22,030 --> 01:10:25,150
And you can see things that are
highly relevant to cancer-- DNA

1645
01:10:25,150 --> 01:10:29,250
damage, cell cycle, and so on.

1646
01:10:29,250 --> 01:10:30,750
And the really nice
thing about this

1647
01:10:30,750 --> 01:10:32,400
then is it gives you
a very focused way

1648
01:10:32,400 --> 01:10:34,030
to then go and do experiments.

1649
01:10:34,030 --> 01:10:35,570
So you can take the networks
that come out of it.

1650
01:10:35,570 --> 01:10:37,486
And now you're not
operating on a network that

1651
01:10:37,486 --> 01:10:39,510
consists of tens of
thousands of edges.

1652
01:10:39,510 --> 01:10:41,790
You're working on a
network that consists

1653
01:10:41,790 --> 01:10:43,800
of very small sets of proteins.

1654
01:10:43,800 --> 01:10:45,760
So in this particular
case, we actually

1655
01:10:45,760 --> 01:10:48,220
were able to go in and test
the number of the nodes that

1656
01:10:48,220 --> 01:10:50,910
were not detected by
the experimental data,

1657
01:10:50,910 --> 01:10:53,390
but were inferred by the
algorithms of the Steiner

1658
01:10:53,390 --> 01:10:56,610
nodes, which had no
direct experimental data.

1659
01:10:56,610 --> 01:11:00,380
We will test whether blocking
the activities of these nodes

1660
01:11:00,380 --> 01:11:02,950
had any effect on the
growth of these tumor cells.

1661
01:11:02,950 --> 01:11:04,334
We will show that
nodes that were

1662
01:11:04,334 --> 01:11:06,250
very central to the
network that were included

1663
01:11:06,250 --> 01:11:08,710
in the prize collecting
Steiner tree solution,

1664
01:11:08,710 --> 01:11:11,830
had a high probability
of being cancer targets.

1665
01:11:11,830 --> 01:11:14,319
Whereas the ones that were
just slightly more removed

1666
01:11:14,319 --> 01:11:15,610
were much lower in probability.

1667
01:11:18,750 --> 01:11:22,650
So one of the advantages of
these large interaction graphs

1668
01:11:22,650 --> 01:11:24,630
is they give us a
natural way to integrate

1669
01:11:24,630 --> 01:11:26,970
many different kinds of data.

1670
01:11:26,970 --> 01:11:31,310
So we already saw that the
protein levels and the mRNA

1671
01:11:31,310 --> 01:11:35,440
levels agreed very
poorly with each other.

1672
01:11:35,440 --> 01:11:37,254
And we talked about
the fact that one thing

1673
01:11:37,254 --> 01:11:38,670
you could do with
those data would

1674
01:11:38,670 --> 01:11:41,940
be to try to find the
connections between not

1675
01:11:41,940 --> 01:11:44,161
the RNAs and the proteins,
but the connections

1676
01:11:44,161 --> 01:11:45,660
between the RNAs
and the things that

1677
01:11:45,660 --> 01:11:48,040
drove the expression of the RNA.

1678
01:11:48,040 --> 01:11:50,790
And so as I said, we'll see
in one of Professor Gifford's

1679
01:11:50,790 --> 01:11:52,790
lectures, precisely
how to do that.

1680
01:11:52,790 --> 01:11:57,120
But once you are able to do
that, you take epigenetic data,

1681
01:11:57,120 --> 01:12:02,300
look at the regions that are
regulatory around the sites

1682
01:12:02,300 --> 01:12:04,240
of genes that are
changing in transcription.

1683
01:12:04,240 --> 01:12:06,380
You can infer DNA
binding proteins.

1684
01:12:06,380 --> 01:12:07,880
And then you can
pile all those data

1685
01:12:07,880 --> 01:12:09,420
onto an interaction
graph, where you've

1686
01:12:09,420 --> 01:12:10,628
got different kinds of edges.

1687
01:12:10,628 --> 01:12:13,010
So you've got RNA nodes that
represent the transcript

1688
01:12:13,010 --> 01:12:13,700
levels.

1689
01:12:13,700 --> 01:12:15,200
You've got the
transcription factors

1690
01:12:15,200 --> 01:12:16,880
that infer from the
epigenetic data.

1691
01:12:16,880 --> 01:12:18,750
And then you've got the
protein-protein interaction

1692
01:12:18,750 --> 01:12:20,208
data that came from
the two hybrid,

1693
01:12:20,208 --> 01:12:21,820
the affinity capture mass spec.

1694
01:12:21,820 --> 01:12:23,695
And now you can put all
those different kinds

1695
01:12:23,695 --> 01:12:25,860
of data in the same graph.

1696
01:12:25,860 --> 01:12:27,910
And even though
there's no correlation

1697
01:12:27,910 --> 01:12:31,520
between what happens in an RNA
and what happens in the protein

1698
01:12:31,520 --> 01:12:33,704
level-- or very
low correlation--

1699
01:12:33,704 --> 01:12:35,120
there's this
physical process that

1700
01:12:35,120 --> 01:12:37,030
links that RNA up
to the signaling

1701
01:12:37,030 --> 01:12:38,155
pathways that are above it.

1702
01:12:38,155 --> 01:12:40,600
And by using the prize
collecting Steiner tree

1703
01:12:40,600 --> 01:12:42,230
approaches, you can rediscover.

1704
01:12:45,444 --> 01:12:46,860
And these kinds
of networks can be

1705
01:12:46,860 --> 01:12:49,250
very valuable for other kinds
of data that don't agree.

1706
01:12:49,250 --> 01:12:53,330
So it's not unique to transcript
data and proteome data.

1707
01:12:53,330 --> 01:12:55,580
Turns out there are many
different kinds of omic data,

1708
01:12:55,580 --> 01:12:58,420
when looked at individually,
give you very different views

1709
01:12:58,420 --> 01:12:59,800
of what's going on in a cell.

1710
01:12:59,800 --> 01:13:05,200
So if you take knockout data,
so which genes when knocked out,

1711
01:13:05,200 --> 01:13:06,200
affect the phenotype?

1712
01:13:06,200 --> 01:13:09,845
And which genes, in
the same condition,

1713
01:13:09,845 --> 01:13:10,720
change an expression?

1714
01:13:10,720 --> 01:13:12,678
Those give you two
completely different answers

1715
01:13:12,678 --> 01:13:15,930
about which genes are important
in a particular setting.

1716
01:13:15,930 --> 01:13:19,671
So here we're looking at
which genes are differentially

1717
01:13:19,671 --> 01:13:21,670
expressed when you put
cells under a whole bunch

1718
01:13:21,670 --> 01:13:23,810
of these different conditions.

1719
01:13:23,810 --> 01:13:25,900
And which genes
when knocked out,

1720
01:13:25,900 --> 01:13:28,445
affect viability
in that condition.

1721
01:13:28,445 --> 01:13:30,445
And then the right-hand
column shows the overlap

1722
01:13:30,445 --> 01:13:32,010
in the number of genes.

1723
01:13:32,010 --> 01:13:33,995
And you can see the
overlap is small.

1724
01:13:33,995 --> 01:13:35,370
In fact, it's less
than you would

1725
01:13:35,370 --> 01:13:39,190
expect by chance
for most of these.

1726
01:13:39,190 --> 01:13:42,900
So just to drill that home, if
I do two separate experiments

1727
01:13:42,900 --> 01:13:45,580
on exactly the same
experimental system,

1728
01:13:45,580 --> 01:13:48,116
say yeast responding
to DNA damage.

1729
01:13:48,116 --> 01:13:49,490
And in one case,
I read out which

1730
01:13:49,490 --> 01:13:51,652
genes are important by
looking at RNA levels.

1731
01:13:51,652 --> 01:13:53,110
And the other one,
I read out which

1732
01:13:53,110 --> 01:13:55,484
genes are important by knocking
every gene out and seeing

1733
01:13:55,484 --> 01:13:56,700
whether it affects viability.

1734
01:13:56,700 --> 01:13:59,580
We'll get two completely
different sets of genes.

1735
01:13:59,580 --> 01:14:03,700
And we'll also have two
completely different sets

1736
01:14:03,700 --> 01:14:05,750
of gene ontology categories.

1737
01:14:05,750 --> 01:14:07,710
But there is some underlying
biological process

1738
01:14:07,710 --> 01:14:10,284
that gives rise to that, right?

1739
01:14:10,284 --> 01:14:11,700
And one of the
reasons for this is

1740
01:14:11,700 --> 01:14:15,030
different assays are
measuring different things.

1741
01:14:15,030 --> 01:14:18,250
So it turns out, if you
look-- at least in yeast--

1742
01:14:18,250 --> 01:14:21,190
over 156 different
experiments, for which there's

1743
01:14:21,190 --> 01:14:24,280
both transcriptional
data and genetic data,

1744
01:14:24,280 --> 01:14:26,100
the things that come
out in genetic screens

1745
01:14:26,100 --> 01:14:27,880
seem to be master regulators.

1746
01:14:27,880 --> 01:14:30,637
Things that were knocked out
have a big effect in phenotype.

1747
01:14:30,637 --> 01:14:32,470
Whereas the things that
change in expression

1748
01:14:32,470 --> 01:14:35,030
tend to be effector molecules.

1749
01:14:35,030 --> 01:14:37,094
And so in say, the
DNA damage case,

1750
01:14:37,094 --> 01:14:38,510
the proteins that
were knocked out

1751
01:14:38,510 --> 01:14:39,926
and have a big
effect on phenotype

1752
01:14:39,926 --> 01:14:43,056
are ones that detect DNA damage
and signal to the nucleus

1753
01:14:43,056 --> 01:14:44,680
that there's been
changes in DNA damage

1754
01:14:44,680 --> 01:14:47,520
that then goes on and
blocks the cell cycle,

1755
01:14:47,520 --> 01:14:50,780
initiates DNA
response to repair.

1756
01:14:50,780 --> 01:14:52,690
Those things show
up as genetic hits,

1757
01:14:52,690 --> 01:14:55,049
but they don't show up as
differentially expressed.

1758
01:14:55,049 --> 01:14:57,340
The things that do show up
as differentially expressed,

1759
01:14:57,340 --> 01:14:58,160
the repair enzymes.

1760
01:14:58,160 --> 01:14:59,330
Those, when you
knock them out, don't

1761
01:14:59,330 --> 01:15:01,370
have a big effect on
phenotype, because they're

1762
01:15:01,370 --> 01:15:03,364
highly redundant.

1763
01:15:03,364 --> 01:15:05,030
But there are these
underlying pathways.

1764
01:15:05,030 --> 01:15:07,520
And so the idea is well, you
could reconstruct these by,

1765
01:15:07,520 --> 01:15:09,050
again, using the
epigenetic data,

1766
01:15:09,050 --> 01:15:11,010
the tough stuff
Professor Gifford

1767
01:15:11,010 --> 01:15:13,000
will talk about in
upcoming lectures.

1768
01:15:13,000 --> 01:15:15,590
And for the transcription
factors and then the network

1769
01:15:15,590 --> 01:15:19,370
properties, to try to build
up a full network of how those

1770
01:15:19,370 --> 01:15:21,150
relate to upstream
signaling pathways

1771
01:15:21,150 --> 01:15:23,290
that would then include
some of the genetic hits.

1772
01:15:27,490 --> 01:15:32,030
I think I'll skip to
the punchline here.

1773
01:15:49,130 --> 01:15:51,870
So we've looked at a number of
different modeling approaches

1774
01:15:51,870 --> 01:15:54,400
for these large interactomes.

1775
01:15:54,400 --> 01:15:57,670
We've also looked at
ways of identifying

1776
01:15:57,670 --> 01:15:59,670
transcriptional
regulatory networks using

1777
01:15:59,670 --> 01:16:02,252
mutual information,
regression, Bayesian networks.

1778
01:16:02,252 --> 01:16:03,960
And how do all these
things fit together?

1779
01:16:03,960 --> 01:16:05,590
And when would you want to
use one of these techniques,

1780
01:16:05,590 --> 01:16:07,214
and when would you
want to use another?

1781
01:16:07,214 --> 01:16:10,017
So I like to think about the
problem along these two axes.

1782
01:16:10,017 --> 01:16:11,600
On one dimension,
we're thinking about

1783
01:16:11,600 --> 01:16:13,440
whether we have systems of
known components or unknown

1784
01:16:13,440 --> 01:16:14,260
components.

1785
01:16:14,260 --> 01:16:15,759
And the other one
is whether we want

1786
01:16:15,759 --> 01:16:17,490
to identify physical
relationships

1787
01:16:17,490 --> 01:16:19,450
or statistical relationships.

1788
01:16:19,450 --> 01:16:21,830
So clustering, regression,
mutual information-- those

1789
01:16:21,830 --> 01:16:23,510
are very, very
powerful for looking

1790
01:16:23,510 --> 01:16:26,430
at the entire genome,
the entire proteome.

1791
01:16:26,430 --> 01:16:28,800
What they give you are
statistical relationships.

1792
01:16:28,800 --> 01:16:30,880
There's no guarantee of
a functional link, right?

1793
01:16:30,880 --> 01:16:34,200
We saw that in the prediction
that postprandial laughter

1794
01:16:34,200 --> 01:16:36,700
predicts breast cancer
outcome, that there's

1795
01:16:36,700 --> 01:16:38,760
no causal link between those.

1796
01:16:38,760 --> 01:16:40,260
Ultimately, you can
find some reason

1797
01:16:40,260 --> 01:16:42,040
why it's not totally random.

1798
01:16:42,040 --> 01:16:43,960
But it's not as if
that's going to lead you

1799
01:16:43,960 --> 01:16:46,290
to new drug targets.

1800
01:16:46,290 --> 01:16:49,740
But those can be on a
completely hypothesis-free way,

1801
01:16:49,740 --> 01:16:52,630
with no external data.

1802
01:16:52,630 --> 01:16:55,764
Bayesian networks are
somewhat more causal.

1803
01:16:55,764 --> 01:16:57,430
But depending on how
much data you have,

1804
01:16:57,430 --> 01:16:58,850
they may not be
perfectly causal.

1805
01:16:58,850 --> 01:17:01,257
You need a lot of
intervention data.

1806
01:17:01,257 --> 01:17:03,340
We also saw that they did
not perform particularly

1807
01:17:03,340 --> 01:17:06,010
well in discovering
gene regulatory networks

1808
01:17:06,010 --> 01:17:07,464
in the dream challenge.

1809
01:17:07,464 --> 01:17:09,130
These interactome
models that we've just

1810
01:17:09,130 --> 01:17:11,990
been talking about work very
well across giant omic data

1811
01:17:11,990 --> 01:17:12,490
sets.

1812
01:17:15,510 --> 01:17:17,530
And they require
this external data.

1813
01:17:17,530 --> 01:17:18,700
They need the interactome.

1814
01:17:18,700 --> 01:17:20,060
So it works well in
organisms for which

1815
01:17:20,060 --> 01:17:21,680
you have all that
interactome data.

1816
01:17:21,680 --> 01:17:25,310
It's not going to work in an
organism for which you don't.

1817
01:17:25,310 --> 01:17:26,960
What they give you
at the end, though,

1818
01:17:26,960 --> 01:17:30,409
is a graph that tells
you relationships

1819
01:17:30,409 --> 01:17:31,200
among the proteins.

1820
01:17:31,200 --> 01:17:32,700
But it doesn't tell
you what's going

1821
01:17:32,700 --> 01:17:35,040
to happen if you start to
perturb those networks.

1822
01:17:35,040 --> 01:17:39,570
So if I give you the
active subgraph that

1823
01:17:39,570 --> 01:17:42,440
has all the proteins and genes
that are changing expression

1824
01:17:42,440 --> 01:17:45,329
in my tumor sample, now
the question is, OK,

1825
01:17:45,329 --> 01:17:47,120
should you inhibit the
nodes in that graph?

1826
01:17:47,120 --> 01:17:49,220
Or should you activate
the nodes in that graph?

1827
01:17:49,220 --> 01:17:51,160
And the interactome
model doesn't tell you

1828
01:17:51,160 --> 01:17:52,260
the answer to that.

1829
01:17:52,260 --> 01:17:54,676
And so what you're going to
hear about in the next lecture

1830
01:17:54,676 --> 01:17:56,750
from Professor
Lauffenburger are models

1831
01:17:56,750 --> 01:17:58,100
that live up in this space.

1832
01:17:58,100 --> 01:18:00,906
Once you've defined a relatively
small piece of the network,

1833
01:18:00,906 --> 01:18:02,780
you can use other kinds
of approaches-- logic

1834
01:18:02,780 --> 01:18:06,580
based models, differential
equation based models, decision

1835
01:18:06,580 --> 01:18:09,410
trees, and other techniques
that will actually

1836
01:18:09,410 --> 01:18:10,910
make very quantitative
processions.

1837
01:18:10,910 --> 01:18:13,640
What happens if I inhibit
a particular node?

1838
01:18:13,640 --> 01:18:16,267
Does it activate the process,
or does it repress the process?

1839
01:18:16,267 --> 01:18:17,850
And so what you could
think about then

1840
01:18:17,850 --> 01:18:20,420
is going from a completely
unbiased view of what's

1841
01:18:20,420 --> 01:18:24,680
going in a cell, collect all
the various kinds of omic data,

1842
01:18:24,680 --> 01:18:26,370
and go through these
kinds of modeling

1843
01:18:26,370 --> 01:18:28,949
approaches to identify a
subnetwork that's of interest.

1844
01:18:28,949 --> 01:18:31,240
And then use the techniques
that we'll [? be hearing ?]

1845
01:18:31,240 --> 01:18:34,219
about in the next lecture
to figure out quantitatively

1846
01:18:34,219 --> 01:18:36,510
what would happen if I were
to inhibit individual nodes

1847
01:18:36,510 --> 01:18:40,500
or inhibit combinations of
nodes or activate, and so on.

1848
01:18:40,500 --> 01:18:44,300
Any questions on anything
we've talked about so far?

1849
01:18:44,300 --> 01:18:45,578
Yes.

1850
01:18:45,578 --> 01:18:48,392
AUDIENCE: Can you say again
the fundamental difference

1851
01:18:48,392 --> 01:18:51,242
between why you get those two
different results if you're

1852
01:18:51,242 --> 01:18:56,277
just weeding out the gene
expression versus the proteins?

1853
01:18:56,277 --> 01:18:57,110
PROFESSOR: Oh, sure.

1854
01:18:57,110 --> 01:18:57,610
Right.

1855
01:18:57,610 --> 01:19:01,204
So we talked about the fact that
if you look at genetic hits,

1856
01:19:01,204 --> 01:19:02,870
and you look at
differential expression,

1857
01:19:02,870 --> 01:19:05,536
you get two completely different
views of what's going in cells.

1858
01:19:05,536 --> 01:19:06,470
So why is that?

1859
01:19:06,470 --> 01:19:09,060
So the genetic hits to tend to
hit master regulators, things

1860
01:19:09,060 --> 01:19:10,720
that when you knock
out a single gene,

1861
01:19:10,720 --> 01:19:13,097
you have a global
effect on the response.

1862
01:19:13,097 --> 01:19:14,555
So in the case of
DNA damage, those

1863
01:19:14,555 --> 01:19:17,380
are things that
detect the DNA damage.

1864
01:19:17,380 --> 01:19:20,530
Those genes tend often not
to be changing very much

1865
01:19:20,530 --> 01:19:22,630
in expression.

1866
01:19:22,630 --> 01:19:24,650
So transcription factors
are very low abundance.

1867
01:19:24,650 --> 01:19:25,760
They usually don't
change very much.

1868
01:19:25,760 --> 01:19:27,820
A lot of signaling proteins
are kept at a constant level,

1869
01:19:27,820 --> 01:19:29,980
and they're regulated
post-transcriptionally.

1870
01:19:29,980 --> 01:19:32,350
So those don't show up in
the differential expression.

1871
01:19:32,350 --> 01:19:35,160
The things that are
changing in expression--

1872
01:19:35,160 --> 01:19:40,110
say the response regulators,
the DNA damage response--

1873
01:19:40,110 --> 01:19:41,510
those often are redundant.

1874
01:19:41,510 --> 01:19:44,660
So one good analogy is to
think about a smoke detector.

1875
01:19:44,660 --> 01:19:46,370
A smoke detector
is on all the time.

1876
01:19:46,370 --> 01:19:48,139
You don't wait until the fire.

1877
01:19:48,139 --> 01:19:50,180
So that's not going to be
changing in expression,

1878
01:19:50,180 --> 01:19:51,390
if you will.

1879
01:19:51,390 --> 01:19:54,330
But if you knock it out,
you've got a big problem.

1880
01:19:54,330 --> 01:19:56,410
The effectors, say
the sprinklers--

1881
01:19:56,410 --> 01:19:58,540
the sprinklers only come
on when there's a fire.

1882
01:19:58,540 --> 01:20:00,140
So that's like the
response genes.

1883
01:20:00,140 --> 01:20:02,002
They come on only in
certain circumstances,

1884
01:20:02,002 --> 01:20:03,210
but they're highly redundant.

1885
01:20:03,210 --> 01:20:04,835
Any room will have
multiple sprinklers,

1886
01:20:04,835 --> 01:20:06,860
so if one gets
damaged or is blocked,

1887
01:20:06,860 --> 01:20:08,190
you still get a response.

1888
01:20:08,190 --> 01:20:10,550
So that's why you get this
discrepancy between the two

1889
01:20:10,550 --> 01:20:11,551
different kinds of data.

1890
01:20:11,551 --> 01:20:12,924
But again, in both
cases, there's

1891
01:20:12,924 --> 01:20:15,290
an underlying physical process
that gives rise to both.

1892
01:20:15,290 --> 01:20:17,040
And if you do this
properly, you can

1893
01:20:17,040 --> 01:20:19,664
detect that on these
interactome models.

1894
01:20:19,664 --> 01:20:20,330
Other questions?

1895
01:20:22,720 --> 01:20:23,220
OK.

1896
01:20:23,220 --> 01:20:25,000
Very good.