1
00:00:00,090 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:26,009 --> 00:00:27,550
PROFESSOR: So you'll
recall last time

9
00:00:27,550 --> 00:00:29,989
we were working on
protein-protein interactions.

10
00:00:29,989 --> 00:00:32,030
We're going to do a little
bit to finish that up,

11
00:00:32,030 --> 00:00:34,810
with a topic that will be a
good transition to the study

12
00:00:34,810 --> 00:00:36,465
of the gene regulatory networks.

13
00:00:41,270 --> 00:00:43,852
And the precise things we're
going to discuss today,

14
00:00:43,852 --> 00:00:45,810
we're going to start off
with Bayesian networks

15
00:00:45,810 --> 00:00:47,690
of protein-protein
interaction prediction.

16
00:00:47,690 --> 00:00:49,220
And then we're going to get
into gene expression data,

17
00:00:49,220 --> 00:00:50,170
at several different levels.

18
00:00:50,170 --> 00:00:51,753
We'll talk about
some basic questions,

19
00:00:51,753 --> 00:00:56,120
of how to compare
the two expression

20
00:00:56,120 --> 00:00:58,462
vectors for a gene,
distance metrics.

21
00:00:58,462 --> 00:01:00,670
We'll talk about how to
cluster gene expression data.

22
00:01:00,670 --> 00:01:02,250
The idea of
identifying signatures

23
00:01:02,250 --> 00:01:04,129
of sets of genes, that
might be predictive

24
00:01:04,129 --> 00:01:05,840
of some biological property.

25
00:01:05,840 --> 00:01:08,694
For example, a
susceptibility to a disease.

26
00:01:08,694 --> 00:01:10,860
And then we'll talk about
a number of different ways

27
00:01:10,860 --> 00:01:11,920
that people have
developed to try

28
00:01:11,920 --> 00:01:13,596
to identify gene
regulatory networks.

29
00:01:13,596 --> 00:01:15,220
That often goes by
the name of modules.

30
00:01:15,220 --> 00:01:16,720
I don't particularly
like that name.

31
00:01:16,720 --> 00:01:18,700
But that's what you'll
find in the literature.

32
00:01:18,700 --> 00:01:21,340
And we're going to
focus on a few of these,

33
00:01:21,340 --> 00:01:23,930
that have recently been
compared head to head, using

34
00:01:23,930 --> 00:01:25,170
both synthetic and real data.

35
00:01:25,170 --> 00:01:26,660
And we'll see some
of the results

36
00:01:26,660 --> 00:01:28,920
from that head to
head comparison.

37
00:01:28,920 --> 00:01:30,330
So let's just launch into it.

38
00:01:30,330 --> 00:01:33,160
Remember last time we had
started this unit looking

39
00:01:33,160 --> 00:01:35,829
at the structural
predictions for proteins.

40
00:01:35,829 --> 00:01:37,620
And we started talking
about how to predict

41
00:01:37,620 --> 00:01:38,900
protein-protein interactions.

42
00:01:38,900 --> 00:01:41,590
Last time we talked about
both computational methods,

43
00:01:41,590 --> 00:01:44,210
and also experimental
data, that could give us

44
00:01:44,210 --> 00:01:46,780
information about
protein-protein interactions.

45
00:01:46,780 --> 00:01:48,490
Ostensibly measuring
direct interactions,

46
00:01:48,490 --> 00:01:50,865
but we saw that there were
possibly very, very high error

47
00:01:50,865 --> 00:01:51,364
rates.

48
00:01:51,364 --> 00:01:54,600
So we needed ways of integrating
lots of different kinds of data

49
00:01:54,600 --> 00:01:57,600
in a probabilistic framework so
we could predict for any pair

50
00:01:57,600 --> 00:02:00,230
proteins what's the
probability that they interact.

51
00:02:00,230 --> 00:02:02,160
Not just the fact that
they were detected

52
00:02:02,160 --> 00:02:05,004
in one assay or the other.

53
00:02:05,004 --> 00:02:06,920
And we started to talk
about Bayesian networks

54
00:02:06,920 --> 00:02:08,130
in this context.

55
00:02:08,130 --> 00:02:11,110
Both useful as we'll see
today, for predicting

56
00:02:11,110 --> 00:02:12,750
protein-protein
interactions, and also

57
00:02:12,750 --> 00:02:14,970
for the gene regulatory
network problem.

58
00:02:14,970 --> 00:02:16,820
So the Bayesian
networks are a tool

59
00:02:16,820 --> 00:02:18,290
for reasoning probabilistically.

60
00:02:18,290 --> 00:02:20,680
That's the fundamental purpose.

61
00:02:20,680 --> 00:02:24,040
And we saw that they consisted
of a graph, the network.

62
00:02:24,040 --> 00:02:27,440
And then the probabilities
that represent the probability

63
00:02:27,440 --> 00:02:30,440
for each edge, the conditional
probability tables.

64
00:02:30,440 --> 00:02:32,289
And that we can learn
these from the data,

65
00:02:32,289 --> 00:02:34,080
either in a completely
objective way, where

66
00:02:34,080 --> 00:02:36,830
we learn both the structure
and the probability.

67
00:02:36,830 --> 00:02:39,160
Or where we impose the
structure initially,

68
00:02:39,160 --> 00:02:42,370
and then we simply learn
the probability tables.

69
00:02:42,370 --> 00:02:45,226
And we had nodes that
represented the variables.

70
00:02:45,226 --> 00:02:46,600
They could be
hidden nodes, where

71
00:02:46,600 --> 00:02:48,979
we don't know what the true
answer is, and observed

72
00:02:48,979 --> 00:02:49,770
nodes, where we do.

73
00:02:49,770 --> 00:02:51,630
So in our case, we're
trying to predict

74
00:02:51,630 --> 00:02:53,295
protein-protein interactions.

75
00:02:53,295 --> 00:02:54,670
There's some hidden
variable that

76
00:02:54,670 --> 00:02:57,380
represents weather protein
A and B truly interact.

77
00:02:57,380 --> 00:02:58,970
We don't know that answer.

78
00:02:58,970 --> 00:03:00,780
But we do know whether
that interaction was

79
00:03:00,780 --> 00:03:03,090
detected in an experiment
one, two, three or 4.

80
00:03:03,090 --> 00:03:05,190
Those are the
effects, the observed.

81
00:03:05,190 --> 00:03:06,770
And so we want to
reason backwards

82
00:03:06,770 --> 00:03:09,355
from the observations,
to the hidden causes.

83
00:03:13,170 --> 00:03:16,666
So last time we talked about
the high throughput experiments,

84
00:03:16,666 --> 00:03:18,040
that directly
we're measuring out

85
00:03:18,040 --> 00:03:19,248
protein-protein interactions.

86
00:03:19,248 --> 00:03:23,240
We talked about yeast two
hybrid and affinity capture

87
00:03:23,240 --> 00:03:25,780
mass spec-- here
listed as pull-downs.

88
00:03:25,780 --> 00:03:28,250
And those could
be used to predict

89
00:03:28,250 --> 00:03:30,000
protein-protein
directions, by themselves.

90
00:03:30,000 --> 00:03:32,000
But we want to find out
what other kinds of data

91
00:03:32,000 --> 00:03:34,010
we can use to amplify
these results, to give us

92
00:03:34,010 --> 00:03:37,240
independent information about
whether two proteins interact.

93
00:03:37,240 --> 00:03:39,120
And one thing you
could look at is

94
00:03:39,120 --> 00:03:41,840
whether the expression
of the two genes that you

95
00:03:41,840 --> 00:03:43,620
think might interact
are similar.

96
00:03:43,620 --> 00:03:46,130
So if you look over many,
many different conditions,

97
00:03:46,130 --> 00:03:48,000
you might expect
the two proteins

98
00:03:48,000 --> 00:03:49,500
that interact with
each other, would

99
00:03:49,500 --> 00:03:51,957
be expressed under
similar conditions.

100
00:03:51,957 --> 00:03:53,540
Certainly if you saw
two proteins that

101
00:03:53,540 --> 00:03:55,550
had exactly opposite
expression patterns,

102
00:03:55,550 --> 00:03:58,862
you would be very unlikely to
believe that they interacted.

103
00:03:58,862 --> 00:04:01,070
So the question is, how much
is true at the other end

104
00:04:01,070 --> 00:04:01,850
of the spectrum?

105
00:04:01,850 --> 00:04:04,140
If things are very
highly correlated,

106
00:04:04,140 --> 00:04:07,200
do they have a high
probability of interaction?

107
00:04:07,200 --> 00:04:10,970
So this graph is a
histogram for proteins

108
00:04:10,970 --> 00:04:13,180
that are known to
interact, proteins

109
00:04:13,180 --> 00:04:15,475
that were shown in these
high throughput experiments

110
00:04:15,475 --> 00:04:18,815
to interact, and proteins that
are known not to interact,

111
00:04:18,815 --> 00:04:20,190
of how similar
the expression is.

112
00:04:20,190 --> 00:04:21,731
On the far right
are things that have

113
00:04:21,731 --> 00:04:25,107
extremely different expression
patterns, a high distance.

114
00:04:25,107 --> 00:04:26,690
And we'll talk
specifically about what

115
00:04:26,690 --> 00:04:28,127
distance is in just a minute.

116
00:04:28,127 --> 00:04:30,210
But these are very dissimilar
expression patterns.

117
00:04:30,210 --> 00:04:32,415
These are very similar ones.

118
00:04:32,415 --> 00:04:33,790
So what do you
see from this plot

119
00:04:33,790 --> 00:04:35,340
we looked at the last time?

120
00:04:35,340 --> 00:04:36,930
We saw that the
interacting proteins

121
00:04:36,930 --> 00:04:39,169
are shifted a bit to the left.

122
00:04:39,169 --> 00:04:41,210
So the interacting ones
have a higher probability

123
00:04:41,210 --> 00:04:42,986
of having similar
expression patterns

124
00:04:42,986 --> 00:04:44,890
than the ones don't interact.

125
00:04:44,890 --> 00:04:46,530
But we couldn't
draw any cut off,

126
00:04:46,530 --> 00:04:49,790
and say everything with this
level expression similarity

127
00:04:49,790 --> 00:04:51,150
is guaranteed to interact.

128
00:04:51,150 --> 00:04:53,617
There's no way to divide these.

129
00:04:53,617 --> 00:04:55,700
So this will be useful in
a probabilistic setting.

130
00:04:55,700 --> 00:04:58,600
But by itself, it would
not be highly predictive.

131
00:04:58,600 --> 00:05:00,460
We also talked about
evolutionary patterns,

132
00:05:00,460 --> 00:05:04,059
and we discussed whether the
red or the green patterns

133
00:05:04,059 --> 00:05:05,350
here, would be more predictive.

134
00:05:05,350 --> 00:05:06,840
And which one was
it, anyone remember?

135
00:05:06,840 --> 00:05:09,043
How many people thought the
red was more predictive?

136
00:05:09,043 --> 00:05:11,390
How many the green?

137
00:05:11,390 --> 00:05:12,410
Right, the greens win.

138
00:05:15,710 --> 00:05:18,490
And we talked about the
coevolution in other ways.

139
00:05:18,490 --> 00:05:21,584
So the paper that, I
think, was one of the first

140
00:05:21,584 --> 00:05:23,250
to do this really
nicely, try to predict

141
00:05:23,250 --> 00:05:25,330
protein-protein
interaction patterns using

142
00:05:25,330 --> 00:05:29,160
Bayesian networks, is this
one from Mark Gerstein's lab.

143
00:05:29,160 --> 00:05:31,770
And they start off as we
talked about previously,

144
00:05:31,770 --> 00:05:33,660
we need some gold
standard interactions,

145
00:05:33,660 --> 00:05:37,150
where we know two proteins
really do interact or don't.

146
00:05:37,150 --> 00:05:40,649
They built their gold
standard data set.

147
00:05:40,649 --> 00:05:42,190
The positive trending
data, they took

148
00:05:42,190 --> 00:05:44,590
from a database
called MIPS, which

149
00:05:44,590 --> 00:05:47,989
is a hand-curated database
that digs into the literature

150
00:05:47,989 --> 00:05:50,280
quite deeply, to find out
whether two proteins interact

151
00:05:50,280 --> 00:05:51,400
or not.

152
00:05:51,400 --> 00:05:52,990
And then the negative
data they took

153
00:05:52,990 --> 00:05:55,100
were proteins that
were identified

154
00:05:55,100 --> 00:05:57,271
as being localized to
different parts of the cell.

155
00:05:57,271 --> 00:05:58,770
And this was done
in yeast, to where

156
00:05:58,770 --> 00:06:00,860
there is pretty good data
for a lot of proteins,

157
00:06:00,860 --> 00:06:02,045
to subcellular localization.

158
00:06:04,700 --> 00:06:07,175
So these are the data that
went into their prediction.

159
00:06:09,640 --> 00:06:11,890
These were the experiments
we've already talked about,

160
00:06:11,890 --> 00:06:14,890
the affinity capture mass
spec and the yeast two hybrid.

161
00:06:14,890 --> 00:06:16,660
And then the other
kinds of data they used

162
00:06:16,660 --> 00:06:19,665
were expression correlation,
one just talked about.

163
00:06:19,665 --> 00:06:22,860
They also looked at
annotations, whether proteins

164
00:06:22,860 --> 00:06:25,870
had the same annotation
for function.

165
00:06:25,870 --> 00:06:27,460
And essentiality.

166
00:06:27,460 --> 00:06:29,026
So in yeast, it's
pretty easy to go

167
00:06:29,026 --> 00:06:30,400
through every gene
in the genome,

168
00:06:30,400 --> 00:06:32,921
knock it out, and determine
whether that kills the cell

169
00:06:32,921 --> 00:06:33,420
or not.

170
00:06:33,420 --> 00:06:36,050
So they can label
every gene in yeast, as

171
00:06:36,050 --> 00:06:39,625
to whether it's essential
for survival or not.

172
00:06:39,625 --> 00:06:41,625
And you can see here, the
number of interactions

173
00:06:41,625 --> 00:06:42,732
that were involved.

174
00:06:42,732 --> 00:06:44,190
And they decided
to break this down

175
00:06:44,190 --> 00:06:45,773
into two separate
prediction problems.

176
00:06:45,773 --> 00:06:48,530
So one was an
experimental problem,

177
00:06:48,530 --> 00:06:50,860
using the four
different large scale

178
00:06:50,860 --> 00:06:53,640
data sets in yeast from
protein-protein interactions,

179
00:06:53,640 --> 00:06:55,020
to predict expression.

180
00:06:55,020 --> 00:06:57,450
The other one wore
these other kinds

181
00:06:57,450 --> 00:06:59,390
of data, that were less direct.

182
00:06:59,390 --> 00:07:01,920
And they used slightly different
kinds of Bayesian networks.

183
00:07:01,920 --> 00:07:05,100
So for this one, they
used a naive Bayes.

184
00:07:05,100 --> 00:07:08,520
And what's the underlying
assumption of the naive Bayes?

185
00:07:08,520 --> 00:07:10,755
The underlying assumption
is that all the data

186
00:07:10,755 --> 00:07:11,457
are independent.

187
00:07:11,457 --> 00:07:12,790
So we looked at this previously.

188
00:07:12,790 --> 00:07:14,960
We discussed how
you could, if you're

189
00:07:14,960 --> 00:07:17,830
trying to identify
the likelihood ratio,

190
00:07:17,830 --> 00:07:19,720
and use it to rank things.

191
00:07:19,720 --> 00:07:23,610
You primarily need to
focus on this term.

192
00:07:23,610 --> 00:07:26,360
Because this term will be the
same for every pair of proteins

193
00:07:26,360 --> 00:07:27,440
that you're examining.

194
00:07:27,440 --> 00:07:28,340
Yes?

195
00:07:28,340 --> 00:07:31,280
AUDIENCE: Could you state
again whether in a naive Bayes,

196
00:07:31,280 --> 00:07:33,752
all data are dependent
or independent?

197
00:07:33,752 --> 00:07:34,710
PROFESSOR: Independent.

198
00:07:34,710 --> 00:07:35,251
AUDIENCE: OK.

199
00:07:38,267 --> 00:07:38,850
PROFESSOR: OK.

200
00:07:38,850 --> 00:07:41,060
So let's actually look
at some of their data.

201
00:07:41,060 --> 00:07:44,780
So in this table, they're
looking at the likelihood ratio

202
00:07:44,780 --> 00:07:47,050
that two proteins
interact, based

203
00:07:47,050 --> 00:07:49,280
on whether the two
proteins are essential.

204
00:07:49,280 --> 00:07:51,400
One is essential, and
one is a nonessential.

205
00:07:51,400 --> 00:07:53,290
Both are nonessential.

206
00:07:53,290 --> 00:07:56,610
So that's what these
two codes here mean.

207
00:07:56,610 --> 00:07:57,770
EE, both essential.

208
00:07:57,770 --> 00:08:01,615
NN, both nonessential,
and any one and the other.

209
00:08:01,615 --> 00:08:03,490
And so they've computed
for all those protein

210
00:08:03,490 --> 00:08:06,060
pairs, how many in
their gold standard,

211
00:08:06,060 --> 00:08:08,660
are EE, how many are
EN, how many are NN?

212
00:08:12,960 --> 00:08:15,100
So here are the
numbers for the EE.

213
00:08:15,100 --> 00:08:20,090
There are just over 1,000, out
of the 2,000, roughly 2,000

214
00:08:20,090 --> 00:08:20,775
that are EE.

215
00:08:20,775 --> 00:08:23,700
So that comes up with a
probability of being essential,

216
00:08:23,700 --> 00:08:25,440
given that I know
that you're positive.

217
00:08:25,440 --> 00:08:28,700
You're in the gold standard
of roughly 50%, right?

218
00:08:28,700 --> 00:08:32,742
And you can assume something
similar for the negatives.

219
00:08:32,742 --> 00:08:34,950
So these are the ones that
definitely don't interact.

220
00:08:34,950 --> 00:08:36,820
So the probability of
both being essential,

221
00:08:36,820 --> 00:08:41,520
given that it's negative,
is about 15%, 14%.

222
00:08:41,520 --> 00:08:46,340
And so then the likelihood ratio
comes out to just under four.

223
00:08:46,340 --> 00:08:49,200
So there's a fourfold
increase in probability

224
00:08:49,200 --> 00:08:52,700
that something is interacting,
given that it's essential,

225
00:08:52,700 --> 00:08:55,840
then not.

226
00:08:55,840 --> 00:08:57,812
And this is the table
for all of the terms,

227
00:08:57,812 --> 00:09:00,270
for all of the different things
that they were considering,

228
00:09:00,270 --> 00:09:01,644
that were not
direct experiments.

229
00:09:01,644 --> 00:09:03,170
So this is the sensuality.

230
00:09:03,170 --> 00:09:05,950
This is expression correlation,
with various values

231
00:09:05,950 --> 00:09:10,590
for the threshold, how similar
the expression had to be.

232
00:09:10,590 --> 00:09:15,412
And these are the terms from
the databases for annotation.

233
00:09:15,412 --> 00:09:16,870
And then for each
of these, then we

234
00:09:16,870 --> 00:09:19,119
get a likelihood ratio
of how predictive it is.

235
00:09:19,119 --> 00:09:21,660
So it's kind of informative to
look at some of these numbers.

236
00:09:21,660 --> 00:09:23,530
We already saw that
essentiality is

237
00:09:23,530 --> 00:09:26,350
pretty weak, predicted the fact
that two genes are essential.

238
00:09:26,350 --> 00:09:28,530
It only gives you a
slightly increased chance

239
00:09:28,530 --> 00:09:30,120
that they're
interacting than not.

240
00:09:30,120 --> 00:09:32,971
But if two things, two genes
have extremely high expression

241
00:09:32,971 --> 00:09:35,220
correlation, then they're
more than a hundredfold more

242
00:09:35,220 --> 00:09:38,280
likely to interact than not.

243
00:09:38,280 --> 00:09:41,270
And the numbers
for the annotations

244
00:09:41,270 --> 00:09:43,275
are significantly
less than that.

245
00:09:45,860 --> 00:09:47,016
So this is a naive Bayes.

246
00:09:47,016 --> 00:09:49,390
We're going to multiply all
those probabilities together.

247
00:09:49,390 --> 00:09:52,720
Now for the
experimental data, they

248
00:09:52,720 --> 00:09:54,970
said, well, these are
probably not all independent.

249
00:09:54,970 --> 00:09:57,053
The probably that you pick
something up in one two

250
00:09:57,053 --> 00:09:59,216
hybrid experiment,
is probably highly

251
00:09:59,216 --> 00:10:01,340
correlated with the
probability that you pick it up

252
00:10:01,340 --> 00:10:03,330
in another two
hybrid experiment.

253
00:10:03,330 --> 00:10:05,990
And one would hope that there's
some correlation between things

254
00:10:05,990 --> 00:10:09,090
are identifying in two hybrid
and affinity caption mass spec.

255
00:10:09,090 --> 00:10:12,330
Although we'll see whether
or not that's the case.

256
00:10:12,330 --> 00:10:14,950
So they used what they refer
to as a fully connected Bayes.

257
00:10:14,950 --> 00:10:16,540
And what do we mean by that?

258
00:10:16,540 --> 00:10:18,257
Remember, this was
the naive Bayes,

259
00:10:18,257 --> 00:10:19,590
where everything is independent.

260
00:10:19,590 --> 00:10:21,173
So the probability
of some observation

261
00:10:21,173 --> 00:10:23,984
is the product of all the
individual probabilities.

262
00:10:23,984 --> 00:10:25,400
But in a fully
connected Bayes, we

263
00:10:25,400 --> 00:10:28,030
don't have that
independence assumption.

264
00:10:28,030 --> 00:10:30,940
So you need to actually
explicitly compute

265
00:10:30,940 --> 00:10:32,850
what the probability
is for an interaction,

266
00:10:32,850 --> 00:10:36,139
based on all the possible
outcomes in those experiments.

267
00:10:36,139 --> 00:10:37,430
So that's not that much harder.

268
00:10:37,430 --> 00:10:42,630
We simply have a table now,
where these columns represent

269
00:10:42,630 --> 00:10:45,150
each of the experimental
data types--

270
00:10:45,150 --> 00:10:49,610
the affinity capture mass
spec and the two hybrids.

271
00:10:49,610 --> 00:10:51,240
Ones indicate that
it was detected,

272
00:10:51,240 --> 00:10:52,390
Zero is that it's not.

273
00:10:52,390 --> 00:10:54,840
And then we simply look
again in our gold standard,

274
00:10:54,840 --> 00:10:59,490
and see how often a protein that
had been detected in whatever

275
00:10:59,490 --> 00:11:03,470
the setting is here, in all
of them except Ito, how often

276
00:11:03,470 --> 00:11:05,700
was it, how many of the
gold positives do we get?

277
00:11:05,700 --> 00:11:06,790
And how many of
the gold negatives?

278
00:11:06,790 --> 00:11:08,539
And then we can compute
the probabilities.

279
00:11:11,284 --> 00:11:12,700
Now it's important
to look at some

280
00:11:12,700 --> 00:11:14,449
of the numbers in these
tables and dig in.

281
00:11:14,449 --> 00:11:17,450
Because you'll see the numbers
here are really, really small.

282
00:11:17,450 --> 00:11:19,770
So they have to be
interpreted with caution.

283
00:11:19,770 --> 00:11:22,380
So some of the
things that might not

284
00:11:22,380 --> 00:11:23,870
hold up with much
larger data sets.

285
00:11:23,870 --> 00:11:26,036
You might imagine the things
that are experimentally

286
00:11:26,036 --> 00:11:27,920
detected in all of the
high-throughput assays

287
00:11:27,920 --> 00:11:30,250
would be the most confident.

288
00:11:30,250 --> 00:11:32,120
That doesn't turn
out to be the case.

289
00:11:32,120 --> 00:11:36,260
So these are sorted by the
law of likelihood ratio,

290
00:11:36,260 --> 00:11:39,300
and the best one is not 1, 1, 1.

291
00:11:39,300 --> 00:11:40,160
It's up there.

292
00:11:40,160 --> 00:11:41,700
But it's not the
top of the pack.

293
00:11:41,700 --> 00:11:44,110
And that's probably just the
statistics of small numbers.

294
00:11:44,110 --> 00:11:46,920
If the databases were larger,
experiments were larger,

295
00:11:46,920 --> 00:11:49,440
it probably would
work out that way.

296
00:11:49,440 --> 00:11:52,530
So any question about how
they formulated this problem,

297
00:11:52,530 --> 00:11:54,835
as a Bayesian network, or
how they implemented it?

298
00:12:00,520 --> 00:12:01,680
OK.

299
00:12:01,680 --> 00:12:05,480
So the results then-- so once
we have these likelihood ratios,

300
00:12:05,480 --> 00:12:10,050
we can try to choose a threshold
for deciding what we're

301
00:12:10,050 --> 00:12:12,410
going to consider to be a
true interaction and not.

302
00:12:12,410 --> 00:12:15,040
So here they've plotted for
different likelihood ratio

303
00:12:15,040 --> 00:12:16,880
thresholds.

304
00:12:16,880 --> 00:12:20,310
On the x-axis, how many of the
true positives you get right,

305
00:12:20,310 --> 00:12:21,620
versus how many you get wrong.

306
00:12:24,410 --> 00:12:27,030
So the true positive
over the false positive.

307
00:12:27,030 --> 00:12:29,190
And you can
arbitrarily decide, OK,

308
00:12:29,190 --> 00:12:31,930
well I want to be more-- I want
to get more right than wrong.

309
00:12:31,930 --> 00:12:34,960
Not a bad way to decide things.

310
00:12:34,960 --> 00:12:37,290
So your passing
grade here is 50%.

311
00:12:37,290 --> 00:12:39,600
So if I draw a line,
a horizontal line,

312
00:12:39,600 --> 00:12:41,760
and wanted to get
more right than wrong,

313
00:12:41,760 --> 00:12:45,600
you'll see that any
of the individual

314
00:12:45,600 --> 00:12:49,880
signals that they were
using, essentiality,

315
00:12:49,880 --> 00:12:53,390
database sanitation, and so on--
all of those fall below that.

316
00:12:53,390 --> 00:12:57,320
So individually, they predict
more wrongs than rights.

317
00:12:57,320 --> 00:13:00,720
But if you combine the data
using this Bayesian network,

318
00:13:00,720 --> 00:13:02,850
then you can choose a
likelihood threshold,

319
00:13:02,850 --> 00:13:05,740
where you do get more
right than wrong.

320
00:13:05,740 --> 00:13:08,560
And you can set your
threshold wherever you want.

321
00:13:08,560 --> 00:13:11,670
Similarly for the direct
experimental data,

322
00:13:11,670 --> 00:13:15,560
you do better by combining--
these are light pink lines,

323
00:13:15,560 --> 00:13:18,930
than you would with any of
the individual data sets.

324
00:13:18,930 --> 00:13:21,350
So this shows the utility
of combining the data,

325
00:13:21,350 --> 00:13:25,360
and reasoning from the
data probabilistically.

326
00:13:25,360 --> 00:13:28,200
Any questions?

327
00:13:28,200 --> 00:13:29,700
So we'll return to
Bayesian networks

328
00:13:29,700 --> 00:13:34,340
in a bit in the
context of discovering

329
00:13:34,340 --> 00:13:35,990
gene regulatory networks.

330
00:13:35,990 --> 00:13:39,010
So we now want to move
to gene expression data.

331
00:13:39,010 --> 00:13:42,210
And the primary reason to be so
interested in gene expression

332
00:13:42,210 --> 00:13:44,690
data is simply that there's a
huge amount of it out there.

333
00:13:44,690 --> 00:13:47,820
So just a short time ago
we passed the million mark,

334
00:13:47,820 --> 00:13:49,550
with a number of
expression data sets

335
00:13:49,550 --> 00:13:52,260
that had been collected
in the databases.

336
00:13:52,260 --> 00:13:54,760
There's much less of any other
kind of high throughput data.

337
00:13:54,760 --> 00:13:58,012
So if you look at proteomics
or high-throughput genetic

338
00:13:58,012 --> 00:13:59,720
screens, there are
tiny numbers, compared

339
00:13:59,720 --> 00:14:01,030
to gene expression data.

340
00:14:01,030 --> 00:14:04,209
So obviously techniques for
analyzing gene expression data

341
00:14:04,209 --> 00:14:06,500
are going to play a very
important role for a long time

342
00:14:06,500 --> 00:14:07,100
to come.

343
00:14:10,050 --> 00:14:11,730
Some of what I'm
going to discuss today

344
00:14:11,730 --> 00:14:12,938
is covered in your textbooks.

345
00:14:12,938 --> 00:14:17,072
I encourage you to look
at text section 16.2.

346
00:14:17,072 --> 00:14:19,280
The fundamental thing that
we're interested in doing,

347
00:14:19,280 --> 00:14:22,180
is seeing how much
biological knowledge

348
00:14:22,180 --> 00:14:25,152
we can infer from the
gene expression data.

349
00:14:25,152 --> 00:14:26,610
So we might imagine
that genes that

350
00:14:26,610 --> 00:14:28,940
are coexpressed under
particular sets and conditions,

351
00:14:28,940 --> 00:14:30,790
have functional
similarity, reflect

352
00:14:30,790 --> 00:14:33,050
common regulatory mechanisms,
and our goal then,

353
00:14:33,050 --> 00:14:34,900
is to discover those mechanisms.

354
00:14:34,900 --> 00:14:37,730
So fundamental to this then, any
time we have a pair of genes--

355
00:14:37,730 --> 00:14:39,521
and we look at their
gene expression data--

356
00:14:39,521 --> 00:14:41,710
we want to decide
how similar they are.

357
00:14:41,710 --> 00:14:45,540
So let's imagine that we had
these data for four genes.

358
00:14:45,540 --> 00:14:47,332
And it's a time
series experiment.

359
00:14:47,332 --> 00:14:49,540
And we're looking at the
different expression levels.

360
00:14:49,540 --> 00:14:51,270
And we want some
quantitative measure

361
00:14:51,270 --> 00:14:54,580
to decide which two
genes are most similar.

362
00:14:54,580 --> 00:14:56,250
Well, it turns out
it's a lot more

363
00:14:56,250 --> 00:14:57,389
subtle than we might think.

364
00:14:57,389 --> 00:14:59,180
So at first glance,
oh, it's pretty obvious

365
00:14:59,180 --> 00:15:00,750
that these two are
the most similar.

366
00:15:00,750 --> 00:15:02,870
But it really depends on
what kind of similarity

367
00:15:02,870 --> 00:15:04,460
you're asking about.

368
00:15:04,460 --> 00:15:07,930
So we can describe
any expression data

369
00:15:07,930 --> 00:15:13,930
set for any gene, is simply
a multi-dimensional vector.

370
00:15:13,930 --> 00:15:16,300
Where this is the set
of expression values

371
00:15:16,300 --> 00:15:18,670
we detected for the
first gene, across all

372
00:15:18,670 --> 00:15:20,670
the different experimental
conditions and so on,

373
00:15:20,670 --> 00:15:21,650
for the second.

374
00:15:21,650 --> 00:15:23,730
And what would be the
most intuitive way

375
00:15:23,730 --> 00:15:26,410
of describing the
distance between two

376
00:15:26,410 --> 00:15:27,640
multi-dimensional vectors?

377
00:15:27,640 --> 00:15:30,000
It would simply be
Euclidean distance, right?

378
00:15:30,000 --> 00:15:31,840
So that's perfectly reasonable.

379
00:15:31,840 --> 00:15:34,660
So we can decide that the
distance between two gene

380
00:15:34,660 --> 00:15:37,850
expression data sets, is
simply the square root

381
00:15:37,850 --> 00:15:40,030
of the sum of the
squares of the distances.

382
00:15:40,030 --> 00:15:45,075
So we'll take the sum over all
the experimental conditions

383
00:15:45,075 --> 00:15:45,950
that we've looked at.

384
00:15:45,950 --> 00:15:47,170
Maybe it's a time series.

385
00:15:47,170 --> 00:15:49,080
Maybe it's different
perturbations.

386
00:15:49,080 --> 00:15:52,720
And look at the difference in
expression of gene A and gene

387
00:15:52,720 --> 00:15:56,860
B in that condition, K.
And then evaluating this

388
00:15:56,860 --> 00:16:00,500
will tell us how similar two
genes are in their expression

389
00:16:00,500 --> 00:16:02,680
profiles.

390
00:16:02,680 --> 00:16:04,970
Well, that's a specific
example of a distance metric.

391
00:16:04,970 --> 00:16:06,845
It turns out that there's
a formal definition

392
00:16:06,845 --> 00:16:08,480
for a distance metric.

393
00:16:08,480 --> 00:16:10,430
Distances have the
following properties.

394
00:16:10,430 --> 00:16:12,150
They're always
greater than zero.

395
00:16:12,150 --> 00:16:14,180
We never have
negative distances.

396
00:16:14,180 --> 00:16:18,180
They are equal to zero under
exactly one condition-- the two

397
00:16:18,180 --> 00:16:20,080
data points are the same.

398
00:16:20,080 --> 00:16:21,500
And they're symmetric.

399
00:16:21,500 --> 00:16:23,430
So the distance from
A to B is the same

400
00:16:23,430 --> 00:16:26,240
as the distance from B to A.

401
00:16:26,240 --> 00:16:28,810
Now, to be a true
distance, then you also

402
00:16:28,810 --> 00:16:30,930
have to satisfy the
triangle inequality,

403
00:16:30,930 --> 00:16:34,380
that the distance from
x to z is less than

404
00:16:34,380 --> 00:16:35,880
or equal to the sum
of the distances

405
00:16:35,880 --> 00:16:37,479
through a third point.

406
00:16:37,479 --> 00:16:39,270
But we will find out
that we don't actually

407
00:16:39,270 --> 00:16:40,820
need that for
similarity measures.

408
00:16:40,820 --> 00:16:42,900
So we can have either
a true distance

409
00:16:42,900 --> 00:16:45,110
metric for comparing gene
expression data sets,

410
00:16:45,110 --> 00:16:48,540
or similarity measures as well.

411
00:16:48,540 --> 00:16:50,510
So let's go back to
the simple example.

412
00:16:50,510 --> 00:16:52,900
So we decided that the
red and the blue genes

413
00:16:52,900 --> 00:16:55,580
were nearly identical, in terms
of their distance metrics.

414
00:16:55,580 --> 00:16:57,720
But that's not always
exactly what we care about.

415
00:16:57,720 --> 00:17:00,170
So in biological settings,
frequently the absolute level

416
00:17:00,170 --> 00:17:03,780
of gene expression is
on some arbitrary scale.

417
00:17:03,780 --> 00:17:05,304
Certainly with
expression arrays,

418
00:17:05,304 --> 00:17:06,470
it was completely arbitrary.

419
00:17:06,470 --> 00:17:08,230
It had to do with
fluorescence properties,

420
00:17:08,230 --> 00:17:10,819
and how well probes
hybridize to each other.

421
00:17:10,819 --> 00:17:12,670
But even with mRNA,
how do we really

422
00:17:12,670 --> 00:17:15,300
know that 1,000 copies
is fundamentally

423
00:17:15,300 --> 00:17:18,510
different from 1,200 copies
of an RNA in the cell?

424
00:17:18,510 --> 00:17:19,650
We don't.

425
00:17:19,650 --> 00:17:23,339
So we might be more interested
in distance metrics that

426
00:17:23,339 --> 00:17:25,790
capture not just the
similarity of these two.

427
00:17:25,790 --> 00:17:28,140
But the fact that these
two are also quite similar,

428
00:17:28,140 --> 00:17:34,930
in terms of the trajectory
of the plot to this one.

429
00:17:34,930 --> 00:17:37,660
So can we come up with measures
that capture this one as well?

430
00:17:37,660 --> 00:17:40,800
A very common one for this
is Pearson correlation.

431
00:17:40,800 --> 00:17:43,199
So in Pearson correlation,
we're gonna look at not just

432
00:17:43,199 --> 00:17:44,990
the expression of a
gene across conditions.

433
00:17:44,990 --> 00:17:47,590
But we're gonna look at
the z-score of that gene.

434
00:17:47,590 --> 00:17:50,955
So we'll take all
of the data for all

435
00:17:50,955 --> 00:17:53,370
of the genes in a
particular condition.

436
00:17:53,370 --> 00:17:55,270
And we'll compute the
z-score by looking

437
00:17:55,270 --> 00:17:59,310
at the difference
between the expression

438
00:17:59,310 --> 00:18:01,600
of a particular gene, in
the average expression

439
00:18:01,600 --> 00:18:03,100
across the whole data set.

440
00:18:03,100 --> 00:18:04,780
And we're going to normalize
it by the standard deviation.

441
00:18:04,780 --> 00:18:05,280
Yes?

442
00:18:05,280 --> 00:18:07,282
AUDIENCE: [INAUDIBLE]
square there?

443
00:18:07,282 --> 00:18:08,490
PROFESSOR: Yes, you're right.

444
00:18:08,490 --> 00:18:09,510
There should be a square there.

445
00:18:09,510 --> 00:18:10,010
Thank you.

446
00:18:13,770 --> 00:18:16,100
So then to compute the
Pearson correlation,

447
00:18:16,100 --> 00:18:18,630
we're going between
two genes, A and B,

448
00:18:18,630 --> 00:18:20,890
we're going to take the
sum over all experiments,

449
00:18:20,890 --> 00:18:23,920
that the z-score for A
and the z-score for B,

450
00:18:23,920 --> 00:18:27,998
the product of that, summed
over all the experiments.

451
00:18:27,998 --> 00:18:29,747
And these values as
we'll see in a second,

452
00:18:29,747 --> 00:18:31,555
are going to range
from plus 1, which

453
00:18:31,555 --> 00:18:34,030
would be a perfect
correlation, to minus 1,

454
00:18:34,030 --> 00:18:37,150
which would be a perfect
anti-correlation.

455
00:18:37,150 --> 00:18:40,610
And then we're going to find the
distance is 1 minus this value.

456
00:18:40,610 --> 00:18:42,810
So things that are
perfectly correlated then,

457
00:18:42,810 --> 00:18:44,792
would have an r of zero.

458
00:18:44,792 --> 00:18:46,250
And things that
are anti-correlated

459
00:18:46,250 --> 00:18:48,250
would have a large one.

460
00:18:48,250 --> 00:18:49,626
So if we take a
look at these two

461
00:18:49,626 --> 00:18:51,250
obviously by Euclidean
distance, they'd

462
00:18:51,250 --> 00:18:52,740
be quite different
from each other.

463
00:18:52,740 --> 00:18:55,140
But the z-scores have
converted the expression values

464
00:18:55,140 --> 00:18:57,280
into z-scores over
here, you can see

465
00:18:57,280 --> 00:18:58,890
that the z-scores
obviously, this one

466
00:18:58,890 --> 00:19:01,110
is the most negative
of all of the ones.

467
00:19:01,110 --> 00:19:03,102
And this as the lowest
one in all of these.

468
00:19:03,102 --> 00:19:04,060
This one's the highest.

469
00:19:04,060 --> 00:19:06,340
And similarly for the red
one, lowest to the highest.

470
00:19:06,340 --> 00:19:08,350
So the z-scores track very well.

471
00:19:08,350 --> 00:19:09,850
And when I take the
product of this,

472
00:19:09,850 --> 00:19:13,280
the signs of the z-score for
A and B are always the same.

473
00:19:13,280 --> 00:19:16,439
So I summed the product of the
z-scores, I get a large number.

474
00:19:16,439 --> 00:19:17,980
And then the
normalization guarantees

475
00:19:17,980 --> 00:19:20,570
that it comes out to one.

476
00:19:20,570 --> 00:19:23,000
And so the red
and blue here will

477
00:19:23,000 --> 00:19:24,720
have a very high
correlation coefficient.

478
00:19:24,720 --> 00:19:27,590
In this case, it's going
to be an r correlation

479
00:19:27,590 --> 00:19:28,850
coefficient of 1.

480
00:19:28,850 --> 00:19:33,220
Whereas compared to this one,
which is relatively flat,

481
00:19:33,220 --> 00:19:36,280
the correlation coefficient
will be approximately zero.

482
00:19:36,280 --> 00:19:37,245
Any questions on that?

483
00:19:44,130 --> 00:19:46,157
So what about, say
the blue and the red?

484
00:19:46,157 --> 00:19:47,740
Well, their z-scores
are going to have

485
00:19:47,740 --> 00:19:50,480
almost the opposite
sign every single time.

486
00:19:50,480 --> 00:19:53,370
And so that's going to add
up to a large negative value.

487
00:19:53,370 --> 00:19:55,370
So for these, they'll be
highly anti-correlated.

488
00:19:55,370 --> 00:20:00,220
So A, the blue and the red,
have a correlation coefficient

489
00:20:00,220 --> 00:20:01,940
of minus 1.

490
00:20:01,940 --> 00:20:02,440
OK.

491
00:20:02,440 --> 00:20:03,898
So we have these
two different ways

492
00:20:03,898 --> 00:20:06,080
of computing distance measures.

493
00:20:06,080 --> 00:20:08,600
We can compute the
Euclidean distance,

494
00:20:08,600 --> 00:20:11,270
which would make the
red and blue the same,

495
00:20:11,270 --> 00:20:13,656
but treat the green one as
being completely different.

496
00:20:13,656 --> 00:20:15,030
Or we have the
correlation, which

497
00:20:15,030 --> 00:20:17,579
would group all of these
together, as being similar.

498
00:20:17,579 --> 00:20:19,870
What you want to do is going
to depend on your setting.

499
00:20:19,870 --> 00:20:21,369
If you look in your
textbook, you'll

500
00:20:21,369 --> 00:20:25,015
see a lot of other definitions
of distance as well.

501
00:20:25,015 --> 00:20:27,140
Now what if you're missing
a particular data point?

502
00:20:27,140 --> 00:20:29,930
This used to be a lot more
of a problem with arrays

503
00:20:29,930 --> 00:20:33,430
than it is with [? RNAC. ?]
With arrays, you'd often

504
00:20:33,430 --> 00:20:36,300
have dirt on the array, that
it actually would literally

505
00:20:36,300 --> 00:20:38,890
cover up spots.

506
00:20:38,890 --> 00:20:40,660
But you have a bunch of choices.

507
00:20:40,660 --> 00:20:42,570
The most extreme would
just be to ignore

508
00:20:42,570 --> 00:20:46,660
that row or column of your
matrix across old data sets.

509
00:20:46,660 --> 00:20:48,580
That's usually not
what we want to do.

510
00:20:48,580 --> 00:20:50,920
You could put in some
arbitrary small value.

511
00:20:50,920 --> 00:20:54,450
But frequently we will do what's
called imputing, where we'll

512
00:20:54,450 --> 00:20:57,340
try to identify the
genes that have the most

513
00:20:57,340 --> 00:20:59,740
similar expression, and replace
the value for the missing

514
00:20:59,740 --> 00:21:02,270
one with a value from
the ones that we do know.

515
00:21:05,160 --> 00:21:08,140
Distance metrics,
pretty straightforward.

516
00:21:08,140 --> 00:21:09,910
Now we want to
use these distance

517
00:21:09,910 --> 00:21:12,659
metrics to actually
cluster the data.

518
00:21:12,659 --> 00:21:13,700
And what's the idea here?

519
00:21:13,700 --> 00:21:16,340
That if we look across
enough data sets,

520
00:21:16,340 --> 00:21:19,660
we might find certain groups of
genes that function similarly

521
00:21:19,660 --> 00:21:21,640
across all those
data sets, that might

522
00:21:21,640 --> 00:21:24,674
be revealing as to their
biological function.

523
00:21:24,674 --> 00:21:27,090
So this is an example of an
unsupervised learning problem.

524
00:21:27,090 --> 00:21:29,527
We don't know what the
classes are, before we go in.

525
00:21:29,527 --> 00:21:31,110
We don't even know
how many there are.

526
00:21:31,110 --> 00:21:32,840
We want to learn from the data.

527
00:21:32,840 --> 00:21:35,260
This is a very large
area of machine learning.

528
00:21:35,260 --> 00:21:37,612
We're just gonna
scrape the surface.

529
00:21:37,612 --> 00:21:39,320
Some of you may be
familiar with the fact

530
00:21:39,320 --> 00:21:40,730
that these kinds of
machine learning algorithms

531
00:21:40,730 --> 00:21:42,660
are used widely
outside of biology.

532
00:21:42,660 --> 00:21:45,000
They're used by
Netflix to tell you

533
00:21:45,000 --> 00:21:46,410
what would movie to choose next.

534
00:21:46,410 --> 00:21:48,430
Or Amazon, to try to
sell you new products.

535
00:21:48,430 --> 00:21:54,630
And all the advertisers who send
pop-up ads on your computer.

536
00:21:54,630 --> 00:21:56,240
But in our biological
setting then, we

537
00:21:56,240 --> 00:21:59,030
have our gene expression
data, collected possibly

538
00:21:59,030 --> 00:22:00,632
over very large
numbers of conditions.

539
00:22:00,632 --> 00:22:02,090
And we want to find
groups of genes

540
00:22:02,090 --> 00:22:04,850
that have some similarity.

541
00:22:04,850 --> 00:22:07,720
This is a figure from one of
these very early papers, that

542
00:22:07,720 --> 00:22:10,260
sort of establish how
people present these datas.

543
00:22:10,260 --> 00:22:13,020
So you'll almost always see
the same kind of presentation.

544
00:22:13,020 --> 00:22:16,940
Typically you'll get a heat
map, where genes are rows.

545
00:22:16,940 --> 00:22:18,790
And the different
experiments here time,

546
00:22:18,790 --> 00:22:22,629
but it could be different
perturbations, are the columns.

547
00:22:22,629 --> 00:22:24,420
And genes that go up
in expression are red,

548
00:22:24,420 --> 00:22:26,517
and genes ago down in
expression are green.

549
00:22:26,517 --> 00:22:29,520
And apologies to anyone
who's colorblind.

550
00:22:29,520 --> 00:22:33,560
But that's just what the
convention has become.

551
00:22:33,560 --> 00:22:35,280
OK, so then why cluster?

552
00:22:35,280 --> 00:22:38,010
So if we cluster
across the rows, then

553
00:22:38,010 --> 00:22:40,397
we'll get sets of genes
that potentially behave--

554
00:22:40,397 --> 00:22:41,980
that hopefully if
we do this properly,

555
00:22:41,980 --> 00:22:44,320
behave similarly across
different subsets

556
00:22:44,320 --> 00:22:46,630
of the experiments.

557
00:22:46,630 --> 00:22:48,960
And those might represent
similar functions.

558
00:22:48,960 --> 00:22:51,040
And if we cluster
the columns, then we

559
00:22:51,040 --> 00:22:54,245
get different experiments
that show similar responses.

560
00:22:54,245 --> 00:22:56,370
So that might be in this
case, different times that

561
00:22:56,370 --> 00:22:56,870
are similar.

562
00:22:56,870 --> 00:22:59,210
Hopefully those are ones
that are close to each other.

563
00:22:59,210 --> 00:23:00,760
But if we have lots
of different patients,

564
00:23:00,760 --> 00:23:02,260
as we'll see in a
second, they might

565
00:23:02,260 --> 00:23:06,670
represent patients who have a
similar version of a disease.

566
00:23:06,670 --> 00:23:09,680
And in fact, the clustering
of genes does work.

567
00:23:09,680 --> 00:23:11,400
So even in this
very early paper,

568
00:23:11,400 --> 00:23:15,345
they were able to identify a
bunch of subsets of genes that

569
00:23:15,345 --> 00:23:17,470
showed similar expression
at different time points,

570
00:23:17,470 --> 00:23:19,720
and turned out to be enriched
in different categories.

571
00:23:19,720 --> 00:23:22,640
These ones were enriched in
cholesterol biosynthesis,

572
00:23:22,640 --> 00:23:26,570
whereas these were enriched
in wound healing, and so on.

573
00:23:26,570 --> 00:23:28,990
So how do you actually
do clustering?

574
00:23:28,990 --> 00:23:31,160
This kind of clustering
is called hierarchical.

575
00:23:31,160 --> 00:23:32,410
That's pretty straightforward.

576
00:23:32,410 --> 00:23:35,590
There are two versions of
hierarchical clustering.

577
00:23:35,590 --> 00:23:39,744
There's what's called
agglomerative and divisive.

578
00:23:39,744 --> 00:23:41,910
In agglomerative, you start
off with each data point

579
00:23:41,910 --> 00:23:42,900
in its own cluster.

580
00:23:42,900 --> 00:23:46,610
And then you search for the
most similar data point to it,

581
00:23:46,610 --> 00:23:47,890
and you group those together.

582
00:23:47,890 --> 00:23:49,389
And you keep doing
that iteratively,

583
00:23:49,389 --> 00:23:53,430
building up larger
and larger clusters.

584
00:23:53,430 --> 00:23:57,970
So we've discussed how to
compare our individual genes.

585
00:23:57,970 --> 00:23:59,570
But you should be
able to, right now,

586
00:23:59,570 --> 00:24:01,860
to find, if I gave you
the vector of expression

587
00:24:01,860 --> 00:24:04,310
for a single gene, to find
the other genes in the data

588
00:24:04,310 --> 00:24:06,810
set that's most similar,
by either say, Euclidean

589
00:24:06,810 --> 00:24:09,650
or Pearson correlation,
or what have you.

590
00:24:09,650 --> 00:24:12,440
But once you've grouped
two genes together,

591
00:24:12,440 --> 00:24:14,330
how do you decide
whether a third gene is

592
00:24:14,330 --> 00:24:16,170
similar to those two?

593
00:24:16,170 --> 00:24:17,670
So now we have to
make some choices.

594
00:24:17,670 --> 00:24:19,503
And so there are number
of different choices

595
00:24:19,503 --> 00:24:21,480
that are commonly made.

596
00:24:21,480 --> 00:24:23,010
So let's say these are our data.

597
00:24:23,010 --> 00:24:27,600
We've got these two
clusters, Y and Z.

598
00:24:27,600 --> 00:24:30,260
And each circle represents a
data point in those clusters.

599
00:24:30,260 --> 00:24:31,940
So we've got four
genes in each cluster.

600
00:24:31,940 --> 00:24:34,090
Now we want to decide
on a distance measure

601
00:24:34,090 --> 00:24:38,509
to compare cluster Y to
cluster Z. So what could we do?

602
00:24:38,509 --> 00:24:39,800
So what are some possibilities?

603
00:24:39,800 --> 00:24:40,550
What might you do?

604
00:24:45,370 --> 00:24:47,780
AUDIENCE: We could take
the average of all points.

605
00:24:47,780 --> 00:24:50,238
PROFESSOR: You could take the
average of all points, right.

606
00:24:50,238 --> 00:24:51,820
What else could you do?

607
00:24:51,820 --> 00:24:54,310
Only a limited number
of possibilities.

608
00:24:54,310 --> 00:24:55,265
AUDIENCE: Centroid?

609
00:24:55,265 --> 00:24:56,640
PROFESSOR: Yeah,
so centroid, you

610
00:24:56,640 --> 00:24:58,280
could take some sort
of average, right.

611
00:24:58,280 --> 00:25:00,070
Any other possibilities?

612
00:25:00,070 --> 00:25:01,695
AUDIENCE: You can
pick a representative

613
00:25:01,695 --> 00:25:03,932
from each set [INAUDIBLE].

614
00:25:03,932 --> 00:25:06,140
PROFESSOR: So you could pick
a representative, right?

615
00:25:06,140 --> 00:25:08,580
How would you decide in advance
what that would be though?

616
00:25:08,580 --> 00:25:10,875
So maybe you have
a way, maybe not.

617
00:25:10,875 --> 00:25:13,991
And what other
possibilities are there?

618
00:25:13,991 --> 00:25:14,490
Yeah?

619
00:25:14,490 --> 00:25:16,865
AUDIENCE: Measure all
the distances [INAUDIBLE]

620
00:25:16,865 --> 00:25:18,282
to all the nodes in the other.

621
00:25:18,282 --> 00:25:18,990
PROFESSOR: Right.

622
00:25:18,990 --> 00:25:20,250
So you could do all to all.

623
00:25:20,250 --> 00:25:21,208
What else could you do?

624
00:25:23,445 --> 00:25:25,320
You can take the minimum
of all those values.

625
00:25:25,320 --> 00:25:26,930
You can take the maximum
of all those values.

626
00:25:26,930 --> 00:25:29,221
And we'll see that all those
are things that people do.

627
00:25:29,221 --> 00:25:32,910
So this clustering,
there are already

628
00:25:32,910 --> 00:25:36,560
rather uninformative
terms for some

629
00:25:36,560 --> 00:25:37,760
of these kinds of decisions.

630
00:25:37,760 --> 00:25:39,500
So it's called
single linkage, is

631
00:25:39,500 --> 00:25:42,260
you decide that the distance
between two clusters

632
00:25:42,260 --> 00:25:45,370
is based on the minimum distance
between any member of cluster Y

633
00:25:45,370 --> 00:25:48,950
and any member of cluster Z.

634
00:25:48,950 --> 00:25:54,480
Complete linkage takes
the maximum distance.

635
00:25:54,480 --> 00:25:56,780
And then the extremely
unfortunately named

636
00:25:56,780 --> 00:26:02,260
Unweighted Pair Group Method
using Centroids-- UPGMC,

637
00:26:02,260 --> 00:26:04,040
I won't try to say
that very often-- takes

638
00:26:04,040 --> 00:26:09,820
the centroid, which was an
early suggestion from the class.

639
00:26:09,820 --> 00:26:14,740
And then the UPGMA,
Unweighted Pair Group Method

640
00:26:14,740 --> 00:26:16,869
with Arithmetic Mean,
takes the average

641
00:26:16,869 --> 00:26:18,410
of all the distances,
all suggestions

642
00:26:18,410 --> 00:26:19,740
that people have made.

643
00:26:19,740 --> 00:26:21,532
So when would you use
one versus the other?

644
00:26:21,532 --> 00:26:23,323
Well, a priori, you
don't necessarily know.

645
00:26:23,323 --> 00:26:25,210
But it's good to know
how they'll behave.

646
00:26:25,210 --> 00:26:26,940
So what do you imagine
is going to happen

647
00:26:26,940 --> 00:26:31,250
if you use single linkage,
versus complete linkage.

648
00:26:31,250 --> 00:26:33,690
Remember, single linkage
is the minimum distance.

649
00:26:33,690 --> 00:26:35,720
And complete linkage is
the maximum distance.

650
00:26:35,720 --> 00:26:36,710
So what's going to
happen in this case,

651
00:26:36,710 --> 00:26:37,990
if I use the minimum distance.

652
00:26:37,990 --> 00:26:39,323
Which two groups will I combine?

653
00:26:43,269 --> 00:26:44,560
AUDIENCE: The blue and the red.

654
00:26:44,560 --> 00:26:46,185
PROFESSOR: The blue
and the red, right?

655
00:26:46,185 --> 00:26:48,420
Whereas if I use the
maximum distance,

656
00:26:48,420 --> 00:26:50,260
then I'll combine the
green and the red.

657
00:26:50,260 --> 00:26:52,420
So it's important
to recognize, then,

658
00:26:52,420 --> 00:26:57,590
that the single linkage has this
property of chaining together

659
00:26:57,590 --> 00:27:00,410
clusters, based on points
that are near each other.

660
00:27:00,410 --> 00:27:04,880
Whereas the complete linkage
is resistant to grouping things

661
00:27:04,880 --> 00:27:06,300
together, if they have outliers.

662
00:27:06,300 --> 00:27:07,550
So they'll behave differently.

663
00:27:07,550 --> 00:27:09,790
Now, if your data are
compact, and you really

664
00:27:09,790 --> 00:27:11,810
do have tight clusters,
it's not going

665
00:27:11,810 --> 00:27:13,355
to matter too much
would you use.

666
00:27:13,355 --> 00:27:15,980
But in most biological settings,
we're dealing with much noise,

667
00:27:15,980 --> 00:27:16,721
there's data.

668
00:27:16,721 --> 00:27:18,470
So you actually will
get different results

669
00:27:18,470 --> 00:27:19,342
based on this.

670
00:27:19,342 --> 00:27:21,550
And as far as I know, there's
no really principal way

671
00:27:21,550 --> 00:27:26,090
to figure out if you have no
prior knowledge, which to use.

672
00:27:26,090 --> 00:27:28,570
Now all of these
hierarchical clustering

673
00:27:28,570 --> 00:27:30,910
come with what's
called a dendogram.

674
00:27:30,910 --> 00:27:34,280
And you'll see these at the
top of all the clustering.

675
00:27:34,280 --> 00:27:36,060
And this represents
the process by which

676
00:27:36,060 --> 00:27:37,400
the data were clustered.

677
00:27:37,400 --> 00:27:39,606
So the things that
are most similar

678
00:27:39,606 --> 00:27:41,480
are most tightly connected
in this dendogram.

679
00:27:41,480 --> 00:27:44,210
So these two data
points, one and two, you

680
00:27:44,210 --> 00:27:49,620
have to go up very little in the
y-axis, to get from one to two.

681
00:27:49,620 --> 00:27:52,180
Whereas if you want
to go from one to 16,

682
00:27:52,180 --> 00:27:55,080
you have to traverse
the entire dendogram.

683
00:27:55,080 --> 00:27:56,780
So the distance
between two samples

684
00:27:56,780 --> 00:28:00,680
is how far vertically you have
to go to connect between them.

685
00:28:00,680 --> 00:28:04,370
Now the good things
are that the dendogram

686
00:28:04,370 --> 00:28:07,350
is, you can then understand
the clustering of the data.

687
00:28:07,350 --> 00:28:09,860
So I can cut this dendogram
at any particular distance,

688
00:28:09,860 --> 00:28:13,230
and get clearly divisions
among my data sets.

689
00:28:13,230 --> 00:28:15,950
So if I cut here at
this distance level,

690
00:28:15,950 --> 00:28:16,920
then I have two groups.

691
00:28:16,920 --> 00:28:18,755
One small, one
consisting of these data.

692
00:28:18,755 --> 00:28:20,820
And one large, one
consisting of these.

693
00:28:20,820 --> 00:28:23,880
Whereas if I cut down here, I
have more groups of my data.

694
00:28:23,880 --> 00:28:26,490
So it doesn't require me in
advance to know how many groups

695
00:28:26,490 --> 00:28:26,990
I have.

696
00:28:26,990 --> 00:28:29,940
I can look at the
dendogram and infer it.

697
00:28:29,940 --> 00:28:32,730
The one risk is that you
always get a dendogram that's

698
00:28:32,730 --> 00:28:35,510
hierarchical, regardless of
if the data were hierarchical

699
00:28:35,510 --> 00:28:36,310
or not.

700
00:28:36,310 --> 00:28:38,650
So it's more a reflection of
how you did your clustering

701
00:28:38,650 --> 00:28:41,180
than any fundamental
structure of the data.

702
00:28:41,180 --> 00:28:45,294
So the fact that you get
a hierarchical dendogram

703
00:28:45,294 --> 00:28:46,835
means really nothing
about your data.

704
00:28:46,835 --> 00:28:48,293
It's simply a tool
that you can use

705
00:28:48,293 --> 00:28:52,450
to try to divide it up
into different groups.

706
00:28:52,450 --> 00:28:55,180
Any questions on the
hierarchical clustering?

707
00:28:58,717 --> 00:28:59,217
Yes?

708
00:29:01,993 --> 00:29:03,993
AUDIENCE: If each data
point is its own cluster,

709
00:29:03,993 --> 00:29:09,482
then won't that be consistent
across, like, single linkage,

710
00:29:09,482 --> 00:29:12,490
complete linkage-- like,
why would you cluster?

711
00:29:12,490 --> 00:29:13,966
Does that question make sense?

712
00:29:13,966 --> 00:29:20,854
Like if you cut it down below,
then haven't you minimized--

713
00:29:20,854 --> 00:29:24,790
don't you successively
minimize the variance, I guess,

714
00:29:24,790 --> 00:29:27,394
up to your clusters, by--

715
00:29:27,394 --> 00:29:29,310
PROFESSOR: So if I cut
it at the lowest level,

716
00:29:29,310 --> 00:29:30,601
everybody is their own cluster.

717
00:29:30,601 --> 00:29:31,720
That's true.

718
00:29:31,720 --> 00:29:32,220
Right.

719
00:29:32,220 --> 00:29:34,011
I'm interested in
finding out whether there

720
00:29:34,011 --> 00:29:36,330
are genes that behave
similarly across the data sets.

721
00:29:36,330 --> 00:29:36,585
Or--

722
00:29:36,585 --> 00:29:37,708
AUDIENCE: My question
is, how would you

723
00:29:37,708 --> 00:29:39,880
go about determining how
many clusters your want?

724
00:29:39,880 --> 00:29:40,160
PROFESSOR: Oh, OK.

725
00:29:40,160 --> 00:29:41,576
So we'll come to
that in a second.

726
00:29:41,576 --> 00:29:43,730
So hierarchical clustering,
you don't actually

727
00:29:43,730 --> 00:29:45,869
have any objective
way of doing that.

728
00:29:45,869 --> 00:29:47,660
But we'll talk about
other means right now,

729
00:29:47,660 --> 00:29:49,000
where it's a little bit clearer.

730
00:29:49,000 --> 00:29:50,880
But actually
fundamentally, there

731
00:29:50,880 --> 00:29:53,220
aren't a lot of good ways
of knowing a priori what

732
00:29:53,220 --> 00:29:54,840
the right number of clusters is.

733
00:29:54,840 --> 00:29:57,090
But we'll look at some
measures in a second that help.

734
00:29:59,880 --> 00:30:03,270
So hierarchical clustering,
as your question implies,

735
00:30:03,270 --> 00:30:06,520
doesn't really tell you how
many clusters there are.

736
00:30:06,520 --> 00:30:08,250
Another approach is
to decide in advance

737
00:30:08,250 --> 00:30:10,000
how many clusters you expect.

738
00:30:10,000 --> 00:30:12,250
And then see whether you can
get the data of the group

739
00:30:12,250 --> 00:30:13,862
into that number or not.

740
00:30:13,862 --> 00:30:15,320
And an example of
that is something

741
00:30:15,320 --> 00:30:16,900
called k-means clustering.

742
00:30:16,900 --> 00:30:18,324
So the nice thing
about it, is it

743
00:30:18,324 --> 00:30:19,740
does give you the
sharp divisions.

744
00:30:19,740 --> 00:30:21,760
But again if you
chose k incorrectly,

745
00:30:21,760 --> 00:30:23,430
we'll see in a
second, you will get--

746
00:30:23,430 --> 00:30:25,710
you'll never less
still get K-clusters.

747
00:30:25,710 --> 00:30:28,330
So K refers the
number of clusters

748
00:30:28,330 --> 00:30:30,950
that you tell the algorithm
you expect to get.

749
00:30:30,950 --> 00:30:33,340
So you specify that in advance.

750
00:30:33,340 --> 00:30:35,500
And then you try to
find a set of clusters

751
00:30:35,500 --> 00:30:37,790
that minimizes the distance.

752
00:30:37,790 --> 00:30:40,540
So everybody's assigned
to a particular cluster,

753
00:30:40,540 --> 00:30:43,410
and the center of that cluster.

754
00:30:43,410 --> 00:30:44,000
Is that clear?

755
00:30:44,000 --> 00:30:45,959
So that's what these
equations represent.

756
00:30:45,959 --> 00:30:47,750
So the center of the
cluster, the centroid,

757
00:30:47,750 --> 00:30:50,910
is just the average coordinates,
over all the components

758
00:30:50,910 --> 00:30:52,180
of that cluster.

759
00:30:52,180 --> 00:30:55,390
And we're trying to find
this set of clusters, C,

760
00:30:55,390 --> 00:31:00,110
that minimizes the sum of
the square of the distances

761
00:31:00,110 --> 00:31:04,010
between each member of that
cluster and the centroid.

762
00:31:04,010 --> 00:31:07,720
Any questions on how
we're doing this?

763
00:31:07,720 --> 00:31:10,160
OK.

764
00:31:10,160 --> 00:31:10,660
All right.

765
00:31:10,660 --> 00:31:11,530
So what's the actual algorithm?

766
00:31:11,530 --> 00:31:12,680
That's remarkably simple.

767
00:31:12,680 --> 00:31:16,470
I'm choosing that initial
set of random positions.

768
00:31:16,470 --> 00:31:21,680
And then I have the simple loop,
I repeat until convergence.

769
00:31:21,680 --> 00:31:25,850
For every point, I assign
it to the nearest centroid.

770
00:31:25,850 --> 00:31:29,967
So if my starting
centroids would be circles,

771
00:31:29,967 --> 00:31:31,550
I look at every data
point, and I ask,

772
00:31:31,550 --> 00:31:33,049
how close is it to
any one of these?

773
00:31:33,049 --> 00:31:35,660
That's what the boundaries
are, defined by these lines.

774
00:31:35,660 --> 00:31:39,374
So everything above this
line belongs to the centroid.

775
00:31:39,374 --> 00:31:41,290
Everything over here
belongs to this centroid.

776
00:31:41,290 --> 00:31:44,850
So I divide the data up by which
centroid you are closest to.

777
00:31:44,850 --> 00:31:47,110
And I assign you
to that centroid.

778
00:31:47,110 --> 00:31:48,960
That's step one.

779
00:31:48,960 --> 00:31:51,371
And step two, I
compute new centroids.

780
00:31:51,371 --> 00:31:53,120
And that's what these
triangles represent.

781
00:31:53,120 --> 00:31:54,522
So after I did
that partitioning,

782
00:31:54,522 --> 00:31:56,230
it turns out that most
of the things that

783
00:31:56,230 --> 00:32:00,540
were assigned to the triangular
cluster live over here.

784
00:32:00,540 --> 00:32:03,790
So the centroid moves
from being here to here.

785
00:32:03,790 --> 00:32:06,360
And I iterate this process.

786
00:32:06,360 --> 00:32:09,410
That's the entire K-means
clustering algorithm.

787
00:32:09,410 --> 00:32:12,670
So here's an example
where I generated data

788
00:32:12,670 --> 00:32:15,210
from three [? calcines. ?]
I chose initial data

789
00:32:15,210 --> 00:32:18,580
points, which are the circles.

790
00:32:18,580 --> 00:32:19,830
I follow that protocol.

791
00:32:19,830 --> 00:32:21,010
Here's the first step.

792
00:32:21,010 --> 00:32:22,320
It computes new triangles.

793
00:32:22,320 --> 00:32:25,100
Second step, and
then it converges.

794
00:32:25,100 --> 00:32:26,700
The distance stops changing.

795
00:32:30,841 --> 00:32:32,340
Now this question's
already come up.

796
00:32:32,340 --> 00:32:35,200
So what happens if you
choose the wrong K?

797
00:32:35,200 --> 00:32:37,269
So I believe there
are three clusters.

798
00:32:37,269 --> 00:32:38,560
And really that's not the case.

799
00:32:38,560 --> 00:32:39,971
So what's going to happen?

800
00:32:39,971 --> 00:32:41,595
So in this data set,
there really were.

801
00:32:41,595 --> 00:32:45,710
How many, there really
were five clusters.

802
00:32:45,710 --> 00:32:47,360
Here, they're
clustered correctly.

803
00:32:47,360 --> 00:32:49,160
What if I told the
algorithm to do

804
00:32:49,160 --> 00:32:51,826
K-means clustering
with a K of three?

805
00:32:51,826 --> 00:32:54,200
It would still find a way to
come up with three clusters.

806
00:32:54,200 --> 00:32:56,677
So now it's grouped these
two things, which are clearly

807
00:32:56,677 --> 00:32:59,010
generated from different
[? calcines ?] scenes together.

808
00:32:59,010 --> 00:33:00,843
It's grouped these two,
which were generated

809
00:33:00,843 --> 00:33:02,970
from different [? calcines ?]
together, and so on.

810
00:33:02,970 --> 00:33:03,470
All right.

811
00:33:03,470 --> 00:33:05,270
So K-means clustering
will do what

812
00:33:05,270 --> 00:33:07,330
you tell it to do, regardless of
whether that's the right answer

813
00:33:07,330 --> 00:33:07,970
or not.

814
00:33:07,970 --> 00:33:10,110
And if you tell it there are
more clusters than you expect--

815
00:33:10,110 --> 00:33:11,550
than really are
there, then it'll

816
00:33:11,550 --> 00:33:14,920
start chopping up well-defined
clusters into sub-clusters.

817
00:33:14,920 --> 00:33:19,660
So here it split this elongated
one into two sub-clusters.

818
00:33:19,660 --> 00:33:21,420
It split this one
arbitrarily into two.

819
00:33:21,420 --> 00:33:25,677
Just so it gets the final
number that we asked for.

820
00:33:25,677 --> 00:33:27,010
Then how do you know what to do?

821
00:33:27,010 --> 00:33:29,500
Well, as I said, you don't--
there's no guarantee to know.

822
00:33:29,500 --> 00:33:31,670
But one thing you can
do is make this kind

823
00:33:31,670 --> 00:33:34,370
of plot, which shows for
different values of K

824
00:33:34,370 --> 00:33:40,500
on the x-axis, the sum of the
distances within the cluster.

825
00:33:40,500 --> 00:33:42,660
So the distance to the
centroid within each cluster

826
00:33:42,660 --> 00:33:44,080
on the y-axis.

827
00:33:44,080 --> 00:33:47,489
And as I increase the number
of K's, when I'm correctly

828
00:33:47,489 --> 00:33:49,280
[? purchasing ?] my
data, when there really

829
00:33:49,280 --> 00:33:52,370
are more subgroups than
I've already defined,

830
00:33:52,370 --> 00:33:53,910
then I'll see big drops.

831
00:33:53,910 --> 00:33:56,530
So I go from saying there are
two to three in that case.

832
00:33:56,530 --> 00:34:00,189
I get a big drop in the distance
between members of the cluster.

833
00:34:00,189 --> 00:34:02,480
Because I'm no longer including
a data point over here.

834
00:34:02,480 --> 00:34:06,010
And in this cluster, with a
data point in that cluster.

835
00:34:06,010 --> 00:34:09,489
But once I go beyond the
correct number, which was five,

836
00:34:09,489 --> 00:34:11,920
you see that the benefits
really start to trail off.

837
00:34:11,920 --> 00:34:13,820
So there's an
inflection point here.

838
00:34:13,820 --> 00:34:17,199
There's an elbow-- sometimes
it's called an elbow plot.

839
00:34:17,199 --> 00:34:19,270
After I go past
the right number,

840
00:34:19,270 --> 00:34:22,940
I get less and less benefit
from each additional clustering.

841
00:34:22,940 --> 00:34:25,760
So this gives us
an empirical way

842
00:34:25,760 --> 00:34:31,739
of choosing approximately
a correct value for K.

843
00:34:31,739 --> 00:34:33,840
Any questions on K-means?

844
00:34:33,840 --> 00:34:34,820
Yes?

845
00:34:34,820 --> 00:34:36,820
AUDIENCE: Does K-means
recapitulate the clusters

846
00:34:36,820 --> 00:34:38,195
that you would
get if you cut off

847
00:34:38,195 --> 00:34:41,710
your dendogram from hierarchical
clustering at a certain level?

848
00:34:41,710 --> 00:34:43,143
PROFESSOR: Not necessarily.

849
00:34:43,143 --> 00:34:44,499
AUDIENCE: OK.

850
00:34:44,499 --> 00:34:45,314
But maybe.

851
00:34:45,314 --> 00:34:45,855
I don't know.

852
00:34:45,855 --> 00:34:48,590
It sort of seems to me
as if you picked a level

853
00:34:48,590 --> 00:34:50,960
where you have a certain
number of clusters,

854
00:34:50,960 --> 00:34:53,585
that that's similar, at least by
centroid, by using the center?

855
00:34:53,585 --> 00:34:56,043
PROFESSOR: Yeah, I think because
of the way that you do it,

856
00:34:56,043 --> 00:34:58,120
you're not even guaranteed
to have a level, where

857
00:34:58,120 --> 00:35:01,725
you have exactly the
right-- other questions?

858
00:35:05,050 --> 00:35:06,052
Yes?

859
00:35:06,052 --> 00:35:07,593
AUDIENCE: Could you
just very quickly

860
00:35:07,593 --> 00:35:08,968
go over how you
initialized where

861
00:35:08,968 --> 00:35:11,050
the starting points
are, and the break ups?

862
00:35:11,050 --> 00:35:11,590
PROFESSOR: All right,
so the question

863
00:35:11,590 --> 00:35:13,470
is how do you initialize
the starting points?

864
00:35:13,470 --> 00:35:16,152
In fact, you have to make some
arbitrary decisions about how

865
00:35:16,152 --> 00:35:17,610
the initialize the
starting points.

866
00:35:17,610 --> 00:35:18,900
So they're usually
chose in a random.

867
00:35:18,900 --> 00:35:20,358
And you will get
different results,

868
00:35:20,358 --> 00:35:22,190
depending on how you do that.

869
00:35:22,190 --> 00:35:23,920
So that's another--
so when you do it,

870
00:35:23,920 --> 00:35:26,330
it's non-deterministic
in that sense.

871
00:35:26,330 --> 00:35:28,605
And you often want to
initialize multiple times.

872
00:35:28,605 --> 00:35:30,390
And make sure you
get similar results.

873
00:35:30,390 --> 00:35:32,350
Very good question.

874
00:35:32,350 --> 00:35:34,610
And in fact, that
was not a set up.

875
00:35:34,610 --> 00:35:37,690
But what happens if you
choose pathologically bad

876
00:35:37,690 --> 00:35:38,960
initial conditions?

877
00:35:38,960 --> 00:35:42,860
So you have the potential to
converge to the right answer.

878
00:35:42,860 --> 00:35:46,620
But you're not guaranteed to
converge to the right answer.

879
00:35:46,620 --> 00:35:50,510
So here's an example where
I had-- I guess there really

880
00:35:50,510 --> 00:35:52,277
are three clusters in the data.

881
00:35:52,277 --> 00:35:53,860
I chose [INAUDIBLE]
three, but I stuck

882
00:35:53,860 --> 00:35:57,270
all my initial coordinates down
in the lower right-hand corner.

883
00:35:57,270 --> 00:36:03,312
And then when I do the
clustering, if things go well,

884
00:36:03,312 --> 00:36:04,270
I get the right answer.

885
00:36:04,270 --> 00:36:05,311
But we're not guaranteed.

886
00:36:11,070 --> 00:36:14,610
But one thing we are guaranteed,
is we always get convergence.

887
00:36:14,610 --> 00:36:16,210
So the algorithm will converge.

888
00:36:16,210 --> 00:36:17,640
Because at each
step, it's either

889
00:36:17,640 --> 00:36:21,100
reducing the objective function,
or it's leaving it the same.

890
00:36:21,100 --> 00:36:22,460
So we're guaranteed convergence.

891
00:36:22,460 --> 00:36:24,960
But it may be as we've seen
previously in other settings,

892
00:36:24,960 --> 00:36:26,835
we may end up with local
minimum, rather than

893
00:36:26,835 --> 00:36:28,224
the global optimum.

894
00:36:28,224 --> 00:36:29,640
And the way to fix
that then would

895
00:36:29,640 --> 00:36:33,235
be to initialize again,
with new starting positions.

896
00:36:35,999 --> 00:36:36,665
Other questions?

897
00:36:44,229 --> 00:36:45,520
What about a setting like this?

898
00:36:45,520 --> 00:36:48,080
Where we've got two
well-defined clusters,

899
00:36:48,080 --> 00:36:50,260
and somebody who lives
straight in the middle.

900
00:36:50,260 --> 00:36:53,120
So what's the
algorithm going to do?

901
00:36:53,120 --> 00:36:55,640
Well, sometimes it'll put it
in one side, of one cluster.

902
00:36:55,640 --> 00:36:58,130
And sometimes it'll end
up in the other side.

903
00:36:58,130 --> 00:37:00,410
So an alternative to
K-means clustering,

904
00:37:00,410 --> 00:37:03,295
which has to make one or the
other arbitrary decision,

905
00:37:03,295 --> 00:37:06,230
is something that's called
fuzzy K-means, which

906
00:37:06,230 --> 00:37:07,950
can put something
actually literally,

907
00:37:07,950 --> 00:37:10,866
membership into both clusters.

908
00:37:10,866 --> 00:37:13,121
And it's very similar in
structure to the K-means,

909
00:37:13,121 --> 00:37:14,620
with one important
difference, which

910
00:37:14,620 --> 00:37:17,760
is a membership variable,
that tells you for every data

911
00:37:17,760 --> 00:37:20,450
point, how much it belongs
to the cluster one, cluster

912
00:37:20,450 --> 00:37:22,060
two, cluster three, and so on.

913
00:37:25,132 --> 00:37:26,590
So in both algorithms,
we start off

914
00:37:26,590 --> 00:37:29,180
by choosing initial
points as a cluster means,

915
00:37:29,180 --> 00:37:31,230
and looping through
each of them.

916
00:37:31,230 --> 00:37:34,670
Now previously, we would make
a hard assignment of each data

917
00:37:34,670 --> 00:37:38,650
point x sub i to
a single cluster.

918
00:37:38,650 --> 00:37:41,220
And here we're going to
calculate the probability

919
00:37:41,220 --> 00:37:42,995
that each data point
belongs to a cluster.

920
00:37:42,995 --> 00:37:44,620
And that's where you
get the fuzziness,

921
00:37:44,620 --> 00:37:48,930
because you could have a non
unit, or a nonzero probability,

922
00:37:48,930 --> 00:37:51,150
belonging to any
of the clusters.

923
00:37:51,150 --> 00:37:55,700
And now we're going, K-means,
we recalculated the mean value,

924
00:37:55,700 --> 00:37:59,550
by just looking at the average
of everybody in that cluster.

925
00:37:59,550 --> 00:38:02,295
Now in fuzzy K-means, we don't
have everybody in the cluster.

926
00:38:02,295 --> 00:38:04,420
Because everybody belongs
partially to the cluster.

927
00:38:04,420 --> 00:38:07,170
So we're going to take
a weighted average.

928
00:38:07,170 --> 00:38:10,010
So here are the details
of how you do that.

929
00:38:10,010 --> 00:38:12,160
In K-means, we are
minimizing this function.

930
00:38:12,160 --> 00:38:14,820
We were trying to decide the
class structure, the class

931
00:38:14,820 --> 00:38:18,390
memberships, that would minimize
the distance of every member

932
00:38:18,390 --> 00:38:22,770
of that cluster, to the defined
centroid of that cluster.

933
00:38:22,770 --> 00:38:24,030
Here it looks almost the same.

934
00:38:24,030 --> 00:38:26,560
Except we now have
this new variable, mu,

935
00:38:26,560 --> 00:38:27,890
which is the membership.

936
00:38:27,890 --> 00:38:32,147
It's the membership of
point j, in cluster i.

937
00:38:34,949 --> 00:38:37,380
So I'm trying to minimize
a very similar function.

938
00:38:37,380 --> 00:38:41,020
But now if mu is one--
if all my mus are one,

939
00:38:41,020 --> 00:38:42,270
then what do I get?

940
00:38:42,270 --> 00:38:43,400
K-means, right?

941
00:38:43,400 --> 00:38:45,710
But as soon as the mus are
allowed to vary from one,

942
00:38:45,710 --> 00:38:47,280
they can be between
zero and one,

943
00:38:47,280 --> 00:38:49,040
then points can
contribute more or less.

944
00:38:49,040 --> 00:38:52,160
So that point there was stuck in
the middle of the two clusters,

945
00:38:52,160 --> 00:38:54,370
if it had a mu of
0.5 for each, it

946
00:38:54,370 --> 00:38:55,620
would contribute half to each.

947
00:38:55,620 --> 00:38:57,090
And then both the
centroids would

948
00:38:57,090 --> 00:38:58,830
move a little bit
towards the middle.

949
00:39:01,920 --> 00:39:03,540
So what's the
result of K-means--

950
00:39:03,540 --> 00:39:05,530
I'm sorry, fuzzy
K-means clustering?

951
00:39:05,530 --> 00:39:06,840
We still get K clusters.

952
00:39:06,840 --> 00:39:10,730
But now every gene or every
object that we're clustering

953
00:39:10,730 --> 00:39:12,510
has a partial membership.

954
00:39:12,510 --> 00:39:15,050
So here's an example
of that, where

955
00:39:15,050 --> 00:39:17,339
they did K-means
clustering, with these six

956
00:39:17,339 --> 00:39:18,130
different clusters.

957
00:39:18,130 --> 00:39:22,190
But now every
profile, every gene,

958
00:39:22,190 --> 00:39:26,050
has a color associated with it,
that represents this mu value.

959
00:39:26,050 --> 00:39:29,210
Whether it goes from zero to
one, with these rainbow colors,

960
00:39:29,210 --> 00:39:32,570
to the things that
are reddish, or pink--

961
00:39:32,570 --> 00:39:34,155
those are the high
confidence things

962
00:39:34,155 --> 00:39:36,430
that are very strongly,
only in that cluster.

963
00:39:36,430 --> 00:39:37,895
Whereas the things that are
more towards the yellow end

964
00:39:37,895 --> 00:39:39,770
of the spectrum are
partially in this cluster

965
00:39:39,770 --> 00:39:43,450
and partially in other clusters.

966
00:39:43,450 --> 00:39:43,950
Questions?

967
00:39:43,950 --> 00:39:44,740
Any questions?

968
00:39:54,250 --> 00:39:58,790
So K-means, we've defined in
terms of Euclidean distance.

969
00:39:58,790 --> 00:40:02,530
And that has clear advantages,
in terms of computing things

970
00:40:02,530 --> 00:40:03,040
very easily.

971
00:40:03,040 --> 00:40:05,240
But it has some
disadvantages as well.

972
00:40:05,240 --> 00:40:08,010
So one of the disadvantages
is because we're

973
00:40:08,010 --> 00:40:10,650
using the squared
distance, then outliers

974
00:40:10,650 --> 00:40:12,570
have a very big effect.

975
00:40:12,570 --> 00:40:17,480
Because I'm squaring the
difference between vectors.

976
00:40:17,480 --> 00:40:18,910
That may not be the worst thing.

977
00:40:18,910 --> 00:40:20,700
But they also
restrict us to things

978
00:40:20,700 --> 00:40:22,700
for which we can
compute a centroid.

979
00:40:22,700 --> 00:40:25,040
We have to have data
that are-- four or more,

980
00:40:25,040 --> 00:40:27,530
you can actually
compute the mean value

981
00:40:27,530 --> 00:40:28,820
of all members of the cluster.

982
00:40:28,820 --> 00:40:30,050
Sometimes you want
to cluster things

983
00:40:30,050 --> 00:40:31,410
that we only have
qualitative data.

984
00:40:31,410 --> 00:40:33,201
Where instead of having
a distance measure,

985
00:40:33,201 --> 00:40:34,560
we have similarity.

986
00:40:34,560 --> 00:40:37,249
This doesn't come up
quite as often in-- well,

987
00:40:37,249 --> 00:40:39,415
it certainly doesn't come
up in gene expression data

988
00:40:39,415 --> 00:40:42,030
or [? RNAC. ?] But you can
imagine more qualitative data,

989
00:40:42,030 --> 00:40:44,360
where you ask people
about similarity

990
00:40:44,360 --> 00:40:48,150
between different things
or behavioral features,

991
00:40:48,150 --> 00:40:50,800
where you know the similarity
between two objects.

992
00:40:50,800 --> 00:40:53,600
But you have no way of
calculating the average object.

993
00:40:53,600 --> 00:40:56,270
One setting that you might
[INAUDIBLE] have looked

994
00:40:56,270 --> 00:40:59,030
at-- if you're trying to
cluster say, sequence motifs

995
00:40:59,030 --> 00:41:01,870
that you've computed
with the EM algorithm.

996
00:41:01,870 --> 00:41:03,746
So what's the average
sequence motif?

997
00:41:03,746 --> 00:41:05,870
That doesn't necessarily
represent any true object,

998
00:41:05,870 --> 00:41:06,380
right?

999
00:41:06,380 --> 00:41:08,430
You might be better off--
you can calculate it.

1000
00:41:08,430 --> 00:41:09,650
But it doesn't mean anything.

1001
00:41:09,650 --> 00:41:11,140
You might be better
off calculating

1002
00:41:11,140 --> 00:41:15,550
using rather than the
average motif, the most

1003
00:41:15,550 --> 00:41:17,870
central of the motifs that
you actually observed.

1004
00:41:17,870 --> 00:41:20,362
So that would be called
a medoid, or an exemplar.

1005
00:41:20,362 --> 00:41:22,820
It's a member of your cluster
that's closest to the middle,

1006
00:41:22,820 --> 00:41:28,130
even if it it's not
smack dab in the middle.

1007
00:41:28,130 --> 00:41:31,690
So instead of K-means, we can
just think, well, K-medoids.

1008
00:41:31,690 --> 00:41:35,090
So in K-means, we actually
computed a centroid.

1009
00:41:35,090 --> 00:41:37,910
And in medoids, we'll
choose the existing data

1010
00:41:37,910 --> 00:41:41,730
point that's most central.

1011
00:41:41,730 --> 00:41:42,770
So what does that mean?

1012
00:41:53,200 --> 00:41:57,715
If these are my data, the true
mean is somewhere over here.

1013
00:42:00,335 --> 00:42:01,460
But this one is the medoid.

1014
00:42:05,600 --> 00:42:08,410
It's an exemplar that's
close to the central point.

1015
00:42:08,410 --> 00:42:11,952
But if there actually
isn't anything here,

1016
00:42:11,952 --> 00:42:12,660
then there isn't.

1017
00:42:12,660 --> 00:42:14,618
So we're going to use
the thing that's closest.

1018
00:42:14,618 --> 00:42:16,680
So if these were
all sequence motifs,

1019
00:42:16,680 --> 00:42:18,550
rather than using some
sequence motif that

1020
00:42:18,550 --> 00:42:20,600
doesn't exist as the
center of your cluster,

1021
00:42:20,600 --> 00:42:22,960
you would use a sequence motif
that actually does exist,

1022
00:42:22,960 --> 00:42:24,222
and it's close to the center.

1023
00:42:29,870 --> 00:42:32,020
So it's a simple
variation on the K-means.

1024
00:42:35,180 --> 00:42:39,540
Instead choosing K
points in arbitrary space

1025
00:42:39,540 --> 00:42:41,600
as our starting
positions, we're going

1026
00:42:41,600 --> 00:42:46,510
to choose K examples from the
data as our starting medoids.

1027
00:42:46,510 --> 00:42:49,180
And then we're going to place
each point in the cluster that

1028
00:42:49,180 --> 00:42:52,310
has the closest medoid,
rather than median.

1029
00:42:52,310 --> 00:42:54,130
And then when we
do the update step,

1030
00:42:54,130 --> 00:42:56,840
instead of choosing the
average position to represent

1031
00:42:56,840 --> 00:42:59,592
the cluster, we'll
choose the medoid.

1032
00:42:59,592 --> 00:43:03,860
The exemplar that's
closest to the middle.

1033
00:43:03,860 --> 00:43:06,495
Any questions on this?

1034
00:43:06,495 --> 00:43:06,995
Yes?

1035
00:43:06,995 --> 00:43:08,453
AUDIENCE: So if
you use the medoid,

1036
00:43:08,453 --> 00:43:10,306
do you lose the
guaranteed convergence?

1037
00:43:10,306 --> 00:43:11,681
Because I can
picture a situation

1038
00:43:11,681 --> 00:43:13,835
where you're sort of
oscillating because now

1039
00:43:13,835 --> 00:43:15,034
you have a discrete stack.

1040
00:43:15,034 --> 00:43:16,794
PROFESSOR: That's
a good question.

1041
00:43:16,794 --> 00:43:17,710
That's probably right.

1042
00:43:17,710 --> 00:43:18,835
Actually, I should
think about that.

1043
00:43:18,835 --> 00:43:19,510
I"m not sure.

1044
00:43:22,570 --> 00:43:25,034
Yeah, that's probably right.

1045
00:43:25,034 --> 00:43:25,700
Other questions?

1046
00:43:30,701 --> 00:43:31,200
OK.

1047
00:43:31,200 --> 00:43:33,325
There are a lot of other
techniques for clustering.

1048
00:43:33,325 --> 00:43:35,840
Your textbook talks about
self organizing maps,

1049
00:43:35,840 --> 00:43:38,366
which were popular at
one point quite a lot.

1050
00:43:38,366 --> 00:43:39,740
And there's also
a nice technique

1051
00:43:39,740 --> 00:43:41,156
called affinity
propagation, which

1052
00:43:41,156 --> 00:43:44,150
is a little bit outside
the scope of this course,

1053
00:43:44,150 --> 00:43:48,880
but has proved quite
useful for clustering.

1054
00:43:48,880 --> 00:43:49,380
OK.

1055
00:43:49,380 --> 00:43:51,530
So why bother to do
all this clustering?

1056
00:43:51,530 --> 00:43:54,960
Our goal is to try to find
some biological information,

1057
00:43:54,960 --> 00:43:56,679
not just to find
groups of genes.

1058
00:43:56,679 --> 00:43:58,220
So what can you do
with these things?

1059
00:43:58,220 --> 00:44:01,120
Well, one thing that
was identified early on,

1060
00:44:01,120 --> 00:44:04,270
is if I could find sets of genes
that behave similarly, maybe

1061
00:44:04,270 --> 00:44:06,300
those could be used
in a predictive way,

1062
00:44:06,300 --> 00:44:08,340
to predict outcomes
for patients,

1063
00:44:08,340 --> 00:44:11,130
or some biological function.

1064
00:44:11,130 --> 00:44:14,320
So we're going to
look at that first.

1065
00:44:14,320 --> 00:44:17,500
So one of the early
papers in this field

1066
00:44:17,500 --> 00:44:20,220
did clustering of
microarrays for patients

1067
00:44:20,220 --> 00:44:22,720
who had B-cell lymphoma.

1068
00:44:22,720 --> 00:44:25,835
The patients had different
kinds of B-cell lymphomas.

1069
00:44:25,835 --> 00:44:28,910
And so they took their
data, they clustered it.

1070
00:44:28,910 --> 00:44:31,610
Again, each row
represents a gene.

1071
00:44:31,610 --> 00:44:33,819
And each column
represents a patient here.

1072
00:44:33,819 --> 00:44:36,110
And with this projector, it's
a little bit hard to see.

1073
00:44:36,110 --> 00:44:38,060
But when you look at
the notes separately,

1074
00:44:38,060 --> 00:44:40,990
you'll be able see
that in the dendogram,

1075
00:44:40,990 --> 00:44:43,420
there's a nice, sharp
division between two

1076
00:44:43,420 --> 00:44:45,270
large groups of patients.

1077
00:44:45,270 --> 00:44:48,824
And it turns out that when
you look at the pathologist's

1078
00:44:48,824 --> 00:44:50,990
annotations for these
patients, which was completely

1079
00:44:50,990 --> 00:44:52,760
independent of the
gene expression data,

1080
00:44:52,760 --> 00:44:54,942
all of patients in
the left hand group--

1081
00:44:54,942 --> 00:44:56,900
almost all the patients
in the left hand group,

1082
00:44:56,900 --> 00:44:59,627
had one kind of lymphoma.

1083
00:44:59,627 --> 00:45:01,460
And all the patients
in the right hand group

1084
00:45:01,460 --> 00:45:02,925
had a different
kind of lymphoma.

1085
00:45:02,925 --> 00:45:04,300
And this got people
very excited.

1086
00:45:04,300 --> 00:45:07,320
Because it suggested that
the pure molecular features

1087
00:45:07,320 --> 00:45:10,815
might be at least as good
as pathological studies.

1088
00:45:10,815 --> 00:45:13,530
So maybe you could completely
automate the identification

1089
00:45:13,530 --> 00:45:15,390
of different tumor types.

1090
00:45:15,390 --> 00:45:17,760
Now the next thing that got
people even more excited,

1091
00:45:17,760 --> 00:45:19,660
was the idea that maybe
you could actually

1092
00:45:19,660 --> 00:45:23,330
use these patterns not
just to recapitulate

1093
00:45:23,330 --> 00:45:25,590
what a pathologist would
find, but go beyond it,

1094
00:45:25,590 --> 00:45:28,102
and actually make predictions
from the patients.

1095
00:45:28,102 --> 00:45:29,810
So in these plots--
I don't know if we've

1096
00:45:29,810 --> 00:45:31,268
seen these before
yet in the class.

1097
00:45:31,268 --> 00:45:34,360
But on the x-axis is survival.

1098
00:45:34,360 --> 00:45:36,240
In the y-axis are the
fraction of patients

1099
00:45:36,240 --> 00:45:39,360
in a particular group,
who survived that long.

1100
00:45:39,360 --> 00:45:40,870
So as the patient's
die, obviously

1101
00:45:40,870 --> 00:45:42,520
the curve is dropping down.

1102
00:45:42,520 --> 00:45:45,110
Each one of these
drops represents

1103
00:45:45,110 --> 00:45:47,280
the death of a patient,
or the loss of the patient

1104
00:45:47,280 --> 00:45:51,110
to the study for other reasons.

1105
00:45:51,110 --> 00:45:53,690
And so in the middle,
let's start with this one.

1106
00:45:53,690 --> 00:45:55,730
This is what the clinicians
would have decided.

1107
00:45:55,730 --> 00:45:57,610
There are here, patients
that they defined

1108
00:45:57,610 --> 00:46:00,247
by clinical standards as
being likely to do well,

1109
00:46:00,247 --> 00:46:02,580
versus patients whom they
defined by clinical standards,

1110
00:46:02,580 --> 00:46:03,619
as likely to do poorly.

1111
00:46:03,619 --> 00:46:05,410
And you could see there
is a big difference

1112
00:46:05,410 --> 00:46:07,420
in the plots for the
low clinical risk

1113
00:46:07,420 --> 00:46:10,120
patients at the top, and
the high clinical risk

1114
00:46:10,120 --> 00:46:11,750
patients at the bottom.

1115
00:46:11,750 --> 00:46:13,600
On the left hand
side, or what you

1116
00:46:13,600 --> 00:46:15,750
get when you use purely
gene expression data

1117
00:46:15,750 --> 00:46:18,630
to cluster the patients into
groups that you turn out

1118
00:46:18,630 --> 00:46:20,800
to be high risk or low risk.

1119
00:46:20,800 --> 00:46:23,370
And you can see that
it's a little bit more

1120
00:46:23,370 --> 00:46:25,610
statistically significant
for the clinical risk.

1121
00:46:25,610 --> 00:46:27,870
But it's pretty
good over here, too.

1122
00:46:27,870 --> 00:46:30,350
Now the really
impressive thing is,

1123
00:46:30,350 --> 00:46:32,600
what if you take the
patients that the clinicians

1124
00:46:32,600 --> 00:46:35,750
define as low clinical risk?

1125
00:46:35,750 --> 00:46:38,290
And then you look at their
gene expression data.

1126
00:46:38,290 --> 00:46:39,900
Could you separate
out the patients

1127
00:46:39,900 --> 00:46:41,571
in that allegedly
low clinical risk

1128
00:46:41,571 --> 00:46:42,820
who are actually at high risk?

1129
00:46:42,820 --> 00:46:44,760
And maybe then they
would be diverted

1130
00:46:44,760 --> 00:46:46,840
to have more aggressive
therapy than patients

1131
00:46:46,840 --> 00:46:49,430
who really and truly
are low risk patients.

1132
00:46:49,430 --> 00:46:51,500
And what they will
show with just barely

1133
00:46:51,500 --> 00:46:53,340
statistical significance,
is that even

1134
00:46:53,340 --> 00:46:56,280
among the clinically
defined low risk patients,

1135
00:46:56,280 --> 00:46:58,792
there is-- based on
these gene signatures--

1136
00:46:58,792 --> 00:47:00,250
the ability to
distinguish patients

1137
00:47:00,250 --> 00:47:02,600
who are going to do
better, and patients

1138
00:47:02,600 --> 00:47:04,990
who are going to do worse.

1139
00:47:04,990 --> 00:47:06,391
So this was over a decade ago.

1140
00:47:06,391 --> 00:47:08,390
And it really set off a
frenzy of people looking

1141
00:47:08,390 --> 00:47:10,720
for gene signatures for
all sorts of things,

1142
00:47:10,720 --> 00:47:12,740
that might be highly predictive.

1143
00:47:12,740 --> 00:47:17,090
Now the fact that
something is correlated,

1144
00:47:17,090 --> 00:47:18,990
doesn't of course
prove any causality.

1145
00:47:18,990 --> 00:47:22,660
So one of the questions is, if
I find a gene signature that

1146
00:47:22,660 --> 00:47:25,272
is predictive of an outcome
in one of the studies,

1147
00:47:25,272 --> 00:47:27,230
can I use it then to go
backwards, and actually

1148
00:47:27,230 --> 00:47:28,810
define a therapy?

1149
00:47:28,810 --> 00:47:32,560
In the ideal setting, I would
have these gene signatures.

1150
00:47:32,560 --> 00:47:34,290
I'd discover that
they are clinically

1151
00:47:34,290 --> 00:47:36,080
associated with outcome.

1152
00:47:36,080 --> 00:47:38,180
I could dig in and
discover what makes

1153
00:47:38,180 --> 00:47:40,720
the patients to do worse, worse.

1154
00:47:40,720 --> 00:47:41,650
And go and treat that.

1155
00:47:41,650 --> 00:47:43,460
So is that the case or not?

1156
00:47:43,460 --> 00:47:47,397
So let me show you some data
from a breast cancer data set.

1157
00:47:47,397 --> 00:47:48,730
Here's a breast cancer data set.

1158
00:47:48,730 --> 00:47:52,350
Again the same kind of plot,
where we've got the survival

1159
00:47:52,350 --> 00:47:56,780
statistic on the y-axis, the
number of years on the x-axis.

1160
00:47:56,780 --> 00:47:59,160
And based on a gene
signature, this group

1161
00:47:59,160 --> 00:48:01,350
has defined a group
that does better,

1162
00:48:01,350 --> 00:48:04,990
and a group that does worse,
the p value is significant.

1163
00:48:04,990 --> 00:48:11,250
And it has a ratio, the
death rate versus control

1164
00:48:11,250 --> 00:48:12,910
is approximately two.

1165
00:48:12,910 --> 00:48:13,620
OK.

1166
00:48:13,620 --> 00:48:16,709
So does this lead us to
any mechanistic insight

1167
00:48:16,709 --> 00:48:17,500
into breast cancer.

1168
00:48:17,500 --> 00:48:20,340
Well, it turns out in this
case, the gene signature

1169
00:48:20,340 --> 00:48:24,390
was defined based on
postprandial laughter.

1170
00:48:24,390 --> 00:48:26,984
So after dinner humor.

1171
00:48:26,984 --> 00:48:28,650
Here's a gene set
that defined something

1172
00:48:28,650 --> 00:48:31,090
that has absolutely nothing
to do with breast cancer,

1173
00:48:31,090 --> 00:48:35,137
and it's predicting the outcome
of breast cancer patients.

1174
00:48:35,137 --> 00:48:36,720
Which leads to
somewhat more of a joke

1175
00:48:36,720 --> 00:48:38,200
that the testing
whether laughter

1176
00:48:38,200 --> 00:48:40,520
really is the best medicine.

1177
00:48:40,520 --> 00:48:41,050
OK.

1178
00:48:41,050 --> 00:48:44,350
So they went on-- they
tried other genes sets.

1179
00:48:44,350 --> 00:48:46,490
Here's the data
set-- gene set that's

1180
00:48:46,490 --> 00:48:47,900
not even defined in humans.

1181
00:48:47,900 --> 00:48:52,240
It's the homologs of
genes that are associated

1182
00:48:52,240 --> 00:48:53,930
with social defeat in mice.

1183
00:48:53,930 --> 00:48:58,670
And once again, you get a
statistically significant

1184
00:48:58,670 --> 00:49:00,929
p-value, and good hazard ratios.

1185
00:49:00,929 --> 00:49:01,720
So what's going on?

1186
00:49:01,720 --> 00:49:03,930
Well, these are not from
a study that's actually

1187
00:49:03,930 --> 00:49:06,130
trying to predict an
outcome in breast cancer.

1188
00:49:06,130 --> 00:49:09,290
It's a study that shows
that most gene expression--

1189
00:49:09,290 --> 00:49:12,360
most randomly selected
sets of genes in the genome

1190
00:49:12,360 --> 00:49:15,810
will give an outcome that's
correlated-- a result that's

1191
00:49:15,810 --> 00:49:18,130
correlated with a patient
outcome in breast cancer.

1192
00:49:21,040 --> 00:49:21,721
Yes?

1193
00:49:21,721 --> 00:49:23,054
AUDIENCE: I'm a little confused.

1194
00:49:23,054 --> 00:49:25,679
In the previous graph, could you
just explain what is the black

1195
00:49:25,679 --> 00:49:26,958
and what is the red?

1196
00:49:26,958 --> 00:49:29,400
Is that individuals or groups?

1197
00:49:29,400 --> 00:49:32,080
PROFESSOR: So the
black are people

1198
00:49:32,080 --> 00:49:34,350
that have the genes
set signature, who

1199
00:49:34,350 --> 00:49:36,265
have high levels
of the genes that

1200
00:49:36,265 --> 00:49:37,969
are defined in this gene set.

1201
00:49:37,969 --> 00:49:40,260
And the red are ones have
low, or the other way around.

1202
00:49:40,260 --> 00:49:43,360
But it's defining all patients
into two groups, based

1203
00:49:43,360 --> 00:49:46,890
on whether they have a
particular level of expression

1204
00:49:46,890 --> 00:49:49,329
in this gene set,
and then following

1205
00:49:49,329 --> 00:49:50,370
those patients over time.

1206
00:49:50,370 --> 00:49:51,849
Do they do better or worse?

1207
00:49:51,849 --> 00:49:53,265
And similarly for
all these plots.

1208
00:49:57,950 --> 00:50:00,775
And he had another one which is
a little less amusing, location

1209
00:50:00,775 --> 00:50:02,710
of skin fibroblasts.

1210
00:50:02,710 --> 00:50:05,820
The real critical point is this.

1211
00:50:05,820 --> 00:50:10,270
Here, they compared
the probability

1212
00:50:10,270 --> 00:50:14,630
based on expectation
an that all genes are

1213
00:50:14,630 --> 00:50:16,830
independent of each
other, the probability

1214
00:50:16,830 --> 00:50:20,510
that that gene signatures
correlated with outcome,

1215
00:50:20,510 --> 00:50:22,700
for genes there were chosen
at random or genes that

1216
00:50:22,700 --> 00:50:25,960
were chosen from a database
of gene signatures,

1217
00:50:25,960 --> 00:50:28,910
that people have identified as
being associated with pathways.

1218
00:50:28,910 --> 00:50:30,850
And you get a very,
very large fraction.

1219
00:50:30,850 --> 00:50:32,380
So this is the p-value.

1220
00:50:32,380 --> 00:50:35,450
So negative log of
p-value, so negative values

1221
00:50:35,450 --> 00:50:36,570
are more significant.

1222
00:50:36,570 --> 00:50:38,910
A huge fraction
of all genes sets

1223
00:50:38,910 --> 00:50:41,130
that you pull at
random from the genome,

1224
00:50:41,130 --> 00:50:43,670
or that you pull from a
compendium of known pathways,

1225
00:50:43,670 --> 00:50:46,010
are going to be
associated with outcome,

1226
00:50:46,010 --> 00:50:48,090
in this breast cancer data set.

1227
00:50:48,090 --> 00:50:51,120
So it's not just well
annotated cancer pathways,

1228
00:50:51,120 --> 00:50:52,180
that are associated.

1229
00:50:52,180 --> 00:50:54,400
Its gene sets
associated as we've

1230
00:50:54,400 --> 00:50:57,200
seen, with laughter or
social defeat in mice,

1231
00:50:57,200 --> 00:50:59,260
and so on-- all sorts
of crazy things,

1232
00:50:59,260 --> 00:51:05,100
that have no mechanistic
link to breast cancer.

1233
00:51:05,100 --> 00:51:08,510
Let's take a second
for that to sink in.

1234
00:51:08,510 --> 00:51:11,350
I pull genes at random
from the genome.

1235
00:51:11,350 --> 00:51:13,520
I define patients
based on whether they

1236
00:51:13,520 --> 00:51:16,260
have high levels of expression
of a random set of genes,

1237
00:51:16,260 --> 00:51:19,100
or low levels of expression
of that random set of genes.

1238
00:51:19,100 --> 00:51:21,970
And I'm extremely likely
to be able to predict

1239
00:51:21,970 --> 00:51:23,210
the outcome in breast cancer.

1240
00:51:25,919 --> 00:51:27,710
So that should be rather
disturbing, right?

1241
00:51:27,710 --> 00:51:30,230
So it turns out-- before
we get to the answer

1242
00:51:30,230 --> 00:51:33,005
then-- so this is not
unique to breast cancer.

1243
00:51:33,005 --> 00:51:35,630
They went through a whole bunch
of data sets in the literature.

1244
00:51:35,630 --> 00:51:38,180
Each row is a different
previously published study,

1245
00:51:38,180 --> 00:51:41,170
where someone had claimed
to identify a signature

1246
00:51:41,170 --> 00:51:45,510
for a particular kind
of disease or outcome.

1247
00:51:45,510 --> 00:51:48,460
And they took their
random gene sets and asked

1248
00:51:48,460 --> 00:51:50,060
how well the random
genes sets did

1249
00:51:50,060 --> 00:51:53,420
in predicting the outcome
in these patients?

1250
00:51:53,420 --> 00:51:56,785
And so these yellow
plots represent

1251
00:51:56,785 --> 00:51:59,090
the probability distribution
for the random gene

1252
00:51:59,090 --> 00:52:00,971
sets-- again on
this projector, it's

1253
00:52:00,971 --> 00:52:03,220
hard to see-- but there's a
highlight in the left hand

1254
00:52:03,220 --> 00:52:07,694
side at where the 5%, the best
5% of the random gene sets are.

1255
00:52:07,694 --> 00:52:09,110
This blue line is
the near measure

1256
00:52:09,110 --> 00:52:10,280
of statistical significance.

1257
00:52:10,280 --> 00:52:12,200
It turns out that a
few of these studies

1258
00:52:12,200 --> 00:52:15,270
didn't even reach a normal level
of statistical significance,

1259
00:52:15,270 --> 00:52:18,160
let alone comparing
to random gene sets.

1260
00:52:18,160 --> 00:52:21,500
But for most of
these, you don't do

1261
00:52:21,500 --> 00:52:24,620
better than a good fraction
of the randomly selected gene

1262
00:52:24,620 --> 00:52:25,630
sets.

1263
00:52:25,630 --> 00:52:27,030
So how could this be?

1264
00:52:27,030 --> 00:52:29,740
So it turns out there is an
answer to why this happens.

1265
00:52:29,740 --> 00:52:31,580
And it's really
quite fascinating.

1266
00:52:31,580 --> 00:52:33,547
So here, we're using
the hazard ratio,

1267
00:52:33,547 --> 00:52:35,380
which is the death rate
for the patients who

1268
00:52:35,380 --> 00:52:38,550
have the signature,
over the control group.

1269
00:52:38,550 --> 00:52:41,180
So high hazard ratio means
it's a very, very dissociative

1270
00:52:41,180 --> 00:52:42,180
outcome.

1271
00:52:42,180 --> 00:52:46,102
And they've plotted that against
the correlation of the genes

1272
00:52:46,102 --> 00:52:48,560
in the gene signature, with
the expression of a gene called

1273
00:52:48,560 --> 00:52:53,610
PCNA, Proliferating
Cell Nuclear Antigen

1274
00:52:53,610 --> 00:52:56,555
And it turns out a very, very
large fraction of the genome

1275
00:52:56,555 --> 00:52:57,180
is coexpressed.

1276
00:53:00,440 --> 00:53:03,640
So genes are not expressed like
random, completely independent

1277
00:53:03,640 --> 00:53:05,060
random variables.

1278
00:53:05,060 --> 00:53:07,550
There are lots of genes that
show very similar expression

1279
00:53:07,550 --> 00:53:09,330
levels, across
all the data sets.

1280
00:53:09,330 --> 00:53:12,105
Now PCNA is a gene that's
been known by pathologists

1281
00:53:12,105 --> 00:53:13,480
for a long time,
as having higher

1282
00:53:13,480 --> 00:53:15,677
levels than most
digressive tumors.

1283
00:53:15,677 --> 00:53:17,510
So a very, very large
fraction of the genome

1284
00:53:17,510 --> 00:53:20,160
is coexpressed with PCNA.

1285
00:53:20,160 --> 00:53:22,590
Then high levels of
randomly selected genes

1286
00:53:22,590 --> 00:53:24,950
are going to be a very good
predictor of tumor outcome.

1287
00:53:24,950 --> 00:53:26,908
Because high levels of
randomly expressed genes

1288
00:53:26,908 --> 00:53:28,500
also means a very
high probability

1289
00:53:28,500 --> 00:53:32,270
of having a high level PCNA,
which is a tumor marker.

1290
00:53:35,930 --> 00:53:37,800
So we have to proceed
with a lot of caution.

1291
00:53:37,800 --> 00:53:41,310
We can find things that are
highly correlated with outcome,

1292
00:53:41,310 --> 00:53:45,690
that could have good value in
terms of prognostic indicators.

1293
00:53:45,690 --> 00:53:47,690
But there are going to
be a lot of possibilities

1294
00:53:47,690 --> 00:53:50,390
for sets of genes that
have that property,

1295
00:53:50,390 --> 00:53:52,130
they're good
predictors of outcome.

1296
00:53:52,130 --> 00:53:54,360
And many of them will
have absolutely nothing

1297
00:53:54,360 --> 00:53:59,170
to causally, with the
process of the disease.

1298
00:53:59,170 --> 00:54:02,110
So at the very least, it means
don't start a drug company

1299
00:54:02,110 --> 00:54:04,010
over every set of genes,
if you identify this

1300
00:54:04,010 --> 00:54:05,620
as associated with outcome.

1301
00:54:05,620 --> 00:54:07,120
But the worst case
scenario, it also

1302
00:54:07,120 --> 00:54:09,760
means that those predictions
will break down under settings

1303
00:54:09,760 --> 00:54:12,580
that we haven't yet examined.

1304
00:54:12,580 --> 00:54:14,410
And so that's the
real fear, that you

1305
00:54:14,410 --> 00:54:16,140
have a gene set
signature that you think

1306
00:54:16,140 --> 00:54:17,473
has a highly predictive outcome.

1307
00:54:17,473 --> 00:54:20,247
It's only because you looked at
a particular set of patients.

1308
00:54:20,247 --> 00:54:22,080
But you look at a
different set of patients,

1309
00:54:22,080 --> 00:54:24,870
and that correlation
will break down.

1310
00:54:24,870 --> 00:54:27,810
So this is an area of research
that's still quite in flux,

1311
00:54:27,810 --> 00:54:29,460
in terms of how
much utility there

1312
00:54:29,460 --> 00:54:32,160
will be in identifying
genes set signatures,

1313
00:54:32,160 --> 00:54:33,827
in this completely
objective way.

1314
00:54:33,827 --> 00:54:35,826
And what we'll see in the
course of this lecture

1315
00:54:35,826 --> 00:54:37,243
and the next one,
is it's probably

1316
00:54:37,243 --> 00:54:39,034
going to be much more
useful to incorporate

1317
00:54:39,034 --> 00:54:40,980
other kinds of information
that will constrain

1318
00:54:40,980 --> 00:54:42,063
us to be more mechanistic.

1319
00:54:44,267 --> 00:54:44,850
Any questions?

1320
00:54:51,892 --> 00:54:52,392
All right.

1321
00:54:52,392 --> 00:54:54,308
So now we're going to
really get into the meat

1322
00:54:54,308 --> 00:54:56,295
of the identification
of gene modules.

1323
00:54:56,295 --> 00:54:59,560
And we're going to try to
see how much we can learn

1324
00:54:59,560 --> 00:55:02,310
about regulatory structure
from the gene expression data.

1325
00:55:02,310 --> 00:55:05,395
So we're going to move up from
just the pure expression data--

1326
00:55:05,395 --> 00:55:07,520
say these genes at the
bottom, to try to figure out

1327
00:55:07,520 --> 00:55:10,225
what set of transcription
factors we're driving,

1328
00:55:10,225 --> 00:55:12,225
and maybe what signaling
pathways lived upstream

1329
00:55:12,225 --> 00:55:14,497
in those transcription
factors, and turn them on.

1330
00:55:14,497 --> 00:55:17,080
And the fundamental difference
then between clustering-- which

1331
00:55:17,080 --> 00:55:19,610
is what we've been looking in
until now, and these modules,

1332
00:55:19,610 --> 00:55:21,542
as people like to call
them-- is that you

1333
00:55:21,542 --> 00:55:23,500
can have a whole bunch
of genes, and we've just

1334
00:55:23,500 --> 00:55:25,458
seen that, that are
correlated with each other,

1335
00:55:25,458 --> 00:55:27,350
without being causally
linked to each other.

1336
00:55:27,350 --> 00:55:30,290
So we like to figure out which
ones are actually functionally

1337
00:55:30,290 --> 00:55:34,202
related, and not just
statistically related.

1338
00:55:34,202 --> 00:55:36,410
And the paper that's going
to serve as our organizing

1339
00:55:36,410 --> 00:55:38,400
principle in the
rest of this lecture,

1340
00:55:38,400 --> 00:55:40,580
maybe bleeding into
the next lecture,

1341
00:55:40,580 --> 00:55:42,910
is this paper,
recently published

1342
00:55:42,910 --> 00:55:45,817
that's called The
DREAM5 Challenge.

1343
00:55:45,817 --> 00:55:48,150
And this, like some of these
other challenges that we've

1344
00:55:48,150 --> 00:55:51,950
seen before, is the case where
the organizers have data sets,

1345
00:55:51,950 --> 00:55:55,370
where are they know
the answer to what

1346
00:55:55,370 --> 00:55:57,230
the regulatory structure is.

1347
00:55:57,230 --> 00:55:58,420
They send out the data.

1348
00:55:58,420 --> 00:56:00,200
People try to make the
best predictions they can.

1349
00:56:00,200 --> 00:56:01,741
And then they unseal
the data, to let

1350
00:56:01,741 --> 00:56:03,495
people know how well they did.

1351
00:56:03,495 --> 00:56:06,480
And so you can get a
relatively objective view

1352
00:56:06,480 --> 00:56:09,920
of how well different
kinds of approaches work.

1353
00:56:09,920 --> 00:56:12,700
So this is the overall
structure of this challenge.

1354
00:56:12,700 --> 00:56:15,600
They had four different
kinds of data.

1355
00:56:15,600 --> 00:56:18,990
Three are real data sets from
different organisms, E. coli,

1356
00:56:18,990 --> 00:56:21,610
yeast, and
Staphylococcus aureus.

1357
00:56:21,610 --> 00:56:23,940
And then the fourth one,
the one at the top here,

1358
00:56:23,940 --> 00:56:26,780
is completely synthetic
data that they generated it.

1359
00:56:26,780 --> 00:56:29,190
And you get a sense of the
scale of the data sets.

1360
00:56:29,190 --> 00:56:33,300
So how many genes are involved,
how many potential regulators.

1361
00:56:33,300 --> 00:56:36,920
In some cases, they've given
you specific information

1362
00:56:36,920 --> 00:56:41,600
on knockouts, antibiotics,
toxins, that are perturbing.

1363
00:56:41,600 --> 00:56:43,990
And again here, the
number of conditions that

1364
00:56:43,990 --> 00:56:46,170
are being looked at,
the number of arrays.

1365
00:56:46,170 --> 00:56:48,170
So then they provide
this data in a way

1366
00:56:48,170 --> 00:56:51,401
that's very hard
for the groups that

1367
00:56:51,401 --> 00:56:53,525
are analyzing to trace it
back to particular genes.

1368
00:56:53,525 --> 00:56:56,680
Because you don't want people to
use external data necessarily,

1369
00:56:56,680 --> 00:56:58,150
to make their predictions.

1370
00:56:58,150 --> 00:57:00,600
So every makes
their predictions.

1371
00:57:00,600 --> 00:57:03,830
They also, as part
of this challenge,

1372
00:57:03,830 --> 00:57:06,390
they actually they made
their own metapredictions,

1373
00:57:06,390 --> 00:57:08,447
based on the
individual predictions

1374
00:57:08,447 --> 00:57:09,280
by different groups.

1375
00:57:09,280 --> 00:57:11,140
And we'll take a look
at that in a second.

1376
00:57:11,140 --> 00:57:14,014
And then they score
how well they did.

1377
00:57:14,014 --> 00:57:16,430
Now we'll get into the details
of the scoring a little bit

1378
00:57:16,430 --> 00:57:16,930
later.

1379
00:57:16,930 --> 00:57:18,790
But what they found
at the highest levels,

1380
00:57:18,790 --> 00:57:22,070
that different kinds of
methods behaved similarly.

1381
00:57:22,070 --> 00:57:24,040
So the main groups
that they found

1382
00:57:24,040 --> 00:57:25,770
were these regression-based
techniques.

1383
00:57:25,770 --> 00:57:27,477
We'll talk about
those in a second.

1384
00:57:27,477 --> 00:57:29,060
Bayesian networks,
which we've already

1385
00:57:29,060 --> 00:57:32,180
discussed in a
different context.

1386
00:57:32,180 --> 00:57:33,930
A hodgepodge of different
kinds of things.

1387
00:57:33,930 --> 00:57:36,970
And then mutual information
and correlation.

1388
00:57:36,970 --> 00:57:40,674
So we're going to look in
each of these main categories

1389
00:57:40,674 --> 00:57:41,590
of prediction methods.

1390
00:57:43,665 --> 00:57:46,040
So we're going to start with
the Bayesian networks, which

1391
00:57:46,040 --> 00:57:47,490
we just finished talking
about in a completely

1392
00:57:47,490 --> 00:57:48,570
different context.

1393
00:57:48,570 --> 00:57:51,420
Here, instead of trying to
predict whether interaction

1394
00:57:51,420 --> 00:57:53,139
is true, based on the
experimental data,

1395
00:57:53,139 --> 00:57:55,680
we're going to try to predict
whether a particular protein is

1396
00:57:55,680 --> 00:57:58,263
involved in regulating a set of
genes, based on the expression

1397
00:57:58,263 --> 00:57:59,910
data.

1398
00:57:59,910 --> 00:58:04,200
So in this context-- let's
say I have cancer data sets,

1399
00:58:04,200 --> 00:58:07,005
and I wanted to decide
whether p53 is activated

1400
00:58:07,005 --> 00:58:11,090
in those tumors, So this
is a known pathway for p53.

1401
00:58:11,090 --> 00:58:13,320
So if I told you
the pathway, how

1402
00:58:13,320 --> 00:58:16,170
might you figure out if p53
is active from gene expression

1403
00:58:16,170 --> 00:58:19,060
data?

1404
00:58:19,060 --> 00:58:21,640
I tell you this pathway, give
you this expression data--

1405
00:58:21,640 --> 00:58:23,980
what's kind of a simple
thing that you could do right

1406
00:58:23,980 --> 00:58:27,380
away, to decide whether you
think p53 is active or not?

1407
00:58:27,380 --> 00:58:28,995
p53 is a transcriptional
activator,

1408
00:58:28,995 --> 00:58:31,910
but it should be turning
on the genes of its targets

1409
00:58:31,910 --> 00:58:33,210
when its on.

1410
00:58:33,210 --> 00:58:36,159
So what's an
obvious thing to do?

1411
00:58:36,159 --> 00:58:38,450
AUDIENCE: Check the expression
levels from the targets.

1412
00:58:38,450 --> 00:58:38,660
PROFESSOR: Thank you.

1413
00:58:38,660 --> 00:58:40,618
Right, so we could check
the expression levels.

1414
00:58:40,618 --> 00:58:43,670
The targets compute some
simple statistics, right?

1415
00:58:43,670 --> 00:58:44,290
OK.

1416
00:58:44,290 --> 00:58:45,620
Well, that could work.

1417
00:58:45,620 --> 00:58:47,080
But of course there could
be other transcriptional

1418
00:58:47,080 --> 00:58:49,080
regulators that regulate
a similar set of genes.

1419
00:58:49,080 --> 00:58:52,010
So that's not a
guarantee that p53 is on.

1420
00:58:52,010 --> 00:58:55,510
It might be some other
transcriptional regulator.

1421
00:58:55,510 --> 00:58:58,560
We could look for the
pathways that activate p53.

1422
00:58:58,560 --> 00:59:01,210
We could ask whether
those genes are on.

1423
00:59:01,210 --> 00:59:04,800
So we've got in this
pathway, a bunch of kinases,

1424
00:59:04,800 --> 00:59:08,422
an ATM, CHK1, and so
on, that activate p53.

1425
00:59:08,422 --> 00:59:10,380
Now if we had proteomic
data, we could actually

1426
00:59:10,380 --> 00:59:12,409
look whether those proteins
are phosphorylated.

1427
00:59:12,409 --> 00:59:14,200
But we have much, much
less proteomic data.

1428
00:59:14,200 --> 00:59:16,770
And most of these settings
only have gene expression data.

1429
00:59:16,770 --> 00:59:18,810
But you look at, is
that gene expressed?

1430
00:59:18,810 --> 00:59:21,160
Has the expression of one
of these activating proteins

1431
00:59:21,160 --> 00:59:21,850
gone up?

1432
00:59:21,850 --> 00:59:23,880
And you can try to
make an inference then.

1433
00:59:23,880 --> 00:59:26,900
From whether there's more of
these activating proteins,

1434
00:59:26,900 --> 00:59:28,640
then maybe p53 is active.

1435
00:59:28,640 --> 00:59:31,490
And therefore it's
turning on it's targets.

1436
00:59:31,490 --> 00:59:32,840
That's one step removed.

1437
00:59:32,840 --> 00:59:35,916
So just the fact that there's
a lot of ATM mRNA around

1438
00:59:35,916 --> 00:59:38,290
doesn't mean that there's a
lot of the ATM protein, which

1439
00:59:38,290 --> 00:59:40,498
certainly doesn't mean that
the ATM is phosphorylated

1440
00:59:40,498 --> 00:59:41,930
and turning on its target.

1441
00:59:41,930 --> 00:59:43,685
So again, we don't
have a guarantee there.

1442
00:59:46,792 --> 00:59:49,000
We could look more specifically
whether the genes are

1443
00:59:49,000 --> 00:59:50,550
differentially expressed.

1444
00:59:50,550 --> 00:59:53,130
So the fact that they're on
may not be as informative

1445
00:59:53,130 --> 00:59:55,240
as if they were uniquely
on in this tumor,

1446
00:59:55,240 --> 00:59:58,740
and not on in control cells
from the same patient.

1447
00:59:58,740 --> 01:00:01,080
So that can be informative.

1448
01:00:01,080 --> 01:00:02,600
But again changes
in gene expression

1449
01:00:02,600 --> 01:00:05,920
are not uniquely related to
changes in protein level.

1450
01:00:05,920 --> 01:00:09,230
So we're going to have to
behave with a bit of caution.

1451
01:00:09,230 --> 01:00:11,750
So the first step we're going
to take in this direction,

1452
01:00:11,750 --> 01:00:13,335
is try to build a
Bayesian network.

1453
01:00:13,335 --> 01:00:15,710
That's going to give us a way
to reason probabilistically

1454
01:00:15,710 --> 01:00:19,090
over all of these kinds of data,
which by themselves are not

1455
01:00:19,090 --> 01:00:21,770
great guarantees that we're
getting the right answer.

1456
01:00:21,770 --> 01:00:25,470
Just like in the protein
prediction interaction problem,

1457
01:00:25,470 --> 01:00:28,920
where individually coexpression
wasn't all that great,

1458
01:00:28,920 --> 01:00:30,680
essentiality wasn't
all that great.

1459
01:00:30,680 --> 01:00:33,509
But taken together, they
could be quite helpful.

1460
01:00:33,509 --> 01:00:35,050
So we want to compute
the probability

1461
01:00:35,050 --> 01:00:39,330
that the p53 pathway is
active, given the data.

1462
01:00:39,330 --> 01:00:41,710
And the only data we're
going to have in the setting

1463
01:00:41,710 --> 01:00:42,910
is gene expression data.

1464
01:00:42,910 --> 01:00:45,950
So we're going to assume
that for the targets

1465
01:00:45,950 --> 01:00:48,210
of a transcription
factor to be active,

1466
01:00:48,210 --> 01:00:50,170
the transcription
factor itself has

1467
01:00:50,170 --> 01:00:52,310
to be expressed
at a higher level.

1468
01:00:52,310 --> 01:00:53,800
That's a restriction
of analyzing

1469
01:00:53,800 --> 01:00:57,070
these kinds of data
that's very commonly used.

1470
01:00:57,070 --> 01:01:00,190
So we're going to try to
compute the probability that p53

1471
01:01:00,190 --> 01:01:02,340
is activated, given the data.

1472
01:01:02,340 --> 01:01:04,750
So how would I compute
the probability,

1473
01:01:04,750 --> 01:01:06,980
that given that some
transcription factors on,

1474
01:01:06,980 --> 01:01:09,967
that I see expression
from target genes?

1475
01:01:09,967 --> 01:01:10,800
How would I do this?

1476
01:01:10,800 --> 01:01:13,470
I would just go into the data,
and just count in the same way

1477
01:01:13,470 --> 01:01:16,010
that we did in our
previous setting.

1478
01:01:16,010 --> 01:01:18,680
We could just look over
all the experiments

1479
01:01:18,680 --> 01:01:22,047
and tabulate whether one of the
targets is up in expression,

1480
01:01:22,047 --> 01:01:24,380
how often is the transcription
factor that's potentially

1481
01:01:24,380 --> 01:01:25,088
activating it up?

1482
01:01:25,088 --> 01:01:29,300
And how often are all the
possible combinations the case?

1483
01:01:29,300 --> 01:01:31,330
And then we can use
Bayesian statistics

1484
01:01:31,330 --> 01:01:32,720
to try to compute
the probability

1485
01:01:32,720 --> 01:01:35,860
that a transcription
factor is up, activated,

1486
01:01:35,860 --> 01:01:37,860
given that I've seen the
gene expression data.

1487
01:01:37,860 --> 01:01:38,443
Is that clear?

1488
01:01:41,190 --> 01:01:41,690
Good.

1489
01:01:46,410 --> 01:01:49,854
So we want to try to not include
just the down stream factors.

1490
01:01:49,854 --> 01:01:51,520
Because that leads
possibly, maybe there

1491
01:01:51,520 --> 01:01:52,940
are multiple
transcription factors

1492
01:01:52,940 --> 01:01:54,610
that are equally
likely to be driving

1493
01:01:54,610 --> 01:01:56,060
expressions instead of genes.

1494
01:01:56,060 --> 01:01:59,020
We want to include the
upstream regulators as well.

1495
01:01:59,020 --> 01:02:01,504
And so here, we're going
to take advantage of one

1496
01:02:01,504 --> 01:02:02,920
of the properties
of Bayesian nets

1497
01:02:02,920 --> 01:02:05,245
at where we looked
at, explaining a way.

1498
01:02:05,245 --> 01:02:06,990
And you'll remember
this example, where

1499
01:02:06,990 --> 01:02:10,940
we decided that if see
that the grass is wet,

1500
01:02:10,940 --> 01:02:12,470
and I know that
it's raining, then

1501
01:02:12,470 --> 01:02:15,910
I can consider less likely
that the sprinklers were on.

1502
01:02:15,910 --> 01:02:18,280
Even though there's no causal
relationship between them.

1503
01:02:18,280 --> 01:02:21,640
So if I see that a set of
targets of transcription factor

1504
01:02:21,640 --> 01:02:25,410
A are on, and I have evidence
that the pathway upstream of A

1505
01:02:25,410 --> 01:02:29,740
is on, that reduces my
inferred probability

1506
01:02:29,740 --> 01:02:32,544
that the transcription
factor B is responsible.

1507
01:02:32,544 --> 01:02:34,710
So that's of the nice things
about Bayesian networks

1508
01:02:34,710 --> 01:02:37,010
that gives us a way of
reasoning automatically,

1509
01:02:37,010 --> 01:02:41,090
over all the data, and not
just the down stream targets.

1510
01:02:41,090 --> 01:02:43,330
And the Bayesian networks
can have multiple layers.

1511
01:02:43,330 --> 01:02:44,955
So we can have one
transcription factor

1512
01:02:44,955 --> 01:02:46,621
turning another one,
turns on other one,

1513
01:02:46,621 --> 01:02:47,502
turns on another one.

1514
01:02:47,502 --> 01:02:49,460
Again, we can have as
many layers as necessary.

1515
01:02:49,460 --> 01:02:51,650
But one thing we
can't have are cycles.

1516
01:02:51,650 --> 01:02:54,380
So we can't have a
transcription factor

1517
01:02:54,380 --> 01:02:56,750
that's at the bottom of this,
going back and activating

1518
01:02:56,750 --> 01:02:58,420
things that are at the top.

1519
01:02:58,420 --> 01:02:59,890
And that's a
fundamental limitation

1520
01:02:59,890 --> 01:03:00,810
of Bayesian networks.

1521
01:03:04,362 --> 01:03:05,820
We've already talked
about the fact

1522
01:03:05,820 --> 01:03:09,000
that in Bayesian networks,
with these two problems that we

1523
01:03:09,000 --> 01:03:12,150
to have to solve, we have to be
able to define the structure.

1524
01:03:12,150 --> 01:03:13,400
If we don't know any a priori.

1525
01:03:13,400 --> 01:03:14,570
Here, we don't
know what a priori.

1526
01:03:14,570 --> 01:03:16,530
So we're going to have to learn
the structure of the network.

1527
01:03:16,530 --> 01:03:17,920
And then with the
structure of the network,

1528
01:03:17,920 --> 01:03:19,480
we're going to have to
learn all the probabilities.

1529
01:03:19,480 --> 01:03:21,021
So the conditional
probability tables

1530
01:03:21,021 --> 01:03:23,810
that relate to each
variable to every other one.

1531
01:03:23,810 --> 01:03:27,320
And then just two more
small points about it.

1532
01:03:27,320 --> 01:03:30,690
So if I just give
you expression data,

1533
01:03:30,690 --> 01:03:33,540
without any interventions--
just the observations,

1534
01:03:33,540 --> 01:03:37,770
then I can't decide what is a
cause and what is an effect.

1535
01:03:37,770 --> 01:03:40,540
So here this was done in
the context of proteomics,

1536
01:03:40,540 --> 01:03:42,760
but the same is true for
gene expression data.

1537
01:03:42,760 --> 01:03:46,050
If I have two variables, x and
y, that are highly correlated,

1538
01:03:46,050 --> 01:03:47,540
it could be that x activates y.

1539
01:03:47,540 --> 01:03:50,340
It could be that y activates x.

1540
01:03:50,340 --> 01:03:53,440
But if I perturb the system,
and I block the activity of one

1541
01:03:53,440 --> 01:03:55,225
of these two genes
or proteins, then I

1542
01:03:55,225 --> 01:03:56,600
can start to tell
the difference.

1543
01:03:56,600 --> 01:04:00,290
In this case, if
you inhibit x, you

1544
01:04:00,290 --> 01:04:02,010
don't see any activation of y.

1545
01:04:02,010 --> 01:04:04,100
That's the yellow,
all down here.

1546
01:04:04,100 --> 01:04:05,970
But if you inhibit y,
you see the full range

1547
01:04:05,970 --> 01:04:06,767
of activity of x.

1548
01:04:10,180 --> 01:04:13,710
So that implies that x
is the activator of y.

1549
01:04:13,710 --> 01:04:16,209
And so in these settings, if
you want to learn a Bayesian

1550
01:04:16,209 --> 01:04:18,500
network from data, you need
more than just a compendium

1551
01:04:18,500 --> 01:04:19,500
of gene expression data.

1552
01:04:19,500 --> 01:04:21,339
If you want to get the
directions correct,

1553
01:04:21,339 --> 01:04:23,380
you need perturbations
where someone has actually

1554
01:04:23,380 --> 01:04:26,502
inhibited particular
genes or proteins.

1555
01:04:26,502 --> 01:04:28,210
Now, in a lot of these
Bayesian networks,

1556
01:04:28,210 --> 01:04:29,130
we're not going
to try to include

1557
01:04:29,130 --> 01:04:31,380
every possible gene and
every possible protein.

1558
01:04:31,380 --> 01:04:33,380
Either because we don't
have measurements of it,

1559
01:04:33,380 --> 01:04:34,940
or because we need
a compact network.

1560
01:04:34,940 --> 01:04:36,356
So there will often
be cases where

1561
01:04:36,356 --> 01:04:39,220
the true regulator
in some causal chain,

1562
01:04:39,220 --> 01:04:41,700
is missing from our data.

1563
01:04:41,700 --> 01:04:45,110
So imagine this is the true
causal chain-- x activates

1564
01:04:45,110 --> 01:04:47,932
y, which then activates z and w.

1565
01:04:47,932 --> 01:04:49,890
But either because we
don't have the data on y,

1566
01:04:49,890 --> 01:04:52,390
or because we left it out to
make our models more compact,

1567
01:04:52,390 --> 01:04:53,940
it's not in the model.

1568
01:04:53,940 --> 01:04:57,810
We can still pick up the
relationships between x and z,

1569
01:04:57,810 --> 01:04:58,699
and x and w.

1570
01:04:58,699 --> 01:05:00,115
But the data will
be much noisier.

1571
01:05:00,115 --> 01:05:03,720
Because we're missing
that information.

1572
01:05:03,720 --> 01:05:05,360
In the conditional
probability tables,

1573
01:05:05,360 --> 01:05:08,216
relating x to y, and then
y because it's too targets.

1574
01:05:12,972 --> 01:05:15,180
So Bayesian networks, we've
already seen quite a lot.

1575
01:05:15,180 --> 01:05:18,010
We now have some idea of how to
transfer them from one domain

1576
01:05:18,010 --> 01:05:20,180
to the domain of
gene expression data.

1577
01:05:20,180 --> 01:05:21,809
The next approach
we want to look at

1578
01:05:21,809 --> 01:05:23,100
is a regression-based approach.

1579
01:05:25,734 --> 01:05:27,150
So the regression-based
approaches

1580
01:05:27,150 --> 01:05:30,430
are founded on a
simple idea, which

1581
01:05:30,430 --> 01:05:32,200
is that the expression
gene is going

1582
01:05:32,200 --> 01:05:34,330
to be some function
of the expression

1583
01:05:34,330 --> 01:05:35,337
levels of the regulator.

1584
01:05:35,337 --> 01:05:36,920
We're going to
actually try to come up

1585
01:05:36,920 --> 01:05:39,020
with a formula that
relates the activity

1586
01:05:39,020 --> 01:05:40,865
levels of the
transcription factors,

1587
01:05:40,865 --> 01:05:43,620
and the activity
level of the target.

1588
01:05:43,620 --> 01:05:45,390
In this cartoon, I've
got a gene that's

1589
01:05:45,390 --> 01:05:47,090
on under one
condition, that's off

1590
01:05:47,090 --> 01:05:48,470
under some other conditions.

1591
01:05:48,470 --> 01:05:51,177
What transforms it
from being off to on,

1592
01:05:51,177 --> 01:05:53,260
is the introduction of
more of these transcription

1593
01:05:53,260 --> 01:05:55,850
factors, that are
binding to the promoter.

1594
01:05:55,850 --> 01:05:57,740
So in general, I have
some predicted level

1595
01:05:57,740 --> 01:05:59,075
of expression for the gene.

1596
01:05:59,075 --> 01:06:01,370
It's called the
predicted level y.

1597
01:06:01,370 --> 01:06:04,450
And it's some function,
unspecified at this point, f

1598
01:06:04,450 --> 01:06:08,100
of g, of all the expression
levels of the transcription

1599
01:06:08,100 --> 01:06:10,890
factors that regulate that gene.

1600
01:06:10,890 --> 01:06:13,320
So just again,
nomenclature is straight,

1601
01:06:13,320 --> 01:06:16,130
x sub g is going to be
the expression of gene x--

1602
01:06:16,130 --> 01:06:19,710
I'm sorry, expression of gene g.

1603
01:06:19,710 --> 01:06:23,450
This capital X, sub t of g
is the set of transcription

1604
01:06:23,450 --> 01:06:27,660
factors, that I believe
are regulating that gene.

1605
01:06:27,660 --> 01:06:29,390
And then f is an
arbitrary function.

1606
01:06:29,390 --> 01:06:30,730
We're going to have
a noise term as well.

1607
01:06:30,730 --> 01:06:32,604
Because this is the
observed gene expression,

1608
01:06:32,604 --> 01:06:36,700
not some sort of platonic view
of the true gene expression.

1609
01:06:36,700 --> 01:06:39,781
Now frequently, we'll
have a specific function.

1610
01:06:39,781 --> 01:06:41,280
So the simplest one
you can imagine,

1611
01:06:41,280 --> 01:06:42,730
which is a linear function.

1612
01:06:42,730 --> 01:06:45,850
So the expression of
any particular gene

1613
01:06:45,850 --> 01:06:48,850
is going to be a
linear function, a sum,

1614
01:06:48,850 --> 01:06:51,490
of the expression of
all of it's regulators,

1615
01:06:51,490 --> 01:06:54,820
where each one has associated
with it a coefficient beta.

1616
01:06:54,820 --> 01:06:56,240
And that beta
coefficient tells us

1617
01:06:56,240 --> 01:07:00,220
how much particular a
regulator influences that gene.

1618
01:07:00,220 --> 01:07:03,250
So say, p53 might have
a very large value.

1619
01:07:03,250 --> 01:07:05,100
Some other
transcriptional regulator

1620
01:07:05,100 --> 01:07:06,940
might have a small
value, representing

1621
01:07:06,940 --> 01:07:09,430
their relative influence.

1622
01:07:09,430 --> 01:07:11,830
Now, I don't know the
beta values in advance.

1623
01:07:11,830 --> 01:07:14,150
So that's one of the things
that I need to learn.

1624
01:07:14,150 --> 01:07:17,756
So I want to be able to
find a setting that tells me

1625
01:07:17,756 --> 01:07:20,130
what the beta values are for
every possible transcription

1626
01:07:20,130 --> 01:07:21,060
factor.

1627
01:07:21,060 --> 01:07:23,000
If the algorithm sets
the beta value to zero,

1628
01:07:23,000 --> 01:07:23,958
what does that tell me?

1629
01:07:26,727 --> 01:07:28,143
If a beta value
is zero here, what

1630
01:07:28,143 --> 01:07:31,620
does that tell me about that
transcriptional regulator?

1631
01:07:31,620 --> 01:07:32,760
No influence, right.

1632
01:07:32,760 --> 01:07:35,310
And the higher the value, then
the greater the influence.

1633
01:07:35,310 --> 01:07:35,810
OK.

1634
01:07:35,810 --> 01:07:38,090
So how do we discover these?

1635
01:07:38,090 --> 01:07:39,540
So the tip of the
approach then is

1636
01:07:39,540 --> 01:07:40,830
to come up with some
objective function

1637
01:07:40,830 --> 01:07:42,330
that we're going
to try to optimize.

1638
01:07:42,330 --> 01:07:44,070
And an obvious
objective function

1639
01:07:44,070 --> 01:07:46,710
is the difference between
the observed expression

1640
01:07:46,710 --> 01:07:49,150
value for each gene,
and the expected one,

1641
01:07:49,150 --> 01:07:50,420
based on that linear function.

1642
01:07:50,420 --> 01:07:52,600
And we're going to choose
a set of data parameters

1643
01:07:52,600 --> 01:07:54,810
that minimize the difference
between the observed

1644
01:07:54,810 --> 01:07:58,220
and the expected, minimize
the sum of the squares.

1645
01:07:58,220 --> 01:08:00,680
So the residual sum
of the squares error,

1646
01:08:00,680 --> 01:08:02,580
between the predicted
and the observed.

1647
01:08:05,400 --> 01:08:08,140
So this is a relatively
standard regression problem,

1648
01:08:08,140 --> 01:08:09,650
just in different setting.

1649
01:08:09,650 --> 01:08:12,490
Now one of the problems with
a standard regression problem,

1650
01:08:12,490 --> 01:08:14,330
is that we'll
typically get a lot

1651
01:08:14,330 --> 01:08:15,850
of very small values of beta.

1652
01:08:15,850 --> 01:08:17,950
So we won't get all
zeros or all ones,

1653
01:08:17,950 --> 01:08:19,950
meaning the algorithm is
100% certain that these

1654
01:08:19,950 --> 01:08:21,620
are the drivers
and these are not.

1655
01:08:21,620 --> 01:08:25,609
We'll get a lot of small values
for many, many transcription

1656
01:08:25,609 --> 01:08:26,760
factors.

1657
01:08:26,760 --> 01:08:28,859
And OK, that could
represent the reality.

1658
01:08:28,859 --> 01:08:30,700
But the bad thing is
that those data values

1659
01:08:30,700 --> 01:08:32,009
are going to be unstable.

1660
01:08:32,009 --> 01:08:33,550
So small changes in
the training data

1661
01:08:33,550 --> 01:08:36,350
will give you big changes, in
which transcription factors

1662
01:08:36,350 --> 01:08:37,479
have which values.

1663
01:08:37,479 --> 01:08:39,069
So that not a desirable setting.

1664
01:08:39,069 --> 01:08:40,902
There's a whole field
built up around trying

1665
01:08:40,902 --> 01:08:43,399
to come up with
better solutions.

1666
01:08:43,399 --> 01:08:45,685
I've given you some
references here.

1667
01:08:45,685 --> 01:08:47,310
One of them is to a
paper that did well

1668
01:08:47,310 --> 01:08:48,268
in the DREAM challenge.

1669
01:08:48,268 --> 01:08:50,649
The other one is to
a very good textbook,

1670
01:08:50,649 --> 01:08:52,180
Elements of
Statistical Learning.

1671
01:08:52,180 --> 01:08:53,840
And there are various
techniques that

1672
01:08:53,840 --> 01:08:56,760
allow you to try to
limit the number of betas

1673
01:08:56,760 --> 01:08:59,660
that are non-zero.

1674
01:08:59,660 --> 01:09:01,960
And by doing that, you get
more robust predictions.

1675
01:09:01,960 --> 01:09:04,584
At a cost, right, because there
could be a lot of transcription

1676
01:09:04,584 --> 01:09:07,490
factors that really do
have small influences.

1677
01:09:07,490 --> 01:09:09,170
But we'll trade
that off, by getting

1678
01:09:09,170 --> 01:09:11,003
more accurate predictions
from the ones that

1679
01:09:11,003 --> 01:09:12,035
have the big influences.

1680
01:09:16,787 --> 01:09:18,370
Are there any questions
on regression?

1681
01:09:21,310 --> 01:09:24,330
So the last of the methods
that we're examining-- this

1682
01:09:24,330 --> 01:09:25,620
is a mutual information.

1683
01:09:25,620 --> 01:09:28,350
We've already seen mutual
information in the course.

1684
01:09:28,350 --> 01:09:30,569
So information
content is related

1685
01:09:30,569 --> 01:09:34,660
to the probability of observing
some variable in an alphabet.

1686
01:09:34,660 --> 01:09:36,239
So in most languages,
the probability

1687
01:09:36,239 --> 01:09:40,004
of observing letters
is quite variable.

1688
01:09:40,004 --> 01:09:41,920
So Es are very common
in the English language.

1689
01:09:41,920 --> 01:09:43,420
Other letters are less common.

1690
01:09:43,420 --> 01:09:48,279
As anyone who plays Hangman or
watches Wheel of fortune knows.

1691
01:09:48,279 --> 01:09:49,830
And we defined the
entropy as the sum

1692
01:09:49,830 --> 01:09:52,460
over all possible outcomes.

1693
01:09:52,460 --> 01:09:55,260
The probability of
observing some variable,

1694
01:09:55,260 --> 01:09:57,580
and the information
on to that variable,

1695
01:09:57,580 --> 01:10:01,410
we can define the discrete
case, or in the continuous case.

1696
01:10:01,410 --> 01:10:03,530
And the critical
thing is to have

1697
01:10:03,530 --> 01:10:05,346
mutual information
between two variables.

1698
01:10:05,346 --> 01:10:07,970
So that's the difference between
the entropy of those variables

1699
01:10:07,970 --> 01:10:10,930
independently, and
then the joint entropy.

1700
01:10:10,930 --> 01:10:13,120
So things with a high
mutual information,

1701
01:10:13,120 --> 01:10:16,580
means that one variable gives
the significant knowledge

1702
01:10:16,580 --> 01:10:18,080
of what the other
variable is doing.

1703
01:10:18,080 --> 01:10:19,830
It reduces my uncertainty.

1704
01:10:19,830 --> 01:10:21,900
That's the critical idea.

1705
01:10:21,900 --> 01:10:22,850
OK.

1706
01:10:22,850 --> 01:10:25,105
So we looked at
correlation before.

1707
01:10:25,105 --> 01:10:26,480
There could be
settings where you

1708
01:10:26,480 --> 01:10:30,390
have very low correlation
between two variables,

1709
01:10:30,390 --> 01:10:32,900
but have high
mutual information.

1710
01:10:32,900 --> 01:10:38,730
So consider these two genes,
protein A and protein b,

1711
01:10:38,730 --> 01:10:42,535
and the blue dots are the
relationship between them.

1712
01:10:42,535 --> 01:10:44,410
You can see that there's
a lot of information

1713
01:10:44,410 --> 01:10:46,900
content in these two variables.

1714
01:10:46,900 --> 01:10:50,680
Knowing the value of A
gives me a high confidence

1715
01:10:50,680 --> 01:10:53,560
in the value of B. But
there's no linear relationship

1716
01:10:53,560 --> 01:10:55,800
that describes these.

1717
01:10:55,800 --> 01:10:57,270
So if I use mutual
information, I

1718
01:10:57,270 --> 01:10:58,875
can capture
situations like this,

1719
01:10:58,875 --> 01:11:02,330
that I can't capture
with correlation.

1720
01:11:02,330 --> 01:11:04,610
And these kinds of
situations actually occur.

1721
01:11:04,610 --> 01:11:06,970
So for example in a
feed-forward loop--

1722
01:11:06,970 --> 01:11:10,680
say we've got a regulator
A, and it directly

1723
01:11:10,680 --> 01:11:15,860
activates B. It also directly
activates C. But C inhibits B.

1724
01:11:15,860 --> 01:11:18,547
So you've got the path on
the left-hand sides that

1725
01:11:18,547 --> 01:11:19,755
are pressing the accelerator.

1726
01:11:19,755 --> 01:11:22,970
And the path on the right hand
side pressing the stop pedal.

1727
01:11:22,970 --> 01:11:25,326
That's called an incoherent
feed-forward loop.

1728
01:11:25,326 --> 01:11:27,700
And you can get under different
settings, different kinds

1729
01:11:27,700 --> 01:11:30,985
of results, where this
is one of those examples.

1730
01:11:30,985 --> 01:11:32,360
You can get much
more complicated

1731
01:11:32,360 --> 01:11:34,990
behavior. [INAUDIBLE] are papers
that have really mapped out

1732
01:11:34,990 --> 01:11:36,990
these behaviors across
many parameters settings.

1733
01:11:36,990 --> 01:11:40,190
You can get switches
in the behavior.

1734
01:11:40,190 --> 01:11:41,890
But in a lot of
these settings, you

1735
01:11:41,890 --> 01:11:43,600
will have high
mutual information

1736
01:11:43,600 --> 01:11:45,550
between two
variables, even if you

1737
01:11:45,550 --> 01:11:48,880
don't have any correlation,
linear correlation

1738
01:11:48,880 --> 01:11:49,762
between them.

1739
01:11:53,122 --> 01:11:55,810
A well-publicized algorithm
that uses mutual information

1740
01:11:55,810 --> 01:11:58,170
to infer gene regulatory
networks is called ARACNe.

1741
01:12:01,249 --> 01:12:03,540
They go through and they
compute the mutual information

1742
01:12:03,540 --> 01:12:07,300
between all pairs of
genes in their data set.

1743
01:12:07,300 --> 01:12:10,330
And now one question you have
with mutual information is,

1744
01:12:10,330 --> 01:12:14,057
what defines a significant
level of mutual information?

1745
01:12:14,057 --> 01:12:16,140
So an obvious way to do
this, to try to figure out

1746
01:12:16,140 --> 01:12:17,974
what's significant, is
to do randomizations.

1747
01:12:17,974 --> 01:12:19,139
And so that's what they did.

1748
01:12:19,139 --> 01:12:20,640
They shuffled the
expression data,

1749
01:12:20,640 --> 01:12:22,780
to compute mutual information
among pairs of genes,

1750
01:12:22,780 --> 01:12:23,870
where there isn't
actually a need--

1751
01:12:23,870 --> 01:12:25,411
there shouldn't be
any relationships.

1752
01:12:25,411 --> 01:12:27,555
Because the data
had been shuffled.

1753
01:12:27,555 --> 01:12:30,180
And then you can decide whether
the observed mutual information

1754
01:12:30,180 --> 01:12:31,940
is significantly
greater than when

1755
01:12:31,940 --> 01:12:34,435
you get from the
randomized data.

1756
01:12:34,435 --> 01:12:36,810
Now, the other thing that
happens with mutual information

1757
01:12:36,810 --> 01:12:38,330
is that indirect
effects still apply

1758
01:12:38,330 --> 01:12:40,140
to degrees of
mutual information.

1759
01:12:40,140 --> 01:12:43,350
So let's consider the set of
genes that are shown on this.

1760
01:12:43,350 --> 01:12:45,530
So you've got G2,
which is actually

1761
01:12:45,530 --> 01:12:49,250
a regulator of G1 and G3.

1762
01:12:49,250 --> 01:12:51,590
So G2 is going to have
high mutual information

1763
01:12:51,590 --> 01:12:56,420
with G1, and with G3.

1764
01:12:56,420 --> 01:12:59,430
Now, what's it going
to be about G1 and G3?

1765
01:12:59,430 --> 01:13:01,610
They're going to behave
very similarly, as well.

1766
01:13:01,610 --> 01:13:03,568
So it'll be a high degree
of mutual information

1767
01:13:03,568 --> 01:13:06,630
between G1 and G3.

1768
01:13:06,630 --> 01:13:08,670
So if I just rely on
mutual information,

1769
01:13:08,670 --> 01:13:11,800
I can't tell what's a
regulator and what's

1770
01:13:11,800 --> 01:13:14,314
a fellow at the same
level of regulation.

1771
01:13:14,314 --> 01:13:16,480
They're both being affected
by something above them.

1772
01:13:16,480 --> 01:13:18,470
I can't tell the difference
between those two.

1773
01:13:18,470 --> 01:13:22,400
So they use what's called the
data processing inequality,

1774
01:13:22,400 --> 01:13:25,730
where they say, well, these
regulatory interactions

1775
01:13:25,730 --> 01:13:27,570
should have higher
mutual information,

1776
01:13:27,570 --> 01:13:30,420
than this, which is just
between two common targets

1777
01:13:30,420 --> 01:13:32,450
in the same parent.

1778
01:13:32,450 --> 01:13:35,110
And so they drop from their
network, those things which

1779
01:13:35,110 --> 01:13:38,116
are the lower of the
three in a triangle.

1780
01:13:41,370 --> 01:13:43,120
So that was the original
ARACNe algorithm,

1781
01:13:43,120 --> 01:13:44,670
and then they modified
it a little bit,

1782
01:13:44,670 --> 01:13:47,086
to try to be more specific in
terms of the regulators that

1783
01:13:47,086 --> 01:13:48,350
were being picked up.

1784
01:13:48,350 --> 01:13:51,500
And so they called
this approach MINDy.

1785
01:13:51,500 --> 01:13:55,055
And the core idea here,
is that in addition

1786
01:13:55,055 --> 01:13:57,550
to the transcription
factors, you

1787
01:13:57,550 --> 01:14:00,600
might have another protein that
turns a transcription factor on

1788
01:14:00,600 --> 01:14:02,590
or off.

1789
01:14:02,590 --> 01:14:04,440
So if I look over
different concentrations

1790
01:14:04,440 --> 01:14:06,898
of the transcription factor,
different levels of expression

1791
01:14:06,898 --> 01:14:09,410
between transcription
factors, I might

1792
01:14:09,410 --> 01:14:12,410
find that there are some cases
where this other protein turns

1793
01:14:12,410 --> 01:14:14,660
it on, and other cases
where it turns it off.

1794
01:14:14,660 --> 01:14:19,532
So here, consider
these two data sets.

1795
01:14:19,532 --> 01:14:20,990
Looking at different
concentrations

1796
01:14:20,990 --> 01:14:22,900
of particular transcription
factor and different expression

1797
01:14:22,900 --> 01:14:25,080
levels, and in one
case-- the blue ones,

1798
01:14:25,080 --> 01:14:27,562
the modulator isn't
present at all,

1799
01:14:27,562 --> 01:14:29,270
or present at it's
lowest possible level.

1800
01:14:29,270 --> 01:14:32,200
And in the red case, it's
present as a high level.

1801
01:14:32,200 --> 01:14:34,520
And you can see that when
the modulator is present only

1802
01:14:34,520 --> 01:14:37,250
in low levels, there's no
relationship between a target

1803
01:14:37,250 --> 01:14:38,809
and it's transcription factor.

1804
01:14:38,809 --> 01:14:40,850
Or when the modulator is
present at a high level,

1805
01:14:40,850 --> 01:14:42,650
then there's this
linear response

1806
01:14:42,650 --> 01:14:44,920
of the target to it's
transcription factor.

1807
01:14:44,920 --> 01:14:47,300
So this modulator seems to
be a necessary component.

1808
01:14:47,300 --> 01:14:48,820
So they went through and
defined a whole bunch

1809
01:14:48,820 --> 01:14:49,736
of settings like this.

1810
01:14:49,736 --> 01:14:53,990
And then systematically search
the data for these modulators.

1811
01:14:53,990 --> 01:14:56,360
So they started off with the
expression data set, genes

1812
01:14:56,360 --> 01:14:59,080
in rows, experiments in columns.

1813
01:14:59,080 --> 01:15:01,710
They do a set of
filtering to remove things

1814
01:15:01,710 --> 01:15:06,120
that are going to be
problematic for the analysis.

1815
01:15:06,120 --> 01:15:07,620
They look, for
example, for settings

1816
01:15:07,620 --> 01:15:10,220
where you have-- they had to
start with a list of modulators

1817
01:15:10,220 --> 01:15:12,303
and transcription factors,
and they moved the ones

1818
01:15:12,303 --> 01:15:15,570
where there isn't enough
variation, and so on.

1819
01:15:15,570 --> 01:15:18,550
And then they examine, for every
modulator and transcription

1820
01:15:18,550 --> 01:15:22,656
factor pair, cases
where the modulator is

1821
01:15:22,656 --> 01:15:24,280
present at its highest
level, and where

1822
01:15:24,280 --> 01:15:26,280
it's present at
it's lowest level.

1823
01:15:26,280 --> 01:15:29,514
So when the modulator is present
at a high level-- let's say,

1824
01:15:29,514 --> 01:15:31,430
when the modulator is
present at a high level,

1825
01:15:31,430 --> 01:15:32,805
there's a high
mutual information

1826
01:15:32,805 --> 01:15:35,140
between the transcription
factor and the target.

1827
01:15:35,140 --> 01:15:38,330
When the modulator is absent,
there's no mutual information.

1828
01:15:38,330 --> 01:15:40,380
That's a setting we
looked at before.

1829
01:15:40,380 --> 01:15:43,200
That would suggest that the
modulator is an activator.

1830
01:15:43,200 --> 01:15:45,310
It's a positive modulator.

1831
01:15:45,310 --> 01:15:46,890
You can have the
opposite situation,

1832
01:15:46,890 --> 01:15:48,870
where when the modulator
is present at low levels,

1833
01:15:48,870 --> 01:15:51,245
there's mutual information
between a transcription factor

1834
01:15:51,245 --> 01:15:52,104
and it's target.

1835
01:15:52,104 --> 01:15:54,020
When the modulator is
present at a high level,

1836
01:15:54,020 --> 01:15:55,120
you don't see anything.

1837
01:15:55,120 --> 01:15:56,661
That would suggest
that the modulator

1838
01:15:56,661 --> 01:15:57,740
is a negative regulator.

1839
01:15:57,740 --> 01:15:59,781
And then there are scenarios
where there's either

1840
01:15:59,781 --> 01:16:01,210
uniformly high
information content

1841
01:16:01,210 --> 01:16:04,640
between transcription factor
target, or uniformly low.

1842
01:16:04,640 --> 01:16:07,000
So the modulator doesn't
seem to be doing anything.

1843
01:16:07,000 --> 01:16:09,949
So we break it down
into these categories.

1844
01:16:09,949 --> 01:16:11,990
And you can look at all
the different categories,

1845
01:16:11,990 --> 01:16:14,040
in their supplemental tables.

1846
01:16:14,040 --> 01:16:15,540
One thing that's
kind of interesting

1847
01:16:15,540 --> 01:16:19,090
is they assume that regardless
of how high the transcription

1848
01:16:19,090 --> 01:16:21,470
factor goes, you'll
always see an increase

1849
01:16:21,470 --> 01:16:23,802
in the expression of the target.

1850
01:16:23,802 --> 01:16:25,260
So there is no
saturation, which is

1851
01:16:25,260 --> 01:16:30,240
an unnatural assumption
in these data sets.

1852
01:16:30,240 --> 01:16:30,980
OK.

1853
01:16:30,980 --> 01:16:33,932
So I think I'll close with this
example, from their experiment.

1854
01:16:33,932 --> 01:16:35,390
And then in the
next lecture, we'll

1855
01:16:35,390 --> 01:16:37,640
look at how these
different methods fare

1856
01:16:37,640 --> 01:16:39,970
against each other in
the DREAM challenge.

1857
01:16:39,970 --> 01:16:43,400
So they specifically wanted
to find regulators of MYC.

1858
01:16:43,400 --> 01:16:48,460
So here's data for a
particular regulator, SDK 38.

1859
01:16:48,460 --> 01:16:51,050
Here's the set of
expression of tumors

1860
01:16:51,050 --> 01:16:53,620
where SDK 38
expression is lowest.

1861
01:16:53,620 --> 01:16:57,330
And a set of tumors where
SDK 38 expression is highest.

1862
01:16:57,330 --> 01:16:59,940
And they're sorted by the
expression level of MYC.

1863
01:16:59,940 --> 01:17:01,440
So on the left hand
side, you'll see

1864
01:17:01,440 --> 01:17:02,856
there's no particular
relationship

1865
01:17:02,856 --> 01:17:05,902
between the expression level
of MYC and the targets.

1866
01:17:05,902 --> 01:17:07,860
In the right hand side,
there is a relationship

1867
01:17:07,860 --> 01:17:09,900
between the expression
level of MYC and targets.

1868
01:17:09,900 --> 01:17:12,740
So having, apparently--
at least at this level

1869
01:17:12,740 --> 01:17:16,810
of mutual information, having
higher levels of SDK 38,

1870
01:17:16,810 --> 01:17:18,422
cause a relationship to occur.

1871
01:17:18,422 --> 01:17:20,005
That would be example
of an activator.

1872
01:17:22,310 --> 01:17:22,810
OK.

1873
01:17:22,810 --> 01:17:24,560
So this technique has
a lot of advantages,

1874
01:17:24,560 --> 01:17:27,110
and allows you to search rapidly
over very large data sets,

1875
01:17:27,110 --> 01:17:29,800
to find potential target
transcription factor

1876
01:17:29,800 --> 01:17:32,330
relationships, and also
potential modulators.

1877
01:17:32,330 --> 01:17:33,835
It has some limitations.

1878
01:17:33,835 --> 01:17:35,834
Where the key limitations
is that the signal has

1879
01:17:35,834 --> 01:17:38,320
to be present in the
expression data set.

1880
01:17:38,320 --> 01:17:40,060
So in the case of
a protein like p53,

1881
01:17:40,060 --> 01:17:42,690
where we know it's activated
by all sort of other processes,

1882
01:17:42,690 --> 01:17:45,350
phosphorylation or NF-kappaB,
where it's regulated

1883
01:17:45,350 --> 01:17:48,490
by phosphorylation, you
might not get any signal.

1884
01:17:48,490 --> 01:17:50,720
So there has to be a case
where the transcription

1885
01:17:50,720 --> 01:17:52,630
factor itself, is
changing expression.

1886
01:17:52,630 --> 01:17:54,710
It also won't work if
the modulator is always

1887
01:17:54,710 --> 01:17:56,690
highly correlated
with its target,

1888
01:17:56,690 --> 01:17:58,280
for some other
biological reason.

1889
01:17:58,280 --> 01:18:00,870
So the modulator has to be
on, for other reasons, when

1890
01:18:00,870 --> 01:18:02,330
the target is, then
you'll never be

1891
01:18:02,330 --> 01:18:06,070
able to divide the
data in this way.

1892
01:18:06,070 --> 01:18:07,730
One of the other
things I think that is

1893
01:18:07,730 --> 01:18:08,782
problematic with
these networks is

1894
01:18:08,782 --> 01:18:10,240
that you get such
large networks,

1895
01:18:10,240 --> 01:18:11,698
and they're very
hard to interpret.

1896
01:18:11,698 --> 01:18:13,950
So in this case, this
is the nearest neighbors

1897
01:18:13,950 --> 01:18:16,650
of just one node in ARACNe.

1898
01:18:16,650 --> 01:18:20,910
This is the mutual information
network of microRNA modulators

1899
01:18:20,910 --> 01:18:24,330
that has a quarter of
a million interactions.

1900
01:18:24,330 --> 01:18:26,690
And in these data
sets, often you

1901
01:18:26,690 --> 01:18:28,850
end up selecting a very,
very large fraction

1902
01:18:28,850 --> 01:18:30,430
of all the potential modulators.

1903
01:18:30,430 --> 01:18:32,702
So of all the candidate
transcription factors

1904
01:18:32,702 --> 01:18:34,410
in modulators, it
comes up with an answer

1905
01:18:34,410 --> 01:18:36,510
that's roughly 10% to 20%
of them are regulating

1906
01:18:36,510 --> 01:18:39,460
any particular gene,
which seems awfully high.

1907
01:18:39,460 --> 01:18:40,080
OK.

1908
01:18:40,080 --> 01:18:44,980
So any questions on the
methods we've seen so far?

1909
01:18:44,980 --> 01:18:45,480
OK.

1910
01:18:45,480 --> 01:18:47,026
So when we come
back on Thursday,

1911
01:18:47,026 --> 01:18:48,400
we'll take a look
at head to head

1912
01:18:48,400 --> 01:18:51,610
of how these different methods
perform on both the synthetic

1913
01:18:51,610 --> 01:18:53,800
and the real data sets.