1
00:00:00,070 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:26,377 --> 00:00:26,960
PROFESSOR: OK.

9
00:00:26,960 --> 00:00:31,006
So we've been talking about
predicting structure proteins.

10
00:00:31,006 --> 00:00:32,380
At the end of the
last lecture we

11
00:00:32,380 --> 00:00:34,838
started to talk a little bit
about predicting interactions,

12
00:00:34,838 --> 00:00:37,870
and that's going to be the
focus of today's lecture.

13
00:00:37,870 --> 00:00:42,120
And we identified a couple of
different possible prediction

14
00:00:42,120 --> 00:00:42,700
challenges.

15
00:00:42,700 --> 00:00:45,010
One was quantitative
predictions of what

16
00:00:45,010 --> 00:00:48,430
happens when you make specific
mutations in a known protein

17
00:00:48,430 --> 00:00:49,465
complex.

18
00:00:49,465 --> 00:00:50,840
We talked about
trying to predict

19
00:00:50,840 --> 00:00:53,320
the structure of, say,
just a pair of proteins,

20
00:00:53,320 --> 00:00:55,760
and then trying to do that
on the global scale for all

21
00:00:55,760 --> 00:00:57,430
known proteins.

22
00:00:57,430 --> 00:00:59,200
And so last time,
if you recall, we

23
00:00:59,200 --> 00:01:01,700
thought that initially maybe
this would be a simple problem.

24
00:01:01,700 --> 00:01:05,890
We have proteins of known
structure with a complex.

25
00:01:05,890 --> 00:01:08,240
Structure of the
complex is also known.

26
00:01:08,240 --> 00:01:11,660
And we want to make predictions
as to the change in affinity

27
00:01:11,660 --> 00:01:14,056
when there's a
specific mutation made.

28
00:01:14,056 --> 00:01:15,430
In principle, this
should be easy

29
00:01:15,430 --> 00:01:17,430
because we have all those
different formulations

30
00:01:17,430 --> 00:01:19,000
for the potential
energy function.

31
00:01:19,000 --> 00:01:22,290
And so if we figure out what
the local structural changes are

32
00:01:22,290 --> 00:01:25,600
that are due to the insertion
or deletion of some side chain,

33
00:01:25,600 --> 00:01:27,070
then we should be
able to predict

34
00:01:27,070 --> 00:01:28,830
the change in the
potential energy,

35
00:01:28,830 --> 00:01:32,329
and therefore the change in
the energy of the complex.

36
00:01:32,329 --> 00:01:33,745
But in fact, it
turned out that it

37
00:01:33,745 --> 00:01:35,280
was very, very hard to do that.

38
00:01:35,280 --> 00:01:38,260
And so this plot compared--
the black circles

39
00:01:38,260 --> 00:01:41,800
were the prediction
algorithms for this problem,

40
00:01:41,800 --> 00:01:44,140
compared to just simply
a substitution matrix,

41
00:01:44,140 --> 00:01:47,450
the BLOSUM substitution matrix
defined in terms of the area

42
00:01:47,450 --> 00:01:49,190
under the curve for
beneficial mutations

43
00:01:49,190 --> 00:01:50,670
and deleterious mutations.

44
00:01:50,670 --> 00:01:53,100
And you can see that very,
very few of the black dots

45
00:01:53,100 --> 00:01:57,200
get far away from what is the
really simple default model.

46
00:01:57,200 --> 00:01:59,170
A lot of them do worse.

47
00:01:59,170 --> 00:02:01,720
So OK, well maybe that's
not such a simple problem

48
00:02:01,720 --> 00:02:03,940
because it requires a highly
quantitative prediction.

49
00:02:03,940 --> 00:02:05,480
Maybe we'll do
better just trying

50
00:02:05,480 --> 00:02:08,220
to predict which
proteins interact at all.

51
00:02:08,220 --> 00:02:12,410
And so that's going to be
the focus of today's lecture.

52
00:02:12,410 --> 00:02:14,110
Now, that also had
a problem, right?

53
00:02:14,110 --> 00:02:16,760
Because even if I know the
structure of two proteins,

54
00:02:16,760 --> 00:02:18,520
I don't know necessarily
what surfaces

55
00:02:18,520 --> 00:02:19,820
of those proteins interact.

56
00:02:19,820 --> 00:02:21,520
And so I have to
figure out this docking

57
00:02:21,520 --> 00:02:24,040
problem of which part
of protein A interacts

58
00:02:24,040 --> 00:02:26,780
with which part of protein B.

59
00:02:26,780 --> 00:02:28,270
That's the beginning
of my problem,

60
00:02:28,270 --> 00:02:31,550
and then I have to make a
series of subsequent decisions.

61
00:02:31,550 --> 00:02:33,760
So I'm going to
have to figure out

62
00:02:33,760 --> 00:02:36,242
for any potential
partner of my protein,

63
00:02:36,242 --> 00:02:37,950
I need to figure out
the docking problem,

64
00:02:37,950 --> 00:02:40,325
the relative
position orientation.

65
00:02:40,325 --> 00:02:41,700
Now, in this little
cartoon, it's

66
00:02:41,700 --> 00:02:45,790
shown as a completely
static protein that

67
00:02:45,790 --> 00:02:47,540
approaches another
static protein.

68
00:02:47,540 --> 00:02:50,000
The only thing that's changing
is the relative coordinates.

69
00:02:50,000 --> 00:02:52,290
But of course, there
will be local changes

70
00:02:52,290 --> 00:02:54,770
in confirmation, perhaps
even global ones.

71
00:02:54,770 --> 00:02:56,900
And so we need to be able
to make some estimates as

72
00:02:56,900 --> 00:02:58,760
to what those structural
rearrangements will

73
00:02:58,760 --> 00:03:02,370
be when the two
proteins interact.

74
00:03:02,370 --> 00:03:04,230
And then after we've
come up with our best

75
00:03:04,230 --> 00:03:05,979
estimate of the
structural rearrangements,

76
00:03:05,979 --> 00:03:08,290
only then can we come up with
an estimate of the energy

77
00:03:08,290 --> 00:03:11,610
interaction and
decide whether it's

78
00:03:11,610 --> 00:03:13,110
better than some threshold.

79
00:03:13,110 --> 00:03:13,670
OK.

80
00:03:13,670 --> 00:03:16,000
So one of the problems that's
pretty obvious from this

81
00:03:16,000 --> 00:03:18,305
is that this kind of
approach in principle,

82
00:03:18,305 --> 00:03:20,840
if we do it rigorously
through all the steps,

83
00:03:20,840 --> 00:03:24,000
would be extremely slow.

84
00:03:24,000 --> 00:03:26,470
Now, another part that's perhaps
a little bit less obvious

85
00:03:26,470 --> 00:03:28,510
is that it's going to be very
prone to false positives.

86
00:03:28,510 --> 00:03:30,050
And why do you
think that might be?

87
00:03:37,380 --> 00:03:39,320
What am I not taking
into account here?

88
00:03:43,748 --> 00:03:45,224
AUDIENCE: Are you
not taking into

89
00:03:45,224 --> 00:03:47,200
account the desolvation
[INAUDIBLE].

90
00:03:47,200 --> 00:03:48,669
PROFESSOR: So one
answer is I'm not

91
00:03:48,669 --> 00:03:50,085
taking account of
the desolvation,

92
00:03:50,085 --> 00:03:51,470
but in fact, I can do that.

93
00:03:51,470 --> 00:03:51,970
Right?

94
00:03:51,970 --> 00:03:53,678
So some of the potential
energy functions

95
00:03:53,678 --> 00:03:55,712
we looked at, the
statistician's version rather

96
00:03:55,712 --> 00:03:57,420
than the physicist's
makes it pretty easy

97
00:03:57,420 --> 00:04:00,374
to incorporate the desolvation.

98
00:04:00,374 --> 00:04:02,790
Any other thoughts as to what
I'm not taking into account?

99
00:04:07,090 --> 00:04:08,840
What other protein
should I be considering

100
00:04:08,840 --> 00:04:10,673
when I'm considering
an interaction problem?

101
00:04:14,170 --> 00:04:16,260
So I've isolated, in
this case, two proteins.

102
00:04:16,260 --> 00:04:18,570
I'm saying, in a universe
where these are the only two

103
00:04:18,570 --> 00:04:21,010
proteins that exist, will
they have a favorable energy

104
00:04:21,010 --> 00:04:21,510
interaction?

105
00:04:21,510 --> 00:04:24,051
What I really need to know is
whether that energy interaction

106
00:04:24,051 --> 00:04:26,690
is more favorable than all
the competing interactions

107
00:04:26,690 --> 00:04:28,200
that they could have.

108
00:04:28,200 --> 00:04:30,740
So even if I find something
that's potentially

109
00:04:30,740 --> 00:04:33,530
a good interaction, it may
not be the best possible

110
00:04:33,530 --> 00:04:34,410
interaction.

111
00:04:34,410 --> 00:04:36,950
And if I consider then the
concentration of this protein

112
00:04:36,950 --> 00:04:39,490
and the concentration of
all the other molecules

113
00:04:39,490 --> 00:04:42,247
out there that have
a higher affinity,

114
00:04:42,247 --> 00:04:44,080
then it could turn out
that this is actually

115
00:04:44,080 --> 00:04:46,460
a rather poor substrate
for my protein, a rather

116
00:04:46,460 --> 00:04:47,890
poor interaction partner.

117
00:04:47,890 --> 00:04:50,450
So we have that false
positive problem.

118
00:04:50,450 --> 00:04:51,450
OK.

119
00:04:51,450 --> 00:04:53,950
But let's focus on the
computational efficiency

120
00:04:53,950 --> 00:04:55,950
problem, because
that's at least one

121
00:04:55,950 --> 00:04:58,110
that we can come up with
some nice algorithms

122
00:04:58,110 --> 00:04:59,480
to try to solve.

123
00:04:59,480 --> 00:05:02,530
So what we want to do is try
to limit our search space.

124
00:05:02,530 --> 00:05:04,530
If I want to figure out--
I have a query protein

125
00:05:04,530 --> 00:05:06,720
and I want to ask, what
does it interact with,

126
00:05:06,720 --> 00:05:09,370
instead of trying to do
the pairwise comparison

127
00:05:09,370 --> 00:05:12,240
of this protein with every
other protein in the database,

128
00:05:12,240 --> 00:05:14,690
and doing very precise
structural calculations on all

129
00:05:14,690 --> 00:05:16,315
of those, maybe
there's some way that I

130
00:05:16,315 --> 00:05:18,210
can prefilter the
set of proteins

131
00:05:18,210 --> 00:05:19,535
that it might interact with.

132
00:05:19,535 --> 00:05:21,160
And that's what we're
going to look at.

133
00:05:21,160 --> 00:05:23,239
So we're going to try
to officially choose

134
00:05:23,239 --> 00:05:24,780
potential partners
before we're doing

135
00:05:24,780 --> 00:05:26,550
any structural comparison.

136
00:05:26,550 --> 00:05:28,710
And then once we
have those partners,

137
00:05:28,710 --> 00:05:30,670
we're going to try
to avoid having

138
00:05:30,670 --> 00:05:33,410
to do detailed calculations
until we have a relatively

139
00:05:33,410 --> 00:05:36,860
high degree of confidence that
these proteins could interact

140
00:05:36,860 --> 00:05:37,740
by other criteria.

141
00:05:37,740 --> 00:05:40,839
And we're going to look at two
papers that describe algorithms

142
00:05:40,839 --> 00:05:42,380
for solving this
problem, and they're

143
00:05:42,380 --> 00:05:44,354
both uploaded to the website.

144
00:05:44,354 --> 00:05:45,770
The first thing
that we'll look at

145
00:05:45,770 --> 00:05:50,350
is called PRISM that actually
uses structural calculations.

146
00:05:50,350 --> 00:05:53,460
And then we'll look
at PrePPI, which

147
00:05:53,460 --> 00:05:57,200
deals with everything
purely at-- without actually

148
00:05:57,200 --> 00:05:59,610
explicitly calculating
the structures.

149
00:05:59,610 --> 00:06:00,110
OK.

150
00:06:00,110 --> 00:06:01,550
So what does PRISM do?

151
00:06:01,550 --> 00:06:03,500
Well, it's based on
the notion that there

152
00:06:03,500 --> 00:06:05,180
are a limited number
of architectures

153
00:06:05,180 --> 00:06:08,940
that we could look at for
which proteins can interact.

154
00:06:08,940 --> 00:06:11,107
And so if we can identify
those architectures,

155
00:06:11,107 --> 00:06:13,190
then we can try to figure
out whether a protein is

156
00:06:13,190 --> 00:06:14,940
a potential partner
of another one

157
00:06:14,940 --> 00:06:18,790
before we do the detailed,
costly calculations.

158
00:06:18,790 --> 00:06:20,550
In addition, in
those architectures,

159
00:06:20,550 --> 00:06:22,300
not all amino acids
are going to be equal,

160
00:06:22,300 --> 00:06:23,140
but there are going
to be some that

161
00:06:23,140 --> 00:06:24,930
contribute more to the
energy than others.

162
00:06:24,930 --> 00:06:27,240
And so by identifying
those critical residues,

163
00:06:27,240 --> 00:06:29,870
we can once again focus
our computational energy

164
00:06:29,870 --> 00:06:34,470
on those complexes that are
most likely to be important.

165
00:06:34,470 --> 00:06:34,970
OK.

166
00:06:34,970 --> 00:06:38,161
So it has these two components--
a rigid-body structural

167
00:06:38,161 --> 00:06:38,660
comparison.

168
00:06:38,660 --> 00:06:40,240
So that's that two
proteins are not

169
00:06:40,240 --> 00:06:41,865
changing their own
coordinates, they're

170
00:06:41,865 --> 00:06:45,160
just being brought together
in different conformations.

171
00:06:45,160 --> 00:06:49,380
And then once the proteins
have passed a series of checks,

172
00:06:49,380 --> 00:06:51,760
then we allow for
flexible refinement using

173
00:06:51,760 --> 00:06:54,390
the kinds of energies we looked
at in the previous lectures

174
00:06:54,390 --> 00:06:58,790
to decide how high affinity
this complex could be.

175
00:06:58,790 --> 00:07:00,940
And the critical
thing is that we're

176
00:07:00,940 --> 00:07:03,080
going to make some of
these early decisions

177
00:07:03,080 --> 00:07:07,450
after the rigid-body comparison
using structural similarity,

178
00:07:07,450 --> 00:07:09,580
evolutionary conservation,
and particularly

179
00:07:09,580 --> 00:07:11,710
looking at these regions
that are called hotspots.

180
00:07:11,710 --> 00:07:15,430
These are sites where most of
the free energy of interaction

181
00:07:15,430 --> 00:07:16,800
occurs during an interface.

182
00:07:16,800 --> 00:07:19,170
So it's not, as I said,
uniformly distributed.

183
00:07:19,170 --> 00:07:20,800
So I showed you this
slide last time.

184
00:07:20,800 --> 00:07:24,010
It shows chymotrypsin in a
light gray and its interaction

185
00:07:24,010 --> 00:07:25,770
with some protein partners.

186
00:07:25,770 --> 00:07:28,750
These two share some global
similarity to each other,

187
00:07:28,750 --> 00:07:31,190
whereas this partner is
quite different from either

188
00:07:31,190 --> 00:07:32,652
of these two globally.

189
00:07:32,652 --> 00:07:34,235
But you can see that
at the interface,

190
00:07:34,235 --> 00:07:36,160
it's actually quite similar.

191
00:07:36,160 --> 00:07:39,640
And so this gives you hope that
even if you can't find a direct

192
00:07:39,640 --> 00:07:41,980
homologue-- so if you
were trying to figure out,

193
00:07:41,980 --> 00:07:44,011
what does this protein
in yellow interact with,

194
00:07:44,011 --> 00:07:46,510
and you searched the database
and you couldn't find anything

195
00:07:46,510 --> 00:07:48,250
that was its
structural homologue,

196
00:07:48,250 --> 00:07:50,850
but if you could figure
out to look for homologues

197
00:07:50,850 --> 00:07:52,896
of the lower regions
that interact,

198
00:07:52,896 --> 00:07:55,520
you might be able to figure out
that it interacts with the same

199
00:07:55,520 --> 00:07:58,311
protein as this
one and this one.

200
00:07:58,311 --> 00:07:58,810
OK.

201
00:07:58,810 --> 00:08:00,240
So what about this
idea of hotspots?

202
00:08:00,240 --> 00:08:01,698
And this was an
idea that was first

203
00:08:01,698 --> 00:08:06,330
developed in 1995 by this
paper, Clackson and Wells,

204
00:08:06,330 --> 00:08:09,430
where they were looking at the
interaction of a cell surface

205
00:08:09,430 --> 00:08:12,220
receptor with its
ligand approaching.

206
00:08:12,220 --> 00:08:14,610
And they did
systematic mutagenesis

207
00:08:14,610 --> 00:08:17,410
across the surface
of the interface

208
00:08:17,410 --> 00:08:21,100
to see when I mutate any
single amino acid to alanine,

209
00:08:21,100 --> 00:08:23,720
how much it affects the
energy of interaction.

210
00:08:23,720 --> 00:08:28,740
What they found was things
were highly non-uniform.

211
00:08:28,740 --> 00:08:31,530
So this lower curve shows
the change in free energy

212
00:08:31,530 --> 00:08:34,669
when you mutate particular
individual amino acids

213
00:08:34,669 --> 00:08:35,590
to alanine.

214
00:08:35,590 --> 00:08:38,429
And you can see there are
big losses of free energy

215
00:08:38,429 --> 00:08:40,460
at some places, and
other places there's

216
00:08:40,460 --> 00:08:42,720
almost no change in the
free energy binding.

217
00:08:42,720 --> 00:08:44,500
In a few places you
actually get a benefit

218
00:08:44,500 --> 00:08:47,460
from mutating a side
chain to alanine.

219
00:08:47,460 --> 00:08:49,170
So in this particular
case, and it's

220
00:08:49,170 --> 00:08:51,160
held up over many,
many cases then,

221
00:08:51,160 --> 00:08:53,900
the free energy of binding is
not uniform across the surface,

222
00:08:53,900 --> 00:08:56,150
but it's distributed in what
has been called hotspots.

223
00:08:56,150 --> 00:08:58,620
So here is a structure
of the human growth

224
00:08:58,620 --> 00:09:00,150
hormone and its receptor.

225
00:09:00,150 --> 00:09:01,820
And in red are the
few amino acids that

226
00:09:01,820 --> 00:09:03,945
contribute very, very large
amounts-- more than one

227
00:09:03,945 --> 00:09:07,590
and a half kcals per mole--
to the energy of interaction.

228
00:09:07,590 --> 00:09:10,290
And it doesn't correspond
with any simple structural

229
00:09:10,290 --> 00:09:10,930
parameter.

230
00:09:10,930 --> 00:09:14,320
So it's not the amino acids
that have the biggest surface

231
00:09:14,320 --> 00:09:16,500
area, for example, or
anything like that.

232
00:09:16,500 --> 00:09:20,040
So it's not trivial to figure
out what these regions are,

233
00:09:20,040 --> 00:09:23,430
although there are some
prediction algorithms.

234
00:09:23,430 --> 00:09:25,165
So there are studies,
and subsequent ones

235
00:09:25,165 --> 00:09:27,934
have indicated that roughly
10% of the amino acids

236
00:09:27,934 --> 00:09:29,350
at the interface
are the ones that

237
00:09:29,350 --> 00:09:30,890
have the biggest contribution.

238
00:09:30,890 --> 00:09:33,500
There are some trends, but
none of these are hard rules.

239
00:09:33,500 --> 00:09:36,040
These tend to be rich in
these three amino acids--

240
00:09:36,040 --> 00:09:39,260
tryptophan, arginine,
and tyrosine.

241
00:09:39,260 --> 00:09:40,885
If you might imagine,
these are regions

242
00:09:40,885 --> 00:09:42,760
of the protein that are
highly complimentary.

243
00:09:42,760 --> 00:09:44,940
So there'll be a patch
on one side that's

244
00:09:44,940 --> 00:09:46,765
a hotspot matching
up with another patch

245
00:09:46,765 --> 00:09:49,770
on the other protein
that's also a hotspot.

246
00:09:49,770 --> 00:09:51,570
And it's kind of
an interesting note

247
00:09:51,570 --> 00:09:54,360
that around these regions
where the hotspots occur,

248
00:09:54,360 --> 00:09:56,650
there are other amino
acids that exclude solvent

249
00:09:56,650 --> 00:09:57,680
from the interface.

250
00:09:57,680 --> 00:09:59,340
And they call that an o-ring.

251
00:09:59,340 --> 00:10:01,700
So these are some
of the features that

252
00:10:01,700 --> 00:10:03,640
tend to occur with
protein interfaces.

253
00:10:03,640 --> 00:10:06,337
So in this PRISM algorithm,
what they do is the following.

254
00:10:06,337 --> 00:10:08,420
They start off with a
template-- two proteins that

255
00:10:08,420 --> 00:10:11,130
are known to interact-- and
they define the interface simply

256
00:10:11,130 --> 00:10:14,310
by close approach of
amino acids in one

257
00:10:14,310 --> 00:10:17,420
chain to amino
acids in the other.

258
00:10:17,420 --> 00:10:19,740
So in this case,
shown in these balls

259
00:10:19,740 --> 00:10:21,925
are regions of the
proteins that interact.

260
00:10:24,700 --> 00:10:27,020
And then they isolate
the interfacial residues.

261
00:10:27,020 --> 00:10:28,820
Ignore the rest of the
protein, because we

262
00:10:28,820 --> 00:10:32,920
said that the parts that
interact in different proteins

263
00:10:32,920 --> 00:10:34,970
could be homologous even
if the global structures

264
00:10:34,970 --> 00:10:36,050
of the proteins are not, right?

265
00:10:36,050 --> 00:10:37,966
So we're going to do our
structural similarity

266
00:10:37,966 --> 00:10:40,710
calculations purely on
the interface residues

267
00:10:40,710 --> 00:10:43,490
and not on the entire structure.

268
00:10:43,490 --> 00:10:45,490
So then with that
template, you can then

269
00:10:45,490 --> 00:10:48,480
look at lots of proteins
and see whether they

270
00:10:48,480 --> 00:10:51,420
have any structural match
to pieces that interact.

271
00:10:51,420 --> 00:10:55,080
So here they've identified
this protein, ASPP2,

272
00:10:55,080 --> 00:10:58,866
which has structural homology
to I kappa b at the interface.

273
00:10:58,866 --> 00:11:00,490
Although globally
it's quite different.

274
00:11:04,090 --> 00:11:08,080
And now, once they have this
potential partner for NF kappa

275
00:11:08,080 --> 00:11:10,780
b, this ASPP2,
they're going to test

276
00:11:10,780 --> 00:11:13,550
whether there's a
good structural match,

277
00:11:13,550 --> 00:11:15,050
whether specifically
in the regions

278
00:11:15,050 --> 00:11:17,280
that are hotspots-- they have
an algorithm for predicting

279
00:11:17,280 --> 00:11:19,430
hotspots-- whether the
match is good, whether it's

280
00:11:19,430 --> 00:11:22,050
sequence conservation
at those hotspots.

281
00:11:22,050 --> 00:11:24,510
And only then do they
do the refinement

282
00:11:24,510 --> 00:11:26,629
to do the flexible
refinement of the type

283
00:11:26,629 --> 00:11:28,670
that we looked at in the
previous lecture, energy

284
00:11:28,670 --> 00:11:30,920
minimization, and
other approaches

285
00:11:30,920 --> 00:11:34,520
to figure out what the
best possible structure

286
00:11:34,520 --> 00:11:36,090
of this complex
would be, and then

287
00:11:36,090 --> 00:11:38,659
what it's free energy would be.

288
00:11:38,659 --> 00:11:40,450
So here's their
description of the problem.

289
00:11:40,450 --> 00:11:42,560
They have template
proteins and targets.

290
00:11:42,560 --> 00:11:44,240
They do a structure alignment.

291
00:11:44,240 --> 00:11:46,260
They asked whether it
passes some thresholds.

292
00:11:46,260 --> 00:11:48,520
These are very, very
fast calculations to do.

293
00:11:48,520 --> 00:11:51,520
And only if they pass
these fast calculations

294
00:11:51,520 --> 00:11:53,320
do you do more
detailed calculations.

295
00:11:53,320 --> 00:11:54,850
And finally, only
if it passes this

296
00:11:54,850 --> 00:11:56,950
do you do the very
computationally expensive

297
00:11:56,950 --> 00:11:58,170
refinement.

298
00:11:58,170 --> 00:12:00,880
And then one critical thing to
remember from this algorithm

299
00:12:00,880 --> 00:12:06,690
is that it doesn't require
the template and its query

300
00:12:06,690 --> 00:12:08,750
to be perfectly
matched in structure.

301
00:12:08,750 --> 00:12:11,250
In fact, the elements of the
structure at the interface

302
00:12:11,250 --> 00:12:13,124
could come from different
parts of the chain.

303
00:12:13,124 --> 00:12:15,330
So they don't take into
account the chain order.

304
00:12:23,470 --> 00:12:26,380
So if I had a beta sheet
structure in one protein that

305
00:12:26,380 --> 00:12:34,592
looks like this, in my
query these two proteins

306
00:12:34,592 --> 00:12:36,050
could be very
indirectly connected.

307
00:12:36,050 --> 00:12:38,300
I don't care that there's a
huge gap in the insertion.

308
00:12:38,300 --> 00:12:40,660
I just care that locally
at the interface,

309
00:12:40,660 --> 00:12:43,120
one protein looks a
lot like the other.

310
00:12:43,120 --> 00:12:45,644
There was a question
in the back.

311
00:12:45,644 --> 00:12:50,672
AUDIENCE: How do you search
a database for 3d structures?

312
00:12:50,672 --> 00:12:52,920
Are you just looking
at all the [INAUDIBLE]?

313
00:12:52,920 --> 00:12:53,920
PROFESSOR: That's right.

314
00:12:53,920 --> 00:12:55,400
So the question was,
how do you search

315
00:12:55,400 --> 00:12:56,566
a database for 3D structure?

316
00:12:56,566 --> 00:12:58,570
You do structural
similarity comparisons

317
00:12:58,570 --> 00:13:00,640
that are based on
the 3D coordinates.

318
00:13:00,640 --> 00:13:02,890
The simplest way to do it,
but not the most efficient,

319
00:13:02,890 --> 00:13:05,920
is to find the rigid-body
superpositions that

320
00:13:05,920 --> 00:13:07,970
minimize the root mean
squared deviation, which

321
00:13:07,970 --> 00:13:10,530
was a metric we gave in one
of the previous lectures.

322
00:13:10,530 --> 00:13:12,390
There are faster things
you can do as well.

323
00:13:12,390 --> 00:13:16,426
You could imagine that you could
look at certain global features

324
00:13:16,426 --> 00:13:18,300
of elements of secondary
structure and so on.

325
00:13:18,300 --> 00:13:20,758
And there's been a lot of work
making those algorithms very

326
00:13:20,758 --> 00:13:21,709
fast.

327
00:13:21,709 --> 00:13:22,375
Other questions?

328
00:13:22,375 --> 00:13:23,240
Good question.

329
00:13:31,000 --> 00:13:34,200
So they give an
example in their papers

330
00:13:34,200 --> 00:13:37,340
that starting off with this
known structural complex,

331
00:13:37,340 --> 00:13:42,030
cyclin-dependent kinase, the
cyclin, and p27, the inhibitor.

332
00:13:42,030 --> 00:13:46,930
And then looking for
structural matches.

333
00:13:46,930 --> 00:13:50,740
So we can identify this
potential structure match.

334
00:13:50,740 --> 00:13:52,810
You refined it, get an
energy of interaction.

335
00:13:52,810 --> 00:13:55,190
Try another one that has no
global structural similarity.

336
00:13:55,190 --> 00:13:57,650
Again, once it passes
all the checks,

337
00:13:57,650 --> 00:14:01,170
you compute the
refinement and the energy.

338
00:14:01,170 --> 00:14:03,050
And similarly with this side.

339
00:14:03,050 --> 00:14:05,440
And so from this
initial complex,

340
00:14:05,440 --> 00:14:07,170
where we had these
two proteins which

341
00:14:07,170 --> 00:14:10,280
were known to interact in the
PDP they can make predictions

342
00:14:10,280 --> 00:14:13,440
that these other proteins are
likely to interact even though,

343
00:14:13,440 --> 00:14:14,900
again, at the global
level, there's

344
00:14:14,900 --> 00:14:17,840
very little sequence similarity.

345
00:14:17,840 --> 00:14:20,090
Is that clear?

346
00:14:20,090 --> 00:14:20,590
OK.

347
00:14:20,590 --> 00:14:23,570
So the advantage of this
is that it eventually

348
00:14:23,570 --> 00:14:26,350
does do these structural
refinements that

349
00:14:26,350 --> 00:14:29,600
allow us to figure out
the best match between two

350
00:14:29,600 --> 00:14:30,912
potential interacting proteins.

351
00:14:30,912 --> 00:14:32,620
But that's also its
weakness because that

352
00:14:32,620 --> 00:14:34,680
takes a lot of
computational time.

353
00:14:34,680 --> 00:14:37,737
So this other approach called
PrePPI never actually does

354
00:14:37,737 --> 00:14:40,070
those structural refinements
of the type we talked about

355
00:14:40,070 --> 00:14:41,351
in the previous lecture.

356
00:14:41,351 --> 00:14:43,350
So if so, how does it
figure out whether the two

357
00:14:43,350 --> 00:14:45,840
proteins are likely to interact?

358
00:14:45,840 --> 00:14:48,515
So this is their schematic,
and we'll go through the steps.

359
00:14:51,150 --> 00:14:52,950
So you start off with
two query proteins

360
00:14:52,950 --> 00:14:54,860
that you want to know
if they interact.

361
00:14:54,860 --> 00:14:57,810
And you do sequence
similarity to a database

362
00:14:57,810 --> 00:14:58,950
of known structures.

363
00:14:58,950 --> 00:15:01,790
So you find sequence
homologues to those proteins.

364
00:15:01,790 --> 00:15:05,520
And so they call
those homology models.

365
00:15:05,520 --> 00:15:07,290
MA and MB.

366
00:15:07,290 --> 00:15:09,910
And now they look through the
database for all the structural

367
00:15:09,910 --> 00:15:12,640
homologues, not
sequence homologues,

368
00:15:12,640 --> 00:15:15,030
but structural
homologues of MA and MB.

369
00:15:15,030 --> 00:15:16,440
So they get a
series of neighbors

370
00:15:16,440 --> 00:15:22,560
that they call NA 1
through n and NB 1 to n.

371
00:15:22,560 --> 00:15:25,940
So these are the neighbors
of these homologues.

372
00:15:25,940 --> 00:15:28,190
And they asked whether
any of these neighbors,

373
00:15:28,190 --> 00:15:30,310
anything in this row,
anything in this row,

374
00:15:30,310 --> 00:15:32,910
are known to interact.

375
00:15:32,910 --> 00:15:34,990
And that potential
interaction then

376
00:15:34,990 --> 00:15:38,480
could be a model for the
interaction of the query,

377
00:15:38,480 --> 00:15:40,230
right?

378
00:15:40,230 --> 00:15:41,220
So far so good.

379
00:15:44,650 --> 00:15:46,390
Then they do a
sequence alignment.

380
00:15:46,390 --> 00:15:51,230
They sequence
alignment of MA and MB,

381
00:15:51,230 --> 00:15:54,620
which are the known structural
homologues of the queries,

382
00:15:54,620 --> 00:15:58,110
and the two proteins that
are known to interact.

383
00:15:58,110 --> 00:16:02,414
And so now they've got
this potential model

384
00:16:02,414 --> 00:16:04,830
for the interaction of the
queries made up of two proteins

385
00:16:04,830 --> 00:16:08,160
of known structure that have
homologues that are known

386
00:16:08,160 --> 00:16:08,700
to interact.

387
00:16:08,700 --> 00:16:09,200
OK?

388
00:16:09,200 --> 00:16:12,650
So it's two steps removed
from the actual interaction.

389
00:16:12,650 --> 00:16:14,960
Now, while their
figure says that they

390
00:16:14,960 --> 00:16:16,620
do a structural
superposition, that's

391
00:16:16,620 --> 00:16:17,990
not, in fact, what they do.

392
00:16:17,990 --> 00:16:19,900
If you look at it carefully,
it's a sequence analysis.

393
00:16:19,900 --> 00:16:21,900
And I'll take you through
the steps in a second.

394
00:16:21,900 --> 00:16:24,351
So they mean structured
in a rather loose way.

395
00:16:24,351 --> 00:16:26,350
So they're only doing
sequence comparisons here.

396
00:16:26,350 --> 00:16:29,120
They're never actually
building a homology model

397
00:16:29,120 --> 00:16:29,910
for the queries.

398
00:16:33,130 --> 00:16:33,630
OK

399
00:16:33,630 --> 00:16:35,370
So this figure comes
from the supplement

400
00:16:35,370 --> 00:16:36,610
where, for some
mysterious reason,

401
00:16:36,610 --> 00:16:38,240
they've changed all
the nomenclature.

402
00:16:38,240 --> 00:16:41,060
So things that previously
were called NA and NB

403
00:16:41,060 --> 00:16:45,380
have now been called TA and TB.

404
00:16:45,380 --> 00:16:46,160
Take what you get.

405
00:16:49,180 --> 00:16:53,000
So this is a pair of
interacting proteins

406
00:16:53,000 --> 00:16:55,360
where the structure of
the interaction is known.

407
00:16:55,360 --> 00:16:59,210
And they're structural
neighbors of NA and NB,

408
00:16:59,210 --> 00:17:02,940
which you don't know whether
they interact or not.

409
00:17:02,940 --> 00:17:05,910
They identify interacting
residues in this structure.

410
00:17:05,910 --> 00:17:08,380
That's why it's represented by
these black lines connecting

411
00:17:08,380 --> 00:17:08,910
blue dots.

412
00:17:08,910 --> 00:17:10,640
So these are
interacting residues

413
00:17:10,640 --> 00:17:15,040
from the two template proteins
and neighbors NA and NB.

414
00:17:15,040 --> 00:17:19,540
And they asked whether the
amino acids in MA and MB

415
00:17:19,540 --> 00:17:21,740
also are good matches
for this interface.

416
00:17:21,740 --> 00:17:23,970
And they have a number of
criteria for doing that.

417
00:17:26,810 --> 00:17:30,200
So they come up
with five measures.

418
00:17:30,200 --> 00:17:32,690
The first of those measures
is a structural similarity

419
00:17:32,690 --> 00:17:37,710
between these MA proteins and
the MA and MB and NA and NB.

420
00:17:42,000 --> 00:17:45,800
Then similarity-- OK, similarity
is the structural similarity.

421
00:17:45,800 --> 00:17:51,880
Then they asked, how many of the
amino acids at this interface,

422
00:17:51,880 --> 00:17:54,220
and what fraction of the
amino acids at the interface

423
00:17:54,220 --> 00:17:56,290
can be aligned?

424
00:17:56,290 --> 00:17:59,370
So this is a sequence-based
alignment of MA

425
00:17:59,370 --> 00:18:03,060
and-- well, it's here called TA,
but was previously called MA.

426
00:18:03,060 --> 00:18:04,420
Just to make life complicated.

427
00:18:04,420 --> 00:18:06,590
So this is the
sequence-based alignment.

428
00:18:06,590 --> 00:18:10,170
These are they interacting
residues, all the blue ones

429
00:18:10,170 --> 00:18:12,790
in the structure of
TA and TB interacting.

430
00:18:12,790 --> 00:18:15,000
And they asked, what
fraction and what number

431
00:18:15,000 --> 00:18:18,570
of these amino acids are aligned
in this sequence alignment?

432
00:18:18,570 --> 00:18:20,970
So here they come
up with a number.

433
00:18:20,970 --> 00:18:24,810
In this case, I guess, it's
four amino acids in this-- four

434
00:18:24,810 --> 00:18:27,910
pairs, I should say, of the
amino acids-- one, two, three,

435
00:18:27,910 --> 00:18:30,050
and four, indicated
by these four lines--

436
00:18:30,050 --> 00:18:34,030
are both interacting in the
structure of the complex

437
00:18:34,030 --> 00:18:38,050
and can be aligned to
sequences in MA and MB.

438
00:18:43,210 --> 00:18:46,470
And then they use
these other algorithms

439
00:18:46,470 --> 00:18:48,940
that are based primarily
on machine learning

440
00:18:48,940 --> 00:18:51,670
looking at protein
interfaces to decide

441
00:18:51,670 --> 00:18:54,610
whether the sequence of the
amino acids that are going

442
00:18:54,610 --> 00:18:56,780
to sit at those places
in the interface

443
00:18:56,780 --> 00:18:59,566
are likely to be residues that
typically occur at interfaces.

444
00:18:59,566 --> 00:19:00,940
So this is the
kind of statistics

445
00:19:00,940 --> 00:19:03,900
that I showed you before
from those old papers that

446
00:19:03,900 --> 00:19:06,996
said 10% of the amino acids
are in these hotspots.

447
00:19:06,996 --> 00:19:09,120
Certain kinds of amino
acids are predominant there.

448
00:19:09,120 --> 00:19:11,630
So the number of algorithms,
and they list a bunch,

449
00:19:11,630 --> 00:19:13,770
that they use to
come up with a score

450
00:19:13,770 --> 00:19:16,530
to decide whether these
residues, in fact, are

451
00:19:16,530 --> 00:19:19,360
statistically likely
to be good matches.

452
00:19:19,360 --> 00:19:22,380
So they have these criteria
and they decide then

453
00:19:22,380 --> 00:19:26,020
that some fraction of the amino
acids at this interface in MA

454
00:19:26,020 --> 00:19:29,620
and MB are likely to
be reasonable ones

455
00:19:29,620 --> 00:19:31,010
to be at the interface.

456
00:19:31,010 --> 00:19:34,380
So with all that
done, they then use

457
00:19:34,380 --> 00:19:36,940
all of these different scores
with a Bayesian classifier,

458
00:19:36,940 --> 00:19:38,325
and we'll talk a
little bit later

459
00:19:38,325 --> 00:19:40,200
in this lecture and
probably the next lecture

460
00:19:40,200 --> 00:19:42,150
as well as to what a
Bayesian classifier is.

461
00:19:42,150 --> 00:19:44,040
But they plug all
those scores in that

462
00:19:44,040 --> 00:19:46,690
they've derived
from these proteins

463
00:19:46,690 --> 00:19:49,420
to decide whether
these two proteins are

464
00:19:49,420 --> 00:19:51,065
likely to interact.

465
00:19:51,065 --> 00:19:52,440
So the advantage
of this approach

466
00:19:52,440 --> 00:19:53,837
is it's extremely fast.

467
00:19:53,837 --> 00:19:55,920
Everything we've talked
about are very, very quick

468
00:19:55,920 --> 00:19:56,590
calculations.

469
00:19:56,590 --> 00:19:58,720
Even the structural
alignments are fast.

470
00:19:58,720 --> 00:20:01,190
The sequence alignments,
of course, are.

471
00:20:01,190 --> 00:20:04,120
So we get through the whole
database very quickly.

472
00:20:04,120 --> 00:20:09,580
So they've actually computed the
potential attraction partners

473
00:20:09,580 --> 00:20:12,800
of every pair of proteins in
various genomes based solely

474
00:20:12,800 --> 00:20:13,760
on these alignments.

475
00:20:13,760 --> 00:20:16,580
The disadvantage-- so what's
the disadvantage of this method?

476
00:20:20,444 --> 00:20:22,980
AUDIENCE: Can't get a
de novo interaction?

477
00:20:22,980 --> 00:20:24,980
PROFESSOR: We can't get
any de novo interaction,

478
00:20:24,980 --> 00:20:28,710
so if there's no neighboring
structures that interact,

479
00:20:28,710 --> 00:20:30,612
they'll never come up with it.

480
00:20:30,612 --> 00:20:31,820
So that's an important point.

481
00:20:31,820 --> 00:20:33,861
And then the other problem
is, because it doesn't

482
00:20:33,861 --> 00:20:35,590
have the structural
refinement, it's

483
00:20:35,590 --> 00:20:37,006
given up on that
slow calculation,

484
00:20:37,006 --> 00:20:39,640
so also loses a lot of
potential specificity.

485
00:20:39,640 --> 00:20:42,140
All the conformational
changes that can occur

486
00:20:42,140 --> 00:20:45,020
will be lost to an
algorithm like this.

487
00:20:45,020 --> 00:20:46,770
So we have these two
competing approaches.

488
00:20:46,770 --> 00:20:48,000
Yes, questions in the back.

489
00:20:48,000 --> 00:20:50,762
AUDIENCE: Couldn't this method
actually be used as an input

490
00:20:50,762 --> 00:20:54,177
to, say, a refinement
step, for example?

491
00:20:54,177 --> 00:20:55,760
PROFESSOR: The
question was, could you

492
00:20:55,760 --> 00:20:58,794
use this kind of approach as an
input to the refinement step?

493
00:20:58,794 --> 00:21:00,698
And absolutely one could.

494
00:21:00,698 --> 00:21:04,364
Is there another
question back there?

495
00:21:04,364 --> 00:21:05,030
Other questions?

496
00:21:12,150 --> 00:21:12,650
All right.

497
00:21:12,650 --> 00:21:16,990
So we're going to take a slight
turn here in the course lecture

498
00:21:16,990 --> 00:21:20,770
and move away from a purely
computational approach

499
00:21:20,770 --> 00:21:23,650
and actually look at how
interaction measurements are

500
00:21:23,650 --> 00:21:24,490
made.

501
00:21:24,490 --> 00:21:26,520
One of the big changes
of the last decade or so

502
00:21:26,520 --> 00:21:31,130
is that we've gone from an era
when interactions were measured

503
00:21:31,130 --> 00:21:34,550
pairwise to interactions
being measured in bulk.

504
00:21:34,550 --> 00:21:36,380
So through high
throughput measurements.

505
00:21:36,380 --> 00:21:39,160
And we'll see that that leads
us to some statistical problems

506
00:21:39,160 --> 00:21:42,070
which eventually bring us back
to some computational issues

507
00:21:42,070 --> 00:21:43,492
as well.

508
00:21:43,492 --> 00:21:45,450
So if you want to measure
all the proteins that

509
00:21:45,450 --> 00:21:48,640
interact in an
organism, turns out

510
00:21:48,640 --> 00:21:50,120
to be, obviously,
very difficult.

511
00:21:50,120 --> 00:21:52,620
One big advance that's
helped with this

512
00:21:52,620 --> 00:21:56,550
is the idea of tagging proteins
and using mass spectrometry

513
00:21:56,550 --> 00:21:59,160
to figure out what
they interact with.

514
00:21:59,160 --> 00:22:02,420
So in these two sets
of papers, which

515
00:22:02,420 --> 00:22:05,070
were some of the early
ones being done in yeast,

516
00:22:05,070 --> 00:22:10,130
they took one protein at a
time and attached a tag to it.

517
00:22:10,130 --> 00:22:12,170
And I'll talk about exactly
what those tags are,

518
00:22:12,170 --> 00:22:13,995
but those are labels
that allow you

519
00:22:13,995 --> 00:22:17,050
to attach it to a solid support.

520
00:22:17,050 --> 00:22:19,870
And then by attaching
to a solid support,

521
00:22:19,870 --> 00:22:21,640
you could then
purify any proteins

522
00:22:21,640 --> 00:22:25,160
that stuck to protein one here.

523
00:22:25,160 --> 00:22:29,530
And then after you purify them,
you can run them out on a gel,

524
00:22:29,530 --> 00:22:31,380
cut them out, and
figure out what

525
00:22:31,380 --> 00:22:33,130
the identity of those
interacting proteins

526
00:22:33,130 --> 00:22:34,131
were by mass spec.

527
00:22:34,131 --> 00:22:35,630
So this sounds very
labor intensive,

528
00:22:35,630 --> 00:22:38,625
but it's still a lot faster than
anything that came before it.

529
00:22:38,625 --> 00:22:40,000
And with this
approach, they were

530
00:22:40,000 --> 00:22:42,270
able to go through
entire genomes,

531
00:22:42,270 --> 00:22:44,590
proteomes I should
say, and figure out

532
00:22:44,590 --> 00:22:47,477
all the interacting partners
for very, very large fractions

533
00:22:47,477 --> 00:22:48,560
of all the proteins there.

534
00:22:51,170 --> 00:22:55,400
So with this approach,
what kinds of proteins

535
00:22:55,400 --> 00:22:58,099
do you think are likely
to be false positives?

536
00:22:58,099 --> 00:22:58,640
Any thoughts?

537
00:23:01,380 --> 00:23:02,322
Yes.

538
00:23:02,322 --> 00:23:04,536
AUDIENCE: Proteins
stuck on the column that

539
00:23:04,536 --> 00:23:06,750
has nothing to do with
interaction [INAUDIBLE].

540
00:23:06,750 --> 00:23:07,541
PROFESSOR: Exactly.

541
00:23:07,541 --> 00:23:09,430
So one thing that can
be quite problematic

542
00:23:09,430 --> 00:23:11,840
are proteins that
stick to the column

543
00:23:11,840 --> 00:23:13,610
regardless of which
protein you put there.

544
00:23:13,610 --> 00:23:15,840
And we'll see an approach
to getting rid of that.

545
00:23:15,840 --> 00:23:17,490
Other kinds of problems?

546
00:23:17,490 --> 00:23:18,415
A variant of that.

547
00:23:20,981 --> 00:23:21,480
Thoughts?

548
00:23:24,437 --> 00:23:26,020
What about proteins
that tend to stick

549
00:23:26,020 --> 00:23:27,990
to other proteins
non-specifically, right?

550
00:23:27,990 --> 00:23:31,016
Those are going to be
quite problematic too.

551
00:23:31,016 --> 00:23:32,640
And what are the
likely false negatives

552
00:23:32,640 --> 00:23:34,080
in an approach like this?

553
00:23:36,770 --> 00:23:39,235
The proteins that really do
interact with the blue one

554
00:23:39,235 --> 00:23:40,110
but aren't picked up.

555
00:23:40,110 --> 00:23:40,986
Yes.

556
00:23:40,986 --> 00:23:42,945
AUDIENCE: Weak interaction
partners [INAUDIBLE]

557
00:23:42,945 --> 00:23:44,819
PROFESSOR: Weak interaction
partners, things,

558
00:23:44,819 --> 00:23:46,590
particularly with
short half lives.

559
00:23:46,590 --> 00:23:48,419
Because you do a lot
of washing, so it's

560
00:23:48,419 --> 00:23:49,930
going to be dependent
on half-life.

561
00:23:49,930 --> 00:23:50,430
Very good.

562
00:23:50,430 --> 00:23:50,930
What else?

563
00:23:50,930 --> 00:23:51,570
Yeah.

564
00:23:51,570 --> 00:23:54,857
AUDIENCE: Maybe something
that interacts in tag region?

565
00:23:54,857 --> 00:23:57,190
PROFESSOR: Something interacts
in the tag region, right.

566
00:23:57,190 --> 00:23:58,210
So something
interacts right around

567
00:23:58,210 --> 00:24:00,668
here would be lost because this
would sterically interfere.

568
00:24:00,668 --> 00:24:01,670
Very good.

569
00:24:01,670 --> 00:24:02,950
Anything else?

570
00:24:02,950 --> 00:24:04,860
What about the
concentration of proteins.

571
00:24:04,860 --> 00:24:07,298
How does that influence
whether they show up here?

572
00:24:10,850 --> 00:24:11,350
All right.

573
00:24:11,350 --> 00:24:13,470
So if I have a very high
concentration protein,

574
00:24:13,470 --> 00:24:16,667
it may interact even though
naturally it doesn't.

575
00:24:16,667 --> 00:24:17,750
They never see each other.

576
00:24:17,750 --> 00:24:18,680
They're in different
compartments.

577
00:24:18,680 --> 00:24:20,220
But when [INAUDIBLE]
and do this.

578
00:24:20,220 --> 00:24:22,725
But low abundance
proteins are going

579
00:24:22,725 --> 00:24:26,026
to be quite problematic because
there'll be very little of them

580
00:24:26,026 --> 00:24:28,530
in these complexes compared to
the high abundance proteins.

581
00:24:28,530 --> 00:24:30,120
It won't be detected
by this method.

582
00:24:30,120 --> 00:24:32,120
They will never get to
the mass spec, and so on.

583
00:24:32,120 --> 00:24:34,350
So we've got both false
positives and false negatives

584
00:24:34,350 --> 00:24:35,410
in these approaches.

585
00:24:35,410 --> 00:24:36,868
Now, one of the
things that came up

586
00:24:36,868 --> 00:24:39,887
was proteins that stick
non-specifically to the column.

587
00:24:39,887 --> 00:24:41,470
And there was a
clever approach in one

588
00:24:41,470 --> 00:24:49,320
of these early papers that
got picked up to avoid that.

589
00:24:49,320 --> 00:24:51,590
And this is called tandem
affinity purification,

590
00:24:51,590 --> 00:24:53,340
or TAP-tags.

591
00:24:53,340 --> 00:24:56,320
And the idea is the following.

592
00:24:56,320 --> 00:24:58,030
We have some gene.

593
00:24:58,030 --> 00:24:59,640
And we use homologous
recombination--

594
00:24:59,640 --> 00:25:01,014
this was done in
yeast where this

595
00:25:01,014 --> 00:25:03,560
is easy-- to insert
this sequence, which

596
00:25:03,560 --> 00:25:05,190
codes for the following.

597
00:25:05,190 --> 00:25:07,750
A piece of protein of
no particular function,

598
00:25:07,750 --> 00:25:09,600
as far as anyone
knows, a spacer,

599
00:25:09,600 --> 00:25:14,010
followed by this
calmodulin-binding protein,

600
00:25:14,010 --> 00:25:15,890
followed by a protease
recognition site,

601
00:25:15,890 --> 00:25:18,400
and then by protein A.

602
00:25:18,400 --> 00:25:20,410
So once this protein
gets expressed--

603
00:25:20,410 --> 00:25:22,250
and it gets expressed
in it's native levels

604
00:25:22,250 --> 00:25:24,166
because you're inserting
this into the genome.

605
00:25:24,166 --> 00:25:26,390
So it's not on an
exogenous promoter.

606
00:25:26,390 --> 00:25:28,120
It's in its normal position.

607
00:25:28,120 --> 00:25:29,750
Whatever that protein
was, then has it

608
00:25:29,750 --> 00:25:32,700
as C terminus all these pieces.

609
00:25:32,700 --> 00:25:34,260
So how does that help?

610
00:25:34,260 --> 00:25:37,010
In the purification, we start
with something, IgG IGG,

611
00:25:37,010 --> 00:25:39,200
that binds to protein
A. So now that's

612
00:25:39,200 --> 00:25:41,730
what attaches us to
the solid support.

613
00:25:41,730 --> 00:25:43,550
And attached to
the solid support

614
00:25:43,550 --> 00:25:45,870
will be all those things
that are nonspecific binders.

615
00:26:07,996 --> 00:26:09,870
And so if I have some
nonspecific binder that

616
00:26:09,870 --> 00:26:12,430
just likes my solid
support, it'll be here.

617
00:26:15,550 --> 00:26:17,540
Nonspecific.

618
00:26:17,540 --> 00:26:21,450
And if I just acid washed
everything off the column

619
00:26:21,450 --> 00:26:24,690
and ran my gels with that,
or boiled it off in SDS,

620
00:26:24,690 --> 00:26:26,510
I would get the
nonspecific protein too.

621
00:26:26,510 --> 00:26:30,950
But what they do instead
is they instead cleave here

622
00:26:30,950 --> 00:26:34,405
with a very specific protease
that recognizes this site.

623
00:26:34,405 --> 00:26:37,320
It's called a tobacco
etch virus protease.

624
00:26:37,320 --> 00:26:39,550
It has a very long
recognition sequence.

625
00:26:39,550 --> 00:26:42,320
You can make sure it doesn't cut
anywhere in any other protein.

626
00:26:42,320 --> 00:26:45,130
And so now, instead of
alluding non-specifically

627
00:26:45,130 --> 00:26:50,710
with acid or detergent, you
allude specifically with TEV,

628
00:26:50,710 --> 00:26:54,980
and then this part of the
protein will fall off.

629
00:26:54,980 --> 00:26:57,280
And then you do a
second purification

630
00:26:57,280 --> 00:27:01,730
that relies on this
piece of the protein.

631
00:27:01,730 --> 00:27:05,110
So you pull out only
the things that you

632
00:27:05,110 --> 00:27:08,050
want that have the CBP,
the calmodulin binding

633
00:27:08,050 --> 00:27:10,680
protein, by having different
kind of solid support

634
00:27:10,680 --> 00:27:13,660
that has calmodulin
attached to it.

635
00:27:13,660 --> 00:27:15,510
And so through this
process, you can get rid

636
00:27:15,510 --> 00:27:17,850
of a lot of nonspecific binders.

637
00:27:17,850 --> 00:27:20,110
It doesn't help you with
the false negatives, right?

638
00:27:20,110 --> 00:27:22,650
You've made the wash
conditions even harsher

639
00:27:22,650 --> 00:27:24,320
so you're going to
lose more proteins.

640
00:27:24,320 --> 00:27:28,550
But you'll pick up
fewer false positives.

641
00:27:28,550 --> 00:27:31,520
And then finally, the last
purification procedure actually

642
00:27:31,520 --> 00:27:34,090
uses EGTA, which is
a chelating agent.

643
00:27:34,090 --> 00:27:36,830
So this interaction
between CBP and calmodulin

644
00:27:36,830 --> 00:27:38,210
depends on calcium.

645
00:27:38,210 --> 00:27:41,740
EGTA sucks the calcium
out of that interaction.

646
00:27:41,740 --> 00:27:44,940
And so it's, again, a very
specific way of alluding rather

647
00:27:44,940 --> 00:27:51,930
nonspecific one, like heat,
salt, acid, or detergent.

648
00:27:51,930 --> 00:27:55,232
So this has been one technology,
affinity purification

649
00:27:55,232 --> 00:27:57,690
followed by mass spec, that's
given us a lot of information

650
00:27:57,690 --> 00:27:59,630
on protein-protein interactions.

651
00:27:59,630 --> 00:28:01,680
And a computing
technology that's

652
00:28:01,680 --> 00:28:06,080
also contributed quite a lot
is called yeast two-hybrid.

653
00:28:06,080 --> 00:28:10,680
So in this approach,
you have a reporter gene

654
00:28:10,680 --> 00:28:12,970
that normally is not
going to be transcribed.

655
00:28:12,970 --> 00:28:19,040
It has at a design DNA binding
site, a DNA binding protein,

656
00:28:19,040 --> 00:28:20,605
and your bait protein.

657
00:28:20,605 --> 00:28:22,480
And you want to figure
out every protein that

658
00:28:22,480 --> 00:28:25,830
can interact with this prey.

659
00:28:25,830 --> 00:28:29,600
So the prey now is attached
to an activation domain.

660
00:28:29,600 --> 00:28:31,400
If these two proteins
don't interact,

661
00:28:31,400 --> 00:28:34,070
the activation domain never
gets recruited to this reporter,

662
00:28:34,070 --> 00:28:35,840
there's no transcription.

663
00:28:35,840 --> 00:28:38,790
But if the green protein and
the blue protein interact,

664
00:28:38,790 --> 00:28:40,440
then the activation
domain is going

665
00:28:40,440 --> 00:28:42,169
to be recruited to
this promoter and it's

666
00:28:42,169 --> 00:28:44,710
going to turn on transcription,
and then you'll get a signal.

667
00:28:48,620 --> 00:28:51,860
So what are some of the
advantages of this approach?

668
00:28:51,860 --> 00:28:54,602
It doesn't require you
to purify anything.

669
00:28:54,602 --> 00:28:56,060
So it should be
much more sensitive

670
00:28:56,060 --> 00:28:58,250
to low abundance proteins.

671
00:28:58,250 --> 00:29:00,100
So that's definitely
an advantage.

672
00:29:00,100 --> 00:29:02,350
It'll pick up a lot of those
transient interactions.

673
00:29:02,350 --> 00:29:04,140
You may not get
continuous activation,

674
00:29:04,140 --> 00:29:05,925
but you'll get
transient activation.

675
00:29:05,925 --> 00:29:08,620
And if you've set the
conditions up properly,

676
00:29:08,620 --> 00:29:11,571
you can pick up the
transient activation.

677
00:29:11,571 --> 00:29:13,820
But it has its own biases,
so none of these techniques

678
00:29:13,820 --> 00:29:14,590
are going to be perfect.

679
00:29:14,590 --> 00:29:16,270
It's going to be
biased against proteins

680
00:29:16,270 --> 00:29:17,810
that don't express well.

681
00:29:17,810 --> 00:29:20,061
This is, as the name implies,
typically done in yeast.

682
00:29:20,061 --> 00:29:22,143
So if you have human
proteins and you express them

683
00:29:22,143 --> 00:29:24,460
in yeast, or plant proteins
that you express in yeast,

684
00:29:24,460 --> 00:29:27,210
there could be some proteins
that just will not express well

685
00:29:27,210 --> 00:29:29,070
in that organism.

686
00:29:29,070 --> 00:29:31,050
What else can be a problem?

687
00:29:31,050 --> 00:29:33,160
Some proteins don't do
well in the nucleus, right?

688
00:29:33,160 --> 00:29:34,970
So if you're interested
in interactions

689
00:29:34,970 --> 00:29:36,386
with membrane
proteins, it's going

690
00:29:36,386 --> 00:29:38,650
to be very hard to get them
to express in the nucleus,

691
00:29:38,650 --> 00:29:41,180
and therefore, you'll never
pick up those interactions.

692
00:29:41,180 --> 00:29:41,680
OK.

693
00:29:41,680 --> 00:29:43,638
So we've got these two
different technologies--

694
00:29:43,638 --> 00:29:45,960
the affinity capture mass
spec and the two-hybrid.

695
00:29:45,960 --> 00:29:48,680
Questions on those technologies?

696
00:29:48,680 --> 00:29:49,636
Yes.

697
00:29:49,636 --> 00:29:51,620
AUDIENCE: Could
another control be

698
00:29:51,620 --> 00:29:53,604
for the mass spec
purification just

699
00:29:53,604 --> 00:29:57,085
to subtract out everything
that alludes non-specifically.

700
00:29:57,085 --> 00:29:59,585
PROFESSOR: The question was,
could you subtract out anything

701
00:29:59,585 --> 00:29:59,920
that's nonspecific.

702
00:29:59,920 --> 00:30:01,355
And yes, if you've
got what you might

703
00:30:01,355 --> 00:30:03,146
call frequent flyers,
proteins that show up

704
00:30:03,146 --> 00:30:04,750
in every single
purification, then you

705
00:30:04,750 --> 00:30:05,840
can simply ignore those.

706
00:30:05,840 --> 00:30:07,270
And that is often done.

707
00:30:07,270 --> 00:30:08,770
So that'll help you
with things that

708
00:30:08,770 --> 00:30:11,590
are very nonspecific
for the surface.

709
00:30:11,590 --> 00:30:14,294
What's more of a
problem are proteins

710
00:30:14,294 --> 00:30:15,960
that have some affinity
for your protein

711
00:30:15,960 --> 00:30:19,920
x but are not really
highly specific for it.

712
00:30:19,920 --> 00:30:22,752
So they tend to bind in
certain kinds of patches.

713
00:30:22,752 --> 00:30:24,210
Those would be
harder to figure out

714
00:30:24,210 --> 00:30:26,460
because they won't
stick to everything.

715
00:30:26,460 --> 00:30:27,104
Good question.

716
00:30:27,104 --> 00:30:27,770
Other questions?

717
00:30:34,610 --> 00:30:35,110
All right.

718
00:30:35,110 --> 00:30:36,500
So we've got these
different technologies.

719
00:30:36,500 --> 00:30:37,750
What we'd really
like to be able do

720
00:30:37,750 --> 00:30:40,010
is we know that there are
problems in each approach.

721
00:30:40,010 --> 00:30:42,343
We'd like to be able to compute
the probability that two

722
00:30:42,343 --> 00:30:44,550
proteins interact
based on the data.

723
00:30:44,550 --> 00:30:47,860
So now we're turning back to the
more mathematical computational

724
00:30:47,860 --> 00:30:49,050
approaches.

725
00:30:49,050 --> 00:30:53,044
So if we just consider
one experiment-- and we're

726
00:30:53,044 --> 00:30:54,460
going to talk about
gold standard.

727
00:30:54,460 --> 00:30:55,543
So what's a gold standard?

728
00:30:55,543 --> 00:30:58,420
It's a set of proteins that we
have extremely high confidence

729
00:30:58,420 --> 00:31:01,180
interact because it was analyzed
by some other technology.

730
00:31:01,180 --> 00:31:04,390
Not two-hybrid, non-affinity
capture mass spec, but much,

731
00:31:04,390 --> 00:31:05,790
much more direct interactions.

732
00:31:05,790 --> 00:31:10,410
By physical measurements,
maybe the structural work.

733
00:31:10,410 --> 00:31:12,160
So the number of
criteria that go into it.

734
00:31:12,160 --> 00:31:14,114
So we have this
gold standard data

735
00:31:14,114 --> 00:31:16,530
set where we know the proteins
definitely interact, and we

736
00:31:16,530 --> 00:31:17,700
have our experiment.

737
00:31:17,700 --> 00:31:19,310
So clearly anything
in the overlap,

738
00:31:19,310 --> 00:31:22,050
we can count as true
positives, right?

739
00:31:22,050 --> 00:31:23,080
We detected it.

740
00:31:23,080 --> 00:31:26,059
It's in the database
of gold standards.

741
00:31:26,059 --> 00:31:28,350
And things that are in the
gold standard that we missed

742
00:31:28,350 --> 00:31:31,660
are obviously false negatives.

743
00:31:31,660 --> 00:31:33,270
We report them as
non-interacting,

744
00:31:33,270 --> 00:31:35,380
but in fact they do.

745
00:31:35,380 --> 00:31:38,586
The question is, how much
of this is true positive?

746
00:31:38,586 --> 00:31:40,710
Everything that's detected
in the experiment but we

747
00:31:40,710 --> 00:31:42,542
have no information
for it in the database.

748
00:31:42,542 --> 00:31:44,500
So that could be for one
of two reasons, right?

749
00:31:44,500 --> 00:31:46,290
That could be that they
really don't interact.

750
00:31:46,290 --> 00:31:47,720
Or it could be that
no one's measured it.

751
00:31:47,720 --> 00:31:49,200
The whole point
of this experiment

752
00:31:49,200 --> 00:31:50,570
is to find new things.

753
00:31:50,570 --> 00:31:54,400
So is there any way to estimate
what fraction of all the things

754
00:31:54,400 --> 00:31:57,250
that are unique to this
experiment are true positives,

755
00:31:57,250 --> 00:31:59,300
and what fraction
are false positives?

756
00:31:59,300 --> 00:32:00,982
Those we'd like to
try to figure out.

757
00:32:00,982 --> 00:32:02,440
Now, if we just
had one experiment,

758
00:32:02,440 --> 00:32:03,731
that would be very challenging.

759
00:32:03,731 --> 00:32:05,940
But what happens when
we've got two experiments?

760
00:32:05,940 --> 00:32:08,510
So we have these two affinity
capture mass spec experiments,

761
00:32:08,510 --> 00:32:11,490
or maybe affinity capture
mass spec and a two-hybrid.

762
00:32:11,490 --> 00:32:13,590
So now let's think about
the overlap of those two

763
00:32:13,590 --> 00:32:16,699
experiments with
the gold standard.

764
00:32:16,699 --> 00:32:18,990
So I've got this region of
overlap between experiment 1

765
00:32:18,990 --> 00:32:21,050
and experiment 2,
and then this region

766
00:32:21,050 --> 00:32:23,410
that's overlapping
between all three things.

767
00:32:23,410 --> 00:32:26,000
Experiment 1, experiment
2, and the gold standard.

768
00:32:26,000 --> 00:32:29,120
So these clearly are
two positives, right?

769
00:32:29,120 --> 00:32:31,390
They're high confidence
because I picked them up

770
00:32:31,390 --> 00:32:35,102
in both experiments, and
they're in the gold standard.

771
00:32:35,102 --> 00:32:37,310
What about all these things
in what I've labeled here

772
00:32:37,310 --> 00:32:38,920
region 2?

773
00:32:38,920 --> 00:32:41,670
Well, if we believe that
these two experiments are

774
00:32:41,670 --> 00:32:45,540
independent of each
other in a rigorous way--

775
00:32:45,540 --> 00:32:47,300
so let's say one's a
two-hybrid and one's

776
00:32:47,300 --> 00:32:49,807
an affinity capture mass spec,
there's no particular reason

777
00:32:49,807 --> 00:32:51,390
that the false
positives for one would

778
00:32:51,390 --> 00:32:54,200
be false positives in the other.

779
00:32:54,200 --> 00:32:56,890
In that case, I can
call this region 2

780
00:32:56,890 --> 00:32:58,185
my consensus true positives.

781
00:32:58,185 --> 00:33:00,820
I have a very high
confidence that these

782
00:33:00,820 --> 00:33:02,660
are true interactors.

783
00:33:02,660 --> 00:33:05,180
Everyone buy that?

784
00:33:05,180 --> 00:33:06,380
Seem reasonable?

785
00:33:06,380 --> 00:33:06,880
OK.

786
00:33:06,880 --> 00:33:11,240
So here's where
the trick comes in.

787
00:33:11,240 --> 00:33:14,040
What fraction of all these
consensus true positives

788
00:33:14,040 --> 00:33:16,930
are picked up in
the gold standard?

789
00:33:16,930 --> 00:33:19,210
This ratio, right?

790
00:33:19,210 --> 00:33:21,430
Region 1 over region 2.

791
00:33:21,430 --> 00:33:21,960
OK.

792
00:33:21,960 --> 00:33:26,380
So now I've got this region
of things that are picked up--

793
00:33:26,380 --> 00:33:28,870
the true positives from
this experiment, then

794
00:33:28,870 --> 00:33:31,130
the gold standard.

795
00:33:31,130 --> 00:33:33,740
And then I've got this region
that's unique to experiment 2

796
00:33:33,740 --> 00:33:35,698
and it's going to be some
mix of true positives

797
00:33:35,698 --> 00:33:36,700
and false positives.

798
00:33:36,700 --> 00:33:41,110
And the authors of this
paper that are cited here

799
00:33:41,110 --> 00:33:42,500
make the following argument.

800
00:33:42,500 --> 00:33:49,300
We're going to assume
that the ratio of I to II

801
00:33:49,300 --> 00:33:51,420
is the same as the
ratio of III to IV.

802
00:33:56,030 --> 00:33:59,744
So the fraction of
consensus true positives

803
00:33:59,744 --> 00:34:01,910
that are picked-- these are
independent experiments.

804
00:34:01,910 --> 00:34:03,730
So the fraction
of true positives

805
00:34:03,730 --> 00:34:05,480
that are picked up
in the gold standard

806
00:34:05,480 --> 00:34:07,250
is going to be constant,
whether they're in the consensus

807
00:34:07,250 --> 00:34:07,890
or not.

808
00:34:07,890 --> 00:34:09,636
So the fraction at
ratio of I to II

809
00:34:09,636 --> 00:34:11,719
is going to be the same
as the ratio of III to IV.

810
00:34:11,719 --> 00:34:15,010
So by that then, I can figure
out how much of this region

811
00:34:15,010 --> 00:34:16,840
consists of true
positives and how much

812
00:34:16,840 --> 00:34:18,590
consists of false positives.

813
00:34:18,590 --> 00:34:21,840
Everyone buy that?

814
00:34:21,840 --> 00:34:23,053
Yeah.

815
00:34:23,053 --> 00:34:25,490
AUDIENCE: Can I check--
are we not saying

816
00:34:25,490 --> 00:34:30,729
that the gold standard
represents all true positives?

817
00:34:30,729 --> 00:34:31,520
PROFESSOR: Correct.

818
00:34:31,520 --> 00:34:35,670
Well, we're saying that the
gold standard consists of things

819
00:34:35,670 --> 00:34:37,412
that we know to interact--

820
00:34:37,412 --> 00:34:38,830
AUDIENCE: But there may be more.

821
00:34:38,830 --> 00:34:40,060
PROFESSOR: But
there may be more.

822
00:34:40,060 --> 00:34:42,518
And the goal of our experiment
is to find those other ones.

823
00:34:45,290 --> 00:34:45,790
All right.

824
00:34:45,790 --> 00:34:48,940
So if you accept that premise,
which seems plausible,

825
00:34:48,940 --> 00:34:51,489
then you can compute what
fraction of all the things

826
00:34:51,489 --> 00:34:53,530
that are picked up in
each of these experiments

827
00:34:53,530 --> 00:34:56,889
are likely to be true positives.

828
00:34:56,889 --> 00:34:58,080
So drum roll please.

829
00:34:58,080 --> 00:34:59,987
It turns out that the
number's not that high.

830
00:35:03,120 --> 00:35:06,520
So the fraction of
things in the consensus

831
00:35:06,520 --> 00:35:09,459
was 347 out of almost 2000.

832
00:35:09,459 --> 00:35:11,500
And if you do the math
then, what you end up with

833
00:35:11,500 --> 00:35:15,240
is that the true
fraction in this region,

834
00:35:15,240 --> 00:35:20,040
for which we have no
data, is 1,123 out of--

835
00:35:20,040 --> 00:35:25,606
and the false piece in this
is going to be almost 15,000.

836
00:35:25,606 --> 00:35:27,480
And they went ahead and
did this for a number

837
00:35:27,480 --> 00:35:29,670
of different
experiments and computed

838
00:35:29,670 --> 00:35:34,230
the fraction of derived false
positives for these data--

839
00:35:34,230 --> 00:35:36,320
might be a little bit hard
to see on this screen.

840
00:35:36,320 --> 00:35:40,120
But the numbers range
from 50% false positives

841
00:35:40,120 --> 00:35:45,410
to, in some cases, over
90% false positives.

842
00:35:45,410 --> 00:35:47,770
That's a little
disturbing, right?

843
00:35:47,770 --> 00:35:51,224
So these technologies are good
at picking up interactions,

844
00:35:51,224 --> 00:35:52,890
but there's reason
to be very skeptical.

845
00:35:55,290 --> 00:35:55,790
OK.

846
00:35:55,790 --> 00:35:57,960
So now we've got
a serious problem,

847
00:35:57,960 --> 00:35:59,670
because how are we
going to figure out

848
00:35:59,670 --> 00:36:01,890
which of these interactions
to trust when we know

849
00:36:01,890 --> 00:36:07,340
that a very, very large fraction
of them are false positives?

850
00:36:07,340 --> 00:36:08,270
So what could you do?

851
00:36:08,270 --> 00:36:11,570
Well, you could take only
the little bit of overlap.

852
00:36:11,570 --> 00:36:17,061
You could say, I have that Venn
diagram-- method 1, method 2.

853
00:36:17,061 --> 00:36:18,560
They did agree on
a bunch of things.

854
00:36:18,560 --> 00:36:21,330
So I could take only those.

855
00:36:21,330 --> 00:36:22,886
That obviously
throws away a lot.

856
00:36:22,886 --> 00:36:25,510
Someone else suggested we could
throw away the sticky proteins,

857
00:36:25,510 --> 00:36:26,010
right?

858
00:36:26,010 --> 00:36:27,980
So maybe there are
nonspecific proteins

859
00:36:27,980 --> 00:36:29,604
that don't show up
in every experiment,

860
00:36:29,604 --> 00:36:31,819
but they show up in a
very, very large fraction

861
00:36:31,819 --> 00:36:32,610
of all experiments.

862
00:36:32,610 --> 00:36:34,290
Maybe I toss those out.

863
00:36:34,290 --> 00:36:36,542
That's another possibility.

864
00:36:36,542 --> 00:36:38,250
But what we really
want to do is actually

865
00:36:38,250 --> 00:36:40,460
come up with a
probability estimate.

866
00:36:40,460 --> 00:36:41,960
To not have to make
a hard decision,

867
00:36:41,960 --> 00:36:43,918
but come up with an
estimate of the probability

868
00:36:43,918 --> 00:36:45,790
that things interact
based on all the data.

869
00:36:45,790 --> 00:36:49,117
So how do we go
about doing that?

870
00:36:49,117 --> 00:36:51,700
So first of all, what happens
if you just require a consensus?

871
00:36:51,700 --> 00:36:55,840
So this plot shows
accuracy and coverage

872
00:36:55,840 --> 00:36:59,990
of the gold standard for
individual experiments

873
00:36:59,990 --> 00:37:05,750
with different thresholds for
deciding what's interacting,

874
00:37:05,750 --> 00:37:07,650
different cutoffs and things.

875
00:37:07,650 --> 00:37:10,360
So the individual
experiments are shown here.

876
00:37:10,360 --> 00:37:12,400
And then if you
acquire two methods

877
00:37:12,400 --> 00:37:14,990
to pick something up, or three
methods to pick something up,

878
00:37:14,990 --> 00:37:17,010
you can get better and
better in your accuracy.

879
00:37:17,010 --> 00:37:18,350
This is a log-log plot.

880
00:37:18,350 --> 00:37:20,710
So if you require
three methods to agree

881
00:37:20,710 --> 00:37:22,750
before you call something
a true positive,

882
00:37:22,750 --> 00:37:25,000
you can get up to-- I'm not
sure exactly what this is,

883
00:37:25,000 --> 00:37:26,790
but 80%, 90% possibly.

884
00:37:26,790 --> 00:37:27,560
Right?

885
00:37:27,560 --> 00:37:29,590
But look at where
you at the y-axis.

886
00:37:29,590 --> 00:37:31,760
You'd only get
about less than 1%

887
00:37:31,760 --> 00:37:33,750
coverage of the gold standard.

888
00:37:33,750 --> 00:37:35,404
So that's not a great approach.

889
00:37:35,404 --> 00:37:37,070
So what we really
want to do, as I said,

890
00:37:37,070 --> 00:37:39,780
is to try to estimate the
probability that proteins

891
00:37:39,780 --> 00:37:43,610
interact given all of
our available data.

892
00:37:43,610 --> 00:37:47,700
And the data could be
specific experiments.

893
00:37:47,700 --> 00:37:49,700
Say the two different
mass spec experiments

894
00:37:49,700 --> 00:37:50,930
we just referred to.

895
00:37:50,930 --> 00:37:52,630
Or as we'll see a
little bit later

896
00:37:52,630 --> 00:37:55,230
in this lecture and possibly
the next one, other kinds

897
00:37:55,230 --> 00:37:58,090
of extraneous data that are not
direct physical measurements

898
00:37:58,090 --> 00:38:00,690
of interaction, but
might give us confidence

899
00:38:00,690 --> 00:38:03,257
that things interact based
on similarity in annotation,

900
00:38:03,257 --> 00:38:05,090
or similarity in gene
expression, and so on.

901
00:38:05,090 --> 00:38:07,040
And we'll get into
details of that.

902
00:38:07,040 --> 00:38:07,540
OK.

903
00:38:07,540 --> 00:38:09,350
So to do this, we need
to have a little bit

904
00:38:09,350 --> 00:38:11,365
of a refresher on
Bayesian statistics.

905
00:38:14,780 --> 00:38:16,640
So I want to measure
the probability

906
00:38:16,640 --> 00:38:21,340
that an interaction is true
given the available data.

907
00:38:21,340 --> 00:38:22,410
Right?

908
00:38:22,410 --> 00:38:26,130
And I can estimate that based
on the probability of observing

909
00:38:26,130 --> 00:38:30,090
the data for things
that I know to be true

910
00:38:30,090 --> 00:38:31,330
and these prior estimates.

911
00:38:31,330 --> 00:38:34,960
So what's the prior probability
that an interaction is true

912
00:38:34,960 --> 00:38:37,710
and the prior probability of
observing a particular data

913
00:38:37,710 --> 00:38:38,340
set.

914
00:38:38,340 --> 00:38:40,427
Now, this by itself isn't
really that helpful.

915
00:38:40,427 --> 00:38:42,760
I haven't told you yet how
to calculate any of the terms

916
00:38:42,760 --> 00:38:43,890
on the right.

917
00:38:43,890 --> 00:38:45,340
But bear with me.

918
00:38:45,340 --> 00:38:48,100
If I want to decide
the likelihood

919
00:38:48,100 --> 00:38:51,660
that a protein interacts--
how likely is it?

920
00:38:51,660 --> 00:38:53,610
Is it more likely that
it interacts or not?

921
00:38:53,610 --> 00:38:54,676
I can compute this ratio.

922
00:38:54,676 --> 00:38:56,175
The probability
that the interaction

923
00:38:56,175 --> 00:38:58,205
is true given the data
over the probability

924
00:38:58,205 --> 00:39:00,330
an interaction is
false given the data.

925
00:39:00,330 --> 00:39:03,330
That's the likelihood ratio.

926
00:39:03,330 --> 00:39:07,440
So by this formula, I then
cancel out this probability

927
00:39:07,440 --> 00:39:10,030
of the data, the prior
probability of the data.

928
00:39:10,030 --> 00:39:13,200
And if I had a way
of calculating this,

929
00:39:13,200 --> 00:39:15,510
and we'll get to it in
a second, then if it's

930
00:39:15,510 --> 00:39:18,030
more likely than not to
be a true interaction,

931
00:39:18,030 --> 00:39:20,405
I can call it an interaction,
right, if it's less likely.

932
00:39:20,405 --> 00:39:21,863
So if this ratio
is greater than 1,

933
00:39:21,863 --> 00:39:23,300
I accept it as a
true interaction.

934
00:39:23,300 --> 00:39:27,520
If this ratio is less
than 1, then I reject it.

935
00:39:27,520 --> 00:39:28,020
OK.

936
00:39:28,020 --> 00:39:29,600
So now our challenge
is to figure out

937
00:39:29,600 --> 00:39:31,800
how to compute these terms.

938
00:39:31,800 --> 00:39:34,400
One more thing to note
is if all I want to do

939
00:39:34,400 --> 00:39:38,900
is be able to rank every
interaction by this likelihood

940
00:39:38,900 --> 00:39:41,940
ratio, rather than coming
up with a hard threshold,

941
00:39:41,940 --> 00:39:44,360
then I actually don't
need all these terms.

942
00:39:44,360 --> 00:39:46,640
So this is the likelihood ratio.

943
00:39:46,640 --> 00:39:48,930
I can convert it to a log space.

944
00:39:48,930 --> 00:39:51,260
So it's going to be the
sum of these two terms.

945
00:39:51,260 --> 00:39:53,180
And if I'm simply
ranking everything

946
00:39:53,180 --> 00:39:56,160
by this log likelihood
ratio, this term

947
00:39:56,160 --> 00:39:58,440
is the same for
every interaction.

948
00:39:58,440 --> 00:40:01,220
It's just composed of
prior probabilities.

949
00:40:01,220 --> 00:40:04,800
So it's not going to
affect the ranking at all.

950
00:40:04,800 --> 00:40:05,740
Any questions on that?

951
00:40:05,740 --> 00:40:07,760
Is that clear?

952
00:40:07,760 --> 00:40:10,042
Good.

953
00:40:10,042 --> 00:40:12,250
So if I just want to come
up with a ranking function,

954
00:40:12,250 --> 00:40:14,680
all I need to do--
all-- I need to do

955
00:40:14,680 --> 00:40:16,930
is to be able to estimate
the probability of observing

956
00:40:16,930 --> 00:40:19,630
data for true interactions and
the probability of observing

957
00:40:19,630 --> 00:40:21,699
that set of data for
false interactions.

958
00:40:21,699 --> 00:40:22,490
Everybody buy that?

959
00:40:26,250 --> 00:40:27,360
Yes, please.

960
00:40:27,360 --> 00:40:29,440
AUDIENCE: When you say
that prior probability is

961
00:40:29,440 --> 00:40:30,960
the same for all
interactions, we're

962
00:40:30,960 --> 00:40:34,459
saying we're assuming the same
prior probability for all,

963
00:40:34,459 --> 00:40:36,874
or is this [INAUDIBLE]?

964
00:40:36,874 --> 00:40:38,485
PROFESSOR: That's
its definition.

965
00:40:38,485 --> 00:40:41,070
We mean, what is the prior
probability that proteins

966
00:40:41,070 --> 00:40:42,740
interact versus the
prior probability?

967
00:40:42,740 --> 00:40:46,959
So it's independent of the
proteins that we're looking at.

968
00:40:46,959 --> 00:40:47,625
Other questions?

969
00:40:51,320 --> 00:40:51,820
All right.

970
00:40:51,820 --> 00:40:54,282
So we need a way of
computing this piece

971
00:40:54,282 --> 00:40:55,990
of all the things
we've looked at before.

972
00:40:55,990 --> 00:40:58,560
So how do we get an estimate
of the probability observing

973
00:40:58,560 --> 00:41:00,550
a particular
configuration of the data?

974
00:41:00,550 --> 00:41:02,530
Meaning, I detect
it in experiment 1

975
00:41:02,530 --> 00:41:06,290
and not in experiment
2, but in experiment 3.

976
00:41:06,290 --> 00:41:09,710
What's the probability of that
given it's a true interaction?

977
00:41:09,710 --> 00:41:11,880
So that's what we're going
to dive into right now.

978
00:41:11,880 --> 00:41:12,430
OK.

979
00:41:12,430 --> 00:41:15,240
So one thing we could
do to make life simpler,

980
00:41:15,240 --> 00:41:18,480
and then we'll remove
this simplification later,

981
00:41:18,480 --> 00:41:21,050
but let's, for the time being,
assume that all of my data

982
00:41:21,050 --> 00:41:23,890
are independent.

983
00:41:23,890 --> 00:41:26,470
So the two-hybrid is going
to have completely different

984
00:41:26,470 --> 00:41:29,467
mistakes than the affinity
capture mass spec.

985
00:41:29,467 --> 00:41:31,050
So those two data
sets are going to be

986
00:41:31,050 --> 00:41:32,910
completely independent
of each other.

987
00:41:32,910 --> 00:41:38,870
So I can write this as a product
of a particular observation--

988
00:41:38,870 --> 00:41:40,470
a particular mass
spec experiment

989
00:41:40,470 --> 00:41:43,542
and a particular two-hybrid
experiment for true attractions

990
00:41:43,542 --> 00:41:44,500
and false interactions.

991
00:41:44,500 --> 00:41:46,580
So it's the product
of the probability

992
00:41:46,580 --> 00:41:50,320
that a particular experiment
would detect an interaction

993
00:41:50,320 --> 00:41:52,970
if the interaction is
true over the probability

994
00:41:52,970 --> 00:41:55,630
that that particular
experiment would detect it

995
00:41:55,630 --> 00:41:57,640
if there was no interaction.

996
00:41:57,640 --> 00:42:01,500
I'm just going to multiply
all of those probabilities.

997
00:42:01,500 --> 00:42:02,100
Yes.

998
00:42:02,100 --> 00:42:03,900
AUDIENCE: [INAUDIBLE].

999
00:42:03,900 --> 00:42:07,961
This is one interaction pair?

1000
00:42:07,961 --> 00:42:08,960
PROFESSOR: That's right.

1001
00:42:08,960 --> 00:42:10,390
AUDIENCE: And you
take the product

1002
00:42:10,390 --> 00:42:12,995
over all the interaction
pairs within one

1003
00:42:12,995 --> 00:42:14,240
run of the experiment.

1004
00:42:14,240 --> 00:42:17,425
Is that correct?

1005
00:42:17,425 --> 00:42:18,800
PROFESSOR: If I
want to determine

1006
00:42:18,800 --> 00:42:22,740
whether a particular
interaction pair--

1007
00:42:22,740 --> 00:42:25,390
I want to compute
this log likelihood

1008
00:42:25,390 --> 00:42:27,150
ratio, or this,
actually, ranking ratio,

1009
00:42:27,150 --> 00:42:28,920
because I've thrown
away the priors.

1010
00:42:28,920 --> 00:42:31,378
I want to compute this ranking
ratio for a particular pair.

1011
00:42:31,378 --> 00:42:33,020
So I've got protein
A and protein B.

1012
00:42:33,020 --> 00:42:34,940
And I want to determine
whether I believe

1013
00:42:34,940 --> 00:42:36,775
it to be more likely
to interact or not,

1014
00:42:36,775 --> 00:42:38,400
and rank it with all
the others, right?

1015
00:42:38,400 --> 00:42:41,390
So I'm doing this for
a pair of proteins now.

1016
00:42:41,390 --> 00:42:42,620
So far so good?

1017
00:42:42,620 --> 00:42:44,050
Now, for that pair
of proteins, I

1018
00:42:44,050 --> 00:42:47,140
have a series of observations,
or lack of observations, right?

1019
00:42:47,140 --> 00:42:49,327
I have a whole bunch
of experiments.

1020
00:42:49,327 --> 00:42:51,160
This experiment detected
it, that experiment

1021
00:42:51,160 --> 00:42:53,566
didn't detect it, this one did.

1022
00:42:53,566 --> 00:42:55,440
So what's the probability
of these proteins--

1023
00:42:55,440 --> 00:42:58,660
these A and B really interact
given that yes, no, yes

1024
00:42:58,660 --> 00:42:59,670
in my experiments?

1025
00:42:59,670 --> 00:43:02,226
And then for new protein,
it might be no, no, yes,

1026
00:43:02,226 --> 00:43:04,725
and what I want to figure out
the probability for this pair.

1027
00:43:04,725 --> 00:43:07,695
AUDIENCE: So is the scale
of the big letter M,

1028
00:43:07,695 --> 00:43:10,830
is it on the order of like 10
experiments, 100 experiments,

1029
00:43:10,830 --> 00:43:12,067
or thousands of experiments?

1030
00:43:12,067 --> 00:43:12,650
PROFESSOR: Ah.

1031
00:43:12,650 --> 00:43:14,825
So the question is,
what's the scale of this.

1032
00:43:14,825 --> 00:43:17,200
So obviously, that's going to
depend on what kind of data

1033
00:43:17,200 --> 00:43:19,870
I bring in, but in
these cases, it's small.

1034
00:43:19,870 --> 00:43:22,300
So we have a handful of these
high throughput experiments

1035
00:43:22,300 --> 00:43:25,190
over entire genomes
and proteomes.

1036
00:43:25,190 --> 00:43:26,416
So there's not to be a lot.

1037
00:43:26,416 --> 00:43:27,790
So in some of
these early papers,

1038
00:43:27,790 --> 00:43:29,530
there were four
interaction experiments

1039
00:43:29,530 --> 00:43:30,614
that they were looking at.

1040
00:43:30,614 --> 00:43:32,488
Now the numbers might
be a little bit bigger,

1041
00:43:32,488 --> 00:43:33,800
but not significantly greater.

1042
00:43:38,130 --> 00:43:38,630
All right.

1043
00:43:38,630 --> 00:43:42,070
So now to compute this, we
need a set of gold standards.

1044
00:43:42,070 --> 00:43:44,790
But now we don't just need gold
standard positive interactions,

1045
00:43:44,790 --> 00:43:46,498
proteins that we know
really do interact.

1046
00:43:46,498 --> 00:43:50,130
We also need proteins that we
know really don't interact.

1047
00:43:50,130 --> 00:43:53,420
Because I want to compute the
probability of an observation

1048
00:43:53,420 --> 00:43:55,595
given that some interaction
is definitely wrong.

1049
00:43:58,970 --> 00:44:01,820
So precisely how I
compute these terms

1050
00:44:01,820 --> 00:44:03,550
is going to depend
on the kinds of data.

1051
00:44:03,550 --> 00:44:05,220
The experiments I've
just been talking about,

1052
00:44:05,220 --> 00:44:06,860
these high throughput
mass spec, which

1053
00:44:06,860 --> 00:44:10,110
were the ones which we looked
at the ratio of the consensus,

1054
00:44:10,110 --> 00:44:14,330
true positives, and estimated
that 96% of all the data

1055
00:44:14,330 --> 00:44:15,710
were possibly in error.

1056
00:44:15,710 --> 00:44:18,030
The details of how to do
those calculations are here.

1057
00:44:18,030 --> 00:44:22,100
I leave you to look that
up if you're interested.

1058
00:44:22,100 --> 00:44:23,510
But now what we're
going to do is

1059
00:44:23,510 --> 00:44:26,650
we're going to see how, if
we were to rank interactions

1060
00:44:26,650 --> 00:44:32,150
based on this term,
we can avoid having

1061
00:44:32,150 --> 00:44:33,920
to throw out most of our data.

1062
00:44:33,920 --> 00:44:36,637
So we said if we require all
the experiments to agree,

1063
00:44:36,637 --> 00:44:38,470
we're going to have
very, very low coverage.

1064
00:44:38,470 --> 00:44:40,450
Now we're instead going
to rank everything

1065
00:44:40,450 --> 00:44:42,420
based on this likelihood
ratio, or something

1066
00:44:42,420 --> 00:44:44,175
derived from the
likelihood ratio.

1067
00:44:44,175 --> 00:44:45,800
So in this paper
where they were simply

1068
00:44:45,800 --> 00:44:47,550
looking at the
protein-protein interaction

1069
00:44:47,550 --> 00:44:51,570
data sets to compute
these interactions,

1070
00:44:51,570 --> 00:44:55,800
they ranked everything based on
that ranking function we just

1071
00:44:55,800 --> 00:44:56,840
described.

1072
00:44:56,840 --> 00:44:59,610
And then as you
vary your threshold,

1073
00:44:59,610 --> 00:45:01,860
you can figure out how many
true positives you have

1074
00:45:01,860 --> 00:45:05,180
and how many false positives
you have in the gold standard.

1075
00:45:05,180 --> 00:45:07,270
True interactors and
false interactors.

1076
00:45:07,270 --> 00:45:09,110
And you can compute
this curve, right?

1077
00:45:09,110 --> 00:45:12,800
For any particular value
of that ranking ratio,

1078
00:45:12,800 --> 00:45:16,240
what's my sensitivity and
what's my specificity?

1079
00:45:16,240 --> 00:45:18,270
Are you clear what
this plot means?

1080
00:45:21,770 --> 00:45:23,886
And here they've
plotted the values

1081
00:45:23,886 --> 00:45:25,010
for individual experiments.

1082
00:45:25,010 --> 00:45:29,950
And this is the value for
an independent database

1083
00:45:29,950 --> 00:45:33,425
of gold standard interactions.

1084
00:45:33,425 --> 00:45:34,800
And so now, where
do they come up

1085
00:45:34,800 --> 00:45:37,160
with their true positives
and their false positives?

1086
00:45:37,160 --> 00:45:39,660
A lot of this is going to depend
on how representative those

1087
00:45:39,660 --> 00:45:40,160
are.

1088
00:45:40,160 --> 00:45:42,480
And all these numbers
are subject to revision

1089
00:45:42,480 --> 00:45:45,280
if you decide that the true
positives and false positives

1090
00:45:45,280 --> 00:45:47,860
that people are using
are not accurate enough.

1091
00:45:47,860 --> 00:45:52,320
So they used two well annotated
databases of interactions.

1092
00:45:52,320 --> 00:45:54,140
One from MIPS and one from SGD.

1093
00:45:54,140 --> 00:45:56,270
And you can play those
off against each other

1094
00:45:56,270 --> 00:45:58,119
as the database
of true positives.

1095
00:45:58,119 --> 00:45:59,660
In some ways, that's
the easier thing

1096
00:45:59,660 --> 00:46:02,650
because people like to report
that proteins interact.

1097
00:46:02,650 --> 00:46:05,210
They tend not to like to report
the proteins don't interact.

1098
00:46:05,210 --> 00:46:07,560
You don't see a lot of
nature papers saying protein

1099
00:46:07,560 --> 00:46:10,152
x doesn't interact
with protein y.

1100
00:46:10,152 --> 00:46:11,860
So how are you going
to figure out, then,

1101
00:46:11,860 --> 00:46:13,950
what are your true negatives?

1102
00:46:13,950 --> 00:46:17,467
So the strategies
that they used-- well,

1103
00:46:17,467 --> 00:46:19,800
one possibility is they're
annotated to be in complexes,

1104
00:46:19,800 --> 00:46:22,110
and those complexes are
different from each other.

1105
00:46:22,110 --> 00:46:23,320
That's not bad, right?

1106
00:46:23,320 --> 00:46:26,030
But it's not a guarantee either.

1107
00:46:26,030 --> 00:46:27,760
Or this is a little bit better.

1108
00:46:27,760 --> 00:46:31,296
They're annotated to be in
different parts of the cell.

1109
00:46:31,296 --> 00:46:33,440
Of course, if those
annotations aren't perfect,

1110
00:46:33,440 --> 00:46:35,910
low concentrations, you
could still be wrong.

1111
00:46:35,910 --> 00:46:37,840
Or that they have
anti-correlated gene

1112
00:46:37,840 --> 00:46:38,340
expression.

1113
00:46:38,340 --> 00:46:39,580
I kind of like this one.

1114
00:46:39,580 --> 00:46:42,164
So it's one thing to be not
correlated, but if you're

1115
00:46:42,164 --> 00:46:43,830
anti-correlated, seems
pretty suggestive

1116
00:46:43,830 --> 00:46:47,032
that these two proteins are
never in a complex together.

1117
00:46:47,032 --> 00:46:49,240
Again, it's no guarantee
because, as we'll talk about

1118
00:46:49,240 --> 00:46:51,850
in some detail later,
RNA levels are not

1119
00:46:51,850 --> 00:46:53,511
very good predictors
of protein levels.

1120
00:46:53,511 --> 00:46:55,260
But if you apply enough
of these criteria,

1121
00:46:55,260 --> 00:46:56,840
you can come up with
a set of proteins

1122
00:46:56,840 --> 00:46:58,430
that you have fairly
high confidence really

1123
00:46:58,430 --> 00:46:59,200
don't interact.

1124
00:46:59,200 --> 00:47:01,750
You combine that with
the databases of proteins

1125
00:47:01,750 --> 00:47:04,420
with very high confidence
that they do interact,

1126
00:47:04,420 --> 00:47:06,880
and you can get the true
positives and false positives

1127
00:47:06,880 --> 00:47:08,213
that you need for this analysis.

1128
00:47:13,071 --> 00:47:13,570
all right.

1129
00:47:13,570 --> 00:47:16,290
So that's a way of
combining some information.

1130
00:47:16,290 --> 00:47:18,320
We're going to see a
generalization of that

1131
00:47:18,320 --> 00:47:19,584
called Bayesian networks.

1132
00:47:19,584 --> 00:47:21,250
We've mentioned this
already in at least

1133
00:47:21,250 --> 00:47:22,624
two different
contexts, and it'll

1134
00:47:22,624 --> 00:47:26,120
come up again later
in the course as well.

1135
00:47:26,120 --> 00:47:28,190
So these are very
general methods

1136
00:47:28,190 --> 00:47:31,580
for reasoning probabilistically.

1137
00:47:31,580 --> 00:47:33,470
We will see them
in the context here

1138
00:47:33,470 --> 00:47:34,810
of predicting interactions.

1139
00:47:34,810 --> 00:47:37,060
We'll see them later in the
context of gene regulation

1140
00:47:37,060 --> 00:47:38,070
and signaling as well.

1141
00:47:41,530 --> 00:47:44,410
What we fundamentally need
to do a Bayesian network

1142
00:47:44,410 --> 00:47:47,320
is a graphical structure that
represents our understanding

1143
00:47:47,320 --> 00:47:50,499
what the relationship is
between causes and effects.

1144
00:47:50,499 --> 00:47:52,040
And a set of
probabilities that allow

1145
00:47:52,040 --> 00:47:54,830
us to compute things
on this network.

1146
00:47:54,830 --> 00:47:58,060
We'll show you examples where
those networks are derived

1147
00:47:58,060 --> 00:48:01,490
from our prior understanding
of the problem,

1148
00:48:01,490 --> 00:48:03,490
but also ones where the
structure of the network

1149
00:48:03,490 --> 00:48:04,730
is learned from the data.

1150
00:48:07,250 --> 00:48:11,280
And we're going to see
two primary contexts.

1151
00:48:11,280 --> 00:48:14,471
First we have this question
of whether proteins interact.

1152
00:48:14,471 --> 00:48:16,220
That's what we've just
been talking about.

1153
00:48:16,220 --> 00:48:19,070
So here are four experiments,
the in vitro pulldown

1154
00:48:19,070 --> 00:48:22,230
experiments and yeast
two-hybrid experiments,

1155
00:48:22,230 --> 00:48:25,220
that give us relatively
independent information

1156
00:48:25,220 --> 00:48:27,195
about whether proteins interact.

1157
00:48:27,195 --> 00:48:28,820
And we're going to
look at a paper that

1158
00:48:28,820 --> 00:48:31,590
used those data with
a Bayesian network

1159
00:48:31,590 --> 00:48:34,270
to compute the probability that
two proteins really do interact

1160
00:48:34,270 --> 00:48:36,295
based on the combination
of all the data,

1161
00:48:36,295 --> 00:48:38,420
rather than throwing out
anything that doesn't fall

1162
00:48:38,420 --> 00:48:41,050
in the overlap, which could
be a very, very small number.

1163
00:48:41,050 --> 00:48:42,550
And then later on
we'll see examples

1164
00:48:42,550 --> 00:48:45,360
of using Bayesian networks to
understand biological networks.

1165
00:48:45,360 --> 00:48:47,980
So this might be a set
of transcription factors

1166
00:48:47,980 --> 00:48:51,460
that are regulating a set of
differentially expressed genes.

1167
00:48:51,460 --> 00:48:53,280
And the structure of
the graphical network

1168
00:48:53,280 --> 00:48:55,280
for a Bayesian network
has a lot of similarities

1169
00:48:55,280 --> 00:48:57,580
to the way we normally
think about transcriptional

1170
00:48:57,580 --> 00:48:58,850
regulatory networks.

1171
00:48:58,850 --> 00:49:01,390
So there's sort of a
natural way of transferring

1172
00:49:01,390 --> 00:49:05,460
our regulatory problem into
a graphical network problem.

1173
00:49:05,460 --> 00:49:08,210
But we're going to focus
on these prediction

1174
00:49:08,210 --> 00:49:10,800
problems for protein-protein
interactions first.

1175
00:49:10,800 --> 00:49:17,070
Now, if I just want to compute
the probability of detecting

1176
00:49:17,070 --> 00:49:19,320
an interaction in various
experiments, given that it's

1177
00:49:19,320 --> 00:49:21,260
true or false, I
could explicitly

1178
00:49:21,260 --> 00:49:23,090
compute that probability.

1179
00:49:23,090 --> 00:49:26,010
And we saw examples
of that just now.

1180
00:49:26,010 --> 00:49:28,080
But some of these
Bayesian network problems

1181
00:49:28,080 --> 00:49:30,980
become much, much
too large to do that.

1182
00:49:30,980 --> 00:49:35,580
This is a little tiny
piece of a Bayesian network

1183
00:49:35,580 --> 00:49:37,270
that is supposed to
represent I believe

1184
00:49:37,270 --> 00:49:40,480
it's transcriptional
regulatory network.

1185
00:49:40,480 --> 00:49:43,730
You could never possibly
write down all of the terms

1186
00:49:43,730 --> 00:49:47,250
in this probability, where every
node could, in principle depend

1187
00:49:47,250 --> 00:49:48,820
on every other node
in the network.

1188
00:49:48,820 --> 00:49:52,080
It would just be a
ridiculously large problem.

1189
00:49:52,080 --> 00:49:55,890
In fact, how large would it be
if I've got N binary variables,

1190
00:49:55,890 --> 00:49:58,410
my gene is on or off, my
interaction is true or false,

1191
00:49:58,410 --> 00:50:01,130
I have 2 to the N
possible states?

1192
00:50:01,130 --> 00:50:01,940
Right?

1193
00:50:01,940 --> 00:50:04,274
And the only constraint
I have, in principle,

1194
00:50:04,274 --> 00:50:06,440
is that all the probabilities
have to add up to one.

1195
00:50:06,440 --> 00:50:08,740
So I have 2 to the N minus 1.

1196
00:50:08,740 --> 00:50:13,770
2 to the N minus 1 possible
variables that I need to set.

1197
00:50:13,770 --> 00:50:16,530
So that's a ridiculously
large number in most contexts.

1198
00:50:16,530 --> 00:50:19,670
So how do Bayesian networks
help us solve this problem?

1199
00:50:19,670 --> 00:50:21,455
Well, we represent
our understanding

1200
00:50:21,455 --> 00:50:23,940
of the problem in a
graphical structure

1201
00:50:23,940 --> 00:50:26,310
where we have
causes and effects.

1202
00:50:26,310 --> 00:50:28,980
And there'll be a direct arrow
from a cause to an effect.

1203
00:50:28,980 --> 00:50:30,636
I don't always know the cause.

1204
00:50:30,636 --> 00:50:32,010
So in our context,
we were trying

1205
00:50:32,010 --> 00:50:34,560
to figure out whether
two proteins interact.

1206
00:50:34,560 --> 00:50:36,495
What do we measure?

1207
00:50:36,495 --> 00:50:38,120
We actually don't
measure interactions.

1208
00:50:38,120 --> 00:50:40,610
We measure the result of a
particular experiment, which

1209
00:50:40,610 --> 00:50:43,170
is a combination of
whether interacted

1210
00:50:43,170 --> 00:50:45,650
and all sorts of noise
that we've just discussed.

1211
00:50:45,650 --> 00:50:49,430
So the effects that we observe
are detected in experiment one

1212
00:50:49,430 --> 00:50:51,260
or detected in experiment two.

1213
00:50:51,260 --> 00:50:54,450
The cause is, did
it interact or not?

1214
00:50:54,450 --> 00:50:56,770
So the cause is hidden,
the effects are observed.

1215
00:50:59,959 --> 00:51:01,750
Now, in the case we
were looking at before,

1216
00:51:01,750 --> 00:51:03,166
we treated all
these probabilities

1217
00:51:03,166 --> 00:51:04,042
as being independent.

1218
00:51:04,042 --> 00:51:06,000
But we might know something
about the structure

1219
00:51:06,000 --> 00:51:08,360
of our experiments, the kinds
of experiments we're doing,

1220
00:51:08,360 --> 00:51:10,401
that might lead us to have
a different structure.

1221
00:51:10,401 --> 00:51:14,900
So we could have an
interaction that gives rise

1222
00:51:14,900 --> 00:51:16,529
to all different kinds of data.

1223
00:51:16,529 --> 00:51:18,570
But depending on whether
the protein's a membrane

1224
00:51:18,570 --> 00:51:20,180
protein or highly
expressed, it might

1225
00:51:20,180 --> 00:51:22,590
influence the results
of certain experiments

1226
00:51:22,590 --> 00:51:26,370
and not influence the
results of others, right?

1227
00:51:26,370 --> 00:51:28,200
So like a two-hybrid
would be very biased

1228
00:51:28,200 --> 00:51:29,330
by which one of these?

1229
00:51:32,320 --> 00:51:33,550
The membrane, right?

1230
00:51:33,550 --> 00:51:35,810
And then the affinity
capture mass spec

1231
00:51:35,810 --> 00:51:37,310
could be very
influenced by proteins

1232
00:51:37,310 --> 00:51:41,409
that are expressed at very
high levels or very low levels.

1233
00:51:41,409 --> 00:51:43,700
If we assume that all the
interactions are independent,

1234
00:51:43,700 --> 00:51:44,705
then we multiply probabilities.

1235
00:51:44,705 --> 00:51:46,329
And we'll go into
more detail, but this

1236
00:51:46,329 --> 00:51:48,530
is what we're looking
at up until now.

1237
00:51:48,530 --> 00:51:52,425
In cases where we believe that
all the observations are not

1238
00:51:52,425 --> 00:51:53,800
independent, then
we're not going

1239
00:51:53,800 --> 00:51:54,880
to simply multiply things.

1240
00:51:54,880 --> 00:51:56,380
We'll see there's
a more precise way

1241
00:51:56,380 --> 00:51:59,424
of computing the probabilities.

1242
00:51:59,424 --> 00:52:01,590
Now in this case, I've drawn
the graphical structure

1243
00:52:01,590 --> 00:52:04,280
because I believe that
I know what's going on.

1244
00:52:04,280 --> 00:52:06,280
But in the more general
case that we'll look at,

1245
00:52:06,280 --> 00:52:08,363
we'll actually derive the
structure from the data.

1246
00:52:11,110 --> 00:52:13,520
One of the nice things
about Bayesian networks

1247
00:52:13,520 --> 00:52:15,850
is that it removes the
need to have all 2 to the N

1248
00:52:15,850 --> 00:52:19,350
minus 1 possible parameters,
because it tells us there

1249
00:52:19,350 --> 00:52:21,550
are certain
independence conditions.

1250
00:52:21,550 --> 00:52:25,470
So node is independent of its
ancestors given its parents.

1251
00:52:25,470 --> 00:52:26,795
What does that mean?

1252
00:52:26,795 --> 00:52:28,920
If I'm trying to reason
about the expression of one

1253
00:52:28,920 --> 00:52:32,120
of the genes down here, and I
know that this transcription

1254
00:52:32,120 --> 00:52:35,717
factor is on, I
don't really care

1255
00:52:35,717 --> 00:52:37,800
what the probability is
that any particular parent

1256
00:52:37,800 --> 00:52:39,690
of that transcription
factor is on, right?

1257
00:52:39,690 --> 00:52:42,670
So I don't need to know anything
of transcription factor B1

1258
00:52:42,670 --> 00:52:43,880
if I know the state of B2.

1259
00:52:43,880 --> 00:52:45,990
If this is on, then
that's the only thing

1260
00:52:45,990 --> 00:52:49,450
that's going to affect whether
it's turning on these genes,

1261
00:52:49,450 --> 00:52:52,370
regardless of what the
activation state of its parent

1262
00:52:52,370 --> 00:52:53,600
was.

1263
00:52:53,600 --> 00:52:54,470
Is that clear?

1264
00:52:54,470 --> 00:52:55,367
Yes.

1265
00:52:55,367 --> 00:52:57,020
AUDIENCE: The
slide's saying TF B1.

1266
00:52:57,020 --> 00:52:59,720
[INAUDIBLE] TF B2?

1267
00:52:59,720 --> 00:53:00,522
It says TF A1.

1268
00:53:00,522 --> 00:53:01,480
PROFESSOR: Yeah, sorry.

1269
00:53:01,480 --> 00:53:02,854
That should say TF B1.

1270
00:53:07,070 --> 00:53:07,570
Thank you.

1271
00:53:11,590 --> 00:53:12,090
OK.

1272
00:53:12,090 --> 00:53:13,670
So we'll do a little example.

1273
00:53:13,670 --> 00:53:16,130
It's admission season
both for graduate school

1274
00:53:16,130 --> 00:53:16,880
and undergraduate.

1275
00:53:16,880 --> 00:53:19,350
So let's do a little
toy example where

1276
00:53:19,350 --> 00:53:21,600
we're going to get rid of
the admissions committees

1277
00:53:21,600 --> 00:53:22,974
and just do
automated admissions.

1278
00:53:26,630 --> 00:53:29,659
So we're going to collect
various data about students,

1279
00:53:29,659 --> 00:53:31,700
and then we're going to
build a Bayesian network.

1280
00:53:31,700 --> 00:53:33,850
And that network
is going to decide

1281
00:53:33,850 --> 00:53:36,649
whether to admit students
into this simplified version.

1282
00:53:36,649 --> 00:53:38,940
And the only information that
will go into our decision

1283
00:53:38,940 --> 00:53:44,380
will be the grades on the
transcript and the GREs.

1284
00:53:44,380 --> 00:53:46,740
Hopefully that's not the case.

1285
00:53:46,740 --> 00:53:49,070
And we believe
that certain things

1286
00:53:49,070 --> 00:53:51,880
influenced your
grades and your GREs.

1287
00:53:51,880 --> 00:53:53,410
Whether or not the
student is smart

1288
00:53:53,410 --> 00:53:54,850
certainly should
have some influence,

1289
00:53:54,850 --> 00:53:56,683
but also the great
inflation at their school

1290
00:53:56,683 --> 00:53:59,610
will have some influence.

1291
00:53:59,610 --> 00:54:02,580
So a prediction problem
in a Bayesian network

1292
00:54:02,580 --> 00:54:06,180
is going from the
causes to the effects.

1293
00:54:06,180 --> 00:54:09,010
So if I want to predict
whether a student's admitted,

1294
00:54:09,010 --> 00:54:10,870
I only need to look upstream.

1295
00:54:10,870 --> 00:54:15,110
So we want to predict-- we
observe the things on the top.

1296
00:54:15,110 --> 00:54:16,570
Say, grades and
GREs, and we want

1297
00:54:16,570 --> 00:54:19,260
to predict whether this student
should be admitted or not.

1298
00:54:19,260 --> 00:54:21,780
There's another problem called
an inference problem, which

1299
00:54:21,780 --> 00:54:23,970
is when we observe
the effect and we

1300
00:54:23,970 --> 00:54:28,460
want to make inferences
about the causes.

1301
00:54:28,460 --> 00:54:30,950
So an example of that would
be, you apply for an internship

1302
00:54:30,950 --> 00:54:34,190
and they say, oh,
she's a student at MIT.

1303
00:54:34,190 --> 00:54:35,080
I bet she's smart.

1304
00:54:35,080 --> 00:54:35,580
Right?

1305
00:54:35,580 --> 00:54:38,964
They're doing an
inference problem.

1306
00:54:38,964 --> 00:54:41,630
We'll leave it for you to decide
whether you and your colleagues

1307
00:54:41,630 --> 00:54:44,500
are as smart as everyone
thinks, but hopefully you are.

1308
00:54:44,500 --> 00:54:45,010
OK.

1309
00:54:45,010 --> 00:54:46,540
So we've got these two
different kinds of problems.

1310
00:54:46,540 --> 00:54:48,581
We've got prediction
problems from top to bottom,

1311
00:54:48,581 --> 00:54:50,850
and inference problems
from bottom to top.

1312
00:54:54,741 --> 00:54:56,990
And we're going to talk about
conditional probability.

1313
00:54:56,990 --> 00:54:59,156
So if I've got some very
small piece of this network

1314
00:54:59,156 --> 00:55:02,030
with just two nodes,
I could write out

1315
00:55:02,030 --> 00:55:06,130
all the possible probabilities
for any pair of those nodes.

1316
00:55:06,130 --> 00:55:09,600
So the probability that
a student is not smart

1317
00:55:09,600 --> 00:55:12,300
given that that student has
low grades, the probability

1318
00:55:12,300 --> 00:55:14,970
that the student is not smart
given that the student has

1319
00:55:14,970 --> 00:55:18,090
good grades, and so on, for all
possible pairwise comparisons.

1320
00:55:18,090 --> 00:55:20,510
Or I could write this as a
conditional probability, which

1321
00:55:20,510 --> 00:55:22,860
tends to be an easier way
to think about the problem.

1322
00:55:22,860 --> 00:55:25,430
What's the conditional
probability of a student

1323
00:55:25,430 --> 00:55:29,920
being smart given that
they've got good grades

1324
00:55:29,920 --> 00:55:33,230
or given that they
have bad grades?

1325
00:55:33,230 --> 00:55:35,800
They have the same information.

1326
00:55:35,800 --> 00:55:38,020
For this one, I need
additional information

1327
00:55:38,020 --> 00:55:41,845
about the total probability of
students being smart or not.

1328
00:55:41,845 --> 00:55:43,720
And the total number of
variables, as I said,

1329
00:55:43,720 --> 00:55:44,845
in either case is the same.

1330
00:55:44,845 --> 00:55:46,690
So these are completely
interchangeable,

1331
00:55:46,690 --> 00:55:49,495
but it's a lot easier to reason
with conditional probabilities

1332
00:55:49,495 --> 00:55:51,120
than with the joint
probability tables.

1333
00:55:51,120 --> 00:55:54,067
Those we'll see in a second.

1334
00:55:54,067 --> 00:55:56,400
So as I've said, you don't
need a full probability table

1335
00:55:56,400 --> 00:55:57,358
for a Bayesian network.

1336
00:55:57,358 --> 00:55:59,430
You don't need two N to
the minus 1 variables.

1337
00:55:59,430 --> 00:56:00,929
And the fundamental
reason for that

1338
00:56:00,929 --> 00:56:02,470
is that the joint
probability is only

1339
00:56:02,470 --> 00:56:04,430
going to depend on the parents.

1340
00:56:04,430 --> 00:56:08,224
So in this toy example,
the GRE scores over here

1341
00:56:08,224 --> 00:56:09,765
are not dependent
on grade inflation.

1342
00:56:15,310 --> 00:56:17,580
Now, that all
hopefully makes sense.

1343
00:56:17,580 --> 00:56:18,080
Questions?

1344
00:56:21,220 --> 00:56:23,152
Bayesian networks get
a little murky next,

1345
00:56:23,152 --> 00:56:25,110
so I'm going to try to
give you into-- oh, yes.

1346
00:56:25,110 --> 00:56:26,358
Question, please.

1347
00:56:26,358 --> 00:56:29,844
AUDIENCE: You said that
the parents don't affect

1348
00:56:29,844 --> 00:56:33,828
their children, but if grade
inflation affects the grades,

1349
00:56:33,828 --> 00:56:37,314
how does that
influence-- will that

1350
00:56:37,314 --> 00:56:39,320
influence the grade [INAUDIBLE]?

1351
00:56:39,320 --> 00:56:41,932
PROFESSOR: Sorry, can you
say the question again?

1352
00:56:41,932 --> 00:56:43,390
AUDIENCE: I guess
I'm just confused

1353
00:56:43,390 --> 00:56:46,140
by this particular example.

1354
00:56:46,140 --> 00:56:48,020
What do you mean by
the joint probability?

1355
00:56:48,020 --> 00:56:49,430
The joint probability of what?

1356
00:56:52,840 --> 00:56:55,040
PROFESSOR: So if I
want to figure out

1357
00:56:55,040 --> 00:56:57,910
the probability of some
particular configuration of all

1358
00:56:57,910 --> 00:57:01,130
the nodes in my network,
I don't necessarily

1359
00:57:01,130 --> 00:57:03,670
need to consider
all possibilities.

1360
00:57:03,670 --> 00:57:05,970
Because for example,
if I want to consider

1361
00:57:05,970 --> 00:57:07,560
all of the joint
probability samples

1362
00:57:07,560 --> 00:57:11,080
with settings for the GREs,
whether the student had

1363
00:57:11,080 --> 00:57:13,220
good GRE scores
or not, that's not

1364
00:57:13,220 --> 00:57:18,350
going be influenced by the
student's school's grade

1365
00:57:18,350 --> 00:57:21,338
inflation policies.

1366
00:57:21,338 --> 00:57:23,730
AUDIENCE: But wouldn't the
grades be influenced by the--

1367
00:57:23,730 --> 00:57:25,188
PROFESSOR: But the
grades would be.

1368
00:57:25,188 --> 00:57:25,910
That's right.

1369
00:57:25,910 --> 00:57:27,920
So some of the
variables I can remove

1370
00:57:27,920 --> 00:57:30,507
and others-- some of the
joint probability statements

1371
00:57:30,507 --> 00:57:32,340
I don't need to worry
about and others I do.

1372
00:57:32,340 --> 00:57:33,840
And which ones I
need to consider

1373
00:57:33,840 --> 00:57:35,381
is determined by
the graph structure.

1374
00:57:37,790 --> 00:57:38,300
Yes.

1375
00:57:38,300 --> 00:57:40,450
AUDIENCE: How is the graph
structure determined?

1376
00:57:40,450 --> 00:57:41,033
PROFESSOR: OK.

1377
00:57:41,033 --> 00:57:43,010
So how is the graph
structure determined?

1378
00:57:43,010 --> 00:57:45,050
So it's determined
in one of two ways.

1379
00:57:45,050 --> 00:57:48,650
I can draw it in advance because
I believe that I know something

1380
00:57:48,650 --> 00:57:51,630
about my setting, I believe
that these data are independent.

1381
00:57:51,630 --> 00:57:55,090
Then it has that
structure like this.

1382
00:57:55,090 --> 00:57:57,275
Cause and a bunch of
independent effects.

1383
00:58:01,560 --> 00:58:06,510
Or perhaps I claim to know that
actually two of these things

1384
00:58:06,510 --> 00:58:10,360
have a common parent as well.

1385
00:58:10,360 --> 00:58:12,320
In some cases I know.

1386
00:58:12,320 --> 00:58:14,580
We'll also talk about how
to learn the structure

1387
00:58:14,580 --> 00:58:16,770
from the data, which is
the more common setting

1388
00:58:16,770 --> 00:58:17,964
in regulatory networks.

1389
00:58:17,964 --> 00:58:19,380
So in these kinds
of problems when

1390
00:58:19,380 --> 00:58:21,180
trying to decide
how to integrate

1391
00:58:21,180 --> 00:58:23,829
different proteomic data
sets, typically people

1392
00:58:23,829 --> 00:58:25,870
make arbitrary decisions
about what the structure

1393
00:58:25,870 --> 00:58:28,640
is based on their
knowledge of the system.

1394
00:58:28,640 --> 00:58:31,750
But if you're trying to figure
out de novo which proteins

1395
00:58:31,750 --> 00:58:34,126
interact with which, which
proteins regulate which genes,

1396
00:58:34,126 --> 00:58:35,791
then you have to learn
it from the data.

1397
00:58:35,791 --> 00:58:37,974
And we'll talk about how
to do that in a second.

1398
00:58:37,974 --> 00:58:38,640
Great questions.

1399
00:58:38,640 --> 00:58:39,780
Any other questions?

1400
00:58:39,780 --> 00:58:41,413
Anything in the quiet
half of the room?

1401
00:58:46,510 --> 00:58:47,010
OK.

1402
00:58:47,010 --> 00:58:49,590
So as I said, this
part of it, I think

1403
00:58:49,590 --> 00:58:51,170
you can usually
come up with cases

1404
00:58:51,170 --> 00:58:53,296
that give you fairly
good intuition.

1405
00:58:53,296 --> 00:58:55,670
One of the things that is true
in these Bayesian networks

1406
00:58:55,670 --> 00:58:58,370
which most people find a
little bit surprising at first

1407
00:58:58,370 --> 00:59:00,950
is something called
explaining away.

1408
00:59:00,950 --> 00:59:04,050
So let's look at this
Bayesian network.

1409
00:59:04,050 --> 00:59:06,180
I go outside and I
detect that things

1410
00:59:06,180 --> 00:59:08,827
are slippery on the grass.

1411
00:59:08,827 --> 00:59:10,410
So that could be for
a lot of reasons,

1412
00:59:10,410 --> 00:59:13,251
but one possible reason
is that the grass is wet.

1413
00:59:13,251 --> 00:59:13,750
OK.

1414
00:59:13,750 --> 00:59:15,541
What are the causes of
the grass being wet?

1415
00:59:15,541 --> 00:59:17,409
Well, it could have
rained or the sprinklers

1416
00:59:17,409 --> 00:59:18,200
might have been on.

1417
00:59:20,720 --> 00:59:23,300
And depending on this
as an example-- so

1418
00:59:23,300 --> 00:59:26,320
a lot of the Bayesian networks
were developed in Stanford

1419
00:59:26,320 --> 00:59:29,050
by Judea Pearl and colleagues.

1420
00:59:29,050 --> 00:59:32,090
And of course, in California
it doesn't rain that often.

1421
00:59:32,090 --> 00:59:34,890
So there the season is a strong
determiner of these things.

1422
00:59:34,890 --> 00:59:36,700
Not so much around here.

1423
00:59:36,700 --> 00:59:38,955
So in this example
that they like to do,

1424
00:59:38,955 --> 00:59:40,330
so does the
probability that it's

1425
00:59:40,330 --> 00:59:44,946
raining depend on whether
the sprinkler is on or not?

1426
00:59:44,946 --> 00:59:46,890
Now, the answer
should be no, right?

1427
00:59:46,890 --> 00:59:52,310
I mean, in reality, when
you think about-- there's

1428
00:59:52,310 --> 00:59:54,800
no causal relationship
between the sprinkler being on

1429
00:59:54,800 --> 00:59:56,390
and the rain.

1430
00:59:56,390 --> 00:59:59,650
But in fact, when we're
reasoning over these networks,

1431
00:59:59,650 --> 01:00:00,820
we actually are influenced.

1432
01:00:03,620 --> 01:00:07,470
In a probabilistic model,
if I know that it's raining,

1433
01:00:07,470 --> 01:00:09,774
and I know the grass
is wet, then what

1434
01:00:09,774 --> 01:00:11,440
do I think about the
sprinkler being on?

1435
01:00:11,440 --> 01:00:13,100
Do I think it's just as likely?

1436
01:00:13,100 --> 01:00:14,600
No, I think it's
less likely, right?

1437
01:00:14,600 --> 01:00:17,058
If I go outside and see the
grass is wet, there are clouds,

1438
01:00:17,058 --> 01:00:20,510
the rain is coming
down, is the sprinkler

1439
01:00:20,510 --> 01:00:21,890
likely to be on or not?

1440
01:00:21,890 --> 01:00:23,930
It's likely to be off, right?

1441
01:00:23,930 --> 01:00:27,380
So there's no
causal relationship,

1442
01:00:27,380 --> 01:00:29,880
but there's the probabilistic
relationship through the graph

1443
01:00:29,880 --> 01:00:30,380
structure.

1444
01:00:30,380 --> 01:00:32,070
And that's called
explaining away.

1445
01:00:32,070 --> 01:00:34,820
And you can take a whole
course on how to understand

1446
01:00:34,820 --> 01:00:37,530
which relationships you
can detect and which not.

1447
01:00:37,530 --> 01:00:40,410
This is not the place
to try to go into that,

1448
01:00:40,410 --> 01:00:42,510
but I hope you'll be
familiar with this problem.

1449
01:00:42,510 --> 01:00:44,490
And I'll try to give
you a toy example that

1450
01:00:44,490 --> 01:00:47,180
makes it a little bit
more obvious in terms

1451
01:00:47,180 --> 01:00:50,210
of the equations
where this comes from.

1452
01:00:50,210 --> 01:00:57,560
So imagine this very silly game
where we play, we toss coins.

1453
01:00:57,560 --> 01:00:59,820
We toss a coin twice.

1454
01:00:59,820 --> 01:01:02,360
And if it turns up heads
both times, you get a point.

1455
01:01:02,360 --> 01:01:04,631
If it turns up tails both
times, you get a point.

1456
01:01:04,631 --> 01:01:07,256
But if one's a head and one's a
tail, you don't get any points.

1457
01:01:10,260 --> 01:01:15,330
Now, does the probability that I
tossed a head on the first time

1458
01:01:15,330 --> 01:01:19,330
depend on whether I toss
a tail on the second time?

1459
01:01:19,330 --> 01:01:21,230
So causally,
obviously not, right?

1460
01:01:21,230 --> 01:01:24,280
First of all, it
happened earlier in time.

1461
01:01:24,280 --> 01:01:28,590
And secondly, the coin tosses
are completely independent.

1462
01:01:28,590 --> 01:01:31,210
But what happens when
I know the outcome?

1463
01:01:31,210 --> 01:01:34,790
What if I know
what score you got?

1464
01:01:34,790 --> 01:01:40,422
So if I know your score,
then is the probability

1465
01:01:40,422 --> 01:01:42,130
that I tossed the
heads on the first time

1466
01:01:42,130 --> 01:01:44,100
independent of whether I got
a tail on the second time?

1467
01:01:44,100 --> 01:01:44,850
What do you think?

1468
01:01:44,850 --> 01:01:47,400
How many people think
it is independent then?

1469
01:01:47,400 --> 01:01:49,410
How many people think
it's not independent.

1470
01:01:49,410 --> 01:01:49,910
Very good.

1471
01:01:49,910 --> 01:01:51,300
It's not independent.

1472
01:01:51,300 --> 01:01:54,520
And obviously, here's
the math to prove it,

1473
01:01:54,520 --> 01:01:56,970
but your intuition
does the same thing.

1474
01:01:56,970 --> 01:02:00,450
So what's the probability
that I tossed a head

1475
01:02:00,450 --> 01:02:02,510
on the second time
given that I got a one,

1476
01:02:02,510 --> 01:02:08,270
I scored, and I tossed a
tail on the first time?

1477
01:02:08,270 --> 01:02:10,880
Obviously, it's zero, right?

1478
01:02:10,880 --> 01:02:14,570
So here's the
probability of getting

1479
01:02:14,570 --> 01:02:17,810
a head in the first
time and scoring one,

1480
01:02:17,810 --> 01:02:20,430
and tails on the second
time is exactly zero.

1481
01:02:20,430 --> 01:02:22,270
So that's called
explaining away.

1482
01:02:22,270 --> 01:02:27,050
You can reduce your
belief in certain parents

1483
01:02:27,050 --> 01:02:30,490
based on what you know
about the children.

1484
01:02:30,490 --> 01:02:32,760
Think of this coin toss
example or the rain

1485
01:02:32,760 --> 01:02:35,620
in California and
the sprinklers.

1486
01:02:35,620 --> 01:02:36,120
All right.

1487
01:02:36,120 --> 01:02:37,495
So as this come
up several times,

1488
01:02:37,495 --> 01:02:39,690
how do we obtain the
Bayesian network structure?

1489
01:02:39,690 --> 01:02:41,680
There are two problems that
we need to be able to solve.

1490
01:02:41,680 --> 01:02:43,540
We need to be able to
learn the structure,

1491
01:02:43,540 --> 01:02:48,000
and we need to be able to
learn these probability tables.

1492
01:02:48,000 --> 01:02:50,540
If we know structure, how
do we get the probabilities?

1493
01:02:50,540 --> 01:02:53,350
Well, we need to identify
some objective function we're

1494
01:02:53,350 --> 01:02:56,367
going to try to optimize,
and then choose values

1495
01:02:56,367 --> 01:02:57,950
for all probability
distributions that

1496
01:02:57,950 --> 01:02:59,390
optimize that
objective function.

1497
01:02:59,390 --> 01:03:00,550
And that's the
kind of thing we've

1498
01:03:00,550 --> 01:03:02,310
been doing all along, just
like in the Gibbs sampler.

1499
01:03:02,310 --> 01:03:04,740
We need some objective
function or protein structure.

1500
01:03:04,740 --> 01:03:06,690
We need some objective
function that we're

1501
01:03:06,690 --> 01:03:07,731
going to try to optimize.

1502
01:03:07,731 --> 01:03:10,090
So there are two common
ones that are used a lot.

1503
01:03:10,090 --> 01:03:14,380
There's maximum likelihood
and the maximum posterior.

1504
01:03:14,380 --> 01:03:18,176
So maximum likelihood is defined
as the set of param-- theta

1505
01:03:18,176 --> 01:03:20,550
is all the parameters, all
the probability distributions,

1506
01:03:20,550 --> 01:03:23,090
the probability of getting a
score of one given that you had

1507
01:03:23,090 --> 01:03:25,070
heads and tails,
whatever it may be.

1508
01:03:25,070 --> 01:03:26,810
The probability of
getting admitted

1509
01:03:26,810 --> 01:03:29,650
given that you had certain
GREs and certain grades.

1510
01:03:29,650 --> 01:03:34,010
So we want to find
the set of parameters,

1511
01:03:34,010 --> 01:03:35,800
all those probability
distributions, that

1512
01:03:35,800 --> 01:03:36,730
maximize this.

1513
01:03:36,730 --> 01:03:39,870
The probability of the
data, our training data,

1514
01:03:39,870 --> 01:03:42,920
given those parameters.

1515
01:03:42,920 --> 01:03:44,580
That's a pretty obvious one.

1516
01:03:44,580 --> 01:03:47,620
And the maximum posterior
includes some of our beliefs

1517
01:03:47,620 --> 01:03:49,610
about the prior
probability of the data

1518
01:03:49,610 --> 01:03:52,740
and the prior probability
of the parameters.

1519
01:03:52,740 --> 01:03:54,250
This is a little
bit less intuitive

1520
01:03:54,250 --> 01:03:55,440
because you have
to ask, well, where

1521
01:03:55,440 --> 01:03:56,565
do those numbers come from?

1522
01:03:56,565 --> 01:04:00,170
And that, again, is a
whole course unto itself.

1523
01:04:00,170 --> 01:04:00,670
OK.

1524
01:04:00,670 --> 01:04:02,110
Now, how do you find
these parameters?

1525
01:04:02,110 --> 01:04:03,850
Again, it's the kinds
of search problems

1526
01:04:03,850 --> 01:04:06,620
that we've looked at before,
various kinds of hill climbing.

1527
01:04:06,620 --> 01:04:09,690
So gradient descent,
expectation maximization,

1528
01:04:09,690 --> 01:04:12,080
Gibbs sampling, which
you've looked at explicitly.

1529
01:04:12,080 --> 01:04:14,140
And again, the full
details of how to do that

1530
01:04:14,140 --> 01:04:15,850
are outside of our scope today.

1531
01:04:15,850 --> 01:04:16,350
OK.

1532
01:04:16,350 --> 01:04:19,490
So in our example of
this coin toss game,

1533
01:04:19,490 --> 01:04:21,740
we would use one of
these two functions

1534
01:04:21,740 --> 01:04:25,800
to try to decide what's
the probability of getting

1535
01:04:25,800 --> 01:04:29,000
heads or tails for
any given score.

1536
01:04:29,000 --> 01:04:30,931
That's what the kinds
of parameters are.

1537
01:04:34,302 --> 01:04:35,760
Now, the structure
problem actually

1538
01:04:35,760 --> 01:04:37,260
turns out to be
really, really hard,

1539
01:04:37,260 --> 01:04:40,600
because there are a very
exponentially large number

1540
01:04:40,600 --> 01:04:42,890
of potential structures
to draw from.

1541
01:04:42,890 --> 01:04:47,410
And unless you've got
some prior knowledge,

1542
01:04:47,410 --> 01:04:51,070
it can be impossible, depending
on how much data you have,

1543
01:04:51,070 --> 01:04:53,060
to actually build
this structure.

1544
01:04:53,060 --> 01:04:56,697
So there are many algorithms
that have been proposed.

1545
01:04:56,697 --> 01:04:58,280
And a lot of our
settings, we're going

1546
01:04:58,280 --> 01:04:59,950
to use some kind
of prior knowledge

1547
01:04:59,950 --> 01:05:01,422
to reduce the search space.

1548
01:05:01,422 --> 01:05:03,880
So if we're trying to talk
about transcriptional regulatory

1549
01:05:03,880 --> 01:05:06,240
networks, it's very common
to assume that there are only

1550
01:05:06,240 --> 01:05:10,325
some kinds of nodes that can be
causes and other kinds of nodes

1551
01:05:10,325 --> 01:05:11,450
that can be effects, right?

1552
01:05:11,450 --> 01:05:13,550
So in gene expression
it would be effect,

1553
01:05:13,550 --> 01:05:15,780
and then you would
limit your causes

1554
01:05:15,780 --> 01:05:17,270
to only be
transcription factors.

1555
01:05:17,270 --> 01:05:19,519
It would generally be signaling
molecules or something

1556
01:05:19,519 --> 01:05:22,040
like that, and not allow all
20,000 genes to be causes

1557
01:05:22,040 --> 01:05:26,800
and all 20,000
genes to be effects.

1558
01:05:26,800 --> 01:05:29,301
So there are lot of
resources to learn more

1559
01:05:29,301 --> 01:05:30,300
about Bayesian networks.

1560
01:05:30,300 --> 01:05:33,530
As I said, you can have
whole courses on this.

1561
01:05:33,530 --> 01:05:36,299
I think there are a lot of
good tutorials at this website.

1562
01:05:36,299 --> 01:05:38,590
I've also put in the notes
a little toy example for you

1563
01:05:38,590 --> 01:05:42,880
to work through all the
probabilities, which I think,

1564
01:05:42,880 --> 01:05:45,290
in the interest of time, we
won't go through in detail.

1565
01:05:49,530 --> 01:05:50,030
All right.

1566
01:05:50,030 --> 01:05:52,490
So to motivate what we're going
to do in the next lecture,

1567
01:05:52,490 --> 01:05:54,440
I just want to talk
about other kinds of data

1568
01:05:54,440 --> 01:05:57,140
that you could bring to bear
on this problem of predicting

1569
01:05:57,140 --> 01:05:58,404
which proteins interact.

1570
01:05:58,404 --> 01:05:59,820
We'll see, then,
how that gets fed

1571
01:05:59,820 --> 01:06:01,856
into an interaction
Bayesian network

1572
01:06:01,856 --> 01:06:02,855
to make the predictions.

1573
01:06:05,660 --> 01:06:08,084
So we've talked about affinity
capture and two-hybrid,

1574
01:06:08,084 --> 01:06:09,500
but what other
kinds of data could

1575
01:06:09,500 --> 01:06:12,130
we use to predict the
probability interaction?

1576
01:06:12,130 --> 01:06:14,930
Well, one thing you could use
would be gene expression data.

1577
01:06:14,930 --> 01:06:17,430
And the idea is that if
two proteins interact,

1578
01:06:17,430 --> 01:06:20,339
they should be present in the
cell at the same time, right?

1579
01:06:20,339 --> 01:06:21,880
So we talked about
this a little bit.

1580
01:06:21,880 --> 01:06:23,380
If they're
anti-correlated, it seems

1581
01:06:23,380 --> 01:06:24,670
very unlikely they interact.

1582
01:06:24,670 --> 01:06:26,950
What about if they're
correlated, but not perfectly

1583
01:06:26,950 --> 01:06:27,970
correlated?

1584
01:06:27,970 --> 01:06:33,290
So here's a plot that shows
a histogram of proteins that

1585
01:06:33,290 --> 01:06:37,090
are known to interact, proteins
that are known not to interact.

1586
01:06:37,090 --> 01:06:40,010
So empty circles are known
interacting proteins,

1587
01:06:40,010 --> 01:06:42,810
the dark circles are
non-interacting proteins,

1588
01:06:42,810 --> 01:06:46,680
and the other ones are based
on the experimental data.

1589
01:06:46,680 --> 01:06:49,020
And the distance here
is the difference

1590
01:06:49,020 --> 01:06:50,300
between expression profiles.

1591
01:06:50,300 --> 01:06:52,930
And we'll talk in coming lecture
about exactly how to compute

1592
01:06:52,930 --> 01:06:55,380
distance between
expression profiles.

1593
01:06:55,380 --> 01:06:57,700
But the further to the right
it is, the less similar

1594
01:06:57,700 --> 01:07:01,004
the expression profiles
are across large data sets.

1595
01:07:01,004 --> 01:07:02,420
So what you see
is the interacting

1596
01:07:02,420 --> 01:07:04,740
proteins tend to be shifted
more to the left, more

1597
01:07:04,740 --> 01:07:08,560
similar expression profiles
than the non-interacting ones.

1598
01:07:08,560 --> 01:07:11,637
But what do you
notice about this?

1599
01:07:11,637 --> 01:07:13,220
There's no way to
draw a line and say,

1600
01:07:13,220 --> 01:07:15,450
everything to the right
of this is in one class

1601
01:07:15,450 --> 01:07:17,670
and everything to the
left is another, right?

1602
01:07:17,670 --> 01:07:20,260
So by itself, it's not
going to get us very far.

1603
01:07:20,260 --> 01:07:22,560
There are plenty of
non-interacting proteins

1604
01:07:22,560 --> 01:07:25,440
that have very highly correlated
gene expression and plenty

1605
01:07:25,440 --> 01:07:27,206
of interacting proteins
that have poorly

1606
01:07:27,206 --> 01:07:28,330
correlated gene expression.

1607
01:07:28,330 --> 01:07:31,520
So it's a trend, not a rule.

1608
01:07:31,520 --> 01:07:33,290
Now, what about evolution?

1609
01:07:33,290 --> 01:07:37,930
So if I look over many, many
organisms, I might expect what?

1610
01:07:37,930 --> 01:07:40,140
The proteins that
interact with each other

1611
01:07:40,140 --> 01:07:43,800
are going to appear in
the same species, right?

1612
01:07:43,800 --> 01:07:45,490
So let's look at
these two cases.

1613
01:07:45,490 --> 01:07:48,530
We've got a bunch of--
eight different genomes.

1614
01:07:48,530 --> 01:07:52,290
And I've got gene 1 and gene 2,
which I suspect might interact,

1615
01:07:52,290 --> 01:07:55,344
and gene 3 and gene 4, which
I suspect might interact.

1616
01:07:55,344 --> 01:07:56,760
Now, looking at
these two patterns

1617
01:07:56,760 --> 01:07:59,850
of evolution, which one
do we have more confidence

1618
01:07:59,850 --> 01:08:00,800
in that it interacts?

1619
01:08:00,800 --> 01:08:03,049
The red one or the green one?

1620
01:08:03,049 --> 01:08:05,340
So what do we notice about
the difference between them?

1621
01:08:05,340 --> 01:08:11,060
What's true of the red one
compared to the green one?

1622
01:08:11,060 --> 01:08:11,827
Yeah.

1623
01:08:11,827 --> 01:08:14,160
AUDIENCE: The red one is only
in one branch of the tree.

1624
01:08:14,160 --> 01:08:16,368
PROFESSOR: The red one is
only one branch in the tree

1625
01:08:16,368 --> 01:08:17,990
and the green one
is scattered across.

1626
01:08:17,990 --> 01:08:19,439
So let's take a vote.

1627
01:08:19,439 --> 01:08:21,100
Do we believe that
the red one is

1628
01:08:21,100 --> 01:08:23,189
better evidence of
interaction or the green one

1629
01:08:23,189 --> 01:08:25,270
is better evidence
of interaction?

1630
01:08:25,270 --> 01:08:27,319
Red?

1631
01:08:27,319 --> 01:08:29,520
Green?

1632
01:08:29,520 --> 01:08:33,399
Can I have an advocate of green.

1633
01:08:33,399 --> 01:08:36,020
Someone explain their rationale?

1634
01:08:36,020 --> 01:08:38,050
Anyone in the quiet
side of the room?

1635
01:08:38,050 --> 01:08:40,525
All right, Ed.

1636
01:08:40,525 --> 01:08:43,990
AUDIENCE: Because red is only
on one branch of the tree,

1637
01:08:43,990 --> 01:08:47,455
I'd expect that
they're naturally more

1638
01:08:47,455 --> 01:08:50,425
correlated with each other.

1639
01:08:50,425 --> 01:08:53,395
They have less--
they appear together

1640
01:08:53,395 --> 01:09:00,517
in [INAUDIBLE] so I'd
expect [INAUDIBLE].

1641
01:09:00,517 --> 01:09:01,100
PROFESSOR: OK.

1642
01:09:01,100 --> 01:09:04,615
So the argument is that red only
occurs in one part of the tree.

1643
01:09:04,615 --> 01:09:08,090
And so there could be a
very simple explanation

1644
01:09:08,090 --> 01:09:10,560
for all the reds being in one
part of the tree and one not,

1645
01:09:10,560 --> 01:09:13,090
which would be a single
loss and gain event.

1646
01:09:13,090 --> 01:09:13,590
Right?

1647
01:09:13,590 --> 01:09:16,359
Somewhere early
on, perhaps here,

1648
01:09:16,359 --> 01:09:18,430
I gain those two proteins.

1649
01:09:18,430 --> 01:09:20,810
And then they're inherited
throughout the genome,

1650
01:09:20,810 --> 01:09:23,640
like most of genes get
inherited throughout the genome.

1651
01:09:23,640 --> 01:09:26,050
Whereas here, we've
got independent events

1652
01:09:26,050 --> 01:09:27,700
of gain and loss.

1653
01:09:27,700 --> 01:09:30,729
And at each one of these
independent events,

1654
01:09:30,729 --> 01:09:32,520
we're getting them
moving jointly, either

1655
01:09:32,520 --> 01:09:34,254
in or out of the genome.

1656
01:09:34,254 --> 01:09:35,670
So there's more
evidence for green

1657
01:09:35,670 --> 01:09:38,819
to be interacting than red.

1658
01:09:38,819 --> 01:09:41,090
Everyone buy that?

1659
01:09:41,090 --> 01:09:44,510
Even some of the
advocates of red?

1660
01:09:44,510 --> 01:09:46,979
Questions?

1661
01:09:46,979 --> 01:09:47,800
Yes.

1662
01:09:47,800 --> 01:09:51,768
AUDIENCE: Could there be a
way of either objectively

1663
01:09:51,768 --> 01:09:58,216
or mathematically
[INAUDIBLE] that way,

1664
01:09:58,216 --> 01:10:01,460
or is it just the
reasoning [INAUDIBLE]?

1665
01:10:01,460 --> 01:10:03,430
PROFESSOR: One can do
the statistics on it

1666
01:10:03,430 --> 01:10:04,509
with known ones, right?

1667
01:10:04,509 --> 01:10:06,050
I think that's
probably the best way.

1668
01:10:06,050 --> 01:10:09,620
And we'll actually see that
in one of these papers that

1669
01:10:09,620 --> 01:10:12,030
uses-- well,
actually, now I don't

1670
01:10:12,030 --> 01:10:14,290
recall whether they
use this co-evolution.

1671
01:10:14,290 --> 01:10:15,790
But yeah, there are
plenty of papers

1672
01:10:15,790 --> 01:10:17,190
that actually have done
the statistics on that.

1673
01:10:17,190 --> 01:10:17,981
So it is supported.

1674
01:10:21,380 --> 01:10:23,520
And a related kind
of question is

1675
01:10:23,520 --> 01:10:26,114
what's called the
Rosetta Stone approach.

1676
01:10:26,114 --> 01:10:27,530
Unfortunately, of
the term Rosetta

1677
01:10:27,530 --> 01:10:29,930
gets used far too much
in computational biology.

1678
01:10:29,930 --> 01:10:32,360
So this has nothing to
do with the other Rosetta

1679
01:10:32,360 --> 01:10:35,230
that we've been talking about.

1680
01:10:35,230 --> 01:10:37,750
And this has to do
with how often you

1681
01:10:37,750 --> 01:10:42,905
find the same pair of genes in
the same genome versus split up

1682
01:10:42,905 --> 01:10:44,360
in different genomes.

1683
01:10:44,360 --> 01:10:45,870
OK.

1684
01:10:45,870 --> 01:10:48,380
So what we're going to
look at next time then

1685
01:10:48,380 --> 01:10:52,430
is an approach that
combines these kinds of data

1686
01:10:52,430 --> 01:10:55,100
with the protein interaction
physical measurements

1687
01:10:55,100 --> 01:10:58,685
through the two-hybrid and the
affinity capture mass spec that

1688
01:10:58,685 --> 01:11:00,560
actually uses the Bayesian
networks we talked

1689
01:11:00,560 --> 01:11:03,490
about this time to predict
whether two proteins are

1690
01:11:03,490 --> 01:11:05,940
likely to interact based on
all of the available data.

1691
01:11:05,940 --> 01:11:09,950
These evolutionary arguments,
the [? sentiality ?] arguments,

1692
01:11:09,950 --> 01:11:12,780
and then the interaction data.

1693
01:11:12,780 --> 01:11:15,670
Any final questions?

1694
01:11:15,670 --> 01:11:18,120
OK, see you next time.