1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:26,460 --> 00:00:29,190
PROFESSOR: Welcome
back, everyone.

9
00:00:29,190 --> 00:00:30,934
I hope you had a good break.

10
00:00:30,934 --> 00:00:32,600
Hopefully you also
remember a little bit

11
00:00:32,600 --> 00:00:34,050
about what we did last time.

12
00:00:34,050 --> 00:00:35,680
So if you'll
recall, last time we

13
00:00:35,680 --> 00:00:37,130
did an introduction
to protein structure.

14
00:00:37,130 --> 00:00:38,790
We talked a little bit
about some of the issues

15
00:00:38,790 --> 00:00:40,220
in predicting protein structure.

16
00:00:40,220 --> 00:00:42,280
Now we're going to go
into that in more detail.

17
00:00:42,280 --> 00:00:45,050
And last time, we'd broken
down the structure prediction

18
00:00:45,050 --> 00:00:47,200
problem into a couple
of sub-problems.

19
00:00:47,200 --> 00:00:49,930
So there was a problem of
secondary structure prediction,

20
00:00:49,930 --> 00:00:51,680
which we discussed a
little bit last time.

21
00:00:51,680 --> 00:00:53,910
And remember that the
early algorithms developed

22
00:00:53,910 --> 00:00:57,480
in the '70s get about
60% accuracy, and decades

23
00:00:57,480 --> 00:00:59,507
of research has only
marginally improved that.

24
00:00:59,507 --> 00:01:01,340
But we're going to see
that some of the work

25
00:01:01,340 --> 00:01:03,750
on the main structure
recognition and predicting

26
00:01:03,750 --> 00:01:05,489
novel three-dimensional
structures

27
00:01:05,489 --> 00:01:07,330
has really advanced
very dramatically

28
00:01:07,330 --> 00:01:09,072
in the last few years.

29
00:01:09,072 --> 00:01:10,780
Now, the other thing
I hope you'll recall

30
00:01:10,780 --> 00:01:13,790
is that we had this dichotomy
between two approaches

31
00:01:13,790 --> 00:01:16,330
to the energetics of
protein structure.

32
00:01:16,330 --> 00:01:19,250
We had the physicist's
approach and we

33
00:01:19,250 --> 00:01:20,901
the statistician's
approach, right?

34
00:01:20,901 --> 00:01:23,400
Now, what were some of the key
differences between these two

35
00:01:23,400 --> 00:01:24,810
approaches?

36
00:01:24,810 --> 00:01:26,640
Anyone want to
volunteer a difference

37
00:01:26,640 --> 00:01:29,047
between the statistical
approach to parametrizing

38
00:01:29,047 --> 00:01:30,130
the energy of a structure?

39
00:01:30,130 --> 00:01:32,130
So we're trying to come
up with an equation that

40
00:01:32,130 --> 00:01:34,380
will convert coordinates
into energy, right?

41
00:01:34,380 --> 00:01:36,810
And what were some of the
differences between the physics

42
00:01:36,810 --> 00:01:38,393
approach and the
statistical approach?

43
00:01:41,266 --> 00:01:41,890
Any volunteers?

44
00:01:41,890 --> 00:01:42,230
Yes.

45
00:01:42,230 --> 00:01:43,518
AUDIENCE: I think the
statistical approach didn't

46
00:01:43,518 --> 00:01:45,611
change the phi and
psi angles, right?

47
00:01:45,611 --> 00:01:49,000
It just changed other variables.

48
00:01:49,000 --> 00:01:50,290
PROFESSOR: So you're close.

49
00:01:50,290 --> 00:01:50,480
Right.

50
00:01:50,480 --> 00:01:52,865
So the statistical-- or maybe
you said the right thing,

51
00:01:52,865 --> 00:01:53,000
actually.

52
00:01:53,000 --> 00:01:54,870
So the statistical
approach keeps a lot

53
00:01:54,870 --> 00:01:57,930
of the pieces the protein rigid,
whereas the physics approach

54
00:01:57,930 --> 00:02:00,000
allows all the atoms
to move independently.

55
00:02:00,000 --> 00:02:01,500
So one of the key
differences, then,

56
00:02:01,500 --> 00:02:04,390
is that in the physics
approach, two atoms that

57
00:02:04,390 --> 00:02:07,210
are bonded to each other
still move apart based

58
00:02:07,210 --> 00:02:09,479
on a spring function.

59
00:02:09,479 --> 00:02:12,820
It's a very stiff spring, but
the atoms move independently.

60
00:02:12,820 --> 00:02:14,330
In the statistical
approach, we just

61
00:02:14,330 --> 00:02:15,700
fix the distance between them.

62
00:02:15,700 --> 00:02:18,580
Similarly for a tetrahedrally
coordinated atom,

63
00:02:18,580 --> 00:02:22,529
in the physics approach
those angles can deform.

64
00:02:22,529 --> 00:02:24,320
In the statistical
approach, they're fixed.

65
00:02:24,320 --> 00:02:24,830
Right?

66
00:02:24,830 --> 00:02:26,246
So in the statistical
approach, we

67
00:02:26,246 --> 00:02:29,590
have more or less
fixed geometry.

68
00:02:29,590 --> 00:02:32,550
In the physics approach, every
atom moves independently.

69
00:02:32,550 --> 00:02:34,750
Anyone else remember
another key difference?

70
00:02:34,750 --> 00:02:37,220
Where do the energy
functions come from?

71
00:02:46,420 --> 00:02:46,920
Volunteers?

72
00:02:46,920 --> 00:02:47,310
All right.

73
00:02:47,310 --> 00:02:48,940
So in the physics
approach, they're all

74
00:02:48,940 --> 00:02:52,170
derived as much as possible
from physical principles,

75
00:02:52,170 --> 00:02:52,970
you might imagine.

76
00:02:52,970 --> 00:02:54,594
Whereas in the
statistical approach,

77
00:02:54,594 --> 00:02:57,260
we're trying to recreate what we
see in nature, even if we don't

78
00:02:57,260 --> 00:02:59,560
have a good physical
grounding for it.

79
00:02:59,560 --> 00:03:01,080
So this is most
dramatic in trying

80
00:03:01,080 --> 00:03:02,950
to predict the
solvation free energies.

81
00:03:02,950 --> 00:03:03,450
Right?

82
00:03:03,450 --> 00:03:07,050
How much does it cost you if
you put a hydrophobic atom

83
00:03:07,050 --> 00:03:08,950
into a polar environment?

84
00:03:08,950 --> 00:03:09,450
Right?

85
00:03:09,450 --> 00:03:11,070
So in the physics
approach, you actually

86
00:03:11,070 --> 00:03:11,925
have to have water molecules.

87
00:03:11,925 --> 00:03:13,341
They have to
interact with matter.

88
00:03:13,341 --> 00:03:15,310
That turns out to be
really, really hard to do.

89
00:03:15,310 --> 00:03:18,000
In the statistical approach, we
come up with an approximation.

90
00:03:18,000 --> 00:03:20,275
How much solvent
accessible surface area

91
00:03:20,275 --> 00:03:23,590
is there on the polar
atom when it's free?

92
00:03:23,590 --> 00:03:25,630
When it's in the
protein structure?

93
00:03:25,630 --> 00:03:30,110
And then we scale the transfer
energies by that amount.

94
00:03:30,110 --> 00:03:32,730
OK, so these are then
the main differences.

95
00:03:35,284 --> 00:03:36,480
Gotta be careful here.

96
00:03:39,560 --> 00:03:42,100
So we've got fixed geometry
this the statistical approach.

97
00:03:42,100 --> 00:03:43,900
We often use discrete rotamers.

98
00:03:43,900 --> 00:03:44,400
Remember?

99
00:03:44,400 --> 00:03:48,160
The side-chain angles, in
principle, can rotate freely.

100
00:03:48,160 --> 00:03:49,860
But there were only
a few confirmations

101
00:03:49,860 --> 00:03:53,140
are typically observed, so
we often restrict ourselves

102
00:03:53,140 --> 00:03:56,080
to the most commonly observed
combinations of the psi angles.

103
00:03:56,080 --> 00:03:57,900
And then we have the
statistical potential

104
00:03:57,900 --> 00:03:59,690
that depends on the
frequency at which we

105
00:03:59,690 --> 00:04:01,005
observe things in the database.

106
00:04:01,005 --> 00:04:03,130
And that could be the
frequency at which we observe

107
00:04:03,130 --> 00:04:05,569
particular atoms at
precise distances.

108
00:04:05,569 --> 00:04:07,610
It could be the fraction
of time that something's

109
00:04:07,610 --> 00:04:11,509
solvent accessible versus not.

110
00:04:11,509 --> 00:04:13,550
And the other thing that
we talked about a little

111
00:04:13,550 --> 00:04:15,260
bit last time was
this thought problem.

112
00:04:15,260 --> 00:04:16,682
If I have a protein
sequence and I

113
00:04:16,682 --> 00:04:18,640
have two potential
structures, how

114
00:04:18,640 --> 00:04:20,250
could I use these
potential energies--

115
00:04:20,250 --> 00:04:22,630
whether they're derived
from the physics approach

116
00:04:22,630 --> 00:04:24,090
or from the
statistical approach--

117
00:04:24,090 --> 00:04:27,110
how could I use these potential
energies to decide which

118
00:04:27,110 --> 00:04:29,940
of the two structures
is correct?

119
00:04:29,940 --> 00:04:32,737
So one possibility is that
I have two structures.

120
00:04:32,737 --> 00:04:35,070
One of them is truly the
structure and the other is not.

121
00:04:35,070 --> 00:04:35,570
Right?

122
00:04:35,570 --> 00:04:37,470
Your fiendish lab mate
knows the structure

123
00:04:37,470 --> 00:04:39,220
but refuses to tell you.

124
00:04:39,220 --> 00:04:41,632
So in that case,
what would I do?

125
00:04:41,632 --> 00:04:43,590
I know that one of these
structures is correct.

126
00:04:43,590 --> 00:04:44,400
I don't know which one.

127
00:04:44,400 --> 00:04:46,180
How could I use the
potential energy function

128
00:04:46,180 --> 00:04:47,490
to decide which one's correct?

129
00:04:54,240 --> 00:04:56,485
What's going to be true
of the correct structure?

130
00:04:56,485 --> 00:04:57,730
AUDIENCE: Minimal energy.

131
00:04:57,730 --> 00:04:58,810
PROFESSOR: It's going
to have lower energy.

132
00:04:58,810 --> 00:04:59,830
So is that sufficient?

133
00:04:59,830 --> 00:05:00,200
No.

134
00:05:00,200 --> 00:05:00,400
Right?

135
00:05:00,400 --> 00:05:02,280
There's a subtlety
we have to face here.

136
00:05:02,280 --> 00:05:06,900
So if I just plug my protein
sequence onto one of these two

137
00:05:06,900 --> 00:05:09,932
structures and compute
the free energy,

138
00:05:09,932 --> 00:05:11,640
there's no guarantee
that the correct one

139
00:05:11,640 --> 00:05:12,806
will have lower free energy.

140
00:05:12,806 --> 00:05:15,010
Why?

141
00:05:15,010 --> 00:05:19,180
What decision do I have to
make when I put a protein

142
00:05:19,180 --> 00:05:20,990
sequence onto a
backbone structure?

143
00:05:24,750 --> 00:05:25,265
Yes.

144
00:05:25,265 --> 00:05:26,890
AUDIENCE: How to
orient the side chain.

145
00:05:26,890 --> 00:05:27,215
PROFESSOR: Exactly.

146
00:05:27,215 --> 00:05:29,254
I need to decide how to
orient the side chains.

147
00:05:29,254 --> 00:05:30,670
If I orient the
side chains wrong,

148
00:05:30,670 --> 00:05:32,674
then I'll have side chains
literally overlapping

149
00:05:32,674 --> 00:05:33,340
with each other.

150
00:05:33,340 --> 00:05:35,532
That'll have incredibly
high energy, right?

151
00:05:35,532 --> 00:05:36,990
So there's no
guarantee that simply

152
00:05:36,990 --> 00:05:39,360
having the right
structure will give you

153
00:05:39,360 --> 00:05:41,640
the minimal free energy
until you correctly

154
00:05:41,640 --> 00:05:43,299
place all the side chains.

155
00:05:43,299 --> 00:05:44,590
OK, but that's the simple case.

156
00:05:44,590 --> 00:05:46,090
Now, that's in the
case where you've

157
00:05:46,090 --> 00:05:49,280
got this fiendish friend who
knows the correct structure.

158
00:05:49,280 --> 00:05:51,680
But of course, in the general
domain recognition problem,

159
00:05:51,680 --> 00:05:53,180
we don't know the
correct structure.

160
00:05:53,180 --> 00:05:54,260
We have homologues.

161
00:05:54,260 --> 00:05:56,600
So we have some
sequence, and we believe

162
00:05:56,600 --> 00:05:59,535
that it's either homologous
to Protein A or to Protein B,

163
00:05:59,535 --> 00:06:01,380
and I want to decide
which one's correct.

164
00:06:01,380 --> 00:06:03,500
So in both cases, the
structure's wrong.

165
00:06:03,500 --> 00:06:05,410
It's this question of
how wrong it is, right?

166
00:06:05,410 --> 00:06:06,530
So now the problem
actually becomes

167
00:06:06,530 --> 00:06:08,960
harder, because not only do
I need to get the right side

168
00:06:08,960 --> 00:06:11,020
chain confirmations, but I
need to get the right backbone

169
00:06:11,020 --> 00:06:11,600
confirmation.

170
00:06:11,600 --> 00:06:14,130
It's going to close to one
of these structures, perhaps,

171
00:06:14,130 --> 00:06:17,320
but it's never going
to be identical.

172
00:06:17,320 --> 00:06:19,440
So both of these
situations are examples

173
00:06:19,440 --> 00:06:21,230
where have to do some
kind of refinement

174
00:06:21,230 --> 00:06:22,606
of an initial
starting structure.

175
00:06:22,606 --> 00:06:24,771
And what we're going to
talk about for the next part

176
00:06:24,771 --> 00:06:26,680
of the lecture are
alternative strategies

177
00:06:26,680 --> 00:06:28,950
for refining a partially
correct structure.

178
00:06:28,950 --> 00:06:31,050
And we're going to look
at three strategies.

179
00:06:31,050 --> 00:06:34,034
The simplest one is called
energy minimization.

180
00:06:34,034 --> 00:06:35,950
Then we're going to look
at molecular dynamics

181
00:06:35,950 --> 00:06:38,690
and simulated annealing.

182
00:06:38,690 --> 00:06:40,910
So energy minimization
starts with this principle

183
00:06:40,910 --> 00:06:43,409
that we talked about last time
I remember that came up here,

184
00:06:43,409 --> 00:06:46,300
that a stable structure has to
be a minimum of free energy.

185
00:06:46,300 --> 00:06:46,800
Right?

186
00:06:46,800 --> 00:06:49,852
Because if it's not, then there
are forces acting on the atoms

187
00:06:49,852 --> 00:06:51,310
and that are going
to drive it away

188
00:06:51,310 --> 00:06:53,460
from that structure to
some other structure.

189
00:06:53,460 --> 00:06:55,730
Now, the fact that it is
a minimum of free energy

190
00:06:55,730 --> 00:06:58,690
does not guarantee that is
the minimum of free energy.

191
00:06:58,690 --> 00:07:02,220
So it's possible that there
are other energetic minima.

192
00:07:02,220 --> 00:07:02,760
Right?

193
00:07:02,760 --> 00:07:05,120
The protein structure,
if it's stable,

194
00:07:05,120 --> 00:07:08,180
is at the very least a
local energetic minimum.

195
00:07:08,180 --> 00:07:10,330
It may also be the global
free energy minimum.

196
00:07:10,330 --> 00:07:12,410
We just don't know
the answer to that.

197
00:07:12,410 --> 00:07:14,110
Now, this was a
big area of debate

198
00:07:14,110 --> 00:07:16,810
in the early days of the
protein structure field,

199
00:07:16,810 --> 00:07:19,350
whether proteins could
fold spontaneously.

200
00:07:19,350 --> 00:07:22,230
If they did, then it meant
that they were at least

201
00:07:22,230 --> 00:07:24,290
apparently global
free energy minima.

202
00:07:24,290 --> 00:07:26,111
Chris Anfinsen actually
won the Nobel Prize

203
00:07:26,111 --> 00:07:27,860
for demonstrating that
some proteins could

204
00:07:27,860 --> 00:07:29,950
fold independently
outside of the cell.

205
00:07:29,950 --> 00:07:32,930
So at least some proteins had
all the structural information

206
00:07:32,930 --> 00:07:35,005
implicit in their
sequence, right?

207
00:07:35,005 --> 00:07:37,380
And that seems to imply that
there are global free energy

208
00:07:37,380 --> 00:07:38,140
minimum.

209
00:07:38,140 --> 00:07:40,280
But there are other
proteins, we now know,

210
00:07:40,280 --> 00:07:42,030
where the most commonly
observed structure

211
00:07:42,030 --> 00:07:44,770
has only a local
free energy minimum.

212
00:07:44,770 --> 00:07:47,420
And it's got very high energetic
barriers that prevent it

213
00:07:47,420 --> 00:07:50,640
from actually getting to the
global free energy minimum.

214
00:07:50,640 --> 00:07:52,382
But regardless of
the case, if we

215
00:07:52,382 --> 00:07:53,840
have an initial
starting structure,

216
00:07:53,840 --> 00:07:56,580
we could try to find the nearest
local free energy minimum,

217
00:07:56,580 --> 00:07:59,134
and perhaps that is
the stable structure.

218
00:07:59,134 --> 00:08:00,550
So in our context,
we were talking

219
00:08:00,550 --> 00:08:03,890
about packing the side chains
on the surface of the protein

220
00:08:03,890 --> 00:08:06,640
that we believe might
be the right structure.

221
00:08:06,640 --> 00:08:08,680
So imagine that this
is the true structure

222
00:08:08,680 --> 00:08:10,280
and we've got the
side chain, and it's

223
00:08:10,280 --> 00:08:13,320
making the dashed green lines
represent hydrogen bonds.

224
00:08:13,320 --> 00:08:15,630
It's making a series
of hydrogen bonds

225
00:08:15,630 --> 00:08:17,910
from this nitrogen
and this oxygen

226
00:08:17,910 --> 00:08:20,100
to pieces of the
rest of the protein.

227
00:08:20,100 --> 00:08:22,480
Now, we get the crude
backbone structure.

228
00:08:22,480 --> 00:08:23,820
We pop in our side chains.

229
00:08:23,820 --> 00:08:26,240
We don't necessarily-- in
fact, we almost never--

230
00:08:26,240 --> 00:08:28,820
will choose randomly to
have the right confirmation

231
00:08:28,820 --> 00:08:30,660
to pick up all these
hydrogen bonds.

232
00:08:30,660 --> 00:08:32,610
So we'll start off with
some structure that

233
00:08:32,610 --> 00:08:34,210
looks like this,
where it's rotated,

234
00:08:34,210 --> 00:08:37,080
so that instead of seeing both
the nitrogen and the oxygen,

235
00:08:37,080 --> 00:08:39,600
you can only see the profile.

236
00:08:39,600 --> 00:08:43,970
And so the question is
whether we can get from one to

237
00:08:43,970 --> 00:08:47,232
by following the
energetic minima.

238
00:08:47,232 --> 00:08:48,190
So that's the question.

239
00:08:48,190 --> 00:08:49,564
How would we go
about doing this?

240
00:08:49,564 --> 00:08:51,700
Well, we have this
function that tells us

241
00:08:51,700 --> 00:08:54,167
the potential energy for every
XYZ coordinate of the atom.

242
00:08:54,167 --> 00:08:55,750
That's what we talked
about last time,

243
00:08:55,750 --> 00:08:57,280
and you can go back
and look at your notes

244
00:08:57,280 --> 00:08:58,400
for those two approaches.

245
00:08:58,400 --> 00:09:00,727
So how could we minimize
this free energy minimum?

246
00:09:00,727 --> 00:09:02,560
Well, it's no different
from other functions

247
00:09:02,560 --> 00:09:03,950
that we want to minimize, right?

248
00:09:03,950 --> 00:09:05,158
We take the first derivative.

249
00:09:05,158 --> 00:09:07,424
We look for places where the
first derivative is zero.

250
00:09:07,424 --> 00:09:09,840
The one difference is that we
can't write out analytically

251
00:09:09,840 --> 00:09:11,850
what this function
looks like and choose

252
00:09:11,850 --> 00:09:16,130
directions and locations in
space that are the minima.

253
00:09:16,130 --> 00:09:18,280
So we're going to have
to take an approach that

254
00:09:18,280 --> 00:09:22,010
has a series of perturbations to
a structure that try to improve

255
00:09:22,010 --> 00:09:25,209
the free energy systematically.

256
00:09:25,209 --> 00:09:27,750
The simplest understanding is
this gradient descent approach,

257
00:09:27,750 --> 00:09:30,810
which says that I have some
initial coordinates that I

258
00:09:30,810 --> 00:09:35,120
choose and I take a
step in the direction

259
00:09:35,120 --> 00:09:39,100
of the first derivative
of the function.

260
00:09:39,100 --> 00:09:40,267
So what does that look like?

261
00:09:40,267 --> 00:09:41,516
So here are two possibilities.

262
00:09:41,516 --> 00:09:42,670
I've got this function.

263
00:09:42,670 --> 00:09:47,647
If I start off at x equals
2, this minus some epsilon,

264
00:09:47,647 --> 00:09:49,480
some small value times
the first derivative,

265
00:09:49,480 --> 00:09:51,195
is going to point
me to the left.

266
00:09:51,195 --> 00:09:53,230
And I'm going to take
steps to the left

267
00:09:53,230 --> 00:09:57,390
until this function, f prime,
the first derivative, is zero.

268
00:09:57,390 --> 00:09:59,100
Then I'm going to stop moving.

269
00:09:59,100 --> 00:10:01,990
So I move from my initial
coordinate a little bit

270
00:10:01,990 --> 00:10:04,360
each time to the left
until I get to the minimum.

271
00:10:04,360 --> 00:10:06,320
And similarly, if I
start off on the right,

272
00:10:06,320 --> 00:10:08,170
I'll move a little bit
further to the right

273
00:10:08,170 --> 00:10:10,187
each time until the
first derivative is zero.

274
00:10:10,187 --> 00:10:11,270
So that looks pretty good.

275
00:10:11,270 --> 00:10:13,210
It can take a lot
of steps, though.

276
00:10:13,210 --> 00:10:16,100
And it's not actually guaranteed
to have great convergence

277
00:10:16,100 --> 00:10:16,600
properties.

278
00:10:16,600 --> 00:10:18,849
Because of the number of
steps you might have to take,

279
00:10:18,849 --> 00:10:20,770
it might take quite a long time.

280
00:10:20,770 --> 00:10:22,220
So that's the first
derivative, in

281
00:10:22,220 --> 00:10:24,499
a simple one-dimensional case.

282
00:10:24,499 --> 00:10:26,415
We're dealing with a
multi-dimensional vector,

283
00:10:26,415 --> 00:10:27,810
so instead of doing
the first derivative

284
00:10:27,810 --> 00:10:29,268
we use the gradient,
which is a set

285
00:10:29,268 --> 00:10:31,230
of partial first derivatives.

286
00:10:31,230 --> 00:10:34,060
And I think one thing that's
useful to point out here

287
00:10:34,060 --> 00:10:37,350
is that, of course, the force
is negative of the gradient

288
00:10:37,350 --> 00:10:38,762
of the potential energy.

289
00:10:38,762 --> 00:10:40,220
So when we do
gradient descent, you

290
00:10:40,220 --> 00:10:42,247
can think of it from
a physical perspective

291
00:10:42,247 --> 00:10:44,205
as always moving in the
direction of the force.

292
00:10:46,770 --> 00:10:47,850
So I have some structure.

293
00:10:47,850 --> 00:10:50,100
It's not the true
native structure,

294
00:10:50,100 --> 00:10:52,680
but I take incremental steps
in the direction of the force

295
00:10:52,680 --> 00:10:54,690
and I move towards
some local minima.

296
00:10:59,059 --> 00:11:01,350
And we've done this in the
case of a continuous energy,

297
00:11:01,350 --> 00:11:03,516
but you can actually also
do this for discrete ones.

298
00:11:03,516 --> 00:11:05,370
Now, the critical point
was that you're not

299
00:11:05,370 --> 00:11:08,840
guaranteed to get to the
correct energetic structure.

300
00:11:08,840 --> 00:11:12,820
So in the case that I showed
you before where we had the side

301
00:11:12,820 --> 00:11:16,300
chain side-on, if you actually
do the minimization there,

302
00:11:16,300 --> 00:11:19,280
you actually end up with the
side chain rotated 180 degrees

303
00:11:19,280 --> 00:11:20,500
where it's supposed to be.

304
00:11:20,500 --> 00:11:22,390
So it eliminates all
the steric clashes,

305
00:11:22,390 --> 00:11:25,260
but it doesn't actually pick
up all the hydrogen bonds.

306
00:11:25,260 --> 00:11:28,700
So this is an example of a
local energetic minima that's

307
00:11:28,700 --> 00:11:31,450
not the global energetic minima.

308
00:11:31,450 --> 00:11:32,610
Any questions on that?

309
00:11:35,570 --> 00:11:36,590
Yes.

310
00:11:36,590 --> 00:11:38,850
AUDIENCE: Where do all these
n-dimensional equations

311
00:11:38,850 --> 00:11:39,522
come from?

312
00:11:39,522 --> 00:11:40,980
PROFESSOR: Where
do what come from?

313
00:11:40,980 --> 00:11:43,014
AUDIENCE: The
n-dimensional equations.

314
00:11:43,014 --> 00:11:45,180
PROFESSOR: So these are the
equations for the energy

315
00:11:45,180 --> 00:11:48,280
in terms of every single
atom in the protein

316
00:11:48,280 --> 00:11:50,590
if you're allowing the
atoms to move, or in terms

317
00:11:50,590 --> 00:11:52,060
of every rotatable
bond, if you're

318
00:11:52,060 --> 00:11:54,170
allowing only bonds to rotate.

319
00:11:54,170 --> 00:11:57,230
So the question was, where do
the multi-dimensional equations

320
00:11:57,230 --> 00:11:58,804
come from.

321
00:11:58,804 --> 00:11:59,470
Other questions?

322
00:12:03,063 --> 00:12:04,470
OK.

323
00:12:04,470 --> 00:12:06,450
All right, so that's
the simplest approach.

324
00:12:06,450 --> 00:12:07,970
Literally minimize the energy.

325
00:12:07,970 --> 00:12:10,303
But we said it has this problem
that it's not guaranteed

326
00:12:10,303 --> 00:12:12,100
to find the global
free energy minimum.

327
00:12:12,100 --> 00:12:14,550
Another approach is
molecular dynamics.

328
00:12:14,550 --> 00:12:16,210
So this actually
attempts to simulate

329
00:12:16,210 --> 00:12:19,320
what's going on in a
protein structure in vitro,

330
00:12:19,320 --> 00:12:22,576
by simulating the force in
every atom and the velocity.

331
00:12:22,576 --> 00:12:24,450
Previously, there was
no measure of velocity.

332
00:12:24,450 --> 00:12:24,890
Right?

333
00:12:24,890 --> 00:12:25,973
All the atoms were static.

334
00:12:25,973 --> 00:12:27,840
We looked at what the
gradient of the energy

335
00:12:27,840 --> 00:12:29,860
was and we move by
some arbitrary step

336
00:12:29,860 --> 00:12:31,900
function in the
direction of the force.

337
00:12:31,900 --> 00:12:33,020
Now we're actually
going to have velocities

338
00:12:33,020 --> 00:12:34,269
associated with all the atoms.

339
00:12:34,269 --> 00:12:36,210
They're going to be
moving around in space.

340
00:12:36,210 --> 00:12:39,157
And we'll have the
coordinate at any time t

341
00:12:39,157 --> 00:12:40,990
is going to be determined
by the coordinates

342
00:12:40,990 --> 00:12:44,160
of the previous
time, t of i minus 1

343
00:12:44,160 --> 00:12:46,045
plus a velocity
times the time step.

344
00:12:46,045 --> 00:12:47,920
And the velocities are
going to be determined

345
00:12:47,920 --> 00:12:49,510
by the forces,
which are determined

346
00:12:49,510 --> 00:12:51,700
by the gradient of
the potential energy.

347
00:12:51,700 --> 00:12:52,200
Right?

348
00:12:52,200 --> 00:12:54,950
So we start off, always, with
that potential energy function,

349
00:12:54,950 --> 00:12:58,054
which is either from
the physics approach

350
00:12:58,054 --> 00:12:59,220
or the statistical approach.

351
00:12:59,220 --> 00:13:00,980
That gives us
velocities, eventually

352
00:13:00,980 --> 00:13:02,245
giving us the coordinates.

353
00:13:02,245 --> 00:13:03,620
So we start off
with the protein.

354
00:13:03,620 --> 00:13:05,245
There are some serious
questions of how

355
00:13:05,245 --> 00:13:06,944
you equilibrate the atoms.

356
00:13:06,944 --> 00:13:09,110
So you start off with a
completely static structure.

357
00:13:09,110 --> 00:13:10,542
You want to apply forces to it.

358
00:13:10,542 --> 00:13:12,000
There are some
subtleties as to how

359
00:13:12,000 --> 00:13:14,208
you go about doing that,
but then you actually end up

360
00:13:14,208 --> 00:13:16,650
simulating the motion
of all the atoms.

361
00:13:16,650 --> 00:13:19,946
And just give you a sense
of what that looks like,

362
00:13:19,946 --> 00:13:21,360
I'll show you a quick movie.

363
00:13:29,020 --> 00:13:33,916
So this is the simulation of the
folding of a protein structure.

364
00:13:33,916 --> 00:13:35,540
And the backbone is
mostly highlighted.

365
00:13:35,540 --> 00:13:37,740
Most of the side chains
are not being shown.

366
00:13:37,740 --> 00:13:41,490
Actually, in bold, but you
can see the stick figures.

367
00:13:41,490 --> 00:13:44,760
And slowly it's accumulating
its three-dimensional structure.

368
00:13:44,760 --> 00:13:47,044
[VIDEO PLAYBACK]

369
00:14:30,196 --> 00:14:32,180
[LAUGHTER]

370
00:15:02,247 --> 00:15:03,080
[END VIDEO PLAYBACK]

371
00:15:03,080 --> 00:15:04,955
PROFESSOR: OK, I think
you get the idea here.

372
00:15:09,080 --> 00:15:10,875
Oh, it won't let me give up.

373
00:15:10,875 --> 00:15:12,220
OK, here we go.

374
00:15:12,220 --> 00:15:14,530
OK, so these are
the equations that

375
00:15:14,530 --> 00:15:17,710
are governing the motion
in an example like that.

376
00:15:17,710 --> 00:15:24,540
Now, the advantage of
this is we're actually

377
00:15:24,540 --> 00:15:26,480
simulating the protein folding.

378
00:15:26,480 --> 00:15:28,690
So if we do it correctly,
we should always

379
00:15:28,690 --> 00:15:29,610
get the right answer.

380
00:15:29,610 --> 00:15:32,770
Of course, that's not
what happens in reality.

381
00:15:32,770 --> 00:15:35,610
Probably the biggest problem
is just computational speed.

382
00:15:35,610 --> 00:15:39,000
So these simulations--
even very, very

383
00:15:39,000 --> 00:15:40,890
short ones like the
one I showed you--

384
00:15:40,890 --> 00:15:43,790
so how long does it take a
protein to fold in vitro?

385
00:15:43,790 --> 00:15:46,137
A long folding might
take a millisecond,

386
00:15:46,137 --> 00:15:47,720
and for a very small
protein like that

387
00:15:47,720 --> 00:15:49,544
it might be orders
of magnitude faster.

388
00:15:49,544 --> 00:15:50,960
But to actually
compute that could

389
00:15:50,960 --> 00:15:53,720
take many, many, many days.

390
00:15:53,720 --> 00:15:56,800
So a lot of computing
resources going into this.

391
00:15:56,800 --> 00:15:58,700
Also, if we want to
accurately represent

392
00:15:58,700 --> 00:16:01,580
solvation-- the interaction of
the protein with water, which

393
00:16:01,580 --> 00:16:04,112
is what causes the hydrophobic
collapse, as we saw-- then

394
00:16:04,112 --> 00:16:06,570
you actually would have to have
water in those simulations.

395
00:16:06,570 --> 00:16:08,944
And each water molecule adds
a lot of degrees of freedom,

396
00:16:08,944 --> 00:16:12,000
so that increases the
computational cost, as well.

397
00:16:12,000 --> 00:16:15,040
So all of these things determine
the radius of convergence.

398
00:16:15,040 --> 00:16:17,620
How far away can you be
from the true structure

399
00:16:17,620 --> 00:16:19,214
and still get there?

400
00:16:19,214 --> 00:16:20,630
For very small
proteins like this,

401
00:16:20,630 --> 00:16:22,213
with a lot of
computational resources,

402
00:16:22,213 --> 00:16:26,330
you can get from an unfolded
protein to the folded state.

403
00:16:26,330 --> 00:16:28,170
We'll see some
important advances that

404
00:16:28,170 --> 00:16:30,840
allow us to get around
this, but in most cases

405
00:16:30,840 --> 00:16:32,985
we only can do
relatively local changes.

406
00:16:35,990 --> 00:16:40,450
So that brings us to our third
approach for refining protein

407
00:16:40,450 --> 00:16:42,920
structures, which is
called simulated annealing.

408
00:16:42,920 --> 00:16:44,900
And the inspiration
for this name

409
00:16:44,900 --> 00:16:47,670
comes from metallurgy
and how to get

410
00:16:47,670 --> 00:16:50,490
the best atomic
structure in a metal.

411
00:16:50,490 --> 00:16:53,090
I don't know if any of you have
ever done any metalworking.

412
00:16:53,090 --> 00:16:54,376
Anyone?

413
00:16:54,376 --> 00:16:56,410
Oh, OK, well one person.

414
00:16:56,410 --> 00:16:57,700
That's better than most years.

415
00:16:57,700 --> 00:17:01,920
I have not, but I understand
that in metallurgy--

416
00:17:01,920 --> 00:17:04,469
and you can correct me if I'm
wrong-- that by repeatedly

417
00:17:04,469 --> 00:17:06,010
raising and lowering
the temperature,

418
00:17:06,010 --> 00:17:08,119
you can get better
metal structures.

419
00:17:08,119 --> 00:17:09,680
Is that reasonably accurate?

420
00:17:09,680 --> 00:17:10,274
OK.

421
00:17:10,274 --> 00:17:12,440
You can talk to one of your
fellow students for more

422
00:17:12,440 --> 00:17:13,730
details if you're interested.

423
00:17:13,730 --> 00:17:15,450
So this similar
idea is going to be

424
00:17:15,450 --> 00:17:18,490
used in this
competition approach.

425
00:17:18,490 --> 00:17:21,589
We're going to try to find
the most probable confirmation

426
00:17:21,589 --> 00:17:24,420
of atoms by trying to get
out of some local minima

427
00:17:24,420 --> 00:17:27,069
by raising the
energy of the system

428
00:17:27,069 --> 00:17:28,870
and then changing
the temperatures,

429
00:17:28,870 --> 00:17:31,120
or raising and lowering it
according to some heating

430
00:17:31,120 --> 00:17:33,484
and cooling schedule to get
the atoms into their most

431
00:17:33,484 --> 00:17:35,650
probable confirmation, the
most stable conformation.

432
00:17:38,605 --> 00:17:40,230
And this goes back
to this idea that we

433
00:17:40,230 --> 00:17:41,710
started with the local minima.

434
00:17:41,710 --> 00:17:43,730
If we're just doing
energy minimization,

435
00:17:43,730 --> 00:17:46,520
we're not going to be able
to get from this minimum

436
00:17:46,520 --> 00:17:48,830
to this minimum, because
these energetic barriers are

437
00:17:48,830 --> 00:17:49,367
in the way.

438
00:17:49,367 --> 00:17:51,200
So we need to raise the
energy of the system

439
00:17:51,200 --> 00:17:53,600
to jump over these
energetic barriers

440
00:17:53,600 --> 00:17:57,410
before we can get to the
global free energy minimum.

441
00:17:57,410 --> 00:18:00,120
But if we just move at very
high temperature all the time,

442
00:18:00,120 --> 00:18:02,582
we will sample the
entire energetic space

443
00:18:02,582 --> 00:18:04,040
but it's going to
take a long time.

444
00:18:04,040 --> 00:18:05,285
We're going to be sampling
a lot of confirmations

445
00:18:05,285 --> 00:18:07,270
that are low
probability, as well.

446
00:18:07,270 --> 00:18:08,910
So this approach
allows us to balance

447
00:18:08,910 --> 00:18:11,535
the need for speed and the
need to be at high temperature

448
00:18:11,535 --> 00:18:13,410
where we can overcome
some of these barriers.

449
00:18:22,870 --> 00:18:25,120
So one thing that I
want to stress here

450
00:18:25,120 --> 00:18:27,640
is that we've made a physical
analogy to this metallurgy

451
00:18:27,640 --> 00:18:28,140
process.

452
00:18:28,140 --> 00:18:30,514
We're talking about raising
the temperature of the system

453
00:18:30,514 --> 00:18:32,410
and let the atoms
evolve under forces,

454
00:18:32,410 --> 00:18:34,200
but it's in no way
meant to simulate

455
00:18:34,200 --> 00:18:36,040
what's going on in
protein folding.

456
00:18:36,040 --> 00:18:37,669
So molecular dynamics
would try to say,

457
00:18:37,669 --> 00:18:39,710
this is what's actually
happening to this protein

458
00:18:39,710 --> 00:18:41,690
as it folds in water.

459
00:18:41,690 --> 00:18:44,250
Simulated annealing is
using high temperature

460
00:18:44,250 --> 00:18:46,635
to search over spaces
and then low temperature.

461
00:18:46,635 --> 00:18:49,010
But these temperatures much,
much higher than the protein

462
00:18:49,010 --> 00:18:51,620
would ever encounter, so
it's not a simulation.

463
00:18:51,620 --> 00:18:54,678
It's a search strategy.

464
00:18:54,678 --> 00:18:58,992
OK, so the key to
this-- and I'll

465
00:18:58,992 --> 00:19:00,700
tell you the full
algorithm in a second--

466
00:19:00,700 --> 00:19:02,040
but at various steps
in the algorithm

467
00:19:02,040 --> 00:19:03,706
we're trying to make
decisions about how

468
00:19:03,706 --> 00:19:05,650
to move from our current
set of coordinates

469
00:19:05,650 --> 00:19:07,630
to some alternative
set of coordinates.

470
00:19:07,630 --> 00:19:11,030
Now, that new set of coordinates
we're going to call test state.

471
00:19:11,030 --> 00:19:13,569
And we're going to decide
whether the new state is

472
00:19:13,569 --> 00:19:15,360
more or less probable
than the current one.

473
00:19:15,360 --> 00:19:15,540
Right?

474
00:19:15,540 --> 00:19:17,706
If it's lower in energy,
then what's it going to be?

475
00:19:17,706 --> 00:19:19,700
It's going to be
more probable, right?

476
00:19:19,700 --> 00:19:21,710
And so in this
algorithm, we're always

477
00:19:21,710 --> 00:19:24,160
going to accept those states
that are lower in free energy

478
00:19:24,160 --> 00:19:25,970
than our current state.

479
00:19:25,970 --> 00:19:28,097
What happens when
the state is higher

480
00:19:28,097 --> 00:19:29,680
in free energy than
our current state?

481
00:19:29,680 --> 00:19:32,100
So it turns out we are going
to accept it probabilistically.

482
00:19:32,100 --> 00:19:34,599
Sometimes it's going to move
up in energy and sometimes not,

483
00:19:34,599 --> 00:19:36,630
and that is going
to allow us to go

484
00:19:36,630 --> 00:19:38,730
over some those
energetic barriers

485
00:19:38,730 --> 00:19:42,100
and try to get to new
energetic states that would not

486
00:19:42,100 --> 00:19:44,470
be accessible to
purely minimization.

487
00:19:44,470 --> 00:19:47,440
So the form of this is the
Boltzmann equation, right?

488
00:19:47,440 --> 00:19:49,625
The probability of some
test state compared

489
00:19:49,625 --> 00:19:51,250
to the probability
of a reference state

490
00:19:51,250 --> 00:19:55,300
is going to be the ratio of
these two Boltzmann equations--

491
00:19:55,300 --> 00:19:57,650
the energy of the test
state over the energy

492
00:19:57,650 --> 00:19:58,670
of the current state.

493
00:19:58,670 --> 00:20:01,915
So it's the e to the minus
difference in energy over KT.

494
00:20:01,915 --> 00:20:03,790
And we'll come back to
where this temperature

495
00:20:03,790 --> 00:20:05,150
term comes from in a second.

496
00:20:07,770 --> 00:20:10,190
OK, so here's the
full algorithm.

497
00:20:10,190 --> 00:20:12,740
We will either iterate for
a fixed number of steps

498
00:20:12,740 --> 00:20:14,060
or until convergence.

499
00:20:14,060 --> 00:20:16,280
We'll see we don't
always converge.

500
00:20:16,280 --> 00:20:18,910
We have some initial
confirmation.

501
00:20:18,910 --> 00:20:20,630
Our current confirmation
will be state n,

502
00:20:20,630 --> 00:20:22,200
and that we can
compute as energy

503
00:20:22,200 --> 00:20:23,770
from those potential
energy functions

504
00:20:23,770 --> 00:20:26,124
that we discussed
in the last meeting.

505
00:20:26,124 --> 00:20:28,290
We're going to choose a
neighboring state at random.

506
00:20:28,290 --> 00:20:30,420
So what does neighboring mean?

507
00:20:30,420 --> 00:20:32,750
So if I'm defining this in
terms of XYZ coordinates,

508
00:20:32,750 --> 00:20:34,410
for every atom I've
got a set of XYZ

509
00:20:34,410 --> 00:20:37,090
coordinates I'm going to
change them a few of them

510
00:20:37,090 --> 00:20:38,090
by small amount.

511
00:20:38,090 --> 00:20:38,240
Right?

512
00:20:38,240 --> 00:20:39,823
If I change them all
by large amounts,

513
00:20:39,823 --> 00:20:41,520
I have a completely
different structure.

514
00:20:41,520 --> 00:20:43,228
So I'm going to make
small perturbations.

515
00:20:43,228 --> 00:20:47,780
And if I'm doing this
with fixed backbone angles

516
00:20:47,780 --> 00:20:49,620
and just rotating the
side chains, then what

517
00:20:49,620 --> 00:20:52,580
would a neighboring state be?

518
00:20:52,580 --> 00:20:53,140
Any thoughts?

519
00:20:59,530 --> 00:21:01,490
What would a
neighboring state be?

520
00:21:01,490 --> 00:21:03,740
Anyone?

521
00:21:03,740 --> 00:21:05,615
Change a few of the side
chain angles, right?

522
00:21:05,615 --> 00:21:07,698
So we don't want to globally
change the structure.

523
00:21:07,698 --> 00:21:09,770
We want some continuity
between the current state

524
00:21:09,770 --> 00:21:11,230
and the next state.

525
00:21:11,230 --> 00:21:13,350
So we're going to
chose an adjacent state

526
00:21:13,350 --> 00:21:15,550
in that sense, so
the state space.

527
00:21:15,550 --> 00:21:17,180
And then here are the rules.

528
00:21:17,180 --> 00:21:19,340
If the new state
has an energy that's

529
00:21:19,340 --> 00:21:23,100
lower than the current state,
we simply accept the new state.

530
00:21:23,100 --> 00:21:24,850
If not, this is where
it gets interesting.

531
00:21:24,850 --> 00:21:26,430
Then, we accept
that higher energy

532
00:21:26,430 --> 00:21:28,151
with a probability
that's associated

533
00:21:28,151 --> 00:21:29,650
with the difference
in the energies.

534
00:21:29,650 --> 00:21:31,380
So if the difference
is very, very large,

535
00:21:31,380 --> 00:21:32,900
there's a low
probability it'll accept.

536
00:21:32,900 --> 00:21:34,733
If the differences are
slightly higher, than

537
00:21:34,733 --> 00:21:36,979
there's a higher
probability that we accept.

538
00:21:36,979 --> 00:21:39,270
If we reject it, we just drop
back to our current state

539
00:21:39,270 --> 00:21:41,140
and we look for
a new test state.

540
00:21:41,140 --> 00:21:41,780
OK?

541
00:21:41,780 --> 00:21:43,250
Any questions on how we do this?

542
00:21:47,360 --> 00:21:48,410
Question, yes.

543
00:21:48,410 --> 00:21:51,690
AUDIENCE: How far away do
we search for neighbors?

544
00:21:51,690 --> 00:21:53,720
PROFESSOR: That's the
art of this process,

545
00:21:53,720 --> 00:21:55,330
so I gave you a straight answer.

546
00:21:55,330 --> 00:21:58,957
Different approaches will
use different thresholds.

547
00:21:58,957 --> 00:21:59,790
Any other questions?

548
00:22:04,300 --> 00:22:06,746
OK, so the key thing
I want you realize,

549
00:22:06,746 --> 00:22:08,120
then, is there's
this distinction

550
00:22:08,120 --> 00:22:09,500
between the
minimization approach

551
00:22:09,500 --> 00:22:10,940
and simulated
annealing approach.

552
00:22:10,940 --> 00:22:13,080
Minimization can only
go from state one

553
00:22:13,080 --> 00:22:15,639
to the local free
energy minimum,

554
00:22:15,639 --> 00:22:17,680
whereas the simulated
annealing has the potential

555
00:22:17,680 --> 00:22:19,350
to go much further
afield, and potentially

556
00:22:19,350 --> 00:22:21,058
to get to the global
free energy minimum.

557
00:22:21,058 --> 00:22:22,990
But it's not
guaranteed to find it.

558
00:22:22,990 --> 00:22:26,120
OK, so let's say we
start in state one

559
00:22:26,120 --> 00:22:28,039
and our neighbor
state was state two.

560
00:22:28,039 --> 00:22:30,080
So we'd accept that with
100% probability, right?

561
00:22:30,080 --> 00:22:31,464
Because it's lower in energy.

562
00:22:31,464 --> 00:22:33,380
Then let's say the
neighboring state turns out

563
00:22:33,380 --> 00:22:35,694
to be state three.
that's higher in energy,

564
00:22:35,694 --> 00:22:37,610
so there's a probability
that we'll accept it,

565
00:22:37,610 --> 00:22:39,359
based on the difference
between the energy

566
00:22:39,359 --> 00:22:40,612
of state two and state three.

567
00:22:40,612 --> 00:22:42,320
Similarly from state
three to state four,

568
00:22:42,320 --> 00:22:44,570
so we might drop
back to state two.

569
00:22:44,570 --> 00:22:45,780
We might go up.

570
00:22:45,780 --> 00:22:48,110
And then we can eventually
get over the hump this way

571
00:22:48,110 --> 00:22:49,330
with sum probability.

572
00:22:49,330 --> 00:22:51,610
It's a sum of each
of those steps.

573
00:22:51,610 --> 00:22:52,110
OK?

574
00:22:58,550 --> 00:23:01,744
OK, so if this is our
function for deciding

575
00:23:01,744 --> 00:23:03,160
whether to accept
a new state, how

576
00:23:03,160 --> 00:23:06,070
does temperature
affect our decisions?

577
00:23:06,070 --> 00:23:10,962
What happens when the
temperature is very, very high,

578
00:23:10,962 --> 00:23:12,420
if you look at that equation?

579
00:23:12,420 --> 00:23:14,630
So it's minus e to the delta.

580
00:23:14,630 --> 00:23:17,010
The difference in
the energy over kT.

581
00:23:17,010 --> 00:23:19,200
So if t is very,
very large, then

582
00:23:19,200 --> 00:23:22,159
what happens that exponent?

583
00:23:22,159 --> 00:23:22,950
It approaches zero.

584
00:23:22,950 --> 00:23:27,240
So e to the minus zero is going
to be approximately 1, right?

585
00:23:27,240 --> 00:23:29,570
So at very high temperatures,
we almost always

586
00:23:29,570 --> 00:23:31,180
take the high energy state.

587
00:23:31,180 --> 00:23:33,980
So that's what allows us to
climb those energetic hills.

588
00:23:33,980 --> 00:23:35,380
If I have a very
high temperature

589
00:23:35,380 --> 00:23:36,838
in my simulated
annealing, then I'm

590
00:23:36,838 --> 00:23:39,094
always going over
those barriers.

591
00:23:39,094 --> 00:23:40,510
So conversely,
what happens, then,

592
00:23:40,510 --> 00:23:44,187
when I set the
temperature very low?

593
00:23:44,187 --> 00:23:45,895
Then there's a very,
very low probability

594
00:23:45,895 --> 00:23:48,640
of accepting those
changes, right?

595
00:23:48,640 --> 00:23:51,350
So if I have a very low
temperature-- temperature

596
00:23:51,350 --> 00:23:54,230
approximately zero-- then
I'll never go uphill.

597
00:23:54,230 --> 00:23:56,440
Almost never go uphill.

598
00:23:56,440 --> 00:23:58,990
So we have a lot of control
over how much of the space

599
00:23:58,990 --> 00:24:03,657
this algorithm explores by
how we set the temperature.

600
00:24:03,657 --> 00:24:06,240
So this is again a little bit
of the art simulated annealing--

601
00:24:06,240 --> 00:24:08,510
decide exactly what
annealing schedule to use,

602
00:24:08,510 --> 00:24:10,405
what temperature
program you use.

603
00:24:10,405 --> 00:24:12,490
Do you start off high
and go literally down?

604
00:24:12,490 --> 00:24:14,090
Do you use some other,
more complicated function

605
00:24:14,090 --> 00:24:15,173
to decide the temperature?

606
00:24:15,173 --> 00:24:17,180
We won't go into exactly
how to choose these.

607
00:24:17,180 --> 00:24:19,360
[INAUDIBLE] you could
track some of these things

608
00:24:19,360 --> 00:24:22,510
down from the references
that are in the notes.

609
00:24:22,510 --> 00:24:23,484
So we have this choice.

610
00:24:23,484 --> 00:24:24,900
But the basic idea
is, we're going

611
00:24:24,900 --> 00:24:26,233
to start at higher temperatures.

612
00:24:26,233 --> 00:24:28,207
We're going to explore
most of the space.

613
00:24:28,207 --> 00:24:29,790
And then, as we lower
the temperature,

614
00:24:29,790 --> 00:24:32,466
we freeze ourselves into the
most probable confirmations.

615
00:24:35,980 --> 00:24:38,740
Now, there's nothing that
restricts simulated annealing

616
00:24:38,740 --> 00:24:40,420
to protein structure.

617
00:24:40,420 --> 00:24:42,120
This approach is
actually quite general.

618
00:24:42,120 --> 00:24:44,490
It's called the Metropolis
Hastings algorithm.

619
00:24:44,490 --> 00:24:47,375
It's often used in cases where
there's no energy whatsoever

620
00:24:47,375 --> 00:24:50,350
and it's thought of purely
in probabilistic terms.

621
00:24:50,350 --> 00:24:53,700
So if I have some probabilistic
function-- some probability

622
00:24:53,700 --> 00:24:57,580
of being in some state S-- I
can choose a neighboring state

623
00:24:57,580 --> 00:24:59,220
at random.

624
00:24:59,220 --> 00:25:01,060
Then I can compute
an acceptance ratio,

625
00:25:01,060 --> 00:25:03,690
which is the probability
of being a state S

626
00:25:03,690 --> 00:25:06,570
test over the probability
of being in a current state.

627
00:25:06,570 --> 00:25:08,870
This is what we did in terms
of the Boltzmann equation,

628
00:25:08,870 --> 00:25:11,078
but if I some other formulation
for the probabilities

629
00:25:11,078 --> 00:25:12,560
I'll just use that.

630
00:25:12,560 --> 00:25:15,860
And then, just like in our
protein folding example,

631
00:25:15,860 --> 00:25:18,770
if this acceptance
ratio is greater than 1,

632
00:25:18,770 --> 00:25:20,050
we accept the new state.

633
00:25:20,050 --> 00:25:21,980
If it's less than
1, then we accept it

634
00:25:21,980 --> 00:25:24,740
with a probabilistic statement.

635
00:25:24,740 --> 00:25:26,924
And so this is a very
general approach.

636
00:25:26,924 --> 00:25:28,840
I think you might see
it in your problem sets.

637
00:25:28,840 --> 00:25:30,881
We certainly have done
this on past exams-- asked

638
00:25:30,881 --> 00:25:34,250
you to apply this algorithm to
other probabilistic settings.

639
00:25:34,250 --> 00:25:37,490
So it's a very, very general
way to search the sample

640
00:25:37,490 --> 00:25:41,030
across a probabilistic
landscape.

641
00:25:41,030 --> 00:25:44,020
OK, so we've seen these
three separate approaches,

642
00:25:44,020 --> 00:25:46,330
starting with an
approximate structure

643
00:25:46,330 --> 00:25:48,370
and trying to get to
the correct structure.

644
00:25:48,370 --> 00:25:50,170
We have energy
minimization, which

645
00:25:50,170 --> 00:25:53,130
will move towards the
local confirmation.

646
00:25:53,130 --> 00:25:55,230
So it's very fast
compared the other two,

647
00:25:55,230 --> 00:25:57,470
but it's restricted
to local changes.

648
00:25:57,470 --> 00:25:59,220
We have molecular
dynamics, which actually

649
00:25:59,220 --> 00:26:01,880
tries to simulate the
biological process.

650
00:26:01,880 --> 00:26:03,590
Connotationally very intensive.

651
00:26:03,590 --> 00:26:05,131
And then we have
simulated annealing,

652
00:26:05,131 --> 00:26:07,030
which tries to shortcut
the root to some

653
00:26:07,030 --> 00:26:08,980
of these global
free energy minima

654
00:26:08,980 --> 00:26:11,930
by raising the temperature,
pretending at this very

655
00:26:11,930 --> 00:26:13,930
high temperature so we
can sample all the space,

656
00:26:13,930 --> 00:26:17,170
and then cooling down so
we trap a high probability

657
00:26:17,170 --> 00:26:18,680
confirmation.

658
00:26:18,680 --> 00:26:20,715
Any questions on any of
these three approaches?

659
00:26:25,090 --> 00:26:27,260
OK.

660
00:26:27,260 --> 00:26:29,230
All right, so I'm going
to go through now some

661
00:26:29,230 --> 00:26:32,150
of the approaches
that have already

662
00:26:32,150 --> 00:26:35,029
been used to try to
solve protein structures.

663
00:26:35,029 --> 00:26:36,320
We started off with a sequence.

664
00:26:36,320 --> 00:26:38,800
We'd like to figure out
what the structure is.

665
00:26:38,800 --> 00:26:42,090
And this field has had
a tremendous advance,

666
00:26:42,090 --> 00:26:44,990
because in 1995 a group
got together and came up

667
00:26:44,990 --> 00:26:47,550
with an objective
way of evaluating

668
00:26:47,550 --> 00:26:49,945
whether these
methods were working.

669
00:26:49,945 --> 00:26:51,570
So lots of people
have proposed methods

670
00:26:51,570 --> 00:26:53,390
for predicting
protein structure,

671
00:26:53,390 --> 00:26:57,540
and what the CASP group
did in '95 was they said,

672
00:26:57,540 --> 00:27:01,030
we will collect structures
from crystallographers,

673
00:27:01,030 --> 00:27:04,430
NMR spectroscopists,
that they have not yet

674
00:27:04,430 --> 00:27:06,280
published but they know
they're likely to be

675
00:27:06,280 --> 00:27:11,020
able to get within the
time scale of this project.

676
00:27:11,020 --> 00:27:13,440
We will send out those
sequences to the modelers.

677
00:27:13,440 --> 00:27:15,750
The modelers will attempt
to predict the structure,

678
00:27:15,750 --> 00:27:16,990
and then at the end
of the competition

679
00:27:16,990 --> 00:27:18,065
we'll go back to the
crystallographers

680
00:27:18,065 --> 00:27:20,489
and the spectroscopists and
say, OK, give us a structure

681
00:27:20,489 --> 00:27:22,280
and now we'll compare
the predicted answers

682
00:27:22,280 --> 00:27:22,870
the real ones.

683
00:27:22,870 --> 00:27:24,860
So no one knows
are the answer is

684
00:27:24,860 --> 00:27:28,260
until all the
submissions are there,

685
00:27:28,260 --> 00:27:30,630
and then you can see objectively
which of the approaches

686
00:27:30,630 --> 00:27:32,435
did the best.

687
00:27:32,435 --> 00:27:34,310
And one of the approaches
that's consistently

688
00:27:34,310 --> 00:27:36,601
has done very well, which
we'll look at in some detail,

689
00:27:36,601 --> 00:27:38,510
is this approach called Rosetta.

690
00:27:38,510 --> 00:27:43,410
So you can look at
the details online.

691
00:27:43,410 --> 00:27:46,740
They split this modeling
problem into two types.

692
00:27:46,740 --> 00:27:48,450
There are ones for
which you can come up

693
00:27:48,450 --> 00:27:50,135
with a reasonable
homology model.

694
00:27:50,135 --> 00:27:52,010
This can be very, very
low sequence homology,

695
00:27:52,010 --> 00:27:54,343
but there's something in the
database of known structure

696
00:27:54,343 --> 00:27:57,400
that it's sequenced
similarly to the query.

697
00:27:57,400 --> 00:28:00,850
And then ones where
it's completely de novo.

698
00:28:00,850 --> 00:28:03,769
So how do they go about
predicting these structures?

699
00:28:03,769 --> 00:28:06,060
So if there's homology, you
can imagine the first thing

700
00:28:06,060 --> 00:28:08,860
you want to do is align your
sequence to the sequence

701
00:28:08,860 --> 00:28:11,000
of the protein that
has a known structure.

702
00:28:11,000 --> 00:28:14,930
Now, if it's high homology this
is not a hard problem, right?

703
00:28:14,930 --> 00:28:16,410
We just need to do a few tweaks.

704
00:28:16,410 --> 00:28:19,170
But we get to places--
what's called the Twilight

705
00:28:19,170 --> 00:28:22,490
Zone, in fact-- where there's
a high probability that you're

706
00:28:22,490 --> 00:28:25,410
wrong, that your sequence
alignments could be to entirely

707
00:28:25,410 --> 00:28:26,420
the wrong structure.

708
00:28:26,420 --> 00:28:28,602
And that's where
things get interesting.

709
00:28:28,602 --> 00:28:30,310
So they've got high
sequence similarity--

710
00:28:30,310 --> 00:28:32,450
greater than 50%
sequence similarity that

711
00:28:32,450 --> 00:28:34,660
are considered
relatively easy problems.

712
00:28:34,660 --> 00:28:38,120
These medium problems that are
20% to 50% sequence similarity.

713
00:28:38,120 --> 00:28:40,770
And then very low sequence
similar problems-- less

714
00:28:40,770 --> 00:28:42,680
than 20% sequence similarity.

715
00:28:46,560 --> 00:28:49,452
OK, so you've already
seen this course methods

716
00:28:49,452 --> 00:28:50,910
for doing sequence
alignment, so we

717
00:28:50,910 --> 00:28:53,790
don't have to go into
that in any detail.

718
00:28:53,790 --> 00:28:56,139
But there are a lot of
different specific approaches

719
00:28:56,139 --> 00:28:57,430
for how to do those alignments.

720
00:28:57,430 --> 00:29:00,830
You could do anything from blast
to highly sophisticated Markov

721
00:29:00,830 --> 00:29:03,770
models to try to decide what's
most similar to your protein

722
00:29:03,770 --> 00:29:04,270
structure.

723
00:29:04,270 --> 00:29:06,353
And one of the important
things that Rosetta found

724
00:29:06,353 --> 00:29:08,090
was not to align on
any single method

725
00:29:08,090 --> 00:29:10,730
but to try a bunch of
different alignment approaches

726
00:29:10,730 --> 00:29:12,160
and then follow
through with many

727
00:29:12,160 --> 00:29:14,030
of the different alignments.

728
00:29:14,030 --> 00:29:15,570
And then we get
this problem of how

729
00:29:15,570 --> 00:29:17,840
do you refine the models,
which is what we've already

730
00:29:17,840 --> 00:29:21,090
started to talk about.

731
00:29:21,090 --> 00:29:22,820
So in the general
refinement procedure,

732
00:29:22,820 --> 00:29:25,170
when you have a protein that's
relatively in good shape

733
00:29:25,170 --> 00:29:28,140
they apply random perturbations
to the backbone torsion angle.

734
00:29:28,140 --> 00:29:29,890
So this is again the
statistical approach,

735
00:29:29,890 --> 00:29:31,389
the not allowing
every atom to move.

736
00:29:31,389 --> 00:29:35,411
They're just rotating a certain
number of the rotatable side

737
00:29:35,411 --> 00:29:35,910
chains.

738
00:29:35,910 --> 00:29:38,370
So we've got the fine psi
angles in the backbone,

739
00:29:38,370 --> 00:29:41,270
and some of the side channels.

740
00:29:41,270 --> 00:29:43,882
They do what's called rotamer
optimization of the side chain.

741
00:29:43,882 --> 00:29:44,840
So what does that mean?

742
00:29:44,840 --> 00:29:47,180
Remember that we
could allow the side

743
00:29:47,180 --> 00:29:48,980
chains to rotate
freely, but very, very

744
00:29:48,980 --> 00:29:51,170
few of those rotations
are frequently observed.

745
00:29:51,170 --> 00:29:53,400
So we're going to choose,
as these three choices,

746
00:29:53,400 --> 00:29:56,025
among the best possible
rotamers, rotational isomers.

747
00:29:58,940 --> 00:30:02,240
And then once we've found
a nearly optimal side chain

748
00:30:02,240 --> 00:30:05,130
confirmation from those
highly probable ones,

749
00:30:05,130 --> 00:30:07,814
then we allow more
continuous optimization

750
00:30:07,814 --> 00:30:08,605
of the side chains.

751
00:30:14,080 --> 00:30:16,357
So when you have a very,
very high sequence homology

752
00:30:16,357 --> 00:30:18,190
template, you don't
need to do a lot of work

753
00:30:18,190 --> 00:30:19,260
on most of the structure.

754
00:30:19,260 --> 00:30:19,960
Right?

755
00:30:19,960 --> 00:30:21,010
Most of it's going
to be correct.

756
00:30:21,010 --> 00:30:22,740
So we're going to
focus on those places

757
00:30:22,740 --> 00:30:24,370
where the alignment is poor.

758
00:30:24,370 --> 00:30:26,884
That seems pretty intuitive.

759
00:30:26,884 --> 00:30:28,550
Things get a little
bit more interesting

760
00:30:28,550 --> 00:30:32,040
when you've got these medium
sequence similarity templates.

761
00:30:32,040 --> 00:30:34,429
So here, even your basic
alignment might not be right.

762
00:30:34,429 --> 00:30:36,470
So they actually proceed
with multiple alignments

763
00:30:36,470 --> 00:30:40,330
and carry them through
the refinement process.

764
00:30:40,330 --> 00:30:42,925
And then, how do you decide
which one's the best?

765
00:30:42,925 --> 00:30:44,750
You use the potential
energy function.

766
00:30:44,750 --> 00:30:44,950
Right?

767
00:30:44,950 --> 00:30:46,491
So you've already
taken a whole bunch

768
00:30:46,491 --> 00:30:48,450
of starting confirmations.

769
00:30:48,450 --> 00:30:50,620
We've taken them through
this refinery procedure.

770
00:30:50,620 --> 00:30:52,510
You now believe that
those energies represent

771
00:30:52,510 --> 00:30:54,770
the probability that the
structure is correct,

772
00:30:54,770 --> 00:30:57,020
so you're going to choose
which of those confirmations

773
00:30:57,020 --> 00:30:58,750
to use based on the energy.

774
00:31:02,050 --> 00:31:06,350
OK, in these medium sequence
similarity templates,

775
00:31:06,350 --> 00:31:09,120
the refinement doesn't do
the entire protein structure,

776
00:31:09,120 --> 00:31:10,750
but it focuses on
particular region.

777
00:31:10,750 --> 00:31:12,920
So places where there
are gaps, insertions,

778
00:31:12,920 --> 00:31:14,300
and deletions in the alignment.

779
00:31:14,300 --> 00:31:14,800
Right?

780
00:31:14,800 --> 00:31:16,508
So your alignment is
uncertain, so that's

781
00:31:16,508 --> 00:31:18,259
where you need to
refine the structure.

782
00:31:18,259 --> 00:31:20,175
Places that were loops
in the starting models,

783
00:31:20,175 --> 00:31:22,040
so they weren't
highly constrained.

784
00:31:22,040 --> 00:31:23,540
So it's plausible
that they're going

785
00:31:23,540 --> 00:31:25,780
to be different in
the starting structure

786
00:31:25,780 --> 00:31:29,945
from some homologous protein
and in the final structure.

787
00:31:29,945 --> 00:31:32,320
And then, regions where the
sequence conservation is low.

788
00:31:32,320 --> 00:31:35,440
So even if there is a
reasonably good alignment,

789
00:31:35,440 --> 00:31:36,940
there's some
probability that things

790
00:31:36,940 --> 00:31:40,619
have changed during evolution.

791
00:31:40,619 --> 00:31:42,660
Now, when they do a
refinement, how they do that?

792
00:31:42,660 --> 00:31:45,170
In these places that
we've just outlined,

793
00:31:45,170 --> 00:31:48,300
they don't simply randomly
perturb all of the angles.

794
00:31:48,300 --> 00:31:51,240
But actually, they take
a segment of the protein,

795
00:31:51,240 --> 00:31:53,380
and exactly how long
those segments are

796
00:31:53,380 --> 00:31:56,610
has changed over the course
of the Rosetta algorithm's

797
00:31:56,610 --> 00:31:57,560
refinement.

798
00:31:57,560 --> 00:32:01,130
But say something on the order
of three to six amino acids.

799
00:32:01,130 --> 00:32:03,525
And you look in the
database for proteins

800
00:32:03,525 --> 00:32:06,250
that have known structure that
contain the same amino acid

801
00:32:06,250 --> 00:32:06,750
sequence.

802
00:32:06,750 --> 00:32:09,220
So it could be completely
unrelated protein structure,

803
00:32:09,220 --> 00:32:11,960
but you develop
a peptide library

804
00:32:11,960 --> 00:32:14,036
for all of those short
sequences for all

805
00:32:14,036 --> 00:32:15,410
the different
possible structures

806
00:32:15,410 --> 00:32:16,150
that they've adopted.

807
00:32:16,150 --> 00:32:18,066
So you know that those
are at least structures

808
00:32:18,066 --> 00:32:20,626
that are consistent with
that local sequence,

809
00:32:20,626 --> 00:32:22,000
although they
might be completely

810
00:32:22,000 --> 00:32:23,630
wrong for this
individual protein.

811
00:32:23,630 --> 00:32:26,810
So you pop in all of
those alternative possible

812
00:32:26,810 --> 00:32:28,841
structures.

813
00:32:28,841 --> 00:32:30,645
So OK, we replace
the torsion angles

814
00:32:30,645 --> 00:32:32,410
with those of peptides
of known structure,

815
00:32:32,410 --> 00:32:35,177
and then we do a local
optimization using

816
00:32:35,177 --> 00:32:37,010
the kinds of minimization
algorithms we just

817
00:32:37,010 --> 00:32:39,370
talked about to see whether
there is a structure that's

818
00:32:39,370 --> 00:32:41,659
roughly compatible with
that little peptide

819
00:32:41,659 --> 00:32:43,450
that you took from the
database that's also

820
00:32:43,450 --> 00:32:45,600
consistent with the
rest the structure.

821
00:32:45,600 --> 00:32:49,050
And after you've done that,
then you do a global refinement.

822
00:32:49,050 --> 00:32:50,175
Questions on that approach?

823
00:32:55,710 --> 00:32:57,750
OK, so does this work?

824
00:32:57,750 --> 00:33:00,770
One of the best competitors
in this CASP competition.

825
00:33:00,770 --> 00:33:04,230
So here are examples where the
native structure's in blue.

826
00:33:04,230 --> 00:33:06,910
The best model they
produced was in red,

827
00:33:06,910 --> 00:33:09,880
and the best template-- that's
the homologous protein--

828
00:33:09,880 --> 00:33:11,320
is in green.

829
00:33:11,320 --> 00:33:13,960
And you can see that they
agree remarkably well.

830
00:33:13,960 --> 00:33:15,580
OK?

831
00:33:15,580 --> 00:33:18,310
So this is very
impressive, especially

832
00:33:18,310 --> 00:33:20,240
compared to some of
the other algorithms.

833
00:33:20,240 --> 00:33:21,740
But again, it's
focusing on proteins

834
00:33:21,740 --> 00:33:24,380
where there's at least some
decent homology to start with.

835
00:33:27,660 --> 00:33:30,270
If you look here at the
center of these proteins,

836
00:33:30,270 --> 00:33:32,850
you can see the original
structure, I believe, is blue,

837
00:33:32,850 --> 00:33:34,090
and their model's in red.

838
00:33:34,090 --> 00:33:36,800
You can see they also get the
side chain confirmations more

839
00:33:36,800 --> 00:33:38,660
or less correct, which
is quite remarkable.

840
00:33:43,135 --> 00:33:44,510
Now, what gets
really interesting

841
00:33:44,510 --> 00:33:45,710
is when they work on
these proteins that

842
00:33:45,710 --> 00:33:47,140
have very low
sequence homologies.

843
00:33:47,140 --> 00:33:50,120
So we're talking about 20%
sequence similarity or less.

844
00:33:50,120 --> 00:33:53,035
So quite often, you'll actually
have globally the wrong

845
00:33:53,035 --> 00:33:55,830
fold-- a 20%
sequence similarity.

846
00:33:55,830 --> 00:33:56,830
So what do they do here?

847
00:33:56,830 --> 00:33:59,036
They start by saying,
OK, we have no guarantee

848
00:33:59,036 --> 00:34:00,910
that our templates are
even remotely correct.

849
00:34:00,910 --> 00:34:02,370
So they're going to start
with a lot of templates

850
00:34:02,370 --> 00:34:04,820
and they're going to refine
all of these in parallel

851
00:34:04,820 --> 00:34:08,198
in hopes that some of them come
out right at the other end.

852
00:34:08,198 --> 00:34:10,489
And these are what they call
more aggressive refinement

853
00:34:10,489 --> 00:34:11,010
strategies.

854
00:34:11,010 --> 00:34:14,736
So before, where did we focus
our refinement energies?

855
00:34:14,736 --> 00:34:17,150
We focused on places that
were poorly constrained,

856
00:34:17,150 --> 00:34:20,761
either by evolution or
regions of the structure that

857
00:34:20,761 --> 00:34:22,219
weren't well-constrained,
or places

858
00:34:22,219 --> 00:34:23,552
where the alignment wasn't good.

859
00:34:23,552 --> 00:34:26,480
Here, they actually go after
the relatively well-defined

860
00:34:26,480 --> 00:34:28,279
secondary structure
elements, as well.

861
00:34:28,279 --> 00:34:29,820
And so they will
allow something that

862
00:34:29,820 --> 00:34:33,480
was a clear alpha helix
in all of the templates

863
00:34:33,480 --> 00:34:35,879
to change some of the structure
by taking peptides out

864
00:34:35,879 --> 00:34:37,670
of the database that
have other structures.

865
00:34:37,670 --> 00:34:38,170
OK?

866
00:34:38,170 --> 00:34:41,380
So you take a very,
very aggressive approach

867
00:34:41,380 --> 00:34:42,567
to the refinement.

868
00:34:42,567 --> 00:34:44,900
You rebuild the secondary
structure elements, as well as

869
00:34:44,900 --> 00:34:47,389
these gaps, insertions,
loops, and regions

870
00:34:47,389 --> 00:34:48,764
with low sequence conservation.

871
00:34:48,764 --> 00:34:50,389
And I think the really
remarkable thing

872
00:34:50,389 --> 00:34:51,763
is that this
approach also works.

873
00:34:51,763 --> 00:34:55,239
It doesn't work quite as
well, but here's a side

874
00:34:55,239 --> 00:34:58,570
by side comparison of a native
structure and the best model.

875
00:34:58,570 --> 00:35:01,010
So this is the hidden
structure that was only

876
00:35:01,010 --> 00:35:03,740
known to the crystallographer,
or the spectroscopist,

877
00:35:03,740 --> 00:35:06,700
who agreed to participate
in this CASP competition.

878
00:35:06,700 --> 00:35:08,244
And here is the
model they submitted

879
00:35:08,244 --> 00:35:09,660
blind without
knowing what it was.

880
00:35:09,660 --> 00:35:11,493
And you can see again
and again that there's

881
00:35:11,493 --> 00:35:14,350
a pretty good global similarity
between the structures

882
00:35:14,350 --> 00:35:17,320
that they propose
and the actual ones.

883
00:35:17,320 --> 00:35:17,900
Not always.

884
00:35:17,900 --> 00:35:20,520
I mean, here's an example where
the good parts are highlighted

885
00:35:20,520 --> 00:35:22,470
and the not-so-good
parts are shown in white

886
00:35:22,470 --> 00:35:24,030
so you can barely see them.

887
00:35:24,030 --> 00:35:25,830
[LAUGHTER]

888
00:35:25,830 --> 00:35:27,672
PROFESSOR: But even
so, give them that.

889
00:35:27,672 --> 00:35:28,630
Give them their credit.

890
00:35:28,630 --> 00:35:32,820
It's a remarkably
good agreement.

891
00:35:32,820 --> 00:35:36,542
Now, we've looked at cases
where there's very high sequence

892
00:35:36,542 --> 00:35:39,000
similarity, where there's medium
sequence similarity, where

893
00:35:39,000 --> 00:35:40,250
there's low sequence similarity.

894
00:35:40,250 --> 00:35:42,583
But the hardest category are
ones where there's actually

895
00:35:42,583 --> 00:35:45,949
nothing in the structural
database that's a detectable

896
00:35:45,949 --> 00:35:47,490
homologue to the
protein of interest.

897
00:35:47,490 --> 00:35:48,906
So how do you go
about doing that?

898
00:35:48,906 --> 00:35:50,310
That's the de novo case.

899
00:35:50,310 --> 00:35:52,930
So in that case, they take
the following strategy.

900
00:35:52,930 --> 00:35:56,900
They do a Monte Carlo
search for backbone angles.

901
00:35:56,900 --> 00:35:59,360
So specifically, they
take short regions--

902
00:35:59,360 --> 00:36:01,102
and again, this is
the exact length.

903
00:36:01,102 --> 00:36:03,060
Changes in different
versions of the algorithm,

904
00:36:03,060 --> 00:36:06,670
but it's either three to nine
amino acids in the backbone.

905
00:36:06,670 --> 00:36:10,079
They find similar
peptides in the database

906
00:36:10,079 --> 00:36:10,870
of known structure.

907
00:36:10,870 --> 00:36:13,220
They take the
backbone confirmations

908
00:36:13,220 --> 00:36:14,490
from the database.

909
00:36:14,490 --> 00:36:17,020
They set the angles
to match those.

910
00:36:17,020 --> 00:36:18,930
And then, they use those
Metropolis criteria

911
00:36:18,930 --> 00:36:20,310
that we looked at in
simulated annealing.

912
00:36:20,310 --> 00:36:20,520
Right?

913
00:36:20,520 --> 00:36:22,200
The relative probability
of the states,

914
00:36:22,200 --> 00:36:23,658
determined by the
Boltzmann energy,

915
00:36:23,658 --> 00:36:25,930
to decide whether
to accept or not.

916
00:36:25,930 --> 00:36:27,906
If it's lower
energy, what happens?

917
00:36:27,906 --> 00:36:29,160
Do you accept?

918
00:36:29,160 --> 00:36:30,650
Do you not accept?

919
00:36:30,650 --> 00:36:31,420
AUDIENCE: Accept.

920
00:36:31,420 --> 00:36:32,336
PROFESSOR: You accept.

921
00:36:32,336 --> 00:36:34,260
And if it's high energy,
how do you decide?

922
00:36:34,260 --> 00:36:35,136
AUDIENCE: [INAUDIBLE]

923
00:36:35,136 --> 00:36:36,635
PROFESSOR: [INAUDIBLE],
probability.

924
00:36:36,635 --> 00:36:37,170
Very good.

925
00:36:37,170 --> 00:36:41,680
OK, so they do a fixed number
of Monte Carlo steps-- 36,000.

926
00:36:41,680 --> 00:36:43,800
And then they repeat
this entire process

927
00:36:43,800 --> 00:36:46,260
to get 2,000 final structures.

928
00:36:46,260 --> 00:36:46,900
OK?

929
00:36:46,900 --> 00:36:48,983
Because they really have
very, very low confidence

930
00:36:48,983 --> 00:36:51,740
in any individual one
of these structures.

931
00:36:51,740 --> 00:36:53,240
OK, now you've got
2,000 structures,

932
00:36:53,240 --> 00:36:54,614
but you're allowed
to submit one.

933
00:36:54,614 --> 00:36:55,900
So what do you do?

934
00:36:55,900 --> 00:36:57,674
So they cluster
them to try to see

935
00:36:57,674 --> 00:36:59,590
whether there are common
patterns that emerge,

936
00:36:59,590 --> 00:37:00,964
and then they
refine the clusters

937
00:37:00,964 --> 00:37:03,910
and they submit each cluster
as a potential solution

938
00:37:03,910 --> 00:37:06,930
to this problem.

939
00:37:06,930 --> 00:37:09,460
OK, questions on the
Rosetta approach?

940
00:37:09,460 --> 00:37:11,137
Yes.

941
00:37:11,137 --> 00:37:13,678
AUDIENCE: Can you mention again
why the short region of three

942
00:37:13,678 --> 00:37:16,300
to nine amino acids,
and whether [INAUDIBLE].

943
00:37:19,919 --> 00:37:21,460
PROFESSOR: So the
question is, what's

944
00:37:21,460 --> 00:37:24,890
the motivation for taking
these short regions

945
00:37:24,890 --> 00:37:27,710
from the structural database?

946
00:37:27,710 --> 00:37:29,255
Ultimately, this is
a modeling choice

947
00:37:29,255 --> 00:37:30,880
that they made that
seems to work well.

948
00:37:30,880 --> 00:37:32,150
So it's an empirical choice.

949
00:37:32,150 --> 00:37:34,680
But what possibly motivated
them, you might ask, right?

950
00:37:34,680 --> 00:37:37,080
So, the thought has been in
this field for a long time,

951
00:37:37,080 --> 00:37:39,120
and it's still, I
think, unproven,

952
00:37:39,120 --> 00:37:42,040
that certain sequences will
have a certain propensity

953
00:37:42,040 --> 00:37:43,050
to certain structures.

954
00:37:43,050 --> 00:37:44,990
We saw this in the secondary
structure prediction

955
00:37:44,990 --> 00:37:47,156
algorithms, that there were
certain amino acids that

956
00:37:47,156 --> 00:37:49,450
occurred much more
frequently in alpha helixes.

957
00:37:49,450 --> 00:37:53,740
So it could be that there
are certain structures that

958
00:37:53,740 --> 00:37:56,480
are very likely to occur
for short peptides,

959
00:37:56,480 --> 00:37:58,280
and other ones that
almost never occur.

960
00:37:58,280 --> 00:38:01,410
And so if you had a large enough
database of protein structures,

961
00:38:01,410 --> 00:38:03,580
then that would be a
sensible sampling approach.

962
00:38:03,580 --> 00:38:06,012
Now, in practice, could you
have gotten some good answer

963
00:38:06,012 --> 00:38:06,970
in some other approach?

964
00:38:06,970 --> 00:38:07,553
We don't know.

965
00:38:07,553 --> 00:38:09,480
This is what
actually worked well.

966
00:38:09,480 --> 00:38:12,380
So there's no real theoretical
justification for it

967
00:38:12,380 --> 00:38:14,090
other than that
crude observation

968
00:38:14,090 --> 00:38:17,030
that there is some information
content that's local,

969
00:38:17,030 --> 00:38:20,110
and then a lot of information
content that's global.

970
00:38:20,110 --> 00:38:20,957
Yes?

971
00:38:20,957 --> 00:38:23,040
AUDIENCE: So when you're
doing a de novo approach,

972
00:38:23,040 --> 00:38:25,510
is it general that you
come up with a bunch

973
00:38:25,510 --> 00:38:27,980
of different clusters
as your answer,

974
00:38:27,980 --> 00:38:29,956
whereas with the
homology approach,

975
00:38:29,956 --> 00:38:32,255
you are more confident
of structure answer?

976
00:38:32,255 --> 00:38:34,630
PROFESSOR: So the question
was, if you're doing a de novo

977
00:38:34,630 --> 00:38:36,080
approach, is it
generally the case

978
00:38:36,080 --> 00:38:38,080
that you have lots
of individual,

979
00:38:38,080 --> 00:38:40,320
or clusters of structures,
whereas in homology you

980
00:38:40,320 --> 00:38:40,820
tend not to.

981
00:38:40,820 --> 00:38:41,670
And yes, that's correct.

982
00:38:41,670 --> 00:38:43,294
So in the de novo,
there are frequently

983
00:38:43,294 --> 00:38:45,610
going to be multiple
solutions that

984
00:38:45,610 --> 00:38:48,080
look equally plausible to you,
whereas the homology tends

985
00:38:48,080 --> 00:38:51,210
to drive you to certain classes.

986
00:38:51,210 --> 00:38:51,840
Good questions.

987
00:38:51,840 --> 00:38:52,673
Any other questions?

988
00:39:01,290 --> 00:39:03,340
All, right so that was CASP.

989
00:39:03,340 --> 00:39:08,050
One was in 1995, which
seems like an eon ago.

990
00:39:08,050 --> 00:39:10,100
So how have things
improved over the course

991
00:39:10,100 --> 00:39:12,067
of the last decade or two?

992
00:39:12,067 --> 00:39:14,400
So there was an interesting
paper that came out recently

993
00:39:14,400 --> 00:39:17,240
that just looked at the
differences between CASP 10,

994
00:39:17,240 --> 00:39:19,230
one of are the most
recent ones, and CASP 5.

995
00:39:19,230 --> 00:39:21,280
They're every two years,
so that's a decade.

996
00:39:21,280 --> 00:39:23,200
So how have things
improved or not

997
00:39:23,200 --> 00:39:25,820
over the last decade
in this challenge?

998
00:39:25,820 --> 00:39:30,420
So in this chart, the
y-axis is the percent

999
00:39:30,420 --> 00:39:34,160
of the residues
that were modeled

1000
00:39:34,160 --> 00:39:35,671
and that were not
in the template.

1001
00:39:35,671 --> 00:39:36,170
OK?

1002
00:39:36,170 --> 00:39:37,670
So I've got some template.

1003
00:39:37,670 --> 00:39:41,420
Some fraction of the amino acids
have no match in the template.

1004
00:39:41,420 --> 00:39:44,030
How many of those
do I get correct?

1005
00:39:44,030 --> 00:39:45,782
As a function of
target difficulty,

1006
00:39:45,782 --> 00:39:47,990
they have their own definition
for target difficulty.

1007
00:39:47,990 --> 00:39:49,830
You can look in the
actual paper to find out

1008
00:39:49,830 --> 00:39:51,870
what is in the CASP
competition, but it's

1009
00:39:51,870 --> 00:39:54,945
a combination of structural
and sequence data.

1010
00:39:54,945 --> 00:39:56,320
So let's just take
them that they

1011
00:39:56,320 --> 00:39:57,440
made some reasonable
choices here.

1012
00:39:57,440 --> 00:39:58,400
They actually put
a lot of effort

1013
00:39:58,400 --> 00:40:00,550
into coming up with a
criteria for evaluation.

1014
00:40:00,550 --> 00:40:04,120
Every point in this diagram
represents some submitted

1015
00:40:04,120 --> 00:40:06,580
structure.

1016
00:40:06,580 --> 00:40:09,440
The CASP5, a decade
ago, are the triangles.

1017
00:40:09,440 --> 00:40:14,000
CASP 9, two years
ago, were the squares,

1018
00:40:14,000 --> 00:40:16,100
and the CASP10 are the circles.

1019
00:40:16,100 --> 00:40:20,015
And then they have
trend lines for CASP9

1020
00:40:20,015 --> 00:40:23,760
and CASP10 are shown
here-- these two lines.

1021
00:40:23,760 --> 00:40:27,350
And you can see that they
do better for the easier

1022
00:40:27,350 --> 00:40:29,640
structures and worse for
the harder structures, which

1023
00:40:29,640 --> 00:40:33,610
is what you'd expect,
whereas CASP5 was pretty much

1024
00:40:33,610 --> 00:40:36,940
flat across all of them
and did about as well even

1025
00:40:36,940 --> 00:40:39,230
on on the easy structures
as these ones are

1026
00:40:39,230 --> 00:40:40,740
doing on the hard structures.

1027
00:40:40,740 --> 00:40:43,802
So in terms of the fraction
of the protein that they don't

1028
00:40:43,802 --> 00:40:46,010
have a template for that
they're able to get correct,

1029
00:40:46,010 --> 00:40:48,770
they're doing much, much
better in the later CASPs

1030
00:40:48,770 --> 00:40:50,070
than they did a decade earlier.

1031
00:40:50,070 --> 00:40:51,540
So that's kind of encouraging.

1032
00:40:51,540 --> 00:40:54,270
Unfortunately, the story isn't
always that straightforward.

1033
00:40:54,270 --> 00:40:59,040
So this chart is, again, target
difficulty on the x-axis.

1034
00:40:59,040 --> 00:41:02,420
The y-axis is what they call
the Global Distance Test,

1035
00:41:02,420 --> 00:41:05,300
and it's a model of accuracy.

1036
00:41:05,300 --> 00:41:08,770
It's the percent of the carbon
alpha atoms in the predictions

1037
00:41:08,770 --> 00:41:11,470
that are close-- and they have
a precise definition of close

1038
00:41:11,470 --> 00:41:14,120
that you can look up-- that are
close to the true structure.

1039
00:41:14,120 --> 00:41:17,900
So for a perfect model, it would
be up here in the 90% to 100%

1040
00:41:17,900 --> 00:41:21,090
range, and then random
models would be down here.

1041
00:41:21,090 --> 00:41:24,760
You can see a lot of
them are close to random.

1042
00:41:24,760 --> 00:41:26,670
But more important here
are the trend lines.

1043
00:41:26,670 --> 00:41:28,850
So the trend line for
CASP10, the most recent one

1044
00:41:28,850 --> 00:41:30,910
in this report, is black.

1045
00:41:30,910 --> 00:41:35,610
And fore CASP5, it's
this yellow one,

1046
00:41:35,610 --> 00:41:39,110
which is not that
different from the black.

1047
00:41:39,110 --> 00:41:43,070
So what this shows is that,
over the course of a decade,

1048
00:41:43,070 --> 00:41:45,270
the actual prediction
accuracy overall

1049
00:41:45,270 --> 00:41:48,770
has not improved that much.

1050
00:41:48,770 --> 00:41:50,530
It's a little bit shocking.

1051
00:41:50,530 --> 00:41:54,350
So they tried in this paper to
try to figure out, why is that?

1052
00:41:54,350 --> 00:41:56,930
I mean, the percentage of
the amino acids that you're

1053
00:41:56,930 --> 00:41:59,790
getting correct is going up,
but overall accuracy has not.

1054
00:41:59,790 --> 00:42:01,850
And so they make
some claims that it

1055
00:42:01,850 --> 00:42:05,350
could be that target difficulty
is not really a fair measure,

1056
00:42:05,350 --> 00:42:12,640
because a lot of the proteins
that are being submitted

1057
00:42:12,640 --> 00:42:16,150
are now actually much harder
in different sense, in that

1058
00:42:16,150 --> 00:42:18,750
they're not single domain
proteins initially.

1059
00:42:18,750 --> 00:42:20,830
So in CASP5, a lot
of them were proteins

1060
00:42:20,830 --> 00:42:22,750
that had independent structures.

1061
00:42:22,750 --> 00:42:24,980
By the time of CASP10,
a lot of the proteins

1062
00:42:24,980 --> 00:42:26,550
that are being
submitted are more

1063
00:42:26,550 --> 00:42:28,450
interesting structural problems
in that they're folding

1064
00:42:28,450 --> 00:42:30,783
is contingent on interactions
with lots of other things.

1065
00:42:30,783 --> 00:42:32,420
So maybe all the
information you need

1066
00:42:32,420 --> 00:42:35,065
is not composed entirely in
the sequence of the peptide

1067
00:42:35,065 --> 00:42:36,884
that you've been given
to test but depends

1068
00:42:36,884 --> 00:42:38,925
more on the interactions
of it with its partners.

1069
00:42:42,784 --> 00:42:44,200
So those were for
homology models.

1070
00:42:44,200 --> 00:42:46,580
These are the free
modeling results.

1071
00:42:46,580 --> 00:42:49,380
So in free modeling, there's
no homology to look at,

1072
00:42:49,380 --> 00:42:52,780
so they don't have a measure of
difficulty except for length.

1073
00:42:52,780 --> 00:42:55,170
They're using, again,
that Global Distance Test.

1074
00:42:55,170 --> 00:42:56,650
So up here are perfect models.

1075
00:42:56,650 --> 00:42:59,420
Down here are nearly
random models.

1076
00:42:59,420 --> 00:43:01,260
CASP10 is in red.

1077
00:43:01,260 --> 00:43:03,260
CASP5, a decade
earlier, is in green.

1078
00:43:03,260 --> 00:43:06,900
And you can see the trend
lines are very, very similar.

1079
00:43:06,900 --> 00:43:10,370
And CASP9, which is
the dashed line here,

1080
00:43:10,370 --> 00:43:13,236
looks almost identical to CASP5.

1081
00:43:13,236 --> 00:43:14,860
So again, this is
not very encouraging.

1082
00:43:14,860 --> 00:43:17,250
It says that the
accuracy the models

1083
00:43:17,250 --> 00:43:19,925
has not approved very
much over the last decade.

1084
00:43:19,925 --> 00:43:21,550
And then, they do
point out that if you

1085
00:43:21,550 --> 00:43:26,390
focus on the short structures,
then it's kind of interesting.

1086
00:43:26,390 --> 00:43:30,400
So in CASP5, which are the
triangles, only one of these

1087
00:43:30,400 --> 00:43:32,880
was above 60%.

1088
00:43:32,880 --> 00:43:37,080
CASP9, they had 5 out
of 11 were pretty good.

1089
00:43:37,080 --> 00:43:40,906
But then you get to
CASP10 and now only three

1090
00:43:40,906 --> 00:43:41,780
are greater than 60%.

1091
00:43:41,780 --> 00:43:43,620
So it's been
fluctuating quite a lot.

1092
00:43:43,620 --> 00:43:47,045
So modeling de novo is still
a very, very hard problem.

1093
00:43:47,045 --> 00:43:48,670
And they have a whole
bunch of theories

1094
00:43:48,670 --> 00:43:50,360
as to why that could be.

1095
00:43:50,360 --> 00:43:51,740
They proposed, as
I already said,

1096
00:43:51,740 --> 00:43:54,820
that maybe the models that
they're trying to solve

1097
00:43:54,820 --> 00:43:57,880
have gotten harder in ways
that are not easy to assess.

1098
00:43:57,880 --> 00:44:00,600
A lot of the proteins that
previously wouldn't have had

1099
00:44:00,600 --> 00:44:03,080
a homologue now already do,
because there has been a decade

1100
00:44:03,080 --> 00:44:05,980
of structural work trying
to fill in missing domain

1101
00:44:05,980 --> 00:44:08,330
structures.

1102
00:44:08,330 --> 00:44:11,580
And that these targets tend
to have more irregularity.

1103
00:44:11,580 --> 00:44:13,209
Tendency be part
of larger proteins.

1104
00:44:13,209 --> 00:44:14,875
So again, there's not
enough information

1105
00:44:14,875 --> 00:44:16,580
in the sequence of
what you're given

1106
00:44:16,580 --> 00:44:17,746
to make the full prediction.

1107
00:44:20,115 --> 00:44:20,615
Questions?

1108
00:44:26,330 --> 00:44:28,660
So what we've seen so far
has been the Rosetta approach

1109
00:44:28,660 --> 00:44:29,910
to solving protein structures.

1110
00:44:29,910 --> 00:44:32,150
And it really is,
throw everything at it.

1111
00:44:32,150 --> 00:44:33,460
Any trick that you've got.

1112
00:44:33,460 --> 00:44:34,740
Let's look into the databases.

1113
00:44:34,740 --> 00:44:37,160
Let's take homologous proteins.

1114
00:44:37,160 --> 00:44:37,660
Right?

1115
00:44:37,660 --> 00:44:41,697
So we have these high,
medium, low levels homologues.

1116
00:44:41,697 --> 00:44:43,280
And even when we're
doing a homologue,

1117
00:44:43,280 --> 00:44:45,490
we don't restrict ourselves
to that protein structure.

1118
00:44:45,490 --> 00:44:47,531
But for certain parts,
we'll go into the database

1119
00:44:47,531 --> 00:44:50,050
and find the structures of
peptides of length three

1120
00:44:50,050 --> 00:44:50,860
to nine.

1121
00:44:50,860 --> 00:44:53,480
Pull those out of the
[? betas. ?] Plug those in.

1122
00:44:53,480 --> 00:44:56,860
Our potential energy functions
are grab bag information,

1123
00:44:56,860 --> 00:44:59,381
some of which has strong
physical principles, some which

1124
00:44:59,381 --> 00:45:01,130
is just curve fitting
to make sure that we

1125
00:45:01,130 --> 00:45:03,880
keep the hydrophobics inside
and hydrophilics outside.

1126
00:45:03,880 --> 00:45:06,790
So we throw any information
that we have at the problem,

1127
00:45:06,790 --> 00:45:11,120
whereas our physicist has
disdain for that approach.

1128
00:45:11,120 --> 00:45:11,790
He says, no, no.

1129
00:45:11,790 --> 00:45:13,510
We're going to this
purely by the book.

1130
00:45:13,510 --> 00:45:16,560
All of our equations are going
to have some physical grounding

1131
00:45:16,560 --> 00:45:17,244
to them.

1132
00:45:17,244 --> 00:45:19,160
We're not going to start
with homology models.

1133
00:45:19,160 --> 00:45:21,160
We're going to try to do the
simulation that I showed you

1134
00:45:21,160 --> 00:45:23,090
a little movie of for
every single protein we

1135
00:45:23,090 --> 00:45:26,780
want to know the structure of.

1136
00:45:26,780 --> 00:45:28,790
Now, why is that problem hard?

1137
00:45:28,790 --> 00:45:33,320
It's because these
potential energy landscapes

1138
00:45:33,320 --> 00:45:34,451
are incredibly complex.

1139
00:45:34,451 --> 00:45:34,950
Right?

1140
00:45:34,950 --> 00:45:36,030
They're very rugged.

1141
00:45:36,030 --> 00:45:38,800
Trying to get from any current
position to any other position

1142
00:45:38,800 --> 00:45:42,010
requires a go over
many, many minima.

1143
00:45:42,010 --> 00:45:44,640
So the reason it's
hard to do, then,

1144
00:45:44,640 --> 00:45:47,377
is it's primarily a
computing power issue.

1145
00:45:47,377 --> 00:45:48,960
There's just not
enough computer power

1146
00:45:48,960 --> 00:45:50,251
to solve all of these problems.

1147
00:45:50,251 --> 00:45:52,464
So what one group, DE
Shaw, did was they said,

1148
00:45:52,464 --> 00:45:54,130
well, we can solve
that by just spending

1149
00:45:54,130 --> 00:45:58,900
a lot of money, which
fortunately they had.

1150
00:45:58,900 --> 00:46:01,290
So they designed
hardware that actually

1151
00:46:01,290 --> 00:46:06,300
solves individual components of
the potential energy function

1152
00:46:06,300 --> 00:46:08,450
in hardware rather
than in software.

1153
00:46:08,450 --> 00:46:11,760
So they have a chip that
they call Anton that actually

1154
00:46:11,760 --> 00:46:15,463
has parts of it that solve the
electrostatic function, the van

1155
00:46:15,463 --> 00:46:17,480
der Waals function.

1156
00:46:17,480 --> 00:46:20,100
And so in these chips,
rather than in software,

1157
00:46:20,100 --> 00:46:22,180
you are doing as fast
as you conceivably

1158
00:46:22,180 --> 00:46:24,190
can to solve the energy terms.

1159
00:46:24,190 --> 00:46:26,890
And that allows you to
sample much, much more space.

1160
00:46:26,890 --> 00:46:29,710
Run your simulations
for much, much longer

1161
00:46:29,710 --> 00:46:31,260
in terms of real time.

1162
00:46:31,260 --> 00:46:32,460
And they do remarkably well.

1163
00:46:32,460 --> 00:46:34,890
So here are some pictures
from a paper of theirs--

1164
00:46:34,890 --> 00:46:37,457
a couple of years ago
now-- with the predicted

1165
00:46:37,457 --> 00:46:38,540
and the actual structures.

1166
00:46:38,540 --> 00:46:40,331
I don't even remember
which color is which,

1167
00:46:40,331 --> 00:46:41,960
but you can see it
doesn't much matter.

1168
00:46:41,960 --> 00:46:45,990
They get them down to
very, very high resolution.

1169
00:46:45,990 --> 00:46:50,350
Now, what do you notice
about all these structures?

1170
00:46:50,350 --> 00:46:51,750
AUDIENCE: They're small.

1171
00:46:51,750 --> 00:46:53,740
PROFESSOR: They're small, right?

1172
00:46:53,740 --> 00:46:55,490
So obviously there's
a reason for that.

1173
00:46:55,490 --> 00:46:57,910
That's when you can do in
reasonable compute time,

1174
00:46:57,910 --> 00:47:01,392
even with a high-end computing
that's special purpose.

1175
00:47:01,392 --> 00:47:02,850
So we're still not
in a state where

1176
00:47:02,850 --> 00:47:04,850
they can fold any
arbitrary structure.

1177
00:47:04,850 --> 00:47:07,370
What else do you
notice about them?

1178
00:47:07,370 --> 00:47:08,412
Yeah, in the back.

1179
00:47:08,412 --> 00:47:09,396
AUDIENCE: They have
very well-defined

1180
00:47:09,396 --> 00:47:09,890
secondary structures.

1181
00:47:09,890 --> 00:47:11,190
PROFESSOR: They have
very well-defined

1182
00:47:11,190 --> 00:47:12,064
secondary structures.

1183
00:47:12,064 --> 00:47:14,152
And they're specifically
what, mostly?

1184
00:47:14,152 --> 00:47:15,153
AUDIENCE: Alpha helixes.

1185
00:47:15,153 --> 00:47:16,485
PROFESSOR: Alpha helixes, right.

1186
00:47:16,485 --> 00:47:19,170
And it turns out that a lot more
information is encoded locally

1187
00:47:19,170 --> 00:47:21,480
in an alpha helix than
in a beta sheet, which

1188
00:47:21,480 --> 00:47:24,760
is going to be contingent on
what that piece of protein

1189
00:47:24,760 --> 00:47:25,480
comes up against.

1190
00:47:25,480 --> 00:47:25,700
Right?

1191
00:47:25,700 --> 00:47:27,240
Whereas in the
alpha helix, we saw

1192
00:47:27,240 --> 00:47:30,000
that you can get 60% accuracy
with very crude algorithms,

1193
00:47:30,000 --> 00:47:30,590
right?

1194
00:47:30,590 --> 00:47:34,675
So we're going to do best
with these physics approaches

1195
00:47:34,675 --> 00:47:37,660
when we have small proteins
that are largely alpha helical.

1196
00:47:37,660 --> 00:47:41,300
But in later papers-- well
here's even an example.

1197
00:47:41,300 --> 00:47:43,582
Here's one that has a
certain amount of beta sheet.

1198
00:47:43,582 --> 00:47:45,790
And the structures are going
to get larger with time.

1199
00:47:45,790 --> 00:47:47,160
So it's not an inherent problem.

1200
00:47:47,160 --> 00:47:49,820
It's just a question of
how fast the hardware is

1201
00:47:49,820 --> 00:47:52,450
today versus tomorrow.

1202
00:47:52,450 --> 00:47:54,860
OK, a third approach.

1203
00:47:54,860 --> 00:47:56,620
So we had the
statistical approach.

1204
00:47:56,620 --> 00:47:58,120
We have the physics approach.

1205
00:47:58,120 --> 00:48:00,310
The third approach, that
I won't go into detail

1206
00:48:00,310 --> 00:48:02,910
but you can play around
was literally yourselves,

1207
00:48:02,910 --> 00:48:05,870
is a game where
we have humans who

1208
00:48:05,870 --> 00:48:08,530
try to identify the
right structure,

1209
00:48:08,530 --> 00:48:12,360
just as humans do very well
in other kinds of pattern

1210
00:48:12,360 --> 00:48:13,680
recognition problems.

1211
00:48:13,680 --> 00:48:18,560
So you can try this video game
where you're given structures

1212
00:48:18,560 --> 00:48:21,040
to try to solve and say, oh,
should I make that helical?

1213
00:48:21,040 --> 00:48:22,790
Should I rotate that side chain?

1214
00:48:22,790 --> 00:48:24,300
So give it a try.

1215
00:48:24,300 --> 00:48:28,480
Just Google FoldIT,
and you can find out

1216
00:48:28,480 --> 00:48:32,991
whether you can be the best
gamers and beat the hardware.

1217
00:48:32,991 --> 00:48:33,490
All right.

1218
00:48:36,200 --> 00:48:37,950
So so far we've been
talking about solving

1219
00:48:37,950 --> 00:48:40,090
the structures of
individual proteins.

1220
00:48:40,090 --> 00:48:43,210
We've seen there is some
success in this field.

1221
00:48:43,210 --> 00:48:45,660
It's improved a
lot in some ways.

1222
00:48:45,660 --> 00:48:48,820
Between CASP1 and CASP5 I think
there's been huge improvements.

1223
00:48:48,820 --> 00:48:51,410
Between CASP5 and CASP10, maybe
the problems have gotten hard.

1224
00:48:51,410 --> 00:48:52,460
Maybe there have
been no improvements.

1225
00:48:52,460 --> 00:48:54,390
We'll leave that for
others to decide.

1226
00:48:54,390 --> 00:48:56,729
What I'd like to look at
in the end of this lecture

1227
00:48:56,729 --> 00:48:58,270
and the beginning
of the next lecture

1228
00:48:58,270 --> 00:49:00,480
are problems of proteins
interacting with each other,

1229
00:49:00,480 --> 00:49:02,063
and can we predict
those interactions?

1230
00:49:02,063 --> 00:49:04,956
And that'll, then, lead us
towards even larger systems

1231
00:49:04,956 --> 00:49:05,830
and network problems.

1232
00:49:08,596 --> 00:49:09,970
So we're going to
break this down

1233
00:49:09,970 --> 00:49:12,680
to three separate
prediction problems.

1234
00:49:12,680 --> 00:49:15,550
The first of these is predicting
the effect of a point mutation

1235
00:49:15,550 --> 00:49:17,120
on the stability
of a known complex.

1236
00:49:17,120 --> 00:49:19,500
So in some ways, you might
think this is an easy problem.

1237
00:49:19,500 --> 00:49:20,440
I've got two proteins.

1238
00:49:20,440 --> 00:49:21,398
I know their structure.

1239
00:49:21,398 --> 00:49:22,350
I know they contract.

1240
00:49:22,350 --> 00:49:24,560
I want to predict whether
a mutation stabilizes

1241
00:49:24,560 --> 00:49:27,120
that interaction or
makes it fall apart.

1242
00:49:27,120 --> 00:49:29,210
That's the first
of the problems.

1243
00:49:29,210 --> 00:49:30,960
We can try to
predict the structure

1244
00:49:30,960 --> 00:49:33,450
of particular complexes,
and we can then

1245
00:49:33,450 --> 00:49:36,060
try to generalize that and try
to predict every protein that

1246
00:49:36,060 --> 00:49:38,690
interacts with
every other protein.

1247
00:49:38,690 --> 00:49:42,550
We'll see how we
do on all of those.

1248
00:49:42,550 --> 00:49:45,020
So we'll go into one of these
competition papers, which

1249
00:49:45,020 --> 00:49:46,710
are very good at
evaluating the fields.

1250
00:49:46,710 --> 00:49:50,580
This competition paper looked at
what I call the simple problem.

1251
00:49:50,580 --> 00:49:53,000
So you've got two proteins
of known structure.

1252
00:49:53,000 --> 00:49:55,480
The authors of the paper,
who issued the challenge,

1253
00:49:55,480 --> 00:49:58,690
knew the answer for the effect
of every possible mutation

1254
00:49:58,690 --> 00:50:01,210
at a whole bunch of positions
along these proteins

1255
00:50:01,210 --> 00:50:05,380
on the-- well, an approximation
to the free energy of binding.

1256
00:50:05,380 --> 00:50:07,610
So they challenged
the competitors

1257
00:50:07,610 --> 00:50:09,610
to try to figure out, we
give you the structure,

1258
00:50:09,610 --> 00:50:12,740
we tell you all the
positions we've mutated,

1259
00:50:12,740 --> 00:50:15,450
and you tell us whether those
mutations made the complex more

1260
00:50:15,450 --> 00:50:17,900
stable or made the
complex less stable.

1261
00:50:21,270 --> 00:50:24,250
Now specifically, they had two
separate protein structures.

1262
00:50:24,250 --> 00:50:26,770
They mutated 53
positions in one.

1263
00:50:26,770 --> 00:50:28,490
45 positions in another.

1264
00:50:28,490 --> 00:50:30,790
They didn't directly measure
the free energy of binding

1265
00:50:30,790 --> 00:50:32,850
for every possible complex,
but they used a high throughput

1266
00:50:32,850 --> 00:50:33,350
assay.

1267
00:50:33,350 --> 00:50:34,890
We won't go into
the details, but it

1268
00:50:34,890 --> 00:50:37,410
should track, more or
less, with the free energy.

1269
00:50:37,410 --> 00:50:42,290
So things that seem to be more
stable directors here probably

1270
00:50:42,290 --> 00:50:45,370
are lower free energy complexes.

1271
00:50:45,370 --> 00:50:49,230
OK, so how would you go
about trying to solve this?

1272
00:50:49,230 --> 00:50:51,294
So using these potential
energy functions

1273
00:50:51,294 --> 00:50:52,710
that we've already
seen, you could

1274
00:50:52,710 --> 00:50:57,000
try to plug in the mutation
into the structure.

1275
00:50:57,000 --> 00:51:00,190
And what would you
have to do then

1276
00:51:00,190 --> 00:51:02,730
in order to evaluate the energy?

1277
00:51:02,730 --> 00:51:06,170
Before you evaluate the energy.

1278
00:51:06,170 --> 00:51:08,170
So I've got known structure.

1279
00:51:08,170 --> 00:51:13,390
I say, position 23 I'm mutating
from phenylalanine to alanine.

1280
00:51:13,390 --> 00:51:14,930
I'll say alanine
to phenylalanine.

1281
00:51:14,930 --> 00:51:15,980
Make it a little
more interesting.

1282
00:51:15,980 --> 00:51:16,480
OK?

1283
00:51:16,480 --> 00:51:18,380
So I'm now stuck on
this big side chain.

1284
00:51:18,380 --> 00:51:20,310
So what do I need to do before
I can evaluate the structure

1285
00:51:20,310 --> 00:51:20,810
energy?

1286
00:51:20,810 --> 00:51:22,856
AUDIENCE: Make sure
there's no clashes.

1287
00:51:22,856 --> 00:51:24,480
PROFESSOR: Make sure
no clashes, right?

1288
00:51:24,480 --> 00:51:25,380
So I have to do one
of those methods

1289
00:51:25,380 --> 00:51:28,230
that we already described
for optimizing the side chain

1290
00:51:28,230 --> 00:51:29,850
confirmation, and
then I can decide,

1291
00:51:29,850 --> 00:51:32,200
based on the free energy,
whether it's an improvement

1292
00:51:32,200 --> 00:51:33,870
or makes things worse.

1293
00:51:33,870 --> 00:51:36,380
OK, so let's see how they do.

1294
00:51:36,380 --> 00:51:39,284
So here's an example
of a solution.

1295
00:51:39,284 --> 00:51:41,700
The submitter, the person who
has the algorithm for making

1296
00:51:41,700 --> 00:51:44,455
a prediction, decides on
some cutoff in their energy

1297
00:51:44,455 --> 00:51:45,830
function, whether
they think this

1298
00:51:45,830 --> 00:51:47,890
is improving things or
making things worse.

1299
00:51:47,890 --> 00:51:49,340
So they decide on the color.

1300
00:51:49,340 --> 00:51:51,220
Each one of these
dots represents

1301
00:51:51,220 --> 00:51:52,095
a different mutation.

1302
00:51:55,010 --> 00:51:58,420
On the y-axis is the
actual change in binding,

1303
00:51:58,420 --> 00:51:59,910
the observed change in binding.

1304
00:51:59,910 --> 00:52:01,660
So things above zero
are improved binding.

1305
00:52:01,660 --> 00:52:04,010
Below zero are worse binding.

1306
00:52:04,010 --> 00:52:07,607
And here are the predictions
on the submitter scale.

1307
00:52:07,607 --> 00:52:09,690
And here the submitter
said that everything in red

1308
00:52:09,690 --> 00:52:12,940
should be worse and everything
green should be better.

1309
00:52:12,940 --> 00:52:15,210
And you can see that
there's some trend.

1310
00:52:15,210 --> 00:52:18,530
They're doing reasonably well
in predicting all these red guys

1311
00:52:18,530 --> 00:52:20,930
as being bad, but
they're not doing so well

1312
00:52:20,930 --> 00:52:23,780
in the neutral ones, clearly,
and certainly not doing

1313
00:52:23,780 --> 00:52:26,707
that well in the improved ones.

1314
00:52:26,707 --> 00:52:29,290
Now, is this one of the better
submitters or one of the worst?

1315
00:52:29,290 --> 00:52:30,420
You'd hope that this
is one of the worst,

1316
00:52:30,420 --> 00:52:32,337
but in fact this is one
of the top submitters.

1317
00:52:32,337 --> 00:52:33,794
In fact, not just
the top submitter

1318
00:52:33,794 --> 00:52:35,410
but top submitter
looking at mutations

1319
00:52:35,410 --> 00:52:37,300
that are right at the
interface where you'd think

1320
00:52:37,300 --> 00:52:38,450
they'd do the best, right?

1321
00:52:38,450 --> 00:52:41,090
So if there's some mutation on
the backside of the protein,

1322
00:52:41,090 --> 00:52:42,320
there's less
structural information

1323
00:52:42,320 --> 00:52:44,340
about what that's going to
be doing in the complex.

1324
00:52:44,340 --> 00:52:45,800
There could be some
surprising results.

1325
00:52:45,800 --> 00:52:47,466
But here, these are
amino acid mutations

1326
00:52:47,466 --> 00:52:50,450
right at the interface.

1327
00:52:50,450 --> 00:52:52,650
So here's an example
of the top performer.

1328
00:52:52,650 --> 00:52:54,290
This is the graph
I just showed you,

1329
00:52:54,290 --> 00:52:55,748
focusing only at
the [? residues ?]

1330
00:52:55,748 --> 00:52:57,484
of the interface, and all sites.

1331
00:52:57,484 --> 00:52:58,650
And here's an average group.

1332
00:52:58,650 --> 00:53:00,030
And you can see the
average groups are really

1333
00:53:00,030 --> 00:53:01,430
doing rather abysmally.

1334
00:53:04,330 --> 00:53:08,270
So this blue cluster that's
almost entirely below zero

1335
00:53:08,270 --> 00:53:09,777
were supposed to be neutral.

1336
00:53:09,777 --> 00:53:11,860
And these green ones were
supposed to be improved,

1337
00:53:11,860 --> 00:53:14,690
and they're almost
entirely below zero.

1338
00:53:14,690 --> 00:53:16,650
This is not encouraging story.

1339
00:53:16,650 --> 00:53:19,140
So how do we
evaluate objectively

1340
00:53:19,140 --> 00:53:21,060
whether they're
really doing well?

1341
00:53:21,060 --> 00:53:23,720
So we have some sort
of baseline measure.

1342
00:53:23,720 --> 00:53:26,240
What is it the sort
of baseline algorithm

1343
00:53:26,240 --> 00:53:29,360
you could use to predict
whether a mutation is improving

1344
00:53:29,360 --> 00:53:31,347
or hurting this interface?

1345
00:53:31,347 --> 00:53:32,846
So all of their
algorithms are going

1346
00:53:32,846 --> 00:53:34,630
to use some kind
of energy function.

1347
00:53:34,630 --> 00:53:37,005
What have we already seen in
earlier parts of this course

1348
00:53:37,005 --> 00:53:38,130
that we could use?

1349
00:53:38,130 --> 00:53:40,930
Well, we could use the
substitution matrices, right?

1350
00:53:40,930 --> 00:53:42,580
We have the BLOSUM
substitution matrix

1351
00:53:42,580 --> 00:53:45,520
that tells us how
surprised we should

1352
00:53:45,520 --> 00:53:47,750
be when we see an evolution,
that Amino Acid A turns

1353
00:53:47,750 --> 00:53:51,170
into Amino Acid B.
So we could use,

1354
00:53:51,170 --> 00:53:52,950
in this case, the BLOSUM matrix.

1355
00:53:52,950 --> 00:53:54,645
That gives us for
each mutation a score.

1356
00:53:54,645 --> 00:53:57,900
It ranges from minus 4 to 11.

1357
00:53:57,900 --> 00:54:00,090
And we can rank
every mutation based

1358
00:54:00,090 --> 00:54:02,840
on the BLOSUM matrix
for the substitution

1359
00:54:02,840 --> 00:54:06,212
and say, OK, at some value
in this range things should

1360
00:54:06,212 --> 00:54:07,670
be getting better
or getting worse.

1361
00:54:10,810 --> 00:54:13,426
So here's an area
under the curve plot

1362
00:54:13,426 --> 00:54:15,050
where we've plotted
the false positives

1363
00:54:15,050 --> 00:54:18,040
and true positive
rates as I change

1364
00:54:18,040 --> 00:54:19,800
my threshold for
that BLOSUM matrix.

1365
00:54:19,800 --> 00:54:24,100
So I compute what the
mutation BLOSUM matrix is,

1366
00:54:24,100 --> 00:54:27,400
and then I say, OK, is a
value of 11 bad or is it good?

1367
00:54:27,400 --> 00:54:28,684
Is a value of 10 bad or good?

1368
00:54:28,684 --> 00:54:30,100
That's what this
curve represents.

1369
00:54:30,100 --> 00:54:33,950
As I vary that threshold,
how many do I get right

1370
00:54:33,950 --> 00:54:36,260
and how many do I get wrong?

1371
00:54:36,260 --> 00:54:38,680
If I'm doing the
decisions at random,

1372
00:54:38,680 --> 00:54:41,290
then I'll be getting
roughly equal true positives

1373
00:54:41,290 --> 00:54:42,750
and false positives.

1374
00:54:42,750 --> 00:54:45,630
They do slightly better in
the random using this matrix.

1375
00:54:45,630 --> 00:54:49,020
Now, the best algorithm at
predicting that uses energies

1376
00:54:49,020 --> 00:54:51,140
only does marginally better.

1377
00:54:51,140 --> 00:54:54,440
So this is the best
algorithm at predicting.

1378
00:54:54,440 --> 00:54:58,220
This is this baseline algorithm
using just the BLOSUM matrix.

1379
00:54:58,220 --> 00:55:02,270
You can see that the green curve
predicting beneficial mutations

1380
00:55:02,270 --> 00:55:03,350
is really hard.

1381
00:55:03,350 --> 00:55:05,200
They don't do much
better than random.

1382
00:55:05,200 --> 00:55:07,430
And for the
deleterious mutations,

1383
00:55:07,430 --> 00:55:10,090
they do somewhat better.

1384
00:55:10,090 --> 00:55:12,520
So we could make these
plots for every single one

1385
00:55:12,520 --> 00:55:14,260
of the algorithms,
but a little easier

1386
00:55:14,260 --> 00:55:17,380
is to just compute the
area under the curve.

1387
00:55:17,380 --> 00:55:19,030
So how much of the area?

1388
00:55:19,030 --> 00:55:22,040
If I were doing perfectly, I
would get 100% true positives

1389
00:55:22,040 --> 00:55:23,410
and no false positives, right?

1390
00:55:23,410 --> 00:55:25,205
So my line would go
straight up and across

1391
00:55:25,205 --> 00:55:27,260
and the area under the
curve would be one.

1392
00:55:27,260 --> 00:55:30,200
And if I'm doing terribly,
I'll get no true positives

1393
00:55:30,200 --> 00:55:32,105
and all false positives.

1394
00:55:32,105 --> 00:55:34,284
I'd be flatlining and
my area would be zero.

1395
00:55:34,284 --> 00:55:35,700
So the area under
the curve, which

1396
00:55:35,700 --> 00:55:37,330
is normalized
between zero and one,

1397
00:55:37,330 --> 00:55:39,830
will give me a sense of how
well these algorithms are doing.

1398
00:55:39,830 --> 00:55:44,160
So this plot-- focus first
on the black dots-- shows

1399
00:55:44,160 --> 00:55:46,880
at each one of these algorithms
what the area under the curve

1400
00:55:46,880 --> 00:55:50,410
is for beneficial and
deleterious mutations.

1401
00:55:50,410 --> 00:55:53,230
Beneficial on the x-axis,
deleterious mutations

1402
00:55:53,230 --> 00:55:54,650
on the y-axis.

1403
00:55:54,650 --> 00:55:56,600
The BLOSUM matrix is here.

1404
00:55:56,600 --> 00:56:00,440
So good algorithms should be
above that and to the right.

1405
00:56:00,440 --> 00:56:03,059
They should having a better
area under the curve.

1406
00:56:03,059 --> 00:56:04,850
And you can see the
perfect algorithm would

1407
00:56:04,850 --> 00:56:06,260
have been all the way up here.

1408
00:56:06,260 --> 00:56:08,880
None of the black dots
are even remotely close.

1409
00:56:08,880 --> 00:56:11,910
The G21, which we'll talk
about a little bit in a minute,

1410
00:56:11,910 --> 00:56:15,890
is somewhat better than the
BLOSUM matrix, but not a lot.

1411
00:56:19,640 --> 00:56:24,270
Now, I'm going to ignore the
second round in much detail,

1412
00:56:24,270 --> 00:56:26,530
because this is a case
where people weren't doing

1413
00:56:26,530 --> 00:56:28,650
so well in the first round so
they went out and gave them

1414
00:56:28,650 --> 00:56:30,850
some of the information about
mutations at all the positions.

1415
00:56:30,850 --> 00:56:32,320
And that really changes
the nature of problem,

1416
00:56:32,320 --> 00:56:33,730
because then you have
a tremendous amount

1417
00:56:33,730 --> 00:56:35,570
of information about which
positions are important

1418
00:56:35,570 --> 00:56:37,430
and how much those
mutations are making.

1419
00:56:37,430 --> 00:56:39,200
So we'll ignore
the second round,

1420
00:56:39,200 --> 00:56:42,300
which I think is an overly
generous way of comparing

1421
00:56:42,300 --> 00:56:43,700
these algorithms.

1422
00:56:43,700 --> 00:56:46,060
OK, so what did the authors
of this paper observe?

1423
00:56:46,060 --> 00:56:48,060
They observed that the
best algorithms were only

1424
00:56:48,060 --> 00:56:50,410
doing marginally better
than random choice.

1425
00:56:50,410 --> 00:56:53,230
So three times better.

1426
00:56:53,230 --> 00:56:57,080
And that there seemed to be
a particular problem looking

1427
00:56:57,080 --> 00:56:59,535
at mutations that
affect polar positions.

1428
00:57:02,370 --> 00:57:05,510
One of the things that I think
was particularly interesting

1429
00:57:05,510 --> 00:57:09,150
and quite relevant when we
think about these things

1430
00:57:09,150 --> 00:57:11,724
in a thermodynamic context
is that the algorithms that

1431
00:57:11,724 --> 00:57:13,890
did better-- none of them
could be really considered

1432
00:57:13,890 --> 00:57:16,690
to do really well-- but the
algorithms that did better

1433
00:57:16,690 --> 00:57:19,440
didn't just focus on
the energetic change

1434
00:57:19,440 --> 00:57:22,490
between forming the
native complex over here

1435
00:57:22,490 --> 00:57:24,990
and forming this mutant
complex indicated by the star.

1436
00:57:24,990 --> 00:57:27,120
But they also focused on
the affect of the mutation

1437
00:57:27,120 --> 00:57:29,590
on the stability of
the mutated protein.

1438
00:57:29,590 --> 00:57:31,480
So there's an
equilibrium not just

1439
00:57:31,480 --> 00:57:33,580
moving between the free
proteins and the complex,

1440
00:57:33,580 --> 00:57:35,830
but also between moving
between the free proteins that

1441
00:57:35,830 --> 00:57:38,882
are folded and the free
proteins that are unfolded.

1442
00:57:38,882 --> 00:57:40,590
And some of these
mutations are affecting

1443
00:57:40,590 --> 00:57:42,506
the energy of the folded
state, and so they're

1444
00:57:42,506 --> 00:57:45,550
driving things to the
left, to the unfolded.

1445
00:57:45,550 --> 00:57:47,560
And if you don't include
that, then you actually

1446
00:57:47,560 --> 00:57:48,268
get into trouble.

1447
00:57:48,268 --> 00:57:50,370
And I've put a link here
to some lecture notes

1448
00:57:50,370 --> 00:57:52,580
from a different course that
I teach where you can look up

1449
00:57:52,580 --> 00:57:54,705
some details and more
sophisticated approaches that

1450
00:57:54,705 --> 00:57:58,340
actually do take into account
a lot of the unfolded states.

1451
00:58:02,570 --> 00:58:06,760
So the best approach--
best of a bad lot--

1452
00:58:06,760 --> 00:58:10,150
consider the effects of
mutations on stability.

1453
00:58:10,150 --> 00:58:13,860
They also model packing,
electrostacks, and solvation.

1454
00:58:13,860 --> 00:58:15,530
But the actual
algorithms that they used

1455
00:58:15,530 --> 00:58:17,095
were a whole mishmash
of approaches.

1456
00:58:17,095 --> 00:58:19,680
So there didn't seem to emerge
a common pattern in what they

1457
00:58:19,680 --> 00:58:21,929
were doing, and I thought I
would take you through one

1458
00:58:21,929 --> 00:58:24,080
of these to see what
actually they were doing.

1459
00:58:24,080 --> 00:58:28,080
So the best one was this
machine learning approach, G21.

1460
00:58:28,080 --> 00:58:30,090
So this is how they
solved the problem.

1461
00:58:30,090 --> 00:58:33,150
First of all, they dug
through the literature

1462
00:58:33,150 --> 00:58:36,940
and found 930 cases where they
could associate a mutation

1463
00:58:36,940 --> 00:58:38,660
with a change in energy.

1464
00:58:38,660 --> 00:58:41,441
These had nothing to do with
proteins under consideration.

1465
00:58:41,441 --> 00:58:43,190
They were completely
different structures.

1466
00:58:43,190 --> 00:58:44,815
But they were cases
where they actually

1467
00:58:44,815 --> 00:58:47,849
had energetic information
for each mutation.

1468
00:58:47,849 --> 00:58:49,390
Then we go through
and try to predict

1469
00:58:49,390 --> 00:58:51,870
what the structural change
will be in the protein,

1470
00:58:51,870 --> 00:58:55,669
using somebody else's
algorithm, FoldX.

1471
00:58:55,669 --> 00:58:57,710
And now, they describe
each mutant, not just with

1472
00:58:57,710 --> 00:58:59,210
a single energy--
we have focused,

1473
00:58:59,210 --> 00:59:02,080
for example, on PyRosetta,
which you'll use in process--

1474
00:59:02,080 --> 00:59:04,782
but they actually had
85 different features

1475
00:59:04,782 --> 00:59:06,490
from a whole bunch of
different programs.

1476
00:59:06,490 --> 00:59:07,820
So they're taking a
pretty agnostic view.

1477
00:59:07,820 --> 00:59:10,361
They're saying, we don't know
which of these energy functions

1478
00:59:10,361 --> 00:59:13,380
is the best, so let's let
the machine learning decide.

1479
00:59:13,380 --> 00:59:16,360
So every single mutation that's
posed to them as a problem,

1480
00:59:16,360 --> 00:59:18,500
they have 85
different parameters

1481
00:59:18,500 --> 00:59:22,560
as to whether it's
improving things or not.

1482
00:59:22,560 --> 00:59:26,290
And then, they had their
database of 930 mutations.

1483
00:59:26,290 --> 00:59:28,195
For each one of those
they had 85 parameters.

1484
00:59:31,510 --> 00:59:33,030
So those are label
trending data.

1485
00:59:33,030 --> 00:59:35,900
They know whether things
are getting better or worse.

1486
00:59:35,900 --> 00:59:40,790
They actually don't even rely
on a single machine learning

1487
00:59:40,790 --> 00:59:41,310
method.

1488
00:59:41,310 --> 00:59:43,360
These actually used five
different approaches.

1489
00:59:43,360 --> 00:59:47,419
We'll discuss Bayesian
nets later in this course.

1490
00:59:47,419 --> 00:59:49,210
Most of these others
we won't cover at all,

1491
00:59:49,210 --> 00:59:51,970
but they used a lot of different
computational approaches

1492
00:59:51,970 --> 00:59:55,480
to try to decide how to go
from those 85 parameters

1493
00:59:55,480 --> 00:59:59,471
to a prediction of whether the
structures improved or not.

1494
01:00:03,620 --> 01:00:05,770
So this actually
shows the complexity

1495
01:00:05,770 --> 01:00:08,310
of this apparently
simple problem, right?

1496
01:00:08,310 --> 01:00:11,620
Here's a case where I have two
proteins of known structure.

1497
01:00:11,620 --> 01:00:14,790
I'm making very specific
point mutations,

1498
01:00:14,790 --> 01:00:19,802
and even so I do only
marginally better than random.

1499
01:00:19,802 --> 01:00:22,010
And even throwing at it all
the best machine learning

1500
01:00:22,010 --> 01:00:22,560
techniques.

1501
01:00:22,560 --> 01:00:24,840
So there's clearly a
lot in protein structure

1502
01:00:24,840 --> 01:00:28,014
that we don't yet have
parametrized in these energy

1503
01:00:28,014 --> 01:00:28,514
functions.

1504
01:00:35,270 --> 01:00:36,910
So maybe some of
these other problems

1505
01:00:36,910 --> 01:00:38,820
are actually not as
hard as we thought.

1506
01:00:38,820 --> 01:00:41,150
Maybe instead of trying to
be very precise in terms

1507
01:00:41,150 --> 01:00:43,380
of the energetic change
for a single mutation

1508
01:00:43,380 --> 01:00:47,020
at an interface, we'd do
better trying to predict rather

1509
01:00:47,020 --> 01:00:49,250
crude parameters of which
two proteins interact

1510
01:00:49,250 --> 01:00:49,917
with each other.

1511
01:00:49,917 --> 01:00:51,333
So that's what
we're going to look

1512
01:00:51,333 --> 01:00:52,810
at in the next
part of the course.

1513
01:00:52,810 --> 01:00:55,590
We're going to
look at whether we

1514
01:00:55,590 --> 01:00:58,210
can use structural data to
predict which two proteins will

1515
01:00:58,210 --> 01:01:00,060
interact.

1516
01:01:00,060 --> 01:01:02,990
So here we've got a problem,
which is a docking problem.

1517
01:01:02,990 --> 01:01:04,190
I've got two proteins.

1518
01:01:04,190 --> 01:01:06,106
Say they're of known
structure, but I've never

1519
01:01:06,106 --> 01:01:07,960
seen them interact
with each other.

1520
01:01:07,960 --> 01:01:09,230
So how do they come together?

1521
01:01:09,230 --> 01:01:12,350
Which faces of the proteins are
interacting with each other?

1522
01:01:12,350 --> 01:01:14,590
That's called a docking problem.

1523
01:01:14,590 --> 01:01:17,490
And if I wanted to try to
systematically figure out

1524
01:01:17,490 --> 01:01:20,011
whether Protein A and Protein
B interact with each other,

1525
01:01:20,011 --> 01:01:22,510
I would have to do a search
over all possible confirmations,

1526
01:01:22,510 --> 01:01:23,409
right?

1527
01:01:23,409 --> 01:01:24,950
Then I could use
the energy functions

1528
01:01:24,950 --> 01:01:27,700
to try to predict which
one has the lowest energy.

1529
01:01:27,700 --> 01:01:29,860
But it actually would be
a computationally very

1530
01:01:29,860 --> 01:01:31,890
inefficient way to do things.

1531
01:01:31,890 --> 01:01:34,700
So we could imagine we
wanted to solve this problem.

1532
01:01:34,700 --> 01:01:36,240
For each potential
partner, we could

1533
01:01:36,240 --> 01:01:39,046
evaluate all relative
positions and orientations.

1534
01:01:39,046 --> 01:01:41,420
Then, when they come together
we can't just rely on that,

1535
01:01:41,420 --> 01:01:42,870
but as we've seen
several times now we're

1536
01:01:42,870 --> 01:01:45,161
going to have to do local
confirmational changes to see

1537
01:01:45,161 --> 01:01:47,610
how they fit together for
each possible docking.

1538
01:01:47,610 --> 01:01:49,110
And then, once
we've done that, we

1539
01:01:49,110 --> 01:01:50,860
can say, OK, which of
these has the lowest

1540
01:01:50,860 --> 01:01:52,010
energy of interaction?

1541
01:01:52,010 --> 01:01:54,640
So that, obviously, is going
to be too computationally

1542
01:01:54,640 --> 01:01:56,055
intensive to do
on a large scale.

1543
01:01:56,055 --> 01:01:57,430
It could work very
well if you've

1544
01:01:57,430 --> 01:01:59,179
got a particular pair
or proteins that you

1545
01:01:59,179 --> 01:01:59,960
need to study.

1546
01:01:59,960 --> 01:02:01,710
But on a big sale, if
we wanted to predict

1547
01:02:01,710 --> 01:02:03,960
all possible interactions,
we wouldn't really

1548
01:02:03,960 --> 01:02:06,060
be able to get very far.

1549
01:02:06,060 --> 01:02:09,960
So what people typically do is
use other kinds of information

1550
01:02:09,960 --> 01:02:11,446
to reduce the search space.

1551
01:02:11,446 --> 01:02:13,070
And what we'll see
in the next lecture,

1552
01:02:13,070 --> 01:02:16,390
then, are different ways
to approach this problem.

1553
01:02:16,390 --> 01:02:19,899
Now, one question we
should ask is, what role

1554
01:02:19,899 --> 01:02:21,440
is structural homology
going to play?

1555
01:02:21,440 --> 01:02:23,820
Should I expect that
any two proteins that

1556
01:02:23,820 --> 01:02:27,710
interact with each
other-- let's say

1557
01:02:27,710 --> 01:02:29,962
that that Protein A and
I know its interactors.

1558
01:02:34,390 --> 01:02:38,380
So I've got A known to
interact with B. Right?

1559
01:02:38,380 --> 01:02:41,190
So I know this interface.

1560
01:02:41,190 --> 01:02:44,990
And now I have
protein C, and I'm not

1561
01:02:44,990 --> 01:02:47,920
sure if it interacts or not.

1562
01:02:47,920 --> 01:02:52,130
Should I expect the interface
of C, that touches A,

1563
01:02:52,130 --> 01:02:53,380
to match the interface of B?

1564
01:02:53,380 --> 01:02:57,450
Should these be homologous?

1565
01:02:57,450 --> 01:02:59,090
And if not precisely
homologous, then

1566
01:02:59,090 --> 01:03:00,871
are there properties
that we can expect

1567
01:03:00,871 --> 01:03:02,370
that should be
similar between them?

1568
01:03:03,280 --> 01:03:05,210
So different
approaches we can take.

1569
01:03:05,210 --> 01:03:06,670
And there are
certainly cases where

1570
01:03:06,670 --> 01:03:10,880
you have proteins that interact
with a common target that

1571
01:03:10,880 --> 01:03:13,250
have no overall structure
similarity to each other

1572
01:03:13,250 --> 01:03:15,150
but do have local
structural similarity.

1573
01:03:15,150 --> 01:03:17,530
So here's an example
of subtilisn,

1574
01:03:17,530 --> 01:03:20,530
which is shown in light
gray, and pieces of it

1575
01:03:20,530 --> 01:03:22,645
that interactive with the
target are shown in red.

1576
01:03:22,645 --> 01:03:25,020
So here are two proteins that
are relatively structurally

1577
01:03:25,020 --> 01:03:26,940
homologous-- they interact
at the same region.

1578
01:03:26,940 --> 01:03:28,679
That's not too surprising.

1579
01:03:28,679 --> 01:03:30,220
But here's a subtilisn
inhibitor that

1580
01:03:30,220 --> 01:03:33,310
has no global structural
similarity to these two

1581
01:03:33,310 --> 01:03:36,740
proteins, and yet its
interactions with subtilisn

1582
01:03:36,740 --> 01:03:37,850
are quite similar.

1583
01:03:37,850 --> 01:03:41,380
So we might expect, even if
C and B don't look globally

1584
01:03:41,380 --> 01:03:42,880
anything like each
other, they might

1585
01:03:42,880 --> 01:03:44,470
have this local similarity.

1586
01:03:50,130 --> 01:03:52,510
OK, actually I think we'd
like to turn back your exams.

1587
01:03:52,510 --> 01:03:54,630
So maybe I'll stop here.

1588
01:03:54,630 --> 01:03:56,300
We'll return the
exams in the class,

1589
01:03:56,300 --> 01:03:59,340
and then we'll pick up at this
point in the next lecture.