1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:25,752 --> 00:00:26,960
PROFESSOR: I'm Ernst Frankel.

9
00:00:26,960 --> 00:00:29,430
I'll be teaching
next two lectures.

10
00:00:29,430 --> 00:00:32,030
I'd like to encourage you to
contact me outside of class

11
00:00:32,030 --> 00:00:33,990
if you have any questions,
if you want to meet.

12
00:00:33,990 --> 00:00:36,045
And also, please, during
class, ask questions.

13
00:00:36,045 --> 00:00:38,420
It's a somewhat impersonal
setting with the video cameras

14
00:00:38,420 --> 00:00:42,270
and the amphitheater, but
hopefully we can overcome that.

15
00:00:42,270 --> 00:00:46,470
This unit is going to focus
on moving across scales

16
00:00:46,470 --> 00:00:48,392
in computational
biology, looking

17
00:00:48,392 --> 00:00:49,850
from computational
issues that deal

18
00:00:49,850 --> 00:00:53,260
with the fundamentals of protein
structure at the atomic level

19
00:00:53,260 --> 00:00:56,695
to the level of protein-protein
interactions between pairs

20
00:00:56,695 --> 00:00:59,760
of molecules, protein
DNA interactions

21
00:00:59,760 --> 00:01:01,630
and small molecules,
and then ultimately

22
00:01:01,630 --> 00:01:02,700
into protein network.

23
00:01:02,700 --> 00:01:04,640
So we've got a lot
of ground to cover,

24
00:01:04,640 --> 00:01:06,460
but I think we'll be able do it.

25
00:01:06,460 --> 00:01:09,410
As you've seen in the syllabus,
the first couple of lectures

26
00:01:09,410 --> 00:01:12,190
are really a detailed
look at protein structure,

27
00:01:12,190 --> 00:01:15,230
molecular level
analysis, and then we'll

28
00:01:15,230 --> 00:01:17,540
move into some of these
other levels of higher

29
00:01:17,540 --> 00:01:20,910
order, including protein
DNA interactions and gene

30
00:01:20,910 --> 00:01:21,790
regulatory networks.

31
00:01:24,355 --> 00:01:26,730
I think may of you are probably
familiar with this quote,

32
00:01:26,730 --> 00:01:28,910
that "nothing in
biology makes sense

33
00:01:28,910 --> 00:01:30,770
except in the light
of evolution."

34
00:01:30,770 --> 00:01:34,310
And I'd like to offer a
modified version of that, which

35
00:01:34,310 --> 00:01:38,050
is little in biology make sense
except in light of structure,

36
00:01:38,050 --> 00:01:40,887
protein structure,
DNA structure.

37
00:01:40,887 --> 00:01:43,470
We've, of course, seen this very
early on in molecular biology

38
00:01:43,470 --> 00:01:46,410
when the structure of DNA was
solved, and immediately became

39
00:01:46,410 --> 00:01:49,870
clear why it was the
basis for heredity.

40
00:01:49,870 --> 00:01:54,200
But protein structures have been
even more lasting impact time

41
00:01:54,200 --> 00:01:56,957
and time again, many,
many more events,

42
00:01:56,957 --> 00:01:59,040
which have really
revolutionized the understanding

43
00:01:59,040 --> 00:02:01,020
of particular
biological problems.

44
00:02:01,020 --> 00:02:04,050
So one example that was
stunning at the time

45
00:02:04,050 --> 00:02:06,520
had to do with the most
frequently mutated protein

46
00:02:06,520 --> 00:02:07,300
in cancer.

47
00:02:07,300 --> 00:02:09,070
This is the p53 gene.

48
00:02:09,070 --> 00:02:12,870
It's mutated in about
half of all cancers,

49
00:02:12,870 --> 00:02:14,435
and what was observed
early on-- this

50
00:02:14,435 --> 00:02:16,143
was in the days before
genomic sequencing

51
00:02:16,143 --> 00:02:18,310
when it was actually
very expensive and hard

52
00:02:18,310 --> 00:02:21,860
to identify mutations in tumors.

53
00:02:21,860 --> 00:02:23,900
So they focused on
this particular gene,

54
00:02:23,900 --> 00:02:25,960
and they observed that
the mutations clustered.

55
00:02:25,960 --> 00:02:28,835
So this is the structure of
the gene from the n-terminus--

56
00:02:28,835 --> 00:02:30,960
the protein from the
n-terminus and the c-terminus,

57
00:02:30,960 --> 00:02:33,081
and the bars indicate the
frequency of mutations.

58
00:02:33,081 --> 00:02:35,330
And you can see that they're
all clustered pretty much

59
00:02:35,330 --> 00:02:37,910
in the center of this molecule.

60
00:02:37,910 --> 00:02:38,680
Now, why is that?

61
00:02:38,680 --> 00:02:41,380
It was enigmatic until the
structure was solved here

62
00:02:41,380 --> 00:02:44,217
at MIT by Carl Pabo
and his post-doc

63
00:02:44,217 --> 00:02:46,950
at the time, Nikola Pavletich,
and they showed, actually,

64
00:02:46,950 --> 00:02:49,320
that these correspond
to critical domains.

65
00:02:49,320 --> 00:02:50,870
And in a second
paper, they actually

66
00:02:50,870 --> 00:02:55,400
showed why the mutations occur
in those particular locations.

67
00:02:55,400 --> 00:02:57,590
So if you look at the
plot on the upper left,

68
00:02:57,590 --> 00:03:00,520
here's the protein
sequence; above it,

69
00:03:00,520 --> 00:03:02,300
the frequency of
mutations; below it,

70
00:03:02,300 --> 00:03:04,990
the secondary
structure elements.

71
00:03:04,990 --> 00:03:06,720
And you'll see that
mutations occur

72
00:03:06,720 --> 00:03:10,010
in regions that don't have any
regular secondary structure

73
00:03:10,010 --> 00:03:12,570
and can occur
frequently in regions

74
00:03:12,570 --> 00:03:15,081
with secondary
structure or not all

75
00:03:15,081 --> 00:03:16,580
in regions with
secondary structure.

76
00:03:16,580 --> 00:03:19,038
So the mere fact that there's
a secondary structure element

77
00:03:19,038 --> 00:03:21,314
does not define why
there're mutations.

78
00:03:21,314 --> 00:03:22,980
But when the
three-dimensional structure

79
00:03:22,980 --> 00:03:24,817
was solved in the
complex with DNA,

80
00:03:24,817 --> 00:03:26,650
over here on the right--
this is the protein

81
00:03:26,650 --> 00:03:29,747
structure on the left, the
DNA structure on the right,

82
00:03:29,747 --> 00:03:32,080
and in yellow are some of
these highly mutated residues.

83
00:03:32,080 --> 00:03:35,260
It turns out that all of the
frequently mutated residues

84
00:03:35,260 --> 00:03:38,390
are ones that occur at
the protein DNA interface.

85
00:03:38,390 --> 00:03:39,950
All right, so in
a single picture,

86
00:03:39,950 --> 00:03:42,020
we now understand
what was an enigma

87
00:03:42,020 --> 00:03:43,390
for years and years and years.

88
00:03:43,390 --> 00:03:45,760
Why are the mutations so
particularly clustered

89
00:03:45,760 --> 00:03:47,532
in this protein in
non obvious ways?

90
00:03:47,532 --> 00:03:49,490
Since that is the interface
between the protein

91
00:03:49,490 --> 00:03:51,950
and the DNA, these
mutations upset

92
00:03:51,950 --> 00:03:57,700
the transcriptional regulation
through the action of p53.

93
00:03:57,700 --> 00:04:00,304
So if we want to understand
protein structure in order

94
00:04:00,304 --> 00:04:01,845
to understand protein
function, where

95
00:04:01,845 --> 00:04:03,860
are we going to get
these structures from?

96
00:04:03,860 --> 00:04:08,710
So the statistics on how
proteins themselves-- I

97
00:04:08,710 --> 00:04:09,400
show here.

98
00:04:09,400 --> 00:04:12,510
This is from the-- I'll call it
the PDB, the Protein Database.

99
00:04:12,510 --> 00:04:15,130
Its full name is the
RCSB Protein Database,

100
00:04:15,130 --> 00:04:16,685
but it's usually
just called the PDB.

101
00:04:16,685 --> 00:04:18,810
And here, it shows that,
at the time of this slide,

102
00:04:18,810 --> 00:04:20,560
around 80,000
structures have been

103
00:04:20,560 --> 00:04:23,460
determined by x-ray
crystallography.

104
00:04:23,460 --> 00:04:26,410
The next most frequent method
was NMR, Nuclear Magnetic

105
00:04:26,410 --> 00:04:29,230
Resonance, which identified
about 10,000 structures,

106
00:04:29,230 --> 00:04:31,300
and all the other
techniques produce

107
00:04:31,300 --> 00:04:33,690
very, very few
structures, hundreds

108
00:04:33,690 --> 00:04:36,010
of structures rather
than thousands.

109
00:04:36,010 --> 00:04:37,400
So how do these techniques work?

110
00:04:37,400 --> 00:04:39,400
Well, they don't magically
give you a structure.

111
00:04:39,400 --> 00:04:39,900
Right?

112
00:04:39,900 --> 00:04:42,483
They give you information that
you have to use computationally

113
00:04:42,483 --> 00:04:44,490
to derive the structure.

114
00:04:44,490 --> 00:04:46,880
Here's a schematic
of how structures

115
00:04:46,880 --> 00:04:48,750
are solved by x-ray
crystallography.

116
00:04:48,750 --> 00:04:50,970
One has to actually grow
a crystal of the protein

117
00:04:50,970 --> 00:04:52,872
or the protein and
other molecules

118
00:04:52,872 --> 00:04:54,330
that you're interested
in studying.

119
00:04:54,330 --> 00:04:57,190
These are not giant
crystals like quarts.

120
00:04:57,190 --> 00:04:58,950
They're even smaller
than table salt.

121
00:04:58,950 --> 00:05:01,950
They're usually barely
visible with the naked eye,

122
00:05:01,950 --> 00:05:03,095
and they're very unstable.

123
00:05:03,095 --> 00:05:07,150
They have to be kept in
solution or, often, frozen,

124
00:05:07,150 --> 00:05:08,910
and you should a very
high powered x-ray

125
00:05:08,910 --> 00:05:09,870
beam through them.

126
00:05:09,870 --> 00:05:12,204
Now, most of the x-rays are--
what are they going to do?

127
00:05:12,204 --> 00:05:14,661
They're going to pass right
through because x-rays interact

128
00:05:14,661 --> 00:05:15,770
very weakly with matter.

129
00:05:15,770 --> 00:05:17,561
But a few of the x-rays
will be diffracted,

130
00:05:17,561 --> 00:05:19,390
and from that weak
diffraction pattern,

131
00:05:19,390 --> 00:05:22,350
you can actually deduce
where the electrons were

132
00:05:22,350 --> 00:05:27,560
that scattered the x-rays
as they hit the crystal.

133
00:05:27,560 --> 00:05:30,800
And so this is a
picture, the lower right,

134
00:05:30,800 --> 00:05:34,620
of electron density cloud in
light blue with the protein

135
00:05:34,620 --> 00:05:37,040
structures snaking
through it, and what

136
00:05:37,040 --> 00:05:39,530
you can calculate,
after a lot of work,

137
00:05:39,530 --> 00:05:41,450
from these crystallographic
diffraction

138
00:05:41,450 --> 00:05:44,441
patterns is the location
of the electron density.

139
00:05:44,441 --> 00:05:46,190
And then there's a
computational challenge

140
00:05:46,190 --> 00:05:48,810
to try to figure out the
location of the atoms that

141
00:05:48,810 --> 00:05:51,360
would have given rise
to that electron density

142
00:05:51,360 --> 00:05:54,050
that then, when hit with
x-rays, would have given rise

143
00:05:54,050 --> 00:05:55,900
to the x-ray
diffraction pattern.

144
00:05:55,900 --> 00:05:59,170
So it's actually an
iterative process

145
00:05:59,170 --> 00:06:03,494
where one arrives at the initial
structure and then calculates,

146
00:06:03,494 --> 00:06:05,410
from that structure,
where the electrons would

147
00:06:05,410 --> 00:06:07,330
be, from the
position of electrons

148
00:06:07,330 --> 00:06:09,920
where the diffraction pattern
would be when the x-rays hit

149
00:06:09,920 --> 00:06:14,690
it, and determines how well
that predicted diffraction

150
00:06:14,690 --> 00:06:17,140
pattern agrees with the
actual diffraction pattern,

151
00:06:17,140 --> 00:06:18,739
and then continuously iterates.

152
00:06:18,739 --> 00:06:21,030
And so this is obviously a
highly computational problem

153
00:06:21,030 --> 00:06:22,500
because you not
only have to find

154
00:06:22,500 --> 00:06:25,540
positions that are maximally
consistent with the observed

155
00:06:25,540 --> 00:06:28,020
diffraction pattern, but also
positions that are actually

156
00:06:28,020 --> 00:06:29,620
consistent with physics.

157
00:06:29,620 --> 00:06:32,270
So if we have a piece
of a molecule here,

158
00:06:32,270 --> 00:06:34,190
we can't just put
our atoms anywhere.

159
00:06:34,190 --> 00:06:37,370
They need to be positioned
with well defined distances

160
00:06:37,370 --> 00:06:40,300
for the bonds, the
bond angles, and so on.

161
00:06:40,300 --> 00:06:42,980
So it's a highly coupled
problem that we have to solve,

162
00:06:42,980 --> 00:06:45,897
and we'll look at some of
the techniques that underlie

163
00:06:45,897 --> 00:06:47,980
these approaches, although
we'll look specifically

164
00:06:47,980 --> 00:06:50,470
at how to solve x-ray
crystal structures.

165
00:06:50,470 --> 00:06:52,450
I mentioned the second
most common technique

166
00:06:52,450 --> 00:06:54,810
is nuclear magnetic
resonance, and this

167
00:06:54,810 --> 00:06:57,860
is a technology that does
not require the crystals,

168
00:06:57,860 --> 00:07:00,180
but requires a very
high concentration

169
00:07:00,180 --> 00:07:03,320
of soluble protein, which
presents its own problems.

170
00:07:03,320 --> 00:07:05,610
And the information
that you get out

171
00:07:05,610 --> 00:07:07,500
of a nuclear magnetic
resonance structure

172
00:07:07,500 --> 00:07:09,380
is not the electron
density locations,

173
00:07:09,380 --> 00:07:10,900
but it's actually
a set of distances

174
00:07:10,900 --> 00:07:13,420
that tell you the relative
distance between two

175
00:07:13,420 --> 00:07:15,510
atoms, usually protons,
in the structure,

176
00:07:15,510 --> 00:07:18,440
and that's what's represented
by these yellow lines here.

177
00:07:18,440 --> 00:07:20,830
And once again, we've got a
hard computational problem

178
00:07:20,830 --> 00:07:23,910
where we need to figure out a
structure of the protein that's

179
00:07:23,910 --> 00:07:25,540
consistent with all
the physical forces

180
00:07:25,540 --> 00:07:29,910
and also puts particular
protons at particular distances

181
00:07:29,910 --> 00:07:31,460
from each other.

182
00:07:31,460 --> 00:07:33,350
So we talk about solving
crystal structures,

183
00:07:33,350 --> 00:07:34,980
solving NMR
structures, because it

184
00:07:34,980 --> 00:07:37,300
is the solution to a
very, very complicated

185
00:07:37,300 --> 00:07:39,190
computational challenge.

186
00:07:39,190 --> 00:07:40,690
So these techniques
that we're going

187
00:07:40,690 --> 00:07:42,300
to look at, while
not specifically

188
00:07:42,300 --> 00:07:47,040
for the solution of
crystal and NMR structures,

189
00:07:47,040 --> 00:07:48,520
underlie those technologies.

190
00:07:48,520 --> 00:07:50,270
What we're going to
focus on is actually

191
00:07:50,270 --> 00:07:51,980
perhaps an even more
complicated problem,

192
00:07:51,980 --> 00:07:54,080
the de novo discovery
of protein structures.

193
00:07:54,080 --> 00:07:55,210
So if I start off
with a sequence,

194
00:07:55,210 --> 00:07:56,590
can I actually
tell you something

195
00:07:56,590 --> 00:07:58,800
important and accurate
about the structure?

196
00:08:03,220 --> 00:08:06,599
Now, there's a nice
summary in a book called

197
00:08:06,599 --> 00:08:08,140
Structural Bioinformatics
that really

198
00:08:08,140 --> 00:08:10,410
deals with a lot of the issues
around computational biology

199
00:08:10,410 --> 00:08:12,360
is relates to structure,
that highlights many

200
00:08:12,360 --> 00:08:14,350
of the differences between the
kinds of algorithms we've been

201
00:08:14,350 --> 00:08:16,830
looking at up until now in
this course and the kinds

202
00:08:16,830 --> 00:08:18,430
of approaches that
we need to take

203
00:08:18,430 --> 00:08:21,550
in our understanding
of protein structure.

204
00:08:21,550 --> 00:08:23,020
So the first and
most fundamental

205
00:08:23,020 --> 00:08:24,130
obvious thing is
that we're dealing

206
00:08:24,130 --> 00:08:25,546
with three-dimensional
structures,

207
00:08:25,546 --> 00:08:28,180
so we're moving away from the
simple linear representations

208
00:08:28,180 --> 00:08:29,860
of the data and
dealing with more

209
00:08:29,860 --> 00:08:33,770
complicated
three-dimensional problems.

210
00:08:33,770 --> 00:08:37,320
And therefore, we encounter
all sorts of new problems.

211
00:08:37,320 --> 00:08:38,950
We no longer a
discrete search space.

212
00:08:38,950 --> 00:08:40,525
We have a continuous
search space,

213
00:08:40,525 --> 00:08:41,900
and we'll look at
algorithms that

214
00:08:41,900 --> 00:08:44,290
try to reduce that continuous
search space back down

215
00:08:44,290 --> 00:08:48,044
to a discrete one to make
it a simpler problem.

216
00:08:48,044 --> 00:08:49,960
But perhaps most
fundamentally, the difference

217
00:08:49,960 --> 00:08:53,410
is that now we have to bring
in a lot of physical knowledge

218
00:08:53,410 --> 00:08:54,610
to underlie our algorithms.

219
00:08:54,610 --> 00:08:57,240
It's not enough to solve this
as a complete abstraction

220
00:08:57,240 --> 00:08:58,790
from the physics,
but we actually

221
00:08:58,790 --> 00:09:02,014
have to deal with the physics
in the heart of the algorithms.

222
00:09:02,014 --> 00:09:03,680
And we'll look at the
issues highlighted

223
00:09:03,680 --> 00:09:06,590
in red in the rest of this talk.

224
00:09:06,590 --> 00:09:08,090
Another thing that's
going to emerge

225
00:09:08,090 --> 00:09:09,959
is that it would
be nice if there

226
00:09:09,959 --> 00:09:12,250
was a simple mapping of
protein sequence to structures,

227
00:09:12,250 --> 00:09:13,916
and if that were the
case, you'd imagine

228
00:09:13,916 --> 00:09:16,457
that two proteins that are
very different in sequence

229
00:09:16,457 --> 00:09:17,790
would have different structures.

230
00:09:17,790 --> 00:09:18,970
But in fact, that's
not the case.

231
00:09:18,970 --> 00:09:21,410
You can have two proteins
that have almost no sequence

232
00:09:21,410 --> 00:09:23,659
similarity at all but adopt
the same three-dimensional

233
00:09:23,659 --> 00:09:27,486
structure, so clearly, it's an
extremely complicated problem

234
00:09:27,486 --> 00:09:28,860
made more complicated
by the fact

235
00:09:28,860 --> 00:09:30,360
that we don't know
all the structures.

236
00:09:30,360 --> 00:09:32,401
It's not like we're
selecting from a discrete set

237
00:09:32,401 --> 00:09:35,090
of known structures to figure
out what our new molecule is.

238
00:09:35,090 --> 00:09:37,110
We have, in potential,
infinite number

239
00:09:37,110 --> 00:09:41,187
of confirmations and protein
chains we need to deal with.

240
00:09:41,187 --> 00:09:43,770
OK, so I hope that you've had a
chance to look at the material

241
00:09:43,770 --> 00:09:46,910
that I've posted online for
review of protein structure.

242
00:09:46,910 --> 00:09:48,320
If you haven't, please do so.

243
00:09:48,320 --> 00:09:49,480
It'll be very helpful
in understanding

244
00:09:49,480 --> 00:09:51,160
the next few lectures,
and I'll assume

245
00:09:51,160 --> 00:09:53,368
that you're familiar with
the basic elements, protein

246
00:09:53,368 --> 00:09:55,110
structure, what
alpha helices are,

247
00:09:55,110 --> 00:09:57,880
what beta sheets are, primary
structure, secondary structure,

248
00:09:57,880 --> 00:09:58,750
and so on.

249
00:09:58,750 --> 00:10:01,190
And I'll also encourage
you to become familiar

250
00:10:01,190 --> 00:10:01,899
with amino acids.

251
00:10:01,899 --> 00:10:04,314
It's very hard to understand
anything in protein structure

252
00:10:04,314 --> 00:10:06,790
without having some knowledge
of what the amino acids are.

253
00:10:06,790 --> 00:10:09,600
The textbook has
a nice figure that

254
00:10:09,600 --> 00:10:12,535
summarizes the many overlapping
ways to describe the features

255
00:10:12,535 --> 00:10:16,020
in amino acids, so please
familiarize yourself with that.

256
00:10:16,020 --> 00:10:18,550
So these are resources
that we posted online.

257
00:10:18,550 --> 00:10:21,130
Also, the Protein
Databank, the RCSB,

258
00:10:21,130 --> 00:10:23,540
has fantastic resources
online for beginning

259
00:10:23,540 --> 00:10:25,820
to understand protein
structure, so I

260
00:10:25,820 --> 00:10:27,689
encourage you to look
at their website.

261
00:10:27,689 --> 00:10:29,230
In particular, in
their website, they

262
00:10:29,230 --> 00:10:31,444
have tools that you can
download to visualize protein

263
00:10:31,444 --> 00:10:32,860
structures, and
that's going to be

264
00:10:32,860 --> 00:10:35,349
a critical component of
understanding these algorithms,

265
00:10:35,349 --> 00:10:37,640
to actually understand what
these structures look like.

266
00:10:37,640 --> 00:10:39,598
I've highlighted, too,
that I find particularly

267
00:10:39,598 --> 00:10:42,400
easy to use PyMOL
and Swiss PDB Viewer.

268
00:10:42,400 --> 00:10:43,954
You can not only
look at structures

269
00:10:43,954 --> 00:10:46,120
with these techniques, you
can actually modify them.

270
00:10:46,120 --> 00:10:49,560
You can do homology modeling.

271
00:10:49,560 --> 00:10:52,950
So before we get into algorithms
for understanding protein

272
00:10:52,950 --> 00:10:54,590
structure, we need
to understand how

273
00:10:54,590 --> 00:10:56,350
protein structures
are represented.

274
00:10:56,350 --> 00:10:59,652
I've already mentioned that
there are these repeating units

275
00:10:59,652 --> 00:11:01,860
that I'd like you already
know about-- alpha helices,

276
00:11:01,860 --> 00:11:02,420
beta sheets.

277
00:11:02,420 --> 00:11:04,550
We won't go into
those in any detail.

278
00:11:04,550 --> 00:11:06,461
But the two more
quantitative ways

279
00:11:06,461 --> 00:11:07,960
of describing protein
structure have

280
00:11:07,960 --> 00:11:09,751
to do with a
three-dimensional coordinates,

281
00:11:09,751 --> 00:11:11,290
the XYZ coordinates
of every atom,

282
00:11:11,290 --> 00:11:12,890
and internal
coordinates, and we'll

283
00:11:12,890 --> 00:11:15,820
go through those a
little bit of detail.

284
00:11:15,820 --> 00:11:18,150
So again, this PDB
website has a lot

285
00:11:18,150 --> 00:11:19,800
of great resources
for understanding

286
00:11:19,800 --> 00:11:22,760
what these
coordinates look like.

287
00:11:22,760 --> 00:11:26,260
They have a good description
of what's called a PDB file,

288
00:11:26,260 --> 00:11:29,000
and those PDB files look
like this at the outset.

289
00:11:29,000 --> 00:11:31,360
They have what is now called
metadata, but at the time

290
00:11:31,360 --> 00:11:33,320
was just information about
how the protein structure was

291
00:11:33,320 --> 00:11:34,080
solved.

292
00:11:34,080 --> 00:11:38,630
So it'll tell you what organism
the protein comes from,

293
00:11:38,630 --> 00:11:40,630
where it was actually
synthesized if it wasn't

294
00:11:40,630 --> 00:11:43,260
purified from that organism, but
if it was made recombinantly,

295
00:11:43,260 --> 00:11:46,250
details like that, details about
how the crystal structure was

296
00:11:46,250 --> 00:11:48,160
determined.

297
00:11:48,160 --> 00:11:50,810
The sequence-- most of
this won't concern us,

298
00:11:50,810 --> 00:11:53,880
but what will concern us is
this bottom section shown here

299
00:11:53,880 --> 00:11:55,380
in more detail.

300
00:11:55,380 --> 00:11:58,240
So let's just look at what
each of these lines represents.

301
00:11:58,240 --> 00:12:00,410
The lines that contain
information about the atomic

302
00:12:00,410 --> 00:12:02,850
coordinates all begin
with the word ATOM,

303
00:12:02,850 --> 00:12:04,950
and then there's a
index number that

304
00:12:04,950 --> 00:12:08,620
just is referenced for
each line of the file,

305
00:12:08,620 --> 00:12:12,050
tells you what kind of atom it
is, what chain in the protein

306
00:12:12,050 --> 00:12:13,660
it is, and the residue number.

307
00:12:13,660 --> 00:12:16,540
So here, it's starting
with residue 100.

308
00:12:16,540 --> 00:12:18,110
The sequence here
can be arbitrary

309
00:12:18,110 --> 00:12:20,210
and may not relate to the
sequence of the protein

310
00:12:20,210 --> 00:12:25,372
as it appears in
SWISS-PROT or Gen Bank.

311
00:12:25,372 --> 00:12:26,830
And then the next
three columns are

312
00:12:26,830 --> 00:12:28,455
the ones that are
most important to us,

313
00:12:28,455 --> 00:12:31,050
so these are the XYZ
coordinates of the atom.

314
00:12:31,050 --> 00:12:33,012
So to identify the
position of any molecule

315
00:12:33,012 --> 00:12:34,720
in three-dimensional
space, obviously you

316
00:12:34,720 --> 00:12:36,340
need three coordinates,
and so those

317
00:12:36,340 --> 00:12:39,140
are what those three
coordinates are.

318
00:12:39,140 --> 00:12:41,390
And they're followed by these
two other numbers, which

319
00:12:41,390 --> 00:12:44,080
actually are very interesting
numbers because they tell us

320
00:12:44,080 --> 00:12:47,260
something about how certain
we are that the molecule is

321
00:12:47,260 --> 00:12:49,890
really-- the atom is really at
that position in the crystal

322
00:12:49,890 --> 00:12:51,040
structure.

323
00:12:51,040 --> 00:12:53,710
So the first of these
is the occupancy.

324
00:12:53,710 --> 00:12:55,312
In a crystal structure,
we're actually

325
00:12:55,312 --> 00:12:57,520
getting the information
about thousands and thousands

326
00:12:57,520 --> 00:13:00,734
of molecules that are in the
repeating units of the crystal,

327
00:13:00,734 --> 00:13:02,150
and it's possible
that there could

328
00:13:02,150 --> 00:13:04,720
be some variation in the
structure between one

329
00:13:04,720 --> 00:13:06,620
unit of the crystal
and the next.

330
00:13:06,620 --> 00:13:09,030
So you could have a side
chain that, in one crystal,

331
00:13:09,030 --> 00:13:11,230
is over here and in
the next crystal--

332
00:13:11,230 --> 00:13:13,880
a repeating unit of the
crystals over there.

333
00:13:13,880 --> 00:13:16,110
If there are discrete
confirmations,

334
00:13:16,110 --> 00:13:18,830
then you imagine that the
signal will be reduced,

335
00:13:18,830 --> 00:13:20,580
and you'll actually
get some superposition

336
00:13:20,580 --> 00:13:22,920
of all the possible
confirmations.

337
00:13:22,920 --> 00:13:24,870
So number one here
means that there

338
00:13:24,870 --> 00:13:27,877
seems to be one
predominate confirmation.

339
00:13:27,877 --> 00:13:30,460
But if there is more than one,
and their discrete-- if they're

340
00:13:30,460 --> 00:13:32,084
continuous, it'll
just look like noise.

341
00:13:32,084 --> 00:13:34,080
It'll be hard to
determine the coordinates.

342
00:13:34,080 --> 00:13:36,580
But if they're
discrete positions,

343
00:13:36,580 --> 00:13:39,700
then you might find, for
example, an occupancy of 0.5

344
00:13:39,700 --> 00:13:42,050
and then another line
with the other position

345
00:13:42,050 --> 00:13:44,270
with an occupancy of 0.5.

346
00:13:44,270 --> 00:13:46,950
So that's when there's
discrete locations where

347
00:13:46,950 --> 00:13:49,200
these atoms are located.

348
00:13:49,200 --> 00:13:51,410
The B factor's called
the thermal factor,

349
00:13:51,410 --> 00:13:54,030
and it tells you how
much thermal motion there

350
00:13:54,030 --> 00:13:56,089
was in the crystal
at that position.

351
00:13:56,089 --> 00:13:57,130
Now, what does that mean?

352
00:13:57,130 --> 00:13:58,380
If we think about a
crystal structure,

353
00:13:58,380 --> 00:14:00,580
there'll be some parts of
it that are rock solid.

354
00:14:00,580 --> 00:14:02,660
In the center, it's
highly constrained.

355
00:14:02,660 --> 00:14:04,860
The dense core of the
protein, not too much

356
00:14:04,860 --> 00:14:06,110
is going to be changing.

357
00:14:06,110 --> 00:14:07,730
But on the surface
of the protein,

358
00:14:07,730 --> 00:14:10,770
there can be residues
that are highly flexible.

359
00:14:10,770 --> 00:14:14,070
And so as those are being
knocked around in the crystal,

360
00:14:14,070 --> 00:14:17,800
they are scattering the x-rays
in slightly different ways.

361
00:14:17,800 --> 00:14:19,880
But they're not in
discrete confirmations,

362
00:14:19,880 --> 00:14:23,180
so we're not going to see
multiple independent positions.

363
00:14:23,180 --> 00:14:25,227
We'll just see some
average positions.

364
00:14:25,227 --> 00:14:27,560
And that kind of noise can
be accounted for with these B

365
00:14:27,560 --> 00:14:31,510
factors, where high numbers
represent highly mobile parts

366
00:14:31,510 --> 00:14:33,120
of the structure,
and low numbers

367
00:14:33,120 --> 00:14:35,035
represent very stable ones.

368
00:14:35,035 --> 00:14:37,820
A very low number here
would be, say, a 20.

369
00:14:37,820 --> 00:14:39,890
These numbers of 80--
typically, things like that

370
00:14:39,890 --> 00:14:41,390
occur at the ends
of molecules where

371
00:14:41,390 --> 00:14:45,180
there is a lot of
structural flexibility.

372
00:14:45,180 --> 00:14:46,920
So we have this one
way of describing

373
00:14:46,920 --> 00:14:51,010
the structure of a protein where
we specify the XYZ coordinates

374
00:14:51,010 --> 00:14:54,620
of every one of these atoms,
and we'd have these other two

375
00:14:54,620 --> 00:14:58,200
parameters to represent thermal
motion and static disorder.

376
00:14:58,200 --> 00:15:00,380
Now, are those coordinates
uniquely defined?

377
00:15:00,380 --> 00:15:02,780
If I have this
structure, is there

378
00:15:02,780 --> 00:15:06,360
exactly one way to write
down the XYZ coordinates?

379
00:15:06,360 --> 00:15:07,000
Hands?

380
00:15:07,000 --> 00:15:08,900
How many people say yes?

381
00:15:08,900 --> 00:15:10,650
How many people say no?

382
00:15:10,650 --> 00:15:12,984
Why not?

383
00:15:12,984 --> 00:15:14,370
AUDIENCE: You can rotate it.

384
00:15:14,370 --> 00:15:15,578
PROFESSOR: You can rotate it.

385
00:15:15,578 --> 00:15:16,370
You set the origin.

386
00:15:16,370 --> 00:15:16,894
Right?

387
00:15:16,894 --> 00:15:18,560
So there's no unique
way of defining it,

388
00:15:18,560 --> 00:15:20,390
and that'll come up again later.

389
00:15:20,390 --> 00:15:22,372
OK, now, this is
a very precise way

390
00:15:22,372 --> 00:15:24,330
of describing the
three-dimensional coordinates

391
00:15:24,330 --> 00:15:28,120
in protein, but it's not a very
concise way of representing it.

392
00:15:28,120 --> 00:15:29,490
Now, why is that?

393
00:15:29,490 --> 00:15:31,476
Well, as the static
model represents,

394
00:15:31,476 --> 00:15:33,350
there are certain parts
of protein structures

395
00:15:33,350 --> 00:15:36,080
that are really not going
to change very much.

396
00:15:36,080 --> 00:15:37,940
The lengths of the
bonds change very little

397
00:15:37,940 --> 00:15:39,770
in protein structures.

398
00:15:39,770 --> 00:15:42,510
The angles, the tetrahedrally
coordinated carbon,

399
00:15:42,510 --> 00:15:45,520
doesn't suddenly
become flat, planar.

400
00:15:45,520 --> 00:15:48,550
These things happen very-- there
may be very small deformations.

401
00:15:48,550 --> 00:15:51,960
So if I had to specify the XYZ
coordinates of this carbon,

402
00:15:51,960 --> 00:15:53,460
I really don't have
too many degrees

403
00:15:53,460 --> 00:15:55,820
of freedom for where
the other carbon can be.

404
00:15:55,820 --> 00:15:58,380
It has to lie in a sphere
at a certain distance.

405
00:15:58,380 --> 00:16:00,860
So instead of representing
XYZ coordinates of every atom,

406
00:16:00,860 --> 00:16:04,020
I can use internal coordinates.

407
00:16:04,020 --> 00:16:08,590
So here in this slide,
we have amino acids--

408
00:16:08,590 --> 00:16:11,830
the amino nitrogen,
the carbonyl carbon.

409
00:16:11,830 --> 00:16:13,580
So this is a single amino acid.

410
00:16:13,580 --> 00:16:16,020
Here's the peptide bond
that goes to the next one.

411
00:16:16,020 --> 00:16:18,050
And as this diagram
indicates, the bond

412
00:16:18,050 --> 00:16:20,922
between the carbonyl
carbon of one amino acid

413
00:16:20,922 --> 00:16:22,505
and the amide nitrogen
of the next one

414
00:16:22,505 --> 00:16:25,597
is planar, so that angle
isn't even rotating.

415
00:16:25,597 --> 00:16:28,180
So that's one degree of freedom
that we've completely removed.

416
00:16:28,180 --> 00:16:31,986
The angles that rotate in the
backbone or called phi and psi;

417
00:16:31,986 --> 00:16:37,410
phi over here,
and psi over here.

418
00:16:37,410 --> 00:16:39,160
So those are two
degrees of freedom

419
00:16:39,160 --> 00:16:42,930
that determine how
this amino acid is--

420
00:16:42,930 --> 00:16:45,560
the confirmation
of this amino acid.

421
00:16:45,560 --> 00:16:47,640
So instead of specifying
all the coordinates,

422
00:16:47,640 --> 00:16:49,630
I can specify the
backbone simply

423
00:16:49,630 --> 00:16:52,280
by giving two numbers to every
amino acid, the phi and psi

424
00:16:52,280 --> 00:16:55,220
angles, with the
assumption that the omega

425
00:16:55,220 --> 00:16:58,560
angle, this peptide
backbone, remains constant.

426
00:16:58,560 --> 00:17:00,057
And similarly for
the side chains,

427
00:17:00,057 --> 00:17:01,890
and we'll go into this
in more detail later,

428
00:17:01,890 --> 00:17:04,859
we can then give the
coordinates, the rotation,

429
00:17:04,859 --> 00:17:06,930
of rotatable bonds
in the side chain

430
00:17:06,930 --> 00:17:09,155
and not specify every
atom as we go out.

431
00:17:09,155 --> 00:17:11,530
OK, so we've got these two
different ways of representing

432
00:17:11,530 --> 00:17:14,569
protein structure, and we'll
see that they're both used.

433
00:17:14,569 --> 00:17:17,829
Any questions on this?

434
00:17:17,829 --> 00:17:19,540
Great.

435
00:17:19,540 --> 00:17:21,774
OK, so if we're looking
at protein structures,

436
00:17:21,774 --> 00:17:23,440
one question we want
to ask is how do we

437
00:17:23,440 --> 00:17:27,339
compare two protein
structures to each other?

438
00:17:27,339 --> 00:17:29,070
So I already mentioned
that proteins

439
00:17:29,070 --> 00:17:31,410
can have similar
structure, whether or not

440
00:17:31,410 --> 00:17:32,950
they are highly
similar in sequence.

441
00:17:32,950 --> 00:17:35,380
So if I have two proteins that
are highly homologous, that

442
00:17:35,380 --> 00:17:38,350
do have a high level of sequence
similarity-- for example,

443
00:17:38,350 --> 00:17:40,050
these two orthologs,
this one from cow

444
00:17:40,050 --> 00:17:42,195
and this one from rat--
you can see, at a distance,

445
00:17:42,195 --> 00:17:43,820
they both have very
similar structures.

446
00:17:43,820 --> 00:17:46,480
They also have 74%
sequence similarity,

447
00:17:46,480 --> 00:17:47,970
so that's not surprising.

448
00:17:47,970 --> 00:17:50,400
But you can get proteins
that have very low sequence

449
00:17:50,400 --> 00:17:51,570
similarity.

450
00:17:51,570 --> 00:17:53,660
They're still
evolutionary related,

451
00:17:53,660 --> 00:17:56,330
like these orthologs, two
different species that

452
00:17:56,330 --> 00:17:59,630
have the same protein, or
paralogs, a single species that

453
00:17:59,630 --> 00:18:02,890
have two similar copies,
but non identical copies,

454
00:18:02,890 --> 00:18:05,410
in the same protein that
maintain the same structure

455
00:18:05,410 --> 00:18:09,410
when they only have about 20%
to 30% sequence similarity.

456
00:18:09,410 --> 00:18:13,440
And you can get even more
distant relationships.

457
00:18:13,440 --> 00:18:15,500
So here are two
proteins, both in

458
00:18:15,500 --> 00:18:20,580
human, evolutionarily related,
but only 4% sequence identity.

459
00:18:20,580 --> 00:18:24,077
And yet at a distance,
they look almost identical.

460
00:18:24,077 --> 00:18:25,910
And those are evolutionary
related proteins,

461
00:18:25,910 --> 00:18:27,880
but we can also have things
that are called analogs, which

462
00:18:27,880 --> 00:18:29,570
have no evolutionary
relationship,

463
00:18:29,570 --> 00:18:32,020
no obvious sequence
similarity, and yet adopt

464
00:18:32,020 --> 00:18:34,410
almost identical
protein structures.

465
00:18:34,410 --> 00:18:36,950
So this adds to the complexity
of the biological problems

466
00:18:36,950 --> 00:18:38,605
that we're going
to try to solve.

467
00:18:38,605 --> 00:18:40,320
All right, so how
do I quantitatively

468
00:18:40,320 --> 00:18:43,510
compare two protein structures?

469
00:18:43,510 --> 00:18:45,180
So the common
measurement is something

470
00:18:45,180 --> 00:18:47,760
called RMSD, Root
Mean Square Deviation,

471
00:18:47,760 --> 00:18:49,260
and here, I have a
set of structures

472
00:18:49,260 --> 00:18:50,542
that were solved by NMR.

473
00:18:50,542 --> 00:18:53,000
And you can see that there's
a core of the structure that's

474
00:18:53,000 --> 00:18:54,510
well determined
and then there are

475
00:18:54,510 --> 00:18:56,635
pieces of the structure
that are poorly determined.

476
00:18:56,635 --> 00:18:58,862
There weren't enough
constraints to define them.

477
00:18:58,862 --> 00:19:00,570
And these proteins
have all been aligned,

478
00:19:00,570 --> 00:19:04,510
so the XYZ coordinates have
been rotated and translated

479
00:19:04,510 --> 00:19:05,625
to give maximal agreement.

480
00:19:05,625 --> 00:19:07,000
And what's the
agreement measure?

481
00:19:07,000 --> 00:19:08,541
It's this Root Mean
Square Deviation.

482
00:19:08,541 --> 00:19:12,230
So I need to define pairs of
atoms in my two structures.

483
00:19:12,230 --> 00:19:14,970
If it's, in this case, the same
structure, that's really easy.

484
00:19:14,970 --> 00:19:18,780
Every atom has a match
in this structure that

485
00:19:18,780 --> 00:19:21,107
was solved with
the same molecule.

486
00:19:21,107 --> 00:19:23,190
But if we're dealing with
two homologous proteins,

487
00:19:23,190 --> 00:19:24,981
then that becomes a
little bit more tricky.

488
00:19:24,981 --> 00:19:27,480
We need to define which amino
acids are going to match up.

489
00:19:27,480 --> 00:19:30,447
We can also define whether we
care about changes in the side

490
00:19:30,447 --> 00:19:33,030
chains, or whether we only care
about changes in the backbone,

491
00:19:33,030 --> 00:19:34,030
whether we're going
to worry about

492
00:19:34,030 --> 00:19:36,064
whether the protons in
the right places or not.

493
00:19:36,064 --> 00:19:37,730
And you'll see that
these alignments can

494
00:19:37,730 --> 00:19:41,710
be done with either only
heavy chain, heavy atoms,

495
00:19:41,710 --> 00:19:44,940
meaning excluding the hydrogens,
or only main chain atoms,

496
00:19:44,940 --> 00:19:48,600
meaning excluding the
side chains completely.

497
00:19:48,600 --> 00:19:51,359
But once we've defined the
pairs of corresponding atoms,

498
00:19:51,359 --> 00:19:53,650
then we're going to take the
difference in the distance

499
00:19:53,650 --> 00:19:55,729
squared, sum of the
squares of the distances

500
00:19:55,729 --> 00:19:58,020
between the corresponding
atoms and their x-coordinate,

501
00:19:58,020 --> 00:19:59,680
their y-coordinate, and
they're z-coordinate.

502
00:19:59,680 --> 00:20:01,055
Take the square
root of that sum,

503
00:20:01,055 --> 00:20:03,917
and that's going to give us
the Root Mean Square Deviation.

504
00:20:03,917 --> 00:20:06,250
And of course, we have to
minimize that Root Mean Square

505
00:20:06,250 --> 00:20:08,200
Deviation with these
rigid body rotations

506
00:20:08,200 --> 00:20:10,310
to account for the fact
that I could have my PDB

507
00:20:10,310 --> 00:20:12,305
file with the
origin of this atom.

508
00:20:12,305 --> 00:20:14,680
Or I could have my PDB file
with the origin of that atom,

509
00:20:14,680 --> 00:20:15,940
and so on.

510
00:20:15,940 --> 00:20:18,150
OK.

511
00:20:18,150 --> 00:20:20,770
Any questions so far?

512
00:20:20,770 --> 00:20:21,450
Yes.

513
00:20:21,450 --> 00:20:25,079
AUDIENCE: Do we consider every
single atom in the molecule?

514
00:20:25,079 --> 00:20:26,370
PROFESSOR: So we have a choice.

515
00:20:26,370 --> 00:20:28,530
The question was do we
consider every single atom

516
00:20:28,530 --> 00:20:29,197
in the molecule?

517
00:20:29,197 --> 00:20:31,029
We don't have to do,
and it depends, really,

518
00:20:31,029 --> 00:20:32,810
on the problem that
we're trying to solve.

519
00:20:32,810 --> 00:20:35,760
So if we're looking for
whether two proteins have

520
00:20:35,760 --> 00:20:38,130
the same fold, we might not
care about the side chains.

521
00:20:38,130 --> 00:20:40,770
We might restrict ourselves
to main chain atoms.

522
00:20:40,770 --> 00:20:43,170
But if we're trying to
decide whether two crystal

523
00:20:43,170 --> 00:20:45,410
structures are in good
agreement with each other,

524
00:20:45,410 --> 00:20:47,554
or say, as we'll
see a few minutes,

525
00:20:47,554 --> 00:20:49,720
we're going to try to predict
the structure protein,

526
00:20:49,720 --> 00:20:51,870
and we have the experimentally
determined structure

527
00:20:51,870 --> 00:20:53,619
of the same protein,
and we want to decide

528
00:20:53,619 --> 00:20:55,203
whether those two
agree, in that case,

529
00:20:55,203 --> 00:20:56,660
we might actually
want to make sure

530
00:20:56,660 --> 00:20:58,760
that every single atom
is in the right position.

531
00:20:58,760 --> 00:21:00,600
So it'll depend on the question
that we're trying to answer.

532
00:21:00,600 --> 00:21:01,360
Good question.

533
00:21:01,360 --> 00:21:02,193
Any other questions?

534
00:21:06,100 --> 00:21:08,197
OK.

535
00:21:08,197 --> 00:21:09,780
All right, so so
far, I've shown a lot

536
00:21:09,780 --> 00:21:11,690
of static pictures of molecules.

537
00:21:11,690 --> 00:21:14,185
I do want to stress
that molecules actually

538
00:21:14,185 --> 00:21:16,560
move around a lot, so I'll
just show a little movie here.

539
00:21:19,548 --> 00:21:23,034
[VIDEO PLAYBACK]

540
00:23:17,574 --> 00:23:18,954
[END VIDEO PLAYBACK]

541
00:23:18,954 --> 00:23:20,870
PROFESSOR: OK, so that
was, in part, an excuse

542
00:23:20,870 --> 00:23:23,390
to play a little New
Age music in class,

543
00:23:23,390 --> 00:23:26,570
but more fundamentally,
it was to remind you

544
00:23:26,570 --> 00:23:29,050
that, despite the
fact that we're

545
00:23:29,050 --> 00:23:31,350
going to show you a lot of
static pictures of proteins,

546
00:23:31,350 --> 00:23:33,250
they're actually
extremely dynamic.

547
00:23:33,250 --> 00:23:36,170
And they have well
defined structures,

548
00:23:36,170 --> 00:23:38,590
but they may have more than
one well defined structure,

549
00:23:38,590 --> 00:23:40,849
especially those molecules
that are doing work.

550
00:23:40,849 --> 00:23:42,390
They're actually
moving things along.

551
00:23:42,390 --> 00:23:43,640
They have multiple structures.

552
00:23:43,640 --> 00:23:45,670
And so when we consider
the protein structure,

553
00:23:45,670 --> 00:23:47,730
it's an approximation,
and we're always

554
00:23:47,730 --> 00:23:52,150
going to mean the protein
structures, not singular one.

555
00:23:52,150 --> 00:23:55,010
OK, so what determines
the protein structure?

556
00:23:55,010 --> 00:23:56,640
Well, I've told
you it's physics.

557
00:23:56,640 --> 00:23:58,620
Fundamentally, it's
a physical problem,

558
00:23:58,620 --> 00:24:03,170
so the optimal protein structure
has to be an energetic minimum.

559
00:24:03,170 --> 00:24:05,450
There has to be no net
force acting on the protein.

560
00:24:05,450 --> 00:24:09,180
The force is negative derivative
of the potential energy,

561
00:24:09,180 --> 00:24:11,120
so that derivative has to be 0.

562
00:24:11,120 --> 00:24:13,307
So we have to have a minimum
of protein structure.

563
00:24:13,307 --> 00:24:15,640
Now, that doesn't mean that
there's exactly one minimum.

564
00:24:15,640 --> 00:24:19,140
Those proteins that had multiple
confirmations in that movie

565
00:24:19,140 --> 00:24:21,980
obviously had multiple minima
that they could adopt depending

566
00:24:21,980 --> 00:24:23,630
on other circumstances,
but there

567
00:24:23,630 --> 00:24:25,540
has to be at least
a local minimum.

568
00:24:25,540 --> 00:24:30,055
So if we knew this U, this
potential energy function,

569
00:24:30,055 --> 00:24:31,680
and we could take
the derivative of it,

570
00:24:31,680 --> 00:24:34,830
we could identify the protein
structure or the protein

571
00:24:34,830 --> 00:24:38,020
structures by simply
identifying the minima

572
00:24:38,020 --> 00:24:39,770
in that potential
energy function.

573
00:24:39,770 --> 00:24:42,280
Now, would that life
were so simple, right?

574
00:24:42,280 --> 00:24:45,385
But we will see that there are
ways of parameterizing the U

575
00:24:45,385 --> 00:24:48,340
and using it to optimize
the structure so it finds

576
00:24:48,340 --> 00:24:49,790
this, at least local, minimum.

577
00:24:52,345 --> 00:24:53,720
And we're going
to look primarily

578
00:24:53,720 --> 00:24:56,920
at two different ways of
describing the potential energy

579
00:24:56,920 --> 00:24:57,610
function.

580
00:24:57,610 --> 00:24:59,890
One of them, we're going to look
at the problem like a physicist

581
00:24:59,890 --> 00:25:01,348
one, and the other
way, we're going

582
00:25:01,348 --> 00:25:03,500
to look at it as a
statistician would.

583
00:25:03,500 --> 00:25:07,080
So the physicist wants to
describe, as you might imagine,

584
00:25:07,080 --> 00:25:10,200
the physical forces that
underlie the protein structure,

585
00:25:10,200 --> 00:25:11,580
and so as much as
possible, we're

586
00:25:11,580 --> 00:25:13,080
going to try to
write down equations

587
00:25:13,080 --> 00:25:14,665
that represent those forces.

588
00:25:14,665 --> 00:25:16,040
Now, we're not
always going to be

589
00:25:16,040 --> 00:25:18,380
able to do that because
a lot of forces involved

590
00:25:18,380 --> 00:25:19,790
are quantum mechanical.

591
00:25:19,790 --> 00:25:21,360
The mere fact the
two solid objects

592
00:25:21,360 --> 00:25:25,260
don't pass through each other is
because of exclusion principles

593
00:25:25,260 --> 00:25:26,644
that deal with
quantum mechanics.

594
00:25:26,644 --> 00:25:29,060
We're not going to write down
quantum mechanical equations

595
00:25:29,060 --> 00:25:30,726
for every atom in our
protein structure,

596
00:25:30,726 --> 00:25:33,134
but we will write down equations
that approximate those.

597
00:25:33,134 --> 00:25:34,550
And wherever
possible, we're going

598
00:25:34,550 --> 00:25:37,100
to try to tie the
terms in our equations

599
00:25:37,100 --> 00:25:39,320
into something
identifiable in physics,

600
00:25:39,320 --> 00:25:43,190
and a very good example of this
approach is the CHARMM program.

601
00:25:43,190 --> 00:25:45,350
And these approaches
actually were the ones

602
00:25:45,350 --> 00:25:48,450
that won the Nobel Prize in
chemistry this past year.

603
00:25:48,450 --> 00:25:51,540
At the other end of the spectrum
are the statistical approaches.

604
00:25:51,540 --> 00:25:53,831
Here, we don't really care
what the underlying physical

605
00:25:53,831 --> 00:25:54,470
properties are.

606
00:25:54,470 --> 00:25:57,600
We want equations that
capture what we see in nature.

607
00:25:57,600 --> 00:25:59,860
Now, often, these two
approaches will align very well.

608
00:25:59,860 --> 00:26:02,860
There'll be some approximations
that the physicist makes

609
00:26:02,860 --> 00:26:05,422
to capture a fundamental
physical force.

610
00:26:05,422 --> 00:26:07,880
That's simply the best way to
describe what you see nature,

611
00:26:07,880 --> 00:26:11,160
and so those two terms may look
indistinguishable in the CHARMM

612
00:26:11,160 --> 00:26:14,150
version or my favorite
statistical approach, which

613
00:26:14,150 --> 00:26:15,360
is Rosetta.

614
00:26:15,360 --> 00:26:17,330
So we'll see that some
terms in these functions

615
00:26:17,330 --> 00:26:18,749
agree between
CHARMM and Rosetta.

616
00:26:18,749 --> 00:26:20,790
Well, there'll be places
where they fundamentally

617
00:26:20,790 --> 00:26:24,040
disagree on how to describe
the molecular potential energy

618
00:26:24,040 --> 00:26:27,180
function because one is trying
to describe the physical forces

619
00:26:27,180 --> 00:26:30,230
and the other one is trying to
describe the statistical ones.

620
00:26:30,230 --> 00:26:32,750
Do we have any native speakers
of German in the audience?

621
00:26:32,750 --> 00:26:34,020
AUDIENCE: I'm a speaker.

622
00:26:34,020 --> 00:26:35,914
PROFESSOR: You want to
read the joke for us?

623
00:26:35,914 --> 00:26:36,538
AUDIENCE: Yeah.

624
00:26:36,538 --> 00:26:39,810
Institute for Quantum
Physics, and it says

625
00:26:39,810 --> 00:26:41,767
"You can find yourself
here or here."

626
00:26:41,767 --> 00:26:42,350
PROFESSOR: OK.

627
00:26:42,350 --> 00:26:43,610
AUDIENCE: [LAUGHTER]

628
00:26:43,610 --> 00:26:46,150
PROFESSOR: All right,
so for the video,

629
00:26:46,150 --> 00:26:48,710
it's the Institute
for Quantum Mechanics.

630
00:26:48,710 --> 00:26:51,315
And you go to a map at MIT,
and it'll say, you find,

631
00:26:51,315 --> 00:26:51,940
"You are here."

632
00:26:51,940 --> 00:26:52,410
Right?

633
00:26:52,410 --> 00:26:54,535
But in the Institute for
Quantum Mechanics, it says

634
00:26:54,535 --> 00:26:56,216
"You're either here or here."

635
00:26:56,216 --> 00:26:57,590
So that's the
physicist approach.

636
00:26:57,590 --> 00:27:00,080
We really do have to think
about those quantum mechanical

637
00:27:00,080 --> 00:27:02,470
features, whereas on
the right-hand side

638
00:27:02,470 --> 00:27:03,850
is the statisticians approach.

639
00:27:03,850 --> 00:27:06,130
It says "Data don't
make any sense.

640
00:27:06,130 --> 00:27:08,140
We'll have to resort
to statistics."

641
00:27:08,140 --> 00:27:08,950
OK?

642
00:27:08,950 --> 00:27:11,190
So the statistician
can get pretty far

643
00:27:11,190 --> 00:27:13,936
without understanding the
underlying physical forces.

644
00:27:13,936 --> 00:27:16,060
All right, so let's look
at this physicist approach

645
00:27:16,060 --> 00:27:18,393
first, so we're going to break
down the potential energy

646
00:27:18,393 --> 00:27:21,722
function into bonded terms
and non-bonded terms.

647
00:27:21,722 --> 00:27:23,180
So the bonded terms,
as they sound,

648
00:27:23,180 --> 00:27:24,990
are going to be atoms that
are close to each other

649
00:27:24,990 --> 00:27:27,370
in the bonded structures, so
certainly these two atoms,

650
00:27:27,370 --> 00:27:29,690
because they're connected
by a single bond,

651
00:27:29,690 --> 00:27:31,000
are going to be bonded terms.

652
00:27:31,000 --> 00:27:34,170
But we'll see groups of three
or four atoms near each other

653
00:27:34,170 --> 00:27:35,359
will also be bonded terms.

654
00:27:35,359 --> 00:27:36,900
And the non-bonded
terms will be when

655
00:27:36,900 --> 00:27:39,690
I have another molecule that
comes close, but isn't directly

656
00:27:39,690 --> 00:27:40,330
connected.

657
00:27:40,330 --> 00:27:44,160
What are the physical
forces between these two ?

658
00:27:44,160 --> 00:27:46,560
So these bonded terms
then first break down

659
00:27:46,560 --> 00:27:47,754
into a lot of sub terms.

660
00:27:47,754 --> 00:27:49,420
I'll show you the
functional forms here.

661
00:27:49,420 --> 00:27:51,070
We'll just look at a
few of them in detail

662
00:27:51,070 --> 00:27:53,278
and then give you a sense
of what the other ones are.

663
00:27:53,278 --> 00:27:55,100
So this first one
is the bonded term

664
00:27:55,100 --> 00:27:58,060
that describes, actually, the
distance between two bonded

665
00:27:58,060 --> 00:27:59,150
atoms.

666
00:27:59,150 --> 00:28:00,800
Now, again, this
is fundamentally

667
00:28:00,800 --> 00:28:03,020
quantum mechanical
property, but it

668
00:28:03,020 --> 00:28:04,646
would be too
computationally expensive

669
00:28:04,646 --> 00:28:06,020
to describe the
quantum mechanics

670
00:28:06,020 --> 00:28:09,630
and not really necessary because
you can do pretty well by just

671
00:28:09,630 --> 00:28:12,720
describing this
as a stiff spring.

672
00:28:12,720 --> 00:28:15,360
So that's what this quadratic
form of the equation

673
00:28:15,360 --> 00:28:16,380
represents.

674
00:28:16,380 --> 00:28:20,670
So we simply define
b naught here

675
00:28:20,670 --> 00:28:22,710
as the equilibrium
position between these two

676
00:28:22,710 --> 00:28:23,950
atoms, particular types.

677
00:28:23,950 --> 00:28:26,546
There would be two tetrahedral
coordinated carbons,

678
00:28:26,546 --> 00:28:28,170
and that would be
determined by looking

679
00:28:28,170 --> 00:28:29,753
at a lot of very,
very high resolution

680
00:28:29,753 --> 00:28:31,350
structures in small
molecule crystals

681
00:28:31,350 --> 00:28:34,760
so we know what the typical
distance for this bond is.

682
00:28:34,760 --> 00:28:36,302
We get that as a parameter.

683
00:28:36,302 --> 00:28:38,260
There would be a big file
in the CHARMM program

684
00:28:38,260 --> 00:28:40,801
that lists all those parameters
for every one of these bonded

685
00:28:40,801 --> 00:28:44,290
terms, and then if there's
a small deviation from that,

686
00:28:44,290 --> 00:28:47,090
because the molecules
stretched a bit

687
00:28:47,090 --> 00:28:48,540
in your refinement
process, there

688
00:28:48,540 --> 00:28:51,190
would be a penalty to
pull it back in just

689
00:28:51,190 --> 00:28:52,630
like a spring pulls it back in.

690
00:28:56,314 --> 00:28:58,230
Now, it turns out that
when you go this route,

691
00:28:58,230 --> 00:29:00,396
you have to actually come
up with a lot of equations

692
00:29:00,396 --> 00:29:03,000
to maintain the geometry
because, again, we're

693
00:29:03,000 --> 00:29:05,584
going to have to not only worry
about these distance bonds,

694
00:29:05,584 --> 00:29:07,000
but we need to
worry about angles.

695
00:29:07,000 --> 00:29:10,810
So we've got the angle between
this bond and this bond.

696
00:29:10,810 --> 00:29:12,070
What keeps that in place?

697
00:29:12,070 --> 00:29:14,610
So we need to add another
term that's a second term here

698
00:29:14,610 --> 00:29:17,024
to make the angle
between these fixed,

699
00:29:17,024 --> 00:29:18,440
and then we have
to deal with what

700
00:29:18,440 --> 00:29:21,650
are called dihedral angles
to make sure that these four

701
00:29:21,650 --> 00:29:26,015
atoms lie in the
allowed geometry.

702
00:29:26,015 --> 00:29:27,640
And so each one of
these terms accounts

703
00:29:27,640 --> 00:29:28,850
for something like that.

704
00:29:28,850 --> 00:29:30,770
This last term over
here makes sure

705
00:29:30,770 --> 00:29:33,090
that the phi and psi angles
are consistent with what

706
00:29:33,090 --> 00:29:36,725
we see in quantum mechanics as
corrected for any deviations

707
00:29:36,725 --> 00:29:39,040
that we see in these
small molecules

708
00:29:39,040 --> 00:29:42,100
so a lot of terms with a lot
of parameters they're trying

709
00:29:42,100 --> 00:29:45,492
to capture the best description
of what we observe in each one

710
00:29:45,492 --> 00:29:46,950
is motivated by
the fact that there

711
00:29:46,950 --> 00:29:49,650
is some quantum mechanical
principle underlying it.

712
00:29:49,650 --> 00:29:50,339
So-- yes?

713
00:29:50,339 --> 00:29:51,714
AUDIENCE: Why is
the [INAUDIBLE]?

714
00:29:54,696 --> 00:29:58,407
PROFESSOR: I actually don't
know the answer to that.

715
00:29:58,407 --> 00:29:59,990
But there's a reference
there that I'm

716
00:29:59,990 --> 00:30:01,239
sure will give you the answer.

717
00:30:03,970 --> 00:30:06,894
OK, now what about
these non-bonded terms?

718
00:30:06,894 --> 00:30:08,310
So non-bonded terms
of the set are

719
00:30:08,310 --> 00:30:11,760
molecules that are distant from
each other in the structure

720
00:30:11,760 --> 00:30:13,470
of the protein, but
close to each other

721
00:30:13,470 --> 00:30:14,720
in three-dimensional space.

722
00:30:14,720 --> 00:30:18,300
And there are two
fundamental forces here.

723
00:30:18,300 --> 00:30:20,890
The first one is called the
Leonard Jones potential,

724
00:30:20,890 --> 00:30:23,020
and the second one of
the electrostatic one.

725
00:30:23,020 --> 00:30:26,320
And the Leonard Jones potential
itself has these two terms.

726
00:30:26,320 --> 00:30:30,390
One is an R6 term, a negative
r to the 6th dependency.

727
00:30:30,390 --> 00:30:32,780
The other one is
positive nr to the 12th.

728
00:30:32,780 --> 00:30:35,260
The negative r to the 6th
is an attractive potential.

729
00:30:35,260 --> 00:30:36,840
That's why it's
negative, and it's

730
00:30:36,840 --> 00:30:39,400
because of small
induced dipoles that

731
00:30:39,400 --> 00:30:41,950
occur in the electron clouds
of each of these atoms that

732
00:30:41,950 --> 00:30:44,560
pull the molecules together.

733
00:30:44,560 --> 00:30:46,370
And the 1 over r to
the 6th dependency

734
00:30:46,370 --> 00:30:50,120
has to do with the physics
of two dipoles interacting.

735
00:30:50,120 --> 00:30:53,280
The r over 12 term
is an approximation

736
00:30:53,280 --> 00:30:54,540
to a quantum mechanical force.

737
00:30:54,540 --> 00:30:58,520
So the reason the two molecules
don't pass through each other,

738
00:30:58,520 --> 00:31:01,690
as we said already, is because
quantum mechanical forces.

739
00:31:01,690 --> 00:31:03,440
That would be very
expensive to compute,

740
00:31:03,440 --> 00:31:05,540
so we come up with a term
that's easy to compute.

741
00:31:05,540 --> 00:31:07,430
And of course, an
r 12 term is simply

742
00:31:07,430 --> 00:31:09,330
the square of an
r to the 6th term,

743
00:31:09,330 --> 00:31:11,957
so if you already computed 1
over r to the 6th between two

744
00:31:11,957 --> 00:31:14,540
atoms, you just square that, and
you get 1 over r to the 12th.

745
00:31:14,540 --> 00:31:16,165
So it's very
computationally efficient,

746
00:31:16,165 --> 00:31:19,920
and you adjust the
parameters, these r mins,

747
00:31:19,920 --> 00:31:23,150
so that it works out so that
these things agree reasonably

748
00:31:23,150 --> 00:31:24,750
well with the
crystal structures.

749
00:31:24,750 --> 00:31:26,875
And these are crystal
structures of small molecules

750
00:31:26,875 --> 00:31:28,502
that we know in great detail.

751
00:31:28,502 --> 00:31:29,960
And then the
electrostatics is what

752
00:31:29,960 --> 00:31:31,510
you might expect
for electrostatics.

753
00:31:31,510 --> 00:31:35,110
It's got a potential that
varies as 1 over the distance,

754
00:31:35,110 --> 00:31:36,930
and as the product
of those charges,

755
00:31:36,930 --> 00:31:39,980
these can be full charges or
they can be partial charges.

756
00:31:39,980 --> 00:31:42,892
And there's a term
here, this epsilon,

757
00:31:42,892 --> 00:31:45,100
which is the dielectric
constant, and that represents

758
00:31:45,100 --> 00:31:47,890
the fact that, in vacuum,
there'd be much greater

759
00:31:47,890 --> 00:31:50,440
force pulling two
oppositely charged molecules

760
00:31:50,440 --> 00:31:54,790
together than in water because
the water's going to shield.

761
00:31:54,790 --> 00:31:56,970
And so these
electrostatic terms,

762
00:31:56,970 --> 00:32:01,730
this dihedral dielectric
potential term,

763
00:32:01,730 --> 00:32:05,990
can vary from one, which is
vacuum, to, say, 80 for water.

764
00:32:05,990 --> 00:32:09,640
And setting that
is a bit of an art.

765
00:32:09,640 --> 00:32:11,810
OK, so what do these
potentials look like?

766
00:32:11,810 --> 00:32:12,800
Those are shown here.

767
00:32:12,800 --> 00:32:15,370
This is the, in dark lines,
the sum of the van der Waals

768
00:32:15,370 --> 00:32:16,580
potential.

769
00:32:16,580 --> 00:32:18,850
It consists of that
attractive term,

770
00:32:18,850 --> 00:32:20,910
which has the r
over 6 dependency,

771
00:32:20,910 --> 00:32:22,660
and the repulsive term
with the r over 12.

772
00:32:22,660 --> 00:32:25,964
And why does it go up so
high at short distances?

773
00:32:25,964 --> 00:32:27,170
AUDIENCE: [INAUDIBLE].

774
00:32:27,170 --> 00:32:28,628
PROFESSOR: Right,
because you can't

775
00:32:28,628 --> 00:32:30,574
have molecules that overlap.

776
00:32:30,574 --> 00:32:31,990
You'll see that
there's a minimum,

777
00:32:31,990 --> 00:32:33,930
so there's an optimal
distance barring

778
00:32:33,930 --> 00:32:35,890
any other forces
between two atoms.

779
00:32:35,890 --> 00:32:39,130
So that's roughly what
these hard sphere distances

780
00:32:39,130 --> 00:32:42,880
represent in the scale models.

781
00:32:42,880 --> 00:32:44,380
And then the
electrostatic potential

782
00:32:44,380 --> 00:32:48,700
also, obviously,
has attractive term,

783
00:32:48,700 --> 00:32:50,830
but it's going to
blow up as you get

784
00:32:50,830 --> 00:32:54,340
to small values,
increasingly favorable.

785
00:32:54,340 --> 00:32:57,910
And so the net sum of
those two is shown here,

786
00:32:57,910 --> 00:33:00,510
the combination of van der
Waals and electrostatics.

787
00:33:00,510 --> 00:33:03,354
It, again, has a strong
minimum but becomes

788
00:33:03,354 --> 00:33:05,270
highly positive as you
get to close distances.

789
00:33:09,340 --> 00:33:12,110
OK, any questions
on these forces?

790
00:33:12,110 --> 00:33:12,610
Yes?

791
00:33:12,610 --> 00:33:14,930
AUDIENCE: Do the van der
Waals equal the Leonard Jones

792
00:33:14,930 --> 00:33:15,430
potential?

793
00:33:15,430 --> 00:33:16,513
Or is that something else?

794
00:33:16,513 --> 00:33:18,874
PROFESSOR: Yeah,
typically, those two terms

795
00:33:18,874 --> 00:33:19,915
are used interchangeably.

796
00:33:19,915 --> 00:33:21,224
Yeah.

797
00:33:21,224 --> 00:33:21,890
Other questions?

798
00:33:25,150 --> 00:33:26,620
OK.

799
00:33:26,620 --> 00:33:28,450
All right, so that's
how the physicist

800
00:33:28,450 --> 00:33:31,300
would describe the
potential energy function.

801
00:33:31,300 --> 00:33:32,850
Rosetta, as I told
you, is an example

802
00:33:32,850 --> 00:33:34,250
of the statistical approach.

803
00:33:34,250 --> 00:33:36,870
It rejects all this
sharp definition

804
00:33:36,870 --> 00:33:39,870
of trying to compute exactly
the right distance between two

805
00:33:39,870 --> 00:33:42,510
atoms by having a stiff
spring between them

806
00:33:42,510 --> 00:33:44,727
and says let's just fix
a lot of these angles.

807
00:33:44,727 --> 00:33:46,935
So we're going to fix the
distance between two atoms.

808
00:33:46,935 --> 00:33:48,540
There's no point
in having them vary

809
00:33:48,540 --> 00:33:50,890
by tiny, tiny fractions
in the bond length.

810
00:33:50,890 --> 00:33:52,906
We're going to fix a
tetrahedral coordination

811
00:33:52,906 --> 00:33:54,030
of our tetrahedral carbons.

812
00:33:54,030 --> 00:33:55,780
We're not going to let them
deform because that never

813
00:33:55,780 --> 00:33:57,470
would happen in
reality, and so we're

814
00:33:57,470 --> 00:34:00,350
going to focus our
search over the space

815
00:34:00,350 --> 00:34:02,240
entirely over the
rotatable bonds.

816
00:34:02,240 --> 00:34:04,220
So remember, how
many rotatable bonds

817
00:34:04,220 --> 00:34:05,900
did we have in the backbone?

818
00:34:05,900 --> 00:34:06,650
We had two, right?

819
00:34:06,650 --> 00:34:08,080
We had the phi and
the psi angles,

820
00:34:08,080 --> 00:34:09,790
and then the side
chains then will

821
00:34:09,790 --> 00:34:12,659
have rotatable bonds
over the side chains.

822
00:34:12,659 --> 00:34:15,030
So in this example,
this is a cysteine.

823
00:34:15,030 --> 00:34:16,580
Here's the backbone.

824
00:34:16,580 --> 00:34:18,670
Here's the sulfur.

825
00:34:18,670 --> 00:34:20,991
And we have exactly one
rotatable bond of interest

826
00:34:20,991 --> 00:34:23,449
because we don't really care
where the hydrogen is located.

827
00:34:23,449 --> 00:34:24,854
So we've got this chi 1 angle.

828
00:34:24,854 --> 00:34:26,270
If there were more
atoms out here,

829
00:34:26,270 --> 00:34:29,040
this would be called
chi 2 and chi 3.

830
00:34:29,040 --> 00:34:33,719
And these can rotate, but
they don't rotate freely.

831
00:34:33,719 --> 00:34:35,440
We don't observe, in
crystal structures,

832
00:34:35,440 --> 00:34:37,790
every possible rotation
of these angles,

833
00:34:37,790 --> 00:34:40,770
and that's what this plot
on the left represents.

834
00:34:40,770 --> 00:34:44,929
For this side chain, there's
a chi 1, a chi 2, and a chi 3,

835
00:34:44,929 --> 00:34:47,792
and the dark regions represent
the observed confirmations

836
00:34:47,792 --> 00:34:49,250
over many, many
crystal structures.

837
00:34:49,250 --> 00:34:51,210
And you can see it's
highly non uniform.

838
00:34:51,210 --> 00:34:53,172
Now why is that?

839
00:34:53,172 --> 00:34:54,741
I see people with
their hands trying

840
00:34:54,741 --> 00:34:55,949
to figure it out in the back.

841
00:34:55,949 --> 00:34:57,910
So why is that?

842
00:34:57,910 --> 00:35:00,210
Figure that's what
you guys are doing.

843
00:35:00,210 --> 00:35:03,370
If not, it's very
interesting sign language.

844
00:35:03,370 --> 00:35:06,390
So if we look down one of
these tetrahedral carbon-carbon

845
00:35:06,390 --> 00:35:09,340
bonds, we have apparently
a free rotation.

846
00:35:09,340 --> 00:35:11,099
But in fact, some
these confirmations,

847
00:35:11,099 --> 00:35:12,890
we're going to have a
lot of steric clashes

848
00:35:12,890 --> 00:35:15,940
between the atoms on one carbon
and the atoms on the other,

849
00:35:15,940 --> 00:35:18,510
and so this is not a
favorable confirmation.

850
00:35:18,510 --> 00:35:20,629
The favorable
confirmation is offset,

851
00:35:20,629 --> 00:35:23,170
and that propagates throughout
all the chains in the protein.

852
00:35:23,170 --> 00:35:25,503
So there'll be certain angles
that are highly preferred,

853
00:35:25,503 --> 00:35:26,920
and other ones that are not.

854
00:35:26,920 --> 00:35:29,417
These highly preferred
angles are called rotamers,

855
00:35:29,417 --> 00:35:30,750
and so we'll use the term a lot.

856
00:35:30,750 --> 00:35:33,200
It stands for
rotational isomers.

857
00:35:33,200 --> 00:35:35,235
And so now, we've turned
our continuous problem

858
00:35:35,235 --> 00:35:38,070
of figuring out what the
optimal angle is for this chi 1

859
00:35:38,070 --> 00:35:40,370
rotation into a discrete
problem where maybe there

860
00:35:40,370 --> 00:35:45,000
are only two or three possible
options for that rotation.

861
00:35:45,000 --> 00:35:46,760
And so now, we
can decide is this

862
00:35:46,760 --> 00:35:49,550
better than this
one or this one?

863
00:35:49,550 --> 00:35:51,750
Questions on rotamers
or any of this?

864
00:35:54,870 --> 00:35:57,690
Excellent.

865
00:35:57,690 --> 00:36:00,092
OK, so how do we determine--
we've decided then

866
00:36:00,092 --> 00:36:01,550
we're going to
describe the protein

867
00:36:01,550 --> 00:36:03,770
entirely by these
internal coordinates--

868
00:36:03,770 --> 00:36:06,550
the phi, the psi, the backbone,
the chi angles of the side

869
00:36:06,550 --> 00:36:07,480
chain.

870
00:36:07,480 --> 00:36:10,170
We still need a potential
energy function, right?

871
00:36:10,170 --> 00:36:12,600
That hasn't told us how to
find the optimal settings,

872
00:36:12,600 --> 00:36:15,880
and we're going to try to
avoid the approach of CHARMM,

873
00:36:15,880 --> 00:36:17,740
where we actually look
at quantum mechanics

874
00:36:17,740 --> 00:36:20,387
to decide what
all the terms are.

875
00:36:20,387 --> 00:36:22,220
So how do they actually
go about doing this?

876
00:36:22,220 --> 00:36:26,150
Well, they take a number of high
resolution crystal structures,

877
00:36:26,150 --> 00:36:27,910
and they characterize
certain properties

878
00:36:27,910 --> 00:36:29,076
in those crystal structures.

879
00:36:29,076 --> 00:36:30,960
For example, they
might characterize

880
00:36:30,960 --> 00:36:33,960
how often a certain
aliphatic carbon-- how often

881
00:36:33,960 --> 00:36:36,569
aliphatic carbons are
near amide nitrogens,

882
00:36:36,569 --> 00:36:38,110
and they might
measure the distance--

883
00:36:38,110 --> 00:36:41,940
they do measure the distance
between these amide nitrogens

884
00:36:41,940 --> 00:36:44,790
and aliphatic carbons across
all the crystal structures

885
00:36:44,790 --> 00:36:47,290
and determine how often
those distances occur.

886
00:36:47,290 --> 00:36:49,700
And you can actually turn
those observations, then,

887
00:36:49,700 --> 00:36:51,890
into a potential energy
function by simply

888
00:36:51,890 --> 00:36:53,980
using Boltzmann's equation.

889
00:36:53,980 --> 00:36:56,380
So we can figure
out how frequently

890
00:36:56,380 --> 00:36:58,537
we get certain
distances on the x-axis

891
00:36:58,537 --> 00:37:01,860
is distance, on the y-axis is
frequency, number of entries

892
00:37:01,860 --> 00:37:07,770
in the crystal structure,
and then by Boltzmann's Law,

893
00:37:07,770 --> 00:37:11,152
we can compute the
density of states

894
00:37:11,152 --> 00:37:13,610
over some reference, which is
actually very hard to define.

895
00:37:13,610 --> 00:37:15,850
And you can look at some
of the references referred

896
00:37:15,850 --> 00:37:18,350
to in the slides to figure out
how currently that's defined,

897
00:37:18,350 --> 00:37:21,010
but we have to find some
arbitrary reference state

898
00:37:21,010 --> 00:37:24,270
to figure out the probability
of being any one of these states

899
00:37:24,270 --> 00:37:26,880
is going to be a function,
a logarithmic function,

900
00:37:26,880 --> 00:37:29,837
of the frequency
of those states.

901
00:37:29,837 --> 00:37:32,170
All right, so we've got an
energy term that's determined

902
00:37:32,170 --> 00:37:35,200
solely by the
observations of distances,

903
00:37:35,200 --> 00:37:37,670
that doesn't say I know that
this one's charge and this one

904
00:37:37,670 --> 00:37:38,169
isn't.

905
00:37:38,169 --> 00:37:40,730
It just says here's
an oxygen attached

906
00:37:40,730 --> 00:37:42,230
to a carbon with double bonds.

907
00:37:42,230 --> 00:37:44,484
Here's a carbon that's not.

908
00:37:44,484 --> 00:37:46,400
How often are they at
any particular distance?

909
00:37:46,400 --> 00:37:48,610
And we go through lots and
lots of other properties,

910
00:37:48,610 --> 00:37:51,630
and we'll go into detail now
to what those other terms are

911
00:37:51,630 --> 00:37:54,630
to look through high
resolution crystal structures,

912
00:37:54,630 --> 00:37:57,191
see what certain
properties are, turn those

913
00:37:57,191 --> 00:37:59,190
into potential energy
functions that we can then

914
00:37:59,190 --> 00:38:02,750
use to identify the optimum
rotations for the side

915
00:38:02,750 --> 00:38:05,558
chain and the backbone.

916
00:38:05,558 --> 00:38:08,100
Oh, and I should also point
out that when we do this,

917
00:38:08,100 --> 00:38:09,600
we'll have different terms
for different things.

918
00:38:09,600 --> 00:38:11,730
We'll have a term for distances
between different kinds

919
00:38:11,730 --> 00:38:12,290
of atoms.

920
00:38:12,290 --> 00:38:16,659
We'll have terms for some
of these other pieces

921
00:38:16,659 --> 00:38:19,200
of potential energy that we'll
describe in subsequent slides,

922
00:38:19,200 --> 00:38:20,575
and we're going
to need to decide

923
00:38:20,575 --> 00:38:23,350
how to weight all of those,
all those independent terms,

924
00:38:23,350 --> 00:38:25,126
to get them to give
us reasonable protein

925
00:38:25,126 --> 00:38:26,250
structures when we're done.

926
00:38:26,250 --> 00:38:28,830
And that, once again, is
a curve fitting exercise,

927
00:38:28,830 --> 00:38:31,400
finding the numbers
that best fit the data

928
00:38:31,400 --> 00:38:35,640
without any guiding physical
principle underneath it.

929
00:38:35,640 --> 00:38:37,060
So you'll be using PyRosetta.

930
00:38:37,060 --> 00:38:38,830
And in PyRosetta,
you'll see the terms

931
00:38:38,830 --> 00:38:40,430
on the board for
the potential energy

932
00:38:40,430 --> 00:38:42,920
functions, the
different features

933
00:38:42,920 --> 00:38:44,330
of the potential
energy function,

934
00:38:44,330 --> 00:38:45,996
and I'll step you
through a few of these

935
00:38:45,996 --> 00:38:48,850
just so you know
what you're using.

936
00:38:48,850 --> 00:38:51,670
There'll also be files
in PyRosetta installation

937
00:38:51,670 --> 00:38:53,830
that will give you the
relative weights for each

938
00:38:53,830 --> 00:38:55,310
of these terms.

939
00:38:55,310 --> 00:38:58,750
OK, so these first
are the van der Waals,

940
00:38:58,750 --> 00:39:02,460
and here, the shape of the curve
looks just like we saw before.

941
00:39:02,460 --> 00:39:04,560
It has to, in some
sense because they're

942
00:39:04,560 --> 00:39:06,490
trying to solve the
same physical problem,

943
00:39:06,490 --> 00:39:08,570
but the motivation
is very different.

944
00:39:08,570 --> 00:39:10,820
There's no attempt to
decide that it should be a 1

945
00:39:10,820 --> 00:39:13,230
over r to the 6th because of
dipole-dipole interactions.

946
00:39:13,230 --> 00:39:16,050
And simply, how do I find
the function that accurately

947
00:39:16,050 --> 00:39:19,440
represents what I
see in the database?

948
00:39:19,440 --> 00:39:22,510
So again, computed, this is
the fa attractive and the fa

949
00:39:22,510 --> 00:39:24,800
repulsive, and
those are determined

950
00:39:24,800 --> 00:39:26,300
based on the
statistics of what's

951
00:39:26,300 --> 00:39:29,930
observed in the
crystal structures.

952
00:39:29,930 --> 00:39:35,590
This one, the hbond, breaks down
into backbone and side chain,

953
00:39:35,590 --> 00:39:38,240
long range and short range.

954
00:39:38,240 --> 00:39:40,810
And the goal of the
hbonds-- so hydrogen bonds

955
00:39:40,810 --> 00:39:42,810
are one of the principal
determinants of protein

956
00:39:42,810 --> 00:39:45,430
structure, and you'll see
that in the reading materials

957
00:39:45,430 --> 00:39:46,590
that are posted online.

958
00:39:46,590 --> 00:39:48,756
And one of the critical
things about a hydrogen bond

959
00:39:48,756 --> 00:39:50,730
is that it needs to
be nearly planar.

960
00:39:50,730 --> 00:39:59,164
So the line between-- the
angle between this atom,

961
00:39:59,164 --> 00:40:01,330
which has the hydrogen
attached, and this one, which

962
00:40:01,330 --> 00:40:03,170
is the free electron
pair, has to be

963
00:40:03,170 --> 00:40:04,710
as close to linear as possible.

964
00:40:04,710 --> 00:40:06,300
And the more it
deviates from linear,

965
00:40:06,300 --> 00:40:09,250
the weaker the
hydrogen bond will be.

966
00:40:09,250 --> 00:40:11,970
And so this hydrogen
bonding potential

967
00:40:11,970 --> 00:40:15,040
has terms that describe the
distance between the atoms that

968
00:40:15,040 --> 00:40:17,400
are donating and
accepting the hydrogen

969
00:40:17,400 --> 00:40:19,420
as well as the
angle between them,

970
00:40:19,420 --> 00:40:21,870
and it's been parameterized
to represent, separately,

971
00:40:21,870 --> 00:40:24,540
things that are far from each
other, close to each other,

972
00:40:24,540 --> 00:40:26,470
things that are side
chain, or main chain.

973
00:40:26,470 --> 00:40:28,730
And here's where it's
really the statistician

974
00:40:28,730 --> 00:40:30,460
against the physicist.

975
00:40:30,460 --> 00:40:33,630
Why divide up side
chain and main chain?

976
00:40:33,630 --> 00:40:37,500
There's no physical principle
that drives you to do that.

977
00:40:37,500 --> 00:40:40,440
It's simply because that's what
gives the best fit to the data,

978
00:40:40,440 --> 00:40:43,540
so the statistician is not
afraid to add terms that

979
00:40:43,540 --> 00:40:46,010
make their models
better fit reality,

980
00:40:46,010 --> 00:40:50,800
even if they don't represent any
fundamental physical principle.

981
00:40:50,800 --> 00:40:53,255
And we'll see it gets
even more dramatic

982
00:40:53,255 --> 00:40:55,190
with some these other terms.

983
00:40:55,190 --> 00:40:57,667
So this is the
Ramachandran plot,

984
00:40:57,667 --> 00:40:59,250
which you'll also
see in your reading.

985
00:40:59,250 --> 00:41:01,230
It represents the
observed frequencies

986
00:41:01,230 --> 00:41:03,490
of phi and the psi angles.

987
00:41:03,490 --> 00:41:05,960
And as you know that there
are only a couple positions

988
00:41:05,960 --> 00:41:08,839
on this phi and psi plot
that are frequently observed,

989
00:41:08,839 --> 00:41:11,130
representing the different
regular secondary structures

990
00:41:11,130 --> 00:41:14,600
primarily, alpha helix and
beta sheet is indicated.

991
00:41:14,600 --> 00:41:17,400
And rather than trying to
capture the fact that protein

992
00:41:17,400 --> 00:41:18,950
should form alpha
helices by having

993
00:41:18,950 --> 00:41:21,890
really good forces all
around, they simply

994
00:41:21,890 --> 00:41:25,217
prefer angles that are observed
in the Ramachandran plot.

995
00:41:25,217 --> 00:41:27,300
So we're going to give a
potential energy function

996
00:41:27,300 --> 00:41:30,340
that's going to penalize you
if your phi and psi ends up

997
00:41:30,340 --> 00:41:33,220
over here, and reward you
if your phi and psi ends up

998
00:41:33,220 --> 00:41:34,820
in one of these positions.

999
00:41:34,820 --> 00:41:37,250
So from the physicist,
this is cheating,

1000
00:41:37,250 --> 00:41:41,250
and for the statistician,
it makes perfect sense.

1001
00:41:41,250 --> 00:41:43,906
Shouldn't laugh at that.

1002
00:41:43,906 --> 00:41:46,030
OK, and this same will be
true for the row numbers.

1003
00:41:46,030 --> 00:41:47,571
So we said that,
for the side chains,

1004
00:41:47,571 --> 00:41:49,920
there are certain angles
that we prefer over others

1005
00:41:49,920 --> 00:41:51,410
because that's what we
observe in the database.

1006
00:41:51,410 --> 00:41:53,118
Again, we're not going
to try to get them

1007
00:41:53,118 --> 00:41:55,500
by making sure that there's
repulsion between these two

1008
00:41:55,500 --> 00:41:56,910
atoms when they're eclipsed.

1009
00:41:56,910 --> 00:41:58,910
We're going to get
there simply by saying

1010
00:41:58,910 --> 00:42:01,050
the potential energy
is lower when you're

1011
00:42:01,050 --> 00:42:02,830
in one of these
staggered confirmations

1012
00:42:02,830 --> 00:42:04,705
than you're one of the
eclipse confirmations.

1013
00:42:07,770 --> 00:42:09,640
OK, now, the place
where the difference

1014
00:42:09,640 --> 00:42:12,670
between the statistician and
the physicist is most dramatic

1015
00:42:12,670 --> 00:42:16,690
comes when we look at
the salvation terms.

1016
00:42:16,690 --> 00:42:18,909
So a lot of what goes on
in protein structure--

1017
00:42:18,909 --> 00:42:20,700
determines protein
structure, I should say,

1018
00:42:20,700 --> 00:42:23,930
is the interaction of
the protein with water.

1019
00:42:23,930 --> 00:42:27,850
It's bathed in a bath of
55 molar water molecules,

1020
00:42:27,850 --> 00:42:28,516
highly polar.

1021
00:42:28,516 --> 00:42:30,640
They normally are hydrogen
bonding with each other.

1022
00:42:30,640 --> 00:42:32,600
When the protein sits
in there, the protein

1023
00:42:32,600 --> 00:42:34,620
has to start hydrogen
bonding with them.

1024
00:42:34,620 --> 00:42:36,680
And where do we find
hydrophobic residues

1025
00:42:36,680 --> 00:42:39,430
in a protein structure,
with your hands?

1026
00:42:39,430 --> 00:42:40,750
Outside or inside?

1027
00:42:40,750 --> 00:42:41,582
Inside, right?

1028
00:42:41,582 --> 00:42:44,040
So the hydrophobic residue's
all going to be buried inside.

1029
00:42:44,040 --> 00:42:45,662
Why is that?

1030
00:42:45,662 --> 00:42:47,330
Well, it's actually
really, really hard

1031
00:42:47,330 --> 00:42:49,163
to describe in terms
of fundamental physical

1032
00:42:49,163 --> 00:42:50,270
principles.

1033
00:42:50,270 --> 00:42:52,940
In fact, it's really hard to
describe the structure of water

1034
00:42:52,940 --> 00:42:54,540
by fundamental
physical principles.

1035
00:42:54,540 --> 00:42:56,340
Simulations that try
to get water to freeze

1036
00:42:56,340 --> 00:42:58,270
were only successful
a few years ago.

1037
00:42:58,270 --> 00:43:00,300
So we've tried to
simulate water using

1038
00:43:00,300 --> 00:43:01,390
basic physical principles.

1039
00:43:01,390 --> 00:43:03,566
It's very hard to
get it to form ice

1040
00:43:03,566 --> 00:43:05,190
when you lower the
temperature, so it's

1041
00:43:05,190 --> 00:43:06,981
going to be even harder,
then, to represent

1042
00:43:06,981 --> 00:43:10,370
how a complicated protein
structure immersed in the water

1043
00:43:10,370 --> 00:43:14,750
actually interacts with
those water molecules.

1044
00:43:14,750 --> 00:43:17,200
So you've got all these
water molecules interacting

1045
00:43:17,200 --> 00:43:19,300
with polar residues
or non-polar residues.

1046
00:43:19,300 --> 00:43:21,936
The physicist really
struggles to represent those.

1047
00:43:21,936 --> 00:43:23,310
And just to show
you why that is,

1048
00:43:23,310 --> 00:43:24,570
let me show you,
again, a little movie.

1049
00:43:24,570 --> 00:43:26,770
Unfortunately, no new
age music with this one.

1050
00:43:26,770 --> 00:43:27,310
I apologize.

1051
00:43:51,820 --> 00:43:54,940
So what's shown here
is a sphere immersed

1052
00:43:54,940 --> 00:43:56,310
in a bunch of water molecules.

1053
00:43:56,310 --> 00:43:58,310
The red is the oxygen.

1054
00:43:58,310 --> 00:44:01,415
The little white parts
are the hydrogens.

1055
00:44:01,415 --> 00:44:02,790
You can see them
wiggling around.

1056
00:44:02,790 --> 00:44:05,900
And what's the fundamental
feature that you observe?

1057
00:44:05,900 --> 00:44:07,800
All right, they're
forming almost a cage

1058
00:44:07,800 --> 00:44:09,420
around this
hydrophobic molecule.

1059
00:44:09,420 --> 00:44:10,170
Why is that?

1060
00:44:14,170 --> 00:44:14,670
Yeah?

1061
00:44:14,670 --> 00:44:16,336
AUDIENCE: It's hard
for them to interact

1062
00:44:16,336 --> 00:44:18,170
with a non-polar residue.

1063
00:44:18,170 --> 00:44:19,819
PROFESSOR: Right, so
it's hard for them

1064
00:44:19,819 --> 00:44:21,360
to interact with a
non-polar residue.

1065
00:44:21,360 --> 00:44:23,332
So the water molecules
want to minimize

1066
00:44:23,332 --> 00:44:24,290
their potential energy.

1067
00:44:24,290 --> 00:44:26,070
They're going to
do that by forming

1068
00:44:26,070 --> 00:44:28,070
hydrogen bonds with something.

1069
00:44:28,070 --> 00:44:31,310
In bulk solvent, they form it
with other water molecules.

1070
00:44:31,310 --> 00:44:35,250
Here, they can't form any
hydrogen bonds with a sphere,

1071
00:44:35,250 --> 00:44:39,440
so they have to dance to
this complicated dance

1072
00:44:39,440 --> 00:44:42,050
to try to form hydrogen
bonds with each other

1073
00:44:42,050 --> 00:44:44,680
with this thing stuck
in middle of them.

1074
00:44:44,680 --> 00:44:47,540
And this is, at its heart,
the fundamental driving force

1075
00:44:47,540 --> 00:44:49,150
between the hydrophobic
effect, that

1076
00:44:49,150 --> 00:44:51,524
which causes the hydrophobic
residues to be buried inside

1077
00:44:51,524 --> 00:44:52,870
of the protein.

1078
00:44:52,870 --> 00:44:55,720
Very, very hard, as
I said, to simulate

1079
00:44:55,720 --> 00:44:57,770
using fundamental
physical forces.

1080
00:44:57,770 --> 00:44:59,350
So what does the
statistician do?

1081
00:45:02,740 --> 00:45:06,730
The statistician has a mixture
of experimental observation

1082
00:45:06,730 --> 00:45:10,210
and statistics at
their benefit, so we

1083
00:45:10,210 --> 00:45:13,360
can measure how hydrophobic
any molecule is.

1084
00:45:13,360 --> 00:45:17,190
We can take carbons and drop
them to non-polar solvents,

1085
00:45:17,190 --> 00:45:20,900
into polar solvents, and
determine what fraction of time

1086
00:45:20,900 --> 00:45:23,134
a molecule will spend
in a polar environment

1087
00:45:23,134 --> 00:45:25,050
versus a non-polar
environment, and from that,

1088
00:45:25,050 --> 00:45:28,350
get a free energy for
the transfer of any atom

1089
00:45:28,350 --> 00:45:32,750
from a hydrophobic environment
to a hydrophilic environment.

1090
00:45:32,750 --> 00:45:38,280
That can give us is delta
G Ref, shown over here.

1091
00:45:38,280 --> 00:45:42,460
OK, now, in a
protein, that molecule

1092
00:45:42,460 --> 00:45:46,087
is not fully solvent exposed
even when it's on the surface,

1093
00:45:46,087 --> 00:45:47,670
because water molecules
trying to come

1094
00:45:47,670 --> 00:45:49,420
at it from this direction
can't get to it,

1095
00:45:49,420 --> 00:45:51,240
from this direction
can't get to it.

1096
00:45:51,240 --> 00:45:54,230
So the transfer
energy for this carbon

1097
00:45:54,230 --> 00:45:57,130
to go from fully solvent
exposed to buried

1098
00:45:57,130 --> 00:46:01,140
is different from
the isolated carbon.

1099
00:46:01,140 --> 00:46:02,730
And so the statistician
says, OK, I'll

1100
00:46:02,730 --> 00:46:04,438
come up with a function
to describe that.

1101
00:46:04,438 --> 00:46:07,910
I will describe what
else is near this atom

1102
00:46:07,910 --> 00:46:09,486
in the rest of the
protein structure.

1103
00:46:09,486 --> 00:46:11,110
That's what the term
on the right does.

1104
00:46:11,110 --> 00:46:14,507
It's a sum over all
other neighboring atoms

1105
00:46:14,507 --> 00:46:16,590
and describes the volume
of the neighboring group.

1106
00:46:16,590 --> 00:46:18,830
Is the thing next to it
really big or really small?

1107
00:46:18,830 --> 00:46:20,030
Usually not described,
necessarily,

1108
00:46:20,030 --> 00:46:20,946
at the level of atoms.

1109
00:46:20,946 --> 00:46:22,510
It might be side
chains depending

1110
00:46:22,510 --> 00:46:24,180
on which program is doing it.

1111
00:46:24,180 --> 00:46:26,530
But I have some measure of
the volume of the neighbors.

1112
00:46:26,530 --> 00:46:28,488
If that volume is really
large, then this thing

1113
00:46:28,488 --> 00:46:30,420
is already in a
hydrophobic environment

1114
00:46:30,420 --> 00:46:32,150
even when it's taking
water because it's

1115
00:46:32,150 --> 00:46:34,130
surrounded by bulky things.

1116
00:46:34,130 --> 00:46:35,700
If the neighbors
are small, then it's

1117
00:46:35,700 --> 00:46:38,620
a more hydrophilic environment
when it's taking in water,

1118
00:46:38,620 --> 00:46:40,870
and that's going to
modulate this free energy.

1119
00:46:43,600 --> 00:46:46,810
Is this function clear?

1120
00:46:46,810 --> 00:46:51,440
OK, so by combining this
observation from small molecule

1121
00:46:51,440 --> 00:46:54,257
transfer experiments
and these observations

1122
00:46:54,257 --> 00:46:55,840
based on the structure
of the protein,

1123
00:46:55,840 --> 00:46:58,600
we can get an approximation
for the hydrophobic effect.

1124
00:46:58,600 --> 00:47:00,310
How expensive is it
to have this piece

1125
00:47:00,310 --> 00:47:03,644
of the protein in solvent
versus in the hydrophobic core?

1126
00:47:03,644 --> 00:47:05,810
And again, we never had to
do any quantum mechanical

1127
00:47:05,810 --> 00:47:06,510
calculations.

1128
00:47:06,510 --> 00:47:08,354
We never had to actually
explicitly compute

1129
00:47:08,354 --> 00:47:10,270
the interaction of this
molecule with solvent.

1130
00:47:10,270 --> 00:47:13,164
We don't need any
water in the structure.

1131
00:47:13,164 --> 00:47:15,080
It's simply the geometry
of the protein that's

1132
00:47:15,080 --> 00:47:17,340
going to give us a good
approximation to the energy

1133
00:47:17,340 --> 00:47:18,820
function.

1134
00:47:18,820 --> 00:47:22,950
All right, so you can look
through all the details

1135
00:47:22,950 --> 00:47:26,200
of these online in the
Rosetta documentation

1136
00:47:26,200 --> 00:47:28,430
that we provided to get
a better sense of what

1137
00:47:28,430 --> 00:47:30,300
all these functions
are, but you can

1138
00:47:30,300 --> 00:47:32,480
see there are a lot of terms.

1139
00:47:32,480 --> 00:47:34,011
It's increasingly incremental.

1140
00:47:34,011 --> 00:47:35,760
You find something
wrong with your models.

1141
00:47:35,760 --> 00:47:37,650
You add a term to try
to account for that.

1142
00:47:37,650 --> 00:47:40,690
Again, not driven necessarily
by the physical forces.

1143
00:47:40,690 --> 00:47:43,470
OK, so what have we seen so far?

1144
00:47:43,470 --> 00:47:45,180
We've seen the
motivation for this unit,

1145
00:47:45,180 --> 00:47:46,710
to begin with
protein structures,

1146
00:47:46,710 --> 00:47:48,085
that the protein
structure really

1147
00:47:48,085 --> 00:47:50,700
helps us understand the
biological molecules that we're

1148
00:47:50,700 --> 00:47:51,755
looking at.

1149
00:47:51,755 --> 00:47:54,130
These structures are going to
influence our understanding

1150
00:47:54,130 --> 00:47:56,760
of all biology, so we need
to be good at predicting

1151
00:47:56,760 --> 00:47:58,830
these protein structures
or solving them

1152
00:47:58,830 --> 00:48:02,160
when we have experimental data.

1153
00:48:02,160 --> 00:48:03,862
The computational
methods that we're

1154
00:48:03,862 --> 00:48:05,320
going to use--
we're going to focus

1155
00:48:05,320 --> 00:48:07,652
on solving protein structures
de novo, predicting them,

1156
00:48:07,652 --> 00:48:09,110
but those same
techniques are going

1157
00:48:09,110 --> 00:48:10,740
to underlie the
methods that are used

1158
00:48:10,740 --> 00:48:13,264
to solve x-ray
crystallography in an MR.

1159
00:48:13,264 --> 00:48:15,430
And fundamentally then, we
have these two approaches

1160
00:48:15,430 --> 00:48:17,630
to describing the
potential energy.

1161
00:48:17,630 --> 00:48:21,310
That's the statistician and
the physicist's approach.

1162
00:48:21,310 --> 00:48:23,400
And remember, the
key simplifications

1163
00:48:23,400 --> 00:48:27,330
of the statistician are that
we used a fixed geometry.

1164
00:48:27,330 --> 00:48:28,960
We're not trying to
figure out the XYZ

1165
00:48:28,960 --> 00:48:30,230
coordinates of every atom.

1166
00:48:30,230 --> 00:48:33,800
We're simply trying to
figure out the bond angles.

1167
00:48:33,800 --> 00:48:35,440
We're going to use
rotamers, so we're

1168
00:48:35,440 --> 00:48:37,320
going to turn our
continuous choices often

1169
00:48:37,320 --> 00:48:38,190
into discrete ones.

1170
00:48:38,190 --> 00:48:40,190
And we're going to derive
statistical potentials

1171
00:48:40,190 --> 00:48:42,950
to present the potential
energy, which may or may not

1172
00:48:42,950 --> 00:48:44,994
have a clear physical basis.

1173
00:48:44,994 --> 00:48:47,410
All right, so let's start with
a little thought experiment

1174
00:48:47,410 --> 00:48:49,200
as we try to get into some of
these prediction algorithms.

1175
00:48:49,200 --> 00:48:50,620
So I have a sequence.

1176
00:48:50,620 --> 00:48:53,700
It's about, I don't know,
100 amino acids long,

1177
00:48:53,700 --> 00:48:55,270
and here are two
protein structures.

1178
00:48:55,270 --> 00:48:56,980
One is predominantly
alpha helical.

1179
00:48:56,980 --> 00:48:58,620
One is predominantly beta sheet.

1180
00:48:58,620 --> 00:49:00,859
How could I tell-- this is
not a rhetorical question.

1181
00:49:00,859 --> 00:49:02,150
I want you to think for second.

1182
00:49:02,150 --> 00:49:05,117
How could I tell whether
the sequence prefers

1183
00:49:05,117 --> 00:49:07,450
the structure on the top or
the structure on the bottom?

1184
00:49:11,550 --> 00:49:13,650
So we have, actually, a
lot of the tools in place.

1185
00:49:13,650 --> 00:49:14,530
Yes, in the back.

1186
00:49:14,530 --> 00:49:19,480
AUDIENCE: Can you, based on
previously known sequences,

1187
00:49:19,480 --> 00:49:22,945
know which sequence is
predominant in which

1188
00:49:22,945 --> 00:49:23,950
[INAUDIBLE]?

1189
00:49:23,950 --> 00:49:26,100
PROFESSOR: OK, so
the answer was we

1190
00:49:26,100 --> 00:49:28,395
could look at previously
known sequences.

1191
00:49:28,395 --> 00:49:30,270
We can look for homology,
and that's actually

1192
00:49:30,270 --> 00:49:31,770
going to be a very
powerful tool.

1193
00:49:31,770 --> 00:49:35,570
So if there is a homologue in
the database that is closely

1194
00:49:35,570 --> 00:49:39,390
related to this protein, and
it has a known structure,

1195
00:49:39,390 --> 00:49:41,810
then problem solved.

1196
00:49:41,810 --> 00:49:43,340
What if there isn't?

1197
00:49:43,340 --> 00:49:45,400
What's my next step?

1198
00:49:45,400 --> 00:49:46,315
Yes?

1199
00:49:46,315 --> 00:49:49,780
AUDIENCE: What if you
start with a description

1200
00:49:49,780 --> 00:49:55,720
of the secondary structure,
say the helices and the sheet,

1201
00:49:55,720 --> 00:50:00,175
and you counted how often a
particular amino acid showed up

1202
00:50:00,175 --> 00:50:02,650
in each of those structures?

1203
00:50:02,650 --> 00:50:05,850
Could you then compute
maybe a likelihood

1204
00:50:05,850 --> 00:50:07,702
across a stretch of amino acids?

1205
00:50:07,702 --> 00:50:08,410
PROFESSOR: Great.

1206
00:50:08,410 --> 00:50:10,430
So that answer was
what if I looked

1207
00:50:10,430 --> 00:50:12,880
at these alpha helices
and beta sheets

1208
00:50:12,880 --> 00:50:15,850
and computed how often
certain amino acids occur

1209
00:50:15,850 --> 00:50:18,050
in alpha helices
versus beta sheets,

1210
00:50:18,050 --> 00:50:20,932
and then I looked in
my protein structure

1211
00:50:20,932 --> 00:50:23,140
and checked whether I have
the right amino acids that

1212
00:50:23,140 --> 00:50:25,348
are more favorable than
alpha helices or beta sheets.

1213
00:50:25,348 --> 00:50:28,220
And we'll see that's an approach
that's been used successfully.

1214
00:50:28,220 --> 00:50:30,420
That's secondary
structure prediction.

1215
00:50:30,420 --> 00:50:31,910
OK, other ideas.

1216
00:50:31,910 --> 00:50:33,300
Yep?

1217
00:50:33,300 --> 00:50:37,670
AUDIENCE: So if you have the
position of the 3D structure,

1218
00:50:37,670 --> 00:50:39,670
you can feed your sequence
through the structure

1219
00:50:39,670 --> 00:50:43,100
and then put it through
your energy function,

1220
00:50:43,100 --> 00:50:46,285
see which one is the
lower [INAUDIBLE].

1221
00:50:46,285 --> 00:50:47,160
PROFESSOR: Excellent.

1222
00:50:47,160 --> 00:50:51,280
So another thing I can do is, if
I have these two structures, I

1223
00:50:51,280 --> 00:50:53,450
have their precise
three-dimensional structures,

1224
00:50:53,450 --> 00:50:57,010
I could try to put my
sequence onto that structure,

1225
00:50:57,010 --> 00:50:59,590
actually put the right
side chains for my sequence

1226
00:50:59,590 --> 00:51:02,189
into that backbone confirmation.

1227
00:51:02,189 --> 00:51:03,230
And then what would I do?

1228
00:51:03,230 --> 00:51:05,710
I would actually measure
the potential energy

1229
00:51:05,710 --> 00:51:09,170
of the protein in top structure
and the potential energy

1230
00:51:09,170 --> 00:51:11,110
of the protein in
the bottom structure.

1231
00:51:11,110 --> 00:51:12,630
If the potential
energy is higher,

1232
00:51:12,630 --> 00:51:15,360
is that the favorable structure
or the unfavorable structure?

1233
00:51:15,360 --> 00:51:16,990
Favorable?

1234
00:51:16,990 --> 00:51:18,074
Unfavorable?

1235
00:51:18,074 --> 00:51:19,240
Right, it's the unfavorable.

1236
00:51:19,240 --> 00:51:21,460
So I want the lower
free energy structure.

1237
00:51:21,460 --> 00:51:25,480
OK, so let's think
about-- that's correct,

1238
00:51:25,480 --> 00:51:26,759
and that's where we're headed.

1239
00:51:26,759 --> 00:51:28,800
But what are going to be
some of the complexities

1240
00:51:28,800 --> 00:51:31,020
of that approach?

1241
00:51:31,020 --> 00:51:33,980
So first of all, what
about these side chains?

1242
00:51:33,980 --> 00:51:36,330
I have to now take a
backbone structure that

1243
00:51:36,330 --> 00:51:38,399
had some other amino
acid sequence on it,

1244
00:51:38,399 --> 00:51:40,190
and I have to put these
new side chains on.

1245
00:51:40,190 --> 00:51:41,140
Right?

1246
00:51:41,140 --> 00:51:45,030
If I put those on in the
wrong way-- let's say,

1247
00:51:45,030 --> 00:51:47,224
this is the true one--
let's say one of these

1248
00:51:47,224 --> 00:51:48,140
is the true structure.

1249
00:51:48,140 --> 00:51:49,730
Let's begin with
a simplification.

1250
00:51:49,730 --> 00:51:52,809
All right, so let's say your
fiendish labmate has actually

1251
00:51:52,809 --> 00:51:54,350
solved the structure
of your protein,

1252
00:51:54,350 --> 00:51:56,090
but refuses tell you
what the answer is.

1253
00:51:56,090 --> 00:51:59,350
AUDIENCE: [LAUGHTER]

1254
00:51:59,350 --> 00:52:01,810
PROFESSOR: And she actually
has solved two structures,

1255
00:52:01,810 --> 00:52:03,500
neither one of which she's going
to give you the sequence to.

1256
00:52:03,500 --> 00:52:05,020
But she's giving you the
coordinates for both of them.

1257
00:52:05,020 --> 00:52:06,330
They're the same length.

1258
00:52:06,330 --> 00:52:09,090
And so she asks you,
ha, you took 791.

1259
00:52:09,090 --> 00:52:10,350
You can figure this out.

1260
00:52:10,350 --> 00:52:13,581
Tell me whether that
your sequence is actually

1261
00:52:13,581 --> 00:52:15,080
in this structure
or that structure.

1262
00:52:15,080 --> 00:52:17,010
She says one of them
is exactly right.

1263
00:52:17,010 --> 00:52:18,510
You just don't know which one.

1264
00:52:18,510 --> 00:52:20,510
OK, so she gives you the
backbone coordinates,

1265
00:52:20,510 --> 00:52:21,080
so you go.

1266
00:52:21,080 --> 00:52:23,320
You put your amino
acid sequence, say,

1267
00:52:23,320 --> 00:52:28,070
with Swiss [? PDB. ?] You add to
the backbone all the right side

1268
00:52:28,070 --> 00:52:28,570
chains.

1269
00:52:28,570 --> 00:52:30,486
But now, you have to
make a bunch of decisions

1270
00:52:30,486 --> 00:52:32,990
for these side
chain confirmations.

1271
00:52:32,990 --> 00:52:35,710
If you make the wrong
decision, what happens?

1272
00:52:35,710 --> 00:52:39,350
Well, you stick this atom close
to where some other atom is.

1273
00:52:39,350 --> 00:52:41,820
Now, you've got an
optimization problem, right?

1274
00:52:41,820 --> 00:52:43,410
You believe that one
of these backbone

1275
00:52:43,410 --> 00:52:44,826
coordinates is
correct, but you've

1276
00:52:44,826 --> 00:52:47,970
got a very highly coupled
optimization problem.

1277
00:52:47,970 --> 00:52:51,560
You need to figure out the right
rotations for every single side

1278
00:52:51,560 --> 00:52:54,040
chain on this protein, and
you can't do it one by one.

1279
00:52:54,040 --> 00:52:57,026
You can't take a greedy approach
because if I put this side

1280
00:52:57,026 --> 00:52:59,400
chain here, and I put this
side chain here, they collide,

1281
00:52:59,400 --> 00:53:01,230
but if this was wrong and
supposed to be over there,

1282
00:53:01,230 --> 00:53:02,740
then maybe this is the
right conformation.

1283
00:53:02,740 --> 00:53:04,640
So I have a coupled
problem, so it turns out

1284
00:53:04,640 --> 00:53:08,550
to be computationally
expensive thing to compute.

1285
00:53:08,550 --> 00:53:11,380
So we're going to look at
what to do if we know backbone

1286
00:53:11,380 --> 00:53:14,080
confirmation, but we don't know
the side chain confirmation.

1287
00:53:14,080 --> 00:53:15,996
We can try to solve that
optimization problem,

1288
00:53:15,996 --> 00:53:18,190
and you'll actually do
that in your problem set.

1289
00:53:18,190 --> 00:53:20,000
Now, what if the
backbone confirmation

1290
00:53:20,000 --> 00:53:21,570
isn't exactly correct?

1291
00:53:21,570 --> 00:53:23,470
So let's say you do what
was first suggested,

1292
00:53:23,470 --> 00:53:25,630
and you search the
sequence database.

1293
00:53:25,630 --> 00:53:28,660
You take this sequence, and
you find that it actually

1294
00:53:28,660 --> 00:53:31,360
has two homologs, two
things with similar sequence

1295
00:53:31,360 --> 00:53:32,210
similarity.

1296
00:53:32,210 --> 00:53:34,550
There are two proteins
with 20% sequence identity

1297
00:53:34,550 --> 00:53:37,210
that have completely
different structures.

1298
00:53:37,210 --> 00:53:38,770
This one has 20%
sequence identity,

1299
00:53:38,770 --> 00:53:40,960
and this one has 20%
sequence identity.

1300
00:53:40,960 --> 00:53:43,970
So you have no way of deciding
which one's which, right?

1301
00:53:43,970 --> 00:53:47,190
And neither one is going to be
the right protein structure.

1302
00:53:47,190 --> 00:53:49,786
So you know that by putting the
side chains onto these protein

1303
00:53:49,786 --> 00:53:52,410
structures, you do have to solve
those problems with side chain

1304
00:53:52,410 --> 00:53:54,430
optimization, but what,
obviously, is the other thing

1305
00:53:54,430 --> 00:53:56,221
that you're going to
need to have to solve?

1306
00:53:57,112 --> 00:53:59,320
All right, you're going to
need to solve the backbone

1307
00:53:59,320 --> 00:54:01,710
optimization problem, and
this becomes even more

1308
00:54:01,710 --> 00:54:05,040
coupled because when
I move this backbone,

1309
00:54:05,040 --> 00:54:06,860
then the side
chains move with it.

1310
00:54:06,860 --> 00:54:09,110
So now, I've got a very,
very complicated optimization

1311
00:54:09,110 --> 00:54:09,985
problem to deal with.

1312
00:54:09,985 --> 00:54:13,790
The search space is enormous,
and even if I discretize it,

1313
00:54:13,790 --> 00:54:15,560
it's still very, very large.

1314
00:54:15,560 --> 00:54:17,160
In fact, there's
something famous

1315
00:54:17,160 --> 00:54:18,530
called the Levinthal Paradox.

1316
00:54:18,530 --> 00:54:20,540
Of course, Cy
Levinthal, who was once

1317
00:54:20,540 --> 00:54:23,180
upon a time a professor here
and then moved to Columbia--

1318
00:54:23,180 --> 00:54:26,380
he did a back of the
envelope calculation

1319
00:54:26,380 --> 00:54:29,110
for extremely simple models
of protein structure.

1320
00:54:29,110 --> 00:54:31,560
If you imagine the proteins
were to randomly search over

1321
00:54:31,560 --> 00:54:34,710
all possible confirmations
with very rapid switching

1322
00:54:34,710 --> 00:54:36,180
between possible
confirmations, it

1323
00:54:36,180 --> 00:54:38,060
would take basically
the lifetime

1324
00:54:38,060 --> 00:54:41,240
of the universe for a
protein to ever fold.

1325
00:54:41,240 --> 00:54:43,030
So proteins don't
do random searches

1326
00:54:43,030 --> 00:54:45,590
over all possible confirmations,
and they can check out

1327
00:54:45,590 --> 00:54:47,426
confirmations
incredibly rapidly.

1328
00:54:47,426 --> 00:54:49,050
So we certainly can't
do that, so we'll

1329
00:54:49,050 --> 00:54:52,224
look at the
optimization techniques.

1330
00:54:52,224 --> 00:54:55,920
All right, so we discussed
how to use energy optimization

1331
00:54:55,920 --> 00:54:59,610
functions to try to decide
which one's correct,

1332
00:54:59,610 --> 00:55:03,107
and that even if the
structure is the correct one,

1333
00:55:03,107 --> 00:55:04,940
we have the side chain
optimization problem.

1334
00:55:04,940 --> 00:55:06,940
If the structure's the incorrect
one, we've got two problems.

1335
00:55:06,940 --> 00:55:08,570
We've got the backbone
confirmation and the side

1336
00:55:08,570 --> 00:55:09,270
chain.

1337
00:55:09,270 --> 00:55:12,470
This is frequently called
fold recognition or threading.

1338
00:55:12,470 --> 00:55:14,460
This choice of, you've
got a protein structure.

1339
00:55:14,460 --> 00:55:16,780
You want to decide if
your sequence matches

1340
00:55:16,780 --> 00:55:18,120
this one or that one.

1341
00:55:18,120 --> 00:55:19,620
There are a couple
of other problems

1342
00:55:19,620 --> 00:55:21,150
that we're going to look at.

1343
00:55:21,150 --> 00:55:26,260
So this was already raised by
one of the students, the idea

1344
00:55:26,260 --> 00:55:28,292
that we try to predict
the secondary structure

1345
00:55:28,292 --> 00:55:30,500
of this protein, so we'll
look at secondary structure

1346
00:55:30,500 --> 00:55:31,860
prediction algorithms.

1347
00:55:31,860 --> 00:55:36,110
This was a very early area
of computational effort

1348
00:55:36,110 --> 00:55:38,220
in structural
biology, and we'll see

1349
00:55:38,220 --> 00:55:41,574
that the early methods
are remarkably good.

1350
00:55:41,574 --> 00:55:42,990
We can look for
domain structures,

1351
00:55:42,990 --> 00:55:44,540
and this is really
a sequence problem.

1352
00:55:44,540 --> 00:55:46,081
So we can look
through our sequences,

1353
00:55:46,081 --> 00:55:48,680
and rather than looking for
sequence identity or similarity

1354
00:55:48,680 --> 00:55:50,080
with known
structures, we can see

1355
00:55:50,080 --> 00:55:51,690
whether there are
certain patterns,

1356
00:55:51,690 --> 00:55:53,273
like the hidden
Markov models that you

1357
00:55:53,273 --> 00:55:54,980
looked at in a
previous lecture, that

1358
00:55:54,980 --> 00:55:58,660
can allow us to recognize the
domain structure of a protein

1359
00:55:58,660 --> 00:56:02,120
even without an identical
sequence in the database.

1360
00:56:02,120 --> 00:56:04,874
So we won't go over that
kind of analysis anymore,

1361
00:56:04,874 --> 00:56:06,290
and then we'll
spend a good amount

1362
00:56:06,290 --> 00:56:08,510
of time looking at ways of
solving novel structures.

1363
00:56:08,510 --> 00:56:10,810
So if you don't have a
fiendish friend who solved

1364
00:56:10,810 --> 00:56:13,110
your structure for you,
and there is no homologue

1365
00:56:13,110 --> 00:56:14,930
in the database,
all is not lost.

1366
00:56:14,930 --> 00:56:18,424
You actually can now predict
novel structures of proteins

1367
00:56:18,424 --> 00:56:19,465
simply from the sequence.

1368
00:56:22,085 --> 00:56:25,260
All right, so a little
history as to the prediction

1369
00:56:25,260 --> 00:56:26,320
of protein structure.

1370
00:56:26,320 --> 00:56:28,995
It really starts
with Linus Pauling,

1371
00:56:28,995 --> 00:56:32,010
who went on to win the
Nobel Prize for this work.

1372
00:56:32,010 --> 00:56:35,940
And this is in the era-- this
paper was published in 1951.

1373
00:56:35,940 --> 00:56:40,960
This was what computers
looked like in 1951,

1374
00:56:40,960 --> 00:56:42,960
and that thing probably
has a lot less computing

1375
00:56:42,960 --> 00:56:45,640
power than your iPhone
or your Android.

1376
00:56:45,640 --> 00:56:48,140
So Linus Pauling did
not solve the structure

1377
00:56:48,140 --> 00:56:51,040
of the alpha helix, predict
that alpha helices existed,

1378
00:56:51,040 --> 00:56:51,785
using computers.

1379
00:56:51,785 --> 00:56:54,680
He actually did it
entirely with paper models.

1380
00:56:54,680 --> 00:56:58,220
And in fact, he solved this--
he got the key insights

1381
00:56:58,220 --> 00:57:01,780
for the alpha helix when
he was lying sick in bed.

1382
00:57:01,780 --> 00:57:05,600
That's a very productive sick
leave, you might imagine.

1383
00:57:05,600 --> 00:57:08,840
He was using paper
models, but it

1384
00:57:08,840 --> 00:57:10,330
wasn't all done
while lying in bed.

1385
00:57:10,330 --> 00:57:12,460
So he and others,
the field as a whole,

1386
00:57:12,460 --> 00:57:15,520
have spend a lot of time
observing small molecule

1387
00:57:15,520 --> 00:57:17,490
distances, so they
have some idea what

1388
00:57:17,490 --> 00:57:18,929
to expect in protein structures.

1389
00:57:18,929 --> 00:57:20,970
They didn't know the
three-dimensional structure,

1390
00:57:20,970 --> 00:57:22,511
but they knew a lot
of the parameters

1391
00:57:22,511 --> 00:57:23,940
about how far apart things were.

1392
00:57:23,940 --> 00:57:25,542
And they also knew
that hydrogen bonds

1393
00:57:25,542 --> 00:57:27,500
were going to be extremely
favorable in protein

1394
00:57:27,500 --> 00:57:28,360
structures.

1395
00:57:28,360 --> 00:57:31,320
And so he looked for
a repeating structure

1396
00:57:31,320 --> 00:57:33,940
that would maximize the
number of hydrogen bonds

1397
00:57:33,940 --> 00:57:35,930
that occur within the
protein backbone chain.

1398
00:57:39,710 --> 00:57:43,610
And he knew, also, the
backbone-- that the amide bonds

1399
00:57:43,610 --> 00:57:44,860
would be planar and so on.

1400
00:57:44,860 --> 00:57:47,810
So there were a lot of
principle that underlay this,

1401
00:57:47,810 --> 00:57:50,290
but it was really
a tour de force

1402
00:57:50,290 --> 00:57:53,010
of just thinking
rather than computing.

1403
00:57:53,010 --> 00:57:54,990
Another really important
contribution early on

1404
00:57:54,990 --> 00:58:00,150
was made by Ramachandran,
was at Madras University,

1405
00:58:00,150 --> 00:58:02,180
and his insight had
to do with the fact

1406
00:58:02,180 --> 00:58:03,940
that not all backbone
confirmations were

1407
00:58:03,940 --> 00:58:04,690
equally favorable.

1408
00:58:04,690 --> 00:58:06,606
So remember, we have
these two rotatable bonds

1409
00:58:06,606 --> 00:58:07,510
in the backbone.

1410
00:58:07,510 --> 00:58:10,200
We have the phi angle
and the psi angle.

1411
00:58:10,200 --> 00:58:14,150
And this plot
shows that there'll

1412
00:58:14,150 --> 00:58:16,250
be certain confirmations
of phi and psi angles

1413
00:58:16,250 --> 00:58:19,430
that are observed within
these dashed lines,

1414
00:58:19,430 --> 00:58:22,320
and then the other
confirmations, which are almost

1415
00:58:22,320 --> 00:58:23,947
never observed.

1416
00:58:23,947 --> 00:58:25,280
Now, how did he figure that out?

1417
00:58:25,280 --> 00:58:27,240
Once again, it wasn't
with computation.

1418
00:58:27,240 --> 00:58:29,350
It was simply with paper
models and figuring out

1419
00:58:29,350 --> 00:58:32,677
what the distances would be, and
then carefully reasoning over

1420
00:58:32,677 --> 00:58:33,760
those possible structures.

1421
00:58:33,760 --> 00:58:36,620
So you can get very far in
this field, initially, back

1422
00:58:36,620 --> 00:58:39,450
then, by simple hard thought.

1423
00:58:39,450 --> 00:58:40,909
OK, so with these
two observations,

1424
00:58:40,909 --> 00:58:42,366
we knew that there
were going to be

1425
00:58:42,366 --> 00:58:44,260
certain kinds of regular
secondary structure

1426
00:58:44,260 --> 00:58:46,520
and that not all
backbone confirmations

1427
00:58:46,520 --> 00:58:49,040
were equally favorable.

1428
00:58:49,040 --> 00:58:51,740
OK, but now, we want to advance
actually predicting structures

1429
00:58:51,740 --> 00:58:54,670
of particular proteins, not just
saying that proteins in general

1430
00:58:54,670 --> 00:58:55,810
will contain alpha helices.

1431
00:58:55,810 --> 00:58:57,480
So how do we go
about doing that?

1432
00:58:57,480 --> 00:58:59,090
So the first
advances here, we're

1433
00:58:59,090 --> 00:59:01,990
trying to predict the
structure of alpha helices,

1434
00:59:01,990 --> 00:59:04,910
and this paper in
the 1960s introduced

1435
00:59:04,910 --> 00:59:07,770
the concept of a helical wheel.

1436
00:59:07,770 --> 00:59:09,870
Now, the idea here,
if you'll imagine

1437
00:59:09,870 --> 00:59:13,430
that this eraser
is an alpha helix,

1438
00:59:13,430 --> 00:59:16,820
I'm going to look down the
backbone of the alpha helix.

1439
00:59:16,820 --> 00:59:18,440
And I'll see that
the side chains

1440
00:59:18,440 --> 00:59:20,150
emerge at regular positions.

1441
00:59:20,150 --> 00:59:21,990
There's going to be
100 degree rotation

1442
00:59:21,990 --> 00:59:25,680
between each sequential
residue in the backbone

1443
00:59:25,680 --> 00:59:27,200
as it goes around helix.

1444
00:59:27,200 --> 00:59:30,580
It's going to be displaced
and rotated by 100 degrees,

1445
00:59:30,580 --> 00:59:32,510
and I could plot,
on a piece of paper,

1446
00:59:32,510 --> 00:59:36,749
the helical projection,
which is shown here.

1447
00:59:36,749 --> 00:59:38,040
So here's the first amino acid.

1448
00:59:38,040 --> 00:59:39,770
100 degrees later, the second.

1449
00:59:39,770 --> 00:59:41,440
100 degrees later, the third.

1450
00:59:41,440 --> 00:59:45,720
And I can ask whether the
residues on that backbone

1451
00:59:45,720 --> 00:59:48,250
have a sequence that
puts all the hydrophobics

1452
00:59:48,250 --> 00:59:52,410
and hydrophilics on the
same side, as in this case,

1453
00:59:52,410 --> 00:59:53,784
or on different sides.

1454
00:59:53,784 --> 00:59:55,200
Now, what difference
does it make?

1455
00:59:55,200 --> 00:59:56,741
Well, if I have an
alpha helix that's

1456
00:59:56,741 --> 01:00:00,430
lying on the surface
of a protein,

1457
01:00:00,430 --> 01:00:03,540
this could have one side
that's solvent exposed and one

1458
01:00:03,540 --> 01:00:05,350
side that's protected.

1459
01:00:05,350 --> 01:00:08,040
So we would expect that some
of these alpha helices lying

1460
01:00:08,040 --> 01:00:10,100
on the service would
be amphipathic.

1461
01:00:10,100 --> 01:00:13,440
Half of them would be
hydrophobic, hydrophobic,

1462
01:00:13,440 --> 01:00:15,220
and half of them
would be hydrophilic.

1463
01:00:15,220 --> 01:00:17,340
And purely, as someone
suggested from the pattern

1464
01:00:17,340 --> 01:00:20,220
of the amino acids, and here the
hydrophobicity of the pattern

1465
01:00:20,220 --> 01:00:23,330
of the amino acids, we could
make reasonable predictions

1466
01:00:23,330 --> 01:00:27,000
of whether this protein forms a
particular kind of alpha helix,

1467
01:00:27,000 --> 01:00:29,080
an amphipathic alpha helix.

1468
01:00:29,080 --> 01:00:32,334
Now, is that going to help
us for all alpha helices?

1469
01:00:32,334 --> 01:00:34,500
Obviously not, because I
can have alpha helices that

1470
01:00:34,500 --> 01:00:36,440
are totally solvent
exposed, and I

1471
01:00:36,440 --> 01:00:39,575
can have alpha helices
that are totally protected.

1472
01:00:39,575 --> 01:00:41,950
So this pattern will occur in
some alpha helices, but not

1473
01:00:41,950 --> 01:00:42,720
all.

1474
01:00:42,720 --> 01:00:44,520
So another idea
that was raised here

1475
01:00:44,520 --> 01:00:47,151
and was used early
on with great success

1476
01:00:47,151 --> 01:00:49,400
was to actually figure out
whether certain amino acids

1477
01:00:49,400 --> 01:00:51,860
have a particular alpha
helical propensity.

1478
01:00:51,860 --> 01:00:54,270
Do they occur more
frequently in alpha helices?

1479
01:00:54,270 --> 01:00:55,340
At the time, it was
also thought maybe

1480
01:00:55,340 --> 01:00:57,131
you could find propensities
for beta sheets

1481
01:00:57,131 --> 01:00:58,260
and other structures.

1482
01:00:58,260 --> 01:01:02,600
So compute the statistics
over for every amino acid,

1483
01:01:02,600 --> 01:01:04,450
shown as a row here.

1484
01:01:04,450 --> 01:01:06,160
How often is it observed
in the database?

1485
01:01:06,160 --> 01:01:09,150
How often does it
occur in alpha helix?

1486
01:01:09,150 --> 01:01:11,450
And how often does it occur
in beta sheet or in a coil?

1487
01:01:11,450 --> 01:01:14,110
And from these, then, we
would compute probabilities

1488
01:01:14,110 --> 01:01:17,510
and compute using, perhaps,
Bayesian statistics to compute

1489
01:01:17,510 --> 01:01:20,524
the poster
expectation for having

1490
01:01:20,524 --> 01:01:21,940
a certain sequence
in alpha helix.

1491
01:01:21,940 --> 01:01:25,130
They didn't quite use
Bayesian statistics here.

1492
01:01:25,130 --> 01:01:28,254
They came up with a
rather ad hoc approach,

1493
01:01:28,254 --> 01:01:29,670
and when you read
it in hindsight,

1494
01:01:29,670 --> 01:01:30,930
it seems kind of crazy.

1495
01:01:30,930 --> 01:01:31,850
But actually, you
have to remember

1496
01:01:31,850 --> 01:01:32,940
when this was being done.

1497
01:01:32,940 --> 01:01:36,330
This is being done before a
big influence of mathematicians

1498
01:01:36,330 --> 01:01:38,120
into structural biology.

1499
01:01:38,120 --> 01:01:41,849
This is 1974, and they used
more physical reasoning.

1500
01:01:41,849 --> 01:01:43,640
They knew something
about how alpha helices

1501
01:01:43,640 --> 01:01:44,960
formed from chemistry.

1502
01:01:44,960 --> 01:01:46,390
They knew that,
typically, there's

1503
01:01:46,390 --> 01:01:49,610
nucleation event, where a small
piece of helix forms initially,

1504
01:01:49,610 --> 01:01:51,257
and then that extends.

1505
01:01:51,257 --> 01:01:53,090
They knew that there
were these propensities

1506
01:01:53,090 --> 01:01:55,280
for certain amino acids
to form alpha helices,

1507
01:01:55,280 --> 01:01:58,820
and other amino acids, which
tended to break the helix.

1508
01:01:58,820 --> 01:02:01,080
And they came up with
an ad hoc algorithm

1509
01:02:01,080 --> 01:02:02,980
that counted how often
you had strong helix

1510
01:02:02,980 --> 01:02:05,300
formers, how often you breakers.

1511
01:02:05,300 --> 01:02:08,180
You can see all the
details in the references.

1512
01:02:08,180 --> 01:02:11,700
The amazing thing is, with this
very ad hoc thing and a very,

1513
01:02:11,700 --> 01:02:13,464
very small database
of protein structures,

1514
01:02:13,464 --> 01:02:15,380
you could look at the
total number of residues

1515
01:02:15,380 --> 01:02:17,380
that they're looking at
over all the structures,

1516
01:02:17,380 --> 01:02:20,320
there's 2,473 and
residues, not structures.

1517
01:02:20,320 --> 01:02:22,680
And now, we have
many, many more times

1518
01:02:22,680 --> 01:02:25,160
than that of
structures of proteins.

1519
01:02:25,160 --> 01:02:27,820
Even with that,
in 1974, they were

1520
01:02:27,820 --> 01:02:32,420
able to achieve 60%
accuracy in predicting

1521
01:02:32,420 --> 01:02:34,080
the secondary
structure of proteins,

1522
01:02:34,080 --> 01:02:36,320
so it's really an
astounding accomplishment.

1523
01:02:36,320 --> 01:02:38,570
And to put that in
perspective, there

1524
01:02:38,570 --> 01:02:41,750
was an evaluation of a whole
bunch of secondary structure

1525
01:02:41,750 --> 01:02:44,220
prediction algorithms
done about a decade ago,

1526
01:02:44,220 --> 01:02:46,740
and things haven't changed
that much since then, where

1527
01:02:46,740 --> 01:02:50,820
between 1974 and
2003, almost 30 years,

1528
01:02:50,820 --> 01:02:55,200
they went from 60%
accuracy to 76% accuracy.

1529
01:02:55,200 --> 01:02:57,905
OK, well, that's not
bad, but it's not a lot

1530
01:02:57,905 --> 01:02:59,530
for-- you'd expect
maybe over 30 years,

1531
01:02:59,530 --> 01:03:00,200
you could do a lot better.

1532
01:03:00,200 --> 01:03:02,110
So the simple approach
really captured

1533
01:03:02,110 --> 01:03:06,696
the fundamentals of predicting
secondary structure.

1534
01:03:06,696 --> 01:03:08,570
There's a lot of work
that's been done since,

1535
01:03:08,570 --> 01:03:10,361
and I encourage you to
look in the textbook

1536
01:03:10,361 --> 01:03:13,020
if you're interested, to look
at all the newer algorithms that

1537
01:03:13,020 --> 01:03:15,270
have tried to solve the
secondary structure prediction

1538
01:03:15,270 --> 01:03:17,310
problem.

1539
01:03:17,310 --> 01:03:19,204
OK.

1540
01:03:19,204 --> 01:03:21,370
All right, so secondary
structure prediction, then--

1541
01:03:21,370 --> 01:03:23,536
you can look in the textbook
for the modern methods,

1542
01:03:23,536 --> 01:03:26,050
but the fundamental ideas were
laid down by Chou and Fasman

1543
01:03:26,050 --> 01:03:28,500
in the 1974 paper.

1544
01:03:28,500 --> 01:03:31,420
We're already said that looking
at the kinds of approaches

1545
01:03:31,420 --> 01:03:33,240
that we discussed
earlier in the course

1546
01:03:33,240 --> 01:03:34,984
can help you solve
domain structures.

1547
01:03:34,984 --> 01:03:37,150
I would like to focus on,
at the end of this lecture

1548
01:03:37,150 --> 01:03:39,733
and the beginning-- and the next
lecture about how to actually

1549
01:03:39,733 --> 01:03:43,457
solve novel structures from
purely amino acid sequence,

1550
01:03:43,457 --> 01:03:45,040
and we're going to
go back to the idea

1551
01:03:45,040 --> 01:03:46,790
that there is a potential
energy function.

1552
01:03:46,790 --> 01:03:50,390
We now have both the CHARMM
approach and the Rosetta

1553
01:03:50,390 --> 01:03:53,930
approach to protein
structure, and so there

1554
01:03:53,930 --> 01:03:55,450
is some protein
folding landscape.

1555
01:03:55,450 --> 01:03:56,701
There's an energy function.

1556
01:03:56,701 --> 01:03:58,200
If you have different
conformations,

1557
01:03:58,200 --> 01:04:00,580
you'll be at different
positions in landscape,

1558
01:04:00,580 --> 01:04:02,710
and we'd like to
figure out how to go

1559
01:04:02,710 --> 01:04:06,310
from some starting confirmation
that may be arbitrary and find

1560
01:04:06,310 --> 01:04:08,080
our way to the minimum
energy structure.

1561
01:04:14,489 --> 01:04:18,620
All right, so there are going
to be three fundamental things

1562
01:04:18,620 --> 01:04:21,210
that we'll talk about
in the next lecture.

1563
01:04:21,210 --> 01:04:23,840
We're going to talk about
energy minimization,

1564
01:04:23,840 --> 01:04:26,300
how to use these potential
energy functions that we

1565
01:04:26,300 --> 01:04:29,250
started off with to go
from approximate structures

1566
01:04:29,250 --> 01:04:30,640
to the refined structure.

1567
01:04:30,640 --> 01:04:32,380
That's the thought
problem I gave you.

1568
01:04:32,380 --> 01:04:35,120
You have the structure, but
you have the wrong side chains.

1569
01:04:35,120 --> 01:04:36,470
Could you minimize them?

1570
01:04:36,470 --> 01:04:38,280
And so that's making
small changes.

1571
01:04:38,280 --> 01:04:39,960
We'll discuss
molecular dynamics,

1572
01:04:39,960 --> 01:04:44,552
which actually tries to simulate
all the forces on a protein

1573
01:04:44,552 --> 01:04:46,510
and to actually carry
out a physical simulation

1574
01:04:46,510 --> 01:04:47,279
of the process.

1575
01:04:47,279 --> 01:04:48,820
That's the CHARMM
approach, and we'll

1576
01:04:48,820 --> 01:04:51,589
see some interesting
variants on that.

1577
01:04:51,589 --> 01:04:53,630
And then we'll look at
simulated annealing, which

1578
01:04:53,630 --> 01:04:56,005
is an optimization technique
that's actually quite broad,

1579
01:04:56,005 --> 01:04:57,490
but can be applied
here, to search

1580
01:04:57,490 --> 01:04:59,770
over large, large
conformational spaces,

1581
01:04:59,770 --> 01:05:01,860
much further than a protein
would actually evolve

1582
01:05:01,860 --> 01:05:03,560
in a molecular dynamic
simulation that's

1583
01:05:03,560 --> 01:05:04,855
simulating protein function.

1584
01:05:04,855 --> 01:05:07,230
You allow the protein, now,
to jump between confirmations

1585
01:05:07,230 --> 01:05:09,790
that have no real
potential to transfer

1586
01:05:09,790 --> 01:05:14,220
between in a normal room
temperature in water,

1587
01:05:14,220 --> 01:05:16,580
but can be done, obviously,
easily in the computer.

1588
01:05:16,580 --> 01:05:17,480
So I'll stop here.

1589
01:05:17,480 --> 01:05:20,010
Any questions before we close?