1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:25,752 --> 00:00:26,960 PROFESSOR: I'm Ernst Frankel. 9 00:00:26,960 --> 00:00:29,430 I'll be teaching next two lectures. 10 00:00:29,430 --> 00:00:32,030 I'd like to encourage you to contact me outside of class 11 00:00:32,030 --> 00:00:33,990 if you have any questions, if you want to meet. 12 00:00:33,990 --> 00:00:36,045 And also, please, during class, ask questions. 13 00:00:36,045 --> 00:00:38,420 It's a somewhat impersonal setting with the video cameras 14 00:00:38,420 --> 00:00:42,270 and the amphitheater, but hopefully we can overcome that. 15 00:00:42,270 --> 00:00:46,470 This unit is going to focus on moving across scales 16 00:00:46,470 --> 00:00:48,392 in computational biology, looking 17 00:00:48,392 --> 00:00:49,850 from computational issues that deal 18 00:00:49,850 --> 00:00:53,260 with the fundamentals of protein structure at the atomic level 19 00:00:53,260 --> 00:00:56,695 to the level of protein-protein interactions between pairs 20 00:00:56,695 --> 00:00:59,760 of molecules, protein DNA interactions 21 00:00:59,760 --> 00:01:01,630 and small molecules, and then ultimately 22 00:01:01,630 --> 00:01:02,700 into protein network. 23 00:01:02,700 --> 00:01:04,640 So we've got a lot of ground to cover, 24 00:01:04,640 --> 00:01:06,460 but I think we'll be able do it. 25 00:01:06,460 --> 00:01:09,410 As you've seen in the syllabus, the first couple of lectures 26 00:01:09,410 --> 00:01:12,190 are really a detailed look at protein structure, 27 00:01:12,190 --> 00:01:15,230 molecular level analysis, and then we'll 28 00:01:15,230 --> 00:01:17,540 move into some of these other levels of higher 29 00:01:17,540 --> 00:01:20,910 order, including protein DNA interactions and gene 30 00:01:20,910 --> 00:01:21,790 regulatory networks. 31 00:01:24,355 --> 00:01:26,730 I think may of you are probably familiar with this quote, 32 00:01:26,730 --> 00:01:28,910 that "nothing in biology makes sense 33 00:01:28,910 --> 00:01:30,770 except in the light of evolution." 34 00:01:30,770 --> 00:01:34,310 And I'd like to offer a modified version of that, which 35 00:01:34,310 --> 00:01:38,050 is little in biology make sense except in light of structure, 36 00:01:38,050 --> 00:01:40,887 protein structure, DNA structure. 37 00:01:40,887 --> 00:01:43,470 We've, of course, seen this very early on in molecular biology 38 00:01:43,470 --> 00:01:46,410 when the structure of DNA was solved, and immediately became 39 00:01:46,410 --> 00:01:49,870 clear why it was the basis for heredity. 40 00:01:49,870 --> 00:01:54,200 But protein structures have been even more lasting impact time 41 00:01:54,200 --> 00:01:56,957 and time again, many, many more events, 42 00:01:56,957 --> 00:01:59,040 which have really revolutionized the understanding 43 00:01:59,040 --> 00:02:01,020 of particular biological problems. 44 00:02:01,020 --> 00:02:04,050 So one example that was stunning at the time 45 00:02:04,050 --> 00:02:06,520 had to do with the most frequently mutated protein 46 00:02:06,520 --> 00:02:07,300 in cancer. 47 00:02:07,300 --> 00:02:09,070 This is the p53 gene. 48 00:02:09,070 --> 00:02:12,870 It's mutated in about half of all cancers, 49 00:02:12,870 --> 00:02:14,435 and what was observed early on-- this 50 00:02:14,435 --> 00:02:16,143 was in the days before genomic sequencing 51 00:02:16,143 --> 00:02:18,310 when it was actually very expensive and hard 52 00:02:18,310 --> 00:02:21,860 to identify mutations in tumors. 53 00:02:21,860 --> 00:02:23,900 So they focused on this particular gene, 54 00:02:23,900 --> 00:02:25,960 and they observed that the mutations clustered. 55 00:02:25,960 --> 00:02:28,835 So this is the structure of the gene from the n-terminus-- 56 00:02:28,835 --> 00:02:30,960 the protein from the n-terminus and the c-terminus, 57 00:02:30,960 --> 00:02:33,081 and the bars indicate the frequency of mutations. 58 00:02:33,081 --> 00:02:35,330 And you can see that they're all clustered pretty much 59 00:02:35,330 --> 00:02:37,910 in the center of this molecule. 60 00:02:37,910 --> 00:02:38,680 Now, why is that? 61 00:02:38,680 --> 00:02:41,380 It was enigmatic until the structure was solved here 62 00:02:41,380 --> 00:02:44,217 at MIT by Carl Pabo and his post-doc 63 00:02:44,217 --> 00:02:46,950 at the time, Nikola Pavletich, and they showed, actually, 64 00:02:46,950 --> 00:02:49,320 that these correspond to critical domains. 65 00:02:49,320 --> 00:02:50,870 And in a second paper, they actually 66 00:02:50,870 --> 00:02:55,400 showed why the mutations occur in those particular locations. 67 00:02:55,400 --> 00:02:57,590 So if you look at the plot on the upper left, 68 00:02:57,590 --> 00:03:00,520 here's the protein sequence; above it, 69 00:03:00,520 --> 00:03:02,300 the frequency of mutations; below it, 70 00:03:02,300 --> 00:03:04,990 the secondary structure elements. 71 00:03:04,990 --> 00:03:06,720 And you'll see that mutations occur 72 00:03:06,720 --> 00:03:10,010 in regions that don't have any regular secondary structure 73 00:03:10,010 --> 00:03:12,570 and can occur frequently in regions 74 00:03:12,570 --> 00:03:15,081 with secondary structure or not all 75 00:03:15,081 --> 00:03:16,580 in regions with secondary structure. 76 00:03:16,580 --> 00:03:19,038 So the mere fact that there's a secondary structure element 77 00:03:19,038 --> 00:03:21,314 does not define why there're mutations. 78 00:03:21,314 --> 00:03:22,980 But when the three-dimensional structure 79 00:03:22,980 --> 00:03:24,817 was solved in the complex with DNA, 80 00:03:24,817 --> 00:03:26,650 over here on the right-- this is the protein 81 00:03:26,650 --> 00:03:29,747 structure on the left, the DNA structure on the right, 82 00:03:29,747 --> 00:03:32,080 and in yellow are some of these highly mutated residues. 83 00:03:32,080 --> 00:03:35,260 It turns out that all of the frequently mutated residues 84 00:03:35,260 --> 00:03:38,390 are ones that occur at the protein DNA interface. 85 00:03:38,390 --> 00:03:39,950 All right, so in a single picture, 86 00:03:39,950 --> 00:03:42,020 we now understand what was an enigma 87 00:03:42,020 --> 00:03:43,390 for years and years and years. 88 00:03:43,390 --> 00:03:45,760 Why are the mutations so particularly clustered 89 00:03:45,760 --> 00:03:47,532 in this protein in non obvious ways? 90 00:03:47,532 --> 00:03:49,490 Since that is the interface between the protein 91 00:03:49,490 --> 00:03:51,950 and the DNA, these mutations upset 92 00:03:51,950 --> 00:03:57,700 the transcriptional regulation through the action of p53. 93 00:03:57,700 --> 00:04:00,304 So if we want to understand protein structure in order 94 00:04:00,304 --> 00:04:01,845 to understand protein function, where 95 00:04:01,845 --> 00:04:03,860 are we going to get these structures from? 96 00:04:03,860 --> 00:04:08,710 So the statistics on how proteins themselves-- I 97 00:04:08,710 --> 00:04:09,400 show here. 98 00:04:09,400 --> 00:04:12,510 This is from the-- I'll call it the PDB, the Protein Database. 99 00:04:12,510 --> 00:04:15,130 Its full name is the RCSB Protein Database, 100 00:04:15,130 --> 00:04:16,685 but it's usually just called the PDB. 101 00:04:16,685 --> 00:04:18,810 And here, it shows that, at the time of this slide, 102 00:04:18,810 --> 00:04:20,560 around 80,000 structures have been 103 00:04:20,560 --> 00:04:23,460 determined by x-ray crystallography. 104 00:04:23,460 --> 00:04:26,410 The next most frequent method was NMR, Nuclear Magnetic 105 00:04:26,410 --> 00:04:29,230 Resonance, which identified about 10,000 structures, 106 00:04:29,230 --> 00:04:31,300 and all the other techniques produce 107 00:04:31,300 --> 00:04:33,690 very, very few structures, hundreds 108 00:04:33,690 --> 00:04:36,010 of structures rather than thousands. 109 00:04:36,010 --> 00:04:37,400 So how do these techniques work? 110 00:04:37,400 --> 00:04:39,400 Well, they don't magically give you a structure. 111 00:04:39,400 --> 00:04:39,900 Right? 112 00:04:39,900 --> 00:04:42,483 They give you information that you have to use computationally 113 00:04:42,483 --> 00:04:44,490 to derive the structure. 114 00:04:44,490 --> 00:04:46,880 Here's a schematic of how structures 115 00:04:46,880 --> 00:04:48,750 are solved by x-ray crystallography. 116 00:04:48,750 --> 00:04:50,970 One has to actually grow a crystal of the protein 117 00:04:50,970 --> 00:04:52,872 or the protein and other molecules 118 00:04:52,872 --> 00:04:54,330 that you're interested in studying. 119 00:04:54,330 --> 00:04:57,190 These are not giant crystals like quarts. 120 00:04:57,190 --> 00:04:58,950 They're even smaller than table salt. 121 00:04:58,950 --> 00:05:01,950 They're usually barely visible with the naked eye, 122 00:05:01,950 --> 00:05:03,095 and they're very unstable. 123 00:05:03,095 --> 00:05:07,150 They have to be kept in solution or, often, frozen, 124 00:05:07,150 --> 00:05:08,910 and you should a very high powered x-ray 125 00:05:08,910 --> 00:05:09,870 beam through them. 126 00:05:09,870 --> 00:05:12,204 Now, most of the x-rays are-- what are they going to do? 127 00:05:12,204 --> 00:05:14,661 They're going to pass right through because x-rays interact 128 00:05:14,661 --> 00:05:15,770 very weakly with matter. 129 00:05:15,770 --> 00:05:17,561 But a few of the x-rays will be diffracted, 130 00:05:17,561 --> 00:05:19,390 and from that weak diffraction pattern, 131 00:05:19,390 --> 00:05:22,350 you can actually deduce where the electrons were 132 00:05:22,350 --> 00:05:27,560 that scattered the x-rays as they hit the crystal. 133 00:05:27,560 --> 00:05:30,800 And so this is a picture, the lower right, 134 00:05:30,800 --> 00:05:34,620 of electron density cloud in light blue with the protein 135 00:05:34,620 --> 00:05:37,040 structures snaking through it, and what 136 00:05:37,040 --> 00:05:39,530 you can calculate, after a lot of work, 137 00:05:39,530 --> 00:05:41,450 from these crystallographic diffraction 138 00:05:41,450 --> 00:05:44,441 patterns is the location of the electron density. 139 00:05:44,441 --> 00:05:46,190 And then there's a computational challenge 140 00:05:46,190 --> 00:05:48,810 to try to figure out the location of the atoms that 141 00:05:48,810 --> 00:05:51,360 would have given rise to that electron density 142 00:05:51,360 --> 00:05:54,050 that then, when hit with x-rays, would have given rise 143 00:05:54,050 --> 00:05:55,900 to the x-ray diffraction pattern. 144 00:05:55,900 --> 00:05:59,170 So it's actually an iterative process 145 00:05:59,170 --> 00:06:03,494 where one arrives at the initial structure and then calculates, 146 00:06:03,494 --> 00:06:05,410 from that structure, where the electrons would 147 00:06:05,410 --> 00:06:07,330 be, from the position of electrons 148 00:06:07,330 --> 00:06:09,920 where the diffraction pattern would be when the x-rays hit 149 00:06:09,920 --> 00:06:14,690 it, and determines how well that predicted diffraction 150 00:06:14,690 --> 00:06:17,140 pattern agrees with the actual diffraction pattern, 151 00:06:17,140 --> 00:06:18,739 and then continuously iterates. 152 00:06:18,739 --> 00:06:21,030 And so this is obviously a highly computational problem 153 00:06:21,030 --> 00:06:22,500 because you not only have to find 154 00:06:22,500 --> 00:06:25,540 positions that are maximally consistent with the observed 155 00:06:25,540 --> 00:06:28,020 diffraction pattern, but also positions that are actually 156 00:06:28,020 --> 00:06:29,620 consistent with physics. 157 00:06:29,620 --> 00:06:32,270 So if we have a piece of a molecule here, 158 00:06:32,270 --> 00:06:34,190 we can't just put our atoms anywhere. 159 00:06:34,190 --> 00:06:37,370 They need to be positioned with well defined distances 160 00:06:37,370 --> 00:06:40,300 for the bonds, the bond angles, and so on. 161 00:06:40,300 --> 00:06:42,980 So it's a highly coupled problem that we have to solve, 162 00:06:42,980 --> 00:06:45,897 and we'll look at some of the techniques that underlie 163 00:06:45,897 --> 00:06:47,980 these approaches, although we'll look specifically 164 00:06:47,980 --> 00:06:50,470 at how to solve x-ray crystal structures. 165 00:06:50,470 --> 00:06:52,450 I mentioned the second most common technique 166 00:06:52,450 --> 00:06:54,810 is nuclear magnetic resonance, and this 167 00:06:54,810 --> 00:06:57,860 is a technology that does not require the crystals, 168 00:06:57,860 --> 00:07:00,180 but requires a very high concentration 169 00:07:00,180 --> 00:07:03,320 of soluble protein, which presents its own problems. 170 00:07:03,320 --> 00:07:05,610 And the information that you get out 171 00:07:05,610 --> 00:07:07,500 of a nuclear magnetic resonance structure 172 00:07:07,500 --> 00:07:09,380 is not the electron density locations, 173 00:07:09,380 --> 00:07:10,900 but it's actually a set of distances 174 00:07:10,900 --> 00:07:13,420 that tell you the relative distance between two 175 00:07:13,420 --> 00:07:15,510 atoms, usually protons, in the structure, 176 00:07:15,510 --> 00:07:18,440 and that's what's represented by these yellow lines here. 177 00:07:18,440 --> 00:07:20,830 And once again, we've got a hard computational problem 178 00:07:20,830 --> 00:07:23,910 where we need to figure out a structure of the protein that's 179 00:07:23,910 --> 00:07:25,540 consistent with all the physical forces 180 00:07:25,540 --> 00:07:29,910 and also puts particular protons at particular distances 181 00:07:29,910 --> 00:07:31,460 from each other. 182 00:07:31,460 --> 00:07:33,350 So we talk about solving crystal structures, 183 00:07:33,350 --> 00:07:34,980 solving NMR structures, because it 184 00:07:34,980 --> 00:07:37,300 is the solution to a very, very complicated 185 00:07:37,300 --> 00:07:39,190 computational challenge. 186 00:07:39,190 --> 00:07:40,690 So these techniques that we're going 187 00:07:40,690 --> 00:07:42,300 to look at, while not specifically 188 00:07:42,300 --> 00:07:47,040 for the solution of crystal and NMR structures, 189 00:07:47,040 --> 00:07:48,520 underlie those technologies. 190 00:07:48,520 --> 00:07:50,270 What we're going to focus on is actually 191 00:07:50,270 --> 00:07:51,980 perhaps an even more complicated problem, 192 00:07:51,980 --> 00:07:54,080 the de novo discovery of protein structures. 193 00:07:54,080 --> 00:07:55,210 So if I start off with a sequence, 194 00:07:55,210 --> 00:07:56,590 can I actually tell you something 195 00:07:56,590 --> 00:07:58,800 important and accurate about the structure? 196 00:08:03,220 --> 00:08:06,599 Now, there's a nice summary in a book called 197 00:08:06,599 --> 00:08:08,140 Structural Bioinformatics that really 198 00:08:08,140 --> 00:08:10,410 deals with a lot of the issues around computational biology 199 00:08:10,410 --> 00:08:12,360 is relates to structure, that highlights many 200 00:08:12,360 --> 00:08:14,350 of the differences between the kinds of algorithms we've been 201 00:08:14,350 --> 00:08:16,830 looking at up until now in this course and the kinds 202 00:08:16,830 --> 00:08:18,430 of approaches that we need to take 203 00:08:18,430 --> 00:08:21,550 in our understanding of protein structure. 204 00:08:21,550 --> 00:08:23,020 So the first and most fundamental 205 00:08:23,020 --> 00:08:24,130 obvious thing is that we're dealing 206 00:08:24,130 --> 00:08:25,546 with three-dimensional structures, 207 00:08:25,546 --> 00:08:28,180 so we're moving away from the simple linear representations 208 00:08:28,180 --> 00:08:29,860 of the data and dealing with more 209 00:08:29,860 --> 00:08:33,770 complicated three-dimensional problems. 210 00:08:33,770 --> 00:08:37,320 And therefore, we encounter all sorts of new problems. 211 00:08:37,320 --> 00:08:38,950 We no longer a discrete search space. 212 00:08:38,950 --> 00:08:40,525 We have a continuous search space, 213 00:08:40,525 --> 00:08:41,900 and we'll look at algorithms that 214 00:08:41,900 --> 00:08:44,290 try to reduce that continuous search space back down 215 00:08:44,290 --> 00:08:48,044 to a discrete one to make it a simpler problem. 216 00:08:48,044 --> 00:08:49,960 But perhaps most fundamentally, the difference 217 00:08:49,960 --> 00:08:53,410 is that now we have to bring in a lot of physical knowledge 218 00:08:53,410 --> 00:08:54,610 to underlie our algorithms. 219 00:08:54,610 --> 00:08:57,240 It's not enough to solve this as a complete abstraction 220 00:08:57,240 --> 00:08:58,790 from the physics, but we actually 221 00:08:58,790 --> 00:09:02,014 have to deal with the physics in the heart of the algorithms. 222 00:09:02,014 --> 00:09:03,680 And we'll look at the issues highlighted 223 00:09:03,680 --> 00:09:06,590 in red in the rest of this talk. 224 00:09:06,590 --> 00:09:08,090 Another thing that's going to emerge 225 00:09:08,090 --> 00:09:09,959 is that it would be nice if there 226 00:09:09,959 --> 00:09:12,250 was a simple mapping of protein sequence to structures, 227 00:09:12,250 --> 00:09:13,916 and if that were the case, you'd imagine 228 00:09:13,916 --> 00:09:16,457 that two proteins that are very different in sequence 229 00:09:16,457 --> 00:09:17,790 would have different structures. 230 00:09:17,790 --> 00:09:18,970 But in fact, that's not the case. 231 00:09:18,970 --> 00:09:21,410 You can have two proteins that have almost no sequence 232 00:09:21,410 --> 00:09:23,659 similarity at all but adopt the same three-dimensional 233 00:09:23,659 --> 00:09:27,486 structure, so clearly, it's an extremely complicated problem 234 00:09:27,486 --> 00:09:28,860 made more complicated by the fact 235 00:09:28,860 --> 00:09:30,360 that we don't know all the structures. 236 00:09:30,360 --> 00:09:32,401 It's not like we're selecting from a discrete set 237 00:09:32,401 --> 00:09:35,090 of known structures to figure out what our new molecule is. 238 00:09:35,090 --> 00:09:37,110 We have, in potential, infinite number 239 00:09:37,110 --> 00:09:41,187 of confirmations and protein chains we need to deal with. 240 00:09:41,187 --> 00:09:43,770 OK, so I hope that you've had a chance to look at the material 241 00:09:43,770 --> 00:09:46,910 that I've posted online for review of protein structure. 242 00:09:46,910 --> 00:09:48,320 If you haven't, please do so. 243 00:09:48,320 --> 00:09:49,480 It'll be very helpful in understanding 244 00:09:49,480 --> 00:09:51,160 the next few lectures, and I'll assume 245 00:09:51,160 --> 00:09:53,368 that you're familiar with the basic elements, protein 246 00:09:53,368 --> 00:09:55,110 structure, what alpha helices are, 247 00:09:55,110 --> 00:09:57,880 what beta sheets are, primary structure, secondary structure, 248 00:09:57,880 --> 00:09:58,750 and so on. 249 00:09:58,750 --> 00:10:01,190 And I'll also encourage you to become familiar 250 00:10:01,190 --> 00:10:01,899 with amino acids. 251 00:10:01,899 --> 00:10:04,314 It's very hard to understand anything in protein structure 252 00:10:04,314 --> 00:10:06,790 without having some knowledge of what the amino acids are. 253 00:10:06,790 --> 00:10:09,600 The textbook has a nice figure that 254 00:10:09,600 --> 00:10:12,535 summarizes the many overlapping ways to describe the features 255 00:10:12,535 --> 00:10:16,020 in amino acids, so please familiarize yourself with that. 256 00:10:16,020 --> 00:10:18,550 So these are resources that we posted online. 257 00:10:18,550 --> 00:10:21,130 Also, the Protein Databank, the RCSB, 258 00:10:21,130 --> 00:10:23,540 has fantastic resources online for beginning 259 00:10:23,540 --> 00:10:25,820 to understand protein structure, so I 260 00:10:25,820 --> 00:10:27,689 encourage you to look at their website. 261 00:10:27,689 --> 00:10:29,230 In particular, in their website, they 262 00:10:29,230 --> 00:10:31,444 have tools that you can download to visualize protein 263 00:10:31,444 --> 00:10:32,860 structures, and that's going to be 264 00:10:32,860 --> 00:10:35,349 a critical component of understanding these algorithms, 265 00:10:35,349 --> 00:10:37,640 to actually understand what these structures look like. 266 00:10:37,640 --> 00:10:39,598 I've highlighted, too, that I find particularly 267 00:10:39,598 --> 00:10:42,400 easy to use PyMOL and Swiss PDB Viewer. 268 00:10:42,400 --> 00:10:43,954 You can not only look at structures 269 00:10:43,954 --> 00:10:46,120 with these techniques, you can actually modify them. 270 00:10:46,120 --> 00:10:49,560 You can do homology modeling. 271 00:10:49,560 --> 00:10:52,950 So before we get into algorithms for understanding protein 272 00:10:52,950 --> 00:10:54,590 structure, we need to understand how 273 00:10:54,590 --> 00:10:56,350 protein structures are represented. 274 00:10:56,350 --> 00:10:59,652 I've already mentioned that there are these repeating units 275 00:10:59,652 --> 00:11:01,860 that I'd like you already know about-- alpha helices, 276 00:11:01,860 --> 00:11:02,420 beta sheets. 277 00:11:02,420 --> 00:11:04,550 We won't go into those in any detail. 278 00:11:04,550 --> 00:11:06,461 But the two more quantitative ways 279 00:11:06,461 --> 00:11:07,960 of describing protein structure have 280 00:11:07,960 --> 00:11:09,751 to do with a three-dimensional coordinates, 281 00:11:09,751 --> 00:11:11,290 the XYZ coordinates of every atom, 282 00:11:11,290 --> 00:11:12,890 and internal coordinates, and we'll 283 00:11:12,890 --> 00:11:15,820 go through those a little bit of detail. 284 00:11:15,820 --> 00:11:18,150 So again, this PDB website has a lot 285 00:11:18,150 --> 00:11:19,800 of great resources for understanding 286 00:11:19,800 --> 00:11:22,760 what these coordinates look like. 287 00:11:22,760 --> 00:11:26,260 They have a good description of what's called a PDB file, 288 00:11:26,260 --> 00:11:29,000 and those PDB files look like this at the outset. 289 00:11:29,000 --> 00:11:31,360 They have what is now called metadata, but at the time 290 00:11:31,360 --> 00:11:33,320 was just information about how the protein structure was 291 00:11:33,320 --> 00:11:34,080 solved. 292 00:11:34,080 --> 00:11:38,630 So it'll tell you what organism the protein comes from, 293 00:11:38,630 --> 00:11:40,630 where it was actually synthesized if it wasn't 294 00:11:40,630 --> 00:11:43,260 purified from that organism, but if it was made recombinantly, 295 00:11:43,260 --> 00:11:46,250 details like that, details about how the crystal structure was 296 00:11:46,250 --> 00:11:48,160 determined. 297 00:11:48,160 --> 00:11:50,810 The sequence-- most of this won't concern us, 298 00:11:50,810 --> 00:11:53,880 but what will concern us is this bottom section shown here 299 00:11:53,880 --> 00:11:55,380 in more detail. 300 00:11:55,380 --> 00:11:58,240 So let's just look at what each of these lines represents. 301 00:11:58,240 --> 00:12:00,410 The lines that contain information about the atomic 302 00:12:00,410 --> 00:12:02,850 coordinates all begin with the word ATOM, 303 00:12:02,850 --> 00:12:04,950 and then there's a index number that 304 00:12:04,950 --> 00:12:08,620 just is referenced for each line of the file, 305 00:12:08,620 --> 00:12:12,050 tells you what kind of atom it is, what chain in the protein 306 00:12:12,050 --> 00:12:13,660 it is, and the residue number. 307 00:12:13,660 --> 00:12:16,540 So here, it's starting with residue 100. 308 00:12:16,540 --> 00:12:18,110 The sequence here can be arbitrary 309 00:12:18,110 --> 00:12:20,210 and may not relate to the sequence of the protein 310 00:12:20,210 --> 00:12:25,372 as it appears in SWISS-PROT or Gen Bank. 311 00:12:25,372 --> 00:12:26,830 And then the next three columns are 312 00:12:26,830 --> 00:12:28,455 the ones that are most important to us, 313 00:12:28,455 --> 00:12:31,050 so these are the XYZ coordinates of the atom. 314 00:12:31,050 --> 00:12:33,012 So to identify the position of any molecule 315 00:12:33,012 --> 00:12:34,720 in three-dimensional space, obviously you 316 00:12:34,720 --> 00:12:36,340 need three coordinates, and so those 317 00:12:36,340 --> 00:12:39,140 are what those three coordinates are. 318 00:12:39,140 --> 00:12:41,390 And they're followed by these two other numbers, which 319 00:12:41,390 --> 00:12:44,080 actually are very interesting numbers because they tell us 320 00:12:44,080 --> 00:12:47,260 something about how certain we are that the molecule is 321 00:12:47,260 --> 00:12:49,890 really-- the atom is really at that position in the crystal 322 00:12:49,890 --> 00:12:51,040 structure. 323 00:12:51,040 --> 00:12:53,710 So the first of these is the occupancy. 324 00:12:53,710 --> 00:12:55,312 In a crystal structure, we're actually 325 00:12:55,312 --> 00:12:57,520 getting the information about thousands and thousands 326 00:12:57,520 --> 00:13:00,734 of molecules that are in the repeating units of the crystal, 327 00:13:00,734 --> 00:13:02,150 and it's possible that there could 328 00:13:02,150 --> 00:13:04,720 be some variation in the structure between one 329 00:13:04,720 --> 00:13:06,620 unit of the crystal and the next. 330 00:13:06,620 --> 00:13:09,030 So you could have a side chain that, in one crystal, 331 00:13:09,030 --> 00:13:11,230 is over here and in the next crystal-- 332 00:13:11,230 --> 00:13:13,880 a repeating unit of the crystals over there. 333 00:13:13,880 --> 00:13:16,110 If there are discrete confirmations, 334 00:13:16,110 --> 00:13:18,830 then you imagine that the signal will be reduced, 335 00:13:18,830 --> 00:13:20,580 and you'll actually get some superposition 336 00:13:20,580 --> 00:13:22,920 of all the possible confirmations. 337 00:13:22,920 --> 00:13:24,870 So number one here means that there 338 00:13:24,870 --> 00:13:27,877 seems to be one predominate confirmation. 339 00:13:27,877 --> 00:13:30,460 But if there is more than one, and their discrete-- if they're 340 00:13:30,460 --> 00:13:32,084 continuous, it'll just look like noise. 341 00:13:32,084 --> 00:13:34,080 It'll be hard to determine the coordinates. 342 00:13:34,080 --> 00:13:36,580 But if they're discrete positions, 343 00:13:36,580 --> 00:13:39,700 then you might find, for example, an occupancy of 0.5 344 00:13:39,700 --> 00:13:42,050 and then another line with the other position 345 00:13:42,050 --> 00:13:44,270 with an occupancy of 0.5. 346 00:13:44,270 --> 00:13:46,950 So that's when there's discrete locations where 347 00:13:46,950 --> 00:13:49,200 these atoms are located. 348 00:13:49,200 --> 00:13:51,410 The B factor's called the thermal factor, 349 00:13:51,410 --> 00:13:54,030 and it tells you how much thermal motion there 350 00:13:54,030 --> 00:13:56,089 was in the crystal at that position. 351 00:13:56,089 --> 00:13:57,130 Now, what does that mean? 352 00:13:57,130 --> 00:13:58,380 If we think about a crystal structure, 353 00:13:58,380 --> 00:14:00,580 there'll be some parts of it that are rock solid. 354 00:14:00,580 --> 00:14:02,660 In the center, it's highly constrained. 355 00:14:02,660 --> 00:14:04,860 The dense core of the protein, not too much 356 00:14:04,860 --> 00:14:06,110 is going to be changing. 357 00:14:06,110 --> 00:14:07,730 But on the surface of the protein, 358 00:14:07,730 --> 00:14:10,770 there can be residues that are highly flexible. 359 00:14:10,770 --> 00:14:14,070 And so as those are being knocked around in the crystal, 360 00:14:14,070 --> 00:14:17,800 they are scattering the x-rays in slightly different ways. 361 00:14:17,800 --> 00:14:19,880 But they're not in discrete confirmations, 362 00:14:19,880 --> 00:14:23,180 so we're not going to see multiple independent positions. 363 00:14:23,180 --> 00:14:25,227 We'll just see some average positions. 364 00:14:25,227 --> 00:14:27,560 And that kind of noise can be accounted for with these B 365 00:14:27,560 --> 00:14:31,510 factors, where high numbers represent highly mobile parts 366 00:14:31,510 --> 00:14:33,120 of the structure, and low numbers 367 00:14:33,120 --> 00:14:35,035 represent very stable ones. 368 00:14:35,035 --> 00:14:37,820 A very low number here would be, say, a 20. 369 00:14:37,820 --> 00:14:39,890 These numbers of 80-- typically, things like that 370 00:14:39,890 --> 00:14:41,390 occur at the ends of molecules where 371 00:14:41,390 --> 00:14:45,180 there is a lot of structural flexibility. 372 00:14:45,180 --> 00:14:46,920 So we have this one way of describing 373 00:14:46,920 --> 00:14:51,010 the structure of a protein where we specify the XYZ coordinates 374 00:14:51,010 --> 00:14:54,620 of every one of these atoms, and we'd have these other two 375 00:14:54,620 --> 00:14:58,200 parameters to represent thermal motion and static disorder. 376 00:14:58,200 --> 00:15:00,380 Now, are those coordinates uniquely defined? 377 00:15:00,380 --> 00:15:02,780 If I have this structure, is there 378 00:15:02,780 --> 00:15:06,360 exactly one way to write down the XYZ coordinates? 379 00:15:06,360 --> 00:15:07,000 Hands? 380 00:15:07,000 --> 00:15:08,900 How many people say yes? 381 00:15:08,900 --> 00:15:10,650 How many people say no? 382 00:15:10,650 --> 00:15:12,984 Why not? 383 00:15:12,984 --> 00:15:14,370 AUDIENCE: You can rotate it. 384 00:15:14,370 --> 00:15:15,578 PROFESSOR: You can rotate it. 385 00:15:15,578 --> 00:15:16,370 You set the origin. 386 00:15:16,370 --> 00:15:16,894 Right? 387 00:15:16,894 --> 00:15:18,560 So there's no unique way of defining it, 388 00:15:18,560 --> 00:15:20,390 and that'll come up again later. 389 00:15:20,390 --> 00:15:22,372 OK, now, this is a very precise way 390 00:15:22,372 --> 00:15:24,330 of describing the three-dimensional coordinates 391 00:15:24,330 --> 00:15:28,120 in protein, but it's not a very concise way of representing it. 392 00:15:28,120 --> 00:15:29,490 Now, why is that? 393 00:15:29,490 --> 00:15:31,476 Well, as the static model represents, 394 00:15:31,476 --> 00:15:33,350 there are certain parts of protein structures 395 00:15:33,350 --> 00:15:36,080 that are really not going to change very much. 396 00:15:36,080 --> 00:15:37,940 The lengths of the bonds change very little 397 00:15:37,940 --> 00:15:39,770 in protein structures. 398 00:15:39,770 --> 00:15:42,510 The angles, the tetrahedrally coordinated carbon, 399 00:15:42,510 --> 00:15:45,520 doesn't suddenly become flat, planar. 400 00:15:45,520 --> 00:15:48,550 These things happen very-- there may be very small deformations. 401 00:15:48,550 --> 00:15:51,960 So if I had to specify the XYZ coordinates of this carbon, 402 00:15:51,960 --> 00:15:53,460 I really don't have too many degrees 403 00:15:53,460 --> 00:15:55,820 of freedom for where the other carbon can be. 404 00:15:55,820 --> 00:15:58,380 It has to lie in a sphere at a certain distance. 405 00:15:58,380 --> 00:16:00,860 So instead of representing XYZ coordinates of every atom, 406 00:16:00,860 --> 00:16:04,020 I can use internal coordinates. 407 00:16:04,020 --> 00:16:08,590 So here in this slide, we have amino acids-- 408 00:16:08,590 --> 00:16:11,830 the amino nitrogen, the carbonyl carbon. 409 00:16:11,830 --> 00:16:13,580 So this is a single amino acid. 410 00:16:13,580 --> 00:16:16,020 Here's the peptide bond that goes to the next one. 411 00:16:16,020 --> 00:16:18,050 And as this diagram indicates, the bond 412 00:16:18,050 --> 00:16:20,922 between the carbonyl carbon of one amino acid 413 00:16:20,922 --> 00:16:22,505 and the amide nitrogen of the next one 414 00:16:22,505 --> 00:16:25,597 is planar, so that angle isn't even rotating. 415 00:16:25,597 --> 00:16:28,180 So that's one degree of freedom that we've completely removed. 416 00:16:28,180 --> 00:16:31,986 The angles that rotate in the backbone or called phi and psi; 417 00:16:31,986 --> 00:16:37,410 phi over here, and psi over here. 418 00:16:37,410 --> 00:16:39,160 So those are two degrees of freedom 419 00:16:39,160 --> 00:16:42,930 that determine how this amino acid is-- 420 00:16:42,930 --> 00:16:45,560 the confirmation of this amino acid. 421 00:16:45,560 --> 00:16:47,640 So instead of specifying all the coordinates, 422 00:16:47,640 --> 00:16:49,630 I can specify the backbone simply 423 00:16:49,630 --> 00:16:52,280 by giving two numbers to every amino acid, the phi and psi 424 00:16:52,280 --> 00:16:55,220 angles, with the assumption that the omega 425 00:16:55,220 --> 00:16:58,560 angle, this peptide backbone, remains constant. 426 00:16:58,560 --> 00:17:00,057 And similarly for the side chains, 427 00:17:00,057 --> 00:17:01,890 and we'll go into this in more detail later, 428 00:17:01,890 --> 00:17:04,859 we can then give the coordinates, the rotation, 429 00:17:04,859 --> 00:17:06,930 of rotatable bonds in the side chain 430 00:17:06,930 --> 00:17:09,155 and not specify every atom as we go out. 431 00:17:09,155 --> 00:17:11,530 OK, so we've got these two different ways of representing 432 00:17:11,530 --> 00:17:14,569 protein structure, and we'll see that they're both used. 433 00:17:14,569 --> 00:17:17,829 Any questions on this? 434 00:17:17,829 --> 00:17:19,540 Great. 435 00:17:19,540 --> 00:17:21,774 OK, so if we're looking at protein structures, 436 00:17:21,774 --> 00:17:23,440 one question we want to ask is how do we 437 00:17:23,440 --> 00:17:27,339 compare two protein structures to each other? 438 00:17:27,339 --> 00:17:29,070 So I already mentioned that proteins 439 00:17:29,070 --> 00:17:31,410 can have similar structure, whether or not 440 00:17:31,410 --> 00:17:32,950 they are highly similar in sequence. 441 00:17:32,950 --> 00:17:35,380 So if I have two proteins that are highly homologous, that 442 00:17:35,380 --> 00:17:38,350 do have a high level of sequence similarity-- for example, 443 00:17:38,350 --> 00:17:40,050 these two orthologs, this one from cow 444 00:17:40,050 --> 00:17:42,195 and this one from rat-- you can see, at a distance, 445 00:17:42,195 --> 00:17:43,820 they both have very similar structures. 446 00:17:43,820 --> 00:17:46,480 They also have 74% sequence similarity, 447 00:17:46,480 --> 00:17:47,970 so that's not surprising. 448 00:17:47,970 --> 00:17:50,400 But you can get proteins that have very low sequence 449 00:17:50,400 --> 00:17:51,570 similarity. 450 00:17:51,570 --> 00:17:53,660 They're still evolutionary related, 451 00:17:53,660 --> 00:17:56,330 like these orthologs, two different species that 452 00:17:56,330 --> 00:17:59,630 have the same protein, or paralogs, a single species that 453 00:17:59,630 --> 00:18:02,890 have two similar copies, but non identical copies, 454 00:18:02,890 --> 00:18:05,410 in the same protein that maintain the same structure 455 00:18:05,410 --> 00:18:09,410 when they only have about 20% to 30% sequence similarity. 456 00:18:09,410 --> 00:18:13,440 And you can get even more distant relationships. 457 00:18:13,440 --> 00:18:15,500 So here are two proteins, both in 458 00:18:15,500 --> 00:18:20,580 human, evolutionarily related, but only 4% sequence identity. 459 00:18:20,580 --> 00:18:24,077 And yet at a distance, they look almost identical. 460 00:18:24,077 --> 00:18:25,910 And those are evolutionary related proteins, 461 00:18:25,910 --> 00:18:27,880 but we can also have things that are called analogs, which 462 00:18:27,880 --> 00:18:29,570 have no evolutionary relationship, 463 00:18:29,570 --> 00:18:32,020 no obvious sequence similarity, and yet adopt 464 00:18:32,020 --> 00:18:34,410 almost identical protein structures. 465 00:18:34,410 --> 00:18:36,950 So this adds to the complexity of the biological problems 466 00:18:36,950 --> 00:18:38,605 that we're going to try to solve. 467 00:18:38,605 --> 00:18:40,320 All right, so how do I quantitatively 468 00:18:40,320 --> 00:18:43,510 compare two protein structures? 469 00:18:43,510 --> 00:18:45,180 So the common measurement is something 470 00:18:45,180 --> 00:18:47,760 called RMSD, Root Mean Square Deviation, 471 00:18:47,760 --> 00:18:49,260 and here, I have a set of structures 472 00:18:49,260 --> 00:18:50,542 that were solved by NMR. 473 00:18:50,542 --> 00:18:53,000 And you can see that there's a core of the structure that's 474 00:18:53,000 --> 00:18:54,510 well determined and then there are 475 00:18:54,510 --> 00:18:56,635 pieces of the structure that are poorly determined. 476 00:18:56,635 --> 00:18:58,862 There weren't enough constraints to define them. 477 00:18:58,862 --> 00:19:00,570 And these proteins have all been aligned, 478 00:19:00,570 --> 00:19:04,510 so the XYZ coordinates have been rotated and translated 479 00:19:04,510 --> 00:19:05,625 to give maximal agreement. 480 00:19:05,625 --> 00:19:07,000 And what's the agreement measure? 481 00:19:07,000 --> 00:19:08,541 It's this Root Mean Square Deviation. 482 00:19:08,541 --> 00:19:12,230 So I need to define pairs of atoms in my two structures. 483 00:19:12,230 --> 00:19:14,970 If it's, in this case, the same structure, that's really easy. 484 00:19:14,970 --> 00:19:18,780 Every atom has a match in this structure that 485 00:19:18,780 --> 00:19:21,107 was solved with the same molecule. 486 00:19:21,107 --> 00:19:23,190 But if we're dealing with two homologous proteins, 487 00:19:23,190 --> 00:19:24,981 then that becomes a little bit more tricky. 488 00:19:24,981 --> 00:19:27,480 We need to define which amino acids are going to match up. 489 00:19:27,480 --> 00:19:30,447 We can also define whether we care about changes in the side 490 00:19:30,447 --> 00:19:33,030 chains, or whether we only care about changes in the backbone, 491 00:19:33,030 --> 00:19:34,030 whether we're going to worry about 492 00:19:34,030 --> 00:19:36,064 whether the protons in the right places or not. 493 00:19:36,064 --> 00:19:37,730 And you'll see that these alignments can 494 00:19:37,730 --> 00:19:41,710 be done with either only heavy chain, heavy atoms, 495 00:19:41,710 --> 00:19:44,940 meaning excluding the hydrogens, or only main chain atoms, 496 00:19:44,940 --> 00:19:48,600 meaning excluding the side chains completely. 497 00:19:48,600 --> 00:19:51,359 But once we've defined the pairs of corresponding atoms, 498 00:19:51,359 --> 00:19:53,650 then we're going to take the difference in the distance 499 00:19:53,650 --> 00:19:55,729 squared, sum of the squares of the distances 500 00:19:55,729 --> 00:19:58,020 between the corresponding atoms and their x-coordinate, 501 00:19:58,020 --> 00:19:59,680 their y-coordinate, and they're z-coordinate. 502 00:19:59,680 --> 00:20:01,055 Take the square root of that sum, 503 00:20:01,055 --> 00:20:03,917 and that's going to give us the Root Mean Square Deviation. 504 00:20:03,917 --> 00:20:06,250 And of course, we have to minimize that Root Mean Square 505 00:20:06,250 --> 00:20:08,200 Deviation with these rigid body rotations 506 00:20:08,200 --> 00:20:10,310 to account for the fact that I could have my PDB 507 00:20:10,310 --> 00:20:12,305 file with the origin of this atom. 508 00:20:12,305 --> 00:20:14,680 Or I could have my PDB file with the origin of that atom, 509 00:20:14,680 --> 00:20:15,940 and so on. 510 00:20:15,940 --> 00:20:18,150 OK. 511 00:20:18,150 --> 00:20:20,770 Any questions so far? 512 00:20:20,770 --> 00:20:21,450 Yes. 513 00:20:21,450 --> 00:20:25,079 AUDIENCE: Do we consider every single atom in the molecule? 514 00:20:25,079 --> 00:20:26,370 PROFESSOR: So we have a choice. 515 00:20:26,370 --> 00:20:28,530 The question was do we consider every single atom 516 00:20:28,530 --> 00:20:29,197 in the molecule? 517 00:20:29,197 --> 00:20:31,029 We don't have to do, and it depends, really, 518 00:20:31,029 --> 00:20:32,810 on the problem that we're trying to solve. 519 00:20:32,810 --> 00:20:35,760 So if we're looking for whether two proteins have 520 00:20:35,760 --> 00:20:38,130 the same fold, we might not care about the side chains. 521 00:20:38,130 --> 00:20:40,770 We might restrict ourselves to main chain atoms. 522 00:20:40,770 --> 00:20:43,170 But if we're trying to decide whether two crystal 523 00:20:43,170 --> 00:20:45,410 structures are in good agreement with each other, 524 00:20:45,410 --> 00:20:47,554 or say, as we'll see a few minutes, 525 00:20:47,554 --> 00:20:49,720 we're going to try to predict the structure protein, 526 00:20:49,720 --> 00:20:51,870 and we have the experimentally determined structure 527 00:20:51,870 --> 00:20:53,619 of the same protein, and we want to decide 528 00:20:53,619 --> 00:20:55,203 whether those two agree, in that case, 529 00:20:55,203 --> 00:20:56,660 we might actually want to make sure 530 00:20:56,660 --> 00:20:58,760 that every single atom is in the right position. 531 00:20:58,760 --> 00:21:00,600 So it'll depend on the question that we're trying to answer. 532 00:21:00,600 --> 00:21:01,360 Good question. 533 00:21:01,360 --> 00:21:02,193 Any other questions? 534 00:21:06,100 --> 00:21:08,197 OK. 535 00:21:08,197 --> 00:21:09,780 All right, so so far, I've shown a lot 536 00:21:09,780 --> 00:21:11,690 of static pictures of molecules. 537 00:21:11,690 --> 00:21:14,185 I do want to stress that molecules actually 538 00:21:14,185 --> 00:21:16,560 move around a lot, so I'll just show a little movie here. 539 00:21:19,548 --> 00:21:23,034 [VIDEO PLAYBACK] 540 00:23:17,574 --> 00:23:18,954 [END VIDEO PLAYBACK] 541 00:23:18,954 --> 00:23:20,870 PROFESSOR: OK, so that was, in part, an excuse 542 00:23:20,870 --> 00:23:23,390 to play a little New Age music in class, 543 00:23:23,390 --> 00:23:26,570 but more fundamentally, it was to remind you 544 00:23:26,570 --> 00:23:29,050 that, despite the fact that we're 545 00:23:29,050 --> 00:23:31,350 going to show you a lot of static pictures of proteins, 546 00:23:31,350 --> 00:23:33,250 they're actually extremely dynamic. 547 00:23:33,250 --> 00:23:36,170 And they have well defined structures, 548 00:23:36,170 --> 00:23:38,590 but they may have more than one well defined structure, 549 00:23:38,590 --> 00:23:40,849 especially those molecules that are doing work. 550 00:23:40,849 --> 00:23:42,390 They're actually moving things along. 551 00:23:42,390 --> 00:23:43,640 They have multiple structures. 552 00:23:43,640 --> 00:23:45,670 And so when we consider the protein structure, 553 00:23:45,670 --> 00:23:47,730 it's an approximation, and we're always 554 00:23:47,730 --> 00:23:52,150 going to mean the protein structures, not singular one. 555 00:23:52,150 --> 00:23:55,010 OK, so what determines the protein structure? 556 00:23:55,010 --> 00:23:56,640 Well, I've told you it's physics. 557 00:23:56,640 --> 00:23:58,620 Fundamentally, it's a physical problem, 558 00:23:58,620 --> 00:24:03,170 so the optimal protein structure has to be an energetic minimum. 559 00:24:03,170 --> 00:24:05,450 There has to be no net force acting on the protein. 560 00:24:05,450 --> 00:24:09,180 The force is negative derivative of the potential energy, 561 00:24:09,180 --> 00:24:11,120 so that derivative has to be 0. 562 00:24:11,120 --> 00:24:13,307 So we have to have a minimum of protein structure. 563 00:24:13,307 --> 00:24:15,640 Now, that doesn't mean that there's exactly one minimum. 564 00:24:15,640 --> 00:24:19,140 Those proteins that had multiple confirmations in that movie 565 00:24:19,140 --> 00:24:21,980 obviously had multiple minima that they could adopt depending 566 00:24:21,980 --> 00:24:23,630 on other circumstances, but there 567 00:24:23,630 --> 00:24:25,540 has to be at least a local minimum. 568 00:24:25,540 --> 00:24:30,055 So if we knew this U, this potential energy function, 569 00:24:30,055 --> 00:24:31,680 and we could take the derivative of it, 570 00:24:31,680 --> 00:24:34,830 we could identify the protein structure or the protein 571 00:24:34,830 --> 00:24:38,020 structures by simply identifying the minima 572 00:24:38,020 --> 00:24:39,770 in that potential energy function. 573 00:24:39,770 --> 00:24:42,280 Now, would that life were so simple, right? 574 00:24:42,280 --> 00:24:45,385 But we will see that there are ways of parameterizing the U 575 00:24:45,385 --> 00:24:48,340 and using it to optimize the structure so it finds 576 00:24:48,340 --> 00:24:49,790 this, at least local, minimum. 577 00:24:52,345 --> 00:24:53,720 And we're going to look primarily 578 00:24:53,720 --> 00:24:56,920 at two different ways of describing the potential energy 579 00:24:56,920 --> 00:24:57,610 function. 580 00:24:57,610 --> 00:24:59,890 One of them, we're going to look at the problem like a physicist 581 00:24:59,890 --> 00:25:01,348 one, and the other way, we're going 582 00:25:01,348 --> 00:25:03,500 to look at it as a statistician would. 583 00:25:03,500 --> 00:25:07,080 So the physicist wants to describe, as you might imagine, 584 00:25:07,080 --> 00:25:10,200 the physical forces that underlie the protein structure, 585 00:25:10,200 --> 00:25:11,580 and so as much as possible, we're 586 00:25:11,580 --> 00:25:13,080 going to try to write down equations 587 00:25:13,080 --> 00:25:14,665 that represent those forces. 588 00:25:14,665 --> 00:25:16,040 Now, we're not always going to be 589 00:25:16,040 --> 00:25:18,380 able to do that because a lot of forces involved 590 00:25:18,380 --> 00:25:19,790 are quantum mechanical. 591 00:25:19,790 --> 00:25:21,360 The mere fact the two solid objects 592 00:25:21,360 --> 00:25:25,260 don't pass through each other is because of exclusion principles 593 00:25:25,260 --> 00:25:26,644 that deal with quantum mechanics. 594 00:25:26,644 --> 00:25:29,060 We're not going to write down quantum mechanical equations 595 00:25:29,060 --> 00:25:30,726 for every atom in our protein structure, 596 00:25:30,726 --> 00:25:33,134 but we will write down equations that approximate those. 597 00:25:33,134 --> 00:25:34,550 And wherever possible, we're going 598 00:25:34,550 --> 00:25:37,100 to try to tie the terms in our equations 599 00:25:37,100 --> 00:25:39,320 into something identifiable in physics, 600 00:25:39,320 --> 00:25:43,190 and a very good example of this approach is the CHARMM program. 601 00:25:43,190 --> 00:25:45,350 And these approaches actually were the ones 602 00:25:45,350 --> 00:25:48,450 that won the Nobel Prize in chemistry this past year. 603 00:25:48,450 --> 00:25:51,540 At the other end of the spectrum are the statistical approaches. 604 00:25:51,540 --> 00:25:53,831 Here, we don't really care what the underlying physical 605 00:25:53,831 --> 00:25:54,470 properties are. 606 00:25:54,470 --> 00:25:57,600 We want equations that capture what we see in nature. 607 00:25:57,600 --> 00:25:59,860 Now, often, these two approaches will align very well. 608 00:25:59,860 --> 00:26:02,860 There'll be some approximations that the physicist makes 609 00:26:02,860 --> 00:26:05,422 to capture a fundamental physical force. 610 00:26:05,422 --> 00:26:07,880 That's simply the best way to describe what you see nature, 611 00:26:07,880 --> 00:26:11,160 and so those two terms may look indistinguishable in the CHARMM 612 00:26:11,160 --> 00:26:14,150 version or my favorite statistical approach, which 613 00:26:14,150 --> 00:26:15,360 is Rosetta. 614 00:26:15,360 --> 00:26:17,330 So we'll see that some terms in these functions 615 00:26:17,330 --> 00:26:18,749 agree between CHARMM and Rosetta. 616 00:26:18,749 --> 00:26:20,790 Well, there'll be places where they fundamentally 617 00:26:20,790 --> 00:26:24,040 disagree on how to describe the molecular potential energy 618 00:26:24,040 --> 00:26:27,180 function because one is trying to describe the physical forces 619 00:26:27,180 --> 00:26:30,230 and the other one is trying to describe the statistical ones. 620 00:26:30,230 --> 00:26:32,750 Do we have any native speakers of German in the audience? 621 00:26:32,750 --> 00:26:34,020 AUDIENCE: I'm a speaker. 622 00:26:34,020 --> 00:26:35,914 PROFESSOR: You want to read the joke for us? 623 00:26:35,914 --> 00:26:36,538 AUDIENCE: Yeah. 624 00:26:36,538 --> 00:26:39,810 Institute for Quantum Physics, and it says 625 00:26:39,810 --> 00:26:41,767 "You can find yourself here or here." 626 00:26:41,767 --> 00:26:42,350 PROFESSOR: OK. 627 00:26:42,350 --> 00:26:43,610 AUDIENCE: [LAUGHTER] 628 00:26:43,610 --> 00:26:46,150 PROFESSOR: All right, so for the video, 629 00:26:46,150 --> 00:26:48,710 it's the Institute for Quantum Mechanics. 630 00:26:48,710 --> 00:26:51,315 And you go to a map at MIT, and it'll say, you find, 631 00:26:51,315 --> 00:26:51,940 "You are here." 632 00:26:51,940 --> 00:26:52,410 Right? 633 00:26:52,410 --> 00:26:54,535 But in the Institute for Quantum Mechanics, it says 634 00:26:54,535 --> 00:26:56,216 "You're either here or here." 635 00:26:56,216 --> 00:26:57,590 So that's the physicist approach. 636 00:26:57,590 --> 00:27:00,080 We really do have to think about those quantum mechanical 637 00:27:00,080 --> 00:27:02,470 features, whereas on the right-hand side 638 00:27:02,470 --> 00:27:03,850 is the statisticians approach. 639 00:27:03,850 --> 00:27:06,130 It says "Data don't make any sense. 640 00:27:06,130 --> 00:27:08,140 We'll have to resort to statistics." 641 00:27:08,140 --> 00:27:08,950 OK? 642 00:27:08,950 --> 00:27:11,190 So the statistician can get pretty far 643 00:27:11,190 --> 00:27:13,936 without understanding the underlying physical forces. 644 00:27:13,936 --> 00:27:16,060 All right, so let's look at this physicist approach 645 00:27:16,060 --> 00:27:18,393 first, so we're going to break down the potential energy 646 00:27:18,393 --> 00:27:21,722 function into bonded terms and non-bonded terms. 647 00:27:21,722 --> 00:27:23,180 So the bonded terms, as they sound, 648 00:27:23,180 --> 00:27:24,990 are going to be atoms that are close to each other 649 00:27:24,990 --> 00:27:27,370 in the bonded structures, so certainly these two atoms, 650 00:27:27,370 --> 00:27:29,690 because they're connected by a single bond, 651 00:27:29,690 --> 00:27:31,000 are going to be bonded terms. 652 00:27:31,000 --> 00:27:34,170 But we'll see groups of three or four atoms near each other 653 00:27:34,170 --> 00:27:35,359 will also be bonded terms. 654 00:27:35,359 --> 00:27:36,900 And the non-bonded terms will be when 655 00:27:36,900 --> 00:27:39,690 I have another molecule that comes close, but isn't directly 656 00:27:39,690 --> 00:27:40,330 connected. 657 00:27:40,330 --> 00:27:44,160 What are the physical forces between these two ? 658 00:27:44,160 --> 00:27:46,560 So these bonded terms then first break down 659 00:27:46,560 --> 00:27:47,754 into a lot of sub terms. 660 00:27:47,754 --> 00:27:49,420 I'll show you the functional forms here. 661 00:27:49,420 --> 00:27:51,070 We'll just look at a few of them in detail 662 00:27:51,070 --> 00:27:53,278 and then give you a sense of what the other ones are. 663 00:27:53,278 --> 00:27:55,100 So this first one is the bonded term 664 00:27:55,100 --> 00:27:58,060 that describes, actually, the distance between two bonded 665 00:27:58,060 --> 00:27:59,150 atoms. 666 00:27:59,150 --> 00:28:00,800 Now, again, this is fundamentally 667 00:28:00,800 --> 00:28:03,020 quantum mechanical property, but it 668 00:28:03,020 --> 00:28:04,646 would be too computationally expensive 669 00:28:04,646 --> 00:28:06,020 to describe the quantum mechanics 670 00:28:06,020 --> 00:28:09,630 and not really necessary because you can do pretty well by just 671 00:28:09,630 --> 00:28:12,720 describing this as a stiff spring. 672 00:28:12,720 --> 00:28:15,360 So that's what this quadratic form of the equation 673 00:28:15,360 --> 00:28:16,380 represents. 674 00:28:16,380 --> 00:28:20,670 So we simply define b naught here 675 00:28:20,670 --> 00:28:22,710 as the equilibrium position between these two 676 00:28:22,710 --> 00:28:23,950 atoms, particular types. 677 00:28:23,950 --> 00:28:26,546 There would be two tetrahedral coordinated carbons, 678 00:28:26,546 --> 00:28:28,170 and that would be determined by looking 679 00:28:28,170 --> 00:28:29,753 at a lot of very, very high resolution 680 00:28:29,753 --> 00:28:31,350 structures in small molecule crystals 681 00:28:31,350 --> 00:28:34,760 so we know what the typical distance for this bond is. 682 00:28:34,760 --> 00:28:36,302 We get that as a parameter. 683 00:28:36,302 --> 00:28:38,260 There would be a big file in the CHARMM program 684 00:28:38,260 --> 00:28:40,801 that lists all those parameters for every one of these bonded 685 00:28:40,801 --> 00:28:44,290 terms, and then if there's a small deviation from that, 686 00:28:44,290 --> 00:28:47,090 because the molecules stretched a bit 687 00:28:47,090 --> 00:28:48,540 in your refinement process, there 688 00:28:48,540 --> 00:28:51,190 would be a penalty to pull it back in just 689 00:28:51,190 --> 00:28:52,630 like a spring pulls it back in. 690 00:28:56,314 --> 00:28:58,230 Now, it turns out that when you go this route, 691 00:28:58,230 --> 00:29:00,396 you have to actually come up with a lot of equations 692 00:29:00,396 --> 00:29:03,000 to maintain the geometry because, again, we're 693 00:29:03,000 --> 00:29:05,584 going to have to not only worry about these distance bonds, 694 00:29:05,584 --> 00:29:07,000 but we need to worry about angles. 695 00:29:07,000 --> 00:29:10,810 So we've got the angle between this bond and this bond. 696 00:29:10,810 --> 00:29:12,070 What keeps that in place? 697 00:29:12,070 --> 00:29:14,610 So we need to add another term that's a second term here 698 00:29:14,610 --> 00:29:17,024 to make the angle between these fixed, 699 00:29:17,024 --> 00:29:18,440 and then we have to deal with what 700 00:29:18,440 --> 00:29:21,650 are called dihedral angles to make sure that these four 701 00:29:21,650 --> 00:29:26,015 atoms lie in the allowed geometry. 702 00:29:26,015 --> 00:29:27,640 And so each one of these terms accounts 703 00:29:27,640 --> 00:29:28,850 for something like that. 704 00:29:28,850 --> 00:29:30,770 This last term over here makes sure 705 00:29:30,770 --> 00:29:33,090 that the phi and psi angles are consistent with what 706 00:29:33,090 --> 00:29:36,725 we see in quantum mechanics as corrected for any deviations 707 00:29:36,725 --> 00:29:39,040 that we see in these small molecules 708 00:29:39,040 --> 00:29:42,100 so a lot of terms with a lot of parameters they're trying 709 00:29:42,100 --> 00:29:45,492 to capture the best description of what we observe in each one 710 00:29:45,492 --> 00:29:46,950 is motivated by the fact that there 711 00:29:46,950 --> 00:29:49,650 is some quantum mechanical principle underlying it. 712 00:29:49,650 --> 00:29:50,339 So-- yes? 713 00:29:50,339 --> 00:29:51,714 AUDIENCE: Why is the [INAUDIBLE]? 714 00:29:54,696 --> 00:29:58,407 PROFESSOR: I actually don't know the answer to that. 715 00:29:58,407 --> 00:29:59,990 But there's a reference there that I'm 716 00:29:59,990 --> 00:30:01,239 sure will give you the answer. 717 00:30:03,970 --> 00:30:06,894 OK, now what about these non-bonded terms? 718 00:30:06,894 --> 00:30:08,310 So non-bonded terms of the set are 719 00:30:08,310 --> 00:30:11,760 molecules that are distant from each other in the structure 720 00:30:11,760 --> 00:30:13,470 of the protein, but close to each other 721 00:30:13,470 --> 00:30:14,720 in three-dimensional space. 722 00:30:14,720 --> 00:30:18,300 And there are two fundamental forces here. 723 00:30:18,300 --> 00:30:20,890 The first one is called the Leonard Jones potential, 724 00:30:20,890 --> 00:30:23,020 and the second one of the electrostatic one. 725 00:30:23,020 --> 00:30:26,320 And the Leonard Jones potential itself has these two terms. 726 00:30:26,320 --> 00:30:30,390 One is an R6 term, a negative r to the 6th dependency. 727 00:30:30,390 --> 00:30:32,780 The other one is positive nr to the 12th. 728 00:30:32,780 --> 00:30:35,260 The negative r to the 6th is an attractive potential. 729 00:30:35,260 --> 00:30:36,840 That's why it's negative, and it's 730 00:30:36,840 --> 00:30:39,400 because of small induced dipoles that 731 00:30:39,400 --> 00:30:41,950 occur in the electron clouds of each of these atoms that 732 00:30:41,950 --> 00:30:44,560 pull the molecules together. 733 00:30:44,560 --> 00:30:46,370 And the 1 over r to the 6th dependency 734 00:30:46,370 --> 00:30:50,120 has to do with the physics of two dipoles interacting. 735 00:30:50,120 --> 00:30:53,280 The r over 12 term is an approximation 736 00:30:53,280 --> 00:30:54,540 to a quantum mechanical force. 737 00:30:54,540 --> 00:30:58,520 So the reason the two molecules don't pass through each other, 738 00:30:58,520 --> 00:31:01,690 as we said already, is because quantum mechanical forces. 739 00:31:01,690 --> 00:31:03,440 That would be very expensive to compute, 740 00:31:03,440 --> 00:31:05,540 so we come up with a term that's easy to compute. 741 00:31:05,540 --> 00:31:07,430 And of course, an r 12 term is simply 742 00:31:07,430 --> 00:31:09,330 the square of an r to the 6th term, 743 00:31:09,330 --> 00:31:11,957 so if you already computed 1 over r to the 6th between two 744 00:31:11,957 --> 00:31:14,540 atoms, you just square that, and you get 1 over r to the 12th. 745 00:31:14,540 --> 00:31:16,165 So it's very computationally efficient, 746 00:31:16,165 --> 00:31:19,920 and you adjust the parameters, these r mins, 747 00:31:19,920 --> 00:31:23,150 so that it works out so that these things agree reasonably 748 00:31:23,150 --> 00:31:24,750 well with the crystal structures. 749 00:31:24,750 --> 00:31:26,875 And these are crystal structures of small molecules 750 00:31:26,875 --> 00:31:28,502 that we know in great detail. 751 00:31:28,502 --> 00:31:29,960 And then the electrostatics is what 752 00:31:29,960 --> 00:31:31,510 you might expect for electrostatics. 753 00:31:31,510 --> 00:31:35,110 It's got a potential that varies as 1 over the distance, 754 00:31:35,110 --> 00:31:36,930 and as the product of those charges, 755 00:31:36,930 --> 00:31:39,980 these can be full charges or they can be partial charges. 756 00:31:39,980 --> 00:31:42,892 And there's a term here, this epsilon, 757 00:31:42,892 --> 00:31:45,100 which is the dielectric constant, and that represents 758 00:31:45,100 --> 00:31:47,890 the fact that, in vacuum, there'd be much greater 759 00:31:47,890 --> 00:31:50,440 force pulling two oppositely charged molecules 760 00:31:50,440 --> 00:31:54,790 together than in water because the water's going to shield. 761 00:31:54,790 --> 00:31:56,970 And so these electrostatic terms, 762 00:31:56,970 --> 00:32:01,730 this dihedral dielectric potential term, 763 00:32:01,730 --> 00:32:05,990 can vary from one, which is vacuum, to, say, 80 for water. 764 00:32:05,990 --> 00:32:09,640 And setting that is a bit of an art. 765 00:32:09,640 --> 00:32:11,810 OK, so what do these potentials look like? 766 00:32:11,810 --> 00:32:12,800 Those are shown here. 767 00:32:12,800 --> 00:32:15,370 This is the, in dark lines, the sum of the van der Waals 768 00:32:15,370 --> 00:32:16,580 potential. 769 00:32:16,580 --> 00:32:18,850 It consists of that attractive term, 770 00:32:18,850 --> 00:32:20,910 which has the r over 6 dependency, 771 00:32:20,910 --> 00:32:22,660 and the repulsive term with the r over 12. 772 00:32:22,660 --> 00:32:25,964 And why does it go up so high at short distances? 773 00:32:25,964 --> 00:32:27,170 AUDIENCE: [INAUDIBLE]. 774 00:32:27,170 --> 00:32:28,628 PROFESSOR: Right, because you can't 775 00:32:28,628 --> 00:32:30,574 have molecules that overlap. 776 00:32:30,574 --> 00:32:31,990 You'll see that there's a minimum, 777 00:32:31,990 --> 00:32:33,930 so there's an optimal distance barring 778 00:32:33,930 --> 00:32:35,890 any other forces between two atoms. 779 00:32:35,890 --> 00:32:39,130 So that's roughly what these hard sphere distances 780 00:32:39,130 --> 00:32:42,880 represent in the scale models. 781 00:32:42,880 --> 00:32:44,380 And then the electrostatic potential 782 00:32:44,380 --> 00:32:48,700 also, obviously, has attractive term, 783 00:32:48,700 --> 00:32:50,830 but it's going to blow up as you get 784 00:32:50,830 --> 00:32:54,340 to small values, increasingly favorable. 785 00:32:54,340 --> 00:32:57,910 And so the net sum of those two is shown here, 786 00:32:57,910 --> 00:33:00,510 the combination of van der Waals and electrostatics. 787 00:33:00,510 --> 00:33:03,354 It, again, has a strong minimum but becomes 788 00:33:03,354 --> 00:33:05,270 highly positive as you get to close distances. 789 00:33:09,340 --> 00:33:12,110 OK, any questions on these forces? 790 00:33:12,110 --> 00:33:12,610 Yes? 791 00:33:12,610 --> 00:33:14,930 AUDIENCE: Do the van der Waals equal the Leonard Jones 792 00:33:14,930 --> 00:33:15,430 potential? 793 00:33:15,430 --> 00:33:16,513 Or is that something else? 794 00:33:16,513 --> 00:33:18,874 PROFESSOR: Yeah, typically, those two terms 795 00:33:18,874 --> 00:33:19,915 are used interchangeably. 796 00:33:19,915 --> 00:33:21,224 Yeah. 797 00:33:21,224 --> 00:33:21,890 Other questions? 798 00:33:25,150 --> 00:33:26,620 OK. 799 00:33:26,620 --> 00:33:28,450 All right, so that's how the physicist 800 00:33:28,450 --> 00:33:31,300 would describe the potential energy function. 801 00:33:31,300 --> 00:33:32,850 Rosetta, as I told you, is an example 802 00:33:32,850 --> 00:33:34,250 of the statistical approach. 803 00:33:34,250 --> 00:33:36,870 It rejects all this sharp definition 804 00:33:36,870 --> 00:33:39,870 of trying to compute exactly the right distance between two 805 00:33:39,870 --> 00:33:42,510 atoms by having a stiff spring between them 806 00:33:42,510 --> 00:33:44,727 and says let's just fix a lot of these angles. 807 00:33:44,727 --> 00:33:46,935 So we're going to fix the distance between two atoms. 808 00:33:46,935 --> 00:33:48,540 There's no point in having them vary 809 00:33:48,540 --> 00:33:50,890 by tiny, tiny fractions in the bond length. 810 00:33:50,890 --> 00:33:52,906 We're going to fix a tetrahedral coordination 811 00:33:52,906 --> 00:33:54,030 of our tetrahedral carbons. 812 00:33:54,030 --> 00:33:55,780 We're not going to let them deform because that never 813 00:33:55,780 --> 00:33:57,470 would happen in reality, and so we're 814 00:33:57,470 --> 00:34:00,350 going to focus our search over the space 815 00:34:00,350 --> 00:34:02,240 entirely over the rotatable bonds. 816 00:34:02,240 --> 00:34:04,220 So remember, how many rotatable bonds 817 00:34:04,220 --> 00:34:05,900 did we have in the backbone? 818 00:34:05,900 --> 00:34:06,650 We had two, right? 819 00:34:06,650 --> 00:34:08,080 We had the phi and the psi angles, 820 00:34:08,080 --> 00:34:09,790 and then the side chains then will 821 00:34:09,790 --> 00:34:12,659 have rotatable bonds over the side chains. 822 00:34:12,659 --> 00:34:15,030 So in this example, this is a cysteine. 823 00:34:15,030 --> 00:34:16,580 Here's the backbone. 824 00:34:16,580 --> 00:34:18,670 Here's the sulfur. 825 00:34:18,670 --> 00:34:20,991 And we have exactly one rotatable bond of interest 826 00:34:20,991 --> 00:34:23,449 because we don't really care where the hydrogen is located. 827 00:34:23,449 --> 00:34:24,854 So we've got this chi 1 angle. 828 00:34:24,854 --> 00:34:26,270 If there were more atoms out here, 829 00:34:26,270 --> 00:34:29,040 this would be called chi 2 and chi 3. 830 00:34:29,040 --> 00:34:33,719 And these can rotate, but they don't rotate freely. 831 00:34:33,719 --> 00:34:35,440 We don't observe, in crystal structures, 832 00:34:35,440 --> 00:34:37,790 every possible rotation of these angles, 833 00:34:37,790 --> 00:34:40,770 and that's what this plot on the left represents. 834 00:34:40,770 --> 00:34:44,929 For this side chain, there's a chi 1, a chi 2, and a chi 3, 835 00:34:44,929 --> 00:34:47,792 and the dark regions represent the observed confirmations 836 00:34:47,792 --> 00:34:49,250 over many, many crystal structures. 837 00:34:49,250 --> 00:34:51,210 And you can see it's highly non uniform. 838 00:34:51,210 --> 00:34:53,172 Now why is that? 839 00:34:53,172 --> 00:34:54,741 I see people with their hands trying 840 00:34:54,741 --> 00:34:55,949 to figure it out in the back. 841 00:34:55,949 --> 00:34:57,910 So why is that? 842 00:34:57,910 --> 00:35:00,210 Figure that's what you guys are doing. 843 00:35:00,210 --> 00:35:03,370 If not, it's very interesting sign language. 844 00:35:03,370 --> 00:35:06,390 So if we look down one of these tetrahedral carbon-carbon 845 00:35:06,390 --> 00:35:09,340 bonds, we have apparently a free rotation. 846 00:35:09,340 --> 00:35:11,099 But in fact, some these confirmations, 847 00:35:11,099 --> 00:35:12,890 we're going to have a lot of steric clashes 848 00:35:12,890 --> 00:35:15,940 between the atoms on one carbon and the atoms on the other, 849 00:35:15,940 --> 00:35:18,510 and so this is not a favorable confirmation. 850 00:35:18,510 --> 00:35:20,629 The favorable confirmation is offset, 851 00:35:20,629 --> 00:35:23,170 and that propagates throughout all the chains in the protein. 852 00:35:23,170 --> 00:35:25,503 So there'll be certain angles that are highly preferred, 853 00:35:25,503 --> 00:35:26,920 and other ones that are not. 854 00:35:26,920 --> 00:35:29,417 These highly preferred angles are called rotamers, 855 00:35:29,417 --> 00:35:30,750 and so we'll use the term a lot. 856 00:35:30,750 --> 00:35:33,200 It stands for rotational isomers. 857 00:35:33,200 --> 00:35:35,235 And so now, we've turned our continuous problem 858 00:35:35,235 --> 00:35:38,070 of figuring out what the optimal angle is for this chi 1 859 00:35:38,070 --> 00:35:40,370 rotation into a discrete problem where maybe there 860 00:35:40,370 --> 00:35:45,000 are only two or three possible options for that rotation. 861 00:35:45,000 --> 00:35:46,760 And so now, we can decide is this 862 00:35:46,760 --> 00:35:49,550 better than this one or this one? 863 00:35:49,550 --> 00:35:51,750 Questions on rotamers or any of this? 864 00:35:54,870 --> 00:35:57,690 Excellent. 865 00:35:57,690 --> 00:36:00,092 OK, so how do we determine-- we've decided then 866 00:36:00,092 --> 00:36:01,550 we're going to describe the protein 867 00:36:01,550 --> 00:36:03,770 entirely by these internal coordinates-- 868 00:36:03,770 --> 00:36:06,550 the phi, the psi, the backbone, the chi angles of the side 869 00:36:06,550 --> 00:36:07,480 chain. 870 00:36:07,480 --> 00:36:10,170 We still need a potential energy function, right? 871 00:36:10,170 --> 00:36:12,600 That hasn't told us how to find the optimal settings, 872 00:36:12,600 --> 00:36:15,880 and we're going to try to avoid the approach of CHARMM, 873 00:36:15,880 --> 00:36:17,740 where we actually look at quantum mechanics 874 00:36:17,740 --> 00:36:20,387 to decide what all the terms are. 875 00:36:20,387 --> 00:36:22,220 So how do they actually go about doing this? 876 00:36:22,220 --> 00:36:26,150 Well, they take a number of high resolution crystal structures, 877 00:36:26,150 --> 00:36:27,910 and they characterize certain properties 878 00:36:27,910 --> 00:36:29,076 in those crystal structures. 879 00:36:29,076 --> 00:36:30,960 For example, they might characterize 880 00:36:30,960 --> 00:36:33,960 how often a certain aliphatic carbon-- how often 881 00:36:33,960 --> 00:36:36,569 aliphatic carbons are near amide nitrogens, 882 00:36:36,569 --> 00:36:38,110 and they might measure the distance-- 883 00:36:38,110 --> 00:36:41,940 they do measure the distance between these amide nitrogens 884 00:36:41,940 --> 00:36:44,790 and aliphatic carbons across all the crystal structures 885 00:36:44,790 --> 00:36:47,290 and determine how often those distances occur. 886 00:36:47,290 --> 00:36:49,700 And you can actually turn those observations, then, 887 00:36:49,700 --> 00:36:51,890 into a potential energy function by simply 888 00:36:51,890 --> 00:36:53,980 using Boltzmann's equation. 889 00:36:53,980 --> 00:36:56,380 So we can figure out how frequently 890 00:36:56,380 --> 00:36:58,537 we get certain distances on the x-axis 891 00:36:58,537 --> 00:37:01,860 is distance, on the y-axis is frequency, number of entries 892 00:37:01,860 --> 00:37:07,770 in the crystal structure, and then by Boltzmann's Law, 893 00:37:07,770 --> 00:37:11,152 we can compute the density of states 894 00:37:11,152 --> 00:37:13,610 over some reference, which is actually very hard to define. 895 00:37:13,610 --> 00:37:15,850 And you can look at some of the references referred 896 00:37:15,850 --> 00:37:18,350 to in the slides to figure out how currently that's defined, 897 00:37:18,350 --> 00:37:21,010 but we have to find some arbitrary reference state 898 00:37:21,010 --> 00:37:24,270 to figure out the probability of being any one of these states 899 00:37:24,270 --> 00:37:26,880 is going to be a function, a logarithmic function, 900 00:37:26,880 --> 00:37:29,837 of the frequency of those states. 901 00:37:29,837 --> 00:37:32,170 All right, so we've got an energy term that's determined 902 00:37:32,170 --> 00:37:35,200 solely by the observations of distances, 903 00:37:35,200 --> 00:37:37,670 that doesn't say I know that this one's charge and this one 904 00:37:37,670 --> 00:37:38,169 isn't. 905 00:37:38,169 --> 00:37:40,730 It just says here's an oxygen attached 906 00:37:40,730 --> 00:37:42,230 to a carbon with double bonds. 907 00:37:42,230 --> 00:37:44,484 Here's a carbon that's not. 908 00:37:44,484 --> 00:37:46,400 How often are they at any particular distance? 909 00:37:46,400 --> 00:37:48,610 And we go through lots and lots of other properties, 910 00:37:48,610 --> 00:37:51,630 and we'll go into detail now to what those other terms are 911 00:37:51,630 --> 00:37:54,630 to look through high resolution crystal structures, 912 00:37:54,630 --> 00:37:57,191 see what certain properties are, turn those 913 00:37:57,191 --> 00:37:59,190 into potential energy functions that we can then 914 00:37:59,190 --> 00:38:02,750 use to identify the optimum rotations for the side 915 00:38:02,750 --> 00:38:05,558 chain and the backbone. 916 00:38:05,558 --> 00:38:08,100 Oh, and I should also point out that when we do this, 917 00:38:08,100 --> 00:38:09,600 we'll have different terms for different things. 918 00:38:09,600 --> 00:38:11,730 We'll have a term for distances between different kinds 919 00:38:11,730 --> 00:38:12,290 of atoms. 920 00:38:12,290 --> 00:38:16,659 We'll have terms for some of these other pieces 921 00:38:16,659 --> 00:38:19,200 of potential energy that we'll describe in subsequent slides, 922 00:38:19,200 --> 00:38:20,575 and we're going to need to decide 923 00:38:20,575 --> 00:38:23,350 how to weight all of those, all those independent terms, 924 00:38:23,350 --> 00:38:25,126 to get them to give us reasonable protein 925 00:38:25,126 --> 00:38:26,250 structures when we're done. 926 00:38:26,250 --> 00:38:28,830 And that, once again, is a curve fitting exercise, 927 00:38:28,830 --> 00:38:31,400 finding the numbers that best fit the data 928 00:38:31,400 --> 00:38:35,640 without any guiding physical principle underneath it. 929 00:38:35,640 --> 00:38:37,060 So you'll be using PyRosetta. 930 00:38:37,060 --> 00:38:38,830 And in PyRosetta, you'll see the terms 931 00:38:38,830 --> 00:38:40,430 on the board for the potential energy 932 00:38:40,430 --> 00:38:42,920 functions, the different features 933 00:38:42,920 --> 00:38:44,330 of the potential energy function, 934 00:38:44,330 --> 00:38:45,996 and I'll step you through a few of these 935 00:38:45,996 --> 00:38:48,850 just so you know what you're using. 936 00:38:48,850 --> 00:38:51,670 There'll also be files in PyRosetta installation 937 00:38:51,670 --> 00:38:53,830 that will give you the relative weights for each 938 00:38:53,830 --> 00:38:55,310 of these terms. 939 00:38:55,310 --> 00:38:58,750 OK, so these first are the van der Waals, 940 00:38:58,750 --> 00:39:02,460 and here, the shape of the curve looks just like we saw before. 941 00:39:02,460 --> 00:39:04,560 It has to, in some sense because they're 942 00:39:04,560 --> 00:39:06,490 trying to solve the same physical problem, 943 00:39:06,490 --> 00:39:08,570 but the motivation is very different. 944 00:39:08,570 --> 00:39:10,820 There's no attempt to decide that it should be a 1 945 00:39:10,820 --> 00:39:13,230 over r to the 6th because of dipole-dipole interactions. 946 00:39:13,230 --> 00:39:16,050 And simply, how do I find the function that accurately 947 00:39:16,050 --> 00:39:19,440 represents what I see in the database? 948 00:39:19,440 --> 00:39:22,510 So again, computed, this is the fa attractive and the fa 949 00:39:22,510 --> 00:39:24,800 repulsive, and those are determined 950 00:39:24,800 --> 00:39:26,300 based on the statistics of what's 951 00:39:26,300 --> 00:39:29,930 observed in the crystal structures. 952 00:39:29,930 --> 00:39:35,590 This one, the hbond, breaks down into backbone and side chain, 953 00:39:35,590 --> 00:39:38,240 long range and short range. 954 00:39:38,240 --> 00:39:40,810 And the goal of the hbonds-- so hydrogen bonds 955 00:39:40,810 --> 00:39:42,810 are one of the principal determinants of protein 956 00:39:42,810 --> 00:39:45,430 structure, and you'll see that in the reading materials 957 00:39:45,430 --> 00:39:46,590 that are posted online. 958 00:39:46,590 --> 00:39:48,756 And one of the critical things about a hydrogen bond 959 00:39:48,756 --> 00:39:50,730 is that it needs to be nearly planar. 960 00:39:50,730 --> 00:39:59,164 So the line between-- the angle between this atom, 961 00:39:59,164 --> 00:40:01,330 which has the hydrogen attached, and this one, which 962 00:40:01,330 --> 00:40:03,170 is the free electron pair, has to be 963 00:40:03,170 --> 00:40:04,710 as close to linear as possible. 964 00:40:04,710 --> 00:40:06,300 And the more it deviates from linear, 965 00:40:06,300 --> 00:40:09,250 the weaker the hydrogen bond will be. 966 00:40:09,250 --> 00:40:11,970 And so this hydrogen bonding potential 967 00:40:11,970 --> 00:40:15,040 has terms that describe the distance between the atoms that 968 00:40:15,040 --> 00:40:17,400 are donating and accepting the hydrogen 969 00:40:17,400 --> 00:40:19,420 as well as the angle between them, 970 00:40:19,420 --> 00:40:21,870 and it's been parameterized to represent, separately, 971 00:40:21,870 --> 00:40:24,540 things that are far from each other, close to each other, 972 00:40:24,540 --> 00:40:26,470 things that are side chain, or main chain. 973 00:40:26,470 --> 00:40:28,730 And here's where it's really the statistician 974 00:40:28,730 --> 00:40:30,460 against the physicist. 975 00:40:30,460 --> 00:40:33,630 Why divide up side chain and main chain? 976 00:40:33,630 --> 00:40:37,500 There's no physical principle that drives you to do that. 977 00:40:37,500 --> 00:40:40,440 It's simply because that's what gives the best fit to the data, 978 00:40:40,440 --> 00:40:43,540 so the statistician is not afraid to add terms that 979 00:40:43,540 --> 00:40:46,010 make their models better fit reality, 980 00:40:46,010 --> 00:40:50,800 even if they don't represent any fundamental physical principle. 981 00:40:50,800 --> 00:40:53,255 And we'll see it gets even more dramatic 982 00:40:53,255 --> 00:40:55,190 with some these other terms. 983 00:40:55,190 --> 00:40:57,667 So this is the Ramachandran plot, 984 00:40:57,667 --> 00:40:59,250 which you'll also see in your reading. 985 00:40:59,250 --> 00:41:01,230 It represents the observed frequencies 986 00:41:01,230 --> 00:41:03,490 of phi and the psi angles. 987 00:41:03,490 --> 00:41:05,960 And as you know that there are only a couple positions 988 00:41:05,960 --> 00:41:08,839 on this phi and psi plot that are frequently observed, 989 00:41:08,839 --> 00:41:11,130 representing the different regular secondary structures 990 00:41:11,130 --> 00:41:14,600 primarily, alpha helix and beta sheet is indicated. 991 00:41:14,600 --> 00:41:17,400 And rather than trying to capture the fact that protein 992 00:41:17,400 --> 00:41:18,950 should form alpha helices by having 993 00:41:18,950 --> 00:41:21,890 really good forces all around, they simply 994 00:41:21,890 --> 00:41:25,217 prefer angles that are observed in the Ramachandran plot. 995 00:41:25,217 --> 00:41:27,300 So we're going to give a potential energy function 996 00:41:27,300 --> 00:41:30,340 that's going to penalize you if your phi and psi ends up 997 00:41:30,340 --> 00:41:33,220 over here, and reward you if your phi and psi ends up 998 00:41:33,220 --> 00:41:34,820 in one of these positions. 999 00:41:34,820 --> 00:41:37,250 So from the physicist, this is cheating, 1000 00:41:37,250 --> 00:41:41,250 and for the statistician, it makes perfect sense. 1001 00:41:41,250 --> 00:41:43,906 Shouldn't laugh at that. 1002 00:41:43,906 --> 00:41:46,030 OK, and this same will be true for the row numbers. 1003 00:41:46,030 --> 00:41:47,571 So we said that, for the side chains, 1004 00:41:47,571 --> 00:41:49,920 there are certain angles that we prefer over others 1005 00:41:49,920 --> 00:41:51,410 because that's what we observe in the database. 1006 00:41:51,410 --> 00:41:53,118 Again, we're not going to try to get them 1007 00:41:53,118 --> 00:41:55,500 by making sure that there's repulsion between these two 1008 00:41:55,500 --> 00:41:56,910 atoms when they're eclipsed. 1009 00:41:56,910 --> 00:41:58,910 We're going to get there simply by saying 1010 00:41:58,910 --> 00:42:01,050 the potential energy is lower when you're 1011 00:42:01,050 --> 00:42:02,830 in one of these staggered confirmations 1012 00:42:02,830 --> 00:42:04,705 than you're one of the eclipse confirmations. 1013 00:42:07,770 --> 00:42:09,640 OK, now, the place where the difference 1014 00:42:09,640 --> 00:42:12,670 between the statistician and the physicist is most dramatic 1015 00:42:12,670 --> 00:42:16,690 comes when we look at the salvation terms. 1016 00:42:16,690 --> 00:42:18,909 So a lot of what goes on in protein structure-- 1017 00:42:18,909 --> 00:42:20,700 determines protein structure, I should say, 1018 00:42:20,700 --> 00:42:23,930 is the interaction of the protein with water. 1019 00:42:23,930 --> 00:42:27,850 It's bathed in a bath of 55 molar water molecules, 1020 00:42:27,850 --> 00:42:28,516 highly polar. 1021 00:42:28,516 --> 00:42:30,640 They normally are hydrogen bonding with each other. 1022 00:42:30,640 --> 00:42:32,600 When the protein sits in there, the protein 1023 00:42:32,600 --> 00:42:34,620 has to start hydrogen bonding with them. 1024 00:42:34,620 --> 00:42:36,680 And where do we find hydrophobic residues 1025 00:42:36,680 --> 00:42:39,430 in a protein structure, with your hands? 1026 00:42:39,430 --> 00:42:40,750 Outside or inside? 1027 00:42:40,750 --> 00:42:41,582 Inside, right? 1028 00:42:41,582 --> 00:42:44,040 So the hydrophobic residue's all going to be buried inside. 1029 00:42:44,040 --> 00:42:45,662 Why is that? 1030 00:42:45,662 --> 00:42:47,330 Well, it's actually really, really hard 1031 00:42:47,330 --> 00:42:49,163 to describe in terms of fundamental physical 1032 00:42:49,163 --> 00:42:50,270 principles. 1033 00:42:50,270 --> 00:42:52,940 In fact, it's really hard to describe the structure of water 1034 00:42:52,940 --> 00:42:54,540 by fundamental physical principles. 1035 00:42:54,540 --> 00:42:56,340 Simulations that try to get water to freeze 1036 00:42:56,340 --> 00:42:58,270 were only successful a few years ago. 1037 00:42:58,270 --> 00:43:00,300 So we've tried to simulate water using 1038 00:43:00,300 --> 00:43:01,390 basic physical principles. 1039 00:43:01,390 --> 00:43:03,566 It's very hard to get it to form ice 1040 00:43:03,566 --> 00:43:05,190 when you lower the temperature, so it's 1041 00:43:05,190 --> 00:43:06,981 going to be even harder, then, to represent 1042 00:43:06,981 --> 00:43:10,370 how a complicated protein structure immersed in the water 1043 00:43:10,370 --> 00:43:14,750 actually interacts with those water molecules. 1044 00:43:14,750 --> 00:43:17,200 So you've got all these water molecules interacting 1045 00:43:17,200 --> 00:43:19,300 with polar residues or non-polar residues. 1046 00:43:19,300 --> 00:43:21,936 The physicist really struggles to represent those. 1047 00:43:21,936 --> 00:43:23,310 And just to show you why that is, 1048 00:43:23,310 --> 00:43:24,570 let me show you, again, a little movie. 1049 00:43:24,570 --> 00:43:26,770 Unfortunately, no new age music with this one. 1050 00:43:26,770 --> 00:43:27,310 I apologize. 1051 00:43:51,820 --> 00:43:54,940 So what's shown here is a sphere immersed 1052 00:43:54,940 --> 00:43:56,310 in a bunch of water molecules. 1053 00:43:56,310 --> 00:43:58,310 The red is the oxygen. 1054 00:43:58,310 --> 00:44:01,415 The little white parts are the hydrogens. 1055 00:44:01,415 --> 00:44:02,790 You can see them wiggling around. 1056 00:44:02,790 --> 00:44:05,900 And what's the fundamental feature that you observe? 1057 00:44:05,900 --> 00:44:07,800 All right, they're forming almost a cage 1058 00:44:07,800 --> 00:44:09,420 around this hydrophobic molecule. 1059 00:44:09,420 --> 00:44:10,170 Why is that? 1060 00:44:14,170 --> 00:44:14,670 Yeah? 1061 00:44:14,670 --> 00:44:16,336 AUDIENCE: It's hard for them to interact 1062 00:44:16,336 --> 00:44:18,170 with a non-polar residue. 1063 00:44:18,170 --> 00:44:19,819 PROFESSOR: Right, so it's hard for them 1064 00:44:19,819 --> 00:44:21,360 to interact with a non-polar residue. 1065 00:44:21,360 --> 00:44:23,332 So the water molecules want to minimize 1066 00:44:23,332 --> 00:44:24,290 their potential energy. 1067 00:44:24,290 --> 00:44:26,070 They're going to do that by forming 1068 00:44:26,070 --> 00:44:28,070 hydrogen bonds with something. 1069 00:44:28,070 --> 00:44:31,310 In bulk solvent, they form it with other water molecules. 1070 00:44:31,310 --> 00:44:35,250 Here, they can't form any hydrogen bonds with a sphere, 1071 00:44:35,250 --> 00:44:39,440 so they have to dance to this complicated dance 1072 00:44:39,440 --> 00:44:42,050 to try to form hydrogen bonds with each other 1073 00:44:42,050 --> 00:44:44,680 with this thing stuck in middle of them. 1074 00:44:44,680 --> 00:44:47,540 And this is, at its heart, the fundamental driving force 1075 00:44:47,540 --> 00:44:49,150 between the hydrophobic effect, that 1076 00:44:49,150 --> 00:44:51,524 which causes the hydrophobic residues to be buried inside 1077 00:44:51,524 --> 00:44:52,870 of the protein. 1078 00:44:52,870 --> 00:44:55,720 Very, very hard, as I said, to simulate 1079 00:44:55,720 --> 00:44:57,770 using fundamental physical forces. 1080 00:44:57,770 --> 00:44:59,350 So what does the statistician do? 1081 00:45:02,740 --> 00:45:06,730 The statistician has a mixture of experimental observation 1082 00:45:06,730 --> 00:45:10,210 and statistics at their benefit, so we 1083 00:45:10,210 --> 00:45:13,360 can measure how hydrophobic any molecule is. 1084 00:45:13,360 --> 00:45:17,190 We can take carbons and drop them to non-polar solvents, 1085 00:45:17,190 --> 00:45:20,900 into polar solvents, and determine what fraction of time 1086 00:45:20,900 --> 00:45:23,134 a molecule will spend in a polar environment 1087 00:45:23,134 --> 00:45:25,050 versus a non-polar environment, and from that, 1088 00:45:25,050 --> 00:45:28,350 get a free energy for the transfer of any atom 1089 00:45:28,350 --> 00:45:32,750 from a hydrophobic environment to a hydrophilic environment. 1090 00:45:32,750 --> 00:45:38,280 That can give us is delta G Ref, shown over here. 1091 00:45:38,280 --> 00:45:42,460 OK, now, in a protein, that molecule 1092 00:45:42,460 --> 00:45:46,087 is not fully solvent exposed even when it's on the surface, 1093 00:45:46,087 --> 00:45:47,670 because water molecules trying to come 1094 00:45:47,670 --> 00:45:49,420 at it from this direction can't get to it, 1095 00:45:49,420 --> 00:45:51,240 from this direction can't get to it. 1096 00:45:51,240 --> 00:45:54,230 So the transfer energy for this carbon 1097 00:45:54,230 --> 00:45:57,130 to go from fully solvent exposed to buried 1098 00:45:57,130 --> 00:46:01,140 is different from the isolated carbon. 1099 00:46:01,140 --> 00:46:02,730 And so the statistician says, OK, I'll 1100 00:46:02,730 --> 00:46:04,438 come up with a function to describe that. 1101 00:46:04,438 --> 00:46:07,910 I will describe what else is near this atom 1102 00:46:07,910 --> 00:46:09,486 in the rest of the protein structure. 1103 00:46:09,486 --> 00:46:11,110 That's what the term on the right does. 1104 00:46:11,110 --> 00:46:14,507 It's a sum over all other neighboring atoms 1105 00:46:14,507 --> 00:46:16,590 and describes the volume of the neighboring group. 1106 00:46:16,590 --> 00:46:18,830 Is the thing next to it really big or really small? 1107 00:46:18,830 --> 00:46:20,030 Usually not described, necessarily, 1108 00:46:20,030 --> 00:46:20,946 at the level of atoms. 1109 00:46:20,946 --> 00:46:22,510 It might be side chains depending 1110 00:46:22,510 --> 00:46:24,180 on which program is doing it. 1111 00:46:24,180 --> 00:46:26,530 But I have some measure of the volume of the neighbors. 1112 00:46:26,530 --> 00:46:28,488 If that volume is really large, then this thing 1113 00:46:28,488 --> 00:46:30,420 is already in a hydrophobic environment 1114 00:46:30,420 --> 00:46:32,150 even when it's taking water because it's 1115 00:46:32,150 --> 00:46:34,130 surrounded by bulky things. 1116 00:46:34,130 --> 00:46:35,700 If the neighbors are small, then it's 1117 00:46:35,700 --> 00:46:38,620 a more hydrophilic environment when it's taking in water, 1118 00:46:38,620 --> 00:46:40,870 and that's going to modulate this free energy. 1119 00:46:43,600 --> 00:46:46,810 Is this function clear? 1120 00:46:46,810 --> 00:46:51,440 OK, so by combining this observation from small molecule 1121 00:46:51,440 --> 00:46:54,257 transfer experiments and these observations 1122 00:46:54,257 --> 00:46:55,840 based on the structure of the protein, 1123 00:46:55,840 --> 00:46:58,600 we can get an approximation for the hydrophobic effect. 1124 00:46:58,600 --> 00:47:00,310 How expensive is it to have this piece 1125 00:47:00,310 --> 00:47:03,644 of the protein in solvent versus in the hydrophobic core? 1126 00:47:03,644 --> 00:47:05,810 And again, we never had to do any quantum mechanical 1127 00:47:05,810 --> 00:47:06,510 calculations. 1128 00:47:06,510 --> 00:47:08,354 We never had to actually explicitly compute 1129 00:47:08,354 --> 00:47:10,270 the interaction of this molecule with solvent. 1130 00:47:10,270 --> 00:47:13,164 We don't need any water in the structure. 1131 00:47:13,164 --> 00:47:15,080 It's simply the geometry of the protein that's 1132 00:47:15,080 --> 00:47:17,340 going to give us a good approximation to the energy 1133 00:47:17,340 --> 00:47:18,820 function. 1134 00:47:18,820 --> 00:47:22,950 All right, so you can look through all the details 1135 00:47:22,950 --> 00:47:26,200 of these online in the Rosetta documentation 1136 00:47:26,200 --> 00:47:28,430 that we provided to get a better sense of what 1137 00:47:28,430 --> 00:47:30,300 all these functions are, but you can 1138 00:47:30,300 --> 00:47:32,480 see there are a lot of terms. 1139 00:47:32,480 --> 00:47:34,011 It's increasingly incremental. 1140 00:47:34,011 --> 00:47:35,760 You find something wrong with your models. 1141 00:47:35,760 --> 00:47:37,650 You add a term to try to account for that. 1142 00:47:37,650 --> 00:47:40,690 Again, not driven necessarily by the physical forces. 1143 00:47:40,690 --> 00:47:43,470 OK, so what have we seen so far? 1144 00:47:43,470 --> 00:47:45,180 We've seen the motivation for this unit, 1145 00:47:45,180 --> 00:47:46,710 to begin with protein structures, 1146 00:47:46,710 --> 00:47:48,085 that the protein structure really 1147 00:47:48,085 --> 00:47:50,700 helps us understand the biological molecules that we're 1148 00:47:50,700 --> 00:47:51,755 looking at. 1149 00:47:51,755 --> 00:47:54,130 These structures are going to influence our understanding 1150 00:47:54,130 --> 00:47:56,760 of all biology, so we need to be good at predicting 1151 00:47:56,760 --> 00:47:58,830 these protein structures or solving them 1152 00:47:58,830 --> 00:48:02,160 when we have experimental data. 1153 00:48:02,160 --> 00:48:03,862 The computational methods that we're 1154 00:48:03,862 --> 00:48:05,320 going to use-- we're going to focus 1155 00:48:05,320 --> 00:48:07,652 on solving protein structures de novo, predicting them, 1156 00:48:07,652 --> 00:48:09,110 but those same techniques are going 1157 00:48:09,110 --> 00:48:10,740 to underlie the methods that are used 1158 00:48:10,740 --> 00:48:13,264 to solve x-ray crystallography in an MR. 1159 00:48:13,264 --> 00:48:15,430 And fundamentally then, we have these two approaches 1160 00:48:15,430 --> 00:48:17,630 to describing the potential energy. 1161 00:48:17,630 --> 00:48:21,310 That's the statistician and the physicist's approach. 1162 00:48:21,310 --> 00:48:23,400 And remember, the key simplifications 1163 00:48:23,400 --> 00:48:27,330 of the statistician are that we used a fixed geometry. 1164 00:48:27,330 --> 00:48:28,960 We're not trying to figure out the XYZ 1165 00:48:28,960 --> 00:48:30,230 coordinates of every atom. 1166 00:48:30,230 --> 00:48:33,800 We're simply trying to figure out the bond angles. 1167 00:48:33,800 --> 00:48:35,440 We're going to use rotamers, so we're 1168 00:48:35,440 --> 00:48:37,320 going to turn our continuous choices often 1169 00:48:37,320 --> 00:48:38,190 into discrete ones. 1170 00:48:38,190 --> 00:48:40,190 And we're going to derive statistical potentials 1171 00:48:40,190 --> 00:48:42,950 to present the potential energy, which may or may not 1172 00:48:42,950 --> 00:48:44,994 have a clear physical basis. 1173 00:48:44,994 --> 00:48:47,410 All right, so let's start with a little thought experiment 1174 00:48:47,410 --> 00:48:49,200 as we try to get into some of these prediction algorithms. 1175 00:48:49,200 --> 00:48:50,620 So I have a sequence. 1176 00:48:50,620 --> 00:48:53,700 It's about, I don't know, 100 amino acids long, 1177 00:48:53,700 --> 00:48:55,270 and here are two protein structures. 1178 00:48:55,270 --> 00:48:56,980 One is predominantly alpha helical. 1179 00:48:56,980 --> 00:48:58,620 One is predominantly beta sheet. 1180 00:48:58,620 --> 00:49:00,859 How could I tell-- this is not a rhetorical question. 1181 00:49:00,859 --> 00:49:02,150 I want you to think for second. 1182 00:49:02,150 --> 00:49:05,117 How could I tell whether the sequence prefers 1183 00:49:05,117 --> 00:49:07,450 the structure on the top or the structure on the bottom? 1184 00:49:11,550 --> 00:49:13,650 So we have, actually, a lot of the tools in place. 1185 00:49:13,650 --> 00:49:14,530 Yes, in the back. 1186 00:49:14,530 --> 00:49:19,480 AUDIENCE: Can you, based on previously known sequences, 1187 00:49:19,480 --> 00:49:22,945 know which sequence is predominant in which 1188 00:49:22,945 --> 00:49:23,950 [INAUDIBLE]? 1189 00:49:23,950 --> 00:49:26,100 PROFESSOR: OK, so the answer was we 1190 00:49:26,100 --> 00:49:28,395 could look at previously known sequences. 1191 00:49:28,395 --> 00:49:30,270 We can look for homology, and that's actually 1192 00:49:30,270 --> 00:49:31,770 going to be a very powerful tool. 1193 00:49:31,770 --> 00:49:35,570 So if there is a homologue in the database that is closely 1194 00:49:35,570 --> 00:49:39,390 related to this protein, and it has a known structure, 1195 00:49:39,390 --> 00:49:41,810 then problem solved. 1196 00:49:41,810 --> 00:49:43,340 What if there isn't? 1197 00:49:43,340 --> 00:49:45,400 What's my next step? 1198 00:49:45,400 --> 00:49:46,315 Yes? 1199 00:49:46,315 --> 00:49:49,780 AUDIENCE: What if you start with a description 1200 00:49:49,780 --> 00:49:55,720 of the secondary structure, say the helices and the sheet, 1201 00:49:55,720 --> 00:50:00,175 and you counted how often a particular amino acid showed up 1202 00:50:00,175 --> 00:50:02,650 in each of those structures? 1203 00:50:02,650 --> 00:50:05,850 Could you then compute maybe a likelihood 1204 00:50:05,850 --> 00:50:07,702 across a stretch of amino acids? 1205 00:50:07,702 --> 00:50:08,410 PROFESSOR: Great. 1206 00:50:08,410 --> 00:50:10,430 So that answer was what if I looked 1207 00:50:10,430 --> 00:50:12,880 at these alpha helices and beta sheets 1208 00:50:12,880 --> 00:50:15,850 and computed how often certain amino acids occur 1209 00:50:15,850 --> 00:50:18,050 in alpha helices versus beta sheets, 1210 00:50:18,050 --> 00:50:20,932 and then I looked in my protein structure 1211 00:50:20,932 --> 00:50:23,140 and checked whether I have the right amino acids that 1212 00:50:23,140 --> 00:50:25,348 are more favorable than alpha helices or beta sheets. 1213 00:50:25,348 --> 00:50:28,220 And we'll see that's an approach that's been used successfully. 1214 00:50:28,220 --> 00:50:30,420 That's secondary structure prediction. 1215 00:50:30,420 --> 00:50:31,910 OK, other ideas. 1216 00:50:31,910 --> 00:50:33,300 Yep? 1217 00:50:33,300 --> 00:50:37,670 AUDIENCE: So if you have the position of the 3D structure, 1218 00:50:37,670 --> 00:50:39,670 you can feed your sequence through the structure 1219 00:50:39,670 --> 00:50:43,100 and then put it through your energy function, 1220 00:50:43,100 --> 00:50:46,285 see which one is the lower [INAUDIBLE]. 1221 00:50:46,285 --> 00:50:47,160 PROFESSOR: Excellent. 1222 00:50:47,160 --> 00:50:51,280 So another thing I can do is, if I have these two structures, I 1223 00:50:51,280 --> 00:50:53,450 have their precise three-dimensional structures, 1224 00:50:53,450 --> 00:50:57,010 I could try to put my sequence onto that structure, 1225 00:50:57,010 --> 00:50:59,590 actually put the right side chains for my sequence 1226 00:50:59,590 --> 00:51:02,189 into that backbone confirmation. 1227 00:51:02,189 --> 00:51:03,230 And then what would I do? 1228 00:51:03,230 --> 00:51:05,710 I would actually measure the potential energy 1229 00:51:05,710 --> 00:51:09,170 of the protein in top structure and the potential energy 1230 00:51:09,170 --> 00:51:11,110 of the protein in the bottom structure. 1231 00:51:11,110 --> 00:51:12,630 If the potential energy is higher, 1232 00:51:12,630 --> 00:51:15,360 is that the favorable structure or the unfavorable structure? 1233 00:51:15,360 --> 00:51:16,990 Favorable? 1234 00:51:16,990 --> 00:51:18,074 Unfavorable? 1235 00:51:18,074 --> 00:51:19,240 Right, it's the unfavorable. 1236 00:51:19,240 --> 00:51:21,460 So I want the lower free energy structure. 1237 00:51:21,460 --> 00:51:25,480 OK, so let's think about-- that's correct, 1238 00:51:25,480 --> 00:51:26,759 and that's where we're headed. 1239 00:51:26,759 --> 00:51:28,800 But what are going to be some of the complexities 1240 00:51:28,800 --> 00:51:31,020 of that approach? 1241 00:51:31,020 --> 00:51:33,980 So first of all, what about these side chains? 1242 00:51:33,980 --> 00:51:36,330 I have to now take a backbone structure that 1243 00:51:36,330 --> 00:51:38,399 had some other amino acid sequence on it, 1244 00:51:38,399 --> 00:51:40,190 and I have to put these new side chains on. 1245 00:51:40,190 --> 00:51:41,140 Right? 1246 00:51:41,140 --> 00:51:45,030 If I put those on in the wrong way-- let's say, 1247 00:51:45,030 --> 00:51:47,224 this is the true one-- let's say one of these 1248 00:51:47,224 --> 00:51:48,140 is the true structure. 1249 00:51:48,140 --> 00:51:49,730 Let's begin with a simplification. 1250 00:51:49,730 --> 00:51:52,809 All right, so let's say your fiendish labmate has actually 1251 00:51:52,809 --> 00:51:54,350 solved the structure of your protein, 1252 00:51:54,350 --> 00:51:56,090 but refuses tell you what the answer is. 1253 00:51:56,090 --> 00:51:59,350 AUDIENCE: [LAUGHTER] 1254 00:51:59,350 --> 00:52:01,810 PROFESSOR: And she actually has solved two structures, 1255 00:52:01,810 --> 00:52:03,500 neither one of which she's going to give you the sequence to. 1256 00:52:03,500 --> 00:52:05,020 But she's giving you the coordinates for both of them. 1257 00:52:05,020 --> 00:52:06,330 They're the same length. 1258 00:52:06,330 --> 00:52:09,090 And so she asks you, ha, you took 791. 1259 00:52:09,090 --> 00:52:10,350 You can figure this out. 1260 00:52:10,350 --> 00:52:13,581 Tell me whether that your sequence is actually 1261 00:52:13,581 --> 00:52:15,080 in this structure or that structure. 1262 00:52:15,080 --> 00:52:17,010 She says one of them is exactly right. 1263 00:52:17,010 --> 00:52:18,510 You just don't know which one. 1264 00:52:18,510 --> 00:52:20,510 OK, so she gives you the backbone coordinates, 1265 00:52:20,510 --> 00:52:21,080 so you go. 1266 00:52:21,080 --> 00:52:23,320 You put your amino acid sequence, say, 1267 00:52:23,320 --> 00:52:28,070 with Swiss [? PDB. ?] You add to the backbone all the right side 1268 00:52:28,070 --> 00:52:28,570 chains. 1269 00:52:28,570 --> 00:52:30,486 But now, you have to make a bunch of decisions 1270 00:52:30,486 --> 00:52:32,990 for these side chain confirmations. 1271 00:52:32,990 --> 00:52:35,710 If you make the wrong decision, what happens? 1272 00:52:35,710 --> 00:52:39,350 Well, you stick this atom close to where some other atom is. 1273 00:52:39,350 --> 00:52:41,820 Now, you've got an optimization problem, right? 1274 00:52:41,820 --> 00:52:43,410 You believe that one of these backbone 1275 00:52:43,410 --> 00:52:44,826 coordinates is correct, but you've 1276 00:52:44,826 --> 00:52:47,970 got a very highly coupled optimization problem. 1277 00:52:47,970 --> 00:52:51,560 You need to figure out the right rotations for every single side 1278 00:52:51,560 --> 00:52:54,040 chain on this protein, and you can't do it one by one. 1279 00:52:54,040 --> 00:52:57,026 You can't take a greedy approach because if I put this side 1280 00:52:57,026 --> 00:52:59,400 chain here, and I put this side chain here, they collide, 1281 00:52:59,400 --> 00:53:01,230 but if this was wrong and supposed to be over there, 1282 00:53:01,230 --> 00:53:02,740 then maybe this is the right conformation. 1283 00:53:02,740 --> 00:53:04,640 So I have a coupled problem, so it turns out 1284 00:53:04,640 --> 00:53:08,550 to be computationally expensive thing to compute. 1285 00:53:08,550 --> 00:53:11,380 So we're going to look at what to do if we know backbone 1286 00:53:11,380 --> 00:53:14,080 confirmation, but we don't know the side chain confirmation. 1287 00:53:14,080 --> 00:53:15,996 We can try to solve that optimization problem, 1288 00:53:15,996 --> 00:53:18,190 and you'll actually do that in your problem set. 1289 00:53:18,190 --> 00:53:20,000 Now, what if the backbone confirmation 1290 00:53:20,000 --> 00:53:21,570 isn't exactly correct? 1291 00:53:21,570 --> 00:53:23,470 So let's say you do what was first suggested, 1292 00:53:23,470 --> 00:53:25,630 and you search the sequence database. 1293 00:53:25,630 --> 00:53:28,660 You take this sequence, and you find that it actually 1294 00:53:28,660 --> 00:53:31,360 has two homologs, two things with similar sequence 1295 00:53:31,360 --> 00:53:32,210 similarity. 1296 00:53:32,210 --> 00:53:34,550 There are two proteins with 20% sequence identity 1297 00:53:34,550 --> 00:53:37,210 that have completely different structures. 1298 00:53:37,210 --> 00:53:38,770 This one has 20% sequence identity, 1299 00:53:38,770 --> 00:53:40,960 and this one has 20% sequence identity. 1300 00:53:40,960 --> 00:53:43,970 So you have no way of deciding which one's which, right? 1301 00:53:43,970 --> 00:53:47,190 And neither one is going to be the right protein structure. 1302 00:53:47,190 --> 00:53:49,786 So you know that by putting the side chains onto these protein 1303 00:53:49,786 --> 00:53:52,410 structures, you do have to solve those problems with side chain 1304 00:53:52,410 --> 00:53:54,430 optimization, but what, obviously, is the other thing 1305 00:53:54,430 --> 00:53:56,221 that you're going to need to have to solve? 1306 00:53:57,112 --> 00:53:59,320 All right, you're going to need to solve the backbone 1307 00:53:59,320 --> 00:54:01,710 optimization problem, and this becomes even more 1308 00:54:01,710 --> 00:54:05,040 coupled because when I move this backbone, 1309 00:54:05,040 --> 00:54:06,860 then the side chains move with it. 1310 00:54:06,860 --> 00:54:09,110 So now, I've got a very, very complicated optimization 1311 00:54:09,110 --> 00:54:09,985 problem to deal with. 1312 00:54:09,985 --> 00:54:13,790 The search space is enormous, and even if I discretize it, 1313 00:54:13,790 --> 00:54:15,560 it's still very, very large. 1314 00:54:15,560 --> 00:54:17,160 In fact, there's something famous 1315 00:54:17,160 --> 00:54:18,530 called the Levinthal Paradox. 1316 00:54:18,530 --> 00:54:20,540 Of course, Cy Levinthal, who was once 1317 00:54:20,540 --> 00:54:23,180 upon a time a professor here and then moved to Columbia-- 1318 00:54:23,180 --> 00:54:26,380 he did a back of the envelope calculation 1319 00:54:26,380 --> 00:54:29,110 for extremely simple models of protein structure. 1320 00:54:29,110 --> 00:54:31,560 If you imagine the proteins were to randomly search over 1321 00:54:31,560 --> 00:54:34,710 all possible confirmations with very rapid switching 1322 00:54:34,710 --> 00:54:36,180 between possible confirmations, it 1323 00:54:36,180 --> 00:54:38,060 would take basically the lifetime 1324 00:54:38,060 --> 00:54:41,240 of the universe for a protein to ever fold. 1325 00:54:41,240 --> 00:54:43,030 So proteins don't do random searches 1326 00:54:43,030 --> 00:54:45,590 over all possible confirmations, and they can check out 1327 00:54:45,590 --> 00:54:47,426 confirmations incredibly rapidly. 1328 00:54:47,426 --> 00:54:49,050 So we certainly can't do that, so we'll 1329 00:54:49,050 --> 00:54:52,224 look at the optimization techniques. 1330 00:54:52,224 --> 00:54:55,920 All right, so we discussed how to use energy optimization 1331 00:54:55,920 --> 00:54:59,610 functions to try to decide which one's correct, 1332 00:54:59,610 --> 00:55:03,107 and that even if the structure is the correct one, 1333 00:55:03,107 --> 00:55:04,940 we have the side chain optimization problem. 1334 00:55:04,940 --> 00:55:06,940 If the structure's the incorrect one, we've got two problems. 1335 00:55:06,940 --> 00:55:08,570 We've got the backbone confirmation and the side 1336 00:55:08,570 --> 00:55:09,270 chain. 1337 00:55:09,270 --> 00:55:12,470 This is frequently called fold recognition or threading. 1338 00:55:12,470 --> 00:55:14,460 This choice of, you've got a protein structure. 1339 00:55:14,460 --> 00:55:16,780 You want to decide if your sequence matches 1340 00:55:16,780 --> 00:55:18,120 this one or that one. 1341 00:55:18,120 --> 00:55:19,620 There are a couple of other problems 1342 00:55:19,620 --> 00:55:21,150 that we're going to look at. 1343 00:55:21,150 --> 00:55:26,260 So this was already raised by one of the students, the idea 1344 00:55:26,260 --> 00:55:28,292 that we try to predict the secondary structure 1345 00:55:28,292 --> 00:55:30,500 of this protein, so we'll look at secondary structure 1346 00:55:30,500 --> 00:55:31,860 prediction algorithms. 1347 00:55:31,860 --> 00:55:36,110 This was a very early area of computational effort 1348 00:55:36,110 --> 00:55:38,220 in structural biology, and we'll see 1349 00:55:38,220 --> 00:55:41,574 that the early methods are remarkably good. 1350 00:55:41,574 --> 00:55:42,990 We can look for domain structures, 1351 00:55:42,990 --> 00:55:44,540 and this is really a sequence problem. 1352 00:55:44,540 --> 00:55:46,081 So we can look through our sequences, 1353 00:55:46,081 --> 00:55:48,680 and rather than looking for sequence identity or similarity 1354 00:55:48,680 --> 00:55:50,080 with known structures, we can see 1355 00:55:50,080 --> 00:55:51,690 whether there are certain patterns, 1356 00:55:51,690 --> 00:55:53,273 like the hidden Markov models that you 1357 00:55:53,273 --> 00:55:54,980 looked at in a previous lecture, that 1358 00:55:54,980 --> 00:55:58,660 can allow us to recognize the domain structure of a protein 1359 00:55:58,660 --> 00:56:02,120 even without an identical sequence in the database. 1360 00:56:02,120 --> 00:56:04,874 So we won't go over that kind of analysis anymore, 1361 00:56:04,874 --> 00:56:06,290 and then we'll spend a good amount 1362 00:56:06,290 --> 00:56:08,510 of time looking at ways of solving novel structures. 1363 00:56:08,510 --> 00:56:10,810 So if you don't have a fiendish friend who solved 1364 00:56:10,810 --> 00:56:13,110 your structure for you, and there is no homologue 1365 00:56:13,110 --> 00:56:14,930 in the database, all is not lost. 1366 00:56:14,930 --> 00:56:18,424 You actually can now predict novel structures of proteins 1367 00:56:18,424 --> 00:56:19,465 simply from the sequence. 1368 00:56:22,085 --> 00:56:25,260 All right, so a little history as to the prediction 1369 00:56:25,260 --> 00:56:26,320 of protein structure. 1370 00:56:26,320 --> 00:56:28,995 It really starts with Linus Pauling, 1371 00:56:28,995 --> 00:56:32,010 who went on to win the Nobel Prize for this work. 1372 00:56:32,010 --> 00:56:35,940 And this is in the era-- this paper was published in 1951. 1373 00:56:35,940 --> 00:56:40,960 This was what computers looked like in 1951, 1374 00:56:40,960 --> 00:56:42,960 and that thing probably has a lot less computing 1375 00:56:42,960 --> 00:56:45,640 power than your iPhone or your Android. 1376 00:56:45,640 --> 00:56:48,140 So Linus Pauling did not solve the structure 1377 00:56:48,140 --> 00:56:51,040 of the alpha helix, predict that alpha helices existed, 1378 00:56:51,040 --> 00:56:51,785 using computers. 1379 00:56:51,785 --> 00:56:54,680 He actually did it entirely with paper models. 1380 00:56:54,680 --> 00:56:58,220 And in fact, he solved this-- he got the key insights 1381 00:56:58,220 --> 00:57:01,780 for the alpha helix when he was lying sick in bed. 1382 00:57:01,780 --> 00:57:05,600 That's a very productive sick leave, you might imagine. 1383 00:57:05,600 --> 00:57:08,840 He was using paper models, but it 1384 00:57:08,840 --> 00:57:10,330 wasn't all done while lying in bed. 1385 00:57:10,330 --> 00:57:12,460 So he and others, the field as a whole, 1386 00:57:12,460 --> 00:57:15,520 have spend a lot of time observing small molecule 1387 00:57:15,520 --> 00:57:17,490 distances, so they have some idea what 1388 00:57:17,490 --> 00:57:18,929 to expect in protein structures. 1389 00:57:18,929 --> 00:57:20,970 They didn't know the three-dimensional structure, 1390 00:57:20,970 --> 00:57:22,511 but they knew a lot of the parameters 1391 00:57:22,511 --> 00:57:23,940 about how far apart things were. 1392 00:57:23,940 --> 00:57:25,542 And they also knew that hydrogen bonds 1393 00:57:25,542 --> 00:57:27,500 were going to be extremely favorable in protein 1394 00:57:27,500 --> 00:57:28,360 structures. 1395 00:57:28,360 --> 00:57:31,320 And so he looked for a repeating structure 1396 00:57:31,320 --> 00:57:33,940 that would maximize the number of hydrogen bonds 1397 00:57:33,940 --> 00:57:35,930 that occur within the protein backbone chain. 1398 00:57:39,710 --> 00:57:43,610 And he knew, also, the backbone-- that the amide bonds 1399 00:57:43,610 --> 00:57:44,860 would be planar and so on. 1400 00:57:44,860 --> 00:57:47,810 So there were a lot of principle that underlay this, 1401 00:57:47,810 --> 00:57:50,290 but it was really a tour de force 1402 00:57:50,290 --> 00:57:53,010 of just thinking rather than computing. 1403 00:57:53,010 --> 00:57:54,990 Another really important contribution early on 1404 00:57:54,990 --> 00:58:00,150 was made by Ramachandran, was at Madras University, 1405 00:58:00,150 --> 00:58:02,180 and his insight had to do with the fact 1406 00:58:02,180 --> 00:58:03,940 that not all backbone confirmations were 1407 00:58:03,940 --> 00:58:04,690 equally favorable. 1408 00:58:04,690 --> 00:58:06,606 So remember, we have these two rotatable bonds 1409 00:58:06,606 --> 00:58:07,510 in the backbone. 1410 00:58:07,510 --> 00:58:10,200 We have the phi angle and the psi angle. 1411 00:58:10,200 --> 00:58:14,150 And this plot shows that there'll 1412 00:58:14,150 --> 00:58:16,250 be certain confirmations of phi and psi angles 1413 00:58:16,250 --> 00:58:19,430 that are observed within these dashed lines, 1414 00:58:19,430 --> 00:58:22,320 and then the other confirmations, which are almost 1415 00:58:22,320 --> 00:58:23,947 never observed. 1416 00:58:23,947 --> 00:58:25,280 Now, how did he figure that out? 1417 00:58:25,280 --> 00:58:27,240 Once again, it wasn't with computation. 1418 00:58:27,240 --> 00:58:29,350 It was simply with paper models and figuring out 1419 00:58:29,350 --> 00:58:32,677 what the distances would be, and then carefully reasoning over 1420 00:58:32,677 --> 00:58:33,760 those possible structures. 1421 00:58:33,760 --> 00:58:36,620 So you can get very far in this field, initially, back 1422 00:58:36,620 --> 00:58:39,450 then, by simple hard thought. 1423 00:58:39,450 --> 00:58:40,909 OK, so with these two observations, 1424 00:58:40,909 --> 00:58:42,366 we knew that there were going to be 1425 00:58:42,366 --> 00:58:44,260 certain kinds of regular secondary structure 1426 00:58:44,260 --> 00:58:46,520 and that not all backbone confirmations 1427 00:58:46,520 --> 00:58:49,040 were equally favorable. 1428 00:58:49,040 --> 00:58:51,740 OK, but now, we want to advance actually predicting structures 1429 00:58:51,740 --> 00:58:54,670 of particular proteins, not just saying that proteins in general 1430 00:58:54,670 --> 00:58:55,810 will contain alpha helices. 1431 00:58:55,810 --> 00:58:57,480 So how do we go about doing that? 1432 00:58:57,480 --> 00:58:59,090 So the first advances here, we're 1433 00:58:59,090 --> 00:59:01,990 trying to predict the structure of alpha helices, 1434 00:59:01,990 --> 00:59:04,910 and this paper in the 1960s introduced 1435 00:59:04,910 --> 00:59:07,770 the concept of a helical wheel. 1436 00:59:07,770 --> 00:59:09,870 Now, the idea here, if you'll imagine 1437 00:59:09,870 --> 00:59:13,430 that this eraser is an alpha helix, 1438 00:59:13,430 --> 00:59:16,820 I'm going to look down the backbone of the alpha helix. 1439 00:59:16,820 --> 00:59:18,440 And I'll see that the side chains 1440 00:59:18,440 --> 00:59:20,150 emerge at regular positions. 1441 00:59:20,150 --> 00:59:21,990 There's going to be 100 degree rotation 1442 00:59:21,990 --> 00:59:25,680 between each sequential residue in the backbone 1443 00:59:25,680 --> 00:59:27,200 as it goes around helix. 1444 00:59:27,200 --> 00:59:30,580 It's going to be displaced and rotated by 100 degrees, 1445 00:59:30,580 --> 00:59:32,510 and I could plot, on a piece of paper, 1446 00:59:32,510 --> 00:59:36,749 the helical projection, which is shown here. 1447 00:59:36,749 --> 00:59:38,040 So here's the first amino acid. 1448 00:59:38,040 --> 00:59:39,770 100 degrees later, the second. 1449 00:59:39,770 --> 00:59:41,440 100 degrees later, the third. 1450 00:59:41,440 --> 00:59:45,720 And I can ask whether the residues on that backbone 1451 00:59:45,720 --> 00:59:48,250 have a sequence that puts all the hydrophobics 1452 00:59:48,250 --> 00:59:52,410 and hydrophilics on the same side, as in this case, 1453 00:59:52,410 --> 00:59:53,784 or on different sides. 1454 00:59:53,784 --> 00:59:55,200 Now, what difference does it make? 1455 00:59:55,200 --> 00:59:56,741 Well, if I have an alpha helix that's 1456 00:59:56,741 --> 01:00:00,430 lying on the surface of a protein, 1457 01:00:00,430 --> 01:00:03,540 this could have one side that's solvent exposed and one 1458 01:00:03,540 --> 01:00:05,350 side that's protected. 1459 01:00:05,350 --> 01:00:08,040 So we would expect that some of these alpha helices lying 1460 01:00:08,040 --> 01:00:10,100 on the service would be amphipathic. 1461 01:00:10,100 --> 01:00:13,440 Half of them would be hydrophobic, hydrophobic, 1462 01:00:13,440 --> 01:00:15,220 and half of them would be hydrophilic. 1463 01:00:15,220 --> 01:00:17,340 And purely, as someone suggested from the pattern 1464 01:00:17,340 --> 01:00:20,220 of the amino acids, and here the hydrophobicity of the pattern 1465 01:00:20,220 --> 01:00:23,330 of the amino acids, we could make reasonable predictions 1466 01:00:23,330 --> 01:00:27,000 of whether this protein forms a particular kind of alpha helix, 1467 01:00:27,000 --> 01:00:29,080 an amphipathic alpha helix. 1468 01:00:29,080 --> 01:00:32,334 Now, is that going to help us for all alpha helices? 1469 01:00:32,334 --> 01:00:34,500 Obviously not, because I can have alpha helices that 1470 01:00:34,500 --> 01:00:36,440 are totally solvent exposed, and I 1471 01:00:36,440 --> 01:00:39,575 can have alpha helices that are totally protected. 1472 01:00:39,575 --> 01:00:41,950 So this pattern will occur in some alpha helices, but not 1473 01:00:41,950 --> 01:00:42,720 all. 1474 01:00:42,720 --> 01:00:44,520 So another idea that was raised here 1475 01:00:44,520 --> 01:00:47,151 and was used early on with great success 1476 01:00:47,151 --> 01:00:49,400 was to actually figure out whether certain amino acids 1477 01:00:49,400 --> 01:00:51,860 have a particular alpha helical propensity. 1478 01:00:51,860 --> 01:00:54,270 Do they occur more frequently in alpha helices? 1479 01:00:54,270 --> 01:00:55,340 At the time, it was also thought maybe 1480 01:00:55,340 --> 01:00:57,131 you could find propensities for beta sheets 1481 01:00:57,131 --> 01:00:58,260 and other structures. 1482 01:00:58,260 --> 01:01:02,600 So compute the statistics over for every amino acid, 1483 01:01:02,600 --> 01:01:04,450 shown as a row here. 1484 01:01:04,450 --> 01:01:06,160 How often is it observed in the database? 1485 01:01:06,160 --> 01:01:09,150 How often does it occur in alpha helix? 1486 01:01:09,150 --> 01:01:11,450 And how often does it occur in beta sheet or in a coil? 1487 01:01:11,450 --> 01:01:14,110 And from these, then, we would compute probabilities 1488 01:01:14,110 --> 01:01:17,510 and compute using, perhaps, Bayesian statistics to compute 1489 01:01:17,510 --> 01:01:20,524 the poster expectation for having 1490 01:01:20,524 --> 01:01:21,940 a certain sequence in alpha helix. 1491 01:01:21,940 --> 01:01:25,130 They didn't quite use Bayesian statistics here. 1492 01:01:25,130 --> 01:01:28,254 They came up with a rather ad hoc approach, 1493 01:01:28,254 --> 01:01:29,670 and when you read it in hindsight, 1494 01:01:29,670 --> 01:01:30,930 it seems kind of crazy. 1495 01:01:30,930 --> 01:01:31,850 But actually, you have to remember 1496 01:01:31,850 --> 01:01:32,940 when this was being done. 1497 01:01:32,940 --> 01:01:36,330 This is being done before a big influence of mathematicians 1498 01:01:36,330 --> 01:01:38,120 into structural biology. 1499 01:01:38,120 --> 01:01:41,849 This is 1974, and they used more physical reasoning. 1500 01:01:41,849 --> 01:01:43,640 They knew something about how alpha helices 1501 01:01:43,640 --> 01:01:44,960 formed from chemistry. 1502 01:01:44,960 --> 01:01:46,390 They knew that, typically, there's 1503 01:01:46,390 --> 01:01:49,610 nucleation event, where a small piece of helix forms initially, 1504 01:01:49,610 --> 01:01:51,257 and then that extends. 1505 01:01:51,257 --> 01:01:53,090 They knew that there were these propensities 1506 01:01:53,090 --> 01:01:55,280 for certain amino acids to form alpha helices, 1507 01:01:55,280 --> 01:01:58,820 and other amino acids, which tended to break the helix. 1508 01:01:58,820 --> 01:02:01,080 And they came up with an ad hoc algorithm 1509 01:02:01,080 --> 01:02:02,980 that counted how often you had strong helix 1510 01:02:02,980 --> 01:02:05,300 formers, how often you breakers. 1511 01:02:05,300 --> 01:02:08,180 You can see all the details in the references. 1512 01:02:08,180 --> 01:02:11,700 The amazing thing is, with this very ad hoc thing and a very, 1513 01:02:11,700 --> 01:02:13,464 very small database of protein structures, 1514 01:02:13,464 --> 01:02:15,380 you could look at the total number of residues 1515 01:02:15,380 --> 01:02:17,380 that they're looking at over all the structures, 1516 01:02:17,380 --> 01:02:20,320 there's 2,473 and residues, not structures. 1517 01:02:20,320 --> 01:02:22,680 And now, we have many, many more times 1518 01:02:22,680 --> 01:02:25,160 than that of structures of proteins. 1519 01:02:25,160 --> 01:02:27,820 Even with that, in 1974, they were 1520 01:02:27,820 --> 01:02:32,420 able to achieve 60% accuracy in predicting 1521 01:02:32,420 --> 01:02:34,080 the secondary structure of proteins, 1522 01:02:34,080 --> 01:02:36,320 so it's really an astounding accomplishment. 1523 01:02:36,320 --> 01:02:38,570 And to put that in perspective, there 1524 01:02:38,570 --> 01:02:41,750 was an evaluation of a whole bunch of secondary structure 1525 01:02:41,750 --> 01:02:44,220 prediction algorithms done about a decade ago, 1526 01:02:44,220 --> 01:02:46,740 and things haven't changed that much since then, where 1527 01:02:46,740 --> 01:02:50,820 between 1974 and 2003, almost 30 years, 1528 01:02:50,820 --> 01:02:55,200 they went from 60% accuracy to 76% accuracy. 1529 01:02:55,200 --> 01:02:57,905 OK, well, that's not bad, but it's not a lot 1530 01:02:57,905 --> 01:02:59,530 for-- you'd expect maybe over 30 years, 1531 01:02:59,530 --> 01:03:00,200 you could do a lot better. 1532 01:03:00,200 --> 01:03:02,110 So the simple approach really captured 1533 01:03:02,110 --> 01:03:06,696 the fundamentals of predicting secondary structure. 1534 01:03:06,696 --> 01:03:08,570 There's a lot of work that's been done since, 1535 01:03:08,570 --> 01:03:10,361 and I encourage you to look in the textbook 1536 01:03:10,361 --> 01:03:13,020 if you're interested, to look at all the newer algorithms that 1537 01:03:13,020 --> 01:03:15,270 have tried to solve the secondary structure prediction 1538 01:03:15,270 --> 01:03:17,310 problem. 1539 01:03:17,310 --> 01:03:19,204 OK. 1540 01:03:19,204 --> 01:03:21,370 All right, so secondary structure prediction, then-- 1541 01:03:21,370 --> 01:03:23,536 you can look in the textbook for the modern methods, 1542 01:03:23,536 --> 01:03:26,050 but the fundamental ideas were laid down by Chou and Fasman 1543 01:03:26,050 --> 01:03:28,500 in the 1974 paper. 1544 01:03:28,500 --> 01:03:31,420 We're already said that looking at the kinds of approaches 1545 01:03:31,420 --> 01:03:33,240 that we discussed earlier in the course 1546 01:03:33,240 --> 01:03:34,984 can help you solve domain structures. 1547 01:03:34,984 --> 01:03:37,150 I would like to focus on, at the end of this lecture 1548 01:03:37,150 --> 01:03:39,733 and the beginning-- and the next lecture about how to actually 1549 01:03:39,733 --> 01:03:43,457 solve novel structures from purely amino acid sequence, 1550 01:03:43,457 --> 01:03:45,040 and we're going to go back to the idea 1551 01:03:45,040 --> 01:03:46,790 that there is a potential energy function. 1552 01:03:46,790 --> 01:03:50,390 We now have both the CHARMM approach and the Rosetta 1553 01:03:50,390 --> 01:03:53,930 approach to protein structure, and so there 1554 01:03:53,930 --> 01:03:55,450 is some protein folding landscape. 1555 01:03:55,450 --> 01:03:56,701 There's an energy function. 1556 01:03:56,701 --> 01:03:58,200 If you have different conformations, 1557 01:03:58,200 --> 01:04:00,580 you'll be at different positions in landscape, 1558 01:04:00,580 --> 01:04:02,710 and we'd like to figure out how to go 1559 01:04:02,710 --> 01:04:06,310 from some starting confirmation that may be arbitrary and find 1560 01:04:06,310 --> 01:04:08,080 our way to the minimum energy structure. 1561 01:04:14,489 --> 01:04:18,620 All right, so there are going to be three fundamental things 1562 01:04:18,620 --> 01:04:21,210 that we'll talk about in the next lecture. 1563 01:04:21,210 --> 01:04:23,840 We're going to talk about energy minimization, 1564 01:04:23,840 --> 01:04:26,300 how to use these potential energy functions that we 1565 01:04:26,300 --> 01:04:29,250 started off with to go from approximate structures 1566 01:04:29,250 --> 01:04:30,640 to the refined structure. 1567 01:04:30,640 --> 01:04:32,380 That's the thought problem I gave you. 1568 01:04:32,380 --> 01:04:35,120 You have the structure, but you have the wrong side chains. 1569 01:04:35,120 --> 01:04:36,470 Could you minimize them? 1570 01:04:36,470 --> 01:04:38,280 And so that's making small changes. 1571 01:04:38,280 --> 01:04:39,960 We'll discuss molecular dynamics, 1572 01:04:39,960 --> 01:04:44,552 which actually tries to simulate all the forces on a protein 1573 01:04:44,552 --> 01:04:46,510 and to actually carry out a physical simulation 1574 01:04:46,510 --> 01:04:47,279 of the process. 1575 01:04:47,279 --> 01:04:48,820 That's the CHARMM approach, and we'll 1576 01:04:48,820 --> 01:04:51,589 see some interesting variants on that. 1577 01:04:51,589 --> 01:04:53,630 And then we'll look at simulated annealing, which 1578 01:04:53,630 --> 01:04:56,005 is an optimization technique that's actually quite broad, 1579 01:04:56,005 --> 01:04:57,490 but can be applied here, to search 1580 01:04:57,490 --> 01:04:59,770 over large, large conformational spaces, 1581 01:04:59,770 --> 01:05:01,860 much further than a protein would actually evolve 1582 01:05:01,860 --> 01:05:03,560 in a molecular dynamic simulation that's 1583 01:05:03,560 --> 01:05:04,855 simulating protein function. 1584 01:05:04,855 --> 01:05:07,230 You allow the protein, now, to jump between confirmations 1585 01:05:07,230 --> 01:05:09,790 that have no real potential to transfer 1586 01:05:09,790 --> 01:05:14,220 between in a normal room temperature in water, 1587 01:05:14,220 --> 01:05:16,580 but can be done, obviously, easily in the computer. 1588 01:05:16,580 --> 01:05:17,480 So I'll stop here. 1589 01:05:17,480 --> 01:05:20,010 Any questions before we close?