1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:26,460 --> 00:00:29,190 PROFESSOR: Welcome back, everyone. 9 00:00:29,190 --> 00:00:30,934 I hope you had a good break. 10 00:00:30,934 --> 00:00:32,600 Hopefully you also remember a little bit 11 00:00:32,600 --> 00:00:34,050 about what we did last time. 12 00:00:34,050 --> 00:00:35,680 So if you'll recall, last time we 13 00:00:35,680 --> 00:00:37,130 did an introduction to protein structure. 14 00:00:37,130 --> 00:00:38,790 We talked a little bit about some of the issues 15 00:00:38,790 --> 00:00:40,220 in predicting protein structure. 16 00:00:40,220 --> 00:00:42,280 Now we're going to go into that in more detail. 17 00:00:42,280 --> 00:00:45,050 And last time, we'd broken down the structure prediction 18 00:00:45,050 --> 00:00:47,200 problem into a couple of sub-problems. 19 00:00:47,200 --> 00:00:49,930 So there was a problem of secondary structure prediction, 20 00:00:49,930 --> 00:00:51,680 which we discussed a little bit last time. 21 00:00:51,680 --> 00:00:53,910 And remember that the early algorithms developed 22 00:00:53,910 --> 00:00:57,480 in the '70s get about 60% accuracy, and decades 23 00:00:57,480 --> 00:00:59,507 of research has only marginally improved that. 24 00:00:59,507 --> 00:01:01,340 But we're going to see that some of the work 25 00:01:01,340 --> 00:01:03,750 on the main structure recognition and predicting 26 00:01:03,750 --> 00:01:05,489 novel three-dimensional structures 27 00:01:05,489 --> 00:01:07,330 has really advanced very dramatically 28 00:01:07,330 --> 00:01:09,072 in the last few years. 29 00:01:09,072 --> 00:01:10,780 Now, the other thing I hope you'll recall 30 00:01:10,780 --> 00:01:13,790 is that we had this dichotomy between two approaches 31 00:01:13,790 --> 00:01:16,330 to the energetics of protein structure. 32 00:01:16,330 --> 00:01:19,250 We had the physicist's approach and we 33 00:01:19,250 --> 00:01:20,901 the statistician's approach, right? 34 00:01:20,901 --> 00:01:23,400 Now, what were some of the key differences between these two 35 00:01:23,400 --> 00:01:24,810 approaches? 36 00:01:24,810 --> 00:01:26,640 Anyone want to volunteer a difference 37 00:01:26,640 --> 00:01:29,047 between the statistical approach to parametrizing 38 00:01:29,047 --> 00:01:30,130 the energy of a structure? 39 00:01:30,130 --> 00:01:32,130 So we're trying to come up with an equation that 40 00:01:32,130 --> 00:01:34,380 will convert coordinates into energy, right? 41 00:01:34,380 --> 00:01:36,810 And what were some of the differences between the physics 42 00:01:36,810 --> 00:01:38,393 approach and the statistical approach? 43 00:01:41,266 --> 00:01:41,890 Any volunteers? 44 00:01:41,890 --> 00:01:42,230 Yes. 45 00:01:42,230 --> 00:01:43,518 AUDIENCE: I think the statistical approach didn't 46 00:01:43,518 --> 00:01:45,611 change the phi and psi angles, right? 47 00:01:45,611 --> 00:01:49,000 It just changed other variables. 48 00:01:49,000 --> 00:01:50,290 PROFESSOR: So you're close. 49 00:01:50,290 --> 00:01:50,480 Right. 50 00:01:50,480 --> 00:01:52,865 So the statistical-- or maybe you said the right thing, 51 00:01:52,865 --> 00:01:53,000 actually. 52 00:01:53,000 --> 00:01:54,870 So the statistical approach keeps a lot 53 00:01:54,870 --> 00:01:57,930 of the pieces the protein rigid, whereas the physics approach 54 00:01:57,930 --> 00:02:00,000 allows all the atoms to move independently. 55 00:02:00,000 --> 00:02:01,500 So one of the key differences, then, 56 00:02:01,500 --> 00:02:04,390 is that in the physics approach, two atoms that 57 00:02:04,390 --> 00:02:07,210 are bonded to each other still move apart based 58 00:02:07,210 --> 00:02:09,479 on a spring function. 59 00:02:09,479 --> 00:02:12,820 It's a very stiff spring, but the atoms move independently. 60 00:02:12,820 --> 00:02:14,330 In the statistical approach, we just 61 00:02:14,330 --> 00:02:15,700 fix the distance between them. 62 00:02:15,700 --> 00:02:18,580 Similarly for a tetrahedrally coordinated atom, 63 00:02:18,580 --> 00:02:22,529 in the physics approach those angles can deform. 64 00:02:22,529 --> 00:02:24,320 In the statistical approach, they're fixed. 65 00:02:24,320 --> 00:02:24,830 Right? 66 00:02:24,830 --> 00:02:26,246 So in the statistical approach, we 67 00:02:26,246 --> 00:02:29,590 have more or less fixed geometry. 68 00:02:29,590 --> 00:02:32,550 In the physics approach, every atom moves independently. 69 00:02:32,550 --> 00:02:34,750 Anyone else remember another key difference? 70 00:02:34,750 --> 00:02:37,220 Where do the energy functions come from? 71 00:02:46,420 --> 00:02:46,920 Volunteers? 72 00:02:46,920 --> 00:02:47,310 All right. 73 00:02:47,310 --> 00:02:48,940 So in the physics approach, they're all 74 00:02:48,940 --> 00:02:52,170 derived as much as possible from physical principles, 75 00:02:52,170 --> 00:02:52,970 you might imagine. 76 00:02:52,970 --> 00:02:54,594 Whereas in the statistical approach, 77 00:02:54,594 --> 00:02:57,260 we're trying to recreate what we see in nature, even if we don't 78 00:02:57,260 --> 00:02:59,560 have a good physical grounding for it. 79 00:02:59,560 --> 00:03:01,080 So this is most dramatic in trying 80 00:03:01,080 --> 00:03:02,950 to predict the solvation free energies. 81 00:03:02,950 --> 00:03:03,450 Right? 82 00:03:03,450 --> 00:03:07,050 How much does it cost you if you put a hydrophobic atom 83 00:03:07,050 --> 00:03:08,950 into a polar environment? 84 00:03:08,950 --> 00:03:09,450 Right? 85 00:03:09,450 --> 00:03:11,070 So in the physics approach, you actually 86 00:03:11,070 --> 00:03:11,925 have to have water molecules. 87 00:03:11,925 --> 00:03:13,341 They have to interact with matter. 88 00:03:13,341 --> 00:03:15,310 That turns out to be really, really hard to do. 89 00:03:15,310 --> 00:03:18,000 In the statistical approach, we come up with an approximation. 90 00:03:18,000 --> 00:03:20,275 How much solvent accessible surface area 91 00:03:20,275 --> 00:03:23,590 is there on the polar atom when it's free? 92 00:03:23,590 --> 00:03:25,630 When it's in the protein structure? 93 00:03:25,630 --> 00:03:30,110 And then we scale the transfer energies by that amount. 94 00:03:30,110 --> 00:03:32,730 OK, so these are then the main differences. 95 00:03:35,284 --> 00:03:36,480 Gotta be careful here. 96 00:03:39,560 --> 00:03:42,100 So we've got fixed geometry this the statistical approach. 97 00:03:42,100 --> 00:03:43,900 We often use discrete rotamers. 98 00:03:43,900 --> 00:03:44,400 Remember? 99 00:03:44,400 --> 00:03:48,160 The side-chain angles, in principle, can rotate freely. 100 00:03:48,160 --> 00:03:49,860 But there were only a few confirmations 101 00:03:49,860 --> 00:03:53,140 are typically observed, so we often restrict ourselves 102 00:03:53,140 --> 00:03:56,080 to the most commonly observed combinations of the psi angles. 103 00:03:56,080 --> 00:03:57,900 And then we have the statistical potential 104 00:03:57,900 --> 00:03:59,690 that depends on the frequency at which we 105 00:03:59,690 --> 00:04:01,005 observe things in the database. 106 00:04:01,005 --> 00:04:03,130 And that could be the frequency at which we observe 107 00:04:03,130 --> 00:04:05,569 particular atoms at precise distances. 108 00:04:05,569 --> 00:04:07,610 It could be the fraction of time that something's 109 00:04:07,610 --> 00:04:11,509 solvent accessible versus not. 110 00:04:11,509 --> 00:04:13,550 And the other thing that we talked about a little 111 00:04:13,550 --> 00:04:15,260 bit last time was this thought problem. 112 00:04:15,260 --> 00:04:16,682 If I have a protein sequence and I 113 00:04:16,682 --> 00:04:18,640 have two potential structures, how 114 00:04:18,640 --> 00:04:20,250 could I use these potential energies-- 115 00:04:20,250 --> 00:04:22,630 whether they're derived from the physics approach 116 00:04:22,630 --> 00:04:24,090 or from the statistical approach-- 117 00:04:24,090 --> 00:04:27,110 how could I use these potential energies to decide which 118 00:04:27,110 --> 00:04:29,940 of the two structures is correct? 119 00:04:29,940 --> 00:04:32,737 So one possibility is that I have two structures. 120 00:04:32,737 --> 00:04:35,070 One of them is truly the structure and the other is not. 121 00:04:35,070 --> 00:04:35,570 Right? 122 00:04:35,570 --> 00:04:37,470 Your fiendish lab mate knows the structure 123 00:04:37,470 --> 00:04:39,220 but refuses to tell you. 124 00:04:39,220 --> 00:04:41,632 So in that case, what would I do? 125 00:04:41,632 --> 00:04:43,590 I know that one of these structures is correct. 126 00:04:43,590 --> 00:04:44,400 I don't know which one. 127 00:04:44,400 --> 00:04:46,180 How could I use the potential energy function 128 00:04:46,180 --> 00:04:47,490 to decide which one's correct? 129 00:04:54,240 --> 00:04:56,485 What's going to be true of the correct structure? 130 00:04:56,485 --> 00:04:57,730 AUDIENCE: Minimal energy. 131 00:04:57,730 --> 00:04:58,810 PROFESSOR: It's going to have lower energy. 132 00:04:58,810 --> 00:04:59,830 So is that sufficient? 133 00:04:59,830 --> 00:05:00,200 No. 134 00:05:00,200 --> 00:05:00,400 Right? 135 00:05:00,400 --> 00:05:02,280 There's a subtlety we have to face here. 136 00:05:02,280 --> 00:05:06,900 So if I just plug my protein sequence onto one of these two 137 00:05:06,900 --> 00:05:09,932 structures and compute the free energy, 138 00:05:09,932 --> 00:05:11,640 there's no guarantee that the correct one 139 00:05:11,640 --> 00:05:12,806 will have lower free energy. 140 00:05:12,806 --> 00:05:15,010 Why? 141 00:05:15,010 --> 00:05:19,180 What decision do I have to make when I put a protein 142 00:05:19,180 --> 00:05:20,990 sequence onto a backbone structure? 143 00:05:24,750 --> 00:05:25,265 Yes. 144 00:05:25,265 --> 00:05:26,890 AUDIENCE: How to orient the side chain. 145 00:05:26,890 --> 00:05:27,215 PROFESSOR: Exactly. 146 00:05:27,215 --> 00:05:29,254 I need to decide how to orient the side chains. 147 00:05:29,254 --> 00:05:30,670 If I orient the side chains wrong, 148 00:05:30,670 --> 00:05:32,674 then I'll have side chains literally overlapping 149 00:05:32,674 --> 00:05:33,340 with each other. 150 00:05:33,340 --> 00:05:35,532 That'll have incredibly high energy, right? 151 00:05:35,532 --> 00:05:36,990 So there's no guarantee that simply 152 00:05:36,990 --> 00:05:39,360 having the right structure will give you 153 00:05:39,360 --> 00:05:41,640 the minimal free energy until you correctly 154 00:05:41,640 --> 00:05:43,299 place all the side chains. 155 00:05:43,299 --> 00:05:44,590 OK, but that's the simple case. 156 00:05:44,590 --> 00:05:46,090 Now, that's in the case where you've 157 00:05:46,090 --> 00:05:49,280 got this fiendish friend who knows the correct structure. 158 00:05:49,280 --> 00:05:51,680 But of course, in the general domain recognition problem, 159 00:05:51,680 --> 00:05:53,180 we don't know the correct structure. 160 00:05:53,180 --> 00:05:54,260 We have homologues. 161 00:05:54,260 --> 00:05:56,600 So we have some sequence, and we believe 162 00:05:56,600 --> 00:05:59,535 that it's either homologous to Protein A or to Protein B, 163 00:05:59,535 --> 00:06:01,380 and I want to decide which one's correct. 164 00:06:01,380 --> 00:06:03,500 So in both cases, the structure's wrong. 165 00:06:03,500 --> 00:06:05,410 It's this question of how wrong it is, right? 166 00:06:05,410 --> 00:06:06,530 So now the problem actually becomes 167 00:06:06,530 --> 00:06:08,960 harder, because not only do I need to get the right side 168 00:06:08,960 --> 00:06:11,020 chain confirmations, but I need to get the right backbone 169 00:06:11,020 --> 00:06:11,600 confirmation. 170 00:06:11,600 --> 00:06:14,130 It's going to close to one of these structures, perhaps, 171 00:06:14,130 --> 00:06:17,320 but it's never going to be identical. 172 00:06:17,320 --> 00:06:19,440 So both of these situations are examples 173 00:06:19,440 --> 00:06:21,230 where have to do some kind of refinement 174 00:06:21,230 --> 00:06:22,606 of an initial starting structure. 175 00:06:22,606 --> 00:06:24,771 And what we're going to talk about for the next part 176 00:06:24,771 --> 00:06:26,680 of the lecture are alternative strategies 177 00:06:26,680 --> 00:06:28,950 for refining a partially correct structure. 178 00:06:28,950 --> 00:06:31,050 And we're going to look at three strategies. 179 00:06:31,050 --> 00:06:34,034 The simplest one is called energy minimization. 180 00:06:34,034 --> 00:06:35,950 Then we're going to look at molecular dynamics 181 00:06:35,950 --> 00:06:38,690 and simulated annealing. 182 00:06:38,690 --> 00:06:40,910 So energy minimization starts with this principle 183 00:06:40,910 --> 00:06:43,409 that we talked about last time I remember that came up here, 184 00:06:43,409 --> 00:06:46,300 that a stable structure has to be a minimum of free energy. 185 00:06:46,300 --> 00:06:46,800 Right? 186 00:06:46,800 --> 00:06:49,852 Because if it's not, then there are forces acting on the atoms 187 00:06:49,852 --> 00:06:51,310 and that are going to drive it away 188 00:06:51,310 --> 00:06:53,460 from that structure to some other structure. 189 00:06:53,460 --> 00:06:55,730 Now, the fact that it is a minimum of free energy 190 00:06:55,730 --> 00:06:58,690 does not guarantee that is the minimum of free energy. 191 00:06:58,690 --> 00:07:02,220 So it's possible that there are other energetic minima. 192 00:07:02,220 --> 00:07:02,760 Right? 193 00:07:02,760 --> 00:07:05,120 The protein structure, if it's stable, 194 00:07:05,120 --> 00:07:08,180 is at the very least a local energetic minimum. 195 00:07:08,180 --> 00:07:10,330 It may also be the global free energy minimum. 196 00:07:10,330 --> 00:07:12,410 We just don't know the answer to that. 197 00:07:12,410 --> 00:07:14,110 Now, this was a big area of debate 198 00:07:14,110 --> 00:07:16,810 in the early days of the protein structure field, 199 00:07:16,810 --> 00:07:19,350 whether proteins could fold spontaneously. 200 00:07:19,350 --> 00:07:22,230 If they did, then it meant that they were at least 201 00:07:22,230 --> 00:07:24,290 apparently global free energy minima. 202 00:07:24,290 --> 00:07:26,111 Chris Anfinsen actually won the Nobel Prize 203 00:07:26,111 --> 00:07:27,860 for demonstrating that some proteins could 204 00:07:27,860 --> 00:07:29,950 fold independently outside of the cell. 205 00:07:29,950 --> 00:07:32,930 So at least some proteins had all the structural information 206 00:07:32,930 --> 00:07:35,005 implicit in their sequence, right? 207 00:07:35,005 --> 00:07:37,380 And that seems to imply that there are global free energy 208 00:07:37,380 --> 00:07:38,140 minimum. 209 00:07:38,140 --> 00:07:40,280 But there are other proteins, we now know, 210 00:07:40,280 --> 00:07:42,030 where the most commonly observed structure 211 00:07:42,030 --> 00:07:44,770 has only a local free energy minimum. 212 00:07:44,770 --> 00:07:47,420 And it's got very high energetic barriers that prevent it 213 00:07:47,420 --> 00:07:50,640 from actually getting to the global free energy minimum. 214 00:07:50,640 --> 00:07:52,382 But regardless of the case, if we 215 00:07:52,382 --> 00:07:53,840 have an initial starting structure, 216 00:07:53,840 --> 00:07:56,580 we could try to find the nearest local free energy minimum, 217 00:07:56,580 --> 00:07:59,134 and perhaps that is the stable structure. 218 00:07:59,134 --> 00:08:00,550 So in our context, we were talking 219 00:08:00,550 --> 00:08:03,890 about packing the side chains on the surface of the protein 220 00:08:03,890 --> 00:08:06,640 that we believe might be the right structure. 221 00:08:06,640 --> 00:08:08,680 So imagine that this is the true structure 222 00:08:08,680 --> 00:08:10,280 and we've got the side chain, and it's 223 00:08:10,280 --> 00:08:13,320 making the dashed green lines represent hydrogen bonds. 224 00:08:13,320 --> 00:08:15,630 It's making a series of hydrogen bonds 225 00:08:15,630 --> 00:08:17,910 from this nitrogen and this oxygen 226 00:08:17,910 --> 00:08:20,100 to pieces of the rest of the protein. 227 00:08:20,100 --> 00:08:22,480 Now, we get the crude backbone structure. 228 00:08:22,480 --> 00:08:23,820 We pop in our side chains. 229 00:08:23,820 --> 00:08:26,240 We don't necessarily-- in fact, we almost never-- 230 00:08:26,240 --> 00:08:28,820 will choose randomly to have the right confirmation 231 00:08:28,820 --> 00:08:30,660 to pick up all these hydrogen bonds. 232 00:08:30,660 --> 00:08:32,610 So we'll start off with some structure that 233 00:08:32,610 --> 00:08:34,210 looks like this, where it's rotated, 234 00:08:34,210 --> 00:08:37,080 so that instead of seeing both the nitrogen and the oxygen, 235 00:08:37,080 --> 00:08:39,600 you can only see the profile. 236 00:08:39,600 --> 00:08:43,970 And so the question is whether we can get from one to 237 00:08:43,970 --> 00:08:47,232 by following the energetic minima. 238 00:08:47,232 --> 00:08:48,190 So that's the question. 239 00:08:48,190 --> 00:08:49,564 How would we go about doing this? 240 00:08:49,564 --> 00:08:51,700 Well, we have this function that tells us 241 00:08:51,700 --> 00:08:54,167 the potential energy for every XYZ coordinate of the atom. 242 00:08:54,167 --> 00:08:55,750 That's what we talked about last time, 243 00:08:55,750 --> 00:08:57,280 and you can go back and look at your notes 244 00:08:57,280 --> 00:08:58,400 for those two approaches. 245 00:08:58,400 --> 00:09:00,727 So how could we minimize this free energy minimum? 246 00:09:00,727 --> 00:09:02,560 Well, it's no different from other functions 247 00:09:02,560 --> 00:09:03,950 that we want to minimize, right? 248 00:09:03,950 --> 00:09:05,158 We take the first derivative. 249 00:09:05,158 --> 00:09:07,424 We look for places where the first derivative is zero. 250 00:09:07,424 --> 00:09:09,840 The one difference is that we can't write out analytically 251 00:09:09,840 --> 00:09:11,850 what this function looks like and choose 252 00:09:11,850 --> 00:09:16,130 directions and locations in space that are the minima. 253 00:09:16,130 --> 00:09:18,280 So we're going to have to take an approach that 254 00:09:18,280 --> 00:09:22,010 has a series of perturbations to a structure that try to improve 255 00:09:22,010 --> 00:09:25,209 the free energy systematically. 256 00:09:25,209 --> 00:09:27,750 The simplest understanding is this gradient descent approach, 257 00:09:27,750 --> 00:09:30,810 which says that I have some initial coordinates that I 258 00:09:30,810 --> 00:09:35,120 choose and I take a step in the direction 259 00:09:35,120 --> 00:09:39,100 of the first derivative of the function. 260 00:09:39,100 --> 00:09:40,267 So what does that look like? 261 00:09:40,267 --> 00:09:41,516 So here are two possibilities. 262 00:09:41,516 --> 00:09:42,670 I've got this function. 263 00:09:42,670 --> 00:09:47,647 If I start off at x equals 2, this minus some epsilon, 264 00:09:47,647 --> 00:09:49,480 some small value times the first derivative, 265 00:09:49,480 --> 00:09:51,195 is going to point me to the left. 266 00:09:51,195 --> 00:09:53,230 And I'm going to take steps to the left 267 00:09:53,230 --> 00:09:57,390 until this function, f prime, the first derivative, is zero. 268 00:09:57,390 --> 00:09:59,100 Then I'm going to stop moving. 269 00:09:59,100 --> 00:10:01,990 So I move from my initial coordinate a little bit 270 00:10:01,990 --> 00:10:04,360 each time to the left until I get to the minimum. 271 00:10:04,360 --> 00:10:06,320 And similarly, if I start off on the right, 272 00:10:06,320 --> 00:10:08,170 I'll move a little bit further to the right 273 00:10:08,170 --> 00:10:10,187 each time until the first derivative is zero. 274 00:10:10,187 --> 00:10:11,270 So that looks pretty good. 275 00:10:11,270 --> 00:10:13,210 It can take a lot of steps, though. 276 00:10:13,210 --> 00:10:16,100 And it's not actually guaranteed to have great convergence 277 00:10:16,100 --> 00:10:16,600 properties. 278 00:10:16,600 --> 00:10:18,849 Because of the number of steps you might have to take, 279 00:10:18,849 --> 00:10:20,770 it might take quite a long time. 280 00:10:20,770 --> 00:10:22,220 So that's the first derivative, in 281 00:10:22,220 --> 00:10:24,499 a simple one-dimensional case. 282 00:10:24,499 --> 00:10:26,415 We're dealing with a multi-dimensional vector, 283 00:10:26,415 --> 00:10:27,810 so instead of doing the first derivative 284 00:10:27,810 --> 00:10:29,268 we use the gradient, which is a set 285 00:10:29,268 --> 00:10:31,230 of partial first derivatives. 286 00:10:31,230 --> 00:10:34,060 And I think one thing that's useful to point out here 287 00:10:34,060 --> 00:10:37,350 is that, of course, the force is negative of the gradient 288 00:10:37,350 --> 00:10:38,762 of the potential energy. 289 00:10:38,762 --> 00:10:40,220 So when we do gradient descent, you 290 00:10:40,220 --> 00:10:42,247 can think of it from a physical perspective 291 00:10:42,247 --> 00:10:44,205 as always moving in the direction of the force. 292 00:10:46,770 --> 00:10:47,850 So I have some structure. 293 00:10:47,850 --> 00:10:50,100 It's not the true native structure, 294 00:10:50,100 --> 00:10:52,680 but I take incremental steps in the direction of the force 295 00:10:52,680 --> 00:10:54,690 and I move towards some local minima. 296 00:10:59,059 --> 00:11:01,350 And we've done this in the case of a continuous energy, 297 00:11:01,350 --> 00:11:03,516 but you can actually also do this for discrete ones. 298 00:11:03,516 --> 00:11:05,370 Now, the critical point was that you're not 299 00:11:05,370 --> 00:11:08,840 guaranteed to get to the correct energetic structure. 300 00:11:08,840 --> 00:11:12,820 So in the case that I showed you before where we had the side 301 00:11:12,820 --> 00:11:16,300 chain side-on, if you actually do the minimization there, 302 00:11:16,300 --> 00:11:19,280 you actually end up with the side chain rotated 180 degrees 303 00:11:19,280 --> 00:11:20,500 where it's supposed to be. 304 00:11:20,500 --> 00:11:22,390 So it eliminates all the steric clashes, 305 00:11:22,390 --> 00:11:25,260 but it doesn't actually pick up all the hydrogen bonds. 306 00:11:25,260 --> 00:11:28,700 So this is an example of a local energetic minima that's 307 00:11:28,700 --> 00:11:31,450 not the global energetic minima. 308 00:11:31,450 --> 00:11:32,610 Any questions on that? 309 00:11:35,570 --> 00:11:36,590 Yes. 310 00:11:36,590 --> 00:11:38,850 AUDIENCE: Where do all these n-dimensional equations 311 00:11:38,850 --> 00:11:39,522 come from? 312 00:11:39,522 --> 00:11:40,980 PROFESSOR: Where do what come from? 313 00:11:40,980 --> 00:11:43,014 AUDIENCE: The n-dimensional equations. 314 00:11:43,014 --> 00:11:45,180 PROFESSOR: So these are the equations for the energy 315 00:11:45,180 --> 00:11:48,280 in terms of every single atom in the protein 316 00:11:48,280 --> 00:11:50,590 if you're allowing the atoms to move, or in terms 317 00:11:50,590 --> 00:11:52,060 of every rotatable bond, if you're 318 00:11:52,060 --> 00:11:54,170 allowing only bonds to rotate. 319 00:11:54,170 --> 00:11:57,230 So the question was, where do the multi-dimensional equations 320 00:11:57,230 --> 00:11:58,804 come from. 321 00:11:58,804 --> 00:11:59,470 Other questions? 322 00:12:03,063 --> 00:12:04,470 OK. 323 00:12:04,470 --> 00:12:06,450 All right, so that's the simplest approach. 324 00:12:06,450 --> 00:12:07,970 Literally minimize the energy. 325 00:12:07,970 --> 00:12:10,303 But we said it has this problem that it's not guaranteed 326 00:12:10,303 --> 00:12:12,100 to find the global free energy minimum. 327 00:12:12,100 --> 00:12:14,550 Another approach is molecular dynamics. 328 00:12:14,550 --> 00:12:16,210 So this actually attempts to simulate 329 00:12:16,210 --> 00:12:19,320 what's going on in a protein structure in vitro, 330 00:12:19,320 --> 00:12:22,576 by simulating the force in every atom and the velocity. 331 00:12:22,576 --> 00:12:24,450 Previously, there was no measure of velocity. 332 00:12:24,450 --> 00:12:24,890 Right? 333 00:12:24,890 --> 00:12:25,973 All the atoms were static. 334 00:12:25,973 --> 00:12:27,840 We looked at what the gradient of the energy 335 00:12:27,840 --> 00:12:29,860 was and we move by some arbitrary step 336 00:12:29,860 --> 00:12:31,900 function in the direction of the force. 337 00:12:31,900 --> 00:12:33,020 Now we're actually going to have velocities 338 00:12:33,020 --> 00:12:34,269 associated with all the atoms. 339 00:12:34,269 --> 00:12:36,210 They're going to be moving around in space. 340 00:12:36,210 --> 00:12:39,157 And we'll have the coordinate at any time t 341 00:12:39,157 --> 00:12:40,990 is going to be determined by the coordinates 342 00:12:40,990 --> 00:12:44,160 of the previous time, t of i minus 1 343 00:12:44,160 --> 00:12:46,045 plus a velocity times the time step. 344 00:12:46,045 --> 00:12:47,920 And the velocities are going to be determined 345 00:12:47,920 --> 00:12:49,510 by the forces, which are determined 346 00:12:49,510 --> 00:12:51,700 by the gradient of the potential energy. 347 00:12:51,700 --> 00:12:52,200 Right? 348 00:12:52,200 --> 00:12:54,950 So we start off, always, with that potential energy function, 349 00:12:54,950 --> 00:12:58,054 which is either from the physics approach 350 00:12:58,054 --> 00:12:59,220 or the statistical approach. 351 00:12:59,220 --> 00:13:00,980 That gives us velocities, eventually 352 00:13:00,980 --> 00:13:02,245 giving us the coordinates. 353 00:13:02,245 --> 00:13:03,620 So we start off with the protein. 354 00:13:03,620 --> 00:13:05,245 There are some serious questions of how 355 00:13:05,245 --> 00:13:06,944 you equilibrate the atoms. 356 00:13:06,944 --> 00:13:09,110 So you start off with a completely static structure. 357 00:13:09,110 --> 00:13:10,542 You want to apply forces to it. 358 00:13:10,542 --> 00:13:12,000 There are some subtleties as to how 359 00:13:12,000 --> 00:13:14,208 you go about doing that, but then you actually end up 360 00:13:14,208 --> 00:13:16,650 simulating the motion of all the atoms. 361 00:13:16,650 --> 00:13:19,946 And just give you a sense of what that looks like, 362 00:13:19,946 --> 00:13:21,360 I'll show you a quick movie. 363 00:13:29,020 --> 00:13:33,916 So this is the simulation of the folding of a protein structure. 364 00:13:33,916 --> 00:13:35,540 And the backbone is mostly highlighted. 365 00:13:35,540 --> 00:13:37,740 Most of the side chains are not being shown. 366 00:13:37,740 --> 00:13:41,490 Actually, in bold, but you can see the stick figures. 367 00:13:41,490 --> 00:13:44,760 And slowly it's accumulating its three-dimensional structure. 368 00:13:44,760 --> 00:13:47,044 [VIDEO PLAYBACK] 369 00:14:30,196 --> 00:14:32,180 [LAUGHTER] 370 00:15:02,247 --> 00:15:03,080 [END VIDEO PLAYBACK] 371 00:15:03,080 --> 00:15:04,955 PROFESSOR: OK, I think you get the idea here. 372 00:15:09,080 --> 00:15:10,875 Oh, it won't let me give up. 373 00:15:10,875 --> 00:15:12,220 OK, here we go. 374 00:15:12,220 --> 00:15:14,530 OK, so these are the equations that 375 00:15:14,530 --> 00:15:17,710 are governing the motion in an example like that. 376 00:15:17,710 --> 00:15:24,540 Now, the advantage of this is we're actually 377 00:15:24,540 --> 00:15:26,480 simulating the protein folding. 378 00:15:26,480 --> 00:15:28,690 So if we do it correctly, we should always 379 00:15:28,690 --> 00:15:29,610 get the right answer. 380 00:15:29,610 --> 00:15:32,770 Of course, that's not what happens in reality. 381 00:15:32,770 --> 00:15:35,610 Probably the biggest problem is just computational speed. 382 00:15:35,610 --> 00:15:39,000 So these simulations-- even very, very 383 00:15:39,000 --> 00:15:40,890 short ones like the one I showed you-- 384 00:15:40,890 --> 00:15:43,790 so how long does it take a protein to fold in vitro? 385 00:15:43,790 --> 00:15:46,137 A long folding might take a millisecond, 386 00:15:46,137 --> 00:15:47,720 and for a very small protein like that 387 00:15:47,720 --> 00:15:49,544 it might be orders of magnitude faster. 388 00:15:49,544 --> 00:15:50,960 But to actually compute that could 389 00:15:50,960 --> 00:15:53,720 take many, many, many days. 390 00:15:53,720 --> 00:15:56,800 So a lot of computing resources going into this. 391 00:15:56,800 --> 00:15:58,700 Also, if we want to accurately represent 392 00:15:58,700 --> 00:16:01,580 solvation-- the interaction of the protein with water, which 393 00:16:01,580 --> 00:16:04,112 is what causes the hydrophobic collapse, as we saw-- then 394 00:16:04,112 --> 00:16:06,570 you actually would have to have water in those simulations. 395 00:16:06,570 --> 00:16:08,944 And each water molecule adds a lot of degrees of freedom, 396 00:16:08,944 --> 00:16:12,000 so that increases the computational cost, as well. 397 00:16:12,000 --> 00:16:15,040 So all of these things determine the radius of convergence. 398 00:16:15,040 --> 00:16:17,620 How far away can you be from the true structure 399 00:16:17,620 --> 00:16:19,214 and still get there? 400 00:16:19,214 --> 00:16:20,630 For very small proteins like this, 401 00:16:20,630 --> 00:16:22,213 with a lot of computational resources, 402 00:16:22,213 --> 00:16:26,330 you can get from an unfolded protein to the folded state. 403 00:16:26,330 --> 00:16:28,170 We'll see some important advances that 404 00:16:28,170 --> 00:16:30,840 allow us to get around this, but in most cases 405 00:16:30,840 --> 00:16:32,985 we only can do relatively local changes. 406 00:16:35,990 --> 00:16:40,450 So that brings us to our third approach for refining protein 407 00:16:40,450 --> 00:16:42,920 structures, which is called simulated annealing. 408 00:16:42,920 --> 00:16:44,900 And the inspiration for this name 409 00:16:44,900 --> 00:16:47,670 comes from metallurgy and how to get 410 00:16:47,670 --> 00:16:50,490 the best atomic structure in a metal. 411 00:16:50,490 --> 00:16:53,090 I don't know if any of you have ever done any metalworking. 412 00:16:53,090 --> 00:16:54,376 Anyone? 413 00:16:54,376 --> 00:16:56,410 Oh, OK, well one person. 414 00:16:56,410 --> 00:16:57,700 That's better than most years. 415 00:16:57,700 --> 00:17:01,920 I have not, but I understand that in metallurgy-- 416 00:17:01,920 --> 00:17:04,469 and you can correct me if I'm wrong-- that by repeatedly 417 00:17:04,469 --> 00:17:06,010 raising and lowering the temperature, 418 00:17:06,010 --> 00:17:08,119 you can get better metal structures. 419 00:17:08,119 --> 00:17:09,680 Is that reasonably accurate? 420 00:17:09,680 --> 00:17:10,274 OK. 421 00:17:10,274 --> 00:17:12,440 You can talk to one of your fellow students for more 422 00:17:12,440 --> 00:17:13,730 details if you're interested. 423 00:17:13,730 --> 00:17:15,450 So this similar idea is going to be 424 00:17:15,450 --> 00:17:18,490 used in this competition approach. 425 00:17:18,490 --> 00:17:21,589 We're going to try to find the most probable confirmation 426 00:17:21,589 --> 00:17:24,420 of atoms by trying to get out of some local minima 427 00:17:24,420 --> 00:17:27,069 by raising the energy of the system 428 00:17:27,069 --> 00:17:28,870 and then changing the temperatures, 429 00:17:28,870 --> 00:17:31,120 or raising and lowering it according to some heating 430 00:17:31,120 --> 00:17:33,484 and cooling schedule to get the atoms into their most 431 00:17:33,484 --> 00:17:35,650 probable confirmation, the most stable conformation. 432 00:17:38,605 --> 00:17:40,230 And this goes back to this idea that we 433 00:17:40,230 --> 00:17:41,710 started with the local minima. 434 00:17:41,710 --> 00:17:43,730 If we're just doing energy minimization, 435 00:17:43,730 --> 00:17:46,520 we're not going to be able to get from this minimum 436 00:17:46,520 --> 00:17:48,830 to this minimum, because these energetic barriers are 437 00:17:48,830 --> 00:17:49,367 in the way. 438 00:17:49,367 --> 00:17:51,200 So we need to raise the energy of the system 439 00:17:51,200 --> 00:17:53,600 to jump over these energetic barriers 440 00:17:53,600 --> 00:17:57,410 before we can get to the global free energy minimum. 441 00:17:57,410 --> 00:18:00,120 But if we just move at very high temperature all the time, 442 00:18:00,120 --> 00:18:02,582 we will sample the entire energetic space 443 00:18:02,582 --> 00:18:04,040 but it's going to take a long time. 444 00:18:04,040 --> 00:18:05,285 We're going to be sampling a lot of confirmations 445 00:18:05,285 --> 00:18:07,270 that are low probability, as well. 446 00:18:07,270 --> 00:18:08,910 So this approach allows us to balance 447 00:18:08,910 --> 00:18:11,535 the need for speed and the need to be at high temperature 448 00:18:11,535 --> 00:18:13,410 where we can overcome some of these barriers. 449 00:18:22,870 --> 00:18:25,120 So one thing that I want to stress here 450 00:18:25,120 --> 00:18:27,640 is that we've made a physical analogy to this metallurgy 451 00:18:27,640 --> 00:18:28,140 process. 452 00:18:28,140 --> 00:18:30,514 We're talking about raising the temperature of the system 453 00:18:30,514 --> 00:18:32,410 and let the atoms evolve under forces, 454 00:18:32,410 --> 00:18:34,200 but it's in no way meant to simulate 455 00:18:34,200 --> 00:18:36,040 what's going on in protein folding. 456 00:18:36,040 --> 00:18:37,669 So molecular dynamics would try to say, 457 00:18:37,669 --> 00:18:39,710 this is what's actually happening to this protein 458 00:18:39,710 --> 00:18:41,690 as it folds in water. 459 00:18:41,690 --> 00:18:44,250 Simulated annealing is using high temperature 460 00:18:44,250 --> 00:18:46,635 to search over spaces and then low temperature. 461 00:18:46,635 --> 00:18:49,010 But these temperatures much, much higher than the protein 462 00:18:49,010 --> 00:18:51,620 would ever encounter, so it's not a simulation. 463 00:18:51,620 --> 00:18:54,678 It's a search strategy. 464 00:18:54,678 --> 00:18:58,992 OK, so the key to this-- and I'll 465 00:18:58,992 --> 00:19:00,700 tell you the full algorithm in a second-- 466 00:19:00,700 --> 00:19:02,040 but at various steps in the algorithm 467 00:19:02,040 --> 00:19:03,706 we're trying to make decisions about how 468 00:19:03,706 --> 00:19:05,650 to move from our current set of coordinates 469 00:19:05,650 --> 00:19:07,630 to some alternative set of coordinates. 470 00:19:07,630 --> 00:19:11,030 Now, that new set of coordinates we're going to call test state. 471 00:19:11,030 --> 00:19:13,569 And we're going to decide whether the new state is 472 00:19:13,569 --> 00:19:15,360 more or less probable than the current one. 473 00:19:15,360 --> 00:19:15,540 Right? 474 00:19:15,540 --> 00:19:17,706 If it's lower in energy, then what's it going to be? 475 00:19:17,706 --> 00:19:19,700 It's going to be more probable, right? 476 00:19:19,700 --> 00:19:21,710 And so in this algorithm, we're always 477 00:19:21,710 --> 00:19:24,160 going to accept those states that are lower in free energy 478 00:19:24,160 --> 00:19:25,970 than our current state. 479 00:19:25,970 --> 00:19:28,097 What happens when the state is higher 480 00:19:28,097 --> 00:19:29,680 in free energy than our current state? 481 00:19:29,680 --> 00:19:32,100 So it turns out we are going to accept it probabilistically. 482 00:19:32,100 --> 00:19:34,599 Sometimes it's going to move up in energy and sometimes not, 483 00:19:34,599 --> 00:19:36,630 and that is going to allow us to go 484 00:19:36,630 --> 00:19:38,730 over some those energetic barriers 485 00:19:38,730 --> 00:19:42,100 and try to get to new energetic states that would not 486 00:19:42,100 --> 00:19:44,470 be accessible to purely minimization. 487 00:19:44,470 --> 00:19:47,440 So the form of this is the Boltzmann equation, right? 488 00:19:47,440 --> 00:19:49,625 The probability of some test state compared 489 00:19:49,625 --> 00:19:51,250 to the probability of a reference state 490 00:19:51,250 --> 00:19:55,300 is going to be the ratio of these two Boltzmann equations-- 491 00:19:55,300 --> 00:19:57,650 the energy of the test state over the energy 492 00:19:57,650 --> 00:19:58,670 of the current state. 493 00:19:58,670 --> 00:20:01,915 So it's the e to the minus difference in energy over KT. 494 00:20:01,915 --> 00:20:03,790 And we'll come back to where this temperature 495 00:20:03,790 --> 00:20:05,150 term comes from in a second. 496 00:20:07,770 --> 00:20:10,190 OK, so here's the full algorithm. 497 00:20:10,190 --> 00:20:12,740 We will either iterate for a fixed number of steps 498 00:20:12,740 --> 00:20:14,060 or until convergence. 499 00:20:14,060 --> 00:20:16,280 We'll see we don't always converge. 500 00:20:16,280 --> 00:20:18,910 We have some initial confirmation. 501 00:20:18,910 --> 00:20:20,630 Our current confirmation will be state n, 502 00:20:20,630 --> 00:20:22,200 and that we can compute as energy 503 00:20:22,200 --> 00:20:23,770 from those potential energy functions 504 00:20:23,770 --> 00:20:26,124 that we discussed in the last meeting. 505 00:20:26,124 --> 00:20:28,290 We're going to choose a neighboring state at random. 506 00:20:28,290 --> 00:20:30,420 So what does neighboring mean? 507 00:20:30,420 --> 00:20:32,750 So if I'm defining this in terms of XYZ coordinates, 508 00:20:32,750 --> 00:20:34,410 for every atom I've got a set of XYZ 509 00:20:34,410 --> 00:20:37,090 coordinates I'm going to change them a few of them 510 00:20:37,090 --> 00:20:38,090 by small amount. 511 00:20:38,090 --> 00:20:38,240 Right? 512 00:20:38,240 --> 00:20:39,823 If I change them all by large amounts, 513 00:20:39,823 --> 00:20:41,520 I have a completely different structure. 514 00:20:41,520 --> 00:20:43,228 So I'm going to make small perturbations. 515 00:20:43,228 --> 00:20:47,780 And if I'm doing this with fixed backbone angles 516 00:20:47,780 --> 00:20:49,620 and just rotating the side chains, then what 517 00:20:49,620 --> 00:20:52,580 would a neighboring state be? 518 00:20:52,580 --> 00:20:53,140 Any thoughts? 519 00:20:59,530 --> 00:21:01,490 What would a neighboring state be? 520 00:21:01,490 --> 00:21:03,740 Anyone? 521 00:21:03,740 --> 00:21:05,615 Change a few of the side chain angles, right? 522 00:21:05,615 --> 00:21:07,698 So we don't want to globally change the structure. 523 00:21:07,698 --> 00:21:09,770 We want some continuity between the current state 524 00:21:09,770 --> 00:21:11,230 and the next state. 525 00:21:11,230 --> 00:21:13,350 So we're going to chose an adjacent state 526 00:21:13,350 --> 00:21:15,550 in that sense, so the state space. 527 00:21:15,550 --> 00:21:17,180 And then here are the rules. 528 00:21:17,180 --> 00:21:19,340 If the new state has an energy that's 529 00:21:19,340 --> 00:21:23,100 lower than the current state, we simply accept the new state. 530 00:21:23,100 --> 00:21:24,850 If not, this is where it gets interesting. 531 00:21:24,850 --> 00:21:26,430 Then, we accept that higher energy 532 00:21:26,430 --> 00:21:28,151 with a probability that's associated 533 00:21:28,151 --> 00:21:29,650 with the difference in the energies. 534 00:21:29,650 --> 00:21:31,380 So if the difference is very, very large, 535 00:21:31,380 --> 00:21:32,900 there's a low probability it'll accept. 536 00:21:32,900 --> 00:21:34,733 If the differences are slightly higher, than 537 00:21:34,733 --> 00:21:36,979 there's a higher probability that we accept. 538 00:21:36,979 --> 00:21:39,270 If we reject it, we just drop back to our current state 539 00:21:39,270 --> 00:21:41,140 and we look for a new test state. 540 00:21:41,140 --> 00:21:41,780 OK? 541 00:21:41,780 --> 00:21:43,250 Any questions on how we do this? 542 00:21:47,360 --> 00:21:48,410 Question, yes. 543 00:21:48,410 --> 00:21:51,690 AUDIENCE: How far away do we search for neighbors? 544 00:21:51,690 --> 00:21:53,720 PROFESSOR: That's the art of this process, 545 00:21:53,720 --> 00:21:55,330 so I gave you a straight answer. 546 00:21:55,330 --> 00:21:58,957 Different approaches will use different thresholds. 547 00:21:58,957 --> 00:21:59,790 Any other questions? 548 00:22:04,300 --> 00:22:06,746 OK, so the key thing I want you realize, 549 00:22:06,746 --> 00:22:08,120 then, is there's this distinction 550 00:22:08,120 --> 00:22:09,500 between the minimization approach 551 00:22:09,500 --> 00:22:10,940 and simulated annealing approach. 552 00:22:10,940 --> 00:22:13,080 Minimization can only go from state one 553 00:22:13,080 --> 00:22:15,639 to the local free energy minimum, 554 00:22:15,639 --> 00:22:17,680 whereas the simulated annealing has the potential 555 00:22:17,680 --> 00:22:19,350 to go much further afield, and potentially 556 00:22:19,350 --> 00:22:21,058 to get to the global free energy minimum. 557 00:22:21,058 --> 00:22:22,990 But it's not guaranteed to find it. 558 00:22:22,990 --> 00:22:26,120 OK, so let's say we start in state one 559 00:22:26,120 --> 00:22:28,039 and our neighbor state was state two. 560 00:22:28,039 --> 00:22:30,080 So we'd accept that with 100% probability, right? 561 00:22:30,080 --> 00:22:31,464 Because it's lower in energy. 562 00:22:31,464 --> 00:22:33,380 Then let's say the neighboring state turns out 563 00:22:33,380 --> 00:22:35,694 to be state three. that's higher in energy, 564 00:22:35,694 --> 00:22:37,610 so there's a probability that we'll accept it, 565 00:22:37,610 --> 00:22:39,359 based on the difference between the energy 566 00:22:39,359 --> 00:22:40,612 of state two and state three. 567 00:22:40,612 --> 00:22:42,320 Similarly from state three to state four, 568 00:22:42,320 --> 00:22:44,570 so we might drop back to state two. 569 00:22:44,570 --> 00:22:45,780 We might go up. 570 00:22:45,780 --> 00:22:48,110 And then we can eventually get over the hump this way 571 00:22:48,110 --> 00:22:49,330 with sum probability. 572 00:22:49,330 --> 00:22:51,610 It's a sum of each of those steps. 573 00:22:51,610 --> 00:22:52,110 OK? 574 00:22:58,550 --> 00:23:01,744 OK, so if this is our function for deciding 575 00:23:01,744 --> 00:23:03,160 whether to accept a new state, how 576 00:23:03,160 --> 00:23:06,070 does temperature affect our decisions? 577 00:23:06,070 --> 00:23:10,962 What happens when the temperature is very, very high, 578 00:23:10,962 --> 00:23:12,420 if you look at that equation? 579 00:23:12,420 --> 00:23:14,630 So it's minus e to the delta. 580 00:23:14,630 --> 00:23:17,010 The difference in the energy over kT. 581 00:23:17,010 --> 00:23:19,200 So if t is very, very large, then 582 00:23:19,200 --> 00:23:22,159 what happens that exponent? 583 00:23:22,159 --> 00:23:22,950 It approaches zero. 584 00:23:22,950 --> 00:23:27,240 So e to the minus zero is going to be approximately 1, right? 585 00:23:27,240 --> 00:23:29,570 So at very high temperatures, we almost always 586 00:23:29,570 --> 00:23:31,180 take the high energy state. 587 00:23:31,180 --> 00:23:33,980 So that's what allows us to climb those energetic hills. 588 00:23:33,980 --> 00:23:35,380 If I have a very high temperature 589 00:23:35,380 --> 00:23:36,838 in my simulated annealing, then I'm 590 00:23:36,838 --> 00:23:39,094 always going over those barriers. 591 00:23:39,094 --> 00:23:40,510 So conversely, what happens, then, 592 00:23:40,510 --> 00:23:44,187 when I set the temperature very low? 593 00:23:44,187 --> 00:23:45,895 Then there's a very, very low probability 594 00:23:45,895 --> 00:23:48,640 of accepting those changes, right? 595 00:23:48,640 --> 00:23:51,350 So if I have a very low temperature-- temperature 596 00:23:51,350 --> 00:23:54,230 approximately zero-- then I'll never go uphill. 597 00:23:54,230 --> 00:23:56,440 Almost never go uphill. 598 00:23:56,440 --> 00:23:58,990 So we have a lot of control over how much of the space 599 00:23:58,990 --> 00:24:03,657 this algorithm explores by how we set the temperature. 600 00:24:03,657 --> 00:24:06,240 So this is again a little bit of the art simulated annealing-- 601 00:24:06,240 --> 00:24:08,510 decide exactly what annealing schedule to use, 602 00:24:08,510 --> 00:24:10,405 what temperature program you use. 603 00:24:10,405 --> 00:24:12,490 Do you start off high and go literally down? 604 00:24:12,490 --> 00:24:14,090 Do you use some other, more complicated function 605 00:24:14,090 --> 00:24:15,173 to decide the temperature? 606 00:24:15,173 --> 00:24:17,180 We won't go into exactly how to choose these. 607 00:24:17,180 --> 00:24:19,360 [INAUDIBLE] you could track some of these things 608 00:24:19,360 --> 00:24:22,510 down from the references that are in the notes. 609 00:24:22,510 --> 00:24:23,484 So we have this choice. 610 00:24:23,484 --> 00:24:24,900 But the basic idea is, we're going 611 00:24:24,900 --> 00:24:26,233 to start at higher temperatures. 612 00:24:26,233 --> 00:24:28,207 We're going to explore most of the space. 613 00:24:28,207 --> 00:24:29,790 And then, as we lower the temperature, 614 00:24:29,790 --> 00:24:32,466 we freeze ourselves into the most probable confirmations. 615 00:24:35,980 --> 00:24:38,740 Now, there's nothing that restricts simulated annealing 616 00:24:38,740 --> 00:24:40,420 to protein structure. 617 00:24:40,420 --> 00:24:42,120 This approach is actually quite general. 618 00:24:42,120 --> 00:24:44,490 It's called the Metropolis Hastings algorithm. 619 00:24:44,490 --> 00:24:47,375 It's often used in cases where there's no energy whatsoever 620 00:24:47,375 --> 00:24:50,350 and it's thought of purely in probabilistic terms. 621 00:24:50,350 --> 00:24:53,700 So if I have some probabilistic function-- some probability 622 00:24:53,700 --> 00:24:57,580 of being in some state S-- I can choose a neighboring state 623 00:24:57,580 --> 00:24:59,220 at random. 624 00:24:59,220 --> 00:25:01,060 Then I can compute an acceptance ratio, 625 00:25:01,060 --> 00:25:03,690 which is the probability of being a state S 626 00:25:03,690 --> 00:25:06,570 test over the probability of being in a current state. 627 00:25:06,570 --> 00:25:08,870 This is what we did in terms of the Boltzmann equation, 628 00:25:08,870 --> 00:25:11,078 but if I some other formulation for the probabilities 629 00:25:11,078 --> 00:25:12,560 I'll just use that. 630 00:25:12,560 --> 00:25:15,860 And then, just like in our protein folding example, 631 00:25:15,860 --> 00:25:18,770 if this acceptance ratio is greater than 1, 632 00:25:18,770 --> 00:25:20,050 we accept the new state. 633 00:25:20,050 --> 00:25:21,980 If it's less than 1, then we accept it 634 00:25:21,980 --> 00:25:24,740 with a probabilistic statement. 635 00:25:24,740 --> 00:25:26,924 And so this is a very general approach. 636 00:25:26,924 --> 00:25:28,840 I think you might see it in your problem sets. 637 00:25:28,840 --> 00:25:30,881 We certainly have done this on past exams-- asked 638 00:25:30,881 --> 00:25:34,250 you to apply this algorithm to other probabilistic settings. 639 00:25:34,250 --> 00:25:37,490 So it's a very, very general way to search the sample 640 00:25:37,490 --> 00:25:41,030 across a probabilistic landscape. 641 00:25:41,030 --> 00:25:44,020 OK, so we've seen these three separate approaches, 642 00:25:44,020 --> 00:25:46,330 starting with an approximate structure 643 00:25:46,330 --> 00:25:48,370 and trying to get to the correct structure. 644 00:25:48,370 --> 00:25:50,170 We have energy minimization, which 645 00:25:50,170 --> 00:25:53,130 will move towards the local confirmation. 646 00:25:53,130 --> 00:25:55,230 So it's very fast compared the other two, 647 00:25:55,230 --> 00:25:57,470 but it's restricted to local changes. 648 00:25:57,470 --> 00:25:59,220 We have molecular dynamics, which actually 649 00:25:59,220 --> 00:26:01,880 tries to simulate the biological process. 650 00:26:01,880 --> 00:26:03,590 Connotationally very intensive. 651 00:26:03,590 --> 00:26:05,131 And then we have simulated annealing, 652 00:26:05,131 --> 00:26:07,030 which tries to shortcut the root to some 653 00:26:07,030 --> 00:26:08,980 of these global free energy minima 654 00:26:08,980 --> 00:26:11,930 by raising the temperature, pretending at this very 655 00:26:11,930 --> 00:26:13,930 high temperature so we can sample all the space, 656 00:26:13,930 --> 00:26:17,170 and then cooling down so we trap a high probability 657 00:26:17,170 --> 00:26:18,680 confirmation. 658 00:26:18,680 --> 00:26:20,715 Any questions on any of these three approaches? 659 00:26:25,090 --> 00:26:27,260 OK. 660 00:26:27,260 --> 00:26:29,230 All right, so I'm going to go through now some 661 00:26:29,230 --> 00:26:32,150 of the approaches that have already 662 00:26:32,150 --> 00:26:35,029 been used to try to solve protein structures. 663 00:26:35,029 --> 00:26:36,320 We started off with a sequence. 664 00:26:36,320 --> 00:26:38,800 We'd like to figure out what the structure is. 665 00:26:38,800 --> 00:26:42,090 And this field has had a tremendous advance, 666 00:26:42,090 --> 00:26:44,990 because in 1995 a group got together and came up 667 00:26:44,990 --> 00:26:47,550 with an objective way of evaluating 668 00:26:47,550 --> 00:26:49,945 whether these methods were working. 669 00:26:49,945 --> 00:26:51,570 So lots of people have proposed methods 670 00:26:51,570 --> 00:26:53,390 for predicting protein structure, 671 00:26:53,390 --> 00:26:57,540 and what the CASP group did in '95 was they said, 672 00:26:57,540 --> 00:27:01,030 we will collect structures from crystallographers, 673 00:27:01,030 --> 00:27:04,430 NMR spectroscopists, that they have not yet 674 00:27:04,430 --> 00:27:06,280 published but they know they're likely to be 675 00:27:06,280 --> 00:27:11,020 able to get within the time scale of this project. 676 00:27:11,020 --> 00:27:13,440 We will send out those sequences to the modelers. 677 00:27:13,440 --> 00:27:15,750 The modelers will attempt to predict the structure, 678 00:27:15,750 --> 00:27:16,990 and then at the end of the competition 679 00:27:16,990 --> 00:27:18,065 we'll go back to the crystallographers 680 00:27:18,065 --> 00:27:20,489 and the spectroscopists and say, OK, give us a structure 681 00:27:20,489 --> 00:27:22,280 and now we'll compare the predicted answers 682 00:27:22,280 --> 00:27:22,870 the real ones. 683 00:27:22,870 --> 00:27:24,860 So no one knows are the answer is 684 00:27:24,860 --> 00:27:28,260 until all the submissions are there, 685 00:27:28,260 --> 00:27:30,630 and then you can see objectively which of the approaches 686 00:27:30,630 --> 00:27:32,435 did the best. 687 00:27:32,435 --> 00:27:34,310 And one of the approaches that's consistently 688 00:27:34,310 --> 00:27:36,601 has done very well, which we'll look at in some detail, 689 00:27:36,601 --> 00:27:38,510 is this approach called Rosetta. 690 00:27:38,510 --> 00:27:43,410 So you can look at the details online. 691 00:27:43,410 --> 00:27:46,740 They split this modeling problem into two types. 692 00:27:46,740 --> 00:27:48,450 There are ones for which you can come up 693 00:27:48,450 --> 00:27:50,135 with a reasonable homology model. 694 00:27:50,135 --> 00:27:52,010 This can be very, very low sequence homology, 695 00:27:52,010 --> 00:27:54,343 but there's something in the database of known structure 696 00:27:54,343 --> 00:27:57,400 that it's sequenced similarly to the query. 697 00:27:57,400 --> 00:28:00,850 And then ones where it's completely de novo. 698 00:28:00,850 --> 00:28:03,769 So how do they go about predicting these structures? 699 00:28:03,769 --> 00:28:06,060 So if there's homology, you can imagine the first thing 700 00:28:06,060 --> 00:28:08,860 you want to do is align your sequence to the sequence 701 00:28:08,860 --> 00:28:11,000 of the protein that has a known structure. 702 00:28:11,000 --> 00:28:14,930 Now, if it's high homology this is not a hard problem, right? 703 00:28:14,930 --> 00:28:16,410 We just need to do a few tweaks. 704 00:28:16,410 --> 00:28:19,170 But we get to places-- what's called the Twilight 705 00:28:19,170 --> 00:28:22,490 Zone, in fact-- where there's a high probability that you're 706 00:28:22,490 --> 00:28:25,410 wrong, that your sequence alignments could be to entirely 707 00:28:25,410 --> 00:28:26,420 the wrong structure. 708 00:28:26,420 --> 00:28:28,602 And that's where things get interesting. 709 00:28:28,602 --> 00:28:30,310 So they've got high sequence similarity-- 710 00:28:30,310 --> 00:28:32,450 greater than 50% sequence similarity that 711 00:28:32,450 --> 00:28:34,660 are considered relatively easy problems. 712 00:28:34,660 --> 00:28:38,120 These medium problems that are 20% to 50% sequence similarity. 713 00:28:38,120 --> 00:28:40,770 And then very low sequence similar problems-- less 714 00:28:40,770 --> 00:28:42,680 than 20% sequence similarity. 715 00:28:46,560 --> 00:28:49,452 OK, so you've already seen this course methods 716 00:28:49,452 --> 00:28:50,910 for doing sequence alignment, so we 717 00:28:50,910 --> 00:28:53,790 don't have to go into that in any detail. 718 00:28:53,790 --> 00:28:56,139 But there are a lot of different specific approaches 719 00:28:56,139 --> 00:28:57,430 for how to do those alignments. 720 00:28:57,430 --> 00:29:00,830 You could do anything from blast to highly sophisticated Markov 721 00:29:00,830 --> 00:29:03,770 models to try to decide what's most similar to your protein 722 00:29:03,770 --> 00:29:04,270 structure. 723 00:29:04,270 --> 00:29:06,353 And one of the important things that Rosetta found 724 00:29:06,353 --> 00:29:08,090 was not to align on any single method 725 00:29:08,090 --> 00:29:10,730 but to try a bunch of different alignment approaches 726 00:29:10,730 --> 00:29:12,160 and then follow through with many 727 00:29:12,160 --> 00:29:14,030 of the different alignments. 728 00:29:14,030 --> 00:29:15,570 And then we get this problem of how 729 00:29:15,570 --> 00:29:17,840 do you refine the models, which is what we've already 730 00:29:17,840 --> 00:29:21,090 started to talk about. 731 00:29:21,090 --> 00:29:22,820 So in the general refinement procedure, 732 00:29:22,820 --> 00:29:25,170 when you have a protein that's relatively in good shape 733 00:29:25,170 --> 00:29:28,140 they apply random perturbations to the backbone torsion angle. 734 00:29:28,140 --> 00:29:29,890 So this is again the statistical approach, 735 00:29:29,890 --> 00:29:31,389 the not allowing every atom to move. 736 00:29:31,389 --> 00:29:35,411 They're just rotating a certain number of the rotatable side 737 00:29:35,411 --> 00:29:35,910 chains. 738 00:29:35,910 --> 00:29:38,370 So we've got the fine psi angles in the backbone, 739 00:29:38,370 --> 00:29:41,270 and some of the side channels. 740 00:29:41,270 --> 00:29:43,882 They do what's called rotamer optimization of the side chain. 741 00:29:43,882 --> 00:29:44,840 So what does that mean? 742 00:29:44,840 --> 00:29:47,180 Remember that we could allow the side 743 00:29:47,180 --> 00:29:48,980 chains to rotate freely, but very, very 744 00:29:48,980 --> 00:29:51,170 few of those rotations are frequently observed. 745 00:29:51,170 --> 00:29:53,400 So we're going to choose, as these three choices, 746 00:29:53,400 --> 00:29:56,025 among the best possible rotamers, rotational isomers. 747 00:29:58,940 --> 00:30:02,240 And then once we've found a nearly optimal side chain 748 00:30:02,240 --> 00:30:05,130 confirmation from those highly probable ones, 749 00:30:05,130 --> 00:30:07,814 then we allow more continuous optimization 750 00:30:07,814 --> 00:30:08,605 of the side chains. 751 00:30:14,080 --> 00:30:16,357 So when you have a very, very high sequence homology 752 00:30:16,357 --> 00:30:18,190 template, you don't need to do a lot of work 753 00:30:18,190 --> 00:30:19,260 on most of the structure. 754 00:30:19,260 --> 00:30:19,960 Right? 755 00:30:19,960 --> 00:30:21,010 Most of it's going to be correct. 756 00:30:21,010 --> 00:30:22,740 So we're going to focus on those places 757 00:30:22,740 --> 00:30:24,370 where the alignment is poor. 758 00:30:24,370 --> 00:30:26,884 That seems pretty intuitive. 759 00:30:26,884 --> 00:30:28,550 Things get a little bit more interesting 760 00:30:28,550 --> 00:30:32,040 when you've got these medium sequence similarity templates. 761 00:30:32,040 --> 00:30:34,429 So here, even your basic alignment might not be right. 762 00:30:34,429 --> 00:30:36,470 So they actually proceed with multiple alignments 763 00:30:36,470 --> 00:30:40,330 and carry them through the refinement process. 764 00:30:40,330 --> 00:30:42,925 And then, how do you decide which one's the best? 765 00:30:42,925 --> 00:30:44,750 You use the potential energy function. 766 00:30:44,750 --> 00:30:44,950 Right? 767 00:30:44,950 --> 00:30:46,491 So you've already taken a whole bunch 768 00:30:46,491 --> 00:30:48,450 of starting confirmations. 769 00:30:48,450 --> 00:30:50,620 We've taken them through this refinery procedure. 770 00:30:50,620 --> 00:30:52,510 You now believe that those energies represent 771 00:30:52,510 --> 00:30:54,770 the probability that the structure is correct, 772 00:30:54,770 --> 00:30:57,020 so you're going to choose which of those confirmations 773 00:30:57,020 --> 00:30:58,750 to use based on the energy. 774 00:31:02,050 --> 00:31:06,350 OK, in these medium sequence similarity templates, 775 00:31:06,350 --> 00:31:09,120 the refinement doesn't do the entire protein structure, 776 00:31:09,120 --> 00:31:10,750 but it focuses on particular region. 777 00:31:10,750 --> 00:31:12,920 So places where there are gaps, insertions, 778 00:31:12,920 --> 00:31:14,300 and deletions in the alignment. 779 00:31:14,300 --> 00:31:14,800 Right? 780 00:31:14,800 --> 00:31:16,508 So your alignment is uncertain, so that's 781 00:31:16,508 --> 00:31:18,259 where you need to refine the structure. 782 00:31:18,259 --> 00:31:20,175 Places that were loops in the starting models, 783 00:31:20,175 --> 00:31:22,040 so they weren't highly constrained. 784 00:31:22,040 --> 00:31:23,540 So it's plausible that they're going 785 00:31:23,540 --> 00:31:25,780 to be different in the starting structure 786 00:31:25,780 --> 00:31:29,945 from some homologous protein and in the final structure. 787 00:31:29,945 --> 00:31:32,320 And then, regions where the sequence conservation is low. 788 00:31:32,320 --> 00:31:35,440 So even if there is a reasonably good alignment, 789 00:31:35,440 --> 00:31:36,940 there's some probability that things 790 00:31:36,940 --> 00:31:40,619 have changed during evolution. 791 00:31:40,619 --> 00:31:42,660 Now, when they do a refinement, how they do that? 792 00:31:42,660 --> 00:31:45,170 In these places that we've just outlined, 793 00:31:45,170 --> 00:31:48,300 they don't simply randomly perturb all of the angles. 794 00:31:48,300 --> 00:31:51,240 But actually, they take a segment of the protein, 795 00:31:51,240 --> 00:31:53,380 and exactly how long those segments are 796 00:31:53,380 --> 00:31:56,610 has changed over the course of the Rosetta algorithm's 797 00:31:56,610 --> 00:31:57,560 refinement. 798 00:31:57,560 --> 00:32:01,130 But say something on the order of three to six amino acids. 799 00:32:01,130 --> 00:32:03,525 And you look in the database for proteins 800 00:32:03,525 --> 00:32:06,250 that have known structure that contain the same amino acid 801 00:32:06,250 --> 00:32:06,750 sequence. 802 00:32:06,750 --> 00:32:09,220 So it could be completely unrelated protein structure, 803 00:32:09,220 --> 00:32:11,960 but you develop a peptide library 804 00:32:11,960 --> 00:32:14,036 for all of those short sequences for all 805 00:32:14,036 --> 00:32:15,410 the different possible structures 806 00:32:15,410 --> 00:32:16,150 that they've adopted. 807 00:32:16,150 --> 00:32:18,066 So you know that those are at least structures 808 00:32:18,066 --> 00:32:20,626 that are consistent with that local sequence, 809 00:32:20,626 --> 00:32:22,000 although they might be completely 810 00:32:22,000 --> 00:32:23,630 wrong for this individual protein. 811 00:32:23,630 --> 00:32:26,810 So you pop in all of those alternative possible 812 00:32:26,810 --> 00:32:28,841 structures. 813 00:32:28,841 --> 00:32:30,645 So OK, we replace the torsion angles 814 00:32:30,645 --> 00:32:32,410 with those of peptides of known structure, 815 00:32:32,410 --> 00:32:35,177 and then we do a local optimization using 816 00:32:35,177 --> 00:32:37,010 the kinds of minimization algorithms we just 817 00:32:37,010 --> 00:32:39,370 talked about to see whether there is a structure that's 818 00:32:39,370 --> 00:32:41,659 roughly compatible with that little peptide 819 00:32:41,659 --> 00:32:43,450 that you took from the database that's also 820 00:32:43,450 --> 00:32:45,600 consistent with the rest the structure. 821 00:32:45,600 --> 00:32:49,050 And after you've done that, then you do a global refinement. 822 00:32:49,050 --> 00:32:50,175 Questions on that approach? 823 00:32:55,710 --> 00:32:57,750 OK, so does this work? 824 00:32:57,750 --> 00:33:00,770 One of the best competitors in this CASP competition. 825 00:33:00,770 --> 00:33:04,230 So here are examples where the native structure's in blue. 826 00:33:04,230 --> 00:33:06,910 The best model they produced was in red, 827 00:33:06,910 --> 00:33:09,880 and the best template-- that's the homologous protein-- 828 00:33:09,880 --> 00:33:11,320 is in green. 829 00:33:11,320 --> 00:33:13,960 And you can see that they agree remarkably well. 830 00:33:13,960 --> 00:33:15,580 OK? 831 00:33:15,580 --> 00:33:18,310 So this is very impressive, especially 832 00:33:18,310 --> 00:33:20,240 compared to some of the other algorithms. 833 00:33:20,240 --> 00:33:21,740 But again, it's focusing on proteins 834 00:33:21,740 --> 00:33:24,380 where there's at least some decent homology to start with. 835 00:33:27,660 --> 00:33:30,270 If you look here at the center of these proteins, 836 00:33:30,270 --> 00:33:32,850 you can see the original structure, I believe, is blue, 837 00:33:32,850 --> 00:33:34,090 and their model's in red. 838 00:33:34,090 --> 00:33:36,800 You can see they also get the side chain confirmations more 839 00:33:36,800 --> 00:33:38,660 or less correct, which is quite remarkable. 840 00:33:43,135 --> 00:33:44,510 Now, what gets really interesting 841 00:33:44,510 --> 00:33:45,710 is when they work on these proteins that 842 00:33:45,710 --> 00:33:47,140 have very low sequence homologies. 843 00:33:47,140 --> 00:33:50,120 So we're talking about 20% sequence similarity or less. 844 00:33:50,120 --> 00:33:53,035 So quite often, you'll actually have globally the wrong 845 00:33:53,035 --> 00:33:55,830 fold-- a 20% sequence similarity. 846 00:33:55,830 --> 00:33:56,830 So what do they do here? 847 00:33:56,830 --> 00:33:59,036 They start by saying, OK, we have no guarantee 848 00:33:59,036 --> 00:34:00,910 that our templates are even remotely correct. 849 00:34:00,910 --> 00:34:02,370 So they're going to start with a lot of templates 850 00:34:02,370 --> 00:34:04,820 and they're going to refine all of these in parallel 851 00:34:04,820 --> 00:34:08,198 in hopes that some of them come out right at the other end. 852 00:34:08,198 --> 00:34:10,489 And these are what they call more aggressive refinement 853 00:34:10,489 --> 00:34:11,010 strategies. 854 00:34:11,010 --> 00:34:14,736 So before, where did we focus our refinement energies? 855 00:34:14,736 --> 00:34:17,150 We focused on places that were poorly constrained, 856 00:34:17,150 --> 00:34:20,761 either by evolution or regions of the structure that 857 00:34:20,761 --> 00:34:22,219 weren't well-constrained, or places 858 00:34:22,219 --> 00:34:23,552 where the alignment wasn't good. 859 00:34:23,552 --> 00:34:26,480 Here, they actually go after the relatively well-defined 860 00:34:26,480 --> 00:34:28,279 secondary structure elements, as well. 861 00:34:28,279 --> 00:34:29,820 And so they will allow something that 862 00:34:29,820 --> 00:34:33,480 was a clear alpha helix in all of the templates 863 00:34:33,480 --> 00:34:35,879 to change some of the structure by taking peptides out 864 00:34:35,879 --> 00:34:37,670 of the database that have other structures. 865 00:34:37,670 --> 00:34:38,170 OK? 866 00:34:38,170 --> 00:34:41,380 So you take a very, very aggressive approach 867 00:34:41,380 --> 00:34:42,567 to the refinement. 868 00:34:42,567 --> 00:34:44,900 You rebuild the secondary structure elements, as well as 869 00:34:44,900 --> 00:34:47,389 these gaps, insertions, loops, and regions 870 00:34:47,389 --> 00:34:48,764 with low sequence conservation. 871 00:34:48,764 --> 00:34:50,389 And I think the really remarkable thing 872 00:34:50,389 --> 00:34:51,763 is that this approach also works. 873 00:34:51,763 --> 00:34:55,239 It doesn't work quite as well, but here's a side 874 00:34:55,239 --> 00:34:58,570 by side comparison of a native structure and the best model. 875 00:34:58,570 --> 00:35:01,010 So this is the hidden structure that was only 876 00:35:01,010 --> 00:35:03,740 known to the crystallographer, or the spectroscopist, 877 00:35:03,740 --> 00:35:06,700 who agreed to participate in this CASP competition. 878 00:35:06,700 --> 00:35:08,244 And here is the model they submitted 879 00:35:08,244 --> 00:35:09,660 blind without knowing what it was. 880 00:35:09,660 --> 00:35:11,493 And you can see again and again that there's 881 00:35:11,493 --> 00:35:14,350 a pretty good global similarity between the structures 882 00:35:14,350 --> 00:35:17,320 that they propose and the actual ones. 883 00:35:17,320 --> 00:35:17,900 Not always. 884 00:35:17,900 --> 00:35:20,520 I mean, here's an example where the good parts are highlighted 885 00:35:20,520 --> 00:35:22,470 and the not-so-good parts are shown in white 886 00:35:22,470 --> 00:35:24,030 so you can barely see them. 887 00:35:24,030 --> 00:35:25,830 [LAUGHTER] 888 00:35:25,830 --> 00:35:27,672 PROFESSOR: But even so, give them that. 889 00:35:27,672 --> 00:35:28,630 Give them their credit. 890 00:35:28,630 --> 00:35:32,820 It's a remarkably good agreement. 891 00:35:32,820 --> 00:35:36,542 Now, we've looked at cases where there's very high sequence 892 00:35:36,542 --> 00:35:39,000 similarity, where there's medium sequence similarity, where 893 00:35:39,000 --> 00:35:40,250 there's low sequence similarity. 894 00:35:40,250 --> 00:35:42,583 But the hardest category are ones where there's actually 895 00:35:42,583 --> 00:35:45,949 nothing in the structural database that's a detectable 896 00:35:45,949 --> 00:35:47,490 homologue to the protein of interest. 897 00:35:47,490 --> 00:35:48,906 So how do you go about doing that? 898 00:35:48,906 --> 00:35:50,310 That's the de novo case. 899 00:35:50,310 --> 00:35:52,930 So in that case, they take the following strategy. 900 00:35:52,930 --> 00:35:56,900 They do a Monte Carlo search for backbone angles. 901 00:35:56,900 --> 00:35:59,360 So specifically, they take short regions-- 902 00:35:59,360 --> 00:36:01,102 and again, this is the exact length. 903 00:36:01,102 --> 00:36:03,060 Changes in different versions of the algorithm, 904 00:36:03,060 --> 00:36:06,670 but it's either three to nine amino acids in the backbone. 905 00:36:06,670 --> 00:36:10,079 They find similar peptides in the database 906 00:36:10,079 --> 00:36:10,870 of known structure. 907 00:36:10,870 --> 00:36:13,220 They take the backbone confirmations 908 00:36:13,220 --> 00:36:14,490 from the database. 909 00:36:14,490 --> 00:36:17,020 They set the angles to match those. 910 00:36:17,020 --> 00:36:18,930 And then, they use those Metropolis criteria 911 00:36:18,930 --> 00:36:20,310 that we looked at in simulated annealing. 912 00:36:20,310 --> 00:36:20,520 Right? 913 00:36:20,520 --> 00:36:22,200 The relative probability of the states, 914 00:36:22,200 --> 00:36:23,658 determined by the Boltzmann energy, 915 00:36:23,658 --> 00:36:25,930 to decide whether to accept or not. 916 00:36:25,930 --> 00:36:27,906 If it's lower energy, what happens? 917 00:36:27,906 --> 00:36:29,160 Do you accept? 918 00:36:29,160 --> 00:36:30,650 Do you not accept? 919 00:36:30,650 --> 00:36:31,420 AUDIENCE: Accept. 920 00:36:31,420 --> 00:36:32,336 PROFESSOR: You accept. 921 00:36:32,336 --> 00:36:34,260 And if it's high energy, how do you decide? 922 00:36:34,260 --> 00:36:35,136 AUDIENCE: [INAUDIBLE] 923 00:36:35,136 --> 00:36:36,635 PROFESSOR: [INAUDIBLE], probability. 924 00:36:36,635 --> 00:36:37,170 Very good. 925 00:36:37,170 --> 00:36:41,680 OK, so they do a fixed number of Monte Carlo steps-- 36,000. 926 00:36:41,680 --> 00:36:43,800 And then they repeat this entire process 927 00:36:43,800 --> 00:36:46,260 to get 2,000 final structures. 928 00:36:46,260 --> 00:36:46,900 OK? 929 00:36:46,900 --> 00:36:48,983 Because they really have very, very low confidence 930 00:36:48,983 --> 00:36:51,740 in any individual one of these structures. 931 00:36:51,740 --> 00:36:53,240 OK, now you've got 2,000 structures, 932 00:36:53,240 --> 00:36:54,614 but you're allowed to submit one. 933 00:36:54,614 --> 00:36:55,900 So what do you do? 934 00:36:55,900 --> 00:36:57,674 So they cluster them to try to see 935 00:36:57,674 --> 00:36:59,590 whether there are common patterns that emerge, 936 00:36:59,590 --> 00:37:00,964 and then they refine the clusters 937 00:37:00,964 --> 00:37:03,910 and they submit each cluster as a potential solution 938 00:37:03,910 --> 00:37:06,930 to this problem. 939 00:37:06,930 --> 00:37:09,460 OK, questions on the Rosetta approach? 940 00:37:09,460 --> 00:37:11,137 Yes. 941 00:37:11,137 --> 00:37:13,678 AUDIENCE: Can you mention again why the short region of three 942 00:37:13,678 --> 00:37:16,300 to nine amino acids, and whether [INAUDIBLE]. 943 00:37:19,919 --> 00:37:21,460 PROFESSOR: So the question is, what's 944 00:37:21,460 --> 00:37:24,890 the motivation for taking these short regions 945 00:37:24,890 --> 00:37:27,710 from the structural database? 946 00:37:27,710 --> 00:37:29,255 Ultimately, this is a modeling choice 947 00:37:29,255 --> 00:37:30,880 that they made that seems to work well. 948 00:37:30,880 --> 00:37:32,150 So it's an empirical choice. 949 00:37:32,150 --> 00:37:34,680 But what possibly motivated them, you might ask, right? 950 00:37:34,680 --> 00:37:37,080 So, the thought has been in this field for a long time, 951 00:37:37,080 --> 00:37:39,120 and it's still, I think, unproven, 952 00:37:39,120 --> 00:37:42,040 that certain sequences will have a certain propensity 953 00:37:42,040 --> 00:37:43,050 to certain structures. 954 00:37:43,050 --> 00:37:44,990 We saw this in the secondary structure prediction 955 00:37:44,990 --> 00:37:47,156 algorithms, that there were certain amino acids that 956 00:37:47,156 --> 00:37:49,450 occurred much more frequently in alpha helixes. 957 00:37:49,450 --> 00:37:53,740 So it could be that there are certain structures that 958 00:37:53,740 --> 00:37:56,480 are very likely to occur for short peptides, 959 00:37:56,480 --> 00:37:58,280 and other ones that almost never occur. 960 00:37:58,280 --> 00:38:01,410 And so if you had a large enough database of protein structures, 961 00:38:01,410 --> 00:38:03,580 then that would be a sensible sampling approach. 962 00:38:03,580 --> 00:38:06,012 Now, in practice, could you have gotten some good answer 963 00:38:06,012 --> 00:38:06,970 in some other approach? 964 00:38:06,970 --> 00:38:07,553 We don't know. 965 00:38:07,553 --> 00:38:09,480 This is what actually worked well. 966 00:38:09,480 --> 00:38:12,380 So there's no real theoretical justification for it 967 00:38:12,380 --> 00:38:14,090 other than that crude observation 968 00:38:14,090 --> 00:38:17,030 that there is some information content that's local, 969 00:38:17,030 --> 00:38:20,110 and then a lot of information content that's global. 970 00:38:20,110 --> 00:38:20,957 Yes? 971 00:38:20,957 --> 00:38:23,040 AUDIENCE: So when you're doing a de novo approach, 972 00:38:23,040 --> 00:38:25,510 is it general that you come up with a bunch 973 00:38:25,510 --> 00:38:27,980 of different clusters as your answer, 974 00:38:27,980 --> 00:38:29,956 whereas with the homology approach, 975 00:38:29,956 --> 00:38:32,255 you are more confident of structure answer? 976 00:38:32,255 --> 00:38:34,630 PROFESSOR: So the question was, if you're doing a de novo 977 00:38:34,630 --> 00:38:36,080 approach, is it generally the case 978 00:38:36,080 --> 00:38:38,080 that you have lots of individual, 979 00:38:38,080 --> 00:38:40,320 or clusters of structures, whereas in homology you 980 00:38:40,320 --> 00:38:40,820 tend not to. 981 00:38:40,820 --> 00:38:41,670 And yes, that's correct. 982 00:38:41,670 --> 00:38:43,294 So in the de novo, there are frequently 983 00:38:43,294 --> 00:38:45,610 going to be multiple solutions that 984 00:38:45,610 --> 00:38:48,080 look equally plausible to you, whereas the homology tends 985 00:38:48,080 --> 00:38:51,210 to drive you to certain classes. 986 00:38:51,210 --> 00:38:51,840 Good questions. 987 00:38:51,840 --> 00:38:52,673 Any other questions? 988 00:39:01,290 --> 00:39:03,340 All, right so that was CASP. 989 00:39:03,340 --> 00:39:08,050 One was in 1995, which seems like an eon ago. 990 00:39:08,050 --> 00:39:10,100 So how have things improved over the course 991 00:39:10,100 --> 00:39:12,067 of the last decade or two? 992 00:39:12,067 --> 00:39:14,400 So there was an interesting paper that came out recently 993 00:39:14,400 --> 00:39:17,240 that just looked at the differences between CASP 10, 994 00:39:17,240 --> 00:39:19,230 one of are the most recent ones, and CASP 5. 995 00:39:19,230 --> 00:39:21,280 They're every two years, so that's a decade. 996 00:39:21,280 --> 00:39:23,200 So how have things improved or not 997 00:39:23,200 --> 00:39:25,820 over the last decade in this challenge? 998 00:39:25,820 --> 00:39:30,420 So in this chart, the y-axis is the percent 999 00:39:30,420 --> 00:39:34,160 of the residues that were modeled 1000 00:39:34,160 --> 00:39:35,671 and that were not in the template. 1001 00:39:35,671 --> 00:39:36,170 OK? 1002 00:39:36,170 --> 00:39:37,670 So I've got some template. 1003 00:39:37,670 --> 00:39:41,420 Some fraction of the amino acids have no match in the template. 1004 00:39:41,420 --> 00:39:44,030 How many of those do I get correct? 1005 00:39:44,030 --> 00:39:45,782 As a function of target difficulty, 1006 00:39:45,782 --> 00:39:47,990 they have their own definition for target difficulty. 1007 00:39:47,990 --> 00:39:49,830 You can look in the actual paper to find out 1008 00:39:49,830 --> 00:39:51,870 what is in the CASP competition, but it's 1009 00:39:51,870 --> 00:39:54,945 a combination of structural and sequence data. 1010 00:39:54,945 --> 00:39:56,320 So let's just take them that they 1011 00:39:56,320 --> 00:39:57,440 made some reasonable choices here. 1012 00:39:57,440 --> 00:39:58,400 They actually put a lot of effort 1013 00:39:58,400 --> 00:40:00,550 into coming up with a criteria for evaluation. 1014 00:40:00,550 --> 00:40:04,120 Every point in this diagram represents some submitted 1015 00:40:04,120 --> 00:40:06,580 structure. 1016 00:40:06,580 --> 00:40:09,440 The CASP5, a decade ago, are the triangles. 1017 00:40:09,440 --> 00:40:14,000 CASP 9, two years ago, were the squares, 1018 00:40:14,000 --> 00:40:16,100 and the CASP10 are the circles. 1019 00:40:16,100 --> 00:40:20,015 And then they have trend lines for CASP9 1020 00:40:20,015 --> 00:40:23,760 and CASP10 are shown here-- these two lines. 1021 00:40:23,760 --> 00:40:27,350 And you can see that they do better for the easier 1022 00:40:27,350 --> 00:40:29,640 structures and worse for the harder structures, which 1023 00:40:29,640 --> 00:40:33,610 is what you'd expect, whereas CASP5 was pretty much 1024 00:40:33,610 --> 00:40:36,940 flat across all of them and did about as well even 1025 00:40:36,940 --> 00:40:39,230 on on the easy structures as these ones are 1026 00:40:39,230 --> 00:40:40,740 doing on the hard structures. 1027 00:40:40,740 --> 00:40:43,802 So in terms of the fraction of the protein that they don't 1028 00:40:43,802 --> 00:40:46,010 have a template for that they're able to get correct, 1029 00:40:46,010 --> 00:40:48,770 they're doing much, much better in the later CASPs 1030 00:40:48,770 --> 00:40:50,070 than they did a decade earlier. 1031 00:40:50,070 --> 00:40:51,540 So that's kind of encouraging. 1032 00:40:51,540 --> 00:40:54,270 Unfortunately, the story isn't always that straightforward. 1033 00:40:54,270 --> 00:40:59,040 So this chart is, again, target difficulty on the x-axis. 1034 00:40:59,040 --> 00:41:02,420 The y-axis is what they call the Global Distance Test, 1035 00:41:02,420 --> 00:41:05,300 and it's a model of accuracy. 1036 00:41:05,300 --> 00:41:08,770 It's the percent of the carbon alpha atoms in the predictions 1037 00:41:08,770 --> 00:41:11,470 that are close-- and they have a precise definition of close 1038 00:41:11,470 --> 00:41:14,120 that you can look up-- that are close to the true structure. 1039 00:41:14,120 --> 00:41:17,900 So for a perfect model, it would be up here in the 90% to 100% 1040 00:41:17,900 --> 00:41:21,090 range, and then random models would be down here. 1041 00:41:21,090 --> 00:41:24,760 You can see a lot of them are close to random. 1042 00:41:24,760 --> 00:41:26,670 But more important here are the trend lines. 1043 00:41:26,670 --> 00:41:28,850 So the trend line for CASP10, the most recent one 1044 00:41:28,850 --> 00:41:30,910 in this report, is black. 1045 00:41:30,910 --> 00:41:35,610 And fore CASP5, it's this yellow one, 1046 00:41:35,610 --> 00:41:39,110 which is not that different from the black. 1047 00:41:39,110 --> 00:41:43,070 So what this shows is that, over the course of a decade, 1048 00:41:43,070 --> 00:41:45,270 the actual prediction accuracy overall 1049 00:41:45,270 --> 00:41:48,770 has not improved that much. 1050 00:41:48,770 --> 00:41:50,530 It's a little bit shocking. 1051 00:41:50,530 --> 00:41:54,350 So they tried in this paper to try to figure out, why is that? 1052 00:41:54,350 --> 00:41:56,930 I mean, the percentage of the amino acids that you're 1053 00:41:56,930 --> 00:41:59,790 getting correct is going up, but overall accuracy has not. 1054 00:41:59,790 --> 00:42:01,850 And so they make some claims that it 1055 00:42:01,850 --> 00:42:05,350 could be that target difficulty is not really a fair measure, 1056 00:42:05,350 --> 00:42:12,640 because a lot of the proteins that are being submitted 1057 00:42:12,640 --> 00:42:16,150 are now actually much harder in different sense, in that 1058 00:42:16,150 --> 00:42:18,750 they're not single domain proteins initially. 1059 00:42:18,750 --> 00:42:20,830 So in CASP5, a lot of them were proteins 1060 00:42:20,830 --> 00:42:22,750 that had independent structures. 1061 00:42:22,750 --> 00:42:24,980 By the time of CASP10, a lot of the proteins 1062 00:42:24,980 --> 00:42:26,550 that are being submitted are more 1063 00:42:26,550 --> 00:42:28,450 interesting structural problems in that they're folding 1064 00:42:28,450 --> 00:42:30,783 is contingent on interactions with lots of other things. 1065 00:42:30,783 --> 00:42:32,420 So maybe all the information you need 1066 00:42:32,420 --> 00:42:35,065 is not composed entirely in the sequence of the peptide 1067 00:42:35,065 --> 00:42:36,884 that you've been given to test but depends 1068 00:42:36,884 --> 00:42:38,925 more on the interactions of it with its partners. 1069 00:42:42,784 --> 00:42:44,200 So those were for homology models. 1070 00:42:44,200 --> 00:42:46,580 These are the free modeling results. 1071 00:42:46,580 --> 00:42:49,380 So in free modeling, there's no homology to look at, 1072 00:42:49,380 --> 00:42:52,780 so they don't have a measure of difficulty except for length. 1073 00:42:52,780 --> 00:42:55,170 They're using, again, that Global Distance Test. 1074 00:42:55,170 --> 00:42:56,650 So up here are perfect models. 1075 00:42:56,650 --> 00:42:59,420 Down here are nearly random models. 1076 00:42:59,420 --> 00:43:01,260 CASP10 is in red. 1077 00:43:01,260 --> 00:43:03,260 CASP5, a decade earlier, is in green. 1078 00:43:03,260 --> 00:43:06,900 And you can see the trend lines are very, very similar. 1079 00:43:06,900 --> 00:43:10,370 And CASP9, which is the dashed line here, 1080 00:43:10,370 --> 00:43:13,236 looks almost identical to CASP5. 1081 00:43:13,236 --> 00:43:14,860 So again, this is not very encouraging. 1082 00:43:14,860 --> 00:43:17,250 It says that the accuracy the models 1083 00:43:17,250 --> 00:43:19,925 has not approved very much over the last decade. 1084 00:43:19,925 --> 00:43:21,550 And then, they do point out that if you 1085 00:43:21,550 --> 00:43:26,390 focus on the short structures, then it's kind of interesting. 1086 00:43:26,390 --> 00:43:30,400 So in CASP5, which are the triangles, only one of these 1087 00:43:30,400 --> 00:43:32,880 was above 60%. 1088 00:43:32,880 --> 00:43:37,080 CASP9, they had 5 out of 11 were pretty good. 1089 00:43:37,080 --> 00:43:40,906 But then you get to CASP10 and now only three 1090 00:43:40,906 --> 00:43:41,780 are greater than 60%. 1091 00:43:41,780 --> 00:43:43,620 So it's been fluctuating quite a lot. 1092 00:43:43,620 --> 00:43:47,045 So modeling de novo is still a very, very hard problem. 1093 00:43:47,045 --> 00:43:48,670 And they have a whole bunch of theories 1094 00:43:48,670 --> 00:43:50,360 as to why that could be. 1095 00:43:50,360 --> 00:43:51,740 They proposed, as I already said, 1096 00:43:51,740 --> 00:43:54,820 that maybe the models that they're trying to solve 1097 00:43:54,820 --> 00:43:57,880 have gotten harder in ways that are not easy to assess. 1098 00:43:57,880 --> 00:44:00,600 A lot of the proteins that previously wouldn't have had 1099 00:44:00,600 --> 00:44:03,080 a homologue now already do, because there has been a decade 1100 00:44:03,080 --> 00:44:05,980 of structural work trying to fill in missing domain 1101 00:44:05,980 --> 00:44:08,330 structures. 1102 00:44:08,330 --> 00:44:11,580 And that these targets tend to have more irregularity. 1103 00:44:11,580 --> 00:44:13,209 Tendency be part of larger proteins. 1104 00:44:13,209 --> 00:44:14,875 So again, there's not enough information 1105 00:44:14,875 --> 00:44:16,580 in the sequence of what you're given 1106 00:44:16,580 --> 00:44:17,746 to make the full prediction. 1107 00:44:20,115 --> 00:44:20,615 Questions? 1108 00:44:26,330 --> 00:44:28,660 So what we've seen so far has been the Rosetta approach 1109 00:44:28,660 --> 00:44:29,910 to solving protein structures. 1110 00:44:29,910 --> 00:44:32,150 And it really is, throw everything at it. 1111 00:44:32,150 --> 00:44:33,460 Any trick that you've got. 1112 00:44:33,460 --> 00:44:34,740 Let's look into the databases. 1113 00:44:34,740 --> 00:44:37,160 Let's take homologous proteins. 1114 00:44:37,160 --> 00:44:37,660 Right? 1115 00:44:37,660 --> 00:44:41,697 So we have these high, medium, low levels homologues. 1116 00:44:41,697 --> 00:44:43,280 And even when we're doing a homologue, 1117 00:44:43,280 --> 00:44:45,490 we don't restrict ourselves to that protein structure. 1118 00:44:45,490 --> 00:44:47,531 But for certain parts, we'll go into the database 1119 00:44:47,531 --> 00:44:50,050 and find the structures of peptides of length three 1120 00:44:50,050 --> 00:44:50,860 to nine. 1121 00:44:50,860 --> 00:44:53,480 Pull those out of the [? betas. ?] Plug those in. 1122 00:44:53,480 --> 00:44:56,860 Our potential energy functions are grab bag information, 1123 00:44:56,860 --> 00:44:59,381 some of which has strong physical principles, some which 1124 00:44:59,381 --> 00:45:01,130 is just curve fitting to make sure that we 1125 00:45:01,130 --> 00:45:03,880 keep the hydrophobics inside and hydrophilics outside. 1126 00:45:03,880 --> 00:45:06,790 So we throw any information that we have at the problem, 1127 00:45:06,790 --> 00:45:11,120 whereas our physicist has disdain for that approach. 1128 00:45:11,120 --> 00:45:11,790 He says, no, no. 1129 00:45:11,790 --> 00:45:13,510 We're going to this purely by the book. 1130 00:45:13,510 --> 00:45:16,560 All of our equations are going to have some physical grounding 1131 00:45:16,560 --> 00:45:17,244 to them. 1132 00:45:17,244 --> 00:45:19,160 We're not going to start with homology models. 1133 00:45:19,160 --> 00:45:21,160 We're going to try to do the simulation that I showed you 1134 00:45:21,160 --> 00:45:23,090 a little movie of for every single protein we 1135 00:45:23,090 --> 00:45:26,780 want to know the structure of. 1136 00:45:26,780 --> 00:45:28,790 Now, why is that problem hard? 1137 00:45:28,790 --> 00:45:33,320 It's because these potential energy landscapes 1138 00:45:33,320 --> 00:45:34,451 are incredibly complex. 1139 00:45:34,451 --> 00:45:34,950 Right? 1140 00:45:34,950 --> 00:45:36,030 They're very rugged. 1141 00:45:36,030 --> 00:45:38,800 Trying to get from any current position to any other position 1142 00:45:38,800 --> 00:45:42,010 requires a go over many, many minima. 1143 00:45:42,010 --> 00:45:44,640 So the reason it's hard to do, then, 1144 00:45:44,640 --> 00:45:47,377 is it's primarily a computing power issue. 1145 00:45:47,377 --> 00:45:48,960 There's just not enough computer power 1146 00:45:48,960 --> 00:45:50,251 to solve all of these problems. 1147 00:45:50,251 --> 00:45:52,464 So what one group, DE Shaw, did was they said, 1148 00:45:52,464 --> 00:45:54,130 well, we can solve that by just spending 1149 00:45:54,130 --> 00:45:58,900 a lot of money, which fortunately they had. 1150 00:45:58,900 --> 00:46:01,290 So they designed hardware that actually 1151 00:46:01,290 --> 00:46:06,300 solves individual components of the potential energy function 1152 00:46:06,300 --> 00:46:08,450 in hardware rather than in software. 1153 00:46:08,450 --> 00:46:11,760 So they have a chip that they call Anton that actually 1154 00:46:11,760 --> 00:46:15,463 has parts of it that solve the electrostatic function, the van 1155 00:46:15,463 --> 00:46:17,480 der Waals function. 1156 00:46:17,480 --> 00:46:20,100 And so in these chips, rather than in software, 1157 00:46:20,100 --> 00:46:22,180 you are doing as fast as you conceivably 1158 00:46:22,180 --> 00:46:24,190 can to solve the energy terms. 1159 00:46:24,190 --> 00:46:26,890 And that allows you to sample much, much more space. 1160 00:46:26,890 --> 00:46:29,710 Run your simulations for much, much longer 1161 00:46:29,710 --> 00:46:31,260 in terms of real time. 1162 00:46:31,260 --> 00:46:32,460 And they do remarkably well. 1163 00:46:32,460 --> 00:46:34,890 So here are some pictures from a paper of theirs-- 1164 00:46:34,890 --> 00:46:37,457 a couple of years ago now-- with the predicted 1165 00:46:37,457 --> 00:46:38,540 and the actual structures. 1166 00:46:38,540 --> 00:46:40,331 I don't even remember which color is which, 1167 00:46:40,331 --> 00:46:41,960 but you can see it doesn't much matter. 1168 00:46:41,960 --> 00:46:45,990 They get them down to very, very high resolution. 1169 00:46:45,990 --> 00:46:50,350 Now, what do you notice about all these structures? 1170 00:46:50,350 --> 00:46:51,750 AUDIENCE: They're small. 1171 00:46:51,750 --> 00:46:53,740 PROFESSOR: They're small, right? 1172 00:46:53,740 --> 00:46:55,490 So obviously there's a reason for that. 1173 00:46:55,490 --> 00:46:57,910 That's when you can do in reasonable compute time, 1174 00:46:57,910 --> 00:47:01,392 even with a high-end computing that's special purpose. 1175 00:47:01,392 --> 00:47:02,850 So we're still not in a state where 1176 00:47:02,850 --> 00:47:04,850 they can fold any arbitrary structure. 1177 00:47:04,850 --> 00:47:07,370 What else do you notice about them? 1178 00:47:07,370 --> 00:47:08,412 Yeah, in the back. 1179 00:47:08,412 --> 00:47:09,396 AUDIENCE: They have very well-defined 1180 00:47:09,396 --> 00:47:09,890 secondary structures. 1181 00:47:09,890 --> 00:47:11,190 PROFESSOR: They have very well-defined 1182 00:47:11,190 --> 00:47:12,064 secondary structures. 1183 00:47:12,064 --> 00:47:14,152 And they're specifically what, mostly? 1184 00:47:14,152 --> 00:47:15,153 AUDIENCE: Alpha helixes. 1185 00:47:15,153 --> 00:47:16,485 PROFESSOR: Alpha helixes, right. 1186 00:47:16,485 --> 00:47:19,170 And it turns out that a lot more information is encoded locally 1187 00:47:19,170 --> 00:47:21,480 in an alpha helix than in a beta sheet, which 1188 00:47:21,480 --> 00:47:24,760 is going to be contingent on what that piece of protein 1189 00:47:24,760 --> 00:47:25,480 comes up against. 1190 00:47:25,480 --> 00:47:25,700 Right? 1191 00:47:25,700 --> 00:47:27,240 Whereas in the alpha helix, we saw 1192 00:47:27,240 --> 00:47:30,000 that you can get 60% accuracy with very crude algorithms, 1193 00:47:30,000 --> 00:47:30,590 right? 1194 00:47:30,590 --> 00:47:34,675 So we're going to do best with these physics approaches 1195 00:47:34,675 --> 00:47:37,660 when we have small proteins that are largely alpha helical. 1196 00:47:37,660 --> 00:47:41,300 But in later papers-- well here's even an example. 1197 00:47:41,300 --> 00:47:43,582 Here's one that has a certain amount of beta sheet. 1198 00:47:43,582 --> 00:47:45,790 And the structures are going to get larger with time. 1199 00:47:45,790 --> 00:47:47,160 So it's not an inherent problem. 1200 00:47:47,160 --> 00:47:49,820 It's just a question of how fast the hardware is 1201 00:47:49,820 --> 00:47:52,450 today versus tomorrow. 1202 00:47:52,450 --> 00:47:54,860 OK, a third approach. 1203 00:47:54,860 --> 00:47:56,620 So we had the statistical approach. 1204 00:47:56,620 --> 00:47:58,120 We have the physics approach. 1205 00:47:58,120 --> 00:48:00,310 The third approach, that I won't go into detail 1206 00:48:00,310 --> 00:48:02,910 but you can play around was literally yourselves, 1207 00:48:02,910 --> 00:48:05,870 is a game where we have humans who 1208 00:48:05,870 --> 00:48:08,530 try to identify the right structure, 1209 00:48:08,530 --> 00:48:12,360 just as humans do very well in other kinds of pattern 1210 00:48:12,360 --> 00:48:13,680 recognition problems. 1211 00:48:13,680 --> 00:48:18,560 So you can try this video game where you're given structures 1212 00:48:18,560 --> 00:48:21,040 to try to solve and say, oh, should I make that helical? 1213 00:48:21,040 --> 00:48:22,790 Should I rotate that side chain? 1214 00:48:22,790 --> 00:48:24,300 So give it a try. 1215 00:48:24,300 --> 00:48:28,480 Just Google FoldIT, and you can find out 1216 00:48:28,480 --> 00:48:32,991 whether you can be the best gamers and beat the hardware. 1217 00:48:32,991 --> 00:48:33,490 All right. 1218 00:48:36,200 --> 00:48:37,950 So so far we've been talking about solving 1219 00:48:37,950 --> 00:48:40,090 the structures of individual proteins. 1220 00:48:40,090 --> 00:48:43,210 We've seen there is some success in this field. 1221 00:48:43,210 --> 00:48:45,660 It's improved a lot in some ways. 1222 00:48:45,660 --> 00:48:48,820 Between CASP1 and CASP5 I think there's been huge improvements. 1223 00:48:48,820 --> 00:48:51,410 Between CASP5 and CASP10, maybe the problems have gotten hard. 1224 00:48:51,410 --> 00:48:52,460 Maybe there have been no improvements. 1225 00:48:52,460 --> 00:48:54,390 We'll leave that for others to decide. 1226 00:48:54,390 --> 00:48:56,729 What I'd like to look at in the end of this lecture 1227 00:48:56,729 --> 00:48:58,270 and the beginning of the next lecture 1228 00:48:58,270 --> 00:49:00,480 are problems of proteins interacting with each other, 1229 00:49:00,480 --> 00:49:02,063 and can we predict those interactions? 1230 00:49:02,063 --> 00:49:04,956 And that'll, then, lead us towards even larger systems 1231 00:49:04,956 --> 00:49:05,830 and network problems. 1232 00:49:08,596 --> 00:49:09,970 So we're going to break this down 1233 00:49:09,970 --> 00:49:12,680 to three separate prediction problems. 1234 00:49:12,680 --> 00:49:15,550 The first of these is predicting the effect of a point mutation 1235 00:49:15,550 --> 00:49:17,120 on the stability of a known complex. 1236 00:49:17,120 --> 00:49:19,500 So in some ways, you might think this is an easy problem. 1237 00:49:19,500 --> 00:49:20,440 I've got two proteins. 1238 00:49:20,440 --> 00:49:21,398 I know their structure. 1239 00:49:21,398 --> 00:49:22,350 I know they contract. 1240 00:49:22,350 --> 00:49:24,560 I want to predict whether a mutation stabilizes 1241 00:49:24,560 --> 00:49:27,120 that interaction or makes it fall apart. 1242 00:49:27,120 --> 00:49:29,210 That's the first of the problems. 1243 00:49:29,210 --> 00:49:30,960 We can try to predict the structure 1244 00:49:30,960 --> 00:49:33,450 of particular complexes, and we can then 1245 00:49:33,450 --> 00:49:36,060 try to generalize that and try to predict every protein that 1246 00:49:36,060 --> 00:49:38,690 interacts with every other protein. 1247 00:49:38,690 --> 00:49:42,550 We'll see how we do on all of those. 1248 00:49:42,550 --> 00:49:45,020 So we'll go into one of these competition papers, which 1249 00:49:45,020 --> 00:49:46,710 are very good at evaluating the fields. 1250 00:49:46,710 --> 00:49:50,580 This competition paper looked at what I call the simple problem. 1251 00:49:50,580 --> 00:49:53,000 So you've got two proteins of known structure. 1252 00:49:53,000 --> 00:49:55,480 The authors of the paper, who issued the challenge, 1253 00:49:55,480 --> 00:49:58,690 knew the answer for the effect of every possible mutation 1254 00:49:58,690 --> 00:50:01,210 at a whole bunch of positions along these proteins 1255 00:50:01,210 --> 00:50:05,380 on the-- well, an approximation to the free energy of binding. 1256 00:50:05,380 --> 00:50:07,610 So they challenged the competitors 1257 00:50:07,610 --> 00:50:09,610 to try to figure out, we give you the structure, 1258 00:50:09,610 --> 00:50:12,740 we tell you all the positions we've mutated, 1259 00:50:12,740 --> 00:50:15,450 and you tell us whether those mutations made the complex more 1260 00:50:15,450 --> 00:50:17,900 stable or made the complex less stable. 1261 00:50:21,270 --> 00:50:24,250 Now specifically, they had two separate protein structures. 1262 00:50:24,250 --> 00:50:26,770 They mutated 53 positions in one. 1263 00:50:26,770 --> 00:50:28,490 45 positions in another. 1264 00:50:28,490 --> 00:50:30,790 They didn't directly measure the free energy of binding 1265 00:50:30,790 --> 00:50:32,850 for every possible complex, but they used a high throughput 1266 00:50:32,850 --> 00:50:33,350 assay. 1267 00:50:33,350 --> 00:50:34,890 We won't go into the details, but it 1268 00:50:34,890 --> 00:50:37,410 should track, more or less, with the free energy. 1269 00:50:37,410 --> 00:50:42,290 So things that seem to be more stable directors here probably 1270 00:50:42,290 --> 00:50:45,370 are lower free energy complexes. 1271 00:50:45,370 --> 00:50:49,230 OK, so how would you go about trying to solve this? 1272 00:50:49,230 --> 00:50:51,294 So using these potential energy functions 1273 00:50:51,294 --> 00:50:52,710 that we've already seen, you could 1274 00:50:52,710 --> 00:50:57,000 try to plug in the mutation into the structure. 1275 00:50:57,000 --> 00:51:00,190 And what would you have to do then 1276 00:51:00,190 --> 00:51:02,730 in order to evaluate the energy? 1277 00:51:02,730 --> 00:51:06,170 Before you evaluate the energy. 1278 00:51:06,170 --> 00:51:08,170 So I've got known structure. 1279 00:51:08,170 --> 00:51:13,390 I say, position 23 I'm mutating from phenylalanine to alanine. 1280 00:51:13,390 --> 00:51:14,930 I'll say alanine to phenylalanine. 1281 00:51:14,930 --> 00:51:15,980 Make it a little more interesting. 1282 00:51:15,980 --> 00:51:16,480 OK? 1283 00:51:16,480 --> 00:51:18,380 So I'm now stuck on this big side chain. 1284 00:51:18,380 --> 00:51:20,310 So what do I need to do before I can evaluate the structure 1285 00:51:20,310 --> 00:51:20,810 energy? 1286 00:51:20,810 --> 00:51:22,856 AUDIENCE: Make sure there's no clashes. 1287 00:51:22,856 --> 00:51:24,480 PROFESSOR: Make sure no clashes, right? 1288 00:51:24,480 --> 00:51:25,380 So I have to do one of those methods 1289 00:51:25,380 --> 00:51:28,230 that we already described for optimizing the side chain 1290 00:51:28,230 --> 00:51:29,850 confirmation, and then I can decide, 1291 00:51:29,850 --> 00:51:32,200 based on the free energy, whether it's an improvement 1292 00:51:32,200 --> 00:51:33,870 or makes things worse. 1293 00:51:33,870 --> 00:51:36,380 OK, so let's see how they do. 1294 00:51:36,380 --> 00:51:39,284 So here's an example of a solution. 1295 00:51:39,284 --> 00:51:41,700 The submitter, the person who has the algorithm for making 1296 00:51:41,700 --> 00:51:44,455 a prediction, decides on some cutoff in their energy 1297 00:51:44,455 --> 00:51:45,830 function, whether they think this 1298 00:51:45,830 --> 00:51:47,890 is improving things or making things worse. 1299 00:51:47,890 --> 00:51:49,340 So they decide on the color. 1300 00:51:49,340 --> 00:51:51,220 Each one of these dots represents 1301 00:51:51,220 --> 00:51:52,095 a different mutation. 1302 00:51:55,010 --> 00:51:58,420 On the y-axis is the actual change in binding, 1303 00:51:58,420 --> 00:51:59,910 the observed change in binding. 1304 00:51:59,910 --> 00:52:01,660 So things above zero are improved binding. 1305 00:52:01,660 --> 00:52:04,010 Below zero are worse binding. 1306 00:52:04,010 --> 00:52:07,607 And here are the predictions on the submitter scale. 1307 00:52:07,607 --> 00:52:09,690 And here the submitter said that everything in red 1308 00:52:09,690 --> 00:52:12,940 should be worse and everything green should be better. 1309 00:52:12,940 --> 00:52:15,210 And you can see that there's some trend. 1310 00:52:15,210 --> 00:52:18,530 They're doing reasonably well in predicting all these red guys 1311 00:52:18,530 --> 00:52:20,930 as being bad, but they're not doing so well 1312 00:52:20,930 --> 00:52:23,780 in the neutral ones, clearly, and certainly not doing 1313 00:52:23,780 --> 00:52:26,707 that well in the improved ones. 1314 00:52:26,707 --> 00:52:29,290 Now, is this one of the better submitters or one of the worst? 1315 00:52:29,290 --> 00:52:30,420 You'd hope that this is one of the worst, 1316 00:52:30,420 --> 00:52:32,337 but in fact this is one of the top submitters. 1317 00:52:32,337 --> 00:52:33,794 In fact, not just the top submitter 1318 00:52:33,794 --> 00:52:35,410 but top submitter looking at mutations 1319 00:52:35,410 --> 00:52:37,300 that are right at the interface where you'd think 1320 00:52:37,300 --> 00:52:38,450 they'd do the best, right? 1321 00:52:38,450 --> 00:52:41,090 So if there's some mutation on the backside of the protein, 1322 00:52:41,090 --> 00:52:42,320 there's less structural information 1323 00:52:42,320 --> 00:52:44,340 about what that's going to be doing in the complex. 1324 00:52:44,340 --> 00:52:45,800 There could be some surprising results. 1325 00:52:45,800 --> 00:52:47,466 But here, these are amino acid mutations 1326 00:52:47,466 --> 00:52:50,450 right at the interface. 1327 00:52:50,450 --> 00:52:52,650 So here's an example of the top performer. 1328 00:52:52,650 --> 00:52:54,290 This is the graph I just showed you, 1329 00:52:54,290 --> 00:52:55,748 focusing only at the [? residues ?] 1330 00:52:55,748 --> 00:52:57,484 of the interface, and all sites. 1331 00:52:57,484 --> 00:52:58,650 And here's an average group. 1332 00:52:58,650 --> 00:53:00,030 And you can see the average groups are really 1333 00:53:00,030 --> 00:53:01,430 doing rather abysmally. 1334 00:53:04,330 --> 00:53:08,270 So this blue cluster that's almost entirely below zero 1335 00:53:08,270 --> 00:53:09,777 were supposed to be neutral. 1336 00:53:09,777 --> 00:53:11,860 And these green ones were supposed to be improved, 1337 00:53:11,860 --> 00:53:14,690 and they're almost entirely below zero. 1338 00:53:14,690 --> 00:53:16,650 This is not encouraging story. 1339 00:53:16,650 --> 00:53:19,140 So how do we evaluate objectively 1340 00:53:19,140 --> 00:53:21,060 whether they're really doing well? 1341 00:53:21,060 --> 00:53:23,720 So we have some sort of baseline measure. 1342 00:53:23,720 --> 00:53:26,240 What is it the sort of baseline algorithm 1343 00:53:26,240 --> 00:53:29,360 you could use to predict whether a mutation is improving 1344 00:53:29,360 --> 00:53:31,347 or hurting this interface? 1345 00:53:31,347 --> 00:53:32,846 So all of their algorithms are going 1346 00:53:32,846 --> 00:53:34,630 to use some kind of energy function. 1347 00:53:34,630 --> 00:53:37,005 What have we already seen in earlier parts of this course 1348 00:53:37,005 --> 00:53:38,130 that we could use? 1349 00:53:38,130 --> 00:53:40,930 Well, we could use the substitution matrices, right? 1350 00:53:40,930 --> 00:53:42,580 We have the BLOSUM substitution matrix 1351 00:53:42,580 --> 00:53:45,520 that tells us how surprised we should 1352 00:53:45,520 --> 00:53:47,750 be when we see an evolution, that Amino Acid A turns 1353 00:53:47,750 --> 00:53:51,170 into Amino Acid B. So we could use, 1354 00:53:51,170 --> 00:53:52,950 in this case, the BLOSUM matrix. 1355 00:53:52,950 --> 00:53:54,645 That gives us for each mutation a score. 1356 00:53:54,645 --> 00:53:57,900 It ranges from minus 4 to 11. 1357 00:53:57,900 --> 00:54:00,090 And we can rank every mutation based 1358 00:54:00,090 --> 00:54:02,840 on the BLOSUM matrix for the substitution 1359 00:54:02,840 --> 00:54:06,212 and say, OK, at some value in this range things should 1360 00:54:06,212 --> 00:54:07,670 be getting better or getting worse. 1361 00:54:10,810 --> 00:54:13,426 So here's an area under the curve plot 1362 00:54:13,426 --> 00:54:15,050 where we've plotted the false positives 1363 00:54:15,050 --> 00:54:18,040 and true positive rates as I change 1364 00:54:18,040 --> 00:54:19,800 my threshold for that BLOSUM matrix. 1365 00:54:19,800 --> 00:54:24,100 So I compute what the mutation BLOSUM matrix is, 1366 00:54:24,100 --> 00:54:27,400 and then I say, OK, is a value of 11 bad or is it good? 1367 00:54:27,400 --> 00:54:28,684 Is a value of 10 bad or good? 1368 00:54:28,684 --> 00:54:30,100 That's what this curve represents. 1369 00:54:30,100 --> 00:54:33,950 As I vary that threshold, how many do I get right 1370 00:54:33,950 --> 00:54:36,260 and how many do I get wrong? 1371 00:54:36,260 --> 00:54:38,680 If I'm doing the decisions at random, 1372 00:54:38,680 --> 00:54:41,290 then I'll be getting roughly equal true positives 1373 00:54:41,290 --> 00:54:42,750 and false positives. 1374 00:54:42,750 --> 00:54:45,630 They do slightly better in the random using this matrix. 1375 00:54:45,630 --> 00:54:49,020 Now, the best algorithm at predicting that uses energies 1376 00:54:49,020 --> 00:54:51,140 only does marginally better. 1377 00:54:51,140 --> 00:54:54,440 So this is the best algorithm at predicting. 1378 00:54:54,440 --> 00:54:58,220 This is this baseline algorithm using just the BLOSUM matrix. 1379 00:54:58,220 --> 00:55:02,270 You can see that the green curve predicting beneficial mutations 1380 00:55:02,270 --> 00:55:03,350 is really hard. 1381 00:55:03,350 --> 00:55:05,200 They don't do much better than random. 1382 00:55:05,200 --> 00:55:07,430 And for the deleterious mutations, 1383 00:55:07,430 --> 00:55:10,090 they do somewhat better. 1384 00:55:10,090 --> 00:55:12,520 So we could make these plots for every single one 1385 00:55:12,520 --> 00:55:14,260 of the algorithms, but a little easier 1386 00:55:14,260 --> 00:55:17,380 is to just compute the area under the curve. 1387 00:55:17,380 --> 00:55:19,030 So how much of the area? 1388 00:55:19,030 --> 00:55:22,040 If I were doing perfectly, I would get 100% true positives 1389 00:55:22,040 --> 00:55:23,410 and no false positives, right? 1390 00:55:23,410 --> 00:55:25,205 So my line would go straight up and across 1391 00:55:25,205 --> 00:55:27,260 and the area under the curve would be one. 1392 00:55:27,260 --> 00:55:30,200 And if I'm doing terribly, I'll get no true positives 1393 00:55:30,200 --> 00:55:32,105 and all false positives. 1394 00:55:32,105 --> 00:55:34,284 I'd be flatlining and my area would be zero. 1395 00:55:34,284 --> 00:55:35,700 So the area under the curve, which 1396 00:55:35,700 --> 00:55:37,330 is normalized between zero and one, 1397 00:55:37,330 --> 00:55:39,830 will give me a sense of how well these algorithms are doing. 1398 00:55:39,830 --> 00:55:44,160 So this plot-- focus first on the black dots-- shows 1399 00:55:44,160 --> 00:55:46,880 at each one of these algorithms what the area under the curve 1400 00:55:46,880 --> 00:55:50,410 is for beneficial and deleterious mutations. 1401 00:55:50,410 --> 00:55:53,230 Beneficial on the x-axis, deleterious mutations 1402 00:55:53,230 --> 00:55:54,650 on the y-axis. 1403 00:55:54,650 --> 00:55:56,600 The BLOSUM matrix is here. 1404 00:55:56,600 --> 00:56:00,440 So good algorithms should be above that and to the right. 1405 00:56:00,440 --> 00:56:03,059 They should having a better area under the curve. 1406 00:56:03,059 --> 00:56:04,850 And you can see the perfect algorithm would 1407 00:56:04,850 --> 00:56:06,260 have been all the way up here. 1408 00:56:06,260 --> 00:56:08,880 None of the black dots are even remotely close. 1409 00:56:08,880 --> 00:56:11,910 The G21, which we'll talk about a little bit in a minute, 1410 00:56:11,910 --> 00:56:15,890 is somewhat better than the BLOSUM matrix, but not a lot. 1411 00:56:19,640 --> 00:56:24,270 Now, I'm going to ignore the second round in much detail, 1412 00:56:24,270 --> 00:56:26,530 because this is a case where people weren't doing 1413 00:56:26,530 --> 00:56:28,650 so well in the first round so they went out and gave them 1414 00:56:28,650 --> 00:56:30,850 some of the information about mutations at all the positions. 1415 00:56:30,850 --> 00:56:32,320 And that really changes the nature of problem, 1416 00:56:32,320 --> 00:56:33,730 because then you have a tremendous amount 1417 00:56:33,730 --> 00:56:35,570 of information about which positions are important 1418 00:56:35,570 --> 00:56:37,430 and how much those mutations are making. 1419 00:56:37,430 --> 00:56:39,200 So we'll ignore the second round, 1420 00:56:39,200 --> 00:56:42,300 which I think is an overly generous way of comparing 1421 00:56:42,300 --> 00:56:43,700 these algorithms. 1422 00:56:43,700 --> 00:56:46,060 OK, so what did the authors of this paper observe? 1423 00:56:46,060 --> 00:56:48,060 They observed that the best algorithms were only 1424 00:56:48,060 --> 00:56:50,410 doing marginally better than random choice. 1425 00:56:50,410 --> 00:56:53,230 So three times better. 1426 00:56:53,230 --> 00:56:57,080 And that there seemed to be a particular problem looking 1427 00:56:57,080 --> 00:56:59,535 at mutations that affect polar positions. 1428 00:57:02,370 --> 00:57:05,510 One of the things that I think was particularly interesting 1429 00:57:05,510 --> 00:57:09,150 and quite relevant when we think about these things 1430 00:57:09,150 --> 00:57:11,724 in a thermodynamic context is that the algorithms that 1431 00:57:11,724 --> 00:57:13,890 did better-- none of them could be really considered 1432 00:57:13,890 --> 00:57:16,690 to do really well-- but the algorithms that did better 1433 00:57:16,690 --> 00:57:19,440 didn't just focus on the energetic change 1434 00:57:19,440 --> 00:57:22,490 between forming the native complex over here 1435 00:57:22,490 --> 00:57:24,990 and forming this mutant complex indicated by the star. 1436 00:57:24,990 --> 00:57:27,120 But they also focused on the affect of the mutation 1437 00:57:27,120 --> 00:57:29,590 on the stability of the mutated protein. 1438 00:57:29,590 --> 00:57:31,480 So there's an equilibrium not just 1439 00:57:31,480 --> 00:57:33,580 moving between the free proteins and the complex, 1440 00:57:33,580 --> 00:57:35,830 but also between moving between the free proteins that 1441 00:57:35,830 --> 00:57:38,882 are folded and the free proteins that are unfolded. 1442 00:57:38,882 --> 00:57:40,590 And some of these mutations are affecting 1443 00:57:40,590 --> 00:57:42,506 the energy of the folded state, and so they're 1444 00:57:42,506 --> 00:57:45,550 driving things to the left, to the unfolded. 1445 00:57:45,550 --> 00:57:47,560 And if you don't include that, then you actually 1446 00:57:47,560 --> 00:57:48,268 get into trouble. 1447 00:57:48,268 --> 00:57:50,370 And I've put a link here to some lecture notes 1448 00:57:50,370 --> 00:57:52,580 from a different course that I teach where you can look up 1449 00:57:52,580 --> 00:57:54,705 some details and more sophisticated approaches that 1450 00:57:54,705 --> 00:57:58,340 actually do take into account a lot of the unfolded states. 1451 00:58:02,570 --> 00:58:06,760 So the best approach-- best of a bad lot-- 1452 00:58:06,760 --> 00:58:10,150 consider the effects of mutations on stability. 1453 00:58:10,150 --> 00:58:13,860 They also model packing, electrostacks, and solvation. 1454 00:58:13,860 --> 00:58:15,530 But the actual algorithms that they used 1455 00:58:15,530 --> 00:58:17,095 were a whole mishmash of approaches. 1456 00:58:17,095 --> 00:58:19,680 So there didn't seem to emerge a common pattern in what they 1457 00:58:19,680 --> 00:58:21,929 were doing, and I thought I would take you through one 1458 00:58:21,929 --> 00:58:24,080 of these to see what actually they were doing. 1459 00:58:24,080 --> 00:58:28,080 So the best one was this machine learning approach, G21. 1460 00:58:28,080 --> 00:58:30,090 So this is how they solved the problem. 1461 00:58:30,090 --> 00:58:33,150 First of all, they dug through the literature 1462 00:58:33,150 --> 00:58:36,940 and found 930 cases where they could associate a mutation 1463 00:58:36,940 --> 00:58:38,660 with a change in energy. 1464 00:58:38,660 --> 00:58:41,441 These had nothing to do with proteins under consideration. 1465 00:58:41,441 --> 00:58:43,190 They were completely different structures. 1466 00:58:43,190 --> 00:58:44,815 But they were cases where they actually 1467 00:58:44,815 --> 00:58:47,849 had energetic information for each mutation. 1468 00:58:47,849 --> 00:58:49,390 Then we go through and try to predict 1469 00:58:49,390 --> 00:58:51,870 what the structural change will be in the protein, 1470 00:58:51,870 --> 00:58:55,669 using somebody else's algorithm, FoldX. 1471 00:58:55,669 --> 00:58:57,710 And now, they describe each mutant, not just with 1472 00:58:57,710 --> 00:58:59,210 a single energy-- we have focused, 1473 00:58:59,210 --> 00:59:02,080 for example, on PyRosetta, which you'll use in process-- 1474 00:59:02,080 --> 00:59:04,782 but they actually had 85 different features 1475 00:59:04,782 --> 00:59:06,490 from a whole bunch of different programs. 1476 00:59:06,490 --> 00:59:07,820 So they're taking a pretty agnostic view. 1477 00:59:07,820 --> 00:59:10,361 They're saying, we don't know which of these energy functions 1478 00:59:10,361 --> 00:59:13,380 is the best, so let's let the machine learning decide. 1479 00:59:13,380 --> 00:59:16,360 So every single mutation that's posed to them as a problem, 1480 00:59:16,360 --> 00:59:18,500 they have 85 different parameters 1481 00:59:18,500 --> 00:59:22,560 as to whether it's improving things or not. 1482 00:59:22,560 --> 00:59:26,290 And then, they had their database of 930 mutations. 1483 00:59:26,290 --> 00:59:28,195 For each one of those they had 85 parameters. 1484 00:59:31,510 --> 00:59:33,030 So those are label trending data. 1485 00:59:33,030 --> 00:59:35,900 They know whether things are getting better or worse. 1486 00:59:35,900 --> 00:59:40,790 They actually don't even rely on a single machine learning 1487 00:59:40,790 --> 00:59:41,310 method. 1488 00:59:41,310 --> 00:59:43,360 These actually used five different approaches. 1489 00:59:43,360 --> 00:59:47,419 We'll discuss Bayesian nets later in this course. 1490 00:59:47,419 --> 00:59:49,210 Most of these others we won't cover at all, 1491 00:59:49,210 --> 00:59:51,970 but they used a lot of different computational approaches 1492 00:59:51,970 --> 00:59:55,480 to try to decide how to go from those 85 parameters 1493 00:59:55,480 --> 00:59:59,471 to a prediction of whether the structures improved or not. 1494 01:00:03,620 --> 01:00:05,770 So this actually shows the complexity 1495 01:00:05,770 --> 01:00:08,310 of this apparently simple problem, right? 1496 01:00:08,310 --> 01:00:11,620 Here's a case where I have two proteins of known structure. 1497 01:00:11,620 --> 01:00:14,790 I'm making very specific point mutations, 1498 01:00:14,790 --> 01:00:19,802 and even so I do only marginally better than random. 1499 01:00:19,802 --> 01:00:22,010 And even throwing at it all the best machine learning 1500 01:00:22,010 --> 01:00:22,560 techniques. 1501 01:00:22,560 --> 01:00:24,840 So there's clearly a lot in protein structure 1502 01:00:24,840 --> 01:00:28,014 that we don't yet have parametrized in these energy 1503 01:00:28,014 --> 01:00:28,514 functions. 1504 01:00:35,270 --> 01:00:36,910 So maybe some of these other problems 1505 01:00:36,910 --> 01:00:38,820 are actually not as hard as we thought. 1506 01:00:38,820 --> 01:00:41,150 Maybe instead of trying to be very precise in terms 1507 01:00:41,150 --> 01:00:43,380 of the energetic change for a single mutation 1508 01:00:43,380 --> 01:00:47,020 at an interface, we'd do better trying to predict rather 1509 01:00:47,020 --> 01:00:49,250 crude parameters of which two proteins interact 1510 01:00:49,250 --> 01:00:49,917 with each other. 1511 01:00:49,917 --> 01:00:51,333 So that's what we're going to look 1512 01:00:51,333 --> 01:00:52,810 at in the next part of the course. 1513 01:00:52,810 --> 01:00:55,590 We're going to look at whether we 1514 01:00:55,590 --> 01:00:58,210 can use structural data to predict which two proteins will 1515 01:00:58,210 --> 01:01:00,060 interact. 1516 01:01:00,060 --> 01:01:02,990 So here we've got a problem, which is a docking problem. 1517 01:01:02,990 --> 01:01:04,190 I've got two proteins. 1518 01:01:04,190 --> 01:01:06,106 Say they're of known structure, but I've never 1519 01:01:06,106 --> 01:01:07,960 seen them interact with each other. 1520 01:01:07,960 --> 01:01:09,230 So how do they come together? 1521 01:01:09,230 --> 01:01:12,350 Which faces of the proteins are interacting with each other? 1522 01:01:12,350 --> 01:01:14,590 That's called a docking problem. 1523 01:01:14,590 --> 01:01:17,490 And if I wanted to try to systematically figure out 1524 01:01:17,490 --> 01:01:20,011 whether Protein A and Protein B interact with each other, 1525 01:01:20,011 --> 01:01:22,510 I would have to do a search over all possible confirmations, 1526 01:01:22,510 --> 01:01:23,409 right? 1527 01:01:23,409 --> 01:01:24,950 Then I could use the energy functions 1528 01:01:24,950 --> 01:01:27,700 to try to predict which one has the lowest energy. 1529 01:01:27,700 --> 01:01:29,860 But it actually would be a computationally very 1530 01:01:29,860 --> 01:01:31,890 inefficient way to do things. 1531 01:01:31,890 --> 01:01:34,700 So we could imagine we wanted to solve this problem. 1532 01:01:34,700 --> 01:01:36,240 For each potential partner, we could 1533 01:01:36,240 --> 01:01:39,046 evaluate all relative positions and orientations. 1534 01:01:39,046 --> 01:01:41,420 Then, when they come together we can't just rely on that, 1535 01:01:41,420 --> 01:01:42,870 but as we've seen several times now we're 1536 01:01:42,870 --> 01:01:45,161 going to have to do local confirmational changes to see 1537 01:01:45,161 --> 01:01:47,610 how they fit together for each possible docking. 1538 01:01:47,610 --> 01:01:49,110 And then, once we've done that, we 1539 01:01:49,110 --> 01:01:50,860 can say, OK, which of these has the lowest 1540 01:01:50,860 --> 01:01:52,010 energy of interaction? 1541 01:01:52,010 --> 01:01:54,640 So that, obviously, is going to be too computationally 1542 01:01:54,640 --> 01:01:56,055 intensive to do on a large scale. 1543 01:01:56,055 --> 01:01:57,430 It could work very well if you've 1544 01:01:57,430 --> 01:01:59,179 got a particular pair or proteins that you 1545 01:01:59,179 --> 01:01:59,960 need to study. 1546 01:01:59,960 --> 01:02:01,710 But on a big sale, if we wanted to predict 1547 01:02:01,710 --> 01:02:03,960 all possible interactions, we wouldn't really 1548 01:02:03,960 --> 01:02:06,060 be able to get very far. 1549 01:02:06,060 --> 01:02:09,960 So what people typically do is use other kinds of information 1550 01:02:09,960 --> 01:02:11,446 to reduce the search space. 1551 01:02:11,446 --> 01:02:13,070 And what we'll see in the next lecture, 1552 01:02:13,070 --> 01:02:16,390 then, are different ways to approach this problem. 1553 01:02:16,390 --> 01:02:19,899 Now, one question we should ask is, what role 1554 01:02:19,899 --> 01:02:21,440 is structural homology going to play? 1555 01:02:21,440 --> 01:02:23,820 Should I expect that any two proteins that 1556 01:02:23,820 --> 01:02:27,710 interact with each other-- let's say 1557 01:02:27,710 --> 01:02:29,962 that that Protein A and I know its interactors. 1558 01:02:34,390 --> 01:02:38,380 So I've got A known to interact with B. Right? 1559 01:02:38,380 --> 01:02:41,190 So I know this interface. 1560 01:02:41,190 --> 01:02:44,990 And now I have protein C, and I'm not 1561 01:02:44,990 --> 01:02:47,920 sure if it interacts or not. 1562 01:02:47,920 --> 01:02:52,130 Should I expect the interface of C, that touches A, 1563 01:02:52,130 --> 01:02:53,380 to match the interface of B? 1564 01:02:53,380 --> 01:02:57,450 Should these be homologous? 1565 01:02:57,450 --> 01:02:59,090 And if not precisely homologous, then 1566 01:02:59,090 --> 01:03:00,871 are there properties that we can expect 1567 01:03:00,871 --> 01:03:02,370 that should be similar between them? 1568 01:03:03,280 --> 01:03:05,210 So different approaches we can take. 1569 01:03:05,210 --> 01:03:06,670 And there are certainly cases where 1570 01:03:06,670 --> 01:03:10,880 you have proteins that interact with a common target that 1571 01:03:10,880 --> 01:03:13,250 have no overall structure similarity to each other 1572 01:03:13,250 --> 01:03:15,150 but do have local structural similarity. 1573 01:03:15,150 --> 01:03:17,530 So here's an example of subtilisn, 1574 01:03:17,530 --> 01:03:20,530 which is shown in light gray, and pieces of it 1575 01:03:20,530 --> 01:03:22,645 that interactive with the target are shown in red. 1576 01:03:22,645 --> 01:03:25,020 So here are two proteins that are relatively structurally 1577 01:03:25,020 --> 01:03:26,940 homologous-- they interact at the same region. 1578 01:03:26,940 --> 01:03:28,679 That's not too surprising. 1579 01:03:28,679 --> 01:03:30,220 But here's a subtilisn inhibitor that 1580 01:03:30,220 --> 01:03:33,310 has no global structural similarity to these two 1581 01:03:33,310 --> 01:03:36,740 proteins, and yet its interactions with subtilisn 1582 01:03:36,740 --> 01:03:37,850 are quite similar. 1583 01:03:37,850 --> 01:03:41,380 So we might expect, even if C and B don't look globally 1584 01:03:41,380 --> 01:03:42,880 anything like each other, they might 1585 01:03:42,880 --> 01:03:44,470 have this local similarity. 1586 01:03:50,130 --> 01:03:52,510 OK, actually I think we'd like to turn back your exams. 1587 01:03:52,510 --> 01:03:54,630 So maybe I'll stop here. 1588 01:03:54,630 --> 01:03:56,300 We'll return the exams in the class, 1589 01:03:56,300 --> 01:03:59,340 and then we'll pick up at this point in the next lecture.