1 00:00:00,070 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:26,377 --> 00:00:26,960 PROFESSOR: OK. 9 00:00:26,960 --> 00:00:31,006 So we've been talking about predicting structure proteins. 10 00:00:31,006 --> 00:00:32,380 At the end of the last lecture we 11 00:00:32,380 --> 00:00:34,838 started to talk a little bit about predicting interactions, 12 00:00:34,838 --> 00:00:37,870 and that's going to be the focus of today's lecture. 13 00:00:37,870 --> 00:00:42,120 And we identified a couple of different possible prediction 14 00:00:42,120 --> 00:00:42,700 challenges. 15 00:00:42,700 --> 00:00:45,010 One was quantitative predictions of what 16 00:00:45,010 --> 00:00:48,430 happens when you make specific mutations in a known protein 17 00:00:48,430 --> 00:00:49,465 complex. 18 00:00:49,465 --> 00:00:50,840 We talked about trying to predict 19 00:00:50,840 --> 00:00:53,320 the structure of, say, just a pair of proteins, 20 00:00:53,320 --> 00:00:55,760 and then trying to do that on the global scale for all 21 00:00:55,760 --> 00:00:57,430 known proteins. 22 00:00:57,430 --> 00:00:59,200 And so last time, if you recall, we 23 00:00:59,200 --> 00:01:01,700 thought that initially maybe this would be a simple problem. 24 00:01:01,700 --> 00:01:05,890 We have proteins of known structure with a complex. 25 00:01:05,890 --> 00:01:08,240 Structure of the complex is also known. 26 00:01:08,240 --> 00:01:11,660 And we want to make predictions as to the change in affinity 27 00:01:11,660 --> 00:01:14,056 when there's a specific mutation made. 28 00:01:14,056 --> 00:01:15,430 In principle, this should be easy 29 00:01:15,430 --> 00:01:17,430 because we have all those different formulations 30 00:01:17,430 --> 00:01:19,000 for the potential energy function. 31 00:01:19,000 --> 00:01:22,290 And so if we figure out what the local structural changes are 32 00:01:22,290 --> 00:01:25,600 that are due to the insertion or deletion of some side chain, 33 00:01:25,600 --> 00:01:27,070 then we should be able to predict 34 00:01:27,070 --> 00:01:28,830 the change in the potential energy, 35 00:01:28,830 --> 00:01:32,329 and therefore the change in the energy of the complex. 36 00:01:32,329 --> 00:01:33,745 But in fact, it turned out that it 37 00:01:33,745 --> 00:01:35,280 was very, very hard to do that. 38 00:01:35,280 --> 00:01:38,260 And so this plot compared-- the black circles 39 00:01:38,260 --> 00:01:41,800 were the prediction algorithms for this problem, 40 00:01:41,800 --> 00:01:44,140 compared to just simply a substitution matrix, 41 00:01:44,140 --> 00:01:47,450 the BLOSUM substitution matrix defined in terms of the area 42 00:01:47,450 --> 00:01:49,190 under the curve for beneficial mutations 43 00:01:49,190 --> 00:01:50,670 and deleterious mutations. 44 00:01:50,670 --> 00:01:53,100 And you can see that very, very few of the black dots 45 00:01:53,100 --> 00:01:57,200 get far away from what is the really simple default model. 46 00:01:57,200 --> 00:01:59,170 A lot of them do worse. 47 00:01:59,170 --> 00:02:01,720 So OK, well maybe that's not such a simple problem 48 00:02:01,720 --> 00:02:03,940 because it requires a highly quantitative prediction. 49 00:02:03,940 --> 00:02:05,480 Maybe we'll do better just trying 50 00:02:05,480 --> 00:02:08,220 to predict which proteins interact at all. 51 00:02:08,220 --> 00:02:12,410 And so that's going to be the focus of today's lecture. 52 00:02:12,410 --> 00:02:14,110 Now, that also had a problem, right? 53 00:02:14,110 --> 00:02:16,760 Because even if I know the structure of two proteins, 54 00:02:16,760 --> 00:02:18,520 I don't know necessarily what surfaces 55 00:02:18,520 --> 00:02:19,820 of those proteins interact. 56 00:02:19,820 --> 00:02:21,520 And so I have to figure out this docking 57 00:02:21,520 --> 00:02:24,040 problem of which part of protein A interacts 58 00:02:24,040 --> 00:02:26,780 with which part of protein B. 59 00:02:26,780 --> 00:02:28,270 That's the beginning of my problem, 60 00:02:28,270 --> 00:02:31,550 and then I have to make a series of subsequent decisions. 61 00:02:31,550 --> 00:02:33,760 So I'm going to have to figure out 62 00:02:33,760 --> 00:02:36,242 for any potential partner of my protein, 63 00:02:36,242 --> 00:02:37,950 I need to figure out the docking problem, 64 00:02:37,950 --> 00:02:40,325 the relative position orientation. 65 00:02:40,325 --> 00:02:41,700 Now, in this little cartoon, it's 66 00:02:41,700 --> 00:02:45,790 shown as a completely static protein that 67 00:02:45,790 --> 00:02:47,540 approaches another static protein. 68 00:02:47,540 --> 00:02:50,000 The only thing that's changing is the relative coordinates. 69 00:02:50,000 --> 00:02:52,290 But of course, there will be local changes 70 00:02:52,290 --> 00:02:54,770 in confirmation, perhaps even global ones. 71 00:02:54,770 --> 00:02:56,900 And so we need to be able to make some estimates as 72 00:02:56,900 --> 00:02:58,760 to what those structural rearrangements will 73 00:02:58,760 --> 00:03:02,370 be when the two proteins interact. 74 00:03:02,370 --> 00:03:04,230 And then after we've come up with our best 75 00:03:04,230 --> 00:03:05,979 estimate of the structural rearrangements, 76 00:03:05,979 --> 00:03:08,290 only then can we come up with an estimate of the energy 77 00:03:08,290 --> 00:03:11,610 interaction and decide whether it's 78 00:03:11,610 --> 00:03:13,110 better than some threshold. 79 00:03:13,110 --> 00:03:13,670 OK. 80 00:03:13,670 --> 00:03:16,000 So one of the problems that's pretty obvious from this 81 00:03:16,000 --> 00:03:18,305 is that this kind of approach in principle, 82 00:03:18,305 --> 00:03:20,840 if we do it rigorously through all the steps, 83 00:03:20,840 --> 00:03:24,000 would be extremely slow. 84 00:03:24,000 --> 00:03:26,470 Now, another part that's perhaps a little bit less obvious 85 00:03:26,470 --> 00:03:28,510 is that it's going to be very prone to false positives. 86 00:03:28,510 --> 00:03:30,050 And why do you think that might be? 87 00:03:37,380 --> 00:03:39,320 What am I not taking into account here? 88 00:03:43,748 --> 00:03:45,224 AUDIENCE: Are you not taking into 89 00:03:45,224 --> 00:03:47,200 account the desolvation [INAUDIBLE]. 90 00:03:47,200 --> 00:03:48,669 PROFESSOR: So one answer is I'm not 91 00:03:48,669 --> 00:03:50,085 taking account of the desolvation, 92 00:03:50,085 --> 00:03:51,470 but in fact, I can do that. 93 00:03:51,470 --> 00:03:51,970 Right? 94 00:03:51,970 --> 00:03:53,678 So some of the potential energy functions 95 00:03:53,678 --> 00:03:55,712 we looked at, the statistician's version rather 96 00:03:55,712 --> 00:03:57,420 than the physicist's makes it pretty easy 97 00:03:57,420 --> 00:04:00,374 to incorporate the desolvation. 98 00:04:00,374 --> 00:04:02,790 Any other thoughts as to what I'm not taking into account? 99 00:04:07,090 --> 00:04:08,840 What other protein should I be considering 100 00:04:08,840 --> 00:04:10,673 when I'm considering an interaction problem? 101 00:04:14,170 --> 00:04:16,260 So I've isolated, in this case, two proteins. 102 00:04:16,260 --> 00:04:18,570 I'm saying, in a universe where these are the only two 103 00:04:18,570 --> 00:04:21,010 proteins that exist, will they have a favorable energy 104 00:04:21,010 --> 00:04:21,510 interaction? 105 00:04:21,510 --> 00:04:24,051 What I really need to know is whether that energy interaction 106 00:04:24,051 --> 00:04:26,690 is more favorable than all the competing interactions 107 00:04:26,690 --> 00:04:28,200 that they could have. 108 00:04:28,200 --> 00:04:30,740 So even if I find something that's potentially 109 00:04:30,740 --> 00:04:33,530 a good interaction, it may not be the best possible 110 00:04:33,530 --> 00:04:34,410 interaction. 111 00:04:34,410 --> 00:04:36,950 And if I consider then the concentration of this protein 112 00:04:36,950 --> 00:04:39,490 and the concentration of all the other molecules 113 00:04:39,490 --> 00:04:42,247 out there that have a higher affinity, 114 00:04:42,247 --> 00:04:44,080 then it could turn out that this is actually 115 00:04:44,080 --> 00:04:46,460 a rather poor substrate for my protein, a rather 116 00:04:46,460 --> 00:04:47,890 poor interaction partner. 117 00:04:47,890 --> 00:04:50,450 So we have that false positive problem. 118 00:04:50,450 --> 00:04:51,450 OK. 119 00:04:51,450 --> 00:04:53,950 But let's focus on the computational efficiency 120 00:04:53,950 --> 00:04:55,950 problem, because that's at least one 121 00:04:55,950 --> 00:04:58,110 that we can come up with some nice algorithms 122 00:04:58,110 --> 00:04:59,480 to try to solve. 123 00:04:59,480 --> 00:05:02,530 So what we want to do is try to limit our search space. 124 00:05:02,530 --> 00:05:04,530 If I want to figure out-- I have a query protein 125 00:05:04,530 --> 00:05:06,720 and I want to ask, what does it interact with, 126 00:05:06,720 --> 00:05:09,370 instead of trying to do the pairwise comparison 127 00:05:09,370 --> 00:05:12,240 of this protein with every other protein in the database, 128 00:05:12,240 --> 00:05:14,690 and doing very precise structural calculations on all 129 00:05:14,690 --> 00:05:16,315 of those, maybe there's some way that I 130 00:05:16,315 --> 00:05:18,210 can prefilter the set of proteins 131 00:05:18,210 --> 00:05:19,535 that it might interact with. 132 00:05:19,535 --> 00:05:21,160 And that's what we're going to look at. 133 00:05:21,160 --> 00:05:23,239 So we're going to try to officially choose 134 00:05:23,239 --> 00:05:24,780 potential partners before we're doing 135 00:05:24,780 --> 00:05:26,550 any structural comparison. 136 00:05:26,550 --> 00:05:28,710 And then once we have those partners, 137 00:05:28,710 --> 00:05:30,670 we're going to try to avoid having 138 00:05:30,670 --> 00:05:33,410 to do detailed calculations until we have a relatively 139 00:05:33,410 --> 00:05:36,860 high degree of confidence that these proteins could interact 140 00:05:36,860 --> 00:05:37,740 by other criteria. 141 00:05:37,740 --> 00:05:40,839 And we're going to look at two papers that describe algorithms 142 00:05:40,839 --> 00:05:42,380 for solving this problem, and they're 143 00:05:42,380 --> 00:05:44,354 both uploaded to the website. 144 00:05:44,354 --> 00:05:45,770 The first thing that we'll look at 145 00:05:45,770 --> 00:05:50,350 is called PRISM that actually uses structural calculations. 146 00:05:50,350 --> 00:05:53,460 And then we'll look at PrePPI, which 147 00:05:53,460 --> 00:05:57,200 deals with everything purely at-- without actually 148 00:05:57,200 --> 00:05:59,610 explicitly calculating the structures. 149 00:05:59,610 --> 00:06:00,110 OK. 150 00:06:00,110 --> 00:06:01,550 So what does PRISM do? 151 00:06:01,550 --> 00:06:03,500 Well, it's based on the notion that there 152 00:06:03,500 --> 00:06:05,180 are a limited number of architectures 153 00:06:05,180 --> 00:06:08,940 that we could look at for which proteins can interact. 154 00:06:08,940 --> 00:06:11,107 And so if we can identify those architectures, 155 00:06:11,107 --> 00:06:13,190 then we can try to figure out whether a protein is 156 00:06:13,190 --> 00:06:14,940 a potential partner of another one 157 00:06:14,940 --> 00:06:18,790 before we do the detailed, costly calculations. 158 00:06:18,790 --> 00:06:20,550 In addition, in those architectures, 159 00:06:20,550 --> 00:06:22,300 not all amino acids are going to be equal, 160 00:06:22,300 --> 00:06:23,140 but there are going to be some that 161 00:06:23,140 --> 00:06:24,930 contribute more to the energy than others. 162 00:06:24,930 --> 00:06:27,240 And so by identifying those critical residues, 163 00:06:27,240 --> 00:06:29,870 we can once again focus our computational energy 164 00:06:29,870 --> 00:06:34,470 on those complexes that are most likely to be important. 165 00:06:34,470 --> 00:06:34,970 OK. 166 00:06:34,970 --> 00:06:38,161 So it has these two components-- a rigid-body structural 167 00:06:38,161 --> 00:06:38,660 comparison. 168 00:06:38,660 --> 00:06:40,240 So that's that two proteins are not 169 00:06:40,240 --> 00:06:41,865 changing their own coordinates, they're 170 00:06:41,865 --> 00:06:45,160 just being brought together in different conformations. 171 00:06:45,160 --> 00:06:49,380 And then once the proteins have passed a series of checks, 172 00:06:49,380 --> 00:06:51,760 then we allow for flexible refinement using 173 00:06:51,760 --> 00:06:54,390 the kinds of energies we looked at in the previous lectures 174 00:06:54,390 --> 00:06:58,790 to decide how high affinity this complex could be. 175 00:06:58,790 --> 00:07:00,940 And the critical thing is that we're 176 00:07:00,940 --> 00:07:03,080 going to make some of these early decisions 177 00:07:03,080 --> 00:07:07,450 after the rigid-body comparison using structural similarity, 178 00:07:07,450 --> 00:07:09,580 evolutionary conservation, and particularly 179 00:07:09,580 --> 00:07:11,710 looking at these regions that are called hotspots. 180 00:07:11,710 --> 00:07:15,430 These are sites where most of the free energy of interaction 181 00:07:15,430 --> 00:07:16,800 occurs during an interface. 182 00:07:16,800 --> 00:07:19,170 So it's not, as I said, uniformly distributed. 183 00:07:19,170 --> 00:07:20,800 So I showed you this slide last time. 184 00:07:20,800 --> 00:07:24,010 It shows chymotrypsin in a light gray and its interaction 185 00:07:24,010 --> 00:07:25,770 with some protein partners. 186 00:07:25,770 --> 00:07:28,750 These two share some global similarity to each other, 187 00:07:28,750 --> 00:07:31,190 whereas this partner is quite different from either 188 00:07:31,190 --> 00:07:32,652 of these two globally. 189 00:07:32,652 --> 00:07:34,235 But you can see that at the interface, 190 00:07:34,235 --> 00:07:36,160 it's actually quite similar. 191 00:07:36,160 --> 00:07:39,640 And so this gives you hope that even if you can't find a direct 192 00:07:39,640 --> 00:07:41,980 homologue-- so if you were trying to figure out, 193 00:07:41,980 --> 00:07:44,011 what does this protein in yellow interact with, 194 00:07:44,011 --> 00:07:46,510 and you searched the database and you couldn't find anything 195 00:07:46,510 --> 00:07:48,250 that was its structural homologue, 196 00:07:48,250 --> 00:07:50,850 but if you could figure out to look for homologues 197 00:07:50,850 --> 00:07:52,896 of the lower regions that interact, 198 00:07:52,896 --> 00:07:55,520 you might be able to figure out that it interacts with the same 199 00:07:55,520 --> 00:07:58,311 protein as this one and this one. 200 00:07:58,311 --> 00:07:58,810 OK. 201 00:07:58,810 --> 00:08:00,240 So what about this idea of hotspots? 202 00:08:00,240 --> 00:08:01,698 And this was an idea that was first 203 00:08:01,698 --> 00:08:06,330 developed in 1995 by this paper, Clackson and Wells, 204 00:08:06,330 --> 00:08:09,430 where they were looking at the interaction of a cell surface 205 00:08:09,430 --> 00:08:12,220 receptor with its ligand approaching. 206 00:08:12,220 --> 00:08:14,610 And they did systematic mutagenesis 207 00:08:14,610 --> 00:08:17,410 across the surface of the interface 208 00:08:17,410 --> 00:08:21,100 to see when I mutate any single amino acid to alanine, 209 00:08:21,100 --> 00:08:23,720 how much it affects the energy of interaction. 210 00:08:23,720 --> 00:08:28,740 What they found was things were highly non-uniform. 211 00:08:28,740 --> 00:08:31,530 So this lower curve shows the change in free energy 212 00:08:31,530 --> 00:08:34,669 when you mutate particular individual amino acids 213 00:08:34,669 --> 00:08:35,590 to alanine. 214 00:08:35,590 --> 00:08:38,429 And you can see there are big losses of free energy 215 00:08:38,429 --> 00:08:40,460 at some places, and other places there's 216 00:08:40,460 --> 00:08:42,720 almost no change in the free energy binding. 217 00:08:42,720 --> 00:08:44,500 In a few places you actually get a benefit 218 00:08:44,500 --> 00:08:47,460 from mutating a side chain to alanine. 219 00:08:47,460 --> 00:08:49,170 So in this particular case, and it's 220 00:08:49,170 --> 00:08:51,160 held up over many, many cases then, 221 00:08:51,160 --> 00:08:53,900 the free energy of binding is not uniform across the surface, 222 00:08:53,900 --> 00:08:56,150 but it's distributed in what has been called hotspots. 223 00:08:56,150 --> 00:08:58,620 So here is a structure of the human growth 224 00:08:58,620 --> 00:09:00,150 hormone and its receptor. 225 00:09:00,150 --> 00:09:01,820 And in red are the few amino acids that 226 00:09:01,820 --> 00:09:03,945 contribute very, very large amounts-- more than one 227 00:09:03,945 --> 00:09:07,590 and a half kcals per mole-- to the energy of interaction. 228 00:09:07,590 --> 00:09:10,290 And it doesn't correspond with any simple structural 229 00:09:10,290 --> 00:09:10,930 parameter. 230 00:09:10,930 --> 00:09:14,320 So it's not the amino acids that have the biggest surface 231 00:09:14,320 --> 00:09:16,500 area, for example, or anything like that. 232 00:09:16,500 --> 00:09:20,040 So it's not trivial to figure out what these regions are, 233 00:09:20,040 --> 00:09:23,430 although there are some prediction algorithms. 234 00:09:23,430 --> 00:09:25,165 So there are studies, and subsequent ones 235 00:09:25,165 --> 00:09:27,934 have indicated that roughly 10% of the amino acids 236 00:09:27,934 --> 00:09:29,350 at the interface are the ones that 237 00:09:29,350 --> 00:09:30,890 have the biggest contribution. 238 00:09:30,890 --> 00:09:33,500 There are some trends, but none of these are hard rules. 239 00:09:33,500 --> 00:09:36,040 These tend to be rich in these three amino acids-- 240 00:09:36,040 --> 00:09:39,260 tryptophan, arginine, and tyrosine. 241 00:09:39,260 --> 00:09:40,885 If you might imagine, these are regions 242 00:09:40,885 --> 00:09:42,760 of the protein that are highly complimentary. 243 00:09:42,760 --> 00:09:44,940 So there'll be a patch on one side that's 244 00:09:44,940 --> 00:09:46,765 a hotspot matching up with another patch 245 00:09:46,765 --> 00:09:49,770 on the other protein that's also a hotspot. 246 00:09:49,770 --> 00:09:51,570 And it's kind of an interesting note 247 00:09:51,570 --> 00:09:54,360 that around these regions where the hotspots occur, 248 00:09:54,360 --> 00:09:56,650 there are other amino acids that exclude solvent 249 00:09:56,650 --> 00:09:57,680 from the interface. 250 00:09:57,680 --> 00:09:59,340 And they call that an o-ring. 251 00:09:59,340 --> 00:10:01,700 So these are some of the features that 252 00:10:01,700 --> 00:10:03,640 tend to occur with protein interfaces. 253 00:10:03,640 --> 00:10:06,337 So in this PRISM algorithm, what they do is the following. 254 00:10:06,337 --> 00:10:08,420 They start off with a template-- two proteins that 255 00:10:08,420 --> 00:10:11,130 are known to interact-- and they define the interface simply 256 00:10:11,130 --> 00:10:14,310 by close approach of amino acids in one 257 00:10:14,310 --> 00:10:17,420 chain to amino acids in the other. 258 00:10:17,420 --> 00:10:19,740 So in this case, shown in these balls 259 00:10:19,740 --> 00:10:21,925 are regions of the proteins that interact. 260 00:10:24,700 --> 00:10:27,020 And then they isolate the interfacial residues. 261 00:10:27,020 --> 00:10:28,820 Ignore the rest of the protein, because we 262 00:10:28,820 --> 00:10:32,920 said that the parts that interact in different proteins 263 00:10:32,920 --> 00:10:34,970 could be homologous even if the global structures 264 00:10:34,970 --> 00:10:36,050 of the proteins are not, right? 265 00:10:36,050 --> 00:10:37,966 So we're going to do our structural similarity 266 00:10:37,966 --> 00:10:40,710 calculations purely on the interface residues 267 00:10:40,710 --> 00:10:43,490 and not on the entire structure. 268 00:10:43,490 --> 00:10:45,490 So then with that template, you can then 269 00:10:45,490 --> 00:10:48,480 look at lots of proteins and see whether they 270 00:10:48,480 --> 00:10:51,420 have any structural match to pieces that interact. 271 00:10:51,420 --> 00:10:55,080 So here they've identified this protein, ASPP2, 272 00:10:55,080 --> 00:10:58,866 which has structural homology to I kappa b at the interface. 273 00:10:58,866 --> 00:11:00,490 Although globally it's quite different. 274 00:11:04,090 --> 00:11:08,080 And now, once they have this potential partner for NF kappa 275 00:11:08,080 --> 00:11:10,780 b, this ASPP2, they're going to test 276 00:11:10,780 --> 00:11:13,550 whether there's a good structural match, 277 00:11:13,550 --> 00:11:15,050 whether specifically in the regions 278 00:11:15,050 --> 00:11:17,280 that are hotspots-- they have an algorithm for predicting 279 00:11:17,280 --> 00:11:19,430 hotspots-- whether the match is good, whether it's 280 00:11:19,430 --> 00:11:22,050 sequence conservation at those hotspots. 281 00:11:22,050 --> 00:11:24,510 And only then do they do the refinement 282 00:11:24,510 --> 00:11:26,629 to do the flexible refinement of the type 283 00:11:26,629 --> 00:11:28,670 that we looked at in the previous lecture, energy 284 00:11:28,670 --> 00:11:30,920 minimization, and other approaches 285 00:11:30,920 --> 00:11:34,520 to figure out what the best possible structure 286 00:11:34,520 --> 00:11:36,090 of this complex would be, and then 287 00:11:36,090 --> 00:11:38,659 what it's free energy would be. 288 00:11:38,659 --> 00:11:40,450 So here's their description of the problem. 289 00:11:40,450 --> 00:11:42,560 They have template proteins and targets. 290 00:11:42,560 --> 00:11:44,240 They do a structure alignment. 291 00:11:44,240 --> 00:11:46,260 They asked whether it passes some thresholds. 292 00:11:46,260 --> 00:11:48,520 These are very, very fast calculations to do. 293 00:11:48,520 --> 00:11:51,520 And only if they pass these fast calculations 294 00:11:51,520 --> 00:11:53,320 do you do more detailed calculations. 295 00:11:53,320 --> 00:11:54,850 And finally, only if it passes this 296 00:11:54,850 --> 00:11:56,950 do you do the very computationally expensive 297 00:11:56,950 --> 00:11:58,170 refinement. 298 00:11:58,170 --> 00:12:00,880 And then one critical thing to remember from this algorithm 299 00:12:00,880 --> 00:12:06,690 is that it doesn't require the template and its query 300 00:12:06,690 --> 00:12:08,750 to be perfectly matched in structure. 301 00:12:08,750 --> 00:12:11,250 In fact, the elements of the structure at the interface 302 00:12:11,250 --> 00:12:13,124 could come from different parts of the chain. 303 00:12:13,124 --> 00:12:15,330 So they don't take into account the chain order. 304 00:12:23,470 --> 00:12:26,380 So if I had a beta sheet structure in one protein that 305 00:12:26,380 --> 00:12:34,592 looks like this, in my query these two proteins 306 00:12:34,592 --> 00:12:36,050 could be very indirectly connected. 307 00:12:36,050 --> 00:12:38,300 I don't care that there's a huge gap in the insertion. 308 00:12:38,300 --> 00:12:40,660 I just care that locally at the interface, 309 00:12:40,660 --> 00:12:43,120 one protein looks a lot like the other. 310 00:12:43,120 --> 00:12:45,644 There was a question in the back. 311 00:12:45,644 --> 00:12:50,672 AUDIENCE: How do you search a database for 3d structures? 312 00:12:50,672 --> 00:12:52,920 Are you just looking at all the [INAUDIBLE]? 313 00:12:52,920 --> 00:12:53,920 PROFESSOR: That's right. 314 00:12:53,920 --> 00:12:55,400 So the question was, how do you search 315 00:12:55,400 --> 00:12:56,566 a database for 3D structure? 316 00:12:56,566 --> 00:12:58,570 You do structural similarity comparisons 317 00:12:58,570 --> 00:13:00,640 that are based on the 3D coordinates. 318 00:13:00,640 --> 00:13:02,890 The simplest way to do it, but not the most efficient, 319 00:13:02,890 --> 00:13:05,920 is to find the rigid-body superpositions that 320 00:13:05,920 --> 00:13:07,970 minimize the root mean squared deviation, which 321 00:13:07,970 --> 00:13:10,530 was a metric we gave in one of the previous lectures. 322 00:13:10,530 --> 00:13:12,390 There are faster things you can do as well. 323 00:13:12,390 --> 00:13:16,426 You could imagine that you could look at certain global features 324 00:13:16,426 --> 00:13:18,300 of elements of secondary structure and so on. 325 00:13:18,300 --> 00:13:20,758 And there's been a lot of work making those algorithms very 326 00:13:20,758 --> 00:13:21,709 fast. 327 00:13:21,709 --> 00:13:22,375 Other questions? 328 00:13:22,375 --> 00:13:23,240 Good question. 329 00:13:31,000 --> 00:13:34,200 So they give an example in their papers 330 00:13:34,200 --> 00:13:37,340 that starting off with this known structural complex, 331 00:13:37,340 --> 00:13:42,030 cyclin-dependent kinase, the cyclin, and p27, the inhibitor. 332 00:13:42,030 --> 00:13:46,930 And then looking for structural matches. 333 00:13:46,930 --> 00:13:50,740 So we can identify this potential structure match. 334 00:13:50,740 --> 00:13:52,810 You refined it, get an energy of interaction. 335 00:13:52,810 --> 00:13:55,190 Try another one that has no global structural similarity. 336 00:13:55,190 --> 00:13:57,650 Again, once it passes all the checks, 337 00:13:57,650 --> 00:14:01,170 you compute the refinement and the energy. 338 00:14:01,170 --> 00:14:03,050 And similarly with this side. 339 00:14:03,050 --> 00:14:05,440 And so from this initial complex, 340 00:14:05,440 --> 00:14:07,170 where we had these two proteins which 341 00:14:07,170 --> 00:14:10,280 were known to interact in the PDP they can make predictions 342 00:14:10,280 --> 00:14:13,440 that these other proteins are likely to interact even though, 343 00:14:13,440 --> 00:14:14,900 again, at the global level, there's 344 00:14:14,900 --> 00:14:17,840 very little sequence similarity. 345 00:14:17,840 --> 00:14:20,090 Is that clear? 346 00:14:20,090 --> 00:14:20,590 OK. 347 00:14:20,590 --> 00:14:23,570 So the advantage of this is that it eventually 348 00:14:23,570 --> 00:14:26,350 does do these structural refinements that 349 00:14:26,350 --> 00:14:29,600 allow us to figure out the best match between two 350 00:14:29,600 --> 00:14:30,912 potential interacting proteins. 351 00:14:30,912 --> 00:14:32,620 But that's also its weakness because that 352 00:14:32,620 --> 00:14:34,680 takes a lot of computational time. 353 00:14:34,680 --> 00:14:37,737 So this other approach called PrePPI never actually does 354 00:14:37,737 --> 00:14:40,070 those structural refinements of the type we talked about 355 00:14:40,070 --> 00:14:41,351 in the previous lecture. 356 00:14:41,351 --> 00:14:43,350 So if so, how does it figure out whether the two 357 00:14:43,350 --> 00:14:45,840 proteins are likely to interact? 358 00:14:45,840 --> 00:14:48,515 So this is their schematic, and we'll go through the steps. 359 00:14:51,150 --> 00:14:52,950 So you start off with two query proteins 360 00:14:52,950 --> 00:14:54,860 that you want to know if they interact. 361 00:14:54,860 --> 00:14:57,810 And you do sequence similarity to a database 362 00:14:57,810 --> 00:14:58,950 of known structures. 363 00:14:58,950 --> 00:15:01,790 So you find sequence homologues to those proteins. 364 00:15:01,790 --> 00:15:05,520 And so they call those homology models. 365 00:15:05,520 --> 00:15:07,290 MA and MB. 366 00:15:07,290 --> 00:15:09,910 And now they look through the database for all the structural 367 00:15:09,910 --> 00:15:12,640 homologues, not sequence homologues, 368 00:15:12,640 --> 00:15:15,030 but structural homologues of MA and MB. 369 00:15:15,030 --> 00:15:16,440 So they get a series of neighbors 370 00:15:16,440 --> 00:15:22,560 that they call NA 1 through n and NB 1 to n. 371 00:15:22,560 --> 00:15:25,940 So these are the neighbors of these homologues. 372 00:15:25,940 --> 00:15:28,190 And they asked whether any of these neighbors, 373 00:15:28,190 --> 00:15:30,310 anything in this row, anything in this row, 374 00:15:30,310 --> 00:15:32,910 are known to interact. 375 00:15:32,910 --> 00:15:34,990 And that potential interaction then 376 00:15:34,990 --> 00:15:38,480 could be a model for the interaction of the query, 377 00:15:38,480 --> 00:15:40,230 right? 378 00:15:40,230 --> 00:15:41,220 So far so good. 379 00:15:44,650 --> 00:15:46,390 Then they do a sequence alignment. 380 00:15:46,390 --> 00:15:51,230 They sequence alignment of MA and MB, 381 00:15:51,230 --> 00:15:54,620 which are the known structural homologues of the queries, 382 00:15:54,620 --> 00:15:58,110 and the two proteins that are known to interact. 383 00:15:58,110 --> 00:16:02,414 And so now they've got this potential model 384 00:16:02,414 --> 00:16:04,830 for the interaction of the queries made up of two proteins 385 00:16:04,830 --> 00:16:08,160 of known structure that have homologues that are known 386 00:16:08,160 --> 00:16:08,700 to interact. 387 00:16:08,700 --> 00:16:09,200 OK? 388 00:16:09,200 --> 00:16:12,650 So it's two steps removed from the actual interaction. 389 00:16:12,650 --> 00:16:14,960 Now, while their figure says that they 390 00:16:14,960 --> 00:16:16,620 do a structural superposition, that's 391 00:16:16,620 --> 00:16:17,990 not, in fact, what they do. 392 00:16:17,990 --> 00:16:19,900 If you look at it carefully, it's a sequence analysis. 393 00:16:19,900 --> 00:16:21,900 And I'll take you through the steps in a second. 394 00:16:21,900 --> 00:16:24,351 So they mean structured in a rather loose way. 395 00:16:24,351 --> 00:16:26,350 So they're only doing sequence comparisons here. 396 00:16:26,350 --> 00:16:29,120 They're never actually building a homology model 397 00:16:29,120 --> 00:16:29,910 for the queries. 398 00:16:33,130 --> 00:16:33,630 OK 399 00:16:33,630 --> 00:16:35,370 So this figure comes from the supplement 400 00:16:35,370 --> 00:16:36,610 where, for some mysterious reason, 401 00:16:36,610 --> 00:16:38,240 they've changed all the nomenclature. 402 00:16:38,240 --> 00:16:41,060 So things that previously were called NA and NB 403 00:16:41,060 --> 00:16:45,380 have now been called TA and TB. 404 00:16:45,380 --> 00:16:46,160 Take what you get. 405 00:16:49,180 --> 00:16:53,000 So this is a pair of interacting proteins 406 00:16:53,000 --> 00:16:55,360 where the structure of the interaction is known. 407 00:16:55,360 --> 00:16:59,210 And they're structural neighbors of NA and NB, 408 00:16:59,210 --> 00:17:02,940 which you don't know whether they interact or not. 409 00:17:02,940 --> 00:17:05,910 They identify interacting residues in this structure. 410 00:17:05,910 --> 00:17:08,380 That's why it's represented by these black lines connecting 411 00:17:08,380 --> 00:17:08,910 blue dots. 412 00:17:08,910 --> 00:17:10,640 So these are interacting residues 413 00:17:10,640 --> 00:17:15,040 from the two template proteins and neighbors NA and NB. 414 00:17:15,040 --> 00:17:19,540 And they asked whether the amino acids in MA and MB 415 00:17:19,540 --> 00:17:21,740 also are good matches for this interface. 416 00:17:21,740 --> 00:17:23,970 And they have a number of criteria for doing that. 417 00:17:26,810 --> 00:17:30,200 So they come up with five measures. 418 00:17:30,200 --> 00:17:32,690 The first of those measures is a structural similarity 419 00:17:32,690 --> 00:17:37,710 between these MA proteins and the MA and MB and NA and NB. 420 00:17:42,000 --> 00:17:45,800 Then similarity-- OK, similarity is the structural similarity. 421 00:17:45,800 --> 00:17:51,880 Then they asked, how many of the amino acids at this interface, 422 00:17:51,880 --> 00:17:54,220 and what fraction of the amino acids at the interface 423 00:17:54,220 --> 00:17:56,290 can be aligned? 424 00:17:56,290 --> 00:17:59,370 So this is a sequence-based alignment of MA 425 00:17:59,370 --> 00:18:03,060 and-- well, it's here called TA, but was previously called MA. 426 00:18:03,060 --> 00:18:04,420 Just to make life complicated. 427 00:18:04,420 --> 00:18:06,590 So this is the sequence-based alignment. 428 00:18:06,590 --> 00:18:10,170 These are they interacting residues, all the blue ones 429 00:18:10,170 --> 00:18:12,790 in the structure of TA and TB interacting. 430 00:18:12,790 --> 00:18:15,000 And they asked, what fraction and what number 431 00:18:15,000 --> 00:18:18,570 of these amino acids are aligned in this sequence alignment? 432 00:18:18,570 --> 00:18:20,970 So here they come up with a number. 433 00:18:20,970 --> 00:18:24,810 In this case, I guess, it's four amino acids in this-- four 434 00:18:24,810 --> 00:18:27,910 pairs, I should say, of the amino acids-- one, two, three, 435 00:18:27,910 --> 00:18:30,050 and four, indicated by these four lines-- 436 00:18:30,050 --> 00:18:34,030 are both interacting in the structure of the complex 437 00:18:34,030 --> 00:18:38,050 and can be aligned to sequences in MA and MB. 438 00:18:43,210 --> 00:18:46,470 And then they use these other algorithms 439 00:18:46,470 --> 00:18:48,940 that are based primarily on machine learning 440 00:18:48,940 --> 00:18:51,670 looking at protein interfaces to decide 441 00:18:51,670 --> 00:18:54,610 whether the sequence of the amino acids that are going 442 00:18:54,610 --> 00:18:56,780 to sit at those places in the interface 443 00:18:56,780 --> 00:18:59,566 are likely to be residues that typically occur at interfaces. 444 00:18:59,566 --> 00:19:00,940 So this is the kind of statistics 445 00:19:00,940 --> 00:19:03,900 that I showed you before from those old papers that 446 00:19:03,900 --> 00:19:06,996 said 10% of the amino acids are in these hotspots. 447 00:19:06,996 --> 00:19:09,120 Certain kinds of amino acids are predominant there. 448 00:19:09,120 --> 00:19:11,630 So the number of algorithms, and they list a bunch, 449 00:19:11,630 --> 00:19:13,770 that they use to come up with a score 450 00:19:13,770 --> 00:19:16,530 to decide whether these residues, in fact, are 451 00:19:16,530 --> 00:19:19,360 statistically likely to be good matches. 452 00:19:19,360 --> 00:19:22,380 So they have these criteria and they decide then 453 00:19:22,380 --> 00:19:26,020 that some fraction of the amino acids at this interface in MA 454 00:19:26,020 --> 00:19:29,620 and MB are likely to be reasonable ones 455 00:19:29,620 --> 00:19:31,010 to be at the interface. 456 00:19:31,010 --> 00:19:34,380 So with all that done, they then use 457 00:19:34,380 --> 00:19:36,940 all of these different scores with a Bayesian classifier, 458 00:19:36,940 --> 00:19:38,325 and we'll talk a little bit later 459 00:19:38,325 --> 00:19:40,200 in this lecture and probably the next lecture 460 00:19:40,200 --> 00:19:42,150 as well as to what a Bayesian classifier is. 461 00:19:42,150 --> 00:19:44,040 But they plug all those scores in that 462 00:19:44,040 --> 00:19:46,690 they've derived from these proteins 463 00:19:46,690 --> 00:19:49,420 to decide whether these two proteins are 464 00:19:49,420 --> 00:19:51,065 likely to interact. 465 00:19:51,065 --> 00:19:52,440 So the advantage of this approach 466 00:19:52,440 --> 00:19:53,837 is it's extremely fast. 467 00:19:53,837 --> 00:19:55,920 Everything we've talked about are very, very quick 468 00:19:55,920 --> 00:19:56,590 calculations. 469 00:19:56,590 --> 00:19:58,720 Even the structural alignments are fast. 470 00:19:58,720 --> 00:20:01,190 The sequence alignments, of course, are. 471 00:20:01,190 --> 00:20:04,120 So we get through the whole database very quickly. 472 00:20:04,120 --> 00:20:09,580 So they've actually computed the potential attraction partners 473 00:20:09,580 --> 00:20:12,800 of every pair of proteins in various genomes based solely 474 00:20:12,800 --> 00:20:13,760 on these alignments. 475 00:20:13,760 --> 00:20:16,580 The disadvantage-- so what's the disadvantage of this method? 476 00:20:20,444 --> 00:20:22,980 AUDIENCE: Can't get a de novo interaction? 477 00:20:22,980 --> 00:20:24,980 PROFESSOR: We can't get any de novo interaction, 478 00:20:24,980 --> 00:20:28,710 so if there's no neighboring structures that interact, 479 00:20:28,710 --> 00:20:30,612 they'll never come up with it. 480 00:20:30,612 --> 00:20:31,820 So that's an important point. 481 00:20:31,820 --> 00:20:33,861 And then the other problem is, because it doesn't 482 00:20:33,861 --> 00:20:35,590 have the structural refinement, it's 483 00:20:35,590 --> 00:20:37,006 given up on that slow calculation, 484 00:20:37,006 --> 00:20:39,640 so also loses a lot of potential specificity. 485 00:20:39,640 --> 00:20:42,140 All the conformational changes that can occur 486 00:20:42,140 --> 00:20:45,020 will be lost to an algorithm like this. 487 00:20:45,020 --> 00:20:46,770 So we have these two competing approaches. 488 00:20:46,770 --> 00:20:48,000 Yes, questions in the back. 489 00:20:48,000 --> 00:20:50,762 AUDIENCE: Couldn't this method actually be used as an input 490 00:20:50,762 --> 00:20:54,177 to, say, a refinement step, for example? 491 00:20:54,177 --> 00:20:55,760 PROFESSOR: The question was, could you 492 00:20:55,760 --> 00:20:58,794 use this kind of approach as an input to the refinement step? 493 00:20:58,794 --> 00:21:00,698 And absolutely one could. 494 00:21:00,698 --> 00:21:04,364 Is there another question back there? 495 00:21:04,364 --> 00:21:05,030 Other questions? 496 00:21:12,150 --> 00:21:12,650 All right. 497 00:21:12,650 --> 00:21:16,990 So we're going to take a slight turn here in the course lecture 498 00:21:16,990 --> 00:21:20,770 and move away from a purely computational approach 499 00:21:20,770 --> 00:21:23,650 and actually look at how interaction measurements are 500 00:21:23,650 --> 00:21:24,490 made. 501 00:21:24,490 --> 00:21:26,520 One of the big changes of the last decade or so 502 00:21:26,520 --> 00:21:31,130 is that we've gone from an era when interactions were measured 503 00:21:31,130 --> 00:21:34,550 pairwise to interactions being measured in bulk. 504 00:21:34,550 --> 00:21:36,380 So through high throughput measurements. 505 00:21:36,380 --> 00:21:39,160 And we'll see that that leads us to some statistical problems 506 00:21:39,160 --> 00:21:42,070 which eventually bring us back to some computational issues 507 00:21:42,070 --> 00:21:43,492 as well. 508 00:21:43,492 --> 00:21:45,450 So if you want to measure all the proteins that 509 00:21:45,450 --> 00:21:48,640 interact in an organism, turns out 510 00:21:48,640 --> 00:21:50,120 to be, obviously, very difficult. 511 00:21:50,120 --> 00:21:52,620 One big advance that's helped with this 512 00:21:52,620 --> 00:21:56,550 is the idea of tagging proteins and using mass spectrometry 513 00:21:56,550 --> 00:21:59,160 to figure out what they interact with. 514 00:21:59,160 --> 00:22:02,420 So in these two sets of papers, which 515 00:22:02,420 --> 00:22:05,070 were some of the early ones being done in yeast, 516 00:22:05,070 --> 00:22:10,130 they took one protein at a time and attached a tag to it. 517 00:22:10,130 --> 00:22:12,170 And I'll talk about exactly what those tags are, 518 00:22:12,170 --> 00:22:13,995 but those are labels that allow you 519 00:22:13,995 --> 00:22:17,050 to attach it to a solid support. 520 00:22:17,050 --> 00:22:19,870 And then by attaching to a solid support, 521 00:22:19,870 --> 00:22:21,640 you could then purify any proteins 522 00:22:21,640 --> 00:22:25,160 that stuck to protein one here. 523 00:22:25,160 --> 00:22:29,530 And then after you purify them, you can run them out on a gel, 524 00:22:29,530 --> 00:22:31,380 cut them out, and figure out what 525 00:22:31,380 --> 00:22:33,130 the identity of those interacting proteins 526 00:22:33,130 --> 00:22:34,131 were by mass spec. 527 00:22:34,131 --> 00:22:35,630 So this sounds very labor intensive, 528 00:22:35,630 --> 00:22:38,625 but it's still a lot faster than anything that came before it. 529 00:22:38,625 --> 00:22:40,000 And with this approach, they were 530 00:22:40,000 --> 00:22:42,270 able to go through entire genomes, 531 00:22:42,270 --> 00:22:44,590 proteomes I should say, and figure out 532 00:22:44,590 --> 00:22:47,477 all the interacting partners for very, very large fractions 533 00:22:47,477 --> 00:22:48,560 of all the proteins there. 534 00:22:51,170 --> 00:22:55,400 So with this approach, what kinds of proteins 535 00:22:55,400 --> 00:22:58,099 do you think are likely to be false positives? 536 00:22:58,099 --> 00:22:58,640 Any thoughts? 537 00:23:01,380 --> 00:23:02,322 Yes. 538 00:23:02,322 --> 00:23:04,536 AUDIENCE: Proteins stuck on the column that 539 00:23:04,536 --> 00:23:06,750 has nothing to do with interaction [INAUDIBLE]. 540 00:23:06,750 --> 00:23:07,541 PROFESSOR: Exactly. 541 00:23:07,541 --> 00:23:09,430 So one thing that can be quite problematic 542 00:23:09,430 --> 00:23:11,840 are proteins that stick to the column 543 00:23:11,840 --> 00:23:13,610 regardless of which protein you put there. 544 00:23:13,610 --> 00:23:15,840 And we'll see an approach to getting rid of that. 545 00:23:15,840 --> 00:23:17,490 Other kinds of problems? 546 00:23:17,490 --> 00:23:18,415 A variant of that. 547 00:23:20,981 --> 00:23:21,480 Thoughts? 548 00:23:24,437 --> 00:23:26,020 What about proteins that tend to stick 549 00:23:26,020 --> 00:23:27,990 to other proteins non-specifically, right? 550 00:23:27,990 --> 00:23:31,016 Those are going to be quite problematic too. 551 00:23:31,016 --> 00:23:32,640 And what are the likely false negatives 552 00:23:32,640 --> 00:23:34,080 in an approach like this? 553 00:23:36,770 --> 00:23:39,235 The proteins that really do interact with the blue one 554 00:23:39,235 --> 00:23:40,110 but aren't picked up. 555 00:23:40,110 --> 00:23:40,986 Yes. 556 00:23:40,986 --> 00:23:42,945 AUDIENCE: Weak interaction partners [INAUDIBLE] 557 00:23:42,945 --> 00:23:44,819 PROFESSOR: Weak interaction partners, things, 558 00:23:44,819 --> 00:23:46,590 particularly with short half lives. 559 00:23:46,590 --> 00:23:48,419 Because you do a lot of washing, so it's 560 00:23:48,419 --> 00:23:49,930 going to be dependent on half-life. 561 00:23:49,930 --> 00:23:50,430 Very good. 562 00:23:50,430 --> 00:23:50,930 What else? 563 00:23:50,930 --> 00:23:51,570 Yeah. 564 00:23:51,570 --> 00:23:54,857 AUDIENCE: Maybe something that interacts in tag region? 565 00:23:54,857 --> 00:23:57,190 PROFESSOR: Something interacts in the tag region, right. 566 00:23:57,190 --> 00:23:58,210 So something interacts right around 567 00:23:58,210 --> 00:24:00,668 here would be lost because this would sterically interfere. 568 00:24:00,668 --> 00:24:01,670 Very good. 569 00:24:01,670 --> 00:24:02,950 Anything else? 570 00:24:02,950 --> 00:24:04,860 What about the concentration of proteins. 571 00:24:04,860 --> 00:24:07,298 How does that influence whether they show up here? 572 00:24:10,850 --> 00:24:11,350 All right. 573 00:24:11,350 --> 00:24:13,470 So if I have a very high concentration protein, 574 00:24:13,470 --> 00:24:16,667 it may interact even though naturally it doesn't. 575 00:24:16,667 --> 00:24:17,750 They never see each other. 576 00:24:17,750 --> 00:24:18,680 They're in different compartments. 577 00:24:18,680 --> 00:24:20,220 But when [INAUDIBLE] and do this. 578 00:24:20,220 --> 00:24:22,725 But low abundance proteins are going 579 00:24:22,725 --> 00:24:26,026 to be quite problematic because there'll be very little of them 580 00:24:26,026 --> 00:24:28,530 in these complexes compared to the high abundance proteins. 581 00:24:28,530 --> 00:24:30,120 It won't be detected by this method. 582 00:24:30,120 --> 00:24:32,120 They will never get to the mass spec, and so on. 583 00:24:32,120 --> 00:24:34,350 So we've got both false positives and false negatives 584 00:24:34,350 --> 00:24:35,410 in these approaches. 585 00:24:35,410 --> 00:24:36,868 Now, one of the things that came up 586 00:24:36,868 --> 00:24:39,887 was proteins that stick non-specifically to the column. 587 00:24:39,887 --> 00:24:41,470 And there was a clever approach in one 588 00:24:41,470 --> 00:24:49,320 of these early papers that got picked up to avoid that. 589 00:24:49,320 --> 00:24:51,590 And this is called tandem affinity purification, 590 00:24:51,590 --> 00:24:53,340 or TAP-tags. 591 00:24:53,340 --> 00:24:56,320 And the idea is the following. 592 00:24:56,320 --> 00:24:58,030 We have some gene. 593 00:24:58,030 --> 00:24:59,640 And we use homologous recombination-- 594 00:24:59,640 --> 00:25:01,014 this was done in yeast where this 595 00:25:01,014 --> 00:25:03,560 is easy-- to insert this sequence, which 596 00:25:03,560 --> 00:25:05,190 codes for the following. 597 00:25:05,190 --> 00:25:07,750 A piece of protein of no particular function, 598 00:25:07,750 --> 00:25:09,600 as far as anyone knows, a spacer, 599 00:25:09,600 --> 00:25:14,010 followed by this calmodulin-binding protein, 600 00:25:14,010 --> 00:25:15,890 followed by a protease recognition site, 601 00:25:15,890 --> 00:25:18,400 and then by protein A. 602 00:25:18,400 --> 00:25:20,410 So once this protein gets expressed-- 603 00:25:20,410 --> 00:25:22,250 and it gets expressed in it's native levels 604 00:25:22,250 --> 00:25:24,166 because you're inserting this into the genome. 605 00:25:24,166 --> 00:25:26,390 So it's not on an exogenous promoter. 606 00:25:26,390 --> 00:25:28,120 It's in its normal position. 607 00:25:28,120 --> 00:25:29,750 Whatever that protein was, then has it 608 00:25:29,750 --> 00:25:32,700 as C terminus all these pieces. 609 00:25:32,700 --> 00:25:34,260 So how does that help? 610 00:25:34,260 --> 00:25:37,010 In the purification, we start with something, IgG IGG, 611 00:25:37,010 --> 00:25:39,200 that binds to protein A. So now that's 612 00:25:39,200 --> 00:25:41,730 what attaches us to the solid support. 613 00:25:41,730 --> 00:25:43,550 And attached to the solid support 614 00:25:43,550 --> 00:25:45,870 will be all those things that are nonspecific binders. 615 00:26:07,996 --> 00:26:09,870 And so if I have some nonspecific binder that 616 00:26:09,870 --> 00:26:12,430 just likes my solid support, it'll be here. 617 00:26:15,550 --> 00:26:17,540 Nonspecific. 618 00:26:17,540 --> 00:26:21,450 And if I just acid washed everything off the column 619 00:26:21,450 --> 00:26:24,690 and ran my gels with that, or boiled it off in SDS, 620 00:26:24,690 --> 00:26:26,510 I would get the nonspecific protein too. 621 00:26:26,510 --> 00:26:30,950 But what they do instead is they instead cleave here 622 00:26:30,950 --> 00:26:34,405 with a very specific protease that recognizes this site. 623 00:26:34,405 --> 00:26:37,320 It's called a tobacco etch virus protease. 624 00:26:37,320 --> 00:26:39,550 It has a very long recognition sequence. 625 00:26:39,550 --> 00:26:42,320 You can make sure it doesn't cut anywhere in any other protein. 626 00:26:42,320 --> 00:26:45,130 And so now, instead of alluding non-specifically 627 00:26:45,130 --> 00:26:50,710 with acid or detergent, you allude specifically with TEV, 628 00:26:50,710 --> 00:26:54,980 and then this part of the protein will fall off. 629 00:26:54,980 --> 00:26:57,280 And then you do a second purification 630 00:26:57,280 --> 00:27:01,730 that relies on this piece of the protein. 631 00:27:01,730 --> 00:27:05,110 So you pull out only the things that you 632 00:27:05,110 --> 00:27:08,050 want that have the CBP, the calmodulin binding 633 00:27:08,050 --> 00:27:10,680 protein, by having different kind of solid support 634 00:27:10,680 --> 00:27:13,660 that has calmodulin attached to it. 635 00:27:13,660 --> 00:27:15,510 And so through this process, you can get rid 636 00:27:15,510 --> 00:27:17,850 of a lot of nonspecific binders. 637 00:27:17,850 --> 00:27:20,110 It doesn't help you with the false negatives, right? 638 00:27:20,110 --> 00:27:22,650 You've made the wash conditions even harsher 639 00:27:22,650 --> 00:27:24,320 so you're going to lose more proteins. 640 00:27:24,320 --> 00:27:28,550 But you'll pick up fewer false positives. 641 00:27:28,550 --> 00:27:31,520 And then finally, the last purification procedure actually 642 00:27:31,520 --> 00:27:34,090 uses EGTA, which is a chelating agent. 643 00:27:34,090 --> 00:27:36,830 So this interaction between CBP and calmodulin 644 00:27:36,830 --> 00:27:38,210 depends on calcium. 645 00:27:38,210 --> 00:27:41,740 EGTA sucks the calcium out of that interaction. 646 00:27:41,740 --> 00:27:44,940 And so it's, again, a very specific way of alluding rather 647 00:27:44,940 --> 00:27:51,930 nonspecific one, like heat, salt, acid, or detergent. 648 00:27:51,930 --> 00:27:55,232 So this has been one technology, affinity purification 649 00:27:55,232 --> 00:27:57,690 followed by mass spec, that's given us a lot of information 650 00:27:57,690 --> 00:27:59,630 on protein-protein interactions. 651 00:27:59,630 --> 00:28:01,680 And a computing technology that's 652 00:28:01,680 --> 00:28:06,080 also contributed quite a lot is called yeast two-hybrid. 653 00:28:06,080 --> 00:28:10,680 So in this approach, you have a reporter gene 654 00:28:10,680 --> 00:28:12,970 that normally is not going to be transcribed. 655 00:28:12,970 --> 00:28:19,040 It has at a design DNA binding site, a DNA binding protein, 656 00:28:19,040 --> 00:28:20,605 and your bait protein. 657 00:28:20,605 --> 00:28:22,480 And you want to figure out every protein that 658 00:28:22,480 --> 00:28:25,830 can interact with this prey. 659 00:28:25,830 --> 00:28:29,600 So the prey now is attached to an activation domain. 660 00:28:29,600 --> 00:28:31,400 If these two proteins don't interact, 661 00:28:31,400 --> 00:28:34,070 the activation domain never gets recruited to this reporter, 662 00:28:34,070 --> 00:28:35,840 there's no transcription. 663 00:28:35,840 --> 00:28:38,790 But if the green protein and the blue protein interact, 664 00:28:38,790 --> 00:28:40,440 then the activation domain is going 665 00:28:40,440 --> 00:28:42,169 to be recruited to this promoter and it's 666 00:28:42,169 --> 00:28:44,710 going to turn on transcription, and then you'll get a signal. 667 00:28:48,620 --> 00:28:51,860 So what are some of the advantages of this approach? 668 00:28:51,860 --> 00:28:54,602 It doesn't require you to purify anything. 669 00:28:54,602 --> 00:28:56,060 So it should be much more sensitive 670 00:28:56,060 --> 00:28:58,250 to low abundance proteins. 671 00:28:58,250 --> 00:29:00,100 So that's definitely an advantage. 672 00:29:00,100 --> 00:29:02,350 It'll pick up a lot of those transient interactions. 673 00:29:02,350 --> 00:29:04,140 You may not get continuous activation, 674 00:29:04,140 --> 00:29:05,925 but you'll get transient activation. 675 00:29:05,925 --> 00:29:08,620 And if you've set the conditions up properly, 676 00:29:08,620 --> 00:29:11,571 you can pick up the transient activation. 677 00:29:11,571 --> 00:29:13,820 But it has its own biases, so none of these techniques 678 00:29:13,820 --> 00:29:14,590 are going to be perfect. 679 00:29:14,590 --> 00:29:16,270 It's going to be biased against proteins 680 00:29:16,270 --> 00:29:17,810 that don't express well. 681 00:29:17,810 --> 00:29:20,061 This is, as the name implies, typically done in yeast. 682 00:29:20,061 --> 00:29:22,143 So if you have human proteins and you express them 683 00:29:22,143 --> 00:29:24,460 in yeast, or plant proteins that you express in yeast, 684 00:29:24,460 --> 00:29:27,210 there could be some proteins that just will not express well 685 00:29:27,210 --> 00:29:29,070 in that organism. 686 00:29:29,070 --> 00:29:31,050 What else can be a problem? 687 00:29:31,050 --> 00:29:33,160 Some proteins don't do well in the nucleus, right? 688 00:29:33,160 --> 00:29:34,970 So if you're interested in interactions 689 00:29:34,970 --> 00:29:36,386 with membrane proteins, it's going 690 00:29:36,386 --> 00:29:38,650 to be very hard to get them to express in the nucleus, 691 00:29:38,650 --> 00:29:41,180 and therefore, you'll never pick up those interactions. 692 00:29:41,180 --> 00:29:41,680 OK. 693 00:29:41,680 --> 00:29:43,638 So we've got these two different technologies-- 694 00:29:43,638 --> 00:29:45,960 the affinity capture mass spec and the two-hybrid. 695 00:29:45,960 --> 00:29:48,680 Questions on those technologies? 696 00:29:48,680 --> 00:29:49,636 Yes. 697 00:29:49,636 --> 00:29:51,620 AUDIENCE: Could another control be 698 00:29:51,620 --> 00:29:53,604 for the mass spec purification just 699 00:29:53,604 --> 00:29:57,085 to subtract out everything that alludes non-specifically. 700 00:29:57,085 --> 00:29:59,585 PROFESSOR: The question was, could you subtract out anything 701 00:29:59,585 --> 00:29:59,920 that's nonspecific. 702 00:29:59,920 --> 00:30:01,355 And yes, if you've got what you might 703 00:30:01,355 --> 00:30:03,146 call frequent flyers, proteins that show up 704 00:30:03,146 --> 00:30:04,750 in every single purification, then you 705 00:30:04,750 --> 00:30:05,840 can simply ignore those. 706 00:30:05,840 --> 00:30:07,270 And that is often done. 707 00:30:07,270 --> 00:30:08,770 So that'll help you with things that 708 00:30:08,770 --> 00:30:11,590 are very nonspecific for the surface. 709 00:30:11,590 --> 00:30:14,294 What's more of a problem are proteins 710 00:30:14,294 --> 00:30:15,960 that have some affinity for your protein 711 00:30:15,960 --> 00:30:19,920 x but are not really highly specific for it. 712 00:30:19,920 --> 00:30:22,752 So they tend to bind in certain kinds of patches. 713 00:30:22,752 --> 00:30:24,210 Those would be harder to figure out 714 00:30:24,210 --> 00:30:26,460 because they won't stick to everything. 715 00:30:26,460 --> 00:30:27,104 Good question. 716 00:30:27,104 --> 00:30:27,770 Other questions? 717 00:30:34,610 --> 00:30:35,110 All right. 718 00:30:35,110 --> 00:30:36,500 So we've got these different technologies. 719 00:30:36,500 --> 00:30:37,750 What we'd really like to be able do 720 00:30:37,750 --> 00:30:40,010 is we know that there are problems in each approach. 721 00:30:40,010 --> 00:30:42,343 We'd like to be able to compute the probability that two 722 00:30:42,343 --> 00:30:44,550 proteins interact based on the data. 723 00:30:44,550 --> 00:30:47,860 So now we're turning back to the more mathematical computational 724 00:30:47,860 --> 00:30:49,050 approaches. 725 00:30:49,050 --> 00:30:53,044 So if we just consider one experiment-- and we're 726 00:30:53,044 --> 00:30:54,460 going to talk about gold standard. 727 00:30:54,460 --> 00:30:55,543 So what's a gold standard? 728 00:30:55,543 --> 00:30:58,420 It's a set of proteins that we have extremely high confidence 729 00:30:58,420 --> 00:31:01,180 interact because it was analyzed by some other technology. 730 00:31:01,180 --> 00:31:04,390 Not two-hybrid, non-affinity capture mass spec, but much, 731 00:31:04,390 --> 00:31:05,790 much more direct interactions. 732 00:31:05,790 --> 00:31:10,410 By physical measurements, maybe the structural work. 733 00:31:10,410 --> 00:31:12,160 So the number of criteria that go into it. 734 00:31:12,160 --> 00:31:14,114 So we have this gold standard data 735 00:31:14,114 --> 00:31:16,530 set where we know the proteins definitely interact, and we 736 00:31:16,530 --> 00:31:17,700 have our experiment. 737 00:31:17,700 --> 00:31:19,310 So clearly anything in the overlap, 738 00:31:19,310 --> 00:31:22,050 we can count as true positives, right? 739 00:31:22,050 --> 00:31:23,080 We detected it. 740 00:31:23,080 --> 00:31:26,059 It's in the database of gold standards. 741 00:31:26,059 --> 00:31:28,350 And things that are in the gold standard that we missed 742 00:31:28,350 --> 00:31:31,660 are obviously false negatives. 743 00:31:31,660 --> 00:31:33,270 We report them as non-interacting, 744 00:31:33,270 --> 00:31:35,380 but in fact they do. 745 00:31:35,380 --> 00:31:38,586 The question is, how much of this is true positive? 746 00:31:38,586 --> 00:31:40,710 Everything that's detected in the experiment but we 747 00:31:40,710 --> 00:31:42,542 have no information for it in the database. 748 00:31:42,542 --> 00:31:44,500 So that could be for one of two reasons, right? 749 00:31:44,500 --> 00:31:46,290 That could be that they really don't interact. 750 00:31:46,290 --> 00:31:47,720 Or it could be that no one's measured it. 751 00:31:47,720 --> 00:31:49,200 The whole point of this experiment 752 00:31:49,200 --> 00:31:50,570 is to find new things. 753 00:31:50,570 --> 00:31:54,400 So is there any way to estimate what fraction of all the things 754 00:31:54,400 --> 00:31:57,250 that are unique to this experiment are true positives, 755 00:31:57,250 --> 00:31:59,300 and what fraction are false positives? 756 00:31:59,300 --> 00:32:00,982 Those we'd like to try to figure out. 757 00:32:00,982 --> 00:32:02,440 Now, if we just had one experiment, 758 00:32:02,440 --> 00:32:03,731 that would be very challenging. 759 00:32:03,731 --> 00:32:05,940 But what happens when we've got two experiments? 760 00:32:05,940 --> 00:32:08,510 So we have these two affinity capture mass spec experiments, 761 00:32:08,510 --> 00:32:11,490 or maybe affinity capture mass spec and a two-hybrid. 762 00:32:11,490 --> 00:32:13,590 So now let's think about the overlap of those two 763 00:32:13,590 --> 00:32:16,699 experiments with the gold standard. 764 00:32:16,699 --> 00:32:18,990 So I've got this region of overlap between experiment 1 765 00:32:18,990 --> 00:32:21,050 and experiment 2, and then this region 766 00:32:21,050 --> 00:32:23,410 that's overlapping between all three things. 767 00:32:23,410 --> 00:32:26,000 Experiment 1, experiment 2, and the gold standard. 768 00:32:26,000 --> 00:32:29,120 So these clearly are two positives, right? 769 00:32:29,120 --> 00:32:31,390 They're high confidence because I picked them up 770 00:32:31,390 --> 00:32:35,102 in both experiments, and they're in the gold standard. 771 00:32:35,102 --> 00:32:37,310 What about all these things in what I've labeled here 772 00:32:37,310 --> 00:32:38,920 region 2? 773 00:32:38,920 --> 00:32:41,670 Well, if we believe that these two experiments are 774 00:32:41,670 --> 00:32:45,540 independent of each other in a rigorous way-- 775 00:32:45,540 --> 00:32:47,300 so let's say one's a two-hybrid and one's 776 00:32:47,300 --> 00:32:49,807 an affinity capture mass spec, there's no particular reason 777 00:32:49,807 --> 00:32:51,390 that the false positives for one would 778 00:32:51,390 --> 00:32:54,200 be false positives in the other. 779 00:32:54,200 --> 00:32:56,890 In that case, I can call this region 2 780 00:32:56,890 --> 00:32:58,185 my consensus true positives. 781 00:32:58,185 --> 00:33:00,820 I have a very high confidence that these 782 00:33:00,820 --> 00:33:02,660 are true interactors. 783 00:33:02,660 --> 00:33:05,180 Everyone buy that? 784 00:33:05,180 --> 00:33:06,380 Seem reasonable? 785 00:33:06,380 --> 00:33:06,880 OK. 786 00:33:06,880 --> 00:33:11,240 So here's where the trick comes in. 787 00:33:11,240 --> 00:33:14,040 What fraction of all these consensus true positives 788 00:33:14,040 --> 00:33:16,930 are picked up in the gold standard? 789 00:33:16,930 --> 00:33:19,210 This ratio, right? 790 00:33:19,210 --> 00:33:21,430 Region 1 over region 2. 791 00:33:21,430 --> 00:33:21,960 OK. 792 00:33:21,960 --> 00:33:26,380 So now I've got this region of things that are picked up-- 793 00:33:26,380 --> 00:33:28,870 the true positives from this experiment, then 794 00:33:28,870 --> 00:33:31,130 the gold standard. 795 00:33:31,130 --> 00:33:33,740 And then I've got this region that's unique to experiment 2 796 00:33:33,740 --> 00:33:35,698 and it's going to be some mix of true positives 797 00:33:35,698 --> 00:33:36,700 and false positives. 798 00:33:36,700 --> 00:33:41,110 And the authors of this paper that are cited here 799 00:33:41,110 --> 00:33:42,500 make the following argument. 800 00:33:42,500 --> 00:33:49,300 We're going to assume that the ratio of I to II 801 00:33:49,300 --> 00:33:51,420 is the same as the ratio of III to IV. 802 00:33:56,030 --> 00:33:59,744 So the fraction of consensus true positives 803 00:33:59,744 --> 00:34:01,910 that are picked-- these are independent experiments. 804 00:34:01,910 --> 00:34:03,730 So the fraction of true positives 805 00:34:03,730 --> 00:34:05,480 that are picked up in the gold standard 806 00:34:05,480 --> 00:34:07,250 is going to be constant, whether they're in the consensus 807 00:34:07,250 --> 00:34:07,890 or not. 808 00:34:07,890 --> 00:34:09,636 So the fraction at ratio of I to II 809 00:34:09,636 --> 00:34:11,719 is going to be the same as the ratio of III to IV. 810 00:34:11,719 --> 00:34:15,010 So by that then, I can figure out how much of this region 811 00:34:15,010 --> 00:34:16,840 consists of true positives and how much 812 00:34:16,840 --> 00:34:18,590 consists of false positives. 813 00:34:18,590 --> 00:34:21,840 Everyone buy that? 814 00:34:21,840 --> 00:34:23,053 Yeah. 815 00:34:23,053 --> 00:34:25,490 AUDIENCE: Can I check-- are we not saying 816 00:34:25,490 --> 00:34:30,729 that the gold standard represents all true positives? 817 00:34:30,729 --> 00:34:31,520 PROFESSOR: Correct. 818 00:34:31,520 --> 00:34:35,670 Well, we're saying that the gold standard consists of things 819 00:34:35,670 --> 00:34:37,412 that we know to interact-- 820 00:34:37,412 --> 00:34:38,830 AUDIENCE: But there may be more. 821 00:34:38,830 --> 00:34:40,060 PROFESSOR: But there may be more. 822 00:34:40,060 --> 00:34:42,518 And the goal of our experiment is to find those other ones. 823 00:34:45,290 --> 00:34:45,790 All right. 824 00:34:45,790 --> 00:34:48,940 So if you accept that premise, which seems plausible, 825 00:34:48,940 --> 00:34:51,489 then you can compute what fraction of all the things 826 00:34:51,489 --> 00:34:53,530 that are picked up in each of these experiments 827 00:34:53,530 --> 00:34:56,889 are likely to be true positives. 828 00:34:56,889 --> 00:34:58,080 So drum roll please. 829 00:34:58,080 --> 00:34:59,987 It turns out that the number's not that high. 830 00:35:03,120 --> 00:35:06,520 So the fraction of things in the consensus 831 00:35:06,520 --> 00:35:09,459 was 347 out of almost 2000. 832 00:35:09,459 --> 00:35:11,500 And if you do the math then, what you end up with 833 00:35:11,500 --> 00:35:15,240 is that the true fraction in this region, 834 00:35:15,240 --> 00:35:20,040 for which we have no data, is 1,123 out of-- 835 00:35:20,040 --> 00:35:25,606 and the false piece in this is going to be almost 15,000. 836 00:35:25,606 --> 00:35:27,480 And they went ahead and did this for a number 837 00:35:27,480 --> 00:35:29,670 of different experiments and computed 838 00:35:29,670 --> 00:35:34,230 the fraction of derived false positives for these data-- 839 00:35:34,230 --> 00:35:36,320 might be a little bit hard to see on this screen. 840 00:35:36,320 --> 00:35:40,120 But the numbers range from 50% false positives 841 00:35:40,120 --> 00:35:45,410 to, in some cases, over 90% false positives. 842 00:35:45,410 --> 00:35:47,770 That's a little disturbing, right? 843 00:35:47,770 --> 00:35:51,224 So these technologies are good at picking up interactions, 844 00:35:51,224 --> 00:35:52,890 but there's reason to be very skeptical. 845 00:35:55,290 --> 00:35:55,790 OK. 846 00:35:55,790 --> 00:35:57,960 So now we've got a serious problem, 847 00:35:57,960 --> 00:35:59,670 because how are we going to figure out 848 00:35:59,670 --> 00:36:01,890 which of these interactions to trust when we know 849 00:36:01,890 --> 00:36:07,340 that a very, very large fraction of them are false positives? 850 00:36:07,340 --> 00:36:08,270 So what could you do? 851 00:36:08,270 --> 00:36:11,570 Well, you could take only the little bit of overlap. 852 00:36:11,570 --> 00:36:17,061 You could say, I have that Venn diagram-- method 1, method 2. 853 00:36:17,061 --> 00:36:18,560 They did agree on a bunch of things. 854 00:36:18,560 --> 00:36:21,330 So I could take only those. 855 00:36:21,330 --> 00:36:22,886 That obviously throws away a lot. 856 00:36:22,886 --> 00:36:25,510 Someone else suggested we could throw away the sticky proteins, 857 00:36:25,510 --> 00:36:26,010 right? 858 00:36:26,010 --> 00:36:27,980 So maybe there are nonspecific proteins 859 00:36:27,980 --> 00:36:29,604 that don't show up in every experiment, 860 00:36:29,604 --> 00:36:31,819 but they show up in a very, very large fraction 861 00:36:31,819 --> 00:36:32,610 of all experiments. 862 00:36:32,610 --> 00:36:34,290 Maybe I toss those out. 863 00:36:34,290 --> 00:36:36,542 That's another possibility. 864 00:36:36,542 --> 00:36:38,250 But what we really want to do is actually 865 00:36:38,250 --> 00:36:40,460 come up with a probability estimate. 866 00:36:40,460 --> 00:36:41,960 To not have to make a hard decision, 867 00:36:41,960 --> 00:36:43,918 but come up with an estimate of the probability 868 00:36:43,918 --> 00:36:45,790 that things interact based on all the data. 869 00:36:45,790 --> 00:36:49,117 So how do we go about doing that? 870 00:36:49,117 --> 00:36:51,700 So first of all, what happens if you just require a consensus? 871 00:36:51,700 --> 00:36:55,840 So this plot shows accuracy and coverage 872 00:36:55,840 --> 00:36:59,990 of the gold standard for individual experiments 873 00:36:59,990 --> 00:37:05,750 with different thresholds for deciding what's interacting, 874 00:37:05,750 --> 00:37:07,650 different cutoffs and things. 875 00:37:07,650 --> 00:37:10,360 So the individual experiments are shown here. 876 00:37:10,360 --> 00:37:12,400 And then if you acquire two methods 877 00:37:12,400 --> 00:37:14,990 to pick something up, or three methods to pick something up, 878 00:37:14,990 --> 00:37:17,010 you can get better and better in your accuracy. 879 00:37:17,010 --> 00:37:18,350 This is a log-log plot. 880 00:37:18,350 --> 00:37:20,710 So if you require three methods to agree 881 00:37:20,710 --> 00:37:22,750 before you call something a true positive, 882 00:37:22,750 --> 00:37:25,000 you can get up to-- I'm not sure exactly what this is, 883 00:37:25,000 --> 00:37:26,790 but 80%, 90% possibly. 884 00:37:26,790 --> 00:37:27,560 Right? 885 00:37:27,560 --> 00:37:29,590 But look at where you at the y-axis. 886 00:37:29,590 --> 00:37:31,760 You'd only get about less than 1% 887 00:37:31,760 --> 00:37:33,750 coverage of the gold standard. 888 00:37:33,750 --> 00:37:35,404 So that's not a great approach. 889 00:37:35,404 --> 00:37:37,070 So what we really want to do, as I said, 890 00:37:37,070 --> 00:37:39,780 is to try to estimate the probability that proteins 891 00:37:39,780 --> 00:37:43,610 interact given all of our available data. 892 00:37:43,610 --> 00:37:47,700 And the data could be specific experiments. 893 00:37:47,700 --> 00:37:49,700 Say the two different mass spec experiments 894 00:37:49,700 --> 00:37:50,930 we just referred to. 895 00:37:50,930 --> 00:37:52,630 Or as we'll see a little bit later 896 00:37:52,630 --> 00:37:55,230 in this lecture and possibly the next one, other kinds 897 00:37:55,230 --> 00:37:58,090 of extraneous data that are not direct physical measurements 898 00:37:58,090 --> 00:38:00,690 of interaction, but might give us confidence 899 00:38:00,690 --> 00:38:03,257 that things interact based on similarity in annotation, 900 00:38:03,257 --> 00:38:05,090 or similarity in gene expression, and so on. 901 00:38:05,090 --> 00:38:07,040 And we'll get into details of that. 902 00:38:07,040 --> 00:38:07,540 OK. 903 00:38:07,540 --> 00:38:09,350 So to do this, we need to have a little bit 904 00:38:09,350 --> 00:38:11,365 of a refresher on Bayesian statistics. 905 00:38:14,780 --> 00:38:16,640 So I want to measure the probability 906 00:38:16,640 --> 00:38:21,340 that an interaction is true given the available data. 907 00:38:21,340 --> 00:38:22,410 Right? 908 00:38:22,410 --> 00:38:26,130 And I can estimate that based on the probability of observing 909 00:38:26,130 --> 00:38:30,090 the data for things that I know to be true 910 00:38:30,090 --> 00:38:31,330 and these prior estimates. 911 00:38:31,330 --> 00:38:34,960 So what's the prior probability that an interaction is true 912 00:38:34,960 --> 00:38:37,710 and the prior probability of observing a particular data 913 00:38:37,710 --> 00:38:38,340 set. 914 00:38:38,340 --> 00:38:40,427 Now, this by itself isn't really that helpful. 915 00:38:40,427 --> 00:38:42,760 I haven't told you yet how to calculate any of the terms 916 00:38:42,760 --> 00:38:43,890 on the right. 917 00:38:43,890 --> 00:38:45,340 But bear with me. 918 00:38:45,340 --> 00:38:48,100 If I want to decide the likelihood 919 00:38:48,100 --> 00:38:51,660 that a protein interacts-- how likely is it? 920 00:38:51,660 --> 00:38:53,610 Is it more likely that it interacts or not? 921 00:38:53,610 --> 00:38:54,676 I can compute this ratio. 922 00:38:54,676 --> 00:38:56,175 The probability that the interaction 923 00:38:56,175 --> 00:38:58,205 is true given the data over the probability 924 00:38:58,205 --> 00:39:00,330 an interaction is false given the data. 925 00:39:00,330 --> 00:39:03,330 That's the likelihood ratio. 926 00:39:03,330 --> 00:39:07,440 So by this formula, I then cancel out this probability 927 00:39:07,440 --> 00:39:10,030 of the data, the prior probability of the data. 928 00:39:10,030 --> 00:39:13,200 And if I had a way of calculating this, 929 00:39:13,200 --> 00:39:15,510 and we'll get to it in a second, then if it's 930 00:39:15,510 --> 00:39:18,030 more likely than not to be a true interaction, 931 00:39:18,030 --> 00:39:20,405 I can call it an interaction, right, if it's less likely. 932 00:39:20,405 --> 00:39:21,863 So if this ratio is greater than 1, 933 00:39:21,863 --> 00:39:23,300 I accept it as a true interaction. 934 00:39:23,300 --> 00:39:27,520 If this ratio is less than 1, then I reject it. 935 00:39:27,520 --> 00:39:28,020 OK. 936 00:39:28,020 --> 00:39:29,600 So now our challenge is to figure out 937 00:39:29,600 --> 00:39:31,800 how to compute these terms. 938 00:39:31,800 --> 00:39:34,400 One more thing to note is if all I want to do 939 00:39:34,400 --> 00:39:38,900 is be able to rank every interaction by this likelihood 940 00:39:38,900 --> 00:39:41,940 ratio, rather than coming up with a hard threshold, 941 00:39:41,940 --> 00:39:44,360 then I actually don't need all these terms. 942 00:39:44,360 --> 00:39:46,640 So this is the likelihood ratio. 943 00:39:46,640 --> 00:39:48,930 I can convert it to a log space. 944 00:39:48,930 --> 00:39:51,260 So it's going to be the sum of these two terms. 945 00:39:51,260 --> 00:39:53,180 And if I'm simply ranking everything 946 00:39:53,180 --> 00:39:56,160 by this log likelihood ratio, this term 947 00:39:56,160 --> 00:39:58,440 is the same for every interaction. 948 00:39:58,440 --> 00:40:01,220 It's just composed of prior probabilities. 949 00:40:01,220 --> 00:40:04,800 So it's not going to affect the ranking at all. 950 00:40:04,800 --> 00:40:05,740 Any questions on that? 951 00:40:05,740 --> 00:40:07,760 Is that clear? 952 00:40:07,760 --> 00:40:10,042 Good. 953 00:40:10,042 --> 00:40:12,250 So if I just want to come up with a ranking function, 954 00:40:12,250 --> 00:40:14,680 all I need to do-- all-- I need to do 955 00:40:14,680 --> 00:40:16,930 is to be able to estimate the probability of observing 956 00:40:16,930 --> 00:40:19,630 data for true interactions and the probability of observing 957 00:40:19,630 --> 00:40:21,699 that set of data for false interactions. 958 00:40:21,699 --> 00:40:22,490 Everybody buy that? 959 00:40:26,250 --> 00:40:27,360 Yes, please. 960 00:40:27,360 --> 00:40:29,440 AUDIENCE: When you say that prior probability is 961 00:40:29,440 --> 00:40:30,960 the same for all interactions, we're 962 00:40:30,960 --> 00:40:34,459 saying we're assuming the same prior probability for all, 963 00:40:34,459 --> 00:40:36,874 or is this [INAUDIBLE]? 964 00:40:36,874 --> 00:40:38,485 PROFESSOR: That's its definition. 965 00:40:38,485 --> 00:40:41,070 We mean, what is the prior probability that proteins 966 00:40:41,070 --> 00:40:42,740 interact versus the prior probability? 967 00:40:42,740 --> 00:40:46,959 So it's independent of the proteins that we're looking at. 968 00:40:46,959 --> 00:40:47,625 Other questions? 969 00:40:51,320 --> 00:40:51,820 All right. 970 00:40:51,820 --> 00:40:54,282 So we need a way of computing this piece 971 00:40:54,282 --> 00:40:55,990 of all the things we've looked at before. 972 00:40:55,990 --> 00:40:58,560 So how do we get an estimate of the probability observing 973 00:40:58,560 --> 00:41:00,550 a particular configuration of the data? 974 00:41:00,550 --> 00:41:02,530 Meaning, I detect it in experiment 1 975 00:41:02,530 --> 00:41:06,290 and not in experiment 2, but in experiment 3. 976 00:41:06,290 --> 00:41:09,710 What's the probability of that given it's a true interaction? 977 00:41:09,710 --> 00:41:11,880 So that's what we're going to dive into right now. 978 00:41:11,880 --> 00:41:12,430 OK. 979 00:41:12,430 --> 00:41:15,240 So one thing we could do to make life simpler, 980 00:41:15,240 --> 00:41:18,480 and then we'll remove this simplification later, 981 00:41:18,480 --> 00:41:21,050 but let's, for the time being, assume that all of my data 982 00:41:21,050 --> 00:41:23,890 are independent. 983 00:41:23,890 --> 00:41:26,470 So the two-hybrid is going to have completely different 984 00:41:26,470 --> 00:41:29,467 mistakes than the affinity capture mass spec. 985 00:41:29,467 --> 00:41:31,050 So those two data sets are going to be 986 00:41:31,050 --> 00:41:32,910 completely independent of each other. 987 00:41:32,910 --> 00:41:38,870 So I can write this as a product of a particular observation-- 988 00:41:38,870 --> 00:41:40,470 a particular mass spec experiment 989 00:41:40,470 --> 00:41:43,542 and a particular two-hybrid experiment for true attractions 990 00:41:43,542 --> 00:41:44,500 and false interactions. 991 00:41:44,500 --> 00:41:46,580 So it's the product of the probability 992 00:41:46,580 --> 00:41:50,320 that a particular experiment would detect an interaction 993 00:41:50,320 --> 00:41:52,970 if the interaction is true over the probability 994 00:41:52,970 --> 00:41:55,630 that that particular experiment would detect it 995 00:41:55,630 --> 00:41:57,640 if there was no interaction. 996 00:41:57,640 --> 00:42:01,500 I'm just going to multiply all of those probabilities. 997 00:42:01,500 --> 00:42:02,100 Yes. 998 00:42:02,100 --> 00:42:03,900 AUDIENCE: [INAUDIBLE]. 999 00:42:03,900 --> 00:42:07,961 This is one interaction pair? 1000 00:42:07,961 --> 00:42:08,960 PROFESSOR: That's right. 1001 00:42:08,960 --> 00:42:10,390 AUDIENCE: And you take the product 1002 00:42:10,390 --> 00:42:12,995 over all the interaction pairs within one 1003 00:42:12,995 --> 00:42:14,240 run of the experiment. 1004 00:42:14,240 --> 00:42:17,425 Is that correct? 1005 00:42:17,425 --> 00:42:18,800 PROFESSOR: If I want to determine 1006 00:42:18,800 --> 00:42:22,740 whether a particular interaction pair-- 1007 00:42:22,740 --> 00:42:25,390 I want to compute this log likelihood 1008 00:42:25,390 --> 00:42:27,150 ratio, or this, actually, ranking ratio, 1009 00:42:27,150 --> 00:42:28,920 because I've thrown away the priors. 1010 00:42:28,920 --> 00:42:31,378 I want to compute this ranking ratio for a particular pair. 1011 00:42:31,378 --> 00:42:33,020 So I've got protein A and protein B. 1012 00:42:33,020 --> 00:42:34,940 And I want to determine whether I believe 1013 00:42:34,940 --> 00:42:36,775 it to be more likely to interact or not, 1014 00:42:36,775 --> 00:42:38,400 and rank it with all the others, right? 1015 00:42:38,400 --> 00:42:41,390 So I'm doing this for a pair of proteins now. 1016 00:42:41,390 --> 00:42:42,620 So far so good? 1017 00:42:42,620 --> 00:42:44,050 Now, for that pair of proteins, I 1018 00:42:44,050 --> 00:42:47,140 have a series of observations, or lack of observations, right? 1019 00:42:47,140 --> 00:42:49,327 I have a whole bunch of experiments. 1020 00:42:49,327 --> 00:42:51,160 This experiment detected it, that experiment 1021 00:42:51,160 --> 00:42:53,566 didn't detect it, this one did. 1022 00:42:53,566 --> 00:42:55,440 So what's the probability of these proteins-- 1023 00:42:55,440 --> 00:42:58,660 these A and B really interact given that yes, no, yes 1024 00:42:58,660 --> 00:42:59,670 in my experiments? 1025 00:42:59,670 --> 00:43:02,226 And then for new protein, it might be no, no, yes, 1026 00:43:02,226 --> 00:43:04,725 and what I want to figure out the probability for this pair. 1027 00:43:04,725 --> 00:43:07,695 AUDIENCE: So is the scale of the big letter M, 1028 00:43:07,695 --> 00:43:10,830 is it on the order of like 10 experiments, 100 experiments, 1029 00:43:10,830 --> 00:43:12,067 or thousands of experiments? 1030 00:43:12,067 --> 00:43:12,650 PROFESSOR: Ah. 1031 00:43:12,650 --> 00:43:14,825 So the question is, what's the scale of this. 1032 00:43:14,825 --> 00:43:17,200 So obviously, that's going to depend on what kind of data 1033 00:43:17,200 --> 00:43:19,870 I bring in, but in these cases, it's small. 1034 00:43:19,870 --> 00:43:22,300 So we have a handful of these high throughput experiments 1035 00:43:22,300 --> 00:43:25,190 over entire genomes and proteomes. 1036 00:43:25,190 --> 00:43:26,416 So there's not to be a lot. 1037 00:43:26,416 --> 00:43:27,790 So in some of these early papers, 1038 00:43:27,790 --> 00:43:29,530 there were four interaction experiments 1039 00:43:29,530 --> 00:43:30,614 that they were looking at. 1040 00:43:30,614 --> 00:43:32,488 Now the numbers might be a little bit bigger, 1041 00:43:32,488 --> 00:43:33,800 but not significantly greater. 1042 00:43:38,130 --> 00:43:38,630 All right. 1043 00:43:38,630 --> 00:43:42,070 So now to compute this, we need a set of gold standards. 1044 00:43:42,070 --> 00:43:44,790 But now we don't just need gold standard positive interactions, 1045 00:43:44,790 --> 00:43:46,498 proteins that we know really do interact. 1046 00:43:46,498 --> 00:43:50,130 We also need proteins that we know really don't interact. 1047 00:43:50,130 --> 00:43:53,420 Because I want to compute the probability of an observation 1048 00:43:53,420 --> 00:43:55,595 given that some interaction is definitely wrong. 1049 00:43:58,970 --> 00:44:01,820 So precisely how I compute these terms 1050 00:44:01,820 --> 00:44:03,550 is going to depend on the kinds of data. 1051 00:44:03,550 --> 00:44:05,220 The experiments I've just been talking about, 1052 00:44:05,220 --> 00:44:06,860 these high throughput mass spec, which 1053 00:44:06,860 --> 00:44:10,110 were the ones which we looked at the ratio of the consensus, 1054 00:44:10,110 --> 00:44:14,330 true positives, and estimated that 96% of all the data 1055 00:44:14,330 --> 00:44:15,710 were possibly in error. 1056 00:44:15,710 --> 00:44:18,030 The details of how to do those calculations are here. 1057 00:44:18,030 --> 00:44:22,100 I leave you to look that up if you're interested. 1058 00:44:22,100 --> 00:44:23,510 But now what we're going to do is 1059 00:44:23,510 --> 00:44:26,650 we're going to see how, if we were to rank interactions 1060 00:44:26,650 --> 00:44:32,150 based on this term, we can avoid having 1061 00:44:32,150 --> 00:44:33,920 to throw out most of our data. 1062 00:44:33,920 --> 00:44:36,637 So we said if we require all the experiments to agree, 1063 00:44:36,637 --> 00:44:38,470 we're going to have very, very low coverage. 1064 00:44:38,470 --> 00:44:40,450 Now we're instead going to rank everything 1065 00:44:40,450 --> 00:44:42,420 based on this likelihood ratio, or something 1066 00:44:42,420 --> 00:44:44,175 derived from the likelihood ratio. 1067 00:44:44,175 --> 00:44:45,800 So in this paper where they were simply 1068 00:44:45,800 --> 00:44:47,550 looking at the protein-protein interaction 1069 00:44:47,550 --> 00:44:51,570 data sets to compute these interactions, 1070 00:44:51,570 --> 00:44:55,800 they ranked everything based on that ranking function we just 1071 00:44:55,800 --> 00:44:56,840 described. 1072 00:44:56,840 --> 00:44:59,610 And then as you vary your threshold, 1073 00:44:59,610 --> 00:45:01,860 you can figure out how many true positives you have 1074 00:45:01,860 --> 00:45:05,180 and how many false positives you have in the gold standard. 1075 00:45:05,180 --> 00:45:07,270 True interactors and false interactors. 1076 00:45:07,270 --> 00:45:09,110 And you can compute this curve, right? 1077 00:45:09,110 --> 00:45:12,800 For any particular value of that ranking ratio, 1078 00:45:12,800 --> 00:45:16,240 what's my sensitivity and what's my specificity? 1079 00:45:16,240 --> 00:45:18,270 Are you clear what this plot means? 1080 00:45:21,770 --> 00:45:23,886 And here they've plotted the values 1081 00:45:23,886 --> 00:45:25,010 for individual experiments. 1082 00:45:25,010 --> 00:45:29,950 And this is the value for an independent database 1083 00:45:29,950 --> 00:45:33,425 of gold standard interactions. 1084 00:45:33,425 --> 00:45:34,800 And so now, where do they come up 1085 00:45:34,800 --> 00:45:37,160 with their true positives and their false positives? 1086 00:45:37,160 --> 00:45:39,660 A lot of this is going to depend on how representative those 1087 00:45:39,660 --> 00:45:40,160 are. 1088 00:45:40,160 --> 00:45:42,480 And all these numbers are subject to revision 1089 00:45:42,480 --> 00:45:45,280 if you decide that the true positives and false positives 1090 00:45:45,280 --> 00:45:47,860 that people are using are not accurate enough. 1091 00:45:47,860 --> 00:45:52,320 So they used two well annotated databases of interactions. 1092 00:45:52,320 --> 00:45:54,140 One from MIPS and one from SGD. 1093 00:45:54,140 --> 00:45:56,270 And you can play those off against each other 1094 00:45:56,270 --> 00:45:58,119 as the database of true positives. 1095 00:45:58,119 --> 00:45:59,660 In some ways, that's the easier thing 1096 00:45:59,660 --> 00:46:02,650 because people like to report that proteins interact. 1097 00:46:02,650 --> 00:46:05,210 They tend not to like to report the proteins don't interact. 1098 00:46:05,210 --> 00:46:07,560 You don't see a lot of nature papers saying protein 1099 00:46:07,560 --> 00:46:10,152 x doesn't interact with protein y. 1100 00:46:10,152 --> 00:46:11,860 So how are you going to figure out, then, 1101 00:46:11,860 --> 00:46:13,950 what are your true negatives? 1102 00:46:13,950 --> 00:46:17,467 So the strategies that they used-- well, 1103 00:46:17,467 --> 00:46:19,800 one possibility is they're annotated to be in complexes, 1104 00:46:19,800 --> 00:46:22,110 and those complexes are different from each other. 1105 00:46:22,110 --> 00:46:23,320 That's not bad, right? 1106 00:46:23,320 --> 00:46:26,030 But it's not a guarantee either. 1107 00:46:26,030 --> 00:46:27,760 Or this is a little bit better. 1108 00:46:27,760 --> 00:46:31,296 They're annotated to be in different parts of the cell. 1109 00:46:31,296 --> 00:46:33,440 Of course, if those annotations aren't perfect, 1110 00:46:33,440 --> 00:46:35,910 low concentrations, you could still be wrong. 1111 00:46:35,910 --> 00:46:37,840 Or that they have anti-correlated gene 1112 00:46:37,840 --> 00:46:38,340 expression. 1113 00:46:38,340 --> 00:46:39,580 I kind of like this one. 1114 00:46:39,580 --> 00:46:42,164 So it's one thing to be not correlated, but if you're 1115 00:46:42,164 --> 00:46:43,830 anti-correlated, seems pretty suggestive 1116 00:46:43,830 --> 00:46:47,032 that these two proteins are never in a complex together. 1117 00:46:47,032 --> 00:46:49,240 Again, it's no guarantee because, as we'll talk about 1118 00:46:49,240 --> 00:46:51,850 in some detail later, RNA levels are not 1119 00:46:51,850 --> 00:46:53,511 very good predictors of protein levels. 1120 00:46:53,511 --> 00:46:55,260 But if you apply enough of these criteria, 1121 00:46:55,260 --> 00:46:56,840 you can come up with a set of proteins 1122 00:46:56,840 --> 00:46:58,430 that you have fairly high confidence really 1123 00:46:58,430 --> 00:46:59,200 don't interact. 1124 00:46:59,200 --> 00:47:01,750 You combine that with the databases of proteins 1125 00:47:01,750 --> 00:47:04,420 with very high confidence that they do interact, 1126 00:47:04,420 --> 00:47:06,880 and you can get the true positives and false positives 1127 00:47:06,880 --> 00:47:08,213 that you need for this analysis. 1128 00:47:13,071 --> 00:47:13,570 all right. 1129 00:47:13,570 --> 00:47:16,290 So that's a way of combining some information. 1130 00:47:16,290 --> 00:47:18,320 We're going to see a generalization of that 1131 00:47:18,320 --> 00:47:19,584 called Bayesian networks. 1132 00:47:19,584 --> 00:47:21,250 We've mentioned this already in at least 1133 00:47:21,250 --> 00:47:22,624 two different contexts, and it'll 1134 00:47:22,624 --> 00:47:26,120 come up again later in the course as well. 1135 00:47:26,120 --> 00:47:28,190 So these are very general methods 1136 00:47:28,190 --> 00:47:31,580 for reasoning probabilistically. 1137 00:47:31,580 --> 00:47:33,470 We will see them in the context here 1138 00:47:33,470 --> 00:47:34,810 of predicting interactions. 1139 00:47:34,810 --> 00:47:37,060 We'll see them later in the context of gene regulation 1140 00:47:37,060 --> 00:47:38,070 and signaling as well. 1141 00:47:41,530 --> 00:47:44,410 What we fundamentally need to do a Bayesian network 1142 00:47:44,410 --> 00:47:47,320 is a graphical structure that represents our understanding 1143 00:47:47,320 --> 00:47:50,499 what the relationship is between causes and effects. 1144 00:47:50,499 --> 00:47:52,040 And a set of probabilities that allow 1145 00:47:52,040 --> 00:47:54,830 us to compute things on this network. 1146 00:47:54,830 --> 00:47:58,060 We'll show you examples where those networks are derived 1147 00:47:58,060 --> 00:48:01,490 from our prior understanding of the problem, 1148 00:48:01,490 --> 00:48:03,490 but also ones where the structure of the network 1149 00:48:03,490 --> 00:48:04,730 is learned from the data. 1150 00:48:07,250 --> 00:48:11,280 And we're going to see two primary contexts. 1151 00:48:11,280 --> 00:48:14,471 First we have this question of whether proteins interact. 1152 00:48:14,471 --> 00:48:16,220 That's what we've just been talking about. 1153 00:48:16,220 --> 00:48:19,070 So here are four experiments, the in vitro pulldown 1154 00:48:19,070 --> 00:48:22,230 experiments and yeast two-hybrid experiments, 1155 00:48:22,230 --> 00:48:25,220 that give us relatively independent information 1156 00:48:25,220 --> 00:48:27,195 about whether proteins interact. 1157 00:48:27,195 --> 00:48:28,820 And we're going to look at a paper that 1158 00:48:28,820 --> 00:48:31,590 used those data with a Bayesian network 1159 00:48:31,590 --> 00:48:34,270 to compute the probability that two proteins really do interact 1160 00:48:34,270 --> 00:48:36,295 based on the combination of all the data, 1161 00:48:36,295 --> 00:48:38,420 rather than throwing out anything that doesn't fall 1162 00:48:38,420 --> 00:48:41,050 in the overlap, which could be a very, very small number. 1163 00:48:41,050 --> 00:48:42,550 And then later on we'll see examples 1164 00:48:42,550 --> 00:48:45,360 of using Bayesian networks to understand biological networks. 1165 00:48:45,360 --> 00:48:47,980 So this might be a set of transcription factors 1166 00:48:47,980 --> 00:48:51,460 that are regulating a set of differentially expressed genes. 1167 00:48:51,460 --> 00:48:53,280 And the structure of the graphical network 1168 00:48:53,280 --> 00:48:55,280 for a Bayesian network has a lot of similarities 1169 00:48:55,280 --> 00:48:57,580 to the way we normally think about transcriptional 1170 00:48:57,580 --> 00:48:58,850 regulatory networks. 1171 00:48:58,850 --> 00:49:01,390 So there's sort of a natural way of transferring 1172 00:49:01,390 --> 00:49:05,460 our regulatory problem into a graphical network problem. 1173 00:49:05,460 --> 00:49:08,210 But we're going to focus on these prediction 1174 00:49:08,210 --> 00:49:10,800 problems for protein-protein interactions first. 1175 00:49:10,800 --> 00:49:17,070 Now, if I just want to compute the probability of detecting 1176 00:49:17,070 --> 00:49:19,320 an interaction in various experiments, given that it's 1177 00:49:19,320 --> 00:49:21,260 true or false, I could explicitly 1178 00:49:21,260 --> 00:49:23,090 compute that probability. 1179 00:49:23,090 --> 00:49:26,010 And we saw examples of that just now. 1180 00:49:26,010 --> 00:49:28,080 But some of these Bayesian network problems 1181 00:49:28,080 --> 00:49:30,980 become much, much too large to do that. 1182 00:49:30,980 --> 00:49:35,580 This is a little tiny piece of a Bayesian network 1183 00:49:35,580 --> 00:49:37,270 that is supposed to represent I believe 1184 00:49:37,270 --> 00:49:40,480 it's transcriptional regulatory network. 1185 00:49:40,480 --> 00:49:43,730 You could never possibly write down all of the terms 1186 00:49:43,730 --> 00:49:47,250 in this probability, where every node could, in principle depend 1187 00:49:47,250 --> 00:49:48,820 on every other node in the network. 1188 00:49:48,820 --> 00:49:52,080 It would just be a ridiculously large problem. 1189 00:49:52,080 --> 00:49:55,890 In fact, how large would it be if I've got N binary variables, 1190 00:49:55,890 --> 00:49:58,410 my gene is on or off, my interaction is true or false, 1191 00:49:58,410 --> 00:50:01,130 I have 2 to the N possible states? 1192 00:50:01,130 --> 00:50:01,940 Right? 1193 00:50:01,940 --> 00:50:04,274 And the only constraint I have, in principle, 1194 00:50:04,274 --> 00:50:06,440 is that all the probabilities have to add up to one. 1195 00:50:06,440 --> 00:50:08,740 So I have 2 to the N minus 1. 1196 00:50:08,740 --> 00:50:13,770 2 to the N minus 1 possible variables that I need to set. 1197 00:50:13,770 --> 00:50:16,530 So that's a ridiculously large number in most contexts. 1198 00:50:16,530 --> 00:50:19,670 So how do Bayesian networks help us solve this problem? 1199 00:50:19,670 --> 00:50:21,455 Well, we represent our understanding 1200 00:50:21,455 --> 00:50:23,940 of the problem in a graphical structure 1201 00:50:23,940 --> 00:50:26,310 where we have causes and effects. 1202 00:50:26,310 --> 00:50:28,980 And there'll be a direct arrow from a cause to an effect. 1203 00:50:28,980 --> 00:50:30,636 I don't always know the cause. 1204 00:50:30,636 --> 00:50:32,010 So in our context, we were trying 1205 00:50:32,010 --> 00:50:34,560 to figure out whether two proteins interact. 1206 00:50:34,560 --> 00:50:36,495 What do we measure? 1207 00:50:36,495 --> 00:50:38,120 We actually don't measure interactions. 1208 00:50:38,120 --> 00:50:40,610 We measure the result of a particular experiment, which 1209 00:50:40,610 --> 00:50:43,170 is a combination of whether interacted 1210 00:50:43,170 --> 00:50:45,650 and all sorts of noise that we've just discussed. 1211 00:50:45,650 --> 00:50:49,430 So the effects that we observe are detected in experiment one 1212 00:50:49,430 --> 00:50:51,260 or detected in experiment two. 1213 00:50:51,260 --> 00:50:54,450 The cause is, did it interact or not? 1214 00:50:54,450 --> 00:50:56,770 So the cause is hidden, the effects are observed. 1215 00:50:59,959 --> 00:51:01,750 Now, in the case we were looking at before, 1216 00:51:01,750 --> 00:51:03,166 we treated all these probabilities 1217 00:51:03,166 --> 00:51:04,042 as being independent. 1218 00:51:04,042 --> 00:51:06,000 But we might know something about the structure 1219 00:51:06,000 --> 00:51:08,360 of our experiments, the kinds of experiments we're doing, 1220 00:51:08,360 --> 00:51:10,401 that might lead us to have a different structure. 1221 00:51:10,401 --> 00:51:14,900 So we could have an interaction that gives rise 1222 00:51:14,900 --> 00:51:16,529 to all different kinds of data. 1223 00:51:16,529 --> 00:51:18,570 But depending on whether the protein's a membrane 1224 00:51:18,570 --> 00:51:20,180 protein or highly expressed, it might 1225 00:51:20,180 --> 00:51:22,590 influence the results of certain experiments 1226 00:51:22,590 --> 00:51:26,370 and not influence the results of others, right? 1227 00:51:26,370 --> 00:51:28,200 So like a two-hybrid would be very biased 1228 00:51:28,200 --> 00:51:29,330 by which one of these? 1229 00:51:32,320 --> 00:51:33,550 The membrane, right? 1230 00:51:33,550 --> 00:51:35,810 And then the affinity capture mass spec 1231 00:51:35,810 --> 00:51:37,310 could be very influenced by proteins 1232 00:51:37,310 --> 00:51:41,409 that are expressed at very high levels or very low levels. 1233 00:51:41,409 --> 00:51:43,700 If we assume that all the interactions are independent, 1234 00:51:43,700 --> 00:51:44,705 then we multiply probabilities. 1235 00:51:44,705 --> 00:51:46,329 And we'll go into more detail, but this 1236 00:51:46,329 --> 00:51:48,530 is what we're looking at up until now. 1237 00:51:48,530 --> 00:51:52,425 In cases where we believe that all the observations are not 1238 00:51:52,425 --> 00:51:53,800 independent, then we're not going 1239 00:51:53,800 --> 00:51:54,880 to simply multiply things. 1240 00:51:54,880 --> 00:51:56,380 We'll see there's a more precise way 1241 00:51:56,380 --> 00:51:59,424 of computing the probabilities. 1242 00:51:59,424 --> 00:52:01,590 Now in this case, I've drawn the graphical structure 1243 00:52:01,590 --> 00:52:04,280 because I believe that I know what's going on. 1244 00:52:04,280 --> 00:52:06,280 But in the more general case that we'll look at, 1245 00:52:06,280 --> 00:52:08,363 we'll actually derive the structure from the data. 1246 00:52:11,110 --> 00:52:13,520 One of the nice things about Bayesian networks 1247 00:52:13,520 --> 00:52:15,850 is that it removes the need to have all 2 to the N 1248 00:52:15,850 --> 00:52:19,350 minus 1 possible parameters, because it tells us there 1249 00:52:19,350 --> 00:52:21,550 are certain independence conditions. 1250 00:52:21,550 --> 00:52:25,470 So node is independent of its ancestors given its parents. 1251 00:52:25,470 --> 00:52:26,795 What does that mean? 1252 00:52:26,795 --> 00:52:28,920 If I'm trying to reason about the expression of one 1253 00:52:28,920 --> 00:52:32,120 of the genes down here, and I know that this transcription 1254 00:52:32,120 --> 00:52:35,717 factor is on, I don't really care 1255 00:52:35,717 --> 00:52:37,800 what the probability is that any particular parent 1256 00:52:37,800 --> 00:52:39,690 of that transcription factor is on, right? 1257 00:52:39,690 --> 00:52:42,670 So I don't need to know anything of transcription factor B1 1258 00:52:42,670 --> 00:52:43,880 if I know the state of B2. 1259 00:52:43,880 --> 00:52:45,990 If this is on, then that's the only thing 1260 00:52:45,990 --> 00:52:49,450 that's going to affect whether it's turning on these genes, 1261 00:52:49,450 --> 00:52:52,370 regardless of what the activation state of its parent 1262 00:52:52,370 --> 00:52:53,600 was. 1263 00:52:53,600 --> 00:52:54,470 Is that clear? 1264 00:52:54,470 --> 00:52:55,367 Yes. 1265 00:52:55,367 --> 00:52:57,020 AUDIENCE: The slide's saying TF B1. 1266 00:52:57,020 --> 00:52:59,720 [INAUDIBLE] TF B2? 1267 00:52:59,720 --> 00:53:00,522 It says TF A1. 1268 00:53:00,522 --> 00:53:01,480 PROFESSOR: Yeah, sorry. 1269 00:53:01,480 --> 00:53:02,854 That should say TF B1. 1270 00:53:07,070 --> 00:53:07,570 Thank you. 1271 00:53:11,590 --> 00:53:12,090 OK. 1272 00:53:12,090 --> 00:53:13,670 So we'll do a little example. 1273 00:53:13,670 --> 00:53:16,130 It's admission season both for graduate school 1274 00:53:16,130 --> 00:53:16,880 and undergraduate. 1275 00:53:16,880 --> 00:53:19,350 So let's do a little toy example where 1276 00:53:19,350 --> 00:53:21,600 we're going to get rid of the admissions committees 1277 00:53:21,600 --> 00:53:22,974 and just do automated admissions. 1278 00:53:26,630 --> 00:53:29,659 So we're going to collect various data about students, 1279 00:53:29,659 --> 00:53:31,700 and then we're going to build a Bayesian network. 1280 00:53:31,700 --> 00:53:33,850 And that network is going to decide 1281 00:53:33,850 --> 00:53:36,649 whether to admit students into this simplified version. 1282 00:53:36,649 --> 00:53:38,940 And the only information that will go into our decision 1283 00:53:38,940 --> 00:53:44,380 will be the grades on the transcript and the GREs. 1284 00:53:44,380 --> 00:53:46,740 Hopefully that's not the case. 1285 00:53:46,740 --> 00:53:49,070 And we believe that certain things 1286 00:53:49,070 --> 00:53:51,880 influenced your grades and your GREs. 1287 00:53:51,880 --> 00:53:53,410 Whether or not the student is smart 1288 00:53:53,410 --> 00:53:54,850 certainly should have some influence, 1289 00:53:54,850 --> 00:53:56,683 but also the great inflation at their school 1290 00:53:56,683 --> 00:53:59,610 will have some influence. 1291 00:53:59,610 --> 00:54:02,580 So a prediction problem in a Bayesian network 1292 00:54:02,580 --> 00:54:06,180 is going from the causes to the effects. 1293 00:54:06,180 --> 00:54:09,010 So if I want to predict whether a student's admitted, 1294 00:54:09,010 --> 00:54:10,870 I only need to look upstream. 1295 00:54:10,870 --> 00:54:15,110 So we want to predict-- we observe the things on the top. 1296 00:54:15,110 --> 00:54:16,570 Say, grades and GREs, and we want 1297 00:54:16,570 --> 00:54:19,260 to predict whether this student should be admitted or not. 1298 00:54:19,260 --> 00:54:21,780 There's another problem called an inference problem, which 1299 00:54:21,780 --> 00:54:23,970 is when we observe the effect and we 1300 00:54:23,970 --> 00:54:28,460 want to make inferences about the causes. 1301 00:54:28,460 --> 00:54:30,950 So an example of that would be, you apply for an internship 1302 00:54:30,950 --> 00:54:34,190 and they say, oh, she's a student at MIT. 1303 00:54:34,190 --> 00:54:35,080 I bet she's smart. 1304 00:54:35,080 --> 00:54:35,580 Right? 1305 00:54:35,580 --> 00:54:38,964 They're doing an inference problem. 1306 00:54:38,964 --> 00:54:41,630 We'll leave it for you to decide whether you and your colleagues 1307 00:54:41,630 --> 00:54:44,500 are as smart as everyone thinks, but hopefully you are. 1308 00:54:44,500 --> 00:54:45,010 OK. 1309 00:54:45,010 --> 00:54:46,540 So we've got these two different kinds of problems. 1310 00:54:46,540 --> 00:54:48,581 We've got prediction problems from top to bottom, 1311 00:54:48,581 --> 00:54:50,850 and inference problems from bottom to top. 1312 00:54:54,741 --> 00:54:56,990 And we're going to talk about conditional probability. 1313 00:54:56,990 --> 00:54:59,156 So if I've got some very small piece of this network 1314 00:54:59,156 --> 00:55:02,030 with just two nodes, I could write out 1315 00:55:02,030 --> 00:55:06,130 all the possible probabilities for any pair of those nodes. 1316 00:55:06,130 --> 00:55:09,600 So the probability that a student is not smart 1317 00:55:09,600 --> 00:55:12,300 given that that student has low grades, the probability 1318 00:55:12,300 --> 00:55:14,970 that the student is not smart given that the student has 1319 00:55:14,970 --> 00:55:18,090 good grades, and so on, for all possible pairwise comparisons. 1320 00:55:18,090 --> 00:55:20,510 Or I could write this as a conditional probability, which 1321 00:55:20,510 --> 00:55:22,860 tends to be an easier way to think about the problem. 1322 00:55:22,860 --> 00:55:25,430 What's the conditional probability of a student 1323 00:55:25,430 --> 00:55:29,920 being smart given that they've got good grades 1324 00:55:29,920 --> 00:55:33,230 or given that they have bad grades? 1325 00:55:33,230 --> 00:55:35,800 They have the same information. 1326 00:55:35,800 --> 00:55:38,020 For this one, I need additional information 1327 00:55:38,020 --> 00:55:41,845 about the total probability of students being smart or not. 1328 00:55:41,845 --> 00:55:43,720 And the total number of variables, as I said, 1329 00:55:43,720 --> 00:55:44,845 in either case is the same. 1330 00:55:44,845 --> 00:55:46,690 So these are completely interchangeable, 1331 00:55:46,690 --> 00:55:49,495 but it's a lot easier to reason with conditional probabilities 1332 00:55:49,495 --> 00:55:51,120 than with the joint probability tables. 1333 00:55:51,120 --> 00:55:54,067 Those we'll see in a second. 1334 00:55:54,067 --> 00:55:56,400 So as I've said, you don't need a full probability table 1335 00:55:56,400 --> 00:55:57,358 for a Bayesian network. 1336 00:55:57,358 --> 00:55:59,430 You don't need two N to the minus 1 variables. 1337 00:55:59,430 --> 00:56:00,929 And the fundamental reason for that 1338 00:56:00,929 --> 00:56:02,470 is that the joint probability is only 1339 00:56:02,470 --> 00:56:04,430 going to depend on the parents. 1340 00:56:04,430 --> 00:56:08,224 So in this toy example, the GRE scores over here 1341 00:56:08,224 --> 00:56:09,765 are not dependent on grade inflation. 1342 00:56:15,310 --> 00:56:17,580 Now, that all hopefully makes sense. 1343 00:56:17,580 --> 00:56:18,080 Questions? 1344 00:56:21,220 --> 00:56:23,152 Bayesian networks get a little murky next, 1345 00:56:23,152 --> 00:56:25,110 so I'm going to try to give you into-- oh, yes. 1346 00:56:25,110 --> 00:56:26,358 Question, please. 1347 00:56:26,358 --> 00:56:29,844 AUDIENCE: You said that the parents don't affect 1348 00:56:29,844 --> 00:56:33,828 their children, but if grade inflation affects the grades, 1349 00:56:33,828 --> 00:56:37,314 how does that influence-- will that 1350 00:56:37,314 --> 00:56:39,320 influence the grade [INAUDIBLE]? 1351 00:56:39,320 --> 00:56:41,932 PROFESSOR: Sorry, can you say the question again? 1352 00:56:41,932 --> 00:56:43,390 AUDIENCE: I guess I'm just confused 1353 00:56:43,390 --> 00:56:46,140 by this particular example. 1354 00:56:46,140 --> 00:56:48,020 What do you mean by the joint probability? 1355 00:56:48,020 --> 00:56:49,430 The joint probability of what? 1356 00:56:52,840 --> 00:56:55,040 PROFESSOR: So if I want to figure out 1357 00:56:55,040 --> 00:56:57,910 the probability of some particular configuration of all 1358 00:56:57,910 --> 00:57:01,130 the nodes in my network, I don't necessarily 1359 00:57:01,130 --> 00:57:03,670 need to consider all possibilities. 1360 00:57:03,670 --> 00:57:05,970 Because for example, if I want to consider 1361 00:57:05,970 --> 00:57:07,560 all of the joint probability samples 1362 00:57:07,560 --> 00:57:11,080 with settings for the GREs, whether the student had 1363 00:57:11,080 --> 00:57:13,220 good GRE scores or not, that's not 1364 00:57:13,220 --> 00:57:18,350 going be influenced by the student's school's grade 1365 00:57:18,350 --> 00:57:21,338 inflation policies. 1366 00:57:21,338 --> 00:57:23,730 AUDIENCE: But wouldn't the grades be influenced by the-- 1367 00:57:23,730 --> 00:57:25,188 PROFESSOR: But the grades would be. 1368 00:57:25,188 --> 00:57:25,910 That's right. 1369 00:57:25,910 --> 00:57:27,920 So some of the variables I can remove 1370 00:57:27,920 --> 00:57:30,507 and others-- some of the joint probability statements 1371 00:57:30,507 --> 00:57:32,340 I don't need to worry about and others I do. 1372 00:57:32,340 --> 00:57:33,840 And which ones I need to consider 1373 00:57:33,840 --> 00:57:35,381 is determined by the graph structure. 1374 00:57:37,790 --> 00:57:38,300 Yes. 1375 00:57:38,300 --> 00:57:40,450 AUDIENCE: How is the graph structure determined? 1376 00:57:40,450 --> 00:57:41,033 PROFESSOR: OK. 1377 00:57:41,033 --> 00:57:43,010 So how is the graph structure determined? 1378 00:57:43,010 --> 00:57:45,050 So it's determined in one of two ways. 1379 00:57:45,050 --> 00:57:48,650 I can draw it in advance because I believe that I know something 1380 00:57:48,650 --> 00:57:51,630 about my setting, I believe that these data are independent. 1381 00:57:51,630 --> 00:57:55,090 Then it has that structure like this. 1382 00:57:55,090 --> 00:57:57,275 Cause and a bunch of independent effects. 1383 00:58:01,560 --> 00:58:06,510 Or perhaps I claim to know that actually two of these things 1384 00:58:06,510 --> 00:58:10,360 have a common parent as well. 1385 00:58:10,360 --> 00:58:12,320 In some cases I know. 1386 00:58:12,320 --> 00:58:14,580 We'll also talk about how to learn the structure 1387 00:58:14,580 --> 00:58:16,770 from the data, which is the more common setting 1388 00:58:16,770 --> 00:58:17,964 in regulatory networks. 1389 00:58:17,964 --> 00:58:19,380 So in these kinds of problems when 1390 00:58:19,380 --> 00:58:21,180 trying to decide how to integrate 1391 00:58:21,180 --> 00:58:23,829 different proteomic data sets, typically people 1392 00:58:23,829 --> 00:58:25,870 make arbitrary decisions about what the structure 1393 00:58:25,870 --> 00:58:28,640 is based on their knowledge of the system. 1394 00:58:28,640 --> 00:58:31,750 But if you're trying to figure out de novo which proteins 1395 00:58:31,750 --> 00:58:34,126 interact with which, which proteins regulate which genes, 1396 00:58:34,126 --> 00:58:35,791 then you have to learn it from the data. 1397 00:58:35,791 --> 00:58:37,974 And we'll talk about how to do that in a second. 1398 00:58:37,974 --> 00:58:38,640 Great questions. 1399 00:58:38,640 --> 00:58:39,780 Any other questions? 1400 00:58:39,780 --> 00:58:41,413 Anything in the quiet half of the room? 1401 00:58:46,510 --> 00:58:47,010 OK. 1402 00:58:47,010 --> 00:58:49,590 So as I said, this part of it, I think 1403 00:58:49,590 --> 00:58:51,170 you can usually come up with cases 1404 00:58:51,170 --> 00:58:53,296 that give you fairly good intuition. 1405 00:58:53,296 --> 00:58:55,670 One of the things that is true in these Bayesian networks 1406 00:58:55,670 --> 00:58:58,370 which most people find a little bit surprising at first 1407 00:58:58,370 --> 00:59:00,950 is something called explaining away. 1408 00:59:00,950 --> 00:59:04,050 So let's look at this Bayesian network. 1409 00:59:04,050 --> 00:59:06,180 I go outside and I detect that things 1410 00:59:06,180 --> 00:59:08,827 are slippery on the grass. 1411 00:59:08,827 --> 00:59:10,410 So that could be for a lot of reasons, 1412 00:59:10,410 --> 00:59:13,251 but one possible reason is that the grass is wet. 1413 00:59:13,251 --> 00:59:13,750 OK. 1414 00:59:13,750 --> 00:59:15,541 What are the causes of the grass being wet? 1415 00:59:15,541 --> 00:59:17,409 Well, it could have rained or the sprinklers 1416 00:59:17,409 --> 00:59:18,200 might have been on. 1417 00:59:20,720 --> 00:59:23,300 And depending on this as an example-- so 1418 00:59:23,300 --> 00:59:26,320 a lot of the Bayesian networks were developed in Stanford 1419 00:59:26,320 --> 00:59:29,050 by Judea Pearl and colleagues. 1420 00:59:29,050 --> 00:59:32,090 And of course, in California it doesn't rain that often. 1421 00:59:32,090 --> 00:59:34,890 So there the season is a strong determiner of these things. 1422 00:59:34,890 --> 00:59:36,700 Not so much around here. 1423 00:59:36,700 --> 00:59:38,955 So in this example that they like to do, 1424 00:59:38,955 --> 00:59:40,330 so does the probability that it's 1425 00:59:40,330 --> 00:59:44,946 raining depend on whether the sprinkler is on or not? 1426 00:59:44,946 --> 00:59:46,890 Now, the answer should be no, right? 1427 00:59:46,890 --> 00:59:52,310 I mean, in reality, when you think about-- there's 1428 00:59:52,310 --> 00:59:54,800 no causal relationship between the sprinkler being on 1429 00:59:54,800 --> 00:59:56,390 and the rain. 1430 00:59:56,390 --> 00:59:59,650 But in fact, when we're reasoning over these networks, 1431 00:59:59,650 --> 01:00:00,820 we actually are influenced. 1432 01:00:03,620 --> 01:00:07,470 In a probabilistic model, if I know that it's raining, 1433 01:00:07,470 --> 01:00:09,774 and I know the grass is wet, then what 1434 01:00:09,774 --> 01:00:11,440 do I think about the sprinkler being on? 1435 01:00:11,440 --> 01:00:13,100 Do I think it's just as likely? 1436 01:00:13,100 --> 01:00:14,600 No, I think it's less likely, right? 1437 01:00:14,600 --> 01:00:17,058 If I go outside and see the grass is wet, there are clouds, 1438 01:00:17,058 --> 01:00:20,510 the rain is coming down, is the sprinkler 1439 01:00:20,510 --> 01:00:21,890 likely to be on or not? 1440 01:00:21,890 --> 01:00:23,930 It's likely to be off, right? 1441 01:00:23,930 --> 01:00:27,380 So there's no causal relationship, 1442 01:00:27,380 --> 01:00:29,880 but there's the probabilistic relationship through the graph 1443 01:00:29,880 --> 01:00:30,380 structure. 1444 01:00:30,380 --> 01:00:32,070 And that's called explaining away. 1445 01:00:32,070 --> 01:00:34,820 And you can take a whole course on how to understand 1446 01:00:34,820 --> 01:00:37,530 which relationships you can detect and which not. 1447 01:00:37,530 --> 01:00:40,410 This is not the place to try to go into that, 1448 01:00:40,410 --> 01:00:42,510 but I hope you'll be familiar with this problem. 1449 01:00:42,510 --> 01:00:44,490 And I'll try to give you a toy example that 1450 01:00:44,490 --> 01:00:47,180 makes it a little bit more obvious in terms 1451 01:00:47,180 --> 01:00:50,210 of the equations where this comes from. 1452 01:00:50,210 --> 01:00:57,560 So imagine this very silly game where we play, we toss coins. 1453 01:00:57,560 --> 01:00:59,820 We toss a coin twice. 1454 01:00:59,820 --> 01:01:02,360 And if it turns up heads both times, you get a point. 1455 01:01:02,360 --> 01:01:04,631 If it turns up tails both times, you get a point. 1456 01:01:04,631 --> 01:01:07,256 But if one's a head and one's a tail, you don't get any points. 1457 01:01:10,260 --> 01:01:15,330 Now, does the probability that I tossed a head on the first time 1458 01:01:15,330 --> 01:01:19,330 depend on whether I toss a tail on the second time? 1459 01:01:19,330 --> 01:01:21,230 So causally, obviously not, right? 1460 01:01:21,230 --> 01:01:24,280 First of all, it happened earlier in time. 1461 01:01:24,280 --> 01:01:28,590 And secondly, the coin tosses are completely independent. 1462 01:01:28,590 --> 01:01:31,210 But what happens when I know the outcome? 1463 01:01:31,210 --> 01:01:34,790 What if I know what score you got? 1464 01:01:34,790 --> 01:01:40,422 So if I know your score, then is the probability 1465 01:01:40,422 --> 01:01:42,130 that I tossed the heads on the first time 1466 01:01:42,130 --> 01:01:44,100 independent of whether I got a tail on the second time? 1467 01:01:44,100 --> 01:01:44,850 What do you think? 1468 01:01:44,850 --> 01:01:47,400 How many people think it is independent then? 1469 01:01:47,400 --> 01:01:49,410 How many people think it's not independent. 1470 01:01:49,410 --> 01:01:49,910 Very good. 1471 01:01:49,910 --> 01:01:51,300 It's not independent. 1472 01:01:51,300 --> 01:01:54,520 And obviously, here's the math to prove it, 1473 01:01:54,520 --> 01:01:56,970 but your intuition does the same thing. 1474 01:01:56,970 --> 01:02:00,450 So what's the probability that I tossed a head 1475 01:02:00,450 --> 01:02:02,510 on the second time given that I got a one, 1476 01:02:02,510 --> 01:02:08,270 I scored, and I tossed a tail on the first time? 1477 01:02:08,270 --> 01:02:10,880 Obviously, it's zero, right? 1478 01:02:10,880 --> 01:02:14,570 So here's the probability of getting 1479 01:02:14,570 --> 01:02:17,810 a head in the first time and scoring one, 1480 01:02:17,810 --> 01:02:20,430 and tails on the second time is exactly zero. 1481 01:02:20,430 --> 01:02:22,270 So that's called explaining away. 1482 01:02:22,270 --> 01:02:27,050 You can reduce your belief in certain parents 1483 01:02:27,050 --> 01:02:30,490 based on what you know about the children. 1484 01:02:30,490 --> 01:02:32,760 Think of this coin toss example or the rain 1485 01:02:32,760 --> 01:02:35,620 in California and the sprinklers. 1486 01:02:35,620 --> 01:02:36,120 All right. 1487 01:02:36,120 --> 01:02:37,495 So as this come up several times, 1488 01:02:37,495 --> 01:02:39,690 how do we obtain the Bayesian network structure? 1489 01:02:39,690 --> 01:02:41,680 There are two problems that we need to be able to solve. 1490 01:02:41,680 --> 01:02:43,540 We need to be able to learn the structure, 1491 01:02:43,540 --> 01:02:48,000 and we need to be able to learn these probability tables. 1492 01:02:48,000 --> 01:02:50,540 If we know structure, how do we get the probabilities? 1493 01:02:50,540 --> 01:02:53,350 Well, we need to identify some objective function we're 1494 01:02:53,350 --> 01:02:56,367 going to try to optimize, and then choose values 1495 01:02:56,367 --> 01:02:57,950 for all probability distributions that 1496 01:02:57,950 --> 01:02:59,390 optimize that objective function. 1497 01:02:59,390 --> 01:03:00,550 And that's the kind of thing we've 1498 01:03:00,550 --> 01:03:02,310 been doing all along, just like in the Gibbs sampler. 1499 01:03:02,310 --> 01:03:04,740 We need some objective function or protein structure. 1500 01:03:04,740 --> 01:03:06,690 We need some objective function that we're 1501 01:03:06,690 --> 01:03:07,731 going to try to optimize. 1502 01:03:07,731 --> 01:03:10,090 So there are two common ones that are used a lot. 1503 01:03:10,090 --> 01:03:14,380 There's maximum likelihood and the maximum posterior. 1504 01:03:14,380 --> 01:03:18,176 So maximum likelihood is defined as the set of param-- theta 1505 01:03:18,176 --> 01:03:20,550 is all the parameters, all the probability distributions, 1506 01:03:20,550 --> 01:03:23,090 the probability of getting a score of one given that you had 1507 01:03:23,090 --> 01:03:25,070 heads and tails, whatever it may be. 1508 01:03:25,070 --> 01:03:26,810 The probability of getting admitted 1509 01:03:26,810 --> 01:03:29,650 given that you had certain GREs and certain grades. 1510 01:03:29,650 --> 01:03:34,010 So we want to find the set of parameters, 1511 01:03:34,010 --> 01:03:35,800 all those probability distributions, that 1512 01:03:35,800 --> 01:03:36,730 maximize this. 1513 01:03:36,730 --> 01:03:39,870 The probability of the data, our training data, 1514 01:03:39,870 --> 01:03:42,920 given those parameters. 1515 01:03:42,920 --> 01:03:44,580 That's a pretty obvious one. 1516 01:03:44,580 --> 01:03:47,620 And the maximum posterior includes some of our beliefs 1517 01:03:47,620 --> 01:03:49,610 about the prior probability of the data 1518 01:03:49,610 --> 01:03:52,740 and the prior probability of the parameters. 1519 01:03:52,740 --> 01:03:54,250 This is a little bit less intuitive 1520 01:03:54,250 --> 01:03:55,440 because you have to ask, well, where 1521 01:03:55,440 --> 01:03:56,565 do those numbers come from? 1522 01:03:56,565 --> 01:04:00,170 And that, again, is a whole course unto itself. 1523 01:04:00,170 --> 01:04:00,670 OK. 1524 01:04:00,670 --> 01:04:02,110 Now, how do you find these parameters? 1525 01:04:02,110 --> 01:04:03,850 Again, it's the kinds of search problems 1526 01:04:03,850 --> 01:04:06,620 that we've looked at before, various kinds of hill climbing. 1527 01:04:06,620 --> 01:04:09,690 So gradient descent, expectation maximization, 1528 01:04:09,690 --> 01:04:12,080 Gibbs sampling, which you've looked at explicitly. 1529 01:04:12,080 --> 01:04:14,140 And again, the full details of how to do that 1530 01:04:14,140 --> 01:04:15,850 are outside of our scope today. 1531 01:04:15,850 --> 01:04:16,350 OK. 1532 01:04:16,350 --> 01:04:19,490 So in our example of this coin toss game, 1533 01:04:19,490 --> 01:04:21,740 we would use one of these two functions 1534 01:04:21,740 --> 01:04:25,800 to try to decide what's the probability of getting 1535 01:04:25,800 --> 01:04:29,000 heads or tails for any given score. 1536 01:04:29,000 --> 01:04:30,931 That's what the kinds of parameters are. 1537 01:04:34,302 --> 01:04:35,760 Now, the structure problem actually 1538 01:04:35,760 --> 01:04:37,260 turns out to be really, really hard, 1539 01:04:37,260 --> 01:04:40,600 because there are a very exponentially large number 1540 01:04:40,600 --> 01:04:42,890 of potential structures to draw from. 1541 01:04:42,890 --> 01:04:47,410 And unless you've got some prior knowledge, 1542 01:04:47,410 --> 01:04:51,070 it can be impossible, depending on how much data you have, 1543 01:04:51,070 --> 01:04:53,060 to actually build this structure. 1544 01:04:53,060 --> 01:04:56,697 So there are many algorithms that have been proposed. 1545 01:04:56,697 --> 01:04:58,280 And a lot of our settings, we're going 1546 01:04:58,280 --> 01:04:59,950 to use some kind of prior knowledge 1547 01:04:59,950 --> 01:05:01,422 to reduce the search space. 1548 01:05:01,422 --> 01:05:03,880 So if we're trying to talk about transcriptional regulatory 1549 01:05:03,880 --> 01:05:06,240 networks, it's very common to assume that there are only 1550 01:05:06,240 --> 01:05:10,325 some kinds of nodes that can be causes and other kinds of nodes 1551 01:05:10,325 --> 01:05:11,450 that can be effects, right? 1552 01:05:11,450 --> 01:05:13,550 So in gene expression it would be effect, 1553 01:05:13,550 --> 01:05:15,780 and then you would limit your causes 1554 01:05:15,780 --> 01:05:17,270 to only be transcription factors. 1555 01:05:17,270 --> 01:05:19,519 It would generally be signaling molecules or something 1556 01:05:19,519 --> 01:05:22,040 like that, and not allow all 20,000 genes to be causes 1557 01:05:22,040 --> 01:05:26,800 and all 20,000 genes to be effects. 1558 01:05:26,800 --> 01:05:29,301 So there are lot of resources to learn more 1559 01:05:29,301 --> 01:05:30,300 about Bayesian networks. 1560 01:05:30,300 --> 01:05:33,530 As I said, you can have whole courses on this. 1561 01:05:33,530 --> 01:05:36,299 I think there are a lot of good tutorials at this website. 1562 01:05:36,299 --> 01:05:38,590 I've also put in the notes a little toy example for you 1563 01:05:38,590 --> 01:05:42,880 to work through all the probabilities, which I think, 1564 01:05:42,880 --> 01:05:45,290 in the interest of time, we won't go through in detail. 1565 01:05:49,530 --> 01:05:50,030 All right. 1566 01:05:50,030 --> 01:05:52,490 So to motivate what we're going to do in the next lecture, 1567 01:05:52,490 --> 01:05:54,440 I just want to talk about other kinds of data 1568 01:05:54,440 --> 01:05:57,140 that you could bring to bear on this problem of predicting 1569 01:05:57,140 --> 01:05:58,404 which proteins interact. 1570 01:05:58,404 --> 01:05:59,820 We'll see, then, how that gets fed 1571 01:05:59,820 --> 01:06:01,856 into an interaction Bayesian network 1572 01:06:01,856 --> 01:06:02,855 to make the predictions. 1573 01:06:05,660 --> 01:06:08,084 So we've talked about affinity capture and two-hybrid, 1574 01:06:08,084 --> 01:06:09,500 but what other kinds of data could 1575 01:06:09,500 --> 01:06:12,130 we use to predict the probability interaction? 1576 01:06:12,130 --> 01:06:14,930 Well, one thing you could use would be gene expression data. 1577 01:06:14,930 --> 01:06:17,430 And the idea is that if two proteins interact, 1578 01:06:17,430 --> 01:06:20,339 they should be present in the cell at the same time, right? 1579 01:06:20,339 --> 01:06:21,880 So we talked about this a little bit. 1580 01:06:21,880 --> 01:06:23,380 If they're anti-correlated, it seems 1581 01:06:23,380 --> 01:06:24,670 very unlikely they interact. 1582 01:06:24,670 --> 01:06:26,950 What about if they're correlated, but not perfectly 1583 01:06:26,950 --> 01:06:27,970 correlated? 1584 01:06:27,970 --> 01:06:33,290 So here's a plot that shows a histogram of proteins that 1585 01:06:33,290 --> 01:06:37,090 are known to interact, proteins that are known not to interact. 1586 01:06:37,090 --> 01:06:40,010 So empty circles are known interacting proteins, 1587 01:06:40,010 --> 01:06:42,810 the dark circles are non-interacting proteins, 1588 01:06:42,810 --> 01:06:46,680 and the other ones are based on the experimental data. 1589 01:06:46,680 --> 01:06:49,020 And the distance here is the difference 1590 01:06:49,020 --> 01:06:50,300 between expression profiles. 1591 01:06:50,300 --> 01:06:52,930 And we'll talk in coming lecture about exactly how to compute 1592 01:06:52,930 --> 01:06:55,380 distance between expression profiles. 1593 01:06:55,380 --> 01:06:57,700 But the further to the right it is, the less similar 1594 01:06:57,700 --> 01:07:01,004 the expression profiles are across large data sets. 1595 01:07:01,004 --> 01:07:02,420 So what you see is the interacting 1596 01:07:02,420 --> 01:07:04,740 proteins tend to be shifted more to the left, more 1597 01:07:04,740 --> 01:07:08,560 similar expression profiles than the non-interacting ones. 1598 01:07:08,560 --> 01:07:11,637 But what do you notice about this? 1599 01:07:11,637 --> 01:07:13,220 There's no way to draw a line and say, 1600 01:07:13,220 --> 01:07:15,450 everything to the right of this is in one class 1601 01:07:15,450 --> 01:07:17,670 and everything to the left is another, right? 1602 01:07:17,670 --> 01:07:20,260 So by itself, it's not going to get us very far. 1603 01:07:20,260 --> 01:07:22,560 There are plenty of non-interacting proteins 1604 01:07:22,560 --> 01:07:25,440 that have very highly correlated gene expression and plenty 1605 01:07:25,440 --> 01:07:27,206 of interacting proteins that have poorly 1606 01:07:27,206 --> 01:07:28,330 correlated gene expression. 1607 01:07:28,330 --> 01:07:31,520 So it's a trend, not a rule. 1608 01:07:31,520 --> 01:07:33,290 Now, what about evolution? 1609 01:07:33,290 --> 01:07:37,930 So if I look over many, many organisms, I might expect what? 1610 01:07:37,930 --> 01:07:40,140 The proteins that interact with each other 1611 01:07:40,140 --> 01:07:43,800 are going to appear in the same species, right? 1612 01:07:43,800 --> 01:07:45,490 So let's look at these two cases. 1613 01:07:45,490 --> 01:07:48,530 We've got a bunch of-- eight different genomes. 1614 01:07:48,530 --> 01:07:52,290 And I've got gene 1 and gene 2, which I suspect might interact, 1615 01:07:52,290 --> 01:07:55,344 and gene 3 and gene 4, which I suspect might interact. 1616 01:07:55,344 --> 01:07:56,760 Now, looking at these two patterns 1617 01:07:56,760 --> 01:07:59,850 of evolution, which one do we have more confidence 1618 01:07:59,850 --> 01:08:00,800 in that it interacts? 1619 01:08:00,800 --> 01:08:03,049 The red one or the green one? 1620 01:08:03,049 --> 01:08:05,340 So what do we notice about the difference between them? 1621 01:08:05,340 --> 01:08:11,060 What's true of the red one compared to the green one? 1622 01:08:11,060 --> 01:08:11,827 Yeah. 1623 01:08:11,827 --> 01:08:14,160 AUDIENCE: The red one is only in one branch of the tree. 1624 01:08:14,160 --> 01:08:16,368 PROFESSOR: The red one is only one branch in the tree 1625 01:08:16,368 --> 01:08:17,990 and the green one is scattered across. 1626 01:08:17,990 --> 01:08:19,439 So let's take a vote. 1627 01:08:19,439 --> 01:08:21,100 Do we believe that the red one is 1628 01:08:21,100 --> 01:08:23,189 better evidence of interaction or the green one 1629 01:08:23,189 --> 01:08:25,270 is better evidence of interaction? 1630 01:08:25,270 --> 01:08:27,319 Red? 1631 01:08:27,319 --> 01:08:29,520 Green? 1632 01:08:29,520 --> 01:08:33,399 Can I have an advocate of green. 1633 01:08:33,399 --> 01:08:36,020 Someone explain their rationale? 1634 01:08:36,020 --> 01:08:38,050 Anyone in the quiet side of the room? 1635 01:08:38,050 --> 01:08:40,525 All right, Ed. 1636 01:08:40,525 --> 01:08:43,990 AUDIENCE: Because red is only on one branch of the tree, 1637 01:08:43,990 --> 01:08:47,455 I'd expect that they're naturally more 1638 01:08:47,455 --> 01:08:50,425 correlated with each other. 1639 01:08:50,425 --> 01:08:53,395 They have less-- they appear together 1640 01:08:53,395 --> 01:09:00,517 in [INAUDIBLE] so I'd expect [INAUDIBLE]. 1641 01:09:00,517 --> 01:09:01,100 PROFESSOR: OK. 1642 01:09:01,100 --> 01:09:04,615 So the argument is that red only occurs in one part of the tree. 1643 01:09:04,615 --> 01:09:08,090 And so there could be a very simple explanation 1644 01:09:08,090 --> 01:09:10,560 for all the reds being in one part of the tree and one not, 1645 01:09:10,560 --> 01:09:13,090 which would be a single loss and gain event. 1646 01:09:13,090 --> 01:09:13,590 Right? 1647 01:09:13,590 --> 01:09:16,359 Somewhere early on, perhaps here, 1648 01:09:16,359 --> 01:09:18,430 I gain those two proteins. 1649 01:09:18,430 --> 01:09:20,810 And then they're inherited throughout the genome, 1650 01:09:20,810 --> 01:09:23,640 like most of genes get inherited throughout the genome. 1651 01:09:23,640 --> 01:09:26,050 Whereas here, we've got independent events 1652 01:09:26,050 --> 01:09:27,700 of gain and loss. 1653 01:09:27,700 --> 01:09:30,729 And at each one of these independent events, 1654 01:09:30,729 --> 01:09:32,520 we're getting them moving jointly, either 1655 01:09:32,520 --> 01:09:34,254 in or out of the genome. 1656 01:09:34,254 --> 01:09:35,670 So there's more evidence for green 1657 01:09:35,670 --> 01:09:38,819 to be interacting than red. 1658 01:09:38,819 --> 01:09:41,090 Everyone buy that? 1659 01:09:41,090 --> 01:09:44,510 Even some of the advocates of red? 1660 01:09:44,510 --> 01:09:46,979 Questions? 1661 01:09:46,979 --> 01:09:47,800 Yes. 1662 01:09:47,800 --> 01:09:51,768 AUDIENCE: Could there be a way of either objectively 1663 01:09:51,768 --> 01:09:58,216 or mathematically [INAUDIBLE] that way, 1664 01:09:58,216 --> 01:10:01,460 or is it just the reasoning [INAUDIBLE]? 1665 01:10:01,460 --> 01:10:03,430 PROFESSOR: One can do the statistics on it 1666 01:10:03,430 --> 01:10:04,509 with known ones, right? 1667 01:10:04,509 --> 01:10:06,050 I think that's probably the best way. 1668 01:10:06,050 --> 01:10:09,620 And we'll actually see that in one of these papers that 1669 01:10:09,620 --> 01:10:12,030 uses-- well, actually, now I don't 1670 01:10:12,030 --> 01:10:14,290 recall whether they use this co-evolution. 1671 01:10:14,290 --> 01:10:15,790 But yeah, there are plenty of papers 1672 01:10:15,790 --> 01:10:17,190 that actually have done the statistics on that. 1673 01:10:17,190 --> 01:10:17,981 So it is supported. 1674 01:10:21,380 --> 01:10:23,520 And a related kind of question is 1675 01:10:23,520 --> 01:10:26,114 what's called the Rosetta Stone approach. 1676 01:10:26,114 --> 01:10:27,530 Unfortunately, of the term Rosetta 1677 01:10:27,530 --> 01:10:29,930 gets used far too much in computational biology. 1678 01:10:29,930 --> 01:10:32,360 So this has nothing to do with the other Rosetta 1679 01:10:32,360 --> 01:10:35,230 that we've been talking about. 1680 01:10:35,230 --> 01:10:37,750 And this has to do with how often you 1681 01:10:37,750 --> 01:10:42,905 find the same pair of genes in the same genome versus split up 1682 01:10:42,905 --> 01:10:44,360 in different genomes. 1683 01:10:44,360 --> 01:10:45,870 OK. 1684 01:10:45,870 --> 01:10:48,380 So what we're going to look at next time then 1685 01:10:48,380 --> 01:10:52,430 is an approach that combines these kinds of data 1686 01:10:52,430 --> 01:10:55,100 with the protein interaction physical measurements 1687 01:10:55,100 --> 01:10:58,685 through the two-hybrid and the affinity capture mass spec that 1688 01:10:58,685 --> 01:11:00,560 actually uses the Bayesian networks we talked 1689 01:11:00,560 --> 01:11:03,490 about this time to predict whether two proteins are 1690 01:11:03,490 --> 01:11:05,940 likely to interact based on all of the available data. 1691 01:11:05,940 --> 01:11:09,950 These evolutionary arguments, the [? sentiality ?] arguments, 1692 01:11:09,950 --> 01:11:12,780 and then the interaction data. 1693 01:11:12,780 --> 01:11:15,670 Any final questions? 1694 01:11:15,670 --> 01:11:18,120 OK, see you next time.