1 00:00:00,070 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:26,011 --> 00:00:27,760 PROFESSOR: Going to finish up a little bit 9 00:00:27,760 --> 00:00:29,920 from last time on gene regulatory networks 10 00:00:29,920 --> 00:00:32,445 and see how the different methods that we looked at 11 00:00:32,445 --> 00:00:34,695 compared, and then we'll dive into protein interaction 12 00:00:34,695 --> 00:00:35,880 networks. 13 00:00:35,880 --> 00:00:37,860 Were there any questions from last time? 14 00:00:40,371 --> 00:00:40,870 OK. 15 00:00:40,870 --> 00:00:41,460 Very good. 16 00:00:41,460 --> 00:00:45,400 So recall that we start off with this dream challenge in which 17 00:00:45,400 --> 00:00:51,010 they provided unlabeled data representing gene expression 18 00:00:51,010 --> 00:00:56,790 data for either in a completely synthetic case, in silico data, 19 00:00:56,790 --> 00:00:59,120 or for three different actual experiments-- 20 00:00:59,120 --> 00:01:02,967 one in E. coli, one in S. cerevisiae, and one in aureus. 21 00:01:02,967 --> 00:01:05,050 For some of those, it was straight expression data 22 00:01:05,050 --> 00:01:05,970 under different conditions. 23 00:01:05,970 --> 00:01:08,450 In other cases, there were actual knock-down experiments 24 00:01:08,450 --> 00:01:10,230 or other kinds of perturbations. 25 00:01:10,230 --> 00:01:12,304 And then they gave that data out to the community 26 00:01:12,304 --> 00:01:14,470 and asked people to use whatever methods they wanted 27 00:01:14,470 --> 00:01:18,190 to try to rediscover automatically the gene 28 00:01:18,190 --> 00:01:19,930 regulatory networks. 29 00:01:19,930 --> 00:01:22,230 So with some preliminary analysis, 30 00:01:22,230 --> 00:01:24,700 we saw that there were a couple of main clusters of kinds 31 00:01:24,700 --> 00:01:27,880 of analyses that all had similar properties across these data 32 00:01:27,880 --> 00:01:28,455 sets. 33 00:01:28,455 --> 00:01:29,830 There were the Bayesian networks, 34 00:01:29,830 --> 00:01:32,720 that we've discussed now in two separate contexts. 35 00:01:32,720 --> 00:01:35,960 And then we looked at regression-based techniques 36 00:01:35,960 --> 00:01:37,754 and mutual information based techniques. 37 00:01:37,754 --> 00:01:39,920 And there were a bunch of other kinds of approaches. 38 00:01:39,920 --> 00:01:42,650 And some of them actually combine multiple predictors 39 00:01:42,650 --> 00:01:45,430 from different kinds of algorithms together. 40 00:01:45,430 --> 00:01:48,180 And some of them, they evaluated how well each of these 41 00:01:48,180 --> 00:01:50,680 did on all the different data sets. 42 00:01:50,680 --> 00:01:53,082 So first the results on the in silico data, 43 00:01:53,082 --> 00:01:54,540 and they're showing this as an area 44 00:01:54,540 --> 00:01:57,030 under the precision-recall curve. 45 00:01:57,030 --> 00:01:59,890 Obviously, higher numbers are going to be better here. 46 00:01:59,890 --> 00:02:02,050 So in this first group over here are 47 00:02:02,050 --> 00:02:05,760 the regression-based techniques, mutual information, 48 00:02:05,760 --> 00:02:08,560 correlation, Bayesian networks. 49 00:02:08,560 --> 00:02:12,000 Things didn't fall into any of those particular categories. 50 00:02:12,000 --> 00:02:14,690 Meta were techniques that use more than one class 51 00:02:14,690 --> 00:02:18,500 of prediction and then develop their own prediction based 52 00:02:18,500 --> 00:02:20,380 on those individual techniques. 53 00:02:20,380 --> 00:02:23,580 Then they defined something that they call the community 54 00:02:23,580 --> 00:02:25,719 definition, which they combine data 55 00:02:25,719 --> 00:02:27,260 from many of the different techniques 56 00:02:27,260 --> 00:02:30,170 together with their own algorithms to kind of come up 57 00:02:30,170 --> 00:02:32,620 with what they call the "wisdom of the crowds." 58 00:02:32,620 --> 00:02:36,100 And then R represents a random collection 59 00:02:36,100 --> 00:02:39,390 of other predictions. 60 00:02:39,390 --> 00:02:42,340 And you can see that on these in silico data, 61 00:02:42,340 --> 00:02:43,980 the performances don't dramatically 62 00:02:43,980 --> 00:02:45,730 differ one from the other. 63 00:02:45,730 --> 00:02:48,130 Within each class, if you look at the best performer 64 00:02:48,130 --> 00:02:52,530 in each class, they're all sort of in the same league. 65 00:02:52,530 --> 00:02:56,480 Obviously, some of the classes do better consistently. 66 00:02:56,480 --> 00:02:58,270 Now their point in their analysis 67 00:02:58,270 --> 00:02:59,960 is about the wisdom of the crowds, that 68 00:02:59,960 --> 00:03:01,970 taking all these data together, even including 69 00:03:01,970 --> 00:03:04,265 some of the bad ones, is beneficial. 70 00:03:04,265 --> 00:03:05,890 That's not the main thing that I wanted 71 00:03:05,890 --> 00:03:07,640 to get out of these data for our purposes. 72 00:03:07,640 --> 00:03:10,460 So these E. coli data, notice though that the errant 73 00:03:10,460 --> 00:03:14,790 to the curve, it's about 30 something percent. 74 00:03:14,790 --> 00:03:17,787 Now this is, oh, sorry, this is in silico data. 75 00:03:17,787 --> 00:03:19,620 Now this is the first real experimental data 76 00:03:19,620 --> 00:03:21,567 we'll look at, so this is E. coli data. 77 00:03:21,567 --> 00:03:24,150 And notice the change of scale, that the best performer's only 78 00:03:24,150 --> 00:03:28,845 doing under less than 10% of the possible objective optimal 79 00:03:28,845 --> 00:03:29,345 results. 80 00:03:29,345 --> 00:03:31,870 So you can see that the real data are much, much harder 81 00:03:31,870 --> 00:03:33,467 than the in silico data. 82 00:03:33,467 --> 00:03:35,300 And here the performance varies quite a lot. 83 00:03:35,300 --> 00:03:37,230 You can see that the Bayesian networks are struggling, 84 00:03:37,230 --> 00:03:38,970 compared to some of the other techniques. 85 00:03:38,970 --> 00:03:41,540 The best of those doesn't really get 86 00:03:41,540 --> 00:03:45,231 close to the best of some of these other approaches. 87 00:03:45,231 --> 00:03:47,730 So what they did next, was they took some of the predictions 88 00:03:47,730 --> 00:03:53,454 from their community predictions that were built off 89 00:03:53,454 --> 00:03:55,870 of all these other data, and they went and actually tested 90 00:03:55,870 --> 00:03:56,453 some of these. 91 00:03:56,453 --> 00:03:59,010 So they built regulatory networks for E. coli 92 00:03:59,010 --> 00:04:00,476 and for aureus. 93 00:04:00,476 --> 00:04:02,850 And then they actually did some experiments to test them. 94 00:04:02,850 --> 00:04:04,350 I think the results overall are kind 95 00:04:04,350 --> 00:04:07,050 of encouraging, in the sense that if you focus 96 00:04:07,050 --> 00:04:09,749 on the top pie chart here, of all the things 97 00:04:09,749 --> 00:04:11,290 that they tested, about half of them, 98 00:04:11,290 --> 00:04:12,920 they could get some support. 99 00:04:12,920 --> 00:04:15,170 In some cases, it was very strong support. 100 00:04:15,170 --> 00:04:18,370 In other cases, it wasn't quite as good. 101 00:04:18,370 --> 00:04:20,745 So the glass is half empty or half full. 102 00:04:20,745 --> 00:04:22,370 But also, one of the interesting things 103 00:04:22,370 --> 00:04:24,260 is that the data are quite variable 104 00:04:24,260 --> 00:04:26,310 over the different predictions that they make. 105 00:04:26,310 --> 00:04:29,990 So each one of these circles represents a regulator, 106 00:04:29,990 --> 00:04:33,450 and the things that they claim are targets of that regulator. 107 00:04:33,450 --> 00:04:35,180 And things that are in blue are things 108 00:04:35,180 --> 00:04:37,510 that were confirmed by their experiments. 109 00:04:37,510 --> 00:04:41,000 The things with black outlines and blue are the controls. 110 00:04:41,000 --> 00:04:42,860 So they knew that these would be right. 111 00:04:42,860 --> 00:04:46,150 So you could see that for pure R, they do very well. 112 00:04:46,150 --> 00:04:49,385 For some of these others, they do mediocre. 113 00:04:49,385 --> 00:04:51,760 But there are some, which they're honest enough to admit, 114 00:04:51,760 --> 00:04:52,760 they do very poorly on. 115 00:04:52,760 --> 00:04:54,551 So they didn't get any of their predictions 116 00:04:54,551 --> 00:04:55,634 right for this regulator. 117 00:04:55,634 --> 00:04:58,050 And this probably reflects the kind of data that they had, 118 00:04:58,050 --> 00:05:01,380 in terms of what conditions were being tested. 119 00:05:01,380 --> 00:05:03,470 So, so far, things look reasonable. 120 00:05:03,470 --> 00:05:05,330 I think the real shocker of this paper 121 00:05:05,330 --> 00:05:07,610 does not appear in the abstract or the title. 122 00:05:07,610 --> 00:05:10,430 But it is in one of the main figures, if you pay attention. 123 00:05:10,430 --> 00:05:12,810 So these were the results for in silico data. 124 00:05:12,810 --> 00:05:14,230 Everything looked pretty good. 125 00:05:14,230 --> 00:05:17,000 Change of scale to E. coli, there's some variation. 126 00:05:17,000 --> 00:05:19,050 But you can make arguments. 127 00:05:19,050 --> 00:05:21,450 These are the results for Saccharomyces cerevisiae. 128 00:05:21,450 --> 00:05:24,090 So this is the organism, yeast, on which 129 00:05:24,090 --> 00:05:25,757 most of the gene regulatory algorithms 130 00:05:25,757 --> 00:05:26,840 were originally developed. 131 00:05:26,840 --> 00:05:28,214 And people actually built careers 132 00:05:28,214 --> 00:05:30,370 off of saying how great their algorithms were 133 00:05:30,370 --> 00:05:32,760 in reconstructing these regulatory networks. 134 00:05:32,760 --> 00:05:35,299 And we look at these completely blinded data, 135 00:05:35,299 --> 00:05:37,340 where people don't know what they're looking for. 136 00:05:37,340 --> 00:05:40,510 You could see that the actual results are rather terrible. 137 00:05:40,510 --> 00:05:43,820 So the area under the curve is in the single digits 138 00:05:43,820 --> 00:05:44,404 of percentage. 139 00:05:44,404 --> 00:05:46,861 And it doesn't seem to matter what algorithm they're using. 140 00:05:46,861 --> 00:05:48,380 They're all doing very badly. 141 00:05:48,380 --> 00:05:51,990 And the community predictions are no better-- in some cases, 142 00:05:51,990 --> 00:05:54,360 worse-- than the individual ones. 143 00:05:54,360 --> 00:05:56,410 So this is really a stunning result. 144 00:05:56,410 --> 00:05:57,372 It's there in the data. 145 00:05:57,372 --> 00:05:58,830 And if you dig into the supplement, 146 00:05:58,830 --> 00:06:01,010 they actually explain what's going on, I think, 147 00:06:01,010 --> 00:06:03,110 pretty clearly. 148 00:06:03,110 --> 00:06:04,840 Remember that all of these predictions 149 00:06:04,840 --> 00:06:09,240 are being made by looking for a transcriptional regulator that 150 00:06:09,240 --> 00:06:11,750 increases in its own expression or decreases 151 00:06:11,750 --> 00:06:13,030 in its own expression. 152 00:06:13,030 --> 00:06:14,610 And that change in its own expression 153 00:06:14,610 --> 00:06:16,480 is predictive of its targets. 154 00:06:16,480 --> 00:06:19,460 So the hypothesis is when you have more of an activator, 155 00:06:19,460 --> 00:06:21,410 you'll have more of its targets coming on. 156 00:06:21,410 --> 00:06:22,250 If you have less of an activator, 157 00:06:22,250 --> 00:06:23,610 you'll have less of the targets. 158 00:06:23,610 --> 00:06:25,026 And you look through all the data, 159 00:06:25,026 --> 00:06:27,370 whether it's by Bayesian networks or regression, 160 00:06:27,370 --> 00:06:30,880 to find those kinds of relationships. 161 00:06:30,880 --> 00:06:32,970 Now what if those relationships don't actually 162 00:06:32,970 --> 00:06:34,550 exist in the data? 163 00:06:34,550 --> 00:06:36,540 And that's what this chart shows. 164 00:06:36,540 --> 00:06:40,034 So the green are genes that have no relationship 165 00:06:40,034 --> 00:06:40,700 with each other. 166 00:06:40,700 --> 00:06:44,340 And they're measuring here the correlation across all the data 167 00:06:44,340 --> 00:06:46,360 sets, between two pairs of genes, 168 00:06:46,360 --> 00:06:48,360 for which have no known regulatory relationship. 169 00:06:52,360 --> 00:06:54,116 The purple are ones that are targets 170 00:06:54,116 --> 00:06:55,490 of the same transcription factor. 171 00:06:55,490 --> 00:06:56,950 And the orange are ones where one 172 00:06:56,950 --> 00:06:59,920 is the activator or repressor of the other. 173 00:06:59,920 --> 00:07:03,000 And in the in silico data, they give a very nice spread 174 00:07:03,000 --> 00:07:05,450 between the green, the orange, and the purple. 175 00:07:05,450 --> 00:07:07,670 So the co-regulator are very highly correlated 176 00:07:07,670 --> 00:07:08,860 with each other. 177 00:07:08,860 --> 00:07:12,700 The ones that are parent-child relationships-- a regulator 178 00:07:12,700 --> 00:07:15,650 and its target-- have a pretty good correlation, 179 00:07:15,650 --> 00:07:17,600 much, much different from the distribution 180 00:07:17,600 --> 00:07:20,690 that you see for the things that are not interacting. 181 00:07:20,690 --> 00:07:23,590 And on these data, the algorithms do their best. 182 00:07:23,590 --> 00:07:25,572 Then you look at the E. coli data, 183 00:07:25,572 --> 00:07:27,530 and you can see that in E. Coli, the curves are 184 00:07:27,530 --> 00:07:30,430 much closer to each other, but there's still some spread. 185 00:07:30,430 --> 00:07:32,376 But when you look at yeast-- again, 186 00:07:32,376 --> 00:07:34,000 this is where a lot of these algorithms 187 00:07:34,000 --> 00:07:35,583 were developed-- you could see there's 188 00:07:35,583 --> 00:07:38,760 almost no difference between the correlation between the things 189 00:07:38,760 --> 00:07:40,940 that have no relationship to each other, 190 00:07:40,940 --> 00:07:44,580 things that are co-regulated by the same regulatory protein, 191 00:07:44,580 --> 00:07:47,040 or those parent-child relationships. 192 00:07:47,040 --> 00:07:48,230 They're all quite similar. 193 00:07:48,230 --> 00:07:49,605 And it doesn't matter whether you 194 00:07:49,605 --> 00:07:52,380 use correlation analysis or mutual information. 195 00:07:52,380 --> 00:07:54,480 Over here and in this right-hand panel, 196 00:07:54,480 --> 00:07:56,990 they've blown up the bottom part of this curve, 197 00:07:56,990 --> 00:07:58,730 and you can see how similar these are. 198 00:07:58,730 --> 00:08:00,720 So again, this is a mutual information 199 00:08:00,720 --> 00:08:06,905 spread for in silico data for E. coli and then for yeast. 200 00:08:09,480 --> 00:08:09,980 OK. 201 00:08:09,980 --> 00:08:14,390 So what I think we can say about the expression analysis 202 00:08:14,390 --> 00:08:16,840 is that expression data are very, very powerful 203 00:08:16,840 --> 00:08:18,590 for some things and are going to be rather 204 00:08:18,590 --> 00:08:20,690 poor for some other applications. 205 00:08:20,690 --> 00:08:23,360 So they're very powerful for classification and clustering. 206 00:08:23,360 --> 00:08:25,560 We saw that earlier. 207 00:08:25,560 --> 00:08:27,250 Now what those clusters mean, that's 208 00:08:27,250 --> 00:08:29,777 this inference problem they're trying to solve now. 209 00:08:29,777 --> 00:08:32,110 And the expression data are not sufficient to figure out 210 00:08:32,110 --> 00:08:34,539 what the regulatory proteins are that are causing 211 00:08:34,539 --> 00:08:37,539 those sets of genes to be co-expressed-- at least 212 00:08:37,539 --> 00:08:38,498 not in yeast. 213 00:08:38,498 --> 00:08:40,039 And I think there's every expectation 214 00:08:40,039 --> 00:08:41,747 that if you did the same thing in humans, 215 00:08:41,747 --> 00:08:43,830 you would have the same result. 216 00:08:43,830 --> 00:08:45,590 So the critical question then is if you 217 00:08:45,590 --> 00:08:48,820 do want to build models of how regulation is taking place 218 00:08:48,820 --> 00:08:51,110 in organisms, what do you do? 219 00:08:51,110 --> 00:08:54,060 And the answer is that you need some other kind of data. 220 00:08:54,060 --> 00:08:56,960 So one thing you might think, if we go back 221 00:08:56,960 --> 00:08:58,840 to this core analysis, like what's wrong? 222 00:08:58,840 --> 00:09:02,870 Why is it that these gene expression levels cannot be 223 00:09:02,870 --> 00:09:05,470 used to predict the regulatory networks? 224 00:09:05,470 --> 00:09:08,540 And it comes down to whether gene levels are 225 00:09:08,540 --> 00:09:09,790 predictive approaching levels. 226 00:09:09,790 --> 00:09:12,260 And a couple of groups have looked into this. 227 00:09:12,260 --> 00:09:17,270 One of the earlier studies was this one, now 2009, 228 00:09:17,270 --> 00:09:20,320 where they used microarray data and looked at mRNA expression 229 00:09:20,320 --> 00:09:22,794 levels versus protein levels. 230 00:09:22,794 --> 00:09:23,960 And what do you see in this? 231 00:09:23,960 --> 00:09:25,740 You see that there is a trend. 232 00:09:25,740 --> 00:09:28,440 Right there, R squared is around 0.2, 233 00:09:28,440 --> 00:09:30,170 but that there's a huge spread. 234 00:09:30,170 --> 00:09:32,390 So that for any position on the x-axis, 235 00:09:32,390 --> 00:09:36,470 a particular level of mRNA, you can have 1,000-fold variation 236 00:09:36,470 --> 00:09:38,769 in the protein levels. 237 00:09:38,769 --> 00:09:40,310 So a lot of people saw this and said, 238 00:09:40,310 --> 00:09:42,890 well, we know there are problems with microarrays. 239 00:09:42,890 --> 00:09:46,640 They're not really great at predicting mRNA levels 240 00:09:46,640 --> 00:09:48,030 or low in protein levels. 241 00:09:48,030 --> 00:09:52,497 So maybe this will all get better if we use mRNA-Seq. 242 00:09:52,497 --> 00:09:54,080 Now that turns out not to be the case. 243 00:09:54,080 --> 00:09:58,090 So there was a very careful study published in 2012, 244 00:09:58,090 --> 00:10:02,110 where the group used microarray data, RNA-Seq data, 245 00:10:02,110 --> 00:10:05,490 and a number of different ways of calling the proteomics data. 246 00:10:05,490 --> 00:10:06,676 So you might say, well, maybe some of the problem 247 00:10:06,676 --> 00:10:09,217 is that you're not doing a very good job of inferring protein 248 00:10:09,217 --> 00:10:11,360 levels from mass spec data. 249 00:10:11,360 --> 00:10:13,645 And so they try a whole bunch of these different ways 250 00:10:13,645 --> 00:10:15,070 of pulling mass spec data. 251 00:10:15,070 --> 00:10:16,570 And then they look, you should focus 252 00:10:16,570 --> 00:10:18,570 on the numbers in these columns for the average 253 00:10:18,570 --> 00:10:22,510 and the best correlations between the RNA 254 00:10:22,510 --> 00:10:27,470 data in these columns and the proteomic data in the rows. 255 00:10:27,470 --> 00:10:29,700 And you could see the best case scenario-- 256 00:10:29,700 --> 00:10:35,730 you can get these up to 0.54 correlation, still pretty weak. 257 00:10:35,730 --> 00:10:38,029 So what's going on? 258 00:10:38,029 --> 00:10:39,820 What we've been focusing on now is the idea 259 00:10:39,820 --> 00:10:42,700 that the RNA levels are going to be very well correlated 260 00:10:42,700 --> 00:10:43,600 with protein levels. 261 00:10:43,600 --> 00:10:45,016 And I think a lot of literature is 262 00:10:45,016 --> 00:10:47,760 based on hypotheses that are almost identical. 263 00:10:47,760 --> 00:10:49,350 But in reality, of course, there are 264 00:10:49,350 --> 00:10:50,516 a lot of processes involved. 265 00:10:50,516 --> 00:10:52,051 There's the process of translation, 266 00:10:52,051 --> 00:10:53,550 which has a rate associated with it. 267 00:10:53,550 --> 00:10:55,664 It has regulatory steps associated with it. 268 00:10:55,664 --> 00:10:57,330 And then there are degradatory pathways. 269 00:10:57,330 --> 00:10:59,580 So the RNA gets degraded at some rate, 270 00:10:59,580 --> 00:11:01,430 and the protein gets degraded at some rate. 271 00:11:01,430 --> 00:11:03,820 And sometimes those rates are regulated, 272 00:11:03,820 --> 00:11:04,840 sometimes they're not. 273 00:11:04,840 --> 00:11:06,790 Sometimes it depends on the sequence. 274 00:11:06,790 --> 00:11:08,430 So what would happen if you actually 275 00:11:08,430 --> 00:11:09,610 measured what's going on? 276 00:11:09,610 --> 00:11:13,640 And that was done recently in this paper in 2011, 277 00:11:13,640 --> 00:11:17,970 where the group used a labeling technique for proteins 278 00:11:17,970 --> 00:11:20,660 to [INAUDIBLE] and measure steady state levels of proteins 279 00:11:20,660 --> 00:11:23,620 and then label the proteins at specific times 280 00:11:23,620 --> 00:11:25,700 and see how much newly synthesized 281 00:11:25,700 --> 00:11:28,130 their protein was at various times. 282 00:11:28,130 --> 00:11:30,780 And similarly, for RNA, using a technology that allowed them 283 00:11:30,780 --> 00:11:35,197 to separate newly synthesized transcripts from the bulk RNA. 284 00:11:35,197 --> 00:11:36,780 And once you have those data, then you 285 00:11:36,780 --> 00:11:40,650 can find out what the spread is in the half lives of proteins 286 00:11:40,650 --> 00:11:43,200 and the abundance of proteins. 287 00:11:43,200 --> 00:11:46,010 So if you focus on the left-hand side, 288 00:11:46,010 --> 00:11:51,280 these are the determined half lives 289 00:11:51,280 --> 00:11:54,890 for various RNAs in blue and proteins in red. 290 00:11:54,890 --> 00:11:56,955 If you look at the spread in the red ones, 291 00:11:56,955 --> 00:11:59,205 you've got at least three orders of magnitude of range 292 00:11:59,205 --> 00:12:03,650 in stability in half lives for proteins. 293 00:12:03,650 --> 00:12:06,201 So that's really at the heart of why RNA levels are 294 00:12:06,201 --> 00:12:07,950 very poorly predictive approaching levels, 295 00:12:07,950 --> 00:12:10,920 because there's such a range of the stability proteins. 296 00:12:10,920 --> 00:12:13,130 And the RNAs also, they spread over probably 297 00:12:13,130 --> 00:12:16,780 about one or two orders of magnitude in the RNA stability. 298 00:12:16,780 --> 00:12:18,590 And then here are the abundances. 299 00:12:18,590 --> 00:12:21,040 So you can see that the range of abundance 300 00:12:21,040 --> 00:12:24,330 for average copies per cell of proteins 301 00:12:24,330 --> 00:12:26,880 is extremely large, from 100 to 10 302 00:12:26,880 --> 00:12:30,000 to the eighth copies per cell. 303 00:12:30,000 --> 00:12:34,570 Now if you look at the degradation rates for protein 304 00:12:34,570 --> 00:12:36,170 half lives and RNA half lives, you 305 00:12:36,170 --> 00:12:38,640 can see there's no correlation. 306 00:12:38,640 --> 00:12:40,575 So these are completely independent processes 307 00:12:40,575 --> 00:12:42,825 that determine whether an RNA is degraded or a protein 308 00:12:42,825 --> 00:12:44,292 is degraded. 309 00:12:44,292 --> 00:12:46,750 So then when you try to figure out what the relationship is 310 00:12:46,750 --> 00:12:48,590 between RNA levels and protein levels, 311 00:12:48,590 --> 00:12:51,350 you really have to resort to a set of differential equations 312 00:12:51,350 --> 00:12:53,542 to map out what all the rates are. 313 00:12:53,542 --> 00:12:55,250 And if you know all those rates, then you 314 00:12:55,250 --> 00:12:57,200 can estimate what the relationships will be. 315 00:12:57,200 --> 00:12:59,400 And so they did exactly that. 316 00:12:59,400 --> 00:13:01,750 And these charts show what they inferred 317 00:13:01,750 --> 00:13:05,680 to be the contribution of each of these components 318 00:13:05,680 --> 00:13:08,140 to protein levels. 319 00:13:08,140 --> 00:13:10,540 So on the left-hand side, these are 320 00:13:10,540 --> 00:13:12,964 from cells which had the most data. 321 00:13:12,964 --> 00:13:14,630 And they build a model on the same cells 322 00:13:14,630 --> 00:13:16,088 from which they collected the data. 323 00:13:16,088 --> 00:13:19,740 And in these cells, the RNA levels account for about 40% 324 00:13:19,740 --> 00:13:22,320 of the protein levels, the variance. 325 00:13:22,320 --> 00:13:24,080 And the biggest thing that affects 326 00:13:24,080 --> 00:13:27,130 the abundance of proteins is rates of translation. 327 00:13:27,130 --> 00:13:30,460 And then they took the data built from one set of cells 328 00:13:30,460 --> 00:13:32,330 and tried to use it to predict outcomes 329 00:13:32,330 --> 00:13:34,160 in another set of cells in replicate. 330 00:13:34,160 --> 00:13:36,267 And the results are kind of similar. 331 00:13:36,267 --> 00:13:38,850 They also did it for an entirely different kind of cell types. 332 00:13:38,850 --> 00:13:40,741 In all of these cases, the precise amounts 333 00:13:40,741 --> 00:13:41,490 are going to vary. 334 00:13:41,490 --> 00:13:43,250 But you can see that the red bars, which 335 00:13:43,250 --> 00:13:46,720 represent the amount of information content in the RNA, 336 00:13:46,720 --> 00:13:50,260 is less than about half of what you can get from other sources. 337 00:13:50,260 --> 00:13:52,120 So this gets back to why it's so hard 338 00:13:52,120 --> 00:13:56,740 to infer regulatory networks solely from RNA levels. 339 00:13:56,740 --> 00:13:58,684 So this is the plot that they get 340 00:13:58,684 --> 00:14:00,350 when they compare protein levels and RNA 341 00:14:00,350 --> 00:14:02,270 levels at the experimental level. 342 00:14:02,270 --> 00:14:04,040 And again, you see that big spread and R 343 00:14:04,040 --> 00:14:06,650 squared at about 0.4, which at the time, 344 00:14:06,650 --> 00:14:07,650 they were very proud of. 345 00:14:07,650 --> 00:14:09,316 They write several times in the article, 346 00:14:09,316 --> 00:14:11,730 this is the best anyone has seen to date. 347 00:14:11,730 --> 00:14:14,230 But if you incorporate all these other pieces of information 348 00:14:14,230 --> 00:14:16,340 about RNA stability and protein stability, 349 00:14:16,340 --> 00:14:18,670 you can actually get a very, very good correlation. 350 00:14:18,670 --> 00:14:22,220 So once you know the variation in the protein stability 351 00:14:22,220 --> 00:14:25,529 and the RNA stability for each and every protein and RNA, 352 00:14:25,529 --> 00:14:27,820 then you can do a good job of predicting protein levels 353 00:14:27,820 --> 00:14:28,690 from RNA levels. 354 00:14:28,690 --> 00:14:32,280 But without all that data, you can't. 355 00:14:32,280 --> 00:14:33,370 Any questions on this? 356 00:14:36,627 --> 00:14:37,960 So what are we going to do then? 357 00:14:37,960 --> 00:14:41,000 So we really have two primary things that we can do. 358 00:14:41,000 --> 00:14:44,600 We can try to explicitly model all of these regulatory steps 359 00:14:44,600 --> 00:14:47,000 and include those in our predictive models 360 00:14:47,000 --> 00:14:50,150 and try to build up gene regulatory networks, protein 361 00:14:50,150 --> 00:14:53,500 models that actually include all those different kinds of data. 362 00:14:53,500 --> 00:14:56,549 And we'll see that in just a minute. 363 00:14:56,549 --> 00:14:58,590 And the other thing we can try to do is actually, 364 00:14:58,590 --> 00:15:00,190 rather than try to focus on what's 365 00:15:00,190 --> 00:15:03,930 downstream of RNA synthesis, the protein levels, 366 00:15:03,930 --> 00:15:06,950 we can try to focus on what's upstream of RNA synthesis 367 00:15:06,950 --> 00:15:09,800 and look at what the production of RNAs-- which RNAs are 368 00:15:09,800 --> 00:15:11,370 getting turned on and off-- tell us 369 00:15:11,370 --> 00:15:13,940 about the signaling pathways and the transcription factors. 370 00:15:13,940 --> 00:15:16,148 And that's going to be a topic of one of the upcoming 371 00:15:16,148 --> 00:15:22,290 lectures in which Professor Gifford will look at variations 372 00:15:22,290 --> 00:15:24,780 in epigenomic data and using those variations 373 00:15:24,780 --> 00:15:27,630 in epigenomic data to identify sequences that represent which 374 00:15:27,630 --> 00:15:29,990 regulatory proteins are bound under certain conditions 375 00:15:29,990 --> 00:15:31,090 and not others. 376 00:15:31,090 --> 00:15:31,682 Questions? 377 00:15:31,682 --> 00:15:32,182 Yeah? 378 00:15:32,182 --> 00:15:33,598 AUDIENCE: In a typical experiment, 379 00:15:33,598 --> 00:15:36,761 the rate constants for how many mRNAs or proteins can 380 00:15:36,761 --> 00:15:37,957 be estimated? 381 00:15:37,957 --> 00:15:40,540 PROFESSOR: So the question was how many rate constants can you 382 00:15:40,540 --> 00:15:41,430 estimate in a typical experiment? 383 00:15:41,430 --> 00:15:42,440 So I should say, first of all, they're 384 00:15:42,440 --> 00:15:43,480 not typical experiments. 385 00:15:43,480 --> 00:15:45,580 Very few people do this kind of analysis. 386 00:15:45,580 --> 00:15:47,910 It's actually very time consuming, very expensive. 387 00:15:47,910 --> 00:15:50,802 So I think in this one, it was-- I'll get the numbers roughly 388 00:15:50,802 --> 00:15:52,010 wrong-- but it was thousands. 389 00:15:52,010 --> 00:15:55,410 It was some decent fraction of the proteome, but not 390 00:15:55,410 --> 00:15:56,520 the entire one. 391 00:15:56,520 --> 00:15:59,200 But most of the data set's papers you'll read 392 00:15:59,200 --> 00:16:02,670 do not include any analysis of stability rates, degradation 393 00:16:02,670 --> 00:16:03,930 rates. 394 00:16:03,930 --> 00:16:08,459 They only look at the bulk abundance of the RNAs. 395 00:16:08,459 --> 00:16:09,125 Other questions? 396 00:16:13,110 --> 00:16:13,610 OK. 397 00:16:13,610 --> 00:16:15,140 So this is an upcoming lecture where 398 00:16:15,140 --> 00:16:17,620 we're going to actually try to go backwards. 399 00:16:17,620 --> 00:16:19,896 We're going to say, we see these changes in RNA. 400 00:16:19,896 --> 00:16:21,270 What does that tell us about what 401 00:16:21,270 --> 00:16:24,280 regulatory regions of the genome were active or not? 402 00:16:24,280 --> 00:16:26,065 And then you could go upstream from that 403 00:16:26,065 --> 00:16:27,940 and try to figure out the signaling pathways. 404 00:16:27,940 --> 00:16:30,750 So if I know changes in RNA, I'll 405 00:16:30,750 --> 00:16:32,740 deduce, as we'll see in that upcoming lecture-- 406 00:16:32,740 --> 00:16:36,090 the sequences-- the identity of the DNA binding proteins. 407 00:16:36,090 --> 00:16:37,580 And then I could try to figure out 408 00:16:37,580 --> 00:16:40,270 what the signaling pathways were that drove 409 00:16:40,270 --> 00:16:42,980 those changes in gene expression. 410 00:16:42,980 --> 00:16:46,910 Now later in this lecture, we'll talk about the network modeling 411 00:16:46,910 --> 00:16:47,870 problem. 412 00:16:47,870 --> 00:16:50,270 If assuming you knew these transcription factors, 413 00:16:50,270 --> 00:16:53,330 what could you do to infer this network? 414 00:16:53,330 --> 00:16:54,920 But before we go to that, I'd like 415 00:16:54,920 --> 00:16:57,230 to talk about an interesting modeling approach that 416 00:16:57,230 --> 00:16:59,970 tries to take into account all these degradatory pathways 417 00:16:59,970 --> 00:17:02,470 and look specifically at each kind of regulation 418 00:17:02,470 --> 00:17:05,530 as an explicit step in the model and see how that 419 00:17:05,530 --> 00:17:08,369 copes with some of these issues. 420 00:17:08,369 --> 00:17:16,520 So this is work from Josh Stewart. 421 00:17:16,520 --> 00:17:18,427 And one of the first papers is here. 422 00:17:18,427 --> 00:17:20,010 We'll look at some later ones as well. 423 00:17:20,010 --> 00:17:22,720 And the idea here is to explicitly, as I said, 424 00:17:22,720 --> 00:17:25,130 deal with many, many different steps in regulation 425 00:17:25,130 --> 00:17:28,860 and try to be quite specific about what kinds of data 426 00:17:28,860 --> 00:17:32,750 are informing about what step in the process. 427 00:17:32,750 --> 00:17:35,120 So we measure the things in the bottom 428 00:17:35,120 --> 00:17:39,300 here-- arrays that tell us how many copies of a gene 429 00:17:39,300 --> 00:17:41,282 there are in the genome, especially in cancer. 430 00:17:41,282 --> 00:17:42,740 And you can get big changes of what 431 00:17:42,740 --> 00:17:45,510 are called copy number, amplifications, or deletions 432 00:17:45,510 --> 00:17:46,970 of large chunks of chromosomes. 433 00:17:46,970 --> 00:17:49,250 You need to take that into account. 434 00:17:49,250 --> 00:17:52,120 All the RNA-Seq and microarrays that we were talking about 435 00:17:52,120 --> 00:17:53,620 in measuring transcription levels-- 436 00:17:53,620 --> 00:17:54,870 what do they actually tell us? 437 00:17:54,870 --> 00:17:56,590 Well, they give us some information 438 00:17:56,590 --> 00:17:58,370 about what they're directly connected to. 439 00:17:58,370 --> 00:18:01,195 So the transcriptomic data tells something about the expression 440 00:18:01,195 --> 00:18:01,930 state. 441 00:18:01,930 --> 00:18:04,380 But notice they have explicitly separated the expression 442 00:18:04,380 --> 00:18:07,370 state of the RNA from the protein level. 443 00:18:07,370 --> 00:18:08,870 And they separated the protein level 444 00:18:08,870 --> 00:18:10,560 from the protein activity. 445 00:18:10,560 --> 00:18:12,657 And they have these little black boxes in here 446 00:18:12,657 --> 00:18:14,740 that represent the different kinds of regulations. 447 00:18:14,740 --> 00:18:18,504 So however many copies of a gene you have in the genome, 448 00:18:18,504 --> 00:18:20,920 there's some regulatory event, transcriptional regulation, 449 00:18:20,920 --> 00:18:22,545 that determines how much expression you 450 00:18:22,545 --> 00:18:24,219 get at the mRNA level. 451 00:18:24,219 --> 00:18:25,760 There's another regulatory event here 452 00:18:25,760 --> 00:18:28,100 that determines at what rate those RNAs are 453 00:18:28,100 --> 00:18:29,890 turned into proteins. 454 00:18:29,890 --> 00:18:31,674 And there are other regulatory steps here 455 00:18:31,674 --> 00:18:33,340 that have to do with signaling pathways, 456 00:18:33,340 --> 00:18:35,589 for example, that determine whether those proteins are 457 00:18:35,589 --> 00:18:36,650 active or not. 458 00:18:36,650 --> 00:18:39,108 So we're going to treat each of those as separate variables 459 00:18:39,108 --> 00:18:41,490 in our model that are going to be 460 00:18:41,490 --> 00:18:43,410 connected by these black boxes. 461 00:18:46,254 --> 00:18:47,920 So they call their algorithm "Paradigm," 462 00:18:47,920 --> 00:18:49,590 and they developed it in the context 463 00:18:49,590 --> 00:18:51,059 of looking at cancer data. 464 00:18:51,059 --> 00:18:53,600 In cancer data, the two primary kinds of information they had 465 00:18:53,600 --> 00:18:58,100 were the RNA levels from either microarray or RNA-Seq and then 466 00:18:58,100 --> 00:18:59,710 these copy number variations, again, 467 00:18:59,710 --> 00:19:01,990 representing amplifications or deletions 468 00:19:01,990 --> 00:19:03,820 of chunks of the genome. 469 00:19:03,820 --> 00:19:05,780 And what they're trying to infer from that 470 00:19:05,780 --> 00:19:07,870 is how active different components are of 471 00:19:07,870 --> 00:19:10,706 known signaling pathways. 472 00:19:10,706 --> 00:19:12,580 Now the approach that they used that involved 473 00:19:12,580 --> 00:19:14,413 all of those little black boxes is something 474 00:19:14,413 --> 00:19:16,060 called a factor graph. 475 00:19:16,060 --> 00:19:18,855 And factor graphs can be thought of in the same context 476 00:19:18,855 --> 00:19:19,730 as Bayesian networks. 477 00:19:19,730 --> 00:19:23,540 In fact, Bayesian networks are a type of factor graph. 478 00:19:23,540 --> 00:19:25,670 So if I have a Bayesian network that 479 00:19:25,670 --> 00:19:27,730 represents these three variables, where they're 480 00:19:27,730 --> 00:19:29,770 directly connected by edges, in a factor graph, 481 00:19:29,770 --> 00:19:32,700 there would be this extra kind of node-- this black box 482 00:19:32,700 --> 00:19:35,960 or red box-- that's the factor that's going to connect them. 483 00:19:35,960 --> 00:19:37,467 So what do these things do? 484 00:19:37,467 --> 00:19:39,050 Well, again, they're bipartite graphs. 485 00:19:39,050 --> 00:19:40,800 They always have these two different kinds 486 00:19:40,800 --> 00:19:43,696 of nodes-- the random variables and the factors. 487 00:19:43,696 --> 00:19:45,820 And the reason they're called factor graphs is they 488 00:19:45,820 --> 00:19:48,460 describe how the global function-- in our case, 489 00:19:48,460 --> 00:19:51,120 it's going to be the global probability distribution-- 490 00:19:51,120 --> 00:19:53,390 can be broken down into factorable components. 491 00:19:53,390 --> 00:19:56,330 It can be combined in a product to look 492 00:19:56,330 --> 00:20:02,090 at what the global probability function is. 493 00:20:02,090 --> 00:20:04,386 So if I have some global function over all 494 00:20:04,386 --> 00:20:06,760 the variables, you can think of this again, specifically, 495 00:20:06,760 --> 00:20:09,218 as the probability function-- the joint probability for all 496 00:20:09,218 --> 00:20:11,910 the variables in my system-- I want to be able to divide it 497 00:20:11,910 --> 00:20:14,230 into a product of individual terms, 498 00:20:14,230 --> 00:20:16,900 where I don't have all the variables in each of these f's. 499 00:20:16,900 --> 00:20:19,382 They're just some subset of variables. 500 00:20:19,382 --> 00:20:23,150 And each of these represents one of these terms 501 00:20:23,150 --> 00:20:25,990 in that global product. 502 00:20:25,990 --> 00:20:27,837 The only things that are in this function, 503 00:20:27,837 --> 00:20:29,670 are things to which it's directly connected. 504 00:20:29,670 --> 00:20:32,540 So these edges exist solely between a factor 505 00:20:32,540 --> 00:20:35,500 and the variables that are terms in that equation. 506 00:20:35,500 --> 00:20:36,450 Is that clear? 507 00:20:43,710 --> 00:20:45,130 So in this context, the variables 508 00:20:45,130 --> 00:20:46,464 are going to be nodes. 509 00:20:46,464 --> 00:20:47,880 And their allowed values are going 510 00:20:47,880 --> 00:20:51,770 to be whether they're activated or not activated. 511 00:20:51,770 --> 00:20:54,044 The factors are going to describe the relationships 512 00:20:54,044 --> 00:20:54,960 among those variables. 513 00:20:54,960 --> 00:20:57,970 We previously saw those as being cases of regulation. 514 00:20:57,970 --> 00:20:59,689 Is the RNA turned into protein? 515 00:20:59,689 --> 00:21:00,730 Is the protein activated? 516 00:21:04,282 --> 00:21:05,740 And what we'd like to be able do is 517 00:21:05,740 --> 00:21:07,410 compute marginal probabilities. 518 00:21:07,410 --> 00:21:09,650 So we've got some big network that 519 00:21:09,650 --> 00:21:12,580 represents our understanding of all the signaling pathways 520 00:21:12,580 --> 00:21:15,610 and all the transcriptional regulatory networks in a cancer 521 00:21:15,610 --> 00:21:16,110 cell. 522 00:21:16,110 --> 00:21:18,830 And we want to ask about a particular pathway 523 00:21:18,830 --> 00:21:21,980 or a particular protein, what's the probability 524 00:21:21,980 --> 00:21:24,040 that this protein or this pathway is activated, 525 00:21:24,040 --> 00:21:27,669 marginalized over all the other variables? 526 00:21:27,669 --> 00:21:28,460 So that's our goal. 527 00:21:28,460 --> 00:21:30,220 Our goal is to find a way to compute 528 00:21:30,220 --> 00:21:32,844 these marginal probabilities efficiently. 529 00:21:32,844 --> 00:21:34,260 And how do you compute a marginal? 530 00:21:34,260 --> 00:21:37,430 Well, obviously you need to sum over all the configurations 531 00:21:37,430 --> 00:21:41,190 of all the variables that have your particular variable 532 00:21:41,190 --> 00:21:41,820 at its value. 533 00:21:41,820 --> 00:21:44,390 So if I want to know if MYC and MAX are active, 534 00:21:44,390 --> 00:21:47,300 I set MYC and MAX equal to active. 535 00:21:47,300 --> 00:21:49,200 And then I sum over all the configurations 536 00:21:49,200 --> 00:21:50,882 that are consistent with that. 537 00:21:50,882 --> 00:21:52,590 And in general, that would be hard to do. 538 00:21:52,590 --> 00:21:54,600 But the factor graph gives us an efficient way 539 00:21:54,600 --> 00:21:55,897 of figuring out how to do that. 540 00:21:55,897 --> 00:21:56,980 I'll show you in a second. 541 00:22:00,210 --> 00:22:01,610 So I have some global function. 542 00:22:01,610 --> 00:22:04,220 In this case, this little factor graph over here, 543 00:22:04,220 --> 00:22:07,450 this is the global function. 544 00:22:07,450 --> 00:22:10,850 Now remember, these represent the factors, 545 00:22:10,850 --> 00:22:12,330 and they only have edges to things 546 00:22:12,330 --> 00:22:14,850 that are terms in their equations. 547 00:22:14,850 --> 00:22:18,850 So over here, is a function of x3 and x5. 548 00:22:18,850 --> 00:22:24,920 And so it has edges to x3 and x5, and so on for all of them. 549 00:22:24,920 --> 00:22:27,730 And if I want to explicitly compute the marginal 550 00:22:27,730 --> 00:22:29,590 with respect to a particular variable, 551 00:22:29,590 --> 00:22:32,160 so the marginal with respect to x1 552 00:22:32,160 --> 00:22:36,550 set equal to a, so I'd have this function with x1 553 00:22:36,550 --> 00:22:40,900 equal to a times the sum over all possible states of x2, 554 00:22:40,900 --> 00:22:45,137 the sum over all possible states of x3, x4, and x5. 555 00:22:45,137 --> 00:22:45,720 Is that clear? 556 00:22:45,720 --> 00:22:47,428 That's just the definition of a marginal. 557 00:22:51,190 --> 00:22:54,850 They introduced a notation in factor graphs that's called 558 00:22:54,850 --> 00:22:55,985 a "not-sum." 559 00:22:55,985 --> 00:22:59,679 It's rather terrible, but the not-sum or summary. 560 00:22:59,679 --> 00:23:01,220 So I like this term, summary, better. 561 00:23:01,220 --> 00:23:02,870 The summary over all the variables. 562 00:23:02,870 --> 00:23:05,540 So if I want to figure out the summary for x1, 563 00:23:05,540 --> 00:23:07,970 that's the sum over all the other variables 564 00:23:07,970 --> 00:23:09,600 of all their possible states when 565 00:23:09,600 --> 00:23:13,290 I set x1 equal to a, in this case. 566 00:23:13,290 --> 00:23:14,510 So it's purely a definition. 567 00:23:14,510 --> 00:23:19,140 So then I can rewrite-- and you can work this through by hand 568 00:23:19,140 --> 00:23:21,340 after class-- but I can rewrite this, 569 00:23:21,340 --> 00:23:24,220 which is this intuitive way of thinking of the marginal, 570 00:23:24,220 --> 00:23:27,840 in terms of these not-sums, where each one of these 571 00:23:27,840 --> 00:23:30,940 is over all the other variables that are not 572 00:23:30,940 --> 00:23:33,380 the one that's in the brackets. 573 00:23:33,380 --> 00:23:35,310 So that's just the definition. 574 00:23:35,310 --> 00:23:37,112 OK, this hasn't really helped us very much, 575 00:23:37,112 --> 00:23:38,570 if we don't have some efficient way 576 00:23:38,570 --> 00:23:39,820 of computing these marginals. 577 00:23:39,820 --> 00:23:41,470 And that's what the factor graph does. 578 00:23:41,470 --> 00:23:43,650 So we've got some factor graph. 579 00:23:43,650 --> 00:23:46,760 We have this representation, either 580 00:23:46,760 --> 00:23:50,020 in terms of graph or equation, of how the global function can 581 00:23:50,020 --> 00:23:51,340 be partitioned. 582 00:23:51,340 --> 00:23:54,280 Now if I take any one of these factor graphs, 583 00:23:54,280 --> 00:23:56,210 and I want to compute a marginal over a node, 584 00:23:56,210 --> 00:24:00,180 I can re-draw the factor graph so that variable of interest 585 00:24:00,180 --> 00:24:01,620 is the root node. 586 00:24:01,620 --> 00:24:02,120 Right? 587 00:24:02,120 --> 00:24:04,380 Everyone see that these two representations 588 00:24:04,380 --> 00:24:06,206 are completely equivalent? 589 00:24:06,206 --> 00:24:08,722 I've just yanked x1 up to the top. 590 00:24:08,722 --> 00:24:10,055 So now this is a tree structure. 591 00:24:12,482 --> 00:24:14,190 So this is that factor graph that we just 592 00:24:14,190 --> 00:24:15,546 saw drawn as a tree. 593 00:24:15,546 --> 00:24:17,670 And this is what's called an expression tree, which 594 00:24:17,670 --> 00:24:19,470 is going to tell us how to compute 595 00:24:19,470 --> 00:24:22,575 the marginal over the structure of the graph. 596 00:24:25,330 --> 00:24:28,880 So this is just copied from the previous picture. 597 00:24:28,880 --> 00:24:32,650 And now we're going to come up with a program for computing 598 00:24:32,650 --> 00:24:35,930 these marginals, using this tree structure. 599 00:24:35,930 --> 00:24:39,520 So first I'm going to compute that summary function-- 600 00:24:39,520 --> 00:24:43,430 the sum over all sets of the other variables for everything 601 00:24:43,430 --> 00:24:46,300 below this point, starting with the lowest point in the graph. 602 00:24:46,300 --> 00:24:48,870 And we can compute the summary function there. 603 00:24:48,870 --> 00:24:55,590 And that's this term, the summary for x3 of just this fE. 604 00:24:55,590 --> 00:25:00,440 I do the same thing for fD, the summary for it. 605 00:25:00,440 --> 00:25:02,720 And then I go up a level in the tree, 606 00:25:02,720 --> 00:25:06,327 and I multiply the summary for everything below it. 607 00:25:06,327 --> 00:25:08,410 So I'm going to compute the product of the summary 608 00:25:08,410 --> 00:25:09,510 functions. 609 00:25:09,510 --> 00:25:12,010 And I always compute the summary with respect to the parent. 610 00:25:12,010 --> 00:25:15,550 So here the parent was x3, for both of these. 611 00:25:15,550 --> 00:25:19,240 So these are summaries with respect to x3. 612 00:25:19,240 --> 00:25:20,260 Here who's the parent? 613 00:25:20,260 --> 00:25:20,760 x1. 614 00:25:20,760 --> 00:25:23,045 And so the summary is to x1. 615 00:25:27,280 --> 00:25:28,100 Yes? 616 00:25:28,100 --> 00:25:29,558 AUDIENCE: Are there directed edges? 617 00:25:29,558 --> 00:25:33,860 In the sense that in f, in the example on the right, 618 00:25:33,860 --> 00:25:37,557 is fD just relating how x4 relates to x3? 619 00:25:37,557 --> 00:25:38,890 PROFESSOR: That's exactly right. 620 00:25:38,890 --> 00:25:44,210 So the edges represent which factor you're related to. 621 00:25:44,210 --> 00:25:47,040 So that's why I can redraw it in any way. 622 00:25:47,040 --> 00:25:49,872 I'm always going to go from the leaves up. 623 00:25:49,872 --> 00:25:54,274 I don't have to worry about any directed edges in the graph. 624 00:25:54,274 --> 00:25:54,940 Other questions. 625 00:25:58,540 --> 00:26:01,080 So what this does is it gives us a way 626 00:26:01,080 --> 00:26:04,310 to officially, overall a complicated graph structure, 627 00:26:04,310 --> 00:26:07,390 compute marginals. 628 00:26:07,390 --> 00:26:09,580 And they're typically thought of in terms 629 00:26:09,580 --> 00:26:12,430 of messages that are being sent from the bottom of the graph up 630 00:26:12,430 --> 00:26:13,160 to the top. 631 00:26:13,160 --> 00:26:15,451 And you can have a rule from computing these marginals. 632 00:26:15,451 --> 00:26:17,260 And the rule is as follows. 633 00:26:17,260 --> 00:26:19,580 Each vertex waits for the messages 634 00:26:19,580 --> 00:26:21,270 from all of its children, until it 635 00:26:21,270 --> 00:26:23,820 gets its-- the messages are accumulating their way up 636 00:26:23,820 --> 00:26:24,564 the graph. 637 00:26:24,564 --> 00:26:25,980 And every node is waiting until it 638 00:26:25,980 --> 00:26:29,900 hears from all of its progeny about what's going on. 639 00:26:29,900 --> 00:26:33,790 And then it sends the signal up above it to its parent, 640 00:26:33,790 --> 00:26:35,110 based on the following rules. 641 00:26:35,110 --> 00:26:38,380 A variable node just takes the product of the children. 642 00:26:38,380 --> 00:26:41,120 And a factor node-- one of those little black boxes-- 643 00:26:41,120 --> 00:26:44,017 computes the summary for the children 644 00:26:44,017 --> 00:26:45,350 and sends that up to the parent. 645 00:26:45,350 --> 00:26:47,540 And it's the summary with respect to the parent, 646 00:26:47,540 --> 00:26:50,924 just like in the examples before. 647 00:26:50,924 --> 00:26:53,090 So this is a formula for computing single marginals. 648 00:26:53,090 --> 00:26:55,780 Now it turns out-- I'm not going to go into details of this. 649 00:26:55,780 --> 00:26:57,080 It's kind of complicated. 650 00:26:57,080 --> 00:27:00,780 But you actually can, based on this core idea, 651 00:27:00,780 --> 00:27:04,950 come up with an efficient way of computing all of the marginals 652 00:27:04,950 --> 00:27:06,450 without having to do this separately 653 00:27:06,450 --> 00:27:07,325 for every single one. 654 00:27:07,325 --> 00:27:09,370 And that's called a message passing algorithm. 655 00:27:09,370 --> 00:27:10,870 And if you're really interested, you 656 00:27:10,870 --> 00:27:14,730 can look into the citation for how that's done. 657 00:27:14,730 --> 00:27:18,960 So the core idea is that we can take a representation 658 00:27:18,960 --> 00:27:22,185 of our belief of how this global function-- in our case, 659 00:27:22,185 --> 00:27:24,560 it's going to be the joint probability-- factors in terms 660 00:27:24,560 --> 00:27:26,670 of particular biological processes. 661 00:27:26,670 --> 00:27:30,040 We can encode what we know about the regulation in that factor 662 00:27:30,040 --> 00:27:31,457 graph, the structure of the graph. 663 00:27:31,457 --> 00:27:33,081 And then we could have an efficient way 664 00:27:33,081 --> 00:27:35,110 of computing the marginals, which will tell us, 665 00:27:35,110 --> 00:27:37,010 given the data, what's the probability 666 00:27:37,010 --> 00:27:38,960 that this particular pathway is active? 667 00:27:41,510 --> 00:27:44,010 So in this particular case, in this paradigm model, 668 00:27:44,010 --> 00:27:45,930 the variables can take three states-- 669 00:27:45,930 --> 00:27:49,130 activated, deactivated, or unchanged. 670 00:27:49,130 --> 00:27:51,800 And this is, in a tumor setting, for example, 671 00:27:51,800 --> 00:27:55,024 you might say the tumor is just like the wild type cell, 672 00:27:55,024 --> 00:27:57,440 or the tumor has activation with respect to the wild type, 673 00:27:57,440 --> 00:27:59,648 or it has a repression with respect to the wild type. 674 00:28:03,010 --> 00:28:06,510 Again, this is the structure of the factor graph that they're 675 00:28:06,510 --> 00:28:09,450 using and the different kinds of information that they have. 676 00:28:09,450 --> 00:28:11,230 The primary experimental data are just 677 00:28:11,230 --> 00:28:15,320 these arrays that tell us about SNiPs and copy number variation 678 00:28:15,320 --> 00:28:18,310 and then arrays or RNA-Seq to tell us about the transcript 679 00:28:18,310 --> 00:28:19,910 levels. 680 00:28:19,910 --> 00:28:21,290 But now they can encode all sorts 681 00:28:21,290 --> 00:28:23,310 of rather complicated biological functions 682 00:28:23,310 --> 00:28:25,510 in the graph structure itself. 683 00:28:25,510 --> 00:28:28,750 So transcription regulation is shown here. 684 00:28:28,750 --> 00:28:31,016 Why is the edge from activity to here? 685 00:28:34,984 --> 00:28:36,990 Because we don't want to just infer 686 00:28:36,990 --> 00:28:40,990 that if there's more of the protein, there's more activity. 687 00:28:40,990 --> 00:28:42,760 So we're actually, explicitly computing 688 00:28:42,760 --> 00:28:45,040 the activity of each protein. 689 00:28:45,040 --> 00:28:48,440 So if an RNA gets transcribed, it's 690 00:28:48,440 --> 00:28:51,400 because some transcription factor was active. 691 00:28:51,400 --> 00:28:53,570 And the transcription factor might not 692 00:28:53,570 --> 00:28:56,590 be active, even if the levels of the transcription factor 693 00:28:56,590 --> 00:28:57,790 are high. 694 00:28:57,790 --> 00:29:00,322 That's one of the pieces that's not 695 00:29:00,322 --> 00:29:02,530 encoded in all of those things that were in the dream 696 00:29:02,530 --> 00:29:05,560 challenge, that are really critical for representing 697 00:29:05,560 --> 00:29:07,240 the regulatory structure. 698 00:29:07,240 --> 00:29:09,140 Similarly, protein activation-- I 699 00:29:09,140 --> 00:29:11,810 can have protein that goes from being present to being active. 700 00:29:11,810 --> 00:29:14,350 So think of a kinase, that itself 701 00:29:14,350 --> 00:29:16,855 needs to be phosphorylated to be active. 702 00:29:16,855 --> 00:29:18,230 So that would be that transition. 703 00:29:18,230 --> 00:29:20,060 Some other kinase comes in. 704 00:29:20,060 --> 00:29:22,240 And if that other kinase1 is active, 705 00:29:22,240 --> 00:29:24,210 then it can phosphorylate kinase2 706 00:29:24,210 --> 00:29:26,392 and make that one active. 707 00:29:26,392 --> 00:29:27,850 And so it's pretty straightforward. 708 00:29:27,850 --> 00:29:30,179 You can also represent the formation of a complex. 709 00:29:30,179 --> 00:29:32,220 So the fact that all the proteins are in the cell 710 00:29:32,220 --> 00:29:34,678 doesn't necessarily mean they're forming an active complex. 711 00:29:34,678 --> 00:29:37,590 So the next step then can be here. 712 00:29:37,590 --> 00:29:39,190 Only when I have all of them, would I 713 00:29:39,190 --> 00:29:40,470 have activity of the complex. 714 00:29:40,470 --> 00:29:43,730 We'll talk about how AND-like connections are formed. 715 00:29:43,730 --> 00:29:46,060 And then they also can incorporate OR. 716 00:29:46,060 --> 00:29:47,160 So what does that mean? 717 00:29:47,160 --> 00:29:50,290 So if I know that all members of the gene family 718 00:29:50,290 --> 00:29:52,790 can do something, I might want to explicitly represent 719 00:29:52,790 --> 00:29:57,230 that gene family as an element to the graph-- a variable. 720 00:29:57,230 --> 00:29:59,400 Is any member of this family active? 721 00:29:59,400 --> 00:30:01,140 And so that would be done this way, where 722 00:30:01,140 --> 00:30:03,340 if you have an OR-like function here, then 723 00:30:03,340 --> 00:30:07,110 this factor would make this gene active if any of the parents 724 00:30:07,110 --> 00:30:07,720 are active. 725 00:30:11,290 --> 00:30:13,250 So there, they give a toy example, 726 00:30:13,250 --> 00:30:15,630 where they're trying to figure out if the P53 pathway is 727 00:30:15,630 --> 00:30:18,820 active, so MDM2 is an inhibitor of P53. 728 00:30:18,820 --> 00:30:21,430 P53 can be an activator-related apoptosis. 729 00:30:21,430 --> 00:30:24,260 And so for separately, for MDM2 and for P53, 730 00:30:24,260 --> 00:30:27,390 they have the factor graphs that show the relationship 731 00:30:27,390 --> 00:30:29,560 between copy number variation and transcript 732 00:30:29,560 --> 00:30:32,230 level and protein level and activity. 733 00:30:32,230 --> 00:30:33,610 And those relate to each other. 734 00:30:33,610 --> 00:30:35,750 And then those relate to the apoptotic pathway. 735 00:30:39,035 --> 00:30:40,910 So what they want to do then is take the data 736 00:30:40,910 --> 00:30:43,300 that they have, in terms of these pathways, 737 00:30:43,300 --> 00:30:45,322 and they want to compute the likelihood ratios. 738 00:30:45,322 --> 00:30:46,780 What's the probability of observing 739 00:30:46,780 --> 00:30:52,250 the data, given a hypothesis that this pathway is active 740 00:30:52,250 --> 00:30:54,250 and all my other settings of the parameters? 741 00:30:54,250 --> 00:30:55,708 And compare that to the probability 742 00:30:55,708 --> 00:30:58,902 of the data, given that that pathway is not active. 743 00:30:58,902 --> 00:31:00,610 So this is the kinds of likelihood ratios 744 00:31:00,610 --> 00:31:02,526 we've been seeing now in a couple of lectures. 745 00:31:04,631 --> 00:31:07,130 So now it gets into the details of how you actually do this. 746 00:31:07,130 --> 00:31:09,400 So there's a lot of manual steps involved here. 747 00:31:09,400 --> 00:31:12,840 So if I want to encode a regulatory pathway as a factor 748 00:31:12,840 --> 00:31:18,052 graph, it's currently done in a manual way or semi-manual way. 749 00:31:18,052 --> 00:31:19,510 You convert what's in the databases 750 00:31:19,510 --> 00:31:20,968 into the structure or factor graph. 751 00:31:20,968 --> 00:31:23,420 And you make a series of decisions 752 00:31:23,420 --> 00:31:25,100 about exactly how you want to do that. 753 00:31:25,100 --> 00:31:26,891 You can argue with the particular decisions 754 00:31:26,891 --> 00:31:29,620 they made, but the reasonable ones. 755 00:31:29,620 --> 00:31:31,390 People could do things differently. 756 00:31:31,390 --> 00:31:37,757 So they convert the regulatory networks into graphs. 757 00:31:37,757 --> 00:31:39,840 And then they have to define some of the functions 758 00:31:39,840 --> 00:31:41,050 on this graph. 759 00:31:41,050 --> 00:31:44,810 So they define the expected state of a variable, 760 00:31:44,810 --> 00:31:47,260 based on the state of its parents. 761 00:31:47,260 --> 00:31:50,879 And they take a majority vote of the parents. 762 00:31:50,879 --> 00:31:53,420 So a parent that's connected by a positive edge, meaning it's 763 00:31:53,420 --> 00:31:55,860 an activator, if the parent is active, 764 00:31:55,860 --> 00:31:59,020 then it contributes a plus 1 to the child. 765 00:31:59,020 --> 00:32:01,360 If it's connected by a repressive edge, 766 00:32:01,360 --> 00:32:04,200 then the parenting active would make a vote of minus 1 767 00:32:04,200 --> 00:32:05,040 for the child. 768 00:32:05,040 --> 00:32:10,580 And you take the majority vote of all those votes. 769 00:32:10,580 --> 00:32:11,990 So that's what this says. 770 00:32:11,990 --> 00:32:14,950 But the nice thing is that you can also incorporate logic. 771 00:32:14,950 --> 00:32:17,000 So for example, when we said, is any member 772 00:32:17,000 --> 00:32:18,250 of this pathway active? 773 00:32:18,250 --> 00:32:20,630 And you have a family member node. 774 00:32:20,630 --> 00:32:23,570 So that can be done with an OR function. 775 00:32:23,570 --> 00:32:25,700 And there, it's these same factors 776 00:32:25,700 --> 00:32:28,480 that will determine-- so some of these edges 777 00:32:28,480 --> 00:32:29,960 are going to get labeled "maximum" 778 00:32:29,960 --> 00:32:33,130 or "minimum," that tell you what's 779 00:32:33,130 --> 00:32:35,420 the expected value of the child, based on the parent. 780 00:32:35,420 --> 00:32:38,129 So if it's an OR, then if any of the parents are active, 781 00:32:38,129 --> 00:32:39,170 then the child is active. 782 00:32:39,170 --> 00:32:40,753 And if it's AND, you need all of them. 783 00:32:43,150 --> 00:32:45,906 So you could have described all of these networks 784 00:32:45,906 --> 00:32:46,780 by Bayesian networks. 785 00:32:46,780 --> 00:32:48,700 But the advantage of a factor graph 786 00:32:48,700 --> 00:32:50,580 is that your explicitly able to include 787 00:32:50,580 --> 00:32:54,386 all these steps to describe this regulation in an intuitive way. 788 00:32:54,386 --> 00:32:55,760 So you can go back to your models 789 00:32:55,760 --> 00:32:57,730 and understand what you've done, and change it 790 00:32:57,730 --> 00:32:59,820 in an obvious way. 791 00:32:59,820 --> 00:33:01,830 Now critically, we're not trying to learn 792 00:33:01,830 --> 00:33:03,750 the structure of the graph from the data. 793 00:33:03,750 --> 00:33:05,870 We're imposing the structure of the graph. 794 00:33:05,870 --> 00:33:07,900 We still need to learn a lot of variables, 795 00:33:07,900 --> 00:33:10,550 and that's done using expectation maximization, 796 00:33:10,550 --> 00:33:13,620 as we saw in the Bayesian networks. 797 00:33:13,620 --> 00:33:15,300 And then, again, it's a factor graph, 798 00:33:15,300 --> 00:33:17,970 which primarily means that we can factor the global function 799 00:33:17,970 --> 00:33:20,550 into all of these factor nodes. 800 00:33:20,550 --> 00:33:23,850 So the total probability is normalized, 801 00:33:23,850 --> 00:33:27,500 but it's the product of these factors which 802 00:33:27,500 --> 00:33:29,720 have to do with just the variables that are connected 803 00:33:29,720 --> 00:33:33,830 to that factor node in the graph. 804 00:33:33,830 --> 00:33:35,230 And this notation that you'll see 805 00:33:35,230 --> 00:33:37,410 if you look through this, this notation 806 00:33:37,410 --> 00:33:39,060 means the setting of all the variables 807 00:33:39,060 --> 00:33:40,650 consistent with something. 808 00:33:40,650 --> 00:33:44,020 So let's see that-- here we go. 809 00:33:44,020 --> 00:33:47,220 So this here, this is the setting of all the variables 810 00:33:47,220 --> 00:33:49,800 X, consistent with the data that we have-- so the data 811 00:33:49,800 --> 00:33:53,870 being the arrays, the RNA-Seq, if you had it. 812 00:33:53,870 --> 00:33:56,630 And so we want to compute the marginal probability 813 00:33:56,630 --> 00:33:59,680 of some particular variable being at a particular setting, 814 00:33:59,680 --> 00:34:02,320 given the fully specified factor graph. 815 00:34:02,320 --> 00:34:06,490 And we just take the product of all of these marginals. 816 00:34:06,490 --> 00:34:07,550 Is that clear? 817 00:34:07,550 --> 00:34:10,050 Consistent with all the settings where 818 00:34:10,050 --> 00:34:15,639 that variable is set to x equals a. 819 00:34:15,639 --> 00:34:16,139 Questions? 820 00:34:18,850 --> 00:34:19,350 OK. 821 00:34:19,350 --> 00:34:21,099 And we can compute the likelihood function 822 00:34:21,099 --> 00:34:21,870 in the same way. 823 00:34:21,870 --> 00:34:24,400 So then what actually happens when you try to do this? 824 00:34:24,400 --> 00:34:28,469 So they give an example here in this more recent paper, where 825 00:34:28,469 --> 00:34:30,250 it's basically a toy example. 826 00:34:30,250 --> 00:34:32,469 But they're modeling all of these different states 827 00:34:32,469 --> 00:34:33,320 in the cells. 828 00:34:33,320 --> 00:34:35,550 So G are the number of genomic copies, 829 00:34:35,550 --> 00:34:38,060 T, the level of transcripts. 830 00:34:38,060 --> 00:34:41,061 Those are connected by a factor to what you actually measure. 831 00:34:41,061 --> 00:34:42,810 So there is some true change in the number 832 00:34:42,810 --> 00:34:44,050 of copies in the cell. 833 00:34:44,050 --> 00:34:46,370 And then there's what appears in your array. 834 00:34:46,370 --> 00:34:49,650 There's some true number of copies of RNA in the cell. 835 00:34:49,650 --> 00:34:52,550 And then there's what you get out of your RNA-Seq. 836 00:34:52,550 --> 00:34:54,300 So that's what these factors are present-- 837 00:34:54,300 --> 00:34:55,799 and then these are regulatory terms. 838 00:34:55,799 --> 00:34:59,460 So how much transcript you get depends on these two variables, 839 00:34:59,460 --> 00:35:03,390 the epigenetic state of the promoter 840 00:35:03,390 --> 00:35:05,880 and the regulatory proteins that interact with it. 841 00:35:05,880 --> 00:35:07,730 How much transcript gets turned into protein 842 00:35:07,730 --> 00:35:10,480 depends on regulatory proteins. 843 00:35:10,480 --> 00:35:12,910 And those are determined by upstream signaling events. 844 00:35:12,910 --> 00:35:14,410 And how much protein becomes active, 845 00:35:14,410 --> 00:35:16,860 again, is determined by the upstream signaling events. 846 00:35:16,860 --> 00:35:21,757 And then those can have effects on downstream pathways as well. 847 00:35:21,757 --> 00:35:24,090 So then in this toy example, they're looking at MYC/MAX. 848 00:35:24,090 --> 00:35:28,390 They're trying to figure out whether it's active or not. 849 00:35:28,390 --> 00:35:30,170 So we've got this pathway. 850 00:35:30,170 --> 00:35:32,470 PAK2 represses MYC/MAX. 851 00:35:32,470 --> 00:35:38,030 MYC/MAX activates these two genes and represses this one. 852 00:35:38,030 --> 00:35:39,710 And so if these were the data that we 853 00:35:39,710 --> 00:35:42,830 had coming from copy number variation, DNA methylation, 854 00:35:42,830 --> 00:35:47,940 and RNA expression, then I'd see that the following states 855 00:35:47,940 --> 00:35:53,240 of the downstream genes-- this one's active. 856 00:35:53,240 --> 00:35:55,220 This one's repressed. 857 00:35:55,220 --> 00:35:55,970 This one's active. 858 00:35:55,970 --> 00:35:57,130 This one's repressed. 859 00:35:57,130 --> 00:36:00,330 They infer that MYC/MAX is active. 860 00:36:00,330 --> 00:36:03,510 Oh, but what about the fact that this one should also 861 00:36:03,510 --> 00:36:05,030 be activated? 862 00:36:05,030 --> 00:36:06,720 That can be explained away by the fact 863 00:36:06,720 --> 00:36:11,940 that there's a difference in the epigenetic state between ENO1 864 00:36:11,940 --> 00:36:14,880 and the other two. 865 00:36:14,880 --> 00:36:18,360 And then the belief propagation allows 866 00:36:18,360 --> 00:36:21,010 us to transfer that information upward through the graph 867 00:36:21,010 --> 00:36:25,030 to figure out, OK, so now we've decided that MYC/MAX is active. 868 00:36:25,030 --> 00:36:28,430 And that gives us information about the state of the proteins 869 00:36:28,430 --> 00:36:33,030 upstream of it and the activity then of PAK2, 870 00:36:33,030 --> 00:36:34,915 which is a repressor of MYC/MAX. 871 00:36:39,220 --> 00:36:41,650 Questions on the factor graphs specifically 872 00:36:41,650 --> 00:36:43,559 or anything's that come up until now? 873 00:36:48,740 --> 00:36:52,810 So this has all been reasoning on known pathways. 874 00:36:52,810 --> 00:36:56,150 One of the big promises of these systematic approaches 875 00:36:56,150 --> 00:36:58,390 is the hope that we can discover new pathways. 876 00:36:58,390 --> 00:37:01,330 Can we discover things we don't already know about? 877 00:37:01,330 --> 00:37:04,040 And for this, we're going to look at interactome graphs, 878 00:37:04,040 --> 00:37:06,251 so graphs that are built primarily 879 00:37:06,251 --> 00:37:08,250 from high throughput protein-protein interaction 880 00:37:08,250 --> 00:37:10,083 data, but could also be built, as we'll see, 881 00:37:10,083 --> 00:37:14,514 from other kinds of large-scale connections. 882 00:37:14,514 --> 00:37:16,430 And we're going to look at what the underlying 883 00:37:16,430 --> 00:37:17,971 structure of these networks could be. 884 00:37:17,971 --> 00:37:19,686 And so they could arise from a graph 885 00:37:19,686 --> 00:37:21,310 where you put an edge between two nodes 886 00:37:21,310 --> 00:37:25,730 if their co-expressed, if they have high mutual information. 887 00:37:25,730 --> 00:37:27,620 That's what we saw in say, ARACHNE, 888 00:37:27,620 --> 00:37:31,100 which we talked about a lecture ago. 889 00:37:31,100 --> 00:37:35,050 Or, if say, the two hybrids and affinity capture mass spec 890 00:37:35,050 --> 00:37:37,210 indicated direct physical interaction 891 00:37:37,210 --> 00:37:39,020 or say a high throughput genetic screen 892 00:37:39,020 --> 00:37:40,394 indicated a genetic interaction. 893 00:37:40,394 --> 00:37:42,310 These are going to be very, very large graphs. 894 00:37:42,310 --> 00:37:44,860 And we're going to look at some of the algorithmic problems 895 00:37:44,860 --> 00:37:46,500 that we have dealing with huge graphs 896 00:37:46,500 --> 00:37:49,690 and how to compress the information down so we get 897 00:37:49,690 --> 00:37:52,795 some piece of the network that's quite interpretable. 898 00:37:52,795 --> 00:37:54,420 And we'll look at various kinds of ways 899 00:37:54,420 --> 00:37:59,490 of analyzing these graphs that are listed here. 900 00:37:59,490 --> 00:38:03,654 So one of the advantages of dealing with data in the graph 901 00:38:03,654 --> 00:38:06,070 formulation is that we can leverage the fact that computer 902 00:38:06,070 --> 00:38:08,760 science has dealt with large graphs for quite a while 903 00:38:08,760 --> 00:38:11,580 now, often in the context of telecommunications. 904 00:38:11,580 --> 00:38:14,460 Now big data, Facebook, Google-- they're always 905 00:38:14,460 --> 00:38:16,260 dealing with things in a graph formulation. 906 00:38:16,260 --> 00:38:20,430 So there are a lot of algorithms that we can take advantage of. 907 00:38:20,430 --> 00:38:23,270 We're going to look at how to use quick distance 908 00:38:23,270 --> 00:38:24,459 calculations on graphs. 909 00:38:24,459 --> 00:38:26,500 And we'll look at that specifically in an example 910 00:38:26,500 --> 00:38:29,570 of how to find the kinase target relationships. 911 00:38:29,570 --> 00:38:31,630 Then we'll look at how to cluster large graphs 912 00:38:31,630 --> 00:38:33,640 to find subgraphs that either represents 913 00:38:33,640 --> 00:38:35,150 an interesting topological feature 914 00:38:35,150 --> 00:38:37,160 of the inherent structure of the graph 915 00:38:37,160 --> 00:38:40,830 or perhaps to represent active pieces of the network. 916 00:38:40,830 --> 00:38:43,006 And then we'll look at other kinds of optimization 917 00:38:43,006 --> 00:38:45,380 techniques to help us find the part of the network that's 918 00:38:45,380 --> 00:38:50,390 most relevant to our particular experimental setting. 919 00:38:50,390 --> 00:38:54,080 So let's start with ostensibly a simple problem. 920 00:38:54,080 --> 00:38:57,140 I know a lot about-- I have a lot of protein phosphorylation 921 00:38:57,140 --> 00:38:57,640 data. 922 00:38:57,640 --> 00:38:59,600 I'd like to figure out what kinase 923 00:38:59,600 --> 00:39:03,060 was that phosphorylated a particular protein. 924 00:39:03,060 --> 00:39:05,270 So let's say I have this protein that's 925 00:39:05,270 --> 00:39:08,560 involved in cancer signaling, Rad50. 926 00:39:08,560 --> 00:39:10,750 And I know it's phosphorylated these two sites. 927 00:39:10,750 --> 00:39:12,660 And I have the sequences of those sites. 928 00:39:12,660 --> 00:39:14,970 So what kinds of tools do we have at our disposal 929 00:39:14,970 --> 00:39:16,900 if I have a set of sequences that I believe 930 00:39:16,900 --> 00:39:18,570 are phosphorylated, that would help 931 00:39:18,570 --> 00:39:21,481 me try to figure out what kinase did the phosphorylation? 932 00:39:21,481 --> 00:39:21,980 Any ideas? 933 00:39:26,910 --> 00:39:29,740 So if I know the specificity of the kinases, what could I do? 934 00:39:32,320 --> 00:39:34,260 I could look for a sequence match 935 00:39:34,260 --> 00:39:36,440 between the specificity of the kinase 936 00:39:36,440 --> 00:39:38,417 and the sequence of the protein, right? 937 00:39:38,417 --> 00:39:40,250 In the same way that we can look for a match 938 00:39:40,250 --> 00:39:42,790 between the specificity of a transcription factor 939 00:39:42,790 --> 00:39:46,330 and the region of the genome to which it binds. 940 00:39:46,330 --> 00:39:49,470 So if I have a library of specificity motifs 941 00:39:49,470 --> 00:39:51,580 for different kinases, where every position here 942 00:39:51,580 --> 00:39:53,859 represents a piece of the recognition element, 943 00:39:53,859 --> 00:39:56,150 and the height of the letters represent the information 944 00:39:56,150 --> 00:39:57,900 content, I can scan those. 945 00:39:57,900 --> 00:40:00,310 And I can see what family of kinases 946 00:40:00,310 --> 00:40:03,440 are most likely to be responsible for phosphorylating 947 00:40:03,440 --> 00:40:04,482 these sites. 948 00:40:04,482 --> 00:40:06,190 But again, those are families of kinases. 949 00:40:06,190 --> 00:40:07,564 There are many individual members 950 00:40:07,564 --> 00:40:08,860 of each of those families. 951 00:40:08,860 --> 00:40:10,380 So how to I find the specific member 952 00:40:10,380 --> 00:40:12,330 of that family that's most likely to carry out 953 00:40:12,330 --> 00:40:13,620 the regulation? 954 00:40:13,620 --> 00:40:15,120 So here, what happens in this paper. 955 00:40:15,120 --> 00:40:17,244 It's called [? "Network." ?] And as they say, well, 956 00:40:17,244 --> 00:40:18,570 let's use the graph properties. 957 00:40:18,570 --> 00:40:23,290 Let's try to figure out which proteins are physically linked 958 00:40:23,290 --> 00:40:26,390 relatively closely in the network to the target. 959 00:40:26,390 --> 00:40:29,080 So in this case, they've got Rad50 over here. 960 00:40:29,080 --> 00:40:33,620 And they're trying to figure out which kinase is regulating it. 961 00:40:33,620 --> 00:40:35,980 So here are two kinases that have similar specificity. 962 00:40:35,980 --> 00:40:37,669 But this one's directly connected 963 00:40:37,669 --> 00:40:39,210 in the interaction that works so it's 964 00:40:39,210 --> 00:40:41,870 more likely to be responsible. 965 00:40:41,870 --> 00:40:44,430 And here's the member of the kinase that 966 00:40:44,430 --> 00:40:47,010 seems to be consistent with the sequence being phosphorylated 967 00:40:47,010 --> 00:40:48,130 over here. 968 00:40:48,130 --> 00:40:50,910 It's not directly connected, but it's relatively close. 969 00:40:50,910 --> 00:40:53,530 And so that's also a highly probable member, 970 00:40:53,530 --> 00:40:56,110 compared to one that's more distantly related. 971 00:40:56,110 --> 00:40:58,870 So in general, if I've got a set of kinases 972 00:40:58,870 --> 00:41:02,410 that are all of equally good sequence matches to the target 973 00:41:02,410 --> 00:41:05,560 sequence, represented by these dash lines, but one of them 974 00:41:05,560 --> 00:41:08,780 is physically linked as well, perhaps directly and perhaps 975 00:41:08,780 --> 00:41:10,610 indirectly, I have higher confidence 976 00:41:10,610 --> 00:41:13,360 in this kinase because of its physical links 977 00:41:13,360 --> 00:41:15,857 than I do in these. 978 00:41:15,857 --> 00:41:18,190 So that's fine if you want to look at things one by one. 979 00:41:18,190 --> 00:41:19,690 But if you want to look at this at a global scale, 980 00:41:19,690 --> 00:41:21,110 we need very efficient algorithms 981 00:41:21,110 --> 00:41:23,980 for figuring out what the distance is in this interaction 982 00:41:23,980 --> 00:41:29,190 network between any kinase and any target. 983 00:41:29,190 --> 00:41:31,590 So how do you go about officially computing distances? 984 00:41:31,590 --> 00:41:34,030 Well that's where converting things into a graph structure 985 00:41:34,030 --> 00:41:35,180 is helpful. 986 00:41:35,180 --> 00:41:37,070 So when we talk about graphs here, 987 00:41:37,070 --> 00:41:40,654 we mean sets of vertices and the edges that connect them. 988 00:41:40,654 --> 00:41:42,820 The vertices, in our case, are going to be proteins. 989 00:41:42,820 --> 00:41:44,290 The edges are going to perhaps represent 990 00:41:44,290 --> 00:41:46,789 physical interactions or some of these other kinds of graphs 991 00:41:46,789 --> 00:41:49,520 we talked about. 992 00:41:49,520 --> 00:41:52,049 These graphs can be directed, or they can the undirected. 993 00:41:52,049 --> 00:41:53,090 Undirected would be what? 994 00:41:53,090 --> 00:41:54,950 For example, say two hybrid. 995 00:41:54,950 --> 00:41:57,150 I don't know which one's doing what to which. 996 00:41:57,150 --> 00:41:59,280 I just know that two proteins can come together. 997 00:41:59,280 --> 00:42:01,260 Whereas a directed edge might be this kinase 998 00:42:01,260 --> 00:42:02,460 phosphorylates this target. 999 00:42:02,460 --> 00:42:05,130 And so it's a directed edge. 1000 00:42:05,130 --> 00:42:07,091 I can have weights associated with these edges. 1001 00:42:07,091 --> 00:42:08,590 We'll see in a second how we can use 1002 00:42:08,590 --> 00:42:11,240 that to encode our confidence that the edge represents 1003 00:42:11,240 --> 00:42:14,680 a true physical interaction. 1004 00:42:14,680 --> 00:42:17,580 We can also talk about the degree, the number of edges 1005 00:42:17,580 --> 00:42:20,770 that come into a node or leave a node. 1006 00:42:20,770 --> 00:42:22,740 And for our point, it's rather important 1007 00:42:22,740 --> 00:42:25,170 to talk about the path, the set of vertices 1008 00:42:25,170 --> 00:42:27,840 that can get me from one node to another node, 1009 00:42:27,840 --> 00:42:31,476 without ever retracing my steps. 1010 00:42:31,476 --> 00:42:34,100 And we're going to talk about path length, 1011 00:42:34,100 --> 00:42:35,600 so if my graph is unweighted, that's 1012 00:42:35,600 --> 00:42:39,000 just the number of edges along the path. 1013 00:42:39,000 --> 00:42:40,990 But if my graph has edge weights, 1014 00:42:40,990 --> 00:42:43,675 it's going to be the sum of the edge weights along that path. 1015 00:42:43,675 --> 00:42:44,290 Is that clear? 1016 00:42:48,040 --> 00:42:50,640 And then we're going to use an adjacency matrix 1017 00:42:50,640 --> 00:42:51,640 to represent the graphs. 1018 00:42:51,640 --> 00:42:53,400 So I have two completely equivalent formulations 1019 00:42:53,400 --> 00:42:53,941 of the graph. 1020 00:42:53,941 --> 00:42:56,190 One is the picture on the left-hand side, 1021 00:42:56,190 --> 00:42:59,310 and the other one is the matrix on the right-hand side, where 1022 00:42:59,310 --> 00:43:02,820 a 1 between any row and column represents 1023 00:43:02,820 --> 00:43:03,820 the presence of an edge. 1024 00:43:03,820 --> 00:43:10,500 So the only edge connecting node 1 goes to node 2. 1025 00:43:10,500 --> 00:43:13,610 Whereas, node 2 is connected both to node 1 and to node 3. 1026 00:43:13,610 --> 00:43:14,720 Hopefully, that agrees. 1027 00:43:14,720 --> 00:43:15,365 OK. 1028 00:43:15,365 --> 00:43:16,170 Is that clear? 1029 00:43:20,537 --> 00:43:22,370 And if I have a weighted graph, then instead 1030 00:43:22,370 --> 00:43:23,910 of putting zeros or ones in the matrix, 1031 00:43:23,910 --> 00:43:25,868 I'll put the actual edge weights in the matrix. 1032 00:43:28,620 --> 00:43:31,950 So there are algorithms that exist for officially finding 1033 00:43:31,950 --> 00:43:35,740 shortest paths in large graphs. 1034 00:43:35,740 --> 00:43:37,910 So we can very rapidly, for example, 1035 00:43:37,910 --> 00:43:40,130 compute the shortest path between any two nodes, 1036 00:43:40,130 --> 00:43:43,720 based solely on that adjacency matrix. 1037 00:43:43,720 --> 00:43:46,170 Now why are we going to look at weighted graphs? 1038 00:43:46,170 --> 00:43:48,650 Because that gives us the way to encode our confidence 1039 00:43:48,650 --> 00:43:50,040 in the underlying data. 1040 00:43:50,040 --> 00:43:54,350 So because the total distance in network 1041 00:43:54,350 --> 00:43:57,230 is the sum of the edge weights, if I set my edge weights 1042 00:43:57,230 --> 00:44:01,150 to be negative log of a probability, 1043 00:44:01,150 --> 00:44:03,302 then if I sum all the edge weights, 1044 00:44:03,302 --> 00:44:05,385 I'm taking the product of all those probabilities. 1045 00:44:08,560 --> 00:44:10,800 And so the shortest path is going 1046 00:44:10,800 --> 00:44:14,390 to be the most probable path as well, because it's 1047 00:44:14,390 --> 00:44:18,900 going to be the minimum of the sum of the negative log. 1048 00:44:18,900 --> 00:44:21,475 So it's going to be the maximum of the joint probability. 1049 00:44:21,475 --> 00:44:24,270 Is that clear? 1050 00:44:24,270 --> 00:44:24,770 OK. 1051 00:44:24,770 --> 00:44:25,470 Very good. 1052 00:44:25,470 --> 00:44:30,200 So by encoding our network as a weighted graph, where the edge 1053 00:44:30,200 --> 00:44:31,932 weights are minus log of the probability, 1054 00:44:31,932 --> 00:44:34,140 then when I use these standard algorithms for finding 1055 00:44:34,140 --> 00:44:35,806 the shortest path between any two nodes, 1056 00:44:35,806 --> 00:44:38,870 I'm also getting the most probable path between these two 1057 00:44:38,870 --> 00:44:41,160 proteins. 1058 00:44:41,160 --> 00:44:44,210 So where these edge weights come from? 1059 00:44:44,210 --> 00:44:47,090 So if my network consists say of affinity capture mass 1060 00:44:47,090 --> 00:44:48,530 spec and two hybrid interactions, 1061 00:44:48,530 --> 00:44:51,700 how would I compute the edge of weights for that network? 1062 00:45:01,524 --> 00:45:03,190 We actually explicitly talked about this 1063 00:45:03,190 --> 00:45:04,468 just a lecture or two ago. 1064 00:45:08,950 --> 00:45:10,960 So I have all this affinity capture mass spec, 1065 00:45:10,960 --> 00:45:12,150 two hybrid data. 1066 00:45:12,150 --> 00:45:13,620 And I want to assign a probability 1067 00:45:13,620 --> 00:45:18,350 to every edge that tells me how confident I am that it's real. 1068 00:45:18,350 --> 00:45:21,050 So we already saw that in the context of this paper 1069 00:45:21,050 --> 00:45:23,970 where we use Bayesian networks and gold standards to compute 1070 00:45:23,970 --> 00:45:26,345 the probability for every single edge in the interactome. 1071 00:45:28,800 --> 00:45:29,300 OK. 1072 00:45:29,300 --> 00:45:32,520 So that works pretty well if you can define the gold standards. 1073 00:45:32,520 --> 00:45:35,630 It turns out that that has not been the most popular way 1074 00:45:35,630 --> 00:45:38,270 of dealing with mammalian data. 1075 00:45:38,270 --> 00:45:40,160 It works pretty well for yeast, but it's not 1076 00:45:40,160 --> 00:45:42,400 what's used primarily in mammalian data. 1077 00:45:42,400 --> 00:45:45,830 So in mammalian data, the databases are much larger. 1078 00:45:45,830 --> 00:45:48,280 The number of gold standards are fewer. 1079 00:45:48,280 --> 00:45:51,270 People rely on more ad hoc methods. 1080 00:45:51,270 --> 00:45:54,370 One of the big advances, technically, for the field 1081 00:45:54,370 --> 00:45:57,060 was the development of a common way for all these databases 1082 00:45:57,060 --> 00:45:59,580 of protein-protein interactions to report their data, 1083 00:45:59,580 --> 00:46:01,150 to be able to interchange them. 1084 00:46:01,150 --> 00:46:05,580 There's something called PSICQUIC and PSISCORE, that 1085 00:46:05,580 --> 00:46:09,485 allow a client to pull information 1086 00:46:09,485 --> 00:46:11,610 from all the different databases of protein-protein 1087 00:46:11,610 --> 00:46:12,720 interactions. 1088 00:46:12,720 --> 00:46:15,810 And because you can get all the data in a common format 1089 00:46:15,810 --> 00:46:18,500 where it's traceable back to the underlying experiment, 1090 00:46:18,500 --> 00:46:21,510 then you can start computing confidence scores 1091 00:46:21,510 --> 00:46:23,110 based on these properties, what we 1092 00:46:23,110 --> 00:46:26,450 know about where the data came from in a high throughput way. 1093 00:46:26,450 --> 00:46:28,510 Different people have different approaches 1094 00:46:28,510 --> 00:46:30,620 to computing those scores. 1095 00:46:30,620 --> 00:46:32,380 So there's a common format for that 1096 00:46:32,380 --> 00:46:34,790 as well, which is this PSISCORE where 1097 00:46:34,790 --> 00:46:38,150 you can build your interaction database from whichever 1098 00:46:38,150 --> 00:46:40,390 one of these underlying databases you want, 1099 00:46:40,390 --> 00:46:41,740 filter it however you want. 1100 00:46:41,740 --> 00:46:45,780 And then send your database to one of these scoring servers. 1101 00:46:45,780 --> 00:46:47,690 And they'll send you back the scores 1102 00:46:47,690 --> 00:46:50,130 according to their algorithm. 1103 00:46:50,130 --> 00:46:52,940 One that I kind of like this is this Miscore algorithm. 1104 00:46:52,940 --> 00:46:54,509 It digs down into the underlying data 1105 00:46:54,509 --> 00:46:56,050 of what kind of experiments were done 1106 00:46:56,050 --> 00:46:58,139 and how many experiments were done. 1107 00:46:58,139 --> 00:47:00,180 Again, they make all sorts of arbitrary decisions 1108 00:47:00,180 --> 00:47:01,013 in how they do that. 1109 00:47:01,013 --> 00:47:03,400 But the arbitrary decisions seem reasonable 1110 00:47:03,400 --> 00:47:05,790 in the absence of any other data. 1111 00:47:05,790 --> 00:47:10,260 So their scores are based on these three kinds of terms-- 1112 00:47:10,260 --> 00:47:12,180 how many publications there are associated 1113 00:47:12,180 --> 00:47:17,130 with any interaction, what experimental method was used, 1114 00:47:17,130 --> 00:47:19,434 and then also, if there's an annotation in the database 1115 00:47:19,434 --> 00:47:21,725 saying that we know that this is a genetic interaction, 1116 00:47:21,725 --> 00:47:23,559 or we know that it's a physical interaction. 1117 00:47:23,559 --> 00:47:25,599 And then they put weights on all of these things. 1118 00:47:25,599 --> 00:47:27,280 So people can argue about what the best 1119 00:47:27,280 --> 00:47:28,924 way of approaching this is. 1120 00:47:28,924 --> 00:47:30,590 The fundamental point is that we can now 1121 00:47:30,590 --> 00:47:33,030 have a very, very large database of known 1122 00:47:33,030 --> 00:47:34,630 interactions as weighted. 1123 00:47:34,630 --> 00:47:37,360 So by last count, there are about 250,000 1124 00:47:37,360 --> 00:47:40,352 protein-protein interactions for humans in these databases. 1125 00:47:40,352 --> 00:47:41,810 So you have that giant interactome. 1126 00:47:41,810 --> 00:47:44,380 It's got all these scores associated with it. 1127 00:47:44,380 --> 00:47:46,390 And now we can dive into that and say, 1128 00:47:46,390 --> 00:47:51,880 these data are somewhat largely unbiased by our prior notions 1129 00:47:51,880 --> 00:47:53,750 about what's important. 1130 00:47:53,750 --> 00:47:55,560 They're built up from high throughput data. 1131 00:47:55,560 --> 00:47:57,910 So unlike the carefully curated pathways 1132 00:47:57,910 --> 00:48:00,077 that are what everybody's been studying for decades, 1133 00:48:00,077 --> 00:48:01,993 there might be information here about pathways 1134 00:48:01,993 --> 00:48:02,830 no one knows about. 1135 00:48:02,830 --> 00:48:05,250 Can we find those pathways in different contexts? 1136 00:48:05,250 --> 00:48:07,050 What can we learn from that? 1137 00:48:07,050 --> 00:48:09,127 So one early thing people can do is just 1138 00:48:09,127 --> 00:48:10,710 try to find pieces of the network that 1139 00:48:10,710 --> 00:48:12,168 seem to be modular, where there are 1140 00:48:12,168 --> 00:48:15,650 more interactions among the components of that module 1141 00:48:15,650 --> 00:48:17,990 than they are to other pieces of the network. 1142 00:48:17,990 --> 00:48:20,700 And you can find those modules in two different ways. 1143 00:48:20,700 --> 00:48:24,240 One is just based on the underlying network. 1144 00:48:24,240 --> 00:48:27,540 And one is based on the network, plus some external data 1145 00:48:27,540 --> 00:48:28,290 you have. 1146 00:48:28,290 --> 00:48:29,920 So one would be to say, are there 1147 00:48:29,920 --> 00:48:32,790 proteins that fundamentally interact with each other 1148 00:48:32,790 --> 00:48:34,531 under all possible settings? 1149 00:48:34,531 --> 00:48:36,780 And then we would say, in my particular patient sample 1150 00:48:36,780 --> 00:48:40,290 or my disease or my microorganism, 1151 00:48:40,290 --> 00:48:42,370 which proteins seem to be functioning 1152 00:48:42,370 --> 00:48:44,810 in this particular condition? 1153 00:48:44,810 --> 00:48:47,230 So one is the topological model. 1154 00:48:47,230 --> 00:48:48,740 That's just the network itself. 1155 00:48:48,740 --> 00:48:51,534 And one is the functional model, where I layer onto information 1156 00:48:51,534 --> 00:48:53,950 that the dark nodes are active in my particular condition. 1157 00:48:56,590 --> 00:49:00,000 So an early use of this kind of approach 1158 00:49:00,000 --> 00:49:02,995 was to try to annotate nodes-- a large fraction 1159 00:49:02,995 --> 00:49:05,130 of even well studied genomes that we don't know 1160 00:49:05,130 --> 00:49:07,500 the function of any of those genes. 1161 00:49:07,500 --> 00:49:09,490 So what if I use the structure of the network 1162 00:49:09,490 --> 00:49:13,060 to infer that if some protein is close to another protein 1163 00:49:13,060 --> 00:49:14,750 in this interaction network, it is 1164 00:49:14,750 --> 00:49:16,670 likely to have similar function? 1165 00:49:16,670 --> 00:49:19,280 And statistically, that's definitely true. 1166 00:49:19,280 --> 00:49:24,410 So this graph shows, for things for where we know the function, 1167 00:49:24,410 --> 00:49:26,590 the semantic similarity in the y-axis, 1168 00:49:26,590 --> 00:49:28,427 the distance in the network in the x-axis, 1169 00:49:28,427 --> 00:49:30,510 things that are close to each other in the network 1170 00:49:30,510 --> 00:49:32,930 of interactions, are also more likely to be 1171 00:49:32,930 --> 00:49:35,645 similar in terms of function. 1172 00:49:35,645 --> 00:49:37,020 So how do we go about doing that? 1173 00:49:37,020 --> 00:49:38,520 So let's say we have got this graph. 1174 00:49:38,520 --> 00:49:40,850 We've got some unknown node labeled u. 1175 00:49:40,850 --> 00:49:43,810 And we've got two known nodes in black. 1176 00:49:43,810 --> 00:49:46,250 And we want to systematically deduce for every example 1177 00:49:46,250 --> 00:49:50,170 like this, every u, what its annotation should be. 1178 00:49:50,170 --> 00:49:52,710 So I could just look at its neighbors, 1179 00:49:52,710 --> 00:49:54,949 and depending on how I set the window around it, 1180 00:49:54,949 --> 00:49:56,490 do I look at the immediate neighbors? 1181 00:49:56,490 --> 00:49:57,410 Do I go two out? 1182 00:49:57,410 --> 00:49:58,470 Do I go three out? 1183 00:49:58,470 --> 00:50:00,430 I could get different answers. 1184 00:50:00,430 --> 00:50:03,070 So if I set K equal to 1, I've got the unknown node, 1185 00:50:03,070 --> 00:50:04,710 but all the neighbors are also unknown. 1186 00:50:04,710 --> 00:50:07,570 If I go two steps out, then I pick up two knowns. 1187 00:50:10,380 --> 00:50:13,940 Now there's a fundamental assumption going on here 1188 00:50:13,940 --> 00:50:17,300 that the node has the same function as its neighbors, 1189 00:50:17,300 --> 00:50:20,430 which is fine when the neighbors are homogeneous. 1190 00:50:20,430 --> 00:50:23,970 But what do you do when the neighbors are heterogeneous? 1191 00:50:23,970 --> 00:50:27,440 So in this case, I've got two unknowns u and v. 1192 00:50:27,440 --> 00:50:30,260 And if I just were to take the K nearest neighbors, 1193 00:50:30,260 --> 00:50:32,390 they would have the same neighborhood, right? 1194 00:50:32,390 --> 00:50:34,760 But I might have a prior expectation that u is more like 1195 00:50:34,760 --> 00:50:39,460 the black nodes, and v is more like the grey nodes. 1196 00:50:39,460 --> 00:50:42,270 So how do you choose the best annotation? 1197 00:50:42,270 --> 00:50:45,290 The K nearest neighbors is OK, but it's not the optimal. 1198 00:50:45,290 --> 00:50:48,530 So here's one approach, which says the following. 1199 00:50:48,530 --> 00:50:51,750 I'm going to go through for every function, 1200 00:50:51,750 --> 00:50:54,094 every annotation in my database, separately. 1201 00:50:54,094 --> 00:50:56,260 And for each annotation, I'll set all the nodes that 1202 00:50:56,260 --> 00:50:59,180 have that annotation to plus 1 and every node 1203 00:50:59,180 --> 00:51:01,680 that doesn't have that annotation, either it's unknown 1204 00:51:01,680 --> 00:51:04,890 or it's got some other annotation, to minus 1. 1205 00:51:04,890 --> 00:51:06,570 And then for every unknown, I'm going 1206 00:51:06,570 --> 00:51:09,960 to try to find the setting which is going 1207 00:51:09,960 --> 00:51:12,550 to maximize the sum of products. 1208 00:51:12,550 --> 00:51:15,570 So we're going to take the sum of the products of u 1209 00:51:15,570 --> 00:51:18,200 and all of its neighbors. 1210 00:51:18,200 --> 00:51:21,742 So in this setting, if I set u to plus 1, 1211 00:51:21,742 --> 00:51:23,908 then I do better than if I set it to minus 1, right? 1212 00:51:27,260 --> 00:51:29,870 Because I'll get plus 1 plus 1 minus 1. 1213 00:51:29,870 --> 00:51:32,760 So that will be better than setting it to minus 1. 1214 00:51:32,760 --> 00:51:33,260 Yes. 1215 00:51:33,260 --> 00:51:35,627 AUDIENCE: Are we ignoring all the end weights? 1216 00:51:35,627 --> 00:51:37,960 PROFESSOR: In this case, we're ignoring the end weights. 1217 00:51:37,960 --> 00:51:40,090 We'll come back to using the end weights later. 1218 00:51:40,090 --> 00:51:42,052 But this was done with an unweighted graph. 1219 00:51:42,052 --> 00:51:44,052 AUDIENCE: [INAUDIBLE] [? nearest neighborhood ?] 1220 00:51:44,052 --> 00:51:45,489 they're using it then? 1221 00:51:45,489 --> 00:51:47,780 PROFESSOR: So here they're using the nearest neighbors. 1222 00:51:47,780 --> 00:51:50,004 That's right, with no cutoff, right? 1223 00:51:50,004 --> 00:51:50,795 So any interaction. 1224 00:51:56,800 --> 00:51:59,982 So then we could iterate this into convergence. 1225 00:51:59,982 --> 00:52:01,190 That's one problem with this. 1226 00:52:01,190 --> 00:52:02,840 But maybe a more fundamental problem 1227 00:52:02,840 --> 00:52:05,880 is that you're never going to get the best overall solution 1228 00:52:05,880 --> 00:52:08,730 by this local optimization procedure. 1229 00:52:08,730 --> 00:52:10,950 So consider a setting like this. 1230 00:52:10,950 --> 00:52:13,700 Remember, I'm trying to maximize the sum 1231 00:52:13,700 --> 00:52:16,950 of the product of the settings for neighbors. 1232 00:52:16,950 --> 00:52:21,330 So how could I ever-- it seems plausible that all A, B, and C 1233 00:52:21,330 --> 00:52:24,280 here, should have the red annotation, right? 1234 00:52:24,280 --> 00:52:27,000 But if I set C to red, that doesn't help me. 1235 00:52:27,000 --> 00:52:29,250 If I set A to red, that doesn't help me. 1236 00:52:29,250 --> 00:52:32,190 If I set B to red, it makes things worse. 1237 00:52:32,190 --> 00:52:34,820 So no local change is going to get me where I want to go. 1238 00:52:37,374 --> 00:52:38,540 So let's think for a second. 1239 00:52:38,540 --> 00:52:40,340 What algorithms have we already seen 1240 00:52:40,340 --> 00:52:42,540 that could help us get to the right answer? 1241 00:52:42,540 --> 00:52:45,320 We can't get here by local optimization. 1242 00:52:45,320 --> 00:52:48,170 We need to find the global minimum, not the local minimum. 1243 00:52:48,170 --> 00:52:49,670 So what algorithms have we seen that 1244 00:52:49,670 --> 00:52:51,140 help us find that global minimum? 1245 00:52:54,612 --> 00:52:58,180 Yeah, sorry, so a video simulated annealing. 1246 00:52:58,180 --> 00:53:01,040 So the simulated annealing version in this setting 1247 00:53:01,040 --> 00:53:02,700 is as follows. 1248 00:53:02,700 --> 00:53:04,280 I initialize the graph. 1249 00:53:04,280 --> 00:53:06,850 I pick a neighboring node, v, that I'm going to add. 1250 00:53:06,850 --> 00:53:09,830 Say we'll turn one of these red. 1251 00:53:09,830 --> 00:53:16,370 I check the value of that sum of the products for this new one. 1252 00:53:16,370 --> 00:53:19,864 And if it's improving things, I keep it. 1253 00:53:19,864 --> 00:53:21,905 But the critical thing is, if it doesn't improve, 1254 00:53:21,905 --> 00:53:23,530 if it makes things worse, I still 1255 00:53:23,530 --> 00:53:24,780 keep it with some probability. 1256 00:53:24,780 --> 00:53:27,480 It's based on how bad things have gotten. 1257 00:53:27,480 --> 00:53:29,295 And by doing this, we can climb the hill 1258 00:53:29,295 --> 00:53:33,630 and get over to some global optimum. 1259 00:53:33,630 --> 00:53:35,660 So we saw simulating before. 1260 00:53:35,660 --> 00:53:36,490 In what context? 1261 00:53:36,490 --> 00:53:38,386 When in the side chain placement problem. 1262 00:53:38,386 --> 00:53:39,510 Here we're seeing it again. 1263 00:53:39,510 --> 00:53:40,370 It's quite broad. 1264 00:53:40,370 --> 00:53:42,299 Any time you've got a local optimization that 1265 00:53:42,299 --> 00:53:43,840 doesn't get you where you need to go, 1266 00:53:43,840 --> 00:53:45,114 you need global optimization. 1267 00:53:45,114 --> 00:53:46,530 You can think simulated annealing. 1268 00:53:46,530 --> 00:53:49,761 It's quite often the plausible way to go. 1269 00:53:49,761 --> 00:53:50,260 All right. 1270 00:53:50,260 --> 00:53:53,092 So this is one approach for annotation. 1271 00:53:53,092 --> 00:53:55,050 We also wanted to see whether we could discover 1272 00:53:55,050 --> 00:53:56,890 inherent structure in these graphs. 1273 00:53:56,890 --> 00:53:58,600 So often, we'll be interested in trying 1274 00:53:58,600 --> 00:54:00,600 to find clusters in a graph. 1275 00:54:00,600 --> 00:54:03,680 Some graphs have obvious structures in them. 1276 00:54:03,680 --> 00:54:05,940 Other graphs, it's a little less obvious. 1277 00:54:05,940 --> 00:54:07,780 What algorithms exist for trying to do this? 1278 00:54:07,780 --> 00:54:10,521 We're going to look at two relatively straightforward 1279 00:54:10,521 --> 00:54:11,020 ways. 1280 00:54:11,020 --> 00:54:13,010 One is called edge betweenness clustering 1281 00:54:13,010 --> 00:54:16,730 and the other one is a Markov process. 1282 00:54:16,730 --> 00:54:19,160 Edge betweenness, I think, is the most intuitive. 1283 00:54:19,160 --> 00:54:25,860 So I look at each edge, and I ask 1284 00:54:25,860 --> 00:54:28,370 for all pairs of nodes in the graph, 1285 00:54:28,370 --> 00:54:30,360 does the shortest path between those nodes 1286 00:54:30,360 --> 00:54:31,395 pass through this edge? 1287 00:54:35,270 --> 00:54:38,790 So if I look at this edge, very few shortest paths 1288 00:54:38,790 --> 00:54:40,240 go through this edge, right? 1289 00:54:40,240 --> 00:54:42,640 Just the shortest path for those two nodes. 1290 00:54:42,640 --> 00:54:47,759 But if I look at this edge, all of the shortest paths 1291 00:54:47,759 --> 00:54:50,050 between any node on this side and any node on this side 1292 00:54:50,050 --> 00:54:51,325 have to pass through there. 1293 00:54:51,325 --> 00:54:55,090 So that has a high betweenness. 1294 00:54:55,090 --> 00:54:58,750 So if I want a cluster, I can go through my graph. 1295 00:54:58,750 --> 00:55:01,400 I can compute betweenness. 1296 00:55:01,400 --> 00:55:03,470 I take the edge that has the highest betweenness, 1297 00:55:03,470 --> 00:55:05,330 and I remove it from my graph. 1298 00:55:05,330 --> 00:55:07,720 And then I repeat. 1299 00:55:07,720 --> 00:55:09,960 And I'll be slowly breaking my graph down 1300 00:55:09,960 --> 00:55:14,050 into chunks that are relatively more connected internally 1301 00:55:14,050 --> 00:55:15,890 than they are to things in other pieces. 1302 00:55:19,430 --> 00:55:20,360 Any questions? 1303 00:55:20,360 --> 00:55:21,860 So that's an entire edge betweenness 1304 00:55:21,860 --> 00:55:22,860 clustering algorithm. 1305 00:55:22,860 --> 00:55:23,840 Pretty straightforward. 1306 00:55:27,480 --> 00:55:32,590 Now an alternative is a Markov clustering method. 1307 00:55:32,590 --> 00:55:34,430 And the Markov clustering method is 1308 00:55:34,430 --> 00:55:37,780 based on the idea of random walks in the graph. 1309 00:55:37,780 --> 00:55:41,070 So again, let's try to develop some intuition here. 1310 00:55:41,070 --> 00:55:42,610 If I start at some node over here, 1311 00:55:42,610 --> 00:55:46,220 and I randomly wander across this graph, 1312 00:55:46,220 --> 00:55:48,991 I'm more likely to stay on the left-hand side 1313 00:55:48,991 --> 00:55:51,490 than I am to move all the way across to the right-hand side, 1314 00:55:51,490 --> 00:55:54,060 correct? 1315 00:55:54,060 --> 00:55:56,090 So can I formalize that and actually come up 1316 00:55:56,090 --> 00:55:58,500 with a measure of how often any node will visit 1317 00:55:58,500 --> 00:56:01,410 any other and then use that to cluster the graph? 1318 00:56:05,990 --> 00:56:07,720 So remember our adjacency matrix, 1319 00:56:07,720 --> 00:56:12,020 which just represented which nodes were connected to which. 1320 00:56:12,020 --> 00:56:16,280 And what happens if I multiply the adjacency matrix by itself? 1321 00:56:16,280 --> 00:56:19,290 So I raise it to some power. 1322 00:56:19,290 --> 00:56:23,750 Well, if I multiply the adjacency matrix by itself 1323 00:56:23,750 --> 00:56:27,770 just once, the squared adjacency matrix of the property, 1324 00:56:27,770 --> 00:56:30,760 that it tells me how many paths of length 2 1325 00:56:30,760 --> 00:56:33,150 exists between any two nodes. 1326 00:56:33,150 --> 00:56:36,160 So the adjacency matrix told me how many paths of length 1 1327 00:56:36,160 --> 00:56:36,851 exist. 1328 00:56:36,851 --> 00:56:37,350 Right? 1329 00:56:37,350 --> 00:56:38,704 You're directly connected. 1330 00:56:38,704 --> 00:56:40,120 If I squared the adjacency matrix, 1331 00:56:40,120 --> 00:56:43,510 it tells me how many paths of length 2 exist. 1332 00:56:43,510 --> 00:56:46,790 N-th power tells me how many paths of length N exist. 1333 00:56:46,790 --> 00:56:48,150 So let's see if that works. 1334 00:56:48,150 --> 00:56:49,710 This claims that there are exactly 1335 00:56:49,710 --> 00:56:53,334 two paths that connect node 2 to node 2. 1336 00:56:53,334 --> 00:56:54,375 What are those two paths? 1337 00:56:59,060 --> 00:57:00,180 Connect node 2 to node 2. 1338 00:57:00,180 --> 00:57:01,930 I go here, and I go back. 1339 00:57:01,930 --> 00:57:06,030 That's the path of length 2, and this is the path of length 2. 1340 00:57:06,030 --> 00:57:08,200 And there are zero paths of length 2 1341 00:57:08,200 --> 00:57:13,210 that connect node 2 to node three, because 1, 2. 1342 00:57:13,210 --> 00:57:15,110 I'm not back at 3. 1343 00:57:15,110 --> 00:57:18,390 So that's from general A to the N equals m, 1344 00:57:18,390 --> 00:57:23,220 if there exists exactly m paths of length N between those two 1345 00:57:23,220 --> 00:57:24,020 nodes. 1346 00:57:24,020 --> 00:57:25,160 So how does this help me? 1347 00:57:25,160 --> 00:57:28,830 Well, when you take that idea of the N-th power of the adjacency 1348 00:57:28,830 --> 00:57:32,360 matrix and convert it to a transition probability matrix, 1349 00:57:32,360 --> 00:57:34,510 simply by normalizing. 1350 00:57:34,510 --> 00:57:36,969 So if I were to do a random walk in this graph, 1351 00:57:36,969 --> 00:57:39,010 what's the probability that I'll move from node i 1352 00:57:39,010 --> 00:57:41,420 to node j in a certain number of steps? 1353 00:57:41,420 --> 00:57:43,330 That's what I want to compute. 1354 00:57:43,330 --> 00:57:45,779 So I need to have a stochastic matrix, 1355 00:57:45,779 --> 00:57:47,195 where the sum of the probabilities 1356 00:57:47,195 --> 00:57:50,426 for any transition is 1. 1357 00:57:50,426 --> 00:57:51,550 I have to end up somewhere. 1358 00:57:51,550 --> 00:57:53,370 I either end up back in myself, or I end up 1359 00:57:53,370 --> 00:57:54,203 at some other nodes. 1360 00:57:54,203 --> 00:57:56,810 I'm just going to take that adjacency matrix 1361 00:57:56,810 --> 00:57:59,370 and normalize the columns. 1362 00:57:59,370 --> 00:58:03,140 And then that gives me the stochastic matrix. 1363 00:58:03,140 --> 00:58:05,460 And then I can exponentiate the stochastic matrix 1364 00:58:05,460 --> 00:58:08,290 to figure out my probability of moving from any node 1365 00:58:08,290 --> 00:58:11,594 to any other in a certain number of steps. 1366 00:58:11,594 --> 00:58:12,510 Any questions on that? 1367 00:58:15,201 --> 00:58:15,700 OK. 1368 00:58:18,790 --> 00:58:23,830 So if we simply keep multiplying this stochasticity matrix, 1369 00:58:23,830 --> 00:58:26,450 we'll get the probability of increasing numbers of moves. 1370 00:58:26,450 --> 00:58:28,700 But it doesn't give us sharp partitions of the matrix. 1371 00:58:28,700 --> 00:58:31,034 So to do a Markov clustering, we do an exponentiation 1372 00:58:31,034 --> 00:58:32,950 of this matrix with what's called an inflation 1373 00:58:32,950 --> 00:58:35,720 operator, which is the following. 1374 00:58:38,500 --> 00:58:43,930 This inflation operator takes the r-th power 1375 00:58:43,930 --> 00:58:48,100 of the adjacency matrix and puts a denominator, 1376 00:58:48,100 --> 00:58:51,945 the sum of the powers of the transition. 1377 00:58:51,945 --> 00:58:52,820 So here's an example. 1378 00:58:52,820 --> 00:58:57,275 Let's say I've got two probabilities-- 0.9 and 0.1. 1379 00:58:57,275 --> 00:59:01,416 When I inflate it, I square the numerator, 1380 00:59:01,416 --> 00:59:03,290 and I square each element of the denominator. 1381 00:59:03,290 --> 00:59:09,210 Now I've gone from 0.9 to 0.99 and 0.1 to 0.01. 1382 00:59:09,210 --> 00:59:11,380 So this inflation operator exaggerates 1383 00:59:11,380 --> 00:59:14,314 all my probabilities and makes the higher probabilities more 1384 00:59:14,314 --> 00:59:16,480 probable and makes the lower probabilities even less 1385 00:59:16,480 --> 00:59:18,910 probable. 1386 00:59:18,910 --> 00:59:20,430 So I take this adjacency matrix that 1387 00:59:20,430 --> 00:59:22,280 represents the number of steps in my matrix, 1388 00:59:22,280 --> 00:59:24,510 and I exaggerate it with the inflation operator. 1389 00:59:24,510 --> 00:59:27,310 And that takes the basic clustering, 1390 00:59:27,310 --> 00:59:30,990 and it makes it more compact. 1391 00:59:30,990 --> 00:59:35,540 So the algorithm for this Markov clustering is as follows. 1392 00:59:35,540 --> 00:59:37,220 I start with a graph. 1393 00:59:37,220 --> 00:59:38,370 I add loops to the graph. 1394 00:59:38,370 --> 00:59:39,237 Why do I add loops? 1395 00:59:39,237 --> 00:59:41,820 Because I need some probability that I stay in the same place, 1396 00:59:41,820 --> 00:59:42,910 right? 1397 00:59:42,910 --> 00:59:44,425 And in a normal adjacency matrix, 1398 00:59:44,425 --> 00:59:45,800 you can't stay in the same place. 1399 00:59:45,800 --> 00:59:47,734 You have to go somewhere. 1400 00:59:47,734 --> 00:59:48,400 So I add a loop. 1401 00:59:48,400 --> 00:59:51,510 So there's always a self loop. 1402 00:59:51,510 --> 00:59:56,680 Then I set the inflation parameter to some value. 1403 00:59:56,680 --> 01:00:01,176 M_1 is the matrix of random walks in the original graph. 1404 01:00:01,176 --> 01:00:02,720 I multiply that. 1405 01:00:02,720 --> 01:00:05,110 I inflate it. 1406 01:00:05,110 --> 01:00:07,550 And then I find the difference. 1407 01:00:07,550 --> 01:00:11,480 And I do that until the difference in this-- 1408 01:00:11,480 --> 01:00:15,000 because this matrix gets below some value. 1409 01:00:15,000 --> 01:00:17,770 And what I end up with then are relatively sharp partitions 1410 01:00:17,770 --> 01:00:20,734 of the overall structure. 1411 01:00:20,734 --> 01:00:24,710 So I'll show you an example of how that works. 1412 01:00:24,710 --> 01:00:26,260 So in this case, the authors were 1413 01:00:26,260 --> 01:00:32,210 using a matrix where the nodes represented proteins. 1414 01:00:32,210 --> 01:00:34,225 The edges represented BLAST hits. 1415 01:00:34,225 --> 01:00:35,990 And what they wanted to do was find 1416 01:00:35,990 --> 01:00:39,610 families of proteins that had similar sequence 1417 01:00:39,610 --> 01:00:40,770 similarity to each other. 1418 01:00:40,770 --> 01:00:44,152 But they didn't want it to be entirely dominated by domains. 1419 01:00:44,152 --> 01:00:46,610 So they figured that this graph structure would be helpful, 1420 01:00:46,610 --> 01:00:48,630 because you'd get-- for any protein, 1421 01:00:48,630 --> 01:00:53,262 there'd be edges, not just things 1422 01:00:53,262 --> 01:00:55,470 that had similar common domains, but also things that 1423 01:00:55,470 --> 01:00:59,670 had edges connecting it to other proteins as well. 1424 01:00:59,670 --> 01:01:03,750 So in the original graph, the edges are these BLAST values. 1425 01:01:03,750 --> 01:01:05,980 They come up with the transition matrix. 1426 01:01:05,980 --> 01:01:08,170 They convert into the Markov matrix, 1427 01:01:08,170 --> 01:01:10,540 and they carry out that exponentiation. 1428 01:01:10,540 --> 01:01:12,680 And what they end up with are clusters 1429 01:01:12,680 --> 01:01:17,190 where any individual domain can appear multiple clusters. 1430 01:01:17,190 --> 01:01:20,060 The domains are dominated not just by the highest BLAST hit, 1431 01:01:20,060 --> 01:01:22,760 but by the whole network property of what other proteins 1432 01:01:22,760 --> 01:01:24,940 they're connected to. 1433 01:01:24,940 --> 01:01:28,300 And it's also been done with a network, where the underlying 1434 01:01:28,300 --> 01:01:30,120 network represents gene expression, 1435 01:01:30,120 --> 01:01:33,330 and edges between two genes represent the degree 1436 01:01:33,330 --> 01:01:37,480 of correlation of the expression across a very large data 1437 01:01:37,480 --> 01:01:39,980 set for 61 mouse tissues. 1438 01:01:39,980 --> 01:01:42,050 And once again, you take the overall graph, 1439 01:01:42,050 --> 01:01:44,140 and you can break it down into clusters, 1440 01:01:44,140 --> 01:01:46,540 where you can find functional annotations 1441 01:01:46,540 --> 01:01:47,570 for specific clusters. 1442 01:01:50,320 --> 01:01:54,210 Any questions then on the Markov clustering? 1443 01:01:54,210 --> 01:01:55,930 So these are two separate ways of looking 1444 01:01:55,930 --> 01:01:57,962 at the underlying structure of a graph. 1445 01:01:57,962 --> 01:02:00,170 We had the edge betweenness clustering and the Markov 1446 01:02:00,170 --> 01:02:00,862 clustering. 1447 01:02:00,862 --> 01:02:03,070 Now when you do this, you have to make some decision, 1448 01:02:03,070 --> 01:02:04,500 as I found this cluster. 1449 01:02:04,500 --> 01:02:06,260 Now how do I decide what it's doing? 1450 01:02:06,260 --> 01:02:08,350 So you need to do some sort of annotation. 1451 01:02:08,350 --> 01:02:09,940 So once I have a cluster, how am I 1452 01:02:09,940 --> 01:02:13,840 going to assign a function to that cluster? 1453 01:02:13,840 --> 01:02:16,220 So one thing I could do would be to look 1454 01:02:16,220 --> 01:02:18,590 at things that already have an annotation. 1455 01:02:18,590 --> 01:02:19,680 So I got some cluster. 1456 01:02:19,680 --> 01:02:21,110 Maybe two members of this cluster 1457 01:02:21,110 --> 01:02:23,540 have an annotation and two members of this one. 1458 01:02:23,540 --> 01:02:25,110 And that's fine. 1459 01:02:25,110 --> 01:02:26,910 But what do I do when a cluster has 1460 01:02:26,910 --> 01:02:29,840 a whole bunch of different annotations? 1461 01:02:29,840 --> 01:02:31,885 So I could be arbitrary. 1462 01:02:31,885 --> 01:02:33,930 I could just take the one that's the most common. 1463 01:02:33,930 --> 01:02:36,471 But a nice way to do it is by the hypergeometric distribution 1464 01:02:36,471 --> 01:02:38,620 that you saw in the earlier part of the semester. 1465 01:02:43,950 --> 01:02:46,790 So these are all ways of clustering the underlying graph 1466 01:02:46,790 --> 01:02:48,420 without any reference to specific data 1467 01:02:48,420 --> 01:02:50,540 for a particular condition that you're interested in. 1468 01:02:50,540 --> 01:02:51,915 A slightly harder problem is when 1469 01:02:51,915 --> 01:02:53,680 I do have those specific data, and I'd 1470 01:02:53,680 --> 01:02:55,620 like to find a piece of the network that's 1471 01:02:55,620 --> 01:02:57,732 most relevant to those specific data. 1472 01:02:57,732 --> 01:02:59,690 So it could be different in different settings. 1473 01:02:59,690 --> 01:03:01,190 Maybe the part of the network that's 1474 01:03:01,190 --> 01:03:02,869 relevant in the cancer setting is not 1475 01:03:02,869 --> 01:03:05,160 the part of the network that's relevant in the diabetes 1476 01:03:05,160 --> 01:03:07,640 setting. 1477 01:03:07,640 --> 01:03:10,160 So one way to think about this is that I have the network, 1478 01:03:10,160 --> 01:03:12,000 and I paint onto it my expression 1479 01:03:12,000 --> 01:03:13,620 data or my proteomic data. 1480 01:03:13,620 --> 01:03:16,850 And then I want to find chunks of the network that 1481 01:03:16,850 --> 01:03:18,640 are enriched in activity. 1482 01:03:18,640 --> 01:03:21,600 So this is sometimes called the active subgraph problem. 1483 01:03:21,600 --> 01:03:23,530 And how do we find the active subgraph? 1484 01:03:23,530 --> 01:03:25,510 Well, it's not that different from the problem 1485 01:03:25,510 --> 01:03:27,190 that we just looked at. 1486 01:03:27,190 --> 01:03:31,030 So if I want to figure out a piece of the network that's 1487 01:03:31,030 --> 01:03:33,790 active, I could just take the things that are immediately 1488 01:03:33,790 --> 01:03:35,160 connected to each other. 1489 01:03:35,160 --> 01:03:37,040 That doesn't give me the global picture. 1490 01:03:37,040 --> 01:03:38,870 So instead why don't I try to find 1491 01:03:38,870 --> 01:03:40,350 larger chunks of the network where 1492 01:03:40,350 --> 01:03:43,300 I can include some nodes for which I do not 1493 01:03:43,300 --> 01:03:45,240 have specific data? 1494 01:03:45,240 --> 01:03:46,877 And one way that's been done for that 1495 01:03:46,877 --> 01:03:48,710 is, again, the simulated annealing approach. 1496 01:03:48,710 --> 01:03:51,880 So you can try to find pieces of the network that 1497 01:03:51,880 --> 01:03:54,700 maximize the probability that all 1498 01:03:54,700 --> 01:03:57,790 the things in the subnetwork are active. 1499 01:04:00,302 --> 01:04:01,760 Another formulation of this problem 1500 01:04:01,760 --> 01:04:04,126 is something that's called the Steiner tree problem. 1501 01:04:04,126 --> 01:04:06,000 And in the Steiner tree, I want to find trees 1502 01:04:06,000 --> 01:04:10,060 in the network that consist of all the nodes that are active, 1503 01:04:10,060 --> 01:04:15,494 plus some nodes that are not, for which I have no data. 1504 01:04:15,494 --> 01:04:17,160 And those nodes for which I have no data 1505 01:04:17,160 --> 01:04:18,712 are called Steiner nodes. 1506 01:04:18,712 --> 01:04:20,920 And this was a problem that was looked at extensively 1507 01:04:20,920 --> 01:04:21,836 in telecommunications. 1508 01:04:21,836 --> 01:04:25,090 So if I want to wire up a bunch of buildings-- 1509 01:04:25,090 --> 01:04:29,130 back when people used wires-- say to give telephone service, 1510 01:04:29,130 --> 01:04:31,552 so I need to figure out what the minimum cost is 1511 01:04:31,552 --> 01:04:32,510 for wiring them all up. 1512 01:04:32,510 --> 01:04:35,580 And sometimes, that involves sticking a pole in the ground, 1513 01:04:35,580 --> 01:04:37,850 then having everybody communicate to that pole. 1514 01:04:37,850 --> 01:04:43,000 So if I've got paying customers over here, 1515 01:04:43,000 --> 01:04:45,110 and I want to wire them to each other, 1516 01:04:45,110 --> 01:04:51,160 I could run wires between everybody. 1517 01:04:51,160 --> 01:04:52,330 But I don't have to. 1518 01:04:52,330 --> 01:04:55,290 If I stick a pole over here, then I don't need this wire, 1519 01:04:55,290 --> 01:04:58,194 and I don't need this wire, and I don't need this wire. 1520 01:04:58,194 --> 01:04:59,860 So this is what's called a Steiner node. 1521 01:05:06,940 --> 01:05:11,470 And so in graph theory, there are pretty efficient algorithms 1522 01:05:11,470 --> 01:05:16,030 for finding a Steiner graph-- the Steiner tree-- the smallest 1523 01:05:16,030 --> 01:05:17,600 tree that connects all of the nodes. 1524 01:05:17,600 --> 01:05:20,640 Now the problem in our setting is that we don't necessarily 1525 01:05:20,640 --> 01:05:22,371 want to connect every node, because we're 1526 01:05:22,371 --> 01:05:24,120 going to have in our data some things that 1527 01:05:24,120 --> 01:05:25,640 are false positives. 1528 01:05:25,640 --> 01:05:27,660 And if we connect too many things in our graph, 1529 01:05:27,660 --> 01:05:31,050 we end up with what are lovingly called "hairballs." 1530 01:05:31,050 --> 01:05:33,100 So I'll give you a specific example of that. 1531 01:05:33,100 --> 01:05:34,891 Here's some data that we were working with. 1532 01:05:34,891 --> 01:05:37,830 We had a relatively small number of experimental hits that 1533 01:05:37,830 --> 01:05:39,460 were detected as changing in a cancer 1534 01:05:39,460 --> 01:05:42,240 setting and the interactome graph. 1535 01:05:42,240 --> 01:05:46,399 And if you simply look for the shortest path, 1536 01:05:46,399 --> 01:05:48,190 I should say, between the experimental hits 1537 01:05:48,190 --> 01:05:49,640 across the interactome, you end up 1538 01:05:49,640 --> 01:05:53,590 with something that looks very similar to the interactome. 1539 01:05:53,590 --> 01:05:56,286 So you start off with a relatively small set of nodes, 1540 01:05:56,286 --> 01:05:57,910 and you try to find the subnetwork that 1541 01:05:57,910 --> 01:05:59,060 includes everything. 1542 01:05:59,060 --> 01:06:02,290 And you get a giant graph. 1543 01:06:02,290 --> 01:06:04,060 And it's very hard to figure out what 1544 01:06:04,060 --> 01:06:06,179 to do with a graph that's this big. 1545 01:06:06,179 --> 01:06:07,970 I mean, there may be some information here, 1546 01:06:07,970 --> 01:06:09,939 but you've taken a relatively simple problem 1547 01:06:09,939 --> 01:06:12,230 to try to understand the relationship among these hits. 1548 01:06:12,230 --> 01:06:13,896 And you've turned it into a problem that 1549 01:06:13,896 --> 01:06:18,070 now involves hundreds and hundreds of nodes. 1550 01:06:18,070 --> 01:06:20,530 So these kinds of problems arise, as I said, 1551 01:06:20,530 --> 01:06:22,400 in part, because of noise in the data. 1552 01:06:22,400 --> 01:06:25,060 So some of these hits are not real. 1553 01:06:25,060 --> 01:06:26,740 And incorporating those, obviously, 1554 01:06:26,740 --> 01:06:30,470 makes me take very long paths in the interactome, 1555 01:06:30,470 --> 01:06:33,160 but also arises because of the noise in the interactome-- 1556 01:06:33,160 --> 01:06:35,910 both false positives and false negatives. 1557 01:06:35,910 --> 01:06:38,710 So I have two proteins that I'm trying to connect, 1558 01:06:38,710 --> 01:06:40,710 and there's a false positive in the interactome. 1559 01:06:40,710 --> 01:06:42,582 It's going to draw a line between them. 1560 01:06:42,582 --> 01:06:44,540 If there's a false negative in the interactome, 1561 01:06:44,540 --> 01:06:47,780 maybe these things really do interact, but there's no edge. 1562 01:06:47,780 --> 01:06:49,720 If I force the algorithm to find a connection, 1563 01:06:49,720 --> 01:06:51,720 it probably can, because most of the interactome 1564 01:06:51,720 --> 01:06:54,400 is one giant connected component. 1565 01:06:54,400 --> 01:06:56,630 But it could be a very, very long edge. 1566 01:06:56,630 --> 01:06:58,619 It goes through many other proteins. 1567 01:06:58,619 --> 01:07:00,910 And so in the process of trying to connect all my data, 1568 01:07:00,910 --> 01:07:02,555 I can get extremely large graphs. 1569 01:07:05,230 --> 01:07:07,304 So to avoid having giant networks-- 1570 01:07:07,304 --> 01:07:08,970 so on this projector, unfortunately, you 1571 01:07:08,970 --> 01:07:10,190 can't see this very well. 1572 01:07:10,190 --> 01:07:13,947 But there are a lot of edges among all the nodes here. 1573 01:07:13,947 --> 01:07:15,280 Most of you have your computers. 1574 01:07:15,280 --> 01:07:16,321 You can look at it there. 1575 01:07:16,321 --> 01:07:20,150 So in a Steiner tree approach, if my data 1576 01:07:20,150 --> 01:07:24,220 are the ones that are yellow, they're called terminals. 1577 01:07:24,220 --> 01:07:26,450 And the grey ones, I have no data. 1578 01:07:26,450 --> 01:07:30,770 And I ask to try to solve the Steiner tree problem, 1579 01:07:30,770 --> 01:07:33,625 it's going to have to find a way to connect this node up 1580 01:07:33,625 --> 01:07:34,750 to the rest of the network. 1581 01:07:37,852 --> 01:07:39,310 But if this one's a false positive, 1582 01:07:39,310 --> 01:07:41,760 that's not the desired outcome. 1583 01:07:41,760 --> 01:07:43,620 So there are optimization techniques 1584 01:07:43,620 --> 01:07:46,110 that actually allow me to tell the algorithm that it's 1585 01:07:46,110 --> 01:07:49,615 OK to leave out some of the data to get a more compact network. 1586 01:07:52,497 --> 01:07:54,330 So one of those approaches is called a prize 1587 01:07:54,330 --> 01:07:56,130 collecting Steiner tree problem. 1588 01:07:56,130 --> 01:07:58,580 And the idea here is the following. 1589 01:07:58,580 --> 01:08:01,370 For every node for which I have experimental data, 1590 01:08:01,370 --> 01:08:05,410 I associate with that node a prize. 1591 01:08:05,410 --> 01:08:07,590 The prize is larger, the more confident 1592 01:08:07,590 --> 01:08:10,640 I am that that node is relevant in the experiment. 1593 01:08:10,640 --> 01:08:12,900 And for every edge, I take the edge away, 1594 01:08:12,900 --> 01:08:15,400 and I convert it into a cost. 1595 01:08:15,400 --> 01:08:19,439 If I have a high confidence edge, there's a low cost. 1596 01:08:19,439 --> 01:08:20,810 It's cheap. 1597 01:08:20,810 --> 01:08:24,520 Low confidence edges are going to be very expensive. 1598 01:08:24,520 --> 01:08:26,210 And now I ask the algorithm to try 1599 01:08:26,210 --> 01:08:28,790 to connect up all the things it can. 1600 01:08:28,790 --> 01:08:31,540 Every time it includes a node for which the zeta keeps 1601 01:08:31,540 --> 01:08:35,040 the prize, but it had to add an edge, so it pays the cost. 1602 01:08:35,040 --> 01:08:37,250 So there's a trade-off for every node. 1603 01:08:37,250 --> 01:08:41,260 So if the algorithm wants to include this node, 1604 01:08:41,260 --> 01:08:44,642 then it's going to pay the price for all the edges, 1605 01:08:44,642 --> 01:08:45,850 but it gets to keep the node. 1606 01:08:45,850 --> 01:08:47,766 So the optimization function is the following. 1607 01:08:47,766 --> 01:08:53,220 For every vertex that's not in the tree, there's a penalty. 1608 01:08:53,220 --> 01:08:55,319 And for every edge in the tree, there's a cost. 1609 01:08:55,319 --> 01:08:57,810 And you want to minimize the sum of these two terms. 1610 01:08:57,810 --> 01:09:01,282 You want to minimize the number of edge costs you pay for. 1611 01:09:01,282 --> 01:09:02,740 And you want to minimize the number 1612 01:09:02,740 --> 01:09:04,705 of prizes you leave behind. 1613 01:09:04,705 --> 01:09:05,695 Is that clear? 1614 01:09:12,140 --> 01:09:15,439 So then the algorithm then can, depending on the optimization 1615 01:09:15,439 --> 01:09:19,492 terms, figure out is it more of a benefit to include this node, 1616 01:09:19,492 --> 01:09:21,950 keep the prize, and pay all the edge costs or the opposite? 1617 01:09:21,950 --> 01:09:23,044 Throw it out. 1618 01:09:23,044 --> 01:09:24,710 You don't get to keep the prize, but you 1619 01:09:24,710 --> 01:09:26,560 don't have to pay the edge costs. 1620 01:09:26,560 --> 01:09:28,810 And so that turns these very, very large networks 1621 01:09:28,810 --> 01:09:30,350 into relatively compact ones. 1622 01:09:30,350 --> 01:09:32,910 Now solving this problem is actually rather computationally 1623 01:09:32,910 --> 01:09:33,899 challenging. 1624 01:09:33,899 --> 01:09:36,415 You can do it with integer linear programming. 1625 01:09:36,415 --> 01:09:38,579 It takes a huge amount of memory. 1626 01:09:38,579 --> 01:09:40,620 There's also signal and message passing approach. 1627 01:09:40,620 --> 01:09:42,800 If you're interested in the underlying algorithms, 1628 01:09:42,800 --> 01:09:45,990 you can look at some of these papers. 1629 01:09:45,990 --> 01:09:47,750 So what happens when you actually do this? 1630 01:09:47,750 --> 01:09:49,920 So that hairball that I showed you before 1631 01:09:49,920 --> 01:09:53,160 consisted of a very small initial data set. 1632 01:09:53,160 --> 01:09:55,960 If you do a shortest path search across the network, 1633 01:09:55,960 --> 01:09:59,110 you get thousands of edges shown here. 1634 01:09:59,110 --> 01:10:02,640 But the prize collecting Steiner tree solution to this problem 1635 01:10:02,640 --> 01:10:07,210 is actually extremely compact, and it consists of subnetworks. 1636 01:10:07,210 --> 01:10:08,590 You can cluster it automatically. 1637 01:10:08,590 --> 01:10:10,620 This was clustered by hand, but you get more or less 1638 01:10:10,620 --> 01:10:11,160 the same results. 1639 01:10:11,160 --> 01:10:12,409 It's just not quite as pretty. 1640 01:10:12,409 --> 01:10:16,109 If you cluster by hand or by say, edge betweenness, then 1641 01:10:16,109 --> 01:10:17,650 you get subnetworks that are enriched 1642 01:10:17,650 --> 01:10:19,910 in various reasonable cellular processes. 1643 01:10:19,910 --> 01:10:22,030 This was a network built from cancer data. 1644 01:10:22,030 --> 01:10:25,150 And you can see things that are highly relevant to cancer-- DNA 1645 01:10:25,150 --> 01:10:29,250 damage, cell cycle, and so on. 1646 01:10:29,250 --> 01:10:30,750 And the really nice thing about this 1647 01:10:30,750 --> 01:10:32,400 then is it gives you a very focused way 1648 01:10:32,400 --> 01:10:34,030 to then go and do experiments. 1649 01:10:34,030 --> 01:10:35,570 So you can take the networks that come out of it. 1650 01:10:35,570 --> 01:10:37,486 And now you're not operating on a network that 1651 01:10:37,486 --> 01:10:39,510 consists of tens of thousands of edges. 1652 01:10:39,510 --> 01:10:41,790 You're working on a network that consists 1653 01:10:41,790 --> 01:10:43,800 of very small sets of proteins. 1654 01:10:43,800 --> 01:10:45,760 So in this particular case, we actually 1655 01:10:45,760 --> 01:10:48,220 were able to go in and test the number of the nodes that 1656 01:10:48,220 --> 01:10:50,910 were not detected by the experimental data, 1657 01:10:50,910 --> 01:10:53,390 but were inferred by the algorithms of the Steiner 1658 01:10:53,390 --> 01:10:56,610 nodes, which had no direct experimental data. 1659 01:10:56,610 --> 01:11:00,380 We will test whether blocking the activities of these nodes 1660 01:11:00,380 --> 01:11:02,950 had any effect on the growth of these tumor cells. 1661 01:11:02,950 --> 01:11:04,334 We will show that nodes that were 1662 01:11:04,334 --> 01:11:06,250 very central to the network that were included 1663 01:11:06,250 --> 01:11:08,710 in the prize collecting Steiner tree solution, 1664 01:11:08,710 --> 01:11:11,830 had a high probability of being cancer targets. 1665 01:11:11,830 --> 01:11:14,319 Whereas the ones that were just slightly more removed 1666 01:11:14,319 --> 01:11:15,610 were much lower in probability. 1667 01:11:18,750 --> 01:11:22,650 So one of the advantages of these large interaction graphs 1668 01:11:22,650 --> 01:11:24,630 is they give us a natural way to integrate 1669 01:11:24,630 --> 01:11:26,970 many different kinds of data. 1670 01:11:26,970 --> 01:11:31,310 So we already saw that the protein levels and the mRNA 1671 01:11:31,310 --> 01:11:35,440 levels agreed very poorly with each other. 1672 01:11:35,440 --> 01:11:37,254 And we talked about the fact that one thing 1673 01:11:37,254 --> 01:11:38,670 you could do with those data would 1674 01:11:38,670 --> 01:11:41,940 be to try to find the connections between not 1675 01:11:41,940 --> 01:11:44,161 the RNAs and the proteins, but the connections 1676 01:11:44,161 --> 01:11:45,660 between the RNAs and the things that 1677 01:11:45,660 --> 01:11:48,040 drove the expression of the RNA. 1678 01:11:48,040 --> 01:11:50,790 And so as I said, we'll see in one of Professor Gifford's 1679 01:11:50,790 --> 01:11:52,790 lectures, precisely how to do that. 1680 01:11:52,790 --> 01:11:57,120 But once you are able to do that, you take epigenetic data, 1681 01:11:57,120 --> 01:12:02,300 look at the regions that are regulatory around the sites 1682 01:12:02,300 --> 01:12:04,240 of genes that are changing in transcription. 1683 01:12:04,240 --> 01:12:06,380 You can infer DNA binding proteins. 1684 01:12:06,380 --> 01:12:07,880 And then you can pile all those data 1685 01:12:07,880 --> 01:12:09,420 onto an interaction graph, where you've 1686 01:12:09,420 --> 01:12:10,628 got different kinds of edges. 1687 01:12:10,628 --> 01:12:13,010 So you've got RNA nodes that represent the transcript 1688 01:12:13,010 --> 01:12:13,700 levels. 1689 01:12:13,700 --> 01:12:15,200 You've got the transcription factors 1690 01:12:15,200 --> 01:12:16,880 that infer from the epigenetic data. 1691 01:12:16,880 --> 01:12:18,750 And then you've got the protein-protein interaction 1692 01:12:18,750 --> 01:12:20,208 data that came from the two hybrid, 1693 01:12:20,208 --> 01:12:21,820 the affinity capture mass spec. 1694 01:12:21,820 --> 01:12:23,695 And now you can put all those different kinds 1695 01:12:23,695 --> 01:12:25,860 of data in the same graph. 1696 01:12:25,860 --> 01:12:27,910 And even though there's no correlation 1697 01:12:27,910 --> 01:12:31,520 between what happens in an RNA and what happens in the protein 1698 01:12:31,520 --> 01:12:33,704 level-- or very low correlation-- 1699 01:12:33,704 --> 01:12:35,120 there's this physical process that 1700 01:12:35,120 --> 01:12:37,030 links that RNA up to the signaling 1701 01:12:37,030 --> 01:12:38,155 pathways that are above it. 1702 01:12:38,155 --> 01:12:40,600 And by using the prize collecting Steiner tree 1703 01:12:40,600 --> 01:12:42,230 approaches, you can rediscover. 1704 01:12:45,444 --> 01:12:46,860 And these kinds of networks can be 1705 01:12:46,860 --> 01:12:49,250 very valuable for other kinds of data that don't agree. 1706 01:12:49,250 --> 01:12:53,330 So it's not unique to transcript data and proteome data. 1707 01:12:53,330 --> 01:12:55,580 Turns out there are many different kinds of omic data, 1708 01:12:55,580 --> 01:12:58,420 when looked at individually, give you very different views 1709 01:12:58,420 --> 01:12:59,800 of what's going on in a cell. 1710 01:12:59,800 --> 01:13:05,200 So if you take knockout data, so which genes when knocked out, 1711 01:13:05,200 --> 01:13:06,200 affect the phenotype? 1712 01:13:06,200 --> 01:13:09,845 And which genes, in the same condition, 1713 01:13:09,845 --> 01:13:10,720 change an expression? 1714 01:13:10,720 --> 01:13:12,678 Those give you two completely different answers 1715 01:13:12,678 --> 01:13:15,930 about which genes are important in a particular setting. 1716 01:13:15,930 --> 01:13:19,671 So here we're looking at which genes are differentially 1717 01:13:19,671 --> 01:13:21,670 expressed when you put cells under a whole bunch 1718 01:13:21,670 --> 01:13:23,810 of these different conditions. 1719 01:13:23,810 --> 01:13:25,900 And which genes when knocked out, 1720 01:13:25,900 --> 01:13:28,445 affect viability in that condition. 1721 01:13:28,445 --> 01:13:30,445 And then the right-hand column shows the overlap 1722 01:13:30,445 --> 01:13:32,010 in the number of genes. 1723 01:13:32,010 --> 01:13:33,995 And you can see the overlap is small. 1724 01:13:33,995 --> 01:13:35,370 In fact, it's less than you would 1725 01:13:35,370 --> 01:13:39,190 expect by chance for most of these. 1726 01:13:39,190 --> 01:13:42,900 So just to drill that home, if I do two separate experiments 1727 01:13:42,900 --> 01:13:45,580 on exactly the same experimental system, 1728 01:13:45,580 --> 01:13:48,116 say yeast responding to DNA damage. 1729 01:13:48,116 --> 01:13:49,490 And in one case, I read out which 1730 01:13:49,490 --> 01:13:51,652 genes are important by looking at RNA levels. 1731 01:13:51,652 --> 01:13:53,110 And the other one, I read out which 1732 01:13:53,110 --> 01:13:55,484 genes are important by knocking every gene out and seeing 1733 01:13:55,484 --> 01:13:56,700 whether it affects viability. 1734 01:13:56,700 --> 01:13:59,580 We'll get two completely different sets of genes. 1735 01:13:59,580 --> 01:14:03,700 And we'll also have two completely different sets 1736 01:14:03,700 --> 01:14:05,750 of gene ontology categories. 1737 01:14:05,750 --> 01:14:07,710 But there is some underlying biological process 1738 01:14:07,710 --> 01:14:10,284 that gives rise to that, right? 1739 01:14:10,284 --> 01:14:11,700 And one of the reasons for this is 1740 01:14:11,700 --> 01:14:15,030 different assays are measuring different things. 1741 01:14:15,030 --> 01:14:18,250 So it turns out, if you look-- at least in yeast-- 1742 01:14:18,250 --> 01:14:21,190 over 156 different experiments, for which there's 1743 01:14:21,190 --> 01:14:24,280 both transcriptional data and genetic data, 1744 01:14:24,280 --> 01:14:26,100 the things that come out in genetic screens 1745 01:14:26,100 --> 01:14:27,880 seem to be master regulators. 1746 01:14:27,880 --> 01:14:30,637 Things that were knocked out have a big effect in phenotype. 1747 01:14:30,637 --> 01:14:32,470 Whereas the things that change in expression 1748 01:14:32,470 --> 01:14:35,030 tend to be effector molecules. 1749 01:14:35,030 --> 01:14:37,094 And so in say, the DNA damage case, 1750 01:14:37,094 --> 01:14:38,510 the proteins that were knocked out 1751 01:14:38,510 --> 01:14:39,926 and have a big effect on phenotype 1752 01:14:39,926 --> 01:14:43,056 are ones that detect DNA damage and signal to the nucleus 1753 01:14:43,056 --> 01:14:44,680 that there's been changes in DNA damage 1754 01:14:44,680 --> 01:14:47,520 that then goes on and blocks the cell cycle, 1755 01:14:47,520 --> 01:14:50,780 initiates DNA response to repair. 1756 01:14:50,780 --> 01:14:52,690 Those things show up as genetic hits, 1757 01:14:52,690 --> 01:14:55,049 but they don't show up as differentially expressed. 1758 01:14:55,049 --> 01:14:57,340 The things that do show up as differentially expressed, 1759 01:14:57,340 --> 01:14:58,160 the repair enzymes. 1760 01:14:58,160 --> 01:14:59,330 Those, when you knock them out, don't 1761 01:14:59,330 --> 01:15:01,370 have a big effect on phenotype, because they're 1762 01:15:01,370 --> 01:15:03,364 highly redundant. 1763 01:15:03,364 --> 01:15:05,030 But there are these underlying pathways. 1764 01:15:05,030 --> 01:15:07,520 And so the idea is well, you could reconstruct these by, 1765 01:15:07,520 --> 01:15:09,050 again, using the epigenetic data, 1766 01:15:09,050 --> 01:15:11,010 the tough stuff Professor Gifford 1767 01:15:11,010 --> 01:15:13,000 will talk about in upcoming lectures. 1768 01:15:13,000 --> 01:15:15,590 And for the transcription factors and then the network 1769 01:15:15,590 --> 01:15:19,370 properties, to try to build up a full network of how those 1770 01:15:19,370 --> 01:15:21,150 relate to upstream signaling pathways 1771 01:15:21,150 --> 01:15:23,290 that would then include some of the genetic hits. 1772 01:15:27,490 --> 01:15:32,030 I think I'll skip to the punchline here. 1773 01:15:49,130 --> 01:15:51,870 So we've looked at a number of different modeling approaches 1774 01:15:51,870 --> 01:15:54,400 for these large interactomes. 1775 01:15:54,400 --> 01:15:57,670 We've also looked at ways of identifying 1776 01:15:57,670 --> 01:15:59,670 transcriptional regulatory networks using 1777 01:15:59,670 --> 01:16:02,252 mutual information, regression, Bayesian networks. 1778 01:16:02,252 --> 01:16:03,960 And how do all these things fit together? 1779 01:16:03,960 --> 01:16:05,590 And when would you want to use one of these techniques, 1780 01:16:05,590 --> 01:16:07,214 and when would you want to use another? 1781 01:16:07,214 --> 01:16:10,017 So I like to think about the problem along these two axes. 1782 01:16:10,017 --> 01:16:11,600 On one dimension, we're thinking about 1783 01:16:11,600 --> 01:16:13,440 whether we have systems of known components or unknown 1784 01:16:13,440 --> 01:16:14,260 components. 1785 01:16:14,260 --> 01:16:15,759 And the other one is whether we want 1786 01:16:15,759 --> 01:16:17,490 to identify physical relationships 1787 01:16:17,490 --> 01:16:19,450 or statistical relationships. 1788 01:16:19,450 --> 01:16:21,830 So clustering, regression, mutual information-- those 1789 01:16:21,830 --> 01:16:23,510 are very, very powerful for looking 1790 01:16:23,510 --> 01:16:26,430 at the entire genome, the entire proteome. 1791 01:16:26,430 --> 01:16:28,800 What they give you are statistical relationships. 1792 01:16:28,800 --> 01:16:30,880 There's no guarantee of a functional link, right? 1793 01:16:30,880 --> 01:16:34,200 We saw that in the prediction that postprandial laughter 1794 01:16:34,200 --> 01:16:36,700 predicts breast cancer outcome, that there's 1795 01:16:36,700 --> 01:16:38,760 no causal link between those. 1796 01:16:38,760 --> 01:16:40,260 Ultimately, you can find some reason 1797 01:16:40,260 --> 01:16:42,040 why it's not totally random. 1798 01:16:42,040 --> 01:16:43,960 But it's not as if that's going to lead you 1799 01:16:43,960 --> 01:16:46,290 to new drug targets. 1800 01:16:46,290 --> 01:16:49,740 But those can be on a completely hypothesis-free way, 1801 01:16:49,740 --> 01:16:52,630 with no external data. 1802 01:16:52,630 --> 01:16:55,764 Bayesian networks are somewhat more causal. 1803 01:16:55,764 --> 01:16:57,430 But depending on how much data you have, 1804 01:16:57,430 --> 01:16:58,850 they may not be perfectly causal. 1805 01:16:58,850 --> 01:17:01,257 You need a lot of intervention data. 1806 01:17:01,257 --> 01:17:03,340 We also saw that they did not perform particularly 1807 01:17:03,340 --> 01:17:06,010 well in discovering gene regulatory networks 1808 01:17:06,010 --> 01:17:07,464 in the dream challenge. 1809 01:17:07,464 --> 01:17:09,130 These interactome models that we've just 1810 01:17:09,130 --> 01:17:11,990 been talking about work very well across giant omic data 1811 01:17:11,990 --> 01:17:12,490 sets. 1812 01:17:15,510 --> 01:17:17,530 And they require this external data. 1813 01:17:17,530 --> 01:17:18,700 They need the interactome. 1814 01:17:18,700 --> 01:17:20,060 So it works well in organisms for which 1815 01:17:20,060 --> 01:17:21,680 you have all that interactome data. 1816 01:17:21,680 --> 01:17:25,310 It's not going to work in an organism for which you don't. 1817 01:17:25,310 --> 01:17:26,960 What they give you at the end, though, 1818 01:17:26,960 --> 01:17:30,409 is a graph that tells you relationships 1819 01:17:30,409 --> 01:17:31,200 among the proteins. 1820 01:17:31,200 --> 01:17:32,700 But it doesn't tell you what's going 1821 01:17:32,700 --> 01:17:35,040 to happen if you start to perturb those networks. 1822 01:17:35,040 --> 01:17:39,570 So if I give you the active subgraph that 1823 01:17:39,570 --> 01:17:42,440 has all the proteins and genes that are changing expression 1824 01:17:42,440 --> 01:17:45,329 in my tumor sample, now the question is, OK, 1825 01:17:45,329 --> 01:17:47,120 should you inhibit the nodes in that graph? 1826 01:17:47,120 --> 01:17:49,220 Or should you activate the nodes in that graph? 1827 01:17:49,220 --> 01:17:51,160 And the interactome model doesn't tell you 1828 01:17:51,160 --> 01:17:52,260 the answer to that. 1829 01:17:52,260 --> 01:17:54,676 And so what you're going to hear about in the next lecture 1830 01:17:54,676 --> 01:17:56,750 from Professor Lauffenburger are models 1831 01:17:56,750 --> 01:17:58,100 that live up in this space. 1832 01:17:58,100 --> 01:18:00,906 Once you've defined a relatively small piece of the network, 1833 01:18:00,906 --> 01:18:02,780 you can use other kinds of approaches-- logic 1834 01:18:02,780 --> 01:18:06,580 based models, differential equation based models, decision 1835 01:18:06,580 --> 01:18:09,410 trees, and other techniques that will actually 1836 01:18:09,410 --> 01:18:10,910 make very quantitative processions. 1837 01:18:10,910 --> 01:18:13,640 What happens if I inhibit a particular node? 1838 01:18:13,640 --> 01:18:16,267 Does it activate the process, or does it repress the process? 1839 01:18:16,267 --> 01:18:17,850 And so what you could think about then 1840 01:18:17,850 --> 01:18:20,420 is going from a completely unbiased view of what's 1841 01:18:20,420 --> 01:18:24,680 going in a cell, collect all the various kinds of omic data, 1842 01:18:24,680 --> 01:18:26,370 and go through these kinds of modeling 1843 01:18:26,370 --> 01:18:28,949 approaches to identify a subnetwork that's of interest. 1844 01:18:28,949 --> 01:18:31,240 And then use the techniques that we'll [? be hearing ?] 1845 01:18:31,240 --> 01:18:34,219 about in the next lecture to figure out quantitatively 1846 01:18:34,219 --> 01:18:36,510 what would happen if I were to inhibit individual nodes 1847 01:18:36,510 --> 01:18:40,500 or inhibit combinations of nodes or activate, and so on. 1848 01:18:40,500 --> 01:18:44,300 Any questions on anything we've talked about so far? 1849 01:18:44,300 --> 01:18:45,578 Yes. 1850 01:18:45,578 --> 01:18:48,392 AUDIENCE: Can you say again the fundamental difference 1851 01:18:48,392 --> 01:18:51,242 between why you get those two different results if you're 1852 01:18:51,242 --> 01:18:56,277 just weeding out the gene expression versus the proteins? 1853 01:18:56,277 --> 01:18:57,110 PROFESSOR: Oh, sure. 1854 01:18:57,110 --> 01:18:57,610 Right. 1855 01:18:57,610 --> 01:19:01,204 So we talked about the fact that if you look at genetic hits, 1856 01:19:01,204 --> 01:19:02,870 and you look at differential expression, 1857 01:19:02,870 --> 01:19:05,536 you get two completely different views of what's going in cells. 1858 01:19:05,536 --> 01:19:06,470 So why is that? 1859 01:19:06,470 --> 01:19:09,060 So the genetic hits to tend to hit master regulators, things 1860 01:19:09,060 --> 01:19:10,720 that when you knock out a single gene, 1861 01:19:10,720 --> 01:19:13,097 you have a global effect on the response. 1862 01:19:13,097 --> 01:19:14,555 So in the case of DNA damage, those 1863 01:19:14,555 --> 01:19:17,380 are things that detect the DNA damage. 1864 01:19:17,380 --> 01:19:20,530 Those genes tend often not to be changing very much 1865 01:19:20,530 --> 01:19:22,630 in expression. 1866 01:19:22,630 --> 01:19:24,650 So transcription factors are very low abundance. 1867 01:19:24,650 --> 01:19:25,760 They usually don't change very much. 1868 01:19:25,760 --> 01:19:27,820 A lot of signaling proteins are kept at a constant level, 1869 01:19:27,820 --> 01:19:29,980 and they're regulated post-transcriptionally. 1870 01:19:29,980 --> 01:19:32,350 So those don't show up in the differential expression. 1871 01:19:32,350 --> 01:19:35,160 The things that are changing in expression-- 1872 01:19:35,160 --> 01:19:40,110 say the response regulators, the DNA damage response-- 1873 01:19:40,110 --> 01:19:41,510 those often are redundant. 1874 01:19:41,510 --> 01:19:44,660 So one good analogy is to think about a smoke detector. 1875 01:19:44,660 --> 01:19:46,370 A smoke detector is on all the time. 1876 01:19:46,370 --> 01:19:48,139 You don't wait until the fire. 1877 01:19:48,139 --> 01:19:50,180 So that's not going to be changing in expression, 1878 01:19:50,180 --> 01:19:51,390 if you will. 1879 01:19:51,390 --> 01:19:54,330 But if you knock it out, you've got a big problem. 1880 01:19:54,330 --> 01:19:56,410 The effectors, say the sprinklers-- 1881 01:19:56,410 --> 01:19:58,540 the sprinklers only come on when there's a fire. 1882 01:19:58,540 --> 01:20:00,140 So that's like the response genes. 1883 01:20:00,140 --> 01:20:02,002 They come on only in certain circumstances, 1884 01:20:02,002 --> 01:20:03,210 but they're highly redundant. 1885 01:20:03,210 --> 01:20:04,835 Any room will have multiple sprinklers, 1886 01:20:04,835 --> 01:20:06,860 so if one gets damaged or is blocked, 1887 01:20:06,860 --> 01:20:08,190 you still get a response. 1888 01:20:08,190 --> 01:20:10,550 So that's why you get this discrepancy between the two 1889 01:20:10,550 --> 01:20:11,551 different kinds of data. 1890 01:20:11,551 --> 01:20:12,924 But again, in both cases, there's 1891 01:20:12,924 --> 01:20:15,290 an underlying physical process that gives rise to both. 1892 01:20:15,290 --> 01:20:17,040 And if you do this properly, you can 1893 01:20:17,040 --> 01:20:19,664 detect that on these interactome models. 1894 01:20:19,664 --> 01:20:20,330 Other questions? 1895 01:20:22,720 --> 01:20:23,220 OK. 1896 01:20:23,220 --> 01:20:25,000 Very good.