1 00:00:00,090 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:26,009 --> 00:00:27,550 PROFESSOR: So you'll recall last time 9 00:00:27,550 --> 00:00:29,989 we were working on protein-protein interactions. 10 00:00:29,989 --> 00:00:32,030 We're going to do a little bit to finish that up, 11 00:00:32,030 --> 00:00:34,810 with a topic that will be a good transition to the study 12 00:00:34,810 --> 00:00:36,465 of the gene regulatory networks. 13 00:00:41,270 --> 00:00:43,852 And the precise things we're going to discuss today, 14 00:00:43,852 --> 00:00:45,810 we're going to start off with Bayesian networks 15 00:00:45,810 --> 00:00:47,690 of protein-protein interaction prediction. 16 00:00:47,690 --> 00:00:49,220 And then we're going to get into gene expression data, 17 00:00:49,220 --> 00:00:50,170 at several different levels. 18 00:00:50,170 --> 00:00:51,753 We'll talk about some basic questions, 19 00:00:51,753 --> 00:00:56,120 of how to compare the two expression 20 00:00:56,120 --> 00:00:58,462 vectors for a gene, distance metrics. 21 00:00:58,462 --> 00:01:00,670 We'll talk about how to cluster gene expression data. 22 00:01:00,670 --> 00:01:02,250 The idea of identifying signatures 23 00:01:02,250 --> 00:01:04,129 of sets of genes, that might be predictive 24 00:01:04,129 --> 00:01:05,840 of some biological property. 25 00:01:05,840 --> 00:01:08,694 For example, a susceptibility to a disease. 26 00:01:08,694 --> 00:01:10,860 And then we'll talk about a number of different ways 27 00:01:10,860 --> 00:01:11,920 that people have developed to try 28 00:01:11,920 --> 00:01:13,596 to identify gene regulatory networks. 29 00:01:13,596 --> 00:01:15,220 That often goes by the name of modules. 30 00:01:15,220 --> 00:01:16,720 I don't particularly like that name. 31 00:01:16,720 --> 00:01:18,700 But that's what you'll find in the literature. 32 00:01:18,700 --> 00:01:21,340 And we're going to focus on a few of these, 33 00:01:21,340 --> 00:01:23,930 that have recently been compared head to head, using 34 00:01:23,930 --> 00:01:25,170 both synthetic and real data. 35 00:01:25,170 --> 00:01:26,660 And we'll see some of the results 36 00:01:26,660 --> 00:01:28,920 from that head to head comparison. 37 00:01:28,920 --> 00:01:30,330 So let's just launch into it. 38 00:01:30,330 --> 00:01:33,160 Remember last time we had started this unit looking 39 00:01:33,160 --> 00:01:35,829 at the structural predictions for proteins. 40 00:01:35,829 --> 00:01:37,620 And we started talking about how to predict 41 00:01:37,620 --> 00:01:38,900 protein-protein interactions. 42 00:01:38,900 --> 00:01:41,590 Last time we talked about both computational methods, 43 00:01:41,590 --> 00:01:44,210 and also experimental data, that could give us 44 00:01:44,210 --> 00:01:46,780 information about protein-protein interactions. 45 00:01:46,780 --> 00:01:48,490 Ostensibly measuring direct interactions, 46 00:01:48,490 --> 00:01:50,865 but we saw that there were possibly very, very high error 47 00:01:50,865 --> 00:01:51,364 rates. 48 00:01:51,364 --> 00:01:54,600 So we needed ways of integrating lots of different kinds of data 49 00:01:54,600 --> 00:01:57,600 in a probabilistic framework so we could predict for any pair 50 00:01:57,600 --> 00:02:00,230 proteins what's the probability that they interact. 51 00:02:00,230 --> 00:02:02,160 Not just the fact that they were detected 52 00:02:02,160 --> 00:02:05,004 in one assay or the other. 53 00:02:05,004 --> 00:02:06,920 And we started to talk about Bayesian networks 54 00:02:06,920 --> 00:02:08,130 in this context. 55 00:02:08,130 --> 00:02:11,110 Both useful as we'll see today, for predicting 56 00:02:11,110 --> 00:02:12,750 protein-protein interactions, and also 57 00:02:12,750 --> 00:02:14,970 for the gene regulatory network problem. 58 00:02:14,970 --> 00:02:16,820 So the Bayesian networks are a tool 59 00:02:16,820 --> 00:02:18,290 for reasoning probabilistically. 60 00:02:18,290 --> 00:02:20,680 That's the fundamental purpose. 61 00:02:20,680 --> 00:02:24,040 And we saw that they consisted of a graph, the network. 62 00:02:24,040 --> 00:02:27,440 And then the probabilities that represent the probability 63 00:02:27,440 --> 00:02:30,440 for each edge, the conditional probability tables. 64 00:02:30,440 --> 00:02:32,289 And that we can learn these from the data, 65 00:02:32,289 --> 00:02:34,080 either in a completely objective way, where 66 00:02:34,080 --> 00:02:36,830 we learn both the structure and the probability. 67 00:02:36,830 --> 00:02:39,160 Or where we impose the structure initially, 68 00:02:39,160 --> 00:02:42,370 and then we simply learn the probability tables. 69 00:02:42,370 --> 00:02:45,226 And we had nodes that represented the variables. 70 00:02:45,226 --> 00:02:46,600 They could be hidden nodes, where 71 00:02:46,600 --> 00:02:48,979 we don't know what the true answer is, and observed 72 00:02:48,979 --> 00:02:49,770 nodes, where we do. 73 00:02:49,770 --> 00:02:51,630 So in our case, we're trying to predict 74 00:02:51,630 --> 00:02:53,295 protein-protein interactions. 75 00:02:53,295 --> 00:02:54,670 There's some hidden variable that 76 00:02:54,670 --> 00:02:57,380 represents weather protein A and B truly interact. 77 00:02:57,380 --> 00:02:58,970 We don't know that answer. 78 00:02:58,970 --> 00:03:00,780 But we do know whether that interaction was 79 00:03:00,780 --> 00:03:03,090 detected in an experiment one, two, three or 4. 80 00:03:03,090 --> 00:03:05,190 Those are the effects, the observed. 81 00:03:05,190 --> 00:03:06,770 And so we want to reason backwards 82 00:03:06,770 --> 00:03:09,355 from the observations, to the hidden causes. 83 00:03:13,170 --> 00:03:16,666 So last time we talked about the high throughput experiments, 84 00:03:16,666 --> 00:03:18,040 that directly we're measuring out 85 00:03:18,040 --> 00:03:19,248 protein-protein interactions. 86 00:03:19,248 --> 00:03:23,240 We talked about yeast two hybrid and affinity capture 87 00:03:23,240 --> 00:03:25,780 mass spec-- here listed as pull-downs. 88 00:03:25,780 --> 00:03:28,250 And those could be used to predict 89 00:03:28,250 --> 00:03:30,000 protein-protein directions, by themselves. 90 00:03:30,000 --> 00:03:32,000 But we want to find out what other kinds of data 91 00:03:32,000 --> 00:03:34,010 we can use to amplify these results, to give us 92 00:03:34,010 --> 00:03:37,240 independent information about whether two proteins interact. 93 00:03:37,240 --> 00:03:39,120 And one thing you could look at is 94 00:03:39,120 --> 00:03:41,840 whether the expression of the two genes that you 95 00:03:41,840 --> 00:03:43,620 think might interact are similar. 96 00:03:43,620 --> 00:03:46,130 So if you look over many, many different conditions, 97 00:03:46,130 --> 00:03:48,000 you might expect the two proteins 98 00:03:48,000 --> 00:03:49,500 that interact with each other, would 99 00:03:49,500 --> 00:03:51,957 be expressed under similar conditions. 100 00:03:51,957 --> 00:03:53,540 Certainly if you saw two proteins that 101 00:03:53,540 --> 00:03:55,550 had exactly opposite expression patterns, 102 00:03:55,550 --> 00:03:58,862 you would be very unlikely to believe that they interacted. 103 00:03:58,862 --> 00:04:01,070 So the question is, how much is true at the other end 104 00:04:01,070 --> 00:04:01,850 of the spectrum? 105 00:04:01,850 --> 00:04:04,140 If things are very highly correlated, 106 00:04:04,140 --> 00:04:07,200 do they have a high probability of interaction? 107 00:04:07,200 --> 00:04:10,970 So this graph is a histogram for proteins 108 00:04:10,970 --> 00:04:13,180 that are known to interact, proteins 109 00:04:13,180 --> 00:04:15,475 that were shown in these high throughput experiments 110 00:04:15,475 --> 00:04:18,815 to interact, and proteins that are known not to interact, 111 00:04:18,815 --> 00:04:20,190 of how similar the expression is. 112 00:04:20,190 --> 00:04:21,731 On the far right are things that have 113 00:04:21,731 --> 00:04:25,107 extremely different expression patterns, a high distance. 114 00:04:25,107 --> 00:04:26,690 And we'll talk specifically about what 115 00:04:26,690 --> 00:04:28,127 distance is in just a minute. 116 00:04:28,127 --> 00:04:30,210 But these are very dissimilar expression patterns. 117 00:04:30,210 --> 00:04:32,415 These are very similar ones. 118 00:04:32,415 --> 00:04:33,790 So what do you see from this plot 119 00:04:33,790 --> 00:04:35,340 we looked at the last time? 120 00:04:35,340 --> 00:04:36,930 We saw that the interacting proteins 121 00:04:36,930 --> 00:04:39,169 are shifted a bit to the left. 122 00:04:39,169 --> 00:04:41,210 So the interacting ones have a higher probability 123 00:04:41,210 --> 00:04:42,986 of having similar expression patterns 124 00:04:42,986 --> 00:04:44,890 than the ones don't interact. 125 00:04:44,890 --> 00:04:46,530 But we couldn't draw any cut off, 126 00:04:46,530 --> 00:04:49,790 and say everything with this level expression similarity 127 00:04:49,790 --> 00:04:51,150 is guaranteed to interact. 128 00:04:51,150 --> 00:04:53,617 There's no way to divide these. 129 00:04:53,617 --> 00:04:55,700 So this will be useful in a probabilistic setting. 130 00:04:55,700 --> 00:04:58,600 But by itself, it would not be highly predictive. 131 00:04:58,600 --> 00:05:00,460 We also talked about evolutionary patterns, 132 00:05:00,460 --> 00:05:04,059 and we discussed whether the red or the green patterns 133 00:05:04,059 --> 00:05:05,350 here, would be more predictive. 134 00:05:05,350 --> 00:05:06,840 And which one was it, anyone remember? 135 00:05:06,840 --> 00:05:09,043 How many people thought the red was more predictive? 136 00:05:09,043 --> 00:05:11,390 How many the green? 137 00:05:11,390 --> 00:05:12,410 Right, the greens win. 138 00:05:15,710 --> 00:05:18,490 And we talked about the coevolution in other ways. 139 00:05:18,490 --> 00:05:21,584 So the paper that, I think, was one of the first 140 00:05:21,584 --> 00:05:23,250 to do this really nicely, try to predict 141 00:05:23,250 --> 00:05:25,330 protein-protein interaction patterns using 142 00:05:25,330 --> 00:05:29,160 Bayesian networks, is this one from Mark Gerstein's lab. 143 00:05:29,160 --> 00:05:31,770 And they start off as we talked about previously, 144 00:05:31,770 --> 00:05:33,660 we need some gold standard interactions, 145 00:05:33,660 --> 00:05:37,150 where we know two proteins really do interact or don't. 146 00:05:37,150 --> 00:05:40,649 They built their gold standard data set. 147 00:05:40,649 --> 00:05:42,190 The positive trending data, they took 148 00:05:42,190 --> 00:05:44,590 from a database called MIPS, which 149 00:05:44,590 --> 00:05:47,989 is a hand-curated database that digs into the literature 150 00:05:47,989 --> 00:05:50,280 quite deeply, to find out whether two proteins interact 151 00:05:50,280 --> 00:05:51,400 or not. 152 00:05:51,400 --> 00:05:52,990 And then the negative data they took 153 00:05:52,990 --> 00:05:55,100 were proteins that were identified 154 00:05:55,100 --> 00:05:57,271 as being localized to different parts of the cell. 155 00:05:57,271 --> 00:05:58,770 And this was done in yeast, to where 156 00:05:58,770 --> 00:06:00,860 there is pretty good data for a lot of proteins, 157 00:06:00,860 --> 00:06:02,045 to subcellular localization. 158 00:06:04,700 --> 00:06:07,175 So these are the data that went into their prediction. 159 00:06:09,640 --> 00:06:11,890 These were the experiments we've already talked about, 160 00:06:11,890 --> 00:06:14,890 the affinity capture mass spec and the yeast two hybrid. 161 00:06:14,890 --> 00:06:16,660 And then the other kinds of data they used 162 00:06:16,660 --> 00:06:19,665 were expression correlation, one just talked about. 163 00:06:19,665 --> 00:06:22,860 They also looked at annotations, whether proteins 164 00:06:22,860 --> 00:06:25,870 had the same annotation for function. 165 00:06:25,870 --> 00:06:27,460 And essentiality. 166 00:06:27,460 --> 00:06:29,026 So in yeast, it's pretty easy to go 167 00:06:29,026 --> 00:06:30,400 through every gene in the genome, 168 00:06:30,400 --> 00:06:32,921 knock it out, and determine whether that kills the cell 169 00:06:32,921 --> 00:06:33,420 or not. 170 00:06:33,420 --> 00:06:36,050 So they can label every gene in yeast, as 171 00:06:36,050 --> 00:06:39,625 to whether it's essential for survival or not. 172 00:06:39,625 --> 00:06:41,625 And you can see here, the number of interactions 173 00:06:41,625 --> 00:06:42,732 that were involved. 174 00:06:42,732 --> 00:06:44,190 And they decided to break this down 175 00:06:44,190 --> 00:06:45,773 into two separate prediction problems. 176 00:06:45,773 --> 00:06:48,530 So one was an experimental problem, 177 00:06:48,530 --> 00:06:50,860 using the four different large scale 178 00:06:50,860 --> 00:06:53,640 data sets in yeast from protein-protein interactions, 179 00:06:53,640 --> 00:06:55,020 to predict expression. 180 00:06:55,020 --> 00:06:57,450 The other one wore these other kinds 181 00:06:57,450 --> 00:06:59,390 of data, that were less direct. 182 00:06:59,390 --> 00:07:01,920 And they used slightly different kinds of Bayesian networks. 183 00:07:01,920 --> 00:07:05,100 So for this one, they used a naive Bayes. 184 00:07:05,100 --> 00:07:08,520 And what's the underlying assumption of the naive Bayes? 185 00:07:08,520 --> 00:07:10,755 The underlying assumption is that all the data 186 00:07:10,755 --> 00:07:11,457 are independent. 187 00:07:11,457 --> 00:07:12,790 So we looked at this previously. 188 00:07:12,790 --> 00:07:14,960 We discussed how you could, if you're 189 00:07:14,960 --> 00:07:17,830 trying to identify the likelihood ratio, 190 00:07:17,830 --> 00:07:19,720 and use it to rank things. 191 00:07:19,720 --> 00:07:23,610 You primarily need to focus on this term. 192 00:07:23,610 --> 00:07:26,360 Because this term will be the same for every pair of proteins 193 00:07:26,360 --> 00:07:27,440 that you're examining. 194 00:07:27,440 --> 00:07:28,340 Yes? 195 00:07:28,340 --> 00:07:31,280 AUDIENCE: Could you state again whether in a naive Bayes, 196 00:07:31,280 --> 00:07:33,752 all data are dependent or independent? 197 00:07:33,752 --> 00:07:34,710 PROFESSOR: Independent. 198 00:07:34,710 --> 00:07:35,251 AUDIENCE: OK. 199 00:07:38,267 --> 00:07:38,850 PROFESSOR: OK. 200 00:07:38,850 --> 00:07:41,060 So let's actually look at some of their data. 201 00:07:41,060 --> 00:07:44,780 So in this table, they're looking at the likelihood ratio 202 00:07:44,780 --> 00:07:47,050 that two proteins interact, based 203 00:07:47,050 --> 00:07:49,280 on whether the two proteins are essential. 204 00:07:49,280 --> 00:07:51,400 One is essential, and one is a nonessential. 205 00:07:51,400 --> 00:07:53,290 Both are nonessential. 206 00:07:53,290 --> 00:07:56,610 So that's what these two codes here mean. 207 00:07:56,610 --> 00:07:57,770 EE, both essential. 208 00:07:57,770 --> 00:08:01,615 NN, both nonessential, and any one and the other. 209 00:08:01,615 --> 00:08:03,490 And so they've computed for all those protein 210 00:08:03,490 --> 00:08:06,060 pairs, how many in their gold standard, 211 00:08:06,060 --> 00:08:08,660 are EE, how many are EN, how many are NN? 212 00:08:12,960 --> 00:08:15,100 So here are the numbers for the EE. 213 00:08:15,100 --> 00:08:20,090 There are just over 1,000, out of the 2,000, roughly 2,000 214 00:08:20,090 --> 00:08:20,775 that are EE. 215 00:08:20,775 --> 00:08:23,700 So that comes up with a probability of being essential, 216 00:08:23,700 --> 00:08:25,440 given that I know that you're positive. 217 00:08:25,440 --> 00:08:28,700 You're in the gold standard of roughly 50%, right? 218 00:08:28,700 --> 00:08:32,742 And you can assume something similar for the negatives. 219 00:08:32,742 --> 00:08:34,950 So these are the ones that definitely don't interact. 220 00:08:34,950 --> 00:08:36,820 So the probability of both being essential, 221 00:08:36,820 --> 00:08:41,520 given that it's negative, is about 15%, 14%. 222 00:08:41,520 --> 00:08:46,340 And so then the likelihood ratio comes out to just under four. 223 00:08:46,340 --> 00:08:49,200 So there's a fourfold increase in probability 224 00:08:49,200 --> 00:08:52,700 that something is interacting, given that it's essential, 225 00:08:52,700 --> 00:08:55,840 then not. 226 00:08:55,840 --> 00:08:57,812 And this is the table for all of the terms, 227 00:08:57,812 --> 00:09:00,270 for all of the different things that they were considering, 228 00:09:00,270 --> 00:09:01,644 that were not direct experiments. 229 00:09:01,644 --> 00:09:03,170 So this is the sensuality. 230 00:09:03,170 --> 00:09:05,950 This is expression correlation, with various values 231 00:09:05,950 --> 00:09:10,590 for the threshold, how similar the expression had to be. 232 00:09:10,590 --> 00:09:15,412 And these are the terms from the databases for annotation. 233 00:09:15,412 --> 00:09:16,870 And then for each of these, then we 234 00:09:16,870 --> 00:09:19,119 get a likelihood ratio of how predictive it is. 235 00:09:19,119 --> 00:09:21,660 So it's kind of informative to look at some of these numbers. 236 00:09:21,660 --> 00:09:23,530 We already saw that essentiality is 237 00:09:23,530 --> 00:09:26,350 pretty weak, predicted the fact that two genes are essential. 238 00:09:26,350 --> 00:09:28,530 It only gives you a slightly increased chance 239 00:09:28,530 --> 00:09:30,120 that they're interacting than not. 240 00:09:30,120 --> 00:09:32,971 But if two things, two genes have extremely high expression 241 00:09:32,971 --> 00:09:35,220 correlation, then they're more than a hundredfold more 242 00:09:35,220 --> 00:09:38,280 likely to interact than not. 243 00:09:38,280 --> 00:09:41,270 And the numbers for the annotations 244 00:09:41,270 --> 00:09:43,275 are significantly less than that. 245 00:09:45,860 --> 00:09:47,016 So this is a naive Bayes. 246 00:09:47,016 --> 00:09:49,390 We're going to multiply all those probabilities together. 247 00:09:49,390 --> 00:09:52,720 Now for the experimental data, they 248 00:09:52,720 --> 00:09:54,970 said, well, these are probably not all independent. 249 00:09:54,970 --> 00:09:57,053 The probably that you pick something up in one two 250 00:09:57,053 --> 00:09:59,216 hybrid experiment, is probably highly 251 00:09:59,216 --> 00:10:01,340 correlated with the probability that you pick it up 252 00:10:01,340 --> 00:10:03,330 in another two hybrid experiment. 253 00:10:03,330 --> 00:10:05,990 And one would hope that there's some correlation between things 254 00:10:05,990 --> 00:10:09,090 are identifying in two hybrid and affinity caption mass spec. 255 00:10:09,090 --> 00:10:12,330 Although we'll see whether or not that's the case. 256 00:10:12,330 --> 00:10:14,950 So they used what they refer to as a fully connected Bayes. 257 00:10:14,950 --> 00:10:16,540 And what do we mean by that? 258 00:10:16,540 --> 00:10:18,257 Remember, this was the naive Bayes, 259 00:10:18,257 --> 00:10:19,590 where everything is independent. 260 00:10:19,590 --> 00:10:21,173 So the probability of some observation 261 00:10:21,173 --> 00:10:23,984 is the product of all the individual probabilities. 262 00:10:23,984 --> 00:10:25,400 But in a fully connected Bayes, we 263 00:10:25,400 --> 00:10:28,030 don't have that independence assumption. 264 00:10:28,030 --> 00:10:30,940 So you need to actually explicitly compute 265 00:10:30,940 --> 00:10:32,850 what the probability is for an interaction, 266 00:10:32,850 --> 00:10:36,139 based on all the possible outcomes in those experiments. 267 00:10:36,139 --> 00:10:37,430 So that's not that much harder. 268 00:10:37,430 --> 00:10:42,630 We simply have a table now, where these columns represent 269 00:10:42,630 --> 00:10:45,150 each of the experimental data types-- 270 00:10:45,150 --> 00:10:49,610 the affinity capture mass spec and the two hybrids. 271 00:10:49,610 --> 00:10:51,240 Ones indicate that it was detected, 272 00:10:51,240 --> 00:10:52,390 Zero is that it's not. 273 00:10:52,390 --> 00:10:54,840 And then we simply look again in our gold standard, 274 00:10:54,840 --> 00:10:59,490 and see how often a protein that had been detected in whatever 275 00:10:59,490 --> 00:11:03,470 the setting is here, in all of them except Ito, how often 276 00:11:03,470 --> 00:11:05,700 was it, how many of the gold positives do we get? 277 00:11:05,700 --> 00:11:06,790 And how many of the gold negatives? 278 00:11:06,790 --> 00:11:08,539 And then we can compute the probabilities. 279 00:11:11,284 --> 00:11:12,700 Now it's important to look at some 280 00:11:12,700 --> 00:11:14,449 of the numbers in these tables and dig in. 281 00:11:14,449 --> 00:11:17,450 Because you'll see the numbers here are really, really small. 282 00:11:17,450 --> 00:11:19,770 So they have to be interpreted with caution. 283 00:11:19,770 --> 00:11:22,380 So some of the things that might not 284 00:11:22,380 --> 00:11:23,870 hold up with much larger data sets. 285 00:11:23,870 --> 00:11:26,036 You might imagine the things that are experimentally 286 00:11:26,036 --> 00:11:27,920 detected in all of the high-throughput assays 287 00:11:27,920 --> 00:11:30,250 would be the most confident. 288 00:11:30,250 --> 00:11:32,120 That doesn't turn out to be the case. 289 00:11:32,120 --> 00:11:36,260 So these are sorted by the law of likelihood ratio, 290 00:11:36,260 --> 00:11:39,300 and the best one is not 1, 1, 1. 291 00:11:39,300 --> 00:11:40,160 It's up there. 292 00:11:40,160 --> 00:11:41,700 But it's not the top of the pack. 293 00:11:41,700 --> 00:11:44,110 And that's probably just the statistics of small numbers. 294 00:11:44,110 --> 00:11:46,920 If the databases were larger, experiments were larger, 295 00:11:46,920 --> 00:11:49,440 it probably would work out that way. 296 00:11:49,440 --> 00:11:52,530 So any question about how they formulated this problem, 297 00:11:52,530 --> 00:11:54,835 as a Bayesian network, or how they implemented it? 298 00:12:00,520 --> 00:12:01,680 OK. 299 00:12:01,680 --> 00:12:05,480 So the results then-- so once we have these likelihood ratios, 300 00:12:05,480 --> 00:12:10,050 we can try to choose a threshold for deciding what we're 301 00:12:10,050 --> 00:12:12,410 going to consider to be a true interaction and not. 302 00:12:12,410 --> 00:12:15,040 So here they've plotted for different likelihood ratio 303 00:12:15,040 --> 00:12:16,880 thresholds. 304 00:12:16,880 --> 00:12:20,310 On the x-axis, how many of the true positives you get right, 305 00:12:20,310 --> 00:12:21,620 versus how many you get wrong. 306 00:12:24,410 --> 00:12:27,030 So the true positive over the false positive. 307 00:12:27,030 --> 00:12:29,190 And you can arbitrarily decide, OK, 308 00:12:29,190 --> 00:12:31,930 well I want to be more-- I want to get more right than wrong. 309 00:12:31,930 --> 00:12:34,960 Not a bad way to decide things. 310 00:12:34,960 --> 00:12:37,290 So your passing grade here is 50%. 311 00:12:37,290 --> 00:12:39,600 So if I draw a line, a horizontal line, 312 00:12:39,600 --> 00:12:41,760 and wanted to get more right than wrong, 313 00:12:41,760 --> 00:12:45,600 you'll see that any of the individual 314 00:12:45,600 --> 00:12:49,880 signals that they were using, essentiality, 315 00:12:49,880 --> 00:12:53,390 database sanitation, and so on-- all of those fall below that. 316 00:12:53,390 --> 00:12:57,320 So individually, they predict more wrongs than rights. 317 00:12:57,320 --> 00:13:00,720 But if you combine the data using this Bayesian network, 318 00:13:00,720 --> 00:13:02,850 then you can choose a likelihood threshold, 319 00:13:02,850 --> 00:13:05,740 where you do get more right than wrong. 320 00:13:05,740 --> 00:13:08,560 And you can set your threshold wherever you want. 321 00:13:08,560 --> 00:13:11,670 Similarly for the direct experimental data, 322 00:13:11,670 --> 00:13:15,560 you do better by combining-- these are light pink lines, 323 00:13:15,560 --> 00:13:18,930 than you would with any of the individual data sets. 324 00:13:18,930 --> 00:13:21,350 So this shows the utility of combining the data, 325 00:13:21,350 --> 00:13:25,360 and reasoning from the data probabilistically. 326 00:13:25,360 --> 00:13:28,200 Any questions? 327 00:13:28,200 --> 00:13:29,700 So we'll return to Bayesian networks 328 00:13:29,700 --> 00:13:34,340 in a bit in the context of discovering 329 00:13:34,340 --> 00:13:35,990 gene regulatory networks. 330 00:13:35,990 --> 00:13:39,010 So we now want to move to gene expression data. 331 00:13:39,010 --> 00:13:42,210 And the primary reason to be so interested in gene expression 332 00:13:42,210 --> 00:13:44,690 data is simply that there's a huge amount of it out there. 333 00:13:44,690 --> 00:13:47,820 So just a short time ago we passed the million mark, 334 00:13:47,820 --> 00:13:49,550 with a number of expression data sets 335 00:13:49,550 --> 00:13:52,260 that had been collected in the databases. 336 00:13:52,260 --> 00:13:54,760 There's much less of any other kind of high throughput data. 337 00:13:54,760 --> 00:13:58,012 So if you look at proteomics or high-throughput genetic 338 00:13:58,012 --> 00:13:59,720 screens, there are tiny numbers, compared 339 00:13:59,720 --> 00:14:01,030 to gene expression data. 340 00:14:01,030 --> 00:14:04,209 So obviously techniques for analyzing gene expression data 341 00:14:04,209 --> 00:14:06,500 are going to play a very important role for a long time 342 00:14:06,500 --> 00:14:07,100 to come. 343 00:14:10,050 --> 00:14:11,730 Some of what I'm going to discuss today 344 00:14:11,730 --> 00:14:12,938 is covered in your textbooks. 345 00:14:12,938 --> 00:14:17,072 I encourage you to look at text section 16.2. 346 00:14:17,072 --> 00:14:19,280 The fundamental thing that we're interested in doing, 347 00:14:19,280 --> 00:14:22,180 is seeing how much biological knowledge 348 00:14:22,180 --> 00:14:25,152 we can infer from the gene expression data. 349 00:14:25,152 --> 00:14:26,610 So we might imagine that genes that 350 00:14:26,610 --> 00:14:28,940 are coexpressed under particular sets and conditions, 351 00:14:28,940 --> 00:14:30,790 have functional similarity, reflect 352 00:14:30,790 --> 00:14:33,050 common regulatory mechanisms, and our goal then, 353 00:14:33,050 --> 00:14:34,900 is to discover those mechanisms. 354 00:14:34,900 --> 00:14:37,730 So fundamental to this then, any time we have a pair of genes-- 355 00:14:37,730 --> 00:14:39,521 and we look at their gene expression data-- 356 00:14:39,521 --> 00:14:41,710 we want to decide how similar they are. 357 00:14:41,710 --> 00:14:45,540 So let's imagine that we had these data for four genes. 358 00:14:45,540 --> 00:14:47,332 And it's a time series experiment. 359 00:14:47,332 --> 00:14:49,540 And we're looking at the different expression levels. 360 00:14:49,540 --> 00:14:51,270 And we want some quantitative measure 361 00:14:51,270 --> 00:14:54,580 to decide which two genes are most similar. 362 00:14:54,580 --> 00:14:56,250 Well, it turns out it's a lot more 363 00:14:56,250 --> 00:14:57,389 subtle than we might think. 364 00:14:57,389 --> 00:14:59,180 So at first glance, oh, it's pretty obvious 365 00:14:59,180 --> 00:15:00,750 that these two are the most similar. 366 00:15:00,750 --> 00:15:02,870 But it really depends on what kind of similarity 367 00:15:02,870 --> 00:15:04,460 you're asking about. 368 00:15:04,460 --> 00:15:07,930 So we can describe any expression data 369 00:15:07,930 --> 00:15:13,930 set for any gene, is simply a multi-dimensional vector. 370 00:15:13,930 --> 00:15:16,300 Where this is the set of expression values 371 00:15:16,300 --> 00:15:18,670 we detected for the first gene, across all 372 00:15:18,670 --> 00:15:20,670 the different experimental conditions and so on, 373 00:15:20,670 --> 00:15:21,650 for the second. 374 00:15:21,650 --> 00:15:23,730 And what would be the most intuitive way 375 00:15:23,730 --> 00:15:26,410 of describing the distance between two 376 00:15:26,410 --> 00:15:27,640 multi-dimensional vectors? 377 00:15:27,640 --> 00:15:30,000 It would simply be Euclidean distance, right? 378 00:15:30,000 --> 00:15:31,840 So that's perfectly reasonable. 379 00:15:31,840 --> 00:15:34,660 So we can decide that the distance between two gene 380 00:15:34,660 --> 00:15:37,850 expression data sets, is simply the square root 381 00:15:37,850 --> 00:15:40,030 of the sum of the squares of the distances. 382 00:15:40,030 --> 00:15:45,075 So we'll take the sum over all the experimental conditions 383 00:15:45,075 --> 00:15:45,950 that we've looked at. 384 00:15:45,950 --> 00:15:47,170 Maybe it's a time series. 385 00:15:47,170 --> 00:15:49,080 Maybe it's different perturbations. 386 00:15:49,080 --> 00:15:52,720 And look at the difference in expression of gene A and gene 387 00:15:52,720 --> 00:15:56,860 B in that condition, K. And then evaluating this 388 00:15:56,860 --> 00:16:00,500 will tell us how similar two genes are in their expression 389 00:16:00,500 --> 00:16:02,680 profiles. 390 00:16:02,680 --> 00:16:04,970 Well, that's a specific example of a distance metric. 391 00:16:04,970 --> 00:16:06,845 It turns out that there's a formal definition 392 00:16:06,845 --> 00:16:08,480 for a distance metric. 393 00:16:08,480 --> 00:16:10,430 Distances have the following properties. 394 00:16:10,430 --> 00:16:12,150 They're always greater than zero. 395 00:16:12,150 --> 00:16:14,180 We never have negative distances. 396 00:16:14,180 --> 00:16:18,180 They are equal to zero under exactly one condition-- the two 397 00:16:18,180 --> 00:16:20,080 data points are the same. 398 00:16:20,080 --> 00:16:21,500 And they're symmetric. 399 00:16:21,500 --> 00:16:23,430 So the distance from A to B is the same 400 00:16:23,430 --> 00:16:26,240 as the distance from B to A. 401 00:16:26,240 --> 00:16:28,810 Now, to be a true distance, then you also 402 00:16:28,810 --> 00:16:30,930 have to satisfy the triangle inequality, 403 00:16:30,930 --> 00:16:34,380 that the distance from x to z is less than 404 00:16:34,380 --> 00:16:35,880 or equal to the sum of the distances 405 00:16:35,880 --> 00:16:37,479 through a third point. 406 00:16:37,479 --> 00:16:39,270 But we will find out that we don't actually 407 00:16:39,270 --> 00:16:40,820 need that for similarity measures. 408 00:16:40,820 --> 00:16:42,900 So we can have either a true distance 409 00:16:42,900 --> 00:16:45,110 metric for comparing gene expression data sets, 410 00:16:45,110 --> 00:16:48,540 or similarity measures as well. 411 00:16:48,540 --> 00:16:50,510 So let's go back to the simple example. 412 00:16:50,510 --> 00:16:52,900 So we decided that the red and the blue genes 413 00:16:52,900 --> 00:16:55,580 were nearly identical, in terms of their distance metrics. 414 00:16:55,580 --> 00:16:57,720 But that's not always exactly what we care about. 415 00:16:57,720 --> 00:17:00,170 So in biological settings, frequently the absolute level 416 00:17:00,170 --> 00:17:03,780 of gene expression is on some arbitrary scale. 417 00:17:03,780 --> 00:17:05,304 Certainly with expression arrays, 418 00:17:05,304 --> 00:17:06,470 it was completely arbitrary. 419 00:17:06,470 --> 00:17:08,230 It had to do with fluorescence properties, 420 00:17:08,230 --> 00:17:10,819 and how well probes hybridize to each other. 421 00:17:10,819 --> 00:17:12,670 But even with mRNA, how do we really 422 00:17:12,670 --> 00:17:15,300 know that 1,000 copies is fundamentally 423 00:17:15,300 --> 00:17:18,510 different from 1,200 copies of an RNA in the cell? 424 00:17:18,510 --> 00:17:19,650 We don't. 425 00:17:19,650 --> 00:17:23,339 So we might be more interested in distance metrics that 426 00:17:23,339 --> 00:17:25,790 capture not just the similarity of these two. 427 00:17:25,790 --> 00:17:28,140 But the fact that these two are also quite similar, 428 00:17:28,140 --> 00:17:34,930 in terms of the trajectory of the plot to this one. 429 00:17:34,930 --> 00:17:37,660 So can we come up with measures that capture this one as well? 430 00:17:37,660 --> 00:17:40,800 A very common one for this is Pearson correlation. 431 00:17:40,800 --> 00:17:43,199 So in Pearson correlation, we're gonna look at not just 432 00:17:43,199 --> 00:17:44,990 the expression of a gene across conditions. 433 00:17:44,990 --> 00:17:47,590 But we're gonna look at the z-score of that gene. 434 00:17:47,590 --> 00:17:50,955 So we'll take all of the data for all 435 00:17:50,955 --> 00:17:53,370 of the genes in a particular condition. 436 00:17:53,370 --> 00:17:55,270 And we'll compute the z-score by looking 437 00:17:55,270 --> 00:17:59,310 at the difference between the expression 438 00:17:59,310 --> 00:18:01,600 of a particular gene, in the average expression 439 00:18:01,600 --> 00:18:03,100 across the whole data set. 440 00:18:03,100 --> 00:18:04,780 And we're going to normalize it by the standard deviation. 441 00:18:04,780 --> 00:18:05,280 Yes? 442 00:18:05,280 --> 00:18:07,282 AUDIENCE: [INAUDIBLE] square there? 443 00:18:07,282 --> 00:18:08,490 PROFESSOR: Yes, you're right. 444 00:18:08,490 --> 00:18:09,510 There should be a square there. 445 00:18:09,510 --> 00:18:10,010 Thank you. 446 00:18:13,770 --> 00:18:16,100 So then to compute the Pearson correlation, 447 00:18:16,100 --> 00:18:18,630 we're going between two genes, A and B, 448 00:18:18,630 --> 00:18:20,890 we're going to take the sum over all experiments, 449 00:18:20,890 --> 00:18:23,920 that the z-score for A and the z-score for B, 450 00:18:23,920 --> 00:18:27,998 the product of that, summed over all the experiments. 451 00:18:27,998 --> 00:18:29,747 And these values as we'll see in a second, 452 00:18:29,747 --> 00:18:31,555 are going to range from plus 1, which 453 00:18:31,555 --> 00:18:34,030 would be a perfect correlation, to minus 1, 454 00:18:34,030 --> 00:18:37,150 which would be a perfect anti-correlation. 455 00:18:37,150 --> 00:18:40,610 And then we're going to find the distance is 1 minus this value. 456 00:18:40,610 --> 00:18:42,810 So things that are perfectly correlated then, 457 00:18:42,810 --> 00:18:44,792 would have an r of zero. 458 00:18:44,792 --> 00:18:46,250 And things that are anti-correlated 459 00:18:46,250 --> 00:18:48,250 would have a large one. 460 00:18:48,250 --> 00:18:49,626 So if we take a look at these two 461 00:18:49,626 --> 00:18:51,250 obviously by Euclidean distance, they'd 462 00:18:51,250 --> 00:18:52,740 be quite different from each other. 463 00:18:52,740 --> 00:18:55,140 But the z-scores have converted the expression values 464 00:18:55,140 --> 00:18:57,280 into z-scores over here, you can see 465 00:18:57,280 --> 00:18:58,890 that the z-scores obviously, this one 466 00:18:58,890 --> 00:19:01,110 is the most negative of all of the ones. 467 00:19:01,110 --> 00:19:03,102 And this as the lowest one in all of these. 468 00:19:03,102 --> 00:19:04,060 This one's the highest. 469 00:19:04,060 --> 00:19:06,340 And similarly for the red one, lowest to the highest. 470 00:19:06,340 --> 00:19:08,350 So the z-scores track very well. 471 00:19:08,350 --> 00:19:09,850 And when I take the product of this, 472 00:19:09,850 --> 00:19:13,280 the signs of the z-score for A and B are always the same. 473 00:19:13,280 --> 00:19:16,439 So I summed the product of the z-scores, I get a large number. 474 00:19:16,439 --> 00:19:17,980 And then the normalization guarantees 475 00:19:17,980 --> 00:19:20,570 that it comes out to one. 476 00:19:20,570 --> 00:19:23,000 And so the red and blue here will 477 00:19:23,000 --> 00:19:24,720 have a very high correlation coefficient. 478 00:19:24,720 --> 00:19:27,590 In this case, it's going to be an r correlation 479 00:19:27,590 --> 00:19:28,850 coefficient of 1. 480 00:19:28,850 --> 00:19:33,220 Whereas compared to this one, which is relatively flat, 481 00:19:33,220 --> 00:19:36,280 the correlation coefficient will be approximately zero. 482 00:19:36,280 --> 00:19:37,245 Any questions on that? 483 00:19:44,130 --> 00:19:46,157 So what about, say the blue and the red? 484 00:19:46,157 --> 00:19:47,740 Well, their z-scores are going to have 485 00:19:47,740 --> 00:19:50,480 almost the opposite sign every single time. 486 00:19:50,480 --> 00:19:53,370 And so that's going to add up to a large negative value. 487 00:19:53,370 --> 00:19:55,370 So for these, they'll be highly anti-correlated. 488 00:19:55,370 --> 00:20:00,220 So A, the blue and the red, have a correlation coefficient 489 00:20:00,220 --> 00:20:01,940 of minus 1. 490 00:20:01,940 --> 00:20:02,440 OK. 491 00:20:02,440 --> 00:20:03,898 So we have these two different ways 492 00:20:03,898 --> 00:20:06,080 of computing distance measures. 493 00:20:06,080 --> 00:20:08,600 We can compute the Euclidean distance, 494 00:20:08,600 --> 00:20:11,270 which would make the red and blue the same, 495 00:20:11,270 --> 00:20:13,656 but treat the green one as being completely different. 496 00:20:13,656 --> 00:20:15,030 Or we have the correlation, which 497 00:20:15,030 --> 00:20:17,579 would group all of these together, as being similar. 498 00:20:17,579 --> 00:20:19,870 What you want to do is going to depend on your setting. 499 00:20:19,870 --> 00:20:21,369 If you look in your textbook, you'll 500 00:20:21,369 --> 00:20:25,015 see a lot of other definitions of distance as well. 501 00:20:25,015 --> 00:20:27,140 Now what if you're missing a particular data point? 502 00:20:27,140 --> 00:20:29,930 This used to be a lot more of a problem with arrays 503 00:20:29,930 --> 00:20:33,430 than it is with [? RNAC. ?] With arrays, you'd often 504 00:20:33,430 --> 00:20:36,300 have dirt on the array, that it actually would literally 505 00:20:36,300 --> 00:20:38,890 cover up spots. 506 00:20:38,890 --> 00:20:40,660 But you have a bunch of choices. 507 00:20:40,660 --> 00:20:42,570 The most extreme would just be to ignore 508 00:20:42,570 --> 00:20:46,660 that row or column of your matrix across old data sets. 509 00:20:46,660 --> 00:20:48,580 That's usually not what we want to do. 510 00:20:48,580 --> 00:20:50,920 You could put in some arbitrary small value. 511 00:20:50,920 --> 00:20:54,450 But frequently we will do what's called imputing, where we'll 512 00:20:54,450 --> 00:20:57,340 try to identify the genes that have the most 513 00:20:57,340 --> 00:20:59,740 similar expression, and replace the value for the missing 514 00:20:59,740 --> 00:21:02,270 one with a value from the ones that we do know. 515 00:21:05,160 --> 00:21:08,140 Distance metrics, pretty straightforward. 516 00:21:08,140 --> 00:21:09,910 Now we want to use these distance 517 00:21:09,910 --> 00:21:12,659 metrics to actually cluster the data. 518 00:21:12,659 --> 00:21:13,700 And what's the idea here? 519 00:21:13,700 --> 00:21:16,340 That if we look across enough data sets, 520 00:21:16,340 --> 00:21:19,660 we might find certain groups of genes that function similarly 521 00:21:19,660 --> 00:21:21,640 across all those data sets, that might 522 00:21:21,640 --> 00:21:24,674 be revealing as to their biological function. 523 00:21:24,674 --> 00:21:27,090 So this is an example of an unsupervised learning problem. 524 00:21:27,090 --> 00:21:29,527 We don't know what the classes are, before we go in. 525 00:21:29,527 --> 00:21:31,110 We don't even know how many there are. 526 00:21:31,110 --> 00:21:32,840 We want to learn from the data. 527 00:21:32,840 --> 00:21:35,260 This is a very large area of machine learning. 528 00:21:35,260 --> 00:21:37,612 We're just gonna scrape the surface. 529 00:21:37,612 --> 00:21:39,320 Some of you may be familiar with the fact 530 00:21:39,320 --> 00:21:40,730 that these kinds of machine learning algorithms 531 00:21:40,730 --> 00:21:42,660 are used widely outside of biology. 532 00:21:42,660 --> 00:21:45,000 They're used by Netflix to tell you 533 00:21:45,000 --> 00:21:46,410 what would movie to choose next. 534 00:21:46,410 --> 00:21:48,430 Or Amazon, to try to sell you new products. 535 00:21:48,430 --> 00:21:54,630 And all the advertisers who send pop-up ads on your computer. 536 00:21:54,630 --> 00:21:56,240 But in our biological setting then, we 537 00:21:56,240 --> 00:21:59,030 have our gene expression data, collected possibly 538 00:21:59,030 --> 00:22:00,632 over very large numbers of conditions. 539 00:22:00,632 --> 00:22:02,090 And we want to find groups of genes 540 00:22:02,090 --> 00:22:04,850 that have some similarity. 541 00:22:04,850 --> 00:22:07,720 This is a figure from one of these very early papers, that 542 00:22:07,720 --> 00:22:10,260 sort of establish how people present these datas. 543 00:22:10,260 --> 00:22:13,020 So you'll almost always see the same kind of presentation. 544 00:22:13,020 --> 00:22:16,940 Typically you'll get a heat map, where genes are rows. 545 00:22:16,940 --> 00:22:18,790 And the different experiments here time, 546 00:22:18,790 --> 00:22:22,629 but it could be different perturbations, are the columns. 547 00:22:22,629 --> 00:22:24,420 And genes that go up in expression are red, 548 00:22:24,420 --> 00:22:26,517 and genes ago down in expression are green. 549 00:22:26,517 --> 00:22:29,520 And apologies to anyone who's colorblind. 550 00:22:29,520 --> 00:22:33,560 But that's just what the convention has become. 551 00:22:33,560 --> 00:22:35,280 OK, so then why cluster? 552 00:22:35,280 --> 00:22:38,010 So if we cluster across the rows, then 553 00:22:38,010 --> 00:22:40,397 we'll get sets of genes that potentially behave-- 554 00:22:40,397 --> 00:22:41,980 that hopefully if we do this properly, 555 00:22:41,980 --> 00:22:44,320 behave similarly across different subsets 556 00:22:44,320 --> 00:22:46,630 of the experiments. 557 00:22:46,630 --> 00:22:48,960 And those might represent similar functions. 558 00:22:48,960 --> 00:22:51,040 And if we cluster the columns, then we 559 00:22:51,040 --> 00:22:54,245 get different experiments that show similar responses. 560 00:22:54,245 --> 00:22:56,370 So that might be in this case, different times that 561 00:22:56,370 --> 00:22:56,870 are similar. 562 00:22:56,870 --> 00:22:59,210 Hopefully those are ones that are close to each other. 563 00:22:59,210 --> 00:23:00,760 But if we have lots of different patients, 564 00:23:00,760 --> 00:23:02,260 as we'll see in a second, they might 565 00:23:02,260 --> 00:23:06,670 represent patients who have a similar version of a disease. 566 00:23:06,670 --> 00:23:09,680 And in fact, the clustering of genes does work. 567 00:23:09,680 --> 00:23:11,400 So even in this very early paper, 568 00:23:11,400 --> 00:23:15,345 they were able to identify a bunch of subsets of genes that 569 00:23:15,345 --> 00:23:17,470 showed similar expression at different time points, 570 00:23:17,470 --> 00:23:19,720 and turned out to be enriched in different categories. 571 00:23:19,720 --> 00:23:22,640 These ones were enriched in cholesterol biosynthesis, 572 00:23:22,640 --> 00:23:26,570 whereas these were enriched in wound healing, and so on. 573 00:23:26,570 --> 00:23:28,990 So how do you actually do clustering? 574 00:23:28,990 --> 00:23:31,160 This kind of clustering is called hierarchical. 575 00:23:31,160 --> 00:23:32,410 That's pretty straightforward. 576 00:23:32,410 --> 00:23:35,590 There are two versions of hierarchical clustering. 577 00:23:35,590 --> 00:23:39,744 There's what's called agglomerative and divisive. 578 00:23:39,744 --> 00:23:41,910 In agglomerative, you start off with each data point 579 00:23:41,910 --> 00:23:42,900 in its own cluster. 580 00:23:42,900 --> 00:23:46,610 And then you search for the most similar data point to it, 581 00:23:46,610 --> 00:23:47,890 and you group those together. 582 00:23:47,890 --> 00:23:49,389 And you keep doing that iteratively, 583 00:23:49,389 --> 00:23:53,430 building up larger and larger clusters. 584 00:23:53,430 --> 00:23:57,970 So we've discussed how to compare our individual genes. 585 00:23:57,970 --> 00:23:59,570 But you should be able to, right now, 586 00:23:59,570 --> 00:24:01,860 to find, if I gave you the vector of expression 587 00:24:01,860 --> 00:24:04,310 for a single gene, to find the other genes in the data 588 00:24:04,310 --> 00:24:06,810 set that's most similar, by either say, Euclidean 589 00:24:06,810 --> 00:24:09,650 or Pearson correlation, or what have you. 590 00:24:09,650 --> 00:24:12,440 But once you've grouped two genes together, 591 00:24:12,440 --> 00:24:14,330 how do you decide whether a third gene is 592 00:24:14,330 --> 00:24:16,170 similar to those two? 593 00:24:16,170 --> 00:24:17,670 So now we have to make some choices. 594 00:24:17,670 --> 00:24:19,503 And so there are number of different choices 595 00:24:19,503 --> 00:24:21,480 that are commonly made. 596 00:24:21,480 --> 00:24:23,010 So let's say these are our data. 597 00:24:23,010 --> 00:24:27,600 We've got these two clusters, Y and Z. 598 00:24:27,600 --> 00:24:30,260 And each circle represents a data point in those clusters. 599 00:24:30,260 --> 00:24:31,940 So we've got four genes in each cluster. 600 00:24:31,940 --> 00:24:34,090 Now we want to decide on a distance measure 601 00:24:34,090 --> 00:24:38,509 to compare cluster Y to cluster Z. So what could we do? 602 00:24:38,509 --> 00:24:39,800 So what are some possibilities? 603 00:24:39,800 --> 00:24:40,550 What might you do? 604 00:24:45,370 --> 00:24:47,780 AUDIENCE: We could take the average of all points. 605 00:24:47,780 --> 00:24:50,238 PROFESSOR: You could take the average of all points, right. 606 00:24:50,238 --> 00:24:51,820 What else could you do? 607 00:24:51,820 --> 00:24:54,310 Only a limited number of possibilities. 608 00:24:54,310 --> 00:24:55,265 AUDIENCE: Centroid? 609 00:24:55,265 --> 00:24:56,640 PROFESSOR: Yeah, so centroid, you 610 00:24:56,640 --> 00:24:58,280 could take some sort of average, right. 611 00:24:58,280 --> 00:25:00,070 Any other possibilities? 612 00:25:00,070 --> 00:25:01,695 AUDIENCE: You can pick a representative 613 00:25:01,695 --> 00:25:03,932 from each set [INAUDIBLE]. 614 00:25:03,932 --> 00:25:06,140 PROFESSOR: So you could pick a representative, right? 615 00:25:06,140 --> 00:25:08,580 How would you decide in advance what that would be though? 616 00:25:08,580 --> 00:25:10,875 So maybe you have a way, maybe not. 617 00:25:10,875 --> 00:25:13,991 And what other possibilities are there? 618 00:25:13,991 --> 00:25:14,490 Yeah? 619 00:25:14,490 --> 00:25:16,865 AUDIENCE: Measure all the distances [INAUDIBLE] 620 00:25:16,865 --> 00:25:18,282 to all the nodes in the other. 621 00:25:18,282 --> 00:25:18,990 PROFESSOR: Right. 622 00:25:18,990 --> 00:25:20,250 So you could do all to all. 623 00:25:20,250 --> 00:25:21,208 What else could you do? 624 00:25:23,445 --> 00:25:25,320 You can take the minimum of all those values. 625 00:25:25,320 --> 00:25:26,930 You can take the maximum of all those values. 626 00:25:26,930 --> 00:25:29,221 And we'll see that all those are things that people do. 627 00:25:29,221 --> 00:25:32,910 So this clustering, there are already 628 00:25:32,910 --> 00:25:36,560 rather uninformative terms for some 629 00:25:36,560 --> 00:25:37,760 of these kinds of decisions. 630 00:25:37,760 --> 00:25:39,500 So it's called single linkage, is 631 00:25:39,500 --> 00:25:42,260 you decide that the distance between two clusters 632 00:25:42,260 --> 00:25:45,370 is based on the minimum distance between any member of cluster Y 633 00:25:45,370 --> 00:25:48,950 and any member of cluster Z. 634 00:25:48,950 --> 00:25:54,480 Complete linkage takes the maximum distance. 635 00:25:54,480 --> 00:25:56,780 And then the extremely unfortunately named 636 00:25:56,780 --> 00:26:02,260 Unweighted Pair Group Method using Centroids-- UPGMC, 637 00:26:02,260 --> 00:26:04,040 I won't try to say that very often-- takes 638 00:26:04,040 --> 00:26:09,820 the centroid, which was an early suggestion from the class. 639 00:26:09,820 --> 00:26:14,740 And then the UPGMA, Unweighted Pair Group Method 640 00:26:14,740 --> 00:26:16,869 with Arithmetic Mean, takes the average 641 00:26:16,869 --> 00:26:18,410 of all the distances, all suggestions 642 00:26:18,410 --> 00:26:19,740 that people have made. 643 00:26:19,740 --> 00:26:21,532 So when would you use one versus the other? 644 00:26:21,532 --> 00:26:23,323 Well, a priori, you don't necessarily know. 645 00:26:23,323 --> 00:26:25,210 But it's good to know how they'll behave. 646 00:26:25,210 --> 00:26:26,940 So what do you imagine is going to happen 647 00:26:26,940 --> 00:26:31,250 if you use single linkage, versus complete linkage. 648 00:26:31,250 --> 00:26:33,690 Remember, single linkage is the minimum distance. 649 00:26:33,690 --> 00:26:35,720 And complete linkage is the maximum distance. 650 00:26:35,720 --> 00:26:36,710 So what's going to happen in this case, 651 00:26:36,710 --> 00:26:37,990 if I use the minimum distance. 652 00:26:37,990 --> 00:26:39,323 Which two groups will I combine? 653 00:26:43,269 --> 00:26:44,560 AUDIENCE: The blue and the red. 654 00:26:44,560 --> 00:26:46,185 PROFESSOR: The blue and the red, right? 655 00:26:46,185 --> 00:26:48,420 Whereas if I use the maximum distance, 656 00:26:48,420 --> 00:26:50,260 then I'll combine the green and the red. 657 00:26:50,260 --> 00:26:52,420 So it's important to recognize, then, 658 00:26:52,420 --> 00:26:57,590 that the single linkage has this property of chaining together 659 00:26:57,590 --> 00:27:00,410 clusters, based on points that are near each other. 660 00:27:00,410 --> 00:27:04,880 Whereas the complete linkage is resistant to grouping things 661 00:27:04,880 --> 00:27:06,300 together, if they have outliers. 662 00:27:06,300 --> 00:27:07,550 So they'll behave differently. 663 00:27:07,550 --> 00:27:09,790 Now, if your data are compact, and you really 664 00:27:09,790 --> 00:27:11,810 do have tight clusters, it's not going 665 00:27:11,810 --> 00:27:13,355 to matter too much would you use. 666 00:27:13,355 --> 00:27:15,980 But in most biological settings, we're dealing with much noise, 667 00:27:15,980 --> 00:27:16,721 there's data. 668 00:27:16,721 --> 00:27:18,470 So you actually will get different results 669 00:27:18,470 --> 00:27:19,342 based on this. 670 00:27:19,342 --> 00:27:21,550 And as far as I know, there's no really principal way 671 00:27:21,550 --> 00:27:26,090 to figure out if you have no prior knowledge, which to use. 672 00:27:26,090 --> 00:27:28,570 Now all of these hierarchical clustering 673 00:27:28,570 --> 00:27:30,910 come with what's called a dendogram. 674 00:27:30,910 --> 00:27:34,280 And you'll see these at the top of all the clustering. 675 00:27:34,280 --> 00:27:36,060 And this represents the process by which 676 00:27:36,060 --> 00:27:37,400 the data were clustered. 677 00:27:37,400 --> 00:27:39,606 So the things that are most similar 678 00:27:39,606 --> 00:27:41,480 are most tightly connected in this dendogram. 679 00:27:41,480 --> 00:27:44,210 So these two data points, one and two, you 680 00:27:44,210 --> 00:27:49,620 have to go up very little in the y-axis, to get from one to two. 681 00:27:49,620 --> 00:27:52,180 Whereas if you want to go from one to 16, 682 00:27:52,180 --> 00:27:55,080 you have to traverse the entire dendogram. 683 00:27:55,080 --> 00:27:56,780 So the distance between two samples 684 00:27:56,780 --> 00:28:00,680 is how far vertically you have to go to connect between them. 685 00:28:00,680 --> 00:28:04,370 Now the good things are that the dendogram 686 00:28:04,370 --> 00:28:07,350 is, you can then understand the clustering of the data. 687 00:28:07,350 --> 00:28:09,860 So I can cut this dendogram at any particular distance, 688 00:28:09,860 --> 00:28:13,230 and get clearly divisions among my data sets. 689 00:28:13,230 --> 00:28:15,950 So if I cut here at this distance level, 690 00:28:15,950 --> 00:28:16,920 then I have two groups. 691 00:28:16,920 --> 00:28:18,755 One small, one consisting of these data. 692 00:28:18,755 --> 00:28:20,820 And one large, one consisting of these. 693 00:28:20,820 --> 00:28:23,880 Whereas if I cut down here, I have more groups of my data. 694 00:28:23,880 --> 00:28:26,490 So it doesn't require me in advance to know how many groups 695 00:28:26,490 --> 00:28:26,990 I have. 696 00:28:26,990 --> 00:28:29,940 I can look at the dendogram and infer it. 697 00:28:29,940 --> 00:28:32,730 The one risk is that you always get a dendogram that's 698 00:28:32,730 --> 00:28:35,510 hierarchical, regardless of if the data were hierarchical 699 00:28:35,510 --> 00:28:36,310 or not. 700 00:28:36,310 --> 00:28:38,650 So it's more a reflection of how you did your clustering 701 00:28:38,650 --> 00:28:41,180 than any fundamental structure of the data. 702 00:28:41,180 --> 00:28:45,294 So the fact that you get a hierarchical dendogram 703 00:28:45,294 --> 00:28:46,835 means really nothing about your data. 704 00:28:46,835 --> 00:28:48,293 It's simply a tool that you can use 705 00:28:48,293 --> 00:28:52,450 to try to divide it up into different groups. 706 00:28:52,450 --> 00:28:55,180 Any questions on the hierarchical clustering? 707 00:28:58,717 --> 00:28:59,217 Yes? 708 00:29:01,993 --> 00:29:03,993 AUDIENCE: If each data point is its own cluster, 709 00:29:03,993 --> 00:29:09,482 then won't that be consistent across, like, single linkage, 710 00:29:09,482 --> 00:29:12,490 complete linkage-- like, why would you cluster? 711 00:29:12,490 --> 00:29:13,966 Does that question make sense? 712 00:29:13,966 --> 00:29:20,854 Like if you cut it down below, then haven't you minimized-- 713 00:29:20,854 --> 00:29:24,790 don't you successively minimize the variance, I guess, 714 00:29:24,790 --> 00:29:27,394 up to your clusters, by-- 715 00:29:27,394 --> 00:29:29,310 PROFESSOR: So if I cut it at the lowest level, 716 00:29:29,310 --> 00:29:30,601 everybody is their own cluster. 717 00:29:30,601 --> 00:29:31,720 That's true. 718 00:29:31,720 --> 00:29:32,220 Right. 719 00:29:32,220 --> 00:29:34,011 I'm interested in finding out whether there 720 00:29:34,011 --> 00:29:36,330 are genes that behave similarly across the data sets. 721 00:29:36,330 --> 00:29:36,585 Or-- 722 00:29:36,585 --> 00:29:37,708 AUDIENCE: My question is, how would you 723 00:29:37,708 --> 00:29:39,880 go about determining how many clusters your want? 724 00:29:39,880 --> 00:29:40,160 PROFESSOR: Oh, OK. 725 00:29:40,160 --> 00:29:41,576 So we'll come to that in a second. 726 00:29:41,576 --> 00:29:43,730 So hierarchical clustering, you don't actually 727 00:29:43,730 --> 00:29:45,869 have any objective way of doing that. 728 00:29:45,869 --> 00:29:47,660 But we'll talk about other means right now, 729 00:29:47,660 --> 00:29:49,000 where it's a little bit clearer. 730 00:29:49,000 --> 00:29:50,880 But actually fundamentally, there 731 00:29:50,880 --> 00:29:53,220 aren't a lot of good ways of knowing a priori what 732 00:29:53,220 --> 00:29:54,840 the right number of clusters is. 733 00:29:54,840 --> 00:29:57,090 But we'll look at some measures in a second that help. 734 00:29:59,880 --> 00:30:03,270 So hierarchical clustering, as your question implies, 735 00:30:03,270 --> 00:30:06,520 doesn't really tell you how many clusters there are. 736 00:30:06,520 --> 00:30:08,250 Another approach is to decide in advance 737 00:30:08,250 --> 00:30:10,000 how many clusters you expect. 738 00:30:10,000 --> 00:30:12,250 And then see whether you can get the data of the group 739 00:30:12,250 --> 00:30:13,862 into that number or not. 740 00:30:13,862 --> 00:30:15,320 And an example of that is something 741 00:30:15,320 --> 00:30:16,900 called k-means clustering. 742 00:30:16,900 --> 00:30:18,324 So the nice thing about it, is it 743 00:30:18,324 --> 00:30:19,740 does give you the sharp divisions. 744 00:30:19,740 --> 00:30:21,760 But again if you chose k incorrectly, 745 00:30:21,760 --> 00:30:23,430 we'll see in a second, you will get-- 746 00:30:23,430 --> 00:30:25,710 you'll never less still get K-clusters. 747 00:30:25,710 --> 00:30:28,330 So K refers the number of clusters 748 00:30:28,330 --> 00:30:30,950 that you tell the algorithm you expect to get. 749 00:30:30,950 --> 00:30:33,340 So you specify that in advance. 750 00:30:33,340 --> 00:30:35,500 And then you try to find a set of clusters 751 00:30:35,500 --> 00:30:37,790 that minimizes the distance. 752 00:30:37,790 --> 00:30:40,540 So everybody's assigned to a particular cluster, 753 00:30:40,540 --> 00:30:43,410 and the center of that cluster. 754 00:30:43,410 --> 00:30:44,000 Is that clear? 755 00:30:44,000 --> 00:30:45,959 So that's what these equations represent. 756 00:30:45,959 --> 00:30:47,750 So the center of the cluster, the centroid, 757 00:30:47,750 --> 00:30:50,910 is just the average coordinates, over all the components 758 00:30:50,910 --> 00:30:52,180 of that cluster. 759 00:30:52,180 --> 00:30:55,390 And we're trying to find this set of clusters, C, 760 00:30:55,390 --> 00:31:00,110 that minimizes the sum of the square of the distances 761 00:31:00,110 --> 00:31:04,010 between each member of that cluster and the centroid. 762 00:31:04,010 --> 00:31:07,720 Any questions on how we're doing this? 763 00:31:07,720 --> 00:31:10,160 OK. 764 00:31:10,160 --> 00:31:10,660 All right. 765 00:31:10,660 --> 00:31:11,530 So what's the actual algorithm? 766 00:31:11,530 --> 00:31:12,680 That's remarkably simple. 767 00:31:12,680 --> 00:31:16,470 I'm choosing that initial set of random positions. 768 00:31:16,470 --> 00:31:21,680 And then I have the simple loop, I repeat until convergence. 769 00:31:21,680 --> 00:31:25,850 For every point, I assign it to the nearest centroid. 770 00:31:25,850 --> 00:31:29,967 So if my starting centroids would be circles, 771 00:31:29,967 --> 00:31:31,550 I look at every data point, and I ask, 772 00:31:31,550 --> 00:31:33,049 how close is it to any one of these? 773 00:31:33,049 --> 00:31:35,660 That's what the boundaries are, defined by these lines. 774 00:31:35,660 --> 00:31:39,374 So everything above this line belongs to the centroid. 775 00:31:39,374 --> 00:31:41,290 Everything over here belongs to this centroid. 776 00:31:41,290 --> 00:31:44,850 So I divide the data up by which centroid you are closest to. 777 00:31:44,850 --> 00:31:47,110 And I assign you to that centroid. 778 00:31:47,110 --> 00:31:48,960 That's step one. 779 00:31:48,960 --> 00:31:51,371 And step two, I compute new centroids. 780 00:31:51,371 --> 00:31:53,120 And that's what these triangles represent. 781 00:31:53,120 --> 00:31:54,522 So after I did that partitioning, 782 00:31:54,522 --> 00:31:56,230 it turns out that most of the things that 783 00:31:56,230 --> 00:32:00,540 were assigned to the triangular cluster live over here. 784 00:32:00,540 --> 00:32:03,790 So the centroid moves from being here to here. 785 00:32:03,790 --> 00:32:06,360 And I iterate this process. 786 00:32:06,360 --> 00:32:09,410 That's the entire K-means clustering algorithm. 787 00:32:09,410 --> 00:32:12,670 So here's an example where I generated data 788 00:32:12,670 --> 00:32:15,210 from three [? calcines. ?] I chose initial data 789 00:32:15,210 --> 00:32:18,580 points, which are the circles. 790 00:32:18,580 --> 00:32:19,830 I follow that protocol. 791 00:32:19,830 --> 00:32:21,010 Here's the first step. 792 00:32:21,010 --> 00:32:22,320 It computes new triangles. 793 00:32:22,320 --> 00:32:25,100 Second step, and then it converges. 794 00:32:25,100 --> 00:32:26,700 The distance stops changing. 795 00:32:30,841 --> 00:32:32,340 Now this question's already come up. 796 00:32:32,340 --> 00:32:35,200 So what happens if you choose the wrong K? 797 00:32:35,200 --> 00:32:37,269 So I believe there are three clusters. 798 00:32:37,269 --> 00:32:38,560 And really that's not the case. 799 00:32:38,560 --> 00:32:39,971 So what's going to happen? 800 00:32:39,971 --> 00:32:41,595 So in this data set, there really were. 801 00:32:41,595 --> 00:32:45,710 How many, there really were five clusters. 802 00:32:45,710 --> 00:32:47,360 Here, they're clustered correctly. 803 00:32:47,360 --> 00:32:49,160 What if I told the algorithm to do 804 00:32:49,160 --> 00:32:51,826 K-means clustering with a K of three? 805 00:32:51,826 --> 00:32:54,200 It would still find a way to come up with three clusters. 806 00:32:54,200 --> 00:32:56,677 So now it's grouped these two things, which are clearly 807 00:32:56,677 --> 00:32:59,010 generated from different [? calcines ?] scenes together. 808 00:32:59,010 --> 00:33:00,843 It's grouped these two, which were generated 809 00:33:00,843 --> 00:33:02,970 from different [? calcines ?] together, and so on. 810 00:33:02,970 --> 00:33:03,470 All right. 811 00:33:03,470 --> 00:33:05,270 So K-means clustering will do what 812 00:33:05,270 --> 00:33:07,330 you tell it to do, regardless of whether that's the right answer 813 00:33:07,330 --> 00:33:07,970 or not. 814 00:33:07,970 --> 00:33:10,110 And if you tell it there are more clusters than you expect-- 815 00:33:10,110 --> 00:33:11,550 than really are there, then it'll 816 00:33:11,550 --> 00:33:14,920 start chopping up well-defined clusters into sub-clusters. 817 00:33:14,920 --> 00:33:19,660 So here it split this elongated one into two sub-clusters. 818 00:33:19,660 --> 00:33:21,420 It split this one arbitrarily into two. 819 00:33:21,420 --> 00:33:25,677 Just so it gets the final number that we asked for. 820 00:33:25,677 --> 00:33:27,010 Then how do you know what to do? 821 00:33:27,010 --> 00:33:29,500 Well, as I said, you don't-- there's no guarantee to know. 822 00:33:29,500 --> 00:33:31,670 But one thing you can do is make this kind 823 00:33:31,670 --> 00:33:34,370 of plot, which shows for different values of K 824 00:33:34,370 --> 00:33:40,500 on the x-axis, the sum of the distances within the cluster. 825 00:33:40,500 --> 00:33:42,660 So the distance to the centroid within each cluster 826 00:33:42,660 --> 00:33:44,080 on the y-axis. 827 00:33:44,080 --> 00:33:47,489 And as I increase the number of K's, when I'm correctly 828 00:33:47,489 --> 00:33:49,280 [? purchasing ?] my data, when there really 829 00:33:49,280 --> 00:33:52,370 are more subgroups than I've already defined, 830 00:33:52,370 --> 00:33:53,910 then I'll see big drops. 831 00:33:53,910 --> 00:33:56,530 So I go from saying there are two to three in that case. 832 00:33:56,530 --> 00:34:00,189 I get a big drop in the distance between members of the cluster. 833 00:34:00,189 --> 00:34:02,480 Because I'm no longer including a data point over here. 834 00:34:02,480 --> 00:34:06,010 And in this cluster, with a data point in that cluster. 835 00:34:06,010 --> 00:34:09,489 But once I go beyond the correct number, which was five, 836 00:34:09,489 --> 00:34:11,920 you see that the benefits really start to trail off. 837 00:34:11,920 --> 00:34:13,820 So there's an inflection point here. 838 00:34:13,820 --> 00:34:17,199 There's an elbow-- sometimes it's called an elbow plot. 839 00:34:17,199 --> 00:34:19,270 After I go past the right number, 840 00:34:19,270 --> 00:34:22,940 I get less and less benefit from each additional clustering. 841 00:34:22,940 --> 00:34:25,760 So this gives us an empirical way 842 00:34:25,760 --> 00:34:31,739 of choosing approximately a correct value for K. 843 00:34:31,739 --> 00:34:33,840 Any questions on K-means? 844 00:34:33,840 --> 00:34:34,820 Yes? 845 00:34:34,820 --> 00:34:36,820 AUDIENCE: Does K-means recapitulate the clusters 846 00:34:36,820 --> 00:34:38,195 that you would get if you cut off 847 00:34:38,195 --> 00:34:41,710 your dendogram from hierarchical clustering at a certain level? 848 00:34:41,710 --> 00:34:43,143 PROFESSOR: Not necessarily. 849 00:34:43,143 --> 00:34:44,499 AUDIENCE: OK. 850 00:34:44,499 --> 00:34:45,314 But maybe. 851 00:34:45,314 --> 00:34:45,855 I don't know. 852 00:34:45,855 --> 00:34:48,590 It sort of seems to me as if you picked a level 853 00:34:48,590 --> 00:34:50,960 where you have a certain number of clusters, 854 00:34:50,960 --> 00:34:53,585 that that's similar, at least by centroid, by using the center? 855 00:34:53,585 --> 00:34:56,043 PROFESSOR: Yeah, I think because of the way that you do it, 856 00:34:56,043 --> 00:34:58,120 you're not even guaranteed to have a level, where 857 00:34:58,120 --> 00:35:01,725 you have exactly the right-- other questions? 858 00:35:05,050 --> 00:35:06,052 Yes? 859 00:35:06,052 --> 00:35:07,593 AUDIENCE: Could you just very quickly 860 00:35:07,593 --> 00:35:08,968 go over how you initialized where 861 00:35:08,968 --> 00:35:11,050 the starting points are, and the break ups? 862 00:35:11,050 --> 00:35:11,590 PROFESSOR: All right, so the question 863 00:35:11,590 --> 00:35:13,470 is how do you initialize the starting points? 864 00:35:13,470 --> 00:35:16,152 In fact, you have to make some arbitrary decisions about how 865 00:35:16,152 --> 00:35:17,610 the initialize the starting points. 866 00:35:17,610 --> 00:35:18,900 So they're usually chose in a random. 867 00:35:18,900 --> 00:35:20,358 And you will get different results, 868 00:35:20,358 --> 00:35:22,190 depending on how you do that. 869 00:35:22,190 --> 00:35:23,920 So that's another-- so when you do it, 870 00:35:23,920 --> 00:35:26,330 it's non-deterministic in that sense. 871 00:35:26,330 --> 00:35:28,605 And you often want to initialize multiple times. 872 00:35:28,605 --> 00:35:30,390 And make sure you get similar results. 873 00:35:30,390 --> 00:35:32,350 Very good question. 874 00:35:32,350 --> 00:35:34,610 And in fact, that was not a set up. 875 00:35:34,610 --> 00:35:37,690 But what happens if you choose pathologically bad 876 00:35:37,690 --> 00:35:38,960 initial conditions? 877 00:35:38,960 --> 00:35:42,860 So you have the potential to converge to the right answer. 878 00:35:42,860 --> 00:35:46,620 But you're not guaranteed to converge to the right answer. 879 00:35:46,620 --> 00:35:50,510 So here's an example where I had-- I guess there really 880 00:35:50,510 --> 00:35:52,277 are three clusters in the data. 881 00:35:52,277 --> 00:35:53,860 I chose [INAUDIBLE] three, but I stuck 882 00:35:53,860 --> 00:35:57,270 all my initial coordinates down in the lower right-hand corner. 883 00:35:57,270 --> 00:36:03,312 And then when I do the clustering, if things go well, 884 00:36:03,312 --> 00:36:04,270 I get the right answer. 885 00:36:04,270 --> 00:36:05,311 But we're not guaranteed. 886 00:36:11,070 --> 00:36:14,610 But one thing we are guaranteed, is we always get convergence. 887 00:36:14,610 --> 00:36:16,210 So the algorithm will converge. 888 00:36:16,210 --> 00:36:17,640 Because at each step, it's either 889 00:36:17,640 --> 00:36:21,100 reducing the objective function, or it's leaving it the same. 890 00:36:21,100 --> 00:36:22,460 So we're guaranteed convergence. 891 00:36:22,460 --> 00:36:24,960 But it may be as we've seen previously in other settings, 892 00:36:24,960 --> 00:36:26,835 we may end up with local minimum, rather than 893 00:36:26,835 --> 00:36:28,224 the global optimum. 894 00:36:28,224 --> 00:36:29,640 And the way to fix that then would 895 00:36:29,640 --> 00:36:33,235 be to initialize again, with new starting positions. 896 00:36:35,999 --> 00:36:36,665 Other questions? 897 00:36:44,229 --> 00:36:45,520 What about a setting like this? 898 00:36:45,520 --> 00:36:48,080 Where we've got two well-defined clusters, 899 00:36:48,080 --> 00:36:50,260 and somebody who lives straight in the middle. 900 00:36:50,260 --> 00:36:53,120 So what's the algorithm going to do? 901 00:36:53,120 --> 00:36:55,640 Well, sometimes it'll put it in one side, of one cluster. 902 00:36:55,640 --> 00:36:58,130 And sometimes it'll end up in the other side. 903 00:36:58,130 --> 00:37:00,410 So an alternative to K-means clustering, 904 00:37:00,410 --> 00:37:03,295 which has to make one or the other arbitrary decision, 905 00:37:03,295 --> 00:37:06,230 is something that's called fuzzy K-means, which 906 00:37:06,230 --> 00:37:07,950 can put something actually literally, 907 00:37:07,950 --> 00:37:10,866 membership into both clusters. 908 00:37:10,866 --> 00:37:13,121 And it's very similar in structure to the K-means, 909 00:37:13,121 --> 00:37:14,620 with one important difference, which 910 00:37:14,620 --> 00:37:17,760 is a membership variable, that tells you for every data 911 00:37:17,760 --> 00:37:20,450 point, how much it belongs to the cluster one, cluster 912 00:37:20,450 --> 00:37:22,060 two, cluster three, and so on. 913 00:37:25,132 --> 00:37:26,590 So in both algorithms, we start off 914 00:37:26,590 --> 00:37:29,180 by choosing initial points as a cluster means, 915 00:37:29,180 --> 00:37:31,230 and looping through each of them. 916 00:37:31,230 --> 00:37:34,670 Now previously, we would make a hard assignment of each data 917 00:37:34,670 --> 00:37:38,650 point x sub i to a single cluster. 918 00:37:38,650 --> 00:37:41,220 And here we're going to calculate the probability 919 00:37:41,220 --> 00:37:42,995 that each data point belongs to a cluster. 920 00:37:42,995 --> 00:37:44,620 And that's where you get the fuzziness, 921 00:37:44,620 --> 00:37:48,930 because you could have a non unit, or a nonzero probability, 922 00:37:48,930 --> 00:37:51,150 belonging to any of the clusters. 923 00:37:51,150 --> 00:37:55,700 And now we're going, K-means, we recalculated the mean value, 924 00:37:55,700 --> 00:37:59,550 by just looking at the average of everybody in that cluster. 925 00:37:59,550 --> 00:38:02,295 Now in fuzzy K-means, we don't have everybody in the cluster. 926 00:38:02,295 --> 00:38:04,420 Because everybody belongs partially to the cluster. 927 00:38:04,420 --> 00:38:07,170 So we're going to take a weighted average. 928 00:38:07,170 --> 00:38:10,010 So here are the details of how you do that. 929 00:38:10,010 --> 00:38:12,160 In K-means, we are minimizing this function. 930 00:38:12,160 --> 00:38:14,820 We were trying to decide the class structure, the class 931 00:38:14,820 --> 00:38:18,390 memberships, that would minimize the distance of every member 932 00:38:18,390 --> 00:38:22,770 of that cluster, to the defined centroid of that cluster. 933 00:38:22,770 --> 00:38:24,030 Here it looks almost the same. 934 00:38:24,030 --> 00:38:26,560 Except we now have this new variable, mu, 935 00:38:26,560 --> 00:38:27,890 which is the membership. 936 00:38:27,890 --> 00:38:32,147 It's the membership of point j, in cluster i. 937 00:38:34,949 --> 00:38:37,380 So I'm trying to minimize a very similar function. 938 00:38:37,380 --> 00:38:41,020 But now if mu is one-- if all my mus are one, 939 00:38:41,020 --> 00:38:42,270 then what do I get? 940 00:38:42,270 --> 00:38:43,400 K-means, right? 941 00:38:43,400 --> 00:38:45,710 But as soon as the mus are allowed to vary from one, 942 00:38:45,710 --> 00:38:47,280 they can be between zero and one, 943 00:38:47,280 --> 00:38:49,040 then points can contribute more or less. 944 00:38:49,040 --> 00:38:52,160 So that point there was stuck in the middle of the two clusters, 945 00:38:52,160 --> 00:38:54,370 if it had a mu of 0.5 for each, it 946 00:38:54,370 --> 00:38:55,620 would contribute half to each. 947 00:38:55,620 --> 00:38:57,090 And then both the centroids would 948 00:38:57,090 --> 00:38:58,830 move a little bit towards the middle. 949 00:39:01,920 --> 00:39:03,540 So what's the result of K-means-- 950 00:39:03,540 --> 00:39:05,530 I'm sorry, fuzzy K-means clustering? 951 00:39:05,530 --> 00:39:06,840 We still get K clusters. 952 00:39:06,840 --> 00:39:10,730 But now every gene or every object that we're clustering 953 00:39:10,730 --> 00:39:12,510 has a partial membership. 954 00:39:12,510 --> 00:39:15,050 So here's an example of that, where 955 00:39:15,050 --> 00:39:17,339 they did K-means clustering, with these six 956 00:39:17,339 --> 00:39:18,130 different clusters. 957 00:39:18,130 --> 00:39:22,190 But now every profile, every gene, 958 00:39:22,190 --> 00:39:26,050 has a color associated with it, that represents this mu value. 959 00:39:26,050 --> 00:39:29,210 Whether it goes from zero to one, with these rainbow colors, 960 00:39:29,210 --> 00:39:32,570 to the things that are reddish, or pink-- 961 00:39:32,570 --> 00:39:34,155 those are the high confidence things 962 00:39:34,155 --> 00:39:36,430 that are very strongly, only in that cluster. 963 00:39:36,430 --> 00:39:37,895 Whereas the things that are more towards the yellow end 964 00:39:37,895 --> 00:39:39,770 of the spectrum are partially in this cluster 965 00:39:39,770 --> 00:39:43,450 and partially in other clusters. 966 00:39:43,450 --> 00:39:43,950 Questions? 967 00:39:43,950 --> 00:39:44,740 Any questions? 968 00:39:54,250 --> 00:39:58,790 So K-means, we've defined in terms of Euclidean distance. 969 00:39:58,790 --> 00:40:02,530 And that has clear advantages, in terms of computing things 970 00:40:02,530 --> 00:40:03,040 very easily. 971 00:40:03,040 --> 00:40:05,240 But it has some disadvantages as well. 972 00:40:05,240 --> 00:40:08,010 So one of the disadvantages is because we're 973 00:40:08,010 --> 00:40:10,650 using the squared distance, then outliers 974 00:40:10,650 --> 00:40:12,570 have a very big effect. 975 00:40:12,570 --> 00:40:17,480 Because I'm squaring the difference between vectors. 976 00:40:17,480 --> 00:40:18,910 That may not be the worst thing. 977 00:40:18,910 --> 00:40:20,700 But they also restrict us to things 978 00:40:20,700 --> 00:40:22,700 for which we can compute a centroid. 979 00:40:22,700 --> 00:40:25,040 We have to have data that are-- four or more, 980 00:40:25,040 --> 00:40:27,530 you can actually compute the mean value 981 00:40:27,530 --> 00:40:28,820 of all members of the cluster. 982 00:40:28,820 --> 00:40:30,050 Sometimes you want to cluster things 983 00:40:30,050 --> 00:40:31,410 that we only have qualitative data. 984 00:40:31,410 --> 00:40:33,201 Where instead of having a distance measure, 985 00:40:33,201 --> 00:40:34,560 we have similarity. 986 00:40:34,560 --> 00:40:37,249 This doesn't come up quite as often in-- well, 987 00:40:37,249 --> 00:40:39,415 it certainly doesn't come up in gene expression data 988 00:40:39,415 --> 00:40:42,030 or [? RNAC. ?] But you can imagine more qualitative data, 989 00:40:42,030 --> 00:40:44,360 where you ask people about similarity 990 00:40:44,360 --> 00:40:48,150 between different things or behavioral features, 991 00:40:48,150 --> 00:40:50,800 where you know the similarity between two objects. 992 00:40:50,800 --> 00:40:53,600 But you have no way of calculating the average object. 993 00:40:53,600 --> 00:40:56,270 One setting that you might [INAUDIBLE] have looked 994 00:40:56,270 --> 00:40:59,030 at-- if you're trying to cluster say, sequence motifs 995 00:40:59,030 --> 00:41:01,870 that you've computed with the EM algorithm. 996 00:41:01,870 --> 00:41:03,746 So what's the average sequence motif? 997 00:41:03,746 --> 00:41:05,870 That doesn't necessarily represent any true object, 998 00:41:05,870 --> 00:41:06,380 right? 999 00:41:06,380 --> 00:41:08,430 You might be better off-- you can calculate it. 1000 00:41:08,430 --> 00:41:09,650 But it doesn't mean anything. 1001 00:41:09,650 --> 00:41:11,140 You might be better off calculating 1002 00:41:11,140 --> 00:41:15,550 using rather than the average motif, the most 1003 00:41:15,550 --> 00:41:17,870 central of the motifs that you actually observed. 1004 00:41:17,870 --> 00:41:20,362 So that would be called a medoid, or an exemplar. 1005 00:41:20,362 --> 00:41:22,820 It's a member of your cluster that's closest to the middle, 1006 00:41:22,820 --> 00:41:28,130 even if it it's not smack dab in the middle. 1007 00:41:28,130 --> 00:41:31,690 So instead of K-means, we can just think, well, K-medoids. 1008 00:41:31,690 --> 00:41:35,090 So in K-means, we actually computed a centroid. 1009 00:41:35,090 --> 00:41:37,910 And in medoids, we'll choose the existing data 1010 00:41:37,910 --> 00:41:41,730 point that's most central. 1011 00:41:41,730 --> 00:41:42,770 So what does that mean? 1012 00:41:53,200 --> 00:41:57,715 If these are my data, the true mean is somewhere over here. 1013 00:42:00,335 --> 00:42:01,460 But this one is the medoid. 1014 00:42:05,600 --> 00:42:08,410 It's an exemplar that's close to the central point. 1015 00:42:08,410 --> 00:42:11,952 But if there actually isn't anything here, 1016 00:42:11,952 --> 00:42:12,660 then there isn't. 1017 00:42:12,660 --> 00:42:14,618 So we're going to use the thing that's closest. 1018 00:42:14,618 --> 00:42:16,680 So if these were all sequence motifs, 1019 00:42:16,680 --> 00:42:18,550 rather than using some sequence motif that 1020 00:42:18,550 --> 00:42:20,600 doesn't exist as the center of your cluster, 1021 00:42:20,600 --> 00:42:22,960 you would use a sequence motif that actually does exist, 1022 00:42:22,960 --> 00:42:24,222 and it's close to the center. 1023 00:42:29,870 --> 00:42:32,020 So it's a simple variation on the K-means. 1024 00:42:35,180 --> 00:42:39,540 Instead choosing K points in arbitrary space 1025 00:42:39,540 --> 00:42:41,600 as our starting positions, we're going 1026 00:42:41,600 --> 00:42:46,510 to choose K examples from the data as our starting medoids. 1027 00:42:46,510 --> 00:42:49,180 And then we're going to place each point in the cluster that 1028 00:42:49,180 --> 00:42:52,310 has the closest medoid, rather than median. 1029 00:42:52,310 --> 00:42:54,130 And then when we do the update step, 1030 00:42:54,130 --> 00:42:56,840 instead of choosing the average position to represent 1031 00:42:56,840 --> 00:42:59,592 the cluster, we'll choose the medoid. 1032 00:42:59,592 --> 00:43:03,860 The exemplar that's closest to the middle. 1033 00:43:03,860 --> 00:43:06,495 Any questions on this? 1034 00:43:06,495 --> 00:43:06,995 Yes? 1035 00:43:06,995 --> 00:43:08,453 AUDIENCE: So if you use the medoid, 1036 00:43:08,453 --> 00:43:10,306 do you lose the guaranteed convergence? 1037 00:43:10,306 --> 00:43:11,681 Because I can picture a situation 1038 00:43:11,681 --> 00:43:13,835 where you're sort of oscillating because now 1039 00:43:13,835 --> 00:43:15,034 you have a discrete stack. 1040 00:43:15,034 --> 00:43:16,794 PROFESSOR: That's a good question. 1041 00:43:16,794 --> 00:43:17,710 That's probably right. 1042 00:43:17,710 --> 00:43:18,835 Actually, I should think about that. 1043 00:43:18,835 --> 00:43:19,510 I"m not sure. 1044 00:43:22,570 --> 00:43:25,034 Yeah, that's probably right. 1045 00:43:25,034 --> 00:43:25,700 Other questions? 1046 00:43:30,701 --> 00:43:31,200 OK. 1047 00:43:31,200 --> 00:43:33,325 There are a lot of other techniques for clustering. 1048 00:43:33,325 --> 00:43:35,840 Your textbook talks about self organizing maps, 1049 00:43:35,840 --> 00:43:38,366 which were popular at one point quite a lot. 1050 00:43:38,366 --> 00:43:39,740 And there's also a nice technique 1051 00:43:39,740 --> 00:43:41,156 called affinity propagation, which 1052 00:43:41,156 --> 00:43:44,150 is a little bit outside the scope of this course, 1053 00:43:44,150 --> 00:43:48,880 but has proved quite useful for clustering. 1054 00:43:48,880 --> 00:43:49,380 OK. 1055 00:43:49,380 --> 00:43:51,530 So why bother to do all this clustering? 1056 00:43:51,530 --> 00:43:54,960 Our goal is to try to find some biological information, 1057 00:43:54,960 --> 00:43:56,679 not just to find groups of genes. 1058 00:43:56,679 --> 00:43:58,220 So what can you do with these things? 1059 00:43:58,220 --> 00:44:01,120 Well, one thing that was identified early on, 1060 00:44:01,120 --> 00:44:04,270 is if I could find sets of genes that behave similarly, maybe 1061 00:44:04,270 --> 00:44:06,300 those could be used in a predictive way, 1062 00:44:06,300 --> 00:44:08,340 to predict outcomes for patients, 1063 00:44:08,340 --> 00:44:11,130 or some biological function. 1064 00:44:11,130 --> 00:44:14,320 So we're going to look at that first. 1065 00:44:14,320 --> 00:44:17,500 So one of the early papers in this field 1066 00:44:17,500 --> 00:44:20,220 did clustering of microarrays for patients 1067 00:44:20,220 --> 00:44:22,720 who had B-cell lymphoma. 1068 00:44:22,720 --> 00:44:25,835 The patients had different kinds of B-cell lymphomas. 1069 00:44:25,835 --> 00:44:28,910 And so they took their data, they clustered it. 1070 00:44:28,910 --> 00:44:31,610 Again, each row represents a gene. 1071 00:44:31,610 --> 00:44:33,819 And each column represents a patient here. 1072 00:44:33,819 --> 00:44:36,110 And with this projector, it's a little bit hard to see. 1073 00:44:36,110 --> 00:44:38,060 But when you look at the notes separately, 1074 00:44:38,060 --> 00:44:40,990 you'll be able see that in the dendogram, 1075 00:44:40,990 --> 00:44:43,420 there's a nice, sharp division between two 1076 00:44:43,420 --> 00:44:45,270 large groups of patients. 1077 00:44:45,270 --> 00:44:48,824 And it turns out that when you look at the pathologist's 1078 00:44:48,824 --> 00:44:50,990 annotations for these patients, which was completely 1079 00:44:50,990 --> 00:44:52,760 independent of the gene expression data, 1080 00:44:52,760 --> 00:44:54,942 all of patients in the left hand group-- 1081 00:44:54,942 --> 00:44:56,900 almost all the patients in the left hand group, 1082 00:44:56,900 --> 00:44:59,627 had one kind of lymphoma. 1083 00:44:59,627 --> 00:45:01,460 And all the patients in the right hand group 1084 00:45:01,460 --> 00:45:02,925 had a different kind of lymphoma. 1085 00:45:02,925 --> 00:45:04,300 And this got people very excited. 1086 00:45:04,300 --> 00:45:07,320 Because it suggested that the pure molecular features 1087 00:45:07,320 --> 00:45:10,815 might be at least as good as pathological studies. 1088 00:45:10,815 --> 00:45:13,530 So maybe you could completely automate the identification 1089 00:45:13,530 --> 00:45:15,390 of different tumor types. 1090 00:45:15,390 --> 00:45:17,760 Now the next thing that got people even more excited, 1091 00:45:17,760 --> 00:45:19,660 was the idea that maybe you could actually 1092 00:45:19,660 --> 00:45:23,330 use these patterns not just to recapitulate 1093 00:45:23,330 --> 00:45:25,590 what a pathologist would find, but go beyond it, 1094 00:45:25,590 --> 00:45:28,102 and actually make predictions from the patients. 1095 00:45:28,102 --> 00:45:29,810 So in these plots-- I don't know if we've 1096 00:45:29,810 --> 00:45:31,268 seen these before yet in the class. 1097 00:45:31,268 --> 00:45:34,360 But on the x-axis is survival. 1098 00:45:34,360 --> 00:45:36,240 In the y-axis are the fraction of patients 1099 00:45:36,240 --> 00:45:39,360 in a particular group, who survived that long. 1100 00:45:39,360 --> 00:45:40,870 So as the patient's die, obviously 1101 00:45:40,870 --> 00:45:42,520 the curve is dropping down. 1102 00:45:42,520 --> 00:45:45,110 Each one of these drops represents 1103 00:45:45,110 --> 00:45:47,280 the death of a patient, or the loss of the patient 1104 00:45:47,280 --> 00:45:51,110 to the study for other reasons. 1105 00:45:51,110 --> 00:45:53,690 And so in the middle, let's start with this one. 1106 00:45:53,690 --> 00:45:55,730 This is what the clinicians would have decided. 1107 00:45:55,730 --> 00:45:57,610 There are here, patients that they defined 1108 00:45:57,610 --> 00:46:00,247 by clinical standards as being likely to do well, 1109 00:46:00,247 --> 00:46:02,580 versus patients whom they defined by clinical standards, 1110 00:46:02,580 --> 00:46:03,619 as likely to do poorly. 1111 00:46:03,619 --> 00:46:05,410 And you could see there is a big difference 1112 00:46:05,410 --> 00:46:07,420 in the plots for the low clinical risk 1113 00:46:07,420 --> 00:46:10,120 patients at the top, and the high clinical risk 1114 00:46:10,120 --> 00:46:11,750 patients at the bottom. 1115 00:46:11,750 --> 00:46:13,600 On the left hand side, or what you 1116 00:46:13,600 --> 00:46:15,750 get when you use purely gene expression data 1117 00:46:15,750 --> 00:46:18,630 to cluster the patients into groups that you turn out 1118 00:46:18,630 --> 00:46:20,800 to be high risk or low risk. 1119 00:46:20,800 --> 00:46:23,370 And you can see that it's a little bit more 1120 00:46:23,370 --> 00:46:25,610 statistically significant for the clinical risk. 1121 00:46:25,610 --> 00:46:27,870 But it's pretty good over here, too. 1122 00:46:27,870 --> 00:46:30,350 Now the really impressive thing is, 1123 00:46:30,350 --> 00:46:32,600 what if you take the patients that the clinicians 1124 00:46:32,600 --> 00:46:35,750 define as low clinical risk? 1125 00:46:35,750 --> 00:46:38,290 And then you look at their gene expression data. 1126 00:46:38,290 --> 00:46:39,900 Could you separate out the patients 1127 00:46:39,900 --> 00:46:41,571 in that allegedly low clinical risk 1128 00:46:41,571 --> 00:46:42,820 who are actually at high risk? 1129 00:46:42,820 --> 00:46:44,760 And maybe then they would be diverted 1130 00:46:44,760 --> 00:46:46,840 to have more aggressive therapy than patients 1131 00:46:46,840 --> 00:46:49,430 who really and truly are low risk patients. 1132 00:46:49,430 --> 00:46:51,500 And what they will show with just barely 1133 00:46:51,500 --> 00:46:53,340 statistical significance, is that even 1134 00:46:53,340 --> 00:46:56,280 among the clinically defined low risk patients, 1135 00:46:56,280 --> 00:46:58,792 there is-- based on these gene signatures-- 1136 00:46:58,792 --> 00:47:00,250 the ability to distinguish patients 1137 00:47:00,250 --> 00:47:02,600 who are going to do better, and patients 1138 00:47:02,600 --> 00:47:04,990 who are going to do worse. 1139 00:47:04,990 --> 00:47:06,391 So this was over a decade ago. 1140 00:47:06,391 --> 00:47:08,390 And it really set off a frenzy of people looking 1141 00:47:08,390 --> 00:47:10,720 for gene signatures for all sorts of things, 1142 00:47:10,720 --> 00:47:12,740 that might be highly predictive. 1143 00:47:12,740 --> 00:47:17,090 Now the fact that something is correlated, 1144 00:47:17,090 --> 00:47:18,990 doesn't of course prove any causality. 1145 00:47:18,990 --> 00:47:22,660 So one of the questions is, if I find a gene signature that 1146 00:47:22,660 --> 00:47:25,272 is predictive of an outcome in one of the studies, 1147 00:47:25,272 --> 00:47:27,230 can I use it then to go backwards, and actually 1148 00:47:27,230 --> 00:47:28,810 define a therapy? 1149 00:47:28,810 --> 00:47:32,560 In the ideal setting, I would have these gene signatures. 1150 00:47:32,560 --> 00:47:34,290 I'd discover that they are clinically 1151 00:47:34,290 --> 00:47:36,080 associated with outcome. 1152 00:47:36,080 --> 00:47:38,180 I could dig in and discover what makes 1153 00:47:38,180 --> 00:47:40,720 the patients to do worse, worse. 1154 00:47:40,720 --> 00:47:41,650 And go and treat that. 1155 00:47:41,650 --> 00:47:43,460 So is that the case or not? 1156 00:47:43,460 --> 00:47:47,397 So let me show you some data from a breast cancer data set. 1157 00:47:47,397 --> 00:47:48,730 Here's a breast cancer data set. 1158 00:47:48,730 --> 00:47:52,350 Again the same kind of plot, where we've got the survival 1159 00:47:52,350 --> 00:47:56,780 statistic on the y-axis, the number of years on the x-axis. 1160 00:47:56,780 --> 00:47:59,160 And based on a gene signature, this group 1161 00:47:59,160 --> 00:48:01,350 has defined a group that does better, 1162 00:48:01,350 --> 00:48:04,990 and a group that does worse, the p value is significant. 1163 00:48:04,990 --> 00:48:11,250 And it has a ratio, the death rate versus control 1164 00:48:11,250 --> 00:48:12,910 is approximately two. 1165 00:48:12,910 --> 00:48:13,620 OK. 1166 00:48:13,620 --> 00:48:16,709 So does this lead us to any mechanistic insight 1167 00:48:16,709 --> 00:48:17,500 into breast cancer. 1168 00:48:17,500 --> 00:48:20,340 Well, it turns out in this case, the gene signature 1169 00:48:20,340 --> 00:48:24,390 was defined based on postprandial laughter. 1170 00:48:24,390 --> 00:48:26,984 So after dinner humor. 1171 00:48:26,984 --> 00:48:28,650 Here's a gene set that defined something 1172 00:48:28,650 --> 00:48:31,090 that has absolutely nothing to do with breast cancer, 1173 00:48:31,090 --> 00:48:35,137 and it's predicting the outcome of breast cancer patients. 1174 00:48:35,137 --> 00:48:36,720 Which leads to somewhat more of a joke 1175 00:48:36,720 --> 00:48:38,200 that the testing whether laughter 1176 00:48:38,200 --> 00:48:40,520 really is the best medicine. 1177 00:48:40,520 --> 00:48:41,050 OK. 1178 00:48:41,050 --> 00:48:44,350 So they went on-- they tried other genes sets. 1179 00:48:44,350 --> 00:48:46,490 Here's the data set-- gene set that's 1180 00:48:46,490 --> 00:48:47,900 not even defined in humans. 1181 00:48:47,900 --> 00:48:52,240 It's the homologs of genes that are associated 1182 00:48:52,240 --> 00:48:53,930 with social defeat in mice. 1183 00:48:53,930 --> 00:48:58,670 And once again, you get a statistically significant 1184 00:48:58,670 --> 00:49:00,929 p-value, and good hazard ratios. 1185 00:49:00,929 --> 00:49:01,720 So what's going on? 1186 00:49:01,720 --> 00:49:03,930 Well, these are not from a study that's actually 1187 00:49:03,930 --> 00:49:06,130 trying to predict an outcome in breast cancer. 1188 00:49:06,130 --> 00:49:09,290 It's a study that shows that most gene expression-- 1189 00:49:09,290 --> 00:49:12,360 most randomly selected sets of genes in the genome 1190 00:49:12,360 --> 00:49:15,810 will give an outcome that's correlated-- a result that's 1191 00:49:15,810 --> 00:49:18,130 correlated with a patient outcome in breast cancer. 1192 00:49:21,040 --> 00:49:21,721 Yes? 1193 00:49:21,721 --> 00:49:23,054 AUDIENCE: I'm a little confused. 1194 00:49:23,054 --> 00:49:25,679 In the previous graph, could you just explain what is the black 1195 00:49:25,679 --> 00:49:26,958 and what is the red? 1196 00:49:26,958 --> 00:49:29,400 Is that individuals or groups? 1197 00:49:29,400 --> 00:49:32,080 PROFESSOR: So the black are people 1198 00:49:32,080 --> 00:49:34,350 that have the genes set signature, who 1199 00:49:34,350 --> 00:49:36,265 have high levels of the genes that 1200 00:49:36,265 --> 00:49:37,969 are defined in this gene set. 1201 00:49:37,969 --> 00:49:40,260 And the red are ones have low, or the other way around. 1202 00:49:40,260 --> 00:49:43,360 But it's defining all patients into two groups, based 1203 00:49:43,360 --> 00:49:46,890 on whether they have a particular level of expression 1204 00:49:46,890 --> 00:49:49,329 in this gene set, and then following 1205 00:49:49,329 --> 00:49:50,370 those patients over time. 1206 00:49:50,370 --> 00:49:51,849 Do they do better or worse? 1207 00:49:51,849 --> 00:49:53,265 And similarly for all these plots. 1208 00:49:57,950 --> 00:50:00,775 And he had another one which is a little less amusing, location 1209 00:50:00,775 --> 00:50:02,710 of skin fibroblasts. 1210 00:50:02,710 --> 00:50:05,820 The real critical point is this. 1211 00:50:05,820 --> 00:50:10,270 Here, they compared the probability 1212 00:50:10,270 --> 00:50:14,630 based on expectation an that all genes are 1213 00:50:14,630 --> 00:50:16,830 independent of each other, the probability 1214 00:50:16,830 --> 00:50:20,510 that that gene signatures correlated with outcome, 1215 00:50:20,510 --> 00:50:22,700 for genes there were chosen at random or genes that 1216 00:50:22,700 --> 00:50:25,960 were chosen from a database of gene signatures, 1217 00:50:25,960 --> 00:50:28,910 that people have identified as being associated with pathways. 1218 00:50:28,910 --> 00:50:30,850 And you get a very, very large fraction. 1219 00:50:30,850 --> 00:50:32,380 So this is the p-value. 1220 00:50:32,380 --> 00:50:35,450 So negative log of p-value, so negative values 1221 00:50:35,450 --> 00:50:36,570 are more significant. 1222 00:50:36,570 --> 00:50:38,910 A huge fraction of all genes sets 1223 00:50:38,910 --> 00:50:41,130 that you pull at random from the genome, 1224 00:50:41,130 --> 00:50:43,670 or that you pull from a compendium of known pathways, 1225 00:50:43,670 --> 00:50:46,010 are going to be associated with outcome, 1226 00:50:46,010 --> 00:50:48,090 in this breast cancer data set. 1227 00:50:48,090 --> 00:50:51,120 So it's not just well annotated cancer pathways, 1228 00:50:51,120 --> 00:50:52,180 that are associated. 1229 00:50:52,180 --> 00:50:54,400 Its gene sets associated as we've 1230 00:50:54,400 --> 00:50:57,200 seen, with laughter or social defeat in mice, 1231 00:50:57,200 --> 00:50:59,260 and so on-- all sorts of crazy things, 1232 00:50:59,260 --> 00:51:05,100 that have no mechanistic link to breast cancer. 1233 00:51:05,100 --> 00:51:08,510 Let's take a second for that to sink in. 1234 00:51:08,510 --> 00:51:11,350 I pull genes at random from the genome. 1235 00:51:11,350 --> 00:51:13,520 I define patients based on whether they 1236 00:51:13,520 --> 00:51:16,260 have high levels of expression of a random set of genes, 1237 00:51:16,260 --> 00:51:19,100 or low levels of expression of that random set of genes. 1238 00:51:19,100 --> 00:51:21,970 And I'm extremely likely to be able to predict 1239 00:51:21,970 --> 00:51:23,210 the outcome in breast cancer. 1240 00:51:25,919 --> 00:51:27,710 So that should be rather disturbing, right? 1241 00:51:27,710 --> 00:51:30,230 So it turns out-- before we get to the answer 1242 00:51:30,230 --> 00:51:33,005 then-- so this is not unique to breast cancer. 1243 00:51:33,005 --> 00:51:35,630 They went through a whole bunch of data sets in the literature. 1244 00:51:35,630 --> 00:51:38,180 Each row is a different previously published study, 1245 00:51:38,180 --> 00:51:41,170 where someone had claimed to identify a signature 1246 00:51:41,170 --> 00:51:45,510 for a particular kind of disease or outcome. 1247 00:51:45,510 --> 00:51:48,460 And they took their random gene sets and asked 1248 00:51:48,460 --> 00:51:50,060 how well the random genes sets did 1249 00:51:50,060 --> 00:51:53,420 in predicting the outcome in these patients? 1250 00:51:53,420 --> 00:51:56,785 And so these yellow plots represent 1251 00:51:56,785 --> 00:51:59,090 the probability distribution for the random gene 1252 00:51:59,090 --> 00:52:00,971 sets-- again on this projector, it's 1253 00:52:00,971 --> 00:52:03,220 hard to see-- but there's a highlight in the left hand 1254 00:52:03,220 --> 00:52:07,694 side at where the 5%, the best 5% of the random gene sets are. 1255 00:52:07,694 --> 00:52:09,110 This blue line is the near measure 1256 00:52:09,110 --> 00:52:10,280 of statistical significance. 1257 00:52:10,280 --> 00:52:12,200 It turns out that a few of these studies 1258 00:52:12,200 --> 00:52:15,270 didn't even reach a normal level of statistical significance, 1259 00:52:15,270 --> 00:52:18,160 let alone comparing to random gene sets. 1260 00:52:18,160 --> 00:52:21,500 But for most of these, you don't do 1261 00:52:21,500 --> 00:52:24,620 better than a good fraction of the randomly selected gene 1262 00:52:24,620 --> 00:52:25,630 sets. 1263 00:52:25,630 --> 00:52:27,030 So how could this be? 1264 00:52:27,030 --> 00:52:29,740 So it turns out there is an answer to why this happens. 1265 00:52:29,740 --> 00:52:31,580 And it's really quite fascinating. 1266 00:52:31,580 --> 00:52:33,547 So here, we're using the hazard ratio, 1267 00:52:33,547 --> 00:52:35,380 which is the death rate for the patients who 1268 00:52:35,380 --> 00:52:38,550 have the signature, over the control group. 1269 00:52:38,550 --> 00:52:41,180 So high hazard ratio means it's a very, very dissociative 1270 00:52:41,180 --> 00:52:42,180 outcome. 1271 00:52:42,180 --> 00:52:46,102 And they've plotted that against the correlation of the genes 1272 00:52:46,102 --> 00:52:48,560 in the gene signature, with the expression of a gene called 1273 00:52:48,560 --> 00:52:53,610 PCNA, Proliferating Cell Nuclear Antigen 1274 00:52:53,610 --> 00:52:56,555 And it turns out a very, very large fraction of the genome 1275 00:52:56,555 --> 00:52:57,180 is coexpressed. 1276 00:53:00,440 --> 00:53:03,640 So genes are not expressed like random, completely independent 1277 00:53:03,640 --> 00:53:05,060 random variables. 1278 00:53:05,060 --> 00:53:07,550 There are lots of genes that show very similar expression 1279 00:53:07,550 --> 00:53:09,330 levels, across all the data sets. 1280 00:53:09,330 --> 00:53:12,105 Now PCNA is a gene that's been known by pathologists 1281 00:53:12,105 --> 00:53:13,480 for a long time, as having higher 1282 00:53:13,480 --> 00:53:15,677 levels than most digressive tumors. 1283 00:53:15,677 --> 00:53:17,510 So a very, very large fraction of the genome 1284 00:53:17,510 --> 00:53:20,160 is coexpressed with PCNA. 1285 00:53:20,160 --> 00:53:22,590 Then high levels of randomly selected genes 1286 00:53:22,590 --> 00:53:24,950 are going to be a very good predictor of tumor outcome. 1287 00:53:24,950 --> 00:53:26,908 Because high levels of randomly expressed genes 1288 00:53:26,908 --> 00:53:28,500 also means a very high probability 1289 00:53:28,500 --> 00:53:32,270 of having a high level PCNA, which is a tumor marker. 1290 00:53:35,930 --> 00:53:37,800 So we have to proceed with a lot of caution. 1291 00:53:37,800 --> 00:53:41,310 We can find things that are highly correlated with outcome, 1292 00:53:41,310 --> 00:53:45,690 that could have good value in terms of prognostic indicators. 1293 00:53:45,690 --> 00:53:47,690 But there are going to be a lot of possibilities 1294 00:53:47,690 --> 00:53:50,390 for sets of genes that have that property, 1295 00:53:50,390 --> 00:53:52,130 they're good predictors of outcome. 1296 00:53:52,130 --> 00:53:54,360 And many of them will have absolutely nothing 1297 00:53:54,360 --> 00:53:59,170 to causally, with the process of the disease. 1298 00:53:59,170 --> 00:54:02,110 So at the very least, it means don't start a drug company 1299 00:54:02,110 --> 00:54:04,010 over every set of genes, if you identify this 1300 00:54:04,010 --> 00:54:05,620 as associated with outcome. 1301 00:54:05,620 --> 00:54:07,120 But the worst case scenario, it also 1302 00:54:07,120 --> 00:54:09,760 means that those predictions will break down under settings 1303 00:54:09,760 --> 00:54:12,580 that we haven't yet examined. 1304 00:54:12,580 --> 00:54:14,410 And so that's the real fear, that you 1305 00:54:14,410 --> 00:54:16,140 have a gene set signature that you think 1306 00:54:16,140 --> 00:54:17,473 has a highly predictive outcome. 1307 00:54:17,473 --> 00:54:20,247 It's only because you looked at a particular set of patients. 1308 00:54:20,247 --> 00:54:22,080 But you look at a different set of patients, 1309 00:54:22,080 --> 00:54:24,870 and that correlation will break down. 1310 00:54:24,870 --> 00:54:27,810 So this is an area of research that's still quite in flux, 1311 00:54:27,810 --> 00:54:29,460 in terms of how much utility there 1312 00:54:29,460 --> 00:54:32,160 will be in identifying genes set signatures, 1313 00:54:32,160 --> 00:54:33,827 in this completely objective way. 1314 00:54:33,827 --> 00:54:35,826 And what we'll see in the course of this lecture 1315 00:54:35,826 --> 00:54:37,243 and the next one, is it's probably 1316 00:54:37,243 --> 00:54:39,034 going to be much more useful to incorporate 1317 00:54:39,034 --> 00:54:40,980 other kinds of information that will constrain 1318 00:54:40,980 --> 00:54:42,063 us to be more mechanistic. 1319 00:54:44,267 --> 00:54:44,850 Any questions? 1320 00:54:51,892 --> 00:54:52,392 All right. 1321 00:54:52,392 --> 00:54:54,308 So now we're going to really get into the meat 1322 00:54:54,308 --> 00:54:56,295 of the identification of gene modules. 1323 00:54:56,295 --> 00:54:59,560 And we're going to try to see how much we can learn 1324 00:54:59,560 --> 00:55:02,310 about regulatory structure from the gene expression data. 1325 00:55:02,310 --> 00:55:05,395 So we're going to move up from just the pure expression data-- 1326 00:55:05,395 --> 00:55:07,520 say these genes at the bottom, to try to figure out 1327 00:55:07,520 --> 00:55:10,225 what set of transcription factors we're driving, 1328 00:55:10,225 --> 00:55:12,225 and maybe what signaling pathways lived upstream 1329 00:55:12,225 --> 00:55:14,497 in those transcription factors, and turn them on. 1330 00:55:14,497 --> 00:55:17,080 And the fundamental difference then between clustering-- which 1331 00:55:17,080 --> 00:55:19,610 is what we've been looking in until now, and these modules, 1332 00:55:19,610 --> 00:55:21,542 as people like to call them-- is that you 1333 00:55:21,542 --> 00:55:23,500 can have a whole bunch of genes, and we've just 1334 00:55:23,500 --> 00:55:25,458 seen that, that are correlated with each other, 1335 00:55:25,458 --> 00:55:27,350 without being causally linked to each other. 1336 00:55:27,350 --> 00:55:30,290 So we like to figure out which ones are actually functionally 1337 00:55:30,290 --> 00:55:34,202 related, and not just statistically related. 1338 00:55:34,202 --> 00:55:36,410 And the paper that's going to serve as our organizing 1339 00:55:36,410 --> 00:55:38,400 principle in the rest of this lecture, 1340 00:55:38,400 --> 00:55:40,580 maybe bleeding into the next lecture, 1341 00:55:40,580 --> 00:55:42,910 is this paper, recently published 1342 00:55:42,910 --> 00:55:45,817 that's called The DREAM5 Challenge. 1343 00:55:45,817 --> 00:55:48,150 And this, like some of these other challenges that we've 1344 00:55:48,150 --> 00:55:51,950 seen before, is the case where the organizers have data sets, 1345 00:55:51,950 --> 00:55:55,370 where are they know the answer to what 1346 00:55:55,370 --> 00:55:57,230 the regulatory structure is. 1347 00:55:57,230 --> 00:55:58,420 They send out the data. 1348 00:55:58,420 --> 00:56:00,200 People try to make the best predictions they can. 1349 00:56:00,200 --> 00:56:01,741 And then they unseal the data, to let 1350 00:56:01,741 --> 00:56:03,495 people know how well they did. 1351 00:56:03,495 --> 00:56:06,480 And so you can get a relatively objective view 1352 00:56:06,480 --> 00:56:09,920 of how well different kinds of approaches work. 1353 00:56:09,920 --> 00:56:12,700 So this is the overall structure of this challenge. 1354 00:56:12,700 --> 00:56:15,600 They had four different kinds of data. 1355 00:56:15,600 --> 00:56:18,990 Three are real data sets from different organisms, E. coli, 1356 00:56:18,990 --> 00:56:21,610 yeast, and Staphylococcus aureus. 1357 00:56:21,610 --> 00:56:23,940 And then the fourth one, the one at the top here, 1358 00:56:23,940 --> 00:56:26,780 is completely synthetic data that they generated it. 1359 00:56:26,780 --> 00:56:29,190 And you get a sense of the scale of the data sets. 1360 00:56:29,190 --> 00:56:33,300 So how many genes are involved, how many potential regulators. 1361 00:56:33,300 --> 00:56:36,920 In some cases, they've given you specific information 1362 00:56:36,920 --> 00:56:41,600 on knockouts, antibiotics, toxins, that are perturbing. 1363 00:56:41,600 --> 00:56:43,990 And again here, the number of conditions that 1364 00:56:43,990 --> 00:56:46,170 are being looked at, the number of arrays. 1365 00:56:46,170 --> 00:56:48,170 So then they provide this data in a way 1366 00:56:48,170 --> 00:56:51,401 that's very hard for the groups that 1367 00:56:51,401 --> 00:56:53,525 are analyzing to trace it back to particular genes. 1368 00:56:53,525 --> 00:56:56,680 Because you don't want people to use external data necessarily, 1369 00:56:56,680 --> 00:56:58,150 to make their predictions. 1370 00:56:58,150 --> 00:57:00,600 So every makes their predictions. 1371 00:57:00,600 --> 00:57:03,830 They also, as part of this challenge, 1372 00:57:03,830 --> 00:57:06,390 they actually they made their own metapredictions, 1373 00:57:06,390 --> 00:57:08,447 based on the individual predictions 1374 00:57:08,447 --> 00:57:09,280 by different groups. 1375 00:57:09,280 --> 00:57:11,140 And we'll take a look at that in a second. 1376 00:57:11,140 --> 00:57:14,014 And then they score how well they did. 1377 00:57:14,014 --> 00:57:16,430 Now we'll get into the details of the scoring a little bit 1378 00:57:16,430 --> 00:57:16,930 later. 1379 00:57:16,930 --> 00:57:18,790 But what they found at the highest levels, 1380 00:57:18,790 --> 00:57:22,070 that different kinds of methods behaved similarly. 1381 00:57:22,070 --> 00:57:24,040 So the main groups that they found 1382 00:57:24,040 --> 00:57:25,770 were these regression-based techniques. 1383 00:57:25,770 --> 00:57:27,477 We'll talk about those in a second. 1384 00:57:27,477 --> 00:57:29,060 Bayesian networks, which we've already 1385 00:57:29,060 --> 00:57:32,180 discussed in a different context. 1386 00:57:32,180 --> 00:57:33,930 A hodgepodge of different kinds of things. 1387 00:57:33,930 --> 00:57:36,970 And then mutual information and correlation. 1388 00:57:36,970 --> 00:57:40,674 So we're going to look in each of these main categories 1389 00:57:40,674 --> 00:57:41,590 of prediction methods. 1390 00:57:43,665 --> 00:57:46,040 So we're going to start with the Bayesian networks, which 1391 00:57:46,040 --> 00:57:47,490 we just finished talking about in a completely 1392 00:57:47,490 --> 00:57:48,570 different context. 1393 00:57:48,570 --> 00:57:51,420 Here, instead of trying to predict whether interaction 1394 00:57:51,420 --> 00:57:53,139 is true, based on the experimental data, 1395 00:57:53,139 --> 00:57:55,680 we're going to try to predict whether a particular protein is 1396 00:57:55,680 --> 00:57:58,263 involved in regulating a set of genes, based on the expression 1397 00:57:58,263 --> 00:57:59,910 data. 1398 00:57:59,910 --> 00:58:04,200 So in this context-- let's say I have cancer data sets, 1399 00:58:04,200 --> 00:58:07,005 and I wanted to decide whether p53 is activated 1400 00:58:07,005 --> 00:58:11,090 in those tumors, So this is a known pathway for p53. 1401 00:58:11,090 --> 00:58:13,320 So if I told you the pathway, how 1402 00:58:13,320 --> 00:58:16,170 might you figure out if p53 is active from gene expression 1403 00:58:16,170 --> 00:58:19,060 data? 1404 00:58:19,060 --> 00:58:21,640 I tell you this pathway, give you this expression data-- 1405 00:58:21,640 --> 00:58:23,980 what's kind of a simple thing that you could do right 1406 00:58:23,980 --> 00:58:27,380 away, to decide whether you think p53 is active or not? 1407 00:58:27,380 --> 00:58:28,995 p53 is a transcriptional activator, 1408 00:58:28,995 --> 00:58:31,910 but it should be turning on the genes of its targets 1409 00:58:31,910 --> 00:58:33,210 when its on. 1410 00:58:33,210 --> 00:58:36,159 So what's an obvious thing to do? 1411 00:58:36,159 --> 00:58:38,450 AUDIENCE: Check the expression levels from the targets. 1412 00:58:38,450 --> 00:58:38,660 PROFESSOR: Thank you. 1413 00:58:38,660 --> 00:58:40,618 Right, so we could check the expression levels. 1414 00:58:40,618 --> 00:58:43,670 The targets compute some simple statistics, right? 1415 00:58:43,670 --> 00:58:44,290 OK. 1416 00:58:44,290 --> 00:58:45,620 Well, that could work. 1417 00:58:45,620 --> 00:58:47,080 But of course there could be other transcriptional 1418 00:58:47,080 --> 00:58:49,080 regulators that regulate a similar set of genes. 1419 00:58:49,080 --> 00:58:52,010 So that's not a guarantee that p53 is on. 1420 00:58:52,010 --> 00:58:55,510 It might be some other transcriptional regulator. 1421 00:58:55,510 --> 00:58:58,560 We could look for the pathways that activate p53. 1422 00:58:58,560 --> 00:59:01,210 We could ask whether those genes are on. 1423 00:59:01,210 --> 00:59:04,800 So we've got in this pathway, a bunch of kinases, 1424 00:59:04,800 --> 00:59:08,422 an ATM, CHK1, and so on, that activate p53. 1425 00:59:08,422 --> 00:59:10,380 Now if we had proteomic data, we could actually 1426 00:59:10,380 --> 00:59:12,409 look whether those proteins are phosphorylated. 1427 00:59:12,409 --> 00:59:14,200 But we have much, much less proteomic data. 1428 00:59:14,200 --> 00:59:16,770 And most of these settings only have gene expression data. 1429 00:59:16,770 --> 00:59:18,810 But you look at, is that gene expressed? 1430 00:59:18,810 --> 00:59:21,160 Has the expression of one of these activating proteins 1431 00:59:21,160 --> 00:59:21,850 gone up? 1432 00:59:21,850 --> 00:59:23,880 And you can try to make an inference then. 1433 00:59:23,880 --> 00:59:26,900 From whether there's more of these activating proteins, 1434 00:59:26,900 --> 00:59:28,640 then maybe p53 is active. 1435 00:59:28,640 --> 00:59:31,490 And therefore it's turning on it's targets. 1436 00:59:31,490 --> 00:59:32,840 That's one step removed. 1437 00:59:32,840 --> 00:59:35,916 So just the fact that there's a lot of ATM mRNA around 1438 00:59:35,916 --> 00:59:38,290 doesn't mean that there's a lot of the ATM protein, which 1439 00:59:38,290 --> 00:59:40,498 certainly doesn't mean that the ATM is phosphorylated 1440 00:59:40,498 --> 00:59:41,930 and turning on its target. 1441 00:59:41,930 --> 00:59:43,685 So again, we don't have a guarantee there. 1442 00:59:46,792 --> 00:59:49,000 We could look more specifically whether the genes are 1443 00:59:49,000 --> 00:59:50,550 differentially expressed. 1444 00:59:50,550 --> 00:59:53,130 So the fact that they're on may not be as informative 1445 00:59:53,130 --> 00:59:55,240 as if they were uniquely on in this tumor, 1446 00:59:55,240 --> 00:59:58,740 and not on in control cells from the same patient. 1447 00:59:58,740 --> 01:00:01,080 So that can be informative. 1448 01:00:01,080 --> 01:00:02,600 But again changes in gene expression 1449 01:00:02,600 --> 01:00:05,920 are not uniquely related to changes in protein level. 1450 01:00:05,920 --> 01:00:09,230 So we're going to have to behave with a bit of caution. 1451 01:00:09,230 --> 01:00:11,750 So the first step we're going to take in this direction, 1452 01:00:11,750 --> 01:00:13,335 is try to build a Bayesian network. 1453 01:00:13,335 --> 01:00:15,710 That's going to give us a way to reason probabilistically 1454 01:00:15,710 --> 01:00:19,090 over all of these kinds of data, which by themselves are not 1455 01:00:19,090 --> 01:00:21,770 great guarantees that we're getting the right answer. 1456 01:00:21,770 --> 01:00:25,470 Just like in the protein prediction interaction problem, 1457 01:00:25,470 --> 01:00:28,920 where individually coexpression wasn't all that great, 1458 01:00:28,920 --> 01:00:30,680 essentiality wasn't all that great. 1459 01:00:30,680 --> 01:00:33,509 But taken together, they could be quite helpful. 1460 01:00:33,509 --> 01:00:35,050 So we want to compute the probability 1461 01:00:35,050 --> 01:00:39,330 that the p53 pathway is active, given the data. 1462 01:00:39,330 --> 01:00:41,710 And the only data we're going to have in the setting 1463 01:00:41,710 --> 01:00:42,910 is gene expression data. 1464 01:00:42,910 --> 01:00:45,950 So we're going to assume that for the targets 1465 01:00:45,950 --> 01:00:48,210 of a transcription factor to be active, 1466 01:00:48,210 --> 01:00:50,170 the transcription factor itself has 1467 01:00:50,170 --> 01:00:52,310 to be expressed at a higher level. 1468 01:00:52,310 --> 01:00:53,800 That's a restriction of analyzing 1469 01:00:53,800 --> 01:00:57,070 these kinds of data that's very commonly used. 1470 01:00:57,070 --> 01:01:00,190 So we're going to try to compute the probability that p53 1471 01:01:00,190 --> 01:01:02,340 is activated, given the data. 1472 01:01:02,340 --> 01:01:04,750 So how would I compute the probability, 1473 01:01:04,750 --> 01:01:06,980 that given that some transcription factors on, 1474 01:01:06,980 --> 01:01:09,967 that I see expression from target genes? 1475 01:01:09,967 --> 01:01:10,800 How would I do this? 1476 01:01:10,800 --> 01:01:13,470 I would just go into the data, and just count in the same way 1477 01:01:13,470 --> 01:01:16,010 that we did in our previous setting. 1478 01:01:16,010 --> 01:01:18,680 We could just look over all the experiments 1479 01:01:18,680 --> 01:01:22,047 and tabulate whether one of the targets is up in expression, 1480 01:01:22,047 --> 01:01:24,380 how often is the transcription factor that's potentially 1481 01:01:24,380 --> 01:01:25,088 activating it up? 1482 01:01:25,088 --> 01:01:29,300 And how often are all the possible combinations the case? 1483 01:01:29,300 --> 01:01:31,330 And then we can use Bayesian statistics 1484 01:01:31,330 --> 01:01:32,720 to try to compute the probability 1485 01:01:32,720 --> 01:01:35,860 that a transcription factor is up, activated, 1486 01:01:35,860 --> 01:01:37,860 given that I've seen the gene expression data. 1487 01:01:37,860 --> 01:01:38,443 Is that clear? 1488 01:01:41,190 --> 01:01:41,690 Good. 1489 01:01:46,410 --> 01:01:49,854 So we want to try to not include just the down stream factors. 1490 01:01:49,854 --> 01:01:51,520 Because that leads possibly, maybe there 1491 01:01:51,520 --> 01:01:52,940 are multiple transcription factors 1492 01:01:52,940 --> 01:01:54,610 that are equally likely to be driving 1493 01:01:54,610 --> 01:01:56,060 expressions instead of genes. 1494 01:01:56,060 --> 01:01:59,020 We want to include the upstream regulators as well. 1495 01:01:59,020 --> 01:02:01,504 And so here, we're going to take advantage of one 1496 01:02:01,504 --> 01:02:02,920 of the properties of Bayesian nets 1497 01:02:02,920 --> 01:02:05,245 at where we looked at, explaining a way. 1498 01:02:05,245 --> 01:02:06,990 And you'll remember this example, where 1499 01:02:06,990 --> 01:02:10,940 we decided that if see that the grass is wet, 1500 01:02:10,940 --> 01:02:12,470 and I know that it's raining, then 1501 01:02:12,470 --> 01:02:15,910 I can consider less likely that the sprinklers were on. 1502 01:02:15,910 --> 01:02:18,280 Even though there's no causal relationship between them. 1503 01:02:18,280 --> 01:02:21,640 So if I see that a set of targets of transcription factor 1504 01:02:21,640 --> 01:02:25,410 A are on, and I have evidence that the pathway upstream of A 1505 01:02:25,410 --> 01:02:29,740 is on, that reduces my inferred probability 1506 01:02:29,740 --> 01:02:32,544 that the transcription factor B is responsible. 1507 01:02:32,544 --> 01:02:34,710 So that's of the nice things about Bayesian networks 1508 01:02:34,710 --> 01:02:37,010 that gives us a way of reasoning automatically, 1509 01:02:37,010 --> 01:02:41,090 over all the data, and not just the down stream targets. 1510 01:02:41,090 --> 01:02:43,330 And the Bayesian networks can have multiple layers. 1511 01:02:43,330 --> 01:02:44,955 So we can have one transcription factor 1512 01:02:44,955 --> 01:02:46,621 turning another one, turns on other one, 1513 01:02:46,621 --> 01:02:47,502 turns on another one. 1514 01:02:47,502 --> 01:02:49,460 Again, we can have as many layers as necessary. 1515 01:02:49,460 --> 01:02:51,650 But one thing we can't have are cycles. 1516 01:02:51,650 --> 01:02:54,380 So we can't have a transcription factor 1517 01:02:54,380 --> 01:02:56,750 that's at the bottom of this, going back and activating 1518 01:02:56,750 --> 01:02:58,420 things that are at the top. 1519 01:02:58,420 --> 01:02:59,890 And that's a fundamental limitation 1520 01:02:59,890 --> 01:03:00,810 of Bayesian networks. 1521 01:03:04,362 --> 01:03:05,820 We've already talked about the fact 1522 01:03:05,820 --> 01:03:09,000 that in Bayesian networks, with these two problems that we 1523 01:03:09,000 --> 01:03:12,150 to have to solve, we have to be able to define the structure. 1524 01:03:12,150 --> 01:03:13,400 If we don't know any a priori. 1525 01:03:13,400 --> 01:03:14,570 Here, we don't know what a priori. 1526 01:03:14,570 --> 01:03:16,530 So we're going to have to learn the structure of the network. 1527 01:03:16,530 --> 01:03:17,920 And then with the structure of the network, 1528 01:03:17,920 --> 01:03:19,480 we're going to have to learn all the probabilities. 1529 01:03:19,480 --> 01:03:21,021 So the conditional probability tables 1530 01:03:21,021 --> 01:03:23,810 that relate to each variable to every other one. 1531 01:03:23,810 --> 01:03:27,320 And then just two more small points about it. 1532 01:03:27,320 --> 01:03:30,690 So if I just give you expression data, 1533 01:03:30,690 --> 01:03:33,540 without any interventions-- just the observations, 1534 01:03:33,540 --> 01:03:37,770 then I can't decide what is a cause and what is an effect. 1535 01:03:37,770 --> 01:03:40,540 So here this was done in the context of proteomics, 1536 01:03:40,540 --> 01:03:42,760 but the same is true for gene expression data. 1537 01:03:42,760 --> 01:03:46,050 If I have two variables, x and y, that are highly correlated, 1538 01:03:46,050 --> 01:03:47,540 it could be that x activates y. 1539 01:03:47,540 --> 01:03:50,340 It could be that y activates x. 1540 01:03:50,340 --> 01:03:53,440 But if I perturb the system, and I block the activity of one 1541 01:03:53,440 --> 01:03:55,225 of these two genes or proteins, then I 1542 01:03:55,225 --> 01:03:56,600 can start to tell the difference. 1543 01:03:56,600 --> 01:04:00,290 In this case, if you inhibit x, you 1544 01:04:00,290 --> 01:04:02,010 don't see any activation of y. 1545 01:04:02,010 --> 01:04:04,100 That's the yellow, all down here. 1546 01:04:04,100 --> 01:04:05,970 But if you inhibit y, you see the full range 1547 01:04:05,970 --> 01:04:06,767 of activity of x. 1548 01:04:10,180 --> 01:04:13,710 So that implies that x is the activator of y. 1549 01:04:13,710 --> 01:04:16,209 And so in these settings, if you want to learn a Bayesian 1550 01:04:16,209 --> 01:04:18,500 network from data, you need more than just a compendium 1551 01:04:18,500 --> 01:04:19,500 of gene expression data. 1552 01:04:19,500 --> 01:04:21,339 If you want to get the directions correct, 1553 01:04:21,339 --> 01:04:23,380 you need perturbations where someone has actually 1554 01:04:23,380 --> 01:04:26,502 inhibited particular genes or proteins. 1555 01:04:26,502 --> 01:04:28,210 Now, in a lot of these Bayesian networks, 1556 01:04:28,210 --> 01:04:29,130 we're not going to try to include 1557 01:04:29,130 --> 01:04:31,380 every possible gene and every possible protein. 1558 01:04:31,380 --> 01:04:33,380 Either because we don't have measurements of it, 1559 01:04:33,380 --> 01:04:34,940 or because we need a compact network. 1560 01:04:34,940 --> 01:04:36,356 So there will often be cases where 1561 01:04:36,356 --> 01:04:39,220 the true regulator in some causal chain, 1562 01:04:39,220 --> 01:04:41,700 is missing from our data. 1563 01:04:41,700 --> 01:04:45,110 So imagine this is the true causal chain-- x activates 1564 01:04:45,110 --> 01:04:47,932 y, which then activates z and w. 1565 01:04:47,932 --> 01:04:49,890 But either because we don't have the data on y, 1566 01:04:49,890 --> 01:04:52,390 or because we left it out to make our models more compact, 1567 01:04:52,390 --> 01:04:53,940 it's not in the model. 1568 01:04:53,940 --> 01:04:57,810 We can still pick up the relationships between x and z, 1569 01:04:57,810 --> 01:04:58,699 and x and w. 1570 01:04:58,699 --> 01:05:00,115 But the data will be much noisier. 1571 01:05:00,115 --> 01:05:03,720 Because we're missing that information. 1572 01:05:03,720 --> 01:05:05,360 In the conditional probability tables, 1573 01:05:05,360 --> 01:05:08,216 relating x to y, and then y because it's too targets. 1574 01:05:12,972 --> 01:05:15,180 So Bayesian networks, we've already seen quite a lot. 1575 01:05:15,180 --> 01:05:18,010 We now have some idea of how to transfer them from one domain 1576 01:05:18,010 --> 01:05:20,180 to the domain of gene expression data. 1577 01:05:20,180 --> 01:05:21,809 The next approach we want to look at 1578 01:05:21,809 --> 01:05:23,100 is a regression-based approach. 1579 01:05:25,734 --> 01:05:27,150 So the regression-based approaches 1580 01:05:27,150 --> 01:05:30,430 are founded on a simple idea, which 1581 01:05:30,430 --> 01:05:32,200 is that the expression gene is going 1582 01:05:32,200 --> 01:05:34,330 to be some function of the expression 1583 01:05:34,330 --> 01:05:35,337 levels of the regulator. 1584 01:05:35,337 --> 01:05:36,920 We're going to actually try to come up 1585 01:05:36,920 --> 01:05:39,020 with a formula that relates the activity 1586 01:05:39,020 --> 01:05:40,865 levels of the transcription factors, 1587 01:05:40,865 --> 01:05:43,620 and the activity level of the target. 1588 01:05:43,620 --> 01:05:45,390 In this cartoon, I've got a gene that's 1589 01:05:45,390 --> 01:05:47,090 on under one condition, that's off 1590 01:05:47,090 --> 01:05:48,470 under some other conditions. 1591 01:05:48,470 --> 01:05:51,177 What transforms it from being off to on, 1592 01:05:51,177 --> 01:05:53,260 is the introduction of more of these transcription 1593 01:05:53,260 --> 01:05:55,850 factors, that are binding to the promoter. 1594 01:05:55,850 --> 01:05:57,740 So in general, I have some predicted level 1595 01:05:57,740 --> 01:05:59,075 of expression for the gene. 1596 01:05:59,075 --> 01:06:01,370 It's called the predicted level y. 1597 01:06:01,370 --> 01:06:04,450 And it's some function, unspecified at this point, f 1598 01:06:04,450 --> 01:06:08,100 of g, of all the expression levels of the transcription 1599 01:06:08,100 --> 01:06:10,890 factors that regulate that gene. 1600 01:06:10,890 --> 01:06:13,320 So just again, nomenclature is straight, 1601 01:06:13,320 --> 01:06:16,130 x sub g is going to be the expression of gene x-- 1602 01:06:16,130 --> 01:06:19,710 I'm sorry, expression of gene g. 1603 01:06:19,710 --> 01:06:23,450 This capital X, sub t of g is the set of transcription 1604 01:06:23,450 --> 01:06:27,660 factors, that I believe are regulating that gene. 1605 01:06:27,660 --> 01:06:29,390 And then f is an arbitrary function. 1606 01:06:29,390 --> 01:06:30,730 We're going to have a noise term as well. 1607 01:06:30,730 --> 01:06:32,604 Because this is the observed gene expression, 1608 01:06:32,604 --> 01:06:36,700 not some sort of platonic view of the true gene expression. 1609 01:06:36,700 --> 01:06:39,781 Now frequently, we'll have a specific function. 1610 01:06:39,781 --> 01:06:41,280 So the simplest one you can imagine, 1611 01:06:41,280 --> 01:06:42,730 which is a linear function. 1612 01:06:42,730 --> 01:06:45,850 So the expression of any particular gene 1613 01:06:45,850 --> 01:06:48,850 is going to be a linear function, a sum, 1614 01:06:48,850 --> 01:06:51,490 of the expression of all of it's regulators, 1615 01:06:51,490 --> 01:06:54,820 where each one has associated with it a coefficient beta. 1616 01:06:54,820 --> 01:06:56,240 And that beta coefficient tells us 1617 01:06:56,240 --> 01:07:00,220 how much particular a regulator influences that gene. 1618 01:07:00,220 --> 01:07:03,250 So say, p53 might have a very large value. 1619 01:07:03,250 --> 01:07:05,100 Some other transcriptional regulator 1620 01:07:05,100 --> 01:07:06,940 might have a small value, representing 1621 01:07:06,940 --> 01:07:09,430 their relative influence. 1622 01:07:09,430 --> 01:07:11,830 Now, I don't know the beta values in advance. 1623 01:07:11,830 --> 01:07:14,150 So that's one of the things that I need to learn. 1624 01:07:14,150 --> 01:07:17,756 So I want to be able to find a setting that tells me 1625 01:07:17,756 --> 01:07:20,130 what the beta values are for every possible transcription 1626 01:07:20,130 --> 01:07:21,060 factor. 1627 01:07:21,060 --> 01:07:23,000 If the algorithm sets the beta value to zero, 1628 01:07:23,000 --> 01:07:23,958 what does that tell me? 1629 01:07:26,727 --> 01:07:28,143 If a beta value is zero here, what 1630 01:07:28,143 --> 01:07:31,620 does that tell me about that transcriptional regulator? 1631 01:07:31,620 --> 01:07:32,760 No influence, right. 1632 01:07:32,760 --> 01:07:35,310 And the higher the value, then the greater the influence. 1633 01:07:35,310 --> 01:07:35,810 OK. 1634 01:07:35,810 --> 01:07:38,090 So how do we discover these? 1635 01:07:38,090 --> 01:07:39,540 So the tip of the approach then is 1636 01:07:39,540 --> 01:07:40,830 to come up with some objective function 1637 01:07:40,830 --> 01:07:42,330 that we're going to try to optimize. 1638 01:07:42,330 --> 01:07:44,070 And an obvious objective function 1639 01:07:44,070 --> 01:07:46,710 is the difference between the observed expression 1640 01:07:46,710 --> 01:07:49,150 value for each gene, and the expected one, 1641 01:07:49,150 --> 01:07:50,420 based on that linear function. 1642 01:07:50,420 --> 01:07:52,600 And we're going to choose a set of data parameters 1643 01:07:52,600 --> 01:07:54,810 that minimize the difference between the observed 1644 01:07:54,810 --> 01:07:58,220 and the expected, minimize the sum of the squares. 1645 01:07:58,220 --> 01:08:00,680 So the residual sum of the squares error, 1646 01:08:00,680 --> 01:08:02,580 between the predicted and the observed. 1647 01:08:05,400 --> 01:08:08,140 So this is a relatively standard regression problem, 1648 01:08:08,140 --> 01:08:09,650 just in different setting. 1649 01:08:09,650 --> 01:08:12,490 Now one of the problems with a standard regression problem, 1650 01:08:12,490 --> 01:08:14,330 is that we'll typically get a lot 1651 01:08:14,330 --> 01:08:15,850 of very small values of beta. 1652 01:08:15,850 --> 01:08:17,950 So we won't get all zeros or all ones, 1653 01:08:17,950 --> 01:08:19,950 meaning the algorithm is 100% certain that these 1654 01:08:19,950 --> 01:08:21,620 are the drivers and these are not. 1655 01:08:21,620 --> 01:08:25,609 We'll get a lot of small values for many, many transcription 1656 01:08:25,609 --> 01:08:26,760 factors. 1657 01:08:26,760 --> 01:08:28,859 And OK, that could represent the reality. 1658 01:08:28,859 --> 01:08:30,700 But the bad thing is that those data values 1659 01:08:30,700 --> 01:08:32,009 are going to be unstable. 1660 01:08:32,009 --> 01:08:33,550 So small changes in the training data 1661 01:08:33,550 --> 01:08:36,350 will give you big changes, in which transcription factors 1662 01:08:36,350 --> 01:08:37,479 have which values. 1663 01:08:37,479 --> 01:08:39,069 So that not a desirable setting. 1664 01:08:39,069 --> 01:08:40,902 There's a whole field built up around trying 1665 01:08:40,902 --> 01:08:43,399 to come up with better solutions. 1666 01:08:43,399 --> 01:08:45,685 I've given you some references here. 1667 01:08:45,685 --> 01:08:47,310 One of them is to a paper that did well 1668 01:08:47,310 --> 01:08:48,268 in the DREAM challenge. 1669 01:08:48,268 --> 01:08:50,649 The other one is to a very good textbook, 1670 01:08:50,649 --> 01:08:52,180 Elements of Statistical Learning. 1671 01:08:52,180 --> 01:08:53,840 And there are various techniques that 1672 01:08:53,840 --> 01:08:56,760 allow you to try to limit the number of betas 1673 01:08:56,760 --> 01:08:59,660 that are non-zero. 1674 01:08:59,660 --> 01:09:01,960 And by doing that, you get more robust predictions. 1675 01:09:01,960 --> 01:09:04,584 At a cost, right, because there could be a lot of transcription 1676 01:09:04,584 --> 01:09:07,490 factors that really do have small influences. 1677 01:09:07,490 --> 01:09:09,170 But we'll trade that off, by getting 1678 01:09:09,170 --> 01:09:11,003 more accurate predictions from the ones that 1679 01:09:11,003 --> 01:09:12,035 have the big influences. 1680 01:09:16,787 --> 01:09:18,370 Are there any questions on regression? 1681 01:09:21,310 --> 01:09:24,330 So the last of the methods that we're examining-- this 1682 01:09:24,330 --> 01:09:25,620 is a mutual information. 1683 01:09:25,620 --> 01:09:28,350 We've already seen mutual information in the course. 1684 01:09:28,350 --> 01:09:30,569 So information content is related 1685 01:09:30,569 --> 01:09:34,660 to the probability of observing some variable in an alphabet. 1686 01:09:34,660 --> 01:09:36,239 So in most languages, the probability 1687 01:09:36,239 --> 01:09:40,004 of observing letters is quite variable. 1688 01:09:40,004 --> 01:09:41,920 So Es are very common in the English language. 1689 01:09:41,920 --> 01:09:43,420 Other letters are less common. 1690 01:09:43,420 --> 01:09:48,279 As anyone who plays Hangman or watches Wheel of fortune knows. 1691 01:09:48,279 --> 01:09:49,830 And we defined the entropy as the sum 1692 01:09:49,830 --> 01:09:52,460 over all possible outcomes. 1693 01:09:52,460 --> 01:09:55,260 The probability of observing some variable, 1694 01:09:55,260 --> 01:09:57,580 and the information on to that variable, 1695 01:09:57,580 --> 01:10:01,410 we can define the discrete case, or in the continuous case. 1696 01:10:01,410 --> 01:10:03,530 And the critical thing is to have 1697 01:10:03,530 --> 01:10:05,346 mutual information between two variables. 1698 01:10:05,346 --> 01:10:07,970 So that's the difference between the entropy of those variables 1699 01:10:07,970 --> 01:10:10,930 independently, and then the joint entropy. 1700 01:10:10,930 --> 01:10:13,120 So things with a high mutual information, 1701 01:10:13,120 --> 01:10:16,580 means that one variable gives the significant knowledge 1702 01:10:16,580 --> 01:10:18,080 of what the other variable is doing. 1703 01:10:18,080 --> 01:10:19,830 It reduces my uncertainty. 1704 01:10:19,830 --> 01:10:21,900 That's the critical idea. 1705 01:10:21,900 --> 01:10:22,850 OK. 1706 01:10:22,850 --> 01:10:25,105 So we looked at correlation before. 1707 01:10:25,105 --> 01:10:26,480 There could be settings where you 1708 01:10:26,480 --> 01:10:30,390 have very low correlation between two variables, 1709 01:10:30,390 --> 01:10:32,900 but have high mutual information. 1710 01:10:32,900 --> 01:10:38,730 So consider these two genes, protein A and protein b, 1711 01:10:38,730 --> 01:10:42,535 and the blue dots are the relationship between them. 1712 01:10:42,535 --> 01:10:44,410 You can see that there's a lot of information 1713 01:10:44,410 --> 01:10:46,900 content in these two variables. 1714 01:10:46,900 --> 01:10:50,680 Knowing the value of A gives me a high confidence 1715 01:10:50,680 --> 01:10:53,560 in the value of B. But there's no linear relationship 1716 01:10:53,560 --> 01:10:55,800 that describes these. 1717 01:10:55,800 --> 01:10:57,270 So if I use mutual information, I 1718 01:10:57,270 --> 01:10:58,875 can capture situations like this, 1719 01:10:58,875 --> 01:11:02,330 that I can't capture with correlation. 1720 01:11:02,330 --> 01:11:04,610 And these kinds of situations actually occur. 1721 01:11:04,610 --> 01:11:06,970 So for example in a feed-forward loop-- 1722 01:11:06,970 --> 01:11:10,680 say we've got a regulator A, and it directly 1723 01:11:10,680 --> 01:11:15,860 activates B. It also directly activates C. But C inhibits B. 1724 01:11:15,860 --> 01:11:18,547 So you've got the path on the left-hand sides that 1725 01:11:18,547 --> 01:11:19,755 are pressing the accelerator. 1726 01:11:19,755 --> 01:11:22,970 And the path on the right hand side pressing the stop pedal. 1727 01:11:22,970 --> 01:11:25,326 That's called an incoherent feed-forward loop. 1728 01:11:25,326 --> 01:11:27,700 And you can get under different settings, different kinds 1729 01:11:27,700 --> 01:11:30,985 of results, where this is one of those examples. 1730 01:11:30,985 --> 01:11:32,360 You can get much more complicated 1731 01:11:32,360 --> 01:11:34,990 behavior. [INAUDIBLE] are papers that have really mapped out 1732 01:11:34,990 --> 01:11:36,990 these behaviors across many parameters settings. 1733 01:11:36,990 --> 01:11:40,190 You can get switches in the behavior. 1734 01:11:40,190 --> 01:11:41,890 But in a lot of these settings, you 1735 01:11:41,890 --> 01:11:43,600 will have high mutual information 1736 01:11:43,600 --> 01:11:45,550 between two variables, even if you 1737 01:11:45,550 --> 01:11:48,880 don't have any correlation, linear correlation 1738 01:11:48,880 --> 01:11:49,762 between them. 1739 01:11:53,122 --> 01:11:55,810 A well-publicized algorithm that uses mutual information 1740 01:11:55,810 --> 01:11:58,170 to infer gene regulatory networks is called ARACNe. 1741 01:12:01,249 --> 01:12:03,540 They go through and they compute the mutual information 1742 01:12:03,540 --> 01:12:07,300 between all pairs of genes in their data set. 1743 01:12:07,300 --> 01:12:10,330 And now one question you have with mutual information is, 1744 01:12:10,330 --> 01:12:14,057 what defines a significant level of mutual information? 1745 01:12:14,057 --> 01:12:16,140 So an obvious way to do this, to try to figure out 1746 01:12:16,140 --> 01:12:17,974 what's significant, is to do randomizations. 1747 01:12:17,974 --> 01:12:19,139 And so that's what they did. 1748 01:12:19,139 --> 01:12:20,640 They shuffled the expression data, 1749 01:12:20,640 --> 01:12:22,780 to compute mutual information among pairs of genes, 1750 01:12:22,780 --> 01:12:23,870 where there isn't actually a need-- 1751 01:12:23,870 --> 01:12:25,411 there shouldn't be any relationships. 1752 01:12:25,411 --> 01:12:27,555 Because the data had been shuffled. 1753 01:12:27,555 --> 01:12:30,180 And then you can decide whether the observed mutual information 1754 01:12:30,180 --> 01:12:31,940 is significantly greater than when 1755 01:12:31,940 --> 01:12:34,435 you get from the randomized data. 1756 01:12:34,435 --> 01:12:36,810 Now, the other thing that happens with mutual information 1757 01:12:36,810 --> 01:12:38,330 is that indirect effects still apply 1758 01:12:38,330 --> 01:12:40,140 to degrees of mutual information. 1759 01:12:40,140 --> 01:12:43,350 So let's consider the set of genes that are shown on this. 1760 01:12:43,350 --> 01:12:45,530 So you've got G2, which is actually 1761 01:12:45,530 --> 01:12:49,250 a regulator of G1 and G3. 1762 01:12:49,250 --> 01:12:51,590 So G2 is going to have high mutual information 1763 01:12:51,590 --> 01:12:56,420 with G1, and with G3. 1764 01:12:56,420 --> 01:12:59,430 Now, what's it going to be about G1 and G3? 1765 01:12:59,430 --> 01:13:01,610 They're going to behave very similarly, as well. 1766 01:13:01,610 --> 01:13:03,568 So it'll be a high degree of mutual information 1767 01:13:03,568 --> 01:13:06,630 between G1 and G3. 1768 01:13:06,630 --> 01:13:08,670 So if I just rely on mutual information, 1769 01:13:08,670 --> 01:13:11,800 I can't tell what's a regulator and what's 1770 01:13:11,800 --> 01:13:14,314 a fellow at the same level of regulation. 1771 01:13:14,314 --> 01:13:16,480 They're both being affected by something above them. 1772 01:13:16,480 --> 01:13:18,470 I can't tell the difference between those two. 1773 01:13:18,470 --> 01:13:22,400 So they use what's called the data processing inequality, 1774 01:13:22,400 --> 01:13:25,730 where they say, well, these regulatory interactions 1775 01:13:25,730 --> 01:13:27,570 should have higher mutual information, 1776 01:13:27,570 --> 01:13:30,420 than this, which is just between two common targets 1777 01:13:30,420 --> 01:13:32,450 in the same parent. 1778 01:13:32,450 --> 01:13:35,110 And so they drop from their network, those things which 1779 01:13:35,110 --> 01:13:38,116 are the lower of the three in a triangle. 1780 01:13:41,370 --> 01:13:43,120 So that was the original ARACNe algorithm, 1781 01:13:43,120 --> 01:13:44,670 and then they modified it a little bit, 1782 01:13:44,670 --> 01:13:47,086 to try to be more specific in terms of the regulators that 1783 01:13:47,086 --> 01:13:48,350 were being picked up. 1784 01:13:48,350 --> 01:13:51,500 And so they called this approach MINDy. 1785 01:13:51,500 --> 01:13:55,055 And the core idea here, is that in addition 1786 01:13:55,055 --> 01:13:57,550 to the transcription factors, you 1787 01:13:57,550 --> 01:14:00,600 might have another protein that turns a transcription factor on 1788 01:14:00,600 --> 01:14:02,590 or off. 1789 01:14:02,590 --> 01:14:04,440 So if I look over different concentrations 1790 01:14:04,440 --> 01:14:06,898 of the transcription factor, different levels of expression 1791 01:14:06,898 --> 01:14:09,410 between transcription factors, I might 1792 01:14:09,410 --> 01:14:12,410 find that there are some cases where this other protein turns 1793 01:14:12,410 --> 01:14:14,660 it on, and other cases where it turns it off. 1794 01:14:14,660 --> 01:14:19,532 So here, consider these two data sets. 1795 01:14:19,532 --> 01:14:20,990 Looking at different concentrations 1796 01:14:20,990 --> 01:14:22,900 of particular transcription factor and different expression 1797 01:14:22,900 --> 01:14:25,080 levels, and in one case-- the blue ones, 1798 01:14:25,080 --> 01:14:27,562 the modulator isn't present at all, 1799 01:14:27,562 --> 01:14:29,270 or present at it's lowest possible level. 1800 01:14:29,270 --> 01:14:32,200 And in the red case, it's present as a high level. 1801 01:14:32,200 --> 01:14:34,520 And you can see that when the modulator is present only 1802 01:14:34,520 --> 01:14:37,250 in low levels, there's no relationship between a target 1803 01:14:37,250 --> 01:14:38,809 and it's transcription factor. 1804 01:14:38,809 --> 01:14:40,850 Or when the modulator is present at a high level, 1805 01:14:40,850 --> 01:14:42,650 then there's this linear response 1806 01:14:42,650 --> 01:14:44,920 of the target to it's transcription factor. 1807 01:14:44,920 --> 01:14:47,300 So this modulator seems to be a necessary component. 1808 01:14:47,300 --> 01:14:48,820 So they went through and defined a whole bunch 1809 01:14:48,820 --> 01:14:49,736 of settings like this. 1810 01:14:49,736 --> 01:14:53,990 And then systematically search the data for these modulators. 1811 01:14:53,990 --> 01:14:56,360 So they started off with the expression data set, genes 1812 01:14:56,360 --> 01:14:59,080 in rows, experiments in columns. 1813 01:14:59,080 --> 01:15:01,710 They do a set of filtering to remove things 1814 01:15:01,710 --> 01:15:06,120 that are going to be problematic for the analysis. 1815 01:15:06,120 --> 01:15:07,620 They look, for example, for settings 1816 01:15:07,620 --> 01:15:10,220 where you have-- they had to start with a list of modulators 1817 01:15:10,220 --> 01:15:12,303 and transcription factors, and they moved the ones 1818 01:15:12,303 --> 01:15:15,570 where there isn't enough variation, and so on. 1819 01:15:15,570 --> 01:15:18,550 And then they examine, for every modulator and transcription 1820 01:15:18,550 --> 01:15:22,656 factor pair, cases where the modulator is 1821 01:15:22,656 --> 01:15:24,280 present at its highest level, and where 1822 01:15:24,280 --> 01:15:26,280 it's present at it's lowest level. 1823 01:15:26,280 --> 01:15:29,514 So when the modulator is present at a high level-- let's say, 1824 01:15:29,514 --> 01:15:31,430 when the modulator is present at a high level, 1825 01:15:31,430 --> 01:15:32,805 there's a high mutual information 1826 01:15:32,805 --> 01:15:35,140 between the transcription factor and the target. 1827 01:15:35,140 --> 01:15:38,330 When the modulator is absent, there's no mutual information. 1828 01:15:38,330 --> 01:15:40,380 That's a setting we looked at before. 1829 01:15:40,380 --> 01:15:43,200 That would suggest that the modulator is an activator. 1830 01:15:43,200 --> 01:15:45,310 It's a positive modulator. 1831 01:15:45,310 --> 01:15:46,890 You can have the opposite situation, 1832 01:15:46,890 --> 01:15:48,870 where when the modulator is present at low levels, 1833 01:15:48,870 --> 01:15:51,245 there's mutual information between a transcription factor 1834 01:15:51,245 --> 01:15:52,104 and it's target. 1835 01:15:52,104 --> 01:15:54,020 When the modulator is present at a high level, 1836 01:15:54,020 --> 01:15:55,120 you don't see anything. 1837 01:15:55,120 --> 01:15:56,661 That would suggest that the modulator 1838 01:15:56,661 --> 01:15:57,740 is a negative regulator. 1839 01:15:57,740 --> 01:15:59,781 And then there are scenarios where there's either 1840 01:15:59,781 --> 01:16:01,210 uniformly high information content 1841 01:16:01,210 --> 01:16:04,640 between transcription factor target, or uniformly low. 1842 01:16:04,640 --> 01:16:07,000 So the modulator doesn't seem to be doing anything. 1843 01:16:07,000 --> 01:16:09,949 So we break it down into these categories. 1844 01:16:09,949 --> 01:16:11,990 And you can look at all the different categories, 1845 01:16:11,990 --> 01:16:14,040 in their supplemental tables. 1846 01:16:14,040 --> 01:16:15,540 One thing that's kind of interesting 1847 01:16:15,540 --> 01:16:19,090 is they assume that regardless of how high the transcription 1848 01:16:19,090 --> 01:16:21,470 factor goes, you'll always see an increase 1849 01:16:21,470 --> 01:16:23,802 in the expression of the target. 1850 01:16:23,802 --> 01:16:25,260 So there is no saturation, which is 1851 01:16:25,260 --> 01:16:30,240 an unnatural assumption in these data sets. 1852 01:16:30,240 --> 01:16:30,980 OK. 1853 01:16:30,980 --> 01:16:33,932 So I think I'll close with this example, from their experiment. 1854 01:16:33,932 --> 01:16:35,390 And then in the next lecture, we'll 1855 01:16:35,390 --> 01:16:37,640 look at how these different methods fare 1856 01:16:37,640 --> 01:16:39,970 against each other in the DREAM challenge. 1857 01:16:39,970 --> 01:16:43,400 So they specifically wanted to find regulators of MYC. 1858 01:16:43,400 --> 01:16:48,460 So here's data for a particular regulator, SDK 38. 1859 01:16:48,460 --> 01:16:51,050 Here's the set of expression of tumors 1860 01:16:51,050 --> 01:16:53,620 where SDK 38 expression is lowest. 1861 01:16:53,620 --> 01:16:57,330 And a set of tumors where SDK 38 expression is highest. 1862 01:16:57,330 --> 01:16:59,940 And they're sorted by the expression level of MYC. 1863 01:16:59,940 --> 01:17:01,440 So on the left hand side, you'll see 1864 01:17:01,440 --> 01:17:02,856 there's no particular relationship 1865 01:17:02,856 --> 01:17:05,902 between the expression level of MYC and the targets. 1866 01:17:05,902 --> 01:17:07,860 In the right hand side, there is a relationship 1867 01:17:07,860 --> 01:17:09,900 between the expression level of MYC and targets. 1868 01:17:09,900 --> 01:17:12,740 So having, apparently-- at least at this level 1869 01:17:12,740 --> 01:17:16,810 of mutual information, having higher levels of SDK 38, 1870 01:17:16,810 --> 01:17:18,422 cause a relationship to occur. 1871 01:17:18,422 --> 01:17:20,005 That would be example of an activator. 1872 01:17:22,310 --> 01:17:22,810 OK. 1873 01:17:22,810 --> 01:17:24,560 So this technique has a lot of advantages, 1874 01:17:24,560 --> 01:17:27,110 and allows you to search rapidly over very large data sets, 1875 01:17:27,110 --> 01:17:29,800 to find potential target transcription factor 1876 01:17:29,800 --> 01:17:32,330 relationships, and also potential modulators. 1877 01:17:32,330 --> 01:17:33,835 It has some limitations. 1878 01:17:33,835 --> 01:17:35,834 Where the key limitations is that the signal has 1879 01:17:35,834 --> 01:17:38,320 to be present in the expression data set. 1880 01:17:38,320 --> 01:17:40,060 So in the case of a protein like p53, 1881 01:17:40,060 --> 01:17:42,690 where we know it's activated by all sort of other processes, 1882 01:17:42,690 --> 01:17:45,350 phosphorylation or NF-kappaB, where it's regulated 1883 01:17:45,350 --> 01:17:48,490 by phosphorylation, you might not get any signal. 1884 01:17:48,490 --> 01:17:50,720 So there has to be a case where the transcription 1885 01:17:50,720 --> 01:17:52,630 factor itself, is changing expression. 1886 01:17:52,630 --> 01:17:54,710 It also won't work if the modulator is always 1887 01:17:54,710 --> 01:17:56,690 highly correlated with its target, 1888 01:17:56,690 --> 01:17:58,280 for some other biological reason. 1889 01:17:58,280 --> 01:18:00,870 So the modulator has to be on, for other reasons, when 1890 01:18:00,870 --> 01:18:02,330 the target is, then you'll never be 1891 01:18:02,330 --> 01:18:06,070 able to divide the data in this way. 1892 01:18:06,070 --> 01:18:07,730 One of the other things I think that is 1893 01:18:07,730 --> 01:18:08,782 problematic with these networks is 1894 01:18:08,782 --> 01:18:10,240 that you get such large networks, 1895 01:18:10,240 --> 01:18:11,698 and they're very hard to interpret. 1896 01:18:11,698 --> 01:18:13,950 So in this case, this is the nearest neighbors 1897 01:18:13,950 --> 01:18:16,650 of just one node in ARACNe. 1898 01:18:16,650 --> 01:18:20,910 This is the mutual information network of microRNA modulators 1899 01:18:20,910 --> 01:18:24,330 that has a quarter of a million interactions. 1900 01:18:24,330 --> 01:18:26,690 And in these data sets, often you 1901 01:18:26,690 --> 01:18:28,850 end up selecting a very, very large fraction 1902 01:18:28,850 --> 01:18:30,430 of all the potential modulators. 1903 01:18:30,430 --> 01:18:32,702 So of all the candidate transcription factors 1904 01:18:32,702 --> 01:18:34,410 in modulators, it comes up with an answer 1905 01:18:34,410 --> 01:18:36,510 that's roughly 10% to 20% of them are regulating 1906 01:18:36,510 --> 01:18:39,460 any particular gene, which seems awfully high. 1907 01:18:39,460 --> 01:18:40,080 OK. 1908 01:18:40,080 --> 01:18:44,980 So any questions on the methods we've seen so far? 1909 01:18:44,980 --> 01:18:45,480 OK. 1910 01:18:45,480 --> 01:18:47,026 So when we come back on Thursday, 1911 01:18:47,026 --> 01:18:48,400 we'll take a look at head to head 1912 01:18:48,400 --> 01:18:51,610 of how these different methods perform on both the synthetic 1913 01:18:51,610 --> 01:18:53,800 and the real data sets.