1 00:00:01,000 --> 00:00:04,000 Today is my last class with you. Awe, I'm sorry, too. You guys are 2 00:00:04,000 --> 00:00:08,000 a lot of fun. This has actually been the most interactive 7. 3 00:00:08,000 --> 00:00:12,000 1 I've ever had. Usually there are a couple of people who perk up and 4 00:00:12,000 --> 00:00:16,000 say things, but you guys are great because all sorts of people are 5 00:00:16,000 --> 00:00:20,000 willing to contribute. So, I've had a wonderful time and 6 00:00:20,000 --> 00:00:24,000 it certainly seems like you guys have learned a lot. 7 00:00:24,000 --> 00:00:28,000 What I'd like to do for my last lecture is pick up again a little 8 00:00:28,000 --> 00:00:32,000 bit like I did with genomics and try to give you a sense of where 9 00:00:32,000 --> 00:00:36,000 things are going. I always like doing this because I 10 00:00:36,000 --> 00:00:40,000 get to talk about things that are in none of the textbooks that, 11 00:00:40,000 --> 00:00:44,000 well, I mean, it's just stuff that many people working in the field 12 00:00:44,000 --> 00:00:48,000 don't necessarily know. And that's what's so much fun about 13 00:00:48,000 --> 00:00:52,000 teaching introductory biology is because it only takes a semester for 14 00:00:52,000 --> 00:00:56,000 you guys to get up to the point of at least being able to understand 15 00:00:56,000 --> 00:01:01,000 what's getting done on the cutting-edge. 16 00:01:01,000 --> 00:01:05,000 Even if you might not yet be able to go off and practice it, 17 00:01:05,000 --> 00:01:09,000 you might need a little more experience for that, 18 00:01:09,000 --> 00:01:13,000 but you'd be surprised, it's not that much more. 19 00:01:13,000 --> 00:01:17,000 Take maybe Project Lab and you'll be able to start doing it already. 20 00:01:17,000 --> 00:01:21,000 It's really wonderful that it's possible to grasp what's going on. 21 00:01:21,000 --> 00:01:25,000 And, in many ways, you guys may have an advantage in grasping what's 22 00:01:25,000 --> 00:01:29,000 going on because, as I've already hinted, 23 00:01:29,000 --> 00:01:33,000 biology's undergoing this remarkable transformation from being a purely 24 00:01:33,000 --> 00:01:37,000 laboratory-based science where each individual works on his or her own 25 00:01:37,000 --> 00:01:41,000 project to being an information-based science that 26 00:01:41,000 --> 00:01:45,000 involves an integration of vast amounts of data across the whole 27 00:01:45,000 --> 00:01:50,000 world and trying to learn things from this tremendous dataset. 28 00:01:50,000 --> 00:01:52,000 And, in that sense, I think the new students coming into 29 00:01:52,000 --> 00:01:55,000 the field have a distinct advantage over those who have been in it. 30 00:01:55,000 --> 00:01:58,000 And certainly the students who know mathematical and physical and 31 00:01:58,000 --> 00:02:01,000 chemical and other sorts of things, and aren't scared to write computer 32 00:02:01,000 --> 00:02:04,000 code when they need to write computer code have a really 33 00:02:04,000 --> 00:02:07,000 great advantage. So, anyway, all that by way of 34 00:02:07,000 --> 00:02:11,000 introduction. I want to talk about two subjects today of great interest 35 00:02:11,000 --> 00:02:14,000 to me. One is DNA variation and one is RNA variation. 36 00:02:14,000 --> 00:02:18,000 The variation of DNA sequence between individuals within a 37 00:02:18,000 --> 00:02:21,000 population, and in particular our population, and the other is RNA 38 00:02:21,000 --> 00:02:25,000 variation, the variation in RNA expression between different cell 39 00:02:25,000 --> 00:02:28,000 types, different tissues. And the work I'm going to talk 40 00:02:28,000 --> 00:02:32,000 about today is work that I, and my colleagues, have all been 41 00:02:32,000 --> 00:02:36,000 involved in. And it's stuff I know and love. 42 00:02:36,000 --> 00:02:40,000 So, feel free to ask questions about it. I may know the answers, 43 00:02:40,000 --> 00:02:44,000 but what's reasonably fun about these lectures is if I don't know 44 00:02:44,000 --> 00:02:48,000 the answers it's probably the case that the answers aren't known. 45 00:02:48,000 --> 00:02:52,000 So, that's good fun because it's stuff I really do know well, 46 00:02:52,000 --> 00:02:56,000 and I love. So, anyway, here's some DNA sequence. It's pretty boring. 47 00:02:56,000 --> 00:03:00,000 This is a chunk of sequence from, let's say, the human genome. 48 00:03:00,000 --> 00:03:04,000 How much does this differ between any two individuals? 49 00:03:04,000 --> 00:03:09,000 If I were to sequence any two chromosomes, any two copies of the 50 00:03:09,000 --> 00:03:14,000 chromosome from an individual in this class or two individuals on 51 00:03:14,000 --> 00:03:19,000 this planet, how much would they differ? The answer is that much. 52 00:03:19,000 --> 00:03:25,000 That's the average amount of difference between any two people on 53 00:03:25,000 --> 00:03:30,000 this planet. Not a lot. If you counted up, it is on average 54 00:03:30,000 --> 00:03:35,000 one nucleotide difference out of 1, 00 nucleotides on average, somewhat 55 00:03:35,000 --> 00:03:41,000 less than one part in 1, 00 or better than 99.9% identity 56 00:03:41,000 --> 00:03:46,000 between any two individuals. Now, that is a very small amount, 57 00:03:46,000 --> 00:03:51,000 not just in absolute terms, 99.9% identity is a lot, 58 00:03:51,000 --> 00:03:57,000 but in comparative terms with other species. If I take two chimpanzees 59 00:03:57,000 --> 00:04:02,000 in Africa, on average they will differ by about twice as much as any 60 00:04:02,000 --> 00:04:07,000 two random humans. And if I take two orangutans in 61 00:04:07,000 --> 00:04:12,000 Southeast Asia, they will on average differ by about 62 00:04:12,000 --> 00:04:17,000 eight times as much as any two humans on this planet. 63 00:04:17,000 --> 00:04:21,000 You guys think the orangutans all look the same. 64 00:04:21,000 --> 00:04:26,000 They think you all look the same, and they're right. So, why is this? 65 00:04:26,000 --> 00:04:31,000 Why are humans amongst mammalian 66 00:04:31,000 --> 00:04:36,000 species relatively limited in the amount of variation? 67 00:04:36,000 --> 00:04:40,000 Well, it's a direct result of our population history. 68 00:04:40,000 --> 00:04:45,000 It turns out that the amount of variation that can be sustained in a 69 00:04:45,000 --> 00:04:50,000 population depends on two things. At equilibrium, if population has 70 00:04:50,000 --> 00:04:55,000 constant size N for a very long time and a certain mutation rate, 71 00:04:55,000 --> 00:04:59,000 Mu, you can just write a piece of arithmetic that says, 72 00:04:59,000 --> 00:05:04,000 well, mutations are always arising due to new mutations in the 73 00:05:04,000 --> 00:05:09,000 population and mutations are being lost by genetic drift, 74 00:05:09,000 --> 00:05:14,000 just by random sampling from generation to generation. 75 00:05:14,000 --> 00:05:17,000 And those two processes, the creation of new mutations and 76 00:05:17,000 --> 00:05:21,000 the loss of mutations just due to random sampling in each generation, 77 00:05:21,000 --> 00:05:25,000 sets up an equilibrium, and the equilibrium defines an equation 78 00:05:25,000 --> 00:05:29,000 there, Pi equals one over one plus four and Mu reciprocal which 79 00:05:29,000 --> 00:05:33,000 equation you have no need to memorize whatsoever and possibly 80 00:05:33,000 --> 00:05:36,000 even no need to write down. The important point is the concept, 81 00:05:36,000 --> 00:05:40,000 that if you know the number of organisms in the population and you 82 00:05:40,000 --> 00:05:43,000 know the mutation rate, those set up the bounds of mutation 83 00:05:43,000 --> 00:05:47,000 and drift, and you can write down how polymorphic, 84 00:05:47,000 --> 00:05:51,000 how heterozygous random individuals should be at equilibrium. 85 00:05:51,000 --> 00:05:54,000 That is if the population has been at size N for a very long time. 86 00:05:54,000 --> 00:05:58,000 Well, the expected amount of heterozygosity for the 87 00:05:58,000 --> 00:06:02,000 human population -- Sorry. For a population of size 10, 88 00:06:02,000 --> 00:06:06,000 00 would be about one nucleotide in 1300. We have exactly the amount of 89 00:06:06,000 --> 00:06:11,000 heterozygosity you would expect for a population of about 10, 90 00:06:11,000 --> 00:06:15,000 00 individuals. Yeah, but wait, we're not a population of 10,000 91 00:06:15,000 --> 00:06:20,000 individuals. Why do we have the heterozygosity you would expect from 92 00:06:20,000 --> 00:06:25,000 a population of 10, 00 individuals? We're six billion. 93 00:06:25,000 --> 00:06:31,000 It's a reflection of our history. 94 00:06:31,000 --> 00:06:35,000 Because remember I said that was the statement about what the 95 00:06:35,000 --> 00:06:38,000 population heterozygosity should be at equilibrium? 96 00:06:38,000 --> 00:06:42,000 We haven't been six billion people except very recently. 97 00:06:42,000 --> 00:06:45,000 The human population has undergone an exponential expansion. 98 00:06:45,000 --> 00:06:49,000 It used to be a relatively small size, and then it very recently 99 00:06:49,000 --> 00:06:52,000 underwent this huge exponential expansion. If you actually write 100 00:06:52,000 --> 00:06:56,000 down the equations, the amount of variation in our 101 00:06:56,000 --> 00:07:00,000 population was determined by that constant size for a very long time. 102 00:07:00,000 --> 00:07:03,000 And then a rapid exponential expansion that's basically taken 103 00:07:03,000 --> 00:07:07,000 place in a mere 3, 00 generations, it's much too rapid 104 00:07:07,000 --> 00:07:11,000 to have any affect on the real variation in our population. 105 00:07:11,000 --> 00:07:15,000 What do I mean by that? What's the mutation rate per nucleotide in the 106 00:07:15,000 --> 00:07:18,000 human genome? It's on the order of two times ten to the minus eighth 107 00:07:18,000 --> 00:07:22,000 per generation. In a mere 3,000 generations, 108 00:07:22,000 --> 00:07:26,000 a tiny mutation rate like two times ten to the minus eighth is not going 109 00:07:26,000 --> 00:07:30,000 to be able to build up much more variation. 110 00:07:30,000 --> 00:07:32,000 So you might as well ignore the last 100,000 years or so. 111 00:07:32,000 --> 00:07:34,000 They're irrelevant to how much variation we have. 112 00:07:34,000 --> 00:07:36,000 The variation we have was set by our ancestral population size. 113 00:07:36,000 --> 00:07:38,000 Now, don't get me wrong. Eventually it will equilibrate. 114 00:07:38,000 --> 00:07:40,000 A couple million years from now we will have a much higher variation in 115 00:07:40,000 --> 00:07:42,000 the human population as a function of our size, but the population 116 00:07:42,000 --> 00:07:44,000 variation we have today is set by the fact that humans derive from a 117 00:07:44,000 --> 00:07:47,000 founding population of about 10, 00 individuals or so. 118 00:07:47,000 --> 00:07:52,000 And that means that the variation that you see in the human population 119 00:07:52,000 --> 00:07:57,000 is mostly ancestral variations, the variation that we all walked 120 00:07:57,000 --> 00:08:03,000 around with in Africa. And, in fact, that makes a 121 00:08:03,000 --> 00:08:08,000 prediction. That would say that if most of the variation in the human 122 00:08:08,000 --> 00:08:13,000 population is from the ancestral African founding population then if 123 00:08:13,000 --> 00:08:19,000 I go to any two villages around this world, in Japan or in Sweden or in 124 00:08:19,000 --> 00:08:24,000 Nigeria, the variance that I see will largely be identical. 125 00:08:24,000 --> 00:08:30,000 And that prediction has been well satisfied. 126 00:08:30,000 --> 00:08:34,000 Because when you go and look and you collect variation in Japan or Sweden 127 00:08:34,000 --> 00:08:38,000 or Africa and you compare it, 90% of the variance are common 128 00:08:38,000 --> 00:08:42,000 across the entire world. Most variation is common ancestral 129 00:08:42,000 --> 00:08:46,000 variation around the world, and only a minority of the variance 130 00:08:46,000 --> 00:08:50,000 are new local mutations restricted to individual populations. 131 00:08:50,000 --> 00:08:54,000 This is so contrary to what people think because there's a natural 132 00:08:54,000 --> 00:08:58,000 tendency to kind of xenophobia, to imagine that world populations 133 00:08:58,000 --> 00:09:02,000 are very different in their genetic background. 134 00:09:02,000 --> 00:09:05,000 But, in point of fact, they're extremely similar. 135 00:09:05,000 --> 00:09:09,000 So, anyway, there's a limited amount of variation. 136 00:09:09,000 --> 00:09:13,000 That's why we have such little variation in the human population. 137 00:09:13,000 --> 00:09:17,000 Now, that variation, humans have a low rate of genetic variation. 138 00:09:17,000 --> 00:09:20,000 Most of the variance that are out there are due to common genetic 139 00:09:20,000 --> 00:09:24,000 variance, not rare variance. If I take your genome and I find a 140 00:09:24,000 --> 00:09:28,000 site of genetic variation at the point of heterozygosity in your 141 00:09:28,000 --> 00:09:32,000 genome, what's the probability that somebody else in this class also is 142 00:09:32,000 --> 00:09:36,000 heterozygous for that spot? It turns out that the odds are about 143 00:09:36,000 --> 00:09:40,000 95% that someone else in this class will also share that variance. 144 00:09:40,000 --> 00:09:44,000 So that the variance are not mostly rare, they're mostly common. 145 00:09:44,000 --> 00:09:48,000 And it turns out that some of this common variation, 146 00:09:48,000 --> 00:09:52,000 that is most of this variation is likely to be important in the risk 147 00:09:52,000 --> 00:09:56,000 of human genetic diseases. So human geneticists have gotten 148 00:09:56,000 --> 00:10:00,000 very excited about the following paradigm. 149 00:10:00,000 --> 00:10:03,000 If there's only a limited amount of genetic variation in the human 150 00:10:03,000 --> 00:10:06,000 population, actually, if you do the arithmetic, 151 00:10:06,000 --> 00:10:09,000 there are only about ten million sites of common variation in the 152 00:10:09,000 --> 00:10:12,000 human population, where common might be defined as 153 00:10:12,000 --> 00:10:15,000 more than about 1% in the population. There are only ten million sites. 154 00:10:15,000 --> 00:10:18,000 Folks are saying, well, why not enumerate them all? 155 00:10:18,000 --> 00:10:22,000 Let's just know them all, and then let's test each one for its 156 00:10:22,000 --> 00:10:25,000 risk of, say, confirming susceptibility of diabetes or heart 157 00:10:25,000 --> 00:10:28,000 disease or whatever? After all, ten million is not as 158 00:10:28,000 --> 00:10:32,000 big a number as it used to be. We now have the whole sequence of 159 00:10:32,000 --> 00:10:36,000 the human genome. Why not layer on the sequence of 160 00:10:36,000 --> 00:10:40,000 the human genome all common human genetic polymorphism? 161 00:10:40,000 --> 00:10:44,000 Now, that's a fairly outrageous idea but could be a very useful one. 162 00:10:44,000 --> 00:10:48,000 Some of these variance are important, by the way. 163 00:10:48,000 --> 00:10:52,000 We know that there are two nucleotides that vary in the gene 164 00:10:52,000 --> 00:10:56,000 apolipoprotein E on chromosome number 19. Apolipoprotein E is also 165 00:10:56,000 --> 00:11:00,000 an apolipoprotein like we talked about before with familiar 166 00:11:00,000 --> 00:11:04,000 hypercholesterolemia. But, in fact, it turns out that 167 00:11:04,000 --> 00:11:08,000 apolipoprotein E is expressed in the brain. And it turns out, 168 00:11:08,000 --> 00:11:13,000 amongst other tissues, that it comes in three variances, 169 00:11:13,000 --> 00:11:18,000 the spelling T-T, T-C and C-C at those two particular spots. 170 00:11:18,000 --> 00:11:22,000 And if you happen to be homozygous for the E4 variant, 171 00:11:22,000 --> 00:11:27,000 homozygous for the E4 variant, you have about a 60% to 70% lifetime 172 00:11:27,000 --> 00:11:32,000 risk of Alzheimer's disease. In this class 13 of you are 173 00:11:32,000 --> 00:11:37,000 homozygous for E4 and have a high lifetime risk of Alzheimer's. 174 00:11:37,000 --> 00:11:42,000 And it would be fairly trivial to go across the street to anybody's 175 00:11:42,000 --> 00:11:47,000 lab and test that. Now, I don't particular recommend 176 00:11:47,000 --> 00:11:52,000 it, and I haven't tested myself for this variant because there happens 177 00:11:52,000 --> 00:11:57,000 to be no particular therapy available today to delay the onset 178 00:11:57,000 --> 00:12:01,000 of Alzheimer's disease. And, therefore, 179 00:12:01,000 --> 00:12:05,000 I don't recommend finding out about that. But a number of 180 00:12:05,000 --> 00:12:08,000 pharmaceutical companies, knowing that this is a very 181 00:12:08,000 --> 00:12:11,000 important gene in the pathogenesis of Alzheimer's disease, 182 00:12:11,000 --> 00:12:15,000 are working on drugs to try to delay the pathogenesis using this 183 00:12:15,000 --> 00:12:18,000 information. And it may be the case that five or ten years from now 184 00:12:18,000 --> 00:12:21,000 people will begin to offer drugs that will delay the onset of 185 00:12:21,000 --> 00:12:25,000 Alzheimer's disease by delaying the interaction of apolipoprotein E with 186 00:12:25,000 --> 00:12:29,000 a target protein called towe, etc. So, this is an example of where a 187 00:12:29,000 --> 00:12:33,000 common variant in the population points us to the basis of a common 188 00:12:33,000 --> 00:12:37,000 disease and has important therapeutic implications. 189 00:12:37,000 --> 00:12:41,000 There are some other ones, for example. 5% of you carry a 190 00:12:41,000 --> 00:12:45,000 particular variant in your factor 5 gene which is the clotting cascade. 191 00:12:45,000 --> 00:12:49,000 It's called the leiden variant. Those 5% of you are going to account 192 00:12:49,000 --> 00:12:53,000 for 50% of the admissions to emergency rooms for deep venous 193 00:12:53,000 --> 00:12:57,000 clots, for example. The much higher risk of deep venous 194 00:12:57,000 --> 00:13:02,000 clots. And, in particular, 195 00:13:02,000 --> 00:13:06,000 there are significant issues if you have that variant and you are a 196 00:13:06,000 --> 00:13:11,000 woman with taking birth control pills. Some of you were at higher 197 00:13:11,000 --> 00:13:16,000 risk for diabetes, type 2 adult onset diabetes. 198 00:13:16,000 --> 00:13:20,000 There's a particular variant in the population that increased your risk 199 00:13:20,000 --> 00:13:25,000 for type 2 diabetes by about 30%. 85% of you have the high-risk 200 00:13:25,000 --> 00:13:30,000 factor, so you might as well figure you do. 201 00:13:30,000 --> 00:13:35,000 15% of you have a lower risk, et cetera. And one I'm particularly 202 00:13:35,000 --> 00:13:40,000 interested in here, it turns out that HIV virus gets 203 00:13:40,000 --> 00:13:46,000 into cells with a co-receptor encoded by a gene called CCR5. 204 00:13:46,000 --> 00:13:51,000 Well, it turns out that if we go across the European population, 205 00:13:51,000 --> 00:13:57,000 10% of all chromosomes of European ancestry have a deletion 206 00:13:57,000 --> 00:14:02,000 within the CCR5 gene. If 10% of all chromosomes have that 207 00:14:02,000 --> 00:14:06,000 deletion then 10% times 10%, 1% of all individuals are homozygous 208 00:14:06,000 --> 00:14:10,000 for that deletion. Those individuals are essentially 209 00:14:10,000 --> 00:14:15,000 immune to infection from HIV. They are not susceptible. It's not 210 00:14:15,000 --> 00:14:19,000 through immunity, it's through lack of a receptor. 211 00:14:19,000 --> 00:14:23,000 Yes? You certainly can. It's not hard. It's a specific known variant. 212 00:14:23,000 --> 00:14:28,000 You could test for it. Absolutely. 213 00:14:28,000 --> 00:14:31,000 Now, of course, that only helps the 1% of people who 214 00:14:31,000 --> 00:14:34,000 have that variant. But what it did do was point to the 215 00:14:34,000 --> 00:14:37,000 pharmaceutical industry that the interaction between the virus and 216 00:14:37,000 --> 00:14:41,000 that variant is essential. And now companies are developing 217 00:14:41,000 --> 00:14:44,000 drugs to block the interaction with that particular protein. 218 00:14:44,000 --> 00:14:48,000 And that tells you that it's an important protein. Yes? 219 00:14:48,000 --> 00:14:56,000 Over the whole world? 220 00:14:56,000 --> 00:15:00,000 I just specified European population for that one. 221 00:15:00,000 --> 00:15:03,000 That one, interestingly, is not found at as high a frequency 222 00:15:03,000 --> 00:15:06,000 outside of Europe, and no one knows why, 223 00:15:06,000 --> 00:15:09,000 whether that might have been due to an ancient selective event or a 224 00:15:09,000 --> 00:15:13,000 genetic drift. By contrast, the apolipoprotein E 225 00:15:13,000 --> 00:15:16,000 variant, at that frequency of about 3% of people being homozygous and 226 00:15:16,000 --> 00:15:19,000 being at risk for Alzheimer's, is about the same frequency 227 00:15:19,000 --> 00:15:23,000 everywhere in the world. So, there's a little bit of 228 00:15:23,000 --> 00:15:26,000 population variation in frequency. Now, the HIV variant is found 229 00:15:26,000 --> 00:15:30,000 elsewhere but at considerably lower frequencies there. 230 00:15:30,000 --> 00:15:33,000 And that's an interesting question as to what causes that variation. 231 00:15:33,000 --> 00:15:36,000 So the notion would be, I've given you a couple of interesting examples, 232 00:15:36,000 --> 00:15:40,000 but, look, there's only ten million variants. Just write them all down. 233 00:15:40,000 --> 00:15:43,000 Make one big Excel spreadsheet with ten million variants along the top 234 00:15:43,000 --> 00:15:47,000 and all the diseases along the rows, and let's just fill in the matrix 235 00:15:47,000 --> 00:15:50,000 and then we'll really, you know, this is the way people 236 00:15:50,000 --> 00:15:54,000 think in a post-genomic era. Now, could you do something like 237 00:15:54,000 --> 00:15:57,000 that? You would have to enumerate all of the single nucleotide 238 00:15:57,000 --> 00:16:01,000 polymorphisms, or SNPs we call them, 239 00:16:01,000 --> 00:16:05,000 single nucleotide polymorphisms. Now, to give you an idea of the 240 00:16:05,000 --> 00:16:09,000 magnitude of this problem, as recently as 1998, the number of 241 00:16:09,000 --> 00:16:13,000 SNPs that were known in the human genome was a couple hundred. 242 00:16:13,000 --> 00:16:17,000 But then a project has taken off. In 1998 an initial SNP map of the 243 00:16:17,000 --> 00:16:21,000 human genome was built here at MIT that had about 4, 244 00:16:21,000 --> 00:16:25,000 00 of these variants. Then within the next year or so an 245 00:16:25,000 --> 00:16:29,000 international consortium was organized here and elsewhere to 246 00:16:29,000 --> 00:16:34,000 begin to collect more of these genetic variants. 247 00:16:34,000 --> 00:16:38,000 The goal was going to be to find 300, 00 of them within a period of two 248 00:16:38,000 --> 00:16:42,000 years. In fact, that goal was blown away and within 249 00:16:42,000 --> 00:16:46,000 three years two million of the SNPs in the human population were found. 250 00:16:46,000 --> 00:16:51,000 And as of today, if you go on the Web, you'll find the database with 251 00:16:51,000 --> 00:16:55,000 about 7.8 million of the roughly ten million SNPs in the human population 252 00:16:55,000 --> 00:17:00,000 already known. Now, that isn't all ten million. 253 00:17:00,000 --> 00:17:03,000 And it takes a while to collect the last ones, you know, 254 00:17:03,000 --> 00:17:07,000 collecting the last ones are hard, but we're already the hump of 255 00:17:07,000 --> 00:17:10,000 knowing the majority of common variation in the human population. 256 00:17:10,000 --> 00:17:14,000 Not just a sequence of the genome, but a database that already contains 257 00:17:14,000 --> 00:17:17,000 more than half of all common variation in the population. 258 00:17:17,000 --> 00:17:21,000 So, we could start building that Excel spreadsheet. 259 00:17:21,000 --> 00:17:24,000 Now, it turns out that it's even a little bit better than that because 260 00:17:24,000 --> 00:17:28,000 if we look at many chromosomes in the population, 261 00:17:28,000 --> 00:17:31,000 here are chromosomes in the population, it turns out that the 262 00:17:31,000 --> 00:17:35,000 common variance on each of those chromosomes tend to be correlated 263 00:17:35,000 --> 00:17:38,000 with each other. If I know your genotype at one 264 00:17:38,000 --> 00:17:41,000 variant, like over at this locus, I know your genotype at the next 265 00:17:41,000 --> 00:17:45,000 locus with reasonably high probability. There's a lot of local 266 00:17:45,000 --> 00:17:48,000 correlation. So, instead of looking like a scattered 267 00:17:48,000 --> 00:17:51,000 picture like that, it's more like this. 268 00:17:51,000 --> 00:17:55,000 If I know that you're red, red, red you're probably red, 269 00:17:55,000 --> 00:17:58,000 red, red over here. In other words, these variations occur in blocks 270 00:17:58,000 --> 00:18:01,000 that we called haplotypes. Here's real data. 271 00:18:01,000 --> 00:18:04,000 Across 111 kilobases of DNA there's a bunch of variants, 272 00:18:04,000 --> 00:18:08,000 but it turns out that the variants come in two basic flavors. 273 00:18:08,000 --> 00:18:11,000 98% of all chromosomes are either this, this, this, 274 00:18:11,000 --> 00:18:14,000 this, this or this, this, this, this, this. 275 00:18:14,000 --> 00:18:18,000 Then there tends to be sites of recombination that are actually 276 00:18:18,000 --> 00:18:21,000 hotspots of recombination where most of the recombination of the 277 00:18:21,000 --> 00:18:24,000 population is concentrated. And you get a couple of 278 00:18:24,000 --> 00:18:28,000 possibilities here. So, the human genome can kind of be 279 00:18:28,000 --> 00:18:31,000 broken up into these haplotypes. Blocks that might be 20, 280 00:18:31,000 --> 00:18:35,000 30, 40, sometimes 100 kilobases long in which within the block you tend 281 00:18:35,000 --> 00:18:39,000 to have a small number of haplotypes, or flavors as you might think of 282 00:18:39,000 --> 00:18:43,000 them, that define most of the chromosomes in the population. 283 00:18:43,000 --> 00:18:46,000 So, in fact, I don't actually need to know all the variants. 284 00:18:46,000 --> 00:18:50,000 If they're so well correlated within a block, 285 00:18:50,000 --> 00:18:54,000 if I knew this block structure I would be able to pick a small number 286 00:18:54,000 --> 00:18:58,000 of SNPs that would serve as a proxy for that entire block of inheritance 287 00:18:58,000 --> 00:19:01,000 in the population. So, what you might want to do is 288 00:19:01,000 --> 00:19:04,000 determine that entire haplotype block structure of hwo they're 289 00:19:04,000 --> 00:19:08,000 related to each other, and pick out tag snips. 290 00:19:08,000 --> 00:19:11,000 And it turns out that in theory, a mere 300,000 or so of them would 291 00:19:11,000 --> 00:19:14,000 suffice to proxy for most of the genome. So, you might want to 292 00:19:14,000 --> 00:19:18,000 declare an international project, and international haplotype map 293 00:19:18,000 --> 00:19:21,000 project to create a haplotype map of the human genome. 294 00:19:21,000 --> 00:19:24,000 And indeed, such a project was declared about a year and a half ago 295 00:19:24,000 --> 00:19:28,000 through some instigation of scientists and a number of places, 296 00:19:28,000 --> 00:19:31,000 including here. And this is $100 million project 297 00:19:31,000 --> 00:19:35,000 involving six different countries. And, it is already more than 298 00:19:35,000 --> 00:19:39,000 halfway done with the task, and it's very likely that by the 299 00:19:39,000 --> 00:19:42,000 middle of next year, we will have a pretty good haplotype 300 00:19:42,000 --> 00:19:46,000 map, not just knowing all the variation, but knowing the 301 00:19:46,000 --> 00:19:50,000 correlation between that variation, being able to break up the genome 302 00:19:50,000 --> 00:19:53,000 into these blocks. By the next time I teach 701, 303 00:19:53,000 --> 00:19:57,000 I should be able to show a haplotype map of the whole human genome 304 00:19:57,000 --> 00:20:01,000 already. That will allow you to start undertaking systematic studies 305 00:20:01,000 --> 00:20:05,000 of inheritance for different diseases across populations. 306 00:20:05,000 --> 00:20:08,000 And in fact, people are already doing things like that. 307 00:20:08,000 --> 00:20:12,000 Here's an example of a study done here at MIT like this, 308 00:20:12,000 --> 00:20:15,000 where to study inflammatory bowel disease, there was evidence that 309 00:20:15,000 --> 00:20:19,000 there might be a particular region of the genome that contained it, 310 00:20:19,000 --> 00:20:22,000 and haplotypes were determined across this, and blah, 311 00:20:22,000 --> 00:20:26,000 blah, blah, blah, blah, blah, blah. And this red haplotype 312 00:20:26,000 --> 00:20:29,000 here turns out to confer high risk, about a two and a half or higher 313 00:20:29,000 --> 00:20:33,000 risk of inflammatory bowel disease. 314 00:20:33,000 --> 00:20:36,000 And it sits over some genes involved in immune responses, 315 00:20:36,000 --> 00:20:40,000 certain cytokine genes and all that. And, things like this have been 316 00:20:40,000 --> 00:20:44,000 done for type 2 diabetes, schizophrenia, cardiovascular 317 00:20:44,000 --> 00:20:47,000 disease, just right now at the moment, a dozen or two examples. 318 00:20:47,000 --> 00:20:51,000 But I think we're set for an explosion in this kind of work. 319 00:20:51,000 --> 00:20:55,000 In addition, you can use this information to do things beyond 320 00:20:55,000 --> 00:20:59,000 medical genetics. You can use it for history and 321 00:20:59,000 --> 00:21:03,000 anthropology as well. It turns out rather interestingly, 322 00:21:03,000 --> 00:21:07,000 that since the human population originated in Africa and spread out 323 00:21:07,000 --> 00:21:12,000 from Africa all the way around the world arriving at different places 324 00:21:12,000 --> 00:21:17,000 in different times, you can trace those migrations by 325 00:21:17,000 --> 00:21:21,000 virtue of rare genetic variants that arose along the way, 326 00:21:21,000 --> 00:21:26,000 and let you, like a trail of break crumbs, see the migrations. 327 00:21:26,000 --> 00:21:30,000 So, for example, there are certain rare genetic variants that we can 328 00:21:30,000 --> 00:21:35,000 see in a South American Indian tribe, and we can actually see that they 329 00:21:35,000 --> 00:21:40,000 came along this route because we can see that residual of that. 330 00:21:40,000 --> 00:21:45,000 In fact, we can do things with this like take a look at Native American 331 00:21:45,000 --> 00:21:50,000 individuals and determine that they cluster into three distinct genetic 332 00:21:50,000 --> 00:21:55,000 groups that represent three distinct migrations over the land bridge. 333 00:21:55,000 --> 00:22:00,000 And, you can assign them to these different migrations. 334 00:22:00,000 --> 00:22:03,000 You can do this on the basis of mitochondrial genotype, 335 00:22:03,000 --> 00:22:06,000 etc. You can also, for example, determine when people talk about the 336 00:22:06,000 --> 00:22:09,000 out of Africa migration, there's now increasing evidence that 337 00:22:09,000 --> 00:22:13,000 there really were two, one that went this way over the land, 338 00:22:13,000 --> 00:22:16,000 and one that went this way following along the coast into southeast Asia. 339 00:22:16,000 --> 00:22:19,000 And, it looks like we're now beginning to get enough evidence of 340 00:22:19,000 --> 00:22:22,000 these two separate migrations by virtue of the genetic breadcrumbs 341 00:22:22,000 --> 00:22:26,000 that they have left along the way. 342 00:22:26,000 --> 00:22:30,000 So, it's really a very fascinating thing of how much you can 343 00:22:30,000 --> 00:22:34,000 reconstruct from looking at genetic variation, both the common variation 344 00:22:34,000 --> 00:22:38,000 that allows us to recognize medical risk, and the rare genetic variation 345 00:22:38,000 --> 00:22:43,000 that provides much more individual trails of things. 346 00:22:43,000 --> 00:22:47,000 None of this is perfect yet. There's lots to learn. But I think 347 00:22:47,000 --> 00:22:51,000 anthropologists are finding that the existing human population has a 348 00:22:51,000 --> 00:22:55,000 tremendous amount of its own history embedded in pattern of genetic 349 00:22:55,000 --> 00:23:00,000 variation across the world. You can do other things. 350 00:23:00,000 --> 00:23:04,000 I won't spend much time on this. Well, I'll take a moment on this, 351 00:23:04,000 --> 00:23:09,000 right? There's some very interesting work of a post-doctoral 352 00:23:09,000 --> 00:23:13,000 fellow here at MIT named Pardese Sebetti who has been trying to ask, 353 00:23:13,000 --> 00:23:18,000 can we see in the genetic variation in the population, 354 00:23:18,000 --> 00:23:22,000 signatures, patterns of ancient selection, or even recent selection 355 00:23:22,000 --> 00:23:27,000 in the human population? Now, hang onto your seats, 356 00:23:27,000 --> 00:23:32,000 because this will get just slightly tricky. 357 00:23:32,000 --> 00:23:35,000 But, hang on. It's only a couple of slides. Here was her idea. 358 00:23:35,000 --> 00:23:39,000 You see, when a mutation arises in the population, 359 00:23:39,000 --> 00:23:43,000 it usually dies out, right? Any new mutation just 360 00:23:43,000 --> 00:23:47,000 typically dies out. But, sometimes by chance it drifts 361 00:23:47,000 --> 00:23:50,000 up to a high frequency. Random events happen. But it 362 00:23:50,000 --> 00:23:54,000 usually takes a long time to do that. If some random mutation happens, 363 00:23:54,000 --> 00:23:58,000 and it happens to drift up to high frequency with no selection on it, 364 00:23:58,000 --> 00:24:02,000 then on average it takes a long time to do so. 365 00:24:02,000 --> 00:24:05,000 If you want, I could write a stochastic differential equation 366 00:24:05,000 --> 00:24:09,000 that would say that, but just take your gut feeling that 367 00:24:09,000 --> 00:24:12,000 if something has no selection on it and it's a rare event that'll drift 368 00:24:12,000 --> 00:24:16,000 up, when it drifts up it's kind of a slow process. It was a slow process. 369 00:24:16,000 --> 00:24:20,000 Then over the course of time that it took to drift to high frequency, 370 00:24:20,000 --> 00:24:23,000 a lot of genetic recombination would have had to have occurred many 371 00:24:23,000 --> 00:24:27,000 generations. And the correlation between the genotype at that spot 372 00:24:27,000 --> 00:24:31,000 and genotypes at other loci would break down. 373 00:24:31,000 --> 00:24:34,000 And there would only be short-range correlation. So, 374 00:24:34,000 --> 00:24:38,000 in other words, the amount of correlation between knowing the 375 00:24:38,000 --> 00:24:41,000 genotype here and the genotype here, maybe allele A here and a C here. 376 00:24:41,000 --> 00:24:45,000 That is an indication of time. It's a clock almost. It's like 377 00:24:45,000 --> 00:24:49,000 radioactive decay, right, that genetic recombination 378 00:24:49,000 --> 00:24:52,000 scrambles up the correlations. And, if something's old, the 379 00:24:52,000 --> 00:24:56,000 correlations go over short distances. But suppose that something happened. 380 00:24:56,000 --> 00:25:00,000 Some mutation happened that was very advantageous. 381 00:25:00,000 --> 00:25:03,000 Then, it would have risen to high frequency quickly because it was 382 00:25:03,000 --> 00:25:07,000 under selection. If it did so quickly, 383 00:25:07,000 --> 00:25:11,000 then the long-range correlations would not have had time to break 384 00:25:11,000 --> 00:25:15,000 down, and we'd have a smoking gun. A smoking gun would be that there 385 00:25:15,000 --> 00:25:18,000 would be a long-range correlation around that locus, 386 00:25:18,000 --> 00:25:22,000 much longer than you would expect across the genome. 387 00:25:22,000 --> 00:25:26,000 Things even out of this distance would show correlation with that, 388 00:25:26,000 --> 00:25:30,000 indicating that this was a recent event. 389 00:25:30,000 --> 00:25:34,000 So, we just measure across the genome, and look for this telltale 390 00:25:34,000 --> 00:25:39,000 sign of common variance that have very long range correlation that 391 00:25:39,000 --> 00:25:44,000 indicate that they're very recent. So, a plot of the allele frequency, 392 00:25:44,000 --> 00:25:49,000 common variance, sorry, if something has a common high frequency and 393 00:25:49,000 --> 00:25:54,000 long-range correlation, you wouldn't expect that by chance. 394 00:25:54,000 --> 00:25:58,000 So, something that was common in its 395 00:25:58,000 --> 00:26:02,000 frequency and had long-range correlation would be a signature of 396 00:26:02,000 --> 00:26:06,000 positive selection. So anyway, Pardise had this idea, 397 00:26:06,000 --> 00:26:09,000 and she tried it out with some interesting mutations, 398 00:26:09,000 --> 00:26:13,000 some mutations that confer resistance to malaria, 399 00:26:13,000 --> 00:26:17,000 one well-known mutation causing resistance to malaria called G6 PD 400 00:26:17,000 --> 00:26:21,000 and another one that she herself had proposed as a mutation causing 401 00:26:21,000 --> 00:26:24,000 resistance to malaria, variants in the CD4 ligand gene. 402 00:26:24,000 --> 00:26:28,000 And to make a long story short, both the known and her newly 403 00:26:28,000 --> 00:26:32,000 predicted variant showed this telltale property of having a high 404 00:26:32,000 --> 00:26:36,000 frequency and very long range correlation. 405 00:26:36,000 --> 00:26:40,000 Well that's very interesting because she was able to show that each of 406 00:26:40,000 --> 00:26:44,000 these mutations probably were the result of positive selection. 407 00:26:44,000 --> 00:26:49,000 But what you could do in principle is test every variant in the human 408 00:26:49,000 --> 00:26:53,000 genome this way: take any variant, look at its frequency, and compare 409 00:26:53,000 --> 00:26:58,000 it to the long range correlation around it, and test every single 410 00:26:58,000 --> 00:27:02,000 variant in the human population to see which ones might be the result 411 00:27:02,000 --> 00:27:06,000 of long range correlation. Now, when she proposed this, 412 00:27:06,000 --> 00:27:09,000 this was about a year and a half ago or two years ago, 413 00:27:09,000 --> 00:27:12,000 this was a pretty nutty idea because you would need all the variants in 414 00:27:12,000 --> 00:27:15,000 the human population, and you would need all this 415 00:27:15,000 --> 00:27:18,000 correlation information. But in fact, as I say, that 416 00:27:18,000 --> 00:27:21,000 information's almost upon us, and I believed that this experiment, 417 00:27:21,000 --> 00:27:24,000 this analysis to look for all strong positive selection in the human 418 00:27:24,000 --> 00:27:27,000 genome will in fact be done in the course of the next 12 months. 419 00:27:27,000 --> 00:27:30,000 So, I'm hoping by next year I can actually report on a genome-wide 420 00:27:30,000 --> 00:27:33,000 search for all the signatures of positive selection. 421 00:27:33,000 --> 00:27:36,000 Now, this doesn't detect all positive selection. 422 00:27:36,000 --> 00:27:39,000 It will detect sufficiently strong positive selection going back pretty 423 00:27:39,000 --> 00:27:42,000 much only over the 10, 00 years. When you do the 424 00:27:42,000 --> 00:27:45,000 arithmetic, that's how much power you have. Of course, 425 00:27:45,000 --> 00:27:48,000 10,000 years has been a pretty interesting time for the human 426 00:27:48,000 --> 00:27:52,000 population, right? The time of civilization and 427 00:27:52,000 --> 00:27:55,000 population density, and infectious diseases, 428 00:27:55,000 --> 00:27:58,000 and all that, and I think we'll have an interesting window into 429 00:27:58,000 --> 00:28:02,000 the change in diet. All of that should come out of 430 00:28:02,000 --> 00:28:06,000 something like this. So, there's a lot of really cool 431 00:28:06,000 --> 00:28:10,000 information in DNA variation to be had. All right, 432 00:28:10,000 --> 00:28:14,000 that's one half. The other half of what I would like to talk about is 433 00:28:14,000 --> 00:28:18,000 totally different. It's not about inherited DNA 434 00:28:18,000 --> 00:28:22,000 variation. It's about somatic differences between tissues in RNA 435 00:28:22,000 --> 00:28:26,000 variation. So, let's shift gears. 436 00:28:26,000 --> 00:28:30,000 RNA variation: let me start by giving you an example here. 437 00:28:30,000 --> 00:28:36,000 These are cells from two different patients with acute leukemia. 438 00:28:36,000 --> 00:28:43,000 Can you spot the difference between these? Yep? More like bunches of 439 00:28:43,000 --> 00:28:49,000 grapes and all that. Yeah, it turns out that's just a 440 00:28:49,000 --> 00:28:56,000 reflection of the field of view you have if you move over 441 00:28:56,000 --> 00:29:02,000 to look like that. But I mean, that's good. 442 00:29:02,000 --> 00:29:07,000 It's just that it turns out that that isn't actually a distinction 443 00:29:07,000 --> 00:29:12,000 when you look at more fields. Anything else? Yep? White blood 444 00:29:12,000 --> 00:29:16,000 cells like different. They look broken. There's more of 445 00:29:16,000 --> 00:29:21,000 them in this field of view. But you look at 100 fields of view 446 00:29:21,000 --> 00:29:26,000 and it turns out that's not either. Well, the reason you're having 447 00:29:26,000 --> 00:29:31,000 trouble spotting any difference is that highly trained pathologists 448 00:29:31,000 --> 00:29:35,000 can't find any difference either. I generally agree there's no 449 00:29:35,000 --> 00:29:39,000 difference between these two if you look at enough fields of view. 450 00:29:39,000 --> 00:29:43,000 But you can convince yourself if you look that you see things there. 451 00:29:43,000 --> 00:29:46,000 But these actually are two very different kinds of leukemia. 452 00:29:46,000 --> 00:29:50,000 And, these patients have to be treated very differently. 453 00:29:50,000 --> 00:29:54,000 But, pathologists cannot determine which leukemia it is just by looking 454 00:29:54,000 --> 00:29:57,000 at the microscope, it turns out. This is the work of this man, 455 00:29:57,000 --> 00:30:01,000 Sydney Farber, namesake of the Dana Farber Cancer Institute here in 456 00:30:01,000 --> 00:30:05,000 Boston, who in the 1950s began noticing that patients with 457 00:30:05,000 --> 00:30:08,000 leukemias, some of them seemed different in the way they responded 458 00:30:08,000 --> 00:30:12,000 to a certain treatment, and he said, look, I think there's 459 00:30:12,000 --> 00:30:16,000 some underlying classification of these leukemias, 460 00:30:16,000 --> 00:30:19,000 but I can't get any reliable way to tell it in the microscope. 461 00:30:19,000 --> 00:30:23,000 And he put many years into working this out, first by noticing certain 462 00:30:23,000 --> 00:30:27,000 difference in enzymes in the cells, and then people noticed certain 463 00:30:27,000 --> 00:30:31,000 things in cell surface markers, and some chromosomal rearrangements. 464 00:30:31,000 --> 00:30:34,000 And nowadays, there are a bunch of test that can be done by a 465 00:30:34,000 --> 00:30:38,000 pathologist when a patient comes in with acute leukemia to determine 466 00:30:38,000 --> 00:30:42,000 whether they have AML or ALL. But it turns out that you can't do 467 00:30:42,000 --> 00:30:46,000 it by looking. You have to do some kind of 468 00:30:46,000 --> 00:30:50,000 immunohystochemical test of some sort in order to do that. 469 00:30:50,000 --> 00:30:54,000 So this is a triumph of diagnosis. After 40 years of work, we can now 470 00:30:54,000 --> 00:30:58,000 correctly classify patients as AML or ALL. And they get the 471 00:30:58,000 --> 00:31:02,000 appropriate treatment. And if they don't get the right 472 00:31:02,000 --> 00:31:06,000 treatment, they have a much higher chance of dying. 473 00:31:06,000 --> 00:31:10,000 And if they do get the right treatment, they have a much higher 474 00:31:10,000 --> 00:31:14,000 chance of living. So, this is great. 475 00:31:14,000 --> 00:31:18,000 There's only one problem with the story. It took 40 years, 476 00:31:18,000 --> 00:31:22,000 40 years to sort this out. That's a long time. Couldn't we do 477 00:31:22,000 --> 00:31:26,000 better? Surely these cells know what they are. 478 00:31:26,000 --> 00:31:30,000 Surely we could just ask them if they are. Well, here's the idea. 479 00:31:30,000 --> 00:31:33,000 Suppose we could ask each cell, please tell us every gene that you 480 00:31:33,000 --> 00:31:37,000 have turned on, and the level to which you have that 481 00:31:37,000 --> 00:31:40,000 gene expressed. In other words, 482 00:31:40,000 --> 00:31:44,000 let us summarize each cell, each tumor by a description of its 483 00:31:44,000 --> 00:31:47,000 complete pattern of gene expression to 22,000 genes on the human genome. 484 00:31:47,000 --> 00:31:51,000 Let's write down the level of expression, X1 up to X22, 485 00:31:51,000 --> 00:31:54,000 00 for each of the 22,000 genes of the genome. So, 486 00:31:54,000 --> 00:31:58,000 ever tumor becomes a point in 22, 00 dimensional space, right? 487 00:31:58,000 --> 00:32:01,000 Now clearly, if we had every tumor described as a point in 22, 488 00:32:01,000 --> 00:32:05,000 00 dimensional space, we ought to be able to sort out which tumors are 489 00:32:05,000 --> 00:32:09,000 similar to each other, right? Well, it turns out you can 490 00:32:09,000 --> 00:32:13,000 do that now. These are gene chips, one of several technologies by which 491 00:32:13,000 --> 00:32:17,000 on a piece of glass are put little spots, each of which contains a 492 00:32:17,000 --> 00:32:21,000 piece of DNA, a unique DNA sequence. Actually, many copies of that DNA 493 00:32:21,000 --> 00:32:25,000 sequence are there. Each of these is a 25 base long DNA 494 00:32:25,000 --> 00:32:29,000 sequence, and I can design this so whatever DNA sequence you 495 00:32:29,000 --> 00:32:32,000 want is in each spot. The way that's done is with the same 496 00:32:32,000 --> 00:32:36,000 photolithographic techniques that are used to make microprocessors. 497 00:32:36,000 --> 00:32:40,000 People have worked out a chemistry where through a mask, 498 00:32:40,000 --> 00:32:44,000 you shine a light, photodeprotect certain pixels; the pixels that are 499 00:32:44,000 --> 00:32:48,000 photodeprotected you can chemically attach an A, then re-protect the 500 00:32:48,000 --> 00:32:52,000 surface. Use a light. Chemically photodeprotect certain 501 00:32:52,000 --> 00:32:56,000 spots. Wash on a C. And in this fashion, 502 00:32:56,000 --> 00:33:00,000 since you can randomly address the spots by light, 503 00:33:00,000 --> 00:33:04,000 and then chemically add bases to whatever spots are deprotected, 504 00:33:04,000 --> 00:33:08,000 you can simultaneously construct hundreds of thousands of spots each 505 00:33:08,000 --> 00:33:12,000 containing its own unique specified oligonucleotide sequence. 506 00:33:12,000 --> 00:33:16,000 And you can get them in little plastic chips. 507 00:33:16,000 --> 00:33:20,000 And then if you want, all you do is you take a tumor. 508 00:33:20,000 --> 00:33:24,000 You grind it up. You prepare RNA. You fluorescently label the RNA 509 00:33:24,000 --> 00:33:28,000 with some appropriate fluorescent dye. You squirt it into the chip. 510 00:33:28,000 --> 00:33:31,000 You wash it back and forth. You rock it back and forth, 511 00:33:31,000 --> 00:33:35,000 wash it out, and stick it in a laser scanner. And it'll see how much 512 00:33:35,000 --> 00:33:38,000 fluorescence is stuck to each spot. And bingo: you get a readout of the 513 00:33:38,000 --> 00:33:42,000 level of gene expression. I guess each spot, you should 514 00:33:42,000 --> 00:33:45,000 design it so that this spot has an oligonucleotide complementary to 515 00:33:45,000 --> 00:33:49,000 gene number one. And the next one, 516 00:33:49,000 --> 00:33:53,000 an oligonucleotide matching by Crick-Watson base pairing 517 00:33:53,000 --> 00:33:56,000 complementary to gene number two and gene number three. 518 00:33:56,000 --> 00:34:00,000 So, if I knew all the genes in the genome, I could make a detector spot 519 00:34:00,000 --> 00:34:03,000 for each gene in the genome. And of course we know essentially 520 00:34:03,000 --> 00:34:07,000 all the genes in the genome. So you can make those detector 521 00:34:07,000 --> 00:34:10,000 spots and you can buy them. So, you can now get a readout of 522 00:34:10,000 --> 00:34:13,000 all the, I mean, this is like so cool because when I 523 00:34:13,000 --> 00:34:17,000 started teaching 701, which wasn't that long ago because I 524 00:34:17,000 --> 00:34:20,000 ain't (sic) that old still, the way people did an analysis of 525 00:34:20,000 --> 00:34:23,000 gene expression is they used primitive technologies where they 526 00:34:23,000 --> 00:34:27,000 would analyze one gene at a time, certain things called northern blots 527 00:34:27,000 --> 00:34:30,000 and things like that, right? And, you know, 528 00:34:30,000 --> 00:34:34,000 you'd put in a lot of work and you get the expression level of a gene, 529 00:34:34,000 --> 00:34:37,000 whereas now you can get the expression of all the genes 530 00:34:37,000 --> 00:34:41,000 simultaneously, and it's pretty mind boggling that 531 00:34:41,000 --> 00:34:44,000 you can do that. How do you analyze data like that? 532 00:34:44,000 --> 00:34:48,000 So, we still use northern blots. It's true. So, 533 00:34:48,000 --> 00:34:51,000 every tumor becomes a vector, and we get a vector corresponding to 534 00:34:51,000 --> 00:34:55,000 each tumor. So, this line here is the first tumor, 535 00:34:55,000 --> 00:34:59,000 the second tumor, the third tumor, the fourth tumor. 536 00:34:59,000 --> 00:35:02,000 The columns here correspond to genes. There are 22, 537 00:35:02,000 --> 00:35:06,000 00 columns in this matrix, and I've shown a certain subset of 538 00:35:06,000 --> 00:35:10,000 the columns because these genes here have the interesting property that 539 00:35:10,000 --> 00:35:14,000 they tend to be high red in the ALL tumors, and they tend to be low blue 540 00:35:14,000 --> 00:35:18,000 in the AML tumors, whereas these genes here have the 541 00:35:18,000 --> 00:35:22,000 opposite property. They tend to be low blue in the ALL 542 00:35:22,000 --> 00:35:26,000 tumors and high red in the AML tumors. These genes do a pretty 543 00:35:26,000 --> 00:35:30,000 good job of telling apart these tumors. 544 00:35:30,000 --> 00:35:35,000 So, here's a new tumor. Patient came in. We analyzed the 545 00:35:35,000 --> 00:35:40,000 RNA, squirted it on the chip. Can somebody classify that? Louder? 546 00:35:40,000 --> 00:35:45,000 AML. Next? Next? Congratulations, you're 547 00:35:45,000 --> 00:35:50,000 pathologists. Very good. That's right, you can do that. 548 00:35:50,000 --> 00:35:56,000 It works. And in fact, in the study that was done that was 549 00:35:56,000 --> 00:36:01,000 published about this, the computer was able to get it 550 00:36:01,000 --> 00:36:05,000 right 100% of the time. Not bad. So now you say, 551 00:36:05,000 --> 00:36:09,000 wait, wait, wait, but you're cheating. 552 00:36:09,000 --> 00:36:12,000 You're giving it a whole bunch of knowns. Once I have a whole bunch 553 00:36:12,000 --> 00:36:15,000 of knowns it's not so hard to classify a new tumor. 554 00:36:15,000 --> 00:36:19,000 What Sydney Farber did was he discovered in the first place that 555 00:36:19,000 --> 00:36:22,000 there existed two subtypes. Surely that's harder than 556 00:36:22,000 --> 00:36:26,000 classifying when you're given a bunch of knowns. And 557 00:36:26,000 --> 00:36:29,000 that's true. So, suppose instead, 558 00:36:29,000 --> 00:36:33,000 I didn't tell you in advance which were AML's and which were ALL's, 559 00:36:33,000 --> 00:36:37,000 and I just gave you vectors corresponding to a large number of 560 00:36:37,000 --> 00:36:41,000 tumors, do you think you would be able to sort out that they actually 561 00:36:41,000 --> 00:36:49,000 fell into two clusters? 562 00:36:49,000 --> 00:36:53,000 Could you by computer tell that there's one class and the other 563 00:36:53,000 --> 00:36:57,000 class? Turns out that you can. Now, I've made it a little easier 564 00:36:57,000 --> 00:37:02,000 by not listing most of the 22,000 columns here. 565 00:37:02,000 --> 00:37:06,000 But think about it. Every tumor is a point in 22, 566 00:37:06,000 --> 00:37:10,000 00 dimensional space. If some of the tumors are similar, 567 00:37:10,000 --> 00:37:14,000 what can you say about those points in 22,000 dimensional space? 568 00:37:14,000 --> 00:37:18,000 They're going to be clumped together. They're near each other. 569 00:37:18,000 --> 00:37:22,000 So, just plot every tumor as a point in 22,000 dimensional space, 570 00:37:22,000 --> 00:37:26,000 and your question is, do the points tend to lie in two clumps up in 22, 571 00:37:26,000 --> 00:37:30,000 00 dimensional space? And there's simple arithmetic you 572 00:37:30,000 --> 00:37:34,000 can learn using linear algebra to get some separating hyperplane and 573 00:37:34,000 --> 00:37:38,000 ask, do tumors lie on one side or the other? And, 574 00:37:38,000 --> 00:37:42,000 it turns out the procedures like that will quickly tell you that 575 00:37:42,000 --> 00:37:46,000 these tumors clump into two very clear clumps. They're not randomly 576 00:37:46,000 --> 00:37:50,000 distributed. And so, if you get these tumors, 577 00:37:50,000 --> 00:37:54,000 and you do gene expression on them and put the data into a computer, 578 00:37:54,000 --> 00:37:58,000 the amount of time it takes the computer to discover that there were 579 00:37:58,000 --> 00:38:02,000 actually two types of acute leukemia is about three seconds marked down 580 00:38:02,000 --> 00:38:06,000 from 40 years. That's good. So, you can reproduce the discovery 581 00:38:06,000 --> 00:38:10,000 of AML and ALL in three seconds. Now you know what the pathologists 582 00:38:10,000 --> 00:38:14,000 say about this. They say, oh, give me a break. 583 00:38:14,000 --> 00:38:18,000 It's shooting fish in a barrel. We know there was a distinction. 584 00:38:18,000 --> 00:38:22,000 Big deal that the computer can find the distinction. 585 00:38:22,000 --> 00:38:26,000 We knew that there was distinction there. I know the computer didn't 586 00:38:26,000 --> 00:38:30,000 know it and all that. Tell us something we don't know. 587 00:38:30,000 --> 00:38:35,000 That's a fair question. So it turns out that you can ask 588 00:38:35,000 --> 00:38:40,000 some more questions. You can say, suppose I take now 589 00:38:40,000 --> 00:38:45,000 just the ALL's. Are they a homogeneous class, 590 00:38:45,000 --> 00:38:50,000 or did they fall into two classes? It turns out that extending this 591 00:38:50,000 --> 00:38:55,000 work, folks here were able to show that we can further split that ALL 592 00:38:55,000 --> 00:39:00,000 class. There was a hint that you might be able to do so because 593 00:39:00,000 --> 00:39:06,000 there's some ALL patients who have disruptions of a gene called MLL. 594 00:39:06,000 --> 00:39:09,000 And this tends to be a little more common in infants, 595 00:39:09,000 --> 00:39:13,000 and tends to be associated with a poor prognosis. 596 00:39:13,000 --> 00:39:16,000 But it was really very unclear whether this was simply one of a 597 00:39:16,000 --> 00:39:20,000 zillion factoids about some leukemia patients, whether this was a 598 00:39:20,000 --> 00:39:24,000 fundamental distinction. So, what happened was folks took a 599 00:39:24,000 --> 00:39:27,000 lot of ALL patients, got their expression profiles, 600 00:39:27,000 --> 00:39:31,000 and lo and behold it turned out that ALL itself broke into two very 601 00:39:31,000 --> 00:39:34,000 different clusters. This is an artist's rendition of a 602 00:39:34,000 --> 00:39:38,000 22,000 dimensional space. We can't afford a 22,000 603 00:39:38,000 --> 00:39:42,000 dimensional projector here, so we're just using two dimensions. 604 00:39:42,000 --> 00:39:46,000 But, the two forms of ALL were quite distinct from each other, 605 00:39:46,000 --> 00:39:50,000 and so actually ALL itself should be split up into two classes, 606 00:39:50,000 --> 00:39:54,000 ALL plus and minus, or ALL one and two, or MLL and ALL. 607 00:39:54,000 --> 00:39:58,000 And it turns out that these forms are quite different. 608 00:39:58,000 --> 00:40:02,000 They have different outcomes and should be treated differently. 609 00:40:02,000 --> 00:40:07,000 It also turns out that a particularly good distinction 610 00:40:07,000 --> 00:40:12,000 between these two subtypes of ALL is found by looking at this particular 611 00:40:12,000 --> 00:40:17,000 gene called the flit-3 kinase. The flit-3 kinase gene, whatever 612 00:40:17,000 --> 00:40:23,000 that is, was of great interest because people know that they can 613 00:40:23,000 --> 00:40:28,000 make inhibitors against certain kinases. And so, 614 00:40:28,000 --> 00:40:33,000 it turned out that an inhibitor against flit-3 kinases, 615 00:40:33,000 --> 00:40:39,000 against this flit-3 kinase gene product. 616 00:40:39,000 --> 00:40:44,000 If you treat cells with that inhibitor, cells of this type die, 617 00:40:44,000 --> 00:40:49,000 and cells of this type are not affected. So in fact, 618 00:40:49,000 --> 00:40:54,000 there's a potential drug use of flit-3 kinases in the MLL class of 619 00:40:54,000 --> 00:41:00,000 these leukemias, and folks are trying some clinical 620 00:41:00,000 --> 00:41:05,000 trials now. So, not only did the analysis of the 621 00:41:05,000 --> 00:41:09,000 gene expression point to two important sub-types of leukemias, 622 00:41:09,000 --> 00:41:14,000 but the analysis of the gene expression even suggested potential 623 00:41:14,000 --> 00:41:19,000 targets for therapy. So, I'll give you a bunch more 624 00:41:19,000 --> 00:41:23,000 examples. I have a bunch more examples like that there. 625 00:41:23,000 --> 00:41:28,000 They are examples of taking lymphomas and showing that they can 626 00:41:28,000 --> 00:41:33,000 be split into two different categories, examples of taking 627 00:41:33,000 --> 00:41:38,000 breast cancers into several categories, colon cancers. 628 00:41:38,000 --> 00:41:42,000 Basically what's going on right now is an attempt to reclassify cancers 629 00:41:42,000 --> 00:41:47,000 based not on what they look like in the microscope, 630 00:41:47,000 --> 00:41:51,000 and based not on what organ in the body they affect, 631 00:41:51,000 --> 00:41:56,000 but based on, molecularly, what their description is, because 632 00:41:56,000 --> 00:42:01,000 the molecular description, as Bob talked to you about with CML 633 00:42:01,000 --> 00:42:05,000 and with Gleveck, turns out to be a tremendously 634 00:42:05,000 --> 00:42:10,000 powerful way of classifying cancers because you're able to see what is 635 00:42:10,000 --> 00:42:15,000 the molecular defect and can make a molecular targeted therapy. 636 00:42:15,000 --> 00:42:20,000 So, these sorts of tools are quite cool, and I've got to say, 637 00:42:20,000 --> 00:42:25,000 in the last year we've begun using these expression tools not just to 638 00:42:25,000 --> 00:42:30,000 classify cancers, but to classify drugs. 639 00:42:30,000 --> 00:42:34,000 We've begun an interesting and somewhat crazy project to take all 640 00:42:34,000 --> 00:42:38,000 the FDA approved drugs, put them onto cell types, 641 00:42:38,000 --> 00:42:42,000 and see what they do, that is, get a signature, a fingerprint, 642 00:42:42,000 --> 00:42:46,000 a gene expression description of the action of a drug. 643 00:42:46,000 --> 00:42:50,000 And then we hope, here's the nutty idea, 644 00:42:50,000 --> 00:42:54,000 that we can look up in the computer which drugs do which things and 645 00:42:54,000 --> 00:42:58,000 might be useful for which diseases, because we'd put the diseases and 646 00:42:58,000 --> 00:43:02,000 the drugs on an equal footing. All of them would be described in 647 00:43:02,000 --> 00:43:06,000 terms of their gene expression patterns. So, 648 00:43:06,000 --> 00:43:10,000 I'll tell you one interesting example, OK? This is an interesting 649 00:43:10,000 --> 00:43:14,000 enough example. I don't even have slides for it yet. 650 00:43:14,000 --> 00:43:18,000 It turns out that these patients with ALL that I've been talking 651 00:43:18,000 --> 00:43:23,000 about, some of the patients with ALL will respond to the drug 652 00:43:23,000 --> 00:43:27,000 dexamethasone. Some won't. If you take patients 653 00:43:27,000 --> 00:43:31,000 who respond to dexamethasone, and patients who are resistant to 654 00:43:31,000 --> 00:43:35,000 dexamethasone, and you get their gene expression 655 00:43:35,000 --> 00:43:40,000 patterns, you can ask are there some genes that explain the difference? 656 00:43:40,000 --> 00:43:44,000 And you can get a certain gene signature, a list of, 657 00:43:44,000 --> 00:43:48,000 say, a dozen or so genes that do a pretty good job of classifying who's 658 00:43:48,000 --> 00:43:53,000 sensitive and who's resistant. Then you can go to this database I 659 00:43:53,000 --> 00:43:57,000 was telling you about of the action of many drugs and say, 660 00:43:57,000 --> 00:44:01,000 do we see any drugs whose effect would be to produce a signature 661 00:44:01,000 --> 00:44:06,000 of sensitivity? If we found a drug X, 662 00:44:06,000 --> 00:44:10,000 which when we put it on cells turned on those genes that correlate with 663 00:44:10,000 --> 00:44:14,000 being sensitive to dexamethasone, you could hallucinate the following 664 00:44:14,000 --> 00:44:18,000 really happy possibility that when you added that drug together with 665 00:44:18,000 --> 00:44:22,000 dexamethasone, you might be able to treat resistant 666 00:44:22,000 --> 00:44:26,000 patients because that drug could make them sensitive to dexamethasone, 667 00:44:26,000 --> 00:44:30,000 and that you could find that drug just by looking it up in 668 00:44:30,000 --> 00:44:35,000 a computer database. So, we tried it and we hit a drug. 669 00:44:35,000 --> 00:44:40,000 There was a certain drug that came up on the screen, 670 00:44:40,000 --> 00:44:45,000 yes? That's very much in the idea too. We found a drug that produced 671 00:44:45,000 --> 00:44:49,000 the signature sensitivity, and tested it in vitro. In vitro, 672 00:44:49,000 --> 00:44:54,000 if you take cells that are resistant and you add dexamethasone, 673 00:44:54,000 --> 00:44:59,000 nothing happens because they're resistant. If you add drug X, 674 00:44:59,000 --> 00:45:04,000 nothing happens. But if you add both drug X plus dexamethasone, 675 00:45:04,000 --> 00:45:08,000 the cells drop dead. It's now going into clinical trials 676 00:45:08,000 --> 00:45:12,000 in human patients. It turns out drug X is already a 677 00:45:12,000 --> 00:45:15,000 well FDA approved drug, so it can be tested in human 678 00:45:15,000 --> 00:45:19,000 patients right away, so it's going to be tested. 679 00:45:19,000 --> 00:45:22,000 So, the gene expression pattern was able to tell us to use a drug which 680 00:45:22,000 --> 00:45:26,000 actually had nothing to do with cancer uses in a cancer setting 681 00:45:26,000 --> 00:45:30,000 because it might do something helpful. 682 00:45:30,000 --> 00:45:33,000 Now, what's the point of all this? We can turn up the lights because I 683 00:45:33,000 --> 00:45:37,000 think I'm going to stop the slides there. The point of all of this, 684 00:45:37,000 --> 00:45:41,000 which is what I've made again, and I will make again, 685 00:45:41,000 --> 00:45:45,000 because you are the generation that's going to really live this, 686 00:45:45,000 --> 00:45:48,000 is that biology is becoming information. Now, 687 00:45:48,000 --> 00:45:52,000 don't get me wrong. It's not stopping being 688 00:45:52,000 --> 00:45:56,000 biochemistry. It's going to be biochemistry. It's not stopping 689 00:45:56,000 --> 00:46:00,000 being molecular biology. It's not stopping any of the things 690 00:46:00,000 --> 00:46:03,000 it was before. 45:57 But it is also becoming 691 00:46:03,000 --> 00:46:07,000 information, that for the first time we're entering a world where we can 692 00:46:07,000 --> 00:46:11,000 collect vast amounts of information: all the genetic variants in a 693 00:46:11,000 --> 00:46:15,000 patient, all of the gene expression pattern in a cell, 694 00:46:15,000 --> 00:46:18,000 or all of the gene expression pattern induced by a drug, 695 00:46:18,000 --> 00:46:22,000 and that whatever question you're asking will be informed by being 696 00:46:22,000 --> 00:46:26,000 able to access that whole database. In no way does it decrease the role 697 00:46:26,000 --> 00:46:30,000 of the individual smart scientist working on his or her problem. 698 00:46:30,000 --> 00:46:32,000 To the contrary, the goal is to empower the 699 00:46:32,000 --> 00:46:35,000 individual smart scientist so that you have all of that information at 700 00:46:35,000 --> 00:46:38,000 your fingertips. There are databases scattered 701 00:46:38,000 --> 00:46:41,000 around the web that have sequences from different species, 702 00:46:41,000 --> 00:46:44,000 variations from the human population, all of these drug database, 703 00:46:44,000 --> 00:46:47,000 etc., etc., etc., etc. It's a time of tremendous ferment, 704 00:46:47,000 --> 00:46:50,000 a little bit of chaos. You talk to people in the field, 705 00:46:50,000 --> 00:46:53,000 they say, we're getting deluged by data. We're getting crushed by the 706 00:46:53,000 --> 00:46:56,000 amount of data. I don't' know what to do with all 707 00:46:56,000 --> 00:46:59,000 the data. There's only one solution for a 708 00:46:59,000 --> 00:47:02,000 field in that condition, and that is young scientists because 709 00:47:02,000 --> 00:47:05,000 the young scientists who come into the field are the ones who take for 710 00:47:05,000 --> 00:47:08,000 granted, of course we're going to have all these data. 711 00:47:08,000 --> 00:47:11,000 We love having all these data. This is just great, couldn't be 712 00:47:11,000 --> 00:47:14,000 happier to have all these data. We're not put off by it in the 713 00:47:14,000 --> 00:47:17,000 least. That's what's going on. That's what's so important about 714 00:47:17,000 --> 00:47:20,000 your generation, and that's why I think it's really 715 00:47:20,000 --> 00:47:23,000 important that even though it's 701 and we're supposed to be teaching 716 00:47:23,000 --> 00:47:26,000 you the basics, it's important that you see this 717 00:47:26,000 --> 00:47:29,000 stuff because this is the change that's going on, 718 00:47:29,000 --> 00:47:32,000 and we're counting on this very much to drive a revolution in health, 719 00:47:32,000 --> 00:47:35,000 a revolution in biomedical research, and we're counting on you guys very 720 00:47:35,000 --> 00:47:39,000 much to drive that revolution. It has been a pleasure to teach you 721 00:47:39,000 --> 00:47:43,000 this term. I hope many of you will stay in touch, 722 00:47:43,000 --> 00:47:48,000 and some of you will go into biology, and even those of you who don't will 723 00:47:48,000 --> 00:47:53,000 know lots about it and enjoy it. Thank you very much. [APPLAUSE]