1 00:00:01,000 --> 00:00:06,000 Good morning. Welcome back. So, the Red Sox won, it's pretty 2 00:00:06,000 --> 00:00:13,000 convincing, yeah, very good. Yay Red Sox. 3 00:00:13,000 --> 00:00:20,000 So, as you can also tell, I have something of a cold, 4 00:00:20,000 --> 00:00:27,000 so I'll see if I, if my voice makes it through, but what I wanted to do 5 00:00:27,000 --> 00:00:34,000 today, if the voice allows, was to talk about genomics. 6 00:00:34,000 --> 00:00:38,000 Now, this is a little bit different than what we normally do in the 7 00:00:38,000 --> 00:00:42,000 class because, I work on genomics, 8 00:00:42,000 --> 00:00:46,000 it's something I'm extremely interested in. 9 00:00:46,000 --> 00:00:50,000 And so, what I wanted to do today, and I'll do it one more time before 10 00:00:50,000 --> 00:00:54,000 the end of the term, is to talk about research that's 11 00:00:54,000 --> 00:00:58,000 going on in genomics, give you a sense of what's really 12 00:00:58,000 --> 00:01:02,000 going on. I can assure you that what I say is not going to be in the 13 00:01:02,000 --> 00:01:05,000 text book, or any other text book. And, I'm not entirely sure how this 14 00:01:05,000 --> 00:01:08,000 might appear on an exam, so don't ask, because I'm really 15 00:01:08,000 --> 00:01:12,000 just going to talk about research that's going on today. 16 00:01:12,000 --> 00:01:15,000 And part of the purpose in doing that is to a, show you that it's 17 00:01:15,000 --> 00:01:18,000 possible for you to understand the kind of research that's going on in 18 00:01:18,000 --> 00:01:21,000 this field, and b, to excite you about what's going on 19 00:01:21,000 --> 00:01:25,000 in this field. So each year I pick different 20 00:01:25,000 --> 00:01:28,000 things to talk about, and I've picked a few things, 21 00:01:28,000 --> 00:01:32,000 and we'll see. So feel free to interrupt and to ask 22 00:01:32,000 --> 00:01:36,000 questions, and all of that, but this is very much more, sort of 23 00:01:36,000 --> 00:01:40,000 the edge of genomics, including stuff that's going on, 24 00:01:40,000 --> 00:01:44,000 you know, right now as we speak. So, we'll fire away. 25 00:01:44,000 --> 00:01:48,000 So a little introductory stuff. I call this, we can actually keep 26 00:01:48,000 --> 00:01:52,000 the lights up, I think people, 27 00:01:52,000 --> 00:01:56,000 can people read that? Yeah, it's fine, good, 28 00:01:56,000 --> 00:02:00,000 so we'll leave the lights up and I can see people. 29 00:02:00,000 --> 00:02:04,000 So, I think the thing that sets apart this revolution of biology 30 00:02:04,000 --> 00:02:08,000 that we're looking through right now, is the transformation of biology, 31 00:02:08,000 --> 00:02:12,000 not just from being the study of living organisms, 32 00:02:12,000 --> 00:02:16,000 to the study of chemicals and enzymes, to the study of molecules, 33 00:02:16,000 --> 00:02:20,000 but to the study of biology as information. That is what's 34 00:02:20,000 --> 00:02:24,000 distinctive about this decade, is the idea that the information 35 00:02:24,000 --> 00:02:28,000 sciences have begun to merge with biology, or biology merged with 36 00:02:28,000 --> 00:02:32,000 information sciences, and that it's having a profound 37 00:02:32,000 --> 00:02:36,000 effect on driving biomedicine. In both of the two talks I'll give, 38 00:02:36,000 --> 00:02:40,000 this one and near the end of the term, that will be the common theme, 39 00:02:40,000 --> 00:02:44,000 because I think that's the most important thing that's going on 40 00:02:44,000 --> 00:02:48,000 right now. Now, just to remind you, 41 00:02:48,000 --> 00:02:52,000 of course, the idea that biology is about information is an old one, 42 00:02:52,000 --> 00:02:56,000 it goes back to my hero, Gregor Mendel, with the recognition that 43 00:02:56,000 --> 00:03:00,000 information was passed from parent to offspring, according to rules. 44 00:03:00,000 --> 00:03:04,000 And, as you know, the history of biology in the 20th 45 00:03:04,000 --> 00:03:08,000 century can be read as the development of biology's information. 46 00:03:08,000 --> 00:03:12,000 The first quarter of the 20th century was the development of the 47 00:03:12,000 --> 00:03:16,000 idea that the information lives in chromosomes. The next quarter of 48 00:03:16,000 --> 00:03:20,000 the 20th century, the idea that the information of the 49 00:03:20,000 --> 00:03:24,000 chromosomes resides in the DNA double-helix, and that information 50 00:03:24,000 --> 00:03:28,000 was contained in this molecule, and somehow in it's sequence, and 51 00:03:28,000 --> 00:03:31,000 you know all of this. And the next quarter of the 20th 52 00:03:31,000 --> 00:03:35,000 century, basically from 1950 to 1975, understanding how it is that the 53 00:03:35,000 --> 00:03:39,000 cell reads out that information, from DNA to RNA to protein, how it 54 00:03:39,000 --> 00:03:43,000 uses a genetic code to translate RNA's into proteins, 55 00:03:43,000 --> 00:03:46,000 and the development of the tools of recombinant DNA that made it 56 00:03:46,000 --> 00:03:50,000 possible for us to read out the information that the cell reads out. 57 00:03:50,000 --> 00:03:54,000 So that brought us ¾ of the way through the 20th century, 58 00:03:54,000 --> 00:03:58,000 with the ability to read out genetic information, at least in little ways, 59 00:03:58,000 --> 00:04:02,000 but they were little ways. You could write a PhD thesis, 60 00:04:02,000 --> 00:04:07,000 around that time, for sequencing 200 letters of DNA. 61 00:04:07,000 --> 00:04:12,000 That would be, you know, considered amazingly exciting PhD 62 00:04:12,000 --> 00:04:17,000 thesis. The next quarter of the 20th century, the last quarter of 63 00:04:17,000 --> 00:04:22,000 the 20th century, was characterized by a veracious 64 00:04:22,000 --> 00:04:27,000 appetite to read as much of this information as possible. 65 00:04:27,000 --> 00:04:32,000 It started, first, with trying to read out the sequence of individual 66 00:04:32,000 --> 00:04:37,000 genes, then sets of genes, then genomes of small organisms' 67 00:04:37,000 --> 00:04:41,000 bacteria, medium-sized organisms. And then, you know, 68 00:04:41,000 --> 00:04:45,000 in a wonderful closure to the 20th century, the reading out of the 69 00:04:45,000 --> 00:04:48,000 nearly complete genetic information of the human being in the closing 70 00:04:48,000 --> 00:04:52,000 weeks of the 20th century. When you remember that, that Mendel 71 00:04:52,000 --> 00:04:55,000 was rediscovered in January of 1900, that's when the papers rediscovering 72 00:04:55,000 --> 00:04:59,000 Mendel came out, and you figure you've got perfect 73 00:04:59,000 --> 00:05:02,000 bookends from the rediscovery of Mendel in January of 1900, 74 00:05:02,000 --> 00:05:06,000 to the sequencing of the human genome in around 2000. 75 00:05:06,000 --> 00:05:09,000 You realize what a century can do. It's not bad, as centuries go, you 76 00:05:09,000 --> 00:05:12,000 know, to accomplish all that, and it gives you know, as students, 77 00:05:12,000 --> 00:05:15,000 you get a point estimate in time of what science knows, 78 00:05:15,000 --> 00:05:18,000 but you guys aren't old enough yet and haven't lived long enough yet, 79 00:05:18,000 --> 00:05:22,000 to measure the derivative, and see how rapidly it's changing. 80 00:05:22,000 --> 00:05:25,000 But just look at what happened over the course of that century, 81 00:05:25,000 --> 00:05:28,000 and then just project forward to what that can mean for 82 00:05:28,000 --> 00:05:32,000 the next century. So what that's done is it's brought 83 00:05:32,000 --> 00:05:36,000 us to the next picture. I have a picture in my head, 84 00:05:36,000 --> 00:05:40,000 of biology as a vast library of information, a library of 85 00:05:40,000 --> 00:05:44,000 information in which evolution has been taking patient notes. 86 00:05:44,000 --> 00:05:48,000 Evolution is a very good experimentalist, 87 00:05:48,000 --> 00:05:52,000 and it's a very patient note taker. It's notes, of course, are written 88 00:05:52,000 --> 00:05:56,000 in the genomes, and everyday evolution wakes up, 89 00:05:56,000 --> 00:06:00,000 changes a few nucleotides, sees how the organism works, 90 00:06:00,000 --> 00:06:04,000 if it was an improvement, evolution keeps the notes, 91 00:06:04,000 --> 00:06:08,000 if it was disadvantageous, evolution discards the notes. 92 00:06:08,000 --> 00:06:11,000 That, by the way, for those of you working in labs, 93 00:06:11,000 --> 00:06:14,000 is no longer considered appropriate laboratory practice. 94 00:06:14,000 --> 00:06:17,000 You're obliged to keep your laboratory notes from failed 95 00:06:17,000 --> 00:06:20,000 experiments, as well, but evolution got into this before 96 00:06:20,000 --> 00:06:23,000 those rules were codified, and so it discards the notes from 97 00:06:23,000 --> 00:06:26,000 unsuccessful experiments, and keeps the notes from the 98 00:06:26,000 --> 00:06:29,000 successful experiments. But nonetheless, we have all the 99 00:06:29,000 --> 00:06:32,000 notes from the successful experiments, and we can learn a 100 00:06:32,000 --> 00:06:35,000 tremendous amount from it. There's a volume on the shelf 101 00:06:35,000 --> 00:06:38,000 corresponding to each species on the planet. There's a volume on the 102 00:06:38,000 --> 00:06:41,000 shelf corresponding to each individual within each species, 103 00:06:41,000 --> 00:06:44,000 to each tissue within each individual within each species, 104 00:06:44,000 --> 00:06:47,000 and there's information there about the DNA sequence, 105 00:06:47,000 --> 00:06:50,000 about the RNA readouts, about the protein expression levels, 106 00:06:50,000 --> 00:06:53,000 and in principle, even if not yet in practice, we can pull down any 107 00:06:53,000 --> 00:06:56,000 volume we want, and interrogate it, 108 00:06:56,000 --> 00:06:59,000 and compare it for related species, for individuals within a species, 109 00:06:59,000 --> 00:07:02,000 some of whom might have a disease, some of whom might not, for 110 00:07:02,000 --> 00:07:06,000 different kinds of tissues treated in different ways. 111 00:07:06,000 --> 00:07:09,000 That is, I think, going to be a tremendous theme of 112 00:07:09,000 --> 00:07:12,000 biology going forward, and that's why it's a particular 113 00:07:12,000 --> 00:07:16,000 pleasure to teach biology at MIT, where you guys understand what that 114 00:07:16,000 --> 00:07:19,000 could mean, that fusion could mean. Now, this idea of extracting 115 00:07:19,000 --> 00:07:23,000 genomic information in large-scale, is a relatively new one. In the 116 00:07:23,000 --> 00:07:26,000 mid-1980's, the scientific community began debating what was a pretty 117 00:07:26,000 --> 00:07:30,000 radical idea, sequencing the human genome. 118 00:07:30,000 --> 00:07:33,000 This was floated in a couple of places, in 1984 at one meeting, 119 00:07:33,000 --> 00:07:37,000 somebody raised the idea, you've got to realize that sequencing itself, 120 00:07:37,000 --> 00:07:41,000 that sequencing DNA, only came from the late 70's, 121 00:07:41,000 --> 00:07:45,000 so within six, seven years of being able to sequence anything, 122 00:07:45,000 --> 00:07:49,000 people were now saying, let's sequence everything. 123 00:07:49,000 --> 00:07:52,000 That was a reasonably audacious thing to do, and it was 124 00:07:52,000 --> 00:07:56,000 controversial. There were many people who felt 125 00:07:56,000 --> 00:08:00,000 that the human genome project was a terrible idea, 126 00:08:00,000 --> 00:08:04,000 and with good reason, because the initial version of the 127 00:08:04,000 --> 00:08:08,000 human genome project was, kind of, a blunderbuss approach. 128 00:08:08,000 --> 00:08:11,000 It was, let's immediately mount a massive factory and start sequencing 129 00:08:11,000 --> 00:08:15,000 the human genome with the just horrible technologies of the 130 00:08:15,000 --> 00:08:19,000 mid-80's, with radioactive sequencing gels, 131 00:08:19,000 --> 00:08:22,000 and you know, lots and lots of people doing stuff. 132 00:08:22,000 --> 00:08:26,000 And so, you know, many people in science were, were concerned that an 133 00:08:26,000 --> 00:08:30,000 entire generation of students would need to be chained to the 134 00:08:30,000 --> 00:08:33,000 bench, sequencing DNA. Sydney Brenner, 135 00:08:33,000 --> 00:08:37,000 a great molecular biologist, proposed the whole thing be done at 136 00:08:37,000 --> 00:08:41,000 institutions [LAUGHTER], because you know, people could be 137 00:08:41,000 --> 00:08:45,000 sentenced to, 20 million bases, with time off for accuracy, or 138 00:08:45,000 --> 00:08:48,000 things like that [LAUGHTER]. And so what happened was, the 139 00:08:48,000 --> 00:08:52,000 scientific community came together well, in it's best form. 140 00:08:52,000 --> 00:08:56,000 Group, a group was put together by the National Academy of Sciences, 141 00:08:56,000 --> 00:09:00,000 who said, well look, this is a really good idea, 142 00:09:00,000 --> 00:09:04,000 but we also need a carefully thought-through program to do it. 143 00:09:04,000 --> 00:09:07,000 We need intermediate goals that will get us things that will advance the 144 00:09:07,000 --> 00:09:10,000 science along the way, we need to improve the technologies, 145 00:09:10,000 --> 00:09:13,000 and laid out a plan. The goals of that plan, to develop a genetic map, 146 00:09:13,000 --> 00:09:16,000 a map showing the locations of DNA polymorphisms, 147 00:09:16,000 --> 00:09:19,000 sites of variation, genetic markers, just like Sturdiman 148 00:09:19,000 --> 00:09:22,000 did with fruit flies, but to do it with humans, 149 00:09:22,000 --> 00:09:25,000 and with DNA sequence differences, to be used to trace inheritance. 150 00:09:25,000 --> 00:09:28,000 That, that genetic map could be used to map human diseases, 151 00:09:28,000 --> 00:09:31,000 and if all you accomplish was, got a human map of the human being, 152 00:09:31,000 --> 00:09:34,000 that would be a good thing. Then you could get a physical map of 153 00:09:34,000 --> 00:09:38,000 the human being, all the pieces of DNA overlapping 154 00:09:38,000 --> 00:09:41,000 each other, so that you would know if you had a genetic marker linked 155 00:09:41,000 --> 00:09:44,000 to cystic fibrosis, you would be able to get the piece 156 00:09:44,000 --> 00:09:48,000 of DNA that contains the gene. Then, if we managed to pull that 157 00:09:48,000 --> 00:09:51,000 off, we could get a sequence of the human genome, all three billion 158 00:09:51,000 --> 00:09:54,000 nucleotides, on the web, so that you could go to just any 159 00:09:54,000 --> 00:09:58,000 place on the genome, double-click, and up would pop the 160 00:09:58,000 --> 00:10:01,000 sequence. Now, you guys of course, 161 00:10:01,000 --> 00:10:04,000 don't laugh at that, but about eight years ago, when I would give talks 162 00:10:04,000 --> 00:10:07,000 about this, I would speak about, oh you'll be able to go double-click 163 00:10:07,000 --> 00:10:10,000 and up will pop the sequence, and of course, everybody thought 164 00:10:10,000 --> 00:10:13,000 that was really funny, and that, that was something people 165 00:10:13,000 --> 00:10:16,000 laughed at. But of course, you can just do that today, if 166 00:10:16,000 --> 00:10:19,000 anybody has a wireless you can just double-click, and up will pop the 167 00:10:19,000 --> 00:10:22,000 sequence. And then, of course, a complete inventory of 168 00:10:22,000 --> 00:10:25,000 all the genes within that sequence. And a very importantly, and from 169 00:10:25,000 --> 00:10:28,000 the very beginning, the notion that all this information 170 00:10:28,000 --> 00:10:31,000 should be completely, freely available to anybody, 171 00:10:31,000 --> 00:10:34,000 regardless of where they were, whether in academia, or industry, 172 00:10:34,000 --> 00:10:37,000 in first world, third world countries, that everybody should 173 00:10:37,000 --> 00:10:40,000 have free and unrestricted access to that information. 174 00:10:40,000 --> 00:10:43,000 So a plan was laid out, I won't go into the details here, 175 00:10:43,000 --> 00:10:46,000 but the plan was laid out that involved work constructing genetic 176 00:10:46,000 --> 00:10:49,000 maps, physical maps, sequence maps, in the human, 177 00:10:49,000 --> 00:10:53,000 the mouse, and some model organisms, including the bacteria yeast, fruit 178 00:10:53,000 --> 00:10:56,000 flies, worms. And, quite remarkably, it largely went 179 00:10:56,000 --> 00:11:00,000 according to plan, over the course of about 15 years. 180 00:11:00,000 --> 00:11:03,000 A lot of people in the scientific community came together and took up 181 00:11:03,000 --> 00:11:06,000 different tasks. I should say, with some pride, 182 00:11:06,000 --> 00:11:09,000 that MIT was by far, one of the leading contributors to this effort, 183 00:11:09,000 --> 00:11:13,000 having been involved in essentially every stage of this, 184 00:11:13,000 --> 00:11:16,000 the genetic mapping of human and mouse, the physical mapping of human 185 00:11:16,000 --> 00:11:19,000 and mouse, and the sequencing of human and mouse, 186 00:11:19,000 --> 00:11:23,000 and having been the leading contributor to the latter, 187 00:11:23,000 --> 00:11:26,000 and it's not an accident because MIT's a marvelous environment in 188 00:11:26,000 --> 00:11:30,000 which to undertake this kind of research. 189 00:11:30,000 --> 00:11:33,000 It involved changing the way we do biology. Back in the mid-80's, 190 00:11:33,000 --> 00:11:37,000 when we sequenced DNA, we did it with radioactivity, 191 00:11:37,000 --> 00:11:40,000 remember I taught you how to sequence using radioactive label of 192 00:11:40,000 --> 00:11:44,000 a gel, and all that. That's how we did it, 193 00:11:44,000 --> 00:11:48,000 stood behind this plastic shield, and you loaded the gels. Of course, 194 00:11:48,000 --> 00:11:51,000 now it's done in a highly automated fashion. This is the production 195 00:11:51,000 --> 00:11:55,000 floor at the Broad Institute, which is here at MIT, where robots 196 00:11:55,000 --> 00:11:59,000 prepare all the DNA samples, so E. coli's grown up, and then you 197 00:11:59,000 --> 00:12:02,000 have to crack open the cells, purify the DNA, purify the plasmid, 198 00:12:02,000 --> 00:12:06,000 do a sequencing reaction, etc., etc. it's all done robotically there, 199 00:12:06,000 --> 00:12:10,000 and this is capable of processing, and does process, in a given day, 200 00:12:10,000 --> 00:12:13,000 about 200,000 samples per day. They then go, and this is all 201 00:12:13,000 --> 00:12:17,000 equipment designed by people here at MIT, and then commercially built for 202 00:12:17,000 --> 00:12:21,000 us. They then go to the back room where, actually, 203 00:12:21,000 --> 00:12:24,000 these are the previous generation of DNA sequencers, 204 00:12:24,000 --> 00:12:28,000 commercial detectors, those capillary detectors that have 205 00:12:28,000 --> 00:12:32,000 little lasers on them, there's a whole farm of them that 206 00:12:32,000 --> 00:12:36,000 sit there, and are able to get data out. 207 00:12:36,000 --> 00:12:39,000 In the course of a single day, we can now generate about 40 billion 208 00:12:39,000 --> 00:12:43,000 bases, I'm sorry, in the course of a single year we 209 00:12:43,000 --> 00:12:46,000 can generate about 40 billion bases of DNA sequence. 210 00:12:46,000 --> 00:12:50,000 The genome project itself, was a collaboration involving 20 211 00:12:50,000 --> 00:12:53,000 different groups around the world, groups in the United States, United 212 00:12:53,000 --> 00:12:57,000 Kingdom, France, Germany, and Japan, 213 00:12:57,000 --> 00:13:01,000 and China. They were of different sizes, they used different 214 00:13:01,000 --> 00:13:04,000 approaches, but everybody was committed to one common cause of 215 00:13:04,000 --> 00:13:08,000 producing this information, and making it freely available, 216 00:13:08,000 --> 00:13:11,000 and everybody worked together. And for the rest of my life, 217 00:13:11,000 --> 00:13:15,000 when it comes to Friday, at 11 o'clock, I will always think genome 218 00:13:15,000 --> 00:13:19,000 project, because we had a weekly conference call of all the groups in 219 00:13:19,000 --> 00:13:23,000 the world working on this Fridays, at eleven, and it was a fascinating 220 00:13:23,000 --> 00:13:26,000 experience, there were many, many years of that. So a draft 221 00:13:26,000 --> 00:13:30,000 sequence, a rough draft sequence of the human genome, 222 00:13:30,000 --> 00:13:34,000 was published in the year, in February of 2001, it was 223 00:13:34,000 --> 00:13:38,000 announced with some fanfare in June of 2000, but the real scientific 224 00:13:38,000 --> 00:13:42,000 paper came out in February of 2001. 225 00:13:42,000 --> 00:13:45,000 This was not a perfect sequence of the human genome, 226 00:13:45,000 --> 00:13:48,000 by any means. We discovered about 90% of the sequence of the human 227 00:13:48,000 --> 00:13:51,000 genome. It still had about 150, 00 gaps in it, it had errors. But, 228 00:13:51,000 --> 00:13:54,000 it still did have 90% of the sequence of the human genome. 229 00:13:54,000 --> 00:13:57,000 For the next three years, people worked very hard, 230 00:13:57,000 --> 00:14:00,000 and, as of last April, a finished sequence of the human 231 00:14:00,000 --> 00:14:03,000 genome was produced, and was published a couple weeks ago, 232 00:14:03,000 --> 00:14:06,000 and it contains, our best guess, about 99. 233 00:14:06,000 --> 00:14:09,000 % of the human genome, and it still has about 343 gaps, 234 00:14:09,000 --> 00:14:12,000 they're, we know what they are, we know where they are, but they're 235 00:14:12,000 --> 00:14:16,000 not sequence able with current technology. 236 00:14:16,000 --> 00:14:19,000 That's the “finished human genome”. What is it like? Well, this is a 237 00:14:19,000 --> 00:14:23,000 picture of the genome, do we have a pointer, yes, 238 00:14:23,000 --> 00:14:27,000 I see here we do have a pointer. This is your genome here, this is 239 00:14:27,000 --> 00:14:31,000 chromosome number 11, and I'll call attention to some 240 00:14:31,000 --> 00:14:34,000 interesting bits. So these colored lines here, 241 00:14:34,000 --> 00:14:38,000 represent genes, or gene-predictions, based on both, 242 00:14:38,000 --> 00:14:42,000 sequencing of the DNA, and mapping them back to the genome, 243 00:14:42,000 --> 00:14:46,000 as well as computer programs that analyze the genome. 244 00:14:46,000 --> 00:14:49,000 And, right here, you have a big pileup of lots of 245 00:14:49,000 --> 00:14:52,000 genes, very few genes of here. Lots of genes, few genes. Notice 246 00:14:52,000 --> 00:14:55,000 the places where there are lots of genes, match up with these 247 00:14:55,000 --> 00:14:58,000 light-grey bands, which are the light-grey bands of 248 00:14:58,000 --> 00:15:01,000 the microscope, on chromosomes. The places with 249 00:15:01,000 --> 00:15:04,000 very few genes match up with the dark bands in the chromosome. 250 00:15:04,000 --> 00:15:08,000 Do you know why that is, that the gene-rich regions are these 251 00:15:08,000 --> 00:15:12,000 light bands, and the gene-poor regions are the chromosome dark 252 00:15:12,000 --> 00:15:16,000 bands? Me neither. Nobody has a clue. It's really, 253 00:15:16,000 --> 00:15:20,000 it's really just one of these things. We had no reason to expect that 254 00:15:20,000 --> 00:15:24,000 we'd see these striking patterns, and other genomes, e-coli, doesn't 255 00:15:24,000 --> 00:15:28,000 have this dense, urban cluster, and these big, 256 00:15:28,000 --> 00:15:32,000 rural plains that are gene-poor. This is very weird, and it's 257 00:15:32,000 --> 00:15:35,000 distinctive to mammals. You'll also notice that the 258 00:15:35,000 --> 00:15:38,000 gene-rich regions, here, are rich in G's and C's, 259 00:15:38,000 --> 00:15:41,000 they have different distributions of some repeat elements, 260 00:15:41,000 --> 00:15:43,000 it's all sorts of weirdness that comes from just looking at the 261 00:15:43,000 --> 00:15:46,000 genome. The biggest weirdness was the number of genes, 262 00:15:46,000 --> 00:15:49,000 the count of genes is, our best guess, about 22, 263 00:15:49,000 --> 00:15:52,000 00 genes, if I had to pick a number today, it would be our count of 264 00:15:52,000 --> 00:15:55,000 genes, and of course, that's down from the 100, 265 00:15:55,000 --> 00:15:58,000 00 that was in some textbooks, and it's down from even 30 to 40, 266 00:15:58,000 --> 00:16:01,000 00 that was in the genome paper of February, 2001. 267 00:16:01,000 --> 00:16:04,000 Our best guess is that it's really just about that range. 268 00:16:04,000 --> 00:16:07,000 Genes, themselves, are very interesting. 269 00:16:07,000 --> 00:16:11,000 When you look at, you know, if we only have 22,000 genes we know 270 00:16:11,000 --> 00:16:15,000 of, how do we manage to run a human being with so few genes? 271 00:16:15,000 --> 00:16:19,000 It is, by the way, probably fewer genes than the mustard weed, 272 00:16:19,000 --> 00:16:22,000 or Arabidopsis thaliana. So, what do we do? Well, humans, one 273 00:16:22,000 --> 00:16:26,000 thing we may take comfort in, is that we, although we only have 274 00:16:26,000 --> 00:16:30,000 about 22,000 genes, there's a lot of alternative 275 00:16:30,000 --> 00:16:34,000 splicing, on average the typical gene, on average, 276 00:16:34,000 --> 00:16:38,000 has about two alternative splice products. 277 00:16:38,000 --> 00:16:41,000 Some have many, some have few, but probably, 278 00:16:41,000 --> 00:16:45,000 when you're all done, those 22, 00 genes may encode 70-80,000 279 00:16:45,000 --> 00:16:48,000 different proteins, and it could be more than that 280 00:16:48,000 --> 00:16:52,000 because we don't know all the alternative splice products, 281 00:16:52,000 --> 00:16:55,000 and what they do. But, if you ask, humans get credit for being really 282 00:16:55,000 --> 00:16:59,000 inventive or creative, for having lots of new genes that 283 00:16:59,000 --> 00:17:03,000 make us human, the answer is, no. 284 00:17:03,000 --> 00:17:06,000 Not only are humans not different in their gene complement from other 285 00:17:06,000 --> 00:17:10,000 mammals, mammals, as a group, really haven't invented 286 00:17:10,000 --> 00:17:14,000 that much, when you get down to it. Most of the recognizable sub-domains 287 00:17:14,000 --> 00:17:18,000 of proteins, proteins are built up of sub-domains, 288 00:17:18,000 --> 00:17:22,000 recognizable sequences that have certain motifs that fold up in 289 00:17:22,000 --> 00:17:26,000 certain ways, or carry out certain enzymatic functions. 290 00:17:26,000 --> 00:17:30,000 And it looks like our genomes, our genes, are mixed-and-matched 291 00:17:30,000 --> 00:17:34,000 combinations of many domains that were invented a long time ago, 292 00:17:34,000 --> 00:17:38,000 in invertebrates and before, and that most of evolutionary innovation 293 00:17:38,000 --> 00:17:42,000 in the more complex, multi-cellular animals, 294 00:17:42,000 --> 00:17:46,000 has simply been mixing-and-matching these domains in new ways, 295 00:17:46,000 --> 00:17:50,000 to get slightly different functions. 296 00:17:50,000 --> 00:17:54,000 You don't get a lot of points for creativity, but it does seem to work. 297 00:17:54,000 --> 00:17:58,000 By far, the most derivative of all, and what characterizes our genome 298 00:17:58,000 --> 00:18:02,000 tremendously is, when a gene works, 299 00:18:02,000 --> 00:18:07,000 make extra copies of it, and let it diverge slightly, 300 00:18:07,000 --> 00:18:11,000 and take up new functions. Really, your genome is just characterized by 301 00:18:11,000 --> 00:18:15,000 large expansions of families, immunoglobulin-like genes, 302 00:18:15,000 --> 00:18:20,000 intermediate filament proteins holding together the cytoskeleton. 303 00:18:20,000 --> 00:18:23,000 There are 111 different keratin-like genes in your genome. 304 00:18:23,000 --> 00:18:26,000 They're all different, they do different things, 305 00:18:26,000 --> 00:18:29,000 but they all came from one gene that was copied, copied, 306 00:18:29,000 --> 00:18:33,000 copied, at random, randomly duplicated, and then diverged to 307 00:18:33,000 --> 00:18:36,000 take up new functions. Growth factors, flies and worms 308 00:18:36,000 --> 00:18:39,000 managed to get by just fine, thank you, with two growth factors 309 00:18:39,000 --> 00:18:43,000 of the TGF beta-class, whatever that is. You have 42 310 00:18:43,000 --> 00:18:46,000 growth factors of this TGF beta-class, all of which help 311 00:18:46,000 --> 00:18:50,000 communicate, cells communicate, in different ways. 312 00:18:50,000 --> 00:18:53,000 And then, of course, all the olfactory receptors. 313 00:18:53,000 --> 00:18:57,000 In your genome, you have about 1, 00 genes for olfactory, for smell 314 00:18:57,000 --> 00:19:00,000 receptors. This is what Richard Axel and Linda Buck won a Nobel 315 00:19:00,000 --> 00:19:04,000 Prize for this year, was their work on the olfactory 316 00:19:04,000 --> 00:19:07,000 receptors. Sad to say though, out of all your olfactory receptors, 317 00:19:07,000 --> 00:19:11,000 genes, most of them are broken. They're most pseudo-genes. 318 00:19:11,000 --> 00:19:15,000 It's not true in dogs and mice, who keep their olfactory receptor 319 00:19:15,000 --> 00:19:18,000 genes in pretty fine-working order, but it's very clear that in primates 320 00:19:18,000 --> 00:19:22,000 with color vision, our olfactory receptor genes have 321 00:19:22,000 --> 00:19:25,000 been going to seed. They've been piling up mutations, 322 00:19:25,000 --> 00:19:28,000 and there's no selective pressure to keep many of them. 323 00:19:28,000 --> 00:19:32,000 And, in fact, we've now shown, in a paper that will come out soon, 324 00:19:32,000 --> 00:19:35,000 that this process is accelerating dramatically in the last 7 million 325 00:19:35,000 --> 00:19:38,000 years since we diverged from chimps. And so, humans have almost 326 00:19:38,000 --> 00:19:42,000 completely lost interest in smell, that's not totally true, some of 327 00:19:42,000 --> 00:19:45,000 these olfactory receptors surely matter for various processes, 328 00:19:45,000 --> 00:19:48,000 but most of them are probably irrelevant right now. 329 00:19:48,000 --> 00:19:52,000 And so, anyway, that's the nature of the genes there. 330 00:19:52,000 --> 00:19:56,000 Anyway, another interesting fact that's worth mentioning about your 331 00:19:56,000 --> 00:20:01,000 genome is half of your genome consists of transposable elements, 332 00:20:01,000 --> 00:20:05,000 elements that simply duplicate themselves, and hop around the 333 00:20:05,000 --> 00:20:10,000 genome. Elements that are like viruses, they make a copy, 334 00:20:10,000 --> 00:20:14,000 sometimes in RNA, the RNA is copied back into DNA and slammed elsewhere 335 00:20:14,000 --> 00:20:19,000 in your genome. These elements, 336 00:20:19,000 --> 00:20:24,000 well the, there are four classes. 337 00:20:24,000 --> 00:20:27,000 Alo elements, Line elements, Retro-Virus like elements, all these 338 00:20:27,000 --> 00:20:30,000 go through RNA intermediates, and use reverse transcription. 339 00:20:30,000 --> 00:20:34,000 And then there's certain DNA transposons, that go through DNA 340 00:20:34,000 --> 00:20:37,000 intermediate. The number of copies of the aloe element, 341 00:20:37,000 --> 00:20:40,000 the aloe element that's hopped around your genome, 342 00:20:40,000 --> 00:20:44,000 you have about a million, you have a million fossils of this 343 00:20:44,000 --> 00:20:47,000 element. You say, why is it there, and the answer is, 344 00:20:47,000 --> 00:20:50,000 because it's there. Because anything that knows how to make a 345 00:20:50,000 --> 00:20:54,000 copy of itself, and insert it itself in it's genome, 346 00:20:54,000 --> 00:20:57,000 you can't get rid of. You can consider it, 347 00:20:57,000 --> 00:21:00,000 if you wish, an infection, but half of your genome consists of 348 00:21:00,000 --> 00:21:03,000 an infection, with these kinds of transposable elements. 349 00:21:03,000 --> 00:21:12,000 Now that's it, yes? 350 00:21:12,000 --> 00:21:16,000 Well, it's very interesting, what's the effect? Well, they do, 351 00:21:16,000 --> 00:21:20,000 some of them are transcribed and, it's very interesting. 352 00:21:20,000 --> 00:21:24,000 Sometimes it's bad, one of them will hop into a gene and mutate it, 353 00:21:24,000 --> 00:21:28,000 and that's bad, that person will have a lethal mutation, 354 00:21:28,000 --> 00:21:32,000 but the genome has probably begun to use them, and count on their being 355 00:21:32,000 --> 00:21:36,000 there. So, when a bunch, when a transposable goes in, 356 00:21:36,000 --> 00:21:40,000 and creates a spacing, if you, for example, if an engineering 357 00:21:40,000 --> 00:21:44,000 committee came in and cleaned up the genome by getting rid of all the 358 00:21:44,000 --> 00:21:48,000 transposable elements, it would surely not work. 359 00:21:48,000 --> 00:21:51,000 Because we have evolutionarily come to count on the spacing there. 360 00:21:51,000 --> 00:21:55,000 It's sort of like, if in some very, some very messy attic, you put a cup 361 00:21:55,000 --> 00:21:58,000 of coffee down on top of a stack of papers, those papers may be utterly 362 00:21:58,000 --> 00:22:02,000 irrelevant, but now they're holding up that cup of coffee that you put 363 00:22:02,000 --> 00:22:06,000 down on it. And if you were to just, poof, magically get rid of them, 364 00:22:06,000 --> 00:22:10,000 the cup of coffee would come crashing to the ground. 365 00:22:10,000 --> 00:22:13,000 So, you know it, they're just there, 366 00:22:13,000 --> 00:22:17,000 taking up space. Now sometimes, even more than that, a few of them 367 00:22:17,000 --> 00:22:20,000 have actually been co-opted into being human genes. 368 00:22:20,000 --> 00:22:24,000 We know that a few of these transposable elements have mutated 369 00:22:24,000 --> 00:22:27,000 into being our genes that do something for us. 370 00:22:27,000 --> 00:22:31,000 And others of them may do things in affecting the general neighborhood 371 00:22:31,000 --> 00:22:35,000 with regard to transcription, and so, instead of it being a 372 00:22:35,000 --> 00:22:38,000 parasite, think of them as a symbiont, that's a genomic symbiont, 373 00:22:38,000 --> 00:22:42,000 which takes some advantage of us, and we may, you know, have worked 374 00:22:42,000 --> 00:22:46,000 out a compromise to take some advantage of it. 375 00:22:46,000 --> 00:22:49,000 Every time a copy is made of these, and it hops in the genome, some 376 00:22:49,000 --> 00:22:53,000 mutations may happen in the master element, but when it lands in the 377 00:22:53,000 --> 00:22:56,000 new place, we have a record of that hop. And if you reconstruct the 378 00:22:56,000 --> 00:23:00,000 sequence of the million AluI elements, you can see which ones are 379 00:23:00,000 --> 00:23:04,000 very close relatives of each other, and had to have hopped recently, and 380 00:23:04,000 --> 00:23:08,000 which ones are somewhat more distant relatives. 381 00:23:08,000 --> 00:23:11,000 And you can build an evolutionary tree connecting all of the repeat 382 00:23:11,000 --> 00:23:14,000 elements that have hopped around your genome, and thereby attaching a 383 00:23:14,000 --> 00:23:17,000 date to each of them, as to when they hopped. 384 00:23:17,000 --> 00:23:20,000 So it really is a fossil record, and you can figure out how many of 385 00:23:20,000 --> 00:23:23,000 them have been hopping at different times over history. 386 00:23:23,000 --> 00:23:27,000 And we can even make a plot of that, this is long ago, 387 00:23:27,000 --> 00:23:30,000 sometime here, some 30 million years ago, there was a huge explosion and 388 00:23:30,000 --> 00:23:33,000 in transposion, transposons, in our genome. 389 00:23:33,000 --> 00:23:36,000 We don't know why that happened, but it's very interesting, it does 390 00:23:36,000 --> 00:23:40,000 correspond to very interesting periods of primate evolution. 391 00:23:40,000 --> 00:23:43,000 And then, interestingly, there's been a huge crash, 392 00:23:43,000 --> 00:23:46,000 and transposition has dropped dramatically. We have no clue why 393 00:23:46,000 --> 00:23:49,000 this is, but we have a whole fossil record here of the rate of 394 00:23:49,000 --> 00:23:52,000 transposition of different kinds of repeat elements around our genome, 395 00:23:52,000 --> 00:23:55,000 and people are now starting to try to figure out what in the world this 396 00:23:55,000 --> 00:23:58,000 means. So all this is sort of there, inherent in the sequence, 397 00:23:58,000 --> 00:24:01,000 and if you want the sequence, as I say, you can go to the web and 398 00:24:01,000 --> 00:24:04,000 pull all this stuff now. So how do we understand the 399 00:24:04,000 --> 00:24:07,000 sequence? Well, I've told you a little bit about it, 400 00:24:07,000 --> 00:24:10,000 from the simple things that we've done, but there's a lot more that 401 00:24:10,000 --> 00:24:13,000 needs to be learned about the sequence, so what I really want to 402 00:24:13,000 --> 00:24:16,000 turn to, is how we're extracting information out of this sequence. 403 00:24:16,000 --> 00:24:19,000 So, DNA sequence is long and boring, it's only marginally more 404 00:24:19,000 --> 00:24:23,000 interesting than reading your hard disk, because it has four letters, 405 00:24:23,000 --> 00:24:27,000 instead of ones and zeros, but it's, you know, well, it's pretty really 406 00:24:27,000 --> 00:24:30,000 boring if you take a look at it. How do you attach meaning to all 407 00:24:30,000 --> 00:24:34,000 this stuff? One of the most powerful ways is by comparison with 408 00:24:34,000 --> 00:24:38,000 other genomes. And so, comparing the human genome 409 00:24:38,000 --> 00:24:42,000 to the mouse genome is very informative in many ways. 410 00:24:42,000 --> 00:24:45,000 So, as soon as the human genome was far along, a portion of the 411 00:24:45,000 --> 00:24:49,000 international consortium, set to work getting a sequence of 412 00:24:49,000 --> 00:24:52,000 the mouse genome. And that was published in December 413 00:24:52,000 --> 00:24:56,000 of 2002. We have a nice map of the mouse genome, with all these things, 414 00:24:56,000 --> 00:24:59,000 it, too, shows these gene-rich regions, gene-poor regions, 415 00:24:59,000 --> 00:25:03,000 all sorts of funny things. And if we look closely at a portion of the 416 00:25:03,000 --> 00:25:06,000 human genome over here, I've picked about a million bases of 417 00:25:06,000 --> 00:25:10,000 the human genome, and we take any little spot in that 418 00:25:10,000 --> 00:25:14,000 million bases of the human genome, let's say over here. 419 00:25:14,000 --> 00:25:17,000 And we take half the DNA sequence corresponding to this spot, 420 00:25:17,000 --> 00:25:20,000 and we run it in the computer against the mouse genome, 421 00:25:20,000 --> 00:25:23,000 and ask where in the mouse genome do we get the best match for this, 422 00:25:23,000 --> 00:25:26,000 the best match to this is here. Now let's do it for this piece, 423 00:25:26,000 --> 00:25:29,000 here. The best match anywhere in the mouse genome lands in the same 424 00:25:29,000 --> 00:25:32,000 million bases here as the mouse genome. In fact, 425 00:25:32,000 --> 00:25:36,000 for every single sequence that we pull out from this million bases in 426 00:25:36,000 --> 00:25:39,000 the human genome, the best match is in this million 427 00:25:39,000 --> 00:25:42,000 bases of the mouse genome. That's very interesting. Why is 428 00:25:42,000 --> 00:25:45,000 that? Sorry? No, people do know. 429 00:25:45,000 --> 00:25:49,000 It, it was a good try, though. [LAUGHTER]. This million bases in 430 00:25:49,000 --> 00:25:52,000 the mouse genome, and this million bases in the human 431 00:25:52,000 --> 00:25:56,000 genome, represent the evolutionary descendents of a common million 432 00:25:56,000 --> 00:25:59,000 bases that occurred in our common ancestor 75-million years ago. 433 00:25:59,000 --> 00:26:03,000 This is a clear evidence of the evolution here, 434 00:26:03,000 --> 00:26:06,000 because we can see that this is a segment of DNA from our common 435 00:26:06,000 --> 00:26:10,000 ancestor that really hasn't undergone much rearrangement, 436 00:26:10,000 --> 00:26:14,000 and we can just line up the sequences and see. 437 00:26:14,000 --> 00:26:17,000 In fact, we can build a whole map across the mouse genome like this. 438 00:26:17,000 --> 00:26:20,000 For any bit of the mouse genome, I don't know, here's a bit on mouse 439 00:26:20,000 --> 00:26:24,000 chromosome 17, this whole stretch corresponds to a 440 00:26:24,000 --> 00:26:27,000 portion of human chromosome number eight. This stretch here, 441 00:26:27,000 --> 00:26:30,000 I don't know, this green color here on chromosome number six, 442 00:26:30,000 --> 00:26:34,000 corresponds to chromosome four in the human. And so, 443 00:26:34,000 --> 00:26:37,000 we can build a look-up table that says, for any portion of the human 444 00:26:37,000 --> 00:26:40,000 genome, what's the corresponding portion of the mouse genome that 445 00:26:40,000 --> 00:26:44,000 came from the same ancestor, has basically the same complement of 446 00:26:44,000 --> 00:26:47,000 genes in it. And there's only about 330 such 447 00:26:47,000 --> 00:26:50,000 regions that we need to cut-and-paste the human genome order 448 00:26:50,000 --> 00:26:53,000 to the mouse genome order, roughly speaking. There's a lot of 449 00:26:53,000 --> 00:26:56,000 little local rearrangements, but at this gross level. So now, 450 00:26:56,000 --> 00:26:59,000 if we go back more closely and we look at this, and we say, 451 00:26:59,000 --> 00:27:03,000 OK, so now we look at this region, we now know these two regions 452 00:27:03,000 --> 00:27:06,000 descend from a common ancestor, if we do a careful evolutionary 453 00:27:06,000 --> 00:27:09,000 analysis, lining up all the sequences, and see how 454 00:27:09,000 --> 00:27:12,000 well-preserved the sequences are, some are much better preserved than 455 00:27:12,000 --> 00:27:16,000 others. Evolution has been much more 456 00:27:16,000 --> 00:27:20,000 lovingly conserving other sequences than others, and so, 457 00:27:20,000 --> 00:27:24,000 so let's now zoom-in on a gene, this is a gene that goes by the name, 458 00:27:24,000 --> 00:27:28,000 PP-Gama, I'm fond of this gene but, it doesn't matter. If we look, I've 459 00:27:28,000 --> 00:27:32,000 indicated all the regions here, in which there's a heightened degree 460 00:27:32,000 --> 00:27:36,000 of conservation. The sequence is well-conserved here, 461 00:27:36,000 --> 00:27:40,000 here, here, here, here, here, here, and here, 462 00:27:40,000 --> 00:27:44,000 here, here, here, here, here. These correspond to the exons of 463 00:27:44,000 --> 00:27:48,000 the PPR-Gama gene, they encode the protein of the gene, 464 00:27:48,000 --> 00:27:52,000 then the splicing goes like this, OK? These things here do not correspond 465 00:27:52,000 --> 00:27:56,000 to the exons. People have no idea what they are, 466 00:27:56,000 --> 00:28:00,000 in fact, this is not supposed to be here. The official textbook picture 467 00:28:00,000 --> 00:28:04,000 says, the vast majority of what matters for a gene, 468 00:28:04,000 --> 00:28:09,000 what evolution should preserve, is the exons plus the promoter. 469 00:28:09,000 --> 00:28:13,000 Here's the promoter. But in fact, what we found is that 470 00:28:13,000 --> 00:28:17,000 an awful lot more is being preserved. In fact, across the genome, 471 00:28:17,000 --> 00:28:21,000 our best estimate is there are about 500,000 conserved elements across 472 00:28:21,000 --> 00:28:26,000 the genome, and only 1/3 of them are protein-coding exons. 473 00:28:26,000 --> 00:28:30,000 That means 2/3 of the stuff evolution has been interested in, 474 00:28:30,000 --> 00:28:34,000 is not protein-coding exons, and the truth is, we do not know what it is, 475 00:28:34,000 --> 00:28:38,000 this was a very radical finding, when this mouse paper came out, 476 00:28:38,000 --> 00:28:43,000 about a year and a half, about two years ago now. 477 00:28:43,000 --> 00:28:47,000 What it must be, I think, but we're guessing, 478 00:28:47,000 --> 00:28:51,000 are regulatory signals, the structural elements in chromosomes, 479 00:28:51,000 --> 00:28:56,000 RNA genes, but there's an awful lot more of it than we had imagined. 480 00:28:56,000 --> 00:28:59,000 And we've, now we're in this fascinating situation, 481 00:28:59,000 --> 00:29:03,000 where computational analysis has told us what's on evolution's mind, 482 00:29:03,000 --> 00:29:07,000 and now we have to go to the lab and figure out what in the world it does. 483 00:29:07,000 --> 00:29:10,000 But there's no doubt that it must do something, because evolution has 484 00:29:10,000 --> 00:29:14,000 preserved it quite well. Now, I oversimplified greatly in 485 00:29:14,000 --> 00:29:18,000 this discussion, let me first say, 486 00:29:18,000 --> 00:29:21,000 and I'll come back to that. We do know, if we take some of those 487 00:29:21,000 --> 00:29:24,000 elements, here's one, there's a 481 base-pair elements 488 00:29:24,000 --> 00:29:27,000 that's 84% identical between human and mouse. You could write yourself 489 00:29:27,000 --> 00:29:31,000 a little statistical model to say that's way unusual to have something 490 00:29:31,000 --> 00:29:34,000 that's so well preserved. When Eddie Ruben and his colleagues 491 00:29:34,000 --> 00:29:37,000 from Berkley made a knockout mouse that deleted that segment, 492 00:29:37,000 --> 00:29:40,000 this knockout mouse loses regulation of three different genes in the 493 00:29:40,000 --> 00:29:43,000 neighborhood, saying that this must be a regulatory sequence that 494 00:29:43,000 --> 00:29:47,000 affects multiple genes in the neighborhood. That, 495 00:29:47,000 --> 00:29:50,000 that's one, with about 300, 00 such elements to go, in order to 496 00:29:50,000 --> 00:29:54,000 attach meaning to them. So doing this entirely by knocking 497 00:29:54,000 --> 00:29:58,000 out mice will be a slow process, one's going to need other ways to be 498 00:29:58,000 --> 00:30:02,000 able to attach meaning, but there's no doubt. Now, 499 00:30:02,000 --> 00:30:06,000 there's some other interesting papers where people have knocked 500 00:30:06,000 --> 00:30:10,000 some of these things out, and they've seen no effect on the 501 00:30:10,000 --> 00:30:14,000 mouse. They get a totally viable mouse. Can you conclude from that, 502 00:30:14,000 --> 00:30:18,000 that they have no function? Why not? The knockout mouse is viable. 503 00:30:18,000 --> 00:30:22,000 Could be redundant, it could even not be redundant, 504 00:30:22,000 --> 00:30:26,000 but yes, it could be redundant, but you couldn't knock out both of 505 00:30:26,000 --> 00:30:29,000 two things. It turns out, suppose knocking it 506 00:30:29,000 --> 00:30:33,000 out affected the mouse's viability by part, ten to the third, 507 00:30:33,000 --> 00:30:37,000 it was only 99.9% as fertile, would you be able to see that in the 508 00:30:37,000 --> 00:30:41,000 laboratory? No. Would that matter to evolution? 509 00:30:41,000 --> 00:30:44,000 It would be lethal, in an evolutionary sense. 510 00:30:44,000 --> 00:30:48,000 Such mutation could never propagate through a population. 511 00:30:48,000 --> 00:30:52,000 One part, and ten to the third, is massive selection against, from 512 00:30:52,000 --> 00:30:56,000 an evolutionary point of view, but almost undetectable in a 513 00:30:56,000 --> 00:31:00,000 laboratory batch. Evolution has a far more sensitive 514 00:31:00,000 --> 00:31:04,000 assay than we do. Now, I won't go into detail, 515 00:31:04,000 --> 00:31:09,000 but for the mathematically inclined here, showing that there really were 516 00:31:09,000 --> 00:31:13,000 about 5% of the human genome under, under evolutionary selection, it was 517 00:31:13,000 --> 00:31:18,000 a complicated affair, because with only two genomes, 518 00:31:18,000 --> 00:31:23,000 what we really had to do, and if this doesn't make sense, ignore it. 519 00:31:23,000 --> 00:31:26,000 We looked at the background distribution of conservation of the 520 00:31:26,000 --> 00:31:29,000 genome in unimportant elements, in those repeat elements that we 521 00:31:29,000 --> 00:31:32,000 knew to be functionally broken. We looked at the overall 522 00:31:32,000 --> 00:31:35,000 conservation of the genome, and found that the overall genome 523 00:31:35,000 --> 00:31:38,000 has this rightward tail, by subtracting the distributions we 524 00:31:38,000 --> 00:31:41,000 were able to see how much excess conservation there was. 525 00:31:41,000 --> 00:31:44,000 That's because we only had two genomes, we had to draw inferences. 526 00:31:44,000 --> 00:31:47,000 If we had more genomes, like the mouse and the rat, 527 00:31:47,000 --> 00:31:50,000 and the dog and the-this-and-the-that, 528 00:31:50,000 --> 00:31:54,000 we would be able to extract signal from noise. 529 00:31:54,000 --> 00:31:57,000 We would be able to see right away, which bits were well-conserved, and 530 00:31:57,000 --> 00:32:01,000 we wouldn't have to do this as a sensitive statistical analysis. 531 00:32:01,000 --> 00:32:05,000 So, in fact, we need more mammalian genomes, so, so right now there's 532 00:32:05,000 --> 00:32:09,000 been a sequence of the rat genome in the past year or so, 533 00:32:09,000 --> 00:32:12,000 there's a sequence of the dog genome, we're writing up that paper now, 534 00:32:12,000 --> 00:32:16,000 but it's on the web already. There's a sequence of the chimpanzee 535 00:32:16,000 --> 00:32:20,000 genome we're writing up a paper on that, in collaboration with our 536 00:32:20,000 --> 00:32:24,000 friends in the genome-sequencing community. 537 00:32:24,000 --> 00:32:27,000 We're currently sequencing a variety of other organisms, 538 00:32:27,000 --> 00:32:30,000 as well. And if you had enough organisms, you ought to be able to 539 00:32:30,000 --> 00:32:34,000 just line it up and say, what has evolution preserved, 540 00:32:34,000 --> 00:32:37,000 and figure out exactly which nucleotides matter, 541 00:32:37,000 --> 00:32:40,000 and which nucleotides don't, are allowed to drift freely, at the 542 00:32:40,000 --> 00:32:44,000 background rate. How far could you go with this? 543 00:32:44,000 --> 00:32:47,000 Well, we decided to try an interesting experiment. 544 00:32:47,000 --> 00:32:50,000 We said, since mammals are very big, then we're going to need a lot of 545 00:32:50,000 --> 00:32:54,000 genome sequences, how about we try a small organism, 546 00:32:54,000 --> 00:32:58,000 like yeast? What if we were to try to do this, 547 00:32:58,000 --> 00:33:02,000 this kind of evolutionary, genomic analysis on something like the yeast 548 00:33:02,000 --> 00:33:06,000 genome? And so, this is work that I'll describe, 549 00:33:06,000 --> 00:33:10,000 that was between a bunch of people here at MIT who do genome-sequencing, 550 00:33:10,000 --> 00:33:14,000 and a student in computer science, Manolis Kellis, was PhD student in 551 00:33:14,000 --> 00:33:18,000 computer science, he now just joined the faculty here 552 00:33:18,000 --> 00:33:21,000 at MIT in computer science. But it was a really great example of 553 00:33:21,000 --> 00:33:25,000 how biology and computer science could come together. 554 00:33:25,000 --> 00:33:28,000 So, the genome-sequencing folks sequenced three related species, 555 00:33:28,000 --> 00:33:32,000 through our friend, the baker's yeast, Saccharomyces cerevisiae, 556 00:33:32,000 --> 00:33:35,000 workhorse of geneticist. These three different species are 557 00:33:35,000 --> 00:33:39,000 separated by different evolutionary distances, from Saccharomyces 558 00:33:39,000 --> 00:33:42,000 cerevisiae. When you line up their genomes, just like with human and 559 00:33:42,000 --> 00:33:46,000 mouse, you find the genes occur largely in the same order, 560 00:33:46,000 --> 00:33:49,000 and it's not hard to pick out, oh there's this gene there, there, 561 00:33:49,000 --> 00:33:53,000 it's all lined up, you've got these evolutionary segments, 562 00:33:53,000 --> 00:33:56,000 and very few rearrangements have occurred across these species, 563 00:33:56,000 --> 00:34:00,000 despite the fact that they're about 20 million years apart in history. 564 00:34:00,000 --> 00:34:05,000 But here's an interesting thing. When the yeast genome, 565 00:34:05,000 --> 00:34:11,000 Saccharomyces cerevisiae, was first published in 1995, 566 00:34:11,000 --> 00:34:16,000 the paper describing it reported 6, 00 genes. Now, how did they know 567 00:34:16,000 --> 00:34:22,000 there were 6,200 genes? They ran a computer program looking 568 00:34:22,000 --> 00:34:28,000 for open reading frames. Any open reading frame, consecutive 569 00:34:28,000 --> 00:34:34,000 codons without a stop sufficiently long, was called a gene. 570 00:34:34,000 --> 00:34:37,000 But statistically, you could, by chance, 571 00:34:37,000 --> 00:34:41,000 just have a long stretch of codons without a stop codon. 572 00:34:41,000 --> 00:34:44,000 And so, if I saw 100 codons in a row, without a stop, 573 00:34:44,000 --> 00:34:48,000 they called it a gene, but it might just be chance. 574 00:34:48,000 --> 00:34:52,000 And they knew that, of course, they wrote that in the paper, but 575 00:34:52,000 --> 00:34:55,000 for many years, people then had 6, 576 00:34:55,000 --> 00:34:59,000 00 open reading frames, which were the yeast's genes. 577 00:34:59,000 --> 00:35:02,000 Could evolution now tell us which one of them were real and which 578 00:35:02,000 --> 00:35:06,000 weren't? Well, it turns out that evolution was 579 00:35:06,000 --> 00:35:10,000 tremendously powerful in doing that. 580 00:35:10,000 --> 00:35:14,000 If you take something that's a well-known gene that has been 581 00:35:14,000 --> 00:35:19,000 extensively studied by yeast geneticists, you line it up across 582 00:35:19,000 --> 00:35:23,000 all four species, you almost never see deletions. 583 00:35:23,000 --> 00:35:28,000 And when you do see the lesions, here in grey, they're always a 584 00:35:28,000 --> 00:35:33,000 multiple of three. Why are they a multiple of three? 585 00:35:33,000 --> 00:35:37,000 They preserve the reading frame. By contrast, if I take some clear, 586 00:35:37,000 --> 00:35:42,000 intergenetic DNA, that's not protein-coding, 587 00:35:42,000 --> 00:35:47,000 and I compare it across these four species, I see lots and lots of 588 00:35:47,000 --> 00:35:52,000 frame shifting deletions that occur, 589 00:35:52,000 --> 00:35:54,000 Evolution tolerates frame shifting deletions, and if I juts write down 590 00:35:54,000 --> 00:35:57,000 the rates, frame shifting deletions are 75x more common in intergenic 591 00:35:57,000 --> 00:36:00,000 DNA, than genic DNA. This provides a very powerful test. 592 00:36:00,000 --> 00:36:03,000 Run this test across the genome, looking for the density of frame 593 00:36:03,000 --> 00:36:06,000 shifting deletions, any place that doesn't tolerate 594 00:36:06,000 --> 00:36:09,000 frame shifting deletions is probably a real gene, anything that does 595 00:36:09,000 --> 00:36:12,000 tolerate it is probably not. When you sorted through all this, 596 00:36:12,000 --> 00:36:15,000 it turned out that 528 of the official yeast genes were clearly 597 00:36:15,000 --> 00:36:18,000 not real, not real genes. They were just chock-a-block full 598 00:36:18,000 --> 00:36:22,000 of these frame shifting deletions. And, and a bunch of others could be 599 00:36:22,000 --> 00:36:26,000 confirmed. So the yeast gene count, and I won't tell you all the 600 00:36:26,000 --> 00:36:30,000 experimental and other that shows this is right, 601 00:36:30,000 --> 00:36:34,000 but the yeast genome has now been revised downward to 5, 602 00:36:34,000 --> 00:36:38,000 00 genes, and we have great confidence that almost all of those 603 00:36:38,000 --> 00:36:42,000 are real genes, there are 20 whose origins that 604 00:36:42,000 --> 00:36:46,000 we're not sure of, and new genes could be found in this 605 00:36:46,000 --> 00:36:50,000 way. Here's a really audacious thing. 606 00:36:50,000 --> 00:36:51,000 This graduate student in computer science said, I think, 607 00:36:51,000 --> 00:36:53,000 based on these other species, there was a mistake made in the 608 00:36:53,000 --> 00:36:55,000 sequencing of the first yeast, and that the reason these things are 609 00:36:55,000 --> 00:36:57,000 called two separate genes, is that somebody made a sequencing 610 00:36:57,000 --> 00:36:58,000 error that got a stop codon here, but I think these are really part of 611 00:36:58,000 --> 00:37:00,000 one gene. And so, somebody went back and re-sequenced 612 00:37:00,000 --> 00:37:02,000 some of these, and sure enough, 613 00:37:02,000 --> 00:37:04,000 he had correctly predicted that there had been a mistake made at 614 00:37:04,000 --> 00:37:06,000 that letter, and that these were in fact, a single gene. 615 00:37:06,000 --> 00:37:11,000 The computational analysis was incredibly powerful in this regard, 616 00:37:11,000 --> 00:37:17,000 it could go further than this, you could ask, could I also figure out 617 00:37:17,000 --> 00:37:23,000 the way genes are regulated in this fashion, could I work out the 618 00:37:23,000 --> 00:37:29,000 intergenic signals in the promoter regions? Remember that lac 619 00:37:29,000 --> 00:37:35,000 repressor to a certain operator site, well, all of these regulatory 620 00:37:35,000 --> 00:37:41,000 proteins bind to different sequences, could we figure out what the 621 00:37:41,000 --> 00:37:46,000 sequences were, computational? Well, if we look closely at a genic, 622 00:37:46,000 --> 00:37:50,000 intergenic region, here's one where there's two genes being transcribed 623 00:37:50,000 --> 00:37:54,000 in opposite directions, gal-1 and gal-10, both involved in 624 00:37:54,000 --> 00:37:58,000 galactose metabolism, and there's a particular protein, 625 00:37:58,000 --> 00:38:03,000 a transcription factor here, called Gal-4, in this region, 626 00:38:03,000 --> 00:38:07,000 and it has a particular sequence that it likes, 627 00:38:07,000 --> 00:38:11,000 CCG, 11 bases, GGC. So, that Gal-4 we see, 628 00:38:11,000 --> 00:38:16,000 is very well preserved across all of the species. 629 00:38:16,000 --> 00:38:20,000 So, in no regulatory sequence is well-preserved, 630 00:38:20,000 --> 00:38:24,000 now let's look at that closely. This Gal-4 binding site is a measly, 631 00:38:24,000 --> 00:38:29,000 crummy, six nucleotides of information. At random, 632 00:38:29,000 --> 00:38:33,000 it's going to occur in many places in the yeast genome, 633 00:38:33,000 --> 00:38:38,000 but not be a real, important Gal-4, right? Some of them matter, some of 634 00:38:38,000 --> 00:38:42,000 them don't. How do we figure out which of these occurrences are real 635 00:38:42,000 --> 00:38:46,000 Gal-4, well, if we look across all four species, what we find is that 636 00:38:46,000 --> 00:38:51,000 those occurrences that occur in promoter regions, 637 00:38:51,000 --> 00:38:55,000 are much more likely to be conserved by evolution than those 638 00:38:55,000 --> 00:39:00,000 that don't. So there's a special property here, 639 00:39:00,000 --> 00:39:04,000 conservation of the motif and the motor regions. 640 00:39:04,000 --> 00:39:08,000 In fact, this particular sequence is four times more likely to be 641 00:39:08,000 --> 00:39:12,000 preserved when it occurs in a promoter region, 642 00:39:12,000 --> 00:39:16,000 than when it occurs in a coded region. And for a typical control 643 00:39:16,000 --> 00:39:20,000 region, the opposite is true. Since genes, since coding sequences 644 00:39:20,000 --> 00:39:24,000 are better preserved in general, for a randomly chosen sequence, I 645 00:39:24,000 --> 00:39:28,000 don't know, ATGGCAT, it's more likely to be preserved in 646 00:39:28,000 --> 00:39:32,000 coding regions than non-coding regions. 647 00:39:32,000 --> 00:39:35,000 So this Gal-4 motif has a very funky property that, 648 00:39:35,000 --> 00:39:38,000 on average, it's 12x more likely than background, 649 00:39:38,000 --> 00:39:41,000 to be preserved when it occurs in a promoter. Now, 650 00:39:41,000 --> 00:39:44,000 that's a test you apply to another motif, and another motif. 651 00:39:44,000 --> 00:39:47,000 In fact, you could, by computer, test all possible motifs, and ask 652 00:39:47,000 --> 00:39:50,000 which ones have that property? Make a scatter plot, most motifs 653 00:39:50,000 --> 00:39:53,000 are better conserved when they occur in promoter regions, 654 00:39:53,000 --> 00:39:56,000 than when they occur in coding regions, some however, 655 00:39:56,000 --> 00:40:00,000 are better preserved in promoter regions than in coding regions. 656 00:40:00,000 --> 00:40:04,000 Our friend, Gal-4, is up there, but there are a lot 657 00:40:04,000 --> 00:40:09,000 more things like it, that are better preserved by 658 00:40:09,000 --> 00:40:14,000 evolution than promoters are. You can make a list of them. You 659 00:40:14,000 --> 00:40:19,000 can get about 72 well-conserved, regulatory motifs and it turns out 660 00:40:19,000 --> 00:40:24,000 that 20 years of yeast work produced knowledge about things like the 661 00:40:24,000 --> 00:40:29,000 Gal-4 site, and other sites. Almost all the known regulatory 662 00:40:29,000 --> 00:40:34,000 sites that had been discovered over the course of 20 years of 663 00:40:34,000 --> 00:40:39,000 experimental work appear on this list that falls out of the computer 664 00:40:39,000 --> 00:40:44,000 analysis of evolutionary comparison of genomes. 665 00:40:44,000 --> 00:40:48,000 You can actually go a step further, I'll hesitate to tell you, but I'll 666 00:40:48,000 --> 00:40:53,000 try anyway. If you wanted to find out, without knowing in advance, 667 00:40:53,000 --> 00:40:57,000 what these motifs were doing, what their biological function was, 668 00:40:57,000 --> 00:41:02,000 you can do that informationally, too. It turns out that if I take my 669 00:41:02,000 --> 00:41:06,000 motif, Gal-4, and I ask, which chains does it occur in front 670 00:41:06,000 --> 00:41:11,000 of? Well, across Saccharomyces cerevisiae, you find this crummy 671 00:41:11,000 --> 00:41:15,000 little motif in many, many places because, as I said, 672 00:41:15,000 --> 00:41:20,000 most of it's just noise. But if I ask, which genes have this 673 00:41:20,000 --> 00:41:24,000 motif in all four species, these genes, there's a huge overlap 674 00:41:24,000 --> 00:41:28,000 with a class of genes involved in carbohydrate metabolism. 675 00:41:28,000 --> 00:41:33,000 So, if I didn't know in advance that the Gal-4 motif was involved in 676 00:41:33,000 --> 00:41:37,000 regulating genes in carbohydrate metabolism, I could tell, 677 00:41:37,000 --> 00:41:41,000 just from the fact that the genes that'd conserved it, 678 00:41:41,000 --> 00:41:46,000 are genes involved in carbohydrate metabolism. 679 00:41:46,000 --> 00:41:50,000 You can do that using all sorts of tricks, expression of genes, 680 00:41:50,000 --> 00:41:54,000 protein mass spec, blah, blah, blah, and the short answer is, for 681 00:41:54,000 --> 00:41:58,000 almost all of those motifs that you can find in the computer, 682 00:41:58,000 --> 00:42:02,000 by consulting public data bases of sets of genes that are co-expressed, 683 00:42:02,000 --> 00:42:06,000 or have similar properties and all that, the computer can also offer 684 00:42:06,000 --> 00:42:10,000 you a pretty good hypothesis about what that motif is associated with. 685 00:42:10,000 --> 00:42:14,000 You can even go a step further than that. You can begin to look at 686 00:42:14,000 --> 00:42:18,000 pairs of motifs, you can say, if I have a certain 687 00:42:18,000 --> 00:42:23,000 regulatory sequence, number one, and a second regulatory 688 00:42:23,000 --> 00:42:27,000 sequence, number two, do they tend to be preserved in 689 00:42:27,000 --> 00:42:31,000 front of the same genes as each other? Is their conservation 690 00:42:31,000 --> 00:42:36,000 correlated? And you can build a map of these two 691 00:42:36,000 --> 00:42:40,000 guys tend, when this guy's correlated, this guy tends to be 692 00:42:40,000 --> 00:42:44,000 correlated. And you can say, oh those proteins must be talking to 693 00:42:44,000 --> 00:42:48,000 each other, and you can read that off from the patterns of evolution, 694 00:42:48,000 --> 00:42:52,000 as well. There are two regulators, one called Sterile 12, 695 00:42:52,000 --> 00:42:57,000 one called Tec1. This computational analysis shows that they tend to 696 00:42:57,000 --> 00:43:01,000 co-occur in a conserved fashion, far more often then you'd expect by 697 00:43:01,000 --> 00:43:05,000 chance. And when you do the analysis, you find that those genes 698 00:43:05,000 --> 00:43:09,000 that just have a conserved Sterile 12, those genes tend to 699 00:43:09,000 --> 00:43:13,000 be involved in mating. Genes that just have a conserved 700 00:43:13,000 --> 00:43:16,000 instance of Tec1 tend to be involved in the budding of the yeast, 701 00:43:16,000 --> 00:43:20,000 and those genes that have conserved the occurrences of both tend to be 702 00:43:20,000 --> 00:43:23,000 involved in fillamentation. Now all that can be read out, 703 00:43:23,000 --> 00:43:26,000 which is way cool, this is not the way we used to do biology. 704 00:43:26,000 --> 00:43:30,000 Now don't get me wrong, there's a ton of experiments that underlay 705 00:43:30,000 --> 00:43:33,000 creating these databases, and there's a ton of experiments 706 00:43:33,000 --> 00:43:36,000 that have to be done to check any of these things. But what we have is 707 00:43:36,000 --> 00:43:40,000 one of the most powerful hypothesis generators that's ever 708 00:43:40,000 --> 00:43:44,000 been seen here. Evolution, by telling us what to 709 00:43:44,000 --> 00:43:48,000 focus on, is giving us, on a silver platter, hundreds of 710 00:43:48,000 --> 00:43:52,000 hypothesis about who's interacting with whom, and sending us back to 711 00:43:52,000 --> 00:43:56,000 the lab then, to test these hypotheses. Now, 712 00:43:56,000 --> 00:44:00,000 what are the implications of all of this for the human genome? 713 00:44:00,000 --> 00:44:04,000 Could we do this for the human genome? Well, 714 00:44:04,000 --> 00:44:08,000 these species, Saccharomyces cerevisiase, S. 715 00:44:08,000 --> 00:44:12,000 paradoxus, S. mikatae and S. bayanus, are they a good model for 716 00:44:12,000 --> 00:44:15,000 mammals? Well it turns out that their 717 00:44:15,000 --> 00:44:19,000 evolutionary distance from each other is the same as the distance of 718 00:44:19,000 --> 00:44:23,000 human to lemur, to dog, to mouse. 719 00:44:23,000 --> 00:44:27,000 So they were chosen with a purpose. Those are actually fairly good 720 00:44:27,000 --> 00:44:30,000 models for the human. So could we do exactly the same 721 00:44:30,000 --> 00:44:34,000 analysis for the human, for the entire human genome? 722 00:44:34,000 --> 00:44:38,000 If we had, human, lemur, dog, and mouse, are basically four 723 00:44:38,000 --> 00:44:42,000 species, human, mouse, rat, and dog. 724 00:44:42,000 --> 00:44:46,000 Well, there's one little fly in the ointment. The human genome is 20x 725 00:44:46,000 --> 00:44:50,000 bigger than the yeast genome. If I want to analyze the whole 726 00:44:50,000 --> 00:44:54,000 human genome, I have a problem of signal-to-noise. 727 00:44:54,000 --> 00:44:58,000 The genome is 20x bigger, I've got 20x as much noise to get 728 00:44:58,000 --> 00:45:03,000 rid of. I won't walk you through it, but I need more evolutionary 729 00:45:03,000 --> 00:45:07,000 information to get rid of all that noise. And, you can do a simple 730 00:45:07,000 --> 00:45:11,000 calculation that says, my evolutionary tree needs to be 731 00:45:11,000 --> 00:45:15,000 bigger, it's branch length needs to be bigger by about the natural log 732 00:45:15,000 --> 00:45:20,000 of 20, to get rid of 20 fold more noise. 733 00:45:20,000 --> 00:45:24,000 And that would mean I'd need more species, I'd need about 16 species, 734 00:45:24,000 --> 00:45:28,000 or something like that to be able to do that. But if I built an 735 00:45:28,000 --> 00:45:32,000 evolutionary tree that had a branch length of four, 736 00:45:32,000 --> 00:45:36,000 that is, four substitutions per base across this evolutionary tree, 737 00:45:36,000 --> 00:45:40,000 as indicated by these colored lines here, I should have enough power to 738 00:45:40,000 --> 00:45:44,000 analyze the entire human genome, the way we just did the yeast genome. 739 00:45:44,000 --> 00:45:48,000 So we currently have human, chimp, mouse, rat, dog. As of this 740 00:45:48,000 --> 00:45:52,000 fall, during in fact, right at the beginning of this term, 741 00:45:52,000 --> 00:45:56,000 the National Institute of Health signed off on the sequencing of 742 00:45:56,000 --> 00:46:00,000 these additional eight mammals. These mammals are now in process, 743 00:46:00,000 --> 00:46:04,000 and in fact, the elephant is done, and the armadillo is in process, 744 00:46:04,000 --> 00:46:08,000 and the tree shrew, I think, is being caught at the moment. 745 00:46:08,000 --> 00:46:12,000 [LAUGHTER]. The ten-, don't talk about the tree 746 00:46:12,000 --> 00:46:18,000 shrews. The tenrec is actually being tested right now, 747 00:46:18,000 --> 00:46:24,000 etc, and all this is going on right now, as we speak, 748 00:46:24,000 --> 00:46:29,000 and I think that by next summer, we should have much of, and by 749 00:46:29,000 --> 00:46:35,000 certainly, by a year from now, we should have all this information 750 00:46:35,000 --> 00:46:41,000 to do such an analysis. That said, we're of course, 751 00:46:41,000 --> 00:46:47,000 very impatient people, you could just take the human, 752 00:46:47,000 --> 00:46:51,000 the mouse, the rat, and the dog. And I said that's not enough if you 753 00:46:51,000 --> 00:46:55,000 wanted to analyze the whole genome, but suppose you just wanted to 754 00:46:55,000 --> 00:46:59,000 analyze a portion of the genome, maybe about a yeast-size piece of 755 00:46:59,000 --> 00:47:03,000 the genome, well let's see, at 20,000 genes, I don't know, 756 00:47:03,000 --> 00:47:06,000 suppose I take, I don't know, two kilo bases around each 20, 757 00:47:06,000 --> 00:47:10,000 00 genes, well that's you know, 40 mega bases of DNA, it's only a 758 00:47:10,000 --> 00:47:14,000 couple-fold more than yeast. Maybe, if I just focus on a limited 759 00:47:14,000 --> 00:47:18,000 region around each promoter, I could start reading out these 760 00:47:18,000 --> 00:47:22,000 regulatory signals, with just four species. 761 00:47:22,000 --> 00:47:26,000 So in fact, the post-doctorate fellow is, has been working on this 762 00:47:26,000 --> 00:47:30,000 problem over the summer, and a little bit, too, through the 763 00:47:30,000 --> 00:47:34,000 spring and summer, together with Manolis Kellis, 764 00:47:34,000 --> 00:47:38,000 who's now in the computer science department. And I think we have a 765 00:47:38,000 --> 00:47:42,000 preliminary list for the human genome that's fallen out over the 766 00:47:42,000 --> 00:47:46,000 course of the past couple of months, and we're in the process, right now, 767 00:47:46,000 --> 00:47:50,000 of finishing up a paper that we're hoping to get submitted by Friday, 768 00:47:50,000 --> 00:47:54,000 with a preliminary list of regulatory signals in the human 769 00:47:54,000 --> 00:47:58,000 genome, read out from evolution of human, mouse, rat, and dog. 770 00:47:58,000 --> 00:48:01,000 It won't be everything, we don't have full power to pick up 771 00:48:01,000 --> 00:48:04,000 all possible signals, but we're picking up a lot of the 772 00:48:04,000 --> 00:48:08,000 signals, we're picking up a very large fraction of previously 773 00:48:08,000 --> 00:48:11,000 discovered signals, and lots more new signals, 774 00:48:11,000 --> 00:48:14,000 as well, are falling out of that analysis. So anyway, 775 00:48:14,000 --> 00:48:18,000 I can assure you that that's not in the textbooks because, 776 00:48:18,000 --> 00:48:21,000 actually, it hasn't been submitted yet. This other stuff I've 777 00:48:21,000 --> 00:48:25,000 described about the yeast analysis, this, you do want to look it up, 778 00:48:25,000 --> 00:48:28,000 there's a paper in nature about a year and change ago, 779 00:48:28,000 --> 00:48:32,000 Kellis et. al. describes this yeast work. This is what's going on. 780 00:48:32,000 --> 00:48:36,000 This is what's fun about teaching at MIT, as I can tell you this stuff, 781 00:48:36,000 --> 00:48:41,000 and you guys have a sense for the convergence that's going on in our 782 00:48:41,000 --> 00:48:45,000 field. Much of what I've tried to make the biology, 783 00:48:45,000 --> 00:48:50,000 you know, in making the biology clear, I've talked about how the 784 00:48:50,000 --> 00:48:54,000 different directions, genetics, biochemistry, 785 00:48:54,000 --> 00:48:59,000 have converged together. What we're really seeing now is 786 00:48:59,000 --> 00:49:03,000 information sciences converging with that as well, and I've got to say, 787 00:49:03,000 --> 00:49:08,000 it's a tremendous amount of fun. See you on Monday, good 788 00:49:08,000 --> 00:49:13,000 luck on the quiz.