1 00:00:00,000 --> 00:00:08,000 So, for today's lecture as you can see up there is molecular -- 2 00:00:08,000 --> 00:00:16,000 evolution, and ecology. 3 00:00:16,000 --> 00:00:23,000 And what I mean by this, 4 00:00:23,000 --> 00:00:28,000 it's basically the study or what we try to figure out in molecular 5 00:00:28,000 --> 00:00:34,000 evolution and ecology is what genes or gene sequences can tell us about 6 00:00:34,000 --> 00:00:39,000 the evolution and ultimately also the ecology of organisms 7 00:00:39,000 --> 00:00:44,000 in the environment. And it's particularly relevant for 8 00:00:44,000 --> 00:00:48,000 thinking about microorganisms, prokaryotes and the environment. 9 00:00:48,000 --> 00:00:53,000 And I hope I can actually convince you today of that. 10 00:00:53,000 --> 00:00:57,000 This is interesting. The topics that I want to cover 11 00:00:57,000 --> 00:01:02,000 today is, first of all, I want to review a little bit what 12 00:01:02,000 --> 00:01:06,000 we know about life on Earth, sort of give an overview of the 13 00:01:06,000 --> 00:01:11,000 evolution of life on Earth. Then, I want to go into specific 14 00:01:11,000 --> 00:01:15,000 topic that's of particular relevance for the evolution of eukaryotes. 15 00:01:15,000 --> 00:01:19,000 That's the endosymbiosis theory. And then I'll explain how we can 16 00:01:19,000 --> 00:01:23,000 use gene sequences to actually reconstruct events that have 17 00:01:23,000 --> 00:01:27,000 happened a very, very long time ago. 18 00:01:27,000 --> 00:01:31,000 OK, so we'll look at what we call molecular phylogenies, 19 00:01:31,000 --> 00:01:35,000 with the use of gene sequences to reconstruct the evolutionary history 20 00:01:35,000 --> 00:01:40,000 of organisms on Earth. Derived from that, we'll look at 21 00:01:40,000 --> 00:01:44,000 what we call the tree of life. That's sort of the big picture 22 00:01:44,000 --> 00:01:49,000 overview of the evolutionary relationships of all organisms on 23 00:01:49,000 --> 00:01:53,000 the planet. And then finally, I'll introduce you to a topic called 24 00:01:53,000 --> 00:01:59,000 molecular ecology. Again, that's how we can use gene 25 00:01:59,000 --> 00:02:05,000 sequences to learn something about the diversity of microorganisms in 26 00:02:05,000 --> 00:02:11,000 the environment that lead us then, next time, when I come back on 27 00:02:11,000 --> 00:02:18,000 Monday, into this big topic of environmental genomics, 28 00:02:18,000 --> 00:02:24,000 how we can actually expand this analysis to learn much more about 29 00:02:24,000 --> 00:02:30,000 organisms in the environment. So, first of all, let's look at 30 00:02:30,000 --> 00:02:38,000 life on Earth. Does anybody know how old we think 31 00:02:38,000 --> 00:02:48,000 Earth is? Say again? Yeah, 4.5 to 4.6, I haven't my 32 00:02:48,000 --> 00:02:58,000 notes 4.6. So, Earth's thought to have originated 33 00:02:58,000 --> 00:03:08,000 about 4.6 billion years ago. When did the first solid rocks 34 00:03:08,000 --> 00:03:19,000 appear on earth? So, when was the surface kind of 35 00:03:19,000 --> 00:03:30,000 solidified? Anybody know? About 3.9 billion years ago, OK? 36 00:03:30,000 --> 00:03:40,000 And when do we think life started to develop on the planet? 37 00:03:40,000 --> 00:03:50,000 Any ideas? Take a guess. Two? One? 3.5 billion years ago, 38 00:03:50,000 --> 00:04:00,000 OK? So, this is really remarkable. 39 00:04:00,000 --> 00:04:04,000 We think it didn't, I mean, of course it took a long 40 00:04:04,000 --> 00:04:09,000 time because were talking about millions of years and hundreds of 41 00:04:09,000 --> 00:04:13,000 millions of years, but still, if you look at the big 42 00:04:13,000 --> 00:04:18,000 picture, it didn't actually take life that long to evolve on the 43 00:04:18,000 --> 00:04:23,000 planet. So, why do we think that is the case? What's the evidence for 44 00:04:23,000 --> 00:04:27,000 that? Well, we look into sedimentary rocks, 45 00:04:27,000 --> 00:04:32,000 so old rocks that arose from sediments, what you find around this 46 00:04:32,000 --> 00:04:37,000 time, you find that chemicals start to appear, organic molecules that 47 00:04:37,000 --> 00:04:42,000 really resemble organic molecules in modern life. 48 00:04:42,000 --> 00:04:47,000 So, we have sort of chemical tracers, or chemical fossils. 49 00:04:47,000 --> 00:05:01,000 So, tracers that indicate the 50 00:05:01,000 --> 00:05:09,000 presence of organisms. But what we also find is so-called 51 00:05:09,000 --> 00:05:17,000 micro-fossils, and I have a picture of that here 52 00:05:17,000 --> 00:05:25,000 where when you actually take rocks and actually slice them into very, 53 00:05:25,000 --> 00:05:33,000 very then slices, you can put them under specific microscopes. 54 00:05:33,000 --> 00:05:37,000 And what you then find is that many rocks that are very, 55 00:05:37,000 --> 00:05:42,000 very old, have those kinds of inclusions in them. 56 00:05:42,000 --> 00:05:47,000 And these things really resemble very much modern prokaryotic cells, 57 00:05:47,000 --> 00:05:52,000 modern bacterial cells, for example. And so, those micro-fossils are 58 00:05:52,000 --> 00:05:57,000 generally taken as an indication, also, that life is already present 59 00:05:57,000 --> 00:06:02,000 during those times. Now, when we take a quick sort of 60 00:06:02,000 --> 00:06:08,000 overlook of the evolution of life on the planet, again this graph here 61 00:06:08,000 --> 00:06:13,000 summarizes sort of the last 4. billion years or so when life 62 00:06:13,000 --> 00:06:19,000 originated. We see that there was a period of chemical evolution, 63 00:06:19,000 --> 00:06:24,000 and then somewhere here that region, it's, of course, not really well 64 00:06:24,000 --> 00:06:30,000 understood when that exactly happens, the origin of life is placed. 65 00:06:30,000 --> 00:06:34,000 But I want to alert you to a couple of really, really critical steps 66 00:06:34,000 --> 00:06:39,000 here that are shown on this graph which we'll actually talk more about. 67 00:06:39,000 --> 00:06:44,000 It is thought that life very early on is split into three major 68 00:06:44,000 --> 00:06:49,000 lineages: the bacteria, the archaea, in what is called here 69 00:06:49,000 --> 00:06:54,000 nuclear line. And I'll come back to that in a minute or so. 70 00:06:54,000 --> 00:06:59,000 Then, a further major event which you may remember is oxygenic 71 00:06:59,000 --> 00:07:04,000 photosynthesis actually evolved -- -- which means that cyanobacteria 72 00:07:04,000 --> 00:07:08,000 evolved that started to produce oxygen as a byproduct of 73 00:07:08,000 --> 00:07:12,000 photosynthesis. And that really fundamentally 74 00:07:12,000 --> 00:07:16,000 changed the chemistry of the Earth. It actually became an oxidizing 75 00:07:16,000 --> 00:07:20,000 atmosphere. And what you see here is, once the oxygen concentration 76 00:07:20,000 --> 00:07:24,000 goes over a certain level, it allowed the development of an 77 00:07:24,000 --> 00:07:28,000 ozone shield. Now, what does that mean? 78 00:07:28,000 --> 00:07:33,000 What was the critical significance of the presence of an ozone shield? 79 00:07:33,000 --> 00:07:38,000 Does anybody know? What does it block out? Anybody remember that? 80 00:07:38,000 --> 00:07:43,000 What's the big significance of the ozone hole over Antarctica for 81 00:07:43,000 --> 00:07:48,000 example? It allows UV radiation to heat the Earth's surface, 82 00:07:48,000 --> 00:07:53,000 and in fact if there were no ozone, the UV radiation would be so strong 83 00:07:53,000 --> 00:07:59,000 that there would be no life possible on land. 84 00:07:59,000 --> 00:08:03,000 So, once the ozone shield actually developed, organisms could conquer, 85 00:08:03,000 --> 00:08:08,000 basically, the land's surface and settle on the land surface. 86 00:08:08,000 --> 00:08:13,000 In this, then, is thought to be at least correlated with the 87 00:08:13,000 --> 00:08:18,000 development of endosymbiosis. And I'll explain what I mean by 88 00:08:18,000 --> 00:08:22,000 that. But it basically led to the origin of modern eukaryotes, 89 00:08:22,000 --> 00:08:27,000 so your ancestors essentially. But there was still a long time, 90 00:08:27,000 --> 00:08:33,000 obviously, until humans appeared. We have here the origin of animals 91 00:08:33,000 --> 00:08:39,000 and metazoans, and then the age of the dinosaurs is 92 00:08:39,000 --> 00:08:45,000 already a very small blip here on this graph. And humans don't even 93 00:08:45,000 --> 00:08:51,000 get featured on that because we are so recent. So, 94 00:08:51,000 --> 00:08:57,000 but what I want to show you here is that three major lineages 95 00:08:57,000 --> 00:09:05,000 evolved early on. These are the bacteria, 96 00:09:05,000 --> 00:09:15,000 archaea, and what we call a nuclear lineage. And the significance of 97 00:09:15,000 --> 00:09:25,000 those nuclear lineages is that it basically combined with bacteria to 98 00:09:25,000 --> 00:09:35,000 form the modern eukaryotic cell. So, the eukarya, or eukaryotes 99 00:09:35,000 --> 00:09:50,000 they're also called. And it was this combination that we 100 00:09:50,000 --> 00:10:02,000 called the endosymbiosis event. I want to explain this a little bit 101 00:10:02,000 --> 00:10:07,000 more, and then I'll show you finally why we actually know that those 102 00:10:07,000 --> 00:10:12,000 things are very likely to have occurred a long time ago. 103 00:10:12,000 --> 00:10:17,000 Yes? It means the bacteria and the nuclear lineages combine to form a 104 00:10:17,000 --> 00:10:22,000 eukaryote, OK? And I'm actually going to explain 105 00:10:22,000 --> 00:10:27,000 this on the slide here. So, if you have any more questions 106 00:10:27,000 --> 00:10:32,000 after that, please let me know. So, again, this shows you this early 107 00:10:32,000 --> 00:10:38,000 evolution, this early split in two archaea, bacteria, 108 00:10:38,000 --> 00:10:44,000 and this sort of nuclear line. It is thought that this nuclear 109 00:10:44,000 --> 00:10:50,000 line, this was single celled organisms that increased in cell 110 00:10:50,000 --> 00:10:56,000 size, and then developed or partitioned the DNA into a nucleus, 111 00:10:56,000 --> 00:11:02,000 basically. So exactly how you find it in modern eukaryotic cells. 112 00:11:02,000 --> 00:11:07,000 But then what happened is the cell took up a bacterial cell, 113 00:11:07,000 --> 00:11:12,000 and over time this bacterial cell became symbiont. 114 00:11:12,000 --> 00:11:17,000 In fact it became the mitochondria. And so what this mitochondria now 115 00:11:17,000 --> 00:11:22,000 does in the moderate eukaryotic cell as you all know is it really took 116 00:11:22,000 --> 00:11:27,000 over the energy metabolism. So, the proto-eukaryotic cell took 117 00:11:27,000 --> 00:11:33,000 up a heterotrophic bacteria that form the mitochondria. 118 00:11:33,000 --> 00:11:37,000 And this ultimately then gave rise to protozoa and to modern-day 119 00:11:37,000 --> 00:11:42,000 animals. But there was a secondary symbiotic event. 120 00:11:42,000 --> 00:11:46,000 This cell, once it had taken up a heterotrophic bacterium, 121 00:11:46,000 --> 00:11:51,000 it took up an autotrophic bacterium, a cyanobacterium, an oxygenic 122 00:11:51,000 --> 00:11:55,000 photosynthesizer. And this actually that led to the 123 00:11:55,000 --> 00:12:00,000 development of modern algae and modern plants. 124 00:12:00,000 --> 00:12:08,000 So what we can say is that mitochondria our ancient 125 00:12:08,000 --> 00:12:24,000 heterotrophic bacteria -- 126 00:12:24,000 --> 00:12:36,000 And the chloroplasts are ancient cyanobacteria, 127 00:12:36,000 --> 00:12:48,000 so, oxygenic, photosynthetic bacteria. And these obviously have 128 00:12:48,000 --> 00:13:00,000 coevolved to then form animals and finally your plants. 129 00:13:00,000 --> 00:13:06,000 So now, obviously we are talking here about events that happened a 130 00:13:06,000 --> 00:13:13,000 very, very long time ago. And so, the big question is really 131 00:13:13,000 --> 00:13:19,000 how do we really know this? But this takes me to the third 132 00:13:19,000 --> 00:13:26,000 topic, which is that of molecular evolution. So, we can state 133 00:13:26,000 --> 00:13:34,000 the problem again, And that is very simply put, 134 00:13:34,000 --> 00:13:42,000 evolution is incredibly slow, OK? And therefore, its processes 135 00:13:42,000 --> 00:14:01,000 are not directly observable. 136 00:14:01,000 --> 00:14:05,000 And we need to actually use inference techniques to reconstruct 137 00:14:05,000 --> 00:14:10,000 evolutionary processes. Now, what do we use when we want to 138 00:14:10,000 --> 00:14:15,000 reconstruct the evolutionary history of animals and plants usually? 139 00:14:15,000 --> 00:14:20,000 Anybody? Fossils. Exactly. So you take a shovel, 140 00:14:20,000 --> 00:14:25,000 essentially, and dig down into the different layers. 141 00:14:25,000 --> 00:14:30,000 And there's different techniques that you can actually determine the 142 00:14:30,000 --> 00:14:34,000 age of different sedentary rocks. For example, and then you can 143 00:14:34,000 --> 00:14:38,000 construct, if you're lucky, you'll find enough fossils of a 144 00:14:38,000 --> 00:14:42,000 particular lineage. You can reconstruct the evolution 145 00:14:42,000 --> 00:14:45,000 of the lineage. I'm sure you all have seen the 146 00:14:45,000 --> 00:14:49,000 example of the horse, for example, where we have actually 147 00:14:49,000 --> 00:14:53,000 quite good evidence when ancient horses look like. 148 00:14:53,000 --> 00:14:57,000 And we can reconstruct the sequence of events that led to the evolution 149 00:14:57,000 --> 00:14:59,000 of modern-day horses. Now, you can imagine, 150 00:14:59,000 --> 00:14:59,000 though, that when we talk about such ancient events like these there really is no fossil record. OK, so what people have figured out, then, is that that was really a stroke of genius that came about in the late 60s, that DNA molecules can act as evolutionary chronometers. 151 00:15:00,000 --> 00:15:44,000 OK, now what do I mean by that? 152 00:15:44,000 --> 00:15:48,000 I mean that you can take DNA sequences or gene sequences from 153 00:15:48,000 --> 00:15:53,000 different kinds of organisms. Based on those gene sequences you 154 00:15:53,000 --> 00:15:58,000 can reconstruct the relationships to each other. You can determine 155 00:15:58,000 --> 00:16:02,000 whether two organisms are closely related or whether they are only 156 00:16:02,000 --> 00:16:14,000 very distantly related. And the underlying mechanism of that, 157 00:16:14,000 --> 00:16:33,000 is that mutations happen with a certain probability all the time. 158 00:16:33,000 --> 00:16:41,000 So, the idea is that as time passed on, DNA molecules will change. 159 00:16:41,000 --> 00:16:50,000 So they will accumulate, actually, mutations, and so this will lead to, 160 00:16:50,000 --> 00:16:59,000 and that the idea is that the amount of change in a particular DNA 161 00:16:59,000 --> 00:17:08,000 sequence is proportional to the time of separate evolution of two 162 00:17:08,000 --> 00:17:17,000 different lineages or two different organisms. 163 00:17:17,000 --> 00:17:26,000 So, the amount is more or less proportional -- 164 00:17:26,000 --> 00:17:38,000 -- to time since the last 165 00:17:38,000 --> 00:17:54,000 common ancestry. 166 00:17:54,000 --> 00:18:05,000 So, let me explain how this is actually done. 167 00:18:05,000 --> 00:18:16,000 What you really need in order to do this, is you need genes that are 168 00:18:16,000 --> 00:18:27,000 related to each other, OK? So, genes, they need to be 169 00:18:27,000 --> 00:18:34,000 universally distributed. That meets all organisms that you 170 00:18:34,000 --> 00:18:37,000 want to compare need to have this type of gene. And, 171 00:18:37,000 --> 00:18:41,000 those genes need to have conserved function. 172 00:18:41,000 --> 00:18:52,000 In these genes, 173 00:18:52,000 --> 00:18:57,000 we can then compare to each other, and I will explain how this is 174 00:18:57,000 --> 00:19:02,000 actually done. Any questions so far? 175 00:19:02,000 --> 00:19:06,000 OK, so the example that I actually want to bring is the 16S 176 00:19:06,000 --> 00:19:26,000 ribosomal RNA genes. 177 00:19:26,000 --> 00:19:35,000 We oftentimes abbreviate this rRNA. Now, does anybody remember what the 178 00:19:35,000 --> 00:19:44,000 ribosomal RNAs are and do? What's the ribosome? Yes? 179 00:19:44,000 --> 00:19:53,000 Right, and what does it do? Exactly, it's the location where 180 00:19:53,000 --> 00:20:02,000 messenger RNA is translated into protein. 181 00:20:02,000 --> 00:20:06,000 Now, the ribosomal RNAs are an integral part of the ribosome. 182 00:20:06,000 --> 00:20:10,000 They play both a catalytic role as well as a structural role in the 183 00:20:10,000 --> 00:20:14,000 ribosome. And so, fundamentally, because this is such 184 00:20:14,000 --> 00:20:18,000 a fundamental organelle, all living organisms possess it. 185 00:20:18,000 --> 00:20:22,000 So, all organisms have it. So this allows us to use these genes to 186 00:20:22,000 --> 00:20:26,000 really compare all living organisms to each other. 187 00:20:26,000 --> 00:20:30,000 OK, so this is a very important point. 188 00:20:30,000 --> 00:20:34,000 I wanted to show you a, OK, if it wakes up. There we go. 189 00:20:34,000 --> 00:20:39,000 An example of these ribosomal RNA genes, now this is actually, 190 00:20:39,000 --> 00:20:43,000 what you see here is a secondary structure of the actual RNA, 191 00:20:43,000 --> 00:20:48,000 the ribosomal RNA. Now, these molecules have a secondary structure 192 00:20:48,000 --> 00:20:52,000 because they play a catalytic and structural role. 193 00:20:52,000 --> 00:20:57,000 And so, the really amazing thing is when you look at the structure, 194 00:20:57,000 --> 00:21:01,000 the structure determines really the function of those molecules in 195 00:21:01,000 --> 00:21:06,000 different organisms. And then look at this. 196 00:21:06,000 --> 00:21:10,000 We have here a bacterium, and here are an archaea. Now, 197 00:21:10,000 --> 00:21:14,000 if you think back to the first couple of slides, 198 00:21:14,000 --> 00:21:18,000 what I showed you is that those organisms have not shared a common 199 00:21:18,000 --> 00:21:22,000 evolutionary history for about four, or so, billion years, or 3 billion 200 00:21:22,000 --> 00:21:26,000 years, excuse me. But, if you just glance very 201 00:21:26,000 --> 00:21:30,000 quickly at the structures, you see that they look very similar 202 00:21:30,000 --> 00:21:34,000 to each other. So, there's an indication that the 203 00:21:34,000 --> 00:21:38,000 function is really very highly conserved of those molecules. 204 00:21:38,000 --> 00:21:42,000 However, when you actually look at the sequences in detail, 205 00:21:42,000 --> 00:21:46,000 what you'll find is that there's different regions. 206 00:21:46,000 --> 00:21:50,000 And I'd given some examples here denoted by A, B, 207 00:21:50,000 --> 00:21:54,000 C in those molecules. And these different regions of the 208 00:21:54,000 --> 00:21:58,000 molecules are really the key to its usefulness in figuring out the 209 00:21:58,000 --> 00:22:02,000 evolution and ecology of many organisms. 210 00:22:02,000 --> 00:22:06,000 The region number A here, or denoted by A, a sequence 211 00:22:06,000 --> 00:22:10,000 stretches that are the same in all living organisms. 212 00:22:10,000 --> 00:22:14,000 So they are universally conserved, which means that if you get a 213 00:22:14,000 --> 00:22:19,000 mutation in a gene in that particular region, 214 00:22:19,000 --> 00:22:23,000 you are dead. OK, that's why it's conserved essentially. 215 00:22:23,000 --> 00:22:27,000 Then we have those regions B where the length is conserved, 216 00:22:27,000 --> 00:22:32,000 but the sequence is not. So, there are sequence change 217 00:22:32,000 --> 00:22:36,000 allowed, but the length needs to be conserved. And then there's the 218 00:22:36,000 --> 00:22:40,000 region C were neither length nor sequence is actually conserved, 219 00:22:40,000 --> 00:22:44,000 and where we get a lot of variation. So, let me write this down. We 220 00:22:44,000 --> 00:22:49,000 have three types of sequence stretches. 221 00:22:49,000 --> 00:23:05,000 We have A, what I called the 222 00:23:05,000 --> 00:23:16,000 universally conserved sequences. We have B where length, but not 223 00:23:16,000 --> 00:23:27,000 sequence is conserved. And, we have C where neither length 224 00:23:27,000 --> 00:23:42,000 nor sequence is actually conserved. 225 00:23:42,000 --> 00:23:48,000 And the first two stretches, the first two types of sequence 226 00:23:48,000 --> 00:23:55,000 stretches, are very important in figuring out the phylogeny or the 227 00:23:55,000 --> 00:24:01,000 evolutionary relationships amongst organisms. Whereas the sequence 228 00:24:01,000 --> 00:24:08,000 stretches number C because they vary so dramatically, 229 00:24:08,000 --> 00:24:15,000 are very important in identifying organisms. 230 00:24:15,000 --> 00:24:19,000 And we'll talk more about this actually next time. 231 00:24:19,000 --> 00:24:24,000 So what can we actually know do with those sequences? 232 00:24:24,000 --> 00:24:29,000 Well, the first step is we need to generate an alignment. 233 00:24:29,000 --> 00:24:51,000 OK, and this is actually shown here, 234 00:24:51,000 --> 00:24:55,000 where each row denotes a gene from a particular organism. 235 00:24:55,000 --> 00:25:00,000 OK, so these are all abbreviated here. 236 00:25:00,000 --> 00:25:04,000 These actually aren't ribosomal RNA genes, but other genes. 237 00:25:04,000 --> 00:25:09,000 And that what you will see here is we can recognize those three 238 00:25:09,000 --> 00:25:13,000 different regions that I've pointed out before. You have the regions A 239 00:25:13,000 --> 00:25:18,000 which tell you which nucleotides line up with each other, 240 00:25:18,000 --> 00:25:22,000 so you use this sort of as an anchor because the sequences never vary 241 00:25:22,000 --> 00:25:27,000 amongst organisms. And that the sequence region B 242 00:25:27,000 --> 00:25:31,000 where you light up sequences that vary or stretches that vary in 243 00:25:31,000 --> 00:25:36,000 sequence but not in length. Now, why is this important? 244 00:25:36,000 --> 00:25:41,000 It's important because you have in each column that nucleotides that 245 00:25:41,000 --> 00:25:47,000 have originated from a common ancestral nucleotide, 246 00:25:47,000 --> 00:25:52,000 and whose variation over time you can actually monitor. 247 00:25:52,000 --> 00:25:58,000 Is everybody with that? Any questions? OK, great. 248 00:25:58,000 --> 00:26:02,000 The second step, then, is the calculation of a 249 00:26:02,000 --> 00:26:16,000 similarity. 250 00:26:16,000 --> 00:26:20,000 And this is shown here. Again, we have a very simplified 251 00:26:20,000 --> 00:26:24,000 alignment now of four different organisms. Here, 252 00:26:24,000 --> 00:26:29,000 we have the sequences that we want to compare. And what you'll see is 253 00:26:29,000 --> 00:26:33,000 that they're overall very similar, but there are different sort of 254 00:26:33,000 --> 00:26:38,000 nucleotides. And so, what we simply do is for 255 00:26:38,000 --> 00:26:43,000 each pair of sequence combinations, we calculate the sequence similarity 256 00:26:43,000 --> 00:26:48,000 value. So, what you see is that you have 12 nucleotides, 257 00:26:48,000 --> 00:26:52,000 and the first pair differs in three nucleotides. OK, 258 00:26:52,000 --> 00:26:57,000 so that tells us, or it's called actually a distance 259 00:26:57,000 --> 00:27:01,000 here, I'm sorry. Let me write this down here. 260 00:27:01,000 --> 00:27:15,000 It's simply one minus the similarity, 261 00:27:15,000 --> 00:27:21,000 of course, but so basically a quarter of the nucleotides differ 262 00:27:21,000 --> 00:27:27,000 where it's between A and C, a third of the nucleotides 263 00:27:27,000 --> 00:27:33,000 difference on. OK, so you do this for each pair of 264 00:27:33,000 --> 00:27:40,000 sequences, excuse me. The third step, 265 00:27:40,000 --> 00:27:49,000 then, is to calculate the correction for multiple mutations affecting the 266 00:27:49,000 --> 00:28:08,000 same nucleotides. 267 00:28:08,000 --> 00:28:12,000 Now, you can imagine that over time there's a probability that a 268 00:28:12,000 --> 00:28:16,000 particular nucleotide mutates, say, twice. So, in the first 269 00:28:16,000 --> 00:28:20,000 instance it may change from A to a G, , but then it changes to a C. 270 00:28:20,000 --> 00:28:24,000 But when you look at the modern-day sequences, you don't know that this 271 00:28:24,000 --> 00:28:28,000 actually happened. And so there's ways to 272 00:28:28,000 --> 00:28:32,000 statistically estimate what the likelihood is that a sequence 273 00:28:32,000 --> 00:28:37,000 actually contains such multiple events. 274 00:28:37,000 --> 00:28:41,000 OK, and this, we called, a corrective evolutionary distance 275 00:28:41,000 --> 00:28:46,000 then. And what you will note is that the corrected evolutionary 276 00:28:46,000 --> 00:28:51,000 distance is invariably larger than the actual observed one. 277 00:28:51,000 --> 00:28:56,000 Now, what can we can do with those distances? We can constrain them 278 00:28:56,000 --> 00:29:01,000 into a best fit tree of relationships. 279 00:29:01,000 --> 00:29:07,000 So, we can draw what we call is a best fit tree. 280 00:29:07,000 --> 00:29:14,000 That's shown here. We have our four organisms, 281 00:29:14,000 --> 00:29:20,000 but when you look at those branches of the tree what you'll see is that 282 00:29:20,000 --> 00:29:27,000 they add up roughly to the correct evolutionary distance here. 283 00:29:27,000 --> 00:29:32,000 So, between A and B we have 0. 3 and 0.08, which roughly gives you 284 00:29:32,000 --> 00:29:37,000 0.3 here, OK, whereas between A and C the tree is constrain such that we 285 00:29:37,000 --> 00:29:42,000 have 0.31, and here 0. 5, and so overall you roughly get 286 00:29:42,000 --> 00:29:48,000 the distance here that we have calculated. And so what this means 287 00:29:48,000 --> 00:29:53,000 is that you ordered the organisms by their calculated evolutionary 288 00:29:53,000 --> 00:29:58,000 distance. And so you have now obtained, actually, 289 00:29:58,000 --> 00:30:04,000 a very intuitive picture of the relationship of organisms to each 290 00:30:04,000 --> 00:30:09,000 other where A and B are obviously the most closely related ones, 291 00:30:09,000 --> 00:30:15,000 and A and D are the most distantly related. 292 00:30:15,000 --> 00:30:23,000 Is everybody with it? Any questions? OK, now, 293 00:30:23,000 --> 00:30:31,000 this best fit tree is what we call a phylogeny. 294 00:30:31,000 --> 00:30:52,000 Now, excuse me, 295 00:30:52,000 --> 00:31:00,000 these techniques really revolutionized the study of 296 00:31:00,000 --> 00:31:08,000 evolutionary relationships, and one of the things that it 297 00:31:08,000 --> 00:31:16,000 allowed us to do is to construct universal phylogenetic trees or what 298 00:31:16,000 --> 00:31:23,000 we can also call the tree of life. And I will show you this on the next 299 00:31:23,000 --> 00:31:30,000 slide, and that I want to make a few general statements about this. 300 00:31:30,000 --> 00:31:37,000 So first of all, when you analyze all known organisms, 301 00:31:37,000 --> 00:31:45,000 and obviously that would be a big task, but representative of all 302 00:31:45,000 --> 00:31:52,000 known organisms, what you'll find is that, 303 00:31:52,000 --> 00:32:00,000 indeed, we have three major lineages: the bacteria, 304 00:32:00,000 --> 00:32:07,000 the archaea, and the eukarya. OK, so we have what we call three 305 00:32:07,000 --> 00:32:15,000 domains of life: the archaea, bacteria, and the eukarya. 306 00:32:15,000 --> 00:32:20,000 So, this really is the evidence that life really split very, 307 00:32:20,000 --> 00:32:26,000 very early on into those three lineages that I showed you before. 308 00:32:26,000 --> 00:32:32,000 Interestingly, two of those major domains here are 309 00:32:32,000 --> 00:32:39,000 prokaryotic, OK? So, two of the domains are 310 00:32:39,000 --> 00:32:46,000 prokaryotes. Moreover, if you actually look at the types of 311 00:32:46,000 --> 00:32:53,000 organisms that are on here, you'll notice that even on the 312 00:32:53,000 --> 00:33:00,000 eukaryotic side of the tree, most of the organisms here are 313 00:33:00,000 --> 00:33:07,000 actually microbial. So, the single celled organisms: and 314 00:33:07,000 --> 00:33:14,000 that means that most of the life on the planet is microbial. 315 00:33:14,000 --> 00:33:21,000 The vast diversity of organisms on the planet are microorganisms. 316 00:33:21,000 --> 00:33:29,000 So, we can say that most life is microbial. 317 00:33:29,000 --> 00:33:34,000 And when you, then, look at analysis of mitochondria, 318 00:33:34,000 --> 00:33:39,000 and chloroplasts which all have their own genetic machinery, 319 00:33:39,000 --> 00:33:44,000 and therefore also their own ribosomes you'll see that the 320 00:33:44,000 --> 00:33:49,000 mitochondrion, OK, and the chloroplasts both tree 321 00:33:49,000 --> 00:33:54,000 within the bacteria. So, we really have an amazing 322 00:33:54,000 --> 00:33:59,000 confirmation of this endosymbiont theory which actually developed in 323 00:33:59,000 --> 00:34:04,000 the absence of gene sequences by some Russian scientists in the early 324 00:34:04,000 --> 00:34:13,000 20th century. So, we have that mitochondria and 325 00:34:13,000 --> 00:34:27,000 chloroplasts tree within bacteria, and this really supports the 326 00:34:27,000 --> 00:34:36,000 endosymbiont theory. So really, you could say eukaryotes 327 00:34:36,000 --> 00:34:42,000 are really just walking, and swimming, and flying incubators 328 00:34:42,000 --> 00:34:48,000 for bacteria, right? So, just hosts for microorganisms. 329 00:34:48,000 --> 00:34:54,000 OK, so basically you can, what you should take home from this is the 330 00:34:54,000 --> 00:35:00,000 three domains of life. Two are prokaryotic, and even more 331 00:35:00,000 --> 00:35:06,000 so most of the diversity that we find is actually microbial, 332 00:35:06,000 --> 00:35:12,000 and then finally the endosymbiont theory is actually confirmed by 333 00:35:12,000 --> 00:35:17,000 those phylogenies. Now, what I want to cover in the 334 00:35:17,000 --> 00:35:22,000 remaining time, is how we can actually use now those 335 00:35:22,000 --> 00:35:27,000 sequences to learn something about organisms in the environment. 336 00:35:27,000 --> 00:35:32,000 That's the topic of molecular ecology. 337 00:35:32,000 --> 00:35:43,000 To introduce this, 338 00:35:43,000 --> 00:35:47,000 I just want to show you a couple slides that really sort of capture 339 00:35:47,000 --> 00:35:51,000 what the big problem is that we're facing here. Now, 340 00:35:51,000 --> 00:35:55,000 when we look at the abundance of prokaryotic cells in different types 341 00:35:55,000 --> 00:35:59,000 of environments, what we see is that there is an 342 00:35:59,000 --> 00:36:04,000 enormous number of different prokaryotes out there. 343 00:36:04,000 --> 00:36:08,000 This summarizes, here, different types of 344 00:36:08,000 --> 00:36:12,000 environments. We have the marine environment, freshwater environment, 345 00:36:12,000 --> 00:36:16,000 sediment and soils, subsurface sentiments and animal guts. 346 00:36:16,000 --> 00:36:20,000 And that this number here gives you the average number of prokaryotic 347 00:36:20,000 --> 00:36:24,000 cells either per milliliter or per gram. And it here we have the total 348 00:36:24,000 --> 00:36:28,000 number of cells obtained by multiplying the average number with 349 00:36:28,000 --> 00:36:33,000 the total volume of the particular environment. 350 00:36:33,000 --> 00:36:37,000 So what you can see is that in the marine environment, 351 00:36:37,000 --> 00:36:41,000 we have an average half a million cells per milliliter of water, 352 00:36:41,000 --> 00:36:45,000 OK? It freshwater, we have about a million cells. 353 00:36:45,000 --> 00:36:49,000 What is that telling you? There's a ton of prokaryotes out 354 00:36:49,000 --> 00:36:53,000 there. What you go swimming, you take a little gulp of water: 355 00:36:53,000 --> 00:36:57,000 you've probably eaten several million prokaryotes, 356 00:36:57,000 --> 00:37:01,000 that it's nothing to worry about because what this also tells us is 357 00:37:01,000 --> 00:37:05,000 that very, very few prokaryotes out there are really pathogens because 358 00:37:05,000 --> 00:37:09,000 otherwise you'd be sick all the time. 359 00:37:09,000 --> 00:37:15,000 Now, in sediments and soils, in as little as a gram you have five 360 00:37:15,000 --> 00:37:22,000 times 10^9 prokaryotic cells almost. 5 billion prokaryotic cells are out 361 00:37:22,000 --> 00:37:29,000 there, and even in very, very deep sediments that reach down 362 00:37:29,000 --> 00:37:36,000 to 3,000 m, you have a substantial number of prokaryotic cells. 363 00:37:36,000 --> 00:37:40,000 Well, and here's your guts, 10^5 times 10^6 gives you 10^11 per 364 00:37:40,000 --> 00:37:45,000 gram. So again, you're just a walking incubator for 365 00:37:45,000 --> 00:37:50,000 a very complex microbial community. Here's the global abundance. You 366 00:37:50,000 --> 00:37:55,000 see that steeps of surface sediments and the marine environment, 367 00:37:55,000 --> 00:38:00,000 probably in terms of numbers at least, the most important 368 00:38:00,000 --> 00:38:05,000 microbial environments. Now, faced with this enormous 369 00:38:05,000 --> 00:38:09,000 abundance of prokaryotes out there, very important question is how many 370 00:38:09,000 --> 00:38:14,000 of them are out there? Or, how diverse our prokaryotes in 371 00:38:14,000 --> 00:38:18,000 the environment? That's important if you want to 372 00:38:18,000 --> 00:38:23,000 figure out their function and the environment, and want to understand 373 00:38:23,000 --> 00:38:27,000 also their evolution. And what I want to show you here is 374 00:38:27,000 --> 00:38:32,000 that we've gone through an amazing development in our understanding of 375 00:38:32,000 --> 00:38:36,000 prokaryotic diversity in the environment over the last 376 00:38:36,000 --> 00:38:42,000 10 to 15 years or so. Who knows about E. 377 00:38:42,000 --> 00:38:48,000 . Wilson here? One person? So, he wrote a very famous book on 378 00:38:48,000 --> 00:38:54,000 biodiversity, which was published in 1988, where he tried to summarize, 379 00:38:54,000 --> 00:39:00,000 really, how diverse the known organisms are on the planet it also 380 00:39:00,000 --> 00:39:06,000 try to extrapolate to the total diversity. 381 00:39:06,000 --> 00:39:10,000 And what you see is that he came up with about 1.4 million different 382 00:39:10,000 --> 00:39:14,000 species here, mostly dominated by insects. That's the big section 383 00:39:14,000 --> 00:39:19,000 here on this pie chart. The plants: very important. 384 00:39:19,000 --> 00:39:23,000 And if you look, the prokaryotes feature with about 3, 385 00:39:23,000 --> 00:39:27,000 00 different species. So, in 1988 we thought there were very 386 00:39:27,000 --> 00:39:32,000 few prokaryotic species out there. If you look about 10 years into the 387 00:39:32,000 --> 00:39:36,000 future and take the assessment here, and this just exemplifies how the 388 00:39:36,000 --> 00:39:41,000 thinking has changed, you see that we think now that there 389 00:39:41,000 --> 00:39:45,000 is about 11 million different species out there, 390 00:39:45,000 --> 00:39:50,000 and that the vast majority of them are prokaryotic, 391 00:39:50,000 --> 00:39:54,000 OK, 10 million. So, this big part of the pie chart is 392 00:39:54,000 --> 00:39:59,000 really the prokaryotic diversity. Now, what really has changed is 393 00:39:59,000 --> 00:40:03,000 that we've actually started to use molecular techniques to determine 394 00:40:03,000 --> 00:40:08,000 the diversity of prokaryotes in the environment. 395 00:40:08,000 --> 00:40:18,000 So molecular ecology is really the use of molecular gene sequences 396 00:40:18,000 --> 00:40:29,000 obtained directly from the environment -- 397 00:40:29,000 --> 00:40:42,000 -- to learn about the diversity 398 00:40:42,000 --> 00:40:54,000 prokaryotic -- 399 00:40:54,000 --> 00:40:58,000 -- diversity out there. Now, this slide just quickly 400 00:40:58,000 --> 00:41:03,000 summarizes this. Basically, the idea is that you go 401 00:41:03,000 --> 00:41:08,000 out into the environment and collect either water or soil samples that, 402 00:41:08,000 --> 00:41:13,000 as I just showed you, invariably contain a lot of different 403 00:41:13,000 --> 00:41:17,000 prokaryotic cells. You then lyse the cells and purify 404 00:41:17,000 --> 00:41:22,000 their DNA. And so that you end up with a mixture of DNA that 405 00:41:22,000 --> 00:41:27,000 represents the organisms out there, and then you can use universal PCR 406 00:41:27,000 --> 00:41:32,000 primers to actually amplify ribosomal RNA genes from all the 407 00:41:32,000 --> 00:41:37,000 organisms that are present in your samples. 408 00:41:37,000 --> 00:41:42,000 Now, why can you use universal PCR primers? Well, 409 00:41:42,000 --> 00:41:48,000 they target the regions number A that I showed you before. 410 00:41:48,000 --> 00:41:53,000 Those regions in the genes are invariant amongst all organisms. 411 00:41:53,000 --> 00:41:59,000 You guys all remember how the PCR works, right? We cover this. 412 00:41:59,000 --> 00:42:04,000 OK? Yes? No? Who doesn't? You don't? All right, 413 00:42:04,000 --> 00:42:09,000 come to the board. Just kidding. OK, you should look it up. I don't 414 00:42:09,000 --> 00:42:15,000 have time to cover this, unfortunately, but basically it's a 415 00:42:15,000 --> 00:42:20,000 technique that allows you to amplify specific types of genes millions to 416 00:42:20,000 --> 00:42:25,000 billion fold. And once you have done this, what you can do is that 417 00:42:25,000 --> 00:42:31,000 you can purify the genes on gels, and then separate them by cloning 418 00:42:31,000 --> 00:42:36,000 them into individual plasmids. And those plasmids have been 419 00:42:36,000 --> 00:42:41,000 inserted into E. coli cells, and the E. 420 00:42:41,000 --> 00:42:46,000 coli cells are then individually grown up so that each culture 421 00:42:46,000 --> 00:42:50,000 contains only a single plasmid, and you can then sequence these 422 00:42:50,000 --> 00:42:55,000 ribosomal DNAs or ribosomal RNA genes from those clones. 423 00:42:55,000 --> 00:43:00,000 And so, you have obtained a library of the ribosomal RNA genes 424 00:43:00,000 --> 00:43:08,000 from the environment. So, we use environmental ribosomal 425 00:43:08,000 --> 00:43:18,000 RNA gene libraries from which we then can actually compare how many 426 00:43:18,000 --> 00:43:28,000 different types of genes are out there. 427 00:43:28,000 --> 00:43:32,000 So let me show you an example of this. What we have done recently, 428 00:43:32,000 --> 00:43:37,000 we've gone out in one of the first really comprehensive samplings of 429 00:43:37,000 --> 00:43:42,000 coastal bacteria plankton, which means the bacteria that are 430 00:43:42,000 --> 00:43:47,000 present free living in ocean water. And so, we've done this, we've 431 00:43:47,000 --> 00:43:52,000 collected all those clones, and then basically we constructed 432 00:43:52,000 --> 00:43:57,000 those phylogenetic trees that I showed you before that really allow 433 00:43:57,000 --> 00:44:02,000 us see how many different types are out there, and how closely related 434 00:44:02,000 --> 00:44:07,000 they are to one another. And what we found is that in this 435 00:44:07,000 --> 00:44:12,000 environment that you think might be very simple because it just the 436 00:44:12,000 --> 00:44:17,000 water column right? No, not much structure in there. 437 00:44:17,000 --> 00:44:22,000 We found over 1500 bacterial 16S ribosomal RNA sequences to occur, 438 00:44:22,000 --> 00:44:27,000 so an enormous diversity of prokaryotes of bacteria in that 439 00:44:27,000 --> 00:44:32,000 particular environment. And the important point is that when 440 00:44:32,000 --> 00:44:36,000 you actually look at a collection of such studies that I just showed you, 441 00:44:36,000 --> 00:44:40,000 what you find is that the vast majority of microorganisms in the 442 00:44:40,000 --> 00:44:44,000 environment have never been cultured. So traditionally what we do of 443 00:44:44,000 --> 00:44:49,000 course to learn about microorganisms when you grow E. 444 00:44:49,000 --> 00:44:53,000 coli, or so, you throw them onto culture plates. 445 00:44:53,000 --> 00:44:57,000 You make lots of different cells, and that allows you to study some of 446 00:44:57,000 --> 00:45:02,000 their properties. But when you look, 447 00:45:02,000 --> 00:45:06,000 for example, at results from the ocean, this summarizes now coastal 448 00:45:06,000 --> 00:45:10,000 and open ocean environments, again, the bacteria plankton is 449 00:45:10,000 --> 00:45:15,000 those free-floating bacterial cells in the water. 450 00:45:15,000 --> 00:45:19,000 And you compare this to what we've actually been able to culture from 451 00:45:19,000 --> 00:45:23,000 those environments. What you see is that you have some 452 00:45:23,000 --> 00:45:27,000 dominant groups here. They have all funny names, 453 00:45:27,000 --> 00:45:32,000 most of them, because they're just clones and clone libraries. 454 00:45:32,000 --> 00:45:36,000 But these are the dominant groups that show up in clone libraries. 455 00:45:36,000 --> 00:45:40,000 Here's their relative representation in different clone 456 00:45:40,000 --> 00:45:44,000 libraries from a variety of environments. And so here you have 457 00:45:44,000 --> 00:45:48,000 one very important one, the SAR11 group, or this one, 458 00:45:48,000 --> 00:45:53,000 the SAR86, that always show up in clone libraries. 459 00:45:53,000 --> 00:45:57,000 But we've never see them in culture, so the important point to realize 460 00:45:57,000 --> 00:46:01,000 here is that what is actually happening is that whenever we go out, 461 00:46:01,000 --> 00:46:05,000 we find a great diversity of bacteria out there, 462 00:46:05,000 --> 00:46:10,000 but we have no idea what they actually do. 463 00:46:10,000 --> 00:46:14,000 And this is one of the big questions that we need to answer to understand, 464 00:46:14,000 --> 00:46:18,000 really, how the planet actually works. What are those uncultured 465 00:46:18,000 --> 00:46:22,000 microorganisms out in the environment really doing, 466 00:46:22,000 --> 00:46:26,000 and what is their importance? And we'll talk about this next time. 467 00:46:26,000 --> 00:46:30,000 We're going to talk about environmental genomics because 468 00:46:30,000 --> 00:46:34,000 essentially what we can do now, is we have techniques available that 469 00:46:34,000 --> 00:46:38,000 allow us to isolate and least large fragments of the genomes, 470 00:46:38,000 --> 00:46:42,000 sequence those, and look at what kinds of genes they have present. 471 00:46:42,000 --> 00:46:46,000 And that allows us, then, to infer some of their 472 00:46:46,000 --> 00:46:51,000 function in the biogeochemical cycles in the environment. 473 00:46:51,000 --> 00:46:55,000 OK, so with this I'm going to close today unless you have 474 00:46:55,000 --> 00:46:58,000 any more questions.