1 00:00:00,000 --> 00:00:04,000 So what I want to do today is recap a little bit what we talked about 2 00:00:04,000 --> 00:00:08,000 last time, reiterate some of the important points, 3 00:00:08,000 --> 00:00:13,000 and then show you how we can learn something about microorganisms in 4 00:00:13,000 --> 00:00:17,000 the environment by talking about in-situ identification of 5 00:00:17,000 --> 00:00:21,000 microorganisms as well as genomics. We'll first talk about genomics and 6 00:00:21,000 --> 00:00:26,000 general and then talk about some applications of genomics to 7 00:00:26,000 --> 00:00:30,000 environmental microbiology because I think there is some of the most 8 00:00:30,000 --> 00:00:35,000 exciting new developments are in the area, actually. 9 00:00:35,000 --> 00:00:44,000 So last time we talked about molecular evolution and ecology. 10 00:00:44,000 --> 00:00:53,000 And just to recap, some of the main points were that we can actually use 11 00:00:53,000 --> 00:01:02,000 genes or gene sequences for a couple of very important questions that we 12 00:01:02,000 --> 00:01:15,000 want to explore. The first one was gene sequences act 13 00:01:15,000 --> 00:01:33,000 as evolutionary chronometers. 14 00:01:33,000 --> 00:01:40,000 Now, what do I mean by that? Basically what we said last time 15 00:01:40,000 --> 00:01:48,000 was that each gene, each sequence in the genome 16 00:01:48,000 --> 00:01:55,000 accumulates mutations with a certain probability. So what we mean is 17 00:01:55,000 --> 00:02:03,000 that all genes accumulate mutations over time. 18 00:02:03,000 --> 00:02:07,000 Now, these of course are the mutations that do not kill the 19 00:02:07,000 --> 00:02:12,000 organisms, so not the deleterious mutations, but these are mutations 20 00:02:12,000 --> 00:02:16,000 that are either slightly deleterious, or don't matter, 21 00:02:16,000 --> 00:02:21,000 or are beneficial mutations. OK, and what this entails is that 22 00:02:21,000 --> 00:02:25,000 each gene accumulates mutation with a certain probability over time. 23 00:02:25,000 --> 00:02:30,000 It basically means that two organisms that come from species 24 00:02:30,000 --> 00:02:34,000 that are relatively closely related to each other have gene sequences 25 00:02:34,000 --> 00:02:39,000 that will be much more similar to each other than genes from an 26 00:02:39,000 --> 00:02:44,000 organism that comes from a species that's much more distantly related. 27 00:02:44,000 --> 00:02:48,000 So, in practical terms what this means is your genes are much, 28 00:02:48,000 --> 00:02:52,000 much more similar to those of a monkey than they are to a crocodile, 29 00:02:52,000 --> 00:02:56,000 for example. And we can take advantage of that by applying some 30 00:02:56,000 --> 00:03:00,000 algorithms, some mathematical modeling essentially, 31 00:03:00,000 --> 00:03:04,000 to constrain these relationships in those phylogenetic trees that we 32 00:03:04,000 --> 00:03:09,000 talked about last time. And I also mentioned that the 33 00:03:09,000 --> 00:03:14,000 ribosomal RNA genes are particularly important for that process. 34 00:03:14,000 --> 00:03:19,000 In principle, you could do it with any protein coding machine or any 35 00:03:19,000 --> 00:03:24,000 kind of gene in the genome, but we use the ribosomal RNA genes 36 00:03:24,000 --> 00:03:29,000 in particular because all organisms have them. They're part of a 37 00:03:29,000 --> 00:03:34,000 handful of genes that are what we called universally distributed 38 00:03:34,000 --> 00:03:38,000 last time. And what this allows us to do is 39 00:03:38,000 --> 00:03:42,000 then construct phylogenetic relationships for all living 40 00:03:42,000 --> 00:03:46,000 organisms. And I just want to remind you of the tree of life that 41 00:03:46,000 --> 00:03:50,000 I showed you last time where we can really explore the relationships 42 00:03:50,000 --> 00:03:54,000 amongst all living organisms. And some of the important points 43 00:03:54,000 --> 00:03:58,000 there that we made were, for example, that the tree of life 44 00:03:58,000 --> 00:04:02,000 supports the endosymbiont theory, that when you actually look on the 45 00:04:02,000 --> 00:04:06,000 tree where the mitochondria and the chloroplasts tree, they fall 46 00:04:06,000 --> 00:04:10,000 into the bacteria. Now, there is a question where 47 00:04:10,000 --> 00:04:15,000 somebody asked in the online survey, can the mitochondria and 48 00:04:15,000 --> 00:04:21,000 chloroplasts still live outside of the eukaryotic cell? 49 00:04:21,000 --> 00:04:26,000 And the answer is no, they can't anymore because over 50 00:04:26,000 --> 00:04:31,000 evolutionary time the two organisms have become so integrated that the 51 00:04:31,000 --> 00:04:37,000 mitochondria and chloroplasts both lost their ability to live outside 52 00:04:37,000 --> 00:04:42,000 of the eukaryotic host cell. Another important point that we made 53 00:04:42,000 --> 00:04:46,000 last time that I want to reiterate here is that gene sequences, 54 00:04:46,000 --> 00:04:51,000 when we go into the environment, and obtain them directly from the 55 00:04:51,000 --> 00:04:55,000 environment act as a proxy for microbial diversity in 56 00:04:55,000 --> 00:05:08,000 the environment. So, the number of genes recovered 57 00:05:08,000 --> 00:05:30,000 directly from the environment is a measure of diversity. 58 00:05:30,000 --> 00:05:34,000 And we said that this actually plays a very, very important role in the 59 00:05:34,000 --> 00:05:39,000 analysis of microbial communities, and I showed you the example here 60 00:05:39,000 --> 00:05:43,000 where we went and took some ocean water and basically apply this 61 00:05:43,000 --> 00:05:48,000 technique that outlined last time where we can actually amplify 62 00:05:48,000 --> 00:05:53,000 ribosomal RNA genes from environmental samples, 63 00:05:53,000 --> 00:05:57,000 clone them, determine the sequence, and then constructs phylogenetic 64 00:05:57,000 --> 00:06:01,000 trees. And what you see here is a tree 65 00:06:01,000 --> 00:06:05,000 where we summarize the major groups that we found in the sample have 66 00:06:05,000 --> 00:06:09,000 been only for two of those groups where we show the entire set of 67 00:06:09,000 --> 00:06:13,000 sequences that we actually obtained because there were so many of them 68 00:06:13,000 --> 00:06:16,000 out there. And what we basically found was that over 1500 bacterial 69 00:06:16,000 --> 00:06:20,000 16S ribosomal RNA gene sequences coexist in this environment. 70 00:06:20,000 --> 00:06:24,000 And what we said also last time was that the analyses like these have 71 00:06:24,000 --> 00:06:28,000 really taught us that microorganisms are the most diverse organisms 72 00:06:28,000 --> 00:06:32,000 on the planet. So, most diversity is amongst the 73 00:06:32,000 --> 00:06:36,000 microorganisms, and one of the big questions now is 74 00:06:36,000 --> 00:06:40,000 what are all those microorganisms doing in the environment? 75 00:06:40,000 --> 00:06:44,000 And so, today what I want to do with you is basically explore this 76 00:06:44,000 --> 00:06:48,000 question of how we can actually figure out what those microorganisms 77 00:06:48,000 --> 00:06:52,000 are all doing in environmental samples? 78 00:06:52,000 --> 00:07:09,000 So we can say we are exploring the 79 00:07:09,000 --> 00:07:21,000 function of microbes in the environment. At first, 80 00:07:21,000 --> 00:07:33,000 I want to cover how we can actually identify them in the environment. 81 00:07:33,000 --> 00:07:38,000 And I want to show you one specific example, and then I want to talk 82 00:07:38,000 --> 00:07:44,000 about genomics in general, and then basically end with an 83 00:07:44,000 --> 00:07:50,000 application of genomics to environmental questions. 84 00:07:50,000 --> 00:07:56,000 So, let's first talk about the in-situ identification 85 00:07:56,000 --> 00:08:14,000 of microorganisms. 86 00:08:14,000 --> 00:08:18,000 And the basic problem that I alluded to already before is that most 87 00:08:18,000 --> 00:08:23,000 microbes are only known [SOUND OFF/THEN ON] 88 00:08:23,000 --> 00:08:33,000 -- from 16S ribosomal 89 00:08:33,000 --> 00:08:48,000 RNA clone libraries. 90 00:08:48,000 --> 00:09:03,000 And we basically want to search and identify them in the environment. 91 00:09:03,000 --> 00:09:07,000 OK, and I'll show you a specific example of that later on. 92 00:09:07,000 --> 00:09:12,000 Now last time, we said that the ribosomal RNA sequences consist 93 00:09:12,000 --> 00:09:17,000 really, like all gene sequences, in fact. We identified several 94 00:09:17,000 --> 00:09:22,000 stretches of nucleotides, types of stretches, that can be 95 00:09:22,000 --> 00:09:26,000 found. We said the A type stretches and B type stretches that are very 96 00:09:26,000 --> 00:09:31,000 important for construction of phylogenetic relationships, 97 00:09:31,000 --> 00:09:36,000 because we can align them and look for changes in the nucleotide 98 00:09:36,000 --> 00:09:41,000 sequences because they are the same length and only differ in mutation 99 00:09:41,000 --> 00:09:46,000 and single nucleotide base pair changes. 100 00:09:46,000 --> 00:09:50,000 But then there's also those C type stretches, if you remember, 101 00:09:50,000 --> 00:09:55,000 and those we said vary at much faster rates because they are not 102 00:09:55,000 --> 00:10:00,000 functionally constrained in those genes. 103 00:10:00,000 --> 00:10:08,000 OK, so they can actually also accumulate length changes. 104 00:10:08,000 --> 00:10:17,000 And, it's these C type stretches that we can use sort of as 105 00:10:17,000 --> 00:10:25,000 diagnostic sequence stretches for microorganisms. 106 00:10:25,000 --> 00:10:34,000 So, what we can say is we identify organisms by the C type stretches, 107 00:10:34,000 --> 00:10:47,000 C type sequence stretches. And we call those signature 108 00:10:47,000 --> 00:11:03,000 sequences. OK, and they allow the differentiation 109 00:11:03,000 --> 00:11:16,000 of closely related organisms, -- because they vary at very fast 110 00:11:16,000 --> 00:11:24,000 rates between organisms. And the way we do this is that we 111 00:11:24,000 --> 00:11:32,000 construct so-called phylogenetic probes. I should probably 112 00:11:32,000 --> 00:11:38,000 write this over here. Now what are those phylogenetic 113 00:11:38,000 --> 00:11:42,000 probes? They're basically short pieces of DNA that have a 114 00:11:42,000 --> 00:11:46,000 fluorescent molecule attached to them. 115 00:11:46,000 --> 00:12:03,000 -- DNA molecules that are roughly 20 116 00:12:03,000 --> 00:12:11,000 nucleotides in length, and they carry a florescent molecule. 117 00:12:11,000 --> 00:12:19,000 Now what the short, single-stranded stretches of DNA 118 00:12:19,000 --> 00:12:27,000 basically are is they are complementary to those C type 119 00:12:27,000 --> 00:12:45,000 sequence stretches -- 120 00:12:45,000 --> 00:12:52,000 -- in the ribosomal RNA. And so basically what we can do is 121 00:12:52,000 --> 00:12:59,000 we can collect microbial cells from the environment -- 122 00:12:59,000 --> 00:13:17,000 -- make them permeable -- 123 00:13:17,000 --> 00:13:33,000 -- and then basically mix them with 124 00:13:33,000 --> 00:13:48,000 those phylogenetic probes. 125 00:13:48,000 --> 00:13:52,000 And these probes will then permeate into the cell and bind to their 126 00:13:52,000 --> 00:14:15,000 complementary sequences. 127 00:14:15,000 --> 00:14:17,000 Then we wash away the unbound probe -- 128 00:14:17,000 --> 00:14:33,000 -- and we can view it in a 129 00:14:33,000 --> 00:14:53,000 microscope under UV light. 130 00:14:53,000 --> 00:14:57,000 Let me show you an example of this. What you see here is basically a 131 00:14:57,000 --> 00:15:01,000 light micrograph. So this is what you see basically 132 00:15:01,000 --> 00:15:06,000 when you collect microbial cells from the environment under the 133 00:15:06,000 --> 00:15:10,000 microscope. Most bacteria look the same, so you cannot actually 134 00:15:10,000 --> 00:15:15,000 differentiate them all by just looking at them. 135 00:15:15,000 --> 00:15:19,000 But then these cells were fixed and permeabilized and then basically 136 00:15:19,000 --> 00:15:24,000 mixed with two different phylogenetic probes that identified 137 00:15:24,000 --> 00:15:28,000 two different types of organisms. One was labeled with a red Fluor, 138 00:15:28,000 --> 00:15:33,000 the other one with a green Fluor. 139 00:15:33,000 --> 00:15:42,000 And what you see is that you can now differentiate those two organisms. 140 00:15:42,000 --> 00:15:52,000 Now, why is this especially interesting? Well here's just a 141 00:15:52,000 --> 00:16:02,000 specific example where people were looking for bacteria capable of 142 00:16:02,000 --> 00:16:12,000 nitrogen oxidation. These are bacteria that are very 143 00:16:12,000 --> 00:16:22,000 important in, for example, sewage treatment. And it was known 144 00:16:22,000 --> 00:16:32,000 that there were two different types out there, one that oxidizes 145 00:16:32,000 --> 00:16:41,000 ammonia to nitrite, -- and that a second one that 146 00:16:41,000 --> 00:16:47,000 oxidizes nitrite to nitrate. And by doing this type of analysis 147 00:16:47,000 --> 00:16:53,000 what people basically learned is that those two organisms live in 148 00:16:53,000 --> 00:17:00,000 very, very close proximity at all times. 149 00:17:00,000 --> 00:17:04,000 So the organisms that oxidized ammonia to nitrite are really 150 00:17:04,000 --> 00:17:08,000 attached, and oftentimes even surround by the organisms that take 151 00:17:08,000 --> 00:17:12,000 the nitrite to nitrate. So, where you have is a very close 152 00:17:12,000 --> 00:17:16,000 cooperation between two different types of microorganisms, 153 00:17:16,000 --> 00:17:21,000 and the transfer of one of the substrates that's a product of the 154 00:17:21,000 --> 00:17:25,000 metabolism of one of the organisms to another one: so extremely 155 00:17:25,000 --> 00:17:29,000 efficient process that really is very important to take into 156 00:17:29,000 --> 00:17:33,000 consideration when you want to understand processes like sewer 157 00:17:33,000 --> 00:17:37,000 treatment, but also nitrogen biogeochemistry and 158 00:17:37,000 --> 00:17:45,000 the environment. Any questions? 159 00:17:45,000 --> 00:17:55,000 OK, so for the remainder of the lecture I want to talk 160 00:17:55,000 --> 00:18:03,000 about genomics, -- and then in particular also its 161 00:18:03,000 --> 00:18:07,000 application to questions of environmental microbiology and 162 00:18:07,000 --> 00:18:12,000 environmental science. So first, what I want to do is give 163 00:18:12,000 --> 00:18:16,000 you a little bit of the definition of genomics, and then cover how it 164 00:18:16,000 --> 00:18:21,000 is actually possible that we can sequence entire genomes, 165 00:18:21,000 --> 00:18:25,000 and I want to give you some highlights of what we have found by 166 00:18:25,000 --> 00:18:30,000 comparing different genomes to each other. 167 00:18:30,000 --> 00:18:36,000 And then I want to talk about this field about environmental genomics 168 00:18:36,000 --> 00:18:43,000 where we can use genomic techniques to actually learn something about 169 00:18:43,000 --> 00:18:49,000 the function of different uncultured microorganisms in the environment. 170 00:18:49,000 --> 00:18:56,000 So first, our definition, it's basically to interpret 171 00:18:56,000 --> 00:19:03,000 or to sequence, -- interpret, and compare whole 172 00:19:03,000 --> 00:19:11,000 genomes. And as you will see the comparison part actually plays an 173 00:19:11,000 --> 00:19:18,000 increasingly important role because we have now actually genome 174 00:19:18,000 --> 00:19:26,000 sequences available from almost all, or from at least some of the major 175 00:19:26,000 --> 00:19:32,000 groups of life. So this, again, 176 00:19:32,000 --> 00:19:36,000 is a different kind of representation of the tree of life. 177 00:19:36,000 --> 00:19:40,000 You have bacteria, archaea, and eukarya again. 178 00:19:40,000 --> 00:19:44,000 And as you can see, we have a lot of representatives. 179 00:19:44,000 --> 00:19:49,000 In fact, this doesn't even come close to the diversity that we have 180 00:19:49,000 --> 00:19:53,000 now sequenced as well over a hundred bacterial genome sequence now, 181 00:19:53,000 --> 00:19:57,000 several archeael genomes, and increasingly also in 182 00:19:57,000 --> 00:20:02,000 eukaryotic genomes. Now, genomes, so how is this done? 183 00:20:02,000 --> 00:20:08,000 How can we actually sequence genomes? Well, 184 00:20:08,000 --> 00:20:13,000 on the face of it we use very large facilities where you have sequencing 185 00:20:13,000 --> 00:20:19,000 machines present. There's one very important one at 186 00:20:19,000 --> 00:20:24,000 MIT, actually at the Broad Institute, and here you see all those really 187 00:20:24,000 --> 00:20:30,000 industrial scale production lines actually. 188 00:20:30,000 --> 00:20:40,000 But the basic problem is that genomes are large. 189 00:20:40,000 --> 00:20:50,000 E. coli, for example, has roughly 4.4 million base pairs, 190 00:20:50,000 --> 00:21:00,000 and the human genome is even much, much larger. 191 00:21:00,000 --> 00:21:08,000 It has about 3 billion base pairs. OK, so genomes are very, very large. 192 00:21:08,000 --> 00:21:22,000 But a single sequencing reaction-- 193 00:21:22,000 --> 00:21:29,000 -- gives you only roughly 500-1, 00 nucleotides or base pairs. So 194 00:21:29,000 --> 00:21:36,000 how is it that we can actually sequence entire genomes? 195 00:21:36,000 --> 00:21:43,000 I'm going to walk you through this, and there is some variation on the 196 00:21:43,000 --> 00:21:50,000 theme, but this is still a major approach that's still used in some 197 00:21:50,000 --> 00:21:57,000 of the sequencing facilities. Now, you start out by extracting 198 00:21:57,000 --> 00:22:04,000 genomic DNA from organisms, and then you use restriction enzymes 199 00:22:04,000 --> 00:22:11,000 to cut the DNA into relatively large pieces of DNA, so about 160 200 00:22:11,000 --> 00:22:17,000 kilobase pairs long. On average, this is shown here. 201 00:22:17,000 --> 00:22:22,000 Kilo means a thousand, so 160,000 base pairs long. 202 00:22:22,000 --> 00:22:28,000 These pieces are then cloned into specific cloning vectors that are 203 00:22:28,000 --> 00:22:43,000 called BAC vectors. 204 00:22:43,000 --> 00:23:01,000 So therefore, cloning large pieces of DNA, and BAC stands for Bacterial 205 00:23:01,000 --> 00:23:17,000 Artificial Chromosome. And what they basically are, 206 00:23:17,000 --> 00:23:31,000 are plasmids, very special plasmids that can carry large pieces of 207 00:23:31,000 --> 00:23:40,000 genome, or large genome fragments. So, by cloning into those BAC 208 00:23:40,000 --> 00:23:45,000 vectors, what you do is you basically divide up the genome, 209 00:23:45,000 --> 00:23:50,000 and then the step number three is mostly done for eukaryotic genomes 210 00:23:50,000 --> 00:23:55,000 because they are so much larger. You can actually map and analyze 211 00:23:55,000 --> 00:24:00,000 the fragments, and map them onto genome maps where 212 00:24:00,000 --> 00:24:05,000 you know the location of different restriction fragments and different 213 00:24:05,000 --> 00:24:10,000 genes, actually. For bacteria, this step is mostly 214 00:24:10,000 --> 00:24:15,000 skipped, actually. What you do with each one of those 215 00:24:15,000 --> 00:24:20,000 BACs, is you cut them further up into 1 kilobase per fragment, 216 00:24:20,000 --> 00:24:25,000 so much smaller fragments. And these are called, 217 00:24:25,000 --> 00:24:30,000 and these are cloned then into normal plasmid vectors. 218 00:24:30,000 --> 00:24:35,000 And so you generate what are called shotgun clones. 219 00:24:35,000 --> 00:24:40,000 So, these are then cloned into E. coli, you go through the same type 220 00:24:40,000 --> 00:24:45,000 of steps that we discussed before already with environmental clone 221 00:24:45,000 --> 00:24:50,000 libraries. And you can actually determine the sequence of each one 222 00:24:50,000 --> 00:24:55,000 of those pieces of DNA. And what you will then get, 223 00:24:55,000 --> 00:25:00,000 is small fragments of overlapping DNA sequences. 224 00:25:00,000 --> 00:25:04,000 That it shown here. You'll find overlaps, 225 00:25:04,000 --> 00:25:09,000 basically, which piece together the whole genome. And so, 226 00:25:09,000 --> 00:25:14,000 first to assemble, you piece together these genome fragments that 227 00:25:14,000 --> 00:25:19,000 are present in the BACs, and then finally you piece together 228 00:25:19,000 --> 00:25:24,000 the entire genome propose large sequence pieces, 229 00:25:24,000 --> 00:25:29,000 and you get a so-called draft genome sequence. 230 00:25:29,000 --> 00:25:39,000 The next step in this analysis, then, is that you do so-called 231 00:25:39,000 --> 00:25:49,000 genome annotation is. And the first very important step 232 00:25:49,000 --> 00:26:00,000 is that you translate the gene sequences into amino acids. 233 00:26:00,000 --> 00:26:05,000 So, the nucleotide sequences into amino acids particularly in 234 00:26:05,000 --> 00:26:10,000 prokaryotes. This step can be done right away -- 235 00:26:10,000 --> 00:26:31,000 -- and what this allows you to do, 236 00:26:31,000 --> 00:26:37,000 is you can look for what we call open reading frames, 237 00:26:37,000 --> 00:26:44,000 or ORFs. And what you look for is a start codon and a stop codon that 238 00:26:44,000 --> 00:26:50,000 basically branches or frames a stretch of amino acids encoded by 239 00:26:50,000 --> 00:26:57,000 the nucleotides. So you look for ORFs. 240 00:26:57,000 --> 00:27:13,000 And these are your putative genes. 241 00:27:13,000 --> 00:27:26,000 The next step that you can do, 242 00:27:26,000 --> 00:27:32,000 then, is you can go to databases and now you compare your ORFs to 243 00:27:32,000 --> 00:27:39,000 information that is present in the databases. So basically, 244 00:27:39,000 --> 00:27:45,000 you inquire the database and ask, is a gene sequence that is similar 245 00:27:45,000 --> 00:27:52,000 to the one that I have statistically significantly similar present that 246 00:27:52,000 --> 00:27:58,000 allows me to say something about the function of this particular gene? 247 00:27:58,000 --> 00:28:05,000 So function, can then be identified by comparison with databases. 248 00:28:05,000 --> 00:28:29,000 Any questions? 249 00:28:29,000 --> 00:28:34,000 OK, so that allows you, then, to basically say something 250 00:28:34,000 --> 00:28:39,000 about the different genes that you have found in the genome, 251 00:28:39,000 --> 00:28:44,000 but to give you an impression of how new this field really is and how 252 00:28:44,000 --> 00:28:49,000 little we still know about the diversity of genes and organisms, 253 00:28:49,000 --> 00:28:54,000 on average when we sequence a new bacterial genome we find about 30% 254 00:28:54,000 --> 00:28:59,000 of the genes, or a third of the genes have no known functional 255 00:28:59,000 --> 00:29:05,000 analog of the databases. OK, so there's a lot to learn about 256 00:29:05,000 --> 00:29:11,000 the diversity of life and about the functional diversity of life. 257 00:29:11,000 --> 00:29:18,000 In eukaryotes, there are some little twists, 258 00:29:18,000 --> 00:29:24,000 as you all know. And basically, that is that genes 259 00:29:24,000 --> 00:29:31,000 of course consist of introns and exons, right? 260 00:29:31,000 --> 00:29:35,000 And so it's basically relatively difficult to directly identify those 261 00:29:35,000 --> 00:29:40,000 open reading frames. And what you have to do is that you 262 00:29:40,000 --> 00:29:45,000 have to actually oftentimes, so let's write this down. 263 00:29:45,000 --> 00:30:01,000 And what people oftentimes do, 264 00:30:01,000 --> 00:30:12,000 then, is that they search for matching sequences in so-called cDNA 265 00:30:12,000 --> 00:30:24,000 libraries. Now what are cDNA libraries? Let me just show you 266 00:30:24,000 --> 00:30:32,000 this on the next slide. Skip this. Basically what you can do is you can 267 00:30:32,000 --> 00:30:38,000 isolate messenger RNA from cells and that translate the messenger RNA by 268 00:30:38,000 --> 00:30:43,000 a process called reverse transcription that the viral enzyme 269 00:30:43,000 --> 00:30:49,000 that translates RNA into DNA, so you can translate it into DNA 270 00:30:49,000 --> 00:30:54,000 fragments. And you can then clone those DNA fragments into plasmids, 271 00:30:54,000 --> 00:31:00,000 sequence those, and then basically see what are the pieces that are 272 00:31:00,000 --> 00:31:06,000 actually, what are the introns in the genes? 273 00:31:06,000 --> 00:31:11,000 What are the pieces that are excised when the messenger RNA is actually 274 00:31:11,000 --> 00:31:29,000 created from the genome? 275 00:31:29,000 --> 00:31:33,000 And so, let me just cover now a few of the major insights that people 276 00:31:33,000 --> 00:31:37,000 have come up with. Of course, it's a very growing 277 00:31:37,000 --> 00:31:41,000 field and a lot of excitement is coming out. 278 00:31:41,000 --> 00:31:58,000 And I first want to talk about 279 00:31:58,000 --> 00:32:09,000 bacteria and archaea -- 280 00:32:09,000 --> 00:32:13,000 -- and then say a few words also about eukaryotes or eukaryote. 281 00:32:13,000 --> 00:32:17,000 First of all, what we learned about, bacteria and archaea, 282 00:32:17,000 --> 00:32:21,000 is that their genomes are very compact. 283 00:32:21,000 --> 00:32:35,000 Whenever they have pieces of DNA 284 00:32:35,000 --> 00:32:43,000 that are not frequently used, they're actually lost from the 285 00:32:43,000 --> 00:32:51,000 genome. OK, so they lose genes, I should say, relatively easily, and 286 00:32:51,000 --> 00:32:59,000 we can see this that the genome size is correlated to metabolic 287 00:32:59,000 --> 00:33:12,000 diversity. 288 00:33:12,000 --> 00:33:23,000 So, for example, we have Mycoplasma genetalium and 289 00:33:23,000 --> 00:33:37,000 Streptomyces -- 290 00:33:37,000 --> 00:33:42,000 coelicor are two very different bacteria. The first one is an 291 00:33:42,000 --> 00:34:01,000 obligate intracellular parasite. 292 00:34:01,000 --> 00:34:08,000 OK, so, which means it's actually bathed in a nutrient solution in the 293 00:34:08,000 --> 00:34:16,000 eukaryotic cells that it invades. It doesn't have to make amino acids. 294 00:34:16,000 --> 00:34:23,000 It gets it just from the host cell. And it turns out it has a very 295 00:34:23,000 --> 00:34:31,000 small genome, so only 0. 8-based mega-base pairs, so 580, 296 00:34:31,000 --> 00:34:37,000 00 base pairs, and only 517 genes. And interestingly, 297 00:34:37,000 --> 00:34:41,000 actually people are now using this organism to try and ask, 298 00:34:41,000 --> 00:34:46,000 well, what's the minimum number of genes that organism can actually 299 00:34:46,000 --> 00:34:50,000 will live with? And so, they are deleting in a 300 00:34:50,000 --> 00:34:55,000 stepwise fashion the different genes in this organism, 301 00:34:55,000 --> 00:34:59,000 and it turns out that you need about two to 300 genes minimum in order to 302 00:34:59,000 --> 00:35:03,000 make the things survive. On the other hand, 303 00:35:03,000 --> 00:35:15,000 streptomyces is a soil bacterium -- 304 00:35:15,000 --> 00:35:20,000 -- has a very complex lifestyle, can degrade a lot of environmental 305 00:35:20,000 --> 00:35:26,000 substrates, and it has a very big genome, one of the biggest bacterial 306 00:35:26,000 --> 00:35:31,000 genomes. And so, those two organisms basically span 307 00:35:31,000 --> 00:35:37,000 pretty much the range of bacterial genome sizes. 308 00:35:37,000 --> 00:35:41,000 And so, it's thought that it has about 7,846 genes. 309 00:35:41,000 --> 00:35:57,000 Now, we also have a very large 310 00:35:57,000 --> 00:36:09,000 genetic diversity -- 311 00:36:09,000 --> 00:36:23,000 -- between species. And typically what you find is that 312 00:36:23,000 --> 00:36:38,000 roughly 15 to 30% of genes are unique to a specific species. 313 00:36:38,000 --> 00:36:44,000 And that's really because bacteria and archaea have the capability to 314 00:36:44,000 --> 00:36:50,000 affect a lot of chemical reactions that eukaryotes, 315 00:36:50,000 --> 00:36:56,000 for example, cannot. There's about 20 million known 316 00:36:56,000 --> 00:37:02,000 organic substances, organic chemicals, and almost all of 317 00:37:02,000 --> 00:37:07,000 them are biodegradable by bacteria. Even the minutest compounds if it 318 00:37:07,000 --> 00:37:12,000 were not biodegradable bacteria, would build up in the environment, 319 00:37:12,000 --> 00:37:16,000 OK? So, if it just where a cofactor that some organism produces because 320 00:37:16,000 --> 00:37:21,000 we have such a long period of time of evolution on this planet and 321 00:37:21,000 --> 00:37:26,000 evolutionary history, you probably would be able to dig it 322 00:37:26,000 --> 00:37:32,000 up in your backyard. One of the other very important and 323 00:37:32,000 --> 00:37:39,000 interesting insights that has come out with comparing genomes for 324 00:37:39,000 --> 00:37:46,000 microorganisms is that lateral gene transfer is a very important process 325 00:37:46,000 --> 00:38:07,000 amongst microorganisms. 326 00:38:07,000 --> 00:38:11,000 Now what do I mean by lateral gene transfer? It basically means that 327 00:38:11,000 --> 00:38:16,000 we find evidence among bacterial genomes that they have actually 328 00:38:16,000 --> 00:38:20,000 taken genes from completely unrelated organisms. 329 00:38:20,000 --> 00:38:25,000 And I just want to show you one example here from that of 330 00:38:25,000 --> 00:38:38,000 thermotoga maritima -- 331 00:38:38,000 --> 00:38:47,000 -- which lives in hot springs. This is a very interesting 332 00:38:47,000 --> 00:38:56,000 bacterium that lives in hot water of around 80∞C and thrives only in 333 00:38:56,000 --> 00:39:05,000 those kinds of environments. And they coexist there with many 334 00:39:05,000 --> 00:39:14,000 archaea. And when people sequenced the genome of thermotoga maritima 335 00:39:14,000 --> 00:39:23,000 what they found was that about 25% of the genes have their closest 336 00:39:23,000 --> 00:39:32,000 relatives in archaeal genomes. So roughly 25% of genes in 337 00:39:32,000 --> 00:39:39,000 thermotoga are of archaeal origin. And how can we actually figure 338 00:39:39,000 --> 00:39:44,000 something like that out? Well, the most important technique 339 00:39:44,000 --> 00:39:49,000 is, again, phylogenetic tree construction. And so when you have, 340 00:39:49,000 --> 00:39:54,000 for example, gene A, well let me draw this, actually, 341 00:39:54,000 --> 00:40:10,000 on a new board. 342 00:40:10,000 --> 00:40:15,000 So you're comparing, say, three organisms, 343 00:40:15,000 --> 00:40:21,000 organism A, B, and C and you compare gene one with gene two. 344 00:40:21,000 --> 00:40:27,000 And you notice that most genes adhere to this pattern, 345 00:40:27,000 --> 00:40:33,000 but that every now and then there's a gene that gives you this 346 00:40:33,000 --> 00:40:38,000 type of pattern. What you can then conclude is that 347 00:40:38,000 --> 00:40:43,000 this gene, C, has not coevolved with the other genes in the genome of 348 00:40:43,000 --> 00:40:48,000 these organisms but was actually transferred into it from another 349 00:40:48,000 --> 00:40:53,000 source. And I don't have time to go actually into the mechanisms. 350 00:40:53,000 --> 00:40:58,000 If you're interested, I teach a graduate class that undergraduates 351 00:40:58,000 --> 00:41:03,000 actually take in our department, environmental microbiology, where we 352 00:41:03,000 --> 00:41:08,000 discussed a lot of the mechanisms. It's basically a lot of viruses can 353 00:41:08,000 --> 00:41:13,000 affect gene transfer but also plasmids and transposons. 354 00:41:13,000 --> 00:41:18,000 But for bacteria, again, you should remember that often new function is 355 00:41:18,000 --> 00:41:23,000 actually oftentimes arises by lateral gene transfer. 356 00:41:23,000 --> 00:41:28,000 And one of the interesting things is that lateral gene transfer is 357 00:41:28,000 --> 00:41:34,000 actually very important in the evolution of pathogenic bacteria. 358 00:41:34,000 --> 00:41:48,000 So, the so-called virulence genes, 359 00:41:48,000 --> 00:41:57,000 which are the genes that basically affect pathogenesis. Do 360 00:41:57,000 --> 00:42:13,000 you have a question? Among pathogenic bacteria, 361 00:42:13,000 --> 00:42:35,000 often arise by lateral gene transfer. OK. Any questions? 362 00:42:35,000 --> 00:42:43,000 OK, now for eukarya, I just want to make the point that 363 00:42:43,000 --> 00:42:52,000 their genomes are generally orders of magnitudes larger -- 364 00:42:52,000 --> 00:43:16,000 OK, and that the exons, 365 00:43:16,000 --> 00:43:23,000 so the stretches that really encode the protein that make up the 366 00:43:23,000 --> 00:43:30,000 organism, the exons are only typically a few percent 367 00:43:30,000 --> 00:43:37,000 of the genome. That's particularly in higher 368 00:43:37,000 --> 00:43:44,000 eukaryotes. Yeasts, for example, have a much more 369 00:43:44,000 --> 00:43:50,000 compact genome also. We, for example, are full of DNA 370 00:43:50,000 --> 00:43:57,000 that people still have a very hard time figuring out what that actually 371 00:43:57,000 --> 00:44:04,000 does. But it seems that the majority of the genome, 372 00:44:04,000 --> 00:44:10,000 so-called repeated sequences -- -- many of which seems to be ancient 373 00:44:10,000 --> 00:44:15,000 retroviruses that have inserted themselves into the genome and have 374 00:44:15,000 --> 00:44:21,000 since then lost actually their function. OK, 375 00:44:21,000 --> 00:44:26,000 so the remaining time I want to just give you an example of how we can 376 00:44:26,000 --> 00:44:31,000 now use these techniques that I outlined before to learn something 377 00:44:31,000 --> 00:44:37,000 about microorganisms in the environment. 378 00:44:37,000 --> 00:44:47,000 It's called environmental. 379 00:44:47,000 --> 00:45:04,000 Basically, the way this all started 380 00:45:04,000 --> 00:45:08,000 was by going into the environment and extracting nuclear gases and 381 00:45:08,000 --> 00:45:13,000 treating them exactly the same way as if you had a single genome. 382 00:45:13,000 --> 00:45:18,000 But, again, remember, we have a very large mixture of microorganisms 383 00:45:18,000 --> 00:45:22,000 present in the environment. And where this is mostly done was 384 00:45:22,000 --> 00:45:27,000 in the ocean, actually. And what people did, was they 385 00:45:27,000 --> 00:45:32,000 constructed those BAC clones directly from the environment and 386 00:45:32,000 --> 00:45:36,000 then looked amongst those BAC clones for specific 16S ribosomal 387 00:45:36,000 --> 00:45:41,000 RNA genes. Remember, this is the marker that we 388 00:45:41,000 --> 00:45:45,000 have for microorganisms in the environment. We know the diversity 389 00:45:45,000 --> 00:45:49,000 of microorganisms through those types of genes, 390 00:45:49,000 --> 00:45:53,000 and we have a lot of the data available. And so, 391 00:45:53,000 --> 00:45:57,000 in order to link a specific function of such an organism that we only 392 00:45:57,000 --> 00:46:01,000 know from the 16S ribosomal RNA genes. 393 00:46:01,000 --> 00:46:06,000 So, to ask the question of what much of this organism might be carrying 394 00:46:06,000 --> 00:46:12,000 out in the environment, it's very useful to sequence BAC 395 00:46:12,000 --> 00:46:17,000 clones that have 16S ribosomal RNA genes on them, 396 00:46:17,000 --> 00:46:23,000 and determine what kinds of protein coding genes are on there that might 397 00:46:23,000 --> 00:46:28,000 reveal some of the function of the organism in the environment. 398 00:46:28,000 --> 00:46:34,000 And one example that I want to show you is that of the proteorhodopsin. 399 00:46:34,000 --> 00:46:45,000 So basically, the initial task was to sequence BAC clones containing 400 00:46:45,000 --> 00:46:57,000 ribosomal RNA genes, and look for other genes that might 401 00:46:57,000 --> 00:47:15,000 reveal some of the function. 402 00:47:15,000 --> 00:47:18,000 So, you don't want to look for all the genes that encode proteins that 403 00:47:18,000 --> 00:47:22,000 are important to the cell cycle and things like that, 404 00:47:22,000 --> 00:47:25,000 but really sort of metabolic genes that might tell you something about 405 00:47:25,000 --> 00:47:29,000 the type of metabolism that this organism carries out 406 00:47:29,000 --> 00:47:33,000 in the environment. And so, what the first example that 407 00:47:33,000 --> 00:47:39,000 turned out to be really, really important is that people 408 00:47:39,000 --> 00:47:44,000 found rhodopsin genes on one of those BAC fragments, 409 00:47:44,000 --> 00:47:50,000 and it turns out this rhodopsin catalyzes or these rhodopsin genes 410 00:47:50,000 --> 00:47:55,000 produce a protein that inserts itself into the bacterial membrane, 411 00:47:55,000 --> 00:48:01,000 and it's a photoreceptor that when it's hit by light, 412 00:48:01,000 --> 00:48:06,000 it actually becomes a proton pump. So, it expels protons from the cell 413 00:48:06,000 --> 00:48:11,000 interior to the outside, and you already know that this is 414 00:48:11,000 --> 00:48:16,000 important in energy generation in all living cells. 415 00:48:16,000 --> 00:48:20,000 So proton gradient across membranes basically give the cells sort of a 416 00:48:20,000 --> 00:48:25,000 battery status that can be exploited by ATPase molecules or ATPase 417 00:48:25,000 --> 00:48:30,000 proteins that equalize the proton gradient and affect ATP 418 00:48:30,000 --> 00:48:35,000 synthesis in doing so. Now, why is this so important? 419 00:48:35,000 --> 00:48:40,000 Well, it turned out that this type of protein is present in almost all 420 00:48:40,000 --> 00:48:45,000 microbial cells that were previously thought to be heterotrophs alone in 421 00:48:45,000 --> 00:48:49,000 the ocean in the parts of the ocean that receive enough life. 422 00:48:49,000 --> 00:48:54,000 And what this means is that our estimates of the global carbon 423 00:48:54,000 --> 00:48:59,000 budget of the ocean were basically wrong because most microorganisms in 424 00:48:59,000 --> 00:49:12,000 the ocean have this. 425 00:49:12,000 --> 00:49:31,000 So most prokaryotes in the ocean have a light-driven proton pump 426 00:49:31,000 --> 00:49:52,000 which is called proteorhodopsin. And it basically allows them to gain 427 00:49:52,000 --> 00:50:06,000 energy from sunlight. And there's an increasing number of 428 00:50:06,000 --> 00:50:12,000 such examples now where we are learning to interpret environmental 429 00:50:12,000 --> 00:50:18,000 communities, and the function of environmental microbial communities 430 00:50:18,000 --> 00:50:23,000 through those genomic approaches. And it reveals basically an 431 00:50:23,000 --> 00:50:29,000 enormous diversity of organisms out there. And what we also are 432 00:50:29,000 --> 00:50:35,000 learning to do now is to assemble entire genomes from those samples by 433 00:50:35,000 --> 00:50:40,000 applying genomic techniques. And this is an example here where 434 00:50:40,000 --> 00:50:44,000 you see, this was published last year, where people went out and 435 00:50:44,000 --> 00:50:49,000 basically were able to piece together from pieces of genes 436 00:50:49,000 --> 00:50:53,000 obtained from the environment, entire genomes or fragments of 437 00:50:53,000 --> 00:50:58,000 entire genomes. And that's shown here. 438 00:50:58,000 --> 00:51:02,000 Those are contiguous sequences. OK, so if you have any questions 439 00:51:02,000 --> 00:51:07,000 let me know by e-mail, or if you're interested in pursuing 440 00:51:07,000 --> 00:51:11,000 this further I also teach another class in civil and environmental 441 00:51:11,000 --> 00:51:14,000 engineering.