1 00:00:01,000 --> 00:00:08,000 Good morning. Good morning. 2 00:00:08,000 --> 00:00:14,000 So, I'd like to pick up where we left off last time and just finish 3 00:00:14,000 --> 00:00:20,000 off translation and then step back and look at how this central dogma 4 00:00:20,000 --> 00:00:26,000 of DNA is replicated into DNA, is read into RNA, and is translated 5 00:00:26,000 --> 00:00:30,000 into protein. Or, actually, as Francis Crick 6 00:00:30,000 --> 00:00:33,000 really put it, all information flow from nucleic 7 00:00:33,000 --> 00:00:37,000 acid to protein. How that varies amongst organisms. 8 00:00:37,000 --> 00:00:40,000 Because first we're going through it and looking at the absolutely 9 00:00:40,000 --> 00:00:43,000 common features, DNA replication, so it's five prime 10 00:00:43,000 --> 00:00:46,000 to three prime, et cetera, et cetera, 11 00:00:46,000 --> 00:00:50,000 transcription, translation. But in a moment I'd like to turn to 12 00:00:50,000 --> 00:00:53,000 the variations between different kinds of organisms. 13 00:00:53,000 --> 00:00:56,000 But let me briefly finish up, if I may, the bit about translation 14 00:00:56,000 --> 00:01:00,000 in general so we can look at its variation. 15 00:01:00,000 --> 00:01:06,000 As we talked about last time, we have a messenger RNA that has 16 00:01:06,000 --> 00:01:13,000 been transcribed from a specific region of the chromosome starting at 17 00:01:13,000 --> 00:01:20,000 a promoter and going to some stop of transcription. 18 00:01:20,000 --> 00:01:27,000 And that messenger RNA will include some particular sequence, 19 00:01:27,000 --> 00:01:34,000 and I'll copy one here, A-U-A-C-G-A-U-G-A-A-G-A-G-G-C-C-C, 20 00:01:34,000 --> 00:01:41,000 et cetera, et cetera, et cetera, out to a UAG. 21 00:01:41,000 --> 00:01:45,000 And this is the direction five prime to three prime. 22 00:01:45,000 --> 00:01:49,000 We'll remember that all nucleic acid polymerization goes five prime 23 00:01:49,000 --> 00:01:53,000 to three prime. So, what happens is the cell begins 24 00:01:53,000 --> 00:01:57,000 scanning this message. And it does that by this message 25 00:01:57,000 --> 00:02:01,000 being exported into the cytoplasm of the cell. The ribosome coming along 26 00:02:01,000 --> 00:02:05,000 and glomming onto this message and scanning on for the 27 00:02:05,000 --> 00:02:09,000 place to start. It looks, it looks, 28 00:02:09,000 --> 00:02:13,000 it looks, it looks, and it finds the first AUG. Footnote, 29 00:02:13,000 --> 00:02:18,000 this isn't 100% true. There are occasional messages that start their 30 00:02:18,000 --> 00:02:22,000 translation not at an AUG, and there are even occasional, 31 00:02:22,000 --> 00:02:27,000 there are even more messages that don't quite start at the first AUG 32 00:02:27,000 --> 00:02:31,000 because the ribosome is really is looking for something a little bit 33 00:02:31,000 --> 00:02:36,000 special, but to a first order approximation. 34 00:02:36,000 --> 00:02:40,000 Good enough for the textbooks. It goes along to the first AUG. 35 00:02:40,000 --> 00:02:44,000 In reality it's a little more subtle than that. 36 00:02:44,000 --> 00:02:48,000 But it starts at the first AUG. And what it does is it builds a 37 00:02:48,000 --> 00:02:52,000 protein that corresponds to it according to a three letter genetic 38 00:02:52,000 --> 00:02:56,000 code. And you all know the lookup table. It's in your book. 39 00:02:56,000 --> 00:03:00,000 AUG, always the first amino acid put in. A methionine. 40 00:03:00,000 --> 00:03:04,000 Then AAG. Lysine, I think. Then arginine. 41 00:03:04,000 --> 00:03:08,000 Then a proline. Now, I mean this is this particular sequence. 42 00:03:08,000 --> 00:03:12,000 Any other sequence would be different. Et cetera. 43 00:03:12,000 --> 00:03:17,000 How does it accomplish this matching between three letters of 44 00:03:17,000 --> 00:03:21,000 the genetic code? Oh, and when it gets to AUG, 45 00:03:21,000 --> 00:03:25,000 that is one of the three singles for stop, don't put in any more amino 46 00:03:25,000 --> 00:03:30,000 acids. There are three such stop signals. 47 00:03:30,000 --> 00:03:37,000 AUG, sorry, UAG, UGG and U, oops, what did I just do 48 00:03:37,000 --> 00:03:44,000 here? Let's get that right. UAG, UGG and UGA. Those are the 49 00:03:44,000 --> 00:03:51,000 three stop codons. So, how many total codons are there? 50 00:03:51,000 --> 00:03:59,000 64 codons. Three of them spell stop. 61 of them spell 51 00:03:59,000 --> 00:04:05,000 specific amino acids. And how many amino acids are there? 52 00:04:05,000 --> 00:04:11,000 20. So, the average redundancy is three. Some are specified by 53 00:04:11,000 --> 00:04:16,000 multiple codons. The most extreme is some amino 54 00:04:16,000 --> 00:04:21,000 acids are specified by as many as six codons. Did I, 55 00:04:21,000 --> 00:04:27,000 oh, thank you. Come back down. Of course. U-A, so it's UAG, right? 56 00:04:27,000 --> 00:04:32,000 Sorry, UAA and UGA and UAG. 57 00:04:32,000 --> 00:04:37,000 Thank you. Very good. All right. So, now, how does it accomplish 58 00:04:37,000 --> 00:04:42,000 this feat of taking amino acids, of taking nucleotide sequence, RNA 59 00:04:42,000 --> 00:04:47,000 sequence and converting it into the sequence of amino acids? 60 00:04:47,000 --> 00:04:52,000 As I mentioned last time, there was lots of original somewhat 61 00:04:52,000 --> 00:04:57,000 nutty thinking about some looping codes that would make the RNA fold 62 00:04:57,000 --> 00:05:03,000 up in such a way to bind the amino acids and all that. 63 00:05:03,000 --> 00:05:12,000 But, as Francis Crick thought up, there had to be some kind of an 64 00:05:12,000 --> 00:05:21,000 adapter molecule that would take the RNA sequence and would somehow 65 00:05:21,000 --> 00:05:30,000 connect it up to the correct amino acid, and that was UAC. 66 00:05:30,000 --> 00:05:35,000 A particular transfer RNA molecule. And the tRNA molecule is an adapter 67 00:05:35,000 --> 00:05:40,000 sequence that has three nucleotides here that match up to the three 68 00:05:40,000 --> 00:05:45,000 nucleotides of the codon that we're trying to translate, 69 00:05:45,000 --> 00:05:50,000 and it has the appropriate amino acid that's been stuck on the end of 70 00:05:50,000 --> 00:05:55,000 it. And how does it get there? How does the right tRNA, the tRNA 71 00:05:55,000 --> 00:06:00,000 to match this codon have the right amino acid put on it? 72 00:06:00,000 --> 00:06:03,000 There's a dedicated enzyme that recognizes that tRNA and puts on 73 00:06:03,000 --> 00:06:07,000 that amino acid. It's aminoacyl-tRNA synthetase. 74 00:06:07,000 --> 00:06:11,000 It sticks the right amino on the right transfer RNA. 75 00:06:11,000 --> 00:06:15,000 So, that's how it accomplishes the physical recognition of these three 76 00:06:15,000 --> 00:06:18,000 bases and has the right amino acid attached to it. 77 00:06:18,000 --> 00:06:22,000 There's an enzymatic machinery that has all of these tRNAs floating 78 00:06:22,000 --> 00:06:26,000 around in the cell which can be used for this translation here. 79 00:06:26,000 --> 00:06:30,000 How does this actually happen physically? 80 00:06:30,000 --> 00:06:35,000 It happens in this vast machine called the ribosome. 81 00:06:35,000 --> 00:06:41,000 In the ribosome, if we have, say, our codon here and we have a 82 00:06:41,000 --> 00:06:47,000 tRNA that, well, we'll put that actually in the 83 00:06:47,000 --> 00:06:53,000 ribosome that, say, has the first amino acid here, 84 00:06:53,000 --> 00:06:59,000 methionine, there's a cavity for this guy and there's a cavity 85 00:06:59,000 --> 00:07:05,000 for the next guy. And other tRNAs come into the cell 86 00:07:05,000 --> 00:07:11,000 carrying their next amino acid. Maybe it will be here a lysine that 87 00:07:11,000 --> 00:07:17,000 matches up with the codon and the anti-codon. And when the right tRNA 88 00:07:17,000 --> 00:07:23,000 fits in the next cavity over, the ribosome itself catalyzes a 89 00:07:23,000 --> 00:07:30,000 peptide bond between these amino acids. 90 00:07:30,000 --> 00:07:35,000 Then it chugs over by one, it translocates by one moving this 91 00:07:35,000 --> 00:07:40,000 bit of the complex to the left, and the peptide chain continues to 92 00:07:40,000 --> 00:07:45,000 grow out this end as each new codon is moved into position, 93 00:07:45,000 --> 00:07:50,000 a tRNA comes in bringing the right amino acid until finally a stop 94 00:07:50,000 --> 00:07:55,000 codon is hit. And what happens when you hit a stop codon? 95 00:07:55,000 --> 00:08:00,000 It stops. And is there a tRNA for a stop? 96 00:08:00,000 --> 00:08:02,000 It turns out there's not. There actually isn't. There's some 97 00:08:02,000 --> 00:08:05,000 other factor. There's a protein factor that helps recognize the 98 00:08:05,000 --> 00:08:08,000 stops. So, that just continues to chug on. Those of you who are 99 00:08:08,000 --> 00:08:11,000 computer scientists or mathematicians will recognize this 100 00:08:11,000 --> 00:08:14,000 is a two-tape Turing machine. It is the small two-tape Turing 101 00:08:14,000 --> 00:08:17,000 machine that I know to exist. If you don't know what that means, 102 00:08:17,000 --> 00:08:20,000 you can forget about that comment. In any case, but some of you know 103 00:08:20,000 --> 00:08:23,000 what that is. So, that's how it proceeds. 104 00:08:23,000 --> 00:08:26,000 That is your basic protein translation. 105 00:08:26,000 --> 00:08:28,000 And, I must say, what I really love about this was 106 00:08:28,000 --> 00:08:31,000 that Francis Crick kind of figured out what had to happen just on first 107 00:08:31,000 --> 00:08:34,000 principles and was able to think through it much more clearly and 108 00:08:34,000 --> 00:08:37,000 direct people to know what to look for in the laboratory. 109 00:08:37,000 --> 00:08:40,000 And if people had not had the clarity of thinking that Crick 110 00:08:40,000 --> 00:08:43,000 provided by saying, look, there's got to be this kind of 111 00:08:43,000 --> 00:08:46,000 adapter, I don't think they would have found it as quickly. 112 00:08:46,000 --> 00:08:49,000 But once he said this is what you've got to look for, 113 00:08:49,000 --> 00:08:52,000 golly, it was there. You can't do that very often, 114 00:08:52,000 --> 00:08:55,000 but Francis Crick seemed to have a very good track record of doing 115 00:08:55,000 --> 00:08:58,000 those things. OK. So, that was just finishing off 116 00:08:58,000 --> 00:09:02,000 translation. Now what I'd like to do is turn to 117 00:09:02,000 --> 00:09:08,000 variations on the theme as the major issue for today. 118 00:09:08,000 --> 00:09:14,000 How does this central dogma, DNA replicates, is transcribed into 119 00:09:14,000 --> 00:09:20,000 RNA and is translated into protein, vary amongst the different kinds of 120 00:09:20,000 --> 00:09:26,000 organisms that we might be interested in? 121 00:09:26,000 --> 00:09:32,000 The kinds of organisms we might be interested in, 122 00:09:32,000 --> 00:09:39,000 eukaryotes, prokaryotes, viruses. Sample eukaryote, 123 00:09:39,000 --> 00:09:46,000 MIT undergraduate. Prokaryote, E. coli. And virus, many possible 124 00:09:46,000 --> 00:09:53,000 viruses. The eukaryotes' big nucleated cells. 125 00:09:53,000 --> 00:10:00,000 So, in here we're going to have our nucleated cells. 126 00:10:00,000 --> 00:10:04,000 DNA living in there. In our prokaryotes we have no 127 00:10:04,000 --> 00:10:09,000 distinct nucleus. The DNA is not in a distinct 128 00:10:09,000 --> 00:10:14,000 nucleus, although it's not entirely freely floating around. 129 00:10:14,000 --> 00:10:19,000 It tends to be clustered together. In the virus the nucleic acid 130 00:10:19,000 --> 00:10:24,000 resides in some kind of a capsid, some kind of a, it could be a 131 00:10:24,000 --> 00:10:29,000 protein capsid. There are some of them that have 132 00:10:29,000 --> 00:10:34,000 lipid capsids with lipid particles around them, but some kind of a coat 133 00:10:34,000 --> 00:10:39,000 around nucleic acid there. Do they all do exactly the same 134 00:10:39,000 --> 00:10:44,000 things with regard to DNA replication, RNA transcription and 135 00:10:44,000 --> 00:10:49,000 protein translation? Well, not entirely. So, 136 00:10:49,000 --> 00:10:54,000 as a way, in a way to reinforce what we know about these, 137 00:10:54,000 --> 00:11:00,000 let's look at how they differ. DNA replication. Eukaryotes. 138 00:11:00,000 --> 00:11:06,000 What's the structure of one of your chromosomes? Is it a long line, 139 00:11:06,000 --> 00:11:12,000 a long linear molecule, or is it a circular molecule? 140 00:11:12,000 --> 00:11:18,000 How many of you have linear chromosomes? How many of you have 141 00:11:18,000 --> 00:11:24,000 circular chromosomes? I heard there were some people with 142 00:11:24,000 --> 00:11:30,000 circular. And how many of you are unsure about your chromosomes? 143 00:11:30,000 --> 00:11:38,000 OK. That's good. Well, then I'm pleased to inform 144 00:11:38,000 --> 00:11:47,000 you that you have long linear chromosomes. Every human chromosome 145 00:11:47,000 --> 00:11:55,000 is a long double-stranded molecule of DNA. Linear double-stranded DNA. 146 00:11:55,000 --> 00:12:02,000 They can be extremely long. You have 23 chromosomes, 147 00:12:02,000 --> 00:12:08,000 and together they make up three billion nucleotides of DNA. 148 00:12:08,000 --> 00:12:13,000 A typical chromosome could be 150 million bases long as an average 149 00:12:13,000 --> 00:12:19,000 size for a chromosome. And it's a single connected 150 00:12:19,000 --> 00:12:24,000 molecule. 150 million bases long in the human is a typical chromosome. 151 00:12:24,000 --> 00:12:30,000 One tricky little bit about replicating DNA. 152 00:12:30,000 --> 00:12:34,000 Let's just think back to our little model of replicating DNA. 153 00:12:34,000 --> 00:12:38,000 Let's come to the chromosome end here. It's five prime to three 154 00:12:38,000 --> 00:12:42,000 prime. Five prime to three prime. We're going to start replicating. 155 00:12:42,000 --> 00:12:47,000 We're getting to the end of chromosome number one. 156 00:12:47,000 --> 00:12:51,000 We've got a primer here, and the primer is going to be used 157 00:12:51,000 --> 00:12:55,000 to extend, extend, extend. We get right to the end. 158 00:12:55,000 --> 00:13:00,000 That's good. Tell me how we're going to replicate back. 159 00:13:00,000 --> 00:13:04,000 We need a little primer to start it, right? And where's that primer 160 00:13:04,000 --> 00:13:09,000 going to land? Maybe over here it will start 161 00:13:09,000 --> 00:13:14,000 replicating back. Oh, boy, we haven't done this 162 00:13:14,000 --> 00:13:19,000 figure. So, what do we have to do there? So, we need to primer a 163 00:13:19,000 --> 00:13:24,000 little further back. OK. But, you know what, 164 00:13:24,000 --> 00:13:29,000 the chance that we're going to get that right at the end, 165 00:13:29,000 --> 00:13:34,000 that we're going to get a primer exactly at the end is pretty low. 166 00:13:34,000 --> 00:13:37,000 And if we don't have a primer exactly at the end, 167 00:13:37,000 --> 00:13:41,000 what's going to be wrong with that copy of the chromosome? 168 00:13:41,000 --> 00:13:45,000 Too short. Now, big deal. So, it's short by maybe 20 bases. 169 00:13:45,000 --> 00:13:49,000 But that's just this cell division. What about next cell division? It 170 00:13:49,000 --> 00:13:53,000 will be short on average by a little bit, and then the next cell division 171 00:13:53,000 --> 00:13:57,000 and the next cell division. It's actually pretty tricky to 172 00:13:57,000 --> 00:14:01,000 replicate a linear chromosome on the lagging strand, 173 00:14:01,000 --> 00:14:05,000 unless you can land the primer in exactly the right place, 174 00:14:05,000 --> 00:14:09,000 which doesn't happen. So, a special little solution is 175 00:14:09,000 --> 00:14:15,000 used. The ends of chromosomes here are called telomeres, 176 00:14:15,000 --> 00:14:21,000 telo meaning end. These telomeres have very specific structures. 177 00:14:21,000 --> 00:14:27,000 In the human they repeat, T-T-A-G-G-G, again and 178 00:14:27,000 --> 00:14:32,000 again and again. At the end of the chromosome there's 179 00:14:32,000 --> 00:14:36,000 a special enzyme that will come along and add some extra telomere to 180 00:14:36,000 --> 00:14:41,000 the chromosome. That, sorry? Did I say leading 181 00:14:41,000 --> 00:14:46,000 strand? It's the, oh, yeah, sorry. It's the lagging, 182 00:14:46,000 --> 00:14:50,000 sorry. It's the leading strand. No, no, no, this is the lagging strand. 183 00:14:50,000 --> 00:14:55,000 This is the leading strand because it's running along happily not 184 00:14:55,000 --> 00:15:00,000 having to make a primer. The okazaki fragment should be here. 185 00:15:00,000 --> 00:15:04,000 I'll stick by that. We'll debate it later. 186 00:15:04,000 --> 00:15:08,000 Anyway, they, we get the point. But it's lagging because you've got 187 00:15:08,000 --> 00:15:12,000 the ogzocy fragments there. So, anyway, we have a problem of 188 00:15:12,000 --> 00:15:16,000 replication. And the way the cell solves it is the actual replication 189 00:15:16,000 --> 00:15:20,000 is shorter, but since it manages to stick some repeat at the end of the 190 00:15:20,000 --> 00:15:24,000 chromosome it adds back some more T-T-A-G-G-G, T-T-A-G-G-G, 191 00:15:24,000 --> 00:15:29,000 T-T-A-G-G-G, and it keeps dynamically adding more. 192 00:15:29,000 --> 00:15:33,000 What do you think would happen if you didn't, or what's the enzyme 193 00:15:33,000 --> 00:15:37,000 that adds telomeres? Telomerase. Telomerase adds that. 194 00:15:37,000 --> 00:15:41,000 What cells do you think need to have active telomerase? 195 00:15:41,000 --> 00:15:45,000 Rapidly dividing cells would need to have telomerase. 196 00:15:45,000 --> 00:15:49,000 Cells that are not rapidly dividing, cells that have stopped dividing can 197 00:15:49,000 --> 00:15:53,000 shut off their telomerase. But if a cell is going to go 198 00:15:53,000 --> 00:15:57,000 through lots and lots of cell divisions it's got to, 199 00:15:57,000 --> 00:16:01,000 it's got to tidy up its telomeres each time because they're 200 00:16:01,000 --> 00:16:06,000 getting too short. You've got to have an enzyme that's 201 00:16:06,000 --> 00:16:10,000 adding back ends of chromosomes. What cells do you think 202 00:16:10,000 --> 00:16:14,000 particularly care about having telomerase on them? 203 00:16:14,000 --> 00:16:19,000 Cancers. It turns out that this is not a trivial point. 204 00:16:19,000 --> 00:16:23,000 More than 90% of cancers turn on actively the telomerase gene, 205 00:16:23,000 --> 00:16:27,000 which would be a shut off in normal cells because the cell is 206 00:16:27,000 --> 00:16:32,000 not dividing anymore. Part of becoming a cancer is having 207 00:16:32,000 --> 00:16:36,000 to turn on this repair mechanism for the ends, this extension mechanism 208 00:16:36,000 --> 00:16:40,000 for the ends of your chromosomes. And so, various people are trying 209 00:16:40,000 --> 00:16:44,000 to make drugs to inhibit cancers by inhibiting this telomerase enzyme. 210 00:16:44,000 --> 00:16:49,000 So, understanding just your linear replication of chromosomes is a kind 211 00:16:49,000 --> 00:16:53,000 of useful thing even in dealing with things like cancer. 212 00:16:53,000 --> 00:16:57,000 Genome sizes. I mentioned, how big was the human genome? 213 00:16:57,000 --> 00:17:02,000 Three times ten to the ninth bases. The mouse genome? 214 00:17:02,000 --> 00:17:06,000 It's almost as big, about 2.7 times ten to the ninth 215 00:17:06,000 --> 00:17:11,000 bases, 2.7 million bases. The elephant genome? I actually 216 00:17:11,000 --> 00:17:15,000 just found this out last week because we just finished sequencing 217 00:17:15,000 --> 00:17:20,000 elephant DNA, and I can now tell you I think it's 3. 218 00:17:20,000 --> 00:17:25,000 . The dog is 2. times ten to the ninth. 219 00:17:25,000 --> 00:17:29,000 Anyway, it's about, for most mammals it's pretty close 220 00:17:29,000 --> 00:17:33,000 to three billion bases. And there is some fluctuation. 221 00:17:33,000 --> 00:17:37,000 Some are a little bigger. Some are a little smaller. 222 00:17:37,000 --> 00:17:41,000 It doesn't scale with sizing the animal, though, 223 00:17:41,000 --> 00:17:45,000 because the dog has a smaller genome, for example, than the mouse does, 224 00:17:45,000 --> 00:17:48,000 but the elephant is a big bigger than us. And check in later in the 225 00:17:48,000 --> 00:17:52,000 term, I'll tell you about the aardvark. We should know in a 226 00:17:52,000 --> 00:17:56,000 little while. But here are, for example, fruit flies. The fruit 227 00:17:56,000 --> 00:18:00,000 fly, it has a genome of two times ten to the eighth. 228 00:18:00,000 --> 00:18:04,000 I'm giving, I'm being quite approximate. In fact, 229 00:18:04,000 --> 00:18:08,000 I'll make it, I'll give you 1. times ten to the eighth. 150 230 00:18:08,000 --> 00:18:12,000 million bases. Yeast, by contrast, 231 00:18:12,000 --> 00:18:17,000 has a genome of 1.2 times ten to the seventh. So, that's 12 million, 232 00:18:17,000 --> 00:18:21,000 150 million give or take, and about three billion, 233 00:18:21,000 --> 00:18:25,000 so 3,000 million. So, genome sizes can vary quite 234 00:18:25,000 --> 00:18:30,000 dramatically amongst different eukaryotes. 235 00:18:30,000 --> 00:18:36,000 Now, what about prokaryotes? How do the prokaryotes differ? 236 00:18:36,000 --> 00:18:43,000 Prokaryotes differ because their genomes are typically not linear 237 00:18:43,000 --> 00:18:49,000 chromosomes. The typical prokaryotic chromosome is a 238 00:18:49,000 --> 00:18:56,000 double-stranded circle. It's a double-stranded circular DNA. 239 00:18:56,000 --> 00:19:02,000 Now, the double-stranded circular 240 00:19:02,000 --> 00:19:07,000 DNA doesn't have this problem of telomeres. You just keep 241 00:19:07,000 --> 00:19:12,000 replicating around and you get to the end. So, there you have a much 242 00:19:12,000 --> 00:19:17,000 simpler replication system than having to worry about your ends of 243 00:19:17,000 --> 00:19:22,000 chromosomes. You also have much smaller genomes. 244 00:19:22,000 --> 00:19:27,000 The typical prokaryotic genome size, it's on the order of a few million 245 00:19:27,000 --> 00:19:31,000 bases. E. coli, 4.6 million bases. There are, for example, 246 00:19:31,000 --> 00:19:35,000 mycobacteria, such as the mycobacteria that caused 247 00:19:35,000 --> 00:19:39,000 tuberculosis or leprosy, have on the order of, well, 248 00:19:39,000 --> 00:19:43,000 actually, not quite them, but other mycobacteria have on the order of 249 00:19:43,000 --> 00:19:46,000 about a million bases or so. Mycobacteria, M. genitalia has 250 00:19:46,000 --> 00:19:50,000 actually slightly less than a million basis. 251 00:19:50,000 --> 00:19:54,000 So, these are basically several million bases. 252 00:19:54,000 --> 00:19:58,000 So, there's a huge variation in genome size. 253 00:19:58,000 --> 00:20:02,000 Your genome is about a thousand times bigger than E. 254 00:20:02,000 --> 00:20:07,000 coli's genome. Now, you do actually have one circular chromosome. 255 00:20:07,000 --> 00:20:11,000 Do you know what it is? I speak about the 23 pairs of human 256 00:20:11,000 --> 00:20:16,000 chromosomes. There's actually one more human chromosome. 257 00:20:16,000 --> 00:20:20,000 The mitochondria have their own chromosome. It's a circle. 258 00:20:20,000 --> 00:20:25,000 That's very odd that you would have one chromosome that's a circle that 259 00:20:25,000 --> 00:20:30,000 looks like a bacterial chromosome. Do you know why that is? 260 00:20:30,000 --> 00:20:34,000 The mitochondria arose as a symbiotic bacterium that became a 261 00:20:34,000 --> 00:20:38,000 symbiont of eukaryotic cells about 1. billion years ago. 262 00:20:38,000 --> 00:20:42,000 It was a bacterium taken up into another cell, and that's how 263 00:20:42,000 --> 00:20:46,000 eukaryotes evolved. And we can even see that little 264 00:20:46,000 --> 00:20:50,000 signature of it having been a prokaryote from the fact that it's 265 00:20:50,000 --> 00:20:54,000 got one of these circular prokaryotic looking chromosomes. 266 00:20:54,000 --> 00:20:58,000 Now, it, because it's living in your cells, has thrown out all sorts 267 00:20:58,000 --> 00:21:02,000 of genes that it doesn't need anymore because the main, 268 00:21:02,000 --> 00:21:06,000 the nucleus supplies most of the proteins. 269 00:21:06,000 --> 00:21:10,000 So, your mitochondrial genome is a circle that's a mere 16, 270 00:21:10,000 --> 00:21:14,000 00 bases long. It's a very small circle encoding a very limited 271 00:21:14,000 --> 00:21:18,000 number of genes, but it's, in fact, 272 00:21:18,000 --> 00:21:22,000 the residue of the bacterial symbiont that lead to the formation 273 00:21:22,000 --> 00:21:26,000 of euks. Now, viruses, what do viruses have? 274 00:21:26,000 --> 00:21:30,000 Do they have double-strained linear chromosomes? Which is it? 275 00:21:30,000 --> 00:21:38,000 Is it double-stranded linear DNA or is it double-stranded circular DNA? 276 00:21:38,000 --> 00:21:46,000 Circular DNA. So, who votes for linear? Who votes for circular? 277 00:21:46,000 --> 00:21:54,000 Who's undecided? Ah, the undecided are very larger here. So, 278 00:21:54,000 --> 00:22:01,000 the answer is both. Some viruses have double-stranded 279 00:22:01,000 --> 00:22:07,000 linear DNA. Some viruses have double-stranded circular DNA. 280 00:22:07,000 --> 00:22:14,000 It's worse than that, though. Some viruses have single-stranded 281 00:22:14,000 --> 00:22:20,000 linear, circular DNA. Ha? How does that work? 282 00:22:20,000 --> 00:22:26,000 Some viruses actually infect the cell injecting DNA, 283 00:22:26,000 --> 00:22:32,000 and it's just single-stranded. As soon as it gets into the cell, 284 00:22:32,000 --> 00:22:36,000 however, it's replicated to make a double-stranded DNA which can then 285 00:22:36,000 --> 00:22:41,000 be transcribed, et cetera, et cetera, 286 00:22:41,000 --> 00:22:46,000 et cetera. But it travels around as a single-stranded piece of DNA. 287 00:22:46,000 --> 00:22:50,000 And it's actually weirder than that. Some viruses, 288 00:22:50,000 --> 00:22:55,000 viruses being very small can experiment with all 289 00:22:55,000 --> 00:23:02,000 sorts of things. Some viruses actually consist not of 290 00:23:02,000 --> 00:23:10,000 DNA at all but of RNA, single-stranded RNA. How does it do 291 00:23:10,000 --> 00:23:18,000 that? So, in other words, in the capsid there's a single 292 00:23:18,000 --> 00:23:26,000 strand of RNA. When it gets into the cell, 293 00:23:26,000 --> 00:23:32,000 what does it do? Sorry? It creates DNA. 294 00:23:32,000 --> 00:23:36,000 How does it create DNA? From the RNA. How's it going to do 295 00:23:36,000 --> 00:23:41,000 that? Well, how is it going to turn itself into DNA? 296 00:23:41,000 --> 00:23:46,000 It needs an enzyme to do that? Reverse transcriptase. You would 297 00:23:46,000 --> 00:23:50,000 like to reverse the transcription process, and you would like to name 298 00:23:50,000 --> 00:23:55,000 that reverse transcriptase. And where are you going to get this 299 00:23:55,000 --> 00:24:00,000 reverse transcriptase from? Laying around. Laying around where? 300 00:24:00,000 --> 00:24:04,000 I mean the cell is just sitting there with reverse transcriptase 301 00:24:04,000 --> 00:24:08,000 waiting to obligingly reverse transcribe this virus? 302 00:24:08,000 --> 00:24:12,000 Your RNA. So make it how? With ribosomes. So, in other words, 303 00:24:12,000 --> 00:24:17,000 if I'm an RNA, why don't I encode the sequence for 304 00:24:17,000 --> 00:24:21,000 reverse transcriptase and actually translate myself. 305 00:24:21,000 --> 00:24:25,000 So, if you were really cleaver, you might decide to put in the 306 00:24:25,000 --> 00:24:30,000 genetic code for reverse transcriptase. 307 00:24:30,000 --> 00:24:36,000 And when that message gets into the cell, it will first act as an mRNA, 308 00:24:36,000 --> 00:24:42,000 a messenger RNA, translate, make, here's the reverse transcriptase 309 00:24:42,000 --> 00:24:48,000 enzyme, which is then going to go, and it's going to reverse transcribe 310 00:24:48,000 --> 00:24:54,000 this thing into, say, DNA. So, wow. 311 00:24:54,000 --> 00:25:00,000 Now, that's a good one. This is a plus strand virus. 312 00:25:00,000 --> 00:25:05,000 It encodes its own reverse transcriptase in its instructions. 313 00:25:05,000 --> 00:25:10,000 There actually are minus-strand viruses that don't, 314 00:25:10,000 --> 00:25:15,000 but what they do is instead in their own code, in their own package bring 315 00:25:15,000 --> 00:25:20,000 a longer reverse transcriptase. So, either you can encode your own 316 00:25:20,000 --> 00:25:25,000 reverse transcriptase or in the package you can include your own 317 00:25:25,000 --> 00:25:30,000 reverse transcriptase. Do you know any viruses? 318 00:25:30,000 --> 00:25:36,000 And then the reverse transcriptase is then used to transcribe the DNA, 319 00:25:36,000 --> 00:25:43,000 the RNA into DNA, and eventually into a double-stranded DNA which, 320 00:25:43,000 --> 00:25:49,000 in some of the viruses, can then be slammed into and inserted into your 321 00:25:49,000 --> 00:25:56,000 own chromosomes. So, a DNA copy of the virus can be 322 00:25:56,000 --> 00:26:03,000 installed into your own chromosomes, which is somewhat insidious. 323 00:26:03,000 --> 00:26:08,000 What viruses do you know that do this? HIV. More generally 324 00:26:08,000 --> 00:26:13,000 retroviruses are the class of these viruses that can, 325 00:26:13,000 --> 00:26:19,000 in fact, run this replication process from RNA to DNA and install 326 00:26:19,000 --> 00:26:24,000 DNA copies of them in your genome. And how do you then get the DNA 327 00:26:24,000 --> 00:26:30,000 copy out of your genome? You don't. 328 00:26:30,000 --> 00:26:33,000 It doesn't come out. Retroviral insertions don't come 329 00:26:33,000 --> 00:26:37,000 out. That's one of the issues in dealing with AIDS is once this DNA 330 00:26:37,000 --> 00:26:40,000 copy is in a cell it's not coming out. We have no way to remove it. 331 00:26:40,000 --> 00:26:44,000 We have to make sure that the virus is shut down by other mechanisms 332 00:26:44,000 --> 00:26:47,000 that might inhibit its products, et cetera, but once its stuck a DNA 333 00:26:47,000 --> 00:26:51,000 copy into your chromosomes, you know, there's no way of getting 334 00:26:51,000 --> 00:26:55,000 it out. So, if we had to try to inhibit the 335 00:26:55,000 --> 00:27:00,000 action of the AIDS virus, we might wish to make inhibitors of 336 00:27:00,000 --> 00:27:05,000 this aspect of replication, inhibitors or reverse transcription. 337 00:27:05,000 --> 00:27:10,000 And, of course, as probably many of you know, some of the important AIDS 338 00:27:10,000 --> 00:27:15,000 drugs are reverse transcriptase inhibitors, very important to 339 00:27:15,000 --> 00:27:20,000 limiting the replication of the AIDS virus. And there are many other 340 00:27:20,000 --> 00:27:25,000 kinds of weirdnesses. Viruses pretty much explore, 341 00:27:25,000 --> 00:27:30,000 everything you possibly can do, viruses come up with ways to do. 342 00:27:30,000 --> 00:27:35,000 Let's take now the process of transcription. 343 00:27:35,000 --> 00:27:40,000 We have replication up there. Let's look at transcription. And 344 00:27:40,000 --> 00:27:45,000 this time let's start with prokaryotes. For the simple aspect 345 00:27:45,000 --> 00:27:50,000 of transcribing genes, the prokaryotic genome looks just 346 00:27:50,000 --> 00:27:55,000 like the simple model I gave you. There is some kind of a promoter 347 00:27:55,000 --> 00:28:00,000 that tells RNA polymerase to come sit down here. 348 00:28:00,000 --> 00:28:07,000 RNA polymerase hops on, RNA polymerase begins to copy in RNA, 349 00:28:07,000 --> 00:28:15,000 and eventually it hits the signal that says to terminate transcription. 350 00:28:15,000 --> 00:28:22,000 OK. This is not a stop codon which is about translation. 351 00:28:22,000 --> 00:28:30,000 This is a termination of transcription. 352 00:28:30,000 --> 00:28:35,000 And this RNA then goes off. A perfectly happy thing, a 353 00:28:35,000 --> 00:28:40,000 messenger RNA, mRNA. So, there's nothing weird 354 00:28:40,000 --> 00:28:46,000 about proks compared to the simple description that we gave before. 355 00:28:46,000 --> 00:28:51,000 But eukaryotes are different. There are some funny things that 356 00:28:51,000 --> 00:28:57,000 happen in the eukaryote. Well, first off it starts the same. 357 00:28:57,000 --> 00:29:03,000 There's a promoter. RNA polymerase sits down there, 358 00:29:03,000 --> 00:29:09,000 it starts transcribing, it makes an mRNA, it hits the transcriptional 359 00:29:09,000 --> 00:29:16,000 termination signal, it stops, and then this RNA gets 360 00:29:16,000 --> 00:29:22,000 processed in interesting ways. The first thing that happens is 361 00:29:22,000 --> 00:29:29,000 three modifications happen. The first one is at the five prime 362 00:29:29,000 --> 00:29:35,000 end, remember five prime to three prime, a funny modification is put 363 00:29:35,000 --> 00:29:41,000 on. It's a, if the message, say, were A-U-C-U-G-G-C et cetera, 364 00:29:41,000 --> 00:29:47,000 a G triphosphate is put on backwards. It's actually a methyl G 365 00:29:47,000 --> 00:29:53,000 triphosphate is put on backwards, so going in the other direction. 366 00:29:53,000 --> 00:30:00,000 You have the triphosphate bond there, a methyl G. 367 00:30:00,000 --> 00:30:04,000 And the only thing that you share care about that, 368 00:30:04,000 --> 00:30:09,000 I don't care if you know the structure, is that there's a funny 369 00:30:09,000 --> 00:30:13,000 cap. This thing is called a cap that is put on this message. 370 00:30:13,000 --> 00:30:18,000 And that cap is very important to signaling to the cell this is a 371 00:30:18,000 --> 00:30:23,000 messenger RNA to be dealt with in a certain way, to get the ribosome to 372 00:30:23,000 --> 00:30:27,000 hop on, to get this thing processed properly, et cetera. 373 00:30:27,000 --> 00:30:32,000 At the other end of the message a long string of As is added 374 00:30:32,000 --> 00:30:37,000 to messenger RNAs. This long string of As is called, 375 00:30:37,000 --> 00:30:41,000 very sensibly, a poly A tail. The poly A tail is added to the 376 00:30:41,000 --> 00:30:46,000 messenger RNA, and very often, 377 00:30:46,000 --> 00:30:51,000 I mean it's, if you wanted to purify messenger RNAs from your own human 378 00:30:51,000 --> 00:30:55,000 cells, you can actually use poly T as a reagent because it turns out, 379 00:30:55,000 --> 00:31:00,000 because messenger RNAs have a poly A tail, they'll bind to 380 00:31:00,000 --> 00:31:04,000 and stick to poly T. So, people actually purify messenger 381 00:31:04,000 --> 00:31:08,000 RNAs by binding them to poly T and they get the poly A tail. 382 00:31:08,000 --> 00:31:11,000 But it is broadly believed that the reason for this poly A tail is not 383 00:31:11,000 --> 00:31:15,000 to make things convenient for molecular biologists to purify 384 00:31:15,000 --> 00:31:18,000 messages. To the contrary, it is an important function for the 385 00:31:18,000 --> 00:31:22,000 cell. And it turns out that this is important in regulating the 386 00:31:22,000 --> 00:31:25,000 stability of messages. If, in fact, you don't have a poly 387 00:31:25,000 --> 00:31:29,000 A tail, if you contrive to make the same message without the poly A tail, 388 00:31:29,000 --> 00:31:33,000 the message will be degraded rather rapidly. 389 00:31:33,000 --> 00:31:36,000 And the lengths of the poly A tails control aspects of the degradation, 390 00:31:36,000 --> 00:31:39,000 et cetera. So, in a complex eukaryotic cell, 391 00:31:39,000 --> 00:31:43,000 already it's how to attach a little signal at the front, 392 00:31:43,000 --> 00:31:46,000 some signals at the back that says process me in a certain way, 393 00:31:46,000 --> 00:31:49,000 et cetera, don't degrade me yet. You could even imagine that this 394 00:31:49,000 --> 00:31:53,000 poly A tail could serve as a little bit of a clock for how long that 395 00:31:53,000 --> 00:31:56,000 message sticks around. It's not quite that simple but 396 00:31:56,000 --> 00:31:59,000 there are ways to do it. But all of these pale in comparison 397 00:31:59,000 --> 00:32:03,000 to the third way in which eukaryotic messages differ from prokaryotic 398 00:32:03,000 --> 00:32:10,000 messages. These small modifications are, 399 00:32:10,000 --> 00:32:21,000 as I say, small. The most striking way in which they differ is that 400 00:32:21,000 --> 00:32:33,000 only a small portion often of the gene, here's my gene, 401 00:32:33,000 --> 00:32:44,000 matters for the protein that is made. So, my mRNA gets made. 402 00:32:44,000 --> 00:32:54,000 It includes the whole long sequence. And then the cell comes along and 403 00:32:54,000 --> 00:33:05,000 splices this message together. So, this is the immature RNA. 404 00:33:05,000 --> 00:33:13,000 It is processed by clipping out this, clipping out this, 405 00:33:13,000 --> 00:33:22,000 clipping out this. And what you get is a splice where the mature message 406 00:33:22,000 --> 00:33:30,000 throws this stuff out, splices between here and here, 407 00:33:30,000 --> 00:33:39,000 splices here, splices here, splices here, and you get a 408 00:33:39,000 --> 00:33:47,000 much shorter mRNA. And this is a mature mRNA. 409 00:33:47,000 --> 00:33:53,000 This splicing is a remarkable phenomenon. In fact, 410 00:33:53,000 --> 00:34:00,000 it was discovered by Phil Sharp here, for which he won a Nobel prize. 411 00:34:00,000 --> 00:34:04,000 This splicing is a very complex operation. First off, 412 00:34:04,000 --> 00:34:08,000 how does, well, actually, what accomplishes splicing? 413 00:34:08,000 --> 00:34:12,000 It should be splicase, right? But it turns out it's not a single 414 00:34:12,000 --> 00:34:17,000 enzyme. It's a big body of stuff. So, instead it's the splicosome, OK. 415 00:34:17,000 --> 00:34:21,000 Everything is either ase or some or something like that. 416 00:34:21,000 --> 00:34:25,000 So, it turns out it's the splicosome that does that. 417 00:34:25,000 --> 00:34:30,000 It's just wonderful how all those names work out. The splicosome. 418 00:34:30,000 --> 00:34:36,000 The splicosome comes along and splices it. How does the splicosome 419 00:34:36,000 --> 00:34:42,000 know how to do this? Well, there are kind of codes. 420 00:34:42,000 --> 00:34:48,000 It turns out that there are some information encoded along in these 421 00:34:48,000 --> 00:34:54,000 messages. It turns out that there is, you know, slight biases. 422 00:34:54,000 --> 00:35:00,000 Typically the sequence just after where the slice starts here is a GU 423 00:35:00,000 --> 00:35:06,000 and the sequence here is an AG, but that's obviously not enough 424 00:35:06,000 --> 00:35:10,000 information, right? It's not enough bases of information 425 00:35:10,000 --> 00:35:14,000 to get this right. And so there's a little more 426 00:35:14,000 --> 00:35:18,000 preferences for what bases use, but the truth is we don't fully know. 427 00:35:18,000 --> 00:35:21,000 Our best picture right now involves some cellular factors help 428 00:35:21,000 --> 00:35:25,000 recognizing the parts that are supposed to stay in some sequences 429 00:35:25,000 --> 00:35:29,000 here. But the truth is we don't have the simple codes. 430 00:35:29,000 --> 00:35:33,000 Because if we had the simple codes, we'd be able to take a long stretch 431 00:35:33,000 --> 00:35:37,000 of DNA and figure out exactly where the splices go based on just 432 00:35:37,000 --> 00:35:42,000 computer analysis. And we can't do that so well. 433 00:35:42,000 --> 00:35:46,000 These bits that stay in are called exons. The bits that go out are 434 00:35:46,000 --> 00:35:51,000 called introns. This is the source of extraordinary 435 00:35:51,000 --> 00:35:55,000 confusion for students because you might think that the bits that are 436 00:35:55,000 --> 00:36:00,000 excised are the exons, but they're not. 437 00:36:00,000 --> 00:36:04,000 The bits that stay in are the exons. Why are they called exons if they 438 00:36:04,000 --> 00:36:08,000 stay in and ex is a prefix meaning out? Well, because the introns are 439 00:36:08,000 --> 00:36:12,000 named because they're intervening sequences. Once the introns, 440 00:36:12,000 --> 00:36:17,000 the intervening sequences were named as intervening sequences or introns, 441 00:36:17,000 --> 00:36:21,000 you were stuck then having to name the things that stay in as exons. 442 00:36:21,000 --> 00:36:25,000 This was all done by a Harvard professor, don't blame me. 443 00:36:25,000 --> 00:36:30,000 In any case, a good friend Harvard professor. 444 00:36:30,000 --> 00:36:37,000 But, nonetheless, I'm not sure that this was the best 445 00:36:37,000 --> 00:36:44,000 way to name them. But you're stuck with it. 446 00:36:44,000 --> 00:36:52,000 So, for a typical human gene, typical human gene, the length of 447 00:36:52,000 --> 00:36:59,000 the gene itself might be 30, 00 bases. But the mature RNA, 448 00:36:59,000 --> 00:37:07,000 the mature mRNA might be one and a half, 1,500 bases. 449 00:37:07,000 --> 00:37:11,000 That's remarkable. Out of 30,000 letters in the 450 00:37:11,000 --> 00:37:15,000 initial transcript that is made, the genes start, the promoter, and 451 00:37:15,000 --> 00:37:19,000 the transcription will stop 30, 00 bases away. The cell goes 452 00:37:19,000 --> 00:37:24,000 through the trouble of making an RNA of 30,000 bases long, 453 00:37:24,000 --> 00:37:28,000 and then it trims it down by throwing out 28, 454 00:37:28,000 --> 00:37:33,000 00 of the bases, keeping only 1, 00 bases at the end. 455 00:37:33,000 --> 00:37:37,000 Now, this may seem profligate but it ain't nothing compared to some 456 00:37:37,000 --> 00:37:42,000 extreme cases. The clotting factor gene, 457 00:37:42,000 --> 00:37:47,000 the factor 8 gene, the gene that has mutated in individuals with 458 00:37:47,000 --> 00:37:52,000 hemophilia, that gene is 200, 00 bases long, and it gets spliced 459 00:37:52,000 --> 00:37:57,000 down to a mere 10, 00 bases. 190,000 bases are thrown 460 00:37:57,000 --> 00:38:02,000 away. But that's nothing compared to 461 00:38:02,000 --> 00:38:08,000 Duchene muscular dystrophy. The Duchene muscular dystrophy is 462 00:38:08,000 --> 00:38:13,000 the all time winner. That gene makes an immature initial 463 00:38:13,000 --> 00:38:19,000 RNA of 2 million bases. RNA polymerase hops on at the 464 00:38:19,000 --> 00:38:24,000 promoter and it gets off at the end of the Boston Marathon on here 2 465 00:38:24,000 --> 00:38:30,000 million bases later having made an RNA of 2 million bases long. 466 00:38:30,000 --> 00:38:36,000 Calculate the speed of RNA polymerase and you'll find out that 467 00:38:36,000 --> 00:38:42,000 it's at it for hours. It hops on and it stays on for 468 00:38:42,000 --> 00:38:48,000 hours until it gets to the other end. And then for all its troubles this 469 00:38:48,000 --> 00:38:54,000 gene is spliced down to 16, 00 bases in the mature message. 470 00:38:54,000 --> 00:39:00,000 Yup? How would it increase the chance of mutations? Yup. 471 00:39:00,000 --> 00:39:05,000 So, splicing mutations could be a problem. Some diseases could arise 472 00:39:05,000 --> 00:39:10,000 from errors in splicing. Do you think that happens? 473 00:39:10,000 --> 00:39:15,000 Sure does. There could be mutations that create, 474 00:39:15,000 --> 00:39:20,000 that change a splicing, or mutations that create a new 475 00:39:20,000 --> 00:39:25,000 splicing, and all of that could screw up the gene. 476 00:39:25,000 --> 00:39:30,000 Why do this? What in the world is going on? 477 00:39:30,000 --> 00:39:35,000 Just think about the energetic cost. I mean count up the ATPs involved 478 00:39:35,000 --> 00:39:40,000 in synthesizing a nucleotide, and then the ATPs involved in adding 479 00:39:40,000 --> 00:39:45,000 nucleotides up. You know, think about this totally 480 00:39:45,000 --> 00:39:50,000 wasted energy. What is the point? 481 00:39:50,000 --> 00:39:55,000 I might be able to encode multiple proteins with the same gene. 482 00:39:55,000 --> 00:40:00,000 How would I do that? Ooh, wouldn't that be cleaver? 483 00:40:00,000 --> 00:40:05,000 I might be able to take a single gene and make a mix and match 484 00:40:05,000 --> 00:40:10,000 product. It might be, do you mean like one type of cell 485 00:40:10,000 --> 00:40:15,000 might splice up that message one way to produce a certain protein, 486 00:40:15,000 --> 00:40:20,000 but a different cell type might splice the same gene another way to 487 00:40:20,000 --> 00:40:25,000 produce a different protein? Ooh. So, you're proposing, if I 488 00:40:25,000 --> 00:40:30,000 understand you correctly, alternative splicing. 489 00:40:30,000 --> 00:40:34,000 Alternative splicing could create multiple proteins, 490 00:40:34,000 --> 00:40:38,000 multiple distinct proteins. It might be, for example, that you 491 00:40:38,000 --> 00:40:42,000 might make one protein that has a cytoplasmic tail and another protein 492 00:40:42,000 --> 00:40:46,000 that doesn't have cytosplasmic tail or a different tail or, 493 00:40:46,000 --> 00:40:50,000 or, this is true. This actually happens. It's very cleaver. 494 00:40:50,000 --> 00:40:54,000 Anything that can happen does happen somewhere, 495 00:40:54,000 --> 00:40:58,000 and it's fairly regularly used. A typical gene in the human being 496 00:40:58,000 --> 00:41:02,000 has at least two alternative splice forms, on average. 497 00:41:02,000 --> 00:41:05,000 Most, many don't, but there are some that have large 498 00:41:05,000 --> 00:41:08,000 numbers. The most extreme is there's a gene known, 499 00:41:08,000 --> 00:41:11,000 drosophila, that has more than a thousand alternative splice forms. 500 00:41:11,000 --> 00:41:15,000 How does it know, how does the cell know whether to splice it one way in 501 00:41:15,000 --> 00:41:18,000 the liver and one way in a heart or something? We don't fully know but 502 00:41:18,000 --> 00:41:21,000 there's machinery and signals people are trying to work out for that. 503 00:41:21,000 --> 00:41:25,000 Now, I don't want to confuse you too much about it. 504 00:41:25,000 --> 00:41:28,000 You know, mostly, when we give you a gene, you should think about it 505 00:41:28,000 --> 00:41:31,000 spliced out introns, exons. But the truth is it is more 506 00:41:31,000 --> 00:41:35,000 complicated than that. There can be alternative splicing 507 00:41:35,000 --> 00:41:39,000 that allows genes to be used in multiple ways. 508 00:41:39,000 --> 00:41:43,000 Sometimes they don't make multiple proteins. They may splice into 509 00:41:43,000 --> 00:41:46,000 portions of the mRNA that are not translated, but, 510 00:41:46,000 --> 00:41:50,000 there is that, but, boy, it's a huge amount of overhead 511 00:41:50,000 --> 00:41:54,000 here just to do that. Is it justified? Yes? 512 00:41:54,000 --> 00:41:58,000 That is by computer if I just gave you the sequence? Not quite. 513 00:41:58,000 --> 00:42:01,000 Almost. Maybe. Sort of. It turns out that the 514 00:42:01,000 --> 00:42:04,000 computer programs for automatically recognizing the matter of the human 515 00:42:04,000 --> 00:42:07,000 genome are sort of, they're mediocre, not very good. 516 00:42:07,000 --> 00:42:09,000 We have some idea of the signals, and various people have trying to 517 00:42:09,000 --> 00:42:12,000 write better and better algorithms for doing that, 518 00:42:12,000 --> 00:42:15,000 but the cell knows what it's doing and we don't fully know, 519 00:42:15,000 --> 00:42:18,000 as evidenced by the fact that we can't write a clean computer program 520 00:42:18,000 --> 00:42:21,000 to do it yet. We need to get information from the cell or from 521 00:42:21,000 --> 00:42:24,000 evolution or various other things like that, and that's the ultimate 522 00:42:24,000 --> 00:42:27,000 test. If we knew what we were talking about we'd just be able to 523 00:42:27,000 --> 00:42:30,000 write a computer program and splice it out. 524 00:42:30,000 --> 00:42:34,000 And we don't. There's another reason why people think these big 525 00:42:34,000 --> 00:42:38,000 introns and exons, these big introns are helpful, 526 00:42:38,000 --> 00:42:42,000 and that is an evolutionary reason. The evolutionary reason is a little 527 00:42:42,000 --> 00:42:47,000 bit harder to follow, but let me try it on you. 528 00:42:47,000 --> 00:42:51,000 Suppose a random event happens and a chromosome breaks, 529 00:42:51,000 --> 00:42:55,000 that happens, and suppose a random breakage sticks one part of a 530 00:42:55,000 --> 00:43:00,000 chromosome to some other part of the chromosome. 531 00:43:00,000 --> 00:43:04,000 If it lands smack dab in the middle of the coding sequence of a gene 532 00:43:04,000 --> 00:43:09,000 that's bad new. But it turns out that if it lands 533 00:43:09,000 --> 00:43:13,000 in the introns of two different genes and sticks them together it 534 00:43:13,000 --> 00:43:18,000 could make a new gene that would still work. By having a random 535 00:43:18,000 --> 00:43:23,000 break between two genes in their introns and slamming them together, 536 00:43:23,000 --> 00:43:27,000 you could make a gene that had a bunch of exons from one gene and a 537 00:43:27,000 --> 00:43:32,000 bunch of exons from another gene. And this intervening sequence in the 538 00:43:32,000 --> 00:43:38,000 middle and it would get spliced up. Evolution might like that because 539 00:43:38,000 --> 00:43:44,000 it would be a very easy way for evolution to build new genes that 540 00:43:44,000 --> 00:43:49,000 had a portion of one protein and a portion of another protein. 541 00:43:49,000 --> 00:43:55,000 This kind of mix and match domain swapping could be very useful. 542 00:43:55,000 --> 00:44:01,000 And when we look across genomes, we see lots and lots of examples of 543 00:44:01,000 --> 00:44:07,000 genes that have a similar first half but different second halves. 544 00:44:07,000 --> 00:44:10,000 Or have some portion in the middle, a domain that we recognize, that we 545 00:44:10,000 --> 00:44:13,000 see in multiple proteins. And so, in fact, an argument for 546 00:44:13,000 --> 00:44:17,000 why we have all of this intronic DNA, one that's impossible to prove but 547 00:44:17,000 --> 00:44:20,000 is an argument is that from an evolution point of view, 548 00:44:20,000 --> 00:44:24,000 this allows a great deal of evolutionary innovation. 549 00:44:24,000 --> 00:44:27,000 You have to be careful that you say those organisms that have this extra 550 00:44:27,000 --> 00:44:31,000 space are able to mix and match and create more new kinds of combination 551 00:44:31,000 --> 00:44:34,000 proteins, et cetera, et cetera, and therefore survived 552 00:44:34,000 --> 00:44:38,000 better, et cetera, et cetera, et cetera. 553 00:44:38,000 --> 00:44:41,000 Why don't bacteria have this? Sorry? They're not as complicated. 554 00:44:41,000 --> 00:44:45,000 That's one though is we can take a sort of condescending attitude to 555 00:44:45,000 --> 00:44:48,000 these bacteria. They're not very, 556 00:44:48,000 --> 00:44:52,000 they're just not so complicated. There's another point of view which 557 00:44:52,000 --> 00:44:56,000 is bacteria are far more sophisticated than we are because 558 00:44:56,000 --> 00:45:00,000 they're under incredibly rigorous evolutionary selection. 559 00:45:00,000 --> 00:45:04,000 You might argue that if I'm a bacteria, can I really afford all 560 00:45:04,000 --> 00:45:08,000 this extra DNA? Now, the metabolic cost of all that 561 00:45:08,000 --> 00:45:12,000 extra DNA is huge to a bacteria which competes on replication. 562 00:45:12,000 --> 00:45:16,000 It's got to divide every 20 minutes, and trying to put in all these extra 563 00:45:16,000 --> 00:45:20,000 bases would be very news. So, you might imagine, just to be, 564 00:45:20,000 --> 00:45:24,000 you know, stand things on its head, that early life all had introns and 565 00:45:24,000 --> 00:45:28,000 bacteria, in the process of competing to be more and more 566 00:45:28,000 --> 00:45:31,000 efficient go rid of their introns. There's actually a large camp of 567 00:45:31,000 --> 00:45:34,000 people who think it went that way, that early life evolved with introns, 568 00:45:34,000 --> 00:45:37,000 and then bacteria, in the pressure to compete, 569 00:45:37,000 --> 00:45:41,000 got rid of them. And there's some evidence to support that. 570 00:45:41,000 --> 00:45:44,000 Bacteria don't have introns. Small eukaryotes like yeast that 571 00:45:44,000 --> 00:45:47,000 sort of do compete on replication have some introns, 572 00:45:47,000 --> 00:45:50,000 but a small number. There are only about 250 introns in 573 00:45:50,000 --> 00:45:53,000 yeast. Only about 5% of the genes have an intron and they're small. 574 00:45:53,000 --> 00:45:57,000 Bigger eukaryotes have bigger introns. And the bigger you get, 575 00:45:57,000 --> 00:46:00,000 often on average the bigger the genome sizes are the more 576 00:46:00,000 --> 00:46:03,000 you can tolerate it. And so I actually think, 577 00:46:03,000 --> 00:46:07,000 I actually probably favor this notion that introns were the 578 00:46:07,000 --> 00:46:10,000 original state and they've been gotten rid of. 579 00:46:10,000 --> 00:46:14,000 And the more pressure you're under to replicate rapidly the less you 580 00:46:14,000 --> 00:46:17,000 can tolerate this interesting and complicated innovation. 581 00:46:17,000 --> 00:46:21,000 Anyway, that's another way that things differ. 582 00:46:21,000 --> 00:46:24,000 And then, finally, viruses can do it either way. 583 00:46:24,000 --> 00:46:28,000 Viruses, depending on whether they are prokaryotic viruses or 584 00:46:28,000 --> 00:46:31,000 eukaryotic viruses, are able to replicate, 585 00:46:31,000 --> 00:46:35,000 are able to either do or don't have splicing. 586 00:46:35,000 --> 00:46:41,000 Last topic. Translation. Here eukaryotes are relatively 587 00:46:41,000 --> 00:46:48,000 simple. You get a message, you get a gene, you get an mRNA. 588 00:46:48,000 --> 00:46:54,000 The mRNA goes to the ribosome. Here's a ribosome. 589 00:46:54,000 --> 00:47:01,000 The ribosome goes to the mRNA, actually, and it starts turning out 590 00:47:01,000 --> 00:47:09,000 one protein as it chugs along. Prokaryotes differ in an interesting 591 00:47:09,000 --> 00:47:18,000 way. I get a promoter here that is transcribed into my mRNA, 592 00:47:18,000 --> 00:47:27,000 but it turns out that the mRNA can encode multiple independent proteins, 593 00:47:27,000 --> 00:47:36,000 protein one, protein two, protein three on the same mRNA. 594 00:47:36,000 --> 00:47:40,000 And a ribosome will hop on here and synthesize this one. 595 00:47:40,000 --> 00:47:44,000 A ribosome will hop on here and synthesize this one, 596 00:47:44,000 --> 00:47:48,000 and a ribosome will hop on here and synthesize that one. 597 00:47:48,000 --> 00:47:52,000 And you have what is called a polycistronic message. 598 00:47:52,000 --> 00:47:56,000 Poly, many. Cystronic, cystrons were an old name for coding 599 00:47:56,000 --> 00:48:00,000 regions of genes here. Polycystronic messages. 600 00:48:00,000 --> 00:48:03,000 Why would you want to do that, have a single mRNA that encodes 601 00:48:03,000 --> 00:48:06,000 multiple distinct proteins, each starting with its own ribosome 602 00:48:06,000 --> 00:48:09,000 start site there? Efficiency. Maybe, 603 00:48:09,000 --> 00:48:12,000 in fact, these would be, how about, oh, this would be cleaver, 604 00:48:12,000 --> 00:48:15,000 make them multiple steps in a biochemical pathway? 605 00:48:15,000 --> 00:48:18,000 Have them coded on a single messenger so then you'd only have to 606 00:48:18,000 --> 00:48:21,000 worry about regulating that once. If you have the regulatory 607 00:48:21,000 --> 00:48:24,000 machinery to turn on, you'll make all the enzymes for the 608 00:48:24,000 --> 00:48:27,000 pathway. And that's exactly what bacteria do. They tend to put all 609 00:48:27,000 --> 00:48:30,000 the enzymes for a pathway on a single message so when they want to 610 00:48:30,000 --> 00:48:33,000 call up, let's digest hexose this morning, they have a whole thing 611 00:48:33,000 --> 00:48:37,000 that will let them be able to do that, poly-cystronic. 612 00:48:37,000 --> 00:48:41,000 That's because they're small genomes. They're pressed for space. 613 00:48:41,000 --> 00:48:46,000 And, because of that, they have to slam a lot into a single unit. 614 00:48:46,000 --> 00:48:50,000 And this single unit that has multiple genes encoded in a single 615 00:48:50,000 --> 00:48:55,000 message is called an operon, and we'll talk more about that. 616 00:48:55,000 --> 00:49:00,000 Last of all viruses. Viruses. Viruses have very little room. 617 00:49:00,000 --> 00:49:05,000 Their genomes can be tiny. A typical virus might have a genome 618 00:49:05,000 --> 00:49:10,000 of 5,000 bases to 10, 00 bases to, in some cases, 619 00:49:10,000 --> 00:49:15,000 200,000 bases, but it hasn't got a lot of room. It wants to pack a lot 620 00:49:15,000 --> 00:49:21,000 of protein coating information in. And some viruses have come up with 621 00:49:21,000 --> 00:49:26,000 the most extraordinary way of doing that. Some viruses have gone to the 622 00:49:26,000 --> 00:49:32,000 extreme of having RNAs that get made from them that have a sequence -- 623 00:49:32,000 --> 00:49:39,000 I'm just going to pick up in the middle of the sequence here. 624 00:49:39,000 --> 00:49:46,000 A-C-U-A-C-U-A-C-U-A-C-U. You might decide to read the sequence like 625 00:49:46,000 --> 00:49:53,000 this, that those are the codons, and you'd get a certain protein. 626 00:49:53,000 --> 00:50:00,000 But I might also decide to read that sequence C-U-A-C-U-A-C-U-A. 627 00:50:00,000 --> 00:50:04,000 And, of course, I'm giving this in a repeating form 628 00:50:04,000 --> 00:50:08,000 because it's easy to note. I could give you any sequence and I 629 00:50:08,000 --> 00:50:12,000 could read it in this reading frame, I could read it in this reading 630 00:50:12,000 --> 00:50:17,000 frame, or I could read it as U-A-C-U-A-C-U-A-C. 631 00:50:17,000 --> 00:50:21,000 In other words, there are three reading frames that, 632 00:50:21,000 --> 00:50:25,000 in principle, you could translate a protein from. In a typical 633 00:50:25,000 --> 00:50:30,000 prokaryotic gene or eukaryotic gene only one of those is used. 634 00:50:30,000 --> 00:50:34,000 You start at the first AUG and that sets the reading frame. 635 00:50:34,000 --> 00:50:39,000 But some viruses are so pressed for space and are so cleaver and are so 636 00:50:39,000 --> 00:50:43,000 efficient that they make messages that have tricks that they actually 637 00:50:43,000 --> 00:50:48,000 use two or, in some cases, all three reading frames, which is 638 00:50:48,000 --> 00:50:53,000 an extraordinary packing of information density into a simple 639 00:50:53,000 --> 00:50:57,000 message. So, the basic point. We have a simple model. DNA is 640 00:50:57,000 --> 00:51:02,000 replicated. Transcribed into RNA. 641 00:51:02,000 --> 00:51:06,000 Translated into protein. But there are a lot of important 642 00:51:06,000 --> 00:51:11,000 variations between eukaryotes, prokaryotes and viruses. And 643 00:51:11,000 --> 00:51:15,000 understanding them can be useful for treating cancer, 644 00:51:15,000 --> 00:51:20,000 for treating AIDS, and for treating viral and bacterial 645 00:51:20,000 --> 00:51:25,000 infections. Next time.