1 00:00:01,000 --> 00:00:05,000 Good morning. Good morning. 2 00:00:05,000 --> 00:00:10,000 I don't know about you, but I can't take too many more nights like this. 3 00:00:10,000 --> 00:00:15,000 I confess, I haven't gotten a thing done for so many nights in a row now, 4 00:00:15,000 --> 00:00:20,000 but what a game! How many of you saw the game? Excellent. 5 00:00:20,000 --> 00:00:25,000 Very good, very good. You have your priorities straight in 6 00:00:25,000 --> 00:00:30,000 the world. Very good. Well, if it's possible to get your 7 00:00:30,000 --> 00:00:35,000 minds off Curt Schilling last night, and off more importantly tonight. 8 00:00:35,000 --> 00:00:40,000 Perhaps we can spend a bit of time this morning in the meanwhile with 9 00:00:40,000 --> 00:00:45,000 whatever spare neurons you have talking about recombinant DNA for a 10 00:00:45,000 --> 00:00:50,000 bit, OK? What we talked about last time was different ways to clone 11 00:00:50,000 --> 00:00:55,000 your gene based on its properties. We started off with cloning by 12 00:00:55,000 --> 00:01:00,000 complementation, right, the idea that if you took a 13 00:01:00,000 --> 00:01:05,000 library of clones, you would be able to put it into 14 00:01:05,000 --> 00:01:10,000 bacteria and select a bacterium whose phenotype had been restored by 15 00:01:10,000 --> 00:01:14,000 virtue of having the plasmid. You would complement the defect. 16 00:01:14,000 --> 00:01:18,000 You'd find the clone you wanted because it complemented the defect. 17 00:01:18,000 --> 00:01:22,000 That's great if you can put it into an organism that has a defect. 18 00:01:22,000 --> 00:01:25,000 You can do it with bacteria. You can do that with yeast. 19 00:01:25,000 --> 00:01:29,000 It's harder to do with large organisms because you can't inject 20 00:01:29,000 --> 00:01:33,000 enough of them with different clones to be able to make that practical 21 00:01:33,000 --> 00:01:37,000 unless you're working in cell culture or some very small, 22 00:01:37,000 --> 00:01:41,000 fast growing organism. We talked about being able to use a 23 00:01:41,000 --> 00:01:45,000 protein sequence, reverse translating that protein 24 00:01:45,000 --> 00:01:50,000 sequence in the computer from amino acid sequence to nucleotide sequence, 25 00:01:50,000 --> 00:01:54,000 and using the nucleotide sequence to design a probe to hybridize back to 26 00:01:54,000 --> 00:01:59,000 the genome. That works fine if you have a protein sequence. 27 00:01:59,000 --> 00:02:02,000 But the last topic we talked about that I wanted to just touch on again 28 00:02:02,000 --> 00:02:05,000 this morning was suppose you were trying to clone the gene that causes 29 00:02:05,000 --> 00:02:08,000 a certain human disease, and you have no idea what the 30 00:02:08,000 --> 00:02:11,000 protein was. Then, you can't use its amino acid 31 00:02:11,000 --> 00:02:15,000 sequence because you don't have the protein. What can you possibly do 32 00:02:15,000 --> 00:02:18,000 when all you know is that you have a gene which causes a genetic defect 33 00:02:18,000 --> 00:02:21,000 that causes a disease? And I said you could clone it using 34 00:02:21,000 --> 00:02:24,000 the ideas of genetic mapping, position, the things that Sturtevant 35 00:02:24,000 --> 00:02:28,000 developed. And, I touched on it briefly, 36 00:02:28,000 --> 00:02:31,000 and I want to just touch on it a bit more because some people had some 37 00:02:31,000 --> 00:02:34,000 questions about it. And I've set up a very simple 38 00:02:34,000 --> 00:02:38,000 example to show you. Suppose that, to make it easy, 39 00:02:38,000 --> 00:02:41,000 we're working in a fruit fly first. We're working drosophila, and 40 00:02:41,000 --> 00:02:44,000 suppose that the true picture of the underlying chromosome is like this. 41 00:02:44,000 --> 00:02:48,000 There's a locus that could either have a mutant allele M or the wild 42 00:02:48,000 --> 00:02:51,000 type allele plus. There's a bunch of other loci along 43 00:02:51,000 --> 00:02:54,000 the chromosome. And, let's suppose we know all of 44 00:02:54,000 --> 00:02:58,000 where they are and all that. And, they have two alternative 45 00:02:58,000 --> 00:03:01,000 alleles. At this locus the alleles are orange 46 00:03:01,000 --> 00:03:05,000 or pink. At this locus I'll call the alleles orange or pink. 47 00:03:05,000 --> 00:03:08,000 Now, these are different loci. These are different alleles. I've 48 00:03:08,000 --> 00:03:11,000 just called them orange and pink in both cases so I don't have a rainbow 49 00:03:11,000 --> 00:03:15,000 of colors up here to confuse us. But all I mean is there's two 50 00:03:15,000 --> 00:03:18,000 possible alleles here, two alleles here, two alleles here. 51 00:03:18,000 --> 00:03:21,000 This is the diseased gene we're interested in, 52 00:03:21,000 --> 00:03:25,000 and these are passive markers. These are other markers along the 53 00:03:25,000 --> 00:03:28,000 chromosome. If we were to set up a cross between 54 00:03:28,000 --> 00:03:32,000 heterozygotes, a heterozygote here, 55 00:03:32,000 --> 00:03:36,000 and a heterozygote here, and it were the case that on the 56 00:03:36,000 --> 00:03:40,000 chromosome bearing the mutant allele, it happened that at these three 57 00:03:40,000 --> 00:03:44,000 markers we had orange alleles. I don't know what they are, but 58 00:03:44,000 --> 00:03:48,000 whatever these orange alleles are, they might be a visible phenotype, 59 00:03:48,000 --> 00:03:52,000 forked or yellow or bristled. They could be a DNA sequence 60 00:03:52,000 --> 00:03:56,000 difference. They could be whatever you want, but let's suppose the M 61 00:03:56,000 --> 00:04:00,000 chromosome has a set of alleles that are different in each location than 62 00:04:00,000 --> 00:04:03,000 the plus chromosome. Then, when we look at the offspring 63 00:04:03,000 --> 00:04:07,000 that come out of this cross, let's only, for the sake of 64 00:04:07,000 --> 00:04:11,000 simplicity, look at those offspring who are homozygous mutants. 65 00:04:11,000 --> 00:04:15,000 Well, in general, if there's been no crossover here, 66 00:04:15,000 --> 00:04:18,000 then the M chromosome will have orange, orange, 67 00:04:18,000 --> 00:04:22,000 orange, orange, orange, orange. If there's been a crossover, 68 00:04:22,000 --> 00:04:26,000 however, it could go orange, orange, pink on one of those 69 00:04:26,000 --> 00:04:30,000 chromosomes. Or if there's been crossovers like this, 70 00:04:30,000 --> 00:04:34,000 it could go orange, orange, pink on one chromosome, and orange, 71 00:04:34,000 --> 00:04:37,000 pink, pink on the other chromosome. It could even, 72 00:04:37,000 --> 00:04:41,000 in the extreme, have had crossovers very close to 73 00:04:41,000 --> 00:04:44,000 the gene maybe here, and even maybe here. And you've got 74 00:04:44,000 --> 00:04:47,000 orange, pink, pink, and pink, pink, pink. 75 00:04:47,000 --> 00:04:51,000 But if we look at the many segregates, you know from genetic 76 00:04:51,000 --> 00:04:54,000 mapping that the closer the locus is to the disease gene, 77 00:04:54,000 --> 00:04:57,000 the more strongly correlated the inheritance will be, 78 00:04:57,000 --> 00:05:01,000 the tighter the linkage will be. This is nothing more than linkage 79 00:05:01,000 --> 00:05:04,000 mapping. But now, suppose we were doing 80 00:05:04,000 --> 00:05:08,000 linkage mapping, but for the sake of argument the 81 00:05:08,000 --> 00:05:12,000 whole genome had already been sequenced. Suppose the genome had 82 00:05:12,000 --> 00:05:15,000 been sequenced in a cross, and the whole genome of the fruit 83 00:05:15,000 --> 00:05:19,000 fly had been sequenced which it has been sequenced. 84 00:05:19,000 --> 00:05:22,000 And, we looked at a cross and we looked at the mutants. 85 00:05:22,000 --> 00:05:26,000 And what we did was we tried different positions along the genome. 86 00:05:26,000 --> 00:05:30,000 And at each position, we had some genetic marker. 87 00:05:30,000 --> 00:05:34,000 And that genetic marker might be as simple as the fact that at that 88 00:05:34,000 --> 00:05:38,000 position, maybe there is an A in the DNA sequence on one of the 89 00:05:38,000 --> 00:05:42,000 chromosomes, and maybe I don't know a G in the other sequence. 90 00:05:42,000 --> 00:05:47,000 And over here, this marker might be, there's a T in some particular 91 00:05:47,000 --> 00:05:51,000 position, and there's a C in some particular position. 92 00:05:51,000 --> 00:05:55,000 If we could assay that, if we could tell, we could look 93 00:05:55,000 --> 00:06:00,000 whether this spelling variation is closely correlated with the mutant. 94 00:06:00,000 --> 00:06:03,000 And this spelling variation is closely correlated with the 95 00:06:03,000 --> 00:06:07,000 inheritance of the mutant allele. And we could just try up and down 96 00:06:07,000 --> 00:06:11,000 the genome, different sites of spelling difference as if they were 97 00:06:11,000 --> 00:06:15,000 genetic markers in our cross because they are genetic markers in our 98 00:06:15,000 --> 00:06:18,000 cross, and see which one is most tightly correlated. 99 00:06:18,000 --> 00:06:22,000 The minute we get any genetic sequence difference, 100 00:06:22,000 --> 00:06:26,000 that shows co-inheritance linkage in this cross, we know that this spot 101 00:06:26,000 --> 00:06:30,000 in the genome must be nearby our mutation. 102 00:06:30,000 --> 00:06:33,000 So, we'll try one closer, and we'll try one on the other side. 103 00:06:33,000 --> 00:06:37,000 And, what you do is you test sites of genetic variation, 104 00:06:37,000 --> 00:06:40,000 first to find one that shows any co-inheritance. 105 00:06:40,000 --> 00:06:44,000 And once you've got that, you try ones closer, and closer, 106 00:06:44,000 --> 00:06:48,000 and closer. Last time I talked about the process of, 107 00:06:48,000 --> 00:06:51,000 if you had one of those markers you could use it to isolate the next 108 00:06:51,000 --> 00:06:55,000 clone and the next clone and the next clone. But you know what I 109 00:06:55,000 --> 00:06:59,000 realized? That's so old fashioned. We might as well deal with the fact 110 00:06:59,000 --> 00:07:02,000 we have a sequence of the genome. No more would you ever isolate the 111 00:07:02,000 --> 00:07:05,000 next clone and the next clone and the next clone. 112 00:07:05,000 --> 00:07:08,000 You just look it up in the computer. So, even if you have the whole 113 00:07:08,000 --> 00:07:11,000 sequence of the genome, we have to figure out what part of 114 00:07:11,000 --> 00:07:14,000 it was co-inherited along with this disease, and that's the way you do 115 00:07:14,000 --> 00:07:17,000 it, OK? Genetic mapping, ust as Sturtevant invented it, 116 00:07:17,000 --> 00:07:20,000 can be applied if you have a whole sequence of the genome, 117 00:07:20,000 --> 00:07:23,000 and enough sites of variation. And, I've drawn it for a fruit fly 118 00:07:23,000 --> 00:07:26,000 cross, but this could equally well be cystic fibrosis. 119 00:07:26,000 --> 00:07:29,000 The only difference is if we're doing this in human families and 120 00:07:29,000 --> 00:07:33,000 it's cystic fibrosis we don't have as many offspring. 121 00:07:33,000 --> 00:07:36,000 So, we have to pool data from many families. And, 122 00:07:36,000 --> 00:07:40,000 we can't arrange it so that every family has exactly the same orange 123 00:07:40,000 --> 00:07:43,000 alleles up here and pink alleles down there, but computers can deal 124 00:07:43,000 --> 00:07:47,000 with that. They can still figure out the correlation across many 125 00:07:47,000 --> 00:07:50,000 families, and you find the spot in the genome where for many, 126 00:07:50,000 --> 00:07:54,000 many, many families the kids who all got the disease show correlated 127 00:07:54,000 --> 00:07:57,000 inheritance with this marker. And that eventually pins you down 128 00:07:57,000 --> 00:08:01,000 to a region of the genome. It pins you down to those genetic 129 00:08:01,000 --> 00:08:04,000 markers that show the absolute tightest correlation, 130 00:08:04,000 --> 00:08:08,000 tight correlation, and that's where you look. 131 00:08:08,000 --> 00:08:12,000 And in that fashion, people went being able to map the 132 00:08:12,000 --> 00:08:16,000 location of Huntington's Disease in 1984 to, by now, 133 00:08:16,000 --> 00:08:20,000 mapping the locations of more than 1, 00 different human genetic diseases 134 00:08:20,000 --> 00:08:25,000 where people didn't know the protein in advance. They did it entirely 135 00:08:25,000 --> 00:08:29,000 based on this positional mapping. So, Sturtevant's idea, which I like 136 00:08:29,000 --> 00:08:33,000 so much, has played itself out so beautifully now in the area of 137 00:08:33,000 --> 00:08:38,000 modern molecular medicine. OK. So, onward. 138 00:08:38,000 --> 00:08:43,000 I want to talk about a few other variations on the theme rather 139 00:08:43,000 --> 00:08:48,000 quickly, and then I think I want to talk about how you analyze your 140 00:08:48,000 --> 00:08:53,000 clones. First, variations on cloning, 141 00:08:53,000 --> 00:08:58,000 I should just at least mention it. We talked about cloning in an 142 00:08:58,000 --> 00:09:04,000 autonomously replicating plasmid in a bacteria. 143 00:09:04,000 --> 00:09:07,000 So, you go to a bacteria. They have some autonomously 144 00:09:07,000 --> 00:09:10,000 replicated pieces of DNA. There are circles. You can clone 145 00:09:10,000 --> 00:09:13,000 in them, and you can typically, these things are on the order of, 146 00:09:13,000 --> 00:09:17,000 I don't know, 1,000 to 2,000 to 5, 00 bases can be readily cloned in 147 00:09:17,000 --> 00:09:20,000 these plasmids. You can do more, 148 00:09:20,000 --> 00:09:23,000 but that's a typical kind of number is the insert size, 149 00:09:23,000 --> 00:09:27,000 typically. But we in the lab go up to much higher numbers 150 00:09:27,000 --> 00:09:31,000 like 10,000 sometimes. You can also, if you wanted to study 151 00:09:31,000 --> 00:09:36,000 yeast, it turns out yeast happily have plasmids as well, 152 00:09:36,000 --> 00:09:41,000 and you can do a similar sort of thing for yeast. 153 00:09:41,000 --> 00:09:47,000 It turns out that instead of using plasmids, you can use bacterial 154 00:09:47,000 --> 00:09:52,000 viruses. These bacterial viruses have all different shapes as we've 155 00:09:52,000 --> 00:09:57,000 talked about, circular or linear, and they can typically hold, oh, 15, 156 00:09:57,000 --> 00:10:02,000 00-40,000. Some of these viruses are quite big. 157 00:10:02,000 --> 00:10:06,000 The bacteriophage lambda tends to carry a lot of stuff. 158 00:10:06,000 --> 00:10:11,000 And, it can replicate. So, you could do the same thing to 159 00:10:11,000 --> 00:10:15,000 that. You can even use viruses that infect mammalian cells and there are 160 00:10:15,000 --> 00:10:19,000 all sorts of viruses now that people clone in again, 161 00:10:19,000 --> 00:10:24,000 linear or circular. I don't know, for mammalian cells, 162 00:10:24,000 --> 00:10:28,000 you often, the viruses like 1,000-5, 00. You can even make artificial 163 00:10:28,000 --> 00:10:33,000 whole chromosomes now. You can do this in yeast. 164 00:10:33,000 --> 00:10:39,000 Artificial chromosomes are called YACs. They have all the little 165 00:10:39,000 --> 00:10:44,000 machinery, little telomeres on them, little centromeres. They have a 166 00:10:44,000 --> 00:10:50,000 selectable marker, and then you can clone into it your 167 00:10:50,000 --> 00:10:55,000 piece of DNA. And these can take up to a million bases of DNA. 168 00:10:55,000 --> 00:11:01,000 So, if you wanted, there are bacterial artificial chromosomes. 169 00:11:01,000 --> 00:11:05,000 They're called BACs if they're in bacteria. And recently, 170 00:11:05,000 --> 00:11:09,000 people have developed artificial chromosome systems for mammalian 171 00:11:09,000 --> 00:11:13,000 cells, and specifically human cells. And they're called unfortunately 172 00:11:13,000 --> 00:11:17,000 MACs and HACs and things like that. Basically, any molecule that can 173 00:11:17,000 --> 00:11:21,000 replicate in any system, some smart molecular biologist will 174 00:11:21,000 --> 00:11:25,000 come along and say, how do I use that for my purpose, 175 00:11:25,000 --> 00:11:30,000 to stick my DNA in it, and get it to replicate in this organism? 176 00:11:30,000 --> 00:11:36,000 And so, if something's not on this list, it will be soon, 177 00:11:36,000 --> 00:11:43,000 OK? Now, here's another thing. This is cloning chunks of DNA. 178 00:11:43,000 --> 00:11:50,000 Just to have the piece of DNA in a library, but suppose we want to do 179 00:11:50,000 --> 00:11:57,000 more than just have the DNA sitting there in the bacterium, 180 00:11:57,000 --> 00:12:04,000 suppose what I'd really like to do is take a bacterium, 181 00:12:04,000 --> 00:12:10,000 E coli, and put it to work for us. Maybe what I'd like to do is take a 182 00:12:10,000 --> 00:12:14,000 plasmid and insert in that plasmid the gene for human insulin. 183 00:12:14,000 --> 00:12:19,000 So, I'm going to take the DNA locus corresponding to human insulin, 184 00:12:19,000 --> 00:12:23,000 clone it into my plasmid. Maybe I'll have isolated it from my 185 00:12:23,000 --> 00:12:28,000 library because, let's see, insulin's protein 186 00:12:28,000 --> 00:12:32,000 sequence is known so I could reverse translate it to a nucleotide 187 00:12:32,000 --> 00:12:36,000 sequence. So, I could probe a library. 188 00:12:36,000 --> 00:12:40,000 So, I could find the clone that has insulin. Now what I'd like to do is 189 00:12:40,000 --> 00:12:43,000 persuade this bacteria not just to carry the DNA but to make insulin 190 00:12:43,000 --> 00:12:47,000 for me. Would that be useful? Yeah, how did people used to get 191 00:12:47,000 --> 00:12:51,000 insulin? Cadavers, dead bodies; it would be much easier 192 00:12:51,000 --> 00:12:54,000 to get them from a fermenter, right, to get insulin from a 193 00:12:54,000 --> 00:12:58,000 fermenter, if you could just ask E coli to make it. 194 00:12:58,000 --> 00:13:02,000 So, if we put it into E coli, will it make insulin for us? 195 00:13:02,000 --> 00:13:09,000 Here's the human locus, DNA for insulin. Will it make 196 00:13:09,000 --> 00:13:17,000 insulin? Let's see, how do you make a protein? 197 00:13:17,000 --> 00:13:24,000 You've got to start by making RNA, right? You've got to transcribe the 198 00:13:24,000 --> 00:13:32,000 gene. Will E coli transcribe this gene? 199 00:13:32,000 --> 00:13:36,000 Well, why? It's got a promoter, right? It's got the insulin 200 00:13:36,000 --> 00:13:41,000 promoter. There we go. The insulin promoter is here. 201 00:13:41,000 --> 00:13:45,000 So, E coli will come along to the insulin promoter and start making 202 00:13:45,000 --> 00:13:50,000 RNA? No, it turns out that promoters in humans and promoters in 203 00:13:50,000 --> 00:13:55,000 bacteria are sufficiently different. They don't work across species. 204 00:13:55,000 --> 00:14:00,000 They won't recognize the human promoter. Too bad. Any ideas? 205 00:14:00,000 --> 00:14:05,000 Yep? Stick a bacterial promoter there. Good, you're acting like a 206 00:14:05,000 --> 00:14:10,000 good molecular biology designer here. Let's put a bacterial promoter here. 207 00:14:10,000 --> 00:14:15,000 It will recognize its own promoter. That's great. Then, let's put the 208 00:14:15,000 --> 00:14:21,000 DNA for the human insulin gene here. And now, maybe we'll put the Lac 209 00:14:21,000 --> 00:14:26,000 operon, and when it has lactose it'll start making RNA from the 210 00:14:26,000 --> 00:14:32,000 human insulin gene. And it'll start translating it. 211 00:14:32,000 --> 00:14:38,000 And, we get insulin. Any problems? Well, 212 00:14:38,000 --> 00:14:44,000 will it make any, for starters? What's another aspect of mammalian 213 00:14:44,000 --> 00:14:50,000 genes that's different from bacterial genes? 214 00:14:50,000 --> 00:14:56,000 Processing, what kind of processing with the RNA? And the splicing, 215 00:14:56,000 --> 00:15:02,000 ooh, the insulin gene has introns that have to be spliced out. 216 00:15:02,000 --> 00:15:06,000 So, this is going to make some RNA, insulin RNA, and it needs to be 217 00:15:06,000 --> 00:15:10,000 processed like this. Will bacteria carry on our splicing 218 00:15:10,000 --> 00:15:14,000 for us? They don't do splicing. Yep? Well, that's a very 219 00:15:14,000 --> 00:15:18,000 interesting question because we haven't. But, 220 00:15:18,000 --> 00:15:22,000 what do you propose? You see, I've just taken a piece of 221 00:15:22,000 --> 00:15:26,000 human DNA from the human genome, which encodes the introns and the 222 00:15:26,000 --> 00:15:30,000 exons. But, you seem to have a solution to our problem, 223 00:15:30,000 --> 00:15:35,000 and what would that be? So, instead of making a library of 224 00:15:35,000 --> 00:15:42,000 genomic DNA, what you're suggesting is a radical idea. 225 00:15:42,000 --> 00:15:50,000 Let's instead take human RNA. Here's some human RNA, lots of 226 00:15:50,000 --> 00:15:57,000 human RNA, a big collection of human RNA. What was at the end of the 227 00:15:57,000 --> 00:16:02,000 human RNA: a poly(A) tail. And what I understand you to be 228 00:16:02,000 --> 00:16:06,000 suggesting is if we take human mRNAs, a whole collection of them, 229 00:16:06,000 --> 00:16:10,000 you want me to turn these mRNAs back into DNA and clone them instead of 230 00:16:10,000 --> 00:16:14,000 using the chromosomal DNA. How do I turn an RNA back to DNA? 231 00:16:14,000 --> 00:16:18,000 Is that possible? What do you use: reverse transcriptase. 232 00:16:18,000 --> 00:16:22,000 We have to give it a primer. So remember, five prime to three 233 00:16:22,000 --> 00:16:26,000 prime, we'd like to put a primer going over here. 234 00:16:26,000 --> 00:16:30,000 Any ideas for a good primer? Poly(T), isn't that convenient? 235 00:16:30,000 --> 00:16:35,000 One of the reasons that mammalian messages have poly(A) tails is so 236 00:16:35,000 --> 00:16:41,000 that we are able to reverse transcribe them using poly(T) 237 00:16:41,000 --> 00:16:46,000 primers. No, that's actually not true. So, we use reverse 238 00:16:46,000 --> 00:16:52,000 transcriptase. And what we can do is we'll copy 239 00:16:52,000 --> 00:16:58,000 this RNA into a strand of DNA. There we go. 240 00:16:58,000 --> 00:17:03,000 Then what we'll do, next step, is we'll take the DNA, 241 00:17:03,000 --> 00:17:09,000 and we'll copy back into a second strand of DNA. 242 00:17:09,000 --> 00:17:15,000 And now, we have double-stranded DNA whose sequence matches the 243 00:17:15,000 --> 00:17:21,000 already-processed mRNAs. Sorry? So, the sequences would 244 00:17:21,000 --> 00:17:27,000 match the mRNAs. So what you could do is instead of 245 00:17:27,000 --> 00:17:32,000 taking human DNA from the nucleus, you could take RNAs, 246 00:17:32,000 --> 00:17:38,000 turn them back into DNA by reverse transcriptase, 247 00:17:38,000 --> 00:17:43,000 and make a library now that consists of zillions of inserts, 248 00:17:43,000 --> 00:17:49,000 each of which has what's called a cDNA, a copied DNA, 249 00:17:49,000 --> 00:17:54,000 copied back from the RNA. The great advantage of this is that 250 00:17:54,000 --> 00:18:00,000 the human cell has already done the splicing, and so there 251 00:18:00,000 --> 00:18:05,000 are no introns left. Now, when you stick it in a 252 00:18:05,000 --> 00:18:09,000 bacterium, the bacterium is able to express this. It's able, 253 00:18:09,000 --> 00:18:13,000 if you give it its own bacterial promoter, to make an RNA. 254 00:18:13,000 --> 00:18:17,000 And if you don't ask the bacteria to have to splice, 255 00:18:17,000 --> 00:18:21,000 if you just give it a pre-spliced piece of DNA that doesn't need 256 00:18:21,000 --> 00:18:25,000 splicing, it can translate that DNA. Now, notice we used all of our 257 00:18:25,000 --> 00:18:29,000 tricks. You had to know about reverse transcriptase, 258 00:18:29,000 --> 00:18:34,000 poly(A) tails, structures of genes, introns, exons, yes, question? 259 00:18:34,000 --> 00:18:38,000 It doesn't. You do this in the test tube. You purify human mRNA in the 260 00:18:38,000 --> 00:18:42,000 test tube. You take that mRNA in a test tube, add reverse transcriptase, 261 00:18:42,000 --> 00:18:47,000 add poly(T), make this reaction of RNA to DNA in the test tube go back. 262 00:18:47,000 --> 00:18:51,000 Where does it come from? Viruses that copy themselves back for a 263 00:18:51,000 --> 00:18:56,000 living, right? So, again, every single thing we're 264 00:18:56,000 --> 00:19:00,000 using comes from some living organism that does this 265 00:19:00,000 --> 00:19:04,000 kind of stuff. And, when I teach you about the 266 00:19:04,000 --> 00:19:08,000 facts of how viruses replicate or what the structure of mRNAs look 267 00:19:08,000 --> 00:19:11,000 like or whatever, it's because every bit of knowledge 268 00:19:11,000 --> 00:19:14,000 we get about the way biology works turns into an incredibly powerful 269 00:19:14,000 --> 00:19:18,000 tool as it's turning out for us to actually be able to further study 270 00:19:18,000 --> 00:19:21,000 biology. So, great. So, where does reverse 271 00:19:21,000 --> 00:19:24,000 transcriptase come from now? Originally they come from viruses 272 00:19:24,000 --> 00:19:28,000 that turn themselves back from RNA to DNA. Now, how do you get reverse 273 00:19:28,000 --> 00:19:32,000 transcriptase? Catalog, right, 274 00:19:32,000 --> 00:19:38,000 very good. All right, so this is called, finally, 275 00:19:38,000 --> 00:19:44,000 a cDNA library. And, if you had made a cDNA library, 276 00:19:44,000 --> 00:19:49,000 you would be able to screen the cDNA library to find the gene for insulin. 277 00:19:49,000 --> 00:19:55,000 Is this useful? This happens to be, 278 00:19:55,000 --> 00:20:01,000 for example, one of the consequences of this was the biotechnology 279 00:20:01,000 --> 00:20:06,000 industry. OK, so if you have any doubts about 280 00:20:06,000 --> 00:20:10,000 the usefulness of understanding these abstract things about E coli 281 00:20:10,000 --> 00:20:14,000 and bacteria and stuff like that, one of the consequences was 282 00:20:14,000 --> 00:20:18,000 Genentech, Biogen, and Amgen, and if you just simply 283 00:20:18,000 --> 00:20:22,000 walk around Kendall Square, within a mile of this place you will 284 00:20:22,000 --> 00:20:26,000 see laid out before you the consequences of this ability, 285 00:20:26,000 --> 00:20:30,000 OK? It's transforming Cambridge. Yes? 286 00:20:30,000 --> 00:20:36,000 And the world. Yeah. Indeed. 287 00:20:36,000 --> 00:20:43,000 It might be that producing large amounts of insulin was bad for the 288 00:20:43,000 --> 00:20:50,000 bacteria because there would be so much protein it would clump and kill 289 00:20:50,000 --> 00:20:57,000 the bacteria. It might be that insulin, for various reasons, 290 00:20:57,000 --> 00:21:04,000 might not fold appropriately in the bacterial environment. 291 00:21:04,000 --> 00:21:07,000 And, this is why the biotechnology industry has lots of smart people 292 00:21:07,000 --> 00:21:10,000 working in it because you're totally, 100% right. You might decide that 293 00:21:10,000 --> 00:21:13,000 instead of cloning it in bacteria it's better to clone it in some 294 00:21:13,000 --> 00:21:16,000 insect cell in culture which, in fact, people like to work with, 295 00:21:16,000 --> 00:21:19,000 or some other cell, or a mammalian cell. And so, 296 00:21:19,000 --> 00:21:23,000 I simplify by saying put it in coli, but in fact that might test six 297 00:21:23,000 --> 00:21:26,000 different cell lines, six different host possibilities. 298 00:21:26,000 --> 00:21:29,000 They might have to take the insulin out and refold it in vitro 299 00:21:29,000 --> 00:21:33,000 and things like that. You're totally right. 300 00:21:33,000 --> 00:21:37,000 This is actually something that requires work to do it right, 301 00:21:37,000 --> 00:21:42,000 just like building an airplane requires work. 302 00:21:42,000 --> 00:21:47,000 I could tell you Bernoulli's principles, but then Boeing does 303 00:21:47,000 --> 00:21:51,000 more than just writes down Bernoulli's principles. 304 00:21:51,000 --> 00:21:56,000 OK, so onward. Now, I'd like to turn next to analyzing your clone. 305 00:21:56,000 --> 00:22:00,000 Analyzing the clone, so suppose we have, maybe it's by positional 306 00:22:00,000 --> 00:22:05,000 cloning, maybe it's by cDNA cloning, but one way or the other we've got 307 00:22:05,000 --> 00:22:10,000 us a clone that we're very interested in. 308 00:22:10,000 --> 00:22:14,000 Maybe it has the insulin gene. Maybe it has the Huntington's 309 00:22:14,000 --> 00:22:18,000 disease gene. Whatever it is, we're going to want to study it. 310 00:22:18,000 --> 00:22:22,000 And at the moment, I haven't told you how I would even read its DNA 311 00:22:22,000 --> 00:22:26,000 sequence or analyze its DNA. So, the first step is, of course, 312 00:22:26,000 --> 00:22:31,000 I have to purify the plasmid. And, it turns out that that can be done. 313 00:22:31,000 --> 00:22:34,000 There are simple biochemical techniques, as I mentioned in a 314 00:22:34,000 --> 00:22:37,000 previous lecture, that allow you to grow up a lot of 315 00:22:37,000 --> 00:22:40,000 the bacteria, crack them open, and the plasmid being a little 316 00:22:40,000 --> 00:22:43,000 circle, and being a little more tightly super-coiled and wound up 317 00:22:43,000 --> 00:22:46,000 has somewhat different physical properties. And you can use those 318 00:22:46,000 --> 00:22:50,000 to purify the plasmid. So, plasmid preps are not hard to 319 00:22:50,000 --> 00:22:53,000 do. You can get a fairly pure collection of the plasmid. 320 00:22:53,000 --> 00:22:56,000 Now, suppose I've done this for, oh, I don't know, let's take my 321 00:22:56,000 --> 00:23:00,000 first example, orange mutants. Suppose I tried to rescue bacteria 322 00:23:00,000 --> 00:23:04,000 that were orange minus, and suppose I found that 50 323 00:23:04,000 --> 00:23:08,000 different plasmids rescued my orange mutant because I transformed a lot 324 00:23:08,000 --> 00:23:12,000 of plasmids in, I plated it, and 50 colonies grew up. 325 00:23:12,000 --> 00:23:16,000 Are they all the same thing or are they different? 326 00:23:16,000 --> 00:23:20,000 Is there any quickie way to take a look at these 50 plasmids and see if 327 00:23:20,000 --> 00:23:24,000 they're identical or fairly close, or obviously different? Well, I'd 328 00:23:24,000 --> 00:23:28,000 like to take some way to take the DNA from the plasmid and analyze it 329 00:23:28,000 --> 00:23:32,000 kind of easily. I might want to see, 330 00:23:32,000 --> 00:23:37,000 like, how big is the insert? Right, that'd be one way, 331 00:23:37,000 --> 00:23:43,000 if they had different sized inserts so they couldn't be the same thing. 332 00:23:43,000 --> 00:23:49,000 So, maybe what I could do is how do I clone this? I used EcoRI sites I 333 00:23:49,000 --> 00:23:55,000 recall. So, I have EcoRI sites here. Suppose I were to take this DNA, 334 00:23:55,000 --> 00:24:02,000 and I were to now cut the DNA from the plasmid with EcoRI. 335 00:24:02,000 --> 00:24:09,000 Then, what I would get is two separate molecules. 336 00:24:09,000 --> 00:24:16,000 I would get the vector and the insert. How could I see how big 337 00:24:16,000 --> 00:24:24,000 they were? Gels, gel electrophoresis is the way to do 338 00:24:24,000 --> 00:24:29,000 that. So, I take a gel. A gel is a slab of gelatin, 339 00:24:29,000 --> 00:24:33,000 Jell-O, OK, and normally it's laid flat, but I'm going to do it 340 00:24:33,000 --> 00:24:37,000 vertically here. I load into the top of it here a 341 00:24:37,000 --> 00:24:41,000 little bit of my DNA, this whole mixture. I take the 342 00:24:41,000 --> 00:24:45,000 plasmid. I cut it. I put it in here. DNA's positive 343 00:24:45,000 --> 00:24:49,000 charge or negative charge? Negative. So, where should I put 344 00:24:49,000 --> 00:24:53,000 the positive pull? On the bottom, well done. 345 00:24:53,000 --> 00:24:57,000 That's often not done, and to the detriment of the experiment. 346 00:24:57,000 --> 00:25:01,000 If you put the positive pull here, it goes the wrong way, and everybody 347 00:25:01,000 --> 00:25:05,000 has to do that at least once. So, what'll happen is the DNA 348 00:25:05,000 --> 00:25:11,000 fragments move through, and the smaller fragments move 349 00:25:11,000 --> 00:25:16,000 faster than the big fragments, right? If something's little, it'll 350 00:25:16,000 --> 00:25:22,000 move fast. If something's big, it moves slowly: little, big. 351 00:25:22,000 --> 00:25:27,000 Smaller moves faster because it wiggles through the little pores in 352 00:25:27,000 --> 00:25:33,000 the gel better. So, suppose I were to do this for a 353 00:25:33,000 --> 00:25:39,000 bunch of plasmids, and what I saw was this. 354 00:25:39,000 --> 00:25:47,000 First order, what do you guess? Sorry? Top road's probably the 355 00:25:47,000 --> 00:25:55,000 plasmid vector. This is probably the vector, 356 00:25:55,000 --> 00:26:03,000 and what do I know about the inserts? At least two inserts, 357 00:26:03,000 --> 00:26:09,000 at least two distinct inserts. Now, if I wanted to be sure that was 358 00:26:09,000 --> 00:26:13,000 the vector, maybe what I could do is take another row, 359 00:26:13,000 --> 00:26:17,000 and run a known amount of the vector, take the vector alone and I could 360 00:26:17,000 --> 00:26:21,000 check that the vector alone runs over here. And maybe I might take 361 00:26:21,000 --> 00:26:25,000 some other known molecules. These would be called molecular 362 00:26:25,000 --> 00:26:29,000 weight standards. So, if I run some knowns in one of 363 00:26:29,000 --> 00:26:33,000 the lanes of the gel, I can even measure and say, 364 00:26:33,000 --> 00:26:37,000 ah-ha, the insert is somewhere between the size of this one and the 365 00:26:37,000 --> 00:26:40,000 size of that one. And so, I get a little ruler that I 366 00:26:40,000 --> 00:26:43,000 can put on the gel. So, in fact, that's the first thing 367 00:26:43,000 --> 00:26:46,000 you would do is you digest your clone that way. 368 00:26:46,000 --> 00:26:49,000 Now, does the fact that these guys have exactly the same, 369 00:26:49,000 --> 00:26:52,000 apparently, size on the gel mean that they're the exact same piece of 370 00:26:52,000 --> 00:26:55,000 DNA? No, because you can't even actually tell it's exactly the same. 371 00:26:55,000 --> 00:26:59,000 There's a limit to how precisely you can measure it. 372 00:26:59,000 --> 00:27:04,000 So, what else could you do? You could try another restriction 373 00:27:04,000 --> 00:27:10,000 enzyme. It turns out that since there are so many restriction 374 00:27:10,000 --> 00:27:15,000 enzymes in the catalog, if I take a piece of DNA, 375 00:27:15,000 --> 00:27:21,000 maybe that Eco fragment, I could try cutting it with HinDIII. 376 00:27:21,000 --> 00:27:26,000 And when I cut it with HinDIII, I'm going to get three distinct 377 00:27:26,000 --> 00:27:32,000 lengths. I could try cutting it with, oh, I don't know, 378 00:27:32,000 --> 00:27:37,000 pick another enzyme, BamHI. When I cut it with BamHI, 379 00:27:37,000 --> 00:27:43,000 I'll get some other lengths. And, how to get these lengths by 380 00:27:43,000 --> 00:27:48,000 adding these, by running them out on a gel and looking at their sizes. 381 00:27:48,000 --> 00:27:54,000 What if I added both HinDIII and BamHI to my test tube? 382 00:27:54,000 --> 00:28:00,000 I'd cut at both sites. So, I'd cut here, here, 383 00:28:00,000 --> 00:28:06,000 here, here, here. So, this is cut with HinDIII, 384 00:28:06,000 --> 00:28:12,000 here cut with BamHI, here cut with both and I could measure these 385 00:28:12,000 --> 00:28:19,000 lengths. So, suppose I gave you this as a computer problem, 386 00:28:19,000 --> 00:28:25,000 I have a string and it's an unknown string, and I cut it at two places 387 00:28:25,000 --> 00:28:31,000 and I get these lengths, X1, X2, X3. And then I take that same string and 388 00:28:31,000 --> 00:28:35,000 I cut it at other positions, Y1, Y2, and Y3 are the lengths that 389 00:28:35,000 --> 00:28:39,000 result. And then suppose I now cut it at both of the sites, 390 00:28:39,000 --> 00:28:43,000 and I measure it, and I get Z1, Z2, Z3, Z4, Z5. If I gave you all 391 00:28:43,000 --> 00:28:48,000 those numbers, could you figure out where the sites 392 00:28:48,000 --> 00:28:52,000 must be? Probably. It turns out to be a reasonably 393 00:28:52,000 --> 00:28:56,000 doable computer problem, although it can get a little hard in 394 00:28:56,000 --> 00:29:00,000 places. And you could try a third enzyme and 395 00:29:00,000 --> 00:29:03,000 a fourth enzyme, and it's a cute exercise to write 396 00:29:03,000 --> 00:29:07,000 yourself a little piece of code that will figure out where the sites are 397 00:29:07,000 --> 00:29:10,000 based on the lengths. The reason it occasionally gets 398 00:29:10,000 --> 00:29:13,000 funny what if Z3 and Z4 are exactly the same length and they run on top 399 00:29:13,000 --> 00:29:16,000 of each other in the gel, and there are special cases. 400 00:29:16,000 --> 00:29:20,000 But you can kind of reconstruct where those restriction sites must 401 00:29:20,000 --> 00:29:23,000 be just by writing a good piece of code that'll put these pieces 402 00:29:23,000 --> 00:29:26,000 together. This is called restriction mapping, 403 00:29:26,000 --> 00:29:30,000 and it's great fun. Everybody likes to do this once. 404 00:29:30,000 --> 00:29:33,000 But, it's only a limited amount of information, right, 405 00:29:33,000 --> 00:29:36,000 because you get where the sites are, and I guess if I gave you ten clones 406 00:29:36,000 --> 00:29:40,000 and they all had exactly the same restriction maps, 407 00:29:40,000 --> 00:29:43,000 the exact same positions of these restriction sites, 408 00:29:43,000 --> 00:29:47,000 you'd feel pretty confident they were the same clone. 409 00:29:47,000 --> 00:29:50,000 But you still wouldn't really know much about the clone other than it 410 00:29:50,000 --> 00:29:53,000 had two HinDIII sites and two BamHI sites, and here's where they were. 411 00:29:53,000 --> 00:29:57,000 What do you really want to know about this clone? It's 412 00:29:57,000 --> 00:30:02,000 DNA sequence, right? Let's not settle for anything less 413 00:30:02,000 --> 00:30:10,000 than the exact nucleotide sequence of the clone. So, 414 00:30:10,000 --> 00:30:18,000 that's really the last key topic is sequencing DNA. 415 00:30:18,000 --> 00:30:26,000 How are you going to sequence DNA? Well, suppose I give you some 416 00:30:26,000 --> 00:30:34,000 double strand of DNA, five prime to three prime, 417 00:30:34,000 --> 00:30:42,000 five prime, three prime, double stranded DNA. 418 00:30:42,000 --> 00:30:47,000 Let me heat it up. What happens when I heat up DNA? 419 00:30:47,000 --> 00:30:52,000 It melts the hydrogen bonds, the non-covalent hydrogen bonds here 420 00:30:52,000 --> 00:30:57,000 break, and I got my two strands separated. Now, 421 00:30:57,000 --> 00:31:02,000 what I'd like to do is I want to start reading out this DNA sequence. 422 00:31:02,000 --> 00:31:08,000 So, I'm going to make me a primer. Now, golly, here's a primer. 423 00:31:08,000 --> 00:31:14,000 You're going to ask me, how did I even know what primer to 424 00:31:14,000 --> 00:31:21,000 use if I don't know the DNA sequence? How can I make a primer? 425 00:31:21,000 --> 00:31:27,000 Hold that question. Make sure I remember to come back and answer 426 00:31:27,000 --> 00:31:34,000 that, OK? But for the moment, grant me that I have a primer here. 427 00:31:34,000 --> 00:31:39,000 What I'd like to do is add DNA polymerase. So, 428 00:31:39,000 --> 00:31:45,000 let's add some DNA polymerase. And, I'd like to add nucleotide 429 00:31:45,000 --> 00:31:50,000 triphosphates, dNTPs, the dATP, 430 00:31:50,000 --> 00:31:56,000 dCTP, the dGTP, dTTP, and if I add DNA polymerase and I 431 00:31:56,000 --> 00:32:02,000 add my nucleotides, what does Arthur Kornberg tell us 432 00:32:02,000 --> 00:32:07,000 will happen? It'll start polymerizing, 433 00:32:07,000 --> 00:32:13,000 right? And, it'll stop there. So, the polymerase knows the bases, 434 00:32:13,000 --> 00:32:18,000 right? It knows what base to put in because polymerase is very smart. 435 00:32:18,000 --> 00:32:24,000 So, the bases get put in correctly. The only problem is, how do we get 436 00:32:24,000 --> 00:32:30,000 polymerase to tell us what it just did? 437 00:32:30,000 --> 00:32:37,000 Here's a cute trick. This is, by the way, 438 00:32:37,000 --> 00:32:44,000 a cute trick that won the Nobel Prize. So, suppose my primer is 439 00:32:44,000 --> 00:32:51,000 like this: five prime, T, A, A, T, T, C, T, and the 440 00:32:51,000 --> 00:32:58,000 template strand here, A, T, T, A, A, G, A, now let's keep 441 00:32:58,000 --> 00:33:05,000 going, A, T, G, C, C, A, A, T, G, 442 00:33:05,000 --> 00:33:14,000 G, A, T, T, A, five prime. So, there's my primer. 443 00:33:14,000 --> 00:33:26,000 There's my template. I'm going to start adding. Well, let's 444 00:33:26,000 --> 00:33:37,000 add our polymerase. Let's add our dNTPs, 445 00:33:37,000 --> 00:33:47,000 polymerase, dATP, dCTP, dTTP, dGTP, and then I want to add a 446 00:33:47,000 --> 00:33:58,000 special extra good old ingredient into this. 447 00:33:58,000 --> 00:34:07,000 The special extra ingredient I want to add is a defective T, 448 00:34:07,000 --> 00:34:17,000 a defective dTTP. What do I mean by defective? I mean chemically 449 00:34:17,000 --> 00:34:27,000 modified in such a way that it can't be extended, that you 450 00:34:27,000 --> 00:34:36,000 can't extend past it. So now, let's follow my reaction. 451 00:34:36,000 --> 00:34:44,000 I'm going to start with, I'm just going to write them down here, 452 00:34:44,000 --> 00:34:52,000 T, A, A, T, T, C, T. What's the next base I'm going to put in? 453 00:34:52,000 --> 00:35:00,000 T, OK? Is that a defective T or a good T? I don't know. 454 00:35:00,000 --> 00:35:05,000 It could be. Maybe it's a defective T, which I'll put a little star 455 00:35:05,000 --> 00:35:11,000 there, OK? If so, what happens to my polymerase? 456 00:35:11,000 --> 00:35:17,000 It stops. It can't go any further. It can't go any further because the 457 00:35:17,000 --> 00:35:22,000 T's defective. But what if it wasn't a defective T? 458 00:35:22,000 --> 00:35:28,000 What if it was a good T? Then what goes on? The polymerase 459 00:35:28,000 --> 00:35:34,000 will put in, keep going guys. A, C, G, G, and what does it put in 460 00:35:34,000 --> 00:35:39,000 now? T, right? Now, 461 00:35:39,000 --> 00:35:45,000 is that a defective T? Maybe. We don't know. 462 00:35:45,000 --> 00:35:50,000 If it is a defective T, it stops there. Otherwise, 463 00:35:50,000 --> 00:35:56,000 polymerase goes here, and the next space is what? 464 00:35:56,000 --> 00:36:02,000 T, and is that a defective T? Maybe. 465 00:36:02,000 --> 00:36:09,000 And, if it's not a defective T, then polymerase goes on, puts in an 466 00:36:09,000 --> 00:36:16,000 A, puts in a G, a C, C, and then a T. 467 00:36:16,000 --> 00:36:23,000 And maybe that's defective. All right, which of these 468 00:36:23,000 --> 00:36:31,000 possibilities is what polymerase does when I throw it in? 469 00:36:31,000 --> 00:36:35,000 Well, all of them. There's a lot of molecules there. 470 00:36:35,000 --> 00:36:39,000 Some of the molecules, by chance, happen to install a defective T, 471 00:36:39,000 --> 00:36:43,000 and they grind to a halt here. Sometimes, a good T's put in and the 472 00:36:43,000 --> 00:36:47,000 molecules stop here. Sometimes they stop here, 473 00:36:47,000 --> 00:36:51,000 and if I start with a big collection of primers in a lot of my template 474 00:36:51,000 --> 00:36:55,000 DNA, I'm going to get this whole collection of different molecules of 475 00:36:55,000 --> 00:36:59,000 different lengths. What lengths do I get? 476 00:36:59,000 --> 00:37:04,000 The lengths correspond precisely to the positions of the Ts. 477 00:37:04,000 --> 00:37:10,000 I get a series of molecules whose lengths perfectly match the 478 00:37:10,000 --> 00:37:16,000 positions of Ts. Well, first off, 479 00:37:16,000 --> 00:37:23,000 how do I measure their lengths? Run a gel, bingo, run a gel. 480 00:37:23,000 --> 00:37:30,000 So, if I could run a gel that could separate nucleotides based on length 481 00:37:30,000 --> 00:37:37,000 that two next to each other, another one up there, I'd see a 482 00:37:37,000 --> 00:37:44,000 small molecule, length one, two, 483 00:37:44,000 --> 00:37:51,000 three, four, five, three, six, eight, I'd see one of 484 00:37:51,000 --> 00:37:58,000 length eight. I'd see one of length, what's the next one, 485 00:37:58,000 --> 00:38:05,000 13, eight, nine, ten, 13, 14, so eight, nine, 486 00:38:05,000 --> 00:38:12,000 ten, 11, 12, 13, 14, 15, what's that, 13, 14, 15, 16, 487 00:38:12,000 --> 00:38:18,000 17, 18. OK, those would be the positions at 488 00:38:18,000 --> 00:38:22,000 which I would see this T. So, I'd need to have a special kind 489 00:38:22,000 --> 00:38:26,000 of gel that's so accurate that it can separate single nucleotides, 490 00:38:26,000 --> 00:38:31,000 right, that the lengths, but that can be done. 491 00:38:31,000 --> 00:38:36,000 There's acrylamide, the polymer that will do that. 492 00:38:36,000 --> 00:38:41,000 That'll tell me the exact lengths of the T's. What else do I do? 493 00:38:41,000 --> 00:38:46,000 Well, let's obviously do it from the other bases. 494 00:38:46,000 --> 00:38:52,000 Let's try defective A, defective C, defective G. 495 00:38:52,000 --> 00:38:57,000 Let's see, if I got it right, which we'll try, it ought to end up 496 00:38:57,000 --> 00:39:02,000 looking something like that. And if not, you get the picture, 497 00:39:02,000 --> 00:39:07,000 that this ought to match up as to which columns have which lengths. 498 00:39:07,000 --> 00:39:13,000 OK, I think I got it right. That tells me the lengths of the 499 00:39:13,000 --> 00:39:18,000 molecules. So, I could read off at sequence. 500 00:39:18,000 --> 00:39:23,000 The sequence of that molecule ought to be, starting over there, 501 00:39:23,000 --> 00:39:29,000 the sequence of what I've added in, ought to be something like T, A, C, 502 00:39:29,000 --> 00:39:34,000 G, G, T, T, A, C, C, T, yep, it worked. 503 00:39:34,000 --> 00:39:39,000 It's exactly right. Bingo. I can now read the sequence. 504 00:39:39,000 --> 00:39:44,000 Fred Sanger, a brilliant scientist, thought up this method of just 505 00:39:44,000 --> 00:39:49,000 exploiting E coli's own polymerase or other organism's own polymerases. 506 00:39:49,000 --> 00:39:54,000 So, copying and all the chemistry that had to be done was thinking up 507 00:39:54,000 --> 00:40:00,000 a defective nucleotide that could not be extended. 508 00:40:00,000 --> 00:40:10,000 It could obviously be inserted. It can't be extended. So, one 509 00:40:10,000 --> 00:40:20,000 question is, what's a defective nucleotide? Well, 510 00:40:20,000 --> 00:40:30,000 you will recall that our nucleotide in the sugar phosphate chain 511 00:40:30,000 --> 00:40:37,000 is sitting like this. Let's see, hanging off the one prime 512 00:40:37,000 --> 00:40:42,000 carbon is the base. This is the one prime carbon, 513 00:40:42,000 --> 00:40:47,000 the two prime carbon, the three prime carbon, the four prime carbon, 514 00:40:47,000 --> 00:40:52,000 the five prime carbon. What do we know in DNA at the two prime carbon? 515 00:40:52,000 --> 00:40:57,000 Normally in ribose there would be a hydroxyl here, 516 00:40:57,000 --> 00:41:02,000 right? But in deoxyribose, there's just a hydrogen. 517 00:41:02,000 --> 00:41:11,000 So, if this is deoxyribose, so a dNTP really means a two prime 518 00:41:11,000 --> 00:41:20,000 deoxyribose, where do I now attach my next base in the sugar phosphate 519 00:41:20,000 --> 00:41:30,000 train? Three prime ends, and what do I attach it to: the OH. 520 00:41:30,000 --> 00:41:35,000 What do you think would happen if there's no OH there? 521 00:41:35,000 --> 00:41:40,000 You're stuck. All you've got to do is take off that hydroxyl. 522 00:41:40,000 --> 00:41:45,000 No hydroxyl group. If you made nucleotides that don't have that 523 00:41:45,000 --> 00:41:50,000 hydroxyl group, they can't be extended. 524 00:41:50,000 --> 00:41:55,000 So, instead of these being just deoxy at the two prime position, 525 00:41:55,000 --> 00:42:00,000 they are dideoxy, deoxy at two positions. 526 00:42:00,000 --> 00:42:04,000 They are two prime, three prime, dideoxynucleotides. 527 00:42:04,000 --> 00:42:09,000 That's it. Now, if you needed to get two prime three prime 528 00:42:09,000 --> 00:42:13,000 dideoxynucleotides, they're in the catalogue of course, 529 00:42:13,000 --> 00:42:18,000 right, because Fred Sanger had to make them himself and all that, 530 00:42:18,000 --> 00:42:23,000 but you can just buy them now. And so, you can do the sequence. 531 00:42:23,000 --> 00:42:28,000 A few other little details here, though, guys. 532 00:42:28,000 --> 00:42:32,000 How do we see the DNA and the gel? One possibility would be staining 533 00:42:32,000 --> 00:42:37,000 it. There are some dies like ethidium bromide, 534 00:42:37,000 --> 00:42:42,000 and for doing your restriction mapping, using a dye that sticks to 535 00:42:42,000 --> 00:42:47,000 DNA like ethidium bromide does is pretty good. And then you put it 536 00:42:47,000 --> 00:42:52,000 under fluorescent light and you look. For sequencing, 537 00:42:52,000 --> 00:42:57,000 the amount of DNA is so little that it's hard to see with a dye by the 538 00:42:57,000 --> 00:43:02,000 naked eye, which is what you do with restriction map. So, sorry? 539 00:43:02,000 --> 00:43:06,000 So, the first thing people did was radioactive. What they did was they 540 00:43:06,000 --> 00:43:10,000 took a primer, made it radioactive, 541 00:43:10,000 --> 00:43:14,000 and you did this whole sequencing reaction with radioactive primer. 542 00:43:14,000 --> 00:43:18,000 Then, when you run the gel, you take your gel and you expose it for 543 00:43:18,000 --> 00:43:22,000 some number of hours, eight hours maybe, a piece of x-ray 544 00:43:22,000 --> 00:43:26,000 film, develop the x-ray film, and you'll see that picture. So, 545 00:43:26,000 --> 00:43:30,000 one solution that you could do to visualize is using radioactive 546 00:43:30,000 --> 00:43:37,000 nucleotides. So, we got the defective nucleotide. 547 00:43:37,000 --> 00:43:45,000 We now need to visualize our DNA. Let's visualize the sequence. 548 00:43:45,000 --> 00:43:54,000 One possibility: radioactive. The second possibility, someone 549 00:43:54,000 --> 00:44:03,000 already mentioned it, a fluorescent dye. 550 00:44:03,000 --> 00:44:10,000 Now, here, a fluorescent dye could be put on, and you can't read it 551 00:44:10,000 --> 00:44:17,000 with your eye, but lasers are very good at reading. 552 00:44:17,000 --> 00:44:24,000 So, you might run a whole gel here and have lasers scan it. 553 00:44:24,000 --> 00:44:31,000 But, you can actually do better than that. Suppose I put my 554 00:44:31,000 --> 00:44:39,000 fluorescent dye on my dideoxynucleotides. 555 00:44:39,000 --> 00:44:45,000 Suppose I put it on my dideoxynucleotides, 556 00:44:45,000 --> 00:44:52,000 and suppose I even had enough chemistry at my disposal that I 557 00:44:52,000 --> 00:44:59,000 could put a different color of fluorescent dye on each 558 00:44:59,000 --> 00:45:05,000 of my nucleotides. Then, whenever the dideoxy is put in 559 00:45:05,000 --> 00:45:09,000 to terminate the chain, it carries with it its own color. 560 00:45:09,000 --> 00:45:14,000 Wouldn't that be cool? And, that's what's done. Not just can you buy 561 00:45:14,000 --> 00:45:18,000 dideoxynucleotides now, but you can buy the four different 562 00:45:18,000 --> 00:45:23,000 dideoxynucleotides each with its own dye attached to it. 563 00:45:23,000 --> 00:45:27,000 So, there are di-dideoxies I guess, sorry, but it's different di's, 564 00:45:27,000 --> 00:45:32,000 right? They're dye-dideoxies. So, you could do that. 565 00:45:32,000 --> 00:45:36,000 And then what you get would be that in this column you get this color. 566 00:45:36,000 --> 00:45:40,000 And in this column, you'd get this color. And in this column you'd get 567 00:45:40,000 --> 00:45:44,000 this color, etc. I'm not worrying about where they 568 00:45:44,000 --> 00:45:48,000 are here. And they'd all be different colors and it would be 569 00:45:48,000 --> 00:45:52,000 very pretty. You know what? Why do we need to run separate 570 00:45:52,000 --> 00:45:56,000 lanes anymore? If we got a laser, 571 00:45:56,000 --> 00:46:00,000 we can tell the laser scan it to tell it different. Stick 572 00:46:00,000 --> 00:46:05,000 it in one way. In fact, what's done is stick it in 573 00:46:05,000 --> 00:46:13,000 a capillary tube, throw in all four at the same time 574 00:46:13,000 --> 00:46:20,000 now, and as these fragments come by, each has its own color. And all we 575 00:46:20,000 --> 00:46:28,000 need is a laser scanner capable of sitting right here. Here's 576 00:46:28,000 --> 00:46:34,000 my laser scanner. And the laser scanner, 577 00:46:34,000 --> 00:46:40,000 positive here, negative here, as the DNA flows by through this 578 00:46:40,000 --> 00:46:46,000 polymer, the laser scanner reads off which colors just went by. 579 00:46:46,000 --> 00:46:51,000 And it goes A color, C color, T color, G color. That's it. So, 580 00:46:51,000 --> 00:46:57,000 there are actually machines now that have 96 different capillaries. 581 00:46:57,000 --> 00:47:04,000 These are called capillary tubes. And you can have 96 of them with 582 00:47:04,000 --> 00:47:12,000 laser scanning across, and in each column now, 583 00:47:12,000 --> 00:47:20,000 it turns out that you can read almost 1,000 letters, 584 00:47:20,000 --> 00:47:28,000 1,000 bases per column per capillary times about 100 capillaries. 585 00:47:28,000 --> 00:47:36,000 Or in other words, you can read out about 10^5 bases of information. 586 00:47:36,000 --> 00:47:40,000 You can read out 10^5 bases of information in about two hours. 587 00:47:40,000 --> 00:47:45,000 Of course, you can do that ten times a day. So, 588 00:47:45,000 --> 00:47:50,000 you can actually read out 10^6 or about a million bases of information 589 00:47:50,000 --> 00:47:55,000 per machine. And here at MIT, we have 100 of these machines. So, 590 00:47:55,000 --> 00:48:00,000 we actually can read out a little shy of 100 million letters of DNA 591 00:48:00,000 --> 00:48:05,000 sequence per day, which I mean is a lot. 592 00:48:05,000 --> 00:48:11,000 We read about 40 billion letters per year here at MIT, 593 00:48:11,000 --> 00:48:17,000 and this is how we do it. How much does a machine cost? 594 00:48:17,000 --> 00:48:23,000 List, or do you want a deal? They list for $300, 595 00:48:23,000 --> 00:48:29,000 00, but if you buy in bulk, I can do better. [LAUGHTER] We buy 596 00:48:29,000 --> 00:48:34,000 it in bulk, by the way. So now, how are we going to get our 597 00:48:34,000 --> 00:48:38,000 primer there? That was the only little bit we were missing is where 598 00:48:38,000 --> 00:48:42,000 did our primer come from? The last little detail: here's my 599 00:48:42,000 --> 00:48:47,000 vector, remember, and I want to sequence this insert. 600 00:48:47,000 --> 00:48:51,000 How am I going to get a primer in the insert? I don't know what its 601 00:48:51,000 --> 00:48:56,000 sequence is. How do I even start this? Sorry? Well, 602 00:48:56,000 --> 00:49:00,000 but that won't tell me what the sequence is that I have to, 603 00:49:00,000 --> 00:49:05,000 I mean, I was looking to try to get a primer that matches the insert. 604 00:49:05,000 --> 00:49:09,000 And I don't know what the insert is. So, how am I going to get a primer? 605 00:49:09,000 --> 00:49:13,000 Oh, I know the vector. The vector is well known. 606 00:49:13,000 --> 00:49:17,000 It sequence is in the catalog. Let me instead just use a primer 607 00:49:17,000 --> 00:49:21,000 that happens to sit in the vector, and I'll match to a known sequence 608 00:49:21,000 --> 00:49:25,000 to start with, and then I'll sequence into my 609 00:49:25,000 --> 00:49:29,000 unknown territory. So, this is how you get the initial 610 00:49:29,000 --> 00:49:33,000 primer was you arrange that your initial primer is sitting in known 611 00:49:33,000 --> 00:49:37,000 vector sequence. All right, so you can now sequence 612 00:49:37,000 --> 00:49:40,000 DNA. I've got to say, I've taught this course for a little 613 00:49:40,000 --> 00:49:44,000 more than a decade, and being able to say, 614 00:49:44,000 --> 00:49:47,000 now we can routinely sequence about a million letters per machine, 615 00:49:47,000 --> 00:49:50,000 and 100 million letters per day, and things like this was not 616 00:49:50,000 --> 00:49:53,000 routinely the case. When we started teaching this 617 00:49:53,000 --> 00:49:58,000 course, I was describing what we di