1 00:00:06,594 --> 00:00:07,550 ERIC LANDER: Good morning. 2 00:00:07,550 --> 00:00:08,800 Good morning. 3 00:00:11,870 --> 00:00:16,120 So we've been talking about recombinant DNA. 4 00:00:18,910 --> 00:00:22,520 And really what it does to our picture here-- 5 00:00:22,520 --> 00:00:25,080 function, gene, protein-- 6 00:00:25,080 --> 00:00:28,790 is for the first time take something that's a theoretical 7 00:00:28,790 --> 00:00:32,130 relationship and make it operational. 8 00:00:32,130 --> 00:00:42,160 Being able to go from a function like the ability to 9 00:00:42,160 --> 00:00:46,550 make your own arginine, to a specific gene, to a specific 10 00:00:46,550 --> 00:00:50,250 protein, and to be able to connect those up. 11 00:00:50,250 --> 00:00:53,340 In principle, by the time we're done with recombinant 12 00:00:53,340 --> 00:00:56,310 DNA, one should be able to go from any vertex of that 13 00:00:56,310 --> 00:00:59,190 triangle to any other vertex of that triangle. 14 00:00:59,190 --> 00:01:01,680 Given a function, find the genes. 15 00:01:01,680 --> 00:01:03,290 Given a gene, find the proteins. 16 00:01:03,290 --> 00:01:04,890 Given a protein, find the genes. 17 00:01:04,890 --> 00:01:07,700 Given a protein, find the function. 18 00:01:07,700 --> 00:01:11,630 That's really the goal of recombinant DNA is to be able 19 00:01:11,630 --> 00:01:14,750 to start at any vertex and reach any other 20 00:01:14,750 --> 00:01:16,670 vertex of that triangle. 21 00:01:16,670 --> 00:01:17,910 We're not there yet. 22 00:01:17,910 --> 00:01:20,590 But we will be in the next couple of days, to the point 23 00:01:20,590 --> 00:01:24,750 where we can move freely about this whole picture. 24 00:01:24,750 --> 00:01:29,450 So we've talked about DNA sequencing. 25 00:01:29,450 --> 00:01:31,840 And I want to pick up a little bit with DNA sequencing. 26 00:01:31,840 --> 00:01:35,570 And I'll probably end with DNA sequencing again, as I tell 27 00:01:35,570 --> 00:01:37,455 you where things stand today. 28 00:01:40,510 --> 00:01:43,860 How did we use DNA sequencing? 29 00:01:43,860 --> 00:01:45,780 Well, we found us a clone. 30 00:01:45,780 --> 00:01:49,620 Maybe our clone was a clone that conferred the ability to 31 00:01:49,620 --> 00:01:50,810 grow without arginine. 32 00:01:50,810 --> 00:01:53,420 That was cloning by complementation. 33 00:01:53,420 --> 00:01:57,160 Maybe it was a clone that encoded beta globin. 34 00:01:57,160 --> 00:02:01,200 There we found an antibody that recognized beta globin. 35 00:02:01,200 --> 00:02:05,170 We made a cDNA library with an appropriate promoter. 36 00:02:05,170 --> 00:02:08,440 And we asked E. coli to produce those human proteins 37 00:02:08,440 --> 00:02:10,259 from those cDNAs. 38 00:02:10,259 --> 00:02:13,560 And then we use our antibody to recognize which clone. 39 00:02:13,560 --> 00:02:16,430 One way or the other, we found ourselves a clone. 40 00:02:19,860 --> 00:02:21,870 The clone had this vector. 41 00:02:21,870 --> 00:02:23,520 It had this insert. 42 00:02:23,520 --> 00:02:25,750 The insert is now of interest to us. 43 00:02:25,750 --> 00:02:27,410 We wish to sequence it. 44 00:02:27,410 --> 00:02:32,460 And we talked before about taking that piece of DNA and 45 00:02:32,460 --> 00:02:33,710 subjecting it to sequencing. 46 00:02:36,820 --> 00:02:40,140 We started with a primer. 47 00:02:40,140 --> 00:02:41,765 From that primer, we extended. 48 00:02:45,200 --> 00:02:52,550 And we hit a point, let's say, opposite an A or maybe 49 00:02:52,550 --> 00:03:02,950 opposite the next A or opposite the A after that or 50 00:03:02,950 --> 00:03:05,300 opposite the A after that. 51 00:03:05,300 --> 00:03:10,450 And we had this clever trick, for which Fred Sanger actually 52 00:03:10,450 --> 00:03:15,830 won a Nobel Prize, of using a defective version of the 53 00:03:15,830 --> 00:03:20,930 nucleotide T that would stop at that point. 54 00:03:20,930 --> 00:03:24,980 But remember, we didn't use only defective T. We used a 55 00:03:24,980 --> 00:03:30,820 mixture of good T and defective T. Let's say 56 00:03:30,820 --> 00:03:33,370 defective T was 1% of the whole mixture, whenever we 57 00:03:33,370 --> 00:03:36,420 encountered a defective T and put it in, the chain would 58 00:03:36,420 --> 00:03:38,510 stop because it couldn't be extended. 59 00:03:38,510 --> 00:03:41,160 Whenever we put in a good T, it would keep going. 60 00:03:41,160 --> 00:03:42,840 And so since we're making-- 61 00:03:42,840 --> 00:03:46,050 Of course, we have millions and millions of copies of our 62 00:03:46,050 --> 00:03:48,000 template sitting there in the test tube. 63 00:03:48,000 --> 00:03:50,500 We're always working-- when I talk about "a" molecule, and I 64 00:03:50,500 --> 00:03:53,110 draw a picture of a molecule, we should always know that 65 00:03:53,110 --> 00:03:55,210 there's millions of copies of those things. 66 00:03:55,210 --> 00:03:56,370 Some of them are stopping here. 67 00:03:56,370 --> 00:03:57,410 Some of them are going here, et cetera, 68 00:03:57,410 --> 00:03:58,900 et cetera, et cetera. 69 00:03:58,900 --> 00:04:02,400 And we end up with a large collection. 70 00:04:02,400 --> 00:04:06,610 And if we separate it on a gel, we can detect the lengths 71 00:04:06,610 --> 00:04:07,860 of those fragments. 72 00:04:13,250 --> 00:04:17,360 If we attach a fluorescent dye, then those fragments are 73 00:04:17,360 --> 00:04:18,959 fluorescently labeled. 74 00:04:18,959 --> 00:04:24,660 We could all run them in the same lane 75 00:04:24,660 --> 00:04:25,910 with different colors. 76 00:04:28,530 --> 00:04:32,060 And we could put our little fluorescence detector and see 77 00:04:32,060 --> 00:04:32,990 what goes by. 78 00:04:32,990 --> 00:04:36,340 And the traces you would get, the actual pictures that 79 00:04:36,340 --> 00:04:41,356 emerge from this, look something like this. 80 00:04:55,920 --> 00:04:59,300 And you'd see colored traces like that emerging from that 81 00:04:59,300 --> 00:05:00,840 electrophoretic detector. 82 00:05:00,840 --> 00:05:03,770 And you could read off the sequence by the colors-- 83 00:05:03,770 --> 00:05:07,190 really very gorgeous, very simple, beautiful technology, 84 00:05:07,190 --> 00:05:09,890 all this taking place in a little capillary tube. 85 00:05:13,030 --> 00:05:15,840 You remember how we made a defective T, right? 86 00:05:15,840 --> 00:05:19,630 How did we make a defective T? 87 00:05:19,630 --> 00:05:20,270 Sorry? 88 00:05:20,270 --> 00:05:21,840 STUDENT: Dideoxy. 89 00:05:21,840 --> 00:05:22,930 ERIC LANDER: To dideoxy. 90 00:05:22,930 --> 00:05:25,520 Because remember, we need that 3-prime hydroxyl in 91 00:05:25,520 --> 00:05:26,870 order to extend it. 92 00:05:26,870 --> 00:05:29,590 No 3-prime hydroxyl, no extension. 93 00:05:29,590 --> 00:05:32,150 So if you make that deoxy at the 3-prime 94 00:05:32,150 --> 00:05:34,220 position, you can extend. 95 00:05:34,220 --> 00:05:36,780 And since it was originally 2-prime deoxy, it's now 96 00:05:36,780 --> 00:05:38,740 2-prime, 3-prime dideoxy. 97 00:05:38,740 --> 00:05:41,440 That's all it takes in order to block 98 00:05:41,440 --> 00:05:43,360 the extension reaction. 99 00:05:43,360 --> 00:05:46,380 And they sell the stuff, and you can use it. 100 00:05:46,380 --> 00:05:47,215 All right. 101 00:05:47,215 --> 00:05:49,712 I had another question for you guys. 102 00:05:49,712 --> 00:05:51,460 Where did the primer come from? 103 00:05:55,550 --> 00:05:57,520 Sorry? 104 00:05:57,520 --> 00:05:57,930 Sorry? 105 00:05:57,930 --> 00:05:58,340 STUDENT: We put it in. 106 00:05:58,340 --> 00:05:58,970 ERIC LANDER: We put it in. 107 00:05:58,970 --> 00:06:01,605 How did we know what to put in? 108 00:06:01,605 --> 00:06:03,000 STUDENT: [INAUDIBLE] catalog. 109 00:06:03,000 --> 00:06:04,950 ERIC LANDER: Well, the catalog's very smart, but it 110 00:06:04,950 --> 00:06:06,440 doesn't tell us what we need. 111 00:06:06,440 --> 00:06:08,352 How do we know what sequence it is? 112 00:06:12,510 --> 00:06:15,920 Actually, it turns out that so many different sequences might 113 00:06:15,920 --> 00:06:18,910 be needed for different purposes in molecular biology 114 00:06:18,910 --> 00:06:22,160 that they don't stock them all in the catalog because you 115 00:06:22,160 --> 00:06:23,950 might order any sequence. 116 00:06:23,950 --> 00:06:26,610 So it really turns out that if you want to order a specific 117 00:06:26,610 --> 00:06:31,320 20-letter sequence, you go on the web, type it in, and then 118 00:06:31,320 --> 00:06:33,080 the machine will make it for you, and you get it the next 119 00:06:33,080 --> 00:06:34,980 day, it turns out. 120 00:06:34,980 --> 00:06:36,790 But they don't actually put it in the catalog because it 121 00:06:36,790 --> 00:06:38,740 would be too big an inventory. 122 00:06:38,740 --> 00:06:41,440 Although ones that people use a lot they keep in the 123 00:06:41,440 --> 00:06:44,604 catalog, otherwise they just make it on the spot for you. 124 00:06:44,604 --> 00:06:46,030 But how do we know what sequence 125 00:06:46,030 --> 00:06:46,870 we're supposed to use? 126 00:06:46,870 --> 00:06:50,990 Here's my clone, let's say arginine, the ARG1 gene. 127 00:06:50,990 --> 00:06:53,200 How did I know what sequence to start with? 128 00:06:56,672 --> 00:06:59,120 You let me get away with just put a primer there. 129 00:06:59,120 --> 00:07:00,370 And what does it match? 130 00:07:07,216 --> 00:07:08,683 Yeah? 131 00:07:08,683 --> 00:07:12,110 STUDENT: Could you use the EcoR1 [INAUDIBLE]? 132 00:07:12,110 --> 00:07:13,360 ERIC LANDER: I did use EcoR1. 133 00:07:16,746 --> 00:07:18,730 STUDENT: [INAUDIBLE] 134 00:07:18,730 --> 00:07:23,778 ERIC LANDER: EcoR1 site here, GAATTC-- 135 00:07:23,778 --> 00:07:25,028 GAATTC-- 136 00:07:27,090 --> 00:07:31,090 I'm going to cut in this site. 137 00:07:31,090 --> 00:07:34,130 I'm sorry, I'm going to cut in this site 138 00:07:34,130 --> 00:07:35,715 like that, let's say. 139 00:07:35,715 --> 00:07:37,430 Yeah, like that. 140 00:07:37,430 --> 00:07:43,210 And this fragment here-- 141 00:07:43,210 --> 00:07:44,590 sorry, like that-- 142 00:07:44,590 --> 00:07:50,840 the fragment here that I'm going to start sequencing from 143 00:07:50,840 --> 00:07:53,470 it starts with a G, right? 144 00:07:53,470 --> 00:07:57,100 Because it's opposite that C. 145 00:07:57,100 --> 00:08:00,240 So because I cut with EcoR1, I'm pretty sure it starts with 146 00:08:00,240 --> 00:08:05,970 a G. It's not a very big primer to use, though. 147 00:08:05,970 --> 00:08:08,080 I could start with a C, but I don't think that's going to 148 00:08:08,080 --> 00:08:09,800 have enough binding energy to do it. 149 00:08:18,249 --> 00:08:23,230 STUDENT: Doesn't a bacteria already replicate that DNA? 150 00:08:23,230 --> 00:08:24,260 ERIC LANDER: It replicates that DNA just fine. 151 00:08:24,260 --> 00:08:26,232 STUDENT: So there's already a primer in there. 152 00:08:26,232 --> 00:08:27,220 Why do you [INAUDIBLE]? 153 00:08:27,220 --> 00:08:28,100 ERIC LANDER: Well, because I've now 154 00:08:28,100 --> 00:08:29,836 purified out my insert. 155 00:08:29,836 --> 00:08:31,410 And I'm going to subject it to sequencing. 156 00:08:31,410 --> 00:08:33,630 And I need to start with a primer. 157 00:08:33,630 --> 00:08:34,630 I've got that fragment. 158 00:08:34,630 --> 00:08:37,710 I need to know how that fragment starts. 159 00:08:37,710 --> 00:08:41,366 And I don't know how that fragment starts. 160 00:08:41,366 --> 00:08:43,843 STUDENT: Is there a place you could cut in a little bit 161 00:08:43,843 --> 00:08:44,159 after that? 162 00:08:44,159 --> 00:08:45,610 ERIC LANDER: Oh! 163 00:08:45,610 --> 00:08:51,640 What if I was really smart and put in a different restriction 164 00:08:51,640 --> 00:08:54,520 site back here? 165 00:08:54,520 --> 00:08:59,070 And I use that restriction enzyme, and I cut there? 166 00:08:59,070 --> 00:09:01,330 Then what would you be able to tell me? 167 00:09:01,330 --> 00:09:03,380 Well, then the fragment would start with a 168 00:09:03,380 --> 00:09:05,900 known vector sequence. 169 00:09:05,900 --> 00:09:07,250 Bingo. 170 00:09:07,250 --> 00:09:08,290 That'll work. 171 00:09:08,290 --> 00:09:08,660 Good. 172 00:09:08,660 --> 00:09:10,360 Good engineering. 173 00:09:10,360 --> 00:09:12,800 Since I don't know what the sequence is of the thing I'm 174 00:09:12,800 --> 00:09:17,180 reading, I'd better back up a little bit and use a known 175 00:09:17,180 --> 00:09:18,640 sequence from the vector. 176 00:09:18,640 --> 00:09:20,600 And then I keep going. 177 00:09:20,600 --> 00:09:22,850 This is all just to give you a sense of all the 178 00:09:22,850 --> 00:09:24,430 tricks you can do. 179 00:09:24,430 --> 00:09:25,530 So now it's easy. 180 00:09:25,530 --> 00:09:28,700 Now, in fact if that's the vector I use a lot, all the 181 00:09:28,700 --> 00:09:30,540 time, then bingo! 182 00:09:30,540 --> 00:09:31,740 I can go to the catalog because they 183 00:09:31,740 --> 00:09:33,330 will stock that one. 184 00:09:33,330 --> 00:09:34,400 And I'll use it. 185 00:09:34,400 --> 00:09:39,210 And I can use my green primer there to get going. 186 00:09:39,210 --> 00:09:39,970 All right. 187 00:09:39,970 --> 00:09:44,130 Now here's the problem. 188 00:09:44,130 --> 00:09:49,380 As I start sequencing, coming down this capillary tube are 189 00:09:49,380 --> 00:09:52,970 fragments of different lengths. 190 00:09:52,970 --> 00:09:58,860 The speed of migration depends on the logarithm of the length 191 00:09:58,860 --> 00:10:00,110 of the fragment. 192 00:10:02,250 --> 00:10:03,940 Big fragments goes slower. 193 00:10:03,940 --> 00:10:07,110 It's inversely proportional to the log of the 194 00:10:07,110 --> 00:10:08,140 length of the fragment. 195 00:10:08,140 --> 00:10:10,900 So big fragments go slower. 196 00:10:10,900 --> 00:10:12,590 And as they get bigger and bigger and bigger, they go 197 00:10:12,590 --> 00:10:13,720 slower and slower and slower. 198 00:10:13,720 --> 00:10:20,190 But the difference between log of 1,000 and log of 1,001 is 199 00:10:20,190 --> 00:10:22,220 pretty small. 200 00:10:22,220 --> 00:10:25,890 And then log of 1,001 and 1,002, very small. 201 00:10:25,890 --> 00:10:29,430 Actually, those peaks over there start bunching up, and I 202 00:10:29,430 --> 00:10:31,760 can't tell them apart. 203 00:10:31,760 --> 00:10:35,270 So I can actually only go with this electrophoretic process 204 00:10:35,270 --> 00:10:39,170 maybe 1,000 letters before the peaks get too bunched up 205 00:10:39,170 --> 00:10:42,630 because the different speeds are not so different anymore. 206 00:10:42,630 --> 00:10:51,940 So I can only read 1,000 bases, let's say. 207 00:10:51,940 --> 00:10:55,420 In practice, we would tend to read 700, 800 bases because it 208 00:10:55,420 --> 00:10:56,520 started getting scruffy. 209 00:10:56,520 --> 00:10:59,190 But let's make it round, and we'll say 1,000 bases. 210 00:10:59,190 --> 00:11:02,850 Now, suppose this fragment that I got that's the ARG1 211 00:11:02,850 --> 00:11:06,970 gene is 3,000 bases. 212 00:11:06,970 --> 00:11:09,830 Well, we've got our clever trick that you've introduced 213 00:11:09,830 --> 00:11:15,920 here of cutting back over here at some previous site, using 214 00:11:15,920 --> 00:11:18,090 our primer here and reading. 215 00:11:21,760 --> 00:11:25,290 But it kind of dies at about 1,000 bases. 216 00:11:25,290 --> 00:11:26,540 I can't read any further. 217 00:11:29,460 --> 00:11:30,710 What do I do? 218 00:11:33,260 --> 00:11:34,510 STUDENT: [INAUDIBLE] 219 00:11:47,470 --> 00:11:48,750 ERIC LANDER: What if there's not a perfect 220 00:11:48,750 --> 00:11:52,260 restriction site there? 221 00:11:52,260 --> 00:11:53,770 But you're on the right track. 222 00:11:53,770 --> 00:11:55,676 Keep going. 223 00:11:55,676 --> 00:12:01,190 Once I've sequenced the first 1,000 bases, what do I know? 224 00:12:01,190 --> 00:12:04,870 The sequence of the first 1,000 bases. 225 00:12:04,870 --> 00:12:07,364 What primer could I use then? 226 00:12:07,364 --> 00:12:08,360 STUDENT: [INAUDIBLE]. 227 00:12:08,360 --> 00:12:10,410 ERIC LANDER: I can just make a new primer based on those 228 00:12:10,410 --> 00:12:11,830 bases, right? 229 00:12:11,830 --> 00:12:13,730 So I don't even need my restriction site anymore. 230 00:12:13,730 --> 00:12:14,680 You've got it exactly right. 231 00:12:14,680 --> 00:12:15,920 I use my knowledge. 232 00:12:15,920 --> 00:12:18,910 And then I could use a new primer, and I 233 00:12:18,910 --> 00:12:23,130 could go more bases. 234 00:12:23,130 --> 00:12:31,360 Then I could use a new primer and go more bases. 235 00:12:31,360 --> 00:12:36,550 Then a new primer, and go more bases. 236 00:12:36,550 --> 00:12:38,060 And I can do what's called "primer 237 00:12:38,060 --> 00:12:41,400 walking" along the clone. 238 00:12:41,400 --> 00:12:43,170 Will that work? 239 00:12:43,170 --> 00:12:43,940 You bet. 240 00:12:43,940 --> 00:12:45,190 That works just fine. 241 00:12:47,820 --> 00:12:51,100 It's also very slow. 242 00:12:51,100 --> 00:12:56,050 Because I had to get my first bases, analyze them, order a 243 00:12:56,050 --> 00:12:58,760 new primer, and the next day set up my next reaction. 244 00:12:58,760 --> 00:13:01,130 Take a couple days, get my next reaction. 245 00:13:01,130 --> 00:13:03,110 Take a couple days, next reaction. 246 00:13:03,110 --> 00:13:06,550 Imagine sequencing the human genome like this. 247 00:13:06,550 --> 00:13:11,720 This could take a long time if I do it in serial. 248 00:13:11,720 --> 00:13:13,150 So what else could I do? 249 00:13:13,150 --> 00:13:13,930 This works, by the way. 250 00:13:13,930 --> 00:13:15,000 This totally works. 251 00:13:15,000 --> 00:13:16,110 It's a good procedure, and it's 252 00:13:16,110 --> 00:13:17,470 used for certain purposes. 253 00:13:17,470 --> 00:13:18,720 But what else could I do? 254 00:13:24,790 --> 00:13:26,320 Here's the cool thing. 255 00:13:26,320 --> 00:13:28,220 We've got biology. 256 00:13:28,220 --> 00:13:30,490 But you guys, being MIT students, you've also got 257 00:13:30,490 --> 00:13:33,260 computer science and other tricks available. 258 00:13:33,260 --> 00:13:34,510 Here's a cool trick. 259 00:13:37,160 --> 00:13:39,820 I have my clone, 3,000 bases. 260 00:13:39,820 --> 00:13:41,690 I like my clone. 261 00:13:41,690 --> 00:13:47,030 I'm going to take my clone, my fragment, of 3,000 bases, and 262 00:13:47,030 --> 00:13:51,180 maybe instead of sequencing it, I'm going to shred it up 263 00:13:51,180 --> 00:13:54,060 into a lot of smaller pieces. 264 00:13:54,060 --> 00:13:59,180 Suppose I shred this up into fragments of, 265 00:13:59,180 --> 00:14:01,600 say, size 800 bases. 266 00:14:01,600 --> 00:14:03,300 This was 3,000 bases. 267 00:14:03,300 --> 00:14:04,420 Now let me shred it up. 268 00:14:04,420 --> 00:14:06,750 I'll just take 800 as a number. 269 00:14:06,750 --> 00:14:09,740 Now remember, I had a lot of copies of this clone, right? 270 00:14:09,740 --> 00:14:12,795 So I'm going to get shreds like this and shreds like that 271 00:14:12,795 --> 00:14:14,060 and a shred like that. 272 00:14:14,060 --> 00:14:15,130 It's not just one copy. 273 00:14:15,130 --> 00:14:18,130 I've got a lot of copies of this piece of DNA there. 274 00:14:18,130 --> 00:14:19,300 I'm going to shred it up. 275 00:14:19,300 --> 00:14:21,600 And there are ways to shred it up by forcing it through a 276 00:14:21,600 --> 00:14:25,210 needle or treating it meanly, or things like that. 277 00:14:25,210 --> 00:14:27,550 I'm going to get lots of little fragments. 278 00:14:27,550 --> 00:14:33,540 And what I could do is I could clone all of those little 279 00:14:33,540 --> 00:14:35,990 subfragments. 280 00:14:35,990 --> 00:14:39,040 I take my big fragment, and what I'm going to do is I'm 281 00:14:39,040 --> 00:14:46,400 going to make a new library of subfragments. 282 00:14:46,400 --> 00:14:48,230 Got it? 283 00:14:48,230 --> 00:14:51,340 Now I have a whole lot of subfragments, each taken from 284 00:14:51,340 --> 00:14:52,580 my 3,000 bases. 285 00:14:52,580 --> 00:14:53,830 And they're all kind of smallish. 286 00:14:56,910 --> 00:14:58,720 What could I do to all of those little subfragments? 287 00:15:01,910 --> 00:15:04,520 They're all living in a vector. 288 00:15:04,520 --> 00:15:07,060 Each is in its own vector. 289 00:15:07,060 --> 00:15:07,570 I've spread them out. 290 00:15:07,570 --> 00:15:08,630 I've made a library. 291 00:15:08,630 --> 00:15:10,080 They're each in their own bacteria. 292 00:15:10,080 --> 00:15:11,370 They're each in a vector. 293 00:15:11,370 --> 00:15:13,240 That vector has a known sequence at its 294 00:15:13,240 --> 00:15:16,530 end, the green primer. 295 00:15:16,530 --> 00:15:22,430 Couldn't I just sequence a lot of different random subclones 296 00:15:22,430 --> 00:15:24,155 and paste them together? 297 00:15:24,155 --> 00:15:27,440 See, that's where it pays to have computer science as well. 298 00:15:27,440 --> 00:15:34,380 Because what I could do is by subcloning these into 299 00:15:34,380 --> 00:15:36,380 individual little random pieces-- 300 00:15:36,380 --> 00:15:38,210 I have no idea how they've been broken up. 301 00:15:38,210 --> 00:15:38,610 I don't care-- 302 00:15:38,610 --> 00:15:41,730 I just take the total DNA, shred it up into pieces, sub 303 00:15:41,730 --> 00:15:44,460 clone it into a vector. 304 00:15:44,460 --> 00:15:47,610 And now because it's in the vector, I could read this one 305 00:15:47,610 --> 00:15:49,760 and that one and this one and that one and this 306 00:15:49,760 --> 00:15:50,510 one and that one. 307 00:15:50,510 --> 00:15:53,980 Do I actually know which ones I'm reading? 308 00:15:53,980 --> 00:15:54,330 No. 309 00:15:54,330 --> 00:15:57,130 It's totally random. 310 00:15:57,130 --> 00:16:00,500 I take my 3,000 bases, shred it up into lots of smaller 311 00:16:00,500 --> 00:16:03,800 pieces, and I just read a lot of them. 312 00:16:03,800 --> 00:16:06,090 When I get a whole lot of these pieces, maybe 800 313 00:16:06,090 --> 00:16:09,410 letters each, what do I do? 314 00:16:09,410 --> 00:16:13,330 I write me a piece of code that looks for overlaps 315 00:16:13,330 --> 00:16:16,020 between them and start pasting it together. 316 00:16:18,540 --> 00:16:20,180 And that's called assembly. 317 00:16:20,180 --> 00:16:23,250 You assemble the sequence out of its little pieces. 318 00:16:23,250 --> 00:16:26,510 And so you can assemble things. 319 00:16:26,510 --> 00:16:31,301 And this gets referred to as shotgun sequencing. 320 00:16:31,301 --> 00:16:33,940 That's what it's really called, because it's like you 321 00:16:33,940 --> 00:16:35,630 shoot it out of the end of the shotgun or something. 322 00:16:35,630 --> 00:16:37,160 It's broken up into a lot of pieces. 323 00:16:37,160 --> 00:16:41,310 It's just a shotgun, random approach where I take 324 00:16:41,310 --> 00:16:43,920 individual random clones, and I assemble them. 325 00:16:43,920 --> 00:16:44,660 Any questions about that? 326 00:16:44,660 --> 00:16:46,440 It's really a way to do it. 327 00:16:46,440 --> 00:16:49,650 And the big difference there is you can do it in parallel. 328 00:16:49,650 --> 00:16:52,820 Rather than doing it one step at a time, which sounds so 329 00:16:52,820 --> 00:16:54,990 logical but takes so much time-- 330 00:16:54,990 --> 00:16:55,960 easier. 331 00:16:55,960 --> 00:17:01,220 Just shred it up, read lots of them all the same afternoon, 332 00:17:01,220 --> 00:17:03,810 and then assemble them by computer. 333 00:17:03,810 --> 00:17:06,099 So that's nice. 334 00:17:06,099 --> 00:17:06,849 So I do that. 335 00:17:06,849 --> 00:17:08,150 I get my clone. 336 00:17:08,150 --> 00:17:10,579 I'm going to now do my computer assembly of it. 337 00:17:10,579 --> 00:17:14,780 And I'm going to get my 3,000 bases. 338 00:17:14,780 --> 00:17:20,820 How do I analyze my clone, my clone sequence? 339 00:17:24,319 --> 00:17:29,818 Now I have 3,000 letters in order, nicely done. 340 00:17:29,818 --> 00:17:31,230 What do I do with it? 341 00:17:36,460 --> 00:17:39,680 That clone, let's say, is able to confer the ability to grow 342 00:17:39,680 --> 00:17:41,270 without arginine. 343 00:17:41,270 --> 00:17:45,120 It encodes some enzyme that lets you make arginine. 344 00:17:45,120 --> 00:17:46,260 I've got 3,000 letters. 345 00:17:46,260 --> 00:17:48,680 How do I tell what it's doing? 346 00:17:48,680 --> 00:17:50,040 What do I look for? 347 00:17:50,040 --> 00:17:50,757 Yep? 348 00:17:50,757 --> 00:17:52,705 STUDENT: Compare it with something that doesn't have 349 00:17:52,705 --> 00:17:54,660 [INAUDIBLE]? 350 00:17:54,660 --> 00:17:56,700 ERIC LANDER: Well, so tell me what I'm looking for? 351 00:17:56,700 --> 00:17:59,172 I'm looking for a gene? 352 00:17:59,172 --> 00:18:02,136 STUDENT: Yes, [INAUDIBLE]. 353 00:18:02,136 --> 00:18:02,630 ERIC LANDER: Yeah. 354 00:18:02,630 --> 00:18:05,320 So what about that gene? 355 00:18:05,320 --> 00:18:06,860 What's distinctive about genes? 356 00:18:06,860 --> 00:18:09,391 How do I recognize a gene? 357 00:18:09,391 --> 00:18:10,792 It's tricky. 358 00:18:10,792 --> 00:18:13,127 STUDENT: The sequence? 359 00:18:13,127 --> 00:18:13,594 ERIC LANDER: How can I just 360 00:18:13,594 --> 00:18:14,630 recognize it from the sequence? 361 00:18:14,630 --> 00:18:16,938 Can I tell that something is a gene? 362 00:18:16,938 --> 00:18:18,460 STUDENT: Start codon. 363 00:18:18,460 --> 00:18:20,660 ERIC LANDER: I could look for a start codon, ATG. 364 00:18:20,660 --> 00:18:24,970 Do you think that'll happen just by chance, though? 365 00:18:24,970 --> 00:18:28,390 There'll be a lot ATGs running around, because you've got two 366 00:18:28,390 --> 00:18:30,120 strands, three reading frames. 367 00:18:30,120 --> 00:18:32,390 It'll happen pretty often. 368 00:18:32,390 --> 00:18:33,570 But that's a start. 369 00:18:33,570 --> 00:18:35,362 What happens after the start codon. 370 00:18:35,362 --> 00:18:36,580 STUDENT: There's a stop codon. 371 00:18:36,580 --> 00:18:38,450 ERIC LANDER: There's a stop codon, at some point. 372 00:18:38,450 --> 00:18:39,660 And what's in between the start codon 373 00:18:39,660 --> 00:18:41,768 and the stop codon? 374 00:18:41,768 --> 00:18:44,000 STUDENT: [INAUDIBLE]. 375 00:18:44,000 --> 00:18:47,310 ERIC LANDER: Well, no stop codons. 376 00:18:47,310 --> 00:18:52,220 A gene should look like ATG and a whole lot of codons 377 00:18:52,220 --> 00:18:55,360 without any stops in the reading frame, until 378 00:18:55,360 --> 00:18:57,390 you get to a stop. 379 00:18:57,390 --> 00:19:01,320 That's called an open reading frame, a long stretch without 380 00:19:01,320 --> 00:19:02,780 stop codons. 381 00:19:02,780 --> 00:19:04,610 So I could look for an open reading frame. 382 00:19:18,560 --> 00:19:21,020 So by an open reading frame, I mean a long stretch that 383 00:19:21,020 --> 00:19:24,370 starts with an ATG and then goes on and on 384 00:19:24,370 --> 00:19:25,570 and on and on with-- 385 00:19:25,570 --> 00:19:26,630 How frequent are stops? 386 00:19:26,630 --> 00:19:28,500 There are three stops out of 64. 387 00:19:28,500 --> 00:19:31,340 One codon in 20 is a stop, on average. 388 00:19:31,340 --> 00:19:35,030 So if I go 20 codons, I might see a stop, on average. 389 00:19:35,030 --> 00:19:40,990 But suppose I run for 100 codons, and there's no stop--- 390 00:19:40,990 --> 00:19:43,460 without a stop codon. 391 00:19:43,460 --> 00:19:48,130 That's pretty impressive, isn't it, if I can read 100 392 00:19:48,130 --> 00:19:50,370 codons in a row, and I never see a stop, 393 00:19:50,370 --> 00:19:51,760 that's pretty unusual. 394 00:19:51,760 --> 00:19:53,110 So I say, that's an open reading frame. 395 00:19:56,370 --> 00:19:58,960 That's one way to recognize the gene in there. 396 00:19:58,960 --> 00:20:01,770 The problem is introns. 397 00:20:01,770 --> 00:20:04,760 What happens if there's an intron? 398 00:20:04,760 --> 00:20:05,900 Yikes. 399 00:20:05,900 --> 00:20:08,120 Then it'll be spliced there. 400 00:20:08,120 --> 00:20:10,490 That'll be spliced out, but I won't initially know that, 401 00:20:10,490 --> 00:20:11,350 reading the sequence. 402 00:20:11,350 --> 00:20:14,730 And there could be stop codons in the intron because it's not 403 00:20:14,730 --> 00:20:17,510 part of the final message. 404 00:20:17,510 --> 00:20:18,780 So I'm in trouble. 405 00:20:18,780 --> 00:20:22,050 So happily, in yeast, which has very small introns, not 406 00:20:22,050 --> 00:20:25,140 very many introns, I can actually almost get away by 407 00:20:25,140 --> 00:20:26,860 looking for open reading frames. 408 00:20:26,860 --> 00:20:31,332 In human DNA, this is kind of lousy. 409 00:20:31,332 --> 00:20:33,760 Well, it's really problematic because 410 00:20:33,760 --> 00:20:35,420 there'll be too many introns. 411 00:20:35,420 --> 00:20:38,070 There, other tricks that get used-- 412 00:20:38,070 --> 00:20:42,090 I could make cDNA and compare it to cDNAs, which have 413 00:20:42,090 --> 00:20:44,210 already spliced everything out, and look for the open 414 00:20:44,210 --> 00:20:45,390 reading frame. 415 00:20:45,390 --> 00:20:47,330 Other tricks that get used-- 416 00:20:47,330 --> 00:20:49,590 I can compare it to the database of everything 417 00:20:49,590 --> 00:20:52,830 everybody has ever sequenced before and start looking for 418 00:20:52,830 --> 00:20:54,010 similarities. 419 00:20:54,010 --> 00:20:56,480 And today there are massive databases. 420 00:20:56,480 --> 00:21:01,490 So many years ago, a postdoctoral fellow in my lab 421 00:21:01,490 --> 00:21:04,640 cloned a gene related to a human disease. 422 00:21:04,640 --> 00:21:07,000 And she didn't know what the gene did. 423 00:21:07,000 --> 00:21:09,630 And she found it. 424 00:21:09,630 --> 00:21:11,935 And it had exons, but she didn't know 425 00:21:11,935 --> 00:21:14,250 where they were yet. 426 00:21:14,250 --> 00:21:17,610 But she just took the whole sequence and said, this 427 00:21:17,610 --> 00:21:21,430 sequence here, is it similar to anything that's ever been 428 00:21:21,430 --> 00:21:22,820 seen before? 429 00:21:22,820 --> 00:21:25,580 This was a gene that was in people who had a really severe 430 00:21:25,580 --> 00:21:28,820 form of dwarfism with twisted bones and things, called 431 00:21:28,820 --> 00:21:30,370 diastrophic dysplasia. 432 00:21:30,370 --> 00:21:33,060 And she put it against the computer database. 433 00:21:33,060 --> 00:21:36,910 And the computer came back and said, the sequence you just 434 00:21:36,910 --> 00:21:42,120 gave me has a whole lot of patches that looks just like 435 00:21:42,120 --> 00:21:46,680 sulfate transporters in a fungus. 436 00:21:46,680 --> 00:21:51,030 She instantly knew what her gene did. 437 00:21:51,030 --> 00:21:53,830 Because it turns out that bones have a lot of sulfated 438 00:21:53,830 --> 00:21:56,780 proteoglycans, et cetera, et cetera, whatever those are. 439 00:21:56,780 --> 00:21:59,370 And she instantly knew, because my sequence was 440 00:21:59,370 --> 00:22:01,190 similar to something-- it's a human sequence similar to 441 00:22:01,190 --> 00:22:04,430 something in a fungus that does sulfate transport, I've 442 00:22:04,430 --> 00:22:05,800 probably got a sulfate transporter. 443 00:22:05,800 --> 00:22:07,700 That's probably the basis of my disease. 444 00:22:07,700 --> 00:22:10,900 She took her cells from her patients, added sulfate, found 445 00:22:10,900 --> 00:22:14,060 that the cells couldn't take up sulfate very well, and 446 00:22:14,060 --> 00:22:16,490 bingo-- had found the cause of her disease. 447 00:22:16,490 --> 00:22:18,670 One of the most powerful ways-- 448 00:22:18,670 --> 00:22:21,120 it's sort of Google, of course, right?-- 449 00:22:21,120 --> 00:22:22,480 it's Google before Google. 450 00:22:22,480 --> 00:22:24,700 You take your sequence and you Google it against all other 451 00:22:24,700 --> 00:22:27,440 sequences and see what it's like. 452 00:22:27,440 --> 00:22:30,540 And by googling all of life's sequences against each other, 453 00:22:30,540 --> 00:22:33,740 if somebody else has already solved your problem for you, 454 00:22:33,740 --> 00:22:36,350 you can find out about your problem. 455 00:22:36,350 --> 00:22:39,340 And it's just this wonderful network effect that is so 456 00:22:39,340 --> 00:22:41,940 characteristic of information technologies. 457 00:22:41,940 --> 00:22:47,320 So anyway, you can do that by searching databases. 458 00:22:47,320 --> 00:22:48,815 And we will not, in this class. 459 00:22:51,650 --> 00:22:53,830 So you can write code for looking for open reading 460 00:22:53,830 --> 00:22:57,150 frames, you can search databases for similarities 461 00:22:57,150 --> 00:23:00,610 across organisms or within organisms, 462 00:23:00,610 --> 00:23:01,750 or things like that. 463 00:23:01,750 --> 00:23:04,850 We won't here, but there are at MIT some great courses on 464 00:23:04,850 --> 00:23:08,550 computational biology that, for example, you can write 465 00:23:08,550 --> 00:23:10,620 algorithms for detecting these sorts of things. 466 00:23:10,620 --> 00:23:11,490 It's an interesting question. 467 00:23:11,490 --> 00:23:14,340 How do you write an algorithm for comparing two strings 468 00:23:14,340 --> 00:23:17,200 which might have insertions and deletions and changes and 469 00:23:17,200 --> 00:23:17,780 things like that? 470 00:23:17,780 --> 00:23:20,820 There's a whole rich field of computational mathematics 471 00:23:20,820 --> 00:23:23,060 associated with genome comparisons. 472 00:23:23,060 --> 00:23:23,910 All right. 473 00:23:23,910 --> 00:23:26,240 So we've got it. 474 00:23:26,240 --> 00:23:27,348 Bingo! 475 00:23:27,348 --> 00:23:30,910 Now, here's our next problem. 476 00:23:30,910 --> 00:23:38,010 Our next problem, we cloned the gene for 477 00:23:38,010 --> 00:23:41,730 beta globin from you. 478 00:23:41,730 --> 00:23:42,770 You were kind enough. 479 00:23:42,770 --> 00:23:46,020 You signed an informed consent allowing us to take some DNA, 480 00:23:46,020 --> 00:23:47,370 prepare a library. 481 00:23:47,370 --> 00:23:49,220 We made our antibody, we found your beta 482 00:23:49,220 --> 00:23:51,360 globin gene, et cetera. 483 00:23:51,360 --> 00:23:55,310 Now we're going to conduct a study of beta globin in a 484 00:23:55,310 --> 00:23:58,010 larger population. 485 00:23:58,010 --> 00:24:01,740 Maybe we're going to ask multiple people in the class, 486 00:24:01,740 --> 00:24:04,420 would they be willing to sign an informed consent to have 487 00:24:04,420 --> 00:24:06,740 their beta globin gene sequenced? 488 00:24:06,740 --> 00:24:07,740 It's an interesting gene. 489 00:24:07,740 --> 00:24:10,320 There are variants in it that confer risk of sickle cell. 490 00:24:10,320 --> 00:24:13,040 There are variants in it that confer risks of other things. 491 00:24:13,040 --> 00:24:14,790 There are fascinating things about that gene. 492 00:24:14,790 --> 00:24:17,110 Maybe we'd like to see the beta globin 493 00:24:17,110 --> 00:24:18,130 sequence of many people. 494 00:24:18,130 --> 00:24:20,380 How do we get the beta globin sequence of a second person? 495 00:24:24,430 --> 00:24:25,690 Well, how'd we get the beta globin sequence 496 00:24:25,690 --> 00:24:27,890 from the first person? 497 00:24:27,890 --> 00:24:34,520 Took DNA, cut it up, cloned it in our vector, spread it out 498 00:24:34,520 --> 00:24:37,230 on the plate, washed over the antibody-- 499 00:24:37,230 --> 00:24:39,200 actually, we took cDNA-- 500 00:24:39,200 --> 00:24:42,400 washed it over, et cetera, et cetera, et cetera. 501 00:24:42,400 --> 00:24:43,890 It's a lot of work. 502 00:24:43,890 --> 00:24:46,410 If we wanted to do 100 people in this class, do we have to 503 00:24:46,410 --> 00:24:50,820 get DNA from each of you, prepare a library, maybe a 504 00:24:50,820 --> 00:24:54,880 cDNA library even, and do the same exact process to discover 505 00:24:54,880 --> 00:24:57,200 your beta globe gene? 506 00:24:57,200 --> 00:25:00,000 Or is there any way where after we've done your beta 507 00:25:00,000 --> 00:25:03,740 globin gene, we could now do everybody's beta globin gene a 508 00:25:03,740 --> 00:25:04,990 lot easier? 509 00:25:07,440 --> 00:25:09,440 STUDENT: [INAUDIBLE] 510 00:25:09,440 --> 00:25:11,040 ERIC LANDER: Actually, I do know the sequence. 511 00:25:11,040 --> 00:25:12,930 That's the thing that's different is having found it 512 00:25:12,930 --> 00:25:15,970 once, I know the whole sequence. 513 00:25:15,970 --> 00:25:19,640 The question is, can I use the sequence to save me the 514 00:25:19,640 --> 00:25:24,060 trouble of making an entire library of everything? 515 00:25:24,060 --> 00:25:27,130 How can I use the knowledge I've just gained to make it so 516 00:25:27,130 --> 00:25:28,560 much easier? 517 00:25:28,560 --> 00:25:32,510 Well, the answer occurred to a chemist working at Cetus 518 00:25:32,510 --> 00:25:35,590 corporation in the mid 1980s. 519 00:25:35,590 --> 00:25:38,580 He was driving along, and he was thinking about this very 520 00:25:38,580 --> 00:25:42,410 problem and thinking about sequencing and how they do it. 521 00:25:42,410 --> 00:25:45,650 And he had the following thought. 522 00:25:45,650 --> 00:25:49,630 His following thought was, suppose we've got the whole 523 00:25:49,630 --> 00:25:51,250 human genome. 524 00:25:51,250 --> 00:25:53,160 There's a whole human genome. 525 00:25:53,160 --> 00:25:56,380 And I'm just going to melt it for you, for a second, into 526 00:25:56,380 --> 00:25:58,016 two strands. 527 00:25:58,016 --> 00:25:59,670 And suppose we've already discovered 528 00:25:59,670 --> 00:26:01,780 the beta globin gene. 529 00:26:01,780 --> 00:26:05,590 The beta globin gene is right over here, it turns out. 530 00:26:05,590 --> 00:26:06,840 That's beta globin. 531 00:26:10,540 --> 00:26:11,790 Well, we know this whole sequence. 532 00:26:16,950 --> 00:26:18,100 This is total DNA. 533 00:26:18,100 --> 00:26:19,640 I haven't done anything right now. 534 00:26:19,640 --> 00:26:21,140 This is the whole human genome that runs 535 00:26:21,140 --> 00:26:22,960 over 3 billion bases. 536 00:26:22,960 --> 00:26:24,413 But it's the whole genome. 537 00:26:24,413 --> 00:26:28,010 I know the sequence, right? 538 00:26:28,010 --> 00:26:32,500 I could make a primer to that part of the sequence. 539 00:26:32,500 --> 00:26:34,590 And suppose I just make a primer to that part of the 540 00:26:34,590 --> 00:26:37,990 sequence, throw it into your total DNA, and I add 541 00:26:37,990 --> 00:26:40,410 polymerase and nucleotides. 542 00:26:40,410 --> 00:26:47,850 Maybe it'll start copying. 543 00:26:47,850 --> 00:26:48,970 At some point, it'll fall off. 544 00:26:48,970 --> 00:26:57,630 But notice, I've made an extra copy of beta globin-- 545 00:26:57,630 --> 00:27:01,210 of course, mixed into the whole, total human genome. 546 00:27:01,210 --> 00:27:04,130 But there's a little bit extra beta globin now. 547 00:27:04,130 --> 00:27:06,600 Suppose I also made a primer over here. 548 00:27:13,070 --> 00:27:14,320 I'd get that. 549 00:27:16,320 --> 00:27:18,810 I'd now have two double strands of beta globin, 550 00:27:18,810 --> 00:27:20,080 whereas before, I only had one. 551 00:27:23,930 --> 00:27:25,180 Let's call this step 1. 552 00:27:28,190 --> 00:27:31,030 What do you think step 1 should be followed by? 553 00:27:31,030 --> 00:27:32,030 STUDENT: Step 2. 554 00:27:32,030 --> 00:27:33,010 ERIC LANDER: Step 2. 555 00:27:33,010 --> 00:27:34,490 Very good. 556 00:27:34,490 --> 00:27:35,140 Excellent. 557 00:27:35,140 --> 00:27:38,190 You guys have learned induction. 558 00:27:38,190 --> 00:27:59,230 So let me melt the DNA and now throw back my primer. 559 00:28:03,240 --> 00:28:07,440 Actually, if you'll allow me, I'm going to make the two 560 00:28:07,440 --> 00:28:10,230 primers different colors. 561 00:28:10,230 --> 00:28:13,470 Let's make that one a different color. 562 00:28:13,470 --> 00:28:14,720 There we go. 563 00:28:20,270 --> 00:28:27,240 So now what will happen is this primer goes here. 564 00:28:27,240 --> 00:28:30,280 This green primer goes here. 565 00:28:34,220 --> 00:28:37,420 This guy sits down here. 566 00:28:42,060 --> 00:28:44,306 And this guy goes like that. 567 00:28:44,306 --> 00:28:45,950 Well, that didn't come out very good. 568 00:28:45,950 --> 00:28:47,945 I'll just draw it a little more clearly here. 569 00:28:50,560 --> 00:28:58,160 What happens is this guy will start copying this way. 570 00:28:58,160 --> 00:29:02,600 This guy starts copying this way. 571 00:29:02,600 --> 00:29:06,470 This guy starts copying this way. 572 00:29:06,470 --> 00:29:11,860 This guy starts copying this way. 573 00:29:11,860 --> 00:29:16,910 Now, after step 2, how many copies of 574 00:29:16,910 --> 00:29:19,670 beta globin do I have? 575 00:29:19,670 --> 00:29:20,920 Four copies. 576 00:29:24,500 --> 00:29:26,424 What's the next step? 577 00:29:26,424 --> 00:29:27,300 STUDENT: Step 3. 578 00:29:27,300 --> 00:29:28,190 ERIC LANDER: Step 3. 579 00:29:28,190 --> 00:29:28,940 Very good. 580 00:29:28,940 --> 00:29:30,810 No putting anything over on you. 581 00:29:30,810 --> 00:29:33,900 And after we melt the DNA and we add back the primers, how 582 00:29:33,900 --> 00:29:36,031 many copies of beta globin will we now have? 583 00:29:38,917 --> 00:29:39,400 STUDENT: Eight. 584 00:29:39,400 --> 00:29:41,700 ERIC LANDER: Eight, because it's doubling every time. 585 00:29:41,700 --> 00:29:44,240 Step 4? 586 00:29:44,240 --> 00:29:46,050 Step 10? 587 00:29:46,050 --> 00:29:47,910 2 the 10th. 588 00:29:47,910 --> 00:29:48,860 2 to the 10th, because we're doubling. 589 00:29:48,860 --> 00:29:50,270 2 to the 10th. 590 00:29:50,270 --> 00:29:51,520 Step 20? 591 00:29:53,990 --> 00:29:58,790 2 to the 20th, which is about a million. 592 00:29:58,790 --> 00:30:01,800 Step 30? 593 00:30:01,800 --> 00:30:04,030 2 to the 30th is about a billion. 594 00:30:07,410 --> 00:30:08,660 Oh. 595 00:30:11,810 --> 00:30:20,480 After 30 steps, I have 2 to the 30th copies, which is 596 00:30:20,480 --> 00:30:24,080 about a billion copies. 597 00:30:24,080 --> 00:30:26,970 And at that point, the majority of the DNA in my tube 598 00:30:26,970 --> 00:30:28,470 is beta globin. 599 00:30:28,470 --> 00:30:30,855 The rest of the human genome is still there, but beta 600 00:30:30,855 --> 00:30:33,635 globin started out being one 601 00:30:33,635 --> 00:30:37,190 one-hundred-millionth of the genome. 602 00:30:37,190 --> 00:30:40,590 And I've just amplified it a billionfold. 603 00:30:40,590 --> 00:30:45,080 So it's now 90% of what's in the tube. 604 00:30:45,080 --> 00:30:47,070 Pretty cool. 605 00:30:47,070 --> 00:30:48,400 This is like a chain reaction. 606 00:30:48,400 --> 00:30:49,890 You do it once, you do it again, you do it 607 00:30:49,890 --> 00:30:50,770 again, you do it again. 608 00:30:50,770 --> 00:30:54,280 You just throw in polymerase, and you run a chain reaction. 609 00:30:54,280 --> 00:31:10,610 This therefore is called the polymerase chain reaction, or 610 00:31:10,610 --> 00:31:12,862 as it is universally known, PCR. 611 00:31:16,040 --> 00:31:17,670 That's PCR. 612 00:31:17,670 --> 00:31:21,020 That's the polymerase chain reaction. 613 00:31:21,020 --> 00:31:23,830 Kary Mullis, who invented this thing, won a Nobel Prize in 614 00:31:23,830 --> 00:31:24,890 chemistry for it. 615 00:31:24,890 --> 00:31:26,930 Because notice what he's just done. 616 00:31:26,930 --> 00:31:31,400 He's cloned your beta globin gene without cloning. 617 00:31:31,400 --> 00:31:33,050 It's cloning without cloning. 618 00:31:33,050 --> 00:31:34,900 I didn't need any vectors, I didn't need any bacteria, I 619 00:31:34,900 --> 00:31:35,730 didn't need no nothing. 620 00:31:35,730 --> 00:31:39,960 All I needed was the sequence that I got once by cloning, 621 00:31:39,960 --> 00:31:41,890 and then I'm off to the races. 622 00:31:41,890 --> 00:31:43,603 I throw in two primers-- 623 00:31:43,603 --> 00:31:46,370 choop-choop-choop-choop-choop-- 624 00:31:46,370 --> 00:31:47,750 bingo! 625 00:31:47,750 --> 00:31:49,880 Where do my primers come from? 626 00:31:49,880 --> 00:31:51,020 They're not in the catalogs. 627 00:31:51,020 --> 00:31:52,140 We don't keep all that inventory. 628 00:31:52,140 --> 00:31:54,440 You just type them in, and they come to you the next day 629 00:31:54,440 --> 00:31:56,640 by an automatic synthesis machine. 630 00:31:56,640 --> 00:31:59,370 So anyplace in the human genome or the yeast genome or 631 00:31:59,370 --> 00:32:02,930 any other thing that you want to PCR, just give me the two 632 00:32:02,930 --> 00:32:04,620 primers and piece it out. 633 00:32:04,620 --> 00:32:05,360 How do we do this? 634 00:32:05,360 --> 00:32:07,750 There are a couple of the details that I have to worry 635 00:32:07,750 --> 00:32:09,000 about here. 636 00:32:11,290 --> 00:32:14,990 Cooking details here for the recipe. 637 00:32:14,990 --> 00:32:18,415 What I have to do is I have to take my test tube, and I have 638 00:32:18,415 --> 00:32:24,290 to heat it up so high that the double helix melts and comes 639 00:32:24,290 --> 00:32:26,960 apart, so that the primers can get in there. 640 00:32:26,960 --> 00:32:32,120 So I have to heat to 97 degrees. 641 00:32:35,230 --> 00:32:38,360 Then it comes apart. 642 00:32:38,360 --> 00:32:42,410 I cool it down, I add my polymerase. 643 00:32:42,410 --> 00:32:45,990 I cool it, I add polymerase and nucleotides. 644 00:32:54,690 --> 00:32:56,270 And then it does its extension. 645 00:32:56,270 --> 00:32:58,630 Then I heat it up to 97. 646 00:32:58,630 --> 00:33:00,320 Now the problem is when I heat it up to 97, you know what 647 00:33:00,320 --> 00:33:02,256 happens to my polymerase? 648 00:33:02,256 --> 00:33:02,700 STUDENT: Denatured. 649 00:33:02,700 --> 00:33:03,490 ERIC LANDER: It gets denatured, 650 00:33:03,490 --> 00:33:04,660 and it doesn't come. 651 00:33:04,660 --> 00:33:05,820 It's ruined. 652 00:33:05,820 --> 00:33:08,250 So what I have to do is I have to pop open my two-- 653 00:33:08,250 --> 00:33:09,690 Sorry, you had a question? 654 00:33:09,690 --> 00:33:10,940 STUDENT: [INAUDIBLE] 655 00:33:13,110 --> 00:33:14,120 ERIC LANDER: Why do I heat it up? 656 00:33:14,120 --> 00:33:16,320 There are ways you might be able to avoid it. 657 00:33:16,320 --> 00:33:18,530 But the traditional technique is you heat it up. 658 00:33:18,530 --> 00:33:20,230 But you're right, there might be other solution. 659 00:33:20,230 --> 00:33:22,110 But now I'm going heat it up. 660 00:33:22,110 --> 00:33:24,280 And my polymerase gets denatured. 661 00:33:24,280 --> 00:33:26,880 So I have to pop open the test tube, throw in some more 662 00:33:26,880 --> 00:33:30,750 polymerase, let it do its work, heat it up again, pop 663 00:33:30,750 --> 00:33:32,750 open the test tube, put some more polymerase in. 664 00:33:32,750 --> 00:33:33,780 And it gets really boring. 665 00:33:33,780 --> 00:33:35,820 Every one of these 30 steps, I have to keep adding 666 00:33:35,820 --> 00:33:37,070 polymerase. 667 00:33:41,780 --> 00:33:44,180 So the engineers in you will say, why don't we just design 668 00:33:44,180 --> 00:33:47,430 a polymerase that doesn't denature at 97 degrees? 669 00:33:47,430 --> 00:33:49,800 So we should go to an expert and say, please make us a 670 00:33:49,800 --> 00:33:52,980 polymerase that doesn't denature at 97 degrees, even 671 00:33:52,980 --> 00:33:55,630 that doesn't mind being boiled. 672 00:33:55,630 --> 00:33:59,220 So what expert do we go to? 673 00:33:59,220 --> 00:33:59,980 STUDENT: Bacteria. 674 00:33:59,980 --> 00:34:00,960 ERIC LANDER: Bacteria. 675 00:34:00,960 --> 00:34:03,650 What bacteria do you think has a DNA polymerase that doesn't 676 00:34:03,650 --> 00:34:05,058 mind being boiled? 677 00:34:05,058 --> 00:34:06,730 STUDENT: [INAUDIBLE]. 678 00:34:06,730 --> 00:34:08,350 ERIC LANDER: Bacteria that live in, say, 679 00:34:08,350 --> 00:34:10,070 geysers, hot springs. 680 00:34:10,070 --> 00:34:11,340 So you go to a hot spring. 681 00:34:11,340 --> 00:34:14,370 You go to geyser, you go to Yosemite, and you fish out 682 00:34:14,370 --> 00:34:16,920 some water, and you see what's growing there, and you find 683 00:34:16,920 --> 00:34:19,719 the bacteria growing there that has the 684 00:34:19,719 --> 00:34:25,380 name Thermus aquaticus. 685 00:34:31,070 --> 00:34:34,300 And you purify DNA from Thermus aquaticus. 686 00:34:34,300 --> 00:34:38,900 Thermus aquaticus just goes by name TAQ, T-A-Q. You purify 687 00:34:38,900 --> 00:34:40,929 TAQ polymerase. 688 00:34:40,929 --> 00:34:44,060 And now, no problem. 689 00:34:44,060 --> 00:34:46,610 You just use TAQ polymerase, throw it in your test tube. 690 00:34:46,610 --> 00:34:48,290 And you go heat, cool, heat, cool, heat,cool, heat, cool, 691 00:34:48,290 --> 00:34:51,000 heat, cool, and you're all done. 692 00:34:51,000 --> 00:34:52,690 You just put it on a little heating block. 693 00:34:52,690 --> 00:34:54,650 And the heating block automatically goes hot, cold, 694 00:34:54,650 --> 00:34:56,969 hot, cold, hot, cold, hot, cold. 695 00:34:56,969 --> 00:34:58,840 And that's called the thermocycler. 696 00:34:58,840 --> 00:35:00,490 The thermocycler does it. 697 00:35:00,490 --> 00:35:02,670 And of course, nowadays, do you have to yourself 698 00:35:02,670 --> 00:35:05,740 personally go to the hot spring and risk your life 699 00:35:05,740 --> 00:35:07,740 fishing out the bacteria? 700 00:35:07,740 --> 00:35:08,640 No. 701 00:35:08,640 --> 00:35:09,950 Because it's in--? 702 00:35:09,950 --> 00:35:10,380 STUDENT: The catalog. 703 00:35:10,380 --> 00:35:11,350 ERIC LANDER: The catalog. 704 00:35:11,350 --> 00:35:12,620 Exactly. 705 00:35:12,620 --> 00:35:13,240 Very good. 706 00:35:13,240 --> 00:35:16,740 TAQ polymerase is in the catalog. 707 00:35:16,740 --> 00:35:18,470 All right. 708 00:35:18,470 --> 00:35:19,310 I'll tell you a story. 709 00:35:19,310 --> 00:35:22,330 The statute of limitations has already expired, so it's OK. 710 00:35:22,330 --> 00:35:25,060 TAQ polymerase used to be very expensive. 711 00:35:25,060 --> 00:35:29,000 So we needed a lot of it in our lab. 712 00:35:29,000 --> 00:35:33,100 And we couldn't afford all of it. 713 00:35:33,100 --> 00:35:39,360 So what we did was we just looked up the sequence of the 714 00:35:39,360 --> 00:35:44,100 TAQ polymerase in Thermus aquaticus, got primers, used 715 00:35:44,100 --> 00:35:49,640 PCR to get the gene for TAQ polymerase, and then expressed 716 00:35:49,640 --> 00:35:51,940 it to make a lot of TAQ polymerase. 717 00:35:51,940 --> 00:35:54,500 So it's kind of cool. 718 00:35:54,500 --> 00:35:57,040 Anyway, that was about 15 years ago. 719 00:35:57,040 --> 00:35:59,830 We produced in a few days what was then worth about $4 720 00:35:59,830 --> 00:36:01,610 million worth of TAQ polymerase. 721 00:36:01,610 --> 00:36:02,402 [LAUGHTER] 722 00:36:02,402 --> 00:36:05,030 That was why we went to the trouble of doing it. 723 00:36:07,810 --> 00:36:09,900 We didn't end up getting in any real trouble about it. 724 00:36:09,900 --> 00:36:10,250 OK. 725 00:36:10,250 --> 00:36:13,770 So now, why is this stuff cool? 726 00:36:13,770 --> 00:36:19,290 This stuff is cool because you're able to amplify tiny 727 00:36:19,290 --> 00:36:20,240 amounts of DNA. 728 00:36:20,240 --> 00:36:22,720 So if I want to purify any human gene now, and I know its 729 00:36:22,720 --> 00:36:25,480 sequence initially, PCR it out. 730 00:36:25,480 --> 00:36:27,730 No problem. 731 00:36:27,730 --> 00:36:32,300 Suppose a patient presents with a bacterial infection. 732 00:36:32,300 --> 00:36:34,730 And you're a physician. 733 00:36:34,730 --> 00:36:37,530 And you suspect that there might be a specific bacterial 734 00:36:37,530 --> 00:36:40,960 infection or maybe a specific viral infection. 735 00:36:40,960 --> 00:36:43,115 So applications of PCR. 736 00:36:50,260 --> 00:36:52,180 Application of PCR? 737 00:36:52,180 --> 00:36:55,780 Well, resequencing a known gene, yes. 738 00:36:55,780 --> 00:36:58,160 Resequencing beta globin. 739 00:36:58,160 --> 00:37:00,325 But infectious disease. 740 00:37:08,340 --> 00:37:09,910 I have a patient. 741 00:37:09,910 --> 00:37:11,450 I think there might be a bacteria, there might be a 742 00:37:11,450 --> 00:37:13,230 virus in the blood. 743 00:37:13,230 --> 00:37:14,480 What do I do? 744 00:37:16,650 --> 00:37:19,700 Make primers, do PCR. 745 00:37:19,700 --> 00:37:21,570 I have a detection technique. 746 00:37:21,570 --> 00:37:25,530 I can detect the presence of a viral infection 747 00:37:25,530 --> 00:37:27,320 or a bacterial infection. 748 00:37:27,320 --> 00:37:31,540 For example, HIV testing can be done by PCR. 749 00:37:31,540 --> 00:37:33,170 Because it doesn't take very much there in 750 00:37:33,170 --> 00:37:35,900 order to detect it. 751 00:37:35,900 --> 00:37:37,620 Water contamination. 752 00:37:37,620 --> 00:37:40,930 You can test for bugs that shouldn't be in the water, by 753 00:37:40,930 --> 00:37:43,070 PCR because you don't need much. 754 00:37:43,070 --> 00:37:45,160 How little do you need? 755 00:37:45,160 --> 00:37:49,540 Suppose I take a tube of DNA, and I start diluting it and 756 00:37:49,540 --> 00:37:52,330 diluting it and diluting it. 757 00:37:52,330 --> 00:37:55,660 How far down do you think I can go and still PCR back up, 758 00:37:55,660 --> 00:37:58,670 say, beta globin? 759 00:37:58,670 --> 00:38:01,360 Suppose I dilute it so that there's only like 1,000 copies 760 00:38:01,360 --> 00:38:04,040 of beta globin left, on average. 761 00:38:04,040 --> 00:38:06,590 Can I still PCR it? 762 00:38:06,590 --> 00:38:09,100 100 copies? 763 00:38:09,100 --> 00:38:11,260 10 copies? 764 00:38:11,260 --> 00:38:13,920 Suppose I dilute it so on average there's only one copy 765 00:38:13,920 --> 00:38:16,280 of the beta globin molecule there. 766 00:38:16,280 --> 00:38:19,810 Can I PCR it? 767 00:38:19,810 --> 00:38:22,410 How can I prove that? 768 00:38:22,410 --> 00:38:23,590 An easy way to prove that-- 769 00:38:23,590 --> 00:38:25,930 I could do it statistically by just diluting it so that on 770 00:38:25,930 --> 00:38:27,380 average there's only one beta globin. 771 00:38:27,380 --> 00:38:29,950 But how can I get one copy of beta globin 772 00:38:29,950 --> 00:38:31,260 packaged up very nicely? 773 00:38:34,410 --> 00:38:35,310 STUDENT: Order it. 774 00:38:35,310 --> 00:38:36,146 ERIC LANDER: Sorry? 775 00:38:36,146 --> 00:38:37,394 STUDENT: Order it. 776 00:38:37,394 --> 00:38:39,550 ERIC LANDER: No. 777 00:38:39,550 --> 00:38:40,800 Can't order that. 778 00:38:45,040 --> 00:38:47,660 Where do you know that there's just exactly one copy of the 779 00:38:47,660 --> 00:38:49,950 beta globin gene? 780 00:38:49,950 --> 00:38:53,510 One human sperm. 781 00:38:53,510 --> 00:38:55,570 Suppose with a micromanipulator, I purify a 782 00:38:55,570 --> 00:38:58,450 single human sperm, one sperm. 783 00:39:03,420 --> 00:39:03,895 It's haploid. 784 00:39:03,895 --> 00:39:05,940 It's got exactly one beta globin. 785 00:39:05,940 --> 00:39:10,020 Throw it in a test tube, crack it open, do PCR, it works. 786 00:39:10,020 --> 00:39:13,230 That's how I demonstrate that a single copy is enough. 787 00:39:13,230 --> 00:39:14,240 I can make that work. 788 00:39:14,240 --> 00:39:15,550 Pretty impressive. 789 00:39:15,550 --> 00:39:18,130 Not only that, that I can do it from a 790 00:39:18,130 --> 00:39:19,380 single copy in a sperm. 791 00:39:25,390 --> 00:39:27,576 If someone is doing in vitro fertilization-- 792 00:39:32,350 --> 00:39:34,070 remember, in vitro fertilization won the Nobel 793 00:39:34,070 --> 00:39:37,396 Prize this semester-- 794 00:39:37,396 --> 00:39:38,810 in vitro. 795 00:39:38,810 --> 00:39:42,960 Suppose a couple has a 1 in 4 chance of having a baby with 796 00:39:42,960 --> 00:39:45,520 some terrible lethal disorder. 797 00:39:45,520 --> 00:39:49,150 The couple might use in vitro fertilization to make multiple 798 00:39:49,150 --> 00:39:52,300 independent embryos. 799 00:39:52,300 --> 00:39:56,540 The doctor, then, is deciding which embryo should we implant 800 00:39:56,540 --> 00:39:58,750 back in mom? 801 00:39:58,750 --> 00:40:03,660 Well, how could they tell at this eight-cell stage which 802 00:40:03,660 --> 00:40:07,030 embryo carries the genetic disease? 803 00:40:07,030 --> 00:40:08,930 Suppose this genetic disease they already knew the 804 00:40:08,930 --> 00:40:10,702 molecular mutation causing it. 805 00:40:13,600 --> 00:40:14,850 Pull off a cell. 806 00:40:19,400 --> 00:40:25,200 Pull off one cell, pull off one cell, pull off one cell. 807 00:40:25,200 --> 00:40:27,760 This is at the eight-cell stage, let's say. 808 00:40:27,760 --> 00:40:30,620 If I pull off one of those cells at the eight-cell stage, 809 00:40:30,620 --> 00:40:33,180 does that mean the baby doesn't have an ear or an arm 810 00:40:33,180 --> 00:40:34,860 or something? 811 00:40:34,860 --> 00:40:35,640 No, it doesn't. 812 00:40:35,640 --> 00:40:37,350 Because at that stage, none of the cells have 813 00:40:37,350 --> 00:40:39,030 taken up any identity. 814 00:40:39,030 --> 00:40:39,830 It regulates. 815 00:40:39,830 --> 00:40:40,630 There's no problem. 816 00:40:40,630 --> 00:40:43,550 It turns out you can pull off an individual cell, and it has 817 00:40:43,550 --> 00:40:45,960 no impact on the embryo. 818 00:40:45,960 --> 00:40:47,210 And I do PCR. 819 00:40:49,710 --> 00:40:53,330 And I can figure out that that one carries the severe genetic 820 00:40:53,330 --> 00:40:55,090 disease that's going to cause the baby 821 00:40:55,090 --> 00:40:56,630 to die at five months. 822 00:40:56,630 --> 00:40:59,120 And the couple says, we're not going to implant that one. 823 00:40:59,120 --> 00:41:00,580 We'll implant the other ones. 824 00:41:00,580 --> 00:41:02,580 That's called preimplantation diagnostics. 825 00:41:13,830 --> 00:41:17,310 Or suppose somebody's being treated for cancer. 826 00:41:17,310 --> 00:41:22,840 And the cancer cells that had previously been there are no 827 00:41:22,840 --> 00:41:25,070 longer detectable. 828 00:41:25,070 --> 00:41:27,820 The drug therapy has apparently killed this blood 829 00:41:27,820 --> 00:41:29,130 cancer that somebody has. 830 00:41:29,130 --> 00:41:31,620 Maybe they have a cancer of the blood. 831 00:41:31,620 --> 00:41:34,670 Now what I'm going to do is monitor that patient every 832 00:41:34,670 --> 00:41:38,250 several months by getting a blood sample and seeing if the 833 00:41:38,250 --> 00:41:40,110 distinct mutations that were present in 834 00:41:40,110 --> 00:41:41,830 the cancer were there. 835 00:41:41,830 --> 00:41:44,750 And I can begin to see if that's coming back, if those 836 00:41:44,750 --> 00:41:46,840 cells are now recurring. 837 00:41:46,840 --> 00:41:49,020 It's an incredibly sensitive technique. 838 00:41:49,020 --> 00:41:51,420 And then of course, where does this stuff get used that you 839 00:41:51,420 --> 00:41:55,280 guys all surely we know about? 840 00:41:55,280 --> 00:41:57,040 Forensics. 841 00:41:57,040 --> 00:41:59,120 CSI and all that kind of stuff. 842 00:42:02,660 --> 00:42:04,900 If I lick an envelope-- 843 00:42:04,900 --> 00:42:08,100 I used to say, licking a stamp, but stamps just peel 844 00:42:08,100 --> 00:42:08,700 off these days. 845 00:42:08,700 --> 00:42:10,610 But people still do lick an envelope sometimes. 846 00:42:10,610 --> 00:42:14,590 If you lick an envelope, more than enough DNA comes off when 847 00:42:14,590 --> 00:42:18,720 you lick the envelope that you can use PCR to determine who 848 00:42:18,720 --> 00:42:21,680 do the licking. 849 00:42:21,680 --> 00:42:22,930 ERIC LANDER: You can. 850 00:42:22,930 --> 00:42:24,590 It works. 851 00:42:24,590 --> 00:42:25,270 All right. 852 00:42:25,270 --> 00:42:26,520 So that's PCR.