1
00:00:01,000 --> 00:00:05,000
Good morning.
Good morning.

2
00:00:05,000 --> 00:00:10,000
I don't know about you, but I can't
take too many more nights like this.

3
00:00:10,000 --> 00:00:15,000
I confess, I haven't gotten a thing
done for so many nights in a row now,

4
00:00:15,000 --> 00:00:20,000
but what a game! How many of
you saw the game? Excellent.

5
00:00:20,000 --> 00:00:25,000
Very good, very good. You have
your priorities straight in

6
00:00:25,000 --> 00:00:30,000
the world. Very good. Well,
if it's possible to get your

7
00:00:30,000 --> 00:00:35,000
minds off Curt Schilling last night,
and off more importantly tonight.

8
00:00:35,000 --> 00:00:40,000
Perhaps we can spend a bit of time
this morning in the meanwhile with

9
00:00:40,000 --> 00:00:45,000
whatever spare neurons you have
talking about recombinant DNA for a

10
00:00:45,000 --> 00:00:50,000
bit, OK? What we talked about last
time was different ways to clone

11
00:00:50,000 --> 00:00:55,000
your gene based on its properties.
We started off with cloning by

12
00:00:55,000 --> 00:01:00,000
complementation, right,
the idea that if you took a

13
00:01:00,000 --> 00:01:05,000
library of clones, you
would be able to put it into

14
00:01:05,000 --> 00:01:10,000
bacteria and select a bacterium
whose phenotype had been restored by

15
00:01:10,000 --> 00:01:14,000
virtue of having the plasmid.
You would complement the defect.

16
00:01:14,000 --> 00:01:18,000
You'd find the clone you wanted
because it complemented the defect.

17
00:01:18,000 --> 00:01:22,000
That's great if you can put it
into an organism that has a defect.

18
00:01:22,000 --> 00:01:25,000
You can do it with bacteria.
You can do that with yeast.

19
00:01:25,000 --> 00:01:29,000
It's harder to do with large
organisms because you can't inject

20
00:01:29,000 --> 00:01:33,000
enough of them with different clones
to be able to make that practical

21
00:01:33,000 --> 00:01:37,000
unless you're working in cell
culture or some very small,

22
00:01:37,000 --> 00:01:41,000
fast growing organism. We
talked about being able to use a

23
00:01:41,000 --> 00:01:45,000
protein sequence, reverse
translating that protein

24
00:01:45,000 --> 00:01:50,000
sequence in the computer from amino
acid sequence to nucleotide sequence,

25
00:01:50,000 --> 00:01:54,000
and using the nucleotide sequence to
design a probe to hybridize back to

26
00:01:54,000 --> 00:01:59,000
the genome. That works fine
if you have a protein sequence.

27
00:01:59,000 --> 00:02:02,000
But the last topic we talked about
that I wanted to just touch on again

28
00:02:02,000 --> 00:02:05,000
this morning was suppose you were
trying to clone the gene that causes

29
00:02:05,000 --> 00:02:08,000
a certain human disease,
and you have no idea what the

30
00:02:08,000 --> 00:02:11,000
protein was. Then, you
can't use its amino acid

31
00:02:11,000 --> 00:02:15,000
sequence because you don't have the
protein. What can you possibly do

32
00:02:15,000 --> 00:02:18,000
when all you know is that you have
a gene which causes a genetic defect

33
00:02:18,000 --> 00:02:21,000
that causes a disease? And I
said you could clone it using

34
00:02:21,000 --> 00:02:24,000
the ideas of genetic mapping,
position, the things that Sturtevant

35
00:02:24,000 --> 00:02:28,000
developed. And, I
touched on it briefly,

36
00:02:28,000 --> 00:02:31,000
and I want to just touch on it a
bit more because some people had some

37
00:02:31,000 --> 00:02:34,000
questions about it. And
I've set up a very simple

38
00:02:34,000 --> 00:02:38,000
example to show you. Suppose
that, to make it easy,

39
00:02:38,000 --> 00:02:41,000
we're working in a fruit fly
first. We're working drosophila, and

40
00:02:41,000 --> 00:02:44,000
suppose that the true picture of the
underlying chromosome is like this.

41
00:02:44,000 --> 00:02:48,000
There's a locus that could either
have a mutant allele M or the wild

42
00:02:48,000 --> 00:02:51,000
type allele plus. There's
a bunch of other loci along

43
00:02:51,000 --> 00:02:54,000
the chromosome. And, let's
suppose we know all of

44
00:02:54,000 --> 00:02:58,000
where they are and all that.
And, they have two alternative

45
00:02:58,000 --> 00:03:01,000
alleles. At this locus
the alleles are orange

46
00:03:01,000 --> 00:03:05,000
or pink. At this locus I'll
call the alleles orange or pink.

47
00:03:05,000 --> 00:03:08,000
Now, these are different loci.
These are different alleles. I've

48
00:03:08,000 --> 00:03:11,000
just called them orange and pink in
both cases so I don't have a rainbow

49
00:03:11,000 --> 00:03:15,000
of colors up here to confuse
us. But all I mean is there's two

50
00:03:15,000 --> 00:03:18,000
possible alleles here, two
alleles here, two alleles here.

51
00:03:18,000 --> 00:03:21,000
This is the diseased
gene we're interested in,

52
00:03:21,000 --> 00:03:25,000
and these are passive markers.
These are other markers along the

53
00:03:25,000 --> 00:03:28,000
chromosome. If we were
to set up a cross between

54
00:03:28,000 --> 00:03:32,000
heterozygotes, a
heterozygote here,

55
00:03:32,000 --> 00:03:36,000
and a heterozygote here, and
it were the case that on the

56
00:03:36,000 --> 00:03:40,000
chromosome bearing the mutant allele,
it happened that at these three

57
00:03:40,000 --> 00:03:44,000
markers we had orange alleles.
I don't know what they are, but

58
00:03:44,000 --> 00:03:48,000
whatever these orange alleles are,
they might be a visible phenotype,

59
00:03:48,000 --> 00:03:52,000
forked or yellow or bristled.
They could be a DNA sequence

60
00:03:52,000 --> 00:03:56,000
difference. They could be whatever
you want, but let's suppose the M

61
00:03:56,000 --> 00:04:00,000
chromosome has a set of alleles that
are different in each location than

62
00:04:00,000 --> 00:04:03,000
the plus chromosome. Then,
when we look at the offspring

63
00:04:03,000 --> 00:04:07,000
that come out of this cross,
let's only, for the sake of

64
00:04:07,000 --> 00:04:11,000
simplicity, look at those offspring
who are homozygous mutants.

65
00:04:11,000 --> 00:04:15,000
Well, in general, if there's
been no crossover here,

66
00:04:15,000 --> 00:04:18,000
then the M chromosome
will have orange, orange,

67
00:04:18,000 --> 00:04:22,000
orange, orange, orange, orange.
If there's been a crossover,

68
00:04:22,000 --> 00:04:26,000
however, it could go orange,
orange, pink on one of those

69
00:04:26,000 --> 00:04:30,000
chromosomes. Or if there's
been crossovers like this,

70
00:04:30,000 --> 00:04:34,000
it could go orange, orange, pink
on one chromosome, and orange,

71
00:04:34,000 --> 00:04:37,000
pink, pink on the other
chromosome. It could even,

72
00:04:37,000 --> 00:04:41,000
in the extreme, have had
crossovers very close to

73
00:04:41,000 --> 00:04:44,000
the gene maybe here, and even
maybe here. And you've got

74
00:04:44,000 --> 00:04:47,000
orange, pink, pink,
and pink, pink, pink.

75
00:04:47,000 --> 00:04:51,000
But if we look at the many
segregates, you know from genetic

76
00:04:51,000 --> 00:04:54,000
mapping that the closer the
locus is to the disease gene,

77
00:04:54,000 --> 00:04:57,000
the more strongly correlated
the inheritance will be,

78
00:04:57,000 --> 00:05:01,000
the tighter the linkage will be.
This is nothing more than linkage

79
00:05:01,000 --> 00:05:04,000
mapping. But now,
suppose we were doing

80
00:05:04,000 --> 00:05:08,000
linkage mapping, but for
the sake of argument the

81
00:05:08,000 --> 00:05:12,000
whole genome had already been
sequenced. Suppose the genome had

82
00:05:12,000 --> 00:05:15,000
been sequenced in a cross, and
the whole genome of the fruit

83
00:05:15,000 --> 00:05:19,000
fly had been sequenced
which it has been sequenced.

84
00:05:19,000 --> 00:05:22,000
And, we looked at a cross
and we looked at the mutants.

85
00:05:22,000 --> 00:05:26,000
And what we did was we tried
different positions along the genome.

86
00:05:26,000 --> 00:05:30,000
And at each position, we
had some genetic marker.

87
00:05:30,000 --> 00:05:34,000
And that genetic marker might be
as simple as the fact that at that

88
00:05:34,000 --> 00:05:38,000
position, maybe there is an A
in the DNA sequence on one of the

89
00:05:38,000 --> 00:05:42,000
chromosomes, and maybe I don't
know a G in the other sequence.

90
00:05:42,000 --> 00:05:47,000
And over here, this marker might
be, there's a T in some particular

91
00:05:47,000 --> 00:05:51,000
position, and there's a C
in some particular position.

92
00:05:51,000 --> 00:05:55,000
If we could assay that, if
we could tell, we could look

93
00:05:55,000 --> 00:06:00,000
whether this spelling variation is
closely correlated with the mutant.

94
00:06:00,000 --> 00:06:03,000
And this spelling variation
is closely correlated with the

95
00:06:03,000 --> 00:06:07,000
inheritance of the mutant allele.
And we could just try up and down

96
00:06:07,000 --> 00:06:11,000
the genome, different sites of
spelling difference as if they were

97
00:06:11,000 --> 00:06:15,000
genetic markers in our cross because
they are genetic markers in our

98
00:06:15,000 --> 00:06:18,000
cross, and see which one
is most tightly correlated.

99
00:06:18,000 --> 00:06:22,000
The minute we get any
genetic sequence difference,

100
00:06:22,000 --> 00:06:26,000
that shows co-inheritance linkage
in this cross, we know that this spot

101
00:06:26,000 --> 00:06:30,000
in the genome must be
nearby our mutation.

102
00:06:30,000 --> 00:06:33,000
So, we'll try one closer, and
we'll try one on the other side.

103
00:06:33,000 --> 00:06:37,000
And, what you do is you test
sites of genetic variation,

104
00:06:37,000 --> 00:06:40,000
first to find one that
shows any co-inheritance.

105
00:06:40,000 --> 00:06:44,000
And once you've got that, you
try ones closer, and closer,

106
00:06:44,000 --> 00:06:48,000
and closer. Last time I
talked about the process of,

107
00:06:48,000 --> 00:06:51,000
if you had one of those markers
you could use it to isolate the next

108
00:06:51,000 --> 00:06:55,000
clone and the next clone and the
next clone. But you know what I

109
00:06:55,000 --> 00:06:59,000
realized? That's so old fashioned.
We might as well deal with the fact

110
00:06:59,000 --> 00:07:02,000
we have a sequence of the genome.
No more would you ever isolate the

111
00:07:02,000 --> 00:07:05,000
next clone and the next
clone and the next clone.

112
00:07:05,000 --> 00:07:08,000
You just look it up in the computer.
So, even if you have the whole

113
00:07:08,000 --> 00:07:11,000
sequence of the genome, we
have to figure out what part of

114
00:07:11,000 --> 00:07:14,000
it was co-inherited along with this
disease, and that's the way you do

115
00:07:14,000 --> 00:07:17,000
it, OK? Genetic mapping, ust
as Sturtevant invented it,

116
00:07:17,000 --> 00:07:20,000
can be applied if you have a
whole sequence of the genome,

117
00:07:20,000 --> 00:07:23,000
and enough sites of variation.
And, I've drawn it for a fruit fly

118
00:07:23,000 --> 00:07:26,000
cross, but this could equally
well be cystic fibrosis.

119
00:07:26,000 --> 00:07:29,000
The only difference is if we're
doing this in human families and

120
00:07:29,000 --> 00:07:33,000
it's cystic fibrosis we
don't have as many offspring.

121
00:07:33,000 --> 00:07:36,000
So, we have to pool data
from many families. And,

122
00:07:36,000 --> 00:07:40,000
we can't arrange it so that every
family has exactly the same orange

123
00:07:40,000 --> 00:07:43,000
alleles up here and pink alleles
down there, but computers can deal

124
00:07:43,000 --> 00:07:47,000
with that. They can still figure
out the correlation across many

125
00:07:47,000 --> 00:07:50,000
families, and you find the spot
in the genome where for many,

126
00:07:50,000 --> 00:07:54,000
many, many families the kids who
all got the disease show correlated

127
00:07:54,000 --> 00:07:57,000
inheritance with this marker.
And that eventually pins you down

128
00:07:57,000 --> 00:08:01,000
to a region of the genome. It
pins you down to those genetic

129
00:08:01,000 --> 00:08:04,000
markers that show the
absolute tightest correlation,

130
00:08:04,000 --> 00:08:08,000
tight correlation, and
that's where you look.

131
00:08:08,000 --> 00:08:12,000
And in that fashion, people
went being able to map the

132
00:08:12,000 --> 00:08:16,000
location of Huntington's
Disease in 1984 to, by now,

133
00:08:16,000 --> 00:08:20,000
mapping the locations of more than
1, 00 different human genetic diseases

134
00:08:20,000 --> 00:08:25,000
where people didn't know the protein
in advance. They did it entirely

135
00:08:25,000 --> 00:08:29,000
based on this positional mapping.
So, Sturtevant's idea, which I like

136
00:08:29,000 --> 00:08:33,000
so much, has played itself out
so beautifully now in the area of

137
00:08:33,000 --> 00:08:38,000
modern molecular
medicine. OK. So, onward.

138
00:08:38,000 --> 00:08:43,000
I want to talk about a few other
variations on the theme rather

139
00:08:43,000 --> 00:08:48,000
quickly, and then I think I want
to talk about how you analyze your

140
00:08:48,000 --> 00:08:53,000
clones. First,
variations on cloning,

141
00:08:53,000 --> 00:08:58,000
I should just at least mention
it. We talked about cloning in an

142
00:08:58,000 --> 00:09:04,000
autonomously replicating
plasmid in a bacteria.

143
00:09:04,000 --> 00:09:07,000
So, you go to a bacteria.
They have some autonomously

144
00:09:07,000 --> 00:09:10,000
replicated pieces of DNA.
There are circles. You can clone

145
00:09:10,000 --> 00:09:13,000
in them, and you can typically,
these things are on the order of,

146
00:09:13,000 --> 00:09:17,000
I don't know, 1,000 to 2,000 to 5,
00 bases can be readily cloned in

147
00:09:17,000 --> 00:09:20,000
these plasmids.
You can do more,

148
00:09:20,000 --> 00:09:23,000
but that's a typical kind
of number is the insert size,

149
00:09:23,000 --> 00:09:27,000
typically. But we in the lab
go up to much higher numbers

150
00:09:27,000 --> 00:09:31,000
like 10,000 sometimes. You can
also, if you wanted to study

151
00:09:31,000 --> 00:09:36,000
yeast, it turns out yeast
happily have plasmids as well,

152
00:09:36,000 --> 00:09:41,000
and you can do a similar
sort of thing for yeast.

153
00:09:41,000 --> 00:09:47,000
It turns out that instead of using
plasmids, you can use bacterial

154
00:09:47,000 --> 00:09:52,000
viruses. These bacterial viruses
have all different shapes as we've

155
00:09:52,000 --> 00:09:57,000
talked about, circular or linear,
and they can typically hold, oh, 15,

156
00:09:57,000 --> 00:10:02,000
00-40,000. Some of these
viruses are quite big.

157
00:10:02,000 --> 00:10:06,000
The bacteriophage lambda
tends to carry a lot of stuff.

158
00:10:06,000 --> 00:10:11,000
And, it can replicate. So,
you could do the same thing to

159
00:10:11,000 --> 00:10:15,000
that. You can even use viruses that
infect mammalian cells and there are

160
00:10:15,000 --> 00:10:19,000
all sorts of viruses now
that people clone in again,

161
00:10:19,000 --> 00:10:24,000
linear or circular. I don't
know, for mammalian cells,

162
00:10:24,000 --> 00:10:28,000
you often, the viruses like 1,000-5,
00. You can even make artificial

163
00:10:28,000 --> 00:10:33,000
whole chromosomes now.
You can do this in yeast.

164
00:10:33,000 --> 00:10:39,000
Artificial chromosomes are called
YACs. They have all the little

165
00:10:39,000 --> 00:10:44,000
machinery, little telomeres on
them, little centromeres. They have a

166
00:10:44,000 --> 00:10:50,000
selectable marker, and then
you can clone into it your

167
00:10:50,000 --> 00:10:55,000
piece of DNA. And these can take
up to a million bases of DNA.

168
00:10:55,000 --> 00:11:01,000
So, if you wanted, there are
bacterial artificial chromosomes.

169
00:11:01,000 --> 00:11:05,000
They're called BACs if they're
in bacteria. And recently,

170
00:11:05,000 --> 00:11:09,000
people have developed artificial
chromosome systems for mammalian

171
00:11:09,000 --> 00:11:13,000
cells, and specifically human cells.
And they're called unfortunately

172
00:11:13,000 --> 00:11:17,000
MACs and HACs and things like that.
Basically, any molecule that can

173
00:11:17,000 --> 00:11:21,000
replicate in any system, some
smart molecular biologist will

174
00:11:21,000 --> 00:11:25,000
come along and say, how do
I use that for my purpose,

175
00:11:25,000 --> 00:11:30,000
to stick my DNA in it, and get
it to replicate in this organism?

176
00:11:30,000 --> 00:11:36,000
And so, if something's not
on this list, it will be soon,

177
00:11:36,000 --> 00:11:43,000
OK? Now, here's another thing.
This is cloning chunks of DNA.

178
00:11:43,000 --> 00:11:50,000
Just to have the piece of DNA in a
library, but suppose we want to do

179
00:11:50,000 --> 00:11:57,000
more than just have the DNA
sitting there in the bacterium,

180
00:11:57,000 --> 00:12:04,000
suppose what I'd really like
to do is take a bacterium,

181
00:12:04,000 --> 00:12:10,000
E coli, and put it to work for us.
Maybe what I'd like to do is take a

182
00:12:10,000 --> 00:12:14,000
plasmid and insert in that
plasmid the gene for human insulin.

183
00:12:14,000 --> 00:12:19,000
So, I'm going to take the DNA locus
corresponding to human insulin,

184
00:12:19,000 --> 00:12:23,000
clone it into my plasmid. Maybe
I'll have isolated it from my

185
00:12:23,000 --> 00:12:28,000
library because, let's
see, insulin's protein

186
00:12:28,000 --> 00:12:32,000
sequence is known so I could
reverse translate it to a nucleotide

187
00:12:32,000 --> 00:12:36,000
sequence. So, I
could probe a library.

188
00:12:36,000 --> 00:12:40,000
So, I could find the clone that has
insulin. Now what I'd like to do is

189
00:12:40,000 --> 00:12:43,000
persuade this bacteria not just to
carry the DNA but to make insulin

190
00:12:43,000 --> 00:12:47,000
for me. Would that be useful?
Yeah, how did people used to get

191
00:12:47,000 --> 00:12:51,000
insulin? Cadavers, dead
bodies; it would be much easier

192
00:12:51,000 --> 00:12:54,000
to get them from a fermenter,
right, to get insulin from a

193
00:12:54,000 --> 00:12:58,000
fermenter, if you could
just ask E coli to make it.

194
00:12:58,000 --> 00:13:02,000
So, if we put it into E coli,
will it make insulin for us?

195
00:13:02,000 --> 00:13:09,000
Here's the human locus, DNA
for insulin. Will it make

196
00:13:09,000 --> 00:13:17,000
insulin? Let's see, how
do you make a protein?

197
00:13:17,000 --> 00:13:24,000
You've got to start by making RNA,
right? You've got to transcribe the

198
00:13:24,000 --> 00:13:32,000
gene. Will E coli
transcribe this gene?

199
00:13:32,000 --> 00:13:36,000
Well, why? It's got a promoter,
right? It's got the insulin

200
00:13:36,000 --> 00:13:41,000
promoter. There we go. The
insulin promoter is here.

201
00:13:41,000 --> 00:13:45,000
So, E coli will come along to the
insulin promoter and start making

202
00:13:45,000 --> 00:13:50,000
RNA? No, it turns out that
promoters in humans and promoters in

203
00:13:50,000 --> 00:13:55,000
bacteria are sufficiently different.
They don't work across species.

204
00:13:55,000 --> 00:14:00,000
They won't recognize the human
promoter. Too bad. Any ideas?

205
00:14:00,000 --> 00:14:05,000
Yep? Stick a bacterial promoter
there. Good, you're acting like a

206
00:14:05,000 --> 00:14:10,000
good molecular biology designer here.
Let's put a bacterial promoter here.

207
00:14:10,000 --> 00:14:15,000
It will recognize its own promoter.
That's great. Then, let's put the

208
00:14:15,000 --> 00:14:21,000
DNA for the human insulin gene here.
And now, maybe we'll put the Lac

209
00:14:21,000 --> 00:14:26,000
operon, and when it has lactose
it'll start making RNA from the

210
00:14:26,000 --> 00:14:32,000
human insulin gene. And
it'll start translating it.

211
00:14:32,000 --> 00:14:38,000
And, we get insulin.
Any problems? Well,

212
00:14:38,000 --> 00:14:44,000
will it make any, for starters?
What's another aspect of mammalian

213
00:14:44,000 --> 00:14:50,000
genes that's different
from bacterial genes?

214
00:14:50,000 --> 00:14:56,000
Processing, what kind of processing
with the RNA? And the splicing,

215
00:14:56,000 --> 00:15:02,000
ooh, the insulin gene has introns
that have to be spliced out.

216
00:15:02,000 --> 00:15:06,000
So, this is going to make some
RNA, insulin RNA, and it needs to be

217
00:15:06,000 --> 00:15:10,000
processed like this. Will
bacteria carry on our splicing

218
00:15:10,000 --> 00:15:14,000
for us? They don't do splicing.
Yep? Well, that's a very

219
00:15:14,000 --> 00:15:18,000
interesting question
because we haven't. But,

220
00:15:18,000 --> 00:15:22,000
what do you propose? You see,
I've just taken a piece of

221
00:15:22,000 --> 00:15:26,000
human DNA from the human genome,
which encodes the introns and the

222
00:15:26,000 --> 00:15:30,000
exons. But, you seem to have
a solution to our problem,

223
00:15:30,000 --> 00:15:35,000
and what would that be? So,
instead of making a library of

224
00:15:35,000 --> 00:15:42,000
genomic DNA, what you're
suggesting is a radical idea.

225
00:15:42,000 --> 00:15:50,000
Let's instead take human RNA.
Here's some human RNA, lots of

226
00:15:50,000 --> 00:15:57,000
human RNA, a big collection of
human RNA. What was at the end of the

227
00:15:57,000 --> 00:16:02,000
human RNA: a poly(A) tail.
And what I understand you to be

228
00:16:02,000 --> 00:16:06,000
suggesting is if we take human
mRNAs, a whole collection of them,

229
00:16:06,000 --> 00:16:10,000
you want me to turn these mRNAs back
into DNA and clone them instead of

230
00:16:10,000 --> 00:16:14,000
using the chromosomal DNA. How
do I turn an RNA back to DNA?

231
00:16:14,000 --> 00:16:18,000
Is that possible? What do you
use: reverse transcriptase.

232
00:16:18,000 --> 00:16:22,000
We have to give it a primer.
So remember, five prime to three

233
00:16:22,000 --> 00:16:26,000
prime, we'd like to put
a primer going over here.

234
00:16:26,000 --> 00:16:30,000
Any ideas for a good primer?
Poly(T), isn't that convenient?

235
00:16:30,000 --> 00:16:35,000
One of the reasons that mammalian
messages have poly(A) tails is so

236
00:16:35,000 --> 00:16:41,000
that we are able to reverse
transcribe them using poly(T)

237
00:16:41,000 --> 00:16:46,000
primers. No, that's actually
not true. So, we use reverse

238
00:16:46,000 --> 00:16:52,000
transcriptase. And what
we can do is we'll copy

239
00:16:52,000 --> 00:16:58,000
this RNA into a strand
of DNA. There we go.

240
00:16:58,000 --> 00:17:03,000
Then what we'll do, next
step, is we'll take the DNA,

241
00:17:03,000 --> 00:17:09,000
and we'll copy back into
a second strand of DNA.

242
00:17:09,000 --> 00:17:15,000
And now, we have double-stranded
DNA whose sequence matches the

243
00:17:15,000 --> 00:17:21,000
already-processed mRNAs.
Sorry? So, the sequences would

244
00:17:21,000 --> 00:17:27,000
match the mRNAs. So what
you could do is instead of

245
00:17:27,000 --> 00:17:32,000
taking human DNA from the
nucleus, you could take RNAs,

246
00:17:32,000 --> 00:17:38,000
turn them back into DNA
by reverse transcriptase,

247
00:17:38,000 --> 00:17:43,000
and make a library now that
consists of zillions of inserts,

248
00:17:43,000 --> 00:17:49,000
each of which has what's
called a cDNA, a copied DNA,

249
00:17:49,000 --> 00:17:54,000
copied back from the RNA. The
great advantage of this is that

250
00:17:54,000 --> 00:18:00,000
the human cell has already
done the splicing, and so there

251
00:18:00,000 --> 00:18:05,000
are no introns left. Now,
when you stick it in a

252
00:18:05,000 --> 00:18:09,000
bacterium, the bacterium is
able to express this. It's able,

253
00:18:09,000 --> 00:18:13,000
if you give it its own bacterial
promoter, to make an RNA.

254
00:18:13,000 --> 00:18:17,000
And if you don't ask the
bacteria to have to splice,

255
00:18:17,000 --> 00:18:21,000
if you just give it a pre-spliced
piece of DNA that doesn't need

256
00:18:21,000 --> 00:18:25,000
splicing, it can translate that
DNA. Now, notice we used all of our

257
00:18:25,000 --> 00:18:29,000
tricks. You had to know
about reverse transcriptase,

258
00:18:29,000 --> 00:18:34,000
poly(A) tails, structures of genes,
introns, exons, yes, question?

259
00:18:34,000 --> 00:18:38,000
It doesn't. You do this in the test
tube. You purify human mRNA in the

260
00:18:38,000 --> 00:18:42,000
test tube. You take that mRNA in a
test tube, add reverse transcriptase,

261
00:18:42,000 --> 00:18:47,000
add poly(T), make this reaction of
RNA to DNA in the test tube go back.

262
00:18:47,000 --> 00:18:51,000
Where does it come from? Viruses
that copy themselves back for a

263
00:18:51,000 --> 00:18:56,000
living, right? So, again,
every single thing we're

264
00:18:56,000 --> 00:19:00,000
using comes from some living
organism that does this

265
00:19:00,000 --> 00:19:04,000
kind of stuff. And, when
I teach you about the

266
00:19:04,000 --> 00:19:08,000
facts of how viruses replicate or
what the structure of mRNAs look

267
00:19:08,000 --> 00:19:11,000
like or whatever, it's
because every bit of knowledge

268
00:19:11,000 --> 00:19:14,000
we get about the way biology works
turns into an incredibly powerful

269
00:19:14,000 --> 00:19:18,000
tool as it's turning out for us to
actually be able to further study

270
00:19:18,000 --> 00:19:21,000
biology. So, great.
So, where does reverse

271
00:19:21,000 --> 00:19:24,000
transcriptase come from now?
Originally they come from viruses

272
00:19:24,000 --> 00:19:28,000
that turn themselves back from RNA
to DNA. Now, how do you get reverse

273
00:19:28,000 --> 00:19:32,000
transcriptase?
Catalog, right,

274
00:19:32,000 --> 00:19:38,000
very good. All right, so
this is called, finally,

275
00:19:38,000 --> 00:19:44,000
a cDNA library. And, if
you had made a cDNA library,

276
00:19:44,000 --> 00:19:49,000
you would be able to screen the cDNA
library to find the gene for insulin.

277
00:19:49,000 --> 00:19:55,000
Is this useful?
This happens to be,

278
00:19:55,000 --> 00:20:01,000
for example, one of the consequences
of this was the biotechnology

279
00:20:01,000 --> 00:20:06,000
industry. OK, so if you
have any doubts about

280
00:20:06,000 --> 00:20:10,000
the usefulness of understanding
these abstract things about E coli

281
00:20:10,000 --> 00:20:14,000
and bacteria and stuff like
that, one of the consequences was

282
00:20:14,000 --> 00:20:18,000
Genentech, Biogen, and
Amgen, and if you just simply

283
00:20:18,000 --> 00:20:22,000
walk around Kendall Square, within
a mile of this place you will

284
00:20:22,000 --> 00:20:26,000
see laid out before you the
consequences of this ability,

285
00:20:26,000 --> 00:20:30,000
OK? It's transforming
Cambridge. Yes?

286
00:20:30,000 --> 00:20:36,000
And the world.
Yeah. Indeed.

287
00:20:36,000 --> 00:20:43,000
It might be that producing large
amounts of insulin was bad for the

288
00:20:43,000 --> 00:20:50,000
bacteria because there would be so
much protein it would clump and kill

289
00:20:50,000 --> 00:20:57,000
the bacteria. It might be that
insulin, for various reasons,

290
00:20:57,000 --> 00:21:04,000
might not fold appropriately
in the bacterial environment.

291
00:21:04,000 --> 00:21:07,000
And, this is why the biotechnology
industry has lots of smart people

292
00:21:07,000 --> 00:21:10,000
working in it because you're totally,
100% right. You might decide that

293
00:21:10,000 --> 00:21:13,000
instead of cloning it in bacteria
it's better to clone it in some

294
00:21:13,000 --> 00:21:16,000
insect cell in culture which, in
fact, people like to work with,

295
00:21:16,000 --> 00:21:19,000
or some other cell, or
a mammalian cell. And so,

296
00:21:19,000 --> 00:21:23,000
I simplify by saying put it in
coli, but in fact that might test six

297
00:21:23,000 --> 00:21:26,000
different cell lines, six
different host possibilities.

298
00:21:26,000 --> 00:21:29,000
They might have to take the
insulin out and refold it in vitro

299
00:21:29,000 --> 00:21:33,000
and things like that.
You're totally right.

300
00:21:33,000 --> 00:21:37,000
This is actually something that
requires work to do it right,

301
00:21:37,000 --> 00:21:42,000
just like building an
airplane requires work.

302
00:21:42,000 --> 00:21:47,000
I could tell you Bernoulli's
principles, but then Boeing does

303
00:21:47,000 --> 00:21:51,000
more than just writes down
Bernoulli's principles.

304
00:21:51,000 --> 00:21:56,000
OK, so onward. Now, I'd like to
turn next to analyzing your clone.

305
00:21:56,000 --> 00:22:00,000
Analyzing the clone, so suppose
we have, maybe it's by positional

306
00:22:00,000 --> 00:22:05,000
cloning, maybe it's by cDNA cloning,
but one way or the other we've got

307
00:22:05,000 --> 00:22:10,000
us a clone that we're
very interested in.

308
00:22:10,000 --> 00:22:14,000
Maybe it has the insulin gene.
Maybe it has the Huntington's

309
00:22:14,000 --> 00:22:18,000
disease gene. Whatever it is,
we're going to want to study it.

310
00:22:18,000 --> 00:22:22,000
And at the moment, I haven't told
you how I would even read its DNA

311
00:22:22,000 --> 00:22:26,000
sequence or analyze its DNA.
So, the first step is, of course,

312
00:22:26,000 --> 00:22:31,000
I have to purify the plasmid. And,
it turns out that that can be done.

313
00:22:31,000 --> 00:22:34,000
There are simple biochemical
techniques, as I mentioned in a

314
00:22:34,000 --> 00:22:37,000
previous lecture, that
allow you to grow up a lot of

315
00:22:37,000 --> 00:22:40,000
the bacteria, crack them open,
and the plasmid being a little

316
00:22:40,000 --> 00:22:43,000
circle, and being a little more
tightly super-coiled and wound up

317
00:22:43,000 --> 00:22:46,000
has somewhat different physical
properties. And you can use those

318
00:22:46,000 --> 00:22:50,000
to purify the plasmid. So,
plasmid preps are not hard to

319
00:22:50,000 --> 00:22:53,000
do. You can get a fairly pure
collection of the plasmid.

320
00:22:53,000 --> 00:22:56,000
Now, suppose I've done this for,
oh, I don't know, let's take my

321
00:22:56,000 --> 00:23:00,000
first example, orange mutants.
Suppose I tried to rescue bacteria

322
00:23:00,000 --> 00:23:04,000
that were orange minus,
and suppose I found that 50

323
00:23:04,000 --> 00:23:08,000
different plasmids rescued my orange
mutant because I transformed a lot

324
00:23:08,000 --> 00:23:12,000
of plasmids in, I plated
it, and 50 colonies grew up.

325
00:23:12,000 --> 00:23:16,000
Are they all the same
thing or are they different?

326
00:23:16,000 --> 00:23:20,000
Is there any quickie way to take a
look at these 50 plasmids and see if

327
00:23:20,000 --> 00:23:24,000
they're identical or fairly close,
or obviously different? Well, I'd

328
00:23:24,000 --> 00:23:28,000
like to take some way to take the
DNA from the plasmid and analyze it

329
00:23:28,000 --> 00:23:32,000
kind of easily. I
might want to see,

330
00:23:32,000 --> 00:23:37,000
like, how big is the insert?
Right, that'd be one way,

331
00:23:37,000 --> 00:23:43,000
if they had different sized inserts
so they couldn't be the same thing.

332
00:23:43,000 --> 00:23:49,000
So, maybe what I could do is how do
I clone this? I used EcoRI sites I

333
00:23:49,000 --> 00:23:55,000
recall. So, I have EcoRI sites here.
Suppose I were to take this DNA,

334
00:23:55,000 --> 00:24:02,000
and I were to now cut the DNA
from the plasmid with EcoRI.

335
00:24:02,000 --> 00:24:09,000
Then, what I would get
is two separate molecules.

336
00:24:09,000 --> 00:24:16,000
I would get the vector and the
insert. How could I see how big

337
00:24:16,000 --> 00:24:24,000
they were? Gels, gel
electrophoresis is the way to do

338
00:24:24,000 --> 00:24:29,000
that. So, I take a gel. A
gel is a slab of gelatin,

339
00:24:29,000 --> 00:24:33,000
Jell-O, OK, and normally it's
laid flat, but I'm going to do it

340
00:24:33,000 --> 00:24:37,000
vertically here. I load
into the top of it here a

341
00:24:37,000 --> 00:24:41,000
little bit of my DNA, this
whole mixture. I take the

342
00:24:41,000 --> 00:24:45,000
plasmid. I cut it. I put
it in here. DNA's positive

343
00:24:45,000 --> 00:24:49,000
charge or negative charge?
Negative. So, where should I put

344
00:24:49,000 --> 00:24:53,000
the positive pull? On
the bottom, well done.

345
00:24:53,000 --> 00:24:57,000
That's often not done, and to
the detriment of the experiment.

346
00:24:57,000 --> 00:25:01,000
If you put the positive pull here,
it goes the wrong way, and everybody

347
00:25:01,000 --> 00:25:05,000
has to do that at least once.
So, what'll happen is the DNA

348
00:25:05,000 --> 00:25:11,000
fragments move through, and
the smaller fragments move

349
00:25:11,000 --> 00:25:16,000
faster than the big fragments,
right? If something's little, it'll

350
00:25:16,000 --> 00:25:22,000
move fast. If something's big,
it moves slowly: little, big.

351
00:25:22,000 --> 00:25:27,000
Smaller moves faster because it
wiggles through the little pores in

352
00:25:27,000 --> 00:25:33,000
the gel better. So, suppose
I were to do this for a

353
00:25:33,000 --> 00:25:39,000
bunch of plasmids, and
what I saw was this.

354
00:25:39,000 --> 00:25:47,000
First order, what do you guess?
Sorry? Top road's probably the

355
00:25:47,000 --> 00:25:55,000
plasmid vector. This
is probably the vector,

356
00:25:55,000 --> 00:26:03,000
and what do I know about the
inserts? At least two inserts,

357
00:26:03,000 --> 00:26:09,000
at least two distinct inserts.
Now, if I wanted to be sure that was

358
00:26:09,000 --> 00:26:13,000
the vector, maybe what I
could do is take another row,

359
00:26:13,000 --> 00:26:17,000
and run a known amount of the vector,
take the vector alone and I could

360
00:26:17,000 --> 00:26:21,000
check that the vector alone runs
over here. And maybe I might take

361
00:26:21,000 --> 00:26:25,000
some other known molecules.
These would be called molecular

362
00:26:25,000 --> 00:26:29,000
weight standards. So, if
I run some knowns in one of

363
00:26:29,000 --> 00:26:33,000
the lanes of the gel, I
can even measure and say,

364
00:26:33,000 --> 00:26:37,000
ah-ha, the insert is somewhere
between the size of this one and the

365
00:26:37,000 --> 00:26:40,000
size of that one. And so,
I get a little ruler that I

366
00:26:40,000 --> 00:26:43,000
can put on the gel. So, in
fact, that's the first thing

367
00:26:43,000 --> 00:26:46,000
you would do is you
digest your clone that way.

368
00:26:46,000 --> 00:26:49,000
Now, does the fact that these
guys have exactly the same,

369
00:26:49,000 --> 00:26:52,000
apparently, size on the gel mean
that they're the exact same piece of

370
00:26:52,000 --> 00:26:55,000
DNA? No, because you can't even
actually tell it's exactly the same.

371
00:26:55,000 --> 00:26:59,000
There's a limit to how
precisely you can measure it.

372
00:26:59,000 --> 00:27:04,000
So, what else could you do? You
could try another restriction

373
00:27:04,000 --> 00:27:10,000
enzyme. It turns out that since
there are so many restriction

374
00:27:10,000 --> 00:27:15,000
enzymes in the catalog,
if I take a piece of DNA,

375
00:27:15,000 --> 00:27:21,000
maybe that Eco fragment, I could
try cutting it with HinDIII.

376
00:27:21,000 --> 00:27:26,000
And when I cut it with HinDIII,
I'm going to get three distinct

377
00:27:26,000 --> 00:27:32,000
lengths. I could try cutting
it with, oh, I don't know,

378
00:27:32,000 --> 00:27:37,000
pick another enzyme, BamHI.
When I cut it with BamHI,

379
00:27:37,000 --> 00:27:43,000
I'll get some other lengths.
And, how to get these lengths by

380
00:27:43,000 --> 00:27:48,000
adding these, by running them out
on a gel and looking at their sizes.

381
00:27:48,000 --> 00:27:54,000
What if I added both HinDIII
and BamHI to my test tube?

382
00:27:54,000 --> 00:28:00,000
I'd cut at both sites.
So, I'd cut here, here,

383
00:28:00,000 --> 00:28:06,000
here, here, here. So,
this is cut with HinDIII,

384
00:28:06,000 --> 00:28:12,000
here cut with BamHI, here cut
with both and I could measure these

385
00:28:12,000 --> 00:28:19,000
lengths. So, suppose I gave
you this as a computer problem,

386
00:28:19,000 --> 00:28:25,000
I have a string and it's an unknown
string, and I cut it at two places

387
00:28:25,000 --> 00:28:31,000
and I get these lengths, X1, X2, X3.
And then I take that same string and

388
00:28:31,000 --> 00:28:35,000
I cut it at other positions, Y1,
Y2, and Y3 are the lengths that

389
00:28:35,000 --> 00:28:39,000
result. And then suppose I now
cut it at both of the sites,

390
00:28:39,000 --> 00:28:43,000
and I measure it, and I get Z1,
Z2, Z3, Z4, Z5. If I gave you all

391
00:28:43,000 --> 00:28:48,000
those numbers, could you
figure out where the sites

392
00:28:48,000 --> 00:28:52,000
must be? Probably. It
turns out to be a reasonably

393
00:28:52,000 --> 00:28:56,000
doable computer problem, although
it can get a little hard in

394
00:28:56,000 --> 00:29:00,000
places. And you could
try a third enzyme and

395
00:29:00,000 --> 00:29:03,000
a fourth enzyme, and it's
a cute exercise to write

396
00:29:03,000 --> 00:29:07,000
yourself a little piece of code that
will figure out where the sites are

397
00:29:07,000 --> 00:29:10,000
based on the lengths. The
reason it occasionally gets

398
00:29:10,000 --> 00:29:13,000
funny what if Z3 and Z4 are exactly
the same length and they run on top

399
00:29:13,000 --> 00:29:16,000
of each other in the gel,
and there are special cases.

400
00:29:16,000 --> 00:29:20,000
But you can kind of reconstruct
where those restriction sites must

401
00:29:20,000 --> 00:29:23,000
be just by writing a good piece
of code that'll put these pieces

402
00:29:23,000 --> 00:29:26,000
together. This is called
restriction mapping,

403
00:29:26,000 --> 00:29:30,000
and it's great fun. Everybody
likes to do this once.

404
00:29:30,000 --> 00:29:33,000
But, it's only a limited
amount of information, right,

405
00:29:33,000 --> 00:29:36,000
because you get where the sites are,
and I guess if I gave you ten clones

406
00:29:36,000 --> 00:29:40,000
and they all had exactly
the same restriction maps,

407
00:29:40,000 --> 00:29:43,000
the exact same positions
of these restriction sites,

408
00:29:43,000 --> 00:29:47,000
you'd feel pretty confident
they were the same clone.

409
00:29:47,000 --> 00:29:50,000
But you still wouldn't really know
much about the clone other than it

410
00:29:50,000 --> 00:29:53,000
had two HinDIII sites and two BamHI
sites, and here's where they were.

411
00:29:53,000 --> 00:29:57,000
What do you really want to
know about this clone? It's

412
00:29:57,000 --> 00:30:02,000
DNA sequence, right? Let's
not settle for anything less

413
00:30:02,000 --> 00:30:10,000
than the exact nucleotide
sequence of the clone. So,

414
00:30:10,000 --> 00:30:18,000
that's really the last key
topic is sequencing DNA.

415
00:30:18,000 --> 00:30:26,000
How are you going to sequence
DNA? Well, suppose I give you some

416
00:30:26,000 --> 00:30:34,000
double strand of DNA,
five prime to three prime,

417
00:30:34,000 --> 00:30:42,000
five prime, three prime,
double stranded DNA.

418
00:30:42,000 --> 00:30:47,000
Let me heat it up. What
happens when I heat up DNA?

419
00:30:47,000 --> 00:30:52,000
It melts the hydrogen bonds, the
non-covalent hydrogen bonds here

420
00:30:52,000 --> 00:30:57,000
break, and I got my two
strands separated. Now,

421
00:30:57,000 --> 00:31:02,000
what I'd like to do is I want to
start reading out this DNA sequence.

422
00:31:02,000 --> 00:31:08,000
So, I'm going to make me a primer.
Now, golly, here's a primer.

423
00:31:08,000 --> 00:31:14,000
You're going to ask me, how
did I even know what primer to

424
00:31:14,000 --> 00:31:21,000
use if I don't know the DNA
sequence? How can I make a primer?

425
00:31:21,000 --> 00:31:27,000
Hold that question. Make sure I
remember to come back and answer

426
00:31:27,000 --> 00:31:34,000
that, OK? But for the moment,
grant me that I have a primer here.

427
00:31:34,000 --> 00:31:39,000
What I'd like to do is
add DNA polymerase. So,

428
00:31:39,000 --> 00:31:45,000
let's add some DNA polymerase.
And, I'd like to add nucleotide

429
00:31:45,000 --> 00:31:50,000
triphosphates,
dNTPs, the dATP,

430
00:31:50,000 --> 00:31:56,000
dCTP, the dGTP, dTTP, and
if I add DNA polymerase and I

431
00:31:56,000 --> 00:32:02,000
add my nucleotides, what
does Arthur Kornberg tell us

432
00:32:02,000 --> 00:32:07,000
will happen? It'll
start polymerizing,

433
00:32:07,000 --> 00:32:13,000
right? And, it'll stop there. So,
the polymerase knows the bases,

434
00:32:13,000 --> 00:32:18,000
right? It knows what base to put
in because polymerase is very smart.

435
00:32:18,000 --> 00:32:24,000
So, the bases get put in correctly.
The only problem is, how do we get

436
00:32:24,000 --> 00:32:30,000
polymerase to tell
us what it just did?

437
00:32:30,000 --> 00:32:37,000
Here's a cute trick.
This is, by the way,

438
00:32:37,000 --> 00:32:44,000
a cute trick that won the Nobel
Prize. So, suppose my primer is

439
00:32:44,000 --> 00:32:51,000
like this: five prime, T,
A, A, T, T, C, T, and the

440
00:32:51,000 --> 00:32:58,000
template strand here, A, T,
T, A, A, G, A, now let's keep

441
00:32:58,000 --> 00:33:05,000
going, A, T, G,
C, C, A, A, T, G,

442
00:33:05,000 --> 00:33:14,000
G, A, T, T, A, five prime.
So, there's my primer.

443
00:33:14,000 --> 00:33:26,000
There's my template. I'm going
to start adding. Well, let's

444
00:33:26,000 --> 00:33:37,000
add our polymerase.
Let's add our dNTPs,

445
00:33:37,000 --> 00:33:47,000
polymerase, dATP, dCTP, dTTP,
dGTP, and then I want to add a

446
00:33:47,000 --> 00:33:58,000
special extra good old
ingredient into this.

447
00:33:58,000 --> 00:34:07,000
The special extra ingredient
I want to add is a defective T,

448
00:34:07,000 --> 00:34:17,000
a defective dTTP. What do I mean
by defective? I mean chemically

449
00:34:17,000 --> 00:34:27,000
modified in such a way that
it can't be extended, that you

450
00:34:27,000 --> 00:34:36,000
can't extend past it. So
now, let's follow my reaction.

451
00:34:36,000 --> 00:34:44,000
I'm going to start with, I'm just
going to write them down here,

452
00:34:44,000 --> 00:34:52,000
T, A, A, T, T, C, T. What's the
next base I'm going to put in?

453
00:34:52,000 --> 00:35:00,000
T, OK? Is that a defective
T or a good T? I don't know.

454
00:35:00,000 --> 00:35:05,000
It could be. Maybe it's a defective
T, which I'll put a little star

455
00:35:05,000 --> 00:35:11,000
there, OK? If so, what
happens to my polymerase?

456
00:35:11,000 --> 00:35:17,000
It stops. It can't go any further.
It can't go any further because the

457
00:35:17,000 --> 00:35:22,000
T's defective. But what
if it wasn't a defective T?

458
00:35:22,000 --> 00:35:28,000
What if it was a good T? Then
what goes on? The polymerase

459
00:35:28,000 --> 00:35:34,000
will put in, keep going guys. A,
C, G, G, and what does it put in

460
00:35:34,000 --> 00:35:39,000
now? T,
right? Now,

461
00:35:39,000 --> 00:35:45,000
is that a defective T?
Maybe. We don't know.

462
00:35:45,000 --> 00:35:50,000
If it is a defective T,
it stops there. Otherwise,

463
00:35:50,000 --> 00:35:56,000
polymerase goes here, and
the next space is what?

464
00:35:56,000 --> 00:36:02,000
T, and is that a
defective T? Maybe.

465
00:36:02,000 --> 00:36:09,000
And, if it's not a defective T,
then polymerase goes on, puts in an

466
00:36:09,000 --> 00:36:16,000
A, puts in a G, a
C, C, and then a T.

467
00:36:16,000 --> 00:36:23,000
And maybe that's defective.
All right, which of these

468
00:36:23,000 --> 00:36:31,000
possibilities is what polymerase
does when I throw it in?

469
00:36:31,000 --> 00:36:35,000
Well, all of them. There's
a lot of molecules there.

470
00:36:35,000 --> 00:36:39,000
Some of the molecules, by chance,
happen to install a defective T,

471
00:36:39,000 --> 00:36:43,000
and they grind to a halt here.
Sometimes, a good T's put in and the

472
00:36:43,000 --> 00:36:47,000
molecules stop here.
Sometimes they stop here,

473
00:36:47,000 --> 00:36:51,000
and if I start with a big collection
of primers in a lot of my template

474
00:36:51,000 --> 00:36:55,000
DNA, I'm going to get this whole
collection of different molecules of

475
00:36:55,000 --> 00:36:59,000
different lengths.
What lengths do I get?

476
00:36:59,000 --> 00:37:04,000
The lengths correspond precisely
to the positions of the Ts.

477
00:37:04,000 --> 00:37:10,000
I get a series of molecules
whose lengths perfectly match the

478
00:37:10,000 --> 00:37:16,000
positions of Ts.
Well, first off,

479
00:37:16,000 --> 00:37:23,000
how do I measure their lengths?
Run a gel, bingo, run a gel.

480
00:37:23,000 --> 00:37:30,000
So, if I could run a gel that could
separate nucleotides based on length

481
00:37:30,000 --> 00:37:37,000
that two next to each other,
another one up there, I'd see a

482
00:37:37,000 --> 00:37:44,000
small molecule,
length one, two,

483
00:37:44,000 --> 00:37:51,000
three, four, five, three,
six, eight, I'd see one of

484
00:37:51,000 --> 00:37:58,000
length eight. I'd see one of
length, what's the next one,

485
00:37:58,000 --> 00:38:05,000
13, eight, nine, ten,
13, 14, so eight, nine,

486
00:38:05,000 --> 00:38:12,000
ten, 11, 12, 13, 14, 15,
what's that, 13, 14, 15, 16,

487
00:38:12,000 --> 00:38:18,000
17, 18. OK, those would
be the positions at

488
00:38:18,000 --> 00:38:22,000
which I would see this T. So,
I'd need to have a special kind

489
00:38:22,000 --> 00:38:26,000
of gel that's so accurate that it
can separate single nucleotides,

490
00:38:26,000 --> 00:38:31,000
right, that the lengths,
but that can be done.

491
00:38:31,000 --> 00:38:36,000
There's acrylamide, the
polymer that will do that.

492
00:38:36,000 --> 00:38:41,000
That'll tell me the exact lengths
of the T's. What else do I do?

493
00:38:41,000 --> 00:38:46,000
Well, let's obviously do
it from the other bases.

494
00:38:46,000 --> 00:38:52,000
Let's try defective A,
defective C, defective G.

495
00:38:52,000 --> 00:38:57,000
Let's see, if I got it right,
which we'll try, it ought to end up

496
00:38:57,000 --> 00:39:02,000
looking something like that.
And if not, you get the picture,

497
00:39:02,000 --> 00:39:07,000
that this ought to match up as to
which columns have which lengths.

498
00:39:07,000 --> 00:39:13,000
OK, I think I got it right.
That tells me the lengths of the

499
00:39:13,000 --> 00:39:18,000
molecules. So, I could
read off at sequence.

500
00:39:18,000 --> 00:39:23,000
The sequence of that molecule
ought to be, starting over there,

501
00:39:23,000 --> 00:39:29,000
the sequence of what I've added in,
ought to be something like T, A, C,

502
00:39:29,000 --> 00:39:34,000
G, G, T, T, A, C,
C, T, yep, it worked.

503
00:39:34,000 --> 00:39:39,000
It's exactly right. Bingo.
I can now read the sequence.

504
00:39:39,000 --> 00:39:44,000
Fred Sanger, a brilliant scientist,
thought up this method of just

505
00:39:44,000 --> 00:39:49,000
exploiting E coli's own polymerase
or other organism's own polymerases.

506
00:39:49,000 --> 00:39:54,000
So, copying and all the chemistry
that had to be done was thinking up

507
00:39:54,000 --> 00:40:00,000
a defective nucleotide
that could not be extended.

508
00:40:00,000 --> 00:40:10,000
It could obviously be inserted.
It can't be extended. So, one

509
00:40:10,000 --> 00:40:20,000
question is, what's a
defective nucleotide? Well,

510
00:40:20,000 --> 00:40:30,000
you will recall that our nucleotide
in the sugar phosphate chain

511
00:40:30,000 --> 00:40:37,000
is sitting like this. Let's
see, hanging off the one prime

512
00:40:37,000 --> 00:40:42,000
carbon is the base. This
is the one prime carbon,

513
00:40:42,000 --> 00:40:47,000
the two prime carbon, the three
prime carbon, the four prime carbon,

514
00:40:47,000 --> 00:40:52,000
the five prime carbon. What do we
know in DNA at the two prime carbon?

515
00:40:52,000 --> 00:40:57,000
Normally in ribose there
would be a hydroxyl here,

516
00:40:57,000 --> 00:41:02,000
right? But in deoxyribose,
there's just a hydrogen.

517
00:41:02,000 --> 00:41:11,000
So, if this is deoxyribose, so
a dNTP really means a two prime

518
00:41:11,000 --> 00:41:20,000
deoxyribose, where do I now attach
my next base in the sugar phosphate

519
00:41:20,000 --> 00:41:30,000
train? Three prime ends, and
what do I attach it to: the OH.

520
00:41:30,000 --> 00:41:35,000
What do you think would
happen if there's no OH there?

521
00:41:35,000 --> 00:41:40,000
You're stuck. All you've got
to do is take off that hydroxyl.

522
00:41:40,000 --> 00:41:45,000
No hydroxyl group. If you made
nucleotides that don't have that

523
00:41:45,000 --> 00:41:50,000
hydroxyl group, they
can't be extended.

524
00:41:50,000 --> 00:41:55,000
So, instead of these being just
deoxy at the two prime position,

525
00:41:55,000 --> 00:42:00,000
they are dideoxy,
deoxy at two positions.

526
00:42:00,000 --> 00:42:04,000
They are two prime, three
prime, dideoxynucleotides.

527
00:42:04,000 --> 00:42:09,000
That's it. Now, if you needed
to get two prime three prime

528
00:42:09,000 --> 00:42:13,000
dideoxynucleotides, they're
in the catalogue of course,

529
00:42:13,000 --> 00:42:18,000
right, because Fred Sanger had
to make them himself and all that,

530
00:42:18,000 --> 00:42:23,000
but you can just buy them now.
And so, you can do the sequence.

531
00:42:23,000 --> 00:42:28,000
A few other little
details here, though, guys.

532
00:42:28,000 --> 00:42:32,000
How do we see the DNA and the gel?
One possibility would be staining

533
00:42:32,000 --> 00:42:37,000
it. There are some dies
like ethidium bromide,

534
00:42:37,000 --> 00:42:42,000
and for doing your restriction
mapping, using a dye that sticks to

535
00:42:42,000 --> 00:42:47,000
DNA like ethidium bromide does is
pretty good. And then you put it

536
00:42:47,000 --> 00:42:52,000
under fluorescent light and
you look. For sequencing,

537
00:42:52,000 --> 00:42:57,000
the amount of DNA is so little that
it's hard to see with a dye by the

538
00:42:57,000 --> 00:43:02,000
naked eye, which is what you do
with restriction map. So, sorry?

539
00:43:02,000 --> 00:43:06,000
So, the first thing people did was
radioactive. What they did was they

540
00:43:06,000 --> 00:43:10,000
took a primer,
made it radioactive,

541
00:43:10,000 --> 00:43:14,000
and you did this whole sequencing
reaction with radioactive primer.

542
00:43:14,000 --> 00:43:18,000
Then, when you run the gel, you
take your gel and you expose it for

543
00:43:18,000 --> 00:43:22,000
some number of hours, eight
hours maybe, a piece of x-ray

544
00:43:22,000 --> 00:43:26,000
film, develop the x-ray film,
and you'll see that picture. So,

545
00:43:26,000 --> 00:43:30,000
one solution that you could do
to visualize is using radioactive

546
00:43:30,000 --> 00:43:37,000
nucleotides. So, we got
the defective nucleotide.

547
00:43:37,000 --> 00:43:45,000
We now need to visualize our
DNA. Let's visualize the sequence.

548
00:43:45,000 --> 00:43:54,000
One possibility: radioactive.
The second possibility, someone

549
00:43:54,000 --> 00:44:03,000
already mentioned
it, a fluorescent dye.

550
00:44:03,000 --> 00:44:10,000
Now, here, a fluorescent dye could
be put on, and you can't read it

551
00:44:10,000 --> 00:44:17,000
with your eye, but lasers
are very good at reading.

552
00:44:17,000 --> 00:44:24,000
So, you might run a whole gel
here and have lasers scan it.

553
00:44:24,000 --> 00:44:31,000
But, you can actually do better
than that. Suppose I put my

554
00:44:31,000 --> 00:44:39,000
fluorescent dye on
my dideoxynucleotides.

555
00:44:39,000 --> 00:44:45,000
Suppose I put it on
my dideoxynucleotides,

556
00:44:45,000 --> 00:44:52,000
and suppose I even had enough
chemistry at my disposal that I

557
00:44:52,000 --> 00:44:59,000
could put a different color
of fluorescent dye on each

558
00:44:59,000 --> 00:45:05,000
of my nucleotides. Then,
whenever the dideoxy is put in

559
00:45:05,000 --> 00:45:09,000
to terminate the chain, it
carries with it its own color.

560
00:45:09,000 --> 00:45:14,000
Wouldn't that be cool? And, that's
what's done. Not just can you buy

561
00:45:14,000 --> 00:45:18,000
dideoxynucleotides now, but
you can buy the four different

562
00:45:18,000 --> 00:45:23,000
dideoxynucleotides each with
its own dye attached to it.

563
00:45:23,000 --> 00:45:27,000
So, there are di-dideoxies I guess,
sorry, but it's different di's,

564
00:45:27,000 --> 00:45:32,000
right? They're dye-dideoxies.
So, you could do that.

565
00:45:32,000 --> 00:45:36,000
And then what you get would be that
in this column you get this color.

566
00:45:36,000 --> 00:45:40,000
And in this column, you'd get this
color. And in this column you'd get

567
00:45:40,000 --> 00:45:44,000
this color, etc. I'm not
worrying about where they

568
00:45:44,000 --> 00:45:48,000
are here. And they'd all be
different colors and it would be

569
00:45:48,000 --> 00:45:52,000
very pretty. You know what?
Why do we need to run separate

570
00:45:52,000 --> 00:45:56,000
lanes anymore?
If we got a laser,

571
00:45:56,000 --> 00:46:00,000
we can tell the laser scan
it to tell it different. Stick

572
00:46:00,000 --> 00:46:05,000
it in one way. In fact,
what's done is stick it in

573
00:46:05,000 --> 00:46:13,000
a capillary tube, throw in
all four at the same time

574
00:46:13,000 --> 00:46:20,000
now, and as these fragments come by,
each has its own color. And all we

575
00:46:20,000 --> 00:46:28,000
need is a laser scanner capable
of sitting right here. Here's

576
00:46:28,000 --> 00:46:34,000
my laser scanner.
And the laser scanner,

577
00:46:34,000 --> 00:46:40,000
positive here, negative here,
as the DNA flows by through this

578
00:46:40,000 --> 00:46:46,000
polymer, the laser scanner reads
off which colors just went by.

579
00:46:46,000 --> 00:46:51,000
And it goes A color, C color,
T color, G color. That's it. So,

580
00:46:51,000 --> 00:46:57,000
there are actually machines now
that have 96 different capillaries.

581
00:46:57,000 --> 00:47:04,000
These are called capillary tubes.
And you can have 96 of them with

582
00:47:04,000 --> 00:47:12,000
laser scanning across,
and in each column now,

583
00:47:12,000 --> 00:47:20,000
it turns out that you can
read almost 1,000 letters,

584
00:47:20,000 --> 00:47:28,000
1,000 bases per column per capillary
times about 100 capillaries.

585
00:47:28,000 --> 00:47:36,000
Or in other words, you can read
out about 10^5 bases of information.

586
00:47:36,000 --> 00:47:40,000
You can read out 10^5 bases of
information in about two hours.

587
00:47:40,000 --> 00:47:45,000
Of course, you can do
that ten times a day. So,

588
00:47:45,000 --> 00:47:50,000
you can actually read out 10^6 or
about a million bases of information

589
00:47:50,000 --> 00:47:55,000
per machine. And here at MIT, we
have 100 of these machines. So,

590
00:47:55,000 --> 00:48:00,000
we actually can read out a little
shy of 100 million letters of DNA

591
00:48:00,000 --> 00:48:05,000
sequence per day,
which I mean is a lot.

592
00:48:05,000 --> 00:48:11,000
We read about 40 billion
letters per year here at MIT,

593
00:48:11,000 --> 00:48:17,000
and this is how we do it.
How much does a machine cost?

594
00:48:17,000 --> 00:48:23,000
List, or do you want a
deal? They list for $300,

595
00:48:23,000 --> 00:48:29,000
00, but if you buy in bulk, I
can do better. [LAUGHTER] We buy

596
00:48:29,000 --> 00:48:34,000
it in bulk, by the way. So
now, how are we going to get our

597
00:48:34,000 --> 00:48:38,000
primer there? That was the only
little bit we were missing is where

598
00:48:38,000 --> 00:48:42,000
did our primer come from? The
last little detail: here's my

599
00:48:42,000 --> 00:48:47,000
vector, remember, and I
want to sequence this insert.

600
00:48:47,000 --> 00:48:51,000
How am I going to get a primer in
the insert? I don't know what its

601
00:48:51,000 --> 00:48:56,000
sequence is. How do I even
start this? Sorry? Well,

602
00:48:56,000 --> 00:49:00,000
but that won't tell me what
the sequence is that I have to,

603
00:49:00,000 --> 00:49:05,000
I mean, I was looking to try to get
a primer that matches the insert.

604
00:49:05,000 --> 00:49:09,000
And I don't know what the insert is.
So, how am I going to get a primer?

605
00:49:09,000 --> 00:49:13,000
Oh, I know the vector.
The vector is well known.

606
00:49:13,000 --> 00:49:17,000
It sequence is in the catalog.
Let me instead just use a primer

607
00:49:17,000 --> 00:49:21,000
that happens to sit in the vector,
and I'll match to a known sequence

608
00:49:21,000 --> 00:49:25,000
to start with, and then
I'll sequence into my

609
00:49:25,000 --> 00:49:29,000
unknown territory. So, this
is how you get the initial

610
00:49:29,000 --> 00:49:33,000
primer was you arrange that your
initial primer is sitting in known

611
00:49:33,000 --> 00:49:37,000
vector sequence. All right,
so you can now sequence

612
00:49:37,000 --> 00:49:40,000
DNA. I've got to say, I've
taught this course for a little

613
00:49:40,000 --> 00:49:44,000
more than a decade,
and being able to say,

614
00:49:44,000 --> 00:49:47,000
now we can routinely sequence
about a million letters per machine,

615
00:49:47,000 --> 00:49:50,000
and 100 million letters per
day, and things like this was not

616
00:49:50,000 --> 00:49:53,000
routinely the case. When
we started teaching this

617
00:49:53,000 --> 00:49:58,000
course, I was
describing what we di