1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:25,625 --> 00:00:26,500
PROFESSOR: All right.

9
00:00:26,500 --> 00:00:29,460
So today, we're going
to briefly review

10
00:00:29,460 --> 00:00:32,750
classical sequencing and
next-gen or second-gen

11
00:00:32,750 --> 00:00:36,520
sequencing, which sort of
provides a lot of the data

12
00:00:36,520 --> 00:00:42,660
that the analytical methods
we'll be talking about work on.

13
00:00:42,660 --> 00:00:46,600
And we'll then introduce
local alignment

14
00:00:46,600 --> 00:00:51,350
a la BLAST and some of
the statistics associated

15
00:00:51,350 --> 00:00:52,830
with that.

16
00:00:52,830 --> 00:00:56,680
So just a few brief
items on topic one.

17
00:00:56,680 --> 00:00:57,230
All right.

18
00:00:57,230 --> 00:01:00,830
So today, we're going to
talk about sequencing first.

19
00:01:00,830 --> 00:01:03,350
Conventional-- or Sanger
sequencing-- then next-gen

20
00:01:03,350 --> 00:01:07,020
or second-gen
sequencing briefly.

21
00:01:07,020 --> 00:01:10,280
And then talk about
local alignments.

22
00:01:10,280 --> 00:01:15,540
So background for the sequencing
part, the Metzger review

23
00:01:15,540 --> 00:01:17,410
covers everything you'll need.

24
00:01:17,410 --> 00:01:21,180
And for the alignment-- we'll
talk about local alignment

25
00:01:21,180 --> 00:01:23,110
today, global
alignment on Tuesday--

26
00:01:23,110 --> 00:01:27,310
then chapters four and five of
the text cover it pretty well.

27
00:01:27,310 --> 00:01:28,590
So here's the text.

28
00:01:28,590 --> 00:01:31,090
If you haven't decided
whether to get it or not,

29
00:01:31,090 --> 00:01:32,050
I'll have it up here.

30
00:01:32,050 --> 00:01:35,650
You can come flip
through it after class.

31
00:01:40,500 --> 00:01:44,450
Sequencing is mostly
done at the level of DNA.

32
00:01:44,450 --> 00:01:50,202
Whether the original
material was RNA or not,

33
00:01:50,202 --> 00:01:52,410
usually convert to DNA and
sequence at the DNA level.

34
00:01:52,410 --> 00:01:57,142
So we'll often think about
DNA as sort of a string.

35
00:01:57,142 --> 00:01:59,100
But it's important to
remember that it actually

36
00:01:59,100 --> 00:02:04,230
has a three dimensional
structure as shown here.

37
00:02:04,230 --> 00:02:08,210
And often, it's helpful to
think of it in sort of a two

38
00:02:08,210 --> 00:02:09,710
dimensional
representation where you

39
00:02:09,710 --> 00:02:14,170
think about the bases and
their hydrogen bonding

40
00:02:14,170 --> 00:02:18,320
and so forth as
shown in the middle.

41
00:02:18,320 --> 00:02:20,530
My mouse is not working
today for some reason,

42
00:02:20,530 --> 00:02:24,410
but hopefully, we won't need it.

43
00:02:24,410 --> 00:02:29,400
So the chemistry of sequencing
is very closely related

44
00:02:29,400 --> 00:02:33,970
to the chemistry of
the individual bases.

45
00:02:33,970 --> 00:02:36,900
And there are really
three main types

46
00:02:36,900 --> 00:02:41,390
that are going to
be relevant here.

47
00:02:41,390 --> 00:02:44,170
Ribonucleotides,
deoxyribonucleotides,

48
00:02:44,170 --> 00:02:48,370
then for Sanger sequencing,
dideoxyribonucleotides.

49
00:02:48,370 --> 00:02:52,200
So who can tell me which
of these structures

50
00:02:52,200 --> 00:02:54,860
corresponds to which
of those names?

51
00:02:54,860 --> 00:02:59,160
And also, please let
me know your name

52
00:02:59,160 --> 00:03:01,860
and I'll attempt
to remember some

53
00:03:01,860 --> 00:03:04,580
of your names toward the end
of the semester probably.

54
00:03:04,580 --> 00:03:06,770
So, which are which?

55
00:03:10,470 --> 00:03:11,386
Yes, what's your name?

56
00:03:11,386 --> 00:03:12,814
AUDIENCE: I'm Simona.

57
00:03:12,814 --> 00:03:15,947
So the ribonucleotide
is the top right.

58
00:03:15,947 --> 00:03:18,566
The deoxy is the one below it.

59
00:03:18,566 --> 00:03:20,190
And the dideoxy is
the one to the left.

60
00:03:20,190 --> 00:03:21,820
PROFESSOR: OK, so
that is correct.

61
00:03:21,820 --> 00:03:25,660
So one way to keep
these things in mind

62
00:03:25,660 --> 00:03:28,230
is the numbering of the bases.

63
00:03:28,230 --> 00:03:33,950
So the carbons in the ribo
sugar are numbered one,

64
00:03:33,950 --> 00:03:37,260
so carbon 1 is the one
where the base is attached.

65
00:03:37,260 --> 00:03:42,560
Two is here, which has an OH
in RNA and just an H in DNA.

66
00:03:42,560 --> 00:03:45,720
And then three is
very important.

67
00:03:45,720 --> 00:03:47,650
Four, and then five.

68
00:03:47,650 --> 00:03:49,700
So five connects
to the phosphates,

69
00:03:49,700 --> 00:03:52,941
which then will connect the
base to the sugar phosphate

70
00:03:52,941 --> 00:03:53,440
backbone.

71
00:03:53,440 --> 00:03:58,720
And three is where you extend.

72
00:03:58,720 --> 00:04:01,480
That's where you're going to
add the next base in a growing

73
00:04:01,480 --> 00:04:01,980
chain.

74
00:04:01,980 --> 00:04:08,810
And so what will happen if you
give DNA polymerase a template

75
00:04:08,810 --> 00:04:11,185
and some dideoxy nucleotides?

76
00:04:14,955 --> 00:04:17,329
It won't be able to extend
because there's no 3-prime OH.

77
00:04:17,329 --> 00:04:19,030
And all the chemistry
requires the OH.

78
00:04:19,030 --> 00:04:22,770
And so that's the basis
of classical or Sanger

79
00:04:22,770 --> 00:04:27,770
sequencing, which Fred
Sanger got the Nobel

80
00:04:27,770 --> 00:04:31,400
Prize for in the 1980s-- I think
it was developed in the '70s--

81
00:04:31,400 --> 00:04:33,940
and it's really
the basis of most

82
00:04:33,940 --> 00:04:36,190
of the sequencing, or
pretty much all the DNA

83
00:04:36,190 --> 00:04:38,860
sequencing up until
the early 2000s

84
00:04:38,860 --> 00:04:42,450
before some newer
technologies came about.

85
00:04:42,450 --> 00:04:45,110
And it takes advantage
of this special property

86
00:04:45,110 --> 00:04:51,860
of dideoxy nucleotides that they
terminate the growing chain.

87
00:04:51,860 --> 00:04:53,980
So imagine we have
a template DNA.

88
00:04:53,980 --> 00:04:56,540
So this is the
molecule whose sequence

89
00:04:56,540 --> 00:04:59,290
we want to determine
shown there in black.

90
00:04:59,290 --> 00:05:00,750
We then have a primer.

91
00:05:00,750 --> 00:05:03,470
And notice the primer's
written in 5-prime to 3-prime

92
00:05:03,470 --> 00:05:04,240
direction.

93
00:05:04,240 --> 00:05:09,360
The ends would be
primer sequences

94
00:05:09,360 --> 00:05:11,829
and then primer complimentary
sequences in the template.

95
00:05:11,829 --> 00:05:13,870
So you typically will have
your template cloned--

96
00:05:13,870 --> 00:05:15,661
this is in conventional
sequencing-- cloned

97
00:05:15,661 --> 00:05:19,470
into some vector
like a phage vector

98
00:05:19,470 --> 00:05:22,080
for sequencing so you know
the flanking sequences.

99
00:05:22,080 --> 00:05:25,450
And then you do four
sequencing reactions

100
00:05:25,450 --> 00:05:28,070
in conventional
Sanger sequencing.

101
00:05:28,070 --> 00:05:31,540
And I know some of you have
probably had this before.

102
00:05:31,540 --> 00:05:34,620
So let's take the first
chemical reaction.

103
00:05:34,620 --> 00:05:38,340
The one here with a DDGTP.

104
00:05:38,340 --> 00:05:41,527
So what would you
put in that reaction?

105
00:05:41,527 --> 00:05:43,360
What are all the
components of that reaction

106
00:05:43,360 --> 00:05:45,590
if you wanted to do
conventional sequencing

107
00:05:45,590 --> 00:05:49,316
on, say, an acrylonitrile?

108
00:05:49,316 --> 00:05:51,760
Anyone?

109
00:05:51,760 --> 00:05:55,090
What do you need and
what does it accomplish?

110
00:05:55,090 --> 00:05:56,416
Yeah, what's your name?

111
00:05:56,416 --> 00:05:57,412
AUDIENCE: I'm Tim.

112
00:05:57,412 --> 00:05:57,660
PROFESSOR: Tim?

113
00:05:57,660 --> 00:05:58,306
Oh yeah, I know you, Tim.

114
00:05:58,306 --> 00:05:58,906
OK, go ahead.

115
00:05:58,906 --> 00:06:02,890
AUDIENCE: So you need
the four nucleotides--

116
00:06:02,890 --> 00:06:04,384
the deoxynucleotides.

117
00:06:04,384 --> 00:06:08,368
You will need the
dideoxy P nucleotides.

118
00:06:08,368 --> 00:06:11,356
In addition, you need all
the other [INAUDIBLE].

119
00:06:11,356 --> 00:06:12,352
You need polymerase.

120
00:06:12,352 --> 00:06:17,830
Generally, you need a buffer
of some sort, [INAUDIBLE],

121
00:06:17,830 --> 00:06:20,654
to [INAUDIBLE].

122
00:06:20,654 --> 00:06:22,070
PROFESSOR: Yeah,
primary template.

123
00:06:22,070 --> 00:06:22,570
Yeah.

124
00:06:22,570 --> 00:06:24,470
Great.

125
00:06:24,470 --> 00:06:24,970
That's good.

126
00:06:24,970 --> 00:06:28,090
It sounds like Tim could
actually do this experiment.

127
00:06:33,750 --> 00:06:35,747
And what ratio would you put in?

128
00:06:35,747 --> 00:06:37,330
So you said you're
going to put in all

129
00:06:37,330 --> 00:06:40,660
four conventional
deoxynucleotides and then one

130
00:06:40,660 --> 00:06:41,770
dideoxynucleotide.

131
00:06:41,770 --> 00:06:46,120
So let's say dideoxy G
just for simplicity here.

132
00:06:46,120 --> 00:06:49,740
So in what ratio would you
put the dideoxynucleotide

133
00:06:49,740 --> 00:06:51,539
compared to the
conventional nucleotides?

134
00:06:51,539 --> 00:06:53,080
AUDIENCE: To lower
the concentration.

135
00:06:53,080 --> 00:06:53,788
PROFESSOR: Lower?

136
00:06:53,788 --> 00:06:55,764
Like how much lower?

137
00:06:55,764 --> 00:06:56,930
AUDIENCE: Like, a lot lower.

138
00:06:56,930 --> 00:06:58,595
PROFESSOR: Like maybe 1%?

139
00:06:58,595 --> 00:06:59,220
AUDIENCE: Yeah.

140
00:06:59,220 --> 00:07:00,511
PROFESSOR: Something like that.

141
00:07:00,511 --> 00:07:01,900
You want to put it a lot lower.

142
00:07:01,900 --> 00:07:05,858
And why is that so important?

143
00:07:05,858 --> 00:07:11,325
AUDIENCE: Because you want the
thing to be able to progress.

144
00:07:11,325 --> 00:07:15,052
Because you need enough of the
ribonucleotide concentration so

145
00:07:15,052 --> 00:07:19,277
that [INAUDIBLE]
every [INAUDIBLE]

146
00:07:19,277 --> 00:07:23,982
equivalent or excess and you're
going to terminate [INAUDIBLE].

147
00:07:23,982 --> 00:07:24,690
PROFESSOR: Right.

148
00:07:24,690 --> 00:07:29,440
So if you put equamolar
deoxy G and dideoxy G,

149
00:07:29,440 --> 00:07:32,050
then it's going to be a
50% chance of terminating

150
00:07:32,050 --> 00:07:33,810
every time you hit
a C in the template.

151
00:07:33,810 --> 00:07:35,480
So you're going to
have half as much

152
00:07:35,480 --> 00:07:38,010
of the material at the second
G, and a quarter as much as

153
00:07:38,010 --> 00:07:41,120
the third, and you're going to
have vanishingly small amounts.

154
00:07:41,120 --> 00:07:42,911
So you're only going
to be able to sequence

155
00:07:42,911 --> 00:07:45,790
the first few C's
in the template.

156
00:07:45,790 --> 00:07:46,290
Exactly.

157
00:07:46,290 --> 00:07:47,456
So that's a very good point.

158
00:07:50,170 --> 00:07:53,520
So now let's imagine you do
these four separate reactions.

159
00:07:53,520 --> 00:07:55,950
You typically would
have radiolabeled primer

160
00:07:55,950 --> 00:07:57,520
so you can see your DNA.

161
00:07:57,520 --> 00:08:00,430
And then you would run
it on some sort of gel.

162
00:08:00,430 --> 00:08:03,670
This is obviously not a real
gel, but an idealized version.

163
00:08:03,670 --> 00:08:10,260
And then in the lane
where you put dideoxy G,

164
00:08:10,260 --> 00:08:12,160
you would see the
smallest products.

165
00:08:12,160 --> 00:08:14,900
So you read these guys
from the bottom up.

166
00:08:14,900 --> 00:08:18,580
And in this lane there
is a very small product

167
00:08:18,580 --> 00:08:21,896
that's just one base longer
than the primer here.

168
00:08:21,896 --> 00:08:24,520
And that's because there was a
C there and it terminated there.

169
00:08:24,520 --> 00:08:27,820
And then the next C appears
several bases later.

170
00:08:27,820 --> 00:08:30,010
So you have sort of a gap here.

171
00:08:30,010 --> 00:08:33,330
And so you can see that the
first base in the template

172
00:08:33,330 --> 00:08:35,929
would be a complement
of T, or C.

173
00:08:35,929 --> 00:08:44,400
And the second base
would be, you can see,

174
00:08:44,400 --> 00:08:46,749
the next smallest product
in this dideoxy T lane,

175
00:08:46,749 --> 00:08:48,290
therefore it would
be A. And you just

176
00:08:48,290 --> 00:08:50,970
sort of snake your
way up through the gel

177
00:08:50,970 --> 00:08:52,540
and read out the sequence.

178
00:08:52,540 --> 00:08:54,760
And this works well.

179
00:08:54,760 --> 00:08:58,200
So what does it actually
look like in practice?

180
00:08:58,200 --> 00:09:00,440
Here are some actual
sequencing gels.

181
00:09:00,440 --> 00:09:03,390
So you run four lanes.

182
00:09:03,390 --> 00:09:10,540
And on big polyacrylamide
gels like this.

183
00:09:10,540 --> 00:09:12,237
Torbin, you ever
run one of these?

184
00:09:12,237 --> 00:09:12,820
AUDIENCE: Yes.

185
00:09:12,820 --> 00:09:14,080
PROFESSOR: Yes?

186
00:09:14,080 --> 00:09:17,140
They're a big pain to cast.

187
00:09:17,140 --> 00:09:19,090
Run for several hours, I think.

188
00:09:19,090 --> 00:09:22,360
And you get these
banding patterns.

189
00:09:22,360 --> 00:09:27,080
And what limits the
sequence read length?

190
00:09:27,080 --> 00:09:31,310
So we normally call
the sequence generated

191
00:09:31,310 --> 00:09:34,560
from one run of a
sequencer as a read.

192
00:09:34,560 --> 00:09:38,640
So that one attempt to sequence
the template is called a read.

193
00:09:38,640 --> 00:09:40,530
And you can see
it's relatively easy

194
00:09:40,530 --> 00:09:42,710
to read the sequence
toward the bottom,

195
00:09:42,710 --> 00:09:46,060
and then it gets
harder as you go up.

196
00:09:46,060 --> 00:09:49,510
And so that's really what
fundamentally limits the read

197
00:09:49,510 --> 00:09:53,170
length, is that the bands get
closer and closer together.

198
00:09:53,170 --> 00:09:58,210
So they'll run inversely
proportional to size

199
00:09:58,210 --> 00:10:00,360
with the small ones
running faster.

200
00:10:00,360 --> 00:10:05,170
But then the difference between
a 20 base product and a 21

201
00:10:05,170 --> 00:10:06,050
might be significant.

202
00:10:06,050 --> 00:10:09,670
But the difference between a
500 base product and a 501 base

203
00:10:09,670 --> 00:10:11,450
product is going
to be very small.

204
00:10:11,450 --> 00:10:16,430
And so you basically can't
order the lanes anymore.

205
00:10:16,430 --> 00:10:19,330
And therefore, that's sort of
what fundamentally limits it.

206
00:10:19,330 --> 00:10:20,780
All right.

207
00:10:20,780 --> 00:10:24,410
So here we had to run
four lanes of a gel.

208
00:10:24,410 --> 00:10:26,190
Can anyone think of
a more efficient way

209
00:10:26,190 --> 00:10:29,107
of doing Sanger sequencing?

210
00:10:29,107 --> 00:10:30,690
Is there any way to
do it in one lane?

211
00:10:34,468 --> 00:10:35,426
Yeah, what's your name?

212
00:10:35,426 --> 00:10:36,338
AUDIENCE: Adrian.

213
00:10:36,338 --> 00:10:39,812
You can use four different
types of the entities.

214
00:10:39,812 --> 00:10:41,590
Maybe like four
different colors.

215
00:10:41,590 --> 00:10:42,310
AUDIENCE: Four different colors.

216
00:10:42,310 --> 00:10:44,670
OK, so instead of using radio
labeling on the primary,

217
00:10:44,670 --> 00:10:50,400
you use fluorophore on your
dideoxy entities, for example.

218
00:10:50,400 --> 00:10:53,980
And then you can run them.

219
00:10:53,980 --> 00:10:57,170
Depending where that
strand terminated,

220
00:10:57,170 --> 00:10:58,680
it'll be a different color.

221
00:10:58,680 --> 00:11:00,221
And you can run them
all in one lane.

222
00:11:00,221 --> 00:11:02,150
OK, so that looks like that.

223
00:11:02,150 --> 00:11:04,990
And so this was an
important development

224
00:11:04,990 --> 00:11:08,610
called terminator
sequencing in the '90s.

225
00:11:08,610 --> 00:11:14,940
That was the basis of the ABI
3700 machine, which was really

226
00:11:14,940 --> 00:11:18,200
the workhorse of
genome sequencing

227
00:11:18,200 --> 00:11:19,910
in the late '90s
and early 2000s.

228
00:11:19,910 --> 00:11:22,430
Really what enabled the
human genome to be sequenced.

229
00:11:22,430 --> 00:11:25,960
And so one of the other
innovations in this technology

230
00:11:25,960 --> 00:11:31,034
was that instead of having a
big gel, they shrunk the gel.

231
00:11:31,034 --> 00:11:32,950
And then they just had
a reader at the bottom.

232
00:11:32,950 --> 00:11:38,475
So the gel was shrunk to as thin
as these little capillaries.

233
00:11:38,475 --> 00:11:40,100
I don't know if you
can see these guys.

234
00:11:40,100 --> 00:11:43,500
But basically it's like
a little thread here.

235
00:11:43,500 --> 00:11:46,040
And so each one of these
is effectively-- oops!

236
00:11:46,040 --> 00:11:47,320
Oh no.

237
00:11:47,320 --> 00:11:50,140
No worries, this
is not valuable.

238
00:11:50,140 --> 00:11:54,450
Ancient technology that I
got for free from somebody.

239
00:11:54,450 --> 00:12:00,702
So the DNA would be
loaded at the top.

240
00:12:00,702 --> 00:12:02,160
There would be a
little gel in each

241
00:12:02,160 --> 00:12:04,520
of these-- it's called
capillary sequencing.

242
00:12:04,520 --> 00:12:06,060
And then it would
run out the bottom

243
00:12:06,060 --> 00:12:07,570
and there would be
a detector which

244
00:12:07,570 --> 00:12:09,240
would detect the
four different flours

245
00:12:09,240 --> 00:12:10,740
and read out the sequence.

246
00:12:10,740 --> 00:12:17,010
So this basically condensed the
volume needed for sequencing.

247
00:12:19,840 --> 00:12:23,630
Any questions about
conventional sequencing?

248
00:12:23,630 --> 00:12:24,130
Yes?

249
00:12:24,130 --> 00:12:26,378
AUDIENCE: Where
are the [INAUDIBLE]

250
00:12:26,378 --> 00:12:28,738
where you'd put the
fluorescent flags?

251
00:12:28,738 --> 00:12:31,294
Like the topic from
the [INAUDIBLE]?

252
00:12:31,294 --> 00:12:32,960
PROFESSOR: Yeah,
that's a good question.

253
00:12:32,960 --> 00:12:35,390
I don't actually remember.

254
00:12:35,390 --> 00:12:39,480
I think there are different
options available.

255
00:12:39,480 --> 00:12:41,460
And sometimes with some
of these reactions,

256
00:12:41,460 --> 00:12:45,940
you need to use modified
polymerases that

257
00:12:45,940 --> 00:12:48,250
can tolerate these
modified nucleotides.

258
00:12:48,250 --> 00:12:49,923
Yeah, so I don't remember that.

259
00:12:49,923 --> 00:12:51,120
It's a good question.

260
00:12:51,120 --> 00:12:52,690
I can look that up.

261
00:12:52,690 --> 00:12:56,650
So how long can a
conventional sequencer go?

262
00:12:56,650 --> 00:12:57,920
What's the read length?

263
00:12:57,920 --> 00:13:01,030
Anyone know?

264
00:13:01,030 --> 00:13:04,060
It's about, say, 600 or so.

265
00:13:04,060 --> 00:13:07,220
And so that's reasonably long.

266
00:13:07,220 --> 00:13:11,360
How long is a typical
mammalian mRNA?

267
00:13:11,360 --> 00:13:12,730
Maybe two, three kb?

268
00:13:12,730 --> 00:13:16,090
So you have in a typical
exon, maybe 150 bases or so.

269
00:13:16,090 --> 00:13:17,920
So you have a chunk.

270
00:13:17,920 --> 00:13:19,680
You don't generally
get full length cDNA.

271
00:13:19,680 --> 00:13:22,540
But you get a chunk
of a cDNA that's say,

272
00:13:22,540 --> 00:13:25,200
three, four exons in length.

273
00:13:25,200 --> 00:13:28,340
And that is actually
generally sufficient

274
00:13:28,340 --> 00:13:32,660
to uniquely identify the gene
locus that that read came from.

275
00:13:32,660 --> 00:13:35,130
And so that was the
basis of EST sequencing--

276
00:13:35,130 --> 00:13:38,060
so-called Expressed
Sequence Tag sequencing.

277
00:13:38,060 --> 00:13:43,510
And millions of these 600 base
chunks of cDNA were generated

278
00:13:43,510 --> 00:13:48,891
and they have been quite
useful over the years.

279
00:13:48,891 --> 00:13:49,390
All right.

280
00:13:49,390 --> 00:13:51,210
So what is next-gen sequencing?

281
00:13:51,210 --> 00:13:58,980
So in next-gen sequencing, you
only read one base at a time.

282
00:13:58,980 --> 00:14:04,370
So it's often a
little bit slower.

283
00:14:04,370 --> 00:14:06,260
But it's really
massively parallel.

284
00:14:06,260 --> 00:14:08,430
And that's the big advantage.

285
00:14:08,430 --> 00:14:12,000
And it's orders of
magnitude cheaper

286
00:14:12,000 --> 00:14:14,139
per base than
conventional sequencing.

287
00:14:14,139 --> 00:14:15,555
Like when it first
came out it, it

288
00:14:15,555 --> 00:14:17,360
was maybe two orders
of magnitude cheaper.

289
00:14:17,360 --> 00:14:21,480
And now it's probably another
four orders of magnitude.

290
00:14:21,480 --> 00:14:24,910
So it really blows away
conventional sequencing

291
00:14:24,910 --> 00:14:30,600
if the output that
you care about

292
00:14:30,600 --> 00:14:34,960
is mostly proportional to
number of bases sequence.

293
00:14:34,960 --> 00:14:37,910
If the output is proportional
to the quality of the assembly

294
00:14:37,910 --> 00:14:39,740
or something, then
there are applications

295
00:14:39,740 --> 00:14:42,430
where conventional
sequencing still

296
00:14:42,430 --> 00:14:45,547
is very useful Because the
next-gen sequencing tends

297
00:14:45,547 --> 00:14:46,130
to be shorter.

298
00:14:46,130 --> 00:14:51,870
But in terms of just volume, it
generates much, much more bases

299
00:14:51,870 --> 00:14:53,390
in one reaction.

300
00:14:53,390 --> 00:14:58,700
And so the basic ideas are
that you have your template DNA

301
00:14:58,700 --> 00:14:59,670
molecules.

302
00:14:59,670 --> 00:15:04,960
Now typically, tens of thousands
for technologies like PacBio

303
00:15:04,960 --> 00:15:10,560
or hundreds of millions for
technologies like Illumina that

304
00:15:10,560 --> 00:15:14,890
are immobilized on
some sort of surface--

305
00:15:14,890 --> 00:15:17,960
typically a flow
cell-- and there

306
00:15:17,960 --> 00:15:19,690
are either single
molecule methods

307
00:15:19,690 --> 00:15:21,731
where you have a single
molecule of your template

308
00:15:21,731 --> 00:15:24,950
or there are methods that
locally amplify your template

309
00:15:24,950 --> 00:15:28,290
and produce, say, hundreds
of identical copies

310
00:15:28,290 --> 00:15:29,440
in little clusters.

311
00:15:29,440 --> 00:15:32,400
And then you use
modified nucleotides,

312
00:15:32,400 --> 00:15:34,440
often with
fluorophores attached,

313
00:15:34,440 --> 00:15:41,990
to interrogate the next base at
each of your template molecules

314
00:15:41,990 --> 00:15:44,800
for hundreds and hundreds
of millions of them.

315
00:15:44,800 --> 00:15:46,800
And so there are several
different technologies.

316
00:15:46,800 --> 00:15:48,133
We won't talk about all of them.

317
00:15:48,133 --> 00:15:49,640
We'll just talk
about two or three

318
00:15:49,640 --> 00:15:52,430
that are interesting
and widely used.

319
00:15:52,430 --> 00:15:55,170
And they differ depending
on the DNA template, what

320
00:15:55,170 --> 00:15:57,100
types of modified
nucleotides are used,

321
00:15:57,100 --> 00:16:00,210
and to some extent, in the
imaging and the image analysis,

322
00:16:00,210 --> 00:16:03,460
which differs for single
molecule methods, for example,

323
00:16:03,460 --> 00:16:08,770
compared to the ones
that sequence a cluster.

324
00:16:08,770 --> 00:16:12,150
So there's a table in
the Metzger review.

325
00:16:12,150 --> 00:16:15,970
And so I've just told you
that next-gen sequencing

326
00:16:15,970 --> 00:16:16,610
is so cheap.

327
00:16:16,610 --> 00:16:18,980
But then you see how
much these machines cost

328
00:16:18,980 --> 00:16:21,790
and you could buy lots of
other interesting things

329
00:16:21,790 --> 00:16:24,670
with that kind of money.

330
00:16:24,670 --> 00:16:27,330
And I also want to emphasize
that that's not even

331
00:16:27,330 --> 00:16:28,060
the full cost.

332
00:16:28,060 --> 00:16:33,100
So if you were to buy an
Illumina GA2-- this would

333
00:16:33,100 --> 00:16:35,010
be like a couple years
ago when the GA2 was

334
00:16:35,010 --> 00:16:38,580
the state of the art-- for
half a million dollars,

335
00:16:38,580 --> 00:16:41,070
the reagents to run
that thing, if you're

336
00:16:41,070 --> 00:16:43,420
going to run it continuously
throughout the year,

337
00:16:43,420 --> 00:16:47,120
the reagents to run it
would be over a million.

338
00:16:47,120 --> 00:16:50,470
So this actually
underestimates the cost.

339
00:16:50,470 --> 00:16:54,170
However, the cost per
base is super, super low.

340
00:16:54,170 --> 00:16:58,830
Because they generate
so much data at once.

341
00:16:58,830 --> 00:17:01,410
All right, So we'll talk
about a couple of these.

342
00:17:01,410 --> 00:17:04,010
The first next-gen
sequencing technology

343
00:17:04,010 --> 00:17:08,050
to be published and
still used today

344
00:17:08,050 --> 00:17:13,869
was from 454--
now Roche-- and it

345
00:17:13,869 --> 00:17:16,979
was based on what's
called emulsion PCR.

346
00:17:16,979 --> 00:17:19,020
So they have these little
beads, the little beads

347
00:17:19,020 --> 00:17:23,589
have adapter DNA molecules
covalently attached.

348
00:17:23,589 --> 00:17:29,020
You incubate the beads
with DNA, and you actually

349
00:17:29,020 --> 00:17:29,760
make an emulsion.

350
00:17:29,760 --> 00:17:31,740
So it's an oil water emulsion.

351
00:17:31,740 --> 00:17:34,780
So each bead, which
is hydrophilic,

352
00:17:34,780 --> 00:17:38,040
is in the little bubble
of water inside oil.

353
00:17:38,040 --> 00:17:41,030
And the reason for that
is so that you do it

354
00:17:41,030 --> 00:17:42,520
at a template
concentration that's

355
00:17:42,520 --> 00:17:44,950
low enough that only a
single molecule of template

356
00:17:44,950 --> 00:17:46,630
is associated with each bead.

357
00:17:46,630 --> 00:17:49,570
So the oil then
provides a barrier

358
00:17:49,570 --> 00:17:53,701
so that the DNA can't get
transferred from one bead

359
00:17:53,701 --> 00:17:54,200
to another.

360
00:17:54,200 --> 00:17:57,010
So each bead will have a
unique template molecule.

361
00:17:57,010 --> 00:18:00,300
You do sort of a local
PCR-like reaction

362
00:18:00,300 --> 00:18:05,110
to amplify that DNA
molecule on the bead,

363
00:18:05,110 --> 00:18:10,732
and then you do sequencing
one base at a time using

364
00:18:10,732 --> 00:18:12,190
a luciferase based
method that I'll

365
00:18:12,190 --> 00:18:14,400
show you on the next slide.

366
00:18:14,400 --> 00:18:18,410
So Illumina technology differs
in that instead of an emulsion,

367
00:18:18,410 --> 00:18:20,730
you're doing it on the
surface of a flow cell.

368
00:18:20,730 --> 00:18:24,450
Again, you start with a
single molecule of template.

369
00:18:24,450 --> 00:18:26,000
Your flow cell has
these two types

370
00:18:26,000 --> 00:18:28,610
of adapters covalently attached.

371
00:18:28,610 --> 00:18:33,080
The template anneals to
one of these adapters.

372
00:18:33,080 --> 00:18:38,180
You extend the adapter molecule
with dNTPs and polymerase.

373
00:18:38,180 --> 00:18:42,620
Now you have the complement of
your template, your denature.

374
00:18:42,620 --> 00:18:44,340
Now you have the
inverse complement

375
00:18:44,340 --> 00:18:46,580
of your template
molecule covalently

376
00:18:46,580 --> 00:18:48,054
attached to the cell surface.

377
00:18:48,054 --> 00:18:50,220
And then at the other end
there's the other adapter.

378
00:18:50,220 --> 00:18:51,636
And so what you
could do is what's

379
00:18:51,636 --> 00:18:56,140
called bridge amplification
where that now complement

380
00:18:56,140 --> 00:19:00,060
of the template molecule
will bridge over

381
00:19:00,060 --> 00:19:02,300
hybridized to the other
adapter, and then you

382
00:19:02,300 --> 00:19:03,870
can extend that adapter.

383
00:19:03,870 --> 00:19:06,336
And now you've regenerated
your original template.

384
00:19:06,336 --> 00:19:08,210
And so now you have the
complementary strand,

385
00:19:08,210 --> 00:19:10,280
and the original
strand, your denature.

386
00:19:10,280 --> 00:19:14,140
And then each of those molecules
can undergo subsequent rounds

387
00:19:14,140 --> 00:19:18,390
of bridge amplification to
make clusters of typically

388
00:19:18,390 --> 00:19:22,197
several hundred
thousand molecules.

389
00:19:22,197 --> 00:19:22,780
Is that clear?

390
00:19:22,780 --> 00:19:23,280
Question.

391
00:19:23,280 --> 00:19:24,486
Yeah, what's your name?

392
00:19:24,486 --> 00:19:26,458
AUDIENCE: Stephanie.

393
00:19:26,458 --> 00:19:30,737
How do they get the adapters
onto the template molecules?

394
00:19:30,737 --> 00:19:32,320
PROFESSOR: How do
you get the adapters

395
00:19:32,320 --> 00:19:34,520
onto the template molecules?

396
00:19:34,520 --> 00:19:38,280
So that's typically
by DNA ligation.

397
00:19:38,280 --> 00:19:43,130
So we may cover
that in later steps.

398
00:19:43,130 --> 00:19:43,856
It depends.

399
00:19:43,856 --> 00:19:45,230
There's a few
different protocol.

400
00:19:45,230 --> 00:19:47,480
So for example, if you're
sequencing microRNAs,

401
00:19:47,480 --> 00:19:50,370
you typically would
isolate the small RNAs

402
00:19:50,370 --> 00:19:52,850
and use RNA litigation
to get the adapters on.

403
00:19:52,850 --> 00:19:55,770
And then you would do
an RT step to get DNA.

404
00:19:55,770 --> 00:19:59,400
With most other applications
like RNA-seq or genome

405
00:19:59,400 --> 00:20:02,950
sequencing-- so with RNA-seq,
you're starting from mRNA,

406
00:20:02,950 --> 00:20:06,850
you typically will isolate
total RNA, do poly(A) selection,

407
00:20:06,850 --> 00:20:08,740
you fragment your RNA
to reduce the effects

408
00:20:08,740 --> 00:20:11,450
of secondary structure,
you random prime with,

409
00:20:11,450 --> 00:20:13,690
like, random hexamers RT enzyme.

410
00:20:13,690 --> 00:20:18,550
So that'll make little bits
of cDNA 200 bases long.

411
00:20:18,550 --> 00:20:20,200
You use second strand synthesis.

412
00:20:20,200 --> 00:20:23,730
Now you have double
stranded cDNA fragments.

413
00:20:23,730 --> 00:20:28,310
And then you do, like, blunt end
ligation to add the adapters.

414
00:20:28,310 --> 00:20:31,086
And then you denature so
you have single strand.

415
00:20:31,086 --> 00:20:32,946
AUDIENCE: I guess my
question is how do you

416
00:20:32,946 --> 00:20:35,550
make sure that the two
ends sandwiching the DNA

417
00:20:35,550 --> 00:20:38,540
are different as opposed to--

418
00:20:38,540 --> 00:20:40,570
PROFESSOR: That the
two ends are different.

419
00:20:40,570 --> 00:20:43,670
Yeah, that's a good question.

420
00:20:47,730 --> 00:20:51,260
I'll post some stuff about--
It's a good question.

421
00:20:51,260 --> 00:20:53,210
I don't want to sweep
it under the rug.

422
00:20:53,210 --> 00:20:55,600
But I kind of want to move on.

423
00:20:55,600 --> 00:20:59,500
And I'll post a
little bit about that.

424
00:20:59,500 --> 00:21:03,880
All right so we
did 454 Illumina.

425
00:21:03,880 --> 00:21:06,050
Helicos is sort of like
Illumina sequencing

426
00:21:06,050 --> 00:21:08,360
except single molecule.

427
00:21:08,360 --> 00:21:11,670
So you have your
template covalently

428
00:21:11,670 --> 00:21:13,560
attached to your substrate.

429
00:21:13,560 --> 00:21:18,590
You just anneal primer and
just start sequencing it

430
00:21:18,590 --> 00:21:24,680
And there's major pros and cons
of single molecule sequencing,

431
00:21:24,680 --> 00:21:26,080
which we can talk about.

432
00:21:26,080 --> 00:21:29,770
And then the PacBio
technology is fundamentally

433
00:21:29,770 --> 00:21:33,980
different in that the
template is not actually

434
00:21:33,980 --> 00:21:35,890
covalently attached
to the surface.

435
00:21:35,890 --> 00:21:38,960
The DNA polymerase is covalently
attached to the surface

436
00:21:38,960 --> 00:21:43,260
and the template is sort of
threaded into the polymerase.

437
00:21:43,260 --> 00:21:46,350
And this is a phage polymerase
that's highly processive

438
00:21:46,350 --> 00:21:47,940
and strand displacing.

439
00:21:47,940 --> 00:21:50,640
And the template is often
a circular molecule.

440
00:21:50,640 --> 00:21:54,060
And so you can
actually read around

441
00:21:54,060 --> 00:21:56,480
the template multiple
times, which turns out

442
00:21:56,480 --> 00:21:59,795
to be really useful in PacBio
because the error rate is

443
00:21:59,795 --> 00:22:01,850
quite high for the sequencing.

444
00:22:01,850 --> 00:22:05,530
So in the top, in
the 454, you're

445
00:22:05,530 --> 00:22:08,040
measuring luciferase
activity-- light.

446
00:22:08,040 --> 00:22:09,924
In Illumina, you're
measuring fluorescence.

447
00:22:09,924 --> 00:22:11,590
Four different
fluorescent tags, sort of

448
00:22:11,590 --> 00:22:15,030
like the four different tags
we saw in Sanger sequencing.

449
00:22:15,030 --> 00:22:18,520
Helicose, it's single
tag one base at a time.

450
00:22:18,520 --> 00:22:21,350
And in PacBio, you actually
have a fluorescently

451
00:22:21,350 --> 00:22:26,620
labeled dNTP that has
the label on-- it's

452
00:22:26,620 --> 00:22:28,520
actually
hexaphosphate-- it's got

453
00:22:28,520 --> 00:22:31,020
the label on the
sixth phosphate.

454
00:22:31,020 --> 00:22:33,530
So the dNTP is labeled.

455
00:22:33,530 --> 00:22:36,560
It enters the active site
of the DNA polymerase.

456
00:22:36,560 --> 00:22:38,530
And the residence
time is much longer

457
00:22:38,530 --> 00:22:41,600
if the base is actually
going to get incorporated

458
00:22:41,600 --> 00:22:42,900
into that growing chain.

459
00:22:42,900 --> 00:22:46,960
And so you measure how much time
you have a fluorescent signal.

460
00:22:46,960 --> 00:22:48,830
And if it's long, that
means that that base

461
00:22:48,830 --> 00:22:51,360
must have incorporated
into the DNA.

462
00:22:51,360 --> 00:22:56,450
But then, the extension
reaction itself

463
00:22:56,450 --> 00:22:58,390
will cleave off the
last five phosphates

464
00:22:58,390 --> 00:23:00,640
and the fluorophore tag.

465
00:23:00,640 --> 00:23:03,960
And so you'll
regenerate native DNA.

466
00:23:03,960 --> 00:23:05,390
So that's another difference.

467
00:23:05,390 --> 00:23:07,315
Whereas in Illumina
sequencing, as we'll see,

468
00:23:07,315 --> 00:23:09,190
there's this reversible
terminator chemistry.

469
00:23:09,190 --> 00:23:12,330
So the DNA is not native
that you're synthesizing.

470
00:23:14,990 --> 00:23:17,800
So this is just a
little bit more on 454.

471
00:23:17,800 --> 00:23:19,300
Just some pretty pictures.

472
00:23:19,300 --> 00:23:22,030
I think I described that before.

473
00:23:22,030 --> 00:23:30,080
The key chemistry here is that
you add one dNTP at a time.

474
00:23:30,080 --> 00:23:33,060
So only a subset of the wells--
perhaps a quarter of them--

475
00:23:33,060 --> 00:23:36,740
that have that next base,
the complementary base

476
00:23:36,740 --> 00:23:40,020
free-- as the next
one after the primer--

477
00:23:40,020 --> 00:23:42,710
will undergo synthesis.

478
00:23:42,710 --> 00:23:45,920
And when they undergo synthesis,
you release pyrophosphate.

479
00:23:45,920 --> 00:23:48,580
And they have these
enzymes attached

480
00:23:48,580 --> 00:23:51,410
to these little micro beads--
the orange beads-- sulfurylase

481
00:23:51,410 --> 00:23:54,680
and luciferase, that use
pyrophosphate to basically

482
00:23:54,680 --> 00:23:55,650
generate light.

483
00:23:55,650 --> 00:23:58,760
And so then you have one of
these beads in each well.

484
00:23:58,760 --> 00:24:04,600
You look at which wells
lit up when we added dCTP.

485
00:24:04,600 --> 00:24:08,030
And they must have had G as
the next base and so forth.

486
00:24:08,030 --> 00:24:10,019
And there's no termination here.

487
00:24:10,019 --> 00:24:11,810
The only termination
is because you're only

488
00:24:11,810 --> 00:24:13,096
adding one base at a time.

489
00:24:13,096 --> 00:24:14,970
So if you have a single
gene in the template,

490
00:24:14,970 --> 00:24:15,803
you'll add one base.

491
00:24:15,803 --> 00:24:19,170
But if you have two Gs in the
template, you'll add two Cs.

492
00:24:19,170 --> 00:24:21,350
And in principle, you'll
get twice as much light.

493
00:24:21,350 --> 00:24:24,500
But then you have to sort of
do some analysis after the fact

494
00:24:24,500 --> 00:24:26,260
to say, OK how much
light do we have?

495
00:24:26,260 --> 00:24:28,322
And was that one G,
two G, and so forth.

496
00:24:28,322 --> 00:24:29,780
And the amount of
light is supposed

497
00:24:29,780 --> 00:24:33,440
to be linear up to
about five or six Gs.

498
00:24:33,440 --> 00:24:38,230
But that's still a
more error-prone step.

499
00:24:38,230 --> 00:24:41,420
And the most common
type of error in 454

500
00:24:41,420 --> 00:24:45,700
is actually insertions
and deletions.

501
00:24:45,700 --> 00:24:48,970
Whereas in Illumina
sequencing, it's substitutions.

502
00:24:48,970 --> 00:24:51,280
David actually encouraged
me to talk more

503
00:24:51,280 --> 00:24:55,380
about sequencing errors
and quality scores.

504
00:24:55,380 --> 00:24:57,510
And I need to do a little
bit more background.

505
00:24:57,510 --> 00:25:03,580
But I may add that a little
bit later in the semester.

506
00:25:03,580 --> 00:25:05,810
OK, so in Illumina
sequencing, you

507
00:25:05,810 --> 00:25:09,180
add all four dNTPs
at the same time.

508
00:25:09,180 --> 00:25:10,920
But they're non-native.

509
00:25:10,920 --> 00:25:13,540
They have two major
modifications.

510
00:25:13,540 --> 00:25:16,490
So one is that they're
three prime blocked.

511
00:25:16,490 --> 00:25:18,525
That means that
the OH is not free,

512
00:25:18,525 --> 00:25:20,400
I'll show the chemical
structure in a moment.

513
00:25:20,400 --> 00:25:22,025
So you can't extend
more than one base.

514
00:25:22,025 --> 00:25:24,710
You incorporate that one
base, and the polymerase

515
00:25:24,710 --> 00:25:25,880
can't do anything more.

516
00:25:25,880 --> 00:25:29,670
And they're also tagged
with four different fluors.

517
00:25:29,670 --> 00:25:31,910
So you add all
four dNTPs at once.

518
00:25:31,910 --> 00:25:34,590
You let the polymerase
incorporate them.

519
00:25:34,590 --> 00:25:37,690
And then you image
the whole flow cell

520
00:25:37,690 --> 00:25:39,150
using two lasers
and two filters.

521
00:25:39,150 --> 00:25:41,330
So basically, to
image the four fluors.

522
00:25:41,330 --> 00:25:44,220
So you have to sort of take
four different pictures

523
00:25:44,220 --> 00:25:47,200
of each portion of the flow
cell and then the camera moves

524
00:25:47,200 --> 00:25:49,360
and you scan the whole cell.

525
00:25:49,360 --> 00:25:55,750
And so then, those clusters that
incorporated a C, let's say,

526
00:25:55,750 --> 00:25:58,710
they will show up in the
green channel as spots.

527
00:25:58,710 --> 00:26:00,460
And those incorporated
in A, and so forth.

528
00:26:00,460 --> 00:26:02,660
So you basically have these
clusters, each of them

529
00:26:02,660 --> 00:26:07,340
represents a distinct
template, and you

530
00:26:07,340 --> 00:26:08,630
read one base at a time.

531
00:26:08,630 --> 00:26:11,080
So, first you read the
first base after the primer.

532
00:26:11,080 --> 00:26:13,990
So it's sequencing
downwards into the template.

533
00:26:13,990 --> 00:26:15,599
And you read the
first base so you

534
00:26:15,599 --> 00:26:17,640
know what the first base
of all your clusters is.

535
00:26:17,640 --> 00:26:20,720
And then you reverse
the termination.

536
00:26:20,720 --> 00:26:22,870
You cleave off
that chemical group

537
00:26:22,870 --> 00:26:26,150
that was blocking the 3-prime
OH so now it can extend again.

538
00:26:26,150 --> 00:26:28,240
And then you add the
four dNTPs again,

539
00:26:28,240 --> 00:26:31,339
do another round of extension,
and then image again,

540
00:26:31,339 --> 00:26:31,880
and so forth.

541
00:26:31,880 --> 00:26:33,171
And so it takes a little while.

542
00:26:33,171 --> 00:26:35,050
Each round of imaging
takes about an hour.

543
00:26:35,050 --> 00:26:39,560
So if you want to do 100 base
single and Illumina sequencing,

544
00:26:39,560 --> 00:26:43,110
it'll be running on the machine
for about four days or so.

545
00:26:43,110 --> 00:26:45,240
Plus the time you have to
build the clusters, which

546
00:26:45,240 --> 00:26:47,650
might be several hours
on the day before.

547
00:26:50,330 --> 00:26:51,240
So what is this?

548
00:26:51,240 --> 00:26:55,570
So actually the whole idea
of blocking termination--

549
00:26:55,570 --> 00:26:59,300
basically Sanger's idea--
is carried over here

550
00:26:59,300 --> 00:27:01,140
in Illumina sequencing
with a little twist.

551
00:27:01,140 --> 00:27:03,760
And that's that you can
reverse the termination.

552
00:27:03,760 --> 00:27:05,830
So if you look down
here at the bottom,

553
00:27:05,830 --> 00:27:09,030
these are two different
3-prime terminators.

554
00:27:09,030 --> 00:27:11,330
Remember your base counting.

555
00:27:11,330 --> 00:27:14,130
Base one, two, three.

556
00:27:14,130 --> 00:27:16,930
So this was the
3-prime OH, now it's

557
00:27:16,930 --> 00:27:19,740
got this methyl [INAUDIBLE],
or whatever that is.

558
00:27:19,740 --> 00:27:22,770
I'm not much of a chemist,
so you can look that one up.

559
00:27:22,770 --> 00:27:24,400
And then here's another version.

560
00:27:24,400 --> 00:27:25,816
And this is sort
of chemistry that

561
00:27:25,816 --> 00:27:27,670
can cleave this off
when you're done.

562
00:27:27,670 --> 00:27:31,060
And then this whole thing
here, hanging off the base,

563
00:27:31,060 --> 00:27:34,356
is the fluor.

564
00:27:34,356 --> 00:27:36,460
And you cleave that off as well.

565
00:27:36,460 --> 00:27:39,235
So you add this big complicated
thing, you image it,

566
00:27:39,235 --> 00:27:40,610
and then you cleave
off the fluor

567
00:27:40,610 --> 00:27:43,330
and cleave off
the 3-prime block.

568
00:27:45,776 --> 00:27:47,400
These are some actual
sequencing images

569
00:27:47,400 --> 00:27:49,560
you would image in
the four channels.

570
00:27:49,560 --> 00:27:51,240
They're actually
black and white.

571
00:27:51,240 --> 00:27:53,080
These are pseudocode.

572
00:27:53,080 --> 00:27:56,330
And then you can merge
those and you can see then

573
00:27:56,330 --> 00:27:58,560
all the clusters
on the flow cell.

574
00:27:58,560 --> 00:28:02,070
So this is from a GA2 with
the recommended cluster

575
00:28:02,070 --> 00:28:04,960
density back in the day,
like a few years ago.

576
00:28:04,960 --> 00:28:06,877
And nowadays, the image
now since the software

577
00:28:06,877 --> 00:28:08,709
has gotten a lot better,
so you can actually

578
00:28:08,709 --> 00:28:10,840
load the clusters more
densely and therefore

579
00:28:10,840 --> 00:28:14,787
get more sequence
out of the same area.

580
00:28:14,787 --> 00:28:16,370
But imagine just
millions and millions

581
00:28:16,370 --> 00:28:18,820
of these little
clusters like this.

582
00:28:18,820 --> 00:28:21,570
Notice the clusters are
not all the same size.

583
00:28:24,390 --> 00:28:25,890
Basically, you're
doing PCR in situ,

584
00:28:25,890 --> 00:28:30,970
and so some molecules are easier
to amplify by PCR than others.

585
00:28:30,970 --> 00:28:33,605
And that probably accounts
for these variations in size.

586
00:28:37,430 --> 00:28:39,800
So what is the
current throughput?

587
00:28:39,800 --> 00:28:43,860
These data are accurate as
of about, maybe, last year.

588
00:28:43,860 --> 00:28:47,540
So the HiSeq 2000 instrument
is the most high performance,

589
00:28:47,540 --> 00:28:49,010
widely used instrument.

590
00:28:49,010 --> 00:28:51,840
Now there's a 2500, but I
think it's roughly similar.

591
00:28:51,840 --> 00:28:52,820
You have one flow cell.

592
00:28:52,820 --> 00:28:55,220
So a flow cell looks sort
of like a glass slide,

593
00:28:55,220 --> 00:28:57,770
except that it has
these tunnels carved

594
00:28:57,770 --> 00:29:01,420
in it like eight little
tubes inside the glass slide.

595
00:29:01,420 --> 00:29:05,600
And on the surfaces
of those tubes

596
00:29:05,600 --> 00:29:10,544
is where the adapters
are covalently attached.

597
00:29:10,544 --> 00:29:11,960
And so you have
eight lanes and so

598
00:29:11,960 --> 00:29:15,120
you can sequence eight different
things in those eight lanes.

599
00:29:15,120 --> 00:29:20,300
You could do yeast genome
in one and fly RNA-seq

600
00:29:20,300 --> 00:29:23,850
in another, and so forth.

601
00:29:23,850 --> 00:29:27,850
And these days, a single
lane will produce something

602
00:29:27,850 --> 00:29:30,440
like 200 million reads.

603
00:29:30,440 --> 00:29:34,250
And this is typically routine
to get 200 million reads

604
00:29:34,250 --> 00:29:34,790
from a lane.

605
00:29:34,790 --> 00:29:36,770
Sometimes you can get more.

606
00:29:36,770 --> 00:29:38,200
You can do up to 100 bases.

607
00:29:38,200 --> 00:29:40,420
You can do 150 these
days on a MiSeq,

608
00:29:40,420 --> 00:29:41,780
which is a miniature version.

609
00:29:41,780 --> 00:29:46,000
You can do maybe 300 or more.

610
00:29:46,000 --> 00:29:48,290
And so that's a whole
lot of sequence.

611
00:29:48,290 --> 00:29:53,510
So that's 160 billion bases of
sequence from a single lane.

612
00:29:53,510 --> 00:29:56,890
And that will cost you-- that
single lane-- maybe $2,000

613
00:29:56,890 --> 00:29:59,870
to $3,000, depending
where you're doing it.

614
00:29:59,870 --> 00:30:03,000
And the cost doesn't
include the capital cost,

615
00:30:03,000 --> 00:30:05,920
that's just the reagent
cost for running that.

616
00:30:05,920 --> 00:30:11,240
So 160 billion-- the
human genome is 3 billion,

617
00:30:11,240 --> 00:30:14,850
so you've now sequenced the
human genome over many times

618
00:30:14,850 --> 00:30:16,440
there.

619
00:30:16,440 --> 00:30:17,520
You can do more.

620
00:30:17,520 --> 00:30:19,070
So you can do
paired-end sequencing,

621
00:30:19,070 --> 00:30:21,380
where you sequence both
ends of your template.

622
00:30:21,380 --> 00:30:24,224
And that'll basically double
the amount of sequence you get.

623
00:30:24,224 --> 00:30:25,640
And you can also,
on this machine,

624
00:30:25,640 --> 00:30:26,840
do two flow cells at once.

625
00:30:26,840 --> 00:30:28,620
So you can actually
double it beyond that.

626
00:30:28,620 --> 00:30:33,796
And so for many applications,
160 billion bases is overkill.

627
00:30:33,796 --> 00:30:34,795
It's more than you need.

628
00:30:34,795 --> 00:30:36,836
Imagine you're doing
bacterial genome sequencing.

629
00:30:36,836 --> 00:30:39,150
Bacterial genome might
be five megabases or so.

630
00:30:39,150 --> 00:30:40,570
This is complete overkill.

631
00:30:40,570 --> 00:30:45,090
So you can do bar coding
where you add little six

632
00:30:45,090 --> 00:30:50,720
base tags to
different libraries,

633
00:30:50,720 --> 00:30:53,660
and then mix them together,
introduce them to the machine,

634
00:30:53,660 --> 00:30:56,370
sequence the tags
first or second,

635
00:30:56,370 --> 00:30:57,730
and then sequence the templates.

636
00:30:57,730 --> 00:31:00,980
And then you effectively
sort them out later.

637
00:31:00,980 --> 00:31:05,280
And then do many
samples in one lane.

638
00:31:05,280 --> 00:31:07,380
And that's what people
most commonly do.

639
00:31:11,590 --> 00:31:13,894
So, questions about
next-gen sequencing?

640
00:31:13,894 --> 00:31:15,060
There's a lot more to learn.

641
00:31:15,060 --> 00:31:16,393
I'm happy to talk about it more.

642
00:31:16,393 --> 00:31:18,170
It's very relevant
to this class.

643
00:31:18,170 --> 00:31:20,680
But I'm sure it'll come up
later in David's sections,

644
00:31:20,680 --> 00:31:23,610
so I don't want to take
too much time on it.

645
00:31:29,960 --> 00:31:34,620
So, now once you generate
reads from an Illumina

646
00:31:34,620 --> 00:31:36,570
instrument or some
other instrument,

647
00:31:36,570 --> 00:31:38,350
you'll want to align
them to the genome

648
00:31:38,350 --> 00:31:39,970
to determine, for
example, if you're

649
00:31:39,970 --> 00:31:44,590
doing RNA-seq mapping
reads that come from mRNA,

650
00:31:44,590 --> 00:31:47,360
you'll want to know what
genes they came from.

651
00:31:47,360 --> 00:31:50,900
So you need to map those
reads back to the genome.

652
00:31:50,900 --> 00:31:55,690
What are some other reasons you
might want to align sequences?

653
00:31:55,690 --> 00:31:57,950
Just in general, why
is aligning sequences--

654
00:31:57,950 --> 00:32:03,220
meaning, matching them up
and finding individual bases

655
00:32:03,220 --> 00:32:06,390
or amino acid residues that
match-- why is that useful?

656
00:32:06,390 --> 00:32:07,150
Diego?

657
00:32:07,150 --> 00:32:09,150
AUDIENCE: You can
assemble them if you want.

658
00:32:09,150 --> 00:32:09,820
PROFESSOR: You
can assemble them?

659
00:32:09,820 --> 00:32:10,040
Yes.

660
00:32:10,040 --> 00:32:11,706
So if you're doing
genome sequencing,

661
00:32:11,706 --> 00:32:13,330
if you align them to
each other and you

662
00:32:13,330 --> 00:32:16,570
find a whole stack that
sort of align this way,

663
00:32:16,570 --> 00:32:19,700
you can then assemble
and infer the existence

664
00:32:19,700 --> 00:32:20,780
of a longer sequence.

665
00:32:20,780 --> 00:32:21,880
That's a good point.

666
00:32:21,880 --> 00:32:22,904
Yes, your name?

667
00:32:22,904 --> 00:32:23,810
AUDIENCE: Julianne.

668
00:32:23,810 --> 00:32:25,409
Looking at homologs.

669
00:32:25,409 --> 00:32:26,700
PROFESSOR: Looking at homologs.

670
00:32:26,700 --> 00:32:27,200
Right.

671
00:32:27,200 --> 00:32:31,320
So if you, for example, are
doing disease gene mapping,

672
00:32:31,320 --> 00:32:35,150
you've identified a human
gene of unknown function

673
00:32:35,150 --> 00:32:38,230
that's associated
with a disease.

674
00:32:38,230 --> 00:32:41,580
Then you might want to search
it against, say, the mouse

675
00:32:41,580 --> 00:32:43,900
database and find
a homolog in mouse

676
00:32:43,900 --> 00:32:48,422
and then that might be what you
would want to study further.

677
00:32:48,422 --> 00:32:50,255
You might want to then
knock it out in mouse

678
00:32:50,255 --> 00:32:52,350
or mutate it or something.

679
00:32:52,350 --> 00:32:54,570
So those are some good reasons.

680
00:32:54,570 --> 00:32:56,310
There's others.

681
00:32:56,310 --> 00:32:59,460
So we're going to first talk
about local alignment, which

682
00:32:59,460 --> 00:33:01,110
is a type of alignment
where you want

683
00:33:01,110 --> 00:33:05,670
to find shorter stretches
of high similarity.

684
00:33:05,670 --> 00:33:08,980
You don't require alignment
of the entire sequence.

685
00:33:12,390 --> 00:33:15,767
So there are certain
situations where

686
00:33:15,767 --> 00:33:16,850
you might want to do that.

687
00:33:16,850 --> 00:33:19,250
So here's an example.

688
00:33:19,250 --> 00:33:22,020
You are studying a recently
discovered human non-coding

689
00:33:22,020 --> 00:33:22,520
RNA.

690
00:33:22,520 --> 00:33:25,330
As you can see, it's 45 bases.

691
00:33:25,330 --> 00:33:29,250
You want to see if
there's a mouse homolog.

692
00:33:29,250 --> 00:33:32,390
You run it through NCBI
BLAST, which as we said

693
00:33:32,390 --> 00:33:35,050
is sort of the Google search
engine of mathematics--

694
00:33:35,050 --> 00:33:38,890
and you're going get a chance
to do it on pump set one,

695
00:33:38,890 --> 00:33:42,120
and you get a hit
that looks like this.

696
00:33:42,120 --> 00:33:45,005
So notice, this is
sort of BLAST notation.

697
00:33:45,005 --> 00:33:46,500
It says Q at the top.

698
00:33:46,500 --> 00:33:48,800
Q is for "query," that's
the sequence you put in.

699
00:33:48,800 --> 00:33:50,896
S is "subject,"
that's the database

700
00:33:50,896 --> 00:33:52,020
you were searching against.

701
00:33:52,020 --> 00:33:54,480
You have coordinates,
so 1 to 45.

702
00:33:54,480 --> 00:33:58,130
And then, in the subject, it
happened to be base 403 to 447

703
00:33:58,130 --> 00:34:00,440
in some mouse
chromosome or something.

704
00:34:00,440 --> 00:34:04,197
And you can see that
it's got some matching.

705
00:34:04,197 --> 00:34:05,530
But it also has some mismatches.

706
00:34:05,530 --> 00:34:09,699
So in all, there are 40
matches and five mismatches

707
00:34:09,699 --> 00:34:11,130
in the alignment.

708
00:34:11,130 --> 00:34:12,854
So is that significant?

709
00:34:15,370 --> 00:34:18,750
Remember, the mouse genome
is 2.7 billion bases long.

710
00:34:18,750 --> 00:34:19,550
It's big.

711
00:34:19,550 --> 00:34:22,919
So would you get a match
this good by chance?

712
00:34:25,570 --> 00:34:28,390
So the question is really,
should you trust this?

713
00:34:28,390 --> 00:34:30,980
Is this something you
can confidently say,

714
00:34:30,980 --> 00:34:34,220
yes mouse is a
homolog, and that's it?

715
00:34:34,220 --> 00:34:36,170
Or should you just
be like, well,

716
00:34:36,170 --> 00:34:38,059
that's not better
than I get by chance

717
00:34:38,059 --> 00:34:39,770
so I have no
evidence of anything?

718
00:34:39,770 --> 00:34:42,370
Or is it sort of
somewhere in between?

719
00:34:42,370 --> 00:34:46,350
And how would you tell?

720
00:34:46,350 --> 00:34:47,701
Yeah, what's your name?

721
00:34:47,701 --> 00:34:49,174
AUDIENCE: Chris.

722
00:34:49,174 --> 00:34:51,629
You would want to
figure out a scoring

723
00:34:51,629 --> 00:34:53,593
function for the alignment.

724
00:34:53,593 --> 00:34:55,557
And then, with that
scoring function,

725
00:34:55,557 --> 00:34:59,607
you would find whether or not
you have a significant match.

726
00:34:59,607 --> 00:35:00,190
PROFESSOR: OK.

727
00:35:00,190 --> 00:35:02,470
So Chris says you want to
define a scoring system

728
00:35:02,470 --> 00:35:05,430
and then use the
scoring system to define

729
00:35:05,430 --> 00:35:07,400
statistical significance.

730
00:35:07,400 --> 00:35:08,900
Do want to suggest
a scoring system?

731
00:35:12,370 --> 00:35:15,570
What's the simplest
one you can think of?

732
00:35:15,570 --> 00:35:20,460
AUDIENCE: Just if there's a
match, you add a certain score.

733
00:35:20,460 --> 00:35:25,230
If it's a mismatch, you
subtract a certain score.

734
00:35:25,230 --> 00:35:27,440
PROFESSOR: So let's do
that scoring system.

735
00:35:31,170 --> 00:35:34,520
So the notation that's
often used is Sii.

736
00:35:34,520 --> 00:35:36,870
So that would be a match
between nucleotide i

737
00:35:36,870 --> 00:35:38,660
and then another
copy of nucleotide i.

738
00:35:38,660 --> 00:35:41,890
We'll call that 1,
plus 1 for a match.

739
00:35:41,890 --> 00:35:46,195
And sij, where i
and j are different,

740
00:35:46,195 --> 00:35:47,570
we'll give that
a negative score.

741
00:35:47,570 --> 00:35:48,470
Minus 1.

742
00:35:48,470 --> 00:35:51,890
So this is i not equal to j.

743
00:35:51,890 --> 00:35:54,630
So that's a scoring matrix.

744
00:35:54,630 --> 00:35:57,630
It's a four by four matrix
with 1 on the diagonal

745
00:35:57,630 --> 00:36:00,540
and minus 1 everywhere else.

746
00:36:00,540 --> 00:36:02,500
And this is commonly
used for DNA.

747
00:36:02,500 --> 00:36:04,240
And then there's a
few other variations

748
00:36:04,240 --> 00:36:07,600
on this that are also used.

749
00:36:07,600 --> 00:36:09,012
So good, a scoring system.

750
00:36:09,012 --> 00:36:10,970
So then, how are we going
to do the statistics?

751
00:36:10,970 --> 00:36:12,169
Any ideas?

752
00:36:12,169 --> 00:36:13,585
How do we know
what's significant?

753
00:36:20,435 --> 00:36:22,348
AUDIENCE: The higher
score would probably

754
00:36:22,348 --> 00:36:28,156
be a little more significant
than a lower score.

755
00:36:28,156 --> 00:36:29,620
But the scale, I'm not sure--

756
00:36:29,620 --> 00:36:31,450
PROFESSOR: The scale
is not so obvious.

757
00:36:31,450 --> 00:36:32,498
Yes, question?

758
00:36:32,498 --> 00:36:33,664
AUDIENCE: My name is Andrea.

759
00:36:33,664 --> 00:36:37,600
So if you shuffled the RNA,
like permute the sequence,

760
00:36:37,600 --> 00:36:39,814
then we'll get the
[INAUDIBLE] genome

761
00:36:39,814 --> 00:36:42,028
you get with that
shuffled sequence.

762
00:36:42,028 --> 00:36:43,996
And the score is about
the same as you'd

763
00:36:43,996 --> 00:36:46,948
get with the non-shuffled
sequence [INAUDIBLE]

764
00:36:46,948 --> 00:36:48,920
about very significant scores.

765
00:36:48,920 --> 00:36:51,274
PROFESSOR: Yeah, so
that's a good idea.

766
00:36:51,274 --> 00:36:52,940
BLAST, as it turns
out-- is pretty fast.

767
00:36:52,940 --> 00:36:54,870
So you could shuffle
your RNA molecule,

768
00:36:54,870 --> 00:36:57,330
randomly permute the
nucleotides many times,

769
00:36:57,330 --> 00:36:59,190
maybe even like
1,000 times, search

770
00:36:59,190 --> 00:37:00,782
each one against
the mouse genome,

771
00:37:00,782 --> 00:37:02,490
and get a distribution
of what's the best

772
00:37:02,490 --> 00:37:05,505
score-- the top score-- that
you get against a genome,

773
00:37:05,505 --> 00:37:06,880
look at that
distribution and say

774
00:37:06,880 --> 00:37:08,420
whether the score
of the actual one

775
00:37:08,420 --> 00:37:10,560
is significantly higher
than that distribution

776
00:37:10,560 --> 00:37:12,476
or just falls in the
middle of that somewhere.

777
00:37:12,476 --> 00:37:15,680
And that's reasonable.

778
00:37:15,680 --> 00:37:18,380
You can certainly do that, and
it's not a bad thing to do.

779
00:37:18,380 --> 00:37:20,590
But it turns out there is
an analytical theory here

780
00:37:20,590 --> 00:37:21,720
that you can use.

781
00:37:21,720 --> 00:37:25,200
And so that you can determine
significance more quickly

782
00:37:25,200 --> 00:37:28,910
without doing so
much computation.

783
00:37:28,910 --> 00:37:30,890
And that's what
we'll talk about.

784
00:37:30,890 --> 00:37:33,489
But another issue, before
we get to the statistics,

785
00:37:33,489 --> 00:37:35,280
is how do you actually
find that alignment?

786
00:37:35,280 --> 00:37:39,420
How do you find the top scoring
match in a mouse genome?

787
00:37:39,420 --> 00:37:43,830
So let's suppose
this guy is your RNA.

788
00:37:43,830 --> 00:37:45,970
OK, of course, we're
using T's, but that's

789
00:37:45,970 --> 00:37:48,990
just because you usually
sequences it at the DNA level.

790
00:37:48,990 --> 00:37:50,300
But imagine this is your RNA.

791
00:37:50,300 --> 00:37:51,160
It's very short.

792
00:37:51,160 --> 00:37:54,390
This is like 10 or so, I think.

793
00:37:54,390 --> 00:37:56,365
And this is your database.

794
00:37:56,365 --> 00:38:01,490
But it goes on a
few billion more.

795
00:38:01,490 --> 00:38:02,980
Several more blackboards.

796
00:38:02,980 --> 00:38:08,410
And I want to come up
with an algorithm that

797
00:38:08,410 --> 00:38:12,730
will find the highest scoring
segment of this query sequence

798
00:38:12,730 --> 00:38:17,520
against this database.

799
00:38:17,520 --> 00:38:18,730
Any ideas?

800
00:38:18,730 --> 00:38:20,530
So this would be like
our first algorithm.

801
00:38:20,530 --> 00:38:25,640
And it's not terribly
hard, so that's

802
00:38:25,640 --> 00:38:28,100
why it's a good
one to start with.

803
00:38:28,100 --> 00:38:30,200
Not totally obvious either.

804
00:38:30,200 --> 00:38:33,200
Who can think of an algorithm
or something, some operation

805
00:38:33,200 --> 00:38:36,010
that we can do on
this sequence compared

806
00:38:36,010 --> 00:38:37,950
to this sequence--
in some way-- that

807
00:38:37,950 --> 00:38:41,270
will help us find the
highest scoring match?

808
00:38:44,410 --> 00:38:45,740
I'm sorry.

809
00:38:45,740 --> 00:38:46,427
Yeah?

810
00:38:46,427 --> 00:38:49,410
AUDIENCE: You have to consider
insertion and deletion.

811
00:38:49,410 --> 00:38:50,390
PROFESSOR: Yeah, OK.

812
00:38:50,390 --> 00:38:52,210
So we're going to
keep it simple.

813
00:38:52,210 --> 00:38:53,210
That's true, in general.

814
00:38:53,210 --> 00:38:54,585
But we're going
to keep it simple

815
00:38:54,585 --> 00:38:57,390
and just say no
insertions and deletions.

816
00:38:57,390 --> 00:39:00,900
So we're going to look for
an ungapped local alignment.

817
00:39:00,900 --> 00:39:03,480
So that's the
algorithm that I want.

818
00:39:03,480 --> 00:39:04,940
First, no gaps.

819
00:39:04,940 --> 00:39:07,950
And then we'll do
gaps on Tuesday.

820
00:39:07,950 --> 00:39:08,920
Tim?

821
00:39:08,920 --> 00:39:11,860
AUDIENCE: You could just
compare your [INAUDIBLE]

822
00:39:11,860 --> 00:39:16,270
to [INAUDIBLE] all
across the database

823
00:39:16,270 --> 00:39:21,170
and turn off all the
[INAUDIBLE] on that [INAUDIBLE],

824
00:39:21,170 --> 00:39:23,520
and then figure out [INAUDIBLE].

825
00:39:23,520 --> 00:39:25,760
PROFESSOR: Yeah, OK.

826
00:39:25,760 --> 00:39:28,490
Pretty much.

827
00:39:28,490 --> 00:39:29,920
I mean, that's
pretty much right.

828
00:39:29,920 --> 00:39:32,800
Although it's not quite as
much of a description as

829
00:39:32,800 --> 00:39:35,049
you would need if you want
to actually code that.

830
00:39:35,049 --> 00:39:36,590
Like, how would you
actually do that?

831
00:39:36,590 --> 00:39:38,810
So, I want a
description that is sort

832
00:39:38,810 --> 00:39:40,730
of more at the
level of pseudocode.

833
00:39:40,730 --> 00:39:43,670
Like, here's how you would
actually organize your code.

834
00:39:43,670 --> 00:39:48,560
So, let's say you
entertain the hypothesis

835
00:39:48,560 --> 00:39:53,690
that the alignment can be
in different registers.

836
00:39:53,690 --> 00:39:58,090
The alignment can correspond to
base one of the query and base

837
00:39:58,090 --> 00:39:59,820
one of the subject.

838
00:39:59,820 --> 00:40:01,500
Or it could be shifted.

839
00:40:01,500 --> 00:40:04,200
It could be an alignment
where base 1 of the query

840
00:40:04,200 --> 00:40:05,576
matches these two, and so forth.

841
00:40:05,576 --> 00:40:07,200
So there's sort of
different registers.

842
00:40:07,200 --> 00:40:09,280
So let's just consider
one register first.

843
00:40:09,280 --> 00:40:11,610
The one where base 1 matches.

844
00:40:11,610 --> 00:40:14,140
So let's just look
at the matches

845
00:40:14,140 --> 00:40:16,470
between corresponding bases.

846
00:40:16,470 --> 00:40:22,150
I'm just going to make these
little angle bracket guys here.

847
00:40:22,150 --> 00:40:25,565
Hopefully I won't
make any mistakes.

848
00:40:28,680 --> 00:40:30,100
I'm going to take this.

849
00:40:30,100 --> 00:40:32,460
This is sort of implementing
Tim's idea here.

850
00:40:32,460 --> 00:40:34,850
And then I'm going to
look for each of these--

851
00:40:34,850 --> 00:40:36,330
so consider it going down here.

852
00:40:36,330 --> 00:40:39,450
Now we're sort of looking
at an alignment here.

853
00:40:39,450 --> 00:40:42,000
Is this a match or a mismatch?

854
00:40:42,000 --> 00:40:43,410
That's a mismatch.

855
00:40:43,410 --> 00:40:44,720
That's a match.

856
00:40:44,720 --> 00:40:46,750
That's a mismatch.

857
00:40:46,750 --> 00:40:47,790
That's a mismatch.

858
00:40:47,790 --> 00:40:48,860
That's a match.

859
00:40:48,860 --> 00:40:53,020
Match Match.

860
00:40:53,020 --> 00:40:53,570
Mismatch.

861
00:40:53,570 --> 00:40:55,180
Mismatch.

862
00:40:55,180 --> 00:40:57,760
Mismatch.

863
00:40:57,760 --> 00:41:01,670
So where is the
top scoring match

864
00:41:01,670 --> 00:41:04,750
between the query
and the subject?

865
00:41:04,750 --> 00:41:05,250
Tim?

866
00:41:11,770 --> 00:41:12,270
Anyone?

867
00:41:14,835 --> 00:41:16,280
AUDIENCE: 6, 7, 8.

868
00:41:16,280 --> 00:41:17,170
PROFESSOR: 6, 7, 8.

869
00:41:17,170 --> 00:41:18,111
Good.

870
00:41:18,111 --> 00:41:18,610
Oh--

871
00:41:18,610 --> 00:41:19,860
AUDIENCE: 5, 6, 7.

872
00:41:19,860 --> 00:41:21,130
PROFESSOR: 5, 6, 7.

873
00:41:21,130 --> 00:41:21,630
Right.

874
00:41:21,630 --> 00:41:22,870
Right here.

875
00:41:22,870 --> 00:41:26,444
You can see there's
three in a row.

876
00:41:26,444 --> 00:41:27,360
Well, what about this?

877
00:41:27,360 --> 00:41:28,900
Why can't we add
this to the match?

878
00:41:32,690 --> 00:41:36,510
What's the reason why
it's not 2, 3, 4, 5, 6, 7?

879
00:41:40,358 --> 00:41:42,290
AUDIENCE: Because the
score for that is lower.

880
00:41:42,290 --> 00:41:43,240
PROFESSOR: Because the
score for that is lower.

881
00:41:43,240 --> 00:41:43,740
Right.

882
00:41:43,740 --> 00:41:45,059
We defined top scoring segment.

883
00:41:45,059 --> 00:41:46,600
You sum up the scores
across the map.

884
00:41:46,600 --> 00:41:48,810
So you can have
mismatches in there,

885
00:41:48,810 --> 00:41:53,800
but this will have a score of 3.

886
00:41:53,800 --> 00:41:56,550
And if you wanted to
add these three bases,

887
00:41:56,550 --> 00:41:59,220
you would be adding
negative 2 and plus 1,

888
00:41:59,220 --> 00:42:01,000
so it would reduce your score.

889
00:42:01,000 --> 00:42:02,340
So that would be worse.

890
00:42:02,340 --> 00:42:06,390
Any ideas on how to do this in
an automatic, algorithmic way?

891
00:42:10,191 --> 00:42:10,690
Yeah?

892
00:42:10,690 --> 00:42:11,680
What's your name?

893
00:42:11,680 --> 00:42:12,670
AUDIENCE: Simon.

894
00:42:12,670 --> 00:42:18,620
So if you keep shifting the
entire database, [INAUDIBLE].

895
00:42:18,620 --> 00:42:21,290
PROFESSOR: OK so you
keep shifting it over,

896
00:42:21,290 --> 00:42:24,050
and you generate
one of these lines.

897
00:42:24,050 --> 00:42:27,890
But imagine my query was
like 1,000 or something.

898
00:42:27,890 --> 00:42:29,500
And my database
is like a billion.

899
00:42:33,229 --> 00:42:34,270
How do I look along here?

900
00:42:34,270 --> 00:42:37,000
And here it was obvious what
the top scoring match is.

901
00:42:37,000 --> 00:42:43,760
But if I had two
matches here, then we

902
00:42:43,760 --> 00:42:46,620
would've actually had
a longer match here.

903
00:42:46,620 --> 00:42:54,610
So in general, how do I
find that that top match?

904
00:42:54,610 --> 00:42:58,780
For each of those
registers, if you will,

905
00:42:58,780 --> 00:43:03,240
you'll have a thousand
long diagonal here

906
00:43:03,240 --> 00:43:06,520
with 1's and minus 1's on it.

907
00:43:06,520 --> 00:43:10,495
How do I process those scores
to find the top scoring segment?

908
00:43:16,409 --> 00:43:17,700
What's an algorithm to do that?

909
00:43:26,916 --> 00:43:28,290
It's kind of
intuitively obvious,

910
00:43:28,290 --> 00:43:31,190
but I want to do something
with, you define a variable

911
00:43:31,190 --> 00:43:34,597
and you update it, and you
add to it, and subtract.

912
00:43:34,597 --> 00:43:35,430
Something like that.

913
00:43:35,430 --> 00:43:37,700
But, like a computer
could actually handle.

914
00:43:41,001 --> 00:43:41,500
Yeah?

915
00:43:41,500 --> 00:43:42,291
What was your name?

916
00:43:42,291 --> 00:43:42,890
Julianne?

917
00:43:42,890 --> 00:43:47,246
AUDIENCE: Could you keep track
of what the highest total

918
00:43:47,246 --> 00:43:50,634
score is, and then you keep
going down the diagonal,

919
00:43:50,634 --> 00:43:53,197
And then you update it?

920
00:43:53,197 --> 00:43:53,780
PROFESSOR: OK.

921
00:43:53,780 --> 00:43:58,262
You keep track of what the
highest total score was?

922
00:43:58,262 --> 00:43:59,188
AUDIENCE: Yeah.

923
00:43:59,188 --> 00:44:01,040
The highest test score.

924
00:44:01,040 --> 00:44:03,320
PROFESSOR: The
highest segment score?

925
00:44:03,320 --> 00:44:05,440
OK.

926
00:44:05,440 --> 00:44:07,950
I'm going to put this up here.

927
00:44:07,950 --> 00:44:14,240
And we'll define max s.

928
00:44:14,240 --> 00:44:17,480
That's the highest segment
score we've achieved to date.

929
00:44:17,480 --> 00:44:21,404
And we'll initialize
that to zero, let's say.

930
00:44:21,404 --> 00:44:22,820
Because if you had
all mismatches,

931
00:44:22,820 --> 00:44:25,030
zero would be the
correct answer.

932
00:44:25,030 --> 00:44:28,320
If your query was A's
and your subject was T's.

933
00:44:28,320 --> 00:44:32,443
And then what do you do?

934
00:44:32,443 --> 00:44:37,126
AUDIENCE: As you go down the
diagonal, you keep track of--

935
00:44:37,126 --> 00:44:39,621
PROFESSOR: Keep track of what?

936
00:44:39,621 --> 00:44:45,609
AUDIENCE: So you
look at 1 in 1 first.

937
00:44:45,609 --> 00:44:48,603
And then you go 1 in 2, and
you find a score of zero.

938
00:44:48,603 --> 00:44:50,100
But that's higher
than negative 1.

939
00:44:54,110 --> 00:44:56,960
PROFESSOR: But the score of the
maximum segment at that point,

940
00:44:56,960 --> 00:44:59,870
after base 2, is not zero.

941
00:44:59,870 --> 00:45:01,970
It's actually 1.

942
00:45:01,970 --> 00:45:05,732
Because you could have a
segment of one base alignment.

943
00:45:05,732 --> 00:45:06,940
The cumulative score is zero.

944
00:45:09,105 --> 00:45:11,230
I think you're onto something
here that may be also

945
00:45:11,230 --> 00:45:14,640
something useful
to keep track of.

946
00:45:14,640 --> 00:45:18,280
Let's do the cumulative score
and then you tell me more.

947
00:45:18,280 --> 00:45:20,380
We'll define cumulative
score variable.

948
00:45:20,380 --> 00:45:21,930
We'll initialize that to zero.

949
00:45:21,930 --> 00:45:26,131
And then we'll have some for
loops that, as some of you

950
00:45:26,131 --> 00:45:29,840
have said, you want to
loop through the subject.

951
00:45:29,840 --> 00:45:33,120
All the possible
registers of the subject.

952
00:45:33,120 --> 00:45:35,780
So that would be
maybe j equals 1

953
00:45:35,780 --> 00:45:39,550
to subject length
minus query length.

954
00:45:39,550 --> 00:45:40,420
Something like that.

955
00:45:40,420 --> 00:45:42,290
Don't worry too much about this.

956
00:45:42,290 --> 00:45:44,200
Again, this is not
real code, obviously.

957
00:45:44,200 --> 00:45:44,960
It's pseudocode.

958
00:45:44,960 --> 00:45:47,870
So then this will be,
say, 1 to query language.

959
00:45:47,870 --> 00:45:50,255
And so this will be
going along our diagonal.

960
00:45:50,255 --> 00:45:52,130
And we're going to plot
the cumulative score.

961
00:45:54,690 --> 00:45:59,250
So here you would you have an
update where cumulative score

962
00:45:59,250 --> 00:46:07,760
plus equals the score
of query position

963
00:46:07,760 --> 00:46:13,714
i matched against
subject position j.

964
00:46:13,714 --> 00:46:14,380
And update that.

965
00:46:14,380 --> 00:46:15,740
So that's just cumulative score.

966
00:46:15,740 --> 00:46:16,823
So what will it look like?

967
00:46:16,823 --> 00:46:19,350
So in this case, I'll
just use this down here.

968
00:46:26,360 --> 00:46:35,390
So you have zero, 1,
2, minus 1, minus 2.

969
00:46:35,390 --> 00:46:39,460
So you'll start at position
zero in the sequence.

970
00:46:39,460 --> 00:46:43,400
At position 1 you're
down here at minus 1

971
00:46:43,400 --> 00:46:44,540
because it was a mismatch.

972
00:46:44,540 --> 00:46:49,550
Then at position 2, as you
said, we're back up to zero.

973
00:46:49,550 --> 00:46:51,130
And then what happens?

974
00:46:51,130 --> 00:46:54,500
Go down to minus
1, down to minus 2.

975
00:46:54,500 --> 00:47:02,719
Then we go up three times in a
row until we're up here to 1.

976
00:47:02,719 --> 00:47:04,010
And then we go down after that.

977
00:47:07,380 --> 00:47:11,780
So where is your highest scoring
match in this cumulative score

978
00:47:11,780 --> 00:47:12,280
plot?

979
00:47:16,030 --> 00:47:17,610
People said it was from 5 to 7.

980
00:47:17,610 --> 00:47:18,733
Yeah, question?

981
00:47:18,733 --> 00:47:20,816
AUDIENCE: So would it be
from like a local minimum

982
00:47:20,816 --> 00:47:22,290
to a local maximum?

983
00:47:22,290 --> 00:47:23,330
PROFESSOR: Yeah.

984
00:47:23,330 --> 00:47:24,690
Exactly.

985
00:47:24,690 --> 00:47:26,458
So, what do you want
to keep track of?

986
00:47:26,458 --> 00:47:29,916
AUDIENCE: You want to keep track
of the minimum and the maximum.

987
00:47:29,916 --> 00:47:34,870
And look for the range which
you maximize to different--

988
00:47:34,870 --> 00:47:37,111
PROFESSOR: Yeah, so
this is now sort of more

989
00:47:37,111 --> 00:47:38,610
what I was looking
for in terms of--

990
00:47:38,610 --> 00:47:43,100
so this was the local minimum,
and that's the local maximum.

991
00:47:43,100 --> 00:47:45,240
This is the score.

992
00:47:45,240 --> 00:47:48,100
That's your mass s there.

993
00:47:48,100 --> 00:47:50,910
And you also want to keep
track of where that happened

994
00:47:50,910 --> 00:47:52,970
in both the query
and the subject.

995
00:47:52,970 --> 00:47:53,910
Does that make sense?

996
00:47:53,910 --> 00:47:56,340
So you would keep track of
this running cumulative score

997
00:47:56,340 --> 00:47:57,200
variable.

998
00:47:57,200 --> 00:48:01,920
You keep track of
the last minimum.

999
00:48:01,920 --> 00:48:05,960
The minimum that
you've achieved so far.

1000
00:48:05,960 --> 00:48:17,650
And so that would then
be down here to minus 2.

1001
00:48:17,650 --> 00:48:20,280
And then when your cumulative
score got up to plus 1,

1002
00:48:20,280 --> 00:48:22,760
you always take that
cumulative score,

1003
00:48:22,760 --> 00:48:26,160
minus the last minimum
cumulative score.

1004
00:48:26,160 --> 00:48:28,230
That gives you a
potential candidate

1005
00:48:28,230 --> 00:48:29,370
for a high scoring segment.

1006
00:48:29,370 --> 00:48:33,060
And if that is bigger than
your current max high scoring

1007
00:48:33,060 --> 00:48:36,107
segment, then you update it
and you would update this.

1008
00:48:36,107 --> 00:48:37,690
And then you would
also have variables

1009
00:48:37,690 --> 00:48:41,210
that would store where you are.

1010
00:48:41,210 --> 00:48:44,419
And also, where did
that last minimum occur.

1011
00:48:44,419 --> 00:48:45,710
So I'm not spelling it all out.

1012
00:48:45,710 --> 00:48:47,543
I'm not going to give
you all the variables.

1013
00:48:47,543 --> 00:48:51,745
But this is an algorithm that
would find the maximum score.

1014
00:48:51,745 --> 00:48:52,370
Yeah, question?

1015
00:48:52,370 --> 00:48:53,745
AUDIENCE: So you're
keeping track

1016
00:48:53,745 --> 00:48:56,500
of the global maximum,
local minimum,

1017
00:48:56,500 --> 00:49:00,530
so that you can accept the most
recent local minimum following

1018
00:49:00,530 --> 00:49:02,575
the global maximum?

1019
00:49:02,575 --> 00:49:04,200
PROFESSOR: I'm not
sure I got all that.

1020
00:49:04,200 --> 00:49:07,800
But you're keeping track
of the cumulative score.

1021
00:49:07,800 --> 00:49:12,680
The minimum that that
cumulative score ever got to.

1022
00:49:12,680 --> 00:49:17,556
And the maximum
difference, the maximum

1023
00:49:17,556 --> 00:49:21,940
that you ever in the
past have gone up.

1024
00:49:21,940 --> 00:49:27,060
Where you've had a
net increment upwards.

1025
00:49:27,060 --> 00:49:27,970
Like here.

1026
00:49:27,970 --> 00:49:30,400
So this variable
here, this max s,

1027
00:49:30,400 --> 00:49:32,000
it would be initialized to zero.

1028
00:49:32,000 --> 00:49:36,780
When you got to here, your last
minimum score would be minus 1.

1029
00:49:36,780 --> 00:49:38,500
Your cumulative
score would be zero.

1030
00:49:38,500 --> 00:49:40,916
You would take the difference
of those, and you'd be like,

1031
00:49:40,916 --> 00:49:43,204
oh I've got a high scoring
segment of score one.

1032
00:49:43,204 --> 00:49:44,370
So I'm going to update that.

1033
00:49:44,370 --> 00:49:48,690
So now, that variable
is now 1 at this point.

1034
00:49:48,690 --> 00:49:51,300
Then you're going down, so
you're not getting anything.

1035
00:49:51,300 --> 00:49:53,520
You're just lowering this
minimum cumulative score

1036
00:49:53,520 --> 00:49:55,070
down to minus 2 here.

1037
00:49:55,070 --> 00:49:58,057
And then when you
get to here, now you

1038
00:49:58,057 --> 00:50:00,140
check the cumulative score
minus the last minimum.

1039
00:50:00,140 --> 00:50:00,880
It's 1.

1040
00:50:00,880 --> 00:50:02,300
That's a tie.

1041
00:50:02,300 --> 00:50:04,550
We won't keep track of ties.

1042
00:50:04,550 --> 00:50:07,820
Now at here, that
difference is 2.

1043
00:50:07,820 --> 00:50:10,180
So now we've got a new record.

1044
00:50:10,180 --> 00:50:14,090
So now we update this maximum
score to 2 in the locations.

1045
00:50:14,090 --> 00:50:16,830
And then we get here, now
it's 3, and we update that.

1046
00:50:16,830 --> 00:50:17,825
Does that make sense?

1047
00:50:17,825 --> 00:50:20,116
AUDIENCE: Imagine the first
dip-- instead of going down

1048
00:50:20,116 --> 00:50:22,760
to negative 1, it went
down to negative 3.

1049
00:50:22,760 --> 00:50:23,731
PROFESSOR: Negative 3?

1050
00:50:23,731 --> 00:50:25,284
AUDIENCE: That first dip.

1051
00:50:25,284 --> 00:50:26,200
PROFESSOR: Right here?

1052
00:50:26,200 --> 00:50:28,160
So we started back a little bit.

1053
00:50:28,160 --> 00:50:30,265
So back here, like this?

1054
00:50:30,265 --> 00:50:31,250
AUDIENCE: No.

1055
00:50:31,250 --> 00:50:32,500
PROFESSOR: Down to negative 3?

1056
00:50:32,500 --> 00:50:33,550
AUDIENCE: No.

1057
00:50:33,550 --> 00:50:35,341
PROFESSOR: But how do
we get to negative 3?

1058
00:50:35,341 --> 00:50:36,970
Because our scoring is this way.

1059
00:50:36,970 --> 00:50:38,580
You want this dip to minus 3?

1060
00:50:38,580 --> 00:50:39,584
AUDIENCE: No

1061
00:50:39,584 --> 00:50:40,750
PROFESSOR: This one minus 3?

1062
00:50:40,750 --> 00:50:42,367
Imagine we are at minus 3 here?

1063
00:50:42,367 --> 00:50:42,992
AUDIENCE: Yeah.

1064
00:50:42,992 --> 00:50:45,347
Imagine it dipped to minus 3.

1065
00:50:45,347 --> 00:50:49,770
And then the next one dipped to
higher than that, to minus 2.

1066
00:50:49,770 --> 00:50:51,714
And then it went up to 1.

1067
00:50:51,714 --> 00:50:54,630
And so, would the difference
you look at be negative 2 to 1,

1068
00:50:54,630 --> 00:50:56,100
or negative 3 to 1?

1069
00:50:56,100 --> 00:50:57,710
PROFESSOR: Like that, right?

1070
00:50:57,710 --> 00:51:02,250
So, minus 3, let's
say, minus 2, 1.

1071
00:51:02,250 --> 00:51:04,010
Something like that.

1072
00:51:04,010 --> 00:51:05,856
What do people think?

1073
00:51:05,856 --> 00:51:07,175
Anyone want to--?

1074
00:51:07,175 --> 00:51:08,480
AUDIENCE: Minus 3 to 1.

1075
00:51:08,480 --> 00:51:10,100
PROFESSOR: Minus 3 to 1.

1076
00:51:10,100 --> 00:51:13,051
It's the minimum
that you ever got to.

1077
00:51:13,051 --> 00:51:16,120
This might be a stronger match,
but this is a higher scoring

1078
00:51:16,120 --> 00:51:16,620
match.

1079
00:51:16,620 --> 00:51:18,620
And we said we want
higher scoring.

1080
00:51:18,620 --> 00:51:21,716
So you would count that.

1081
00:51:21,716 --> 00:51:23,966
AUDIENCE: So you keep track
of both the global minimum

1082
00:51:23,966 --> 00:51:25,490
and the global
maximum, and you take

1083
00:51:25,490 --> 00:51:27,322
the difference between them.

1084
00:51:27,322 --> 00:51:29,280
PROFESSOR: You keep track
of the global minimum

1085
00:51:29,280 --> 00:51:31,134
and the current
cumulative score,

1086
00:51:31,134 --> 00:51:32,300
and you take the difference.

1087
00:51:35,186 --> 00:51:38,080
AUDIENCE: The global maximum--

1088
00:51:38,080 --> 00:51:40,210
PROFESSOR: It's not
necessarily global maximum

1089
00:51:40,210 --> 00:51:43,390
because we could be
well below zero here.

1090
00:51:43,390 --> 00:51:44,389
We could do like this.

1091
00:51:49,550 --> 00:51:50,880
From here to here.

1092
00:51:50,880 --> 00:51:53,440
So this is not the
global maximum.

1093
00:51:53,440 --> 00:51:57,270
This just happens
to be, we went up

1094
00:51:57,270 --> 00:51:58,510
a lot since our last minimum.

1095
00:51:58,510 --> 00:52:00,950
So that's your high
scoring segment.

1096
00:52:00,950 --> 00:52:02,582
Does that make sense?

1097
00:52:04,961 --> 00:52:06,460
I haven't completely
spelled it out.

1098
00:52:06,460 --> 00:52:12,300
But I think you guys have given
enough ideas here that there;s

1099
00:52:12,300 --> 00:52:14,100
sort of the core
of an algorithm.

1100
00:52:14,100 --> 00:52:21,230
And I encourage you to think
this through afterwards

1101
00:52:21,230 --> 00:52:24,090
and let me know if
there are questions.

1102
00:52:24,090 --> 00:52:27,700
And we could add an
optional homework

1103
00:52:27,700 --> 00:52:29,700
where I ask you to do
this, that we've sometimes

1104
00:52:29,700 --> 00:52:30,740
had in the past.

1105
00:52:30,740 --> 00:52:33,090
It is a useful thing to look at.

1106
00:52:33,090 --> 00:52:35,530
This is not exactly how
the BLAST algorithm works.

1107
00:52:35,530 --> 00:52:38,330
It uses some tricks
for faster speed.

1108
00:52:38,330 --> 00:52:40,900
But this is sort of
morally equivalent to BLAST

1109
00:52:40,900 --> 00:52:46,340
in the sense that it has the
same order of magnitude running

1110
00:52:46,340 --> 00:52:48,780
time.

1111
00:52:48,780 --> 00:52:54,960
So this algorithm-- what is the
running time in Big-O notation?

1112
00:52:54,960 --> 00:52:58,800
So just for those who
are non-CS people, when

1113
00:52:58,800 --> 00:53:02,840
you use this Big-O notation,
then you're asking,

1114
00:53:02,840 --> 00:53:05,010
how does the running
time increase

1115
00:53:05,010 --> 00:53:07,560
in the size of the input?

1116
00:53:07,560 --> 00:53:08,870
And so what is the input?

1117
00:53:08,870 --> 00:53:10,440
So we have two inputs.

1118
00:53:10,440 --> 00:53:14,070
We have a query of length.

1119
00:53:14,070 --> 00:53:17,670
And let's say
subject of length n.

1120
00:53:17,670 --> 00:53:21,171
So clearly, if those are bigger,
it'll take longer to run.

1121
00:53:21,171 --> 00:53:22,920
But when you compare
different algorithms,

1122
00:53:22,920 --> 00:53:26,971
you want to know how the run
time depends on those lengths.

1123
00:53:26,971 --> 00:53:27,470
Yes.

1124
00:53:27,470 --> 00:53:28,356
What's your name?

1125
00:53:28,356 --> 00:53:29,308
AUDIENCE: Sally.

1126
00:53:29,308 --> 00:53:31,509
m times n.

1127
00:53:31,509 --> 00:53:33,050
PROFESSOR: So with
this, this is what

1128
00:53:33,050 --> 00:53:35,940
you would call an
order mn algorithm.

1129
00:53:35,940 --> 00:53:36,900
And why is that?

1130
00:53:36,900 --> 00:53:38,220
How can you see that?

1131
00:53:38,220 --> 00:53:40,806
AUDIENCE: You have two for
loops And for each length,

1132
00:53:40,806 --> 00:53:42,639
essentially, you're
going through everything

1133
00:53:42,639 --> 00:53:45,094
in the query.

1134
00:53:45,094 --> 00:53:48,040
And then, for everything that
you go through in the query,

1135
00:53:48,040 --> 00:53:49,022
you would [INAUDIBLE].

1136
00:53:52,970 --> 00:53:55,270
PROFESSOR: Right.

1137
00:53:55,270 --> 00:53:59,800
In this second for loop here,
you're going through the query.

1138
00:53:59,800 --> 00:54:02,420
And you're doing that
nested inside of a for loop

1139
00:54:02,420 --> 00:54:05,112
that's basically the
length of the subject.

1140
00:54:05,112 --> 00:54:06,570
And eventually
you're going to have

1141
00:54:06,570 --> 00:54:07,986
to compare every
base in the query

1142
00:54:07,986 --> 00:54:09,296
to every base in the subject.

1143
00:54:09,296 --> 00:54:10,420
There's no way around that.

1144
00:54:10,420 --> 00:54:13,270
And that takes
some unit of time.

1145
00:54:13,270 --> 00:54:15,900
And so the actual time will
be proportional to that.

1146
00:54:15,900 --> 00:54:18,220
So the bigger n gets
and m gets, it's

1147
00:54:18,220 --> 00:54:20,150
just proportional
to the product.

1148
00:54:20,150 --> 00:54:21,460
Does that make sense?

1149
00:54:21,460 --> 00:54:24,010
Or another way to
think about it is,

1150
00:54:24,010 --> 00:54:26,594
you're clearly going to have to
do something on this diagonal.

1151
00:54:26,594 --> 00:54:28,468
And then you're going
to have to do something

1152
00:54:28,468 --> 00:54:30,530
on this diagonal, and
this one, and this one.

1153
00:54:30,530 --> 00:54:34,280
And actually, you have to
also check these ones here.

1154
00:54:34,280 --> 00:54:39,380
And in the end, the total
number of computations

1155
00:54:39,380 --> 00:54:41,900
there is going to
be this times that.

1156
00:54:41,900 --> 00:54:43,680
You're basically doing
a rectangle's worth

1157
00:54:43,680 --> 00:54:45,606
of computations.

1158
00:54:45,606 --> 00:54:47,430
Does that makes sense?

1159
00:54:47,430 --> 00:54:49,120
So that's not bad, right?

1160
00:54:49,120 --> 00:54:50,940
It could be worse.

1161
00:54:50,940 --> 00:54:54,790
It could be, like, mn squared
or something like that.

1162
00:54:57,550 --> 00:55:01,030
So that's basically
why BLAST is fast.

1163
00:55:01,030 --> 00:55:03,190
So what do these things
look like, in general?

1164
00:55:03,190 --> 00:55:07,180
And what is the condition on
our score for this algorithm

1165
00:55:07,180 --> 00:55:07,680
to work?

1166
00:55:12,460 --> 00:55:16,260
What if I gave a score
of plus 1 for a match,

1167
00:55:16,260 --> 00:55:17,820
and zero for a mismatch?

1168
00:55:17,820 --> 00:55:20,340
Could we do this?

1169
00:55:20,340 --> 00:55:21,990
Joe, you're shaking your head.

1170
00:55:21,990 --> 00:55:23,490
AUDIENCE: It would
just be going up.

1171
00:55:23,490 --> 00:55:24,156
PROFESSOR: Yeah.

1172
00:55:24,156 --> 00:55:26,311
The problem is, it might
be flat for a while,

1173
00:55:26,311 --> 00:55:27,560
but eventually it would go up.

1174
00:55:27,560 --> 00:55:29,143
And it would just
go up and up and up.

1175
00:55:29,143 --> 00:55:30,650
And so your highest
scoring segment

1176
00:55:30,650 --> 00:55:32,430
would, most of the
time, be something

1177
00:55:32,430 --> 00:55:34,040
that started very
near the beginning

1178
00:55:34,040 --> 00:55:36,720
and ended very near the end.

1179
00:55:36,720 --> 00:55:37,790
So that doesn't work.

1180
00:55:37,790 --> 00:55:40,015
So you have to have
a net negative drift.

1181
00:55:40,015 --> 00:55:42,140
And the way that's formalized
is the expected score

1182
00:55:42,140 --> 00:55:43,980
has to be negative.

1183
00:55:43,980 --> 00:55:46,970
So why is the expected score
negative in this scoring system

1184
00:55:46,970 --> 00:55:50,379
that has plus 1 for a match,
and minus 1 for a mismatch?

1185
00:55:50,379 --> 00:55:51,170
Why does that work?

1186
00:55:54,507 --> 00:55:56,840
AUDIENCE: It should be wrong
three quarters of the time.

1187
00:55:56,840 --> 00:55:57,590
PROFESSOR: Yeah.

1188
00:55:57,590 --> 00:55:59,673
You'll have a mismatch
three quarters of the time.

1189
00:55:59,673 --> 00:56:04,040
So on average, you
tend to drift down.

1190
00:56:04,040 --> 00:56:06,440
And then you have these
little excursions upwards,

1191
00:56:06,440 --> 00:56:09,410
and those are your
high scoring segments.

1192
00:56:09,410 --> 00:56:13,216
Any questions about that?

1193
00:56:13,216 --> 00:56:14,178
AUDIENCE: Question.

1194
00:56:14,178 --> 00:56:17,795
Is there something
better than m times n?

1195
00:56:17,795 --> 00:56:19,920
PROFESSOR: We've got some
computer scientists here.

1196
00:56:19,920 --> 00:56:22,194
David?

1197
00:56:22,194 --> 00:56:23,110
Better than m times n?

1198
00:56:23,110 --> 00:56:24,544
I don't think so,
because you have

1199
00:56:24,544 --> 00:56:25,710
to do all those comparisons.

1200
00:56:25,710 --> 00:56:29,530
And so there's no way around
that, so I don't think so.

1201
00:56:29,530 --> 00:56:31,070
All right.

1202
00:56:31,070 --> 00:56:33,180
But the constant--
you can do better

1203
00:56:33,180 --> 00:56:35,076
on the constant than
this algorithm thing.

1204
00:56:35,076 --> 00:56:36,955
AUDIENCE: With
multiple queries--

1205
00:56:36,955 --> 00:56:38,580
PROFESSOR: With
multiple queries, yeah.

1206
00:56:38,580 --> 00:56:41,720
Then you can maybe
do some hashing

1207
00:56:41,720 --> 00:56:45,720
or find some-- to speed it up.

1208
00:56:45,720 --> 00:56:48,560
OK, so what about the
statistics of this?

1209
00:56:48,560 --> 00:56:54,790
So it turns out that Karlin and
Altschul developed some theory

1210
00:56:54,790 --> 00:56:57,410
for just exactly this problem.

1211
00:56:57,410 --> 00:57:00,210
For searching a query sequence.

1212
00:57:00,210 --> 00:57:04,240
It can be nucleotide or protein
as long as you have integer

1213
00:57:04,240 --> 00:57:09,700
scores and the average-- or the
expected-- score is negative,

1214
00:57:09,700 --> 00:57:13,520
then this theory
tells you how often

1215
00:57:13,520 --> 00:57:17,890
the highest score of all--
across the entire query

1216
00:57:17,890 --> 00:57:20,430
database comparison--
exceeds a cut

1217
00:57:20,430 --> 00:57:25,300
off x using a local alignment
algorithm such as BLAST.

1218
00:57:25,300 --> 00:57:30,380
And it turns out
that these scores

1219
00:57:30,380 --> 00:57:34,240
follow what's called an extreme
value or Gumbel distribution.

1220
00:57:34,240 --> 00:57:39,340
And it has this kind of
double exponential form here.

1221
00:57:39,340 --> 00:57:40,620
So x is some cut off.

1222
00:57:40,620 --> 00:57:43,760
So usually x would be the
score that you actually

1223
00:57:43,760 --> 00:57:46,530
observed when you searched your
query against the database.

1224
00:57:46,530 --> 00:57:47,980
That's the one you care about.

1225
00:57:47,980 --> 00:57:49,870
And then you want
to know, what's

1226
00:57:49,870 --> 00:57:53,980
the probability we would've
seen something higher than that?

1227
00:57:53,980 --> 00:58:00,150
Or you might do x is one less
than the score you observed.

1228
00:58:00,150 --> 00:58:04,190
So what's the chance we
observed something the same,

1229
00:58:04,190 --> 00:58:06,775
as good as this, or better?

1230
00:58:06,775 --> 00:58:07,650
Does that make sense?

1231
00:58:07,650 --> 00:58:11,100
And so this is going to
be your P value then.

1232
00:58:11,100 --> 00:58:13,740
So the probability
of S. The score

1233
00:58:13,740 --> 00:58:16,699
of the highest segment
under a model where

1234
00:58:16,699 --> 00:58:18,740
you have a random query
against a random database

1235
00:58:18,740 --> 00:58:22,500
of the same length is 1
minus e to the minus KMN

1236
00:58:22,500 --> 00:58:24,790
e to the minus lambda x.

1237
00:58:24,790 --> 00:58:30,030
Where M and N are the lengths
of the query and the database.

1238
00:58:30,030 --> 00:58:32,360
x is the score.

1239
00:58:32,360 --> 00:58:35,860
And then K and lambda are
two positive parameters

1240
00:58:35,860 --> 00:58:39,760
that depend actually on the
details of your score matrix

1241
00:58:39,760 --> 00:58:42,990
and the composition
of your sequences.

1242
00:58:42,990 --> 00:58:47,190
And it turns out that lambda
is really the one that matters.

1243
00:58:47,190 --> 00:58:48,940
And you can see
that because lambda

1244
00:58:48,940 --> 00:58:52,279
is up there in that
exponent multiplying x.

1245
00:58:52,279 --> 00:58:53,820
So if you double
lambda, that'll have

1246
00:58:53,820 --> 00:58:56,660
a big effect on the answer.

1247
00:58:56,660 --> 00:58:58,550
And K, it turns
out, you can mostly

1248
00:58:58,550 --> 00:59:01,650
ignore it for most purposes.

1249
00:59:01,650 --> 00:59:04,840
So as a formula, what
does this thing look like?

1250
00:59:04,840 --> 00:59:07,244
It looks like that.

1251
00:59:07,244 --> 00:59:08,160
Kind of a funny shape.

1252
00:59:08,160 --> 00:59:10,210
It sort of looks like
an umlauf a little bit,

1253
00:59:10,210 --> 00:59:15,540
but then has a different shape
on the right than the left.

1254
00:59:15,540 --> 00:59:18,130
And how do you
calculate this lambda?

1255
00:59:18,130 --> 00:59:23,530
So I said that lambda is
sort of the key to all this

1256
00:59:23,530 --> 00:59:31,630
because of its uniquely
important place

1257
00:59:31,630 --> 00:59:36,300
in that formula,
multiplying the score.

1258
00:59:36,300 --> 00:59:39,230
So it turns out that lambda is
the unique positive solution

1259
00:59:39,230 --> 00:59:41,250
to this equation here.

1260
00:59:41,250 --> 00:59:44,990
So now it actually depends
on the scoring matrix.

1261
00:59:44,990 --> 00:59:46,710
So you see there's sij there.

1262
00:59:46,710 --> 00:59:49,190
It depends on the
composition of your query.

1263
00:59:49,190 --> 00:59:50,540
That's the pi's.

1264
00:59:50,540 --> 00:59:55,820
The composition of your
subject, that's the rj's.

1265
00:59:55,820 --> 00:59:59,020
You sum over the i
and j equal to each

1266
00:59:59,020 --> 01:00:01,340
of the four nucleotides.

1267
01:00:01,340 --> 01:00:03,110
And that sum has to be 1.

1268
01:00:03,110 --> 01:00:08,830
So there's a unique positive
solution to this equation.

1269
01:00:08,830 --> 01:00:11,210
So how would we solve
an equation like this?

1270
01:00:11,210 --> 01:00:12,980
First of all, what
kind of equation

1271
01:00:12,980 --> 01:00:17,460
is this, given that we're
going to set the sij,

1272
01:00:17,460 --> 01:00:20,710
and we're going to just
measure the pi and the rj?

1273
01:00:20,710 --> 01:00:23,470
So those are all
known constants,

1274
01:00:23,470 --> 01:00:25,890
and lambda is what we're
trying to solve for here.

1275
01:00:25,890 --> 01:00:28,791
So what kind of an
equation is this in lambda?

1276
01:00:35,850 --> 01:00:36,350
Linear?

1277
01:00:36,350 --> 01:00:38,550
Quadratic?

1278
01:00:38,550 --> 01:00:40,890
Hyperbolic?

1279
01:00:40,890 --> 01:00:43,750
Anybody know what this is?

1280
01:00:43,750 --> 01:00:47,660
So this is called a
transcendental equation

1281
01:00:47,660 --> 01:00:51,430
because you have
different powers.

1282
01:00:55,200 --> 01:00:59,245
That sounds kind of unpleasant.

1283
01:00:59,245 --> 01:01:01,440
You don't take a class in
transcendental equations

1284
01:01:01,440 --> 01:01:01,940
probably.

1285
01:01:01,940 --> 01:01:08,270
So in general, they're not
possible to solve analytically

1286
01:01:08,270 --> 01:01:10,560
when they get complicated.

1287
01:01:10,560 --> 01:01:13,380
But in simple cases, you
can solve them analytically.

1288
01:01:13,380 --> 01:01:17,420
And in fact, let's just do one.

1289
01:01:17,420 --> 01:01:19,450
So let's take the
simplest case, which

1290
01:01:19,450 --> 01:01:25,230
would be that all the
pi's are a quarter.

1291
01:01:25,230 --> 01:01:29,020
All the ri's are a quarter.

1292
01:01:29,020 --> 01:01:31,990
And we'll use the scoring system
that we came up with before,

1293
01:01:31,990 --> 01:01:37,710
where sii is 1,
and sij is minus 1.

1294
01:01:37,710 --> 01:01:41,640
If i does not equal j.

1295
01:01:41,640 --> 01:01:50,250
And so when we plug those in to
that sum there, what do we get?

1296
01:01:50,250 --> 01:01:56,190
We'll get four terms that are
one quarter, times one quarter,

1297
01:01:56,190 --> 01:01:59,710
times e to the lambda.

1298
01:01:59,710 --> 01:02:03,817
There's four possible
types of matches, right?

1299
01:02:03,817 --> 01:02:05,900
They have probability one
quarter times a quarter.

1300
01:02:05,900 --> 01:02:07,710
That's pi and rj.

1301
01:02:07,710 --> 01:02:12,450
And the e to the lambda
sii is just e to the lambda

1302
01:02:12,450 --> 01:02:14,130
because sii is 1.

1303
01:02:14,130 --> 01:02:19,680
And then there's 12 terms that
are one quarter, one quarter,

1304
01:02:19,680 --> 01:02:21,270
e to the minus lambda.

1305
01:02:21,270 --> 01:02:23,150
Because there's
the minus 1 score.

1306
01:02:23,150 --> 01:02:26,610
And that has to equal 1.

1307
01:02:26,610 --> 01:02:32,210
So cancel this, we'll
multiply through by 4, maybe.

1308
01:02:32,210 --> 01:02:39,720
So now we get e to
the lambda plus 3.

1309
01:02:39,720 --> 01:02:43,050
e to the minus lambda equals 1.

1310
01:02:43,050 --> 01:02:45,050
It's still a
transcendental equation,

1311
01:02:45,050 --> 01:02:47,710
but it's looking
a little simpler.

1312
01:02:47,710 --> 01:02:49,630
Any ideas how to
solve this for lambda?

1313
01:02:52,140 --> 01:02:53,228
Sally?

1314
01:02:53,228 --> 01:02:54,875
AUDIENCE: Wouldn't the 1 be 4?

1315
01:02:54,875 --> 01:02:55,750
PROFESSOR: I'm sorry.

1316
01:02:55,750 --> 01:02:56,160
4.

1317
01:02:56,160 --> 01:02:56,670
Thank you.

1318
01:03:06,730 --> 01:03:07,778
Yeah, what's your name?

1319
01:03:07,778 --> 01:03:10,772
AUDIENCE: [INAUDIBLE] I think
[INAUDIBLE] quadratic equation.

1320
01:03:10,772 --> 01:03:14,764
If you multiply both sides by
[INAUDIBLE] then [INAUDIBLE].

1321
01:03:16,672 --> 01:03:18,130
PROFESSOR: OK, so
the claim is this

1322
01:03:18,130 --> 01:03:19,460
is basically a
quadratic equation.

1323
01:03:19,460 --> 01:03:21,376
So you multiply both
sides by e to the lambda.

1324
01:03:21,376 --> 01:03:24,697
So then you get e to
the 2 lambda plus 3.

1325
01:03:24,697 --> 01:03:26,530
And then it's going to
move this over and do

1326
01:03:26,530 --> 01:03:31,580
minus 4 e to the
lambda equals zero.

1327
01:03:31,580 --> 01:03:33,340
Is that good?

1328
01:03:33,340 --> 01:03:35,607
So how is it quadratic?

1329
01:03:35,607 --> 01:03:37,190
What do you actually
do to solve this?

1330
01:03:37,190 --> 01:03:38,356
AUDIENCE: Well, [INAUDIBLE].

1331
01:03:42,280 --> 01:03:45,000
PROFESSOR: Change the variable,
x equals e to the lambda.

1332
01:03:45,000 --> 01:03:46,330
Then it's quadratic in x.

1333
01:03:46,330 --> 01:03:46,990
Solve for x.

1334
01:03:46,990 --> 01:03:48,440
We all know how to solve
quadratic equations.

1335
01:03:48,440 --> 01:03:49,760
And then substitute
that for lambda.

1336
01:03:49,760 --> 01:03:50,800
OK, everyone got that?

1337
01:03:55,845 --> 01:03:58,120
If you use 16 different
scores to represent

1338
01:03:58,120 --> 01:04:00,250
all the different types
of matches and mismatches,

1339
01:04:00,250 --> 01:04:03,060
this will be very unpleasant.

1340
01:04:03,060 --> 01:04:04,940
It's not unsolvable,
it's just that you

1341
01:04:04,940 --> 01:04:08,520
have to use computational
numerical methods to solve it.

1342
01:04:08,520 --> 01:04:10,184
But in simple cases
where you just

1343
01:04:10,184 --> 01:04:11,850
have a couple different
types of scores,

1344
01:04:11,850 --> 01:04:15,530
it will often be a
quadratic equation.

1345
01:04:15,530 --> 01:04:16,080
All right.

1346
01:04:16,080 --> 01:04:20,140
So let's suppose that we have
a particular scoring system--

1347
01:04:20,140 --> 01:04:24,610
particular pi's, rj's-- and
we have a value of lambda that

1348
01:04:24,610 --> 01:04:25,280
satisfies those.

1349
01:04:25,280 --> 01:04:28,370
So we've solved this
quadratic equation for lambda.

1350
01:04:28,370 --> 01:04:30,240
I think we get lambda
equals natural log

1351
01:04:30,240 --> 01:04:32,690
3, something like that.

1352
01:04:32,690 --> 01:04:34,660
Remember, it's a unique
positive solution.

1353
01:04:34,660 --> 01:04:35,370
Quadratic equations
are two solutions,

1354
01:04:35,370 --> 01:04:37,286
but there's going to be
just one positive one.

1355
01:04:37,286 --> 01:04:41,230
And then we have that value.

1356
01:04:41,230 --> 01:04:43,960
It satisfies this equation.

1357
01:04:43,960 --> 01:04:47,200
So then, what if we
double the scores?

1358
01:04:47,200 --> 01:04:50,340
Instead of plus 1 minus
1, we use plus 2 minus 2?

1359
01:04:53,020 --> 01:04:55,850
What would then happen?

1360
01:04:55,850 --> 01:04:58,440
You can see that the
original version of lambda

1361
01:04:58,440 --> 01:05:01,860
wouldn't necessarily still
satisfy this equation.

1362
01:05:01,860 --> 01:05:04,080
But if you think
about it a little bit,

1363
01:05:04,080 --> 01:05:08,370
you can figure out what
new value of lambda

1364
01:05:08,370 --> 01:05:09,692
would satisfy this equation.

1365
01:05:14,860 --> 01:05:19,365
We've solved for the lambda
that solves with these scores.

1366
01:05:23,502 --> 01:05:24,960
Now we're going to
have new scores.

1367
01:05:24,960 --> 01:05:28,530
sii prime equals 2.

1368
01:05:28,530 --> 01:05:34,080
sij prime equals minus 2.

1369
01:05:34,080 --> 01:05:35,666
What is lambda prime?

1370
01:05:35,666 --> 01:05:37,290
The lambda that goes
with these scores?

1371
01:05:39,850 --> 01:05:40,860
Yeah, go ahead.

1372
01:05:40,860 --> 01:05:42,217
AUDIENCE: Half of the original?

1373
01:05:42,217 --> 01:05:43,550
PROFESSOR: Half of the original?

1374
01:05:43,550 --> 01:05:44,050
Right.

1375
01:05:44,050 --> 01:05:46,974
So you're saying that lambda
prime equals lambda over 2.

1376
01:05:46,974 --> 01:05:47,640
And why is that?

1377
01:05:47,640 --> 01:05:50,965
Can you explain?

1378
01:05:50,965 --> 01:05:52,506
AUDIENCE: Because
of the [INAUDIBLE].

1379
01:05:59,530 --> 01:06:03,980
PROFESSOR: Yeah, if you think
about these terms in the sum,

1380
01:06:03,980 --> 01:06:05,940
the s part is all doubling.

1381
01:06:05,940 --> 01:06:09,380
So if you cut the lambda apart,
and the product will equal what

1382
01:06:09,380 --> 01:06:10,650
it did before.

1383
01:06:10,650 --> 01:06:13,370
And we haven't changed the pi's
and rj's, so all those terms

1384
01:06:13,370 --> 01:06:14,590
will be the same.

1385
01:06:14,590 --> 01:06:17,400
So therefore, it will still
satisfy that equation.

1386
01:06:17,400 --> 01:06:19,900
So that's another way
of thinking about it.

1387
01:06:19,900 --> 01:06:21,210
Yes, you're correct.

1388
01:06:21,210 --> 01:06:23,770
So if you double
the scores, lambda

1389
01:06:23,770 --> 01:06:27,090
will be reduced
by a factor of 2.

1390
01:06:27,090 --> 01:06:28,880
So what does that
tell us about lambda?

1391
01:06:28,880 --> 01:06:32,994
What is it?

1392
01:06:32,994 --> 01:06:34,443
What is its meaning?

1393
01:06:40,250 --> 01:06:41,416
Yeah, go ahead, Jeff.

1394
01:06:41,416 --> 01:06:42,874
AUDIENCE: Scale of
the distribution

1395
01:06:42,874 --> 01:06:45,808
to the expectant score?

1396
01:06:45,808 --> 01:06:47,104
Or the range score?

1397
01:06:47,104 --> 01:06:47,770
PROFESSOR: Yeah.

1398
01:06:47,770 --> 01:06:50,240
It basically scales the scores.

1399
01:06:50,240 --> 01:06:55,040
So we can have the
same equation here used

1400
01:06:55,040 --> 01:06:56,200
with arbitrary scoring.

1401
01:06:56,200 --> 01:06:56,960
It just scales it.

1402
01:06:56,960 --> 01:06:59,410
You can see the way it appears
as a multiplicative factor

1403
01:06:59,410 --> 01:07:00,326
in front of the score.

1404
01:07:00,326 --> 01:07:03,050
So if you double all
the scores, will that

1405
01:07:03,050 --> 01:07:04,880
change what the highest
scoring segment is?

1406
01:07:07,560 --> 01:07:09,620
No, it won't change
it because you'll

1407
01:07:09,620 --> 01:07:11,890
have this cumulative thing.

1408
01:07:11,890 --> 01:07:14,450
It just changes how
you label the y-axis.

1409
01:07:14,450 --> 01:07:19,505
It'll make it bigger, but it
won't change what that is.

1410
01:07:19,505 --> 01:07:21,210
And if you look
at this equation,

1411
01:07:21,210 --> 01:07:23,650
it won't change the
statistical significance.

1412
01:07:23,650 --> 01:07:27,620
The x will double in value,
because all the matches are now

1413
01:07:27,620 --> 01:07:30,360
worth twice as much as
what they were before.

1414
01:07:30,360 --> 01:07:33,110
But lambda will be half
as big, and so the product

1415
01:07:33,110 --> 01:07:35,360
will be the same and therefore,
the final probability

1416
01:07:35,360 --> 01:07:36,150
will be the same.

1417
01:07:36,150 --> 01:07:38,850
So it's just a scaling factor
for using different scoring

1418
01:07:38,850 --> 01:07:39,350
systems.

1419
01:07:44,281 --> 01:07:45,030
Everyone got that?

1420
01:07:47,820 --> 01:07:48,320
All right.

1421
01:07:48,320 --> 01:07:51,820
So what scoring matrix
should we use for DNA?

1422
01:07:51,820 --> 01:07:53,450
How about this one?

1423
01:07:53,450 --> 01:07:56,010
So this is now a
slight generalization.

1424
01:07:56,010 --> 01:08:00,530
So we're going to keep
1 for the matches.

1425
01:08:00,530 --> 01:08:03,450
You don't lose any generality
by choosing 1 here for matches,

1426
01:08:03,450 --> 01:08:07,500
because if you use 2,
then lambda is just

1427
01:08:07,500 --> 01:08:09,990
going to be reduced
to compensate.

1428
01:08:09,990 --> 01:08:11,360
So 1 for matches.

1429
01:08:11,360 --> 01:08:15,590
And then we're going to
use m for mismatches.

1430
01:08:15,590 --> 01:08:21,569
And m must be negative in
order to satisfy this condition

1431
01:08:21,569 --> 01:08:23,910
for this theory to work,
that the average score

1432
01:08:23,910 --> 01:08:24,790
has to be negative.

1433
01:08:24,790 --> 01:08:27,540
Clearly, you have to have
some negative scores.

1434
01:08:27,540 --> 01:08:30,990
And the question then
is, should we use minus 1

1435
01:08:30,990 --> 01:08:31,970
like we used before?

1436
01:08:31,970 --> 01:08:37,580
Or should we use like minus 2
or minus 5, or something else?

1437
01:08:37,580 --> 01:08:38,857
Any thoughts on this?

1438
01:08:43,230 --> 01:08:43,979
Or does it matter?

1439
01:08:43,979 --> 01:08:44,890
Maybe it doesn't matter.

1440
01:08:44,890 --> 01:08:45,848
Yeah, what's your name?

1441
01:08:45,848 --> 01:08:46,823
AUDIENCE: [INAUDIBLE].

1442
01:08:46,823 --> 01:08:49,769
Would it make sense to
not use [INAUDIBLE],

1443
01:08:49,769 --> 01:08:52,229
because [INAUDIBLE].

1444
01:08:52,229 --> 01:08:53,340
PROFESSOR: Yeah, OK.

1445
01:08:53,340 --> 01:08:56,580
So you want to use a more
complicated scoring system.

1446
01:08:56,580 --> 01:08:59,069
What particular
mismatches would you

1447
01:08:59,069 --> 01:09:03,080
want to penalize more and less?

1448
01:09:03,080 --> 01:09:09,424
AUDIENCE: [INAUDIBLE]
I think [INAUDIBLE]

1449
01:09:09,424 --> 01:09:10,888
needs to be [INAUDIBLE].

1450
01:09:21,160 --> 01:09:23,886
PROFESSOR: Yeah, you are
correct in your intuition.

1451
01:09:23,886 --> 01:09:25,260
Maybe one of the
biologists wants

1452
01:09:25,260 --> 01:09:27,240
to offer a suggestion here.

1453
01:09:27,240 --> 01:09:28,218
Yeah, go ahead.

1454
01:09:28,218 --> 01:09:30,652
AUDIENCE: So it's a mismatch
between purine and pyrimidine

1455
01:09:30,652 --> 01:09:31,152
[INAUDIBLE].

1456
01:09:35,349 --> 01:09:37,640
PROFESSOR: OK so now we've
got purines and pyrimidines.

1457
01:09:37,640 --> 01:09:39,840
So everyone
remember, the purines

1458
01:09:39,840 --> 01:09:45,930
are A and G. The
pyrimidines are C and T.

1459
01:09:45,930 --> 01:09:50,370
And the idea is that
this should be penalized,

1460
01:09:50,370 --> 01:09:54,010
or this should be penalized
less than changing

1461
01:09:54,010 --> 01:09:55,760
a purine to a pyrimidine.

1462
01:09:55,760 --> 01:09:58,142
And why does that makes sense?

1463
01:09:58,142 --> 01:09:59,950
AUDIENCE: Well,
structurally they're--

1464
01:09:59,950 --> 01:10:02,260
PROFESSOR: Structurally, purines
are more similar to each other

1465
01:10:02,260 --> 01:10:03,468
than they are to pyrimidines.

1466
01:10:03,468 --> 01:10:04,820
And?

1467
01:10:04,820 --> 01:10:07,997
More importantly, I think.

1468
01:10:07,997 --> 01:10:08,538
In evolution?

1469
01:10:12,452 --> 01:10:13,368
AUDIENCE: [INAUDIBLE].

1470
01:10:16,059 --> 01:10:17,684
PROFESSOR: I'm sorry,
can you speak up?

1471
01:10:17,684 --> 01:10:20,044
AUDIENCE: C to C mutations
happen spontaneously

1472
01:10:20,044 --> 01:10:21,215
in [INAUDIBLE] chemistry.

1473
01:10:21,215 --> 01:10:21,840
PROFESSOR: Yes.

1474
01:10:21,840 --> 01:10:24,810
So C to C mutations
happen spontaneously.

1475
01:10:24,810 --> 01:10:27,440
So basically, it's
easier because they

1476
01:10:27,440 --> 01:10:29,620
look more similar structurally.

1477
01:10:29,620 --> 01:10:31,980
The DNA polymerase is more
likely to make a mistake

1478
01:10:31,980 --> 01:10:33,970
and substitute another purine.

1479
01:10:38,110 --> 01:10:43,940
The rate of purine,
purine or pyrimidine,

1480
01:10:43,940 --> 01:10:46,470
pyrimidine to
transversions which

1481
01:10:46,470 --> 01:10:48,970
switch the type is
about three to one,

1482
01:10:48,970 --> 01:10:50,480
or two to one in
different systems.

1483
01:10:50,480 --> 01:10:51,860
So yeah, that's a good idea.

1484
01:10:51,860 --> 01:10:54,820
But for simplicity, just
to keep the math simple,

1485
01:10:54,820 --> 01:10:57,139
we're just going to go
with one mismatch penalty.

1486
01:10:57,139 --> 01:10:58,180
But that is a good point.

1487
01:10:58,180 --> 01:11:00,470
In practice, you
might want to do that.

1488
01:11:00,470 --> 01:11:02,820
So now, I'm saying
I'm going to limit you

1489
01:11:02,820 --> 01:11:04,340
to one mismatch penalty.

1490
01:11:04,340 --> 01:11:07,710
But I'm going to let you
choose any value you want.

1491
01:11:07,710 --> 01:11:10,050
So what value should you choose?

1492
01:11:10,050 --> 01:11:11,959
Or does it matter?

1493
01:11:15,791 --> 01:11:18,414
Or maybe different applications?

1494
01:11:18,414 --> 01:11:18,944
Tim, yeah?

1495
01:11:18,944 --> 01:11:20,402
AUDIENCE: I've just
got a question.

1496
01:11:20,402 --> 01:11:24,378
Does it depend on pi and ri?

1497
01:11:24,378 --> 01:11:27,857
For example, we could
use all these numbers.

1498
01:11:27,857 --> 01:11:30,342
But if the overall
wants to be negative,

1499
01:11:30,342 --> 01:11:33,324
then you couldn't
use negative .1.

1500
01:11:37,407 --> 01:11:38,990
PROFESSOR: Right,
that's a good point.

1501
01:11:38,990 --> 01:11:42,030
You can't make it too weak.

1502
01:11:42,030 --> 01:11:45,230
It may depend on what your
expected fraction of matches

1503
01:11:45,230 --> 01:11:49,860
is, which actually
depends on pi and ri.

1504
01:11:49,860 --> 01:11:55,210
So if you have very biased
sequences, like very AT rich,

1505
01:11:55,210 --> 01:11:57,830
your expected fraction of
matches is actually higher.

1506
01:11:57,830 --> 01:11:59,145
When you're researching
an AT rich sequence

1507
01:11:59,145 --> 01:12:00,370
against another
AT rich sequence,

1508
01:12:00,370 --> 01:12:01,869
it's actually higher
than a quarter.

1509
01:12:04,300 --> 01:12:06,480
So even minus one might
not be sufficient there.

1510
01:12:06,480 --> 01:12:08,260
You might need to go
down more negative.

1511
01:12:08,260 --> 01:12:11,180
So you may need to use a
higher negative value just

1512
01:12:11,180 --> 01:12:14,050
to make sure that the
expected value is negative.

1513
01:12:14,050 --> 01:12:15,200
That's true.

1514
01:12:15,200 --> 01:12:19,830
And yeah, you may want to adjust
it based on the composition.

1515
01:12:19,830 --> 01:12:22,090
So let's just do a bit more.

1516
01:12:22,090 --> 01:12:25,480
So it turns out that the
Karlin and Altschul theory,

1517
01:12:25,480 --> 01:12:31,190
in addition to telling you what
the p value is of your match--

1518
01:12:31,190 --> 01:12:34,280
the statistical significance--
it also tells you

1519
01:12:34,280 --> 01:12:37,640
what the matches will
look like in terms

1520
01:12:37,640 --> 01:12:41,460
of what fraction of
identity they will have.

1521
01:12:41,460 --> 01:12:45,290
And this is the so-called
target frequency equation.

1522
01:12:45,290 --> 01:12:48,940
The theory says that if
I search a query with one

1523
01:12:48,940 --> 01:12:52,450
particular composition, p,
subject meta-composition r--

1524
01:12:52,450 --> 01:12:54,840
here, I've just assumed
they're the same, both p just

1525
01:12:54,840 --> 01:12:58,250
for simplicity-- with a
scoring matrix sij, which

1526
01:12:58,250 --> 01:13:00,070
has a corresponding of lambda.

1527
01:13:00,070 --> 01:13:03,080
Then, when I take those
very high scoring matches--

1528
01:13:03,080 --> 01:13:05,780
the ones that are
statistically significant--

1529
01:13:05,780 --> 01:13:10,650
and I look at those
alignments of those matches,

1530
01:13:10,650 --> 01:13:15,650
I will get values qij,
given by this formula.

1531
01:13:15,650 --> 01:13:17,110
So look at the formula.

1532
01:13:17,110 --> 01:13:17,790
So it's qij.

1533
01:13:17,790 --> 01:13:19,970
So pipj e to the lambda sij.

1534
01:13:19,970 --> 01:13:22,620
So it's basically
the expected chance

1535
01:13:22,620 --> 01:13:28,970
that you would have base
i matching based j just

1536
01:13:28,970 --> 01:13:29,470
by chance.

1537
01:13:29,470 --> 01:13:30,840
That's pipj.

1538
01:13:30,840 --> 01:13:33,290
But then weighted by
e to the lambda sij.

1539
01:13:33,290 --> 01:13:36,560
So we notice for a match,
s will be positive,

1540
01:13:36,560 --> 01:13:38,060
so e to the lambda
will be positive.

1541
01:13:38,060 --> 01:13:39,710
So that will be bigger than 1.

1542
01:13:39,710 --> 01:13:41,340
And you'll have more
matches and you'll

1543
01:13:41,340 --> 01:13:43,370
have correspondingly
less mismatches

1544
01:13:43,370 --> 01:13:45,320
because the mismatch
has a negative.

1545
01:13:45,320 --> 01:13:47,240
So get the target value score.

1546
01:13:47,240 --> 01:13:52,510
And that also tells you that
the so-called natural scores

1547
01:13:52,510 --> 01:13:56,540
are actually determined
by the fraction of matches

1548
01:13:56,540 --> 01:13:59,500
that you want in your
high scoring segments.

1549
01:13:59,500 --> 01:14:02,830
If we want 90% matches,
we just set qii

1550
01:14:02,830 --> 01:14:06,180
to be 0.9, and use
this equation here.

1551
01:14:09,440 --> 01:14:11,060
Solve for sij.

1552
01:14:14,740 --> 01:14:18,620
For example, if you want to
find regions with R% identities.

1553
01:14:18,620 --> 01:14:22,220
Little r is just the
r as a proportion.

1554
01:14:22,220 --> 01:14:24,550
qii is going to be r over 4.

1555
01:14:24,550 --> 01:14:27,110
This assumes unbiased
base composition.

1556
01:14:27,110 --> 01:14:28,920
A quarter of the
matches are acgt.

1557
01:14:32,060 --> 01:14:35,700
Qij, then, is 1 minus r over 12.

1558
01:14:35,700 --> 01:14:38,750
1 minus r is a fraction
of non-matching positions.

1559
01:14:38,750 --> 01:14:41,360
They're 12 different types.

1560
01:14:41,360 --> 01:14:45,180
Set sii equal to 1, that's
what we said we normally do.

1561
01:14:45,180 --> 01:14:48,560
And then you do a little
bit of algebra here.

1562
01:14:48,560 --> 01:14:50,410
m is sij.

1563
01:14:50,410 --> 01:14:54,730
And you sort of plug in
this equation twice here.

1564
01:14:54,730 --> 01:14:56,740
And you get this equation.

1565
01:14:56,740 --> 01:15:01,350
So it says that m equals
log of 4 1 minus r over 3

1566
01:15:01,350 --> 01:15:03,010
over log 4 r.

1567
01:15:03,010 --> 01:15:06,240
And for this to be
true, this assumes

1568
01:15:06,240 --> 01:15:08,780
that both the query
and the database

1569
01:15:08,780 --> 01:15:11,450
have uniform composition
of a quarter,

1570
01:15:11,450 --> 01:15:16,300
and that r is between
a quarter and 1.

1571
01:15:16,300 --> 01:15:19,294
The proportion of matches in
your high scoring segment--

1572
01:15:19,294 --> 01:15:20,960
you want it to be
bigger than a quarter.

1573
01:15:20,960 --> 01:15:23,014
A quarter is what you
would see by chance.

1574
01:15:23,014 --> 01:15:25,430
There's something wrong with
your scoring system if you're

1575
01:15:25,430 --> 01:15:27,390
considering those
to be significant.

1576
01:15:27,390 --> 01:15:30,280
So it's something above 25%.

1577
01:15:30,280 --> 01:15:35,090
And so it's just
simple algebra--

1578
01:15:35,090 --> 01:15:41,180
you can check my work at
home-- to solve for m here.

1579
01:15:41,180 --> 01:15:44,310
And then this equation
then tells you

1580
01:15:44,310 --> 01:15:49,530
that if I want to find
75% identical matches

1581
01:15:49,530 --> 01:15:51,340
in a nucleotide
search, I should use

1582
01:15:51,340 --> 01:15:53,640
a mismatch penalty of minus 1.

1583
01:15:53,640 --> 01:15:57,160
And if I want 99%
identical matches,

1584
01:15:57,160 --> 01:16:00,200
I should use a
penalty of minus 3.

1585
01:16:00,200 --> 01:16:02,440
Not minus 5, but minus 3.

1586
01:16:02,440 --> 01:16:09,070
And I want you to think
about, does that make sense?

1587
01:16:09,070 --> 01:16:11,640
Does that not make sense?

1588
01:16:11,640 --> 01:16:14,400
Because I'm going to ask you
at the beginning of class

1589
01:16:14,400 --> 01:16:17,950
on Tuesday to
explain and comment

1590
01:16:17,950 --> 01:16:23,470
on this particular phenomenon
of how when you want higher

1591
01:16:23,470 --> 01:16:31,340
percent identities, you want a
more negative mismatch score.

1592
01:16:31,340 --> 01:16:35,410
Any last questions?

1593
01:16:35,410 --> 01:16:36,960
Comments?