1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:27,210 --> 00:00:32,940
PROFESSOR: OK, so welcome
back to computational systems

9
00:00:32,940 --> 00:00:35,300
biology, I'm David Gifford.

10
00:00:35,300 --> 00:00:38,100
I'm delighted to be with
you here here today.

11
00:00:38,100 --> 00:00:40,350
And today we're going to be
talking about a topic that

12
00:00:40,350 --> 00:00:43,710
is central to modern high
throughput biology, which

13
00:00:43,710 --> 00:00:47,200
is understanding how to do
short read alignment, sometimes

14
00:00:47,200 --> 00:00:49,600
called read mapping.

15
00:00:49,600 --> 00:00:53,310
Now it's very important
to me that you understand

16
00:00:53,310 --> 00:00:55,240
what I'm about to
say today, and so I'm

17
00:00:55,240 --> 00:00:58,770
hopeful that you'll be
uninhibited to raise your hand

18
00:00:58,770 --> 00:01:01,270
and ask questions about the
fine points in today's lecture

19
00:01:01,270 --> 00:01:05,180
if you have any, because I'd
be totally delighted to answer

20
00:01:05,180 --> 00:01:07,700
any questions and we
have enough time today

21
00:01:07,700 --> 00:01:12,680
that we can spend time looking
at one aspect of this problem

22
00:01:12,680 --> 00:01:14,690
and understand it thoroughly.

23
00:01:14,690 --> 00:01:18,619
An associated topic is the
question library complexity.

24
00:01:18,619 --> 00:01:21,035
How many people have heard of
sequencing libraries before?

25
00:01:21,035 --> 00:01:23,340
Let's see a show of hands.

26
00:01:23,340 --> 00:01:26,180
OK, how many people have heard
of read alignment before, read

27
00:01:26,180 --> 00:01:27,540
mapping?

28
00:01:27,540 --> 00:01:30,570
OK, great, fantastic.

29
00:01:30,570 --> 00:01:32,287
Let's start with
what we're going

30
00:01:32,287 --> 00:01:33,370
to be talking about today.

31
00:01:33,370 --> 00:01:35,870
We're going to first begin
talking about what a sequencing

32
00:01:35,870 --> 00:01:39,440
library is and what we
mean by library complexity.

33
00:01:39,440 --> 00:01:43,150
We'll then turn to what has been
called a full text minute size

34
00:01:43,150 --> 00:01:47,010
index, sometimes called a
burrows Wheeler transform

35
00:01:47,010 --> 00:01:50,950
index, a BWT index,
an FM index, but this

36
00:01:50,950 --> 00:01:54,670
is at the center of most
modern computational biology

37
00:01:54,670 --> 00:01:59,100
algorithms for processing high
throughput sequencing data.

38
00:01:59,100 --> 00:02:02,160
And then we'll turn how
to use that type of index

39
00:02:02,160 --> 00:02:04,470
for read alignment.

40
00:02:04,470 --> 00:02:07,660
So let's start now with what
a sequencing library Is.

41
00:02:07,660 --> 00:02:10,180
Let's just say that
you have a DNA sample,

42
00:02:10,180 --> 00:02:14,090
we'll be talking about various
ways of producing said samples

43
00:02:14,090 --> 00:02:15,205
throughout the term.

44
00:02:15,205 --> 00:02:16,580
But we're going
to assume that we

45
00:02:16,580 --> 00:02:19,870
have a bunch of
different DNA molecules.

46
00:02:19,870 --> 00:02:25,140
And I'll illustrate the
different molecules here

47
00:02:25,140 --> 00:02:26,240
in different colors.

48
00:02:26,240 --> 00:02:29,050
And we have three different
types of molecules here.

49
00:02:29,050 --> 00:02:31,510
Some molecules are duplicated,
because as you know,

50
00:02:31,510 --> 00:02:35,400
typically, we're preparing DNA
from an experiment where there

51
00:02:35,400 --> 00:02:39,550
are many cells and we can get
copies of DNA from those cells

52
00:02:39,550 --> 00:02:41,410
or the DNA could
be amplified using

53
00:02:41,410 --> 00:02:43,500
PCR or some other technique.

54
00:02:43,500 --> 00:02:46,970
So we have this
collection of molecules,

55
00:02:46,970 --> 00:02:49,040
and to make a library,
we're going to process it.

56
00:02:49,040 --> 00:02:50,498
And one of the
things that we'll do

57
00:02:50,498 --> 00:02:52,360
when we process the
library is we'll

58
00:02:52,360 --> 00:02:53,720
put sequencing adapters on.

59
00:02:53,720 --> 00:02:56,970
These are short DNA
sequences that we

60
00:02:56,970 --> 00:02:58,760
put on to the end
of the molecules

61
00:02:58,760 --> 00:03:02,060
to enable them to
have defined sequences

62
00:03:02,060 --> 00:03:04,250
at the ends which
permits sequencing.

63
00:03:08,200 --> 00:03:10,459
Now, if somebody hands you
a tube of DNA like this,

64
00:03:10,459 --> 00:03:12,250
there are a couple
questions you could ask.

65
00:03:12,250 --> 00:03:14,330
You could check the DNA
concentration to find out

66
00:03:14,330 --> 00:03:16,660
how DNA is there,
you could run a gel

67
00:03:16,660 --> 00:03:18,160
to look at the size
of the fragments

68
00:03:18,160 --> 00:03:19,710
that you're sequencing.

69
00:03:19,710 --> 00:03:21,140
We'll be returning
to that later,

70
00:03:21,140 --> 00:03:23,787
but these are typically
called the insert sizes

71
00:03:23,787 --> 00:03:25,370
of the library that
you're sequencing,

72
00:03:25,370 --> 00:03:30,550
the total length of the
DNA excluding the adapters.

73
00:03:30,550 --> 00:03:33,390
But we could also ask questions
about how complex this library

74
00:03:33,390 --> 00:03:38,410
is, because it's possible to run
experiments where you produce

75
00:03:38,410 --> 00:03:40,090
libraries that are
not very complex,

76
00:03:40,090 --> 00:03:42,631
where they don't have very many
different types of molecules.

77
00:03:42,631 --> 00:03:45,870
Now that typically is a
failure of the experiment.

78
00:03:45,870 --> 00:03:47,495
So an important part
of quality control

79
00:03:47,495 --> 00:03:50,970
is characterizing the
library complexity

80
00:03:50,970 --> 00:03:55,000
where we want to figure out
here complexity is equal to 3.

81
00:03:55,000 --> 00:03:57,700
There are three different
types of molecules.

82
00:03:57,700 --> 00:04:00,860
And we sample these molecules.

83
00:04:00,860 --> 00:04:06,370
And when we sample them, we get
a bunch of DNA sequence reads.

84
00:04:06,370 --> 00:04:10,380
And typically the number
of reads that we get

85
00:04:10,380 --> 00:04:13,720
is larger than the
complexity of the library.

86
00:04:13,720 --> 00:04:18,399
Here we have a total
of 12 different reads.

87
00:04:18,399 --> 00:04:22,990
And when we sequence a library,
we're sampling from it.

88
00:04:22,990 --> 00:04:26,600
And so the probability that we
get any one particular molecule

89
00:04:26,600 --> 00:04:31,310
is going to be roughly
speaking equal to 1

90
00:04:31,310 --> 00:04:32,990
over c, which is the complexity.

91
00:04:35,550 --> 00:04:37,731
And thus, we could use
the binomial distribution

92
00:04:37,731 --> 00:04:39,230
to figure out the
likelihood that we

93
00:04:39,230 --> 00:04:44,200
had exactly four of
these type one molecules.

94
00:04:44,200 --> 00:04:46,630
However, as n the number
of sequencing reads

95
00:04:46,630 --> 00:04:50,350
grows to be very large, typical
numbers are a hundred million

96
00:04:50,350 --> 00:04:53,280
different reads,
the binomial becomes

97
00:04:53,280 --> 00:04:54,690
cumbersome to work with.

98
00:04:54,690 --> 00:04:57,720
And so we typically are
going to characterize

99
00:04:57,720 --> 00:05:01,770
this kind of selection
process with a different kind

100
00:05:01,770 --> 00:05:03,440
of distribution.

101
00:05:03,440 --> 00:05:06,360
So one idea is to
use a Poisson, where

102
00:05:06,360 --> 00:05:08,840
we say that the
rate of sequencing

103
00:05:08,840 --> 00:05:12,640
is going to be n over c.

104
00:05:12,640 --> 00:05:18,730
And we can see that
here shown on the slide

105
00:05:18,730 --> 00:05:21,120
above is the same
process where we

106
00:05:21,120 --> 00:05:23,400
have the ligation
of the adapters.

107
00:05:23,400 --> 00:05:26,295
We have a library and we have
reads coming from the library.

108
00:05:29,470 --> 00:05:32,490
We have a characterized
library complexity here,

109
00:05:32,490 --> 00:05:35,980
there are four different
types of molecules.

110
00:05:35,980 --> 00:05:38,490
And the modeling
approach is that assuming

111
00:05:38,490 --> 00:05:40,324
that we have c different
unique molecules,

112
00:05:40,324 --> 00:05:42,240
the probability that
we'll get any one of them

113
00:05:42,240 --> 00:05:46,370
when we're doing the
sequencing is 1 over c.

114
00:05:46,370 --> 00:05:49,520
And if we do end
sequencing reads,

115
00:05:49,520 --> 00:05:52,130
we can find out the
probability that we'll

116
00:05:52,130 --> 00:05:54,820
get a certain number of
each type of molecule.

117
00:05:54,820 --> 00:05:59,110
Let's just stick with the
first one to start, OK?

118
00:05:59,110 --> 00:06:03,590
Now part of the challenge
in analyzing sequencing data

119
00:06:03,590 --> 00:06:07,730
is that you don't see
what you don't sequence.

120
00:06:07,730 --> 00:06:12,760
So things that actually occur
0 times in your sequencing data

121
00:06:12,760 --> 00:06:15,441
still may be present
in the library.

122
00:06:15,441 --> 00:06:17,940
And what we would like to do
is from the observed sequencing

123
00:06:17,940 --> 00:06:22,120
data, estimate the
library complexity.

124
00:06:22,120 --> 00:06:23,912
So we have all of
the sequencing data,

125
00:06:23,912 --> 00:06:25,870
we just don't know how
many different molecules

126
00:06:25,870 --> 00:06:28,830
there are over here.

127
00:06:28,830 --> 00:06:30,620
So one way to do
with this is to say

128
00:06:30,620 --> 00:06:34,910
that let us suppose that we
make a histogram of the number

129
00:06:34,910 --> 00:06:38,130
of times we see
distinct molecules

130
00:06:38,130 --> 00:06:40,580
and we're going to say that
we can observe molecules that

131
00:06:40,580 --> 00:06:45,570
are sequenced or appear l
times up through r times.

132
00:06:45,570 --> 00:06:50,530
So we actually can
create a version

133
00:06:50,530 --> 00:06:53,380
of the distribution
that characterizes

134
00:06:53,380 --> 00:06:56,970
just a part of
what we're seeing.

135
00:06:56,970 --> 00:07:02,080
So if we do this, we can
build a Poisson model

136
00:07:02,080 --> 00:07:06,260
and we can estimate lambda
from what we can observe.

137
00:07:06,260 --> 00:07:08,400
We don't get to observe
things we don't see.

138
00:07:08,400 --> 00:07:10,340
So for sure, we know
we can't observe

139
00:07:10,340 --> 00:07:12,960
the things that are
sequenced 0 times.

140
00:07:12,960 --> 00:07:16,350
But for the things that are
sequenced at least one time,

141
00:07:16,350 --> 00:07:18,790
we can build an
estimate of lambda.

142
00:07:18,790 --> 00:07:20,730
And from that
estimate of lambda,

143
00:07:20,730 --> 00:07:25,900
we can build an estimate of C.

144
00:07:25,900 --> 00:07:31,810
So one way to look
at this is that if we

145
00:07:31,810 --> 00:07:34,150
look at the total number
of unique molecules

146
00:07:34,150 --> 00:07:39,530
that we sequence, which is
equal to m, then the probability

147
00:07:39,530 --> 00:07:45,090
that we observe between l
and r occurrences of a given

148
00:07:45,090 --> 00:07:49,130
individual sequence
times c is going

149
00:07:49,130 --> 00:07:51,890
to be equal to the total
number of unique molecules

150
00:07:51,890 --> 00:07:54,290
that we observe.

151
00:07:54,290 --> 00:07:57,930
Another way to look at this
is the very bottom equation

152
00:07:57,930 --> 00:08:02,350
where we note that if we
look at the total complexity

153
00:08:02,350 --> 00:08:07,280
of the library and we multiply
it by 1 minus the probability

154
00:08:07,280 --> 00:08:09,954
that we don't observe
certain molecules,

155
00:08:09,954 --> 00:08:11,870
that will give an estimate
of the total number

156
00:08:11,870 --> 00:08:14,770
of unique molecules
that we do see.

157
00:08:14,770 --> 00:08:17,060
And thus we can
manipulate that to come up

158
00:08:17,060 --> 00:08:19,770
with an estimate
of the complexity.

159
00:08:19,770 --> 00:08:22,270
Are there any questions about
the details of this so far?

160
00:08:25,030 --> 00:08:27,650
OK, so this is a
very simple model

161
00:08:27,650 --> 00:08:30,910
for estimating the
complexity of a library based

162
00:08:30,910 --> 00:08:32,720
upon looking at the
distribution of reads

163
00:08:32,720 --> 00:08:36,980
that we actually observe for
quality control purposes.

164
00:08:36,980 --> 00:08:40,929
And let us suppose that
we apply this to thousands

165
00:08:40,929 --> 00:08:45,410
genomes data, which is
public data on human.

166
00:08:45,410 --> 00:08:47,890
And suppose we want to test
whether this model works

167
00:08:47,890 --> 00:08:49,520
or not, so what
we're going to do

168
00:08:49,520 --> 00:08:52,102
is we're going to estimate
the library complexity from 10

169
00:08:52,102 --> 00:08:53,810
percent of the sequencing
reads, so we'll

170
00:08:53,810 --> 00:08:56,580
pick 10 percent of the reads
of an individual at random,

171
00:08:56,580 --> 00:09:00,280
we'll estimate the
complexity of the library,

172
00:09:00,280 --> 00:09:03,670
and then we'll also take all
of the region the individual

173
00:09:03,670 --> 00:09:05,910
and estimate the complexity.

174
00:09:05,910 --> 00:09:07,740
And if our estimator
is pretty good,

175
00:09:07,740 --> 00:09:09,310
we should get about
the same number,

176
00:09:09,310 --> 00:09:12,080
from 10 percent of the reads
and from all of the reads.

177
00:09:12,080 --> 00:09:14,990
Will people go along with that?

178
00:09:14,990 --> 00:09:17,880
Think that seems reasonable?

179
00:09:17,880 --> 00:09:19,910
OK, so we do that.

180
00:09:19,910 --> 00:09:22,320
And this is what we get.

181
00:09:22,320 --> 00:09:25,000
And it's hard to see
the diagonal line here,

182
00:09:25,000 --> 00:09:27,450
but there's a big oops here.

183
00:09:27,450 --> 00:09:31,640
And the big oops is that if we
estimate the library complexity

184
00:09:31,640 --> 00:09:33,780
from just 10 percent
of the reads,

185
00:09:33,780 --> 00:09:36,630
it's grossly underestimating
the number of unique molecules

186
00:09:36,630 --> 00:09:38,900
we actually have.

187
00:09:38,900 --> 00:09:43,559
In fact, it's off by typically
a factor of two or more.

188
00:09:43,559 --> 00:09:45,100
So for some reason,
even though we're

189
00:09:45,100 --> 00:09:48,106
examining millions of
reads in this subsample,

190
00:09:48,106 --> 00:09:49,480
we're not getting
a good estimate

191
00:09:49,480 --> 00:09:53,390
of the complexity
of the library.

192
00:09:53,390 --> 00:09:57,250
Does anybody have any idea
what could be going wrong here?

193
00:09:57,250 --> 00:10:00,820
Why is it that this
very simple model that

194
00:10:00,820 --> 00:10:04,120
is attempting to estimate how
many different molecules we

195
00:10:04,120 --> 00:10:10,312
have here based upon what
we observe is broken?

196
00:10:10,312 --> 00:10:11,020
Any ideas at all?

197
00:10:11,020 --> 00:10:12,311
And please say your name first.

198
00:10:12,311 --> 00:10:13,410
AUDIENCE: I'm Chris.

199
00:10:13,410 --> 00:10:14,670
PROFESSOR: Hi Chris.

200
00:10:14,670 --> 00:10:17,375
AUDIENCE: Is it because
repeated sequences,

201
00:10:17,375 --> 00:10:19,916
so there could be a short
sequence at the end of one

202
00:10:19,916 --> 00:10:22,810
molecule that's the beginning
of another one, middle

203
00:10:22,810 --> 00:10:25,010
of another one, so [INAUDIBLE].

204
00:10:25,010 --> 00:10:27,080
PROFESSOR: Chris, you're
on the right track, OK?

205
00:10:27,080 --> 00:10:30,870
Because what we have
assumed at the outset

206
00:10:30,870 --> 00:10:35,180
was that all of these molecules
occur with equal probability.

207
00:10:35,180 --> 00:10:35,800
Right?

208
00:10:35,800 --> 00:10:38,790
What would happen
if in fact there

209
00:10:38,790 --> 00:10:40,790
are four copies
of this purple one

210
00:10:40,790 --> 00:10:43,302
and only two copies of
the other molecules?

211
00:10:43,302 --> 00:10:45,010
Then the probability
of sampling this one

212
00:10:45,010 --> 00:10:46,870
is going to be twice as high
as the probability of sampling

213
00:10:46,870 --> 00:10:47,430
one of these.

214
00:10:50,550 --> 00:10:53,930
If there's non uniformity
in the original population,

215
00:10:53,930 --> 00:10:58,045
that's going to mess
up our model big time.

216
00:10:58,045 --> 00:10:59,920
And that could happen
from repeated sequences

217
00:10:59,920 --> 00:11:01,420
or other kinds of
duplicated things,

218
00:11:01,420 --> 00:11:04,360
or it could be that there's
unequal amplification.

219
00:11:04,360 --> 00:11:06,800
It might be that
PCR really loves

220
00:11:06,800 --> 00:11:10,140
a particular molecule, right,
and amplifies that one a lot,

221
00:11:10,140 --> 00:11:12,880
and doesn't amplify another one
that's difficult to amplify.

222
00:11:12,880 --> 00:11:15,930
So somewhere in our
experimental protocol pipeline,

223
00:11:15,930 --> 00:11:17,890
it could be that
there's non uniformity

224
00:11:17,890 --> 00:11:20,710
and thus we're getting a
skewness to our distribution

225
00:11:20,710 --> 00:11:23,210
here in our library.

226
00:11:23,210 --> 00:11:25,500
So the other thing
that's true is

227
00:11:25,500 --> 00:11:29,140
in a Poisson, lambda,
which is equal to the mean,

228
00:11:29,140 --> 00:11:31,410
is also equal to the variance.

229
00:11:31,410 --> 00:11:32,850
And so our Poisson's
only one knob

230
00:11:32,850 --> 00:11:34,433
we could turn to fit
the distribution.

231
00:11:36,780 --> 00:11:39,240
So coming back to this,
we talked about the idea

232
00:11:39,240 --> 00:11:41,730
that the library complexity
still may be four

233
00:11:41,730 --> 00:11:43,520
but then there may
be different numbers

234
00:11:43,520 --> 00:11:47,130
of molecules of each type.

235
00:11:47,130 --> 00:11:48,830
And here's an idea
for you, right?

236
00:11:48,830 --> 00:11:51,730
The idea is this.

237
00:11:51,730 --> 00:11:54,230
Imagine that the
top distributions

238
00:11:54,230 --> 00:11:57,270
are the number of each type
of molecule that are present.

239
00:11:57,270 --> 00:11:59,600
And it might be that our
original assumption was

240
00:11:59,600 --> 00:12:03,210
that it was like the very
top, that typically there

241
00:12:03,210 --> 00:12:05,950
are two copies of each molecule
in the original sequencing

242
00:12:05,950 --> 00:12:10,870
library, and that's a
fairly tight distribution.

243
00:12:10,870 --> 00:12:14,140
But it could be, in fact,
that the number of molecules

244
00:12:14,140 --> 00:12:19,960
of each type is very dispersed.

245
00:12:19,960 --> 00:12:22,870
And so if we look at
each one of those plots

246
00:12:22,870 --> 00:12:24,660
at the top, the
first four, those

247
00:12:24,660 --> 00:12:29,870
are going to be our guesses
about the distribution

248
00:12:29,870 --> 00:12:33,320
of the number of
copies of a molecule

249
00:12:33,320 --> 00:12:36,627
in the original library.

250
00:12:36,627 --> 00:12:38,210
And we don't know
what that is, right?

251
00:12:38,210 --> 00:12:41,870
That's something we can't
directly observe, but imagine

252
00:12:41,870 --> 00:12:44,320
that we took that
distribution and used it

253
00:12:44,320 --> 00:12:48,110
for lambda in our
Poisson distribution

254
00:12:48,110 --> 00:12:49,890
down below for sampling.

255
00:12:49,890 --> 00:12:51,990
So we have one distribution
over the number

256
00:12:51,990 --> 00:12:54,760
of each type of
molecule we have and we

257
00:12:54,760 --> 00:12:57,790
have the Poisson for
sampling from that,

258
00:12:57,790 --> 00:13:00,910
and we put those two together.

259
00:13:00,910 --> 00:13:03,480
And when we do that, we have
the Poisson distribution

260
00:13:03,480 --> 00:13:05,650
at the top, the
gamma distribution

261
00:13:05,650 --> 00:13:07,890
is what we'll use
for representing

262
00:13:07,890 --> 00:13:10,470
the number of different
species over here

263
00:13:10,470 --> 00:13:13,950
and their relative copy number.

264
00:13:13,950 --> 00:13:19,596
And when we actually put
those together as shown,

265
00:13:19,596 --> 00:13:21,720
we wind up with what's
called the negative binomial

266
00:13:21,720 --> 00:13:24,830
distribution, which is a
more flexible distribution,

267
00:13:24,830 --> 00:13:27,320
it has two parameters.

268
00:13:27,320 --> 00:13:29,350
And that negative
binomial distribution

269
00:13:29,350 --> 00:13:33,980
can be used, once again,
to estimate our library

270
00:13:33,980 --> 00:13:35,910
complexity.

271
00:13:35,910 --> 00:13:38,420
And when we do so, we
have lamba be the same,

272
00:13:38,420 --> 00:13:39,565
but k is a new parameter.

273
00:13:39,565 --> 00:13:42,950
It measures sort of the
variance or dispersion

274
00:13:42,950 --> 00:13:46,540
of this original
sequencing library.

275
00:13:46,540 --> 00:13:49,010
And then when we fit this
negative binomial distribution

276
00:13:49,010 --> 00:13:55,180
to that 1,000 genomes data, it's
going to be hopefully better.

277
00:13:55,180 --> 00:13:58,340
Let's start with
a smaller example.

278
00:13:58,340 --> 00:14:01,280
If we have a library that's
artificial with a known million

279
00:14:01,280 --> 00:14:03,600
unique molecules
and we subsample,

280
00:14:03,600 --> 00:14:05,970
it gives you 100,000
reads, you can

281
00:14:05,970 --> 00:14:08,035
see that with
different dispersions

282
00:14:08,035 --> 00:14:09,875
here in the left, k
with different values

283
00:14:09,875 --> 00:14:14,150
from 0.1 to 20, the Poisson
begins to grossly underestimate

284
00:14:14,150 --> 00:14:17,730
the complexity of the library
as the dispersion gets larger,

285
00:14:17,730 --> 00:14:19,790
whereas the negative
binomial, otherwise known

286
00:14:19,790 --> 00:14:23,560
as the GP or gamma Poisson,
does a much better job.

287
00:14:23,560 --> 00:14:26,860
And furthermore,
when we look at this,

288
00:14:26,860 --> 00:14:30,290
in the context of the
thousand genomes data,

289
00:14:30,290 --> 00:14:34,200
you can see when we fit this
how much better we are doing.

290
00:14:34,200 --> 00:14:35,950
Almost all those points
are almost exactly

291
00:14:35,950 --> 00:14:39,890
on the line, which means you
can take a small sampling run

292
00:14:39,890 --> 00:14:41,550
and figure out from
that sampling run

293
00:14:41,550 --> 00:14:44,290
how complex your library is.

294
00:14:44,290 --> 00:14:47,400
And that allows us to tell
something very important, which

295
00:14:47,400 --> 00:14:50,291
is what is the marginal
value of extra sequencing.

296
00:14:50,291 --> 00:14:52,540
So for example, if somebody
comes to you and they say,

297
00:14:52,540 --> 00:14:56,720
well, I ran my experiment
and all I could afford

298
00:14:56,720 --> 00:14:57,840
was 50 million reads.

299
00:14:57,840 --> 00:15:00,100
Do you think I
should sequence more?

300
00:15:00,100 --> 00:15:06,530
Is there more information in my
experimental DNA preparation?

301
00:15:06,530 --> 00:15:07,890
It's easy to tell now, right?

302
00:15:07,890 --> 00:15:10,223
Because you can actually
analyze the distribution of the

303
00:15:10,223 --> 00:15:12,040
reads that they got
and you can go back

304
00:15:12,040 --> 00:15:14,410
and you could estimate
the marginal value

305
00:15:14,410 --> 00:15:16,180
of additional sequencing.

306
00:15:16,180 --> 00:15:20,305
And the way you do that is you
go back to the distribution

307
00:15:20,305 --> 00:15:23,240
that you fit this
negative binomial

308
00:15:23,240 --> 00:15:25,320
and ask if you
have r more reads,

309
00:15:25,320 --> 00:15:30,330
how many more unique molecules
are you going to get?

310
00:15:30,330 --> 00:15:35,890
And the answer is
that you can see

311
00:15:35,890 --> 00:15:40,937
that if you imagine that
this is artificial data,

312
00:15:40,937 --> 00:15:43,020
but if you imagine that
you had a complexity of 10

313
00:15:43,020 --> 00:15:46,275
to the 6 molecules, the
number sequencing regions is

314
00:15:46,275 --> 00:15:49,750
on the x-axis, the number of
observed distinct molecules

315
00:15:49,750 --> 00:15:54,940
is on the y-axis, and as you
increase the sequencing depth,

316
00:15:54,940 --> 00:15:57,710
you get more and more
back to the library.

317
00:15:57,710 --> 00:16:00,950
However, the important thing
to note is that the more skewed

318
00:16:00,950 --> 00:16:04,660
the library is, the less
benefit you get, right?

319
00:16:04,660 --> 00:16:07,710
So if you look at the various
values of k, as k gets larger,

320
00:16:07,710 --> 00:16:09,580
the sort of the
skewness of the library

321
00:16:09,580 --> 00:16:13,150
increases, and you can see that
you get fewer unique molecules

322
00:16:13,150 --> 00:16:16,780
as you increase the
sequencing depth.

323
00:16:16,780 --> 00:16:18,890
Now I mention this
to you because it's

324
00:16:18,890 --> 00:16:22,560
important to think
in a principled way

325
00:16:22,560 --> 00:16:24,590
about analyzing sequencing data.

326
00:16:24,590 --> 00:16:28,390
If somebody drops 200 million
reads on your desk and says,

327
00:16:28,390 --> 00:16:31,440
can you help me with
these, it's good to start

328
00:16:31,440 --> 00:16:34,090
with some fundamental
questions, like just how complex

329
00:16:34,090 --> 00:16:36,550
is the original
library and you think

330
00:16:36,550 --> 00:16:40,370
that these data are
really good or not, OK?

331
00:16:40,370 --> 00:16:42,750
Furthermore, this is a
introduction to the idea

332
00:16:42,750 --> 00:16:45,020
that certain kinds of
very simplistic models,

333
00:16:45,020 --> 00:16:47,330
like Poisson models
of sequencing data

334
00:16:47,330 --> 00:16:49,600
can be wrong because
they're not adequately

335
00:16:49,600 --> 00:16:52,720
taking into account
the over dispersion

336
00:16:52,720 --> 00:16:55,620
of the original
sequencing count data.

337
00:16:55,620 --> 00:16:59,480
OK, so that's all there is
about library complexity.

338
00:16:59,480 --> 00:17:05,680
Let's move on now to questions
of how to deal with these

339
00:17:05,680 --> 00:17:08,619
reads once we have them.

340
00:17:08,619 --> 00:17:13,890
So the fundamental
challenge is this.

341
00:17:13,890 --> 00:17:15,840
I hand you a genome like human.

342
00:17:19,530 --> 00:17:22,970
3 times 10 to the ninth bases.

343
00:17:22,970 --> 00:17:28,430
This will be in fast
a format, let's say.

344
00:17:28,430 --> 00:17:31,670
I had you reads.

345
00:17:31,670 --> 00:17:34,980
And this will be-- we'll
have, say, 200 base

346
00:17:34,980 --> 00:17:45,310
pairs times 2 times 10 to
the eighth different reads.

347
00:17:45,310 --> 00:17:46,810
And this will be
in fast q format.

348
00:17:50,520 --> 00:17:52,686
The q means that there are--
it's like fast a except

349
00:17:52,686 --> 00:17:54,460
that our quality
score's associated

350
00:17:54,460 --> 00:17:57,100
with each particular
base position.

351
00:17:57,100 --> 00:18:03,284
And the PHRED score
which is typically

352
00:18:03,284 --> 00:18:04,700
used for these
sorts of qualities,

353
00:18:04,700 --> 00:18:11,960
is minus p minus 10
times log base 10

354
00:18:11,960 --> 00:18:13,870
of the probability of an error.

355
00:18:13,870 --> 00:18:17,440
Right, so a PHRED score of
10 means that there's a 1

356
00:18:17,440 --> 00:18:20,490
in 10 chance that the bases is
an error, a PHRED score of 20

357
00:18:20,490 --> 00:18:24,470
means it's one in a
100, and so forth.

358
00:18:24,470 --> 00:18:28,780
And then the goal is
today if I give you

359
00:18:28,780 --> 00:18:33,040
these data on a
hard drive, your job

360
00:18:33,040 --> 00:18:37,160
would be to produce a SAM
file, a Sequence Alignment

361
00:18:37,160 --> 00:18:39,730
and Mapping file,
which tells us where

362
00:18:39,730 --> 00:18:41,820
all these reads
map in the genome.

363
00:18:44,340 --> 00:18:48,010
And more pictorially,
the idea is

364
00:18:48,010 --> 00:18:50,460
that there are many
different reasons why

365
00:18:50,460 --> 00:18:52,260
we want to do this mapping.

366
00:18:52,260 --> 00:18:55,400
So one might be
to do genotyping.

367
00:18:55,400 --> 00:18:57,840
You and I differ in our
genomes by about one base

368
00:18:57,840 --> 00:18:59,470
in a thousand.

369
00:18:59,470 --> 00:19:02,545
So if I sequence your genome
and I map it back or align it

370
00:19:02,545 --> 00:19:04,960
to the human brain
reference genome,

371
00:19:04,960 --> 00:19:07,029
I'm going to find differences
between your genome

372
00:19:07,029 --> 00:19:08,320
and the human reference genome.

373
00:19:08,320 --> 00:19:10,670
And you can see how this is
done at the very top where

374
00:19:10,670 --> 00:19:13,060
we have the aligned
reads and there's a G,

375
00:19:13,060 --> 00:19:15,750
let's say, in the
sample DNA, and there's

376
00:19:15,750 --> 00:19:17,512
a C in the reference.

377
00:19:17,512 --> 00:19:19,720
But in order to figure out
where the differences are,

378
00:19:19,720 --> 00:19:21,094
we have to take
those short reads

379
00:19:21,094 --> 00:19:24,730
and align them to the genome.

380
00:19:24,730 --> 00:19:27,610
Another kind of
experimental protocol

381
00:19:27,610 --> 00:19:30,740
uses DNA fragments
that are representative

382
00:19:30,740 --> 00:19:33,490
of some kind of
biological process.

383
00:19:33,490 --> 00:19:36,130
So here the DNA being
produced are mapped back

384
00:19:36,130 --> 00:19:38,930
to the genome to look
for areas of enrichment

385
00:19:38,930 --> 00:19:41,330
or what are sometimes
called peaks.

386
00:19:41,330 --> 00:19:44,910
And there we want to actually
do exactly the same process,

387
00:19:44,910 --> 00:19:47,479
but the post processing once
the alignment is complete

388
00:19:47,479 --> 00:19:48,020
is different.

389
00:19:50,690 --> 00:19:56,990
So both of these share the goal
of taking hundreds of millions

390
00:19:56,990 --> 00:20:03,480
of short reads and aligning
them to a very large genome.

391
00:20:03,480 --> 00:20:06,572
And you heard about Smith
Waterman from Professor Berg,

392
00:20:06,572 --> 00:20:08,780
and as you can tell, that
really isn't going to work,

393
00:20:08,780 --> 00:20:12,750
because its time
complexity is not

394
00:20:12,750 --> 00:20:16,760
going to be admissible for
hundreds of millions of reads.

395
00:20:16,760 --> 00:20:18,950
So we need to come up
with a different way

396
00:20:18,950 --> 00:20:22,510
of approaching this problem.

397
00:20:22,510 --> 00:20:27,620
So finding this alignment is
really a performance bottleneck

398
00:20:27,620 --> 00:20:33,310
for many computational
biology problems today.

399
00:20:33,310 --> 00:20:37,510
And we have to talk a
little bit about what

400
00:20:37,510 --> 00:20:41,490
we mean by a good alignment,
because we're going to assume,

401
00:20:41,490 --> 00:20:44,240
of course, fewer
mismatches are better.

402
00:20:44,240 --> 00:20:46,900
And we're going to try and
align to high quality bases

403
00:20:46,900 --> 00:20:51,210
as opposed to low quality
bases and note that all we

404
00:20:51,210 --> 00:20:56,490
have in our input data are
quality scores for the reads.

405
00:20:56,490 --> 00:21:01,770
So we begin with an assumption
that the genome is the truth

406
00:21:01,770 --> 00:21:04,030
and when we are
aligning, we are going

407
00:21:04,030 --> 00:21:06,060
to be more permissive
of mismatches

408
00:21:06,060 --> 00:21:09,380
in read locations that have
higher likelihood of being

409
00:21:09,380 --> 00:21:09,880
wrong.

410
00:21:13,910 --> 00:21:17,820
So is everybody OK
with the set up so far?

411
00:21:17,820 --> 00:21:20,500
You understand what
the problem is?

412
00:21:20,500 --> 00:21:23,790
Yes, all the way in the back
row, my back row consultants,

413
00:21:23,790 --> 00:21:24,897
you're good on that?

414
00:21:24,897 --> 00:21:26,480
See, the back row
is always the people

415
00:21:26,480 --> 00:21:28,870
I call on for consulting
advice, right?

416
00:21:28,870 --> 00:21:30,890
So yeah.

417
00:21:30,890 --> 00:21:32,730
You're all good back there?

418
00:21:32,730 --> 00:21:35,350
Good, I like that, good,
that's good, I like that, OK.

419
00:21:35,350 --> 00:21:37,150
All right.

420
00:21:37,150 --> 00:21:41,608
So now I'm going to
talk to you about what

421
00:21:41,608 --> 00:21:45,510
are the most amazing
transforms I have seen.

422
00:21:45,510 --> 00:21:48,860
It's called the Burrows
Wheeler Transform.

423
00:21:48,860 --> 00:21:51,010
And it is a transform
that we will

424
00:21:51,010 --> 00:21:53,870
do to the original
genome that allows

425
00:21:53,870 --> 00:21:57,880
us to do this look up
very, very quickly.

426
00:21:57,880 --> 00:22:01,570
And it's worth understanding.

427
00:22:01,570 --> 00:22:06,760
So here's the basic idea behind
the Burrows Wheeler Transform.

428
00:22:06,760 --> 00:22:11,510
We take the original string that
we want to use as our target

429
00:22:11,510 --> 00:22:13,340
that we're going to
look things up in, OK,

430
00:22:13,340 --> 00:22:16,310
so this is going to be the
dictionary looking things up in

431
00:22:16,310 --> 00:22:18,670
and it's going to be
the genome sequence.

432
00:22:18,670 --> 00:22:23,240
And you can see the sequence on
the left hand side, ACA, ACG,

433
00:22:23,240 --> 00:22:27,270
and the dollar sign represents
the end of string terminator.

434
00:22:27,270 --> 00:22:28,964
OK.

435
00:22:28,964 --> 00:22:30,380
Now here's what
we're going to do.

436
00:22:30,380 --> 00:22:38,730
We take all possible
rotations of this string, OK?

437
00:22:38,730 --> 00:22:41,900
And we're going to sort them.

438
00:22:41,900 --> 00:22:44,550
And the result of
sorting all the rotations

439
00:22:44,550 --> 00:22:48,170
is shown in the next
block of characters.

440
00:22:48,170 --> 00:22:51,220
And you can see that the
end of string character

441
00:22:51,220 --> 00:22:57,264
has the shortest sorting
order, followed by A, C, and G

442
00:22:57,264 --> 00:22:59,180
and that all the strings
are ordered lexically

443
00:22:59,180 --> 00:23:01,510
by all of their letters.

444
00:23:01,510 --> 00:23:05,760
So once again, we take
the original input string,

445
00:23:05,760 --> 00:23:09,070
we do all of the
possible rotations of it,

446
00:23:09,070 --> 00:23:12,820
and then we sort them and wind
up with this Burrows Wheeler

447
00:23:12,820 --> 00:23:18,020
Matrix as it's called
in this slide, OK?

448
00:23:18,020 --> 00:23:21,700
And we take the last
column of that matrix

449
00:23:21,700 --> 00:23:25,140
and that is the Burrows
Wheeler Transform.

450
00:23:25,140 --> 00:23:29,650
Now yo might say, what on
earth is going on here?

451
00:23:29,650 --> 00:23:32,940
Why would you want to
take a string or even

452
00:23:32,940 --> 00:23:34,280
an entire genome?

453
00:23:34,280 --> 00:23:38,520
We actually do this
on entire genomes, OK?

454
00:23:38,520 --> 00:23:40,920
Consider all the
rotations of it,

455
00:23:40,920 --> 00:23:45,300
sort them, and then take the
last column of that matrix.

456
00:23:45,300 --> 00:23:48,130
What could that be doing, OK?

457
00:23:48,130 --> 00:23:50,150
Here's a bit of
intuition for you.

458
00:23:50,150 --> 00:23:53,090
The intuition is that that
Burrows Wheeler Matrix

459
00:23:53,090 --> 00:23:57,840
is representing all
of the suffixes of t.

460
00:23:57,840 --> 00:24:04,380
OK, so all the red things are
suffixes of t in the matrix.

461
00:24:04,380 --> 00:24:07,150
And when we are going
to be matching a read,

462
00:24:07,150 --> 00:24:10,159
we're going to be matched
it from its end going

463
00:24:10,159 --> 00:24:11,700
towards the beginning
of it, so we'll

464
00:24:11,700 --> 00:24:13,210
be matching suffixes of it.

465
00:24:13,210 --> 00:24:16,140
And I'm going to show you a
very neat way of using this

466
00:24:16,140 --> 00:24:19,700
transform to do matching
very efficiently.

467
00:24:19,700 --> 00:24:23,330
But before I do that,
I want you to observe

468
00:24:23,330 --> 00:24:26,500
that it's not complicated.

469
00:24:26,500 --> 00:24:29,010
All we do is we take all
the possible rotations

470
00:24:29,010 --> 00:24:33,710
and we sort them and we
come up with this transform.

471
00:24:33,710 --> 00:24:35,331
Yes.

472
00:24:35,331 --> 00:24:37,206
AUDIENCE: What are you
sorting them based on?

473
00:24:37,206 --> 00:24:38,580
PROFESSOR: OK, what
was your name again?

474
00:24:38,580 --> 00:24:39,455
AUDIENCE: I'm Samona.

475
00:24:39,455 --> 00:24:40,740
PROFESSOR: Samona.

476
00:24:40,740 --> 00:24:42,910
What are we sorting
them based upon?

477
00:24:42,910 --> 00:24:44,576
We're just sorting
them alphabetically.

478
00:24:44,576 --> 00:24:45,290
AUDIENCE: OK.

479
00:24:45,290 --> 00:24:48,420
PROFESSOR: So you can see that
if dollar sign is the lowest

480
00:24:48,420 --> 00:24:52,610
alphabetical character,
that if you consider

481
00:24:52,610 --> 00:24:55,910
each one a word, that they're
sorted alphabetically, OK?

482
00:24:55,910 --> 00:24:58,740
So we have seven
characters in each row

483
00:24:58,740 --> 00:25:02,070
and we sort them alphabetically.

484
00:25:02,070 --> 00:25:04,570
Or a lexically.

485
00:25:04,570 --> 00:25:05,520
Good question.

486
00:25:05,520 --> 00:25:06,840
Any other questions like that?

487
00:25:06,840 --> 00:25:08,430
This is a great time
to ask questions,

488
00:25:08,430 --> 00:25:10,030
because what's
going to happen is

489
00:25:10,030 --> 00:25:12,150
that in about the
next three minutes

490
00:25:12,150 --> 00:25:14,680
if you lose your attention
span of about 10 seconds,

491
00:25:14,680 --> 00:25:17,770
you're going to look up and
you'll say, what just happened?

492
00:25:17,770 --> 00:25:18,300
Yes.

493
00:25:18,300 --> 00:25:21,180
AUDIENCE: Could you
explain the suffixes of t?

494
00:25:21,180 --> 00:25:22,640
PROFESSOR: The suffixes of t?

495
00:25:22,640 --> 00:25:24,530
Sure.

496
00:25:24,530 --> 00:25:26,650
Let's talk about
the suffixes of tr.

497
00:25:26,650 --> 00:25:28,670
They're all of the
things that end t.

498
00:25:28,670 --> 00:25:35,400
So a suffix of t would be
G, or CG, or ACG, or AACG,

499
00:25:35,400 --> 00:25:38,640
or CAACG, or the
entire string t.

500
00:25:38,640 --> 00:25:40,790
Those are all of
the endings of t.

501
00:25:40,790 --> 00:25:42,630
And if you look
over on the right,

502
00:25:42,630 --> 00:25:46,760
you can see all of
those suffixes in red.

503
00:25:46,760 --> 00:25:48,290
So one way to
think about this is

504
00:25:48,290 --> 00:25:54,730
that it's sorting all of the
suffixes of t in that matrix.

505
00:25:54,730 --> 00:25:58,130
Because the rotations are
exposing the suffixes,

506
00:25:58,130 --> 00:26:01,134
right, is what's going on.

507
00:26:01,134 --> 00:26:02,300
Does that make sense to you?

508
00:26:05,390 --> 00:26:08,280
Now keep me honest here in a
minute, OK, you'll help me out?

509
00:26:08,280 --> 00:26:09,290
Yes.

510
00:26:09,290 --> 00:26:10,010
Your name first?

511
00:26:10,010 --> 00:26:11,260
AUDIENCE: [INAUDIBLE].

512
00:26:11,260 --> 00:26:11,900
[INAUDIBLE]

513
00:26:11,900 --> 00:26:12,705
PROFESSOR: [INAUDIBLE].

514
00:26:12,705 --> 00:26:13,954
AUDIENCE: What is dollar sign?

515
00:26:13,954 --> 00:26:17,050
PROFESSOR: Dollar sign is
the end of string character

516
00:26:17,050 --> 00:26:20,010
which has the lowest
lexical sorting order.

517
00:26:20,010 --> 00:26:22,690
So it's marking the end of t.

518
00:26:22,690 --> 00:26:26,177
That's how we know that
we're at the end of t.

519
00:26:26,177 --> 00:26:26,760
Good question.

520
00:26:26,760 --> 00:26:27,667
Yes.

521
00:26:27,667 --> 00:26:30,589
AUDIENCE: Can you sort them
non-alphabetically, just

522
00:26:30,589 --> 00:26:34,002
different ways to sort
them [INAUDIBLE] algorithm?

523
00:26:34,002 --> 00:26:35,460
PROFESSOR: The
question is, can you

524
00:26:35,460 --> 00:26:38,440
sort them non alphabetically.

525
00:26:38,440 --> 00:26:42,076
You can sort them any way as
long as it's consistent, OK.

526
00:26:42,076 --> 00:26:44,690
But let's stick with
alphabetical lexical order

527
00:26:44,690 --> 00:26:45,190
today.

528
00:26:45,190 --> 00:26:47,020
It's really simple
and it's all you need.

529
00:26:47,020 --> 00:26:47,520
Yes.

530
00:26:52,460 --> 00:26:56,412
in red is the suffixes
in the last colored

531
00:26:56,412 --> 00:26:58,132
group on the right?

532
00:26:58,132 --> 00:26:58,882
PROFESSOR: No, no.

533
00:26:58,882 --> 00:26:59,870
AUDIENCE: What's in red?

534
00:26:59,870 --> 00:27:01,521
PROFESSOR: What's in
red are all the suffixes

535
00:27:01,521 --> 00:27:02,690
of T on the very far left.

536
00:27:02,690 --> 00:27:04,052
OK?

537
00:27:04,052 --> 00:27:06,412
AUDIENCE: On the right,
last column group?

538
00:27:06,412 --> 00:27:08,390
PROFESSOR: The right
last column group.

539
00:27:08,390 --> 00:27:11,350
That last column in red, that is
the Burrows-Wheeler Transform,

540
00:27:11,350 --> 00:27:13,270
read from top to bottom.

541
00:27:13,270 --> 00:27:13,940
OK?

542
00:27:13,940 --> 00:27:15,315
And I know you're
looking at that

543
00:27:15,315 --> 00:27:17,300
and saying, how could
that possibly be useful?

544
00:27:17,300 --> 00:27:18,444
We've taken our genome.

545
00:27:18,444 --> 00:27:19,610
We've shifted it all around.

546
00:27:19,610 --> 00:27:21,390
We've sorted it, we
take this last thing.

547
00:27:21,390 --> 00:27:24,070
It looks like junk to me, right?

548
00:27:24,070 --> 00:27:26,140
But you're going to
find out that all

549
00:27:26,140 --> 00:27:28,170
of the information in
the genome is contained

550
00:27:28,170 --> 00:27:30,550
in that last string
in a very handy way.

551
00:27:30,550 --> 00:27:32,379
Hard to believe but true.

552
00:27:32,379 --> 00:27:33,420
Hard to believe but true.

553
00:27:33,420 --> 00:27:35,070
Yes.

554
00:27:35,070 --> 00:27:37,850
Prepare to be amazed, all right?

555
00:27:37,850 --> 00:27:39,100
These are all great questions.

556
00:27:39,100 --> 00:27:41,871
Any other questions
of this sort?

557
00:27:41,871 --> 00:27:42,370
.

558
00:27:42,370 --> 00:27:43,240
OK.

559
00:27:43,240 --> 00:27:48,500
So, I'm going to make a very
important observation here

560
00:27:48,500 --> 00:27:52,960
that is going to be crucial
for your understanding.

561
00:27:52,960 --> 00:27:57,830
So I have reproduced the
matrix down on this blackboard.

562
00:27:57,830 --> 00:27:59,570
What?

563
00:27:59,570 --> 00:28:01,875
That's usually there under
that board, you know that.

564
00:28:01,875 --> 00:28:04,250
You guys haven't checked this
classroom before, have you?

565
00:28:04,250 --> 00:28:04,749
No.

566
00:28:04,749 --> 00:28:06,746
It's always there.

567
00:28:06,746 --> 00:28:09,760
It's such a handy transform.

568
00:28:09,760 --> 00:28:11,800
So this is the same
matrix as the matrix

569
00:28:11,800 --> 00:28:13,350
you see on the right.

570
00:28:13,350 --> 00:28:16,900
I'm going to make a very, very
important assertion right now.

571
00:28:16,900 --> 00:28:17,410
OK?

572
00:28:17,410 --> 00:28:25,220
The very important assertion
is that if you consider

573
00:28:25,220 --> 00:28:30,040
that this is the first a
in the last column that

574
00:28:30,040 --> 00:28:34,850
is the same textual occurrence
in the string as the first a

575
00:28:34,850 --> 00:28:37,720
in the first column.

576
00:28:37,720 --> 00:28:41,460
And this is the second a
in the last column, that's

577
00:28:41,460 --> 00:28:43,886
the same as the second
a in the first column.

578
00:28:43,886 --> 00:28:46,010
And you're going to say,
what does he mean by that?

579
00:28:46,010 --> 00:28:49,200
OK, do the following
thought experiment.

580
00:28:49,200 --> 00:28:51,090
Look at the matrix.

581
00:28:51,090 --> 00:28:51,760
OK?

582
00:28:51,760 --> 00:28:54,530
And in your mind, shift it
left and put all the characters

583
00:28:54,530 --> 00:28:56,190
on the right hand side.

584
00:28:56,190 --> 00:28:57,130
OK?

585
00:28:57,130 --> 00:28:59,890
When you do that,
what will happen

586
00:28:59,890 --> 00:29:04,180
is that these
things will be used

587
00:29:04,180 --> 00:29:09,390
to sort the occurrences
a on the right hand side.

588
00:29:09,390 --> 00:29:12,430
Once again, if you shift this
whole thing left and these pop

589
00:29:12,430 --> 00:29:16,020
over to the right, then the
occurrence of these a's will

590
00:29:16,020 --> 00:29:21,140
be sorted by these
rows from here over.

591
00:29:21,140 --> 00:29:24,320
But these are alphabetical.

592
00:29:24,320 --> 00:29:27,160
And therefore they're going
to certain alphabetical order.

593
00:29:27,160 --> 00:29:30,240
And therefore these a's
will sort in the same order

594
00:29:30,240 --> 00:29:31,820
here as they are over there.

595
00:29:34,360 --> 00:29:38,850
So that means that when
we do this rotation,

596
00:29:38,850 --> 00:29:41,170
that this textual
occurrence of a will

597
00:29:41,170 --> 00:29:45,570
have the same rank in the first
column and in the last column.

598
00:29:45,570 --> 00:29:51,220
And you can see I've annotated
the various bases here

599
00:29:51,220 --> 00:29:52,580
with their ranks.

600
00:29:52,580 --> 00:29:55,570
This is the first g, the first
c, the first end of line,

601
00:29:55,570 --> 00:29:56,730
end of string character.

602
00:29:56,730 --> 00:30:00,180
First a, second a,
third a, second c.

603
00:30:00,180 --> 00:30:02,910
And correspondingly I
have the same annotations

604
00:30:02,910 --> 00:30:09,930
over here and thus
the third a here is

605
00:30:09,930 --> 00:30:16,280
the same lexical occurrence
as the third a on the left

606
00:30:16,280 --> 00:30:19,810
in the string, same
text occurrence.

607
00:30:19,810 --> 00:30:21,776
Now I'm going to let that
sink in for a second,

608
00:30:21,776 --> 00:30:23,400
and then when somebody
asks a question,

609
00:30:23,400 --> 00:30:25,691
I'm going to explain it again
because it's a little bit

610
00:30:25,691 --> 00:30:27,340
counterintuitive.

611
00:30:27,340 --> 00:30:29,440
But the very important
thing is if we

612
00:30:29,440 --> 00:30:34,070
think about textual recurrences
of characters in that string t

613
00:30:34,070 --> 00:30:36,750
and we put them
in this framework,

614
00:30:36,750 --> 00:30:40,470
that the rank allows
us to identify

615
00:30:40,470 --> 00:30:42,365
identical textual
recurrences of a character.

616
00:30:45,430 --> 00:30:47,080
Would somebody like
to ask a question?

617
00:30:47,080 --> 00:30:47,340
Yes.

618
00:30:47,340 --> 00:30:49,010
Say your name and
the question, please.

619
00:30:49,010 --> 00:30:49,353
AUDIENCE: Dan.

620
00:30:49,353 --> 00:30:50,040
PROFESSOR: Dan.

621
00:30:50,040 --> 00:30:51,910
AUDIENCE: So in
your original string

622
00:30:51,910 --> 00:30:56,190
though those won't
correspond to the same order

623
00:30:56,190 --> 00:30:59,070
in the transformed string.

624
00:30:59,070 --> 00:31:02,495
So like the a's in the
original string in their order,

625
00:31:02,495 --> 00:31:03,870
they don't correspond
numerically

626
00:31:03,870 --> 00:31:06,030
to the transformed string.

627
00:31:06,030 --> 00:31:07,860
PROFESSOR: That's correct.

628
00:31:07,860 --> 00:31:09,200
Is that OK?

629
00:31:09,200 --> 00:31:14,240
The comment was that the
order in BWT, the transform

630
00:31:14,240 --> 00:31:17,020
is not the same as the order
in the original string.

631
00:31:17,020 --> 00:31:20,460
And all I'm saying is that in
this particular matrix form,

632
00:31:20,460 --> 00:31:21,836
that the order on
the last column

633
00:31:21,836 --> 00:31:23,668
is the same as the order
in the first column

634
00:31:23,668 --> 00:31:25,030
for a particular character.

635
00:31:25,030 --> 00:31:28,420
And furthermore, that these are
matching textual occurrences,

636
00:31:28,420 --> 00:31:30,120
right?

637
00:31:30,120 --> 00:31:35,800
Now if I look at
a2 here, we know

638
00:31:35,800 --> 00:31:38,860
that c comes after it, then
a, then a, and c and g, right?

639
00:31:42,170 --> 00:31:42,670
Right.

640
00:31:42,670 --> 00:31:45,800
OK, so did that
answer your question

641
00:31:45,800 --> 00:31:47,447
that they're not
exactly the same?

642
00:31:47,447 --> 00:31:48,030
AUDIENCE: Yes.

643
00:31:48,030 --> 00:31:50,440
I don't understand how
they're useful yet.

644
00:31:50,440 --> 00:31:52,770
PROFESSOR: You don't
understand how it's useful yet.

645
00:31:52,770 --> 00:31:53,927
OK.

646
00:31:53,927 --> 00:31:55,760
Well, maybe we better
get to the useful part

647
00:31:55,760 --> 00:31:58,180
and then you can-- OK.

648
00:31:58,180 --> 00:32:04,020
So let us suppose that
we want to, from this,

649
00:32:04,020 --> 00:32:07,950
reconstruct the original string.

650
00:32:07,950 --> 00:32:10,160
Does anybody have any
ideas about how to do that?

651
00:32:20,200 --> 00:32:22,030
OK.

652
00:32:22,030 --> 00:32:24,320
Let me ask a different question.

653
00:32:24,320 --> 00:32:28,960
If we look at this g1, right?

654
00:32:28,960 --> 00:32:32,400
And then this is the same
textual occurrence, right?

655
00:32:32,400 --> 00:32:34,890
And we know that
this g1 comes right

656
00:32:34,890 --> 00:32:37,930
before the end of character,
in end of string terminator,

657
00:32:37,930 --> 00:32:39,200
right?

658
00:32:39,200 --> 00:32:41,420
So if we look at the
first row, we always

659
00:32:41,420 --> 00:32:45,530
know what the last character
was in the original string.

660
00:32:45,530 --> 00:32:54,370
The last character is g1, right?

661
00:32:54,370 --> 00:32:56,000
Fair enough?

662
00:32:56,000 --> 00:32:56,500
OK

663
00:32:56,500 --> 00:32:58,600
Where does g1 would
occur over here?

664
00:32:58,600 --> 00:33:00,440
Right over here, right?

665
00:33:00,440 --> 00:33:03,380
What's the character before g1?

666
00:33:03,380 --> 00:33:05,960
c2.

667
00:33:05,960 --> 00:33:08,380
where is c2 over here?

668
00:33:08,380 --> 00:33:10,608
What's the character before c2?

669
00:33:10,608 --> 00:33:13,100
a3.

670
00:33:13,100 --> 00:33:16,610
What's the character before a3?

671
00:33:16,610 --> 00:33:18,785
a1.

672
00:33:18,785 --> 00:33:19,285
Uh, oh.

673
00:33:22,010 --> 00:33:27,870
Let me just cheat a little
bit here. a1 a3 c2 g1 $.

674
00:33:33,460 --> 00:33:35,389
So we're at a1, right?

675
00:33:35,389 --> 00:33:36,680
What's the character before a1?

676
00:33:39,330 --> 00:33:42,350
c1, right?

677
00:33:42,350 --> 00:33:43,740
What's the character before c1?

678
00:33:48,340 --> 00:33:48,840
a2.

679
00:33:51,400 --> 00:33:53,720
And what's the
character before a2?

680
00:33:56,690 --> 00:33:58,200
That's the end of string.

681
00:33:58,200 --> 00:34:01,210
Is that the original
string that we had?

682
00:34:01,210 --> 00:34:02,480
Magic.

683
00:34:02,480 --> 00:34:03,040
OK?

684
00:34:03,040 --> 00:34:04,980
Yes.

685
00:34:04,980 --> 00:34:07,380
AUDIENCE: Wouldn't it
be simpler to look at--

686
00:34:07,380 --> 00:34:12,570
to just remove the
dollar sign [INAUDIBLE]

687
00:34:12,570 --> 00:34:16,510
or do you mean reconstruct
from only the transformed?

688
00:34:16,510 --> 00:34:18,672
PROFESSOR: We're
only using this.

689
00:34:18,672 --> 00:34:22,310
This is all we have.

690
00:34:22,310 --> 00:34:25,210
Because I actually didn't
use any of these characters.

691
00:34:25,210 --> 00:34:29,870
I was only doing the matching
so we would go to the right row.

692
00:34:29,870 --> 00:34:30,659
Right?

693
00:34:30,659 --> 00:34:31,909
I didn't use any of this.

694
00:34:35,190 --> 00:34:39,260
And so, but do people
understand what's going on here?

695
00:34:39,260 --> 00:34:41,830
If anybody has any questions,
now is a great time

696
00:34:41,830 --> 00:34:44,365
to raise your hand
and say-- here we go.

697
00:34:44,365 --> 00:34:45,460
We have a customer.

698
00:34:45,460 --> 00:34:46,595
Say your name and
the question, please.

699
00:34:46,595 --> 00:34:47,754
AUDIENCE: My name is Eric.

700
00:34:47,754 --> 00:34:48,628
PROFESSOR: Thanks, Eric.

701
00:34:48,628 --> 00:34:50,711
AUDIENCE: Can you illustrate
how you would do this

702
00:34:50,711 --> 00:34:53,275
without using any
of the elements

703
00:34:53,275 --> 00:34:55,473
to the left of the box?

704
00:34:55,473 --> 00:34:56,639
PROFESSOR: Absolutely, Eric.

705
00:34:56,639 --> 00:34:58,370
I'm so glad you
asked that question.

706
00:34:58,370 --> 00:35:00,880
That's the next thing
we're going to talk about.

707
00:35:00,880 --> 00:35:03,232
OK, but before I get to
there, I want to make sure,

708
00:35:03,232 --> 00:35:05,190
are you comfortable doing
it with all the stuff

709
00:35:05,190 --> 00:35:06,720
on the left hand side?

710
00:35:06,720 --> 00:35:08,265
You're happy about that?

711
00:35:08,265 --> 00:35:11,750
OK. if anyone was unhappy about
that, now would be the time

712
00:35:11,750 --> 00:35:15,230
to say, I'm unhappy, help me.

713
00:35:15,230 --> 00:35:16,890
How about the details?

714
00:35:16,890 --> 00:35:17,980
Everybody's happy?

715
00:35:17,980 --> 00:35:19,600
Yes.

716
00:35:19,600 --> 00:35:22,673
AUDIENCE: So, you have
your original string

717
00:35:22,673 --> 00:35:24,350
in the first place,
though, so why

718
00:35:24,350 --> 00:35:27,485
do you want to create another
string of the same length?

719
00:35:27,485 --> 00:35:31,950
Like, how does this help
you match your read?

720
00:35:31,950 --> 00:35:34,600
PROFESSOR: How does this
help you match your read?

721
00:35:34,600 --> 00:35:36,020
How does this help you match
your read, was the question.

722
00:35:36,020 --> 00:35:36,710
What is was your name?

723
00:35:36,710 --> 00:35:37,293
AUDIENCE: Dan.

724
00:35:37,293 --> 00:35:40,040
PROFESSOR: That's right, Dan.

725
00:35:40,040 --> 00:35:41,520
That's a great question.

726
00:35:41,520 --> 00:35:42,610
I'm so glad you asked it.

727
00:35:42,610 --> 00:35:44,160
First we'll get
to Eric's question

728
00:35:44,160 --> 00:35:45,350
and then we'll get to yours.

729
00:35:45,350 --> 00:35:47,850
Because I know if I don't give
you a good answer that you're

730
00:35:47,850 --> 00:35:50,300
going to be very mad, right?

731
00:35:50,300 --> 00:35:51,700
OK?

732
00:35:51,700 --> 00:35:54,505
Let's talk about
the question of how

733
00:35:54,505 --> 00:35:56,550
to do this without
the other things.

734
00:35:56,550 --> 00:35:58,670
So we're going to create
something called the last

735
00:35:58,670 --> 00:36:04,570
to first function that maps
a character in the last row,

736
00:36:04,570 --> 00:36:07,817
column, I should say,
to the first column.

737
00:36:07,817 --> 00:36:09,400
And there is the
function right there.

738
00:36:09,400 --> 00:36:10,700
It's called LF.

739
00:36:10,700 --> 00:36:11,850
You give it a row number.

740
00:36:11,850 --> 00:36:14,400
The rows are zero origined.

741
00:36:14,400 --> 00:36:16,830
And you give it a
character and it tells you

742
00:36:16,830 --> 00:36:21,370
what the corresponding place
is in the first column.

743
00:36:21,370 --> 00:36:23,650
And it has two components.

744
00:36:23,650 --> 00:36:27,070
The first is Occ, which tells
you how many characters are

745
00:36:27,070 --> 00:36:29,440
smaller than that
character lexically.

746
00:36:29,440 --> 00:36:32,880
So tells you where, for example,
the a's start, the c's start,

747
00:36:32,880 --> 00:36:33,830
or the g's start.

748
00:36:36,410 --> 00:36:41,220
So in this case, for
example Occ of c is 4.

749
00:36:41,220 --> 00:36:45,900
That is the c's start at
0, 1,2, 3, the fourth row.

750
00:36:45,900 --> 00:36:48,110
OK?

751
00:36:48,110 --> 00:36:53,520
And then count tells you
the rank of c minus 1.

752
00:36:53,520 --> 00:36:55,930
So it's going to
essentially count

753
00:36:55,930 --> 00:37:00,180
how many times c occurs
before the c at the row you're

754
00:37:00,180 --> 00:37:01,460
pointing at.

755
00:37:01,460 --> 00:37:04,690
In this case, the
answer is 1 and you

756
00:37:04,690 --> 00:37:11,370
add 1 and that gets you
to 5, which is this row.

757
00:37:11,370 --> 00:37:18,120
So this c2 maps here to c2
as we already discussed.

758
00:37:18,120 --> 00:37:23,050
So this LF function is a
way to map from the last row

759
00:37:23,050 --> 00:37:23,890
to the first row.

760
00:37:26,862 --> 00:37:28,320
And we need to have
two components.

761
00:37:28,320 --> 00:37:32,390
So we need to know Occ, which
is very trivial to compute.

762
00:37:32,390 --> 00:37:35,370
There are only five
elements, one for each base

763
00:37:35,370 --> 00:37:38,190
and one for the end of line
terminator, which is actually

764
00:37:38,190 --> 00:37:39,390
zero.

765
00:37:39,390 --> 00:37:45,080
So it will only have
integers and count, which

766
00:37:45,080 --> 00:37:47,900
is going to tell us the
rank in the BWT transform

767
00:37:47,900 --> 00:37:51,540
and we'll talk about how
to do that presently.

768
00:37:51,540 --> 00:37:52,420
OK.

769
00:37:52,420 --> 00:37:54,010
So did that answer
your question,

770
00:37:54,010 --> 00:37:55,640
how to do this without
the rest of the matrix?

771
00:37:55,640 --> 00:37:56,170
Eric?

772
00:37:56,170 --> 00:38:00,042
AUDIENCE: Could you show us
step by step on the blackboard

773
00:38:00,042 --> 00:38:03,430
how you would reconstruct it?

774
00:38:03,430 --> 00:38:05,040
PROFESSOR: How do
we reconstruct it?

775
00:38:05,040 --> 00:38:06,030
AUDIENCE: Yeah.

776
00:38:06,030 --> 00:38:09,490
PROFESSOR: You mean
something like this?

777
00:38:09,490 --> 00:38:11,936
Is this what you're suggesting?

778
00:38:11,936 --> 00:38:14,888
AUDIENCE: Somehow
I get a feeling

779
00:38:14,888 --> 00:38:18,866
that the first column
doesn't help us

780
00:38:18,866 --> 00:38:21,360
in understanding how
the algorithm work only

781
00:38:21,360 --> 00:38:24,640
using the last column.

782
00:38:24,640 --> 00:38:25,560
PROFESSOR: OK.

783
00:38:25,560 --> 00:38:27,190
Your comment, Eric,
is that you feel

784
00:38:27,190 --> 00:38:30,140
like the first column
doesn't help us understand

785
00:38:30,140 --> 00:38:33,370
how the algorithm works, only
using the last column, right?

786
00:38:33,370 --> 00:38:33,870
OK.

787
00:38:36,536 --> 00:38:42,692
AUDIENCE: [INAUDIBLE] going back
to the first column of data.

788
00:38:42,692 --> 00:38:43,275
PROFESSOR: OK.

789
00:38:46,810 --> 00:39:01,180
Well let's compute the LF
function of the character

790
00:39:01,180 --> 00:39:05,160
and the row for each
one of these things, OK?

791
00:39:05,160 --> 00:39:07,530
And that might help
you, all right?

792
00:39:07,530 --> 00:39:10,810
Because that's the central part
of being able to reverse this

793
00:39:10,810 --> 00:39:12,590
transform.

794
00:39:12,590 --> 00:39:16,720
So this is, to be more clear,
I'll make it more explicit.

795
00:39:16,720 --> 00:39:27,810
This is LF of I and BWT of
i, where i goes from 0 to 6.

796
00:39:27,810 --> 00:39:32,340
So what is that value
for this one right here?

797
00:39:32,340 --> 00:39:33,090
Anybody know?

798
00:39:35,880 --> 00:39:43,310
Well it would be Occ of
g, which is 6, right?

799
00:39:43,310 --> 00:39:57,220
Plus count of of 6 n g,
which is going to be 0.

800
00:39:57,220 --> 00:39:58,820
Or I can look right
over here and see

801
00:39:58,820 --> 00:40:00,020
that in fact it's 6, right?

802
00:40:00,020 --> 00:40:03,450
Because this occurrence
of g1 is right here.

803
00:40:03,450 --> 00:40:16,280
So this LF value is, it's 6
4 0 a1 is in 1, a2 is in 2,

804
00:40:16,280 --> 00:40:22,660
a3 is in 3, c2 is in 5.

805
00:40:22,660 --> 00:40:28,710
So this is the LF
function, 6 4 0 1 2 3 5.

806
00:40:28,710 --> 00:40:31,670
And I don't need any
of this to compute it.

807
00:40:31,670 --> 00:40:36,920
Because it simply is equal
to, going back one slide,

808
00:40:36,920 --> 00:40:39,450
it's equal to Occ
of c plus count.

809
00:40:39,450 --> 00:40:41,960
So it' going to be equal to
where that particular character

810
00:40:41,960 --> 00:40:47,730
starts on the left hand
side and its rank minus 1.

811
00:40:50,330 --> 00:40:53,100
And so these are
the values for LF.

812
00:40:53,100 --> 00:40:56,530
This is what I need to be
able to take this string

813
00:40:56,530 --> 00:40:59,630
and recompute the
original string.

814
00:40:59,630 --> 00:41:04,170
If I can compute this, I
don't need any of that.

815
00:41:04,170 --> 00:41:07,000
And to compute this,
I need two things.

816
00:41:07,000 --> 00:41:10,240
I need Occ and I need count.

817
00:41:13,000 --> 00:41:13,990
All right?

818
00:41:13,990 --> 00:41:18,310
Now, I can tell you're not
quite completely satisfied yet.

819
00:41:18,310 --> 00:41:20,020
So maybe you can ask
me another question

820
00:41:20,020 --> 00:41:21,655
and it would be
very helpful to me.

821
00:41:21,655 --> 00:41:27,530
AUDIENCE: How did you get
G1's last and first functions

822
00:41:27,530 --> 00:41:29,720
score being 6?

823
00:41:29,720 --> 00:41:30,770
PROFESSOR: OK.

824
00:41:30,770 --> 00:41:31,810
Let's take that apart.

825
00:41:40,260 --> 00:41:50,040
We want to know what LF of
6 and where was that G1?

826
00:41:50,040 --> 00:41:51,590
G1 is 1 and 0, right?

827
00:41:51,590 --> 00:41:52,270
Sorry.

828
00:41:52,270 --> 00:41:58,490
LF of 1 and g is
equal to, right?

829
00:41:58,490 --> 00:42:00,640
Is that g and 1 or 0?

830
00:42:00,640 --> 00:42:02,940
Oop, sorry it's in 0.

831
00:42:02,940 --> 00:42:05,720
So this is what you like
me to compute, right?

832
00:42:05,720 --> 00:42:09,160
OK what's Occ of g?

833
00:42:09,160 --> 00:42:11,140
It's how many characters
are less than g

834
00:42:11,140 --> 00:42:12,227
in the original string?

835
00:42:19,050 --> 00:42:20,156
I'll give you a clue.

836
00:42:20,156 --> 00:42:22,220
It's 1, 2, 3, 4, 5, 6.

837
00:42:22,220 --> 00:42:23,136
AUDIENCE: [INAUDIBLE].

838
00:42:27,139 --> 00:42:29,430
PROFESSOR: No, it's how many
characters are less than g

839
00:42:29,430 --> 00:42:30,435
in the original string.

840
00:42:30,435 --> 00:42:32,560
How many things are going
to distort underneath it?

841
00:42:32,560 --> 00:42:36,270
Where do the g's begin
in the sorted version?

842
00:42:36,270 --> 00:42:39,380
The g's begin in row 6.

843
00:42:39,380 --> 00:42:40,840
OK?

844
00:42:40,840 --> 00:42:45,560
So OCC of g is 6.

845
00:42:45,560 --> 00:42:48,265
Is that-- are you getting
hung up on that point?

846
00:42:48,265 --> 00:42:48,848
AUDIENCE: Yes.

847
00:42:48,848 --> 00:42:51,812
How do you know that without
ever referencing back

848
00:42:51,812 --> 00:42:54,422
to the first 5 columns?

849
00:42:54,422 --> 00:42:57,486
PROFESSOR: Because when we
build the index we remember.

850
00:42:57,486 --> 00:42:59,790
AUDIENCE: Oh, OK.

851
00:42:59,790 --> 00:43:02,630
PROFESSOR: So we have to,
I haven't told you this,

852
00:43:02,630 --> 00:43:05,380
but I need to compute,
I need to remember

853
00:43:05,380 --> 00:43:08,830
ways to compute Occ and
count really quickly.

854
00:43:08,830 --> 00:43:12,250
Occ is represented
by four values only.

855
00:43:12,250 --> 00:43:14,870
They're where the a's start,
the c's start, the t's start,

856
00:43:14,870 --> 00:43:15,790
and the g's start.

857
00:43:15,790 --> 00:43:17,937
That's all I need to
know, four integers.

858
00:43:17,937 --> 00:43:18,437
OK?

859
00:43:21,360 --> 00:43:22,560
Are you happy with that?

860
00:43:25,070 --> 00:43:27,140
Occ, the a's start at 1.

861
00:43:27,140 --> 00:43:30,040
The c's start at 4 and
the g's start at 6.

862
00:43:30,040 --> 00:43:30,540
OK?

863
00:43:30,540 --> 00:43:32,880
That's all I need to remember.

864
00:43:32,880 --> 00:43:35,100
But I precompute that.

865
00:43:35,100 --> 00:43:35,600
OK?

866
00:43:35,600 --> 00:43:36,575
Remember it.

867
00:43:36,575 --> 00:43:37,967
Are you happy with that no?

868
00:43:37,967 --> 00:43:38,550
AUDIENCE: Yes.

869
00:43:38,550 --> 00:43:39,270
PROFESSOR: OK.

870
00:43:39,270 --> 00:43:53,471
And then this business over
here of count of zero and g.

871
00:43:53,471 --> 00:43:53,970
Right?

872
00:43:56,900 --> 00:43:59,500
Which is how many
g's, what's the rank

873
00:43:59,500 --> 00:44:01,000
of this g in the
right hand side?

874
00:44:05,160 --> 00:44:06,320
1.

875
00:44:06,320 --> 00:44:08,170
Minus 1 is 0.

876
00:44:08,170 --> 00:44:10,290
So that's 0.

877
00:44:12,810 --> 00:44:14,115
That's how we computed it.

878
00:44:14,115 --> 00:44:14,615
OK?

879
00:44:17,302 --> 00:44:18,760
These are great
questions because I

880
00:44:18,760 --> 00:44:20,460
think they're foundational.

881
00:44:20,460 --> 00:44:22,248
Yeah.

882
00:44:22,248 --> 00:44:24,244
AUDIENCE: Occ and count
are both precomputed

883
00:44:24,244 --> 00:44:30,731
as you're building on this
last column [INAUDIBLE].

884
00:44:30,731 --> 00:44:33,000
PROFESSOR: They are
precomputed and well,

885
00:44:33,000 --> 00:44:36,890
I have not told you, see,
you're sort of ahead of things

886
00:44:36,890 --> 00:44:39,849
a little bit in that I'd
hoped to suspend disbelief

887
00:44:39,849 --> 00:44:42,140
and that I could actually
build these very efficiently.

888
00:44:42,140 --> 00:44:43,764
But yes, they're
built at the same time

889
00:44:43,764 --> 00:44:45,090
as the index is built.

890
00:44:45,090 --> 00:44:45,590
OK?

891
00:44:45,590 --> 00:44:47,490
But yes.

892
00:44:47,490 --> 00:44:50,857
OK and if I have Occ and
count and I have the string,

893
00:44:50,857 --> 00:44:51,690
then I can get this.

894
00:44:51,690 --> 00:44:53,130
But I can still need to
get to Dan's question

895
00:44:53,130 --> 00:44:55,140
because he has been
very patient over there.

896
00:44:55,140 --> 00:44:56,840
He wants to know
how I can use this

897
00:44:56,840 --> 00:44:59,880
to find sequence
region of genome.

898
00:44:59,880 --> 00:45:01,530
And he's just being
so good over there.

899
00:45:01,530 --> 00:45:02,780
I really appreciate that, Dan.

900
00:45:02,780 --> 00:45:05,620
Thanks so much for that.

901
00:45:05,620 --> 00:45:07,170
Are you happier?

902
00:45:07,170 --> 00:45:07,930
OK you're good.

903
00:45:07,930 --> 00:45:08,690
All right.

904
00:45:08,690 --> 00:45:12,600
So and this is what we just did.

905
00:45:12,600 --> 00:45:16,580
The walk left algorithm
actually inverts the BWT

906
00:45:16,580 --> 00:45:19,340
by using this
function to walk left

907
00:45:19,340 --> 00:45:22,020
and using that
Python code up there,

908
00:45:22,020 --> 00:45:24,440
you can actually reconstruct
the original string

909
00:45:24,440 --> 00:45:27,310
you started with.

910
00:45:27,310 --> 00:45:30,390
So it's just very simple.

911
00:45:30,390 --> 00:45:33,790
And we went through
it on the board.

912
00:45:33,790 --> 00:45:37,117
Are there any questions
about it at all?

913
00:45:37,117 --> 00:45:37,700
AUDIENCE: Yes.

914
00:45:37,700 --> 00:45:42,155
Can you actually
do this any column?

915
00:45:42,155 --> 00:45:45,900
Like, why are you using the
last column instead of like,

916
00:45:45,900 --> 00:45:53,490
why can't you just change
it like the [INAUDIBLE],

917
00:45:53,490 --> 00:45:56,052
the equation and
make it work for--

918
00:45:56,052 --> 00:45:57,760
PROFESSOR: Because a
very important thing

919
00:45:57,760 --> 00:46:00,200
is that this is actually a
very important property right?

920
00:46:00,200 --> 00:46:03,194
Which is all the
suffixes are sorted here.

921
00:46:03,194 --> 00:46:04,610
And if we didn't
do that though, I

922
00:46:04,610 --> 00:46:08,300
couldn't answer Dan's question.

923
00:46:08,300 --> 00:46:09,650
And he'd be very upset.

924
00:46:09,650 --> 00:46:14,200
So please be respectful
of his interest here.

925
00:46:14,200 --> 00:46:17,247
Now, the thing is,
the beauty of this

926
00:46:17,247 --> 00:46:19,080
is, is that I have all
these suffixes sorted

927
00:46:19,080 --> 00:46:21,210
and what you're about to see is
the most amazing thing, which

928
00:46:21,210 --> 00:46:22,420
is that we're going
to snap our fingers

929
00:46:22,420 --> 00:46:24,045
and, bang, we can
map 200 million reads

930
00:46:24,045 --> 00:46:24,800
in no time at all.

931
00:46:27,630 --> 00:46:29,560
You like that?

932
00:46:29,560 --> 00:46:30,315
You're laughing.

933
00:46:30,315 --> 00:46:32,475
Oh, oh, that's not a good
That's not a good sign.

934
00:46:35,190 --> 00:46:37,579
Let's press ahead
fearlessly, OK?

935
00:46:37,579 --> 00:46:39,870
And talk about how we're
going to use this to map read.

936
00:46:39,870 --> 00:46:41,720
So we're going to
figure out how to use

937
00:46:41,720 --> 00:46:44,780
this index and this
transform to rapidly aligned

938
00:46:44,780 --> 00:46:45,982
reads to a reference genome.

939
00:46:45,982 --> 00:46:48,190
And we're not talking about
one read or 10 reads or 1

940
00:46:48,190 --> 00:46:48,770
million reads.

941
00:46:48,770 --> 00:46:50,853
We're talking about hundreds
of millions of reads.

942
00:46:50,853 --> 00:46:55,270
So it has be very
efficient indeed, OK?

943
00:46:55,270 --> 00:46:58,720
So here's the essential idea.

944
00:46:58,720 --> 00:47:01,710
There's the core
algorithm on the slide.

945
00:47:01,710 --> 00:47:06,510
Which is that what we do is we
take the original query that we

946
00:47:06,510 --> 00:47:09,200
have the read that
we're trying to match

947
00:47:09,200 --> 00:47:10,950
and we're going to
process it backwards

948
00:47:10,950 --> 00:47:14,700
from the end of
the read forwards.

949
00:47:14,700 --> 00:47:19,050
And we begin by considering all
possible suffixes from row zero

950
00:47:19,050 --> 00:47:21,305
to, in this case, it
would be row seven.

951
00:47:24,620 --> 00:47:28,760
Which is the length of
the entire transform.

952
00:47:28,760 --> 00:47:36,120
And we iterate and
in each iteration

953
00:47:36,120 --> 00:47:41,320
we consider suffixes
that match the query.

954
00:47:41,320 --> 00:47:43,810
So in the first
step, right here,

955
00:47:43,810 --> 00:47:49,060
let me see if I can get my
point working, there we are.

956
00:47:49,060 --> 00:47:52,790
So in the first step
here, we matching this c.

957
00:47:52,790 --> 00:47:54,450
OK?

958
00:47:54,450 --> 00:47:57,070
And we compute the
LF of the top, which

959
00:47:57,070 --> 00:48:01,020
is this row and of the bottom,
which is down here pointing off

960
00:48:01,020 --> 00:48:05,010
the end, and that takes
us to the first d here

961
00:48:05,010 --> 00:48:06,740
and to this point.

962
00:48:06,740 --> 00:48:09,620
Here are the two
c's that could be

963
00:48:09,620 --> 00:48:14,090
possible matches to our
query, which ends in a c.

964
00:48:16,880 --> 00:48:21,350
We then say, oh, the next
character we have to match

965
00:48:21,350 --> 00:48:23,150
is an a.

966
00:48:23,150 --> 00:48:27,820
So we look here at the
a we need to match,

967
00:48:27,820 --> 00:48:33,740
and starting from this
row, which is row four,

968
00:48:33,740 --> 00:48:38,650
and this row, which is row six,
we compute the LF of each one

969
00:48:38,650 --> 00:48:45,320
of these to figure out what
rows in a precedes these c's.

970
00:48:49,530 --> 00:48:53,530
And the way we compute the LF
is that we use the character a

971
00:48:53,530 --> 00:48:57,280
to be able to figure out which
rows have the a preceding

972
00:48:57,280 --> 00:48:59,630
the c.

973
00:48:59,630 --> 00:49:02,820
You can see, when we
compute those LF functions,

974
00:49:02,820 --> 00:49:06,530
what we wind up with
are these rows where

975
00:49:06,530 --> 00:49:09,150
we have a followed by c.

976
00:49:09,150 --> 00:49:13,530
So we're beginning to
match the end of our read,

977
00:49:13,530 --> 00:49:16,970
as we go from right to left.

978
00:49:16,970 --> 00:49:19,220
We then compute the
same thing once again,

979
00:49:19,220 --> 00:49:21,880
considering the
first a and ask what

980
00:49:21,880 --> 00:49:32,520
rows are going to allow us to
put this a in front of the ac

981
00:49:32,520 --> 00:49:37,170
to form our entire read.

982
00:49:37,170 --> 00:49:41,600
And we compute the LF once
again of these things.

983
00:49:41,600 --> 00:49:43,230
And you can see that
here it takes us

984
00:49:43,230 --> 00:49:48,770
to this specific row aac.

985
00:49:48,770 --> 00:49:54,150
So that row represents
a suffix that

986
00:49:54,150 --> 00:49:58,590
is matching our query exactly.

987
00:49:58,590 --> 00:50:04,570
So we iterate this loop
to be able to match

988
00:50:04,570 --> 00:50:08,850
a read against the index.

989
00:50:08,850 --> 00:50:13,210
And we're using the
LF function to do it.

990
00:50:13,210 --> 00:50:16,640
And it's a really
beautiful algorithm.

991
00:50:16,640 --> 00:50:19,382
And remember, we only
have the transform.

992
00:50:19,382 --> 00:50:20,965
We don't have the
rest of this matrix.

993
00:50:23,890 --> 00:50:29,966
So before I press ahead and
talk about other details,

994
00:50:29,966 --> 00:50:32,340
I think it's important to
observe a couple of things that

995
00:50:32,340 --> 00:50:35,480
are a little bit
counterintuitive about this.

996
00:50:35,480 --> 00:50:37,230
One counterintuitive
aspect of it

997
00:50:37,230 --> 00:50:42,820
is, that when I'm over here
for example, and for example

998
00:50:42,820 --> 00:50:48,000
when I'm computing the LF here,
I'm computing the LF of row two

999
00:50:48,000 --> 00:50:50,470
with respect to a.

1000
00:50:50,470 --> 00:50:53,060
But there's a dollar sign there.

1001
00:50:53,060 --> 00:50:54,260
Right?

1002
00:50:54,260 --> 00:50:58,880
So I'm using this
to the LF function,

1003
00:50:58,880 --> 00:51:07,060
to tell me where a suffix
would be that actually follows

1004
00:51:07,060 --> 00:51:10,660
my constraint of having to
have an a be the prefix of ac,

1005
00:51:10,660 --> 00:51:13,710
where I am right now.

1006
00:51:13,710 --> 00:51:16,750
This code is actually
not fake code.

1007
00:51:16,750 --> 00:51:19,970
It's the actual code
that's in a matcher,

1008
00:51:19,970 --> 00:51:24,230
for matching a read
against the index.

1009
00:51:26,830 --> 00:51:28,830
Now let me just stop
right here for a second

1010
00:51:28,830 --> 00:51:30,371
and see if there
any other questions.

1011
00:51:30,371 --> 00:51:32,640
Dan is getting now his answer
to his question, right?

1012
00:51:32,640 --> 00:51:35,200
About how you actually use
this for matching reads.

1013
00:51:35,200 --> 00:51:37,800
You do this once for every read.

1014
00:51:37,800 --> 00:51:41,660
And it is linear time.

1015
00:51:41,660 --> 00:51:42,510
Right?

1016
00:51:42,510 --> 00:51:44,750
It's the length
of the read itself

1017
00:51:44,750 --> 00:51:48,410
is all the time it takes
to match in a huge genome.

1018
00:51:48,410 --> 00:51:51,120
So once we've built the
index of the genome,

1019
00:51:51,120 --> 00:51:53,530
in fact, most of
the time when you're

1020
00:51:53,530 --> 00:51:56,130
doing this sort of mapping,
you don't build the index.

1021
00:51:56,130 --> 00:51:59,620
You download the index
off of a website.

1022
00:51:59,620 --> 00:52:02,760
And so you don't have to pay for
the time to build this index.

1023
00:52:02,760 --> 00:52:05,330
You just download the index
and you take your reads

1024
00:52:05,330 --> 00:52:07,720
and the time to match
all of your sequencing

1025
00:52:07,720 --> 00:52:10,930
reads against a certain build
of whatever genome you're using

1026
00:52:10,930 --> 00:52:15,730
is simply linear in the
number of bases you have.

1027
00:52:15,730 --> 00:52:16,960
Questions?

1028
00:52:16,960 --> 00:52:17,460
Yes.

1029
00:52:17,460 --> 00:52:19,805
And say your name and
the question, please.

1030
00:52:19,805 --> 00:52:21,470
AUDIENCE: How did
[INAUDIBLE] come up

1031
00:52:21,470 --> 00:52:22,595
with intuition [INAUDIBLE]?

1032
00:52:22,595 --> 00:52:24,882
It seems like they just
pulled it out of a hat.

1033
00:52:24,882 --> 00:52:27,090
PROFESSOR: You know, I asked
Mike that the other day.

1034
00:52:27,090 --> 00:52:29,470
I saw him in a meeting
and he sort of surprised

1035
00:52:29,470 --> 00:52:33,450
at how this has taken off.

1036
00:52:33,450 --> 00:52:39,217
And he told me some
other interesting facts

1037
00:52:39,217 --> 00:52:41,050
about this, which you
probably could deduce.

1038
00:52:41,050 --> 00:52:43,420
Which is that if you
only want to match

1039
00:52:43,420 --> 00:52:45,250
reads that are
four long, you only

1040
00:52:45,250 --> 00:52:48,250
have to sort this matrix by
the first four characters.

1041
00:52:50,810 --> 00:52:54,260
But there are other little
tricks you can play here.

1042
00:52:54,260 --> 00:52:55,500
Any other questions?

1043
00:52:55,500 --> 00:52:56,290
Yes.

1044
00:52:56,290 --> 00:52:57,300
AUDIENCE: I'm Deborah.

1045
00:52:57,300 --> 00:53:00,370
What is the FM index?

1046
00:53:00,370 --> 00:53:04,070
PROFESSOR: What is the FM index.

1047
00:53:04,070 --> 00:53:05,504
Well, the guys who
thought this up

1048
00:53:05,504 --> 00:53:06,920
have the last
initials of F and M,

1049
00:53:06,920 --> 00:53:09,720
but that's not
what it stands for,

1050
00:53:09,720 --> 00:53:11,360
contrary to popular opinion.

1051
00:53:11,360 --> 00:53:16,900
It stands for full
text minute size.

1052
00:53:16,900 --> 00:53:18,090
That's what they claim.

1053
00:53:18,090 --> 00:53:21,200
So if you hear people talking
about full text minute size

1054
00:53:21,200 --> 00:53:24,800
indices, or FM
indices, the Fm index

1055
00:53:24,800 --> 00:53:30,060
is actually the part that was
being asked over here, the Occ

1056
00:53:30,060 --> 00:53:33,830
part and the LF part, how you
actually compute those quickly.

1057
00:53:33,830 --> 00:53:37,910
That was what FNM contributed
to this but, generically when

1058
00:53:37,910 --> 00:53:40,780
we're talking about
this style of indexing,

1059
00:53:40,780 --> 00:53:44,420
it's called FM indexing or you
might hear, I'm using a BWT.

1060
00:53:44,420 --> 00:53:46,460
Some people will say that.

1061
00:53:46,460 --> 00:53:50,350
But that's what FM stands for.

1062
00:53:50,350 --> 00:53:53,120
Does that answer your question?

1063
00:53:53,120 --> 00:53:53,620
Excellent.

1064
00:53:53,620 --> 00:53:55,010
OK.

1065
00:53:55,010 --> 00:53:55,771
All right.

1066
00:53:55,771 --> 00:53:57,270
Any-- these are all
great questions.

1067
00:53:57,270 --> 00:53:59,184
Yes.

1068
00:53:59,184 --> 00:54:00,100
AUDIENCE: [INAUDIBLE].

1069
00:54:08,780 --> 00:54:10,750
PROFESSOR: Oh, you
don't know that a

1070
00:54:10,750 --> 00:54:15,220
and c are there,
except that remember,

1071
00:54:15,220 --> 00:54:20,440
if you look at the way
that this is working,

1072
00:54:20,440 --> 00:54:24,660
is that you're not actually
reconstructing strings,

1073
00:54:24,660 --> 00:54:27,081
you're only trying to find them.

1074
00:54:27,081 --> 00:54:27,580
Right?

1075
00:54:27,580 --> 00:54:29,230
And so at the end,
top and bottom

1076
00:54:29,230 --> 00:54:33,760
are going to point to the row
that contains the suffix where

1077
00:54:33,760 --> 00:54:35,310
your original read was.

1078
00:54:35,310 --> 00:54:37,310
And now your next
question is going to be,

1079
00:54:37,310 --> 00:54:38,910
where is that in the genome?

1080
00:54:38,910 --> 00:54:40,110
This doesn't do me any good.

1081
00:54:40,110 --> 00:54:43,300
I mean, the number 1
doesn't help me out here,

1082
00:54:43,300 --> 00:54:46,530
doesn't mean anything.

1083
00:54:46,530 --> 00:54:48,410
Not good, right?

1084
00:54:48,410 --> 00:54:52,230
So where is it in the
genome is the next question.

1085
00:54:52,230 --> 00:54:55,390
So we'll get to
that in a second.

1086
00:54:55,390 --> 00:54:57,700
What happens if you
give me a read that

1087
00:54:57,700 --> 00:55:00,761
doesn't match anywhere
in this index?

1088
00:55:00,761 --> 00:55:03,010
Well if you give me a read
that doesn't match anywhere

1089
00:55:03,010 --> 00:55:05,720
in this index, what happens
is the top and bottom

1090
00:55:05,720 --> 00:55:08,010
become the same.

1091
00:55:08,010 --> 00:55:12,710
So on top and bottom become the
same, it's a failed look up.

1092
00:55:12,710 --> 00:55:13,210
All right?

1093
00:55:16,000 --> 00:55:22,290
And that's because the suffix
doesn't exist in the index.

1094
00:55:22,290 --> 00:55:25,120
And once top and
bottom become the same,

1095
00:55:25,120 --> 00:55:27,980
they remain the same
throughout that loop.

1096
00:55:27,980 --> 00:55:29,070
Yes.

1097
00:55:29,070 --> 00:55:30,519
AUDIENCE: I'm Sally.

1098
00:55:30,519 --> 00:55:32,451
My main question is
that this doesn't

1099
00:55:32,451 --> 00:55:34,866
provide any leeway for errors.

1100
00:55:34,866 --> 00:55:38,807
You kind of have to be able
to present all of your rates.

1101
00:55:38,807 --> 00:55:40,640
PROFESSOR: Sally, you're
absolutely correct.

1102
00:55:40,640 --> 00:55:42,140
I'm so glad you
asked that question.

1103
00:55:42,140 --> 00:55:44,960
Your observation is
it does not provide

1104
00:55:44,960 --> 00:55:46,880
any leeway for mismatches.

1105
00:55:46,880 --> 00:55:49,420
And so unlike all the
other algorithms we study,

1106
00:55:49,420 --> 00:55:51,360
which had these very
nice matrices and ability

1107
00:55:51,360 --> 00:55:54,400
to assign weights
to mismatches, this

1108
00:55:54,400 --> 00:55:56,235
is only doing exact matching.

1109
00:55:56,235 --> 00:55:57,860
And so what you need
help understanding

1110
00:55:57,860 --> 00:56:00,320
is, how we can deal
with mismatches

1111
00:56:00,320 --> 00:56:01,570
in the presence of this.

1112
00:56:01,570 --> 00:56:05,390
And I'll get to that in
less than 10 minutes.

1113
00:56:05,390 --> 00:56:10,570
And it won't be quite as elegant
as what you saw from Professor

1114
00:56:10,570 --> 00:56:12,530
Berg but it's what
everybody does.

1115
00:56:12,530 --> 00:56:18,250
So that's my only
excuse for it OK?

1116
00:56:18,250 --> 00:56:19,240
Yes.

1117
00:56:19,240 --> 00:56:22,890
AUDIENCE: What is the
bottom set of arrows doing?

1118
00:56:22,890 --> 00:56:24,260
What's its significance?

1119
00:56:24,260 --> 00:56:26,176
PROFESSOR: The significance
of top and bottom,

1120
00:56:26,176 --> 00:56:27,470
that's a great question.

1121
00:56:27,470 --> 00:56:29,270
What the significance
of top and bottom?

1122
00:56:29,270 --> 00:56:33,890
Top and bottom bracket
in that original matrix,

1123
00:56:33,890 --> 00:56:38,080
the suffixes that are
matching the original query.

1124
00:56:38,080 --> 00:56:39,000
OK?

1125
00:56:39,000 --> 00:56:42,795
And so between top
and bottom minus 1

1126
00:56:42,795 --> 00:56:46,000
are all of the rows
that have a suffix that

1127
00:56:46,000 --> 00:56:48,870
match our original query.

1128
00:56:48,870 --> 00:56:53,250
And if top equals bottom,
there are no matching suffixes.

1129
00:56:53,250 --> 00:56:55,104
But assuming there
are matching suffixes,

1130
00:56:55,104 --> 00:56:57,020
those are rows that
contain a matching suffix.

1131
00:56:57,020 --> 00:57:00,480
And as we progress along,
top and bottom change

1132
00:57:00,480 --> 00:57:03,410
as the suffixes change
as we expand the suffix

1133
00:57:03,410 --> 00:57:05,790
to contain more and more bases.

1134
00:57:05,790 --> 00:57:07,720
OK?

1135
00:57:07,720 --> 00:57:09,150
OK.

1136
00:57:09,150 --> 00:57:11,380
Any other questions?

1137
00:57:11,380 --> 00:57:12,140
OK.

1138
00:57:12,140 --> 00:57:17,410
So now back to the question
over here, which is that OK,

1139
00:57:17,410 --> 00:57:22,970
I know that we've matched, and
I know that we have this hit.

1140
00:57:22,970 --> 00:57:27,460
The question is, where is it
in the Genome because the fact

1141
00:57:27,460 --> 00:57:31,510
that it matched row
one of my BWT matrix

1142
00:57:31,510 --> 00:57:34,640
does me absolutely
no good at all.

1143
00:57:34,640 --> 00:57:35,224
Right?

1144
00:57:35,224 --> 00:57:37,390
Anybody have any ideas about
how we could figure out

1145
00:57:37,390 --> 00:57:39,010
where it is the genome?

1146
00:57:39,010 --> 00:57:42,850
Given what I've told you so far,
which is that you have the BWT,

1147
00:57:42,850 --> 00:57:45,055
you have Occ, and
you have count,

1148
00:57:45,055 --> 00:57:46,555
and you can compute
the LF function.

1149
00:57:49,620 --> 00:57:52,400
Any ideas?

1150
00:57:52,400 --> 00:57:53,750
Doesn't matter how slow it is.

1151
00:58:01,110 --> 00:58:04,330
OK well how could
I figure out what

1152
00:58:04,330 --> 00:58:06,470
came before aac in the genome?

1153
00:58:06,470 --> 00:58:07,330
Yes.

1154
00:58:07,330 --> 00:58:09,063
AUDIENCE: So, for,
like at the beginning,

1155
00:58:09,063 --> 00:58:13,082
we rebuilt this whole
string, starting at the end.

1156
00:58:13,082 --> 00:58:15,814
You could rebuild the
whole string starting at,

1157
00:58:15,814 --> 00:58:19,150
rebuild the whole genome
starting at the 1 and see--

1158
00:58:19,150 --> 00:58:22,090
PROFESSOR: You could rebuild
the entire genome that

1159
00:58:22,090 --> 00:58:24,911
prepends or occurs
before aac, right?

1160
00:58:24,911 --> 00:58:25,660
AUDIENCE: Exactly.

1161
00:58:25,660 --> 00:58:26,451
PROFESSOR: Exactly.

1162
00:58:26,451 --> 00:58:27,530
So that's what we can do.

1163
00:58:27,530 --> 00:58:32,730
We could actually do
our walk left algorithm.

1164
00:58:32,730 --> 00:58:35,230
We can walk left
from there, find out

1165
00:58:35,230 --> 00:58:38,020
that we go two steps
until we hit dollar sign,

1166
00:58:38,020 --> 00:58:40,530
and therefore,
the offset is two,

1167
00:58:40,530 --> 00:58:43,070
where it occurs in the genome.

1168
00:58:43,070 --> 00:58:47,137
So we can give a match
position by walking left.

1169
00:58:47,137 --> 00:58:48,720
Does everybody see
that, that we could

1170
00:58:48,720 --> 00:58:52,160
walk left to figure where it is?

1171
00:58:52,160 --> 00:58:53,700
It's not fast, but it works.

1172
00:58:53,700 --> 00:58:54,870
Yes.

1173
00:58:54,870 --> 00:58:55,640
AUDIENCE: I'm Ted.

1174
00:58:55,640 --> 00:58:56,431
PROFESSOR: Hi, Ted.

1175
00:58:56,431 --> 00:59:02,186
AUDIENCE: So now our function
first has to take the read

1176
00:59:02,186 --> 00:59:06,210
and it has to align
it and the same,

1177
00:59:06,210 --> 00:59:13,028
where the position is built
into the end of the function,

1178
00:59:13,028 --> 00:59:18,873
the speed of the function is now
dependent on position as well.

1179
00:59:18,873 --> 00:59:19,779
Is that right?

1180
00:59:19,779 --> 00:59:22,050
Because the longer it takes,

1181
00:59:22,050 --> 00:59:23,710
PROFESSOR: I was
being a little bit

1182
00:59:23,710 --> 00:59:25,720
glib find if it matches
or not this linear time.

1183
00:59:25,720 --> 00:59:26,840
Now you're saying,
hey, wait a minute.

1184
00:59:26,840 --> 00:59:28,548
I want to know where
it is in the genome.

1185
00:59:28,548 --> 00:59:29,760
That's a big bonus, right?

1186
00:59:29,760 --> 00:59:31,500
And so you'd like to
know where that is?

1187
00:59:31,500 --> 00:59:32,280
Yes.

1188
00:59:32,280 --> 00:59:34,320
But I still can do
that in linear time.

1189
00:59:34,320 --> 00:59:36,236
And we'll show you how
to do that in a second.

1190
00:59:36,236 --> 00:59:37,970
This is not linear time.

1191
00:59:37,970 --> 00:59:39,467
This actually
needs to walk back,

1192
00:59:39,467 --> 00:59:41,550
order the length of genome
for every single query.

1193
00:59:41,550 --> 00:59:43,660
That's not good.

1194
00:59:43,660 --> 00:59:45,350
All right?

1195
00:59:45,350 --> 00:59:45,850
Well,

1196
00:59:45,850 --> 00:59:49,320
What we could do is, we could
store what's called a suffix

1197
00:59:49,320 --> 00:59:51,970
array with each row and
say, where in the genome

1198
00:59:51,970 --> 00:59:54,070
that position is.

1199
00:59:54,070 --> 00:59:56,780
Where that row starts.

1200
00:59:56,780 --> 00:59:58,370
And then maybe a simple look-up.

1201
00:59:58,370 --> 01:00:00,827
That when you actually have
a hit in row one, ah, OK,

1202
01:00:00,827 --> 01:00:02,410
start your position
two of the genome.

1203
01:00:04,940 --> 01:00:07,810
But then the
problem with that is

1204
01:00:07,810 --> 01:00:13,170
that it actually
takes a lot of space.

1205
01:00:13,170 --> 01:00:15,900
And we want to have
compact indices.

1206
01:00:15,900 --> 01:00:18,800
So the trick is what we
do is, instead of storing

1207
01:00:18,800 --> 01:00:21,280
that entire suffix
array, we store

1208
01:00:21,280 --> 01:00:24,630
every so many rows,
like every 25 rows.

1209
01:00:24,630 --> 01:00:26,370
And all we do is, we
walk left until we

1210
01:00:26,370 --> 01:00:29,250
hit a row that
actually has the value,

1211
01:00:29,250 --> 01:00:32,860
and then we add how many times
we walked left, plus the value

1212
01:00:32,860 --> 01:00:36,380
and we know where we
are in the genome.

1213
01:00:36,380 --> 01:00:40,120
So we can sample
the suffix array,

1214
01:00:40,120 --> 01:00:41,780
and by sampling
the suffix array,

1215
01:00:41,780 --> 01:00:46,472
we cut down our storage
hugely and it's still

1216
01:00:46,472 --> 01:00:47,180
pretty efficient.

1217
01:00:47,180 --> 01:00:48,596
Because what we
can do is, we just

1218
01:00:48,596 --> 01:00:51,775
walk left until we hit a
sample suffix array location

1219
01:00:51,775 --> 01:00:55,441
and then add the two
numbers together.

1220
01:00:55,441 --> 01:00:55,940
All right?

1221
01:00:58,880 --> 01:01:01,250
So that's how it's done.

1222
01:01:01,250 --> 01:01:03,500
OK?

1223
01:01:03,500 --> 01:01:08,750
So that's how we
actually do the alignment

1224
01:01:08,750 --> 01:01:10,320
and figure out where things are.

1225
01:01:10,320 --> 01:01:12,440
The one thing I
haven't told you about

1226
01:01:12,440 --> 01:01:16,170
is how to compute
count efficiently.

1227
01:01:16,170 --> 01:01:18,160
Now remember what count does.

1228
01:01:18,160 --> 01:01:22,450
Count is a function--
but this is putting it

1229
01:01:22,450 --> 01:01:25,610
all together where we're
matching this query,

1230
01:01:25,610 --> 01:01:26,980
we do the steps.

1231
01:01:26,980 --> 01:01:28,400
We get the match.

1232
01:01:28,400 --> 01:01:30,817
Then we do walk
left once and then

1233
01:01:30,817 --> 01:01:33,150
we look at the suffix to
figure out where we are, right?

1234
01:01:36,360 --> 01:01:40,880
The business about count
is that what we need to do

1235
01:01:40,880 --> 01:01:44,930
is to figure out the
rank of a particular base

1236
01:01:44,930 --> 01:01:47,401
in a position in the transform.

1237
01:01:47,401 --> 01:01:49,650
And one way to do that is
to go backwards to the whole

1238
01:01:49,650 --> 01:01:53,100
transform, counting how many
g;s occur before this one,

1239
01:01:53,100 --> 01:01:56,950
and that's very expensive,
to compute the rank of this

1240
01:01:56,950 --> 01:01:57,577
particular g.

1241
01:01:57,577 --> 01:01:59,660
Remember the rank is simply
the number of g's that

1242
01:01:59,660 --> 01:02:03,110
occur before this
one in the BWT.

1243
01:02:03,110 --> 01:02:05,300
Very simple metric.

1244
01:02:05,300 --> 01:02:07,250
So instead of
doing that, what we

1245
01:02:07,250 --> 01:02:12,500
can do is, we can build a
data structure that every once

1246
01:02:12,500 --> 01:02:15,000
in awhile, counts how
many a's, c's, g's,

1247
01:02:15,000 --> 01:02:19,000
and t's have occurred
before now in the BWT.

1248
01:02:19,000 --> 01:02:22,400
And so we're going to sample
this with these checkpoints,

1249
01:02:22,400 --> 01:02:26,040
and then when you want to
compute count at any point,

1250
01:02:26,040 --> 01:02:29,240
you can go to the nearest
checkpoint, wherever that is,

1251
01:02:29,240 --> 01:02:31,340
and make an
adjustment by counting

1252
01:02:31,340 --> 01:02:35,240
the number of characters
between you and that checkpoint.

1253
01:02:35,240 --> 01:02:36,810
Very straightforward.

1254
01:02:36,810 --> 01:02:38,380
All Right

1255
01:02:38,380 --> 01:02:43,840
So this, coming back to
question, it's Time, right?

1256
01:02:43,840 --> 01:02:46,470
--asked you need to
build this checkpointing

1257
01:02:46,470 --> 01:02:49,510
mechanism at the same time you
build the index, as well as

1258
01:02:49,510 --> 01:02:51,880
the sampling of
the suffix array.

1259
01:02:51,880 --> 01:02:54,960
So a full index consists
of the transform

1260
01:02:54,960 --> 01:02:58,650
itself, which is the genome
transformed into its BWT.

1261
01:03:01,250 --> 01:03:04,620
And they literally take the
entire genome and do this.

1262
01:03:04,620 --> 01:03:08,910
Typically they'll put dollar
signs between the chromosomes.

1263
01:03:08,910 --> 01:03:10,620
So they'll transform
the whole thing.

1264
01:03:10,620 --> 01:03:12,850
It takes a sampling
of the suffix array

1265
01:03:12,850 --> 01:03:18,550
we just saw and it takes
the checkpointing of the LF

1266
01:03:18,550 --> 01:03:21,650
function to make
a constant time.

1267
01:03:21,650 --> 01:03:26,090
And that's what is
inside of an FM index.

1268
01:03:26,090 --> 01:03:28,550
OK?

1269
01:03:28,550 --> 01:03:32,450
Now it's small, which is
one of the nice things,

1270
01:03:32,450 --> 01:03:35,560
compared to things
like suffix tree,

1271
01:03:35,560 --> 01:03:39,610
suffix arrays, or even other
kinds of hash structures

1272
01:03:39,610 --> 01:03:43,040
for looking for seeds, it
really is not even twice

1273
01:03:43,040 --> 01:03:44,810
the size of the genome.

1274
01:03:44,810 --> 01:03:49,667
So it's a very compact index
that is very, very efficient.

1275
01:03:49,667 --> 01:03:51,250
And so it's a wonderful
data structure

1276
01:03:51,250 --> 01:03:53,530
for doing what we're
doing, except we have not

1277
01:03:53,530 --> 01:03:56,850
dealt with mismatches
yet, right?

1278
01:03:56,850 --> 01:03:59,072
And so once again,
I want to put a plug

1279
01:03:59,072 --> 01:04:02,250
in for BWA which is really
a marvelous aligner.

1280
01:04:02,250 --> 01:04:05,360
And we'll talk about
tomorrow in recitation

1281
01:04:05,360 --> 01:04:08,140
if you want to know all of the
details of what it actually

1282
01:04:08,140 --> 01:04:11,350
takes to make this
work in practice.

1283
01:04:11,350 --> 01:04:17,160
Now, it finds exactness
matches quickly,

1284
01:04:17,160 --> 01:04:21,300
but it doesn't really have
any allowances for mismatches.

1285
01:04:21,300 --> 01:04:25,260
And the way that bow tie and
other aligners deal with this,

1286
01:04:25,260 --> 01:04:26,910
and they're all
pretty consistent,

1287
01:04:26,910 --> 01:04:31,130
is in the following way, which
is that they do backtracking.

1288
01:04:31,130 --> 01:04:33,210
Here's the idea.

1289
01:04:33,210 --> 01:04:39,440
You try and match
something or match a read

1290
01:04:39,440 --> 01:04:42,272
and you get to a particular
point in the read,

1291
01:04:42,272 --> 01:04:43,480
and you can't go any further.

1292
01:04:43,480 --> 01:04:46,770
Top is equal to bottom.

1293
01:04:46,770 --> 01:04:48,750
So you know that
there's no suffix

1294
01:04:48,750 --> 01:04:52,370
in the genome that
matches your query.

1295
01:04:52,370 --> 01:04:54,830
So what do you do?

1296
01:04:54,830 --> 01:04:58,130
Well, what you can do is you can
try all of the different bases

1297
01:04:58,130 --> 01:05:00,110
at that position
besides the one you

1298
01:05:00,110 --> 01:05:02,110
tried to see whether
it matches or not.

1299
01:05:02,110 --> 01:05:04,450
I can see the horror
coming over people.

1300
01:05:04,450 --> 01:05:06,940
Oh, no, not
backtracking, not that.

1301
01:05:06,940 --> 01:05:12,640
But sometimes it actually works.

1302
01:05:12,640 --> 01:05:15,760
And just to give you order of
magnitude idea about how this

1303
01:05:15,760 --> 01:05:18,370
works in practice,
when reads don't match,

1304
01:05:18,370 --> 01:05:22,390
they limit backtracking to about
125 times in these aligners.

1305
01:05:22,390 --> 01:05:25,070
so they try pretty hard
to actually match things.

1306
01:05:25,070 --> 01:05:29,260
And yes, it is true that
even with this backtracking,

1307
01:05:29,260 --> 01:05:32,450
it's still a great approach.

1308
01:05:32,450 --> 01:05:35,700
And sometimes the first
thing you try doesn't work,

1309
01:05:35,700 --> 01:05:38,840
and you have to backtrack,
trying multiple bases

1310
01:05:38,840 --> 01:05:41,560
at that location until
you get one that matches.

1311
01:05:41,560 --> 01:05:43,210
And then you can proceed.

1312
01:05:43,210 --> 01:05:46,670
OK And you eventually
wind up at the alignment

1313
01:05:46,670 --> 01:05:49,030
you see in the lower right
hand corner, where you're

1314
01:05:49,030 --> 01:05:54,170
substituting a g for an a,
an a for a g, excuse me,

1315
01:05:54,170 --> 01:05:56,870
to make it go forward.

1316
01:05:56,870 --> 01:05:59,120
Do people understand
the essential idea

1317
01:05:59,120 --> 01:06:04,370
of this idea of backtracking?

1318
01:06:04,370 --> 01:06:09,540
Does anybody have any comments
or questions about it?

1319
01:06:09,540 --> 01:06:12,490
Like ew, or ideas?

1320
01:06:12,490 --> 01:06:12,990
Yes.

1321
01:06:12,990 --> 01:06:14,340
AUDIENCE: What about gaps?

1322
01:06:14,340 --> 01:06:15,465
PROFESSOR: What about gaps?

1323
01:06:19,340 --> 01:06:22,680
BWA, I believe, processes gaps.

1324
01:06:22,680 --> 01:06:27,720
But gaps are much, much less
likely than missed bases.

1325
01:06:27,720 --> 01:06:31,740
The other thing is that if
you're doing a sequencing

1326
01:06:31,740 --> 01:06:35,850
library, and you have a read
that actually has a gap in it,

1327
01:06:35,850 --> 01:06:39,460
it's probably the case you
have another read that doesn't.

1328
01:06:39,460 --> 01:06:41,460
For the same sequence.

1329
01:06:41,460 --> 01:06:45,100
So it is less important
to process gaps

1330
01:06:45,100 --> 01:06:48,590
than it is to
process differences.

1331
01:06:48,590 --> 01:06:52,920
The reason is that
differences mean

1332
01:06:52,920 --> 01:06:56,070
that it might be a
difference of an allele.

1333
01:06:56,070 --> 01:06:58,320
In other words, it
might be that your base

1334
01:06:58,320 --> 01:07:00,860
is different than
the reference genome.

1335
01:07:00,860 --> 01:07:03,484
Indels are also possible.

1336
01:07:03,484 --> 01:07:04,900
And there are
different strategies

1337
01:07:04,900 --> 01:07:05,816
of dealing with those.

1338
01:07:05,816 --> 01:07:09,189
That would be a great question
for Hang tomorrow about gaps.

1339
01:07:09,189 --> 01:07:11,230
Because he can tell you
in practice what they do.

1340
01:07:11,230 --> 01:07:12,660
And we'll get into
a little bit of that

1341
01:07:12,660 --> 01:07:13,790
at the end of today's lecture.

1342
01:07:13,790 --> 01:07:14,289
Yes.

1343
01:07:14,289 --> 01:07:15,534
Question?

1344
01:07:15,534 --> 01:07:16,358
AUDIENCE: I'm Levy.

1345
01:07:16,358 --> 01:07:17,090
PROFESSOR: Hi, Levy.

1346
01:07:17,090 --> 01:07:17,690
AUDIENCE: How do
you make sure when

1347
01:07:17,690 --> 01:07:18,430
you're backtracking
that you end up

1348
01:07:18,430 --> 01:07:19,638
with the best possible match?

1349
01:07:19,638 --> 01:07:22,037
Do you just go down the first--

1350
01:07:22,037 --> 01:07:23,620
PROFESSOR: The
questions is how do you

1351
01:07:23,620 --> 01:07:26,690
guarantee you wind up with
the best possible match?

1352
01:07:26,690 --> 01:07:28,991
The short answer
is that you don't.

1353
01:07:28,991 --> 01:07:30,490
There's a longer
answer, which we're

1354
01:07:30,490 --> 01:07:34,200
about to get to, about how
we try to approximate that.

1355
01:07:34,200 --> 01:07:38,280
And what judgment you would use
to get to what we would think

1356
01:07:38,280 --> 01:07:40,220
is a practically good match.

1357
01:07:40,220 --> 01:07:41,100
OK?

1358
01:07:41,100 --> 01:07:42,870
But in terms of
theoretically optimal,

1359
01:07:42,870 --> 01:07:45,632
the answer is, it doesn't
attempt to do that.

1360
01:07:45,632 --> 01:07:46,590
That's a good question.

1361
01:07:46,590 --> 01:07:48,000
Yes.

1362
01:07:48,000 --> 01:07:50,375
AUDIENCE: In practice, does
this backtracking method

1363
01:07:50,375 --> 01:07:54,660
at the same time as you're
computing the matches or--

1364
01:07:54,660 --> 01:07:55,590
PROFESSOR: Yes.

1365
01:07:55,590 --> 01:07:57,506
So what's happening is,
you remember that loop

1366
01:07:57,506 --> 01:07:59,324
where we're going
around, where we

1367
01:07:59,324 --> 01:08:00,990
were moving the top
and bottom pointers.

1368
01:08:00,990 --> 01:08:03,130
If you get to a point
where they come together,

1369
01:08:03,130 --> 01:08:07,300
then you would at that
point, begin backtracking

1370
01:08:07,300 --> 01:08:09,280
and try different bases.

1371
01:08:09,280 --> 01:08:11,380
And if you look, I
posted the BWA paper

1372
01:08:11,380 --> 01:08:12,400
on the Stellar website.

1373
01:08:12,400 --> 01:08:13,983
And if you look in
one of the figures,

1374
01:08:13,983 --> 01:08:15,940
the algorithm is there,
And you'll actually

1375
01:08:15,940 --> 01:08:19,210
see, if you can deconvolute
what's going on,

1376
01:08:19,210 --> 01:08:23,479
that inside the loop, it's
actually doing exactly that.

1377
01:08:23,479 --> 01:08:23,979
Yes.

1378
01:08:23,979 --> 01:08:24,689
Question.

1379
01:08:24,689 --> 01:08:27,581
AUDIENCE: In practice, is the
number of errors is small,

1380
01:08:27,581 --> 01:08:29,991
would it make sense
just to use [INAUDIBLE]?

1381
01:08:33,257 --> 01:08:34,840
PROFESSOR: We're
going to get to that.

1382
01:08:34,840 --> 01:08:38,370
I think the question was, if
the number of errors is small,

1383
01:08:38,370 --> 01:08:41,467
would it be good to actually
use a different algorithm,

1384
01:08:41,467 --> 01:08:42,800
drop into a different algorithm?

1385
01:08:42,800 --> 01:08:47,170
So there are algorithms
on FM index assisted

1386
01:08:47,170 --> 01:08:48,870
Smith Waterman, for example.

1387
01:08:48,870 --> 01:08:50,569
Where you get to
the neighborhood

1388
01:08:50,569 --> 01:08:54,510
by a fast technique and
then you do a full search,

1389
01:08:54,510 --> 01:08:58,384
using a more in depth
principles methodology, right?

1390
01:08:58,384 --> 01:09:00,550
And so there's some papers
I have in the slides here

1391
01:09:00,550 --> 01:09:04,420
that I referenced
that do exactly that.

1392
01:09:04,420 --> 01:09:06,920
These are all great questions.

1393
01:09:06,920 --> 01:09:07,850
OK.

1394
01:09:07,850 --> 01:09:09,000
Yes.

1395
01:09:09,000 --> 01:09:11,350
AUDIENCE: If you're-- If
you only decide to backtrack

1396
01:09:11,350 --> 01:09:13,580
a certain number of
times, like 100 times,

1397
01:09:13,580 --> 01:09:16,600
then wouldn't like the alignment
be biased towards the end

1398
01:09:16,600 --> 01:09:18,103
of the short read?

1399
01:09:18,103 --> 01:09:20,850
PROFESSOR: I am so glad
you asked this question.

1400
01:09:20,850 --> 01:09:24,070
The question is, and
what was your name again?

1401
01:09:24,070 --> 01:09:24,736
AUDIENCE: Kevin.

1402
01:09:24,736 --> 01:09:25,444
PROFESSOR: Kevin.

1403
01:09:25,444 --> 01:09:28,510
Kevin asks, gee, if you're
matching from the right

1404
01:09:28,510 --> 01:09:30,500
to the left, and you're
doing backtracking,

1405
01:09:30,500 --> 01:09:32,520
isn't this going to
be biased towards

1406
01:09:32,520 --> 01:09:35,029
the right into the read,
in some sense, right?

1407
01:09:35,029 --> 01:09:37,710
Because if the right into
the read doesn't match,

1408
01:09:37,710 --> 01:09:39,810
then you're going
to give up, right?

1409
01:09:39,810 --> 01:09:42,399
In fact, what we know is
the left end of the read

1410
01:09:42,399 --> 01:09:43,750
is the better end of the read.

1411
01:09:43,750 --> 01:09:48,569
Because sequences are done five
prime to three prime and thus

1412
01:09:48,569 --> 01:09:50,960
typically, the highest quality
scores or the best quality

1413
01:09:50,960 --> 01:09:55,130
scores are in the left
hand side of the read.

1414
01:09:55,130 --> 01:09:59,450
So do you have any idea about
how you would cope with that?

1415
01:09:59,450 --> 01:10:01,324
AUDIENCE: You could just
reverse one of them.

1416
01:10:01,324 --> 01:10:02,168
But you'd reverse--

1417
01:10:02,168 --> 01:10:02,950
PROFESSOR: Exactly what they do.

1418
01:10:02,950 --> 01:10:05,450
They execute the entire genome
and they reverse it and then

1419
01:10:05,450 --> 01:10:08,756
they index that.

1420
01:10:08,756 --> 01:10:11,006
And so, when they create
what's called a mirror index,

1421
01:10:11,006 --> 01:10:12,390
they just reverse
the entire genome,

1422
01:10:12,390 --> 01:10:13,889
and now you can
match left to right,

1423
01:10:13,889 --> 01:10:15,600
as opposed to right to left.

1424
01:10:15,600 --> 01:10:17,530
Pretty cool, huh?

1425
01:10:17,530 --> 01:10:18,030
Yeah.

1426
01:10:21,090 --> 01:10:26,445
So backtracking, just note that
there are different alignments

1427
01:10:26,445 --> 01:10:29,030
that can occur across
different backtracking paths.

1428
01:10:29,030 --> 01:10:31,850
And this is not
optimal in any sense.

1429
01:10:31,850 --> 01:10:34,060
And to your question
about how you actually

1430
01:10:34,060 --> 01:10:38,220
go about picking a
backtracking strategy,

1431
01:10:38,220 --> 01:10:40,290
assuming we're matching
from right to left again

1432
01:10:40,290 --> 01:10:46,510
for a moment, what you can
do is, if you hit a mismatch,

1433
01:10:46,510 --> 01:10:50,577
you backtrack to the lowest
quality based position,

1434
01:10:50,577 --> 01:10:51,660
according to PHRED scores.

1435
01:10:51,660 --> 01:10:53,350
We talked about
PHRED scores earlier,

1436
01:10:53,350 --> 01:10:56,270
which are shown
here on the slide.

1437
01:10:56,270 --> 01:10:58,840
And you backtrack there and
you try a different base

1438
01:10:58,840 --> 01:11:00,770
and you move forward from there.

1439
01:11:00,770 --> 01:11:02,890
So you're assuming
that the read, which

1440
01:11:02,890 --> 01:11:06,220
is the query, which is
associated quality scores,

1441
01:11:06,220 --> 01:11:10,190
is most suspect where the
quality score is the lowest.

1442
01:11:10,190 --> 01:11:15,020
So you backtrack to the right
to the leftmost lowest quality

1443
01:11:15,020 --> 01:11:16,710
score.

1444
01:11:16,710 --> 01:11:21,160
Now it's a very simple approach.

1445
01:11:21,160 --> 01:11:22,040
Right?

1446
01:11:22,040 --> 01:11:24,270
And we talked a little
bit about the idea

1447
01:11:24,270 --> 01:11:28,080
that you don't necessarily want
to match from the right side

1448
01:11:28,080 --> 01:11:30,880
and thus, typically the
parameters to algorithms

1449
01:11:30,880 --> 01:11:33,320
like this include,
how many mismatches

1450
01:11:33,320 --> 01:11:35,906
are allowed in the first L
bases on the left end, the sum

1451
01:11:35,906 --> 01:11:38,030
of the mismatch qualities
you're going to tolerate,

1452
01:11:38,030 --> 01:11:38,580
and so forth.

1453
01:11:38,580 --> 01:11:41,246
And you'll find that these align
yourself with a lot of switches

1454
01:11:41,246 --> 01:11:42,120
that you can set.

1455
01:11:42,120 --> 01:11:44,025
And you can consult
with your colleagues

1456
01:11:44,025 --> 01:11:45,900
about how to set switches,
because it depends

1457
01:11:45,900 --> 01:11:47,941
upon the particular type
of data you're aligning,

1458
01:11:47,941 --> 01:11:50,560
the length of the
reads and so forth.

1459
01:11:50,560 --> 01:11:55,780
But suffice it to say,
when you're doing this,

1460
01:11:55,780 --> 01:11:58,760
typically we create these mirror
indices that actually reverse

1461
01:11:58,760 --> 01:12:01,300
the entire genome
and then index it.

1462
01:12:01,300 --> 01:12:05,180
So we can either match either
right to left or left to right.

1463
01:12:05,180 --> 01:12:09,650
And so for example, if
you have a mirror index,

1464
01:12:09,650 --> 01:12:13,480
and you only tolerate
up to two errors,

1465
01:12:13,480 --> 01:12:14,946
then you know that
either, you're

1466
01:12:14,946 --> 01:12:16,320
going to get the
first half right

1467
01:12:16,320 --> 01:12:19,450
in one index or
the mirror index.

1468
01:12:19,450 --> 01:12:22,060
And so you can use both indices
in parallel, the forward

1469
01:12:22,060 --> 01:12:24,060
and the reverse
index of the genome,

1470
01:12:24,060 --> 01:12:26,770
and then get pretty
far into the read

1471
01:12:26,770 --> 01:12:29,990
before you have to
start backtracking.

1472
01:12:29,990 --> 01:12:32,120
There are all these
sorts of techniques,

1473
01:12:32,120 --> 01:12:34,380
shall we say, to
actually overcome some

1474
01:12:34,380 --> 01:12:36,900
the limitations of backtracking.

1475
01:12:36,900 --> 01:12:41,020
Any questions about
backtracking at all?

1476
01:12:41,020 --> 01:12:41,650
Yes.

1477
01:12:41,650 --> 01:12:44,145
AUDIENCE: Is it
trivial knowing the BWT

1478
01:12:44,145 --> 01:12:47,638
originally to find
the mirror BWT?

1479
01:12:47,638 --> 01:12:48,636
Like for example,

1480
01:12:48,636 --> 01:12:50,632
PROFESSOR: No, it's not trivial.

1481
01:12:50,632 --> 01:12:52,628
AUDIENCE: So it's not
like a simple matrix

1482
01:12:52,628 --> 01:12:54,352
transforming [INAUDIBLE].

1483
01:12:54,352 --> 01:12:54,935
PROFESSOR: No.

1484
01:12:54,935 --> 01:12:55,957
Not to my knowledge.

1485
01:12:55,957 --> 01:12:58,373
I think you start with the
original genome, you reverse it

1486
01:12:58,373 --> 01:13:00,100
and then you compute
the BWT with that.

1487
01:13:00,100 --> 01:13:00,600
Right?

1488
01:13:00,600 --> 01:13:02,832
That's pretty easy to do.

1489
01:13:02,832 --> 01:13:04,290
And Hang was
explaining to me today

1490
01:13:04,290 --> 01:13:07,170
how you compute his new
ways of compute BWT, which

1491
01:13:07,170 --> 01:13:09,880
don't actually involve
sorting the entire thing.

1492
01:13:09,880 --> 01:13:12,685
There are insertion ways of
computing the BWT that are very

1493
01:13:12,685 --> 01:13:15,060
[INAUDIBLE], and you could
ask him this question tomorrow

1494
01:13:15,060 --> 01:13:16,850
if you care to come.

1495
01:13:16,850 --> 01:13:20,650
All right just to give you an
idea on how complex things are,

1496
01:13:20,650 --> 01:13:23,830
to build an index
like this, takes,

1497
01:13:23,830 --> 01:13:25,770
for the entire
human genome, we're

1498
01:13:25,770 --> 01:13:29,540
talking five hours
of compute time

1499
01:13:29,540 --> 01:13:32,220
to compute an index to give
you an order of magnitude

1500
01:13:32,220 --> 01:13:36,095
time for how to compute the BWT.

1501
01:13:36,095 --> 01:13:39,780
the LF checkpoints, and
the suffix array sampling.

1502
01:13:39,780 --> 01:13:41,070
Something like that.

1503
01:13:41,070 --> 01:13:42,620
So it's really not
too bad to compute

1504
01:13:42,620 --> 01:13:46,050
the index of the entire genome.

1505
01:13:46,050 --> 01:13:49,620
And to do searches, you
know, we're talking about,

1506
01:13:49,620 --> 01:13:51,070
like on a four
processor machine,

1507
01:13:51,070 --> 01:13:52,444
we're talking
about maybe upwards

1508
01:13:52,444 --> 01:13:55,560
of 100 million rads
per hour to map.

1509
01:13:55,560 --> 01:13:57,400
So if you have 200
million reads and you

1510
01:13:57,400 --> 01:14:00,090
want to map them to a genome,
or align them as it's sometimes

1511
01:14:00,090 --> 01:14:03,970
called, it's going to take
you a couple hours to do it.

1512
01:14:03,970 --> 01:14:07,400
So this is sort of the order of
magnitude of the time required

1513
01:14:07,400 --> 01:14:10,730
to do these sorts of functions.

1514
01:14:10,730 --> 01:14:12,550
And there are a
couple fine points

1515
01:14:12,550 --> 01:14:14,656
I wanted to end with today.

1516
01:14:14,656 --> 01:14:16,780
The first is we haven't
talked at all about paired,

1517
01:14:16,780 --> 01:14:17,990
erred, and read alignment.

1518
01:14:23,190 --> 01:14:27,950
In paired read alignment,
you get, for each molecule,

1519
01:14:27,950 --> 01:14:29,180
you get two reads.

1520
01:14:29,180 --> 01:14:32,360
One starting at the five
prime end on one side, ,

1521
01:14:32,360 --> 01:14:35,840
and one starting from the five
prime end on the other side.

1522
01:14:35,840 --> 01:14:38,540
So typical read links might
be 100 base pairs on the left

1523
01:14:38,540 --> 01:14:41,490
and 100 base pairs on the right.

1524
01:14:41,490 --> 01:14:45,340
What is called the insert
size is the total size

1525
01:14:45,340 --> 01:14:49,260
of the molecule from five prime
end to five prime end to read.

1526
01:14:49,260 --> 01:14:52,230
And the stuff in the
middle is not observed.

1527
01:14:52,230 --> 01:14:54,600
We actually don't
know what it is.

1528
01:14:54,600 --> 01:14:57,610
And we also don't
know how long it is.

1529
01:14:57,610 --> 01:15:00,460
Now when these
libraries are prepared,

1530
01:15:00,460 --> 01:15:02,520
size selection is
done, so we get

1531
01:15:02,520 --> 01:15:05,380
a rough idea of
what it should be.

1532
01:15:05,380 --> 01:15:06,970
We can actually
compute by looking

1533
01:15:06,970 --> 01:15:10,982
at where things align on the
genome, what it actually is.

1534
01:15:10,982 --> 01:15:12,190
But we don't know absolutely.

1535
01:15:14,700 --> 01:15:18,010
If we were able to
strictly control

1536
01:15:18,010 --> 01:15:20,430
the length of the
unobserved part, which

1537
01:15:20,430 --> 01:15:25,370
is almost impossible to do, then
we would get molecular rulers.

1538
01:15:25,370 --> 01:15:28,100
And we would know
exactly down to the base,

1539
01:15:28,100 --> 01:15:30,480
whether or not there were
indels between the left read

1540
01:15:30,480 --> 01:15:32,354
and the right read when
we did the alignment.

1541
01:15:32,354 --> 01:15:35,790
We actually don't
have that today.

1542
01:15:35,790 --> 01:15:37,550
The sequencing
instrument actually

1543
01:15:37,550 --> 01:15:39,970
identifies the read
pairs in its output.

1544
01:15:39,970 --> 01:15:41,660
That's the only way to do this.

1545
01:15:41,660 --> 01:15:44,220
So when you get an output
file, like a fast Q file,

1546
01:15:44,220 --> 01:15:46,920
from a sequencing
instrument, it will tell you,

1547
01:15:46,920 --> 01:15:48,860
for a given molecule,
here's the left read

1548
01:15:48,860 --> 01:15:50,314
and here's the right read.

1549
01:15:50,314 --> 01:15:52,480
Although left and right are
really sort of misnomers

1550
01:15:52,480 --> 01:15:55,060
because there really is
no left and right, right?

1551
01:15:55,060 --> 01:15:57,018
This is one end and then
this is the other end.

1552
01:15:59,056 --> 01:16:00,680
Typical ways of
processing these paired

1553
01:16:00,680 --> 01:16:03,950
reads, first you align
left and right reads.

1554
01:16:03,950 --> 01:16:05,935
And they could really
only be oriented

1555
01:16:05,935 --> 01:16:07,560
with respect to a
genome sequence where

1556
01:16:07,560 --> 01:16:09,976
you say that one has a lower
coordinate than the other one

1557
01:16:09,976 --> 01:16:12,680
when you're actually
doing the alignment.

1558
01:16:12,680 --> 01:16:15,380
And if one read fails to align
uniquely, then what you can do

1559
01:16:15,380 --> 01:16:18,100
is, you know what neighborhood
you're in because you know,

1560
01:16:18,100 --> 01:16:21,780
roughly speaking, what
the insert size is,

1561
01:16:21,780 --> 01:16:24,420
so you can do Smith Waterman
to actually try and locate

1562
01:16:24,420 --> 01:16:26,620
the other read in
that neighborhood.

1563
01:16:26,620 --> 01:16:30,970
Or you can tolerate
multiply mapped reads.

1564
01:16:30,970 --> 01:16:34,220
One thing that I did not
mention to you explicitly,

1565
01:16:34,220 --> 01:16:36,380
is that when you match
the entire query,

1566
01:16:36,380 --> 01:16:39,390
and top and bottom are more
than one away from each other,

1567
01:16:39,390 --> 01:16:41,460
that means you've got
many places in the genome

1568
01:16:41,460 --> 01:16:43,384
that things map.

1569
01:16:43,384 --> 01:16:45,300
And thus you may report
all of those locations

1570
01:16:45,300 --> 01:16:49,030
or I might report the first one.

1571
01:16:49,030 --> 01:16:54,700
So that's one bit of insight
into how to do a map paired

1572
01:16:54,700 --> 01:16:55,950
reads.

1573
01:16:55,950 --> 01:16:57,790
And these are becoming
very important

1574
01:16:57,790 --> 01:16:59,900
because as sequencing
costs go down,

1575
01:16:59,900 --> 01:17:03,180
people are doing
more and more paired

1576
01:17:03,180 --> 01:17:05,240
and sequencing
because they give you

1577
01:17:05,240 --> 01:17:08,790
much more information about the
original library you created

1578
01:17:08,790 --> 01:17:11,609
and for certain protocols can
allow you to localize events

1579
01:17:11,609 --> 01:17:13,025
in the genome far
more accurately.

1580
01:17:16,500 --> 01:17:18,930
Final piece of advice
on considerations

1581
01:17:18,930 --> 01:17:20,580
for read alignment.

1582
01:17:20,580 --> 01:17:24,780
We talked about the idea that
some reads will map or align

1583
01:17:24,780 --> 01:17:30,390
uniquely to the genome
and some will multimap.

1584
01:17:30,390 --> 01:17:34,740
You know that the genome is
roughly 50% repeat sequence.

1585
01:17:34,740 --> 01:17:38,600
And thus it's likely that if
you have a particular read

1586
01:17:38,600 --> 01:17:41,030
molecule, there's
a reasonable chance

1587
01:17:41,030 --> 01:17:44,029
that it will map to
multiple locations.

1588
01:17:44,029 --> 01:17:45,070
Is there a question here?

1589
01:17:45,070 --> 01:17:45,570
No.

1590
01:17:45,570 --> 01:17:47,317
OK.

1591
01:17:47,317 --> 01:17:49,900
You have to figure out what your
desired mismatch tolerance is

1592
01:17:49,900 --> 01:17:53,620
when you're doing alignment
and set the parameters

1593
01:17:53,620 --> 01:17:56,610
to your aligner carefully,
after reading the documentation

1594
01:17:56,610 --> 01:17:58,940
thoroughly, because
as you could tell,

1595
01:17:58,940 --> 01:18:01,300
there's no beautiful
matrix formulation

1596
01:18:01,300 --> 01:18:05,420
like there is with a
well established basis

1597
01:18:05,420 --> 01:18:09,169
in the literature,
rather it's more ad hoc.

1598
01:18:09,169 --> 01:18:10,960
And you need to figure
out what the desired

1599
01:18:10,960 --> 01:18:14,520
processing is for paired reads.

1600
01:18:14,520 --> 01:18:16,450
So what we've talked
about today is

1601
01:18:16,450 --> 01:18:19,640
we started off talking about
library complexity and the idea

1602
01:18:19,640 --> 01:18:23,910
that when we get a bunch
of reads from a sequencer,

1603
01:18:23,910 --> 01:18:26,270
we can use that collection
of reads to estimate

1604
01:18:26,270 --> 01:18:28,570
the complexity of
our original library

1605
01:18:28,570 --> 01:18:30,210
and whether or
not something went

1606
01:18:30,210 --> 01:18:33,520
wrong in the biological
processing that we were doing.

1607
01:18:33,520 --> 01:18:35,500
Assuming it's a
good set of reads,

1608
01:18:35,500 --> 01:18:38,260
we need to figure out where
they align to the genome.

1609
01:18:38,260 --> 01:18:41,390
So we talked about this idea
of creating a full text minute

1610
01:18:41,390 --> 01:18:44,800
size index, which involves
a Burrows-Wheeler transform.

1611
01:18:44,800 --> 01:18:47,650
And we saw how we can
compute that and throw away

1612
01:18:47,650 --> 01:18:49,760
almost everything else
except for the BWT

1613
01:18:49,760 --> 01:18:55,340
itself, the suffix array
checkpoints, and the FM index

1614
01:18:55,340 --> 01:18:59,600
checkpoints to be able
to reconstruct this

1615
01:18:59,600 --> 01:19:02,520
at a relatively modest increase
in size over the genome

1616
01:19:02,520 --> 01:19:05,100
itself and do this very,
very rapid matching,

1617
01:19:05,100 --> 01:19:10,470
albeit with more problematic
matching of mismatches.

1618
01:19:10,470 --> 01:19:12,830
And then we turned
to the question

1619
01:19:12,830 --> 01:19:15,930
of how to deal with those
mismatches with backtracking

1620
01:19:15,930 --> 01:19:20,900
and some fine points on
paired end alignment.

1621
01:19:20,900 --> 01:19:26,680
So that is the end
of today's lecture.

1622
01:19:26,680 --> 01:19:29,230
On Tuesday of next
week, we'll talk

1623
01:19:29,230 --> 01:19:33,840
about how to actually construct
a reference genome, which

1624
01:19:33,840 --> 01:19:36,320
is a really neat thing
to be able to do, take

1625
01:19:36,320 --> 01:19:40,630
a whole bunch of reads, put
the puzzle back together again.

1626
01:19:40,630 --> 01:19:43,440
I would encourage you to
make sure you understand

1627
01:19:43,440 --> 01:19:45,360
how this indexing
strategy works.

1628
01:19:45,360 --> 01:19:46,250
Look at the slides.

1629
01:19:46,250 --> 01:19:47,844
Feel free to ask any of us.

1630
01:19:47,844 --> 01:19:49,260
Thanks so much for
your attention.

1631
01:19:49,260 --> 01:19:50,330
Welcome back.

1632
01:19:50,330 --> 01:19:51,800
Have a great weekend.

1633
01:19:51,800 --> 01:19:54,370
We'll see you next Tuesday.