1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:27,210 --> 00:00:32,940 PROFESSOR: OK, so welcome back to computational systems 9 00:00:32,940 --> 00:00:35,300 biology, I'm David Gifford. 10 00:00:35,300 --> 00:00:38,100 I'm delighted to be with you here here today. 11 00:00:38,100 --> 00:00:40,350 And today we're going to be talking about a topic that 12 00:00:40,350 --> 00:00:43,710 is central to modern high throughput biology, which 13 00:00:43,710 --> 00:00:47,200 is understanding how to do short read alignment, sometimes 14 00:00:47,200 --> 00:00:49,600 called read mapping. 15 00:00:49,600 --> 00:00:53,310 Now it's very important to me that you understand 16 00:00:53,310 --> 00:00:55,240 what I'm about to say today, and so I'm 17 00:00:55,240 --> 00:00:58,770 hopeful that you'll be uninhibited to raise your hand 18 00:00:58,770 --> 00:01:01,270 and ask questions about the fine points in today's lecture 19 00:01:01,270 --> 00:01:05,180 if you have any, because I'd be totally delighted to answer 20 00:01:05,180 --> 00:01:07,700 any questions and we have enough time today 21 00:01:07,700 --> 00:01:12,680 that we can spend time looking at one aspect of this problem 22 00:01:12,680 --> 00:01:14,690 and understand it thoroughly. 23 00:01:14,690 --> 00:01:18,619 An associated topic is the question library complexity. 24 00:01:18,619 --> 00:01:21,035 How many people have heard of sequencing libraries before? 25 00:01:21,035 --> 00:01:23,340 Let's see a show of hands. 26 00:01:23,340 --> 00:01:26,180 OK, how many people have heard of read alignment before, read 27 00:01:26,180 --> 00:01:27,540 mapping? 28 00:01:27,540 --> 00:01:30,570 OK, great, fantastic. 29 00:01:30,570 --> 00:01:32,287 Let's start with what we're going 30 00:01:32,287 --> 00:01:33,370 to be talking about today. 31 00:01:33,370 --> 00:01:35,870 We're going to first begin talking about what a sequencing 32 00:01:35,870 --> 00:01:39,440 library is and what we mean by library complexity. 33 00:01:39,440 --> 00:01:43,150 We'll then turn to what has been called a full text minute size 34 00:01:43,150 --> 00:01:47,010 index, sometimes called a burrows Wheeler transform 35 00:01:47,010 --> 00:01:50,950 index, a BWT index, an FM index, but this 36 00:01:50,950 --> 00:01:54,670 is at the center of most modern computational biology 37 00:01:54,670 --> 00:01:59,100 algorithms for processing high throughput sequencing data. 38 00:01:59,100 --> 00:02:02,160 And then we'll turn how to use that type of index 39 00:02:02,160 --> 00:02:04,470 for read alignment. 40 00:02:04,470 --> 00:02:07,660 So let's start now with what a sequencing library Is. 41 00:02:07,660 --> 00:02:10,180 Let's just say that you have a DNA sample, 42 00:02:10,180 --> 00:02:14,090 we'll be talking about various ways of producing said samples 43 00:02:14,090 --> 00:02:15,205 throughout the term. 44 00:02:15,205 --> 00:02:16,580 But we're going to assume that we 45 00:02:16,580 --> 00:02:19,870 have a bunch of different DNA molecules. 46 00:02:19,870 --> 00:02:25,140 And I'll illustrate the different molecules here 47 00:02:25,140 --> 00:02:26,240 in different colors. 48 00:02:26,240 --> 00:02:29,050 And we have three different types of molecules here. 49 00:02:29,050 --> 00:02:31,510 Some molecules are duplicated, because as you know, 50 00:02:31,510 --> 00:02:35,400 typically, we're preparing DNA from an experiment where there 51 00:02:35,400 --> 00:02:39,550 are many cells and we can get copies of DNA from those cells 52 00:02:39,550 --> 00:02:41,410 or the DNA could be amplified using 53 00:02:41,410 --> 00:02:43,500 PCR or some other technique. 54 00:02:43,500 --> 00:02:46,970 So we have this collection of molecules, 55 00:02:46,970 --> 00:02:49,040 and to make a library, we're going to process it. 56 00:02:49,040 --> 00:02:50,498 And one of the things that we'll do 57 00:02:50,498 --> 00:02:52,360 when we process the library is we'll 58 00:02:52,360 --> 00:02:53,720 put sequencing adapters on. 59 00:02:53,720 --> 00:02:56,970 These are short DNA sequences that we 60 00:02:56,970 --> 00:02:58,760 put on to the end of the molecules 61 00:02:58,760 --> 00:03:02,060 to enable them to have defined sequences 62 00:03:02,060 --> 00:03:04,250 at the ends which permits sequencing. 63 00:03:08,200 --> 00:03:10,459 Now, if somebody hands you a tube of DNA like this, 64 00:03:10,459 --> 00:03:12,250 there are a couple questions you could ask. 65 00:03:12,250 --> 00:03:14,330 You could check the DNA concentration to find out 66 00:03:14,330 --> 00:03:16,660 how DNA is there, you could run a gel 67 00:03:16,660 --> 00:03:18,160 to look at the size of the fragments 68 00:03:18,160 --> 00:03:19,710 that you're sequencing. 69 00:03:19,710 --> 00:03:21,140 We'll be returning to that later, 70 00:03:21,140 --> 00:03:23,787 but these are typically called the insert sizes 71 00:03:23,787 --> 00:03:25,370 of the library that you're sequencing, 72 00:03:25,370 --> 00:03:30,550 the total length of the DNA excluding the adapters. 73 00:03:30,550 --> 00:03:33,390 But we could also ask questions about how complex this library 74 00:03:33,390 --> 00:03:38,410 is, because it's possible to run experiments where you produce 75 00:03:38,410 --> 00:03:40,090 libraries that are not very complex, 76 00:03:40,090 --> 00:03:42,631 where they don't have very many different types of molecules. 77 00:03:42,631 --> 00:03:45,870 Now that typically is a failure of the experiment. 78 00:03:45,870 --> 00:03:47,495 So an important part of quality control 79 00:03:47,495 --> 00:03:50,970 is characterizing the library complexity 80 00:03:50,970 --> 00:03:55,000 where we want to figure out here complexity is equal to 3. 81 00:03:55,000 --> 00:03:57,700 There are three different types of molecules. 82 00:03:57,700 --> 00:04:00,860 And we sample these molecules. 83 00:04:00,860 --> 00:04:06,370 And when we sample them, we get a bunch of DNA sequence reads. 84 00:04:06,370 --> 00:04:10,380 And typically the number of reads that we get 85 00:04:10,380 --> 00:04:13,720 is larger than the complexity of the library. 86 00:04:13,720 --> 00:04:18,399 Here we have a total of 12 different reads. 87 00:04:18,399 --> 00:04:22,990 And when we sequence a library, we're sampling from it. 88 00:04:22,990 --> 00:04:26,600 And so the probability that we get any one particular molecule 89 00:04:26,600 --> 00:04:31,310 is going to be roughly speaking equal to 1 90 00:04:31,310 --> 00:04:32,990 over c, which is the complexity. 91 00:04:35,550 --> 00:04:37,731 And thus, we could use the binomial distribution 92 00:04:37,731 --> 00:04:39,230 to figure out the likelihood that we 93 00:04:39,230 --> 00:04:44,200 had exactly four of these type one molecules. 94 00:04:44,200 --> 00:04:46,630 However, as n the number of sequencing reads 95 00:04:46,630 --> 00:04:50,350 grows to be very large, typical numbers are a hundred million 96 00:04:50,350 --> 00:04:53,280 different reads, the binomial becomes 97 00:04:53,280 --> 00:04:54,690 cumbersome to work with. 98 00:04:54,690 --> 00:04:57,720 And so we typically are going to characterize 99 00:04:57,720 --> 00:05:01,770 this kind of selection process with a different kind 100 00:05:01,770 --> 00:05:03,440 of distribution. 101 00:05:03,440 --> 00:05:06,360 So one idea is to use a Poisson, where 102 00:05:06,360 --> 00:05:08,840 we say that the rate of sequencing 103 00:05:08,840 --> 00:05:12,640 is going to be n over c. 104 00:05:12,640 --> 00:05:18,730 And we can see that here shown on the slide 105 00:05:18,730 --> 00:05:21,120 above is the same process where we 106 00:05:21,120 --> 00:05:23,400 have the ligation of the adapters. 107 00:05:23,400 --> 00:05:26,295 We have a library and we have reads coming from the library. 108 00:05:29,470 --> 00:05:32,490 We have a characterized library complexity here, 109 00:05:32,490 --> 00:05:35,980 there are four different types of molecules. 110 00:05:35,980 --> 00:05:38,490 And the modeling approach is that assuming 111 00:05:38,490 --> 00:05:40,324 that we have c different unique molecules, 112 00:05:40,324 --> 00:05:42,240 the probability that we'll get any one of them 113 00:05:42,240 --> 00:05:46,370 when we're doing the sequencing is 1 over c. 114 00:05:46,370 --> 00:05:49,520 And if we do end sequencing reads, 115 00:05:49,520 --> 00:05:52,130 we can find out the probability that we'll 116 00:05:52,130 --> 00:05:54,820 get a certain number of each type of molecule. 117 00:05:54,820 --> 00:05:59,110 Let's just stick with the first one to start, OK? 118 00:05:59,110 --> 00:06:03,590 Now part of the challenge in analyzing sequencing data 119 00:06:03,590 --> 00:06:07,730 is that you don't see what you don't sequence. 120 00:06:07,730 --> 00:06:12,760 So things that actually occur 0 times in your sequencing data 121 00:06:12,760 --> 00:06:15,441 still may be present in the library. 122 00:06:15,441 --> 00:06:17,940 And what we would like to do is from the observed sequencing 123 00:06:17,940 --> 00:06:22,120 data, estimate the library complexity. 124 00:06:22,120 --> 00:06:23,912 So we have all of the sequencing data, 125 00:06:23,912 --> 00:06:25,870 we just don't know how many different molecules 126 00:06:25,870 --> 00:06:28,830 there are over here. 127 00:06:28,830 --> 00:06:30,620 So one way to do with this is to say 128 00:06:30,620 --> 00:06:34,910 that let us suppose that we make a histogram of the number 129 00:06:34,910 --> 00:06:38,130 of times we see distinct molecules 130 00:06:38,130 --> 00:06:40,580 and we're going to say that we can observe molecules that 131 00:06:40,580 --> 00:06:45,570 are sequenced or appear l times up through r times. 132 00:06:45,570 --> 00:06:50,530 So we actually can create a version 133 00:06:50,530 --> 00:06:53,380 of the distribution that characterizes 134 00:06:53,380 --> 00:06:56,970 just a part of what we're seeing. 135 00:06:56,970 --> 00:07:02,080 So if we do this, we can build a Poisson model 136 00:07:02,080 --> 00:07:06,260 and we can estimate lambda from what we can observe. 137 00:07:06,260 --> 00:07:08,400 We don't get to observe things we don't see. 138 00:07:08,400 --> 00:07:10,340 So for sure, we know we can't observe 139 00:07:10,340 --> 00:07:12,960 the things that are sequenced 0 times. 140 00:07:12,960 --> 00:07:16,350 But for the things that are sequenced at least one time, 141 00:07:16,350 --> 00:07:18,790 we can build an estimate of lambda. 142 00:07:18,790 --> 00:07:20,730 And from that estimate of lambda, 143 00:07:20,730 --> 00:07:25,900 we can build an estimate of C. 144 00:07:25,900 --> 00:07:31,810 So one way to look at this is that if we 145 00:07:31,810 --> 00:07:34,150 look at the total number of unique molecules 146 00:07:34,150 --> 00:07:39,530 that we sequence, which is equal to m, then the probability 147 00:07:39,530 --> 00:07:45,090 that we observe between l and r occurrences of a given 148 00:07:45,090 --> 00:07:49,130 individual sequence times c is going 149 00:07:49,130 --> 00:07:51,890 to be equal to the total number of unique molecules 150 00:07:51,890 --> 00:07:54,290 that we observe. 151 00:07:54,290 --> 00:07:57,930 Another way to look at this is the very bottom equation 152 00:07:57,930 --> 00:08:02,350 where we note that if we look at the total complexity 153 00:08:02,350 --> 00:08:07,280 of the library and we multiply it by 1 minus the probability 154 00:08:07,280 --> 00:08:09,954 that we don't observe certain molecules, 155 00:08:09,954 --> 00:08:11,870 that will give an estimate of the total number 156 00:08:11,870 --> 00:08:14,770 of unique molecules that we do see. 157 00:08:14,770 --> 00:08:17,060 And thus we can manipulate that to come up 158 00:08:17,060 --> 00:08:19,770 with an estimate of the complexity. 159 00:08:19,770 --> 00:08:22,270 Are there any questions about the details of this so far? 160 00:08:25,030 --> 00:08:27,650 OK, so this is a very simple model 161 00:08:27,650 --> 00:08:30,910 for estimating the complexity of a library based 162 00:08:30,910 --> 00:08:32,720 upon looking at the distribution of reads 163 00:08:32,720 --> 00:08:36,980 that we actually observe for quality control purposes. 164 00:08:36,980 --> 00:08:40,929 And let us suppose that we apply this to thousands 165 00:08:40,929 --> 00:08:45,410 genomes data, which is public data on human. 166 00:08:45,410 --> 00:08:47,890 And suppose we want to test whether this model works 167 00:08:47,890 --> 00:08:49,520 or not, so what we're going to do 168 00:08:49,520 --> 00:08:52,102 is we're going to estimate the library complexity from 10 169 00:08:52,102 --> 00:08:53,810 percent of the sequencing reads, so we'll 170 00:08:53,810 --> 00:08:56,580 pick 10 percent of the reads of an individual at random, 171 00:08:56,580 --> 00:09:00,280 we'll estimate the complexity of the library, 172 00:09:00,280 --> 00:09:03,670 and then we'll also take all of the region the individual 173 00:09:03,670 --> 00:09:05,910 and estimate the complexity. 174 00:09:05,910 --> 00:09:07,740 And if our estimator is pretty good, 175 00:09:07,740 --> 00:09:09,310 we should get about the same number, 176 00:09:09,310 --> 00:09:12,080 from 10 percent of the reads and from all of the reads. 177 00:09:12,080 --> 00:09:14,990 Will people go along with that? 178 00:09:14,990 --> 00:09:17,880 Think that seems reasonable? 179 00:09:17,880 --> 00:09:19,910 OK, so we do that. 180 00:09:19,910 --> 00:09:22,320 And this is what we get. 181 00:09:22,320 --> 00:09:25,000 And it's hard to see the diagonal line here, 182 00:09:25,000 --> 00:09:27,450 but there's a big oops here. 183 00:09:27,450 --> 00:09:31,640 And the big oops is that if we estimate the library complexity 184 00:09:31,640 --> 00:09:33,780 from just 10 percent of the reads, 185 00:09:33,780 --> 00:09:36,630 it's grossly underestimating the number of unique molecules 186 00:09:36,630 --> 00:09:38,900 we actually have. 187 00:09:38,900 --> 00:09:43,559 In fact, it's off by typically a factor of two or more. 188 00:09:43,559 --> 00:09:45,100 So for some reason, even though we're 189 00:09:45,100 --> 00:09:48,106 examining millions of reads in this subsample, 190 00:09:48,106 --> 00:09:49,480 we're not getting a good estimate 191 00:09:49,480 --> 00:09:53,390 of the complexity of the library. 192 00:09:53,390 --> 00:09:57,250 Does anybody have any idea what could be going wrong here? 193 00:09:57,250 --> 00:10:00,820 Why is it that this very simple model that 194 00:10:00,820 --> 00:10:04,120 is attempting to estimate how many different molecules we 195 00:10:04,120 --> 00:10:10,312 have here based upon what we observe is broken? 196 00:10:10,312 --> 00:10:11,020 Any ideas at all? 197 00:10:11,020 --> 00:10:12,311 And please say your name first. 198 00:10:12,311 --> 00:10:13,410 AUDIENCE: I'm Chris. 199 00:10:13,410 --> 00:10:14,670 PROFESSOR: Hi Chris. 200 00:10:14,670 --> 00:10:17,375 AUDIENCE: Is it because repeated sequences, 201 00:10:17,375 --> 00:10:19,916 so there could be a short sequence at the end of one 202 00:10:19,916 --> 00:10:22,810 molecule that's the beginning of another one, middle 203 00:10:22,810 --> 00:10:25,010 of another one, so [INAUDIBLE]. 204 00:10:25,010 --> 00:10:27,080 PROFESSOR: Chris, you're on the right track, OK? 205 00:10:27,080 --> 00:10:30,870 Because what we have assumed at the outset 206 00:10:30,870 --> 00:10:35,180 was that all of these molecules occur with equal probability. 207 00:10:35,180 --> 00:10:35,800 Right? 208 00:10:35,800 --> 00:10:38,790 What would happen if in fact there 209 00:10:38,790 --> 00:10:40,790 are four copies of this purple one 210 00:10:40,790 --> 00:10:43,302 and only two copies of the other molecules? 211 00:10:43,302 --> 00:10:45,010 Then the probability of sampling this one 212 00:10:45,010 --> 00:10:46,870 is going to be twice as high as the probability of sampling 213 00:10:46,870 --> 00:10:47,430 one of these. 214 00:10:50,550 --> 00:10:53,930 If there's non uniformity in the original population, 215 00:10:53,930 --> 00:10:58,045 that's going to mess up our model big time. 216 00:10:58,045 --> 00:10:59,920 And that could happen from repeated sequences 217 00:10:59,920 --> 00:11:01,420 or other kinds of duplicated things, 218 00:11:01,420 --> 00:11:04,360 or it could be that there's unequal amplification. 219 00:11:04,360 --> 00:11:06,800 It might be that PCR really loves 220 00:11:06,800 --> 00:11:10,140 a particular molecule, right, and amplifies that one a lot, 221 00:11:10,140 --> 00:11:12,880 and doesn't amplify another one that's difficult to amplify. 222 00:11:12,880 --> 00:11:15,930 So somewhere in our experimental protocol pipeline, 223 00:11:15,930 --> 00:11:17,890 it could be that there's non uniformity 224 00:11:17,890 --> 00:11:20,710 and thus we're getting a skewness to our distribution 225 00:11:20,710 --> 00:11:23,210 here in our library. 226 00:11:23,210 --> 00:11:25,500 So the other thing that's true is 227 00:11:25,500 --> 00:11:29,140 in a Poisson, lambda, which is equal to the mean, 228 00:11:29,140 --> 00:11:31,410 is also equal to the variance. 229 00:11:31,410 --> 00:11:32,850 And so our Poisson's only one knob 230 00:11:32,850 --> 00:11:34,433 we could turn to fit the distribution. 231 00:11:36,780 --> 00:11:39,240 So coming back to this, we talked about the idea 232 00:11:39,240 --> 00:11:41,730 that the library complexity still may be four 233 00:11:41,730 --> 00:11:43,520 but then there may be different numbers 234 00:11:43,520 --> 00:11:47,130 of molecules of each type. 235 00:11:47,130 --> 00:11:48,830 And here's an idea for you, right? 236 00:11:48,830 --> 00:11:51,730 The idea is this. 237 00:11:51,730 --> 00:11:54,230 Imagine that the top distributions 238 00:11:54,230 --> 00:11:57,270 are the number of each type of molecule that are present. 239 00:11:57,270 --> 00:11:59,600 And it might be that our original assumption was 240 00:11:59,600 --> 00:12:03,210 that it was like the very top, that typically there 241 00:12:03,210 --> 00:12:05,950 are two copies of each molecule in the original sequencing 242 00:12:05,950 --> 00:12:10,870 library, and that's a fairly tight distribution. 243 00:12:10,870 --> 00:12:14,140 But it could be, in fact, that the number of molecules 244 00:12:14,140 --> 00:12:19,960 of each type is very dispersed. 245 00:12:19,960 --> 00:12:22,870 And so if we look at each one of those plots 246 00:12:22,870 --> 00:12:24,660 at the top, the first four, those 247 00:12:24,660 --> 00:12:29,870 are going to be our guesses about the distribution 248 00:12:29,870 --> 00:12:33,320 of the number of copies of a molecule 249 00:12:33,320 --> 00:12:36,627 in the original library. 250 00:12:36,627 --> 00:12:38,210 And we don't know what that is, right? 251 00:12:38,210 --> 00:12:41,870 That's something we can't directly observe, but imagine 252 00:12:41,870 --> 00:12:44,320 that we took that distribution and used it 253 00:12:44,320 --> 00:12:48,110 for lambda in our Poisson distribution 254 00:12:48,110 --> 00:12:49,890 down below for sampling. 255 00:12:49,890 --> 00:12:51,990 So we have one distribution over the number 256 00:12:51,990 --> 00:12:54,760 of each type of molecule we have and we 257 00:12:54,760 --> 00:12:57,790 have the Poisson for sampling from that, 258 00:12:57,790 --> 00:13:00,910 and we put those two together. 259 00:13:00,910 --> 00:13:03,480 And when we do that, we have the Poisson distribution 260 00:13:03,480 --> 00:13:05,650 at the top, the gamma distribution 261 00:13:05,650 --> 00:13:07,890 is what we'll use for representing 262 00:13:07,890 --> 00:13:10,470 the number of different species over here 263 00:13:10,470 --> 00:13:13,950 and their relative copy number. 264 00:13:13,950 --> 00:13:19,596 And when we actually put those together as shown, 265 00:13:19,596 --> 00:13:21,720 we wind up with what's called the negative binomial 266 00:13:21,720 --> 00:13:24,830 distribution, which is a more flexible distribution, 267 00:13:24,830 --> 00:13:27,320 it has two parameters. 268 00:13:27,320 --> 00:13:29,350 And that negative binomial distribution 269 00:13:29,350 --> 00:13:33,980 can be used, once again, to estimate our library 270 00:13:33,980 --> 00:13:35,910 complexity. 271 00:13:35,910 --> 00:13:38,420 And when we do so, we have lamba be the same, 272 00:13:38,420 --> 00:13:39,565 but k is a new parameter. 273 00:13:39,565 --> 00:13:42,950 It measures sort of the variance or dispersion 274 00:13:42,950 --> 00:13:46,540 of this original sequencing library. 275 00:13:46,540 --> 00:13:49,010 And then when we fit this negative binomial distribution 276 00:13:49,010 --> 00:13:55,180 to that 1,000 genomes data, it's going to be hopefully better. 277 00:13:55,180 --> 00:13:58,340 Let's start with a smaller example. 278 00:13:58,340 --> 00:14:01,280 If we have a library that's artificial with a known million 279 00:14:01,280 --> 00:14:03,600 unique molecules and we subsample, 280 00:14:03,600 --> 00:14:05,970 it gives you 100,000 reads, you can 281 00:14:05,970 --> 00:14:08,035 see that with different dispersions 282 00:14:08,035 --> 00:14:09,875 here in the left, k with different values 283 00:14:09,875 --> 00:14:14,150 from 0.1 to 20, the Poisson begins to grossly underestimate 284 00:14:14,150 --> 00:14:17,730 the complexity of the library as the dispersion gets larger, 285 00:14:17,730 --> 00:14:19,790 whereas the negative binomial, otherwise known 286 00:14:19,790 --> 00:14:23,560 as the GP or gamma Poisson, does a much better job. 287 00:14:23,560 --> 00:14:26,860 And furthermore, when we look at this, 288 00:14:26,860 --> 00:14:30,290 in the context of the thousand genomes data, 289 00:14:30,290 --> 00:14:34,200 you can see when we fit this how much better we are doing. 290 00:14:34,200 --> 00:14:35,950 Almost all those points are almost exactly 291 00:14:35,950 --> 00:14:39,890 on the line, which means you can take a small sampling run 292 00:14:39,890 --> 00:14:41,550 and figure out from that sampling run 293 00:14:41,550 --> 00:14:44,290 how complex your library is. 294 00:14:44,290 --> 00:14:47,400 And that allows us to tell something very important, which 295 00:14:47,400 --> 00:14:50,291 is what is the marginal value of extra sequencing. 296 00:14:50,291 --> 00:14:52,540 So for example, if somebody comes to you and they say, 297 00:14:52,540 --> 00:14:56,720 well, I ran my experiment and all I could afford 298 00:14:56,720 --> 00:14:57,840 was 50 million reads. 299 00:14:57,840 --> 00:15:00,100 Do you think I should sequence more? 300 00:15:00,100 --> 00:15:06,530 Is there more information in my experimental DNA preparation? 301 00:15:06,530 --> 00:15:07,890 It's easy to tell now, right? 302 00:15:07,890 --> 00:15:10,223 Because you can actually analyze the distribution of the 303 00:15:10,223 --> 00:15:12,040 reads that they got and you can go back 304 00:15:12,040 --> 00:15:14,410 and you could estimate the marginal value 305 00:15:14,410 --> 00:15:16,180 of additional sequencing. 306 00:15:16,180 --> 00:15:20,305 And the way you do that is you go back to the distribution 307 00:15:20,305 --> 00:15:23,240 that you fit this negative binomial 308 00:15:23,240 --> 00:15:25,320 and ask if you have r more reads, 309 00:15:25,320 --> 00:15:30,330 how many more unique molecules are you going to get? 310 00:15:30,330 --> 00:15:35,890 And the answer is that you can see 311 00:15:35,890 --> 00:15:40,937 that if you imagine that this is artificial data, 312 00:15:40,937 --> 00:15:43,020 but if you imagine that you had a complexity of 10 313 00:15:43,020 --> 00:15:46,275 to the 6 molecules, the number sequencing regions is 314 00:15:46,275 --> 00:15:49,750 on the x-axis, the number of observed distinct molecules 315 00:15:49,750 --> 00:15:54,940 is on the y-axis, and as you increase the sequencing depth, 316 00:15:54,940 --> 00:15:57,710 you get more and more back to the library. 317 00:15:57,710 --> 00:16:00,950 However, the important thing to note is that the more skewed 318 00:16:00,950 --> 00:16:04,660 the library is, the less benefit you get, right? 319 00:16:04,660 --> 00:16:07,710 So if you look at the various values of k, as k gets larger, 320 00:16:07,710 --> 00:16:09,580 the sort of the skewness of the library 321 00:16:09,580 --> 00:16:13,150 increases, and you can see that you get fewer unique molecules 322 00:16:13,150 --> 00:16:16,780 as you increase the sequencing depth. 323 00:16:16,780 --> 00:16:18,890 Now I mention this to you because it's 324 00:16:18,890 --> 00:16:22,560 important to think in a principled way 325 00:16:22,560 --> 00:16:24,590 about analyzing sequencing data. 326 00:16:24,590 --> 00:16:28,390 If somebody drops 200 million reads on your desk and says, 327 00:16:28,390 --> 00:16:31,440 can you help me with these, it's good to start 328 00:16:31,440 --> 00:16:34,090 with some fundamental questions, like just how complex 329 00:16:34,090 --> 00:16:36,550 is the original library and you think 330 00:16:36,550 --> 00:16:40,370 that these data are really good or not, OK? 331 00:16:40,370 --> 00:16:42,750 Furthermore, this is a introduction to the idea 332 00:16:42,750 --> 00:16:45,020 that certain kinds of very simplistic models, 333 00:16:45,020 --> 00:16:47,330 like Poisson models of sequencing data 334 00:16:47,330 --> 00:16:49,600 can be wrong because they're not adequately 335 00:16:49,600 --> 00:16:52,720 taking into account the over dispersion 336 00:16:52,720 --> 00:16:55,620 of the original sequencing count data. 337 00:16:55,620 --> 00:16:59,480 OK, so that's all there is about library complexity. 338 00:16:59,480 --> 00:17:05,680 Let's move on now to questions of how to deal with these 339 00:17:05,680 --> 00:17:08,619 reads once we have them. 340 00:17:08,619 --> 00:17:13,890 So the fundamental challenge is this. 341 00:17:13,890 --> 00:17:15,840 I hand you a genome like human. 342 00:17:19,530 --> 00:17:22,970 3 times 10 to the ninth bases. 343 00:17:22,970 --> 00:17:28,430 This will be in fast a format, let's say. 344 00:17:28,430 --> 00:17:31,670 I had you reads. 345 00:17:31,670 --> 00:17:34,980 And this will be-- we'll have, say, 200 base 346 00:17:34,980 --> 00:17:45,310 pairs times 2 times 10 to the eighth different reads. 347 00:17:45,310 --> 00:17:46,810 And this will be in fast q format. 348 00:17:50,520 --> 00:17:52,686 The q means that there are-- it's like fast a except 349 00:17:52,686 --> 00:17:54,460 that our quality score's associated 350 00:17:54,460 --> 00:17:57,100 with each particular base position. 351 00:17:57,100 --> 00:18:03,284 And the PHRED score which is typically 352 00:18:03,284 --> 00:18:04,700 used for these sorts of qualities, 353 00:18:04,700 --> 00:18:11,960 is minus p minus 10 times log base 10 354 00:18:11,960 --> 00:18:13,870 of the probability of an error. 355 00:18:13,870 --> 00:18:17,440 Right, so a PHRED score of 10 means that there's a 1 356 00:18:17,440 --> 00:18:20,490 in 10 chance that the bases is an error, a PHRED score of 20 357 00:18:20,490 --> 00:18:24,470 means it's one in a 100, and so forth. 358 00:18:24,470 --> 00:18:28,780 And then the goal is today if I give you 359 00:18:28,780 --> 00:18:33,040 these data on a hard drive, your job 360 00:18:33,040 --> 00:18:37,160 would be to produce a SAM file, a Sequence Alignment 361 00:18:37,160 --> 00:18:39,730 and Mapping file, which tells us where 362 00:18:39,730 --> 00:18:41,820 all these reads map in the genome. 363 00:18:44,340 --> 00:18:48,010 And more pictorially, the idea is 364 00:18:48,010 --> 00:18:50,460 that there are many different reasons why 365 00:18:50,460 --> 00:18:52,260 we want to do this mapping. 366 00:18:52,260 --> 00:18:55,400 So one might be to do genotyping. 367 00:18:55,400 --> 00:18:57,840 You and I differ in our genomes by about one base 368 00:18:57,840 --> 00:18:59,470 in a thousand. 369 00:18:59,470 --> 00:19:02,545 So if I sequence your genome and I map it back or align it 370 00:19:02,545 --> 00:19:04,960 to the human brain reference genome, 371 00:19:04,960 --> 00:19:07,029 I'm going to find differences between your genome 372 00:19:07,029 --> 00:19:08,320 and the human reference genome. 373 00:19:08,320 --> 00:19:10,670 And you can see how this is done at the very top where 374 00:19:10,670 --> 00:19:13,060 we have the aligned reads and there's a G, 375 00:19:13,060 --> 00:19:15,750 let's say, in the sample DNA, and there's 376 00:19:15,750 --> 00:19:17,512 a C in the reference. 377 00:19:17,512 --> 00:19:19,720 But in order to figure out where the differences are, 378 00:19:19,720 --> 00:19:21,094 we have to take those short reads 379 00:19:21,094 --> 00:19:24,730 and align them to the genome. 380 00:19:24,730 --> 00:19:27,610 Another kind of experimental protocol 381 00:19:27,610 --> 00:19:30,740 uses DNA fragments that are representative 382 00:19:30,740 --> 00:19:33,490 of some kind of biological process. 383 00:19:33,490 --> 00:19:36,130 So here the DNA being produced are mapped back 384 00:19:36,130 --> 00:19:38,930 to the genome to look for areas of enrichment 385 00:19:38,930 --> 00:19:41,330 or what are sometimes called peaks. 386 00:19:41,330 --> 00:19:44,910 And there we want to actually do exactly the same process, 387 00:19:44,910 --> 00:19:47,479 but the post processing once the alignment is complete 388 00:19:47,479 --> 00:19:48,020 is different. 389 00:19:50,690 --> 00:19:56,990 So both of these share the goal of taking hundreds of millions 390 00:19:56,990 --> 00:20:03,480 of short reads and aligning them to a very large genome. 391 00:20:03,480 --> 00:20:06,572 And you heard about Smith Waterman from Professor Berg, 392 00:20:06,572 --> 00:20:08,780 and as you can tell, that really isn't going to work, 393 00:20:08,780 --> 00:20:12,750 because its time complexity is not 394 00:20:12,750 --> 00:20:16,760 going to be admissible for hundreds of millions of reads. 395 00:20:16,760 --> 00:20:18,950 So we need to come up with a different way 396 00:20:18,950 --> 00:20:22,510 of approaching this problem. 397 00:20:22,510 --> 00:20:27,620 So finding this alignment is really a performance bottleneck 398 00:20:27,620 --> 00:20:33,310 for many computational biology problems today. 399 00:20:33,310 --> 00:20:37,510 And we have to talk a little bit about what 400 00:20:37,510 --> 00:20:41,490 we mean by a good alignment, because we're going to assume, 401 00:20:41,490 --> 00:20:44,240 of course, fewer mismatches are better. 402 00:20:44,240 --> 00:20:46,900 And we're going to try and align to high quality bases 403 00:20:46,900 --> 00:20:51,210 as opposed to low quality bases and note that all we 404 00:20:51,210 --> 00:20:56,490 have in our input data are quality scores for the reads. 405 00:20:56,490 --> 00:21:01,770 So we begin with an assumption that the genome is the truth 406 00:21:01,770 --> 00:21:04,030 and when we are aligning, we are going 407 00:21:04,030 --> 00:21:06,060 to be more permissive of mismatches 408 00:21:06,060 --> 00:21:09,380 in read locations that have higher likelihood of being 409 00:21:09,380 --> 00:21:09,880 wrong. 410 00:21:13,910 --> 00:21:17,820 So is everybody OK with the set up so far? 411 00:21:17,820 --> 00:21:20,500 You understand what the problem is? 412 00:21:20,500 --> 00:21:23,790 Yes, all the way in the back row, my back row consultants, 413 00:21:23,790 --> 00:21:24,897 you're good on that? 414 00:21:24,897 --> 00:21:26,480 See, the back row is always the people 415 00:21:26,480 --> 00:21:28,870 I call on for consulting advice, right? 416 00:21:28,870 --> 00:21:30,890 So yeah. 417 00:21:30,890 --> 00:21:32,730 You're all good back there? 418 00:21:32,730 --> 00:21:35,350 Good, I like that, good, that's good, I like that, OK. 419 00:21:35,350 --> 00:21:37,150 All right. 420 00:21:37,150 --> 00:21:41,608 So now I'm going to talk to you about what 421 00:21:41,608 --> 00:21:45,510 are the most amazing transforms I have seen. 422 00:21:45,510 --> 00:21:48,860 It's called the Burrows Wheeler Transform. 423 00:21:48,860 --> 00:21:51,010 And it is a transform that we will 424 00:21:51,010 --> 00:21:53,870 do to the original genome that allows 425 00:21:53,870 --> 00:21:57,880 us to do this look up very, very quickly. 426 00:21:57,880 --> 00:22:01,570 And it's worth understanding. 427 00:22:01,570 --> 00:22:06,760 So here's the basic idea behind the Burrows Wheeler Transform. 428 00:22:06,760 --> 00:22:11,510 We take the original string that we want to use as our target 429 00:22:11,510 --> 00:22:13,340 that we're going to look things up in, OK, 430 00:22:13,340 --> 00:22:16,310 so this is going to be the dictionary looking things up in 431 00:22:16,310 --> 00:22:18,670 and it's going to be the genome sequence. 432 00:22:18,670 --> 00:22:23,240 And you can see the sequence on the left hand side, ACA, ACG, 433 00:22:23,240 --> 00:22:27,270 and the dollar sign represents the end of string terminator. 434 00:22:27,270 --> 00:22:28,964 OK. 435 00:22:28,964 --> 00:22:30,380 Now here's what we're going to do. 436 00:22:30,380 --> 00:22:38,730 We take all possible rotations of this string, OK? 437 00:22:38,730 --> 00:22:41,900 And we're going to sort them. 438 00:22:41,900 --> 00:22:44,550 And the result of sorting all the rotations 439 00:22:44,550 --> 00:22:48,170 is shown in the next block of characters. 440 00:22:48,170 --> 00:22:51,220 And you can see that the end of string character 441 00:22:51,220 --> 00:22:57,264 has the shortest sorting order, followed by A, C, and G 442 00:22:57,264 --> 00:22:59,180 and that all the strings are ordered lexically 443 00:22:59,180 --> 00:23:01,510 by all of their letters. 444 00:23:01,510 --> 00:23:05,760 So once again, we take the original input string, 445 00:23:05,760 --> 00:23:09,070 we do all of the possible rotations of it, 446 00:23:09,070 --> 00:23:12,820 and then we sort them and wind up with this Burrows Wheeler 447 00:23:12,820 --> 00:23:18,020 Matrix as it's called in this slide, OK? 448 00:23:18,020 --> 00:23:21,700 And we take the last column of that matrix 449 00:23:21,700 --> 00:23:25,140 and that is the Burrows Wheeler Transform. 450 00:23:25,140 --> 00:23:29,650 Now yo might say, what on earth is going on here? 451 00:23:29,650 --> 00:23:32,940 Why would you want to take a string or even 452 00:23:32,940 --> 00:23:34,280 an entire genome? 453 00:23:34,280 --> 00:23:38,520 We actually do this on entire genomes, OK? 454 00:23:38,520 --> 00:23:40,920 Consider all the rotations of it, 455 00:23:40,920 --> 00:23:45,300 sort them, and then take the last column of that matrix. 456 00:23:45,300 --> 00:23:48,130 What could that be doing, OK? 457 00:23:48,130 --> 00:23:50,150 Here's a bit of intuition for you. 458 00:23:50,150 --> 00:23:53,090 The intuition is that that Burrows Wheeler Matrix 459 00:23:53,090 --> 00:23:57,840 is representing all of the suffixes of t. 460 00:23:57,840 --> 00:24:04,380 OK, so all the red things are suffixes of t in the matrix. 461 00:24:04,380 --> 00:24:07,150 And when we are going to be matching a read, 462 00:24:07,150 --> 00:24:10,159 we're going to be matched it from its end going 463 00:24:10,159 --> 00:24:11,700 towards the beginning of it, so we'll 464 00:24:11,700 --> 00:24:13,210 be matching suffixes of it. 465 00:24:13,210 --> 00:24:16,140 And I'm going to show you a very neat way of using this 466 00:24:16,140 --> 00:24:19,700 transform to do matching very efficiently. 467 00:24:19,700 --> 00:24:23,330 But before I do that, I want you to observe 468 00:24:23,330 --> 00:24:26,500 that it's not complicated. 469 00:24:26,500 --> 00:24:29,010 All we do is we take all the possible rotations 470 00:24:29,010 --> 00:24:33,710 and we sort them and we come up with this transform. 471 00:24:33,710 --> 00:24:35,331 Yes. 472 00:24:35,331 --> 00:24:37,206 AUDIENCE: What are you sorting them based on? 473 00:24:37,206 --> 00:24:38,580 PROFESSOR: OK, what was your name again? 474 00:24:38,580 --> 00:24:39,455 AUDIENCE: I'm Samona. 475 00:24:39,455 --> 00:24:40,740 PROFESSOR: Samona. 476 00:24:40,740 --> 00:24:42,910 What are we sorting them based upon? 477 00:24:42,910 --> 00:24:44,576 We're just sorting them alphabetically. 478 00:24:44,576 --> 00:24:45,290 AUDIENCE: OK. 479 00:24:45,290 --> 00:24:48,420 PROFESSOR: So you can see that if dollar sign is the lowest 480 00:24:48,420 --> 00:24:52,610 alphabetical character, that if you consider 481 00:24:52,610 --> 00:24:55,910 each one a word, that they're sorted alphabetically, OK? 482 00:24:55,910 --> 00:24:58,740 So we have seven characters in each row 483 00:24:58,740 --> 00:25:02,070 and we sort them alphabetically. 484 00:25:02,070 --> 00:25:04,570 Or a lexically. 485 00:25:04,570 --> 00:25:05,520 Good question. 486 00:25:05,520 --> 00:25:06,840 Any other questions like that? 487 00:25:06,840 --> 00:25:08,430 This is a great time to ask questions, 488 00:25:08,430 --> 00:25:10,030 because what's going to happen is 489 00:25:10,030 --> 00:25:12,150 that in about the next three minutes 490 00:25:12,150 --> 00:25:14,680 if you lose your attention span of about 10 seconds, 491 00:25:14,680 --> 00:25:17,770 you're going to look up and you'll say, what just happened? 492 00:25:17,770 --> 00:25:18,300 Yes. 493 00:25:18,300 --> 00:25:21,180 AUDIENCE: Could you explain the suffixes of t? 494 00:25:21,180 --> 00:25:22,640 PROFESSOR: The suffixes of t? 495 00:25:22,640 --> 00:25:24,530 Sure. 496 00:25:24,530 --> 00:25:26,650 Let's talk about the suffixes of tr. 497 00:25:26,650 --> 00:25:28,670 They're all of the things that end t. 498 00:25:28,670 --> 00:25:35,400 So a suffix of t would be G, or CG, or ACG, or AACG, 499 00:25:35,400 --> 00:25:38,640 or CAACG, or the entire string t. 500 00:25:38,640 --> 00:25:40,790 Those are all of the endings of t. 501 00:25:40,790 --> 00:25:42,630 And if you look over on the right, 502 00:25:42,630 --> 00:25:46,760 you can see all of those suffixes in red. 503 00:25:46,760 --> 00:25:48,290 So one way to think about this is 504 00:25:48,290 --> 00:25:54,730 that it's sorting all of the suffixes of t in that matrix. 505 00:25:54,730 --> 00:25:58,130 Because the rotations are exposing the suffixes, 506 00:25:58,130 --> 00:26:01,134 right, is what's going on. 507 00:26:01,134 --> 00:26:02,300 Does that make sense to you? 508 00:26:05,390 --> 00:26:08,280 Now keep me honest here in a minute, OK, you'll help me out? 509 00:26:08,280 --> 00:26:09,290 Yes. 510 00:26:09,290 --> 00:26:10,010 Your name first? 511 00:26:10,010 --> 00:26:11,260 AUDIENCE: [INAUDIBLE]. 512 00:26:11,260 --> 00:26:11,900 [INAUDIBLE] 513 00:26:11,900 --> 00:26:12,705 PROFESSOR: [INAUDIBLE]. 514 00:26:12,705 --> 00:26:13,954 AUDIENCE: What is dollar sign? 515 00:26:13,954 --> 00:26:17,050 PROFESSOR: Dollar sign is the end of string character 516 00:26:17,050 --> 00:26:20,010 which has the lowest lexical sorting order. 517 00:26:20,010 --> 00:26:22,690 So it's marking the end of t. 518 00:26:22,690 --> 00:26:26,177 That's how we know that we're at the end of t. 519 00:26:26,177 --> 00:26:26,760 Good question. 520 00:26:26,760 --> 00:26:27,667 Yes. 521 00:26:27,667 --> 00:26:30,589 AUDIENCE: Can you sort them non-alphabetically, just 522 00:26:30,589 --> 00:26:34,002 different ways to sort them [INAUDIBLE] algorithm? 523 00:26:34,002 --> 00:26:35,460 PROFESSOR: The question is, can you 524 00:26:35,460 --> 00:26:38,440 sort them non alphabetically. 525 00:26:38,440 --> 00:26:42,076 You can sort them any way as long as it's consistent, OK. 526 00:26:42,076 --> 00:26:44,690 But let's stick with alphabetical lexical order 527 00:26:44,690 --> 00:26:45,190 today. 528 00:26:45,190 --> 00:26:47,020 It's really simple and it's all you need. 529 00:26:47,020 --> 00:26:47,520 Yes. 530 00:26:52,460 --> 00:26:56,412 in red is the suffixes in the last colored 531 00:26:56,412 --> 00:26:58,132 group on the right? 532 00:26:58,132 --> 00:26:58,882 PROFESSOR: No, no. 533 00:26:58,882 --> 00:26:59,870 AUDIENCE: What's in red? 534 00:26:59,870 --> 00:27:01,521 PROFESSOR: What's in red are all the suffixes 535 00:27:01,521 --> 00:27:02,690 of T on the very far left. 536 00:27:02,690 --> 00:27:04,052 OK? 537 00:27:04,052 --> 00:27:06,412 AUDIENCE: On the right, last column group? 538 00:27:06,412 --> 00:27:08,390 PROFESSOR: The right last column group. 539 00:27:08,390 --> 00:27:11,350 That last column in red, that is the Burrows-Wheeler Transform, 540 00:27:11,350 --> 00:27:13,270 read from top to bottom. 541 00:27:13,270 --> 00:27:13,940 OK? 542 00:27:13,940 --> 00:27:15,315 And I know you're looking at that 543 00:27:15,315 --> 00:27:17,300 and saying, how could that possibly be useful? 544 00:27:17,300 --> 00:27:18,444 We've taken our genome. 545 00:27:18,444 --> 00:27:19,610 We've shifted it all around. 546 00:27:19,610 --> 00:27:21,390 We've sorted it, we take this last thing. 547 00:27:21,390 --> 00:27:24,070 It looks like junk to me, right? 548 00:27:24,070 --> 00:27:26,140 But you're going to find out that all 549 00:27:26,140 --> 00:27:28,170 of the information in the genome is contained 550 00:27:28,170 --> 00:27:30,550 in that last string in a very handy way. 551 00:27:30,550 --> 00:27:32,379 Hard to believe but true. 552 00:27:32,379 --> 00:27:33,420 Hard to believe but true. 553 00:27:33,420 --> 00:27:35,070 Yes. 554 00:27:35,070 --> 00:27:37,850 Prepare to be amazed, all right? 555 00:27:37,850 --> 00:27:39,100 These are all great questions. 556 00:27:39,100 --> 00:27:41,871 Any other questions of this sort? 557 00:27:41,871 --> 00:27:42,370 . 558 00:27:42,370 --> 00:27:43,240 OK. 559 00:27:43,240 --> 00:27:48,500 So, I'm going to make a very important observation here 560 00:27:48,500 --> 00:27:52,960 that is going to be crucial for your understanding. 561 00:27:52,960 --> 00:27:57,830 So I have reproduced the matrix down on this blackboard. 562 00:27:57,830 --> 00:27:59,570 What? 563 00:27:59,570 --> 00:28:01,875 That's usually there under that board, you know that. 564 00:28:01,875 --> 00:28:04,250 You guys haven't checked this classroom before, have you? 565 00:28:04,250 --> 00:28:04,749 No. 566 00:28:04,749 --> 00:28:06,746 It's always there. 567 00:28:06,746 --> 00:28:09,760 It's such a handy transform. 568 00:28:09,760 --> 00:28:11,800 So this is the same matrix as the matrix 569 00:28:11,800 --> 00:28:13,350 you see on the right. 570 00:28:13,350 --> 00:28:16,900 I'm going to make a very, very important assertion right now. 571 00:28:16,900 --> 00:28:17,410 OK? 572 00:28:17,410 --> 00:28:25,220 The very important assertion is that if you consider 573 00:28:25,220 --> 00:28:30,040 that this is the first a in the last column that 574 00:28:30,040 --> 00:28:34,850 is the same textual occurrence in the string as the first a 575 00:28:34,850 --> 00:28:37,720 in the first column. 576 00:28:37,720 --> 00:28:41,460 And this is the second a in the last column, that's 577 00:28:41,460 --> 00:28:43,886 the same as the second a in the first column. 578 00:28:43,886 --> 00:28:46,010 And you're going to say, what does he mean by that? 579 00:28:46,010 --> 00:28:49,200 OK, do the following thought experiment. 580 00:28:49,200 --> 00:28:51,090 Look at the matrix. 581 00:28:51,090 --> 00:28:51,760 OK? 582 00:28:51,760 --> 00:28:54,530 And in your mind, shift it left and put all the characters 583 00:28:54,530 --> 00:28:56,190 on the right hand side. 584 00:28:56,190 --> 00:28:57,130 OK? 585 00:28:57,130 --> 00:28:59,890 When you do that, what will happen 586 00:28:59,890 --> 00:29:04,180 is that these things will be used 587 00:29:04,180 --> 00:29:09,390 to sort the occurrences a on the right hand side. 588 00:29:09,390 --> 00:29:12,430 Once again, if you shift this whole thing left and these pop 589 00:29:12,430 --> 00:29:16,020 over to the right, then the occurrence of these a's will 590 00:29:16,020 --> 00:29:21,140 be sorted by these rows from here over. 591 00:29:21,140 --> 00:29:24,320 But these are alphabetical. 592 00:29:24,320 --> 00:29:27,160 And therefore they're going to certain alphabetical order. 593 00:29:27,160 --> 00:29:30,240 And therefore these a's will sort in the same order 594 00:29:30,240 --> 00:29:31,820 here as they are over there. 595 00:29:34,360 --> 00:29:38,850 So that means that when we do this rotation, 596 00:29:38,850 --> 00:29:41,170 that this textual occurrence of a will 597 00:29:41,170 --> 00:29:45,570 have the same rank in the first column and in the last column. 598 00:29:45,570 --> 00:29:51,220 And you can see I've annotated the various bases here 599 00:29:51,220 --> 00:29:52,580 with their ranks. 600 00:29:52,580 --> 00:29:55,570 This is the first g, the first c, the first end of line, 601 00:29:55,570 --> 00:29:56,730 end of string character. 602 00:29:56,730 --> 00:30:00,180 First a, second a, third a, second c. 603 00:30:00,180 --> 00:30:02,910 And correspondingly I have the same annotations 604 00:30:02,910 --> 00:30:09,930 over here and thus the third a here is 605 00:30:09,930 --> 00:30:16,280 the same lexical occurrence as the third a on the left 606 00:30:16,280 --> 00:30:19,810 in the string, same text occurrence. 607 00:30:19,810 --> 00:30:21,776 Now I'm going to let that sink in for a second, 608 00:30:21,776 --> 00:30:23,400 and then when somebody asks a question, 609 00:30:23,400 --> 00:30:25,691 I'm going to explain it again because it's a little bit 610 00:30:25,691 --> 00:30:27,340 counterintuitive. 611 00:30:27,340 --> 00:30:29,440 But the very important thing is if we 612 00:30:29,440 --> 00:30:34,070 think about textual recurrences of characters in that string t 613 00:30:34,070 --> 00:30:36,750 and we put them in this framework, 614 00:30:36,750 --> 00:30:40,470 that the rank allows us to identify 615 00:30:40,470 --> 00:30:42,365 identical textual recurrences of a character. 616 00:30:45,430 --> 00:30:47,080 Would somebody like to ask a question? 617 00:30:47,080 --> 00:30:47,340 Yes. 618 00:30:47,340 --> 00:30:49,010 Say your name and the question, please. 619 00:30:49,010 --> 00:30:49,353 AUDIENCE: Dan. 620 00:30:49,353 --> 00:30:50,040 PROFESSOR: Dan. 621 00:30:50,040 --> 00:30:51,910 AUDIENCE: So in your original string 622 00:30:51,910 --> 00:30:56,190 though those won't correspond to the same order 623 00:30:56,190 --> 00:30:59,070 in the transformed string. 624 00:30:59,070 --> 00:31:02,495 So like the a's in the original string in their order, 625 00:31:02,495 --> 00:31:03,870 they don't correspond numerically 626 00:31:03,870 --> 00:31:06,030 to the transformed string. 627 00:31:06,030 --> 00:31:07,860 PROFESSOR: That's correct. 628 00:31:07,860 --> 00:31:09,200 Is that OK? 629 00:31:09,200 --> 00:31:14,240 The comment was that the order in BWT, the transform 630 00:31:14,240 --> 00:31:17,020 is not the same as the order in the original string. 631 00:31:17,020 --> 00:31:20,460 And all I'm saying is that in this particular matrix form, 632 00:31:20,460 --> 00:31:21,836 that the order on the last column 633 00:31:21,836 --> 00:31:23,668 is the same as the order in the first column 634 00:31:23,668 --> 00:31:25,030 for a particular character. 635 00:31:25,030 --> 00:31:28,420 And furthermore, that these are matching textual occurrences, 636 00:31:28,420 --> 00:31:30,120 right? 637 00:31:30,120 --> 00:31:35,800 Now if I look at a2 here, we know 638 00:31:35,800 --> 00:31:38,860 that c comes after it, then a, then a, and c and g, right? 639 00:31:42,170 --> 00:31:42,670 Right. 640 00:31:42,670 --> 00:31:45,800 OK, so did that answer your question 641 00:31:45,800 --> 00:31:47,447 that they're not exactly the same? 642 00:31:47,447 --> 00:31:48,030 AUDIENCE: Yes. 643 00:31:48,030 --> 00:31:50,440 I don't understand how they're useful yet. 644 00:31:50,440 --> 00:31:52,770 PROFESSOR: You don't understand how it's useful yet. 645 00:31:52,770 --> 00:31:53,927 OK. 646 00:31:53,927 --> 00:31:55,760 Well, maybe we better get to the useful part 647 00:31:55,760 --> 00:31:58,180 and then you can-- OK. 648 00:31:58,180 --> 00:32:04,020 So let us suppose that we want to, from this, 649 00:32:04,020 --> 00:32:07,950 reconstruct the original string. 650 00:32:07,950 --> 00:32:10,160 Does anybody have any ideas about how to do that? 651 00:32:20,200 --> 00:32:22,030 OK. 652 00:32:22,030 --> 00:32:24,320 Let me ask a different question. 653 00:32:24,320 --> 00:32:28,960 If we look at this g1, right? 654 00:32:28,960 --> 00:32:32,400 And then this is the same textual occurrence, right? 655 00:32:32,400 --> 00:32:34,890 And we know that this g1 comes right 656 00:32:34,890 --> 00:32:37,930 before the end of character, in end of string terminator, 657 00:32:37,930 --> 00:32:39,200 right? 658 00:32:39,200 --> 00:32:41,420 So if we look at the first row, we always 659 00:32:41,420 --> 00:32:45,530 know what the last character was in the original string. 660 00:32:45,530 --> 00:32:54,370 The last character is g1, right? 661 00:32:54,370 --> 00:32:56,000 Fair enough? 662 00:32:56,000 --> 00:32:56,500 OK 663 00:32:56,500 --> 00:32:58,600 Where does g1 would occur over here? 664 00:32:58,600 --> 00:33:00,440 Right over here, right? 665 00:33:00,440 --> 00:33:03,380 What's the character before g1? 666 00:33:03,380 --> 00:33:05,960 c2. 667 00:33:05,960 --> 00:33:08,380 where is c2 over here? 668 00:33:08,380 --> 00:33:10,608 What's the character before c2? 669 00:33:10,608 --> 00:33:13,100 a3. 670 00:33:13,100 --> 00:33:16,610 What's the character before a3? 671 00:33:16,610 --> 00:33:18,785 a1. 672 00:33:18,785 --> 00:33:19,285 Uh, oh. 673 00:33:22,010 --> 00:33:27,870 Let me just cheat a little bit here. a1 a3 c2 g1 $. 674 00:33:33,460 --> 00:33:35,389 So we're at a1, right? 675 00:33:35,389 --> 00:33:36,680 What's the character before a1? 676 00:33:39,330 --> 00:33:42,350 c1, right? 677 00:33:42,350 --> 00:33:43,740 What's the character before c1? 678 00:33:48,340 --> 00:33:48,840 a2. 679 00:33:51,400 --> 00:33:53,720 And what's the character before a2? 680 00:33:56,690 --> 00:33:58,200 That's the end of string. 681 00:33:58,200 --> 00:34:01,210 Is that the original string that we had? 682 00:34:01,210 --> 00:34:02,480 Magic. 683 00:34:02,480 --> 00:34:03,040 OK? 684 00:34:03,040 --> 00:34:04,980 Yes. 685 00:34:04,980 --> 00:34:07,380 AUDIENCE: Wouldn't it be simpler to look at-- 686 00:34:07,380 --> 00:34:12,570 to just remove the dollar sign [INAUDIBLE] 687 00:34:12,570 --> 00:34:16,510 or do you mean reconstruct from only the transformed? 688 00:34:16,510 --> 00:34:18,672 PROFESSOR: We're only using this. 689 00:34:18,672 --> 00:34:22,310 This is all we have. 690 00:34:22,310 --> 00:34:25,210 Because I actually didn't use any of these characters. 691 00:34:25,210 --> 00:34:29,870 I was only doing the matching so we would go to the right row. 692 00:34:29,870 --> 00:34:30,659 Right? 693 00:34:30,659 --> 00:34:31,909 I didn't use any of this. 694 00:34:35,190 --> 00:34:39,260 And so, but do people understand what's going on here? 695 00:34:39,260 --> 00:34:41,830 If anybody has any questions, now is a great time 696 00:34:41,830 --> 00:34:44,365 to raise your hand and say-- here we go. 697 00:34:44,365 --> 00:34:45,460 We have a customer. 698 00:34:45,460 --> 00:34:46,595 Say your name and the question, please. 699 00:34:46,595 --> 00:34:47,754 AUDIENCE: My name is Eric. 700 00:34:47,754 --> 00:34:48,628 PROFESSOR: Thanks, Eric. 701 00:34:48,628 --> 00:34:50,711 AUDIENCE: Can you illustrate how you would do this 702 00:34:50,711 --> 00:34:53,275 without using any of the elements 703 00:34:53,275 --> 00:34:55,473 to the left of the box? 704 00:34:55,473 --> 00:34:56,639 PROFESSOR: Absolutely, Eric. 705 00:34:56,639 --> 00:34:58,370 I'm so glad you asked that question. 706 00:34:58,370 --> 00:35:00,880 That's the next thing we're going to talk about. 707 00:35:00,880 --> 00:35:03,232 OK, but before I get to there, I want to make sure, 708 00:35:03,232 --> 00:35:05,190 are you comfortable doing it with all the stuff 709 00:35:05,190 --> 00:35:06,720 on the left hand side? 710 00:35:06,720 --> 00:35:08,265 You're happy about that? 711 00:35:08,265 --> 00:35:11,750 OK. if anyone was unhappy about that, now would be the time 712 00:35:11,750 --> 00:35:15,230 to say, I'm unhappy, help me. 713 00:35:15,230 --> 00:35:16,890 How about the details? 714 00:35:16,890 --> 00:35:17,980 Everybody's happy? 715 00:35:17,980 --> 00:35:19,600 Yes. 716 00:35:19,600 --> 00:35:22,673 AUDIENCE: So, you have your original string 717 00:35:22,673 --> 00:35:24,350 in the first place, though, so why 718 00:35:24,350 --> 00:35:27,485 do you want to create another string of the same length? 719 00:35:27,485 --> 00:35:31,950 Like, how does this help you match your read? 720 00:35:31,950 --> 00:35:34,600 PROFESSOR: How does this help you match your read? 721 00:35:34,600 --> 00:35:36,020 How does this help you match your read, was the question. 722 00:35:36,020 --> 00:35:36,710 What is was your name? 723 00:35:36,710 --> 00:35:37,293 AUDIENCE: Dan. 724 00:35:37,293 --> 00:35:40,040 PROFESSOR: That's right, Dan. 725 00:35:40,040 --> 00:35:41,520 That's a great question. 726 00:35:41,520 --> 00:35:42,610 I'm so glad you asked it. 727 00:35:42,610 --> 00:35:44,160 First we'll get to Eric's question 728 00:35:44,160 --> 00:35:45,350 and then we'll get to yours. 729 00:35:45,350 --> 00:35:47,850 Because I know if I don't give you a good answer that you're 730 00:35:47,850 --> 00:35:50,300 going to be very mad, right? 731 00:35:50,300 --> 00:35:51,700 OK? 732 00:35:51,700 --> 00:35:54,505 Let's talk about the question of how 733 00:35:54,505 --> 00:35:56,550 to do this without the other things. 734 00:35:56,550 --> 00:35:58,670 So we're going to create something called the last 735 00:35:58,670 --> 00:36:04,570 to first function that maps a character in the last row, 736 00:36:04,570 --> 00:36:07,817 column, I should say, to the first column. 737 00:36:07,817 --> 00:36:09,400 And there is the function right there. 738 00:36:09,400 --> 00:36:10,700 It's called LF. 739 00:36:10,700 --> 00:36:11,850 You give it a row number. 740 00:36:11,850 --> 00:36:14,400 The rows are zero origined. 741 00:36:14,400 --> 00:36:16,830 And you give it a character and it tells you 742 00:36:16,830 --> 00:36:21,370 what the corresponding place is in the first column. 743 00:36:21,370 --> 00:36:23,650 And it has two components. 744 00:36:23,650 --> 00:36:27,070 The first is Occ, which tells you how many characters are 745 00:36:27,070 --> 00:36:29,440 smaller than that character lexically. 746 00:36:29,440 --> 00:36:32,880 So tells you where, for example, the a's start, the c's start, 747 00:36:32,880 --> 00:36:33,830 or the g's start. 748 00:36:36,410 --> 00:36:41,220 So in this case, for example Occ of c is 4. 749 00:36:41,220 --> 00:36:45,900 That is the c's start at 0, 1,2, 3, the fourth row. 750 00:36:45,900 --> 00:36:48,110 OK? 751 00:36:48,110 --> 00:36:53,520 And then count tells you the rank of c minus 1. 752 00:36:53,520 --> 00:36:55,930 So it's going to essentially count 753 00:36:55,930 --> 00:37:00,180 how many times c occurs before the c at the row you're 754 00:37:00,180 --> 00:37:01,460 pointing at. 755 00:37:01,460 --> 00:37:04,690 In this case, the answer is 1 and you 756 00:37:04,690 --> 00:37:11,370 add 1 and that gets you to 5, which is this row. 757 00:37:11,370 --> 00:37:18,120 So this c2 maps here to c2 as we already discussed. 758 00:37:18,120 --> 00:37:23,050 So this LF function is a way to map from the last row 759 00:37:23,050 --> 00:37:23,890 to the first row. 760 00:37:26,862 --> 00:37:28,320 And we need to have two components. 761 00:37:28,320 --> 00:37:32,390 So we need to know Occ, which is very trivial to compute. 762 00:37:32,390 --> 00:37:35,370 There are only five elements, one for each base 763 00:37:35,370 --> 00:37:38,190 and one for the end of line terminator, which is actually 764 00:37:38,190 --> 00:37:39,390 zero. 765 00:37:39,390 --> 00:37:45,080 So it will only have integers and count, which 766 00:37:45,080 --> 00:37:47,900 is going to tell us the rank in the BWT transform 767 00:37:47,900 --> 00:37:51,540 and we'll talk about how to do that presently. 768 00:37:51,540 --> 00:37:52,420 OK. 769 00:37:52,420 --> 00:37:54,010 So did that answer your question, 770 00:37:54,010 --> 00:37:55,640 how to do this without the rest of the matrix? 771 00:37:55,640 --> 00:37:56,170 Eric? 772 00:37:56,170 --> 00:38:00,042 AUDIENCE: Could you show us step by step on the blackboard 773 00:38:00,042 --> 00:38:03,430 how you would reconstruct it? 774 00:38:03,430 --> 00:38:05,040 PROFESSOR: How do we reconstruct it? 775 00:38:05,040 --> 00:38:06,030 AUDIENCE: Yeah. 776 00:38:06,030 --> 00:38:09,490 PROFESSOR: You mean something like this? 777 00:38:09,490 --> 00:38:11,936 Is this what you're suggesting? 778 00:38:11,936 --> 00:38:14,888 AUDIENCE: Somehow I get a feeling 779 00:38:14,888 --> 00:38:18,866 that the first column doesn't help us 780 00:38:18,866 --> 00:38:21,360 in understanding how the algorithm work only 781 00:38:21,360 --> 00:38:24,640 using the last column. 782 00:38:24,640 --> 00:38:25,560 PROFESSOR: OK. 783 00:38:25,560 --> 00:38:27,190 Your comment, Eric, is that you feel 784 00:38:27,190 --> 00:38:30,140 like the first column doesn't help us understand 785 00:38:30,140 --> 00:38:33,370 how the algorithm works, only using the last column, right? 786 00:38:33,370 --> 00:38:33,870 OK. 787 00:38:36,536 --> 00:38:42,692 AUDIENCE: [INAUDIBLE] going back to the first column of data. 788 00:38:42,692 --> 00:38:43,275 PROFESSOR: OK. 789 00:38:46,810 --> 00:39:01,180 Well let's compute the LF function of the character 790 00:39:01,180 --> 00:39:05,160 and the row for each one of these things, OK? 791 00:39:05,160 --> 00:39:07,530 And that might help you, all right? 792 00:39:07,530 --> 00:39:10,810 Because that's the central part of being able to reverse this 793 00:39:10,810 --> 00:39:12,590 transform. 794 00:39:12,590 --> 00:39:16,720 So this is, to be more clear, I'll make it more explicit. 795 00:39:16,720 --> 00:39:27,810 This is LF of I and BWT of i, where i goes from 0 to 6. 796 00:39:27,810 --> 00:39:32,340 So what is that value for this one right here? 797 00:39:32,340 --> 00:39:33,090 Anybody know? 798 00:39:35,880 --> 00:39:43,310 Well it would be Occ of g, which is 6, right? 799 00:39:43,310 --> 00:39:57,220 Plus count of of 6 n g, which is going to be 0. 800 00:39:57,220 --> 00:39:58,820 Or I can look right over here and see 801 00:39:58,820 --> 00:40:00,020 that in fact it's 6, right? 802 00:40:00,020 --> 00:40:03,450 Because this occurrence of g1 is right here. 803 00:40:03,450 --> 00:40:16,280 So this LF value is, it's 6 4 0 a1 is in 1, a2 is in 2, 804 00:40:16,280 --> 00:40:22,660 a3 is in 3, c2 is in 5. 805 00:40:22,660 --> 00:40:28,710 So this is the LF function, 6 4 0 1 2 3 5. 806 00:40:28,710 --> 00:40:31,670 And I don't need any of this to compute it. 807 00:40:31,670 --> 00:40:36,920 Because it simply is equal to, going back one slide, 808 00:40:36,920 --> 00:40:39,450 it's equal to Occ of c plus count. 809 00:40:39,450 --> 00:40:41,960 So it' going to be equal to where that particular character 810 00:40:41,960 --> 00:40:47,730 starts on the left hand side and its rank minus 1. 811 00:40:50,330 --> 00:40:53,100 And so these are the values for LF. 812 00:40:53,100 --> 00:40:56,530 This is what I need to be able to take this string 813 00:40:56,530 --> 00:40:59,630 and recompute the original string. 814 00:40:59,630 --> 00:41:04,170 If I can compute this, I don't need any of that. 815 00:41:04,170 --> 00:41:07,000 And to compute this, I need two things. 816 00:41:07,000 --> 00:41:10,240 I need Occ and I need count. 817 00:41:13,000 --> 00:41:13,990 All right? 818 00:41:13,990 --> 00:41:18,310 Now, I can tell you're not quite completely satisfied yet. 819 00:41:18,310 --> 00:41:20,020 So maybe you can ask me another question 820 00:41:20,020 --> 00:41:21,655 and it would be very helpful to me. 821 00:41:21,655 --> 00:41:27,530 AUDIENCE: How did you get G1's last and first functions 822 00:41:27,530 --> 00:41:29,720 score being 6? 823 00:41:29,720 --> 00:41:30,770 PROFESSOR: OK. 824 00:41:30,770 --> 00:41:31,810 Let's take that apart. 825 00:41:40,260 --> 00:41:50,040 We want to know what LF of 6 and where was that G1? 826 00:41:50,040 --> 00:41:51,590 G1 is 1 and 0, right? 827 00:41:51,590 --> 00:41:52,270 Sorry. 828 00:41:52,270 --> 00:41:58,490 LF of 1 and g is equal to, right? 829 00:41:58,490 --> 00:42:00,640 Is that g and 1 or 0? 830 00:42:00,640 --> 00:42:02,940 Oop, sorry it's in 0. 831 00:42:02,940 --> 00:42:05,720 So this is what you like me to compute, right? 832 00:42:05,720 --> 00:42:09,160 OK what's Occ of g? 833 00:42:09,160 --> 00:42:11,140 It's how many characters are less than g 834 00:42:11,140 --> 00:42:12,227 in the original string? 835 00:42:19,050 --> 00:42:20,156 I'll give you a clue. 836 00:42:20,156 --> 00:42:22,220 It's 1, 2, 3, 4, 5, 6. 837 00:42:22,220 --> 00:42:23,136 AUDIENCE: [INAUDIBLE]. 838 00:42:27,139 --> 00:42:29,430 PROFESSOR: No, it's how many characters are less than g 839 00:42:29,430 --> 00:42:30,435 in the original string. 840 00:42:30,435 --> 00:42:32,560 How many things are going to distort underneath it? 841 00:42:32,560 --> 00:42:36,270 Where do the g's begin in the sorted version? 842 00:42:36,270 --> 00:42:39,380 The g's begin in row 6. 843 00:42:39,380 --> 00:42:40,840 OK? 844 00:42:40,840 --> 00:42:45,560 So OCC of g is 6. 845 00:42:45,560 --> 00:42:48,265 Is that-- are you getting hung up on that point? 846 00:42:48,265 --> 00:42:48,848 AUDIENCE: Yes. 847 00:42:48,848 --> 00:42:51,812 How do you know that without ever referencing back 848 00:42:51,812 --> 00:42:54,422 to the first 5 columns? 849 00:42:54,422 --> 00:42:57,486 PROFESSOR: Because when we build the index we remember. 850 00:42:57,486 --> 00:42:59,790 AUDIENCE: Oh, OK. 851 00:42:59,790 --> 00:43:02,630 PROFESSOR: So we have to, I haven't told you this, 852 00:43:02,630 --> 00:43:05,380 but I need to compute, I need to remember 853 00:43:05,380 --> 00:43:08,830 ways to compute Occ and count really quickly. 854 00:43:08,830 --> 00:43:12,250 Occ is represented by four values only. 855 00:43:12,250 --> 00:43:14,870 They're where the a's start, the c's start, the t's start, 856 00:43:14,870 --> 00:43:15,790 and the g's start. 857 00:43:15,790 --> 00:43:17,937 That's all I need to know, four integers. 858 00:43:17,937 --> 00:43:18,437 OK? 859 00:43:21,360 --> 00:43:22,560 Are you happy with that? 860 00:43:25,070 --> 00:43:27,140 Occ, the a's start at 1. 861 00:43:27,140 --> 00:43:30,040 The c's start at 4 and the g's start at 6. 862 00:43:30,040 --> 00:43:30,540 OK? 863 00:43:30,540 --> 00:43:32,880 That's all I need to remember. 864 00:43:32,880 --> 00:43:35,100 But I precompute that. 865 00:43:35,100 --> 00:43:35,600 OK? 866 00:43:35,600 --> 00:43:36,575 Remember it. 867 00:43:36,575 --> 00:43:37,967 Are you happy with that no? 868 00:43:37,967 --> 00:43:38,550 AUDIENCE: Yes. 869 00:43:38,550 --> 00:43:39,270 PROFESSOR: OK. 870 00:43:39,270 --> 00:43:53,471 And then this business over here of count of zero and g. 871 00:43:53,471 --> 00:43:53,970 Right? 872 00:43:56,900 --> 00:43:59,500 Which is how many g's, what's the rank 873 00:43:59,500 --> 00:44:01,000 of this g in the right hand side? 874 00:44:05,160 --> 00:44:06,320 1. 875 00:44:06,320 --> 00:44:08,170 Minus 1 is 0. 876 00:44:08,170 --> 00:44:10,290 So that's 0. 877 00:44:12,810 --> 00:44:14,115 That's how we computed it. 878 00:44:14,115 --> 00:44:14,615 OK? 879 00:44:17,302 --> 00:44:18,760 These are great questions because I 880 00:44:18,760 --> 00:44:20,460 think they're foundational. 881 00:44:20,460 --> 00:44:22,248 Yeah. 882 00:44:22,248 --> 00:44:24,244 AUDIENCE: Occ and count are both precomputed 883 00:44:24,244 --> 00:44:30,731 as you're building on this last column [INAUDIBLE]. 884 00:44:30,731 --> 00:44:33,000 PROFESSOR: They are precomputed and well, 885 00:44:33,000 --> 00:44:36,890 I have not told you, see, you're sort of ahead of things 886 00:44:36,890 --> 00:44:39,849 a little bit in that I'd hoped to suspend disbelief 887 00:44:39,849 --> 00:44:42,140 and that I could actually build these very efficiently. 888 00:44:42,140 --> 00:44:43,764 But yes, they're built at the same time 889 00:44:43,764 --> 00:44:45,090 as the index is built. 890 00:44:45,090 --> 00:44:45,590 OK? 891 00:44:45,590 --> 00:44:47,490 But yes. 892 00:44:47,490 --> 00:44:50,857 OK and if I have Occ and count and I have the string, 893 00:44:50,857 --> 00:44:51,690 then I can get this. 894 00:44:51,690 --> 00:44:53,130 But I can still need to get to Dan's question 895 00:44:53,130 --> 00:44:55,140 because he has been very patient over there. 896 00:44:55,140 --> 00:44:56,840 He wants to know how I can use this 897 00:44:56,840 --> 00:44:59,880 to find sequence region of genome. 898 00:44:59,880 --> 00:45:01,530 And he's just being so good over there. 899 00:45:01,530 --> 00:45:02,780 I really appreciate that, Dan. 900 00:45:02,780 --> 00:45:05,620 Thanks so much for that. 901 00:45:05,620 --> 00:45:07,170 Are you happier? 902 00:45:07,170 --> 00:45:07,930 OK you're good. 903 00:45:07,930 --> 00:45:08,690 All right. 904 00:45:08,690 --> 00:45:12,600 So and this is what we just did. 905 00:45:12,600 --> 00:45:16,580 The walk left algorithm actually inverts the BWT 906 00:45:16,580 --> 00:45:19,340 by using this function to walk left 907 00:45:19,340 --> 00:45:22,020 and using that Python code up there, 908 00:45:22,020 --> 00:45:24,440 you can actually reconstruct the original string 909 00:45:24,440 --> 00:45:27,310 you started with. 910 00:45:27,310 --> 00:45:30,390 So it's just very simple. 911 00:45:30,390 --> 00:45:33,790 And we went through it on the board. 912 00:45:33,790 --> 00:45:37,117 Are there any questions about it at all? 913 00:45:37,117 --> 00:45:37,700 AUDIENCE: Yes. 914 00:45:37,700 --> 00:45:42,155 Can you actually do this any column? 915 00:45:42,155 --> 00:45:45,900 Like, why are you using the last column instead of like, 916 00:45:45,900 --> 00:45:53,490 why can't you just change it like the [INAUDIBLE], 917 00:45:53,490 --> 00:45:56,052 the equation and make it work for-- 918 00:45:56,052 --> 00:45:57,760 PROFESSOR: Because a very important thing 919 00:45:57,760 --> 00:46:00,200 is that this is actually a very important property right? 920 00:46:00,200 --> 00:46:03,194 Which is all the suffixes are sorted here. 921 00:46:03,194 --> 00:46:04,610 And if we didn't do that though, I 922 00:46:04,610 --> 00:46:08,300 couldn't answer Dan's question. 923 00:46:08,300 --> 00:46:09,650 And he'd be very upset. 924 00:46:09,650 --> 00:46:14,200 So please be respectful of his interest here. 925 00:46:14,200 --> 00:46:17,247 Now, the thing is, the beauty of this 926 00:46:17,247 --> 00:46:19,080 is, is that I have all these suffixes sorted 927 00:46:19,080 --> 00:46:21,210 and what you're about to see is the most amazing thing, which 928 00:46:21,210 --> 00:46:22,420 is that we're going to snap our fingers 929 00:46:22,420 --> 00:46:24,045 and, bang, we can map 200 million reads 930 00:46:24,045 --> 00:46:24,800 in no time at all. 931 00:46:27,630 --> 00:46:29,560 You like that? 932 00:46:29,560 --> 00:46:30,315 You're laughing. 933 00:46:30,315 --> 00:46:32,475 Oh, oh, that's not a good That's not a good sign. 934 00:46:35,190 --> 00:46:37,579 Let's press ahead fearlessly, OK? 935 00:46:37,579 --> 00:46:39,870 And talk about how we're going to use this to map read. 936 00:46:39,870 --> 00:46:41,720 So we're going to figure out how to use 937 00:46:41,720 --> 00:46:44,780 this index and this transform to rapidly aligned 938 00:46:44,780 --> 00:46:45,982 reads to a reference genome. 939 00:46:45,982 --> 00:46:48,190 And we're not talking about one read or 10 reads or 1 940 00:46:48,190 --> 00:46:48,770 million reads. 941 00:46:48,770 --> 00:46:50,853 We're talking about hundreds of millions of reads. 942 00:46:50,853 --> 00:46:55,270 So it has be very efficient indeed, OK? 943 00:46:55,270 --> 00:46:58,720 So here's the essential idea. 944 00:46:58,720 --> 00:47:01,710 There's the core algorithm on the slide. 945 00:47:01,710 --> 00:47:06,510 Which is that what we do is we take the original query that we 946 00:47:06,510 --> 00:47:09,200 have the read that we're trying to match 947 00:47:09,200 --> 00:47:10,950 and we're going to process it backwards 948 00:47:10,950 --> 00:47:14,700 from the end of the read forwards. 949 00:47:14,700 --> 00:47:19,050 And we begin by considering all possible suffixes from row zero 950 00:47:19,050 --> 00:47:21,305 to, in this case, it would be row seven. 951 00:47:24,620 --> 00:47:28,760 Which is the length of the entire transform. 952 00:47:28,760 --> 00:47:36,120 And we iterate and in each iteration 953 00:47:36,120 --> 00:47:41,320 we consider suffixes that match the query. 954 00:47:41,320 --> 00:47:43,810 So in the first step, right here, 955 00:47:43,810 --> 00:47:49,060 let me see if I can get my point working, there we are. 956 00:47:49,060 --> 00:47:52,790 So in the first step here, we matching this c. 957 00:47:52,790 --> 00:47:54,450 OK? 958 00:47:54,450 --> 00:47:57,070 And we compute the LF of the top, which 959 00:47:57,070 --> 00:48:01,020 is this row and of the bottom, which is down here pointing off 960 00:48:01,020 --> 00:48:05,010 the end, and that takes us to the first d here 961 00:48:05,010 --> 00:48:06,740 and to this point. 962 00:48:06,740 --> 00:48:09,620 Here are the two c's that could be 963 00:48:09,620 --> 00:48:14,090 possible matches to our query, which ends in a c. 964 00:48:16,880 --> 00:48:21,350 We then say, oh, the next character we have to match 965 00:48:21,350 --> 00:48:23,150 is an a. 966 00:48:23,150 --> 00:48:27,820 So we look here at the a we need to match, 967 00:48:27,820 --> 00:48:33,740 and starting from this row, which is row four, 968 00:48:33,740 --> 00:48:38,650 and this row, which is row six, we compute the LF of each one 969 00:48:38,650 --> 00:48:45,320 of these to figure out what rows in a precedes these c's. 970 00:48:49,530 --> 00:48:53,530 And the way we compute the LF is that we use the character a 971 00:48:53,530 --> 00:48:57,280 to be able to figure out which rows have the a preceding 972 00:48:57,280 --> 00:48:59,630 the c. 973 00:48:59,630 --> 00:49:02,820 You can see, when we compute those LF functions, 974 00:49:02,820 --> 00:49:06,530 what we wind up with are these rows where 975 00:49:06,530 --> 00:49:09,150 we have a followed by c. 976 00:49:09,150 --> 00:49:13,530 So we're beginning to match the end of our read, 977 00:49:13,530 --> 00:49:16,970 as we go from right to left. 978 00:49:16,970 --> 00:49:19,220 We then compute the same thing once again, 979 00:49:19,220 --> 00:49:21,880 considering the first a and ask what 980 00:49:21,880 --> 00:49:32,520 rows are going to allow us to put this a in front of the ac 981 00:49:32,520 --> 00:49:37,170 to form our entire read. 982 00:49:37,170 --> 00:49:41,600 And we compute the LF once again of these things. 983 00:49:41,600 --> 00:49:43,230 And you can see that here it takes us 984 00:49:43,230 --> 00:49:48,770 to this specific row aac. 985 00:49:48,770 --> 00:49:54,150 So that row represents a suffix that 986 00:49:54,150 --> 00:49:58,590 is matching our query exactly. 987 00:49:58,590 --> 00:50:04,570 So we iterate this loop to be able to match 988 00:50:04,570 --> 00:50:08,850 a read against the index. 989 00:50:08,850 --> 00:50:13,210 And we're using the LF function to do it. 990 00:50:13,210 --> 00:50:16,640 And it's a really beautiful algorithm. 991 00:50:16,640 --> 00:50:19,382 And remember, we only have the transform. 992 00:50:19,382 --> 00:50:20,965 We don't have the rest of this matrix. 993 00:50:23,890 --> 00:50:29,966 So before I press ahead and talk about other details, 994 00:50:29,966 --> 00:50:32,340 I think it's important to observe a couple of things that 995 00:50:32,340 --> 00:50:35,480 are a little bit counterintuitive about this. 996 00:50:35,480 --> 00:50:37,230 One counterintuitive aspect of it 997 00:50:37,230 --> 00:50:42,820 is, that when I'm over here for example, and for example 998 00:50:42,820 --> 00:50:48,000 when I'm computing the LF here, I'm computing the LF of row two 999 00:50:48,000 --> 00:50:50,470 with respect to a. 1000 00:50:50,470 --> 00:50:53,060 But there's a dollar sign there. 1001 00:50:53,060 --> 00:50:54,260 Right? 1002 00:50:54,260 --> 00:50:58,880 So I'm using this to the LF function, 1003 00:50:58,880 --> 00:51:07,060 to tell me where a suffix would be that actually follows 1004 00:51:07,060 --> 00:51:10,660 my constraint of having to have an a be the prefix of ac, 1005 00:51:10,660 --> 00:51:13,710 where I am right now. 1006 00:51:13,710 --> 00:51:16,750 This code is actually not fake code. 1007 00:51:16,750 --> 00:51:19,970 It's the actual code that's in a matcher, 1008 00:51:19,970 --> 00:51:24,230 for matching a read against the index. 1009 00:51:26,830 --> 00:51:28,830 Now let me just stop right here for a second 1010 00:51:28,830 --> 00:51:30,371 and see if there any other questions. 1011 00:51:30,371 --> 00:51:32,640 Dan is getting now his answer to his question, right? 1012 00:51:32,640 --> 00:51:35,200 About how you actually use this for matching reads. 1013 00:51:35,200 --> 00:51:37,800 You do this once for every read. 1014 00:51:37,800 --> 00:51:41,660 And it is linear time. 1015 00:51:41,660 --> 00:51:42,510 Right? 1016 00:51:42,510 --> 00:51:44,750 It's the length of the read itself 1017 00:51:44,750 --> 00:51:48,410 is all the time it takes to match in a huge genome. 1018 00:51:48,410 --> 00:51:51,120 So once we've built the index of the genome, 1019 00:51:51,120 --> 00:51:53,530 in fact, most of the time when you're 1020 00:51:53,530 --> 00:51:56,130 doing this sort of mapping, you don't build the index. 1021 00:51:56,130 --> 00:51:59,620 You download the index off of a website. 1022 00:51:59,620 --> 00:52:02,760 And so you don't have to pay for the time to build this index. 1023 00:52:02,760 --> 00:52:05,330 You just download the index and you take your reads 1024 00:52:05,330 --> 00:52:07,720 and the time to match all of your sequencing 1025 00:52:07,720 --> 00:52:10,930 reads against a certain build of whatever genome you're using 1026 00:52:10,930 --> 00:52:15,730 is simply linear in the number of bases you have. 1027 00:52:15,730 --> 00:52:16,960 Questions? 1028 00:52:16,960 --> 00:52:17,460 Yes. 1029 00:52:17,460 --> 00:52:19,805 And say your name and the question, please. 1030 00:52:19,805 --> 00:52:21,470 AUDIENCE: How did [INAUDIBLE] come up 1031 00:52:21,470 --> 00:52:22,595 with intuition [INAUDIBLE]? 1032 00:52:22,595 --> 00:52:24,882 It seems like they just pulled it out of a hat. 1033 00:52:24,882 --> 00:52:27,090 PROFESSOR: You know, I asked Mike that the other day. 1034 00:52:27,090 --> 00:52:29,470 I saw him in a meeting and he sort of surprised 1035 00:52:29,470 --> 00:52:33,450 at how this has taken off. 1036 00:52:33,450 --> 00:52:39,217 And he told me some other interesting facts 1037 00:52:39,217 --> 00:52:41,050 about this, which you probably could deduce. 1038 00:52:41,050 --> 00:52:43,420 Which is that if you only want to match 1039 00:52:43,420 --> 00:52:45,250 reads that are four long, you only 1040 00:52:45,250 --> 00:52:48,250 have to sort this matrix by the first four characters. 1041 00:52:50,810 --> 00:52:54,260 But there are other little tricks you can play here. 1042 00:52:54,260 --> 00:52:55,500 Any other questions? 1043 00:52:55,500 --> 00:52:56,290 Yes. 1044 00:52:56,290 --> 00:52:57,300 AUDIENCE: I'm Deborah. 1045 00:52:57,300 --> 00:53:00,370 What is the FM index? 1046 00:53:00,370 --> 00:53:04,070 PROFESSOR: What is the FM index. 1047 00:53:04,070 --> 00:53:05,504 Well, the guys who thought this up 1048 00:53:05,504 --> 00:53:06,920 have the last initials of F and M, 1049 00:53:06,920 --> 00:53:09,720 but that's not what it stands for, 1050 00:53:09,720 --> 00:53:11,360 contrary to popular opinion. 1051 00:53:11,360 --> 00:53:16,900 It stands for full text minute size. 1052 00:53:16,900 --> 00:53:18,090 That's what they claim. 1053 00:53:18,090 --> 00:53:21,200 So if you hear people talking about full text minute size 1054 00:53:21,200 --> 00:53:24,800 indices, or FM indices, the Fm index 1055 00:53:24,800 --> 00:53:30,060 is actually the part that was being asked over here, the Occ 1056 00:53:30,060 --> 00:53:33,830 part and the LF part, how you actually compute those quickly. 1057 00:53:33,830 --> 00:53:37,910 That was what FNM contributed to this but, generically when 1058 00:53:37,910 --> 00:53:40,780 we're talking about this style of indexing, 1059 00:53:40,780 --> 00:53:44,420 it's called FM indexing or you might hear, I'm using a BWT. 1060 00:53:44,420 --> 00:53:46,460 Some people will say that. 1061 00:53:46,460 --> 00:53:50,350 But that's what FM stands for. 1062 00:53:50,350 --> 00:53:53,120 Does that answer your question? 1063 00:53:53,120 --> 00:53:53,620 Excellent. 1064 00:53:53,620 --> 00:53:55,010 OK. 1065 00:53:55,010 --> 00:53:55,771 All right. 1066 00:53:55,771 --> 00:53:57,270 Any-- these are all great questions. 1067 00:53:57,270 --> 00:53:59,184 Yes. 1068 00:53:59,184 --> 00:54:00,100 AUDIENCE: [INAUDIBLE]. 1069 00:54:08,780 --> 00:54:10,750 PROFESSOR: Oh, you don't know that a 1070 00:54:10,750 --> 00:54:15,220 and c are there, except that remember, 1071 00:54:15,220 --> 00:54:20,440 if you look at the way that this is working, 1072 00:54:20,440 --> 00:54:24,660 is that you're not actually reconstructing strings, 1073 00:54:24,660 --> 00:54:27,081 you're only trying to find them. 1074 00:54:27,081 --> 00:54:27,580 Right? 1075 00:54:27,580 --> 00:54:29,230 And so at the end, top and bottom 1076 00:54:29,230 --> 00:54:33,760 are going to point to the row that contains the suffix where 1077 00:54:33,760 --> 00:54:35,310 your original read was. 1078 00:54:35,310 --> 00:54:37,310 And now your next question is going to be, 1079 00:54:37,310 --> 00:54:38,910 where is that in the genome? 1080 00:54:38,910 --> 00:54:40,110 This doesn't do me any good. 1081 00:54:40,110 --> 00:54:43,300 I mean, the number 1 doesn't help me out here, 1082 00:54:43,300 --> 00:54:46,530 doesn't mean anything. 1083 00:54:46,530 --> 00:54:48,410 Not good, right? 1084 00:54:48,410 --> 00:54:52,230 So where is it in the genome is the next question. 1085 00:54:52,230 --> 00:54:55,390 So we'll get to that in a second. 1086 00:54:55,390 --> 00:54:57,700 What happens if you give me a read that 1087 00:54:57,700 --> 00:55:00,761 doesn't match anywhere in this index? 1088 00:55:00,761 --> 00:55:03,010 Well if you give me a read that doesn't match anywhere 1089 00:55:03,010 --> 00:55:05,720 in this index, what happens is the top and bottom 1090 00:55:05,720 --> 00:55:08,010 become the same. 1091 00:55:08,010 --> 00:55:12,710 So on top and bottom become the same, it's a failed look up. 1092 00:55:12,710 --> 00:55:13,210 All right? 1093 00:55:16,000 --> 00:55:22,290 And that's because the suffix doesn't exist in the index. 1094 00:55:22,290 --> 00:55:25,120 And once top and bottom become the same, 1095 00:55:25,120 --> 00:55:27,980 they remain the same throughout that loop. 1096 00:55:27,980 --> 00:55:29,070 Yes. 1097 00:55:29,070 --> 00:55:30,519 AUDIENCE: I'm Sally. 1098 00:55:30,519 --> 00:55:32,451 My main question is that this doesn't 1099 00:55:32,451 --> 00:55:34,866 provide any leeway for errors. 1100 00:55:34,866 --> 00:55:38,807 You kind of have to be able to present all of your rates. 1101 00:55:38,807 --> 00:55:40,640 PROFESSOR: Sally, you're absolutely correct. 1102 00:55:40,640 --> 00:55:42,140 I'm so glad you asked that question. 1103 00:55:42,140 --> 00:55:44,960 Your observation is it does not provide 1104 00:55:44,960 --> 00:55:46,880 any leeway for mismatches. 1105 00:55:46,880 --> 00:55:49,420 And so unlike all the other algorithms we study, 1106 00:55:49,420 --> 00:55:51,360 which had these very nice matrices and ability 1107 00:55:51,360 --> 00:55:54,400 to assign weights to mismatches, this 1108 00:55:54,400 --> 00:55:56,235 is only doing exact matching. 1109 00:55:56,235 --> 00:55:57,860 And so what you need help understanding 1110 00:55:57,860 --> 00:56:00,320 is, how we can deal with mismatches 1111 00:56:00,320 --> 00:56:01,570 in the presence of this. 1112 00:56:01,570 --> 00:56:05,390 And I'll get to that in less than 10 minutes. 1113 00:56:05,390 --> 00:56:10,570 And it won't be quite as elegant as what you saw from Professor 1114 00:56:10,570 --> 00:56:12,530 Berg but it's what everybody does. 1115 00:56:12,530 --> 00:56:18,250 So that's my only excuse for it OK? 1116 00:56:18,250 --> 00:56:19,240 Yes. 1117 00:56:19,240 --> 00:56:22,890 AUDIENCE: What is the bottom set of arrows doing? 1118 00:56:22,890 --> 00:56:24,260 What's its significance? 1119 00:56:24,260 --> 00:56:26,176 PROFESSOR: The significance of top and bottom, 1120 00:56:26,176 --> 00:56:27,470 that's a great question. 1121 00:56:27,470 --> 00:56:29,270 What the significance of top and bottom? 1122 00:56:29,270 --> 00:56:33,890 Top and bottom bracket in that original matrix, 1123 00:56:33,890 --> 00:56:38,080 the suffixes that are matching the original query. 1124 00:56:38,080 --> 00:56:39,000 OK? 1125 00:56:39,000 --> 00:56:42,795 And so between top and bottom minus 1 1126 00:56:42,795 --> 00:56:46,000 are all of the rows that have a suffix that 1127 00:56:46,000 --> 00:56:48,870 match our original query. 1128 00:56:48,870 --> 00:56:53,250 And if top equals bottom, there are no matching suffixes. 1129 00:56:53,250 --> 00:56:55,104 But assuming there are matching suffixes, 1130 00:56:55,104 --> 00:56:57,020 those are rows that contain a matching suffix. 1131 00:56:57,020 --> 00:57:00,480 And as we progress along, top and bottom change 1132 00:57:00,480 --> 00:57:03,410 as the suffixes change as we expand the suffix 1133 00:57:03,410 --> 00:57:05,790 to contain more and more bases. 1134 00:57:05,790 --> 00:57:07,720 OK? 1135 00:57:07,720 --> 00:57:09,150 OK. 1136 00:57:09,150 --> 00:57:11,380 Any other questions? 1137 00:57:11,380 --> 00:57:12,140 OK. 1138 00:57:12,140 --> 00:57:17,410 So now back to the question over here, which is that OK, 1139 00:57:17,410 --> 00:57:22,970 I know that we've matched, and I know that we have this hit. 1140 00:57:22,970 --> 00:57:27,460 The question is, where is it in the Genome because the fact 1141 00:57:27,460 --> 00:57:31,510 that it matched row one of my BWT matrix 1142 00:57:31,510 --> 00:57:34,640 does me absolutely no good at all. 1143 00:57:34,640 --> 00:57:35,224 Right? 1144 00:57:35,224 --> 00:57:37,390 Anybody have any ideas about how we could figure out 1145 00:57:37,390 --> 00:57:39,010 where it is the genome? 1146 00:57:39,010 --> 00:57:42,850 Given what I've told you so far, which is that you have the BWT, 1147 00:57:42,850 --> 00:57:45,055 you have Occ, and you have count, 1148 00:57:45,055 --> 00:57:46,555 and you can compute the LF function. 1149 00:57:49,620 --> 00:57:52,400 Any ideas? 1150 00:57:52,400 --> 00:57:53,750 Doesn't matter how slow it is. 1151 00:58:01,110 --> 00:58:04,330 OK well how could I figure out what 1152 00:58:04,330 --> 00:58:06,470 came before aac in the genome? 1153 00:58:06,470 --> 00:58:07,330 Yes. 1154 00:58:07,330 --> 00:58:09,063 AUDIENCE: So, for, like at the beginning, 1155 00:58:09,063 --> 00:58:13,082 we rebuilt this whole string, starting at the end. 1156 00:58:13,082 --> 00:58:15,814 You could rebuild the whole string starting at, 1157 00:58:15,814 --> 00:58:19,150 rebuild the whole genome starting at the 1 and see-- 1158 00:58:19,150 --> 00:58:22,090 PROFESSOR: You could rebuild the entire genome that 1159 00:58:22,090 --> 00:58:24,911 prepends or occurs before aac, right? 1160 00:58:24,911 --> 00:58:25,660 AUDIENCE: Exactly. 1161 00:58:25,660 --> 00:58:26,451 PROFESSOR: Exactly. 1162 00:58:26,451 --> 00:58:27,530 So that's what we can do. 1163 00:58:27,530 --> 00:58:32,730 We could actually do our walk left algorithm. 1164 00:58:32,730 --> 00:58:35,230 We can walk left from there, find out 1165 00:58:35,230 --> 00:58:38,020 that we go two steps until we hit dollar sign, 1166 00:58:38,020 --> 00:58:40,530 and therefore, the offset is two, 1167 00:58:40,530 --> 00:58:43,070 where it occurs in the genome. 1168 00:58:43,070 --> 00:58:47,137 So we can give a match position by walking left. 1169 00:58:47,137 --> 00:58:48,720 Does everybody see that, that we could 1170 00:58:48,720 --> 00:58:52,160 walk left to figure where it is? 1171 00:58:52,160 --> 00:58:53,700 It's not fast, but it works. 1172 00:58:53,700 --> 00:58:54,870 Yes. 1173 00:58:54,870 --> 00:58:55,640 AUDIENCE: I'm Ted. 1174 00:58:55,640 --> 00:58:56,431 PROFESSOR: Hi, Ted. 1175 00:58:56,431 --> 00:59:02,186 AUDIENCE: So now our function first has to take the read 1176 00:59:02,186 --> 00:59:06,210 and it has to align it and the same, 1177 00:59:06,210 --> 00:59:13,028 where the position is built into the end of the function, 1178 00:59:13,028 --> 00:59:18,873 the speed of the function is now dependent on position as well. 1179 00:59:18,873 --> 00:59:19,779 Is that right? 1180 00:59:19,779 --> 00:59:22,050 Because the longer it takes, 1181 00:59:22,050 --> 00:59:23,710 PROFESSOR: I was being a little bit 1182 00:59:23,710 --> 00:59:25,720 glib find if it matches or not this linear time. 1183 00:59:25,720 --> 00:59:26,840 Now you're saying, hey, wait a minute. 1184 00:59:26,840 --> 00:59:28,548 I want to know where it is in the genome. 1185 00:59:28,548 --> 00:59:29,760 That's a big bonus, right? 1186 00:59:29,760 --> 00:59:31,500 And so you'd like to know where that is? 1187 00:59:31,500 --> 00:59:32,280 Yes. 1188 00:59:32,280 --> 00:59:34,320 But I still can do that in linear time. 1189 00:59:34,320 --> 00:59:36,236 And we'll show you how to do that in a second. 1190 00:59:36,236 --> 00:59:37,970 This is not linear time. 1191 00:59:37,970 --> 00:59:39,467 This actually needs to walk back, 1192 00:59:39,467 --> 00:59:41,550 order the length of genome for every single query. 1193 00:59:41,550 --> 00:59:43,660 That's not good. 1194 00:59:43,660 --> 00:59:45,350 All right? 1195 00:59:45,350 --> 00:59:45,850 Well, 1196 00:59:45,850 --> 00:59:49,320 What we could do is, we could store what's called a suffix 1197 00:59:49,320 --> 00:59:51,970 array with each row and say, where in the genome 1198 00:59:51,970 --> 00:59:54,070 that position is. 1199 00:59:54,070 --> 00:59:56,780 Where that row starts. 1200 00:59:56,780 --> 00:59:58,370 And then maybe a simple look-up. 1201 00:59:58,370 --> 01:00:00,827 That when you actually have a hit in row one, ah, OK, 1202 01:00:00,827 --> 01:00:02,410 start your position two of the genome. 1203 01:00:04,940 --> 01:00:07,810 But then the problem with that is 1204 01:00:07,810 --> 01:00:13,170 that it actually takes a lot of space. 1205 01:00:13,170 --> 01:00:15,900 And we want to have compact indices. 1206 01:00:15,900 --> 01:00:18,800 So the trick is what we do is, instead of storing 1207 01:00:18,800 --> 01:00:21,280 that entire suffix array, we store 1208 01:00:21,280 --> 01:00:24,630 every so many rows, like every 25 rows. 1209 01:00:24,630 --> 01:00:26,370 And all we do is, we walk left until we 1210 01:00:26,370 --> 01:00:29,250 hit a row that actually has the value, 1211 01:00:29,250 --> 01:00:32,860 and then we add how many times we walked left, plus the value 1212 01:00:32,860 --> 01:00:36,380 and we know where we are in the genome. 1213 01:00:36,380 --> 01:00:40,120 So we can sample the suffix array, 1214 01:00:40,120 --> 01:00:41,780 and by sampling the suffix array, 1215 01:00:41,780 --> 01:00:46,472 we cut down our storage hugely and it's still 1216 01:00:46,472 --> 01:00:47,180 pretty efficient. 1217 01:00:47,180 --> 01:00:48,596 Because what we can do is, we just 1218 01:00:48,596 --> 01:00:51,775 walk left until we hit a sample suffix array location 1219 01:00:51,775 --> 01:00:55,441 and then add the two numbers together. 1220 01:00:55,441 --> 01:00:55,940 All right? 1221 01:00:58,880 --> 01:01:01,250 So that's how it's done. 1222 01:01:01,250 --> 01:01:03,500 OK? 1223 01:01:03,500 --> 01:01:08,750 So that's how we actually do the alignment 1224 01:01:08,750 --> 01:01:10,320 and figure out where things are. 1225 01:01:10,320 --> 01:01:12,440 The one thing I haven't told you about 1226 01:01:12,440 --> 01:01:16,170 is how to compute count efficiently. 1227 01:01:16,170 --> 01:01:18,160 Now remember what count does. 1228 01:01:18,160 --> 01:01:22,450 Count is a function-- but this is putting it 1229 01:01:22,450 --> 01:01:25,610 all together where we're matching this query, 1230 01:01:25,610 --> 01:01:26,980 we do the steps. 1231 01:01:26,980 --> 01:01:28,400 We get the match. 1232 01:01:28,400 --> 01:01:30,817 Then we do walk left once and then 1233 01:01:30,817 --> 01:01:33,150 we look at the suffix to figure out where we are, right? 1234 01:01:36,360 --> 01:01:40,880 The business about count is that what we need to do 1235 01:01:40,880 --> 01:01:44,930 is to figure out the rank of a particular base 1236 01:01:44,930 --> 01:01:47,401 in a position in the transform. 1237 01:01:47,401 --> 01:01:49,650 And one way to do that is to go backwards to the whole 1238 01:01:49,650 --> 01:01:53,100 transform, counting how many g;s occur before this one, 1239 01:01:53,100 --> 01:01:56,950 and that's very expensive, to compute the rank of this 1240 01:01:56,950 --> 01:01:57,577 particular g. 1241 01:01:57,577 --> 01:01:59,660 Remember the rank is simply the number of g's that 1242 01:01:59,660 --> 01:02:03,110 occur before this one in the BWT. 1243 01:02:03,110 --> 01:02:05,300 Very simple metric. 1244 01:02:05,300 --> 01:02:07,250 So instead of doing that, what we 1245 01:02:07,250 --> 01:02:12,500 can do is, we can build a data structure that every once 1246 01:02:12,500 --> 01:02:15,000 in awhile, counts how many a's, c's, g's, 1247 01:02:15,000 --> 01:02:19,000 and t's have occurred before now in the BWT. 1248 01:02:19,000 --> 01:02:22,400 And so we're going to sample this with these checkpoints, 1249 01:02:22,400 --> 01:02:26,040 and then when you want to compute count at any point, 1250 01:02:26,040 --> 01:02:29,240 you can go to the nearest checkpoint, wherever that is, 1251 01:02:29,240 --> 01:02:31,340 and make an adjustment by counting 1252 01:02:31,340 --> 01:02:35,240 the number of characters between you and that checkpoint. 1253 01:02:35,240 --> 01:02:36,810 Very straightforward. 1254 01:02:36,810 --> 01:02:38,380 All Right 1255 01:02:38,380 --> 01:02:43,840 So this, coming back to question, it's Time, right? 1256 01:02:43,840 --> 01:02:46,470 --asked you need to build this checkpointing 1257 01:02:46,470 --> 01:02:49,510 mechanism at the same time you build the index, as well as 1258 01:02:49,510 --> 01:02:51,880 the sampling of the suffix array. 1259 01:02:51,880 --> 01:02:54,960 So a full index consists of the transform 1260 01:02:54,960 --> 01:02:58,650 itself, which is the genome transformed into its BWT. 1261 01:03:01,250 --> 01:03:04,620 And they literally take the entire genome and do this. 1262 01:03:04,620 --> 01:03:08,910 Typically they'll put dollar signs between the chromosomes. 1263 01:03:08,910 --> 01:03:10,620 So they'll transform the whole thing. 1264 01:03:10,620 --> 01:03:12,850 It takes a sampling of the suffix array 1265 01:03:12,850 --> 01:03:18,550 we just saw and it takes the checkpointing of the LF 1266 01:03:18,550 --> 01:03:21,650 function to make a constant time. 1267 01:03:21,650 --> 01:03:26,090 And that's what is inside of an FM index. 1268 01:03:26,090 --> 01:03:28,550 OK? 1269 01:03:28,550 --> 01:03:32,450 Now it's small, which is one of the nice things, 1270 01:03:32,450 --> 01:03:35,560 compared to things like suffix tree, 1271 01:03:35,560 --> 01:03:39,610 suffix arrays, or even other kinds of hash structures 1272 01:03:39,610 --> 01:03:43,040 for looking for seeds, it really is not even twice 1273 01:03:43,040 --> 01:03:44,810 the size of the genome. 1274 01:03:44,810 --> 01:03:49,667 So it's a very compact index that is very, very efficient. 1275 01:03:49,667 --> 01:03:51,250 And so it's a wonderful data structure 1276 01:03:51,250 --> 01:03:53,530 for doing what we're doing, except we have not 1277 01:03:53,530 --> 01:03:56,850 dealt with mismatches yet, right? 1278 01:03:56,850 --> 01:03:59,072 And so once again, I want to put a plug 1279 01:03:59,072 --> 01:04:02,250 in for BWA which is really a marvelous aligner. 1280 01:04:02,250 --> 01:04:05,360 And we'll talk about tomorrow in recitation 1281 01:04:05,360 --> 01:04:08,140 if you want to know all of the details of what it actually 1282 01:04:08,140 --> 01:04:11,350 takes to make this work in practice. 1283 01:04:11,350 --> 01:04:17,160 Now, it finds exactness matches quickly, 1284 01:04:17,160 --> 01:04:21,300 but it doesn't really have any allowances for mismatches. 1285 01:04:21,300 --> 01:04:25,260 And the way that bow tie and other aligners deal with this, 1286 01:04:25,260 --> 01:04:26,910 and they're all pretty consistent, 1287 01:04:26,910 --> 01:04:31,130 is in the following way, which is that they do backtracking. 1288 01:04:31,130 --> 01:04:33,210 Here's the idea. 1289 01:04:33,210 --> 01:04:39,440 You try and match something or match a read 1290 01:04:39,440 --> 01:04:42,272 and you get to a particular point in the read, 1291 01:04:42,272 --> 01:04:43,480 and you can't go any further. 1292 01:04:43,480 --> 01:04:46,770 Top is equal to bottom. 1293 01:04:46,770 --> 01:04:48,750 So you know that there's no suffix 1294 01:04:48,750 --> 01:04:52,370 in the genome that matches your query. 1295 01:04:52,370 --> 01:04:54,830 So what do you do? 1296 01:04:54,830 --> 01:04:58,130 Well, what you can do is you can try all of the different bases 1297 01:04:58,130 --> 01:05:00,110 at that position besides the one you 1298 01:05:00,110 --> 01:05:02,110 tried to see whether it matches or not. 1299 01:05:02,110 --> 01:05:04,450 I can see the horror coming over people. 1300 01:05:04,450 --> 01:05:06,940 Oh, no, not backtracking, not that. 1301 01:05:06,940 --> 01:05:12,640 But sometimes it actually works. 1302 01:05:12,640 --> 01:05:15,760 And just to give you order of magnitude idea about how this 1303 01:05:15,760 --> 01:05:18,370 works in practice, when reads don't match, 1304 01:05:18,370 --> 01:05:22,390 they limit backtracking to about 125 times in these aligners. 1305 01:05:22,390 --> 01:05:25,070 so they try pretty hard to actually match things. 1306 01:05:25,070 --> 01:05:29,260 And yes, it is true that even with this backtracking, 1307 01:05:29,260 --> 01:05:32,450 it's still a great approach. 1308 01:05:32,450 --> 01:05:35,700 And sometimes the first thing you try doesn't work, 1309 01:05:35,700 --> 01:05:38,840 and you have to backtrack, trying multiple bases 1310 01:05:38,840 --> 01:05:41,560 at that location until you get one that matches. 1311 01:05:41,560 --> 01:05:43,210 And then you can proceed. 1312 01:05:43,210 --> 01:05:46,670 OK And you eventually wind up at the alignment 1313 01:05:46,670 --> 01:05:49,030 you see in the lower right hand corner, where you're 1314 01:05:49,030 --> 01:05:54,170 substituting a g for an a, an a for a g, excuse me, 1315 01:05:54,170 --> 01:05:56,870 to make it go forward. 1316 01:05:56,870 --> 01:05:59,120 Do people understand the essential idea 1317 01:05:59,120 --> 01:06:04,370 of this idea of backtracking? 1318 01:06:04,370 --> 01:06:09,540 Does anybody have any comments or questions about it? 1319 01:06:09,540 --> 01:06:12,490 Like ew, or ideas? 1320 01:06:12,490 --> 01:06:12,990 Yes. 1321 01:06:12,990 --> 01:06:14,340 AUDIENCE: What about gaps? 1322 01:06:14,340 --> 01:06:15,465 PROFESSOR: What about gaps? 1323 01:06:19,340 --> 01:06:22,680 BWA, I believe, processes gaps. 1324 01:06:22,680 --> 01:06:27,720 But gaps are much, much less likely than missed bases. 1325 01:06:27,720 --> 01:06:31,740 The other thing is that if you're doing a sequencing 1326 01:06:31,740 --> 01:06:35,850 library, and you have a read that actually has a gap in it, 1327 01:06:35,850 --> 01:06:39,460 it's probably the case you have another read that doesn't. 1328 01:06:39,460 --> 01:06:41,460 For the same sequence. 1329 01:06:41,460 --> 01:06:45,100 So it is less important to process gaps 1330 01:06:45,100 --> 01:06:48,590 than it is to process differences. 1331 01:06:48,590 --> 01:06:52,920 The reason is that differences mean 1332 01:06:52,920 --> 01:06:56,070 that it might be a difference of an allele. 1333 01:06:56,070 --> 01:06:58,320 In other words, it might be that your base 1334 01:06:58,320 --> 01:07:00,860 is different than the reference genome. 1335 01:07:00,860 --> 01:07:03,484 Indels are also possible. 1336 01:07:03,484 --> 01:07:04,900 And there are different strategies 1337 01:07:04,900 --> 01:07:05,816 of dealing with those. 1338 01:07:05,816 --> 01:07:09,189 That would be a great question for Hang tomorrow about gaps. 1339 01:07:09,189 --> 01:07:11,230 Because he can tell you in practice what they do. 1340 01:07:11,230 --> 01:07:12,660 And we'll get into a little bit of that 1341 01:07:12,660 --> 01:07:13,790 at the end of today's lecture. 1342 01:07:13,790 --> 01:07:14,289 Yes. 1343 01:07:14,289 --> 01:07:15,534 Question? 1344 01:07:15,534 --> 01:07:16,358 AUDIENCE: I'm Levy. 1345 01:07:16,358 --> 01:07:17,090 PROFESSOR: Hi, Levy. 1346 01:07:17,090 --> 01:07:17,690 AUDIENCE: How do you make sure when 1347 01:07:17,690 --> 01:07:18,430 you're backtracking that you end up 1348 01:07:18,430 --> 01:07:19,638 with the best possible match? 1349 01:07:19,638 --> 01:07:22,037 Do you just go down the first-- 1350 01:07:22,037 --> 01:07:23,620 PROFESSOR: The questions is how do you 1351 01:07:23,620 --> 01:07:26,690 guarantee you wind up with the best possible match? 1352 01:07:26,690 --> 01:07:28,991 The short answer is that you don't. 1353 01:07:28,991 --> 01:07:30,490 There's a longer answer, which we're 1354 01:07:30,490 --> 01:07:34,200 about to get to, about how we try to approximate that. 1355 01:07:34,200 --> 01:07:38,280 And what judgment you would use to get to what we would think 1356 01:07:38,280 --> 01:07:40,220 is a practically good match. 1357 01:07:40,220 --> 01:07:41,100 OK? 1358 01:07:41,100 --> 01:07:42,870 But in terms of theoretically optimal, 1359 01:07:42,870 --> 01:07:45,632 the answer is, it doesn't attempt to do that. 1360 01:07:45,632 --> 01:07:46,590 That's a good question. 1361 01:07:46,590 --> 01:07:48,000 Yes. 1362 01:07:48,000 --> 01:07:50,375 AUDIENCE: In practice, does this backtracking method 1363 01:07:50,375 --> 01:07:54,660 at the same time as you're computing the matches or-- 1364 01:07:54,660 --> 01:07:55,590 PROFESSOR: Yes. 1365 01:07:55,590 --> 01:07:57,506 So what's happening is, you remember that loop 1366 01:07:57,506 --> 01:07:59,324 where we're going around, where we 1367 01:07:59,324 --> 01:08:00,990 were moving the top and bottom pointers. 1368 01:08:00,990 --> 01:08:03,130 If you get to a point where they come together, 1369 01:08:03,130 --> 01:08:07,300 then you would at that point, begin backtracking 1370 01:08:07,300 --> 01:08:09,280 and try different bases. 1371 01:08:09,280 --> 01:08:11,380 And if you look, I posted the BWA paper 1372 01:08:11,380 --> 01:08:12,400 on the Stellar website. 1373 01:08:12,400 --> 01:08:13,983 And if you look in one of the figures, 1374 01:08:13,983 --> 01:08:15,940 the algorithm is there, And you'll actually 1375 01:08:15,940 --> 01:08:19,210 see, if you can deconvolute what's going on, 1376 01:08:19,210 --> 01:08:23,479 that inside the loop, it's actually doing exactly that. 1377 01:08:23,479 --> 01:08:23,979 Yes. 1378 01:08:23,979 --> 01:08:24,689 Question. 1379 01:08:24,689 --> 01:08:27,581 AUDIENCE: In practice, is the number of errors is small, 1380 01:08:27,581 --> 01:08:29,991 would it make sense just to use [INAUDIBLE]? 1381 01:08:33,257 --> 01:08:34,840 PROFESSOR: We're going to get to that. 1382 01:08:34,840 --> 01:08:38,370 I think the question was, if the number of errors is small, 1383 01:08:38,370 --> 01:08:41,467 would it be good to actually use a different algorithm, 1384 01:08:41,467 --> 01:08:42,800 drop into a different algorithm? 1385 01:08:42,800 --> 01:08:47,170 So there are algorithms on FM index assisted 1386 01:08:47,170 --> 01:08:48,870 Smith Waterman, for example. 1387 01:08:48,870 --> 01:08:50,569 Where you get to the neighborhood 1388 01:08:50,569 --> 01:08:54,510 by a fast technique and then you do a full search, 1389 01:08:54,510 --> 01:08:58,384 using a more in depth principles methodology, right? 1390 01:08:58,384 --> 01:09:00,550 And so there's some papers I have in the slides here 1391 01:09:00,550 --> 01:09:04,420 that I referenced that do exactly that. 1392 01:09:04,420 --> 01:09:06,920 These are all great questions. 1393 01:09:06,920 --> 01:09:07,850 OK. 1394 01:09:07,850 --> 01:09:09,000 Yes. 1395 01:09:09,000 --> 01:09:11,350 AUDIENCE: If you're-- If you only decide to backtrack 1396 01:09:11,350 --> 01:09:13,580 a certain number of times, like 100 times, 1397 01:09:13,580 --> 01:09:16,600 then wouldn't like the alignment be biased towards the end 1398 01:09:16,600 --> 01:09:18,103 of the short read? 1399 01:09:18,103 --> 01:09:20,850 PROFESSOR: I am so glad you asked this question. 1400 01:09:20,850 --> 01:09:24,070 The question is, and what was your name again? 1401 01:09:24,070 --> 01:09:24,736 AUDIENCE: Kevin. 1402 01:09:24,736 --> 01:09:25,444 PROFESSOR: Kevin. 1403 01:09:25,444 --> 01:09:28,510 Kevin asks, gee, if you're matching from the right 1404 01:09:28,510 --> 01:09:30,500 to the left, and you're doing backtracking, 1405 01:09:30,500 --> 01:09:32,520 isn't this going to be biased towards 1406 01:09:32,520 --> 01:09:35,029 the right into the read, in some sense, right? 1407 01:09:35,029 --> 01:09:37,710 Because if the right into the read doesn't match, 1408 01:09:37,710 --> 01:09:39,810 then you're going to give up, right? 1409 01:09:39,810 --> 01:09:42,399 In fact, what we know is the left end of the read 1410 01:09:42,399 --> 01:09:43,750 is the better end of the read. 1411 01:09:43,750 --> 01:09:48,569 Because sequences are done five prime to three prime and thus 1412 01:09:48,569 --> 01:09:50,960 typically, the highest quality scores or the best quality 1413 01:09:50,960 --> 01:09:55,130 scores are in the left hand side of the read. 1414 01:09:55,130 --> 01:09:59,450 So do you have any idea about how you would cope with that? 1415 01:09:59,450 --> 01:10:01,324 AUDIENCE: You could just reverse one of them. 1416 01:10:01,324 --> 01:10:02,168 But you'd reverse-- 1417 01:10:02,168 --> 01:10:02,950 PROFESSOR: Exactly what they do. 1418 01:10:02,950 --> 01:10:05,450 They execute the entire genome and they reverse it and then 1419 01:10:05,450 --> 01:10:08,756 they index that. 1420 01:10:08,756 --> 01:10:11,006 And so, when they create what's called a mirror index, 1421 01:10:11,006 --> 01:10:12,390 they just reverse the entire genome, 1422 01:10:12,390 --> 01:10:13,889 and now you can match left to right, 1423 01:10:13,889 --> 01:10:15,600 as opposed to right to left. 1424 01:10:15,600 --> 01:10:17,530 Pretty cool, huh? 1425 01:10:17,530 --> 01:10:18,030 Yeah. 1426 01:10:21,090 --> 01:10:26,445 So backtracking, just note that there are different alignments 1427 01:10:26,445 --> 01:10:29,030 that can occur across different backtracking paths. 1428 01:10:29,030 --> 01:10:31,850 And this is not optimal in any sense. 1429 01:10:31,850 --> 01:10:34,060 And to your question about how you actually 1430 01:10:34,060 --> 01:10:38,220 go about picking a backtracking strategy, 1431 01:10:38,220 --> 01:10:40,290 assuming we're matching from right to left again 1432 01:10:40,290 --> 01:10:46,510 for a moment, what you can do is, if you hit a mismatch, 1433 01:10:46,510 --> 01:10:50,577 you backtrack to the lowest quality based position, 1434 01:10:50,577 --> 01:10:51,660 according to PHRED scores. 1435 01:10:51,660 --> 01:10:53,350 We talked about PHRED scores earlier, 1436 01:10:53,350 --> 01:10:56,270 which are shown here on the slide. 1437 01:10:56,270 --> 01:10:58,840 And you backtrack there and you try a different base 1438 01:10:58,840 --> 01:11:00,770 and you move forward from there. 1439 01:11:00,770 --> 01:11:02,890 So you're assuming that the read, which 1440 01:11:02,890 --> 01:11:06,220 is the query, which is associated quality scores, 1441 01:11:06,220 --> 01:11:10,190 is most suspect where the quality score is the lowest. 1442 01:11:10,190 --> 01:11:15,020 So you backtrack to the right to the leftmost lowest quality 1443 01:11:15,020 --> 01:11:16,710 score. 1444 01:11:16,710 --> 01:11:21,160 Now it's a very simple approach. 1445 01:11:21,160 --> 01:11:22,040 Right? 1446 01:11:22,040 --> 01:11:24,270 And we talked a little bit about the idea 1447 01:11:24,270 --> 01:11:28,080 that you don't necessarily want to match from the right side 1448 01:11:28,080 --> 01:11:30,880 and thus, typically the parameters to algorithms 1449 01:11:30,880 --> 01:11:33,320 like this include, how many mismatches 1450 01:11:33,320 --> 01:11:35,906 are allowed in the first L bases on the left end, the sum 1451 01:11:35,906 --> 01:11:38,030 of the mismatch qualities you're going to tolerate, 1452 01:11:38,030 --> 01:11:38,580 and so forth. 1453 01:11:38,580 --> 01:11:41,246 And you'll find that these align yourself with a lot of switches 1454 01:11:41,246 --> 01:11:42,120 that you can set. 1455 01:11:42,120 --> 01:11:44,025 And you can consult with your colleagues 1456 01:11:44,025 --> 01:11:45,900 about how to set switches, because it depends 1457 01:11:45,900 --> 01:11:47,941 upon the particular type of data you're aligning, 1458 01:11:47,941 --> 01:11:50,560 the length of the reads and so forth. 1459 01:11:50,560 --> 01:11:55,780 But suffice it to say, when you're doing this, 1460 01:11:55,780 --> 01:11:58,760 typically we create these mirror indices that actually reverse 1461 01:11:58,760 --> 01:12:01,300 the entire genome and then index it. 1462 01:12:01,300 --> 01:12:05,180 So we can either match either right to left or left to right. 1463 01:12:05,180 --> 01:12:09,650 And so for example, if you have a mirror index, 1464 01:12:09,650 --> 01:12:13,480 and you only tolerate up to two errors, 1465 01:12:13,480 --> 01:12:14,946 then you know that either, you're 1466 01:12:14,946 --> 01:12:16,320 going to get the first half right 1467 01:12:16,320 --> 01:12:19,450 in one index or the mirror index. 1468 01:12:19,450 --> 01:12:22,060 And so you can use both indices in parallel, the forward 1469 01:12:22,060 --> 01:12:24,060 and the reverse index of the genome, 1470 01:12:24,060 --> 01:12:26,770 and then get pretty far into the read 1471 01:12:26,770 --> 01:12:29,990 before you have to start backtracking. 1472 01:12:29,990 --> 01:12:32,120 There are all these sorts of techniques, 1473 01:12:32,120 --> 01:12:34,380 shall we say, to actually overcome some 1474 01:12:34,380 --> 01:12:36,900 the limitations of backtracking. 1475 01:12:36,900 --> 01:12:41,020 Any questions about backtracking at all? 1476 01:12:41,020 --> 01:12:41,650 Yes. 1477 01:12:41,650 --> 01:12:44,145 AUDIENCE: Is it trivial knowing the BWT 1478 01:12:44,145 --> 01:12:47,638 originally to find the mirror BWT? 1479 01:12:47,638 --> 01:12:48,636 Like for example, 1480 01:12:48,636 --> 01:12:50,632 PROFESSOR: No, it's not trivial. 1481 01:12:50,632 --> 01:12:52,628 AUDIENCE: So it's not like a simple matrix 1482 01:12:52,628 --> 01:12:54,352 transforming [INAUDIBLE]. 1483 01:12:54,352 --> 01:12:54,935 PROFESSOR: No. 1484 01:12:54,935 --> 01:12:55,957 Not to my knowledge. 1485 01:12:55,957 --> 01:12:58,373 I think you start with the original genome, you reverse it 1486 01:12:58,373 --> 01:13:00,100 and then you compute the BWT with that. 1487 01:13:00,100 --> 01:13:00,600 Right? 1488 01:13:00,600 --> 01:13:02,832 That's pretty easy to do. 1489 01:13:02,832 --> 01:13:04,290 And Hang was explaining to me today 1490 01:13:04,290 --> 01:13:07,170 how you compute his new ways of compute BWT, which 1491 01:13:07,170 --> 01:13:09,880 don't actually involve sorting the entire thing. 1492 01:13:09,880 --> 01:13:12,685 There are insertion ways of computing the BWT that are very 1493 01:13:12,685 --> 01:13:15,060 [INAUDIBLE], and you could ask him this question tomorrow 1494 01:13:15,060 --> 01:13:16,850 if you care to come. 1495 01:13:16,850 --> 01:13:20,650 All right just to give you an idea on how complex things are, 1496 01:13:20,650 --> 01:13:23,830 to build an index like this, takes, 1497 01:13:23,830 --> 01:13:25,770 for the entire human genome, we're 1498 01:13:25,770 --> 01:13:29,540 talking five hours of compute time 1499 01:13:29,540 --> 01:13:32,220 to compute an index to give you an order of magnitude 1500 01:13:32,220 --> 01:13:36,095 time for how to compute the BWT. 1501 01:13:36,095 --> 01:13:39,780 the LF checkpoints, and the suffix array sampling. 1502 01:13:39,780 --> 01:13:41,070 Something like that. 1503 01:13:41,070 --> 01:13:42,620 So it's really not too bad to compute 1504 01:13:42,620 --> 01:13:46,050 the index of the entire genome. 1505 01:13:46,050 --> 01:13:49,620 And to do searches, you know, we're talking about, 1506 01:13:49,620 --> 01:13:51,070 like on a four processor machine, 1507 01:13:51,070 --> 01:13:52,444 we're talking about maybe upwards 1508 01:13:52,444 --> 01:13:55,560 of 100 million rads per hour to map. 1509 01:13:55,560 --> 01:13:57,400 So if you have 200 million reads and you 1510 01:13:57,400 --> 01:14:00,090 want to map them to a genome, or align them as it's sometimes 1511 01:14:00,090 --> 01:14:03,970 called, it's going to take you a couple hours to do it. 1512 01:14:03,970 --> 01:14:07,400 So this is sort of the order of magnitude of the time required 1513 01:14:07,400 --> 01:14:10,730 to do these sorts of functions. 1514 01:14:10,730 --> 01:14:12,550 And there are a couple fine points 1515 01:14:12,550 --> 01:14:14,656 I wanted to end with today. 1516 01:14:14,656 --> 01:14:16,780 The first is we haven't talked at all about paired, 1517 01:14:16,780 --> 01:14:17,990 erred, and read alignment. 1518 01:14:23,190 --> 01:14:27,950 In paired read alignment, you get, for each molecule, 1519 01:14:27,950 --> 01:14:29,180 you get two reads. 1520 01:14:29,180 --> 01:14:32,360 One starting at the five prime end on one side, , 1521 01:14:32,360 --> 01:14:35,840 and one starting from the five prime end on the other side. 1522 01:14:35,840 --> 01:14:38,540 So typical read links might be 100 base pairs on the left 1523 01:14:38,540 --> 01:14:41,490 and 100 base pairs on the right. 1524 01:14:41,490 --> 01:14:45,340 What is called the insert size is the total size 1525 01:14:45,340 --> 01:14:49,260 of the molecule from five prime end to five prime end to read. 1526 01:14:49,260 --> 01:14:52,230 And the stuff in the middle is not observed. 1527 01:14:52,230 --> 01:14:54,600 We actually don't know what it is. 1528 01:14:54,600 --> 01:14:57,610 And we also don't know how long it is. 1529 01:14:57,610 --> 01:15:00,460 Now when these libraries are prepared, 1530 01:15:00,460 --> 01:15:02,520 size selection is done, so we get 1531 01:15:02,520 --> 01:15:05,380 a rough idea of what it should be. 1532 01:15:05,380 --> 01:15:06,970 We can actually compute by looking 1533 01:15:06,970 --> 01:15:10,982 at where things align on the genome, what it actually is. 1534 01:15:10,982 --> 01:15:12,190 But we don't know absolutely. 1535 01:15:14,700 --> 01:15:18,010 If we were able to strictly control 1536 01:15:18,010 --> 01:15:20,430 the length of the unobserved part, which 1537 01:15:20,430 --> 01:15:25,370 is almost impossible to do, then we would get molecular rulers. 1538 01:15:25,370 --> 01:15:28,100 And we would know exactly down to the base, 1539 01:15:28,100 --> 01:15:30,480 whether or not there were indels between the left read 1540 01:15:30,480 --> 01:15:32,354 and the right read when we did the alignment. 1541 01:15:32,354 --> 01:15:35,790 We actually don't have that today. 1542 01:15:35,790 --> 01:15:37,550 The sequencing instrument actually 1543 01:15:37,550 --> 01:15:39,970 identifies the read pairs in its output. 1544 01:15:39,970 --> 01:15:41,660 That's the only way to do this. 1545 01:15:41,660 --> 01:15:44,220 So when you get an output file, like a fast Q file, 1546 01:15:44,220 --> 01:15:46,920 from a sequencing instrument, it will tell you, 1547 01:15:46,920 --> 01:15:48,860 for a given molecule, here's the left read 1548 01:15:48,860 --> 01:15:50,314 and here's the right read. 1549 01:15:50,314 --> 01:15:52,480 Although left and right are really sort of misnomers 1550 01:15:52,480 --> 01:15:55,060 because there really is no left and right, right? 1551 01:15:55,060 --> 01:15:57,018 This is one end and then this is the other end. 1552 01:15:59,056 --> 01:16:00,680 Typical ways of processing these paired 1553 01:16:00,680 --> 01:16:03,950 reads, first you align left and right reads. 1554 01:16:03,950 --> 01:16:05,935 And they could really only be oriented 1555 01:16:05,935 --> 01:16:07,560 with respect to a genome sequence where 1556 01:16:07,560 --> 01:16:09,976 you say that one has a lower coordinate than the other one 1557 01:16:09,976 --> 01:16:12,680 when you're actually doing the alignment. 1558 01:16:12,680 --> 01:16:15,380 And if one read fails to align uniquely, then what you can do 1559 01:16:15,380 --> 01:16:18,100 is, you know what neighborhood you're in because you know, 1560 01:16:18,100 --> 01:16:21,780 roughly speaking, what the insert size is, 1561 01:16:21,780 --> 01:16:24,420 so you can do Smith Waterman to actually try and locate 1562 01:16:24,420 --> 01:16:26,620 the other read in that neighborhood. 1563 01:16:26,620 --> 01:16:30,970 Or you can tolerate multiply mapped reads. 1564 01:16:30,970 --> 01:16:34,220 One thing that I did not mention to you explicitly, 1565 01:16:34,220 --> 01:16:36,380 is that when you match the entire query, 1566 01:16:36,380 --> 01:16:39,390 and top and bottom are more than one away from each other, 1567 01:16:39,390 --> 01:16:41,460 that means you've got many places in the genome 1568 01:16:41,460 --> 01:16:43,384 that things map. 1569 01:16:43,384 --> 01:16:45,300 And thus you may report all of those locations 1570 01:16:45,300 --> 01:16:49,030 or I might report the first one. 1571 01:16:49,030 --> 01:16:54,700 So that's one bit of insight into how to do a map paired 1572 01:16:54,700 --> 01:16:55,950 reads. 1573 01:16:55,950 --> 01:16:57,790 And these are becoming very important 1574 01:16:57,790 --> 01:16:59,900 because as sequencing costs go down, 1575 01:16:59,900 --> 01:17:03,180 people are doing more and more paired 1576 01:17:03,180 --> 01:17:05,240 and sequencing because they give you 1577 01:17:05,240 --> 01:17:08,790 much more information about the original library you created 1578 01:17:08,790 --> 01:17:11,609 and for certain protocols can allow you to localize events 1579 01:17:11,609 --> 01:17:13,025 in the genome far more accurately. 1580 01:17:16,500 --> 01:17:18,930 Final piece of advice on considerations 1581 01:17:18,930 --> 01:17:20,580 for read alignment. 1582 01:17:20,580 --> 01:17:24,780 We talked about the idea that some reads will map or align 1583 01:17:24,780 --> 01:17:30,390 uniquely to the genome and some will multimap. 1584 01:17:30,390 --> 01:17:34,740 You know that the genome is roughly 50% repeat sequence. 1585 01:17:34,740 --> 01:17:38,600 And thus it's likely that if you have a particular read 1586 01:17:38,600 --> 01:17:41,030 molecule, there's a reasonable chance 1587 01:17:41,030 --> 01:17:44,029 that it will map to multiple locations. 1588 01:17:44,029 --> 01:17:45,070 Is there a question here? 1589 01:17:45,070 --> 01:17:45,570 No. 1590 01:17:45,570 --> 01:17:47,317 OK. 1591 01:17:47,317 --> 01:17:49,900 You have to figure out what your desired mismatch tolerance is 1592 01:17:49,900 --> 01:17:53,620 when you're doing alignment and set the parameters 1593 01:17:53,620 --> 01:17:56,610 to your aligner carefully, after reading the documentation 1594 01:17:56,610 --> 01:17:58,940 thoroughly, because as you could tell, 1595 01:17:58,940 --> 01:18:01,300 there's no beautiful matrix formulation 1596 01:18:01,300 --> 01:18:05,420 like there is with a well established basis 1597 01:18:05,420 --> 01:18:09,169 in the literature, rather it's more ad hoc. 1598 01:18:09,169 --> 01:18:10,960 And you need to figure out what the desired 1599 01:18:10,960 --> 01:18:14,520 processing is for paired reads. 1600 01:18:14,520 --> 01:18:16,450 So what we've talked about today is 1601 01:18:16,450 --> 01:18:19,640 we started off talking about library complexity and the idea 1602 01:18:19,640 --> 01:18:23,910 that when we get a bunch of reads from a sequencer, 1603 01:18:23,910 --> 01:18:26,270 we can use that collection of reads to estimate 1604 01:18:26,270 --> 01:18:28,570 the complexity of our original library 1605 01:18:28,570 --> 01:18:30,210 and whether or not something went 1606 01:18:30,210 --> 01:18:33,520 wrong in the biological processing that we were doing. 1607 01:18:33,520 --> 01:18:35,500 Assuming it's a good set of reads, 1608 01:18:35,500 --> 01:18:38,260 we need to figure out where they align to the genome. 1609 01:18:38,260 --> 01:18:41,390 So we talked about this idea of creating a full text minute 1610 01:18:41,390 --> 01:18:44,800 size index, which involves a Burrows-Wheeler transform. 1611 01:18:44,800 --> 01:18:47,650 And we saw how we can compute that and throw away 1612 01:18:47,650 --> 01:18:49,760 almost everything else except for the BWT 1613 01:18:49,760 --> 01:18:55,340 itself, the suffix array checkpoints, and the FM index 1614 01:18:55,340 --> 01:18:59,600 checkpoints to be able to reconstruct this 1615 01:18:59,600 --> 01:19:02,520 at a relatively modest increase in size over the genome 1616 01:19:02,520 --> 01:19:05,100 itself and do this very, very rapid matching, 1617 01:19:05,100 --> 01:19:10,470 albeit with more problematic matching of mismatches. 1618 01:19:10,470 --> 01:19:12,830 And then we turned to the question 1619 01:19:12,830 --> 01:19:15,930 of how to deal with those mismatches with backtracking 1620 01:19:15,930 --> 01:19:20,900 and some fine points on paired end alignment. 1621 01:19:20,900 --> 01:19:26,680 So that is the end of today's lecture. 1622 01:19:26,680 --> 01:19:29,230 On Tuesday of next week, we'll talk 1623 01:19:29,230 --> 01:19:33,840 about how to actually construct a reference genome, which 1624 01:19:33,840 --> 01:19:36,320 is a really neat thing to be able to do, take 1625 01:19:36,320 --> 01:19:40,630 a whole bunch of reads, put the puzzle back together again. 1626 01:19:40,630 --> 01:19:43,440 I would encourage you to make sure you understand 1627 01:19:43,440 --> 01:19:45,360 how this indexing strategy works. 1628 01:19:45,360 --> 01:19:46,250 Look at the slides. 1629 01:19:46,250 --> 01:19:47,844 Feel free to ask any of us. 1630 01:19:47,844 --> 01:19:49,260 Thanks so much for your attention. 1631 01:19:49,260 --> 01:19:50,330 Welcome back. 1632 01:19:50,330 --> 01:19:51,800 Have a great weekend. 1633 01:19:51,800 --> 01:19:54,370 We'll see you next Tuesday.