1 00:00:00,070 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,215 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,215 --> 00:00:17,840 at ocw.mit.edu. 8 00:00:27,790 --> 00:00:33,820 PROFESSOR: Well, welcome back to computational systems biology. 9 00:00:33,820 --> 00:00:36,980 We're back here today talking about genome assembly. 10 00:00:36,980 --> 00:00:43,150 How many people have ever assembled a genome before? 11 00:00:43,150 --> 00:00:44,050 In your spare time? 12 00:00:44,050 --> 00:00:46,740 Anybody done any genome assembly here? 13 00:00:46,740 --> 00:00:48,330 One person? 14 00:00:48,330 --> 00:00:50,590 I think genome assembly is a fascinating topic. 15 00:00:50,590 --> 00:00:54,830 And as you know, it's at the bedrock of all modern biology. 16 00:00:54,830 --> 00:00:59,750 We rely upon genome references for almost everything in terms 17 00:00:59,750 --> 00:01:03,690 of studying evolution, looking at the structure of genes, 18 00:01:03,690 --> 00:01:07,350 regulation of genes, differences between individuals. 19 00:01:07,350 --> 00:01:11,970 So it's really a very fundamental concept. 20 00:01:11,970 --> 00:01:14,220 And we're going to talk today about two different ways 21 00:01:14,220 --> 00:01:15,450 of assembling genomes. 22 00:01:15,450 --> 00:01:18,290 And I think one of the takeaway messages from today's lecture 23 00:01:18,290 --> 00:01:21,200 is going to be that genome assembly is more 24 00:01:21,200 --> 00:01:23,740 of an art, in some sense, than a science. 25 00:01:23,740 --> 00:01:25,320 And one has to always be a little bit 26 00:01:25,320 --> 00:01:28,310 suspicious of a genome assembly given 27 00:01:28,310 --> 00:01:30,300 what you're about to learn today. 28 00:01:30,300 --> 00:01:33,920 And, of course, genome assembly is becoming even more complex 29 00:01:33,920 --> 00:01:37,700 because it used to be that assembling the human genome 30 00:01:37,700 --> 00:01:41,740 was the big task scientifically in front of the community. 31 00:01:41,740 --> 00:01:44,230 But now there are billions of genomes waiting 32 00:01:44,230 --> 00:01:48,020 to be sequenced-- all the individuals in the world 33 00:01:48,020 --> 00:01:49,330 and to try and interpret them. 34 00:01:49,330 --> 00:01:50,996 And now you can get your genome sequence 35 00:01:50,996 --> 00:01:52,290 for between $5,000 and $10,000. 36 00:01:52,290 --> 00:01:57,220 How many people here are tempted to get their genome sequenced? 37 00:01:57,220 --> 00:02:00,461 OK, I see about five hands-- six hands. 38 00:02:00,461 --> 00:02:00,960 Great. 39 00:02:00,960 --> 00:02:07,950 So let's look at the science behind genome assembly. 40 00:02:07,950 --> 00:02:10,289 The basic concept is that we're going 41 00:02:10,289 --> 00:02:15,089 to collect some sequence reads from the genome. 42 00:02:15,089 --> 00:02:16,630 And we're going to assemble them know 43 00:02:16,630 --> 00:02:20,870 what are called contigs for contiguous segments. 44 00:02:20,870 --> 00:02:22,827 And these represent uninterrupted portions 45 00:02:22,827 --> 00:02:24,660 of the genome that are completely covered by 46 00:02:24,660 --> 00:02:26,485 reads that we believe are contiguous. 47 00:02:29,290 --> 00:02:33,730 These contigs then will be paired together in scaffolds. 48 00:02:33,730 --> 00:02:36,360 And scaffolds are like contigs except that there are missing 49 00:02:36,360 --> 00:02:39,560 parts between the contigs in a scaffold. 50 00:02:39,560 --> 00:02:42,360 We don't know what those parts are. 51 00:02:42,360 --> 00:02:44,710 But we're able to actually glue them together 52 00:02:44,710 --> 00:02:47,750 by using read pairs that allow us to jump over the missing 53 00:02:47,750 --> 00:02:50,870 parts because we have read both ends of a molecule. 54 00:02:50,870 --> 00:02:54,190 But we don't know what's in the middle. 55 00:02:54,190 --> 00:02:56,730 And then oftentimes we had physical mapping technologies 56 00:02:56,730 --> 00:03:00,240 where we actually can go back and assign location scaffolds 57 00:03:00,240 --> 00:03:03,040 to physical locations on chromosomes 58 00:03:03,040 --> 00:03:08,260 by using PCR sequences like sequence tag sites 59 00:03:08,260 --> 00:03:12,980 that physically locate a particular sequence 60 00:03:12,980 --> 00:03:16,655 identity to a physical location on a particular chromosome. 61 00:03:16,655 --> 00:03:19,804 And that provides us with a total genome map. 62 00:03:19,804 --> 00:03:21,220 So today we're going to be talking 63 00:03:21,220 --> 00:03:25,440 about how to go from a hard drive full sequence 64 00:03:25,440 --> 00:03:28,505 reads all the way down to a set of scaffolds 65 00:03:28,505 --> 00:03:31,820 that include assembled contigs. 66 00:03:31,820 --> 00:03:35,560 And the way to think about this once again 67 00:03:35,560 --> 00:03:38,470 is that we start with conceptually a single copy 68 00:03:38,470 --> 00:03:39,230 of the genome. 69 00:03:39,230 --> 00:03:42,250 We amplify this. 70 00:03:42,250 --> 00:03:47,020 And in order to sequence it on contemporary instruments, 71 00:03:47,020 --> 00:03:48,470 we have to fragment it. 72 00:03:48,470 --> 00:03:52,020 Now for those of you who were in last Friday's recitation, 73 00:03:52,020 --> 00:03:54,729 you heard Heng Li talking about the idea that sequence reads 74 00:03:54,729 --> 00:03:55,520 are getting longer. 75 00:03:55,520 --> 00:03:57,490 In fact, sequence reads up to 10 to 15 76 00:03:57,490 --> 00:03:59,630 kilobases are now possible. 77 00:03:59,630 --> 00:04:01,687 And sequence reads even longer than that 78 00:04:01,687 --> 00:04:03,520 are going to be possible, which will greatly 79 00:04:03,520 --> 00:04:05,636 simplify the assembly process. 80 00:04:05,636 --> 00:04:07,510 But for now we're talking about the challenge 81 00:04:07,510 --> 00:04:11,245 of assembling short reads-- say 100 base pair reads off 82 00:04:11,245 --> 00:04:14,400 of contemporary sequencing instruments. 83 00:04:14,400 --> 00:04:18,529 So we take the fragmented reads and the notion 84 00:04:18,529 --> 00:04:20,250 is that we know that they're going 85 00:04:20,250 --> 00:04:22,696 to align up like a puzzle. 86 00:04:22,696 --> 00:04:24,070 And all we have to do is line the 87 00:04:24,070 --> 00:04:27,180 reads up to recover the read sequence at the bottom-- 88 00:04:27,180 --> 00:04:31,424 the original genome sequence. 89 00:04:31,424 --> 00:04:33,840 And I should add that many of the illustrations in today's 90 00:04:33,840 --> 00:04:34,964 lecture are from Ben Lagmi. 91 00:04:34,964 --> 00:04:39,840 He was kind enough to allow me to use them for today's talk. 92 00:04:39,840 --> 00:04:44,860 So the goal is to come up with that red sequence at the bottom 93 00:04:44,860 --> 00:04:48,400 from the original set of reads but, of course, 94 00:04:48,400 --> 00:04:50,690 the read set that we're talking about 95 00:04:50,690 --> 00:04:53,680 is perhaps 200 million reads or even a billion 96 00:04:53,680 --> 00:04:55,970 reads as we'll see. 97 00:04:55,970 --> 00:04:59,420 And so it's quite a tough task to put pieces together given 98 00:04:59,420 --> 00:05:02,170 that we really don't know where they came from. 99 00:05:02,170 --> 00:05:03,750 And we don't know where they align 100 00:05:03,750 --> 00:05:08,512 because we don't have the red part to guide us. 101 00:05:08,512 --> 00:05:09,970 Now today we're going to be talking 102 00:05:09,970 --> 00:05:12,280 about what's called de novo assembly. 103 00:05:12,280 --> 00:05:14,490 That means starting from scratch. 104 00:05:14,490 --> 00:05:18,030 You hand me your set of reads for your favorite organism. 105 00:05:18,030 --> 00:05:20,572 And we're going to assemble it today. 106 00:05:20,572 --> 00:05:22,030 That's different than what's called 107 00:05:22,030 --> 00:05:24,670 reference-guided assembly because, 108 00:05:24,670 --> 00:05:27,490 for example, if you're going to re-sequence me or you, 109 00:05:27,490 --> 00:05:29,660 there is a reference human genome. 110 00:05:29,660 --> 00:05:33,900 And it would be a simple matter to take the reads from you or I 111 00:05:33,900 --> 00:05:35,930 and map them back onto the reference genome 112 00:05:35,930 --> 00:05:40,050 as a guide to trying to reassemble our genomes. 113 00:05:40,050 --> 00:05:42,070 However, as you can tell, if there's 114 00:05:42,070 --> 00:05:44,280 a large structural variation between the reference 115 00:05:44,280 --> 00:05:48,500 genome and our genomes, that process can fail. 116 00:05:48,500 --> 00:05:53,230 So we're going to be talking today about de novo assembly. 117 00:05:53,230 --> 00:05:57,430 And in the process of de novo assembly, 118 00:05:57,430 --> 00:05:59,840 oftentimes we talk about coverage, 119 00:05:59,840 --> 00:06:03,570 which is on average how many sequencing bases do 120 00:06:03,570 --> 00:06:06,390 we have for every base of the genome. 121 00:06:06,390 --> 00:06:10,540 Here we have for this little illustrative example 122 00:06:10,540 --> 00:06:14,050 coverage of about 7x. 123 00:06:14,050 --> 00:06:18,220 Now, at the origin of the Human Genome Project, 124 00:06:18,220 --> 00:06:20,710 some calculations were done about how much coverage 125 00:06:20,710 --> 00:06:23,670 was required to cover the human genome. 126 00:06:23,670 --> 00:06:28,980 And we talked last time about library complexity. 127 00:06:28,980 --> 00:06:30,670 This is a slightly different idea, 128 00:06:30,670 --> 00:06:33,020 which is we want to estimate the probability the base is 129 00:06:33,020 --> 00:06:34,830 uncovered. 130 00:06:34,830 --> 00:06:37,810 So if we have the genome size as G and the number of reads 131 00:06:37,810 --> 00:06:40,090 as N and L is the length of a read, 132 00:06:40,090 --> 00:06:44,030 then N times L is the total number bases that we have. 133 00:06:44,030 --> 00:06:47,120 And that divided by the genome is the average coverage 134 00:06:47,120 --> 00:06:49,020 of a base. 135 00:06:49,020 --> 00:06:52,240 And probably the probability that a base is not covered 136 00:06:52,240 --> 00:06:55,250 is the probability we're going to observe 137 00:06:55,250 --> 00:06:59,090 zero reads to that base, which is e to the minus 138 00:06:59,090 --> 00:07:04,330 lambda, roughly speaking, if we use a Poisson approximation. 139 00:07:04,330 --> 00:07:07,670 And therefore, the number of uncovered bases it will have 140 00:07:07,670 --> 00:07:12,630 is going to be roughly G times e to the minus lambda. 141 00:07:12,630 --> 00:07:15,640 The next calculations can be thought intuitively 142 00:07:15,640 --> 00:07:19,900 as the following way, which is if we have N reads, if there's 143 00:07:19,900 --> 00:07:21,540 going to be a gap after a read, there 144 00:07:21,540 --> 00:07:23,669 has to be an uncovered base after it. 145 00:07:23,669 --> 00:07:26,210 And so the number of gaps we're going to have in our assembly 146 00:07:26,210 --> 00:07:30,360 is roughly N times e to the minus lambda. 147 00:07:30,360 --> 00:07:33,280 So this is a back of the envelop calculation. 148 00:07:33,280 --> 00:07:38,290 And now if we take some of our 1,000 genomes data, which 149 00:07:38,290 --> 00:07:42,820 we previously used and asked how well this approximation works, 150 00:07:42,820 --> 00:07:47,400 we see something like this where the x-axis is 151 00:07:47,400 --> 00:07:50,340 the total number of reads and the genome coverage in bases 152 00:07:50,340 --> 00:07:52,141 is shown on the y-axis. 153 00:07:52,141 --> 00:07:54,265 And these are all different sequencing experiments. 154 00:07:56,840 --> 00:08:00,150 So you can see there the roughly green outline, 155 00:08:00,150 --> 00:08:04,630 which follows the approximately what we saw before 156 00:08:04,630 --> 00:08:06,137 in this Lander-Waterman rule. 157 00:08:06,137 --> 00:08:07,720 Could somebody tell me what they think 158 00:08:07,720 --> 00:08:10,290 is going on with the red lines that actually don't match up 159 00:08:10,290 --> 00:08:11,200 with that green line? 160 00:08:15,210 --> 00:08:17,970 Anybody have any ideas about why we 161 00:08:17,970 --> 00:08:20,430 need more reads out of those libraries 162 00:08:20,430 --> 00:08:22,005 to get better coverage? 163 00:08:24,700 --> 00:08:25,503 Yes? 164 00:08:25,503 --> 00:08:27,044 AUDIENCE: There is probably some bias 165 00:08:27,044 --> 00:08:28,211 when you're amplifying them? 166 00:08:28,211 --> 00:08:29,585 PROFESSOR: Yeah, there's probably 167 00:08:29,585 --> 00:08:32,179 skew in the original libraries we talked about last time. 168 00:08:32,179 --> 00:08:34,289 In fact, we talked about last time 169 00:08:34,289 --> 00:08:37,530 why the Poisson was not a great approximation 170 00:08:37,530 --> 00:08:39,120 for looking at libraries. 171 00:08:39,120 --> 00:08:41,530 And in fact, we might want to fit something 172 00:08:41,530 --> 00:08:44,850 like a negative binomial in this particular case. 173 00:08:47,520 --> 00:08:50,110 So we've got our read set. 174 00:08:50,110 --> 00:08:52,270 And we can also talk about coverage 175 00:08:52,270 --> 00:08:55,940 at a particular base, which is different than average coverage 176 00:08:55,940 --> 00:08:58,600 just to be clear that there are two different kinds of coverage 177 00:08:58,600 --> 00:09:00,510 that one can think about. 178 00:09:00,510 --> 00:09:05,590 Here we see coverage at T of level six. 179 00:09:08,140 --> 00:09:12,310 And the other thing that we need to be cognizant of 180 00:09:12,310 --> 00:09:16,700 is that there are two reasons that we might-- 181 00:09:16,700 --> 00:09:19,130 two common reasons why we might actually see 182 00:09:19,130 --> 00:09:23,310 reads that overlap but don't agree at all positions. 183 00:09:23,310 --> 00:09:24,790 The obvious reason is that there's 184 00:09:24,790 --> 00:09:26,330 an error in one of the reads. 185 00:09:26,330 --> 00:09:27,890 We get quality scores and so forth. 186 00:09:27,890 --> 00:09:30,610 And that can help us decide which is the truth. 187 00:09:30,610 --> 00:09:34,330 But the other possibility is that as you know, 188 00:09:34,330 --> 00:09:36,800 you have one of each of your chromosomes 189 00:09:36,800 --> 00:09:39,077 from mom one from your dad. 190 00:09:39,077 --> 00:09:40,660 And there could be allelic differences 191 00:09:40,660 --> 00:09:41,743 between these chromosomes. 192 00:09:41,743 --> 00:09:44,810 So when we're doing assembly, oftentimes we'll 193 00:09:44,810 --> 00:09:47,530 find that these allelic differences are 194 00:09:47,530 --> 00:09:53,230 going to pop up in terms of non-concordance of our reads. 195 00:09:53,230 --> 00:09:55,140 And we'll have to ultimately decide 196 00:09:55,140 --> 00:09:59,510 if we want to make a single diploid approximation 197 00:09:59,510 --> 00:10:03,130 of a human genome or we want to attempt 198 00:10:03,130 --> 00:10:09,240 to assemble a diploid genome. 199 00:10:09,240 --> 00:10:13,030 And if we're going to do a diploid genome, 200 00:10:13,030 --> 00:10:15,250 then we have to be quite careful and use 201 00:10:15,250 --> 00:10:18,560 somewhat different assembly techniques. 202 00:10:18,560 --> 00:10:20,760 But the common reference genome is haploid. 203 00:10:20,760 --> 00:10:24,500 It's only considering one chromosomal sequence. 204 00:10:24,500 --> 00:10:27,500 Is that clear to everybody? 205 00:10:27,500 --> 00:10:29,900 OK, great. 206 00:10:29,900 --> 00:10:33,980 So we're going to talk about two general approaches to assembly 207 00:10:33,980 --> 00:10:34,480 today. 208 00:10:34,480 --> 00:10:38,990 We're going to talk about overlap layout consensus 209 00:10:38,990 --> 00:10:43,520 assemblers as exemplified by a string graph assembler. 210 00:10:43,520 --> 00:10:45,820 And we're also going to talk about De Bruijn graph 211 00:10:45,820 --> 00:10:48,020 assemblers today. 212 00:10:48,020 --> 00:10:52,850 Now, overlap consensus assemblers 213 00:10:52,850 --> 00:10:55,760 were the first ones that were used in the Human Genome 214 00:10:55,760 --> 00:10:58,940 Project because reads were longer back then. 215 00:10:58,940 --> 00:11:02,330 However, as the number of reads has increased, 216 00:11:02,330 --> 00:11:05,720 those assemblers are more difficult to utilize 217 00:11:05,720 --> 00:11:09,525 in part because of the need to find overlaps between reads, 218 00:11:09,525 --> 00:11:12,239 as we'll see in a moment. 219 00:11:12,239 --> 00:11:13,780 Whereas to De Bruijn graph assemblers 220 00:11:13,780 --> 00:11:15,910 are somewhat more efficient. 221 00:11:15,910 --> 00:11:18,750 But they lose certain kinds of information. 222 00:11:18,750 --> 00:11:22,550 So let's begin with these overlap layout consensus 223 00:11:22,550 --> 00:11:24,830 assemblers. 224 00:11:24,830 --> 00:11:30,720 And we're going to talk about three steps to build contigs 225 00:11:30,720 --> 00:11:33,000 and the scaffolding step can be thought 226 00:11:33,000 --> 00:11:36,490 of a similar between either the overlap layout consensus 227 00:11:36,490 --> 00:11:38,735 assemblers or De Bruijn graph-based assemblers. 228 00:11:42,770 --> 00:11:45,220 So we're going to first build an overlap graph. 229 00:11:45,220 --> 00:11:47,420 What's an overlap graph? 230 00:11:47,420 --> 00:11:49,870 The essential idea is that when we 231 00:11:49,870 --> 00:11:52,280 take our collection of reads, we look 232 00:11:52,280 --> 00:11:55,060 for overlaps between the suffix of one read 233 00:11:55,060 --> 00:11:58,130 and the prefix of another read. 234 00:11:58,130 --> 00:12:00,040 And if we think of all of our reads, 235 00:12:00,040 --> 00:12:04,130 we want to build a graph that describes all of such overlaps. 236 00:12:04,130 --> 00:12:06,090 And just to be clear, I'm not going 237 00:12:06,090 --> 00:12:09,750 to be talking today about the reverse complement 238 00:12:09,750 --> 00:12:11,580 of these reads. 239 00:12:11,580 --> 00:12:14,400 Actual assemblers have to represent that. 240 00:12:14,400 --> 00:12:16,700 But it just duplicates all the nodes at edges. 241 00:12:16,700 --> 00:12:18,241 So we're going to try and keep things 242 00:12:18,241 --> 00:12:21,330 uncluttered by-- that's OK. 243 00:12:21,330 --> 00:12:23,030 Thank you. 244 00:12:23,030 --> 00:12:26,150 We're going to try and keep things uncluttered 245 00:12:26,150 --> 00:12:28,520 by not considering those today. 246 00:12:32,250 --> 00:12:36,480 Now, one of the challenges is how 247 00:12:36,480 --> 00:12:38,929 to construct those overlaps. 248 00:12:38,929 --> 00:12:40,970 And we're going to be talking about graphs a lot. 249 00:12:40,970 --> 00:12:44,212 So I thought it was worthwhile just to review terminology. 250 00:12:44,212 --> 00:12:46,670 We're going to represent overlap graphs as directed graphs, 251 00:12:46,670 --> 00:12:48,980 which consists of a set of vertices, which 252 00:12:48,980 --> 00:12:52,020 are the objects represented by the circles in the edges, which 253 00:12:52,020 --> 00:12:55,500 are the lines and a directed edge goes from one vertex 254 00:12:55,500 --> 00:12:56,000 to another. 255 00:12:58,670 --> 00:13:02,460 And there's also an equivalent representation 256 00:13:02,460 --> 00:13:06,540 in notational form on the lower part of the right of the slide 257 00:13:06,540 --> 00:13:08,790 as well as a graphical representation. 258 00:13:08,790 --> 00:13:11,330 We're going to be using the graphical representations 259 00:13:11,330 --> 00:13:13,710 of these directed graphs today. 260 00:13:18,660 --> 00:13:23,070 So the overlap graph is simply a representation 261 00:13:23,070 --> 00:13:25,520 of the overlap between reads. 262 00:13:25,520 --> 00:13:30,340 And we pick a minimum length of overlap at times. 263 00:13:30,340 --> 00:13:34,640 But for the next few slides, I'm simply 264 00:13:34,640 --> 00:13:39,780 going to represent each node as an individual read. 265 00:13:39,780 --> 00:13:42,310 And the edges will be annotated with the amount 266 00:13:42,310 --> 00:13:45,190 of overlap between the reads. 267 00:13:45,190 --> 00:13:48,660 So if I hand you a set of reads, all we need to do 268 00:13:48,660 --> 00:13:51,130 is to compute this overlap graph. 269 00:13:51,130 --> 00:13:53,930 We'll talk about how to do that in a moment. 270 00:13:53,930 --> 00:13:58,500 And you'll see graphically then what comes out 271 00:13:58,500 --> 00:14:00,520 of the process of computing the overlap graph. 272 00:14:03,500 --> 00:14:08,700 Now, it's possible that overlap graphs 273 00:14:08,700 --> 00:14:14,250 are cyclic because there are circular chromosomes. 274 00:14:14,250 --> 00:14:18,020 And as we'll see, it's also possible to get a cyclic graph 275 00:14:18,020 --> 00:14:21,600 out of a linear chromosome if in fact there 276 00:14:21,600 --> 00:14:25,450 are repetitive structures in the chromosome that 277 00:14:25,450 --> 00:14:28,130 cause a graph to cycle back on itself. 278 00:14:30,940 --> 00:14:37,770 So how to find overlaps in efficient time 279 00:14:37,770 --> 00:14:39,230 is a key problem. 280 00:14:39,230 --> 00:14:41,990 And that's one of the reasons that people have shied away 281 00:14:41,990 --> 00:14:44,430 from using these types of assemblers 282 00:14:44,430 --> 00:14:48,130 is because the cost of computing overlaps 283 00:14:48,130 --> 00:14:50,630 has been thought to be N-squared where N is the number reads 284 00:14:50,630 --> 00:14:54,040 because you have to compare all the reads to one another. 285 00:14:54,040 --> 00:14:58,490 However, a really clever algorithm 286 00:14:58,490 --> 00:15:01,630 was devised that used the technology 287 00:15:01,630 --> 00:15:04,060 we talked about last time. 288 00:15:04,060 --> 00:15:09,330 You recall the idea of the FM index and Burroughs-Wheeler 289 00:15:09,330 --> 00:15:16,240 transforms allowed us to index a genome and then to look up 290 00:15:16,240 --> 00:15:21,110 reads in time proportional to the length of the read. 291 00:15:21,110 --> 00:15:22,730 So here's the essential idea. 292 00:15:22,730 --> 00:15:26,540 What we're going to do is we're going to take all of the reads 293 00:15:26,540 --> 00:15:28,480 that we collect. 294 00:15:28,480 --> 00:15:29,730 And we're going to index them. 295 00:15:32,890 --> 00:15:36,480 And we can do that roughly at N log N time. 296 00:15:36,480 --> 00:15:39,640 And after we've indexed all of the reads, 297 00:15:39,640 --> 00:15:42,497 then we can use that same index to find overlaps very, 298 00:15:42,497 --> 00:15:43,205 very efficiently. 299 00:15:45,940 --> 00:15:49,580 And you can conceptualize this as simply looking at a read 300 00:15:49,580 --> 00:15:52,930 that you have in your hand and looking it up in the index. 301 00:15:52,930 --> 00:15:54,960 And you'll find all the places that the suffix 302 00:15:54,960 --> 00:15:57,450 or prefix of that read batches. 303 00:15:57,450 --> 00:16:00,690 And you can trace back till you find all the places it matches 304 00:16:00,690 --> 00:16:03,450 where they hit an end of a read. 305 00:16:03,450 --> 00:16:06,030 And those all correspond to edges in the graph. 306 00:16:06,030 --> 00:16:08,640 And it turns out that this is so clever 307 00:16:08,640 --> 00:16:12,450 that it eliminates redundant edges. 308 00:16:12,450 --> 00:16:16,190 So, for example, if I have reads that 309 00:16:16,190 --> 00:16:21,850 look like this where I have read one overlaps with read 310 00:16:21,850 --> 00:16:27,050 two which overlaps with read three. 311 00:16:27,050 --> 00:16:29,055 And read one and read three also overlap. 312 00:16:32,003 --> 00:16:39,910 An unreduced graph would have a representation like this. 313 00:16:39,910 --> 00:16:44,370 But it turns out that we don't have 314 00:16:44,370 --> 00:16:50,630 to do that because we can simply reduce our graph to this 315 00:16:50,630 --> 00:16:54,640 because we know that read one and read three. 316 00:16:54,640 --> 00:16:56,700 Actually, this is the graph that we 317 00:16:56,700 --> 00:16:59,300 would have that would be unreduced. 318 00:16:59,300 --> 00:17:03,650 We can reduce the graph to eliminate this transitive edge 319 00:17:03,650 --> 00:17:09,040 and simply represent it in this fashion. 320 00:17:09,040 --> 00:17:11,960 So when we use these indices, we eliminate 321 00:17:11,960 --> 00:17:14,595 these transitive edges as we'll see momentarily. 322 00:17:19,160 --> 00:17:22,260 So here's an example graph. 323 00:17:22,260 --> 00:17:25,599 The sequence is shown on the bottom. 324 00:17:25,599 --> 00:17:30,660 The read lengths are of length seven bases. 325 00:17:30,660 --> 00:17:36,810 And we're going to consider all overlaps a minimum size three. 326 00:17:36,810 --> 00:17:39,440 And the edge label is the actual length 327 00:17:39,440 --> 00:17:42,870 of the overlap between the reads. 328 00:17:42,870 --> 00:17:46,660 And you can see that at the outset 329 00:17:46,660 --> 00:17:50,310 that these overlap graphs are not necessarily simple. 330 00:17:50,310 --> 00:17:52,430 That tracing a path of the graph that 331 00:17:52,430 --> 00:17:57,030 represents the original string is not completely 332 00:17:57,030 --> 00:17:58,690 and totally straightforward. 333 00:17:58,690 --> 00:18:03,970 So we need to come up with a way to articulate our metrics 334 00:18:03,970 --> 00:18:08,060 for how to trace a path to the graph to reconstruct a genome. 335 00:18:10,940 --> 00:18:15,310 And that comes to the question of layout, 336 00:18:15,310 --> 00:18:19,760 which is how do we formulate the problem of tracing 337 00:18:19,760 --> 00:18:25,240 a path through an overlap graph? 338 00:18:25,240 --> 00:18:29,050 So we'll first start with the idea of the shortest 339 00:18:29,050 --> 00:18:32,170 common superstring. 340 00:18:32,170 --> 00:18:38,240 The shortest common superstring of a string S 341 00:18:38,240 --> 00:18:42,600 is the shortest string that contains all the strings in S 342 00:18:42,600 --> 00:18:48,710 as substrings for a particular length of substring. 343 00:18:48,710 --> 00:18:53,540 So, for example, if we didn't have 344 00:18:53,540 --> 00:18:56,014 the constraint of shortest, then just 345 00:18:56,014 --> 00:18:58,430 finding a string that contains all the substrings is easy. 346 00:18:58,430 --> 00:19:01,540 You just put them all together. 347 00:19:01,540 --> 00:19:04,300 But if we want the shortest, then we 348 00:19:04,300 --> 00:19:10,450 need to be more thoughtful in terms of the way 349 00:19:10,450 --> 00:19:14,210 that we compute this shortest common substring. 350 00:19:14,210 --> 00:19:16,530 And here is an example of the shortest common substring 351 00:19:16,530 --> 00:19:22,700 for the substrings that I have shown you up there. 352 00:19:22,700 --> 00:19:25,060 So one way to think about the assembly problem 353 00:19:25,060 --> 00:19:28,480 is that we're trying to compute the shortest common substring 354 00:19:28,480 --> 00:19:31,950 of all the reads that we have. 355 00:19:31,950 --> 00:19:35,420 And that will be the most efficient representation 356 00:19:35,420 --> 00:19:37,015 of those reads in a linear sequence. 357 00:19:40,020 --> 00:19:47,350 Now, we can describe this problem 358 00:19:47,350 --> 00:19:49,807 in terms of an overlap graph. 359 00:19:49,807 --> 00:19:51,390 And if you think about the way that we 360 00:19:51,390 --> 00:19:55,390 would solve this in overlap graph, in the shortest strings, 361 00:19:55,390 --> 00:19:58,680 we want the maximum amount of overlap. 362 00:19:58,680 --> 00:20:02,800 So we want to trace a path through the overlap graph that 363 00:20:02,800 --> 00:20:06,960 gives us the largest amount of overlap, 364 00:20:06,960 --> 00:20:08,920 which gives us the shortest string. 365 00:20:08,920 --> 00:20:10,000 Right? 366 00:20:10,000 --> 00:20:14,680 So if we simply negate the overlaps, 367 00:20:14,680 --> 00:20:20,340 we want to minimize the total cost of the graph. 368 00:20:20,340 --> 00:20:22,130 Now, it turns out that this problem 369 00:20:22,130 --> 00:20:24,690 is known to be a very hard computational problem. 370 00:20:24,690 --> 00:20:27,600 It's in the class of something called NP-hard 371 00:20:27,600 --> 00:20:30,296 because it's known as the traveling salesman problem. 372 00:20:30,296 --> 00:20:31,670 And when you think about the fact 373 00:20:31,670 --> 00:20:34,180 that we're going to have hundreds of millions of reads, 374 00:20:34,180 --> 00:20:36,870 this is not really going to be tractable. 375 00:20:36,870 --> 00:20:39,740 If we got rid of the weights, and we simply 376 00:20:39,740 --> 00:20:42,309 wanted to find a path through the graph, 377 00:20:42,309 --> 00:20:44,100 that's called the Hamiltonian Path problem. 378 00:20:44,100 --> 00:20:46,880 That's also NP-complete. 379 00:20:46,880 --> 00:20:50,020 So the shortest common substring is a way 380 00:20:50,020 --> 00:20:53,390 to think about assembling. 381 00:20:53,390 --> 00:20:57,070 But we can't really necessarily optimize metrics 382 00:20:57,070 --> 00:20:59,640 because it's going to be intractable. 383 00:20:59,640 --> 00:21:05,165 So think about ways of doing this that are greedier. 384 00:21:05,165 --> 00:21:07,540 So here's an example of how we would compute the shortest 385 00:21:07,540 --> 00:21:11,460 common substring starting with the first string. 386 00:21:11,460 --> 00:21:14,970 And each step along the way, is a concatenation 387 00:21:14,970 --> 00:21:19,150 of strings or a collapsing of strings that 388 00:21:19,150 --> 00:21:23,470 works towards building the shortest common substring. 389 00:21:23,470 --> 00:21:29,490 And we get the input string and the output string. 390 00:21:29,490 --> 00:21:32,620 So we could articulate our assembly problem 391 00:21:32,620 --> 00:21:37,380 as a greedy SCS algorithm to try and put all the 392 00:21:37,380 --> 00:21:40,590 reads together to come up with a superstring. 393 00:21:40,590 --> 00:21:49,800 And let me just describe to you this will give us an intuition 394 00:21:49,800 --> 00:21:52,960 into what goes wrong with assembly in a moment. 395 00:21:52,960 --> 00:21:55,580 But we do know there are some bounds on this-- 396 00:21:55,580 --> 00:21:58,330 that if we actually did the greedy algorithm, then 397 00:21:58,330 --> 00:22:01,940 the assembly that we got would be only two and a half times 398 00:22:01,940 --> 00:22:05,970 longer than the true shortest common substring. 399 00:22:05,970 --> 00:22:08,340 That isn't really very much comfort to us. 400 00:22:08,340 --> 00:22:10,590 So we're going to have to come up with different, more 401 00:22:10,590 --> 00:22:12,714 heuristic ways of approaching the assembly problem. 402 00:22:15,870 --> 00:22:17,710 Here is another example. 403 00:22:17,710 --> 00:22:20,480 Now, this is the one that I want to show you 404 00:22:20,480 --> 00:22:23,790 where we start with a string at the top 405 00:22:23,790 --> 00:22:27,670 where we're going to be looking for minimum overlaps of three 406 00:22:27,670 --> 00:22:32,530 and these are reads of six long. 407 00:22:32,530 --> 00:22:36,360 And when we do this greedy algorithm, 408 00:22:36,360 --> 00:22:40,620 we come up with a string, which is shorter 409 00:22:40,620 --> 00:22:44,581 than the original beginning string we started with. 410 00:22:44,581 --> 00:22:46,080 Can somebody see what happened here? 411 00:22:46,080 --> 00:22:49,760 Why are we missing part of the original string? 412 00:22:53,156 --> 00:22:53,656 Yes? 413 00:22:53,656 --> 00:22:55,239 AUDIENCE: The reads were short enough. 414 00:22:55,239 --> 00:22:59,410 And they repeated enough that we never found out 415 00:22:59,410 --> 00:23:02,510 that it was of the length that it actually was. 416 00:23:02,510 --> 00:23:06,530 And so we just kind of [INAUDIBLE] did it [INAUDIBLE]. 417 00:23:06,530 --> 00:23:09,660 PROFESSOR: So the point was that the reads were 418 00:23:09,660 --> 00:23:14,120 too short to be able to unambiguously identify 419 00:23:14,120 --> 00:23:15,740 the number of repeats of long that we 420 00:23:15,740 --> 00:23:18,860 had in the original sequence. 421 00:23:18,860 --> 00:23:20,090 That's absolutely correct. 422 00:23:20,090 --> 00:23:24,650 So we're not able to disambiguate what was going on. 423 00:23:24,650 --> 00:23:29,832 And perhaps if we went back to our graph formalism 424 00:23:29,832 --> 00:23:31,290 we could solve this problem, right? 425 00:23:31,290 --> 00:23:34,640 Because here we have our graph and the overlaps 426 00:23:34,640 --> 00:23:38,410 are written in on the edges of the number 427 00:23:38,410 --> 00:23:40,844 bases that each one of these reads overlaps. 428 00:23:40,844 --> 00:23:43,010 And all we need to do is to trace through this graph 429 00:23:43,010 --> 00:23:45,450 to find the original string. 430 00:23:45,450 --> 00:23:50,360 So here is one tracing, which gives 431 00:23:50,360 --> 00:23:53,980 a total overlap of 39, which actually faithfully reproduces 432 00:23:53,980 --> 00:23:57,680 the original string, right? 433 00:23:57,680 --> 00:24:02,150 However, that's not the best tracing. 434 00:24:02,150 --> 00:24:05,810 A better tracing through this graph or path through the graph 435 00:24:05,810 --> 00:24:09,900 would be this, which gives us more overlap 436 00:24:09,900 --> 00:24:11,580 and gives us a shorter string. 437 00:24:11,580 --> 00:24:13,480 But as we know, even though it's better 438 00:24:13,480 --> 00:24:16,700 according to this metric, it isn't really optimum 439 00:24:16,700 --> 00:24:19,560 because it gives us the wrong answer. 440 00:24:19,560 --> 00:24:23,070 It's better but wrong. 441 00:24:23,070 --> 00:24:25,940 So we're going to have to take into account other things when 442 00:24:25,940 --> 00:24:29,620 we do our assembly and our tracing of this graph 443 00:24:29,620 --> 00:24:33,190 to be able to come up with the best possible assembly. 444 00:24:35,730 --> 00:24:40,800 So if we increase the read length 445 00:24:40,800 --> 00:24:44,450 as was pointed out to span appropriately, 446 00:24:44,450 --> 00:24:49,820 we will be able to reconstruct the original sequence. 447 00:24:49,820 --> 00:24:53,414 And the point of this example is that we 448 00:24:53,414 --> 00:24:55,830 need to consider this when we're thinking about recovering 449 00:24:55,830 --> 00:24:58,570 repeat structures in genomes. 450 00:24:58,570 --> 00:25:04,850 So if we don't have long enough reads, 451 00:25:04,850 --> 00:25:09,620 in this case reads of length 8, we're 452 00:25:09,620 --> 00:25:12,600 not going to go to recover the original repeat structure. 453 00:25:15,390 --> 00:25:20,270 And if we look at this, repeats are really 454 00:25:20,270 --> 00:25:22,570 the bane of assemblers in some sense. 455 00:25:22,570 --> 00:25:26,740 And as you know, roughly 50% of the human genome 456 00:25:26,740 --> 00:25:30,110 is repetitive content. 457 00:25:30,110 --> 00:25:34,620 So we need to be very, very careful in terms of the way 458 00:25:34,620 --> 00:25:39,079 that we utilize reads to be able to recover the best 459 00:25:39,079 --> 00:25:40,620 approximation of our genome sequence. 460 00:25:44,140 --> 00:25:47,680 So here's another example where we 461 00:25:47,680 --> 00:25:52,880 look at l is minimum over length and k 462 00:25:52,880 --> 00:25:55,360 is the length of the reads. 463 00:25:55,360 --> 00:25:57,500 And you can see the sequence that we're 464 00:25:57,500 --> 00:25:59,333 trying to recover-- It was the best of times 465 00:25:59,333 --> 00:26:03,800 it was the worst of times and the output 466 00:26:03,800 --> 00:26:07,680 from our greedy SCS assembler. 467 00:26:07,680 --> 00:26:10,620 And as you can see, we need to get up 468 00:26:10,620 --> 00:26:15,080 to a read length of 13 characters for us 469 00:26:15,080 --> 00:26:18,475 to be able to properly assemble that original sentence. 470 00:26:21,980 --> 00:26:25,390 So the essential message here is that unless you 471 00:26:25,390 --> 00:26:29,860 have reads that are long enough to span repeats, 472 00:26:29,860 --> 00:26:31,370 you're not going to go to recover 473 00:26:31,370 --> 00:26:35,480 the original sequence exactly. 474 00:26:35,480 --> 00:26:41,410 And this can be also thought of in the following example. 475 00:26:41,410 --> 00:26:45,650 Imagine you have repeats that are tandem repeats out 476 00:26:45,650 --> 00:26:46,931 at the end of a sequence. 477 00:26:46,931 --> 00:26:48,430 And we're using the English language 478 00:26:48,430 --> 00:26:51,604 here because it's easier to see than if I put up 479 00:26:51,604 --> 00:26:52,770 a bunch of genomic sequence. 480 00:26:52,770 --> 00:26:56,330 But, of course, the principles are the same. 481 00:26:56,330 --> 00:26:58,110 You can see that unless we have reads 482 00:26:58,110 --> 00:27:01,680 that actually are anchored and unique sequence 483 00:27:01,680 --> 00:27:04,560 and span out towards a repetitive sequence, 484 00:27:04,560 --> 00:27:08,535 we can't really tell how many times the word 485 00:27:08,535 --> 00:27:09,285 bells is repeated. 486 00:27:12,000 --> 00:27:14,010 Another possibility is that we can 487 00:27:14,010 --> 00:27:16,500 actually coming from both sides. 488 00:27:16,500 --> 00:27:20,450 And if we can anchor our reads and unique sequence on both 489 00:27:20,450 --> 00:27:24,000 the left and the right side of a repetitive element, 490 00:27:24,000 --> 00:27:26,940 then we can figure out how many copies of something like bells 491 00:27:26,940 --> 00:27:28,740 is present. 492 00:27:28,740 --> 00:27:30,860 But in the absence of that, we really can't do it. 493 00:27:30,860 --> 00:27:34,020 In fact, we wind up with a structure looks like this. 494 00:27:34,020 --> 00:27:39,900 We wind up with-- there it is-- a structure where 495 00:27:39,900 --> 00:27:41,820 we have-- let's just say that there 496 00:27:41,820 --> 00:27:45,300 are four different stretches of genome 497 00:27:45,300 --> 00:27:47,430 in disparate parts of chromosomes 498 00:27:47,430 --> 00:27:49,910 and we repeat sequence in the middle. 499 00:27:49,910 --> 00:27:53,340 The blue parts of the chromosomes 500 00:27:53,340 --> 00:27:54,700 are unique sequence. 501 00:27:54,700 --> 00:27:58,450 And the red parts are repetitive sequences. 502 00:27:58,450 --> 00:28:02,770 What will happen is that if the reads aren't long enough, 503 00:28:02,770 --> 00:28:06,580 we'll be able to find out in each one of the four locations 504 00:28:06,580 --> 00:28:10,310 that we've gone from unique sequence to repeat sequence. 505 00:28:10,310 --> 00:28:12,000 And then we will get lost in the middle 506 00:28:12,000 --> 00:28:16,030 of this identical repeated sequence. 507 00:28:16,030 --> 00:28:18,059 And then on the right-hand side we'll 508 00:28:18,059 --> 00:28:20,100 once again transition back from repeated sequence 509 00:28:20,100 --> 00:28:21,470 to unique sequence. 510 00:28:21,470 --> 00:28:24,250 But we won't know how to put things together in the middle. 511 00:28:24,250 --> 00:28:24,750 Right? 512 00:28:24,750 --> 00:28:26,208 We won't be able to figure out what 513 00:28:26,208 --> 00:28:31,150 the path is through these repetitive elements. 514 00:28:31,150 --> 00:28:36,500 So that's the essential point I'd like to make about repeats. 515 00:28:36,500 --> 00:28:39,330 And we can now turn to the question of layout 516 00:28:39,330 --> 00:28:44,760 and how to process an overlap graph towards making contigs. 517 00:28:44,760 --> 00:28:48,650 This is the actual layout graph. 518 00:28:48,650 --> 00:28:51,940 When we think about that sentence up there. 519 00:28:51,940 --> 00:28:55,270 And we say the minimum over that length is four characters. 520 00:28:55,270 --> 00:28:58,750 And we have seven-character reads out of the sequence. 521 00:28:58,750 --> 00:29:04,470 You can see it's a pretty messy graph. 522 00:29:04,470 --> 00:29:09,220 If we clean up the graph by removing the redundant edges, 523 00:29:09,220 --> 00:29:14,430 the edges like this that span over 524 00:29:14,430 --> 00:29:17,450 reads and are implied by other reads, 525 00:29:17,450 --> 00:29:20,690 we can remove edges that are transitive over one 526 00:29:20,690 --> 00:29:22,210 reads or two reads. 527 00:29:22,210 --> 00:29:28,440 Now, my presentation is going to talk 528 00:29:28,440 --> 00:29:29,980 about how to remove these edges. 529 00:29:29,980 --> 00:29:31,960 However, as I said at the outset, 530 00:29:31,960 --> 00:29:37,110 if you use the algorithm by Simpson et al., 531 00:29:37,110 --> 00:29:39,400 you actually don't generate these transitive edges 532 00:29:39,400 --> 00:29:41,440 in the first place. 533 00:29:41,440 --> 00:29:43,760 But assuming that you didn't use an algorithm 534 00:29:43,760 --> 00:29:45,410 and you did generate them, you want 535 00:29:45,410 --> 00:29:49,480 to get rid of these transitive edges like so. 536 00:29:49,480 --> 00:29:53,190 And it starts getting somewhat simpler 537 00:29:53,190 --> 00:29:54,870 as you begin simplifying the graph, 538 00:29:54,870 --> 00:29:58,030 removing these transitive edges. 539 00:29:58,030 --> 00:30:02,370 And then we can remove edges that skip two nodes. 540 00:30:02,370 --> 00:30:05,210 So here's what happens after you remove the single transitive 541 00:30:05,210 --> 00:30:06,140 edges in this graph. 542 00:30:06,140 --> 00:30:07,055 Yes? 543 00:30:07,055 --> 00:30:10,464 AUDIENCE: So it seems that the transitive and verbal edges 544 00:30:10,464 --> 00:30:12,755 gave us a little bit more information about the genome. 545 00:30:12,755 --> 00:30:18,124 Do we lose some useful ordering principles by-- 546 00:30:18,124 --> 00:30:20,040 PROFESSOR: They provide redundant information. 547 00:30:20,040 --> 00:30:22,666 They don't really provide any additional information. 548 00:30:22,666 --> 00:30:25,165 It's the same linear sequence that's implied by those edges. 549 00:30:28,012 --> 00:30:28,845 Any other questions? 550 00:30:32,640 --> 00:30:37,820 So we can then remove edges that span two nodes. 551 00:30:37,820 --> 00:30:41,450 And we get an even simpler graph like this. 552 00:30:41,450 --> 00:30:43,900 Now this is beginning to look more tractable because we 553 00:30:43,900 --> 00:30:48,270 can look at this and we can output contigs that 554 00:30:48,270 --> 00:30:51,150 correspond to linear portions of the graph, which 555 00:30:51,150 --> 00:30:53,370 should be linear sequence. 556 00:30:53,370 --> 00:31:00,590 And when we do that what we wind up with are two contigs. 557 00:31:00,590 --> 00:31:03,400 And there's just a bit of problem in the middle, which 558 00:31:03,400 --> 00:31:06,980 is that we're unable to resolve the bit in the middle 559 00:31:06,980 --> 00:31:10,350 and as a consequence, we know that that 560 00:31:10,350 --> 00:31:15,490 is the number of terms that are in that original sentence 561 00:31:15,490 --> 00:31:17,550 because we didn't have a read long 562 00:31:17,550 --> 00:31:21,670 enough to be able to resolve that. 563 00:31:21,670 --> 00:31:24,390 The other problem that we can have 564 00:31:24,390 --> 00:31:26,430 in doing this kind of layout is that when 565 00:31:26,430 --> 00:31:30,120 there are portions of the genome that occur 566 00:31:30,120 --> 00:31:34,200 or sequences in the genome that occur multiple times, when we 567 00:31:34,200 --> 00:31:37,308 actually do this layout, we may find 568 00:31:37,308 --> 00:31:38,807 that the portions of the genome that 569 00:31:38,807 --> 00:31:42,990 occur in two disparate locations line up with one another. 570 00:31:42,990 --> 00:31:45,470 And it may be that as you exit the portion that's 571 00:31:45,470 --> 00:31:48,540 shared you get a mismatched base. 572 00:31:48,540 --> 00:31:51,110 So that mismatch could be because you 573 00:31:51,110 --> 00:31:53,400 have disparate parts of the genome that actually 574 00:31:53,400 --> 00:31:54,944 have very similar sequence. 575 00:31:54,944 --> 00:31:56,360 Or it could be that you had a read 576 00:31:56,360 --> 00:31:59,400 error at the end of your read. 577 00:31:59,400 --> 00:32:03,010 And it's difficult to tell the two apart except by the amount 578 00:32:03,010 --> 00:32:05,190 of coverage that you have. 579 00:32:05,190 --> 00:32:06,690 We'll talk about how to prune graphs 580 00:32:06,690 --> 00:32:09,930 like this in a few moments. 581 00:32:09,930 --> 00:32:14,420 But in any event, assuming that we have pruned the graph, 582 00:32:14,420 --> 00:32:16,154 we have done our overlap. 583 00:32:16,154 --> 00:32:17,070 We've done our layout. 584 00:32:17,070 --> 00:32:19,950 We've found our paths to the graph for our contigs. 585 00:32:19,950 --> 00:32:22,500 And then what we find is that for each contig, 586 00:32:22,500 --> 00:32:24,810 we have many reads. 587 00:32:24,810 --> 00:32:26,990 And we're going to take those reads. 588 00:32:26,990 --> 00:32:28,910 And we're going to look at them. 589 00:32:28,910 --> 00:32:30,950 And as you recall, we could either 590 00:32:30,950 --> 00:32:34,100 have errors causing disagreement among the reads. 591 00:32:34,100 --> 00:32:37,870 We could have allelic differences between mom and dad 592 00:32:37,870 --> 00:32:41,270 causing those errors, well, not really errors-- differences. 593 00:32:41,270 --> 00:32:43,810 And then we can take a consensus to come up 594 00:32:43,810 --> 00:32:48,130 with what the haploid genome is. 595 00:32:48,130 --> 00:32:54,610 So that's the essential idea of a overlap layout consensus 596 00:32:54,610 --> 00:32:55,480 assembler. 597 00:32:55,480 --> 00:32:58,590 We compute the overlap graph. 598 00:32:58,590 --> 00:33:01,170 During the layout phase we actually simplify the graph. 599 00:33:01,170 --> 00:33:02,710 And we find pass through it. 600 00:33:02,710 --> 00:33:05,230 And during the consensus phase, we take our reads, 601 00:33:05,230 --> 00:33:11,280 and we build a consensus sequence of the genome. 602 00:33:11,280 --> 00:33:17,890 And as I said, this graph building can be slow. 603 00:33:17,890 --> 00:33:19,746 Although, we'll talk about how slow 604 00:33:19,746 --> 00:33:21,830 it is here in just a moment. 605 00:33:21,830 --> 00:33:25,440 And the challenge is that modern sequencing data sets 606 00:33:25,440 --> 00:33:28,820 are hundreds of millions of reads. 607 00:33:28,820 --> 00:33:33,730 So let's talk about a contemporary overlap-based 608 00:33:33,730 --> 00:33:36,680 assembler-- something called the stream graph assembler, 609 00:33:36,680 --> 00:33:39,905 which is done over at the Sanger in the UK. 610 00:33:39,905 --> 00:33:42,030 And there are three separate steps it goes through. 611 00:33:42,030 --> 00:33:45,110 The first step is it tries to correct reads. 612 00:33:45,110 --> 00:33:47,070 And the way it does this is it actually 613 00:33:47,070 --> 00:33:50,340 looks at all the k-mers that occur in 614 00:33:50,340 --> 00:33:53,470 reads-- it tries to find sequences that are very, very 615 00:33:53,470 --> 00:33:57,090 rare and find sequences that are nearby 616 00:33:57,090 --> 00:33:59,780 in sequence base that aren't as rare. 617 00:33:59,780 --> 00:34:04,520 And it can correct bases that it believes are sequencing errors. 618 00:34:04,520 --> 00:34:06,090 The next step is assembly once it 619 00:34:06,090 --> 00:34:09,030 has taken all these reads and corrected them. 620 00:34:09,030 --> 00:34:10,850 It indexes all the reads as I suggested 621 00:34:10,850 --> 00:34:14,330 earlier using an FM index. 622 00:34:14,330 --> 00:34:18,790 And then it can find the overlap from that FM index directly. 623 00:34:18,790 --> 00:34:22,170 And part of the assembly process is throwing away 624 00:34:22,170 --> 00:34:23,639 duplicate reads and throwing away 625 00:34:23,639 --> 00:34:26,250 reads that have low quality scores. 626 00:34:26,250 --> 00:34:28,449 So that's the filtering step. 627 00:34:28,449 --> 00:34:34,860 It then has the set of contigs that it has generated. 628 00:34:34,860 --> 00:34:36,500 And it does something quite interesting 629 00:34:36,500 --> 00:34:39,429 to find the scaffolds is that it takes the contigs it's 630 00:34:39,429 --> 00:34:42,860 assembled in terms of linear sequence. 631 00:34:42,860 --> 00:34:45,790 And it completely re-indexes them once again using 632 00:34:45,790 --> 00:34:47,394 an FM index. 633 00:34:47,394 --> 00:34:49,643 And then it takes all the reads that you started with. 634 00:34:49,643 --> 00:34:53,770 And it maps them back onto the contigs. 635 00:34:53,770 --> 00:34:57,100 And by mapping the paired reads back on to the contigs, 636 00:34:57,100 --> 00:35:00,640 it can actually figure out what contigs 637 00:35:00,640 --> 00:35:03,890 should be formed into scaffolds where there are holes that 638 00:35:03,890 --> 00:35:07,960 are breached by these longer reads. 639 00:35:07,960 --> 00:35:11,420 So it's using the FM indexed both for correction to find out 640 00:35:11,420 --> 00:35:14,720 nearby k-mers for assembly to find overlaps 641 00:35:14,720 --> 00:35:17,290 and for scaffolding to put things together. 642 00:35:17,290 --> 00:35:22,119 And it does its indexing three different times. 643 00:35:22,119 --> 00:35:24,160 And just to give you an idea of how long it takes 644 00:35:24,160 --> 00:35:30,300 for a human-sized genome, it's actually 645 00:35:30,300 --> 00:35:33,300 quite expensive in terms of CPU time. 646 00:35:33,300 --> 00:35:36,150 It takes many days have elapsed time 647 00:35:36,150 --> 00:35:41,610 to assemble an entire human genome right now. 648 00:35:41,610 --> 00:35:45,860 And it's thousands of CPU hours to actually put 649 00:35:45,860 --> 00:35:47,930 a genome together starting from scratch. 650 00:35:50,590 --> 00:35:55,440 OK, so that's the essential idea of an overlap-based assembler. 651 00:35:55,440 --> 00:35:58,780 Are there any questions at all about overlap-based assemblers? 652 00:35:58,780 --> 00:35:59,643 Yeah? 653 00:35:59,643 --> 00:36:01,958 AUDIENCE: So in the case of an error , 654 00:36:01,958 --> 00:36:05,210 it's obvious how you would call that. 655 00:36:05,210 --> 00:36:08,284 But in an allelic difference, hypothetically, there 656 00:36:08,284 --> 00:36:11,067 would be 50% of the reads would have one and 50% of the reads 657 00:36:11,067 --> 00:36:11,430 would have another. 658 00:36:11,430 --> 00:36:11,914 PROFESSOR: That's correct. 659 00:36:11,914 --> 00:36:14,590 AUDIENCE: So in that case does it assemble-- do you 660 00:36:14,590 --> 00:36:17,400 just bias towards whichever ones weren't easily amplified? 661 00:36:17,400 --> 00:36:21,960 Or do you assemble two sequences? 662 00:36:21,960 --> 00:36:24,540 PROFESSOR: Most assemblers produce a single sequence. 663 00:36:24,540 --> 00:36:29,750 And I don't know how SGA decides between the different alleles 664 00:36:29,750 --> 00:36:32,710 because I don't recall what the paper said they did. 665 00:36:32,710 --> 00:36:34,240 But they have to essentially flip 666 00:36:34,240 --> 00:36:37,860 a coin to come up with a haploid sequence. 667 00:36:37,860 --> 00:36:38,735 Yes? 668 00:36:38,735 --> 00:36:40,675 AUDIENCE: You said there was three different times that you 669 00:36:40,675 --> 00:36:41,175 index. 670 00:36:41,175 --> 00:36:44,046 What are the three? 671 00:36:44,046 --> 00:36:45,420 PROFESSOR: Yeah, the question was 672 00:36:45,420 --> 00:36:47,580 I said there are three different they indexed. 673 00:36:47,580 --> 00:36:55,620 They indexed at the outset to find errors. 674 00:36:55,620 --> 00:37:02,900 They indexed the second time to do the overlap computation. 675 00:37:02,900 --> 00:37:06,070 And they indexed the third time to realign 676 00:37:06,070 --> 00:37:09,740 all the original reads to the contigs they have to figure out 677 00:37:09,740 --> 00:37:11,695 which contigs to put together into scaffolds. 678 00:37:15,328 --> 00:37:16,260 Right? 679 00:37:16,260 --> 00:37:19,430 But they have this essential foundational platform, 680 00:37:19,430 --> 00:37:21,490 which is the FM index. 681 00:37:21,490 --> 00:37:23,450 And so they use that over and over again 682 00:37:23,450 --> 00:37:24,700 to be able to do the assembly. 683 00:37:28,045 --> 00:37:29,295 These are all great questions. 684 00:37:32,340 --> 00:37:35,295 All right, any other questions about overlap-based assemblers. 685 00:37:39,080 --> 00:37:41,550 And you can see that if you think about how much coverage 686 00:37:41,550 --> 00:37:43,800 they get out of an assembler like this, it's actually, 687 00:37:43,800 --> 00:37:45,841 we'll compare all the assemblers at the very end. 688 00:37:45,841 --> 00:37:50,730 But if you look at the number of bases of autosomes 689 00:37:50,730 --> 00:37:57,000 and the X chromosome covered by an assembly, 690 00:37:57,000 --> 00:37:59,950 you can consider that as a function 691 00:37:59,950 --> 00:38:04,180 of the minimum alignment length to a referenced genome. 692 00:38:04,180 --> 00:38:07,830 And as the minimum alignment length goes up, 693 00:38:07,830 --> 00:38:10,270 that means you have to match longer and longer portions 694 00:38:10,270 --> 00:38:16,590 of the reference genome for your assembly contig to count. 695 00:38:16,590 --> 00:38:19,126 You can see that the number of bases dropped somewhat. 696 00:38:19,126 --> 00:38:20,500 In here they're showing that they 697 00:38:20,500 --> 00:38:24,630 do better than another assembler called SOAPdenovo. 698 00:38:24,630 --> 00:38:27,930 But they do get a fairly good coverage. 699 00:38:27,930 --> 00:38:31,250 On the other hand, they don't get coverage 700 00:38:31,250 --> 00:38:34,950 anywhere near as good as Lander-Waterman might suggest 701 00:38:34,950 --> 00:38:36,637 because the coverage should suggest 702 00:38:36,637 --> 00:38:38,470 that the probability of uncovered base using 703 00:38:38,470 --> 00:38:42,700 Lander-Waterman would be roughly e to the minus 40th-- something 704 00:38:42,700 --> 00:38:43,200 like that. 705 00:38:43,200 --> 00:38:46,600 And e to the minus 40th is like 4 times 10 to the minus 18. 706 00:38:46,600 --> 00:38:48,754 So they're not anywhere near what 707 00:38:48,754 --> 00:38:50,670 we would think the Lander-Waterman bound would 708 00:38:50,670 --> 00:38:52,350 be for assembly. 709 00:38:55,690 --> 00:38:59,810 So we've talked about these overlap-based assemblers. 710 00:38:59,810 --> 00:39:02,950 Now I'm going to turn to De Bruijn graph assemblers. 711 00:39:02,950 --> 00:39:05,490 How many people have heard of De Bruijn graphs before? 712 00:39:05,490 --> 00:39:07,560 Anybody? 713 00:39:07,560 --> 00:39:11,210 One person? 714 00:39:11,210 --> 00:39:15,860 So before we talk about De Bruijn graphs themselves, 715 00:39:15,860 --> 00:39:17,640 let's just talk terminology. 716 00:39:17,640 --> 00:39:23,415 So when I'm using terms we're all 717 00:39:23,415 --> 00:39:27,550 on the same page where we were talking about k-mers where 718 00:39:27,550 --> 00:39:32,080 the word mer is from the Greek "part." 719 00:39:32,080 --> 00:39:34,380 And we talk about 4-mers of an original sequence 720 00:39:34,380 --> 00:39:37,900 as a sequence that's four bases long. 721 00:39:37,900 --> 00:39:40,570 And we can think about all of the 3-mers 722 00:39:40,570 --> 00:39:43,080 of an original sequence. 723 00:39:43,080 --> 00:39:45,790 So we talk a lot about k-mers. 724 00:39:45,790 --> 00:39:51,620 And a k minus 1-mer is a substring of length k minus 1 725 00:39:51,620 --> 00:39:52,855 obviously from a k-mer. 726 00:39:55,560 --> 00:39:58,670 So if we think about the collection of reads-- 727 00:39:58,670 --> 00:40:03,790 here these are our super-simple economy sequencers 728 00:40:03,790 --> 00:40:06,360 producing reads of only length three, which 729 00:40:06,360 --> 00:40:07,300 is pretty desperate. 730 00:40:07,300 --> 00:40:09,670 But at any rate we'll go with that for the time being. 731 00:40:09,670 --> 00:40:12,750 And we think about each one of these reads 732 00:40:12,750 --> 00:40:19,040 as having a left k minus 1-mer and a right k minus 1-mer. 733 00:40:19,040 --> 00:40:22,960 We split them into two halves that way. 734 00:40:22,960 --> 00:40:33,320 And we're going to build a graph that is as follows. 735 00:40:33,320 --> 00:40:36,420 We're going to take all of the k minus 1-mers-- in this case 736 00:40:36,420 --> 00:40:38,030 the 2-mers. 737 00:40:38,030 --> 00:40:40,850 And for each read, we're going to draw 738 00:40:40,850 --> 00:40:46,340 an edge between its left 2-mer and its right 2-mer. 739 00:40:46,340 --> 00:40:49,690 OK, once again, for each read, these sort 740 00:40:49,690 --> 00:40:51,890 of anemic, three-base-pair reads, 741 00:40:51,890 --> 00:40:54,140 we're going to draw an edge between its left 2-mer 742 00:40:54,140 --> 00:40:55,000 and its right 2-mer. 743 00:40:55,000 --> 00:40:58,020 And they overlap in one base. 744 00:40:58,020 --> 00:41:02,100 So all of the graphs that are De Bruijn graphs, the edges 745 00:41:02,100 --> 00:41:05,860 represent an overlap of one base. 746 00:41:05,860 --> 00:41:07,330 OK? 747 00:41:07,330 --> 00:41:10,390 So if you look at the graph at the bottom, 748 00:41:10,390 --> 00:41:15,140 that represents the overlaps present 749 00:41:15,140 --> 00:41:17,650 in the original sequence. 750 00:41:17,650 --> 00:41:21,260 You note that we have AA as one of the 2-mers. 751 00:41:21,260 --> 00:41:26,310 And its left half and right half obviously overlap by one base. 752 00:41:26,310 --> 00:41:30,170 The triple-A read has AA as its left read 753 00:41:30,170 --> 00:41:33,910 and AA as a right read-- thay overlap at one base. 754 00:41:33,910 --> 00:41:38,330 And that's why we have that circular edge from A to itself. 755 00:41:38,330 --> 00:41:42,830 And the next edge from AA to AB comes 756 00:41:42,830 --> 00:41:47,200 from the next read-- the AAB read. 757 00:41:50,060 --> 00:41:56,640 So each edge then represents an overlap of one base. 758 00:41:56,640 --> 00:41:59,330 And therefore, each edge represents 759 00:41:59,330 --> 00:42:01,950 a unique k-mer sequence. 760 00:42:01,950 --> 00:42:04,690 So the way to think about this graph 761 00:42:04,690 --> 00:42:08,970 is it that all of the edges represent the original reads. 762 00:42:08,970 --> 00:42:13,340 And we have represented the k minus 1 words as the nodes. 763 00:42:13,340 --> 00:42:16,150 OK? 764 00:42:16,150 --> 00:42:21,550 So we can take this graph then and generalize this idea. 765 00:42:21,550 --> 00:42:27,440 And if we look at how the graph changes 766 00:42:27,440 --> 00:42:29,850 as we add more structure, here you 767 00:42:29,850 --> 00:42:31,840 see that we've added an extra b. 768 00:42:31,840 --> 00:42:35,530 And we get another edge in the graph back to the same node. 769 00:42:35,530 --> 00:42:37,030 So when we're building these graphs, 770 00:42:37,030 --> 00:42:40,245 if possible, we reuse a node that already exists. 771 00:42:42,900 --> 00:42:46,320 Now the way to think about coming back 772 00:42:46,320 --> 00:42:48,360 to the original sequence is finding a path 773 00:42:48,360 --> 00:42:52,890 through this graph and emitting sequence as we trace the path. 774 00:42:52,890 --> 00:42:54,680 And we would like to have a path that 775 00:42:54,680 --> 00:42:58,130 traverses all of the nodes. 776 00:42:58,130 --> 00:43:01,180 And so we have some definitions here, 777 00:43:01,180 --> 00:43:07,390 which is that a node is balanced if its indegree equals 778 00:43:07,390 --> 00:43:09,400 it's outdegree. 779 00:43:09,400 --> 00:43:13,960 And you can see that not all the nodes are balanced down 780 00:43:13,960 --> 00:43:16,660 the graph of the lower, right-hand corner. 781 00:43:16,660 --> 00:43:18,720 And it's connected if all the components 782 00:43:18,720 --> 00:43:20,970 or nodes can be reached. 783 00:43:20,970 --> 00:43:25,650 And a Eulerian walk visit each edge exactly once, 784 00:43:25,650 --> 00:43:30,690 which is what we would like to actually take a De Bruijn graph 785 00:43:30,690 --> 00:43:34,750 and emit a genome sequence. 786 00:43:34,750 --> 00:43:37,130 Now, not all graphs have these walks. 787 00:43:40,010 --> 00:43:42,290 And graphs do our Eulerian. 788 00:43:42,290 --> 00:43:47,520 And we won't distinguish different types 789 00:43:47,520 --> 00:43:50,430 of these graphs. 790 00:43:50,430 --> 00:43:54,379 And if a graph has two semi-balanced nodes 791 00:43:54,379 --> 00:43:56,170 and all the rest of the nodes are balanced, 792 00:43:56,170 --> 00:43:59,650 then it will have a walk through it. 793 00:43:59,650 --> 00:44:04,990 So if we think about our original graph, 794 00:44:04,990 --> 00:44:09,360 there are two arguments for it having such a walk. 795 00:44:09,360 --> 00:44:14,120 The first argument is that we show the walk. 796 00:44:14,120 --> 00:44:18,730 And the second is that we have two semi-balanced nodes 797 00:44:18,730 --> 00:44:20,355 and the rest of the nodes are balanced. 798 00:44:23,620 --> 00:44:26,340 So the reason that we care about this 799 00:44:26,340 --> 00:44:30,100 is that we want to study cases where this goes wrong. 800 00:44:33,760 --> 00:44:37,800 So to build a De Bruijn graph of a genome, 801 00:44:37,800 --> 00:44:42,120 we're going to take our original sequence reads. 802 00:44:42,120 --> 00:44:45,570 And we're going to take all the k-mers that 803 00:44:45,570 --> 00:44:48,570 occur in those reads. 804 00:44:48,570 --> 00:44:53,230 And we're going to add edges to a De Bruijn graph 805 00:44:53,230 --> 00:44:54,310 based upon those k-mers. 806 00:44:54,310 --> 00:45:04,530 So if we have a read like this, and we consider 807 00:45:04,530 --> 00:45:07,660 a k-mer in the read, we're going to add an edge 808 00:45:07,660 --> 00:45:11,070 in the graph between the left k minus 1-mer 809 00:45:11,070 --> 00:45:14,250 and the right k minus 1-mer. 810 00:45:14,250 --> 00:45:18,440 And we'll do that for every single k-mer in the read. 811 00:45:18,440 --> 00:45:22,230 Now note what this does is it destroys some information. 812 00:45:22,230 --> 00:45:26,230 It destroys information about the ordering 813 00:45:26,230 --> 00:45:30,230 of certain of the k-mers in this read just destroying their read 814 00:45:30,230 --> 00:45:34,400 contiguity in order to make some simplifying assumptions 815 00:45:34,400 --> 00:45:39,360 to represent the sequence ordering 816 00:45:39,360 --> 00:45:43,940 of these k minus 1-mers in the graph. 817 00:45:43,940 --> 00:45:48,910 So we build the graph in this way 818 00:45:48,910 --> 00:45:56,120 and if I were to build the graph like this, 819 00:45:56,120 --> 00:45:59,370 what is the minimum sequence overlap for two 820 00:45:59,370 --> 00:46:02,420 reads to actually share an edge in the resulting graph? 821 00:46:05,200 --> 00:46:08,720 Can anybody see how long the sequence 822 00:46:08,720 --> 00:46:10,210 must be in the second read for it 823 00:46:10,210 --> 00:46:13,270 to actually overlap at edge with the first read? 824 00:46:20,570 --> 00:46:23,000 Well, if this second read also has a k-mer, right? 825 00:46:25,684 --> 00:46:28,100 It's going to produce another structure just like this one 826 00:46:28,100 --> 00:46:30,480 if these two do overlap. 827 00:46:30,480 --> 00:46:34,180 And thus the edge produced by this read and the edge 828 00:46:34,180 --> 00:46:39,520 by this read will overlap like this. 829 00:46:39,520 --> 00:46:47,750 And thus all of the nodes that came from this part of read one 830 00:46:47,750 --> 00:46:49,470 will feed into this graph. 831 00:46:49,470 --> 00:46:51,110 And then all the nodes to come out 832 00:46:51,110 --> 00:46:54,640 of this k-mer from the purple read will come out 833 00:46:54,640 --> 00:46:57,030 of it like so, right? 834 00:46:57,030 --> 00:46:59,320 And thus when we're tracing the graph, 835 00:46:59,320 --> 00:47:01,410 the idea is that the graph will be connected. 836 00:47:01,410 --> 00:47:03,500 And we'll be able to come between these reads 837 00:47:03,500 --> 00:47:05,912 and reconstruct the sequence that 838 00:47:05,912 --> 00:47:07,120 was suggested by the overlap. 839 00:47:10,360 --> 00:47:15,120 The thing, however, you should note in this-- yes, question? 840 00:47:15,120 --> 00:47:18,960 AUDIENCE: So you're picking two k minus 1 841 00:47:18,960 --> 00:47:22,970 reads there-- are those from different reads? 842 00:47:22,970 --> 00:47:24,464 Or from the white read? 843 00:47:24,464 --> 00:47:26,130 PROFESSOR: No, it's from the white read. 844 00:47:26,130 --> 00:47:30,550 These are the 2k minus 1-mers that came out of this read. 845 00:47:30,550 --> 00:47:32,185 So they actually overlap. 846 00:47:32,185 --> 00:47:34,560 AUDIENCE: Yeah, but then you were 847 00:47:34,560 --> 00:47:38,401 talking about how the one was purple in that case. 848 00:47:38,401 --> 00:47:40,900 PROFESSOR: Right, well, this is the same sequence let's say. 849 00:47:40,900 --> 00:47:44,420 This is the same, exact sequence down here. 850 00:47:44,420 --> 00:47:47,990 So if it's the same, exact sequence, 851 00:47:47,990 --> 00:47:50,930 it will have the same k minus 1-mers. 852 00:47:50,930 --> 00:47:54,230 And when we build the graph if a node already exists, 853 00:47:54,230 --> 00:47:55,100 we reuse it. 854 00:47:58,430 --> 00:48:01,280 And thus if we reuse the nodes that 855 00:48:01,280 --> 00:48:04,850 were created when we built the graph nodes 856 00:48:04,850 --> 00:48:09,450 and edges for the white read, then when the purple read comes 857 00:48:09,450 --> 00:48:11,700 along, we're going to put another edge here 858 00:48:11,700 --> 00:48:14,370 between these two k minus 1-mers because they are contained here 859 00:48:14,370 --> 00:48:16,594 as well. 860 00:48:16,594 --> 00:48:18,260 So these are identical sequences to this 861 00:48:18,260 --> 00:48:21,430 because these two reads overlap. 862 00:48:21,430 --> 00:48:23,800 And this part is the same sequence as that part. 863 00:48:23,800 --> 00:48:26,280 AUDIENCE: Yeah, so why do you need k minus 1-mers 864 00:48:26,280 --> 00:48:30,250 if you have overlapped k? 865 00:48:30,250 --> 00:48:31,990 PROFESSOR: Because the way we're finding 866 00:48:31,990 --> 00:48:35,260 these overlaps is through the graph. 867 00:48:35,260 --> 00:48:39,654 And we're not indexing things of size k, right? 868 00:48:39,654 --> 00:48:41,320 We're indexing things of size k minus 1. 869 00:48:47,130 --> 00:48:55,502 In each edge represents a sequence of length k 870 00:48:55,502 --> 00:48:57,460 because we know this sequence and this sequence 871 00:48:57,460 --> 00:48:59,320 are overlapped by one base. 872 00:49:02,150 --> 00:49:04,940 So when we find an edge that's the same between the white 873 00:49:04,940 --> 00:49:07,150 and the purple read, we know that they're 874 00:49:07,150 --> 00:49:08,370 overlapping by k bases. 875 00:49:11,040 --> 00:49:13,019 Is that making sense to you? 876 00:49:13,019 --> 00:49:13,560 AUDIENCE: No. 877 00:49:13,560 --> 00:49:17,928 PROFESSOR: No, OK, so let's try it again. 878 00:49:17,928 --> 00:49:19,136 AUDIENCE: You can keep going. 879 00:49:19,136 --> 00:49:20,094 PROFESSOR: No, it's OK. 880 00:49:24,140 --> 00:49:27,580 Let's just start with the purple read to start for a moment 881 00:49:27,580 --> 00:49:29,570 because I think if you have a question, 882 00:49:29,570 --> 00:49:31,500 other people may have a question. 883 00:49:31,500 --> 00:49:38,220 So we have this sequence, which is this sequence right here, 884 00:49:38,220 --> 00:49:39,050 right? 885 00:49:39,050 --> 00:49:41,600 And then we have this sequence, which 886 00:49:41,600 --> 00:49:44,070 is the sequence right here. 887 00:49:44,070 --> 00:49:47,170 They overlap by one base. 888 00:49:47,170 --> 00:49:50,100 And so we put an edge between them like this in the graph. 889 00:49:50,100 --> 00:49:50,600 OK? 890 00:49:53,380 --> 00:49:56,235 AUDIENCE: Don't they overlap by more than one base? 891 00:49:56,235 --> 00:50:00,805 They can only contain one base from each k-mer. 892 00:50:00,805 --> 00:50:01,680 PROFESSOR: I'm sorry. 893 00:50:01,680 --> 00:50:02,665 That's what I meant. 894 00:50:02,665 --> 00:50:03,165 Yeah. 895 00:50:06,210 --> 00:50:12,070 And then the same thing is true down here. 896 00:50:15,730 --> 00:50:20,969 And so we will find this k minus 1-mer and this k minus 1-mer. 897 00:50:20,969 --> 00:50:21,885 And then they overlap. 898 00:50:39,350 --> 00:50:46,920 For genome assembly, we record the forward and reverse 899 00:50:46,920 --> 00:50:48,980 complement reads in twin nodes. 900 00:50:48,980 --> 00:50:51,920 And we're not going to show those because it just 901 00:50:51,920 --> 00:50:53,540 complicates our graphs without really 902 00:50:53,540 --> 00:50:56,690 adding any illustrative power. 903 00:50:56,690 --> 00:50:59,854 And we always choose k to be odd so that a node can't 904 00:50:59,854 --> 00:51:01,145 be its own reversed complement. 905 00:51:07,210 --> 00:51:19,200 And here is the graph growing if we think about k equals 5. 906 00:51:19,200 --> 00:51:21,360 So we have reads of length five. 907 00:51:21,360 --> 00:51:25,910 And we are adding sequences to the graph. 908 00:51:25,910 --> 00:51:28,290 And you note that the graph is acyclic 909 00:51:28,290 --> 00:51:30,390 until we get to the repeated sequence. 910 00:51:30,390 --> 00:51:34,000 And we get to the second long the sequence comes back around 911 00:51:34,000 --> 00:51:37,060 begins a looping back on itself. 912 00:51:37,060 --> 00:51:42,660 And if we consider the last part of this De Bruijn graph 913 00:51:42,660 --> 00:51:47,170 construction, then we wind up with the finished graph 914 00:51:47,170 --> 00:51:48,860 on the right-hand side. 915 00:51:48,860 --> 00:51:51,930 And you can see the multiplicity of the edges 916 00:51:51,930 --> 00:51:53,640 correspond to the number of times 917 00:51:53,640 --> 00:51:55,665 the long is repeated in this graph. 918 00:51:58,550 --> 00:52:03,370 So once again, repeats are causing the circular structure, 919 00:52:03,370 --> 00:52:06,460 which only could be resolved if we had sufficiently long reads, 920 00:52:06,460 --> 00:52:08,465 which we don't have in this particular case. 921 00:52:12,830 --> 00:52:15,920 However, if we consider perfect sequencing 922 00:52:15,920 --> 00:52:18,800 we always have a path to the graph. 923 00:52:18,800 --> 00:52:25,750 And the reason is that the leftmost part 924 00:52:25,750 --> 00:52:31,040 of the genome, so to speak, is going to be semi-balanced. 925 00:52:31,040 --> 00:52:33,370 And the rightmost part is going to be semi-balanced. 926 00:52:33,370 --> 00:52:38,350 And all the parts in between are going to be balanced. 927 00:52:38,350 --> 00:52:42,190 So the k minus 1-mer on the very left end is semi-balanced 928 00:52:42,190 --> 00:52:45,650 and the k minus 1-mer on the right is semi-balanced. 929 00:52:45,650 --> 00:52:50,220 And all the nodes in between are balanced. 930 00:52:50,220 --> 00:52:55,470 Now, this does not allow for errors of course. 931 00:52:55,470 --> 00:53:04,320 And we talk about following this Eulerian walk 932 00:53:04,320 --> 00:53:06,349 to find the original sequence. 933 00:53:06,349 --> 00:53:08,640 But the question we can ask ourselves is whether or not 934 00:53:08,640 --> 00:53:10,430 this walk always really corresponds 935 00:53:10,430 --> 00:53:13,250 to the original genome sequence. 936 00:53:13,250 --> 00:53:16,010 It turns out I can show you this example, which 937 00:53:16,010 --> 00:53:20,200 is we have this graph for this sequence. 938 00:53:20,200 --> 00:53:25,040 And there are two different walks through this graph. 939 00:53:25,040 --> 00:53:31,070 And the two different walks produced 940 00:53:31,070 --> 00:53:32,630 two different sequences. 941 00:53:32,630 --> 00:53:35,080 And they depend upon which way you 942 00:53:35,080 --> 00:53:39,580 start walking from the node AB. 943 00:53:39,580 --> 00:53:43,620 So once again, here we have seen that even when 944 00:53:43,620 --> 00:53:48,480 we have a path to the graph, the path may not be unique. 945 00:53:48,480 --> 00:53:51,290 It may not be able to generate the original sequence 946 00:53:51,290 --> 00:53:52,170 that we started with. 947 00:53:55,750 --> 00:54:01,670 So the other problem we can have when 948 00:54:01,670 --> 00:54:04,960 we are building a graph like this is that gaps in coverage 949 00:54:04,960 --> 00:54:07,810 can create holes in the graph. 950 00:54:07,810 --> 00:54:11,970 So if we omit certain of our reads, 951 00:54:11,970 --> 00:54:16,640 we'll come up with a graph that is broken into two parts. 952 00:54:16,640 --> 00:54:18,660 And this corresponds to the idea that we're 953 00:54:18,660 --> 00:54:22,660 going to create two different contigs that are contiguous 954 00:54:22,660 --> 00:54:25,720 sequence but will be unable to fill in the middle part. 955 00:54:25,720 --> 00:54:26,220 OK? 956 00:54:29,370 --> 00:54:39,140 So we also can have differences in coverage of a graph 957 00:54:39,140 --> 00:54:44,480 when we have extra reads at particular locations 958 00:54:44,480 --> 00:54:45,710 in the genome. 959 00:54:45,710 --> 00:54:51,030 And that causes the degrees on the individual nodes to vary 960 00:54:51,030 --> 00:54:56,560 and causes us to not be able to rely upon the indegree 961 00:54:56,560 --> 00:55:00,629 and outdegree as an absolute metric for how 962 00:55:00,629 --> 00:55:02,045 to trace a path through the graph. 963 00:55:07,470 --> 00:55:12,544 And the other thing is that if you have differences 964 00:55:12,544 --> 00:55:14,460 between the chromosomes, which we talked about 965 00:55:14,460 --> 00:55:20,450 last time in our overlap layout consensus assembler, 966 00:55:20,450 --> 00:55:25,150 it also can cause graphs to split apart 967 00:55:25,150 --> 00:55:27,120 and to have subgraphs that correspond 968 00:55:27,120 --> 00:55:31,020 to one allele versus the other allele, which is present 969 00:55:31,020 --> 00:55:33,760 perhaps in the main graph. 970 00:55:33,760 --> 00:55:40,000 All right, so it's actually the case 971 00:55:40,000 --> 00:55:45,710 that these graphs are attractive for a very important reason, 972 00:55:45,710 --> 00:55:48,180 which is there extraordinarily efficient to build. 973 00:55:48,180 --> 00:55:51,600 That is in order to build a graph like this, 974 00:55:51,600 --> 00:55:54,420 you need to take each one of these k minus 1-mers 975 00:55:54,420 --> 00:55:57,517 and actually find the node, which you can do by hashing 976 00:55:57,517 --> 00:55:59,100 and then put the edges into the graph. 977 00:55:59,100 --> 00:56:02,870 And so you find that you need to put in an edge and two nodes 978 00:56:02,870 --> 00:56:05,210 for each k-mer. 979 00:56:05,210 --> 00:56:08,810 And if you have a hash map that encoded these nodes and edges, 980 00:56:08,810 --> 00:56:11,460 it's constant time work. 981 00:56:11,460 --> 00:56:14,660 So you wind up with a graph which 982 00:56:14,660 --> 00:56:18,330 costs order of the number of reads to build. 983 00:56:18,330 --> 00:56:21,620 So it's a linear time graph construction problem. 984 00:56:21,620 --> 00:56:26,510 Recall that our last overlap construction, 985 00:56:26,510 --> 00:56:30,130 we thought we could get down to N log N. 986 00:56:30,130 --> 00:56:34,430 And here is an example of sub-setting part of the lambda 987 00:56:34,430 --> 00:56:38,840 phage genome using a De Bruijn graph assembler. 988 00:56:38,840 --> 00:56:40,850 And you can see that roughly the time required 989 00:56:40,850 --> 00:56:43,085 to assemble parts of the genome is 990 00:56:43,085 --> 00:56:45,460 linear in the amount of genome sequence that you give it. 991 00:56:50,460 --> 00:56:59,290 So these assemblers were favored early on in the days 992 00:56:59,290 --> 00:57:03,040 of short-read assembly in part because they were so efficient. 993 00:57:03,040 --> 00:57:05,395 And typically in some of the projects, 994 00:57:05,395 --> 00:57:06,747 you have very high coverage. 995 00:57:06,747 --> 00:57:08,580 And so you wind up with graphs that actually 996 00:57:08,580 --> 00:57:11,980 have a huge number of edges between nodes. 997 00:57:11,980 --> 00:57:15,300 And this can be summarised in terms 998 00:57:15,300 --> 00:57:17,530 of a graph that simply annotates the edges 999 00:57:17,530 --> 00:57:21,750 with the number of instances. 1000 00:57:21,750 --> 00:57:25,920 And so you have a weighted graph on the right-hand side, which 1001 00:57:25,920 --> 00:57:31,660 is easier in some sense to trace because we can now 1002 00:57:31,660 --> 00:57:36,745 begin to eliminate low-coverage edges as potential anomalies. 1003 00:57:40,740 --> 00:57:43,780 But the essential idea is to trace these graphs to produce 1004 00:57:43,780 --> 00:57:45,840 the ultimate genome sequence. 1005 00:57:45,840 --> 00:57:48,480 And in order to do so, we may need 1006 00:57:48,480 --> 00:57:51,550 to do some error correction. 1007 00:57:51,550 --> 00:57:56,570 So we talked earlier about the idea that if we have an error, 1008 00:57:56,570 --> 00:58:00,060 we're going to actually produce a portion of the graph that 1009 00:58:00,060 --> 00:58:03,540 hangs off into outer space. 1010 00:58:03,540 --> 00:58:07,910 And we can cut these dead-end tips of the graph 1011 00:58:07,910 --> 00:58:13,060 off if they are low coverage because they presumably 1012 00:58:13,060 --> 00:58:15,740 correspond to errors. 1013 00:58:15,740 --> 00:58:19,540 If we get an error in the middle of a read, 1014 00:58:19,540 --> 00:58:22,860 we can wind up with a so-called bubble in the graph, which 1015 00:58:22,860 --> 00:58:25,300 once again is low coverage. 1016 00:58:25,300 --> 00:58:31,800 And we can get rid of these bubbles in a similar fashion. 1017 00:58:31,800 --> 00:58:36,650 And it's also possible to get chimeric edges of the graph. 1018 00:58:36,650 --> 00:58:39,730 And those can be caused by errors as well. 1019 00:58:39,730 --> 00:58:43,591 And we can clip those edges. 1020 00:58:43,591 --> 00:58:45,590 So there are different kinds of error correction 1021 00:58:45,590 --> 00:58:46,548 we can do in the graph. 1022 00:58:46,548 --> 00:58:48,280 These are all quite heuristic. 1023 00:58:48,280 --> 00:58:51,030 Each assembler has its own set of heuristics 1024 00:58:51,030 --> 00:58:54,370 for how to deal with graph anomalies 1025 00:58:54,370 --> 00:59:01,890 and how to eliminate edges in the graph to permit assembly. 1026 00:59:01,890 --> 00:59:05,880 But these are getting rid of dead-end tips 1027 00:59:05,880 --> 00:59:08,630 and popping bubbles and getting rid of chimeric edges 1028 00:59:08,630 --> 00:59:11,240 are important things to consider for any assembler. 1029 00:59:14,800 --> 00:59:18,160 So the limitations of these graphs 1030 00:59:18,160 --> 00:59:20,770 are the idea that we're immediately 1031 00:59:20,770 --> 00:59:26,640 splitting these reads into this k-mer representation, which 1032 00:59:26,640 --> 00:59:29,210 is destroying information. 1033 00:59:29,210 --> 00:59:33,460 And in order to overcome this, one 1034 00:59:33,460 --> 00:59:36,170 of the things that people have done in these De Bruijn graph 1035 00:59:36,170 --> 00:59:42,520 assemblers is to take the original reads 1036 00:59:42,520 --> 00:59:45,380 and to map them back on to the graph. 1037 00:59:45,380 --> 00:59:46,955 So when you're attempting to trace 1038 00:59:46,955 --> 00:59:48,580 the path through the graph, what you do 1039 00:59:48,580 --> 00:59:49,871 is you take the original reads. 1040 00:59:49,871 --> 00:59:52,102 You thread them through the graph. 1041 00:59:52,102 --> 00:59:53,560 And you know that the original read 1042 00:59:53,560 --> 00:59:56,390 represents contiguous genome sequence. 1043 00:59:56,390 --> 00:59:58,470 So it provides you with a path through the graph 1044 00:59:58,470 --> 00:59:59,520 that you know is good. 1045 01:00:02,430 --> 01:00:05,100 People have been doing this in part 1046 01:00:05,100 --> 01:00:08,380 because they didn't want to go to the full overlap graph 1047 01:00:08,380 --> 01:00:10,890 implementation because of the cost. 1048 01:00:10,890 --> 01:00:15,140 But I think that these overlap graph implementations now 1049 01:00:15,140 --> 01:00:18,230 are sufficiently sophisticated that I personally 1050 01:00:18,230 --> 01:00:20,710 would use them instead of a De Bruijn graph assembler. 1051 01:00:23,610 --> 01:00:29,580 And so the trade off really centers around 1052 01:00:29,580 --> 01:00:36,070 speed and space versus accuracy. 1053 01:00:36,070 --> 01:00:42,400 So we can look at some example assemblers 1054 01:00:42,400 --> 01:00:45,210 and look at their performance. 1055 01:00:45,210 --> 01:00:48,444 But before I do that and we leave De Bruijn graphs, 1056 01:00:48,444 --> 01:00:50,610 are there any other questions about De Bruijin graph 1057 01:00:50,610 --> 01:00:51,110 assemblers? 1058 01:00:53,570 --> 01:00:54,480 AUDIENCE: I have one. 1059 01:00:54,480 --> 01:00:55,563 PROFESSOR: Yeah, question. 1060 01:00:55,563 --> 01:00:57,340 AUDIENCE: How long is k typically? 1061 01:00:57,340 --> 01:00:59,090 PROFESSOR: We're going to talk about that. 1062 01:00:59,090 --> 01:01:04,260 The k typically is somewhere around 60-- something 1063 01:01:04,260 --> 01:01:08,834 like that-- Somewhere in that neighborhood. 1064 01:01:08,834 --> 01:01:10,500 It's actually-- it has to be odd, right? 1065 01:01:10,500 --> 01:01:14,340 So 61, 57-- something like that. 1066 01:01:14,340 --> 01:01:15,050 Good question. 1067 01:01:15,050 --> 01:01:17,585 Any other questions about De Bruijin graph assemblers? 1068 01:01:25,150 --> 01:01:32,670 So once again returning to over our architecture, 1069 01:01:32,670 --> 01:01:34,230 we have these reads. 1070 01:01:34,230 --> 01:01:37,310 We need to produce contigs. 1071 01:01:37,310 --> 01:01:40,430 In the case of overlap graphs, we're 1072 01:01:40,430 --> 01:01:42,130 going to trace the overlap graphs. 1073 01:01:42,130 --> 01:01:43,825 In the case of De Bruijn graphs, we're 1074 01:01:43,825 --> 01:01:45,283 going to trace the De Bruijn graph. 1075 01:01:45,283 --> 01:01:49,980 For scaffolding, we can use the read pairs to put scaffolds 1076 01:01:49,980 --> 01:01:52,340 back together again. 1077 01:01:52,340 --> 01:01:57,540 And here is some comparison of the performance 1078 01:01:57,540 --> 01:01:59,290 of these various assemblers. 1079 01:01:59,290 --> 01:02:05,640 So the first assembler-- SGA-- is an overlap layout 1080 01:02:05,640 --> 01:02:08,280 consensus-style assembler. 1081 01:02:08,280 --> 01:02:11,160 Velvet/Abyss and SOAPdenovo are all De Bruijn, 1082 01:02:11,160 --> 01:02:12,274 graph-based assemblers. 1083 01:02:12,274 --> 01:02:13,940 So these are all contemporary assemblers 1084 01:02:13,940 --> 01:02:17,190 that people use for assembling genomes. 1085 01:02:17,190 --> 01:02:18,870 An important metric for assemblers 1086 01:02:18,870 --> 01:02:22,560 is something called N50, which is the size of a contig 1087 01:02:22,560 --> 01:02:28,740 or scaffold where at that length or larger 50% of the bases 1088 01:02:28,740 --> 01:02:31,600 are present in scaffolds of that length. 1089 01:02:31,600 --> 01:02:36,340 So, for example, for SGA, they say that scaffold N50 size 1090 01:02:36,340 --> 01:02:41,390 is 26.3 kilobases, which means that in scaffolds 1091 01:02:41,390 --> 01:02:44,790 of length 26.3 kilobases or larger, 1092 01:02:44,790 --> 01:02:47,760 half of the bases of the assembly lie. 1093 01:02:47,760 --> 01:02:52,700 So the larger the N50 is, the larger the scaffolds 1094 01:02:52,700 --> 01:02:54,670 are that cover things. 1095 01:02:54,670 --> 01:02:58,090 And you want larger and larger scaffolds or contigs 1096 01:02:58,090 --> 01:03:01,290 so that you have fewer gaps in your assembly. 1097 01:03:01,290 --> 01:03:07,350 So the N50 number is a principle comparison metric 1098 01:03:07,350 --> 01:03:10,510 when one is thinking about assemblers. 1099 01:03:10,510 --> 01:03:16,540 So in this particular case, for SGA the overlap metric 1100 01:03:16,540 --> 01:03:21,480 was that the reads had to overlap by at least 75 bases 1101 01:03:21,480 --> 01:03:23,300 or more. 1102 01:03:23,300 --> 01:03:25,100 And these were 100-base pair reads. 1103 01:03:25,100 --> 01:03:26,690 You can see the details on the read 1104 01:03:26,690 --> 01:03:29,260 data on the bottom line there. 1105 01:03:29,260 --> 01:03:32,790 So as long as the reads overlap by 75 bases, 1106 01:03:32,790 --> 01:03:35,840 they were put together in the graph. 1107 01:03:35,840 --> 01:03:37,650 And the De Bruijn graph assemblers 1108 01:03:37,650 --> 01:03:43,590 each had their own optimum number for k. 1109 01:03:43,590 --> 01:03:46,050 And the way that you tune these parameters is you 1110 01:03:46,050 --> 01:03:50,040 run the assembler on a range of k values. 1111 01:03:50,040 --> 01:03:56,290 And you see which k value produced the assembly 1112 01:03:56,290 --> 01:03:59,280 with the highest N50. 1113 01:03:59,280 --> 01:04:02,060 And you pick that k. 1114 01:04:02,060 --> 01:04:04,120 Can anybody think of a reason why 1115 01:04:04,120 --> 01:04:06,290 it is that although these are all 1116 01:04:06,290 --> 01:04:09,360 roughly in the same ballpark, different assemblers might have 1117 01:04:09,360 --> 01:04:15,245 different k values given that the underlying technology is 1118 01:04:15,245 --> 01:04:15,828 quite similar? 1119 01:04:22,400 --> 01:04:24,630 Any guesses about what is going on here? 1120 01:04:32,990 --> 01:04:35,480 Well, we know that the differences in the assemblers 1121 01:04:35,480 --> 01:04:38,560 really are rooted in the way that they are processing 1122 01:04:38,560 --> 01:04:41,500 the graphs and the way that they are simplifying them. 1123 01:04:41,500 --> 01:04:43,970 And therefore, one has to imagine 1124 01:04:43,970 --> 01:04:47,590 that the differences lie in the post-processing of the graph 1125 01:04:47,590 --> 01:04:52,780 once it's built and that certain assemblers like larger k 1126 01:04:52,780 --> 01:04:53,330 values. 1127 01:04:53,330 --> 01:04:57,590 Whereas other ones can tolerate smaller k values. 1128 01:04:57,590 --> 01:05:00,750 And you can see if we look at the running statistics 1129 01:05:00,750 --> 01:05:06,820 for these, that the performance of SGA 1130 01:05:06,820 --> 01:05:09,120 if you look at the reference bases covered 1131 01:05:09,120 --> 01:05:11,790 by contigs greater than one kilobase 1132 01:05:11,790 --> 01:05:14,470 is roughly comparable to all the other assemblers. 1133 01:05:14,470 --> 01:05:18,270 But its mismatch performance is much better. 1134 01:05:18,270 --> 01:05:22,550 That is the other assemblers are producing-- well, 1135 01:05:22,550 --> 01:05:24,920 I take it back except for SOAPdenovo. 1136 01:05:24,920 --> 01:05:27,830 But it does quite a good job at correcting 1137 01:05:27,830 --> 01:05:29,830 reads in coming up with the correct sequence. 1138 01:05:32,730 --> 01:05:36,030 The last lines however tell the story about running time, which 1139 01:05:36,030 --> 01:05:38,940 is that the overlap consensus assembler is taking 1140 01:05:38,940 --> 01:05:43,340 41 hours of CPU time for C. elegans genome assembly. 1141 01:05:43,340 --> 01:05:47,760 Whereas the other assemblers, the De Bruijn assembler 1142 01:05:47,760 --> 01:05:48,840 are running much faster. 1143 01:05:52,270 --> 01:05:58,790 So the thing that I wanted to emphasize today 1144 01:05:58,790 --> 01:06:03,750 was that once you have the final graph whether it 1145 01:06:03,750 --> 01:06:07,130 be an overlap graph or a De Bruijn graph, 1146 01:06:07,130 --> 01:06:12,110 which represents possible ways of putting back together 1147 01:06:12,110 --> 01:06:14,910 again the jigsaw puzzle, it still 1148 01:06:14,910 --> 01:06:18,530 is an art to be able to build an assembler that 1149 01:06:18,530 --> 01:06:21,530 uses appropriate heuristics to trace the graph 1150 01:06:21,530 --> 01:06:24,850 to come up with a genome sequence. 1151 01:06:24,850 --> 01:06:27,560 And I think another lesson is that repeats 1152 01:06:27,560 --> 01:06:29,740 are very problematic. 1153 01:06:29,740 --> 01:06:34,300 With short reads, we really cannot resolve repeats exactly. 1154 01:06:34,300 --> 01:06:39,020 As a consequence, when we think about any reference genome 1155 01:06:39,020 --> 01:06:42,840 that we're dealing with, if we consider 1156 01:06:42,840 --> 01:06:45,844 the size of the reads that were used to assemble that genome, 1157 01:06:45,844 --> 01:06:47,260 then we need to be mindful of what 1158 01:06:47,260 --> 01:06:49,070 that tells us about whether or not 1159 01:06:49,070 --> 01:06:51,540 the repeat structure that we're observing in the genome 1160 01:06:51,540 --> 01:06:53,970 is really an accurate rendition of what's 1161 01:06:53,970 --> 01:06:57,170 going on in the genome itself. 1162 01:06:57,170 --> 01:06:59,840 And finally, I think that we've talked today 1163 01:06:59,840 --> 01:07:03,470 about the problem of assembling genomes 1164 01:07:03,470 --> 01:07:07,070 from a set of reads that represent 1165 01:07:07,070 --> 01:07:13,150 a uniform, single individual albeit with possibilities 1166 01:07:13,150 --> 01:07:15,250 of differences of alleles between mom 1167 01:07:15,250 --> 01:07:18,810 and dad in a diploid organism. 1168 01:07:18,810 --> 01:07:21,200 However, environmental sequencing 1169 01:07:21,200 --> 01:07:25,410 where one takes up sea water or other samples 1170 01:07:25,410 --> 01:07:27,330 and sequences all the organisms in it 1171 01:07:27,330 --> 01:07:31,650 and then attempts to assemble those organisms de novo 1172 01:07:31,650 --> 01:07:33,260 admits the possibility that there 1173 01:07:33,260 --> 01:07:37,290 are many different genomes that you're considering. 1174 01:07:37,290 --> 01:07:39,340 And that, of course, creates a whole new set 1175 01:07:39,340 --> 01:07:40,920 of research problems, which I think 1176 01:07:40,920 --> 01:07:45,130 are unsolved in part because of the read links 1177 01:07:45,130 --> 01:07:47,800 that we're currently dealing with. 1178 01:07:47,800 --> 01:07:50,460 Are there any final questions about assembly? 1179 01:07:54,360 --> 01:07:54,940 OK, great. 1180 01:07:54,940 --> 01:07:57,310 Well, we will see you then on Thursday 1181 01:07:57,310 --> 01:08:01,040 where we will talk about ChIP-seq and IDR analysis. 1182 01:08:01,040 --> 01:08:02,550 Until then, have a great Wednesday. 1183 01:08:02,550 --> 01:08:04,470 Thank you very much.