1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:25,625 --> 00:00:26,500 PROFESSOR: All right. 9 00:00:26,500 --> 00:00:29,460 So today, we're going to briefly review 10 00:00:29,460 --> 00:00:32,750 classical sequencing and next-gen or second-gen 11 00:00:32,750 --> 00:00:36,520 sequencing, which sort of provides a lot of the data 12 00:00:36,520 --> 00:00:42,660 that the analytical methods we'll be talking about work on. 13 00:00:42,660 --> 00:00:46,600 And we'll then introduce local alignment 14 00:00:46,600 --> 00:00:51,350 a la BLAST and some of the statistics associated 15 00:00:51,350 --> 00:00:52,830 with that. 16 00:00:52,830 --> 00:00:56,680 So just a few brief items on topic one. 17 00:00:56,680 --> 00:00:57,230 All right. 18 00:00:57,230 --> 00:01:00,830 So today, we're going to talk about sequencing first. 19 00:01:00,830 --> 00:01:03,350 Conventional-- or Sanger sequencing-- then next-gen 20 00:01:03,350 --> 00:01:07,020 or second-gen sequencing briefly. 21 00:01:07,020 --> 00:01:10,280 And then talk about local alignments. 22 00:01:10,280 --> 00:01:15,540 So background for the sequencing part, the Metzger review 23 00:01:15,540 --> 00:01:17,410 covers everything you'll need. 24 00:01:17,410 --> 00:01:21,180 And for the alignment-- we'll talk about local alignment 25 00:01:21,180 --> 00:01:23,110 today, global alignment on Tuesday-- 26 00:01:23,110 --> 00:01:27,310 then chapters four and five of the text cover it pretty well. 27 00:01:27,310 --> 00:01:28,590 So here's the text. 28 00:01:28,590 --> 00:01:31,090 If you haven't decided whether to get it or not, 29 00:01:31,090 --> 00:01:32,050 I'll have it up here. 30 00:01:32,050 --> 00:01:35,650 You can come flip through it after class. 31 00:01:40,500 --> 00:01:44,450 Sequencing is mostly done at the level of DNA. 32 00:01:44,450 --> 00:01:50,202 Whether the original material was RNA or not, 33 00:01:50,202 --> 00:01:52,410 usually convert to DNA and sequence at the DNA level. 34 00:01:52,410 --> 00:01:57,142 So we'll often think about DNA as sort of a string. 35 00:01:57,142 --> 00:01:59,100 But it's important to remember that it actually 36 00:01:59,100 --> 00:02:04,230 has a three dimensional structure as shown here. 37 00:02:04,230 --> 00:02:08,210 And often, it's helpful to think of it in sort of a two 38 00:02:08,210 --> 00:02:09,710 dimensional representation where you 39 00:02:09,710 --> 00:02:14,170 think about the bases and their hydrogen bonding 40 00:02:14,170 --> 00:02:18,320 and so forth as shown in the middle. 41 00:02:18,320 --> 00:02:20,530 My mouse is not working today for some reason, 42 00:02:20,530 --> 00:02:24,410 but hopefully, we won't need it. 43 00:02:24,410 --> 00:02:29,400 So the chemistry of sequencing is very closely related 44 00:02:29,400 --> 00:02:33,970 to the chemistry of the individual bases. 45 00:02:33,970 --> 00:02:36,900 And there are really three main types 46 00:02:36,900 --> 00:02:41,390 that are going to be relevant here. 47 00:02:41,390 --> 00:02:44,170 Ribonucleotides, deoxyribonucleotides, 48 00:02:44,170 --> 00:02:48,370 then for Sanger sequencing, dideoxyribonucleotides. 49 00:02:48,370 --> 00:02:52,200 So who can tell me which of these structures 50 00:02:52,200 --> 00:02:54,860 corresponds to which of those names? 51 00:02:54,860 --> 00:02:59,160 And also, please let me know your name 52 00:02:59,160 --> 00:03:01,860 and I'll attempt to remember some 53 00:03:01,860 --> 00:03:04,580 of your names toward the end of the semester probably. 54 00:03:04,580 --> 00:03:06,770 So, which are which? 55 00:03:10,470 --> 00:03:11,386 Yes, what's your name? 56 00:03:11,386 --> 00:03:12,814 AUDIENCE: I'm Simona. 57 00:03:12,814 --> 00:03:15,947 So the ribonucleotide is the top right. 58 00:03:15,947 --> 00:03:18,566 The deoxy is the one below it. 59 00:03:18,566 --> 00:03:20,190 And the dideoxy is the one to the left. 60 00:03:20,190 --> 00:03:21,820 PROFESSOR: OK, so that is correct. 61 00:03:21,820 --> 00:03:25,660 So one way to keep these things in mind 62 00:03:25,660 --> 00:03:28,230 is the numbering of the bases. 63 00:03:28,230 --> 00:03:33,950 So the carbons in the ribo sugar are numbered one, 64 00:03:33,950 --> 00:03:37,260 so carbon 1 is the one where the base is attached. 65 00:03:37,260 --> 00:03:42,560 Two is here, which has an OH in RNA and just an H in DNA. 66 00:03:42,560 --> 00:03:45,720 And then three is very important. 67 00:03:45,720 --> 00:03:47,650 Four, and then five. 68 00:03:47,650 --> 00:03:49,700 So five connects to the phosphates, 69 00:03:49,700 --> 00:03:52,941 which then will connect the base to the sugar phosphate 70 00:03:52,941 --> 00:03:53,440 backbone. 71 00:03:53,440 --> 00:03:58,720 And three is where you extend. 72 00:03:58,720 --> 00:04:01,480 That's where you're going to add the next base in a growing 73 00:04:01,480 --> 00:04:01,980 chain. 74 00:04:01,980 --> 00:04:08,810 And so what will happen if you give DNA polymerase a template 75 00:04:08,810 --> 00:04:11,185 and some dideoxy nucleotides? 76 00:04:14,955 --> 00:04:17,329 It won't be able to extend because there's no 3-prime OH. 77 00:04:17,329 --> 00:04:19,030 And all the chemistry requires the OH. 78 00:04:19,030 --> 00:04:22,770 And so that's the basis of classical or Sanger 79 00:04:22,770 --> 00:04:27,770 sequencing, which Fred Sanger got the Nobel 80 00:04:27,770 --> 00:04:31,400 Prize for in the 1980s-- I think it was developed in the '70s-- 81 00:04:31,400 --> 00:04:33,940 and it's really the basis of most 82 00:04:33,940 --> 00:04:36,190 of the sequencing, or pretty much all the DNA 83 00:04:36,190 --> 00:04:38,860 sequencing up until the early 2000s 84 00:04:38,860 --> 00:04:42,450 before some newer technologies came about. 85 00:04:42,450 --> 00:04:45,110 And it takes advantage of this special property 86 00:04:45,110 --> 00:04:51,860 of dideoxy nucleotides that they terminate the growing chain. 87 00:04:51,860 --> 00:04:53,980 So imagine we have a template DNA. 88 00:04:53,980 --> 00:04:56,540 So this is the molecule whose sequence 89 00:04:56,540 --> 00:04:59,290 we want to determine shown there in black. 90 00:04:59,290 --> 00:05:00,750 We then have a primer. 91 00:05:00,750 --> 00:05:03,470 And notice the primer's written in 5-prime to 3-prime 92 00:05:03,470 --> 00:05:04,240 direction. 93 00:05:04,240 --> 00:05:09,360 The ends would be primer sequences 94 00:05:09,360 --> 00:05:11,829 and then primer complimentary sequences in the template. 95 00:05:11,829 --> 00:05:13,870 So you typically will have your template cloned-- 96 00:05:13,870 --> 00:05:15,661 this is in conventional sequencing-- cloned 97 00:05:15,661 --> 00:05:19,470 into some vector like a phage vector 98 00:05:19,470 --> 00:05:22,080 for sequencing so you know the flanking sequences. 99 00:05:22,080 --> 00:05:25,450 And then you do four sequencing reactions 100 00:05:25,450 --> 00:05:28,070 in conventional Sanger sequencing. 101 00:05:28,070 --> 00:05:31,540 And I know some of you have probably had this before. 102 00:05:31,540 --> 00:05:34,620 So let's take the first chemical reaction. 103 00:05:34,620 --> 00:05:38,340 The one here with a DDGTP. 104 00:05:38,340 --> 00:05:41,527 So what would you put in that reaction? 105 00:05:41,527 --> 00:05:43,360 What are all the components of that reaction 106 00:05:43,360 --> 00:05:45,590 if you wanted to do conventional sequencing 107 00:05:45,590 --> 00:05:49,316 on, say, an acrylonitrile? 108 00:05:49,316 --> 00:05:51,760 Anyone? 109 00:05:51,760 --> 00:05:55,090 What do you need and what does it accomplish? 110 00:05:55,090 --> 00:05:56,416 Yeah, what's your name? 111 00:05:56,416 --> 00:05:57,412 AUDIENCE: I'm Tim. 112 00:05:57,412 --> 00:05:57,660 PROFESSOR: Tim? 113 00:05:57,660 --> 00:05:58,306 Oh yeah, I know you, Tim. 114 00:05:58,306 --> 00:05:58,906 OK, go ahead. 115 00:05:58,906 --> 00:06:02,890 AUDIENCE: So you need the four nucleotides-- 116 00:06:02,890 --> 00:06:04,384 the deoxynucleotides. 117 00:06:04,384 --> 00:06:08,368 You will need the dideoxy P nucleotides. 118 00:06:08,368 --> 00:06:11,356 In addition, you need all the other [INAUDIBLE]. 119 00:06:11,356 --> 00:06:12,352 You need polymerase. 120 00:06:12,352 --> 00:06:17,830 Generally, you need a buffer of some sort, [INAUDIBLE], 121 00:06:17,830 --> 00:06:20,654 to [INAUDIBLE]. 122 00:06:20,654 --> 00:06:22,070 PROFESSOR: Yeah, primary template. 123 00:06:22,070 --> 00:06:22,570 Yeah. 124 00:06:22,570 --> 00:06:24,470 Great. 125 00:06:24,470 --> 00:06:24,970 That's good. 126 00:06:24,970 --> 00:06:28,090 It sounds like Tim could actually do this experiment. 127 00:06:33,750 --> 00:06:35,747 And what ratio would you put in? 128 00:06:35,747 --> 00:06:37,330 So you said you're going to put in all 129 00:06:37,330 --> 00:06:40,660 four conventional deoxynucleotides and then one 130 00:06:40,660 --> 00:06:41,770 dideoxynucleotide. 131 00:06:41,770 --> 00:06:46,120 So let's say dideoxy G just for simplicity here. 132 00:06:46,120 --> 00:06:49,740 So in what ratio would you put the dideoxynucleotide 133 00:06:49,740 --> 00:06:51,539 compared to the conventional nucleotides? 134 00:06:51,539 --> 00:06:53,080 AUDIENCE: To lower the concentration. 135 00:06:53,080 --> 00:06:53,788 PROFESSOR: Lower? 136 00:06:53,788 --> 00:06:55,764 Like how much lower? 137 00:06:55,764 --> 00:06:56,930 AUDIENCE: Like, a lot lower. 138 00:06:56,930 --> 00:06:58,595 PROFESSOR: Like maybe 1%? 139 00:06:58,595 --> 00:06:59,220 AUDIENCE: Yeah. 140 00:06:59,220 --> 00:07:00,511 PROFESSOR: Something like that. 141 00:07:00,511 --> 00:07:01,900 You want to put it a lot lower. 142 00:07:01,900 --> 00:07:05,858 And why is that so important? 143 00:07:05,858 --> 00:07:11,325 AUDIENCE: Because you want the thing to be able to progress. 144 00:07:11,325 --> 00:07:15,052 Because you need enough of the ribonucleotide concentration so 145 00:07:15,052 --> 00:07:19,277 that [INAUDIBLE] every [INAUDIBLE] 146 00:07:19,277 --> 00:07:23,982 equivalent or excess and you're going to terminate [INAUDIBLE]. 147 00:07:23,982 --> 00:07:24,690 PROFESSOR: Right. 148 00:07:24,690 --> 00:07:29,440 So if you put equamolar deoxy G and dideoxy G, 149 00:07:29,440 --> 00:07:32,050 then it's going to be a 50% chance of terminating 150 00:07:32,050 --> 00:07:33,810 every time you hit a C in the template. 151 00:07:33,810 --> 00:07:35,480 So you're going to have half as much 152 00:07:35,480 --> 00:07:38,010 of the material at the second G, and a quarter as much as 153 00:07:38,010 --> 00:07:41,120 the third, and you're going to have vanishingly small amounts. 154 00:07:41,120 --> 00:07:42,911 So you're only going to be able to sequence 155 00:07:42,911 --> 00:07:45,790 the first few C's in the template. 156 00:07:45,790 --> 00:07:46,290 Exactly. 157 00:07:46,290 --> 00:07:47,456 So that's a very good point. 158 00:07:50,170 --> 00:07:53,520 So now let's imagine you do these four separate reactions. 159 00:07:53,520 --> 00:07:55,950 You typically would have radiolabeled primer 160 00:07:55,950 --> 00:07:57,520 so you can see your DNA. 161 00:07:57,520 --> 00:08:00,430 And then you would run it on some sort of gel. 162 00:08:00,430 --> 00:08:03,670 This is obviously not a real gel, but an idealized version. 163 00:08:03,670 --> 00:08:10,260 And then in the lane where you put dideoxy G, 164 00:08:10,260 --> 00:08:12,160 you would see the smallest products. 165 00:08:12,160 --> 00:08:14,900 So you read these guys from the bottom up. 166 00:08:14,900 --> 00:08:18,580 And in this lane there is a very small product 167 00:08:18,580 --> 00:08:21,896 that's just one base longer than the primer here. 168 00:08:21,896 --> 00:08:24,520 And that's because there was a C there and it terminated there. 169 00:08:24,520 --> 00:08:27,820 And then the next C appears several bases later. 170 00:08:27,820 --> 00:08:30,010 So you have sort of a gap here. 171 00:08:30,010 --> 00:08:33,330 And so you can see that the first base in the template 172 00:08:33,330 --> 00:08:35,929 would be a complement of T, or C. 173 00:08:35,929 --> 00:08:44,400 And the second base would be, you can see, 174 00:08:44,400 --> 00:08:46,749 the next smallest product in this dideoxy T lane, 175 00:08:46,749 --> 00:08:48,290 therefore it would be A. And you just 176 00:08:48,290 --> 00:08:50,970 sort of snake your way up through the gel 177 00:08:50,970 --> 00:08:52,540 and read out the sequence. 178 00:08:52,540 --> 00:08:54,760 And this works well. 179 00:08:54,760 --> 00:08:58,200 So what does it actually look like in practice? 180 00:08:58,200 --> 00:09:00,440 Here are some actual sequencing gels. 181 00:09:00,440 --> 00:09:03,390 So you run four lanes. 182 00:09:03,390 --> 00:09:10,540 And on big polyacrylamide gels like this. 183 00:09:10,540 --> 00:09:12,237 Torbin, you ever run one of these? 184 00:09:12,237 --> 00:09:12,820 AUDIENCE: Yes. 185 00:09:12,820 --> 00:09:14,080 PROFESSOR: Yes? 186 00:09:14,080 --> 00:09:17,140 They're a big pain to cast. 187 00:09:17,140 --> 00:09:19,090 Run for several hours, I think. 188 00:09:19,090 --> 00:09:22,360 And you get these banding patterns. 189 00:09:22,360 --> 00:09:27,080 And what limits the sequence read length? 190 00:09:27,080 --> 00:09:31,310 So we normally call the sequence generated 191 00:09:31,310 --> 00:09:34,560 from one run of a sequencer as a read. 192 00:09:34,560 --> 00:09:38,640 So that one attempt to sequence the template is called a read. 193 00:09:38,640 --> 00:09:40,530 And you can see it's relatively easy 194 00:09:40,530 --> 00:09:42,710 to read the sequence toward the bottom, 195 00:09:42,710 --> 00:09:46,060 and then it gets harder as you go up. 196 00:09:46,060 --> 00:09:49,510 And so that's really what fundamentally limits the read 197 00:09:49,510 --> 00:09:53,170 length, is that the bands get closer and closer together. 198 00:09:53,170 --> 00:09:58,210 So they'll run inversely proportional to size 199 00:09:58,210 --> 00:10:00,360 with the small ones running faster. 200 00:10:00,360 --> 00:10:05,170 But then the difference between a 20 base product and a 21 201 00:10:05,170 --> 00:10:06,050 might be significant. 202 00:10:06,050 --> 00:10:09,670 But the difference between a 500 base product and a 501 base 203 00:10:09,670 --> 00:10:11,450 product is going to be very small. 204 00:10:11,450 --> 00:10:16,430 And so you basically can't order the lanes anymore. 205 00:10:16,430 --> 00:10:19,330 And therefore, that's sort of what fundamentally limits it. 206 00:10:19,330 --> 00:10:20,780 All right. 207 00:10:20,780 --> 00:10:24,410 So here we had to run four lanes of a gel. 208 00:10:24,410 --> 00:10:26,190 Can anyone think of a more efficient way 209 00:10:26,190 --> 00:10:29,107 of doing Sanger sequencing? 210 00:10:29,107 --> 00:10:30,690 Is there any way to do it in one lane? 211 00:10:34,468 --> 00:10:35,426 Yeah, what's your name? 212 00:10:35,426 --> 00:10:36,338 AUDIENCE: Adrian. 213 00:10:36,338 --> 00:10:39,812 You can use four different types of the entities. 214 00:10:39,812 --> 00:10:41,590 Maybe like four different colors. 215 00:10:41,590 --> 00:10:42,310 AUDIENCE: Four different colors. 216 00:10:42,310 --> 00:10:44,670 OK, so instead of using radio labeling on the primary, 217 00:10:44,670 --> 00:10:50,400 you use fluorophore on your dideoxy entities, for example. 218 00:10:50,400 --> 00:10:53,980 And then you can run them. 219 00:10:53,980 --> 00:10:57,170 Depending where that strand terminated, 220 00:10:57,170 --> 00:10:58,680 it'll be a different color. 221 00:10:58,680 --> 00:11:00,221 And you can run them all in one lane. 222 00:11:00,221 --> 00:11:02,150 OK, so that looks like that. 223 00:11:02,150 --> 00:11:04,990 And so this was an important development 224 00:11:04,990 --> 00:11:08,610 called terminator sequencing in the '90s. 225 00:11:08,610 --> 00:11:14,940 That was the basis of the ABI 3700 machine, which was really 226 00:11:14,940 --> 00:11:18,200 the workhorse of genome sequencing 227 00:11:18,200 --> 00:11:19,910 in the late '90s and early 2000s. 228 00:11:19,910 --> 00:11:22,430 Really what enabled the human genome to be sequenced. 229 00:11:22,430 --> 00:11:25,960 And so one of the other innovations in this technology 230 00:11:25,960 --> 00:11:31,034 was that instead of having a big gel, they shrunk the gel. 231 00:11:31,034 --> 00:11:32,950 And then they just had a reader at the bottom. 232 00:11:32,950 --> 00:11:38,475 So the gel was shrunk to as thin as these little capillaries. 233 00:11:38,475 --> 00:11:40,100 I don't know if you can see these guys. 234 00:11:40,100 --> 00:11:43,500 But basically it's like a little thread here. 235 00:11:43,500 --> 00:11:46,040 And so each one of these is effectively-- oops! 236 00:11:46,040 --> 00:11:47,320 Oh no. 237 00:11:47,320 --> 00:11:50,140 No worries, this is not valuable. 238 00:11:50,140 --> 00:11:54,450 Ancient technology that I got for free from somebody. 239 00:11:54,450 --> 00:12:00,702 So the DNA would be loaded at the top. 240 00:12:00,702 --> 00:12:02,160 There would be a little gel in each 241 00:12:02,160 --> 00:12:04,520 of these-- it's called capillary sequencing. 242 00:12:04,520 --> 00:12:06,060 And then it would run out the bottom 243 00:12:06,060 --> 00:12:07,570 and there would be a detector which 244 00:12:07,570 --> 00:12:09,240 would detect the four different flours 245 00:12:09,240 --> 00:12:10,740 and read out the sequence. 246 00:12:10,740 --> 00:12:17,010 So this basically condensed the volume needed for sequencing. 247 00:12:19,840 --> 00:12:23,630 Any questions about conventional sequencing? 248 00:12:23,630 --> 00:12:24,130 Yes? 249 00:12:24,130 --> 00:12:26,378 AUDIENCE: Where are the [INAUDIBLE] 250 00:12:26,378 --> 00:12:28,738 where you'd put the fluorescent flags? 251 00:12:28,738 --> 00:12:31,294 Like the topic from the [INAUDIBLE]? 252 00:12:31,294 --> 00:12:32,960 PROFESSOR: Yeah, that's a good question. 253 00:12:32,960 --> 00:12:35,390 I don't actually remember. 254 00:12:35,390 --> 00:12:39,480 I think there are different options available. 255 00:12:39,480 --> 00:12:41,460 And sometimes with some of these reactions, 256 00:12:41,460 --> 00:12:45,940 you need to use modified polymerases that 257 00:12:45,940 --> 00:12:48,250 can tolerate these modified nucleotides. 258 00:12:48,250 --> 00:12:49,923 Yeah, so I don't remember that. 259 00:12:49,923 --> 00:12:51,120 It's a good question. 260 00:12:51,120 --> 00:12:52,690 I can look that up. 261 00:12:52,690 --> 00:12:56,650 So how long can a conventional sequencer go? 262 00:12:56,650 --> 00:12:57,920 What's the read length? 263 00:12:57,920 --> 00:13:01,030 Anyone know? 264 00:13:01,030 --> 00:13:04,060 It's about, say, 600 or so. 265 00:13:04,060 --> 00:13:07,220 And so that's reasonably long. 266 00:13:07,220 --> 00:13:11,360 How long is a typical mammalian mRNA? 267 00:13:11,360 --> 00:13:12,730 Maybe two, three kb? 268 00:13:12,730 --> 00:13:16,090 So you have in a typical exon, maybe 150 bases or so. 269 00:13:16,090 --> 00:13:17,920 So you have a chunk. 270 00:13:17,920 --> 00:13:19,680 You don't generally get full length cDNA. 271 00:13:19,680 --> 00:13:22,540 But you get a chunk of a cDNA that's say, 272 00:13:22,540 --> 00:13:25,200 three, four exons in length. 273 00:13:25,200 --> 00:13:28,340 And that is actually generally sufficient 274 00:13:28,340 --> 00:13:32,660 to uniquely identify the gene locus that that read came from. 275 00:13:32,660 --> 00:13:35,130 And so that was the basis of EST sequencing-- 276 00:13:35,130 --> 00:13:38,060 so-called Expressed Sequence Tag sequencing. 277 00:13:38,060 --> 00:13:43,510 And millions of these 600 base chunks of cDNA were generated 278 00:13:43,510 --> 00:13:48,891 and they have been quite useful over the years. 279 00:13:48,891 --> 00:13:49,390 All right. 280 00:13:49,390 --> 00:13:51,210 So what is next-gen sequencing? 281 00:13:51,210 --> 00:13:58,980 So in next-gen sequencing, you only read one base at a time. 282 00:13:58,980 --> 00:14:04,370 So it's often a little bit slower. 283 00:14:04,370 --> 00:14:06,260 But it's really massively parallel. 284 00:14:06,260 --> 00:14:08,430 And that's the big advantage. 285 00:14:08,430 --> 00:14:12,000 And it's orders of magnitude cheaper 286 00:14:12,000 --> 00:14:14,139 per base than conventional sequencing. 287 00:14:14,139 --> 00:14:15,555 Like when it first came out it, it 288 00:14:15,555 --> 00:14:17,360 was maybe two orders of magnitude cheaper. 289 00:14:17,360 --> 00:14:21,480 And now it's probably another four orders of magnitude. 290 00:14:21,480 --> 00:14:24,910 So it really blows away conventional sequencing 291 00:14:24,910 --> 00:14:30,600 if the output that you care about 292 00:14:30,600 --> 00:14:34,960 is mostly proportional to number of bases sequence. 293 00:14:34,960 --> 00:14:37,910 If the output is proportional to the quality of the assembly 294 00:14:37,910 --> 00:14:39,740 or something, then there are applications 295 00:14:39,740 --> 00:14:42,430 where conventional sequencing still 296 00:14:42,430 --> 00:14:45,547 is very useful Because the next-gen sequencing tends 297 00:14:45,547 --> 00:14:46,130 to be shorter. 298 00:14:46,130 --> 00:14:51,870 But in terms of just volume, it generates much, much more bases 299 00:14:51,870 --> 00:14:53,390 in one reaction. 300 00:14:53,390 --> 00:14:58,700 And so the basic ideas are that you have your template DNA 301 00:14:58,700 --> 00:14:59,670 molecules. 302 00:14:59,670 --> 00:15:04,960 Now typically, tens of thousands for technologies like PacBio 303 00:15:04,960 --> 00:15:10,560 or hundreds of millions for technologies like Illumina that 304 00:15:10,560 --> 00:15:14,890 are immobilized on some sort of surface-- 305 00:15:14,890 --> 00:15:17,960 typically a flow cell-- and there 306 00:15:17,960 --> 00:15:19,690 are either single molecule methods 307 00:15:19,690 --> 00:15:21,731 where you have a single molecule of your template 308 00:15:21,731 --> 00:15:24,950 or there are methods that locally amplify your template 309 00:15:24,950 --> 00:15:28,290 and produce, say, hundreds of identical copies 310 00:15:28,290 --> 00:15:29,440 in little clusters. 311 00:15:29,440 --> 00:15:32,400 And then you use modified nucleotides, 312 00:15:32,400 --> 00:15:34,440 often with fluorophores attached, 313 00:15:34,440 --> 00:15:41,990 to interrogate the next base at each of your template molecules 314 00:15:41,990 --> 00:15:44,800 for hundreds and hundreds of millions of them. 315 00:15:44,800 --> 00:15:46,800 And so there are several different technologies. 316 00:15:46,800 --> 00:15:48,133 We won't talk about all of them. 317 00:15:48,133 --> 00:15:49,640 We'll just talk about two or three 318 00:15:49,640 --> 00:15:52,430 that are interesting and widely used. 319 00:15:52,430 --> 00:15:55,170 And they differ depending on the DNA template, what 320 00:15:55,170 --> 00:15:57,100 types of modified nucleotides are used, 321 00:15:57,100 --> 00:16:00,210 and to some extent, in the imaging and the image analysis, 322 00:16:00,210 --> 00:16:03,460 which differs for single molecule methods, for example, 323 00:16:03,460 --> 00:16:08,770 compared to the ones that sequence a cluster. 324 00:16:08,770 --> 00:16:12,150 So there's a table in the Metzger review. 325 00:16:12,150 --> 00:16:15,970 And so I've just told you that next-gen sequencing 326 00:16:15,970 --> 00:16:16,610 is so cheap. 327 00:16:16,610 --> 00:16:18,980 But then you see how much these machines cost 328 00:16:18,980 --> 00:16:21,790 and you could buy lots of other interesting things 329 00:16:21,790 --> 00:16:24,670 with that kind of money. 330 00:16:24,670 --> 00:16:27,330 And I also want to emphasize that that's not even 331 00:16:27,330 --> 00:16:28,060 the full cost. 332 00:16:28,060 --> 00:16:33,100 So if you were to buy an Illumina GA2-- this would 333 00:16:33,100 --> 00:16:35,010 be like a couple years ago when the GA2 was 334 00:16:35,010 --> 00:16:38,580 the state of the art-- for half a million dollars, 335 00:16:38,580 --> 00:16:41,070 the reagents to run that thing, if you're 336 00:16:41,070 --> 00:16:43,420 going to run it continuously throughout the year, 337 00:16:43,420 --> 00:16:47,120 the reagents to run it would be over a million. 338 00:16:47,120 --> 00:16:50,470 So this actually underestimates the cost. 339 00:16:50,470 --> 00:16:54,170 However, the cost per base is super, super low. 340 00:16:54,170 --> 00:16:58,830 Because they generate so much data at once. 341 00:16:58,830 --> 00:17:01,410 All right, So we'll talk about a couple of these. 342 00:17:01,410 --> 00:17:04,010 The first next-gen sequencing technology 343 00:17:04,010 --> 00:17:08,050 to be published and still used today 344 00:17:08,050 --> 00:17:13,869 was from 454-- now Roche-- and it 345 00:17:13,869 --> 00:17:16,979 was based on what's called emulsion PCR. 346 00:17:16,979 --> 00:17:19,020 So they have these little beads, the little beads 347 00:17:19,020 --> 00:17:23,589 have adapter DNA molecules covalently attached. 348 00:17:23,589 --> 00:17:29,020 You incubate the beads with DNA, and you actually 349 00:17:29,020 --> 00:17:29,760 make an emulsion. 350 00:17:29,760 --> 00:17:31,740 So it's an oil water emulsion. 351 00:17:31,740 --> 00:17:34,780 So each bead, which is hydrophilic, 352 00:17:34,780 --> 00:17:38,040 is in the little bubble of water inside oil. 353 00:17:38,040 --> 00:17:41,030 And the reason for that is so that you do it 354 00:17:41,030 --> 00:17:42,520 at a template concentration that's 355 00:17:42,520 --> 00:17:44,950 low enough that only a single molecule of template 356 00:17:44,950 --> 00:17:46,630 is associated with each bead. 357 00:17:46,630 --> 00:17:49,570 So the oil then provides a barrier 358 00:17:49,570 --> 00:17:53,701 so that the DNA can't get transferred from one bead 359 00:17:53,701 --> 00:17:54,200 to another. 360 00:17:54,200 --> 00:17:57,010 So each bead will have a unique template molecule. 361 00:17:57,010 --> 00:18:00,300 You do sort of a local PCR-like reaction 362 00:18:00,300 --> 00:18:05,110 to amplify that DNA molecule on the bead, 363 00:18:05,110 --> 00:18:10,732 and then you do sequencing one base at a time using 364 00:18:10,732 --> 00:18:12,190 a luciferase based method that I'll 365 00:18:12,190 --> 00:18:14,400 show you on the next slide. 366 00:18:14,400 --> 00:18:18,410 So Illumina technology differs in that instead of an emulsion, 367 00:18:18,410 --> 00:18:20,730 you're doing it on the surface of a flow cell. 368 00:18:20,730 --> 00:18:24,450 Again, you start with a single molecule of template. 369 00:18:24,450 --> 00:18:26,000 Your flow cell has these two types 370 00:18:26,000 --> 00:18:28,610 of adapters covalently attached. 371 00:18:28,610 --> 00:18:33,080 The template anneals to one of these adapters. 372 00:18:33,080 --> 00:18:38,180 You extend the adapter molecule with dNTPs and polymerase. 373 00:18:38,180 --> 00:18:42,620 Now you have the complement of your template, your denature. 374 00:18:42,620 --> 00:18:44,340 Now you have the inverse complement 375 00:18:44,340 --> 00:18:46,580 of your template molecule covalently 376 00:18:46,580 --> 00:18:48,054 attached to the cell surface. 377 00:18:48,054 --> 00:18:50,220 And then at the other end there's the other adapter. 378 00:18:50,220 --> 00:18:51,636 And so what you could do is what's 379 00:18:51,636 --> 00:18:56,140 called bridge amplification where that now complement 380 00:18:56,140 --> 00:19:00,060 of the template molecule will bridge over 381 00:19:00,060 --> 00:19:02,300 hybridized to the other adapter, and then you 382 00:19:02,300 --> 00:19:03,870 can extend that adapter. 383 00:19:03,870 --> 00:19:06,336 And now you've regenerated your original template. 384 00:19:06,336 --> 00:19:08,210 And so now you have the complementary strand, 385 00:19:08,210 --> 00:19:10,280 and the original strand, your denature. 386 00:19:10,280 --> 00:19:14,140 And then each of those molecules can undergo subsequent rounds 387 00:19:14,140 --> 00:19:18,390 of bridge amplification to make clusters of typically 388 00:19:18,390 --> 00:19:22,197 several hundred thousand molecules. 389 00:19:22,197 --> 00:19:22,780 Is that clear? 390 00:19:22,780 --> 00:19:23,280 Question. 391 00:19:23,280 --> 00:19:24,486 Yeah, what's your name? 392 00:19:24,486 --> 00:19:26,458 AUDIENCE: Stephanie. 393 00:19:26,458 --> 00:19:30,737 How do they get the adapters onto the template molecules? 394 00:19:30,737 --> 00:19:32,320 PROFESSOR: How do you get the adapters 395 00:19:32,320 --> 00:19:34,520 onto the template molecules? 396 00:19:34,520 --> 00:19:38,280 So that's typically by DNA ligation. 397 00:19:38,280 --> 00:19:43,130 So we may cover that in later steps. 398 00:19:43,130 --> 00:19:43,856 It depends. 399 00:19:43,856 --> 00:19:45,230 There's a few different protocol. 400 00:19:45,230 --> 00:19:47,480 So for example, if you're sequencing microRNAs, 401 00:19:47,480 --> 00:19:50,370 you typically would isolate the small RNAs 402 00:19:50,370 --> 00:19:52,850 and use RNA litigation to get the adapters on. 403 00:19:52,850 --> 00:19:55,770 And then you would do an RT step to get DNA. 404 00:19:55,770 --> 00:19:59,400 With most other applications like RNA-seq or genome 405 00:19:59,400 --> 00:20:02,950 sequencing-- so with RNA-seq, you're starting from mRNA, 406 00:20:02,950 --> 00:20:06,850 you typically will isolate total RNA, do poly(A) selection, 407 00:20:06,850 --> 00:20:08,740 you fragment your RNA to reduce the effects 408 00:20:08,740 --> 00:20:11,450 of secondary structure, you random prime with, 409 00:20:11,450 --> 00:20:13,690 like, random hexamers RT enzyme. 410 00:20:13,690 --> 00:20:18,550 So that'll make little bits of cDNA 200 bases long. 411 00:20:18,550 --> 00:20:20,200 You use second strand synthesis. 412 00:20:20,200 --> 00:20:23,730 Now you have double stranded cDNA fragments. 413 00:20:23,730 --> 00:20:28,310 And then you do, like, blunt end ligation to add the adapters. 414 00:20:28,310 --> 00:20:31,086 And then you denature so you have single strand. 415 00:20:31,086 --> 00:20:32,946 AUDIENCE: I guess my question is how do you 416 00:20:32,946 --> 00:20:35,550 make sure that the two ends sandwiching the DNA 417 00:20:35,550 --> 00:20:38,540 are different as opposed to-- 418 00:20:38,540 --> 00:20:40,570 PROFESSOR: That the two ends are different. 419 00:20:40,570 --> 00:20:43,670 Yeah, that's a good question. 420 00:20:47,730 --> 00:20:51,260 I'll post some stuff about-- It's a good question. 421 00:20:51,260 --> 00:20:53,210 I don't want to sweep it under the rug. 422 00:20:53,210 --> 00:20:55,600 But I kind of want to move on. 423 00:20:55,600 --> 00:20:59,500 And I'll post a little bit about that. 424 00:20:59,500 --> 00:21:03,880 All right so we did 454 Illumina. 425 00:21:03,880 --> 00:21:06,050 Helicos is sort of like Illumina sequencing 426 00:21:06,050 --> 00:21:08,360 except single molecule. 427 00:21:08,360 --> 00:21:11,670 So you have your template covalently 428 00:21:11,670 --> 00:21:13,560 attached to your substrate. 429 00:21:13,560 --> 00:21:18,590 You just anneal primer and just start sequencing it 430 00:21:18,590 --> 00:21:24,680 And there's major pros and cons of single molecule sequencing, 431 00:21:24,680 --> 00:21:26,080 which we can talk about. 432 00:21:26,080 --> 00:21:29,770 And then the PacBio technology is fundamentally 433 00:21:29,770 --> 00:21:33,980 different in that the template is not actually 434 00:21:33,980 --> 00:21:35,890 covalently attached to the surface. 435 00:21:35,890 --> 00:21:38,960 The DNA polymerase is covalently attached to the surface 436 00:21:38,960 --> 00:21:43,260 and the template is sort of threaded into the polymerase. 437 00:21:43,260 --> 00:21:46,350 And this is a phage polymerase that's highly processive 438 00:21:46,350 --> 00:21:47,940 and strand displacing. 439 00:21:47,940 --> 00:21:50,640 And the template is often a circular molecule. 440 00:21:50,640 --> 00:21:54,060 And so you can actually read around 441 00:21:54,060 --> 00:21:56,480 the template multiple times, which turns out 442 00:21:56,480 --> 00:21:59,795 to be really useful in PacBio because the error rate is 443 00:21:59,795 --> 00:22:01,850 quite high for the sequencing. 444 00:22:01,850 --> 00:22:05,530 So in the top, in the 454, you're 445 00:22:05,530 --> 00:22:08,040 measuring luciferase activity-- light. 446 00:22:08,040 --> 00:22:09,924 In Illumina, you're measuring fluorescence. 447 00:22:09,924 --> 00:22:11,590 Four different fluorescent tags, sort of 448 00:22:11,590 --> 00:22:15,030 like the four different tags we saw in Sanger sequencing. 449 00:22:15,030 --> 00:22:18,520 Helicose, it's single tag one base at a time. 450 00:22:18,520 --> 00:22:21,350 And in PacBio, you actually have a fluorescently 451 00:22:21,350 --> 00:22:26,620 labeled dNTP that has the label on-- it's 452 00:22:26,620 --> 00:22:28,520 actually hexaphosphate-- it's got 453 00:22:28,520 --> 00:22:31,020 the label on the sixth phosphate. 454 00:22:31,020 --> 00:22:33,530 So the dNTP is labeled. 455 00:22:33,530 --> 00:22:36,560 It enters the active site of the DNA polymerase. 456 00:22:36,560 --> 00:22:38,530 And the residence time is much longer 457 00:22:38,530 --> 00:22:41,600 if the base is actually going to get incorporated 458 00:22:41,600 --> 00:22:42,900 into that growing chain. 459 00:22:42,900 --> 00:22:46,960 And so you measure how much time you have a fluorescent signal. 460 00:22:46,960 --> 00:22:48,830 And if it's long, that means that that base 461 00:22:48,830 --> 00:22:51,360 must have incorporated into the DNA. 462 00:22:51,360 --> 00:22:56,450 But then, the extension reaction itself 463 00:22:56,450 --> 00:22:58,390 will cleave off the last five phosphates 464 00:22:58,390 --> 00:23:00,640 and the fluorophore tag. 465 00:23:00,640 --> 00:23:03,960 And so you'll regenerate native DNA. 466 00:23:03,960 --> 00:23:05,390 So that's another difference. 467 00:23:05,390 --> 00:23:07,315 Whereas in Illumina sequencing, as we'll see, 468 00:23:07,315 --> 00:23:09,190 there's this reversible terminator chemistry. 469 00:23:09,190 --> 00:23:12,330 So the DNA is not native that you're synthesizing. 470 00:23:14,990 --> 00:23:17,800 So this is just a little bit more on 454. 471 00:23:17,800 --> 00:23:19,300 Just some pretty pictures. 472 00:23:19,300 --> 00:23:22,030 I think I described that before. 473 00:23:22,030 --> 00:23:30,080 The key chemistry here is that you add one dNTP at a time. 474 00:23:30,080 --> 00:23:33,060 So only a subset of the wells-- perhaps a quarter of them-- 475 00:23:33,060 --> 00:23:36,740 that have that next base, the complementary base 476 00:23:36,740 --> 00:23:40,020 free-- as the next one after the primer-- 477 00:23:40,020 --> 00:23:42,710 will undergo synthesis. 478 00:23:42,710 --> 00:23:45,920 And when they undergo synthesis, you release pyrophosphate. 479 00:23:45,920 --> 00:23:48,580 And they have these enzymes attached 480 00:23:48,580 --> 00:23:51,410 to these little micro beads-- the orange beads-- sulfurylase 481 00:23:51,410 --> 00:23:54,680 and luciferase, that use pyrophosphate to basically 482 00:23:54,680 --> 00:23:55,650 generate light. 483 00:23:55,650 --> 00:23:58,760 And so then you have one of these beads in each well. 484 00:23:58,760 --> 00:24:04,600 You look at which wells lit up when we added dCTP. 485 00:24:04,600 --> 00:24:08,030 And they must have had G as the next base and so forth. 486 00:24:08,030 --> 00:24:10,019 And there's no termination here. 487 00:24:10,019 --> 00:24:11,810 The only termination is because you're only 488 00:24:11,810 --> 00:24:13,096 adding one base at a time. 489 00:24:13,096 --> 00:24:14,970 So if you have a single gene in the template, 490 00:24:14,970 --> 00:24:15,803 you'll add one base. 491 00:24:15,803 --> 00:24:19,170 But if you have two Gs in the template, you'll add two Cs. 492 00:24:19,170 --> 00:24:21,350 And in principle, you'll get twice as much light. 493 00:24:21,350 --> 00:24:24,500 But then you have to sort of do some analysis after the fact 494 00:24:24,500 --> 00:24:26,260 to say, OK how much light do we have? 495 00:24:26,260 --> 00:24:28,322 And was that one G, two G, and so forth. 496 00:24:28,322 --> 00:24:29,780 And the amount of light is supposed 497 00:24:29,780 --> 00:24:33,440 to be linear up to about five or six Gs. 498 00:24:33,440 --> 00:24:38,230 But that's still a more error-prone step. 499 00:24:38,230 --> 00:24:41,420 And the most common type of error in 454 500 00:24:41,420 --> 00:24:45,700 is actually insertions and deletions. 501 00:24:45,700 --> 00:24:48,970 Whereas in Illumina sequencing, it's substitutions. 502 00:24:48,970 --> 00:24:51,280 David actually encouraged me to talk more 503 00:24:51,280 --> 00:24:55,380 about sequencing errors and quality scores. 504 00:24:55,380 --> 00:24:57,510 And I need to do a little bit more background. 505 00:24:57,510 --> 00:25:03,580 But I may add that a little bit later in the semester. 506 00:25:03,580 --> 00:25:05,810 OK, so in Illumina sequencing, you 507 00:25:05,810 --> 00:25:09,180 add all four dNTPs at the same time. 508 00:25:09,180 --> 00:25:10,920 But they're non-native. 509 00:25:10,920 --> 00:25:13,540 They have two major modifications. 510 00:25:13,540 --> 00:25:16,490 So one is that they're three prime blocked. 511 00:25:16,490 --> 00:25:18,525 That means that the OH is not free, 512 00:25:18,525 --> 00:25:20,400 I'll show the chemical structure in a moment. 513 00:25:20,400 --> 00:25:22,025 So you can't extend more than one base. 514 00:25:22,025 --> 00:25:24,710 You incorporate that one base, and the polymerase 515 00:25:24,710 --> 00:25:25,880 can't do anything more. 516 00:25:25,880 --> 00:25:29,670 And they're also tagged with four different fluors. 517 00:25:29,670 --> 00:25:31,910 So you add all four dNTPs at once. 518 00:25:31,910 --> 00:25:34,590 You let the polymerase incorporate them. 519 00:25:34,590 --> 00:25:37,690 And then you image the whole flow cell 520 00:25:37,690 --> 00:25:39,150 using two lasers and two filters. 521 00:25:39,150 --> 00:25:41,330 So basically, to image the four fluors. 522 00:25:41,330 --> 00:25:44,220 So you have to sort of take four different pictures 523 00:25:44,220 --> 00:25:47,200 of each portion of the flow cell and then the camera moves 524 00:25:47,200 --> 00:25:49,360 and you scan the whole cell. 525 00:25:49,360 --> 00:25:55,750 And so then, those clusters that incorporated a C, let's say, 526 00:25:55,750 --> 00:25:58,710 they will show up in the green channel as spots. 527 00:25:58,710 --> 00:26:00,460 And those incorporated in A, and so forth. 528 00:26:00,460 --> 00:26:02,660 So you basically have these clusters, each of them 529 00:26:02,660 --> 00:26:07,340 represents a distinct template, and you 530 00:26:07,340 --> 00:26:08,630 read one base at a time. 531 00:26:08,630 --> 00:26:11,080 So, first you read the first base after the primer. 532 00:26:11,080 --> 00:26:13,990 So it's sequencing downwards into the template. 533 00:26:13,990 --> 00:26:15,599 And you read the first base so you 534 00:26:15,599 --> 00:26:17,640 know what the first base of all your clusters is. 535 00:26:17,640 --> 00:26:20,720 And then you reverse the termination. 536 00:26:20,720 --> 00:26:22,870 You cleave off that chemical group 537 00:26:22,870 --> 00:26:26,150 that was blocking the 3-prime OH so now it can extend again. 538 00:26:26,150 --> 00:26:28,240 And then you add the four dNTPs again, 539 00:26:28,240 --> 00:26:31,339 do another round of extension, and then image again, 540 00:26:31,339 --> 00:26:31,880 and so forth. 541 00:26:31,880 --> 00:26:33,171 And so it takes a little while. 542 00:26:33,171 --> 00:26:35,050 Each round of imaging takes about an hour. 543 00:26:35,050 --> 00:26:39,560 So if you want to do 100 base single and Illumina sequencing, 544 00:26:39,560 --> 00:26:43,110 it'll be running on the machine for about four days or so. 545 00:26:43,110 --> 00:26:45,240 Plus the time you have to build the clusters, which 546 00:26:45,240 --> 00:26:47,650 might be several hours on the day before. 547 00:26:50,330 --> 00:26:51,240 So what is this? 548 00:26:51,240 --> 00:26:55,570 So actually the whole idea of blocking termination-- 549 00:26:55,570 --> 00:26:59,300 basically Sanger's idea-- is carried over here 550 00:26:59,300 --> 00:27:01,140 in Illumina sequencing with a little twist. 551 00:27:01,140 --> 00:27:03,760 And that's that you can reverse the termination. 552 00:27:03,760 --> 00:27:05,830 So if you look down here at the bottom, 553 00:27:05,830 --> 00:27:09,030 these are two different 3-prime terminators. 554 00:27:09,030 --> 00:27:11,330 Remember your base counting. 555 00:27:11,330 --> 00:27:14,130 Base one, two, three. 556 00:27:14,130 --> 00:27:16,930 So this was the 3-prime OH, now it's 557 00:27:16,930 --> 00:27:19,740 got this methyl [INAUDIBLE], or whatever that is. 558 00:27:19,740 --> 00:27:22,770 I'm not much of a chemist, so you can look that one up. 559 00:27:22,770 --> 00:27:24,400 And then here's another version. 560 00:27:24,400 --> 00:27:25,816 And this is sort of chemistry that 561 00:27:25,816 --> 00:27:27,670 can cleave this off when you're done. 562 00:27:27,670 --> 00:27:31,060 And then this whole thing here, hanging off the base, 563 00:27:31,060 --> 00:27:34,356 is the fluor. 564 00:27:34,356 --> 00:27:36,460 And you cleave that off as well. 565 00:27:36,460 --> 00:27:39,235 So you add this big complicated thing, you image it, 566 00:27:39,235 --> 00:27:40,610 and then you cleave off the fluor 567 00:27:40,610 --> 00:27:43,330 and cleave off the 3-prime block. 568 00:27:45,776 --> 00:27:47,400 These are some actual sequencing images 569 00:27:47,400 --> 00:27:49,560 you would image in the four channels. 570 00:27:49,560 --> 00:27:51,240 They're actually black and white. 571 00:27:51,240 --> 00:27:53,080 These are pseudocode. 572 00:27:53,080 --> 00:27:56,330 And then you can merge those and you can see then 573 00:27:56,330 --> 00:27:58,560 all the clusters on the flow cell. 574 00:27:58,560 --> 00:28:02,070 So this is from a GA2 with the recommended cluster 575 00:28:02,070 --> 00:28:04,960 density back in the day, like a few years ago. 576 00:28:04,960 --> 00:28:06,877 And nowadays, the image now since the software 577 00:28:06,877 --> 00:28:08,709 has gotten a lot better, so you can actually 578 00:28:08,709 --> 00:28:10,840 load the clusters more densely and therefore 579 00:28:10,840 --> 00:28:14,787 get more sequence out of the same area. 580 00:28:14,787 --> 00:28:16,370 But imagine just millions and millions 581 00:28:16,370 --> 00:28:18,820 of these little clusters like this. 582 00:28:18,820 --> 00:28:21,570 Notice the clusters are not all the same size. 583 00:28:24,390 --> 00:28:25,890 Basically, you're doing PCR in situ, 584 00:28:25,890 --> 00:28:30,970 and so some molecules are easier to amplify by PCR than others. 585 00:28:30,970 --> 00:28:33,605 And that probably accounts for these variations in size. 586 00:28:37,430 --> 00:28:39,800 So what is the current throughput? 587 00:28:39,800 --> 00:28:43,860 These data are accurate as of about, maybe, last year. 588 00:28:43,860 --> 00:28:47,540 So the HiSeq 2000 instrument is the most high performance, 589 00:28:47,540 --> 00:28:49,010 widely used instrument. 590 00:28:49,010 --> 00:28:51,840 Now there's a 2500, but I think it's roughly similar. 591 00:28:51,840 --> 00:28:52,820 You have one flow cell. 592 00:28:52,820 --> 00:28:55,220 So a flow cell looks sort of like a glass slide, 593 00:28:55,220 --> 00:28:57,770 except that it has these tunnels carved 594 00:28:57,770 --> 00:29:01,420 in it like eight little tubes inside the glass slide. 595 00:29:01,420 --> 00:29:05,600 And on the surfaces of those tubes 596 00:29:05,600 --> 00:29:10,544 is where the adapters are covalently attached. 597 00:29:10,544 --> 00:29:11,960 And so you have eight lanes and so 598 00:29:11,960 --> 00:29:15,120 you can sequence eight different things in those eight lanes. 599 00:29:15,120 --> 00:29:20,300 You could do yeast genome in one and fly RNA-seq 600 00:29:20,300 --> 00:29:23,850 in another, and so forth. 601 00:29:23,850 --> 00:29:27,850 And these days, a single lane will produce something 602 00:29:27,850 --> 00:29:30,440 like 200 million reads. 603 00:29:30,440 --> 00:29:34,250 And this is typically routine to get 200 million reads 604 00:29:34,250 --> 00:29:34,790 from a lane. 605 00:29:34,790 --> 00:29:36,770 Sometimes you can get more. 606 00:29:36,770 --> 00:29:38,200 You can do up to 100 bases. 607 00:29:38,200 --> 00:29:40,420 You can do 150 these days on a MiSeq, 608 00:29:40,420 --> 00:29:41,780 which is a miniature version. 609 00:29:41,780 --> 00:29:46,000 You can do maybe 300 or more. 610 00:29:46,000 --> 00:29:48,290 And so that's a whole lot of sequence. 611 00:29:48,290 --> 00:29:53,510 So that's 160 billion bases of sequence from a single lane. 612 00:29:53,510 --> 00:29:56,890 And that will cost you-- that single lane-- maybe $2,000 613 00:29:56,890 --> 00:29:59,870 to $3,000, depending where you're doing it. 614 00:29:59,870 --> 00:30:03,000 And the cost doesn't include the capital cost, 615 00:30:03,000 --> 00:30:05,920 that's just the reagent cost for running that. 616 00:30:05,920 --> 00:30:11,240 So 160 billion-- the human genome is 3 billion, 617 00:30:11,240 --> 00:30:14,850 so you've now sequenced the human genome over many times 618 00:30:14,850 --> 00:30:16,440 there. 619 00:30:16,440 --> 00:30:17,520 You can do more. 620 00:30:17,520 --> 00:30:19,070 So you can do paired-end sequencing, 621 00:30:19,070 --> 00:30:21,380 where you sequence both ends of your template. 622 00:30:21,380 --> 00:30:24,224 And that'll basically double the amount of sequence you get. 623 00:30:24,224 --> 00:30:25,640 And you can also, on this machine, 624 00:30:25,640 --> 00:30:26,840 do two flow cells at once. 625 00:30:26,840 --> 00:30:28,620 So you can actually double it beyond that. 626 00:30:28,620 --> 00:30:33,796 And so for many applications, 160 billion bases is overkill. 627 00:30:33,796 --> 00:30:34,795 It's more than you need. 628 00:30:34,795 --> 00:30:36,836 Imagine you're doing bacterial genome sequencing. 629 00:30:36,836 --> 00:30:39,150 Bacterial genome might be five megabases or so. 630 00:30:39,150 --> 00:30:40,570 This is complete overkill. 631 00:30:40,570 --> 00:30:45,090 So you can do bar coding where you add little six 632 00:30:45,090 --> 00:30:50,720 base tags to different libraries, 633 00:30:50,720 --> 00:30:53,660 and then mix them together, introduce them to the machine, 634 00:30:53,660 --> 00:30:56,370 sequence the tags first or second, 635 00:30:56,370 --> 00:30:57,730 and then sequence the templates. 636 00:30:57,730 --> 00:31:00,980 And then you effectively sort them out later. 637 00:31:00,980 --> 00:31:05,280 And then do many samples in one lane. 638 00:31:05,280 --> 00:31:07,380 And that's what people most commonly do. 639 00:31:11,590 --> 00:31:13,894 So, questions about next-gen sequencing? 640 00:31:13,894 --> 00:31:15,060 There's a lot more to learn. 641 00:31:15,060 --> 00:31:16,393 I'm happy to talk about it more. 642 00:31:16,393 --> 00:31:18,170 It's very relevant to this class. 643 00:31:18,170 --> 00:31:20,680 But I'm sure it'll come up later in David's sections, 644 00:31:20,680 --> 00:31:23,610 so I don't want to take too much time on it. 645 00:31:29,960 --> 00:31:34,620 So, now once you generate reads from an Illumina 646 00:31:34,620 --> 00:31:36,570 instrument or some other instrument, 647 00:31:36,570 --> 00:31:38,350 you'll want to align them to the genome 648 00:31:38,350 --> 00:31:39,970 to determine, for example, if you're 649 00:31:39,970 --> 00:31:44,590 doing RNA-seq mapping reads that come from mRNA, 650 00:31:44,590 --> 00:31:47,360 you'll want to know what genes they came from. 651 00:31:47,360 --> 00:31:50,900 So you need to map those reads back to the genome. 652 00:31:50,900 --> 00:31:55,690 What are some other reasons you might want to align sequences? 653 00:31:55,690 --> 00:31:57,950 Just in general, why is aligning sequences-- 654 00:31:57,950 --> 00:32:03,220 meaning, matching them up and finding individual bases 655 00:32:03,220 --> 00:32:06,390 or amino acid residues that match-- why is that useful? 656 00:32:06,390 --> 00:32:07,150 Diego? 657 00:32:07,150 --> 00:32:09,150 AUDIENCE: You can assemble them if you want. 658 00:32:09,150 --> 00:32:09,820 PROFESSOR: You can assemble them? 659 00:32:09,820 --> 00:32:10,040 Yes. 660 00:32:10,040 --> 00:32:11,706 So if you're doing genome sequencing, 661 00:32:11,706 --> 00:32:13,330 if you align them to each other and you 662 00:32:13,330 --> 00:32:16,570 find a whole stack that sort of align this way, 663 00:32:16,570 --> 00:32:19,700 you can then assemble and infer the existence 664 00:32:19,700 --> 00:32:20,780 of a longer sequence. 665 00:32:20,780 --> 00:32:21,880 That's a good point. 666 00:32:21,880 --> 00:32:22,904 Yes, your name? 667 00:32:22,904 --> 00:32:23,810 AUDIENCE: Julianne. 668 00:32:23,810 --> 00:32:25,409 Looking at homologs. 669 00:32:25,409 --> 00:32:26,700 PROFESSOR: Looking at homologs. 670 00:32:26,700 --> 00:32:27,200 Right. 671 00:32:27,200 --> 00:32:31,320 So if you, for example, are doing disease gene mapping, 672 00:32:31,320 --> 00:32:35,150 you've identified a human gene of unknown function 673 00:32:35,150 --> 00:32:38,230 that's associated with a disease. 674 00:32:38,230 --> 00:32:41,580 Then you might want to search it against, say, the mouse 675 00:32:41,580 --> 00:32:43,900 database and find a homolog in mouse 676 00:32:43,900 --> 00:32:48,422 and then that might be what you would want to study further. 677 00:32:48,422 --> 00:32:50,255 You might want to then knock it out in mouse 678 00:32:50,255 --> 00:32:52,350 or mutate it or something. 679 00:32:52,350 --> 00:32:54,570 So those are some good reasons. 680 00:32:54,570 --> 00:32:56,310 There's others. 681 00:32:56,310 --> 00:32:59,460 So we're going to first talk about local alignment, which 682 00:32:59,460 --> 00:33:01,110 is a type of alignment where you want 683 00:33:01,110 --> 00:33:05,670 to find shorter stretches of high similarity. 684 00:33:05,670 --> 00:33:08,980 You don't require alignment of the entire sequence. 685 00:33:12,390 --> 00:33:15,767 So there are certain situations where 686 00:33:15,767 --> 00:33:16,850 you might want to do that. 687 00:33:16,850 --> 00:33:19,250 So here's an example. 688 00:33:19,250 --> 00:33:22,020 You are studying a recently discovered human non-coding 689 00:33:22,020 --> 00:33:22,520 RNA. 690 00:33:22,520 --> 00:33:25,330 As you can see, it's 45 bases. 691 00:33:25,330 --> 00:33:29,250 You want to see if there's a mouse homolog. 692 00:33:29,250 --> 00:33:32,390 You run it through NCBI BLAST, which as we said 693 00:33:32,390 --> 00:33:35,050 is sort of the Google search engine of mathematics-- 694 00:33:35,050 --> 00:33:38,890 and you're going get a chance to do it on pump set one, 695 00:33:38,890 --> 00:33:42,120 and you get a hit that looks like this. 696 00:33:42,120 --> 00:33:45,005 So notice, this is sort of BLAST notation. 697 00:33:45,005 --> 00:33:46,500 It says Q at the top. 698 00:33:46,500 --> 00:33:48,800 Q is for "query," that's the sequence you put in. 699 00:33:48,800 --> 00:33:50,896 S is "subject," that's the database 700 00:33:50,896 --> 00:33:52,020 you were searching against. 701 00:33:52,020 --> 00:33:54,480 You have coordinates, so 1 to 45. 702 00:33:54,480 --> 00:33:58,130 And then, in the subject, it happened to be base 403 to 447 703 00:33:58,130 --> 00:34:00,440 in some mouse chromosome or something. 704 00:34:00,440 --> 00:34:04,197 And you can see that it's got some matching. 705 00:34:04,197 --> 00:34:05,530 But it also has some mismatches. 706 00:34:05,530 --> 00:34:09,699 So in all, there are 40 matches and five mismatches 707 00:34:09,699 --> 00:34:11,130 in the alignment. 708 00:34:11,130 --> 00:34:12,854 So is that significant? 709 00:34:15,370 --> 00:34:18,750 Remember, the mouse genome is 2.7 billion bases long. 710 00:34:18,750 --> 00:34:19,550 It's big. 711 00:34:19,550 --> 00:34:22,919 So would you get a match this good by chance? 712 00:34:25,570 --> 00:34:28,390 So the question is really, should you trust this? 713 00:34:28,390 --> 00:34:30,980 Is this something you can confidently say, 714 00:34:30,980 --> 00:34:34,220 yes mouse is a homolog, and that's it? 715 00:34:34,220 --> 00:34:36,170 Or should you just be like, well, 716 00:34:36,170 --> 00:34:38,059 that's not better than I get by chance 717 00:34:38,059 --> 00:34:39,770 so I have no evidence of anything? 718 00:34:39,770 --> 00:34:42,370 Or is it sort of somewhere in between? 719 00:34:42,370 --> 00:34:46,350 And how would you tell? 720 00:34:46,350 --> 00:34:47,701 Yeah, what's your name? 721 00:34:47,701 --> 00:34:49,174 AUDIENCE: Chris. 722 00:34:49,174 --> 00:34:51,629 You would want to figure out a scoring 723 00:34:51,629 --> 00:34:53,593 function for the alignment. 724 00:34:53,593 --> 00:34:55,557 And then, with that scoring function, 725 00:34:55,557 --> 00:34:59,607 you would find whether or not you have a significant match. 726 00:34:59,607 --> 00:35:00,190 PROFESSOR: OK. 727 00:35:00,190 --> 00:35:02,470 So Chris says you want to define a scoring system 728 00:35:02,470 --> 00:35:05,430 and then use the scoring system to define 729 00:35:05,430 --> 00:35:07,400 statistical significance. 730 00:35:07,400 --> 00:35:08,900 Do want to suggest a scoring system? 731 00:35:12,370 --> 00:35:15,570 What's the simplest one you can think of? 732 00:35:15,570 --> 00:35:20,460 AUDIENCE: Just if there's a match, you add a certain score. 733 00:35:20,460 --> 00:35:25,230 If it's a mismatch, you subtract a certain score. 734 00:35:25,230 --> 00:35:27,440 PROFESSOR: So let's do that scoring system. 735 00:35:31,170 --> 00:35:34,520 So the notation that's often used is Sii. 736 00:35:34,520 --> 00:35:36,870 So that would be a match between nucleotide i 737 00:35:36,870 --> 00:35:38,660 and then another copy of nucleotide i. 738 00:35:38,660 --> 00:35:41,890 We'll call that 1, plus 1 for a match. 739 00:35:41,890 --> 00:35:46,195 And sij, where i and j are different, 740 00:35:46,195 --> 00:35:47,570 we'll give that a negative score. 741 00:35:47,570 --> 00:35:48,470 Minus 1. 742 00:35:48,470 --> 00:35:51,890 So this is i not equal to j. 743 00:35:51,890 --> 00:35:54,630 So that's a scoring matrix. 744 00:35:54,630 --> 00:35:57,630 It's a four by four matrix with 1 on the diagonal 745 00:35:57,630 --> 00:36:00,540 and minus 1 everywhere else. 746 00:36:00,540 --> 00:36:02,500 And this is commonly used for DNA. 747 00:36:02,500 --> 00:36:04,240 And then there's a few other variations 748 00:36:04,240 --> 00:36:07,600 on this that are also used. 749 00:36:07,600 --> 00:36:09,012 So good, a scoring system. 750 00:36:09,012 --> 00:36:10,970 So then, how are we going to do the statistics? 751 00:36:10,970 --> 00:36:12,169 Any ideas? 752 00:36:12,169 --> 00:36:13,585 How do we know what's significant? 753 00:36:20,435 --> 00:36:22,348 AUDIENCE: The higher score would probably 754 00:36:22,348 --> 00:36:28,156 be a little more significant than a lower score. 755 00:36:28,156 --> 00:36:29,620 But the scale, I'm not sure-- 756 00:36:29,620 --> 00:36:31,450 PROFESSOR: The scale is not so obvious. 757 00:36:31,450 --> 00:36:32,498 Yes, question? 758 00:36:32,498 --> 00:36:33,664 AUDIENCE: My name is Andrea. 759 00:36:33,664 --> 00:36:37,600 So if you shuffled the RNA, like permute the sequence, 760 00:36:37,600 --> 00:36:39,814 then we'll get the [INAUDIBLE] genome 761 00:36:39,814 --> 00:36:42,028 you get with that shuffled sequence. 762 00:36:42,028 --> 00:36:43,996 And the score is about the same as you'd 763 00:36:43,996 --> 00:36:46,948 get with the non-shuffled sequence [INAUDIBLE] 764 00:36:46,948 --> 00:36:48,920 about very significant scores. 765 00:36:48,920 --> 00:36:51,274 PROFESSOR: Yeah, so that's a good idea. 766 00:36:51,274 --> 00:36:52,940 BLAST, as it turns out-- is pretty fast. 767 00:36:52,940 --> 00:36:54,870 So you could shuffle your RNA molecule, 768 00:36:54,870 --> 00:36:57,330 randomly permute the nucleotides many times, 769 00:36:57,330 --> 00:36:59,190 maybe even like 1,000 times, search 770 00:36:59,190 --> 00:37:00,782 each one against the mouse genome, 771 00:37:00,782 --> 00:37:02,490 and get a distribution of what's the best 772 00:37:02,490 --> 00:37:05,505 score-- the top score-- that you get against a genome, 773 00:37:05,505 --> 00:37:06,880 look at that distribution and say 774 00:37:06,880 --> 00:37:08,420 whether the score of the actual one 775 00:37:08,420 --> 00:37:10,560 is significantly higher than that distribution 776 00:37:10,560 --> 00:37:12,476 or just falls in the middle of that somewhere. 777 00:37:12,476 --> 00:37:15,680 And that's reasonable. 778 00:37:15,680 --> 00:37:18,380 You can certainly do that, and it's not a bad thing to do. 779 00:37:18,380 --> 00:37:20,590 But it turns out there is an analytical theory here 780 00:37:20,590 --> 00:37:21,720 that you can use. 781 00:37:21,720 --> 00:37:25,200 And so that you can determine significance more quickly 782 00:37:25,200 --> 00:37:28,910 without doing so much computation. 783 00:37:28,910 --> 00:37:30,890 And that's what we'll talk about. 784 00:37:30,890 --> 00:37:33,489 But another issue, before we get to the statistics, 785 00:37:33,489 --> 00:37:35,280 is how do you actually find that alignment? 786 00:37:35,280 --> 00:37:39,420 How do you find the top scoring match in a mouse genome? 787 00:37:39,420 --> 00:37:43,830 So let's suppose this guy is your RNA. 788 00:37:43,830 --> 00:37:45,970 OK, of course, we're using T's, but that's 789 00:37:45,970 --> 00:37:48,990 just because you usually sequences it at the DNA level. 790 00:37:48,990 --> 00:37:50,300 But imagine this is your RNA. 791 00:37:50,300 --> 00:37:51,160 It's very short. 792 00:37:51,160 --> 00:37:54,390 This is like 10 or so, I think. 793 00:37:54,390 --> 00:37:56,365 And this is your database. 794 00:37:56,365 --> 00:38:01,490 But it goes on a few billion more. 795 00:38:01,490 --> 00:38:02,980 Several more blackboards. 796 00:38:02,980 --> 00:38:08,410 And I want to come up with an algorithm that 797 00:38:08,410 --> 00:38:12,730 will find the highest scoring segment of this query sequence 798 00:38:12,730 --> 00:38:17,520 against this database. 799 00:38:17,520 --> 00:38:18,730 Any ideas? 800 00:38:18,730 --> 00:38:20,530 So this would be like our first algorithm. 801 00:38:20,530 --> 00:38:25,640 And it's not terribly hard, so that's 802 00:38:25,640 --> 00:38:28,100 why it's a good one to start with. 803 00:38:28,100 --> 00:38:30,200 Not totally obvious either. 804 00:38:30,200 --> 00:38:33,200 Who can think of an algorithm or something, some operation 805 00:38:33,200 --> 00:38:36,010 that we can do on this sequence compared 806 00:38:36,010 --> 00:38:37,950 to this sequence-- in some way-- that 807 00:38:37,950 --> 00:38:41,270 will help us find the highest scoring match? 808 00:38:44,410 --> 00:38:45,740 I'm sorry. 809 00:38:45,740 --> 00:38:46,427 Yeah? 810 00:38:46,427 --> 00:38:49,410 AUDIENCE: You have to consider insertion and deletion. 811 00:38:49,410 --> 00:38:50,390 PROFESSOR: Yeah, OK. 812 00:38:50,390 --> 00:38:52,210 So we're going to keep it simple. 813 00:38:52,210 --> 00:38:53,210 That's true, in general. 814 00:38:53,210 --> 00:38:54,585 But we're going to keep it simple 815 00:38:54,585 --> 00:38:57,390 and just say no insertions and deletions. 816 00:38:57,390 --> 00:39:00,900 So we're going to look for an ungapped local alignment. 817 00:39:00,900 --> 00:39:03,480 So that's the algorithm that I want. 818 00:39:03,480 --> 00:39:04,940 First, no gaps. 819 00:39:04,940 --> 00:39:07,950 And then we'll do gaps on Tuesday. 820 00:39:07,950 --> 00:39:08,920 Tim? 821 00:39:08,920 --> 00:39:11,860 AUDIENCE: You could just compare your [INAUDIBLE] 822 00:39:11,860 --> 00:39:16,270 to [INAUDIBLE] all across the database 823 00:39:16,270 --> 00:39:21,170 and turn off all the [INAUDIBLE] on that [INAUDIBLE], 824 00:39:21,170 --> 00:39:23,520 and then figure out [INAUDIBLE]. 825 00:39:23,520 --> 00:39:25,760 PROFESSOR: Yeah, OK. 826 00:39:25,760 --> 00:39:28,490 Pretty much. 827 00:39:28,490 --> 00:39:29,920 I mean, that's pretty much right. 828 00:39:29,920 --> 00:39:32,800 Although it's not quite as much of a description as 829 00:39:32,800 --> 00:39:35,049 you would need if you want to actually code that. 830 00:39:35,049 --> 00:39:36,590 Like, how would you actually do that? 831 00:39:36,590 --> 00:39:38,810 So, I want a description that is sort 832 00:39:38,810 --> 00:39:40,730 of more at the level of pseudocode. 833 00:39:40,730 --> 00:39:43,670 Like, here's how you would actually organize your code. 834 00:39:43,670 --> 00:39:48,560 So, let's say you entertain the hypothesis 835 00:39:48,560 --> 00:39:53,690 that the alignment can be in different registers. 836 00:39:53,690 --> 00:39:58,090 The alignment can correspond to base one of the query and base 837 00:39:58,090 --> 00:39:59,820 one of the subject. 838 00:39:59,820 --> 00:40:01,500 Or it could be shifted. 839 00:40:01,500 --> 00:40:04,200 It could be an alignment where base 1 of the query 840 00:40:04,200 --> 00:40:05,576 matches these two, and so forth. 841 00:40:05,576 --> 00:40:07,200 So there's sort of different registers. 842 00:40:07,200 --> 00:40:09,280 So let's just consider one register first. 843 00:40:09,280 --> 00:40:11,610 The one where base 1 matches. 844 00:40:11,610 --> 00:40:14,140 So let's just look at the matches 845 00:40:14,140 --> 00:40:16,470 between corresponding bases. 846 00:40:16,470 --> 00:40:22,150 I'm just going to make these little angle bracket guys here. 847 00:40:22,150 --> 00:40:25,565 Hopefully I won't make any mistakes. 848 00:40:28,680 --> 00:40:30,100 I'm going to take this. 849 00:40:30,100 --> 00:40:32,460 This is sort of implementing Tim's idea here. 850 00:40:32,460 --> 00:40:34,850 And then I'm going to look for each of these-- 851 00:40:34,850 --> 00:40:36,330 so consider it going down here. 852 00:40:36,330 --> 00:40:39,450 Now we're sort of looking at an alignment here. 853 00:40:39,450 --> 00:40:42,000 Is this a match or a mismatch? 854 00:40:42,000 --> 00:40:43,410 That's a mismatch. 855 00:40:43,410 --> 00:40:44,720 That's a match. 856 00:40:44,720 --> 00:40:46,750 That's a mismatch. 857 00:40:46,750 --> 00:40:47,790 That's a mismatch. 858 00:40:47,790 --> 00:40:48,860 That's a match. 859 00:40:48,860 --> 00:40:53,020 Match Match. 860 00:40:53,020 --> 00:40:53,570 Mismatch. 861 00:40:53,570 --> 00:40:55,180 Mismatch. 862 00:40:55,180 --> 00:40:57,760 Mismatch. 863 00:40:57,760 --> 00:41:01,670 So where is the top scoring match 864 00:41:01,670 --> 00:41:04,750 between the query and the subject? 865 00:41:04,750 --> 00:41:05,250 Tim? 866 00:41:11,770 --> 00:41:12,270 Anyone? 867 00:41:14,835 --> 00:41:16,280 AUDIENCE: 6, 7, 8. 868 00:41:16,280 --> 00:41:17,170 PROFESSOR: 6, 7, 8. 869 00:41:17,170 --> 00:41:18,111 Good. 870 00:41:18,111 --> 00:41:18,610 Oh-- 871 00:41:18,610 --> 00:41:19,860 AUDIENCE: 5, 6, 7. 872 00:41:19,860 --> 00:41:21,130 PROFESSOR: 5, 6, 7. 873 00:41:21,130 --> 00:41:21,630 Right. 874 00:41:21,630 --> 00:41:22,870 Right here. 875 00:41:22,870 --> 00:41:26,444 You can see there's three in a row. 876 00:41:26,444 --> 00:41:27,360 Well, what about this? 877 00:41:27,360 --> 00:41:28,900 Why can't we add this to the match? 878 00:41:32,690 --> 00:41:36,510 What's the reason why it's not 2, 3, 4, 5, 6, 7? 879 00:41:40,358 --> 00:41:42,290 AUDIENCE: Because the score for that is lower. 880 00:41:42,290 --> 00:41:43,240 PROFESSOR: Because the score for that is lower. 881 00:41:43,240 --> 00:41:43,740 Right. 882 00:41:43,740 --> 00:41:45,059 We defined top scoring segment. 883 00:41:45,059 --> 00:41:46,600 You sum up the scores across the map. 884 00:41:46,600 --> 00:41:48,810 So you can have mismatches in there, 885 00:41:48,810 --> 00:41:53,800 but this will have a score of 3. 886 00:41:53,800 --> 00:41:56,550 And if you wanted to add these three bases, 887 00:41:56,550 --> 00:41:59,220 you would be adding negative 2 and plus 1, 888 00:41:59,220 --> 00:42:01,000 so it would reduce your score. 889 00:42:01,000 --> 00:42:02,340 So that would be worse. 890 00:42:02,340 --> 00:42:06,390 Any ideas on how to do this in an automatic, algorithmic way? 891 00:42:10,191 --> 00:42:10,690 Yeah? 892 00:42:10,690 --> 00:42:11,680 What's your name? 893 00:42:11,680 --> 00:42:12,670 AUDIENCE: Simon. 894 00:42:12,670 --> 00:42:18,620 So if you keep shifting the entire database, [INAUDIBLE]. 895 00:42:18,620 --> 00:42:21,290 PROFESSOR: OK so you keep shifting it over, 896 00:42:21,290 --> 00:42:24,050 and you generate one of these lines. 897 00:42:24,050 --> 00:42:27,890 But imagine my query was like 1,000 or something. 898 00:42:27,890 --> 00:42:29,500 And my database is like a billion. 899 00:42:33,229 --> 00:42:34,270 How do I look along here? 900 00:42:34,270 --> 00:42:37,000 And here it was obvious what the top scoring match is. 901 00:42:37,000 --> 00:42:43,760 But if I had two matches here, then we 902 00:42:43,760 --> 00:42:46,620 would've actually had a longer match here. 903 00:42:46,620 --> 00:42:54,610 So in general, how do I find that that top match? 904 00:42:54,610 --> 00:42:58,780 For each of those registers, if you will, 905 00:42:58,780 --> 00:43:03,240 you'll have a thousand long diagonal here 906 00:43:03,240 --> 00:43:06,520 with 1's and minus 1's on it. 907 00:43:06,520 --> 00:43:10,495 How do I process those scores to find the top scoring segment? 908 00:43:16,409 --> 00:43:17,700 What's an algorithm to do that? 909 00:43:26,916 --> 00:43:28,290 It's kind of intuitively obvious, 910 00:43:28,290 --> 00:43:31,190 but I want to do something with, you define a variable 911 00:43:31,190 --> 00:43:34,597 and you update it, and you add to it, and subtract. 912 00:43:34,597 --> 00:43:35,430 Something like that. 913 00:43:35,430 --> 00:43:37,700 But, like a computer could actually handle. 914 00:43:41,001 --> 00:43:41,500 Yeah? 915 00:43:41,500 --> 00:43:42,291 What was your name? 916 00:43:42,291 --> 00:43:42,890 Julianne? 917 00:43:42,890 --> 00:43:47,246 AUDIENCE: Could you keep track of what the highest total 918 00:43:47,246 --> 00:43:50,634 score is, and then you keep going down the diagonal, 919 00:43:50,634 --> 00:43:53,197 And then you update it? 920 00:43:53,197 --> 00:43:53,780 PROFESSOR: OK. 921 00:43:53,780 --> 00:43:58,262 You keep track of what the highest total score was? 922 00:43:58,262 --> 00:43:59,188 AUDIENCE: Yeah. 923 00:43:59,188 --> 00:44:01,040 The highest test score. 924 00:44:01,040 --> 00:44:03,320 PROFESSOR: The highest segment score? 925 00:44:03,320 --> 00:44:05,440 OK. 926 00:44:05,440 --> 00:44:07,950 I'm going to put this up here. 927 00:44:07,950 --> 00:44:14,240 And we'll define max s. 928 00:44:14,240 --> 00:44:17,480 That's the highest segment score we've achieved to date. 929 00:44:17,480 --> 00:44:21,404 And we'll initialize that to zero, let's say. 930 00:44:21,404 --> 00:44:22,820 Because if you had all mismatches, 931 00:44:22,820 --> 00:44:25,030 zero would be the correct answer. 932 00:44:25,030 --> 00:44:28,320 If your query was A's and your subject was T's. 933 00:44:28,320 --> 00:44:32,443 And then what do you do? 934 00:44:32,443 --> 00:44:37,126 AUDIENCE: As you go down the diagonal, you keep track of-- 935 00:44:37,126 --> 00:44:39,621 PROFESSOR: Keep track of what? 936 00:44:39,621 --> 00:44:45,609 AUDIENCE: So you look at 1 in 1 first. 937 00:44:45,609 --> 00:44:48,603 And then you go 1 in 2, and you find a score of zero. 938 00:44:48,603 --> 00:44:50,100 But that's higher than negative 1. 939 00:44:54,110 --> 00:44:56,960 PROFESSOR: But the score of the maximum segment at that point, 940 00:44:56,960 --> 00:44:59,870 after base 2, is not zero. 941 00:44:59,870 --> 00:45:01,970 It's actually 1. 942 00:45:01,970 --> 00:45:05,732 Because you could have a segment of one base alignment. 943 00:45:05,732 --> 00:45:06,940 The cumulative score is zero. 944 00:45:09,105 --> 00:45:11,230 I think you're onto something here that may be also 945 00:45:11,230 --> 00:45:14,640 something useful to keep track of. 946 00:45:14,640 --> 00:45:18,280 Let's do the cumulative score and then you tell me more. 947 00:45:18,280 --> 00:45:20,380 We'll define cumulative score variable. 948 00:45:20,380 --> 00:45:21,930 We'll initialize that to zero. 949 00:45:21,930 --> 00:45:26,131 And then we'll have some for loops that, as some of you 950 00:45:26,131 --> 00:45:29,840 have said, you want to loop through the subject. 951 00:45:29,840 --> 00:45:33,120 All the possible registers of the subject. 952 00:45:33,120 --> 00:45:35,780 So that would be maybe j equals 1 953 00:45:35,780 --> 00:45:39,550 to subject length minus query length. 954 00:45:39,550 --> 00:45:40,420 Something like that. 955 00:45:40,420 --> 00:45:42,290 Don't worry too much about this. 956 00:45:42,290 --> 00:45:44,200 Again, this is not real code, obviously. 957 00:45:44,200 --> 00:45:44,960 It's pseudocode. 958 00:45:44,960 --> 00:45:47,870 So then this will be, say, 1 to query language. 959 00:45:47,870 --> 00:45:50,255 And so this will be going along our diagonal. 960 00:45:50,255 --> 00:45:52,130 And we're going to plot the cumulative score. 961 00:45:54,690 --> 00:45:59,250 So here you would you have an update where cumulative score 962 00:45:59,250 --> 00:46:07,760 plus equals the score of query position 963 00:46:07,760 --> 00:46:13,714 i matched against subject position j. 964 00:46:13,714 --> 00:46:14,380 And update that. 965 00:46:14,380 --> 00:46:15,740 So that's just cumulative score. 966 00:46:15,740 --> 00:46:16,823 So what will it look like? 967 00:46:16,823 --> 00:46:19,350 So in this case, I'll just use this down here. 968 00:46:26,360 --> 00:46:35,390 So you have zero, 1, 2, minus 1, minus 2. 969 00:46:35,390 --> 00:46:39,460 So you'll start at position zero in the sequence. 970 00:46:39,460 --> 00:46:43,400 At position 1 you're down here at minus 1 971 00:46:43,400 --> 00:46:44,540 because it was a mismatch. 972 00:46:44,540 --> 00:46:49,550 Then at position 2, as you said, we're back up to zero. 973 00:46:49,550 --> 00:46:51,130 And then what happens? 974 00:46:51,130 --> 00:46:54,500 Go down to minus 1, down to minus 2. 975 00:46:54,500 --> 00:47:02,719 Then we go up three times in a row until we're up here to 1. 976 00:47:02,719 --> 00:47:04,010 And then we go down after that. 977 00:47:07,380 --> 00:47:11,780 So where is your highest scoring match in this cumulative score 978 00:47:11,780 --> 00:47:12,280 plot? 979 00:47:16,030 --> 00:47:17,610 People said it was from 5 to 7. 980 00:47:17,610 --> 00:47:18,733 Yeah, question? 981 00:47:18,733 --> 00:47:20,816 AUDIENCE: So would it be from like a local minimum 982 00:47:20,816 --> 00:47:22,290 to a local maximum? 983 00:47:22,290 --> 00:47:23,330 PROFESSOR: Yeah. 984 00:47:23,330 --> 00:47:24,690 Exactly. 985 00:47:24,690 --> 00:47:26,458 So, what do you want to keep track of? 986 00:47:26,458 --> 00:47:29,916 AUDIENCE: You want to keep track of the minimum and the maximum. 987 00:47:29,916 --> 00:47:34,870 And look for the range which you maximize to different-- 988 00:47:34,870 --> 00:47:37,111 PROFESSOR: Yeah, so this is now sort of more 989 00:47:37,111 --> 00:47:38,610 what I was looking for in terms of-- 990 00:47:38,610 --> 00:47:43,100 so this was the local minimum, and that's the local maximum. 991 00:47:43,100 --> 00:47:45,240 This is the score. 992 00:47:45,240 --> 00:47:48,100 That's your mass s there. 993 00:47:48,100 --> 00:47:50,910 And you also want to keep track of where that happened 994 00:47:50,910 --> 00:47:52,970 in both the query and the subject. 995 00:47:52,970 --> 00:47:53,910 Does that make sense? 996 00:47:53,910 --> 00:47:56,340 So you would keep track of this running cumulative score 997 00:47:56,340 --> 00:47:57,200 variable. 998 00:47:57,200 --> 00:48:01,920 You keep track of the last minimum. 999 00:48:01,920 --> 00:48:05,960 The minimum that you've achieved so far. 1000 00:48:05,960 --> 00:48:17,650 And so that would then be down here to minus 2. 1001 00:48:17,650 --> 00:48:20,280 And then when your cumulative score got up to plus 1, 1002 00:48:20,280 --> 00:48:22,760 you always take that cumulative score, 1003 00:48:22,760 --> 00:48:26,160 minus the last minimum cumulative score. 1004 00:48:26,160 --> 00:48:28,230 That gives you a potential candidate 1005 00:48:28,230 --> 00:48:29,370 for a high scoring segment. 1006 00:48:29,370 --> 00:48:33,060 And if that is bigger than your current max high scoring 1007 00:48:33,060 --> 00:48:36,107 segment, then you update it and you would update this. 1008 00:48:36,107 --> 00:48:37,690 And then you would also have variables 1009 00:48:37,690 --> 00:48:41,210 that would store where you are. 1010 00:48:41,210 --> 00:48:44,419 And also, where did that last minimum occur. 1011 00:48:44,419 --> 00:48:45,710 So I'm not spelling it all out. 1012 00:48:45,710 --> 00:48:47,543 I'm not going to give you all the variables. 1013 00:48:47,543 --> 00:48:51,745 But this is an algorithm that would find the maximum score. 1014 00:48:51,745 --> 00:48:52,370 Yeah, question? 1015 00:48:52,370 --> 00:48:53,745 AUDIENCE: So you're keeping track 1016 00:48:53,745 --> 00:48:56,500 of the global maximum, local minimum, 1017 00:48:56,500 --> 00:49:00,530 so that you can accept the most recent local minimum following 1018 00:49:00,530 --> 00:49:02,575 the global maximum? 1019 00:49:02,575 --> 00:49:04,200 PROFESSOR: I'm not sure I got all that. 1020 00:49:04,200 --> 00:49:07,800 But you're keeping track of the cumulative score. 1021 00:49:07,800 --> 00:49:12,680 The minimum that that cumulative score ever got to. 1022 00:49:12,680 --> 00:49:17,556 And the maximum difference, the maximum 1023 00:49:17,556 --> 00:49:21,940 that you ever in the past have gone up. 1024 00:49:21,940 --> 00:49:27,060 Where you've had a net increment upwards. 1025 00:49:27,060 --> 00:49:27,970 Like here. 1026 00:49:27,970 --> 00:49:30,400 So this variable here, this max s, 1027 00:49:30,400 --> 00:49:32,000 it would be initialized to zero. 1028 00:49:32,000 --> 00:49:36,780 When you got to here, your last minimum score would be minus 1. 1029 00:49:36,780 --> 00:49:38,500 Your cumulative score would be zero. 1030 00:49:38,500 --> 00:49:40,916 You would take the difference of those, and you'd be like, 1031 00:49:40,916 --> 00:49:43,204 oh I've got a high scoring segment of score one. 1032 00:49:43,204 --> 00:49:44,370 So I'm going to update that. 1033 00:49:44,370 --> 00:49:48,690 So now, that variable is now 1 at this point. 1034 00:49:48,690 --> 00:49:51,300 Then you're going down, so you're not getting anything. 1035 00:49:51,300 --> 00:49:53,520 You're just lowering this minimum cumulative score 1036 00:49:53,520 --> 00:49:55,070 down to minus 2 here. 1037 00:49:55,070 --> 00:49:58,057 And then when you get to here, now you 1038 00:49:58,057 --> 00:50:00,140 check the cumulative score minus the last minimum. 1039 00:50:00,140 --> 00:50:00,880 It's 1. 1040 00:50:00,880 --> 00:50:02,300 That's a tie. 1041 00:50:02,300 --> 00:50:04,550 We won't keep track of ties. 1042 00:50:04,550 --> 00:50:07,820 Now at here, that difference is 2. 1043 00:50:07,820 --> 00:50:10,180 So now we've got a new record. 1044 00:50:10,180 --> 00:50:14,090 So now we update this maximum score to 2 in the locations. 1045 00:50:14,090 --> 00:50:16,830 And then we get here, now it's 3, and we update that. 1046 00:50:16,830 --> 00:50:17,825 Does that make sense? 1047 00:50:17,825 --> 00:50:20,116 AUDIENCE: Imagine the first dip-- instead of going down 1048 00:50:20,116 --> 00:50:22,760 to negative 1, it went down to negative 3. 1049 00:50:22,760 --> 00:50:23,731 PROFESSOR: Negative 3? 1050 00:50:23,731 --> 00:50:25,284 AUDIENCE: That first dip. 1051 00:50:25,284 --> 00:50:26,200 PROFESSOR: Right here? 1052 00:50:26,200 --> 00:50:28,160 So we started back a little bit. 1053 00:50:28,160 --> 00:50:30,265 So back here, like this? 1054 00:50:30,265 --> 00:50:31,250 AUDIENCE: No. 1055 00:50:31,250 --> 00:50:32,500 PROFESSOR: Down to negative 3? 1056 00:50:32,500 --> 00:50:33,550 AUDIENCE: No. 1057 00:50:33,550 --> 00:50:35,341 PROFESSOR: But how do we get to negative 3? 1058 00:50:35,341 --> 00:50:36,970 Because our scoring is this way. 1059 00:50:36,970 --> 00:50:38,580 You want this dip to minus 3? 1060 00:50:38,580 --> 00:50:39,584 AUDIENCE: No 1061 00:50:39,584 --> 00:50:40,750 PROFESSOR: This one minus 3? 1062 00:50:40,750 --> 00:50:42,367 Imagine we are at minus 3 here? 1063 00:50:42,367 --> 00:50:42,992 AUDIENCE: Yeah. 1064 00:50:42,992 --> 00:50:45,347 Imagine it dipped to minus 3. 1065 00:50:45,347 --> 00:50:49,770 And then the next one dipped to higher than that, to minus 2. 1066 00:50:49,770 --> 00:50:51,714 And then it went up to 1. 1067 00:50:51,714 --> 00:50:54,630 And so, would the difference you look at be negative 2 to 1, 1068 00:50:54,630 --> 00:50:56,100 or negative 3 to 1? 1069 00:50:56,100 --> 00:50:57,710 PROFESSOR: Like that, right? 1070 00:50:57,710 --> 00:51:02,250 So, minus 3, let's say, minus 2, 1. 1071 00:51:02,250 --> 00:51:04,010 Something like that. 1072 00:51:04,010 --> 00:51:05,856 What do people think? 1073 00:51:05,856 --> 00:51:07,175 Anyone want to--? 1074 00:51:07,175 --> 00:51:08,480 AUDIENCE: Minus 3 to 1. 1075 00:51:08,480 --> 00:51:10,100 PROFESSOR: Minus 3 to 1. 1076 00:51:10,100 --> 00:51:13,051 It's the minimum that you ever got to. 1077 00:51:13,051 --> 00:51:16,120 This might be a stronger match, but this is a higher scoring 1078 00:51:16,120 --> 00:51:16,620 match. 1079 00:51:16,620 --> 00:51:18,620 And we said we want higher scoring. 1080 00:51:18,620 --> 00:51:21,716 So you would count that. 1081 00:51:21,716 --> 00:51:23,966 AUDIENCE: So you keep track of both the global minimum 1082 00:51:23,966 --> 00:51:25,490 and the global maximum, and you take 1083 00:51:25,490 --> 00:51:27,322 the difference between them. 1084 00:51:27,322 --> 00:51:29,280 PROFESSOR: You keep track of the global minimum 1085 00:51:29,280 --> 00:51:31,134 and the current cumulative score, 1086 00:51:31,134 --> 00:51:32,300 and you take the difference. 1087 00:51:35,186 --> 00:51:38,080 AUDIENCE: The global maximum-- 1088 00:51:38,080 --> 00:51:40,210 PROFESSOR: It's not necessarily global maximum 1089 00:51:40,210 --> 00:51:43,390 because we could be well below zero here. 1090 00:51:43,390 --> 00:51:44,389 We could do like this. 1091 00:51:49,550 --> 00:51:50,880 From here to here. 1092 00:51:50,880 --> 00:51:53,440 So this is not the global maximum. 1093 00:51:53,440 --> 00:51:57,270 This just happens to be, we went up 1094 00:51:57,270 --> 00:51:58,510 a lot since our last minimum. 1095 00:51:58,510 --> 00:52:00,950 So that's your high scoring segment. 1096 00:52:00,950 --> 00:52:02,582 Does that make sense? 1097 00:52:04,961 --> 00:52:06,460 I haven't completely spelled it out. 1098 00:52:06,460 --> 00:52:12,300 But I think you guys have given enough ideas here that there;s 1099 00:52:12,300 --> 00:52:14,100 sort of the core of an algorithm. 1100 00:52:14,100 --> 00:52:21,230 And I encourage you to think this through afterwards 1101 00:52:21,230 --> 00:52:24,090 and let me know if there are questions. 1102 00:52:24,090 --> 00:52:27,700 And we could add an optional homework 1103 00:52:27,700 --> 00:52:29,700 where I ask you to do this, that we've sometimes 1104 00:52:29,700 --> 00:52:30,740 had in the past. 1105 00:52:30,740 --> 00:52:33,090 It is a useful thing to look at. 1106 00:52:33,090 --> 00:52:35,530 This is not exactly how the BLAST algorithm works. 1107 00:52:35,530 --> 00:52:38,330 It uses some tricks for faster speed. 1108 00:52:38,330 --> 00:52:40,900 But this is sort of morally equivalent to BLAST 1109 00:52:40,900 --> 00:52:46,340 in the sense that it has the same order of magnitude running 1110 00:52:46,340 --> 00:52:48,780 time. 1111 00:52:48,780 --> 00:52:54,960 So this algorithm-- what is the running time in Big-O notation? 1112 00:52:54,960 --> 00:52:58,800 So just for those who are non-CS people, when 1113 00:52:58,800 --> 00:53:02,840 you use this Big-O notation, then you're asking, 1114 00:53:02,840 --> 00:53:05,010 how does the running time increase 1115 00:53:05,010 --> 00:53:07,560 in the size of the input? 1116 00:53:07,560 --> 00:53:08,870 And so what is the input? 1117 00:53:08,870 --> 00:53:10,440 So we have two inputs. 1118 00:53:10,440 --> 00:53:14,070 We have a query of length. 1119 00:53:14,070 --> 00:53:17,670 And let's say subject of length n. 1120 00:53:17,670 --> 00:53:21,171 So clearly, if those are bigger, it'll take longer to run. 1121 00:53:21,171 --> 00:53:22,920 But when you compare different algorithms, 1122 00:53:22,920 --> 00:53:26,971 you want to know how the run time depends on those lengths. 1123 00:53:26,971 --> 00:53:27,470 Yes. 1124 00:53:27,470 --> 00:53:28,356 What's your name? 1125 00:53:28,356 --> 00:53:29,308 AUDIENCE: Sally. 1126 00:53:29,308 --> 00:53:31,509 m times n. 1127 00:53:31,509 --> 00:53:33,050 PROFESSOR: So with this, this is what 1128 00:53:33,050 --> 00:53:35,940 you would call an order mn algorithm. 1129 00:53:35,940 --> 00:53:36,900 And why is that? 1130 00:53:36,900 --> 00:53:38,220 How can you see that? 1131 00:53:38,220 --> 00:53:40,806 AUDIENCE: You have two for loops And for each length, 1132 00:53:40,806 --> 00:53:42,639 essentially, you're going through everything 1133 00:53:42,639 --> 00:53:45,094 in the query. 1134 00:53:45,094 --> 00:53:48,040 And then, for everything that you go through in the query, 1135 00:53:48,040 --> 00:53:49,022 you would [INAUDIBLE]. 1136 00:53:52,970 --> 00:53:55,270 PROFESSOR: Right. 1137 00:53:55,270 --> 00:53:59,800 In this second for loop here, you're going through the query. 1138 00:53:59,800 --> 00:54:02,420 And you're doing that nested inside of a for loop 1139 00:54:02,420 --> 00:54:05,112 that's basically the length of the subject. 1140 00:54:05,112 --> 00:54:06,570 And eventually you're going to have 1141 00:54:06,570 --> 00:54:07,986 to compare every base in the query 1142 00:54:07,986 --> 00:54:09,296 to every base in the subject. 1143 00:54:09,296 --> 00:54:10,420 There's no way around that. 1144 00:54:10,420 --> 00:54:13,270 And that takes some unit of time. 1145 00:54:13,270 --> 00:54:15,900 And so the actual time will be proportional to that. 1146 00:54:15,900 --> 00:54:18,220 So the bigger n gets and m gets, it's 1147 00:54:18,220 --> 00:54:20,150 just proportional to the product. 1148 00:54:20,150 --> 00:54:21,460 Does that make sense? 1149 00:54:21,460 --> 00:54:24,010 Or another way to think about it is, 1150 00:54:24,010 --> 00:54:26,594 you're clearly going to have to do something on this diagonal. 1151 00:54:26,594 --> 00:54:28,468 And then you're going to have to do something 1152 00:54:28,468 --> 00:54:30,530 on this diagonal, and this one, and this one. 1153 00:54:30,530 --> 00:54:34,280 And actually, you have to also check these ones here. 1154 00:54:34,280 --> 00:54:39,380 And in the end, the total number of computations 1155 00:54:39,380 --> 00:54:41,900 there is going to be this times that. 1156 00:54:41,900 --> 00:54:43,680 You're basically doing a rectangle's worth 1157 00:54:43,680 --> 00:54:45,606 of computations. 1158 00:54:45,606 --> 00:54:47,430 Does that makes sense? 1159 00:54:47,430 --> 00:54:49,120 So that's not bad, right? 1160 00:54:49,120 --> 00:54:50,940 It could be worse. 1161 00:54:50,940 --> 00:54:54,790 It could be, like, mn squared or something like that. 1162 00:54:57,550 --> 00:55:01,030 So that's basically why BLAST is fast. 1163 00:55:01,030 --> 00:55:03,190 So what do these things look like, in general? 1164 00:55:03,190 --> 00:55:07,180 And what is the condition on our score for this algorithm 1165 00:55:07,180 --> 00:55:07,680 to work? 1166 00:55:12,460 --> 00:55:16,260 What if I gave a score of plus 1 for a match, 1167 00:55:16,260 --> 00:55:17,820 and zero for a mismatch? 1168 00:55:17,820 --> 00:55:20,340 Could we do this? 1169 00:55:20,340 --> 00:55:21,990 Joe, you're shaking your head. 1170 00:55:21,990 --> 00:55:23,490 AUDIENCE: It would just be going up. 1171 00:55:23,490 --> 00:55:24,156 PROFESSOR: Yeah. 1172 00:55:24,156 --> 00:55:26,311 The problem is, it might be flat for a while, 1173 00:55:26,311 --> 00:55:27,560 but eventually it would go up. 1174 00:55:27,560 --> 00:55:29,143 And it would just go up and up and up. 1175 00:55:29,143 --> 00:55:30,650 And so your highest scoring segment 1176 00:55:30,650 --> 00:55:32,430 would, most of the time, be something 1177 00:55:32,430 --> 00:55:34,040 that started very near the beginning 1178 00:55:34,040 --> 00:55:36,720 and ended very near the end. 1179 00:55:36,720 --> 00:55:37,790 So that doesn't work. 1180 00:55:37,790 --> 00:55:40,015 So you have to have a net negative drift. 1181 00:55:40,015 --> 00:55:42,140 And the way that's formalized is the expected score 1182 00:55:42,140 --> 00:55:43,980 has to be negative. 1183 00:55:43,980 --> 00:55:46,970 So why is the expected score negative in this scoring system 1184 00:55:46,970 --> 00:55:50,379 that has plus 1 for a match, and minus 1 for a mismatch? 1185 00:55:50,379 --> 00:55:51,170 Why does that work? 1186 00:55:54,507 --> 00:55:56,840 AUDIENCE: It should be wrong three quarters of the time. 1187 00:55:56,840 --> 00:55:57,590 PROFESSOR: Yeah. 1188 00:55:57,590 --> 00:55:59,673 You'll have a mismatch three quarters of the time. 1189 00:55:59,673 --> 00:56:04,040 So on average, you tend to drift down. 1190 00:56:04,040 --> 00:56:06,440 And then you have these little excursions upwards, 1191 00:56:06,440 --> 00:56:09,410 and those are your high scoring segments. 1192 00:56:09,410 --> 00:56:13,216 Any questions about that? 1193 00:56:13,216 --> 00:56:14,178 AUDIENCE: Question. 1194 00:56:14,178 --> 00:56:17,795 Is there something better than m times n? 1195 00:56:17,795 --> 00:56:19,920 PROFESSOR: We've got some computer scientists here. 1196 00:56:19,920 --> 00:56:22,194 David? 1197 00:56:22,194 --> 00:56:23,110 Better than m times n? 1198 00:56:23,110 --> 00:56:24,544 I don't think so, because you have 1199 00:56:24,544 --> 00:56:25,710 to do all those comparisons. 1200 00:56:25,710 --> 00:56:29,530 And so there's no way around that, so I don't think so. 1201 00:56:29,530 --> 00:56:31,070 All right. 1202 00:56:31,070 --> 00:56:33,180 But the constant-- you can do better 1203 00:56:33,180 --> 00:56:35,076 on the constant than this algorithm thing. 1204 00:56:35,076 --> 00:56:36,955 AUDIENCE: With multiple queries-- 1205 00:56:36,955 --> 00:56:38,580 PROFESSOR: With multiple queries, yeah. 1206 00:56:38,580 --> 00:56:41,720 Then you can maybe do some hashing 1207 00:56:41,720 --> 00:56:45,720 or find some-- to speed it up. 1208 00:56:45,720 --> 00:56:48,560 OK, so what about the statistics of this? 1209 00:56:48,560 --> 00:56:54,790 So it turns out that Karlin and Altschul developed some theory 1210 00:56:54,790 --> 00:56:57,410 for just exactly this problem. 1211 00:56:57,410 --> 00:57:00,210 For searching a query sequence. 1212 00:57:00,210 --> 00:57:04,240 It can be nucleotide or protein as long as you have integer 1213 00:57:04,240 --> 00:57:09,700 scores and the average-- or the expected-- score is negative, 1214 00:57:09,700 --> 00:57:13,520 then this theory tells you how often 1215 00:57:13,520 --> 00:57:17,890 the highest score of all-- across the entire query 1216 00:57:17,890 --> 00:57:20,430 database comparison-- exceeds a cut 1217 00:57:20,430 --> 00:57:25,300 off x using a local alignment algorithm such as BLAST. 1218 00:57:25,300 --> 00:57:30,380 And it turns out that these scores 1219 00:57:30,380 --> 00:57:34,240 follow what's called an extreme value or Gumbel distribution. 1220 00:57:34,240 --> 00:57:39,340 And it has this kind of double exponential form here. 1221 00:57:39,340 --> 00:57:40,620 So x is some cut off. 1222 00:57:40,620 --> 00:57:43,760 So usually x would be the score that you actually 1223 00:57:43,760 --> 00:57:46,530 observed when you searched your query against the database. 1224 00:57:46,530 --> 00:57:47,980 That's the one you care about. 1225 00:57:47,980 --> 00:57:49,870 And then you want to know, what's 1226 00:57:49,870 --> 00:57:53,980 the probability we would've seen something higher than that? 1227 00:57:53,980 --> 00:58:00,150 Or you might do x is one less than the score you observed. 1228 00:58:00,150 --> 00:58:04,190 So what's the chance we observed something the same, 1229 00:58:04,190 --> 00:58:06,775 as good as this, or better? 1230 00:58:06,775 --> 00:58:07,650 Does that make sense? 1231 00:58:07,650 --> 00:58:11,100 And so this is going to be your P value then. 1232 00:58:11,100 --> 00:58:13,740 So the probability of S. The score 1233 00:58:13,740 --> 00:58:16,699 of the highest segment under a model where 1234 00:58:16,699 --> 00:58:18,740 you have a random query against a random database 1235 00:58:18,740 --> 00:58:22,500 of the same length is 1 minus e to the minus KMN 1236 00:58:22,500 --> 00:58:24,790 e to the minus lambda x. 1237 00:58:24,790 --> 00:58:30,030 Where M and N are the lengths of the query and the database. 1238 00:58:30,030 --> 00:58:32,360 x is the score. 1239 00:58:32,360 --> 00:58:35,860 And then K and lambda are two positive parameters 1240 00:58:35,860 --> 00:58:39,760 that depend actually on the details of your score matrix 1241 00:58:39,760 --> 00:58:42,990 and the composition of your sequences. 1242 00:58:42,990 --> 00:58:47,190 And it turns out that lambda is really the one that matters. 1243 00:58:47,190 --> 00:58:48,940 And you can see that because lambda 1244 00:58:48,940 --> 00:58:52,279 is up there in that exponent multiplying x. 1245 00:58:52,279 --> 00:58:53,820 So if you double lambda, that'll have 1246 00:58:53,820 --> 00:58:56,660 a big effect on the answer. 1247 00:58:56,660 --> 00:58:58,550 And K, it turns out, you can mostly 1248 00:58:58,550 --> 00:59:01,650 ignore it for most purposes. 1249 00:59:01,650 --> 00:59:04,840 So as a formula, what does this thing look like? 1250 00:59:04,840 --> 00:59:07,244 It looks like that. 1251 00:59:07,244 --> 00:59:08,160 Kind of a funny shape. 1252 00:59:08,160 --> 00:59:10,210 It sort of looks like an umlauf a little bit, 1253 00:59:10,210 --> 00:59:15,540 but then has a different shape on the right than the left. 1254 00:59:15,540 --> 00:59:18,130 And how do you calculate this lambda? 1255 00:59:18,130 --> 00:59:23,530 So I said that lambda is sort of the key to all this 1256 00:59:23,530 --> 00:59:31,630 because of its uniquely important place 1257 00:59:31,630 --> 00:59:36,300 in that formula, multiplying the score. 1258 00:59:36,300 --> 00:59:39,230 So it turns out that lambda is the unique positive solution 1259 00:59:39,230 --> 00:59:41,250 to this equation here. 1260 00:59:41,250 --> 00:59:44,990 So now it actually depends on the scoring matrix. 1261 00:59:44,990 --> 00:59:46,710 So you see there's sij there. 1262 00:59:46,710 --> 00:59:49,190 It depends on the composition of your query. 1263 00:59:49,190 --> 00:59:50,540 That's the pi's. 1264 00:59:50,540 --> 00:59:55,820 The composition of your subject, that's the rj's. 1265 00:59:55,820 --> 00:59:59,020 You sum over the i and j equal to each 1266 00:59:59,020 --> 01:00:01,340 of the four nucleotides. 1267 01:00:01,340 --> 01:00:03,110 And that sum has to be 1. 1268 01:00:03,110 --> 01:00:08,830 So there's a unique positive solution to this equation. 1269 01:00:08,830 --> 01:00:11,210 So how would we solve an equation like this? 1270 01:00:11,210 --> 01:00:12,980 First of all, what kind of equation 1271 01:00:12,980 --> 01:00:17,460 is this, given that we're going to set the sij, 1272 01:00:17,460 --> 01:00:20,710 and we're going to just measure the pi and the rj? 1273 01:00:20,710 --> 01:00:23,470 So those are all known constants, 1274 01:00:23,470 --> 01:00:25,890 and lambda is what we're trying to solve for here. 1275 01:00:25,890 --> 01:00:28,791 So what kind of an equation is this in lambda? 1276 01:00:35,850 --> 01:00:36,350 Linear? 1277 01:00:36,350 --> 01:00:38,550 Quadratic? 1278 01:00:38,550 --> 01:00:40,890 Hyperbolic? 1279 01:00:40,890 --> 01:00:43,750 Anybody know what this is? 1280 01:00:43,750 --> 01:00:47,660 So this is called a transcendental equation 1281 01:00:47,660 --> 01:00:51,430 because you have different powers. 1282 01:00:55,200 --> 01:00:59,245 That sounds kind of unpleasant. 1283 01:00:59,245 --> 01:01:01,440 You don't take a class in transcendental equations 1284 01:01:01,440 --> 01:01:01,940 probably. 1285 01:01:01,940 --> 01:01:08,270 So in general, they're not possible to solve analytically 1286 01:01:08,270 --> 01:01:10,560 when they get complicated. 1287 01:01:10,560 --> 01:01:13,380 But in simple cases, you can solve them analytically. 1288 01:01:13,380 --> 01:01:17,420 And in fact, let's just do one. 1289 01:01:17,420 --> 01:01:19,450 So let's take the simplest case, which 1290 01:01:19,450 --> 01:01:25,230 would be that all the pi's are a quarter. 1291 01:01:25,230 --> 01:01:29,020 All the ri's are a quarter. 1292 01:01:29,020 --> 01:01:31,990 And we'll use the scoring system that we came up with before, 1293 01:01:31,990 --> 01:01:37,710 where sii is 1, and sij is minus 1. 1294 01:01:37,710 --> 01:01:41,640 If i does not equal j. 1295 01:01:41,640 --> 01:01:50,250 And so when we plug those in to that sum there, what do we get? 1296 01:01:50,250 --> 01:01:56,190 We'll get four terms that are one quarter, times one quarter, 1297 01:01:56,190 --> 01:01:59,710 times e to the lambda. 1298 01:01:59,710 --> 01:02:03,817 There's four possible types of matches, right? 1299 01:02:03,817 --> 01:02:05,900 They have probability one quarter times a quarter. 1300 01:02:05,900 --> 01:02:07,710 That's pi and rj. 1301 01:02:07,710 --> 01:02:12,450 And the e to the lambda sii is just e to the lambda 1302 01:02:12,450 --> 01:02:14,130 because sii is 1. 1303 01:02:14,130 --> 01:02:19,680 And then there's 12 terms that are one quarter, one quarter, 1304 01:02:19,680 --> 01:02:21,270 e to the minus lambda. 1305 01:02:21,270 --> 01:02:23,150 Because there's the minus 1 score. 1306 01:02:23,150 --> 01:02:26,610 And that has to equal 1. 1307 01:02:26,610 --> 01:02:32,210 So cancel this, we'll multiply through by 4, maybe. 1308 01:02:32,210 --> 01:02:39,720 So now we get e to the lambda plus 3. 1309 01:02:39,720 --> 01:02:43,050 e to the minus lambda equals 1. 1310 01:02:43,050 --> 01:02:45,050 It's still a transcendental equation, 1311 01:02:45,050 --> 01:02:47,710 but it's looking a little simpler. 1312 01:02:47,710 --> 01:02:49,630 Any ideas how to solve this for lambda? 1313 01:02:52,140 --> 01:02:53,228 Sally? 1314 01:02:53,228 --> 01:02:54,875 AUDIENCE: Wouldn't the 1 be 4? 1315 01:02:54,875 --> 01:02:55,750 PROFESSOR: I'm sorry. 1316 01:02:55,750 --> 01:02:56,160 4. 1317 01:02:56,160 --> 01:02:56,670 Thank you. 1318 01:03:06,730 --> 01:03:07,778 Yeah, what's your name? 1319 01:03:07,778 --> 01:03:10,772 AUDIENCE: [INAUDIBLE] I think [INAUDIBLE] quadratic equation. 1320 01:03:10,772 --> 01:03:14,764 If you multiply both sides by [INAUDIBLE] then [INAUDIBLE]. 1321 01:03:16,672 --> 01:03:18,130 PROFESSOR: OK, so the claim is this 1322 01:03:18,130 --> 01:03:19,460 is basically a quadratic equation. 1323 01:03:19,460 --> 01:03:21,376 So you multiply both sides by e to the lambda. 1324 01:03:21,376 --> 01:03:24,697 So then you get e to the 2 lambda plus 3. 1325 01:03:24,697 --> 01:03:26,530 And then it's going to move this over and do 1326 01:03:26,530 --> 01:03:31,580 minus 4 e to the lambda equals zero. 1327 01:03:31,580 --> 01:03:33,340 Is that good? 1328 01:03:33,340 --> 01:03:35,607 So how is it quadratic? 1329 01:03:35,607 --> 01:03:37,190 What do you actually do to solve this? 1330 01:03:37,190 --> 01:03:38,356 AUDIENCE: Well, [INAUDIBLE]. 1331 01:03:42,280 --> 01:03:45,000 PROFESSOR: Change the variable, x equals e to the lambda. 1332 01:03:45,000 --> 01:03:46,330 Then it's quadratic in x. 1333 01:03:46,330 --> 01:03:46,990 Solve for x. 1334 01:03:46,990 --> 01:03:48,440 We all know how to solve quadratic equations. 1335 01:03:48,440 --> 01:03:49,760 And then substitute that for lambda. 1336 01:03:49,760 --> 01:03:50,800 OK, everyone got that? 1337 01:03:55,845 --> 01:03:58,120 If you use 16 different scores to represent 1338 01:03:58,120 --> 01:04:00,250 all the different types of matches and mismatches, 1339 01:04:00,250 --> 01:04:03,060 this will be very unpleasant. 1340 01:04:03,060 --> 01:04:04,940 It's not unsolvable, it's just that you 1341 01:04:04,940 --> 01:04:08,520 have to use computational numerical methods to solve it. 1342 01:04:08,520 --> 01:04:10,184 But in simple cases where you just 1343 01:04:10,184 --> 01:04:11,850 have a couple different types of scores, 1344 01:04:11,850 --> 01:04:15,530 it will often be a quadratic equation. 1345 01:04:15,530 --> 01:04:16,080 All right. 1346 01:04:16,080 --> 01:04:20,140 So let's suppose that we have a particular scoring system-- 1347 01:04:20,140 --> 01:04:24,610 particular pi's, rj's-- and we have a value of lambda that 1348 01:04:24,610 --> 01:04:25,280 satisfies those. 1349 01:04:25,280 --> 01:04:28,370 So we've solved this quadratic equation for lambda. 1350 01:04:28,370 --> 01:04:30,240 I think we get lambda equals natural log 1351 01:04:30,240 --> 01:04:32,690 3, something like that. 1352 01:04:32,690 --> 01:04:34,660 Remember, it's a unique positive solution. 1353 01:04:34,660 --> 01:04:35,370 Quadratic equations are two solutions, 1354 01:04:35,370 --> 01:04:37,286 but there's going to be just one positive one. 1355 01:04:37,286 --> 01:04:41,230 And then we have that value. 1356 01:04:41,230 --> 01:04:43,960 It satisfies this equation. 1357 01:04:43,960 --> 01:04:47,200 So then, what if we double the scores? 1358 01:04:47,200 --> 01:04:50,340 Instead of plus 1 minus 1, we use plus 2 minus 2? 1359 01:04:53,020 --> 01:04:55,850 What would then happen? 1360 01:04:55,850 --> 01:04:58,440 You can see that the original version of lambda 1361 01:04:58,440 --> 01:05:01,860 wouldn't necessarily still satisfy this equation. 1362 01:05:01,860 --> 01:05:04,080 But if you think about it a little bit, 1363 01:05:04,080 --> 01:05:08,370 you can figure out what new value of lambda 1364 01:05:08,370 --> 01:05:09,692 would satisfy this equation. 1365 01:05:14,860 --> 01:05:19,365 We've solved for the lambda that solves with these scores. 1366 01:05:23,502 --> 01:05:24,960 Now we're going to have new scores. 1367 01:05:24,960 --> 01:05:28,530 sii prime equals 2. 1368 01:05:28,530 --> 01:05:34,080 sij prime equals minus 2. 1369 01:05:34,080 --> 01:05:35,666 What is lambda prime? 1370 01:05:35,666 --> 01:05:37,290 The lambda that goes with these scores? 1371 01:05:39,850 --> 01:05:40,860 Yeah, go ahead. 1372 01:05:40,860 --> 01:05:42,217 AUDIENCE: Half of the original? 1373 01:05:42,217 --> 01:05:43,550 PROFESSOR: Half of the original? 1374 01:05:43,550 --> 01:05:44,050 Right. 1375 01:05:44,050 --> 01:05:46,974 So you're saying that lambda prime equals lambda over 2. 1376 01:05:46,974 --> 01:05:47,640 And why is that? 1377 01:05:47,640 --> 01:05:50,965 Can you explain? 1378 01:05:50,965 --> 01:05:52,506 AUDIENCE: Because of the [INAUDIBLE]. 1379 01:05:59,530 --> 01:06:03,980 PROFESSOR: Yeah, if you think about these terms in the sum, 1380 01:06:03,980 --> 01:06:05,940 the s part is all doubling. 1381 01:06:05,940 --> 01:06:09,380 So if you cut the lambda apart, and the product will equal what 1382 01:06:09,380 --> 01:06:10,650 it did before. 1383 01:06:10,650 --> 01:06:13,370 And we haven't changed the pi's and rj's, so all those terms 1384 01:06:13,370 --> 01:06:14,590 will be the same. 1385 01:06:14,590 --> 01:06:17,400 So therefore, it will still satisfy that equation. 1386 01:06:17,400 --> 01:06:19,900 So that's another way of thinking about it. 1387 01:06:19,900 --> 01:06:21,210 Yes, you're correct. 1388 01:06:21,210 --> 01:06:23,770 So if you double the scores, lambda 1389 01:06:23,770 --> 01:06:27,090 will be reduced by a factor of 2. 1390 01:06:27,090 --> 01:06:28,880 So what does that tell us about lambda? 1391 01:06:28,880 --> 01:06:32,994 What is it? 1392 01:06:32,994 --> 01:06:34,443 What is its meaning? 1393 01:06:40,250 --> 01:06:41,416 Yeah, go ahead, Jeff. 1394 01:06:41,416 --> 01:06:42,874 AUDIENCE: Scale of the distribution 1395 01:06:42,874 --> 01:06:45,808 to the expectant score? 1396 01:06:45,808 --> 01:06:47,104 Or the range score? 1397 01:06:47,104 --> 01:06:47,770 PROFESSOR: Yeah. 1398 01:06:47,770 --> 01:06:50,240 It basically scales the scores. 1399 01:06:50,240 --> 01:06:55,040 So we can have the same equation here used 1400 01:06:55,040 --> 01:06:56,200 with arbitrary scoring. 1401 01:06:56,200 --> 01:06:56,960 It just scales it. 1402 01:06:56,960 --> 01:06:59,410 You can see the way it appears as a multiplicative factor 1403 01:06:59,410 --> 01:07:00,326 in front of the score. 1404 01:07:00,326 --> 01:07:03,050 So if you double all the scores, will that 1405 01:07:03,050 --> 01:07:04,880 change what the highest scoring segment is? 1406 01:07:07,560 --> 01:07:09,620 No, it won't change it because you'll 1407 01:07:09,620 --> 01:07:11,890 have this cumulative thing. 1408 01:07:11,890 --> 01:07:14,450 It just changes how you label the y-axis. 1409 01:07:14,450 --> 01:07:19,505 It'll make it bigger, but it won't change what that is. 1410 01:07:19,505 --> 01:07:21,210 And if you look at this equation, 1411 01:07:21,210 --> 01:07:23,650 it won't change the statistical significance. 1412 01:07:23,650 --> 01:07:27,620 The x will double in value, because all the matches are now 1413 01:07:27,620 --> 01:07:30,360 worth twice as much as what they were before. 1414 01:07:30,360 --> 01:07:33,110 But lambda will be half as big, and so the product 1415 01:07:33,110 --> 01:07:35,360 will be the same and therefore, the final probability 1416 01:07:35,360 --> 01:07:36,150 will be the same. 1417 01:07:36,150 --> 01:07:38,850 So it's just a scaling factor for using different scoring 1418 01:07:38,850 --> 01:07:39,350 systems. 1419 01:07:44,281 --> 01:07:45,030 Everyone got that? 1420 01:07:47,820 --> 01:07:48,320 All right. 1421 01:07:48,320 --> 01:07:51,820 So what scoring matrix should we use for DNA? 1422 01:07:51,820 --> 01:07:53,450 How about this one? 1423 01:07:53,450 --> 01:07:56,010 So this is now a slight generalization. 1424 01:07:56,010 --> 01:08:00,530 So we're going to keep 1 for the matches. 1425 01:08:00,530 --> 01:08:03,450 You don't lose any generality by choosing 1 here for matches, 1426 01:08:03,450 --> 01:08:07,500 because if you use 2, then lambda is just 1427 01:08:07,500 --> 01:08:09,990 going to be reduced to compensate. 1428 01:08:09,990 --> 01:08:11,360 So 1 for matches. 1429 01:08:11,360 --> 01:08:15,590 And then we're going to use m for mismatches. 1430 01:08:15,590 --> 01:08:21,569 And m must be negative in order to satisfy this condition 1431 01:08:21,569 --> 01:08:23,910 for this theory to work, that the average score 1432 01:08:23,910 --> 01:08:24,790 has to be negative. 1433 01:08:24,790 --> 01:08:27,540 Clearly, you have to have some negative scores. 1434 01:08:27,540 --> 01:08:30,990 And the question then is, should we use minus 1 1435 01:08:30,990 --> 01:08:31,970 like we used before? 1436 01:08:31,970 --> 01:08:37,580 Or should we use like minus 2 or minus 5, or something else? 1437 01:08:37,580 --> 01:08:38,857 Any thoughts on this? 1438 01:08:43,230 --> 01:08:43,979 Or does it matter? 1439 01:08:43,979 --> 01:08:44,890 Maybe it doesn't matter. 1440 01:08:44,890 --> 01:08:45,848 Yeah, what's your name? 1441 01:08:45,848 --> 01:08:46,823 AUDIENCE: [INAUDIBLE]. 1442 01:08:46,823 --> 01:08:49,769 Would it make sense to not use [INAUDIBLE], 1443 01:08:49,769 --> 01:08:52,229 because [INAUDIBLE]. 1444 01:08:52,229 --> 01:08:53,340 PROFESSOR: Yeah, OK. 1445 01:08:53,340 --> 01:08:56,580 So you want to use a more complicated scoring system. 1446 01:08:56,580 --> 01:08:59,069 What particular mismatches would you 1447 01:08:59,069 --> 01:09:03,080 want to penalize more and less? 1448 01:09:03,080 --> 01:09:09,424 AUDIENCE: [INAUDIBLE] I think [INAUDIBLE] 1449 01:09:09,424 --> 01:09:10,888 needs to be [INAUDIBLE]. 1450 01:09:21,160 --> 01:09:23,886 PROFESSOR: Yeah, you are correct in your intuition. 1451 01:09:23,886 --> 01:09:25,260 Maybe one of the biologists wants 1452 01:09:25,260 --> 01:09:27,240 to offer a suggestion here. 1453 01:09:27,240 --> 01:09:28,218 Yeah, go ahead. 1454 01:09:28,218 --> 01:09:30,652 AUDIENCE: So it's a mismatch between purine and pyrimidine 1455 01:09:30,652 --> 01:09:31,152 [INAUDIBLE]. 1456 01:09:35,349 --> 01:09:37,640 PROFESSOR: OK so now we've got purines and pyrimidines. 1457 01:09:37,640 --> 01:09:39,840 So everyone remember, the purines 1458 01:09:39,840 --> 01:09:45,930 are A and G. The pyrimidines are C and T. 1459 01:09:45,930 --> 01:09:50,370 And the idea is that this should be penalized, 1460 01:09:50,370 --> 01:09:54,010 or this should be penalized less than changing 1461 01:09:54,010 --> 01:09:55,760 a purine to a pyrimidine. 1462 01:09:55,760 --> 01:09:58,142 And why does that makes sense? 1463 01:09:58,142 --> 01:09:59,950 AUDIENCE: Well, structurally they're-- 1464 01:09:59,950 --> 01:10:02,260 PROFESSOR: Structurally, purines are more similar to each other 1465 01:10:02,260 --> 01:10:03,468 than they are to pyrimidines. 1466 01:10:03,468 --> 01:10:04,820 And? 1467 01:10:04,820 --> 01:10:07,997 More importantly, I think. 1468 01:10:07,997 --> 01:10:08,538 In evolution? 1469 01:10:12,452 --> 01:10:13,368 AUDIENCE: [INAUDIBLE]. 1470 01:10:16,059 --> 01:10:17,684 PROFESSOR: I'm sorry, can you speak up? 1471 01:10:17,684 --> 01:10:20,044 AUDIENCE: C to C mutations happen spontaneously 1472 01:10:20,044 --> 01:10:21,215 in [INAUDIBLE] chemistry. 1473 01:10:21,215 --> 01:10:21,840 PROFESSOR: Yes. 1474 01:10:21,840 --> 01:10:24,810 So C to C mutations happen spontaneously. 1475 01:10:24,810 --> 01:10:27,440 So basically, it's easier because they 1476 01:10:27,440 --> 01:10:29,620 look more similar structurally. 1477 01:10:29,620 --> 01:10:31,980 The DNA polymerase is more likely to make a mistake 1478 01:10:31,980 --> 01:10:33,970 and substitute another purine. 1479 01:10:38,110 --> 01:10:43,940 The rate of purine, purine or pyrimidine, 1480 01:10:43,940 --> 01:10:46,470 pyrimidine to transversions which 1481 01:10:46,470 --> 01:10:48,970 switch the type is about three to one, 1482 01:10:48,970 --> 01:10:50,480 or two to one in different systems. 1483 01:10:50,480 --> 01:10:51,860 So yeah, that's a good idea. 1484 01:10:51,860 --> 01:10:54,820 But for simplicity, just to keep the math simple, 1485 01:10:54,820 --> 01:10:57,139 we're just going to go with one mismatch penalty. 1486 01:10:57,139 --> 01:10:58,180 But that is a good point. 1487 01:10:58,180 --> 01:11:00,470 In practice, you might want to do that. 1488 01:11:00,470 --> 01:11:02,820 So now, I'm saying I'm going to limit you 1489 01:11:02,820 --> 01:11:04,340 to one mismatch penalty. 1490 01:11:04,340 --> 01:11:07,710 But I'm going to let you choose any value you want. 1491 01:11:07,710 --> 01:11:10,050 So what value should you choose? 1492 01:11:10,050 --> 01:11:11,959 Or does it matter? 1493 01:11:15,791 --> 01:11:18,414 Or maybe different applications? 1494 01:11:18,414 --> 01:11:18,944 Tim, yeah? 1495 01:11:18,944 --> 01:11:20,402 AUDIENCE: I've just got a question. 1496 01:11:20,402 --> 01:11:24,378 Does it depend on pi and ri? 1497 01:11:24,378 --> 01:11:27,857 For example, we could use all these numbers. 1498 01:11:27,857 --> 01:11:30,342 But if the overall wants to be negative, 1499 01:11:30,342 --> 01:11:33,324 then you couldn't use negative .1. 1500 01:11:37,407 --> 01:11:38,990 PROFESSOR: Right, that's a good point. 1501 01:11:38,990 --> 01:11:42,030 You can't make it too weak. 1502 01:11:42,030 --> 01:11:45,230 It may depend on what your expected fraction of matches 1503 01:11:45,230 --> 01:11:49,860 is, which actually depends on pi and ri. 1504 01:11:49,860 --> 01:11:55,210 So if you have very biased sequences, like very AT rich, 1505 01:11:55,210 --> 01:11:57,830 your expected fraction of matches is actually higher. 1506 01:11:57,830 --> 01:11:59,145 When you're researching an AT rich sequence 1507 01:11:59,145 --> 01:12:00,370 against another AT rich sequence, 1508 01:12:00,370 --> 01:12:01,869 it's actually higher than a quarter. 1509 01:12:04,300 --> 01:12:06,480 So even minus one might not be sufficient there. 1510 01:12:06,480 --> 01:12:08,260 You might need to go down more negative. 1511 01:12:08,260 --> 01:12:11,180 So you may need to use a higher negative value just 1512 01:12:11,180 --> 01:12:14,050 to make sure that the expected value is negative. 1513 01:12:14,050 --> 01:12:15,200 That's true. 1514 01:12:15,200 --> 01:12:19,830 And yeah, you may want to adjust it based on the composition. 1515 01:12:19,830 --> 01:12:22,090 So let's just do a bit more. 1516 01:12:22,090 --> 01:12:25,480 So it turns out that the Karlin and Altschul theory, 1517 01:12:25,480 --> 01:12:31,190 in addition to telling you what the p value is of your match-- 1518 01:12:31,190 --> 01:12:34,280 the statistical significance-- it also tells you 1519 01:12:34,280 --> 01:12:37,640 what the matches will look like in terms 1520 01:12:37,640 --> 01:12:41,460 of what fraction of identity they will have. 1521 01:12:41,460 --> 01:12:45,290 And this is the so-called target frequency equation. 1522 01:12:45,290 --> 01:12:48,940 The theory says that if I search a query with one 1523 01:12:48,940 --> 01:12:52,450 particular composition, p, subject meta-composition r-- 1524 01:12:52,450 --> 01:12:54,840 here, I've just assumed they're the same, both p just 1525 01:12:54,840 --> 01:12:58,250 for simplicity-- with a scoring matrix sij, which 1526 01:12:58,250 --> 01:13:00,070 has a corresponding of lambda. 1527 01:13:00,070 --> 01:13:03,080 Then, when I take those very high scoring matches-- 1528 01:13:03,080 --> 01:13:05,780 the ones that are statistically significant-- 1529 01:13:05,780 --> 01:13:10,650 and I look at those alignments of those matches, 1530 01:13:10,650 --> 01:13:15,650 I will get values qij, given by this formula. 1531 01:13:15,650 --> 01:13:17,110 So look at the formula. 1532 01:13:17,110 --> 01:13:17,790 So it's qij. 1533 01:13:17,790 --> 01:13:19,970 So pipj e to the lambda sij. 1534 01:13:19,970 --> 01:13:22,620 So it's basically the expected chance 1535 01:13:22,620 --> 01:13:28,970 that you would have base i matching based j just 1536 01:13:28,970 --> 01:13:29,470 by chance. 1537 01:13:29,470 --> 01:13:30,840 That's pipj. 1538 01:13:30,840 --> 01:13:33,290 But then weighted by e to the lambda sij. 1539 01:13:33,290 --> 01:13:36,560 So we notice for a match, s will be positive, 1540 01:13:36,560 --> 01:13:38,060 so e to the lambda will be positive. 1541 01:13:38,060 --> 01:13:39,710 So that will be bigger than 1. 1542 01:13:39,710 --> 01:13:41,340 And you'll have more matches and you'll 1543 01:13:41,340 --> 01:13:43,370 have correspondingly less mismatches 1544 01:13:43,370 --> 01:13:45,320 because the mismatch has a negative. 1545 01:13:45,320 --> 01:13:47,240 So get the target value score. 1546 01:13:47,240 --> 01:13:52,510 And that also tells you that the so-called natural scores 1547 01:13:52,510 --> 01:13:56,540 are actually determined by the fraction of matches 1548 01:13:56,540 --> 01:13:59,500 that you want in your high scoring segments. 1549 01:13:59,500 --> 01:14:02,830 If we want 90% matches, we just set qii 1550 01:14:02,830 --> 01:14:06,180 to be 0.9, and use this equation here. 1551 01:14:09,440 --> 01:14:11,060 Solve for sij. 1552 01:14:14,740 --> 01:14:18,620 For example, if you want to find regions with R% identities. 1553 01:14:18,620 --> 01:14:22,220 Little r is just the r as a proportion. 1554 01:14:22,220 --> 01:14:24,550 qii is going to be r over 4. 1555 01:14:24,550 --> 01:14:27,110 This assumes unbiased base composition. 1556 01:14:27,110 --> 01:14:28,920 A quarter of the matches are acgt. 1557 01:14:32,060 --> 01:14:35,700 Qij, then, is 1 minus r over 12. 1558 01:14:35,700 --> 01:14:38,750 1 minus r is a fraction of non-matching positions. 1559 01:14:38,750 --> 01:14:41,360 They're 12 different types. 1560 01:14:41,360 --> 01:14:45,180 Set sii equal to 1, that's what we said we normally do. 1561 01:14:45,180 --> 01:14:48,560 And then you do a little bit of algebra here. 1562 01:14:48,560 --> 01:14:50,410 m is sij. 1563 01:14:50,410 --> 01:14:54,730 And you sort of plug in this equation twice here. 1564 01:14:54,730 --> 01:14:56,740 And you get this equation. 1565 01:14:56,740 --> 01:15:01,350 So it says that m equals log of 4 1 minus r over 3 1566 01:15:01,350 --> 01:15:03,010 over log 4 r. 1567 01:15:03,010 --> 01:15:06,240 And for this to be true, this assumes 1568 01:15:06,240 --> 01:15:08,780 that both the query and the database 1569 01:15:08,780 --> 01:15:11,450 have uniform composition of a quarter, 1570 01:15:11,450 --> 01:15:16,300 and that r is between a quarter and 1. 1571 01:15:16,300 --> 01:15:19,294 The proportion of matches in your high scoring segment-- 1572 01:15:19,294 --> 01:15:20,960 you want it to be bigger than a quarter. 1573 01:15:20,960 --> 01:15:23,014 A quarter is what you would see by chance. 1574 01:15:23,014 --> 01:15:25,430 There's something wrong with your scoring system if you're 1575 01:15:25,430 --> 01:15:27,390 considering those to be significant. 1576 01:15:27,390 --> 01:15:30,280 So it's something above 25%. 1577 01:15:30,280 --> 01:15:35,090 And so it's just simple algebra-- 1578 01:15:35,090 --> 01:15:41,180 you can check my work at home-- to solve for m here. 1579 01:15:41,180 --> 01:15:44,310 And then this equation then tells you 1580 01:15:44,310 --> 01:15:49,530 that if I want to find 75% identical matches 1581 01:15:49,530 --> 01:15:51,340 in a nucleotide search, I should use 1582 01:15:51,340 --> 01:15:53,640 a mismatch penalty of minus 1. 1583 01:15:53,640 --> 01:15:57,160 And if I want 99% identical matches, 1584 01:15:57,160 --> 01:16:00,200 I should use a penalty of minus 3. 1585 01:16:00,200 --> 01:16:02,440 Not minus 5, but minus 3. 1586 01:16:02,440 --> 01:16:09,070 And I want you to think about, does that make sense? 1587 01:16:09,070 --> 01:16:11,640 Does that not make sense? 1588 01:16:11,640 --> 01:16:14,400 Because I'm going to ask you at the beginning of class 1589 01:16:14,400 --> 01:16:17,950 on Tuesday to explain and comment 1590 01:16:17,950 --> 01:16:23,470 on this particular phenomenon of how when you want higher 1591 01:16:23,470 --> 01:16:31,340 percent identities, you want a more negative mismatch score. 1592 01:16:31,340 --> 01:16:35,410 Any last questions? 1593 01:16:35,410 --> 01:16:36,960 Comments?