1 00:00:00,070 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:14,960 from hundreds of MIT courses, visit 7 00:00:14,960 --> 00:00:16,210 mitopencourseware@ocw.mit.edu. 8 00:00:26,430 --> 00:00:28,630 PROFESSOR: All right, so let's get started. 9 00:00:28,630 --> 00:00:34,800 So today we're going to review local alignment, 10 00:00:34,800 --> 00:00:39,270 we talked about last time, and introduce global alignment, 11 00:00:39,270 --> 00:00:42,810 also talking about issues related 12 00:00:42,810 --> 00:00:45,340 to protein sequences, which include 13 00:00:45,340 --> 00:00:49,670 some more interesting scoring matrices. 14 00:00:49,670 --> 00:00:54,980 So just some info on topic one, which we are still in. 15 00:00:54,980 --> 00:00:57,082 So I will have an overview slide. 16 00:00:57,082 --> 00:00:58,707 It'll have a blue background, and there 17 00:00:58,707 --> 00:01:00,665 will be a review slide with a purple background 18 00:01:00,665 --> 00:01:01,870 in every lecture. 19 00:01:01,870 --> 00:01:06,440 So last time we talked about local alignment and some 20 00:01:06,440 --> 00:01:08,990 of the statistics associated with that, 21 00:01:08,990 --> 00:01:13,210 and also a little bit about sequencing, technologies, 22 00:01:13,210 --> 00:01:16,890 both conventional Sanger DNA sequencing 23 00:01:16,890 --> 00:01:20,070 as well as second generation sequencing. 24 00:01:20,070 --> 00:01:22,680 And at the beginning of the local alignment section, 25 00:01:22,680 --> 00:01:25,900 we introduced a simple BLAST-like algorithm, 26 00:01:25,900 --> 00:01:28,750 and then we talked about statistics, target frequencies, 27 00:01:28,750 --> 00:01:30,940 mismatched penalties, that sort of thing. 28 00:01:33,630 --> 00:01:36,660 So there were a couple questions at the end which I just 29 00:01:36,660 --> 00:01:38,260 wanted to briefly answer. 30 00:01:38,260 --> 00:01:45,500 So I believe it was Joe asked about how the dye is attached 31 00:01:45,500 --> 00:01:50,070 to the DNTP in dye terminator sequencing. 32 00:01:50,070 --> 00:01:51,530 And it appears that it's attached 33 00:01:51,530 --> 00:01:54,950 to the base, sort of the backside of the base, not 34 00:01:54,950 --> 00:01:57,562 the Watson-Crick face, obviously. 35 00:01:57,562 --> 00:01:59,520 That seems to be the common way that it's done. 36 00:01:59,520 --> 00:02:01,140 And then there was another question 37 00:02:01,140 --> 00:02:02,690 from somebody in the back. 38 00:02:02,690 --> 00:02:08,520 I don't remember who asked about when you're making libraries, 39 00:02:08,520 --> 00:02:11,480 how do you make sure that each of your insert sequences 40 00:02:11,480 --> 00:02:14,310 has the two different adapters, one adaptor on one side 41 00:02:14,310 --> 00:02:16,610 and the other adapter on the other side? 42 00:02:16,610 --> 00:02:20,040 And there are at least three ways to do this. 43 00:02:20,040 --> 00:02:22,650 So simplest is in RNA ligation, when 44 00:02:22,650 --> 00:02:25,235 you take advantage of the different chemistry at the five 45 00:02:25,235 --> 00:02:27,604 prime and three prime ends of the small RNA 46 00:02:27,604 --> 00:02:28,770 that you're trying to clone. 47 00:02:28,770 --> 00:02:32,920 So you just use the phosphate and NLH 48 00:02:32,920 --> 00:02:34,850 to ligate two different adapters. 49 00:02:34,850 --> 00:02:39,450 Another more complicated way occurs in ribosome footprint 50 00:02:39,450 --> 00:02:43,640 profiling, which is a method for mapping the precise locations 51 00:02:43,640 --> 00:02:47,410 of ribosomes along mRNAs, and involves polyA 52 00:02:47,410 --> 00:02:53,040 tailing, and then introducing the adapters 53 00:02:53,040 --> 00:02:57,070 together, the two adapters, with a polyT primer 54 00:02:57,070 --> 00:02:58,580 that primes off the polyA tail. 55 00:02:58,580 --> 00:03:01,150 And then you circularize, and then you PCR off the circles. 56 00:03:01,150 --> 00:03:02,566 And it's a little bit complicated, 57 00:03:02,566 --> 00:03:04,690 but you can look it up in the reference that's 58 00:03:04,690 --> 00:03:07,390 up here on the slide. 59 00:03:07,390 --> 00:03:08,850 It's working now. 60 00:03:08,850 --> 00:03:11,180 And then, finally, the way that's 61 00:03:11,180 --> 00:03:14,100 actually most commonly used for protocols 62 00:03:14,100 --> 00:03:17,780 like RNA seq and genomic DNA sequencing 63 00:03:17,780 --> 00:03:24,930 is that after you make your double strand DNA, 64 00:03:24,930 --> 00:03:27,560 there's an enzyme that adds a single A to the three 65 00:03:27,560 --> 00:03:29,170 prime end of each strand. 66 00:03:29,170 --> 00:03:32,160 So now you have a symmetrical molecule. 67 00:03:32,160 --> 00:03:38,780 But then you add these funny Y-shaped adapters that have 68 00:03:38,780 --> 00:03:42,000 and overhanging T on, say, the red guy here. 69 00:03:42,000 --> 00:03:47,150 And so what will happen is that each of these Y's can 70 00:03:47,150 --> 00:03:48,250 be ligated here. 71 00:03:48,250 --> 00:03:51,870 But each of the inserts, independent of which 72 00:03:51,870 --> 00:03:55,520 strand it is, will have a red adapter at the five prime end 73 00:03:55,520 --> 00:03:59,000 and a blue adaptor at the three prime end. 74 00:03:59,000 --> 00:04:02,470 Any questions about this or about 75 00:04:02,470 --> 00:04:08,084 sequencing technologies before we go to local alignments? 76 00:04:08,084 --> 00:04:08,980 OK, good. 77 00:04:08,980 --> 00:04:12,890 It was a good question, and that's the answer. 78 00:04:12,890 --> 00:04:16,690 So we motivated our discussion of local alignments 79 00:04:16,690 --> 00:04:18,750 last time by talking about this example, 80 00:04:18,750 --> 00:04:22,289 where you have a non-coding RNA that you found, inhuman. 81 00:04:22,289 --> 00:04:24,580 You BLAST it against mouse, and you get this alignment. 82 00:04:24,580 --> 00:04:25,430 Is this significant? 83 00:04:25,430 --> 00:04:29,900 So is this really likely to be a homologous sequence? 84 00:04:29,900 --> 00:04:32,822 And how do you find the alignments? 85 00:04:32,822 --> 00:04:35,120 And so we said that, well, there's 86 00:04:35,120 --> 00:04:38,790 this theory that's exact, at least 87 00:04:38,790 --> 00:04:42,060 exact in the asymptotic sense for large query and database 88 00:04:42,060 --> 00:04:47,620 sizes that tells us the statistical significance 89 00:04:47,620 --> 00:04:52,010 of the highest scoring ungapped local alignment. 90 00:04:52,010 --> 00:04:54,760 And it's given by this formula here, 91 00:04:54,760 --> 00:04:58,370 which is the extreme value or Gumbel distribution. 92 00:04:58,370 --> 00:05:00,580 And then we talked about the constraints 93 00:05:00,580 --> 00:05:02,624 or the expected score has to be negative, 94 00:05:02,624 --> 00:05:04,915 but positive scores have to be possible for this theory 95 00:05:04,915 --> 00:05:05,840 to work. 96 00:05:05,840 --> 00:05:08,570 And we also talked about an algorithm. 97 00:05:08,570 --> 00:05:11,720 But if you remember, the algorithms was very simple. 98 00:05:11,720 --> 00:05:15,595 It involved-- this is zero-- keeping 99 00:05:15,595 --> 00:05:17,200 track of the cumulative score. 100 00:05:17,200 --> 00:05:21,360 So we have a mismatch and a match, mismatch, mismatch, 101 00:05:21,360 --> 00:05:24,071 mismatch, match, match, match. 102 00:05:24,071 --> 00:05:25,820 That is a high scoring segment, et cetera. 103 00:05:25,820 --> 00:05:29,010 So you keep track of the lowest point you've ever been to 104 00:05:29,010 --> 00:05:30,219 as well as the current score. 105 00:05:30,219 --> 00:05:32,385 And when the current score exceeds that lowest point 106 00:05:32,385 --> 00:05:36,850 you've ever been to by more than we've ever seen before, 107 00:05:36,850 --> 00:05:41,890 more than this, then that's your high scoring segment. 108 00:05:41,890 --> 00:05:47,600 Now it turns out, if this is not intuitive to you, 109 00:05:47,600 --> 00:05:52,420 there's another algorithm which I find, 110 00:05:52,420 --> 00:05:53,580 personally, more intuitive. 111 00:05:53,580 --> 00:05:55,670 So I just want to tell you about that one as well. 112 00:05:55,670 --> 00:05:57,086 And it's basically the same thing, 113 00:05:57,086 --> 00:06:01,780 except whenever you go negative, you reset to zero. 114 00:06:01,780 --> 00:06:05,000 So here, we were going to go negative, 115 00:06:05,000 --> 00:06:08,560 so we just reset to zero. 116 00:06:08,560 --> 00:06:10,170 That was on this mismatch here. 117 00:06:10,170 --> 00:06:11,460 Then we have a match. 118 00:06:11,460 --> 00:06:12,480 Now we're at plus 1. 119 00:06:12,480 --> 00:06:13,380 That's fine. 120 00:06:13,380 --> 00:06:14,950 Now we have a mismatch. 121 00:06:14,950 --> 00:06:16,220 Now we're down to zero. 122 00:06:16,220 --> 00:06:17,780 We don't need to do anything. 123 00:06:17,780 --> 00:06:19,820 Now we have another mismatch. 124 00:06:19,820 --> 00:06:21,850 Here, we're still at zero. 125 00:06:21,850 --> 00:06:23,680 Remember, we just stay at zero. 126 00:06:23,680 --> 00:06:25,990 We were going to go negative, but we stayed at zero. 127 00:06:25,990 --> 00:06:28,310 Another mismatch, we still stay at zero. 128 00:06:28,310 --> 00:06:33,290 And now we have these three matches in a row. 129 00:06:33,290 --> 00:06:34,920 My line is not staying very flat. 130 00:06:34,920 --> 00:06:38,970 But this should've been here flat at zero. 131 00:06:38,970 --> 00:06:44,700 The point is that now the highest scoring segment 132 00:06:44,700 --> 00:06:46,920 is the highest point you ever reach. 133 00:06:46,920 --> 00:06:49,220 So it's very simple. 134 00:06:49,220 --> 00:06:52,140 So this is actually slightly easier to implement. 135 00:06:52,140 --> 00:06:53,980 And that's sort of a little trick. 136 00:06:53,980 --> 00:06:59,860 So for local alignments, you can often reset to zero. 137 00:06:59,860 --> 00:07:01,320 Any questions about that? 138 00:07:04,390 --> 00:07:08,830 So well, we talked about computational efficiency, 139 00:07:08,830 --> 00:07:12,384 this big O notation, where you consider 140 00:07:12,384 --> 00:07:13,925 the number of individual computations 141 00:07:13,925 --> 00:07:16,840 that are required to run an algorithm as a function 142 00:07:16,840 --> 00:07:18,660 of the size of the input, basically 143 00:07:18,660 --> 00:07:20,550 the number of units in the problem base 144 00:07:20,550 --> 00:07:24,360 pairs, amino acid residues, whatever. 145 00:07:24,360 --> 00:07:28,360 So computer scientists look at the asymptotic worst case 146 00:07:28,360 --> 00:07:29,660 running time. 147 00:07:29,660 --> 00:07:32,870 That's either because they're pessimistic, 148 00:07:32,870 --> 00:07:36,000 or perhaps because they want to guarantee things. 149 00:07:36,000 --> 00:07:38,430 They want to say, it's not going to be worse than this. 150 00:07:38,430 --> 00:07:39,850 Maybe it'll be faster, and then you'll be happy. 151 00:07:39,850 --> 00:07:41,488 But I can guarantee you, it's not 152 00:07:41,488 --> 00:07:42,654 going to be worse than this. 153 00:07:42,654 --> 00:07:48,340 And so in this case, the algorithm we talked about 154 00:07:48,340 --> 00:07:52,100 was order n times n, where that's 155 00:07:52,100 --> 00:07:55,170 the lengths of the two sequences. 156 00:07:55,170 --> 00:07:58,190 So toward the end last time, we get 157 00:07:58,190 --> 00:08:01,160 we talked about this lambda parameter 158 00:08:01,160 --> 00:08:05,060 and said that lambda is the unique positive solution 159 00:08:05,060 --> 00:08:11,370 to this equation here, where sij are the scores 160 00:08:11,370 --> 00:08:16,110 and pi and rj are the nucleotide frequencies. 161 00:08:16,110 --> 00:08:18,740 And then there's this target frequency formula 162 00:08:18,740 --> 00:08:23,550 that comes up that says that if you use a scoring system sij 163 00:08:23,550 --> 00:08:26,380 to apply to sequences, and then you pull out 164 00:08:26,380 --> 00:08:28,410 just the high scoring segments, the ones that 165 00:08:28,410 --> 00:08:31,620 are unusually high scoring, they will have 166 00:08:31,620 --> 00:08:35,390 a frequency of matching nucleotides 167 00:08:35,390 --> 00:08:39,330 qij that's given by the product of the frequencies 168 00:08:39,330 --> 00:08:43,370 in the two sequences basically weighted by e to the lambda 169 00:08:43,370 --> 00:08:44,030 sij. 170 00:08:44,030 --> 00:08:48,440 So matches will occur more strongly, because that 171 00:08:48,440 --> 00:08:51,020 has a positive work, and mismatches less strongly. 172 00:08:51,020 --> 00:08:54,000 And that then gives rise to this notion 173 00:08:54,000 --> 00:08:57,270 that there's an optimal mismatch penalty, if you just 174 00:08:57,270 --> 00:08:59,800 consider scoring systems that have plus 1 for a match 175 00:08:59,800 --> 00:09:02,790 and m for a mismatch, some negative number, that's 176 00:09:02,790 --> 00:09:05,780 given by this equation here, and here I've 177 00:09:05,780 --> 00:09:07,750 worked out a couple of values. 178 00:09:07,750 --> 00:09:14,760 So the theory says that to find matches that are 99% identical, 179 00:09:14,760 --> 00:09:16,740 you should use a mismatched score of minus 3, 180 00:09:16,740 --> 00:09:21,610 but for 75% identical, you should use minus 1. 181 00:09:21,610 --> 00:09:24,560 And I asked you to think about does that make sense, 182 00:09:24,560 --> 00:09:29,390 or how is that true? 183 00:09:29,390 --> 00:09:41,140 So y is minus 3 better than minus 1 184 00:09:41,140 --> 00:09:44,380 for finding nearly identical matches. 185 00:09:44,380 --> 00:09:46,542 Anyone have an idea or thought on this? 186 00:09:46,542 --> 00:09:48,000 There's some thoughts on the slide. 187 00:09:48,000 --> 00:09:50,990 But can anyone intuitively explain why this is true? 188 00:09:57,862 --> 00:09:58,820 Yeah, what's your name? 189 00:09:58,820 --> 00:09:59,445 AUDIENCE: Eric. 190 00:09:59,445 --> 00:10:00,700 PROFESSOR: Yeah, go ahead. 191 00:10:00,700 --> 00:10:03,050 AUDIENCE: With a mismatch penalty of minus 3, 192 00:10:03,050 --> 00:10:05,793 you actually need more steps climbing back up 193 00:10:05,793 --> 00:10:09,230 to get back to some local maximum. 194 00:10:09,230 --> 00:10:14,386 And therefore, you do require a longer stretch [INAUDIBLE] 195 00:10:14,386 --> 00:10:19,732 matches in order to get a significant hit. 196 00:10:19,732 --> 00:10:24,373 That's my guess as to why a score of m 197 00:10:24,373 --> 00:10:28,830 equals minus 3, a penalty, [INAUDIBLE] 198 00:10:28,830 --> 00:10:30,437 why it would be better at finding 199 00:10:30,437 --> 00:10:35,110 the higher identity matches. 200 00:10:35,110 --> 00:10:39,380 PROFESSOR: OK, because the minus 3 makes 201 00:10:39,380 --> 00:10:42,120 you go down faster, so it takes longer to recover, 202 00:10:42,120 --> 00:10:46,350 so you can only find nearly identical things 203 00:10:46,350 --> 00:10:48,150 with that kind of scoring system. 204 00:10:48,150 --> 00:10:49,080 Is that your point? 205 00:10:49,080 --> 00:10:50,322 OK, that's a good point. 206 00:10:52,990 --> 00:10:55,640 So yeah, when would you want to use 207 00:10:55,640 --> 00:10:58,852 a mismatched penalty of minus 1? 208 00:10:58,852 --> 00:11:02,756 AUDIENCE: When you're trying to look for things that are 209 00:11:02,756 --> 00:11:07,148 [INAUDIBLE], but maybe not so close. 210 00:11:07,148 --> 00:11:09,588 Well, when you're looking for a [INAUDIBLE], 211 00:11:09,588 --> 00:11:13,004 you're looking for [INAUDIBLE]. 212 00:11:13,004 --> 00:11:15,629 That kind of situation. 213 00:11:15,629 --> 00:11:17,170 PROFESSOR: And so let's say I'm using 214 00:11:17,170 --> 00:11:20,960 a mismatch penalty of minus 2. 215 00:11:20,960 --> 00:11:26,980 Can I find regions that are 66% identical? 216 00:11:35,746 --> 00:11:36,977 AUDIENCE: Probably. 217 00:11:36,977 --> 00:11:37,810 PROFESSOR: Probably. 218 00:11:37,810 --> 00:11:41,730 AUDIENCE: But not guaranteed. 219 00:11:41,730 --> 00:11:44,480 PROFESSOR: Anyone else have a comment on that? 220 00:11:44,480 --> 00:11:45,600 Match is plus 1. 221 00:11:45,600 --> 00:11:49,830 Mismatch is minus 2 regions of 66% identity. 222 00:11:58,920 --> 00:12:00,020 Yeah, with Levi, yeah. 223 00:12:00,020 --> 00:12:03,116 AUDIENCE: No, since your score will just be zero. 224 00:12:03,116 --> 00:12:04,070 PROFESSOR: Yeah. 225 00:12:04,070 --> 00:12:05,000 That's correct. 226 00:12:05,000 --> 00:12:07,255 So Levi's comment is your score will be zero. 227 00:12:11,970 --> 00:12:18,360 Well, I'll just say plus for match, plus, plus, minus, plus, 228 00:12:18,360 --> 00:12:19,300 plus, minus. 229 00:12:19,300 --> 00:12:22,270 I mean, it'll be interspersed. 230 00:12:22,270 --> 00:12:24,470 It doesn't have to be like this for every triplet. 231 00:12:24,470 --> 00:12:27,680 But but on average, you'll have two matches for every match. 232 00:12:27,680 --> 00:12:29,480 That's what 66% identity means. 233 00:12:29,480 --> 00:12:33,320 And so these will score a total of plus 2. 234 00:12:33,320 --> 00:12:35,505 And this will score minus 2. 235 00:12:35,505 --> 00:12:39,940 And so you basically will never rise much above zero. 236 00:12:39,940 --> 00:12:46,090 And so you you can't really use that mismatch penalty. 237 00:12:46,090 --> 00:12:47,040 There's a limit. 238 00:12:47,040 --> 00:12:50,779 66% is sort of at a point where you can no longer see. 239 00:12:50,779 --> 00:12:52,320 You could potentially see things that 240 00:12:52,320 --> 00:12:54,820 are 75% identical if they were incredibly 241 00:12:54,820 --> 00:12:56,690 long with that kind of mismatch penalty. 242 00:12:56,690 --> 00:13:01,360 But you just can't see anything below 2/3% identity 243 00:13:01,360 --> 00:13:02,150 with minus 2. 244 00:13:02,150 --> 00:13:05,900 So to find those low things, you have to use the lower. 245 00:13:05,900 --> 00:13:07,840 You have to go down to minus 1 if you 246 00:13:07,840 --> 00:13:12,710 want to find the really weak matches. 247 00:13:12,710 --> 00:13:14,950 But they will have to be correspondingly very long 248 00:13:14,950 --> 00:13:18,350 in order to achieve statistical significance. 249 00:13:18,350 --> 00:13:22,340 So correspondingly, the reason why it's better 250 00:13:22,340 --> 00:13:25,080 to use a harsher mismatch penalty of minus 3 251 00:13:25,080 --> 00:13:26,970 to find the nearly identical regions 252 00:13:26,970 --> 00:13:31,160 is that, in this equation, when you 253 00:13:31,160 --> 00:13:37,230 go from having a plus 1, minus 1 scoring system to plus 1, 254 00:13:37,230 --> 00:13:39,880 minus 3, lambda will change. 255 00:13:39,880 --> 00:13:41,730 This equation will no longer satisfied 256 00:13:41,730 --> 00:13:47,440 so that a new value of lambda will be relevant. 257 00:13:47,440 --> 00:13:50,640 And that value will be larger. 258 00:13:50,640 --> 00:13:52,580 That's not totally obvious from this equation 259 00:13:52,580 --> 00:13:54,740 because you sort of have one term, which 260 00:13:54,740 --> 00:13:57,950 is either the minus lambda in one term or either plus lambda. 261 00:13:57,950 --> 00:14:01,090 But it turns out that that making the mismatch 262 00:14:01,090 --> 00:14:04,630 penalty more negative will lead to a solution that's 263 00:14:04,630 --> 00:14:06,470 a bigger value of lambda. 264 00:14:06,470 --> 00:14:11,300 So that means that the same score, x, 265 00:14:11,300 --> 00:14:20,480 will lead to a larger negative exponent here. 266 00:14:20,480 --> 00:14:20,980 and? 267 00:14:20,980 --> 00:14:23,574 How will that affect the p-value? 268 00:14:23,574 --> 00:14:24,740 Anyone take us through this? 269 00:14:24,740 --> 00:14:26,281 It's a little bit convoluted with all 270 00:14:26,281 --> 00:14:28,430 these negative exponentials and stuff, 271 00:14:28,430 --> 00:14:32,690 but can someone explain to us how that affects the p-value? 272 00:14:32,690 --> 00:14:33,280 Same x. 273 00:14:33,280 --> 00:14:34,610 We're going to increase lambda. 274 00:14:34,610 --> 00:14:35,776 What happens to the p-value? 275 00:14:41,581 --> 00:14:44,300 This gets bigger, more negative. 276 00:14:44,300 --> 00:14:49,550 That means this e to the minus thing gets closer to 0. 277 00:14:49,550 --> 00:14:52,110 That means that this is inside an exponential. 278 00:14:52,110 --> 00:14:54,940 As that thing gets closer to 0, this whole term 279 00:14:54,940 --> 00:14:58,840 here gets closer to 1. 280 00:14:58,840 --> 00:15:00,550 Therefore you're subtracting it from 1. 281 00:15:00,550 --> 00:15:03,580 Therefore the p-value gets smaller, closer to 0, 282 00:15:03,580 --> 00:15:04,985 more significant. 283 00:15:04,985 --> 00:15:05,860 Does that made sense? 284 00:15:05,860 --> 00:15:07,740 So it's good to just work through 285 00:15:07,740 --> 00:15:10,850 how this equation works. 286 00:15:10,850 --> 00:15:13,070 So that's all I wanted to say about mismatch 287 00:15:13,070 --> 00:15:15,250 penalties for DNA. 288 00:15:15,250 --> 00:15:18,150 Any questions about that? 289 00:15:20,647 --> 00:15:22,480 So how do you actually use this in practice? 290 00:15:22,480 --> 00:15:24,010 So if you just Google "BLAST end," 291 00:15:24,010 --> 00:15:25,610 you'll get to this website. 292 00:15:25,610 --> 00:15:30,440 It's been set up at NCBI for about 20 years or so. 293 00:15:30,440 --> 00:15:32,650 And of course, it's gone through various iterations 294 00:15:32,650 --> 00:15:33,983 and improvements over the years. 295 00:15:33,983 --> 00:15:37,280 And if you look down at the bottom, 296 00:15:37,280 --> 00:15:40,780 there is a place where you can click and set 297 00:15:40,780 --> 00:15:42,790 the algorithm parameters. 298 00:15:42,790 --> 00:15:45,510 And there are a number of parameters you can set. 299 00:15:48,340 --> 00:15:50,630 Some of them affect the speed. 300 00:15:50,630 --> 00:15:54,050 But we're focused here mostly on parameters 301 00:15:54,050 --> 00:15:57,100 that will affect the quality, the nature of the alignments 302 00:15:57,100 --> 00:15:57,810 that you find. 303 00:15:57,810 --> 00:16:01,730 And so here, you can set not arbitrary penalties, 304 00:16:01,730 --> 00:16:07,480 but you can set within a range of standard mismatch penalties. 305 00:16:07,480 --> 00:16:10,200 You can do 1 minus 1, 1 minus 2, et cetera. 306 00:16:14,940 --> 00:16:19,490 So what about sequences that code for protein? 307 00:16:19,490 --> 00:16:21,230 So exons, for example. 308 00:16:21,230 --> 00:16:27,340 So you can search them with a nucleotide search, like BLAST. 309 00:16:27,340 --> 00:16:31,070 But it can often be the case that you'll 310 00:16:31,070 --> 00:16:35,590 do better if you first translate your exon 311 00:16:35,590 --> 00:16:37,960 into the corresponding amino acid sequence using 312 00:16:37,960 --> 00:16:41,510 a genetic code and then search that peptide. 313 00:16:41,510 --> 00:16:44,440 Now you may or may not know the reading frame of your exon 314 00:16:44,440 --> 00:16:46,550 a priori, or even know that it is an exon, 315 00:16:46,550 --> 00:16:52,025 so BLAST automatically will do this translation for you. 316 00:16:52,025 --> 00:16:53,650 So for example, with this DNA sequence, 317 00:16:53,650 --> 00:16:55,850 it'll translate in all three of the reading frames, 318 00:16:55,850 --> 00:16:59,020 leading to essentially this bag of peptides 319 00:16:59,020 --> 00:17:02,680 here, where sometimes you'll hit a stop code on, like right 320 00:17:02,680 --> 00:17:03,230 here. 321 00:17:03,230 --> 00:17:04,979 And then it just treats it as, OK, there's 322 00:17:04,979 --> 00:17:06,750 a little PR dipeptide there. 323 00:17:06,750 --> 00:17:09,869 And then there's a longer peptide here, [INAUDIBLE], 324 00:17:09,869 --> 00:17:10,849 and so forth. 325 00:17:10,849 --> 00:17:14,609 So it just makes these bags of peptides for each reading frame 326 00:17:14,609 --> 00:17:18,000 and searches all those peptides against some target, which 327 00:17:18,000 --> 00:17:21,770 can be approaching database or a DNA database, 328 00:17:21,770 --> 00:17:23,849 again, translated in all the reading frames. 329 00:17:23,849 --> 00:17:27,890 So the folks at NCBI have made all these different flavors 330 00:17:27,890 --> 00:17:30,060 of BLAST available. 331 00:17:30,060 --> 00:17:31,550 So BLASTP is for proteins. 332 00:17:31,550 --> 00:17:32,430 N is for nucleotides. 333 00:17:32,430 --> 00:17:36,260 And then the translating ones are called things like BLASTX 334 00:17:36,260 --> 00:17:38,830 for a nucleotide query against a protein database. 335 00:17:38,830 --> 00:17:41,870 TBLASTN for a protein query against a nucleotide database, 336 00:17:41,870 --> 00:17:44,140 which gets translated in all frames, or TBLASTX, 337 00:17:44,140 --> 00:17:47,440 where you translate both the nucleotide sequences 338 00:17:47,440 --> 00:17:49,430 in all frames. 339 00:17:49,430 --> 00:17:51,790 And then there's a number of other versions of BLAST 340 00:17:51,790 --> 00:17:54,590 which we probably won't discuss but that are well 341 00:17:54,590 --> 00:17:59,110 described in the textbook and other accessible 342 00:17:59,110 --> 00:18:03,150 online sources. 343 00:18:03,150 --> 00:18:04,370 Let me ask you this. 344 00:18:04,370 --> 00:18:05,990 So remember ESTs. 345 00:18:05,990 --> 00:18:10,610 So ESTs are segments of cDNAs that typically correspond 346 00:18:10,610 --> 00:18:18,690 to one ABI 3700 Sanger sequenc off of that cDNA, 347 00:18:18,690 --> 00:18:22,270 so one read, like 600 bases or so. 348 00:18:22,270 --> 00:18:28,980 So let's say you have some ESTs from chimp. 349 00:18:28,980 --> 00:18:32,070 And you don't have the chimp genome yet. 350 00:18:32,070 --> 00:18:35,360 So you're going to search them against human. 351 00:18:35,360 --> 00:18:37,250 What would you do? 352 00:18:40,420 --> 00:18:43,510 Would you use a translating search? 353 00:18:43,510 --> 00:18:48,711 Or would you use a BLASTN search? 354 00:18:48,711 --> 00:18:49,460 Or does it matter? 355 00:18:56,760 --> 00:19:01,780 Chimp is a 98% identical human, very high. 356 00:19:05,540 --> 00:19:08,000 Any ideas? 357 00:19:08,000 --> 00:19:08,668 Yeah, Tim. 358 00:19:08,668 --> 00:19:11,058 AUDIENCE: You could use a translating search, 359 00:19:11,058 --> 00:19:18,610 because you know that the cDNAs are at least coding for RNAs. 360 00:19:18,610 --> 00:19:26,450 And so if you just use a nucleotide search, 361 00:19:26,450 --> 00:19:30,860 then you're not going to have functional significance 362 00:19:30,860 --> 00:19:32,820 in terms of the alignment. 363 00:19:32,820 --> 00:19:35,790 But if it's going for a protein, then-- 364 00:19:35,790 --> 00:19:38,380 PROFESSOR: You mean you won't know whether it is the protein 365 00:19:38,380 --> 00:19:41,390 coding part of the cDNA or not? 366 00:19:41,390 --> 00:19:43,856 AUDIENCE: So I just mean that if you're 367 00:19:43,856 --> 00:19:45,397 looking between chimp and human, then 368 00:19:45,397 --> 00:19:47,538 you're expecting some sort of mismatch. 369 00:19:47,538 --> 00:19:51,673 But it's possible that it could be a functional mismatch. 370 00:19:51,673 --> 00:19:55,398 Then you know that the cDNA is maybe coding for a protein. 371 00:19:55,398 --> 00:19:58,302 Therefore, if the mismatch is between two 372 00:19:58,302 --> 00:20:01,980 similar amino acids, then that would be picked up 373 00:20:01,980 --> 00:20:05,181 by a translating search, but it would 374 00:20:05,181 --> 00:20:07,014 be skewed against it in a nucleotide search. 375 00:20:07,014 --> 00:20:09,190 PROFESSOR: OK, fair enough. 376 00:20:09,190 --> 00:20:11,770 But if you assume that the two genomes are, 377 00:20:11,770 --> 00:20:15,480 let's say, 97% identical, even in a non-coding region, which 378 00:20:15,480 --> 00:20:16,230 they're very high. 379 00:20:16,230 --> 00:20:18,390 I don't remember the exact percent, but very high. 380 00:20:18,390 --> 00:20:24,250 Then if you're searching 600 nucleotides against the genome, 381 00:20:24,250 --> 00:20:25,870 even if it's 95% identical, you'll 382 00:20:25,870 --> 00:20:27,990 easily find that under either. 383 00:20:27,990 --> 00:20:33,660 So either answer is correct, BLASTN or BLASTX. 384 00:20:33,660 --> 00:20:38,650 And the UTRs could only be found by-- if it happened that this 385 00:20:38,650 --> 00:20:40,270 was a sequence from a three prime UTR, 386 00:20:40,270 --> 00:20:44,100 you could only find that by BLASTN typically. 387 00:20:44,100 --> 00:20:49,451 What if it's a human EST against the mouse genome? 388 00:20:52,160 --> 00:20:54,570 So mouse exons are about 80% identical 389 00:20:54,570 --> 00:20:58,671 to human exons at the nucleotide level, typically. 390 00:20:58,671 --> 00:20:59,310 Any ideas? 391 00:21:03,175 --> 00:21:04,550 What kind of search would you do? 392 00:21:04,550 --> 00:21:07,290 BLASTN, BLASTX, or something else? 393 00:21:10,454 --> 00:21:12,880 TBLASTX. 394 00:21:12,880 --> 00:21:13,631 Yeah, go ahead. 395 00:21:13,631 --> 00:21:15,790 AUDIENCE: I have another question. 396 00:21:15,790 --> 00:21:17,672 What exactly is the kind of question 397 00:21:17,672 --> 00:21:19,912 we're trying to answer by doing this BLAST search? 398 00:21:19,912 --> 00:21:21,370 PROFESSOR: Oh, well, I was assuming 399 00:21:21,370 --> 00:21:27,410 you're just trying to find the closest homologous cDNA 400 00:21:27,410 --> 00:21:33,600 or exons in the genome-- exons, I guess, yeah, the exons. 401 00:21:33,600 --> 00:21:36,755 of the homologous gene. 402 00:21:36,755 --> 00:21:38,085 Yeah, that's a good question. 403 00:21:38,085 --> 00:21:39,210 Exons of a homologous gene. 404 00:21:39,210 --> 00:21:43,410 We've got a human EST going against the mouse genome. 405 00:21:43,410 --> 00:21:44,750 When do we do? 406 00:21:44,750 --> 00:21:49,025 AUDIENCE: I suggest BLASTP because-- 407 00:21:49,025 --> 00:21:51,900 PROFESSOR: Well, BLASTP, that's protein. 408 00:21:51,900 --> 00:21:54,650 This is a nucleotide sequence against nucleotide. 409 00:21:54,650 --> 00:21:59,969 So we can do BLASTN or TBLASTX, let's say. 410 00:21:59,969 --> 00:22:01,009 AUDIENCE: TBLASTX. 411 00:22:01,009 --> 00:22:01,800 PROFESSOR: TBLASTX. 412 00:22:01,800 --> 00:22:05,190 You translate your EST, translate the genome, 413 00:22:05,190 --> 00:22:07,090 search those peptides. 414 00:22:07,090 --> 00:22:10,518 TBLASTX, why? 415 00:22:10,518 --> 00:22:13,003 AUDIENCE: The nucleotide sequences 416 00:22:13,003 --> 00:22:17,476 may be only about 80% similarity, 417 00:22:17,476 --> 00:22:21,452 but the protein sequences functionally, 418 00:22:21,452 --> 00:22:23,937 due to the functional constraints, 419 00:22:23,937 --> 00:22:26,980 you might actually get higher similarities there. 420 00:22:26,980 --> 00:22:27,760 PROFESSOR: Yeah. 421 00:22:27,760 --> 00:22:28,800 It's exactly right. 422 00:22:28,800 --> 00:22:33,241 So they are about, on average, about 80% identical. 423 00:22:33,241 --> 00:22:33,990 It varies by gene. 424 00:22:33,990 --> 00:22:37,070 But a lot of those variations that occur 425 00:22:37,070 --> 00:22:39,227 are at the third side of the codon that 426 00:22:39,227 --> 00:22:41,060 don't effect the amino acid, because there's 427 00:22:41,060 --> 00:22:42,750 a lot of constraint on protein sequence. 428 00:22:42,750 --> 00:22:45,460 And so you'll do better, in general, 429 00:22:45,460 --> 00:22:48,650 with a translating search than with a nucleotide search. 430 00:22:48,650 --> 00:22:50,260 Although, they both may work. 431 00:22:50,260 --> 00:22:53,250 But you may find a more complete match 432 00:22:53,250 --> 00:22:54,900 with a translating search. 433 00:22:54,900 --> 00:22:55,400 That's good. 434 00:22:55,400 --> 00:22:56,237 Everyone got that? 435 00:22:56,237 --> 00:22:56,736 Sally, yeah. 436 00:22:56,736 --> 00:22:59,027 AUDIENCE: Is there a reason why you wouldn't use BLASTX 437 00:22:59,027 --> 00:23:01,214 and instead you use TBLASTX? 438 00:23:01,214 --> 00:23:02,880 PROFESSOR: Yeah, I just gave the example 439 00:23:02,880 --> 00:23:04,213 of searching against the genome. 440 00:23:04,213 --> 00:23:08,642 But you could search against the mouse proteome as well. 441 00:23:08,642 --> 00:23:09,600 You might or might not. 442 00:23:09,600 --> 00:23:11,230 It depends how well annotated that genome is. 443 00:23:11,230 --> 00:23:12,530 Mouse is pretty well annotated. 444 00:23:12,530 --> 00:23:14,321 Almost all the proteins are probably known. 445 00:23:14,321 --> 00:23:15,660 So you probably get it. 446 00:23:15,660 --> 00:23:21,477 But if you were searching against some more obscure 447 00:23:21,477 --> 00:23:23,310 organism, the chameleon genome or something, 448 00:23:23,310 --> 00:23:25,427 and it wasn't well annotated, then you 449 00:23:25,427 --> 00:23:28,010 might do better with searching against the genome, because you 450 00:23:28,010 --> 00:23:31,200 could find a new x on there. 451 00:23:31,200 --> 00:23:32,173 OK, good. 452 00:23:32,173 --> 00:23:33,173 Question yeah, go ahead. 453 00:23:33,173 --> 00:23:36,027 AUDIENCE: So when we do these translations, these nucleotide, 454 00:23:36,027 --> 00:23:38,287 amino acid things, do we get all frames? 455 00:23:38,287 --> 00:23:39,620 Do the algorithms to all frames? 456 00:23:39,620 --> 00:23:41,990 PROFESSOR: Yeah, all six frames. 457 00:23:41,990 --> 00:23:44,220 So three frames on the plus strand, and three frames 458 00:23:44,220 --> 00:23:45,639 on the reverse strand. 459 00:23:45,639 --> 00:23:46,138 Yeah. 460 00:23:49,870 --> 00:23:51,660 All right, great. 461 00:23:51,660 --> 00:23:55,360 So that's the end of local alignment, for the moment. 462 00:23:55,360 --> 00:24:00,245 And we're going to now move on to global alignment using 463 00:24:00,245 --> 00:24:00,870 two algorithms. 464 00:24:04,134 --> 00:24:06,050 For global alignment, Needleman-Wunch-Sellers, 465 00:24:06,050 --> 00:24:09,870 and then for gapped local alignment, Smith-Waterman. 466 00:24:09,870 --> 00:24:11,930 And toward the end, we're going to introduce 467 00:24:11,930 --> 00:24:17,210 the concept of amino acid substitution matrices. 468 00:24:17,210 --> 00:24:20,500 So the background for today, the textbook 469 00:24:20,500 --> 00:24:22,860 does a pretty good job on these topics, 470 00:24:22,860 --> 00:24:25,660 especially the pages indicated are 471 00:24:25,660 --> 00:24:29,970 good for introducing the PAM series of matrices. 472 00:24:29,970 --> 00:24:34,190 We'll talked a little bit today and a little bit next time. 473 00:24:34,190 --> 00:24:36,850 So why would we align protein sequences? 474 00:24:36,850 --> 00:24:44,390 So the most obvious reason is to find homologues that we might, 475 00:24:44,390 --> 00:24:47,360 then, want to investigate, or we might, for example, 476 00:24:47,360 --> 00:24:51,590 if you have a human protein and you find homologous mouse 477 00:24:51,590 --> 00:24:54,510 protein, and that mouse protein has known function from 478 00:24:54,510 --> 00:24:58,010 a knockout or from some biochemical studies, 479 00:24:58,010 --> 00:25:02,274 for example, then you can guess that the human protein will 480 00:25:02,274 --> 00:25:03,190 have similar function. 481 00:25:03,190 --> 00:25:05,340 So we often use this type of inference 482 00:25:05,340 --> 00:25:11,320 that sequence similarity implies similarity in function and/or 483 00:25:11,320 --> 00:25:13,280 structure. 484 00:25:13,280 --> 00:25:14,380 So how true is this? 485 00:25:14,380 --> 00:25:17,990 So it turns out, from a wide body 486 00:25:17,990 --> 00:25:21,530 of literature, that this inference that sequence 487 00:25:21,530 --> 00:25:26,660 similarity implies functional and structural similarity 488 00:25:26,660 --> 00:25:30,820 is almost always true when the sequence similarity is 489 00:25:30,820 --> 00:25:33,710 more than about 30% identity over the whole length 490 00:25:33,710 --> 00:25:37,830 of a protein, over 300, 400 amino acids. 491 00:25:37,830 --> 00:25:40,470 That's a good inference. 492 00:25:40,470 --> 00:25:44,110 Below that, sort of in the 20% to 30% 493 00:25:44,110 --> 00:25:45,650 sequence similarity, that's often 494 00:25:45,650 --> 00:25:47,450 referred to as the Twilight Zone, 495 00:25:47,450 --> 00:25:49,570 where sometimes it's a good inference, 496 00:25:49,570 --> 00:25:51,780 and sometimes it's not. 497 00:25:51,780 --> 00:25:56,410 So you need to be a little bit careful. 498 00:25:56,410 --> 00:26:00,030 And below that, it's deeper into the Twilight Zone, 499 00:26:00,030 --> 00:26:02,500 where most of the time you probably shouldn't trust it. 500 00:26:02,500 --> 00:26:06,500 But occasionally, you can see these very remote homologies. 501 00:26:06,500 --> 00:26:09,820 You might want to have additional information 502 00:26:09,820 --> 00:26:12,420 to support that kind of inference. 503 00:26:12,420 --> 00:26:15,740 And I want to just point out that the converse is just not 504 00:26:15,740 --> 00:26:16,950 true in biology. 505 00:26:16,950 --> 00:26:19,780 So structural similarity does not 506 00:26:19,780 --> 00:26:23,850 imply sequence similarity or even 507 00:26:23,850 --> 00:26:28,550 derivation from a common ancestor. 508 00:26:28,550 --> 00:26:34,070 So you may think, well, every protein 509 00:26:34,070 --> 00:26:39,030 has a really complex, elaborate three dimensional structure, 510 00:26:39,030 --> 00:26:43,844 and there's no way that could ever evolve twice. 511 00:26:43,844 --> 00:26:46,260 And it's true that probably that exact structure can never 512 00:26:46,260 --> 00:26:46,801 evolve twice. 513 00:26:46,801 --> 00:26:49,850 But a very similar structure, a similar fold even, 514 00:26:49,850 --> 00:26:53,745 in terms of the topology of alpha helices 515 00:26:53,745 --> 00:26:55,820 and beta strands, which Professor Frank will 516 00:26:55,820 --> 00:26:59,430 talk about later in the course, the identical fold 517 00:26:59,430 --> 00:27:01,080 can involve more than once. 518 00:27:01,080 --> 00:27:03,450 It's not that hard to evolve a pattern 519 00:27:03,450 --> 00:27:05,960 of alpha helices and beta strands. 520 00:27:05,960 --> 00:27:09,480 And so this point about structural similarity 521 00:27:09,480 --> 00:27:11,950 not implying sequence similarity, 522 00:27:11,950 --> 00:27:17,120 the way I think about it is like this, like here 523 00:27:17,120 --> 00:27:19,260 are two organisms. 524 00:27:19,260 --> 00:27:20,982 This is a hummingbird, you've all seen. 525 00:27:20,982 --> 00:27:22,440 And some of you may have seen this. 526 00:27:22,440 --> 00:27:26,030 This is a hawk moth, which is an insect that 527 00:27:26,030 --> 00:27:29,640 is roughly two inches long, beats its wings very fast, 528 00:27:29,640 --> 00:27:32,770 has a long tongue that sips nectar from flowers. 529 00:27:32,770 --> 00:27:35,870 So it basically occupies the same ecological niche 530 00:27:35,870 --> 00:27:38,109 as a hummingbird, and looks very, very similar 531 00:27:38,109 --> 00:27:39,400 to a hummingbird at a distance. 532 00:27:39,400 --> 00:27:41,890 From 10 or more feet, you often can't tell. 533 00:27:44,859 --> 00:27:46,400 This is an insect, and that's a bird. 534 00:27:46,400 --> 00:27:50,210 The last common ancestor was something 535 00:27:50,210 --> 00:27:53,220 that probably lived 500 million years ago, 536 00:27:53,220 --> 00:27:55,770 and certainly didn't have wings, and may not 537 00:27:55,770 --> 00:27:58,020 have had legs or eyes. 538 00:27:58,020 --> 00:27:59,630 And yet, they've independently evolved 539 00:27:59,630 --> 00:28:01,130 eyes and wings and all these things. 540 00:28:01,130 --> 00:28:04,520 So when there's selective pressure 541 00:28:04,520 --> 00:28:10,890 to evolve something, either a morphology or a protein 542 00:28:10,890 --> 00:28:13,320 structure, for example, evolution 543 00:28:13,320 --> 00:28:16,530 is flexible enough that it can evolve it many, many times. 544 00:28:16,530 --> 00:28:20,380 So here's an example from the protein structure world. 545 00:28:20,380 --> 00:28:23,520 This is homophilous iron binding protein. 546 00:28:23,520 --> 00:28:27,570 This is just the iron coordination center. 547 00:28:27,570 --> 00:28:31,600 And this is now a eukaryotic protein called lactoferrin. 548 00:28:31,600 --> 00:28:34,220 Turns out these guys are homologous. 549 00:28:34,220 --> 00:28:38,810 But eukaryotes and bacteria diverged 2 million years 550 00:28:38,810 --> 00:28:44,770 ago or so, so their ancestry is very, very ancient. 551 00:28:44,770 --> 00:28:48,160 And yet, you can see that in this iron coordination center, 552 00:28:48,160 --> 00:28:51,235 you have a tyrosine pointing into the iron here. 553 00:28:51,235 --> 00:28:54,810 And you have a histidine up here, and so forth. 554 00:28:54,810 --> 00:28:59,240 So the geometry has been highly conserved. 555 00:28:59,240 --> 00:29:00,460 It's not perfectly conserved. 556 00:29:00,460 --> 00:29:02,840 Like, here you have a carboxylate. 557 00:29:02,840 --> 00:29:04,090 And here you have a phosphate. 558 00:29:04,090 --> 00:29:05,715 So there's been a little bit of change. 559 00:29:05,715 --> 00:29:14,160 But overall, this way of coordinating iron 560 00:29:14,160 --> 00:29:16,720 has basically evolved independently. 561 00:29:16,720 --> 00:29:18,620 So although these are homologous, 562 00:29:18,620 --> 00:29:22,920 the last common ancestor bound anions-- 563 00:29:22,920 --> 00:29:24,760 that's known from [INAUDIBLE] construction. 564 00:29:24,760 --> 00:29:29,530 So they independently evolved the ability to bind cations, 565 00:29:29,530 --> 00:29:31,430 like iron. 566 00:29:31,430 --> 00:29:34,300 And here is actually my favorite example. 567 00:29:34,300 --> 00:29:39,450 So here's a protein called ribosome recycling factor. 568 00:29:39,450 --> 00:29:40,340 And that's its shape. 569 00:29:40,340 --> 00:29:42,680 So it's a very unusual shaped protein 570 00:29:42,680 --> 00:29:45,214 that's kind of shaped like an L. 571 00:29:45,214 --> 00:29:47,440 Does this remind anyone of anything, 572 00:29:47,440 --> 00:29:49,150 this particular shape? 573 00:29:49,150 --> 00:29:55,610 Have you seen this in another biomolecule at some point? 574 00:29:55,610 --> 00:29:57,080 AUDIENCE: [INAUDIBLE]. 575 00:29:57,080 --> 00:29:58,663 PROFESSOR: Something like [INAUDIBLE]. 576 00:29:58,663 --> 00:29:59,990 OK, could be. 577 00:29:59,990 --> 00:30:00,780 Any other guesses? 578 00:30:05,037 --> 00:30:06,670 How about this? 579 00:30:06,670 --> 00:30:08,360 That's a tRNA. 580 00:30:08,360 --> 00:30:13,100 So the 3D structure of tRNA is almost identical, 581 00:30:13,100 --> 00:30:16,640 both in terms of the overall shape 582 00:30:16,640 --> 00:30:21,000 and in terms of the geometry. 583 00:30:21,000 --> 00:30:24,190 Sorry, I'm having issues with my animations here. 584 00:30:24,190 --> 00:30:30,900 The geometry of these, they're both about 70 angstroms long. 585 00:30:30,900 --> 00:30:31,650 So why is that? 586 00:30:31,650 --> 00:30:34,930 Why would this protein evolve to have the same three 587 00:30:34,930 --> 00:30:36,890 dimensional shape as a tRNA? 588 00:30:42,710 --> 00:30:44,254 Any ideas? 589 00:30:44,254 --> 00:30:46,190 AUDIENCE: [INAUDIBLE]. 590 00:30:46,190 --> 00:30:47,160 PROFESSOR: [INAUDIBLE]. 591 00:30:47,160 --> 00:30:47,940 Right, exactly. 592 00:30:47,940 --> 00:30:50,430 It fits into the ribosome, and it's 593 00:30:50,430 --> 00:30:55,000 involved, when the ribosome is stalled, 594 00:30:55,000 --> 00:30:56,920 and basically releasing the ribosome. 595 00:30:56,920 --> 00:31:00,210 So it's mimicking a tRNA in terms of structure. 596 00:31:00,210 --> 00:31:03,950 And so the point about this is that, if you 597 00:31:03,950 --> 00:31:06,430 were to take a bunch of biomolecules 598 00:31:06,430 --> 00:31:09,040 and match them up using a structure comparison 599 00:31:09,040 --> 00:31:12,410 algorithm to find similar ones-- these two are clearly similar. 600 00:31:12,410 --> 00:31:15,520 And yet, they probably never had a common ancestor right, 601 00:31:15,520 --> 00:31:18,250 because one's an RNA in one's a protein. 602 00:31:23,220 --> 00:31:23,860 OK. 603 00:31:23,860 --> 00:31:26,100 So now what we're going to talk about a few different types 604 00:31:26,100 --> 00:31:26,683 of alignments. 605 00:31:26,683 --> 00:31:29,280 So we talked about local alignments, 606 00:31:29,280 --> 00:31:32,120 where you don't try to align the entire sequence 607 00:31:32,120 --> 00:31:33,790 of your query or your database. 608 00:31:33,790 --> 00:31:38,020 You just find smaller regions of high similarity. 609 00:31:38,020 --> 00:31:41,810 Global alignment, where you try to align the two 610 00:31:41,810 --> 00:31:43,380 proteins from end to end, you assume 611 00:31:43,380 --> 00:31:46,530 that these two proteins are homologous, 612 00:31:46,530 --> 00:31:51,210 and actually that they haven't had major insertions 613 00:31:51,210 --> 00:31:55,322 or rearrangements of their sequence. 614 00:31:55,322 --> 00:31:56,780 And then semi-global, which is sort 615 00:31:56,780 --> 00:31:59,256 of a little twist on global. 616 00:31:59,256 --> 00:32:01,630 And we'll talk about a few different scoring systems-- so 617 00:32:01,630 --> 00:32:04,150 ungapped, which we've been talking about until now, 618 00:32:04,150 --> 00:32:07,950 and then we'll introduce gaps of two types that 619 00:32:07,950 --> 00:32:10,140 are called linear and affine. 620 00:32:10,140 --> 00:32:14,880 And the nomenclature is a little bit confusing, as you'll see. 621 00:32:14,880 --> 00:32:18,010 They're both linear, in a sense. 622 00:32:18,010 --> 00:32:23,100 So a common way to represent sequence alignments, especially 623 00:32:23,100 --> 00:32:26,370 in the protein alignment-- you can do it for protein or DNA-- 624 00:32:26,370 --> 00:32:29,790 is what's called a dot matrix. 625 00:32:29,790 --> 00:32:31,170 Now we've got two proteins. 626 00:32:31,170 --> 00:32:34,960 They might be 500 amino acids long each, let's say. 627 00:32:34,960 --> 00:32:37,990 You write sequence one along the x-axis, 628 00:32:37,990 --> 00:32:40,400 sequence two along the y-axis. 629 00:32:40,400 --> 00:32:44,152 And then you make a dot in this matrix whenever 630 00:32:44,152 --> 00:32:46,360 they have identical residues, although probably there 631 00:32:46,360 --> 00:32:47,830 would be a lot more dots in this. 632 00:32:47,830 --> 00:32:51,360 So let's say, whenever you have three residues in a row that 633 00:32:51,360 --> 00:32:53,910 are identical-- OK, that's going to occur fairly rarely, 634 00:32:53,910 --> 00:32:56,170 since there's 20 amino acids. 635 00:32:56,170 --> 00:32:57,290 And you make that dot. 636 00:32:57,290 --> 00:33:01,490 And for these two proteins, you don't 637 00:33:01,490 --> 00:33:02,980 get any off diagonal dots. 638 00:33:02,980 --> 00:33:06,470 You just get these three diagonal lines here. 639 00:33:06,470 --> 00:33:11,360 So what does that tell you about the history of these two 640 00:33:11,360 --> 00:33:12,890 proteins? 641 00:33:12,890 --> 00:33:14,300 What's that right there? 642 00:33:17,054 --> 00:33:17,891 Sally. 643 00:33:17,891 --> 00:33:19,349 AUDIENCE: An insertion or deletion. 644 00:33:19,349 --> 00:33:20,849 PROFESSOR: An insertion or deletion. 645 00:33:20,849 --> 00:33:22,232 An insertion in which protein? 646 00:33:22,232 --> 00:33:23,190 AUDIENCE: Sequence two. 647 00:33:23,190 --> 00:33:24,356 PROFESSOR: Or a deletion in? 648 00:33:24,356 --> 00:33:24,978 AUDIENCE: One. 649 00:33:24,978 --> 00:33:25,806 PROFESSOR: OK. 650 00:33:25,806 --> 00:33:27,490 Everyone got that? 651 00:33:27,490 --> 00:33:28,350 OK, good. 652 00:33:28,350 --> 00:33:32,010 There's extra sequence in sequence two here 653 00:33:32,010 --> 00:33:33,362 that's not in sequence one. 654 00:33:33,362 --> 00:33:35,570 You don't know whether it's an insertion or deletion. 655 00:33:35,570 --> 00:33:37,880 It could be either one, based on this information. 656 00:33:37,880 --> 00:33:41,030 Sometimes you can figure that out from other information. 657 00:33:41,030 --> 00:33:44,050 So sometimes you call that an indel-- insertion or deletion. 658 00:33:44,050 --> 00:33:48,860 And then, what is this down here? 659 00:33:48,860 --> 00:33:50,850 Someone else? 660 00:33:50,850 --> 00:33:53,880 Insertion, I heard, insertion in sequence one or deletion 661 00:33:53,880 --> 00:33:55,010 in sequence two. 662 00:33:55,010 --> 00:33:58,100 OK, good. 663 00:33:58,100 --> 00:33:59,810 All right, so what type of alignment 664 00:33:59,810 --> 00:34:03,080 would be most appropriate for this pair sequences, 665 00:34:03,080 --> 00:34:04,345 a local or a global? 666 00:34:10,662 --> 00:34:12,646 AUDIENCE: I would do global, because they're 667 00:34:12,646 --> 00:34:14,964 very, very similar. [INAUDIBLE]. 668 00:34:14,964 --> 00:34:15,630 PROFESSOR: Yeah. 669 00:34:15,630 --> 00:34:19,150 They are quite similar across their entire lengths, 670 00:34:19,150 --> 00:34:21,850 just with these two major indels. 671 00:34:21,850 --> 00:34:24,920 So that's sort of the classical case where 672 00:34:24,920 --> 00:34:27,060 you want to do the global alignment. 673 00:34:27,060 --> 00:34:27,960 All right. 674 00:34:27,960 --> 00:34:31,337 So what about these two proteins? 675 00:34:31,337 --> 00:34:32,920 Based on this dot matrix, what can you 676 00:34:32,920 --> 00:34:35,570 say about the relation between these two, 677 00:34:35,570 --> 00:34:38,179 and what type of alignment would you 678 00:34:38,179 --> 00:34:42,652 want to use when comparing these two proteins? 679 00:34:42,652 --> 00:34:43,610 Yeah, what's your name? 680 00:34:43,610 --> 00:34:44,101 AUDIENCE: Sonia. 681 00:34:44,101 --> 00:34:45,083 PROFESSOR: Go ahead, Sonia. 682 00:34:45,083 --> 00:34:47,541 AUDIENCE: It looks like they've got similar domains, maybe. 683 00:34:47,541 --> 00:34:50,560 So local alignment might be better. 684 00:34:50,560 --> 00:34:52,810 PROFESSOR: And why wouldn't you do a global alignment? 685 00:34:52,810 --> 00:34:55,970 AUDIENCE: Local, because the local alignment might actually 686 00:34:55,970 --> 00:34:57,972 find those domains and tell you what they are. 687 00:34:57,972 --> 00:34:59,930 PROFESSOR: So a local alignment should at least 688 00:34:59,930 --> 00:35:03,700 find these two guys here. 689 00:35:03,700 --> 00:35:06,900 And why do these two parallel diagonal 690 00:35:06,900 --> 00:35:08,689 lines, what does that tell you? 691 00:35:08,689 --> 00:35:10,934 AUDIENCE: That the two different proteins 692 00:35:10,934 --> 00:35:13,679 have similar sequences, just in different parts of the protein, 693 00:35:13,679 --> 00:35:16,180 different areas relative to the start. 694 00:35:16,180 --> 00:35:16,960 PROFESSOR: Right. 695 00:35:16,960 --> 00:35:18,134 Yeah, go ahead. 696 00:35:18,134 --> 00:35:19,800 AUDIENCE: Doesn't it just basically mean 697 00:35:19,800 --> 00:35:22,315 that there's a section in sequence two 698 00:35:22,315 --> 00:35:23,880 that's in sequence one twice? 699 00:35:23,880 --> 00:35:25,100 PROFESSOR: Yeah, exactly. 700 00:35:25,100 --> 00:35:31,830 So this segment of sequence two, here-- sorry, having trouble, 701 00:35:31,830 --> 00:35:35,680 there we go, that apart-- is present twice in sequence one. 702 00:35:35,680 --> 00:35:39,590 It's present once from about here over to here, 703 00:35:39,590 --> 00:35:41,970 and then it's present once from here over to here. 704 00:35:41,970 --> 00:35:43,380 So it's repeated. 705 00:35:43,380 --> 00:35:48,580 So repeats and things like that will 706 00:35:48,580 --> 00:35:50,080 confuse your global alignment. 707 00:35:50,080 --> 00:35:55,140 The global alignment needs to align each residue-- or trying 708 00:35:55,140 --> 00:35:56,750 to align each residue in protein one 709 00:35:56,750 --> 00:35:58,200 to each residue in protein two. 710 00:35:58,200 --> 00:35:59,660 And here, it's ambiguous. 711 00:35:59,660 --> 00:36:01,920 It's not clear which part of sequence one 712 00:36:01,920 --> 00:36:03,670 to align to that part of sequence two. 713 00:36:03,670 --> 00:36:04,800 So it'll get confused. 714 00:36:04,800 --> 00:36:06,170 It'll choose one or the other. 715 00:36:06,170 --> 00:36:07,830 But that may be wrong, and that really 716 00:36:07,830 --> 00:36:09,730 doesn't capture what actually happens. 717 00:36:09,730 --> 00:36:12,460 So yeah, so here a local alignment 718 00:36:12,460 --> 00:36:13,461 would be more suitable. 719 00:36:13,461 --> 00:36:13,960 Good. 720 00:36:16,630 --> 00:36:20,080 So let's talk now about gaps, again, which 721 00:36:20,080 --> 00:36:22,710 can be called indels. 722 00:36:22,710 --> 00:36:26,100 In protein sequence alignments, or DNA, which many of you 723 00:36:26,100 --> 00:36:29,850 have probably seen, you often use maybe just 724 00:36:29,850 --> 00:36:32,150 a dash to represent a gap. 725 00:36:32,150 --> 00:36:33,870 So in this alignment here, you can 726 00:36:33,870 --> 00:36:36,180 see that's kind of a reasonable alignment, right? 727 00:36:36,180 --> 00:36:38,960 You've got pretty good matching on both sides. 728 00:36:38,960 --> 00:36:44,750 But there's nothing in the second sequence that 729 00:36:44,750 --> 00:36:46,960 matches the RG in the first sequence. 730 00:36:46,960 --> 00:36:50,200 So that would be a reasonable alignment of those two. 731 00:36:50,200 --> 00:36:53,880 And so what's often used is what's 732 00:36:53,880 --> 00:36:55,420 called a linear gap penalty. 733 00:36:55,420 --> 00:36:59,500 So if you have end gaps, like in this case two, 734 00:36:59,500 --> 00:37:02,580 you assign a gap penalty A, let's say. 735 00:37:02,580 --> 00:37:06,330 And A is a negative number. 736 00:37:06,330 --> 00:37:10,950 And then you can just run the same kinds of algorithms, where 737 00:37:10,950 --> 00:37:13,846 you add up matches, penalize mismatches, 738 00:37:13,846 --> 00:37:15,470 but then you have an additional penalty 739 00:37:15,470 --> 00:37:19,540 you apply when you introduce a gap. 740 00:37:19,540 --> 00:37:22,080 And typically, the gap penalty is 741 00:37:22,080 --> 00:37:26,656 more severe than your average mismatch. 742 00:37:26,656 --> 00:37:28,030 But there's really no theory that 743 00:37:28,030 --> 00:37:31,190 says exactly how the gap penalty should be chosen. 744 00:37:31,190 --> 00:37:33,935 But empirically, in cases where you 745 00:37:33,935 --> 00:37:35,560 should know the answer, where you have, 746 00:37:35,560 --> 00:37:37,260 for example, a structural alignment, 747 00:37:37,260 --> 00:37:40,382 you can often find that a gap penalty that's 748 00:37:40,382 --> 00:37:42,090 bigger than your average mismatch penalty 749 00:37:42,090 --> 00:37:46,226 is usually the right thing to do. 750 00:37:46,226 --> 00:37:47,100 So why would that be? 751 00:37:47,100 --> 00:37:51,400 Why would a gap penalty-- why would you want to set it 752 00:37:51,400 --> 00:37:53,740 larger than a typical mismatch? 753 00:37:53,740 --> 00:37:54,480 Any ideas? 754 00:37:54,480 --> 00:37:55,162 Yeah, what's your name? 755 00:37:55,162 --> 00:37:55,995 AUDIENCE: I'm Chris. 756 00:37:55,995 --> 00:37:57,050 PROFESSOR: Chris. 757 00:37:57,050 --> 00:38:03,970 AUDIENCE: Because having mutations that shift the frame 758 00:38:03,970 --> 00:38:07,730 or that one insert would have insertions or deletions 759 00:38:07,730 --> 00:38:14,239 is far more uncommon than just having changing [INAUDIBLE]. 760 00:38:14,239 --> 00:38:16,780 PROFESSOR: Mutations that create insertions and deletions are 761 00:38:16,780 --> 00:38:19,650 less common than those that introduce substitutions 762 00:38:19,650 --> 00:38:20,800 of residues. 763 00:38:20,800 --> 00:38:22,780 Everyone got that? 764 00:38:22,780 --> 00:38:23,730 That's true. 765 00:38:23,730 --> 00:38:27,630 And do you know by what factor? 766 00:38:27,630 --> 00:38:29,940 AUDIENCE: Oh, I couldn't give you a number. 767 00:38:29,940 --> 00:38:32,795 PROFESSOR: So I mean, this varies by organism, 768 00:38:32,795 --> 00:38:36,910 and It varies by what type of insertion you're looking at. 769 00:38:36,910 --> 00:38:39,450 But even at the single nucleotide level, 770 00:38:39,450 --> 00:38:42,160 having insertions is about an order 771 00:38:42,160 --> 00:38:44,700 of magnitude less common than having 772 00:38:44,700 --> 00:38:46,450 a substitution in those lineages. 773 00:38:46,450 --> 00:38:49,320 And here, in order to get an amino acid insertion, 774 00:38:49,320 --> 00:38:52,960 you actually have to have a triplet insertion, three or six 775 00:38:52,960 --> 00:38:56,680 or some multiple of three into the exon. 776 00:38:56,680 --> 00:38:58,520 And that's quite a bit less common. 777 00:38:58,520 --> 00:39:00,050 So they occur less commonly. 778 00:39:00,050 --> 00:39:03,460 A mutation occurs less commonly, and therefore the mutation 779 00:39:03,460 --> 00:39:08,250 is actually accepted by evolution even less commonly. 780 00:39:08,250 --> 00:39:13,850 And an alternative is a so-called affine gap penalty, 781 00:39:13,850 --> 00:39:18,440 which is defined as G plus n lambda. 782 00:39:18,440 --> 00:39:21,590 So n is the number of gaps, and then G 783 00:39:21,590 --> 00:39:24,030 is what's called a gap opening penalty. 784 00:39:24,030 --> 00:39:27,860 So the idea here is that basically the gaps 785 00:39:27,860 --> 00:39:30,930 tend to cluster. 786 00:39:30,930 --> 00:39:37,210 So having an insertion is a rare thing. 787 00:39:37,210 --> 00:39:39,650 You penalize that with G. But then, 788 00:39:39,650 --> 00:39:41,910 if you're going to have an insertion, 789 00:39:41,910 --> 00:39:44,250 sometimes you'll have a big insertion of two or three 790 00:39:44,250 --> 00:39:47,000 or four codons. 791 00:39:47,000 --> 00:39:51,840 A four codon insertion should not be penalized twice as much 792 00:39:51,840 --> 00:39:57,690 as a two codon insertion, because only one gap actually 793 00:39:57,690 --> 00:39:58,190 occurred. 794 00:39:58,190 --> 00:39:59,815 And when you have this insertion event, 795 00:39:59,815 --> 00:40:04,530 it can any variety of sizes. 796 00:40:04,530 --> 00:40:06,520 You still penalize more for a bigger gap 797 00:40:06,520 --> 00:40:10,910 than for a smaller gap, but it's no longer linear. 798 00:40:10,910 --> 00:40:12,530 I mean, it's still a linear function, 799 00:40:12,530 --> 00:40:14,830 just with this constant thing added. 800 00:40:14,830 --> 00:40:18,350 So these are the two common types of gap penalties 801 00:40:18,350 --> 00:40:20,000 that you'll see in the literature. 802 00:40:20,000 --> 00:40:22,570 The affine works a little bit better, 803 00:40:22,570 --> 00:40:27,180 but it is a little bit more complicated to implement. 804 00:40:27,180 --> 00:40:31,860 So sometimes you'll see both of them used in practice. 805 00:40:31,860 --> 00:40:35,350 And then, of course, by changing your definition of gamma, 806 00:40:35,350 --> 00:40:37,550 you could have a G plus n minus 1. 807 00:40:37,550 --> 00:40:42,750 So that first gap would be G, and then all 808 00:40:42,750 --> 00:40:44,660 the subsequent gaps would gamma. 809 00:40:44,660 --> 00:40:49,000 So you're not going to have to double score something. 810 00:40:49,000 --> 00:40:49,580 All right. 811 00:40:49,580 --> 00:40:50,080 OK. 812 00:40:53,440 --> 00:40:54,440 You've got two proteins. 813 00:40:54,440 --> 00:40:59,620 How do you actually find the optimal global alignment? 814 00:40:59,620 --> 00:41:03,270 Any ideas on how to do this? 815 00:41:03,270 --> 00:41:06,450 So we can write one sequence down one axis, one down 816 00:41:06,450 --> 00:41:07,550 the other axis. 817 00:41:07,550 --> 00:41:09,650 We can make this dot plot. 818 00:41:09,650 --> 00:41:13,330 The dot plot can give us some ideas about what's going on. 819 00:41:13,330 --> 00:41:16,840 But how do we actually find the optimal one 820 00:41:16,840 --> 00:41:20,880 where we want to start from the beginning? 821 00:41:20,880 --> 00:41:23,750 In the end, we're going to write the two sequences 822 00:41:23,750 --> 00:41:24,960 one above the other. 823 00:41:24,960 --> 00:41:28,845 And if the first residue or the first sequence is n, 824 00:41:28,845 --> 00:41:31,090 and maybe we'll align it to here, 825 00:41:31,090 --> 00:41:35,580 then we have to write the entire sequence here 826 00:41:35,580 --> 00:41:37,950 all the way down to the end. 827 00:41:37,950 --> 00:41:40,830 And below it has to be either a residue 828 00:41:40,830 --> 00:41:43,090 in sequence two or a gap. 829 00:41:43,090 --> 00:41:45,050 And again, we can have gaps up here. 830 00:41:45,050 --> 00:41:47,307 So you have to do something. 831 00:41:47,307 --> 00:41:49,890 You have to make it all the way from the beginning to the end. 832 00:41:49,890 --> 00:41:52,450 And we're just going to sum the scores of all the matching 833 00:41:52,450 --> 00:41:55,970 residues, of all the mismatching residues, and of all the gaps. 834 00:41:55,970 --> 00:41:58,130 How do we find that alignment? 835 00:42:02,570 --> 00:42:03,070 Chris. 836 00:42:03,070 --> 00:42:07,116 AUDIENCE: Well, since we're using dynamic programming, 837 00:42:07,116 --> 00:42:09,266 I'm guessing that you're going to have to fill out 838 00:42:09,266 --> 00:42:13,280 a matrix of some sort and backtrack. 839 00:42:13,280 --> 00:42:16,610 PROFESSOR: And so when you see the term dynamic programming, 840 00:42:16,610 --> 00:42:18,890 what does that mean to you? 841 00:42:18,890 --> 00:42:23,490 AUDIENCE: You're going to find solutions to sub problems 842 00:42:23,490 --> 00:42:26,190 until you find a smaller solution. 843 00:42:26,190 --> 00:42:27,956 Then you'd backtrack through what 844 00:42:27,956 --> 00:42:31,400 you've solved so far to find the global sequence. 845 00:42:31,400 --> 00:42:32,170 PROFESSOR: Good. 846 00:42:32,170 --> 00:42:33,628 That's a good way of describing it. 847 00:42:33,628 --> 00:42:35,770 So what smaller problems are you going 848 00:42:35,770 --> 00:42:37,916 to break this large problem into? 849 00:42:37,916 --> 00:42:39,415 AUDIENCE: The smaller sub-sequences. 850 00:42:43,875 --> 00:42:45,500 PROFESSOR: Which smaller sub-sequences? 851 00:42:49,285 --> 00:42:49,830 Anyone else? 852 00:42:49,830 --> 00:42:52,690 You are definitely on the right track here. 853 00:42:52,690 --> 00:42:54,780 Go ahead. 854 00:42:54,780 --> 00:42:57,720 AUDIENCE: I mean, it says at the top there, 855 00:42:57,720 --> 00:43:00,660 one sequence across the top and one down the side. 856 00:43:00,660 --> 00:43:03,600 You could start with just the gap versus the sequence 857 00:43:03,600 --> 00:43:08,070 and say your gap will increase as you move across. 858 00:43:08,070 --> 00:43:11,620 Basically, each cell there could be filled out with information 859 00:43:11,620 --> 00:43:14,010 from some of its neighbors. 860 00:43:14,010 --> 00:43:15,925 So you want to make sure that you fill out 861 00:43:15,925 --> 00:43:19,430 old cells in some order so that we 862 00:43:19,430 --> 00:43:21,658 can proceed to the next level with what we've 863 00:43:21,658 --> 00:43:23,570 [? written down. ?] 864 00:43:23,570 --> 00:43:25,260 PROFESSOR: So if you had precisely 865 00:43:25,260 --> 00:43:30,270 to find a sub problem where you could see what the answer is, 866 00:43:30,270 --> 00:43:34,060 and then a slightly larger sub problem whose solution would 867 00:43:34,060 --> 00:43:38,459 build on the solution that first one, where would you start? 868 00:43:38,459 --> 00:43:40,125 What would be your smallest sub problem? 869 00:43:40,125 --> 00:43:42,336 AUDIENCE: I'd start with the top row, 870 00:43:42,336 --> 00:43:45,240 because you could just the gap versus gap, 871 00:43:45,240 --> 00:43:49,112 and then move in the row, because you don't need anything 872 00:43:49,112 --> 00:43:51,540 above that. 873 00:43:51,540 --> 00:43:56,370 PROFESSOR: And then what's the smallest actual problem 874 00:43:56,370 --> 00:44:01,460 where you actually have parts of the protein aligned? 875 00:44:01,460 --> 00:44:04,250 AUDIENCE: One row in column two, basically. 876 00:44:04,250 --> 00:44:06,725 If it's a match, you have some score. 877 00:44:06,725 --> 00:44:10,685 And if it's a mismatch, you have some other score. 878 00:44:10,685 --> 00:44:16,130 And you want the best possible one in each block. 879 00:44:16,130 --> 00:44:19,360 PROFESSOR: Yeah, OK. 880 00:44:19,360 --> 00:44:20,440 Yeah. 881 00:44:20,440 --> 00:44:21,770 That's good. 882 00:44:21,770 --> 00:44:26,290 So just to generalize this-- hopefully this 883 00:44:26,290 --> 00:44:32,480 is blank-- in general, you could think about we've got, 884 00:44:32,480 --> 00:44:37,720 let's say, 1 to n here, and a sequence 1 to n here. 885 00:44:37,720 --> 00:44:42,710 You could think about a position i here and a position j here. 886 00:44:42,710 --> 00:44:48,790 And we could say finding the optimal global alignment, 887 00:44:48,790 --> 00:44:49,839 that's a big problem. 888 00:44:49,839 --> 00:44:50,630 That's complicated. 889 00:44:50,630 --> 00:44:54,920 But finding an alignment of just the sequence from 1 890 00:44:54,920 --> 00:44:57,440 to i in the first protein against the sequence 891 00:44:57,440 --> 00:44:59,650 from 1 to j in the second protein, 892 00:44:59,650 --> 00:45:01,310 that could be pretty easy. 893 00:45:01,310 --> 00:45:05,150 If i is 2 and j is 2, you've got a dipeptide 894 00:45:05,150 --> 00:45:06,265 against a dipeptide. 895 00:45:06,265 --> 00:45:07,890 You could actually try all combinations 896 00:45:07,890 --> 00:45:10,790 and get the optimal alignment there. 897 00:45:10,790 --> 00:45:15,800 And so the idea, then, is if you can record those optimal scores 898 00:45:15,800 --> 00:45:20,010 here in this matrix, then you could build out, for example, 899 00:45:20,010 --> 00:45:24,000 like this, and find the optimal alignments 900 00:45:24,000 --> 00:45:28,280 of increasingly bigger sub problems 901 00:45:28,280 --> 00:45:32,626 where you add another residue in each direction, for example. 902 00:45:32,626 --> 00:45:34,710 Does that makes sense to you? 903 00:45:34,710 --> 00:45:37,260 The idea of a dynamic programming algorithm 904 00:45:37,260 --> 00:45:39,925 is it's a form of recursive optimization. 905 00:45:39,925 --> 00:45:43,420 So you first optimize something small, 906 00:45:43,420 --> 00:45:46,040 and then you optimize something bigger 907 00:45:46,040 --> 00:45:49,560 using the solution you got from that smaller piece. 908 00:45:49,560 --> 00:45:52,480 And the way that that's done for protein sequences 909 00:45:52,480 --> 00:45:58,050 in Neeleman-Wunsch is to, as we were saying, 910 00:45:58,050 --> 00:46:02,000 first consider that there might be a gap in one aligning 911 00:46:02,000 --> 00:46:04,310 to a residue in the other. 912 00:46:04,310 --> 00:46:08,260 So we need to put these gaps down across the top 913 00:46:08,260 --> 00:46:10,090 and down the side. 914 00:46:13,140 --> 00:46:16,580 This is a linear gap penalty, for example. 915 00:46:16,580 --> 00:46:19,900 And so here would be how you start. 916 00:46:19,900 --> 00:46:24,170 And this is a gap penalty, obviously, of minus 8. 917 00:46:24,170 --> 00:46:28,060 So if you're the optimal solution that 918 00:46:28,060 --> 00:46:36,800 begins with this V in the top sequence aligned 919 00:46:36,800 --> 00:46:42,826 to this gap in the vertical sequence, 920 00:46:42,826 --> 00:46:44,450 there's one gap there, so it's minus 8. 921 00:46:44,450 --> 00:46:49,690 And then if you want to start with two gaps against this V 922 00:46:49,690 --> 00:46:51,600 and D, then it's minus 16. 923 00:46:51,600 --> 00:46:54,310 So that's how you would start it. 924 00:46:54,310 --> 00:46:58,040 So you start with these problems where there's no options. 925 00:46:58,040 --> 00:47:01,220 If you have two gaps against two residues, that's minus 16. 926 00:47:01,220 --> 00:47:04,780 By our scoring system, it's unambiguous. 927 00:47:04,780 --> 00:47:08,230 So you just can fill those in. 928 00:47:08,230 --> 00:47:10,070 And then you can start thinking about, 929 00:47:10,070 --> 00:47:16,920 what do we put right here? 930 00:47:16,920 --> 00:47:21,140 What score should we put right there? 931 00:47:21,140 --> 00:47:24,870 Remember, we're defining the entries in this matrix 932 00:47:24,870 --> 00:47:31,740 as the optimal score of the sub-sequence of the top protein 933 00:47:31,740 --> 00:47:35,445 up to position i against the vertical protein up 934 00:47:35,445 --> 00:47:36,440 to position j. 935 00:47:36,440 --> 00:47:38,990 So that would be the top protein position 936 00:47:38,990 --> 00:47:42,580 one up to the vertical protein position one. 937 00:47:42,580 --> 00:47:45,150 What score would that be? 938 00:47:45,150 --> 00:47:48,835 What's the optical alignment there 939 00:47:48,835 --> 00:47:50,640 that ends position one both sequences? 940 00:47:53,010 --> 00:47:54,510 It'll depend on your scoring system. 941 00:47:54,510 --> 00:47:57,450 But for a reasonable scoring system, that's a match. 942 00:47:57,450 --> 00:47:59,220 That's going to get some positive score. 943 00:47:59,220 --> 00:48:02,650 That's going to be better than anything involving a gap in one 944 00:48:02,650 --> 00:48:04,990 against a gap in the other or something crazy like that. 945 00:48:04,990 --> 00:48:08,680 So that's going to get whatever your VV match score is. 946 00:48:11,860 --> 00:48:15,140 This is your Sij from your scoring matrix 947 00:48:15,140 --> 00:48:18,390 for your different amino acids. 948 00:48:18,390 --> 00:48:26,030 And then, basically the way that this is done 949 00:48:26,030 --> 00:48:35,470 is to consider that when you're matching that position 950 00:48:35,470 --> 00:48:39,170 one against position one, you might have come from a gap 951 00:48:39,170 --> 00:48:42,250 before in one sequence or a gap in the other sequence, 952 00:48:42,250 --> 00:48:45,190 or from a match position in the other sequence. 953 00:48:45,190 --> 00:48:47,830 And that leads to these three arrows. 954 00:48:47,830 --> 00:48:52,630 I think it gets clear if I write up the whole algorithm here. 955 00:48:52,630 --> 00:48:57,010 So Sij is the score of the optimal alignment ending 956 00:48:57,010 --> 00:49:00,630 at position i in sequence one and position j in sequence two. 957 00:49:00,630 --> 00:49:04,470 Requires that we know what's above, to the left, 958 00:49:04,470 --> 00:49:06,280 and diagonally above. 959 00:49:06,280 --> 00:49:12,570 And you solve it from the top and left down 960 00:49:12,570 --> 00:49:15,810 to the bottom and right, which is often 961 00:49:15,810 --> 00:49:19,490 called dynamic programming. 962 00:49:19,490 --> 00:49:23,380 And let's just look at what the recursion is. 963 00:49:23,380 --> 00:49:27,780 So Needleman and Wunsch basically 964 00:49:27,780 --> 00:49:32,470 observed that you could find this optimal global alignment 965 00:49:32,470 --> 00:49:38,390 score by filling in the matrix by at each point taking 966 00:49:38,390 --> 00:49:42,500 the maximum of these three scores here. 967 00:49:42,500 --> 00:49:44,860 So you take the maximum of the score 968 00:49:44,860 --> 00:49:51,660 that you had above and to the left, so diagonally above, plus 969 00:49:51,660 --> 00:49:55,090 sigma of xi yj. 970 00:49:55,090 --> 00:49:57,420 Sigma, in this case, is the scoring matrix 971 00:49:57,420 --> 00:50:00,040 that you're using that's 20 by 20 that 972 00:50:00,040 --> 00:50:04,990 scores each amino acid against each other amino acid residue. 973 00:50:04,990 --> 00:50:07,970 You add that score if you're going to move diagonally 974 00:50:07,970 --> 00:50:09,720 to whatever the optimal score was there, 975 00:50:09,720 --> 00:50:14,710 or if you're moving to the right or down, you're 976 00:50:14,710 --> 00:50:17,290 adding a gap in one sequence or the other. 977 00:50:17,290 --> 00:50:20,200 So you have to add A, which is this gap penalty, which 978 00:50:20,200 --> 00:50:23,150 is a negative number, to whatever the optimal alignment 979 00:50:23,150 --> 00:50:25,890 was before. 980 00:50:25,890 --> 00:50:29,330 I think it's maybe easier if we do an example here. 981 00:50:29,330 --> 00:50:34,050 So here is the PAM250 scoring matrix. 982 00:50:34,050 --> 00:50:37,866 So this was actually developed by Dayhoff back in the '70s. 983 00:50:37,866 --> 00:50:39,240 This might be an updated version, 984 00:50:39,240 --> 00:50:43,150 but it's more or less the same as the original. 985 00:50:43,150 --> 00:50:46,110 Notice, it's a triangular matrix. 986 00:50:46,110 --> 00:50:48,508 Why is that? 987 00:50:48,508 --> 00:50:50,215 AUDIENCE: It's symmetric. 988 00:50:50,215 --> 00:50:51,590 PROFESSOR: It's symmetric, right. 989 00:50:51,590 --> 00:50:52,530 So it has a diagonal. 990 00:50:52,530 --> 00:50:54,780 But then everything below the diagonal, 991 00:50:54,780 --> 00:50:56,942 it would be mirrored above the diagonal, 992 00:50:56,942 --> 00:50:57,900 because it's symmetric. 993 00:50:57,900 --> 00:51:01,660 Because you don't know when you see a valine matched 994 00:51:01,660 --> 00:51:05,760 to a leucine, it's the same as a leucine matched to a valine, 995 00:51:05,760 --> 00:51:09,160 because it's a symmetrical definition of scoring. 996 00:51:09,160 --> 00:51:12,360 And here are two relevant scores. 997 00:51:12,360 --> 00:51:18,080 So notice that VV has a score of plus 4 in this matrix. 998 00:51:18,080 --> 00:51:23,780 And over here, VD has a score of minus 2. 999 00:51:23,780 --> 00:51:25,420 So I'll just write those down. 1000 00:51:32,220 --> 00:51:36,814 Anyone notice anything else interesting about this matrix? 1001 00:51:36,814 --> 00:51:39,480 We haven't said exactly where it comes from, but we're going to. 1002 00:51:39,480 --> 00:51:40,250 Yeah, what's your name? 1003 00:51:40,250 --> 00:51:41,170 AUDIENCE: Michael. 1004 00:51:41,170 --> 00:51:42,090 PROFESSOR: Go ahead. 1005 00:51:42,090 --> 00:51:43,770 AUDIENCE: Not all the diagonal values are the same. 1006 00:51:43,770 --> 00:51:45,519 PROFESSOR: Not all diagonals are the same. 1007 00:51:45,519 --> 00:51:51,810 In fact, there's a pretty big range, from 2 up to 17, 1008 00:51:51,810 --> 00:51:52,850 so a big range. 1009 00:51:52,850 --> 00:51:55,420 And anything else? 1010 00:51:55,420 --> 00:51:56,800 OK, I'm sorry, go ahead. 1011 00:51:56,800 --> 00:51:57,672 What's your name? 1012 00:51:57,672 --> 00:51:58,380 AUDIENCE: Tagius. 1013 00:51:58,380 --> 00:51:58,786 PROFESSOR: Tagius, yeah. 1014 00:51:58,786 --> 00:51:58,990 Go ahead. 1015 00:51:58,990 --> 00:52:00,448 AUDIENCE: There are positive values 1016 00:52:00,448 --> 00:52:01,850 for things that are not the same? 1017 00:52:01,850 --> 00:52:02,610 PROFESSOR: Yeah. 1018 00:52:02,610 --> 00:52:05,420 So all the diagonal terms are positive. 1019 00:52:05,420 --> 00:52:09,440 So a match of any particular residue type 1020 00:52:09,440 --> 00:52:11,500 to its identical residue is always 1021 00:52:11,500 --> 00:52:13,790 scored positively, but with varying scores. 1022 00:52:13,790 --> 00:52:16,310 And there are also some positive scores in the off diagonal. 1023 00:52:16,310 --> 00:52:19,290 And where are those positive scores occurring? 1024 00:52:19,290 --> 00:52:22,670 Notice they tend to be to nearby residues. 1025 00:52:22,670 --> 00:52:27,710 And notice the order of residues is not alphabetical. 1026 00:52:27,710 --> 00:52:31,550 So someone who knows a lot about amino acids, 1027 00:52:31,550 --> 00:52:36,670 what can you see about these scores? 1028 00:52:36,670 --> 00:52:38,390 Yeah, go ahead. 1029 00:52:38,390 --> 00:52:40,840 AUDIENCE: I think these amino acids [INAUDIBLE] 1030 00:52:40,840 --> 00:52:42,800 based on their [INAUDIBLE]. 1031 00:52:47,704 --> 00:52:49,870 PROFESSOR: So the comment was that the residues have 1032 00:52:49,870 --> 00:52:53,339 been grouped by similar chemistry of their side chains. 1033 00:52:53,339 --> 00:52:54,380 And that's exactly right. 1034 00:52:54,380 --> 00:52:58,205 So the basic residues, histidine, arginine, 1035 00:52:58,205 --> 00:53:00,160 and lysine, are all together. 1036 00:53:00,160 --> 00:53:02,520 The acidic residues, aspartate and glutamate, 1037 00:53:02,520 --> 00:53:06,330 are here, along with asparagine and glutamine. 1038 00:53:06,330 --> 00:53:13,380 And notice that D to E has a positive score here. 1039 00:53:13,380 --> 00:53:15,450 It's 3. 1040 00:53:15,450 --> 00:53:19,650 It's almost as good as D to D or E to E, which are plus 4. 1041 00:53:19,650 --> 00:53:23,770 So recognizing that you can often substitute in evolution 1042 00:53:23,770 --> 00:53:25,880 an aspartate for a glutamate. 1043 00:53:25,880 --> 00:53:30,530 So yeah, so it basically, to some extent, 1044 00:53:30,530 --> 00:53:34,420 is scoring for similar chemistry. 1045 00:53:34,420 --> 00:53:38,430 But that doesn't explain why, on the diagonal, 1046 00:53:38,430 --> 00:53:41,510 you have such a large range of values. 1047 00:53:41,510 --> 00:53:44,530 Why is a tryptophan more like a tryptophan 1048 00:53:44,530 --> 00:53:47,490 than a serine is like a serine. 1049 00:53:47,490 --> 00:53:48,832 Tim, you want to comment? 1050 00:53:48,832 --> 00:53:52,720 AUDIENCE: Perhaps it's because tryptophans occur very rarely 1051 00:53:52,720 --> 00:53:55,636 in all proteins [INAUDIBLE]. 1052 00:53:55,636 --> 00:54:00,010 So if you've got two [INAUDIBLE], 1053 00:54:00,010 --> 00:54:03,412 that's a lot rarer of an occurrence and [INAUDIBLE]. 1054 00:54:05,350 --> 00:54:07,850 PROFESSOR: So Tim's point was that tryptophans occur rarely, 1055 00:54:07,850 --> 00:54:10,960 so when you see two tryptophans aligned, 1056 00:54:10,960 --> 00:54:12,682 you should take note of it. 1057 00:54:12,682 --> 00:54:13,890 It can anchor your alignment. 1058 00:54:13,890 --> 00:54:15,731 You can be more confident in that. 1059 00:54:15,731 --> 00:54:16,230 Sally. 1060 00:54:16,230 --> 00:54:18,700 AUDIENCE: Well, tryptophans are also incredibly bulky, 1061 00:54:18,700 --> 00:54:24,134 and also have the ability to make electric interactions, 1062 00:54:24,134 --> 00:54:25,616 electro-static interactions. 1063 00:54:25,616 --> 00:54:27,098 PROFESSOR: Not really electro-static, you would say, 1064 00:54:27,098 --> 00:54:27,598 more-- 1065 00:54:27,598 --> 00:54:28,491 [INTERPOSING VOICES] 1066 00:54:28,491 --> 00:54:29,074 AUDIENCE: Yes. 1067 00:54:29,074 --> 00:54:31,544 But they do have a lot of abilities 1068 00:54:31,544 --> 00:54:34,014 to interact with other side chains. 1069 00:54:34,014 --> 00:54:37,966 And cysteines contribute very, very strongly 1070 00:54:37,966 --> 00:54:41,918 to the three dimensional structure of the protein. 1071 00:54:41,918 --> 00:54:42,906 PROFESSOR: Why is that? 1072 00:54:42,906 --> 00:54:45,870 AUDIENCE: Well, because they can form [INAUDIBLE]. 1073 00:54:45,870 --> 00:54:47,181 PROFESSOR: OK. 1074 00:54:47,181 --> 00:54:47,680 Yeah. 1075 00:54:47,680 --> 00:54:50,950 So maybe you don't put your tryptophans and your cysteines 1076 00:54:50,950 --> 00:54:54,000 into your protein by chance, or you only 1077 00:54:54,000 --> 00:54:55,630 put them when you want them, when 1078 00:54:55,630 --> 00:54:57,880 there's enough space for a tryptophan. 1079 00:54:57,880 --> 00:54:59,650 And when you substitute something smaller, 1080 00:54:59,650 --> 00:55:00,550 it leaves a gap. 1081 00:55:00,550 --> 00:55:04,430 It leaves a 3D spatial gap. 1082 00:55:04,430 --> 00:55:06,100 And so you don't want that. 1083 00:55:06,100 --> 00:55:07,880 You don't get good packing. 1084 00:55:07,880 --> 00:55:10,147 When you have cysteines, they form disulfide bonds. 1085 00:55:10,147 --> 00:55:12,230 If you change it to something that's non-cysteine, 1086 00:55:12,230 --> 00:55:13,271 it can form that anymore. 1087 00:55:13,271 --> 00:55:15,270 That could be disruptive to the overall fold. 1088 00:55:15,270 --> 00:55:19,110 So those ones tend to be more conserved in protein sequence 1089 00:55:19,110 --> 00:55:21,170 alignments, absolutely. 1090 00:55:21,170 --> 00:55:24,320 Whereas, for example, if you look at these hydrophobics, 1091 00:55:24,320 --> 00:55:29,240 the MILV group down here, they all 1092 00:55:29,240 --> 00:55:32,920 have positive scores relative to each other. 1093 00:55:32,920 --> 00:55:35,740 And that says that, basically, most the time when those 1094 00:55:35,740 --> 00:55:37,360 are used-- I mean, there are sometimes 1095 00:55:37,360 --> 00:55:38,740 when it went really matters. 1096 00:55:38,740 --> 00:55:42,270 But a lot of time, if you just want a transmembrane segment, 1097 00:55:42,270 --> 00:55:46,990 you can often substitute any one of those at several positions 1098 00:55:46,990 --> 00:55:50,749 and it'll work equally well as a transmembrane segment. 1099 00:55:50,749 --> 00:55:52,040 So these are not random at all. 1100 00:55:52,040 --> 00:55:54,300 There's some patterns here. 1101 00:55:54,300 --> 00:55:57,010 So let's go back to this algorithm. 1102 00:55:57,010 --> 00:56:00,150 So now, if we're going to implement this recursion, 1103 00:56:00,150 --> 00:56:04,320 so we fill on the top row and the left column, 1104 00:56:04,320 --> 00:56:07,340 and then we need to fill in this first. 1105 00:56:07,340 --> 00:56:10,280 I would argue the first interesting place in the matrix 1106 00:56:10,280 --> 00:56:11,620 is right here. 1107 00:56:11,620 --> 00:56:17,460 And we consider adding a gap here. 1108 00:56:17,460 --> 00:56:21,020 When you move vertically or horizontally, 1109 00:56:21,020 --> 00:56:24,705 you're not adding a match or adding a match. 1110 00:56:24,705 --> 00:56:28,799 So from this position, this is sort of the beginning point. 1111 00:56:28,799 --> 00:56:31,090 It doesn't actually correspond to a particular position 1112 00:56:31,090 --> 00:56:31,780 in the protein. 1113 00:56:31,780 --> 00:56:33,580 We're going to add now the score for VV. 1114 00:56:33,580 --> 00:56:36,960 And we said that VV, you look it up in that PAM matrix, 1115 00:56:36,960 --> 00:56:37,900 and it's plus 4. 1116 00:56:37,900 --> 00:56:39,670 So we're going to add 4 there to 0. 1117 00:56:39,670 --> 00:56:43,120 And so that's clearly bigger than minus 16, 1118 00:56:43,120 --> 00:56:45,717 which is what you get from coming above or coming 1119 00:56:45,717 --> 00:56:46,300 from the left. 1120 00:56:46,300 --> 00:56:48,360 So you put in the 4. 1121 00:56:48,360 --> 00:56:52,700 And then you also, in addition to putting that 4 there, 1122 00:56:52,700 --> 00:56:55,060 you also keep the arrow. 1123 00:56:55,060 --> 00:56:56,560 So there's that red arrow. 1124 00:56:56,560 --> 00:56:59,242 We remember where we came from in this algorithm. 1125 00:56:59,242 --> 00:57:01,450 Because someone said something about backtracking-- I 1126 00:57:01,450 --> 00:57:04,620 think Chris-- so that's going to be relevant later. 1127 00:57:04,620 --> 00:57:07,400 So we basically get rid of those two dotted arrows 1128 00:57:07,400 --> 00:57:09,850 and just keep that red arrow as well as the score. 1129 00:57:09,850 --> 00:57:14,880 And then we fill in the next position here. 1130 00:57:14,880 --> 00:57:18,860 And so to fill in this, now we're 1131 00:57:18,860 --> 00:57:23,480 considering going to the second position in sequence one, 1132 00:57:23,480 --> 00:57:27,410 but we're still only at the first position in sequence two. 1133 00:57:27,410 --> 00:57:31,830 So if we match V to V, then we'd have to add, 1134 00:57:31,830 --> 00:57:38,370 basically, a gap in one of the sequences. 1135 00:57:38,370 --> 00:57:40,260 Basically it would be a gap in sequence two. 1136 00:57:40,260 --> 00:57:42,050 And that's going to be minus 8. 1137 00:57:42,050 --> 00:57:45,610 So you take 4, and then plus minus 8, so it's negative 4. 1138 00:57:45,610 --> 00:57:50,900 Or you could do minus 8 and then plus negative 2, 1139 00:57:50,900 --> 00:57:53,960 if you want to start from a gap and then 1140 00:57:53,960 --> 00:57:58,300 add a DV mismatch there, because minus 2 1141 00:57:58,300 --> 00:58:01,630 was the score for a DV mismatch. 1142 00:58:01,630 --> 00:58:04,769 Or again, you can start from a gap and then add another gap. 1143 00:58:04,769 --> 00:58:05,810 OK, does that make sense? 1144 00:58:05,810 --> 00:58:10,420 So what is the maximum going to be? 1145 00:58:10,420 --> 00:58:12,290 Negative 4. 1146 00:58:12,290 --> 00:58:14,610 And the arrow is going to be horizontal, 1147 00:58:14,610 --> 00:58:17,910 because we got some bonus points for that VV match, 1148 00:58:17,910 --> 00:58:19,850 and now it's carrying over. 1149 00:58:19,850 --> 00:58:21,100 We're negative, but that's OK. 1150 00:58:21,100 --> 00:58:24,180 We're going to just keep the maximum, whatever it is. 1151 00:58:24,180 --> 00:58:27,670 All right, so it's minus 4, and the horizontal arrow. 1152 00:58:27,670 --> 00:58:30,880 And then here's the entire matrix filled out. 1153 00:58:30,880 --> 00:58:34,690 And you'll have a chance to do this for yourself 1154 00:58:34,690 --> 00:58:36,810 on problem set one. 1155 00:58:36,810 --> 00:58:39,022 And I've also filled in arrows. 1156 00:58:39,022 --> 00:58:40,480 I haven't filled in all the arrows, 1157 00:58:40,480 --> 00:58:42,260 because it gets kind of cluttered. 1158 00:58:42,260 --> 00:58:45,770 But all the relevant arrows here are filled in, as well as 1159 00:58:45,770 --> 00:58:47,480 some irrelevant arrows. 1160 00:58:47,480 --> 00:58:53,920 And so then, once I fill this in, 1161 00:58:53,920 --> 00:58:56,100 what do I do with this information? 1162 00:58:56,100 --> 00:58:59,532 How do I get an actual alignment out of this matrix? 1163 00:59:03,780 --> 00:59:04,322 Any ideas? 1164 00:59:04,322 --> 00:59:05,280 Yeah, what's your name? 1165 00:59:05,280 --> 00:59:06,196 AUDIENCE: [INAUDIBLE]. 1166 00:59:14,749 --> 00:59:17,290 PROFESSOR: Yeah, so what he said is start at the bottom right 1167 00:59:17,290 --> 00:59:20,390 corner and go backwards following 1168 00:59:20,390 --> 00:59:21,790 the red arrows in reverse. 1169 00:59:21,790 --> 00:59:23,719 Is that right? 1170 00:59:23,719 --> 00:59:25,010 So why the bottom right corner? 1171 00:59:25,010 --> 00:59:26,303 What's special about that? 1172 00:59:26,303 --> 00:59:27,219 AUDIENCE: [INAUDIBLE]. 1173 00:59:31,233 --> 00:59:34,180 PROFESSOR: Yeah. 1174 00:59:34,180 --> 00:59:35,910 It's a score of the optimal alignment 1175 00:59:35,910 --> 00:59:38,550 of the entire sequence one against the entire sequence 1176 00:59:38,550 --> 00:59:39,930 two. 1177 00:59:39,930 --> 00:59:41,270 So that's the answer. 1178 00:59:41,270 --> 00:59:43,790 That's what we define as the optimal global alignment. 1179 00:59:43,790 --> 00:59:47,710 And then you want to know how you got there. 1180 00:59:47,710 --> 00:59:49,290 And so how did we get there? 1181 00:59:49,290 --> 00:59:54,760 So the fact that there's this red arrow here, 1182 00:59:54,760 --> 00:59:59,476 what does that red arrow correspond to specifically? 1183 00:59:59,476 --> 01:00:00,392 AUDIENCE: [INAUDIBLE]. 1184 01:00:03,180 --> 01:00:03,950 PROFESSOR: Right. 1185 01:00:03,950 --> 01:00:07,630 In this particular case, for this particular red arrow, 1186 01:00:07,630 --> 01:00:09,260 remember the diagonals are matches. 1187 01:00:09,260 --> 01:00:10,380 So what match is that? 1188 01:00:10,380 --> 01:00:11,334 AUDIENCE: [INAUDIBLE]. 1189 01:00:11,334 --> 01:00:13,250 PROFESSOR: Yeah, that's a Y to Y match, right? 1190 01:00:13,250 --> 01:00:14,166 Can everyone see that? 1191 01:00:14,166 --> 01:00:16,340 We added Y to Y, which was plus 10, 1192 01:00:16,340 --> 01:00:19,370 to whatever this 13 was and got 23. 1193 01:00:19,370 --> 01:00:20,960 OK, so now we go back to here. 1194 01:00:20,960 --> 01:00:22,310 And then how do we get here? 1195 01:00:22,310 --> 01:00:27,900 We came from up here by going this diagonal arrow. 1196 01:00:27,900 --> 01:00:28,940 What is that? 1197 01:00:28,940 --> 01:00:30,110 What match was that? 1198 01:00:30,110 --> 01:00:32,016 That's a cysteine-cysteine match. 1199 01:00:32,016 --> 01:00:33,390 And then how do we get to this 1? 1200 01:00:36,090 --> 01:00:37,870 We came vertically. 1201 01:00:37,870 --> 01:00:40,440 And so what does that mean? 1202 01:00:40,440 --> 01:00:41,765 AUDIENCE: [INAUDIBLE]. 1203 01:00:41,765 --> 01:00:43,765 PROFESSOR: We inserted a gap, in which sequence? 1204 01:00:47,090 --> 01:00:48,455 The first one. 1205 01:00:48,455 --> 01:00:49,080 The second one? 1206 01:00:52,130 --> 01:00:55,230 What do people think? 1207 01:00:55,230 --> 01:00:55,954 Moving down. 1208 01:00:55,954 --> 01:00:56,870 AUDIENCE: [INAUDIBLE]. 1209 01:01:01,220 --> 01:01:03,670 PROFESSOR: Yeah, the top one. 1210 01:01:03,670 --> 01:01:05,560 And so that got us to here. 1211 01:01:05,560 --> 01:01:09,220 Here's a match, plus 2 for having a serine-serine match. 1212 01:01:09,220 --> 01:01:13,450 Here's a plus 3 for having a D to E mismatch. 1213 01:01:13,450 --> 01:01:16,020 But remember, that's those are chemically similar, 1214 01:01:16,020 --> 01:01:17,340 so they get a positive score. 1215 01:01:17,340 --> 01:01:19,871 And then this is the V to V. 1216 01:01:19,871 --> 01:01:20,940 So can you see? 1217 01:01:20,940 --> 01:01:24,080 I think I have the optimal alignment written 1218 01:01:24,080 --> 01:01:28,160 somewhere here, hopefully, there. 1219 01:01:28,160 --> 01:01:29,580 That's called the trace back. 1220 01:01:29,580 --> 01:01:34,650 And then that is the alignment. 1221 01:01:34,650 --> 01:01:37,500 OK, we align the Y to the Y, the C to the C. 1222 01:01:37,500 --> 01:01:43,500 Then we have basically a gap in this top sequence-- that's 1223 01:01:43,500 --> 01:01:48,550 that purple dash there-- that's corresponding to that L. 1224 01:01:48,550 --> 01:01:50,902 And you can see why we wanted to put that gap in there, 1225 01:01:50,902 --> 01:01:52,360 because we want these S's to match, 1226 01:01:52,360 --> 01:01:53,710 and we want the C's to match. 1227 01:01:53,710 --> 01:01:55,270 And the only way to connect those 1228 01:01:55,270 --> 01:01:57,880 is to have a gap in the purple. 1229 01:01:57,880 --> 01:02:00,640 And the purple was shorter than the green sequence 1230 01:02:00,640 --> 01:02:02,290 anyway, so we kind of knew that there 1231 01:02:02,290 --> 01:02:05,030 was going to be a gap somewhere. 1232 01:02:05,030 --> 01:02:06,530 And good. 1233 01:02:06,530 --> 01:02:09,150 And that's the optimal alignment. 1234 01:02:13,470 --> 01:02:17,610 So that's just some philosophy on Needlemen-Wunsch alignments. 1235 01:02:22,960 --> 01:02:24,890 So what is semi-global alignment? 1236 01:02:24,890 --> 01:02:26,400 You don't see that that commonly. 1237 01:02:26,400 --> 01:02:27,180 It's not that big a deal. 1238 01:02:27,180 --> 01:02:28,929 I don't want to spend too much time on it. 1239 01:02:28,929 --> 01:02:33,870 But it is actually reasonable a lot of times 1240 01:02:33,870 --> 01:02:36,440 that, let's say you have a protein that 1241 01:02:36,440 --> 01:02:40,580 has a particular enzymatic activity, 1242 01:02:40,580 --> 01:02:48,760 and you may find that the whole, the bulk of the protein 1243 01:02:48,760 --> 01:02:51,150 is well conserved across species. 1244 01:02:51,150 --> 01:02:53,620 But then at the N and C termini, there's 1245 01:02:53,620 --> 01:02:55,490 a little bit of flutter. 1246 01:02:55,490 --> 01:02:58,510 You can add a few residues or delete a few residues, and not 1247 01:02:58,510 --> 01:03:00,570 much matters at the N and C termini. 1248 01:03:00,570 --> 01:03:03,657 Or it may matter not for the structure but for, 1249 01:03:03,657 --> 01:03:06,240 you know, you're adding a single peptide so it'll be secreted, 1250 01:03:06,240 --> 01:03:07,989 or you're adding some localization signal. 1251 01:03:07,989 --> 01:03:09,640 You're adding some little thing that 1252 01:03:09,640 --> 01:03:11,760 isn't necessarily conserved. 1253 01:03:11,760 --> 01:03:15,870 And so a semi-global alignment, where 1254 01:03:15,870 --> 01:03:18,820 you use the same algorithm, except 1255 01:03:18,820 --> 01:03:23,530 that you initialize the edges of the dynamic programming matrix 1256 01:03:23,530 --> 01:03:29,570 to 0, instead of the minus 8, minus 16 whole gap, 1257 01:03:29,570 --> 01:03:30,500 and go to 0. 1258 01:03:30,500 --> 01:03:32,920 So we're not going to penalize for gaps of the edges. 1259 01:03:32,920 --> 01:03:35,940 And then, instead of requiring the trace back 1260 01:03:35,940 --> 01:03:38,200 to begin at the bottom right, Smn, 1261 01:03:38,200 --> 01:03:40,150 you allow it to begin at the highest 1262 01:03:40,150 --> 01:03:42,970 score in the bottom row or the rightmost column. 1263 01:03:42,970 --> 01:03:46,390 And when you do the trace back as before, 1264 01:03:46,390 --> 01:03:51,950 these two changes basically find the optimal global alignment 1265 01:03:51,950 --> 01:03:56,420 but allowing arbitrary numbers of gaps at the ends 1266 01:03:56,420 --> 01:03:58,570 and just finding the core match. 1267 01:03:58,570 --> 01:04:01,110 It has to go basically to the end of one 1268 01:04:01,110 --> 01:04:02,740 or the other sequences, but then you 1269 01:04:02,740 --> 01:04:05,353 can have other residues hanging off the end 1270 01:04:05,353 --> 01:04:09,180 on the other sequence, if you want, with no penalty. 1271 01:04:09,180 --> 01:04:11,920 And this sometimes will give a better answer, 1272 01:04:11,920 --> 01:04:14,170 so it's worth knowing about. 1273 01:04:14,170 --> 01:04:17,510 And it's quite easy to implement. 1274 01:04:17,510 --> 01:04:20,560 Now what about gapped local alignments? 1275 01:04:20,560 --> 01:04:22,264 So what if you have two proteins? 1276 01:04:22,264 --> 01:04:23,680 Do you remember those two proteins 1277 01:04:23,680 --> 01:04:26,400 where we had the two diagonal? 1278 01:04:26,400 --> 01:04:29,425 I guess they were diagonal lines. 1279 01:04:29,425 --> 01:04:30,150 How where they? 1280 01:04:30,150 --> 01:04:30,605 Something like that. 1281 01:04:30,605 --> 01:04:31,979 Anyway, diagonal lines like that. 1282 01:04:31,979 --> 01:04:37,730 So where in this protein on the vertical, there is a sequence 1283 01:04:37,730 --> 01:04:44,249 here that matches two segments of the horizontal protein. 1284 01:04:44,249 --> 01:04:46,790 So for those two, you don't want to do this global alignment. 1285 01:04:46,790 --> 01:04:47,640 It'll get confused. 1286 01:04:47,640 --> 01:04:51,220 It doesn't know whether to match this guy to this 1287 01:04:51,220 --> 01:04:53,400 or this other one to the sequence. 1288 01:04:53,400 --> 01:04:55,790 So you want to use a local alignment. 1289 01:04:55,790 --> 01:04:58,710 So how do we modify this Needleman-Wunsch algorithm 1290 01:04:58,710 --> 01:05:00,520 to do local alignment? 1291 01:05:05,460 --> 01:05:07,068 Any ideas? 1292 01:05:07,068 --> 01:05:09,044 It's not super hard. 1293 01:05:17,950 --> 01:05:19,790 Yeah, go ahead. 1294 01:05:19,790 --> 01:05:23,180 AUDIENCE: If the score is going to go negative, 1295 01:05:23,180 --> 01:05:26,810 instead of putting a negative score, you just put 0, 1296 01:05:26,810 --> 01:05:36,380 and you start from where you get the highest total score, rather 1297 01:05:36,380 --> 01:05:40,540 than the last column or last row. 1298 01:05:40,540 --> 01:05:43,060 Start your traceback from the highest score. 1299 01:05:43,060 --> 01:05:46,470 PROFESSOR: So whenever you're going negative, you reset to 0. 1300 01:05:46,470 --> 01:05:49,380 Now what does that remind you of? 1301 01:05:49,380 --> 01:05:52,800 That's the same trick we did write previously 1302 01:05:52,800 --> 01:05:55,570 with ungapped local alignment. 1303 01:05:55,570 --> 01:05:57,090 So you reset to 0. 1304 01:05:57,090 --> 01:06:00,160 And that's as a no penalty, because if you're 1305 01:06:00,160 --> 01:06:03,770 going negative, it's better just to throw that stuff away 1306 01:06:03,770 --> 01:06:05,220 and start over. 1307 01:06:05,220 --> 01:06:07,380 We can do that because we're doing local alignment. 1308 01:06:07,380 --> 01:06:09,030 We don't have to align the whole thing. 1309 01:06:09,030 --> 01:06:11,410 So that's allowed. 1310 01:06:11,410 --> 01:06:16,600 And then, rather than going to the bottom right corner, 1311 01:06:16,600 --> 01:06:18,220 you can be anywhere in the matrix. 1312 01:06:18,220 --> 01:06:21,360 You look for that highest score and then do the traceback. 1313 01:06:21,360 --> 01:06:22,910 That's exactly right. 1314 01:06:22,910 --> 01:06:27,600 So it's not that different. 1315 01:06:27,600 --> 01:06:29,920 There are a few constraints, though, 1316 01:06:29,920 --> 01:06:31,100 now on the scoring system. 1317 01:06:31,100 --> 01:06:34,250 So if you think about the Needleman-Wunsch algorithm, 1318 01:06:34,250 --> 01:06:38,150 we could actually use a matrix that had all positive scores. 1319 01:06:38,150 --> 01:06:40,860 You could take the PAM250 matrix. 1320 01:06:40,860 --> 01:06:42,900 And let's say the most negative score 1321 01:06:42,900 --> 01:06:46,280 there is, I don't know, like minus 10 or something, 1322 01:06:46,280 --> 01:06:49,620 and you could just add 10, or even add 20 to all those score. 1323 01:06:49,620 --> 01:06:50,960 So they're all positive now. 1324 01:06:50,960 --> 01:06:52,640 And you could still run that algorithm. 1325 01:06:52,640 --> 01:06:56,890 And it would still produce more or less sensible results. 1326 01:06:56,890 --> 01:07:00,330 I mean, they t wouldn't be as good as the real PAM250, 1327 01:07:00,330 --> 01:07:03,460 but you would still get a coherent alignment out 1328 01:07:03,460 --> 01:07:04,470 of the other end. 1329 01:07:04,470 --> 01:07:06,920 But that is no longer true when you 1330 01:07:06,920 --> 01:07:09,300 talk about the Smith-Waterman algorithm, 1331 01:07:09,300 --> 01:07:12,540 for the same reason that an ungapped local alignment, 1332 01:07:12,540 --> 01:07:15,807 we had to require that the expected score be negative, 1333 01:07:15,807 --> 01:07:17,640 because you have to have this negative drift 1334 01:07:17,640 --> 01:07:20,771 to find small regions that go in the positive. 1335 01:07:20,771 --> 01:07:23,270 So if you have this rule, this kind of permissive that says, 1336 01:07:23,270 --> 01:07:25,300 whenever we go negative we can just reset to 0, 1337 01:07:25,300 --> 01:07:29,420 then you have to have this negative drift in order 1338 01:07:29,420 --> 01:07:32,700 for positive scoring stuff to be unusual. 1339 01:07:37,400 --> 01:07:39,726 All right, so that's another constraint there. 1340 01:07:39,726 --> 01:07:42,100 You have to have negative values for mismatches-- I mean, 1341 01:07:42,100 --> 01:07:43,230 not all mismatches. 1342 01:07:43,230 --> 01:07:46,810 But if you took two random residues in alignment, 1343 01:07:46,810 --> 01:07:49,511 the average score has to be negative. 1344 01:07:49,511 --> 01:07:51,680 I should probably rephrase that, but more or less. 1345 01:07:51,680 --> 01:07:54,840 And here's an example of Smith-Waterman. 1346 01:07:54,840 --> 01:07:59,940 So you right zeroes down the left side and across the top. 1347 01:07:59,940 --> 01:08:02,690 And that's because, remember, if you go negative, 1348 01:08:02,690 --> 01:08:03,530 you reset to 0. 1349 01:08:03,530 --> 01:08:04,730 So we're doing that. 1350 01:08:04,730 --> 01:08:07,840 And then you take the maximum of four things. 1351 01:08:07,840 --> 01:08:12,150 So coming from the diagonal and adding the score of the match, 1352 01:08:12,150 --> 01:08:14,410 that's the same as before. 1353 01:08:14,410 --> 01:08:17,700 Coming from the left and adding a gap in one sequence, 1354 01:08:17,700 --> 01:08:19,859 coming from above and adding a gap 1355 01:08:19,859 --> 01:08:22,979 in the other sequence, or 0. 1356 01:08:22,979 --> 01:08:26,620 This "or 0" business allows us to reset to 0 1357 01:08:26,620 --> 01:08:30,550 if we ever go negative. 1358 01:08:30,550 --> 01:08:33,770 And when you have a 0, you still keep track of these arrows. 1359 01:08:33,770 --> 01:08:36,710 But when you have a 0, there's no arrow. 1360 01:08:36,710 --> 01:08:37,600 You're starting it. 1361 01:08:37,600 --> 01:08:39,350 You're starting the alignment right there. 1362 01:08:43,010 --> 01:08:45,930 So that's Smith-Waterman. 1363 01:08:45,930 --> 01:08:46,970 It's helpful. 1364 01:08:46,970 --> 01:08:48,569 I think on problem set one, you'll 1365 01:08:48,569 --> 01:08:50,943 have some experience thinking about both Needleman-Wunsch 1366 01:08:50,943 --> 01:08:52,055 and Smith-Waterman. 1367 01:08:52,055 --> 01:08:53,930 They sort of behave a little bit differently, 1368 01:08:53,930 --> 01:08:55,649 but they're highly related. 1369 01:08:55,649 --> 01:08:59,109 So it's important to understand how they're similar, 1370 01:08:59,109 --> 01:09:00,710 how they're different. 1371 01:09:00,710 --> 01:09:05,149 And what I want to focus on for the remainder of this lecture 1372 01:09:05,149 --> 01:09:09,479 is just introducing the concept of amino acid similarity 1373 01:09:09,479 --> 01:09:10,250 matrices. 1374 01:09:10,250 --> 01:09:12,990 We saw that PAM matrix, but where does it come from? 1375 01:09:12,990 --> 01:09:15,170 And what does it mean? 1376 01:09:15,170 --> 01:09:19,810 And does it work well or not, and are there alternatives? 1377 01:09:19,810 --> 01:09:22,740 So we could use this identity matrix. 1378 01:09:22,740 --> 01:09:25,749 But as we've heard, there are a number 1379 01:09:25,749 --> 01:09:28,770 of reasons why that may not be optimal. 1380 01:09:28,770 --> 01:09:32,100 For example, the cysteines, we should surely score them more, 1381 01:09:32,100 --> 01:09:34,250 because they're often involved in disulfide bonds, 1382 01:09:34,250 --> 01:09:38,160 and those have major structural effects on the protein 1383 01:09:38,160 --> 01:09:41,649 and are likely to be conserved more than your average leucine 1384 01:09:41,649 --> 01:09:44,779 or alanine or whatever. 1385 01:09:44,779 --> 01:09:47,729 So clearly, scoring system should 1386 01:09:47,729 --> 01:09:50,990 favor matching identical or related amino acids, penalize 1387 01:09:50,990 --> 01:09:54,210 poor matches and for gaps. 1388 01:09:54,210 --> 01:09:56,985 And there's also an argument that 1389 01:09:56,985 --> 01:10:00,890 can be made that it should have to do with how often one 1390 01:10:00,890 --> 01:10:04,820 residue is substituted for another during evolution. 1391 01:10:04,820 --> 01:10:07,060 So that commonly substituted thing 1392 01:10:07,060 --> 01:10:10,270 should have either positive scores or less negative 1393 01:10:10,270 --> 01:10:14,250 scores than rarely substituted things. 1394 01:10:14,250 --> 01:10:18,280 And perhaps not totally obvious, but it 1395 01:10:18,280 --> 01:10:20,860 is if you think about it for a while, 1396 01:10:20,860 --> 01:10:30,630 is that any scoring system that you dream up carries with it 1397 01:10:30,630 --> 01:10:34,130 an implicit model of molecular evolution 1398 01:10:34,130 --> 01:10:35,780 for how often things are going to be 1399 01:10:35,780 --> 01:10:37,510 substituted for each other. 1400 01:10:37,510 --> 01:10:40,550 So it's going to turn out that the score is roughly 1401 01:10:40,550 --> 01:10:42,920 proportional to a [INAUDIBLE] score 1402 01:10:42,920 --> 01:10:46,930 for the occurrence of that pair of residues, 1403 01:10:46,930 --> 01:10:48,430 divided by how often it would occur 1404 01:10:48,430 --> 01:10:50,370 by chance, something like that. 1405 01:10:50,370 --> 01:10:54,710 And so that if you assign positive scores to things, 1406 01:10:54,710 --> 01:10:57,090 to certain pairs of residues, you're 1407 01:10:57,090 --> 01:11:00,240 basically implying that those things will commonly 1408 01:11:00,240 --> 01:11:02,860 interchange during evolution. 1409 01:11:02,860 --> 01:11:12,930 And so if you want to have realistic, useful scores, 1410 01:11:12,930 --> 01:11:15,840 it helps to think about what the implicit evolutionary model is 1411 01:11:15,840 --> 01:11:21,740 and whether that is a realistic model for how proteins evolve. 1412 01:11:21,740 --> 01:11:25,420 So I'm going to come to Dayhoff. 1413 01:11:25,420 --> 01:11:29,510 And so unlike later matrices, she 1414 01:11:29,510 --> 01:11:33,590 had an explicit evolutionary model, 1415 01:11:33,590 --> 01:11:38,800 like an actual mathematical model, for how proteins evolve. 1416 01:11:38,800 --> 01:11:42,550 And the idea was that there are going 1417 01:11:42,550 --> 01:11:44,280 to be alignments of some proteins. 1418 01:11:44,280 --> 01:11:45,970 And keep in mind, this was in 1978. 1419 01:11:45,970 --> 01:11:49,389 So the protein database probably had like 1,000 proteins in it, 1420 01:11:49,389 --> 01:11:49,930 or something. 1421 01:11:49,930 --> 01:11:50,929 It was very, very small. 1422 01:11:53,850 --> 01:11:57,790 But there were some alignments that were obvious. 1423 01:11:57,790 --> 01:12:02,390 If you see two protein segments of 50 residues long 1424 01:12:02,390 --> 01:12:07,720 that are 85 identical, there's no way that occurred by chance. 1425 01:12:07,720 --> 01:12:10,040 You don't even need to do statistics on that. 1426 01:12:10,040 --> 01:12:10,890 So you're sure. 1427 01:12:10,890 --> 01:12:13,750 So she took these very high confidence protein sequence 1428 01:12:13,750 --> 01:12:18,020 alignments, and she calculated the actual residue residue 1429 01:12:18,020 --> 01:12:21,730 substitution frequencies, how often we have a valine in one 1430 01:12:21,730 --> 01:12:23,860 sequence as a substitute for a leucine. 1431 01:12:23,860 --> 01:12:26,490 And it's actually assumed it's symmetrical. 1432 01:12:26,490 --> 01:12:28,750 Again, you don't know the direction. 1433 01:12:28,750 --> 01:12:33,540 And calculated these substitution frequencies. 1434 01:12:33,540 --> 01:12:38,620 Basically estimated what she called 1435 01:12:38,620 --> 01:12:42,540 a PAM1 one matrix, which is a matrix that 1436 01:12:42,540 --> 01:12:46,100 implies 1% divergence between proteins. 1437 01:12:46,100 --> 01:12:49,100 So there's, on average, only a 1% chance 1438 01:12:49,100 --> 01:12:51,730 any given residue will change. 1439 01:12:51,730 --> 01:12:54,420 And the real alignments had greater divergence than that. 1440 01:12:54,420 --> 01:12:56,310 They had something like 15% divergence. 1441 01:12:56,310 --> 01:12:58,530 But you can look at those frequencies 1442 01:12:58,530 --> 01:13:00,130 and reduce them by a factor of 15, 1443 01:13:00,130 --> 01:13:03,270 and you'll get not exactly 15 but something like 15. 1444 01:13:03,270 --> 01:13:05,150 And you'll get something where there's 1445 01:13:05,150 --> 01:13:09,240 a 1% chance of substitution. 1446 01:13:09,240 --> 01:13:13,400 And then once you have that model for what 1% sequence 1447 01:13:13,400 --> 01:13:14,970 substitution looks like, turns out 1448 01:13:14,970 --> 01:13:19,170 you can just represent that as a matrix 1449 01:13:19,170 --> 01:13:24,540 and multiply it up to get a matrix that describes what 1450 01:13:24,540 --> 01:13:33,080 5% sequence substitution looks like, or 10% or 50% or 250%. 1451 01:13:33,080 --> 01:13:36,470 So that PAM250 matrix that we talked about before, that's 1452 01:13:36,470 --> 01:13:44,940 a model for what 250% amino acid substitution looks like. 1453 01:13:44,940 --> 01:13:46,270 How does that even make sense? 1454 01:13:46,270 --> 01:13:49,981 How can you have more than 100%? 1455 01:13:49,981 --> 01:13:55,460 Is anyone with me on this? 1456 01:13:55,460 --> 01:13:56,390 Tim, yeah. 1457 01:13:56,390 --> 01:13:57,764 AUDIENCE: Because it can go back. 1458 01:13:57,764 --> 01:14:00,235 So it's more likely, in some cases, that you revert 1459 01:14:00,235 --> 01:14:02,833 rather than [INAUDIBLE]. 1460 01:14:02,833 --> 01:14:04,300 PROFESSOR: Right. 1461 01:14:04,300 --> 01:14:08,580 So a PAM10 matrix means, on average, 10% of the residues 1462 01:14:08,580 --> 01:14:09,980 have changed. 1463 01:14:09,980 --> 01:14:11,960 But a few of those residues might 1464 01:14:11,960 --> 01:14:15,550 have actually-- so maybe about 90% won't have changed at all. 1465 01:14:15,550 --> 01:14:17,130 Some will have changed once, but some 1466 01:14:17,130 --> 01:14:19,810 might have even changed twice, even at 10%. 1467 01:14:19,810 --> 01:14:23,240 And when you get to 250%, on average, 1468 01:14:23,240 --> 01:14:26,010 every residue has changed 2 and 1/2 times. 1469 01:14:26,010 --> 01:14:28,640 But again, a few residues might have remained the same. 1470 01:14:28,640 --> 01:14:30,740 And some residues that change-- for example, 1471 01:14:30,740 --> 01:14:35,360 if you had an isoleucine that mutated to a valine, 1472 01:14:35,360 --> 01:14:39,440 it might have actually changed back already in that time. 1473 01:14:39,440 --> 01:14:43,729 So it basically accounts for all those sorts of things. 1474 01:14:43,729 --> 01:14:45,645 And if you have commonly substituted residues, 1475 01:14:45,645 --> 01:14:49,490 you get that type of evolution happening. 1476 01:14:49,490 --> 01:14:49,990 All right. 1477 01:14:49,990 --> 01:14:51,930 So she took these protein sequence alignments-- 1478 01:14:51,930 --> 01:14:53,240 it looks something like this-- and calculated 1479 01:14:53,240 --> 01:14:54,460 these statistics. 1480 01:14:54,460 --> 01:14:57,170 Again, I don't want to go through this in detail 1481 01:14:57,170 --> 01:15:00,390 during lecture, because it's very well described 1482 01:15:00,390 --> 01:15:02,830 in the text. 1483 01:15:02,830 --> 01:15:05,930 But what I do want to do is introduce this concept 1484 01:15:05,930 --> 01:15:07,630 of a Markov chain, because it's sort 1485 01:15:07,630 --> 01:15:12,230 of what is underlying these Dayhoff matrices. 1486 01:15:12,230 --> 01:15:13,410 So let's think about it. 1487 01:15:13,410 --> 01:15:16,800 We'll do more on this next time. 1488 01:15:16,800 --> 01:15:19,900 But imagine that you were able to sequence 1489 01:15:19,900 --> 01:15:22,310 the genomes of cartoon characters 1490 01:15:22,310 --> 01:15:24,960 with some newly developed technology 1491 01:15:24,960 --> 01:15:28,680 and you chose to analyze the complicated genetics 1492 01:15:28,680 --> 01:15:30,610 of the Simpson lineage. 1493 01:15:30,610 --> 01:15:32,860 I'm assuming you all know these people. 1494 01:15:32,860 --> 01:15:39,350 This is Grandpa and Homer eating doughnut and his son, Bart. 1495 01:15:39,350 --> 01:15:41,710 So imagine this is Grandpa's genome 1496 01:15:41,710 --> 01:15:45,480 at the apolipoprotein A locus. 1497 01:15:45,480 --> 01:15:53,730 And a mutation occurred that he then passed on to Homer. 1498 01:15:53,730 --> 01:15:59,730 So this mutation occurred in the germ line, passed on to Homer. 1499 01:15:59,730 --> 01:16:04,140 And then when Homer passed on his genes to Bart, 1500 01:16:04,140 --> 01:16:08,816 another mutation occurred here, changing this AT pair 1501 01:16:08,816 --> 01:16:11,750 to a GC pair in Bart. 1502 01:16:11,750 --> 01:16:17,757 So this, I would argue, is a type of Markov chain. 1503 01:16:17,757 --> 01:16:18,840 So what is a Markov chain? 1504 01:16:18,840 --> 01:16:21,500 So it's just a stochastic process. 1505 01:16:21,500 --> 01:16:24,310 So a stochastic process is a random process, 1506 01:16:24,310 --> 01:16:25,780 is sort of the general meaning. 1507 01:16:25,780 --> 01:16:28,880 But here we're going to be dealing 1508 01:16:28,880 --> 01:16:32,880 with discrete stochastic processes, which is just 1509 01:16:32,880 --> 01:16:35,620 a sequence of random variables. 1510 01:16:35,620 --> 01:16:41,030 So X1 here is a random variable that represents, for example, 1511 01:16:41,030 --> 01:16:43,470 the genome of an individual, or it 1512 01:16:43,470 --> 01:16:46,000 could represent the genotype, in this case, 1513 01:16:46,000 --> 01:16:48,693 at a particular position, maybe whether it's an A, 1514 01:16:48,693 --> 01:16:53,400 C, G, or T at one particular position in the genome. 1515 01:16:53,400 --> 01:16:57,230 And now the index here-- one, two, three, and so forth-- 1516 01:16:57,230 --> 01:16:58,950 is going to represent time. 1517 01:16:58,950 --> 01:17:05,460 So X1 might be the genotype in Grandpa Simpson 1518 01:17:05,460 --> 01:17:07,360 at a particular position. 1519 01:17:07,360 --> 01:17:11,340 And X2 might be the genotype of Homer Simpson. 1520 01:17:11,340 --> 01:17:14,450 And X3 would be the genotype in the next generation, which 1521 01:17:14,450 --> 01:17:17,370 would be Bart Simpson. 1522 01:17:17,370 --> 01:17:20,120 And what a Markov chain is is it's 1523 01:17:20,120 --> 01:17:22,770 a particular type of stochastic process 1524 01:17:22,770 --> 01:17:26,950 that arises commonly in natural sciences, really, 1525 01:17:26,950 --> 01:17:29,320 and other places all over the place. 1526 01:17:29,320 --> 01:17:32,260 So it's a good one to know that has 1527 01:17:32,260 --> 01:17:33,740 what's called the Markov property. 1528 01:17:33,740 --> 01:17:37,240 And that says that the probability 1529 01:17:37,240 --> 01:17:40,700 that the next random variable, or 1530 01:17:40,700 --> 01:17:43,760 the genotype at the next generation, if you will-- so 1531 01:17:43,760 --> 01:17:48,340 Xn plus 1 equals some value, j, which 1532 01:17:48,340 --> 01:17:50,090 could be any of the possible values, 1533 01:17:50,090 --> 01:17:55,480 say any of the four bases, conditional on the values of X1 1534 01:17:55,480 --> 01:17:58,610 through Xn, that is the entire history of the process up 1535 01:17:58,610 --> 01:18:03,040 to that time, is equal to the conditional probability 1536 01:18:03,040 --> 01:18:09,300 that Xn plus 1 equals j given only that little xn equals 1537 01:18:09,300 --> 01:18:10,580 some particular value. 1538 01:18:10,580 --> 01:18:14,500 So basically what it says that if I tell you what 1539 01:18:14,500 --> 01:18:18,570 Homer's genotype was at this locus, 1540 01:18:18,570 --> 01:18:21,950 and I tell you what Grandpa Simpson's genotype was at that 1541 01:18:21,950 --> 01:18:24,480 locus, you can just ignore Grandpa Simpson's. 1542 01:18:24,480 --> 01:18:25,310 That's irrelevant. 1543 01:18:25,310 --> 01:18:27,400 It only matters what Homer's genotype 1544 01:18:27,400 --> 01:18:31,226 was for the purpose of predicting Bart's genotype. 1545 01:18:31,226 --> 01:18:32,660 Does that make sense? 1546 01:18:32,660 --> 01:18:36,870 So it really doesn't matter whether that base 1547 01:18:36,870 --> 01:18:41,450 in Homer's genome was the same as it was in Grandpa Simpson's 1548 01:18:41,450 --> 01:18:44,650 genome, or whether it was a mutation that's 1549 01:18:44,650 --> 01:18:47,040 specific to Homer, because Homer is 1550 01:18:47,040 --> 01:18:49,350 the one who passes on DNA to Bart. 1551 01:18:49,350 --> 01:18:50,250 Does that make sense? 1552 01:18:50,250 --> 01:18:54,975 So you only look back one generation. 1553 01:18:54,975 --> 01:18:56,470 It's a type of memoryless process, 1554 01:18:56,470 --> 01:18:59,160 that you only remember the last generation. 1555 01:18:59,160 --> 01:19:01,650 That's the only thing that's relevant. 1556 01:19:01,650 --> 01:19:03,710 And so to understand Markov chains, 1557 01:19:03,710 --> 01:19:07,100 it's very important that you all review 1558 01:19:07,100 --> 01:19:09,389 your conditional probability. 1559 01:19:09,389 --> 01:19:11,680 So we're going to do a little bit more on Markov chains 1560 01:19:11,680 --> 01:19:12,730 next time. 1561 01:19:12,730 --> 01:19:18,220 P A given B, what does that mean? 1562 01:19:18,220 --> 01:19:20,950 If you don't remember, look it up in the Probability 1563 01:19:20,950 --> 01:19:25,340 and Statistics, because that's sort of essential to Markov 1564 01:19:25,340 --> 01:19:25,840 chains. 1565 01:19:25,840 --> 01:19:29,210 So next time we're going to talk about comparative genomics, 1566 01:19:29,210 --> 01:19:31,700 which will involve some applications of some 1567 01:19:31,700 --> 01:19:35,390 of the alignment methods that we've been talking about. 1568 01:19:35,390 --> 01:19:41,320 And I may post some examples of interesting comparative genomic 1569 01:19:41,320 --> 01:19:44,162 research papers, which are going to be optional reading. 1570 01:19:44,162 --> 01:19:46,745 You may get a little more out of the lecture if you read them, 1571 01:19:46,745 --> 01:19:49,400 but it's not essential.