1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:27,170 --> 00:00:29,310 PROFESSOR: Why don't we get started? 9 00:00:29,310 --> 00:00:32,540 So today we're going to talk about comparative genomics. 10 00:00:32,540 --> 00:00:36,910 And first, a brief review of what we did last time. 11 00:00:36,910 --> 00:00:39,790 So last time we talked about global alignment 12 00:00:39,790 --> 00:00:43,420 of protein sequences, including the Needleman-Wunsch 13 00:00:43,420 --> 00:00:45,570 and Smith-Waterman algorithms. 14 00:00:45,570 --> 00:00:49,650 And we talked about gap penalties a little bit 15 00:00:49,650 --> 00:00:55,620 and started to introduce the PAM series of matrices which 16 00:00:55,620 --> 00:00:59,020 are well described in the text. 17 00:00:59,020 --> 00:01:01,510 So what I wanted to do is just briefly 18 00:01:01,510 --> 00:01:05,850 go over what I started to talk about at the end, about 19 00:01:05,850 --> 00:01:07,780 Markov models of evolution. 20 00:01:07,780 --> 00:01:11,380 Because they're relevant, not only for the PAM series, 21 00:01:11,380 --> 00:01:16,140 but also for some other topics in the course. 22 00:01:16,140 --> 00:01:19,710 A short unit on molecular evolution 23 00:01:19,710 --> 00:01:22,030 we're going to do today. 24 00:01:22,030 --> 00:01:24,540 And then they also introduce hidden Markov models 25 00:01:24,540 --> 00:01:27,660 that will come up later in the course. 26 00:01:27,660 --> 00:01:32,860 So the example that we gave of a Markov model was DNA sequence 27 00:01:32,860 --> 00:01:37,350 evolution in successive generations where 28 00:01:37,350 --> 00:01:43,090 the observation here is that the base at a particular position 29 00:01:43,090 --> 00:01:53,640 at generation n+1 here depends on the base at that generation 30 00:01:53,640 --> 00:01:56,780 and the base at generation n. 31 00:01:56,780 --> 00:02:00,919 But conditional on knowing the base at generation n, 32 00:02:00,919 --> 00:02:02,460 you don't learn anything from knowing 33 00:02:02,460 --> 00:02:05,730 what that base was at generation n-1. 34 00:02:05,730 --> 00:02:09,310 That's the essence of the Markov properties. 35 00:02:09,310 --> 00:02:16,140 So here's the formal definition, as we saw before. 36 00:02:16,140 --> 00:02:18,650 Any questions on this? 37 00:02:18,650 --> 00:02:22,710 And I asked you to review your conditional probability 38 00:02:22,710 --> 00:02:27,190 if it was rusty, because that's very relevant. 39 00:02:31,100 --> 00:02:39,080 OK so in this example you might, if you had a random variable x 40 00:02:39,080 --> 00:02:42,530 that represented the genotype at a particular locus, 41 00:02:42,530 --> 00:02:45,640 let's say the apolipoprotein locus, 42 00:02:45,640 --> 00:02:48,320 and it had alleles A and a, then you 43 00:02:48,320 --> 00:02:52,450 might write something like the probability 44 00:02:52,450 --> 00:02:57,790 that Bart's genotype is a homozygous given 45 00:02:57,790 --> 00:03:00,900 his grandfather's genotype and his dad's genotype 46 00:03:00,900 --> 00:03:05,080 is equal to just the conditional probability given his father's 47 00:03:05,080 --> 00:03:05,920 genotype. 48 00:03:05,920 --> 00:03:07,628 So those are the sorts of things that you 49 00:03:07,628 --> 00:03:09,980 can do with Markov chains 50 00:03:09,980 --> 00:03:13,740 So when you're working with Markov chains 51 00:03:13,740 --> 00:03:16,720 matrices are extremely useful. 52 00:03:16,720 --> 00:03:20,660 So another thing that will be helpful in this part 53 00:03:20,660 --> 00:03:23,550 of the course and then again in Professor Fraenkel's 54 00:03:23,550 --> 00:03:25,400 part, where he's talking-- he'll use also 55 00:03:25,400 --> 00:03:28,640 some ideas from linear algebra-- is 56 00:03:28,640 --> 00:03:36,340 to review your basics of matrices and vector 57 00:03:36,340 --> 00:03:37,510 multiplication. 58 00:03:37,510 --> 00:03:42,020 OK so, if you now make a model of molecular evolution 59 00:03:42,020 --> 00:03:46,470 where sn is-- so s is this variable that 60 00:03:46,470 --> 00:03:49,160 represents a particular base in the genome 61 00:03:49,160 --> 00:03:50,820 and is the generation. 62 00:03:50,820 --> 00:03:53,350 And then to describe the evolution 63 00:03:53,350 --> 00:03:55,730 of this base over time, we're going 64 00:03:55,730 --> 00:04:00,010 to imagine that its evolution is described by a Markov chain. 65 00:04:00,010 --> 00:04:03,640 And a Markov chain can be described by, in this case, a 4 66 00:04:03,640 --> 00:04:07,040 by 4 matrix, since there are four possible nucleotides 67 00:04:07,040 --> 00:04:13,020 at generation i, for example, and four possible at generation 68 00:04:13,020 --> 00:04:13,680 i plus one. 69 00:04:13,680 --> 00:04:17,040 And you simply need to specify what 70 00:04:17,040 --> 00:04:19,950 the conditional probability that the base will 71 00:04:19,950 --> 00:04:24,430 be, of any possible base, at the next generation, given what 72 00:04:24,430 --> 00:04:26,820 it is at the current generation. 73 00:04:26,820 --> 00:04:29,930 So here's the matrix up here. 74 00:04:29,930 --> 00:04:31,860 And it describes, for example, the probability 75 00:04:31,860 --> 00:04:36,810 of going from a c to an a. 76 00:04:36,810 --> 00:04:40,740 So then in general you might know 77 00:04:40,740 --> 00:04:44,310 that that base is a g at the first generation. 78 00:04:44,310 --> 00:04:47,360 But in general you won't necessarily 79 00:04:47,360 --> 00:04:50,840 know what base it is if you're modeling events that 80 00:04:50,840 --> 00:04:52,200 may happen in the future. 81 00:04:52,200 --> 00:04:54,050 So the most general way of describing 82 00:04:54,050 --> 00:04:57,100 what's happening at that base is a vector 83 00:04:57,100 --> 00:05:01,175 of probabilities of the four possible bases-- 84 00:05:01,175 --> 00:05:08,990 so qa, qc, qg, qt, with those probabilities summing up to 1. 85 00:05:08,990 --> 00:05:12,490 And so then it turns out that with this notation 86 00:05:12,490 --> 00:05:17,960 that the content of the vector at generation n plus 1 87 00:05:17,960 --> 00:05:22,360 is equal to simply the vector at generation n 88 00:05:22,360 --> 00:05:26,170 multiplied on the right by the matrix, 89 00:05:26,170 --> 00:05:34,390 just using the standard vector matrix multiplication. 90 00:05:36,980 --> 00:05:41,110 So for example, if we have vectors 91 00:05:41,110 --> 00:05:42,690 with four things in them, and we have 92 00:05:42,690 --> 00:05:48,770 a 4 by 4 matrix, then to get this term here in this vector 93 00:05:48,770 --> 00:05:51,860 you multiply-- you basically take the dot product 94 00:05:51,860 --> 00:05:55,530 of this vector times this first column. 95 00:05:55,530 --> 00:05:58,430 The vector times the first column 96 00:05:58,430 --> 00:06:00,630 will give you that entry. 97 00:06:00,630 --> 00:06:05,640 And this times this column will give you 98 00:06:05,640 --> 00:06:08,350 that entry in the vector, and so forth. 99 00:06:08,350 --> 00:06:10,870 And you can see that the way this makes sense, 100 00:06:10,870 --> 00:06:16,470 the way the matrix is defined, that first column tells you 101 00:06:16,470 --> 00:06:20,410 the probability that you'll have an a at the next generation, 102 00:06:20,410 --> 00:06:22,470 conditional on each of the four bases 103 00:06:22,470 --> 00:06:23,595 at the previous generation. 104 00:06:23,595 --> 00:06:26,280 And so you just multiply by the probabilities of those four 105 00:06:26,280 --> 00:06:29,952 bases times the appropriate conditional probability here. 106 00:06:29,952 --> 00:06:31,410 And those are all the ways that you 107 00:06:31,410 --> 00:06:34,060 can be an a generation, n plus 1. 108 00:06:37,870 --> 00:06:43,640 And so it's also true that if you want to go further in time, 109 00:06:43,640 --> 00:06:46,820 so from generation n to generation n 110 00:06:46,820 --> 00:06:53,200 plus k-- k is some integer-- then this just corresponds 111 00:06:53,200 --> 00:06:57,800 to sequential multiplication by the matrix k-- 112 00:06:57,800 --> 00:06:59,690 I'm sorry, by the matrix p. 113 00:06:59,690 --> 00:07:09,160 So qn plus 1 equals q times p. 114 00:07:09,160 --> 00:07:16,440 And then qn plus 2 will equal q-- I'm sorry. 115 00:07:16,440 --> 00:07:21,560 That's a really bad q, but-- qn plus 1 times p, 116 00:07:21,560 --> 00:07:26,150 which will equal q times p squared, where p squared means 117 00:07:26,150 --> 00:07:30,200 matrix multiplication, again using the standard rules 118 00:07:30,200 --> 00:07:35,010 of matrix multiplication that you can look up. 119 00:07:35,010 --> 00:07:37,430 So one of the things you might think about here 120 00:07:37,430 --> 00:07:39,605 is what happens after a long time? 121 00:07:39,605 --> 00:07:47,360 If you start from some vector q-- for example, q is 0010. 122 00:07:47,360 --> 00:07:51,590 That is, it's 100% chance of g. 123 00:07:51,590 --> 00:07:54,970 What would happen if you run this matrix 124 00:07:54,970 --> 00:07:57,310 on that over a long period of time. 125 00:07:57,310 --> 00:08:01,950 And we'll come back to that question a little bit later. 126 00:08:01,950 --> 00:08:05,300 So thinking about the Dayhoff matrices-- and again, 127 00:08:05,300 --> 00:08:07,150 I'm not going to go into detail here, 128 00:08:07,150 --> 00:08:10,110 because it's well described in the text. 129 00:08:10,110 --> 00:08:15,790 Dayhoff looked at these highly identical alignments, these 85% 130 00:08:15,790 --> 00:08:20,080 identical alignments, and calculated the mutability 131 00:08:20,080 --> 00:08:25,250 of each residue and these mutation probabilities 132 00:08:25,250 --> 00:08:28,270 for how often each residue changes into each other one 133 00:08:28,270 --> 00:08:32,360 and then scaled them so that on average the chance of mutating 134 00:08:32,360 --> 00:08:37,840 is 1% and then took these probabilities, 135 00:08:37,840 --> 00:08:41,650 these frequencies, of mutation m, a, b, 136 00:08:41,650 --> 00:08:49,470 divided by the frequency of the residue b, took the log, 137 00:08:49,470 --> 00:08:55,060 and then just multiplied by two just for scaling purposes, 138 00:08:55,060 --> 00:08:58,780 and came up with a-- and then rounded to the nearest integer, 139 00:08:58,780 --> 00:09:01,330 again for practical purposes. 140 00:09:01,330 --> 00:09:06,800 And that's how she came up with her PAM 1 matrix. 141 00:09:06,800 --> 00:09:10,830 And then you can use matrix multiplication 142 00:09:10,830 --> 00:09:15,585 to derive all the successive PAM series. 143 00:09:15,585 --> 00:09:20,450 Just multiply the PAM1 matrix times itself to get the PAM2 144 00:09:20,450 --> 00:09:23,430 and recalculate the scores. 145 00:09:23,430 --> 00:09:29,265 So if you actually use PAM matrices in practice 146 00:09:29,265 --> 00:09:31,010 there are some issues. 147 00:09:31,010 --> 00:09:34,110 And these are also well described in the text. 148 00:09:34,110 --> 00:09:36,730 And the fundamental problem seems 149 00:09:36,730 --> 00:09:44,730 to be that the way the proteins evolve over 150 00:09:44,730 --> 00:09:47,630 short periods of time and the way they evolve over 151 00:09:47,630 --> 00:09:50,900 long periods of time is somewhat different. 152 00:09:50,900 --> 00:09:55,020 And basically this model, this Markov model of evolution, 153 00:09:55,020 --> 00:10:03,120 is not quite right, that things don't-- what you see in a short 154 00:10:03,120 --> 00:10:05,870 periods of time-- it does not match long periods of time. 155 00:10:05,870 --> 00:10:07,190 And why is that? 156 00:10:07,190 --> 00:10:08,700 A number of possible reasons. 157 00:10:08,700 --> 00:10:12,620 But keep in mind that in addition to proteins 158 00:10:12,620 --> 00:10:18,400 simply changing their amino acid sequence, 159 00:10:18,400 --> 00:10:20,310 other things can happen in evolution. 160 00:10:20,310 --> 00:10:22,350 You can have insertions and deletions 161 00:10:22,350 --> 00:10:24,790 that are not captured by this Markov model. 162 00:10:24,790 --> 00:10:28,940 And you can also have birth and death of proteins. 163 00:10:28,940 --> 00:10:32,350 A protein can evolve according to this model for millions 164 00:10:32,350 --> 00:10:33,070 of years. 165 00:10:33,070 --> 00:10:38,290 And then it can become unneeded, and just be lost, for example. 166 00:10:38,290 --> 00:10:41,620 So real protein evolution is more complicated. 167 00:10:41,620 --> 00:10:49,040 And so about 20 years ago or so Henikoff and Henikoff 168 00:10:49,040 --> 00:10:53,770 decided to develop a new type of matrix. 169 00:10:53,770 --> 00:10:56,320 And the way they did it was to identify these things 170 00:10:56,320 --> 00:11:00,860 called blocks, which are regions of reasonably high similarity, 171 00:11:00,860 --> 00:11:03,310 but not as high as Dayhoff required. 172 00:11:03,310 --> 00:11:06,050 So there were many more-- Dayhoff was working the '70s. 173 00:11:06,050 --> 00:11:07,300 They were working in the '90s. 174 00:11:07,300 --> 00:11:09,290 So there were many more proteins available. 175 00:11:09,290 --> 00:11:14,220 And they could identify, with confidence, basically 176 00:11:14,220 --> 00:11:16,050 a much larger data set, including 177 00:11:16,050 --> 00:11:20,471 more distantly related, but still confidently alignable, 178 00:11:20,471 --> 00:11:21,220 protein sequences. 179 00:11:21,220 --> 00:11:23,640 And they derived new parameters. 180 00:11:23,640 --> 00:11:28,910 And in the end this matrix they came up with called BLOSUM62 181 00:11:28,910 --> 00:11:31,550 seems to work well in a variety of contexts 182 00:11:31,550 --> 00:11:35,880 when comparing moderately distantly related 183 00:11:35,880 --> 00:11:40,280 proteins or quite distantly related proteins. 184 00:11:40,280 --> 00:11:42,150 If you're comparing very similar proteins 185 00:11:42,150 --> 00:11:43,700 it almost doesn't matter. 186 00:11:43,700 --> 00:11:45,420 Any reasonable matrix will probably 187 00:11:45,420 --> 00:11:46,503 give you the right answer. 188 00:11:46,503 --> 00:11:49,150 But when you're comparing the more distant ones, 189 00:11:49,150 --> 00:11:52,730 that's where it becomes challenging. 190 00:11:52,730 --> 00:11:56,530 And so this is the BLOSUM62 matrix here. 191 00:11:56,530 --> 00:12:05,330 And you can see it's similar to the PAM matrices in that-- I 192 00:12:05,330 --> 00:12:08,160 think we showed PAM 250 last time-- in that you have 193 00:12:08,160 --> 00:12:10,230 a diagonal with all positive numbers. 194 00:12:10,230 --> 00:12:14,390 And it's also similar in that, for example, trytophan 195 00:12:14,390 --> 00:12:18,690 down here has a higher positive score than others. 196 00:12:18,690 --> 00:12:19,510 It's plus 9. 197 00:12:19,510 --> 00:12:21,550 And cysteine is also one of the higher ones. 198 00:12:21,550 --> 00:12:27,050 But those are less extreme. 199 00:12:27,050 --> 00:12:29,610 And basically, maybe over short periods of evolutionary time, 200 00:12:29,610 --> 00:12:30,901 you don't change your cysteine. 201 00:12:30,901 --> 00:12:34,330 But over longer periods there is some rewiring 202 00:12:34,330 --> 00:12:37,370 of disulfide bonding, and so cysteines can change. 203 00:12:37,370 --> 00:12:40,220 Something like that may be going on. 204 00:12:43,330 --> 00:12:47,900 So we've just talked about pairwise sequence alignments. 205 00:12:47,900 --> 00:12:49,930 But in practice you often have, especially 206 00:12:49,930 --> 00:12:51,930 these days you often have, many proteins though. 207 00:12:51,930 --> 00:12:55,850 So you want to align three or five or 10 different proteins 208 00:12:55,850 --> 00:12:58,730 together to find out which residues 209 00:12:58,730 --> 00:13:01,590 are most conserved, for example. 210 00:13:01,590 --> 00:13:05,190 And so basically the principles are 211 00:13:05,190 --> 00:13:06,710 similar to pairwise alignment. 212 00:13:06,710 --> 00:13:10,550 But now you want to find alignments 213 00:13:10,550 --> 00:13:13,110 that bring the greatest number of single characters 214 00:13:13,110 --> 00:13:13,820 into register. 215 00:13:13,820 --> 00:13:15,630 So if you're aligning three proteins, 216 00:13:15,630 --> 00:13:17,940 you really want to have columns where all three are 217 00:13:17,940 --> 00:13:20,270 the same residue, or very similar residues. 218 00:13:20,270 --> 00:13:23,050 And you need to then define scoring systems, 219 00:13:23,050 --> 00:13:27,180 define gap penalties, and so forth. 220 00:13:27,180 --> 00:13:30,070 This is also reasonably well described in the text. 221 00:13:30,070 --> 00:13:31,880 I just wanted to make one comment 222 00:13:31,880 --> 00:13:36,070 about the sort of computational complexity of multiple sequence 223 00:13:36,070 --> 00:13:36,820 alignment. 224 00:13:36,820 --> 00:13:41,090 So if you think about pairwise sequence alignment, 225 00:13:41,090 --> 00:13:47,580 say with Needleman-Wunsch or Smith-Waterman, 226 00:13:47,580 --> 00:13:49,510 with a sequence of length-- let's 227 00:13:49,510 --> 00:13:53,030 say you're aligning one protein of sequence length n 228 00:13:53,030 --> 00:13:57,690 to another of life n, what is the computational complexity 229 00:13:57,690 --> 00:14:02,949 of that calculation in using this big O notation that we've 230 00:14:02,949 --> 00:14:03,490 talked about? 231 00:14:06,430 --> 00:14:10,130 Let's just say standard gap penalties, 232 00:14:10,130 --> 00:14:11,740 linear gap penalties. 233 00:14:11,740 --> 00:14:14,330 Anyone? 234 00:14:14,330 --> 00:14:15,310 Or does it matter? 235 00:14:15,310 --> 00:14:16,069 Yeah, go ahead. 236 00:14:16,069 --> 00:14:16,860 STUDENT: n squared. 237 00:14:16,860 --> 00:14:18,360 PROFESSOR: It's n squared. 238 00:14:18,360 --> 00:14:22,770 So even though this has gaps, with local-- with ungapped 239 00:14:22,770 --> 00:14:25,760 it was also n squared, or n times n, 240 00:14:25,760 --> 00:14:30,190 So why is it that gaps don't make it worse? 241 00:14:30,190 --> 00:14:30,690 Or do they? 242 00:14:35,746 --> 00:14:36,620 Any thoughts on that? 243 00:14:36,620 --> 00:14:40,841 STUDENT: You put a constant number of gaps in the sequence. 244 00:14:40,841 --> 00:14:44,538 So it's just stating the essence of the complexity 245 00:14:44,538 --> 00:14:45,970 should still be n squared. 246 00:14:45,970 --> 00:14:48,230 PROFESSOR: You put a constant number of gaps? 247 00:14:48,230 --> 00:14:53,867 The-- I mean, yeah-- let's just hear a few different comments. 248 00:14:53,867 --> 00:14:55,200 And then we'll try to summarize. 249 00:14:55,200 --> 00:14:56,284 Go ahead. 250 00:14:56,284 --> 00:14:57,950 STUDENT: So we're still only filling out 251 00:14:57,950 --> 00:15:01,530 an n by n matrix at any given time. 252 00:15:01,530 --> 00:15:04,300 PROFESSOR: You're still filling out an n by n matrix, right. 253 00:15:04,300 --> 00:15:07,119 There happen to be a few more things. 254 00:15:07,119 --> 00:15:08,910 The recursion is slightly more complicated. 255 00:15:08,910 --> 00:15:10,285 But there's a few more things you 256 00:15:10,285 --> 00:15:11,770 have to calculate to fill in each. 257 00:15:11,770 --> 00:15:13,710 But it's like three things, or four things. 258 00:15:13,710 --> 00:15:18,260 It's not-- so it doesn't grow with the size. 259 00:15:18,260 --> 00:15:22,127 So it's just still n squared, but with a larger constant. 260 00:15:22,127 --> 00:15:23,410 OK, good. 261 00:15:23,410 --> 00:15:25,351 And then if you did affine gap penalty, 262 00:15:25,351 --> 00:15:26,850 remember where you had a gap opening 263 00:15:26,850 --> 00:15:30,451 penalty and a gap extension, what then? 264 00:15:30,451 --> 00:15:31,450 Does that make it worse? 265 00:15:31,450 --> 00:15:35,736 Or is it still n squared? 266 00:15:35,736 --> 00:15:38,230 STUDENT: I think it's still n squared. 267 00:15:38,230 --> 00:15:40,595 PROFESSOR: Why is that? 268 00:15:40,595 --> 00:15:45,930 STUDENT: Computing the affine gap penalty is no more than o 269 00:15:45,930 --> 00:15:47,981 of n, right? 270 00:15:47,981 --> 00:15:49,730 PROFESSOR: Yeah, basically with the affine 271 00:15:49,730 --> 00:15:56,330 you have to keep track of two things at each place. 272 00:15:56,330 --> 00:15:56,970 So yeah, it is. 273 00:15:56,970 --> 00:15:57,511 You're right. 274 00:15:57,511 --> 00:15:58,525 It's still n squared. 275 00:15:58,525 --> 00:16:02,320 It's just you got to keep track of two numbers in each place 276 00:16:02,320 --> 00:16:03,070 there. 277 00:16:03,070 --> 00:16:03,660 OK, good. 278 00:16:03,660 --> 00:16:10,440 And so what about when we go to three proteins? 279 00:16:10,440 --> 00:16:12,800 So how would you generalize, let's say, 280 00:16:12,800 --> 00:16:15,780 the Needleman-Wunsch algorithm to align three proteins? 281 00:16:20,730 --> 00:16:22,450 Any ideas? 282 00:16:22,450 --> 00:16:30,530 What structure would you use, or what-- 283 00:16:30,530 --> 00:16:32,960 analogous to a matrix-- yeah, in the back. 284 00:16:32,960 --> 00:16:36,140 STUDENT: Another way to do this would be have a 3D matrix. 285 00:16:36,140 --> 00:16:40,930 PROFESSOR: OK, a 3D matrix, like a cube. 286 00:16:40,930 --> 00:16:45,176 And can everyone visualize that? 287 00:16:45,176 --> 00:16:47,430 So yeah, basically you could have 288 00:16:47,430 --> 00:16:51,350 a version of Needleman-Wunsch that was on a cube. 289 00:16:51,350 --> 00:16:54,850 And it started in the 0, 0, 0 corner 290 00:16:54,850 --> 00:17:00,440 and went down to the n, n, n corner, filling in in 3D. 291 00:17:00,440 --> 00:17:03,950 OK so what kind of computational complexity 292 00:17:03,950 --> 00:17:08,575 do you think that algorithm would have? 293 00:17:08,575 --> 00:17:09,549 STUDENT: n cubed? 294 00:17:09,549 --> 00:17:11,010 PROFESSOR: n cubed. 295 00:17:11,010 --> 00:17:12,180 Yeah, makes sense. 296 00:17:12,180 --> 00:17:15,020 There would be a similar number, a few operations 297 00:17:15,020 --> 00:17:17,390 to fill in each element in the cube. 298 00:17:17,390 --> 00:17:18,710 And there's n cubed. 299 00:17:18,710 --> 00:17:23,700 So the way that the problem grows with n is as n cubed. 300 00:17:23,700 --> 00:17:27,493 And what about in general, if you have k sequences? 301 00:17:27,493 --> 00:17:28,326 STUDENT: n to the k? 302 00:17:28,326 --> 00:17:30,510 PROFESSOR: n to the k. 303 00:17:30,510 --> 00:17:32,165 So is this practical? 304 00:17:35,000 --> 00:17:40,070 With three proteins and modern computers you could do it. 305 00:17:40,070 --> 00:17:43,060 You could implement Needleman-Wunsch on a cube. 306 00:17:43,060 --> 00:17:47,390 But what about with 20 proteins? 307 00:17:47,390 --> 00:17:50,000 Is that practical? 308 00:17:50,000 --> 00:17:51,950 So it's really not. 309 00:17:51,950 --> 00:17:55,290 So if proteins are 500 residues long and there's 310 00:17:55,290 --> 00:17:57,945 500 to the 20th, right. 311 00:17:57,945 --> 00:17:58,820 It starts to explode. 312 00:17:58,820 --> 00:18:01,419 So that approach really only works 313 00:18:01,419 --> 00:18:03,710 in two dimensions and a little bit in three dimensions. 314 00:18:03,710 --> 00:18:05,320 And it becomes impractical. 315 00:18:05,320 --> 00:18:07,540 So you need to use a variety of shortcuts. 316 00:18:07,540 --> 00:18:12,360 And so this is, again, described pretty well 317 00:18:12,360 --> 00:18:14,800 in chapter six of the text. 318 00:18:14,800 --> 00:18:19,690 And a commonly used-- if you're looking for a default 319 00:18:19,690 --> 00:18:23,540 multiple sequence aligner, CLUSTALW is a common one. 320 00:18:23,540 --> 00:18:25,830 There's a web interface if you just 321 00:18:25,830 --> 00:18:27,620 need to do one or two alignments. 322 00:18:27,620 --> 00:18:28,310 That works fine. 323 00:18:28,310 --> 00:18:30,990 You can also download a version called CLUSTALX 324 00:18:30,990 --> 00:18:32,490 and run it locally. 325 00:18:32,490 --> 00:18:35,640 And it does a lot of things with pairwise alignments 326 00:18:35,640 --> 00:18:37,490 and then combining the pairwise alignments. 327 00:18:37,490 --> 00:18:40,337 It aligns the two closest things first 328 00:18:40,337 --> 00:18:42,420 and then brings in the next closest, and so forth. 329 00:18:42,420 --> 00:18:45,590 And it does a lot of tricks that are-- they're 330 00:18:45,590 --> 00:18:46,650 basically heuristics. 331 00:18:46,650 --> 00:18:49,150 They're things that usually work, 332 00:18:49,150 --> 00:18:51,540 give you a reasonable answer, but don't necessarily 333 00:18:51,540 --> 00:18:57,030 guarantee that you will find the optimal alignment if you were 334 00:18:57,030 --> 00:19:00,570 to do it on a 20 dimensional cube, for example. 335 00:19:00,570 --> 00:19:02,280 So they work reasonably well in practice. 336 00:19:02,280 --> 00:19:04,980 And then there's a variety of other algorithms. 337 00:19:04,980 --> 00:19:07,420 OK, good. 338 00:19:07,420 --> 00:19:13,650 So that's a review of what we've mostly been talking about. 339 00:19:13,650 --> 00:19:17,770 And now I want to introduce a couple of new topics. 340 00:19:17,770 --> 00:19:22,530 So we're going to briefly talk a little bit more 341 00:19:22,530 --> 00:19:26,290 about Markov models of sequence evolution. 342 00:19:26,290 --> 00:19:31,240 And these are closely related to some classic evolutionary 343 00:19:31,240 --> 00:19:33,640 theory from Jukes-Cantor and Kimura. 344 00:19:33,640 --> 00:19:36,680 So we'll just briefly mention that. 345 00:19:36,680 --> 00:19:40,600 And we'll talk a little bit about different types 346 00:19:40,600 --> 00:19:46,680 of selection that sequences can undergo-- so neutral, negative, 347 00:19:46,680 --> 00:19:50,090 and positive-- and how you might distinguish 348 00:19:50,090 --> 00:19:54,770 among those for protein coding sequences. 349 00:19:54,770 --> 00:19:58,540 And this will basically serve as an intro 350 00:19:58,540 --> 00:20:04,080 into the main topic today, which is comparative genomics. 351 00:20:04,080 --> 00:20:06,980 And comparative genomics-- it's not really a field, exactly. 352 00:20:06,980 --> 00:20:09,830 It's more of an approach. 353 00:20:09,830 --> 00:20:15,070 But I wanted to give you some actual concrete examples 354 00:20:15,070 --> 00:20:18,070 of computational biology research, successful research 355 00:20:18,070 --> 00:20:22,740 that has led to various types of insights into gene regulation, 356 00:20:22,740 --> 00:20:26,440 in this case, mostly to emphasize 357 00:20:26,440 --> 00:20:31,747 that computational biology is not just a bag of tools. 358 00:20:31,747 --> 00:20:33,330 We've mostly been talking about tools. 359 00:20:33,330 --> 00:20:36,090 We introduced tools for local alignment 360 00:20:36,090 --> 00:20:38,340 and multiple alignment and statistics and so forth. 361 00:20:38,340 --> 00:20:41,050 But really it's a living, breathing field 362 00:20:41,050 --> 00:20:42,580 with active research. 363 00:20:42,580 --> 00:20:45,900 And even using-- comparative genomics 364 00:20:45,900 --> 00:20:48,530 is one of my favorite areas within this field. 365 00:20:48,530 --> 00:20:51,240 Because it's very powerful. 366 00:20:51,240 --> 00:20:55,300 And you can often use very simple ideas. 367 00:20:55,300 --> 00:20:58,460 And simple algorithms can sometimes 368 00:20:58,460 --> 00:21:01,170 give you a really interesting biological result, 369 00:21:01,170 --> 00:21:04,297 if you have the right sequences and ask the question 370 00:21:04,297 --> 00:21:04,880 the right way. 371 00:21:04,880 --> 00:21:10,480 So I have posted a dozen of my favorite comparative genomics 372 00:21:10,480 --> 00:21:14,200 papers in a special section on the website. 373 00:21:14,200 --> 00:21:16,750 Obviously I'm not asking you to read all of these. 374 00:21:16,750 --> 00:21:22,990 But I'm going to give you a few insights and approaches that 375 00:21:22,990 --> 00:21:26,022 were used in each of these papers here, 376 00:21:26,022 --> 00:21:27,980 just to give you a flavor of some of the things 377 00:21:27,980 --> 00:21:31,440 that you can do with comparative genomics, 378 00:21:31,440 --> 00:21:34,634 in the hopes that this might inspire some of your projects. 379 00:21:34,634 --> 00:21:36,050 So hopefully you're going to start 380 00:21:36,050 --> 00:21:39,310 thinking about finding teammates and thinking about projects. 381 00:21:39,310 --> 00:21:43,105 And this will hopefully help in that direction. 382 00:21:43,105 --> 00:21:45,730 Of course, they don't have to be comparative genomics projects. 383 00:21:45,730 --> 00:21:48,460 You could do anything in computational biology 384 00:21:48,460 --> 00:21:50,150 or systems biology in this class. 385 00:21:50,150 --> 00:21:54,890 But that's just one area to start thinking about. 386 00:21:58,040 --> 00:21:59,697 Yeah, I'll also-- I'm sorry, I think 387 00:21:59,697 --> 00:22:00,780 I haven't posted this yet. 388 00:22:00,780 --> 00:22:03,750 But I will also post this review by Sabeti 389 00:22:03,750 --> 00:22:09,020 that has a good discussion of positive selection a little bit 390 00:22:09,020 --> 00:22:10,690 later. 391 00:22:10,690 --> 00:22:12,930 Again, not required. 392 00:22:12,930 --> 00:22:19,070 All right, so let's go back to this question 393 00:22:19,070 --> 00:22:20,640 that I posed earlier. 394 00:22:20,640 --> 00:22:25,880 We have a Markov model of DNA sequence evolution. 395 00:22:25,880 --> 00:22:31,600 And we-- sn is the base at generation n. 396 00:22:31,600 --> 00:22:34,120 And then what happens after a long time? 397 00:22:36,670 --> 00:22:40,720 If you take any vector-- q, to start with, 398 00:22:40,720 --> 00:22:43,070 might be a known base, for example-- 399 00:22:43,070 --> 00:22:45,720 and apply that matrix many times, 400 00:22:45,720 --> 00:22:48,140 what happens as n goes to infinity. 401 00:22:48,140 --> 00:22:52,530 And so it turns out that there's fairly classical theory 402 00:22:52,530 --> 00:22:54,836 here that gives us an answer. 403 00:22:54,836 --> 00:22:56,460 This is not all the theory that exists, 404 00:22:56,460 --> 00:23:00,150 but this describes the typical case. 405 00:23:00,150 --> 00:23:04,950 So the theory says that if all of the elements in the matrix 406 00:23:04,950 --> 00:23:10,420 are greater than 0, and then of course 407 00:23:10,420 --> 00:23:17,200 all of the-- pij's, when you sum over j, they have to equal 1. 408 00:23:17,200 --> 00:23:20,190 That's just for it to be a well-defined Markov chain. 409 00:23:20,190 --> 00:23:22,800 Because you're going from i to j. 410 00:23:22,800 --> 00:23:26,856 And so from any base you have to go-- 411 00:23:26,856 --> 00:23:28,980 the probability of going to one of those four bases 412 00:23:28,980 --> 00:23:30,550 has to sum to 1. 413 00:23:30,550 --> 00:23:34,660 And so if those conditions hold, then there 414 00:23:34,660 --> 00:23:40,063 is a unique vector r such that r equals r times p. 415 00:23:42,600 --> 00:23:47,950 And the limit of q times p to the n 416 00:23:47,950 --> 00:23:50,964 equals r, independent of what q was. 417 00:23:50,964 --> 00:23:52,630 So basically, wherever you were starting 418 00:23:52,630 --> 00:23:54,088 from-- you could have been starting 419 00:23:54,088 --> 00:23:57,595 from 100% g, or 50% a, 50% g, or 100% 420 00:23:57,595 --> 00:24:02,010 c-- you apply this matrix many, many times, 421 00:24:02,010 --> 00:24:05,800 you will eventually approach this vector r. 422 00:24:05,800 --> 00:24:10,130 And the theory doesn't say what r is, exactly. 423 00:24:10,130 --> 00:24:13,010 But it says that r equals r times p. 424 00:24:13,010 --> 00:24:18,590 And that turns out to basically implicitly define what r is. 425 00:24:18,590 --> 00:24:22,800 That is, you can solve for r using that equation. 426 00:24:22,800 --> 00:24:27,660 And r, for this reason, because the matrix doesn't move r, 427 00:24:27,660 --> 00:24:30,412 r is called the stationary distribution. 428 00:24:30,412 --> 00:24:32,620 And it's often also called the limiting distribution, 429 00:24:32,620 --> 00:24:34,050 for obvious reasons. 430 00:24:34,050 --> 00:24:37,450 And if you want to read more, like where this theory comes 431 00:24:37,450 --> 00:24:42,080 from, here's a reasonable reference. 432 00:24:42,080 --> 00:24:46,642 So any questions about this theory? 433 00:24:46,642 --> 00:24:48,100 All the elements in the matrix have 434 00:24:48,100 --> 00:24:50,550 to be strictly greater than 1-- I'm sorry, 435 00:24:50,550 --> 00:24:52,680 strictly greater than 0. 436 00:24:52,680 --> 00:24:55,370 Otherwise, really no conditions. 437 00:24:58,142 --> 00:24:59,370 All right, question? 438 00:24:59,370 --> 00:25:00,612 Yeah, go ahead. 439 00:25:00,612 --> 00:25:04,200 STUDENT: Does the [INAUDIBLE] distribution ever change, 440 00:25:04,200 --> 00:25:08,325 based on the sequence, or are we assuming that it doesn't? 441 00:25:08,325 --> 00:25:10,380 PROFESSOR: The theory says it only depends on p. 442 00:25:10,380 --> 00:25:11,925 It doesn't depend on q. 443 00:25:11,925 --> 00:25:16,150 So it depends on the model of how the changes happen, 444 00:25:16,150 --> 00:25:19,580 the conditional probability of what the base will 445 00:25:19,580 --> 00:25:21,177 be at the next generation given what 446 00:25:21,177 --> 00:25:22,510 it is at the current generation. 447 00:25:22,510 --> 00:25:24,140 It doesn't depend where you start. 448 00:25:24,140 --> 00:25:28,590 q is what your starting point is, 449 00:25:28,590 --> 00:25:31,876 what base you're initially at. 450 00:25:31,876 --> 00:25:33,160 Does that make sense? 451 00:25:35,860 --> 00:25:37,930 And this is obviously a very simplified case, 452 00:25:37,930 --> 00:25:39,930 where we're just modeling evolution of one base, 453 00:25:39,930 --> 00:25:43,750 and we're not thinking about whether the rates vary 454 00:25:43,750 --> 00:25:46,260 at different positions or within-- this 455 00:25:46,260 --> 00:25:47,530 is the simplest case. 456 00:25:47,530 --> 00:25:49,613 But it's important to understand the simplest case 457 00:25:49,613 --> 00:25:53,950 before you start to generalize that. 458 00:25:53,950 --> 00:25:55,930 OK, so let's do some examples here. 459 00:25:55,930 --> 00:25:58,350 So here are some matrices. 460 00:25:58,350 --> 00:26:02,200 So it turns out the math is a lot easier if you limit 461 00:26:02,200 --> 00:26:06,000 yourself to a two-letter alphabet instead of four. 462 00:26:06,000 --> 00:26:08,180 So that's what I've done here. 463 00:26:08,180 --> 00:26:12,680 So let's look at these matrices and think about what they mean. 464 00:26:12,680 --> 00:26:14,110 So we have two-letter alphabet. 465 00:26:14,110 --> 00:26:15,050 R is purine. 466 00:26:15,050 --> 00:26:17,600 Y is pyrimidine. 467 00:26:17,600 --> 00:26:20,470 These matrices describe the conditional probability 468 00:26:20,470 --> 00:26:26,010 that, at the next generation, you'll be, for example-- oops, 469 00:26:26,010 --> 00:26:27,510 here we go. 470 00:26:27,510 --> 00:26:31,400 That, for example, if you start at purine, 471 00:26:31,400 --> 00:26:33,930 that you'll remain purine at the next generation. 472 00:26:33,930 --> 00:26:36,790 That would be 1 minus P. And the probability that you'll 473 00:26:36,790 --> 00:26:39,930 change to pyrimidine is P. And the probability of pyrimidine 474 00:26:39,930 --> 00:26:43,730 will remain as a pyrimidine is 1 minus P. 475 00:26:43,730 --> 00:26:48,776 So what is the stationary distribution of this matrix? 476 00:26:48,776 --> 00:26:54,260 OK, so if p is small, this describes a typical model, 477 00:26:54,260 --> 00:26:58,110 where most of the time you remain-- 478 00:26:58,110 --> 00:27:00,440 DNA replication and repair is faithful. 479 00:27:00,440 --> 00:27:04,000 You maintain the same base. 480 00:27:04,000 --> 00:27:07,040 But occasionally a mutation happens with probability p. 481 00:27:09,600 --> 00:27:13,746 Anyone want to guess what the stationary distribution is 482 00:27:13,746 --> 00:27:18,470 or describe a strategy for finding it? 483 00:27:18,470 --> 00:27:20,790 Like what do we know about this distribution? 484 00:27:28,500 --> 00:27:30,530 Or imagine you start with a purine 485 00:27:30,530 --> 00:27:32,720 and then you apply this matrix many times 486 00:27:32,720 --> 00:27:36,751 to that vector that's 1 comma 0, what will happen? 487 00:27:39,460 --> 00:27:41,060 Yeah, Levi. 488 00:27:41,060 --> 00:27:44,739 STUDENT: Probably 50-50 because any other that way you skew it 489 00:27:44,739 --> 00:27:46,280 it would be pushed towards the center 490 00:27:46,280 --> 00:27:49,720 because there's more [INAUDIBLE] the other. 491 00:27:49,720 --> 00:27:51,320 PROFESSOR: OK, everyone get that? 492 00:27:51,320 --> 00:27:54,350 So Levi's comment was that it's probably 50-50. 493 00:27:54,350 --> 00:27:58,020 Because mutation probabilities are symmetrical. 494 00:27:58,020 --> 00:28:01,250 Purine-pyrimidine and pyrimidine-purine are the same. 495 00:28:01,250 --> 00:28:04,380 So if you were to start with say, lots of purine, 496 00:28:04,380 --> 00:28:06,925 then there will be more mutation toward pyrimidine 497 00:28:06,925 --> 00:28:08,340 in a given generation. 498 00:28:08,340 --> 00:28:11,990 So if you think about this is your population of R 499 00:28:11,990 --> 00:28:16,410 and that's your population of Y, then if this is bigger 500 00:28:16,410 --> 00:28:21,550 than that, you'll tend to push it more that way. 501 00:28:21,550 --> 00:28:24,130 And there will be less mutation coming this way, 502 00:28:24,130 --> 00:28:25,650 until they're equal. 503 00:28:25,650 --> 00:28:28,315 And then you'll have equal flux going both directions. 504 00:28:28,315 --> 00:28:29,940 So that's a good way to think about it. 505 00:28:29,940 --> 00:28:32,300 And that's correct. 506 00:28:32,300 --> 00:28:37,430 Can you think of how would you show that? 507 00:28:37,430 --> 00:28:42,240 What's a way of solving for the stationary distribution? 508 00:28:46,088 --> 00:28:47,050 Anyone? 509 00:28:47,050 --> 00:28:51,172 So remember, we'll just get back one. 510 00:28:51,172 --> 00:28:54,140 The theory says that R equals RP. 511 00:28:54,140 --> 00:28:56,141 That's the key. 512 00:28:56,141 --> 00:28:56,640 R equals RP. 513 00:29:04,710 --> 00:29:05,520 So what is R? 514 00:29:05,520 --> 00:29:10,520 Well we don't know R. So we let that be a general vector. 515 00:29:10,520 --> 00:29:13,280 So notice there's only one free parameter. 516 00:29:13,280 --> 00:29:15,440 Because the two components have to sum to 1. 517 00:29:15,440 --> 00:29:20,140 It's a frequency vector, so x and 1 minus x. 518 00:29:20,140 --> 00:29:23,160 And we just multiply this times the matrix. 519 00:29:23,160 --> 00:29:27,610 So you take x comma 1 minus x. 520 00:29:27,610 --> 00:29:29,710 And you multiply it by this matrix. 521 00:29:29,710 --> 00:29:36,190 The matrix is 1 minus P P. I'm using too much space here. 522 00:29:36,190 --> 00:29:41,370 I'll just make it a little smaller-- P 1 minus P. 523 00:29:41,370 --> 00:29:46,110 And that's going to equal R. And so 524 00:29:46,110 --> 00:29:51,900 we'll get x times 1 minus P plus-- remember, 525 00:29:51,900 --> 00:29:54,700 it's dot product of this times this column, right? 526 00:29:54,700 --> 00:30:00,240 So x times 1 minus P plus 1 minus x times P. 527 00:30:00,240 --> 00:30:02,630 That's the first component. 528 00:30:02,630 --> 00:30:06,950 And the second component will be xp 529 00:30:06,950 --> 00:30:11,440 plus 1 minus x times 1 minus p. 530 00:30:16,370 --> 00:30:18,170 OK, everyone got that? 531 00:30:18,170 --> 00:30:19,340 So now what do we do? 532 00:30:24,320 --> 00:30:24,905 STUDENT: r. 533 00:30:24,905 --> 00:30:25,863 PROFESSOR: What's that? 534 00:30:25,863 --> 00:30:27,970 STUDENT: Make that equal to the initial r. 535 00:30:27,970 --> 00:30:30,053 PROFESSOR: Yeah, make that equal to the initial r. 536 00:30:30,053 --> 00:30:33,804 So it's two equations and-- well, you really 537 00:30:33,804 --> 00:30:34,970 only need one equation here. 538 00:30:34,970 --> 00:30:36,940 Because we've already simplified it. 539 00:30:36,940 --> 00:30:38,980 In general there will be two equations. 540 00:30:38,980 --> 00:30:40,480 There will be one equation that says 541 00:30:40,480 --> 00:30:42,334 that the components of the vector sum to 1. 542 00:30:42,334 --> 00:30:44,500 And there will be another equation coming from here. 543 00:30:44,500 --> 00:30:47,040 But we can just use either one, either term. 544 00:30:47,040 --> 00:30:49,070 So we know that the first component 545 00:30:49,070 --> 00:30:51,780 of a vector-- if this vector is equal to that vector, then 546 00:30:51,780 --> 00:30:53,654 the first components have to be equal, right? 547 00:30:53,654 --> 00:30:59,030 So x equals x times-- times what? 548 00:30:59,030 --> 00:31:04,980 Times 1 minus p, just combining these two. 549 00:31:04,980 --> 00:31:12,780 And then plus what are all the-- I'm sorry, that's 1 minus p-- 1 550 00:31:12,780 --> 00:31:13,430 minus p here. 551 00:31:13,430 --> 00:31:16,090 And then there's another term here, minus another p. 552 00:31:19,020 --> 00:31:20,620 And then there's a term that's just p. 553 00:31:23,780 --> 00:31:26,330 And so then what do you do? 554 00:31:26,330 --> 00:31:28,862 You just solve for x. 555 00:31:32,510 --> 00:31:34,500 And I think when you work this out 556 00:31:34,500 --> 00:31:43,590 you'll get two p x equals p, so x equals 1/2. 557 00:31:46,470 --> 00:31:48,870 Right, everyone got that? 558 00:31:51,750 --> 00:31:54,670 OK, so yeah. 559 00:31:54,670 --> 00:31:57,380 So if x is 1/2, then the vector is 1/2 comma 1/2, 560 00:31:57,380 --> 00:31:59,250 which is the unbiased. 561 00:31:59,250 --> 00:32:01,710 All right, what about this next matrix, 562 00:32:01,710 --> 00:32:06,620 right below-- 1 minus p 1 minus q. 563 00:32:06,620 --> 00:32:14,610 p and q are two positive numbers that are different. 564 00:32:14,610 --> 00:32:16,740 So now there's actually a different probability 565 00:32:16,740 --> 00:32:20,850 of mutating purine to pyrimidine and pyrimidine to purine. 566 00:32:20,850 --> 00:32:23,460 So Levi, can we apply your approach 567 00:32:23,460 --> 00:32:26,701 to see what the answer is? 568 00:32:26,701 --> 00:32:27,655 STUDENT: Not exactly. 569 00:32:27,655 --> 00:32:29,563 PROFESSOR: Not exactly? 570 00:32:29,563 --> 00:32:32,350 OK, yeah, it's not as obvious. 571 00:32:32,350 --> 00:32:34,200 It's not symmetrical anymore. 572 00:32:34,200 --> 00:32:37,530 But can anyone guess what the answer might be? 573 00:32:37,530 --> 00:32:39,985 Yeah, go ahead Diego. 574 00:32:39,985 --> 00:32:41,985 STUDENT: It'll go either all the way to one side 575 00:32:41,985 --> 00:32:44,780 or depending on q and d. 576 00:32:44,780 --> 00:32:46,833 PROFESSOR: All the way to one side or all the way 577 00:32:46,833 --> 00:32:47,374 to the other? 578 00:32:47,374 --> 00:32:50,526 So meaning it'll be all purine or all pyrimidine again. 579 00:32:50,526 --> 00:32:52,236 STUDENT: Yeah, depending on which-- 580 00:32:52,236 --> 00:32:53,841 PROFESSOR: Which is bigger? 581 00:32:53,841 --> 00:32:58,600 OK, anyone else have an alternative theory? 582 00:32:58,600 --> 00:32:59,440 Yeah, go ahead. 583 00:32:59,440 --> 00:33:00,729 What was your name again? 584 00:33:00,729 --> 00:33:01,395 STUDENT: Daniel. 585 00:33:01,395 --> 00:33:02,455 PROFESSOR: Sorry, Daniel? 586 00:33:02,455 --> 00:33:03,300 STUDENT: Daniel, yeah. 587 00:33:03,300 --> 00:33:04,049 PROFESSOR: Daniel. 588 00:33:04,049 --> 00:33:05,040 OK, go ahead. 589 00:33:05,040 --> 00:33:10,908 STUDENT: It'll reach some intermediate equilibrium 590 00:33:10,908 --> 00:33:14,226 once they balance each other out. 591 00:33:14,226 --> 00:33:20,500 And that would be exactly-- I'm not sure-- some ratio of q 592 00:33:20,500 --> 00:33:21,383 to p. 593 00:33:21,383 --> 00:33:23,700 PROFESSOR: OK. 594 00:33:23,700 --> 00:33:27,384 How many people think that might happen? 595 00:33:27,384 --> 00:33:28,050 OK, some people. 596 00:33:28,050 --> 00:33:30,920 OK Daniel has maybe slightly more supporters. 597 00:33:30,920 --> 00:33:32,470 So let's see. 598 00:33:32,470 --> 00:33:33,950 So how are we going to solve this? 599 00:33:36,730 --> 00:33:40,442 How do we figure out what the stationary distribution is? 600 00:33:40,442 --> 00:33:42,740 You just use that same approach. 601 00:33:42,740 --> 00:33:50,070 So you can do-- you have x 1 minus x 602 00:33:50,070 --> 00:33:59,035 times that matrix, which is got the 1 minus p p q 1 minus q. 603 00:33:59,035 --> 00:34:03,650 OK, and so now you'll get x 1 minus p. 604 00:34:03,650 --> 00:34:06,920 Anyway, go through the same operations. 605 00:34:06,920 --> 00:34:08,080 Solve for x. 606 00:34:08,080 --> 00:34:12,540 And you will get-- I think I put the answer on the slide here. 607 00:34:12,540 --> 00:34:15,100 You will get q over p plus q. 608 00:34:15,100 --> 00:34:18,409 So as Danny predicted, some ratio involving q's and p's. 609 00:34:18,409 --> 00:34:20,719 And does this make sense? 610 00:34:20,719 --> 00:34:23,590 Seeing what the answer is, can you 611 00:34:23,590 --> 00:34:25,300 rationalize why that's true? 612 00:34:28,216 --> 00:34:31,132 STUDENT: It's like a kind of equilibrium. 613 00:34:31,132 --> 00:34:34,725 You have one mode of force play pushing 614 00:34:34,725 --> 00:34:36,704 one way and another different one 615 00:34:36,704 --> 00:34:38,321 in this case pushing the other. 616 00:34:38,321 --> 00:34:40,320 PROFESSOR: Yeah, that's basically the same idea. 617 00:34:40,320 --> 00:34:42,449 And so they have to be in balance. 618 00:34:42,449 --> 00:34:50,190 So the one that has less, where the mutation rate is a lower, 619 00:34:50,190 --> 00:34:54,770 will end up being bigger, so that the amount that flows out 620 00:34:54,770 --> 00:34:57,870 will be the same as the amount that flows in. 621 00:34:57,870 --> 00:35:01,560 You can apply Levi's idea of thinking 622 00:35:01,560 --> 00:35:03,850 about how much flux is going in each way. 623 00:35:03,850 --> 00:35:06,380 So there's going to be some flux p in one direction, q 624 00:35:06,380 --> 00:35:07,700 in the other direction. 625 00:35:07,700 --> 00:35:16,440 And you want x times p to equal 1 minus x times q. 626 00:35:16,440 --> 00:35:19,320 And this is the value of that works. 627 00:35:23,590 --> 00:35:26,545 OK, good? 628 00:35:26,545 --> 00:35:28,760 What about this guy down here? 629 00:35:28,760 --> 00:35:31,510 So this is a very special matrix called the identity matrix. 630 00:35:31,510 --> 00:35:33,570 And what kind of model of evolution is this? 631 00:35:35,771 --> 00:35:36,979 STUDENT: There's no mutation. 632 00:35:36,979 --> 00:35:38,312 PROFESSOR: There's no evolution. 633 00:35:38,312 --> 00:35:42,610 This is like a perfect replication repair system. 634 00:35:42,610 --> 00:35:44,160 The base never changes. 635 00:35:44,160 --> 00:35:47,805 So what's a stationary distribution? 636 00:35:47,805 --> 00:35:49,753 STUDENT: It's all-- 637 00:35:49,753 --> 00:35:50,727 PROFESSOR: What's that? 638 00:35:50,727 --> 00:35:52,190 STUDENT: It'll just stay where it is. 639 00:35:52,190 --> 00:35:53,156 PROFESSOR: It'll stay where it is. 640 00:35:53,156 --> 00:35:54,148 That's right. 641 00:35:54,148 --> 00:35:57,179 So any vector is stationary for this matrix. 642 00:35:57,179 --> 00:35:58,720 Remember that the theory said there's 643 00:35:58,720 --> 00:36:01,610 a unique stationary distribution. 644 00:36:01,610 --> 00:36:05,845 This seems to be inconsistent. 645 00:36:05,845 --> 00:36:07,250 Why is it not inconsistent? 646 00:36:07,250 --> 00:36:08,670 Sally? 647 00:36:08,670 --> 00:36:14,470 STUDENT: We defined all of the variables to be greater than 0. 648 00:36:14,470 --> 00:36:17,500 So when you have anything that's [INAUDIBLE] that is equal to 0. 649 00:36:17,500 --> 00:36:18,820 PROFESSOR: Right, so a condition of the theorem 650 00:36:18,820 --> 00:36:21,130 is that all the entries be strictly greater than 0. 651 00:36:21,130 --> 00:36:22,110 And this is why. 652 00:36:22,110 --> 00:36:25,869 If you have 0s, in there then crazy things can happen. 653 00:36:25,869 --> 00:36:28,410 Wherever you start, that's where you end up with this matrix. 654 00:36:28,410 --> 00:36:29,870 So every vector is stationary. 655 00:36:29,870 --> 00:36:34,260 And what about this crazy matrix over here, matrix q? 656 00:36:36,980 --> 00:36:39,630 What does it do? 657 00:36:39,630 --> 00:36:40,280 Joe. 658 00:36:40,280 --> 00:36:42,120 STUDENT: It's going to swap them back and forth. 659 00:36:42,120 --> 00:36:43,786 PROFESSOR: It swaps them back and forth. 660 00:36:43,786 --> 00:36:47,860 So this is like a hyper mutable organism 661 00:36:47,860 --> 00:36:50,110 that has such a high mutation rate that it always 662 00:36:50,110 --> 00:36:53,056 mutates every base to the other kind. 663 00:36:53,056 --> 00:36:54,430 It's never happy with its genome. 664 00:36:54,430 --> 00:36:56,940 It always wants to switch it, get something better. 665 00:36:56,940 --> 00:37:00,540 And so what can you say about the stationary distribution 666 00:37:00,540 --> 00:37:01,415 for this matrix? 667 00:37:04,578 --> 00:37:05,544 Jeff? 668 00:37:05,544 --> 00:37:07,086 STUDENT: There isn't going to be one. 669 00:37:07,086 --> 00:37:08,710 PROFESSOR: There isn't going to be one? 670 00:37:08,710 --> 00:37:09,350 Anyone else? 671 00:37:09,350 --> 00:37:12,334 STUDENT: Well, actually, I guess 1, 1, like 0.5, 0.5. 672 00:37:12,334 --> 00:37:14,000 PROFESSOR: 0.5, 0.5 would be stationary. 673 00:37:14,000 --> 00:37:14,900 Because you're-- 674 00:37:14,900 --> 00:37:16,700 STUDENT: But you won't converge to it. 675 00:37:16,700 --> 00:37:18,366 PROFESSOR: But you won't converge to it. 676 00:37:18,366 --> 00:37:20,411 That's right. it's stationary, but not limiting. 677 00:37:20,411 --> 00:37:21,910 And again, the theory doesn't apply. 678 00:37:21,910 --> 00:37:23,600 Because there's some 0s in this matrix. 679 00:37:23,600 --> 00:37:25,058 But you can still think about that. 680 00:37:25,058 --> 00:37:26,450 OK, everyone got that? 681 00:37:26,450 --> 00:37:27,894 All right, good. 682 00:37:30,730 --> 00:37:33,060 OK so let's talk now about Jukes-Cantor. 683 00:37:33,060 --> 00:37:36,700 So Jukes-Cantor is very much a Markov model 684 00:37:36,700 --> 00:37:38,690 of DNA sequence evolution. 685 00:37:38,690 --> 00:37:41,480 And it simply has-- now we've got four bases. 686 00:37:41,480 --> 00:37:46,330 It's got probability alpha of mutating from each base 687 00:37:46,330 --> 00:37:47,240 to any other base. 688 00:37:47,240 --> 00:37:51,040 And so the overall mutation rate, or probability 689 00:37:51,040 --> 00:37:54,850 of substitution, at one generation is three alpha. 690 00:37:54,850 --> 00:37:59,240 Because from the base G there's an alpha probability mutate 691 00:37:59,240 --> 00:38:02,430 to A, an alpha probability to C, an alpha to T, 692 00:38:02,430 --> 00:38:03,610 so the three alpha. 693 00:38:03,610 --> 00:38:09,910 And you can basically write a recursion 694 00:38:09,910 --> 00:38:12,490 that describes what's going on here. 695 00:38:12,490 --> 00:38:16,500 So if you start with a G at time 0, 696 00:38:16,500 --> 00:38:19,405 the probability of a G at time 1 is 1 minus 3 alpha. 697 00:38:19,405 --> 00:38:21,910 It's a probability that you didn't mutate. 698 00:38:21,910 --> 00:38:24,050 But then, at generation two, you have 699 00:38:24,050 --> 00:38:27,430 to consider two cases really. 700 00:38:27,430 --> 00:38:32,660 First of all, if you didn't mutate, that's PG1. 701 00:38:32,660 --> 00:38:35,290 Then you have a 1 minus alpha probability 702 00:38:35,290 --> 00:38:39,490 of not mutating again, so remaining G. 703 00:38:39,490 --> 00:38:40,650 But you might have mutated. 704 00:38:43,750 --> 00:38:46,350 With probability 1 minus PG 1 you mutated. 705 00:38:46,350 --> 00:38:50,250 And then whatever you were-- might 706 00:38:50,250 --> 00:38:52,530 be a C-- you have an alpha probably 707 00:38:52,530 --> 00:38:58,224 of mutating back to G. Does that make sense? 708 00:38:58,224 --> 00:39:02,600 Everyone clear why there's a 3 in one place and only a 1 alpha 709 00:39:02,600 --> 00:39:03,494 in the other? 710 00:39:06,180 --> 00:39:10,370 All right, so you can actually solve this recursion. 711 00:39:10,370 --> 00:39:15,430 And you get this expression here, P G of t 712 00:39:15,430 --> 00:39:21,560 equals 1/4 plus 3/4 E to the minus 4 alpha t. 713 00:39:21,560 --> 00:39:27,480 OK so what does that tell you about-- we 714 00:39:27,480 --> 00:39:29,690 know from our previous discussion 715 00:39:29,690 --> 00:39:33,710 what the stationary distribution of this Markov chain 716 00:39:33,710 --> 00:39:35,220 is going to be. 717 00:39:35,220 --> 00:39:35,910 What will it be? 718 00:39:40,950 --> 00:39:44,048 What's the stationary distribution? 719 00:39:44,048 --> 00:39:46,342 STUDENT: 1/4 of each. 720 00:39:46,342 --> 00:39:47,300 PROFESSOR: 1/4 of each. 721 00:39:47,300 --> 00:39:48,790 And why, Daniel, is that? 722 00:39:48,790 --> 00:39:51,370 STUDENT: Because the probability of them moving to any base 723 00:39:51,370 --> 00:39:52,134 is the same? 724 00:39:52,134 --> 00:39:53,925 PROFESSOR: Right, it's totally symmetrical. 725 00:39:53,925 --> 00:39:56,582 So that has to be the answer by symmetry. 726 00:39:56,582 --> 00:39:57,540 And you could solve it. 727 00:39:57,540 --> 00:40:00,790 You could use this same approach with defining 728 00:40:00,790 --> 00:40:06,000 a value-- the theory applies if alpha is greater than 0 729 00:40:06,000 --> 00:40:09,079 and less than 1-- or less than-- I think 730 00:40:09,079 --> 00:40:10,870 it has to be less than a quarter, actually, 731 00:40:10,870 --> 00:40:12,130 or something like that. 732 00:40:12,130 --> 00:40:17,749 And you can apply the theory. 733 00:40:17,749 --> 00:40:19,540 So there will be a stationary distribution. 734 00:40:19,540 --> 00:40:20,920 You can set up a vector. 735 00:40:20,920 --> 00:40:25,970 Now you have to have four terms in it and multiplication. 736 00:40:25,970 --> 00:40:29,250 And then you'll get a system of basically four 737 00:40:29,250 --> 00:40:31,385 equations and four unknowns. 738 00:40:31,385 --> 00:40:36,100 And you can solve that system using linear algebra 739 00:40:36,100 --> 00:40:37,650 and get the answer. 740 00:40:37,650 --> 00:40:42,440 And yeah, the answer will be 1/4, as you guessed. 741 00:40:42,440 --> 00:40:46,350 And so what this Jukes-Cantor expression 742 00:40:46,350 --> 00:40:51,540 tells you is how quickly does it get to that equilibrium. 743 00:40:51,540 --> 00:40:55,070 We're thinking about G. You can start at 100% G. 744 00:40:55,070 --> 00:40:57,860 And it will then approach 1/4. 745 00:40:57,860 --> 00:40:59,672 You can see 1/4 is clearly what's 746 00:40:59,672 --> 00:41:00,880 going to happen in the limit. 747 00:41:00,880 --> 00:41:05,720 Because as t gets big that second term is going to 0. 748 00:41:05,720 --> 00:41:08,110 And so what does the distribution look like? 749 00:41:08,110 --> 00:41:10,810 How rapidly do you approach 1/4? 750 00:41:15,080 --> 00:41:18,160 You approach it exponentially. 751 00:41:18,160 --> 00:41:20,390 So you start at 1 here. 752 00:41:20,390 --> 00:41:21,590 And this is 0. 753 00:41:21,590 --> 00:41:23,530 This is 1/4. 754 00:41:23,530 --> 00:41:24,560 You'll start here. 755 00:41:24,560 --> 00:41:28,197 And you'll go like that. 756 00:41:28,197 --> 00:41:29,530 You go rapidly at the beginning. 757 00:41:29,530 --> 00:41:34,990 And then you get just very gradual approach 1/4. 758 00:41:34,990 --> 00:41:41,000 So you can do a little bit more algebra with this expression. 759 00:41:41,000 --> 00:41:44,710 And here's where the really useful part comes in. 760 00:41:44,710 --> 00:41:48,520 And you can show that K, which we'll 761 00:41:48,520 --> 00:41:51,340 define as the true number of substitutions that 762 00:41:51,340 --> 00:41:57,190 have occurred at this particular base that we're considering, 763 00:41:57,190 --> 00:42:04,070 is related to D, where D is the fraction of positions that 764 00:42:04,070 --> 00:42:07,940 differ when you just take say the parental sequence 765 00:42:07,940 --> 00:42:11,124 and the daughter sequence, the eventual sequence 766 00:42:11,124 --> 00:42:11,790 that you get to. 767 00:42:11,790 --> 00:42:13,250 You just match those two. 768 00:42:13,250 --> 00:42:14,840 And you count up the differences. 769 00:42:14,840 --> 00:42:17,825 That's D. And then K is the actual number 770 00:42:17,825 --> 00:42:20,300 of substitutions that have occurred. 771 00:42:20,300 --> 00:42:24,580 And those are related by this equation, K equals minus 3/4, 772 00:42:24,580 --> 00:42:29,520 natural log, 1 minus 4/3 d. 773 00:42:29,520 --> 00:42:31,220 So let's try to think about, first 774 00:42:31,220 --> 00:42:36,520 of all, what is the shape of that curve? 775 00:42:36,520 --> 00:42:38,150 What does that look like? 776 00:42:45,990 --> 00:42:47,030 Here's 0. 777 00:42:47,030 --> 00:42:49,710 I'll put 1 over here. 778 00:42:49,710 --> 00:42:54,110 So we all know that log-- if it was just simply 779 00:42:54,110 --> 00:42:57,110 log of something between 0 and 1, 780 00:42:57,110 --> 00:43:03,700 it would look like what-- look like that. 781 00:43:03,700 --> 00:43:09,090 Starts from negative infinity and comes up to 0 at 1. 782 00:43:09,090 --> 00:43:13,340 But it's actually not log of D. It's log of 1 minus D, 783 00:43:13,340 --> 00:43:18,560 or 1 minus a constant times D. So that will flip it. 784 00:43:18,560 --> 00:43:22,433 So the minus infinity will be there. 785 00:43:22,433 --> 00:43:26,310 It will come in like that. 786 00:43:26,310 --> 00:43:32,050 And then we also have minus 3/4. 787 00:43:32,050 --> 00:43:34,390 There's a minus in front of this whole thing. 788 00:43:34,390 --> 00:43:38,301 So all these logs are of numbers that are less than 1. 789 00:43:38,301 --> 00:43:39,300 So they're all negative. 790 00:43:39,300 --> 00:43:41,450 But then it'll get flipped. 791 00:43:41,450 --> 00:43:46,570 So it'll actually look like that. 792 00:43:46,570 --> 00:43:50,810 And it will go to infinity where? 793 00:44:03,280 --> 00:44:05,000 Where does this go to infinity? 794 00:44:05,000 --> 00:44:09,330 So if this is now K is on this axis. 795 00:44:09,330 --> 00:44:11,020 And yeah, sorry if that wasn't clear. 796 00:44:11,020 --> 00:44:13,120 D is here. 797 00:44:13,120 --> 00:44:15,520 So this is just again, this is if we 798 00:44:15,520 --> 00:44:17,780 did log of D it would look like this. 799 00:44:17,780 --> 00:44:20,080 If we do log of 1 minus something times D, 800 00:44:20,080 --> 00:44:22,240 that'll flip it. 801 00:44:22,240 --> 00:44:25,660 And then if we do minus that, it'll flip it again that way. 802 00:44:25,660 --> 00:44:29,815 OK so now K, as a function of D, is going to look like this. 803 00:44:32,380 --> 00:44:38,870 Sometimes people like to put-- anyway, 804 00:44:38,870 --> 00:44:40,520 but let's just think about this. 805 00:44:40,520 --> 00:44:43,510 So it's going to go to up to infinity somewhere. 806 00:44:43,510 --> 00:44:44,410 And where is that? 807 00:44:44,410 --> 00:44:45,495 STUDENT: 3/4. 808 00:44:45,495 --> 00:44:46,120 PROFESSOR: 3/4. 809 00:44:48,950 --> 00:44:52,020 So does that make sense? 810 00:44:52,020 --> 00:44:54,050 Can someone tell us what's going on 811 00:44:54,050 --> 00:44:58,022 and what is the use of this whole thing here? 812 00:44:58,022 --> 00:44:59,516 Yeah, in the back. 813 00:44:59,516 --> 00:45:00,260 What's your name? 814 00:45:00,260 --> 00:45:01,010 STUDENT: Julianne. 815 00:45:01,010 --> 00:45:02,006 PROFESSOR: Yeah, Julianne. 816 00:45:02,006 --> 00:45:02,506 Go ahead. 817 00:45:02,506 --> 00:45:03,500 STUDENT: [INAUDIBLE] 0. 818 00:45:03,500 --> 00:45:09,476 So part, it would give you negative infinite. 819 00:45:09,476 --> 00:45:12,970 And so you just solve for D in there. 820 00:45:12,970 --> 00:45:17,260 PROFESSOR: OK, so when D is 3/4 you'll get 1 minus 1. 821 00:45:17,260 --> 00:45:17,795 You get 0. 822 00:45:17,795 --> 00:45:18,870 That'll be negative infinity. 823 00:45:18,870 --> 00:45:20,280 And then there's a minus in front, 824 00:45:20,280 --> 00:45:21,529 so it'll be constant infinity. 825 00:45:21,529 --> 00:45:22,280 So that's true. 826 00:45:22,280 --> 00:45:24,620 And does that intuitively make sense to you? 827 00:45:32,460 --> 00:45:33,630 We have a sequence. 828 00:45:33,630 --> 00:45:37,200 It's evolving randomly, according to this model. 829 00:45:37,200 --> 00:45:39,080 And then we have that ancestral sequence. 830 00:45:39,080 --> 00:45:42,461 And then we have a modern descendant of that sequence, 831 00:45:42,461 --> 00:45:44,960 millions of generations-- or maybe thousands of generations, 832 00:45:44,960 --> 00:45:47,020 or some large number of generations away. 833 00:45:47,020 --> 00:45:48,567 We line up those two sequences. 834 00:45:48,567 --> 00:45:50,650 We count how many matches and how many mismatches. 835 00:45:50,650 --> 00:45:52,665 What's the fraction of mismatches, 836 00:45:52,665 --> 00:45:53,820 of differences we have? 837 00:45:56,610 --> 00:46:00,990 Basically if that-- let's look at a different case. 838 00:46:00,990 --> 00:46:06,410 What if d is very small? 839 00:46:06,410 --> 00:46:08,420 What if it's like 1%. 840 00:46:08,420 --> 00:46:09,170 Then what happens? 841 00:46:14,220 --> 00:46:19,270 If d is small, turns out k is pretty much like d. 842 00:46:19,270 --> 00:46:22,960 It grows linearly with d in the beginning. 843 00:46:22,960 --> 00:46:25,460 So does that make sense? 844 00:46:27,862 --> 00:46:28,570 That makes sense. 845 00:46:28,570 --> 00:46:31,510 Because k is the true number of substitutions that happen. 846 00:46:31,510 --> 00:46:34,265 When you go one generation, the true number 847 00:46:34,265 --> 00:46:36,050 of substitutions and the measured number 848 00:46:36,050 --> 00:46:37,480 of substitutions is the same. 849 00:46:37,480 --> 00:46:39,720 Because there's no back mutations. 850 00:46:39,720 --> 00:46:42,374 But when you go further, there's an increasing chance 851 00:46:42,374 --> 00:46:44,040 of a back-- there's an increasing chance 852 00:46:44,040 --> 00:46:46,002 of a mutation, therefore increasing chance 853 00:46:46,002 --> 00:46:47,460 that you also have a back mutation. 854 00:46:47,460 --> 00:46:49,610 And so this is what happens at long time. 855 00:46:49,610 --> 00:46:54,870 So basically this is linear here and then goes up like that. 856 00:46:54,870 --> 00:46:58,010 And so what this allows you to do 857 00:46:58,010 --> 00:47:00,840 is d something that you can measure. 858 00:47:00,840 --> 00:47:04,400 And then k is something that you want to know. 859 00:47:04,400 --> 00:47:10,020 The point is, if I measure the difference between human 860 00:47:10,020 --> 00:47:14,190 and chimp sequence, it might be only 1% different. 861 00:47:14,190 --> 00:47:17,420 And if I have an idea of mutation rate per generation, 862 00:47:17,420 --> 00:47:19,710 I configure out how many generations apart, 863 00:47:19,710 --> 00:47:24,330 or how much time has passed, since humans split from chimp. 864 00:47:24,330 --> 00:47:29,880 But if I go to mouse, where the average base might be-- there 865 00:47:29,880 --> 00:47:34,985 might be only a 50% matching-- if that's true, 866 00:47:34,985 --> 00:47:36,610 there have been a lot of changes there. 867 00:47:36,610 --> 00:47:39,270 There will be a lot of bases that have changed once, 868 00:47:39,270 --> 00:47:41,460 as well as a lot that may have changed twice, 869 00:47:41,460 --> 00:47:43,540 and may have actually changed back. 870 00:47:43,540 --> 00:47:46,920 And so that let's say human and mouse are 50% identical. 871 00:47:46,920 --> 00:47:50,040 That 50% identical-- I can't just 872 00:47:50,040 --> 00:47:53,080 compare it to let's say the 1% with chimp 873 00:47:53,080 --> 00:47:56,680 and say it's 50 times longer. 874 00:47:56,680 --> 00:47:58,850 That 50% will be an underestimate 875 00:47:58,850 --> 00:48:01,700 of the true difference. 876 00:48:01,700 --> 00:48:05,180 Because there's been some back mutations as well. 877 00:48:05,180 --> 00:48:06,730 And so you have to use this formula 878 00:48:06,730 --> 00:48:10,110 to figure out what the true evolutionary time is, 879 00:48:10,110 --> 00:48:12,600 the true number of changes that happened. 880 00:48:12,600 --> 00:48:13,826 Yeah, go ahead. 881 00:48:13,826 --> 00:48:17,277 STUDENT: Does simple count refer to just the difference 882 00:48:17,277 --> 00:48:20,235 in the amount of mutations? 883 00:48:20,235 --> 00:48:21,714 Or what's-- 884 00:48:21,714 --> 00:48:25,233 PROFESSOR: The simple count is what you actually observe. 885 00:48:25,233 --> 00:48:29,570 So you have a stretch of sequence-- 886 00:48:29,570 --> 00:48:33,520 let's say the beta globin genomic locus in human. 887 00:48:33,520 --> 00:48:36,540 You line it up to the beta globin locus in chimp. 888 00:48:36,540 --> 00:48:38,770 You count what fraction of positions differ? 889 00:48:38,770 --> 00:48:40,060 What fractions are different? 890 00:48:40,060 --> 00:48:40,580 That's d. 891 00:48:43,190 --> 00:48:47,410 And then k is-- actually, it's slightly complicated here. 892 00:48:47,410 --> 00:48:50,360 Because if this is human and that's chimp, 893 00:48:50,360 --> 00:48:54,990 then k is more like-- because you don't actually 894 00:48:54,990 --> 00:48:56,190 observe the ancestor. 895 00:48:56,190 --> 00:48:57,440 You observe chimp. 896 00:48:57,440 --> 00:49:00,460 So you have to go back to the ancestor and then forward. 897 00:49:00,460 --> 00:49:03,860 So that's the relevant number of generations. 898 00:49:03,860 --> 00:49:06,060 And so k will tell you how many changes 899 00:49:06,060 --> 00:49:09,540 must have occurred to give you that observed 900 00:49:09,540 --> 00:49:11,280 fraction of differences. 901 00:49:11,280 --> 00:49:13,360 And for short distances, it's linear. 902 00:49:13,360 --> 00:49:16,360 And then for long, it's logarithmic, basically. 903 00:49:19,020 --> 00:49:19,892 Yeah, question. 904 00:49:19,892 --> 00:49:23,695 STUDENT: So I'm guessing all of [INAUDIBLE] that selection 905 00:49:23,695 --> 00:49:24,936 is absent. 906 00:49:24,936 --> 00:49:25,936 PROFESSOR: Right, right. 907 00:49:25,936 --> 00:49:27,373 This is ignoring selection. 908 00:49:27,373 --> 00:49:28,331 That's a good point. 909 00:49:32,170 --> 00:49:34,020 So think about this. 910 00:49:34,020 --> 00:49:37,040 And let me if other questions come up. 911 00:49:37,040 --> 00:49:39,740 So this actually came up the other day 912 00:49:39,740 --> 00:49:42,660 when we were talking about DNA substitution models. 913 00:49:42,660 --> 00:49:45,890 So Kimura and others have observed 914 00:49:45,890 --> 00:49:49,440 that transitions occur much more often than transversions, 915 00:49:49,440 --> 00:49:51,240 maybe two to three times as often, 916 00:49:51,240 --> 00:49:54,120 and so proposed a matrix like this. 917 00:49:54,120 --> 00:49:56,290 And now you can use what you know 918 00:49:56,290 --> 00:49:58,380 about stationary distributions to solve 919 00:49:58,380 --> 00:50:03,800 for the limiting or stationary distribution of this matrix. 920 00:50:03,800 --> 00:50:06,635 And actually, you will find it's still symmetrical. 921 00:50:06,635 --> 00:50:08,260 It's a little bit more complicated now, 922 00:50:08,260 --> 00:50:11,690 but you'll still get that 1/4, 1/4. 923 00:50:11,690 --> 00:50:13,760 But then more recently others have 924 00:50:13,760 --> 00:50:16,670 observed that really, dinucleotides 925 00:50:16,670 --> 00:50:19,810 matter in terms of mutation rates, 926 00:50:19,810 --> 00:50:22,600 particularly in vertebrates So what's 927 00:50:22,600 --> 00:50:26,210 special about vertebrates is that they have methylation 928 00:50:26,210 --> 00:50:30,372 machinery that methylates CPG dinucleotides on the C. 929 00:50:30,372 --> 00:50:33,470 And that makes those C's hypermutable. 930 00:50:33,470 --> 00:50:36,570 They mutate at about 10 times the rate of any other base. 931 00:50:36,570 --> 00:50:40,142 And so you can give a higher mutation rate to C, 932 00:50:40,142 --> 00:50:41,600 but that doesn't really capture it. 933 00:50:41,600 --> 00:50:44,060 It's really a higher mutation rate of C's that are next 934 00:50:44,060 --> 00:50:45,960 to G's. 935 00:50:45,960 --> 00:50:48,880 And so you can define a model that's 936 00:50:48,880 --> 00:50:51,134 16 by 16, which has dinucleotide mutation rates. 937 00:50:51,134 --> 00:50:52,550 And that's actually a better model 938 00:50:52,550 --> 00:50:54,130 of DNA sequence evolution. 939 00:50:54,130 --> 00:50:57,059 And it's just the math gets a little hairier 940 00:50:57,059 --> 00:50:59,100 if you want to calculate stationary distribution. 941 00:50:59,100 --> 00:51:01,090 But again, it can be done. 942 00:51:01,090 --> 00:51:03,980 And it's actually pretty easy to simulate. 943 00:51:06,490 --> 00:51:08,490 Knowing that it will converge to the stationary, 944 00:51:08,490 --> 00:51:10,520 you can just run the thing many times. 945 00:51:10,520 --> 00:51:14,420 And you'll get to the answer. 946 00:51:14,420 --> 00:51:17,170 And there's even been strand-specific models 947 00:51:17,170 --> 00:51:20,510 proposed, where there are some differences between how 948 00:51:20,510 --> 00:51:23,797 the repair machinery treats the two DNA strands that 949 00:51:23,797 --> 00:51:25,630 are related to transcription coupled repair. 950 00:51:25,630 --> 00:51:27,560 So you actually get some asymmetries there. 951 00:51:27,560 --> 00:51:31,840 And this is a reasonably rich area. 952 00:51:31,840 --> 00:51:35,780 And you can look at some of these references. 953 00:51:35,780 --> 00:51:38,580 All right, so one more topic, while we're on 954 00:51:38,580 --> 00:51:41,470 evolution-- this is very classical. 955 00:51:41,470 --> 00:51:45,180 But I just wanted to make sure that everyone has seen it. 956 00:51:45,180 --> 00:51:50,000 If you are looking specifically at protein coding sequences, 957 00:51:50,000 --> 00:51:55,130 exons, and you know the reading frame, you can just align them. 958 00:51:55,130 --> 00:51:57,840 And then you can look at two different types 959 00:51:57,840 --> 00:51:59,500 of substitutions. 960 00:51:59,500 --> 00:52:03,320 You can look at what are called the nonsynonymous 961 00:52:03,320 --> 00:52:08,850 substitutions, so changes to the codons that change 962 00:52:08,850 --> 00:52:12,830 the underlying amino acid, the encoded amino acid. 963 00:52:12,830 --> 00:52:15,380 And you define often a term that's 964 00:52:15,380 --> 00:52:19,790 either called Ka or dN, depending who you read, 965 00:52:19,790 --> 00:52:24,260 that is the fraction of nonsynonymous substitutions 966 00:52:24,260 --> 00:52:27,090 divided by nonsynonymous sites. 967 00:52:27,090 --> 00:52:31,060 And in this case let's do synonymous first. 968 00:52:31,060 --> 00:52:32,900 So you can also look at the other changes. 969 00:52:32,900 --> 00:52:35,050 So these are now synonymous changes 970 00:52:35,050 --> 00:52:36,920 which are base changes to triplets 971 00:52:36,920 --> 00:52:40,210 that do not change the encoded amino acid. 972 00:52:40,210 --> 00:52:42,970 So in this case, there are three of those. 973 00:52:42,970 --> 00:52:47,300 And a lot of evolutionary approaches 974 00:52:47,300 --> 00:52:50,220 are just based on calculating these two numbers. 975 00:52:50,220 --> 00:52:52,040 You count synonymous changes. 976 00:52:52,040 --> 00:52:54,100 You divide by synonymous sites, count 977 00:52:54,100 --> 00:52:57,480 non-synonymous substitutions, divide by non-synonymous sites. 978 00:52:57,480 --> 00:53:00,040 And so what do we mean synonymous site? 979 00:53:00,040 --> 00:53:07,010 Well if you have only amino acids that are fourfold, 980 00:53:07,010 --> 00:53:09,420 that have fourfold degenerate codons, 981 00:53:09,420 --> 00:53:13,070 which is all of them are like that in this case, 982 00:53:13,070 --> 00:53:19,410 then for example GG-- or let's see what's up here. 983 00:53:19,410 --> 00:53:23,250 Yeah, CC anything codes for proline. 984 00:53:23,250 --> 00:53:24,320 Do we have any of those? 985 00:53:24,320 --> 00:53:26,320 Actually, these are not all fourfold degenerate. 986 00:53:26,320 --> 00:53:27,150 I apologize. 987 00:53:27,150 --> 00:53:30,880 But glycine, for example-- so GG anything is glycine. 988 00:53:30,880 --> 00:53:36,476 So in this triplet, this triplet here, 989 00:53:36,476 --> 00:53:39,140 there's one synonymous site. 990 00:53:39,140 --> 00:53:40,640 The third side is a synonymous site. 991 00:53:40,640 --> 00:53:44,330 You can change that without changing the amino acid. 992 00:53:44,330 --> 00:53:46,580 But the other two are non-synonymous. 993 00:53:46,580 --> 00:53:50,100 So to do first approximation, you 994 00:53:50,100 --> 00:53:51,650 take non-synonymous substitutions 995 00:53:51,650 --> 00:53:54,030 and divide by the number of codons-- I'm sorry, 996 00:53:54,030 --> 00:53:55,740 the number of codons times 2, since there 997 00:53:55,740 --> 00:53:58,390 are two non-synonymous positions in each codon. 998 00:53:58,390 --> 00:54:00,620 And you take synonymous substitutions, 999 00:54:00,620 --> 00:54:01,989 divide by the number of codons. 1000 00:54:01,989 --> 00:54:03,030 OK, does that make sense? 1001 00:54:03,030 --> 00:54:04,590 One per codon. 1002 00:54:04,590 --> 00:54:10,650 OK and so what do you then do with this? 1003 00:54:10,650 --> 00:54:12,950 You can correct this value using-- basically 1004 00:54:12,950 --> 00:54:15,910 this is the Jukes-Cantor correction 1005 00:54:15,910 --> 00:54:20,870 that we just calculated, this 3/4 log 1 minus 4/3. 1006 00:54:20,870 --> 00:54:24,570 That applies to codon evolution as well as individual base 1007 00:54:24,570 --> 00:54:25,510 evolution. 1008 00:54:25,510 --> 00:54:28,430 And what people often do with this 1009 00:54:28,430 --> 00:54:33,350 is they calculate Ka and Ks for a whole gene. 1010 00:54:33,350 --> 00:54:37,280 Let's say you have alignments of all human genes 1011 00:54:37,280 --> 00:54:40,110 to their orthologs in mouse-- that 1012 00:54:40,110 --> 00:54:42,500 is, the corresponding homologous gene in mouse. 1013 00:54:42,500 --> 00:54:45,280 And you calculate Ka Ks. 1014 00:54:45,280 --> 00:54:47,700 And then you can look at those genes 1015 00:54:47,700 --> 00:54:51,190 where this ratio is significantly less than 1, 1016 00:54:51,190 --> 00:54:53,959 or around 1, or greater than 1. 1017 00:54:53,959 --> 00:54:55,500 And that actually tells you something 1018 00:54:55,500 --> 00:54:59,130 about how that-- the type of selection 1019 00:54:59,130 --> 00:55:03,230 that that gene is experiencing. 1020 00:55:03,230 --> 00:55:06,090 So what would you expect to see-- 1021 00:55:06,090 --> 00:55:08,800 or if I told you we've got two genes 1022 00:55:08,800 --> 00:55:12,640 and the Ka/Ks ratio is much less than 1. 1023 00:55:12,640 --> 00:55:15,270 It's like 0.2. 1024 00:55:15,270 --> 00:55:16,700 What would that tell you? 1025 00:55:16,700 --> 00:55:20,320 Or what could you infer about the selection 1026 00:55:20,320 --> 00:55:22,294 that's happening to that gene? 1027 00:55:27,140 --> 00:55:30,100 Ka/Ks is much less than 1. 1028 00:55:30,100 --> 00:55:31,600 Any ideas? 1029 00:55:31,600 --> 00:55:32,515 Julianne, yeah. 1030 00:55:32,515 --> 00:55:34,495 STUDENT: The protein sequence is important-- 1031 00:55:34,495 --> 00:55:35,649 or the amino acid sequence. 1032 00:55:35,649 --> 00:55:36,690 PROFESSOR: Yeah, exactly. 1033 00:55:36,690 --> 00:55:39,340 The amino acid sequence is important. 1034 00:55:39,340 --> 00:55:43,010 Because you assume that those synonymous 1035 00:55:43,010 --> 00:55:44,694 sites and non-synonymous sites-- they're 1036 00:55:44,694 --> 00:55:46,360 going to mutate at the same rate, right? 1037 00:55:46,360 --> 00:55:49,640 The mutation processes don't know about protein coding. 1038 00:55:49,640 --> 00:55:54,430 So what you're seeing is an absence, a loss, 1039 00:55:54,430 --> 00:55:56,020 of the non-synonymous changes. 1040 00:55:56,020 --> 00:55:57,860 80% of those non-synonymous changes 1041 00:55:57,860 --> 00:55:59,570 have been kicked out by evolution. 1042 00:55:59,570 --> 00:56:01,580 You're only seeing 20%. 1043 00:56:01,580 --> 00:56:04,810 And you're using, assuming the non-synonymous are 1044 00:56:04,810 --> 00:56:08,014 neutral-- I'm sorry. 1045 00:56:08,014 --> 00:56:09,930 I seem to have trouble with these words today. 1046 00:56:09,930 --> 00:56:13,120 But you assume that the synonymous ones are neutral. 1047 00:56:13,120 --> 00:56:15,067 And then that's calibrates everything. 1048 00:56:15,067 --> 00:56:17,400 And then you see that the non-synonymous are much lower. 1049 00:56:17,400 --> 00:56:19,323 Therefore you must have lost-- these ones must 1050 00:56:19,323 --> 00:56:20,740 have been kicked out by evolution. 1051 00:56:20,740 --> 00:56:22,690 So the amino acid sequence is important. 1052 00:56:22,690 --> 00:56:25,860 And it's optimal in some sense. 1053 00:56:25,860 --> 00:56:29,160 The protein works-- the organism does not want to change it. 1054 00:56:29,160 --> 00:56:32,080 Or changes to that protein sequence 1055 00:56:32,080 --> 00:56:34,160 make the protein worse. 1056 00:56:34,160 --> 00:56:35,530 And so you don't see them. 1057 00:56:35,530 --> 00:56:37,750 And that's what you see for most protein coding 1058 00:56:37,750 --> 00:56:41,970 genes in the genome-- a Ka/Ks ratio that's well below one. 1059 00:56:41,970 --> 00:56:44,700 It says we care what the protein is. 1060 00:56:44,700 --> 00:56:45,940 And it's pretty good already. 1061 00:56:45,940 --> 00:56:48,150 And we don't want to change it. 1062 00:56:48,150 --> 00:56:50,110 All right, what about a gene that 1063 00:56:50,110 --> 00:56:54,180 has a Ka/Ks ratio of around 1? 1064 00:56:54,180 --> 00:56:58,063 Anyone have an idea what would that tell you about that gene? 1065 00:57:01,930 --> 00:57:03,330 There are some-- Daniel? 1066 00:57:03,330 --> 00:57:06,620 STUDENT: The sequence is-- it doesn't particularly matter. 1067 00:57:06,620 --> 00:57:12,088 Maybe it's a non-coding, non-regulatory patch of DNA. 1068 00:57:12,088 --> 00:57:14,665 I assume there must be something. 1069 00:57:14,665 --> 00:57:16,880 PROFESSOR: Yeah, so it could be that it's not really 1070 00:57:16,880 --> 00:57:17,921 protein coding after all. 1071 00:57:17,921 --> 00:57:18,860 It's non-coding. 1072 00:57:18,860 --> 00:57:22,210 Then this whole triplet thing we were doing to it is arbitrary. 1073 00:57:22,210 --> 00:57:26,060 So you don't expect any particular distribution. 1074 00:57:26,060 --> 00:57:26,700 That's true. 1075 00:57:26,700 --> 00:57:28,125 Any other possibilities? 1076 00:57:28,125 --> 00:57:29,340 Yeah, Tim. 1077 00:57:29,340 --> 00:57:32,752 STUDENT: Could be that there are opposite forces that 1078 00:57:32,752 --> 00:57:33,664 are equilibrating. 1079 00:57:33,664 --> 00:57:35,570 For example, we're taking the unit of the G. 1080 00:57:35,570 --> 00:57:39,200 But maybe in one half of the G there's 1081 00:57:39,200 --> 00:57:41,777 a strong selective pressure for non-synonymous 1082 00:57:41,777 --> 00:57:44,222 and in the other half it's strong selective pressure 1083 00:57:44,222 --> 00:57:45,690 for synonymous. 1084 00:57:45,690 --> 00:57:47,982 Alternatively, it could be in the same par of the gene, 1085 00:57:47,982 --> 00:57:49,856 but it's involved in two different processes. 1086 00:57:49,856 --> 00:57:50,734 It's diatropic. 1087 00:57:50,734 --> 00:57:54,478 So in one process it's selecting this one thing. 1088 00:57:54,478 --> 00:57:56,865 PROFESSOR: Yeah, or one period of time, 1089 00:57:56,865 --> 00:57:58,990 if you're looking at 10 million years of evolution, 1090 00:57:58,990 --> 00:58:01,186 it could have been for this first five million years it was 1091 00:58:01,186 --> 00:58:03,560 under negative selection, and then it was under positive. 1092 00:58:03,560 --> 00:58:04,950 And it averages out. 1093 00:58:04,950 --> 00:58:09,740 Yes, all those things are possible, but kind of unusual. 1094 00:58:09,740 --> 00:58:12,800 And so maybe if you saw that the-- 1095 00:58:12,800 --> 00:58:14,791 if you plotted Ka/Ks along the gene 1096 00:58:14,791 --> 00:58:17,290 and you saw that it was high in one area and low in another, 1097 00:58:17,290 --> 00:58:18,460 then that would tell you that you probably 1098 00:58:18,460 --> 00:58:20,100 shouldn't be taking the average across the gene. 1099 00:58:20,100 --> 00:58:22,120 And that would be a good thing to look for. 1100 00:58:22,120 --> 00:58:25,517 But what if-- again, so we said if Ka/Ks is near 1 1101 00:58:25,517 --> 00:58:28,100 it could be that it's not really a protein coding gene at all. 1102 00:58:28,100 --> 00:58:29,330 That's certainly possible. 1103 00:58:29,330 --> 00:58:31,470 It could also be though that it's a pseudogene. 1104 00:58:34,020 --> 00:58:37,080 Or it's a gene that is no longer needed by the organism. 1105 00:58:37,080 --> 00:58:39,380 It still codes for protein, but the organism just 1106 00:58:39,380 --> 00:58:41,060 could care less about its function. 1107 00:58:41,060 --> 00:58:43,920 It's something that maybe evolved in some other time. 1108 00:58:43,920 --> 00:58:49,630 It helps you adapt to when the temperature gets 1109 00:58:49,630 --> 00:58:50,720 below minus 20. 1110 00:58:50,720 --> 00:58:52,810 But it never gets below minus 20 anymore. 1111 00:58:52,810 --> 00:58:57,250 And so there's no selection on it, or something like that. 1112 00:58:57,250 --> 00:59:02,020 So neutral indicates-- this is called neutral evolution. 1113 00:59:02,020 --> 00:59:07,720 And then what about a gene which has a Ka/Ks ratio significantly 1114 00:59:07,720 --> 00:59:10,980 greater than 1? 1115 00:59:10,980 --> 00:59:14,940 Any thoughts on what that might mean and what kind of genes 1116 00:59:14,940 --> 00:59:19,054 might happen to-- yes, what's your name? 1117 00:59:19,054 --> 00:59:19,720 STUDENT: Simona. 1118 00:59:19,720 --> 00:59:20,065 PROFESSOR: Simona, go ahead. 1119 00:59:20,065 --> 00:59:22,585 STUDENT: It might be a gene that's selected against, 1120 00:59:22,585 --> 00:59:25,736 so something that's detrimental to the cell or the organism. 1121 00:59:25,736 --> 00:59:28,950 PROFESSOR: It's detrimental-- so the existing protein is 1122 00:59:28,950 --> 00:59:31,350 bad for you, so you want to change it. 1123 00:59:31,350 --> 00:59:33,920 So it's better to change it to something else. 1124 00:59:33,920 --> 00:59:34,480 That's true. 1125 00:59:34,480 --> 00:59:37,065 Can you think of an example where that might be the case? 1126 00:59:37,065 --> 00:59:39,490 STUDENT: A gene that produces a toxin. 1127 00:59:39,490 --> 00:59:42,470 PROFESSOR: A gene that produces toxin. 1128 00:59:42,470 --> 00:59:44,160 You might just lose the gene completely 1129 00:59:44,160 --> 00:59:47,110 if it produced a toxin. 1130 00:59:47,110 --> 00:59:49,770 Any other examples you can think of or other people? 1131 00:59:53,880 --> 00:59:55,080 Yeah, Jeff. 1132 00:59:55,080 --> 00:59:58,440 STUDENT: Maybe a pigment that makes 1133 00:59:58,440 --> 01:00:03,090 the organism more susceptible to being eaten by a predator. 1134 01:00:03,090 --> 01:00:07,305 PROFESSOR: OK, yeah if it was a polar organism 1135 01:00:07,305 --> 01:00:09,964 and it happened to have this gene that made the fur dark 1136 01:00:09,964 --> 01:00:12,380 and it showed up against the snow, or something like that. 1137 01:00:12,380 --> 01:00:13,421 And you can imagine that. 1138 01:00:13,421 --> 01:00:17,090 Or a very common case is, for example, 1139 01:00:17,090 --> 01:00:21,190 a receptor that's used by a virus to enter the cell. 1140 01:00:21,190 --> 01:00:24,095 It probably had some other purpose. 1141 01:00:24,095 --> 01:00:28,030 But if the virus is very virulent, 1142 01:00:28,030 --> 01:00:30,490 you really just want to change that receptor 1143 01:00:30,490 --> 01:00:32,980 so that the virus can't attack it anymore. 1144 01:00:32,980 --> 01:00:35,620 So you see this kind of thing is much rarer. 1145 01:00:35,620 --> 01:00:38,060 It's only less than 1% of genes probably 1146 01:00:38,060 --> 01:00:41,160 are under positive selection, depending on how you measure it 1147 01:00:41,160 --> 01:00:42,670 and what time period you look at. 1148 01:00:42,670 --> 01:00:46,910 But it tends to be really recent, really strong selection 1149 01:00:46,910 --> 01:00:48,880 for changing the protein sequence. 1150 01:00:48,880 --> 01:00:52,315 And the most common-- well, probably the most common-- 1151 01:00:52,315 --> 01:00:57,040 is these immune arms races between a host and a pathogen. 1152 01:00:57,040 --> 01:00:59,154 But there are other cases too. 1153 01:00:59,154 --> 01:01:00,570 You can have very strong selection 1154 01:01:00,570 --> 01:01:03,740 where-- well, I don't want to-- basically where 1155 01:01:03,740 --> 01:01:07,140 a protein is maladapted, like the organism moves from a very 1156 01:01:07,140 --> 01:01:09,140 cold environment to a very warm environment. 1157 01:01:09,140 --> 01:01:10,920 And you just need to change a lot of stuff 1158 01:01:10,920 --> 01:01:12,920 to make those proteins better adapted. 1159 01:01:12,920 --> 01:01:16,056 Occasionally you can get positive selection there. 1160 01:01:16,056 --> 01:01:17,550 Yeah, go ahead. 1161 01:01:17,550 --> 01:01:20,670 STUDENT: So the situation where K or Ks is 1-- 1162 01:01:20,670 --> 01:01:26,019 could it be possible that the mRNA is under selection? 1163 01:01:26,019 --> 01:01:28,060 PROFESSOR: Yeah, so that basically we have always 1164 01:01:28,060 --> 01:01:31,092 been implicitly assuming that the synonymous substitution 1165 01:01:31,092 --> 01:01:31,800 rate was neutral. 1166 01:01:31,800 --> 01:01:34,621 But it could actually be it's not neutral. 1167 01:01:34,621 --> 01:01:36,120 That's under negative selection too. 1168 01:01:36,120 --> 01:01:37,660 And it happens that they balance. 1169 01:01:37,660 --> 01:01:38,537 That's also possible. 1170 01:01:38,537 --> 01:01:40,120 So for that, to assess that, you might 1171 01:01:40,120 --> 01:01:44,410 want to compare the synonymous substitution rate of that gene 1172 01:01:44,410 --> 01:01:45,759 to neighboring genes. 1173 01:01:45,759 --> 01:01:47,300 And if you find it's much lower, that 1174 01:01:47,300 --> 01:01:50,700 could indicate that the coding sequences-- 1175 01:01:50,700 --> 01:01:54,510 the third base of codons is under selection-- 1176 01:01:54,510 --> 01:01:56,050 could be for splicing, maybe. 1177 01:01:56,050 --> 01:01:59,070 It could be for RNA secondary structure, translation, 1178 01:01:59,070 --> 01:02:01,240 different other-- that's a good point. 1179 01:02:01,240 --> 01:02:05,180 So yeah, you guys have already poked holes in this. 1180 01:02:05,180 --> 01:02:06,470 This is a method. 1181 01:02:06,470 --> 01:02:07,740 It gives you something. 1182 01:02:07,740 --> 01:02:09,200 You'll see it used. 1183 01:02:09,200 --> 01:02:10,600 It gives you some inferences. 1184 01:02:10,600 --> 01:02:14,170 But there are cases where it doesn't fully work. 1185 01:02:14,170 --> 01:02:16,570 OK, good. 1186 01:02:16,570 --> 01:02:18,400 So in the remaining time I wanted 1187 01:02:18,400 --> 01:02:24,170 to do some examples of comparative genomics. 1188 01:02:24,170 --> 01:02:27,670 So as I mentioned before, these are 1189 01:02:27,670 --> 01:02:31,210 chosen to just give you some examples of types of things 1190 01:02:31,210 --> 01:02:32,860 you can learn about gene regulation 1191 01:02:32,860 --> 01:02:35,980 by comparing genomes again, often by using really 1192 01:02:35,980 --> 01:02:37,770 simple methods, just blasting all 1193 01:02:37,770 --> 01:02:42,090 the genes against each other or things like this. 1194 01:02:45,000 --> 01:02:49,570 And also, if you do choose to read some of these papers, 1195 01:02:49,570 --> 01:02:51,780 it can give you some experience looking 1196 01:02:51,780 --> 01:02:55,420 at this literature in regulatory genomics. 1197 01:02:55,420 --> 01:03:01,600 So the papers I've chosen-- we'll start with Bejerano et al 1198 01:03:01,600 --> 01:03:07,570 from 2002, who basically sought to identify regulatory elements 1199 01:03:07,570 --> 01:03:10,490 that are things that are under evolutionary constraint. 1200 01:03:10,490 --> 01:03:13,360 That's all he was trying to find. 1201 01:03:13,360 --> 01:03:15,650 Didn't know what their functions were. 1202 01:03:15,650 --> 01:03:19,142 But they turned out to be interesting nonetheless, 1203 01:03:19,142 --> 01:03:20,600 which is maybe a little surprising. 1204 01:03:20,600 --> 01:03:27,270 And then this other work from Eddy Rubin's lab and others-- 1205 01:03:27,270 --> 01:03:29,290 Steve Brenner's lab-- actually characterized 1206 01:03:29,290 --> 01:03:31,590 some of these extremely conserved regions 1207 01:03:31,590 --> 01:03:33,610 and assessed their function. 1208 01:03:33,610 --> 01:03:35,740 And then Bejerano came back a few years later 1209 01:03:35,740 --> 01:03:39,700 and actually had a paper about where these extremely conserved 1210 01:03:39,700 --> 01:03:42,080 regions actually came from. 1211 01:03:42,080 --> 01:03:43,550 So we'll talk about those. 1212 01:03:43,550 --> 01:03:45,910 Then we'll look at some papers that 1213 01:03:45,910 --> 01:03:50,640 have to do with inferring the regulatory targets 1214 01:03:50,640 --> 01:03:52,104 of a transacting factor. 1215 01:03:52,104 --> 01:03:53,770 And the factors that we'll consider here 1216 01:03:53,770 --> 01:03:58,050 will be microRNAs, mostly, Either trying 1217 01:03:58,050 --> 01:04:00,100 to understand what the rules are for microRNA 1218 01:04:00,100 --> 01:04:02,660 targeting and these Lewis et al papers, 1219 01:04:02,660 --> 01:04:06,000 or trying to identify the regulatory targets 1220 01:04:06,000 --> 01:04:07,750 in the genome. 1221 01:04:07,750 --> 01:04:10,800 And then, time permitting, we'll talk about a few other examples 1222 01:04:10,800 --> 01:04:13,740 of slightly more exotic things. 1223 01:04:13,740 --> 01:04:18,750 Graveley identified a pair-- or pairs-- 1224 01:04:18,750 --> 01:04:21,400 of interacting regulatory elements 1225 01:04:21,400 --> 01:04:26,391 through a clever comparative genomic approach. 1226 01:04:26,391 --> 01:04:28,640 And then I'll talk about these two examples at the end 1227 01:04:28,640 --> 01:04:32,840 if there's time, where a new class of transacting factors 1228 01:04:32,840 --> 01:04:38,420 was inferred from the locations of the encoded genes 1229 01:04:38,420 --> 01:04:39,500 in the genome. 1230 01:04:39,500 --> 01:04:43,330 And also an inference was made about the functions 1231 01:04:43,330 --> 01:04:45,800 of some repetitive elements from, again, 1232 01:04:45,800 --> 01:04:49,320 looking at the matching between these elements 1233 01:04:49,320 --> 01:04:51,610 and another genome. 1234 01:04:51,610 --> 01:04:54,350 All right, so first example-- Bejerano 1235 01:04:54,350 --> 01:04:55,530 "Ultraconserved elements." 1236 01:04:55,530 --> 01:04:58,840 So they defined, in a fairly arbitrary way, ultraconserved 1237 01:04:58,840 --> 01:05:00,410 elements as unusually long segments 1238 01:05:00,410 --> 01:05:03,530 that 100% identical between human, mouse, and rat. 1239 01:05:03,530 --> 01:05:05,840 This was in 2000-- I'm sorry, I might 1240 01:05:05,840 --> 01:05:07,800 have the wrong-- it's either 2004 or 2002. 1241 01:05:07,800 --> 01:05:10,240 I forget. 1242 01:05:10,240 --> 01:05:12,652 This was basically when the first three mammalian genomes 1243 01:05:12,652 --> 01:05:14,860 had been sequenced, which were human, mouse, and rat. 1244 01:05:14,860 --> 01:05:17,510 And there were whole genome alignments. 1245 01:05:17,510 --> 01:05:20,110 So they basically said let's try to use these whole genome 1246 01:05:20,110 --> 01:05:22,390 alignments to find what's the most 1247 01:05:22,390 --> 01:05:25,150 conserved thing in mammals. 1248 01:05:25,150 --> 01:05:28,460 So they wanted to see if there's anything 100% conserved. 1249 01:05:28,460 --> 01:05:31,720 And so they did statistics to say 1250 01:05:31,720 --> 01:05:37,030 what's an unusually long region of 100% identity. 1251 01:05:37,030 --> 01:05:41,280 Any ideas how you would do that calculation, what kind 1252 01:05:41,280 --> 01:05:43,160 of statistics you would use? 1253 01:05:43,160 --> 01:05:44,856 They used a really simple approach. 1254 01:05:48,830 --> 01:05:51,600 What they did was they took one megabase segments 1255 01:05:51,600 --> 01:05:54,590 of the genome, assuming it might vary across the genome. 1256 01:05:54,590 --> 01:05:56,986 They took ancestral repetitive elements-- so repetitive 1257 01:05:56,986 --> 01:05:58,360 elements that were inserted, that 1258 01:05:58,360 --> 01:06:00,440 were present in mouse, rat, and human-- 1259 01:06:00,440 --> 01:06:02,981 and assumed that they were neutrally evolving, 1260 01:06:02,981 --> 01:06:04,230 they were not under selection. 1261 01:06:04,230 --> 01:06:06,460 And then therefor you could look at the number of differences 1262 01:06:06,460 --> 01:06:09,010 and get an idea what the background rate of mutation is. 1263 01:06:09,010 --> 01:06:09,760 And they use that. 1264 01:06:09,760 --> 01:06:12,787 And they found that that rate was-- 1265 01:06:12,787 --> 01:06:15,370 this is from their supplementary data-- that was never greater 1266 01:06:15,370 --> 01:06:18,420 than 0.68. 1267 01:06:18,420 --> 01:06:26,270 And so they just said well, if we have a probability of-- I'm 1268 01:06:26,270 --> 01:06:27,620 sorry. 1269 01:06:27,620 --> 01:06:28,470 One is heads. 1270 01:06:28,470 --> 01:06:31,370 So if they're all three the same-- yeah, 1271 01:06:31,370 --> 01:06:35,384 so if we have a probability of 0.7 of heads, 1272 01:06:35,384 --> 01:06:37,050 meaning that they're all three the same, 1273 01:06:37,050 --> 01:06:39,960 then the chance that you have 200 heads in a row 1274 01:06:39,960 --> 01:06:47,570 would be 1 minus P P to the 200, just like [INAUDIBLE] trials. 1275 01:06:47,570 --> 01:06:50,070 And you can just multiply that times the size of the genome. 1276 01:06:50,070 --> 01:06:52,420 And you say it's extremely unlikely that you'll ever 1277 01:06:52,420 --> 01:06:57,970 see anything where there's 200 identical nucleotides in a row. 1278 01:06:57,970 --> 01:07:01,630 So that's what they defined as an ultraconserved element. 1279 01:07:01,630 --> 01:07:04,500 So it all seems very silly for now, 1280 01:07:04,500 --> 01:07:06,880 until you actually get to what they find. 1281 01:07:06,880 --> 01:07:09,080 So they looked at where are these elements 1282 01:07:09,080 --> 01:07:10,210 around the genome. 1283 01:07:10,210 --> 01:07:13,310 They found about 100 overlapped exons of known protein coding 1284 01:07:13,310 --> 01:07:16,850 genes, 100 are in introns, and the remainder 1285 01:07:16,850 --> 01:07:19,030 are in intergenic regions. 1286 01:07:19,030 --> 01:07:21,810 So then they looked at well what kind of genes 1287 01:07:21,810 --> 01:07:26,460 contain exons with overlapping-- or contain 1288 01:07:26,460 --> 01:07:28,420 ultraconserved elements that overlap exons? 1289 01:07:28,420 --> 01:07:29,430 Those are type 1 genes. 1290 01:07:29,430 --> 01:07:32,320 And what kind of genes are next to 1291 01:07:32,320 --> 01:07:33,970 the intergenic ultraconserved elements, 1292 01:07:33,970 --> 01:07:37,300 to try to get some clues about the function of these elements. 1293 01:07:37,300 --> 01:07:42,670 And so they did this early gene ontology analysis. 1294 01:07:42,670 --> 01:07:46,890 And what they found was that the ultraconserved elements that 1295 01:07:46,890 --> 01:07:51,660 overlapped exons tended to fall in genes 1296 01:07:51,660 --> 01:07:56,160 that encoded RNA-binding proteins, particular splicing 1297 01:07:56,160 --> 01:08:02,094 factors, by an order of magnitude more frequent. 1298 01:08:02,094 --> 01:08:03,760 And then the type 2 genes, the ones that 1299 01:08:03,760 --> 01:08:07,390 were next to these intergenic ultraconserved regions, 1300 01:08:07,390 --> 01:08:09,820 tended to be transcription factors. 1301 01:08:09,820 --> 01:08:11,930 In particular, homeobox transcription factors 1302 01:08:11,930 --> 01:08:15,500 were the most enriched class. 1303 01:08:15,500 --> 01:08:18,270 So this gave them some clues about what might be going on. 1304 01:08:18,270 --> 01:08:20,550 Particularly the second class was followed up 1305 01:08:20,550 --> 01:08:25,029 by Eddy Rubins's lab at Berkeley. 1306 01:08:25,029 --> 01:08:29,595 And they tested 167 extremely conserved sequences. 1307 01:08:29,595 --> 01:08:31,720 So some of them were these ultraconserved elements. 1308 01:08:31,720 --> 01:08:33,553 And some of them were just highly conserved, 1309 01:08:33,553 --> 01:08:36,540 but not quite 100% conserved. 1310 01:08:36,540 --> 01:08:39,609 And they had an assay where they have a reporter. 1311 01:08:39,609 --> 01:08:44,020 It's a lacZ with a-- you take a minimal promoter, fuse 1312 01:08:44,020 --> 01:08:45,930 in to lacZ, and then you take your element 1313 01:08:45,930 --> 01:08:48,590 of interest and fuse it upstream. 1314 01:08:48,590 --> 01:08:51,810 And then you do staining of whole mount embryos. 1315 01:08:51,810 --> 01:08:55,200 And you say what pattern of gene expression 1316 01:08:55,200 --> 01:08:57,319 does this element drive, or does it 1317 01:08:57,319 --> 01:08:59,380 drive a pattern of gene expression? 1318 01:08:59,380 --> 01:09:03,319 And so 45% of the time it drove a particular pattern 1319 01:09:03,319 --> 01:09:05,560 of gene expression. 1320 01:09:05,560 --> 01:09:07,990 So it functioned as an enhancer. 1321 01:09:07,990 --> 01:09:14,210 And these are the types of patterns that they saw. 1322 01:09:14,210 --> 01:09:16,120 So they saw often forebrain, sometimes 1323 01:09:16,120 --> 01:09:19,689 midbrain, neural tube, lim, et cetera. 1324 01:09:19,689 --> 01:09:24,029 So many of these things are enhancers 1325 01:09:24,029 --> 01:09:27,710 that drive particular developmental patterns of gene 1326 01:09:27,710 --> 01:09:28,729 expression. 1327 01:09:28,729 --> 01:09:31,410 So that out to be actually-- that was a pretty good way 1328 01:09:31,410 --> 01:09:34,779 to identify developmental enhancers. 1329 01:09:34,779 --> 01:09:37,180 So they wondered, is there anything special 1330 01:09:37,180 --> 01:09:39,200 about these ultraconserved regions, these 100% 1331 01:09:39,200 --> 01:09:42,359 identical regions, versus others that are 95% identical. 1332 01:09:42,359 --> 01:09:44,090 And so they tested a bunch of each. 1333 01:09:44,090 --> 01:09:47,120 And they found absolutely no difference there. 1334 01:09:47,120 --> 01:09:49,270 They drive similar types of expression. 1335 01:09:49,270 --> 01:09:52,950 And you can even find individual instances of them 1336 01:09:52,950 --> 01:09:57,520 that drive pretty much exactly the same pattern of expression. 1337 01:09:57,520 --> 01:09:59,510 So this whole 100% identical thing 1338 01:09:59,510 --> 01:10:03,020 was just a purely-- it was purely arbitrary. 1339 01:10:03,020 --> 01:10:06,530 But still, it's useful. 1340 01:10:06,530 --> 01:10:11,270 These things are among the most interesting enhancers 1341 01:10:11,270 --> 01:10:13,220 that have been identified. 1342 01:10:13,220 --> 01:10:17,520 So what about the-- oh yeah, so where did they come from? 1343 01:10:17,520 --> 01:10:22,940 OK, so this is totally from left field. 1344 01:10:22,940 --> 01:10:26,390 Bejerano was looking at some of these ultraconserved elements, 1345 01:10:26,390 --> 01:10:30,280 probably just blasting them against different genomes 1346 01:10:30,280 --> 01:10:35,370 as they came out, and noticed something very, very strange. 1347 01:10:35,370 --> 01:10:37,750 And that was there had recently been 1348 01:10:37,750 --> 01:10:40,120 some sequencing from coelacanth. 1349 01:10:40,120 --> 01:10:44,340 So for those of you who aren't fish experts, 1350 01:10:44,340 --> 01:10:48,980 this is a lobed fin fish, where they found fossils 1351 01:10:48,980 --> 01:10:51,160 from dating back to 400 million years. 1352 01:10:51,160 --> 01:10:53,930 And they noticed that these fossils-- the morphology never 1353 01:10:53,930 --> 01:10:54,430 changed. 1354 01:10:54,430 --> 01:10:57,769 From 400 million, 300 million years, you could see this fish. 1355 01:10:57,769 --> 01:10:58,810 It was exactly like this. 1356 01:10:58,810 --> 01:11:00,030 And it has lobed fins. 1357 01:11:00,030 --> 01:11:01,613 That was why they're interested in it. 1358 01:11:01,613 --> 01:11:03,750 Because the fins-- they have a round structure. 1359 01:11:03,750 --> 01:11:05,750 They look almost like limbs, like maybe this guy 1360 01:11:05,750 --> 01:11:08,041 could have evolved into something that would eventually 1361 01:11:08,041 --> 01:11:10,374 live on land. 1362 01:11:10,374 --> 01:11:12,040 Anyway, but they thought it was extinct. 1363 01:11:12,040 --> 01:11:16,270 And then somebody caught one. 1364 01:11:16,270 --> 01:11:20,150 In the '70s, in the West Indian Ocean, from deep water fishing, 1365 01:11:20,150 --> 01:11:21,940 they pulled one up, and it looked exactly 1366 01:11:21,940 --> 01:11:25,400 like these fossils from 400 million years before. 1367 01:11:25,400 --> 01:11:27,520 And so then of course somebody took some DNA 1368 01:11:27,520 --> 01:11:28,840 and did some sequencing. 1369 01:11:28,840 --> 01:11:34,090 And what Bejerano noticed is that this one megabase or so 1370 01:11:34,090 --> 01:11:38,520 coelacanth sequence had a very common repeat in it that 1371 01:11:38,520 --> 01:11:44,170 was around 500 bases or so, that looked like a SINE element. 1372 01:11:44,170 --> 01:11:46,980 SINE elements-- short, interspersed nuclear element, 1373 01:11:46,980 --> 01:11:50,437 like Alus, if you're familiar with those, so 1374 01:11:50,437 --> 01:11:51,770 some sort of repetitive element. 1375 01:11:51,770 --> 01:11:53,310 And this repetitive element was very 1376 01:11:53,310 --> 01:11:59,220 similar to these ultraconserved enhancers in mammals. 1377 01:11:59,220 --> 01:12:00,700 So something that we normally think 1378 01:12:00,700 --> 01:12:03,770 of as the least conserved of all, 1379 01:12:03,770 --> 01:12:06,540 like a repetitive element that inserts itself randomly 1380 01:12:06,540 --> 01:12:09,280 in the genome, had become-- some of these elements 1381 01:12:09,280 --> 01:12:12,090 had become among the most conserved sequences 1382 01:12:12,090 --> 01:12:15,120 later in evolution. 1383 01:12:15,120 --> 01:12:22,240 So how does that make any sense at all? 1384 01:12:22,240 --> 01:12:25,950 Anyone have a theory on that? 1385 01:12:25,950 --> 01:12:27,845 I can tell you how they interpreted it. 1386 01:12:31,650 --> 01:12:36,150 So their theory-- here's some text from their-- anyway, 1387 01:12:36,150 --> 01:12:38,130 you can look at the paper for the details here. 1388 01:12:38,130 --> 01:12:44,870 But their theory is basically that once you have a repetitive 1389 01:12:44,870 --> 01:12:46,970 element-- initially it's a parasitic element, 1390 01:12:46,970 --> 01:12:49,110 inserts itself randomly in the genome, 1391 01:12:49,110 --> 01:12:51,340 doesn't actually do anything. 1392 01:12:51,340 --> 01:12:54,790 But once you have hundreds of them, by chance 1393 01:12:54,790 --> 01:12:57,580 there will be perhaps a set of genes 1394 01:12:57,580 --> 01:13:00,400 that have this element next to them, 1395 01:13:00,400 --> 01:13:03,220 where you'd like to control them coordinately. 1396 01:13:03,220 --> 01:13:06,700 You'd like to turn all those genes on or all those genes off 1397 01:13:06,700 --> 01:13:09,084 in a particular circumstance-- a stress response, 1398 01:13:09,084 --> 01:13:10,750 during development, something like that. 1399 01:13:10,750 --> 01:13:14,470 And so then it's relatively easy to evolve a transcription 1400 01:13:14,470 --> 01:13:16,360 factor, for example, that will bind 1401 01:13:16,360 --> 01:13:18,280 to some sequence in that element. 1402 01:13:18,280 --> 01:13:20,420 And then it'll turn on all those genes. 1403 01:13:20,420 --> 01:13:22,256 Of course, it'll turn out all the genes 1404 01:13:22,256 --> 01:13:23,630 that have the elements near them. 1405 01:13:23,630 --> 01:13:25,190 So it'll probably turn on some extra genes 1406 01:13:25,190 --> 01:13:26,030 that you don't want. 1407 01:13:26,030 --> 01:13:30,490 But you can then-- selection will then tune these elements. 1408 01:13:30,490 --> 01:13:37,450 It gives you a quick way of generating a large-scale gene 1409 01:13:37,450 --> 01:13:38,489 expression response. 1410 01:13:38,489 --> 01:13:40,655 Because you've got so many of these things scattered 1411 01:13:40,655 --> 01:13:41,660 across the genome. 1412 01:13:41,660 --> 01:13:45,660 And so this-- that's as good as an explanation as we have, 1413 01:13:45,660 --> 01:13:49,560 I would say, for what is going on here. 1414 01:13:49,560 --> 01:13:52,150 And there's been some theories about this. 1415 01:13:52,150 --> 01:13:55,490 And they point out that actually something 1416 01:13:55,490 --> 01:13:58,560 like 50% of our genome actually comes from transposons, 1417 01:13:58,560 --> 01:14:01,080 if you go back far enough. 1418 01:14:01,080 --> 01:14:03,260 Some are recent, some are ancient. 1419 01:14:03,260 --> 01:14:06,200 And that maybe a lot of the regulatory elements-- not just 1420 01:14:06,200 --> 01:14:08,280 these ultraconserved enhancers, but others-- 1421 01:14:08,280 --> 01:14:11,660 may have evolved in this way. 1422 01:14:11,660 --> 01:14:14,480 So basically you insert a bunch of random junk throughout. 1423 01:14:14,480 --> 01:14:17,740 And then the fact that it's all identical, 1424 01:14:17,740 --> 01:14:20,600 because it derived from a common source, 1425 01:14:20,600 --> 01:14:23,980 you use-- that fact actually turns it 1426 01:14:23,980 --> 01:14:27,622 into something that's useful, a useful regulatory element. 1427 01:14:27,622 --> 01:14:29,330 All right, just wanted to throw that out. 1428 01:14:29,330 --> 01:14:31,970 So what about the exonic ultraconserved elements? 1429 01:14:31,970 --> 01:14:32,830 So here's one. 1430 01:14:32,830 --> 01:14:36,095 This is a 600 18 nucleotide region 1431 01:14:36,095 --> 01:14:38,220 that's 10% identical between human, mouse, and rat. 1432 01:14:38,220 --> 01:14:40,790 It's one of the longest in the genome. 1433 01:14:40,790 --> 01:14:41,790 And where is it? 1434 01:14:41,790 --> 01:14:46,420 It's in a splicing factor gene called SRp20. 1435 01:14:46,420 --> 01:14:51,750 And it's actually not in the protein coding part. 1436 01:14:51,750 --> 01:14:56,720 It's in a essentially non-coding exon of this splicing factor. 1437 01:14:56,720 --> 01:14:58,750 So it's this yellow exon here. 1438 01:14:58,750 --> 01:15:01,460 And what you'll notice is there's this little red thing 1439 01:15:01,460 --> 01:15:02,240 here. 1440 01:15:02,240 --> 01:15:04,960 That's a stop codon. 1441 01:15:04,960 --> 01:15:07,550 So this gene is spliced-- produces 1442 01:15:07,550 --> 01:15:08,820 two different isoforms. 1443 01:15:08,820 --> 01:15:10,780 The full length is the blue, when you just 1444 01:15:10,780 --> 01:15:11,930 use all the blue exons. 1445 01:15:11,930 --> 01:15:13,850 But when you include this yellow exon, 1446 01:15:13,850 --> 01:15:16,280 there's a premature termination codon that you hit. 1447 01:15:16,280 --> 01:15:18,300 So you don't make full-length protein. 1448 01:15:18,300 --> 01:15:23,600 Instead, that mRNA is degraded in a pathway called 1449 01:15:23,600 --> 01:15:26,590 nonsense mediated mRNA decay. 1450 01:15:26,590 --> 01:15:28,470 So the purpose of this exon appears 1451 01:15:28,470 --> 01:15:34,020 to be so that this gene can regulate expression 1452 01:15:34,020 --> 01:15:36,480 of the protein at the level of splicing. 1453 01:15:36,480 --> 01:15:39,500 And others have shown that this protein, the protein product, 1454 01:15:39,500 --> 01:15:42,255 actually binds to that exon and promotes 1455 01:15:42,255 --> 01:15:43,870 the splicing of that exon. 1456 01:15:43,870 --> 01:15:47,410 So it's basically a form of negative auto regulation. 1457 01:15:47,410 --> 01:15:50,220 The gene-- when the protein gets high, 1458 01:15:50,220 --> 01:15:54,210 it comes back and shifts the splicing of its own transcripts 1459 01:15:54,210 --> 01:15:56,850 to produce a non-functional form of the message 1460 01:15:56,850 --> 01:15:58,310 and reduce the protein expression. 1461 01:15:58,310 --> 01:16:01,400 So the theory is that this helps to keep this splicing 1462 01:16:01,400 --> 01:16:03,780 factor at a constant level throughout time 1463 01:16:03,780 --> 01:16:05,527 and between different cells, which 1464 01:16:05,527 --> 01:16:06,860 might be important for splicing. 1465 01:16:06,860 --> 01:16:09,495 But that's only a theory. 1466 01:16:09,495 --> 01:16:10,620 It could be something else. 1467 01:16:10,620 --> 01:16:14,740 And it does not explain why you need 600 nucleotides perfectly 1468 01:16:14,740 --> 01:16:16,940 conserved in order to have this function. 1469 01:16:16,940 --> 01:16:19,680 So I think these exonic ones are still fairly 1470 01:16:19,680 --> 01:16:22,020 mysterious and worth investigating. 1471 01:16:26,160 --> 01:16:29,510 A couple examples from microRNAs-- 1472 01:16:29,510 --> 01:16:31,780 you probably it's just a brief review on microRNAs. 1473 01:16:31,780 --> 01:16:35,410 They are these small, non-coding RNAs, 1474 01:16:35,410 --> 01:16:38,180 typically 20 to 22 nucleotides or so. 1475 01:16:38,180 --> 01:16:40,730 They have a characteristic RNA secondary structure 1476 01:16:40,730 --> 01:16:44,100 in their precursor, often called miRNAs. 1477 01:16:44,100 --> 01:16:48,510 And they're produced from primary transcripts typically, 1478 01:16:48,510 --> 01:16:50,450 or introns, or protein coding genes, 1479 01:16:50,450 --> 01:16:53,090 which are then processed in the nucleus of an enzyme called 1480 01:16:53,090 --> 01:16:57,940 drosha into a hairpin structure, like so. 1481 01:16:57,940 --> 01:17:00,770 And then that is exported to the cytoplasm, 1482 01:17:00,770 --> 01:17:03,340 where it's further processed by an enzyme called dicer 1483 01:17:03,340 --> 01:17:07,990 to produce the mature microRNA, which enters the risk complex, 1484 01:17:07,990 --> 01:17:12,090 and which then pairs the microRNA with mRNA targets, 1485 01:17:12,090 --> 01:17:13,410 usually in the 3'-UTR. 1486 01:17:13,410 --> 01:17:15,760 And that either inhibits their translation 1487 01:17:15,760 --> 01:17:19,710 or triggers the decay of those messages. 1488 01:17:19,710 --> 01:17:25,480 So microRNAs can do-- they can be really important. 1489 01:17:25,480 --> 01:17:27,640 Weird animation-- but for example, 1490 01:17:27,640 --> 01:17:32,800 this bantam microRNA in flies inhibits a proapoptotic gene 1491 01:17:32,800 --> 01:17:33,510 hid. 1492 01:17:33,510 --> 01:17:38,690 If you delete bantam, apoptosis goes crazy. 1493 01:17:38,690 --> 01:17:41,655 And you can see this is a normal fly. 1494 01:17:41,655 --> 01:17:44,030 There's a little fly in there with red eyes and so forth. 1495 01:17:44,030 --> 01:17:46,150 In this guy there's just a sack of mush. 1496 01:17:46,150 --> 01:17:48,670 All the cells-- most of the cells actually died. 1497 01:17:48,670 --> 01:17:51,270 So microRNAs play important roles 1498 01:17:51,270 --> 01:17:53,540 in developmental pathways. 1499 01:17:53,540 --> 01:17:58,400 And so we wanted to figure out the rules for their targeting. 1500 01:17:58,400 --> 01:18:01,720 And so this was an early study from Ben Lewis, 1501 01:18:01,720 --> 01:18:08,570 where he looked for conserved instances of segments, 1502 01:18:08,570 --> 01:18:12,630 short oligonucleotides, that match perfectly 1503 01:18:12,630 --> 01:18:15,560 to different parts of the microRNA, 1504 01:18:15,560 --> 01:18:18,440 using again these human, mouse, rat alignments, 1505 01:18:18,440 --> 01:18:21,090 which were what was available at the time. 1506 01:18:21,090 --> 01:18:26,360 And what he found was that if you took the set of microRNAs 1507 01:18:26,360 --> 01:18:31,940 which were known, and you identified targets of these 1508 01:18:31,940 --> 01:18:35,000 defined as 7-mers that are perfectly conserved 1509 01:18:35,000 --> 01:18:38,180 in 3'-UTRs of mammalian messages, 1510 01:18:38,180 --> 01:18:41,260 and then you looked at how many you got and you compared that 1511 01:18:41,260 --> 01:18:45,965 to the number of targets of shuffled microRNA-- 1512 01:18:45,965 --> 01:18:47,840 so where you take the whole set of microRNAs, 1513 01:18:47,840 --> 01:18:50,540 randomly permute their sequences so you generate random stuff, 1514 01:18:50,540 --> 01:18:52,900 look at how many conserve targets they have-- 1515 01:18:52,900 --> 01:18:56,760 that there was a significant signal above background, 1516 01:18:56,760 --> 01:18:59,820 in the sense of real conserved targets, 1517 01:18:59,820 --> 01:19:03,490 specifically only for the 5'-end of the microRNA. 1518 01:19:03,490 --> 01:19:07,700 Especially, bases 2 to 8 of the microRNA gave a signal. 1519 01:19:07,700 --> 01:19:10,500 And no other positions in the microRNA 1520 01:19:10,500 --> 01:19:13,060 gave a significant signal above background. 1521 01:19:13,060 --> 01:19:17,683 And so that led to the inference that the 5'-end of the microRNA 1522 01:19:17,683 --> 01:19:24,200 is what matters, specifically these bases. 1523 01:19:24,200 --> 01:19:27,900 And then later, alignments of actually 1524 01:19:27,900 --> 01:19:31,550 paralogous microRNA genes, shown here-- 1525 01:19:31,550 --> 01:19:34,380 so these are different let-7 genes. 1526 01:19:34,380 --> 01:19:37,870 You can actually see that the 5'-end of the microRNA, 1527 01:19:37,870 --> 01:19:39,800 which the microRNA's shown here in blue-- 1528 01:19:39,800 --> 01:19:40,870 this is the fold-back. 1529 01:19:40,870 --> 01:19:45,850 So you get conservation of the microRNA and of the other arm 1530 01:19:45,850 --> 01:19:48,350 of the fold-back, which is complimentary. 1531 01:19:48,350 --> 01:19:51,280 Little conservation of the loop, but the most conserved part 1532 01:19:51,280 --> 01:19:54,745 of the microRNA is the very 5'-end, consistent with that 1533 01:19:54,745 --> 01:19:55,245 idea. 1534 01:19:58,815 --> 01:20:00,690 Just one more example, because it's so cool-- 1535 01:20:00,690 --> 01:20:05,470 so this is the dscam gene in drosophila. 1536 01:20:05,470 --> 01:20:11,630 And this gene has four different alternative spliced regions 1537 01:20:11,630 --> 01:20:14,560 which are each spliced by mutually exclusive splicing. 1538 01:20:14,560 --> 01:20:17,200 So there are actually 12 copies of exon 4 1539 01:20:17,200 --> 01:20:19,670 and 48 different copies of exon 6. 1540 01:20:19,670 --> 01:20:22,996 And messages from this gene only ever contain 1541 01:20:22,996 --> 01:20:25,290 one of those particular exons. 1542 01:20:25,290 --> 01:20:31,470 And so Brent Graveley asked how does this gene get spliced 1543 01:20:31,470 --> 01:20:32,750 in a mutually exclusive way? 1544 01:20:32,750 --> 01:20:36,310 How do you only choose one of those 48 different versions 1545 01:20:36,310 --> 01:20:37,020 of exon 6? 1546 01:20:37,020 --> 01:20:44,500 And so what he did was did some sequencing from various fly 1547 01:20:44,500 --> 01:20:48,935 and other insect species of this locus, did some alignments. 1548 01:20:48,935 --> 01:20:53,830 And he noticed that there was this very conserved sequence 1549 01:20:53,830 --> 01:20:58,630 just stream of exon 5, right upstream of this cluster. 1550 01:20:58,630 --> 01:21:01,130 And then, looking more carefully, 1551 01:21:01,130 --> 01:21:06,000 he saw that there is another sequence, just immediately 1552 01:21:06,000 --> 01:21:07,950 upstream of each of the alternative exons, 1553 01:21:07,950 --> 01:21:11,860 that was very similar between all those exons, 1554 01:21:11,860 --> 01:21:15,420 and also conserved across the insects. 1555 01:21:15,420 --> 01:21:17,950 And then he started at these for a while, 1556 01:21:17,950 --> 01:21:21,510 and recognized that actually this sequence up 1557 01:21:21,510 --> 01:21:25,980 at the 5'-end is-- its consensus is perfectly complimentary 1558 01:21:25,980 --> 01:21:30,860 to the sequence that's found upstream of all of the other 1559 01:21:30,860 --> 01:21:31,360 exons. 1560 01:21:31,360 --> 01:21:33,560 And so what that suggested, immediately, 1561 01:21:33,560 --> 01:21:37,940 is that splicing requires the pairing 1562 01:21:37,940 --> 01:21:40,710 of this sequence from exon 5 to one 1563 01:21:40,710 --> 01:21:42,140 of those downstream sequences. 1564 01:21:42,140 --> 01:21:44,560 And then you'll splice to the next exons that's 1565 01:21:44,560 --> 01:21:48,420 immediately downstream and skip out all of the others. 1566 01:21:48,420 --> 01:21:51,677 And that's been subsequently confirmed, 1567 01:21:51,677 --> 01:21:52,760 that that's the mechanism. 1568 01:21:52,760 --> 01:21:56,124 So this just shows you that to figure this out 1569 01:21:56,124 --> 01:21:58,540 by molecular genetics would have been extremely difficult. 1570 01:21:58,540 --> 01:22:00,680 But sometimes comparative genomics, 1571 01:22:00,680 --> 01:22:04,010 when you ask the right question, you get a really clear-- 1572 01:22:04,010 --> 01:22:08,550 you can actually get mechanistic insights from sequences. 1573 01:22:08,550 --> 01:22:10,220 So that's it. 1574 01:22:10,220 --> 01:22:14,750 And I'm actually passing the baton over to David, 1575 01:22:14,750 --> 01:22:19,030 who will be-- take over next week.