1 00:00:00,070 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,030 under a Creative Commons license. 3 00:00:04,030 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:26,830 --> 00:00:27,959 PROFESSOR: All right. 9 00:00:27,959 --> 00:00:29,250 We should probably get started. 10 00:00:33,230 --> 00:00:37,960 So RNA plays important regulatory and catalytic roles 11 00:00:37,960 --> 00:00:40,860 in biology, and so it's important to understand 12 00:00:40,860 --> 00:00:41,450 its function. 13 00:00:41,450 --> 00:00:45,462 And so that's going to be the main theme of today's lecture. 14 00:00:45,462 --> 00:00:46,920 But before we get to that, I wanted 15 00:00:46,920 --> 00:00:51,840 to briefly review what we went over last time. 16 00:00:51,840 --> 00:00:55,160 So we talked about hidden Markov models, some 17 00:00:55,160 --> 00:00:59,670 of the terminology, thinking of them as generative models, 18 00:00:59,670 --> 00:01:03,160 terminology of the different types of parameters, 19 00:01:03,160 --> 00:01:06,050 the initiation probabilities and transition probabilities 20 00:01:06,050 --> 00:01:07,320 and so forth. 21 00:01:07,320 --> 00:01:11,200 And Viterbi algorithm, just sort of the core algorithm 22 00:01:11,200 --> 00:01:16,077 used whenever you apply HMMs. 23 00:01:16,077 --> 00:01:18,160 Essentially, you always use the Viterbi algorithm. 24 00:01:18,160 --> 00:01:22,410 And then we gave as an example the CpG Island HMM, 25 00:01:22,410 --> 00:01:25,740 which is admittedly a bit of a toy example. 26 00:01:25,740 --> 00:01:28,397 It's not really used in practice, 27 00:01:28,397 --> 00:01:29,730 that illustrates the principles. 28 00:01:29,730 --> 00:01:32,650 And then today we're going to talk about a couple 29 00:01:32,650 --> 00:01:35,246 of real world HMMs. 30 00:01:35,246 --> 00:01:36,620 But before we get to that, I just 31 00:01:36,620 --> 00:01:38,300 wanted to-- sort of toward the end, 32 00:01:38,300 --> 00:01:40,930 we talked about the computational complexity 33 00:01:40,930 --> 00:01:45,920 of the algorithm, and concluded that if you have a case state 34 00:01:45,920 --> 00:01:51,050 HMM run on a sequence of length L, it's order k squared L. 35 00:01:51,050 --> 00:01:55,660 And this diagram is helpful to many people in sort 36 00:01:55,660 --> 00:01:56,910 of thinking about that. 37 00:01:56,910 --> 00:02:00,520 So you can have transitions from any state-- 38 00:02:00,520 --> 00:02:03,000 for example, from this state-- to any of the other five 39 00:02:03,000 --> 00:02:04,754 states, and there's five-state HMM. 40 00:02:04,754 --> 00:02:06,170 And when you're doing the Viterbi, 41 00:02:06,170 --> 00:02:10,750 you have to maximize over the five possible input transitions 42 00:02:10,750 --> 00:02:11,440 into each state. 43 00:02:11,440 --> 00:02:14,435 And so the full set of computations 44 00:02:14,435 --> 00:02:19,070 that you have to do from going from position i to i plus 1 45 00:02:19,070 --> 00:02:20,025 is k squared. 46 00:02:20,025 --> 00:02:20,900 Does that make sense? 47 00:02:20,900 --> 00:02:23,890 And then there's L different transitions you have to do, 48 00:02:23,890 --> 00:02:26,670 so it's k squared L. 49 00:02:26,670 --> 00:02:30,670 Any questions about that? 50 00:02:30,670 --> 00:02:31,860 OK. 51 00:02:31,860 --> 00:02:35,900 All right and, so the example that we gave is shown here. 52 00:02:35,900 --> 00:02:40,870 And what we did was to take an example sort of where you could 53 00:02:40,870 --> 00:02:44,870 sort of see the answer-- not immediately see it, 54 00:02:44,870 --> 00:02:49,900 but if we're thinking about it a little, figure out the answer. 55 00:02:49,900 --> 00:02:53,010 And then we talked about how the Viterbi algorithm actually 56 00:02:53,010 --> 00:02:57,940 works, and why it makes the transitions at the right place. 57 00:03:00,960 --> 00:03:05,080 It seems to intuitively like it would make a transition later, 58 00:03:05,080 --> 00:03:07,180 but actually transitions at the right place. 59 00:03:07,180 --> 00:03:09,000 And one way to think about that is 60 00:03:09,000 --> 00:03:16,540 that these are not hard and fast decisions because you're 61 00:03:16,540 --> 00:03:18,540 optimizing two different paths. 62 00:03:18,540 --> 00:03:21,840 At every state, you're considering two possibilities. 63 00:03:21,840 --> 00:03:26,140 And so you explore the possibility of-- the first time 64 00:03:26,140 --> 00:03:29,710 you hit a c, you explore the possibility of transitioning 65 00:03:29,710 --> 00:03:32,080 from genome to island, but you're not 66 00:03:32,080 --> 00:03:35,510 confirming whether you're going to do that yet until you get 67 00:03:35,510 --> 00:03:38,410 to the end and see whether that path ends up having a higher 68 00:03:38,410 --> 00:03:42,060 probability at the end of the sequence than the alternative. 69 00:03:42,060 --> 00:03:45,570 So that's sort of one way of thinking about that. 70 00:03:45,570 --> 00:03:48,920 Any questions about this sort of thing, 71 00:03:48,920 --> 00:03:53,470 how to understand when a transition will be made? 72 00:03:53,470 --> 00:03:56,760 And I want to emphasize, for this simple HMM, 73 00:03:56,760 --> 00:03:58,840 we talked about you can kind of see 74 00:03:58,840 --> 00:04:00,220 what the answer's going to be. 75 00:04:00,220 --> 00:04:05,080 But if you have any HMM, any sort of interesting real world 76 00:04:05,080 --> 00:04:07,550 HMM with multiple states, there's 77 00:04:07,550 --> 00:04:09,600 no way you're going to be able to see it. 78 00:04:09,600 --> 00:04:11,558 Maybe you could guess what the answer might be, 79 00:04:11,558 --> 00:04:14,290 but you're not going to be able to be confident of what that 80 00:04:14,290 --> 00:04:16,942 is, which is why you have to actually implement it. 81 00:04:19,840 --> 00:04:21,940 All right, good. 82 00:04:21,940 --> 00:04:23,840 Let's talk about a couple of real world HMMs. 83 00:04:23,840 --> 00:04:26,730 So I mentioned gene finding. 84 00:04:26,730 --> 00:04:30,560 That's been a popular application of HMMs, 85 00:04:30,560 --> 00:04:32,100 both in prokaryotes and eukaryotes. 86 00:04:32,100 --> 00:04:34,400 There's some examples discussed in the text. 87 00:04:34,400 --> 00:04:39,610 Another very popular application are so-called profile HMMs. 88 00:04:39,610 --> 00:04:43,440 And so this is a hidden Markov model 89 00:04:43,440 --> 00:04:47,570 that's made based on a multiple alignment of proteins which 90 00:04:47,570 --> 00:04:49,887 have a related function or share a common domain. 91 00:04:49,887 --> 00:04:51,470 For example, there's a database called 92 00:04:51,470 --> 00:04:56,280 Pfam, which includes profile HMMs for hundreds 93 00:04:56,280 --> 00:04:58,740 of different types of protein domains. 94 00:04:58,740 --> 00:05:03,210 And so once you have many dozens or hundreds or thousands 95 00:05:03,210 --> 00:05:05,020 of examples of a protein domain, you 96 00:05:05,020 --> 00:05:07,270 can learn lots of things about it-- 97 00:05:07,270 --> 00:05:11,110 not just what the frequencies of each residue 98 00:05:11,110 --> 00:05:12,910 are in each position, but how likely 99 00:05:12,910 --> 00:05:15,460 you are to have an insertion at each position. 100 00:05:15,460 --> 00:05:17,330 And if you do have an insertion, what 101 00:05:17,330 --> 00:05:20,200 types of amino acid residues are likely to be inserted 102 00:05:20,200 --> 00:05:21,970 in that position, and how often you 103 00:05:21,970 --> 00:05:25,060 are likely to have a deletion at each position 104 00:05:25,060 --> 00:05:26,240 in the multiple alignment. 105 00:05:26,240 --> 00:05:30,460 And so the challenge then is to take a query protein 106 00:05:30,460 --> 00:05:35,080 and to thread it through all of these profile HMMs and ask, 107 00:05:35,080 --> 00:05:37,660 does it have a significant match to any of them? 108 00:05:37,660 --> 00:05:40,280 And so that's basically how Pfam works. 109 00:05:40,280 --> 00:05:42,580 And the nice thing about HMMs is that they allow you 110 00:05:42,580 --> 00:05:46,040 to-- if you want to have the same probability 111 00:05:46,040 --> 00:05:49,880 of an insertion at each position in your multiple alignment, 112 00:05:49,880 --> 00:05:50,550 you can do that. 113 00:05:50,550 --> 00:05:53,390 But if you have enough data to observe that there's 114 00:05:53,390 --> 00:05:58,630 a five-fold higher likelihood of having an insertion at position 115 00:05:58,630 --> 00:06:02,000 three in a multiple alignment than there is at position two, 116 00:06:02,000 --> 00:06:03,010 you can put that in. 117 00:06:03,010 --> 00:06:06,310 You just change those probabilities. 118 00:06:06,310 --> 00:06:09,230 So in this HMM, each of the hidden states 119 00:06:09,230 --> 00:06:14,860 is either an M state, which is a match state, or an I state, 120 00:06:14,860 --> 00:06:15,850 or an insert state. 121 00:06:15,850 --> 00:06:19,520 And so those will emit actual amino acid residues. 122 00:06:19,520 --> 00:06:21,240 Or it could be a delete state, which 123 00:06:21,240 --> 00:06:25,190 is thought of as emitting a dash, a placeholder 124 00:06:25,190 --> 00:06:27,810 in the multiple alignment. 125 00:06:27,810 --> 00:06:30,650 So these are also widely used. 126 00:06:30,650 --> 00:06:36,140 And then one of my favorite examples-- it's fairly simple, 127 00:06:36,140 --> 00:06:38,530 but it turns out to be quite useful-- 128 00:06:38,530 --> 00:06:42,820 is the so-called TMHMM for prediction 129 00:06:42,820 --> 00:06:45,200 of transmembrane helices in protein. 130 00:06:45,200 --> 00:06:50,280 So we know that many, especially eukaryotic proteins, 131 00:06:50,280 --> 00:06:52,500 are embedded in membranes. 132 00:06:52,500 --> 00:06:59,400 And there's one famous family of seven transmembrane helix 133 00:06:59,400 --> 00:07:01,400 proteins, and there are others that 134 00:07:01,400 --> 00:07:04,030 have one or a few transmembrane helices. 135 00:07:04,030 --> 00:07:08,180 And knowing that a protein has at least one transmembrane 136 00:07:08,180 --> 00:07:10,735 helix is very useful in terms of predicting its function. 137 00:07:10,735 --> 00:07:12,600 You predict it's localization. 138 00:07:12,600 --> 00:07:15,000 And knowing that it's a seven transmembrane helix protein 139 00:07:15,000 --> 00:07:16,380 is also useful. 140 00:07:16,380 --> 00:07:20,250 And so you want to predict whether the protein has 141 00:07:20,250 --> 00:07:23,785 transmembrane helices and what their orientation is. 142 00:07:23,785 --> 00:07:25,660 That is, proteins can have their end terminus 143 00:07:25,660 --> 00:07:29,030 either inside the cell or outside the cell. 144 00:07:29,030 --> 00:07:33,260 And then of course, where exactly those helices are. 145 00:07:33,260 --> 00:07:38,240 And this program has about a 97% accuracy, 146 00:07:38,240 --> 00:07:41,650 according to [? the author. ?] So it works very well. 147 00:07:41,650 --> 00:07:44,930 So what properties do you think-- 148 00:07:44,930 --> 00:07:46,950 we said before that you have to have 149 00:07:46,950 --> 00:07:49,310 strongly different emission probabilities 150 00:07:49,310 --> 00:07:52,080 in the different hidden states to have a chance of being 151 00:07:52,080 --> 00:07:54,580 able to predict things accurately. 152 00:07:54,580 --> 00:07:56,570 So what properties do you think are captured 153 00:07:56,570 --> 00:08:00,210 in a model of transmembrane helices? 154 00:08:00,210 --> 00:08:01,780 What types of emission probabilities 155 00:08:01,780 --> 00:08:05,956 would you when you have for the different states in this model? 156 00:08:05,956 --> 00:08:06,456 Anyone? 157 00:08:09,830 --> 00:08:13,690 So for this protein, what kind of residues 158 00:08:13,690 --> 00:08:15,730 would you have in here? 159 00:08:15,730 --> 00:08:16,970 Oops, sorry. 160 00:08:16,970 --> 00:08:19,300 I'm having trouble with this thing. 161 00:08:19,300 --> 00:08:21,550 All right, here in the middle of the membrane, 162 00:08:21,550 --> 00:08:24,187 what kind of residues are you going to see there? 163 00:08:24,187 --> 00:08:25,062 AUDIENCE: [INAUDIBLE] 164 00:08:25,062 --> 00:08:26,200 PROFESSOR: Those are going to be hydrophobic. 165 00:08:26,200 --> 00:08:26,700 Exactly. 166 00:08:26,700 --> 00:08:30,450 And what about right where the helix emerges 167 00:08:30,450 --> 00:08:33,310 from the membrane? 168 00:08:33,310 --> 00:08:35,564 [INAUDIBLE] charge residue's there to kind of anchor 169 00:08:35,564 --> 00:08:38,276 it and prevent it from sliding back into membrane. 170 00:08:38,276 --> 00:08:43,860 And then in general, both on the exterior and interior, 171 00:08:43,860 --> 00:08:46,520 you'll tend to have more hydrophilic residues. 172 00:08:46,520 --> 00:08:50,680 So that's sort of the basis of TMHMM. 173 00:08:50,680 --> 00:08:52,390 So this is the structure. 174 00:08:52,390 --> 00:08:56,680 And you'll notice that these are not exactly the hidden states 175 00:08:56,680 --> 00:09:00,160 that correspond to individual amino acid residues. 176 00:09:00,160 --> 00:09:03,100 These are like meta states, just to illustrate 177 00:09:03,100 --> 00:09:04,710 the overall structure. 178 00:09:04,710 --> 00:09:07,880 I'll show you the actual states on the next slide. 179 00:09:07,880 --> 00:09:10,450 But these were the types of states 180 00:09:10,450 --> 00:09:14,020 that the author, Anders [? Crow ?], decided to model. 181 00:09:14,020 --> 00:09:20,000 So he has sort of a-- focuses here on the helix core. 182 00:09:20,000 --> 00:09:23,710 There's also a cytoplasmic cap and a non-cytoplasmic cap. 183 00:09:23,710 --> 00:09:25,660 Oops, didn't mean that. 184 00:09:25,660 --> 00:09:31,020 And then there's sort of a globular domain on each side-- 185 00:09:31,020 --> 00:09:32,720 both on the cytoplasmic side, or you 186 00:09:32,720 --> 00:09:35,600 could have one on the non-cytoplasmic side. 187 00:09:35,600 --> 00:09:40,480 OK, so there's going to be different compositions in each 188 00:09:40,480 --> 00:09:41,460 of these regions. 189 00:09:41,460 --> 00:09:44,740 Now one of the things we talked about with HMMs 190 00:09:44,740 --> 00:09:48,169 is that if you were-- now let's think about the helix core. 191 00:09:48,169 --> 00:09:49,710 The simplest model you might think of 192 00:09:49,710 --> 00:09:53,120 would be to have sort of a helix state, 193 00:09:53,120 --> 00:09:56,520 and then to allow that state to recur to itself. 194 00:09:56,520 --> 00:09:59,670 OK, so this type of thing where you then have some transition 195 00:09:59,670 --> 00:10:04,350 to some sort of cap state after, this would allow you 196 00:10:04,350 --> 00:10:10,590 to model helices of any length. 197 00:10:10,590 --> 00:10:13,200 But now how long are transmembrane helices? 198 00:10:13,200 --> 00:10:15,610 What does that distribution look like? 199 00:10:15,610 --> 00:10:18,800 Anyone have an idea? 200 00:10:18,800 --> 00:10:20,611 There's a certain physical dimension. 201 00:10:20,611 --> 00:10:21,110 [INAUDIBLE] 202 00:10:24,800 --> 00:10:27,530 It takes a certain number residues to get across here, 203 00:10:27,530 --> 00:10:30,590 and then that number is about 20-ish. 204 00:10:30,590 --> 00:10:32,420 So transmembrane helices tend to be sort of 205 00:10:32,420 --> 00:10:34,960 on the order of 20 plus or minus a few. 206 00:10:34,960 --> 00:10:37,580 And so it's totally unrealistic to have a transmembrane helix 207 00:10:37,580 --> 00:10:40,430 that's, like, five residues long. 208 00:10:40,430 --> 00:10:44,980 So if you run this algorithm in generative mode, 209 00:10:44,980 --> 00:10:49,275 what distribution of helix lengths will you produce? 210 00:10:52,697 --> 00:10:54,280 We're running in generative mode where 211 00:10:54,280 --> 00:10:56,460 we're going to let, remember, to generate 212 00:10:56,460 --> 00:10:58,350 a series of hidden states and then 213 00:10:58,350 --> 00:11:02,750 associated amino acid sequences. 214 00:11:02,750 --> 00:11:05,995 It's coming from some, let's say-- I don't know. 215 00:11:05,995 --> 00:11:09,650 What kind of states are there here? [INAUDIBLE] plasmic. 216 00:11:09,650 --> 00:11:14,980 Let's say goes into helix, hangs out here. 217 00:11:14,980 --> 00:11:17,886 I'm sorry, is there an answer to this question? 218 00:11:17,886 --> 00:11:19,130 Anyone? 219 00:11:19,130 --> 00:11:21,470 I don't know how long-- if I let it run, 220 00:11:21,470 --> 00:11:22,770 it'll generate a random number. 221 00:11:22,770 --> 00:11:25,710 It depends on what this probability is here. 222 00:11:25,710 --> 00:11:27,940 Let's call this probability p, and then this 223 00:11:27,940 --> 00:11:30,190 would be 1 minus p. 224 00:11:30,190 --> 00:11:34,420 OK, so obviously if 1 minus p is bigger, 225 00:11:34,420 --> 00:11:36,595 it'll tend to produce longer helices. 226 00:11:36,595 --> 00:11:37,970 But in general, what is the shape 227 00:11:37,970 --> 00:11:42,940 of the distribution there of consecutive helical states 228 00:11:42,940 --> 00:11:45,778 that this model will generate? 229 00:11:45,778 --> 00:11:47,150 AUDIENCE: Binomial. 230 00:11:47,150 --> 00:11:48,050 PROFESSOR: Binomial. 231 00:11:48,050 --> 00:11:49,970 OK, can you explain why? 232 00:11:49,970 --> 00:11:55,290 AUDIENCE: Because the helix would 233 00:11:55,290 --> 00:11:59,900 have to have probable-- the helix of length n 234 00:11:59,900 --> 00:12:05,466 would occur 1 minus p to the n power. 235 00:12:05,466 --> 00:12:08,730 PROFESSOR: OK, so a helix of length 10 with a probability 236 00:12:08,730 --> 00:12:12,580 of then, say, let's call it L, for the length of the helix, 237 00:12:12,580 --> 00:12:19,197 equals n is 1 minus p to the n, right? 238 00:12:19,197 --> 00:12:19,905 Is that binomial? 239 00:12:22,770 --> 00:12:24,916 Someone else? 240 00:12:24,916 --> 00:12:25,880 AUDIENCE: Yeah. 241 00:12:25,880 --> 00:12:27,564 Is it a negative binomial? 242 00:12:27,564 --> 00:12:28,772 PROFESSOR: Negative binomial. 243 00:12:28,772 --> 00:12:29,736 OK. 244 00:12:29,736 --> 00:12:33,110 AUDIENCE: [INAUDIBLE] states and a helix state before moving out 245 00:12:33,110 --> 00:12:34,384 [INAUDIBLE]. 246 00:12:34,384 --> 00:12:35,050 PROFESSOR: Yeah. 247 00:12:35,050 --> 00:12:37,630 So the distribution is going to be like that. 248 00:12:37,630 --> 00:12:43,620 You have to stay in here for n and then leave. 249 00:12:43,620 --> 00:12:49,242 So this is the simplest-- you can 250 00:12:49,242 --> 00:12:51,450 have special cases of binomial and negative binomial. 251 00:12:51,450 --> 00:12:52,950 But in general, this distribution 252 00:12:52,950 --> 00:12:55,180 is called the geometric distribution. 253 00:12:55,180 --> 00:12:58,430 Or a continuous version would be the exponential distribution. 254 00:12:58,430 --> 00:13:01,880 So what is the shape of this distribution? 255 00:13:01,880 --> 00:13:11,140 If I were to plot n down here on this axis, and the probability 256 00:13:11,140 --> 00:13:14,760 that L equals n on this axis, what kind of shape-- 257 00:13:14,760 --> 00:13:17,920 could someone draw in the air? 258 00:13:17,920 --> 00:13:20,586 So you had up and then down? 259 00:13:20,586 --> 00:13:24,950 OK, so actually, it's going to be just down. 260 00:13:31,210 --> 00:13:32,550 Like that, right? 261 00:13:32,550 --> 00:13:36,806 Because as n increases, this goes down 262 00:13:36,806 --> 00:13:38,180 because 1 minus p is less than 1. 263 00:13:38,180 --> 00:13:40,275 So it just steadily goes down. 264 00:13:40,275 --> 00:13:42,025 And what is the mean of this distribution? 265 00:13:47,150 --> 00:13:48,608 Anyone remember this? 266 00:13:51,524 --> 00:13:53,940 Yeah, so there's sort of two versions of this 267 00:13:53,940 --> 00:13:55,340 that you'll see. 268 00:13:55,340 --> 00:14:03,120 One of them is the 1 minus p n minus 1 p, and one of them 269 00:14:03,120 --> 00:14:03,620 is this. 270 00:14:03,620 --> 00:14:08,990 And so this is the number of failures before a success, 271 00:14:08,990 --> 00:14:09,970 if you will. 272 00:14:09,970 --> 00:14:11,650 Successes lead to the helix. 273 00:14:11,650 --> 00:14:14,800 And this is the number of trials till the first success. 274 00:14:14,800 --> 00:14:17,190 So one of them has a mean that's 1/p, 275 00:14:17,190 --> 00:14:22,030 and the other has a mean that's 1 minus p over p. 276 00:14:22,030 --> 00:14:26,510 So usually, p is small, and so those are about the same. 277 00:14:26,510 --> 00:14:27,150 So 1/p. 278 00:14:27,150 --> 00:14:29,050 You could think that 1/p is roughly right. 279 00:14:29,050 --> 00:14:32,520 And so if we were to model transmembrane helices, 280 00:14:32,520 --> 00:14:34,360 and if transmembrane heresies are about-- 281 00:14:34,360 --> 00:14:37,130 I said about 20 residues long-- you 282 00:14:37,130 --> 00:14:46,157 would set p to what value to get the right mean? 283 00:14:49,900 --> 00:14:51,770 AUDIENCE: 0.05. 284 00:14:51,770 --> 00:14:54,250 PROFESSOR: Yeah. 285 00:14:54,250 --> 00:14:55,170 0.05. 286 00:14:55,170 --> 00:15:01,010 1/20, so that 1 over that will be about 20, right? 287 00:15:01,010 --> 00:15:04,860 And then 1 minus p would, of course, be 0.9. 288 00:15:04,860 --> 00:15:07,335 So if I were to do that, I would get a distribution 289 00:15:07,335 --> 00:15:10,780 that looks about like this with a mean of 20. 290 00:15:10,780 --> 00:15:14,910 But if I were to then look at real transmembrane helices 291 00:15:14,910 --> 00:15:16,930 and look at their distribution, I 292 00:15:16,930 --> 00:15:21,220 would see something totally different. 293 00:15:21,220 --> 00:15:23,000 It would probably look like that. 294 00:15:23,000 --> 00:15:25,600 It would have a mean around 20. 295 00:15:25,600 --> 00:15:30,360 But the probability of anything less than 15 would be 0. 296 00:15:30,360 --> 00:15:31,130 That's too short. 297 00:15:31,130 --> 00:15:35,500 It can't go across the membrane. 298 00:15:35,500 --> 00:15:37,680 And then again, you don't have ones that are 40. 299 00:15:37,680 --> 00:15:40,180 They don't kind of wiggle around in there and then come out. 300 00:15:40,180 --> 00:15:43,010 They tend to just go straight across. 301 00:15:43,010 --> 00:15:46,330 So there's a problem here. 302 00:15:46,330 --> 00:15:51,020 You can see that if you want to make a more accurate model, 303 00:15:51,020 --> 00:15:54,340 you want to not only get the right emission probabilities 304 00:15:54,340 --> 00:15:57,100 with the right probabilities of hydrophobics and hydrophilics 305 00:15:57,100 --> 00:15:58,240 and the different states, but you also 306 00:15:58,240 --> 00:15:59,600 want to get the length right. 307 00:15:59,600 --> 00:16:04,470 And so the trick that-- well, actually, yeah. 308 00:16:04,470 --> 00:16:07,190 Can anyone think of tricks to get the right length 309 00:16:07,190 --> 00:16:09,270 distribution here? 310 00:16:09,270 --> 00:16:11,290 How do we do better than this? 311 00:16:11,290 --> 00:16:14,770 Basically, hidden Markov models where 312 00:16:14,770 --> 00:16:16,660 you have a state that will recur to itself, 313 00:16:16,660 --> 00:16:18,580 it will always be a geometric distribution. 314 00:16:18,580 --> 00:16:22,010 The only choice you have is what is that probability. 315 00:16:22,010 --> 00:16:24,360 And so you can get any mean you want, 316 00:16:24,360 --> 00:16:26,330 but you always get this shape. 317 00:16:26,330 --> 00:16:29,380 So if you want a more general shape, 318 00:16:29,380 --> 00:16:32,805 what are some tricks that you could do? 319 00:16:32,805 --> 00:16:36,094 How could you change the model? 320 00:16:36,094 --> 00:16:37,078 any ideas? 321 00:16:37,078 --> 00:16:38,554 Yeah, go ahead. 322 00:16:38,554 --> 00:16:40,623 AUDIENCE: [INAUDIBLE] have multiple helix states. 323 00:16:40,623 --> 00:16:41,998 PROFESSOR: Multiple helix states. 324 00:16:41,998 --> 00:16:42,498 OK. 325 00:16:42,498 --> 00:16:43,474 How many? 326 00:16:46,426 --> 00:16:49,880 AUDIENCE: Proportional to the length we want, [INAUDIBLE]. 327 00:16:49,880 --> 00:16:53,030 PROFESSOR: Like one for each possible length. 328 00:16:53,030 --> 00:16:55,010 AUDIENCE: It'd be less than one length. 329 00:16:55,010 --> 00:16:56,495 PROFESSOR: Or less than one. 330 00:16:56,495 --> 00:16:56,994 OK. 331 00:16:56,994 --> 00:16:58,970 So you could have something like-- I mean, 332 00:16:58,970 --> 00:17:02,940 let's say you have like this. 333 00:17:02,940 --> 00:17:08,450 Helix begin-- or, helix 1, helix 2. 334 00:17:08,450 --> 00:17:11,290 You allow each of these to recur to themselves. 335 00:17:14,260 --> 00:17:15,220 What does that get you? 336 00:17:18,702 --> 00:17:20,910 This actually gets you something a little bit better. 337 00:17:20,910 --> 00:17:26,780 It gives you a little bit about of-- it's more like that. 338 00:17:26,780 --> 00:17:28,860 So that's better. 339 00:17:28,860 --> 00:17:32,330 But if I want to get the exact distribution, then actually 340 00:17:32,330 --> 00:17:37,870 one-- so this is the solution that the authors actually used. 341 00:17:37,870 --> 00:17:44,490 They made essentially 25 different helix states, 342 00:17:44,490 --> 00:17:49,190 and then they allowed various different transitions here. 343 00:17:49,190 --> 00:17:52,810 So it's a larger arbitrary here, but they 344 00:17:52,810 --> 00:17:58,100 have this special state three that can kind of take a jump. 345 00:17:58,100 --> 00:18:00,600 So it can just continue on to four, 346 00:18:00,600 --> 00:18:03,860 and that'll make your maximum length helix core. 347 00:18:03,860 --> 00:18:07,180 Or it can skip one, go to five, and that'll 348 00:18:07,180 --> 00:18:10,006 make a helix core that's one residue shorter than that, 349 00:18:10,006 --> 00:18:11,380 or it can skip two, and so forth. 350 00:18:11,380 --> 00:18:13,150 And you can set any probabilities 351 00:18:13,150 --> 00:18:14,410 you want on these transitions. 352 00:18:14,410 --> 00:18:18,120 As so you can fit basically an arbitrary distribution 353 00:18:18,120 --> 00:18:20,450 within a fixed range of lengths that's 354 00:18:20,450 --> 00:18:22,450 determined by how many states you have. 355 00:18:22,450 --> 00:18:26,270 OK, so they really wanted to get the length distribution right, 356 00:18:26,270 --> 00:18:27,561 and that's what they did. 357 00:18:27,561 --> 00:18:28,560 What's the cost of this? 358 00:18:28,560 --> 00:18:29,570 What's the downside? 359 00:18:29,570 --> 00:18:30,904 Simona? 360 00:18:30,904 --> 00:18:32,320 AUDIENCE: I was just going to ask, 361 00:18:32,320 --> 00:18:34,930 it looks like from this your minimum helix 362 00:18:34,930 --> 00:18:36,500 length could be four. 363 00:18:36,500 --> 00:18:37,262 PROFESSOR: Yeah. 364 00:18:37,262 --> 00:18:38,220 That's a good question. 365 00:18:42,262 --> 00:18:44,470 Well, we don't know what the probabilities-- they say 366 00:18:44,470 --> 00:18:45,080 said on that. 367 00:18:45,080 --> 00:18:47,624 Well, did they really mean that? 368 00:18:47,624 --> 00:18:50,040 And also, that's only the core, and maybe these cap things 369 00:18:50,040 --> 00:18:52,780 can be-- yeah, that seems a little short to me. 370 00:18:52,780 --> 00:18:55,840 So yeah, I agree. 371 00:18:55,840 --> 00:18:56,430 I'm not sure. 372 00:18:56,430 --> 00:18:58,346 It could just be for the sake of illustration, 373 00:18:58,346 --> 00:19:01,660 but they don't actually use those. 374 00:19:01,660 --> 00:19:05,869 But anyway, I'll probably have to read the paper. 375 00:19:05,869 --> 00:19:07,535 I haven't read this paper for many years 376 00:19:07,535 --> 00:19:09,493 so I don't remember exactly the answer to that. 377 00:19:09,493 --> 00:19:13,410 But I have a citation. 378 00:19:13,410 --> 00:19:15,095 You can look it up if you're curious. 379 00:19:15,095 --> 00:19:16,970 But the main point I wanted to make with this 380 00:19:16,970 --> 00:19:20,020 is just that by setting an arbitrary number of states 381 00:19:20,020 --> 00:19:22,062 and putting in possible transitions between them, 382 00:19:22,062 --> 00:19:24,270 you can actually construct any length of distribution 383 00:19:24,270 --> 00:19:24,790 you want. 384 00:19:24,790 --> 00:19:27,392 But there is a downside, and what is that downside? 385 00:19:27,392 --> 00:19:28,600 AUDIENCE: Computational cost. 386 00:19:28,600 --> 00:19:30,930 PROFESSOR: Yeah, the computational cost. 387 00:19:30,930 --> 00:19:32,490 Instead of having one helix state, 388 00:19:32,490 --> 00:19:34,580 now we've got 25 or something. 389 00:19:34,580 --> 00:19:38,660 So and the time goes up by the square of the number of states, 390 00:19:38,660 --> 00:19:41,300 so it's going to run slower. 391 00:19:41,300 --> 00:19:45,480 And you also have to estimate all these parameters. 392 00:19:45,480 --> 00:19:54,520 OK, so here's an example of the output of the TMHMM 393 00:19:54,520 --> 00:20:00,180 program for a mouse chloride channel gene, CLC6. 394 00:20:00,180 --> 00:20:02,810 So the program predicts that there 395 00:20:02,810 --> 00:20:05,820 are seven transmembrane helices, as shown 396 00:20:05,820 --> 00:20:07,790 by these little red blocks here. 397 00:20:07,790 --> 00:20:11,040 You can see they're all about the same-- about 20 or so-- 398 00:20:11,040 --> 00:20:17,530 and that the program starts outside and ends inside. 399 00:20:17,530 --> 00:20:21,850 So let's say you were going to do some experiments 400 00:20:21,850 --> 00:20:25,737 on this protein to test this prediction. 401 00:20:25,737 --> 00:20:27,570 So one of the types of experiments people do 402 00:20:27,570 --> 00:20:30,150 is they put some sort of modifiable 403 00:20:30,150 --> 00:20:35,670 or modified residue into one of the spaces 404 00:20:35,670 --> 00:20:37,530 between the transmembrane helices. 405 00:20:37,530 --> 00:20:41,190 And then you can test, by modifying this cell 406 00:20:41,190 --> 00:20:45,030 with something that's a non-permeable chemical, 407 00:20:45,030 --> 00:20:46,820 can you modify that protein? 408 00:20:46,820 --> 00:20:52,150 So only if that stretches on the outside of the cell 409 00:20:52,150 --> 00:20:53,800 will you be able to predict it. 410 00:20:53,800 --> 00:20:57,961 So that's a way of testing the topology. 411 00:20:57,961 --> 00:20:59,960 So if you were doing those types of experiments, 412 00:20:59,960 --> 00:21:02,440 you might actually-- like maybe you're not sure 413 00:21:02,440 --> 00:21:06,110 if every transmembrane helix is correct. 414 00:21:06,110 --> 00:21:08,250 There could be some where the boundaries were 415 00:21:08,250 --> 00:21:10,450 a little off, or even a wrong helix. 416 00:21:10,450 --> 00:21:12,300 And so one of the things that you often 417 00:21:12,300 --> 00:21:14,940 want with a prediction is not only 418 00:21:14,940 --> 00:21:18,490 to know what is the optimal or most likely prediction, 419 00:21:18,490 --> 00:21:21,770 but also how confident is the algorithm in each 420 00:21:21,770 --> 00:21:23,390 of the parts of its prediction. 421 00:21:23,390 --> 00:21:28,310 How confident is it in the location of transmembrane helix 422 00:21:28,310 --> 00:21:33,710 three or the probability that actually there 423 00:21:33,710 --> 00:21:36,250 is a transmembrane helix three. 424 00:21:36,250 --> 00:21:42,280 And so the way that this program does that is using something 425 00:21:42,280 --> 00:21:45,380 called the forward-backward algorithm. 426 00:21:45,380 --> 00:21:48,070 So those of you who read the Rabener tutorial, 427 00:21:48,070 --> 00:21:49,660 it's described pretty well there. 428 00:21:49,660 --> 00:21:52,480 The basic idea is that I mentioned 429 00:21:52,480 --> 00:21:59,630 that this Po-- the probability of the observable sequence 430 00:21:59,630 --> 00:22:02,680 summing over all possible HMM structures 431 00:22:02,680 --> 00:22:05,550 or all possible sequences of hidden states-- 432 00:22:05,550 --> 00:22:06,950 that is possible to calculate. 433 00:22:06,950 --> 00:22:08,580 And the way that you do it is you 434 00:22:08,580 --> 00:22:11,630 run an algorithm that's similar to the Viterbi, 435 00:22:11,630 --> 00:22:14,340 but instead of taking the maximum entering 436 00:22:14,340 --> 00:22:17,910 each hidden state at intermediate positions, 437 00:22:17,910 --> 00:22:19,907 you sum those inputs. 438 00:22:19,907 --> 00:22:21,490 So you just do the sum at every point. 439 00:22:21,490 --> 00:22:25,720 And it turns out that will calculate the sum of the two 440 00:22:25,720 --> 00:22:28,220 values at the end-- or the k values at the end 441 00:22:28,220 --> 00:22:33,580 will be equal to the sum of the probabilities of generating 442 00:22:33,580 --> 00:22:37,232 the observable sequence over all possible sequences 443 00:22:37,232 --> 00:22:37,940 of hidden states. 444 00:22:37,940 --> 00:22:39,420 OK, so that's useful. 445 00:22:39,420 --> 00:22:41,310 And then you can also run it backwards. 446 00:22:41,310 --> 00:22:44,750 There's no reason it has to be only going in one direction. 447 00:22:44,750 --> 00:22:48,420 And so what you do is you run these sort of summing versions 448 00:22:48,420 --> 00:22:56,330 of the Viterbi in both the forward direction 449 00:22:56,330 --> 00:23:01,350 and also run one in the backward direction. 450 00:23:01,350 --> 00:23:04,870 And then you take a particular position here-- 451 00:23:04,870 --> 00:23:09,365 like let's say this is your helix state, for example. 452 00:23:09,365 --> 00:23:11,150 And we're interested in this position 453 00:23:11,150 --> 00:23:13,240 somewhere in the middle of the protein. 454 00:23:13,240 --> 00:23:14,590 Is that a helix or not? 455 00:23:14,590 --> 00:23:18,480 And so basically you take the value 456 00:23:18,480 --> 00:23:20,800 that you get here from the forward 457 00:23:20,800 --> 00:23:22,797 in your forward algorithm and the value 458 00:23:22,797 --> 00:23:24,630 that you get here in the backward algorithm, 459 00:23:24,630 --> 00:23:28,710 and multiply those two together, and divide by this Po. 460 00:23:28,710 --> 00:23:31,310 And that gives you the probability. 461 00:23:31,310 --> 00:23:35,350 So that ends up being a way of calculating 462 00:23:35,350 --> 00:23:37,900 the sum of all the parses that go 463 00:23:37,900 --> 00:23:42,580 through this particular position i in the sequence 464 00:23:42,580 --> 00:23:43,740 in that particular state. 465 00:23:46,320 --> 00:23:50,880 I mean, I realize that may not have been totally clear, 466 00:23:50,880 --> 00:23:56,590 and I don't want to take more time to totally go into it, 467 00:23:56,590 --> 00:23:59,274 but it is pretty well described and Rabener. 468 00:23:59,274 --> 00:24:00,690 And I'll just give you an example. 469 00:24:00,690 --> 00:24:03,504 So if you're motivated, please take a look at that. 470 00:24:03,504 --> 00:24:04,920 And if you have further questions, 471 00:24:04,920 --> 00:24:09,640 I'd be happy to discuss during office hours next week. 472 00:24:09,640 --> 00:24:12,930 And this is what it looks like for this particular protein. 473 00:24:12,930 --> 00:24:15,690 So you get something called the posterior probability, which 474 00:24:15,690 --> 00:24:21,410 is the sum of the probabilities of all the parses. 475 00:24:21,410 --> 00:24:25,270 And they've plotted it for the particular state that 476 00:24:25,270 --> 00:24:28,830 is in the Viterbi path, that is in the optimal parse-- 477 00:24:28,830 --> 00:24:31,111 so for example, in blue here. 478 00:24:31,111 --> 00:24:33,610 Well, actually, they've done it for all the different states 479 00:24:33,610 --> 00:24:34,390 here. 480 00:24:34,390 --> 00:24:37,670 So blue is the probability that you're outside. 481 00:24:37,670 --> 00:24:40,770 OK, so it's very, very confident that the end terminus 482 00:24:40,770 --> 00:24:42,690 of the protein is outside the cell. 483 00:24:42,690 --> 00:24:44,440 It's very, very confident in the locations 484 00:24:44,440 --> 00:24:48,330 of transmembrane helices one and two. 485 00:24:48,330 --> 00:24:51,330 It actually more often than not thinks 486 00:24:51,330 --> 00:24:54,584 there's actually a third helix right here, 487 00:24:54,584 --> 00:24:56,500 but that didn't make it in the optional parse. 488 00:24:56,500 --> 00:24:58,458 That actually occurs in the majority of parses, 489 00:24:58,458 --> 00:25:00,650 but not in the optimal. 490 00:25:00,650 --> 00:25:03,820 And it's probably because it would then cause other things 491 00:25:03,820 --> 00:25:08,550 to be flipped later on if you had transmembrane helix there. 492 00:25:08,550 --> 00:25:11,800 It's not sure whether there's a helix there or not, 493 00:25:11,800 --> 00:25:13,450 but then it's confident in this one. 494 00:25:13,450 --> 00:25:15,380 OK, so this gives you an idea. 495 00:25:15,380 --> 00:25:19,990 Now if you wanted to do some sort of test of the prediction, 496 00:25:19,990 --> 00:25:23,860 you want to test probably first the higher confidence 497 00:25:23,860 --> 00:25:27,700 predictions, so you might do something right here. 498 00:25:27,700 --> 00:25:30,170 Or if maybe from experience you know 499 00:25:30,170 --> 00:25:32,587 that when it has a probability that's that high, 500 00:25:32,587 --> 00:25:34,670 it's always right, so there's no point testing it. 501 00:25:34,670 --> 00:25:38,580 So you should test one of these kind of less confident regions. 502 00:25:38,580 --> 00:25:41,450 So this actually makes the prediction much more useful 503 00:25:41,450 --> 00:25:44,010 to have some degree of confidence assigned 504 00:25:44,010 --> 00:25:46,500 to each part of the prediction. 505 00:25:51,760 --> 00:25:55,750 So for the remainder of today, I want 506 00:25:55,750 --> 00:26:01,870 to turn to the topic of RNA secondary structure. 507 00:26:01,870 --> 00:26:03,710 So at the beginning, I will sort of 508 00:26:03,710 --> 00:26:05,390 get through some nomenclature. 509 00:26:05,390 --> 00:26:10,450 And then to motivate the topic, give some biological examples 510 00:26:10,450 --> 00:26:11,590 of RNA structure. 511 00:26:11,590 --> 00:26:15,370 Gives me an excuse to show some pretty pictures of structure. 512 00:26:15,370 --> 00:26:18,180 And then we'll talk about two approaches which 513 00:26:18,180 --> 00:26:21,470 are two of the most widely used approaches toward predicting 514 00:26:21,470 --> 00:26:21,970 structure. 515 00:26:21,970 --> 00:26:25,600 So using evolution to predict structure 516 00:26:25,600 --> 00:26:31,210 by method of co-variations, which works well when you 517 00:26:31,210 --> 00:26:33,080 have many homologous sequences. 518 00:26:33,080 --> 00:26:35,770 And then using sort of first principles 519 00:26:35,770 --> 00:26:38,761 thermodynamics to predict secondary structure 520 00:26:38,761 --> 00:26:40,510 by energy minimization where obviously you 521 00:26:40,510 --> 00:26:45,100 don't need to have a homologous sequence present. 522 00:26:45,100 --> 00:26:47,710 And the nature biotechnology primer 523 00:26:47,710 --> 00:26:51,000 on RNA folding that I recommended 524 00:26:51,000 --> 00:26:57,312 is a good intro to the energy minimization approach. 525 00:26:57,312 --> 00:26:58,770 So what is RNA secondary structure? 526 00:26:58,770 --> 00:27:03,520 So you all know that RNAs, like proteins, 527 00:27:03,520 --> 00:27:08,690 have a three-dimensional tertiary fold structure that, 528 00:27:08,690 --> 00:27:11,860 in many cases, determines their function. 529 00:27:11,860 --> 00:27:14,890 But there's also sort of a simpler representation 530 00:27:14,890 --> 00:27:19,610 of this structure where you just describe which pairs of bases 531 00:27:19,610 --> 00:27:21,230 are hydrogen bonded to one other. 532 00:27:21,230 --> 00:27:25,010 OK, and so for RNA-- so it's a famous example 533 00:27:25,010 --> 00:27:27,760 of an RNA structure, this sort of clover leaf structure 534 00:27:27,760 --> 00:27:29,970 that all tRNAs have. 535 00:27:29,970 --> 00:27:34,260 The secondary structure of the tRNA is the set of base pairs. 536 00:27:34,260 --> 00:27:37,310 So it's this base pair here between the first base 537 00:27:37,310 --> 00:27:40,495 and this one toward the end, and then 538 00:27:40,495 --> 00:27:42,180 base right here, and so forth. 539 00:27:42,180 --> 00:27:44,460 And so if you specify all those base pairs, 540 00:27:44,460 --> 00:27:50,140 then you can then draw a picture like this, which gives you 541 00:27:50,140 --> 00:27:54,990 a good idea of what parts of the RNA molecule are accessible. 542 00:27:54,990 --> 00:27:56,700 So for example, it won't tell you 543 00:27:56,700 --> 00:28:00,134 where the anticodon loop is, which 544 00:28:00,134 --> 00:28:01,800 is sort of the business end of the tRNA. 545 00:28:01,800 --> 00:28:04,460 But it narrows it down to three possibilities. 546 00:28:04,460 --> 00:28:09,710 You might consider that, or that, or down here. 547 00:28:09,710 --> 00:28:11,875 It's unlikely to be something in here 548 00:28:11,875 --> 00:28:13,500 because these bases are already paired. 549 00:28:13,500 --> 00:28:15,280 They can't pair to message. 550 00:28:15,280 --> 00:28:19,570 So it gives you sort of a first approximation toward the 3D 551 00:28:19,570 --> 00:28:21,550 structure, and so it's quite useful. 552 00:28:21,550 --> 00:28:24,110 So how do we represent secondary structure? 553 00:28:24,110 --> 00:28:28,650 So there's a few different common representations 554 00:28:28,650 --> 00:28:29,620 that you'll see. 555 00:28:29,620 --> 00:28:33,530 So one is-- and this is sort of a computer-friendly but not 556 00:28:33,530 --> 00:28:36,230 terribly human-friendly representation, 557 00:28:36,230 --> 00:28:38,470 I would say-- is this sort of dot 558 00:28:38,470 --> 00:28:40,560 in parentheses notation here. 559 00:28:40,560 --> 00:28:46,040 So the dot is an unpaired base and the parenthesis 560 00:28:46,040 --> 00:28:48,070 is a paired base. 561 00:28:48,070 --> 00:28:53,500 And how do you know-- chalk is sort of non-uniformly 562 00:28:53,500 --> 00:28:59,400 distributed here-- so if you have a structure like this 563 00:28:59,400 --> 00:29:01,700 and you have these three parentheses, what 564 00:29:01,700 --> 00:29:02,670 are they paired to? 565 00:29:02,670 --> 00:29:06,350 Well, you don't know yet until you get further down. 566 00:29:06,350 --> 00:29:08,370 And then each left parenthesis has 567 00:29:08,370 --> 00:29:10,520 to have a right parenthesis somewhere. 568 00:29:10,520 --> 00:29:13,650 So now if we see this, then we know 569 00:29:13,650 --> 00:29:16,310 that there are two unpaired bases here, 570 00:29:16,310 --> 00:29:18,050 and then there's going to be three 571 00:29:18,050 --> 00:29:21,545 in a row that are paired-- these guys. 572 00:29:21,545 --> 00:29:24,310 We don't know what they're paired to yet. 573 00:29:24,310 --> 00:29:26,940 Then there's going to be a five base pair loop, maybe 574 00:29:26,940 --> 00:29:30,240 a little pentagon type thing. 575 00:29:30,240 --> 00:29:34,550 Two, three, four-- oops-- four, five. 576 00:29:34,550 --> 00:29:38,670 And this one would be the right parentheses 577 00:29:38,670 --> 00:29:47,010 that pair with the left parentheses over here. 578 00:29:47,010 --> 00:29:48,690 I should probably draw this coming out 579 00:29:48,690 --> 00:29:51,930 to make it clearer that it's not paired. 580 00:29:51,930 --> 00:29:54,680 So this notation you can convert to this. 581 00:29:54,680 --> 00:29:59,370 So after a while, it's relatively easy to do this, 582 00:29:59,370 --> 00:30:02,960 except when they're super long. 583 00:30:02,960 --> 00:30:05,522 So that's what the left part of that would look like. 584 00:30:05,522 --> 00:30:06,730 So what about the right part? 585 00:30:06,730 --> 00:30:11,710 So the right part, we have something like one, two, three, 586 00:30:11,710 --> 00:30:15,900 four, bunch of dots, and then we have two, and then a dot, 587 00:30:15,900 --> 00:30:17,189 and then two. 588 00:30:17,189 --> 00:30:18,480 What does that thing look like? 589 00:30:18,480 --> 00:30:22,008 So that's going to look like four bases here in a stem. 590 00:30:25,470 --> 00:30:29,230 Big loop, and then there's going to be two bases that 591 00:30:29,230 --> 00:30:31,600 are paired, and then a bulge, and then 592 00:30:31,600 --> 00:30:35,010 two more that are paired. 593 00:30:35,010 --> 00:30:38,290 These things happen in real structures. 594 00:30:38,290 --> 00:30:40,200 OK and then the arced notation is 595 00:30:40,200 --> 00:30:41,500 a little more human-friendly. 596 00:30:41,500 --> 00:30:44,890 It actually draws an arc between each pair 597 00:30:44,890 --> 00:30:47,340 of bases that are hydrogen bonded. 598 00:30:47,340 --> 00:30:50,880 So I'm sure you can imagine what those structures would 599 00:30:50,880 --> 00:30:51,380 look like. 600 00:30:53,900 --> 00:30:56,744 And it turns out that the arcs are very important. 601 00:30:56,744 --> 00:30:58,410 Like whether those arcs cross each other 602 00:30:58,410 --> 00:31:02,050 or not is sort of a fundamental classification of RNA 603 00:31:02,050 --> 00:31:05,480 secondary structures, into the ones that are tractable 604 00:31:05,480 --> 00:31:07,150 and the ones that are really difficult. 605 00:31:09,670 --> 00:31:11,710 So pretty pictures of RNA. 606 00:31:11,710 --> 00:31:15,480 So this is a lower resolution cryo-EM structure 607 00:31:15,480 --> 00:31:17,280 of the bacterial ribosomes. 608 00:31:17,280 --> 00:31:20,380 Remember, ribosomes have two sub-units-- a large sub-unit, 609 00:31:20,380 --> 00:31:23,160 50S, and a small sub-unit, 30S. 610 00:31:23,160 --> 00:31:26,100 And if you crack it open-- OK, so you basically split. 611 00:31:26,100 --> 00:31:30,330 You sort of break the ribosome like that, and you look inside, 612 00:31:30,330 --> 00:31:32,760 they're full of tRNAs. 613 00:31:32,760 --> 00:31:36,430 So there are three pockets that are normally 614 00:31:36,430 --> 00:31:37,900 distinguished within ribosomes. 615 00:31:37,900 --> 00:31:40,020 The A site-- this is the site where 616 00:31:40,020 --> 00:31:43,390 the tRNA enters that's going to add 617 00:31:43,390 --> 00:31:45,710 a new amino acid to the growing peptide chain. 618 00:31:45,710 --> 00:31:49,370 The P site, which is this tRNA will have it 619 00:31:49,370 --> 00:31:52,720 [INAUDIBLE] with the actual growing peptide. 620 00:31:52,720 --> 00:31:56,730 And then the exit tunnel where this tRNA will eventually-- 621 00:31:56,730 --> 00:32:00,910 the exit, the E site, which is the one that 622 00:32:00,910 --> 00:32:02,982 was added a couple of residues ago. 623 00:32:05,810 --> 00:32:10,139 So people often think of RNA structure 624 00:32:10,139 --> 00:32:11,930 just in terms of these secondary structures 625 00:32:11,930 --> 00:32:16,420 because they're much easier to generate 626 00:32:16,420 --> 00:32:20,910 than tertiary structures, and they give you-- like for tRNA, 627 00:32:20,910 --> 00:32:25,850 it gives you some pretty good information about how it works. 628 00:32:25,850 --> 00:32:31,030 But for a large and complex structure like the ribosome, 629 00:32:31,030 --> 00:32:33,390 it turns out that RNA is actually not 630 00:32:33,390 --> 00:32:36,030 bad at building complex structures. 631 00:32:36,030 --> 00:32:38,230 I would say it's not as good as protein, 632 00:32:38,230 --> 00:32:43,130 but it is capable of constructing something 633 00:32:43,130 --> 00:32:44,356 like a long tube. 634 00:32:44,356 --> 00:32:45,730 And in fact, in the ribosome, you 635 00:32:45,730 --> 00:32:49,580 find such a long tube right here. 636 00:32:49,580 --> 00:32:55,430 That is where the peptide that's been synthesized 637 00:32:55,430 --> 00:32:57,610 exits the ribosome. 638 00:32:57,610 --> 00:33:01,480 And you'll notice it's not a large cavity in which 639 00:33:01,480 --> 00:33:02,920 the protein might start folding. 640 00:33:02,920 --> 00:33:09,210 It's a skinny tube that is thin enough that the polypeptide has 641 00:33:09,210 --> 00:33:13,380 to remain linear, cannot start folding back on itself. 642 00:33:13,380 --> 00:33:16,120 So you sort of extrude the protein 643 00:33:16,120 --> 00:33:18,590 in a linear, unfolded confirmation, 644 00:33:18,590 --> 00:33:21,240 and let it fold outside of the ribosome. 645 00:33:21,240 --> 00:33:24,550 If it could fold inside that, that might clog it up. 646 00:33:24,550 --> 00:33:31,440 That's probably one reason why it's not designed that way. 647 00:33:31,440 --> 00:33:34,210 I'm sure that was tried bye evolution and rejected. 648 00:33:34,210 --> 00:33:37,020 So if you look at the ribosome-- now remember, 649 00:33:37,020 --> 00:33:40,500 the ribosome is composed of both RNA and protein-- 650 00:33:40,500 --> 00:33:45,560 you'll see that it's much more of one than the other. 651 00:33:45,560 --> 00:33:51,795 And so it's really much more of the fettuccine, which 652 00:33:51,795 --> 00:33:56,120 is the RNA part, than the linguini of the protein. 653 00:33:56,120 --> 00:33:58,440 And if you also look at the distribution 654 00:33:58,440 --> 00:34:00,240 of the proteins on the ribosome, you'll 655 00:34:00,240 --> 00:34:03,070 see that they're not in the core. 656 00:34:03,070 --> 00:34:05,010 They're kind of decorated around the edges. 657 00:34:05,010 --> 00:34:08,040 It really looks like something that was originally made out 658 00:34:08,040 --> 00:34:12,341 of RNA, and then you sort of added proteins as accessories 659 00:34:12,341 --> 00:34:12,840 later. 660 00:34:12,840 --> 00:34:14,256 And that's probably what happened. 661 00:34:17,670 --> 00:34:19,409 This is based on the structures that 662 00:34:19,409 --> 00:34:22,050 were solved a few years ago. 663 00:34:22,050 --> 00:34:27,130 If you then look at where the nearest proteins are 664 00:34:27,130 --> 00:34:29,449 to the active site-- actual catalytic site-- remember, 665 00:34:29,449 --> 00:34:35,050 the ribosome catalyzes peptide in addition to an amino acid 666 00:34:35,050 --> 00:34:38,300 to a growing peptide, so peptide bond formation-- 667 00:34:38,300 --> 00:34:42,250 you'll find that the nearest proteins are around 668 00:34:42,250 --> 00:34:46,290 18 to 20 angstroms away. 669 00:34:46,290 --> 00:34:48,489 And this is too far to do any chemistry, 670 00:34:48,489 --> 00:34:54,370 so the active site residues or molecules 671 00:34:54,370 --> 00:34:56,330 need to be within a few angstroms 672 00:34:56,330 --> 00:34:58,060 to do any useful chemistry. 673 00:34:58,060 --> 00:35:02,430 And so this basically proves that the ribosome. 674 00:35:02,430 --> 00:35:03,160 Is a ribozyme. 675 00:35:03,160 --> 00:35:05,100 That is, it's an RNA enzyme. 676 00:35:05,100 --> 00:35:06,030 RNAs is [INAUDIBLE]. 677 00:35:11,540 --> 00:35:17,620 So here is the structure of a ribosome. 678 00:35:17,620 --> 00:35:20,160 It's very kind of beautiful, and it's impressive 679 00:35:20,160 --> 00:35:23,500 that somebody can actually solve the structure of something 680 00:35:23,500 --> 00:35:24,450 this big. 681 00:35:24,450 --> 00:35:27,450 But what is actually the practical use 682 00:35:27,450 --> 00:35:28,870 of this structure? 683 00:35:28,870 --> 00:35:33,170 Turns out there's quite an important practical application 684 00:35:33,170 --> 00:35:35,130 of knowing the structure. 685 00:35:35,130 --> 00:35:38,330 Any ideas? 686 00:35:38,330 --> 00:35:39,332 AUDIENCE: Antibiotics. 687 00:35:39,332 --> 00:35:40,290 PROFESSOR: Antibiotics. 688 00:35:40,290 --> 00:35:40,790 Exactly. 689 00:35:40,790 --> 00:35:47,640 So many antibiotics work by taking advantage of differences 690 00:35:47,640 --> 00:35:49,980 between the prokaryotic ribosome structure 691 00:35:49,980 --> 00:35:51,610 and eukaryotic ribosome structure. 692 00:35:51,610 --> 00:35:55,330 So if you can make a small molecule-- 693 00:35:55,330 --> 00:35:58,980 these are some examples-- that will inhibit 694 00:35:58,980 --> 00:36:01,625 prokaryotic ribosomes but hopefully not 695 00:36:01,625 --> 00:36:03,000 inhibit eukaryotic ribosome, then 696 00:36:03,000 --> 00:36:06,710 you can kill bacteria that might be infecting you. 697 00:36:11,920 --> 00:36:14,550 So non-coding RNA. 698 00:36:14,550 --> 00:36:17,260 So there's many different families of non-coding RNAs, 699 00:36:17,260 --> 00:36:19,202 and I'm going to list some in a moment. 700 00:36:19,202 --> 00:36:20,660 And I'm going to actually challenge 701 00:36:20,660 --> 00:36:22,404 you, see if you can come up with any more 702 00:36:22,404 --> 00:36:23,570 families of non-coding RNAs. 703 00:36:23,570 --> 00:36:27,550 But they're receiving increasing interest, 704 00:36:27,550 --> 00:36:32,350 I would say, ever since micro RNA's were discovered. 705 00:36:32,350 --> 00:36:34,940 Sort of a boom in looking at different types 706 00:36:34,940 --> 00:36:36,240 of non-coding RNAs. 707 00:36:36,240 --> 00:36:40,690 Link RNA is also important and interesting, as well as many 708 00:36:40,690 --> 00:36:47,240 of the classical RNA's like tRNAs and rRNAs and snoRNAs. 709 00:36:47,240 --> 00:36:50,489 There may be new aspects of their regulation and function 710 00:36:50,489 --> 00:36:51,530 that will be interesting. 711 00:36:51,530 --> 00:36:55,230 And so when you're studying a non RNA, 712 00:36:55,230 --> 00:36:58,910 it's very, very helpful to know its structure. 713 00:36:58,910 --> 00:37:02,600 If it's going to base pair in trans with some other RNA-- 714 00:37:02,600 --> 00:37:07,870 as tRNAs do, as micro RNA's do, for example, or snRNAs 715 00:37:07,870 --> 00:37:10,240 and snoRNAs-- then you want to know 716 00:37:10,240 --> 00:37:12,110 which parts of the molecule are free 717 00:37:12,110 --> 00:37:15,340 and which are internally based paired. 718 00:37:15,340 --> 00:37:20,960 And if you want to predict non RNAs genes in a genome, 719 00:37:20,960 --> 00:37:23,500 you may want to look for regions that 720 00:37:23,500 --> 00:37:28,120 are under selection for conservation of RNA structure, 721 00:37:28,120 --> 00:37:30,750 for conservation of the potential 722 00:37:30,750 --> 00:37:32,960 to base pair at some distance. 723 00:37:32,960 --> 00:37:34,780 If you see that, it's much more likely 724 00:37:34,780 --> 00:37:38,430 that that region of the genome encodes a non-coding RNA 725 00:37:38,430 --> 00:37:43,379 than it codes, for example-- there's a coding axon 726 00:37:43,379 --> 00:37:45,170 or that it's a transcription factor binding 727 00:37:45,170 --> 00:37:48,050 site or something like that that functions at the DNA level. 728 00:37:48,050 --> 00:37:54,030 So having this notion of structure-- 729 00:37:54,030 --> 00:37:59,610 even just secondary structure-- is helpful for that application 730 00:37:59,610 --> 00:38:01,530 as well, and predicting functions as well, 731 00:38:01,530 --> 00:38:02,740 as I mentioned. 732 00:38:02,740 --> 00:38:05,760 So co-variation. 733 00:38:05,760 --> 00:38:08,830 So let's take a look at these sequences. 734 00:38:08,830 --> 00:38:15,110 So imagine you've discovered a new class of mini micro RNA's. 735 00:38:15,110 --> 00:38:19,640 They're only eight bases long, and you've sequence five 736 00:38:19,640 --> 00:38:24,870 homologues from your five favorite mammals. 737 00:38:24,870 --> 00:38:28,430 And these are the sequences that you get. 738 00:38:28,430 --> 00:38:30,560 And you know that they're homologous 739 00:38:30,560 --> 00:38:32,560 by [? a centimeter ?], they're in the same place 740 00:38:32,560 --> 00:38:35,630 in the genome, and they seem to have the same function. 741 00:38:35,630 --> 00:38:39,040 What could you say about their secondary structure 742 00:38:39,040 --> 00:38:42,400 based on this multiple alignment? 743 00:38:42,400 --> 00:38:45,644 You have to stare at it a little bit to see the pattern. 744 00:38:45,644 --> 00:38:46,636 There's a pattern here. 745 00:38:50,108 --> 00:38:51,596 Any ideas? 746 00:38:51,596 --> 00:38:55,564 Anyone have a guess about what the structure is? 747 00:39:01,020 --> 00:39:01,710 Yeah, go ahead. 748 00:39:01,710 --> 00:39:04,520 AUDIENCE: There's a two base pair stem, and then 749 00:39:04,520 --> 00:39:08,060 a four base loop. 750 00:39:08,060 --> 00:39:10,020 PROFESSOR: Two base pair stem, four base loop, 751 00:39:10,020 --> 00:39:11,472 and you have of the stem. 752 00:39:11,472 --> 00:39:13,410 So how do you know that? 753 00:39:13,410 --> 00:39:17,275 AUDIENCE: So if you look at the first two 754 00:39:17,275 --> 00:39:22,400 and last two bases of each sequence, 755 00:39:22,400 --> 00:39:24,790 the first and the eighths nucleotide 756 00:39:24,790 --> 00:39:28,812 can pair with each other, and so can the second and the seventh. 757 00:39:28,812 --> 00:39:29,966 PROFESSOR: Yeah. 758 00:39:29,966 --> 00:39:30,716 Everyone see that? 759 00:39:30,716 --> 00:39:34,700 So in the first column you have AUACG, 760 00:39:34,700 --> 00:39:36,155 and that's complementary to UAUGC. 761 00:39:38,966 --> 00:39:40,090 Each base is complementary. 762 00:39:40,090 --> 00:39:44,580 And the second position is CAGGU complementary to GUCUA. 763 00:39:48,200 --> 00:39:50,196 There's one slight exception there. 764 00:39:50,196 --> 00:39:51,070 AUDIENCE: [INAUDIBLE] 765 00:39:51,070 --> 00:39:52,050 PROFESSOR: Yeah. 766 00:39:52,050 --> 00:39:56,445 Well, it turns out that that RNA-- although the Watson Crick 767 00:39:56,445 --> 00:39:59,580 pairs GC and AU are the most stable-- GU pairs 768 00:39:59,580 --> 00:40:02,850 are only a little bit less stable than AU pairs, 769 00:40:02,850 --> 00:40:07,050 and they occur in natural RNA molecules. 770 00:40:07,050 --> 00:40:09,920 So GU is allowed in RNA even though you would never 771 00:40:09,920 --> 00:40:11,695 see that in DNA. 772 00:40:11,695 --> 00:40:13,740 OK, so everyone see that? 773 00:40:13,740 --> 00:40:18,200 So the structure is-- I think I have it here. 774 00:40:23,770 --> 00:40:28,030 This would be co-variation You're changing the bases, 775 00:40:28,030 --> 00:40:29,850 but preserving the ability to pair. 776 00:40:29,850 --> 00:40:32,570 So when one base change-- when the first base changes from A 777 00:40:32,570 --> 00:40:35,091 to U, the last base changes from U to A 778 00:40:35,091 --> 00:40:36,532 in order to preserve that pairing. 779 00:40:36,532 --> 00:40:38,740 You wouldn't know that if you just had two sequences, 780 00:40:38,740 --> 00:40:41,080 but once you get several sequences, 781 00:40:41,080 --> 00:40:43,990 it can be pretty compelling and allow 782 00:40:43,990 --> 00:40:47,050 you to make a pretty strong inference that that 783 00:40:47,050 --> 00:40:50,509 is the structure of that molecule. 784 00:40:50,509 --> 00:40:51,550 So how would you do this? 785 00:40:51,550 --> 00:40:53,900 So imagine you had a more realistic example where 786 00:40:53,900 --> 00:40:57,010 you've got a non-coding RNA that's 100 or a few hundred 787 00:40:57,010 --> 00:41:01,520 bases long, and you might have a multiple alignment of 50 788 00:41:01,520 --> 00:41:03,710 homologous sequences. 789 00:41:03,710 --> 00:41:05,720 You want something, you're not going 790 00:41:05,720 --> 00:41:07,810 to be able to see it by eye. 791 00:41:07,810 --> 00:41:13,130 You need sort of a more objective criterion. 792 00:41:13,130 --> 00:41:15,740 So one method that's commonly used 793 00:41:15,740 --> 00:41:21,020 is this statistic IX mutual information. 794 00:41:21,020 --> 00:41:26,190 So if you look in your multiple alignment-- 795 00:41:26,190 --> 00:41:27,816 I'll just draw this here. 796 00:41:33,655 --> 00:41:34,655 You have many sequences. 797 00:41:37,760 --> 00:41:41,370 You consider every pair of columns-- 798 00:41:41,370 --> 00:41:44,760 this is a multiple alignment, so this column and this column-- 799 00:41:44,760 --> 00:41:46,980 and you calculate what we're going 800 00:41:46,980 --> 00:41:49,796 to call-- what are we going to call it? f ix. 801 00:41:52,770 --> 00:41:54,860 That would be the frequency of a nucleotide x. 802 00:41:54,860 --> 00:41:57,610 You're in column i, so you just count how many A's, C's, G's, 803 00:41:57,610 --> 00:41:58,610 and T's there are. 804 00:41:58,610 --> 00:42:04,875 And similarly, f jy for all the possible values 805 00:42:04,875 --> 00:42:06,849 of x and all the possible values of y. 806 00:42:06,849 --> 00:42:08,890 So these are the base frequencies in each column. 807 00:42:08,890 --> 00:42:14,400 And then you calculate the dinucleotide frequencies xy 808 00:42:14,400 --> 00:42:17,490 at each pair of columns. 809 00:42:17,490 --> 00:42:22,470 So in this colony, you say if there's an A here and a C here, 810 00:42:22,470 --> 00:42:24,460 and then there's another AC down here, 811 00:42:24,460 --> 00:42:27,702 and there's a total of one, two, three, four, five, six, 812 00:42:27,702 --> 00:42:37,470 seven sequences, then f AC ij is 2/7. 813 00:42:37,470 --> 00:42:40,620 So you just calculate the frequency of each dinucleotide. 814 00:42:40,620 --> 00:42:43,670 These are no longer consecutive dinucleotides in a sequence 815 00:42:43,670 --> 00:42:44,700 necessarily there. 816 00:42:44,700 --> 00:42:47,770 They can be in arbitrary spacing. 817 00:42:47,770 --> 00:42:49,640 OK, so you calculate those and then 818 00:42:49,640 --> 00:42:54,552 you throw them into this formula, 819 00:42:54,552 --> 00:42:55,510 and out comes a number. 820 00:42:55,510 --> 00:42:58,660 So what does this formula remind of? 821 00:42:58,660 --> 00:43:01,396 Have you seen a similar formula before? 822 00:43:05,380 --> 00:43:06,376 AUDIENCE: [INAUDIBLE] 823 00:43:06,376 --> 00:43:09,902 PROFESSOR: Someone said [INAUDIBLE] Yeah, go ahead. 824 00:43:09,902 --> 00:43:12,360 AUDIENCE: It reminds me of the Shannon entropy [INAUDIBLE]. 825 00:43:12,360 --> 00:43:14,834 PROFESSOR: Yeah, it looks like Shannon entropy, 826 00:43:14,834 --> 00:43:17,120 but there's a log of a ratio in there, 827 00:43:17,120 --> 00:43:19,010 so it's not exactly Shannon entropy. 828 00:43:19,010 --> 00:43:23,590 So what other formula has a log of a ratio in it? 829 00:43:23,590 --> 00:43:24,586 AUDIENCE: [INAUDIBLE] 830 00:43:24,586 --> 00:43:25,419 PROFESSOR: Relative. 831 00:43:25,419 --> 00:43:28,570 So it actually looks like relative entropy. 832 00:43:28,570 --> 00:43:31,200 So relative entropy of what versus what? 833 00:43:39,270 --> 00:43:43,900 Who can sort of say more precisely if it's-- we'll say 834 00:43:43,900 --> 00:43:47,140 it's relative entropy of something versus a p versus q. 835 00:43:47,140 --> 00:43:49,650 And what is p and what is q? 836 00:43:49,650 --> 00:43:50,950 Yeah, in the back. 837 00:43:50,950 --> 00:43:54,819 AUDIENCE: Is it relative entropy of co-occurrence 838 00:43:54,819 --> 00:43:56,220 versus independent occurrence? 839 00:43:56,220 --> 00:43:57,630 PROFESSOR: Good. 840 00:43:57,630 --> 00:43:59,910 Yeah. co-occurence-- everyone get that? 841 00:43:59,910 --> 00:44:05,520 Co-occurrence of a pair of nucleotide xy at positions ij. 842 00:44:05,520 --> 00:44:08,680 Versus q is an independent occurrence. 843 00:44:08,680 --> 00:44:12,130 So if x and y occurred independently, 844 00:44:12,130 --> 00:44:17,270 they would have this frequency. 845 00:44:17,270 --> 00:44:20,030 So if you think about it, you calculate the frequency 846 00:44:20,030 --> 00:44:23,420 of each base at each column in the multiple alignment. 847 00:44:23,420 --> 00:44:25,900 And this is like your null hypothesis. 848 00:44:25,900 --> 00:44:28,810 You're going to assume, what if they're evolving independently? 849 00:44:28,810 --> 00:44:35,060 So if it's not a folded RNA-- or if it's a folded RNA 850 00:44:35,060 --> 00:44:37,380 but those two columns don't happen to interact-- 851 00:44:37,380 --> 00:44:40,470 there's no reason to suspect that those bases would 852 00:44:40,470 --> 00:44:42,420 have any relationship to each other. 853 00:44:42,420 --> 00:44:45,540 So this is like your expected value 854 00:44:45,540 --> 00:44:50,490 of the frequency of xy in position ij. 855 00:44:50,490 --> 00:44:53,060 And then this p is your observed value. 856 00:44:53,060 --> 00:44:56,040 So you're taking relative entropy of basically observed 857 00:44:56,040 --> 00:44:58,040 over expected. 858 00:44:58,040 --> 00:45:04,170 And so relative entropy has-- I haven't proved this, 859 00:45:04,170 --> 00:45:06,910 but it's non-negative. 860 00:45:06,910 --> 00:45:10,310 It can be 0, and then it goes up to some maximum, 861 00:45:10,310 --> 00:45:12,580 a positive value, but it's never negative. 862 00:45:12,580 --> 00:45:20,900 And what would it be if, in fact, p were equal to q? 863 00:45:20,900 --> 00:45:22,804 What would this formula give? 864 00:45:26,980 --> 00:45:29,900 This is where we're saying suppose. 865 00:45:29,900 --> 00:45:30,490 Suppose this. 866 00:45:30,490 --> 00:45:33,160 In general, this won't be sure, but suppose 867 00:45:33,160 --> 00:45:35,920 it was equal to that. 868 00:45:35,920 --> 00:45:40,726 We've got mi ij equals summation of what? 869 00:45:48,170 --> 00:45:52,120 That log of this, which is equal to this, 870 00:45:52,120 --> 00:46:05,196 so it's fx i fy j over the same thing-- 871 00:46:05,196 --> 00:46:12,280 hope you can see that-- log of-- log of 1 is 0, right? 872 00:46:12,280 --> 00:46:14,360 So it's just 0. 873 00:46:14,360 --> 00:46:19,580 So if the nucleotides of the two columns 874 00:46:19,580 --> 00:46:24,160 occur completely independently, mutual information is 0. 875 00:46:24,160 --> 00:46:27,810 And that's one reason it's called mutual information. 876 00:46:27,810 --> 00:46:29,040 There's no information. 877 00:46:29,040 --> 00:46:30,640 Knowing what's in column i gives you 878 00:46:30,640 --> 00:46:33,620 no information about column j. 879 00:46:33,620 --> 00:46:36,590 So remember, relative entities are measures of information, 880 00:46:36,590 --> 00:46:39,870 not entropy. 881 00:46:39,870 --> 00:46:45,400 And what is the maximum value that the mutual information 882 00:46:45,400 --> 00:46:47,080 could have? 883 00:46:47,080 --> 00:46:47,860 Any ideas on that? 884 00:46:53,810 --> 00:46:54,360 Any guesses? 885 00:47:03,613 --> 00:47:04,587 Joe, yeah. 886 00:47:04,587 --> 00:47:08,970 AUDIENCE: You could have log base 2 log over f sub x, 887 00:47:08,970 --> 00:47:09,750 f sub y. 888 00:47:13,674 --> 00:47:14,340 PROFESSOR: Of 1? 889 00:47:14,340 --> 00:47:17,060 OK, so you're saying if one of the particular dinucleotides 890 00:47:17,060 --> 00:47:18,930 had a frequency of 1? 891 00:47:18,930 --> 00:47:19,555 AUDIENCE: Yeah. 892 00:47:19,555 --> 00:47:23,000 So if they're always the same whenever there's-- like an A, 893 00:47:23,000 --> 00:47:24,750 there's always going to be a T. 894 00:47:24,750 --> 00:47:25,458 PROFESSOR: Right. 895 00:47:25,458 --> 00:47:31,370 So whenever there's an A, there's always a G or a T. 896 00:47:31,370 --> 00:47:34,010 AUDIENCE: So then you'd get a 1 in the numerator, 897 00:47:34,010 --> 00:47:40,573 and they're relative probably in the bottom, which 898 00:47:40,573 --> 00:47:44,240 would be maximized if they were all even. 899 00:47:44,240 --> 00:47:45,697 PROFESSOR: If they were all? 900 00:47:45,697 --> 00:47:46,530 [INTERPOSING VOICES] 901 00:47:46,530 --> 00:47:47,480 PROFESSOR: If they were uniform. 902 00:47:47,480 --> 00:47:47,980 Yeah. 903 00:47:47,980 --> 00:47:49,110 So did everyone get that? 904 00:47:49,110 --> 00:47:59,560 So the maximum occurs if fx i and j-- they're both uniform, 905 00:47:59,560 --> 00:48:03,790 so they're a quarter for every base at both positions. 906 00:48:03,790 --> 00:48:08,770 That's the maximum entropy in the background distribution. 907 00:48:08,770 --> 00:48:26,720 But then if fx y ij equals 1/4, for example, x equals y-- 908 00:48:26,720 --> 00:48:28,870 or in our case, we're not interested in that. 909 00:48:28,870 --> 00:48:34,530 We're interested in x equals complement of y. 910 00:48:34,530 --> 00:48:36,784 C of y is going to be the complement of y. 911 00:48:41,890 --> 00:48:50,000 And 0 otherwise for x not equal complement of y. 912 00:48:50,000 --> 00:48:58,714 OK, so for example, if we have only the dinucleotides AT, 913 00:48:58,714 --> 00:49:04,400 CG, GC, and TA occur, and each of them 914 00:49:04,400 --> 00:49:10,470 occurs with a frequency of 1/4, then 915 00:49:10,470 --> 00:49:13,670 you'll have four terms in the sum because, remember, 916 00:49:13,670 --> 00:49:15,500 the 0 log 0 is 0. 917 00:49:15,500 --> 00:49:18,940 So you'll have four terms in the sum, and each of them 918 00:49:18,940 --> 00:49:29,580 will look like 1/4 log 1/4 over a 1/4 times 1/4. 919 00:49:29,580 --> 00:49:33,510 And so this will be 4, so log 2 of 4 4 is 2. 920 00:49:33,510 --> 00:49:39,000 And so you have four terms that are each 1/4 times 2. 921 00:49:39,000 --> 00:49:41,822 And so you'll get 2. 922 00:49:44,690 --> 00:49:46,719 Well, this is not a sum. 923 00:49:46,719 --> 00:49:47,760 These are the four terms. 924 00:49:47,760 --> 00:49:53,140 These are the individual nonzero terms in that sum. 925 00:49:53,140 --> 00:49:54,030 Does that make sense? 926 00:49:54,030 --> 00:49:55,340 Everyone get this? 927 00:49:57,880 --> 00:50:02,640 So that's why this is a useful measure of co-variation. 928 00:50:05,950 --> 00:50:08,850 If what's in one column really strongly 929 00:50:08,850 --> 00:50:11,710 influences what's in the other column, 930 00:50:11,710 --> 00:50:14,110 and there's a lot of variation in the two columns, 931 00:50:14,110 --> 00:50:17,420 and so you can really see that co-variation well, 932 00:50:17,420 --> 00:50:19,975 then mutual information is maximized. 933 00:50:22,760 --> 00:50:24,390 And that's basically what we just said, 934 00:50:24,390 --> 00:50:28,390 is written down here. 935 00:50:28,390 --> 00:50:31,400 So it's maximal. 936 00:50:31,400 --> 00:50:32,900 They don't have to be complementary. 937 00:50:32,900 --> 00:50:36,750 It would achieve this maximum of 2 if they are complementary, 938 00:50:36,750 --> 00:50:39,950 but it would be also if they had some other very specific 939 00:50:39,950 --> 00:50:42,690 relationship between the nucleotides. 940 00:50:42,690 --> 00:50:45,960 So if you're going to use this, the way you would use it 941 00:50:45,960 --> 00:50:48,200 is take your multiple alignment, calculate 942 00:50:48,200 --> 00:50:50,800 the mutual information of each pair of columns-- 943 00:50:50,800 --> 00:50:53,320 so you actually have to make a table, i versus j, 944 00:50:53,320 --> 00:50:55,130 all possible pairs of columns-- and then 945 00:50:55,130 --> 00:50:57,310 you're going to look for the really high values. 946 00:50:57,310 --> 00:51:01,970 And then when you find those high values, when 947 00:51:01,970 --> 00:51:04,570 you look at what actual bases are tending to occur together, 948 00:51:04,570 --> 00:51:07,070 you'll want to see that they're bases 949 00:51:07,070 --> 00:51:09,540 that are complementary to one another. 950 00:51:09,540 --> 00:51:11,400 And another thing that you'd want to see 951 00:51:11,400 --> 00:51:15,120 is you'd want to see that consecutive positions in one 952 00:51:15,120 --> 00:51:17,770 part of the alignment are co-varying 953 00:51:17,770 --> 00:51:21,770 with consecutive positions in another part of the alignment 954 00:51:21,770 --> 00:51:24,990 in the right way, in this sort of inverse complementary way 955 00:51:24,990 --> 00:51:27,295 that RNA likes to pair. 956 00:51:27,295 --> 00:51:28,170 Does that make sense? 957 00:51:28,170 --> 00:51:35,890 So in a sort of nested way in your multiple alignment, 958 00:51:35,890 --> 00:51:38,450 if you saw that this one co-varied with that, 959 00:51:38,450 --> 00:51:41,860 and then you also saw that the next base co-varied 960 00:51:41,860 --> 00:51:44,340 with the base right before this one, 961 00:51:44,340 --> 00:51:46,960 and this one co-varies with that one, 962 00:51:46,960 --> 00:51:48,620 that starts to look like a stem. 963 00:51:48,620 --> 00:51:51,320 It's much more likely that you have a three-base stem 964 00:51:51,320 --> 00:51:55,184 than that you just have some isolated base 965 00:51:55,184 --> 00:51:56,600 pair out in the middle of nowhere. 966 00:51:56,600 --> 00:51:59,020 It turns out it takes a few bases to make 967 00:51:59,020 --> 00:52:01,920 a good thermodynamically stable stem, 968 00:52:01,920 --> 00:52:04,286 and so you want to look for blocks of these things. 969 00:52:04,286 --> 00:52:08,194 And so this works pretty well. 970 00:52:08,194 --> 00:52:10,110 Yeah, actually, one point I want to make first 971 00:52:10,110 --> 00:52:13,300 is that mutual information is nice 972 00:52:13,300 --> 00:52:16,544 because it's kind of a useful concept 973 00:52:16,544 --> 00:52:18,835 and it also relates to some of the entropy and relative 974 00:52:18,835 --> 00:52:21,293 entropy that we've been talking about in the course before. 975 00:52:21,293 --> 00:52:24,190 But it's not the only statistic that would work in practice. 976 00:52:24,190 --> 00:52:27,720 You can use any measure of basically non-independence 977 00:52:27,720 --> 00:52:29,500 between distributions. 978 00:52:29,500 --> 00:52:31,240 A chi square statistic would probably 979 00:52:31,240 --> 00:52:34,730 work equally well in practice. 980 00:52:34,730 --> 00:52:37,770 And so here is a multiple alignment 981 00:52:37,770 --> 00:52:39,410 of a bunch of sequences. 982 00:52:39,410 --> 00:52:45,410 And what I've done is put boxes around columns 983 00:52:45,410 --> 00:52:48,510 that have significant 984 00:52:48,510 --> 00:52:52,230 mutual information with other sets of columns. 985 00:52:52,230 --> 00:52:57,470 So for example, this set of columns here at the left-- the 986 00:52:57,470 --> 00:53:01,660 far left-- has significant mutual information 987 00:53:01,660 --> 00:53:03,660 with the ones at the far right. 988 00:53:03,660 --> 00:53:06,510 And these ones, these four positions 989 00:53:06,510 --> 00:53:08,850 co-vary with these four, and so forth. 990 00:53:08,850 --> 00:53:11,450 So can you tell, based on looking 991 00:53:11,450 --> 00:53:13,240 at this pattern of co-variation, what 992 00:53:13,240 --> 00:53:14,860 the structure is going to be? 993 00:53:22,440 --> 00:53:25,400 OK, let's say we start up here. 994 00:53:25,400 --> 00:53:29,200 The first is going to pair with the last, 995 00:53:29,200 --> 00:53:30,404 with something at the end. 996 00:53:30,404 --> 00:53:31,820 Then we're going to have something 997 00:53:31,820 --> 00:53:36,150 here in the middle that pairs with something else nearby. 998 00:53:36,150 --> 00:53:38,690 Then we have something here that pairs 999 00:53:38,690 --> 00:53:42,060 with something else nearby, then we have another like that. 1000 00:53:44,475 --> 00:53:45,350 Does that make sense? 1001 00:53:45,350 --> 00:53:49,190 So that there's these three pairs of columns 1002 00:53:49,190 --> 00:53:52,490 in the middle-- these two, these two, and these two-- and then 1003 00:53:52,490 --> 00:53:55,220 they're surrounded by this thing, 1004 00:53:55,220 --> 00:53:57,160 the first pairing with the last. 1005 00:53:57,160 --> 00:53:59,160 And so it's a clover leaf, so that's tRNA. 1006 00:54:05,056 --> 00:54:05,556 Yeah? 1007 00:54:08,250 --> 00:54:14,381 AUDIENCE: So with that previous slide, this table here, 1008 00:54:14,381 --> 00:54:17,470 you could create a co-variation matrix. 1009 00:54:17,470 --> 00:54:19,155 How would that-- or, and it could be-- 1010 00:54:19,155 --> 00:54:21,113 PROFESSOR: How does that co-variations matrix-- 1011 00:54:21,113 --> 00:54:23,580 how do you convert it to this representations? 1012 00:54:23,580 --> 00:54:27,100 AUDIENCE: I'm just wondering how this would go up. 1013 00:54:27,100 --> 00:54:29,480 Like let's say you took the co-variation matrix-- 1014 00:54:29,480 --> 00:54:30,271 PROFESSOR: Oh, what would it look like? 1015 00:54:30,271 --> 00:54:31,173 AUDIENCE: --and visualized it as a heat map-- 1016 00:54:31,173 --> 00:54:32,530 PROFESSOR: In the co-variation matrix. 1017 00:54:32,530 --> 00:54:33,155 AUDIENCE: Yeah. 1018 00:54:33,155 --> 00:54:37,554 What would it look like in this particular example? 1019 00:54:37,554 --> 00:54:39,220 PROFESSOR: Yeah, that's a good question. 1020 00:54:39,220 --> 00:54:40,200 OK, let's do that. 1021 00:54:42,690 --> 00:54:44,190 I haven't thought about that before, 1022 00:54:44,190 --> 00:54:47,560 so you'll have to help me on this. 1023 00:54:47,560 --> 00:54:52,224 So here's the beginning. 1024 00:54:52,224 --> 00:54:53,890 We're going to write the sequence from 1 1025 00:54:53,890 --> 00:54:57,790 to n in both dimensions. 1026 00:54:57,790 --> 00:55:02,330 And so here's the beginning, and it co-varies with the end. 1027 00:55:02,330 --> 00:55:06,670 So this first would have a co-variation with the last, 1028 00:55:06,670 --> 00:55:08,760 and then the second would co-vary with the second 1029 00:55:08,760 --> 00:55:10,150 to last, and so forth. 1030 00:55:10,150 --> 00:55:13,730 So you get a little diagonal down here. 1031 00:55:13,730 --> 00:55:17,210 That's this top stem here. 1032 00:55:17,210 --> 00:55:18,980 And then what about the second stem? 1033 00:55:18,980 --> 00:55:21,894 So then you have something down here 1034 00:55:21,894 --> 00:55:24,310 that's going to co-vary with something kind of near by it. 1035 00:55:29,720 --> 00:55:32,300 So block two is going to co-vary with block three. 1036 00:55:32,300 --> 00:55:35,283 And again, it's going to be this inverse complementary kind 1037 00:55:35,283 --> 00:55:38,230 of thing like that. 1038 00:55:38,230 --> 00:55:43,910 It's symmetrical, so you get this with that. 1039 00:55:43,910 --> 00:55:47,100 But you only have to do one half, 1040 00:55:47,100 --> 00:55:49,770 so you can just do this upper half here. 1041 00:55:49,770 --> 00:55:50,660 So you get that. 1042 00:55:50,660 --> 00:55:55,046 So it would look something like that. 1043 00:55:55,046 --> 00:55:57,426 AUDIENCE: So with the diagonal line orthogonal 1044 00:55:57,426 --> 00:56:01,890 to the diagonal of the matrix-- 1045 00:56:01,890 --> 00:56:05,730 PROFESSOR: Yeah, that's because they're inverse complementary. 1046 00:56:05,730 --> 00:56:08,130 AUDIENCE: OK. 1047 00:56:08,130 --> 00:56:10,050 PROFESSOR: That make sense? 1048 00:56:10,050 --> 00:56:12,450 Good question. 1049 00:56:12,450 --> 00:56:14,187 But we'll see an example like that later 1050 00:56:14,187 --> 00:56:15,270 actually, as it turns out. 1051 00:56:17,910 --> 00:56:22,180 All right, so here's my question for you. 1052 00:56:22,180 --> 00:56:25,390 You're setting this non-coding RNA. 1053 00:56:25,390 --> 00:56:26,810 It has some length. 1054 00:56:26,810 --> 00:56:29,190 You have some number of sequences. 1055 00:56:29,190 --> 00:56:32,910 They might have some structure. 1056 00:56:32,910 --> 00:56:35,850 Is this method going to work for you, or is it not? 1057 00:56:35,850 --> 00:56:40,060 What is required for it to work? 1058 00:56:40,060 --> 00:56:45,160 For example, would I want to isolate 1059 00:56:45,160 --> 00:56:48,820 this gene-- this non-coding RNA gene-- 1060 00:56:48,820 --> 00:56:52,680 just from primates, from like human, gorilla, 1061 00:56:52,680 --> 00:56:57,770 chimp, orangutan, and do that alignment? 1062 00:56:57,770 --> 00:56:59,290 Or would I want to go further? 1063 00:56:59,290 --> 00:57:05,840 Would I want to go back to the rodents and dog, horse-- 1064 00:57:05,840 --> 00:57:06,925 how far do you want to go? 1065 00:57:06,925 --> 00:57:07,590 Yeah, question. 1066 00:57:07,590 --> 00:57:10,662 AUDIENCE: I think we a need a very strong sequence alignment 1067 00:57:10,662 --> 00:57:14,106 for this, so we cannot go very far, 1068 00:57:14,106 --> 00:57:17,058 because if you don't have a high percentage homology, 1069 00:57:17,058 --> 00:57:19,518 then you will see all sorts of false positives. 1070 00:57:19,518 --> 00:57:20,502 PROFESSOR: Absolutely. 1071 00:57:20,502 --> 00:57:23,590 So if you go too far, your alignment will suffer, 1072 00:57:23,590 --> 00:57:25,097 and you need an alignment in order 1073 00:57:25,097 --> 00:57:26,680 to identify the corresponding columns. 1074 00:57:26,680 --> 00:57:30,580 So that puts an upper limit on how far you can go. 1075 00:57:30,580 --> 00:57:32,790 But excellent point. 1076 00:57:32,790 --> 00:57:33,890 Is there a lower limit? 1077 00:57:33,890 --> 00:57:35,515 Do you want to go as close as possible, 1078 00:57:35,515 --> 00:57:40,200 like this example I gave with human, chimp, orangutan? 1079 00:57:40,200 --> 00:57:42,750 Or is that too close? 1080 00:57:42,750 --> 00:57:44,030 Why is too close bad? 1081 00:57:44,030 --> 00:57:44,842 Tim? 1082 00:57:44,842 --> 00:57:46,770 AUDIENCE: Maybe if you're too close, 1083 00:57:46,770 --> 00:57:49,180 then the sequence is having to [INAUDIBLE] 1084 00:57:49,180 --> 00:57:51,108 to give you enough information [INAUDIBLE]. 1085 00:57:51,108 --> 00:57:52,149 PROFESSOR: Yeah, exactly. 1086 00:57:52,149 --> 00:57:53,040 They're all the same. 1087 00:57:53,040 --> 00:57:57,880 Actually, you'll get 1 times 1 over 1 1088 00:57:57,880 --> 00:58:00,210 in that mutual information statistic, which log of that 1089 00:58:00,210 --> 00:58:01,440 is going to be 0. 1090 00:58:01,440 --> 00:58:04,860 There's zero mutual information if they're all the same. 1091 00:58:04,860 --> 00:58:09,400 So there has to be some variation, 1092 00:58:09,400 --> 00:58:12,230 and the structure has to be conserved. 1093 00:58:12,230 --> 00:58:13,180 That's key. 1094 00:58:13,180 --> 00:58:17,340 You have to assume that the structure is well conserved 1095 00:58:17,340 --> 00:58:20,147 and you have to have a good alignment 1096 00:58:20,147 --> 00:58:21,605 and there has to be some variation, 1097 00:58:21,605 --> 00:58:22,854 a certain amount of variation. 1098 00:58:22,854 --> 00:58:26,620 Those are basically the three keys. 1099 00:58:26,620 --> 00:58:29,170 Secondary structure has a more highly conserved sequence. 1100 00:58:29,170 --> 00:58:31,710 Sufficient divergence so that you have these variations, 1101 00:58:31,710 --> 00:58:35,060 and sufficient number of homologues you have to get good 1102 00:58:35,060 --> 00:58:40,340 statistics, and not so far they your alignment is bad. 1103 00:58:40,340 --> 00:58:41,050 Sorry about that. 1104 00:58:41,050 --> 00:58:41,550 Sally? 1105 00:58:44,201 --> 00:58:45,742 AUDIENCE: It seems like another thing 1106 00:58:45,742 --> 00:58:50,030 that we assume here is that you can project it onto a plane 1107 00:58:50,030 --> 00:58:52,590 and it will lie flat. 1108 00:58:52,590 --> 00:58:55,270 So if you have some very important, weird folding 1109 00:58:55,270 --> 00:58:58,611 that allows you to, say, crisscross the rainbow thing. 1110 00:58:58,611 --> 00:59:00,277 PROFESSOR: Yeah, crisscross the rainbow. 1111 00:59:00,277 --> 00:59:01,684 Yeah, very good question. 1112 00:59:08,420 --> 00:59:10,400 So in the example of tRNA, if you 1113 00:59:10,400 --> 00:59:12,880 were to do that arc diagram for tRNA, 1114 00:59:12,880 --> 00:59:14,664 it would look like another big arc-- 1115 00:59:14,664 --> 00:59:16,330 that's the first and last-- and then you 1116 00:59:16,330 --> 00:59:19,460 have these three nested arcs. 1117 00:59:19,460 --> 00:59:20,405 Nothing crisscrossing. 1118 00:59:24,410 --> 00:59:32,340 What if I saw-- [INAUDIBLE]-- two blocks of sequence that 1119 00:59:32,340 --> 00:59:33,793 have a relationship like that? 1120 00:59:33,793 --> 00:59:34,779 Is that OK? 1121 00:59:43,160 --> 00:59:46,611 With this method, the co-variation, that's OK. 1122 00:59:46,611 --> 00:59:47,730 There's no problem there. 1123 00:59:47,730 --> 00:59:51,034 What does this structure look like? 1124 00:59:51,034 --> 00:59:57,870 So [INAUDIBLE] you have a stem, then you have a loop, 1125 00:59:57,870 --> 00:59:58,550 and then a stem. 1126 00:59:58,550 --> 01:00:01,640 So this is 1 pairs with 3. 1127 01:00:01,640 --> 01:00:02,510 That's 1. 1128 01:00:02,510 --> 01:00:03,610 That's 3. 1129 01:00:03,610 --> 01:00:06,350 Then you've got 2 up here, but 2 pairs with 4. 1130 01:00:06,350 --> 01:00:09,340 So here's 4 over here, so 4 is going 1131 01:00:09,340 --> 01:00:12,620 to have to come back up here and pair with 2. 1132 01:00:15,270 --> 01:00:16,940 This is 2 over here. 1133 01:00:16,940 --> 01:00:20,750 So that is called a pseudoknot. 1134 01:00:20,750 --> 01:00:22,920 It's not really a knot because this thing doesn't 1135 01:00:22,920 --> 01:00:25,850 go through the loop, but it kind of 1136 01:00:25,850 --> 01:00:27,800 behaves like a knot in some ways. 1137 01:00:27,800 --> 01:00:31,290 And so do these actually occur in natural RNAs? 1138 01:00:31,290 --> 01:00:32,780 Yes, Tim is nodding. 1139 01:00:32,780 --> 01:00:34,580 And are they important? 1140 01:00:34,580 --> 01:00:37,090 Can you give me an example where they are important 1141 01:00:37,090 --> 01:00:38,289 biologically? 1142 01:00:38,289 --> 01:00:39,726 AUDIENCE: [INAUDIBLE] 1143 01:00:39,726 --> 01:00:41,163 [INTERPOSING VOICES] 1144 01:00:41,163 --> 01:00:42,850 PROFESSOR: Riboswitches. 1145 01:00:42,850 --> 01:00:44,516 We're going to come to what riboswitches 1146 01:00:44,516 --> 01:00:49,390 are in a moment for those not familiar. 1147 01:00:49,390 --> 01:00:51,050 And I think I have an example later 1148 01:00:51,050 --> 01:00:52,450 of a pseudoknot that's important. 1149 01:00:52,450 --> 01:00:53,533 So that's a good question. 1150 01:00:58,190 --> 01:01:00,290 I think I should have added to this list the point 1151 01:01:00,290 --> 01:01:03,203 that you made in the back that they 1152 01:01:03,203 --> 01:01:06,020 have to be close enough that you can get a good alignment. 1153 01:01:06,020 --> 01:01:07,502 I should add that to this last. 1154 01:01:07,502 --> 01:01:08,002 Thanks. 1155 01:01:08,002 --> 01:01:09,650 It's a good point. 1156 01:01:09,650 --> 01:01:11,730 All right, so classes of non-coding RNAs. 1157 01:01:11,730 --> 01:01:14,340 As promised, my favorites listed here. 1158 01:01:17,070 --> 01:01:19,540 Everyone knows tRNAs, rRNAs. 1159 01:01:19,540 --> 01:01:22,530 You can think of UTRs as being non RNAs. 1160 01:01:22,530 --> 01:01:24,270 They often have structure that can 1161 01:01:24,270 --> 01:01:26,230 be involved in regulating the message. 1162 01:01:26,230 --> 01:01:28,160 snRNAs involved splicing. 1163 01:01:28,160 --> 01:01:31,490 snoRNAs-- small nucleolar RNAs-- are 1164 01:01:31,490 --> 01:01:33,870 involved in directing modification 1165 01:01:33,870 --> 01:01:39,519 of other RNAs, such as ribosomal RNAs and snRNAs, for example. 1166 01:01:39,519 --> 01:01:41,310 Terminators of transcription in prokaryotes 1167 01:01:41,310 --> 01:01:43,460 are like little stem loop structures. 1168 01:01:43,460 --> 01:01:45,200 RNaseP is an important enzyme. 1169 01:01:45,200 --> 01:01:51,590 SRP is involved in targeting proteins with signal peptides 1170 01:01:51,590 --> 01:01:54,290 to the export machinery. 1171 01:01:54,290 --> 01:01:55,730 We won't go into tmRNA. 1172 01:01:55,730 --> 01:01:57,440 micro RNAs and link RNAs, you probably 1173 01:01:57,440 --> 01:01:59,400 know, and riboswitches. 1174 01:01:59,400 --> 01:02:03,394 So Tim, can you tell us what a riboswitch is? 1175 01:02:03,394 --> 01:02:06,810 AUDIENCE: A riboswitch is any RNA structure 1176 01:02:06,810 --> 01:02:10,226 that changes confirmation according 1177 01:02:10,226 --> 01:02:16,550 to some stimulus [INAUDIBLE] or something in the cell. 1178 01:02:16,550 --> 01:02:20,020 It could be an ion, critical changes in the structure. 1179 01:02:20,020 --> 01:02:22,317 [INAUDIBLE] 1180 01:02:22,317 --> 01:02:23,650 PROFESSOR: Yeah, that was great. 1181 01:02:23,650 --> 01:02:25,922 So just for those that may not have heard, 1182 01:02:25,922 --> 01:02:26,880 I'll just say it again. 1183 01:02:26,880 --> 01:02:31,480 So a riboswitch is any RNA that can 1184 01:02:31,480 --> 01:02:34,750 have multiple confirmations, and changes confirmation 1185 01:02:34,750 --> 01:02:41,325 in response to some stimulus-- temperature, binding 1186 01:02:41,325 --> 01:02:45,190 of some ligand, small molecules, something like that, et cetera. 1187 01:02:45,190 --> 01:02:49,360 And often, one of those structures 1188 01:02:49,360 --> 01:02:51,970 will block a particular regulatory element. 1189 01:02:51,970 --> 01:02:53,600 I'll show an example in a moment. 1190 01:02:53,600 --> 01:02:55,940 And so when it's in one confirmation, 1191 01:02:55,940 --> 01:02:57,180 the gene will be repressed. 1192 01:02:57,180 --> 01:02:58,555 And when it's in the other, it'll 1193 01:02:58,555 --> 01:03:02,560 be on. so it's a way of using RNA's secondary structure 1194 01:03:02,560 --> 01:03:04,630 to sense what's going on in the cell 1195 01:03:04,630 --> 01:03:06,590 and to appropriately regulate gene expression. 1196 01:03:09,027 --> 01:03:11,610 All right, so now we're going to talk about a second approach. 1197 01:03:11,610 --> 01:03:12,860 So this would be the approach. 1198 01:03:12,860 --> 01:03:14,670 You've got some RNA. 1199 01:03:14,670 --> 01:03:18,561 It may not do something, and maybe you can't find any 1200 01:03:18,561 --> 01:03:19,060 homologues. 1201 01:03:19,060 --> 01:03:21,940 It might be some newly evolved species-specific RNA, 1202 01:03:21,940 --> 01:03:24,250 or your studying some obscure species 1203 01:03:24,250 --> 01:03:27,100 where you don't have a lot of genomic sequence around. 1204 01:03:27,100 --> 01:03:29,200 So you want to use the first principles, approach, 1205 01:03:29,200 --> 01:03:31,545 the energy minimization approach. 1206 01:03:31,545 --> 01:03:32,920 Or maybe you have the homologues, 1207 01:03:32,920 --> 01:03:34,820 but you don't trust your alignment. 1208 01:03:34,820 --> 01:03:36,680 You want a second opinion on what 1209 01:03:36,680 --> 01:03:38,130 the structure is going to be. 1210 01:03:38,130 --> 01:03:44,870 So just in the way that protein folding-- 1211 01:03:44,870 --> 01:03:46,540 you could think of an equilibrium model 1212 01:03:46,540 --> 01:03:49,400 where it's determined by folding free energy, 1213 01:03:49,400 --> 01:03:52,200 and enthalpy will favor base pairing. 1214 01:03:52,200 --> 01:03:55,810 You get gain some enthalpy when you form a hydrogen bond, 1215 01:03:55,810 --> 01:03:58,770 and entropy will tend to favor unfolding. 1216 01:03:58,770 --> 01:04:02,590 So an RNA molecule that's linear has 1217 01:04:02,590 --> 01:04:04,210 all this confirmational flexibility, 1218 01:04:04,210 --> 01:04:06,072 and lose some of that when you form a stem. 1219 01:04:06,072 --> 01:04:06,780 It forms a helix. 1220 01:04:06,780 --> 01:04:09,460 Those things don't have as much flexibility. 1221 01:04:09,460 --> 01:04:13,670 And even the nucleotides in the loop are a little bit 1222 01:04:13,670 --> 01:04:16,340 confirmationally-- they're not as flexible 1223 01:04:16,340 --> 01:04:18,600 as they were when it was linear. 1224 01:04:18,600 --> 01:04:20,790 So that means that at high temperatures, 1225 01:04:20,790 --> 01:04:24,930 it'll favor unfolding. 1226 01:04:24,930 --> 01:04:29,480 So the earliest approaches were approaches 1227 01:04:29,480 --> 01:04:34,300 that sought to maximize the number of base pairs. 1228 01:04:34,300 --> 01:04:37,710 So they basically ignore entropy and focus on the enthalpy 1229 01:04:37,710 --> 01:04:39,530 that you gain from forming base pairs. 1230 01:04:39,530 --> 01:04:43,730 And so Ruth Nussinov described the first algorithm 1231 01:04:43,730 --> 01:04:47,780 to figure out what is the maximum number of base pairs 1232 01:04:47,780 --> 01:04:51,160 that you can form in an RNA. 1233 01:04:51,160 --> 01:04:57,750 And so a way to think about this is 1234 01:04:57,750 --> 01:04:59,225 imagine you've got this sequence. 1235 01:05:06,444 --> 01:05:08,110 What is the largest number of base pairs 1236 01:05:08,110 --> 01:05:09,690 I can form with this sequence? 1237 01:05:15,090 --> 01:05:17,405 I could just draw all possible base pairs. 1238 01:05:17,405 --> 01:05:19,780 That A can pair with that T. This A can pair with that T. 1239 01:05:19,780 --> 01:05:21,660 They can't both pair simultaneously, right? 1240 01:05:21,660 --> 01:05:27,460 And this C can pair with that G. So if we don't allow crossing, 1241 01:05:27,460 --> 01:05:30,610 which-- coming back to Sally's point-- 1242 01:05:30,610 --> 01:05:32,470 this would cross this, right? 1243 01:05:32,470 --> 01:05:34,320 So we're not going to allow that. 1244 01:05:34,320 --> 01:05:38,780 So the best you could do be to have this A pair with this C 1245 01:05:38,780 --> 01:05:41,790 and this C pair with this G and form this little structure. 1246 01:05:45,500 --> 01:05:49,324 This is not realistic because RNA loops can't be one base. 1247 01:05:49,324 --> 01:05:50,490 They minimum is about three. 1248 01:05:50,490 --> 01:05:52,810 But just for the sake of argument, 1249 01:05:52,810 --> 01:05:55,720 you can list all these out, but imagine now 1250 01:05:55,720 --> 01:05:59,140 you've got 100 bases here. 1251 01:05:59,140 --> 01:06:02,490 Every base will on average potentially 1252 01:06:02,490 --> 01:06:07,700 be able to pair with 24 or 25 other bases. 1253 01:06:07,700 --> 01:06:12,190 So you're just going to have just an incredible mishmash 1254 01:06:12,190 --> 01:06:16,960 of possible lines all crisscrossing. 1255 01:06:16,960 --> 01:06:22,697 So how do you figure out how to maximize that pairing? 1256 01:06:27,231 --> 01:06:27,730 Any ideas? 1257 01:06:33,208 --> 01:06:34,950 Don, yeah? 1258 01:06:34,950 --> 01:06:37,512 AUDIENCE: You look for sections of homology. 1259 01:06:37,512 --> 01:06:39,400 PROFESSOR: We're not using homology. 1260 01:06:39,400 --> 01:06:41,300 We're doing [INAUDIBLE] 1261 01:06:41,300 --> 01:06:44,190 AUDIENCE: I'm sorry, not homology, but sections where-- 1262 01:06:44,190 --> 01:06:44,602 PROFESSOR: Complementary? 1263 01:06:44,602 --> 01:06:45,426 AUDIENCE: Complementary. 1264 01:06:45,426 --> 01:06:46,967 Yeah, that's the word I was thinking. 1265 01:06:46,967 --> 01:06:48,670 PROFESSOR: The blocks are complementary. 1266 01:06:48,670 --> 01:06:51,470 AUDIENCE: And then so-- 1267 01:06:51,470 --> 01:06:54,172 PROFESSOR: You could blast the sequence against inverse 1268 01:06:54,172 --> 01:06:56,990 complements itself and look for little blocks. 1269 01:06:56,990 --> 01:06:59,410 You could do that. 1270 01:06:59,410 --> 01:07:00,970 That's not what people generally do, 1271 01:07:00,970 --> 01:07:03,770 mostly because the blocks of complementarity in real RNA 1272 01:07:03,770 --> 01:07:05,680 structures are really short. 1273 01:07:05,680 --> 01:07:07,710 They can be two, three, four, bases. 1274 01:07:07,710 --> 01:07:08,625 Sally, yeah? 1275 01:07:08,625 --> 01:07:11,000 AUDIENCE: Could you use [INAUDIBLE] approach 1276 01:07:11,000 --> 01:07:16,110 where you just start with a very small case and build up? 1277 01:07:16,110 --> 01:07:18,630 PROFESSOR: So we've seen that work for protein sequence 1278 01:07:18,630 --> 01:07:19,130 alignment. 1279 01:07:19,130 --> 01:07:22,750 We've seen it work for the Viterbi algorithm. 1280 01:07:22,750 --> 01:07:27,710 So that is sort of the go-to approach in bioinfomatics, 1281 01:07:27,710 --> 01:07:29,950 is to use some sort of dynamic programming. 1282 01:07:29,950 --> 01:07:32,790 Now this one for RNA secondary structure 1283 01:07:32,790 --> 01:07:35,440 that Nussinov came up with is a little bit 1284 01:07:35,440 --> 01:07:36,800 different than the others. 1285 01:07:36,800 --> 01:07:39,860 So you'll see it has a kind of different flavor. 1286 01:07:39,860 --> 01:07:42,482 It turns out to be actually it's a little hard 1287 01:07:42,482 --> 01:07:44,190 to get your head around at the beginning, 1288 01:07:44,190 --> 01:07:47,720 but it's actually easier to do by hand. 1289 01:07:47,720 --> 01:07:49,380 So let's take a look at that. 1290 01:07:49,380 --> 01:07:53,020 OK, so recursive maximization of base pairing. 1291 01:07:53,020 --> 01:07:55,290 Now the thing about base pairing that's 1292 01:07:55,290 --> 01:07:56,780 different from these other problems 1293 01:07:56,780 --> 01:07:59,500 is that the first base in the sequence 1294 01:07:59,500 --> 01:08:02,980 can base pair with the last. 1295 01:08:02,980 --> 01:08:05,220 How do you chop up a sequence? 1296 01:08:05,220 --> 01:08:08,870 Remember with Needleman-Wunsch and with Viterbi 1297 01:08:08,870 --> 01:08:11,146 we go from the beginning to the end, 1298 01:08:11,146 --> 01:08:12,270 and that's a logical order. 1299 01:08:12,270 --> 01:08:16,560 But with base pairing, that's actually not a logical order. 1300 01:08:16,560 --> 01:08:19,350 You can't really do it that way. 1301 01:08:19,350 --> 01:08:24,540 So instead, you go from the inside out. 1302 01:08:24,540 --> 01:08:26,640 You start in the middle of a sequence 1303 01:08:26,640 --> 01:08:30,990 and work your way outwards in both directions. 1304 01:08:30,990 --> 01:08:40,890 Or another way to think about it is you start with you write 1305 01:08:40,890 --> 01:08:45,920 the sequence from 1 to n on both axes, 1306 01:08:45,920 --> 01:08:52,399 and then actually we'll see that we initiate the diagonal all 1307 01:08:52,399 --> 01:08:53,584 to 0's. 1308 01:08:53,584 --> 01:08:58,229 And then we think about these positions here next. 1309 01:09:02,620 --> 01:09:06,109 So 1 versus 2. 1310 01:09:06,109 --> 01:09:08,100 Could 1 pair with 2? 1311 01:09:08,100 --> 01:09:09,439 And could 2 pair with 3? 1312 01:09:09,439 --> 01:09:12,087 Those are like little bits of possible RNA 1313 01:09:12,087 --> 01:09:12,920 secondary structure. 1314 01:09:12,920 --> 01:09:14,340 Again, we're ignoring this fact that loops 1315 01:09:14,340 --> 01:09:15,464 have to be certain minimum. 1316 01:09:15,464 --> 01:09:17,800 This is sort of a simplified case. 1317 01:09:17,800 --> 01:09:19,600 And then you build outwards. 1318 01:09:19,600 --> 01:09:27,220 So you conclude that base 4 here could pair with base 5, 1319 01:09:27,220 --> 01:09:30,090 so we're going to put a 1 there. 1320 01:09:30,090 --> 01:09:33,630 And then we're going to build outward 1321 01:09:33,630 --> 01:09:35,590 from that toward the beginning of the sequence 1322 01:09:35,590 --> 01:09:38,930 and toward the end, adding additional base pairs 1323 01:09:38,930 --> 01:09:40,210 when we can. 1324 01:09:40,210 --> 01:09:42,200 That's basically the way the [INAUDIBLE] works. 1325 01:09:42,200 --> 01:09:47,740 And so that's one key idea, that we 1326 01:09:47,740 --> 01:09:50,890 go from sort of close sequences, work 1327 01:09:50,890 --> 01:09:53,120 outward, to faraway sequences. 1328 01:09:53,120 --> 01:09:57,540 And the second key idea is that the relationship 1329 01:09:57,540 --> 01:10:00,620 that, as you add more bases on the outside of what you've 1330 01:10:00,620 --> 01:10:05,920 already got, that the optimal structure in that larger 1331 01:10:05,920 --> 01:10:08,430 portion of sequence space is related 1332 01:10:08,430 --> 01:10:13,100 to the optimal structures of smaller portions of it 1333 01:10:13,100 --> 01:10:14,810 in one of four different ways. 1334 01:10:14,810 --> 01:10:17,470 And these are the four ways. 1335 01:10:17,470 --> 01:10:21,370 So let's look at these. 1336 01:10:23,940 --> 01:10:29,830 So the first one is probably the simplest 1337 01:10:29,830 --> 01:10:38,270 where if you're doing this, you're here somewhere, 1338 01:10:38,270 --> 01:10:44,050 meaning you've compared sequences from position, 1339 01:10:44,050 --> 01:10:49,680 let's say, i minus 1 to j minus 1 here. 1340 01:10:49,680 --> 01:10:53,430 And then we're going to consider adding-- actually, 1341 01:10:53,430 --> 01:10:56,700 it depends how you number your sequence. 1342 01:10:56,700 --> 01:10:58,460 Let me see how this is done. 1343 01:10:58,460 --> 01:10:59,530 Sorry. i plus 1. 1344 01:11:03,360 --> 01:11:06,482 i plus 1 to j minus 1. 1345 01:11:06,482 --> 01:11:08,690 We figured out what the optimal structure is in here, 1346 01:11:08,690 --> 01:11:09,920 let's suppose. 1347 01:11:09,920 --> 01:11:12,370 And now we're going to consider adding one more 1348 01:11:12,370 --> 01:11:13,750 base on either end. 1349 01:11:13,750 --> 01:11:19,780 We're going to add j down here, and we're 1350 01:11:19,780 --> 01:11:22,190 going to ask if it pairs with i. 1351 01:11:22,190 --> 01:11:25,020 And if so, we're going to take whatever the optimal structure 1352 01:11:25,020 --> 01:11:27,612 was in here and we're going to add one base pair, 1353 01:11:27,612 --> 01:11:29,320 and we're going to add plus 1 because now 1354 01:11:29,320 --> 01:11:30,810 it's got one additional. 1355 01:11:30,810 --> 01:11:32,040 We're counting base pairs. 1356 01:11:32,040 --> 01:11:36,230 So that's that first case there. 1357 01:11:36,230 --> 01:11:39,940 And then the second case is you could also consider just 1358 01:11:39,940 --> 01:11:43,270 adding one unpaired base onto whatever structure you had, 1359 01:11:43,270 --> 01:11:45,380 and then you don't add one. 1360 01:11:45,380 --> 01:11:47,484 And you could go in either direction. 1361 01:11:47,484 --> 01:11:49,900 You can go sort of toward of the beginning of the sequence 1362 01:11:49,900 --> 01:11:52,580 or toward the end of the sequence. 1363 01:11:52,580 --> 01:11:54,890 And then the third one is the tricky one, 1364 01:11:54,890 --> 01:11:57,830 is what's called a bifurcation. 1365 01:11:57,830 --> 01:12:02,840 You could consider that actually i and j 1366 01:12:02,840 --> 01:12:05,840 are both paired, but not with each other. 1367 01:12:05,840 --> 01:12:09,280 That i pairs with something that was inside here 1368 01:12:09,280 --> 01:12:11,280 and j pairs with something that was inside here. 1369 01:12:11,280 --> 01:12:15,760 So your optimal parse from i to j, if you will, 1370 01:12:15,760 --> 01:12:18,650 is not going to come from the optimal parse from i plus 1 1371 01:12:18,650 --> 01:12:19,500 to j minus 1. 1372 01:12:19,500 --> 01:12:23,160 It's going to come from rethinking this and doing 1373 01:12:23,160 --> 01:12:25,690 the optimal parse from here to here and from here to here, 1374 01:12:25,690 --> 01:12:29,060 and combining those two. 1375 01:12:29,060 --> 01:12:32,590 So you're probably confused by now, 1376 01:12:32,590 --> 01:12:35,252 so let me try to do an example. 1377 01:12:46,545 --> 01:12:49,320 And then I have an analogy that will confuse you further. 1378 01:12:49,320 --> 01:12:51,200 So ask me for that one. 1379 01:13:00,630 --> 01:13:02,350 This was the simplest one I could come up 1380 01:13:02,350 --> 01:13:04,220 with that has this property. 1381 01:13:04,220 --> 01:13:11,510 OK, so we said before that if you were doing the optimal 1382 01:13:11,510 --> 01:13:18,080 from 1 to 5, that it would be the AC pairing with the GT. 1383 01:13:18,080 --> 01:13:19,450 We do that one. 1384 01:13:19,450 --> 01:13:24,800 And now if you notice, this guy is kind of a similar sequence. 1385 01:13:24,800 --> 01:13:27,060 I just added a T at the beginning and an A at the end. 1386 01:13:27,060 --> 01:13:33,910 And so you can probably imagine that the best structure of this 1387 01:13:33,910 --> 01:13:36,470 is here, those three. 1388 01:13:36,470 --> 01:13:39,644 You've got three pairs of this sub-sequence here. 1389 01:13:39,644 --> 01:13:41,560 That's as good as you can do with seven bases. 1390 01:13:41,560 --> 01:13:43,170 You can only get three pairs. 1391 01:13:43,170 --> 01:13:45,003 And this is as good as you can do with five, 1392 01:13:45,003 --> 01:13:47,050 so these are clearly optimal. 1393 01:13:47,050 --> 01:13:53,900 So the issue comes that if you're starting from somewhere 1394 01:13:53,900 --> 01:13:58,669 in the middle here-- let's say you are-- let's see, 1395 01:13:58,669 --> 01:13:59,960 so how would you be doing this? 1396 01:14:02,610 --> 01:14:03,659 You start here. 1397 01:14:03,659 --> 01:14:05,950 Let's suppose the first two you consider are these two. 1398 01:14:05,950 --> 01:14:08,520 You consider pairing that T with that A. 1399 01:14:08,520 --> 01:14:12,900 You can see this is not going to go well. 1400 01:14:12,900 --> 01:14:17,640 You might end up with that as your optimal substructure 1401 01:14:17,640 --> 01:14:18,410 of this region. 1402 01:14:18,410 --> 01:14:20,285 Remember, you're working from the inside out, 1403 01:14:20,285 --> 01:14:24,760 so you're going from here to here, and you end up with that. 1404 01:14:27,790 --> 01:14:29,270 And what do you do here? 1405 01:14:29,270 --> 01:14:30,770 You don't have a G to pair the C to, 1406 01:14:30,770 --> 01:14:33,880 so you add another unpaired base. 1407 01:14:33,880 --> 01:14:36,140 Now you've got this optimal substructure 1408 01:14:36,140 --> 01:14:38,680 of a sequence that's almost the whole sequence. 1409 01:14:38,680 --> 01:14:40,590 It's just missing the first and last bases, 1410 01:14:40,590 --> 01:14:43,500 but it only has three base pairs. 1411 01:14:43,500 --> 01:14:46,410 So when you go to add this, you can say, 1412 01:14:46,410 --> 01:14:49,560 oh, I can't add any more base pairs, so I've only got three. 1413 01:14:49,560 --> 01:14:52,280 But you should consider that we've already 1414 01:14:52,280 --> 01:14:54,570 solved the optimal structure of that, 1415 01:14:54,570 --> 01:14:57,120 and we had two nice pairs here. 1416 01:14:57,120 --> 01:15:00,480 We had that pair and that pair, and we already 1417 01:15:00,480 --> 01:15:04,380 solved the substructure of the optimal structure 1418 01:15:04,380 --> 01:15:06,700 of this portion here, and you had those three pairs. 1419 01:15:06,700 --> 01:15:09,770 And so you can combine those two and all of a sudden 1420 01:15:09,770 --> 01:15:12,680 you can do much better. 1421 01:15:12,680 --> 01:15:16,215 So that's what that bifurcation thing is about. 1422 01:15:20,650 --> 01:15:23,030 So this is the recursion working out, 1423 01:15:23,030 --> 01:15:25,920 and you can see that's the base pairing one. 1424 01:15:25,920 --> 01:15:29,470 You can add one, or you can just add an unpaired base 1425 01:15:29,470 --> 01:15:30,610 and you don't add anything. 1426 01:15:30,610 --> 01:15:33,220 Or you consider all the possible locations 1427 01:15:33,220 --> 01:15:36,150 of bifurcations in-between the two positions you're adding, 1428 01:15:36,150 --> 01:15:39,040 i and j, and you consider all the possible pairs. 1429 01:15:39,040 --> 01:15:43,204 And you just sum up each pair and go-- I'm sorry, 1430 01:15:43,204 --> 01:15:44,120 you don't sum them up. 1431 01:15:44,120 --> 01:15:48,740 You consider them all, and then you take the maximum. 1432 01:15:48,740 --> 01:15:54,570 All right, so the algorithm is to take an n by n matrix, 1433 01:15:54,570 --> 01:15:58,810 initialize the diagonal to 0, and initialize the sub-diagonal 1434 01:15:58,810 --> 01:16:00,379 to 0 also. 1435 01:16:00,379 --> 01:16:01,920 Just don't think too much about that. 1436 01:16:01,920 --> 01:16:02,760 Just do it. 1437 01:16:02,760 --> 01:16:07,040 And then fill in this matrix recursively 1438 01:16:07,040 --> 01:16:09,520 from the diagonal up and to the right. 1439 01:16:09,520 --> 01:16:12,760 And it actually doesn't matter what order you fill it in 1440 01:16:12,760 --> 01:16:14,730 as long as you're kind of working your way up 1441 01:16:14,730 --> 01:16:15,355 into the right. 1442 01:16:15,355 --> 01:16:17,590 You have to have the thing to the left and the thing 1443 01:16:17,590 --> 01:16:21,500 below already filled in if you're going to fill in a box. 1444 01:16:21,500 --> 01:16:24,210 And then you keep track of the optimal score, which 1445 01:16:24,210 --> 01:16:25,980 is going to be the sum of base pairs. 1446 01:16:25,980 --> 01:16:28,970 And then you also keep track of how you got there. 1447 01:16:28,970 --> 01:16:32,789 What base pair did you add so that you can trace back? 1448 01:16:32,789 --> 01:16:34,580 And then when you get up to the upper right 1449 01:16:34,580 --> 01:16:39,010 corner of this matrix, you then trace back. 1450 01:16:39,010 --> 01:16:42,190 So here is a partially filled in this matrix. 1451 01:16:42,190 --> 01:16:44,820 This is from that the Nature Biotechnology Review. 1452 01:16:44,820 --> 01:16:48,534 And the 0's are filled in. 1453 01:16:48,534 --> 01:16:50,200 So here's what I want you to do at home, 1454 01:16:50,200 --> 01:16:54,110 is print out, photocopy or whatever-- make this matrix, 1455 01:16:54,110 --> 01:16:56,260 or make a bigger version of it perhaps-- 1456 01:16:56,260 --> 01:17:00,580 and look at the sequence and fill in this matrix, 1457 01:17:00,580 --> 01:17:05,284 and fill in the little arrows every time you add a base pair. 1458 01:17:05,284 --> 01:17:06,450 It's actually not that hard. 1459 01:17:06,450 --> 01:17:09,150 There are no bifurcations in this, so that's the tricky one. 1460 01:17:09,150 --> 01:17:09,936 Ignore that one. 1461 01:17:09,936 --> 01:17:11,310 You'll just be adding base pairs. 1462 01:17:11,310 --> 01:17:12,340 It'll be pretty easy. 1463 01:17:12,340 --> 01:17:15,470 And then you can reconstruct the sequence. 1464 01:17:15,470 --> 01:17:16,835 So here it is filled in. 1465 01:17:16,835 --> 01:17:18,960 And the answer is given, so you can check yourself. 1466 01:17:18,960 --> 01:17:21,000 But do it without looking at the answer. 1467 01:17:21,000 --> 01:17:24,160 And then you go to the upper right corner. 1468 01:17:24,160 --> 01:17:26,000 That means that the optimal structure 1469 01:17:26,000 --> 01:17:28,250 from the beginning of the sequence to the end-- which, 1470 01:17:28,250 --> 01:17:30,080 of course, was our goal all along. 1471 01:17:30,080 --> 01:17:32,590 And then you trace back and you can 1472 01:17:32,590 --> 01:17:38,410 see whenever you're moving diagonally here, 1473 01:17:38,410 --> 01:17:40,440 you're adding a base pair. 1474 01:17:40,440 --> 01:17:42,880 Remember, you add one on each end, 1475 01:17:42,880 --> 01:17:45,590 and so you're moving diagonally and adding the base pair, 1476 01:17:45,590 --> 01:17:47,940 and you get this little structure here. 1477 01:17:52,270 --> 01:17:55,427 So computational complexity of the algorithm. 1478 01:17:55,427 --> 01:17:57,510 You could think about this but I'll just tell you. 1479 01:17:57,510 --> 01:17:59,415 It's memory n squared because you've 1480 01:17:59,415 --> 01:18:01,970 got to fill in this matrix, so square 1481 01:18:01,970 --> 01:18:03,220 of the length of the sequence. 1482 01:18:03,220 --> 01:18:06,100 Time n cubed. 1483 01:18:06,100 --> 01:18:07,210 This is bad now. 1484 01:18:07,210 --> 01:18:08,700 And why is it n cubed? 1485 01:18:08,700 --> 01:18:11,657 It's n cubed because you have to fill in a matrix that's n by n. 1486 01:18:11,657 --> 01:18:13,490 And then when you do that maximization step, 1487 01:18:13,490 --> 01:18:16,310 that check for bifurcations, that's sort of of order n, 1488 01:18:16,310 --> 01:18:16,930 as well. 1489 01:18:16,930 --> 01:18:19,517 So n cubed-- so this means that RNA folding is slow. 1490 01:18:19,517 --> 01:18:21,100 And in fact, some of the servers won't 1491 01:18:21,100 --> 01:18:23,058 allow you to fold anything more than a thousand 1492 01:18:23,058 --> 01:18:27,530 bases because they'll take forever or something like that. 1493 01:18:27,530 --> 01:18:30,300 And it cannot handle pseudoknots. 1494 01:18:30,300 --> 01:18:32,420 If you think through the recursion, 1495 01:18:32,420 --> 01:18:34,220 pseudoknots will be a problem. 1496 01:18:37,170 --> 01:18:40,810 I'm going to just show you-- yeah, 1497 01:18:40,810 --> 01:18:44,910 I'll get to this-- that these are from the viruses. 1498 01:18:44,910 --> 01:18:49,010 Real viruses, some of them have pseudoknots 1499 01:18:49,010 --> 01:18:51,782 like these ones shown here, and some even 1500 01:18:51,782 --> 01:18:53,990 have these kissing loops, which is another type where 1501 01:18:53,990 --> 01:18:57,550 the two stem loops, the loops interact. 1502 01:18:57,550 --> 01:18:59,840 And the pseudoknots in particular 1503 01:18:59,840 --> 01:19:01,790 are important in the viral life cycle. 1504 01:19:01,790 --> 01:19:03,900 They can actually cause programmed ribosomal frame 1505 01:19:03,900 --> 01:19:05,750 shifting. 1506 01:19:05,750 --> 01:19:07,500 When the ribosomes hits one of the things, 1507 01:19:07,500 --> 01:19:10,332 normally it just denatures RNA secondary structure. 1508 01:19:10,332 --> 01:19:12,040 When it hits a pseudoknot, it'll actually 1509 01:19:12,040 --> 01:19:15,420 get knocked back by one and will start 1510 01:19:15,420 --> 01:19:16,980 translating in a different frame. 1511 01:19:16,980 --> 01:19:18,670 And that's actually useful to the virus 1512 01:19:18,670 --> 01:19:20,540 to do that under certain circumstances. 1513 01:19:20,540 --> 01:19:23,940 That's how HIV makes the replicated polymerase, 1514 01:19:23,940 --> 01:19:30,870 is by doing a frame shift on the ribosome using a pseudoknot. 1515 01:19:30,870 --> 01:19:33,790 So these things are important. 1516 01:19:33,790 --> 01:19:40,510 And there's fancier methods that use 1517 01:19:40,510 --> 01:19:43,010 more sophisticated thermodynamic models where 1518 01:19:43,010 --> 01:19:46,270 GC counts more than AU. 1519 01:19:46,270 --> 01:19:48,810 And I won't go into the details, but I just 1520 01:19:48,810 --> 01:19:51,540 wanted to show you some pretty pictures here 1521 01:19:51,540 --> 01:19:55,110 that the Zuker algorithm-- this is 1522 01:19:55,110 --> 01:19:59,800 a real world RNA folding algorithm-- calculates not only 1523 01:19:59,800 --> 01:20:03,610 the minimum energy fold, but also sub-optimal folds, 1524 01:20:03,610 --> 01:20:05,990 and the probabilities of particular base pairs, 1525 01:20:05,990 --> 01:20:10,800 summing over all the possible structures that RNA could form, 1526 01:20:10,800 --> 01:20:14,370 weighted by their free energy. 1527 01:20:14,370 --> 01:20:16,180 So it's the full partition function. 1528 01:20:16,180 --> 01:20:17,614 It's not perfectly accurate. 1529 01:20:17,614 --> 01:20:19,280 It gets about 70% of base pairs correct, 1530 01:20:19,280 --> 01:20:20,988 which means it usually gets things right, 1531 01:20:20,988 --> 01:20:23,230 but occasionally totally wrong. 1532 01:20:23,230 --> 01:20:27,560 And there's a website for the Mfold server, which is actually 1533 01:20:27,560 --> 01:20:30,370 one of the most beautiful websites in bioinfomatics, 1534 01:20:30,370 --> 01:20:31,510 I would say. 1535 01:20:31,510 --> 01:20:34,140 And also if you want to run it locally, 1536 01:20:34,140 --> 01:20:36,480 you should download the Vienna RNAfold package, 1537 01:20:36,480 --> 01:20:38,880 which has a very similar algorithm. 1538 01:20:38,880 --> 01:20:41,590 And I just wanted to show you one or two examples. 1539 01:20:41,590 --> 01:20:43,990 So this is the U5 snRNA. 1540 01:20:43,990 --> 01:20:45,480 This is the output of Mfold. 1541 01:20:45,480 --> 01:20:47,500 It predicts this structure. 1542 01:20:47,500 --> 01:20:50,710 And then this what's called the energy dot plot, which 1543 01:20:50,710 --> 01:20:55,260 shows the bases in the optimal structure down below here 1544 01:20:55,260 --> 01:20:58,030 and then sort of these suboptimal structures here. 1545 01:20:58,030 --> 01:21:00,180 And you can see there's no ambiguity. 1546 01:21:00,180 --> 01:21:02,850 It's totally confident in this structure. 1547 01:21:02,850 --> 01:21:07,420 Then I ran the lysine riboswitch through this program, 1548 01:21:07,420 --> 01:21:09,840 and I got this. 1549 01:21:09,840 --> 01:21:12,060 I got the minimum for energy structure 1550 01:21:12,060 --> 01:21:13,020 down in the lower left. 1551 01:21:13,020 --> 01:21:15,630 And then you see there's a lot of other colored dots. 1552 01:21:15,630 --> 01:21:17,450 Those are from the suboptimal structures. 1553 01:21:17,450 --> 01:21:20,850 So it looks like this thing has multiple structures, which 1554 01:21:20,850 --> 01:21:21,950 of course it does. 1555 01:21:21,950 --> 01:21:28,050 So the way that this one works is, in the absence of lysine, 1556 01:21:28,050 --> 01:21:31,810 it forms this structure where the ribosome binding 1557 01:21:31,810 --> 01:21:34,750 sequences-- this is prokaryotic-- is exposed. 1558 01:21:34,750 --> 01:21:37,710 And so the ribosome can enter and translate 1559 01:21:37,710 --> 01:21:40,520 these lysine biosynthetic enzymes. 1560 01:21:40,520 --> 01:21:43,630 But then when lysine accumulates to a certain level, 1561 01:21:43,630 --> 01:21:47,900 it can interact with the RNA and shift it's structure 1562 01:21:47,900 --> 01:21:50,600 so that you now form this stem, which 1563 01:21:50,600 --> 01:21:52,520 sequesters the ribosome binding sequence 1564 01:21:52,520 --> 01:21:54,640 and blocks lysine biosynthesis. 1565 01:21:54,640 --> 01:21:56,980 So a very clever system. 1566 01:21:56,980 --> 01:22:00,040 And it turns out that there's dozens 1567 01:22:00,040 --> 01:22:02,030 of these things in bacterial genomes, 1568 01:22:02,030 --> 01:22:04,267 and they control a lot of metabolism. 1569 01:22:04,267 --> 01:22:05,350 So they're very important. 1570 01:22:05,350 --> 01:22:07,590 And there may be some in eukaryotes, too, 1571 01:22:07,590 --> 01:22:09,077 and that would be good. 1572 01:22:09,077 --> 01:22:10,910 If anyone's looking for a product, not happy 1573 01:22:10,910 --> 01:22:12,451 with their current project, you might 1574 01:22:12,451 --> 01:22:15,780 think about looking for more riboswitches. 1575 01:22:15,780 --> 01:22:18,810 So I'm going to have to end there. 1576 01:22:18,810 --> 01:22:21,440 And thank you guys for your attention, 1577 01:22:21,440 --> 01:22:24,500 and good luck on the midterm.