1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation, or view additional materials 6 00:00:13,340 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:26,810 --> 00:00:29,580 PROFESSOR: Any questions from last time about Gibbs Sampling? 9 00:00:32,450 --> 00:00:33,920 No? 10 00:00:33,920 --> 00:00:38,210 So at the end, we introduced this concept 11 00:00:38,210 --> 00:00:39,670 of relative entropy. 12 00:00:39,670 --> 00:00:41,850 So I just wanted to briefly review this, and make 13 00:00:41,850 --> 00:00:44,050 sure it's clear to everyone. 14 00:00:44,050 --> 00:00:47,300 So the relative entropy is a measure 15 00:00:47,300 --> 00:00:51,340 of distance between probability distributions-- 16 00:00:51,340 --> 00:00:56,280 can be written different ways, often with this D p q notation. 17 00:00:56,280 --> 00:00:59,950 And as you'll see, it's the mean bit-score, 18 00:00:59,950 --> 00:01:04,538 if you're scoring a motif with a foreground model, pk, 19 00:01:04,538 --> 00:01:08,610 and a background model, qk it's the average log 20 00:01:08,610 --> 00:01:12,060 odd score under the motif model. 21 00:01:12,060 --> 00:01:17,040 And I asked you to show that under the special case 22 00:01:17,040 --> 00:01:21,525 where qk is 1 over 4 to the w, that 23 00:01:21,525 --> 00:01:24,930 is uniform background, that the relative entropy of the motif 24 00:01:24,930 --> 00:01:29,090 ends up being simply 2w minus 8. 25 00:01:29,090 --> 00:01:33,280 Did anyone have a chance to do this? 26 00:01:33,280 --> 00:01:36,170 It's pretty simple-- has anyone done this? 27 00:01:36,170 --> 00:01:39,340 Can anyone show this? 28 00:01:39,340 --> 00:01:40,460 Want me to do it briefly? 29 00:01:40,460 --> 00:01:42,626 How many would like to actually see this derivation? 30 00:01:42,626 --> 00:01:43,706 It's very, very quick. 31 00:01:43,706 --> 00:01:44,790 Few people, OK. 32 00:01:44,790 --> 00:01:46,206 So I'll just do that really quick. 33 00:01:46,206 --> 00:01:56,640 So summation Pk log Pk over qk equals-- so you rewrite it 34 00:01:56,640 --> 00:02:02,270 as a difference, the log of a quotient 35 00:02:02,270 --> 00:02:04,480 is the difference of the log. 36 00:02:04,480 --> 00:02:12,850 Of summation log Pk plus summation Pk log qk. 37 00:02:12,850 --> 00:02:15,710 OK, and then the special case that we're dealing with here 38 00:02:15,710 --> 00:02:18,140 is that qk is equal to a quarter, 39 00:02:18,140 --> 00:02:20,150 if we're dealing with the simplest 40 00:02:20,150 --> 00:02:22,600 case of a one-base motif. 41 00:02:22,600 --> 00:02:28,180 And so you recognize that that's minus H of Pp, right? 42 00:02:28,180 --> 00:02:30,710 H of p is defined as minus that, so it's minus H p. 43 00:02:30,710 --> 00:02:34,730 And this here, that's just a quarter. 44 00:02:34,730 --> 00:02:36,690 Log 2 of a quarter is minus 2. 45 00:02:36,690 --> 00:02:39,245 You can take the minus 2 outside of the sum, 46 00:02:39,245 --> 00:02:45,692 so you're ending up with minus 2-- I'm sorry. 47 00:02:45,692 --> 00:02:47,940 How come Sally didn't correct me? 48 00:02:47,940 --> 00:02:49,315 Usually she catches these things. 49 00:02:49,315 --> 00:02:51,189 So that's a minus there, right, because we're 50 00:02:51,189 --> 00:02:52,330 taking the difference. 51 00:02:52,330 --> 00:02:56,830 And so then we have a minus 2 that we're pulling out 52 00:02:56,830 --> 00:02:59,430 from this, and you're left with summation Pk. 53 00:02:59,430 --> 00:03:02,900 And summation Pk, it sums to 1. 54 00:03:02,900 --> 00:03:04,900 So that's just 1. 55 00:03:04,900 --> 00:03:11,010 And so this equals minus minus 2, or 2 minus H of p. 56 00:03:11,010 --> 00:03:15,910 And there are many other results of this type that 57 00:03:15,910 --> 00:03:19,422 can be shown in information theory. 58 00:03:19,422 --> 00:03:20,880 Often there are some simple results 59 00:03:20,880 --> 00:03:23,500 you can get simply by using this by splitting it 60 00:03:23,500 --> 00:03:26,620 into different terms, and summing. 61 00:03:26,620 --> 00:03:28,720 So another result that I mentioned earlier, 62 00:03:28,720 --> 00:03:35,310 without showing, is that if you have a motif, say, of length 2, 63 00:03:35,310 --> 00:03:39,910 that the information content of that motif model 64 00:03:39,910 --> 00:03:43,230 can be broken into the information 65 00:03:43,230 --> 00:03:46,670 content of each position if your model is such 66 00:03:46,670 --> 00:03:48,700 that the positions are independent. 67 00:03:48,700 --> 00:03:52,350 So you would have, in that case-- 68 00:03:52,350 --> 00:03:56,740 let's just take the entropy of a model on [? dinucleotides. ?] 69 00:03:56,740 --> 00:04:05,520 It that would be minus summation Pi Pj log Pi Pj, 70 00:04:05,520 --> 00:04:09,610 if you have a model where the two are independent, 71 00:04:09,610 --> 00:04:12,794 and this sum would be taken over both i and j. 72 00:04:12,794 --> 00:04:14,960 And so if you want to show that this is equal to-- I 73 00:04:14,960 --> 00:04:31,500 claim that this is equal to the i-- 74 00:04:31,500 --> 00:04:33,977 Anyway, if you have different positions, in general-- 75 00:04:33,977 --> 00:04:36,310 this would be the more general term-- where you have two 76 00:04:36,310 --> 00:04:39,400 different compositions at the two positions for the motif. 77 00:04:39,400 --> 00:04:50,860 And then you can show that it's equal to basically the sum 78 00:04:50,860 --> 00:04:56,887 the entropies at the two positions. 79 00:04:56,887 --> 00:04:57,970 OK, you do the same thing. 80 00:04:57,970 --> 00:05:00,490 You separate out the log of the sum, 81 00:05:00,490 --> 00:05:03,950 in terms of the sum of the logs, and then you 82 00:05:03,950 --> 00:05:07,030 do properties of summations until you get the answer. 83 00:05:07,030 --> 00:05:10,804 OK, so this is your homework, and obviously it 84 00:05:10,804 --> 00:05:11,470 won't be graded. 85 00:05:11,470 --> 00:05:14,570 But we'll check in next Thursday and see 86 00:05:14,570 --> 00:05:16,580 if anyone has questions with that. 87 00:05:16,580 --> 00:05:19,730 So what is the use of relative entropy? 88 00:05:19,730 --> 00:05:21,810 the main use in bio-informatics is 89 00:05:21,810 --> 00:05:26,830 that it's a measure that takes into account non-uniform 90 00:05:26,830 --> 00:05:28,670 backgrounds. 91 00:05:28,670 --> 00:05:30,950 The standard definition of information 92 00:05:30,950 --> 00:05:33,320 basically works when the background 93 00:05:33,320 --> 00:05:37,900 is uniform, but falls apart when it's non-uniform. 94 00:05:37,900 --> 00:05:40,050 So if you have a very biased genome, 95 00:05:40,050 --> 00:05:42,760 like this one shown here which is 75% A T, 96 00:05:42,760 --> 00:05:45,910 then the information content using the standard method 97 00:05:45,910 --> 00:05:49,870 would be two bits of this motif, which is P C equals 1. 98 00:05:49,870 --> 00:05:53,860 But then, that would predict, using the formula, 99 00:05:53,860 --> 00:06:00,190 that a motif occurs 2 to the information content-- 100 00:06:00,190 --> 00:06:02,550 once every 2 to the information content bases-- that 101 00:06:02,550 --> 00:06:05,020 would be 2 to the 2, which would be 4 bases, 102 00:06:05,020 --> 00:06:06,950 and that's clearly incorrect in this case. 103 00:06:06,950 --> 00:06:10,890 But the relative entropy, if you do it, 104 00:06:10,890 --> 00:06:16,250 there will be four terms, but three of them just have a 0. 105 00:06:16,250 --> 00:06:28,760 And then one of them has a 1, so it's 1 times log 1 over 1/8, 106 00:06:28,760 --> 00:06:34,157 in this case, and that's will be equal to 3. 107 00:06:34,157 --> 00:06:35,615 And so the relative entropy clearly 108 00:06:35,615 --> 00:06:38,820 gives you a more sensible version. 109 00:06:38,820 --> 00:06:43,190 It's a good measure for non-uniform backgrounds. 110 00:06:43,190 --> 00:06:44,616 Questions about relative entropy? 111 00:06:48,350 --> 00:06:50,270 All right, so then we said you can 112 00:06:50,270 --> 00:06:52,790 use a weight matrix, or a position-specific probability 113 00:06:52,790 --> 00:06:56,300 matrix, for a motif like this five-prime splice site motif, 114 00:06:56,300 --> 00:06:58,440 assuming independence between positions. 115 00:06:58,440 --> 00:07:02,840 But if that's not true, then a natural generalization 116 00:07:02,840 --> 00:07:08,400 would be an inhomogeneous Markov model. 117 00:07:08,400 --> 00:07:15,530 So now, we're going to say that the base at position k 118 00:07:15,530 --> 00:07:18,350 depends on the base at position k minus 1, 119 00:07:18,350 --> 00:07:21,140 but not on anything before that. 120 00:07:21,140 --> 00:07:23,950 And so, the probability of generating 121 00:07:23,950 --> 00:07:29,500 a particular sequence, S1 to S9, is now given by this expression 122 00:07:29,500 --> 00:07:36,170 here, where you have for every base after the first, 123 00:07:36,170 --> 00:07:37,842 you have a conditional probability. 124 00:07:37,842 --> 00:07:39,300 This is the conditional probability 125 00:07:39,300 --> 00:07:42,355 of seeing the base, S2, at position minus 2, given 126 00:07:42,355 --> 00:07:47,150 that you saw S1 at position minus 3, and so forth. 127 00:07:47,150 --> 00:07:51,160 And again, you can take the log for convenience, if you like. 128 00:07:51,160 --> 00:07:56,060 So I actually implemented both of these models. 129 00:07:56,060 --> 00:08:01,080 So just for thinking about it, if you want to implement this, 130 00:08:01,080 --> 00:08:03,430 you have parameters-- these conditional probability 131 00:08:03,430 --> 00:08:08,870 parameters-- and you estimate them as shown here. 132 00:08:08,870 --> 00:08:16,340 So remember, conditional probability of A given B 133 00:08:16,340 --> 00:08:21,660 is the joint probability divided by the probability of B. 134 00:08:21,660 --> 00:08:24,910 And so in this case, that would be the joint probability 135 00:08:24,910 --> 00:08:28,520 of seeing C A at minus 3, minus 2, divided 136 00:08:28,520 --> 00:08:31,550 by the probability of seeing C at minus 3. 137 00:08:31,550 --> 00:08:34,924 You could have the ratio of the frequencies, or, in this case, 138 00:08:34,924 --> 00:08:36,840 the counts, because the normalization constant 139 00:08:36,840 --> 00:08:38,710 will cancel. 140 00:08:38,710 --> 00:08:40,770 Is that clear? 141 00:08:40,770 --> 00:08:43,264 So I actually implemented both the weight matrix model 142 00:08:43,264 --> 00:08:45,680 and a first-order Markov model of five-prime splice sites, 143 00:08:45,680 --> 00:08:47,750 and scored some genomic sequence. 144 00:08:47,750 --> 00:08:53,460 And what you can see here, the units are in 1/10th-bit units, 145 00:08:53,460 --> 00:08:58,790 is that they both are partially successful in separating real 146 00:08:58,790 --> 00:09:02,600 five-prime splice sites-- shown in black from the background, 147 00:09:02,600 --> 00:09:05,400 shown in light bars-- but in both cases, 148 00:09:05,400 --> 00:09:06,650 it's not a perfect separation. 149 00:09:06,650 --> 00:09:08,500 There's some overlap here. 150 00:09:08,500 --> 00:09:10,500 And if you zoom there, you can see 151 00:09:10,500 --> 00:09:13,010 that the Markov model is a little bit better. 152 00:09:13,010 --> 00:09:18,170 It has a tighter tail on the left. 153 00:09:18,170 --> 00:09:21,500 So it's generally separating the true 154 00:09:21,500 --> 00:09:23,256 from the decoys a little bit better. 155 00:09:23,256 --> 00:09:25,130 Not dramatically better, but slightly better. 156 00:09:25,130 --> 00:09:26,949 Yes, question? 157 00:09:26,949 --> 00:09:28,365 AUDIENCE: From the previous slide, 158 00:09:28,365 --> 00:09:33,990 could you clarify what the letter R and the letter S are? 159 00:09:33,990 --> 00:09:36,690 PROFESSOR: Yes, sorry about that. 160 00:09:36,690 --> 00:09:38,660 R would be the odd ratio-- so it's 161 00:09:38,660 --> 00:09:41,180 the ratio of the probability of generating 162 00:09:41,180 --> 00:09:43,589 that sequence under the foreground 163 00:09:43,589 --> 00:09:45,130 model-- the plus model, we're calling 164 00:09:45,130 --> 00:09:49,480 it-- divided by the probability under the background, 165 00:09:49,480 --> 00:09:50,840 or minus, model. 166 00:09:50,840 --> 00:09:54,210 And then, I think I pointed out last time 167 00:09:54,210 --> 00:09:56,780 that when you get products of probabilities, 168 00:09:56,780 --> 00:09:58,390 they tend to get very small. 169 00:09:58,390 --> 00:10:01,410 This can cause computational problems. 170 00:10:01,410 --> 00:10:04,430 And so if you just take the log, you convert it into a sum. 171 00:10:04,430 --> 00:10:08,940 And so we'll often use score, or S, 172 00:10:08,940 --> 00:10:12,440 for the log of the odds ratio. 173 00:10:12,440 --> 00:10:14,845 Sorry, should have marked that more clearly. 174 00:10:17,880 --> 00:10:20,570 So Markov models can improve performance, 175 00:10:20,570 --> 00:10:23,890 when there is dependence, and when you have enough data 176 00:10:23,890 --> 00:10:28,070 to estimate the increased number of parameters. 177 00:10:28,070 --> 00:10:30,710 And it doesn't just have to be dependence 178 00:10:30,710 --> 00:10:35,600 on the previous base-- you can have a model where 179 00:10:35,600 --> 00:10:37,710 the probability of the next base depends 180 00:10:37,710 --> 00:10:38,919 on the two previous bases. 181 00:10:38,919 --> 00:10:40,960 That would be called a second-order Markov model, 182 00:10:40,960 --> 00:10:43,340 or in general, a K-order Markov model. 183 00:10:46,590 --> 00:10:51,257 Sometimes, these dependencies actually occur in practice. 184 00:10:51,257 --> 00:10:53,340 With five-prime splice sites, it's a nice example, 185 00:10:53,340 --> 00:10:55,550 because there's probably a couple thousand of them 186 00:10:55,550 --> 00:10:57,550 in the human genome, and we know them very well, 187 00:10:57,550 --> 00:11:00,620 so you can make quite complex models 188 00:11:00,620 --> 00:11:03,050 and have enough data to train them. 189 00:11:03,050 --> 00:11:04,510 But in general, if you're thinking 190 00:11:04,510 --> 00:11:06,790 about modeling a transcription factor binding site, 191 00:11:06,790 --> 00:11:10,400 or something, often you might have dozens or, at best, 192 00:11:10,400 --> 00:11:13,290 hundreds of examples, typically. 193 00:11:13,290 --> 00:11:15,170 And so you might not have enough to train 194 00:11:15,170 --> 00:11:16,680 some of the larger model. 195 00:11:16,680 --> 00:11:19,400 So how many parameters do you need 196 00:11:19,400 --> 00:11:21,700 to fit a K-order Markov model? 197 00:11:21,700 --> 00:11:23,750 So question first, yeah? 198 00:11:23,750 --> 00:11:26,120 AUDIENCE: [INAUDIBLE] If you're comparing 199 00:11:26,120 --> 00:11:32,000 the first-order Markov models with W M M, what is W M M? 200 00:11:32,000 --> 00:11:36,900 PROFESSOR: Weight matrix, or position-specific probability 201 00:11:36,900 --> 00:11:37,400 matrix. 202 00:11:37,400 --> 00:11:40,150 Just a model of independence between the two. 203 00:11:44,370 --> 00:11:45,620 Coming back to this case. 204 00:11:45,620 --> 00:11:47,620 So let's suppose you are thinking 205 00:11:47,620 --> 00:11:49,659 about making a K-order Markov model, 206 00:11:49,659 --> 00:11:51,200 because you do some statistical tasks 207 00:11:51,200 --> 00:11:54,240 and you find there's some dependence between sets 208 00:11:54,240 --> 00:11:56,460 of positions in your motif. 209 00:11:56,460 --> 00:11:58,020 How many parameters would there be? 210 00:11:58,020 --> 00:12:02,110 So if you have an independence model, or weight matrix 211 00:12:02,110 --> 00:12:04,900 or position-specific probability matrix, 212 00:12:04,900 --> 00:12:09,110 there are four parameters at each position, 213 00:12:09,110 --> 00:12:12,394 the probabilities of the four bases. 214 00:12:12,394 --> 00:12:14,330 This will be only three free parameters, 215 00:12:14,330 --> 00:12:16,640 because the fourth one-- but let's just think about it 216 00:12:16,640 --> 00:12:20,260 as four, four parameters times the width of the motif. 217 00:12:20,260 --> 00:12:23,957 So if I now go to a first-order Markov model, 218 00:12:23,957 --> 00:12:25,540 now there's more parameters, because I 219 00:12:25,540 --> 00:12:28,855 have these conditional probabilities at each position. 220 00:12:28,855 --> 00:12:30,520 So how many parameters are there? 221 00:12:36,840 --> 00:12:39,400 For a first-order Markov? 222 00:12:39,400 --> 00:12:42,610 How many do I need to estimate? 223 00:12:42,610 --> 00:12:43,368 Yeah, Kevin? 224 00:12:43,368 --> 00:12:45,451 AUDIENCE: I think it would be 16 at each position. 225 00:12:45,451 --> 00:12:47,020 PROFESSOR: Yeah, 16 at each position, 226 00:12:47,020 --> 00:12:49,100 except the first position, which has four. 227 00:12:49,100 --> 00:12:52,647 OK, and what about a second-order Markov model, 228 00:12:52,647 --> 00:12:54,730 where you condition on the two previous positions? 229 00:12:58,710 --> 00:13:00,840 64, right? 230 00:13:00,840 --> 00:13:04,060 Because you have two possible bases you're conditioning on, 231 00:13:04,060 --> 00:13:05,960 that's 16 possibilities times 4. 232 00:13:05,960 --> 00:13:12,830 And so in general, the formula is 4 to the k plus 1. 233 00:13:12,830 --> 00:13:18,040 This is really the issue-- if you have only 100 sequences, 234 00:13:18,040 --> 00:13:20,800 and you need to estimate 64 parameters at each position, 235 00:13:20,800 --> 00:13:23,440 you don't have enough data to estimate those. 236 00:13:23,440 --> 00:13:30,250 So you shouldn't use such a high order model. 237 00:13:30,250 --> 00:13:33,670 All right, so let's think about this-- 238 00:13:33,670 --> 00:13:37,600 what could happen if you don't have enough data to estimate 239 00:13:37,600 --> 00:13:39,740 parameters, and how can you get around that? 240 00:13:39,740 --> 00:13:41,740 So let's just take a very simple example. 241 00:13:41,740 --> 00:13:44,920 So suppose you were setting a new transcription factor. 242 00:13:44,920 --> 00:13:50,360 You had done some sort of pull-down assay, followed 243 00:13:50,360 --> 00:13:52,420 by, say, conventional sequencing, 244 00:13:52,420 --> 00:13:55,960 and identified 10 sequences that bind 245 00:13:55,960 --> 00:13:57,180 to that transcription factor. 246 00:13:57,180 --> 00:14:00,184 And these are the 10 sequences, and you align them. 247 00:14:00,184 --> 00:14:01,600 You see there is sort of a pattern 248 00:14:01,600 --> 00:14:04,280 there-- there's usually an A at the first position, 249 00:14:04,280 --> 00:14:06,550 and usually a C at the second, and so forth. 250 00:14:06,550 --> 00:14:11,680 And so you consider making a weight matrix model. 251 00:14:11,680 --> 00:14:15,200 Then you tally up-- there's eight A's, one C, one G, 252 00:14:15,200 --> 00:14:18,020 and no T's at the first position. 253 00:14:18,020 --> 00:14:22,300 So how confident can you be that T is not 254 00:14:22,300 --> 00:14:25,660 compatible with binding of this transcription factor? 255 00:14:25,660 --> 00:14:28,054 Who thinks you can be very confident? 256 00:14:28,054 --> 00:14:29,470 Most of you are shaking your head. 257 00:14:29,470 --> 00:14:36,270 So if you're not confident, why are you not confident? 258 00:14:36,270 --> 00:14:39,100 I think-- wait, were you shaking your head? 259 00:14:39,100 --> 00:14:42,110 What's the problem here? 260 00:14:42,110 --> 00:14:44,840 It's just too small a sample, right? 261 00:14:44,840 --> 00:14:47,290 Maybe T occurs rarely. 262 00:14:47,290 --> 00:14:50,700 So suppose that T occurs at a frequency of 10%, 263 00:14:50,700 --> 00:14:54,480 what's the probability of that in natural sequences? 264 00:14:54,480 --> 00:14:57,744 And we just have a random sample of those. 265 00:14:57,744 --> 00:14:59,160 What's the probability we wouldn't 266 00:14:59,160 --> 00:15:02,270 see any T's in a sample of size 10? 267 00:15:05,520 --> 00:15:07,900 Anyone have an idea? 268 00:15:07,900 --> 00:15:10,102 Anyone have a ballpark number on this? 269 00:15:13,750 --> 00:15:15,136 Yeah, Simona? 270 00:15:15,136 --> 00:15:17,080 AUDIENCE: 0.9 to the 10th. 271 00:15:17,080 --> 00:15:18,790 PROFESSOR: 0.9 to the 10th, OK. 272 00:15:18,790 --> 00:15:20,040 And what is that? 273 00:15:20,040 --> 00:15:23,280 AUDIENCE: 0.9 is the probability that you grab one, and don't 274 00:15:23,280 --> 00:15:26,450 see a T, and then you do that 10 times. 275 00:15:26,450 --> 00:15:27,492 PROFESSOR: Yeah, exactly. 276 00:15:27,492 --> 00:15:29,533 In genera it's a binomial thing, but it works out 277 00:15:29,533 --> 00:15:30,490 to be 0.9 to the 10th. 278 00:15:30,490 --> 00:15:33,790 And that's roughly-- this is like a Poisson. 279 00:15:33,790 --> 00:15:37,620 There's a mean of 1, so it's roughly E to the minus 1, 280 00:15:37,620 --> 00:15:41,830 so about 35% chance that you don't see any T's. 281 00:15:41,830 --> 00:15:45,500 So we really shouldn't be confident. 282 00:15:45,500 --> 00:15:48,320 T probably doesn't have a frequency of 0.5, 283 00:15:48,320 --> 00:15:53,270 but it could easily have a frequency of 10% or even 5%, 284 00:15:53,270 --> 00:15:54,910 or even 15%. 285 00:15:54,910 --> 00:15:57,210 And you might have just not seen it. 286 00:15:57,210 --> 00:16:00,290 So you don't want to assign a probability 0 to T. 287 00:16:00,290 --> 00:16:05,680 But what value should you assign for something you haven't seen? 288 00:16:05,680 --> 00:16:06,180 Sally? 289 00:16:10,060 --> 00:16:13,220 So, it turns out there is a principled way 290 00:16:13,220 --> 00:16:16,670 to do this called pseudocounts. 291 00:16:16,670 --> 00:16:22,430 So basically, if you use maximum likelihood estimation, 292 00:16:22,430 --> 00:16:25,290 you get-- maximum likelihood, it turns out, 293 00:16:25,290 --> 00:16:28,420 is equal to the observed frequency. 294 00:16:28,420 --> 00:16:34,910 But if you assume that the true frequency is unknown, 295 00:16:34,910 --> 00:16:38,390 but was sampled from all possible, reasonable 296 00:16:38,390 --> 00:16:41,920 frequencies-- so that's a Dirichlet distribution-- then 297 00:16:41,920 --> 00:16:45,360 you can calculate what the posterior distribution is 298 00:16:45,360 --> 00:16:49,900 in a Bayesian framework, given that you observed, for example, 299 00:16:49,900 --> 00:16:54,490 zero T's, what's the distribution 300 00:16:54,490 --> 00:16:58,650 of that parameter, frequency of T? 301 00:16:58,650 --> 00:17:01,930 And it turns out, it's equivalent to adding 302 00:17:01,930 --> 00:17:06,419 a single count to each of your bins. 303 00:17:06,419 --> 00:17:08,210 I'm not going to go through the derivation, 304 00:17:08,210 --> 00:17:10,460 because it takes time, but it is well 305 00:17:10,460 --> 00:17:13,300 described in the appendix of a book called, 306 00:17:13,300 --> 00:17:16,420 Biological Sequence Analysis, published about 10, 307 00:17:16,420 --> 00:17:19,730 15 years ago by a number of leaders in the field-- Durbin, 308 00:17:19,730 --> 00:17:21,460 Eddy, Krogh, and Mitchison. 309 00:17:21,460 --> 00:17:25,990 And there's also a derivation of this in the probability 310 00:17:25,990 --> 00:17:28,079 and statistics primer. 311 00:17:28,079 --> 00:17:32,080 So basically you just do this poster calculation, 312 00:17:32,080 --> 00:17:34,970 and it turns out to be equivalent to adding 1 count. 313 00:17:34,970 --> 00:17:36,910 So when you add 1 count-- and then, of course, 314 00:17:36,910 --> 00:17:39,880 you re-normalize, and then you get 315 00:17:39,880 --> 00:17:43,840 a frequency-- what effectively it does 316 00:17:43,840 --> 00:17:47,120 is it will reduce the frequency of things 317 00:17:47,120 --> 00:17:51,450 that you observe most commonly, and boost up the things 318 00:17:51,450 --> 00:17:55,010 that you don't see, so that you actually end up 319 00:17:55,010 --> 00:17:59,750 assigning a probability of 0.07 to T. 320 00:17:59,750 --> 00:18:01,980 Now, if you had a larger sample-- so 321 00:18:01,980 --> 00:18:06,880 let's imagine instead of 8 1 1 0, it was 80 10 10 0, 322 00:18:06,880 --> 00:18:08,980 you still at a single count. 323 00:18:08,980 --> 00:18:10,880 So you can see in that case, you're 324 00:18:10,880 --> 00:18:16,990 only going to be adding a very small, close to 1%, for T. 325 00:18:16,990 --> 00:18:20,310 So as you get more data, it converges 326 00:18:20,310 --> 00:18:22,900 to the maximum likelihood estimate. 327 00:18:22,900 --> 00:18:25,909 But it does something more reasonable, more open-minded, 328 00:18:25,909 --> 00:18:28,200 in a case where you're really limited in terms of data. 329 00:18:28,200 --> 00:18:29,860 So the limitation-- you always want 330 00:18:29,860 --> 00:18:31,720 to be aware when you're considering 331 00:18:31,720 --> 00:18:35,050 going to a more complex model to get better predictability-- you 332 00:18:35,050 --> 00:18:37,230 want to be aware of how much data you have, 333 00:18:37,230 --> 00:18:41,910 and whether you have enough to accurately estimate parameters. 334 00:18:41,910 --> 00:18:44,530 And if you don't, you either simplify the model, 335 00:18:44,530 --> 00:18:47,240 or if you can't simplify it anymore, 336 00:18:47,240 --> 00:18:49,060 consider using pseudocounts. 337 00:18:49,060 --> 00:18:51,820 Sometimes you'll see smaller pseudocounts added-- 338 00:18:51,820 --> 00:18:54,070 like instead of 1 1 1 1, you might 339 00:18:54,070 --> 00:18:57,810 see a quarter, one pseudocount distributed across the four 340 00:18:57,810 --> 00:18:58,670 bins. 341 00:18:58,670 --> 00:19:01,380 There's arguments pro and con, which I won't go into. 342 00:19:04,100 --> 00:19:06,880 So for the remainder of today, I want 343 00:19:06,880 --> 00:19:09,530 to introduce hidden Markov models. 344 00:19:09,530 --> 00:19:13,070 We'll talk about some the terminology, 345 00:19:13,070 --> 00:19:16,440 some applications, and the Viterbi algorithm-- which 346 00:19:16,440 --> 00:19:22,070 is a core algorithm when using HMMs to predict things-- 347 00:19:22,070 --> 00:19:24,330 and then we'll give a couple examples. 348 00:19:24,330 --> 00:19:27,230 So we'll talk about the CpG Island 349 00:19:27,230 --> 00:19:30,340 HMM, which is about the simplest HMM I could think of, 350 00:19:30,340 --> 00:19:34,140 which is good for illustrating the mechanics of HMM. 351 00:19:34,140 --> 00:19:38,030 And then a couple later, probably 352 00:19:38,030 --> 00:19:41,390 coming into the next lecture, some examples 353 00:19:41,390 --> 00:19:43,270 of real-world HMMs, like one that 354 00:19:43,270 --> 00:19:45,920 predicts transmembrane helices. 355 00:19:45,920 --> 00:19:48,080 So some background reading for today's lecture 356 00:19:48,080 --> 00:19:49,600 that's posted on the course website, 357 00:19:49,600 --> 00:19:53,140 there's a nature biotechnology primer on HMMs, 358 00:19:53,140 --> 00:19:55,930 there's a little bit in the textbook. 359 00:19:55,930 --> 00:19:58,620 But really, if you want to understand the guts of HMMs, 360 00:19:58,620 --> 00:20:01,070 you should read the Rabiner tutorial, 361 00:20:01,070 --> 00:20:04,830 which is really pretty well done. 362 00:20:04,830 --> 00:20:09,260 For Thursday's lecture, I will post another of these nature 363 00:20:09,260 --> 00:20:11,580 biotechnology primers on RNA folding. 364 00:20:11,580 --> 00:20:15,477 This one is actually has a little bit more content, 365 00:20:15,477 --> 00:20:17,310 takes a little bit longer to absorb probably 366 00:20:17,310 --> 00:20:19,520 than some of the others, but still 367 00:20:19,520 --> 00:20:21,010 a good introduction to the topic. 368 00:20:21,010 --> 00:20:24,990 And then it turns out the text has a pretty good section 369 00:20:24,990 --> 00:20:29,170 on RNA folding, so take a look at chapter 11. 370 00:20:33,550 --> 00:20:37,370 So hidden Markov models can be thought 371 00:20:37,370 --> 00:20:42,360 of as a general approach for modeling sequence 372 00:20:42,360 --> 00:20:44,105 labeling problems-- you have sequences, 373 00:20:44,105 --> 00:20:46,640 they might be genomic sequences, protein sequences, 374 00:20:46,640 --> 00:20:47,790 RNA sequences. 375 00:20:47,790 --> 00:20:51,120 And these sequences have features-- promoters, 376 00:20:51,120 --> 00:20:55,040 they may have domains, et cetera, linear motifs. 377 00:20:55,040 --> 00:21:00,320 And you want to label those features 378 00:21:00,320 --> 00:21:01,790 in an unknown sequence. 379 00:21:01,790 --> 00:21:04,830 So a classical example would be gene finding. 380 00:21:04,830 --> 00:21:08,720 You have a genomic sequence, some parts are, say, exon, 381 00:21:08,720 --> 00:21:09,950 some are introns. 382 00:21:09,950 --> 00:21:13,240 You want to be able to label them, it's not known. 383 00:21:13,240 --> 00:21:15,970 But you might have a training set of known exons and introns, 384 00:21:15,970 --> 00:21:20,150 and you might learn what the sequence composition of each 385 00:21:20,150 --> 00:21:22,720 of those labels looks like, and then 386 00:21:22,720 --> 00:21:25,120 make a model that builds things together. 387 00:21:25,120 --> 00:21:27,380 And what they allow you to do, though, with HMMs 388 00:21:27,380 --> 00:21:29,892 is to have transition probabilities 389 00:21:29,892 --> 00:21:31,100 between the different states. 390 00:21:31,100 --> 00:21:33,610 You could model states, you can model 391 00:21:33,610 --> 00:21:36,300 the length of different types of states to some extent-- 392 00:21:36,300 --> 00:21:38,820 as we'll see-- and you can model which 393 00:21:38,820 --> 00:21:43,170 states need to follow other states. 394 00:21:43,170 --> 00:21:45,270 They're relatively easy to design, 395 00:21:45,270 --> 00:21:48,850 you can just simply draw a graph. 396 00:21:48,850 --> 00:21:53,660 It can even have cycles in it, that's OK. 397 00:21:53,660 --> 00:21:56,090 And they've been described as the LEGOs 398 00:21:56,090 --> 00:21:57,580 of computational sequence analysis. 399 00:21:57,580 --> 00:22:01,300 They were developed originally in electrical engineering 400 00:22:01,300 --> 00:22:03,700 four or five decades ago for applications 401 00:22:03,700 --> 00:22:05,740 in voice recognition, and they're still 402 00:22:05,740 --> 00:22:07,220 used in voice recognition. 403 00:22:07,220 --> 00:22:12,790 So when you are calling up some large corporation, and instead 404 00:22:12,790 --> 00:22:16,634 of a person answering the phone, some computer 405 00:22:16,634 --> 00:22:18,050 answering the phone and attempting 406 00:22:18,050 --> 00:22:20,130 to recognize your voice, it could well 407 00:22:20,130 --> 00:22:25,260 be an HMM on the other end, which is either correctly 408 00:22:25,260 --> 00:22:27,610 recognizing what you're saying, or not. 409 00:22:27,610 --> 00:22:32,990 So you can thank them or blame them, as you wish. 410 00:22:32,990 --> 00:22:37,870 All right, so Markov Model example-- we did this before, 411 00:22:37,870 --> 00:22:42,330 imagine the genotype at a particular locus, 412 00:22:42,330 --> 00:22:48,070 and successive generations is thought of as a Markov chain. 413 00:22:48,070 --> 00:22:51,030 Bart's genotype depends on Homer's, but is conditionally 414 00:22:51,030 --> 00:22:55,090 independent of Grandpa Simpson's, given Homer's. 415 00:22:55,090 --> 00:22:56,990 So now what's a hidden Markov model? 416 00:22:56,990 --> 00:23:01,830 So imagine that our DNA sequencer is not 417 00:23:01,830 --> 00:23:03,350 working that week, we can't actually 418 00:23:03,350 --> 00:23:05,910 go in and measure the genotype. 419 00:23:05,910 --> 00:23:10,260 But instead, we're going to observe some phenotype that's 420 00:23:10,260 --> 00:23:12,470 dependent on genotype. 421 00:23:12,470 --> 00:23:17,680 But it's not dependent in a deterministic way, 422 00:23:17,680 --> 00:23:20,180 it's dependent in a more complex way, because there's 423 00:23:20,180 --> 00:23:23,390 an impact of environment, as well, let's say. 424 00:23:23,390 --> 00:23:28,710 So we're imagining that your genotype at the apolipoprotein 425 00:23:28,710 --> 00:23:31,140 locus is correlated with cholesterol, 426 00:23:31,140 --> 00:23:32,970 but doesn't completely predict it. 427 00:23:32,970 --> 00:23:36,690 So you're homozygous, you tend to have higher LDL cholesterol 428 00:23:36,690 --> 00:23:38,200 than you are heterozygous. 429 00:23:38,200 --> 00:23:39,990 But there's a distribution depending 430 00:23:39,990 --> 00:23:43,690 on how many doughnuts you eat, or something like that. 431 00:23:43,690 --> 00:23:47,160 Imagine that we observe that grandpa 432 00:23:47,160 --> 00:23:51,560 had low cholesterol, 150, Homer had high cholesterol, 433 00:23:51,560 --> 00:23:56,020 and Bart's cholesterol is intermediate. 434 00:23:56,020 --> 00:23:59,550 Now if we had just observed Bart's cholesterol, 435 00:23:59,550 --> 00:24:04,170 we would say, well, it could go either way. 436 00:24:04,170 --> 00:24:07,720 It could be homozygous or heterozygous. 437 00:24:07,720 --> 00:24:09,750 You would just look at the population frequency 438 00:24:09,750 --> 00:24:13,450 of those two, and would use that to guess. 439 00:24:13,450 --> 00:24:16,600 But remember, we know his father's cholesterol, which 440 00:24:16,600 --> 00:24:19,990 was 250, makes it much more likely 441 00:24:19,990 --> 00:24:25,714 that his father was homozygous, and then that, in turn, biases 442 00:24:25,714 --> 00:24:27,380 the distribution [? of it. ?] So that'll 443 00:24:27,380 --> 00:24:30,560 make it a little bit more likely that Bart, himself, is 444 00:24:30,560 --> 00:24:32,100 homozygous, if you didn't know. 445 00:24:32,100 --> 00:24:37,590 So this is the basic idea-- you have some observable phenotype, 446 00:24:37,590 --> 00:24:41,080 if you will, that depends, in a probabilistic way, 447 00:24:41,080 --> 00:24:42,600 on something hidden. 448 00:24:42,600 --> 00:24:48,180 And that hidden thing has some dependent structure to it. 449 00:24:48,180 --> 00:24:50,720 And you want to, then, predict those hidden states 450 00:24:50,720 --> 00:24:52,040 from the observable data. 451 00:24:52,040 --> 00:24:54,640 So we'll give some more examples coming up. 452 00:24:54,640 --> 00:24:57,670 And the way to think about these models, or at least a handy way 453 00:24:57,670 --> 00:25:01,110 to think about them, is as generative models. 454 00:25:01,110 --> 00:25:03,870 And so this is from the Rabiner tutorial-- 455 00:25:03,870 --> 00:25:08,670 you imagine an HMM used in order to generate 456 00:25:08,670 --> 00:25:10,194 observable sequences. 457 00:25:10,194 --> 00:25:12,110 So there's these hidden states-- think of them 458 00:25:12,110 --> 00:25:14,300 as genotypes-- observable-- think 459 00:25:14,300 --> 00:25:16,570 of them as the cholesterol levels. 460 00:25:16,570 --> 00:25:18,970 So the way that it works is you choose an initial state 461 00:25:18,970 --> 00:25:21,650 from one of your possible hidden states, 462 00:25:21,650 --> 00:25:24,110 according to some initial distribution, 463 00:25:24,110 --> 00:25:26,350 you set the time variable equal to 1. 464 00:25:26,350 --> 00:25:29,550 In this case, it's T, which will, in our case, 465 00:25:29,550 --> 00:25:31,930 often be the position in the sequence. 466 00:25:31,930 --> 00:25:35,200 And then you choose an observed value, 467 00:25:35,200 --> 00:25:37,260 according to some probability distribution, 468 00:25:37,260 --> 00:25:41,240 but it depends on what that hidden state was. 469 00:25:41,240 --> 00:25:44,010 And then you transition to a new state, 470 00:25:44,010 --> 00:25:47,450 and then you emit another one. 471 00:25:47,450 --> 00:25:49,230 So we'll do an example. 472 00:25:49,230 --> 00:25:56,880 Let's say bacterial gene finding is our application, 473 00:25:56,880 --> 00:25:59,740 and we're going to model a bacterial gene-- these 474 00:25:59,740 --> 00:26:02,820 are protein coding genes, only it's got to have a start 475 00:26:02,820 --> 00:26:06,510 coat on, it's got to have an open reading frame, 476 00:26:06,510 --> 00:26:08,880 and then it's got to have a stop code on it. 477 00:26:08,880 --> 00:26:13,960 So how many different states do we need in our HMM? 478 00:26:13,960 --> 00:26:15,940 What should our states be? 479 00:26:18,750 --> 00:26:19,766 Anyone? 480 00:26:19,766 --> 00:26:22,009 Do you want to make that-- Tim? 481 00:26:22,009 --> 00:26:23,550 AUDIENCE: Maybe you need four states, 482 00:26:23,550 --> 00:26:26,861 because the start state, the orf state, the stop state, 483 00:26:26,861 --> 00:26:29,020 and the non-genic state. 484 00:26:29,020 --> 00:26:37,290 PROFESSOR: OK, start, orf, stop and then intergenic, 485 00:26:37,290 --> 00:26:38,203 or non-genic. 486 00:26:43,250 --> 00:26:45,960 OK now, remember these are the hidden states, 487 00:26:45,960 --> 00:26:48,145 so what are they going to emit? 488 00:26:48,145 --> 00:26:50,020 They emit observable data, what's 489 00:26:50,020 --> 00:26:52,080 that observable data going to be? 490 00:26:52,080 --> 00:26:52,680 Sequence, OK. 491 00:26:52,680 --> 00:26:57,070 And how many bases of sequence should each of them emit? 492 00:26:57,070 --> 00:27:00,552 AUDIENCE: Well I guess we don't know. 493 00:27:00,552 --> 00:27:01,760 PROFESSOR: You have a choice. 494 00:27:01,760 --> 00:27:04,550 You're the model builder, you can do anything you want. 495 00:27:04,550 --> 00:27:09,060 1, 5, 10-- any number of bases you want. 496 00:27:09,060 --> 00:27:11,335 And they can emit different things, if you want. 497 00:27:11,335 --> 00:27:12,960 This is generative, you can do anything 498 00:27:12,960 --> 00:27:14,793 you want-- there will be consequences later, 499 00:27:14,793 --> 00:27:19,920 but for now-- I'm going to call this-- go ahead. 500 00:27:19,920 --> 00:27:23,920 AUDIENCE: You could start with the start and the stop states 501 00:27:23,920 --> 00:27:25,420 maybe being three. 502 00:27:25,420 --> 00:27:26,490 PROFESSOR: Three, OK. 503 00:27:26,490 --> 00:27:30,630 So this is going to emit three nucleotides. 504 00:27:34,296 --> 00:27:35,170 How about this state? 505 00:27:35,170 --> 00:27:37,162 What should this emit? 506 00:27:40,150 --> 00:27:41,644 AUDIENCE: Any number. 507 00:27:41,644 --> 00:27:44,632 PROFESSOR: Any number? 508 00:27:44,632 --> 00:27:46,452 Yeah, OK, Sally? 509 00:27:46,452 --> 00:27:48,118 AUDIENCE: If you let it emit one number, 510 00:27:48,118 --> 00:27:51,110 and then add a self-cycle, then that would work. 511 00:27:51,110 --> 00:27:54,500 PROFESSOR: So Sally wants to have this state emit one 512 00:27:54,500 --> 00:27:56,960 nucleotide, but she wants it to have 513 00:27:56,960 --> 00:27:59,880 a chance of returning to itself. 514 00:27:59,880 --> 00:28:03,170 So that then we can have strings of N's to represent intergenic. 515 00:28:03,170 --> 00:28:04,140 Does that make sense? 516 00:28:04,140 --> 00:28:05,970 And these, I agree. 517 00:28:05,970 --> 00:28:07,660 Three is a good choice, here. 518 00:28:07,660 --> 00:28:10,550 If you had this one emit three, as well, then 519 00:28:10,550 --> 00:28:13,476 your genes would have to be a multiple of three 520 00:28:13,476 --> 00:28:15,350 apart from each other, which isn't realistic. 521 00:28:15,350 --> 00:28:18,010 You would miss out on some genes for that. 522 00:28:18,010 --> 00:28:21,480 So this has to be able to emit arbitrary numbers. 523 00:28:21,480 --> 00:28:24,050 So you could either have it emit an arbitrary number, 524 00:28:24,050 --> 00:28:26,650 but it's going to turn out to make the Viterbi 525 00:28:26,650 --> 00:28:30,010 algorithm easier if it just emits one and recurs, 526 00:28:30,010 --> 00:28:31,260 as Sally suggested. 527 00:28:31,260 --> 00:28:32,780 And then we have our orf state. 528 00:28:32,780 --> 00:28:33,830 So how about here? 529 00:28:33,830 --> 00:28:36,110 What should we do here? 530 00:28:36,110 --> 00:28:38,514 AUDIENCE: It can be three, and then you 531 00:28:38,514 --> 00:28:39,900 put the circle [INAUDIBLE]. 532 00:28:39,900 --> 00:28:42,066 PROFESSOR: So I'm going to change the name to Codon, 533 00:28:42,066 --> 00:28:45,450 because it's going to emit one codon-- three nucleotides. 534 00:28:45,450 --> 00:28:47,760 And then recur to itself. 535 00:28:47,760 --> 00:28:50,150 And now what transitions should we allow between states? 536 00:28:52,700 --> 00:29:02,140 AUDIENCE: So start to four, orf to stop, then stop to N, 537 00:29:02,140 --> 00:29:04,625 and then to start. 538 00:29:04,625 --> 00:29:05,619 PROFESSOR: Any others? 539 00:29:14,068 --> 00:29:15,007 Yeah? 540 00:29:15,007 --> 00:29:16,590 AUDIENCE: N could go to stop, as well. 541 00:29:16,590 --> 00:29:18,298 PROFESSOR: I'm sorry, N could go to stop? 542 00:29:18,298 --> 00:29:20,231 AUDIENCE: Yeah, so that the gene [INAUDIBLE]. 543 00:29:20,231 --> 00:29:21,730 PROFESSOR: OK, so that's a question. 544 00:29:21,730 --> 00:29:24,490 We're thinking of a gene on the plus strand, 545 00:29:24,490 --> 00:29:27,200 a gene could well be on the opposite strand. 546 00:29:27,200 --> 00:29:29,230 And so we should probably make a model 547 00:29:29,230 --> 00:29:34,314 of where you would hit stop on the other strand, which 548 00:29:34,314 --> 00:29:36,730 would emit a triplet of the inverse complement of the stop 549 00:29:36,730 --> 00:29:37,910 code, [INAUDIBLE] et cetera. 550 00:29:37,910 --> 00:29:40,280 That's true, excellent point. 551 00:29:40,280 --> 00:29:42,220 And then you would traverse this whole circle 552 00:29:42,220 --> 00:29:44,230 in the opposite direction. 553 00:29:44,230 --> 00:29:46,190 But it wouldn't be the same state. 554 00:29:46,190 --> 00:29:48,860 It would be stop-- because it would emit different things. 555 00:29:48,860 --> 00:29:55,047 So you'd have minus stop, stop minus strand. 556 00:29:55,047 --> 00:29:56,880 And then you'd have some other states there. 557 00:29:56,880 --> 00:29:59,046 And I'm not going to draw those, but that's a point. 558 00:29:59,046 --> 00:30:01,950 And you could have a teeny one-codon gene 559 00:30:01,950 --> 00:30:06,420 if you want, but probably not worth it. 560 00:30:06,420 --> 00:30:11,330 All right, everyone have an idea about this HMM? 561 00:30:11,330 --> 00:30:14,550 So this is a model you have to specify in order for this 562 00:30:14,550 --> 00:30:17,150 to actually generate sequence. 563 00:30:17,150 --> 00:30:19,530 This model will actually generate 564 00:30:19,530 --> 00:30:21,892 annotations and sequence. 565 00:30:21,892 --> 00:30:23,350 You have to specify where to start, 566 00:30:23,350 --> 00:30:25,150 so you have to have some probability of starting, 567 00:30:25,150 --> 00:30:27,150 but the first base that you're going to generate 568 00:30:27,150 --> 00:30:30,380 is going to begin intergenic, or start, or codon, et cetera. 569 00:30:30,380 --> 00:30:32,700 And you might give it a high probability of this, 570 00:30:32,700 --> 00:30:36,440 and then it'll generate a label. 571 00:30:36,440 --> 00:30:39,380 So for example, let's say N. And then it'll generate a base, 572 00:30:39,380 --> 00:30:41,320 let's say G. 573 00:30:41,320 --> 00:30:43,415 And then you look at these probabilities, 574 00:30:43,415 --> 00:30:46,690 so the transition probability here, versus this-- you either 575 00:30:46,690 --> 00:30:48,860 generate another N, or you generate a start. 576 00:30:48,860 --> 00:30:51,500 And let's say you go to start, and then 577 00:30:51,500 --> 00:30:54,400 you'll generate three bases, so A T G. 578 00:30:54,400 --> 00:30:56,880 And then you would go to the codon state, 579 00:30:56,880 --> 00:31:03,666 you would emit some other triplet, and so forth. 580 00:31:03,666 --> 00:31:07,930 So this is a model that will generate strings of annotations 581 00:31:07,930 --> 00:31:09,460 with associated bases. 582 00:31:12,440 --> 00:31:15,040 Still doesn't predict gene structure yet, 583 00:31:15,040 --> 00:31:19,200 but at least it generates gene structures. 584 00:31:19,200 --> 00:31:24,050 All right, so we are going to, for the sake of illustrating 585 00:31:24,050 --> 00:31:25,830 the Viterbi algorithm, we're going 586 00:31:25,830 --> 00:31:27,920 to use a simpler HMM in that. 587 00:31:27,920 --> 00:31:31,060 So this one only has two states, and its purpose 588 00:31:31,060 --> 00:31:36,840 is to predict CPG islands in a vertebrate genome. 589 00:31:36,840 --> 00:31:38,780 So what are CPG islands? 590 00:31:38,780 --> 00:31:39,830 Anyone remember? 591 00:31:44,010 --> 00:31:44,945 What is a CPG island? 592 00:31:44,945 --> 00:31:45,820 Anyone heard of this? 593 00:31:45,820 --> 00:31:46,694 I'm sure some of you. 594 00:31:53,090 --> 00:31:54,550 Well, the definition here is going 595 00:31:54,550 --> 00:31:58,050 to be regions of high CNG content, and relatively high 596 00:31:58,050 --> 00:32:00,900 abundance of CPG dinucleotides, which are unmethylated. 597 00:32:00,900 --> 00:32:02,130 So what is the P here? 598 00:32:02,130 --> 00:32:06,200 The p means that the CG we're talking about 599 00:32:06,200 --> 00:32:12,420 is C followed by G along the particular DNA strand, 600 00:32:12,420 --> 00:32:15,282 just to distinguish it from C base paired with G. 601 00:32:15,282 --> 00:32:16,990 We're not talking about a base pair here, 602 00:32:16,990 --> 00:32:18,680 we're talking about C and G following 603 00:32:18,680 --> 00:32:21,520 each other along the strand. 604 00:32:21,520 --> 00:32:25,040 So this dinucleotide is rare in vertebrate genomes, 605 00:32:25,040 --> 00:32:29,470 because CPG is the site of a methylase, 606 00:32:29,470 --> 00:32:33,392 and methylation of the C is mutogenic-- it lends 607 00:32:33,392 --> 00:32:34,850 to a much higher rate of mutations. 608 00:32:34,850 --> 00:32:37,260 so CPGs often mutate away, except for the ones 609 00:32:37,260 --> 00:32:39,020 that are necessary. 610 00:32:39,020 --> 00:32:42,310 But there are certain regions, often near promoters, 611 00:32:42,310 --> 00:32:47,100 that are unmethylated, and therefore, CPGs 612 00:32:47,100 --> 00:32:49,050 can accumulate to higher frequencies. 613 00:32:49,050 --> 00:32:53,020 And so you can actually look for these regions 614 00:32:53,020 --> 00:32:54,960 and use them to predict where promoters are. 615 00:32:54,960 --> 00:32:57,570 That's one application. 616 00:32:57,570 --> 00:33:00,250 So they have higher CPG dinucleotide content, and also 617 00:33:00,250 --> 00:33:01,880 higher C and G content. 618 00:33:01,880 --> 00:33:06,220 The background of the human genome is about 40% C G only, 619 00:33:06,220 --> 00:33:09,000 so it's a bit AT rich, and so you see these patches of, say, 620 00:33:09,000 --> 00:33:13,330 50% to 60% C G that are often associated with promoters, 621 00:33:13,330 --> 00:33:17,520 with promoters of roughly half of human genes. 622 00:33:17,520 --> 00:33:20,500 So we're going to-- I always drop that little clicker thing. 623 00:33:20,500 --> 00:33:21,420 Here it is. 624 00:33:21,420 --> 00:33:24,480 We're going to make a model of these, 625 00:33:24,480 --> 00:33:29,550 and then run it to predict promoters in the gene. 626 00:33:29,550 --> 00:33:30,560 So here's our model. 627 00:33:30,560 --> 00:33:32,393 We have two states, we have a genome state-- 628 00:33:32,393 --> 00:33:34,710 this sort of generic position in the genome-- 629 00:33:34,710 --> 00:33:37,000 and then we have an island state. 630 00:33:37,000 --> 00:33:39,550 We have the simplest possible transitions, 631 00:33:39,550 --> 00:33:42,820 you can go genome to genome, genome to island, 632 00:33:42,820 --> 00:33:45,590 island to genome, or island to island. 633 00:33:45,590 --> 00:33:49,860 So now you can generate islands of arbitrary size, interspersed 634 00:33:49,860 --> 00:33:52,870 with genomic regions of arbitrary size. 635 00:33:52,870 --> 00:33:56,980 And then each of those hidden states 636 00:33:56,980 --> 00:33:59,400 is going to emit a single base. 637 00:33:59,400 --> 00:34:02,520 So a CPG island in this model is a stretch 638 00:34:02,520 --> 00:34:08,679 of I states in a row flanked by G states, if you will. 639 00:34:08,679 --> 00:34:11,120 Everyone clear on this set up? 640 00:34:11,120 --> 00:34:13,170 Good. 641 00:34:13,170 --> 00:34:16,639 So here, in order to fully specify the model, 642 00:34:16,639 --> 00:34:20,219 you need to say what all the parameters are. 643 00:34:20,219 --> 00:34:24,659 And there are really three class of parameters. 644 00:34:24,659 --> 00:34:26,949 There are initiation probabilities-- 645 00:34:26,949 --> 00:34:30,800 so the green here is the notation used in the Rabiner 646 00:34:30,800 --> 00:34:34,449 tutorial, except they call them [? pi ?] [? j's. ?] So here, 647 00:34:34,449 --> 00:34:39,000 I'm going to say it's a 99% chance you start in the generic 648 00:34:39,000 --> 00:34:43,130 genome state, and a 1% chance you start in an island state, 649 00:34:43,130 --> 00:34:46,120 because islands are not that common. 650 00:34:46,120 --> 00:34:49,420 And then you need to specify transition probabilities. 651 00:34:49,420 --> 00:34:52,830 So there's four possible transitions you could make, 652 00:34:52,830 --> 00:34:55,620 and you need to assign probabilities to them. 653 00:34:55,620 --> 00:35:00,630 So if the average length of an island were 1,000 bases, 654 00:35:00,630 --> 00:35:05,180 then a reasonable value for the I to I transition 655 00:35:05,180 --> 00:35:07,440 would be 0.999. 656 00:35:07,440 --> 00:35:10,460 You have 99.9% chance of making another island, 657 00:35:10,460 --> 00:35:14,360 and 0.1% chance of leaving that island state. 658 00:35:14,360 --> 00:35:18,370 If you just run that in this generative mode, 659 00:35:18,370 --> 00:35:21,370 it would generate a variety of lengths of islands, 660 00:35:21,370 --> 00:35:24,840 but on average, they'd be about one kb long, 661 00:35:24,840 --> 00:35:29,260 because the probability of terminating is one in 1,000. 662 00:35:29,260 --> 00:35:33,070 And then if we imagine that those one kb islands are 663 00:35:33,070 --> 00:35:35,690 interspersed with genomic regions that are about, say, 664 00:35:35,690 --> 00:35:39,000 100 kilo bases long on average, then you would get this 665 00:35:39,000 --> 00:35:41,890 [? five ?] [? nines ?] probability for P G G, 666 00:35:41,890 --> 00:35:44,630 and 10 to the minus fifth as a probability of going from 667 00:35:44,630 --> 00:35:45,750 genome to islands. 668 00:35:45,750 --> 00:35:50,190 That would generate widely spaced islands, 669 00:35:50,190 --> 00:35:52,730 that are on average 100 kb apart, 670 00:35:52,730 --> 00:35:55,130 that are about one kb in length. 671 00:35:55,130 --> 00:35:58,060 Is that making sense? 672 00:35:58,060 --> 00:36:02,300 And now the third type of probability we need to specify 673 00:36:02,300 --> 00:36:04,130 are called emission probabilities, 674 00:36:04,130 --> 00:36:07,850 which are the B J K in Rabiner notation. 675 00:36:07,850 --> 00:36:11,960 And this is where the predictive power is going to come in. 676 00:36:11,960 --> 00:36:14,602 There has to be a difference in the emissions 677 00:36:14,602 --> 00:36:16,060 if you're going to have any ability 678 00:36:16,060 --> 00:36:18,580 to predict these features, and so we're 679 00:36:18,580 --> 00:36:21,822 going to imagine that the genome is 40% C G, 680 00:36:21,822 --> 00:36:24,027 and islands are 60% C G. So it's a base composition 681 00:36:24,027 --> 00:36:24,860 that we're modeling. 682 00:36:24,860 --> 00:36:26,200 We're not doing the dinucleotides here, 683 00:36:26,200 --> 00:36:27,700 that would make it more complicated. 684 00:36:27,700 --> 00:36:30,623 We're just looking for patches of high G C content. 685 00:36:33,700 --> 00:36:36,300 So now we've fully specified our model. 686 00:36:41,410 --> 00:36:45,200 The problem here is that the model 687 00:36:45,200 --> 00:36:47,900 is written from the hidden generating 688 00:36:47,900 --> 00:36:51,840 the observable, and the problem we're faced with, in practice, 689 00:36:51,840 --> 00:36:54,099 is that we have the observable sequence, 690 00:36:54,099 --> 00:36:55,640 and we want to go back to the hidden. 691 00:36:55,640 --> 00:36:59,980 So we need to reverse the conditioning that's 692 00:36:59,980 --> 00:37:01,320 in the model. 693 00:37:01,320 --> 00:37:03,110 So when you see this type of problem, 694 00:37:03,110 --> 00:37:06,310 how do you reverse conditioning? 695 00:37:06,310 --> 00:37:10,170 In general, what's a good way to do it? 696 00:37:10,170 --> 00:37:13,530 You see P A, given B, but the model 697 00:37:13,530 --> 00:37:17,780 you have is written in P B, given A. What do you do? 698 00:37:17,780 --> 00:37:18,400 What's that? 699 00:37:18,400 --> 00:37:19,399 AUDIENCE: Bayes theorem. 700 00:37:19,399 --> 00:37:20,870 PROFESSOR: Yeah, Bayes theorem. 701 00:37:20,870 --> 00:37:25,830 Right, let's do base theorem here. 702 00:37:30,540 --> 00:37:33,390 Remember, definition of conditional probability, 703 00:37:33,390 --> 00:37:36,660 if we have P A given B-- so this might be the hidden 704 00:37:36,660 --> 00:37:40,820 states, given the observables-- we want to write that in terms 705 00:37:40,820 --> 00:37:42,980 of P B given A. So what we do first? 706 00:37:42,980 --> 00:37:46,410 How do we derive Bayes rule? 707 00:37:46,410 --> 00:37:52,770 You first write the definition of conditional probability-- 708 00:37:52,770 --> 00:37:55,150 right, that's just the definition. 709 00:37:55,150 --> 00:37:58,160 And now what do I do? 710 00:37:58,160 --> 00:38:01,298 Split the top part into what? 711 00:38:01,298 --> 00:38:09,120 AUDIENCE: A times P of B given [INAUDIBLE]. 712 00:38:09,120 --> 00:38:11,515 PROFESSOR: P B given A. That's just 713 00:38:11,515 --> 00:38:14,930 another way of writing joint probability of P A B, 714 00:38:14,930 --> 00:38:17,180 using the definition of conditional probability again. 715 00:38:22,950 --> 00:38:26,650 So now it's written the other way. 716 00:38:26,650 --> 00:38:30,530 That's basically the idea. 717 00:38:30,530 --> 00:38:32,860 So this is the simple form. 718 00:38:32,860 --> 00:38:34,880 And like I said, I don't usually call it 719 00:38:34,880 --> 00:38:37,470 a theorem, because it's so simple-- it's 720 00:38:37,470 --> 00:38:41,420 something you can derive in 30 seconds. 721 00:38:41,420 --> 00:38:45,030 It should be called maybe a rule, or something. 722 00:38:45,030 --> 00:38:46,670 There is a more general form. 723 00:38:46,670 --> 00:38:54,100 This is where you have two states, basically B 724 00:38:54,100 --> 00:38:56,044 or not B that you're dealing with. 725 00:38:56,044 --> 00:38:57,460 And there's this more general form 726 00:38:57,460 --> 00:39:00,940 that's shown on the slide, which is when you have many states. 727 00:39:00,940 --> 00:39:03,330 And it basically is the same idea, 728 00:39:03,330 --> 00:39:09,540 it's just that we've rewritten this term, this P B, 729 00:39:09,540 --> 00:39:13,040 and we split it up into all the possible states. 730 00:39:22,620 --> 00:39:25,830 The slide starts from PB given A and goes the other-- anyway, 731 00:39:25,830 --> 00:39:29,530 so you rewrite the bottom term as a sum 732 00:39:29,530 --> 00:39:34,340 of all the possible cases. 733 00:39:34,340 --> 00:39:37,510 All right, so how does that apply to HMMs? 734 00:39:37,510 --> 00:39:44,670 So with HMMs, we're interested in the joint probability 735 00:39:44,670 --> 00:39:47,960 of a set of hidden states, and a set of observable states. 736 00:39:47,960 --> 00:39:52,440 So H, capital H, is going to be a vector that 737 00:39:52,440 --> 00:39:58,415 specifies the particular hidden state-- for instance, island 738 00:39:58,415 --> 00:40:03,060 or genome-- at position 1, that's H1 all the way to H N. 739 00:40:03,060 --> 00:40:07,760 So little h's are specific values for those hidden state. 740 00:40:07,760 --> 00:40:11,400 And then, big O is a vector that describes 741 00:40:11,400 --> 00:40:14,440 the different bases in the genome. 742 00:40:14,440 --> 00:40:16,410 So O1 is the first base in the genome, 743 00:40:16,410 --> 00:40:24,310 up to O N. One can imagine comparing two H vectors, 744 00:40:24,310 --> 00:40:29,090 one of which, H versus the H primes, what's the what's 745 00:40:29,090 --> 00:40:33,010 the probability of this hidden state versus that? 746 00:40:33,010 --> 00:40:36,500 You could compare them in terms of their joint probabilities 747 00:40:36,500 --> 00:40:38,450 with this model, and perhaps favor 748 00:40:38,450 --> 00:40:42,770 those that have higher probabilities. 749 00:40:42,770 --> 00:40:43,810 Yeah? 750 00:40:43,810 --> 00:40:45,780 AUDIENCE: [INAUDIBLE] of the two capital 751 00:40:45,780 --> 00:40:49,674 H's have any different notation? 752 00:40:49,674 --> 00:40:52,090 Like the second one being H prime, or something like that? 753 00:40:52,090 --> 00:40:53,256 Or are they supposed to be-- 754 00:40:53,256 --> 00:40:55,607 PROFESSOR: With the H, in this case, 755 00:40:55,607 --> 00:40:57,940 these are probability statements about random variables. 756 00:40:57,940 --> 00:41:00,120 So H is a random variable, which could 757 00:41:00,120 --> 00:41:02,860 assume any possible sequence of hidden states. 758 00:41:02,860 --> 00:41:08,230 The little h's are specific values. 759 00:41:08,230 --> 00:41:11,190 So for instance, imagine comparing 760 00:41:11,190 --> 00:41:19,910 what's the probability of H equals genome, genome, 761 00:41:19,910 --> 00:41:23,160 genome, versus the probability that H 762 00:41:23,160 --> 00:41:26,210 equals genome, genome, island. 763 00:41:26,210 --> 00:41:28,400 So the little h's, or the little h primes, 764 00:41:28,400 --> 00:41:30,430 are specific instances. 765 00:41:30,430 --> 00:41:33,010 The H's is a random variable, unknown. 766 00:41:33,010 --> 00:41:34,211 Does that help? 767 00:41:41,280 --> 00:41:43,840 OK, so how do we apply Bayes' rule? 768 00:41:43,840 --> 00:41:46,920 So what we're interested in here is the probability 769 00:41:46,920 --> 00:41:50,380 that H, this unknown variable that represents hidden states, 770 00:41:50,380 --> 00:41:53,330 that it equals a particular set of hidden states, 771 00:41:53,330 --> 00:41:57,040 little h1 to h N, given the observables, little 772 00:41:57,040 --> 00:42:01,200 o1 to little oN, which is the actual sequence that we see. 773 00:42:01,200 --> 00:42:05,210 And we can write that using definition 774 00:42:05,210 --> 00:42:07,710 of conditional probability as the joint probability 775 00:42:07,710 --> 00:42:10,370 patient of H and O, over the probability of O. 776 00:42:10,370 --> 00:42:13,430 And then Bayes' rule, we just apply conditional probability 777 00:42:13,430 --> 00:42:13,930 again. 778 00:42:13,930 --> 00:42:20,960 It's P H times P O, given H, over P O. 779 00:42:20,960 --> 00:42:25,590 So it turns out that this P O-- so what is P O 780 00:42:25,590 --> 00:42:30,000 equals O1 to O N in this model? 781 00:42:30,000 --> 00:42:33,720 Well, the model specifies how to generate the hidden states, 782 00:42:33,720 --> 00:42:37,300 and how the observables are generated from those hidden 783 00:42:37,300 --> 00:42:39,335 states. 784 00:42:39,335 --> 00:42:46,310 So P O is actually defined as the sum of P O comma H 785 00:42:46,310 --> 00:42:51,540 equals the first possible hidden state, plus the same term 786 00:42:51,540 --> 00:42:52,470 for the second. 787 00:42:52,470 --> 00:42:56,330 You have to sum over all the possible outcomes 788 00:42:56,330 --> 00:42:58,790 of the hidden states, every possible thing. 789 00:42:58,790 --> 00:43:00,650 So if we have a sequence of length three, 790 00:43:00,650 --> 00:43:04,120 you have to sum over the possibility 791 00:43:04,120 --> 00:43:11,850 that H might be G G G or G G I, or G I G, or G I I, 792 00:43:11,850 --> 00:43:15,400 or I G G, et cetera. 793 00:43:19,760 --> 00:43:23,417 You have to sum over eight possibilities here. 794 00:43:23,417 --> 00:43:25,000 And if the sequence is a million long, 795 00:43:25,000 --> 00:43:28,500 you have to sum over 2 to the one millionth possibilities. 796 00:43:28,500 --> 00:43:31,610 That sounds complicated to calculate. 797 00:43:31,610 --> 00:43:34,080 So it turns out that there's actually a trick, 798 00:43:34,080 --> 00:43:35,830 and you can calculate it. 799 00:43:35,830 --> 00:43:37,950 But you don't have to. 800 00:43:37,950 --> 00:43:40,410 That's one of the good things, is 801 00:43:40,410 --> 00:43:42,550 that we can just treat it as a constant. 802 00:43:42,550 --> 00:43:47,780 So notice that the denominator here is independent of the H's. 803 00:43:47,780 --> 00:43:50,320 So we'll just treat that as a constant, an unknown constant. 804 00:43:50,320 --> 00:43:55,720 And what we're interested in, which possible value of H 805 00:43:55,720 --> 00:43:56,900 has a higher probability? 806 00:43:56,900 --> 00:44:04,080 So we're just going to try to maximize P H equals H1 to H N-- 807 00:44:04,080 --> 00:44:06,590 find the optimal sequence of hidden states 808 00:44:06,590 --> 00:44:09,310 that optimizes that joint probability, 809 00:44:09,310 --> 00:44:12,600 the joint probability with the observable values, 810 00:44:12,600 --> 00:44:19,360 O1 to O N. Is that making sense? 811 00:44:19,360 --> 00:44:22,350 Basically, we want to find the sequence of hidden states, 812 00:44:22,350 --> 00:44:24,540 we'll call it H opt. 813 00:44:24,540 --> 00:44:27,030 So now H opt here is a particular vector. 814 00:44:27,030 --> 00:44:29,460 Capital H, by itself, is a random vector. 815 00:44:29,460 --> 00:44:33,110 This is now a particular vector of hidden states, 816 00:44:33,110 --> 00:44:35,430 H1 opt through H N opt. 817 00:44:35,430 --> 00:44:41,350 And it's defined as the vector of hidden states 818 00:44:41,350 --> 00:44:46,330 that maximizes the joint probability with O equals 819 00:44:46,330 --> 00:44:53,220 O1 to O N, where that's the observed sequence that we're 820 00:44:53,220 --> 00:44:55,160 dealing with. 821 00:44:55,160 --> 00:44:59,450 So now what I'm telling you is if we 822 00:44:59,450 --> 00:45:01,960 can find the vector of hidden states 823 00:45:01,960 --> 00:45:03,830 that maximizes the joint probability, 824 00:45:03,830 --> 00:45:08,520 then that will also maximize the conditional probability of H 825 00:45:08,520 --> 00:45:14,680 given O. And that's often the language of linguistics 826 00:45:14,680 --> 00:45:18,420 is used, and it's called the optimal parse of the sequence. 827 00:45:18,420 --> 00:45:21,980 You'll see that sometimes, I might say that. 828 00:45:21,980 --> 00:45:28,510 So the solution is to define these variables, 829 00:45:28,510 --> 00:45:35,610 R I of H, which are defined as the probability 830 00:45:35,610 --> 00:45:37,730 of the optimal parse of the subsequence from one 831 00:45:37,730 --> 00:45:40,980 to I-- not the whole long sequence, 832 00:45:40,980 --> 00:45:43,380 but a little piece of it from the beginning 833 00:45:43,380 --> 00:45:47,620 to a particular place in the middle, that ends in state, H. 834 00:45:47,620 --> 00:45:48,640 And so first. 835 00:45:48,640 --> 00:45:52,580 We calculate R, R 1 1, the probability 836 00:45:52,580 --> 00:45:54,320 of generating the first base, given 837 00:45:54,320 --> 00:45:56,420 that it ended in hidden state 1. 838 00:45:56,420 --> 00:45:59,040 And then we would do the hidden state 2, 839 00:45:59,040 --> 00:46:02,110 and then we basically have to figure out a way, a recursion, 840 00:46:02,110 --> 00:46:07,800 for getting the probabilities of the optimal parses ending 841 00:46:07,800 --> 00:46:10,850 at each of the states at position 2, 842 00:46:10,850 --> 00:46:12,334 given the values at position 1. 843 00:46:12,334 --> 00:46:14,000 And then we go, work our way all the way 844 00:46:14,000 --> 00:46:16,250 down to the end of the sequence. 845 00:46:16,250 --> 00:46:18,110 And then we'll figure out which is better, 846 00:46:18,110 --> 00:46:19,940 and then we'll backtrack to figure out 847 00:46:19,940 --> 00:46:23,460 what that optimal parse was. 848 00:46:23,460 --> 00:46:25,000 We'll do an example on the board, 849 00:46:25,000 --> 00:46:31,180 this is unlikely to be completely clear at this point. 850 00:46:31,180 --> 00:46:32,800 But don't worry. 851 00:46:32,800 --> 00:46:35,770 So why is this called the Viterbi algorithm? 852 00:46:35,770 --> 00:46:38,320 Well, this is the guy who figured it out. 853 00:46:38,320 --> 00:46:40,590 He was actually an MIT alum. 854 00:46:40,590 --> 00:46:44,122 He did his bachelor's and master's in double E, 855 00:46:44,122 --> 00:46:48,220 I don't know, quite awhile ago, '50s or '60s. 856 00:46:48,220 --> 00:46:51,210 And later went on to found Qualcomm, 857 00:46:51,210 --> 00:46:57,060 and is now big philanthropist, who apparently supports USC. 858 00:46:57,060 --> 00:47:00,540 I don't know why he lost his loyalty to MIT, 859 00:47:00,540 --> 00:47:03,870 but maybe he'll come back and give us a seminar. 860 00:47:03,870 --> 00:47:06,810 I actually met him once. 861 00:47:06,810 --> 00:47:08,840 Let's talk about his algorithm a little more. 862 00:47:08,840 --> 00:47:14,690 So what I want to do is I want to take a particular HMM, 863 00:47:14,690 --> 00:47:18,480 so we'll take our CPG island HMM, 864 00:47:18,480 --> 00:47:21,370 and then we'll go through the actual Viterbi 865 00:47:21,370 --> 00:47:26,140 algorithm on the board for a particular sequence. 866 00:47:26,140 --> 00:47:29,660 And you'll see that it's actually pretty simple. 867 00:47:29,660 --> 00:47:31,200 But then you'll also see that it's 868 00:47:31,200 --> 00:47:33,770 not totally obvious why it works. 869 00:47:36,500 --> 00:47:38,210 The mechanics of it are not that bad, 870 00:47:38,210 --> 00:47:41,690 but the understanding really how it 871 00:47:41,690 --> 00:47:44,430 is able to come up with the optimal parses, that's 872 00:47:44,430 --> 00:47:49,270 the more subtle part. 873 00:47:49,270 --> 00:47:53,360 So let's suppose we have a sequence, 874 00:47:53,360 --> 00:47:58,620 A C G. Can anyone tell me what the optimal parse 875 00:47:58,620 --> 00:48:01,394 of this sequence is, without doing Viterbi? 876 00:48:04,240 --> 00:48:07,050 With this particular model, these initiation 877 00:48:07,050 --> 00:48:11,740 probabilities transitions and emissions? 878 00:48:11,740 --> 00:48:16,570 Do you know what it's going to be in advance? 879 00:48:16,570 --> 00:48:17,910 Any guesses? 880 00:48:17,910 --> 00:48:19,879 AUDIENCE: How about genome, island, island. 881 00:48:19,879 --> 00:48:21,295 PROFESSOR: Genome, island, island. 882 00:48:21,295 --> 00:48:24,220 Because, you're saying, that way the emissions 883 00:48:24,220 --> 00:48:26,010 will be optimized, right? 884 00:48:26,010 --> 00:48:28,400 Because you'll omit the C's and G's. 885 00:48:28,400 --> 00:48:29,680 OK, that's a reasonable guess. 886 00:48:29,680 --> 00:48:31,496 Sally's shaking her head, though. 887 00:48:31,496 --> 00:48:33,137 AUDIENCE: The transitional probability 888 00:48:33,137 --> 00:48:36,059 from being in the genome is very, very small, 889 00:48:36,059 --> 00:48:38,210 and so it's more likely that it'll either only 890 00:48:38,210 --> 00:48:39,960 be in the genome or only be in the island. 891 00:48:39,960 --> 00:48:41,585 PROFESSOR: So the transition from going 892 00:48:41,585 --> 00:48:44,220 from a genome to island or island to genome is very small, 893 00:48:44,220 --> 00:48:46,220 and so she's saying you're going to pay a bigger 894 00:48:46,220 --> 00:48:48,640 penalty for making that transition in there, that 895 00:48:48,640 --> 00:48:50,630 may not be offset by the emissions. 896 00:48:50,630 --> 00:48:51,780 Right, is that your point? 897 00:48:51,780 --> 00:48:53,820 Yeah, question? 898 00:48:53,820 --> 00:48:56,985 AUDIENCE: Check here-- when we're talking about the optimal 899 00:48:56,985 --> 00:49:01,700 parse, we're saying let's maximize the probability 900 00:49:01,700 --> 00:49:03,700 of that letter-- 901 00:49:03,700 --> 00:49:06,188 PROFESSOR: The joint probability. 902 00:49:06,188 --> 00:49:08,658 AUDIENCE: Sorry, the joint probability of that letter-- 903 00:49:08,658 --> 00:49:10,866 PROFESSOR: Of that [INAUDIBLE] state and that letter. 904 00:49:10,866 --> 00:49:12,116 AUDIENCE: OK, so that means-- 905 00:49:12,116 --> 00:49:14,092 PROFESSOR: Or that set of [INAUDIBLE] states 906 00:49:14,092 --> 00:49:15,670 and that set of bases. 907 00:49:15,670 --> 00:49:16,070 AUDIENCE: So when we're computing 908 00:49:16,070 --> 00:49:17,486 across this three-letter thing, we 909 00:49:17,486 --> 00:49:21,846 have to say the probability of the letter, then let's 910 00:49:21,846 --> 00:49:24,266 multiply if by the probability of the transition 911 00:49:24,266 --> 00:49:28,622 to the next letter, and then multiply it again 912 00:49:28,622 --> 00:49:31,410 [INAUDIBLE] and that letter. 913 00:49:31,410 --> 00:49:33,020 PROFESSOR: So let's do this. 914 00:49:33,020 --> 00:49:36,400 If A C G is our sequence-- I'm just 915 00:49:36,400 --> 00:49:38,290 going to space it out a little bit. 916 00:49:38,290 --> 00:49:43,970 Here's our A at position one, here's our C at position two, 917 00:49:43,970 --> 00:49:49,100 and our G at position three. 918 00:49:49,100 --> 00:49:51,440 And then we have our hidden states. 919 00:49:51,440 --> 00:49:58,100 And so we'll write genome first, and then we have island here. 920 00:50:01,370 --> 00:50:05,700 And so what is the optimal parse of the sequence from base 921 00:50:05,700 --> 00:50:10,920 one to base one, that ends in genome? 922 00:50:10,920 --> 00:50:12,800 It's just the one that starts in genome, 923 00:50:12,800 --> 00:50:15,430 because it doesn't go-- right, so it's just genome. 924 00:50:15,430 --> 00:50:17,920 And what is its probability? 925 00:50:17,920 --> 00:50:19,890 That's how this thing is defined here. 926 00:50:19,890 --> 00:50:24,870 This is this R I H thing I was talking about. 927 00:50:24,870 --> 00:50:28,230 This is H, and this is I here. 928 00:50:31,130 --> 00:50:35,130 So the probability of the optimal parse 929 00:50:35,130 --> 00:50:38,350 of the sequence, up to position one, that ends in genome, 930 00:50:38,350 --> 00:50:40,200 is just the one that starts in genome, 931 00:50:40,200 --> 00:50:41,651 and then emits that base. 932 00:50:41,651 --> 00:50:43,650 So what's the probability of starting in genome? 933 00:50:46,310 --> 00:50:49,270 It's five nines, right? 934 00:50:49,270 --> 00:50:54,130 So that's the initial probability-- 9 9 9 9 9. 935 00:50:54,130 --> 00:50:58,040 And then what's the probability of emitting an A, given 936 00:50:58,040 --> 00:51:00,810 that we're in the genome state? 937 00:51:00,810 --> 00:51:02,380 0.3. 938 00:51:02,380 --> 00:51:06,360 So I claim that this is the value 939 00:51:06,360 --> 00:51:11,370 of R1 of genome, of the genome state. 940 00:51:11,370 --> 00:51:12,590 OK, that's the optimal parse. 941 00:51:12,590 --> 00:51:14,006 There's only one parse, so there's 942 00:51:14,006 --> 00:51:17,080 nothing-- it is what it is. 943 00:51:17,080 --> 00:51:20,190 You start here, there's no transitions-- we started here-- 944 00:51:20,190 --> 00:51:22,930 and then you emit an A. 945 00:51:22,930 --> 00:51:26,070 What's the probability of the optimal parse 946 00:51:26,070 --> 00:51:29,149 ending an island at position one of the sequence? 947 00:51:29,149 --> 00:51:29,690 Someone else? 948 00:51:29,690 --> 00:51:31,391 Yeah, question? 949 00:51:31,391 --> 00:51:33,796 AUDIENCE: Why are we using the transition probability? 950 00:51:33,796 --> 00:51:37,171 PROFESSOR: This is the initial-- Oh, I'm sorry. 951 00:51:37,171 --> 00:51:37,670 Correct. 952 00:51:37,670 --> 00:51:40,286 Thank you, thank you-- what was your name? 953 00:51:40,286 --> 00:51:41,036 AUDIENCE: Deborah. 954 00:51:41,036 --> 00:51:41,910 PROFESSOR: Deborah, OK, thanks Deborah. 955 00:51:41,910 --> 00:51:43,451 It should be the initial probability, 956 00:51:43,451 --> 00:51:45,740 which is 0.99, good. 957 00:51:45,740 --> 00:51:47,160 Initial probability. 958 00:51:47,160 --> 00:51:47,910 What about island? 959 00:51:47,910 --> 00:51:50,062 Deborah, you want to take this one? 960 00:51:50,062 --> 00:51:51,970 AUDIENCE: 0.01. 961 00:51:51,970 --> 00:51:53,470 PROFESSOR: 0.01 to be an island. 962 00:51:53,470 --> 00:51:58,930 And what about the emission probability? 963 00:51:58,930 --> 00:52:00,840 We have to start an island, and then 964 00:52:00,840 --> 00:52:03,617 emit an A, which is probably what? 965 00:52:03,617 --> 00:52:04,200 AUDIENCE: 0.2. 966 00:52:04,200 --> 00:52:06,810 PROFESSOR: 0.2, yeah, it's up on the screen. 967 00:52:06,810 --> 00:52:09,080 Should be, hopefully. 968 00:52:09,080 --> 00:52:10,900 Yeah, 0.02. 969 00:52:10,900 --> 00:52:12,400 So who's winning so far? 970 00:52:12,400 --> 00:52:15,640 If the sequences ended at position one? 971 00:52:15,640 --> 00:52:16,690 Genome, genome's winning. 972 00:52:16,690 --> 00:52:18,190 This is a lot bigger than that, it's 973 00:52:18,190 --> 00:52:21,170 about 150 times bigger, right? 974 00:52:21,170 --> 00:52:25,210 Now what do when we go-- we said we have to do recursion, right? 975 00:52:25,210 --> 00:52:27,290 We have to figure out the probability 976 00:52:27,290 --> 00:52:30,840 of the optimal parse ending at position two in each 977 00:52:30,840 --> 00:52:34,480 of these states, given the optimal parse ending 978 00:52:34,480 --> 00:52:36,280 at position one. 979 00:52:36,280 --> 00:52:37,420 How do we figure that out? 980 00:52:43,594 --> 00:52:45,260 What are we going to write here, or what 981 00:52:45,260 --> 00:52:51,310 do we have to compare to figure out what to put here? 982 00:52:51,310 --> 00:52:53,200 There's two possible parses ending in genome 983 00:52:53,200 --> 00:52:54,880 at position two-- there's the one that 984 00:52:54,880 --> 00:52:56,770 started in genome at position one, 985 00:52:56,770 --> 00:52:58,700 and there's the one that started at island. 986 00:52:58,700 --> 00:53:02,680 So you have to compare this to this. 987 00:53:02,680 --> 00:53:06,200 So you compare what the probability of this parse 988 00:53:06,200 --> 00:53:09,110 was times the transition probability, 989 00:53:09,110 --> 00:53:12,740 and then the emission in that state. 990 00:53:12,740 --> 00:53:13,930 So what would that be? 991 00:53:16,780 --> 00:53:19,730 What would this be, if we stay in genome? 992 00:53:19,730 --> 00:53:22,380 What's the transition? 993 00:53:22,380 --> 00:53:25,010 Now we've got our five nines, yeah, good. 994 00:53:25,010 --> 00:53:26,430 So five nines. 995 00:53:26,430 --> 00:53:28,074 And the emission is what? 996 00:53:28,074 --> 00:53:30,490 AUDIENCE: Genome at 0.2. 997 00:53:30,490 --> 00:53:32,680 PROFESSOR: 0.2, right. 998 00:53:32,680 --> 00:53:35,860 And times this. 999 00:53:35,860 --> 00:53:37,070 And what about this one? 1000 00:53:37,070 --> 00:53:38,720 What are we going to multiply when 1001 00:53:38,720 --> 00:53:41,356 we consider this island to genome transition? 1002 00:53:41,356 --> 00:53:48,132 AUDIENCE: [INAUDIBLE] 0.01? 1003 00:53:48,132 --> 00:53:50,760 Because the genome is still 0.2. 1004 00:53:50,760 --> 00:53:53,096 PROFESSOR: It's still 0.2. 1005 00:53:53,096 --> 00:53:54,720 So we take the maximum of these, right? 1006 00:53:54,720 --> 00:53:56,678 We're doing optimal parse, highest probability. 1007 00:53:56,678 --> 00:53:58,920 So which one of these two turns this bigger? 1008 00:53:58,920 --> 00:54:00,450 Clearly, the top one is bigger. 1009 00:54:00,450 --> 00:54:01,590 This one is already bigger than this, 1010 00:54:01,590 --> 00:54:03,214 and we're multiplying by the same data, 1011 00:54:03,214 --> 00:54:11,370 so clearly the answer here is 0.99 times 1012 00:54:11,370 --> 00:54:18,530 0.3 times-- my nines are going to have 1013 00:54:18,530 --> 00:54:24,250 to get really skinny here-- 0.99999 and 0.2. 1014 00:54:24,250 --> 00:54:25,150 That's the winner. 1015 00:54:25,150 --> 00:54:27,620 And the other thing we do, besides recording this number, 1016 00:54:27,620 --> 00:54:30,654 is we circle this arrow. 1017 00:54:30,654 --> 00:54:31,570 Does this ring a bell? 1018 00:54:31,570 --> 00:54:33,350 This is sort of like Needleman-Wench 1019 00:54:33,350 --> 00:54:36,710 or Smith-Waterman, where you don't just 1020 00:54:36,710 --> 00:54:38,330 record what's the best score, but you 1021 00:54:38,330 --> 00:54:39,454 remember how you got there. 1022 00:54:39,454 --> 00:54:41,280 We're going to need that later. 1023 00:54:41,280 --> 00:54:42,540 And what about here? 1024 00:54:42,540 --> 00:54:46,400 What's the optimal parse-- what's 1025 00:54:46,400 --> 00:54:48,340 the probability of the optimal parse ending 1026 00:54:48,340 --> 00:54:50,936 in island at position two? 1027 00:54:50,936 --> 00:54:52,602 Or what do I have to do to calculate it? 1028 00:55:00,314 --> 00:55:01,278 Sorry? 1029 00:55:01,278 --> 00:55:05,640 AUDIENCE: You have to calculate the [INAUDIBLE]. 1030 00:55:05,640 --> 00:55:09,290 PROFESSOR: Right, you consider going genome to island here, 1031 00:55:09,290 --> 00:55:10,630 and island to island. 1032 00:55:10,630 --> 00:55:12,510 And who's going to win that race? 1033 00:55:12,510 --> 00:55:13,412 Do you have an idea? 1034 00:55:16,860 --> 00:55:19,120 Genome to island had a head start, 1035 00:55:19,120 --> 00:55:23,590 but it pays a penalty for the transition. 1036 00:55:23,590 --> 00:55:27,510 The transition is pretty small, that's 10 to the minus fifth. 1037 00:55:27,510 --> 00:55:30,966 And what about this one? 1038 00:55:30,966 --> 00:55:33,150 This has a much higher transition probability 1039 00:55:33,150 --> 00:55:35,780 of 0.999. 1040 00:55:35,780 --> 00:55:38,090 And so even though you were starting from something, 1041 00:55:38,090 --> 00:55:40,550 this is about 150 times smaller than this. 1042 00:55:40,550 --> 00:55:43,120 But this is being multiplied by 10 to the minus fifth, 1043 00:55:43,120 --> 00:55:46,010 and this is being multiplied by something that's around 1. 1044 00:55:46,010 --> 00:55:49,060 So this one will win, island to island will win. 1045 00:55:49,060 --> 00:55:50,930 Everyone agree on that? 1046 00:55:50,930 --> 00:55:53,940 And what will that value be? 1047 00:55:53,940 --> 00:55:59,150 So it's whatever it was before times 1048 00:55:59,150 --> 00:56:02,034 the transition, which is what? 1049 00:56:02,034 --> 00:56:02,950 From island to island? 1050 00:56:05,590 --> 00:56:07,040 999. 1051 00:56:07,040 --> 00:56:08,250 Times the emission which is? 1052 00:56:11,870 --> 00:56:13,132 0.3. 1053 00:56:13,132 --> 00:56:14,590 Island is more likely [? to omit ?] 1054 00:56:14,590 --> 00:56:17,270 a C. Everyone clear on that? 1055 00:56:20,850 --> 00:56:24,370 And then, we're not that until we circle this arrow here. 1056 00:56:24,370 --> 00:56:27,230 That was the winner, the winner was coming from island, 1057 00:56:27,230 --> 00:56:28,830 remaining on island. 1058 00:56:28,830 --> 00:56:32,292 And then we keep going like this. 1059 00:56:32,292 --> 00:56:33,750 Do you want me to do one more base? 1060 00:56:33,750 --> 00:56:35,583 How many people want me to do one more base, 1061 00:56:35,583 --> 00:56:37,696 and how many people want me to stop this? 1062 00:56:37,696 --> 00:56:39,970 I'll do one more base, but you guys 1063 00:56:39,970 --> 00:56:42,750 will have to help me a little bit. 1064 00:56:42,750 --> 00:56:44,960 Who is going to win-- now we want 1065 00:56:44,960 --> 00:56:49,120 the probability of the optimal parse ending in G, 1066 00:56:49,120 --> 00:56:54,300 ending at position three, which is a G, and ending in genome, 1067 00:56:54,300 --> 00:56:56,150 or ending in island. 1068 00:56:56,150 --> 00:57:02,840 So for ending in genome, where is that one going to come from? 1069 00:57:02,840 --> 00:57:04,730 Which is going to win? 1070 00:57:04,730 --> 00:57:08,530 This one, or this one? 1071 00:57:08,530 --> 00:57:09,770 AUDIENCE: Stay in genome. 1072 00:57:09,770 --> 00:57:11,769 PROFESSOR: Yeah, stay in genome is going to win. 1073 00:57:11,769 --> 00:57:13,920 This one is already bigger than this, 1074 00:57:13,920 --> 00:57:16,860 and the transition probability here-- this 1075 00:57:16,860 --> 00:57:19,860 is a 10 to minus 3 transition probability. 1076 00:57:19,860 --> 00:57:22,400 And this is a probability that's near one, 1077 00:57:22,400 --> 00:57:25,320 so the transitions are going to dominate here. 1078 00:57:25,320 --> 00:57:29,720 And so you're going to have this term-- I'm 1079 00:57:29,720 --> 00:57:38,120 going to call that R 2 of G. That's this notation here. 1080 00:57:38,120 --> 00:57:40,920 And times the transition probability, 1081 00:57:40,920 --> 00:57:44,645 genome to genome, which is all these nines here. 1082 00:57:47,345 --> 00:57:48,720 And then the emission probability 1083 00:57:48,720 --> 00:57:51,540 of a G in the genome state is 0.2. 1084 00:57:51,540 --> 00:57:55,840 And who's going to win here for the optimal parse, ending 1085 00:57:55,840 --> 00:57:58,035 at position three, in the island state? 1086 00:58:02,990 --> 00:58:06,640 Is it going to be this guy, to island, or this one, 1087 00:58:06,640 --> 00:58:09,150 changing from genome to island? 1088 00:58:09,150 --> 00:58:11,880 Island to island, because, again, the transmission 1089 00:58:11,880 --> 00:58:14,210 probability is prohibitive-- that's a 10 1090 00:58:14,210 --> 00:58:16,740 to the minus fifth penalty there. 1091 00:58:16,740 --> 00:58:18,480 So you're going to stay in island. 1092 00:58:18,480 --> 00:58:21,670 So this one won here, and this one won here. 1093 00:58:21,670 --> 00:58:29,610 And so this term here is R 2 of island 1094 00:58:29,610 --> 00:58:32,480 times the transition probability, island to island, 1095 00:58:32,480 --> 00:58:36,810 which is 0.999 times the emission probability, which 1096 00:58:36,810 --> 00:58:41,230 is 0.3 of a G in the island state. 1097 00:58:41,230 --> 00:58:45,440 Everyone clear on that? 1098 00:58:45,440 --> 00:58:53,640 Now, if we went out another 20 bases, what's going to happen? 1099 00:58:53,640 --> 00:58:54,931 Probably not a lot. 1100 00:58:54,931 --> 00:58:58,040 Probably the same kind of stuff that's happening. 1101 00:58:58,040 --> 00:59:01,050 That seems kind of boring, but when would 1102 00:59:01,050 --> 00:59:04,590 we actually get a cross? 1103 00:59:04,590 --> 00:59:05,722 What would it take? 1104 00:59:05,722 --> 00:59:08,934 To push you over and cause you to transition from one 1105 00:59:08,934 --> 00:59:09,578 to the other? 1106 00:59:12,952 --> 00:59:16,675 AUDIENCE: Slowly stacked against you or long enough? 1107 00:59:16,675 --> 00:59:18,550 PROFESSOR: Yeah, that's a good way to put it. 1108 00:59:18,550 --> 00:59:19,880 So let me give you an example. 1109 00:59:19,880 --> 00:59:27,690 This is the Viterbi algorithm, written out mathematically. 1110 00:59:27,690 --> 00:59:29,750 We can go over this in a moment, but I just 1111 00:59:29,750 --> 00:59:33,020 want to try to stay with the intuition here. 1112 00:59:33,020 --> 00:59:38,460 We did that, now I want to do this. 1113 00:59:38,460 --> 00:59:45,760 Suppose your sequence is A C G T, repeating 10,000 times. 1114 00:59:45,760 --> 00:59:51,310 Can anyone figure out what the optimal parse of that sequence 1115 00:59:51,310 --> 00:59:54,704 would be, without doing Viterbi in their head? 1116 00:59:59,970 --> 01:00:01,510 Start and stay in genome. 1117 01:00:01,510 --> 01:00:03,070 Can you explain why? 1118 01:00:03,070 --> 01:00:06,111 AUDIENCE: Because it's equal to the-- 1119 01:00:06,111 --> 01:00:09,640 what are the [? widths? ?] Because it's 1120 01:00:09,640 --> 01:00:13,080 homogeneous to the distributor, as opposed 1121 01:00:13,080 --> 01:00:19,238 to enriched for C N G, and it just repeats without pattern. 1122 01:00:19,238 --> 01:00:22,094 Or it repeats throughout without [? concentrating the C N Gs ?] 1123 01:00:22,094 --> 01:00:22,930 anywhere. 1124 01:00:22,930 --> 01:00:25,555 PROFESSOR: Right, so the unit of the repeat, this A C G T unit, 1125 01:00:25,555 --> 01:00:27,020 is not biased for either one. 1126 01:00:27,020 --> 01:00:32,120 So there will be 2.3 emissions, and 2.2 emissions, 1127 01:00:32,120 --> 01:00:34,100 whether you go through those in G G G G 1128 01:00:34,100 --> 01:00:37,200 or in I I I I. Does that make sense? 1129 01:00:37,200 --> 01:00:39,670 So the emissions will be the same, if you're all in genome, 1130 01:00:39,670 --> 01:00:41,150 or if you're all in island. 1131 01:00:41,150 --> 01:00:46,360 And the initial probabilities favor genome, 1132 01:00:46,360 --> 01:00:49,080 and the transitions also favor staying in genome. 1133 01:00:49,080 --> 01:00:51,433 Right, so all genome. 1134 01:00:51,433 --> 01:00:52,710 Can everyone see that? 1135 01:00:56,419 --> 01:00:58,377 So do you want to take a stab at this next one? 1136 01:01:04,710 --> 01:01:05,770 This one's harder. 1137 01:01:05,770 --> 01:01:08,500 Let me ask you, in the optimal parse, 1138 01:01:08,500 --> 01:01:12,729 what state is it going to end in? 1139 01:01:12,729 --> 01:01:13,437 AUDIENCE: Genome. 1140 01:01:13,437 --> 01:01:16,270 PROFESSOR: Genome, you've got to run [? with a 1,000 ?] T's. 1141 01:01:16,270 --> 01:01:20,330 And genome, the emissions favor emitting T's. 1142 01:01:20,330 --> 01:01:23,450 So clearly, it's going to end in genome. 1143 01:01:23,450 --> 01:01:27,610 And then, what about those runs of C's and G's in the middle 1144 01:01:27,610 --> 01:01:28,215 there? 1145 01:01:28,215 --> 01:01:32,560 Are any of those long enough to trigger a transition to island? 1146 01:01:32,560 --> 01:01:34,102 What was your name again? 1147 01:01:34,102 --> 01:01:34,810 AUDIENCE: Daniel. 1148 01:01:34,810 --> 01:01:37,210 PROFESSOR: Daniel, so you're shaking your head. 1149 01:01:37,210 --> 01:01:39,010 You think they're not long enough. 1150 01:01:39,010 --> 01:01:44,230 So you think the winner's going to be genome all the way? 1151 01:01:44,230 --> 01:01:45,778 Who thinks they're long enough? 1152 01:01:45,778 --> 01:01:48,218 Or maybe some of them are? 1153 01:01:48,218 --> 01:01:49,908 Go ahead, what was your name? 1154 01:01:49,908 --> 01:01:50,658 AUDIENCE: Michael. 1155 01:01:50,658 --> 01:01:51,699 PROFESSOR: Michael, yeah. 1156 01:01:51,699 --> 01:01:54,946 AUDIENCE: The ones at length 80 and 60 are long enough, 1157 01:01:54,946 --> 01:01:56,890 but the one at length 20 is not. 1158 01:01:56,890 --> 01:01:59,352 PROFESSOR: OK, and why do you say that? 1159 01:01:59,352 --> 01:02:03,860 AUDIENCE: Just looking at power of 3 over 2, 1160 01:02:03,860 --> 01:02:08,320 3 times 10 to the 3 isn't enough to overcome the difference 1161 01:02:08,320 --> 01:02:11,750 in transition probabilities between the island and genome. 1162 01:02:11,750 --> 01:02:17,630 But 3 times 10 to the 10 and 1 times 10 to the 14 1163 01:02:17,630 --> 01:02:25,680 is, over the length of those sequences. 1164 01:02:25,680 --> 01:02:28,585 The difference in probability of making that 1165 01:02:28,585 --> 01:02:31,177 switch once at the beginning [INAUDIBLE]. 1166 01:02:31,177 --> 01:02:33,510 PROFESSOR: OK, did everyone get what Michael was saying? 1167 01:02:33,510 --> 01:02:36,500 So, Michael, can you explain why powers of 1.5 1168 01:02:36,500 --> 01:02:38,130 are relevant here? 1169 01:02:38,130 --> 01:02:48,620 AUDIENCE: Oh, that's the ratio of emission probability 1170 01:02:48,620 --> 01:02:52,545 between the C states and the G states, 1171 01:02:52,545 --> 01:02:54,920 between island and genome. 1172 01:02:54,920 --> 01:02:58,016 So in island it's 0.3, and in genome it's 1173 01:02:58,016 --> 01:02:58,890 0.2 over [INAUDIBLE]. 1174 01:02:58,890 --> 01:03:01,390 PROFESSOR: Right, so when you're going through a run of C's, 1175 01:03:01,390 --> 01:03:04,830 if you're in the island state, you get a power of 1.5, 1176 01:03:04,830 --> 01:03:08,520 in terms of emissions at each position. 1177 01:03:08,520 --> 01:03:09,780 What about the transitions? 1178 01:03:09,780 --> 01:03:12,240 You're sort of glossing over those. 1179 01:03:12,240 --> 01:03:13,670 Why is that? 1180 01:03:13,670 --> 01:03:16,950 AUDIENCE: Because that only has to happen once 1181 01:03:16,950 --> 01:03:18,450 at the beginning. 1182 01:03:18,450 --> 01:03:26,020 So the ratio between the transition probabilities 1183 01:03:26,020 --> 01:03:30,310 is really high, but as long as the compounded 1184 01:03:30,310 --> 01:03:32,345 ratio of the emission probability 1185 01:03:32,345 --> 01:03:35,091 is high enough over a [INAUDIBLE] 1186 01:03:35,091 --> 01:03:39,755 of sequences, that as long as that compound emission 1187 01:03:39,755 --> 01:03:41,965 is greater than that one-off ratio at the beginning, 1188 01:03:41,965 --> 01:03:51,790 then the island is more [INAUDIBLE]. 1189 01:03:51,790 --> 01:03:56,560 PROFESSOR: Yeah, so if you think about the transitions, I to I, 1190 01:03:56,560 --> 01:04:01,290 or G to G, as being close to 1-- so if you think of them as 1, 1191 01:04:01,290 --> 01:04:03,920 then you can ignore them, and only focus on the cases 1192 01:04:03,920 --> 01:04:06,710 where it transitions from G to I, and I to G. 1193 01:04:06,710 --> 01:04:11,010 So you say that 60 and 80 are long enough. 1194 01:04:11,010 --> 01:04:15,880 So your prediction is at the optimal parse 1195 01:04:15,880 --> 01:04:30,460 is G 1,000 I 80 G another 2,020-- 1196 01:04:30,460 --> 01:04:37,370 you said that one wasn't going to-- and then I 60 G 1,000. 1197 01:04:37,370 --> 01:04:38,990 Michael, is that what you're saying? 1198 01:04:38,990 --> 01:04:41,560 Can you read this? 1199 01:04:41,560 --> 01:04:43,520 AUDIENCE: Yeah. 1200 01:04:43,520 --> 01:04:48,920 PROFESSOR: OK, so why do you say that 10 to the 10th 1201 01:04:48,920 --> 01:04:53,139 is enough to flip the switch, and 10 to the 3rd is not? 1202 01:04:53,139 --> 01:04:56,492 AUDIENCE: If I remember the numbers from the previous slide 1203 01:04:56,492 --> 01:04:57,930 correctly-- 1204 01:04:57,930 --> 01:05:01,410 PROFESSOR: A couple of slides back? 1205 01:05:01,410 --> 01:05:05,165 AUDIENCE: So if you look at the ratio of the probability 1206 01:05:05,165 --> 01:05:08,395 of staying in the genome, and the probability of going from 1207 01:05:08,395 --> 01:05:10,135 the genome to the island, it's-- 1208 01:05:10,135 --> 01:05:11,520 PROFESSOR: 10 to the 5th. 1209 01:05:11,520 --> 01:05:12,770 AUDIENCE: Yeah, 10 to the 5th. 1210 01:05:12,770 --> 01:05:19,001 So whatever happens going over the next [? run ?] of sequences 1211 01:05:19,001 --> 01:05:23,474 has to overcome the difference in ratio for the switch 1212 01:05:23,474 --> 01:05:25,322 to become more likely. 1213 01:05:25,322 --> 01:05:26,030 PROFESSOR: Right. 1214 01:05:26,030 --> 01:05:31,820 So if everyone agrees that we're going to start in genome, 1215 01:05:31,820 --> 01:05:35,430 we've got a run of 1,000 A's, and genome is favored anyway-- 1216 01:05:35,430 --> 01:05:38,830 so that's clear, we're going to be in genome at the beginning 1217 01:05:38,830 --> 01:05:41,590 for the first 1,000, and be in genome at the end, 1218 01:05:41,590 --> 01:05:43,440 then if you're going to go to island, 1219 01:05:43,440 --> 01:05:46,050 you have to pay two penalties, basically. 1220 01:05:46,050 --> 01:05:48,460 You pay the penalty of starting an island, which 1221 01:05:48,460 --> 01:05:51,657 is 10 to the minus 5th-- this is maybe a slightly different way 1222 01:05:51,657 --> 01:05:53,240 than you were thinking about it, but I 1223 01:05:53,240 --> 01:05:56,300 think it's equivalent-- 10 the minus 5th to switch island, 1224 01:05:56,300 --> 01:05:59,770 and then you pay a penalty coming back, 10 to the minus 3. 1225 01:05:59,770 --> 01:06:02,000 And all the other transitions are near 1. 1226 01:06:02,000 --> 01:06:04,150 So it's like a 10 to the minus 8 penalty 1227 01:06:04,150 --> 01:06:07,360 for going from genome to island and back. 1228 01:06:07,360 --> 01:06:10,970 And so if the emissions are greater than 10 1229 01:06:10,970 --> 01:06:15,523 to the 8th, favor island by a gradient of 10 to the 8th, 1230 01:06:15,523 --> 01:06:18,070 it'll be worth doing that. 1231 01:06:18,070 --> 01:06:19,134 Does that make sense? 1232 01:06:19,134 --> 01:06:21,554 AUDIENCE: I forgot about the penalty of [INAUDIBLE], 1233 01:06:21,554 --> 01:06:23,567 but it's still the [INAUDIBLE]. 1234 01:06:23,567 --> 01:06:24,942 PROFESSOR: Yeah, it's still true. 1235 01:06:24,942 --> 01:06:25,910 Everyone see that? 1236 01:06:25,910 --> 01:06:28,930 You have to pay a penalty of 10 to the 8gh 1237 01:06:28,930 --> 01:06:31,215 to go from genome to island and back. 1238 01:06:31,215 --> 01:06:32,840 But the emissions can make up for that. 1239 01:06:32,840 --> 01:06:36,130 Even though it seems small, it seems like 60 bases is not 1240 01:06:36,130 --> 01:06:41,110 enough-- it's multiplicative, and it adds up. 1241 01:06:41,110 --> 01:06:41,880 Sally? 1242 01:06:41,880 --> 01:06:44,880 AUDIENCE: So it seems like the [INAUDIBLE] 1243 01:06:44,880 --> 01:06:50,380 to me is going to return a lagging answer, 1244 01:06:50,380 --> 01:06:53,046 because we're not going to actually switch 1245 01:06:53,046 --> 01:06:57,380 to genome in our HMM until we hit the point where we should 1246 01:06:57,380 --> 01:07:00,970 [? tip, ?] which would be about 60 G's into the run of 80. 1247 01:07:00,970 --> 01:07:03,057 PROFESSOR: So you're saying it's not actually 1248 01:07:03,057 --> 01:07:05,492 going to predict the right thing? 1249 01:07:05,492 --> 01:07:09,388 AUDIENCE: Do you have to rerun the [INAUDIBLE] processing 1250 01:07:09,388 --> 01:07:11,830 to get it actually in line to the correct thing? 1251 01:07:11,830 --> 01:07:15,060 PROFESSOR: What do people think about this? 1252 01:07:15,060 --> 01:07:16,681 Yeah, comment? 1253 01:07:16,681 --> 01:07:18,180 AUDIENCE: That's not quite the case, 1254 01:07:18,180 --> 01:07:22,210 because you [? pack it ?] or you [? stack it ?] both 1255 01:07:22,210 --> 01:07:24,900 in the genome and island possibilities, 1256 01:07:24,900 --> 01:07:27,647 and your transition is the penalty. 1257 01:07:27,647 --> 01:07:31,270 So it's the highest impact penalty. 1258 01:07:31,270 --> 01:07:36,830 So when you island to island to island in that string of 80, 1259 01:07:36,830 --> 01:07:39,440 the transition will only be valid 1260 01:07:39,440 --> 01:07:41,392 starting at the first one. 1261 01:07:44,308 --> 01:07:44,808 [INAUDIBLE] 1262 01:07:50,176 --> 01:07:53,766 PROFESSOR: OK, we're at position 1,000. 1263 01:07:53,766 --> 01:07:55,390 I think you're on the right track here. 1264 01:07:55,390 --> 01:07:58,722 So I'm going to claim that the Viterbi will transition 1265 01:07:58,722 --> 01:08:00,430 at the right place, because it's actually 1266 01:08:00,430 --> 01:08:03,680 proven to generate the optimal parse. 1267 01:08:03,680 --> 01:08:09,030 So I'm right, but I totally get your intuition. 1268 01:08:09,030 --> 01:08:11,787 This is the key thing-- most people's intuition, 1269 01:08:11,787 --> 01:08:13,870 my intuition, everyone's intuition when they first 1270 01:08:13,870 --> 01:08:17,090 hear about this is that it seems like you don't transition 1271 01:08:17,090 --> 01:08:17,590 soon enough. 1272 01:08:17,590 --> 01:08:19,950 It seems like you have to look into the future 1273 01:08:19,950 --> 01:08:21,886 to know to transition at that place, right? 1274 01:08:21,886 --> 01:08:23,760 And obviously you can't look into the future, 1275 01:08:23,760 --> 01:08:26,430 it's a recursion. 1276 01:08:26,430 --> 01:08:28,410 How does it work? 1277 01:08:28,410 --> 01:08:31,680 Clearly, this is going to be the winner. 1278 01:08:31,680 --> 01:08:37,014 So let's go to position 1,001, that's the first C. 1279 01:08:37,014 --> 01:08:43,160 And this guy is going to come from here, 1280 01:08:43,160 --> 01:08:47,620 this guy is the winner overall-- G 1,000 is clearly the winner. 1281 01:08:47,620 --> 01:08:50,189 But what about this guy? 1282 01:08:50,189 --> 01:08:52,840 Where's it coming from? 1283 01:08:52,840 --> 01:08:57,040 G 1,000, it's coming from there. 1284 01:08:57,040 --> 01:09:03,770 And in fact, the previous guy came from G 1,000. 1285 01:09:03,770 --> 01:09:08,240 I 1,000 came from G 999, and so forth. 1286 01:09:08,240 --> 01:09:10,710 Now, here's the interesting question. 1287 01:09:10,710 --> 01:09:16,040 What happens at 1,002? 1288 01:09:16,040 --> 01:09:23,729 Sally, I want you to tell me what happens at 1,002. 1289 01:09:23,729 --> 01:09:25,040 Who wins here? 1290 01:09:25,040 --> 01:09:26,000 AUDIENCE: Genome. 1291 01:09:26,000 --> 01:09:28,040 PROFESSOR: Genome. 1292 01:09:28,040 --> 01:09:28,958 Who wins here? 1293 01:09:28,958 --> 01:09:29,666 AUDIENCE: Island. 1294 01:09:29,666 --> 01:09:31,380 PROFESSOR: Island. 1295 01:09:31,380 --> 01:09:34,600 It had been transitioning, genome has got a head start, 1296 01:09:34,600 --> 01:09:36,700 so the best way to beat an island 1297 01:09:36,700 --> 01:09:38,910 is to have been genome as long as possible, 1298 01:09:38,910 --> 01:09:41,640 up until position 1,000. 1299 01:09:41,640 --> 01:09:43,450 And that was still true at 1,001. 1300 01:09:43,450 --> 01:09:45,370 It's no longer true after that. 1301 01:09:45,370 --> 01:09:48,819 It was actually better to have transitioned back here 1302 01:09:48,819 --> 01:09:52,075 to get that one extra emission, that one power of 1.5 1303 01:09:52,075 --> 01:09:54,120 from emitting that C in the island state. 1304 01:09:54,120 --> 01:09:56,453 If you're going to be in island anyway-- 1305 01:09:56,453 --> 01:09:59,780 this is much lower than this, at this point. 1306 01:09:59,780 --> 01:10:02,317 It's about 10 to the 5th lower. 1307 01:10:02,317 --> 01:10:03,650 But that's OK, we still keep it. 1308 01:10:03,650 --> 01:10:05,140 It's the best that ends in island. 1309 01:10:05,140 --> 01:10:07,710 Do you see what I'm saying? 1310 01:10:07,710 --> 01:10:11,840 OK, there were all these-- island always 1311 01:10:11,840 --> 01:10:14,334 had to come from genome at the latest possible time, up 1312 01:10:14,334 --> 01:10:16,250 until this point, and now it's actually better 1313 01:10:16,250 --> 01:10:18,780 to have made that transition there, and then stay in island. 1314 01:10:18,780 --> 01:10:21,460 So you can see island is going to win for awhile, 1315 01:10:21,460 --> 01:10:24,460 and then it'll flip back. 1316 01:10:24,460 --> 01:10:30,080 And the question is going to be down here at 1,060, 1317 01:10:30,080 --> 01:10:32,960 going to 1,061. 1318 01:10:32,960 --> 01:10:33,850 Who's bigger here? 1319 01:10:33,850 --> 01:10:37,180 This guy was perhaps-- well, we don't even 1320 01:10:37,180 --> 01:10:38,850 know exactly how we got here. 1321 01:10:38,850 --> 01:10:47,410 But you can see that this parse here that stays in island 1322 01:10:47,410 --> 01:10:48,590 is going to be optimal. 1323 01:10:48,590 --> 01:10:51,230 And the question is, would it be just staying in genome? 1324 01:10:51,230 --> 01:10:54,170 And the answer is yes, because it gained 10 to the 10th 1325 01:10:54,170 --> 01:10:56,950 in emissions, overcomes the 10 to the 8th penalty 1326 01:10:56,950 --> 01:10:59,921 that it paid. 1327 01:10:59,921 --> 01:11:01,170 Now what do you do at the end? 1328 01:11:01,170 --> 01:11:03,900 How do you actually find the optimal parse overall? 1329 01:11:07,050 --> 01:11:21,690 I go out to position whatever it is, 4,160. 1330 01:11:21,690 --> 01:11:24,000 I've got a probability here, probability here, 1331 01:11:24,000 --> 01:11:26,730 what do I do with those? 1332 01:11:26,730 --> 01:11:29,500 Right, but what do I do first? 1333 01:11:29,500 --> 01:11:31,500 You pick the bigger one, whichever one's bigger. 1334 01:11:31,500 --> 01:11:35,810 We decided that this one is going to be bigger, right? 1335 01:11:35,810 --> 01:11:38,560 And then remember all the arrows that I circled? 1336 01:11:38,560 --> 01:11:42,910 You just backtrack through and figure out what it was. 1337 01:11:42,910 --> 01:11:45,382 Does that make sense? 1338 01:11:45,382 --> 01:11:46,590 That's the Viterbi algorithm. 1339 01:11:46,590 --> 01:11:53,350 We'll do it a little bit more on this next time, 1340 01:11:53,350 --> 01:11:55,060 or definitely field questions. 1341 01:11:55,060 --> 01:11:57,500 It's a little bit tricky to get your head around. 1342 01:12:01,450 --> 01:12:03,960 It's a dynamic programming algorithm, like Needleman-Wench 1343 01:12:03,960 --> 01:12:06,710 or Smith-Waterman, but a little bit different. 1344 01:12:09,400 --> 01:12:12,230 The runtime-- what is the runtime, 1345 01:12:12,230 --> 01:12:15,540 for those who were sleeping and didn't notice 1346 01:12:15,540 --> 01:12:17,860 that little thing I flashed up there? 1347 01:12:17,860 --> 01:12:21,030 Or, if you read it, can you explain where it comes from? 1348 01:12:21,030 --> 01:12:24,969 How does the runtime depend on the number of hidden states 1349 01:12:24,969 --> 01:12:26,260 and the length of the sequence? 1350 01:12:31,140 --> 01:12:37,590 I've got K states, sequence of length L, what is the runtime? 1351 01:12:37,590 --> 01:12:39,822 So I'm going to put this up here. 1352 01:12:39,822 --> 01:12:41,298 This might help. 1353 01:12:47,210 --> 01:12:50,750 So when you look at the recursion like this, 1354 01:12:50,750 --> 01:12:53,640 when you want to think about the runtime-- 1355 01:12:53,640 --> 01:12:57,191 forget about initialization and termination, that's not 1356 01:12:57,191 --> 01:12:57,690 [INAUDIBLE]. 1357 01:12:57,690 --> 01:13:01,550 It's what you do on the typical intermediate state that 1358 01:13:01,550 --> 01:13:02,680 determines the runtime. 1359 01:13:02,680 --> 01:13:04,760 That's what grows with sequence length. 1360 01:13:04,760 --> 01:13:09,460 So what do you have to do at each-- 1361 01:13:09,460 --> 01:13:11,969 to go from position I to position I plus 1? 1362 01:13:11,969 --> 01:13:12,885 How many calculations? 1363 01:13:16,208 --> 01:13:19,694 AUDIENCE: You have to do N calculations for 33. 1364 01:13:19,694 --> 01:13:21,700 Is that right? 1365 01:13:21,700 --> 01:13:23,530 PROFESSOR: Yeah, so 33. 1366 01:13:23,530 --> 01:13:25,600 Yeah, the notation is a little bit 1367 01:13:25,600 --> 01:13:28,460 different, but how many-- let me ask you this, 1368 01:13:28,460 --> 01:13:31,010 how many transitions do you have to consider? 1369 01:13:31,010 --> 01:13:34,110 If I have an HMM with K hidden states? 1370 01:13:34,110 --> 01:13:35,405 AUDIENCE: K squared. 1371 01:13:35,405 --> 01:13:37,780 PROFESSOR: K squared, right? 1372 01:13:37,780 --> 01:13:40,155 So you're going to have to do K squared calculations, 1373 01:13:40,155 --> 01:13:42,790 basically, to go from position I to I plus 1. 1374 01:13:42,790 --> 01:13:44,870 So what is the overall dependence 1375 01:13:44,870 --> 01:13:48,460 on K and L, the length of the sequence? 1376 01:13:52,780 --> 01:13:55,560 OK, it's K squared L. It's linear in the sequence. 1377 01:13:55,560 --> 01:13:57,540 So is this good or bad? 1378 01:13:57,540 --> 01:13:58,062 Yes, Sally? 1379 01:13:58,062 --> 01:14:00,437 AUDIENCE: Doesn't this assume that the graph is complete? 1380 01:14:00,437 --> 01:14:03,916 And if you don't actually have [INAUDIBLE] 1381 01:14:03,916 --> 01:14:05,442 you can get a little faster? 1382 01:14:05,442 --> 01:14:06,900 PROFESSOR: Yeah, it's a good point. 1383 01:14:06,900 --> 01:14:09,390 So this is the worst case, or this 1384 01:14:09,390 --> 01:14:12,670 is in the case where you can transition from any state 1385 01:14:12,670 --> 01:14:13,500 to any other state. 1386 01:14:13,500 --> 01:14:18,410 If you remember, the gene finding HMM-- 1387 01:14:18,410 --> 01:14:20,890 I might have erased it, I think I erased it-- 1388 01:14:20,890 --> 01:14:23,080 if you can see this subtle-- 1389 01:14:23,080 --> 01:14:26,530 No, remember Tim designed an HMM for gene-finding 1390 01:14:26,530 --> 01:14:30,160 here, which only had some of the arrows were allowed. 1391 01:14:30,160 --> 01:14:33,010 So if that's true, if there's a bunch of zero probabilities 1392 01:14:33,010 --> 01:14:34,870 for transitions, then you can ignore those, 1393 01:14:34,870 --> 01:14:36,200 and it actually speeds it up. 1394 01:14:36,200 --> 01:14:37,660 That's true. 1395 01:14:37,660 --> 01:14:38,680 It's a good point. 1396 01:14:38,680 --> 01:14:39,430 Everyone got this? 1397 01:14:39,430 --> 01:14:42,360 So this the worst case. 1398 01:14:42,360 --> 01:14:45,320 K squared L-- is this good or bad? 1399 01:14:45,320 --> 01:14:47,290 Fast or slow? 1400 01:14:47,290 --> 01:14:49,340 Slow? 1401 01:14:49,340 --> 01:14:52,600 I mean, it depends on the structure of your HMM. 1402 01:14:52,600 --> 01:14:56,650 For a simple HMM, like the CPG island HMM, 1403 01:14:56,650 --> 01:15:00,210 this is like blindingly fast. 1404 01:15:00,210 --> 01:15:01,850 K squared is 4, right? 1405 01:15:01,850 --> 01:15:03,740 So it takes the same order of magnitude 1406 01:15:03,740 --> 01:15:05,510 as just reading the sequence. 1407 01:15:05,510 --> 01:15:07,750 So it'll be super, super fast. 1408 01:15:07,750 --> 01:15:10,750 If you make a really complicated HMM, it can be slower. 1409 01:15:10,750 --> 01:15:14,930 But the point is that for genomic sequence analysis, 1410 01:15:14,930 --> 01:15:16,250 L is big. 1411 01:15:16,250 --> 01:15:19,646 So as long as you keep K small, it'll run fast. 1412 01:15:19,646 --> 01:15:22,310 It's much better than sequence comparison, where you end up 1413 01:15:22,310 --> 01:15:24,274 with these L squared types of things. 1414 01:15:24,274 --> 01:15:25,940 So it's faster than sequence comparison. 1415 01:15:25,940 --> 01:15:27,410 So that's really one of the reasons 1416 01:15:27,410 --> 01:15:30,640 why Viterbi is so popular, is it's super fast. 1417 01:15:30,640 --> 01:15:33,020 So in the last couple minutes, I just 1418 01:15:33,020 --> 01:15:35,500 want to say a few things about the midterm. 1419 01:15:35,500 --> 01:15:38,490 You guys did remember there is a mid-term, right? 1420 01:15:38,490 --> 01:15:44,440 So the midterm is a week from today, Tuesday, March 18. 1421 01:15:44,440 --> 01:15:46,820 For everybody, it's going to be here, 1422 01:15:46,820 --> 01:15:50,240 except for those who are in 6874-- 1423 01:15:50,240 --> 01:15:54,730 and those people should go to 68 180. 1424 01:15:54,730 --> 01:15:57,200 And because they're going to be given extra time, 1425 01:15:57,200 --> 01:15:58,810 you should go there early. 1426 01:15:58,810 --> 01:16:00,670 Go there at 12:40. 1427 01:16:00,670 --> 01:16:03,200 Everyone else, who's not in 6784, 1428 01:16:03,200 --> 01:16:06,850 should come here to the regular class by 1:00 PM, 1429 01:16:06,850 --> 01:16:09,500 just so you have a chance to get set up and everything. 1430 01:16:09,500 --> 01:16:12,380 And then the exam will start promptly at 1:05, 1431 01:16:12,380 --> 01:16:16,820 and will end at 2:25, an hour and 20 minutes. 1432 01:16:16,820 --> 01:16:20,337 It is closed book, open notes. 1433 01:16:20,337 --> 01:16:22,420 So don't bring your textbook, but you can bring up 1434 01:16:22,420 --> 01:16:24,440 to two pages-- they can be double sided 1435 01:16:24,440 --> 01:16:25,700 if you want-- of notes. 1436 01:16:25,700 --> 01:16:27,010 So why do we do this? 1437 01:16:27,010 --> 01:16:32,950 Well, we think that the act of going through the lectures 1438 01:16:32,950 --> 01:16:35,590 and textbook, or whatever other notes you have, 1439 01:16:35,590 --> 01:16:41,480 and deciding what's most important, maybe helpful. 1440 01:16:41,480 --> 01:16:44,320 And so this, hopefully, will be a useful studying exercise, so 1441 01:16:44,320 --> 01:16:46,403 figure out what's most important and write it down 1442 01:16:46,403 --> 01:16:49,340 on a piece of paper if you are likely to forget it-- maybe 1443 01:16:49,340 --> 01:16:50,970 complicated equations, things like 1444 01:16:50,970 --> 01:16:53,520 that, you might want to write down. 1445 01:16:53,520 --> 01:16:55,570 No calculators or other electronic aids. 1446 01:16:55,570 --> 01:16:57,050 But you won't need them. 1447 01:16:57,050 --> 01:17:03,400 If you get an answer that's e squared over 17 factorial-- 1448 01:17:03,400 --> 01:17:05,710 you're asked to convert that into a decimal. 1449 01:17:05,710 --> 01:17:08,389 Just leave it like that. 1450 01:17:08,389 --> 01:17:09,430 So what should you study? 1451 01:17:09,430 --> 01:17:12,610 So you should study your lecture notes, readings and tutorials, 1452 01:17:12,610 --> 01:17:14,240 and past exams. 1453 01:17:14,240 --> 01:17:17,050 Past exams have been posted to the course website. 1454 01:17:17,050 --> 01:17:18,960 p-sets as well. 1455 01:17:18,960 --> 01:17:21,780 The midterm exams from past years are posted. 1456 01:17:21,780 --> 01:17:25,110 And there's some variation in topics from year to year, 1457 01:17:25,110 --> 01:17:28,570 so if you're reading through a midterm from a past year, 1458 01:17:28,570 --> 01:17:33,000 and you run across an unfamiliar phrase or concept, 1459 01:17:33,000 --> 01:17:35,270 you have to ask yourself, was I just 1460 01:17:35,270 --> 01:17:37,450 dozing off when that was discussed? 1461 01:17:37,450 --> 01:17:39,320 Or was that not discussed this year? 1462 01:17:39,320 --> 01:17:41,730 And act appropriately. 1463 01:17:41,730 --> 01:17:44,800 The content of the midterm will be all the lectures, 1464 01:17:44,800 --> 01:17:48,752 all the topics up through today-- hidden Markov models. 1465 01:17:48,752 --> 01:17:50,960 And I'll do a little bit more on hidden Markov models 1466 01:17:50,960 --> 01:17:51,840 on Thursday. 1467 01:17:51,840 --> 01:17:57,060 That part could be on the exam, but the next major topic-- RNA 1468 01:17:57,060 --> 01:18:00,880 secondary structure-- will not be on the exam. 1469 01:18:00,880 --> 01:18:02,627 It'll be on a p-set in the future. 1470 01:18:02,627 --> 01:18:03,960 Any questions about the midterm? 1471 01:18:10,060 --> 01:18:13,341 And the TAs will be doing some review stuff in sections 1472 01:18:13,341 --> 01:18:13,840 this week. 1473 01:18:17,670 --> 01:18:19,420 OK, thank you. 1474 01:18:19,420 --> 01:18:21,420 See you on Thursday.