1 00:00:00,060 --> 00:00:01,476 NARRATOR: The following content is 2 00:00:01,476 --> 00:00:04,019 provided under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high-quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation, or view additional materials 6 00:00:13,330 --> 00:00:17,236 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,236 --> 00:00:17,861 at ocw.mit.edu. 8 00:00:27,000 --> 00:00:33,690 PROFESSOR: All right, we should get started. 9 00:00:33,690 --> 00:00:36,600 So it's good to be back. 10 00:00:36,600 --> 00:00:41,920 We'll be discussing DNA sequence motifs. 11 00:00:41,920 --> 00:00:44,120 Oh yeah, we were, if you're wondering, 12 00:00:44,120 --> 00:00:48,060 yes, the instructors were at the awards on Sunday. 13 00:00:48,060 --> 00:00:48,830 It was great. 14 00:00:48,830 --> 00:00:52,420 The pizza was delicious. 15 00:00:52,420 --> 00:00:56,560 So today, we're going to be talking about DNA and protein 16 00:00:56,560 --> 00:01:01,110 sequence motifs, which are essentially the building 17 00:01:01,110 --> 00:01:07,770 blocks of regulatory information, in a sense. 18 00:01:07,770 --> 00:01:11,940 Before we get started, I wanted to just see 19 00:01:11,940 --> 00:01:15,540 if there are any questions about material 20 00:01:15,540 --> 00:01:19,100 that Professor Gifford covered from the past couple days? 21 00:01:19,100 --> 00:01:21,490 No guarantees I'll be able to answer them, 22 00:01:21,490 --> 00:01:26,760 but just general things related to transcriptome analysis, 23 00:01:26,760 --> 00:01:28,990 or PCA? 24 00:01:28,990 --> 00:01:29,730 Anything? 25 00:01:29,730 --> 00:01:33,020 Hopefully, you all got the email that he sent out about, 26 00:01:33,020 --> 00:01:36,790 basically, what you're expected to get. 27 00:01:36,790 --> 00:01:42,070 So at the level of the document that's posted, 28 00:01:42,070 --> 00:01:43,570 that's sort of what we're expecting. 29 00:01:43,570 --> 00:01:45,200 So if you haven't had linear algebra, 30 00:01:45,200 --> 00:01:47,190 that should still be accessible-- 31 00:01:47,190 --> 00:01:50,080 not necessarily all the derivations. 32 00:01:50,080 --> 00:01:53,630 Any questions about that? 33 00:01:53,630 --> 00:01:57,210 OK, so as a reminder, team projects, 34 00:01:57,210 --> 00:02:01,150 your aims are due soon. 35 00:02:01,150 --> 00:02:02,740 We'll post a slightly-- there's been 36 00:02:02,740 --> 00:02:05,160 a request for more detailed information on what we'd 37 00:02:05,160 --> 00:02:09,660 like in the aims, so we'll post something more detailed 38 00:02:09,660 --> 00:02:13,500 on the website this evening, and probably 39 00:02:13,500 --> 00:02:15,860 extend the deadline a day or two, just 40 00:02:15,860 --> 00:02:19,270 to give you a little bit more time on the aims. 41 00:02:19,270 --> 00:02:22,070 So after you submit your aims-- this 42 00:02:22,070 --> 00:02:24,420 is students who are taking the project 43 00:02:24,420 --> 00:02:29,650 component of the course-- then your team 44 00:02:29,650 --> 00:02:32,760 will be assigned to one of the three instructors 45 00:02:32,760 --> 00:02:38,170 as a mentor/advisor, and we will schedule a time 46 00:02:38,170 --> 00:02:40,680 to meet with you in the next week or two 47 00:02:40,680 --> 00:02:43,500 to discuss your aims, just to assess 48 00:02:43,500 --> 00:02:48,140 the feasibility of the project and so forth, before you launch 49 00:02:48,140 --> 00:02:50,570 into it. 50 00:02:50,570 --> 00:02:56,421 All right-- any questions from past lectures? 51 00:02:56,421 --> 00:02:57,920 All right, today we're going to talk 52 00:02:57,920 --> 00:03:02,280 about modeling and discovery of sequence motifs. 53 00:03:02,280 --> 00:03:05,500 We'll give an example of a particular algorithm that's 54 00:03:05,500 --> 00:03:08,870 used in motif finding called the Gibbs Sampling Algorithm. 55 00:03:08,870 --> 00:03:12,170 It's not the only algorithm, it's not even necessarily 56 00:03:12,170 --> 00:03:13,280 the best algorithm. 57 00:03:13,280 --> 00:03:15,000 It's pretty good. 58 00:03:15,000 --> 00:03:17,330 It works in many cases. 59 00:03:17,330 --> 00:03:18,470 It's an early algorithm. 60 00:03:18,470 --> 00:03:21,700 But it's interesting to talk about 61 00:03:21,700 --> 00:03:24,490 because it illustrates the problem in general, 62 00:03:24,490 --> 00:03:26,800 and also it's an example of a stochastic algorithm-- 63 00:03:26,800 --> 00:03:30,930 an algorithm where what it does is determined 64 00:03:30,930 --> 00:03:32,410 at random, to some extent. 65 00:03:32,410 --> 00:03:36,337 And yet still often converges to a particular answer. 66 00:03:36,337 --> 00:03:38,170 So it's interesting from that point of view. 67 00:03:38,170 --> 00:03:40,420 And we'll talk about a few other types 68 00:03:40,420 --> 00:03:42,480 of motif finding algorithms. 69 00:03:42,480 --> 00:03:46,476 And we'll do a little bit on statistical entropy 70 00:03:46,476 --> 00:03:47,850 and information content, which is 71 00:03:47,850 --> 00:03:50,540 a handy way of describing motifs. 72 00:03:50,540 --> 00:03:54,470 And talk a little bit about parameter estimation, 73 00:03:54,470 --> 00:03:58,250 as well, which is critical when you have a motif 74 00:03:58,250 --> 00:04:00,440 and you want to build a model of it 75 00:04:00,440 --> 00:04:05,380 to then discover additional instances of that motif. 76 00:04:05,380 --> 00:04:09,480 So some reading for today-- I posted some nature 77 00:04:09,480 --> 00:04:13,780 biotechnology primers on motifs and motif discovery, 78 00:04:13,780 --> 00:04:16,589 which are pretty easy reading. 79 00:04:16,589 --> 00:04:19,390 The textbook, chapter 6, also has some good information 80 00:04:19,390 --> 00:04:22,200 on motifs, I encourage you to look at that. 81 00:04:22,200 --> 00:04:25,180 And I've also posted the original paper 82 00:04:25,180 --> 00:04:28,010 by Bailey and Elkin on the MEME algorithm, which 83 00:04:28,010 --> 00:04:30,560 is kind of related to the Gibbs Sampling Algorithm, 84 00:04:30,560 --> 00:04:35,720 but is used as expectation maximization. 85 00:04:35,720 --> 00:04:40,340 And so it's a really nice paper-- take a look at that. 86 00:04:40,340 --> 00:04:44,370 And I'll also post the original Gibbs Sampler paper later 87 00:04:44,370 --> 00:04:45,280 today. 88 00:04:45,280 --> 00:04:47,520 And then on Tuesday, we're going to be talking 89 00:04:47,520 --> 00:04:50,160 about Markov and hidden Markov models. 90 00:04:50,160 --> 00:04:56,030 And so take a look at the primer on HMMs, 91 00:04:56,030 --> 00:04:57,980 as well as there is some information 92 00:04:57,980 --> 00:05:00,480 on HMMs in the text. 93 00:05:00,480 --> 00:05:02,410 It's not really a distinct section, 94 00:05:02,410 --> 00:05:04,900 it's kind of scattered throughout the text. 95 00:05:04,900 --> 00:05:10,086 So the best approach is to look in the index for HMMs, 96 00:05:10,086 --> 00:05:16,310 and read the relevant parts that you're interested in. 97 00:05:16,310 --> 00:05:19,100 And if you really want to understand 98 00:05:19,100 --> 00:05:21,430 the mechanics of HMMs, and how to actually implement 99 00:05:21,430 --> 00:05:25,680 one in depth, then I strongly recommend this Rabiner tutorial 100 00:05:25,680 --> 00:05:28,050 on HMMs, which is posted. 101 00:05:28,050 --> 00:05:29,920 So everyone please, please read that. 102 00:05:29,920 --> 00:05:36,030 I will use the same notation, to the extent possible, 103 00:05:36,030 --> 00:05:40,310 as the Rabiner paper when talking about some 104 00:05:40,310 --> 00:05:42,440 of the algorithms used in HMMs in lecture. 105 00:05:42,440 --> 00:05:45,420 So it should synergize well. 106 00:05:48,280 --> 00:05:50,775 So what is a sequence motifs? 107 00:05:53,335 --> 00:05:54,710 In general, it's a pattern that's 108 00:05:54,710 --> 00:05:58,300 common to a set of DNA, RNA, or protein sequences, 109 00:05:58,300 --> 00:06:01,920 that share a biological property. 110 00:06:01,920 --> 00:06:06,410 So for example, all of the binding sites of the Myc 111 00:06:06,410 --> 00:06:09,680 transcription factor-- there's probably a pattern 112 00:06:09,680 --> 00:06:13,500 that they share, and you call that the motif for Myc. 113 00:06:13,500 --> 00:06:19,280 Can you give some examples of where you might get DNA motifs? 114 00:06:19,280 --> 00:06:21,215 Or protein motifs? 115 00:06:21,215 --> 00:06:23,970 Anyone have another example of a type of motif 116 00:06:23,970 --> 00:06:25,740 that would be interesting? 117 00:06:25,740 --> 00:06:27,765 What about one that's defined on function? 118 00:06:27,765 --> 00:06:28,390 Yeah, go ahead. 119 00:06:28,390 --> 00:06:29,557 What's your name? 120 00:06:29,557 --> 00:06:30,704 AUDIENCE: Dan. [INAUDIBLE] 121 00:06:30,704 --> 00:06:31,370 PROFESSOR: Yeah. 122 00:06:31,370 --> 00:06:37,560 So each kinase typically has a certain sequence motif 123 00:06:37,560 --> 00:06:40,655 that determines which proteins it phosphorylate. 124 00:06:40,655 --> 00:06:41,790 Right. 125 00:06:41,790 --> 00:06:42,812 Other examples? 126 00:06:42,812 --> 00:06:45,270 Yeah, so in that case, you might determine it functionally. 127 00:06:45,270 --> 00:06:47,644 You might purify that protein, incubate it 128 00:06:47,644 --> 00:06:50,060 with a pool of peptides, and see what gets phosphorylated, 129 00:06:50,060 --> 00:06:51,450 for example. 130 00:06:51,450 --> 00:06:52,316 Yeah, in the back? 131 00:06:52,316 --> 00:06:55,184 AUDIENCE: I'm [INAUDIBLE], and promonocytes. 132 00:06:55,184 --> 00:06:56,600 PROFESSOR: What was the first one? 133 00:06:56,600 --> 00:06:57,076 AUDIENCE: Promonocytes? 134 00:06:57,076 --> 00:06:58,028 Oh, that one? 135 00:06:58,028 --> 00:06:59,456 Oh, that was my name. 136 00:06:59,456 --> 00:07:01,850 PROFESSOR: Yeah, OK. 137 00:07:01,850 --> 00:07:03,540 And as to promoter motifs, sir? 138 00:07:03,540 --> 00:07:05,510 Some examples? 139 00:07:05,510 --> 00:07:11,613 AUDIENCE: Like, [INAUDIBLE] in transcription mining site. 140 00:07:11,613 --> 00:07:12,196 PROFESSOR: Ah. 141 00:07:12,196 --> 00:07:13,188 Yeah. 142 00:07:13,188 --> 00:07:15,172 And so you would identify those how? 143 00:07:15,172 --> 00:07:17,156 AUDIENCE: By looking at sequences 144 00:07:17,156 --> 00:07:20,132 upstream of [INAUDIBLE], and seeing 145 00:07:20,132 --> 00:07:23,092 what different sequences have in common? 146 00:07:23,092 --> 00:07:23,800 PROFESSOR: Right. 147 00:07:23,800 --> 00:07:27,521 So I think there's at least three ways-- OK, 148 00:07:27,521 --> 00:07:29,020 four ways I can think of identifying 149 00:07:29,020 --> 00:07:29,730 those types of motifs. 150 00:07:29,730 --> 00:07:31,563 That's probably one of the most common types 151 00:07:31,563 --> 00:07:34,600 of motifs encountered in molecular biology. 152 00:07:34,600 --> 00:07:38,160 So one way, you take a bunch of genes, 153 00:07:38,160 --> 00:07:40,490 where you've identified the transcription start site. 154 00:07:40,490 --> 00:07:44,409 You just look for patterns-- short sub-sequences 155 00:07:44,409 --> 00:07:45,450 that they have in common. 156 00:07:45,450 --> 00:07:49,120 That might give you the TATA box, for example. 157 00:07:49,120 --> 00:07:52,150 Another way would be, what about comparative genomics? 158 00:07:52,150 --> 00:07:54,040 You take each individual one, look 159 00:07:54,040 --> 00:07:56,720 to see which parts of that promoter are conserved. 160 00:07:56,720 --> 00:08:01,090 That can also help you refine your motifs. 161 00:08:01,090 --> 00:08:03,100 Protein binding, you could do ChIP-Seq, 162 00:08:03,100 --> 00:08:05,520 that could give you motifs. 163 00:08:05,520 --> 00:08:07,160 And what about a functional readout? 164 00:08:07,160 --> 00:08:10,080 You clone a bunch of random sequences 165 00:08:10,080 --> 00:08:12,490 upstream of a luciferase reporter, see 166 00:08:12,490 --> 00:08:15,580 which ones actually drive expression, for example. 167 00:08:15,580 --> 00:08:16,780 So, that would be another. 168 00:08:16,780 --> 00:08:19,460 Yeah, absolutely, so there's a bunch of different ways 169 00:08:19,460 --> 00:08:21,570 to define them. 170 00:08:21,570 --> 00:08:23,780 In terms of when we talk about motifs, 171 00:08:23,780 --> 00:08:28,660 there are several different models of increasing resolution 172 00:08:28,660 --> 00:08:30,440 that people use. 173 00:08:30,440 --> 00:08:33,960 So people often talk about talk about the consensus sequence 174 00:08:33,960 --> 00:08:38,510 so you say the TATA box, which, of course, 175 00:08:38,510 --> 00:08:41,659 describes the actual motif-- T-A-T-A-A-A, 176 00:08:41,659 --> 00:08:43,470 something like that. 177 00:08:43,470 --> 00:08:46,690 But that's really just the consensus 178 00:08:46,690 --> 00:08:49,210 of a bunch of TATA box motifs. 179 00:08:49,210 --> 00:08:51,490 You rarely find the perfect consensus 180 00:08:51,490 --> 00:08:54,820 in real promoters-- the real, naturally occurring ones 181 00:08:54,820 --> 00:08:58,080 are usually one or two mismatches away. 182 00:08:58,080 --> 00:09:00,084 So that doesn't fully captured it. 183 00:09:00,084 --> 00:09:02,000 So sometimes you'll have a regular expression. 184 00:09:02,000 --> 00:09:06,710 So an example would be if you were describing mammalian 185 00:09:06,710 --> 00:09:12,300 5 prime splice sites, you might describe the motif as GT, A 186 00:09:12,300 --> 00:09:20,960 or G, AGT, or sometimes abbreviated as GTR AGT, where 187 00:09:20,960 --> 00:09:24,026 R is shorthand for either appearing nucleotide-- either A 188 00:09:24,026 --> 00:09:29,710 or G. In some motifs you could have GT, NN, GT, or something 189 00:09:29,710 --> 00:09:32,280 like that. 190 00:09:32,280 --> 00:09:35,170 Those can be captured, often, by regular expressions 191 00:09:35,170 --> 00:09:38,950 in a scripting language like Python or Perl. 192 00:09:38,950 --> 00:09:41,740 Another very common description in motifs, 193 00:09:41,740 --> 00:09:43,980 there would be a weight matrix. 194 00:09:43,980 --> 00:09:47,990 So you'll see a matrix where the width of the matrix 195 00:09:47,990 --> 00:09:50,730 is the number of bases in the motif. 196 00:09:50,730 --> 00:09:53,270 And then there are four rows, which 197 00:09:53,270 --> 00:09:55,910 are the four bases-- we'll see that in a moment. 198 00:09:55,910 --> 00:09:59,280 Sometimes these are described as position-specific probability 199 00:09:59,280 --> 00:10:01,990 matrices, or position-specific score matrices. 200 00:10:01,990 --> 00:10:03,550 We'll come to that in a moment. 201 00:10:03,550 --> 00:10:04,890 And then there are more complicated models. 202 00:10:04,890 --> 00:10:06,360 So it's increasingly becoming clear 203 00:10:06,360 --> 00:10:10,340 that the simple weight matrix is too limited-- it doesn't 204 00:10:10,340 --> 00:10:15,410 capture all the information that's present in motifs. 205 00:10:15,410 --> 00:10:17,980 So we talked about where do motifs come from. 206 00:10:17,980 --> 00:10:19,690 These are just some examples. 207 00:10:19,690 --> 00:10:22,740 I think I talked about all of these, 208 00:10:22,740 --> 00:10:25,250 except for in vitro binding. 209 00:10:25,250 --> 00:10:28,010 So in addition to doing a CLIP-seq, where you're 210 00:10:28,010 --> 00:10:30,480 looking at the binding of the endogenous protein, 211 00:10:30,480 --> 00:10:33,020 you could also make recombinant protein-- incubate that 212 00:10:33,020 --> 00:10:35,690 with a random pool of DNA molecules, 213 00:10:35,690 --> 00:10:38,750 pull down, and see what binds to it, for example. 214 00:10:42,010 --> 00:10:44,200 So why are they important? 215 00:10:44,200 --> 00:10:46,310 They're important for obvious reasons-- 216 00:10:46,310 --> 00:10:49,280 that they can identify proteins that 217 00:10:49,280 --> 00:10:52,390 have a specific biological property of interest. 218 00:10:52,390 --> 00:10:54,480 For example, being phosphorylated by a particular 219 00:10:54,480 --> 00:10:54,980 kinase. 220 00:10:54,980 --> 00:10:57,460 Or promoters that have a particular property. 221 00:10:57,460 --> 00:10:59,480 That is, that they're likely to be regulated 222 00:10:59,480 --> 00:11:03,510 by a particular transcription factor, et cetera. 223 00:11:03,510 --> 00:11:08,020 And ultimately, if you're very interested 224 00:11:08,020 --> 00:11:12,160 in the regulation of a particular gene, 225 00:11:12,160 --> 00:11:17,180 knowing what motifs are upstream and how strong the evidence is 226 00:11:17,180 --> 00:11:19,480 for each particular transcription factor that 227 00:11:19,480 --> 00:11:22,040 might or might not bind there, can 228 00:11:22,040 --> 00:11:27,527 be very useful in understanding the regulation of that gene. 229 00:11:27,527 --> 00:11:29,610 And they're also going to be important for efforts 230 00:11:29,610 --> 00:11:30,780 to model gene expression. 231 00:11:30,780 --> 00:11:33,650 So, a goal of systems biology would 232 00:11:33,650 --> 00:11:42,100 be to predict, from a given starting point, if we introduce 233 00:11:42,100 --> 00:11:44,780 some perturbation-- for example, if we knock out 234 00:11:44,780 --> 00:11:46,870 or knock down a particular transcription factor, 235 00:11:46,870 --> 00:11:51,470 or over-express it, how will the system behave? 236 00:11:51,470 --> 00:11:53,220 So you'd really want to be able to predict 237 00:11:53,220 --> 00:11:57,620 how the occupancy of that transcription factor 238 00:11:57,620 --> 00:11:58,360 would change. 239 00:11:58,360 --> 00:12:01,720 You'd want to know, first, where it is at an endogenous levels, 240 00:12:01,720 --> 00:12:04,210 and then how its occupancy at every promoter 241 00:12:04,210 --> 00:12:06,300 will change when you perturb its levels. 242 00:12:06,300 --> 00:12:09,460 And then, what effects that will have 243 00:12:09,460 --> 00:12:13,500 on expression of downstream genes. 244 00:12:13,500 --> 00:12:16,370 So these sorts of models all require 245 00:12:16,370 --> 00:12:19,940 really accurate descriptions of motifs. 246 00:12:19,940 --> 00:12:22,230 OK, so these are some examples of protein motifs. 247 00:12:22,230 --> 00:12:25,560 Anyone recognize this one? 248 00:12:25,560 --> 00:12:29,720 What motif is that? 249 00:12:29,720 --> 00:12:30,941 So it says X's. 250 00:12:30,941 --> 00:12:32,440 X's would be degenerate oppositions, 251 00:12:32,440 --> 00:12:34,510 and C's would cysteines. 252 00:12:34,510 --> 00:12:36,965 And H's [INAUDIBLE]. 253 00:12:36,965 --> 00:12:38,438 What is this? 254 00:12:38,438 --> 00:12:39,911 What does this define? 255 00:12:39,911 --> 00:12:41,089 What protein has this? 256 00:12:41,089 --> 00:12:42,857 What can you predict about its function? 257 00:12:42,857 --> 00:12:43,839 AUDIENCE: Zinc finger. 258 00:12:43,839 --> 00:12:44,760 PROFESSOR: Zinc finger, right. 259 00:12:44,760 --> 00:12:47,360 So it's a motif commonly seen in genome binding transcription 260 00:12:47,360 --> 00:12:52,400 factors, and it coordinates to zinc. 261 00:12:52,400 --> 00:12:55,460 What about this one? 262 00:12:55,460 --> 00:12:57,340 Any guesses on what this motif is? 263 00:12:57,340 --> 00:13:00,550 This quite a short motif. 264 00:13:00,550 --> 00:13:01,884 Yeah? 265 00:13:01,884 --> 00:13:02,371 AUDIENCE: That's a phosphorylation. 266 00:13:02,371 --> 00:13:03,345 PROFESSOR: Phosphorylation site. 267 00:13:03,345 --> 00:13:03,845 Yeah. 268 00:13:03,845 --> 00:13:06,754 And how do you know that? 269 00:13:06,754 --> 00:13:10,039 AUDIENCE: The [INAUDIBLE] and the [INAUDIBLE] next to it 270 00:13:10,039 --> 00:13:12,600 means it's [INAUDIBLE]. 271 00:13:12,600 --> 00:13:15,121 PROFESSOR: OK, so you even know what kinase it is, yeah. 272 00:13:15,121 --> 00:13:15,620 Exactly. 273 00:13:15,620 --> 00:13:16,745 So that's sort of the view. 274 00:13:16,745 --> 00:13:18,922 So, serine, threonine, and tirocene 275 00:13:18,922 --> 00:13:20,630 are the residues that get phosphorylated. 276 00:13:20,630 --> 00:13:23,350 And so if you see a motif with a serine in the middle, 277 00:13:23,350 --> 00:13:26,140 it's a good chance it's a phosphorylation site. 278 00:13:28,850 --> 00:13:32,720 Here are some-- you can think of them as DNA sequence motifs, 279 00:13:32,720 --> 00:13:34,950 because they occur in genes, but they, of course, 280 00:13:34,950 --> 00:13:37,100 function at that RNA level. 281 00:13:37,100 --> 00:13:40,310 These are the motifs that occur at the boundaries 282 00:13:40,310 --> 00:13:42,770 of mammalian introns. 283 00:13:42,770 --> 00:13:46,700 So this first one is the prime splicing motif. 284 00:13:46,700 --> 00:13:49,800 So these would be the bases that occur at the last three 285 00:13:49,800 --> 00:13:51,390 bases of the exon. 286 00:13:51,390 --> 00:13:55,079 The first two of the intron here, are almost always GT. 287 00:13:55,079 --> 00:13:57,370 And then you have this position that I mentioned here-- 288 00:13:57,370 --> 00:13:59,900 it's almost always A or G position. 289 00:13:59,900 --> 00:14:02,660 And then some positions that are bias for A, bias for G, 290 00:14:02,660 --> 00:14:06,870 and then slightly biased for T. And that 291 00:14:06,870 --> 00:14:11,075 is what you see when you look at a whole bunch of five 292 00:14:11,075 --> 00:14:14,080 prime ends of mammalian introns-- they have this motif. 293 00:14:14,080 --> 00:14:17,710 So some will have better matches, or worse, 294 00:14:17,710 --> 00:14:19,980 to this particular pattern. 295 00:14:19,980 --> 00:14:22,530 And that's the average pattern that you see. 296 00:14:22,530 --> 00:14:25,290 And it turns out that in this case, 297 00:14:25,290 --> 00:14:29,240 the recognition of that site is not by a protein, per se, 298 00:14:29,240 --> 00:14:32,410 but it's by a ribonucleoprotein complex. 299 00:14:32,410 --> 00:14:35,160 So there's actually an RNA called U1 snRNA 300 00:14:35,160 --> 00:14:37,880 that base pairs with the five prime splice site. 301 00:14:37,880 --> 00:14:40,760 And its sequence, or part of its sequence, 302 00:14:40,760 --> 00:14:45,320 is perfectly complimentary to the consensus five prime splice 303 00:14:45,320 --> 00:14:45,820 site. 304 00:14:45,820 --> 00:14:48,645 So we can understand why five prime splice sites have 305 00:14:48,645 --> 00:14:50,660 this motif-- they're evolving to have 306 00:14:50,660 --> 00:14:53,400 a certain degree of complementarity to U1, 307 00:14:53,400 --> 00:14:55,140 and in order to get officially recognized 308 00:14:55,140 --> 00:14:58,250 by the splicing machinery. 309 00:14:58,250 --> 00:15:01,260 Then at the three prime end of introns, 310 00:15:01,260 --> 00:15:02,650 you see this motif here. 311 00:15:02,650 --> 00:15:05,450 So here's the last base of the intron, a G, and then 312 00:15:05,450 --> 00:15:06,610 an A before it. 313 00:15:06,610 --> 00:15:09,230 Almost all introns end with AG. 314 00:15:09,230 --> 00:15:12,000 Then you have a pyrimidine ahead of it. 315 00:15:12,000 --> 00:15:15,230 Then you have basically an irrelevant position here 316 00:15:15,230 --> 00:15:20,720 at minus four, which is not strongly conserved. 317 00:15:20,720 --> 00:15:23,924 And then a stretch of residues that are usually, 318 00:15:23,924 --> 00:15:26,340 but not always, pyrimidines-- called the pyrimidine track. 319 00:15:26,340 --> 00:15:29,080 And in this case, the recognition 320 00:15:29,080 --> 00:15:31,910 is actually by proteins rather than RNA. 321 00:15:31,910 --> 00:15:33,790 And there are two proteins. 322 00:15:33,790 --> 00:15:36,280 One called U2AF65 that binds the pyrimidine track, 323 00:15:36,280 --> 00:15:40,550 and one, U2AF35 that binds that last YAG motif. 324 00:15:40,550 --> 00:15:42,590 And then there's an upstream motif 325 00:15:42,590 --> 00:15:44,820 here, that's just upstream of the 3 prime splice site 326 00:15:44,820 --> 00:15:47,350 that is quite degenerate and hard to find, 327 00:15:47,350 --> 00:15:49,890 called the branch point motif. 328 00:15:49,890 --> 00:15:52,210 OK, so, let's take an example. 329 00:15:52,210 --> 00:15:55,650 So the five prime splice site is a nice example of a motif, 330 00:15:55,650 --> 00:15:59,010 because you can uniquely align them, right? 331 00:15:59,010 --> 00:16:01,530 You can sequence DNA, sequence genomes, 332 00:16:01,530 --> 00:16:03,790 align the CDNA to the genome, that tells you 333 00:16:03,790 --> 00:16:05,700 exactly where the splice junctions are. 334 00:16:05,700 --> 00:16:10,710 And you can take the exons that have a 5 prime splice site, 335 00:16:10,710 --> 00:16:14,680 and align the sequences aligned to the exon/intron boundary 336 00:16:14,680 --> 00:16:16,630 and get a precise motif. 337 00:16:16,630 --> 00:16:19,255 And then you can tally up the frequencies of the bases, 338 00:16:19,255 --> 00:16:20,630 and make a table like this, which 339 00:16:20,630 --> 00:16:23,760 we would call a position-specific probability 340 00:16:23,760 --> 00:16:26,050 matrix. 341 00:16:26,050 --> 00:16:31,900 And what you could then do to predict additional, 342 00:16:31,900 --> 00:16:35,850 say, five prime splice-site motifs in other genes-- 343 00:16:35,850 --> 00:16:38,010 for example, genes where you didn't get a good CDNA 344 00:16:38,010 --> 00:16:40,560 coverage, because let's say they're not expressed 345 00:16:40,560 --> 00:16:43,530 in the cells that you analyzed-- you could then 346 00:16:43,530 --> 00:16:48,450 make this odds ratio here. 347 00:16:48,450 --> 00:16:50,920 So here we have a candidate sequence. 348 00:16:50,920 --> 00:16:56,425 So the motif is nine positions, often numbered minus 3 349 00:16:56,425 --> 00:16:58,420 to minus 1, would be the exonic parts of this. 350 00:16:58,420 --> 00:17:00,830 And then plus 1 to plus 6 would be the first six 351 00:17:00,830 --> 00:17:01,680 bases of the intron. 352 00:17:01,680 --> 00:17:03,760 That's just the convention that's used. 353 00:17:03,760 --> 00:17:06,690 I'm sure it's going to drive the computer scientists crazy 354 00:17:06,690 --> 00:17:09,036 because we're not starting at 0, but that's 355 00:17:09,036 --> 00:17:10,619 usually what's used in the literature. 356 00:17:10,619 --> 00:17:13,010 And so we have a nine-based motif. 357 00:17:13,010 --> 00:17:16,050 And then we're going to calculate the probability 358 00:17:16,050 --> 00:17:19,970 of generating that particular sequence as given 359 00:17:19,970 --> 00:17:23,020 plus-- meaning given our foreground, 360 00:17:23,020 --> 00:17:27,760 or motif model-- as the product of the probability 361 00:17:27,760 --> 00:17:33,150 of generating the first base in sequence, S1, using the column 362 00:17:33,150 --> 00:17:36,290 probability in the minus 3 position. 363 00:17:36,290 --> 00:17:41,660 So if the first base is AC, for example, that would be 0.4. 364 00:17:41,660 --> 00:17:43,910 And then the probability of generating the second base 365 00:17:43,910 --> 00:17:46,670 in the sequence using the next column, and so forth. 366 00:17:49,540 --> 00:17:54,207 If you made a vector for each position that had a 1 367 00:17:54,207 --> 00:17:56,040 for the base that occurred at that position, 368 00:17:56,040 --> 00:17:58,180 and a 0 for the other bases, and then you 369 00:17:58,180 --> 00:18:02,680 just did the dot product of that with the matrix, you get this. 370 00:18:02,680 --> 00:18:04,130 So, we multiply probabilities. 371 00:18:04,130 --> 00:18:08,420 So that is assuming independence between positions. 372 00:18:08,420 --> 00:18:12,600 And so that's a key assumption-- weight matrices assume 373 00:18:12,600 --> 00:18:16,020 that each position in the motif contributes independently 374 00:18:16,020 --> 00:18:17,980 to the overall strength of that motif. 375 00:18:17,980 --> 00:18:20,774 And that may or may not be true-- they don't assume 376 00:18:20,774 --> 00:18:22,190 that it's homogeneous, that is you 377 00:18:22,190 --> 00:18:28,327 have usually in a typical case, different probabilities 378 00:18:28,327 --> 00:18:30,160 in different columns, so it's inhomogeneous, 379 00:18:30,160 --> 00:18:31,570 but assumes independence. 380 00:18:31,570 --> 00:18:35,770 And then you often want to use a background model. 381 00:18:35,770 --> 00:18:38,100 For example, if your genome composition 382 00:18:38,100 --> 00:18:40,690 is 25% of each of the nucleotides, 383 00:18:40,690 --> 00:18:43,550 you could just have a background probability that 384 00:18:43,550 --> 00:18:45,930 was equally likely for each of the four, 385 00:18:45,930 --> 00:18:51,300 and then calculate the probability, S given minus, 386 00:18:51,300 --> 00:18:53,700 of generating that particular [INAUDIBLE] 387 00:18:53,700 --> 00:18:56,880 under the background model, and take the ratio of those two. 388 00:18:56,880 --> 00:18:58,800 And the advantage of that is that then you 389 00:18:58,800 --> 00:19:02,890 can find sequences that are-- that ratio, 390 00:19:02,890 --> 00:19:07,117 it could be 100 times more like a 5 prime splice site 391 00:19:07,117 --> 00:19:08,700 than like background-- or 1,000 times. 392 00:19:08,700 --> 00:19:12,480 Or you have some sort of scaling on it. 393 00:19:12,480 --> 00:19:14,850 Whereas, if you just take the raw probability, 394 00:19:14,850 --> 00:19:17,661 it's going to be something that's on the order of 1/4 395 00:19:17,661 --> 00:19:18,160 to a 1/9. 396 00:19:18,160 --> 00:19:20,631 So some very, very small number that's a little hard to 397 00:19:20,631 --> 00:19:21,130 work with. 398 00:19:24,800 --> 00:19:27,410 So when people talk about motifs, 399 00:19:27,410 --> 00:19:30,770 they often use language like exact, or precise, 400 00:19:30,770 --> 00:19:34,310 versus degenerate, strong versus weak, good versus lousy, 401 00:19:34,310 --> 00:19:38,140 depending on the context, who's listening. 402 00:19:38,140 --> 00:19:41,830 So an example of these would be a restriction enzyme. 403 00:19:41,830 --> 00:19:43,920 You often say restriction enzymes have 404 00:19:43,920 --> 00:19:46,380 very precise sequence specificity, 405 00:19:46,380 --> 00:19:51,190 they only cut-- echo R1 only cuts a GAA TTC. 406 00:19:51,190 --> 00:19:55,300 Whereas, a TATA binding protein is somewhat more degenerate. 407 00:19:55,300 --> 00:19:58,510 It'll bind to a range of things. 408 00:19:58,510 --> 00:20:01,819 So I use degenerate there, you could say it's a weaker motif. 409 00:20:01,819 --> 00:20:04,110 You'll often-- if you want to try to make this precise, 410 00:20:04,110 --> 00:20:06,840 then the language of entropy information 411 00:20:06,840 --> 00:20:10,360 offers additional terminology, like high information 412 00:20:10,360 --> 00:20:13,630 content, low entropy, et cetera. 413 00:20:13,630 --> 00:20:19,230 So let's take a look at this as perhaps a more natural, or more 414 00:20:19,230 --> 00:20:22,530 precise way of describing what we mean, here. 415 00:20:22,530 --> 00:20:25,220 So imagine you have a motif. 416 00:20:25,220 --> 00:20:27,840 We're going to do a motif of length one-- 417 00:20:27,840 --> 00:20:31,200 just keep the math super simple, but you'll 418 00:20:31,200 --> 00:20:33,440 see it easily generalizes. 419 00:20:33,440 --> 00:20:40,090 So you have probabilities of the four nucleotides that are Pk. 420 00:20:40,090 --> 00:20:42,500 And you have background probabilities, qk. 421 00:20:42,500 --> 00:20:45,690 And we're going to assume those are all uniform, 422 00:20:45,690 --> 00:20:47,250 they're all a quarter. 423 00:20:47,250 --> 00:20:51,920 So then the statistical, or Shannon entropy 424 00:20:51,920 --> 00:20:57,520 of a probability distribution-- or vector of probabilities, 425 00:20:57,520 --> 00:21:01,250 if you will-- is defined here. 426 00:21:01,250 --> 00:21:10,590 So H of q, where q is a distribution or, in this case, 427 00:21:10,590 --> 00:21:19,460 vector, is defined as minus the summation of qk log qk, 428 00:21:19,460 --> 00:21:20,160 in general. 429 00:21:20,160 --> 00:21:22,230 And then if you wanted to be in units of bits, 430 00:21:22,230 --> 00:21:25,060 you'd use log base 2. 431 00:21:25,060 --> 00:21:27,345 So how many people have seen this equation before? 432 00:21:29,850 --> 00:21:32,030 Like half, I'm going to go with. 433 00:21:32,030 --> 00:21:33,550 OK, good. 434 00:21:33,550 --> 00:21:39,490 So who can tell me why, first of all-- 435 00:21:39,490 --> 00:21:41,880 is this a positive quantity, negative quantity, 436 00:21:41,880 --> 00:21:43,183 non-negative, or what? 437 00:21:46,354 --> 00:21:47,260 Yeah, go ahead. 438 00:21:47,260 --> 00:21:49,274 AUDIENCE: Log qk is always going to be negative. 439 00:21:49,274 --> 00:21:50,649 And so therefore you have to take 440 00:21:50,649 --> 00:21:52,482 the negative of the sum of all the negatives 441 00:21:52,482 --> 00:21:53,770 to get a positive number. 442 00:21:53,770 --> 00:21:55,390 PROFESSOR: Right, so this, in general, 443 00:21:55,390 --> 00:21:57,580 is a non-negative quantity, because we 444 00:21:57,580 --> 00:21:58,780 have this minus sign here. 445 00:21:58,780 --> 00:22:01,170 We're taking logs of things that are between 0 and 1. 446 00:22:01,170 --> 00:22:04,530 So the logs are negative, right? 447 00:22:04,530 --> 00:22:05,030 OK. 448 00:22:05,030 --> 00:22:08,910 And then what would be the entropy if I say that 449 00:22:08,910 --> 00:22:17,810 the distribution q is this-- 0100, meaning, 450 00:22:17,810 --> 00:22:20,234 it's a motif that's 100% C? 451 00:22:20,234 --> 00:22:21,400 What is the entropy of that? 452 00:22:24,536 --> 00:22:26,151 What was your name? 453 00:22:26,151 --> 00:22:26,900 AUDIENCE: William. 454 00:22:26,900 --> 00:22:28,960 PROFESSOR: William. 455 00:22:28,960 --> 00:22:31,084 AUDIENCE: So the entropy would be 0, 456 00:22:31,084 --> 00:22:33,674 because the vector is determined in respect of the known 457 00:22:33,674 --> 00:22:34,381 certainty. 458 00:22:34,381 --> 00:22:35,089 PROFESSOR: Right. 459 00:22:35,089 --> 00:22:39,215 And we do the math-- you'll get, for the C term, 460 00:22:39,215 --> 00:22:40,650 you'll have a sum. 461 00:22:40,650 --> 00:22:43,060 You'll have three terms that are 0 log 462 00:22:43,060 --> 00:22:47,410 0-- it might crash your calculator, I guess. 463 00:22:47,410 --> 00:22:52,440 And then you'll have one term that is 1 log 1. 464 00:22:52,440 --> 00:22:53,890 And so 1 log 1, that's easy. 465 00:22:53,890 --> 00:22:56,560 That's 0, right? 466 00:22:56,560 --> 00:22:58,810 This, you could say, is undefined. 467 00:22:58,810 --> 00:23:04,000 But using L'Hospital's rule-- by continuity, x log x, 468 00:23:04,000 --> 00:23:06,750 you take the limit, as x gets small, is 0. 469 00:23:06,750 --> 00:23:10,350 So this is defined to be 0 in information theory. 470 00:23:10,350 --> 00:23:12,370 And this is always, always 0. 471 00:23:12,370 --> 00:23:14,710 So that comes out to be 0. 472 00:23:14,710 --> 00:23:16,090 So it's deterministic. 473 00:23:16,090 --> 00:23:18,860 So entropy is a measure of uncertainty, 474 00:23:18,860 --> 00:23:21,340 and so that makes sense-- if you know what the base is, 475 00:23:21,340 --> 00:23:23,780 there's no uncertainty, entropy is 0. 476 00:23:23,780 --> 00:23:35,290 So what about this vector-- 1/4, 1/4, 25% of each of the bases, 477 00:23:35,290 --> 00:23:40,200 what is H of q? 478 00:23:40,200 --> 00:23:40,700 Anyone? 479 00:23:43,622 --> 00:23:45,570 I'm going to make you show me why, 480 00:23:45,570 --> 00:23:52,695 so-- Anyone want to attempt this? 481 00:23:52,695 --> 00:23:53,690 Levi? 482 00:23:53,690 --> 00:23:54,732 AUDIENCE: I think it's 2. 483 00:23:54,732 --> 00:23:55,440 PROFESSOR: 2, OK. 484 00:23:55,440 --> 00:23:56,270 Can you explain? 485 00:23:56,270 --> 00:23:58,260 AUDIENCE: Because the log of the 1/4's 486 00:23:58,260 --> 00:24:00,610 is going to be negative 2. 487 00:24:00,610 --> 00:24:04,554 And then you're multiplying that by 1/4, 488 00:24:04,554 --> 00:24:07,170 so you're getting 1/2 for each and adding it up equals 2. 489 00:24:07,170 --> 00:24:08,910 PROFESSOR: Right, in sum, there are 490 00:24:08,910 --> 00:24:13,010 going to be four terms that are 1/4 times log of a 1/4. 491 00:24:16,340 --> 00:24:19,215 This is minus 2, 1/4 times minus 2 is minus 1/2, 492 00:24:19,215 --> 00:24:21,190 4 times minus 1/2 is minus 2, and then you 493 00:24:21,190 --> 00:24:24,110 change the sign, because there's this minus in front. 494 00:24:24,110 --> 00:24:26,670 So that equals 2. 495 00:24:26,670 --> 00:24:29,245 And what about this one? 496 00:24:35,770 --> 00:24:38,940 Anyone see that one? 497 00:24:38,940 --> 00:24:40,380 This is a coin flip, basically. 498 00:24:40,380 --> 00:24:41,220 All right? 499 00:24:41,220 --> 00:24:43,355 It's either A or G. [INAUDIBLE]. 500 00:24:49,785 --> 00:24:50,285 Anyone? 501 00:24:55,730 --> 00:24:57,215 Levi, want to do this one again? 502 00:24:57,215 --> 00:24:58,205 AUDIENCE: It's 1. 503 00:24:58,205 --> 00:25:02,165 PROFESSOR: OK, and why? 504 00:25:02,165 --> 00:25:05,830 AUDIENCE: Because you have two terms of 0 log 0, which is 0. 505 00:25:05,830 --> 00:25:16,980 And two terms of 1/2 times the log of 1/2, 506 00:25:16,980 --> 00:25:18,960 which is just negative 1. 507 00:25:18,960 --> 00:25:19,950 So you have 2 halves. 508 00:25:19,950 --> 00:25:20,445 PROFESSOR: Yeah. 509 00:25:20,445 --> 00:25:21,435 So two terms like that. 510 00:25:21,435 --> 00:25:23,226 And then there's going to be two terms that 511 00:25:23,226 --> 00:25:26,650 are something that turns out to be 0-- 0 log 0. 512 00:25:26,650 --> 00:25:28,730 And then there's a minus in front. 513 00:25:28,730 --> 00:25:30,590 So that will be 1. 514 00:25:30,590 --> 00:25:32,786 So a coin flip has one bit of information. 515 00:25:32,786 --> 00:25:34,160 So that's basically what we mean. 516 00:25:34,160 --> 00:25:37,776 If you have a fair coin and you don't know the outcome, 517 00:25:37,776 --> 00:25:39,150 we're going to call that one bit. 518 00:25:39,150 --> 00:25:43,620 And so a base that could be any of the four equally likely 519 00:25:43,620 --> 00:25:47,630 has twice as much uncertainty. 520 00:25:47,630 --> 00:25:51,593 All right, and this is related to the Boltzmann entropy 521 00:25:51,593 --> 00:25:54,670 that you may be familiar with from statistical mechanics, 522 00:25:54,670 --> 00:25:59,880 which is the log of the number of states, in that if you have 523 00:25:59,880 --> 00:26:02,570 N states, and they're all equally likely, then 524 00:26:02,570 --> 00:26:05,160 it turns out that the Shannon entropy turns out 525 00:26:05,160 --> 00:26:07,170 to be log of the number states. 526 00:26:07,170 --> 00:26:10,350 We saw that here-- four states, equally likely, 527 00:26:10,350 --> 00:26:13,930 comes out to be log of 4 or 2. 528 00:26:13,930 --> 00:26:16,130 And that's true in general. 529 00:26:16,130 --> 00:26:18,170 All right, so you can think of this 530 00:26:18,170 --> 00:26:23,010 as a generalization of Boltzmann entropy, if you want to. 531 00:26:23,010 --> 00:26:28,280 OK, so why did he call it entropy? 532 00:26:28,280 --> 00:26:30,930 So it turns out that Shannon, who 533 00:26:30,930 --> 00:26:32,980 was developing this in the late '40s, 534 00:26:32,980 --> 00:26:36,697 as developing a theory of communication, 535 00:26:36,697 --> 00:26:38,030 scratched his head a little bit. 536 00:26:38,030 --> 00:26:41,613 And he talked to his friend, John von Neumann-- none 537 00:26:41,613 --> 00:26:45,970 other than him, involved in inventing computers-- 538 00:26:45,970 --> 00:26:48,600 and he says, "My concern was what to call it. 539 00:26:48,600 --> 00:26:50,130 I thought of calling it information. 540 00:26:50,130 --> 00:26:51,800 But the word was overly used." 541 00:26:51,800 --> 00:26:57,060 OK, so back in 1949, information was already overused. 542 00:26:57,060 --> 00:27:00,342 "And and so I decided to call it uncertainty." 543 00:27:00,342 --> 00:27:02,300 And then he discussed it with John von Neumann, 544 00:27:02,300 --> 00:27:03,480 and he had a better idea. 545 00:27:03,480 --> 00:27:05,550 He said, "You should call it entropy. 546 00:27:05,550 --> 00:27:07,704 In the first place, your certainly function 547 00:27:07,704 --> 00:27:09,620 has already been used in statistical mechanics 548 00:27:09,620 --> 00:27:11,420 under that name," so it already has a name. 549 00:27:11,420 --> 00:27:13,450 "And the second place, and more important, 550 00:27:13,450 --> 00:27:16,020 nobody knows what entropy really is, so in a debate, 551 00:27:16,020 --> 00:27:17,820 you always have the advantage." 552 00:27:17,820 --> 00:27:20,350 So keep that in mind. 553 00:27:20,350 --> 00:27:23,250 After you've taken this class, just start throwing it around 554 00:27:23,250 --> 00:27:25,820 and you will win a lot of debates. 555 00:27:30,010 --> 00:27:32,530 So how is information related to entropy? 556 00:27:32,530 --> 00:27:35,110 So the way we're going to define it here, 557 00:27:35,110 --> 00:27:37,730 which is how it's often defined, is 558 00:27:37,730 --> 00:27:41,760 information is reduction in uncertainty. 559 00:27:41,760 --> 00:27:47,870 So, if I'm dealing with an unknown DNA 560 00:27:47,870 --> 00:27:49,860 sequence, the lambda phage genome, 561 00:27:49,860 --> 00:27:54,930 and it has 25% of each base, if you tell me, 562 00:27:54,930 --> 00:27:56,930 I'm going to send you two bases, I have no idea. 563 00:27:56,930 --> 00:27:59,320 They could be any pair of bases. 564 00:27:59,320 --> 00:28:03,260 My uncertainty is 2 bits per base, 565 00:28:03,260 --> 00:28:05,660 or 4 bits before you tell me anything. 566 00:28:05,660 --> 00:28:10,360 If you then tell me, it's the TA motif, 567 00:28:10,360 --> 00:28:13,980 which is always T followed by A, then now my uncertainty 568 00:28:13,980 --> 00:28:16,990 is 0, so the amount of information 569 00:28:16,990 --> 00:28:19,860 that you just gave me is 4 bits. 570 00:28:19,860 --> 00:28:23,370 You reduced my uncertainty from 4 bits to 0. 571 00:28:23,370 --> 00:28:26,370 So we define the information at a particular position 572 00:28:26,370 --> 00:28:30,320 as the entropy before-- before meaning 573 00:28:30,320 --> 00:28:33,400 the background, the background a sort of your null hypothesis-- 574 00:28:33,400 --> 00:28:37,570 minus the entropy after-- so after you've told me 575 00:28:37,570 --> 00:28:39,270 that this is an instance of that motif, 576 00:28:39,270 --> 00:28:42,580 and it has a particular model. 577 00:28:42,580 --> 00:28:50,750 So, in this case, you can see the entropy 578 00:28:50,750 --> 00:28:53,370 is going to be entropy before. 579 00:28:53,370 --> 00:28:56,580 This is just H of q right here, this term. 580 00:28:56,580 --> 00:29:02,050 And then minus this term, which is H of p. 581 00:29:02,050 --> 00:29:07,070 So, if it's uniform, we said H of q is 2 bits per position. 582 00:29:07,070 --> 00:29:12,870 And so, so the information content of the motif is just 2 583 00:29:12,870 --> 00:29:17,820 minus the entropy of that motif model. 584 00:29:17,820 --> 00:29:21,140 In general, it turns out if the positions in the motif 585 00:29:21,140 --> 00:29:23,360 are independent, then the information content 586 00:29:23,360 --> 00:29:27,610 of the motif is 2w minus H of motif, 587 00:29:27,610 --> 00:29:30,710 where w is it width of the motif. 588 00:29:30,710 --> 00:29:37,770 So for example, the entropy of the motif of-- we 589 00:29:37,770 --> 00:29:41,230 said the entropy of this is 2 bits, right? 590 00:29:41,230 --> 00:29:44,160 Therefore, the information content is what? 591 00:29:47,118 --> 00:29:54,230 If this is our-- let's say this is a P. This is our routine. 592 00:29:54,230 --> 00:29:56,440 Are you starting to generate? 593 00:29:56,440 --> 00:29:58,948 What is its information content? 594 00:29:58,948 --> 00:30:00,250 AUDIENCE: 0? 595 00:30:00,250 --> 00:30:00,830 PROFESSOR: 0. 596 00:30:00,830 --> 00:30:03,686 Why is it 0? 597 00:30:03,686 --> 00:30:04,674 Yeah, back row. 598 00:30:04,674 --> 00:30:07,043 AUDIENCE: Because the information content 599 00:30:07,043 --> 00:30:08,626 of that is 0, and then the information 600 00:30:08,626 --> 00:30:11,096 content of the known hypothesis, so to say, is 0. 601 00:30:11,096 --> 00:30:13,072 Sorry, both of them are 2. 602 00:30:13,072 --> 00:30:13,876 So 2 minus 2 is 0. 603 00:30:13,876 --> 00:30:15,542 PROFESSOR: The entropy of the background 604 00:30:15,542 --> 00:30:18,012 is 2, and the entropy if this is also 2. 605 00:30:18,012 --> 00:30:19,520 So 2 minus 2 is 0. 606 00:30:19,520 --> 00:30:21,920 And what about this? 607 00:30:21,920 --> 00:30:23,410 Let's say this was our motif, it's 608 00:30:23,410 --> 00:30:28,100 a motif that's either A or G. We said the entropy of this 609 00:30:28,100 --> 00:30:30,470 is 1 bit, so what is the information 610 00:30:30,470 --> 00:30:34,125 content of this motif? 611 00:30:34,125 --> 00:30:35,060 AUDIENCE: 1. 612 00:30:35,060 --> 00:30:38,809 PROFESSOR: 1, and why is it 1? 613 00:30:38,809 --> 00:30:42,705 AUDIENCE: Background is 2, and entropy here is 1. 614 00:30:42,705 --> 00:30:44,653 PROFESSOR: Background is 2, entropy is 1. 615 00:30:44,653 --> 00:30:45,627 OK? 616 00:30:45,627 --> 00:30:49,240 And what about if I tell you it's the echo R1 restriction 617 00:30:49,240 --> 00:30:49,790 enzyme? 618 00:30:49,790 --> 00:30:54,190 So it's GAA TTC, a six-base motif 619 00:30:54,190 --> 00:30:56,980 precise-- it has to be those bases? 620 00:30:56,980 --> 00:30:59,042 What is the information content of that motif? 621 00:31:06,422 --> 00:31:07,410 In the back? 622 00:31:07,410 --> 00:31:08,380 AUDIENCE: It's 12. 623 00:31:08,380 --> 00:31:10,101 PROFESSOR: 12-- 12 what? 624 00:31:10,101 --> 00:31:10,850 AUDIENCE: 12 bits. 625 00:31:10,850 --> 00:31:12,350 PROFESSOR: 12 bits, and why is that? 626 00:31:12,350 --> 00:31:15,954 AUDIENCE: Because the background is 2 times 6. 627 00:31:15,954 --> 00:31:20,328 So 6 bases, and 2 bits for each. 628 00:31:20,328 --> 00:31:24,216 And you have all the bases are determined 629 00:31:24,216 --> 00:31:26,180 at the specific [INAUDIBLE] enzyme site. 630 00:31:26,180 --> 00:31:31,520 So the entropy of that is 0, since 12 minus 0 is 12. 631 00:31:31,520 --> 00:31:35,170 PROFESSOR: Right, the entropy of that motif is 0. 632 00:31:35,170 --> 00:31:39,190 You imagine 4,096 possible six-mers. 633 00:31:39,190 --> 00:31:40,350 One of them has probably 1. 634 00:31:40,350 --> 00:31:41,510 All the others have 0. 635 00:31:41,510 --> 00:31:44,730 You're going to have that big sum. 636 00:31:44,730 --> 00:31:48,110 It's going to come out to be 0, OK? 637 00:31:48,110 --> 00:31:49,630 Why is this useful at all, or is it? 638 00:31:52,340 --> 00:31:56,990 One of the reasons why it's useful-- sorry, 639 00:31:56,990 --> 00:31:59,460 that's on a later slide. 640 00:31:59,460 --> 00:32:02,320 Well, just hang with me, and it will be clear 641 00:32:02,320 --> 00:32:05,660 why it's useful in a few slides. 642 00:32:05,660 --> 00:32:08,120 But for now, we have a description 643 00:32:08,120 --> 00:32:09,840 of information content. 644 00:32:09,840 --> 00:32:14,940 So the echo R1 site has 12 bits of information, 645 00:32:14,940 --> 00:32:19,180 a completely random position has 0, 646 00:32:19,180 --> 00:32:21,550 and a short four-cutter restriction enzyme 647 00:32:21,550 --> 00:32:23,887 would have 2 times 4, 8 bits of information, right, 648 00:32:23,887 --> 00:32:24,720 and an eight-cutter. 649 00:32:24,720 --> 00:32:26,594 So you can see as the restriction enzyme gets 650 00:32:26,594 --> 00:32:28,400 longer, more information content. 651 00:32:31,894 --> 00:32:33,810 So let's talk about the motif finding problem, 652 00:32:33,810 --> 00:32:36,460 and then we'll return to the usefulness of information 653 00:32:36,460 --> 00:32:37,090 content. 654 00:32:37,090 --> 00:32:40,019 So can everyone see the motif that's 655 00:32:40,019 --> 00:32:41,310 present in all these sequences? 656 00:32:45,710 --> 00:32:48,070 If anyone can't, please let me know. 657 00:32:48,070 --> 00:32:49,010 You probably can't. 658 00:32:49,010 --> 00:32:49,990 Now, what now? 659 00:32:49,990 --> 00:32:54,580 These are the same sequences, but I've aligned them. 660 00:32:54,580 --> 00:32:56,934 Can anyone see a motif? 661 00:32:56,934 --> 00:32:58,850 PROFESSOR: GGG GGG. 662 00:32:58,850 --> 00:33:00,280 PROFESSOR: Yeah, I heard some G's. 663 00:33:00,280 --> 00:33:00,780 Right. 664 00:33:00,780 --> 00:33:03,254 so there's this motif that's over here. 665 00:33:03,254 --> 00:33:04,920 It's pretty weak, and pretty degenerate. 666 00:33:04,920 --> 00:33:05,910 There's definitely some exceptions, 667 00:33:05,910 --> 00:33:08,159 but you can definitely see that a lot of the sequences 668 00:33:08,159 --> 00:33:12,450 have at least GGC, possibly an A after that. 669 00:33:12,450 --> 00:33:15,610 Right, so this is the problem we're dealing with. 670 00:33:15,610 --> 00:33:21,070 You have a bunch of promoters, and the transcription factor 671 00:33:21,070 --> 00:33:26,867 that binds may be fairly degenerate, maybe because it 672 00:33:26,867 --> 00:33:29,200 likes to bind cooperatively with several of its buddies, 673 00:33:29,200 --> 00:33:30,741 and so it doesn't have to have a very 674 00:33:30,741 --> 00:33:33,920 strong instance of the motif present. 675 00:33:33,920 --> 00:33:37,260 And so, it can be quite difficult to find. 676 00:33:37,260 --> 00:33:40,790 So that's why there's a real bio-informatics challenge. 677 00:33:40,790 --> 00:33:44,622 Motif finding is not done by lining up sequences by hand, 678 00:33:44,622 --> 00:33:46,080 and drawing boxes-- although that's 679 00:33:46,080 --> 00:33:48,920 how the first motif was found, the TATA box. 680 00:33:48,920 --> 00:33:51,400 That's why it's called the TATA box, because someone just 681 00:33:51,400 --> 00:33:53,820 drew a box in a sequence alignment. 682 00:33:53,820 --> 00:33:56,860 But these days, you need a computer 683 00:33:56,860 --> 00:34:01,020 to find-- most motifs require some sort of algorithm to find. 684 00:34:03,890 --> 00:34:09,050 Like I said, it's essentially a local multiple alignment 685 00:34:09,050 --> 00:34:09,770 problem. 686 00:34:09,770 --> 00:34:13,590 You want multiple alignment, but it doesn't have to be global. 687 00:34:13,590 --> 00:34:16,034 It just can be local, it can be just over a sub-region. 688 00:34:19,300 --> 00:34:24,770 There are basically at least three different sort 689 00:34:24,770 --> 00:34:27,960 of general approaches to the problem of motif finding. 690 00:34:27,960 --> 00:34:31,840 One approach is the so-called enumerative, or dictionary, 691 00:34:31,840 --> 00:34:33,110 approach. 692 00:34:33,110 --> 00:34:36,310 And so in this approach, you say, 693 00:34:36,310 --> 00:34:40,239 well, we're looking for a motif of length 6 694 00:34:40,239 --> 00:34:45,760 because this is a leucine zipper transcription factor that we're 695 00:34:45,760 --> 00:34:48,100 modeling, and they usually have binding sites around 6, 696 00:34:48,100 --> 00:34:49,320 so we're going to guess 6. 697 00:34:49,320 --> 00:34:51,949 And we're going to enumerate all the six-mers, 698 00:34:51,949 --> 00:34:54,310 there's 4,096 six-mers. 699 00:34:54,310 --> 00:34:56,750 We're going to count up their occurrences 700 00:34:56,750 --> 00:35:01,220 in a set of promoters that, for example, are turned on when you 701 00:35:01,220 --> 00:35:04,230 over-express this factor, and look 702 00:35:04,230 --> 00:35:07,630 at those frequencies divided by the frequencies 703 00:35:07,630 --> 00:35:09,940 of those six-mers in some background set-- 704 00:35:09,940 --> 00:35:12,717 either random sequences, or promoters that didn't turn on. 705 00:35:12,717 --> 00:35:13,550 Something like that. 706 00:35:13,550 --> 00:35:15,520 You have two classes, and you look 707 00:35:15,520 --> 00:35:17,940 for statistical enrichment. 708 00:35:17,940 --> 00:35:19,530 This approach, this is fine. 709 00:35:19,530 --> 00:35:22,360 There's nothing wrong with this approach. 710 00:35:22,360 --> 00:35:25,330 People use it all the time. 711 00:35:25,330 --> 00:35:28,170 One of the downsides, though, is that you're 712 00:35:28,170 --> 00:35:29,620 doing a lot of statistical tests. 713 00:35:29,620 --> 00:35:31,720 You're essentially testing each six-mer-- 714 00:35:31,720 --> 00:35:34,200 you're doing 4,096 statistical tests. 715 00:35:34,200 --> 00:35:36,740 So you have to adjust the statistical significance 716 00:35:36,740 --> 00:35:38,240 for the number of tests that you do, 717 00:35:38,240 --> 00:35:40,750 and that can reduce your power. 718 00:35:40,750 --> 00:35:43,520 So that's one main drawback. 719 00:35:43,520 --> 00:35:47,740 The other reason is that maybe you 720 00:35:47,740 --> 00:35:50,750 don't see-- maybe this protein binds a rather degenerate 721 00:35:50,750 --> 00:35:53,960 motif, and a precise six-mer is just too precise. 722 00:35:53,960 --> 00:35:56,330 None of them will occur often enough. 723 00:35:56,330 --> 00:35:59,880 You really have to have a degenerate motif that's 724 00:35:59,880 --> 00:36:05,870 C R Y G Y. That's really the motif that it binds to, 725 00:36:05,870 --> 00:36:07,510 and so you don't see it unless you 726 00:36:07,510 --> 00:36:09,280 use something more degenerate. 727 00:36:09,280 --> 00:36:11,680 So you can generalize this to use 728 00:36:11,680 --> 00:36:13,910 regular expressions, et cetera. 729 00:36:13,910 --> 00:36:16,580 And it's a reasonable approach. 730 00:36:16,580 --> 00:36:20,430 Another approach that we'll talk about in a moment 731 00:36:20,430 --> 00:36:22,970 is probabilistic optimization, where 732 00:36:22,970 --> 00:36:31,070 you wander around the possible space of possible motifs 733 00:36:31,070 --> 00:36:34,240 until you find one that looks strong. 734 00:36:34,240 --> 00:36:35,530 And we'll talk about that. 735 00:36:35,530 --> 00:36:39,810 And then they're deterministic versions of this, like me. 736 00:36:42,580 --> 00:36:45,360 We're going to focus today on this second one. 737 00:36:45,360 --> 00:36:49,150 Mostly because it's a little bit more mysterious and interesting 738 00:36:49,150 --> 00:36:50,270 as an algorithm. 739 00:36:53,210 --> 00:36:55,255 And it's also [INAUDIBLE]. 740 00:36:55,255 --> 00:36:58,400 So, if the motif landscape looked like this, 741 00:36:58,400 --> 00:37:02,360 where imagine all possible motifs, 742 00:37:02,360 --> 00:37:05,970 you've somehow come up with a 2D lattice of the possible motif 743 00:37:05,970 --> 00:37:06,890 sequences. 744 00:37:06,890 --> 00:37:10,400 And then the strength of that motif, or the degree 745 00:37:10,400 --> 00:37:15,160 to which that motif description corresponds to the 2 motif 746 00:37:15,160 --> 00:37:17,170 is represented by the height here. 747 00:37:17,170 --> 00:37:19,780 Then, there's basically one optimal motif, 748 00:37:19,780 --> 00:37:24,070 and the closer you get to that, the better fit it is. 749 00:37:24,070 --> 00:37:28,010 Then our problem is going to be relatively easy. 750 00:37:28,010 --> 00:37:32,140 But it's also possible that it looks something like this. 751 00:37:32,140 --> 00:37:34,170 There's a lot of sort of decoy motifs, 752 00:37:34,170 --> 00:37:37,420 or weaker motifs that are only slightly enriched 753 00:37:37,420 --> 00:37:42,650 in the sequence space. 754 00:37:42,650 --> 00:37:44,640 And so you can easily get tripped up, 755 00:37:44,640 --> 00:37:49,320 if you're wandering around randomly. 756 00:37:49,320 --> 00:37:52,240 We don't know a priori, and it's probably not 757 00:37:52,240 --> 00:37:54,410 as simple as the first example. 758 00:37:54,410 --> 00:37:58,300 And so that's one of the issues that 759 00:37:58,300 --> 00:38:01,770 motivates these stochastic algorithms. 760 00:38:01,770 --> 00:38:03,921 So just to sort of put this in context-- 761 00:38:03,921 --> 00:38:06,420 the Gibbs motif sampler that we're going to be talking about 762 00:38:06,420 --> 00:38:08,380 is a Monte Carlo algorithm, so that just 763 00:38:08,380 --> 00:38:12,040 means it's an algorithm that basically does 764 00:38:12,040 --> 00:38:15,310 some random sampling somewhere in it, 765 00:38:15,310 --> 00:38:18,240 so that the outcome that you get isn't 766 00:38:18,240 --> 00:38:20,235 necessarily deterministic. 767 00:38:20,235 --> 00:38:22,360 Your run it at different times, and you'll actually 768 00:38:22,360 --> 00:38:24,050 get different outputs, which can be 769 00:38:24,050 --> 00:38:28,190 a little bit disconcerting and annoying at times. 770 00:38:28,190 --> 00:38:31,370 But it turns out to be useful in some cases. 771 00:38:31,370 --> 00:38:36,265 There's also a special case of a Las Vegas algorithm, 772 00:38:36,265 --> 00:38:39,670 where it knows when it got be optimal answer. 773 00:38:39,670 --> 00:38:41,910 But in general, not. 774 00:38:41,910 --> 00:38:44,960 In general, you don't know for sure. 775 00:38:44,960 --> 00:38:49,290 So Gibbs motif simpler is basically 776 00:38:49,290 --> 00:38:54,880 a model where you have a likelihood for generating 777 00:38:54,880 --> 00:38:58,420 a set of sequences, S. So imagine 778 00:38:58,420 --> 00:39:04,150 you have 40 sequences that are bacterial promoters, each 779 00:39:04,150 --> 00:39:07,620 of 40 bases long, let's say. 780 00:39:07,620 --> 00:39:12,500 That's your S. And so what you want to do, 781 00:39:12,500 --> 00:39:18,810 then, is consider a model that there is a particular instance 782 00:39:18,810 --> 00:39:21,500 of a motif you're trying to discover, 783 00:39:21,500 --> 00:39:23,990 at a particular position in each one of those sequences. 784 00:39:23,990 --> 00:39:25,406 Not necessarily the same position, 785 00:39:25,406 --> 00:39:27,880 just some position in each sequence. 786 00:39:27,880 --> 00:39:32,530 And we're going to describe the composition of that motif 787 00:39:32,530 --> 00:39:33,690 by a weight matrix. 788 00:39:33,690 --> 00:39:37,300 OK, one of these matrices that's of width, W, and then 789 00:39:37,300 --> 00:39:41,820 has the four rows specifying the frequencies of the four 790 00:39:41,820 --> 00:39:44,580 nucleotides at that position. 791 00:39:44,580 --> 00:39:50,860 The setup here is that you want to calculate or think 792 00:39:50,860 --> 00:39:56,600 about the probability of S comma A, S is the actual sequences, 793 00:39:56,600 --> 00:40:01,000 and A is basically a vector that specifies 794 00:40:01,000 --> 00:40:04,810 the location of the motif instance in each of those 40 795 00:40:04,810 --> 00:40:05,310 sequences. 796 00:40:09,230 --> 00:40:13,260 You want to calculate that, conditional on capital 797 00:40:13,260 --> 00:40:15,910 theta-- which is our weight matrix. 798 00:40:15,910 --> 00:40:17,730 So that's going to be, in this case, 799 00:40:17,730 --> 00:40:21,850 I think I made a motif of length 8, and it's shown there in red. 800 00:40:21,850 --> 00:40:24,750 There's going to be a weight matrix of length 8. 801 00:40:24,750 --> 00:40:27,200 And then there's going to be some sort of background 802 00:40:27,200 --> 00:40:29,270 frequency vector that might be the background 803 00:40:29,270 --> 00:40:34,270 composition in the genome of E.coli DNA, for example. 804 00:40:34,270 --> 00:40:40,220 And so then the probability of generating those sequences 805 00:40:40,220 --> 00:40:47,430 together with that particular locations 806 00:40:47,430 --> 00:40:51,190 is going to be proportional to this. 807 00:40:51,190 --> 00:40:55,970 Basically, use the little theta background vector 808 00:40:55,970 --> 00:41:00,830 for all the positions, except the specific positions that 809 00:41:00,830 --> 00:41:05,770 are inside the motif, starting at position AK here. 810 00:41:05,770 --> 00:41:08,220 And then you use the particular column 811 00:41:08,220 --> 00:41:10,460 of the weight matrix for those 8 positions, 812 00:41:10,460 --> 00:41:15,220 and then you go back to using the background probabilities. 813 00:41:15,220 --> 00:41:16,295 Question, yeah? 814 00:41:16,295 --> 00:41:20,255 AUDIENCE: Is this for finding motifs 815 00:41:20,255 --> 00:41:22,235 based on other known motifs? 816 00:41:22,235 --> 00:41:23,081 Or is this-- 817 00:41:23,081 --> 00:41:24,914 PROFESSOR: No, what we're doing-- I'm sorry, 818 00:41:24,914 --> 00:41:25,545 I should've prefaced that. 819 00:41:25,545 --> 00:41:26,961 We're doing de novo motif finding. 820 00:41:26,961 --> 00:41:28,970 We're going to tell the algorithm-- we're 821 00:41:28,970 --> 00:41:32,230 going to give the algorithm some sequences of a given length, 822 00:41:32,230 --> 00:41:33,860 or it can even be of variable lengths, 823 00:41:33,860 --> 00:41:36,380 and we're going to give it a guess of what 824 00:41:36,380 --> 00:41:37,810 the length of the motif is. 825 00:41:37,810 --> 00:41:40,090 So we're going to say, we think it's 8. 826 00:41:40,090 --> 00:41:42,300 That could come from structural reasons. 827 00:41:42,300 --> 00:41:43,970 Or often you really have no idea, 828 00:41:43,970 --> 00:41:46,510 so you just guess that you know, a lot of times 829 00:41:46,510 --> 00:41:48,830 it's kind of short, so we're going to go with 6 or 8, 830 00:41:48,830 --> 00:41:50,450 or you try different lengths. 831 00:41:50,450 --> 00:41:52,350 Totally de novo motif finding. 832 00:41:52,350 --> 00:41:55,160 OK, so how does algorithm work? 833 00:41:55,160 --> 00:41:58,620 You have N sequences of length, L. You guessed 834 00:41:58,620 --> 00:42:04,330 that the motif has width, W. You choose starting positions 835 00:42:04,330 --> 00:42:08,627 at random-- OK, so this is a vector, of the starting 836 00:42:08,627 --> 00:42:10,210 position in each sequence, we're going 837 00:42:10,210 --> 00:42:14,060 to choose completely random positions within the end 838 00:42:14,060 --> 00:42:14,960 sequences. 839 00:42:14,960 --> 00:42:18,760 They have to be at least W before the end-- 840 00:42:18,760 --> 00:42:22,600 so we'll have a whole motif, that's just an accounting thing 841 00:42:22,600 --> 00:42:23,640 to make it simpler. 842 00:42:23,640 --> 00:42:27,190 And then you choose one of the sequence at random. 843 00:42:27,190 --> 00:42:28,420 Say, the first sequence. 844 00:42:28,420 --> 00:42:31,260 You make a weight matrix model of width, W, 845 00:42:31,260 --> 00:42:35,250 from the instances in the other sequences. 846 00:42:35,250 --> 00:42:37,880 So for example-- actually, I have 847 00:42:37,880 --> 00:42:40,220 slides on this, so we'll just do it with the slides, 848 00:42:40,220 --> 00:42:42,053 you'll see what this looks like in a moment. 849 00:42:42,053 --> 00:42:44,810 And so you have instances here in the sequence, 850 00:42:44,810 --> 00:42:46,360 here in this one, here. 851 00:42:46,360 --> 00:42:49,930 You take all those, line them up, make a weight matrix out 852 00:42:49,930 --> 00:42:52,660 of those, and then you score the positions in sequence 1 853 00:42:52,660 --> 00:42:55,630 for how well they match. 854 00:42:55,630 --> 00:42:57,420 So, let me just do this. 855 00:42:57,420 --> 00:42:58,920 These are your motif instances. 856 00:42:58,920 --> 00:43:01,070 Again, totally random at the beginning. 857 00:43:01,070 --> 00:43:07,840 Then you build a weight matrix from those 858 00:43:07,840 --> 00:43:11,950 by lining them up, and just counting frequencies. 859 00:43:11,950 --> 00:43:14,520 Then you pick a sequence at random-- yeah, 860 00:43:14,520 --> 00:43:17,560 your weight matrix doesn't include that sequence, 861 00:43:17,560 --> 00:43:18,810 typically. 862 00:43:18,810 --> 00:43:23,690 And then you take your theta matrix 863 00:43:23,690 --> 00:43:26,940 and you slide it along the sequence. 864 00:43:26,940 --> 00:43:30,300 You consider every sub-sequence of length, 865 00:43:30,300 --> 00:43:33,340 W-- the one that goes from 1 to W, 866 00:43:33,340 --> 00:43:36,190 to one that goes from 2 to W plus 1, et 867 00:43:36,190 --> 00:43:38,130 cetera, all the way along the sequence, 868 00:43:38,130 --> 00:43:40,610 until you get to the end. 869 00:43:40,610 --> 00:43:43,760 And you calculate the probability of that sequence, 870 00:43:43,760 --> 00:43:47,500 using that likelihood that I gave you before. 871 00:43:47,500 --> 00:43:50,140 So, it's basically the probability generating sequence 872 00:43:50,140 --> 00:43:54,020 where you use the background vector for all the positions, 873 00:43:54,020 --> 00:43:57,240 except for the particular motif instance that you're 874 00:43:57,240 --> 00:43:59,960 considering, and use the motif model for that. 875 00:43:59,960 --> 00:44:01,190 Does that make sense? 876 00:44:01,190 --> 00:44:06,810 So, if you happen to have a good looking occurrence of the motif 877 00:44:06,810 --> 00:44:10,220 at this position, here, in the sequence, 878 00:44:10,220 --> 00:44:15,560 then you would get a higher likelihood. 879 00:44:15,560 --> 00:44:24,010 So for example, if the motif was, let's say it's 3 long, 880 00:44:24,010 --> 00:44:40,650 and it happened to favor ACG, then 881 00:44:40,650 --> 00:44:44,210 if you have a sequence here that has, 882 00:44:44,210 --> 00:44:46,364 let's say, it's got TTT, that's going 883 00:44:46,364 --> 00:44:48,030 to have a low probability in this motif. 884 00:44:48,030 --> 00:44:50,380 It's going to be 0.1 cubed. 885 00:44:50,380 --> 00:44:54,152 And then if you have an occurrence of, say, ACT, 886 00:44:54,152 --> 00:44:55,860 that's going to have a higher occurrence. 887 00:44:55,860 --> 00:44:58,100 It's going to be 0.7 times 0.7 times 0.1. 888 00:44:58,100 --> 00:44:59,260 So, quite a bit higher. 889 00:44:59,260 --> 00:45:02,450 So you start, it'll be low for this triplet 890 00:45:02,450 --> 00:45:04,370 here-- so I'll put a low value here. 891 00:45:04,370 --> 00:45:07,190 TTA is also going to be low. 892 00:45:07,190 --> 00:45:09,350 TAC, also low. 893 00:45:09,350 --> 00:45:12,540 But ACT, that matches 2 out of 3 to the motif. 894 00:45:12,540 --> 00:45:15,040 It's going to be a lot better. 895 00:45:15,040 --> 00:45:17,855 And then CT is going to be low again, et cetera. 896 00:45:17,855 --> 00:45:20,230 So you just slide this along and calculate probabilities. 897 00:45:23,130 --> 00:45:27,270 And then what you do is you sample from this distribution. 898 00:45:27,270 --> 00:45:31,500 These probabilities don't necessarily sum to 1. 899 00:45:31,500 --> 00:45:34,165 But you re-normalize them so that they do sum to 1, 900 00:45:34,165 --> 00:45:36,440 you just add them up, divide by the sum. 901 00:45:36,440 --> 00:45:37,590 Now they sum to 1. 902 00:45:37,590 --> 00:45:39,919 And now you sample those sites in that sequence, 903 00:45:39,919 --> 00:45:41,710 according to that probability distribution. 904 00:45:45,560 --> 00:45:47,500 Like I said, in this case you might end up 905 00:45:47,500 --> 00:45:49,729 sampling-- that's the highest probability site, 906 00:45:49,729 --> 00:45:50,770 so you might sample that. 907 00:45:50,770 --> 00:45:53,874 But you also might sample one of these other ones. 908 00:45:53,874 --> 00:45:55,540 It's unlikely you would sample this one, 909 00:45:55,540 --> 00:45:56,640 because that's very low. 910 00:45:56,640 --> 00:46:03,650 But you actually sometime sample one that's not so great. 911 00:46:03,650 --> 00:46:06,790 So you sample a starting position in that sequence, 912 00:46:06,790 --> 00:46:09,490 and you basically-- wherever you would originally 913 00:46:09,490 --> 00:46:13,640 assign in sequence 1, now you move it to that new location. 914 00:46:13,640 --> 00:46:17,380 We've just changed the assignment 915 00:46:17,380 --> 00:46:20,950 of where we think the motif might be in that sequence. 916 00:46:20,950 --> 00:46:22,450 And then you choose another sequence 917 00:46:22,450 --> 00:46:24,440 at random from your list. 918 00:46:24,440 --> 00:46:26,450 Often you go through the sequences sequentially, 919 00:46:26,450 --> 00:46:29,570 and then you make a new weight matrix model. 920 00:46:29,570 --> 00:46:32,450 So how will that weight matrix model differ from the last one? 921 00:46:32,450 --> 00:46:35,730 Well it'll differ because the instance 922 00:46:35,730 --> 00:46:39,780 of the motif in sequence 1 is now at a new location, 923 00:46:39,780 --> 00:46:40,360 in general. 924 00:46:40,360 --> 00:46:42,776 I mean, you might have sampled the exact same location you 925 00:46:42,776 --> 00:46:44,777 started, but in general it'll move. 926 00:46:44,777 --> 00:46:46,860 And so now, you'll got a slightly different weight 927 00:46:46,860 --> 00:46:48,250 matrix. 928 00:46:48,250 --> 00:46:52,200 Most of the data going into it, N minus 1, 929 00:46:52,200 --> 00:46:53,200 is going to be the same. 930 00:46:53,200 --> 00:46:55,040 But one of them is going to be different. 931 00:46:55,040 --> 00:46:57,400 So it'll change a little bit. 932 00:46:57,400 --> 00:46:59,690 You make a new weight matrix, and then you 933 00:46:59,690 --> 00:47:00,810 pick a new sequence. 934 00:47:00,810 --> 00:47:03,000 You slide that weight matrix along that sequence, 935 00:47:03,000 --> 00:47:05,541 you get this distribution, you sample from that distribution, 936 00:47:05,541 --> 00:47:07,980 and you keep going. 937 00:47:07,980 --> 00:47:12,800 Yeah, this was described by Lorenz in 1993, 938 00:47:12,800 --> 00:47:16,670 and I'll post that paper. 939 00:47:16,670 --> 00:47:19,210 OK, so you sample a portion with that, 940 00:47:19,210 --> 00:47:20,890 and you update the location. 941 00:47:20,890 --> 00:47:23,130 So now we sampled that really high probably one, 942 00:47:23,130 --> 00:47:26,520 so we moved the motif over to that new orange location, 943 00:47:26,520 --> 00:47:28,742 there. 944 00:47:28,742 --> 00:47:30,920 I don't know if these animations are helping at all. 945 00:47:30,920 --> 00:47:33,670 And then you update your weight matrix. 946 00:47:37,080 --> 00:47:40,240 And then you iterate until convergence. 947 00:47:40,240 --> 00:47:44,700 So you typically have a set of end sequences, 948 00:47:44,700 --> 00:47:46,259 you go through them once. 949 00:47:46,259 --> 00:47:48,800 You have a weight matrix, and then you go through them again. 950 00:47:48,800 --> 00:47:50,110 You go through a few times. 951 00:47:50,110 --> 00:47:51,960 And maybe at a certain point, you 952 00:47:51,960 --> 00:47:54,600 end re-sampling the same sites as you 953 00:47:54,600 --> 00:47:57,910 did in the last iteration-- same exact sites. 954 00:47:57,910 --> 00:47:59,720 You've converged. 955 00:47:59,720 --> 00:48:03,265 Or, you keep track of the theta matrices 956 00:48:03,265 --> 00:48:06,320 that you get after going through the whole set of sequences, 957 00:48:06,320 --> 00:48:08,460 and from one iteration to the next, 958 00:48:08,460 --> 00:48:11,545 the theta matrix hasn't really changed much. 959 00:48:11,545 --> 00:48:12,440 You've converged. 960 00:48:16,790 --> 00:48:20,430 So let's do an example of this. 961 00:48:20,430 --> 00:48:23,570 Here I made up a motif, and this is a representation 962 00:48:23,570 --> 00:48:26,650 where the four bases have these colors assigned to them. 963 00:48:26,650 --> 00:48:29,440 And you can see that this motif is quite strong. 964 00:48:29,440 --> 00:48:32,870 It really strongly prefers A at this position 965 00:48:32,870 --> 00:48:34,259 here, and et cetera. 966 00:48:34,259 --> 00:48:36,550 And I put it at the same position in all the sequences, 967 00:48:36,550 --> 00:48:40,090 just to make life simple. 968 00:48:40,090 --> 00:48:46,010 And then a former student in the lab, [INAUDIBLE], 969 00:48:46,010 --> 00:48:50,990 he implemented the Gibb Sample in Matlab, actually, 970 00:48:50,990 --> 00:48:53,730 and made a little video of what's going on. 971 00:48:53,730 --> 00:48:58,470 So the upper part shows the current weight matrix. 972 00:48:58,470 --> 00:49:01,590 Notice it's pretty random-looking 973 00:49:01,590 --> 00:49:02,550 at the beginning. 974 00:49:02,550 --> 00:49:10,740 And the right parts show where the motif 975 00:49:10,740 --> 00:49:13,874 is, or the position that we're currently considering. 976 00:49:13,874 --> 00:49:15,540 So this shows the position that was last 977 00:49:15,540 --> 00:49:19,800 sampled in the last round. 978 00:49:19,800 --> 00:49:21,570 And this shows the probability density 979 00:49:21,570 --> 00:49:24,560 along each sequence of what's the probability 980 00:49:24,560 --> 00:49:28,050 that the motif occurs at each particular place 981 00:49:28,050 --> 00:49:30,620 in the sequence. 982 00:49:30,620 --> 00:49:32,440 And that's what happens over times. 983 00:49:32,440 --> 00:49:35,410 So it's obviously very fast, so I'll run it again 984 00:49:35,410 --> 00:49:38,070 and maybe pause it partway. 985 00:49:38,070 --> 00:49:41,160 We're starting from a very random-looking motif. 986 00:49:44,480 --> 00:49:48,420 This is what you get after not too many iterations-- probably 987 00:49:48,420 --> 00:49:50,870 like 100 or so. 988 00:49:50,870 --> 00:49:55,730 And now you can see your motif-- your weight matrix is now quite 989 00:49:55,730 --> 00:49:59,670 biased, and now favors A at this position, and so forth. 990 00:49:59,670 --> 00:50:03,680 And the locations of your motif, most of them 991 00:50:03,680 --> 00:50:05,840 are around this position, around 6 or 7 992 00:50:05,840 --> 00:50:09,450 in the sequence-- that's where we put the motif in. 993 00:50:09,450 --> 00:50:11,020 But not all, some of them. 994 00:50:11,020 --> 00:50:14,410 And then you can see the probabilities-- white is high, 995 00:50:14,410 --> 00:50:17,010 black is low-- in some sequences, 996 00:50:17,010 --> 00:50:18,530 it's very, very confident, the motif 997 00:50:18,530 --> 00:50:21,090 is exactly at that position, like this first sequence here. 998 00:50:21,090 --> 00:50:23,980 And others, it's got some uncertainty 999 00:50:23,980 --> 00:50:26,460 about where the motif might be. 1000 00:50:26,460 --> 00:50:29,450 And then we let it run a little bit more, 1001 00:50:29,450 --> 00:50:33,010 and it eventually converges to being 1002 00:50:33,010 --> 00:50:37,213 very confident that the motif has the sequence, A C G T A G C 1003 00:50:37,213 --> 00:50:41,324 A, and that it occurs at that particular position 1004 00:50:41,324 --> 00:50:41,990 in the sequence. 1005 00:50:45,200 --> 00:50:49,630 So who can tell me why this actually works? 1006 00:50:49,630 --> 00:50:51,590 We're choosing positions at random, 1007 00:50:51,590 --> 00:50:55,280 updating a weight matrix, why does that actually 1008 00:50:55,280 --> 00:50:58,646 help you find the real motif that's in these sequences? 1009 00:51:01,840 --> 00:51:04,310 Any ideas? 1010 00:51:04,310 --> 00:51:06,774 Or who can make an argument that it shouldn't work? 1011 00:51:09,830 --> 00:51:10,330 Yeah? 1012 00:51:10,330 --> 00:51:11,473 What was your name again? 1013 00:51:11,473 --> 00:51:12,056 AUDIENCE: Dan. 1014 00:51:12,056 --> 00:51:13,347 PROFESSOR: Dan, yeah, go ahead. 1015 00:51:13,347 --> 00:51:18,678 AUDIENCE: So, couldn't it, sort of, in a certain situation 1016 00:51:18,678 --> 00:51:25,110 have different sub-motifs that are also sort of rich, 1017 00:51:25,110 --> 00:51:27,090 and because you're sampling randomly 1018 00:51:27,090 --> 00:51:31,960 you might be stuck inside of those boundaries 1019 00:51:31,960 --> 00:51:36,142 where you're searching your composition? 1020 00:51:36,142 --> 00:51:37,350 PROFESSOR: Yeah, that's good. 1021 00:51:37,350 --> 00:51:39,560 So Dan's point is that you can get 1022 00:51:39,560 --> 00:51:45,350 stuck in sub-optimal smaller or weaker motifs. 1023 00:51:45,350 --> 00:51:47,035 So that's certainly true. 1024 00:51:47,035 --> 00:51:49,160 So you're saying, maybe this example is artificial? 1025 00:51:49,160 --> 00:51:51,326 Because I had started with totally random sequences, 1026 00:51:51,326 --> 00:51:54,160 and I put a pretty strong motif in a particular place, 1027 00:51:54,160 --> 00:51:58,080 so there were no-- it's more like that mountain, 1028 00:51:58,080 --> 00:52:01,780 that structure where there's just one motif to find. 1029 00:52:01,780 --> 00:52:03,580 So it's perhaps an easy case. 1030 00:52:03,580 --> 00:52:06,740 But still, what I want to know is how does this algorithm, 1031 00:52:06,740 --> 00:52:09,322 how did it actually find that motif? 1032 00:52:09,322 --> 00:52:15,290 He implemented exactly that algorithm that I described. 1033 00:52:15,290 --> 00:52:19,806 Why does it tend to go towards [INAUDIBLE]? 1034 00:52:19,806 --> 00:52:22,056 After a long time, remember it's a long time, 1035 00:52:22,056 --> 00:52:23,222 it's hundreds of iterations. 1036 00:52:23,222 --> 00:52:27,614 AUDIENCE: So you're covering a lot in the sequence, 1037 00:52:27,614 --> 00:52:34,214 just the random searching of the sequence, when you're-- 1038 00:52:34,214 --> 00:52:35,755 PROFESSOR: There are many iterations. 1039 00:52:35,755 --> 00:52:37,570 You're considering many possible locations 1040 00:52:37,570 --> 00:52:40,620 within the sequences, that's true. 1041 00:52:40,620 --> 00:52:47,238 But why does it eventually-- why does it converge to something? 1042 00:52:47,238 --> 00:52:48,935 AUDIENCE: I guess, because you're 1043 00:52:48,935 --> 00:52:52,806 seeing your motif more plainly than you're 1044 00:52:52,806 --> 00:52:56,694 seeing other random motifs. 1045 00:52:56,694 --> 00:52:59,610 So it will hit it more frequently-- randomly. 1046 00:52:59,610 --> 00:53:03,012 And therefore, converge [INAUDIBLE]. 1047 00:53:03,012 --> 00:53:04,990 PROFESSOR: Yeah, that's true. 1048 00:53:04,990 --> 00:53:10,760 Can someone give a more intuition behind this? 1049 00:53:10,760 --> 00:53:11,734 Yeah? 1050 00:53:11,734 --> 00:53:13,216 AUDIENCE: I just have a question. 1051 00:53:13,216 --> 00:53:15,192 Is each iteration an independent test? 1052 00:53:15,192 --> 00:53:20,626 For example, if you iterate over the same sequence base 1053 00:53:20,626 --> 00:53:24,792 100 times, and you're updating your weight matrix each time, 1054 00:53:24,792 --> 00:53:29,485 does that mean it is the updating the weight matrix also 1055 00:53:29,485 --> 00:53:31,481 taking into account that the previous-- 1056 00:53:31,481 --> 00:53:34,475 that this is the same sample space? 1057 00:53:34,475 --> 00:53:35,937 PROFESSOR: Yeah, the weight matrix, 1058 00:53:35,937 --> 00:53:38,270 after you go through one iteration of all the sequences, 1059 00:53:38,270 --> 00:53:40,120 you have a weight matrix. 1060 00:53:40,120 --> 00:53:43,900 You carry that over, you don't start from scratch. 1061 00:53:43,900 --> 00:53:45,470 You bring that weight matrix back up, 1062 00:53:45,470 --> 00:53:49,150 and use that to score, let's say, that first sequence. 1063 00:53:49,150 --> 00:53:56,090 Yeah, the weight matrix just keeps moving around. 1064 00:53:56,090 --> 00:53:58,450 Moves a little bit every time you sample a sequence. 1065 00:53:58,450 --> 00:54:00,740 AUDIENCE: So you constantly get a strong [INAUDIBLE]. 1066 00:54:00,740 --> 00:54:02,206 PROFESSOR: Well, does it? 1067 00:54:02,206 --> 00:54:04,150 AUDIENCE: Well, I guess-- 1068 00:54:04,150 --> 00:54:07,070 PROFESSOR: Would it constantly get stronger? 1069 00:54:07,070 --> 00:54:11,190 What's to make it get stronger or weaker? 1070 00:54:11,190 --> 00:54:14,599 I mean, this is sort of-- you're on the track. 1071 00:54:14,599 --> 00:54:18,008 AUDIENCE: If it is random, then there's 1072 00:54:18,008 --> 00:54:21,379 some probability that you're going to find this motif again, 1073 00:54:21,379 --> 00:54:22,878 at which point it will get stronger. 1074 00:54:22,878 --> 00:54:28,235 But, if it's-- given enough iterations, 1075 00:54:28,235 --> 00:54:35,044 it gets stronger as long as you hit different spots at random. 1076 00:54:35,044 --> 00:54:35,960 PROFESSOR: Yeah, yeah. 1077 00:54:35,960 --> 00:54:39,832 That's what I'm-- I think there was a comment. 1078 00:54:39,832 --> 00:54:40,800 Jacob, yeah? 1079 00:54:40,800 --> 00:54:43,731 AUDIENCE: Well, you can think about it as a random walk 1080 00:54:43,731 --> 00:54:44,956 through the landscape. 1081 00:54:44,956 --> 00:54:47,436 Eventually, it has a high probability 1082 00:54:47,436 --> 00:54:50,412 of taking that motif, and updating the [INAUDIBLE] 1083 00:54:50,412 --> 00:54:54,293 direction, just from the probability of [INAUDIBLE]. 1084 00:54:54,293 --> 00:54:54,876 PROFESSOR: OK. 1085 00:54:54,876 --> 00:54:57,356 AUDIENCE: And given the [INAUDIBLE]. 1086 00:55:00,332 --> 00:55:09,240 PROFESSOR: OK, let's say I had 100 sequences of length, 1087 00:55:09,240 --> 00:55:11,330 I don't know, 30. 1088 00:55:11,330 --> 00:55:14,130 And the width of the motif is 6. 1089 00:55:20,160 --> 00:55:22,310 So here's our sequences. 1090 00:55:22,310 --> 00:55:29,800 We choose random positions for the start position, 1091 00:55:29,800 --> 00:55:32,490 and let's say it was this example where 1092 00:55:32,490 --> 00:55:38,932 the real motif, I put it right here, and all the sequences. 1093 00:55:38,932 --> 00:55:39,890 That's where it starts. 1094 00:55:43,160 --> 00:55:43,785 Does this help? 1095 00:55:46,370 --> 00:55:49,630 So it's 30 and 6, so there's 25 possible start positions. 1096 00:55:49,630 --> 00:55:53,060 I did that to make it a little easier. 1097 00:55:53,060 --> 00:55:56,070 So what would happen in that first iteration? 1098 00:55:56,070 --> 00:55:57,920 What w can you say about what the weight 1099 00:55:57,920 --> 00:55:59,730 matrix would look like? 1100 00:55:59,730 --> 00:56:06,315 It's going to be a width, W, you know, columns 1, 2, 3, up to 6. 1101 00:56:09,790 --> 00:56:13,206 We're going to give it 100 positions at random. 1102 00:56:13,206 --> 00:56:16,150 The motif is here-- let's say it's 1103 00:56:16,150 --> 00:56:19,230 a very strong motif, that's a 12-bit motif. 1104 00:56:19,230 --> 00:56:24,390 So it's 100%-- it's echo R1. 1105 00:56:24,390 --> 00:56:25,286 It's that. 1106 00:56:28,430 --> 00:56:31,570 What would that weight matrix look like, 1107 00:56:31,570 --> 00:56:33,670 in this first iteration, when you first 1108 00:56:33,670 --> 00:56:37,490 just sample the sites at random? 1109 00:56:37,490 --> 00:56:40,130 What kind of probabilities would it have? 1110 00:56:40,130 --> 00:56:42,110 AUDIENCE: [INAUDIBLE] 1111 00:56:42,110 --> 00:56:44,454 PROFESSOR: Equal? 1112 00:56:44,454 --> 00:56:47,940 OK-- perfectly equal? 1113 00:56:47,940 --> 00:56:48,851 AUDIENCE: Roughly. 1114 00:56:48,851 --> 00:56:49,434 PROFESSOR: OK. 1115 00:56:49,434 --> 00:56:50,430 Any box? 1116 00:56:58,398 --> 00:57:01,620 Are we likely to hit the actual motif, ever, 1117 00:57:01,620 --> 00:57:03,318 in that first encryption? 1118 00:57:03,318 --> 00:57:06,282 AUDIENCE: No, because you have a uniform probability, 1119 00:57:06,282 --> 00:57:07,270 of sampling. 1120 00:57:07,270 --> 00:57:10,514 Well, uniform at each one of the 25 positions? 1121 00:57:10,514 --> 00:57:11,222 PROFESSOR: Right. 1122 00:57:11,222 --> 00:57:14,186 AUDIENCE: Right now, you're not sampling proportional 1123 00:57:14,186 --> 00:57:15,668 to the likelihood. 1124 00:57:15,668 --> 00:57:19,406 PROFESSOR: So the chance of hitting the motif in any given 1125 00:57:19,406 --> 00:57:20,114 sequence is what? 1126 00:57:20,114 --> 00:57:20,608 AUDIENCE: 1/25. 1127 00:57:20,608 --> 00:57:21,274 PROFESSOR: 1/25. 1128 00:57:21,274 --> 00:57:22,584 We have 100 sequences. 1129 00:57:22,584 --> 00:57:25,529 AUDIENCE: So that's four out of-- 1130 00:57:25,529 --> 00:57:27,862 PROFESSOR: So on average, I'll hit the motif four times, 1131 00:57:27,862 --> 00:57:28,530 right. 1132 00:57:28,530 --> 00:57:32,680 The other 96 positions will be essentially random, right? 1133 00:57:35,760 --> 00:57:40,080 So you initially said this was going to be uniform, right? 1134 00:57:40,080 --> 00:57:43,280 On average, 25% of each base, plus or minus 1135 00:57:43,280 --> 00:57:47,730 a little bit of sampling error-- could be 23, 24, 26. 1136 00:57:47,730 --> 00:57:52,210 But now, you pointed out that it's going to be four. 1137 00:57:52,210 --> 00:57:55,900 You're going to hit the motif four times, on average. 1138 00:57:55,900 --> 00:57:59,280 So, can you say anything more? 1139 00:57:59,280 --> 00:58:02,737 AUDIENCE: Could you maybe have a slightly bias towards G 1140 00:58:02,737 --> 00:58:04,228 on the first position? 1141 00:58:04,228 --> 00:58:08,701 Slightly biased towards A on the second and third? 1142 00:58:08,701 --> 00:58:12,180 Slightly biased towards T on the fourth and fifth. 1143 00:58:12,180 --> 00:58:14,665 And slightly biased towards C in the sixth? 1144 00:58:14,665 --> 00:58:16,156 So it would be slightly biased-- 1145 00:58:16,156 --> 00:58:19,138 PROFESSOR: Right, so remind me of your name? 1146 00:58:19,138 --> 00:58:20,132 AUDIENCE: I'm Eric. 1147 00:58:20,132 --> 00:58:22,370 PROFESSOR: Eric, OK, so Eric says 1148 00:58:22,370 --> 00:58:24,640 that because four of the sequences 1149 00:58:24,640 --> 00:58:26,160 will have a G at the first position, 1150 00:58:26,160 --> 00:58:28,451 because those are the ones where you sampled the motif, 1151 00:58:28,451 --> 00:58:31,810 and the other 96 will have each of the four bases equally 1152 00:58:31,810 --> 00:58:36,760 likely, on average you have like 24%-- plus 4 for G, right? 1153 00:58:36,760 --> 00:58:38,640 Something like 28%-- this will be 1154 00:58:38,640 --> 00:58:42,860 28%, plus or minus a little bit. 1155 00:58:42,860 --> 00:58:50,520 And these other ones will be whatever that works out to be, 1156 00:58:50,520 --> 00:58:54,790 23 or something like that-- 23-ish, on average. 1157 00:58:54,790 --> 00:58:56,540 Again, it may not come out exactly 1158 00:58:56,540 --> 00:58:59,750 like-- G may not be number one, but it's more often 1159 00:58:59,750 --> 00:59:01,680 going to be number one than any other base. 1160 00:59:01,680 --> 00:59:06,370 And on average, it'll be more like 28% rather than 25%. 1161 00:59:06,370 --> 00:59:08,600 And similarly for position two, A 1162 00:59:08,600 --> 00:59:12,610 will be 28%, and three, and et cetera. 1163 00:59:12,610 --> 00:59:16,264 And then the sixth will be-- C will 1164 00:59:16,264 --> 00:59:17,430 have a little bit of a bias. 1165 00:59:17,430 --> 00:59:20,210 OK, so even in that first round, when 1166 00:59:20,210 --> 00:59:22,380 you're sampling that first sequence, 1167 00:59:22,380 --> 00:59:24,920 the matrix is going to be slightly biased 1168 00:59:24,920 --> 00:59:27,050 toward the motif-- depending how the sampling went. 1169 00:59:27,050 --> 00:59:30,100 You might not have hit any instances of motif, right? 1170 00:59:30,100 --> 00:59:34,766 But often, it'll be a little bit-- 1171 00:59:34,766 --> 00:59:37,760 Is that enough of a bias to give you 1172 00:59:37,760 --> 00:59:42,994 a good chance of selecting the motif in that first sequence? 1173 00:59:42,994 --> 00:59:44,743 AUDIENCE: You mean in the first iteration? 1174 00:59:44,743 --> 00:59:48,155 PROFESSOR: Let's say the first random sequence size sample. 1175 00:59:48,155 --> 00:59:48,655 No. 1176 00:59:48,655 --> 00:59:50,122 You're shaking your head. 1177 00:59:50,122 --> 00:59:55,501 Not enough of a bias because-- it's 1178 00:59:55,501 --> 01:00:01,955 0.28 over 0.25 to the sixth power, right? 1179 01:00:01,955 --> 01:00:03,050 So it's like-- 1180 01:00:03,050 --> 01:00:05,228 AUDIENCE: The likelihood is still close 1. 1181 01:00:05,228 --> 01:00:07,025 Like, that's [INAUDIBLE] ratio. 1182 01:00:07,025 --> 01:00:09,150 PROFESSOR: So it's something like 1.1 to the sixth, 1183 01:00:09,150 --> 01:00:10,400 or something like that. 1184 01:00:10,400 --> 01:00:13,810 So it might be close to 2, might be twice as likely. 1185 01:00:13,810 --> 01:00:16,750 But still, there's 25 positions. 1186 01:00:16,750 --> 01:00:18,230 Does that make sense? 1187 01:00:18,230 --> 01:00:21,790 So it's quite likely that you won't 1188 01:00:21,790 --> 01:00:25,020 sample the motif in that first-- you'll sample something else. 1189 01:00:25,020 --> 01:00:31,440 Which will take it away in some random direction. 1190 01:00:31,440 --> 01:00:34,462 So who can tell me how this actually ends up working? 1191 01:00:34,462 --> 01:00:36,170 Why does it actually converge eventually, 1192 01:00:36,170 --> 01:00:37,710 if you get it long enough? 1193 01:00:52,622 --> 01:00:53,614 AUDIENCE: [INAUDIBLE]. 1194 01:00:58,078 --> 01:00:59,670 PROFESSOR: So the information content, 1195 01:00:59,670 --> 01:01:01,900 what will happen to that? 1196 01:01:01,900 --> 01:01:05,074 So the information content, if it was completely random-- 1197 01:01:05,074 --> 01:01:06,360 we said that would be uniform. 1198 01:01:06,360 --> 01:01:08,630 That would be zero information content, right? 1199 01:01:08,630 --> 01:01:12,330 This matrix, which has around 28% at six different positions, 1200 01:01:12,330 --> 01:01:16,190 will have an information content that's low, but non-zero. 1201 01:01:16,190 --> 01:01:19,700 It might end up being like 1 bit, or something. 1202 01:01:19,700 --> 01:01:23,599 And if you then sample motifs that are not the motif, 1203 01:01:23,599 --> 01:01:25,640 they will tend to reduce the information content, 1204 01:01:25,640 --> 01:01:27,910 tend to bring it back toward random. 1205 01:01:27,910 --> 01:01:34,680 If you sample locations that have the motif, 1206 01:01:34,680 --> 01:01:36,900 what will that do to the information content? 1207 01:01:36,900 --> 01:01:38,162 Boost it. 1208 01:01:38,162 --> 01:01:40,620 So what would you expect if we were to plot the information 1209 01:01:40,620 --> 01:01:43,090 content over time, what would that look like? 1210 01:01:46,090 --> 01:01:47,590 AUDIENCE: It should trend upwards, 1211 01:01:47,590 --> 01:01:50,430 but it could fluctuate. 1212 01:01:50,430 --> 01:01:51,378 PROFESSOR: Yeah. 1213 01:01:51,378 --> 01:01:53,290 AUDIENCE: Over the number of iterations? 1214 01:01:59,929 --> 01:02:01,470 PROFESSOR: I think I blocked it here. 1215 01:02:01,470 --> 01:02:03,886 Let me see if I can-- Let's try this. 1216 01:02:03,886 --> 01:02:05,050 I think I plotted it. 1217 01:02:08,000 --> 01:02:09,951 OK, never mind. 1218 01:02:09,951 --> 01:02:11,450 I wanted to keep it very mysterious, 1219 01:02:11,450 --> 01:02:13,470 so you guys have to figure it out. 1220 01:02:13,470 --> 01:02:24,430 The answer is that it will-- basically what happens is you 1221 01:02:24,430 --> 01:02:26,670 start with a weight matrix like this. 1222 01:02:26,670 --> 01:02:31,390 A lot of times, because the bias for the motif is quite weak, 1223 01:02:31,390 --> 01:02:33,490 a lot of times you'll sample-- even 1224 01:02:33,490 --> 01:02:35,034 for a sequence, what matters is-- 1225 01:02:35,034 --> 01:02:37,450 like, if you had a sequence where the location, initially, 1226 01:02:37,450 --> 01:02:40,494 was not the motif, and then you sample another location that's 1227 01:02:40,494 --> 01:02:42,910 not the motif, that's not really going to change anything. 1228 01:02:42,910 --> 01:02:44,285 It'll change things a little bit, 1229 01:02:44,285 --> 01:02:46,280 but not in any particular direction. 1230 01:02:46,280 --> 01:02:48,330 What really matters is when you get to a sequence 1231 01:02:48,330 --> 01:02:51,040 where you already had the motif, if you now sample one that's 1232 01:02:51,040 --> 01:02:54,000 not the motif, your information content will get weaker. 1233 01:02:54,000 --> 01:02:57,420 It will become more uniform. 1234 01:02:57,420 --> 01:03:00,410 But if you have a sequence where it wasn't the motif, 1235 01:03:00,410 --> 01:03:05,440 but now you happen to sample the motif, then it'll get stronger. 1236 01:03:05,440 --> 01:03:07,920 And when it gets stronger, it will then 1237 01:03:07,920 --> 01:03:11,820 be more likely to pick the motif in the next sequence, 1238 01:03:11,820 --> 01:03:13,584 and so on. 1239 01:03:13,584 --> 01:03:15,750 So basically what happens to the information content 1240 01:03:15,750 --> 01:03:19,440 is that over many iterations-- it starts near 0. 1241 01:03:19,440 --> 01:03:22,654 And can occasionally go up a little bit. 1242 01:03:22,654 --> 01:03:25,070 And then once it exceeds the threshold, it goes like that. 1243 01:03:27,630 --> 01:03:30,770 So what happens is it stumbles onto a few instances 1244 01:03:30,770 --> 01:03:33,230 of the motif that bias the weight matrix. 1245 01:03:33,230 --> 01:03:34,900 And if they don't bias it enough, 1246 01:03:34,900 --> 01:03:36,874 it'll just fall off that. 1247 01:03:36,874 --> 01:03:38,540 It's like trying to climb the mountain-- 1248 01:03:38,540 --> 01:03:40,480 but it's walking in a random direction. 1249 01:03:40,480 --> 01:03:42,710 So sometimes it will turn around and go back down. 1250 01:03:42,710 --> 01:03:47,270 But then when it gets high enough, it'll be obvious. 1251 01:03:47,270 --> 01:03:52,020 Once you have a, say, 20 times greater likelihood 1252 01:03:52,020 --> 01:03:55,010 of picking that motif than any other sequence, most 1253 01:03:55,010 --> 01:03:56,300 of the time you will pick it. 1254 01:03:56,300 --> 01:03:59,576 And very soon, it'll be stronger. 1255 01:03:59,576 --> 01:04:01,200 And the next round, when it's stronger, 1256 01:04:01,200 --> 01:04:03,220 you'll have a greater bias for picking 1257 01:04:03,220 --> 01:04:04,770 the motif, and so forth. 1258 01:04:04,770 --> 01:04:06,501 Question? 1259 01:04:06,501 --> 01:04:08,370 AUDIENCE: For this specific example, 1260 01:04:08,370 --> 01:04:11,850 M is much greater than L minus W. 1261 01:04:11,850 --> 01:04:15,730 How true is that for practical examples? 1262 01:04:15,730 --> 01:04:18,380 PROFESSOR: That's a very good question. 1263 01:04:18,380 --> 01:04:23,950 There is sometimes-- depends on how commonly your motif occurs 1264 01:04:23,950 --> 01:04:27,840 in the genome, and how good your data is, really, 1265 01:04:27,840 --> 01:04:30,370 and what the source of your data is. 1266 01:04:30,370 --> 01:04:31,870 So sometimes it can be very limited, 1267 01:04:31,870 --> 01:04:37,190 sometimes-- If you do ChIP-Seq you might have 10,000 1268 01:04:37,190 --> 01:04:38,940 peaks that you're analyzing, or something. 1269 01:04:38,940 --> 01:04:40,870 So you could have a huge number. 1270 01:04:40,870 --> 01:04:43,500 But on the other hand, if you did some functional 1271 01:04:43,500 --> 01:04:48,320 assay that's quite laborious for a motif that drives luciferase, 1272 01:04:48,320 --> 01:04:50,770 or something, and you can only test a few, 1273 01:04:50,770 --> 01:04:51,940 you might only have 10. 1274 01:04:51,940 --> 01:04:55,980 So it varies all over the map. 1275 01:04:55,980 --> 01:04:57,740 So that's a good question. 1276 01:04:57,740 --> 01:05:00,130 We'll come back to that in a little bit. 1277 01:05:00,130 --> 01:05:01,200 Simona? 1278 01:05:01,200 --> 01:05:02,750 AUDIENCE: If you have a short motif, 1279 01:05:02,750 --> 01:05:04,270 does it make sense, then, to reduce 1280 01:05:04,270 --> 01:05:05,710 the number of sequences you have? 1281 01:05:05,710 --> 01:05:07,880 Because maybe it won't converge? 1282 01:05:07,880 --> 01:05:09,846 PROFESSOR: Reduce the number of sequences? 1283 01:05:09,846 --> 01:05:11,345 What do you people think about that? 1284 01:05:11,345 --> 01:05:12,830 Is that a good idea or a bad idea? 1285 01:05:16,790 --> 01:05:21,182 It's true that it might converge faster 1286 01:05:21,182 --> 01:05:22,640 with a smaller number of sequences, 1287 01:05:22,640 --> 01:05:25,080 but you also might not find it all. 1288 01:05:25,080 --> 01:05:27,640 So generally you're losing information, 1289 01:05:27,640 --> 01:05:30,320 so you want to have more sequences 1290 01:05:30,320 --> 01:05:32,415 up to a certain point. 1291 01:05:32,415 --> 01:05:34,790 Let's just do a couple more examples, and I'll come back. 1292 01:05:34,790 --> 01:05:36,080 Those are both good questions. 1293 01:05:36,080 --> 01:05:37,329 OK, so here's this weak motif. 1294 01:05:37,329 --> 01:05:39,590 So this is the one where you guys couldn't see it 1295 01:05:39,590 --> 01:05:41,240 when I just put the sequences up. 1296 01:05:41,240 --> 01:05:43,880 You can only see it when it's aligned-- 1297 01:05:43,880 --> 01:05:47,190 it's this thing with GGC, here. 1298 01:05:47,190 --> 01:05:52,410 And here is, again, the Gibbs Sampler. 1299 01:05:52,410 --> 01:05:55,775 And what happened? 1300 01:06:00,980 --> 01:06:04,698 Who can summarize what happened here? 1301 01:06:12,926 --> 01:06:13,822 Yeah, David? 1302 01:06:13,822 --> 01:06:15,030 AUDIENCE: It didn't converge. 1303 01:06:15,030 --> 01:06:17,894 PROFESSOR: Yeah, it didn't quite converge. 1304 01:06:17,894 --> 01:06:21,130 The motif is usually on the right side, 1305 01:06:21,130 --> 01:06:24,890 and it found something that's like the motif. 1306 01:06:24,890 --> 01:06:29,330 But it's not quite right-- it's got that A, it's G A G C, 1307 01:06:29,330 --> 01:06:33,310 it should be G G C. And so it sampled some other things, 1308 01:06:33,310 --> 01:06:35,050 and it got off track a little bit, 1309 01:06:35,050 --> 01:06:36,766 because probably by chance, there 1310 01:06:36,766 --> 01:06:39,140 were some things that looked a little bit like the motif, 1311 01:06:39,140 --> 01:06:41,090 and it was finding some instances of that, 1312 01:06:41,090 --> 01:06:42,920 and some instances of the real motif. 1313 01:06:42,920 --> 01:06:44,640 And yeah, it didn't quite converge. 1314 01:06:44,640 --> 01:06:50,220 And you can see this probability vectors here, 1315 01:06:50,220 --> 01:06:53,000 they have multiple white dots in many of the rows. 1316 01:06:53,000 --> 01:06:54,604 So it doesn't know, it's uncertain. 1317 01:06:54,604 --> 01:06:55,770 So it keeps bouncing around. 1318 01:06:55,770 --> 01:06:57,890 So it didn't really converge, it was too weak, 1319 01:06:57,890 --> 01:07:00,930 it was too challenging for the algorithm. 1320 01:07:04,130 --> 01:07:10,750 This is just a summary of the Gibb Sampler, how it works. 1321 01:07:10,750 --> 01:07:15,390 It's not guaranteed to converge to the same motif every time. 1322 01:07:15,390 --> 01:07:19,390 So what you generally will want to do is run it several times, 1323 01:07:19,390 --> 01:07:22,890 and nine out of 10 times, you get the same motif. 1324 01:07:22,890 --> 01:07:25,400 You should trust that. 1325 01:07:25,400 --> 01:07:26,480 Go ahead. 1326 01:07:26,480 --> 01:07:33,062 AUDIENCE: Over here, are we optimizing for convergence 1327 01:07:33,062 --> 01:07:35,930 of the value of the information content? 1328 01:07:35,930 --> 01:07:37,600 PROFESSOR: No, the information content 1329 01:07:37,600 --> 01:07:41,880 is just describing-- it's just a handy single number 1330 01:07:41,880 --> 01:07:44,550 description of how biased the weight matrix is. 1331 01:07:44,550 --> 01:07:49,200 So it's not actually directly being optimized. 1332 01:07:49,200 --> 01:07:53,700 But it turns out that this way of sampling 1333 01:07:53,700 --> 01:07:58,350 tends to increase information content. 1334 01:07:58,350 --> 01:08:01,985 It's sort of a self-reinforcing kind of a thing. 1335 01:08:04,366 --> 01:08:05,740 But it's not directly doing that. 1336 01:08:05,740 --> 01:08:10,110 However MEME, more or less, directly does that. 1337 01:08:10,110 --> 01:08:14,450 The problem with that is that, where do you start? 1338 01:08:14,450 --> 01:08:16,500 Imagine an algorithm like this, but where 1339 01:08:16,500 --> 01:08:18,292 you deterministically-- instead of sampling 1340 01:08:18,292 --> 01:08:20,583 from the positions in the sequence, where it might have 1341 01:08:20,583 --> 01:08:22,310 a motif in proportion to probabilities, 1342 01:08:22,310 --> 01:08:24,643 you just chose the one that had the highest probability. 1343 01:08:24,643 --> 01:08:26,800 That's more or less what MEME does. 1344 01:08:26,800 --> 01:08:29,960 And so what are the pros and cons 1345 01:08:29,960 --> 01:08:31,540 of that approach, versus this one? 1346 01:08:39,444 --> 01:08:40,432 Any ideas? 1347 01:08:50,312 --> 01:08:54,620 OK, one of the disadvantages is that the initial choice 1348 01:08:54,620 --> 01:08:58,420 of-- how you're initially seeding your matrix, 1349 01:08:58,420 --> 01:08:59,580 matters a lot. 1350 01:08:59,580 --> 01:09:05,290 That slight bias-- it might be that you had a slight bias, 1351 01:09:05,290 --> 01:09:08,660 and it didn't come out being G was number one. 1352 01:09:08,660 --> 01:09:10,850 It was actually-- T was number one, just because 1353 01:09:10,850 --> 01:09:14,960 of the quirks of the sampling. 1354 01:09:14,960 --> 01:09:18,880 So what would this be, 31 or something? 1355 01:09:18,880 --> 01:09:22,279 Anyway, it's higher than these other guys. 1356 01:09:22,279 --> 01:09:27,550 And so then you're always picking the highest. 1357 01:09:27,550 --> 01:09:30,167 It'll become a self-fulfilling prophecy. 1358 01:09:30,167 --> 01:09:31,500 So that's the problem with MEME. 1359 01:09:31,500 --> 01:09:33,210 So the way that MEME gets around that, 1360 01:09:33,210 --> 01:09:35,380 is it uses multiple different seeding, 1361 01:09:35,380 --> 01:09:37,130 multiple different starting points, 1362 01:09:37,130 --> 01:09:39,569 and goes to the end with all of them. 1363 01:09:39,569 --> 01:09:42,399 And then it evaluates, how good a model did we get at the end? 1364 01:09:42,399 --> 01:09:44,819 And whichever was the best one, it takes that. 1365 01:09:44,819 --> 01:09:47,580 So it actually takes longer, but you only 1366 01:09:47,580 --> 01:09:50,330 need to run it once because it's deterministic. 1367 01:09:50,330 --> 01:09:53,334 You use a deterministic set of starting points, 1368 01:09:53,334 --> 01:09:54,750 you run a deterministic algorithm, 1369 01:09:54,750 --> 01:09:57,650 and then you evaluate. 1370 01:09:57,650 --> 01:10:01,400 The Gibbs, it can go off on a tangent, 1371 01:10:01,400 --> 01:10:03,590 but because it's sampling so randomly, 1372 01:10:03,590 --> 01:10:05,526 it often will fall off, then, and come back 1373 01:10:05,526 --> 01:10:06,900 to something that's more uniform. 1374 01:10:06,900 --> 01:10:08,640 And when it's a uniform matrix, it's 1375 01:10:08,640 --> 01:10:10,140 really sampling completely randomly, 1376 01:10:10,140 --> 01:10:12,560 exploring the space in an unbiased way. 1377 01:10:12,560 --> 01:10:13,724 Tim? 1378 01:10:13,724 --> 01:10:17,580 AUDIENCE: For genomes that have inherent biases that you know 1379 01:10:17,580 --> 01:10:23,160 going in, do you precalculate-- do you just recalculate 1380 01:10:23,160 --> 01:10:28,730 the weight matrix before, to [? affect those classes? ?] 1381 01:10:28,730 --> 01:10:33,400 For example, if you had 80% AT content, 1382 01:10:33,400 --> 01:10:36,895 then you're not looking for-- you know, immediately, 1383 01:10:36,895 --> 01:10:41,386 that you're going to hit an A or a T off the first iteration. 1384 01:10:41,386 --> 01:10:43,237 So how do you deal with that? 1385 01:10:43,237 --> 01:10:44,278 PROFESSOR: Good question. 1386 01:10:48,800 --> 01:10:53,100 So these are some features that affect motif finding. 1387 01:10:53,100 --> 01:10:59,485 I think that we've now hit at least a few of these-- number 1388 01:10:59,485 --> 01:11:01,990 of sequences, length of sequences, information content, 1389 01:11:01,990 --> 01:11:07,140 and motif, and basically whether the background is biased 1390 01:11:07,140 --> 01:11:08,060 or not. 1391 01:11:08,060 --> 01:11:13,130 So, in general, higher information content motifs, 1392 01:11:13,130 --> 01:11:15,660 or lower information content, are easier to find-- 1393 01:11:15,660 --> 01:11:18,792 who thinks higher? 1394 01:11:18,792 --> 01:11:19,500 Who thinks lower? 1395 01:11:22,590 --> 01:11:26,055 Someone, can you explain? 1396 01:11:26,055 --> 01:11:27,013 AUDIENCE: I don't know. 1397 01:11:27,013 --> 01:11:27,638 I just guessed. 1398 01:11:27,638 --> 01:11:30,210 PROFESSOR: Just a guess? 1399 01:11:30,210 --> 01:11:31,636 OK, in back, can you explain? 1400 01:11:31,636 --> 01:11:32,135 Lower? 1401 01:11:32,135 --> 01:11:33,510 AUDIENCE: Low information content 1402 01:11:33,510 --> 01:11:35,991 is basically very uniform. 1403 01:11:35,991 --> 01:11:38,324 PROFESSOR: Low information means nearly uniform-- right, 1404 01:11:38,324 --> 01:11:41,186 those are very hard to find. 1405 01:11:41,186 --> 01:11:43,660 That's like that GGC one. 1406 01:11:43,660 --> 01:11:45,200 The high information content motif, 1407 01:11:45,200 --> 01:11:47,366 those are the very strong ones, like that first one. 1408 01:11:47,366 --> 01:11:48,700 Those are much easier to find. 1409 01:11:48,700 --> 01:11:50,640 Because when you stumble on to them, 1410 01:11:50,640 --> 01:11:54,870 it biases the matrix more, and you rapidly converge to that. 1411 01:11:54,870 --> 01:11:57,150 OK, high information is easy to find. 1412 01:11:57,150 --> 01:12:01,200 So if I have one motif per sequence, 1413 01:12:01,200 --> 01:12:03,226 what about the length of the sequence? 1414 01:12:03,226 --> 01:12:05,190 Is longer or shorter better? 1415 01:12:07,930 --> 01:12:09,706 Is long better? 1416 01:12:09,706 --> 01:12:11,800 Who thinks shorter is better? 1417 01:12:11,800 --> 01:12:14,291 Shorter-- can you explain why short? 1418 01:12:14,291 --> 01:12:16,582 AUDIENCE: Shouldn't it be the smaller the search space, 1419 01:12:16,582 --> 01:12:18,842 the fewer the problems? 1420 01:12:18,842 --> 01:12:21,207 PROFESSOR: Exactly, the shorter the search space, 1421 01:12:21,207 --> 01:12:23,290 and your motif, there's less place for it to hide. 1422 01:12:23,290 --> 01:12:25,900 You're more likely to sample it. 1423 01:12:25,900 --> 01:12:27,510 Shorter is better. 1424 01:12:27,510 --> 01:12:30,830 If you think about-- if you have a motif like TATA, which 1425 01:12:30,830 --> 01:12:35,770 is typically 30 bases from the TSS, 1426 01:12:35,770 --> 01:12:40,320 if you happen to know that, and you give it plus 1 to minus 50, 1427 01:12:40,320 --> 01:12:41,820 you're giving it a small region, you 1428 01:12:41,820 --> 01:12:43,150 can easily find the TATA box. 1429 01:12:43,150 --> 01:12:47,890 If you give it plus 1 to minus 2,000 or something, 1430 01:12:47,890 --> 01:12:48,980 you may not find it. 1431 01:12:48,980 --> 01:12:52,680 It's diluted, essentially. 1432 01:12:52,680 --> 01:12:54,486 Number of sequences-- the more the better. 1433 01:12:54,486 --> 01:12:56,610 This is a little more subtle, as Simona was saying. 1434 01:12:56,610 --> 01:12:58,410 It affects convergence time, and so forth. 1435 01:12:58,410 --> 01:13:01,870 But in general, the more the better. 1436 01:13:01,870 --> 01:13:06,520 And if you guessed the wrong length of your matrix, 1437 01:13:06,520 --> 01:13:09,370 that makes it worse than if you guess 1438 01:13:09,370 --> 01:13:11,310 the right length in either direction. 1439 01:13:11,310 --> 01:13:15,690 For example, it's six-base motif, you guess three. 1440 01:13:15,690 --> 01:13:18,670 The information content, even if it's a 12-bit motif, 1441 01:13:18,670 --> 01:13:21,256 there's only six bits that you could hope to find, 1442 01:13:21,256 --> 01:13:23,380 because you can only find three of those positions. 1443 01:13:23,380 --> 01:13:27,176 So clearly, effectively it's a smaller information 1444 01:13:27,176 --> 01:13:28,550 content, and much harder to find. 1445 01:13:28,550 --> 01:13:29,960 And vice versa. 1446 01:13:32,720 --> 01:13:36,390 Another thing that occurs in practice 1447 01:13:36,390 --> 01:13:39,350 is what's called shifted motifs. 1448 01:13:39,350 --> 01:13:44,740 Your motif is G A A T T C. Imagine in your first iteration 1449 01:13:44,740 --> 01:13:49,680 you happen to hit several of these sequences, starting here. 1450 01:13:49,680 --> 01:13:52,330 You hit the motif, but off by two 1451 01:13:52,330 --> 01:13:54,150 at several different places. 1452 01:13:54,150 --> 01:13:55,680 That'll bias first position to be 1453 01:13:55,680 --> 01:13:59,230 A, and the second position to be T, and so forth. 1454 01:13:59,230 --> 01:14:03,090 And then you tend to find other shifted versions of that motif. 1455 01:14:03,090 --> 01:14:06,700 You may well converge to this-- A T C C N N, or something 1456 01:14:06,700 --> 01:14:09,480 like that-- which is not quite right. 1457 01:14:09,480 --> 01:14:12,950 It's close, you're very close, but not quite right. 1458 01:14:12,950 --> 01:14:16,060 And it's not as information rich as the real motif. 1459 01:14:16,060 --> 01:14:19,480 Because it's got those two N's at the end, instead of G A. 1460 01:14:19,480 --> 01:14:21,290 So one thing that's done in practice 1461 01:14:21,290 --> 01:14:26,020 is a lot of times, every so often, the algorithm will say, 1462 01:14:26,020 --> 01:14:28,310 what would happen if we shifted all of our positions 1463 01:14:28,310 --> 01:14:30,990 over to the left by one or two? 1464 01:14:30,990 --> 01:14:32,410 Or to the right by one or two? 1465 01:14:32,410 --> 01:14:34,730 Would the information content go up? 1466 01:14:34,730 --> 01:14:37,540 If so, let's do that. 1467 01:14:37,540 --> 01:14:41,210 So basically, shifted versions of the motif become 1468 01:14:41,210 --> 01:14:43,630 local, near-optimal solutions. 1469 01:14:43,630 --> 01:14:46,296 So you have to avoid them. 1470 01:14:46,296 --> 01:14:47,670 And biased background composition 1471 01:14:47,670 --> 01:14:49,120 is very difficult to deal with. 1472 01:14:49,120 --> 01:14:54,650 So I will just give you one or two more 1473 01:14:54,650 --> 01:14:57,490 examples of that in a moment, and continue. 1474 01:14:57,490 --> 01:15:02,340 So in practice, I would say the Gibbs Sampler is sometimes 1475 01:15:02,340 --> 01:15:05,830 used, or AlignACE, which is a version of Gibbs Sampler. 1476 01:15:05,830 --> 01:15:08,530 But probably more often, people use 1477 01:15:08,530 --> 01:15:12,390 an algorithm called MEME, which is this EM algorithm, which, 1478 01:15:12,390 --> 01:15:14,550 like I said, is deterministic, so you always 1479 01:15:14,550 --> 01:15:17,600 get the same answer, which makes you feel good. 1480 01:15:17,600 --> 01:15:21,280 May or may not always be right, but you can try it out here 1481 01:15:21,280 --> 01:15:22,100 at this website. 1482 01:15:22,100 --> 01:15:23,755 And actually, the Fraenkel Lab has 1483 01:15:23,755 --> 01:15:26,350 a very nice website called WebMotifs 1484 01:15:26,350 --> 01:15:30,700 that runs several different motif finders including, 1485 01:15:30,700 --> 01:15:33,080 like I said, a MEME and AlignACE, which 1486 01:15:33,080 --> 01:15:35,420 is similar to Gibbs, as well as some others. 1487 01:15:35,420 --> 01:15:40,190 And it integrates the output, so that's often a handy thing 1488 01:15:40,190 --> 01:15:41,951 to use. 1489 01:15:41,951 --> 01:15:43,200 You can read about them there. 1490 01:15:43,200 --> 01:15:48,370 And then I just wanted to say a couple words-- 1491 01:15:48,370 --> 01:15:53,070 this is related to Tim's comment about the biased background. 1492 01:15:53,070 --> 01:15:56,130 How do you actually deal with that? 1493 01:15:56,130 --> 01:16:04,360 And this related to this notion of a mean bit score of a motif. 1494 01:16:04,360 --> 01:16:10,040 So if I were to give you a motif model, P, and a background 1495 01:16:10,040 --> 01:16:14,780 model, q, then the natural scoring system, if you wanted 1496 01:16:14,780 --> 01:16:16,940 additives scores, instead of multiplicative, 1497 01:16:16,940 --> 01:16:18,310 you would just take the log. 1498 01:16:18,310 --> 01:16:23,050 So log P over q, I would argue, is natural additive scores. 1499 01:16:23,050 --> 01:16:25,600 And that's often what you'll see in a weight matrix-- you'll 1500 01:16:25,600 --> 01:16:30,116 see log probabilities, or logs of ratios of probabilities. 1501 01:16:30,116 --> 01:16:31,490 And so then you just add them up, 1502 01:16:31,490 --> 01:16:33,700 and it makes life a bit simpler. 1503 01:16:33,700 --> 01:16:36,350 And so then, if you were to calculate what's the mean bit 1504 01:16:36,350 --> 01:16:40,570 score-- if I had a bunch of instances of a motif, 1505 01:16:40,570 --> 01:16:46,580 it will be given by this formula that's here in the upper right. 1506 01:16:46,580 --> 01:16:48,420 So that's your score. 1507 01:16:48,420 --> 01:16:51,710 And this is the mean, where you're sampling over 1508 01:16:51,710 --> 01:16:58,150 the probability in using the motif model, probabilities. 1509 01:16:58,150 --> 01:17:03,130 So it turns out, then, that if qk, your background, 1510 01:17:03,130 --> 01:17:06,360 is uniform, motif of width w-- so its probability 1511 01:17:06,360 --> 01:17:10,420 of any w-mer, is 1/4 to the w, then 1512 01:17:10,420 --> 01:17:13,410 it's true that the mean bit-score is 1513 01:17:13,410 --> 01:17:17,110 2w minus the entropy of the motif, which 1514 01:17:17,110 --> 01:17:20,270 is the same as the information content of the motif, 1515 01:17:20,270 --> 01:17:22,760 using our previous definition. 1516 01:17:22,760 --> 01:17:27,380 So that's just a handy relationship. 1517 01:17:27,380 --> 01:17:32,340 And you can do a little algebra to show that, if you want. 1518 01:17:32,340 --> 01:17:43,150 So basically summation Pk log Pk over qk-- this log, 1519 01:17:43,150 --> 01:17:44,630 you turn that into a difference-- 1520 01:17:44,630 --> 01:17:56,890 so that summation Pk log Pk minus Pk log qk. 1521 01:17:56,890 --> 01:18:00,810 And then you can do some rearrangement, and sum them up, 1522 01:18:00,810 --> 01:18:02,920 and you'll get this formula. 1523 01:18:02,920 --> 01:18:06,480 I'll leave that as an exercise, and any questions on it, 1524 01:18:06,480 --> 01:18:08,790 we can do it next time. 1525 01:18:08,790 --> 01:18:12,250 So what I wanted to get to is sort of this big question 1526 01:18:12,250 --> 01:18:15,050 that I posed earlier-- what's the use of knowing 1527 01:18:15,050 --> 01:18:17,520 the information content of a motif? 1528 01:18:17,520 --> 01:18:26,630 And the answer is that one use is that it's true, in general, 1529 01:18:26,630 --> 01:18:29,540 that the motif with n bits of information 1530 01:18:29,540 --> 01:18:34,140 will occur about once every 2 to the n bases of random sequence. 1531 01:18:34,140 --> 01:18:39,190 So we said a six-cutter restriction enzyme, echo R1, 1532 01:18:39,190 --> 01:18:42,660 has an information content of 12 bits. 1533 01:18:42,660 --> 01:18:45,260 So by this rule, it should occur about once every 1534 01:18:45,260 --> 01:18:47,700 to 2 to the 12th bases of sequence. 1535 01:18:47,700 --> 01:18:50,060 And if you know your powers of 2, which you should all 1536 01:18:50,060 --> 01:18:54,370 commit to memory, that's about 4,000. 1537 01:18:54,370 --> 01:18:57,330 2 to the 12th is 4 to the sixth, is 4,096. 1538 01:18:57,330 --> 01:19:00,680 So it'll occur about once every 4 [? kb, ?] which 1539 01:19:00,680 --> 01:19:03,490 if you've ever cut E. coli DNA, you know is about right-- 1540 01:19:03,490 --> 01:19:07,610 your fragments come out to be about 4 [? kb. ?] 1541 01:19:07,610 --> 01:19:11,570 So this turns out to be strictly true for any motif 1542 01:19:11,570 --> 01:19:14,920 that you can represent by a regular expression, 1543 01:19:14,920 --> 01:19:16,610 like a precise motif, or something 1544 01:19:16,610 --> 01:19:21,660 where you have a degenerate R or Y or N in it, still true. 1545 01:19:21,660 --> 01:19:23,770 And if you have a more general motif that's 1546 01:19:23,770 --> 01:19:27,490 described by weight matrix, then you have to define a threshold, 1547 01:19:27,490 --> 01:19:32,960 and it's roughly true, but not exactly. 1548 01:19:32,960 --> 01:19:35,870 All right, so what do you do when the background 1549 01:19:35,870 --> 01:19:38,320 composition is biased, like Tim was saying? 1550 01:19:38,320 --> 01:19:41,100 What if it's 80%, A plus T? 1551 01:19:41,100 --> 01:19:47,310 So then, it turns out that this mean bit-score is a good way 1552 01:19:47,310 --> 01:19:49,830 to go. 1553 01:19:49,830 --> 01:19:52,460 So like I said, the mean bit-score 1554 01:19:52,460 --> 01:19:55,750 equals the information content in this special case, 1555 01:19:55,750 --> 01:19:58,720 where the background is uniform. 1556 01:19:58,720 --> 01:20:02,790 But if the background is not uniform, 1557 01:20:02,790 --> 01:20:05,950 then you can still calculate this mean bit-score, 1558 01:20:05,950 --> 01:20:08,181 and it'll still be meaningful. 1559 01:20:08,181 --> 01:20:09,680 But now it's called something else-- 1560 01:20:09,680 --> 01:20:12,730 it's called relative entropy. 1561 01:20:12,730 --> 01:20:14,860 Actually it has several names, relative entropy, 1562 01:20:14,860 --> 01:20:17,270 Kullback-Leibler distance is another, 1563 01:20:17,270 --> 01:20:18,900 and information for discrimination, 1564 01:20:18,900 --> 01:20:21,330 depending whether you're reading the Double 1565 01:20:21,330 --> 01:20:24,410 E literature, or statistics, or whatever. 1566 01:20:24,410 --> 01:20:26,110 And so it turns out that if you have 1567 01:20:26,110 --> 01:20:29,970 a very biased composition-- so here's one that's 75% A T, 1568 01:20:29,970 --> 01:20:33,020 probability of A and T are 3/8, C and G are 1/8. 1569 01:20:35,990 --> 01:20:40,760 If your motif is just C 100% of the time, 1570 01:20:40,760 --> 01:20:43,920 your information content by the original formula 1571 01:20:43,920 --> 01:20:48,230 that I gave you, would be 2 bits. 1572 01:20:48,230 --> 01:20:53,560 However, the relative entropy will be 3 bits, 1573 01:20:53,560 --> 01:20:56,620 if you just plug in these numbers into this formula, 1574 01:20:56,620 --> 01:21:00,490 it will turn out to be 3 bits. 1575 01:21:00,490 --> 01:21:03,750 My question is, which one better describes 1576 01:21:03,750 --> 01:21:08,400 the frequency of C in the background sequence? 1577 01:21:08,400 --> 01:21:10,345 Frequency of this motif-- the motif 1578 01:21:10,345 --> 01:21:14,050 is just a C. You can see that the relative entropy says 1579 01:21:14,050 --> 01:21:16,530 that actually, that's stronger than it appears. 1580 01:21:16,530 --> 01:21:18,600 Because it's a C, and that's a rare nucleotide, 1581 01:21:18,600 --> 01:21:20,225 it's actually stronger than it appears. 1582 01:21:20,225 --> 01:21:22,340 And so 2 to the 3rd is a better estimate 1583 01:21:22,340 --> 01:21:24,990 of its frequency than 2 squared. 1584 01:21:24,990 --> 01:21:25,840 So relative entropy. 1585 01:21:25,840 --> 01:21:27,660 So what you can do when you run a motif 1586 01:21:27,660 --> 01:21:29,970 finder in a sequence of biased composition, 1587 01:21:29,970 --> 01:21:31,930 you can say, what's the relative entropy 1588 01:21:31,930 --> 01:21:33,330 of this motif at the end? 1589 01:21:33,330 --> 01:21:36,750 And look at the ones that are strong. 1590 01:21:36,750 --> 01:21:40,580 We'll come back to this a little more next time. 1591 01:21:40,580 --> 01:21:42,700 Next time, we'll talk about hidden Markov models, 1592 01:21:42,700 --> 01:21:45,590 and please take a look at the readings. 1593 01:21:45,590 --> 01:21:48,840 And please, those who are doing projects, 1594 01:21:48,840 --> 01:21:50,760 look for more detailed instructions 1595 01:21:50,760 --> 01:21:54,070 to be posted tonight. 1596 01:21:54,070 --> 01:21:55,620 Thanks.