1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,010 under a Creative Commons license. 3 00:00:04,010 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,236 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,236 --> 00:00:17,861 at ocw.mit.edu. 8 00:00:29,230 --> 00:00:31,760 PROFESSOR: So good afternoon, once again. 9 00:00:31,760 --> 00:00:35,210 And welcome back to Computational Systems Biology, 10 00:00:35,210 --> 00:00:38,450 lecture number seven. 11 00:00:38,450 --> 00:00:41,457 And today we're going to put to good use 12 00:00:41,457 --> 00:00:42,790 two things that we have learned. 13 00:00:42,790 --> 00:00:46,630 We've learned how to align high throughput reads to genomes. 14 00:00:46,630 --> 00:00:49,250 And we've learned how to take a collection of high throughput 15 00:00:49,250 --> 00:00:51,580 reads and assemble a genome. 16 00:00:51,580 --> 00:00:54,150 And today we're going to delve into the mysteries 17 00:00:54,150 --> 00:00:56,470 of transcriptional regulation. 18 00:00:56,470 --> 00:01:00,220 And so what I'd like to discuss with you today 19 00:01:00,220 --> 00:01:03,380 is a very important set of techniques 20 00:01:03,380 --> 00:01:06,860 that allows us to elucidate exactly how genes 21 00:01:06,860 --> 00:01:08,380 are regulated. 22 00:01:08,380 --> 00:01:10,130 It's probably the most important technique 23 00:01:10,130 --> 00:01:13,660 for allowing us to get at that question. 24 00:01:13,660 --> 00:01:17,070 And I'm sure many of you are familiar with the idea 25 00:01:17,070 --> 00:01:19,090 of transcriptional regulators, which 26 00:01:19,090 --> 00:01:21,820 are proteins that bind in a sequence specific way 27 00:01:21,820 --> 00:01:25,520 to the genome and act as molecular switches. 28 00:01:25,520 --> 00:01:28,060 And we'll return to some aspects of these 29 00:01:28,060 --> 00:01:30,950 when we talk about proteomics later in the term. 30 00:01:30,950 --> 00:01:33,090 But suffice to say, these proteins 31 00:01:33,090 --> 00:01:36,750 have domains that interact in a sequence specific way 32 00:01:36,750 --> 00:01:41,010 with the DNA bases in one of the grooves of DNA. 33 00:01:41,010 --> 00:01:43,790 They also typically contain a domain 34 00:01:43,790 --> 00:01:47,440 which is an activation domain or repression 35 00:01:47,440 --> 00:01:49,570 domain that interacts with other proteins that 36 00:01:49,570 --> 00:01:52,196 can cause the genome to fold up. 37 00:01:52,196 --> 00:01:56,250 And it can also help recruit the RNA polymerase holoenzyme 38 00:01:56,250 --> 00:01:59,580 to actually turn on a gene transcription. 39 00:01:59,580 --> 00:02:02,050 So here we have a figure of a collection 40 00:02:02,050 --> 00:02:05,470 of Pit1 molecules interacting with the genome. 41 00:02:05,470 --> 00:02:10,269 And of course, there are many flavors of genomic regulators. 42 00:02:10,269 --> 00:02:13,000 It's estimated that humans have about 2,000 different proteins 43 00:02:13,000 --> 00:02:17,130 that act as these molecular switches, some as activators, 44 00:02:17,130 --> 00:02:18,216 some as repressors. 45 00:02:18,216 --> 00:02:20,340 And we're going to be talking about, fundamentally, 46 00:02:20,340 --> 00:02:24,150 today the idea of how these molecules interact 47 00:02:24,150 --> 00:02:29,930 with the genome and control gene expression 48 00:02:29,930 --> 00:02:32,600 through the analysis of where they actually 49 00:02:32,600 --> 00:02:37,340 interact with specific genome loci. 50 00:02:37,340 --> 00:02:45,230 So if we were to draw a picture of how we understand 51 00:02:45,230 --> 00:02:50,020 gene regulation in cartoon form, if we have a gene here 52 00:02:50,020 --> 00:02:52,200 with the transcription start site 53 00:02:52,200 --> 00:02:57,290 and we can imagine RNA molecules being produced off 54 00:02:57,290 --> 00:03:00,260 of this genomic template, we know 55 00:03:00,260 --> 00:03:07,050 that there are non-coding regions of a gene that 56 00:03:07,050 --> 00:03:09,320 permit for the binding of these regulators. 57 00:03:09,320 --> 00:03:11,180 And my definition of a gene is all 58 00:03:11,180 --> 00:03:15,170 of the DNA that's required to make a specific transcriptive 59 00:03:15,170 --> 00:03:15,740 protein. 60 00:03:15,740 --> 00:03:18,640 So that includes not only the coding parts 61 00:03:18,640 --> 00:03:23,500 of the gene but the non-coding, regulatory parts as well. 62 00:03:23,500 --> 00:03:25,280 So we can imagine out here that there 63 00:03:25,280 --> 00:03:27,470 are a collection of regulators, perhaps just 64 00:03:27,470 --> 00:03:33,670 one, that bind to a sequence that in turn activate 65 00:03:33,670 --> 00:03:40,030 this gene, producing the RNA transcript, which then in turn 66 00:03:40,030 --> 00:03:43,770 this turned into another protein. 67 00:03:43,770 --> 00:03:46,550 This protein may undergo some sort 68 00:03:46,550 --> 00:03:48,720 of post-translational modification 69 00:03:48,720 --> 00:03:55,090 by signaling pathway or other mechanism that activates it. 70 00:03:55,090 --> 00:03:56,830 So regulators need to be activated 71 00:03:56,830 --> 00:03:59,700 before they could bind and some do not. 72 00:03:59,700 --> 00:04:05,190 And this activated regular combine to get another gene 73 00:04:05,190 --> 00:04:09,170 and cause another RNA to be expressed. 74 00:04:09,170 --> 00:04:11,620 And it may be, in the second context, 75 00:04:11,620 --> 00:04:15,730 that we need two different proteins to bind 76 00:04:15,730 --> 00:04:17,120 to activate the gene. 77 00:04:17,120 --> 00:04:19,190 And so during the course of the term, 78 00:04:19,190 --> 00:04:21,420 we're going to be talking about many aspects 79 00:04:21,420 --> 00:04:25,960 of these regulatory networks, including things like what 80 00:04:25,960 --> 00:04:28,553 the regulatory code of a genome is, 81 00:04:28,553 --> 00:04:30,279 that is where these binding sites are 82 00:04:30,279 --> 00:04:31,320 and how they're occupied. 83 00:04:34,880 --> 00:04:38,790 And we'll return to that later on in today's lecture. 84 00:04:38,790 --> 00:04:44,680 We'll talk about the dynamics of binding of proteins, including 85 00:04:44,680 --> 00:04:47,850 how concentration [? they ?] dependent are. 86 00:04:47,850 --> 00:04:57,520 And we'll talk about combinatorial control, 87 00:04:57,520 --> 00:04:59,272 whether or not, for example, both 88 00:04:59,272 --> 00:05:01,230 of these have to be present or just one of them 89 00:05:01,230 --> 00:05:03,600 needs to be present for a gene to be transcribed, 90 00:05:03,600 --> 00:05:07,280 and how you can use these regulatory sequences 91 00:05:07,280 --> 00:05:11,747 to implement very complex computational functions. 92 00:05:11,747 --> 00:05:13,580 But suffice to say, the most important thing 93 00:05:13,580 --> 00:05:17,070 that we have to identify is the programming 94 00:05:17,070 --> 00:05:19,430 that underlies the genome, which includes 95 00:05:19,430 --> 00:05:21,520 these regulatory sequences and exactly how they're 96 00:05:21,520 --> 00:05:25,720 occupied by regulatory proteins. 97 00:05:25,720 --> 00:05:27,980 Ideally what we would like to be able to do 98 00:05:27,980 --> 00:05:32,980 is to do a single experiment that elucidated all 99 00:05:32,980 --> 00:05:35,140 of the regulatory sites in the genome 100 00:05:35,140 --> 00:05:38,830 and which ones were occupied by which proteins. 101 00:05:38,830 --> 00:05:42,480 But presently, that's technically not possible. 102 00:05:42,480 --> 00:05:44,750 So we'll begin today with a technique that 103 00:05:44,750 --> 00:05:47,690 allows us to consider one protein at a time 104 00:05:47,690 --> 00:05:52,230 and identify where it occupies the genome. 105 00:05:52,230 --> 00:05:54,430 Now there are other kinds of proteins 106 00:05:54,430 --> 00:05:59,020 that we can identify in terms of where they are associated 107 00:05:59,020 --> 00:06:02,290 with the genome that are not transcriptional regulators, per 108 00:06:02,290 --> 00:06:03,670 se. 109 00:06:03,670 --> 00:06:07,790 For example, we all know that chromatin 110 00:06:07,790 --> 00:06:13,265 is organized on spools called nucleosomes. 111 00:06:16,840 --> 00:06:20,290 These nucleosomes are composed of eight different histones 112 00:06:20,290 --> 00:06:22,800 that have tails on them, and there 113 00:06:22,800 --> 00:06:25,570 can be covalent marks put on these tails. 114 00:06:25,570 --> 00:06:29,130 We'll return to this later on when we talk about epigenetics. 115 00:06:29,130 --> 00:06:32,970 But I did want to mention today that it's possible to identify 116 00:06:32,970 --> 00:06:37,340 where there are histones with specific barks that 117 00:06:37,340 --> 00:06:38,960 are present in the genome and what 118 00:06:38,960 --> 00:06:42,710 genome sequences they are associated with. 119 00:06:42,710 --> 00:06:45,010 So we can look at sequence specific regulators. 120 00:06:45,010 --> 00:06:49,350 We could look at a more general epigenetic marks, 121 00:06:49,350 --> 00:06:52,680 all using the same technology. 122 00:06:52,680 --> 00:06:57,020 And this slide simply recapitulates what we just 123 00:06:57,020 --> 00:07:00,870 talked about over here on the left-hand side of the board. 124 00:07:00,870 --> 00:07:02,980 But we want to know basically what 125 00:07:02,980 --> 00:07:06,280 and where, in terms of these genomic regulators, 126 00:07:06,280 --> 00:07:09,500 what they are and where they are in the genome. 127 00:07:09,500 --> 00:07:12,660 And today we're going to assume a fairly simple model, which 128 00:07:12,660 --> 00:07:16,060 is that regulars that are proximal to a gene, most 129 00:07:16,060 --> 00:07:17,670 probably regulated. 130 00:07:17,670 --> 00:07:20,150 Although we know in practice, actually, it 131 00:07:20,150 --> 00:07:23,525 appears that roughly one third of the regulators that 132 00:07:23,525 --> 00:07:26,310 are proximal to a gene actually skip it and regulate a gene 133 00:07:26,310 --> 00:07:28,980 further down the genome. 134 00:07:28,980 --> 00:07:31,600 It's not really understood very well 135 00:07:31,600 --> 00:07:34,600 how the genome folds in three space 136 00:07:34,600 --> 00:07:37,030 to allow these transit regulatory interactions 137 00:07:37,030 --> 00:07:38,380 to occur. 138 00:07:38,380 --> 00:07:42,300 But I'll just point out to you that the simplistic model 139 00:07:42,300 --> 00:07:47,860 that proximal binding regulates proximal genes 140 00:07:47,860 --> 00:07:49,510 doesn't always hold, especially when 141 00:07:49,510 --> 00:07:53,740 we get into mammals and other higher organisms. 142 00:07:53,740 --> 00:07:56,516 And as I mentioned, another aspect of this 143 00:07:56,516 --> 00:07:58,140 is that certain proteins may need 144 00:07:58,140 --> 00:08:02,900 to be modified to become active, and thus there 145 00:08:02,900 --> 00:08:05,960 are signaling pathways. 146 00:08:05,960 --> 00:08:08,510 You can imagine signaling pathways responding 147 00:08:08,510 --> 00:08:11,730 to environmental stimuli outside of cells. 148 00:08:11,730 --> 00:08:14,100 These signaling pathways can interact 149 00:08:14,100 --> 00:08:18,420 with transcriptional activators and modify 150 00:08:18,420 --> 00:08:21,670 what targets they seek in the genome. 151 00:08:21,670 --> 00:08:23,280 So these sorts of regulatory networks 152 00:08:23,280 --> 00:08:25,800 will be talked about specifically 153 00:08:25,800 --> 00:08:27,510 in a separate lecture later in the term, 154 00:08:27,510 --> 00:08:29,460 but they're extraordinarily important. 155 00:08:29,460 --> 00:08:32,230 And a foundational aspect of them 156 00:08:32,230 --> 00:08:36,200 is putting together the wiring diagram. 157 00:08:36,200 --> 00:08:40,150 And the wiring diagram has to do with where 158 00:08:40,150 --> 00:08:44,320 the regulators occupy the genome and what 159 00:08:44,320 --> 00:08:48,880 genes those regulators regulate. 160 00:08:48,880 --> 00:08:51,190 And in order to do that, we're going 161 00:08:51,190 --> 00:08:54,870 to utilize a technique today called 162 00:08:54,870 --> 00:08:58,930 ChIP-seq, which stands for chromatin immunoprecipitation 163 00:08:58,930 --> 00:09:02,210 followed by sequencing. 164 00:09:02,210 --> 00:09:06,350 And we can now reliably identify where regulars bind 165 00:09:06,350 --> 00:09:10,900 to the genome within roughly 10 base pairs. 166 00:09:10,900 --> 00:09:13,210 So the spatial resolution has gotten 167 00:09:13,210 --> 00:09:16,590 exceptionally good with high throughput sequencing, as we'll 168 00:09:16,590 --> 00:09:21,420 see, which really is a fantastic era to be in now, 169 00:09:21,420 --> 00:09:23,220 because this really wasn't possible 5 170 00:09:23,220 --> 00:09:25,360 or even 10 years ago. 171 00:09:25,360 --> 00:09:28,060 And so we now have the tools to be 172 00:09:28,060 --> 00:09:31,870 able to take apart the regulatory occupancy 173 00:09:31,870 --> 00:09:35,505 of the genome and discern exactly 174 00:09:35,505 --> 00:09:36,880 where these proteins are binding. 175 00:09:40,150 --> 00:09:43,740 The way that this is done, as I'll describe the protocol 176 00:09:43,740 --> 00:09:48,910 to you, in general, and then I'm going to pause for a second 177 00:09:48,910 --> 00:09:52,490 and see if anybody has any questions about the specifics. 178 00:09:52,490 --> 00:09:58,070 But the essential idea is that you have a collection of cells. 179 00:09:58,070 --> 00:09:59,700 Typically you need a lot of cells. 180 00:09:59,700 --> 00:10:02,210 We're talking 10 million cells. 181 00:10:02,210 --> 00:10:05,360 So for certain kinds of marks, you 182 00:10:05,360 --> 00:10:08,870 can get down below a million or even to 100,000 cells. 183 00:10:08,870 --> 00:10:12,960 But to get robust signals, you need a lot of cells at present. 184 00:10:12,960 --> 00:10:18,770 And all these cells obviously have chromosomes inside of them 185 00:10:18,770 --> 00:10:21,680 with proteins that are occupying them. 186 00:10:21,680 --> 00:10:25,400 And the essential idea is that you take a flash photography 187 00:10:25,400 --> 00:10:28,490 picture of the cell while it's alive. 188 00:10:28,490 --> 00:10:32,700 You add a cross linking agent, and that cross links proteins 189 00:10:32,700 --> 00:10:34,590 creates bonds between the proteins 190 00:10:34,590 --> 00:10:39,630 and the genome, the DNA, where those proteins are sitting. 191 00:10:39,630 --> 00:10:44,340 And so you then isolate the chromatin material, 192 00:10:44,340 --> 00:10:49,890 and you wind up with pieces of DNA with proteins occupying 193 00:10:49,890 --> 00:10:52,620 them, all the proteins. 194 00:10:52,620 --> 00:10:55,100 So not just some of the proteins, but all the proteins 195 00:10:55,100 --> 00:10:58,410 are non-selectively cross-linked to the genome. 196 00:11:00,920 --> 00:11:07,100 You then can take this extract and fragment it. 197 00:11:07,100 --> 00:11:12,150 Typically you fragment it by using sonication, 198 00:11:12,150 --> 00:11:15,620 which is mechanical energy, which causes the DNA 199 00:11:15,620 --> 00:11:20,749 to break at random locations. 200 00:11:20,749 --> 00:11:22,540 There are more modern techniques that we'll 201 00:11:22,540 --> 00:11:24,220 touch on at the end of today's lecture 202 00:11:24,220 --> 00:11:28,080 where you could enzymatically digest these fragments right 203 00:11:28,080 --> 00:11:30,660 down to where the protein is. 204 00:11:30,660 --> 00:11:34,620 But suffice to say, you get small fragments, which you then 205 00:11:34,620 --> 00:11:37,820 can immunopurify with an antibody 206 00:11:37,820 --> 00:11:40,600 to a protein of interest. 207 00:11:40,600 --> 00:11:46,450 So one condition of using this technology is either A, 208 00:11:46,450 --> 00:11:48,750 you have a good antibody to a protein 209 00:11:48,750 --> 00:11:53,250 that you care about as regulatory, or B, 210 00:11:53,250 --> 00:11:56,600 you have tagged this protein such 211 00:11:56,600 --> 00:12:00,950 that it has a flag tag, myc tag, or some other epitope 212 00:12:00,950 --> 00:12:04,410 tag on it which allows you to use an antibody 213 00:12:04,410 --> 00:12:06,080 or other purification methodology 214 00:12:06,080 --> 00:12:08,190 for that specific tag. 215 00:12:08,190 --> 00:12:10,690 So either you have a good antibody 216 00:12:10,690 --> 00:12:14,760 or you have a tag on the protein. 217 00:12:14,760 --> 00:12:16,490 One problem with tags on proteins 218 00:12:16,490 --> 00:12:20,550 is that they can render the proteins nonfunctional. 219 00:12:20,550 --> 00:12:22,670 If they're nonfunctional, then, of course, 220 00:12:22,670 --> 00:12:24,720 they're not going to bind where they should bind. 221 00:12:24,720 --> 00:12:26,470 And one has to be careful about this, 222 00:12:26,470 --> 00:12:28,590 because if you introduce a tag and you 223 00:12:28,590 --> 00:12:32,770 have a couple good copies of the protein that are untagged 224 00:12:32,770 --> 00:12:34,890 and one copy that is tagged, it's 225 00:12:34,890 --> 00:12:37,400 hard to tell whether or not the tagged version is actually 226 00:12:37,400 --> 00:12:41,170 doing what you think it does, and one has to be careful. 227 00:12:41,170 --> 00:12:43,250 But suffice to say, assuming that you 228 00:12:43,250 --> 00:12:48,610 have some way of immunopurifying this protein, 229 00:12:48,610 --> 00:12:54,610 you can then use the antibodies to simply purify 230 00:12:54,610 --> 00:12:57,930 those fragments that have the protein of interest. 231 00:13:01,510 --> 00:13:03,620 After you've purified the protein of interest, 232 00:13:03,620 --> 00:13:05,980 you can reverse the cross linking, 233 00:13:05,980 --> 00:13:10,430 have a collection of fragments which you then 234 00:13:10,430 --> 00:13:16,570 sequence using a high throughput sequencing instrument. 235 00:13:16,570 --> 00:13:21,890 Now recall that, in the usually applied protocol, 236 00:13:21,890 --> 00:13:23,910 the fragmentation is a random. 237 00:13:23,910 --> 00:13:28,310 So you're going to be sequencing both ends of these molecules. 238 00:13:28,310 --> 00:13:30,585 For each one, you probably only sequence one end. 239 00:13:30,585 --> 00:13:34,100 You're going to sequence an end of these molecules which 240 00:13:34,100 --> 00:13:39,280 gives you a sequence tag that is near where the event occurred, 241 00:13:39,280 --> 00:13:41,890 but not exactly at it. 242 00:13:41,890 --> 00:13:45,210 And we're going to take those tags, 243 00:13:45,210 --> 00:13:47,630 and if we have our genome-- here represented 244 00:13:47,630 --> 00:13:52,820 as this short, horizontal chalk line-- we'll take our reads, 245 00:13:52,820 --> 00:13:57,030 and we will align them to the genome, 246 00:13:57,030 --> 00:14:00,110 and try and discern from those aligned 247 00:14:00,110 --> 00:14:02,800 reads where the original proteins were binding. 248 00:14:05,350 --> 00:14:13,360 Now our job is to do the best possible alignment or discovery 249 00:14:13,360 --> 00:14:17,290 of where the proteins are binding, given this evidence. 250 00:14:17,290 --> 00:14:18,960 So we have a collection of evidence 251 00:14:18,960 --> 00:14:21,590 exhibited by the read sequences. 252 00:14:21,590 --> 00:14:23,190 The other thing that we will do is 253 00:14:23,190 --> 00:14:27,450 we will take the original population of molecules, 254 00:14:27,450 --> 00:14:29,930 and we will sequence them as well, 255 00:14:29,930 --> 00:14:35,560 sometimes called the whole cell extract sequence, as a control. 256 00:14:35,560 --> 00:14:40,250 And we'll see why we need this control a little bit later on. 257 00:14:40,250 --> 00:14:45,900 But this is going to be a purified so-called IP 258 00:14:45,900 --> 00:14:49,680 for immunoprecipitate fraction, which we'll sequence. 259 00:14:49,680 --> 00:14:52,150 And this will be the whole cell extract, which should not 260 00:14:52,150 --> 00:14:55,510 be enriched for any particular protein. 261 00:14:55,510 --> 00:14:59,370 Now before I go on, I'd be happy to entertain any questions 262 00:14:59,370 --> 00:15:00,990 about the details of this protocol, 263 00:15:00,990 --> 00:15:03,610 because it's really important that you feel comfortable 264 00:15:03,610 --> 00:15:05,920 with it before we talk about the computational analysis 265 00:15:05,920 --> 00:15:08,050 of the output. 266 00:15:08,050 --> 00:15:10,232 So if anybody has any questions, now 267 00:15:10,232 --> 00:15:11,440 would be a great time to ask. 268 00:15:14,540 --> 00:15:15,050 Yes. 269 00:15:15,050 --> 00:15:17,008 AUDIENCE: I have more of a scientific question. 270 00:15:17,008 --> 00:15:19,352 This assumes that we know a transcriptioned factor. 271 00:15:19,352 --> 00:15:24,780 Are there are ways, methods to figure out transcription 272 00:15:24,780 --> 00:15:28,452 factors so that you can design antibodies to bind to it? 273 00:15:28,452 --> 00:15:29,910 PROFESSOR: So the question is, this 274 00:15:29,910 --> 00:15:31,826 assumes that we know the regulators that we're 275 00:15:31,826 --> 00:15:34,600 interested in ahead of time. 276 00:15:34,600 --> 00:15:38,700 And is there a de novo way of discovering 277 00:15:38,700 --> 00:15:41,445 heretofore unknown regulators that are binding to the genome? 278 00:15:44,130 --> 00:15:48,250 The answer to that question is sometimes, 279 00:15:48,250 --> 00:15:49,710 as is usually the case. 280 00:15:49,710 --> 00:15:52,500 Later in the term, we'll talk about other methodologies 281 00:15:52,500 --> 00:15:55,520 for looking at the regulatory occupancy of the genome that 282 00:15:55,520 --> 00:15:59,050 don't depend upon immunopurification, 283 00:15:59,050 --> 00:16:03,270 in which case we'll get an understanding of what's 284 00:16:03,270 --> 00:16:08,627 going on with the genome at the level of knowing what 285 00:16:08,627 --> 00:16:10,960 particular sequences are occupied without knowing what's 286 00:16:10,960 --> 00:16:12,242 there. 287 00:16:12,242 --> 00:16:13,700 From the sequence, sometimes we can 288 00:16:13,700 --> 00:16:17,070 infer the family of the protein that is sitting there. 289 00:16:17,070 --> 00:16:23,800 But in general, the holy grail of this, which has not really 290 00:16:23,800 --> 00:16:26,420 fully materialized, would be as follows, 291 00:16:26,420 --> 00:16:29,680 which is instead of purifying with an antibody 292 00:16:29,680 --> 00:16:33,930 and then sequencing, why not purify with a nucleic acid 293 00:16:33,930 --> 00:16:39,570 sequence and then do mass spec, to actually take the proteins 294 00:16:39,570 --> 00:16:42,280 off of the DNA, run them through mass spec, 295 00:16:42,280 --> 00:16:44,680 and figure out what's there. 296 00:16:44,680 --> 00:16:48,000 And we and others have attempted this. 297 00:16:48,000 --> 00:16:49,960 And at times, you get good results. 298 00:16:49,960 --> 00:16:52,770 But mass spec is improving greatly, 299 00:16:52,770 --> 00:16:57,890 but it's till a fraught process with noise. 300 00:16:57,890 --> 00:17:02,520 And there's a paper just published in Nature Methods 301 00:17:02,520 --> 00:17:04,750 late last year on something called the CRAPome. 302 00:17:04,750 --> 00:17:06,900 Have you heard of this paper before? 303 00:17:06,900 --> 00:17:10,750 It is all the junk you get when you run mass spec experiments. 304 00:17:10,750 --> 00:17:12,849 And so when you run a mass spec experiment, 305 00:17:12,849 --> 00:17:15,240 you can just take all the stuff in the CRAPome out of it, 306 00:17:15,240 --> 00:17:17,119 and it actually helps you quite a bit. 307 00:17:17,119 --> 00:17:19,089 That gives you an idea what the state of the art of mass spec 308 00:17:19,089 --> 00:17:19,720 is. 309 00:17:19,720 --> 00:17:20,720 It's a little bit noisy. 310 00:17:23,200 --> 00:17:25,520 But I think your question is great. 311 00:17:25,520 --> 00:17:27,270 I think we need to get there. 312 00:17:27,270 --> 00:17:29,030 We need to get to the place where 313 00:17:29,030 --> 00:17:32,570 we can take portions of the genome 314 00:17:32,570 --> 00:17:35,260 and run them through mass spec and figure out 315 00:17:35,260 --> 00:17:37,900 what is populating it de novo without having to know ahead 316 00:17:37,900 --> 00:17:39,537 of time. 317 00:17:39,537 --> 00:17:40,370 Any other questions? 318 00:17:43,580 --> 00:17:44,490 OK. 319 00:17:44,490 --> 00:17:45,210 Great. 320 00:17:45,210 --> 00:17:48,780 So the figure on the slide up there 321 00:17:48,780 --> 00:17:53,660 also describes the ChIP-seq protocol. 322 00:17:53,660 --> 00:17:56,130 And I'll also say that some people 323 00:17:56,130 --> 00:17:59,070 believe this would never work. 324 00:17:59,070 --> 00:18:02,020 They actually thought that when you did the purification, 325 00:18:02,020 --> 00:18:04,590 you would just get so much background sequence 326 00:18:04,590 --> 00:18:06,320 that when you map it to the genome, 327 00:18:06,320 --> 00:18:09,030 you could never discern any signal whatsoever. 328 00:18:09,030 --> 00:18:11,470 And so there are lively debates about this until somebody 329 00:18:11,470 --> 00:18:13,930 actually made it work, and then the argument 330 00:18:13,930 --> 00:18:16,220 was over because it made everything else completely 331 00:18:16,220 --> 00:18:17,950 and totally obsolete. 332 00:18:17,950 --> 00:18:21,430 So it wasn't good to be on the wrong side of that argument, 333 00:18:21,430 --> 00:18:22,950 I'll tell you that. 334 00:18:22,950 --> 00:18:25,310 I wasn't, but all right. 335 00:18:25,310 --> 00:18:26,400 I was on the right side. 336 00:18:26,400 --> 00:18:31,230 But suffice to say, here's a close-up picture of Mr. 337 00:18:31,230 --> 00:18:35,790 Protein-- Ms. Protein-- and what happens when there is breakage 338 00:18:35,790 --> 00:18:38,480 around that site, followed by sequencing. 339 00:18:38,480 --> 00:18:41,050 And as you can see, the little black lines 340 00:18:41,050 --> 00:18:43,140 connecting between the protein and the DNA 341 00:18:43,140 --> 00:18:46,050 is supposed to indicate contacts sites. 342 00:18:46,050 --> 00:18:48,565 And you can see the little yellow arrows are supposed 343 00:18:48,565 --> 00:18:52,670 to indicate breakage sites of the DNA that are being caused 344 00:18:52,670 --> 00:18:59,550 by, in this case, mechanical breakage through sonication. 345 00:18:59,550 --> 00:19:02,830 And you get reads from both strands of the DNA. 346 00:19:02,830 --> 00:19:05,610 Remember that a sequencing instrument always 347 00:19:05,610 --> 00:19:07,790 sequences from five prime to three prime. 348 00:19:07,790 --> 00:19:14,740 So you're going to get the reads on the red strand 349 00:19:14,740 --> 00:19:17,650 and on the blue strand, shown here. 350 00:19:17,650 --> 00:19:19,670 And when we do the mapping, we know 351 00:19:19,670 --> 00:19:22,910 which strand they're mapped on. 352 00:19:22,910 --> 00:19:28,750 And the profile is shown in the lower plot, 353 00:19:28,750 --> 00:19:32,850 showing the density of map reads versus distance 354 00:19:32,850 --> 00:19:38,010 from where we believe the protein is sitting. 355 00:19:38,010 --> 00:19:42,400 And the tag density refers to tags or sequence tags 356 00:19:42,400 --> 00:19:47,500 or reads that are aligned using the methodology we discussed 357 00:19:47,500 --> 00:19:52,210 two lectures ago to the genome we assembled last lecture. 358 00:19:52,210 --> 00:19:58,000 So the characteristic shape shown in this picture 359 00:19:58,000 --> 00:20:02,780 is something that is not the same for all proteins. 360 00:20:02,780 --> 00:20:05,557 It is something that varies from protein to protein. 361 00:20:05,557 --> 00:20:07,140 And thus, one of the things that we'll 362 00:20:07,140 --> 00:20:10,250 want to do during our discovery of where these proteins are 363 00:20:10,250 --> 00:20:15,429 binding is always learn the shape of the read distribution 364 00:20:15,429 --> 00:20:17,470 that will come out of a particular binding event. 365 00:20:20,140 --> 00:20:23,310 So just to show you some actual data, so you'll 366 00:20:23,310 --> 00:20:25,690 get a feel for what we're talking about, 367 00:20:25,690 --> 00:20:31,280 this is actual data from the Oct4 protein, which 368 00:20:31,280 --> 00:20:35,010 is an embryonic regulator, pluripotency factor, 369 00:20:35,010 --> 00:20:40,310 binding to the mouse genome around the SOCS2 gene. 370 00:20:40,310 --> 00:20:45,530 And you can see the two distinct peaks on the upper track, 371 00:20:45,530 --> 00:20:48,690 both the plus strand reads and the minus strand 372 00:20:48,690 --> 00:20:52,360 reads, shown in blue and red respectively. 373 00:20:52,360 --> 00:20:54,627 Each one of the black and white bars at the top-- 374 00:20:54,627 --> 00:20:56,710 probably can't be read from the back of the room-- 375 00:20:56,710 --> 00:20:58,680 but each one of those is 1,000 bases, 376 00:20:58,680 --> 00:21:02,630 to give you some idea about sort of the scale of the genome 377 00:21:02,630 --> 00:21:04,960 that we're looking at here. 378 00:21:04,960 --> 00:21:08,120 You can see the SOCS2 gene below. 379 00:21:08,120 --> 00:21:11,460 The exons are the solid bars. 380 00:21:11,460 --> 00:21:14,800 And then you see the whole cell extract channel, 381 00:21:14,800 --> 00:21:16,580 which we talked about earlier. 382 00:21:16,580 --> 00:21:18,590 And the whole cell extract channel 383 00:21:18,590 --> 00:21:22,580 is simply giving us a background set of reads 384 00:21:22,580 --> 00:21:26,020 that are nonspecific. 385 00:21:26,020 --> 00:21:29,090 And so you might have a set of, say, 10 or 20 million reads, 386 00:21:29,090 --> 00:21:32,160 something like that, for when these experiments that you map 387 00:21:32,160 --> 00:21:36,630 to the genome and get a picture that looks like this. 388 00:21:36,630 --> 00:21:41,980 So now our job is to take the read sets that we see here, 389 00:21:41,980 --> 00:21:45,130 genome wide, and figure out every place 390 00:21:45,130 --> 00:21:47,590 that the Oct4 protein is binding to the genome. 391 00:21:50,164 --> 00:21:51,580 Now there are several ways that we 392 00:21:51,580 --> 00:21:54,020 could approach this question. 393 00:21:54,020 --> 00:21:55,603 One way to approach the question would 394 00:21:55,603 --> 00:21:59,210 be to simply say, where are the peaks? 395 00:21:59,210 --> 00:22:03,020 And so you hear this kind of exploration often described 396 00:22:03,020 --> 00:22:05,380 as peak finding. 397 00:22:05,380 --> 00:22:07,380 Where can you find the peaks? 398 00:22:07,380 --> 00:22:10,740 And where is the middle of the peak? 399 00:22:10,740 --> 00:22:13,720 Now the problem with this approach 400 00:22:13,720 --> 00:22:18,880 is that it works just fine when a peak represents 401 00:22:18,880 --> 00:22:21,020 a single binding event. 402 00:22:21,020 --> 00:22:23,050 So imagine that these two fingers here 403 00:22:23,050 --> 00:22:25,680 are binding events, and they're fairly far apart of the genome. 404 00:22:25,680 --> 00:22:29,580 Now as they come closer and closer and closer together, 405 00:22:29,580 --> 00:22:32,990 what will happen is that, instead of having two peaks, 406 00:22:32,990 --> 00:22:37,020 we're going to wind up having one broad peak. 407 00:22:37,020 --> 00:22:40,770 And thus, there's a lot of biology present 408 00:22:40,770 --> 00:22:48,190 in this kind of binding of the same protein 409 00:22:48,190 --> 00:22:50,130 proximal to itself. 410 00:22:50,130 --> 00:22:53,072 So we need to be able to take these sorts of events that 411 00:22:53,072 --> 00:22:58,120 occur underneath a single enrichment, or single peak, 412 00:22:58,120 --> 00:23:00,807 into two separate binding events. 413 00:23:00,807 --> 00:23:02,765 And this is shown in the next couple of slides, 414 00:23:02,765 --> 00:23:05,850 right where we look at what we would expect 415 00:23:05,850 --> 00:23:09,700 from a single event, in terms of a read enrichment profile 416 00:23:09,700 --> 00:23:11,880 once it's aligned to the genome. 417 00:23:11,880 --> 00:23:14,470 And we think about a possibility that there 418 00:23:14,470 --> 00:23:17,910 are two events, here shown in indiscernible 419 00:23:17,910 --> 00:23:20,200 gray and blue and red. 420 00:23:20,200 --> 00:23:24,330 And we note that each one of these 421 00:23:24,330 --> 00:23:28,770 will have its own specific profile. 422 00:23:28,770 --> 00:23:31,747 And then you can consider them to be added together 423 00:23:31,747 --> 00:23:33,080 to get the peak that we observe. 424 00:23:37,750 --> 00:23:41,670 Now one of the reasons this additive property works 425 00:23:41,670 --> 00:23:45,790 is that, remember, we're working with a large population 426 00:23:45,790 --> 00:23:52,500 of cells, and regulators don't always occupy a site. 427 00:23:52,500 --> 00:23:55,770 And thus, what we're looking at in terms of the reads 428 00:23:55,770 --> 00:24:01,610 are the sum of all of the evidence from all of the cells. 429 00:24:01,610 --> 00:24:10,420 And so even though the proteins are close to one another, 430 00:24:10,420 --> 00:24:13,490 we often can find an additive effect 431 00:24:13,490 --> 00:24:15,530 between that proximal binding. 432 00:24:18,840 --> 00:24:21,730 So how can we handle this? 433 00:24:21,730 --> 00:24:25,540 Well, what we're going to do is we're 434 00:24:25,540 --> 00:24:28,740 going to do two key algorithmic things. 435 00:24:28,740 --> 00:24:31,520 We're going to model the spatial distribution of the reads that 436 00:24:31,520 --> 00:24:35,634 come out of a specific event, as I suggested earlier. 437 00:24:35,634 --> 00:24:37,550 And we're going to keep that model up to date. 438 00:24:37,550 --> 00:24:45,930 That is, we can learn that model by first running our method 439 00:24:45,930 --> 00:24:49,960 using a common distribution of reads, 440 00:24:49,960 --> 00:24:52,480 identify a bunch of events that we 441 00:24:52,480 --> 00:24:58,510 think are bindings of a single protein, take those events 442 00:24:58,510 --> 00:25:00,920 and use them to build a better model of what 443 00:25:00,920 --> 00:25:03,480 the redistribution looks like, and then run the algorithm 444 00:25:03,480 --> 00:25:05,550 again. 445 00:25:05,550 --> 00:25:07,460 And with the better distribution, 446 00:25:07,460 --> 00:25:11,540 we can do a much better job at resolving events out 447 00:25:11,540 --> 00:25:15,560 of multiple events out of single peaks. 448 00:25:15,560 --> 00:25:17,930 And the next thing we're going to do 449 00:25:17,930 --> 00:25:22,240 is we're going to model the genome at a single base pair 450 00:25:22,240 --> 00:25:24,130 level. 451 00:25:24,130 --> 00:25:29,300 So we're going to consider every single base as being the center 452 00:25:29,300 --> 00:25:34,860 point of a protein binding event, and using that model, 453 00:25:34,860 --> 00:25:40,540 try and sort through how we could have observed 454 00:25:40,540 --> 00:25:44,550 the reads that we are presented with. 455 00:25:44,550 --> 00:25:48,270 And first, the spatial distribution that we build 456 00:25:48,270 --> 00:25:50,800 is going to look something like this. 457 00:25:50,800 --> 00:25:54,860 And we consider a 400 base pair window, 458 00:25:54,860 --> 00:25:58,720 and we learn the distribution of reads, 459 00:25:58,720 --> 00:26:01,140 and we build an actual empirical distribution. 460 00:26:01,140 --> 00:26:04,140 We don't fit the distribution to it, 461 00:26:04,140 --> 00:26:09,700 but rather we can keep an exact distribution or histogram 462 00:26:09,700 --> 00:26:13,280 of what we observe, averaged over many, many events. 463 00:26:13,280 --> 00:26:16,820 So when we're fitting things, we have the best possible 464 00:26:16,820 --> 00:26:18,460 estimate. 465 00:26:18,460 --> 00:26:18,960 Yes. 466 00:26:18,960 --> 00:26:20,585 AUDIENCE: I was just trying to remember 467 00:26:20,585 --> 00:26:22,540 to origin of the [INAUDIBLE] clearly. 468 00:26:22,540 --> 00:26:25,740 So within the protocol, are you sequencing 469 00:26:25,740 --> 00:26:29,700 with the proteins bound to these fragments? 470 00:26:29,700 --> 00:26:30,450 PROFESSOR: No. 471 00:26:30,450 --> 00:26:32,470 See, you can't do that. 472 00:26:32,470 --> 00:26:37,824 You have to reverse the cross-linking from the DNA. 473 00:26:37,824 --> 00:26:39,240 And then there's a step here which 474 00:26:39,240 --> 00:26:42,964 we omitted for simplicity, which is, we amplified the DNA. 475 00:26:42,964 --> 00:26:44,380 AUDIENCE: So I was just wondering, 476 00:26:44,380 --> 00:26:48,220 if there's no protein, why doesn't the polymerase just 477 00:26:48,220 --> 00:26:50,040 read through the whole thing from one side? 478 00:26:50,040 --> 00:26:52,152 Why is there a peak? 479 00:26:52,152 --> 00:26:55,650 There seems to be a loss of signal right 480 00:26:55,650 --> 00:26:57,550 where the protein is bound. 481 00:26:57,550 --> 00:27:00,630 PROFESSOR: Well, that depends upon how long the reads are 482 00:27:00,630 --> 00:27:02,840 and how hard you fragment. 483 00:27:02,840 --> 00:27:06,170 And in fact, that can occur. 484 00:27:06,170 --> 00:27:08,130 But typically, we're using fairly 485 00:27:08,130 --> 00:27:12,110 short reads, like 35 base pair reads. 486 00:27:12,110 --> 00:27:14,800 And we're fragmenting the DNA to be, 487 00:27:14,800 --> 00:27:18,410 say, perhaps 200 to 300 base pairs along. 488 00:27:18,410 --> 00:27:21,832 So we're reading the first 35 base pairs of the DNA fragment, 489 00:27:21,832 --> 00:27:23,540 but we're not really all the way through. 490 00:27:23,540 --> 00:27:25,980 We could read all the way through if we wanted, 491 00:27:25,980 --> 00:27:27,930 but there really wouldn't be a point to that. 492 00:27:27,930 --> 00:27:31,940 The thing that we're observing here 493 00:27:31,940 --> 00:27:35,340 is where the five prime end of the readers, where 494 00:27:35,340 --> 00:27:38,049 the leftmost edge of the read is. 495 00:27:38,049 --> 00:27:40,340 So even though it might be reading all the way through, 496 00:27:40,340 --> 00:27:41,923 we're just seeing the left edge of it. 497 00:27:43,959 --> 00:27:45,250 Does that answer your question? 498 00:27:45,250 --> 00:27:45,875 AUDIENCE: Yeah. 499 00:27:45,875 --> 00:27:46,930 PROFESSOR: OK, great. 500 00:27:46,930 --> 00:27:48,540 Any other questions? 501 00:27:48,540 --> 00:27:49,350 Yes, at the back. 502 00:27:49,350 --> 00:27:52,164 AUDIENCE: So to clarify, this distribution 503 00:27:52,164 --> 00:27:54,034 that's being shown up here on both 504 00:27:54,034 --> 00:27:55,450 the positive and negative strand-- 505 00:27:55,450 --> 00:27:55,910 PROFESSOR: Yes. 506 00:27:55,910 --> 00:27:58,335 AUDIENCE: This is the position of where the reads started, 507 00:27:58,335 --> 00:28:01,730 not the count of the number of times that particular base was 508 00:28:01,730 --> 00:28:03,876 shown in the sequencing result. 509 00:28:03,876 --> 00:28:06,180 Is that correct? 510 00:28:06,180 --> 00:28:09,084 PROFESSOR: Let me repeat the question. 511 00:28:09,084 --> 00:28:10,750 What we're observing in the distribution 512 00:28:10,750 --> 00:28:15,200 is where the read starts and not the number of times 513 00:28:15,200 --> 00:28:18,640 that base shows up in the sequencing data. 514 00:28:18,640 --> 00:28:22,850 It is the number of reads whose five prime position 515 00:28:22,850 --> 00:28:25,570 start at that base. 516 00:28:25,570 --> 00:28:29,555 OK So each read only get one count. 517 00:28:29,555 --> 00:28:31,150 Does that help? 518 00:28:31,150 --> 00:28:33,089 OK. 519 00:28:33,089 --> 00:28:35,380 The other thing that we're going to do, for simplicity, 520 00:28:35,380 --> 00:28:38,750 is we're going to assume that the plus and the minus 521 00:28:38,750 --> 00:28:41,970 strand distributions are symmetric. 522 00:28:41,970 --> 00:28:43,410 So we only learn one distribution, 523 00:28:43,410 --> 00:28:46,630 and then we flip it to do the minus strand. 524 00:28:49,650 --> 00:28:54,390 And that's shown here, where we can articulate 525 00:28:54,390 --> 00:28:57,270 this as this empirical distribution, 526 00:28:57,270 --> 00:29:01,500 where the probability or read given a base position is 527 00:29:01,500 --> 00:29:08,310 described in terms of the distance between where we are 528 00:29:08,310 --> 00:29:12,080 considering the binding of it may have occurred 529 00:29:12,080 --> 00:29:13,750 and where the read is. 530 00:29:13,750 --> 00:29:20,230 So it's important for us to look at this in some detail 531 00:29:20,230 --> 00:29:21,480 so you're comfortable with it. 532 00:29:30,640 --> 00:29:32,566 Here's our genome again. 533 00:29:32,566 --> 00:29:38,170 And let's assume that we have a binding 534 00:29:38,170 --> 00:29:42,010 event at base m along the genome. 535 00:29:42,010 --> 00:29:47,840 And we have a read here, r sub n, at some position. 536 00:29:47,840 --> 00:29:53,590 The probability that this read was 537 00:29:53,590 --> 00:29:59,280 caused by this binding event can be described as probably a read 538 00:29:59,280 --> 00:30:04,040 n given the fact that we're considering 539 00:30:04,040 --> 00:30:05,476 an event at location m. 540 00:30:08,750 --> 00:30:10,540 Now of course, it could be that there 541 00:30:10,540 --> 00:30:16,350 are other possible locations that have caused this read. 542 00:30:16,350 --> 00:30:19,600 And let us suppose that we model all 543 00:30:19,600 --> 00:30:25,750 of those positions along the genome as a vector pi. 544 00:30:25,750 --> 00:30:33,790 And each element of pi describes the probability or the strength 545 00:30:33,790 --> 00:30:35,350 of a binding event having occurred 546 00:30:35,350 --> 00:30:37,510 in a particular location. 547 00:30:37,510 --> 00:30:44,650 So we can now describe the probability of a read sub n 548 00:30:44,650 --> 00:30:50,030 given pi is equal to the summation 549 00:30:50,030 --> 00:30:53,100 where i equals 1 to big M, assuming that there 550 00:30:53,100 --> 00:31:03,810 is 1 to M bases in this genome, of a p rn given m pi m, 551 00:31:03,810 --> 00:31:05,740 like this. 552 00:31:05,740 --> 00:31:09,440 So we are mixing together here all of the positions 553 00:31:09,440 --> 00:31:13,280 along the genome to try and explain this read. 554 00:31:13,280 --> 00:31:17,530 So the probably of the read, given this vector 555 00:31:17,530 --> 00:31:21,200 pi, which describes all the possible events that 556 00:31:21,200 --> 00:31:27,760 could have created this read, is this formulation, 557 00:31:27,760 --> 00:31:31,240 which considers the probability of each position 558 00:31:31,240 --> 00:31:35,410 times the probability that an event occurred at that position 559 00:31:35,410 --> 00:31:44,790 subject to the constraint that all of a pi i's sum to 1. 560 00:31:44,790 --> 00:31:46,880 So we're just assigning probability mass 561 00:31:46,880 --> 00:31:53,770 along the genome from whence all of the reads originally came. 562 00:31:53,770 --> 00:31:59,190 So this is considering a single read, r sub n, 563 00:31:59,190 --> 00:32:03,551 and where that might have originated from. 564 00:32:03,551 --> 00:32:04,050 Yes. 565 00:32:04,050 --> 00:32:06,116 AUDIENCE: Does the constraint basically state 566 00:32:06,116 --> 00:32:11,360 that only one event occurred in this fragment point? 567 00:32:11,360 --> 00:32:13,360 PROFESSOR: The question is, does this constraint 568 00:32:13,360 --> 00:32:15,900 imply that only one event occurred? 569 00:32:15,900 --> 00:32:16,420 No. 570 00:32:16,420 --> 00:32:19,570 The constraint is implying that we're 571 00:32:19,570 --> 00:32:24,210 going to only have one unit of probability mass 572 00:32:24,210 --> 00:32:28,750 to assign along the genome which will generate all of the reads. 573 00:32:28,750 --> 00:32:33,850 And thus, this vector describes the contribution 574 00:32:33,850 --> 00:32:37,010 of each base to the reads that we observe. 575 00:32:37,010 --> 00:32:39,401 So let us say simplistically that it 576 00:32:39,401 --> 00:32:41,650 might be that there are only two events in the genome, 577 00:32:41,650 --> 00:32:43,640 and we had a perfect solution. 578 00:32:43,640 --> 00:32:46,470 Two points in the genome, like m1 and m2, 579 00:32:46,470 --> 00:32:49,650 would have 0.5 as their values for pi. 580 00:32:49,650 --> 00:32:52,910 And all the other values in the genome would be 0. 581 00:32:52,910 --> 00:32:56,130 We're going to try to make this as sparse as possible, 582 00:32:56,130 --> 00:32:58,290 as many zeroes as possible. 583 00:32:58,290 --> 00:33:01,055 So only at the places in the genome where protein 584 00:33:01,055 --> 00:33:05,795 is actually binding will pi i be nonzero. 585 00:33:05,795 --> 00:33:08,080 Does that make sense? 586 00:33:08,080 --> 00:33:09,250 These are great questions. 587 00:33:09,250 --> 00:33:09,750 Yes. 588 00:33:09,750 --> 00:33:11,530 And if people could say their names first, 589 00:33:11,530 --> 00:33:12,050 that would be great. 590 00:33:12,050 --> 00:33:12,560 Yes. 591 00:33:12,560 --> 00:33:13,184 AUDIENCE: Sara. 592 00:33:13,184 --> 00:33:18,570 Just to clarify, the pi vector is completely empirical? 593 00:33:18,570 --> 00:33:20,868 PROFESSOR: This distribution is completely empirical. 594 00:33:20,868 --> 00:33:24,949 Yes, that's right, Sara, completely empirical. 595 00:33:24,949 --> 00:33:26,740 I'll also say, just so you know, that there 596 00:33:26,740 --> 00:33:31,490 are many ways of doing this kind of discovery, 597 00:33:31,490 --> 00:33:33,180 as you might imagine. 598 00:33:33,180 --> 00:33:35,600 The way we're going to describe today 599 00:33:35,600 --> 00:33:38,770 was a way that was selected as part of the ENCODE 3 600 00:33:38,770 --> 00:33:41,040 pipeline for the government's ENCODE project. 601 00:33:41,040 --> 00:33:43,010 And so what I'm going to talk about today 602 00:33:43,010 --> 00:33:47,230 is the methodology that's being used for the next set of data 603 00:33:47,230 --> 00:33:50,760 for ENCODE 3 followed by IDR analysis, which 604 00:33:50,760 --> 00:33:52,090 is also part of ENCODE 3. 605 00:33:52,090 --> 00:33:54,251 So what you're hearing about today 606 00:33:54,251 --> 00:33:56,500 is a pipeline that's being used that will be published 607 00:33:56,500 --> 00:33:58,779 next year as part of the ENCODE project. 608 00:33:58,779 --> 00:34:00,570 These papers, this method's been published. 609 00:34:00,570 --> 00:34:03,496 But the analysis of all the Encyclopedia 610 00:34:03,496 --> 00:34:05,370 of DNA Elements-- which is what Encode stands 611 00:34:05,370 --> 00:34:10,250 for-- the third phase of that is utilizing this. 612 00:34:10,250 --> 00:34:13,000 OK, any other questions? 613 00:34:13,000 --> 00:34:13,783 Yes. 614 00:34:13,783 --> 00:34:15,940 AUDIENCE: Does the shape of this binding event 615 00:34:15,940 --> 00:34:19,070 tell you anything about the topology of the actual protein? 616 00:34:19,070 --> 00:34:20,791 PROFESSOR: It does, actually. 617 00:34:20,791 --> 00:34:23,219 And we'll return to that. 618 00:34:23,219 --> 00:34:27,889 But the shape of this binding can tell you 619 00:34:27,889 --> 00:34:31,699 something about the class of protein, which is still 620 00:34:31,699 --> 00:34:34,960 an area of active research. 621 00:34:34,960 --> 00:34:37,580 But also note that that is a little bit 622 00:34:37,580 --> 00:34:40,159 confounded by the fact that when you 623 00:34:40,159 --> 00:34:42,330 have homotypic binding, which means 624 00:34:42,330 --> 00:34:45,469 you have these closely spaced binding events, 625 00:34:45,469 --> 00:34:47,070 you get these broader peaks. 626 00:34:47,070 --> 00:34:51,070 And so there's a lot of research into what 627 00:34:51,070 --> 00:34:54,889 the shapes on the genome mean and what biological function 628 00:34:54,889 --> 00:34:57,260 of mechanism they might imply. 629 00:34:57,260 --> 00:34:57,770 Yes. 630 00:34:57,770 --> 00:34:59,970 AUDIENCE: Can you explain pi one more time? 631 00:34:59,970 --> 00:35:02,395 PROFESSOR: Yeah, explain pi one more time, sure. 632 00:35:02,395 --> 00:35:06,370 So pi is describing where there are 633 00:35:06,370 --> 00:35:08,280 binding events along the genome. 634 00:35:08,280 --> 00:35:14,590 So for example, if we just had two binding events, m1 and m2, 635 00:35:14,590 --> 00:35:19,680 then pi of m1 would be equal to 0.5, and pi of m2 636 00:35:19,680 --> 00:35:22,720 would be equal to 0.5, and all the other values of pi 637 00:35:22,720 --> 00:35:24,540 would be 0. 638 00:35:24,540 --> 00:35:28,330 So we're just describing with pi where the events are occurring. 639 00:35:28,330 --> 00:35:31,930 That gives us the location of the events. 640 00:35:31,930 --> 00:35:32,430 OK? 641 00:35:34,940 --> 00:35:41,290 And you'll see in a moment why we articulated it that way. 642 00:35:41,290 --> 00:35:44,090 But that's what pi is doing. 643 00:35:44,090 --> 00:35:45,570 Does that answer your question? 644 00:35:45,570 --> 00:35:46,337 OK. 645 00:35:46,337 --> 00:35:47,170 Any other questions? 646 00:35:47,170 --> 00:35:48,107 Yes. 647 00:35:48,107 --> 00:35:51,610 AUDIENCE: In cases when you have two peaks 648 00:35:51,610 --> 00:35:54,965 really close together, to a point that there's 649 00:35:54,965 --> 00:35:56,340 some sort of [INAUDIBLE] between, 650 00:35:56,340 --> 00:35:59,920 how do you constrain your pi? 651 00:35:59,920 --> 00:36:01,530 PROFESSOR: How do you constrain the pi 652 00:36:01,530 --> 00:36:03,030 when you have closely spaced peaks? 653 00:36:03,030 --> 00:36:04,904 AUDIENCE: Yeah. 654 00:36:04,904 --> 00:36:06,570 PROFESSOR: Well, you don't constrain it. 655 00:36:06,570 --> 00:36:07,470 You actually want-- 656 00:36:07,470 --> 00:36:08,345 AUDIENCE: [INAUDIBLE] 657 00:36:12,937 --> 00:36:14,520 PROFESSOR: Well, I'm going to show you 658 00:36:14,520 --> 00:36:17,160 some actual examples of this algorithm running. 659 00:36:17,160 --> 00:36:19,710 So I'm going to give you an animation of the algorithm 660 00:36:19,710 --> 00:36:22,370 running, so you can actually watch what it does. 661 00:36:22,370 --> 00:36:24,370 And you'll see, first it's going to be something 662 00:36:24,370 --> 00:36:27,880 that isn't too pleasant, and then we'll fix it. 663 00:36:27,880 --> 00:36:30,840 But the key thing here is sparsity. 664 00:36:30,840 --> 00:36:33,360 What we want to do is we want to enforce pi 665 00:36:33,360 --> 00:36:36,710 being as sparse as possible to explain the data. 666 00:36:36,710 --> 00:36:40,270 One of the common problems in any approach to machine 667 00:36:40,270 --> 00:36:43,610 learning is that, with a suitably complex model, 668 00:36:43,610 --> 00:36:47,690 you can model anything, but it's not necessarily interpretable. 669 00:36:47,690 --> 00:36:52,340 So what we want to do here is to make pi as simple as possible 670 00:36:52,340 --> 00:36:55,890 to be able to explain the data that we see. 671 00:36:55,890 --> 00:37:02,250 However, if a single event cannot explain and observe 672 00:37:02,250 --> 00:37:04,730 redistribution at a particular point in the genome, 673 00:37:04,730 --> 00:37:06,340 we'll need to bring another event in. 674 00:37:09,860 --> 00:37:15,205 All right, so that is how to think about a single read. 675 00:37:18,380 --> 00:37:21,730 And now we can just say the probability 676 00:37:21,730 --> 00:37:26,490 of our entire read set given pi is quite simple. 677 00:37:26,490 --> 00:37:29,400 It's simply the product over all the reads. 678 00:37:40,200 --> 00:37:40,700 Sorry. 679 00:37:48,020 --> 00:37:50,210 Like so. 680 00:37:50,210 --> 00:37:53,540 So this is the probability of the entire read set. 681 00:37:53,540 --> 00:37:56,460 So we had this previously, which is 682 00:37:56,460 --> 00:37:58,600 the probability of a single read. 683 00:37:58,600 --> 00:38:01,570 And we take the product of the probability for each individual 684 00:38:01,570 --> 00:38:06,620 read to get the likelihood of all the reads. 685 00:38:06,620 --> 00:38:09,430 So now all we need to do to solve this problem is this. 686 00:38:09,430 --> 00:38:14,870 We need to say that pi is equal to the arg max 687 00:38:14,870 --> 00:38:21,510 pi of P R pi, which gives us the maximum likelihood 688 00:38:21,510 --> 00:38:23,920 estimate for pi. 689 00:38:23,920 --> 00:38:25,630 Now it's easy to write that down, just 690 00:38:25,630 --> 00:38:29,080 find the setting for pi that maximizes 691 00:38:29,080 --> 00:38:33,780 the likelihood of the observed reads, and proof, you're done, 692 00:38:33,780 --> 00:38:36,706 because now you've come up with a pi that describes where 693 00:38:36,706 --> 00:38:38,330 the binding events are along the genome 694 00:38:38,330 --> 00:38:40,990 at single base pair solution, assuming 695 00:38:40,990 --> 00:38:47,126 that pi is modeling every single base. 696 00:38:47,126 --> 00:38:50,590 Does everybody see that? 697 00:38:50,590 --> 00:38:53,860 Any questions about that? 698 00:38:53,860 --> 00:38:54,505 Yes. 699 00:38:54,505 --> 00:38:55,380 AUDIENCE: [INAUDIBLE] 700 00:39:00,030 --> 00:39:03,870 PROFESSOR: n is the number of reads. 701 00:39:03,870 --> 00:39:05,440 m is a number of bases in the genome. 702 00:39:11,010 --> 00:39:13,977 n is the number of reads. 703 00:39:13,977 --> 00:39:16,422 AUDIENCE: The length of pi, right? 704 00:39:16,422 --> 00:39:18,378 Is that equal to the number of binding events? 705 00:39:18,378 --> 00:39:21,312 Or is it equal to just the number of [INAUDIBLE]? 706 00:39:21,312 --> 00:39:22,655 PROFESSOR: The length of pi? 707 00:39:22,655 --> 00:39:23,280 AUDIENCE: Yeah. 708 00:39:23,280 --> 00:39:25,370 PROFESSOR: It's the number of bases in the genome, 709 00:39:25,370 --> 00:39:27,567 and hopefully the number of bindings 710 00:39:27,567 --> 00:39:29,150 is much, much, much smaller than that. 711 00:39:29,150 --> 00:39:34,320 Typical number of binding events is 5 to 30,000 712 00:39:34,320 --> 00:39:37,920 across a genome that has 3 billion bases, something 713 00:39:37,920 --> 00:39:38,870 like that. 714 00:39:38,870 --> 00:39:41,165 So it's much, much smaller. 715 00:39:41,165 --> 00:39:41,665 OK? 716 00:39:44,740 --> 00:39:48,020 So this is what we would like to solve. 717 00:39:48,020 --> 00:39:53,010 And another way to look at this model, which 718 00:39:53,010 --> 00:39:56,580 it may be somewhat more confusing, 719 00:39:56,580 --> 00:40:01,370 is as follows, which is that we have these events spread 720 00:40:01,370 --> 00:40:03,130 along the genome. 721 00:40:03,130 --> 00:40:05,940 And we have the reads shown on the lower line. 722 00:40:05,940 --> 00:40:09,900 And the events are generating the reads. 723 00:40:09,900 --> 00:40:11,830 So this is called a generative model, 724 00:40:11,830 --> 00:40:17,040 because the derivation of the likelihood 725 00:40:17,040 --> 00:40:22,320 directly follows from, if we knew where the events were, 726 00:40:22,320 --> 00:40:25,470 we could exactly come up with the best 727 00:40:25,470 --> 00:40:30,300 solution for the assignment of pi, 728 00:40:30,300 --> 00:40:31,825 and thus for the likelihood. 729 00:40:34,282 --> 00:40:36,240 So this is what we have on the board over here. 730 00:40:39,070 --> 00:40:42,640 But we can solve this directly. 731 00:40:42,640 --> 00:40:43,200 Yes. 732 00:40:43,200 --> 00:40:44,511 Question in the back. 733 00:40:44,511 --> 00:40:45,052 AUDIENCE: Hi. 734 00:40:45,052 --> 00:40:46,378 My name is Eric. 735 00:40:46,378 --> 00:40:48,150 I have a little question. 736 00:40:48,150 --> 00:40:50,640 What is the definition of GPS here? 737 00:40:50,640 --> 00:40:52,416 PROFESSOR: Oh, sorry. 738 00:40:52,416 --> 00:40:55,962 Yeah, GPS is the name of this algorithm. 739 00:40:55,962 --> 00:40:57,670 I told it to an editor and they hated it. 740 00:40:57,670 --> 00:41:00,874 But at any rate, it's called the genome positioning system. 741 00:41:00,874 --> 00:41:01,700 [LAUGHTER] 742 00:41:01,700 --> 00:41:03,690 Yeah, so you don't like it either, right, Eric? 743 00:41:03,690 --> 00:41:04,190 Oh, well. 744 00:41:06,530 --> 00:41:09,840 But yes, it locates proteins on the genome, right? 745 00:41:09,840 --> 00:41:11,380 Yeah. 746 00:41:11,380 --> 00:41:13,980 Good question. 747 00:41:13,980 --> 00:41:18,940 So this is going to be, I think, our first introduction 748 00:41:18,940 --> 00:41:21,990 to the EM algorithm as a way of solving 749 00:41:21,990 --> 00:41:24,880 complex problems like this. 750 00:41:24,880 --> 00:41:28,040 And here is the insight we're going 751 00:41:28,040 --> 00:41:30,440 to use to solve this problem, which 752 00:41:30,440 --> 00:41:36,230 is that imagine I do the function g. 753 00:41:36,230 --> 00:41:37,890 So g is going to be this function that 754 00:41:37,890 --> 00:41:42,870 tells us what reads came from which events 755 00:41:42,870 --> 00:41:46,580 and where those events are in the genome. 756 00:41:46,580 --> 00:41:51,130 So if you knew g exactly, this would 757 00:41:51,130 --> 00:41:53,590 be a really trivial problem to solve. 758 00:41:53,590 --> 00:41:58,470 It would tell you, for every single read, which 759 00:41:58,470 --> 00:42:03,310 particular binding event caused it. 760 00:42:03,310 --> 00:42:06,880 And if we knew that, it would be great, 761 00:42:06,880 --> 00:42:09,880 because then we could say something like this. 762 00:42:09,880 --> 00:42:14,575 We could say, well, we knew g. 763 00:42:21,110 --> 00:42:29,540 Then the number of reeds created by a binding event at location 764 00:42:29,540 --> 00:42:38,400 m would simply be-- it's not a very good summation sign here-- 765 00:42:38,400 --> 00:42:50,130 we would sum over all the reads and count up the number of them 766 00:42:50,130 --> 00:42:57,270 where read n was caused by an event at location m. 767 00:42:57,270 --> 00:42:59,300 And we just summed up this number. 768 00:42:59,300 --> 00:43:03,740 We'd have the number of reads caused 769 00:43:03,740 --> 00:43:06,400 by an event at location m. 770 00:43:06,400 --> 00:43:09,622 Is everybody cool with that? 771 00:43:09,622 --> 00:43:12,570 No. 772 00:43:12,570 --> 00:43:16,290 Do you see the definition of there of g 773 00:43:16,290 --> 00:43:20,000 on the overhead projector? 774 00:43:20,000 --> 00:43:25,120 So every time a read n comes from event m, 775 00:43:25,120 --> 00:43:28,400 that function is going to be equal to 1. 776 00:43:28,400 --> 00:43:31,790 So we just count the number of times it's 1. 777 00:43:31,790 --> 00:43:33,510 For a given location m, we find out 778 00:43:33,510 --> 00:43:36,980 how many reads came from that event-- could be 5, 779 00:43:36,980 --> 00:43:40,010 could be 10, could be 100. 780 00:43:40,010 --> 00:43:41,890 We don't know exactly how many were generated 781 00:43:41,890 --> 00:43:43,481 by that particular event, but there's 782 00:43:43,481 --> 00:43:44,480 going to be some number. 783 00:43:44,480 --> 00:43:49,170 And we have 10 million reads, and we have 100,000 events, 784 00:43:49,170 --> 00:43:50,750 we're going to get about 100 reads 785 00:43:50,750 --> 00:43:53,901 per event, something like that. 786 00:43:53,901 --> 00:43:55,775 You can give that kind of order of magnitude. 787 00:43:58,360 --> 00:43:59,014 Everybody OK? 788 00:43:59,014 --> 00:44:00,680 Any questions about the details of this? 789 00:44:00,680 --> 00:44:04,780 Because the next step is going to be causing leaps of faith, 790 00:44:04,780 --> 00:44:07,340 so I want to make sure that everybody's on firm ground 791 00:44:07,340 --> 00:44:09,110 before we take the leap together. 792 00:44:09,110 --> 00:44:09,610 Yes. 793 00:44:09,610 --> 00:44:13,960 AUDIENCE: So you are in forced sparsity yet? 794 00:44:13,960 --> 00:44:15,822 PROFESSOR: I'm not in forced sparsity yet. 795 00:44:15,822 --> 00:44:17,905 You like sparsity. 796 00:44:17,905 --> 00:44:19,780 You're going to keep me to that, right? 797 00:44:19,780 --> 00:44:22,364 So later on, you'll hold me to it, all right? 798 00:44:22,364 --> 00:44:23,030 That's your job. 799 00:44:23,030 --> 00:44:23,610 What's your name? 800 00:44:23,610 --> 00:44:24,460 AUDIENCE: [INAUDIBLE] 801 00:44:24,460 --> 00:44:25,043 PROFESSOR: OK. 802 00:44:25,043 --> 00:44:26,310 That's your job, all right? 803 00:44:26,310 --> 00:44:27,930 Mr. Sparsity. 804 00:44:27,930 --> 00:44:28,700 I like that. 805 00:44:28,700 --> 00:44:29,445 He's been sparse. 806 00:44:29,445 --> 00:44:31,500 I like it. 807 00:44:31,500 --> 00:44:34,290 So if this is the number of reads assigned 808 00:44:34,290 --> 00:44:38,780 to a particular location, then we 809 00:44:38,780 --> 00:44:41,870 know that pi sub m will simply be 810 00:44:41,870 --> 00:44:55,680 n sub m over the summation of all of the reads assigned 811 00:44:55,680 --> 00:44:59,030 to all of the events. 812 00:44:59,030 --> 00:45:01,450 And some of these are going to be zero. 813 00:45:01,450 --> 00:45:03,470 A lot of them will be zero. 814 00:45:03,470 --> 00:45:09,440 So here, I don't know this assignment of reads to events. 815 00:45:09,440 --> 00:45:11,100 It's latent. 816 00:45:11,100 --> 00:45:15,420 It's something I have invented which I do not know. 817 00:45:15,420 --> 00:45:20,696 But if I did know it, I'd be able to compute pi directly, 818 00:45:20,696 --> 00:45:24,750 like so, because I'd be able to figure out 819 00:45:24,750 --> 00:45:28,060 for each location on the genome how many reads were there. 820 00:45:28,060 --> 00:45:30,060 And that's how much responsibility that location 821 00:45:30,060 --> 00:45:31,710 had for generating all of the reads. 822 00:45:35,660 --> 00:45:37,270 Is everybody with me so far? 823 00:45:37,270 --> 00:45:38,240 Any questions at all? 824 00:45:42,330 --> 00:45:43,420 OK. 825 00:45:43,420 --> 00:45:47,480 So remember, this is latent. 826 00:45:47,480 --> 00:45:49,570 We don't know what it is, but if we did know, 827 00:45:49,570 --> 00:45:52,010 we'd be in great shape. 828 00:45:52,010 --> 00:45:58,180 So our job now is to estimate g. 829 00:45:58,180 --> 00:46:03,750 And if we could estimate g, we can estimate pi. 830 00:46:03,750 --> 00:46:06,016 And once we get a better estimate for pi, 831 00:46:06,016 --> 00:46:07,390 I'd like to suggest to you we can 832 00:46:07,390 --> 00:46:10,800 get a better estimate for g. 833 00:46:10,800 --> 00:46:14,240 And we can go between these two steps. 834 00:46:14,240 --> 00:46:17,390 And so that's what the expectation, maximization 835 00:46:17,390 --> 00:46:20,880 algorithm does in the lower part here, which 836 00:46:20,880 --> 00:46:24,450 is that the left part is estimating gamma, which 837 00:46:24,450 --> 00:46:39,860 is our estimate for g, which is looking at the number of reads 838 00:46:39,860 --> 00:46:44,440 that we think are softly assigned to a location m 839 00:46:44,440 --> 00:46:46,690 over the total number of softly assigned 840 00:46:46,690 --> 00:46:49,710 reads from all locations. 841 00:46:49,710 --> 00:47:00,190 And that gives the fraction of a read at n assigned to event m. 842 00:47:00,190 --> 00:47:07,210 So we are computing an estimate of g. 843 00:47:07,210 --> 00:47:09,690 But to compute this estimate, we have to have pi. 844 00:47:09,690 --> 00:47:12,780 It's telling us where we believe that these reads are 845 00:47:12,780 --> 00:47:13,390 coming from. 846 00:47:16,090 --> 00:47:18,560 And once we have gamma, which is an estimate, 847 00:47:18,560 --> 00:47:24,140 we can compute pi, just in the same way we computed it here, 848 00:47:24,140 --> 00:47:26,790 if we knew g. 849 00:47:26,790 --> 00:47:33,500 So we compute the expectation of this latent function, which 850 00:47:33,500 --> 00:47:37,580 we don't know, and then we maximize 851 00:47:37,580 --> 00:47:41,700 pi to maximize our likelihood. 852 00:47:41,700 --> 00:47:45,940 And you can derive this directly by taking 853 00:47:45,940 --> 00:47:49,420 the log of that likelihood probability up there 854 00:47:49,420 --> 00:47:52,220 and maximizing it. 855 00:47:52,220 --> 00:47:57,662 But suffice to say, you wind up with this formulation 856 00:47:57,662 --> 00:47:58,495 in the EM framework. 857 00:48:01,050 --> 00:48:06,080 So I can go between these two different steps-- 858 00:48:06,080 --> 00:48:09,060 the E step, the M step, the E step, the M step. 859 00:48:09,060 --> 00:48:11,420 And each time as I go through it, 860 00:48:11,420 --> 00:48:15,860 I get a better and better approximation for pi, 861 00:48:15,860 --> 00:48:19,510 until I ultimately get it within some preset level 862 00:48:19,510 --> 00:48:21,840 of convergence. 863 00:48:21,840 --> 00:48:23,290 And then I'm finished. 864 00:48:23,290 --> 00:48:31,200 I actually can report my pi. 865 00:48:31,200 --> 00:48:34,600 But before I leave the EM algorithm, and this formulation 866 00:48:34,600 --> 00:48:36,730 of it, are there any questions at all 867 00:48:36,730 --> 00:48:40,807 about what's going on here? 868 00:48:40,807 --> 00:48:41,890 Yes, question in the back. 869 00:48:41,890 --> 00:48:43,552 AUDIENCE: So the solution to the EM algorithm 870 00:48:43,552 --> 00:48:44,740 doesn't always [INAUDIBLE]? 871 00:48:50,450 --> 00:48:52,960 PROFESSOR: It gives you a solution. 872 00:48:56,560 --> 00:48:59,750 And it is the optimum solution given the starting conditions 873 00:48:59,750 --> 00:49:01,140 that we give it. 874 00:49:01,140 --> 00:49:02,650 AUDIENCE: But it depends on the starting conditions, right? 875 00:49:02,650 --> 00:49:04,858 PROFESSOR: It does depend on the starting conditions. 876 00:49:08,077 --> 00:49:08,910 Any other questions? 877 00:49:11,540 --> 00:49:12,260 Yes, in the back. 878 00:49:12,260 --> 00:49:13,759 AUDIENCE: Is this dependent upon you 879 00:49:13,759 --> 00:49:16,220 having discrete binding events? 880 00:49:16,220 --> 00:49:20,180 So if you have a protein that has more diffuse localization 881 00:49:20,180 --> 00:49:22,655 across the genome, for example, RNA Pol II, 882 00:49:22,655 --> 00:49:24,234 would this model break down? 883 00:49:24,234 --> 00:49:25,900 PROFESSOR: So the question is, does this 884 00:49:25,900 --> 00:49:27,858 depend upon you having discrete binding events? 885 00:49:27,858 --> 00:49:30,770 What biological assumptions are we making in this model about 886 00:49:30,770 --> 00:49:33,770 whether or not a protein is actually fixed and welded 887 00:49:33,770 --> 00:49:35,570 to the genome in a particular location 888 00:49:35,570 --> 00:49:37,432 or whether or not it's drifting around? 889 00:49:37,432 --> 00:49:39,140 And this might be of particular interest, 890 00:49:39,140 --> 00:49:41,139 for example, if we're looking at histones, which 891 00:49:41,139 --> 00:49:43,620 are known to slide up and down the genome, 892 00:49:43,620 --> 00:49:45,440 and we're immunoprecipitating them. 893 00:49:45,440 --> 00:49:48,490 What will this give us? 894 00:49:48,490 --> 00:49:50,380 We are making the assumption that we're 895 00:49:50,380 --> 00:49:52,880 dealing with proteins that have punctate binding properties, 896 00:49:52,880 --> 00:49:55,521 that actually are point binding proteins. 897 00:49:55,521 --> 00:49:58,020 And thus what we're going to get out of this, as you'll see, 898 00:49:58,020 --> 00:50:00,940 is going to be a single location. 899 00:50:00,940 --> 00:50:03,600 There are other methodologies, which I won't describe today, 900 00:50:03,600 --> 00:50:10,120 for essentially deconvolving the motion of a protein 901 00:50:10,120 --> 00:50:12,840 on the genome and coming up with its middle 902 00:50:12,840 --> 00:50:16,570 or mean position while allowing it to move. 903 00:50:16,570 --> 00:50:19,200 But today's algorithm is designed 904 00:50:19,200 --> 00:50:22,350 for point binding proteins. 905 00:50:22,350 --> 00:50:25,180 Good question, OK? 906 00:50:25,180 --> 00:50:26,730 All right. 907 00:50:26,730 --> 00:50:28,934 So just to be clear, what we're going to do 908 00:50:28,934 --> 00:50:30,600 is we're going to initialize everything, 909 00:50:30,600 --> 00:50:33,200 such that we'll begin by initializing pi 910 00:50:33,200 --> 00:50:36,550 to be 1 over the size of the genome. 911 00:50:36,550 --> 00:50:39,560 And at the very end of this, the strength of a binding event 912 00:50:39,560 --> 00:50:47,670 will be simply the number of reads assigned to that event 913 00:50:47,670 --> 00:50:52,460 by summing up our estimate of g. 914 00:50:52,460 --> 00:50:56,340 And the nice thing about this is that because the number 915 00:50:56,340 --> 00:50:59,620 of reads assigned to an event, in some sense, 916 00:50:59,620 --> 00:51:02,165 is scaled relative to the total number of reads 917 00:51:02,165 --> 00:51:06,090 in the experiment, we can algorithmically take our genome 918 00:51:06,090 --> 00:51:08,920 and chop it up into independent pieces 919 00:51:08,920 --> 00:51:10,990 and process them all in parallel, 920 00:51:10,990 --> 00:51:14,760 which is what this methodology does. 921 00:51:14,760 --> 00:51:15,460 All right. 922 00:51:15,460 --> 00:51:16,960 So let's have a look at what happens 923 00:51:16,960 --> 00:51:19,360 when we run this algorithm. 924 00:51:19,360 --> 00:51:20,130 So here we are. 925 00:51:20,130 --> 00:51:21,210 This is synthetic data. 926 00:51:21,210 --> 00:51:24,930 I have events planted at 500 and 550 base pairs. 927 00:51:24,930 --> 00:51:29,620 The x-axis is between 0 and 1,400 base pairs. 928 00:51:29,620 --> 00:51:34,930 And the y-axis is pi from 0 to 1. 929 00:51:34,930 --> 00:51:38,728 So here we go. 930 00:51:38,728 --> 00:51:39,676 It's running. 931 00:51:44,420 --> 00:51:46,030 And this is one base pair resolution. 932 00:51:49,390 --> 00:51:54,320 And it is still running-- and stop. 933 00:51:54,320 --> 00:52:00,070 Well, we get sort of a fuzzy mess there, don't we? 934 00:52:00,070 --> 00:52:02,390 And the reason we're getting a fuzzy mess there 935 00:52:02,390 --> 00:52:06,820 is that we have got a lot of reads we've created, 936 00:52:06,820 --> 00:52:09,810 and there are a lot of different genome positions 937 00:52:09,810 --> 00:52:13,270 that are claiming some responsibility for this, which 938 00:52:13,270 --> 00:52:15,640 my friend, Mr. Sparsity, doesn't like, right? 939 00:52:15,640 --> 00:52:19,360 We need to clean this up somehow. 940 00:52:19,360 --> 00:52:22,260 And let's see what happens when we actually 941 00:52:22,260 --> 00:52:23,570 look at actual data. 942 00:52:23,570 --> 00:52:27,190 So here is that original data I showed you 943 00:52:27,190 --> 00:52:31,910 for the two Oct4 binding events next to the SOCS2 gene. 944 00:52:31,910 --> 00:52:36,000 And I'll run the algorithm on these data. 945 00:52:36,000 --> 00:52:40,530 And you can see it working away here. 946 00:52:40,530 --> 00:52:46,590 And once again, it's giving us a spread of events, 947 00:52:46,590 --> 00:52:48,330 and you can see it even beginning 948 00:52:48,330 --> 00:52:52,340 to fill in along where there's noise in the genome 949 00:52:52,340 --> 00:52:54,150 and stopped here. 950 00:52:54,150 --> 00:52:57,750 And it's assigning the probability mass of pi 951 00:52:57,750 --> 00:53:00,180 all over the place. 952 00:53:00,180 --> 00:53:02,280 It is not sparse at all. 953 00:53:05,680 --> 00:53:08,500 So does anybody have any suggestions 954 00:53:08,500 --> 00:53:10,960 for how to fix this? 955 00:53:10,960 --> 00:53:11,940 We've worked all might. 956 00:53:11,940 --> 00:53:13,530 We've got our algorithm running. 957 00:53:13,530 --> 00:53:15,363 We thought it was going to be totally great. 958 00:53:15,363 --> 00:53:20,640 When we run it, it gives us this unfortunate, smeary kind 959 00:53:20,640 --> 00:53:23,460 of result. 960 00:53:23,460 --> 00:53:25,250 What could we do? 961 00:53:25,250 --> 00:53:28,630 Any ideas at all? 962 00:53:28,630 --> 00:53:29,618 Yes. 963 00:53:29,618 --> 00:53:32,455 AUDIENCE: Add a prior where most of the reads are. 964 00:53:32,455 --> 00:53:33,990 PROFESSOR: Add a prior. 965 00:53:33,990 --> 00:53:37,355 You saw the word no prior, a little tip off. 966 00:53:37,355 --> 00:53:39,530 Yeah, absolutely, good point. 967 00:53:39,530 --> 00:53:40,174 Add a prior. 968 00:53:40,174 --> 00:53:41,590 So what we're going to do is we're 969 00:53:41,590 --> 00:53:44,330 going to try and add a prior that's going to prejudice pi 970 00:53:44,330 --> 00:53:48,540 to be 0 to create sparsity. 971 00:53:48,540 --> 00:53:51,690 So we like pi being 0. 972 00:53:51,690 --> 00:53:55,030 So we'll add what's called a negative Dirichlet prior, 973 00:53:55,030 --> 00:53:57,800 which looks like this, which is the probability of pi there is 974 00:53:57,800 --> 00:54:02,620 proportional to 1 over pi to the m raised to the alpha power. 975 00:54:02,620 --> 00:54:06,350 And as pi gets smaller, that gets much, much bigger. 976 00:54:06,350 --> 00:54:11,490 And thus, if we add this prior, which 977 00:54:11,490 --> 00:54:13,820 happens to be a very convenient prior to use, 978 00:54:13,820 --> 00:54:18,720 we can force pi to be 0 in many cases. 979 00:54:18,720 --> 00:54:22,750 And when we take that prior into account and we compute 980 00:54:22,750 --> 00:54:29,200 the maximum a posteriori values for pi, 981 00:54:29,200 --> 00:54:31,880 the update rules are very similar, 982 00:54:31,880 --> 00:54:33,950 except that what's interesting is 983 00:54:33,950 --> 00:54:36,690 that the rule on the left for computing our estimate of g 984 00:54:36,690 --> 00:54:38,660 is identical. 985 00:54:38,660 --> 00:54:42,025 But on the right, what we do is we take the number of reads 986 00:54:42,025 --> 00:54:44,350 that we observe at a particular location 987 00:54:44,350 --> 00:54:46,750 and we subtract alpha from it, which 988 00:54:46,750 --> 00:54:51,180 was that exponent in our prior. 989 00:54:51,180 --> 00:54:54,720 And what that simply means is that if you don't have alpha 990 00:54:54,720 --> 00:54:56,640 reads, you're history. 991 00:54:56,640 --> 00:54:58,890 You're going to get eliminated. 992 00:54:58,890 --> 00:55:01,840 So this is going to do something called component elimination. 993 00:55:01,840 --> 00:55:03,380 If you don't have enough strength, 994 00:55:03,380 --> 00:55:06,040 you're going to get zapped to zero. 995 00:55:06,040 --> 00:55:08,374 Now you don't want to do this too aggressively early on. 996 00:55:08,374 --> 00:55:10,581 You want to let things sort of percolate for a while. 997 00:55:10,581 --> 00:55:13,160 So what this algorithm does is it doesn't eliminate 998 00:55:13,160 --> 00:55:17,950 all of the components that don't have alpha reads at the outset. 999 00:55:17,950 --> 00:55:20,430 It lets it run for a little while. 1000 00:55:20,430 --> 00:55:25,400 But it provides you with a way of ensuring that components 1001 00:55:25,400 --> 00:55:28,120 get eliminated and set to zero when they actually don't have 1002 00:55:28,120 --> 00:55:29,540 enough support in the read set. 1003 00:55:34,022 --> 00:55:35,522 So once again, all we're going to do 1004 00:55:35,522 --> 00:55:40,160 is we're going to add this prior on pi, which we will wind up 1005 00:55:40,160 --> 00:55:43,130 multiplying times that top equation. 1006 00:55:43,130 --> 00:55:45,830 We'll get the joint probability of r and pi in this case, 1007 00:55:45,830 --> 00:55:51,590 and then we're going to maximize it using this EM adaptation. 1008 00:55:51,590 --> 00:55:53,310 And when we do so, let me show you 1009 00:55:53,310 --> 00:55:56,740 what happens to our first example that we had. 1010 00:56:04,240 --> 00:56:05,957 OK, that's much cleaner, right? 1011 00:56:05,957 --> 00:56:07,790 We actually only have two events popping out 1012 00:56:07,790 --> 00:56:11,370 of this at the right locations. 1013 00:56:11,370 --> 00:56:12,770 All right. 1014 00:56:12,770 --> 00:56:14,680 So that's looking good. 1015 00:56:14,680 --> 00:56:19,430 And the probability is summed to 1, which we like. 1016 00:56:19,430 --> 00:56:24,000 And then we'll run this on the Oct4 data. 1017 00:56:27,300 --> 00:56:29,310 And you can see now, instead of getting 1018 00:56:29,310 --> 00:56:32,180 the mushing around those locations, what's 1019 00:56:32,180 --> 00:56:35,452 going to happen is that most of the probability mass 1020 00:56:35,452 --> 00:56:37,660 is going to be absorbed into just a couple components 1021 00:56:37,660 --> 00:56:38,870 around those binding events. 1022 00:56:43,490 --> 00:56:49,010 Now another way to test this is to ask whether or not, 1023 00:56:49,010 --> 00:56:53,260 if we looked at what we believe are closely spaced 1024 00:56:53,260 --> 00:56:59,650 homotypic events, in this case, for CTCF, with that prior, 1025 00:56:59,650 --> 00:57:02,170 we still can recover those two events being 1026 00:57:02,170 --> 00:57:04,460 next to each other in the genome. 1027 00:57:04,460 --> 00:57:11,930 And so if we run this, you can see it running along here. 1028 00:57:11,930 --> 00:57:16,640 Each one of these iterations, by the way, is an EM step. 1029 00:57:16,640 --> 00:57:18,160 And there you go. 1030 00:57:18,160 --> 00:57:20,790 And you can see that even from the sparse data 1031 00:57:20,790 --> 00:57:23,060 you see above-- that's the actual data being used, 1032 00:57:23,060 --> 00:57:27,950 all those little bars up there, read counts 1033 00:57:27,950 --> 00:57:31,230 for the five prime ends of the reads-- we can recover 1034 00:57:31,230 --> 00:57:36,110 the position of those binding events. 1035 00:57:36,110 --> 00:57:36,830 Question, yes. 1036 00:57:36,830 --> 00:57:37,705 AUDIENCE: [INAUDIBLE] 1037 00:57:40,036 --> 00:57:42,140 PROFESSOR: It depends upon the number of reads, 1038 00:57:42,140 --> 00:57:45,390 but it's somewhere around three or four. 1039 00:57:45,390 --> 00:57:48,110 I mean, if you look at the GPS paper, which I posted, 1040 00:57:48,110 --> 00:57:51,290 it tells you how alpha is set. 1041 00:57:51,290 --> 00:57:51,790 Yes. 1042 00:57:51,790 --> 00:57:53,750 AUDIENCE: Do you use wholesale extract data 1043 00:57:53,750 --> 00:57:56,230 when you are trying to compute? 1044 00:57:56,230 --> 00:57:58,480 PROFESSOR: Question, do we use wholesale extract data? 1045 00:57:58,480 --> 00:57:59,980 I told you all how important it was, 1046 00:57:59,980 --> 00:58:01,640 and I haven't mentioned it again. 1047 00:58:01,640 --> 00:58:03,550 We're about to get to that, because that's very important. 1048 00:58:03,550 --> 00:58:04,940 I'm glad you asked that question. 1049 00:58:04,940 --> 00:58:05,760 Yes. 1050 00:58:05,760 --> 00:58:08,175 AUDIENCE: It looks like, along the genome coordinates, 1051 00:58:08,175 --> 00:58:10,170 each event is only one base pare, 1052 00:58:10,170 --> 00:58:12,880 but binding is usually more than one base pair. 1053 00:58:12,880 --> 00:58:14,980 So is that the center of the binding? 1054 00:58:14,980 --> 00:58:16,050 PROFESSOR: Yes. 1055 00:58:16,050 --> 00:58:23,509 Well, it's the center of the read distribution function. 1056 00:58:23,509 --> 00:58:26,050 Whether or not it's the center of the binding of the protein, 1057 00:58:26,050 --> 00:58:27,100 I really can't tell. 1058 00:58:31,600 --> 00:58:33,960 OK, great. 1059 00:58:33,960 --> 00:58:37,410 And I'll just point out that a power of this method 1060 00:58:37,410 --> 00:58:40,690 is its ability to take apart piles of reads 1061 00:58:40,690 --> 00:58:45,560 like this and to these so-called homotypic binding events, where 1062 00:58:45,560 --> 00:58:46,890 you can see where the motif is. 1063 00:58:46,890 --> 00:58:50,690 And you can see this method will take apart that pile of reads 1064 00:58:50,690 --> 00:58:54,510 into two independent events, whereas other popular methods, 1065 00:58:54,510 --> 00:58:57,610 of which there are many for analyzing 1066 00:58:57,610 --> 00:59:00,800 these type of ChIP-seq data, don't take apart 1067 00:59:00,800 --> 00:59:04,045 the event into multiple events, because most of them 1068 00:59:04,045 --> 00:59:06,020 make the assumption that one pile of reads 1069 00:59:06,020 --> 00:59:07,311 equals one binding event. 1070 00:59:10,960 --> 00:59:15,754 Now back to our question about wholesale extract. 1071 00:59:15,754 --> 00:59:17,920 What happens if you get to a part of the genome that 1072 00:59:17,920 --> 00:59:18,630 looks like this? 1073 00:59:18,630 --> 00:59:21,130 We're going to go back to our Oct4 data. 1074 00:59:21,130 --> 00:59:25,370 And we have our Oct4 IP track, which is above, 1075 00:59:25,370 --> 00:59:26,980 and then our wholesale extract tract 1076 00:59:26,980 --> 00:59:28,146 is down there on the bottom. 1077 00:59:30,630 --> 00:59:31,530 What would you say? 1078 00:59:31,530 --> 00:59:33,920 Would you say those are really Oct4 binding events? 1079 00:59:33,920 --> 00:59:35,961 Or do you think something else is going on there? 1080 00:59:35,961 --> 00:59:37,710 If so, what might it be? 1081 00:59:42,180 --> 00:59:45,315 Any ideas about what's going on here? 1082 00:59:45,315 --> 00:59:45,815 Yes. 1083 00:59:45,815 --> 00:59:48,470 AUDIENCE: Regions with a lot of repeats in the genome? 1084 00:59:48,470 --> 00:59:50,920 PROFESSOR: Regions with repeats in the genome? 1085 00:59:50,920 --> 00:59:51,510 Yes. 1086 00:59:51,510 --> 00:59:56,830 In fact, you can see repeats annotated in green right there. 1087 00:59:56,830 --> 01:00:00,510 And as you recall from last time when we talked about assembly, 1088 01:00:00,510 --> 01:00:04,000 it's very difficult to actually get 1089 01:00:04,000 --> 01:00:08,200 the number of repeat elements correct. 1090 01:00:08,200 --> 01:00:10,820 And a consequence of that is if you underestimate 1091 01:00:10,820 --> 01:00:12,620 the number of repeat elements in the genome 1092 01:00:12,620 --> 01:00:16,290 and you collapse them, when you do sequencing of the genome 1093 01:00:16,290 --> 01:00:18,910 and you remap it, you're going to get regions of the genome 1094 01:00:18,910 --> 01:00:22,480 were a lot of reads pile up, because they're actually 1095 01:00:22,480 --> 01:00:24,890 coming from many different places in the genome, 1096 01:00:24,890 --> 01:00:28,240 but they only have one place to align. 1097 01:00:28,240 --> 01:00:31,692 And these towers of reads, as they're called, 1098 01:00:31,692 --> 01:00:35,270 can create all sorts of artifacts. 1099 01:00:35,270 --> 01:00:38,060 So in order to judge the significant of events, 1100 01:00:38,060 --> 01:00:40,100 what we'll do is a two-step process. 1101 01:00:40,100 --> 01:00:44,980 First, we will run our discovery pipeline 1102 01:00:44,980 --> 01:00:47,650 and discover where all the events are in the IP channel 1103 01:00:47,650 --> 01:00:50,190 without regard to the wholesale extract channel, 1104 01:00:50,190 --> 01:00:52,120 in this particular case. 1105 01:00:52,120 --> 01:00:53,930 And then we will filter the events 1106 01:00:53,930 --> 01:00:59,440 and say which of the events are real in view of the data 1107 01:00:59,440 --> 01:01:02,190 from the wholesale extract channel. 1108 01:01:02,190 --> 01:01:05,100 And so the way to represent this is 1109 01:01:05,100 --> 01:01:11,060 what is the likelihood we would have seen those IP reads 1110 01:01:11,060 --> 01:01:17,710 at random given the wholesale extract channel? 1111 01:01:17,710 --> 01:01:19,670 So how can we formulate this problem? 1112 01:01:19,670 --> 01:01:21,790 Well imagine that we take all of the reads 1113 01:01:21,790 --> 01:01:25,306 that we see in both channels and we add them together. 1114 01:01:25,306 --> 01:01:26,680 So we get a total number of reads 1115 01:01:26,680 --> 01:01:29,985 at a particular location in the genome. 1116 01:01:29,985 --> 01:01:31,985 If, in fact, they were going to occur at random, 1117 01:01:31,985 --> 01:01:34,109 we'd be doing a coin flip. 1118 01:01:34,109 --> 01:01:35,650 Some reads will go on the IP channel. 1119 01:01:35,650 --> 01:01:38,790 Some reads will go on the wholesale extract channel. 1120 01:01:38,790 --> 01:01:42,660 So now we can ask what's the probability we observed greater 1121 01:01:42,660 --> 01:01:45,080 than a certain number of reads on the IP channel 1122 01:01:45,080 --> 01:01:48,670 at chance, assuming a coin flipping model where 1123 01:01:48,670 --> 01:01:51,520 we've added the reads together. 1124 01:01:51,520 --> 01:01:54,670 Another way to view that is what's the chance that we have 1125 01:01:54,670 --> 01:01:57,500 observed less than a certain number of 1126 01:01:57,500 --> 01:02:00,160 reads in the wholesale extract channel 1127 01:02:00,160 --> 01:02:02,900 assuming a coin flipping model? 1128 01:02:02,900 --> 01:02:08,350 And that will give us a p-value under the null hypothesis 1129 01:02:08,350 --> 01:02:10,700 that, in fact, there is no binding event 1130 01:02:10,700 --> 01:02:12,740 and simply what's going on is that we 1131 01:02:12,740 --> 01:02:17,150 have reads being flipped between the two channels out 1132 01:02:17,150 --> 01:02:18,980 of a pool of the total number of reads. 1133 01:02:21,980 --> 01:02:25,125 And if we take that approach, we can formulate it 1134 01:02:25,125 --> 01:02:28,140 as the binomial in the following way. 1135 01:02:28,140 --> 01:02:34,390 We can compute a p-value for a given binding event using 1136 01:02:34,390 --> 01:02:37,670 the data from both the IP channel 1137 01:02:37,670 --> 01:02:43,140 and from the control channel. 1138 01:02:43,140 --> 01:02:47,740 And the way that we do this is that we look at the number of 1139 01:02:47,740 --> 01:02:49,850 reads assigned to an event, because we already 1140 01:02:49,850 --> 01:02:50,960 know that number. 1141 01:02:50,960 --> 01:02:53,050 We've computed that number. 1142 01:02:53,050 --> 01:02:57,880 And we take a similar window around the control channel. 1143 01:02:57,880 --> 01:02:59,840 We count the number of reads in it 1144 01:02:59,840 --> 01:03:02,900 and ask whether or not the reads in the IP channel 1145 01:03:02,900 --> 01:03:07,370 is significantly larger enough than the reads in the control 1146 01:03:07,370 --> 01:03:12,370 channel to reject the null hypothesis that, in fact, they 1147 01:03:12,370 --> 01:03:13,430 occurred at random. 1148 01:03:16,360 --> 01:03:18,600 So this is the way that we compute the p-values 1149 01:03:18,600 --> 01:03:19,100 for events. 1150 01:03:23,920 --> 01:03:27,670 So once we've computed the p-values for events, 1151 01:03:27,670 --> 01:03:31,790 we still have our multiple hypothesis correction problem, 1152 01:03:31,790 --> 01:03:36,890 which is that if we have 100,000 events, 1153 01:03:36,890 --> 01:03:44,060 we need to set our p-value's cut off and appropriate value. 1154 01:03:44,060 --> 01:03:47,380 And I think that you may have heard 1155 01:03:47,380 --> 01:03:49,930 about this before in recitation. 1156 01:03:49,930 --> 01:03:55,900 But the way that this is done in this system 1157 01:03:55,900 --> 01:03:59,960 is to use a Benjamini-Hochberg correction. 1158 01:03:59,960 --> 01:04:04,800 And the essential idea is that we take all of the p-values 1159 01:04:04,800 --> 01:04:13,550 and we organize them from smallest to largest. 1160 01:04:13,550 --> 01:04:17,450 And we're going to take a set of the top events, 1161 01:04:17,450 --> 01:04:21,140 say 1 through 4, and call them significant. 1162 01:04:21,140 --> 01:04:25,860 And we call them significant under the constraint 1163 01:04:25,860 --> 01:04:32,800 that p-value sub i is less than or equal to i over n, 1164 01:04:32,800 --> 01:04:35,680 where n is the total number of events times alpha, 1165 01:04:35,680 --> 01:04:38,580 which is our desired false discovery rate. 1166 01:04:43,770 --> 01:04:46,520 So this allows us in a principled way 1167 01:04:46,520 --> 01:04:50,010 to select the events that we believe 1168 01:04:50,010 --> 01:04:53,240 are significant up to a false discovery rate, which might be, 1169 01:04:53,240 --> 01:04:55,430 for example, 0.05 for something that's typically 1170 01:04:55,430 --> 01:04:56,860 used for a false discovery rate. 1171 01:05:02,410 --> 01:05:06,650 So we've talked about how to discover events, 1172 01:05:06,650 --> 01:05:10,530 how to judge their significance individually, 1173 01:05:10,530 --> 01:05:15,734 how to take a ranked list of events from a given experiment 1174 01:05:15,734 --> 01:05:17,400 and determine which ones are significant 1175 01:05:17,400 --> 01:05:20,200 up to a desired false discovery rate. 1176 01:05:20,200 --> 01:05:22,890 The false discovery means that, let's say, 1177 01:05:22,890 --> 01:05:27,080 that we have 1,000 events that come out. 1178 01:05:27,080 --> 01:05:28,990 If this false discovery rate was 0.1, 1179 01:05:28,990 --> 01:05:33,350 we would expect 100 of them to be false positives. 1180 01:05:33,350 --> 01:05:35,880 So it's the fraction of positives 1181 01:05:35,880 --> 01:05:39,551 that we detect that we think are going to be false. 1182 01:05:39,551 --> 01:05:41,342 And we can set that to be whatever we want. 1183 01:05:51,360 --> 01:05:54,395 Now let's talk about the analysis of these sort of data. 1184 01:05:54,395 --> 01:05:55,520 There's a question up here. 1185 01:05:55,520 --> 01:05:56,361 Yes. 1186 01:05:56,361 --> 01:05:59,127 AUDIENCE: So this randomness, does that correct to the fact 1187 01:05:59,127 --> 01:06:04,780 that in your previous slide the coin isn't necessarily fair? 1188 01:06:04,780 --> 01:06:06,880 PROFESSOR: Oh, the null hypothesis 1189 01:06:06,880 --> 01:06:12,780 assumed that the coin was fair, that reads could either occur. 1190 01:06:12,780 --> 01:06:15,970 That in this case, for example, that the reads were 1191 01:06:15,970 --> 01:06:19,090 either occurring in the control channel or in the IP channel 1192 01:06:19,090 --> 01:06:22,640 with equal likelihood, assuming because they were coming 1193 01:06:22,640 --> 01:06:27,097 from the same process, which was not related to the IP itself. 1194 01:06:27,097 --> 01:06:29,180 They're coming from some underlying noise process. 1195 01:06:31,925 --> 01:06:34,207 AUDIENCE: So typically you'd have some fixed number 1196 01:06:34,207 --> 01:06:40,564 of reads, and so in the IP channel, the bulk of your reads 1197 01:06:40,564 --> 01:06:42,683 was in your peaks. 1198 01:06:42,683 --> 01:06:45,130 Then you would have fewer in your-- 1199 01:06:45,130 --> 01:06:47,010 PROFESSOR: So the question is, how do you 1200 01:06:47,010 --> 01:06:48,990 normalize for the number of reads 1201 01:06:48,990 --> 01:06:50,210 and how they're being spent? 1202 01:06:50,210 --> 01:06:52,060 Because in the IP channel, you're 1203 01:06:52,060 --> 01:06:55,410 spending the reeds on IP enriched events. 1204 01:06:55,410 --> 01:06:56,790 In the wholesale extract channel, 1205 01:06:56,790 --> 01:06:58,450 you're not spending the reads on those, 1206 01:06:58,450 --> 01:07:01,820 so you have more reads to go around for bad things, 1207 01:07:01,820 --> 01:07:03,650 so to speak. 1208 01:07:03,650 --> 01:07:07,520 So what this algorithm does-- which I'm glad 1209 01:07:07,520 --> 01:07:11,620 you brought it up-- is it takes regions of the genome that 1210 01:07:11,620 --> 01:07:15,960 are event free, and it matches them up against one another 1211 01:07:15,960 --> 01:07:17,500 and fits a model against them to be 1212 01:07:17,500 --> 01:07:22,900 able to scale the control reads so that the number of reads 1213 01:07:22,900 --> 01:07:24,840 in the IP and extract channel are 1214 01:07:24,840 --> 01:07:27,865 the same in the regions that are free of events. 1215 01:07:30,590 --> 01:07:34,260 And so it matches things, so it is 0.5. 1216 01:07:34,260 --> 01:07:34,760 OK? 1217 01:07:34,760 --> 01:07:38,560 AUDIENCE: Even if wasn't 0.5, wouldn't this rank list 1218 01:07:38,560 --> 01:07:41,080 correct with that? 1219 01:07:41,080 --> 01:07:42,580 PROFESSOR: No, it would not. 1220 01:07:47,460 --> 01:07:53,180 Let us suppose that we made a horrible mistake 1221 01:07:53,180 --> 01:07:56,900 and that we let these events through 1222 01:07:56,900 --> 01:07:59,320 because our p was wrong. 1223 01:07:59,320 --> 01:08:01,260 They would have a very large number of reads, 1224 01:08:01,260 --> 01:08:02,718 and they would be very significant. 1225 01:08:04,990 --> 01:08:06,690 As a consequence, they would float up 1226 01:08:06,690 --> 01:08:09,260 to the top of this list as having the lowest 1227 01:08:09,260 --> 01:08:11,630 p-values, or the most significant, 1228 01:08:11,630 --> 01:08:13,580 and they would pop through this. 1229 01:08:13,580 --> 01:08:17,051 So we would report them as binding events, 1230 01:08:17,051 --> 01:08:18,009 which is not desirable. 1231 01:08:21,859 --> 01:08:24,339 These are all great questions. 1232 01:08:24,339 --> 01:08:25,464 Any other questions at all? 1233 01:08:29,979 --> 01:08:36,359 So we've talked about this. 1234 01:08:36,359 --> 01:08:40,779 The next question you might have would be, 1235 01:08:40,779 --> 01:08:42,649 you've done two replicates of the same chip. 1236 01:08:42,649 --> 01:08:45,180 You always want to do two of every experiment, if not three 1237 01:08:45,180 --> 01:08:45,680 or four. 1238 01:08:48,620 --> 01:08:51,565 And I know it's expensive, but you 1239 01:08:51,565 --> 01:08:53,939 don't know where you are if you only do one of something. 1240 01:08:53,939 --> 01:08:55,930 Trust me. 1241 01:08:55,930 --> 01:08:58,420 So assuming you've done two experiments that 1242 01:08:58,420 --> 01:09:01,120 were identical ChIP experiments, and you 1243 01:09:01,120 --> 01:09:03,729 would like to know whether or not 1244 01:09:03,729 --> 01:09:06,160 the results between those two experiments are concordant. 1245 01:09:09,880 --> 01:09:13,214 How might you go about determining this? 1246 01:09:13,214 --> 01:09:14,380 Does anybody have any ideas? 1247 01:09:24,770 --> 01:09:26,410 Let me suggest something to you. 1248 01:09:29,200 --> 01:09:39,189 Imagine that we create two lists-- one 1249 01:09:39,189 --> 01:09:42,640 for experiment x and one for experiment y. 1250 01:09:42,640 --> 01:09:44,250 And we're going to put in this list, 1251 01:09:44,250 --> 01:09:46,609 in rank order, the strongest events 1252 01:09:46,609 --> 01:09:49,450 and where they occur in the genome. 1253 01:09:49,450 --> 01:09:53,067 And let us suppose that we are able to match events 1254 01:09:53,067 --> 01:09:55,275 across experiments when they're in the same location. 1255 01:09:59,520 --> 01:10:08,920 So what we'll do is we'll say Event one occurs here. 1256 01:10:08,920 --> 01:10:11,010 Event 13 occurs here and here. 1257 01:10:11,010 --> 01:10:13,290 Event 9 occurs here and here. 1258 01:10:13,290 --> 01:10:15,220 Event 10 is matched. 1259 01:10:15,220 --> 01:10:17,270 Event 11 occurs here. 1260 01:10:17,270 --> 01:10:20,450 Event 57 occurs here. 1261 01:10:20,450 --> 01:10:24,290 And these are ranked in order of significance. 1262 01:10:24,290 --> 01:10:27,960 They're not completely concordant, 1263 01:10:27,960 --> 01:10:31,190 and they're ranked according to the significance 1264 01:10:31,190 --> 01:10:35,440 in this particular experiment and this particular experiment. 1265 01:10:35,440 --> 01:10:37,180 And we would like to know whether or not 1266 01:10:37,180 --> 01:10:40,980 we think these two experiments are 1267 01:10:40,980 --> 01:10:42,462 good replicates of one another. 1268 01:10:45,380 --> 01:10:49,870 Now the nice thing about converting this 1269 01:10:49,870 --> 01:10:54,980 into a rank test is that we are no longer 1270 01:10:54,980 --> 01:11:02,310 sensitive to the number of reads or any specific numeric values 1271 01:11:02,310 --> 01:11:04,070 represented by each one of these events. 1272 01:11:04,070 --> 01:11:07,530 We're simply ranking them in terms of their importance 1273 01:11:07,530 --> 01:11:13,950 and asking whether or not they appear to be quite similar. 1274 01:11:13,950 --> 01:11:19,640 So one way to do this is simply to take a rank correlation, 1275 01:11:19,640 --> 01:11:24,390 which is a correlation between the ranks of identical events 1276 01:11:24,390 --> 01:11:26,520 in the two lists. 1277 01:11:26,520 --> 01:11:29,390 And so, for example, imagine in the lower right-hand corner 1278 01:11:29,390 --> 01:11:36,179 of this slide shows you an example of X and Y values, 1279 01:11:36,179 --> 01:11:38,220 although there's not really a linear relationship 1280 01:11:38,220 --> 01:11:41,560 between them, and the Pearson correlation coefficient 1281 01:11:41,560 --> 01:11:43,510 is 0.88. 1282 01:11:43,510 --> 01:11:45,260 If we look at the right correlation, which 1283 01:11:45,260 --> 01:11:47,551 is defined by the equation on the left, which is really 1284 01:11:47,551 --> 01:11:51,290 simply the correlation between their rank values, 1285 01:11:51,290 --> 01:11:53,650 the Spearman correlation for rank is 1. 1286 01:11:53,650 --> 01:11:56,650 They're perfectly matched. 1287 01:11:56,650 --> 01:12:01,640 So if you want to consider whether or not 1288 01:12:01,640 --> 01:12:05,150 two experiments are very similar, 1289 01:12:05,150 --> 01:12:08,420 I'd suggest you consider rank based tests. 1290 01:12:11,820 --> 01:12:15,920 But we can do even more than this. 1291 01:12:15,920 --> 01:12:18,280 Imagine that we would like to know 1292 01:12:18,280 --> 01:12:23,760 at what point along this list things 1293 01:12:23,760 --> 01:12:27,830 are no longer concordant, that we would like to consider 1294 01:12:27,830 --> 01:12:31,460 all the events that are consistent among these two 1295 01:12:31,460 --> 01:12:32,860 experiments. 1296 01:12:32,860 --> 01:12:37,320 And we assume that there is a portion of the experiment 1297 01:12:37,320 --> 01:12:43,590 results that are concordant and part that is discordant. 1298 01:12:43,590 --> 01:12:47,960 And we want to learn where the boundary is. 1299 01:12:47,960 --> 01:12:50,300 So what we'll do is this. 1300 01:12:50,300 --> 01:12:56,580 We will look at the number of events that are concordant up 1301 01:12:56,580 --> 01:12:59,152 to a particular point. 1302 01:12:59,152 --> 01:13:01,610 Let's see here, make sure we get the parametrization right. 1303 01:13:07,550 --> 01:13:14,650 So if we have lists that are n long, psi of n of t 1304 01:13:14,650 --> 01:13:26,950 is the number of events that are paired 1305 01:13:26,950 --> 01:13:33,760 in the top n times t events. 1306 01:13:33,760 --> 01:13:37,060 So t is simply the fraction. 1307 01:13:37,060 --> 01:13:40,730 If is 0.25, that would mean that in the top 25% 1308 01:13:40,730 --> 01:13:44,180 of the events, the number of events that are paired. 1309 01:13:47,020 --> 01:13:49,676 Actually, it's the fraction of events. 1310 01:13:49,676 --> 01:13:50,175 Sorry. 1311 01:13:54,900 --> 01:14:03,430 So assuming that we had perfect replicates, what we would see 1312 01:14:03,430 --> 01:14:09,430 would be that psi of n of t would look like this. 1313 01:14:22,931 --> 01:14:23,430 Yes? 1314 01:14:23,430 --> 01:14:25,440 AUDIENCE: What is t? 1315 01:14:25,440 --> 01:14:28,424 PROFESSOR: t is the fraction of the events 1316 01:14:28,424 --> 01:14:29,382 that we're considering. 1317 01:14:29,382 --> 01:14:33,600 So this t might be equal to 0.25. 1318 01:14:33,600 --> 01:14:38,060 If this was 10, then this would be-- oh, sorry. 1319 01:14:38,060 --> 01:14:40,570 I think this t equal 0.5. 1320 01:14:40,570 --> 01:14:44,180 If n was equal to 10, this would be the first five events. 1321 01:14:44,180 --> 01:14:45,337 So t is the fraction. 1322 01:14:45,337 --> 01:14:46,670 n is the total number of events. 1323 01:14:51,640 --> 01:15:02,130 So if this is t, which is the fraction which 1324 01:15:02,130 --> 01:15:06,070 goes from 0 to 1, and this is psi of n of t, 1325 01:15:06,070 --> 01:15:08,345 this would be a perfectly matched set of replicates. 1326 01:15:11,630 --> 01:15:15,765 A 0.5, 0.5 of the events are matched, which is perfect. 1327 01:15:15,765 --> 01:15:17,015 Can't do any better than that. 1328 01:15:19,930 --> 01:15:26,070 So this is simply telling us, as we 1329 01:15:26,070 --> 01:15:30,570 take larger and larger fractions of our event, what 1330 01:15:30,570 --> 01:15:33,800 fraction of them are matched up to that point? 1331 01:15:33,800 --> 01:15:34,590 Question. 1332 01:15:34,590 --> 01:15:39,958 AUDIENCE: So this assumes that the rank order 1333 01:15:39,958 --> 01:15:42,418 is important for the correlation, right? 1334 01:15:42,418 --> 01:15:44,189 The higher rank they are, we expect 1335 01:15:44,189 --> 01:15:45,862 them to be better correlated. 1336 01:15:45,862 --> 01:15:49,397 For example, if in the middle they 1337 01:15:49,397 --> 01:15:53,388 were the best correlated, but not the top or bottom in terms 1338 01:15:53,388 --> 01:15:56,548 of the ranking, then you might have a weird looking-- 1339 01:15:56,548 --> 01:15:57,496 PROFESSOR: Yeah. 1340 01:15:57,496 --> 01:15:59,200 The question is, doesn't this assume 1341 01:15:59,200 --> 01:16:01,562 that we care most about the events 1342 01:16:01,562 --> 01:16:03,520 up here, as opposed to the events, for example, 1343 01:16:03,520 --> 01:16:04,750 in the middle. 1344 01:16:04,750 --> 01:16:09,550 And this analysis is done to consider 1345 01:16:09,550 --> 01:16:12,920 how many of the top events we take 1346 01:16:12,920 --> 01:16:16,510 that we think are consistent across replicates. 1347 01:16:16,510 --> 01:16:19,170 So it starts at the top as a consequence of that assumption. 1348 01:16:25,920 --> 01:16:28,440 This is the definition of psi n, which 1349 01:16:28,440 --> 01:16:30,210 is the fraction of the top events that 1350 01:16:30,210 --> 01:16:32,640 are paired in the top events. 1351 01:16:32,640 --> 01:16:34,590 It's roughly linear from the point 1352 01:16:34,590 --> 01:16:37,220 where events are no longer reproducible. 1353 01:16:37,220 --> 01:16:43,320 And psi prime of n is the derivative of that function. 1354 01:16:43,320 --> 01:16:46,150 And graphically, if we look at these, 1355 01:16:46,150 --> 01:16:48,430 we can see that there's perfect correspondence 1356 01:16:48,430 --> 01:16:51,700 on the left point up to some point. 1357 01:16:51,700 --> 01:16:56,780 And then the hypothesis of this irreproducible discovery rate 1358 01:16:56,780 --> 01:17:01,390 is at that point things stop being in correspondence 1359 01:17:01,390 --> 01:17:05,980 because you've hit the noise part of the event population. 1360 01:17:05,980 --> 01:17:07,900 and you get to junk. 1361 01:17:07,900 --> 01:17:11,710 And as soon as you get to junk, things fall apart. 1362 01:17:11,710 --> 01:17:13,760 And what this methodology does is 1363 01:17:13,760 --> 01:17:16,090 it attempts to fit the distribution to both 1364 01:17:16,090 --> 01:17:18,450 the correspondence and the non-correspondence parts 1365 01:17:18,450 --> 01:17:23,460 of the population, and you get to set a parameter, which 1366 01:17:23,460 --> 01:17:26,510 is the probability that you're willing to accept something 1367 01:17:26,510 --> 01:17:28,174 that's part of the junk population 1368 01:17:28,174 --> 01:17:29,715 of the non-correspondence population. 1369 01:17:32,500 --> 01:17:35,990 And the way this is used, for example, in the ENCODE project 1370 01:17:35,990 --> 01:17:40,550 is that all ChIP-seq experiments are done in parallel. 1371 01:17:40,550 --> 01:17:43,830 Event discovery is done independently on them. 1372 01:17:43,830 --> 01:17:46,600 All of the events from each independent experiment 1373 01:17:46,600 --> 01:17:48,970 are ranked from 1 to n. 1374 01:17:48,970 --> 01:17:51,580 And then IDR is done on the results 1375 01:17:51,580 --> 01:17:54,430 to figure out which of the events 1376 01:17:54,430 --> 01:17:57,277 are consistent across replicates. 1377 01:17:57,277 --> 01:17:59,360 And obviously, if you've had a very bad replicate, 1378 01:17:59,360 --> 01:18:02,856 you get a very small number of events, if any at all. 1379 01:18:02,856 --> 01:18:04,600 But if you have good replicates, then you 1380 01:18:04,600 --> 01:18:07,850 get a large number of events. 1381 01:18:07,850 --> 01:18:13,380 A secondary question is, imagine you had processing algorithm A 1382 01:18:13,380 --> 01:18:16,990 and processing algorithm B. It might be that processing 1383 01:18:16,990 --> 01:18:21,550 algorithm A typically gave you more reproducible 1384 01:18:21,550 --> 01:18:26,580 events than processing algorithm B, that whenever you ran A, 1385 01:18:26,580 --> 01:18:28,450 you could get much further down this curve 1386 01:18:28,450 --> 01:18:31,250 before things fell apart. 1387 01:18:31,250 --> 01:18:35,435 Does that mean that A is necessarily better, or not? 1388 01:18:35,435 --> 01:18:36,185 What do you think? 1389 01:18:39,634 --> 01:18:41,925 Would you always pick algorithm A if that was the case? 1390 01:18:44,460 --> 01:18:45,852 Any insight on this? 1391 01:18:45,852 --> 01:18:47,740 Yes? 1392 01:18:47,740 --> 01:18:50,100 AUDIENCE: No, probably, because there's 1393 01:18:50,100 --> 01:18:51,794 experimental considerations to it. 1394 01:18:51,794 --> 01:18:52,460 PROFESSOR: Yeah. 1395 01:18:52,460 --> 01:18:55,206 I would go with [INAUDIBLE] experimental considerations. 1396 01:18:55,206 --> 01:18:56,580 But the reason you might not pick 1397 01:18:56,580 --> 01:19:00,420 it is it might be that if algorithm A always gave you 1398 01:19:00,420 --> 01:19:03,930 the same wrong answers in the same order, 1399 01:19:03,930 --> 01:19:06,590 it would score perfectly on this, right? 1400 01:19:06,590 --> 01:19:08,380 It doesn't mean that it's right. 1401 01:19:08,380 --> 01:19:12,090 It just means it's consistent. 1402 01:19:12,090 --> 01:19:14,650 So for example, it could be calling all the towers 1403 01:19:14,650 --> 01:19:18,530 as events, and it could always give you, 1404 01:19:18,530 --> 01:19:22,090 for every experiment, the same results. 1405 01:19:22,090 --> 01:19:23,630 In fact, one interesting thing to do 1406 01:19:23,630 --> 01:19:26,940 is to run this kind of analysis against your experiment, 1407 01:19:26,940 --> 01:19:29,390 and an outlier shouldn't match your experiment. 1408 01:19:29,390 --> 01:19:31,492 It should fail. 1409 01:19:31,492 --> 01:19:33,200 It's always good to run negative controls 1410 01:19:33,200 --> 01:19:35,730 when you're doing data analysis. 1411 01:19:35,730 --> 01:19:39,500 So this IDR analysis simply allows 1412 01:19:39,500 --> 01:19:42,837 you to pick alpha, which is the probability of the rate 1413 01:19:42,837 --> 01:19:45,330 of repairs for the irreproducible part 1414 01:19:45,330 --> 01:19:48,435 of the mixture that you're willing to accept. 1415 01:19:48,435 --> 01:19:50,060 And the way this is used is, as I said, 1416 01:19:50,060 --> 01:19:53,820 is that here you see different ways of analyzing 1417 01:19:53,820 --> 01:19:54,910 ChIP-seq data. 1418 01:19:54,910 --> 01:19:57,400 And the little vertical tick mark 1419 01:19:57,400 --> 01:19:58,920 indicates how many peaks they get 1420 01:19:58,920 --> 01:20:01,360 before they can't go any further because they 1421 01:20:01,360 --> 01:20:05,040 hit the IDR bound alpha. 1422 01:20:05,040 --> 01:20:07,340 And some methodologies can produce far more events 1423 01:20:07,340 --> 01:20:08,930 consistently than other methods can. 1424 01:20:12,320 --> 01:20:14,680 So that's just something for you to keep in mind. 1425 01:20:14,680 --> 01:20:18,840 And I will not have time to present all the rest of this. 1426 01:20:18,840 --> 01:20:21,180 But I did want to point out one thing to you, which 1427 01:20:21,180 --> 01:20:25,940 is that the methodology we talked about today 1428 01:20:25,940 --> 01:20:30,610 is able to resolve where things are 1429 01:20:30,610 --> 01:20:33,350 in the genome at exceptional spatial resolution, 1430 01:20:33,350 --> 01:20:36,450 within 20 base pairs genome wide. 1431 01:20:36,450 --> 01:20:39,780 And it's always run genome wide at single base pair solution. 1432 01:20:39,780 --> 01:20:43,500 And it can do things like compute 1433 01:20:43,500 --> 01:20:47,030 for pairs of transcription factors 1434 01:20:47,030 --> 01:20:49,120 that have been profiled using ChIP-seq, 1435 01:20:49,120 --> 01:20:52,440 the spacing between them for spacings that are significant. 1436 01:20:52,440 --> 01:20:54,240 And so you can see that we're beginning 1437 01:20:54,240 --> 01:20:59,280 to understand the regulatory grammar of the genome in terms 1438 01:20:59,280 --> 01:21:01,640 of the way factors organize together 1439 01:21:01,640 --> 01:21:04,810 and they interact to implement the combinatorial switches 1440 01:21:04,810 --> 01:21:07,520 that' we've talked about. 1441 01:21:07,520 --> 01:21:08,450 That's it for today. 1442 01:21:08,450 --> 01:21:10,090 You guys have been totally great. 1443 01:21:10,090 --> 01:21:13,060 We'll see you on Tuesday for RNA-seq. 1444 01:21:13,060 --> 01:21:15,830 And have a great weekend until then.