1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,217 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,217 --> 00:00:17,842 at ocw.mit.edu. 8 00:00:26,397 --> 00:00:26,980 PROFESSOR: OK. 9 00:00:26,980 --> 00:00:30,090 Well hello, everyone. 10 00:00:30,090 --> 00:00:32,980 And welcome back to Computational Systems Biology. 11 00:00:32,980 --> 00:00:35,300 I am David Gifford and I am delighted to be back 12 00:00:35,300 --> 00:00:36,820 with you today. 13 00:00:36,820 --> 00:00:39,150 We're going to talk, today, about understanding 14 00:00:39,150 --> 00:00:39,876 transcription. 15 00:00:39,876 --> 00:00:42,250 Specifically, how we're going to understand transcription 16 00:00:42,250 --> 00:00:45,460 is a technique called RNA-seq. 17 00:00:45,460 --> 00:00:52,500 And RNA-seq is a methodology for characterizing RNA molecules 18 00:00:52,500 --> 00:00:55,120 through next generation sequencing. 19 00:00:55,120 --> 00:00:58,220 And we'll talk, first, about RNA-seq principles. 20 00:00:58,220 --> 00:01:00,190 We'll then talk about how to take 21 00:01:00,190 --> 00:01:03,640 the data we learn from RNA-seq and analyze it using tools 22 00:01:03,640 --> 00:01:05,440 for characterizing differential gene 23 00:01:05,440 --> 00:01:09,810 expression and principal component analysis. 24 00:01:09,810 --> 00:01:12,380 And finally, we'll talk about single cell RNA-seq, which 25 00:01:12,380 --> 00:01:17,730 is a very important and growing area of scientific inquiry. 26 00:01:17,730 --> 00:01:19,230 But first, let's talk about RNA-seq. 27 00:01:19,230 --> 00:01:22,410 How many people have heard of RNA-seq before? 28 00:01:22,410 --> 00:01:22,910 Fantastic. 29 00:01:22,910 --> 00:01:25,340 How many people have done it before? 30 00:01:25,340 --> 00:01:26,161 Some? 31 00:01:26,161 --> 00:01:26,660 Great. 32 00:01:26,660 --> 00:01:29,939 So RNA-seq is fairly simple in concept. 33 00:01:29,939 --> 00:01:31,480 What we're going to do is we're going 34 00:01:31,480 --> 00:01:38,080 to isolate RNA species from a cell or collection of cells 35 00:01:38,080 --> 00:01:39,790 in the desired condition. 36 00:01:39,790 --> 00:01:43,480 And note that we can choose which kind of RNA molecules 37 00:01:43,480 --> 00:01:45,090 to isolate. 38 00:01:45,090 --> 00:01:49,830 We can isolate molecules before any selection, which 39 00:01:49,830 --> 00:01:55,090 would include molecules that are precursor RNAs that have not 40 00:01:55,090 --> 00:01:59,457 been spliced yet, including non-coding RNAs. 41 00:01:59,457 --> 00:02:01,540 As you probably know, the study of non-coding RNAs 42 00:02:01,540 --> 00:02:02,990 is extraordinarily important. 43 00:02:02,990 --> 00:02:07,540 There are over 3,300 so-called long non-coding RNAs 44 00:02:07,540 --> 00:02:10,660 that have been characterized so far. 45 00:02:10,660 --> 00:02:13,799 Those are non-coding RNAs over 200 bases long. 46 00:02:13,799 --> 00:02:15,590 We'll be talking about those later on, when 47 00:02:15,590 --> 00:02:18,665 we talk about chromatin function in the genome. 48 00:02:18,665 --> 00:02:21,304 And of course, there are the precursor messenger RNAs 49 00:02:21,304 --> 00:02:23,720 that are spliced, turned into messenger RNAs that are then 50 00:02:23,720 --> 00:02:25,790 translated into protein. 51 00:02:25,790 --> 00:02:29,370 And the specifics of the RNA-seq protocol 52 00:02:29,370 --> 00:02:31,159 will give you various of these species 53 00:02:31,159 --> 00:02:33,450 depending upon what kinds of purification methodologies 54 00:02:33,450 --> 00:02:35,850 you use. 55 00:02:35,850 --> 00:02:43,990 But as you're aware, there are many isoforms 56 00:02:43,990 --> 00:02:47,100 that are possible in most mammalian genes. 57 00:02:47,100 --> 00:02:49,570 This is a short summary, produced by the Burge 58 00:02:49,570 --> 00:02:52,230 Laboratory, of different kinds of splicing 59 00:02:52,230 --> 00:02:54,020 events that can occur. 60 00:02:54,020 --> 00:02:57,260 And the splicing events are often 61 00:02:57,260 --> 00:03:00,140 regulated by cis-regulatory sequences 62 00:03:00,140 --> 00:03:02,320 that live in the introns. 63 00:03:02,320 --> 00:03:04,530 And these introns contain recognition sequences 64 00:03:04,530 --> 00:03:07,530 for splicing factors where the splicing factors can 65 00:03:07,530 --> 00:03:09,180 be conditionally expressed. 66 00:03:09,180 --> 00:03:11,150 And so you get different combinations 67 00:03:11,150 --> 00:03:13,640 of these exons being glued together 68 00:03:13,640 --> 00:03:17,800 to produce variant forms of proteins. 69 00:03:17,800 --> 00:03:19,710 So we have to be mindful of the idea 70 00:03:19,710 --> 00:03:21,740 that the RNA molecules that we're 71 00:03:21,740 --> 00:03:23,160 going to be observing, typically, 72 00:03:23,160 --> 00:03:25,540 are going to be reverse transcribed. 73 00:03:25,540 --> 00:03:30,420 So we'll see entire transcripts that came, perhaps, 74 00:03:30,420 --> 00:03:33,080 from distinct exonic locations in the genome. 75 00:03:36,120 --> 00:03:38,590 And the essential idea of RNA-seq 76 00:03:38,590 --> 00:03:42,060 is that we take the RNA molecules we care about-- 77 00:03:42,060 --> 00:03:45,220 in this case, we're going to purify 78 00:03:45,220 --> 00:03:49,950 the ones that have poly-A tails. 79 00:03:49,950 --> 00:03:52,300 We will take those molecules. 80 00:03:52,300 --> 00:03:55,060 We'll reverse transcribe them. 81 00:03:55,060 --> 00:03:57,680 We'll sequence fragments of them and then 82 00:03:57,680 --> 00:04:00,690 map them to the genome. 83 00:04:00,690 --> 00:04:03,140 Now if you don't purify for poly-A tails, 84 00:04:03,140 --> 00:04:07,270 you get a lot of things mapping to the genome that 85 00:04:07,270 --> 00:04:08,530 are intronic. 86 00:04:08,530 --> 00:04:10,040 And so you get a lot of data that 87 00:04:10,040 --> 00:04:11,410 is very difficult to analyze. 88 00:04:11,410 --> 00:04:14,760 So typically, people, when they're 89 00:04:14,760 --> 00:04:17,709 looking for gene expression data, 90 00:04:17,709 --> 00:04:21,160 will do poly-A purification. 91 00:04:21,160 --> 00:04:25,040 And when you do this-- when you sequence the result of doing 92 00:04:25,040 --> 00:04:28,070 the reverse transcription of these RNA molecules-- 93 00:04:28,070 --> 00:04:31,450 and you map the results to the genome, what you find 94 00:04:31,450 --> 00:04:34,030 is data that looks like this. 95 00:04:34,030 --> 00:04:35,490 This is the SOX2 gene. 96 00:04:35,490 --> 00:04:37,660 This is typical expression data. 97 00:04:37,660 --> 00:04:40,120 You can see all of the individual reads mapping 98 00:04:40,120 --> 00:04:42,660 to the genome, the blue in the plus strand, 99 00:04:42,660 --> 00:04:44,980 the pink on the minus strand. 100 00:04:44,980 --> 00:04:52,550 And our job is to take data like this and to analyze it. 101 00:04:52,550 --> 00:04:54,970 Now we can take these data and we 102 00:04:54,970 --> 00:04:58,540 can summarize them in various formats. 103 00:04:58,540 --> 00:05:00,610 One way is to simply count the number of times 104 00:05:00,610 --> 00:05:03,500 that we see a read at a particular base, 105 00:05:03,500 --> 00:05:06,390 like here, for the SMUG1 gene. 106 00:05:06,390 --> 00:05:08,730 And here you see something else going on, 107 00:05:08,730 --> 00:05:12,840 which is that we have reads from-- the sequencing 108 00:05:12,840 --> 00:05:15,192 experiments have been polyadenylated and purified. 109 00:05:15,192 --> 00:05:16,650 So we're only seeing the reads that 110 00:05:16,650 --> 00:05:19,580 occur over the exonic sequences, more or less. 111 00:05:19,580 --> 00:05:23,100 There are a few intronic reads there, scattered about. 112 00:05:23,100 --> 00:05:26,590 The other thing that we see, which is very important, 113 00:05:26,590 --> 00:05:32,630 is that we see reads that are split across exons. 114 00:05:32,630 --> 00:05:34,860 Because the splicing event has occurred, 115 00:05:34,860 --> 00:05:37,790 the RNA molecule is a contiguous sequence 116 00:05:37,790 --> 00:05:39,680 of a collection of exons. 117 00:05:39,680 --> 00:05:41,665 And sometimes you'll get a read that 118 00:05:41,665 --> 00:05:44,726 spans an exon-exon boundary. 119 00:05:44,726 --> 00:05:46,290 And when we get this, you can see, 120 00:05:46,290 --> 00:05:51,110 in the bottom part of the slide that I'm showing you, 121 00:05:51,110 --> 00:05:54,160 these reads can map across the exons. 122 00:05:54,160 --> 00:05:56,700 Typically, in order to get good performance out 123 00:05:56,700 --> 00:05:58,590 of something like this, we want to use 124 00:05:58,590 --> 00:06:01,810 reads that are about 100 bases long. 125 00:06:01,810 --> 00:06:05,110 And we'll use a mapper that is capable of mapping 126 00:06:05,110 --> 00:06:08,880 both ends of a read to account for split reads 127 00:06:08,880 --> 00:06:13,680 for these exon crossing events. 128 00:06:13,680 --> 00:06:17,170 So that gives you an idea of the kinds of data that we have. 129 00:06:17,170 --> 00:06:21,610 And part of the challenge now is looking 130 00:06:21,610 --> 00:06:26,570 at how we can determine which particular isoforms of a gene 131 00:06:26,570 --> 00:06:31,040 are being expressed by evaluating these data. 132 00:06:31,040 --> 00:06:36,250 So there are two principal ways of going about this. 133 00:06:36,250 --> 00:06:39,052 One way is we simply take our read data, 134 00:06:39,052 --> 00:06:40,760 and we use the ideas that we talked about 135 00:06:40,760 --> 00:06:43,230 in genome assembly, and we assemble the reads 136 00:06:43,230 --> 00:06:45,450 into transcripts. 137 00:06:45,450 --> 00:06:47,210 That's typically done de novo. 138 00:06:47,210 --> 00:06:50,484 There are reference guided assemblers. 139 00:06:50,484 --> 00:06:52,900 It has the benefit that you don't need a reference genome. 140 00:06:52,900 --> 00:06:55,230 So sometimes, when you're working with an organism that 141 00:06:55,230 --> 00:06:57,520 does not have a well characterized reference genome, 142 00:06:57,520 --> 00:07:01,100 you will de novo assemble the transcripts 143 00:07:01,100 --> 00:07:02,791 from an RNA-seq experiment. 144 00:07:02,791 --> 00:07:04,540 But it also has problems with correctness, 145 00:07:04,540 --> 00:07:07,490 as we saw when we talked about assembly. 146 00:07:07,490 --> 00:07:09,930 The other approach, which is more typically used, 147 00:07:09,930 --> 00:07:13,380 is to map reads, or aligned the reads, to a reference genome 148 00:07:13,380 --> 00:07:15,290 and identify the different isoforms that 149 00:07:15,290 --> 00:07:18,340 are being expressed using constraints. 150 00:07:18,340 --> 00:07:20,575 And ultimately, the goals is to identify the isoforms 151 00:07:20,575 --> 00:07:25,176 and quantitate them so we can do further downstream analysis. 152 00:07:25,176 --> 00:07:27,320 And you'll hear about two different metrics, 153 00:07:27,320 --> 00:07:31,350 sometimes, in the literature, for expression. 154 00:07:31,350 --> 00:07:35,050 One is the number of reads per kilobase 155 00:07:35,050 --> 00:07:38,220 of transcript per million reads. 156 00:07:38,220 --> 00:07:41,035 So you might have, for example, an RPKM metric 157 00:07:41,035 --> 00:07:47,140 of 1,000, which means that one out of every thousandth read 158 00:07:47,140 --> 00:07:49,540 is mapping to a particular gene. 159 00:07:49,540 --> 00:07:54,040 So it gives you a metric that's adjusted for the fact 160 00:07:54,040 --> 00:07:58,740 that longer genes will produce more reads. 161 00:08:01,270 --> 00:08:03,120 An alternative metric is fragments 162 00:08:03,120 --> 00:08:04,200 per kilobase per million. 163 00:08:04,200 --> 00:08:05,699 And that's sometimes used when we're 164 00:08:05,699 --> 00:08:07,824 talking about paired end data. 165 00:08:07,824 --> 00:08:09,240 And you're considering that you're 166 00:08:09,240 --> 00:08:10,744 sequencing both ends of a fragment. 167 00:08:10,744 --> 00:08:12,660 So here we're talking about how many fragments 168 00:08:12,660 --> 00:08:15,080 we see for a particular gene per 1,000 169 00:08:15,080 --> 00:08:18,700 bases of the gene per million fragments. 170 00:08:18,700 --> 00:08:19,200 OK. 171 00:08:19,200 --> 00:08:23,700 So the essential idea, then, is to take the reads 172 00:08:23,700 --> 00:08:28,220 that we have-- the basket of reads-- align it to the genome, 173 00:08:28,220 --> 00:08:32,210 both to exons and to exon crossings, 174 00:08:32,210 --> 00:08:36,909 and to determine, for a given gene, what isoforms we have 175 00:08:36,909 --> 00:08:39,039 and how they're being expressed. 176 00:08:39,039 --> 00:08:40,770 So from here on in, I'm going to assume 177 00:08:40,770 --> 00:08:44,370 that we're talking about a gene and its isoforms. 178 00:08:44,370 --> 00:08:47,410 And that's OK. 179 00:08:47,410 --> 00:08:54,940 Because typically, we can map reads uniquely to a gene. 180 00:08:54,940 --> 00:08:56,420 And there are details, of course, 181 00:08:56,420 --> 00:08:59,220 when you have genes there are paralogs that 182 00:08:59,220 --> 00:09:02,350 have identical sequences across the genome where this becomes 183 00:09:02,350 --> 00:09:05,300 more difficult. 184 00:09:05,300 --> 00:09:08,990 So if we consider this, once again, 185 00:09:08,990 --> 00:09:11,010 we're going to take our reads, we're 186 00:09:11,010 --> 00:09:12,760 going to map them to the genome, and we're 187 00:09:12,760 --> 00:09:16,500 going to look for all possible pairings of exons. 188 00:09:16,500 --> 00:09:19,140 What we would like to do is to enumerate 189 00:09:19,140 --> 00:09:23,130 all of the possible isoforms that are possible 190 00:09:23,130 --> 00:09:25,530 given the data that we have. 191 00:09:25,530 --> 00:09:30,160 And we can use junction crossing reads and other methodologies 192 00:09:30,160 --> 00:09:34,246 to enumerate all the possible isoforms. 193 00:09:34,246 --> 00:09:35,620 But what we're going to assume is 194 00:09:35,620 --> 00:09:38,650 that we've enumerated all the isoforms. 195 00:09:38,650 --> 00:09:41,090 And we're going to number them 1 through n. 196 00:09:41,090 --> 00:09:44,530 So we have isoform 1, isoform 2, isoform 3 for a given gene. 197 00:09:44,530 --> 00:09:47,390 And what we want to compute for each isoform is 198 00:09:47,390 --> 00:09:50,470 its relative contribution to the read population 199 00:09:50,470 --> 00:09:54,590 we're seeing that maps to that gene. 200 00:09:54,590 --> 00:09:57,680 So in order to do that, what we could do 201 00:09:57,680 --> 00:09:58,690 is use some constraints. 202 00:09:58,690 --> 00:10:03,600 So if I show you this picture, which 203 00:10:03,600 --> 00:10:07,570 suggests that we have possible splice events here 204 00:10:07,570 --> 00:10:09,990 for the event C in the middle, if I told you 205 00:10:09,990 --> 00:10:15,710 that A, C, and E were very highly covered by reeds, 206 00:10:15,710 --> 00:10:18,530 you might think that A, C, and E represented 207 00:10:18,530 --> 00:10:22,110 one isoform that was highly expressed. 208 00:10:22,110 --> 00:10:25,440 And so if we think about how to use our read 209 00:10:25,440 --> 00:10:29,967 coverage as a guide to determining isoform prevalence, 210 00:10:29,967 --> 00:10:30,925 we'll be in good shape. 211 00:10:30,925 --> 00:10:33,030 And really, that's the best evidence we have. 212 00:10:33,030 --> 00:10:34,654 We have two sources of evidence, right? 213 00:10:34,654 --> 00:10:37,800 We have our junction crossing reads, which tell us 214 00:10:37,800 --> 00:10:40,590 which exons are being spliced together, 215 00:10:40,590 --> 00:10:47,150 that helps us both compute the set of possible isoforms 216 00:10:47,150 --> 00:10:50,100 and estimate their prevalence. 217 00:10:50,100 --> 00:10:51,960 The other thing we have is the reads 218 00:10:51,960 --> 00:10:54,157 that actually simply cover exons. 219 00:10:54,157 --> 00:10:55,740 And their relative prevalence can also 220 00:10:55,740 --> 00:11:00,805 help us compute the relative amounts of different isoforms. 221 00:11:03,650 --> 00:11:07,520 So in order to do this, we can think about what reads tell us. 222 00:11:07,520 --> 00:11:12,830 And some reads will exclude certain isoforms. 223 00:11:12,830 --> 00:11:15,980 So if we consider the reads that we have, 224 00:11:15,980 --> 00:11:21,170 we can think about reads that cross particular junctions that 225 00:11:21,170 --> 00:11:24,100 are inclusion reads saying that-- for example, 226 00:11:24,100 --> 00:11:26,910 in this case, the top reads are indicating 227 00:11:26,910 --> 00:11:32,070 that the middle white exon is being included in a transcript 228 00:11:32,070 --> 00:11:35,560 whereas the bottom reads are exclusion 229 00:11:35,560 --> 00:11:38,990 reads indicating that that white exon in the middle 230 00:11:38,990 --> 00:11:41,474 is being spliced out. 231 00:11:41,474 --> 00:11:42,890 So what we would like to do, then, 232 00:11:42,890 --> 00:11:45,110 is to build a probabilistic model that 233 00:11:45,110 --> 00:11:48,970 takes into account what we know about what a read tells us. 234 00:11:48,970 --> 00:11:53,060 Because each read is a piece of evidence. 235 00:11:53,060 --> 00:11:55,200 And we're going to use that read like detectives. 236 00:11:55,200 --> 00:11:58,240 We're going to go in and we're going to try and analyze 237 00:11:58,240 --> 00:12:00,820 all of the different reads we see for a gene 238 00:12:00,820 --> 00:12:03,440 and use it to weight what we think 239 00:12:03,440 --> 00:12:05,130 is happening with the different isoform 240 00:12:05,130 --> 00:12:09,380 expressions in the pool of reads that we're observing. 241 00:12:09,380 --> 00:12:14,850 And in order to do so, we will have 242 00:12:14,850 --> 00:12:16,470 to build a function that describes 243 00:12:16,470 --> 00:12:19,990 the probability of seeing a read given 244 00:12:19,990 --> 00:12:22,230 the expression of a particular isoform. 245 00:12:22,230 --> 00:12:26,700 So the essential idea is this-- for the next three slides, 246 00:12:26,700 --> 00:12:30,670 I want to build a model of the probability of seeing 247 00:12:30,670 --> 00:12:34,570 a read conditioned upon a particular isoform being 248 00:12:34,570 --> 00:12:36,270 expressed. 249 00:12:36,270 --> 00:12:37,450 All right? 250 00:12:37,450 --> 00:12:41,320 So there are three ways to approach this. 251 00:12:41,320 --> 00:12:46,250 One is that I can see a read that I 252 00:12:46,250 --> 00:12:50,660 know is incompatible with a given isoform. 253 00:12:50,660 --> 00:12:52,620 And therefore, the probability of seeing that 254 00:12:52,620 --> 00:12:55,060 read given the isoform is 0. 255 00:12:55,060 --> 00:12:57,620 And that's perfectly fine. 256 00:12:57,620 --> 00:13:00,910 And this can either happen with a single ended read 257 00:13:00,910 --> 00:13:04,180 or with a paired end read. 258 00:13:04,180 --> 00:13:06,250 And it's a conditional probability. 259 00:13:06,250 --> 00:13:11,360 So the probability of seeing read i given and isoform j 260 00:13:11,360 --> 00:13:12,160 can be 0. 261 00:13:14,850 --> 00:13:22,950 Another possibility is that, if I have a transcript-- 262 00:13:22,950 --> 00:13:26,140 now recall that the transcript has been spliced. 263 00:13:26,140 --> 00:13:28,670 So what we're looking at is the entire sequence 264 00:13:28,670 --> 00:13:30,770 of the spliced isoform. 265 00:13:30,770 --> 00:13:31,830 And we see a read. 266 00:13:31,830 --> 00:13:35,735 And the read can land anywhere within that transcript. 267 00:13:35,735 --> 00:13:37,110 Let's assume, for the time being, 268 00:13:37,110 --> 00:13:39,400 we're looking at single ended read. 269 00:13:39,400 --> 00:13:42,360 Then, the probability of seeing that read 270 00:13:42,360 --> 00:13:44,960 land in that transcript is 1 over the length 271 00:13:44,960 --> 00:13:51,300 of the transcript-- all right-- at a particular base. 272 00:13:51,300 --> 00:13:57,740 So this read is compatible with that transcript. 273 00:13:57,740 --> 00:14:02,980 And we can describe the probability in this fashion. 274 00:14:02,980 --> 00:14:07,240 It's also possible for us to consider paired end reads. 275 00:14:07,240 --> 00:14:09,420 And if we have paired end reads, we 276 00:14:09,420 --> 00:14:11,430 can describe a probability function 277 00:14:11,430 --> 00:14:13,000 that has two essential components. 278 00:14:13,000 --> 00:14:15,000 The denominator is still the same. 279 00:14:15,000 --> 00:14:17,620 That is, the likelihood of the read aligning 280 00:14:17,620 --> 00:14:20,690 at a particular position is going 281 00:14:20,690 --> 00:14:24,200 to be 1 over the length of the entire transcript. 282 00:14:24,200 --> 00:14:26,480 The numerator is different though. 283 00:14:26,480 --> 00:14:28,870 We're going to compute the length of the read 284 00:14:28,870 --> 00:14:32,390 that we have, or the implied length of the paired end read, 285 00:14:32,390 --> 00:14:36,210 and ask, what's the likelihood of seeing that. 286 00:14:36,210 --> 00:14:42,726 So we don't know the exact length, recall, of the insert. 287 00:14:42,726 --> 00:14:44,350 When we're looking at paired end reads, 288 00:14:44,350 --> 00:14:47,100 we can only estimate how long the fragment 289 00:14:47,100 --> 00:14:49,410 is that we're sequencing. 290 00:14:49,410 --> 00:14:52,980 And so we are going to have a probabilistic interpretation 291 00:14:52,980 --> 00:14:59,060 of how long the piece of RNA is that we actually wound up 292 00:14:59,060 --> 00:15:00,910 sequencing the ends of. 293 00:15:00,910 --> 00:15:04,900 And that is placed in the numerator, 294 00:15:04,900 --> 00:15:09,000 which scales the 1 over l sub j. 295 00:15:09,000 --> 00:15:11,220 So this gives us a probability for seeing 296 00:15:11,220 --> 00:15:13,770 a read in a particular transcript that 297 00:15:13,770 --> 00:15:15,620 accounts for the fact that we have 298 00:15:15,620 --> 00:15:18,810 to back both ends to that transcript. 299 00:15:18,810 --> 00:15:19,310 OK? 300 00:15:19,310 --> 00:15:21,800 So we have three possibilities that we've described, 301 00:15:21,800 --> 00:15:27,360 one, where a particular read is incompatible with an isoform, 302 00:15:27,360 --> 00:15:29,940 two, where we had a single end read 303 00:15:29,940 --> 00:15:32,390 mapping to an isoform, which is simply 304 00:15:32,390 --> 00:15:34,170 1 over the length of the isoform, 305 00:15:34,170 --> 00:15:36,140 and three, where we're mapping paired 306 00:15:36,140 --> 00:15:39,540 end reads to an isoform, which includes uncertainty 307 00:15:39,540 --> 00:15:42,000 about where it will map, which is the 1 over l sub j, 308 00:15:42,000 --> 00:15:45,810 and also uncertainty about the length of the fragment itself, 309 00:15:45,810 --> 00:15:49,150 which is encoded in the F function, which 310 00:15:49,150 --> 00:15:51,350 is a distribution over fragment lengths. 311 00:15:54,310 --> 00:15:54,810 OK. 312 00:15:54,810 --> 00:15:57,550 So once we have this structure, we 313 00:15:57,550 --> 00:16:01,720 can then estimate isoform expression. 314 00:16:01,720 --> 00:16:04,520 Now we talked before, when we talked about ChIP-seq 315 00:16:04,520 --> 00:16:11,100 last time, the idea of estimating proportions. 316 00:16:11,100 --> 00:16:15,430 And the essential idea here is that if we 317 00:16:15,430 --> 00:16:20,990 want to compute the probability of a read given, 318 00:16:20,990 --> 00:16:25,830 in this case, a mixture of isoforms, that's simply going 319 00:16:25,830 --> 00:16:29,297 to be-- let's see, what variable did I use? 320 00:16:29,297 --> 00:16:36,980 Yeah-- the estimated concentration 321 00:16:36,980 --> 00:16:45,160 of that isoform times the probability of the read as 322 00:16:45,160 --> 00:16:52,030 seen in that isoform. 323 00:16:52,030 --> 00:16:56,560 So for an individual read, we can estimate its likelihood 324 00:16:56,560 --> 00:16:58,840 given this mixture. 325 00:16:58,840 --> 00:17:00,850 And then the product around the outside 326 00:17:00,850 --> 00:17:03,110 describes how to estimate the probability 327 00:17:03,110 --> 00:17:06,512 of the entire basket of reads that we see for that gene. 328 00:17:06,512 --> 00:17:10,890 And what we would like to do is to pick psi, in this case, 329 00:17:10,890 --> 00:17:14,470 to maximize the likelihood of the observed reads. 330 00:17:14,470 --> 00:17:16,250 So what psi is going to do is it's 331 00:17:16,250 --> 00:17:19,130 going to give us the fraction of each isoform that we see. 332 00:17:22,619 --> 00:17:25,220 Are there any questions about the idea 333 00:17:25,220 --> 00:17:27,400 of isoform quantitation? 334 00:17:31,613 --> 00:17:32,113 Yes. 335 00:17:32,113 --> 00:17:36,500 AUDIENCE: I'm a little lost in-- in the last full slide review, 336 00:17:36,500 --> 00:17:40,000 you were describing these three cases for excluded, single, 337 00:17:40,000 --> 00:17:41,458 and paired reads. 338 00:17:41,458 --> 00:17:46,640 So we're computing the different probabilities 339 00:17:46,640 --> 00:17:51,740 for both ends to happen in a transcript, of for just one, 340 00:17:51,740 --> 00:17:52,574 or-- 341 00:17:52,574 --> 00:17:53,490 PROFESSOR: It depends. 342 00:17:53,490 --> 00:17:55,130 The second and third cases depend upon 343 00:17:55,130 --> 00:17:56,630 whether we're analyzing single ended 344 00:17:56,630 --> 00:17:59,272 reads or paired end reads. 345 00:17:59,272 --> 00:18:01,480 And so we wouldn't use both of them at the same time. 346 00:18:01,480 --> 00:18:03,860 In other words, if you only have single ended data, 347 00:18:03,860 --> 00:18:07,385 you would use the second case that we showed. 348 00:18:07,385 --> 00:18:08,843 And if you had paired end data, you 349 00:18:08,843 --> 00:18:12,435 would use the third case that we showed. 350 00:18:12,435 --> 00:18:13,729 AUDIENCE: OK. 351 00:18:13,729 --> 00:18:14,770 PROFESSOR: Question, yes. 352 00:18:14,770 --> 00:18:15,436 AUDIENCE: Sorry. 353 00:18:15,436 --> 00:18:19,586 I noticed that the single end reads case-- could you 354 00:18:19,586 --> 00:18:23,810 explain the intuition behind that probability [INAUDIBLE]? 355 00:18:23,810 --> 00:18:26,540 PROFESSOR: Sure. 356 00:18:26,540 --> 00:18:29,320 The intuition behind that probability 357 00:18:29,320 --> 00:18:33,435 is that we're asking-- so here is a transcript. 358 00:18:37,520 --> 00:18:44,340 And we're assuming, what is the probability of a read given 359 00:18:44,340 --> 00:18:46,870 that it came from this transcript. 360 00:18:46,870 --> 00:18:47,790 OK? 361 00:18:47,790 --> 00:18:50,900 What's the probability of observing a particular read? 362 00:18:50,900 --> 00:18:55,185 And the probability of observing it lining up 363 00:18:55,185 --> 00:19:00,530 at a particular position is 1 over the length 364 00:19:00,530 --> 00:19:02,930 of this transcript. 365 00:19:02,930 --> 00:19:09,230 And so the probability actually includes the idea of alignment 366 00:19:09,230 --> 00:19:12,090 at a particular position in the transcript. 367 00:19:12,090 --> 00:19:14,430 OK? 368 00:19:14,430 --> 00:19:16,960 So obviously, the probability's 1 369 00:19:16,960 --> 00:19:19,644 if we assume it comes from here if we don't consider this fact. 370 00:19:19,644 --> 00:19:22,060 But if we want to ask where it lines up in the transcript, 371 00:19:22,060 --> 00:19:23,417 it's going to be 1 over l sub j. 372 00:19:23,417 --> 00:19:23,917 OK? 373 00:19:23,917 --> 00:19:26,227 AUDIENCE: So we assume that it's uniformly possible? 374 00:19:26,227 --> 00:19:27,893 PROFESSOR: That it's uniformly possible, 375 00:19:27,893 --> 00:19:29,390 which gets us to a good point. 376 00:19:29,390 --> 00:19:31,680 I'm glad you asked that question. 377 00:19:31,680 --> 00:19:37,510 Sometimes we would have more reads 378 00:19:37,510 --> 00:19:40,440 at the three prime end of a transcript than the five 379 00:19:40,440 --> 00:19:41,970 prime end. 380 00:19:41,970 --> 00:19:45,571 Can anybody imagine why? 381 00:19:45,571 --> 00:19:46,070 Yes. 382 00:19:46,070 --> 00:19:48,420 AUDIENCE: Because you're pulling on the poly-A tails. 383 00:19:48,420 --> 00:19:49,086 PROFESSOR: Yeah. 384 00:19:49,086 --> 00:19:53,060 So this actually was purified by the poly-A tail. 385 00:19:53,060 --> 00:19:54,750 And we're pulling on this. 386 00:19:54,750 --> 00:19:56,450 We're purifying by it. 387 00:19:56,450 --> 00:19:59,370 And if there's breakage of these transcripts 388 00:19:59,370 --> 00:20:01,782 as it goes through the biochemical processing, 389 00:20:01,782 --> 00:20:03,490 we can get shorter and shorter molecules. 390 00:20:03,490 --> 00:20:05,217 They all contain this bit. 391 00:20:05,217 --> 00:20:07,550 And the probability to contain that whole thing actually 392 00:20:07,550 --> 00:20:09,330 decreases. 393 00:20:09,330 --> 00:20:12,370 So oftentimes there is three prime bias 394 00:20:12,370 --> 00:20:15,430 in RNA-seq experiments one needs to be mindful of. 395 00:20:15,430 --> 00:20:18,200 But we're assuming, today, that there isn't such bias 396 00:20:18,200 --> 00:20:19,970 and that the probability's equal that it 397 00:20:19,970 --> 00:20:21,950 maps someplace in this transcript. 398 00:20:21,950 --> 00:20:22,450 OK? 399 00:20:22,450 --> 00:20:23,741 Does that answer your question? 400 00:20:23,741 --> 00:20:24,418 Yes. 401 00:20:24,418 --> 00:20:25,084 AUDIENCE: Sorry. 402 00:20:25,084 --> 00:20:26,078 I do have one more. 403 00:20:26,078 --> 00:20:28,066 Can you show us how that extends, then, 404 00:20:28,066 --> 00:20:31,334 to the paired end read and where the probability distribution-- 405 00:20:31,334 --> 00:20:32,042 PROFESSOR: Right. 406 00:20:32,042 --> 00:20:41,556 So if we go to paired end reads, right, like this, 407 00:20:41,556 --> 00:20:46,020 this component is going to be where it aligns, right? 408 00:20:46,020 --> 00:20:54,170 And then, the probability of the length of this entire molecule 409 00:20:54,170 --> 00:20:59,620 is what I had up there before, which 410 00:20:59,620 --> 00:21:05,736 is-- exactly how did I do that? 411 00:21:26,420 --> 00:21:31,990 So this is going to be-- this is this bit, which 412 00:21:31,990 --> 00:21:34,901 is the implied length of this. 413 00:21:34,901 --> 00:21:35,400 OK? 414 00:21:35,400 --> 00:21:40,150 So if I map the left and the right-- 415 00:21:40,150 --> 00:21:42,710 this component is where the left end maps. 416 00:21:42,710 --> 00:21:43,530 OK? 417 00:21:43,530 --> 00:21:46,490 I take the left and the right ends of the read 418 00:21:46,490 --> 00:21:49,830 that I have from that transcript. 419 00:21:49,830 --> 00:21:53,140 And it has a particular length on this transcript. 420 00:21:53,140 --> 00:21:53,980 OK? 421 00:21:53,980 --> 00:21:59,220 I'll call that length l sub j of R sub i. 422 00:21:59,220 --> 00:22:00,960 Now remember, that is not the same 423 00:22:00,960 --> 00:22:03,320 as the length in the genome. 424 00:22:03,320 --> 00:22:07,270 That's the length in this transcript as it is spliced. 425 00:22:07,270 --> 00:22:08,060 OK? 426 00:22:08,060 --> 00:22:10,910 F is going to be the probability distribution 427 00:22:10,910 --> 00:22:12,490 of the length of the fragments. 428 00:22:12,490 --> 00:22:16,520 So let's just say that they're centered around 200 bases. 429 00:22:16,520 --> 00:22:18,170 OK? 430 00:22:18,170 --> 00:22:24,980 So if this is exactly 200 bases, it's going to be quite likely. 431 00:22:24,980 --> 00:22:25,720 OK? 432 00:22:25,720 --> 00:22:28,720 But imagine that when I map this fragment, 433 00:22:28,720 --> 00:22:31,930 this wound up being 400 bases apart. 434 00:22:31,930 --> 00:22:33,405 Then, this distribution would tell 435 00:22:33,405 --> 00:22:37,690 is it's very unlikely that I would see a fragment that 436 00:22:37,690 --> 00:22:39,490 mapped here and mapped 400 bases up 437 00:22:39,490 --> 00:22:43,080 here, because my fragment with distribution defined by F 438 00:22:43,080 --> 00:22:45,230 is 200 bases. 439 00:22:45,230 --> 00:22:48,920 So it's going to discount that, the probability of that. 440 00:22:48,920 --> 00:22:53,505 So this term that the probability of the read 441 00:22:53,505 --> 00:22:59,670 given the transcript is the component 442 00:22:59,670 --> 00:23:08,030 of where it's aligning times the likelihood 443 00:23:08,030 --> 00:23:10,405 that the implied fragment length agrees 444 00:23:10,405 --> 00:23:13,630 with what we think we have empirically. 445 00:23:13,630 --> 00:23:14,130 OK? 446 00:23:14,130 --> 00:23:16,421 Does that make sense? 447 00:23:16,421 --> 00:23:16,920 OK. 448 00:23:16,920 --> 00:23:17,961 Those are good questions. 449 00:23:22,070 --> 00:23:22,590 OK. 450 00:23:22,590 --> 00:23:26,410 So given this framework, we can either 451 00:23:26,410 --> 00:23:33,250 use EM or other machine learning like frameworks to maximize psi 452 00:23:33,250 --> 00:23:37,100 and to learn the fraction of expression 453 00:23:37,100 --> 00:23:41,206 of each different isoform from the observed data given 454 00:23:41,206 --> 00:23:42,330 the functions that we have. 455 00:23:45,040 --> 00:23:47,560 And just to give you an idea, when 456 00:23:47,560 --> 00:23:50,750 this was done for myogenesis, a program 457 00:23:50,750 --> 00:23:53,630 called Cufflinks, which does this kind of process 458 00:23:53,630 --> 00:23:58,510 of identifying isoform prevalences, 459 00:23:58,510 --> 00:24:04,200 was able to identify a large number of transcripts. 460 00:24:04,200 --> 00:24:06,250 70% of the reads were in previously 461 00:24:06,250 --> 00:24:08,350 annotated transcripts. 462 00:24:08,350 --> 00:24:13,150 But it also found 643 new isoforms of genes 463 00:24:13,150 --> 00:24:15,850 in this single time series. 464 00:24:15,850 --> 00:24:19,340 And I posted one of the papers that 465 00:24:19,340 --> 00:24:22,970 describes some this technology on the Stellar site. 466 00:24:22,970 --> 00:24:28,150 But note that certain of the genes have light coverage. 467 00:24:28,150 --> 00:24:32,580 And what we're seeing here is that for genes are expressed 468 00:24:32,580 --> 00:24:34,860 in low copy numbers, it's obviously 469 00:24:34,860 --> 00:24:37,410 more difficult to get reads out of them. 470 00:24:37,410 --> 00:24:39,552 And I'm presuming, in this particular experiment-- 471 00:24:39,552 --> 00:24:42,010 although I can't recollect-- that the reason they don't see 472 00:24:42,010 --> 00:24:44,385 that many intronic reads is they did poly-A purification. 473 00:24:47,700 --> 00:24:48,760 OK. 474 00:24:48,760 --> 00:24:55,890 So we've talked about how to take our reads, 475 00:24:55,890 --> 00:24:59,720 build a probabilistic model, estimate isoform prevalences. 476 00:24:59,720 --> 00:25:03,120 And we know how many reads are mapping to a given gene. 477 00:25:03,120 --> 00:25:04,910 The question now is whether or not 478 00:25:04,910 --> 00:25:07,750 we see differential expression of a gene 479 00:25:07,750 --> 00:25:10,110 in different conditions. 480 00:25:10,110 --> 00:25:13,840 So I'd like to turn to the analysis of differential 481 00:25:13,840 --> 00:25:17,010 expression unless there are any final questions 482 00:25:17,010 --> 00:25:21,360 about the details of RNA-seq and isoform estimation. 483 00:25:26,490 --> 00:25:28,500 This is your chance to ask those hard questions. 484 00:25:28,500 --> 00:25:29,000 Yes. 485 00:25:29,000 --> 00:25:29,541 AUDIENCE: OK. 486 00:25:29,541 --> 00:25:30,864 I have a really silly question. 487 00:25:30,864 --> 00:25:34,310 But can you explain, really quickly, what are isoforms? 488 00:25:34,310 --> 00:25:35,940 PROFESSOR: What an isoform is? 489 00:25:35,940 --> 00:25:37,090 AUDIENCE: Yeah. 490 00:25:37,090 --> 00:25:38,710 PROFESSOR: Sure. 491 00:25:38,710 --> 00:25:42,730 An isoform is a particular splice variant of a gene. 492 00:25:42,730 --> 00:25:47,230 So a gene that has a particular splicing pattern 493 00:25:47,230 --> 00:25:48,940 is called an isoform. 494 00:25:48,940 --> 00:25:53,370 So imagine we have three exons, one, two, and three. 495 00:25:53,370 --> 00:25:58,370 And a transcript that has all three would be one isoform. 496 00:25:58,370 --> 00:26:01,750 And another variant that omits two would be a second isoform. 497 00:26:01,750 --> 00:26:04,750 So just one in three would be an isoform. 498 00:26:04,750 --> 00:26:09,960 And each gene has a set of isoforms it exhibits. 499 00:26:09,960 --> 00:26:14,032 And that depends upon how it's regulated and whether or not 500 00:26:14,032 --> 00:26:16,110 any splicing is constitutive-- it always 501 00:26:16,110 --> 00:26:19,580 happens-- or whether or not it's regulated. 502 00:26:19,580 --> 00:26:23,717 And so in theory, a gene with n exons 503 00:26:23,717 --> 00:26:25,050 has how many potential isoforms? 504 00:26:28,990 --> 00:26:30,980 It's 2 to the n. 505 00:26:30,980 --> 00:26:35,787 Because you can consider each exon being included or omitted. 506 00:26:35,787 --> 00:26:36,761 All right. 507 00:26:36,761 --> 00:26:38,690 But that isn't typically the case-- 508 00:26:38,690 --> 00:26:41,720 that there are many fewer isoforms than that. 509 00:26:41,720 --> 00:26:44,710 But in general, an isoform refers to a particular splice 510 00:26:44,710 --> 00:26:46,985 variant of a gene. 511 00:26:46,985 --> 00:26:47,485 Yes. 512 00:26:47,485 --> 00:26:50,070 AUDIENCE: I just want to make sure I have everything 513 00:26:50,070 --> 00:26:51,340 correctly. 514 00:26:51,340 --> 00:26:54,360 When you're using single legged or pair end reads, 515 00:26:54,360 --> 00:26:57,300 you can get excluded ends, right? 516 00:26:57,300 --> 00:27:00,764 So you can get that in both cases, whether or not you're-- 517 00:27:00,764 --> 00:27:01,930 PROFESSOR: Well, it depends. 518 00:27:01,930 --> 00:27:03,555 Once again, it's somewhat probabilistic 519 00:27:03,555 --> 00:27:05,600 where the reads actually hit. 520 00:27:05,600 --> 00:27:10,170 Because if all the reads only hit exons, 521 00:27:10,170 --> 00:27:12,090 and you didn't get any junctions, 522 00:27:12,090 --> 00:27:15,500 and none of your paired end reads crossed junctions, 523 00:27:15,500 --> 00:27:19,042 then you wouldn't actually have exclusion. 524 00:27:19,042 --> 00:27:19,840 All right? 525 00:27:19,840 --> 00:27:23,020 AUDIENCE: But it's possible for using both types of sequencing? 526 00:27:23,020 --> 00:27:25,520 PROFESSOR: Yes, it's possible with both types of sequencing. 527 00:27:25,520 --> 00:27:29,200 In fact, oftentimes, what people do 528 00:27:29,200 --> 00:27:32,120 is that they will count junction crossing reads. 529 00:27:32,120 --> 00:27:36,470 If you have a large enough number of reads 530 00:27:36,470 --> 00:27:40,190 in your sequencing, say, 100 base pairs at least, 531 00:27:40,190 --> 00:27:42,000 then a large number of your reads 532 00:27:42,000 --> 00:27:44,950 are going to be-- not a large, but a significant fraction-- 533 00:27:44,950 --> 00:27:47,310 will be exon crossing. 534 00:27:47,310 --> 00:27:51,050 And you'll be able to count the number of exon-exon junctions 535 00:27:51,050 --> 00:27:53,420 you have of different types. 536 00:27:53,420 --> 00:27:54,980 And that will give you an estimate 537 00:27:54,980 --> 00:27:59,005 of how much splicing is going on and will help validate 538 00:27:59,005 --> 00:28:01,220 the kinds of conclusions that come out of programs 539 00:28:01,220 --> 00:28:05,920 like Cufflinks or MISO, which is another program from the Burge 540 00:28:05,920 --> 00:28:11,050 Laboratory that is used to estimate isoform prevalence. 541 00:28:11,050 --> 00:28:12,014 Yes. 542 00:28:12,014 --> 00:28:13,990 AUDIENCE: So even given this information, 543 00:28:13,990 --> 00:28:17,942 you can't say which exons go with which exons necessarily, 544 00:28:17,942 --> 00:28:20,412 except paralogs, right? 545 00:28:20,412 --> 00:28:22,882 Because the reads, in general, aren't long enough 546 00:28:22,882 --> 00:28:23,870 to span an exon. 547 00:28:23,870 --> 00:28:26,080 And therefore we wouldn't know, for example, 548 00:28:26,080 --> 00:28:30,671 that a given transcript is exons one, five, and six. 549 00:28:30,671 --> 00:28:36,150 You could only know that exons five and six went together. 550 00:28:36,150 --> 00:28:37,902 PROFESSOR: That is not strictly true 551 00:28:37,902 --> 00:28:39,860 if you have paired end reads and your fragments 552 00:28:39,860 --> 00:28:42,510 are long enough to span exons. 553 00:28:42,510 --> 00:28:44,000 But in general, you're correct. 554 00:28:44,000 --> 00:28:46,530 And that's why modern sequencing technologies that 555 00:28:46,530 --> 00:28:50,850 are coming down the pike that can do 25 kilobase reads 556 00:28:50,850 --> 00:28:55,210 are so important for doing things just like that. 557 00:28:55,210 --> 00:28:56,156 Yes. 558 00:28:56,156 --> 00:28:58,120 AUDIENCE: In your diagram of the read up there, 559 00:28:58,120 --> 00:29:03,200 are the boxes the actual genome or RNA sequence and the line 560 00:29:03,200 --> 00:29:06,520 in between the artificial linker that you added when you seq 561 00:29:06,520 --> 00:29:07,462 ref'd? 562 00:29:07,462 --> 00:29:09,045 PROFESSOR: Ah, that's a good question. 563 00:29:09,045 --> 00:29:10,900 The question is, is this the linker 564 00:29:10,900 --> 00:29:12,931 and these are the actual sequences. 565 00:29:12,931 --> 00:29:13,430 No. 566 00:29:13,430 --> 00:29:16,377 What I'm drawing here is that these are the bits 567 00:29:16,377 --> 00:29:18,210 that we actually get to see the sequence of. 568 00:29:18,210 --> 00:29:20,980 We sequence from both ends of a molecule. 569 00:29:20,980 --> 00:29:23,770 This is the part of the fragment that we haven't sequenced 570 00:29:23,770 --> 00:29:25,760 because our reads aren't long enough. 571 00:29:25,760 --> 00:29:28,960 And so the entire fragment might be 300 bases long. 572 00:29:28,960 --> 00:29:31,360 If this is 100 and this is 100, then the unobserved part 573 00:29:31,360 --> 00:29:33,060 is 100 in the middle. 574 00:29:33,060 --> 00:29:33,800 OK? 575 00:29:33,800 --> 00:29:35,640 And that's called the insert length, 576 00:29:35,640 --> 00:29:39,030 the entire length of the molecule. 577 00:29:39,030 --> 00:29:42,150 And we get to choose how long these fragments are up 578 00:29:42,150 --> 00:29:43,340 to a maximum size. 579 00:29:43,340 --> 00:29:44,850 Contemporary sequencers don't really 580 00:29:44,850 --> 00:29:47,140 like fragments over 1,000 bases. 581 00:29:47,140 --> 00:29:48,734 And the performance starts falling off 582 00:29:48,734 --> 00:29:50,150 when you get close to that number. 583 00:29:50,150 --> 00:29:53,190 So people, typically, are operating 584 00:29:53,190 --> 00:29:55,024 in a more optimal range of fragments 585 00:29:55,024 --> 00:29:56,440 that are a few hundred bases long. 586 00:29:59,200 --> 00:30:01,450 Any other questions? 587 00:30:01,450 --> 00:30:03,240 OK. 588 00:30:03,240 --> 00:30:09,730 So I wanted to briefly talk about hypothesis testing. 589 00:30:09,730 --> 00:30:11,640 Because we're going to be needing it 590 00:30:11,640 --> 00:30:14,080 for determining when things are really differentially 591 00:30:14,080 --> 00:30:15,630 expressed. 592 00:30:15,630 --> 00:30:17,350 So I'm just going to show you some data 593 00:30:17,350 --> 00:30:19,200 and ask you a few questions about it. 594 00:30:19,200 --> 00:30:24,895 So here are two different scatters of data. 595 00:30:24,895 --> 00:30:26,800 Well, actually, it's exactly the same data. 596 00:30:26,800 --> 00:30:28,650 But we have two different bits to it. 597 00:30:28,650 --> 00:30:31,670 We have two independent Gaussians 598 00:30:31,670 --> 00:30:35,690 that are fit to the data, from gene one and gene two. 599 00:30:35,690 --> 00:30:40,060 And another fit uses two Gaussians 600 00:30:40,060 --> 00:30:43,870 that have a correlation structure between them. 601 00:30:43,870 --> 00:30:45,800 And the question is whether or not 602 00:30:45,800 --> 00:30:48,960 the null hypothesis or the alternative hypothesis 603 00:30:48,960 --> 00:30:52,050 is more reasonable. 604 00:30:52,050 --> 00:30:54,100 And typically, when we say reasonable, 605 00:30:54,100 --> 00:30:58,360 we want to judge whether or not it's significant. 606 00:30:58,360 --> 00:30:59,840 Significance typically talks about, 607 00:30:59,840 --> 00:31:03,840 what's the chance that the data we saw 608 00:31:03,840 --> 00:31:07,599 occurred at random given the null hypothesis. 609 00:31:07,599 --> 00:31:09,140 So what's the chance it was generated 610 00:31:09,140 --> 00:31:11,330 by the null hypothesis versus the chance 611 00:31:11,330 --> 00:31:14,190 that it was generated by the alternative hypothesis? 612 00:31:14,190 --> 00:31:17,190 Now the problem is that alternative hypotheses, 613 00:31:17,190 --> 00:31:19,180 typically, are more complex. 614 00:31:19,180 --> 00:31:22,570 And a more complex model will always explain data better. 615 00:31:22,570 --> 00:31:24,380 So we need to have a principled way 616 00:31:24,380 --> 00:31:27,970 of asking the question, given that the alternative hypothesis 617 00:31:27,970 --> 00:31:30,040 is always going to do a better job, 618 00:31:30,040 --> 00:31:33,825 does it do such a better job that it can exclude 619 00:31:33,825 --> 00:31:38,470 the null hypothesis at a particular probability level. 620 00:31:38,470 --> 00:31:40,000 OK? 621 00:31:40,000 --> 00:31:46,630 So here are two different models for these data. 622 00:31:46,630 --> 00:31:48,390 The null model, H0, is that they came 623 00:31:48,390 --> 00:31:50,950 from two independent Gaussians. 624 00:31:50,950 --> 00:31:53,020 The alternative model, H1, is that they 625 00:31:53,020 --> 00:31:56,510 came from two correlated Gaussians. 626 00:31:56,510 --> 00:32:00,590 And then we can ask whether or not 627 00:32:00,590 --> 00:32:07,040 H1 is sufficiently more likely to warrant our rejecting H0 628 00:32:07,040 --> 00:32:10,380 and accepting H1. 629 00:32:10,380 --> 00:32:15,630 Now as I said, H1 is always going to fit the data better. 630 00:32:15,630 --> 00:32:18,250 So the probability of that collection 631 00:32:18,250 --> 00:32:23,110 of points evaluated with the H1 model, fit to the data, 632 00:32:23,110 --> 00:32:25,920 is always going to be superior. 633 00:32:25,920 --> 00:32:28,700 So we need to have a way to compare 634 00:32:28,700 --> 00:32:33,000 the probability of the data given H1 versus the data 635 00:32:33,000 --> 00:32:38,240 given H0 in a way that allows us to judge 636 00:32:38,240 --> 00:32:47,990 the probability that the data via H0 occurred at random. 637 00:32:47,990 --> 00:32:50,510 In this particular case, the data supports H1. 638 00:32:50,510 --> 00:32:52,950 And let's see why. 639 00:32:52,950 --> 00:32:56,345 So this is a key idea here. 640 00:32:56,345 --> 00:32:58,720 How many people have heard of likelihood ratio statistics 641 00:32:58,720 --> 00:33:00,450 before? 642 00:33:00,450 --> 00:33:00,980 OK. 643 00:33:00,980 --> 00:33:02,060 About half the class. 644 00:33:02,060 --> 00:33:03,190 OK. 645 00:33:03,190 --> 00:33:06,045 So here's the idea. 646 00:33:06,045 --> 00:33:09,504 The idea is that what we're going to do 647 00:33:09,504 --> 00:33:11,295 is we're going to compute a test statistic. 648 00:33:14,890 --> 00:33:17,780 And the test statistic is going to be 649 00:33:17,780 --> 00:33:19,180 a function of the observed data. 650 00:33:22,048 --> 00:33:26,100 And it's 2 times the log of the probability of the observed 651 00:33:26,100 --> 00:33:30,600 data given H1 over the probability of the observed 652 00:33:30,600 --> 00:33:32,242 data given H0. 653 00:33:32,242 --> 00:33:34,010 OK? 654 00:33:34,010 --> 00:33:36,790 Now we know that this is always going 655 00:33:36,790 --> 00:33:41,590 to have a higher value than the probability in the denominator. 656 00:33:41,590 --> 00:33:44,930 So this is always going to be greater than 1. 657 00:33:44,930 --> 00:33:47,490 So the test statistic will always be greater than 0 658 00:33:47,490 --> 00:33:49,850 since we're operating in the log domain. 659 00:33:49,850 --> 00:33:52,050 OK? 660 00:33:52,050 --> 00:33:56,740 The question is-- we know that this is always 661 00:33:56,740 --> 00:34:03,680 going to be better, even when the data was generated from H0. 662 00:34:03,680 --> 00:34:06,270 But when is this sufficiently better 663 00:34:06,270 --> 00:34:12,370 for us to believe that H0 is not true and we should accept H1? 664 00:34:12,370 --> 00:34:20,230 What we need is a distribution for this test statistic 665 00:34:20,230 --> 00:34:25,380 that occurred if H0 was true. 666 00:34:25,380 --> 00:34:29,480 And that distribution allows us to compute the probability 667 00:34:29,480 --> 00:34:32,230 that an observed value for the test statistic 668 00:34:32,230 --> 00:34:38,860 occurred, even in the presence-- assuming that H0 is true. 669 00:34:38,860 --> 00:34:39,750 OK. 670 00:34:39,750 --> 00:34:44,929 So this depends upon the number of degrees 671 00:34:44,929 --> 00:34:48,510 of freedom difference between H1 and H0. 672 00:34:48,510 --> 00:34:50,739 How many degrees of freedom are there 673 00:34:50,739 --> 00:34:57,560 in H1 in this model up here? 674 00:35:02,999 --> 00:35:05,822 How many parameters do we get to pick? 675 00:35:05,822 --> 00:35:06,630 AUDIENCE: Six. 676 00:35:06,630 --> 00:35:07,255 PROFESSOR: Hmm? 677 00:35:07,255 --> 00:35:08,340 AUDIENCE: Six. 678 00:35:08,340 --> 00:35:09,070 PROFESSOR: Six? 679 00:35:09,070 --> 00:35:10,770 AUDIENCE: Two means and four-- 680 00:35:10,770 --> 00:35:13,110 PROFESSOR: Two means and four coherences. 681 00:35:16,080 --> 00:35:19,882 And for H0? 682 00:35:19,882 --> 00:35:20,864 AUDIENCE: Just four. 683 00:35:20,864 --> 00:35:21,530 PROFESSOR: Four. 684 00:35:21,530 --> 00:35:23,840 So what's the difference in the number 685 00:35:23,840 --> 00:35:25,506 of degrees of freedom between H1 and H0? 686 00:35:25,506 --> 00:35:27,570 It's two. 687 00:35:27,570 --> 00:35:30,790 So the test statistic is parametrized 688 00:35:30,790 --> 00:35:42,130 by the difference in number of degrees of freedom. 689 00:35:46,326 --> 00:35:48,860 And so what we see, then, is something that looks like this. 690 00:35:51,950 --> 00:35:59,010 We see a test statistic where this is the probability of it, 691 00:35:59,010 --> 00:36:03,570 on the y-axis, and the test statistic on the x-axis. 692 00:36:03,570 --> 00:36:05,700 But as the test statistic gets larger and larger, 693 00:36:05,700 --> 00:36:08,540 the probability that it occurred with H0 being true 694 00:36:08,540 --> 00:36:11,720 gets smaller and smaller. 695 00:36:11,720 --> 00:36:15,520 So let us just suppose that we took our data from our model 696 00:36:15,520 --> 00:36:17,520 that we observed. 697 00:36:17,520 --> 00:36:20,390 And we computed the test statistic at a particular value 698 00:36:20,390 --> 00:36:22,670 call T observed. 699 00:36:22,670 --> 00:36:24,860 So this is the actual value that we computed out 700 00:36:24,860 --> 00:36:29,220 of our likelihood ratio test. 701 00:36:29,220 --> 00:36:33,320 What we would like to ask is, what's the probability 702 00:36:33,320 --> 00:36:40,860 that our test statistic is greater than or equal to T 703 00:36:40,860 --> 00:36:46,010 observed given that H0 is true, which 704 00:36:46,010 --> 00:36:49,550 means that we're going to consider 705 00:36:49,550 --> 00:36:52,430 all the tail of this distribution. 706 00:36:52,430 --> 00:36:57,140 Because we want to also consider the case where T observed 707 00:36:57,140 --> 00:37:01,180 was even greater than what we saw. 708 00:37:01,180 --> 00:37:06,160 And this gives us a way of computing the probability 709 00:37:06,160 --> 00:37:12,650 that H0 is true given the test statistic. 710 00:37:12,650 --> 00:37:14,030 And this gives us our p-value. 711 00:37:19,370 --> 00:37:20,510 OK? 712 00:37:20,510 --> 00:37:24,310 So this is a way of, in general, comparing 713 00:37:24,310 --> 00:37:30,820 two probabilistic models and analyzing the significance 714 00:37:30,820 --> 00:37:34,140 of adding extra degrees of freedom to the model. 715 00:37:34,140 --> 00:37:37,350 Typically, what we'll be doing in today's lecture 716 00:37:37,350 --> 00:37:39,590 is asking whether or not-- if we let the means 717 00:37:39,590 --> 00:37:43,240 change, for example, between two conditions-- 718 00:37:43,240 --> 00:37:47,680 we get a sufficient improvement in our ability 719 00:37:47,680 --> 00:37:50,580 to predict the data that our test statistic will allow 720 00:37:50,580 --> 00:37:53,710 us to reject the null hypothesis that the means are the same. 721 00:37:56,230 --> 00:37:56,730 OK. 722 00:37:56,730 --> 00:37:57,900 I'm going to stop here and see if there 723 00:37:57,900 --> 00:37:59,400 are any questions at all about this. 724 00:38:03,507 --> 00:38:04,090 AUDIENCE: Yes. 725 00:38:04,090 --> 00:38:08,162 Where did the degrees of freedom enter into the equation? 726 00:38:08,162 --> 00:38:10,661 PROFESSOR: Where did the degrees of freedom enter into this? 727 00:38:10,661 --> 00:38:12,649 Great question. 728 00:38:12,649 --> 00:38:14,951 The chi-square tables that you look up 729 00:38:14,951 --> 00:38:16,450 are indexed by the number of degrees 730 00:38:16,450 --> 00:38:17,980 of freedom of difference. 731 00:38:17,980 --> 00:38:18,840 OK? 732 00:38:18,840 --> 00:38:22,680 And so whenever you compute a chi-squared, 733 00:38:22,680 --> 00:38:24,570 you will compute it with the number 734 00:38:24,570 --> 00:38:27,315 of degrees of freedom difference. 735 00:38:30,567 --> 00:38:31,400 Any other questions? 736 00:38:36,100 --> 00:38:37,410 OK. 737 00:38:37,410 --> 00:38:48,490 So let's now turn to evaluating RNA-seq data once again. 738 00:38:48,490 --> 00:38:53,580 And I'm going to describe a method called DESeq 739 00:38:53,580 --> 00:38:55,735 for determining differential expression. 740 00:38:58,600 --> 00:39:06,220 And in our analysis, what we're going to do 741 00:39:06,220 --> 00:39:14,180 is we're going to let i range over a gene or an isoform. 742 00:39:18,156 --> 00:39:19,270 j is an experiment. 743 00:39:21,880 --> 00:39:24,570 And there may be multiple experiments 744 00:39:24,570 --> 00:39:28,740 in the same condition that are replicates. 745 00:39:28,740 --> 00:39:47,910 And Kij is the number of counts observed for i and j. 746 00:39:47,910 --> 00:39:53,430 So that's the expression of gene or isoform i in experiment j. 747 00:39:53,430 --> 00:39:55,080 Now what we need to do, however, is 748 00:39:55,080 --> 00:39:59,330 to normalize experiments against one another. 749 00:39:59,330 --> 00:40:01,670 And the normalization factor s sub j 750 00:40:01,670 --> 00:40:03,470 is computed for a particular experiment. 751 00:40:03,470 --> 00:40:06,520 And it's used to normalize all of the values 752 00:40:06,520 --> 00:40:08,590 in that experiment. 753 00:40:08,590 --> 00:40:11,740 And if all the experiments were completely identical-- 754 00:40:11,740 --> 00:40:14,560 the read depth was identical and everything was the same-- 755 00:40:14,560 --> 00:40:18,500 then all of the s sub js would be 1. 756 00:40:18,500 --> 00:40:23,830 If you had an experiment that had exactly twice as many reads 757 00:40:23,830 --> 00:40:28,690 as the other experiments, it's s sub j would be 2. 758 00:40:28,690 --> 00:40:32,790 So this scale factor is used to normalize things 759 00:40:32,790 --> 00:40:35,380 in our next slide, as we'll see. 760 00:40:35,380 --> 00:40:36,880 And the essential idea is that we're 761 00:40:36,880 --> 00:40:41,480 going to take the median value of this ratio. 762 00:40:41,480 --> 00:40:45,330 And the reason that the denominator is a geometric mean 763 00:40:45,330 --> 00:40:49,465 is so that no one experiment dominates the average. 764 00:40:52,240 --> 00:40:55,510 They all have equal weight for the average. 765 00:40:55,510 --> 00:40:58,480 But the geometric mean is simply the product 766 00:40:58,480 --> 00:41:05,870 of all of the expressions for a particular median gene taken 767 00:41:05,870 --> 00:41:10,410 to the root m power to get them back 768 00:41:10,410 --> 00:41:13,020 to the value for a single experiment. 769 00:41:13,020 --> 00:41:15,000 And that is the normalizing factor 770 00:41:15,000 --> 00:41:18,770 for the numerator, which is the number of counts 771 00:41:18,770 --> 00:41:20,320 for a particular gene. 772 00:41:20,320 --> 00:41:20,820 OK? 773 00:41:20,820 --> 00:41:24,360 So we're just doing median style normalization 774 00:41:24,360 --> 00:41:26,880 where s sub j is a scale factor. 775 00:41:26,880 --> 00:41:29,810 Once again, if all the experiments were the same, 776 00:41:29,810 --> 00:41:32,110 s sub j would be 1. 777 00:41:32,110 --> 00:41:36,500 If one particular experiment had twice as many counts 778 00:41:36,500 --> 00:41:38,350 as another experiment uniformly, s 779 00:41:38,350 --> 00:41:42,190 sub j would be 2, just for that experiment. 780 00:41:42,190 --> 00:41:44,180 Any questions about the scale factor? 781 00:41:44,180 --> 00:41:45,626 Yes. 782 00:41:45,626 --> 00:41:47,558 AUDIENCE: Sorry, what is the term 783 00:41:47,558 --> 00:41:49,980 on the bottom-- in the denominator? 784 00:41:49,980 --> 00:41:52,430 PROFESSOR: That's a normalizing term 785 00:41:52,430 --> 00:41:58,230 across-- that's the geometric mean of all the experiments put 786 00:41:58,230 --> 00:41:58,730 together. 787 00:42:01,430 --> 00:42:04,280 All right? 788 00:42:04,280 --> 00:42:06,860 So because it's the product of all the experiments-- m 789 00:42:06,860 --> 00:42:09,080 experiments-- then, rooted m, it's 790 00:42:09,080 --> 00:42:11,960 equal to the geometric mean of a single experiment. 791 00:42:16,610 --> 00:42:18,380 Any other questions? 792 00:42:18,380 --> 00:42:18,880 Yes. 793 00:42:18,880 --> 00:42:21,350 AUDIENCE: Are we normalizing different experiments 794 00:42:21,350 --> 00:42:23,326 to each other or different replicates 795 00:42:23,326 --> 00:42:24,808 of a single experiment? 796 00:42:24,808 --> 00:42:27,050 PROFESSOR: In this particular case, each one of these 797 00:42:27,050 --> 00:42:28,840 is a different replicate. 798 00:42:28,840 --> 00:42:29,340 OK? 799 00:42:29,340 --> 00:42:32,500 So j is ranging over different replicates, 800 00:42:32,500 --> 00:42:34,550 not over conditions, right now. 801 00:42:34,550 --> 00:42:37,610 So each replicate, each experiment, 802 00:42:37,610 --> 00:42:39,710 gets its own normalizing factor. 803 00:42:39,710 --> 00:42:43,070 We'll see, in a moment, how to put those replicates together 804 00:42:43,070 --> 00:42:44,840 to build statistical strength. 805 00:42:44,840 --> 00:42:48,135 But we need-- since each replicate has its own read 806 00:42:48,135 --> 00:42:50,260 depth, we have to normalize each one independently. 807 00:42:52,780 --> 00:42:55,020 OK? 808 00:42:55,020 --> 00:43:03,200 So what we then do is we compute an expression for a condition. 809 00:43:03,200 --> 00:43:08,130 Now a condition, we're going to call p. 810 00:43:08,130 --> 00:43:23,220 And q sub ip is the normalized expression 811 00:43:23,220 --> 00:43:33,572 for gene slash isoform i in condition p. 812 00:43:36,350 --> 00:43:39,390 So a condition may have multiple replicates in it. 813 00:43:39,390 --> 00:43:41,520 So we're going to average, over all 814 00:43:41,520 --> 00:43:44,860 of the replicates, the average expression, 815 00:43:44,860 --> 00:43:45,890 as you can see here. 816 00:43:45,890 --> 00:43:47,890 So we're summing over all the replicates 817 00:43:47,890 --> 00:43:49,832 for a given condition. 818 00:43:49,832 --> 00:43:51,290 We're going to take each replicate, 819 00:43:51,290 --> 00:43:54,130 normalize it by its scale factor we just computed, 820 00:43:54,130 --> 00:44:02,100 and then compute the normalized expression for a gene 821 00:44:02,100 --> 00:44:05,510 or an isoform in that particular condition. 822 00:44:05,510 --> 00:44:09,750 Is that clear to everybody, what's going on here? 823 00:44:09,750 --> 00:44:14,800 Now I'm describing this to you because the key fact is 824 00:44:14,800 --> 00:44:18,290 the next line, which is that we compute 825 00:44:18,290 --> 00:44:22,150 the mean for a particular replicate 826 00:44:22,150 --> 00:44:25,910 by taking the normalized expression for a gene 827 00:44:25,910 --> 00:44:28,430 and then reverse correcting it back to scaling it 828 00:44:28,430 --> 00:44:32,180 back up again for that particular replicate 829 00:44:32,180 --> 00:44:34,847 by multiplying by s sub j. 830 00:44:34,847 --> 00:44:36,430 But the most important thing is what's 831 00:44:36,430 --> 00:44:39,470 on the right hand side, which is that the variance is 832 00:44:39,470 --> 00:44:49,312 equal to the mean plus this function of the expression. 833 00:44:49,312 --> 00:44:50,770 And the reason this is important is 834 00:44:50,770 --> 00:44:53,980 that most other models for modeling expression data 835 00:44:53,980 --> 00:44:55,654 use Poisson models. 836 00:44:55,654 --> 00:44:57,820 And we've already seen, when we talked about library 837 00:44:57,820 --> 00:45:00,030 complexity, that Poisson models don't work that well 838 00:45:00,030 --> 00:45:01,440 all the time. 839 00:45:01,440 --> 00:45:04,330 So this is using a negative binomial function, once again. 840 00:45:04,330 --> 00:45:06,330 We saw negative binomials before. 841 00:45:06,330 --> 00:45:09,260 We're modelling both the mean and the variance. 842 00:45:09,260 --> 00:45:13,630 And the variance is a function, a linear function 843 00:45:13,630 --> 00:45:16,290 and a non-linear function, of the mean. 844 00:45:19,740 --> 00:45:22,550 So what's going to happen, then, is 845 00:45:22,550 --> 00:45:26,520 that we're going to use the negative binomial to compute 846 00:45:26,520 --> 00:45:31,060 the probability of observing the data in a given condition. 847 00:45:31,060 --> 00:45:34,267 And we can either combine conditions and ask, 848 00:45:34,267 --> 00:45:36,600 what's the probability of seeing the conditions combined 849 00:45:36,600 --> 00:45:41,850 with a single mean and variance, or we can separate them 850 00:45:41,850 --> 00:45:44,125 into a more complex H1 hypothesis and ask, 851 00:45:44,125 --> 00:45:46,500 what's the probability of seeing them with separate means 852 00:45:46,500 --> 00:45:49,190 and variances, and then do a test 853 00:45:49,190 --> 00:45:52,240 to see how significant the difference is to determine 854 00:45:52,240 --> 00:45:57,200 whether or not we can exclude the fact that the genes are 855 00:45:57,200 --> 00:45:59,280 expressed at the same level. 856 00:45:59,280 --> 00:46:02,290 And to give you an intuitive idea of what's going on, 857 00:46:02,290 --> 00:46:07,050 this is a plot from the paper showing the relationship 858 00:46:07,050 --> 00:46:10,930 between mean expression and variance. 859 00:46:10,930 --> 00:46:15,302 Recall, for Poisson, that variance is equal to mean. 860 00:46:15,302 --> 00:46:17,260 We only have one parameter to tune for Poisson, 861 00:46:17,260 --> 00:46:18,250 which is lambda. 862 00:46:18,250 --> 00:46:20,490 The purple line is Poisson. 863 00:46:20,490 --> 00:46:22,500 And you can check and see that the mean's 864 00:46:22,500 --> 00:46:24,660 equal to the variance in that case. 865 00:46:24,660 --> 00:46:29,930 What DESeq does is it fits the orange line to the observed 866 00:46:29,930 --> 00:46:31,690 data. 867 00:46:31,690 --> 00:46:33,366 Yes. 868 00:46:33,366 --> 00:46:37,310 AUDIENCE: I don't know where the v sub p comes from. 869 00:46:37,310 --> 00:46:38,296 Where is that, again? 870 00:46:38,296 --> 00:46:40,680 PROFESSOR: That's the function. 871 00:46:40,680 --> 00:46:44,470 V sub p, in that equation up there, 872 00:46:44,470 --> 00:46:46,350 is the function that we're fitting-- that's 873 00:46:46,350 --> 00:46:49,480 the solid orange line-- to the observed 874 00:46:49,480 --> 00:46:53,890 relationship between mean and variance in the data. 875 00:46:53,890 --> 00:46:54,810 OK? 876 00:46:54,810 --> 00:46:57,110 So DESeq fits that function. 877 00:47:00,720 --> 00:47:07,420 EdgeR is another technique that does not fit and instead uses 878 00:47:07,420 --> 00:47:11,460 the estimate that's the dotted line, which isn't as good. 879 00:47:11,460 --> 00:47:13,710 So a lot of people use DESeq these days 880 00:47:13,710 --> 00:47:15,900 for doing differential expression analysis 881 00:47:15,900 --> 00:47:20,190 because it allows the variance to change 882 00:47:20,190 --> 00:47:22,060 as the mean increases. 883 00:47:22,060 --> 00:47:28,440 And this is the case for these type of count data. 884 00:47:28,440 --> 00:47:30,318 Before I go on though, I'll pause and see 885 00:47:30,318 --> 00:47:32,859 if there are any questions at all about what's going on here. 886 00:47:35,675 --> 00:47:36,175 Yes. 887 00:47:36,175 --> 00:47:39,370 AUDIENCE: You said that mu sub p, then, is the [INAUDIBLE] 888 00:47:39,370 --> 00:47:40,795 to the data. 889 00:47:40,795 --> 00:47:41,704 I'm confused. 890 00:47:41,704 --> 00:47:44,120 How do we fit that function, or where does that come from? 891 00:47:44,120 --> 00:47:46,060 PROFESSOR: You mean the mu sub p? 892 00:47:46,060 --> 00:47:47,180 AUDIENCE: Yes. 893 00:47:47,180 --> 00:47:50,080 PROFESSOR: That function is fit. 894 00:47:50,080 --> 00:47:51,630 And the paper describes exactly how 895 00:47:51,630 --> 00:47:54,790 it's fit, which I posted on the Stellar site. 896 00:47:54,790 --> 00:47:58,770 But it's a nonlinear function of q, in this case. 897 00:48:04,630 --> 00:48:05,270 Good question. 898 00:48:05,270 --> 00:48:06,103 Any other questions? 899 00:48:10,160 --> 00:48:10,910 OK. 900 00:48:10,910 --> 00:48:13,850 So once again, we have two hypotheses, 901 00:48:13,850 --> 00:48:19,700 the null hypothesis that A and B are expressing identically, 902 00:48:19,700 --> 00:48:22,500 H1, A and B differentially express. 903 00:48:22,500 --> 00:48:25,240 We can compute the number of degrees of freedom. 904 00:48:25,240 --> 00:48:27,690 And we can do a likelihood ratio test, if we'd like, 905 00:48:27,690 --> 00:48:31,000 to compute the probability of H0. 906 00:48:31,000 --> 00:48:34,450 And our model in this case is the negative binomial model 907 00:48:34,450 --> 00:48:36,920 of the data, which fits the data better. 908 00:48:36,920 --> 00:48:39,349 And that's why DESeq does a better job 909 00:48:39,349 --> 00:48:40,390 than other methodologies. 910 00:48:40,390 --> 00:48:46,200 Because it provides a better approximation 911 00:48:46,200 --> 00:48:49,470 to the underlying noise. 912 00:48:49,470 --> 00:48:51,680 And the next slide shows what you get out 913 00:48:51,680 --> 00:48:57,310 of this kind of analysis, where the little red dots are 914 00:48:57,310 --> 00:49:00,160 the genes that have been called significant 915 00:49:00,160 --> 00:49:03,250 using Benjamini-Hochberg correction, which 916 00:49:03,250 --> 00:49:05,500 we talked about previously. 917 00:49:05,500 --> 00:49:16,750 And you can see how, as the mean increases, the required log 2 918 00:49:16,750 --> 00:49:20,355 fold change comes down to be significant. 919 00:49:23,882 --> 00:49:25,590 So oftentimes, you'll see plots like this 920 00:49:25,590 --> 00:49:30,575 in papers that describe how they actually computed what genes 921 00:49:30,575 --> 00:49:31,825 were differentially expressed. 922 00:49:34,520 --> 00:49:35,430 Any questions at all? 923 00:49:35,430 --> 00:49:36,668 Yes. 924 00:49:36,668 --> 00:49:40,660 AUDIENCE: Why is it that the significance is 925 00:49:40,660 --> 00:49:42,660 lower as the mean increases? 926 00:49:46,160 --> 00:49:50,685 PROFESSOR: Why the significance is lower? 927 00:49:50,685 --> 00:49:52,910 AUDIENCE: Or the threshold. 928 00:49:52,910 --> 00:49:55,390 PROFESSOR: Oh, because as you increase 929 00:49:55,390 --> 00:49:59,800 the number of observations, the mean value theorem 930 00:49:59,800 --> 00:50:01,840 is going to cause things to actually get 931 00:50:01,840 --> 00:50:04,890 closer and closer to 0. 932 00:50:04,890 --> 00:50:09,047 And so you need less of a fold change difference 933 00:50:09,047 --> 00:50:11,380 to be significant as you get more and more observations. 934 00:50:16,590 --> 00:50:19,110 And other questions? 935 00:50:19,110 --> 00:50:20,390 OK. 936 00:50:20,390 --> 00:50:29,120 So now we're going to delve into one other area. 937 00:50:29,120 --> 00:50:34,131 How many people have done hypergeometric tests before? 938 00:50:34,131 --> 00:50:34,630 OK. 939 00:50:34,630 --> 00:50:38,380 So we're going to talk about hypergeometric tests. 940 00:50:38,380 --> 00:50:50,365 So imagine that we have a universe, for simplicity, 941 00:50:50,365 --> 00:50:52,130 of 1,000 genes. 942 00:50:52,130 --> 00:50:52,630 OK? 943 00:50:52,630 --> 00:50:54,640 So we have this universe. 944 00:50:54,640 --> 00:51:01,230 And we have, B is a set of genes that there are 30 of them. 945 00:51:01,230 --> 00:51:06,010 And there's another set, A, of which there are 20. 946 00:51:06,010 --> 00:51:10,440 And the overlap between these two is a total of three genes. 947 00:51:10,440 --> 00:51:14,130 So it might be that A is the set of genes 948 00:51:14,130 --> 00:51:15,800 that are differentially expressed 949 00:51:15,800 --> 00:51:17,550 between two conditions. 950 00:51:17,550 --> 00:51:19,610 B is the set of genes that, you happen to know, 951 00:51:19,610 --> 00:51:21,290 have a particular annotation. 952 00:51:21,290 --> 00:51:25,380 For example, they're involved in stress response in the cell. 953 00:51:25,380 --> 00:51:27,260 And you'd like to know whether or not 954 00:51:27,260 --> 00:51:30,400 the genes that are differentially expressed 955 00:51:30,400 --> 00:51:34,170 have a significant component of stress response related genes 956 00:51:34,170 --> 00:51:37,170 or whether or not this occurred at random. 957 00:51:37,170 --> 00:51:39,760 OK? 958 00:51:39,760 --> 00:51:42,250 So we need to compute that. 959 00:51:44,900 --> 00:51:48,780 So how many ways could we choose B? 960 00:51:48,780 --> 00:51:57,020 Well, if we are going to use-- this is n1, n2, this is big N, 961 00:51:57,020 --> 00:51:58,220 and this is k. 962 00:51:58,220 --> 00:52:00,090 All right? 963 00:52:00,090 --> 00:52:06,620 So the number of ways I can choose B is big N choose n2. 964 00:52:06,620 --> 00:52:09,501 That's the number of ways I can choose B. Is everybody with me 965 00:52:09,501 --> 00:52:10,000 on that? 966 00:52:12,510 --> 00:52:14,900 Yeah? 967 00:52:14,900 --> 00:52:15,400 OK. 968 00:52:17,980 --> 00:52:22,320 How many ways can I choose three elements out 969 00:52:22,320 --> 00:52:26,820 of A, these three that are going to overlap? 970 00:52:26,820 --> 00:52:31,650 Well, that's going to be n1 choose k. 971 00:52:31,650 --> 00:52:37,590 So that's how many ways I can choose these three elements. 972 00:52:37,590 --> 00:52:40,970 And then, how many ways could I choose the other elements of B? 973 00:52:40,970 --> 00:52:44,830 So once again, I'm figuring out how I can choose B. Well, 974 00:52:44,830 --> 00:52:46,230 how could I choose the rest of B? 975 00:52:46,230 --> 00:52:48,860 Well, how many elements do I have to choose, of B, here? 976 00:52:48,860 --> 00:52:52,286 Well, B is n too big. 977 00:52:52,286 --> 00:52:54,161 But I've already chosen k of them. 978 00:52:54,161 --> 00:52:54,660 Right? 979 00:52:57,170 --> 00:52:58,820 All right. 980 00:52:58,820 --> 00:53:01,400 Sorry, it's the other way around. 981 00:53:01,400 --> 00:53:07,550 The universe I can pick from is 1,000, 982 00:53:07,550 --> 00:53:15,140 which is all the elements, minus the elements of A that I've 983 00:53:15,140 --> 00:53:17,380 already chosen from to get those three. 984 00:53:19,960 --> 00:53:27,970 And then I need to pick the 27 things they don't overlap 985 00:53:27,970 --> 00:53:31,714 with A. So 27 things that don't overlap with A 986 00:53:31,714 --> 00:53:34,420 would be n2 minus k. 987 00:53:38,120 --> 00:53:39,730 So this is the number ways to choose 988 00:53:39,730 --> 00:53:43,620 B given this set of constraints. 989 00:53:43,620 --> 00:53:47,580 This is the number of ways to choose B given no constraints. 990 00:53:47,580 --> 00:53:50,910 So the probability that I have overlap 991 00:53:50,910 --> 00:53:55,530 of exactly k is equal to this, which is, 992 00:53:55,530 --> 00:53:57,620 how many ways are there with no constraints 993 00:53:57,620 --> 00:53:59,410 and how many ways are there given 994 00:53:59,410 --> 00:54:00,600 that I have an overlap of k. 995 00:54:03,360 --> 00:54:04,410 All right? 996 00:54:04,410 --> 00:54:06,870 And typically, what I want to ask is, 997 00:54:06,870 --> 00:54:13,400 what is the probability that my observed overlap 998 00:54:13,400 --> 00:54:15,450 is greater than or equal to k. 999 00:54:15,450 --> 00:54:19,300 So this case, the overlap would be three. 1000 00:54:19,300 --> 00:54:21,184 But I also would need to consider the fact 1001 00:54:21,184 --> 00:54:23,100 that I might have four, or five, or six, which 1002 00:54:23,100 --> 00:54:26,240 would be even more unlikely, but still significant. 1003 00:54:26,240 --> 00:54:28,870 So if you look at the exact computation, 1004 00:54:28,870 --> 00:54:31,760 the probability of three here is 0.017. 1005 00:54:31,760 --> 00:54:36,222 And the probability that I have three or more is 0.02. 1006 00:54:36,222 --> 00:54:37,680 So that's still pretty significant. 1007 00:54:37,680 --> 00:54:41,810 Unlikely that would occur by chance. 1008 00:54:41,810 --> 00:54:43,720 Right? 1009 00:54:43,720 --> 00:54:48,130 That I have three or more genes overlapping in this situation 1010 00:54:48,130 --> 00:54:53,682 could only happen two out of 100 times. 1011 00:54:53,682 --> 00:54:55,640 Does everybody understand what's going on here? 1012 00:55:00,250 --> 00:55:01,330 Any questions at all? 1013 00:55:01,330 --> 00:55:04,980 So you're all now hypergeometric whizzes, right? 1014 00:55:04,980 --> 00:55:06,760 All right? 1015 00:55:06,760 --> 00:55:09,010 Fantastic. 1016 00:55:09,010 --> 00:55:11,894 OK. 1017 00:55:11,894 --> 00:55:14,060 Now we're going to turn to a final kind of analysis. 1018 00:55:14,060 --> 00:55:15,610 How many people have heard of principal component analysis 1019 00:55:15,610 --> 00:55:16,110 before? 1020 00:55:16,110 --> 00:55:20,226 How many people know how to do principal component analysis? 1021 00:55:20,226 --> 00:55:21,280 A few. 1022 00:55:21,280 --> 00:55:22,000 OK. 1023 00:55:22,000 --> 00:55:23,030 Great. 1024 00:55:23,030 --> 00:55:23,530 Yes. 1025 00:55:23,530 --> 00:55:26,506 AUDIENCE: Sorry, could you just briefly 1026 00:55:26,506 --> 00:55:31,962 mention, again, where exactly do we use the hypergeometric test? 1027 00:55:31,962 --> 00:55:34,442 What kinds of questions are we asking when we do that? 1028 00:55:34,442 --> 00:55:36,441 PROFESSOR: Typically, they're overlap questions. 1029 00:55:36,441 --> 00:55:41,600 So you're asking-- you have a universe of objects, right-- 1030 00:55:41,600 --> 00:55:43,690 like, in this case, genes. 1031 00:55:43,690 --> 00:55:46,749 And you have a subset of 20 and a subset of 30. 1032 00:55:46,749 --> 00:55:49,040 Let's say these are the differentially expressed genes. 1033 00:55:49,040 --> 00:55:51,710 These are genes in the stress response pathway. 1034 00:55:51,710 --> 00:55:53,695 They overlap by three genes. 1035 00:55:53,695 --> 00:55:57,776 Does that actually occur at random or not? 1036 00:55:57,776 --> 00:55:59,270 All right? 1037 00:55:59,270 --> 00:56:05,330 If I told you that there are a much smaller number of genes 1038 00:56:05,330 --> 00:56:08,050 and the stress response genes were very much larger, 1039 00:56:08,050 --> 00:56:11,460 it could be much easier for that overlap to occur at random. 1040 00:56:15,610 --> 00:56:16,260 Good question. 1041 00:56:16,260 --> 00:56:17,093 Any other questions? 1042 00:56:20,450 --> 00:56:20,950 OK. 1043 00:56:20,950 --> 00:56:23,980 So the next part of lecture is entitled 1044 00:56:23,980 --> 00:56:26,260 "multivariate Gaussians are your friends." 1045 00:56:26,260 --> 00:56:26,760 OK? 1046 00:56:26,760 --> 00:56:27,750 They are friendly. 1047 00:56:27,750 --> 00:56:28,970 They're like a puppy dog. 1048 00:56:28,970 --> 00:56:32,670 They are just wonderful to play with and very friendly. 1049 00:56:32,670 --> 00:56:35,620 And the reason most people get a little turned off by them 1050 00:56:35,620 --> 00:56:38,430 is because they get this-- the first thing they're shown 1051 00:56:38,430 --> 00:56:40,950 is this very hairy looking exponential 1052 00:56:40,950 --> 00:56:43,020 which describes what they are. 1053 00:56:43,020 --> 00:56:45,980 And so I'm going to shy away from complicated looking 1054 00:56:45,980 --> 00:56:48,889 exponentials and give you the puppy dog, my favorite way 1055 00:56:48,889 --> 00:56:50,430 of looking at multivariate Gaussians. 1056 00:56:50,430 --> 00:56:50,929 OK? 1057 00:56:50,929 --> 00:56:54,967 Which, I think, is a great way to look at them. 1058 00:56:54,967 --> 00:56:57,300 And the reason we need to look at multivariate Gaussians 1059 00:56:57,300 --> 00:57:00,389 is that they help us understand what 1060 00:57:00,389 --> 00:57:02,680 is going on with principal component analysis in a very 1061 00:57:02,680 --> 00:57:03,921 straightforward way. 1062 00:57:03,921 --> 00:57:06,170 And the reason that we want to use principal component 1063 00:57:06,170 --> 00:57:09,310 analysis is that we're going to be 1064 00:57:09,310 --> 00:57:12,969 able to reveal hidden factors and structures in our data. 1065 00:57:12,969 --> 00:57:14,385 And they're also going to allow us 1066 00:57:14,385 --> 00:57:17,710 to reduce the dimensionality of the data. 1067 00:57:17,710 --> 00:57:20,910 And we'll see why in a moment. 1068 00:57:20,910 --> 00:57:27,420 But here is the friendly version of multivariate Gaussians. 1069 00:57:27,420 --> 00:57:34,490 And let me describe to you why I think this is so friendly. 1070 00:57:34,490 --> 00:57:38,830 So we're all familiar with unidimensional Gaussians 1071 00:57:38,830 --> 00:57:40,320 like this. 1072 00:57:40,320 --> 00:57:41,170 Centered at zero. 1073 00:57:43,800 --> 00:57:46,840 They have variance 1. 1074 00:57:46,840 --> 00:57:50,260 Just very friendly univariate Gaussians, right? 1075 00:57:50,260 --> 00:57:52,085 Everybody's familiar with those? 1076 00:57:52,085 --> 00:57:53,785 Normal distributions? 1077 00:57:53,785 --> 00:57:55,450 So let's suppose that we just take 1078 00:57:55,450 --> 00:57:58,710 a whole collection of those. 1079 00:57:58,710 --> 00:58:02,970 And we say that we have a vector z that 1080 00:58:02,970 --> 00:58:07,240 is sampled from a collection of univariate Gaussians. 1081 00:58:07,240 --> 00:58:09,060 And it can be as long as we like. 1082 00:58:09,060 --> 00:58:09,560 OK? 1083 00:58:09,560 --> 00:58:12,520 But they're all sampled from the same distribution. 1084 00:58:12,520 --> 00:58:15,240 And what we're going to say is that our multivariate Gaussian, 1085 00:58:15,240 --> 00:58:22,910 x, is going to be a matrix times z plus a mean. 1086 00:58:22,910 --> 00:58:24,710 And so what this matrix is going to do 1087 00:58:24,710 --> 00:58:28,960 is it's going to take all of our univariate Gaussians 1088 00:58:28,960 --> 00:58:33,750 and combine them to produce our multivariate Gaussian. 1089 00:58:33,750 --> 00:58:35,150 All right? 1090 00:58:35,150 --> 00:58:38,430 So the structure of this matrix will 1091 00:58:38,430 --> 00:58:41,840 describe how these single variate Gaussians are 1092 00:58:41,840 --> 00:58:46,527 being mixed together to produce this multivariate distribution. 1093 00:58:46,527 --> 00:58:48,110 And you can imagine various structures 1094 00:58:48,110 --> 00:58:51,970 for this matrix A. Right? 1095 00:58:51,970 --> 00:58:58,990 And the covariance matrix, sigma, 1096 00:58:58,990 --> 00:59:05,040 which describes the structure of this multivariate Gaussian, 1097 00:59:05,040 --> 00:59:13,890 is shown on this slide to be equal to A A transpose. 1098 00:59:13,890 --> 00:59:20,100 And thus, if we knew this matrix A, which we may not know, 1099 00:59:20,100 --> 00:59:25,030 we'd be able to compute the covariance matrix directly. 1100 00:59:25,030 --> 00:59:27,240 OK? 1101 00:59:27,240 --> 00:59:30,570 Let me take that one more time. 1102 00:59:30,570 --> 00:59:32,750 We take a bunch of univariate Gaussians, 1103 00:59:32,750 --> 00:59:34,700 make a vector z out of them. 1104 00:59:34,700 --> 00:59:38,110 And just for clarity, right, we're 1105 00:59:38,110 --> 00:59:43,450 going to talk about matrices and vectors as rows across columns. 1106 00:59:43,450 --> 00:59:46,750 So this is n by 1. 1107 00:59:46,750 --> 00:59:48,610 This is n by n. 1108 00:59:48,610 --> 00:59:50,460 This is n by 1. 1109 00:59:50,460 --> 00:59:52,000 And this is n by 1. 1110 00:59:52,000 --> 00:59:52,940 OK? 1111 00:59:52,940 --> 00:59:55,790 That's the dimensionality of these various objects 1112 00:59:55,790 --> 00:59:58,620 we're dealing with here. 1113 00:59:58,620 --> 01:00:03,610 So we get this vector of univariate Gaussians. 1114 01:00:03,610 --> 01:00:06,670 We apply this matrix n to combine them together. 1115 01:00:06,670 --> 01:00:09,980 We get out our multivariate Gaussian offset by some mean. 1116 01:00:12,500 --> 01:00:14,090 Is everybody happy with that so far? 1117 01:00:16,441 --> 01:00:16,940 Yes? 1118 01:00:16,940 --> 01:00:19,760 No? 1119 01:00:19,760 --> 01:00:22,600 You're suspending disbelief for the next slide. 1120 01:00:22,600 --> 01:00:27,670 Is that-- OK. 1121 01:00:27,670 --> 01:00:36,020 Well, here's the next thing I'd like to say, 1122 01:00:36,020 --> 01:00:47,670 is that the variance of a vector-- we'll call it 1123 01:00:47,670 --> 01:00:58,320 v-- times x, which is a random variable, 1124 01:00:58,320 --> 01:01:01,070 is going to be equal to-- x is derived 1125 01:01:01,070 --> 01:01:08,130 from this distribution-- v transpose sigma 1126 01:01:08,130 --> 01:01:15,914 v. And the demonstration of that is on the top of this page. 1127 01:01:18,810 --> 01:01:34,650 So the variance of this vector-- sorry, the projection 1128 01:01:34,650 --> 01:01:37,230 of this random variable onto this vector-- 1129 01:01:37,230 --> 01:01:43,350 is going to give you a variance in this direction 1130 01:01:43,350 --> 01:01:47,990 as that product. 1131 01:01:47,990 --> 01:01:55,367 So what we would like to do is this. 1132 01:01:55,367 --> 01:02:08,065 We would like to find v sub i, which are vectors, 1133 01:02:08,065 --> 01:02:25,360 to maximize variance of v sub i transpose x 1134 01:02:25,360 --> 01:02:31,960 such that v sub i transpose v sub i is equal to 1. 1135 01:02:31,960 --> 01:02:35,590 In other words, they're unit length vectors. 1136 01:02:35,590 --> 01:02:42,170 So these are going to be called the eigenvectors. 1137 01:02:42,170 --> 01:02:53,052 And if we think about the structure that we desire, 1138 01:02:53,052 --> 01:02:57,030 what we'll find is that they satisfy the constraint 1139 01:02:57,030 --> 01:03:00,540 that the covariance matrix times an eigenvector 1140 01:03:00,540 --> 01:03:03,740 is equal to the eigenvalue associated 1141 01:03:03,740 --> 01:03:06,420 with that vector times the vector itself. 1142 01:03:09,680 --> 01:03:13,050 And with a little manipulation-- if we multiply both sides by v 1143 01:03:13,050 --> 01:03:19,535 sub i transpose-- v sub i equals v sub i transpose lambda sub i 1144 01:03:19,535 --> 01:03:24,420 sigma squared v sub i-- and we move these guys around, 1145 01:03:24,420 --> 01:03:27,120 v sub i transpose sigma v sub i is 1146 01:03:27,120 --> 01:03:30,150 equal to-- these two guys, multiplied together, 1147 01:03:30,150 --> 01:03:33,990 equal 1-- lambda sub i squared. 1148 01:03:33,990 --> 01:03:40,680 This, we see up above, is equal to the variance 1149 01:03:40,680 --> 01:03:44,240 when it's projected in the direction of v. 1150 01:03:44,240 --> 01:03:48,620 And so lambda sub i squared is simply the variance associated 1151 01:03:48,620 --> 01:03:49,586 with that direction. 1152 01:03:53,480 --> 01:03:59,590 So the question then becomes, how do we find these things. 1153 01:03:59,590 --> 01:04:03,250 And how do we discover these magic eigenvectors 1154 01:04:03,250 --> 01:04:11,660 that are directions in which this multivariate Gaussian has 1155 01:04:11,660 --> 01:04:12,910 its variance maximized? 1156 01:04:15,740 --> 01:04:25,120 And we can do this by singular value decomposition. 1157 01:04:25,120 --> 01:04:29,340 So we can compute this covariance matrix. 1158 01:04:29,340 --> 01:04:34,270 So we compute sigma from the data. 1159 01:04:37,100 --> 01:04:49,020 And then we do a singular value decomposition 1160 01:04:49,020 --> 01:04:54,665 such that sigma is equal to U S U transpose. 1161 01:04:57,530 --> 01:04:59,840 And that's what the singular value decomposition does 1162 01:04:59,840 --> 01:05:02,250 for us is it decomposes the sigma 1163 01:05:02,250 --> 01:05:04,960 matrix into these components where 1164 01:05:04,960 --> 01:05:16,460 S is a diagonal matrix that contains the eigenvalues 1165 01:05:16,460 --> 01:05:22,180 and U is a column. 1166 01:05:22,180 --> 01:05:24,225 Each column is an eigenvector. 1167 01:05:31,610 --> 01:05:33,460 So in doing a singular value decomposition, 1168 01:05:33,460 --> 01:05:39,920 we get the eigenvalues and the eigenvectors. 1169 01:05:39,920 --> 01:05:43,400 The other thing, you recall, was that sigma 1170 01:05:43,400 --> 01:05:46,860 was equal to A A transpose when we started off. 1171 01:05:46,860 --> 01:05:49,830 A was the matrix that we used to make our multivariate Gaussian 1172 01:05:49,830 --> 01:05:52,440 out of our univariate Gaussians. 1173 01:05:52,440 --> 01:05:56,680 And thus, what we can observe is that our multivariate Gaussian 1174 01:05:56,680 --> 01:06:07,690 x is equal to U S to the 1/2 times z plus a mean. 1175 01:06:07,690 --> 01:06:12,050 So here is what's going when we make a multivariate Gaussian is 1176 01:06:12,050 --> 01:06:15,020 we're taking a bunch of univariate Gaussians, 1177 01:06:15,020 --> 01:06:19,400 we're scaling them, and we're rotating them. 1178 01:06:19,400 --> 01:06:20,660 OK? 1179 01:06:20,660 --> 01:06:22,350 And that makes a multivariate Gaussian. 1180 01:06:22,350 --> 01:06:25,350 And then we offset this whole thing by a mean. 1181 01:06:25,350 --> 01:06:29,440 Because we also have to do rotations around the origin. 1182 01:06:29,440 --> 01:06:32,180 So the way I think about multivariate Gaussians 1183 01:06:32,180 --> 01:06:35,860 is that it is a scaling and a rotation 1184 01:06:35,860 --> 01:06:37,900 of univariate Gaussians. 1185 01:06:37,900 --> 01:06:40,060 And implicit in that scaling and rotation 1186 01:06:40,060 --> 01:06:43,550 is the discovery of the major directions of variance 1187 01:06:43,550 --> 01:06:46,100 in the underlying data as represented 1188 01:06:46,100 --> 01:06:48,070 by the eigenvectors. 1189 01:06:48,070 --> 01:06:51,200 And the eigenvalues tell you how much of the variance 1190 01:06:51,200 --> 01:06:53,763 is accounted for in each one of those dimensions. 1191 01:06:56,870 --> 01:06:59,200 Are there any questions about that? 1192 01:06:59,200 --> 01:07:00,610 I hope there are. 1193 01:07:00,610 --> 01:07:01,860 Yes. 1194 01:07:01,860 --> 01:07:04,260 AUDIENCE: How do you compute sigma from the data? 1195 01:07:04,260 --> 01:07:06,884 Is it some magical process? 1196 01:07:06,884 --> 01:07:09,080 PROFESSOR: The sigma of the value of decomposition? 1197 01:07:09,080 --> 01:07:09,621 AUDIENCE: No. 1198 01:07:09,621 --> 01:07:11,609 So-- 1199 01:07:11,609 --> 01:07:14,094 PROFESSOR: How do you compute sigma from the data? 1200 01:07:14,094 --> 01:07:15,886 AUDIENCE: Yeah, the first step. 1201 01:07:15,886 --> 01:07:24,330 PROFESSOR: That is shown in equation seven. 1202 01:07:31,340 --> 01:07:34,950 So you can compute the means. 1203 01:07:34,950 --> 01:07:38,567 And you know you have x, which are observed values. 1204 01:07:38,567 --> 01:07:39,900 So you compute that expectation. 1205 01:07:39,900 --> 01:07:41,851 And that is sigma. 1206 01:07:41,851 --> 01:07:42,350 OK? 1207 01:07:51,017 --> 01:07:51,600 Good question. 1208 01:07:51,600 --> 01:07:52,433 Any other questions? 1209 01:07:57,260 --> 01:07:57,760 OK. 1210 01:07:57,760 --> 01:08:01,180 So we have these eigenvectors and eigenvalues, 1211 01:08:01,180 --> 01:08:06,390 which represent the vectors of maximum variance 1212 01:08:06,390 --> 01:08:08,570 in the underlying data. 1213 01:08:08,570 --> 01:08:15,190 And we can use these to organize data by projecting observations 1214 01:08:15,190 --> 01:08:19,630 onto these eigenvectors-- or they're sometimes 1215 01:08:19,630 --> 01:08:24,250 called principal components-- defined dimensions 1216 01:08:24,250 --> 01:08:29,319 of variability that help us organize our underlying data. 1217 01:08:29,319 --> 01:08:31,089 We'll come back to that in a moment. 1218 01:08:36,000 --> 01:08:37,399 OK. 1219 01:08:37,399 --> 01:08:41,664 Any other questions about principal component analysis? 1220 01:08:41,664 --> 01:08:42,164 Yes. 1221 01:08:42,164 --> 01:08:44,370 AUDIENCE: [INAUDIBLE] e was expectation 1222 01:08:44,370 --> 01:08:46,729 when you were calculating the second? 1223 01:08:46,729 --> 01:08:50,191 PROFESSOR: Yes, e is the expectation is correct. 1224 01:08:50,191 --> 01:08:51,649 AUDIENCE: And also what that means. 1225 01:08:51,649 --> 01:08:57,689 PROFESSOR: It's the average expected value. 1226 01:08:57,689 --> 01:09:02,359 So in the case of computing sigma, 1227 01:09:02,359 --> 01:09:08,760 you would compute the expected value of that inner equation 1228 01:09:08,760 --> 01:09:10,792 across all the data points that you see. 1229 01:09:14,912 --> 01:09:16,620 So you'd sum up all the values and divide 1230 01:09:16,620 --> 01:09:18,161 by the number of things that you had. 1231 01:09:20,387 --> 01:09:21,220 Any other questions? 1232 01:09:24,030 --> 01:09:24,530 OK. 1233 01:09:27,220 --> 01:09:32,630 So just for calibration for next year, how many people 1234 01:09:32,630 --> 01:09:36,350 think they've got a general idea of what principal component 1235 01:09:36,350 --> 01:09:39,131 analysis is-- a general idea? 1236 01:09:39,131 --> 01:09:39,630 Uh-oh. 1237 01:09:42,439 --> 01:09:44,740 How many people who thought it was really interesting 1238 01:09:44,740 --> 01:09:48,930 were sort of completely baffled about halfway through? 1239 01:09:48,930 --> 01:09:51,000 OK. 1240 01:09:51,000 --> 01:09:51,779 All right. 1241 01:09:51,779 --> 01:09:53,760 Well, I think that recitation can 1242 01:09:53,760 --> 01:09:55,550 help with some of those questions. 1243 01:09:55,550 --> 01:09:58,840 But if anybody has a question they'd like to ask now-- No? 1244 01:09:58,840 --> 01:10:00,316 It's that far gone? 1245 01:10:05,570 --> 01:10:07,780 I mean, the thing with this sort of analysis 1246 01:10:07,780 --> 01:10:11,384 is that if your matrix algebra is a little rusty, then, 1247 01:10:11,384 --> 01:10:13,300 when you start looking at equations like that, 1248 01:10:13,300 --> 01:10:17,000 you can get a little lost sometimes. 1249 01:10:19,940 --> 01:10:20,440 All right. 1250 01:10:20,440 --> 01:10:23,650 Well, let's turn, then-- if there aren't any brave souls 1251 01:10:23,650 --> 01:10:27,220 who wish to ask a question, we'll 1252 01:10:27,220 --> 01:10:30,935 turn to single cell RNA-seq analysis. 1253 01:10:33,700 --> 01:10:37,500 So I'm a firm believer that single cell analysis 1254 01:10:37,500 --> 01:10:40,670 of biological samples is the next big frontier. 1255 01:10:40,670 --> 01:10:46,915 And it's being made possible through devices like this. 1256 01:10:46,915 --> 01:10:52,830 This is a Fluidigm C1 chip, which has 96 different reaction 1257 01:10:52,830 --> 01:10:55,070 wells, which allows you, in each well, 1258 01:10:55,070 --> 01:10:57,830 to process a single cell independently. 1259 01:10:57,830 --> 01:11:02,060 And the little winds are ways to get reagents into those cells 1260 01:11:02,060 --> 01:11:12,400 to do things like produce RNA-seq ready materials. 1261 01:11:12,400 --> 01:11:16,340 And when you do single cell analysis, 1262 01:11:16,340 --> 01:11:21,770 you can take apart what's happening in a population. 1263 01:11:21,770 --> 01:11:29,660 So an early paper asked some fairly fundamental but simple 1264 01:11:29,660 --> 01:11:31,200 questions. 1265 01:11:31,200 --> 01:11:36,820 For example, if you take two 10,000 cell aliquots 1266 01:11:36,820 --> 01:11:41,880 of the same culture, and you profile them independently, 1267 01:11:41,880 --> 01:11:45,720 and you ask how well to the expression values for each gene 1268 01:11:45,720 --> 01:11:49,220 agree between sample A and sample B, 1269 01:11:49,220 --> 01:11:54,270 you expect there to be a very good agreement between sample A 1270 01:11:54,270 --> 01:11:59,070 and sample B in these 10,000 cell cultures. 1271 01:11:59,070 --> 01:12:05,170 A second question is, now, if you take, say, 14 cells 1272 01:12:05,170 --> 01:12:09,060 from those cultures and you profile them independently, 1273 01:12:09,060 --> 01:12:12,550 and you ask, how well do they correlate 1274 01:12:12,550 --> 01:12:16,080 with what you saw in the 10,000 cell experiment, that 1275 01:12:16,080 --> 01:12:18,765 will tell you something about the population heterogeneity 1276 01:12:18,765 --> 01:12:20,560 that you're observing. 1277 01:12:20,560 --> 01:12:23,790 Because if they correlate perfectly with the 10,000 cell 1278 01:12:23,790 --> 01:12:26,570 experiment, then you really know that there's 1279 01:12:26,570 --> 01:12:28,990 no point in looking at individual cells in some sense, 1280 01:12:28,990 --> 01:12:30,000 because they're all the same. 1281 01:12:30,000 --> 01:12:31,000 Seen one, seem them all. 1282 01:12:31,000 --> 01:12:32,620 Right? 1283 01:12:32,620 --> 01:12:35,660 But if you find that each cell has 1284 01:12:35,660 --> 01:12:39,240 its own particular expression fingerprint 1285 01:12:39,240 --> 01:12:44,890 and what you're seeing in the 10,000 cell average experiment 1286 01:12:44,890 --> 01:12:47,825 wipes out those fingerprints, then you 1287 01:12:47,825 --> 01:12:50,350 know it's very important to analyze each cell individually. 1288 01:12:54,050 --> 01:13:01,350 So the analysis that was done asked exactly that question. 1289 01:13:01,350 --> 01:13:04,020 So here's what I'll show you in these plots. 1290 01:13:04,020 --> 01:13:11,700 So here is, on the upper left, the 10,000 cell experiment 1291 01:13:11,700 --> 01:13:13,350 versus the 10,000 cell experiment. 1292 01:13:13,350 --> 01:13:16,320 And as you can see, the correlation coefficient 1293 01:13:16,320 --> 01:13:24,570 is quite high-- 0.98-- and looks very, very good 1294 01:13:24,570 --> 01:13:27,680 of experiment one versus experiment two, 1295 01:13:27,680 --> 01:13:30,090 or rep one, rep two. 1296 01:13:30,090 --> 01:13:34,450 Here is a separate experiment which 1297 01:13:34,450 --> 01:13:37,540 is looking at two individual cells 1298 01:13:37,540 --> 01:13:40,320 and asking-- and plotting, for each gene, 1299 01:13:40,320 --> 01:13:42,750 the expression in one cell versus the gene 1300 01:13:42,750 --> 01:13:44,620 in the other cell. 1301 01:13:44,620 --> 01:13:47,460 And you can see that the correlation coefficient's 0.54. 1302 01:13:47,460 --> 01:13:49,280 And there's actually a fairly wide spread. 1303 01:13:49,280 --> 01:13:54,040 In fact, there are genes that are expressed in one cell that 1304 01:13:54,040 --> 01:13:57,069 are not expressed in the other cell, and vice versa. 1305 01:13:57,069 --> 01:13:58,860 So the expression of these individual cells 1306 01:13:58,860 --> 01:14:02,050 it's quite divergent. 1307 01:14:02,050 --> 01:14:09,530 And the final panel shows how-- down here-- 1308 01:14:09,530 --> 01:14:15,760 how a single cell average on the y-axis 1309 01:14:15,760 --> 01:14:17,930 relates to the 10,000 cell experiment. 1310 01:14:20,550 --> 01:14:23,840 But given the middle panel-- the panel B there-- 1311 01:14:23,840 --> 01:14:27,720 that is showing the fact that two single cells don't really 1312 01:14:27,720 --> 01:14:30,060 relate that well to another, it bags other questions. 1313 01:14:30,060 --> 01:14:33,530 For example, are the isoforms of the genes that 1314 01:14:33,530 --> 01:14:38,980 are being expressed the same in those distinct cells? 1315 01:14:38,980 --> 01:14:42,570 And so panel D shows isoforms that 1316 01:14:42,570 --> 01:14:47,240 are the same across each one of the single cells being 1317 01:14:47,240 --> 01:14:51,770 profiled, which is the solid bar at the top. 1318 01:14:51,770 --> 01:14:57,410 And the bottom couple of rows in figure D 1319 01:14:57,410 --> 01:14:59,630 are the 10,000 cell experiment average. 1320 01:15:02,140 --> 01:15:03,730 But panel E is the most interesting, 1321 01:15:03,730 --> 01:15:09,370 perhaps, which is that the isoforms for those four genes 1322 01:15:09,370 --> 01:15:11,920 are being differentially expressed 1323 01:15:11,920 --> 01:15:13,305 in different individual cells. 1324 01:15:16,680 --> 01:15:19,690 And that's further supported by taking two of those genes 1325 01:15:19,690 --> 01:15:27,570 and doing fluorescent in situ histochemistry and microscopy, 1326 01:15:27,570 --> 01:15:29,810 and looking at the number of RNA molecules 1327 01:15:29,810 --> 01:15:34,000 for each one of those, and noting that it corresponds 1328 01:15:34,000 --> 01:15:37,860 to what's seen in the upper right hand panel. 1329 01:15:37,860 --> 01:15:42,220 So we see that in individual cells, 1330 01:15:42,220 --> 01:15:48,120 different isoforms are being expressed. 1331 01:15:48,120 --> 01:15:51,030 Now these cells were derived from bone marrow. 1332 01:15:51,030 --> 01:15:55,860 And they're exposed to lipopolysaccharide 1333 01:15:55,860 --> 01:15:59,420 to activate an immune response. 1334 01:15:59,420 --> 01:16:02,150 So they are clearly not all behaving exactly the same. 1335 01:16:04,730 --> 01:16:11,370 And to further elucidate this, the authors of this paper 1336 01:16:11,370 --> 01:16:18,020 took the gene expressions that they 1337 01:16:18,020 --> 01:16:21,650 saw for a given cell as a large vector 1338 01:16:21,650 --> 01:16:23,900 and computed the principal components, 1339 01:16:23,900 --> 01:16:29,290 and then projected the cells into the first and second 1340 01:16:29,290 --> 01:16:32,810 principal component or eigenvector space. 1341 01:16:32,810 --> 01:16:37,000 And as you can see, there is a distinct separation of three 1342 01:16:37,000 --> 01:16:40,880 of the cells from the rest of the cells, where 1343 01:16:40,880 --> 01:16:44,240 three of the cells, which correlate well 1344 01:16:44,240 --> 01:16:46,670 with principal component one, are thought 1345 01:16:46,670 --> 01:16:53,160 to be mature cells that express certain cell surface 1346 01:16:53,160 --> 01:16:56,390 proteins whereas the ones on the left-- the maturing 1347 01:16:56,390 --> 01:17:00,810 cells-- the triangle depicted cells-- 1348 01:17:00,810 --> 01:17:04,680 express certain cytokines under the maturing legend 1349 01:17:04,680 --> 01:17:08,420 there, on the clustergram on the right-hand side. 1350 01:17:08,420 --> 01:17:12,420 And thus, the first principal component 1351 01:17:12,420 --> 01:17:16,600 was able to separate those two different broad classes 1352 01:17:16,600 --> 01:17:18,269 of cells. 1353 01:17:18,269 --> 01:17:20,560 So it looks like there are at least two different kinds 1354 01:17:20,560 --> 01:17:23,800 of cells in this population. 1355 01:17:23,800 --> 01:17:25,680 And then, the authors asked another question, 1356 01:17:25,680 --> 01:17:32,350 which is, can they take individual cell data 1357 01:17:32,350 --> 01:17:35,510 and look at the relationship between pairs of genes 1358 01:17:35,510 --> 01:17:38,860 to see which genes are co-expressed. 1359 01:17:38,860 --> 01:17:40,660 And the hypothesis is that genes that 1360 01:17:40,660 --> 01:17:44,030 are co-expressed in individual cells 1361 01:17:44,030 --> 01:17:47,560 make up individual regulatory circuits. 1362 01:17:47,560 --> 01:17:52,280 And so they hypothesize that the genes LRF7 and IFIT1 1363 01:17:52,280 --> 01:17:58,470 and STAT2 and LRF7 are all in an anti-viral regulatory circuit. 1364 01:17:58,470 --> 01:18:00,610 They then ask the question, if they knocked out 1365 01:18:00,610 --> 01:18:03,990 LRF7, which is the second panel on the right-hand side, 1366 01:18:03,990 --> 01:18:06,310 would they oblate downstream gene expression. 1367 01:18:06,310 --> 01:18:08,440 And they partially did. 1368 01:18:08,440 --> 01:18:12,330 And they thought that since STAT2 and LRF7 are both thought 1369 01:18:12,330 --> 01:18:14,340 to be regulators of the circuit and they're 1370 01:18:14,340 --> 01:18:16,234 both downstream of the interfering receptor, 1371 01:18:16,234 --> 01:18:18,650 they thought if they knocked out the interfering receptor, 1372 01:18:18,650 --> 01:18:20,691 they would oblate most of the anti-viral cluster, 1373 01:18:20,691 --> 01:18:23,250 which, in fact, they did. 1374 01:18:23,250 --> 01:18:28,420 So what this is suggesting is that, first, single cell 1375 01:18:28,420 --> 01:18:30,140 analysis is extraordinarily important 1376 01:18:30,140 --> 01:18:32,410 to understand what's going on in individual cells. 1377 01:18:32,410 --> 01:18:37,140 Because in a cell culture, the cells can be quite different. 1378 01:18:37,140 --> 01:18:41,910 And secondarily, it's possible, within the context 1379 01:18:41,910 --> 01:18:43,900 of individual single cell analysis, 1380 01:18:43,900 --> 01:18:47,170 to be able to pick out regulatory circuits that 1381 01:18:47,170 --> 01:18:49,100 wouldn't be as evident when you're 1382 01:18:49,100 --> 01:18:52,850 looking at cells en masse. 1383 01:18:52,850 --> 01:18:59,300 And finally-- I'll thank Mike for the next two slides-- 1384 01:18:59,300 --> 01:19:00,980 I wanted to point out that quality 1385 01:19:00,980 --> 01:19:04,640 metrics for RNA-seq data for single cells is very important. 1386 01:19:04,640 --> 01:19:08,820 And we talked about library complexity earlier in the term. 1387 01:19:08,820 --> 01:19:13,786 And here, you can see that as library complexity increases, 1388 01:19:13,786 --> 01:19:15,660 expression of coefficient of variation, which 1389 01:19:15,660 --> 01:19:19,570 is the standard deviation over the mean, 1390 01:19:19,570 --> 01:19:22,870 comes down as you get sufficient library complexity. 1391 01:19:22,870 --> 01:19:28,240 And furthermore, as library complexity increases, 1392 01:19:28,240 --> 01:19:30,610 mean expression increases. 1393 01:19:30,610 --> 01:19:34,580 And the cells that are in red were classified as bad 1394 01:19:34,580 --> 01:19:43,910 by microscopy, from the Fluidigm instrument processing step. 1395 01:19:43,910 --> 01:19:47,150 So I think you can see that single cell analysis is going 1396 01:19:47,150 --> 01:19:50,220 to be extraordinarily important and can reveal 1397 01:19:50,220 --> 01:19:55,250 a lot of information that is not present in these large batch 1398 01:19:55,250 --> 01:19:56,900 experiments. 1399 01:19:56,900 --> 01:19:59,770 And it's coming to a lab near you. 1400 01:19:59,770 --> 01:20:04,140 So on that note, I'll thank you very much for today. 1401 01:20:04,140 --> 01:20:06,280 And we'll see you later in the term. 1402 01:20:06,280 --> 01:20:09,010 And Professor Burge will return at the next lecture. 1403 01:20:09,010 --> 01:20:10,860 Thanks very much.