1 00:00:00,060 --> 00:00:01,780 The following content is provided 2 00:00:01,780 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,870 Your support will help MIT OpenCourseWare continue 4 00:00:06,870 --> 00:00:10,730 to offer high quality educational resources for free. 5 00:00:10,730 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,236 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,236 --> 00:00:17,861 at ocw.mit.edu. 8 00:00:26,450 --> 00:00:28,080 PROFESSOR: So as you recall last time 9 00:00:28,080 --> 00:00:30,780 we talked about chromatin structure and chromatin 10 00:00:30,780 --> 00:00:32,000 regulation. 11 00:00:32,000 --> 00:00:35,390 And now we're going to move on to genetic analysis. 12 00:00:35,390 --> 00:00:38,160 But before we did that, I want us to touch on two points 13 00:00:38,160 --> 00:00:41,440 that we talked about briefly last time. 14 00:00:41,440 --> 00:00:44,880 One was 5C analysis. 15 00:00:44,880 --> 00:00:48,420 Who was it that brought up-- who was the 5C expert here? 16 00:00:48,420 --> 00:00:50,010 Anybody? 17 00:00:50,010 --> 00:00:51,790 No? 18 00:00:51,790 --> 00:00:52,920 Nobody wants to own 5C. 19 00:00:52,920 --> 00:00:53,500 OK. 20 00:00:53,500 --> 00:00:56,100 But as you recall, we talked about ChIA-PET 21 00:00:56,100 --> 00:01:00,090 as one way of analyzing any to any interactions in the way 22 00:01:00,090 --> 00:01:04,370 that the genome folds up and enhancers talk to promoters. 23 00:01:04,370 --> 00:01:06,114 And 5C is a very similar technique. 24 00:01:06,114 --> 00:01:07,530 I just wanted to show you the flow 25 00:01:07,530 --> 00:01:10,260 chart for how the protocol goes. 26 00:01:10,260 --> 00:01:12,950 There is a cross linking. 27 00:01:12,950 --> 00:01:14,720 A digestion with a restriction enzyme 28 00:01:14,720 --> 00:01:18,040 step, followed by a proximity ligation step, 29 00:01:18,040 --> 00:01:21,900 which gives you molecules that had been brought together 30 00:01:21,900 --> 00:01:25,320 by an enhancer, promoter complex, or any other kind 31 00:01:25,320 --> 00:01:29,030 of distal protein-protein interaction. 32 00:01:29,030 --> 00:01:35,660 And then, what happens is that you design specific timers 33 00:01:35,660 --> 00:01:38,600 to detect those ligation events. 34 00:01:38,600 --> 00:01:41,150 And you sequence the result of what 35 00:01:41,150 --> 00:01:44,390 is known as ligation mediated amplification. 36 00:01:44,390 --> 00:01:46,730 So those primers are only going to ligate 37 00:01:46,730 --> 00:01:49,910 if they're brought together at a particular junction, which 38 00:01:49,910 --> 00:01:53,230 is defined by the restriction sites lining up. 39 00:01:53,230 --> 00:01:58,500 So, 5C is a method of looking at which regions of the genome 40 00:01:58,500 --> 00:02:04,170 interact and can produce these sorts of results, 41 00:02:04,170 --> 00:02:05,680 showing which parts of the genome 42 00:02:05,680 --> 00:02:07,190 interact with one another. 43 00:02:07,190 --> 00:02:12,630 The key difference, I think, between chIA-PET and 5C 44 00:02:12,630 --> 00:02:15,660 is that you actually have to have these primers designed 45 00:02:15,660 --> 00:02:19,050 and pick the particular locations you want to query. 46 00:02:19,050 --> 00:02:23,110 So the primers that you design represent query locations 47 00:02:23,110 --> 00:02:26,840 and you can then either apply the results to a microarray, 48 00:02:26,840 --> 00:02:31,710 or to high throughput sequencing to detect these interactions. 49 00:02:31,710 --> 00:02:33,900 But the essential idea is the same. 50 00:02:33,900 --> 00:02:36,540 Where you do proximity based ligation 51 00:02:36,540 --> 00:02:40,440 to form molecules that contain components 52 00:02:40,440 --> 00:02:42,230 of two different pieces of the genome 53 00:02:42,230 --> 00:02:47,270 that have been brought together for some functional reason. 54 00:02:47,270 --> 00:02:50,580 The next thing I want to touch upon 55 00:02:50,580 --> 00:02:55,500 was this idea of the CpG dinucleotides 56 00:02:55,500 --> 00:02:59,930 that are connected by a phosphate bond. 57 00:02:59,930 --> 00:03:01,880 And you recall that I talked about the idea 58 00:03:01,880 --> 00:03:03,310 that they were symmetric. 59 00:03:03,310 --> 00:03:07,430 So you could have methyl groups on the cytosines in such a way 60 00:03:07,430 --> 00:03:10,810 that, because they could mirror one another, 61 00:03:10,810 --> 00:03:14,770 they could be transferred from one strand of DNA 62 00:03:14,770 --> 00:03:18,890 to the other strand of DNA, during cell replication 63 00:03:18,890 --> 00:03:21,530 by DNA methyltransferase. 64 00:03:21,530 --> 00:03:25,750 So it forms a more stable kind of mark and as you recall, 65 00:03:25,750 --> 00:03:27,750 DNA methylation where something occurred 66 00:03:27,750 --> 00:03:32,210 in lowly expressed genes and typically in regions 67 00:03:32,210 --> 00:03:34,850 of the genome that are methylated. 68 00:03:34,850 --> 00:03:36,430 Other histone marks are not present 69 00:03:36,430 --> 00:03:39,050 and the genes are turned off. 70 00:03:39,050 --> 00:03:39,680 OK. 71 00:03:39,680 --> 00:03:41,055 So those were the points I wanted 72 00:03:41,055 --> 00:03:44,380 to touch upon from last lecture. 73 00:03:44,380 --> 00:03:48,090 Now we're going to embark upon an adventure, 74 00:03:48,090 --> 00:03:54,670 looking for the answer to, wear is missing heritability found? 75 00:03:54,670 --> 00:03:57,560 So it's a big open question now in genetics. 76 00:03:57,560 --> 00:04:00,240 In human genetics, which is that we really 77 00:04:00,240 --> 00:04:03,830 can't find all the heritability. 78 00:04:03,830 --> 00:04:06,990 And as a point of introduction, the narrative 79 00:04:06,990 --> 00:04:10,760 arc for today's lecture is that, generally speaking, 80 00:04:10,760 --> 00:04:13,160 you're more like your relatives than random people 81 00:04:13,160 --> 00:04:14,480 on the planet. 82 00:04:14,480 --> 00:04:15,770 And why is this? 83 00:04:15,770 --> 00:04:20,529 Well obviously you contain components of your mom 84 00:04:20,529 --> 00:04:22,089 and dad's genomes. 85 00:04:22,089 --> 00:04:27,800 And they are providing you with components of your traits. 86 00:04:27,800 --> 00:04:30,580 And the heritability of a trait is 87 00:04:30,580 --> 00:04:34,310 defined by the fraction of phenotypic variance 88 00:04:34,310 --> 00:04:37,830 that can be explained by genetics. 89 00:04:37,830 --> 00:04:41,600 And we're going to talk today about computational models that 90 00:04:41,600 --> 00:04:44,250 can predict phenotype from genotype. 91 00:04:44,250 --> 00:04:46,480 And this is very important, obviously, 92 00:04:46,480 --> 00:04:51,380 for understanding the sources of various traits and phenotypes. 93 00:04:51,380 --> 00:04:54,760 As well as fields such as pharmacogenomics 94 00:04:54,760 --> 00:04:59,640 that try and predict the best therapy for a disease 95 00:04:59,640 --> 00:05:03,440 based upon your genetic makeup. 96 00:05:03,440 --> 00:05:09,110 So, individual loci in the genome 97 00:05:09,110 --> 00:05:12,140 that contribute to quantitative traits 98 00:05:12,140 --> 00:05:17,395 are called quantitative trait locis, or QTLs. 99 00:05:17,395 --> 00:05:19,520 So we're going to talked about how to discover them 100 00:05:19,520 --> 00:05:24,150 and how to build models of quantitative traits using QTLs. 101 00:05:24,150 --> 00:05:27,490 And finally, as I said at the outset, 102 00:05:27,490 --> 00:05:29,980 our models are insufficient today. 103 00:05:29,980 --> 00:05:33,150 They really can't find all of the heritability. 104 00:05:33,150 --> 00:05:35,730 So we're going to go searching for this missing heritability 105 00:05:35,730 --> 00:05:39,460 and see where it might be found. 106 00:05:39,460 --> 00:05:44,320 Computationally, we're going to apply a variety of techniques 107 00:05:44,320 --> 00:05:45,830 to these problems. 108 00:05:45,830 --> 00:05:48,820 A preview is, we're going to build 109 00:05:48,820 --> 00:05:52,040 linear models of phenotype and we're 110 00:05:52,040 --> 00:05:56,124 going to use stepwise regression to learn these models using 111 00:05:56,124 --> 00:05:57,290 a forward feature selection. 112 00:05:57,290 --> 00:05:59,210 And I'll talk about what that is when 113 00:05:59,210 --> 00:06:01,320 we get to that point of the lecture. 114 00:06:01,320 --> 00:06:04,310 We're going to derive test statistics for discovering 115 00:06:04,310 --> 00:06:08,530 which QTLs are significant and which QTLs are not, 116 00:06:08,530 --> 00:06:10,211 to include in our model. 117 00:06:10,211 --> 00:06:11,960 And finally, we're going to talk about how 118 00:06:11,960 --> 00:06:15,200 to measure narrow sense heritability and broad sense 119 00:06:15,200 --> 00:06:17,435 heritability in environmental variance. 120 00:06:20,010 --> 00:06:21,990 OK. 121 00:06:21,990 --> 00:06:30,400 So, one great resource for traits that are fairly simple. 122 00:06:30,400 --> 00:06:35,600 That primarily are the result of a single gene mutation, 123 00:06:35,600 --> 00:06:40,040 or where a single gene mutation plays a dominant role, 124 00:06:40,040 --> 00:06:45,130 is something called Online Mendelian Inheritance in Man. 125 00:06:45,130 --> 00:06:46,770 And it's a resource. 126 00:06:46,770 --> 00:06:49,690 It has about 21,000 genes in it right now. 127 00:06:49,690 --> 00:06:55,100 And it's a great way to explore what human genes function 128 00:06:55,100 --> 00:06:57,170 is in various diseases. 129 00:06:57,170 --> 00:06:58,500 And you could query by disease. 130 00:06:58,500 --> 00:07:00,060 You can query by gene. 131 00:07:00,060 --> 00:07:07,130 And it is a very carefully annotated and maintained 132 00:07:07,130 --> 00:07:10,190 collection that is worthy of study, 133 00:07:10,190 --> 00:07:13,460 if you're interested in particular disease genes. 134 00:07:13,460 --> 00:07:19,132 We're going to be looking at more complex analyses today. 135 00:07:19,132 --> 00:07:20,590 The analyses we're going to look at 136 00:07:20,590 --> 00:07:22,275 are where there are many genes that 137 00:07:22,275 --> 00:07:23,540 influence a particular trait. 138 00:07:23,540 --> 00:07:25,630 And we would like to come up with general methods 139 00:07:25,630 --> 00:07:30,902 for discovering how we can de novo from experimental data-- 140 00:07:30,902 --> 00:07:32,985 discover all the different genes that participate. 141 00:07:36,680 --> 00:07:39,910 Now just as a quick review of statistics, 142 00:07:39,910 --> 00:07:43,660 I think that we've talked before about means in class 143 00:07:43,660 --> 00:07:44,989 and variances. 144 00:07:44,989 --> 00:07:46,530 We're also going to talk a little bit 145 00:07:46,530 --> 00:07:48,732 about covariances today. 146 00:07:48,732 --> 00:07:50,190 But these are terms that you should 147 00:07:50,190 --> 00:07:54,210 be familiar with as we're looking today 148 00:07:54,210 --> 00:08:01,930 at some of our metrics for understanding heritability. 149 00:08:01,930 --> 00:08:05,960 Are there any question about any of the statistical metrics that 150 00:08:05,960 --> 00:08:06,460 are up here? 151 00:08:09,348 --> 00:08:09,848 OK. 152 00:08:12,760 --> 00:08:16,125 So, a broad overview of genotype to phenotype. 153 00:08:18,996 --> 00:08:20,620 So, we're primarily going to be working 154 00:08:20,620 --> 00:08:24,170 with complete genome sequences today, 155 00:08:24,170 --> 00:08:26,490 which will reveal all of the variance that 156 00:08:26,490 --> 00:08:28,930 are present in the genome. 157 00:08:28,930 --> 00:08:32,264 And it's also the case that you can subsample a genome 158 00:08:32,264 --> 00:08:35,070 and only observe certain variance. 159 00:08:35,070 --> 00:08:37,710 Typically that's done with microarrays 160 00:08:37,710 --> 00:08:41,010 that have probes that are specific to particular markers. 161 00:08:41,010 --> 00:08:42,929 The way those arrays are manufactured 162 00:08:42,929 --> 00:08:47,110 is that whole genome sequencing is done at the outset, and then 163 00:08:47,110 --> 00:08:50,570 high prevalence variance, at least 164 00:08:50,570 --> 00:08:52,190 common variance, which typically are 165 00:08:52,190 --> 00:08:55,410 at a frequency of at least 5% in the population 166 00:08:55,410 --> 00:08:58,390 are queried by using a microarray. 167 00:08:58,390 --> 00:09:01,790 But today we'll talk about complete genome sequence. 168 00:09:01,790 --> 00:09:03,290 An individual's phenotype, we'll say 169 00:09:03,290 --> 00:09:05,800 is defined by one or more traits. 170 00:09:05,800 --> 00:09:09,720 And a non-quantitative trait is something perhaps as simple as 171 00:09:09,720 --> 00:09:12,590 whether or not something is dead or alive. 172 00:09:12,590 --> 00:09:15,820 Or whether or not it can survive in a particular condition. 173 00:09:15,820 --> 00:09:19,920 Or its ability to produce a particular substance. 174 00:09:19,920 --> 00:09:22,370 A quantitative trait, on the other hand, 175 00:09:22,370 --> 00:09:25,274 is a continuous variable. 176 00:09:25,274 --> 00:09:26,815 Height, for example, of an individual 177 00:09:26,815 --> 00:09:28,840 is a quantitative trait. 178 00:09:28,840 --> 00:09:32,610 As is growth rate, expression of a particular gene, 179 00:09:32,610 --> 00:09:34,490 and so forth. 180 00:09:34,490 --> 00:09:39,310 So we'll be focusing today on estimating quantitative traits. 181 00:09:39,310 --> 00:09:41,970 And as I said, a quantitative trait or loci, 182 00:09:41,970 --> 00:09:45,240 is a marker that's associated with a quantitative trait 183 00:09:45,240 --> 00:09:47,160 and could be used to predict it. 184 00:09:47,160 --> 00:09:49,520 And you can sometimes hear about eQTLs, 185 00:09:49,520 --> 00:09:52,900 which are expression quantitative trait loci. 186 00:09:52,900 --> 00:09:55,540 And they're loci that are related to gene expression. 187 00:09:59,770 --> 00:10:07,469 So, let's begin then, with a very simple genetic model. 188 00:10:07,469 --> 00:10:09,510 It's going to be haploid, which means, of course, 189 00:10:09,510 --> 00:10:11,218 there's only one copy of each chromosome. 190 00:10:11,218 --> 00:10:12,607 Yeast is the model organism we're 191 00:10:12,607 --> 00:10:13,940 going to be talking about today. 192 00:10:13,940 --> 00:10:16,190 It's a haploid organism. 193 00:10:16,190 --> 00:10:18,020 And we have mom and dad up there. 194 00:10:18,020 --> 00:10:22,310 Mom on the left, dad on the right in two different colors. 195 00:10:22,310 --> 00:10:24,900 And you can see that mom and dad in this particular example, 196 00:10:24,900 --> 00:10:26,200 have n different genes. 197 00:10:26,200 --> 00:10:29,580 They're going to contribute to the F1 generation, to junior. 198 00:10:32,110 --> 00:10:37,700 And the relative color is white for mom, black for dad, 199 00:10:37,700 --> 00:10:41,980 are going to be used to describe the alleles, 200 00:10:41,980 --> 00:10:44,860 or the allelic variance that are inherited 201 00:10:44,860 --> 00:10:48,436 by the child, the F1 generation. 202 00:10:48,436 --> 00:10:50,309 And as I said, a specific phenotype 203 00:10:50,309 --> 00:10:52,350 might be alive or dead in a specific environment. 204 00:10:55,290 --> 00:11:03,110 And note that I have drawn the chromosomes to be disconnected. 205 00:11:03,110 --> 00:11:06,230 Which means that each one of those genes 206 00:11:06,230 --> 00:11:09,940 is going to be independently inherited. 207 00:11:09,940 --> 00:11:13,160 So the probability in the F1 generation 208 00:11:13,160 --> 00:11:16,020 that you're going to get one of those from mom or dad 209 00:11:16,020 --> 00:11:18,420 is going to be a coin flip. 210 00:11:18,420 --> 00:11:19,970 We're going to assume that they're 211 00:11:19,970 --> 00:11:23,400 far enough away that the probability of crossing over 212 00:11:23,400 --> 00:11:26,410 during meiosis is 0.5. 213 00:11:26,410 --> 00:11:28,530 And so we get a random assortment 214 00:11:28,530 --> 00:11:32,050 of alleles from mom and dad. 215 00:11:32,050 --> 00:11:33,130 OK? 216 00:11:33,130 --> 00:11:37,520 So let us say that you go off and do an experiment. 217 00:11:37,520 --> 00:11:44,610 And you have 32 individuals that you produce out of a cross. 218 00:11:44,610 --> 00:11:47,490 And you test them, OK. 219 00:11:47,490 --> 00:11:57,130 And two of them are resistant to a particular substance. 220 00:11:57,130 --> 00:12:00,155 How many genes do you think are involved in that resistance? 221 00:12:03,320 --> 00:12:07,885 Let's assume that mom is resistant and dad is not. 222 00:12:07,885 --> 00:12:08,385 OK. 223 00:12:11,560 --> 00:12:14,780 If you had two that were resistant out of 32, 224 00:12:14,780 --> 00:12:18,350 how many different genes do you think were involved? 225 00:12:18,350 --> 00:12:19,635 How do you estimate that? 226 00:12:26,420 --> 00:12:26,970 Any ideas? 227 00:12:33,667 --> 00:12:34,645 Yes? 228 00:12:34,645 --> 00:12:39,046 AUDIENCE: If you had 32 individuals 229 00:12:39,046 --> 00:12:44,930 and say half of them got it? 230 00:12:44,930 --> 00:12:46,160 PROFESSOR: Two, let's say. 231 00:12:46,160 --> 00:12:51,760 One out of 16 is resistant. 232 00:12:51,760 --> 00:12:53,970 And mom is resistant. 233 00:12:53,970 --> 00:12:56,940 AUDIENCE: Because I was thinking that if it was half of them 234 00:12:56,940 --> 00:13:00,212 were resistant, then you would maybe guess one gene, 235 00:13:00,212 --> 00:13:01,170 or something like that. 236 00:13:01,170 --> 00:13:02,220 PROFESSOR: Very good. 237 00:13:02,220 --> 00:13:04,970 AUDIENCE: So then if only eight were 238 00:13:04,970 --> 00:13:09,720 resistant you might guess two genes, or something like that? 239 00:13:09,720 --> 00:13:11,960 PROFESSOR: Yeah. 240 00:13:11,960 --> 00:13:15,320 What you say is, that if mom's resistant, then 241 00:13:15,320 --> 00:13:16,850 we're going to assume that you need 242 00:13:16,850 --> 00:13:20,030 to get the right number of genes from mom to be resistant. 243 00:13:20,030 --> 00:13:21,390 Right? 244 00:13:21,390 --> 00:13:25,369 And so, let's say that you had to get four genes from mom. 245 00:13:25,369 --> 00:13:27,410 What's the chance of getting four genes from mom? 246 00:13:30,236 --> 00:13:33,070 AUDIENCE: Half to the power of four. 247 00:13:33,070 --> 00:13:35,290 PROFESSOR: Yeah, which is one out of 16, right? 248 00:13:35,290 --> 00:13:39,200 So, if you, for example had two that were resistant out of 32, 249 00:13:39,200 --> 00:13:41,240 the chances are one in 16. 250 00:13:41,240 --> 00:13:41,960 Right? 251 00:13:41,960 --> 00:13:46,230 So you would naively think, and properly so, 252 00:13:46,230 --> 00:13:51,620 that you had to give four genes from mom to be resistant. 253 00:13:51,620 --> 00:13:54,440 So the way to think about these sorts 254 00:13:54,440 --> 00:13:57,450 of non-quantitative traits is that you 255 00:13:57,450 --> 00:14:00,630 can estimate the number of genes involved. 256 00:14:00,630 --> 00:14:02,930 The simply is log base 2 over the number 257 00:14:02,930 --> 00:14:07,570 of F1s tested over the number of the F1s with the phenotype. 258 00:14:07,570 --> 00:14:09,790 It tells you roughly how many genes 259 00:14:09,790 --> 00:14:16,160 are involved in providing a particular trait, 260 00:14:16,160 --> 00:14:18,530 assuming that the genes are unlinked. 261 00:14:18,530 --> 00:14:21,415 It's a coin flip, whether you get them or not. 262 00:14:21,415 --> 00:14:22,415 Does everybody see that? 263 00:14:25,090 --> 00:14:26,300 Yes? 264 00:14:26,300 --> 00:14:27,760 Any questions at all about that? 265 00:14:33,025 --> 00:14:33,775 About the details? 266 00:14:37,390 --> 00:14:38,940 OK. 267 00:14:38,940 --> 00:14:44,400 Let's talk now about quantitative traits then. 268 00:14:44,400 --> 00:14:47,905 We'll go back to our model and imagine 269 00:14:47,905 --> 00:14:50,850 that we have the same set-- actually 270 00:14:50,850 --> 00:14:53,110 it's going to a different set of n genes. 271 00:14:53,110 --> 00:14:56,070 We're going to have a coin flip as to 272 00:14:56,070 --> 00:14:58,890 whether or not you're getting a mom gene or a dad gene. 273 00:14:58,890 --> 00:15:00,200 OK. 274 00:15:00,200 --> 00:15:05,826 And each gene in dad has an effect size of 1 over n. 275 00:15:05,826 --> 00:15:06,326 Yes? 276 00:15:06,326 --> 00:15:08,618 AUDIENCE: I just wanted to check. 277 00:15:08,618 --> 00:15:13,438 We're assuming that the parents are homozygous for the trait? 278 00:15:13,438 --> 00:15:14,402 Is that correct? 279 00:15:14,402 --> 00:15:16,040 PROFESSOR: Remember these are haploid. 280 00:15:16,040 --> 00:15:17,540 AUDIENCE: Oh, these are haploid. 281 00:15:17,540 --> 00:15:18,248 PROFESSOR: Right. 282 00:15:18,248 --> 00:15:23,300 So they only have one copy of all these genes. 283 00:15:23,300 --> 00:15:24,760 All right. 284 00:15:24,760 --> 00:15:25,774 Yes? 285 00:15:25,774 --> 00:15:30,220 AUDIENCE: [INAUDIBLE] resistant and they're [INAUDIBLE]. 286 00:15:30,220 --> 00:15:32,030 That could still mean that dad has 287 00:15:32,030 --> 00:15:35,160 three of the four genes in principle. 288 00:15:35,160 --> 00:15:36,640 PROFESSOR: The previous slide? 289 00:15:36,640 --> 00:15:38,588 Is that where what you're talking about? 290 00:15:38,588 --> 00:15:40,527 AUDIENCE: [INAUDIBLE] knew about it. 291 00:15:40,527 --> 00:15:42,360 So really what you mean is that dad does not 292 00:15:42,360 --> 00:15:44,930 have any of the genes that are involved with resistance. 293 00:15:44,930 --> 00:15:45,888 PROFESSOR: The correct. 294 00:15:48,602 --> 00:15:50,560 I was saying that dad has to have all of gene-- 295 00:15:50,560 --> 00:15:52,670 that the child has to have all of the genes that 296 00:15:52,670 --> 00:15:54,290 are operative to create resistance. 297 00:15:54,290 --> 00:15:55,750 We're going to assume an AND model. 298 00:15:55,750 --> 00:15:58,120 He must have all the genes from mom. 299 00:15:58,120 --> 00:16:01,470 They're involved in the resistance pathway. 300 00:16:01,470 --> 00:16:04,720 And since only one out of a 16 progeny 301 00:16:04,720 --> 00:16:08,370 has all those genes from mom, right, it 302 00:16:08,370 --> 00:16:11,390 appears that given the chance of inheriting something from mom 303 00:16:11,390 --> 00:16:16,040 is 1/2, that it's four genes you have to inherit from mom. 304 00:16:16,040 --> 00:16:19,787 Because the chance of inheriting all four is one out of 16. 305 00:16:19,787 --> 00:16:23,060 AUDIENCE: [INAUDIBLE] in which case-- 306 00:16:23,060 --> 00:16:27,630 PROFESSOR: No, I'm assuming the dad doesn't have any of those. 307 00:16:27,630 --> 00:16:29,660 But here we're asking, what is the difference 308 00:16:29,660 --> 00:16:32,630 in the number of genes between mom and dad? 309 00:16:32,630 --> 00:16:35,520 So you're right, that the number we're computing 310 00:16:35,520 --> 00:16:39,360 is the relative number of genes different between mom and dad 311 00:16:39,360 --> 00:16:40,740 you require. 312 00:16:40,740 --> 00:16:43,080 And so it might be that dad's a reference 313 00:16:43,080 --> 00:16:45,790 and we're asking how many additional genes mom brought 314 00:16:45,790 --> 00:16:47,870 to the table to provide with that resistance. 315 00:16:47,870 --> 00:16:49,560 But that's a good point. 316 00:16:49,560 --> 00:16:50,060 OK. 317 00:16:53,370 --> 00:16:54,410 OK. 318 00:16:54,410 --> 00:16:59,080 So, now let's look at this quantitative model. 319 00:16:59,080 --> 00:17:04,440 Let's assume that mom has a bunch of genes that contribute 320 00:17:04,440 --> 00:17:11,089 zero to an effect size and dad-- each gene 321 00:17:11,089 --> 00:17:14,640 that dad has produces an effect of 1 over n. 322 00:17:14,640 --> 00:17:18,560 So the total effect size here for dad is 1. 323 00:17:18,560 --> 00:17:22,700 So the effect of mom on this particular quantitative trait 324 00:17:22,700 --> 00:17:23,480 might be zero. 325 00:17:23,480 --> 00:17:25,920 It might be the amount of ethanol produced 326 00:17:25,920 --> 00:17:28,190 or some other quantitative value. 327 00:17:28,190 --> 00:17:31,930 And dad, on the other hand, since he has n genes, 328 00:17:31,930 --> 00:17:35,890 is going to produce one, because each gene contributes 329 00:17:35,890 --> 00:17:38,275 a little bit to this quantitative phenotype. 330 00:17:41,642 --> 00:17:43,050 Is everybody clear on that? 331 00:17:45,670 --> 00:17:51,550 So, the child is going to inherit genes 332 00:17:51,550 --> 00:17:56,290 to our coin flip between mom and dad, right. 333 00:17:56,290 --> 00:17:57,880 So the first fundamental question 334 00:17:57,880 --> 00:18:01,430 is, how many different levels are there 335 00:18:01,430 --> 00:18:04,440 in our quantitative phenotype in our trait? 336 00:18:08,360 --> 00:18:10,020 How many different levels can you have? 337 00:18:16,274 --> 00:18:16,940 AUDIENCE: N + 1? 338 00:18:16,940 --> 00:18:20,270 PROFESSOR: N + 1, right, because you can either inherit 339 00:18:20,270 --> 00:18:24,010 zero, or up to n genes from dad. 340 00:18:24,010 --> 00:18:27,940 And it gets you n plus 1 different levels. 341 00:18:27,940 --> 00:18:29,380 OK. 342 00:18:29,380 --> 00:18:32,410 So, what's the probability then-- well, 343 00:18:32,410 --> 00:18:33,660 I'll ask a different question. 344 00:18:33,660 --> 00:18:38,245 What's the expected value of the quantitative phenotype 345 00:18:38,245 --> 00:18:39,060 of a child? 346 00:18:43,620 --> 00:18:44,700 Just looking at this. 347 00:18:48,410 --> 00:18:52,860 If dad's one and mom's zero, and you have a collection of genes 348 00:18:52,860 --> 00:18:57,507 and you do a coin flip each time, 349 00:18:57,507 --> 00:18:59,340 you're going to get half your genes from mom 350 00:18:59,340 --> 00:19:01,410 and half your genes from dad. 351 00:19:01,410 --> 00:19:02,990 Right. 352 00:19:02,990 --> 00:19:11,310 And so the expected trait value is 0.5. 353 00:19:11,310 --> 00:19:13,420 So for these added traits, you're 354 00:19:13,420 --> 00:19:17,770 going be at the midpoint between mom and dad. 355 00:19:17,770 --> 00:19:19,880 Right. 356 00:19:19,880 --> 00:19:28,340 And what is the probability that you 357 00:19:28,340 --> 00:19:32,055 inherit x copies of dad's genes? 358 00:19:35,390 --> 00:19:44,260 Well, that's n choose x, times 1 minus .5 n to the minus 359 00:19:44,260 --> 00:19:47,310 x times 0.5 to the x. 360 00:19:47,310 --> 00:19:50,080 A simple binomial. 361 00:19:50,080 --> 00:19:51,700 Right. 362 00:19:51,700 --> 00:19:54,980 So if you look at this, the probability 363 00:19:54,980 --> 00:19:58,499 of the distribution for the children 364 00:19:58,499 --> 00:20:00,040 is going to look something like this, 365 00:20:00,040 --> 00:20:04,240 where this is the mean, 0.5. 366 00:20:04,240 --> 00:20:09,670 And the number of distinct values is going to be n plus 1. 367 00:20:09,670 --> 00:20:12,100 Right. 368 00:20:12,100 --> 00:20:18,860 So the expected value of x is 0.5 and turns out 369 00:20:18,860 --> 00:20:26,900 that the expected value, or the variance of x minus 0.5, which 370 00:20:26,900 --> 00:20:35,012 is the mean squared, is going to be 0.25 over n. 371 00:20:35,012 --> 00:20:36,720 So I can show you this on the next slide. 372 00:20:39,550 --> 00:20:43,965 So you can see, this could be ethanol production, 373 00:20:43,965 --> 00:20:46,700 it could be growth rate, what have you. 374 00:20:46,700 --> 00:20:49,210 And you can see that the number of genes that you're 375 00:20:49,210 --> 00:20:54,720 going to get from dad follows this binomial distribution 376 00:20:54,720 --> 00:20:58,810 and gives you a spread of different phenotypes 377 00:20:58,810 --> 00:21:00,790 in the child's generation, depending 378 00:21:00,790 --> 00:21:03,005 upon how many copies of dad's genes that you inherit. 379 00:21:07,000 --> 00:21:08,840 But does this make sense to everybody? 380 00:21:08,840 --> 00:21:11,187 Now would be a great time to ask any questions 381 00:21:11,187 --> 00:21:12,270 about the details of this. 382 00:21:12,270 --> 00:21:13,152 Yes? 383 00:21:13,152 --> 00:21:15,462 AUDIENCE: Can you clarify what x is? 384 00:21:15,462 --> 00:21:17,780 Is x the fraction of genes inherited-- 385 00:21:17,780 --> 00:21:22,160 PROFESSOR: The number of genes you inherit from dad. 386 00:21:22,160 --> 00:21:23,790 The number of genes. 387 00:21:23,790 --> 00:21:27,565 So it would zero, one, two, up to n. 388 00:21:27,565 --> 00:21:29,990 AUDIENCE: Shouldn't the expectation of n [INAUDIBLE] 389 00:21:29,990 --> 00:21:32,415 x be n/2? 390 00:21:32,415 --> 00:21:33,315 PROFESSOR: I'm sorry. 391 00:21:33,315 --> 00:21:35,810 It is supposed to be n/2. 392 00:21:35,810 --> 00:21:41,320 But the last two expectations are 393 00:21:41,320 --> 00:21:44,847 some of the number of genes you've inherited from dad. 394 00:21:44,847 --> 00:21:46,835 Right, that's correct. 395 00:21:46,835 --> 00:21:48,326 Yeah, this slide's wrong. 396 00:21:55,232 --> 00:21:56,065 Any other questions? 397 00:21:59,540 --> 00:22:00,220 OK. 398 00:22:00,220 --> 00:22:06,075 So this is a very simple model but it tells us 399 00:22:06,075 --> 00:22:08,550 a couple of things, right. 400 00:22:08,550 --> 00:22:13,350 Which is that as n gets to be very large, 401 00:22:13,350 --> 00:22:16,125 the effect of each gene gets to be quite small. 402 00:22:18,650 --> 00:22:21,760 So something could be completely heritable, 403 00:22:21,760 --> 00:22:26,140 but if it's spread over, say 1,000 genes, 404 00:22:26,140 --> 00:22:28,840 then it will be very difficult to detect, 405 00:22:28,840 --> 00:22:31,550 because the effect of each gene would be quite small. 406 00:22:31,550 --> 00:22:37,300 And furthermore, the variance that you see in the offspring 407 00:22:37,300 --> 00:22:40,870 will be quite small as well, right, 408 00:22:40,870 --> 00:22:42,450 in terms of the phenotype. 409 00:22:42,450 --> 00:22:46,170 Because it's going to be 0.25/n in terms of the expected value. 410 00:22:46,170 --> 00:22:50,090 So as n gets larger, the number genes that 411 00:22:50,090 --> 00:22:52,820 contribute to that phenotype increase, 412 00:22:52,820 --> 00:22:54,855 the variance is going to go down linearly. 413 00:22:58,580 --> 00:22:59,080 OK. 414 00:22:59,080 --> 00:23:01,390 So we should just keep this in mind 415 00:23:01,390 --> 00:23:09,010 as we're looking at discovering these sort of traits 416 00:23:09,010 --> 00:23:15,070 and the underlying QTLs that can be used to predict them. 417 00:23:15,070 --> 00:23:19,460 And finally, I'd like to point out one other detail which 418 00:23:19,460 --> 00:23:21,920 is that, if genes are linked, that is, 419 00:23:21,920 --> 00:23:25,020 if they're in close proximity to one another in the genome 420 00:23:25,020 --> 00:23:26,620 and it makes it very unlikely there's 421 00:23:26,620 --> 00:23:29,340 going to be crossing over between them, 422 00:23:29,340 --> 00:23:31,876 then they're going to act as a unit. 423 00:23:31,876 --> 00:23:38,450 And if they act as a unit, then we'll get marker correlation. 424 00:23:38,450 --> 00:23:41,140 And you can also see, effectively, 425 00:23:41,140 --> 00:23:42,940 that the effect size of those two genes 426 00:23:42,940 --> 00:23:45,520 is going to be larger. 427 00:23:45,520 --> 00:23:48,500 And in more complicated models, we obviously 428 00:23:48,500 --> 00:23:52,930 wouldn't have the same effect size for each gene. 429 00:23:52,930 --> 00:23:55,250 The effect size might be quite large for some genes, 430 00:23:55,250 --> 00:23:56,780 might be quite small for some genes. 431 00:24:00,480 --> 00:24:04,880 And we'll see the effects of marker correlation 432 00:24:04,880 --> 00:24:08,150 in a little bit. 433 00:24:08,150 --> 00:24:12,409 So the way we're going to model this is we're going to-- this 434 00:24:12,409 --> 00:24:14,200 is a definition of the variables that we're 435 00:24:14,200 --> 00:24:17,160 going to be talking about today. 436 00:24:17,160 --> 00:24:21,745 And the essential idea is quite simple. 437 00:24:33,280 --> 00:24:38,140 So the phenotype of an individual-- so p sub 438 00:24:38,140 --> 00:24:40,620 i is the phenotype of an individual, 439 00:24:40,620 --> 00:24:45,170 is going to be equal to some function of their genotype 440 00:24:45,170 --> 00:24:47,285 plus an environmental component. 441 00:24:49,890 --> 00:24:57,170 This function is the critical thing that we want to discover. 442 00:24:57,170 --> 00:25:00,470 This function, f, is mapping from the genotype 443 00:25:00,470 --> 00:25:05,019 of an individual to its phenotype. 444 00:25:05,019 --> 00:25:06,560 And the environmental component could 445 00:25:06,560 --> 00:25:14,320 be how well something is fed, how much sunlight it gets, 446 00:25:14,320 --> 00:25:18,910 things that can greatly influence things like growth 447 00:25:18,910 --> 00:25:22,550 but they're not described by genetics. 448 00:25:22,550 --> 00:25:25,630 But this function is going to encapsulate 449 00:25:25,630 --> 00:25:29,150 what we know about how the genetics 450 00:25:29,150 --> 00:25:33,145 of a particular individual influences a trait. 451 00:25:36,600 --> 00:25:46,710 And thus, if we consider a population of individuals, 452 00:25:46,710 --> 00:25:50,250 the phenotypic variance is going to be 453 00:25:50,250 --> 00:25:57,220 equal to the genotypic variance plus the environmental variance 454 00:25:57,220 --> 00:26:05,295 plus two times the covariance between the genotype 455 00:26:05,295 --> 00:26:06,086 in the environment. 456 00:26:08,590 --> 00:26:12,280 And we're going to assume, as most studies do, 457 00:26:12,280 --> 00:26:16,200 that there is no correlation between genotype 458 00:26:16,200 --> 00:26:17,230 and environment. 459 00:26:17,230 --> 00:26:19,800 So this term disappears. 460 00:26:19,800 --> 00:26:23,410 So what we're left with is that the observed phenotypic 461 00:26:23,410 --> 00:26:25,660 variance is equal to the genotypic variance 462 00:26:25,660 --> 00:26:27,955 plus the environmental variance. 463 00:26:31,190 --> 00:26:35,870 And what we would like to do is to come up with a function 464 00:26:35,870 --> 00:26:41,960 f, that best predicts the genotypic component 465 00:26:41,960 --> 00:26:44,340 of this equation. 466 00:26:44,340 --> 00:26:47,041 There's nothing we can do about environmental variance. 467 00:26:47,041 --> 00:26:47,540 Right. 468 00:26:50,680 --> 00:26:53,894 But we can measure it. 469 00:26:53,894 --> 00:26:55,310 Does anybody have any ideas how we 470 00:26:55,310 --> 00:27:00,361 could measure environmental variance? 471 00:27:00,361 --> 00:27:00,860 Yes? 472 00:27:00,860 --> 00:27:02,360 AUDIENCE: Study populations in which 473 00:27:02,360 --> 00:27:04,588 there's some kind of controlled environment. 474 00:27:08,110 --> 00:27:11,610 So you study populations that one population 475 00:27:11,610 --> 00:27:13,610 is one with a homogeneous. 476 00:27:13,610 --> 00:27:16,332 And another one was a completely different one. 477 00:27:16,332 --> 00:27:17,040 PROFESSOR: Right. 478 00:27:17,040 --> 00:27:20,880 So what we could do is we could use controls. 479 00:27:20,880 --> 00:27:26,290 So typically what we could do is we could study in environments 480 00:27:26,290 --> 00:27:28,310 where we try and control the environment exactly 481 00:27:28,310 --> 00:27:33,670 to eliminate this as much as we possibly can, for example. 482 00:27:33,670 --> 00:27:35,600 As we'll see that we also can do things 483 00:27:35,600 --> 00:27:37,590 like study clones, where individuals 484 00:27:37,590 --> 00:27:40,770 have exactly the same genotype. 485 00:27:40,770 --> 00:27:43,340 And then, all of the variance that we observe-- 486 00:27:43,340 --> 00:27:47,100 if this term vanishes because the genotypes are identical, 487 00:27:47,100 --> 00:27:48,370 it is due to the environment. 488 00:27:50,900 --> 00:27:52,520 So typically, if you're doing things 489 00:27:52,520 --> 00:27:56,680 like studying humans, since cloning humans isn't really 490 00:27:56,680 --> 00:28:01,670 a good idea to actually measure environmental variance, 491 00:28:01,670 --> 00:28:12,110 right, what you could do is you can look at identical twins. 492 00:28:12,110 --> 00:28:15,080 And identical twins give you a way 493 00:28:15,080 --> 00:28:17,640 to get at the question of how much environment variance 494 00:28:17,640 --> 00:28:19,325 there is for a particular phenotype. 495 00:28:22,110 --> 00:28:28,560 So in sum, this is replicates what I have here 496 00:28:28,560 --> 00:28:31,230 on the left-hand side of the board. 497 00:28:31,230 --> 00:28:33,177 And note that today we'll be talking 498 00:28:33,177 --> 00:28:35,010 about the idea of discovering this function, 499 00:28:35,010 --> 00:28:38,120 f, and how well we can discover f, 500 00:28:38,120 --> 00:28:39,620 which is really important, right. 501 00:28:39,620 --> 00:28:42,530 It's fundamental to be able to predict phenotype 502 00:28:42,530 --> 00:28:44,860 from genotype. 503 00:28:44,860 --> 00:28:51,250 It's an extraordinarily central question in genetics. 504 00:28:51,250 --> 00:28:55,510 And when we do the prediction, there are two kinds of-- oh, 505 00:28:55,510 --> 00:28:56,480 there's a question? 506 00:28:56,480 --> 00:28:58,146 AUDIENCE: Could you please explain again 507 00:28:58,146 --> 00:29:01,540 why the co-variance drops out or it goes away. 508 00:29:01,540 --> 00:29:03,290 PROFESSOR: Yeah, the co-variance drops out 509 00:29:03,290 --> 00:29:05,180 because we're going to assume that genotype 510 00:29:05,180 --> 00:29:06,513 and environment are independent. 511 00:29:09,030 --> 00:29:12,220 Now if they're not independent, it won't drop out. 512 00:29:12,220 --> 00:29:16,980 But making that assumption-- and of course, for human studies 513 00:29:16,980 --> 00:29:19,710 you can't really make that assumption completely, right? 514 00:29:19,710 --> 00:29:22,020 And one of the problems in doing these sorts of studies 515 00:29:22,020 --> 00:29:25,830 is that it's very, very easy to get confounded. 516 00:29:25,830 --> 00:29:28,250 Because when you're trying to decompose 517 00:29:28,250 --> 00:29:31,694 the observed variance and height, for example. 518 00:29:31,694 --> 00:29:35,750 You know, there's what mom and dad provided to an individual 519 00:29:35,750 --> 00:29:37,500 in terms of their height, and there's also 520 00:29:37,500 --> 00:29:40,060 how much junior ate, right. 521 00:29:40,060 --> 00:29:42,340 And whether he went to McDonald's a lot, or you know, 522 00:29:42,340 --> 00:29:44,876 was going to Whole Foods a lot. 523 00:29:44,876 --> 00:29:46,000 You know, who knows, right? 524 00:29:46,000 --> 00:29:48,650 But this component and this component, 525 00:29:48,650 --> 00:29:50,960 it's easy to get confounded between them 526 00:29:50,960 --> 00:29:55,330 and sometimes you can imagine that genotype 527 00:29:55,330 --> 00:29:58,670 is related to place of origin in the world. 528 00:29:58,670 --> 00:30:00,566 And that has a lot to do with environment. 529 00:30:00,566 --> 00:30:02,565 And so this term wouldn't necessarily disappear. 530 00:30:07,451 --> 00:30:07,950 OK. 531 00:30:07,950 --> 00:30:09,533 So there are two kinds of heritability 532 00:30:09,533 --> 00:30:11,050 I'd like to touch upon today. 533 00:30:11,050 --> 00:30:14,590 And it's important that you remember there are two kinds 534 00:30:14,590 --> 00:30:19,810 and one is extraordinarily difficult to recover 535 00:30:19,810 --> 00:30:24,990 and the other one is in some sense, a more constrained 536 00:30:24,990 --> 00:30:27,750 problem, because we're much better at building models 537 00:30:27,750 --> 00:30:30,430 for that kind of heritability estimate. 538 00:30:30,430 --> 00:30:33,810 The first is broad-sense heritability, 539 00:30:33,810 --> 00:30:38,560 which describes the upper bound for phenotypic prediction given 540 00:30:38,560 --> 00:30:39,980 an arbitrary model. 541 00:30:39,980 --> 00:30:43,430 So it's the total contribution to phenotypic variance 542 00:30:43,430 --> 00:30:45,740 from genetic causes. 543 00:30:45,740 --> 00:30:47,660 And we can estimate that, right. 544 00:30:47,660 --> 00:30:51,420 And we'll see how we can estimate it in a moment. 545 00:30:51,420 --> 00:30:55,030 And narrow-sense heritability is defined as, 546 00:30:55,030 --> 00:30:58,710 how much of the heritability can we describe 547 00:30:58,710 --> 00:31:03,750 when we restrict f to be a linear model. 548 00:31:03,750 --> 00:31:10,480 So when f is simply linear, as the sum of terms, 549 00:31:10,480 --> 00:31:15,000 that describes the maximum narrow-sense heritability we 550 00:31:15,000 --> 00:31:17,950 can recover in terms of the fraction of phenotypic 551 00:31:17,950 --> 00:31:19,950 variance we can capture in f. 552 00:31:22,690 --> 00:31:27,820 And it's very useful because it turns out 553 00:31:27,820 --> 00:31:32,257 that we can compute both broad-sense and narrow-sense 554 00:31:32,257 --> 00:31:33,840 heritability from first principles-- I 555 00:31:33,840 --> 00:31:36,590 mean from experiment. 556 00:31:36,590 --> 00:31:41,860 And the difference between them is part of our quest today. 557 00:31:41,860 --> 00:31:44,080 Our quest is, to answer the question, 558 00:31:44,080 --> 00:31:45,990 where is the missing heritability? 559 00:31:45,990 --> 00:31:52,050 Why can't we build an Oracle f that perfectly 560 00:31:52,050 --> 00:31:55,390 predicts phenotype from genotype? 561 00:31:58,270 --> 00:32:03,519 So on that line-- I just want to give you some caveats. 562 00:32:03,519 --> 00:32:06,060 One is that we're always talking about populations when we're 563 00:32:06,060 --> 00:32:07,870 talking about heritability because it's 564 00:32:07,870 --> 00:32:10,030 how we're going to estimate it. 565 00:32:10,030 --> 00:32:13,580 And when you hear people talk about heritability, 566 00:32:13,580 --> 00:32:15,550 oftentimes they won't qualify it in terms 567 00:32:15,550 --> 00:32:18,030 of whether it's broad-sense or narrow-sense. 568 00:32:18,030 --> 00:32:20,170 And so you should ask them if you're 569 00:32:20,170 --> 00:32:23,310 engaged in a scientific discussion with them. 570 00:32:23,310 --> 00:32:26,750 And as we've already discussed, sometimes estimation 571 00:32:26,750 --> 00:32:30,730 is difficult because of matching environment and eliminating 572 00:32:30,730 --> 00:32:33,210 this term, the environmental term 573 00:32:33,210 --> 00:32:37,260 can be a challenge when you're out of the laboratory. 574 00:32:37,260 --> 00:32:40,330 Like when you're dealing with humans. 575 00:32:40,330 --> 00:32:43,930 So, let's talk about broad-sense heritability. 576 00:32:46,660 --> 00:32:52,980 Imagine that we measure environmental variants simply 577 00:32:52,980 --> 00:32:58,650 by looking at environmental twins or clones, right. 578 00:32:58,650 --> 00:33:02,470 Because if we, for example, take a bunch of yeast 579 00:33:02,470 --> 00:33:04,990 that are genotypically identical. 580 00:33:04,990 --> 00:33:07,660 And we grow them up separately, and we 581 00:33:07,660 --> 00:33:13,540 measure a trait like how well they respond 582 00:33:13,540 --> 00:33:17,870 to a particular chemical or their growth rate, 583 00:33:17,870 --> 00:33:21,760 then the variance we see from each individual to individual 584 00:33:21,760 --> 00:33:27,850 is simply environmental, because they're genetically identical. 585 00:33:27,850 --> 00:33:28,440 So 586 00:33:28,440 --> 00:33:31,510 we can, in that particular case, exactly 587 00:33:31,510 --> 00:33:33,800 quantify the environmental variance 588 00:33:33,800 --> 00:33:38,200 given that every individual is genetically identical. 589 00:33:38,200 --> 00:33:41,070 We simply measure all the growth rates 590 00:33:41,070 --> 00:33:42,450 and we compute the variance. 591 00:33:42,450 --> 00:33:45,160 And that's the environmental variance. 592 00:33:45,160 --> 00:33:47,900 OK? 593 00:33:47,900 --> 00:33:51,225 As I said for humans, the best we can do is identical twins. 594 00:33:53,910 --> 00:33:56,680 Monozygotic twins. 595 00:33:56,680 --> 00:34:01,320 You can go out and for pairs of twins that are identical, 596 00:34:01,320 --> 00:34:04,630 you can measure height or any other trait that you like 597 00:34:04,630 --> 00:34:07,370 and compute the variance. 598 00:34:07,370 --> 00:34:11,239 And then that is an estimate of the environmental component 599 00:34:11,239 --> 00:34:16,690 of that, because they should be genetically identical. 600 00:34:19,650 --> 00:34:23,300 And big H squared-- broad-sense is always 601 00:34:23,300 --> 00:34:25,940 capital H squared and narrow-sense is always 602 00:34:25,940 --> 00:34:27,420 little h squared. 603 00:34:27,420 --> 00:34:29,630 Big H squared, which is broad-sense 604 00:34:29,630 --> 00:34:32,270 heritability is very simple then. 605 00:34:32,270 --> 00:34:35,290 It's the phenotypic variance, minus the environmental 606 00:34:35,290 --> 00:34:37,449 variance, over the phenotypic variance. 607 00:34:37,449 --> 00:34:40,250 So it's the fraction of phenotypic experience 608 00:34:40,250 --> 00:34:44,587 that can be explained from genetic causes. 609 00:34:44,587 --> 00:34:46,515 Is that clear to everybody? 610 00:34:49,420 --> 00:34:52,030 Any questions at all about this? 611 00:34:55,030 --> 00:34:56,270 OK. 612 00:34:56,270 --> 00:35:01,030 So, for example, on the right-hand hand side 613 00:35:01,030 --> 00:35:04,540 here, those three purplish squares 614 00:35:04,540 --> 00:35:09,310 have three different populations, 615 00:35:09,310 --> 00:35:12,230 which are genotypically identical. 616 00:35:12,230 --> 00:35:15,250 They have two genes, a little a, a little a, big A, a little A, 617 00:35:15,250 --> 00:35:19,280 and big A, big A. And each one is a variance of 1.0. 618 00:35:19,280 --> 00:35:24,240 out So since there are genetically identical, 619 00:35:24,240 --> 00:35:29,340 we know that the environmental variance has to be 1.0. 620 00:35:29,340 --> 00:35:33,640 On the left-hand side, you see the genotypic variance. 621 00:35:33,640 --> 00:35:37,420 And that reminds us of where we started today. 622 00:35:37,420 --> 00:35:40,710 It depends on the number of alleles you get of big A, 623 00:35:40,710 --> 00:35:43,390 as to what the value is. 624 00:35:43,390 --> 00:35:46,110 And when you put all of that together, 625 00:35:46,110 --> 00:35:48,920 you get a total variance of 3. 626 00:35:48,920 --> 00:35:52,840 And so big H squared is simply the genotypic variance, 627 00:35:52,840 --> 00:35:55,740 which is 2, over the total phenotypic variance, which 628 00:35:55,740 --> 00:35:56,680 is 3. 629 00:35:56,680 --> 00:35:58,210 So big H squared is 2/3. 630 00:36:01,150 --> 00:36:04,274 And so that is a way of computing 631 00:36:04,274 --> 00:36:05,315 broad-sense heritability. 632 00:36:09,580 --> 00:36:15,530 Now, if we think about our models, 633 00:36:15,530 --> 00:36:18,110 we can see that narrow-sense heritability 634 00:36:18,110 --> 00:36:20,150 has some very nice properties. 635 00:36:20,150 --> 00:36:21,210 Right. 636 00:36:21,210 --> 00:36:30,175 That is, if we build and add a model of phenotype, 637 00:36:30,175 --> 00:36:31,675 to get at narrow-sense heritability. 638 00:36:31,675 --> 00:36:37,270 So if we were to constraint f here to be linear, 639 00:36:37,270 --> 00:36:40,880 it's simply going to be a very simple linear model. 640 00:36:40,880 --> 00:36:46,050 For each particular QTL that we discover, 641 00:36:46,050 --> 00:36:49,640 we assign an effect size beta to it, 642 00:36:49,640 --> 00:36:52,900 or a coefficient that describes its deviation 643 00:36:52,900 --> 00:36:57,330 from the mean for that particular trait. 644 00:36:57,330 --> 00:37:00,880 And we have an offset, beta zero. 645 00:37:00,880 --> 00:37:03,550 So our simple linear model is going to take all the discovery 646 00:37:03,550 --> 00:37:06,270 QTLs that we have-- take each QTL 647 00:37:06,270 --> 00:37:10,120 and discover which allelic form it's in. 648 00:37:10,120 --> 00:37:14,620 Typically it's considered either in zero or one form. 649 00:37:14,620 --> 00:37:23,890 And then add a beta j, where j is the particular QTL 650 00:37:23,890 --> 00:37:26,220 deviation from mean value. 651 00:37:26,220 --> 00:37:29,594 Add them all together to compute the phenotype. 652 00:37:29,594 --> 00:37:30,950 OK. 653 00:37:30,950 --> 00:37:36,540 So, this is a very simple additive model 654 00:37:36,540 --> 00:37:39,500 and a consequence of this model is 655 00:37:39,500 --> 00:37:44,620 that if you think about an F1 or a child of two parents, 656 00:37:44,620 --> 00:37:50,740 as we said earlier, a child is going to inherit roughly half 657 00:37:50,740 --> 00:37:55,380 of the alleles from mom and half of the alleles from dad. 658 00:37:55,380 --> 00:37:58,320 And so for additive models like this, 659 00:37:58,320 --> 00:38:04,120 the expected value of the child's trait value 660 00:38:04,120 --> 00:38:08,087 is going to be the midpoint of mom and dad. 661 00:38:08,087 --> 00:38:10,170 And that can be derived directly from the equation 662 00:38:10,170 --> 00:38:13,740 above, because you're getting half of the QTLs 663 00:38:13,740 --> 00:38:15,750 from mom and half of the QTLs from dad. 664 00:38:18,350 --> 00:38:20,390 So this was observed a long time ago, right, 665 00:38:20,390 --> 00:38:30,470 because if you did studies and you looked at the deviation 666 00:38:30,470 --> 00:38:35,040 from the midpoint of parents for human height. 667 00:38:35,040 --> 00:38:42,320 You can see that the children fall pretty 668 00:38:42,320 --> 00:38:49,360 close to mid-parent line, where the y-axis here 669 00:38:49,360 --> 00:38:54,980 is the height in inches and that suggests 670 00:38:54,980 --> 00:39:04,295 that much of human height can be modeled by a narrow-sense based 671 00:39:04,295 --> 00:39:05,160 heritability model. 672 00:39:09,660 --> 00:39:17,970 Now, once again, narrow-sense heritability 673 00:39:17,970 --> 00:39:20,270 is the fraction of phenotypic variance explained 674 00:39:20,270 --> 00:39:22,620 by an additive model. 675 00:39:22,620 --> 00:39:30,690 And we've talked before about the model itself. 676 00:39:30,690 --> 00:39:32,770 And little h squared is simply going 677 00:39:32,770 --> 00:39:36,100 to be the amount of variance explained 678 00:39:36,100 --> 00:39:40,540 by the additive model over the total phenotypic variance. 679 00:39:40,540 --> 00:39:45,920 And the additive variance is shown on the right-hand side. 680 00:39:45,920 --> 00:39:50,980 That equation boils down to, you take the phenotypic variance 681 00:39:50,980 --> 00:39:57,150 and you subtract off the variance that's environmental 682 00:39:57,150 --> 00:40:01,260 and that cannot be explained by the additive variance, 683 00:40:01,260 --> 00:40:03,885 and what you're left with is the additive variance. 684 00:40:09,230 --> 00:40:12,010 And once again, coming back to the question 685 00:40:12,010 --> 00:40:15,850 of missing heritability, if we observe 686 00:40:15,850 --> 00:40:19,540 that what we can estimate for little h squared 687 00:40:19,540 --> 00:40:23,370 is below what we expect, that gap 688 00:40:23,370 --> 00:40:24,615 has to be explained somehow. 689 00:40:27,550 --> 00:40:32,150 Some typical values for theoretical h squared. 690 00:40:32,150 --> 00:40:33,800 So this is not measured h squared 691 00:40:33,800 --> 00:40:38,020 in terms of building a model and testing it like this. 692 00:40:38,020 --> 00:40:40,020 But what we can do is we can theoretically 693 00:40:40,020 --> 00:40:43,510 estimate what h squared should be, 694 00:40:43,510 --> 00:40:46,585 by looking at the fraction of identity between individuals. 695 00:40:50,630 --> 00:40:52,930 Morphological traits tend to have 696 00:40:52,930 --> 00:40:55,810 higher h squared for the fitness traits. 697 00:40:55,810 --> 00:41:01,320 So human height has a little h square of about 0.8. 698 00:41:01,320 --> 00:41:05,300 And for those ranchers out there in the audience, 699 00:41:05,300 --> 00:41:08,120 you'll be happy to know that cattle yearly weight has 700 00:41:08,120 --> 00:41:10,770 heritability of about 0.35. 701 00:41:10,770 --> 00:41:14,610 Now, things like life history which are fitness traits 702 00:41:14,610 --> 00:41:17,980 are less heritable. 703 00:41:17,980 --> 00:41:21,980 Which would suggest that looking at how long your parents lived 704 00:41:21,980 --> 00:41:24,470 and trying to estimate how long you're going to live 705 00:41:24,470 --> 00:41:27,502 is not as productive as looking at how tall you 706 00:41:27,502 --> 00:41:28,710 are compared to your parents. 707 00:41:32,677 --> 00:41:34,260 And there's a complete table that I've 708 00:41:34,260 --> 00:41:37,080 included in the slides for you to look at, 709 00:41:37,080 --> 00:41:41,050 but it's too small to read on the screen. 710 00:41:41,050 --> 00:41:45,020 OK, so now we're going to turn to computational models 711 00:41:45,020 --> 00:41:49,660 and how we can discover a model that figures out 712 00:41:49,660 --> 00:41:54,870 where the QTLs are, and then assigns that function f to them 713 00:41:54,870 --> 00:41:56,870 so we can predict phenotype from genotype. 714 00:41:59,560 --> 00:42:04,190 And we're going to be taking our example from this paper 715 00:42:04,190 --> 00:42:07,170 by Bloom, et al, which I posted on the Stellar site. 716 00:42:07,170 --> 00:42:10,300 And it came out last year and it's 717 00:42:10,300 --> 00:42:15,520 wonderful study in QTL analysis. 718 00:42:15,520 --> 00:42:20,600 And the setup for this study is quite simple. 719 00:42:20,600 --> 00:42:23,140 What they did was, is they took two different strains 720 00:42:23,140 --> 00:42:28,730 of yeast, RM and BY, and they crossed them 721 00:42:28,730 --> 00:42:35,720 and produced roughly 1,000 F1s. 722 00:42:35,720 --> 00:42:39,750 And RM and BY are very similar. 723 00:42:39,750 --> 00:42:44,710 They are about, I think it's about 35,000 snips 724 00:42:44,710 --> 00:42:46,954 between them. 725 00:42:46,954 --> 00:42:50,280 Only about 0.5% of their genomes are different. 726 00:42:50,280 --> 00:42:52,130 So they're really close. 727 00:42:55,150 --> 00:42:58,670 Just for point of reference, you know, the distance between me 728 00:42:58,670 --> 00:43:03,622 and you is something like one base for every thousand? 729 00:43:03,622 --> 00:43:04,455 Something like that. 730 00:43:07,800 --> 00:43:10,120 And then they assayed all those F1s. 731 00:43:10,120 --> 00:43:12,340 They genotyped them all. 732 00:43:12,340 --> 00:43:14,980 So to genotype them, what you do is 733 00:43:14,980 --> 00:43:16,720 you know what the parental genotypes are 734 00:43:16,720 --> 00:43:18,980 because they sequence both parents. 735 00:43:18,980 --> 00:43:23,250 The mom and dad, so to speak, at 50x coverage. 736 00:43:23,250 --> 00:43:25,750 So they knew the genome sequence is completely 737 00:43:25,750 --> 00:43:27,780 for both mom and dad. 738 00:43:27,780 --> 00:43:31,620 And then for each one of the 1,000 F1s 739 00:43:31,620 --> 00:43:35,170 they put them on a microarray and what 740 00:43:35,170 --> 00:43:38,030 is shown on the very bottom left is 741 00:43:38,030 --> 00:43:40,900 a result of genotype in an individual 742 00:43:40,900 --> 00:43:44,880 where they can see each chromosome 743 00:43:44,880 --> 00:43:47,331 and whether it came from mom or from dad. 744 00:43:47,331 --> 00:43:48,830 And you can't see it here, but there 745 00:43:48,830 --> 00:43:52,540 are 16 different chromosomes and the alternating purple and 746 00:43:52,540 --> 00:43:56,370 yellow colors show whether that particular part of the genome 747 00:43:56,370 --> 00:43:59,050 came from mom or from dad. 748 00:43:59,050 --> 00:44:04,530 So they know for each individual, its source. 749 00:44:04,530 --> 00:44:07,000 From the left or the right strain. 750 00:44:07,000 --> 00:44:08,200 OK. 751 00:44:08,200 --> 00:44:12,150 And they have a thousand different genetic makeups. 752 00:44:12,150 --> 00:44:17,010 And then they asked, for each one of those individuals, 753 00:44:17,010 --> 00:44:23,420 how well could they grow in 46 different conditions? 754 00:44:23,420 --> 00:44:26,610 So they exposed them to different sugars, 755 00:44:26,610 --> 00:44:32,130 to different unfavorable environments and so forth. 756 00:44:32,130 --> 00:44:36,212 And they measured growth rate as shown on the right-hand side. 757 00:44:36,212 --> 00:44:37,920 Or right in the middle, that little thing 758 00:44:37,920 --> 00:44:40,800 that looks like a bunch of little dots of various sizes. 759 00:44:40,800 --> 00:44:43,420 By measuring colony size, they could 760 00:44:43,420 --> 00:44:47,090 measure how well the yeast were growing. 761 00:44:47,090 --> 00:44:49,680 And so they had two different things, right. 762 00:44:49,680 --> 00:44:53,800 They had the exact genotype of each individual, 763 00:44:53,800 --> 00:44:55,520 and they also had how well it was 764 00:44:55,520 --> 00:44:59,010 growing in a particular condition. 765 00:44:59,010 --> 00:45:01,250 And so for each condition, they wanted 766 00:45:01,250 --> 00:45:04,280 to associate the genotype of the individual 767 00:45:04,280 --> 00:45:05,582 to how well it was growing. 768 00:45:05,582 --> 00:45:06,290 To its phenotype. 769 00:45:09,970 --> 00:45:14,740 Now, one fair question is, of these different conditions, 770 00:45:14,740 --> 00:45:18,260 how many of them were really independent? 771 00:45:18,260 --> 00:45:20,860 And so to analyze that, they looked 772 00:45:20,860 --> 00:45:22,710 at the correlation between growth rates 773 00:45:22,710 --> 00:45:26,350 across conditions to try and figure out whether or not 774 00:45:26,350 --> 00:45:32,270 they actually had 46 different traits they were measuring. 775 00:45:32,270 --> 00:45:35,330 So this is a correlation matrix that 776 00:45:35,330 --> 00:45:38,350 is too small to read on the screen. 777 00:45:38,350 --> 00:45:41,790 The colors are somewhat visible, where the blue colors 778 00:45:41,790 --> 00:45:44,310 are perfect correlation and the red colors 779 00:45:44,310 --> 00:45:46,650 are perfect anti-correlation. 780 00:45:46,650 --> 00:45:50,839 And you can see that in certain areas of this grid, 781 00:45:50,839 --> 00:45:52,380 things are more correlated, like what 782 00:45:52,380 --> 00:45:57,040 sugars the yeast liked to eat. 783 00:45:57,040 --> 00:46:01,459 But suffice to say, they had a large collection 784 00:46:01,459 --> 00:46:02,875 of traits they wanted to estimate. 785 00:46:06,200 --> 00:46:12,720 So, now we want to build a computational model. 786 00:46:12,720 --> 00:46:15,030 So our next step is figuring out how 787 00:46:15,030 --> 00:46:17,240 to find those places in the genome that 788 00:46:17,240 --> 00:46:21,390 allows us to predict, how well, given a trait, 789 00:46:21,390 --> 00:46:23,390 the yeast would grow. 790 00:46:23,390 --> 00:46:26,650 The actual growth rate. 791 00:46:26,650 --> 00:46:44,110 So the key idea is this-- you have genetic markers, which 792 00:46:44,110 --> 00:46:46,670 are snips down the genome and you're 793 00:46:46,670 --> 00:46:50,180 going to test a particular marker. 794 00:46:50,180 --> 00:46:57,360 And if this is a particular trait, 795 00:46:57,360 --> 00:47:01,320 one possibility is that-- let's say 796 00:47:01,320 --> 00:47:04,900 that this marker could be either 0 or 1. 797 00:47:04,900 --> 00:47:08,310 Without loss of generality, it could 798 00:47:08,310 --> 00:47:10,660 be that here are all the individuals where 799 00:47:10,660 --> 00:47:12,700 the marker is zero. 800 00:47:12,700 --> 00:47:19,370 And here are all the markers where the marker is 1. 801 00:47:19,370 --> 00:47:25,790 And really, fundamentally, whether an individual 802 00:47:25,790 --> 00:47:28,470 has a 0 or a 1 marker, it doesn't really 803 00:47:28,470 --> 00:47:33,380 change its growth rate very much. 804 00:47:33,380 --> 00:47:34,220 OK? 805 00:47:34,220 --> 00:47:36,900 It's more or less identical. 806 00:47:36,900 --> 00:47:50,800 It's also possible that this is best 807 00:47:50,800 --> 00:47:57,120 modeled by two different means for a given trait. 808 00:47:57,120 --> 00:48:03,780 That when the marker is 1, you're growing-- actually 809 00:48:03,780 --> 00:48:10,210 this is going to be the growth rate on the x-axis. 810 00:48:10,210 --> 00:48:11,890 The y-axis is the density. 811 00:48:11,890 --> 00:48:14,970 That you're growing much better when you have a 1 812 00:48:14,970 --> 00:48:18,970 in that marker position than a zero. 813 00:48:18,970 --> 00:48:22,240 And so we need to distinguish between these two cases 814 00:48:22,240 --> 00:48:25,350 when the marker is predictive of growth rate 815 00:48:25,350 --> 00:48:28,000 and when the marker is not predictive of growth rate. 816 00:48:30,630 --> 00:48:32,770 And we've talked about lod likelihood tests before 817 00:48:32,770 --> 00:48:36,060 and you can see one on the very top. 818 00:48:36,060 --> 00:48:38,410 And you can see there's an additional degree of freedom 819 00:48:38,410 --> 00:48:41,117 that we have in the top prediction versus the bottom 820 00:48:41,117 --> 00:48:42,950 because we're using two different means that 821 00:48:42,950 --> 00:48:47,530 are conditioned upon the genotypic value 822 00:48:47,530 --> 00:48:48,738 at a particular marker. 823 00:48:53,230 --> 00:48:57,650 So we have a lot of different markers indeed. 824 00:48:57,650 --> 00:49:00,980 So we have-- let's see here, the exact number. 825 00:49:00,980 --> 00:49:06,770 I think it's about 13,000 markers they had in this study. 826 00:49:06,770 --> 00:49:07,280 No. 827 00:49:07,280 --> 00:49:12,620 11,623 different unique markers they found. 828 00:49:12,620 --> 00:49:15,500 That they could discover, that weren't linked together. 829 00:49:15,500 --> 00:49:18,210 We talked about linkage earlier on. 830 00:49:18,210 --> 00:49:23,260 So you've got over 11,000 markers. 831 00:49:23,260 --> 00:49:26,430 You're going to do a lod likelihood 832 00:49:26,430 --> 00:49:29,065 test to compute this lod odds score. 833 00:49:33,750 --> 00:49:36,180 Do we have to worry about multiple hypothesis correction 834 00:49:36,180 --> 00:49:38,640 here? 835 00:49:38,640 --> 00:49:41,240 Because you're testing over 11,000 836 00:49:41,240 --> 00:49:42,810 markers to see whether or not they're 837 00:49:42,810 --> 00:49:45,360 significant for one trait. 838 00:49:45,360 --> 00:49:45,860 Right. 839 00:49:52,920 --> 00:50:01,520 So one thing that we could do is imagine that what we did was 840 00:50:01,520 --> 00:50:05,970 we scrambled the association between phenotypes 841 00:50:05,970 --> 00:50:07,620 and individuals. 842 00:50:07,620 --> 00:50:11,136 So we just randomized it and we did that a thousand times. 843 00:50:11,136 --> 00:50:15,670 And each time we did it, we computed the distribution 844 00:50:15,670 --> 00:50:18,330 of these lod scores. 845 00:50:18,330 --> 00:50:23,400 Because we have broken the association between phenotype 846 00:50:23,400 --> 00:50:26,410 and genotype, the lod scores which 847 00:50:26,410 --> 00:50:29,680 we should be seeing if we did this randomization, 848 00:50:29,680 --> 00:50:33,360 should correspond to essentially noise. 849 00:50:33,360 --> 00:50:35,360 But we would see it random. 850 00:50:35,360 --> 00:50:39,750 So it's a null distribution we can look at. 851 00:50:39,750 --> 00:50:44,875 And so what we'll see is a distribution of lod scores. 852 00:50:49,410 --> 00:50:51,440 This is the lod. 853 00:50:51,440 --> 00:51:03,980 This is the probability from a null, a permutation test. 854 00:51:03,980 --> 00:51:07,970 And since we actually have done the randomization 855 00:51:07,970 --> 00:51:15,725 over all 11,000 markers, we can directly draw a line 856 00:51:15,725 --> 00:51:19,960 and ask what are the chances that a lod score would 857 00:51:19,960 --> 00:51:25,500 be greater than or equal to a particular value at random? 858 00:51:25,500 --> 00:51:27,730 And we can pick an area inside this tail, 859 00:51:27,730 --> 00:51:29,660 let's say 0.05, because that's what 860 00:51:29,660 --> 00:51:32,590 the authors of this particular paper used 861 00:51:32,590 --> 00:51:37,780 and ask what value of a lod score 862 00:51:37,780 --> 00:51:42,500 would be very unlikely to have by chance? 863 00:51:42,500 --> 00:51:48,330 It turns out in their first iteration, it was 2.63. 864 00:51:48,330 --> 00:51:53,500 That a lod score over 2.63 had a 0.05 chance 865 00:51:53,500 --> 00:51:59,430 or less of occurring in randomly permuted data. 866 00:51:59,430 --> 00:52:03,640 And since a permuted data contained all of the markers, 867 00:52:03,640 --> 00:52:07,700 we don't have to do any multiple hypothesis correction. 868 00:52:07,700 --> 00:52:10,410 So you can directly compare the statistic 869 00:52:10,410 --> 00:52:15,520 that you compute against a threshold 870 00:52:15,520 --> 00:52:21,330 and accept any marker or QTL that has a lod score greater, 871 00:52:21,330 --> 00:52:26,720 in this case then 2.63 and put it in your model. 872 00:52:26,720 --> 00:52:30,200 And everything else you can reject. 873 00:52:30,200 --> 00:52:32,520 And so you start by building a model out 874 00:52:32,520 --> 00:52:34,870 of all of the markers that are significant 875 00:52:34,870 --> 00:52:36,670 at this particular level. 876 00:52:39,750 --> 00:52:44,080 You then assemble the model and you can now 877 00:52:44,080 --> 00:52:47,860 predict phenotype from genotype. 878 00:52:47,860 --> 00:52:50,710 But of course, you're going to make errors, right. 879 00:52:50,710 --> 00:52:53,490 For each individual, there's going to be an error. 880 00:52:53,490 --> 00:53:05,370 You're going to have a residual for each individual that 881 00:53:05,370 --> 00:53:15,880 is going to be the phenotype minus the genotype 882 00:53:15,880 --> 00:53:18,390 of the individual. 883 00:53:18,390 --> 00:53:22,200 So this is the error that you're making. 884 00:53:22,200 --> 00:53:27,060 So what these folks did was that you first 885 00:53:27,060 --> 00:53:34,150 look at predicting the phenotype directly, 886 00:53:34,150 --> 00:53:37,620 and you pick all the QTLs that are significant at that level. 887 00:53:37,620 --> 00:53:40,097 And then you compute the residuals 888 00:53:40,097 --> 00:53:41,680 and you try and predict the residuals. 889 00:53:44,700 --> 00:53:49,370 And you try and find additional QTLs 890 00:53:49,370 --> 00:53:55,910 that are significant after you have picked the original ones. 891 00:53:55,910 --> 00:53:57,410 OK. 892 00:53:57,410 --> 00:54:02,190 So why might this produce more QTLs then the original pass? 893 00:54:09,040 --> 00:54:11,190 What do you think? 894 00:54:11,190 --> 00:54:14,930 Why is it that trying to predict the residuals is 895 00:54:14,930 --> 00:54:17,642 a good idea after you've tried to predict 896 00:54:17,642 --> 00:54:18,600 the phenotype directly? 897 00:54:23,530 --> 00:54:25,168 Any ideas about that? 898 00:54:34,060 --> 00:54:36,550 Well, what this is telling us, is 899 00:54:36,550 --> 00:54:39,640 that these QTLs we're going to predict now 900 00:54:39,640 --> 00:54:44,310 were not significant enough in the original pass, 901 00:54:44,310 --> 00:54:48,210 but when we're looking at what's left over, after we subtract 902 00:54:48,210 --> 00:54:50,660 off the effect of all the other QTLs, 903 00:54:50,660 --> 00:54:52,916 other things might pop up. 904 00:54:52,916 --> 00:54:57,300 But in some sense, we're obscured by the original QTLs. 905 00:54:57,300 --> 00:55:00,890 Once we subtract off their influence, 906 00:55:00,890 --> 00:55:04,500 we can see things that we didn't see before. 907 00:55:04,500 --> 00:55:07,470 And we start gathering up these additional QTLs 908 00:55:07,470 --> 00:55:10,340 to predict the residual components. 909 00:55:10,340 --> 00:55:13,290 And so they do this three times. 910 00:55:13,290 --> 00:55:15,650 So they predict the original set of QTLs 911 00:55:15,650 --> 00:55:20,390 and then they iterate three time on the residuals 912 00:55:20,390 --> 00:55:24,150 to find and fit a linear model that predicts a given 913 00:55:24,150 --> 00:55:28,230 trait from a collection of QTLs that they discover. 914 00:55:28,230 --> 00:55:29,830 Yes? 915 00:55:29,830 --> 00:55:30,496 AUDIENCE: Sorry. 916 00:55:30,496 --> 00:55:32,670 I'm still confused. 917 00:55:32,670 --> 00:55:40,328 The second round? [INAUDIBLE] done three additional times? 918 00:55:40,328 --> 00:55:41,810 Is that right? 919 00:55:41,810 --> 00:55:42,798 So the-- 920 00:55:42,798 --> 00:55:45,280 PROFESSOR: Yes. 921 00:55:45,280 --> 00:55:47,588 AUDIENCE: Is it done on the remainder of QTL 922 00:55:47,588 --> 00:55:50,444 or on the original list of every-- 923 00:55:50,444 --> 00:55:52,710 PROFESSOR: Each time you expand your model 924 00:55:52,710 --> 00:55:55,780 to include all the QTLs you've discovered up to that point. 925 00:55:55,780 --> 00:56:01,111 So initially, you discover a set of QTLs, call that set one. 926 00:56:01,111 --> 00:56:04,620 You then compute a model using set one 927 00:56:04,620 --> 00:56:07,555 and you discover the residuals. 928 00:56:07,555 --> 00:56:08,505 AUDIENCE: [INAUDIBLE]. 929 00:56:08,505 --> 00:56:09,296 PROFESSOR: Correct. 930 00:56:09,296 --> 00:56:10,890 Well, residual [INAUDIBLE] so you use 931 00:56:10,890 --> 00:56:13,425 set one to build a model, a phenotype. 932 00:56:16,280 --> 00:56:20,480 So set one is used here to compute this, right. 933 00:56:20,480 --> 00:56:21,740 And so set one is used. 934 00:56:21,740 --> 00:56:23,350 And then you compute what's left over 935 00:56:23,350 --> 00:56:26,980 after you've discovered the first set of QTLs. 936 00:56:26,980 --> 00:56:30,980 Now you say, we still have this left to go. 937 00:56:30,980 --> 00:56:32,380 Let's discover some more QTLs. 938 00:56:32,380 --> 00:56:35,950 And now you discover set two of QTLs. 939 00:56:35,950 --> 00:56:37,260 OK. 940 00:56:37,260 --> 00:56:41,520 And that set two then is used to build a model that has set one 941 00:56:41,520 --> 00:56:44,010 and set two in it. 942 00:56:44,010 --> 00:56:44,510 Right. 943 00:56:44,510 --> 00:56:46,200 And that residual is used to discover 944 00:56:46,200 --> 00:56:49,850 set three and so forth. 945 00:56:49,850 --> 00:56:52,790 So each time you're expanding the set of QTLs 946 00:56:52,790 --> 00:56:54,860 by what you've discovered in the residuals. 947 00:56:54,860 --> 00:56:57,140 Sort of in the trash bin so to speak. 948 00:56:57,140 --> 00:56:57,640 Yes? 949 00:56:57,640 --> 00:57:00,130 AUDIENCE: Each time you're doing this randomization 950 00:57:00,130 --> 00:57:01,130 to determine lod cutoff? 951 00:57:01,130 --> 00:57:02,213 PROFESSOR: That's correct. 952 00:57:02,213 --> 00:57:04,285 Each time you have to redo the randomization 953 00:57:04,285 --> 00:57:05,770 and get to the lod cutoff. 954 00:57:05,770 --> 00:57:07,502 AUDIENCE: But does that method actually 955 00:57:07,502 --> 00:57:10,581 work the way you expect it on the second pass, given that you 956 00:57:10,581 --> 00:57:12,205 have some false positives from the pass 957 00:57:12,205 --> 00:57:17,052 that you've now subtracted from your data? 958 00:57:17,052 --> 00:57:19,135 PROFESSOR: I'm not sure I understand the question. 959 00:57:19,135 --> 00:57:21,115 AUDIENCE: So the second time you do this randomization, 960 00:57:21,115 --> 00:57:22,739 and you again come up with a threshold, 961 00:57:22,739 --> 00:57:27,287 you say, oh, above here there are 5% false positives. 962 00:57:27,287 --> 00:57:27,995 PROFESSOR: Right. 963 00:57:27,995 --> 00:57:31,324 AUDIENCE: But could it be that that estimate is actually 964 00:57:31,324 --> 00:57:35,680 significantly wrong based the fact that you've subtracted off 965 00:57:35,680 --> 00:57:39,068 false positives before you do that process? 966 00:57:41,630 --> 00:57:43,590 PROFESSOR: I mean, in some sense, what's 967 00:57:43,590 --> 00:57:46,450 your definition of a false positive? 968 00:57:46,450 --> 00:57:46,950 Right. 969 00:57:46,950 --> 00:57:50,080 I mean it gets down to that because we've 970 00:57:50,080 --> 00:57:53,020 discovered there's an association between that QTL 971 00:57:53,020 --> 00:57:54,750 and predicting phenotype. 972 00:57:54,750 --> 00:57:59,230 And in this particular world it's useful for doing that. 973 00:57:59,230 --> 00:58:01,860 So it's hard to call something a false positive in that sense, 974 00:58:01,860 --> 00:58:03,932 right. 975 00:58:03,932 --> 00:58:05,390 But you're right, you actually have 976 00:58:05,390 --> 00:58:09,327 to reset your threshold every time 977 00:58:09,327 --> 00:58:10,785 that you go through this iteration. 978 00:58:14,900 --> 00:58:15,650 Good question. 979 00:58:15,650 --> 00:58:16,316 Other questions? 980 00:58:19,710 --> 00:58:21,300 OK. 981 00:58:21,300 --> 00:58:25,160 So, let's see what happens when you do this. 982 00:58:25,160 --> 00:58:28,910 What happens is that if you look down the genome, 983 00:58:28,910 --> 00:58:30,670 you discover a collection. 984 00:58:30,670 --> 00:58:42,040 For example, this is growth in E6 berbamine. 985 00:58:42,040 --> 00:58:45,130 And you can see the significant locations 986 00:58:45,130 --> 00:58:48,590 in the genome, the numbers 1 through 16 of the chromosomes 987 00:58:48,590 --> 00:58:51,960 and the little red asterisks above the peaks 988 00:58:51,960 --> 00:58:54,010 indicate that that was a significant lod score. 989 00:58:54,010 --> 00:58:56,729 The y-axis is a lod score. 990 00:58:56,729 --> 00:58:58,520 And you can see the locations in the genome 991 00:58:58,520 --> 00:59:05,030 where we have found places that were associated with growth 992 00:59:05,030 --> 00:59:10,010 rate in that particular chemical. 993 00:59:10,010 --> 00:59:11,810 OK. 994 00:59:11,810 --> 00:59:15,890 Now, why is it, do you think, that in many of those places 995 00:59:15,890 --> 00:59:20,520 you see sort of a rise and fall that is somewhat gentle 996 00:59:20,520 --> 00:59:23,000 as opposed to having an impulse function 997 00:59:23,000 --> 00:59:24,400 right at that particular spot? 998 00:59:30,029 --> 00:59:31,445 AUDIENCE: Nearby snips are linked? 999 00:59:31,445 --> 00:59:33,153 PROFESSOR: Yeah, nearby snips are linked. 1000 00:59:33,153 --> 00:59:37,930 That as you come up to a place that is causal, 1001 00:59:37,930 --> 00:59:42,000 you get a lot of other things are linked to that. 1002 00:59:42,000 --> 00:59:44,880 And the closer you get, the higher the correlation is. 1003 00:59:48,440 --> 00:59:52,910 So that is for 1,000 segregants in the top. 1004 00:59:52,910 --> 00:59:58,550 And what was discovered for that particular trait, 1005 00:59:58,550 --> 01:00:04,690 was 15 different loci that explained 1006 01:00:04,690 --> 01:00:10,090 78% of the phenotypic variance. 1007 01:00:10,090 --> 01:00:14,650 And in the bottom, the same procedure 1008 01:00:14,650 --> 01:00:20,040 was used, but was only used on 100 segregants. 1009 01:00:20,040 --> 01:00:23,350 And what you can see is that, in this particular case, 1010 01:00:23,350 --> 01:00:27,330 only two loci were discovered that explain 1011 01:00:27,330 --> 01:00:28,715 21% of the variance. 1012 01:00:31,450 --> 01:00:33,730 So the bottom study was grossly under powered. 1013 01:00:36,740 --> 01:00:41,240 Remember we talked about the problem of finding 1014 01:00:41,240 --> 01:00:45,070 QTLs that had small effect sizes. 1015 01:00:45,070 --> 01:00:47,230 And if you don't have enough individuals 1016 01:00:47,230 --> 01:00:49,830 you're going to be under-powered and you can't actually 1017 01:00:49,830 --> 01:00:51,210 identify all of the QTLs. 1018 01:00:54,080 --> 01:00:57,620 So this is a comparison of this. 1019 01:00:57,620 --> 01:01:00,950 And of course, one of the things that you don't know 1020 01:01:00,950 --> 01:01:05,010 is the environmental variance that you're fighting against. 1021 01:01:05,010 --> 01:01:07,160 Because the number of individuals 1022 01:01:07,160 --> 01:01:11,490 you need, depends both on the number of potential loci 1023 01:01:11,490 --> 01:01:12,890 that you have. 1024 01:01:12,890 --> 01:01:17,380 The more loci you have, the more individuals you need to fight 1025 01:01:17,380 --> 01:01:19,210 against the multiple hypotheses problem, 1026 01:01:19,210 --> 01:01:21,465 which is taken care of by this permutation implicitly. 1027 01:01:24,870 --> 01:01:28,220 And the more QTLs that contribute 1028 01:01:28,220 --> 01:01:31,320 to a particular trait, the smaller they might be. 1029 01:01:31,320 --> 01:01:33,080 And there you need more individuals 1030 01:01:33,080 --> 01:01:34,900 to provide adequate power for your test. 1031 01:01:39,810 --> 01:01:43,400 And out of this model, however, if you 1032 01:01:43,400 --> 01:01:48,090 look at for all the different traits, the predictive insight 1033 01:01:48,090 --> 01:01:50,040 versus the observed phenotype, you 1034 01:01:50,040 --> 01:01:52,565 can see that the model does a reasonably good job. 1035 01:01:56,010 --> 01:02:03,990 So the interesting things that came out of the study 1036 01:02:03,990 --> 01:02:06,620 were that, first of all, it was possible to look 1037 01:02:06,620 --> 01:02:11,860 at the effect sizes of each QTL. 1038 01:02:11,860 --> 01:02:16,700 Now, the effect size in terms of fraction of variance explained 1039 01:02:16,700 --> 01:02:21,590 of a particular marker, is the square of its coefficient. 1040 01:02:21,590 --> 01:02:23,990 It's the beta squared. 1041 01:02:23,990 --> 01:02:29,350 So you can see here the histogram of effect sizes, 1042 01:02:29,350 --> 01:02:33,440 and you can see that most QTLs have very small effects 1043 01:02:33,440 --> 01:02:39,700 on phenotype where phenotype is scaled between 0 and 1 1044 01:02:39,700 --> 01:02:40,400 for this study. 1045 01:02:43,680 --> 01:02:49,790 So, most traits as described here 1046 01:02:49,790 --> 01:02:53,690 have between 5 and 29 different QTL loci in the genome. 1047 01:02:53,690 --> 01:02:56,405 They're used to describe them with a median of 12. 1048 01:03:00,240 --> 01:03:04,260 Now, the question the authors asked, 1049 01:03:04,260 --> 01:03:11,280 was if they looked at the theoretical h squared that they 1050 01:03:11,280 --> 01:03:16,560 computed for the F1s, how well did their model do? 1051 01:03:16,560 --> 01:03:18,680 And you can see that their model does very well. 1052 01:03:18,680 --> 01:03:22,630 That, in terms of looking at narrow sense heritability, 1053 01:03:22,630 --> 01:03:25,560 they can recover almost all of it, all the time. 1054 01:03:29,860 --> 01:03:35,526 However, the problem comes here. 1055 01:03:35,526 --> 01:03:37,150 Remember we talked about how to compute 1056 01:03:37,150 --> 01:03:44,430 broad-sense heritability by looking at clones 1057 01:03:44,430 --> 01:03:47,830 and computing environmental variance directly. 1058 01:03:47,830 --> 01:03:51,270 And so they were able to compute broad-sense heritability 1059 01:03:51,270 --> 01:03:55,080 and compare that the narrow-sense heritability 1060 01:03:55,080 --> 01:03:57,684 that they were able to actually achieve in the study. 1061 01:03:57,684 --> 01:03:59,475 And you can see there are substantial gaps. 1062 01:04:02,430 --> 01:04:06,900 So what could be making up those gaps? 1063 01:04:06,900 --> 01:04:12,380 Why is it that this additive model can't explain growth rate 1064 01:04:12,380 --> 01:04:15,770 in a particular condition? 1065 01:04:15,770 --> 01:04:20,440 So, the next thing that we're going to discover 1066 01:04:20,440 --> 01:04:24,680 are some of the sources of this so-called missing heritability. 1067 01:04:24,680 --> 01:04:26,900 But before I give you some of the stock answers 1068 01:04:26,900 --> 01:04:29,830 that people in the field give, since this is part of our quest 1069 01:04:29,830 --> 01:04:33,370 today to actually look into missing heritability, 1070 01:04:33,370 --> 01:04:36,960 I'll put it to you, my panel of experts. 1071 01:04:36,960 --> 01:04:39,450 What could be causing this heritability to go missing? 1072 01:04:39,450 --> 01:04:46,390 Why can't this additive model predict growth rate accurately, 1073 01:04:46,390 --> 01:04:50,190 given it knows the genotype exactly? 1074 01:04:50,190 --> 01:04:51,010 Yes. 1075 01:04:51,010 --> 01:04:54,986 AUDIENCE: [INAUDIBLE] that you wouldn't 1076 01:04:54,986 --> 01:04:56,980 detect from looking at the DNA sequence. 1077 01:04:56,980 --> 01:04:59,080 PROFESSOR: So epidemic factors-- are 1078 01:04:59,080 --> 01:05:00,980 you talking about protein factors or are you 1079 01:05:00,980 --> 01:05:02,355 talking about epigenetic effects? 1080 01:05:02,355 --> 01:05:04,117 AUDIENCE: More of the epigenetic marks. 1081 01:05:04,117 --> 01:05:05,450 PROFESSOR: Epigenetic marks, OK. 1082 01:05:05,450 --> 01:05:09,545 So it might be now, yeast doesn't have DNA methylation. 1083 01:05:12,490 --> 01:05:15,670 It does have chromatin modifications 1084 01:05:15,670 --> 01:05:18,020 in the form of histone marks. 1085 01:05:18,020 --> 01:05:20,900 So it might be that there's some histone marks that 1086 01:05:20,900 --> 01:05:24,750 are copied from generation to generation that are not 1087 01:05:24,750 --> 01:05:26,310 counted for in our model. 1088 01:05:26,310 --> 01:05:28,330 right? 1089 01:05:28,330 --> 01:05:29,900 OK, that's one possibility. 1090 01:05:29,900 --> 01:05:30,440 Great. 1091 01:05:30,440 --> 01:05:30,940 Yes. 1092 01:05:30,940 --> 01:05:33,874 AUDIENCE: There could be more complex effects 1093 01:05:33,874 --> 01:05:37,297 so two separate genes may come out, other than just adding. 1094 01:05:37,297 --> 01:05:38,764 One could turn the other off. 1095 01:05:38,764 --> 01:05:43,170 So it one's on, it could [INAUDIBLE]. 1096 01:05:43,170 --> 01:05:43,910 PROFESSOR: Right. 1097 01:05:43,910 --> 01:05:47,000 So those are called epistatic effects, 1098 01:05:47,000 --> 01:05:48,250 or they're non-linear effects. 1099 01:05:48,250 --> 01:05:51,036 They're gene-gene interaction effects. 1100 01:05:51,036 --> 01:05:52,410 That's actually thought to be one 1101 01:05:52,410 --> 01:05:56,915 of the major issues in missing heritability. 1102 01:06:00,090 --> 01:06:02,570 What else could there be? 1103 01:06:02,570 --> 01:06:03,080 Yes. 1104 01:06:03,080 --> 01:06:03,996 AUDIENCE: [INAUDIBLE]. 1105 01:06:16,872 --> 01:06:17,580 PROFESSOR: Right. 1106 01:06:17,580 --> 01:06:21,390 So you're saying that there could be inherent noise that 1107 01:06:21,390 --> 01:06:24,250 would cause there to be fluctuations in colony size 1108 01:06:24,250 --> 01:06:26,550 that are unrelated to the genotype. 1109 01:06:26,550 --> 01:06:27,967 And, in fact, that's a good point. 1110 01:06:27,967 --> 01:06:29,508 And that's something that we're going 1111 01:06:29,508 --> 01:06:31,530 to take care of with the environmental variance. 1112 01:06:31,530 --> 01:06:34,580 So we're going to measure how well individuals 1113 01:06:34,580 --> 01:06:37,900 grow with exactly the same genotype in a given condition. 1114 01:06:37,900 --> 01:06:40,860 And so that kind of fluctuation would 1115 01:06:40,860 --> 01:06:43,200 appear in that variance term. 1116 01:06:43,200 --> 01:06:46,010 And we're going to get rid of that. 1117 01:06:46,010 --> 01:06:48,900 But that's a good thought and I think it's important and not 1118 01:06:48,900 --> 01:06:53,110 appreciated that there can be random fluctuations 1119 01:06:53,110 --> 01:06:55,710 in that term. 1120 01:06:55,710 --> 01:06:56,545 Any other ideas? 1121 01:07:02,130 --> 01:07:05,240 So we have epistasis. 1122 01:07:05,240 --> 01:07:06,510 We have epigenetics. 1123 01:07:06,510 --> 01:07:09,170 We've got two E's so far. 1124 01:07:09,170 --> 01:07:10,050 Anything else? 1125 01:07:19,430 --> 01:07:27,870 How about if there are a lot of different loci 1126 01:07:27,870 --> 01:07:33,860 that are influencing a particular trait, 1127 01:07:33,860 --> 01:07:37,180 but the effect sizes are very small. 1128 01:07:37,180 --> 01:07:38,940 That we've captured, sort of the cream. 1129 01:07:38,940 --> 01:07:40,140 We've skimmed off the cream. 1130 01:07:40,140 --> 01:07:45,110 So we get 70% of the variance explained, 1131 01:07:45,110 --> 01:07:49,150 but the rest of the QTLs are small, 1132 01:07:49,150 --> 01:07:50,359 right, and we can't see them. 1133 01:07:50,359 --> 01:07:52,816 We can't see them because we don't have enough individuals. 1134 01:07:52,816 --> 01:07:54,030 We're underpowered, right. 1135 01:07:54,030 --> 01:07:58,284 We just-- more individuals more sequencing, right. 1136 01:07:58,284 --> 01:08:00,450 And that would be the only way to break through this 1137 01:08:00,450 --> 01:08:04,745 and be able to see these very small effects. 1138 01:08:07,410 --> 01:08:11,630 Because if the effects are small, in some sense, 1139 01:08:11,630 --> 01:08:13,540 we're hosed. 1140 01:08:13,540 --> 01:08:15,080 Right? 1141 01:08:15,080 --> 01:08:17,859 You just can't see them through the noise. 1142 01:08:17,859 --> 01:08:24,620 All those effects are going to show up down here 1143 01:08:24,620 --> 01:08:26,500 and we're going to reject them. 1144 01:08:30,700 --> 01:08:33,645 Anything else, people can think about? 1145 01:08:36,555 --> 01:08:38,010 Yes? 1146 01:08:38,010 --> 01:08:42,066 AUDIENCE: Could you content maybe the sum of some areas 1147 01:08:42,066 --> 01:08:48,674 that are-- sorry, the addition sum of those guys 1148 01:08:48,674 --> 01:08:50,365 that have low effects. 1149 01:08:50,365 --> 01:08:52,604 Or is that not detectable by any [INAUDIBLE]? 1150 01:08:52,604 --> 01:08:53,979 PROFESSOR: Well, that's certainly 1151 01:08:53,979 --> 01:08:57,524 what we're trying to do with residuals, right? 1152 01:08:57,524 --> 01:08:59,149 This multi-round round thing is that we 1153 01:08:59,149 --> 01:09:01,109 take all the things we can detect 1154 01:09:01,109 --> 01:09:03,510 that have an effect with a conservative cut off 1155 01:09:03,510 --> 01:09:05,356 and we get rid of them. 1156 01:09:05,356 --> 01:09:07,189 And then we say, oh, is there anything left? 1157 01:09:07,189 --> 01:09:10,660 You know, that's hiding, sort of behind that forest, right. 1158 01:09:10,660 --> 01:09:12,569 If we cut through the first line of trees, 1159 01:09:12,569 --> 01:09:16,600 can we get to another collection of informative QTLs? 1160 01:09:21,090 --> 01:09:21,760 Yeah. 1161 01:09:21,760 --> 01:09:23,242 AUDIENCE: I was wondering if this 1162 01:09:23,242 --> 01:09:24,724 could be an overestimate also. 1163 01:09:24,724 --> 01:09:26,926 Like, for example, if, when you throw out 1164 01:09:26,926 --> 01:09:28,676 the variance for environmental conditions, 1165 01:09:28,676 --> 01:09:32,628 the environmental conditions aren't as exact as we thought 1166 01:09:32,628 --> 01:09:36,860 they were between two yeast growing in the same set, setup. 1167 01:09:36,860 --> 01:09:37,568 PROFESSOR: Right. 1168 01:09:37,568 --> 01:09:41,026 AUDIENCE: Then maybe you would inappropriately 1169 01:09:41,026 --> 01:09:43,990 assign a variance to the environmental condition 1170 01:09:43,990 --> 01:09:51,400 whereas some that could be, in fact-- something 1171 01:09:51,400 --> 01:09:52,882 that wouldn't be explained by. 1172 01:09:52,882 --> 01:09:55,030 PROFESSOR: And probably the other way around. 1173 01:09:55,030 --> 01:09:57,000 The other way around would be that you thought 1174 01:09:57,000 --> 01:10:00,522 you had the conditions exactly duplicated, right. 1175 01:10:00,522 --> 01:10:02,230 But when you actually did something else, 1176 01:10:02,230 --> 01:10:06,240 they weren't exactly duplicated so you see bigger variance 1177 01:10:06,240 --> 01:10:07,190 in another experiment. 1178 01:10:07,190 --> 01:10:11,024 And it appears to be heritable in some sense. 1179 01:10:11,024 --> 01:10:13,190 But, in fact, it would just be that you misestimated 1180 01:10:13,190 --> 01:10:14,970 the environmental component. 1181 01:10:14,970 --> 01:10:16,870 So, there are a variety of things 1182 01:10:16,870 --> 01:10:19,000 that we can think about, right. 1183 01:10:19,000 --> 01:10:20,870 Incorrect heritability estimates. 1184 01:10:20,870 --> 01:10:23,605 We can think about rare variance. 1185 01:10:23,605 --> 01:10:25,230 Now in this particular study we're 1186 01:10:25,230 --> 01:10:27,240 looking at everything, right. 1187 01:10:27,240 --> 01:10:28,330 Nothing is hiding. 1188 01:10:28,330 --> 01:10:30,040 We've got 50x sequencing. 1189 01:10:30,040 --> 01:10:32,400 There are no variants hiding behind the bushes. 1190 01:10:32,400 --> 01:10:35,080 They are all there for us to look at. 1191 01:10:35,080 --> 01:10:37,290 Structural variants-- well in this particular case, 1192 01:10:37,290 --> 01:10:39,770 we know structural variants aren't present, 1193 01:10:39,770 --> 01:10:42,290 but as you know, many kinds of mammalian cells 1194 01:10:42,290 --> 01:10:45,470 exhibit structural variance and other kinds 1195 01:10:45,470 --> 01:10:50,480 of bizarre behaviors with their chromosomes. 1196 01:10:50,480 --> 01:10:52,280 Many common variants of low effect. 1197 01:10:52,280 --> 01:10:54,520 We just talked about that. 1198 01:10:54,520 --> 01:10:56,420 And epistasis was brought up. 1199 01:10:56,420 --> 01:10:58,140 And this does not include epigenetics, 1200 01:10:58,140 --> 01:10:59,720 I'll have to add that to the listen. 1201 01:10:59,720 --> 01:11:01,490 It's a good point. 1202 01:11:01,490 --> 01:11:03,050 OK. 1203 01:11:03,050 --> 01:11:04,650 And then we talked about this idea 1204 01:11:04,650 --> 01:11:10,460 that epistasis is the case where we have nonlinear effects. 1205 01:11:10,460 --> 01:11:14,970 So a very simple example of this is 1206 01:11:14,970 --> 01:11:17,570 when you have little a and big B, and big A and big B 1207 01:11:17,570 --> 01:11:19,210 together, they both had an effect. 1208 01:11:19,210 --> 01:11:22,090 But little a, little b, have no effect. 1209 01:11:22,090 --> 01:11:25,110 And big A and big B have no effect by themselves. 1210 01:11:25,110 --> 01:11:26,780 So you have a pairwise interaction 1211 01:11:26,780 --> 01:11:28,750 between these terms. 1212 01:11:28,750 --> 01:11:29,940 Right. 1213 01:11:29,940 --> 01:11:33,670 So this is sort of the exclusive OR of two terms 1214 01:11:33,670 --> 01:11:36,420 and that non-linear effect can never 1215 01:11:36,420 --> 01:11:40,990 be captured when you're looking at terms one at a time. 1216 01:11:40,990 --> 01:11:42,650 OK. 1217 01:11:42,650 --> 01:11:46,200 Because looking one at a time looks 1218 01:11:46,200 --> 01:11:49,310 like it has no effect whatsoever. 1219 01:11:49,310 --> 01:11:52,060 And these effects, of course, could be more than pairwise, 1220 01:11:52,060 --> 01:11:54,075 if you have a complicated network or pathway. 1221 01:11:59,120 --> 01:12:03,780 Now, what the authors did to examine this, 1222 01:12:03,780 --> 01:12:07,610 is they looked at pairwise effects. 1223 01:12:07,610 --> 01:12:10,710 So they considered all pairs of markers 1224 01:12:10,710 --> 01:12:14,170 and asked whether or not, taken two at a time now, 1225 01:12:14,170 --> 01:12:20,930 they could predict a difference in trait need. 1226 01:12:20,930 --> 01:12:22,547 But what's the problem with this? 1227 01:12:22,547 --> 01:12:24,130 How many markers did I say there were? 1228 01:12:28,140 --> 01:12:31,630 13,000, something like that. 1229 01:12:31,630 --> 01:12:35,160 All pairs of markers is a lot of pairs of markers. 1230 01:12:35,160 --> 01:12:36,600 Right. 1231 01:12:36,600 --> 01:12:39,100 And what happens to your statistical power 1232 01:12:39,100 --> 01:12:42,320 when you get to that many markers? 1233 01:12:42,320 --> 01:12:43,590 You have a serious problem. 1234 01:12:43,590 --> 01:12:45,620 It goes right through the floor. 1235 01:12:45,620 --> 01:12:48,450 So you really are very under-powered to detect 1236 01:12:48,450 --> 01:12:50,820 these interactions. 1237 01:12:50,820 --> 01:12:53,250 The other thing they did was to try 1238 01:12:53,250 --> 01:12:55,450 to get things a little bit better as they said, 1239 01:12:55,450 --> 01:12:57,780 how about this. 1240 01:12:57,780 --> 01:13:02,610 If we know that a given QTL is always important for a trait 1241 01:13:02,610 --> 01:13:05,980 because we discovered it in our additive model. 1242 01:13:05,980 --> 01:13:07,690 Well consider its pairwise interaction 1243 01:13:07,690 --> 01:13:10,980 with all the other possible variants. 1244 01:13:10,980 --> 01:13:13,870 So instead of now 13,000 squared, 1245 01:13:13,870 --> 01:13:17,430 it's only going to be like 22 different QTLs for a given 1246 01:13:17,430 --> 01:13:23,540 trait times 13,000 to reduce the space of search. 1247 01:13:23,540 --> 01:13:26,410 Obviously I got this explanation not completely clear. 1248 01:13:26,410 --> 01:13:27,880 So let me try one more time. 1249 01:13:27,880 --> 01:13:30,460 OK. 1250 01:13:30,460 --> 01:13:33,040 The naive way to go at looking at pairwise interactions 1251 01:13:33,040 --> 01:13:35,570 is consider all pairs and ask whether or not 1252 01:13:35,570 --> 01:13:38,860 all pairs have an influence on a particular trait value. 1253 01:13:38,860 --> 01:13:39,390 Right. 1254 01:13:39,390 --> 01:13:40,430 We've got that much? 1255 01:13:40,430 --> 01:13:41,330 OK. 1256 01:13:41,330 --> 01:13:44,100 Now let's suppose we don't want to look at all pairs. 1257 01:13:44,100 --> 01:13:46,570 How could we pick one element of the pair 1258 01:13:46,570 --> 01:13:50,082 to be interesting, but smaller in number? 1259 01:13:50,082 --> 01:13:50,940 Right. 1260 01:13:50,940 --> 01:13:53,370 So what we'll do is, for a given trait, 1261 01:13:53,370 --> 01:13:57,610 we already know which QTLs are important for it 1262 01:13:57,610 --> 01:14:00,406 because we've built our model already. 1263 01:14:00,406 --> 01:14:02,280 So let's just say, for purpose of discussion, 1264 01:14:02,280 --> 01:14:06,150 there are 20 QTLs that are important for this trait. 1265 01:14:06,150 --> 01:14:09,010 We'll take each one of those 20 QTLs 1266 01:14:09,010 --> 01:14:11,870 and we'll examine whether or not it has a pairwise interaction 1267 01:14:11,870 --> 01:14:14,960 with all of the other variance. 1268 01:14:14,960 --> 01:14:18,220 And that will reduce our search base. 1269 01:14:18,220 --> 01:14:19,500 Is that better? 1270 01:14:19,500 --> 01:14:21,000 OK, good. 1271 01:14:21,000 --> 01:14:27,800 So, when they did that, they did find 1272 01:14:27,800 --> 01:14:30,830 some pairwise interactions. 1273 01:14:30,830 --> 01:14:35,940 In 24 of their 46 traits had pairwise interactions 1274 01:14:35,940 --> 01:14:36,935 and here is an example. 1275 01:14:40,200 --> 01:14:48,640 And you can see the dot plot, or the upper right-hand part 1276 01:14:48,640 --> 01:14:52,710 of this slide, how when you BYBY. 1277 01:14:52,710 --> 01:14:55,920 You have a lower phenotypic value then 1278 01:14:55,920 --> 01:15:02,220 when you have just any RM component 1279 01:15:02,220 --> 01:15:04,590 on the right-hand side. 1280 01:15:04,590 --> 01:15:08,680 So those were two different snips 1281 01:15:08,680 --> 01:15:10,690 on chromosome 7 and chromosome 11 1282 01:15:10,690 --> 01:15:12,915 and showing how they interact with one another 1283 01:15:12,915 --> 01:15:16,760 in a non-linear way. 1284 01:15:16,760 --> 01:15:21,060 If they were linear, then as you added either a chromosome at 7 1285 01:15:21,060 --> 01:15:24,560 or a chromosome 11 contribution it would go up a little bit. 1286 01:15:24,560 --> 01:15:33,410 Here, as soon as you add either contribution from RM, 1287 01:15:33,410 --> 01:15:38,710 it goes all way up to have a mean of zero or higher. 1288 01:15:38,710 --> 01:15:44,300 In this particular case, 71% of the gap between broad-sense 1289 01:15:44,300 --> 01:15:50,360 and narrow-sense was explained by this one pair interaction. 1290 01:15:50,360 --> 01:15:54,010 So it is the case that pairwise interactions 1291 01:15:54,010 --> 01:15:56,235 can explain some of the missing heritability. 1292 01:15:59,540 --> 01:16:01,310 Can anybody think of anything else 1293 01:16:01,310 --> 01:16:02,893 they can explain missing heritability? 1294 01:16:08,760 --> 01:16:09,260 OK. 1295 01:16:11,840 --> 01:16:13,967 What's inherited? 1296 01:16:13,967 --> 01:16:15,550 Let's make a list of everything that's 1297 01:16:15,550 --> 01:16:21,660 inherited from the parental line to the F1s. 1298 01:16:21,660 --> 01:16:23,450 OK. 1299 01:16:23,450 --> 01:16:24,266 Yes. 1300 01:16:24,266 --> 01:16:27,172 AUDIENCE: I mean, because there's 1301 01:16:27,172 --> 01:16:29,164 a lot more things inherited. 1302 01:16:29,164 --> 01:16:31,071 The protein levels are inherited. 1303 01:16:31,071 --> 01:16:31,654 PROFESSOR: OK. 1304 01:16:31,654 --> 01:16:33,734 AUDIENCE: [INAUDIBLE] are inherited as well. 1305 01:16:33,734 --> 01:16:34,400 PROFESSOR: Good. 1306 01:16:34,400 --> 01:16:35,909 I like this line of thinking. 1307 01:16:35,909 --> 01:16:36,825 AUDIENCE: [INAUDIBLE]. 1308 01:16:36,825 --> 01:16:38,325 PROFESSOR: There are a lot of things 1309 01:16:38,325 --> 01:16:40,130 that are inherited, right? 1310 01:16:40,130 --> 01:16:43,690 So what's inherited? 1311 01:16:43,690 --> 01:16:47,790 Some proteins are probably inherited, right? 1312 01:16:47,790 --> 01:16:50,760 What is replicable through generation 1313 01:16:50,760 --> 01:16:54,200 to generation as a genetic material that's inherited? 1314 01:16:54,200 --> 01:16:56,835 Let's just talk about that for a moment. 1315 01:16:56,835 --> 01:16:58,710 Proteins are interesting, don't get me wrong. 1316 01:16:58,710 --> 01:17:01,360 I mean, prions and other things are very interesting. 1317 01:17:01,360 --> 01:17:03,410 But what else is inherited? 1318 01:17:08,370 --> 01:17:09,137 OK, yes? 1319 01:17:09,137 --> 01:17:10,053 AUDIENCE: [INAUDIBLE]. 1320 01:17:18,210 --> 01:17:21,230 PROFESSOR: So there are other genetic molecules. 1321 01:17:21,230 --> 01:17:24,730 Let's just take a really simple one-- mitochondria. 1322 01:17:24,730 --> 01:17:26,760 OK. 1323 01:17:26,760 --> 01:17:28,830 Mitochondria are inherited. 1324 01:17:28,830 --> 01:17:30,610 And it turns out that these two strains 1325 01:17:30,610 --> 01:17:35,830 have can have different mitochondria. 1326 01:17:35,830 --> 01:17:37,120 What else can be inherited? 1327 01:17:40,790 --> 01:17:44,140 Well, we were doing these experiments with our colleagues 1328 01:17:44,140 --> 01:17:46,822 over at the Whitehead and for a long time 1329 01:17:46,822 --> 01:17:48,530 we couldn't figure out what was going on. 1330 01:17:48,530 --> 01:17:50,689 Because we would do the experiments on day one 1331 01:17:50,689 --> 01:17:52,730 and they come out a particular way and on day two 1332 01:17:52,730 --> 01:17:54,030 they come out a different way. 1333 01:17:54,030 --> 01:17:55,100 Right. 1334 01:17:55,100 --> 01:17:59,160 And we're doing some very controlled conditions. 1335 01:17:59,160 --> 01:18:02,460 Until we figured out that everybody 1336 01:18:02,460 --> 01:18:06,230 uses S288C which is the genetic nomenclature 1337 01:18:06,230 --> 01:18:10,010 for the lab trained yeast, right. 1338 01:18:10,010 --> 01:18:12,112 It's lab train because it's very well behaved. 1339 01:18:12,112 --> 01:18:13,070 It's a very nice yeast. 1340 01:18:13,070 --> 01:18:14,210 It grows very well. 1341 01:18:14,210 --> 01:18:16,220 It's been selected for that, right. 1342 01:18:16,220 --> 01:18:19,750 And people always do genetic studies by taking S288C, 1343 01:18:19,750 --> 01:18:22,717 which is the lab yeast, which has being completely sequenced 1344 01:18:22,717 --> 01:18:24,800 and so you want to use it because you can download 1345 01:18:24,800 --> 01:18:29,490 the genome with a wild strain. 1346 01:18:29,490 --> 01:18:33,260 And wild strains come from the wild, right. 1347 01:18:33,260 --> 01:18:35,920 And they come either off of people 1348 01:18:35,920 --> 01:18:37,230 who have yeast infections. 1349 01:18:37,230 --> 01:18:40,190 I mean, human beings, or they come off of grape vines 1350 01:18:40,190 --> 01:18:42,120 or God knows where, right. 1351 01:18:42,120 --> 01:18:44,080 But they are not well behaved. 1352 01:18:44,080 --> 01:18:45,760 And why are they not well behaved? 1353 01:18:45,760 --> 01:18:48,830 What makes these yeast particularly rude? 1354 01:18:48,830 --> 01:18:51,380 Well, the thing that makes them particularly rude 1355 01:18:51,380 --> 01:18:54,120 is that they have things like viruses in them. 1356 01:18:54,120 --> 01:18:55,440 Oh, no. 1357 01:18:55,440 --> 01:18:56,170 OK. 1358 01:18:56,170 --> 01:18:58,470 Because what happens is that when 1359 01:18:58,470 --> 01:19:01,080 you take a yeast that has a virus in it, 1360 01:19:01,080 --> 01:19:04,460 and you cross it with a lab yeast, right. 1361 01:19:04,460 --> 01:19:06,120 All of the kids got the virus. 1362 01:19:08,940 --> 01:19:10,220 Yuck. 1363 01:19:10,220 --> 01:19:11,930 OK. 1364 01:19:11,930 --> 01:19:19,150 And it turns out that the so-called killer virus in yeast 1365 01:19:19,150 --> 01:19:23,950 interacts with various chromosomal changes. 1366 01:19:23,950 --> 01:19:25,590 And so now you have interactions-- 1367 01:19:25,590 --> 01:19:29,340 genetic interactions between a viral element 1368 01:19:29,340 --> 01:19:32,090 and the chromosome. 1369 01:19:32,090 --> 01:19:36,760 And so the phenotype you get out of particular deletions 1370 01:19:36,760 --> 01:19:41,860 in the yeast genome has to do with whether or not 1371 01:19:41,860 --> 01:19:44,220 it's infected with a particular virus. 1372 01:19:44,220 --> 01:19:50,500 It also has to do with which mitochondrial content it has. 1373 01:19:50,500 --> 01:19:51,970 And people didn't appreciate this 1374 01:19:51,970 --> 01:19:57,520 until recently because most of the past yeast studies for QTLs 1375 01:19:57,520 --> 01:20:03,010 were busy crossing lab strains with wild strains 1376 01:20:03,010 --> 01:20:06,960 and whether it was ethanol tolerance or growth and heat, 1377 01:20:06,960 --> 01:20:09,620 a lot of the strains came up with a gene 1378 01:20:09,620 --> 01:20:12,800 as a significant QTL, which was MKT1. 1379 01:20:12,800 --> 01:20:17,580 And people couldn't understand why MKT1 was so popular, right. 1380 01:20:17,580 --> 01:20:22,460 MKT1, maintenance of killer toxin one. 1381 01:20:22,460 --> 01:20:23,050 Yeah. 1382 01:20:23,050 --> 01:20:26,210 That's the viral thing that enables-- the chromosomal thing 1383 01:20:26,210 --> 01:20:28,130 that enables a viral competence. 1384 01:20:28,130 --> 01:20:34,170 So, it turns out that if you look 1385 01:20:34,170 --> 01:20:37,440 at this-- in this particular case, 1386 01:20:37,440 --> 01:20:39,730 we're looking at yeast that don't 1387 01:20:39,730 --> 01:20:43,220 have the virus in the bottom little photograph there. 1388 01:20:43,220 --> 01:20:46,595 You can see they're all sort of, you know, 1389 01:20:46,595 --> 01:20:48,930 they're growing similarly. 1390 01:20:48,930 --> 01:20:53,680 And the yeast with the same genotype above-- those 1391 01:20:53,680 --> 01:20:56,164 are all in tetrads. 1392 01:20:56,164 --> 01:20:58,080 Two out of the four are growing, the other two 1393 01:20:58,080 --> 01:21:02,470 are not, because the other two have a particular deletion. 1394 01:21:02,470 --> 01:21:06,595 And if you look at the model-- a deletion only model, 1395 01:21:06,595 --> 01:21:09,940 the deletion only, only looks at the chromosomal compliment 1396 01:21:09,940 --> 01:21:14,660 doesn't predict the variance very well. 1397 01:21:14,660 --> 01:21:18,150 And if you look at the deletion and whether or not 1398 01:21:18,150 --> 01:21:21,250 you have the virus, you do better. 1399 01:21:21,250 --> 01:21:24,190 But you do even better, if you allow 1400 01:21:24,190 --> 01:21:26,930 for there to be a nonlinear interaction 1401 01:21:26,930 --> 01:21:29,040 between the chromosomal modification 1402 01:21:29,040 --> 01:21:31,520 and whether or not you have a virus. 1403 01:21:31,520 --> 01:21:36,120 And then you recover almost all of missing heritability. 1404 01:21:36,120 --> 01:21:37,930 So I'll leave you with this thought, which 1405 01:21:37,930 --> 01:21:45,560 is that genetics is complicated and QTLs are great, but don't 1406 01:21:45,560 --> 01:21:49,400 forget that there are all sorts of genetic elements. 1407 01:21:49,400 --> 01:21:50,980 And on that note, next time we'll 1408 01:21:50,980 --> 01:21:52,490 talk about human genetics. 1409 01:21:52,490 --> 01:21:54,050 Have a great weekend until then. 1410 01:21:54,050 --> 01:21:54,730 We'll see you. 1411 01:21:54,730 --> 01:21:56,440 Take care.