1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,236
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,236 --> 00:00:17,861
at ocw.mit.edu.

8
00:00:26,450 --> 00:00:28,080
PROFESSOR: So as
you recall last time

9
00:00:28,080 --> 00:00:30,780
we talked about chromatin
structure and chromatin

10
00:00:30,780 --> 00:00:32,000
regulation.

11
00:00:32,000 --> 00:00:35,390
And now we're going to move
on to genetic analysis.

12
00:00:35,390 --> 00:00:38,160
But before we did that, I
want us to touch on two points

13
00:00:38,160 --> 00:00:41,440
that we talked about
briefly last time.

14
00:00:41,440 --> 00:00:44,880
One was 5C analysis.

15
00:00:44,880 --> 00:00:48,420
Who was it that brought up--
who was the 5C expert here?

16
00:00:48,420 --> 00:00:50,010
Anybody?

17
00:00:50,010 --> 00:00:51,790
No?

18
00:00:51,790 --> 00:00:52,920
Nobody wants to own 5C.

19
00:00:52,920 --> 00:00:53,500
OK.

20
00:00:53,500 --> 00:00:56,100
But as you recall, we
talked about ChIA-PET

21
00:00:56,100 --> 00:01:00,090
as one way of analyzing any
to any interactions in the way

22
00:01:00,090 --> 00:01:04,370
that the genome folds up and
enhancers talk to promoters.

23
00:01:04,370 --> 00:01:06,114
And 5C is a very
similar technique.

24
00:01:06,114 --> 00:01:07,530
I just wanted to
show you the flow

25
00:01:07,530 --> 00:01:10,260
chart for how the protocol goes.

26
00:01:10,260 --> 00:01:12,950
There is a cross linking.

27
00:01:12,950 --> 00:01:14,720
A digestion with a
restriction enzyme

28
00:01:14,720 --> 00:01:18,040
step, followed by a
proximity ligation step,

29
00:01:18,040 --> 00:01:21,900
which gives you molecules
that had been brought together

30
00:01:21,900 --> 00:01:25,320
by an enhancer, promoter
complex, or any other kind

31
00:01:25,320 --> 00:01:29,030
of distal protein-protein
interaction.

32
00:01:29,030 --> 00:01:35,660
And then, what happens is that
you design specific timers

33
00:01:35,660 --> 00:01:38,600
to detect those ligation events.

34
00:01:38,600 --> 00:01:41,150
And you sequence
the result of what

35
00:01:41,150 --> 00:01:44,390
is known as ligation
mediated amplification.

36
00:01:44,390 --> 00:01:46,730
So those primers are
only going to ligate

37
00:01:46,730 --> 00:01:49,910
if they're brought together at
a particular junction, which

38
00:01:49,910 --> 00:01:53,230
is defined by the
restriction sites lining up.

39
00:01:53,230 --> 00:01:58,500
So, 5C is a method of looking
at which regions of the genome

40
00:01:58,500 --> 00:02:04,170
interact and can produce
these sorts of results,

41
00:02:04,170 --> 00:02:05,680
showing which
parts of the genome

42
00:02:05,680 --> 00:02:07,190
interact with one another.

43
00:02:07,190 --> 00:02:12,630
The key difference, I think,
between chIA-PET and 5C

44
00:02:12,630 --> 00:02:15,660
is that you actually have to
have these primers designed

45
00:02:15,660 --> 00:02:19,050
and pick the particular
locations you want to query.

46
00:02:19,050 --> 00:02:23,110
So the primers that you design
represent query locations

47
00:02:23,110 --> 00:02:26,840
and you can then either apply
the results to a microarray,

48
00:02:26,840 --> 00:02:31,710
or to high throughput sequencing
to detect these interactions.

49
00:02:31,710 --> 00:02:33,900
But the essential
idea is the same.

50
00:02:33,900 --> 00:02:36,540
Where you do proximity
based ligation

51
00:02:36,540 --> 00:02:40,440
to form molecules that
contain components

52
00:02:40,440 --> 00:02:42,230
of two different
pieces of the genome

53
00:02:42,230 --> 00:02:47,270
that have been brought together
for some functional reason.

54
00:02:47,270 --> 00:02:50,580
The next thing I
want to touch upon

55
00:02:50,580 --> 00:02:55,500
was this idea of the
CpG dinucleotides

56
00:02:55,500 --> 00:02:59,930
that are connected
by a phosphate bond.

57
00:02:59,930 --> 00:03:01,880
And you recall that I
talked about the idea

58
00:03:01,880 --> 00:03:03,310
that they were symmetric.

59
00:03:03,310 --> 00:03:07,430
So you could have methyl groups
on the cytosines in such a way

60
00:03:07,430 --> 00:03:10,810
that, because they could
mirror one another,

61
00:03:10,810 --> 00:03:14,770
they could be transferred
from one strand of DNA

62
00:03:14,770 --> 00:03:18,890
to the other strand of DNA,
during cell replication

63
00:03:18,890 --> 00:03:21,530
by DNA methyltransferase.

64
00:03:21,530 --> 00:03:25,750
So it forms a more stable kind
of mark and as you recall,

65
00:03:25,750 --> 00:03:27,750
DNA methylation where
something occurred

66
00:03:27,750 --> 00:03:32,210
in lowly expressed genes
and typically in regions

67
00:03:32,210 --> 00:03:34,850
of the genome that
are methylated.

68
00:03:34,850 --> 00:03:36,430
Other histone marks
are not present

69
00:03:36,430 --> 00:03:39,050
and the genes are turned off.

70
00:03:39,050 --> 00:03:39,680
OK.

71
00:03:39,680 --> 00:03:41,055
So those were the
points I wanted

72
00:03:41,055 --> 00:03:44,380
to touch upon from last lecture.

73
00:03:44,380 --> 00:03:48,090
Now we're going to
embark upon an adventure,

74
00:03:48,090 --> 00:03:54,670
looking for the answer to, wear
is missing heritability found?

75
00:03:54,670 --> 00:03:57,560
So it's a big open
question now in genetics.

76
00:03:57,560 --> 00:04:00,240
In human genetics,
which is that we really

77
00:04:00,240 --> 00:04:03,830
can't find all the heritability.

78
00:04:03,830 --> 00:04:06,990
And as a point of
introduction, the narrative

79
00:04:06,990 --> 00:04:10,760
arc for today's lecture is
that, generally speaking,

80
00:04:10,760 --> 00:04:13,160
you're more like your
relatives than random people

81
00:04:13,160 --> 00:04:14,480
on the planet.

82
00:04:14,480 --> 00:04:15,770
And why is this?

83
00:04:15,770 --> 00:04:20,529
Well obviously you contain
components of your mom

84
00:04:20,529 --> 00:04:22,089
and dad's genomes.

85
00:04:22,089 --> 00:04:27,800
And they are providing you
with components of your traits.

86
00:04:27,800 --> 00:04:30,580
And the heritability
of a trait is

87
00:04:30,580 --> 00:04:34,310
defined by the fraction
of phenotypic variance

88
00:04:34,310 --> 00:04:37,830
that can be explained
by genetics.

89
00:04:37,830 --> 00:04:41,600
And we're going to talk today
about computational models that

90
00:04:41,600 --> 00:04:44,250
can predict phenotype
from genotype.

91
00:04:44,250 --> 00:04:46,480
And this is very
important, obviously,

92
00:04:46,480 --> 00:04:51,380
for understanding the sources of
various traits and phenotypes.

93
00:04:51,380 --> 00:04:54,760
As well as fields such
as pharmacogenomics

94
00:04:54,760 --> 00:04:59,640
that try and predict the
best therapy for a disease

95
00:04:59,640 --> 00:05:03,440
based upon your genetic makeup.

96
00:05:03,440 --> 00:05:09,110
So, individual
loci in the genome

97
00:05:09,110 --> 00:05:12,140
that contribute to
quantitative traits

98
00:05:12,140 --> 00:05:17,395
are called quantitative
trait locis, or QTLs.

99
00:05:17,395 --> 00:05:19,520
So we're going to talked
about how to discover them

100
00:05:19,520 --> 00:05:24,150
and how to build models of
quantitative traits using QTLs.

101
00:05:24,150 --> 00:05:27,490
And finally, as I
said at the outset,

102
00:05:27,490 --> 00:05:29,980
our models are
insufficient today.

103
00:05:29,980 --> 00:05:33,150
They really can't find
all of the heritability.

104
00:05:33,150 --> 00:05:35,730
So we're going to go searching
for this missing heritability

105
00:05:35,730 --> 00:05:39,460
and see where it might be found.

106
00:05:39,460 --> 00:05:44,320
Computationally, we're going to
apply a variety of techniques

107
00:05:44,320 --> 00:05:45,830
to these problems.

108
00:05:45,830 --> 00:05:48,820
A preview is, we're
going to build

109
00:05:48,820 --> 00:05:52,040
linear models of
phenotype and we're

110
00:05:52,040 --> 00:05:56,124
going to use stepwise regression
to learn these models using

111
00:05:56,124 --> 00:05:57,290
a forward feature selection.

112
00:05:57,290 --> 00:05:59,210
And I'll talk about
what that is when

113
00:05:59,210 --> 00:06:01,320
we get to that point
of the lecture.

114
00:06:01,320 --> 00:06:04,310
We're going to derive test
statistics for discovering

115
00:06:04,310 --> 00:06:08,530
which QTLs are significant
and which QTLs are not,

116
00:06:08,530 --> 00:06:10,211
to include in our model.

117
00:06:10,211 --> 00:06:11,960
And finally, we're
going to talk about how

118
00:06:11,960 --> 00:06:15,200
to measure narrow sense
heritability and broad sense

119
00:06:15,200 --> 00:06:17,435
heritability in
environmental variance.

120
00:06:20,010 --> 00:06:21,990
OK.

121
00:06:21,990 --> 00:06:30,400
So, one great resource for
traits that are fairly simple.

122
00:06:30,400 --> 00:06:35,600
That primarily are the result
of a single gene mutation,

123
00:06:35,600 --> 00:06:40,040
or where a single gene
mutation plays a dominant role,

124
00:06:40,040 --> 00:06:45,130
is something called Online
Mendelian Inheritance in Man.

125
00:06:45,130 --> 00:06:46,770
And it's a resource.

126
00:06:46,770 --> 00:06:49,690
It has about 21,000
genes in it right now.

127
00:06:49,690 --> 00:06:55,100
And it's a great way to explore
what human genes function

128
00:06:55,100 --> 00:06:57,170
is in various diseases.

129
00:06:57,170 --> 00:06:58,500
And you could query by disease.

130
00:06:58,500 --> 00:07:00,060
You can query by gene.

131
00:07:00,060 --> 00:07:07,130
And it is a very carefully
annotated and maintained

132
00:07:07,130 --> 00:07:10,190
collection that is
worthy of study,

133
00:07:10,190 --> 00:07:13,460
if you're interested in
particular disease genes.

134
00:07:13,460 --> 00:07:19,132
We're going to be looking at
more complex analyses today.

135
00:07:19,132 --> 00:07:20,590
The analyses we're
going to look at

136
00:07:20,590 --> 00:07:22,275
are where there
are many genes that

137
00:07:22,275 --> 00:07:23,540
influence a particular trait.

138
00:07:23,540 --> 00:07:25,630
And we would like to come
up with general methods

139
00:07:25,630 --> 00:07:30,902
for discovering how we can de
novo from experimental data--

140
00:07:30,902 --> 00:07:32,985
discover all the different
genes that participate.

141
00:07:36,680 --> 00:07:39,910
Now just as a quick
review of statistics,

142
00:07:39,910 --> 00:07:43,660
I think that we've talked
before about means in class

143
00:07:43,660 --> 00:07:44,989
and variances.

144
00:07:44,989 --> 00:07:46,530
We're also going to
talk a little bit

145
00:07:46,530 --> 00:07:48,732
about covariances today.

146
00:07:48,732 --> 00:07:50,190
But these are terms
that you should

147
00:07:50,190 --> 00:07:54,210
be familiar with as
we're looking today

148
00:07:54,210 --> 00:08:01,930
at some of our metrics for
understanding heritability.

149
00:08:01,930 --> 00:08:05,960
Are there any question about any
of the statistical metrics that

150
00:08:05,960 --> 00:08:06,460
are up here?

151
00:08:09,348 --> 00:08:09,848
OK.

152
00:08:12,760 --> 00:08:16,125
So, a broad overview of
genotype to phenotype.

153
00:08:18,996 --> 00:08:20,620
So, we're primarily
going to be working

154
00:08:20,620 --> 00:08:24,170
with complete genome
sequences today,

155
00:08:24,170 --> 00:08:26,490
which will reveal all
of the variance that

156
00:08:26,490 --> 00:08:28,930
are present in the genome.

157
00:08:28,930 --> 00:08:32,264
And it's also the case that
you can subsample a genome

158
00:08:32,264 --> 00:08:35,070
and only observe
certain variance.

159
00:08:35,070 --> 00:08:37,710
Typically that's
done with microarrays

160
00:08:37,710 --> 00:08:41,010
that have probes that are
specific to particular markers.

161
00:08:41,010 --> 00:08:42,929
The way those arrays
are manufactured

162
00:08:42,929 --> 00:08:47,110
is that whole genome sequencing
is done at the outset, and then

163
00:08:47,110 --> 00:08:50,570
high prevalence
variance, at least

164
00:08:50,570 --> 00:08:52,190
common variance,
which typically are

165
00:08:52,190 --> 00:08:55,410
at a frequency of at
least 5% in the population

166
00:08:55,410 --> 00:08:58,390
are queried by
using a microarray.

167
00:08:58,390 --> 00:09:01,790
But today we'll talk about
complete genome sequence.

168
00:09:01,790 --> 00:09:03,290
An individual's
phenotype, we'll say

169
00:09:03,290 --> 00:09:05,800
is defined by one
or more traits.

170
00:09:05,800 --> 00:09:09,720
And a non-quantitative trait is
something perhaps as simple as

171
00:09:09,720 --> 00:09:12,590
whether or not something
is dead or alive.

172
00:09:12,590 --> 00:09:15,820
Or whether or not it can survive
in a particular condition.

173
00:09:15,820 --> 00:09:19,920
Or its ability to produce
a particular substance.

174
00:09:19,920 --> 00:09:22,370
A quantitative trait,
on the other hand,

175
00:09:22,370 --> 00:09:25,274
is a continuous variable.

176
00:09:25,274 --> 00:09:26,815
Height, for example,
of an individual

177
00:09:26,815 --> 00:09:28,840
is a quantitative trait.

178
00:09:28,840 --> 00:09:32,610
As is growth rate, expression
of a particular gene,

179
00:09:32,610 --> 00:09:34,490
and so forth.

180
00:09:34,490 --> 00:09:39,310
So we'll be focusing today on
estimating quantitative traits.

181
00:09:39,310 --> 00:09:41,970
And as I said, a
quantitative trait or loci,

182
00:09:41,970 --> 00:09:45,240
is a marker that's associated
with a quantitative trait

183
00:09:45,240 --> 00:09:47,160
and could be used to predict it.

184
00:09:47,160 --> 00:09:49,520
And you can sometimes
hear about eQTLs,

185
00:09:49,520 --> 00:09:52,900
which are expression
quantitative trait loci.

186
00:09:52,900 --> 00:09:55,540
And they're loci that are
related to gene expression.

187
00:09:59,770 --> 00:10:07,469
So, let's begin then, with
a very simple genetic model.

188
00:10:07,469 --> 00:10:09,510
It's going to be haploid,
which means, of course,

189
00:10:09,510 --> 00:10:11,218
there's only one copy
of each chromosome.

190
00:10:11,218 --> 00:10:12,607
Yeast is the model
organism we're

191
00:10:12,607 --> 00:10:13,940
going to be talking about today.

192
00:10:13,940 --> 00:10:16,190
It's a haploid organism.

193
00:10:16,190 --> 00:10:18,020
And we have mom
and dad up there.

194
00:10:18,020 --> 00:10:22,310
Mom on the left, dad on the
right in two different colors.

195
00:10:22,310 --> 00:10:24,900
And you can see that mom and
dad in this particular example,

196
00:10:24,900 --> 00:10:26,200
have n different genes.

197
00:10:26,200 --> 00:10:29,580
They're going to contribute to
the F1 generation, to junior.

198
00:10:32,110 --> 00:10:37,700
And the relative color is
white for mom, black for dad,

199
00:10:37,700 --> 00:10:41,980
are going to be used to
describe the alleles,

200
00:10:41,980 --> 00:10:44,860
or the allelic variance
that are inherited

201
00:10:44,860 --> 00:10:48,436
by the child, the F1 generation.

202
00:10:48,436 --> 00:10:50,309
And as I said, a
specific phenotype

203
00:10:50,309 --> 00:10:52,350
might be alive or dead in
a specific environment.

204
00:10:55,290 --> 00:11:03,110
And note that I have drawn the
chromosomes to be disconnected.

205
00:11:03,110 --> 00:11:06,230
Which means that each
one of those genes

206
00:11:06,230 --> 00:11:09,940
is going to be
independently inherited.

207
00:11:09,940 --> 00:11:13,160
So the probability
in the F1 generation

208
00:11:13,160 --> 00:11:16,020
that you're going to get
one of those from mom or dad

209
00:11:16,020 --> 00:11:18,420
is going to be a coin flip.

210
00:11:18,420 --> 00:11:19,970
We're going to
assume that they're

211
00:11:19,970 --> 00:11:23,400
far enough away that the
probability of crossing over

212
00:11:23,400 --> 00:11:26,410
during meiosis is 0.5.

213
00:11:26,410 --> 00:11:28,530
And so we get a
random assortment

214
00:11:28,530 --> 00:11:32,050
of alleles from mom and dad.

215
00:11:32,050 --> 00:11:33,130
OK?

216
00:11:33,130 --> 00:11:37,520
So let us say that you go
off and do an experiment.

217
00:11:37,520 --> 00:11:44,610
And you have 32 individuals
that you produce out of a cross.

218
00:11:44,610 --> 00:11:47,490
And you test them, OK.

219
00:11:47,490 --> 00:11:57,130
And two of them are resistant
to a particular substance.

220
00:11:57,130 --> 00:12:00,155
How many genes do you think are
involved in that resistance?

221
00:12:03,320 --> 00:12:07,885
Let's assume that mom is
resistant and dad is not.

222
00:12:07,885 --> 00:12:08,385
OK.

223
00:12:11,560 --> 00:12:14,780
If you had two that were
resistant out of 32,

224
00:12:14,780 --> 00:12:18,350
how many different genes
do you think were involved?

225
00:12:18,350 --> 00:12:19,635
How do you estimate that?

226
00:12:26,420 --> 00:12:26,970
Any ideas?

227
00:12:33,667 --> 00:12:34,645
Yes?

228
00:12:34,645 --> 00:12:39,046
AUDIENCE: If you
had 32 individuals

229
00:12:39,046 --> 00:12:44,930
and say half of them got it?

230
00:12:44,930 --> 00:12:46,160
PROFESSOR: Two, let's say.

231
00:12:46,160 --> 00:12:51,760
One out of 16 is resistant.

232
00:12:51,760 --> 00:12:53,970
And mom is resistant.

233
00:12:53,970 --> 00:12:56,940
AUDIENCE: Because I was thinking
that if it was half of them

234
00:12:56,940 --> 00:13:00,212
were resistant, then you
would maybe guess one gene,

235
00:13:00,212 --> 00:13:01,170
or something like that.

236
00:13:01,170 --> 00:13:02,220
PROFESSOR: Very good.

237
00:13:02,220 --> 00:13:04,970
AUDIENCE: So then
if only eight were

238
00:13:04,970 --> 00:13:09,720
resistant you might guess two
genes, or something like that?

239
00:13:09,720 --> 00:13:11,960
PROFESSOR: Yeah.

240
00:13:11,960 --> 00:13:15,320
What you say is, that
if mom's resistant, then

241
00:13:15,320 --> 00:13:16,850
we're going to
assume that you need

242
00:13:16,850 --> 00:13:20,030
to get the right number of
genes from mom to be resistant.

243
00:13:20,030 --> 00:13:21,390
Right?

244
00:13:21,390 --> 00:13:25,369
And so, let's say that you had
to get four genes from mom.

245
00:13:25,369 --> 00:13:27,410
What's the chance of
getting four genes from mom?

246
00:13:30,236 --> 00:13:33,070
AUDIENCE: Half to
the power of four.

247
00:13:33,070 --> 00:13:35,290
PROFESSOR: Yeah, which
is one out of 16, right?

248
00:13:35,290 --> 00:13:39,200
So, if you, for example had two
that were resistant out of 32,

249
00:13:39,200 --> 00:13:41,240
the chances are one in 16.

250
00:13:41,240 --> 00:13:41,960
Right?

251
00:13:41,960 --> 00:13:46,230
So you would naively
think, and properly so,

252
00:13:46,230 --> 00:13:51,620
that you had to give four
genes from mom to be resistant.

253
00:13:51,620 --> 00:13:54,440
So the way to think
about these sorts

254
00:13:54,440 --> 00:13:57,450
of non-quantitative
traits is that you

255
00:13:57,450 --> 00:14:00,630
can estimate the number
of genes involved.

256
00:14:00,630 --> 00:14:02,930
The simply is log
base 2 over the number

257
00:14:02,930 --> 00:14:07,570
of F1s tested over the number
of the F1s with the phenotype.

258
00:14:07,570 --> 00:14:09,790
It tells you roughly
how many genes

259
00:14:09,790 --> 00:14:16,160
are involved in providing
a particular trait,

260
00:14:16,160 --> 00:14:18,530
assuming that the
genes are unlinked.

261
00:14:18,530 --> 00:14:21,415
It's a coin flip, whether
you get them or not.

262
00:14:21,415 --> 00:14:22,415
Does everybody see that?

263
00:14:25,090 --> 00:14:26,300
Yes?

264
00:14:26,300 --> 00:14:27,760
Any questions at all about that?

265
00:14:33,025 --> 00:14:33,775
About the details?

266
00:14:37,390 --> 00:14:38,940
OK.

267
00:14:38,940 --> 00:14:44,400
Let's talk now about
quantitative traits then.

268
00:14:44,400 --> 00:14:47,905
We'll go back to our
model and imagine

269
00:14:47,905 --> 00:14:50,850
that we have the
same set-- actually

270
00:14:50,850 --> 00:14:53,110
it's going to a
different set of n genes.

271
00:14:53,110 --> 00:14:56,070
We're going to have
a coin flip as to

272
00:14:56,070 --> 00:14:58,890
whether or not you're getting
a mom gene or a dad gene.

273
00:14:58,890 --> 00:15:00,200
OK.

274
00:15:00,200 --> 00:15:05,826
And each gene in dad has
an effect size of 1 over n.

275
00:15:05,826 --> 00:15:06,326
Yes?

276
00:15:06,326 --> 00:15:08,618
AUDIENCE: I just
wanted to check.

277
00:15:08,618 --> 00:15:13,438
We're assuming that the parents
are homozygous for the trait?

278
00:15:13,438 --> 00:15:14,402
Is that correct?

279
00:15:14,402 --> 00:15:16,040
PROFESSOR: Remember
these are haploid.

280
00:15:16,040 --> 00:15:17,540
AUDIENCE: Oh, these are haploid.

281
00:15:17,540 --> 00:15:18,248
PROFESSOR: Right.

282
00:15:18,248 --> 00:15:23,300
So they only have one
copy of all these genes.

283
00:15:23,300 --> 00:15:24,760
All right.

284
00:15:24,760 --> 00:15:25,774
Yes?

285
00:15:25,774 --> 00:15:30,220
AUDIENCE: [INAUDIBLE] resistant
and they're [INAUDIBLE].

286
00:15:30,220 --> 00:15:32,030
That could still
mean that dad has

287
00:15:32,030 --> 00:15:35,160
three of the four
genes in principle.

288
00:15:35,160 --> 00:15:36,640
PROFESSOR: The previous slide?

289
00:15:36,640 --> 00:15:38,588
Is that where what
you're talking about?

290
00:15:38,588 --> 00:15:40,527
AUDIENCE: [INAUDIBLE]
knew about it.

291
00:15:40,527 --> 00:15:42,360
So really what you mean
is that dad does not

292
00:15:42,360 --> 00:15:44,930
have any of the genes that
are involved with resistance.

293
00:15:44,930 --> 00:15:45,888
PROFESSOR: The correct.

294
00:15:48,602 --> 00:15:50,560
I was saying that dad
has to have all of gene--

295
00:15:50,560 --> 00:15:52,670
that the child has to
have all of the genes that

296
00:15:52,670 --> 00:15:54,290
are operative to
create resistance.

297
00:15:54,290 --> 00:15:55,750
We're going to
assume an AND model.

298
00:15:55,750 --> 00:15:58,120
He must have all
the genes from mom.

299
00:15:58,120 --> 00:16:01,470
They're involved in
the resistance pathway.

300
00:16:01,470 --> 00:16:04,720
And since only one
out of a 16 progeny

301
00:16:04,720 --> 00:16:08,370
has all those genes
from mom, right, it

302
00:16:08,370 --> 00:16:11,390
appears that given the chance
of inheriting something from mom

303
00:16:11,390 --> 00:16:16,040
is 1/2, that it's four genes
you have to inherit from mom.

304
00:16:16,040 --> 00:16:19,787
Because the chance of inheriting
all four is one out of 16.

305
00:16:19,787 --> 00:16:23,060
AUDIENCE: [INAUDIBLE]
in which case--

306
00:16:23,060 --> 00:16:27,630
PROFESSOR: No, I'm assuming the
dad doesn't have any of those.

307
00:16:27,630 --> 00:16:29,660
But here we're asking,
what is the difference

308
00:16:29,660 --> 00:16:32,630
in the number of genes
between mom and dad?

309
00:16:32,630 --> 00:16:35,520
So you're right, that the
number we're computing

310
00:16:35,520 --> 00:16:39,360
is the relative number of genes
different between mom and dad

311
00:16:39,360 --> 00:16:40,740
you require.

312
00:16:40,740 --> 00:16:43,080
And so it might be
that dad's a reference

313
00:16:43,080 --> 00:16:45,790
and we're asking how many
additional genes mom brought

314
00:16:45,790 --> 00:16:47,870
to the table to provide
with that resistance.

315
00:16:47,870 --> 00:16:49,560
But that's a good point.

316
00:16:49,560 --> 00:16:50,060
OK.

317
00:16:53,370 --> 00:16:54,410
OK.

318
00:16:54,410 --> 00:16:59,080
So, now let's look at
this quantitative model.

319
00:16:59,080 --> 00:17:04,440
Let's assume that mom has a
bunch of genes that contribute

320
00:17:04,440 --> 00:17:11,089
zero to an effect size
and dad-- each gene

321
00:17:11,089 --> 00:17:14,640
that dad has produces
an effect of 1 over n.

322
00:17:14,640 --> 00:17:18,560
So the total effect
size here for dad is 1.

323
00:17:18,560 --> 00:17:22,700
So the effect of mom on this
particular quantitative trait

324
00:17:22,700 --> 00:17:23,480
might be zero.

325
00:17:23,480 --> 00:17:25,920
It might be the amount
of ethanol produced

326
00:17:25,920 --> 00:17:28,190
or some other
quantitative value.

327
00:17:28,190 --> 00:17:31,930
And dad, on the other
hand, since he has n genes,

328
00:17:31,930 --> 00:17:35,890
is going to produce one,
because each gene contributes

329
00:17:35,890 --> 00:17:38,275
a little bit to this
quantitative phenotype.

330
00:17:41,642 --> 00:17:43,050
Is everybody clear on that?

331
00:17:45,670 --> 00:17:51,550
So, the child is
going to inherit genes

332
00:17:51,550 --> 00:17:56,290
to our coin flip between
mom and dad, right.

333
00:17:56,290 --> 00:17:57,880
So the first
fundamental question

334
00:17:57,880 --> 00:18:01,430
is, how many different
levels are there

335
00:18:01,430 --> 00:18:04,440
in our quantitative
phenotype in our trait?

336
00:18:08,360 --> 00:18:10,020
How many different
levels can you have?

337
00:18:16,274 --> 00:18:16,940
AUDIENCE: N + 1?

338
00:18:16,940 --> 00:18:20,270
PROFESSOR: N + 1, right,
because you can either inherit

339
00:18:20,270 --> 00:18:24,010
zero, or up to n genes from dad.

340
00:18:24,010 --> 00:18:27,940
And it gets you n plus
1 different levels.

341
00:18:27,940 --> 00:18:29,380
OK.

342
00:18:29,380 --> 00:18:32,410
So, what's the
probability then-- well,

343
00:18:32,410 --> 00:18:33,660
I'll ask a different question.

344
00:18:33,660 --> 00:18:38,245
What's the expected value of
the quantitative phenotype

345
00:18:38,245 --> 00:18:39,060
of a child?

346
00:18:43,620 --> 00:18:44,700
Just looking at this.

347
00:18:48,410 --> 00:18:52,860
If dad's one and mom's zero, and
you have a collection of genes

348
00:18:52,860 --> 00:18:57,507
and you do a coin
flip each time,

349
00:18:57,507 --> 00:18:59,340
you're going to get
half your genes from mom

350
00:18:59,340 --> 00:19:01,410
and half your genes from dad.

351
00:19:01,410 --> 00:19:02,990
Right.

352
00:19:02,990 --> 00:19:11,310
And so the expected
trait value is 0.5.

353
00:19:11,310 --> 00:19:13,420
So for these added
traits, you're

354
00:19:13,420 --> 00:19:17,770
going be at the midpoint
between mom and dad.

355
00:19:17,770 --> 00:19:19,880
Right.

356
00:19:19,880 --> 00:19:28,340
And what is the
probability that you

357
00:19:28,340 --> 00:19:32,055
inherit x copies of dad's genes?

358
00:19:35,390 --> 00:19:44,260
Well, that's n choose x, times
1 minus .5 n to the minus

359
00:19:44,260 --> 00:19:47,310
x times 0.5 to the x.

360
00:19:47,310 --> 00:19:50,080
A simple binomial.

361
00:19:50,080 --> 00:19:51,700
Right.

362
00:19:51,700 --> 00:19:54,980
So if you look at
this, the probability

363
00:19:54,980 --> 00:19:58,499
of the distribution
for the children

364
00:19:58,499 --> 00:20:00,040
is going to look
something like this,

365
00:20:00,040 --> 00:20:04,240
where this is the mean, 0.5.

366
00:20:04,240 --> 00:20:09,670
And the number of distinct
values is going to be n plus 1.

367
00:20:09,670 --> 00:20:12,100
Right.

368
00:20:12,100 --> 00:20:18,860
So the expected value of
x is 0.5 and turns out

369
00:20:18,860 --> 00:20:26,900
that the expected value, or the
variance of x minus 0.5, which

370
00:20:26,900 --> 00:20:35,012
is the mean squared, is
going to be 0.25 over n.

371
00:20:35,012 --> 00:20:36,720
So I can show you this
on the next slide.

372
00:20:39,550 --> 00:20:43,965
So you can see, this could
be ethanol production,

373
00:20:43,965 --> 00:20:46,700
it could be growth
rate, what have you.

374
00:20:46,700 --> 00:20:49,210
And you can see that the
number of genes that you're

375
00:20:49,210 --> 00:20:54,720
going to get from dad follows
this binomial distribution

376
00:20:54,720 --> 00:20:58,810
and gives you a spread
of different phenotypes

377
00:20:58,810 --> 00:21:00,790
in the child's
generation, depending

378
00:21:00,790 --> 00:21:03,005
upon how many copies of
dad's genes that you inherit.

379
00:21:07,000 --> 00:21:08,840
But does this make
sense to everybody?

380
00:21:08,840 --> 00:21:11,187
Now would be a great
time to ask any questions

381
00:21:11,187 --> 00:21:12,270
about the details of this.

382
00:21:12,270 --> 00:21:13,152
Yes?

383
00:21:13,152 --> 00:21:15,462
AUDIENCE: Can you
clarify what x is?

384
00:21:15,462 --> 00:21:17,780
Is x the fraction
of genes inherited--

385
00:21:17,780 --> 00:21:22,160
PROFESSOR: The number of
genes you inherit from dad.

386
00:21:22,160 --> 00:21:23,790
The number of genes.

387
00:21:23,790 --> 00:21:27,565
So it would zero,
one, two, up to n.

388
00:21:27,565 --> 00:21:29,990
AUDIENCE: Shouldn't the
expectation of n [INAUDIBLE]

389
00:21:29,990 --> 00:21:32,415
x be n/2?

390
00:21:32,415 --> 00:21:33,315
PROFESSOR: I'm sorry.

391
00:21:33,315 --> 00:21:35,810
It is supposed to be n/2.

392
00:21:35,810 --> 00:21:41,320
But the last two
expectations are

393
00:21:41,320 --> 00:21:44,847
some of the number of genes
you've inherited from dad.

394
00:21:44,847 --> 00:21:46,835
Right, that's correct.

395
00:21:46,835 --> 00:21:48,326
Yeah, this slide's wrong.

396
00:21:55,232 --> 00:21:56,065
Any other questions?

397
00:21:59,540 --> 00:22:00,220
OK.

398
00:22:00,220 --> 00:22:06,075
So this is a very simple
model but it tells us

399
00:22:06,075 --> 00:22:08,550
a couple of things, right.

400
00:22:08,550 --> 00:22:13,350
Which is that as n
gets to be very large,

401
00:22:13,350 --> 00:22:16,125
the effect of each gene
gets to be quite small.

402
00:22:18,650 --> 00:22:21,760
So something could be
completely heritable,

403
00:22:21,760 --> 00:22:26,140
but if it's spread
over, say 1,000 genes,

404
00:22:26,140 --> 00:22:28,840
then it will be very
difficult to detect,

405
00:22:28,840 --> 00:22:31,550
because the effect of each
gene would be quite small.

406
00:22:31,550 --> 00:22:37,300
And furthermore, the variance
that you see in the offspring

407
00:22:37,300 --> 00:22:40,870
will be quite small
as well, right,

408
00:22:40,870 --> 00:22:42,450
in terms of the phenotype.

409
00:22:42,450 --> 00:22:46,170
Because it's going to be 0.25/n
in terms of the expected value.

410
00:22:46,170 --> 00:22:50,090
So as n gets larger,
the number genes that

411
00:22:50,090 --> 00:22:52,820
contribute to that
phenotype increase,

412
00:22:52,820 --> 00:22:54,855
the variance is going
to go down linearly.

413
00:22:58,580 --> 00:22:59,080
OK.

414
00:22:59,080 --> 00:23:01,390
So we should just
keep this in mind

415
00:23:01,390 --> 00:23:09,010
as we're looking at discovering
these sort of traits

416
00:23:09,010 --> 00:23:15,070
and the underlying QTLs that
can be used to predict them.

417
00:23:15,070 --> 00:23:19,460
And finally, I'd like to point
out one other detail which

418
00:23:19,460 --> 00:23:21,920
is that, if genes
are linked, that is,

419
00:23:21,920 --> 00:23:25,020
if they're in close proximity
to one another in the genome

420
00:23:25,020 --> 00:23:26,620
and it makes it very
unlikely there's

421
00:23:26,620 --> 00:23:29,340
going to be crossing
over between them,

422
00:23:29,340 --> 00:23:31,876
then they're going
to act as a unit.

423
00:23:31,876 --> 00:23:38,450
And if they act as a unit, then
we'll get marker correlation.

424
00:23:38,450 --> 00:23:41,140
And you can also
see, effectively,

425
00:23:41,140 --> 00:23:42,940
that the effect size
of those two genes

426
00:23:42,940 --> 00:23:45,520
is going to be larger.

427
00:23:45,520 --> 00:23:48,500
And in more complicated
models, we obviously

428
00:23:48,500 --> 00:23:52,930
wouldn't have the same
effect size for each gene.

429
00:23:52,930 --> 00:23:55,250
The effect size might be
quite large for some genes,

430
00:23:55,250 --> 00:23:56,780
might be quite small
for some genes.

431
00:24:00,480 --> 00:24:04,880
And we'll see the effects
of marker correlation

432
00:24:04,880 --> 00:24:08,150
in a little bit.

433
00:24:08,150 --> 00:24:12,409
So the way we're going to model
this is we're going to-- this

434
00:24:12,409 --> 00:24:14,200
is a definition of the
variables that we're

435
00:24:14,200 --> 00:24:17,160
going to be talking about today.

436
00:24:17,160 --> 00:24:21,745
And the essential
idea is quite simple.

437
00:24:33,280 --> 00:24:38,140
So the phenotype of an
individual-- so p sub

438
00:24:38,140 --> 00:24:40,620
i is the phenotype
of an individual,

439
00:24:40,620 --> 00:24:45,170
is going to be equal to some
function of their genotype

440
00:24:45,170 --> 00:24:47,285
plus an environmental component.

441
00:24:49,890 --> 00:24:57,170
This function is the critical
thing that we want to discover.

442
00:24:57,170 --> 00:25:00,470
This function, f, is
mapping from the genotype

443
00:25:00,470 --> 00:25:05,019
of an individual
to its phenotype.

444
00:25:05,019 --> 00:25:06,560
And the environmental
component could

445
00:25:06,560 --> 00:25:14,320
be how well something is fed,
how much sunlight it gets,

446
00:25:14,320 --> 00:25:18,910
things that can greatly
influence things like growth

447
00:25:18,910 --> 00:25:22,550
but they're not
described by genetics.

448
00:25:22,550 --> 00:25:25,630
But this function is
going to encapsulate

449
00:25:25,630 --> 00:25:29,150
what we know about
how the genetics

450
00:25:29,150 --> 00:25:33,145
of a particular individual
influences a trait.

451
00:25:36,600 --> 00:25:46,710
And thus, if we consider a
population of individuals,

452
00:25:46,710 --> 00:25:50,250
the phenotypic
variance is going to be

453
00:25:50,250 --> 00:25:57,220
equal to the genotypic variance
plus the environmental variance

454
00:25:57,220 --> 00:26:05,295
plus two times the covariance
between the genotype

455
00:26:05,295 --> 00:26:06,086
in the environment.

456
00:26:08,590 --> 00:26:12,280
And we're going to assume,
as most studies do,

457
00:26:12,280 --> 00:26:16,200
that there is no
correlation between genotype

458
00:26:16,200 --> 00:26:17,230
and environment.

459
00:26:17,230 --> 00:26:19,800
So this term disappears.

460
00:26:19,800 --> 00:26:23,410
So what we're left with is
that the observed phenotypic

461
00:26:23,410 --> 00:26:25,660
variance is equal to
the genotypic variance

462
00:26:25,660 --> 00:26:27,955
plus the environmental variance.

463
00:26:31,190 --> 00:26:35,870
And what we would like to do
is to come up with a function

464
00:26:35,870 --> 00:26:41,960
f, that best predicts
the genotypic component

465
00:26:41,960 --> 00:26:44,340
of this equation.

466
00:26:44,340 --> 00:26:47,041
There's nothing we can do
about environmental variance.

467
00:26:47,041 --> 00:26:47,540
Right.

468
00:26:50,680 --> 00:26:53,894
But we can measure it.

469
00:26:53,894 --> 00:26:55,310
Does anybody have
any ideas how we

470
00:26:55,310 --> 00:27:00,361
could measure
environmental variance?

471
00:27:00,361 --> 00:27:00,860
Yes?

472
00:27:00,860 --> 00:27:02,360
AUDIENCE: Study
populations in which

473
00:27:02,360 --> 00:27:04,588
there's some kind of
controlled environment.

474
00:27:08,110 --> 00:27:11,610
So you study populations
that one population

475
00:27:11,610 --> 00:27:13,610
is one with a homogeneous.

476
00:27:13,610 --> 00:27:16,332
And another one was a
completely different one.

477
00:27:16,332 --> 00:27:17,040
PROFESSOR: Right.

478
00:27:17,040 --> 00:27:20,880
So what we could do is
we could use controls.

479
00:27:20,880 --> 00:27:26,290
So typically what we could do is
we could study in environments

480
00:27:26,290 --> 00:27:28,310
where we try and control
the environment exactly

481
00:27:28,310 --> 00:27:33,670
to eliminate this as much as
we possibly can, for example.

482
00:27:33,670 --> 00:27:35,600
As we'll see that we
also can do things

483
00:27:35,600 --> 00:27:37,590
like study clones,
where individuals

484
00:27:37,590 --> 00:27:40,770
have exactly the same genotype.

485
00:27:40,770 --> 00:27:43,340
And then, all of the
variance that we observe--

486
00:27:43,340 --> 00:27:47,100
if this term vanishes because
the genotypes are identical,

487
00:27:47,100 --> 00:27:48,370
it is due to the environment.

488
00:27:50,900 --> 00:27:52,520
So typically, if
you're doing things

489
00:27:52,520 --> 00:27:56,680
like studying humans, since
cloning humans isn't really

490
00:27:56,680 --> 00:28:01,670
a good idea to actually
measure environmental variance,

491
00:28:01,670 --> 00:28:12,110
right, what you could do is you
can look at identical twins.

492
00:28:12,110 --> 00:28:15,080
And identical twins
give you a way

493
00:28:15,080 --> 00:28:17,640
to get at the question of
how much environment variance

494
00:28:17,640 --> 00:28:19,325
there is for a
particular phenotype.

495
00:28:22,110 --> 00:28:28,560
So in sum, this is
replicates what I have here

496
00:28:28,560 --> 00:28:31,230
on the left-hand
side of the board.

497
00:28:31,230 --> 00:28:33,177
And note that today
we'll be talking

498
00:28:33,177 --> 00:28:35,010
about the idea of
discovering this function,

499
00:28:35,010 --> 00:28:38,120
f, and how well
we can discover f,

500
00:28:38,120 --> 00:28:39,620
which is really
important, right.

501
00:28:39,620 --> 00:28:42,530
It's fundamental to be
able to predict phenotype

502
00:28:42,530 --> 00:28:44,860
from genotype.

503
00:28:44,860 --> 00:28:51,250
It's an extraordinarily
central question in genetics.

504
00:28:51,250 --> 00:28:55,510
And when we do the prediction,
there are two kinds of-- oh,

505
00:28:55,510 --> 00:28:56,480
there's a question?

506
00:28:56,480 --> 00:28:58,146
AUDIENCE: Could you
please explain again

507
00:28:58,146 --> 00:29:01,540
why the co-variance drops
out or it goes away.

508
00:29:01,540 --> 00:29:03,290
PROFESSOR: Yeah, the
co-variance drops out

509
00:29:03,290 --> 00:29:05,180
because we're going to
assume that genotype

510
00:29:05,180 --> 00:29:06,513
and environment are independent.

511
00:29:09,030 --> 00:29:12,220
Now if they're not
independent, it won't drop out.

512
00:29:12,220 --> 00:29:16,980
But making that assumption--
and of course, for human studies

513
00:29:16,980 --> 00:29:19,710
you can't really make that
assumption completely, right?

514
00:29:19,710 --> 00:29:22,020
And one of the problems in
doing these sorts of studies

515
00:29:22,020 --> 00:29:25,830
is that it's very, very
easy to get confounded.

516
00:29:25,830 --> 00:29:28,250
Because when you're
trying to decompose

517
00:29:28,250 --> 00:29:31,694
the observed variance
and height, for example.

518
00:29:31,694 --> 00:29:35,750
You know, there's what mom and
dad provided to an individual

519
00:29:35,750 --> 00:29:37,500
in terms of their
height, and there's also

520
00:29:37,500 --> 00:29:40,060
how much junior ate, right.

521
00:29:40,060 --> 00:29:42,340
And whether he went to
McDonald's a lot, or you know,

522
00:29:42,340 --> 00:29:44,876
was going to Whole Foods a lot.

523
00:29:44,876 --> 00:29:46,000
You know, who knows, right?

524
00:29:46,000 --> 00:29:48,650
But this component
and this component,

525
00:29:48,650 --> 00:29:50,960
it's easy to get
confounded between them

526
00:29:50,960 --> 00:29:55,330
and sometimes you can
imagine that genotype

527
00:29:55,330 --> 00:29:58,670
is related to place of
origin in the world.

528
00:29:58,670 --> 00:30:00,566
And that has a lot to
do with environment.

529
00:30:00,566 --> 00:30:02,565
And so this term wouldn't
necessarily disappear.

530
00:30:07,451 --> 00:30:07,950
OK.

531
00:30:07,950 --> 00:30:09,533
So there are two
kinds of heritability

532
00:30:09,533 --> 00:30:11,050
I'd like to touch upon today.

533
00:30:11,050 --> 00:30:14,590
And it's important that you
remember there are two kinds

534
00:30:14,590 --> 00:30:19,810
and one is extraordinarily
difficult to recover

535
00:30:19,810 --> 00:30:24,990
and the other one is in some
sense, a more constrained

536
00:30:24,990 --> 00:30:27,750
problem, because we're much
better at building models

537
00:30:27,750 --> 00:30:30,430
for that kind of
heritability estimate.

538
00:30:30,430 --> 00:30:33,810
The first is broad-sense
heritability,

539
00:30:33,810 --> 00:30:38,560
which describes the upper bound
for phenotypic prediction given

540
00:30:38,560 --> 00:30:39,980
an arbitrary model.

541
00:30:39,980 --> 00:30:43,430
So it's the total contribution
to phenotypic variance

542
00:30:43,430 --> 00:30:45,740
from genetic causes.

543
00:30:45,740 --> 00:30:47,660
And we can estimate that, right.

544
00:30:47,660 --> 00:30:51,420
And we'll see how we can
estimate it in a moment.

545
00:30:51,420 --> 00:30:55,030
And narrow-sense
heritability is defined as,

546
00:30:55,030 --> 00:30:58,710
how much of the
heritability can we describe

547
00:30:58,710 --> 00:31:03,750
when we restrict f
to be a linear model.

548
00:31:03,750 --> 00:31:10,480
So when f is simply linear,
as the sum of terms,

549
00:31:10,480 --> 00:31:15,000
that describes the maximum
narrow-sense heritability we

550
00:31:15,000 --> 00:31:17,950
can recover in terms of
the fraction of phenotypic

551
00:31:17,950 --> 00:31:19,950
variance we can capture in f.

552
00:31:22,690 --> 00:31:27,820
And it's very useful
because it turns out

553
00:31:27,820 --> 00:31:32,257
that we can compute both
broad-sense and narrow-sense

554
00:31:32,257 --> 00:31:33,840
heritability from
first principles-- I

555
00:31:33,840 --> 00:31:36,590
mean from experiment.

556
00:31:36,590 --> 00:31:41,860
And the difference between them
is part of our quest today.

557
00:31:41,860 --> 00:31:44,080
Our quest is, to
answer the question,

558
00:31:44,080 --> 00:31:45,990
where is the missing
heritability?

559
00:31:45,990 --> 00:31:52,050
Why can't we build an
Oracle f that perfectly

560
00:31:52,050 --> 00:31:55,390
predicts phenotype
from genotype?

561
00:31:58,270 --> 00:32:03,519
So on that line-- I just want
to give you some caveats.

562
00:32:03,519 --> 00:32:06,060
One is that we're always talking
about populations when we're

563
00:32:06,060 --> 00:32:07,870
talking about
heritability because it's

564
00:32:07,870 --> 00:32:10,030
how we're going to estimate it.

565
00:32:10,030 --> 00:32:13,580
And when you hear people
talk about heritability,

566
00:32:13,580 --> 00:32:15,550
oftentimes they won't
qualify it in terms

567
00:32:15,550 --> 00:32:18,030
of whether it's broad-sense
or narrow-sense.

568
00:32:18,030 --> 00:32:20,170
And so you should
ask them if you're

569
00:32:20,170 --> 00:32:23,310
engaged in a scientific
discussion with them.

570
00:32:23,310 --> 00:32:26,750
And as we've already
discussed, sometimes estimation

571
00:32:26,750 --> 00:32:30,730
is difficult because of matching
environment and eliminating

572
00:32:30,730 --> 00:32:33,210
this term, the
environmental term

573
00:32:33,210 --> 00:32:37,260
can be a challenge when
you're out of the laboratory.

574
00:32:37,260 --> 00:32:40,330
Like when you're
dealing with humans.

575
00:32:40,330 --> 00:32:43,930
So, let's talk about
broad-sense heritability.

576
00:32:46,660 --> 00:32:52,980
Imagine that we measure
environmental variants simply

577
00:32:52,980 --> 00:32:58,650
by looking at environmental
twins or clones, right.

578
00:32:58,650 --> 00:33:02,470
Because if we, for example,
take a bunch of yeast

579
00:33:02,470 --> 00:33:04,990
that are genotypically
identical.

580
00:33:04,990 --> 00:33:07,660
And we grow them up
separately, and we

581
00:33:07,660 --> 00:33:13,540
measure a trait like
how well they respond

582
00:33:13,540 --> 00:33:17,870
to a particular chemical
or their growth rate,

583
00:33:17,870 --> 00:33:21,760
then the variance we see from
each individual to individual

584
00:33:21,760 --> 00:33:27,850
is simply environmental, because
they're genetically identical.

585
00:33:27,850 --> 00:33:28,440
So

586
00:33:28,440 --> 00:33:31,510
we can, in that
particular case, exactly

587
00:33:31,510 --> 00:33:33,800
quantify the
environmental variance

588
00:33:33,800 --> 00:33:38,200
given that every individual
is genetically identical.

589
00:33:38,200 --> 00:33:41,070
We simply measure
all the growth rates

590
00:33:41,070 --> 00:33:42,450
and we compute the variance.

591
00:33:42,450 --> 00:33:45,160
And that's the
environmental variance.

592
00:33:45,160 --> 00:33:47,900
OK?

593
00:33:47,900 --> 00:33:51,225
As I said for humans, the best
we can do is identical twins.

594
00:33:53,910 --> 00:33:56,680
Monozygotic twins.

595
00:33:56,680 --> 00:34:01,320
You can go out and for pairs
of twins that are identical,

596
00:34:01,320 --> 00:34:04,630
you can measure height or
any other trait that you like

597
00:34:04,630 --> 00:34:07,370
and compute the variance.

598
00:34:07,370 --> 00:34:11,239
And then that is an estimate
of the environmental component

599
00:34:11,239 --> 00:34:16,690
of that, because they should
be genetically identical.

600
00:34:19,650 --> 00:34:23,300
And big H squared--
broad-sense is always

601
00:34:23,300 --> 00:34:25,940
capital H squared and
narrow-sense is always

602
00:34:25,940 --> 00:34:27,420
little h squared.

603
00:34:27,420 --> 00:34:29,630
Big H squared,
which is broad-sense

604
00:34:29,630 --> 00:34:32,270
heritability is
very simple then.

605
00:34:32,270 --> 00:34:35,290
It's the phenotypic variance,
minus the environmental

606
00:34:35,290 --> 00:34:37,449
variance, over the
phenotypic variance.

607
00:34:37,449 --> 00:34:40,250
So it's the fraction of
phenotypic experience

608
00:34:40,250 --> 00:34:44,587
that can be explained
from genetic causes.

609
00:34:44,587 --> 00:34:46,515
Is that clear to everybody?

610
00:34:49,420 --> 00:34:52,030
Any questions at all about this?

611
00:34:55,030 --> 00:34:56,270
OK.

612
00:34:56,270 --> 00:35:01,030
So, for example, on the
right-hand hand side

613
00:35:01,030 --> 00:35:04,540
here, those three
purplish squares

614
00:35:04,540 --> 00:35:09,310
have three different
populations,

615
00:35:09,310 --> 00:35:12,230
which are genotypically
identical.

616
00:35:12,230 --> 00:35:15,250
They have two genes, a little a,
a little a, big A, a little A,

617
00:35:15,250 --> 00:35:19,280
and big A, big A. And each
one is a variance of 1.0.

618
00:35:19,280 --> 00:35:24,240
out So since there are
genetically identical,

619
00:35:24,240 --> 00:35:29,340
we know that the environmental
variance has to be 1.0.

620
00:35:29,340 --> 00:35:33,640
On the left-hand side, you
see the genotypic variance.

621
00:35:33,640 --> 00:35:37,420
And that reminds us of
where we started today.

622
00:35:37,420 --> 00:35:40,710
It depends on the number of
alleles you get of big A,

623
00:35:40,710 --> 00:35:43,390
as to what the value is.

624
00:35:43,390 --> 00:35:46,110
And when you put all
of that together,

625
00:35:46,110 --> 00:35:48,920
you get a total variance of 3.

626
00:35:48,920 --> 00:35:52,840
And so big H squared is
simply the genotypic variance,

627
00:35:52,840 --> 00:35:55,740
which is 2, over the total
phenotypic variance, which

628
00:35:55,740 --> 00:35:56,680
is 3.

629
00:35:56,680 --> 00:35:58,210
So big H squared is 2/3.

630
00:36:01,150 --> 00:36:04,274
And so that is a
way of computing

631
00:36:04,274 --> 00:36:05,315
broad-sense heritability.

632
00:36:09,580 --> 00:36:15,530
Now, if we think
about our models,

633
00:36:15,530 --> 00:36:18,110
we can see that
narrow-sense heritability

634
00:36:18,110 --> 00:36:20,150
has some very nice properties.

635
00:36:20,150 --> 00:36:21,210
Right.

636
00:36:21,210 --> 00:36:30,175
That is, if we build and
add a model of phenotype,

637
00:36:30,175 --> 00:36:31,675
to get at narrow-sense
heritability.

638
00:36:31,675 --> 00:36:37,270
So if we were to constraint
f here to be linear,

639
00:36:37,270 --> 00:36:40,880
it's simply going to be a
very simple linear model.

640
00:36:40,880 --> 00:36:46,050
For each particular
QTL that we discover,

641
00:36:46,050 --> 00:36:49,640
we assign an effect
size beta to it,

642
00:36:49,640 --> 00:36:52,900
or a coefficient that
describes its deviation

643
00:36:52,900 --> 00:36:57,330
from the mean for
that particular trait.

644
00:36:57,330 --> 00:37:00,880
And we have an
offset, beta zero.

645
00:37:00,880 --> 00:37:03,550
So our simple linear model is
going to take all the discovery

646
00:37:03,550 --> 00:37:06,270
QTLs that we have--
take each QTL

647
00:37:06,270 --> 00:37:10,120
and discover which
allelic form it's in.

648
00:37:10,120 --> 00:37:14,620
Typically it's considered
either in zero or one form.

649
00:37:14,620 --> 00:37:23,890
And then add a beta j, where
j is the particular QTL

650
00:37:23,890 --> 00:37:26,220
deviation from mean value.

651
00:37:26,220 --> 00:37:29,594
Add them all together to
compute the phenotype.

652
00:37:29,594 --> 00:37:30,950
OK.

653
00:37:30,950 --> 00:37:36,540
So, this is a very
simple additive model

654
00:37:36,540 --> 00:37:39,500
and a consequence
of this model is

655
00:37:39,500 --> 00:37:44,620
that if you think about an
F1 or a child of two parents,

656
00:37:44,620 --> 00:37:50,740
as we said earlier, a child is
going to inherit roughly half

657
00:37:50,740 --> 00:37:55,380
of the alleles from mom and
half of the alleles from dad.

658
00:37:55,380 --> 00:37:58,320
And so for additive
models like this,

659
00:37:58,320 --> 00:38:04,120
the expected value of
the child's trait value

660
00:38:04,120 --> 00:38:08,087
is going to be the
midpoint of mom and dad.

661
00:38:08,087 --> 00:38:10,170
And that can be derived
directly from the equation

662
00:38:10,170 --> 00:38:13,740
above, because you're
getting half of the QTLs

663
00:38:13,740 --> 00:38:15,750
from mom and half of
the QTLs from dad.

664
00:38:18,350 --> 00:38:20,390
So this was observed a
long time ago, right,

665
00:38:20,390 --> 00:38:30,470
because if you did studies and
you looked at the deviation

666
00:38:30,470 --> 00:38:35,040
from the midpoint of
parents for human height.

667
00:38:35,040 --> 00:38:42,320
You can see that the
children fall pretty

668
00:38:42,320 --> 00:38:49,360
close to mid-parent line,
where the y-axis here

669
00:38:49,360 --> 00:38:54,980
is the height in inches
and that suggests

670
00:38:54,980 --> 00:39:04,295
that much of human height can be
modeled by a narrow-sense based

671
00:39:04,295 --> 00:39:05,160
heritability model.

672
00:39:09,660 --> 00:39:17,970
Now, once again,
narrow-sense heritability

673
00:39:17,970 --> 00:39:20,270
is the fraction of
phenotypic variance explained

674
00:39:20,270 --> 00:39:22,620
by an additive model.

675
00:39:22,620 --> 00:39:30,690
And we've talked before
about the model itself.

676
00:39:30,690 --> 00:39:32,770
And little h squared
is simply going

677
00:39:32,770 --> 00:39:36,100
to be the amount of
variance explained

678
00:39:36,100 --> 00:39:40,540
by the additive model over
the total phenotypic variance.

679
00:39:40,540 --> 00:39:45,920
And the additive variance is
shown on the right-hand side.

680
00:39:45,920 --> 00:39:50,980
That equation boils down to,
you take the phenotypic variance

681
00:39:50,980 --> 00:39:57,150
and you subtract off the
variance that's environmental

682
00:39:57,150 --> 00:40:01,260
and that cannot be explained
by the additive variance,

683
00:40:01,260 --> 00:40:03,885
and what you're left with
is the additive variance.

684
00:40:09,230 --> 00:40:12,010
And once again, coming
back to the question

685
00:40:12,010 --> 00:40:15,850
of missing heritability,
if we observe

686
00:40:15,850 --> 00:40:19,540
that what we can estimate
for little h squared

687
00:40:19,540 --> 00:40:23,370
is below what we
expect, that gap

688
00:40:23,370 --> 00:40:24,615
has to be explained somehow.

689
00:40:27,550 --> 00:40:32,150
Some typical values for
theoretical h squared.

690
00:40:32,150 --> 00:40:33,800
So this is not
measured h squared

691
00:40:33,800 --> 00:40:38,020
in terms of building a model
and testing it like this.

692
00:40:38,020 --> 00:40:40,020
But what we can do is
we can theoretically

693
00:40:40,020 --> 00:40:43,510
estimate what h
squared should be,

694
00:40:43,510 --> 00:40:46,585
by looking at the fraction of
identity between individuals.

695
00:40:50,630 --> 00:40:52,930
Morphological
traits tend to have

696
00:40:52,930 --> 00:40:55,810
higher h squared for
the fitness traits.

697
00:40:55,810 --> 00:41:01,320
So human height has a little
h square of about 0.8.

698
00:41:01,320 --> 00:41:05,300
And for those ranchers
out there in the audience,

699
00:41:05,300 --> 00:41:08,120
you'll be happy to know that
cattle yearly weight has

700
00:41:08,120 --> 00:41:10,770
heritability of about 0.35.

701
00:41:10,770 --> 00:41:14,610
Now, things like life history
which are fitness traits

702
00:41:14,610 --> 00:41:17,980
are less heritable.

703
00:41:17,980 --> 00:41:21,980
Which would suggest that looking
at how long your parents lived

704
00:41:21,980 --> 00:41:24,470
and trying to estimate how
long you're going to live

705
00:41:24,470 --> 00:41:27,502
is not as productive as
looking at how tall you

706
00:41:27,502 --> 00:41:28,710
are compared to your parents.

707
00:41:32,677 --> 00:41:34,260
And there's a complete
table that I've

708
00:41:34,260 --> 00:41:37,080
included in the slides
for you to look at,

709
00:41:37,080 --> 00:41:41,050
but it's too small to
read on the screen.

710
00:41:41,050 --> 00:41:45,020
OK, so now we're going to
turn to computational models

711
00:41:45,020 --> 00:41:49,660
and how we can discover
a model that figures out

712
00:41:49,660 --> 00:41:54,870
where the QTLs are, and then
assigns that function f to them

713
00:41:54,870 --> 00:41:56,870
so we can predict
phenotype from genotype.

714
00:41:59,560 --> 00:42:04,190
And we're going to be taking
our example from this paper

715
00:42:04,190 --> 00:42:07,170
by Bloom, et al, which I
posted on the Stellar site.

716
00:42:07,170 --> 00:42:10,300
And it came out
last year and it's

717
00:42:10,300 --> 00:42:15,520
wonderful study in QTL analysis.

718
00:42:15,520 --> 00:42:20,600
And the setup for this
study is quite simple.

719
00:42:20,600 --> 00:42:23,140
What they did was, is they
took two different strains

720
00:42:23,140 --> 00:42:28,730
of yeast, RM and BY,
and they crossed them

721
00:42:28,730 --> 00:42:35,720
and produced roughly 1,000 F1s.

722
00:42:35,720 --> 00:42:39,750
And RM and BY are very similar.

723
00:42:39,750 --> 00:42:44,710
They are about, I think
it's about 35,000 snips

724
00:42:44,710 --> 00:42:46,954
between them.

725
00:42:46,954 --> 00:42:50,280
Only about 0.5% of their
genomes are different.

726
00:42:50,280 --> 00:42:52,130
So they're really close.

727
00:42:55,150 --> 00:42:58,670
Just for point of reference, you
know, the distance between me

728
00:42:58,670 --> 00:43:03,622
and you is something like
one base for every thousand?

729
00:43:03,622 --> 00:43:04,455
Something like that.

730
00:43:07,800 --> 00:43:10,120
And then they assayed
all those F1s.

731
00:43:10,120 --> 00:43:12,340
They genotyped them all.

732
00:43:12,340 --> 00:43:14,980
So to genotype
them, what you do is

733
00:43:14,980 --> 00:43:16,720
you know what the
parental genotypes are

734
00:43:16,720 --> 00:43:18,980
because they sequence
both parents.

735
00:43:18,980 --> 00:43:23,250
The mom and dad, so to
speak, at 50x coverage.

736
00:43:23,250 --> 00:43:25,750
So they knew the genome
sequence is completely

737
00:43:25,750 --> 00:43:27,780
for both mom and dad.

738
00:43:27,780 --> 00:43:31,620
And then for each
one of the 1,000 F1s

739
00:43:31,620 --> 00:43:35,170
they put them on a
microarray and what

740
00:43:35,170 --> 00:43:38,030
is shown on the
very bottom left is

741
00:43:38,030 --> 00:43:40,900
a result of genotype
in an individual

742
00:43:40,900 --> 00:43:44,880
where they can see
each chromosome

743
00:43:44,880 --> 00:43:47,331
and whether it came
from mom or from dad.

744
00:43:47,331 --> 00:43:48,830
And you can't see
it here, but there

745
00:43:48,830 --> 00:43:52,540
are 16 different chromosomes
and the alternating purple and

746
00:43:52,540 --> 00:43:56,370
yellow colors show whether that
particular part of the genome

747
00:43:56,370 --> 00:43:59,050
came from mom or from dad.

748
00:43:59,050 --> 00:44:04,530
So they know for each
individual, its source.

749
00:44:04,530 --> 00:44:07,000
From the left or
the right strain.

750
00:44:07,000 --> 00:44:08,200
OK.

751
00:44:08,200 --> 00:44:12,150
And they have a thousand
different genetic makeups.

752
00:44:12,150 --> 00:44:17,010
And then they asked, for each
one of those individuals,

753
00:44:17,010 --> 00:44:23,420
how well could they grow
in 46 different conditions?

754
00:44:23,420 --> 00:44:26,610
So they exposed them
to different sugars,

755
00:44:26,610 --> 00:44:32,130
to different unfavorable
environments and so forth.

756
00:44:32,130 --> 00:44:36,212
And they measured growth rate
as shown on the right-hand side.

757
00:44:36,212 --> 00:44:37,920
Or right in the middle,
that little thing

758
00:44:37,920 --> 00:44:40,800
that looks like a bunch of
little dots of various sizes.

759
00:44:40,800 --> 00:44:43,420
By measuring colony
size, they could

760
00:44:43,420 --> 00:44:47,090
measure how well the
yeast were growing.

761
00:44:47,090 --> 00:44:49,680
And so they had two
different things, right.

762
00:44:49,680 --> 00:44:53,800
They had the exact genotype
of each individual,

763
00:44:53,800 --> 00:44:55,520
and they also had
how well it was

764
00:44:55,520 --> 00:44:59,010
growing in a
particular condition.

765
00:44:59,010 --> 00:45:01,250
And so for each
condition, they wanted

766
00:45:01,250 --> 00:45:04,280
to associate the genotype
of the individual

767
00:45:04,280 --> 00:45:05,582
to how well it was growing.

768
00:45:05,582 --> 00:45:06,290
To its phenotype.

769
00:45:09,970 --> 00:45:14,740
Now, one fair question is, of
these different conditions,

770
00:45:14,740 --> 00:45:18,260
how many of them were
really independent?

771
00:45:18,260 --> 00:45:20,860
And so to analyze
that, they looked

772
00:45:20,860 --> 00:45:22,710
at the correlation
between growth rates

773
00:45:22,710 --> 00:45:26,350
across conditions to try and
figure out whether or not

774
00:45:26,350 --> 00:45:32,270
they actually had 46 different
traits they were measuring.

775
00:45:32,270 --> 00:45:35,330
So this is a
correlation matrix that

776
00:45:35,330 --> 00:45:38,350
is too small to
read on the screen.

777
00:45:38,350 --> 00:45:41,790
The colors are somewhat
visible, where the blue colors

778
00:45:41,790 --> 00:45:44,310
are perfect correlation
and the red colors

779
00:45:44,310 --> 00:45:46,650
are perfect anti-correlation.

780
00:45:46,650 --> 00:45:50,839
And you can see that in
certain areas of this grid,

781
00:45:50,839 --> 00:45:52,380
things are more
correlated, like what

782
00:45:52,380 --> 00:45:57,040
sugars the yeast liked to eat.

783
00:45:57,040 --> 00:46:01,459
But suffice to say, they
had a large collection

784
00:46:01,459 --> 00:46:02,875
of traits they
wanted to estimate.

785
00:46:06,200 --> 00:46:12,720
So, now we want to build
a computational model.

786
00:46:12,720 --> 00:46:15,030
So our next step
is figuring out how

787
00:46:15,030 --> 00:46:17,240
to find those places
in the genome that

788
00:46:17,240 --> 00:46:21,390
allows us to predict,
how well, given a trait,

789
00:46:21,390 --> 00:46:23,390
the yeast would grow.

790
00:46:23,390 --> 00:46:26,650
The actual growth rate.

791
00:46:26,650 --> 00:46:44,110
So the key idea is this-- you
have genetic markers, which

792
00:46:44,110 --> 00:46:46,670
are snips down the
genome and you're

793
00:46:46,670 --> 00:46:50,180
going to test a
particular marker.

794
00:46:50,180 --> 00:46:57,360
And if this is a
particular trait,

795
00:46:57,360 --> 00:47:01,320
one possibility is
that-- let's say

796
00:47:01,320 --> 00:47:04,900
that this marker could
be either 0 or 1.

797
00:47:04,900 --> 00:47:08,310
Without loss of
generality, it could

798
00:47:08,310 --> 00:47:10,660
be that here are all
the individuals where

799
00:47:10,660 --> 00:47:12,700
the marker is zero.

800
00:47:12,700 --> 00:47:19,370
And here are all the markers
where the marker is 1.

801
00:47:19,370 --> 00:47:25,790
And really, fundamentally,
whether an individual

802
00:47:25,790 --> 00:47:28,470
has a 0 or a 1 marker,
it doesn't really

803
00:47:28,470 --> 00:47:33,380
change its growth
rate very much.

804
00:47:33,380 --> 00:47:34,220
OK?

805
00:47:34,220 --> 00:47:36,900
It's more or less identical.

806
00:47:36,900 --> 00:47:50,800
It's also possible
that this is best

807
00:47:50,800 --> 00:47:57,120
modeled by two different
means for a given trait.

808
00:47:57,120 --> 00:48:03,780
That when the marker is 1,
you're growing-- actually

809
00:48:03,780 --> 00:48:10,210
this is going to be the
growth rate on the x-axis.

810
00:48:10,210 --> 00:48:11,890
The y-axis is the density.

811
00:48:11,890 --> 00:48:14,970
That you're growing much
better when you have a 1

812
00:48:14,970 --> 00:48:18,970
in that marker
position than a zero.

813
00:48:18,970 --> 00:48:22,240
And so we need to distinguish
between these two cases

814
00:48:22,240 --> 00:48:25,350
when the marker is
predictive of growth rate

815
00:48:25,350 --> 00:48:28,000
and when the marker is not
predictive of growth rate.

816
00:48:30,630 --> 00:48:32,770
And we've talked about lod
likelihood tests before

817
00:48:32,770 --> 00:48:36,060
and you can see one
on the very top.

818
00:48:36,060 --> 00:48:38,410
And you can see there's an
additional degree of freedom

819
00:48:38,410 --> 00:48:41,117
that we have in the top
prediction versus the bottom

820
00:48:41,117 --> 00:48:42,950
because we're using two
different means that

821
00:48:42,950 --> 00:48:47,530
are conditioned upon
the genotypic value

822
00:48:47,530 --> 00:48:48,738
at a particular marker.

823
00:48:53,230 --> 00:48:57,650
So we have a lot of
different markers indeed.

824
00:48:57,650 --> 00:49:00,980
So we have-- let's see
here, the exact number.

825
00:49:00,980 --> 00:49:06,770
I think it's about 13,000
markers they had in this study.

826
00:49:06,770 --> 00:49:07,280
No.

827
00:49:07,280 --> 00:49:12,620
11,623 different unique
markers they found.

828
00:49:12,620 --> 00:49:15,500
That they could discover,
that weren't linked together.

829
00:49:15,500 --> 00:49:18,210
We talked about
linkage earlier on.

830
00:49:18,210 --> 00:49:23,260
So you've got over
11,000 markers.

831
00:49:23,260 --> 00:49:26,430
You're going to do
a lod likelihood

832
00:49:26,430 --> 00:49:29,065
test to compute
this lod odds score.

833
00:49:33,750 --> 00:49:36,180
Do we have to worry about
multiple hypothesis correction

834
00:49:36,180 --> 00:49:38,640
here?

835
00:49:38,640 --> 00:49:41,240
Because you're
testing over 11,000

836
00:49:41,240 --> 00:49:42,810
markers to see
whether or not they're

837
00:49:42,810 --> 00:49:45,360
significant for one trait.

838
00:49:45,360 --> 00:49:45,860
Right.

839
00:49:52,920 --> 00:50:01,520
So one thing that we could do
is imagine that what we did was

840
00:50:01,520 --> 00:50:05,970
we scrambled the association
between phenotypes

841
00:50:05,970 --> 00:50:07,620
and individuals.

842
00:50:07,620 --> 00:50:11,136
So we just randomized it and
we did that a thousand times.

843
00:50:11,136 --> 00:50:15,670
And each time we did it, we
computed the distribution

844
00:50:15,670 --> 00:50:18,330
of these lod scores.

845
00:50:18,330 --> 00:50:23,400
Because we have broken the
association between phenotype

846
00:50:23,400 --> 00:50:26,410
and genotype, the
lod scores which

847
00:50:26,410 --> 00:50:29,680
we should be seeing if we
did this randomization,

848
00:50:29,680 --> 00:50:33,360
should correspond to
essentially noise.

849
00:50:33,360 --> 00:50:35,360
But we would see it random.

850
00:50:35,360 --> 00:50:39,750
So it's a null distribution
we can look at.

851
00:50:39,750 --> 00:50:44,875
And so what we'll see is a
distribution of lod scores.

852
00:50:49,410 --> 00:50:51,440
This is the lod.

853
00:50:51,440 --> 00:51:03,980
This is the probability from
a null, a permutation test.

854
00:51:03,980 --> 00:51:07,970
And since we actually have
done the randomization

855
00:51:07,970 --> 00:51:15,725
over all 11,000 markers,
we can directly draw a line

856
00:51:15,725 --> 00:51:19,960
and ask what are the chances
that a lod score would

857
00:51:19,960 --> 00:51:25,500
be greater than or equal to
a particular value at random?

858
00:51:25,500 --> 00:51:27,730
And we can pick an
area inside this tail,

859
00:51:27,730 --> 00:51:29,660
let's say 0.05,
because that's what

860
00:51:29,660 --> 00:51:32,590
the authors of this
particular paper used

861
00:51:32,590 --> 00:51:37,780
and ask what value
of a lod score

862
00:51:37,780 --> 00:51:42,500
would be very unlikely
to have by chance?

863
00:51:42,500 --> 00:51:48,330
It turns out in their first
iteration, it was 2.63.

864
00:51:48,330 --> 00:51:53,500
That a lod score over
2.63 had a 0.05 chance

865
00:51:53,500 --> 00:51:59,430
or less of occurring in
randomly permuted data.

866
00:51:59,430 --> 00:52:03,640
And since a permuted data
contained all of the markers,

867
00:52:03,640 --> 00:52:07,700
we don't have to do any
multiple hypothesis correction.

868
00:52:07,700 --> 00:52:10,410
So you can directly
compare the statistic

869
00:52:10,410 --> 00:52:15,520
that you compute
against a threshold

870
00:52:15,520 --> 00:52:21,330
and accept any marker or QTL
that has a lod score greater,

871
00:52:21,330 --> 00:52:26,720
in this case then 2.63
and put it in your model.

872
00:52:26,720 --> 00:52:30,200
And everything else
you can reject.

873
00:52:30,200 --> 00:52:32,520
And so you start by
building a model out

874
00:52:32,520 --> 00:52:34,870
of all of the markers
that are significant

875
00:52:34,870 --> 00:52:36,670
at this particular level.

876
00:52:39,750 --> 00:52:44,080
You then assemble the
model and you can now

877
00:52:44,080 --> 00:52:47,860
predict phenotype from genotype.

878
00:52:47,860 --> 00:52:50,710
But of course, you're going
to make errors, right.

879
00:52:50,710 --> 00:52:53,490
For each individual, there's
going to be an error.

880
00:52:53,490 --> 00:53:05,370
You're going to have a residual
for each individual that

881
00:53:05,370 --> 00:53:15,880
is going to be the
phenotype minus the genotype

882
00:53:15,880 --> 00:53:18,390
of the individual.

883
00:53:18,390 --> 00:53:22,200
So this is the error
that you're making.

884
00:53:22,200 --> 00:53:27,060
So what these folks
did was that you first

885
00:53:27,060 --> 00:53:34,150
look at predicting the
phenotype directly,

886
00:53:34,150 --> 00:53:37,620
and you pick all the QTLs that
are significant at that level.

887
00:53:37,620 --> 00:53:40,097
And then you compute
the residuals

888
00:53:40,097 --> 00:53:41,680
and you try and
predict the residuals.

889
00:53:44,700 --> 00:53:49,370
And you try and
find additional QTLs

890
00:53:49,370 --> 00:53:55,910
that are significant after you
have picked the original ones.

891
00:53:55,910 --> 00:53:57,410
OK.

892
00:53:57,410 --> 00:54:02,190
So why might this produce more
QTLs then the original pass?

893
00:54:09,040 --> 00:54:11,190
What do you think?

894
00:54:11,190 --> 00:54:14,930
Why is it that trying to
predict the residuals is

895
00:54:14,930 --> 00:54:17,642
a good idea after
you've tried to predict

896
00:54:17,642 --> 00:54:18,600
the phenotype directly?

897
00:54:23,530 --> 00:54:25,168
Any ideas about that?

898
00:54:34,060 --> 00:54:36,550
Well, what this
is telling us, is

899
00:54:36,550 --> 00:54:39,640
that these QTLs we're
going to predict now

900
00:54:39,640 --> 00:54:44,310
were not significant enough
in the original pass,

901
00:54:44,310 --> 00:54:48,210
but when we're looking at what's
left over, after we subtract

902
00:54:48,210 --> 00:54:50,660
off the effect of
all the other QTLs,

903
00:54:50,660 --> 00:54:52,916
other things might pop up.

904
00:54:52,916 --> 00:54:57,300
But in some sense, we're
obscured by the original QTLs.

905
00:54:57,300 --> 00:55:00,890
Once we subtract
off their influence,

906
00:55:00,890 --> 00:55:04,500
we can see things that
we didn't see before.

907
00:55:04,500 --> 00:55:07,470
And we start gathering
up these additional QTLs

908
00:55:07,470 --> 00:55:10,340
to predict the
residual components.

909
00:55:10,340 --> 00:55:13,290
And so they do this three times.

910
00:55:13,290 --> 00:55:15,650
So they predict the
original set of QTLs

911
00:55:15,650 --> 00:55:20,390
and then they iterate
three time on the residuals

912
00:55:20,390 --> 00:55:24,150
to find and fit a linear
model that predicts a given

913
00:55:24,150 --> 00:55:28,230
trait from a collection of
QTLs that they discover.

914
00:55:28,230 --> 00:55:29,830
Yes?

915
00:55:29,830 --> 00:55:30,496
AUDIENCE: Sorry.

916
00:55:30,496 --> 00:55:32,670
I'm still confused.

917
00:55:32,670 --> 00:55:40,328
The second round? [INAUDIBLE]
done three additional times?

918
00:55:40,328 --> 00:55:41,810
Is that right?

919
00:55:41,810 --> 00:55:42,798
So the--

920
00:55:42,798 --> 00:55:45,280
PROFESSOR: Yes.

921
00:55:45,280 --> 00:55:47,588
AUDIENCE: Is it done
on the remainder of QTL

922
00:55:47,588 --> 00:55:50,444
or on the original
list of every--

923
00:55:50,444 --> 00:55:52,710
PROFESSOR: Each time
you expand your model

924
00:55:52,710 --> 00:55:55,780
to include all the QTLs you've
discovered up to that point.

925
00:55:55,780 --> 00:56:01,111
So initially, you discover a
set of QTLs, call that set one.

926
00:56:01,111 --> 00:56:04,620
You then compute a
model using set one

927
00:56:04,620 --> 00:56:07,555
and you discover the residuals.

928
00:56:07,555 --> 00:56:08,505
AUDIENCE: [INAUDIBLE].

929
00:56:08,505 --> 00:56:09,296
PROFESSOR: Correct.

930
00:56:09,296 --> 00:56:10,890
Well, residual
[INAUDIBLE] so you use

931
00:56:10,890 --> 00:56:13,425
set one to build a
model, a phenotype.

932
00:56:16,280 --> 00:56:20,480
So set one is used here
to compute this, right.

933
00:56:20,480 --> 00:56:21,740
And so set one is used.

934
00:56:21,740 --> 00:56:23,350
And then you compute
what's left over

935
00:56:23,350 --> 00:56:26,980
after you've discovered
the first set of QTLs.

936
00:56:26,980 --> 00:56:30,980
Now you say, we still
have this left to go.

937
00:56:30,980 --> 00:56:32,380
Let's discover some more QTLs.

938
00:56:32,380 --> 00:56:35,950
And now you discover
set two of QTLs.

939
00:56:35,950 --> 00:56:37,260
OK.

940
00:56:37,260 --> 00:56:41,520
And that set two then is used to
build a model that has set one

941
00:56:41,520 --> 00:56:44,010
and set two in it.

942
00:56:44,010 --> 00:56:44,510
Right.

943
00:56:44,510 --> 00:56:46,200
And that residual
is used to discover

944
00:56:46,200 --> 00:56:49,850
set three and so forth.

945
00:56:49,850 --> 00:56:52,790
So each time you're
expanding the set of QTLs

946
00:56:52,790 --> 00:56:54,860
by what you've discovered
in the residuals.

947
00:56:54,860 --> 00:56:57,140
Sort of in the trash
bin so to speak.

948
00:56:57,140 --> 00:56:57,640
Yes?

949
00:56:57,640 --> 00:57:00,130
AUDIENCE: Each time you're
doing this randomization

950
00:57:00,130 --> 00:57:01,130
to determine lod cutoff?

951
00:57:01,130 --> 00:57:02,213
PROFESSOR: That's correct.

952
00:57:02,213 --> 00:57:04,285
Each time you have to
redo the randomization

953
00:57:04,285 --> 00:57:05,770
and get to the lod cutoff.

954
00:57:05,770 --> 00:57:07,502
AUDIENCE: But does
that method actually

955
00:57:07,502 --> 00:57:10,581
work the way you expect it on
the second pass, given that you

956
00:57:10,581 --> 00:57:12,205
have some false
positives from the pass

957
00:57:12,205 --> 00:57:17,052
that you've now
subtracted from your data?

958
00:57:17,052 --> 00:57:19,135
PROFESSOR: I'm not sure I
understand the question.

959
00:57:19,135 --> 00:57:21,115
AUDIENCE: So the second time
you do this randomization,

960
00:57:21,115 --> 00:57:22,739
and you again come
up with a threshold,

961
00:57:22,739 --> 00:57:27,287
you say, oh, above here
there are 5% false positives.

962
00:57:27,287 --> 00:57:27,995
PROFESSOR: Right.

963
00:57:27,995 --> 00:57:31,324
AUDIENCE: But could it be
that that estimate is actually

964
00:57:31,324 --> 00:57:35,680
significantly wrong based the
fact that you've subtracted off

965
00:57:35,680 --> 00:57:39,068
false positives before
you do that process?

966
00:57:41,630 --> 00:57:43,590
PROFESSOR: I mean,
in some sense, what's

967
00:57:43,590 --> 00:57:46,450
your definition of
a false positive?

968
00:57:46,450 --> 00:57:46,950
Right.

969
00:57:46,950 --> 00:57:50,080
I mean it gets down
to that because we've

970
00:57:50,080 --> 00:57:53,020
discovered there's an
association between that QTL

971
00:57:53,020 --> 00:57:54,750
and predicting phenotype.

972
00:57:54,750 --> 00:57:59,230
And in this particular world
it's useful for doing that.

973
00:57:59,230 --> 00:58:01,860
So it's hard to call something
a false positive in that sense,

974
00:58:01,860 --> 00:58:03,932
right.

975
00:58:03,932 --> 00:58:05,390
But you're right,
you actually have

976
00:58:05,390 --> 00:58:09,327
to reset your
threshold every time

977
00:58:09,327 --> 00:58:10,785
that you go through
this iteration.

978
00:58:14,900 --> 00:58:15,650
Good question.

979
00:58:15,650 --> 00:58:16,316
Other questions?

980
00:58:19,710 --> 00:58:21,300
OK.

981
00:58:21,300 --> 00:58:25,160
So, let's see what
happens when you do this.

982
00:58:25,160 --> 00:58:28,910
What happens is that if
you look down the genome,

983
00:58:28,910 --> 00:58:30,670
you discover a collection.

984
00:58:30,670 --> 00:58:42,040
For example, this is
growth in E6 berbamine.

985
00:58:42,040 --> 00:58:45,130
And you can see the
significant locations

986
00:58:45,130 --> 00:58:48,590
in the genome, the numbers 1
through 16 of the chromosomes

987
00:58:48,590 --> 00:58:51,960
and the little red
asterisks above the peaks

988
00:58:51,960 --> 00:58:54,010
indicate that that was
a significant lod score.

989
00:58:54,010 --> 00:58:56,729
The y-axis is a lod score.

990
00:58:56,729 --> 00:58:58,520
And you can see the
locations in the genome

991
00:58:58,520 --> 00:59:05,030
where we have found places that
were associated with growth

992
00:59:05,030 --> 00:59:10,010
rate in that
particular chemical.

993
00:59:10,010 --> 00:59:11,810
OK.

994
00:59:11,810 --> 00:59:15,890
Now, why is it, do you think,
that in many of those places

995
00:59:15,890 --> 00:59:20,520
you see sort of a rise and
fall that is somewhat gentle

996
00:59:20,520 --> 00:59:23,000
as opposed to having
an impulse function

997
00:59:23,000 --> 00:59:24,400
right at that particular spot?

998
00:59:30,029 --> 00:59:31,445
AUDIENCE: Nearby
snips are linked?

999
00:59:31,445 --> 00:59:33,153
PROFESSOR: Yeah, nearby
snips are linked.

1000
00:59:33,153 --> 00:59:37,930
That as you come up to
a place that is causal,

1001
00:59:37,930 --> 00:59:42,000
you get a lot of other
things are linked to that.

1002
00:59:42,000 --> 00:59:44,880
And the closer you get, the
higher the correlation is.

1003
00:59:48,440 --> 00:59:52,910
So that is for 1,000
segregants in the top.

1004
00:59:52,910 --> 00:59:58,550
And what was discovered
for that particular trait,

1005
00:59:58,550 --> 01:00:04,690
was 15 different
loci that explained

1006
01:00:04,690 --> 01:00:10,090
78% of the phenotypic variance.

1007
01:00:10,090 --> 01:00:14,650
And in the bottom,
the same procedure

1008
01:00:14,650 --> 01:00:20,040
was used, but was only
used on 100 segregants.

1009
01:00:20,040 --> 01:00:23,350
And what you can see is that,
in this particular case,

1010
01:00:23,350 --> 01:00:27,330
only two loci were
discovered that explain

1011
01:00:27,330 --> 01:00:28,715
21% of the variance.

1012
01:00:31,450 --> 01:00:33,730
So the bottom study was
grossly under powered.

1013
01:00:36,740 --> 01:00:41,240
Remember we talked about
the problem of finding

1014
01:00:41,240 --> 01:00:45,070
QTLs that had
small effect sizes.

1015
01:00:45,070 --> 01:00:47,230
And if you don't have
enough individuals

1016
01:00:47,230 --> 01:00:49,830
you're going to be under-powered
and you can't actually

1017
01:00:49,830 --> 01:00:51,210
identify all of the QTLs.

1018
01:00:54,080 --> 01:00:57,620
So this is a comparison of this.

1019
01:00:57,620 --> 01:01:00,950
And of course, one of the
things that you don't know

1020
01:01:00,950 --> 01:01:05,010
is the environmental variance
that you're fighting against.

1021
01:01:05,010 --> 01:01:07,160
Because the number
of individuals

1022
01:01:07,160 --> 01:01:11,490
you need, depends both on
the number of potential loci

1023
01:01:11,490 --> 01:01:12,890
that you have.

1024
01:01:12,890 --> 01:01:17,380
The more loci you have, the more
individuals you need to fight

1025
01:01:17,380 --> 01:01:19,210
against the multiple
hypotheses problem,

1026
01:01:19,210 --> 01:01:21,465
which is taken care of by
this permutation implicitly.

1027
01:01:24,870 --> 01:01:28,220
And the more QTLs
that contribute

1028
01:01:28,220 --> 01:01:31,320
to a particular trait,
the smaller they might be.

1029
01:01:31,320 --> 01:01:33,080
And there you need
more individuals

1030
01:01:33,080 --> 01:01:34,900
to provide adequate
power for your test.

1031
01:01:39,810 --> 01:01:43,400
And out of this
model, however, if you

1032
01:01:43,400 --> 01:01:48,090
look at for all the different
traits, the predictive insight

1033
01:01:48,090 --> 01:01:50,040
versus the observed
phenotype, you

1034
01:01:50,040 --> 01:01:52,565
can see that the model
does a reasonably good job.

1035
01:01:56,010 --> 01:02:03,990
So the interesting things
that came out of the study

1036
01:02:03,990 --> 01:02:06,620
were that, first of all,
it was possible to look

1037
01:02:06,620 --> 01:02:11,860
at the effect sizes of each QTL.

1038
01:02:11,860 --> 01:02:16,700
Now, the effect size in terms of
fraction of variance explained

1039
01:02:16,700 --> 01:02:21,590
of a particular marker, is
the square of its coefficient.

1040
01:02:21,590 --> 01:02:23,990
It's the beta squared.

1041
01:02:23,990 --> 01:02:29,350
So you can see here the
histogram of effect sizes,

1042
01:02:29,350 --> 01:02:33,440
and you can see that most
QTLs have very small effects

1043
01:02:33,440 --> 01:02:39,700
on phenotype where phenotype
is scaled between 0 and 1

1044
01:02:39,700 --> 01:02:40,400
for this study.

1045
01:02:43,680 --> 01:02:49,790
So, most traits
as described here

1046
01:02:49,790 --> 01:02:53,690
have between 5 and 29 different
QTL loci in the genome.

1047
01:02:53,690 --> 01:02:56,405
They're used to describe
them with a median of 12.

1048
01:03:00,240 --> 01:03:04,260
Now, the question
the authors asked,

1049
01:03:04,260 --> 01:03:11,280
was if they looked at the
theoretical h squared that they

1050
01:03:11,280 --> 01:03:16,560
computed for the F1s, how
well did their model do?

1051
01:03:16,560 --> 01:03:18,680
And you can see that their
model does very well.

1052
01:03:18,680 --> 01:03:22,630
That, in terms of looking at
narrow sense heritability,

1053
01:03:22,630 --> 01:03:25,560
they can recover almost
all of it, all the time.

1054
01:03:29,860 --> 01:03:35,526
However, the problem comes here.

1055
01:03:35,526 --> 01:03:37,150
Remember we talked
about how to compute

1056
01:03:37,150 --> 01:03:44,430
broad-sense heritability
by looking at clones

1057
01:03:44,430 --> 01:03:47,830
and computing environmental
variance directly.

1058
01:03:47,830 --> 01:03:51,270
And so they were able to
compute broad-sense heritability

1059
01:03:51,270 --> 01:03:55,080
and compare that the
narrow-sense heritability

1060
01:03:55,080 --> 01:03:57,684
that they were able to
actually achieve in the study.

1061
01:03:57,684 --> 01:03:59,475
And you can see there
are substantial gaps.

1062
01:04:02,430 --> 01:04:06,900
So what could be
making up those gaps?

1063
01:04:06,900 --> 01:04:12,380
Why is it that this additive
model can't explain growth rate

1064
01:04:12,380 --> 01:04:15,770
in a particular condition?

1065
01:04:15,770 --> 01:04:20,440
So, the next thing that
we're going to discover

1066
01:04:20,440 --> 01:04:24,680
are some of the sources of this
so-called missing heritability.

1067
01:04:24,680 --> 01:04:26,900
But before I give you
some of the stock answers

1068
01:04:26,900 --> 01:04:29,830
that people in the field give,
since this is part of our quest

1069
01:04:29,830 --> 01:04:33,370
today to actually look
into missing heritability,

1070
01:04:33,370 --> 01:04:36,960
I'll put it to you,
my panel of experts.

1071
01:04:36,960 --> 01:04:39,450
What could be causing this
heritability to go missing?

1072
01:04:39,450 --> 01:04:46,390
Why can't this additive model
predict growth rate accurately,

1073
01:04:46,390 --> 01:04:50,190
given it knows the
genotype exactly?

1074
01:04:50,190 --> 01:04:51,010
Yes.

1075
01:04:51,010 --> 01:04:54,986
AUDIENCE: [INAUDIBLE]
that you wouldn't

1076
01:04:54,986 --> 01:04:56,980
detect from looking
at the DNA sequence.

1077
01:04:56,980 --> 01:04:59,080
PROFESSOR: So
epidemic factors-- are

1078
01:04:59,080 --> 01:05:00,980
you talking about protein
factors or are you

1079
01:05:00,980 --> 01:05:02,355
talking about
epigenetic effects?

1080
01:05:02,355 --> 01:05:04,117
AUDIENCE: More of
the epigenetic marks.

1081
01:05:04,117 --> 01:05:05,450
PROFESSOR: Epigenetic marks, OK.

1082
01:05:05,450 --> 01:05:09,545
So it might be now, yeast
doesn't have DNA methylation.

1083
01:05:12,490 --> 01:05:15,670
It does have chromatin
modifications

1084
01:05:15,670 --> 01:05:18,020
in the form of histone marks.

1085
01:05:18,020 --> 01:05:20,900
So it might be that there's
some histone marks that

1086
01:05:20,900 --> 01:05:24,750
are copied from generation
to generation that are not

1087
01:05:24,750 --> 01:05:26,310
counted for in our model.

1088
01:05:26,310 --> 01:05:28,330
right?

1089
01:05:28,330 --> 01:05:29,900
OK, that's one possibility.

1090
01:05:29,900 --> 01:05:30,440
Great.

1091
01:05:30,440 --> 01:05:30,940
Yes.

1092
01:05:30,940 --> 01:05:33,874
AUDIENCE: There could
be more complex effects

1093
01:05:33,874 --> 01:05:37,297
so two separate genes may come
out, other than just adding.

1094
01:05:37,297 --> 01:05:38,764
One could turn the other off.

1095
01:05:38,764 --> 01:05:43,170
So it one's on, it
could [INAUDIBLE].

1096
01:05:43,170 --> 01:05:43,910
PROFESSOR: Right.

1097
01:05:43,910 --> 01:05:47,000
So those are called
epistatic effects,

1098
01:05:47,000 --> 01:05:48,250
or they're non-linear effects.

1099
01:05:48,250 --> 01:05:51,036
They're gene-gene
interaction effects.

1100
01:05:51,036 --> 01:05:52,410
That's actually
thought to be one

1101
01:05:52,410 --> 01:05:56,915
of the major issues in
missing heritability.

1102
01:06:00,090 --> 01:06:02,570
What else could there be?

1103
01:06:02,570 --> 01:06:03,080
Yes.

1104
01:06:03,080 --> 01:06:03,996
AUDIENCE: [INAUDIBLE].

1105
01:06:16,872 --> 01:06:17,580
PROFESSOR: Right.

1106
01:06:17,580 --> 01:06:21,390
So you're saying that there
could be inherent noise that

1107
01:06:21,390 --> 01:06:24,250
would cause there to be
fluctuations in colony size

1108
01:06:24,250 --> 01:06:26,550
that are unrelated
to the genotype.

1109
01:06:26,550 --> 01:06:27,967
And, in fact,
that's a good point.

1110
01:06:27,967 --> 01:06:29,508
And that's something
that we're going

1111
01:06:29,508 --> 01:06:31,530
to take care of with the
environmental variance.

1112
01:06:31,530 --> 01:06:34,580
So we're going to measure
how well individuals

1113
01:06:34,580 --> 01:06:37,900
grow with exactly the same
genotype in a given condition.

1114
01:06:37,900 --> 01:06:40,860
And so that kind of
fluctuation would

1115
01:06:40,860 --> 01:06:43,200
appear in that variance term.

1116
01:06:43,200 --> 01:06:46,010
And we're going to
get rid of that.

1117
01:06:46,010 --> 01:06:48,900
But that's a good thought and
I think it's important and not

1118
01:06:48,900 --> 01:06:53,110
appreciated that there
can be random fluctuations

1119
01:06:53,110 --> 01:06:55,710
in that term.

1120
01:06:55,710 --> 01:06:56,545
Any other ideas?

1121
01:07:02,130 --> 01:07:05,240
So we have epistasis.

1122
01:07:05,240 --> 01:07:06,510
We have epigenetics.

1123
01:07:06,510 --> 01:07:09,170
We've got two E's so far.

1124
01:07:09,170 --> 01:07:10,050
Anything else?

1125
01:07:19,430 --> 01:07:27,870
How about if there are
a lot of different loci

1126
01:07:27,870 --> 01:07:33,860
that are influencing
a particular trait,

1127
01:07:33,860 --> 01:07:37,180
but the effect sizes
are very small.

1128
01:07:37,180 --> 01:07:38,940
That we've captured,
sort of the cream.

1129
01:07:38,940 --> 01:07:40,140
We've skimmed off the cream.

1130
01:07:40,140 --> 01:07:45,110
So we get 70% of the
variance explained,

1131
01:07:45,110 --> 01:07:49,150
but the rest of
the QTLs are small,

1132
01:07:49,150 --> 01:07:50,359
right, and we can't see them.

1133
01:07:50,359 --> 01:07:52,816
We can't see them because we
don't have enough individuals.

1134
01:07:52,816 --> 01:07:54,030
We're underpowered, right.

1135
01:07:54,030 --> 01:07:58,284
We just-- more individuals
more sequencing, right.

1136
01:07:58,284 --> 01:08:00,450
And that would be the only
way to break through this

1137
01:08:00,450 --> 01:08:04,745
and be able to see these
very small effects.

1138
01:08:07,410 --> 01:08:11,630
Because if the effects
are small, in some sense,

1139
01:08:11,630 --> 01:08:13,540
we're hosed.

1140
01:08:13,540 --> 01:08:15,080
Right?

1141
01:08:15,080 --> 01:08:17,859
You just can't see
them through the noise.

1142
01:08:17,859 --> 01:08:24,620
All those effects are
going to show up down here

1143
01:08:24,620 --> 01:08:26,500
and we're going to reject them.

1144
01:08:30,700 --> 01:08:33,645
Anything else, people
can think about?

1145
01:08:36,555 --> 01:08:38,010
Yes?

1146
01:08:38,010 --> 01:08:42,066
AUDIENCE: Could you content
maybe the sum of some areas

1147
01:08:42,066 --> 01:08:48,674
that are-- sorry, the
addition sum of those guys

1148
01:08:48,674 --> 01:08:50,365
that have low effects.

1149
01:08:50,365 --> 01:08:52,604
Or is that not detectable
by any [INAUDIBLE]?

1150
01:08:52,604 --> 01:08:53,979
PROFESSOR: Well,
that's certainly

1151
01:08:53,979 --> 01:08:57,524
what we're trying to do
with residuals, right?

1152
01:08:57,524 --> 01:08:59,149
This multi-round
round thing is that we

1153
01:08:59,149 --> 01:09:01,109
take all the things
we can detect

1154
01:09:01,109 --> 01:09:03,510
that have an effect with
a conservative cut off

1155
01:09:03,510 --> 01:09:05,356
and we get rid of them.

1156
01:09:05,356 --> 01:09:07,189
And then we say, oh,
is there anything left?

1157
01:09:07,189 --> 01:09:10,660
You know, that's hiding, sort
of behind that forest, right.

1158
01:09:10,660 --> 01:09:12,569
If we cut through the
first line of trees,

1159
01:09:12,569 --> 01:09:16,600
can we get to another
collection of informative QTLs?

1160
01:09:21,090 --> 01:09:21,760
Yeah.

1161
01:09:21,760 --> 01:09:23,242
AUDIENCE: I was
wondering if this

1162
01:09:23,242 --> 01:09:24,724
could be an overestimate also.

1163
01:09:24,724 --> 01:09:26,926
Like, for example,
if, when you throw out

1164
01:09:26,926 --> 01:09:28,676
the variance for
environmental conditions,

1165
01:09:28,676 --> 01:09:32,628
the environmental conditions
aren't as exact as we thought

1166
01:09:32,628 --> 01:09:36,860
they were between two yeast
growing in the same set, setup.

1167
01:09:36,860 --> 01:09:37,568
PROFESSOR: Right.

1168
01:09:37,568 --> 01:09:41,026
AUDIENCE: Then maybe you
would inappropriately

1169
01:09:41,026 --> 01:09:43,990
assign a variance to the
environmental condition

1170
01:09:43,990 --> 01:09:51,400
whereas some that could
be, in fact-- something

1171
01:09:51,400 --> 01:09:52,882
that wouldn't be explained by.

1172
01:09:52,882 --> 01:09:55,030
PROFESSOR: And probably
the other way around.

1173
01:09:55,030 --> 01:09:57,000
The other way around
would be that you thought

1174
01:09:57,000 --> 01:10:00,522
you had the conditions
exactly duplicated, right.

1175
01:10:00,522 --> 01:10:02,230
But when you actually
did something else,

1176
01:10:02,230 --> 01:10:06,240
they weren't exactly duplicated
so you see bigger variance

1177
01:10:06,240 --> 01:10:07,190
in another experiment.

1178
01:10:07,190 --> 01:10:11,024
And it appears to be
heritable in some sense.

1179
01:10:11,024 --> 01:10:13,190
But, in fact, it would just
be that you misestimated

1180
01:10:13,190 --> 01:10:14,970
the environmental component.

1181
01:10:14,970 --> 01:10:16,870
So, there are a
variety of things

1182
01:10:16,870 --> 01:10:19,000
that we can think about, right.

1183
01:10:19,000 --> 01:10:20,870
Incorrect heritability
estimates.

1184
01:10:20,870 --> 01:10:23,605
We can think about
rare variance.

1185
01:10:23,605 --> 01:10:25,230
Now in this
particular study we're

1186
01:10:25,230 --> 01:10:27,240
looking at everything, right.

1187
01:10:27,240 --> 01:10:28,330
Nothing is hiding.

1188
01:10:28,330 --> 01:10:30,040
We've got 50x sequencing.

1189
01:10:30,040 --> 01:10:32,400
There are no variants
hiding behind the bushes.

1190
01:10:32,400 --> 01:10:35,080
They are all there
for us to look at.

1191
01:10:35,080 --> 01:10:37,290
Structural variants-- well
in this particular case,

1192
01:10:37,290 --> 01:10:39,770
we know structural
variants aren't present,

1193
01:10:39,770 --> 01:10:42,290
but as you know, many
kinds of mammalian cells

1194
01:10:42,290 --> 01:10:45,470
exhibit structural
variance and other kinds

1195
01:10:45,470 --> 01:10:50,480
of bizarre behaviors
with their chromosomes.

1196
01:10:50,480 --> 01:10:52,280
Many common variants
of low effect.

1197
01:10:52,280 --> 01:10:54,520
We just talked about that.

1198
01:10:54,520 --> 01:10:56,420
And epistasis was brought up.

1199
01:10:56,420 --> 01:10:58,140
And this does not
include epigenetics,

1200
01:10:58,140 --> 01:10:59,720
I'll have to add
that to the listen.

1201
01:10:59,720 --> 01:11:01,490
It's a good point.

1202
01:11:01,490 --> 01:11:03,050
OK.

1203
01:11:03,050 --> 01:11:04,650
And then we talked
about this idea

1204
01:11:04,650 --> 01:11:10,460
that epistasis is the case
where we have nonlinear effects.

1205
01:11:10,460 --> 01:11:14,970
So a very simple
example of this is

1206
01:11:14,970 --> 01:11:17,570
when you have little a and
big B, and big A and big B

1207
01:11:17,570 --> 01:11:19,210
together, they
both had an effect.

1208
01:11:19,210 --> 01:11:22,090
But little a, little
b, have no effect.

1209
01:11:22,090 --> 01:11:25,110
And big A and big B have
no effect by themselves.

1210
01:11:25,110 --> 01:11:26,780
So you have a
pairwise interaction

1211
01:11:26,780 --> 01:11:28,750
between these terms.

1212
01:11:28,750 --> 01:11:29,940
Right.

1213
01:11:29,940 --> 01:11:33,670
So this is sort of the
exclusive OR of two terms

1214
01:11:33,670 --> 01:11:36,420
and that non-linear
effect can never

1215
01:11:36,420 --> 01:11:40,990
be captured when you're
looking at terms one at a time.

1216
01:11:40,990 --> 01:11:42,650
OK.

1217
01:11:42,650 --> 01:11:46,200
Because looking
one at a time looks

1218
01:11:46,200 --> 01:11:49,310
like it has no
effect whatsoever.

1219
01:11:49,310 --> 01:11:52,060
And these effects, of course,
could be more than pairwise,

1220
01:11:52,060 --> 01:11:54,075
if you have a complicated
network or pathway.

1221
01:11:59,120 --> 01:12:03,780
Now, what the authors
did to examine this,

1222
01:12:03,780 --> 01:12:07,610
is they looked at
pairwise effects.

1223
01:12:07,610 --> 01:12:10,710
So they considered
all pairs of markers

1224
01:12:10,710 --> 01:12:14,170
and asked whether or not,
taken two at a time now,

1225
01:12:14,170 --> 01:12:20,930
they could predict a
difference in trait need.

1226
01:12:20,930 --> 01:12:22,547
But what's the
problem with this?

1227
01:12:22,547 --> 01:12:24,130
How many markers did
I say there were?

1228
01:12:28,140 --> 01:12:31,630
13,000, something like that.

1229
01:12:31,630 --> 01:12:35,160
All pairs of markers is a
lot of pairs of markers.

1230
01:12:35,160 --> 01:12:36,600
Right.

1231
01:12:36,600 --> 01:12:39,100
And what happens to
your statistical power

1232
01:12:39,100 --> 01:12:42,320
when you get to
that many markers?

1233
01:12:42,320 --> 01:12:43,590
You have a serious problem.

1234
01:12:43,590 --> 01:12:45,620
It goes right through the floor.

1235
01:12:45,620 --> 01:12:48,450
So you really are very
under-powered to detect

1236
01:12:48,450 --> 01:12:50,820
these interactions.

1237
01:12:50,820 --> 01:12:53,250
The other thing
they did was to try

1238
01:12:53,250 --> 01:12:55,450
to get things a little
bit better as they said,

1239
01:12:55,450 --> 01:12:57,780
how about this.

1240
01:12:57,780 --> 01:13:02,610
If we know that a given QTL is
always important for a trait

1241
01:13:02,610 --> 01:13:05,980
because we discovered it
in our additive model.

1242
01:13:05,980 --> 01:13:07,690
Well consider its
pairwise interaction

1243
01:13:07,690 --> 01:13:10,980
with all the other
possible variants.

1244
01:13:10,980 --> 01:13:13,870
So instead of now
13,000 squared,

1245
01:13:13,870 --> 01:13:17,430
it's only going to be like
22 different QTLs for a given

1246
01:13:17,430 --> 01:13:23,540
trait times 13,000 to
reduce the space of search.

1247
01:13:23,540 --> 01:13:26,410
Obviously I got this explanation
not completely clear.

1248
01:13:26,410 --> 01:13:27,880
So let me try one more time.

1249
01:13:27,880 --> 01:13:30,460
OK.

1250
01:13:30,460 --> 01:13:33,040
The naive way to go at looking
at pairwise interactions

1251
01:13:33,040 --> 01:13:35,570
is consider all pairs
and ask whether or not

1252
01:13:35,570 --> 01:13:38,860
all pairs have an influence
on a particular trait value.

1253
01:13:38,860 --> 01:13:39,390
Right.

1254
01:13:39,390 --> 01:13:40,430
We've got that much?

1255
01:13:40,430 --> 01:13:41,330
OK.

1256
01:13:41,330 --> 01:13:44,100
Now let's suppose we don't
want to look at all pairs.

1257
01:13:44,100 --> 01:13:46,570
How could we pick one
element of the pair

1258
01:13:46,570 --> 01:13:50,082
to be interesting,
but smaller in number?

1259
01:13:50,082 --> 01:13:50,940
Right.

1260
01:13:50,940 --> 01:13:53,370
So what we'll do is,
for a given trait,

1261
01:13:53,370 --> 01:13:57,610
we already know which
QTLs are important for it

1262
01:13:57,610 --> 01:14:00,406
because we've built
our model already.

1263
01:14:00,406 --> 01:14:02,280
So let's just say, for
purpose of discussion,

1264
01:14:02,280 --> 01:14:06,150
there are 20 QTLs that are
important for this trait.

1265
01:14:06,150 --> 01:14:09,010
We'll take each one
of those 20 QTLs

1266
01:14:09,010 --> 01:14:11,870
and we'll examine whether or not
it has a pairwise interaction

1267
01:14:11,870 --> 01:14:14,960
with all of the other variance.

1268
01:14:14,960 --> 01:14:18,220
And that will reduce
our search base.

1269
01:14:18,220 --> 01:14:19,500
Is that better?

1270
01:14:19,500 --> 01:14:21,000
OK, good.

1271
01:14:21,000 --> 01:14:27,800
So, when they did
that, they did find

1272
01:14:27,800 --> 01:14:30,830
some pairwise interactions.

1273
01:14:30,830 --> 01:14:35,940
In 24 of their 46 traits
had pairwise interactions

1274
01:14:35,940 --> 01:14:36,935
and here is an example.

1275
01:14:40,200 --> 01:14:48,640
And you can see the dot plot,
or the upper right-hand part

1276
01:14:48,640 --> 01:14:52,710
of this slide,
how when you BYBY.

1277
01:14:52,710 --> 01:14:55,920
You have a lower
phenotypic value then

1278
01:14:55,920 --> 01:15:02,220
when you have just
any RM component

1279
01:15:02,220 --> 01:15:04,590
on the right-hand side.

1280
01:15:04,590 --> 01:15:08,680
So those were two
different snips

1281
01:15:08,680 --> 01:15:10,690
on chromosome 7
and chromosome 11

1282
01:15:10,690 --> 01:15:12,915
and showing how they
interact with one another

1283
01:15:12,915 --> 01:15:16,760
in a non-linear way.

1284
01:15:16,760 --> 01:15:21,060
If they were linear, then as you
added either a chromosome at 7

1285
01:15:21,060 --> 01:15:24,560
or a chromosome 11 contribution
it would go up a little bit.

1286
01:15:24,560 --> 01:15:33,410
Here, as soon as you add
either contribution from RM,

1287
01:15:33,410 --> 01:15:38,710
it goes all way up to have
a mean of zero or higher.

1288
01:15:38,710 --> 01:15:44,300
In this particular case, 71%
of the gap between broad-sense

1289
01:15:44,300 --> 01:15:50,360
and narrow-sense was explained
by this one pair interaction.

1290
01:15:50,360 --> 01:15:54,010
So it is the case that
pairwise interactions

1291
01:15:54,010 --> 01:15:56,235
can explain some of the
missing heritability.

1292
01:15:59,540 --> 01:16:01,310
Can anybody think
of anything else

1293
01:16:01,310 --> 01:16:02,893
they can explain
missing heritability?

1294
01:16:08,760 --> 01:16:09,260
OK.

1295
01:16:11,840 --> 01:16:13,967
What's inherited?

1296
01:16:13,967 --> 01:16:15,550
Let's make a list
of everything that's

1297
01:16:15,550 --> 01:16:21,660
inherited from the
parental line to the F1s.

1298
01:16:21,660 --> 01:16:23,450
OK.

1299
01:16:23,450 --> 01:16:24,266
Yes.

1300
01:16:24,266 --> 01:16:27,172
AUDIENCE: I mean,
because there's

1301
01:16:27,172 --> 01:16:29,164
a lot more things inherited.

1302
01:16:29,164 --> 01:16:31,071
The protein levels
are inherited.

1303
01:16:31,071 --> 01:16:31,654
PROFESSOR: OK.

1304
01:16:31,654 --> 01:16:33,734
AUDIENCE: [INAUDIBLE]
are inherited as well.

1305
01:16:33,734 --> 01:16:34,400
PROFESSOR: Good.

1306
01:16:34,400 --> 01:16:35,909
I like this line of thinking.

1307
01:16:35,909 --> 01:16:36,825
AUDIENCE: [INAUDIBLE].

1308
01:16:36,825 --> 01:16:38,325
PROFESSOR: There
are a lot of things

1309
01:16:38,325 --> 01:16:40,130
that are inherited, right?

1310
01:16:40,130 --> 01:16:43,690
So what's inherited?

1311
01:16:43,690 --> 01:16:47,790
Some proteins are
probably inherited, right?

1312
01:16:47,790 --> 01:16:50,760
What is replicable
through generation

1313
01:16:50,760 --> 01:16:54,200
to generation as a genetic
material that's inherited?

1314
01:16:54,200 --> 01:16:56,835
Let's just talk about
that for a moment.

1315
01:16:56,835 --> 01:16:58,710
Proteins are interesting,
don't get me wrong.

1316
01:16:58,710 --> 01:17:01,360
I mean, prions and other
things are very interesting.

1317
01:17:01,360 --> 01:17:03,410
But what else is inherited?

1318
01:17:08,370 --> 01:17:09,137
OK, yes?

1319
01:17:09,137 --> 01:17:10,053
AUDIENCE: [INAUDIBLE].

1320
01:17:18,210 --> 01:17:21,230
PROFESSOR: So there are
other genetic molecules.

1321
01:17:21,230 --> 01:17:24,730
Let's just take a really
simple one-- mitochondria.

1322
01:17:24,730 --> 01:17:26,760
OK.

1323
01:17:26,760 --> 01:17:28,830
Mitochondria are inherited.

1324
01:17:28,830 --> 01:17:30,610
And it turns out that
these two strains

1325
01:17:30,610 --> 01:17:35,830
have can have
different mitochondria.

1326
01:17:35,830 --> 01:17:37,120
What else can be inherited?

1327
01:17:40,790 --> 01:17:44,140
Well, we were doing these
experiments with our colleagues

1328
01:17:44,140 --> 01:17:46,822
over at the Whitehead
and for a long time

1329
01:17:46,822 --> 01:17:48,530
we couldn't figure
out what was going on.

1330
01:17:48,530 --> 01:17:50,689
Because we would do the
experiments on day one

1331
01:17:50,689 --> 01:17:52,730
and they come out a
particular way and on day two

1332
01:17:52,730 --> 01:17:54,030
they come out a different way.

1333
01:17:54,030 --> 01:17:55,100
Right.

1334
01:17:55,100 --> 01:17:59,160
And we're doing some very
controlled conditions.

1335
01:17:59,160 --> 01:18:02,460
Until we figured
out that everybody

1336
01:18:02,460 --> 01:18:06,230
uses S288C which is the
genetic nomenclature

1337
01:18:06,230 --> 01:18:10,010
for the lab trained
yeast, right.

1338
01:18:10,010 --> 01:18:12,112
It's lab train because
it's very well behaved.

1339
01:18:12,112 --> 01:18:13,070
It's a very nice yeast.

1340
01:18:13,070 --> 01:18:14,210
It grows very well.

1341
01:18:14,210 --> 01:18:16,220
It's been selected
for that, right.

1342
01:18:16,220 --> 01:18:19,750
And people always do genetic
studies by taking S288C,

1343
01:18:19,750 --> 01:18:22,717
which is the lab yeast, which
has being completely sequenced

1344
01:18:22,717 --> 01:18:24,800
and so you want to use it
because you can download

1345
01:18:24,800 --> 01:18:29,490
the genome with a wild strain.

1346
01:18:29,490 --> 01:18:33,260
And wild strains come
from the wild, right.

1347
01:18:33,260 --> 01:18:35,920
And they come
either off of people

1348
01:18:35,920 --> 01:18:37,230
who have yeast infections.

1349
01:18:37,230 --> 01:18:40,190
I mean, human beings, or
they come off of grape vines

1350
01:18:40,190 --> 01:18:42,120
or God knows where, right.

1351
01:18:42,120 --> 01:18:44,080
But they are not well behaved.

1352
01:18:44,080 --> 01:18:45,760
And why are they
not well behaved?

1353
01:18:45,760 --> 01:18:48,830
What makes these yeast
particularly rude?

1354
01:18:48,830 --> 01:18:51,380
Well, the thing that makes
them particularly rude

1355
01:18:51,380 --> 01:18:54,120
is that they have things
like viruses in them.

1356
01:18:54,120 --> 01:18:55,440
Oh, no.

1357
01:18:55,440 --> 01:18:56,170
OK.

1358
01:18:56,170 --> 01:18:58,470
Because what
happens is that when

1359
01:18:58,470 --> 01:19:01,080
you take a yeast that
has a virus in it,

1360
01:19:01,080 --> 01:19:04,460
and you cross it with
a lab yeast, right.

1361
01:19:04,460 --> 01:19:06,120
All of the kids got the virus.

1362
01:19:08,940 --> 01:19:10,220
Yuck.

1363
01:19:10,220 --> 01:19:11,930
OK.

1364
01:19:11,930 --> 01:19:19,150
And it turns out that the
so-called killer virus in yeast

1365
01:19:19,150 --> 01:19:23,950
interacts with various
chromosomal changes.

1366
01:19:23,950 --> 01:19:25,590
And so now you
have interactions--

1367
01:19:25,590 --> 01:19:29,340
genetic interactions
between a viral element

1368
01:19:29,340 --> 01:19:32,090
and the chromosome.

1369
01:19:32,090 --> 01:19:36,760
And so the phenotype you get
out of particular deletions

1370
01:19:36,760 --> 01:19:41,860
in the yeast genome has
to do with whether or not

1371
01:19:41,860 --> 01:19:44,220
it's infected with
a particular virus.

1372
01:19:44,220 --> 01:19:50,500
It also has to do with which
mitochondrial content it has.

1373
01:19:50,500 --> 01:19:51,970
And people didn't
appreciate this

1374
01:19:51,970 --> 01:19:57,520
until recently because most of
the past yeast studies for QTLs

1375
01:19:57,520 --> 01:20:03,010
were busy crossing lab
strains with wild strains

1376
01:20:03,010 --> 01:20:06,960
and whether it was ethanol
tolerance or growth and heat,

1377
01:20:06,960 --> 01:20:09,620
a lot of the strains
came up with a gene

1378
01:20:09,620 --> 01:20:12,800
as a significant
QTL, which was MKT1.

1379
01:20:12,800 --> 01:20:17,580
And people couldn't understand
why MKT1 was so popular, right.

1380
01:20:17,580 --> 01:20:22,460
MKT1, maintenance
of killer toxin one.

1381
01:20:22,460 --> 01:20:23,050
Yeah.

1382
01:20:23,050 --> 01:20:26,210
That's the viral thing that
enables-- the chromosomal thing

1383
01:20:26,210 --> 01:20:28,130
that enables a viral competence.

1384
01:20:28,130 --> 01:20:34,170
So, it turns out
that if you look

1385
01:20:34,170 --> 01:20:37,440
at this-- in this
particular case,

1386
01:20:37,440 --> 01:20:39,730
we're looking at
yeast that don't

1387
01:20:39,730 --> 01:20:43,220
have the virus in the bottom
little photograph there.

1388
01:20:43,220 --> 01:20:46,595
You can see they're
all sort of, you know,

1389
01:20:46,595 --> 01:20:48,930
they're growing similarly.

1390
01:20:48,930 --> 01:20:53,680
And the yeast with the
same genotype above-- those

1391
01:20:53,680 --> 01:20:56,164
are all in tetrads.

1392
01:20:56,164 --> 01:20:58,080
Two out of the four are
growing, the other two

1393
01:20:58,080 --> 01:21:02,470
are not, because the other two
have a particular deletion.

1394
01:21:02,470 --> 01:21:06,595
And if you look at the
model-- a deletion only model,

1395
01:21:06,595 --> 01:21:09,940
the deletion only, only looks
at the chromosomal compliment

1396
01:21:09,940 --> 01:21:14,660
doesn't predict the
variance very well.

1397
01:21:14,660 --> 01:21:18,150
And if you look at the
deletion and whether or not

1398
01:21:18,150 --> 01:21:21,250
you have the virus,
you do better.

1399
01:21:21,250 --> 01:21:24,190
But you do even
better, if you allow

1400
01:21:24,190 --> 01:21:26,930
for there to be a
nonlinear interaction

1401
01:21:26,930 --> 01:21:29,040
between the chromosomal
modification

1402
01:21:29,040 --> 01:21:31,520
and whether or not
you have a virus.

1403
01:21:31,520 --> 01:21:36,120
And then you recover almost
all of missing heritability.

1404
01:21:36,120 --> 01:21:37,930
So I'll leave you with
this thought, which

1405
01:21:37,930 --> 01:21:45,560
is that genetics is complicated
and QTLs are great, but don't

1406
01:21:45,560 --> 01:21:49,400
forget that there are all
sorts of genetic elements.

1407
01:21:49,400 --> 01:21:50,980
And on that note,
next time we'll

1408
01:21:50,980 --> 01:21:52,490
talk about human genetics.

1409
01:21:52,490 --> 01:21:54,050
Have a great weekend until then.

1410
01:21:54,050 --> 01:21:54,730
We'll see you.

1411
01:21:54,730 --> 01:21:56,440
Take care.