1
00:00:00,060 --> 00:00:01,780
The following
content is provided

2
00:00:01,780 --> 00:00:04,019
under a Creative
Commons license.

3
00:00:04,019 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue

4
00:00:06,870 --> 00:00:10,730
to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:26,397 --> 00:00:26,980
PROFESSOR: OK.

9
00:00:26,980 --> 00:00:30,090
Well hello, everyone.

10
00:00:30,090 --> 00:00:32,980
And welcome back to
Computational Systems Biology.

11
00:00:32,980 --> 00:00:35,300
I am David Gifford and I
am delighted to be back

12
00:00:35,300 --> 00:00:36,820
with you today.

13
00:00:36,820 --> 00:00:39,150
We're going to talk,
today, about understanding

14
00:00:39,150 --> 00:00:39,876
transcription.

15
00:00:39,876 --> 00:00:42,250
Specifically, how we're going
to understand transcription

16
00:00:42,250 --> 00:00:45,460
is a technique called RNA-seq.

17
00:00:45,460 --> 00:00:52,500
And RNA-seq is a methodology
for characterizing RNA molecules

18
00:00:52,500 --> 00:00:55,120
through next
generation sequencing.

19
00:00:55,120 --> 00:00:58,220
And we'll talk, first,
about RNA-seq principles.

20
00:00:58,220 --> 00:01:00,190
We'll then talk
about how to take

21
00:01:00,190 --> 00:01:03,640
the data we learn from RNA-seq
and analyze it using tools

22
00:01:03,640 --> 00:01:05,440
for characterizing
differential gene

23
00:01:05,440 --> 00:01:09,810
expression and principal
component analysis.

24
00:01:09,810 --> 00:01:12,380
And finally, we'll talk about
single cell RNA-seq, which

25
00:01:12,380 --> 00:01:17,730
is a very important and growing
area of scientific inquiry.

26
00:01:17,730 --> 00:01:19,230
But first, let's
talk about RNA-seq.

27
00:01:19,230 --> 00:01:22,410
How many people have
heard of RNA-seq before?

28
00:01:22,410 --> 00:01:22,910
Fantastic.

29
00:01:22,910 --> 00:01:25,340
How many people
have done it before?

30
00:01:25,340 --> 00:01:26,161
Some?

31
00:01:26,161 --> 00:01:26,660
Great.

32
00:01:26,660 --> 00:01:29,939
So RNA-seq is fairly
simple in concept.

33
00:01:29,939 --> 00:01:31,480
What we're going to
do is we're going

34
00:01:31,480 --> 00:01:38,080
to isolate RNA species from
a cell or collection of cells

35
00:01:38,080 --> 00:01:39,790
in the desired condition.

36
00:01:39,790 --> 00:01:43,480
And note that we can choose
which kind of RNA molecules

37
00:01:43,480 --> 00:01:45,090
to isolate.

38
00:01:45,090 --> 00:01:49,830
We can isolate molecules
before any selection, which

39
00:01:49,830 --> 00:01:55,090
would include molecules that
are precursor RNAs that have not

40
00:01:55,090 --> 00:01:59,457
been spliced yet,
including non-coding RNAs.

41
00:01:59,457 --> 00:02:01,540
As you probably know, the
study of non-coding RNAs

42
00:02:01,540 --> 00:02:02,990
is extraordinarily important.

43
00:02:02,990 --> 00:02:07,540
There are over 3,300
so-called long non-coding RNAs

44
00:02:07,540 --> 00:02:10,660
that have been
characterized so far.

45
00:02:10,660 --> 00:02:13,799
Those are non-coding
RNAs over 200 bases long.

46
00:02:13,799 --> 00:02:15,590
We'll be talking about
those later on, when

47
00:02:15,590 --> 00:02:18,665
we talk about chromatin
function in the genome.

48
00:02:18,665 --> 00:02:21,304
And of course, there are
the precursor messenger RNAs

49
00:02:21,304 --> 00:02:23,720
that are spliced, turned into
messenger RNAs that are then

50
00:02:23,720 --> 00:02:25,790
translated into protein.

51
00:02:25,790 --> 00:02:29,370
And the specifics of
the RNA-seq protocol

52
00:02:29,370 --> 00:02:31,159
will give you various
of these species

53
00:02:31,159 --> 00:02:33,450
depending upon what kinds of
purification methodologies

54
00:02:33,450 --> 00:02:35,850
you use.

55
00:02:35,850 --> 00:02:43,990
But as you're aware,
there are many isoforms

56
00:02:43,990 --> 00:02:47,100
that are possible in
most mammalian genes.

57
00:02:47,100 --> 00:02:49,570
This is a short summary,
produced by the Burge

58
00:02:49,570 --> 00:02:52,230
Laboratory, of different
kinds of splicing

59
00:02:52,230 --> 00:02:54,020
events that can occur.

60
00:02:54,020 --> 00:02:57,260
And the splicing
events are often

61
00:02:57,260 --> 00:03:00,140
regulated by
cis-regulatory sequences

62
00:03:00,140 --> 00:03:02,320
that live in the introns.

63
00:03:02,320 --> 00:03:04,530
And these introns contain
recognition sequences

64
00:03:04,530 --> 00:03:07,530
for splicing factors where
the splicing factors can

65
00:03:07,530 --> 00:03:09,180
be conditionally expressed.

66
00:03:09,180 --> 00:03:11,150
And so you get
different combinations

67
00:03:11,150 --> 00:03:13,640
of these exons
being glued together

68
00:03:13,640 --> 00:03:17,800
to produce variant
forms of proteins.

69
00:03:17,800 --> 00:03:19,710
So we have to be
mindful of the idea

70
00:03:19,710 --> 00:03:21,740
that the RNA
molecules that we're

71
00:03:21,740 --> 00:03:23,160
going to be
observing, typically,

72
00:03:23,160 --> 00:03:25,540
are going to be
reverse transcribed.

73
00:03:25,540 --> 00:03:30,420
So we'll see entire
transcripts that came, perhaps,

74
00:03:30,420 --> 00:03:33,080
from distinct exonic
locations in the genome.

75
00:03:36,120 --> 00:03:38,590
And the essential
idea of RNA-seq

76
00:03:38,590 --> 00:03:42,060
is that we take the RNA
molecules we care about--

77
00:03:42,060 --> 00:03:45,220
in this case, we're
going to purify

78
00:03:45,220 --> 00:03:49,950
the ones that have poly-A tails.

79
00:03:49,950 --> 00:03:52,300
We will take those molecules.

80
00:03:52,300 --> 00:03:55,060
We'll reverse transcribe them.

81
00:03:55,060 --> 00:03:57,680
We'll sequence fragments
of them and then

82
00:03:57,680 --> 00:04:00,690
map them to the genome.

83
00:04:00,690 --> 00:04:03,140
Now if you don't purify
for poly-A tails,

84
00:04:03,140 --> 00:04:07,270
you get a lot of things
mapping to the genome that

85
00:04:07,270 --> 00:04:08,530
are intronic.

86
00:04:08,530 --> 00:04:10,040
And so you get a
lot of data that

87
00:04:10,040 --> 00:04:11,410
is very difficult to analyze.

88
00:04:11,410 --> 00:04:14,760
So typically,
people, when they're

89
00:04:14,760 --> 00:04:17,709
looking for gene
expression data,

90
00:04:17,709 --> 00:04:21,160
will do poly-A purification.

91
00:04:21,160 --> 00:04:25,040
And when you do this-- when you
sequence the result of doing

92
00:04:25,040 --> 00:04:28,070
the reverse transcription
of these RNA molecules--

93
00:04:28,070 --> 00:04:31,450
and you map the results to
the genome, what you find

94
00:04:31,450 --> 00:04:34,030
is data that looks like this.

95
00:04:34,030 --> 00:04:35,490
This is the SOX2 gene.

96
00:04:35,490 --> 00:04:37,660
This is typical expression data.

97
00:04:37,660 --> 00:04:40,120
You can see all of the
individual reads mapping

98
00:04:40,120 --> 00:04:42,660
to the genome, the blue
in the plus strand,

99
00:04:42,660 --> 00:04:44,980
the pink on the minus strand.

100
00:04:44,980 --> 00:04:52,550
And our job is to take data
like this and to analyze it.

101
00:04:52,550 --> 00:04:54,970
Now we can take
these data and we

102
00:04:54,970 --> 00:04:58,540
can summarize them
in various formats.

103
00:04:58,540 --> 00:05:00,610
One way is to simply
count the number of times

104
00:05:00,610 --> 00:05:03,500
that we see a read
at a particular base,

105
00:05:03,500 --> 00:05:06,390
like here, for the SMUG1 gene.

106
00:05:06,390 --> 00:05:08,730
And here you see
something else going on,

107
00:05:08,730 --> 00:05:12,840
which is that we have
reads from-- the sequencing

108
00:05:12,840 --> 00:05:15,192
experiments have been
polyadenylated and purified.

109
00:05:15,192 --> 00:05:16,650
So we're only seeing
the reads that

110
00:05:16,650 --> 00:05:19,580
occur over the exonic
sequences, more or less.

111
00:05:19,580 --> 00:05:23,100
There are a few intronic
reads there, scattered about.

112
00:05:23,100 --> 00:05:26,590
The other thing that we see,
which is very important,

113
00:05:26,590 --> 00:05:32,630
is that we see reads that
are split across exons.

114
00:05:32,630 --> 00:05:34,860
Because the splicing
event has occurred,

115
00:05:34,860 --> 00:05:37,790
the RNA molecule is
a contiguous sequence

116
00:05:37,790 --> 00:05:39,680
of a collection of exons.

117
00:05:39,680 --> 00:05:41,665
And sometimes you'll
get a read that

118
00:05:41,665 --> 00:05:44,726
spans an exon-exon boundary.

119
00:05:44,726 --> 00:05:46,290
And when we get
this, you can see,

120
00:05:46,290 --> 00:05:51,110
in the bottom part of the
slide that I'm showing you,

121
00:05:51,110 --> 00:05:54,160
these reads can map
across the exons.

122
00:05:54,160 --> 00:05:56,700
Typically, in order to
get good performance out

123
00:05:56,700 --> 00:05:58,590
of something like
this, we want to use

124
00:05:58,590 --> 00:06:01,810
reads that are about
100 bases long.

125
00:06:01,810 --> 00:06:05,110
And we'll use a mapper
that is capable of mapping

126
00:06:05,110 --> 00:06:08,880
both ends of a read to
account for split reads

127
00:06:08,880 --> 00:06:13,680
for these exon crossing events.

128
00:06:13,680 --> 00:06:17,170
So that gives you an idea of
the kinds of data that we have.

129
00:06:17,170 --> 00:06:21,610
And part of the
challenge now is looking

130
00:06:21,610 --> 00:06:26,570
at how we can determine which
particular isoforms of a gene

131
00:06:26,570 --> 00:06:31,040
are being expressed by
evaluating these data.

132
00:06:31,040 --> 00:06:36,250
So there are two principal
ways of going about this.

133
00:06:36,250 --> 00:06:39,052
One way is we simply
take our read data,

134
00:06:39,052 --> 00:06:40,760
and we use the ideas
that we talked about

135
00:06:40,760 --> 00:06:43,230
in genome assembly, and
we assemble the reads

136
00:06:43,230 --> 00:06:45,450
into transcripts.

137
00:06:45,450 --> 00:06:47,210
That's typically done de novo.

138
00:06:47,210 --> 00:06:50,484
There are reference
guided assemblers.

139
00:06:50,484 --> 00:06:52,900
It has the benefit that you
don't need a reference genome.

140
00:06:52,900 --> 00:06:55,230
So sometimes, when you're
working with an organism that

141
00:06:55,230 --> 00:06:57,520
does not have a well
characterized reference genome,

142
00:06:57,520 --> 00:07:01,100
you will de novo
assemble the transcripts

143
00:07:01,100 --> 00:07:02,791
from an RNA-seq experiment.

144
00:07:02,791 --> 00:07:04,540
But it also has problems
with correctness,

145
00:07:04,540 --> 00:07:07,490
as we saw when we
talked about assembly.

146
00:07:07,490 --> 00:07:09,930
The other approach, which
is more typically used,

147
00:07:09,930 --> 00:07:13,380
is to map reads, or aligned the
reads, to a reference genome

148
00:07:13,380 --> 00:07:15,290
and identify the
different isoforms that

149
00:07:15,290 --> 00:07:18,340
are being expressed
using constraints.

150
00:07:18,340 --> 00:07:20,575
And ultimately, the goals
is to identify the isoforms

151
00:07:20,575 --> 00:07:25,176
and quantitate them so we can
do further downstream analysis.

152
00:07:25,176 --> 00:07:27,320
And you'll hear about
two different metrics,

153
00:07:27,320 --> 00:07:31,350
sometimes, in the
literature, for expression.

154
00:07:31,350 --> 00:07:35,050
One is the number of
reads per kilobase

155
00:07:35,050 --> 00:07:38,220
of transcript per million reads.

156
00:07:38,220 --> 00:07:41,035
So you might have, for
example, an RPKM metric

157
00:07:41,035 --> 00:07:47,140
of 1,000, which means that one
out of every thousandth read

158
00:07:47,140 --> 00:07:49,540
is mapping to a particular gene.

159
00:07:49,540 --> 00:07:54,040
So it gives you a metric
that's adjusted for the fact

160
00:07:54,040 --> 00:07:58,740
that longer genes will
produce more reads.

161
00:08:01,270 --> 00:08:03,120
An alternative
metric is fragments

162
00:08:03,120 --> 00:08:04,200
per kilobase per million.

163
00:08:04,200 --> 00:08:05,699
And that's sometimes
used when we're

164
00:08:05,699 --> 00:08:07,824
talking about paired end data.

165
00:08:07,824 --> 00:08:09,240
And you're considering
that you're

166
00:08:09,240 --> 00:08:10,744
sequencing both
ends of a fragment.

167
00:08:10,744 --> 00:08:12,660
So here we're talking
about how many fragments

168
00:08:12,660 --> 00:08:15,080
we see for a particular
gene per 1,000

169
00:08:15,080 --> 00:08:18,700
bases of the gene per
million fragments.

170
00:08:18,700 --> 00:08:19,200
OK.

171
00:08:19,200 --> 00:08:23,700
So the essential idea,
then, is to take the reads

172
00:08:23,700 --> 00:08:28,220
that we have-- the basket of
reads-- align it to the genome,

173
00:08:28,220 --> 00:08:32,210
both to exons and
to exon crossings,

174
00:08:32,210 --> 00:08:36,909
and to determine, for a given
gene, what isoforms we have

175
00:08:36,909 --> 00:08:39,039
and how they're being expressed.

176
00:08:39,039 --> 00:08:40,770
So from here on in,
I'm going to assume

177
00:08:40,770 --> 00:08:44,370
that we're talking about
a gene and its isoforms.

178
00:08:44,370 --> 00:08:47,410
And that's OK.

179
00:08:47,410 --> 00:08:54,940
Because typically, we can
map reads uniquely to a gene.

180
00:08:54,940 --> 00:08:56,420
And there are
details, of course,

181
00:08:56,420 --> 00:08:59,220
when you have genes
there are paralogs that

182
00:08:59,220 --> 00:09:02,350
have identical sequences across
the genome where this becomes

183
00:09:02,350 --> 00:09:05,300
more difficult.

184
00:09:05,300 --> 00:09:08,990
So if we consider
this, once again,

185
00:09:08,990 --> 00:09:11,010
we're going to take
our reads, we're

186
00:09:11,010 --> 00:09:12,760
going to map them to
the genome, and we're

187
00:09:12,760 --> 00:09:16,500
going to look for all
possible pairings of exons.

188
00:09:16,500 --> 00:09:19,140
What we would like
to do is to enumerate

189
00:09:19,140 --> 00:09:23,130
all of the possible
isoforms that are possible

190
00:09:23,130 --> 00:09:25,530
given the data that we have.

191
00:09:25,530 --> 00:09:30,160
And we can use junction crossing
reads and other methodologies

192
00:09:30,160 --> 00:09:34,246
to enumerate all the
possible isoforms.

193
00:09:34,246 --> 00:09:35,620
But what we're
going to assume is

194
00:09:35,620 --> 00:09:38,650
that we've enumerated
all the isoforms.

195
00:09:38,650 --> 00:09:41,090
And we're going to
number them 1 through n.

196
00:09:41,090 --> 00:09:44,530
So we have isoform 1, isoform
2, isoform 3 for a given gene.

197
00:09:44,530 --> 00:09:47,390
And what we want to
compute for each isoform is

198
00:09:47,390 --> 00:09:50,470
its relative contribution
to the read population

199
00:09:50,470 --> 00:09:54,590
we're seeing that
maps to that gene.

200
00:09:54,590 --> 00:09:57,680
So in order to do
that, what we could do

201
00:09:57,680 --> 00:09:58,690
is use some constraints.

202
00:09:58,690 --> 00:10:03,600
So if I show you
this picture, which

203
00:10:03,600 --> 00:10:07,570
suggests that we have
possible splice events here

204
00:10:07,570 --> 00:10:09,990
for the event C in the
middle, if I told you

205
00:10:09,990 --> 00:10:15,710
that A, C, and E were very
highly covered by reeds,

206
00:10:15,710 --> 00:10:18,530
you might think that
A, C, and E represented

207
00:10:18,530 --> 00:10:22,110
one isoform that was
highly expressed.

208
00:10:22,110 --> 00:10:25,440
And so if we think about
how to use our read

209
00:10:25,440 --> 00:10:29,967
coverage as a guide to
determining isoform prevalence,

210
00:10:29,967 --> 00:10:30,925
we'll be in good shape.

211
00:10:30,925 --> 00:10:33,030
And really, that's the
best evidence we have.

212
00:10:33,030 --> 00:10:34,654
We have two sources
of evidence, right?

213
00:10:34,654 --> 00:10:37,800
We have our junction
crossing reads, which tell us

214
00:10:37,800 --> 00:10:40,590
which exons are being
spliced together,

215
00:10:40,590 --> 00:10:47,150
that helps us both compute
the set of possible isoforms

216
00:10:47,150 --> 00:10:50,100
and estimate their prevalence.

217
00:10:50,100 --> 00:10:51,960
The other thing we
have is the reads

218
00:10:51,960 --> 00:10:54,157
that actually
simply cover exons.

219
00:10:54,157 --> 00:10:55,740
And their relative
prevalence can also

220
00:10:55,740 --> 00:11:00,805
help us compute the relative
amounts of different isoforms.

221
00:11:03,650 --> 00:11:07,520
So in order to do this, we can
think about what reads tell us.

222
00:11:07,520 --> 00:11:12,830
And some reads will
exclude certain isoforms.

223
00:11:12,830 --> 00:11:15,980
So if we consider the
reads that we have,

224
00:11:15,980 --> 00:11:21,170
we can think about reads that
cross particular junctions that

225
00:11:21,170 --> 00:11:24,100
are inclusion reads
saying that-- for example,

226
00:11:24,100 --> 00:11:26,910
in this case, the top
reads are indicating

227
00:11:26,910 --> 00:11:32,070
that the middle white exon is
being included in a transcript

228
00:11:32,070 --> 00:11:35,560
whereas the bottom
reads are exclusion

229
00:11:35,560 --> 00:11:38,990
reads indicating that that
white exon in the middle

230
00:11:38,990 --> 00:11:41,474
is being spliced out.

231
00:11:41,474 --> 00:11:42,890
So what we would
like to do, then,

232
00:11:42,890 --> 00:11:45,110
is to build a
probabilistic model that

233
00:11:45,110 --> 00:11:48,970
takes into account what we know
about what a read tells us.

234
00:11:48,970 --> 00:11:53,060
Because each read is
a piece of evidence.

235
00:11:53,060 --> 00:11:55,200
And we're going to use
that read like detectives.

236
00:11:55,200 --> 00:11:58,240
We're going to go in and
we're going to try and analyze

237
00:11:58,240 --> 00:12:00,820
all of the different
reads we see for a gene

238
00:12:00,820 --> 00:12:03,440
and use it to
weight what we think

239
00:12:03,440 --> 00:12:05,130
is happening with
the different isoform

240
00:12:05,130 --> 00:12:09,380
expressions in the pool of
reads that we're observing.

241
00:12:09,380 --> 00:12:14,850
And in order to do
so, we will have

242
00:12:14,850 --> 00:12:16,470
to build a function
that describes

243
00:12:16,470 --> 00:12:19,990
the probability of
seeing a read given

244
00:12:19,990 --> 00:12:22,230
the expression of a
particular isoform.

245
00:12:22,230 --> 00:12:26,700
So the essential idea is this--
for the next three slides,

246
00:12:26,700 --> 00:12:30,670
I want to build a model of
the probability of seeing

247
00:12:30,670 --> 00:12:34,570
a read conditioned upon a
particular isoform being

248
00:12:34,570 --> 00:12:36,270
expressed.

249
00:12:36,270 --> 00:12:37,450
All right?

250
00:12:37,450 --> 00:12:41,320
So there are three
ways to approach this.

251
00:12:41,320 --> 00:12:46,250
One is that I can
see a read that I

252
00:12:46,250 --> 00:12:50,660
know is incompatible
with a given isoform.

253
00:12:50,660 --> 00:12:52,620
And therefore, the
probability of seeing that

254
00:12:52,620 --> 00:12:55,060
read given the isoform is 0.

255
00:12:55,060 --> 00:12:57,620
And that's perfectly fine.

256
00:12:57,620 --> 00:13:00,910
And this can either happen
with a single ended read

257
00:13:00,910 --> 00:13:04,180
or with a paired end read.

258
00:13:04,180 --> 00:13:06,250
And it's a conditional
probability.

259
00:13:06,250 --> 00:13:11,360
So the probability of seeing
read i given and isoform j

260
00:13:11,360 --> 00:13:12,160
can be 0.

261
00:13:14,850 --> 00:13:22,950
Another possibility is that,
if I have a transcript--

262
00:13:22,950 --> 00:13:26,140
now recall that the
transcript has been spliced.

263
00:13:26,140 --> 00:13:28,670
So what we're looking at
is the entire sequence

264
00:13:28,670 --> 00:13:30,770
of the spliced isoform.

265
00:13:30,770 --> 00:13:31,830
And we see a read.

266
00:13:31,830 --> 00:13:35,735
And the read can land anywhere
within that transcript.

267
00:13:35,735 --> 00:13:37,110
Let's assume, for
the time being,

268
00:13:37,110 --> 00:13:39,400
we're looking at
single ended read.

269
00:13:39,400 --> 00:13:42,360
Then, the probability
of seeing that read

270
00:13:42,360 --> 00:13:44,960
land in that transcript
is 1 over the length

271
00:13:44,960 --> 00:13:51,300
of the transcript-- all
right-- at a particular base.

272
00:13:51,300 --> 00:13:57,740
So this read is compatible
with that transcript.

273
00:13:57,740 --> 00:14:02,980
And we can describe the
probability in this fashion.

274
00:14:02,980 --> 00:14:07,240
It's also possible for us to
consider paired end reads.

275
00:14:07,240 --> 00:14:09,420
And if we have
paired end reads, we

276
00:14:09,420 --> 00:14:11,430
can describe a
probability function

277
00:14:11,430 --> 00:14:13,000
that has two
essential components.

278
00:14:13,000 --> 00:14:15,000
The denominator
is still the same.

279
00:14:15,000 --> 00:14:17,620
That is, the likelihood
of the read aligning

280
00:14:17,620 --> 00:14:20,690
at a particular
position is going

281
00:14:20,690 --> 00:14:24,200
to be 1 over the length
of the entire transcript.

282
00:14:24,200 --> 00:14:26,480
The numerator is
different though.

283
00:14:26,480 --> 00:14:28,870
We're going to compute
the length of the read

284
00:14:28,870 --> 00:14:32,390
that we have, or the implied
length of the paired end read,

285
00:14:32,390 --> 00:14:36,210
and ask, what's the
likelihood of seeing that.

286
00:14:36,210 --> 00:14:42,726
So we don't know the exact
length, recall, of the insert.

287
00:14:42,726 --> 00:14:44,350
When we're looking
at paired end reads,

288
00:14:44,350 --> 00:14:47,100
we can only estimate
how long the fragment

289
00:14:47,100 --> 00:14:49,410
is that we're sequencing.

290
00:14:49,410 --> 00:14:52,980
And so we are going to have a
probabilistic interpretation

291
00:14:52,980 --> 00:14:59,060
of how long the piece of RNA
is that we actually wound up

292
00:14:59,060 --> 00:15:00,910
sequencing the ends of.

293
00:15:00,910 --> 00:15:04,900
And that is placed
in the numerator,

294
00:15:04,900 --> 00:15:09,000
which scales the 1 over l sub j.

295
00:15:09,000 --> 00:15:11,220
So this gives us a
probability for seeing

296
00:15:11,220 --> 00:15:13,770
a read in a particular
transcript that

297
00:15:13,770 --> 00:15:15,620
accounts for the
fact that we have

298
00:15:15,620 --> 00:15:18,810
to back both ends
to that transcript.

299
00:15:18,810 --> 00:15:19,310
OK?

300
00:15:19,310 --> 00:15:21,800
So we have three possibilities
that we've described,

301
00:15:21,800 --> 00:15:27,360
one, where a particular read is
incompatible with an isoform,

302
00:15:27,360 --> 00:15:29,940
two, where we had
a single end read

303
00:15:29,940 --> 00:15:32,390
mapping to an isoform,
which is simply

304
00:15:32,390 --> 00:15:34,170
1 over the length
of the isoform,

305
00:15:34,170 --> 00:15:36,140
and three, where
we're mapping paired

306
00:15:36,140 --> 00:15:39,540
end reads to an isoform,
which includes uncertainty

307
00:15:39,540 --> 00:15:42,000
about where it will map,
which is the 1 over l sub j,

308
00:15:42,000 --> 00:15:45,810
and also uncertainty about the
length of the fragment itself,

309
00:15:45,810 --> 00:15:49,150
which is encoded in
the F function, which

310
00:15:49,150 --> 00:15:51,350
is a distribution
over fragment lengths.

311
00:15:54,310 --> 00:15:54,810
OK.

312
00:15:54,810 --> 00:15:57,550
So once we have
this structure, we

313
00:15:57,550 --> 00:16:01,720
can then estimate
isoform expression.

314
00:16:01,720 --> 00:16:04,520
Now we talked before, when
we talked about ChIP-seq

315
00:16:04,520 --> 00:16:11,100
last time, the idea of
estimating proportions.

316
00:16:11,100 --> 00:16:15,430
And the essential idea
here is that if we

317
00:16:15,430 --> 00:16:20,990
want to compute the
probability of a read given,

318
00:16:20,990 --> 00:16:25,830
in this case, a mixture of
isoforms, that's simply going

319
00:16:25,830 --> 00:16:29,297
to be-- let's see, what
variable did I use?

320
00:16:29,297 --> 00:16:36,980
Yeah-- the estimated
concentration

321
00:16:36,980 --> 00:16:45,160
of that isoform times the
probability of the read as

322
00:16:45,160 --> 00:16:52,030
seen in that isoform.

323
00:16:52,030 --> 00:16:56,560
So for an individual read, we
can estimate its likelihood

324
00:16:56,560 --> 00:16:58,840
given this mixture.

325
00:16:58,840 --> 00:17:00,850
And then the product
around the outside

326
00:17:00,850 --> 00:17:03,110
describes how to
estimate the probability

327
00:17:03,110 --> 00:17:06,512
of the entire basket of reads
that we see for that gene.

328
00:17:06,512 --> 00:17:10,890
And what we would like to do
is to pick psi, in this case,

329
00:17:10,890 --> 00:17:14,470
to maximize the likelihood
of the observed reads.

330
00:17:14,470 --> 00:17:16,250
So what psi is
going to do is it's

331
00:17:16,250 --> 00:17:19,130
going to give us the fraction
of each isoform that we see.

332
00:17:22,619 --> 00:17:25,220
Are there any questions
about the idea

333
00:17:25,220 --> 00:17:27,400
of isoform quantitation?

334
00:17:31,613 --> 00:17:32,113
Yes.

335
00:17:32,113 --> 00:17:36,500
AUDIENCE: I'm a little lost in--
in the last full slide review,

336
00:17:36,500 --> 00:17:40,000
you were describing these three
cases for excluded, single,

337
00:17:40,000 --> 00:17:41,458
and paired reads.

338
00:17:41,458 --> 00:17:46,640
So we're computing the
different probabilities

339
00:17:46,640 --> 00:17:51,740
for both ends to happen in a
transcript, of for just one,

340
00:17:51,740 --> 00:17:52,574
or--

341
00:17:52,574 --> 00:17:53,490
PROFESSOR: It depends.

342
00:17:53,490 --> 00:17:55,130
The second and third
cases depend upon

343
00:17:55,130 --> 00:17:56,630
whether we're
analyzing single ended

344
00:17:56,630 --> 00:17:59,272
reads or paired end reads.

345
00:17:59,272 --> 00:18:01,480
And so we wouldn't use both
of them at the same time.

346
00:18:01,480 --> 00:18:03,860
In other words, if you only
have single ended data,

347
00:18:03,860 --> 00:18:07,385
you would use the second
case that we showed.

348
00:18:07,385 --> 00:18:08,843
And if you had
paired end data, you

349
00:18:08,843 --> 00:18:12,435
would use the third
case that we showed.

350
00:18:12,435 --> 00:18:13,729
AUDIENCE: OK.

351
00:18:13,729 --> 00:18:14,770
PROFESSOR: Question, yes.

352
00:18:14,770 --> 00:18:15,436
AUDIENCE: Sorry.

353
00:18:15,436 --> 00:18:19,586
I noticed that the single
end reads case-- could you

354
00:18:19,586 --> 00:18:23,810
explain the intuition behind
that probability [INAUDIBLE]?

355
00:18:23,810 --> 00:18:26,540
PROFESSOR: Sure.

356
00:18:26,540 --> 00:18:29,320
The intuition behind
that probability

357
00:18:29,320 --> 00:18:33,435
is that we're asking--
so here is a transcript.

358
00:18:37,520 --> 00:18:44,340
And we're assuming, what is
the probability of a read given

359
00:18:44,340 --> 00:18:46,870
that it came from
this transcript.

360
00:18:46,870 --> 00:18:47,790
OK?

361
00:18:47,790 --> 00:18:50,900
What's the probability of
observing a particular read?

362
00:18:50,900 --> 00:18:55,185
And the probability of
observing it lining up

363
00:18:55,185 --> 00:19:00,530
at a particular position
is 1 over the length

364
00:19:00,530 --> 00:19:02,930
of this transcript.

365
00:19:02,930 --> 00:19:09,230
And so the probability actually
includes the idea of alignment

366
00:19:09,230 --> 00:19:12,090
at a particular position
in the transcript.

367
00:19:12,090 --> 00:19:14,430
OK?

368
00:19:14,430 --> 00:19:16,960
So obviously, the
probability's 1

369
00:19:16,960 --> 00:19:19,644
if we assume it comes from here
if we don't consider this fact.

370
00:19:19,644 --> 00:19:22,060
But if we want to ask where
it lines up in the transcript,

371
00:19:22,060 --> 00:19:23,417
it's going to be 1 over l sub j.

372
00:19:23,417 --> 00:19:23,917
OK?

373
00:19:23,917 --> 00:19:26,227
AUDIENCE: So we assume that
it's uniformly possible?

374
00:19:26,227 --> 00:19:27,893
PROFESSOR: That it's
uniformly possible,

375
00:19:27,893 --> 00:19:29,390
which gets us to a good point.

376
00:19:29,390 --> 00:19:31,680
I'm glad you asked
that question.

377
00:19:31,680 --> 00:19:37,510
Sometimes we would
have more reads

378
00:19:37,510 --> 00:19:40,440
at the three prime end of
a transcript than the five

379
00:19:40,440 --> 00:19:41,970
prime end.

380
00:19:41,970 --> 00:19:45,571
Can anybody imagine why?

381
00:19:45,571 --> 00:19:46,070
Yes.

382
00:19:46,070 --> 00:19:48,420
AUDIENCE: Because you're
pulling on the poly-A tails.

383
00:19:48,420 --> 00:19:49,086
PROFESSOR: Yeah.

384
00:19:49,086 --> 00:19:53,060
So this actually was
purified by the poly-A tail.

385
00:19:53,060 --> 00:19:54,750
And we're pulling on this.

386
00:19:54,750 --> 00:19:56,450
We're purifying by it.

387
00:19:56,450 --> 00:19:59,370
And if there's breakage
of these transcripts

388
00:19:59,370 --> 00:20:01,782
as it goes through the
biochemical processing,

389
00:20:01,782 --> 00:20:03,490
we can get shorter
and shorter molecules.

390
00:20:03,490 --> 00:20:05,217
They all contain this bit.

391
00:20:05,217 --> 00:20:07,550
And the probability to contain
that whole thing actually

392
00:20:07,550 --> 00:20:09,330
decreases.

393
00:20:09,330 --> 00:20:12,370
So oftentimes there
is three prime bias

394
00:20:12,370 --> 00:20:15,430
in RNA-seq experiments one
needs to be mindful of.

395
00:20:15,430 --> 00:20:18,200
But we're assuming, today,
that there isn't such bias

396
00:20:18,200 --> 00:20:19,970
and that the probability's
equal that it

397
00:20:19,970 --> 00:20:21,950
maps someplace in
this transcript.

398
00:20:21,950 --> 00:20:22,450
OK?

399
00:20:22,450 --> 00:20:23,741
Does that answer your question?

400
00:20:23,741 --> 00:20:24,418
Yes.

401
00:20:24,418 --> 00:20:25,084
AUDIENCE: Sorry.

402
00:20:25,084 --> 00:20:26,078
I do have one more.

403
00:20:26,078 --> 00:20:28,066
Can you show us how
that extends, then,

404
00:20:28,066 --> 00:20:31,334
to the paired end read and where
the probability distribution--

405
00:20:31,334 --> 00:20:32,042
PROFESSOR: Right.

406
00:20:32,042 --> 00:20:41,556
So if we go to paired end
reads, right, like this,

407
00:20:41,556 --> 00:20:46,020
this component is going to
be where it aligns, right?

408
00:20:46,020 --> 00:20:54,170
And then, the probability of the
length of this entire molecule

409
00:20:54,170 --> 00:20:59,620
is what I had up
there before, which

410
00:20:59,620 --> 00:21:05,736
is-- exactly how did I do that?

411
00:21:26,420 --> 00:21:31,990
So this is going to be--
this is this bit, which

412
00:21:31,990 --> 00:21:34,901
is the implied length of this.

413
00:21:34,901 --> 00:21:35,400
OK?

414
00:21:35,400 --> 00:21:40,150
So if I map the
left and the right--

415
00:21:40,150 --> 00:21:42,710
this component is where
the left end maps.

416
00:21:42,710 --> 00:21:43,530
OK?

417
00:21:43,530 --> 00:21:46,490
I take the left and the
right ends of the read

418
00:21:46,490 --> 00:21:49,830
that I have from
that transcript.

419
00:21:49,830 --> 00:21:53,140
And it has a particular
length on this transcript.

420
00:21:53,140 --> 00:21:53,980
OK?

421
00:21:53,980 --> 00:21:59,220
I'll call that length
l sub j of R sub i.

422
00:21:59,220 --> 00:22:00,960
Now remember, that
is not the same

423
00:22:00,960 --> 00:22:03,320
as the length in the genome.

424
00:22:03,320 --> 00:22:07,270
That's the length in this
transcript as it is spliced.

425
00:22:07,270 --> 00:22:08,060
OK?

426
00:22:08,060 --> 00:22:10,910
F is going to be the
probability distribution

427
00:22:10,910 --> 00:22:12,490
of the length of the fragments.

428
00:22:12,490 --> 00:22:16,520
So let's just say that they're
centered around 200 bases.

429
00:22:16,520 --> 00:22:18,170
OK?

430
00:22:18,170 --> 00:22:24,980
So if this is exactly 200 bases,
it's going to be quite likely.

431
00:22:24,980 --> 00:22:25,720
OK?

432
00:22:25,720 --> 00:22:28,720
But imagine that when
I map this fragment,

433
00:22:28,720 --> 00:22:31,930
this wound up being
400 bases apart.

434
00:22:31,930 --> 00:22:33,405
Then, this
distribution would tell

435
00:22:33,405 --> 00:22:37,690
is it's very unlikely that
I would see a fragment that

436
00:22:37,690 --> 00:22:39,490
mapped here and
mapped 400 bases up

437
00:22:39,490 --> 00:22:43,080
here, because my fragment
with distribution defined by F

438
00:22:43,080 --> 00:22:45,230
is 200 bases.

439
00:22:45,230 --> 00:22:48,920
So it's going to discount
that, the probability of that.

440
00:22:48,920 --> 00:22:53,505
So this term that the
probability of the read

441
00:22:53,505 --> 00:22:59,670
given the transcript
is the component

442
00:22:59,670 --> 00:23:08,030
of where it's aligning
times the likelihood

443
00:23:08,030 --> 00:23:10,405
that the implied
fragment length agrees

444
00:23:10,405 --> 00:23:13,630
with what we think
we have empirically.

445
00:23:13,630 --> 00:23:14,130
OK?

446
00:23:14,130 --> 00:23:16,421
Does that make sense?

447
00:23:16,421 --> 00:23:16,920
OK.

448
00:23:16,920 --> 00:23:17,961
Those are good questions.

449
00:23:22,070 --> 00:23:22,590
OK.

450
00:23:22,590 --> 00:23:26,410
So given this
framework, we can either

451
00:23:26,410 --> 00:23:33,250
use EM or other machine learning
like frameworks to maximize psi

452
00:23:33,250 --> 00:23:37,100
and to learn the
fraction of expression

453
00:23:37,100 --> 00:23:41,206
of each different isoform
from the observed data given

454
00:23:41,206 --> 00:23:42,330
the functions that we have.

455
00:23:45,040 --> 00:23:47,560
And just to give
you an idea, when

456
00:23:47,560 --> 00:23:50,750
this was done for
myogenesis, a program

457
00:23:50,750 --> 00:23:53,630
called Cufflinks, which
does this kind of process

458
00:23:53,630 --> 00:23:58,510
of identifying
isoform prevalences,

459
00:23:58,510 --> 00:24:04,200
was able to identify a
large number of transcripts.

460
00:24:04,200 --> 00:24:06,250
70% of the reads
were in previously

461
00:24:06,250 --> 00:24:08,350
annotated transcripts.

462
00:24:08,350 --> 00:24:13,150
But it also found 643
new isoforms of genes

463
00:24:13,150 --> 00:24:15,850
in this single time series.

464
00:24:15,850 --> 00:24:19,340
And I posted one
of the papers that

465
00:24:19,340 --> 00:24:22,970
describes some this technology
on the Stellar site.

466
00:24:22,970 --> 00:24:28,150
But note that certain of the
genes have light coverage.

467
00:24:28,150 --> 00:24:32,580
And what we're seeing here is
that for genes are expressed

468
00:24:32,580 --> 00:24:34,860
in low copy numbers,
it's obviously

469
00:24:34,860 --> 00:24:37,410
more difficult to get
reads out of them.

470
00:24:37,410 --> 00:24:39,552
And I'm presuming, in this
particular experiment--

471
00:24:39,552 --> 00:24:42,010
although I can't recollect--
that the reason they don't see

472
00:24:42,010 --> 00:24:44,385
that many intronic reads is
they did poly-A purification.

473
00:24:47,700 --> 00:24:48,760
OK.

474
00:24:48,760 --> 00:24:55,890
So we've talked about
how to take our reads,

475
00:24:55,890 --> 00:24:59,720
build a probabilistic model,
estimate isoform prevalences.

476
00:24:59,720 --> 00:25:03,120
And we know how many reads
are mapping to a given gene.

477
00:25:03,120 --> 00:25:04,910
The question now
is whether or not

478
00:25:04,910 --> 00:25:07,750
we see differential
expression of a gene

479
00:25:07,750 --> 00:25:10,110
in different conditions.

480
00:25:10,110 --> 00:25:13,840
So I'd like to turn to the
analysis of differential

481
00:25:13,840 --> 00:25:17,010
expression unless there
are any final questions

482
00:25:17,010 --> 00:25:21,360
about the details of RNA-seq
and isoform estimation.

483
00:25:26,490 --> 00:25:28,500
This is your chance to
ask those hard questions.

484
00:25:28,500 --> 00:25:29,000
Yes.

485
00:25:29,000 --> 00:25:29,541
AUDIENCE: OK.

486
00:25:29,541 --> 00:25:30,864
I have a really silly question.

487
00:25:30,864 --> 00:25:34,310
But can you explain, really
quickly, what are isoforms?

488
00:25:34,310 --> 00:25:35,940
PROFESSOR: What an isoform is?

489
00:25:35,940 --> 00:25:37,090
AUDIENCE: Yeah.

490
00:25:37,090 --> 00:25:38,710
PROFESSOR: Sure.

491
00:25:38,710 --> 00:25:42,730
An isoform is a particular
splice variant of a gene.

492
00:25:42,730 --> 00:25:47,230
So a gene that has a
particular splicing pattern

493
00:25:47,230 --> 00:25:48,940
is called an isoform.

494
00:25:48,940 --> 00:25:53,370
So imagine we have three
exons, one, two, and three.

495
00:25:53,370 --> 00:25:58,370
And a transcript that has all
three would be one isoform.

496
00:25:58,370 --> 00:26:01,750
And another variant that omits
two would be a second isoform.

497
00:26:01,750 --> 00:26:04,750
So just one in three
would be an isoform.

498
00:26:04,750 --> 00:26:09,960
And each gene has a set
of isoforms it exhibits.

499
00:26:09,960 --> 00:26:14,032
And that depends upon how it's
regulated and whether or not

500
00:26:14,032 --> 00:26:16,110
any splicing is
constitutive-- it always

501
00:26:16,110 --> 00:26:19,580
happens-- or whether
or not it's regulated.

502
00:26:19,580 --> 00:26:23,717
And so in theory,
a gene with n exons

503
00:26:23,717 --> 00:26:25,050
has how many potential isoforms?

504
00:26:28,990 --> 00:26:30,980
It's 2 to the n.

505
00:26:30,980 --> 00:26:35,787
Because you can consider each
exon being included or omitted.

506
00:26:35,787 --> 00:26:36,761
All right.

507
00:26:36,761 --> 00:26:38,690
But that isn't
typically the case--

508
00:26:38,690 --> 00:26:41,720
that there are many
fewer isoforms than that.

509
00:26:41,720 --> 00:26:44,710
But in general, an isoform
refers to a particular splice

510
00:26:44,710 --> 00:26:46,985
variant of a gene.

511
00:26:46,985 --> 00:26:47,485
Yes.

512
00:26:47,485 --> 00:26:50,070
AUDIENCE: I just want to
make sure I have everything

513
00:26:50,070 --> 00:26:51,340
correctly.

514
00:26:51,340 --> 00:26:54,360
When you're using single
legged or pair end reads,

515
00:26:54,360 --> 00:26:57,300
you can get excluded
ends, right?

516
00:26:57,300 --> 00:27:00,764
So you can get that in both
cases, whether or not you're--

517
00:27:00,764 --> 00:27:01,930
PROFESSOR: Well, it depends.

518
00:27:01,930 --> 00:27:03,555
Once again, it's
somewhat probabilistic

519
00:27:03,555 --> 00:27:05,600
where the reads actually hit.

520
00:27:05,600 --> 00:27:10,170
Because if all the
reads only hit exons,

521
00:27:10,170 --> 00:27:12,090
and you didn't
get any junctions,

522
00:27:12,090 --> 00:27:15,500
and none of your paired end
reads crossed junctions,

523
00:27:15,500 --> 00:27:19,042
then you wouldn't
actually have exclusion.

524
00:27:19,042 --> 00:27:19,840
All right?

525
00:27:19,840 --> 00:27:23,020
AUDIENCE: But it's possible for
using both types of sequencing?

526
00:27:23,020 --> 00:27:25,520
PROFESSOR: Yes, it's possible
with both types of sequencing.

527
00:27:25,520 --> 00:27:29,200
In fact, oftentimes,
what people do

528
00:27:29,200 --> 00:27:32,120
is that they will count
junction crossing reads.

529
00:27:32,120 --> 00:27:36,470
If you have a large
enough number of reads

530
00:27:36,470 --> 00:27:40,190
in your sequencing, say,
100 base pairs at least,

531
00:27:40,190 --> 00:27:42,000
then a large number
of your reads

532
00:27:42,000 --> 00:27:44,950
are going to be-- not a large,
but a significant fraction--

533
00:27:44,950 --> 00:27:47,310
will be exon crossing.

534
00:27:47,310 --> 00:27:51,050
And you'll be able to count the
number of exon-exon junctions

535
00:27:51,050 --> 00:27:53,420
you have of different types.

536
00:27:53,420 --> 00:27:54,980
And that will give
you an estimate

537
00:27:54,980 --> 00:27:59,005
of how much splicing is going
on and will help validate

538
00:27:59,005 --> 00:28:01,220
the kinds of conclusions
that come out of programs

539
00:28:01,220 --> 00:28:05,920
like Cufflinks or MISO, which is
another program from the Burge

540
00:28:05,920 --> 00:28:11,050
Laboratory that is used to
estimate isoform prevalence.

541
00:28:11,050 --> 00:28:12,014
Yes.

542
00:28:12,014 --> 00:28:13,990
AUDIENCE: So even
given this information,

543
00:28:13,990 --> 00:28:17,942
you can't say which exons go
with which exons necessarily,

544
00:28:17,942 --> 00:28:20,412
except paralogs, right?

545
00:28:20,412 --> 00:28:22,882
Because the reads, in
general, aren't long enough

546
00:28:22,882 --> 00:28:23,870
to span an exon.

547
00:28:23,870 --> 00:28:26,080
And therefore we wouldn't
know, for example,

548
00:28:26,080 --> 00:28:30,671
that a given transcript is
exons one, five, and six.

549
00:28:30,671 --> 00:28:36,150
You could only know that exons
five and six went together.

550
00:28:36,150 --> 00:28:37,902
PROFESSOR: That is
not strictly true

551
00:28:37,902 --> 00:28:39,860
if you have paired end
reads and your fragments

552
00:28:39,860 --> 00:28:42,510
are long enough to span exons.

553
00:28:42,510 --> 00:28:44,000
But in general, you're correct.

554
00:28:44,000 --> 00:28:46,530
And that's why modern
sequencing technologies that

555
00:28:46,530 --> 00:28:50,850
are coming down the pike
that can do 25 kilobase reads

556
00:28:50,850 --> 00:28:55,210
are so important for doing
things just like that.

557
00:28:55,210 --> 00:28:56,156
Yes.

558
00:28:56,156 --> 00:28:58,120
AUDIENCE: In your diagram
of the read up there,

559
00:28:58,120 --> 00:29:03,200
are the boxes the actual genome
or RNA sequence and the line

560
00:29:03,200 --> 00:29:06,520
in between the artificial linker
that you added when you seq

561
00:29:06,520 --> 00:29:07,462
ref'd?

562
00:29:07,462 --> 00:29:09,045
PROFESSOR: Ah, that's
a good question.

563
00:29:09,045 --> 00:29:10,900
The question is,
is this the linker

564
00:29:10,900 --> 00:29:12,931
and these are the
actual sequences.

565
00:29:12,931 --> 00:29:13,430
No.

566
00:29:13,430 --> 00:29:16,377
What I'm drawing here is
that these are the bits

567
00:29:16,377 --> 00:29:18,210
that we actually get
to see the sequence of.

568
00:29:18,210 --> 00:29:20,980
We sequence from both
ends of a molecule.

569
00:29:20,980 --> 00:29:23,770
This is the part of the fragment
that we haven't sequenced

570
00:29:23,770 --> 00:29:25,760
because our reads
aren't long enough.

571
00:29:25,760 --> 00:29:28,960
And so the entire fragment
might be 300 bases long.

572
00:29:28,960 --> 00:29:31,360
If this is 100 and this is
100, then the unobserved part

573
00:29:31,360 --> 00:29:33,060
is 100 in the middle.

574
00:29:33,060 --> 00:29:33,800
OK?

575
00:29:33,800 --> 00:29:35,640
And that's called
the insert length,

576
00:29:35,640 --> 00:29:39,030
the entire length
of the molecule.

577
00:29:39,030 --> 00:29:42,150
And we get to choose how
long these fragments are up

578
00:29:42,150 --> 00:29:43,340
to a maximum size.

579
00:29:43,340 --> 00:29:44,850
Contemporary
sequencers don't really

580
00:29:44,850 --> 00:29:47,140
like fragments over 1,000 bases.

581
00:29:47,140 --> 00:29:48,734
And the performance
starts falling off

582
00:29:48,734 --> 00:29:50,150
when you get close
to that number.

583
00:29:50,150 --> 00:29:53,190
So people, typically,
are operating

584
00:29:53,190 --> 00:29:55,024
in a more optimal
range of fragments

585
00:29:55,024 --> 00:29:56,440
that are a few
hundred bases long.

586
00:29:59,200 --> 00:30:01,450
Any other questions?

587
00:30:01,450 --> 00:30:03,240
OK.

588
00:30:03,240 --> 00:30:09,730
So I wanted to briefly talk
about hypothesis testing.

589
00:30:09,730 --> 00:30:11,640
Because we're going
to be needing it

590
00:30:11,640 --> 00:30:14,080
for determining when things
are really differentially

591
00:30:14,080 --> 00:30:15,630
expressed.

592
00:30:15,630 --> 00:30:17,350
So I'm just going to
show you some data

593
00:30:17,350 --> 00:30:19,200
and ask you a few
questions about it.

594
00:30:19,200 --> 00:30:24,895
So here are two different
scatters of data.

595
00:30:24,895 --> 00:30:26,800
Well, actually, it's
exactly the same data.

596
00:30:26,800 --> 00:30:28,650
But we have two
different bits to it.

597
00:30:28,650 --> 00:30:31,670
We have two
independent Gaussians

598
00:30:31,670 --> 00:30:35,690
that are fit to the data,
from gene one and gene two.

599
00:30:35,690 --> 00:30:40,060
And another fit
uses two Gaussians

600
00:30:40,060 --> 00:30:43,870
that have a correlation
structure between them.

601
00:30:43,870 --> 00:30:45,800
And the question
is whether or not

602
00:30:45,800 --> 00:30:48,960
the null hypothesis or
the alternative hypothesis

603
00:30:48,960 --> 00:30:52,050
is more reasonable.

604
00:30:52,050 --> 00:30:54,100
And typically, when
we say reasonable,

605
00:30:54,100 --> 00:30:58,360
we want to judge whether
or not it's significant.

606
00:30:58,360 --> 00:30:59,840
Significance
typically talks about,

607
00:30:59,840 --> 00:31:03,840
what's the chance
that the data we saw

608
00:31:03,840 --> 00:31:07,599
occurred at random given
the null hypothesis.

609
00:31:07,599 --> 00:31:09,140
So what's the chance
it was generated

610
00:31:09,140 --> 00:31:11,330
by the null hypothesis
versus the chance

611
00:31:11,330 --> 00:31:14,190
that it was generated by
the alternative hypothesis?

612
00:31:14,190 --> 00:31:17,190
Now the problem is that
alternative hypotheses,

613
00:31:17,190 --> 00:31:19,180
typically, are more complex.

614
00:31:19,180 --> 00:31:22,570
And a more complex model will
always explain data better.

615
00:31:22,570 --> 00:31:24,380
So we need to have
a principled way

616
00:31:24,380 --> 00:31:27,970
of asking the question, given
that the alternative hypothesis

617
00:31:27,970 --> 00:31:30,040
is always going to
do a better job,

618
00:31:30,040 --> 00:31:33,825
does it do such a better
job that it can exclude

619
00:31:33,825 --> 00:31:38,470
the null hypothesis at a
particular probability level.

620
00:31:38,470 --> 00:31:40,000
OK?

621
00:31:40,000 --> 00:31:46,630
So here are two different
models for these data.

622
00:31:46,630 --> 00:31:48,390
The null model, H0,
is that they came

623
00:31:48,390 --> 00:31:50,950
from two independent Gaussians.

624
00:31:50,950 --> 00:31:53,020
The alternative model,
H1, is that they

625
00:31:53,020 --> 00:31:56,510
came from two
correlated Gaussians.

626
00:31:56,510 --> 00:32:00,590
And then we can
ask whether or not

627
00:32:00,590 --> 00:32:07,040
H1 is sufficiently more likely
to warrant our rejecting H0

628
00:32:07,040 --> 00:32:10,380
and accepting H1.

629
00:32:10,380 --> 00:32:15,630
Now as I said, H1 is always
going to fit the data better.

630
00:32:15,630 --> 00:32:18,250
So the probability
of that collection

631
00:32:18,250 --> 00:32:23,110
of points evaluated with the
H1 model, fit to the data,

632
00:32:23,110 --> 00:32:25,920
is always going to be superior.

633
00:32:25,920 --> 00:32:28,700
So we need to have
a way to compare

634
00:32:28,700 --> 00:32:33,000
the probability of the data
given H1 versus the data

635
00:32:33,000 --> 00:32:38,240
given H0 in a way that
allows us to judge

636
00:32:38,240 --> 00:32:47,990
the probability that the data
via H0 occurred at random.

637
00:32:47,990 --> 00:32:50,510
In this particular case,
the data supports H1.

638
00:32:50,510 --> 00:32:52,950
And let's see why.

639
00:32:52,950 --> 00:32:56,345
So this is a key idea here.

640
00:32:56,345 --> 00:32:58,720
How many people have heard of
likelihood ratio statistics

641
00:32:58,720 --> 00:33:00,450
before?

642
00:33:00,450 --> 00:33:00,980
OK.

643
00:33:00,980 --> 00:33:02,060
About half the class.

644
00:33:02,060 --> 00:33:03,190
OK.

645
00:33:03,190 --> 00:33:06,045
So here's the idea.

646
00:33:06,045 --> 00:33:09,504
The idea is that what
we're going to do

647
00:33:09,504 --> 00:33:11,295
is we're going to
compute a test statistic.

648
00:33:14,890 --> 00:33:17,780
And the test statistic
is going to be

649
00:33:17,780 --> 00:33:19,180
a function of the observed data.

650
00:33:22,048 --> 00:33:26,100
And it's 2 times the log of
the probability of the observed

651
00:33:26,100 --> 00:33:30,600
data given H1 over the
probability of the observed

652
00:33:30,600 --> 00:33:32,242
data given H0.

653
00:33:32,242 --> 00:33:34,010
OK?

654
00:33:34,010 --> 00:33:36,790
Now we know that
this is always going

655
00:33:36,790 --> 00:33:41,590
to have a higher value than the
probability in the denominator.

656
00:33:41,590 --> 00:33:44,930
So this is always going
to be greater than 1.

657
00:33:44,930 --> 00:33:47,490
So the test statistic will
always be greater than 0

658
00:33:47,490 --> 00:33:49,850
since we're operating
in the log domain.

659
00:33:49,850 --> 00:33:52,050
OK?

660
00:33:52,050 --> 00:33:56,740
The question is-- we
know that this is always

661
00:33:56,740 --> 00:34:03,680
going to be better, even when
the data was generated from H0.

662
00:34:03,680 --> 00:34:06,270
But when is this
sufficiently better

663
00:34:06,270 --> 00:34:12,370
for us to believe that H0 is not
true and we should accept H1?

664
00:34:12,370 --> 00:34:20,230
What we need is a distribution
for this test statistic

665
00:34:20,230 --> 00:34:25,380
that occurred if H0 was true.

666
00:34:25,380 --> 00:34:29,480
And that distribution allows
us to compute the probability

667
00:34:29,480 --> 00:34:32,230
that an observed value
for the test statistic

668
00:34:32,230 --> 00:34:38,860
occurred, even in the presence--
assuming that H0 is true.

669
00:34:38,860 --> 00:34:39,750
OK.

670
00:34:39,750 --> 00:34:44,929
So this depends upon
the number of degrees

671
00:34:44,929 --> 00:34:48,510
of freedom difference
between H1 and H0.

672
00:34:48,510 --> 00:34:50,739
How many degrees of
freedom are there

673
00:34:50,739 --> 00:34:57,560
in H1 in this model up here?

674
00:35:02,999 --> 00:35:05,822
How many parameters
do we get to pick?

675
00:35:05,822 --> 00:35:06,630
AUDIENCE: Six.

676
00:35:06,630 --> 00:35:07,255
PROFESSOR: Hmm?

677
00:35:07,255 --> 00:35:08,340
AUDIENCE: Six.

678
00:35:08,340 --> 00:35:09,070
PROFESSOR: Six?

679
00:35:09,070 --> 00:35:10,770
AUDIENCE: Two means and four--

680
00:35:10,770 --> 00:35:13,110
PROFESSOR: Two means
and four coherences.

681
00:35:16,080 --> 00:35:19,882
And for H0?

682
00:35:19,882 --> 00:35:20,864
AUDIENCE: Just four.

683
00:35:20,864 --> 00:35:21,530
PROFESSOR: Four.

684
00:35:21,530 --> 00:35:23,840
So what's the
difference in the number

685
00:35:23,840 --> 00:35:25,506
of degrees of freedom
between H1 and H0?

686
00:35:25,506 --> 00:35:27,570
It's two.

687
00:35:27,570 --> 00:35:30,790
So the test statistic
is parametrized

688
00:35:30,790 --> 00:35:42,130
by the difference in number
of degrees of freedom.

689
00:35:46,326 --> 00:35:48,860
And so what we see, then, is
something that looks like this.

690
00:35:51,950 --> 00:35:59,010
We see a test statistic where
this is the probability of it,

691
00:35:59,010 --> 00:36:03,570
on the y-axis, and the test
statistic on the x-axis.

692
00:36:03,570 --> 00:36:05,700
But as the test statistic
gets larger and larger,

693
00:36:05,700 --> 00:36:08,540
the probability that it
occurred with H0 being true

694
00:36:08,540 --> 00:36:11,720
gets smaller and smaller.

695
00:36:11,720 --> 00:36:15,520
So let us just suppose that we
took our data from our model

696
00:36:15,520 --> 00:36:17,520
that we observed.

697
00:36:17,520 --> 00:36:20,390
And we computed the test
statistic at a particular value

698
00:36:20,390 --> 00:36:22,670
call T observed.

699
00:36:22,670 --> 00:36:24,860
So this is the actual
value that we computed out

700
00:36:24,860 --> 00:36:29,220
of our likelihood ratio test.

701
00:36:29,220 --> 00:36:33,320
What we would like to ask
is, what's the probability

702
00:36:33,320 --> 00:36:40,860
that our test statistic is
greater than or equal to T

703
00:36:40,860 --> 00:36:46,010
observed given that
H0 is true, which

704
00:36:46,010 --> 00:36:49,550
means that we're
going to consider

705
00:36:49,550 --> 00:36:52,430
all the tail of
this distribution.

706
00:36:52,430 --> 00:36:57,140
Because we want to also consider
the case where T observed

707
00:36:57,140 --> 00:37:01,180
was even greater
than what we saw.

708
00:37:01,180 --> 00:37:06,160
And this gives us a way of
computing the probability

709
00:37:06,160 --> 00:37:12,650
that H0 is true given
the test statistic.

710
00:37:12,650 --> 00:37:14,030
And this gives us our p-value.

711
00:37:19,370 --> 00:37:20,510
OK?

712
00:37:20,510 --> 00:37:24,310
So this is a way of,
in general, comparing

713
00:37:24,310 --> 00:37:30,820
two probabilistic models and
analyzing the significance

714
00:37:30,820 --> 00:37:34,140
of adding extra degrees
of freedom to the model.

715
00:37:34,140 --> 00:37:37,350
Typically, what we'll be
doing in today's lecture

716
00:37:37,350 --> 00:37:39,590
is asking whether or
not-- if we let the means

717
00:37:39,590 --> 00:37:43,240
change, for example,
between two conditions--

718
00:37:43,240 --> 00:37:47,680
we get a sufficient
improvement in our ability

719
00:37:47,680 --> 00:37:50,580
to predict the data that our
test statistic will allow

720
00:37:50,580 --> 00:37:53,710
us to reject the null hypothesis
that the means are the same.

721
00:37:56,230 --> 00:37:56,730
OK.

722
00:37:56,730 --> 00:37:57,900
I'm going to stop
here and see if there

723
00:37:57,900 --> 00:37:59,400
are any questions
at all about this.

724
00:38:03,507 --> 00:38:04,090
AUDIENCE: Yes.

725
00:38:04,090 --> 00:38:08,162
Where did the degrees of
freedom enter into the equation?

726
00:38:08,162 --> 00:38:10,661
PROFESSOR: Where did the degrees
of freedom enter into this?

727
00:38:10,661 --> 00:38:12,649
Great question.

728
00:38:12,649 --> 00:38:14,951
The chi-square tables
that you look up

729
00:38:14,951 --> 00:38:16,450
are indexed by the
number of degrees

730
00:38:16,450 --> 00:38:17,980
of freedom of difference.

731
00:38:17,980 --> 00:38:18,840
OK?

732
00:38:18,840 --> 00:38:22,680
And so whenever you
compute a chi-squared,

733
00:38:22,680 --> 00:38:24,570
you will compute
it with the number

734
00:38:24,570 --> 00:38:27,315
of degrees of
freedom difference.

735
00:38:30,567 --> 00:38:31,400
Any other questions?

736
00:38:36,100 --> 00:38:37,410
OK.

737
00:38:37,410 --> 00:38:48,490
So let's now turn to evaluating
RNA-seq data once again.

738
00:38:48,490 --> 00:38:53,580
And I'm going to describe
a method called DESeq

739
00:38:53,580 --> 00:38:55,735
for determining
differential expression.

740
00:38:58,600 --> 00:39:06,220
And in our analysis,
what we're going to do

741
00:39:06,220 --> 00:39:14,180
is we're going to let i range
over a gene or an isoform.

742
00:39:18,156 --> 00:39:19,270
j is an experiment.

743
00:39:21,880 --> 00:39:24,570
And there may be
multiple experiments

744
00:39:24,570 --> 00:39:28,740
in the same condition
that are replicates.

745
00:39:28,740 --> 00:39:47,910
And Kij is the number of
counts observed for i and j.

746
00:39:47,910 --> 00:39:53,430
So that's the expression of gene
or isoform i in experiment j.

747
00:39:53,430 --> 00:39:55,080
Now what we need
to do, however, is

748
00:39:55,080 --> 00:39:59,330
to normalize experiments
against one another.

749
00:39:59,330 --> 00:40:01,670
And the normalization
factor s sub j

750
00:40:01,670 --> 00:40:03,470
is computed for a
particular experiment.

751
00:40:03,470 --> 00:40:06,520
And it's used to normalize
all of the values

752
00:40:06,520 --> 00:40:08,590
in that experiment.

753
00:40:08,590 --> 00:40:11,740
And if all the experiments
were completely identical--

754
00:40:11,740 --> 00:40:14,560
the read depth was identical
and everything was the same--

755
00:40:14,560 --> 00:40:18,500
then all of the s
sub js would be 1.

756
00:40:18,500 --> 00:40:23,830
If you had an experiment that
had exactly twice as many reads

757
00:40:23,830 --> 00:40:28,690
as the other experiments,
it's s sub j would be 2.

758
00:40:28,690 --> 00:40:32,790
So this scale factor is
used to normalize things

759
00:40:32,790 --> 00:40:35,380
in our next slide, as we'll see.

760
00:40:35,380 --> 00:40:36,880
And the essential
idea is that we're

761
00:40:36,880 --> 00:40:41,480
going to take the median
value of this ratio.

762
00:40:41,480 --> 00:40:45,330
And the reason that the
denominator is a geometric mean

763
00:40:45,330 --> 00:40:49,465
is so that no one experiment
dominates the average.

764
00:40:52,240 --> 00:40:55,510
They all have equal
weight for the average.

765
00:40:55,510 --> 00:40:58,480
But the geometric mean
is simply the product

766
00:40:58,480 --> 00:41:05,870
of all of the expressions for
a particular median gene taken

767
00:41:05,870 --> 00:41:10,410
to the root m power
to get them back

768
00:41:10,410 --> 00:41:13,020
to the value for a
single experiment.

769
00:41:13,020 --> 00:41:15,000
And that is the
normalizing factor

770
00:41:15,000 --> 00:41:18,770
for the numerator, which
is the number of counts

771
00:41:18,770 --> 00:41:20,320
for a particular gene.

772
00:41:20,320 --> 00:41:20,820
OK?

773
00:41:20,820 --> 00:41:24,360
So we're just doing
median style normalization

774
00:41:24,360 --> 00:41:26,880
where s sub j is a scale factor.

775
00:41:26,880 --> 00:41:29,810
Once again, if all the
experiments were the same,

776
00:41:29,810 --> 00:41:32,110
s sub j would be 1.

777
00:41:32,110 --> 00:41:36,500
If one particular experiment
had twice as many counts

778
00:41:36,500 --> 00:41:38,350
as another experiment
uniformly, s

779
00:41:38,350 --> 00:41:42,190
sub j would be 2, just
for that experiment.

780
00:41:42,190 --> 00:41:44,180
Any questions about
the scale factor?

781
00:41:44,180 --> 00:41:45,626
Yes.

782
00:41:45,626 --> 00:41:47,558
AUDIENCE: Sorry,
what is the term

783
00:41:47,558 --> 00:41:49,980
on the bottom-- in
the denominator?

784
00:41:49,980 --> 00:41:52,430
PROFESSOR: That's
a normalizing term

785
00:41:52,430 --> 00:41:58,230
across-- that's the geometric
mean of all the experiments put

786
00:41:58,230 --> 00:41:58,730
together.

787
00:42:01,430 --> 00:42:04,280
All right?

788
00:42:04,280 --> 00:42:06,860
So because it's the product
of all the experiments-- m

789
00:42:06,860 --> 00:42:09,080
experiments-- then,
rooted m, it's

790
00:42:09,080 --> 00:42:11,960
equal to the geometric mean
of a single experiment.

791
00:42:16,610 --> 00:42:18,380
Any other questions?

792
00:42:18,380 --> 00:42:18,880
Yes.

793
00:42:18,880 --> 00:42:21,350
AUDIENCE: Are we normalizing
different experiments

794
00:42:21,350 --> 00:42:23,326
to each other or
different replicates

795
00:42:23,326 --> 00:42:24,808
of a single experiment?

796
00:42:24,808 --> 00:42:27,050
PROFESSOR: In this particular
case, each one of these

797
00:42:27,050 --> 00:42:28,840
is a different replicate.

798
00:42:28,840 --> 00:42:29,340
OK?

799
00:42:29,340 --> 00:42:32,500
So j is ranging over
different replicates,

800
00:42:32,500 --> 00:42:34,550
not over conditions, right now.

801
00:42:34,550 --> 00:42:37,610
So each replicate,
each experiment,

802
00:42:37,610 --> 00:42:39,710
gets its own normalizing factor.

803
00:42:39,710 --> 00:42:43,070
We'll see, in a moment, how to
put those replicates together

804
00:42:43,070 --> 00:42:44,840
to build statistical strength.

805
00:42:44,840 --> 00:42:48,135
But we need-- since each
replicate has its own read

806
00:42:48,135 --> 00:42:50,260
depth, we have to normalize
each one independently.

807
00:42:52,780 --> 00:42:55,020
OK?

808
00:42:55,020 --> 00:43:03,200
So what we then do is we compute
an expression for a condition.

809
00:43:03,200 --> 00:43:08,130
Now a condition,
we're going to call p.

810
00:43:08,130 --> 00:43:23,220
And q sub ip is the
normalized expression

811
00:43:23,220 --> 00:43:33,572
for gene slash isoform
i in condition p.

812
00:43:36,350 --> 00:43:39,390
So a condition may have
multiple replicates in it.

813
00:43:39,390 --> 00:43:41,520
So we're going to
average, over all

814
00:43:41,520 --> 00:43:44,860
of the replicates, the
average expression,

815
00:43:44,860 --> 00:43:45,890
as you can see here.

816
00:43:45,890 --> 00:43:47,890
So we're summing over
all the replicates

817
00:43:47,890 --> 00:43:49,832
for a given condition.

818
00:43:49,832 --> 00:43:51,290
We're going to take
each replicate,

819
00:43:51,290 --> 00:43:54,130
normalize it by its scale
factor we just computed,

820
00:43:54,130 --> 00:44:02,100
and then compute the normalized
expression for a gene

821
00:44:02,100 --> 00:44:05,510
or an isoform in that
particular condition.

822
00:44:05,510 --> 00:44:09,750
Is that clear to everybody,
what's going on here?

823
00:44:09,750 --> 00:44:14,800
Now I'm describing this to
you because the key fact is

824
00:44:14,800 --> 00:44:18,290
the next line, which
is that we compute

825
00:44:18,290 --> 00:44:22,150
the mean for a
particular replicate

826
00:44:22,150 --> 00:44:25,910
by taking the normalized
expression for a gene

827
00:44:25,910 --> 00:44:28,430
and then reverse correcting
it back to scaling it

828
00:44:28,430 --> 00:44:32,180
back up again for that
particular replicate

829
00:44:32,180 --> 00:44:34,847
by multiplying by s sub j.

830
00:44:34,847 --> 00:44:36,430
But the most important
thing is what's

831
00:44:36,430 --> 00:44:39,470
on the right hand side,
which is that the variance is

832
00:44:39,470 --> 00:44:49,312
equal to the mean plus this
function of the expression.

833
00:44:49,312 --> 00:44:50,770
And the reason this
is important is

834
00:44:50,770 --> 00:44:53,980
that most other models for
modeling expression data

835
00:44:53,980 --> 00:44:55,654
use Poisson models.

836
00:44:55,654 --> 00:44:57,820
And we've already seen,
when we talked about library

837
00:44:57,820 --> 00:45:00,030
complexity, that Poisson
models don't work that well

838
00:45:00,030 --> 00:45:01,440
all the time.

839
00:45:01,440 --> 00:45:04,330
So this is using a negative
binomial function, once again.

840
00:45:04,330 --> 00:45:06,330
We saw negative
binomials before.

841
00:45:06,330 --> 00:45:09,260
We're modelling both the
mean and the variance.

842
00:45:09,260 --> 00:45:13,630
And the variance is a
function, a linear function

843
00:45:13,630 --> 00:45:16,290
and a non-linear
function, of the mean.

844
00:45:19,740 --> 00:45:22,550
So what's going to
happen, then, is

845
00:45:22,550 --> 00:45:26,520
that we're going to use the
negative binomial to compute

846
00:45:26,520 --> 00:45:31,060
the probability of observing
the data in a given condition.

847
00:45:31,060 --> 00:45:34,267
And we can either combine
conditions and ask,

848
00:45:34,267 --> 00:45:36,600
what's the probability of
seeing the conditions combined

849
00:45:36,600 --> 00:45:41,850
with a single mean and variance,
or we can separate them

850
00:45:41,850 --> 00:45:44,125
into a more complex
H1 hypothesis and ask,

851
00:45:44,125 --> 00:45:46,500
what's the probability of
seeing them with separate means

852
00:45:46,500 --> 00:45:49,190
and variances,
and then do a test

853
00:45:49,190 --> 00:45:52,240
to see how significant the
difference is to determine

854
00:45:52,240 --> 00:45:57,200
whether or not we can exclude
the fact that the genes are

855
00:45:57,200 --> 00:45:59,280
expressed at the same level.

856
00:45:59,280 --> 00:46:02,290
And to give you an intuitive
idea of what's going on,

857
00:46:02,290 --> 00:46:07,050
this is a plot from the paper
showing the relationship

858
00:46:07,050 --> 00:46:10,930
between mean expression
and variance.

859
00:46:10,930 --> 00:46:15,302
Recall, for Poisson, that
variance is equal to mean.

860
00:46:15,302 --> 00:46:17,260
We only have one parameter
to tune for Poisson,

861
00:46:17,260 --> 00:46:18,250
which is lambda.

862
00:46:18,250 --> 00:46:20,490
The purple line is Poisson.

863
00:46:20,490 --> 00:46:22,500
And you can check and
see that the mean's

864
00:46:22,500 --> 00:46:24,660
equal to the variance
in that case.

865
00:46:24,660 --> 00:46:29,930
What DESeq does is it fits the
orange line to the observed

866
00:46:29,930 --> 00:46:31,690
data.

867
00:46:31,690 --> 00:46:33,366
Yes.

868
00:46:33,366 --> 00:46:37,310
AUDIENCE: I don't know where
the v sub p comes from.

869
00:46:37,310 --> 00:46:38,296
Where is that, again?

870
00:46:38,296 --> 00:46:40,680
PROFESSOR: That's the function.

871
00:46:40,680 --> 00:46:44,470
V sub p, in that
equation up there,

872
00:46:44,470 --> 00:46:46,350
is the function that
we're fitting-- that's

873
00:46:46,350 --> 00:46:49,480
the solid orange
line-- to the observed

874
00:46:49,480 --> 00:46:53,890
relationship between mean
and variance in the data.

875
00:46:53,890 --> 00:46:54,810
OK?

876
00:46:54,810 --> 00:46:57,110
So DESeq fits that function.

877
00:47:00,720 --> 00:47:07,420
EdgeR is another technique that
does not fit and instead uses

878
00:47:07,420 --> 00:47:11,460
the estimate that's the dotted
line, which isn't as good.

879
00:47:11,460 --> 00:47:13,710
So a lot of people
use DESeq these days

880
00:47:13,710 --> 00:47:15,900
for doing differential
expression analysis

881
00:47:15,900 --> 00:47:20,190
because it allows the
variance to change

882
00:47:20,190 --> 00:47:22,060
as the mean increases.

883
00:47:22,060 --> 00:47:28,440
And this is the case for
these type of count data.

884
00:47:28,440 --> 00:47:30,318
Before I go on though,
I'll pause and see

885
00:47:30,318 --> 00:47:32,859
if there are any questions at
all about what's going on here.

886
00:47:35,675 --> 00:47:36,175
Yes.

887
00:47:36,175 --> 00:47:39,370
AUDIENCE: You said that mu sub
p, then, is the [INAUDIBLE]

888
00:47:39,370 --> 00:47:40,795
to the data.

889
00:47:40,795 --> 00:47:41,704
I'm confused.

890
00:47:41,704 --> 00:47:44,120
How do we fit that function,
or where does that come from?

891
00:47:44,120 --> 00:47:46,060
PROFESSOR: You
mean the mu sub p?

892
00:47:46,060 --> 00:47:47,180
AUDIENCE: Yes.

893
00:47:47,180 --> 00:47:50,080
PROFESSOR: That function is fit.

894
00:47:50,080 --> 00:47:51,630
And the paper
describes exactly how

895
00:47:51,630 --> 00:47:54,790
it's fit, which I posted
on the Stellar site.

896
00:47:54,790 --> 00:47:58,770
But it's a nonlinear
function of q, in this case.

897
00:48:04,630 --> 00:48:05,270
Good question.

898
00:48:05,270 --> 00:48:06,103
Any other questions?

899
00:48:10,160 --> 00:48:10,910
OK.

900
00:48:10,910 --> 00:48:13,850
So once again, we
have two hypotheses,

901
00:48:13,850 --> 00:48:19,700
the null hypothesis that A and
B are expressing identically,

902
00:48:19,700 --> 00:48:22,500
H1, A and B
differentially express.

903
00:48:22,500 --> 00:48:25,240
We can compute the number
of degrees of freedom.

904
00:48:25,240 --> 00:48:27,690
And we can do a likelihood
ratio test, if we'd like,

905
00:48:27,690 --> 00:48:31,000
to compute the
probability of H0.

906
00:48:31,000 --> 00:48:34,450
And our model in this case is
the negative binomial model

907
00:48:34,450 --> 00:48:36,920
of the data, which
fits the data better.

908
00:48:36,920 --> 00:48:39,349
And that's why DESeq
does a better job

909
00:48:39,349 --> 00:48:40,390
than other methodologies.

910
00:48:40,390 --> 00:48:46,200
Because it provides a
better approximation

911
00:48:46,200 --> 00:48:49,470
to the underlying noise.

912
00:48:49,470 --> 00:48:51,680
And the next slide
shows what you get out

913
00:48:51,680 --> 00:48:57,310
of this kind of analysis,
where the little red dots are

914
00:48:57,310 --> 00:49:00,160
the genes that have
been called significant

915
00:49:00,160 --> 00:49:03,250
using Benjamini-Hochberg
correction, which

916
00:49:03,250 --> 00:49:05,500
we talked about previously.

917
00:49:05,500 --> 00:49:16,750
And you can see how, as the mean
increases, the required log 2

918
00:49:16,750 --> 00:49:20,355
fold change comes down
to be significant.

919
00:49:23,882 --> 00:49:25,590
So oftentimes, you'll
see plots like this

920
00:49:25,590 --> 00:49:30,575
in papers that describe how they
actually computed what genes

921
00:49:30,575 --> 00:49:31,825
were differentially expressed.

922
00:49:34,520 --> 00:49:35,430
Any questions at all?

923
00:49:35,430 --> 00:49:36,668
Yes.

924
00:49:36,668 --> 00:49:40,660
AUDIENCE: Why is it
that the significance is

925
00:49:40,660 --> 00:49:42,660
lower as the mean increases?

926
00:49:46,160 --> 00:49:50,685
PROFESSOR: Why the
significance is lower?

927
00:49:50,685 --> 00:49:52,910
AUDIENCE: Or the threshold.

928
00:49:52,910 --> 00:49:55,390
PROFESSOR: Oh, because
as you increase

929
00:49:55,390 --> 00:49:59,800
the number of observations,
the mean value theorem

930
00:49:59,800 --> 00:50:01,840
is going to cause
things to actually get

931
00:50:01,840 --> 00:50:04,890
closer and closer to 0.

932
00:50:04,890 --> 00:50:09,047
And so you need less of
a fold change difference

933
00:50:09,047 --> 00:50:11,380
to be significant as you get
more and more observations.

934
00:50:16,590 --> 00:50:19,110
And other questions?

935
00:50:19,110 --> 00:50:20,390
OK.

936
00:50:20,390 --> 00:50:29,120
So now we're going to
delve into one other area.

937
00:50:29,120 --> 00:50:34,131
How many people have done
hypergeometric tests before?

938
00:50:34,131 --> 00:50:34,630
OK.

939
00:50:34,630 --> 00:50:38,380
So we're going to talk
about hypergeometric tests.

940
00:50:38,380 --> 00:50:50,365
So imagine that we have a
universe, for simplicity,

941
00:50:50,365 --> 00:50:52,130
of 1,000 genes.

942
00:50:52,130 --> 00:50:52,630
OK?

943
00:50:52,630 --> 00:50:54,640
So we have this universe.

944
00:50:54,640 --> 00:51:01,230
And we have, B is a set of
genes that there are 30 of them.

945
00:51:01,230 --> 00:51:06,010
And there's another set,
A, of which there are 20.

946
00:51:06,010 --> 00:51:10,440
And the overlap between these
two is a total of three genes.

947
00:51:10,440 --> 00:51:14,130
So it might be that
A is the set of genes

948
00:51:14,130 --> 00:51:15,800
that are differentially
expressed

949
00:51:15,800 --> 00:51:17,550
between two conditions.

950
00:51:17,550 --> 00:51:19,610
B is the set of genes
that, you happen to know,

951
00:51:19,610 --> 00:51:21,290
have a particular annotation.

952
00:51:21,290 --> 00:51:25,380
For example, they're involved
in stress response in the cell.

953
00:51:25,380 --> 00:51:27,260
And you'd like to
know whether or not

954
00:51:27,260 --> 00:51:30,400
the genes that are
differentially expressed

955
00:51:30,400 --> 00:51:34,170
have a significant component of
stress response related genes

956
00:51:34,170 --> 00:51:37,170
or whether or not this
occurred at random.

957
00:51:37,170 --> 00:51:39,760
OK?

958
00:51:39,760 --> 00:51:42,250
So we need to compute that.

959
00:51:44,900 --> 00:51:48,780
So how many ways
could we choose B?

960
00:51:48,780 --> 00:51:57,020
Well, if we are going to use--
this is n1, n2, this is big N,

961
00:51:57,020 --> 00:51:58,220
and this is k.

962
00:51:58,220 --> 00:52:00,090
All right?

963
00:52:00,090 --> 00:52:06,620
So the number of ways I can
choose B is big N choose n2.

964
00:52:06,620 --> 00:52:09,501
That's the number of ways I can
choose B. Is everybody with me

965
00:52:09,501 --> 00:52:10,000
on that?

966
00:52:12,510 --> 00:52:14,900
Yeah?

967
00:52:14,900 --> 00:52:15,400
OK.

968
00:52:17,980 --> 00:52:22,320
How many ways can I
choose three elements out

969
00:52:22,320 --> 00:52:26,820
of A, these three that
are going to overlap?

970
00:52:26,820 --> 00:52:31,650
Well, that's going
to be n1 choose k.

971
00:52:31,650 --> 00:52:37,590
So that's how many ways I can
choose these three elements.

972
00:52:37,590 --> 00:52:40,970
And then, how many ways could I
choose the other elements of B?

973
00:52:40,970 --> 00:52:44,830
So once again, I'm figuring
out how I can choose B. Well,

974
00:52:44,830 --> 00:52:46,230
how could I choose
the rest of B?

975
00:52:46,230 --> 00:52:48,860
Well, how many elements do I
have to choose, of B, here?

976
00:52:48,860 --> 00:52:52,286
Well, B is n too big.

977
00:52:52,286 --> 00:52:54,161
But I've already
chosen k of them.

978
00:52:54,161 --> 00:52:54,660
Right?

979
00:52:57,170 --> 00:52:58,820
All right.

980
00:52:58,820 --> 00:53:01,400
Sorry, it's the
other way around.

981
00:53:01,400 --> 00:53:07,550
The universe I can
pick from is 1,000,

982
00:53:07,550 --> 00:53:15,140
which is all the elements, minus
the elements of A that I've

983
00:53:15,140 --> 00:53:17,380
already chosen from
to get those three.

984
00:53:19,960 --> 00:53:27,970
And then I need to pick the
27 things they don't overlap

985
00:53:27,970 --> 00:53:31,714
with A. So 27 things
that don't overlap with A

986
00:53:31,714 --> 00:53:34,420
would be n2 minus k.

987
00:53:38,120 --> 00:53:39,730
So this is the
number ways to choose

988
00:53:39,730 --> 00:53:43,620
B given this set of constraints.

989
00:53:43,620 --> 00:53:47,580
This is the number of ways to
choose B given no constraints.

990
00:53:47,580 --> 00:53:50,910
So the probability
that I have overlap

991
00:53:50,910 --> 00:53:55,530
of exactly k is equal
to this, which is,

992
00:53:55,530 --> 00:53:57,620
how many ways are there
with no constraints

993
00:53:57,620 --> 00:53:59,410
and how many ways
are there given

994
00:53:59,410 --> 00:54:00,600
that I have an overlap of k.

995
00:54:03,360 --> 00:54:04,410
All right?

996
00:54:04,410 --> 00:54:06,870
And typically, what
I want to ask is,

997
00:54:06,870 --> 00:54:13,400
what is the probability
that my observed overlap

998
00:54:13,400 --> 00:54:15,450
is greater than or equal to k.

999
00:54:15,450 --> 00:54:19,300
So this case, the
overlap would be three.

1000
00:54:19,300 --> 00:54:21,184
But I also would need
to consider the fact

1001
00:54:21,184 --> 00:54:23,100
that I might have four,
or five, or six, which

1002
00:54:23,100 --> 00:54:26,240
would be even more unlikely,
but still significant.

1003
00:54:26,240 --> 00:54:28,870
So if you look at the
exact computation,

1004
00:54:28,870 --> 00:54:31,760
the probability of
three here is 0.017.

1005
00:54:31,760 --> 00:54:36,222
And the probability that I
have three or more is 0.02.

1006
00:54:36,222 --> 00:54:37,680
So that's still
pretty significant.

1007
00:54:37,680 --> 00:54:41,810
Unlikely that would
occur by chance.

1008
00:54:41,810 --> 00:54:43,720
Right?

1009
00:54:43,720 --> 00:54:48,130
That I have three or more genes
overlapping in this situation

1010
00:54:48,130 --> 00:54:53,682
could only happen
two out of 100 times.

1011
00:54:53,682 --> 00:54:55,640
Does everybody understand
what's going on here?

1012
00:55:00,250 --> 00:55:01,330
Any questions at all?

1013
00:55:01,330 --> 00:55:04,980
So you're all now
hypergeometric whizzes, right?

1014
00:55:04,980 --> 00:55:06,760
All right?

1015
00:55:06,760 --> 00:55:09,010
Fantastic.

1016
00:55:09,010 --> 00:55:11,894
OK.

1017
00:55:11,894 --> 00:55:14,060
Now we're going to turn to
a final kind of analysis.

1018
00:55:14,060 --> 00:55:15,610
How many people have heard of
principal component analysis

1019
00:55:15,610 --> 00:55:16,110
before?

1020
00:55:16,110 --> 00:55:20,226
How many people know how to do
principal component analysis?

1021
00:55:20,226 --> 00:55:21,280
A few.

1022
00:55:21,280 --> 00:55:22,000
OK.

1023
00:55:22,000 --> 00:55:23,030
Great.

1024
00:55:23,030 --> 00:55:23,530
Yes.

1025
00:55:23,530 --> 00:55:26,506
AUDIENCE: Sorry,
could you just briefly

1026
00:55:26,506 --> 00:55:31,962
mention, again, where exactly do
we use the hypergeometric test?

1027
00:55:31,962 --> 00:55:34,442
What kinds of questions are
we asking when we do that?

1028
00:55:34,442 --> 00:55:36,441
PROFESSOR: Typically,
they're overlap questions.

1029
00:55:36,441 --> 00:55:41,600
So you're asking-- you have a
universe of objects, right--

1030
00:55:41,600 --> 00:55:43,690
like, in this case, genes.

1031
00:55:43,690 --> 00:55:46,749
And you have a subset of
20 and a subset of 30.

1032
00:55:46,749 --> 00:55:49,040
Let's say these are the
differentially expressed genes.

1033
00:55:49,040 --> 00:55:51,710
These are genes in the
stress response pathway.

1034
00:55:51,710 --> 00:55:53,695
They overlap by three genes.

1035
00:55:53,695 --> 00:55:57,776
Does that actually
occur at random or not?

1036
00:55:57,776 --> 00:55:59,270
All right?

1037
00:55:59,270 --> 00:56:05,330
If I told you that there are
a much smaller number of genes

1038
00:56:05,330 --> 00:56:08,050
and the stress response
genes were very much larger,

1039
00:56:08,050 --> 00:56:11,460
it could be much easier for
that overlap to occur at random.

1040
00:56:15,610 --> 00:56:16,260
Good question.

1041
00:56:16,260 --> 00:56:17,093
Any other questions?

1042
00:56:20,450 --> 00:56:20,950
OK.

1043
00:56:20,950 --> 00:56:23,980
So the next part of
lecture is entitled

1044
00:56:23,980 --> 00:56:26,260
"multivariate Gaussians
are your friends."

1045
00:56:26,260 --> 00:56:26,760
OK?

1046
00:56:26,760 --> 00:56:27,750
They are friendly.

1047
00:56:27,750 --> 00:56:28,970
They're like a puppy dog.

1048
00:56:28,970 --> 00:56:32,670
They are just wonderful to
play with and very friendly.

1049
00:56:32,670 --> 00:56:35,620
And the reason most people get
a little turned off by them

1050
00:56:35,620 --> 00:56:38,430
is because they get this--
the first thing they're shown

1051
00:56:38,430 --> 00:56:40,950
is this very hairy
looking exponential

1052
00:56:40,950 --> 00:56:43,020
which describes what they are.

1053
00:56:43,020 --> 00:56:45,980
And so I'm going to shy away
from complicated looking

1054
00:56:45,980 --> 00:56:48,889
exponentials and give you the
puppy dog, my favorite way

1055
00:56:48,889 --> 00:56:50,430
of looking at
multivariate Gaussians.

1056
00:56:50,430 --> 00:56:50,929
OK?

1057
00:56:50,929 --> 00:56:54,967
Which, I think, is a
great way to look at them.

1058
00:56:54,967 --> 00:56:57,300
And the reason we need to
look at multivariate Gaussians

1059
00:56:57,300 --> 00:57:00,389
is that they help
us understand what

1060
00:57:00,389 --> 00:57:02,680
is going on with principal
component analysis in a very

1061
00:57:02,680 --> 00:57:03,921
straightforward way.

1062
00:57:03,921 --> 00:57:06,170
And the reason that we want
to use principal component

1063
00:57:06,170 --> 00:57:09,310
analysis is that
we're going to be

1064
00:57:09,310 --> 00:57:12,969
able to reveal hidden factors
and structures in our data.

1065
00:57:12,969 --> 00:57:14,385
And they're also
going to allow us

1066
00:57:14,385 --> 00:57:17,710
to reduce the
dimensionality of the data.

1067
00:57:17,710 --> 00:57:20,910
And we'll see why in a moment.

1068
00:57:20,910 --> 00:57:27,420
But here is the friendly version
of multivariate Gaussians.

1069
00:57:27,420 --> 00:57:34,490
And let me describe to you why
I think this is so friendly.

1070
00:57:34,490 --> 00:57:38,830
So we're all familiar with
unidimensional Gaussians

1071
00:57:38,830 --> 00:57:40,320
like this.

1072
00:57:40,320 --> 00:57:41,170
Centered at zero.

1073
00:57:43,800 --> 00:57:46,840
They have variance 1.

1074
00:57:46,840 --> 00:57:50,260
Just very friendly
univariate Gaussians, right?

1075
00:57:50,260 --> 00:57:52,085
Everybody's familiar with those?

1076
00:57:52,085 --> 00:57:53,785
Normal distributions?

1077
00:57:53,785 --> 00:57:55,450
So let's suppose
that we just take

1078
00:57:55,450 --> 00:57:58,710
a whole collection of those.

1079
00:57:58,710 --> 00:58:02,970
And we say that we
have a vector z that

1080
00:58:02,970 --> 00:58:07,240
is sampled from a collection
of univariate Gaussians.

1081
00:58:07,240 --> 00:58:09,060
And it can be as
long as we like.

1082
00:58:09,060 --> 00:58:09,560
OK?

1083
00:58:09,560 --> 00:58:12,520
But they're all sampled
from the same distribution.

1084
00:58:12,520 --> 00:58:15,240
And what we're going to say is
that our multivariate Gaussian,

1085
00:58:15,240 --> 00:58:22,910
x, is going to be a matrix
times z plus a mean.

1086
00:58:22,910 --> 00:58:24,710
And so what this
matrix is going to do

1087
00:58:24,710 --> 00:58:28,960
is it's going to take all
of our univariate Gaussians

1088
00:58:28,960 --> 00:58:33,750
and combine them to produce
our multivariate Gaussian.

1089
00:58:33,750 --> 00:58:35,150
All right?

1090
00:58:35,150 --> 00:58:38,430
So the structure
of this matrix will

1091
00:58:38,430 --> 00:58:41,840
describe how these single
variate Gaussians are

1092
00:58:41,840 --> 00:58:46,527
being mixed together to produce
this multivariate distribution.

1093
00:58:46,527 --> 00:58:48,110
And you can imagine
various structures

1094
00:58:48,110 --> 00:58:51,970
for this matrix A. Right?

1095
00:58:51,970 --> 00:58:58,990
And the covariance
matrix, sigma,

1096
00:58:58,990 --> 00:59:05,040
which describes the structure
of this multivariate Gaussian,

1097
00:59:05,040 --> 00:59:13,890
is shown on this slide to
be equal to A A transpose.

1098
00:59:13,890 --> 00:59:20,100
And thus, if we knew this
matrix A, which we may not know,

1099
00:59:20,100 --> 00:59:25,030
we'd be able to compute the
covariance matrix directly.

1100
00:59:25,030 --> 00:59:27,240
OK?

1101
00:59:27,240 --> 00:59:30,570
Let me take that one more time.

1102
00:59:30,570 --> 00:59:32,750
We take a bunch of
univariate Gaussians,

1103
00:59:32,750 --> 00:59:34,700
make a vector z out of them.

1104
00:59:34,700 --> 00:59:38,110
And just for
clarity, right, we're

1105
00:59:38,110 --> 00:59:43,450
going to talk about matrices and
vectors as rows across columns.

1106
00:59:43,450 --> 00:59:46,750
So this is n by 1.

1107
00:59:46,750 --> 00:59:48,610
This is n by n.

1108
00:59:48,610 --> 00:59:50,460
This is n by 1.

1109
00:59:50,460 --> 00:59:52,000
And this is n by 1.

1110
00:59:52,000 --> 00:59:52,940
OK?

1111
00:59:52,940 --> 00:59:55,790
That's the dimensionality
of these various objects

1112
00:59:55,790 --> 00:59:58,620
we're dealing with here.

1113
00:59:58,620 --> 01:00:03,610
So we get this vector
of univariate Gaussians.

1114
01:00:03,610 --> 01:00:06,670
We apply this matrix n
to combine them together.

1115
01:00:06,670 --> 01:00:09,980
We get out our multivariate
Gaussian offset by some mean.

1116
01:00:12,500 --> 01:00:14,090
Is everybody happy
with that so far?

1117
01:00:16,441 --> 01:00:16,940
Yes?

1118
01:00:16,940 --> 01:00:19,760
No?

1119
01:00:19,760 --> 01:00:22,600
You're suspending disbelief
for the next slide.

1120
01:00:22,600 --> 01:00:27,670
Is that-- OK.

1121
01:00:27,670 --> 01:00:36,020
Well, here's the next
thing I'd like to say,

1122
01:00:36,020 --> 01:00:47,670
is that the variance of
a vector-- we'll call it

1123
01:00:47,670 --> 01:00:58,320
v-- times x, which
is a random variable,

1124
01:00:58,320 --> 01:01:01,070
is going to be equal
to-- x is derived

1125
01:01:01,070 --> 01:01:08,130
from this distribution--
v transpose sigma

1126
01:01:08,130 --> 01:01:15,914
v. And the demonstration of
that is on the top of this page.

1127
01:01:18,810 --> 01:01:34,650
So the variance of this
vector-- sorry, the projection

1128
01:01:34,650 --> 01:01:37,230
of this random variable
onto this vector--

1129
01:01:37,230 --> 01:01:43,350
is going to give you a
variance in this direction

1130
01:01:43,350 --> 01:01:47,990
as that product.

1131
01:01:47,990 --> 01:01:55,367
So what we would
like to do is this.

1132
01:01:55,367 --> 01:02:08,065
We would like to find v
sub i, which are vectors,

1133
01:02:08,065 --> 01:02:25,360
to maximize variance
of v sub i transpose x

1134
01:02:25,360 --> 01:02:31,960
such that v sub i transpose
v sub i is equal to 1.

1135
01:02:31,960 --> 01:02:35,590
In other words, they're
unit length vectors.

1136
01:02:35,590 --> 01:02:42,170
So these are going to be
called the eigenvectors.

1137
01:02:42,170 --> 01:02:53,052
And if we think about the
structure that we desire,

1138
01:02:53,052 --> 01:02:57,030
what we'll find is that
they satisfy the constraint

1139
01:02:57,030 --> 01:03:00,540
that the covariance matrix
times an eigenvector

1140
01:03:00,540 --> 01:03:03,740
is equal to the
eigenvalue associated

1141
01:03:03,740 --> 01:03:06,420
with that vector times
the vector itself.

1142
01:03:09,680 --> 01:03:13,050
And with a little manipulation--
if we multiply both sides by v

1143
01:03:13,050 --> 01:03:19,535
sub i transpose-- v sub i equals
v sub i transpose lambda sub i

1144
01:03:19,535 --> 01:03:24,420
sigma squared v sub i-- and
we move these guys around,

1145
01:03:24,420 --> 01:03:27,120
v sub i transpose
sigma v sub i is

1146
01:03:27,120 --> 01:03:30,150
equal to-- these two
guys, multiplied together,

1147
01:03:30,150 --> 01:03:33,990
equal 1-- lambda sub i squared.

1148
01:03:33,990 --> 01:03:40,680
This, we see up above,
is equal to the variance

1149
01:03:40,680 --> 01:03:44,240
when it's projected
in the direction of v.

1150
01:03:44,240 --> 01:03:48,620
And so lambda sub i squared is
simply the variance associated

1151
01:03:48,620 --> 01:03:49,586
with that direction.

1152
01:03:53,480 --> 01:03:59,590
So the question then becomes,
how do we find these things.

1153
01:03:59,590 --> 01:04:03,250
And how do we discover
these magic eigenvectors

1154
01:04:03,250 --> 01:04:11,660
that are directions in which
this multivariate Gaussian has

1155
01:04:11,660 --> 01:04:12,910
its variance maximized?

1156
01:04:15,740 --> 01:04:25,120
And we can do this by
singular value decomposition.

1157
01:04:25,120 --> 01:04:29,340
So we can compute this
covariance matrix.

1158
01:04:29,340 --> 01:04:34,270
So we compute sigma
from the data.

1159
01:04:37,100 --> 01:04:49,020
And then we do a singular
value decomposition

1160
01:04:49,020 --> 01:04:54,665
such that sigma is equal
to U S U transpose.

1161
01:04:57,530 --> 01:04:59,840
And that's what the singular
value decomposition does

1162
01:04:59,840 --> 01:05:02,250
for us is it
decomposes the sigma

1163
01:05:02,250 --> 01:05:04,960
matrix into these
components where

1164
01:05:04,960 --> 01:05:16,460
S is a diagonal matrix that
contains the eigenvalues

1165
01:05:16,460 --> 01:05:22,180
and U is a column.

1166
01:05:22,180 --> 01:05:24,225
Each column is an eigenvector.

1167
01:05:31,610 --> 01:05:33,460
So in doing a singular
value decomposition,

1168
01:05:33,460 --> 01:05:39,920
we get the eigenvalues
and the eigenvectors.

1169
01:05:39,920 --> 01:05:43,400
The other thing, you
recall, was that sigma

1170
01:05:43,400 --> 01:05:46,860
was equal to A A transpose
when we started off.

1171
01:05:46,860 --> 01:05:49,830
A was the matrix that we used to
make our multivariate Gaussian

1172
01:05:49,830 --> 01:05:52,440
out of our univariate Gaussians.

1173
01:05:52,440 --> 01:05:56,680
And thus, what we can observe is
that our multivariate Gaussian

1174
01:05:56,680 --> 01:06:07,690
x is equal to U S to the
1/2 times z plus a mean.

1175
01:06:07,690 --> 01:06:12,050
So here is what's going when we
make a multivariate Gaussian is

1176
01:06:12,050 --> 01:06:15,020
we're taking a bunch of
univariate Gaussians,

1177
01:06:15,020 --> 01:06:19,400
we're scaling them, and
we're rotating them.

1178
01:06:19,400 --> 01:06:20,660
OK?

1179
01:06:20,660 --> 01:06:22,350
And that makes a
multivariate Gaussian.

1180
01:06:22,350 --> 01:06:25,350
And then we offset this
whole thing by a mean.

1181
01:06:25,350 --> 01:06:29,440
Because we also have to do
rotations around the origin.

1182
01:06:29,440 --> 01:06:32,180
So the way I think about
multivariate Gaussians

1183
01:06:32,180 --> 01:06:35,860
is that it is a
scaling and a rotation

1184
01:06:35,860 --> 01:06:37,900
of univariate Gaussians.

1185
01:06:37,900 --> 01:06:40,060
And implicit in that
scaling and rotation

1186
01:06:40,060 --> 01:06:43,550
is the discovery of the
major directions of variance

1187
01:06:43,550 --> 01:06:46,100
in the underlying
data as represented

1188
01:06:46,100 --> 01:06:48,070
by the eigenvectors.

1189
01:06:48,070 --> 01:06:51,200
And the eigenvalues tell
you how much of the variance

1190
01:06:51,200 --> 01:06:53,763
is accounted for in each
one of those dimensions.

1191
01:06:56,870 --> 01:06:59,200
Are there any
questions about that?

1192
01:06:59,200 --> 01:07:00,610
I hope there are.

1193
01:07:00,610 --> 01:07:01,860
Yes.

1194
01:07:01,860 --> 01:07:04,260
AUDIENCE: How do you
compute sigma from the data?

1195
01:07:04,260 --> 01:07:06,884
Is it some magical process?

1196
01:07:06,884 --> 01:07:09,080
PROFESSOR: The sigma of
the value of decomposition?

1197
01:07:09,080 --> 01:07:09,621
AUDIENCE: No.

1198
01:07:09,621 --> 01:07:11,609
So--

1199
01:07:11,609 --> 01:07:14,094
PROFESSOR: How do you
compute sigma from the data?

1200
01:07:14,094 --> 01:07:15,886
AUDIENCE: Yeah, the first step.

1201
01:07:15,886 --> 01:07:24,330
PROFESSOR: That is
shown in equation seven.

1202
01:07:31,340 --> 01:07:34,950
So you can compute the means.

1203
01:07:34,950 --> 01:07:38,567
And you know you have x,
which are observed values.

1204
01:07:38,567 --> 01:07:39,900
So you compute that expectation.

1205
01:07:39,900 --> 01:07:41,851
And that is sigma.

1206
01:07:41,851 --> 01:07:42,350
OK?

1207
01:07:51,017 --> 01:07:51,600
Good question.

1208
01:07:51,600 --> 01:07:52,433
Any other questions?

1209
01:07:57,260 --> 01:07:57,760
OK.

1210
01:07:57,760 --> 01:08:01,180
So we have these
eigenvectors and eigenvalues,

1211
01:08:01,180 --> 01:08:06,390
which represent the
vectors of maximum variance

1212
01:08:06,390 --> 01:08:08,570
in the underlying data.

1213
01:08:08,570 --> 01:08:15,190
And we can use these to organize
data by projecting observations

1214
01:08:15,190 --> 01:08:19,630
onto these eigenvectors--
or they're sometimes

1215
01:08:19,630 --> 01:08:24,250
called principal components--
defined dimensions

1216
01:08:24,250 --> 01:08:29,319
of variability that help us
organize our underlying data.

1217
01:08:29,319 --> 01:08:31,089
We'll come back to
that in a moment.

1218
01:08:36,000 --> 01:08:37,399
OK.

1219
01:08:37,399 --> 01:08:41,664
Any other questions about
principal component analysis?

1220
01:08:41,664 --> 01:08:42,164
Yes.

1221
01:08:42,164 --> 01:08:44,370
AUDIENCE: [INAUDIBLE]
e was expectation

1222
01:08:44,370 --> 01:08:46,729
when you were
calculating the second?

1223
01:08:46,729 --> 01:08:50,191
PROFESSOR: Yes, e is the
expectation is correct.

1224
01:08:50,191 --> 01:08:51,649
AUDIENCE: And also
what that means.

1225
01:08:51,649 --> 01:08:57,689
PROFESSOR: It's the
average expected value.

1226
01:08:57,689 --> 01:09:02,359
So in the case of
computing sigma,

1227
01:09:02,359 --> 01:09:08,760
you would compute the expected
value of that inner equation

1228
01:09:08,760 --> 01:09:10,792
across all the data
points that you see.

1229
01:09:14,912 --> 01:09:16,620
So you'd sum up all
the values and divide

1230
01:09:16,620 --> 01:09:18,161
by the number of
things that you had.

1231
01:09:20,387 --> 01:09:21,220
Any other questions?

1232
01:09:24,030 --> 01:09:24,530
OK.

1233
01:09:27,220 --> 01:09:32,630
So just for calibration for
next year, how many people

1234
01:09:32,630 --> 01:09:36,350
think they've got a general
idea of what principal component

1235
01:09:36,350 --> 01:09:39,131
analysis is-- a general idea?

1236
01:09:39,131 --> 01:09:39,630
Uh-oh.

1237
01:09:42,439 --> 01:09:44,740
How many people who thought
it was really interesting

1238
01:09:44,740 --> 01:09:48,930
were sort of completely
baffled about halfway through?

1239
01:09:48,930 --> 01:09:51,000
OK.

1240
01:09:51,000 --> 01:09:51,779
All right.

1241
01:09:51,779 --> 01:09:53,760
Well, I think that
recitation can

1242
01:09:53,760 --> 01:09:55,550
help with some of
those questions.

1243
01:09:55,550 --> 01:09:58,840
But if anybody has a question
they'd like to ask now-- No?

1244
01:09:58,840 --> 01:10:00,316
It's that far gone?

1245
01:10:05,570 --> 01:10:07,780
I mean, the thing with
this sort of analysis

1246
01:10:07,780 --> 01:10:11,384
is that if your matrix algebra
is a little rusty, then,

1247
01:10:11,384 --> 01:10:13,300
when you start looking
at equations like that,

1248
01:10:13,300 --> 01:10:17,000
you can get a little
lost sometimes.

1249
01:10:19,940 --> 01:10:20,440
All right.

1250
01:10:20,440 --> 01:10:23,650
Well, let's turn, then-- if
there aren't any brave souls

1251
01:10:23,650 --> 01:10:27,220
who wish to ask
a question, we'll

1252
01:10:27,220 --> 01:10:30,935
turn to single cell
RNA-seq analysis.

1253
01:10:33,700 --> 01:10:37,500
So I'm a firm believer
that single cell analysis

1254
01:10:37,500 --> 01:10:40,670
of biological samples is
the next big frontier.

1255
01:10:40,670 --> 01:10:46,915
And it's being made possible
through devices like this.

1256
01:10:46,915 --> 01:10:52,830
This is a Fluidigm C1 chip,
which has 96 different reaction

1257
01:10:52,830 --> 01:10:55,070
wells, which allows
you, in each well,

1258
01:10:55,070 --> 01:10:57,830
to process a single
cell independently.

1259
01:10:57,830 --> 01:11:02,060
And the little winds are ways
to get reagents into those cells

1260
01:11:02,060 --> 01:11:12,400
to do things like produce
RNA-seq ready materials.

1261
01:11:12,400 --> 01:11:16,340
And when you do
single cell analysis,

1262
01:11:16,340 --> 01:11:21,770
you can take apart what's
happening in a population.

1263
01:11:21,770 --> 01:11:29,660
So an early paper asked some
fairly fundamental but simple

1264
01:11:29,660 --> 01:11:31,200
questions.

1265
01:11:31,200 --> 01:11:36,820
For example, if you take
two 10,000 cell aliquots

1266
01:11:36,820 --> 01:11:41,880
of the same culture, and you
profile them independently,

1267
01:11:41,880 --> 01:11:45,720
and you ask how well to the
expression values for each gene

1268
01:11:45,720 --> 01:11:49,220
agree between sample
A and sample B,

1269
01:11:49,220 --> 01:11:54,270
you expect there to be a very
good agreement between sample A

1270
01:11:54,270 --> 01:11:59,070
and sample B in these
10,000 cell cultures.

1271
01:11:59,070 --> 01:12:05,170
A second question is, now,
if you take, say, 14 cells

1272
01:12:05,170 --> 01:12:09,060
from those cultures and you
profile them independently,

1273
01:12:09,060 --> 01:12:12,550
and you ask, how well
do they correlate

1274
01:12:12,550 --> 01:12:16,080
with what you saw in the
10,000 cell experiment, that

1275
01:12:16,080 --> 01:12:18,765
will tell you something about
the population heterogeneity

1276
01:12:18,765 --> 01:12:20,560
that you're observing.

1277
01:12:20,560 --> 01:12:23,790
Because if they correlate
perfectly with the 10,000 cell

1278
01:12:23,790 --> 01:12:26,570
experiment, then you
really know that there's

1279
01:12:26,570 --> 01:12:28,990
no point in looking at
individual cells in some sense,

1280
01:12:28,990 --> 01:12:30,000
because they're all the same.

1281
01:12:30,000 --> 01:12:31,000
Seen one, seem them all.

1282
01:12:31,000 --> 01:12:32,620
Right?

1283
01:12:32,620 --> 01:12:35,660
But if you find
that each cell has

1284
01:12:35,660 --> 01:12:39,240
its own particular
expression fingerprint

1285
01:12:39,240 --> 01:12:44,890
and what you're seeing in the
10,000 cell average experiment

1286
01:12:44,890 --> 01:12:47,825
wipes out those
fingerprints, then you

1287
01:12:47,825 --> 01:12:50,350
know it's very important to
analyze each cell individually.

1288
01:12:54,050 --> 01:13:01,350
So the analysis that was done
asked exactly that question.

1289
01:13:01,350 --> 01:13:04,020
So here's what I'll
show you in these plots.

1290
01:13:04,020 --> 01:13:11,700
So here is, on the upper left,
the 10,000 cell experiment

1291
01:13:11,700 --> 01:13:13,350
versus the 10,000
cell experiment.

1292
01:13:13,350 --> 01:13:16,320
And as you can see, the
correlation coefficient

1293
01:13:16,320 --> 01:13:24,570
is quite high-- 0.98--
and looks very, very good

1294
01:13:24,570 --> 01:13:27,680
of experiment one
versus experiment two,

1295
01:13:27,680 --> 01:13:30,090
or rep one, rep two.

1296
01:13:30,090 --> 01:13:34,450
Here is a separate
experiment which

1297
01:13:34,450 --> 01:13:37,540
is looking at two
individual cells

1298
01:13:37,540 --> 01:13:40,320
and asking-- and
plotting, for each gene,

1299
01:13:40,320 --> 01:13:42,750
the expression in one
cell versus the gene

1300
01:13:42,750 --> 01:13:44,620
in the other cell.

1301
01:13:44,620 --> 01:13:47,460
And you can see that the
correlation coefficient's 0.54.

1302
01:13:47,460 --> 01:13:49,280
And there's actually
a fairly wide spread.

1303
01:13:49,280 --> 01:13:54,040
In fact, there are genes that
are expressed in one cell that

1304
01:13:54,040 --> 01:13:57,069
are not expressed in the
other cell, and vice versa.

1305
01:13:57,069 --> 01:13:58,860
So the expression of
these individual cells

1306
01:13:58,860 --> 01:14:02,050
it's quite divergent.

1307
01:14:02,050 --> 01:14:09,530
And the final panel
shows how-- down here--

1308
01:14:09,530 --> 01:14:15,760
how a single cell
average on the y-axis

1309
01:14:15,760 --> 01:14:17,930
relates to the 10,000
cell experiment.

1310
01:14:20,550 --> 01:14:23,840
But given the middle
panel-- the panel B there--

1311
01:14:23,840 --> 01:14:27,720
that is showing the fact that
two single cells don't really

1312
01:14:27,720 --> 01:14:30,060
relate that well to another,
it bags other questions.

1313
01:14:30,060 --> 01:14:33,530
For example, are the
isoforms of the genes that

1314
01:14:33,530 --> 01:14:38,980
are being expressed the same
in those distinct cells?

1315
01:14:38,980 --> 01:14:42,570
And so panel D
shows isoforms that

1316
01:14:42,570 --> 01:14:47,240
are the same across each one
of the single cells being

1317
01:14:47,240 --> 01:14:51,770
profiled, which is the
solid bar at the top.

1318
01:14:51,770 --> 01:14:57,410
And the bottom couple
of rows in figure D

1319
01:14:57,410 --> 01:14:59,630
are the 10,000 cell
experiment average.

1320
01:15:02,140 --> 01:15:03,730
But panel E is the
most interesting,

1321
01:15:03,730 --> 01:15:09,370
perhaps, which is that the
isoforms for those four genes

1322
01:15:09,370 --> 01:15:11,920
are being
differentially expressed

1323
01:15:11,920 --> 01:15:13,305
in different individual cells.

1324
01:15:16,680 --> 01:15:19,690
And that's further supported
by taking two of those genes

1325
01:15:19,690 --> 01:15:27,570
and doing fluorescent in situ
histochemistry and microscopy,

1326
01:15:27,570 --> 01:15:29,810
and looking at the
number of RNA molecules

1327
01:15:29,810 --> 01:15:34,000
for each one of those, and
noting that it corresponds

1328
01:15:34,000 --> 01:15:37,860
to what's seen in the
upper right hand panel.

1329
01:15:37,860 --> 01:15:42,220
So we see that in
individual cells,

1330
01:15:42,220 --> 01:15:48,120
different isoforms
are being expressed.

1331
01:15:48,120 --> 01:15:51,030
Now these cells were
derived from bone marrow.

1332
01:15:51,030 --> 01:15:55,860
And they're exposed
to lipopolysaccharide

1333
01:15:55,860 --> 01:15:59,420
to activate an immune response.

1334
01:15:59,420 --> 01:16:02,150
So they are clearly not all
behaving exactly the same.

1335
01:16:04,730 --> 01:16:11,370
And to further elucidate this,
the authors of this paper

1336
01:16:11,370 --> 01:16:18,020
took the gene
expressions that they

1337
01:16:18,020 --> 01:16:21,650
saw for a given cell
as a large vector

1338
01:16:21,650 --> 01:16:23,900
and computed the
principal components,

1339
01:16:23,900 --> 01:16:29,290
and then projected the cells
into the first and second

1340
01:16:29,290 --> 01:16:32,810
principal component
or eigenvector space.

1341
01:16:32,810 --> 01:16:37,000
And as you can see, there is
a distinct separation of three

1342
01:16:37,000 --> 01:16:40,880
of the cells from the
rest of the cells, where

1343
01:16:40,880 --> 01:16:44,240
three of the cells,
which correlate well

1344
01:16:44,240 --> 01:16:46,670
with principal component
one, are thought

1345
01:16:46,670 --> 01:16:53,160
to be mature cells that
express certain cell surface

1346
01:16:53,160 --> 01:16:56,390
proteins whereas the ones
on the left-- the maturing

1347
01:16:56,390 --> 01:17:00,810
cells-- the triangle
depicted cells--

1348
01:17:00,810 --> 01:17:04,680
express certain cytokines
under the maturing legend

1349
01:17:04,680 --> 01:17:08,420
there, on the clustergram
on the right-hand side.

1350
01:17:08,420 --> 01:17:12,420
And thus, the first
principal component

1351
01:17:12,420 --> 01:17:16,600
was able to separate those
two different broad classes

1352
01:17:16,600 --> 01:17:18,269
of cells.

1353
01:17:18,269 --> 01:17:20,560
So it looks like there are
at least two different kinds

1354
01:17:20,560 --> 01:17:23,800
of cells in this population.

1355
01:17:23,800 --> 01:17:25,680
And then, the authors
asked another question,

1356
01:17:25,680 --> 01:17:32,350
which is, can they take
individual cell data

1357
01:17:32,350 --> 01:17:35,510
and look at the relationship
between pairs of genes

1358
01:17:35,510 --> 01:17:38,860
to see which genes
are co-expressed.

1359
01:17:38,860 --> 01:17:40,660
And the hypothesis
is that genes that

1360
01:17:40,660 --> 01:17:44,030
are co-expressed
in individual cells

1361
01:17:44,030 --> 01:17:47,560
make up individual
regulatory circuits.

1362
01:17:47,560 --> 01:17:52,280
And so they hypothesize that
the genes LRF7 and IFIT1

1363
01:17:52,280 --> 01:17:58,470
and STAT2 and LRF7 are all in an
anti-viral regulatory circuit.

1364
01:17:58,470 --> 01:18:00,610
They then ask the question,
if they knocked out

1365
01:18:00,610 --> 01:18:03,990
LRF7, which is the second
panel on the right-hand side,

1366
01:18:03,990 --> 01:18:06,310
would they oblate
downstream gene expression.

1367
01:18:06,310 --> 01:18:08,440
And they partially did.

1368
01:18:08,440 --> 01:18:12,330
And they thought that since
STAT2 and LRF7 are both thought

1369
01:18:12,330 --> 01:18:14,340
to be regulators of
the circuit and they're

1370
01:18:14,340 --> 01:18:16,234
both downstream of the
interfering receptor,

1371
01:18:16,234 --> 01:18:18,650
they thought if they knocked
out the interfering receptor,

1372
01:18:18,650 --> 01:18:20,691
they would oblate most of
the anti-viral cluster,

1373
01:18:20,691 --> 01:18:23,250
which, in fact, they did.

1374
01:18:23,250 --> 01:18:28,420
So what this is suggesting
is that, first, single cell

1375
01:18:28,420 --> 01:18:30,140
analysis is
extraordinarily important

1376
01:18:30,140 --> 01:18:32,410
to understand what's going
on in individual cells.

1377
01:18:32,410 --> 01:18:37,140
Because in a cell culture, the
cells can be quite different.

1378
01:18:37,140 --> 01:18:41,910
And secondarily, it's
possible, within the context

1379
01:18:41,910 --> 01:18:43,900
of individual single
cell analysis,

1380
01:18:43,900 --> 01:18:47,170
to be able to pick out
regulatory circuits that

1381
01:18:47,170 --> 01:18:49,100
wouldn't be as
evident when you're

1382
01:18:49,100 --> 01:18:52,850
looking at cells en masse.

1383
01:18:52,850 --> 01:18:59,300
And finally-- I'll thank Mike
for the next two slides--

1384
01:18:59,300 --> 01:19:00,980
I wanted to point
out that quality

1385
01:19:00,980 --> 01:19:04,640
metrics for RNA-seq data for
single cells is very important.

1386
01:19:04,640 --> 01:19:08,820
And we talked about library
complexity earlier in the term.

1387
01:19:08,820 --> 01:19:13,786
And here, you can see that as
library complexity increases,

1388
01:19:13,786 --> 01:19:15,660
expression of coefficient
of variation, which

1389
01:19:15,660 --> 01:19:19,570
is the standard
deviation over the mean,

1390
01:19:19,570 --> 01:19:22,870
comes down as you get
sufficient library complexity.

1391
01:19:22,870 --> 01:19:28,240
And furthermore, as library
complexity increases,

1392
01:19:28,240 --> 01:19:30,610
mean expression increases.

1393
01:19:30,610 --> 01:19:34,580
And the cells that are in
red were classified as bad

1394
01:19:34,580 --> 01:19:43,910
by microscopy, from the Fluidigm
instrument processing step.

1395
01:19:43,910 --> 01:19:47,150
So I think you can see that
single cell analysis is going

1396
01:19:47,150 --> 01:19:50,220
to be extraordinarily
important and can reveal

1397
01:19:50,220 --> 01:19:55,250
a lot of information that is
not present in these large batch

1398
01:19:55,250 --> 01:19:56,900
experiments.

1399
01:19:56,900 --> 01:19:59,770
And it's coming
to a lab near you.

1400
01:19:59,770 --> 01:20:04,140
So on that note, I'll thank
you very much for today.

1401
01:20:04,140 --> 01:20:06,280
And we'll see you
later in the term.

1402
01:20:06,280 --> 01:20:09,010
And Professor Burge will
return at the next lecture.

1403
01:20:09,010 --> 01:20:10,860
Thanks very much.