1
00:00:00,120 --> 00:00:02,550
The following content is
provided under a Creative

2
00:00:02,550 --> 00:00:04,090
Commons license.

3
00:00:04,090 --> 00:00:06,390
Your support will help
MIT OpenCourseWare

4
00:00:06,390 --> 00:00:10,750
continue to offer high quality
educational resources for free.

5
00:00:10,750 --> 00:00:13,380
To make a donation or
view additional materials

6
00:00:13,380 --> 00:00:17,310
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,310 --> 00:00:18,480
at ocw.mit.edu.

8
00:00:22,427 --> 00:00:24,260
TOM MITCHELL: I want
to talk about some work

9
00:00:24,260 --> 00:00:29,210
that we're doing to try to
study language in the brain.

10
00:00:29,210 --> 00:00:33,080
Actually, to be honest, this
is part of a grander plan.

11
00:00:33,080 --> 00:00:37,700
So here is what I'm really
doing with my research life.

12
00:00:37,700 --> 00:00:43,140
I'm interested in
language, and so I'm

13
00:00:43,140 --> 00:00:46,390
involved in two different
research projects.

14
00:00:46,390 --> 00:00:49,950
One of them is to build a
computer to learn to read.

15
00:00:49,950 --> 00:00:53,730
And we have a project
which we call our Never

16
00:00:53,730 --> 00:00:56,700
Ending Language Learner,
which is an attempt

17
00:00:56,700 --> 00:00:59,550
to build a computer program
to learn to read the web.

18
00:00:59,550 --> 00:01:03,930
NELL, we call it, has been
running nonstop, 24 hours a day

19
00:01:03,930 --> 00:01:05,370
since 2010.

20
00:01:05,370 --> 00:01:07,842
So it's now five years old.

21
00:01:07,842 --> 00:01:09,300
If you have very
good eyesight, you

22
00:01:09,300 --> 00:01:11,466
can tell that everybody's
t-shirt there in the group

23
00:01:11,466 --> 00:01:15,990
is wearing a NELL fifth
birthday party t-shirt.

24
00:01:15,990 --> 00:01:21,000
But it's an effort
to try to understand

25
00:01:21,000 --> 00:01:24,120
what it would be like to
build a computer program that

26
00:01:24,120 --> 00:01:26,280
runs forever and gets
better every day.

27
00:01:26,280 --> 00:01:29,930
In this case, its job is
to learn to read the web.

28
00:01:29,930 --> 00:01:31,530
It is getting better.

29
00:01:31,530 --> 00:01:34,530
It currently has about
100 million beliefs

30
00:01:34,530 --> 00:01:37,540
that it has read from the web.

31
00:01:37,540 --> 00:01:40,590
It's learning to infer new
beliefs from old beliefs.

32
00:01:40,590 --> 00:01:43,440
It's a better reader today
than it was last year.

33
00:01:43,440 --> 00:01:46,240
It was better last year
than it was the year before.

34
00:01:46,240 --> 00:01:50,280
It's still not anything like
as competent as you and I,

35
00:01:50,280 --> 00:01:54,300
but it's one line
of research that you

36
00:01:54,300 --> 00:01:58,410
can follow if you're interested
in understanding language

37
00:01:58,410 --> 00:02:00,302
understanding.

38
00:02:00,302 --> 00:02:01,760
The other thread,
which is what I'm

39
00:02:01,760 --> 00:02:04,440
going to talk about tonight,
which is in the bottom half

40
00:02:04,440 --> 00:02:09,990
here, is to study how the
brain processes language

41
00:02:09,990 --> 00:02:12,480
by putting people
in brain imaging

42
00:02:12,480 --> 00:02:16,770
scanners of different types, and
showing them language stimuli,

43
00:02:16,770 --> 00:02:18,390
and getting them to read.

44
00:02:18,390 --> 00:02:20,550
So I'm going to focus
really on the bottom part.

45
00:02:20,550 --> 00:02:23,070
But I can't really talk
about this honestly

46
00:02:23,070 --> 00:02:29,760
unless I fess up to the fact
that my goal is for these two

47
00:02:29,760 --> 00:02:35,280
projects to collide in
a monstrous collision.

48
00:02:35,280 --> 00:02:37,980
They haven't yet, although
you'll see some signs,

49
00:02:37,980 --> 00:02:42,240
I hope, tonight, of some
of the cross-fertilisation

50
00:02:42,240 --> 00:02:44,810
between the two areas.

51
00:02:44,810 --> 00:02:47,520
When it comes to the
brain imaging work,

52
00:02:47,520 --> 00:02:51,360
we have a very great
team of people.

53
00:02:51,360 --> 00:02:54,520
One of them, Nicole Rafidi,
is sitting right here.

54
00:02:54,520 --> 00:02:57,130
Some of you have already
met her this week.

55
00:02:57,130 --> 00:03:00,840
And so what I'm going to
present is really the group work

56
00:03:00,840 --> 00:03:05,590
of quite a few people.

57
00:03:05,590 --> 00:03:09,570
And the idea is simple,
but here's the brainteaser.

58
00:03:09,570 --> 00:03:12,960
Suppose you're interested in how
the brain processes language,

59
00:03:12,960 --> 00:03:16,140
and you have access to some
scanning machines, then

60
00:03:16,140 --> 00:03:17,910
what would you do?

61
00:03:17,910 --> 00:03:21,300
And so we started
out by showing people

62
00:03:21,300 --> 00:03:24,220
in a scanner stimuli like these.

63
00:03:24,220 --> 00:03:28,590
Maybe single words, initially
nouns like camera, and drill,

64
00:03:28,590 --> 00:03:30,630
and house, and saw.

65
00:03:30,630 --> 00:03:34,690
Sometimes pictures, sometimes
pictures with words under them.

66
00:03:34,690 --> 00:03:37,470
But just showing people
stimuli to get them

67
00:03:37,470 --> 00:03:39,960
to think about some concept.

68
00:03:39,960 --> 00:03:42,990
And then we collect
a brain image,

69
00:03:42,990 --> 00:03:45,060
like this one,
which we collected

70
00:03:45,060 --> 00:03:49,032
when a person was looking
at this particular stimulus,

71
00:03:49,032 --> 00:03:51,480
a bottle.

72
00:03:51,480 --> 00:03:55,650
And this is posterior, this is
the back of the head on top.

73
00:03:55,650 --> 00:03:58,500
This is the front of the
head at the bottom here.

74
00:03:58,500 --> 00:04:04,530
And these four slices are
four out of about 22 slices

75
00:04:04,530 --> 00:04:07,820
of the brain that make up
the three dimensional image.

76
00:04:07,820 --> 00:04:11,730
And so you can see here what
the brain activity looks like--

77
00:04:11,730 --> 00:04:16,860
kind of blotchy-- when one
particular person thinks about,

78
00:04:16,860 --> 00:04:19,500
bottle.

79
00:04:19,500 --> 00:04:22,050
So you might ask, what
does it look like if they

80
00:04:22,050 --> 00:04:23,419
think about something else?

81
00:04:23,419 --> 00:04:25,710
Well, I can show you what it
looks like on the average.

82
00:04:25,710 --> 00:04:28,100
If we average over
60 different words,

83
00:04:28,100 --> 00:04:30,820
then here's the brain activity.

84
00:04:30,820 --> 00:04:34,080
And you can see that it
looks a lot like bottle,

85
00:04:34,080 --> 00:04:38,770
but maybe there are
some differences.

86
00:04:38,770 --> 00:04:42,220
And in fact if I subtract
out this mean activity

87
00:04:42,220 --> 00:04:46,390
from the brain image
we get for bottle,

88
00:04:46,390 --> 00:04:49,390
then you can see
the residue here.

89
00:04:49,390 --> 00:04:51,670
There are in fact
some differences

90
00:04:51,670 --> 00:04:53,800
in the activity we see
for bottle compared

91
00:04:53,800 --> 00:04:58,300
to the mean activity
over many words.

92
00:04:58,300 --> 00:05:00,090
Whether that's
signal or noise, I

93
00:05:00,090 --> 00:05:03,370
guess you can't tell by
looking at this picture.

94
00:05:03,370 --> 00:05:07,420
But that's the kind
of data that we have

95
00:05:07,420 --> 00:05:12,190
if we use fMRI to capture
brain activity while people

96
00:05:12,190 --> 00:05:14,194
read words.

97
00:05:14,194 --> 00:05:15,610
So the first thing
you might think

98
00:05:15,610 --> 00:05:19,330
of doing if you had
this kind of data

99
00:05:19,330 --> 00:05:23,050
would be to train a
machine learning program

100
00:05:23,050 --> 00:05:28,410
to decode from these brain
images which word somebody is

101
00:05:28,410 --> 00:05:30,080
thinking about it.

102
00:05:30,080 --> 00:05:34,510
And so we, in fact, began that
way by training classifiers

103
00:05:34,510 --> 00:05:38,050
where we'd give
them a brain image.

104
00:05:38,050 --> 00:05:40,930
And during training
time we would tell them

105
00:05:40,930 --> 00:05:44,350
which word that brain
image corresponds to.

106
00:05:44,350 --> 00:05:47,830
And then after training we
could test the classifier

107
00:05:47,830 --> 00:05:51,040
to see whether
indeed it had learned

108
00:05:51,040 --> 00:05:54,220
the right pattern of activity
by showing it new brain images

109
00:05:54,220 --> 00:05:58,240
and having it tell us, for
example, is this person reading

110
00:05:58,240 --> 00:06:01,630
the word hammer or bottle.

111
00:06:01,630 --> 00:06:04,610
And, in fact, that works
that works quite well.

112
00:06:04,610 --> 00:06:08,590
And, in fact, if you try it over
several different participants

113
00:06:08,590 --> 00:06:13,090
in our study, you can see we
get classification accuracies

114
00:06:13,090 --> 00:06:15,190
for a Boolean
classification problem.

115
00:06:15,190 --> 00:06:21,460
Are they reading a tool word
like hammer, saw, chisel,

116
00:06:21,460 --> 00:06:26,560
or a building were like
house, palace, hotel.

117
00:06:26,560 --> 00:06:29,830
Then, depending on
the individual person,

118
00:06:29,830 --> 00:06:34,570
we can get in the high 90%
accuracy or a little worse.

119
00:06:34,570 --> 00:06:39,970
In fact, if you ask why it's
not the same for all people,

120
00:06:39,970 --> 00:06:43,510
it turns out the accuracy
that we get correlates

121
00:06:43,510 --> 00:06:47,090
very well with measure of
head motion in the machine.

122
00:06:47,090 --> 00:06:51,680
So a lot of this is noise.

123
00:06:51,680 --> 00:06:55,090
But the bottom
line here is good.

124
00:06:55,090 --> 00:06:57,970
fMRI actually has
enough resolution

125
00:06:57,970 --> 00:07:03,850
to resolve the differences in
neural activity between, say,

126
00:07:03,850 --> 00:07:07,210
thinking about
house versus hammer.

127
00:07:07,210 --> 00:07:13,690
And machine learning methods
can discover those distinctions.

128
00:07:13,690 --> 00:07:15,760
So that's a good basis.

129
00:07:15,760 --> 00:07:21,310
And so given that, you
can start asking a number

130
00:07:21,310 --> 00:07:22,540
of interesting questions.

131
00:07:22,540 --> 00:07:25,030
Like we could ask, well,
what about you and me?

132
00:07:25,030 --> 00:07:28,240
Do we have the same
pattern of brain activity

133
00:07:28,240 --> 00:07:32,380
to encode hammer, and house,
and all the other concepts?

134
00:07:32,380 --> 00:07:36,350
Or do each of us do
something different?

135
00:07:36,350 --> 00:07:38,726
And we can convert that into
a machine learning question,

136
00:07:38,726 --> 00:07:39,225
right?

137
00:07:39,225 --> 00:07:41,440
We could say, well, what
if we train on people

138
00:07:41,440 --> 00:07:43,470
on that side of the room.

139
00:07:43,470 --> 00:07:46,830
We'll collect their brain
data and train our program.

140
00:07:46,830 --> 00:07:48,890
Then we'll collect
data from these people

141
00:07:48,890 --> 00:07:50,860
and try to decode
which word they're

142
00:07:50,860 --> 00:07:53,380
reading based on
the patterns that we

143
00:07:53,380 --> 00:07:55,750
learned from those people.

144
00:07:55,750 --> 00:07:59,500
If that works, then that's
overwhelming evidence

145
00:07:59,500 --> 00:08:04,150
that we have very similar neural
encodings of different word

146
00:08:04,150 --> 00:08:06,640
meanings.

147
00:08:06,640 --> 00:08:09,490
So we tried that and,
in fact, it works.

148
00:08:09,490 --> 00:08:12,460
In fact, here you see in
black, the accuracies,

149
00:08:12,460 --> 00:08:16,900
just like on the first slide,
of how well we can decode

150
00:08:16,900 --> 00:08:21,010
which word a person
is reading in black,

151
00:08:21,010 --> 00:08:24,630
If we train on data from the
same person we're testing on.

152
00:08:24,630 --> 00:08:27,520
But in white you
see the accuracies

153
00:08:27,520 --> 00:08:32,200
we get if we train on no
data at all from this person,

154
00:08:32,200 --> 00:08:34,179
but instead train
on the data from all

155
00:08:34,179 --> 00:08:37,270
the other participants.

156
00:08:37,270 --> 00:08:40,210
And you see on average
we do about as well

157
00:08:40,210 --> 00:08:43,809
with the white bars as we
do with the black bars.

158
00:08:43,809 --> 00:08:45,730
In fact, in some cases
we do better training

159
00:08:45,730 --> 00:08:46,990
on other people.

160
00:08:46,990 --> 00:08:50,860
That might be, for
example, because we get

161
00:08:50,860 --> 00:08:52,660
to use more training examples.

162
00:08:52,660 --> 00:08:55,780
We get to use all the other
participants' data instead

163
00:08:55,780 --> 00:08:58,310
of just one participant's data.

164
00:08:58,310 --> 00:08:59,920
But again, the
important thing here

165
00:08:59,920 --> 00:09:04,030
is, this is very strong evidence
that, even though we're all

166
00:09:04,030 --> 00:09:08,050
very different people,
we have remarkably

167
00:09:08,050 --> 00:09:14,395
similar neural encodings when
we think about common nouns.

168
00:09:17,240 --> 00:09:21,970
Which is something that
really, say in the year 2000,

169
00:09:21,970 --> 00:09:24,070
I don't think
anybody understood.

170
00:09:24,070 --> 00:09:28,860
So I want to kind of
wrap up this idea.

171
00:09:28,860 --> 00:09:31,630
So I want to go through
basically four ideas

172
00:09:31,630 --> 00:09:32,650
in this talk.

173
00:09:32,650 --> 00:09:36,280
Idea number one
is, gee, we could

174
00:09:36,280 --> 00:09:42,000
train classifiers to try to
decode from the neural activity

175
00:09:42,000 --> 00:09:45,090
which word a person is reading.

176
00:09:45,090 --> 00:09:47,610
And if we do that,
then we can actually

177
00:09:47,610 --> 00:09:49,860
ask some interesting
scientific questions,

178
00:09:49,860 --> 00:09:52,800
like are the patterns
similar across our brains?

179
00:09:52,800 --> 00:09:55,650
Does it depend whether
it's a picture or a word?

180
00:09:55,650 --> 00:10:01,030
And, in fact, we can think
of this technique of training

181
00:10:01,030 --> 00:10:04,710
and classifier as--
the way I think of it

182
00:10:04,710 --> 00:10:09,780
is it's a way of building a
virtual sensor of information

183
00:10:09,780 --> 00:10:13,380
content in the neural signal.

184
00:10:13,380 --> 00:10:16,500
So I think that fMRI
was truly a revolution

185
00:10:16,500 --> 00:10:19,290
in the study of the brain,
because for the first time

186
00:10:19,290 --> 00:10:22,890
we could look inside
and see the activity.

187
00:10:22,890 --> 00:10:26,100
But I think these classifiers
give us a different thing.

188
00:10:26,100 --> 00:10:30,240
Now we can look inside and see
not just the neural activity,

189
00:10:30,240 --> 00:10:34,610
but the information encoded
in that neural activity.

190
00:10:34,610 --> 00:10:37,050
And so it's a different
kind of sensor.

191
00:10:37,050 --> 00:10:39,990
And you can design your
own and train it, and then

192
00:10:39,990 --> 00:10:43,050
use it to study
information represented

193
00:10:43,050 --> 00:10:44,835
in the neural
signal in the brain.

194
00:10:44,835 --> 00:10:48,780
So it kind of opens
up a very large set

195
00:10:48,780 --> 00:10:53,430
of methods, and techniques,
and experiments that we can now

196
00:10:53,430 --> 00:10:56,100
run with brain imaging.

197
00:10:56,100 --> 00:10:58,530
Where instead of looking
just at the activity,

198
00:10:58,530 --> 00:11:00,450
we now can look at the
information content.

199
00:11:03,690 --> 00:11:08,130
OK, so that's idea number one.

200
00:11:08,130 --> 00:11:10,226
We were quite pleased
with ourselves

201
00:11:10,226 --> 00:11:11,350
and we are doing this work.

202
00:11:11,350 --> 00:11:14,590
But in the back of
back of our mind

203
00:11:14,590 --> 00:11:18,820
was kind of a gnawing question
of, well, this is good,

204
00:11:18,820 --> 00:11:22,040
now maybe we've trained on
a couple of hundred words,

205
00:11:22,040 --> 00:11:24,640
so we have a couple hundred
different neural patterns

206
00:11:24,640 --> 00:11:26,170
of activity.

207
00:11:26,170 --> 00:11:29,530
We have kind of a list
of the neural codes

208
00:11:29,530 --> 00:11:32,260
for a couple of hundred
words, but that's not really

209
00:11:32,260 --> 00:11:36,270
a theory of neural
encodings of meaning.

210
00:11:36,270 --> 00:11:37,960
It's a list.

211
00:11:37,960 --> 00:11:40,920
What would it mean
to have a theory?

212
00:11:40,920 --> 00:11:45,340
Well, scientific theories
are logical systems

213
00:11:45,340 --> 00:11:47,600
that can make predictions.

214
00:11:47,600 --> 00:11:49,300
And if they're
interesting theories,

215
00:11:49,300 --> 00:11:54,500
they make experimentally
testable predictions.

216
00:11:54,500 --> 00:11:57,070
So in our case,
it would be nice,

217
00:11:57,070 --> 00:12:00,610
if we want to study
representations of meaning,

218
00:12:00,610 --> 00:12:04,270
to have a theory where we
could input an arbitrary noun

219
00:12:04,270 --> 00:12:07,090
and get it to
predict for us what

220
00:12:07,090 --> 00:12:10,190
would be the neural
representation for that non.

221
00:12:10,190 --> 00:12:12,420
At least that would
be better than a list.

222
00:12:12,420 --> 00:12:17,810
That would be a generative
theory or model.

223
00:12:17,810 --> 00:12:19,330
And so we're interested in this.

224
00:12:19,330 --> 00:12:22,930
And we worked on this for
a while and came up with--

225
00:12:22,930 --> 00:12:27,980
our first version of
this looked like this.

226
00:12:27,980 --> 00:12:31,420
It's a computational
model that was trained.

227
00:12:31,420 --> 00:12:32,980
And once it's
trained, it would make

228
00:12:32,980 --> 00:12:38,480
a prediction for any input word,
like telephone, in two steps.

229
00:12:38,480 --> 00:12:41,710
Step one, if you gave it a word
like telephone, for example.

230
00:12:41,710 --> 00:12:45,490
Step one, it would look
up the word telephone

231
00:12:45,490 --> 00:12:49,180
in a trillion words of
text collected from the web

232
00:12:49,180 --> 00:12:52,660
and represent that word
by a set of statistics

233
00:12:52,660 --> 00:12:55,150
about how telephone is used.

234
00:12:55,150 --> 00:12:57,340
In our case,
statistics about which

235
00:12:57,340 --> 00:13:00,730
verbs co-occurred
with that noun.

236
00:13:00,730 --> 00:13:02,560
And then in the
second step, it would

237
00:13:02,560 --> 00:13:06,130
use that vector
which approximates

238
00:13:06,130 --> 00:13:10,300
the meaning of the input noun
as the basis for predicting

239
00:13:10,300 --> 00:13:13,570
in each of 20,000
locations in the brain,

240
00:13:13,570 --> 00:13:16,270
how much activity
will there be there.

241
00:13:16,270 --> 00:13:19,450
So let me push on
that a little bit.

242
00:13:19,450 --> 00:13:25,420
So I say in step one, we look
up for a word like celery which

243
00:13:25,420 --> 00:13:26,690
verbs that occur with.

244
00:13:26,690 --> 00:13:29,740
Well, here are the
statistics that we get.

245
00:13:29,740 --> 00:13:32,360
This is normalized to
be a vector of length 1.

246
00:13:32,360 --> 00:13:36,550
But you can see for celery
the most common verb is eat.

247
00:13:36,550 --> 00:13:38,950
And taste is second most common.

248
00:13:38,950 --> 00:13:43,630
But celery doesn't occur
very often with ride.

249
00:13:43,630 --> 00:13:45,550
On the other hand,
airplane occurs a lot

250
00:13:45,550 --> 00:13:49,510
with ride, and not very much
with manipulate and rub.

251
00:13:49,510 --> 00:13:55,420
So these are the verb statistics
extracted from the web

252
00:13:55,420 --> 00:13:58,480
for two typical nouns.

253
00:13:58,480 --> 00:14:01,690
And step one of
the model was just

254
00:14:01,690 --> 00:14:03,880
to collect statistics
for whatever now we

255
00:14:03,880 --> 00:14:05,650
give it to make the prediction.

256
00:14:05,650 --> 00:14:11,140
Step two is then to predict
at each location in the brain

257
00:14:11,140 --> 00:14:16,300
what the neural activity will
be there, the fMRI activity,

258
00:14:16,300 --> 00:14:21,350
as a function of those
statistics we just collected.

259
00:14:21,350 --> 00:14:23,740
So for the word
celery, now we know

260
00:14:23,740 --> 00:14:30,340
it occurs 0.84 with eat and
0.35 with the verb taste.

261
00:14:30,340 --> 00:14:34,030
We're now going to make a
prediction of this voxel.

262
00:14:34,030 --> 00:14:37,030
In particular, the
prediction that voxel v

263
00:14:37,030 --> 00:14:41,950
is the sum, over those 25
verbs that we're using,

264
00:14:41,950 --> 00:14:46,860
of how frequently verb i
occurs with the input noun,

265
00:14:46,860 --> 00:14:50,800
celery in this case, times
some coefficient that we

266
00:14:50,800 --> 00:14:53,120
have to learn from training.

267
00:14:53,120 --> 00:14:57,580
And this coefficient tells
us how voxel v is influenced

268
00:14:57,580 --> 00:15:01,390
by co-occurring with verb i.

269
00:15:01,390 --> 00:15:06,530
And we have 25
verbs, 20,000 voxels,

270
00:15:06,530 --> 00:15:11,060
so we have 500,000 of these
coefficients to learn.

271
00:15:13,810 --> 00:15:18,460
We learn them by taking
nouns, collecting the brain--

272
00:15:18,460 --> 00:15:22,100
the same data we use to
train those classifiers.

273
00:15:22,100 --> 00:15:24,610
So we have a collection of nouns
and the corresponding brain

274
00:15:24,610 --> 00:15:25,990
images.

275
00:15:25,990 --> 00:15:29,890
For each of those nouns we can
look up the verbs statistics.

276
00:15:29,890 --> 00:15:32,740
And then we can
train on that data

277
00:15:32,740 --> 00:15:36,880
to estimate all these
half million coefficients.

278
00:15:36,880 --> 00:15:39,700
When you put those coefficients
together, say, for eat,

279
00:15:39,700 --> 00:15:42,980
this is actually a plot
of the coefficient values.

280
00:15:42,980 --> 00:15:45,550
Here's one of those
coefficients for the verb

281
00:15:45,550 --> 00:15:49,360
eat in a particular
voxel right there.

282
00:15:49,360 --> 00:15:51,700
So you can think of the
coefficients associated

283
00:15:51,700 --> 00:15:58,780
with each verb as forming a kind
of activity map for that verb.

284
00:15:58,780 --> 00:16:03,730
And a weighted linear sum of
those verb-associated activity

285
00:16:03,730 --> 00:16:07,240
maps gives us a
prediction for celery.

286
00:16:07,240 --> 00:16:10,210
You could ask, how well
do these predictions work?

287
00:16:10,210 --> 00:16:12,880
One way I could answer
that is to show you here,

288
00:16:12,880 --> 00:16:15,670
when we trained
on 58 other nouns,

289
00:16:15,670 --> 00:16:18,910
not including celery,
not including airplane.

290
00:16:18,910 --> 00:16:23,620
And then we had the system
predict these novel,

291
00:16:23,620 --> 00:16:25,470
to it, words.

292
00:16:25,470 --> 00:16:27,370
Celery, it predicted this image.

293
00:16:27,370 --> 00:16:29,470
Airplane, it
predicted this image.

294
00:16:29,470 --> 00:16:33,070
Unbeknownst to it, here are
the actual observed images

295
00:16:33,070 --> 00:16:35,740
for celery and airplane.

296
00:16:35,740 --> 00:16:38,350
So you can see it
correctly predicts

297
00:16:38,350 --> 00:16:40,870
some of this structure--

298
00:16:40,870 --> 00:16:44,470
this is, by the way,
fusiform gyrus--

299
00:16:44,470 --> 00:16:46,040
but not all the structure.

300
00:16:46,040 --> 00:16:51,290
So it captures some
of what's going on.

301
00:16:51,290 --> 00:16:53,710
I can, in a more
quantitative way,

302
00:16:53,710 --> 00:16:56,290
tell you how well
it's working by--

303
00:16:56,290 --> 00:16:58,510
we can test the
program this way.

304
00:16:58,510 --> 00:17:03,640
We can say, here are two
words you have not seen.

305
00:17:03,640 --> 00:17:05,920
Here are two images
you have not seen.

306
00:17:05,920 --> 00:17:08,319
One of them is celery,
one is airplane.

307
00:17:08,319 --> 00:17:11,800
You, the program, tell me which.

308
00:17:11,800 --> 00:17:14,690
If it was just
working at chance,

309
00:17:14,690 --> 00:17:17,190
it would get an accuracy of 50%.

310
00:17:17,190 --> 00:17:19,990
If you just guess randomly,
you'll get half of those right

311
00:17:19,990 --> 00:17:22,750
by chance.

312
00:17:22,750 --> 00:17:26,680
In its case, averaged over
nine different subjects

313
00:17:26,680 --> 00:17:31,520
in the experiment,
we get 79% accuracy.

314
00:17:31,520 --> 00:17:32,760
So what does this mean?

315
00:17:32,760 --> 00:17:38,360
What this means is, three
times out of four, 79%,

316
00:17:38,360 --> 00:17:42,290
we could give this trained model
two new nouns that it has never

317
00:17:42,290 --> 00:17:47,030
seen, two fMRI images
for those nouns,

318
00:17:47,030 --> 00:17:51,960
and it could tell us three times
out of four which was which.

319
00:17:51,960 --> 00:17:55,610
So this model is
extrapolating beyond the words

320
00:17:55,610 --> 00:17:58,850
on which it was trained.

321
00:17:58,850 --> 00:18:02,720
And it's extrapolating,
not perfectly, but somewhat

322
00:18:02,720 --> 00:18:06,320
successfully to other nouns.

323
00:18:06,320 --> 00:18:07,340
Now, why?

324
00:18:07,340 --> 00:18:10,610
What's the basis on which
it's doing that extrapolation?

325
00:18:10,610 --> 00:18:13,610
What are the assumptions
built into this model?

326
00:18:13,610 --> 00:18:15,290
Well, for one
thing, it's assuming

327
00:18:15,290 --> 00:18:20,270
that you can predict the neural
representation of any word

328
00:18:20,270 --> 00:18:23,630
based on corpus
statistics summarizing how

329
00:18:23,630 --> 00:18:27,560
that word is used on the web.

330
00:18:27,560 --> 00:18:35,630
Furthermore, it's assuming
that any noun you can think of

331
00:18:35,630 --> 00:18:38,660
has a neural
representation which

332
00:18:38,660 --> 00:18:43,940
lives in a 25-dimensional
vector space, where

333
00:18:43,940 --> 00:18:48,770
each dimension corresponds
to one of those 25 verbs.

334
00:18:48,770 --> 00:18:55,520
And every image is some point
in this 25-dimensional vector

335
00:18:55,520 --> 00:18:56,780
space.

336
00:18:56,780 --> 00:19:00,710
That's what that
linear equation is

337
00:19:00,710 --> 00:19:05,180
doing when it's combining
some weighted combination

338
00:19:05,180 --> 00:19:09,270
of these 25 axes to
predict the image.

339
00:19:09,270 --> 00:19:12,470
So, I don't actually
believe that everything

340
00:19:12,470 --> 00:19:14,990
you think lives in a
25-dimensional space

341
00:19:14,990 --> 00:19:17,780
where the dimensions
are those verbs.

342
00:19:17,780 --> 00:19:22,280
But the interesting thing
is that the model works.

343
00:19:22,280 --> 00:19:29,950
And so it does mean that there
is some more primitive set

344
00:19:29,950 --> 00:19:34,570
of meaning components out of
which these neural patterns are

345
00:19:34,570 --> 00:19:35,620
being constructed.

346
00:19:39,140 --> 00:19:41,240
It's not just a big hash
code where every word

347
00:19:41,240 --> 00:19:42,644
gets its own pattern.

348
00:19:42,644 --> 00:19:44,060
If that were the
case, we wouldn't

349
00:19:44,060 --> 00:19:45,680
be able to extrapolate
and predict

350
00:19:45,680 --> 00:19:54,120
new ones by adding together
these different 25 components.

351
00:19:54,120 --> 00:19:56,420
So patterns are
being built up out

352
00:19:56,420 --> 00:20:00,410
of more primitive
semantic components.

353
00:20:00,410 --> 00:20:06,710
And this model is
crudely, only 79%,

354
00:20:06,710 --> 00:20:11,660
capturing some of
that substructure that

355
00:20:11,660 --> 00:20:17,940
gets combined when you
think about an entire word.

356
00:20:17,940 --> 00:20:21,140
And the substructure are the
different meaning components.

357
00:20:21,140 --> 00:20:24,080
The point here, I
think, is, here's

358
00:20:24,080 --> 00:20:27,140
a model that's different
from training a classifier.

359
00:20:27,140 --> 00:20:29,040
This is actually a
generative model.

360
00:20:29,040 --> 00:20:32,750
It can make predictions that
extrapolate beyond the training

361
00:20:32,750 --> 00:20:35,360
words on which it was trained.

362
00:20:35,360 --> 00:20:42,050
It is assuming that there is
a space of semantic primitives

363
00:20:42,050 --> 00:20:46,520
out of which the patterns of
neural activity are built.

364
00:20:46,520 --> 00:20:49,550
And it is assuming that that
space is at least spanned

365
00:20:49,550 --> 00:20:55,640
by the corpus
statistics of the noun.

366
00:20:55,640 --> 00:21:00,410
And since then, we've
extended this work,

367
00:21:00,410 --> 00:21:04,010
and we no longer use just
that list of 25 verbs.

368
00:21:04,010 --> 00:21:13,130
We actually use a very high
100-million-dimensional vector,

369
00:21:13,130 --> 00:21:14,840
which is generally
very sparse, but where

370
00:21:14,840 --> 00:21:22,070
every feature comes from a
much more precise parse of text

371
00:21:22,070 --> 00:21:23,510
on the web.

372
00:21:23,510 --> 00:21:28,340
And for example,
when I say parse,

373
00:21:28,340 --> 00:21:30,470
I mean if we have
a simple sentence

374
00:21:30,470 --> 00:21:34,430
like, he booked a ticket, this
would be a dependency parse.

375
00:21:34,430 --> 00:21:36,020
It's showing, for
example, that booked

376
00:21:36,020 --> 00:21:39,560
is a verb whose subject is
he and whose direct object

377
00:21:39,560 --> 00:21:41,600
is ticket.

378
00:21:41,600 --> 00:21:44,480
And now each of these
edges in the parse

379
00:21:44,480 --> 00:21:49,160
becomes a feature in our new
representation of the word.

380
00:21:49,160 --> 00:21:53,360
So instead of using verbs, we
use dependency parse features.

381
00:21:53,360 --> 00:21:58,280
And this actually increases
slightly the accuracy

382
00:21:58,280 --> 00:22:02,840
of our former model
from 79 up a little bit.

383
00:22:02,840 --> 00:22:07,500
But importantly, it also lets us
work with all parts of speech.

384
00:22:07,500 --> 00:22:10,640
So now we're not restricted
to just using nouns.

385
00:22:10,640 --> 00:22:14,330
We can use these dependency
parse vectors for adjectives

386
00:22:14,330 --> 00:22:16,200
and all parts of speech.

387
00:22:16,200 --> 00:22:20,150
So in terms of
broadening the model

388
00:22:20,150 --> 00:22:22,100
to be able to handle
different types of words,

389
00:22:22,100 --> 00:22:24,060
this is helpful.

390
00:22:24,060 --> 00:22:27,260
So at this point you
could say, well, this

391
00:22:27,260 --> 00:22:30,260
is kind of interesting,
because what have we seen?

392
00:22:30,260 --> 00:22:36,050
I think the main
points so far are, gee,

393
00:22:36,050 --> 00:22:38,630
different people have
very similar patterns

394
00:22:38,630 --> 00:22:43,760
of neural activity that their
brains use to encode meaning.

395
00:22:43,760 --> 00:22:47,420
Furthermore, those
patterns of neural activity

396
00:22:47,420 --> 00:22:50,990
decompose into more primitive
semantic components.

397
00:22:50,990 --> 00:22:56,990
And we can train models that
extrapolate to new words

398
00:22:56,990 --> 00:23:02,360
on which they weren't trained
by learning those more

399
00:23:02,360 --> 00:23:04,580
primitive semantic
components and how

400
00:23:04,580 --> 00:23:10,290
to combine them for novel words
based on corpus statistics.

401
00:23:10,290 --> 00:23:11,540
So that's kind of interesting.

402
00:23:11,540 --> 00:23:13,130
But everything that
I've said so far

403
00:23:13,130 --> 00:23:17,030
is really about the static
spatial distribution

404
00:23:17,030 --> 00:23:20,690
of neural activity that
encodes these things.

405
00:23:20,690 --> 00:23:24,410
Now, in truth, your
neural activity

406
00:23:24,410 --> 00:23:27,254
is not just one little snapshot.

407
00:23:27,254 --> 00:23:29,420
When you understand a word--
do you know how long it

408
00:23:29,420 --> 00:23:30,711
takes you to understand a word?

409
00:23:34,430 --> 00:23:36,290
About 400 milliseconds.

410
00:23:36,290 --> 00:23:40,930
It takes about 400 milliseconds
to understand a word.

411
00:23:40,930 --> 00:23:44,090
Well, it turns out there is
interesting brain activity

412
00:23:44,090 --> 00:23:47,810
dynamics during those
400 milliseconds.

413
00:23:47,810 --> 00:23:49,960
And let me show you.

414
00:23:49,960 --> 00:23:54,020
So up till now, we were
looking at fMRI data.

415
00:23:54,020 --> 00:23:57,450
But here's some
magnetoencephalography data.

416
00:23:57,450 --> 00:24:01,700
And this data has a time
resolution of one millisecond.

417
00:24:01,700 --> 00:24:04,010
So I'll show you
this movie which

418
00:24:04,010 --> 00:24:09,270
begins 20 milliseconds before
a word appears on the screen.

419
00:24:09,270 --> 00:24:12,410
In this case, the
word is the word hand.

420
00:24:12,410 --> 00:24:15,440
And this brain is about
to read the word hand.

421
00:24:15,440 --> 00:24:19,570
You'll see 550 milliseconds
of brain activity.

422
00:24:19,570 --> 00:24:21,440
I'll read out the
numbers so you can just

423
00:24:21,440 --> 00:24:23,790
watch the activity over here.

424
00:24:23,790 --> 00:24:24,830
So here we go.

425
00:24:24,830 --> 00:24:28,890
20 milliseconds before the
word appears on the screen.

426
00:24:28,890 --> 00:24:51,520
0, 100, 200 milliseconds,
300, 400 milliseconds, 500.

427
00:24:51,520 --> 00:24:55,580
OK, so it wasn't a static
snapshot of activity.

428
00:24:55,580 --> 00:24:57,910
Your brain is doing
a lot of things.

429
00:24:57,910 --> 00:25:01,540
There's a lot of dynamism
during that 400 milliseconds

430
00:25:01,540 --> 00:25:03,520
that you're reading the word.

431
00:25:03,520 --> 00:25:10,270
fMRI captures an image
about once a second,

432
00:25:10,270 --> 00:25:14,380
but because of the
blood oxygen level

433
00:25:14,380 --> 00:25:18,340
dependent mechanism that it
uses to capture that, it's

434
00:25:18,340 --> 00:25:19,850
kind of smeared out over time.

435
00:25:19,850 --> 00:25:25,420
So we can't see this dynamics
with fMRI, but with MEG we can.

436
00:25:25,420 --> 00:25:29,020
And so now we can ask all
kinds of interesting questions,

437
00:25:29,020 --> 00:25:31,390
like well, what was
the information encoded

438
00:25:31,390 --> 00:25:33,590
in that movie that we just saw?

439
00:25:33,590 --> 00:25:36,730
I just showed you a
movie of neural activity,

440
00:25:36,730 --> 00:25:42,270
but I want a movie of
data flow in the brain.

441
00:25:42,270 --> 00:25:44,350
I want the movie showing
me what information

442
00:25:44,350 --> 00:25:46,750
is encoded over time.

443
00:25:46,750 --> 00:25:49,700
Given this data,
what could we do?

444
00:25:49,700 --> 00:25:51,260
Well, here's one
thing we can do.

445
00:25:51,260 --> 00:25:54,670
In fact, Gus Sudre did
this for his PhD thesis.

446
00:25:54,670 --> 00:25:57,850
He said, I want to know
what information is flowing

447
00:25:57,850 --> 00:25:59,560
around the brain
there, so I'm going

448
00:25:59,560 --> 00:26:03,100
to train roughly a million
different classifiers.

449
00:26:03,100 --> 00:26:06,850
I'll train classifiers that
look at just 100 milliseconds

450
00:26:06,850 --> 00:26:10,840
worth of that movie and
look at just one of 70

451
00:26:10,840 --> 00:26:15,820
or so anatomically
defined brain regions.

452
00:26:15,820 --> 00:26:18,940
And I'll use a set
of features-- he

453
00:26:18,940 --> 00:26:21,430
wasn't using our verbs anymore.

454
00:26:21,430 --> 00:26:26,410
He was using a set
of 229 features

455
00:26:26,410 --> 00:26:32,740
that we had made up manually
and that were inspired

456
00:26:32,740 --> 00:26:34,420
by the game 20 questions.

457
00:26:34,420 --> 00:26:36,520
These were features
of the word, not

458
00:26:36,520 --> 00:26:39,640
like, how often does a court
does it co-occur with the verb

459
00:26:39,640 --> 00:26:40,340
eat?

460
00:26:40,340 --> 00:26:43,480
But instead, features
like, would you eat it?

461
00:26:43,480 --> 00:26:44,660
Yes or no.

462
00:26:44,660 --> 00:26:45,940
Is it bigger than a bread box?

463
00:26:45,940 --> 00:26:46,660
Yes or no.

464
00:26:46,660 --> 00:26:48,070
And so forth.

465
00:26:48,070 --> 00:26:52,000
He had a set of 218
questions like that.

466
00:26:52,000 --> 00:26:53,740
And every word
could be described

467
00:26:53,740 --> 00:26:59,790
by a set of 218 answers
to those questions,

468
00:26:59,790 --> 00:27:02,950
analogous to the verbs.

469
00:27:02,950 --> 00:27:06,100
And so what Gus did is, for
every one of those features,

470
00:27:06,100 --> 00:27:08,860
every one of those
218 features like,

471
00:27:08,860 --> 00:27:12,550
is it bigger than a breadbox,
he trained a classifier

472
00:27:12,550 --> 00:27:15,100
to try to decode the
value of that for the word

473
00:27:15,100 --> 00:27:18,790
that you're reading from
just 100 milliseconds

474
00:27:18,790 --> 00:27:22,360
worth of this movie, and
looking at just one of 70

475
00:27:22,360 --> 00:27:25,840
anatomically defined regions.

476
00:27:25,840 --> 00:27:30,250
And so when he did
that, he ended up

477
00:27:30,250 --> 00:27:34,930
being able to make us a movie
of what information is coded,

478
00:27:34,930 --> 00:27:36,900
in which part of
the brain, when.

479
00:27:36,900 --> 00:27:39,280
And he ran this--
every 50 milliseconds

480
00:27:39,280 --> 00:27:42,550
he'd move forward and use a
100 millisecond window starting

481
00:27:42,550 --> 00:27:43,510
there.

482
00:27:43,510 --> 00:27:46,660
So he found that during
the first 50 milliseconds

483
00:27:46,660 --> 00:27:50,170
after the word
appears on the screen,

484
00:27:50,170 --> 00:27:54,180
none of those classifiers
could reliably,

485
00:27:54,180 --> 00:27:58,330
in a cross validated
way, produce

486
00:27:58,330 --> 00:28:01,150
any reliable predictions.

487
00:28:01,150 --> 00:28:03,460
Meaning the neural
signals seems to not

488
00:28:03,460 --> 00:28:05,860
encode any of those
semantic features

489
00:28:05,860 --> 00:28:10,060
during the first
50 milliseconds.

490
00:28:10,060 --> 00:28:12,580
By timing out to
100 milliseconds,

491
00:28:12,580 --> 00:28:15,100
there were no semantic features,
but you could decode things

492
00:28:15,100 --> 00:28:17,650
like the number of letters
in the word, the word length.

493
00:28:21,000 --> 00:28:24,330
At 150 milliseconds,
at 200 milliseconds,

494
00:28:24,330 --> 00:28:26,880
you got the first
semantic feature.

495
00:28:26,880 --> 00:28:27,510
Is it hairy?

496
00:28:30,240 --> 00:28:36,210
I think this is actually a
stand-in for, is it alive?

497
00:28:36,210 --> 00:28:40,230
But the feature he happened
to uncover was, is it hairy?

498
00:28:40,230 --> 00:28:43,140
At 200 milliseconds.

499
00:28:43,140 --> 00:28:46,800
At 250, now we start to
see more semantic features.

500
00:28:46,800 --> 00:28:52,410
300, 350, 400, 450.

501
00:28:52,410 --> 00:29:04,310
So literally, these are the
semantic features trickling

502
00:29:04,310 --> 00:29:07,610
in over time during
this 500 milliseconds--

503
00:29:07,610 --> 00:29:09,740
that's the movie--

504
00:29:09,740 --> 00:29:12,860
that corresponds to
the neural activity

505
00:29:12,860 --> 00:29:16,800
that I showed you
in that first movie.

506
00:29:16,800 --> 00:29:23,090
So this is a kind of data flow
picture of what information

507
00:29:23,090 --> 00:29:27,560
is flowing around in the
brain in that neural activity

508
00:29:27,560 --> 00:29:32,700
during that 450
milliseconds so far.

509
00:29:32,700 --> 00:29:33,840
Here's the set.

510
00:29:33,840 --> 00:29:38,810
Out of those 218 questions,
here are the 20 most decodable

511
00:29:38,810 --> 00:29:41,610
features.

512
00:29:41,610 --> 00:29:44,670
So the number one feature
that's most decodable,

513
00:29:44,670 --> 00:29:47,670
is that bigger than
a loaf of bread?

514
00:29:47,670 --> 00:29:49,740
But actually, if you
look at those questions,

515
00:29:49,740 --> 00:29:52,440
you see many of the
most incredible ones

516
00:29:52,440 --> 00:29:55,150
are really size.

517
00:29:55,150 --> 00:29:58,960
And many of the next
are manipulability.

518
00:29:58,960 --> 00:30:00,820
And many others are animacy.

519
00:30:00,820 --> 00:30:04,110
And some are shelter.

520
00:30:04,110 --> 00:30:10,380
In fact, we've across a
diverse set of experiments

521
00:30:10,380 --> 00:30:12,390
keep seeing these
kind of features.

522
00:30:12,390 --> 00:30:16,620
Size, manipulability,
animacy, shelter,

523
00:30:16,620 --> 00:30:26,620
edibility are recurring as
features that have their own--

524
00:30:26,620 --> 00:30:30,600
they seem to be kind
of naturally some

525
00:30:30,600 --> 00:30:33,540
of the primitive components.

526
00:30:33,540 --> 00:30:35,460
And they have their
corresponding neural

527
00:30:35,460 --> 00:30:40,740
signatures, out of which the
encoding of the full word

528
00:30:40,740 --> 00:30:42,480
is built.

529
00:30:42,480 --> 00:30:44,340
So if you ask me
right now, what's

530
00:30:44,340 --> 00:30:47,910
my best guess of what are
the semantic primitives out

531
00:30:47,910 --> 00:30:50,280
of which the neural
codes are built, I'd say,

532
00:30:50,280 --> 00:30:51,280
I don't really know.

533
00:30:51,280 --> 00:30:56,340
But these features plus
edibility, for example,

534
00:30:56,340 --> 00:30:58,890
keep recurring in
what we're seeing.

535
00:30:58,890 --> 00:31:01,050
And they have their
own spatial regions

536
00:31:01,050 --> 00:31:05,000
where the codes seem to live.

537
00:31:05,000 --> 00:31:10,410
OK, so I want to get to
the final part, which

538
00:31:10,410 --> 00:31:14,790
is, so far we've talked
about just single words.

539
00:31:14,790 --> 00:31:17,370
And there's plenty of
interesting questions

540
00:31:17,370 --> 00:31:18,850
we can ask about single words.

541
00:31:18,850 --> 00:31:22,230
But really, language is
about multiple words.

542
00:31:22,230 --> 00:31:25,800
And so I want to show you a
couple of examples of some more

543
00:31:25,800 --> 00:31:29,550
recent work where we've been
looking at semantic composition

544
00:31:29,550 --> 00:31:31,590
with the adjective-noun phrases.

545
00:31:31,590 --> 00:31:34,500
This is the work of Alona Fyshe.

546
00:31:34,500 --> 00:31:36,690
And what she did is
she presented people

547
00:31:36,690 --> 00:31:42,060
with just simple
adjective-noun sequences.

548
00:31:42,060 --> 00:31:47,190
She put an adjective on
the screen like tasty,

549
00:31:47,190 --> 00:31:50,190
leave it there
for half a second,

550
00:31:50,190 --> 00:31:53,730
then a noun like tomato.

551
00:31:53,730 --> 00:31:57,900
And she was interested
in the question of,

552
00:31:57,900 --> 00:32:03,930
well, where and when is the
neural encoding of these two

553
00:32:03,930 --> 00:32:07,500
words, and what does
that encoding look like?

554
00:32:07,500 --> 00:32:10,270
So I'll show you a
couple of things.

555
00:32:10,270 --> 00:32:19,320
One is, here is a picture
of the classifier weights

556
00:32:19,320 --> 00:32:23,370
that were learned to
decode the adjective.

557
00:32:23,370 --> 00:32:25,650
And you have to
think of it this way.

558
00:32:25,650 --> 00:32:27,540
Here's time.

559
00:32:27,540 --> 00:32:31,020
And this is the time, the
first 500 milliseconds

560
00:32:31,020 --> 00:32:33,300
when the adjectives
on the screen.

561
00:32:33,300 --> 00:32:36,320
Then there's 300
milliseconds of dead air.

562
00:32:36,320 --> 00:32:39,690
Then 500 milliseconds when
the noun is on the screen.

563
00:32:39,690 --> 00:32:42,780
And then more dead air.

564
00:32:42,780 --> 00:32:46,290
This, the vertical axis,
are different locations

565
00:32:46,290 --> 00:32:51,030
in the sensor helmet
of the MEG scanner.

566
00:32:51,030 --> 00:32:54,330
And there are
about 306 of those.

567
00:32:56,910 --> 00:33:01,590
The intensity here
is showing the weight

568
00:33:01,590 --> 00:33:04,020
of a trained classifier
that was trained

569
00:33:04,020 --> 00:33:06,180
to decode the adjective.

570
00:33:06,180 --> 00:33:08,370
And, in fact, this
is the pattern

571
00:33:08,370 --> 00:33:11,960
of activity associated
with the adjective gentle.

572
00:33:11,960 --> 00:33:13,520
Like gentle bear.

573
00:33:17,310 --> 00:33:21,990
And so what you see here is
that there is neural activity

574
00:33:21,990 --> 00:33:24,720
out here when the
noun is on the screen

575
00:33:24,720 --> 00:33:28,020
long after the adjective has
disappeared from the screen.

576
00:33:28,020 --> 00:33:30,790
That's quite
relevant to decoding

577
00:33:30,790 --> 00:33:32,070
what the adjective was.

578
00:33:35,610 --> 00:33:39,960
And so this is just
kind of a quick look.

579
00:33:39,960 --> 00:33:46,320
You can see that if I say
tasty tomato, even when

580
00:33:46,320 --> 00:33:51,210
you're reading the word tomato,
there's neural activity here,

581
00:33:51,210 --> 00:33:54,060
when you're looking at
that noun, that encodes

582
00:33:54,060 --> 00:33:57,870
what the adjective had been.

583
00:33:57,870 --> 00:33:59,850
And we can see
that, in fact, it's

584
00:33:59,850 --> 00:34:01,890
a different pattern
of neural activity

585
00:34:01,890 --> 00:34:05,070
than was here when
the adjective was on.

586
00:34:05,070 --> 00:34:10,260
And in fact, one thing that
Alona got interested in

587
00:34:10,260 --> 00:34:13,949
is, given that you
can decode across time

588
00:34:13,949 --> 00:34:16,860
what that adjective
was, is your brain

589
00:34:16,860 --> 00:34:20,949
using the same neural
encoding across time?

590
00:34:20,949 --> 00:34:24,750
Or is it a different
neural encoding, maybe

591
00:34:24,750 --> 00:34:29,100
for different
purposes across time.

592
00:34:29,100 --> 00:34:30,659
Let me explain what she did.

593
00:34:30,659 --> 00:34:37,260
She trained a
classifier at one time

594
00:34:37,260 --> 00:34:40,080
in this time series
of adjective-noun,

595
00:34:40,080 --> 00:34:47,580
and then she would test it
at some other time point.

596
00:34:47,580 --> 00:34:49,320
And if you could
train at this time,

597
00:34:49,320 --> 00:34:52,830
like let's say, right when the
adjective comes on the screen,

598
00:34:52,830 --> 00:34:56,159
and use it successfully
to decode the adjective

599
00:34:56,159 --> 00:34:58,860
way down here when the
noun is on the screen,

600
00:34:58,860 --> 00:35:02,100
then we can know that it's
the same neural encoding,

601
00:35:02,100 --> 00:35:05,220
because that's what it's doing.

602
00:35:05,220 --> 00:35:09,150
And then she made a plot,
a two-dimensional plot,

603
00:35:09,150 --> 00:35:12,060
where you could plot, let's
say, the time at which you train

604
00:35:12,060 --> 00:35:14,700
the classifier on
the vertical axis,

605
00:35:14,700 --> 00:35:18,560
and the time at which you test
it on the horizontal axis.

606
00:35:18,560 --> 00:35:23,550
And then we could show at
each training test time

607
00:35:23,550 --> 00:35:26,640
whether you could
train at this time

608
00:35:26,640 --> 00:35:27,970
and then decode at this time.

609
00:35:27,970 --> 00:35:30,210
And that'll tell
us whether there's

610
00:35:30,210 --> 00:35:34,845
a stable neural encoding of the
adjective meaning across time.

611
00:35:34,845 --> 00:35:36,720
When she did that, here's
what it looks like.

612
00:35:40,710 --> 00:35:44,390
OK, so here we have
on the vertical axis

613
00:35:44,390 --> 00:35:46,250
the time at which she trained.

614
00:35:46,250 --> 00:35:49,130
This is when the adjective is
on the screen, the first 500

615
00:35:49,130 --> 00:35:52,380
milliseconds, when the
noun's on the screen.

616
00:35:52,380 --> 00:35:57,500
Here's then using any of
these trained classifiers

617
00:35:57,500 --> 00:35:59,780
for decoding the adjective.

618
00:35:59,780 --> 00:36:02,487
Here's a different time at
which she tried to use it.

619
00:36:02,487 --> 00:36:04,070
And again, here's
when the adjective's

620
00:36:04,070 --> 00:36:06,170
on the screen, the noun.

621
00:36:06,170 --> 00:36:07,880
And so what you see--

622
00:36:07,880 --> 00:36:11,780
all this intense stuff means
high decoding accuracy--

623
00:36:11,780 --> 00:36:17,450
shows that if you train when
the adjective is on the screen,

624
00:36:17,450 --> 00:36:21,170
you can use that to decode other
times at which the adjective's

625
00:36:21,170 --> 00:36:22,450
on the screen.

626
00:36:22,450 --> 00:36:24,420
That's good.

627
00:36:24,420 --> 00:36:26,780
So we can decode adjectives.

628
00:36:26,780 --> 00:36:30,590
But if you try to use it to
decode the adjective when

629
00:36:30,590 --> 00:36:32,810
the noun's on the
screen, it fails.

630
00:36:32,810 --> 00:36:36,380
Blue means failure.

631
00:36:36,380 --> 00:36:40,670
No statistically significant
decoding accuracy.

632
00:36:40,670 --> 00:36:43,190
On the other hand, when
the noun is on the screen

633
00:36:43,190 --> 00:36:46,435
if you train using the
neural patterns when

634
00:36:46,435 --> 00:36:47,810
the nouns on the
screen, then you

635
00:36:47,810 --> 00:36:50,420
can, in fact, decode
what the adjective

636
00:36:50,420 --> 00:36:54,230
had been while the
noun is on the screen.

637
00:36:54,230 --> 00:36:56,600
So it's like there are
two different encodings

638
00:36:56,600 --> 00:36:58,850
of the adjective
being used here.

639
00:36:58,850 --> 00:37:01,280
One when the adjective's
on the screen that lets you

640
00:37:01,280 --> 00:37:04,190
successfully decode it when
the adjective's on the screen,

641
00:37:04,190 --> 00:37:06,680
but it doesn't work when
the noun's on the screen.

642
00:37:06,680 --> 00:37:11,320
And then the second one that
works another neural encoding

643
00:37:11,320 --> 00:37:14,060
that you can use to decode
what the adjective had been

644
00:37:14,060 --> 00:37:17,920
when the noun is on the screen.

645
00:37:17,920 --> 00:37:21,970
And then interestingly, there's
also this other region here,

646
00:37:21,970 --> 00:37:25,400
which says if you train when
the adjective was on the screen,

647
00:37:25,400 --> 00:37:27,501
you can't use that to
successfully decode it

648
00:37:27,501 --> 00:37:28,750
when the noun's on the screen.

649
00:37:28,750 --> 00:37:34,840
But later on, when nothing is on
the screen, the phrase is gone,

650
00:37:34,840 --> 00:37:38,860
your brain is still
thinking about the adjective

651
00:37:38,860 --> 00:37:43,320
in a way that's using this
neural encoding, the very first

652
00:37:43,320 --> 00:37:45,490
of those neural encodings.

653
00:37:45,490 --> 00:37:50,530
This is evidence that the neural
encoding of the adjective that

654
00:37:50,530 --> 00:37:53,740
was present when you
saw the adjective

655
00:37:53,740 --> 00:37:58,120
is re-emerging now a
couple seconds later,

656
00:37:58,120 --> 00:38:01,780
after that thing
is off the screen.

657
00:38:01,780 --> 00:38:04,100
But the neural encoding
of the adjective

658
00:38:04,100 --> 00:38:09,340
when the noun was on the screen
doesn't seem to get used again.

659
00:38:09,340 --> 00:38:11,770
Most recently, we've
also been looking

660
00:38:11,770 --> 00:38:14,890
at stories and passages.

661
00:38:14,890 --> 00:38:18,610
And much of this,
not all of it, is

662
00:38:18,610 --> 00:38:22,780
the work of Leila Wehbe,
another PhD student.

663
00:38:22,780 --> 00:38:24,820
And here's what she did.

664
00:38:24,820 --> 00:38:28,765
She put people in fMRI
and in MEG scanners,

665
00:38:28,765 --> 00:38:30,890
and she showed them the
following kind of stimulus.

666
00:38:45,710 --> 00:38:48,340
So this goes on for
about 40 minutes.

667
00:38:48,340 --> 00:38:51,010
One chapter of a
Harry Potter story.

668
00:38:51,010 --> 00:38:53,650
And word by word,
every 500 milliseconds,

669
00:38:53,650 --> 00:38:57,020
we know exactly when
you've seen every word.

670
00:38:57,020 --> 00:39:01,060
So she collected this
data in fMRI and in MEG

671
00:39:01,060 --> 00:39:05,746
to try to study the
jumble of activity that

672
00:39:05,746 --> 00:39:07,120
goes on in your
brain when you're

673
00:39:07,120 --> 00:39:10,930
reading not an isolated
word, but a whole story.

674
00:39:10,930 --> 00:39:18,200
And so for her, with the fMRI we
get an image every two seconds.

675
00:39:18,200 --> 00:39:21,490
So four words go by and
we get an fMRI image.

676
00:39:21,490 --> 00:39:23,800
So here's the kind
of data that she had.

677
00:39:23,800 --> 00:39:28,000
She trained a model that's
very analogous to the very

678
00:39:28,000 --> 00:39:30,970
first generative model I
talked about where we would

679
00:39:30,970 --> 00:39:32,860
input a word, code
it with verbs,

680
00:39:32,860 --> 00:39:35,560
and then use that to
predict neural activity.

681
00:39:35,560 --> 00:39:43,190
In her case, she took an
approach where for every word,

682
00:39:43,190 --> 00:39:47,980
she would encode that word
with a big feature vector.

683
00:39:47,980 --> 00:39:51,640
And that vector could
summarize both the meaning

684
00:39:51,640 --> 00:39:53,470
of the individual
word, but it also

685
00:39:53,470 --> 00:39:56,380
could have other
features that capture

686
00:39:56,380 --> 00:40:02,380
the context or the various
properties of the story

687
00:40:02,380 --> 00:40:04,400
at that point in time.

688
00:40:04,400 --> 00:40:07,600
But the general framework
was to convert the time

689
00:40:07,600 --> 00:40:11,060
series of words into a time
series of feature vectors

690
00:40:11,060 --> 00:40:14,500
that capture individual
word meanings plus story

691
00:40:14,500 --> 00:40:20,110
content at that time, and then
to use that to predict the fMRI

692
00:40:20,110 --> 00:40:21,830
and MEG activity.

693
00:40:21,830 --> 00:40:27,210
So when she did
this, here are some

694
00:40:27,210 --> 00:40:31,210
of the kind of features
that we ended up using.

695
00:40:31,210 --> 00:40:36,950
So some of there were like
motions of the characters,

696
00:40:36,950 --> 00:40:38,550
like was there somebody flying--

697
00:40:38,550 --> 00:40:40,780
this was the Harry Potter story.

698
00:40:40,780 --> 00:40:42,390
Somebody manipulating,
or moving,

699
00:40:42,390 --> 00:40:45,300
or physically colliding.

700
00:40:45,300 --> 00:40:47,430
What were the emotions
being experienced

701
00:40:47,430 --> 00:40:49,950
by the characters in the
story that you're focused

702
00:40:49,950 --> 00:40:52,140
on at this point in time?

703
00:40:52,140 --> 00:40:55,230
What were the parts of
speech of the different words

704
00:40:55,230 --> 00:40:58,590
and other syntactic features.

705
00:40:58,590 --> 00:41:00,310
What were the semantic content?

706
00:41:00,310 --> 00:41:03,210
We also used the
dependency parse statistics

707
00:41:03,210 --> 00:41:07,070
that I mentioned that capture
semantics of individual words.

708
00:41:07,070 --> 00:41:10,740
So altogether, she had a feature
vector with about 200 features.

709
00:41:10,740 --> 00:41:15,190
Some manually annotated, some
captured by corpus statistics.

710
00:41:15,190 --> 00:41:17,200
And for every word
in the story we

711
00:41:17,200 --> 00:41:20,020
then had this feature vector.

712
00:41:20,020 --> 00:41:23,160
Then she trained this
model that literally

713
00:41:23,160 --> 00:41:30,690
would take as input a sequence
of words, convert that

714
00:41:30,690 --> 00:41:35,820
into the feature
sequence, and then,

715
00:41:35,820 --> 00:41:38,100
using the trained
regression, predict

716
00:41:38,100 --> 00:41:40,770
the time series
of brain activity

717
00:41:40,770 --> 00:41:43,030
from those feature vectors.

718
00:41:43,030 --> 00:41:47,100
So this allowed
her to then test,

719
00:41:47,100 --> 00:41:50,910
analogous to what we did with
our single word noun generative

720
00:41:50,910 --> 00:41:55,710
model, to test to see, did the
model learn well enough that we

721
00:41:55,710 --> 00:41:58,920
could give it to different
passages, and then

722
00:41:58,920 --> 00:42:02,070
one real time series
of observed data,

723
00:42:02,070 --> 00:42:04,800
and ask it to tell us
which passage this person

724
00:42:04,800 --> 00:42:06,180
was reading.

725
00:42:06,180 --> 00:42:08,430
And these would be novel
passages that were not

726
00:42:08,430 --> 00:42:10,200
part of the training data.

727
00:42:10,200 --> 00:42:15,990
And she found that it was, in
fact, possible, imperfectly,

728
00:42:15,990 --> 00:42:20,250
but three times out of four,
to take a passage which

729
00:42:20,250 --> 00:42:22,290
was not part of--

730
00:42:22,290 --> 00:42:25,290
two passages which had
never been seen in training,

731
00:42:25,290 --> 00:42:27,660
and a time series of
neural activity never seen

732
00:42:27,660 --> 00:42:30,090
during training, and
three times out of four,

733
00:42:30,090 --> 00:42:35,310
tell us which of those two
passages they correspond to.

734
00:42:35,310 --> 00:42:38,970
So capturing some of
the structure here.

735
00:42:38,970 --> 00:42:42,000
Interestingly, as a
side effect of that,

736
00:42:42,000 --> 00:42:47,590
you end up with a map of
different cortical regions

737
00:42:47,590 --> 00:42:51,370
and which of these 200
features are encoded

738
00:42:51,370 --> 00:42:54,400
in different cortical regions.

739
00:42:54,400 --> 00:42:56,930
So from one analysis
of people reading

740
00:42:56,930 --> 00:43:02,510
this very complicated, complex
story, in this analysis,

741
00:43:02,510 --> 00:43:03,490
we end up--

742
00:43:03,490 --> 00:43:06,890
you can go [AUDIO OUT]
features and color code.

743
00:43:06,890 --> 00:43:09,250
Some of them have to do
with syntax, like part

744
00:43:09,250 --> 00:43:11,460
of speech and sentence length.

745
00:43:11,460 --> 00:43:13,810
Some have to do
with dialogue, some

746
00:43:13,810 --> 00:43:18,760
have to do visual properties
or characters in the stories.

747
00:43:18,760 --> 00:43:22,480
And you can see here
is a map of where

748
00:43:22,480 --> 00:43:24,160
those different
types of information

749
00:43:24,160 --> 00:43:29,050
were decodable from
the neural activity.

750
00:43:29,050 --> 00:43:33,730
Interestingly, here is a
slightly earlier piece of work,

751
00:43:33,730 --> 00:43:40,120
from Ev Fedorenko showing where
there is neural activity that's

752
00:43:40,120 --> 00:43:43,937
selectively associated
with language processing.

753
00:43:43,937 --> 00:43:45,770
The difference here is
that in Leila's work,

754
00:43:45,770 --> 00:43:48,700
she was also able to
indicate not just where

755
00:43:48,700 --> 00:43:54,010
the activity was, but what
information is encoded there.

756
00:43:54,010 --> 00:43:56,260
And then again, you can
drill down on some of these.

757
00:43:56,260 --> 00:43:58,120
If you want know
more about syntax,

758
00:43:58,120 --> 00:44:01,510
we could actually look at
the different syntax features

759
00:44:01,510 --> 00:44:04,240
and see, well, where's the
part of speech encoded?

760
00:44:04,240 --> 00:44:06,590
What about the length
of the sentence?

761
00:44:06,590 --> 00:44:09,520
What about the specific
dependency role

762
00:44:09,520 --> 00:44:11,680
in the parse of the
word that we're reading

763
00:44:11,680 --> 00:44:14,120
right now, and so forth.

764
00:44:14,120 --> 00:44:19,600
So this gives us a
way then of starting

765
00:44:19,600 --> 00:44:24,880
to look simultaneously at very
complex cognitive function,

766
00:44:24,880 --> 00:44:25,380
right?

767
00:44:25,380 --> 00:44:28,884
You're reading a story,
you're perceiving the words,

768
00:44:28,884 --> 00:44:30,550
you're figuring out
what part of speech,

769
00:44:30,550 --> 00:44:31,907
you're parsing the sentence.

770
00:44:31,907 --> 00:44:33,490
You're thinking about
the plot, you're

771
00:44:33,490 --> 00:44:35,200
fitting this into the plot.

772
00:44:35,200 --> 00:44:37,180
You're feeling sorry
for the hero who

773
00:44:37,180 --> 00:44:40,090
just had their brooms stolen,
and all kinds of stuff

774
00:44:40,090 --> 00:44:43,280
going on in your head.

775
00:44:43,280 --> 00:44:48,820
Here's the analysis that
attempts to simultaneously

776
00:44:48,820 --> 00:44:52,990
analyze a diverse range
of these features,

777
00:44:52,990 --> 00:44:56,570
and I think with some success.

778
00:44:56,570 --> 00:45:01,960
There still remain
problems of correlations

779
00:45:01,960 --> 00:45:05,270
between different features.

780
00:45:05,270 --> 00:45:08,170
And so it might be hard
to know whether we're

781
00:45:08,170 --> 00:45:11,350
decoding the fact that
somebody is being shouted at,

782
00:45:11,350 --> 00:45:15,870
versus the fact that their
ears are hurting, so to speak.

783
00:45:15,870 --> 00:45:18,385
But there could be two
different properties we thinking

784
00:45:18,385 --> 00:45:20,350
of that are highly correlated.

785
00:45:20,350 --> 00:45:24,670
And so it can still be
hard to tease those apart.

786
00:45:24,670 --> 00:45:28,510
But I think that, to me,
the interesting thing

787
00:45:28,510 --> 00:45:33,820
about Leila's analysis here
is that it flips from a style

788
00:45:33,820 --> 00:45:36,080
that I would call reductionist.

789
00:45:36,080 --> 00:45:39,370
One way that people often
study language in the brain

790
00:45:39,370 --> 00:45:41,620
is they pick one
phenomena, and then

791
00:45:41,620 --> 00:45:45,260
run a carefully controlled
experiment to just vary

792
00:45:45,260 --> 00:45:47,020
that one dimension.

793
00:45:47,020 --> 00:45:50,300
Like we'll use
words, and we'll use

794
00:45:50,300 --> 00:45:51,950
letter strings that
are pronounceable

795
00:45:51,950 --> 00:45:54,410
that are not words,
and then words.

796
00:45:54,410 --> 00:45:57,770
And we'll just look at
what's different in those two

797
00:45:57,770 --> 00:46:00,060
almost identical situations.

798
00:46:00,060 --> 00:46:03,620
Here, instead, we have
people doing natural reading,

799
00:46:03,620 --> 00:46:06,710
doing a complex
cognitive function,

800
00:46:06,710 --> 00:46:09,380
and try to use a
multivariate analysis

801
00:46:09,380 --> 00:46:16,690
to simultaneously model all
of those different functions.

802
00:46:16,690 --> 00:46:22,130
And so I think this is an
interesting, methodologically,

803
00:46:22,130 --> 00:46:23,600
position to take.

804
00:46:23,600 --> 00:46:25,610
And it also gives
us a chance to start

805
00:46:25,610 --> 00:46:29,130
looking at some of these
phenomena in story reading.