1
00:00:00,050 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:04,010
Commons license.

3
00:00:04,010 --> 00:00:06,350
Your support will help
MIT OpenCourseWare

4
00:00:06,350 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,205
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,205 --> 00:00:17,830
at ocw.mit.edu.

8
00:00:21,742 --> 00:00:22,950
BROWN WESTRICK: Good morning.

9
00:00:22,950 --> 00:00:24,366
My name is Brown
Westrick, and I'm

10
00:00:24,366 --> 00:00:28,450
going to be talking to you about
the speech synthesis project.

11
00:00:31,850 --> 00:00:33,820
Our main goal for the
speech synthesis project

12
00:00:33,820 --> 00:00:38,990
was to create
simulated speech using

13
00:00:38,990 --> 00:00:43,690
a model of the vocal tract
in which we would model

14
00:00:43,690 --> 00:00:45,170
the flow of air over time.

15
00:00:48,680 --> 00:00:51,790
There's existing software
called new speech

16
00:00:51,790 --> 00:00:53,400
that already does this.

17
00:00:53,400 --> 00:00:57,030
And we want to deport it to
cell and then improve the speech

18
00:00:57,030 --> 00:00:59,050
quality that it would
afford us by using

19
00:00:59,050 --> 00:01:00,385
additional computational cycles.

20
00:01:04,430 --> 00:01:07,540
So again, new speech
was originally

21
00:01:07,540 --> 00:01:09,060
developed for
linguistics research

22
00:01:09,060 --> 00:01:13,100
but now it's available for free
under the new public license.

23
00:01:13,100 --> 00:01:17,105
It already models airflow in
the vocal tract in real time.

24
00:01:17,105 --> 00:01:19,480
What this means is that there
are no pre-recorded sounds.

25
00:01:19,480 --> 00:01:22,780
Many speech
synthesizers nowadays

26
00:01:22,780 --> 00:01:25,340
have very large
dictionaries of sounds

27
00:01:25,340 --> 00:01:27,090
that they just piece
together and then try

28
00:01:27,090 --> 00:01:29,740
to smooth the
transition between them.

29
00:01:29,740 --> 00:01:33,784
However, this model
attempts to actually do

30
00:01:33,784 --> 00:01:35,950
what the vocal tract is
doing instead of just trying

31
00:01:35,950 --> 00:01:38,990
to imitate the end result.

32
00:01:40,860 --> 00:01:44,130
The quality of the speech,
of this synthesizer,

33
00:01:44,130 --> 00:01:49,060
the way that it exists is not
as high as the current ones that

34
00:01:49,060 --> 00:01:50,670
use the recorded libraries.

35
00:01:50,670 --> 00:01:53,540
But it has potential to
be much better because you

36
00:01:53,540 --> 00:01:56,090
have so much finer control over
all the different parameters.

37
00:02:00,310 --> 00:02:03,660
Our goal was that we
have this software that

38
00:02:03,660 --> 00:02:05,540
would make an acceptable
speech in real time.

39
00:02:05,540 --> 00:02:07,540
We are hoping it would
be able to take advantage

40
00:02:07,540 --> 00:02:09,889
of the additional
computational power of cell

41
00:02:09,889 --> 00:02:13,560
to be able to get an
increase in speech quality.

42
00:02:13,560 --> 00:02:16,490
And I will it over to
Drew who will tell you

43
00:02:16,490 --> 00:02:18,620
about the new speech system.

44
00:02:31,412 --> 00:02:33,257
DREW ALTSCHUL: So
new speech is made up

45
00:02:33,257 --> 00:02:35,300
of three different major parts.

46
00:02:35,300 --> 00:02:37,634
The first of which is just
called the new speech engine.

47
00:02:37,634 --> 00:02:40,050
The second of which is, which
is probably the largest part

48
00:02:40,050 --> 00:02:41,390
of it, which is called Monet.

49
00:02:41,390 --> 00:02:43,120
And then the tube
resonance model,

50
00:02:43,120 --> 00:02:46,840
which is the final part that
actually outputs the sound.

51
00:02:46,840 --> 00:02:49,515
And as [INAUDIBLE],
the basic processes

52
00:02:49,515 --> 00:02:52,190
is you take a text
input standard string

53
00:02:52,190 --> 00:02:54,030
and the new speech
engine will take care of

54
00:02:54,030 --> 00:02:56,900
and transform it into basic
phonetic information, which

55
00:02:56,900 --> 00:03:00,675
they will then deal with
and will take that string

56
00:03:00,675 --> 00:03:04,070
and eventually convert it into
these what we call vocal tract

57
00:03:04,070 --> 00:03:07,320
parameters, which are basically
parameters that can be sent

58
00:03:07,320 --> 00:03:09,540
to the tube resonance model.

59
00:03:09,540 --> 00:03:12,140
And those parameters
will define exactly how

60
00:03:12,140 --> 00:03:14,800
this tube, which represents
the throat and the nasal tract,

61
00:03:14,800 --> 00:03:18,060
changes over time to
represent speech sounds.

62
00:03:18,060 --> 00:03:20,570
And with those parameters, you
can send a signal through it

63
00:03:20,570 --> 00:03:23,520
and create a voice.

64
00:03:23,520 --> 00:03:25,747
So the first part
of the example will

65
00:03:25,747 --> 00:03:28,330
take a perfectly normal string,
like "all your base are belong

66
00:03:28,330 --> 00:03:31,740
to us," and transform
it into what we

67
00:03:31,740 --> 00:03:34,694
call the phonetic format of it.

68
00:03:34,694 --> 00:03:36,578
And you can see it highlighted.

69
00:03:36,578 --> 00:03:39,060
The actual sounds
are highlighted,

70
00:03:39,060 --> 00:03:42,450
whereas various parameters are
also included in the output

71
00:03:42,450 --> 00:03:46,377
string, like /w, and you can
determine where the words are.

72
00:03:46,377 --> 00:03:49,510
And [INAUDIBLE] which
determine where sentences

73
00:03:49,510 --> 00:03:52,166
and various phrases end.

74
00:03:52,166 --> 00:03:55,795
And basically, Gnuspeech
makes uses of dictionary files

75
00:03:55,795 --> 00:03:58,285
as well as some basic
linguistic models in order

76
00:03:58,285 --> 00:04:03,386
to create this phonetic output
from the basic input string.

77
00:04:06,240 --> 00:04:07,990
So having created
that phonetic model,

78
00:04:07,990 --> 00:04:11,602
you can then send it to Monet,
which is by far the largest

79
00:04:11,602 --> 00:04:14,120
part of the program,
which in turn will take

80
00:04:14,120 --> 00:04:17,200
the phonetic information,
and as I said,

81
00:04:17,200 --> 00:04:21,040
use what a basic
diphone file, which

82
00:04:21,040 --> 00:04:27,430
takes a very large range
of sounds and characters

83
00:04:27,430 --> 00:04:29,630
and will then transform
these phonetics

84
00:04:29,630 --> 00:04:32,650
into direct parameters,
which can represent

85
00:04:32,650 --> 00:04:36,310
the changing of a throat
the entire nasal tract

86
00:04:36,310 --> 00:04:38,860
as you voice your own speech.

87
00:04:38,860 --> 00:04:43,100
So Monet has to go through a
long process of calculating

88
00:04:43,100 --> 00:04:46,580
these phrases given the whole--
the rhythm, the intonation

89
00:04:46,580 --> 00:04:49,340
of the phrase that's
being given to Monet.

90
00:04:49,340 --> 00:04:52,010
And also, a very important
part of the Monet process

91
00:04:52,010 --> 00:04:55,330
is by taking each phrase
and the postures--

92
00:04:55,330 --> 00:04:57,770
is what we call the output.

93
00:04:57,770 --> 00:05:02,870
Looking at the phrase and piece
by piece examining the sounds

94
00:05:02,870 --> 00:05:07,190
and realizing that as the
postures of the throat change,

95
00:05:07,190 --> 00:05:10,050
there are important changes
being made between there.

96
00:05:10,050 --> 00:05:17,320
And basically having some sort
of-- basically a slow change

97
00:05:17,320 --> 00:05:19,250
between them, not a
gradual conversion

98
00:05:19,250 --> 00:05:22,450
as opposed to a sudden
change in the actual shape.

99
00:05:22,450 --> 00:05:25,630
So then having outputted
the basic postures,

100
00:05:25,630 --> 00:05:28,832
you finally send it to the
Tube Resonance Model, or TRM,

101
00:05:28,832 --> 00:05:31,290
which will take the vocal tract
which is divided into eight

102
00:05:31,290 --> 00:05:32,790
sections in this model.

103
00:05:32,790 --> 00:05:37,440
And send a signal
off of a sine wave

104
00:05:37,440 --> 00:05:40,450
through the modified
tube resonance model.

105
00:05:40,450 --> 00:05:43,740
And therefore, all the changes
that occur as time goes on

106
00:05:43,740 --> 00:05:46,405
and the postures
which change the width

107
00:05:46,405 --> 00:05:49,360
of the tube at various
stages will then

108
00:05:49,360 --> 00:05:52,770
cause different speech patterns
to come out and basically

109
00:05:52,770 --> 00:05:56,560
create an actual speech pattern,
which is usually recognized.

110
00:05:56,560 --> 00:06:01,880
So basically you have these
three parts from a basic string

111
00:06:01,880 --> 00:06:05,260
to phonetics to throat
postures until finally you

112
00:06:05,260 --> 00:06:08,250
get the actual speech out.

113
00:06:08,250 --> 00:06:11,880
So now I'm handing
it over to Joyce

114
00:06:11,880 --> 00:06:15,345
to talk a little bit about
the resources and algorithms.

115
00:06:21,780 --> 00:06:24,007
JOYCE CHEN: Well, before
I talk about the resources

116
00:06:24,007 --> 00:06:26,895
and algorithms, I'll talk a
little bit about the TRM, which

117
00:06:26,895 --> 00:06:28,740
is the tube resonance model.

118
00:06:28,740 --> 00:06:31,370
So we already talked
about how Monet outputs

119
00:06:31,370 --> 00:06:34,380
like two parameters
based on transitions

120
00:06:34,380 --> 00:06:37,940
between different words
and postures and so on.

121
00:06:37,940 --> 00:06:39,760
And the tube resonance
model actually

122
00:06:39,760 --> 00:06:42,840
simulates the physics
of the vocal tract.

123
00:06:42,840 --> 00:06:44,770
First you have a glottal source.

124
00:06:44,770 --> 00:06:47,264
If you have done
any linguistics,

125
00:06:47,264 --> 00:06:48,930
you might have heard
the little clicking

126
00:06:48,930 --> 00:06:50,720
sound the glottal source makes.

127
00:06:50,720 --> 00:06:53,710
There are different ways to
simulate the glottal source.

128
00:06:53,710 --> 00:06:57,520
Now, ideally, the way you have
a good, natural glottal source

129
00:06:57,520 --> 00:07:00,750
is you have a simulation
of the physics between two

130
00:07:00,750 --> 00:07:03,750
oscillating masses as you
air passes through them.

131
00:07:03,750 --> 00:07:05,620
Now, back in the days
when, say, people

132
00:07:05,620 --> 00:07:08,760
were doing speech
research on gnu speech,

133
00:07:08,760 --> 00:07:12,780
actually simulating the physics
of glottis was not possible.

134
00:07:12,780 --> 00:07:15,010
So what they did
instead was, you know,

135
00:07:15,010 --> 00:07:17,050
they would try a half
sine wave or they

136
00:07:17,050 --> 00:07:20,260
would research the most natural
sounding glottal pulse shape,

137
00:07:20,260 --> 00:07:22,950
initialize a wave table,
and do table lookup

138
00:07:22,950 --> 00:07:26,060
on it, updating it
with the amplitude

139
00:07:26,060 --> 00:07:28,380
and so on to change
it a little by little.

140
00:07:28,380 --> 00:07:31,490
So one of our goals
was possibly to--

141
00:07:31,490 --> 00:07:34,770
because we harnessed the
additional computational power

142
00:07:34,770 --> 00:07:37,760
and make more natural
sounding speech.

143
00:07:37,760 --> 00:07:40,680
And now I'll talk about
allocating the resources.

144
00:07:43,300 --> 00:07:46,275
For example, in
new speech Monet,

145
00:07:46,275 --> 00:07:48,430
there is not as
much computation.

146
00:07:48,430 --> 00:07:49,660
Monet has a lot of rules.

147
00:07:49,660 --> 00:07:51,500
For example, between
postures and postures,

148
00:07:51,500 --> 00:07:53,150
like different shapes
of vocal tracts,

149
00:07:53,150 --> 00:07:55,300
you can't just do an
linear interpolation

150
00:07:55,300 --> 00:07:56,600
to smoothly change.

151
00:07:56,600 --> 00:07:58,710
There are different
rules beach in order

152
00:07:58,710 --> 00:08:00,750
to change between
the postures which

153
00:08:00,750 --> 00:08:03,970
greatly affects the speech.

154
00:08:03,970 --> 00:08:10,030
This was much harder to improve
on, like to parallelize.

155
00:08:10,030 --> 00:08:11,790
Then the tube resonance model.

156
00:08:11,790 --> 00:08:13,040
It had a lot more computation.

157
00:08:13,040 --> 00:08:14,581
In fact, the thing
that took probably

158
00:08:14,581 --> 00:08:17,590
most computation was
after we got our signal

159
00:08:17,590 --> 00:08:21,210
data from the mouth
end of the simulation,

160
00:08:21,210 --> 00:08:23,520
we would have to up
sample or down sample it.

161
00:08:23,520 --> 00:08:25,710
And that was something
that had a lot of potential

162
00:08:25,710 --> 00:08:27,260
to be parallelized.

163
00:08:27,260 --> 00:08:29,730
However, when you were
simulating the tube presence

164
00:08:29,730 --> 00:08:33,400
model, you could only update the
signal inside the vocal tract

165
00:08:33,400 --> 00:08:34,659
incrementally.

166
00:08:34,659 --> 00:08:37,440
If you were to break it
up, there was a possibility

167
00:08:37,440 --> 00:08:40,260
that there would be a lot
of pops in between when

168
00:08:40,260 --> 00:08:41,850
you try to space them together.

169
00:08:41,850 --> 00:08:44,507
We thought about trying to
resolve that with interpolation

170
00:08:44,507 --> 00:08:45,090
between forms.

171
00:08:47,910 --> 00:08:49,590
There were nested loops.

172
00:08:49,590 --> 00:08:52,080
The main synthesized
thing had nested loops.

173
00:08:52,080 --> 00:08:53,480
You had a posture.

174
00:08:53,480 --> 00:08:55,890
And then you simulate
on the posture

175
00:08:55,890 --> 00:08:57,470
and between the postures.

176
00:08:57,470 --> 00:09:00,680
And that took the
most computation,

177
00:09:00,680 --> 00:09:03,570
as well as updating
the glottal wave table.

178
00:09:03,570 --> 00:09:04,920
All right.

179
00:09:04,920 --> 00:09:07,840
Now I will hand it off to Omari
to explain the challenges.

180
00:09:11,770 --> 00:09:14,490
SPEAKER 5: So the
TRM algorithm, which

181
00:09:14,490 --> 00:09:16,440
is-- we're most
focused on trying

182
00:09:16,440 --> 00:09:28,270
to-- so we most tried to
focus on parallelizing the TRM

183
00:09:28,270 --> 00:09:29,010
algorithm.

184
00:09:29,010 --> 00:09:32,910
Because both Gnuspeech and
Monet are almost entirely

185
00:09:32,910 --> 00:09:34,390
just dictionary
lookups involving

186
00:09:34,390 --> 00:09:37,020
large amounts of memory with
not that much computation.

187
00:09:37,020 --> 00:09:41,300
So there wasn't really
that much potential

188
00:09:41,300 --> 00:09:42,790
for parallelizing those.

189
00:09:42,790 --> 00:09:49,425
So we looked at the tasks
that were being done on TRM

190
00:09:49,425 --> 00:09:51,340
and profiled them.

191
00:09:51,340 --> 00:09:54,620
And you can see what
took most of the time

192
00:09:54,620 --> 00:09:57,940
was the noise generator
part, like the attempt

193
00:09:57,940 --> 00:10:01,810
at the glottal source
that was being put

194
00:10:01,810 --> 00:10:05,850
into the tubes and
the actual updates

195
00:10:05,850 --> 00:10:08,750
where the tubes are supposed
to be as they were shifting.

196
00:10:08,750 --> 00:10:14,260
And so unfortunately, the
main loop as this was updating

197
00:10:14,260 --> 00:10:15,630
was very, very fast.

198
00:10:15,630 --> 00:10:17,690
It was about 15 microseconds.

199
00:10:17,690 --> 00:10:24,500
So it would be pretty difficult
to update, for example,

200
00:10:24,500 --> 00:10:28,490
several SPUs as fast
as we needed them to,

201
00:10:28,490 --> 00:10:32,950
considering how communication
costs affect them.

202
00:10:35,730 --> 00:10:42,950
So parallelism was not
very simple to use.

203
00:10:42,950 --> 00:10:46,710
Our main, original idea for
this exploiting parallelism

204
00:10:46,710 --> 00:10:50,350
was to make a pipeline of the
various parts of the TRM model,

205
00:10:50,350 --> 00:10:53,690
maybe using three or
four SPUs, like one

206
00:10:53,690 --> 00:10:56,080
for each part of the
throat as a sound would

207
00:10:56,080 --> 00:10:57,820
go from one to the next.

208
00:10:57,820 --> 00:11:00,090
So that all of them would
be engaged simultaneously,

209
00:11:00,090 --> 00:11:06,380
like going from one posture to
the next in a linear fashion.

210
00:11:06,380 --> 00:11:09,210
But unfortunately,
the timing for this

211
00:11:09,210 --> 00:11:12,430
was very, very fast, in the
order of about 70 kilohertz,

212
00:11:12,430 --> 00:11:15,074
which is many times
a second for SPUs

213
00:11:15,074 --> 00:11:17,240
to be transferring data
back and forth to each other

214
00:11:17,240 --> 00:11:19,820
with mailboxes and memory.

215
00:11:19,820 --> 00:11:42,490
So that was somewhat difficult.

216
00:11:42,490 --> 00:11:45,080
OMARI: Unfortunately,
with this project,

217
00:11:45,080 --> 00:11:48,670
we faced a number of
challenges, the first

218
00:11:48,670 --> 00:11:52,220
and foremost being
that Gnuspeech

219
00:11:52,220 --> 00:11:55,260
is written in a programming
language most of us

220
00:11:55,260 --> 00:11:56,720
weren't familiar with.

221
00:11:56,720 --> 00:11:59,060
And it's huge.

222
00:11:59,060 --> 00:12:02,330
Monet, for example,
is 30,000 lines.

223
00:12:02,330 --> 00:12:03,390
It's hardly documented.

224
00:12:06,090 --> 00:12:08,780
And it took a fair
amount of time

225
00:12:08,780 --> 00:12:12,360
just reading through and
figuring out what was going on.

226
00:12:12,360 --> 00:12:14,630
Additionally, because
it uses Gnustep,

227
00:12:14,630 --> 00:12:19,660
which is a GUI library,
the calls are asynchronous

228
00:12:19,660 --> 00:12:25,670
and it makes it tremendously
difficult to debug as well.

229
00:12:25,670 --> 00:12:30,370
I had tried to convert part
of it to C++ to try to get

230
00:12:30,370 --> 00:12:32,270
the tube running
on one of the SPEs,

231
00:12:32,270 --> 00:12:34,710
and that took three
days in and of itself.

232
00:12:34,710 --> 00:12:38,910
And I ended up having to
toss that attempt out.

233
00:12:38,910 --> 00:12:42,330
Another problem we had was
dynamic pointer alignment.

234
00:12:42,330 --> 00:12:45,830
In Objective C,
most of the objects

235
00:12:45,830 --> 00:12:49,130
are stored as dynamic pointers.

236
00:12:49,130 --> 00:12:55,190
And basically in
Objective C, there's

237
00:12:55,190 --> 00:12:59,300
also no malloc alignment
or anything of that nature.

238
00:12:59,300 --> 00:13:02,390
So we couldn't really
transfer any of the objects

239
00:13:02,390 --> 00:13:08,730
from Objective C memory area
to the SPUs to work on the data

240
00:13:08,730 --> 00:13:12,190
and then send them back.

241
00:13:12,190 --> 00:13:13,930
So what is working now?

242
00:13:13,930 --> 00:13:21,160
We are able to do line buffered
text in the Gnuspeech engine

243
00:13:21,160 --> 00:13:23,680
and translate that
to utterances so

244
00:13:23,680 --> 00:13:27,350
the-- phonetic pronunciations.

245
00:13:27,350 --> 00:13:31,930
And get to the point where we
would execute the tube model.

246
00:13:31,930 --> 00:13:37,420
Unfortunately, one of
the-- a bug in Gnuspeech

247
00:13:37,420 --> 00:13:40,060
potentially is preventing
us from properly executing

248
00:13:40,060 --> 00:13:41,470
the tube model right now.

249
00:13:44,490 --> 00:13:47,870
So that's one thing we're
having problems with.

250
00:13:47,870 --> 00:13:52,230
Additionally, the
tube runs on the PPE.

251
00:13:52,230 --> 00:13:54,930
We've been trying to get
the tube to run on the SPE,

252
00:13:54,930 --> 00:14:00,220
but it's not going well, partly
because of the dynamic pointer

253
00:14:00,220 --> 00:14:04,280
alignment issue and partly
because of some other things

254
00:14:04,280 --> 00:14:07,090
we've run into.

255
00:14:07,090 --> 00:14:09,400
Currently not working.

256
00:14:09,400 --> 00:14:14,000
As Drew had mentioned,
there are a lot

257
00:14:14,000 --> 00:14:18,510
of dictionary lookups
in the preprocessing

258
00:14:18,510 --> 00:14:22,760
stage of the pipeline.

259
00:14:22,760 --> 00:14:25,700
And there's a bug
in Gnustep where

260
00:14:25,700 --> 00:14:31,360
it won't parse the dictionary
if it's above a certain size.

261
00:14:31,360 --> 00:14:35,230
The dictionary has I
believe 70,000 entries,

262
00:14:35,230 --> 00:14:37,430
it takes up almost 3 megabytes.

263
00:14:37,430 --> 00:14:44,500
But if there are more
than like 3,000 entries

264
00:14:44,500 --> 00:14:46,740
in the dictionary, it
just doesn't parse.

265
00:14:46,740 --> 00:14:48,160
And we have no idea why.

266
00:14:52,340 --> 00:14:58,870
So to conclude, this was a
tremendously difficult problem.

267
00:14:58,870 --> 00:15:02,130
There are a bunch
of data dependencies

268
00:15:02,130 --> 00:15:07,160
and the synchronization
is very, very close.

269
00:15:07,160 --> 00:15:11,420
However, we feel
that with more time

270
00:15:11,420 --> 00:15:14,030
and with more experience
with the code base,

271
00:15:14,030 --> 00:15:16,870
we would have been
able to parallelize it.

272
00:15:16,870 --> 00:15:20,970
And the parallelization
can almost

273
00:15:20,970 --> 00:15:22,620
definitely help
the vocal quality

274
00:15:22,620 --> 00:15:24,820
in terms of naturalness.

275
00:15:24,820 --> 00:15:26,960
Getting, for example,
as Joyce mentioned,

276
00:15:26,960 --> 00:15:29,820
a higher quality glottal source.

277
00:15:29,820 --> 00:15:33,620
Speaker identification
and vowel identification.

278
00:15:33,620 --> 00:15:39,630
For example, when you
pronounce different vowels,

279
00:15:39,630 --> 00:15:43,400
sometimes quality the
glottal source changes.

280
00:15:43,400 --> 00:15:46,780
And lastly, it would be worth
the time to re-write the whole

281
00:15:46,780 --> 00:15:51,000
thing from scratch, skipping
Gnustep, skipping Objective C,

282
00:15:51,000 --> 00:15:55,650
and going with C++ or most
likely C for the whole thing.

283
00:15:55,650 --> 00:15:56,150
Thank you.

284
00:15:56,150 --> 00:15:56,805
Any questions?

285
00:16:02,227 --> 00:16:03,602
AUDIENCE: It sounds
like you guys

286
00:16:03,602 --> 00:16:05,685
are taking a fine-grained
approach in which you're

287
00:16:05,685 --> 00:16:09,280
splitting the application
across different units.

288
00:16:09,280 --> 00:16:12,410
Since you're synthesizing
completely independent words,

289
00:16:12,410 --> 00:16:16,125
let's say, could you just
run the whole application

290
00:16:16,125 --> 00:16:17,690
on an SPU?

291
00:16:17,690 --> 00:16:20,000
There's engineering there,
but from a parallelization

292
00:16:20,000 --> 00:16:22,367
standpoint, can you just
take the whole application

293
00:16:22,367 --> 00:16:24,200
and run it, for example,
on different words?

294
00:16:28,090 --> 00:16:30,340
OMARI: So I believe
you're suggesting

295
00:16:30,340 --> 00:16:32,420
we run the tube
on different SPEs

296
00:16:32,420 --> 00:16:36,280
and then feed data to the
separate instances of the tube

297
00:16:36,280 --> 00:16:37,540
from the PPE?

298
00:16:37,540 --> 00:16:40,114
AUDIENCE: Well, including--
the whole processing backline.

299
00:16:40,114 --> 00:16:42,530
I mean, order the sentences
from one sentence to the next?

300
00:16:42,530 --> 00:16:44,820
OMARI: So the big
stumbling block for us

301
00:16:44,820 --> 00:16:48,420
was that there isn't
currently an Objective C

302
00:16:48,420 --> 00:16:54,710
compiler for the SPEs, so
we can't run the Objective C

303
00:16:54,710 --> 00:16:55,948
code on the SPEs at all.

304
00:16:55,948 --> 00:16:57,656
AUDIENCE: But if you
did it from scratch,

305
00:16:57,656 --> 00:17:00,349
if you were to write your
own-- just throw away

306
00:17:00,349 --> 00:17:03,525
all your Objective C
and start from scratch,

307
00:17:03,525 --> 00:17:05,483
would that be a better
parallelization strategy

308
00:17:05,483 --> 00:17:06,706
than a fine-grained one?

309
00:17:11,740 --> 00:17:13,240
OMARI: Possibly.

310
00:17:13,240 --> 00:17:18,859
So one of the disadvantages
of splitting up words

311
00:17:18,859 --> 00:17:22,099
is that there is
state that connects

312
00:17:22,099 --> 00:17:24,839
the different-- [INAUDIBLE]
continuous state that connects

313
00:17:24,839 --> 00:17:28,240
the different postures
vocal tract all

314
00:17:28,240 --> 00:17:29,880
the way through the utterance.

315
00:17:29,880 --> 00:17:34,090
And so we would have to do
possibly some prediction,

316
00:17:34,090 --> 00:17:36,680
possibly some
interpolation to figure out

317
00:17:36,680 --> 00:17:39,480
how to connect the different,
separate utterances which

318
00:17:39,480 --> 00:17:41,475
should been produced
consecutively.

319
00:17:41,475 --> 00:17:42,850
AUDIENCE: Permitting
one sentence

320
00:17:42,850 --> 00:17:44,270
at a time or something?

321
00:17:44,270 --> 00:17:45,360
OMARI: Yeah.

322
00:17:45,360 --> 00:17:48,620
That might be an option, yes.