1 00:00:00,050 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:04,010 Commons license. 3 00:00:04,010 --> 00:00:06,350 Your support will help MIT OpenCourseWare 4 00:00:06,350 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,205 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,205 --> 00:00:17,830 at ocw.mit.edu. 8 00:00:21,742 --> 00:00:22,950 BROWN WESTRICK: Good morning. 9 00:00:22,950 --> 00:00:24,366 My name is Brown Westrick, and I'm 10 00:00:24,366 --> 00:00:28,450 going to be talking to you about the speech synthesis project. 11 00:00:31,850 --> 00:00:33,820 Our main goal for the speech synthesis project 12 00:00:33,820 --> 00:00:38,990 was to create simulated speech using 13 00:00:38,990 --> 00:00:43,690 a model of the vocal tract in which we would model 14 00:00:43,690 --> 00:00:45,170 the flow of air over time. 15 00:00:48,680 --> 00:00:51,790 There's existing software called new speech 16 00:00:51,790 --> 00:00:53,400 that already does this. 17 00:00:53,400 --> 00:00:57,030 And we want to deport it to cell and then improve the speech 18 00:00:57,030 --> 00:00:59,050 quality that it would afford us by using 19 00:00:59,050 --> 00:01:00,385 additional computational cycles. 20 00:01:04,430 --> 00:01:07,540 So again, new speech was originally 21 00:01:07,540 --> 00:01:09,060 developed for linguistics research 22 00:01:09,060 --> 00:01:13,100 but now it's available for free under the new public license. 23 00:01:13,100 --> 00:01:17,105 It already models airflow in the vocal tract in real time. 24 00:01:17,105 --> 00:01:19,480 What this means is that there are no pre-recorded sounds. 25 00:01:19,480 --> 00:01:22,780 Many speech synthesizers nowadays 26 00:01:22,780 --> 00:01:25,340 have very large dictionaries of sounds 27 00:01:25,340 --> 00:01:27,090 that they just piece together and then try 28 00:01:27,090 --> 00:01:29,740 to smooth the transition between them. 29 00:01:29,740 --> 00:01:33,784 However, this model attempts to actually do 30 00:01:33,784 --> 00:01:35,950 what the vocal tract is doing instead of just trying 31 00:01:35,950 --> 00:01:38,990 to imitate the end result. 32 00:01:40,860 --> 00:01:44,130 The quality of the speech, of this synthesizer, 33 00:01:44,130 --> 00:01:49,060 the way that it exists is not as high as the current ones that 34 00:01:49,060 --> 00:01:50,670 use the recorded libraries. 35 00:01:50,670 --> 00:01:53,540 But it has potential to be much better because you 36 00:01:53,540 --> 00:01:56,090 have so much finer control over all the different parameters. 37 00:02:00,310 --> 00:02:03,660 Our goal was that we have this software that 38 00:02:03,660 --> 00:02:05,540 would make an acceptable speech in real time. 39 00:02:05,540 --> 00:02:07,540 We are hoping it would be able to take advantage 40 00:02:07,540 --> 00:02:09,889 of the additional computational power of cell 41 00:02:09,889 --> 00:02:13,560 to be able to get an increase in speech quality. 42 00:02:13,560 --> 00:02:16,490 And I will it over to Drew who will tell you 43 00:02:16,490 --> 00:02:18,620 about the new speech system. 44 00:02:31,412 --> 00:02:33,257 DREW ALTSCHUL: So new speech is made up 45 00:02:33,257 --> 00:02:35,300 of three different major parts. 46 00:02:35,300 --> 00:02:37,634 The first of which is just called the new speech engine. 47 00:02:37,634 --> 00:02:40,050 The second of which is, which is probably the largest part 48 00:02:40,050 --> 00:02:41,390 of it, which is called Monet. 49 00:02:41,390 --> 00:02:43,120 And then the tube resonance model, 50 00:02:43,120 --> 00:02:46,840 which is the final part that actually outputs the sound. 51 00:02:46,840 --> 00:02:49,515 And as [INAUDIBLE], the basic processes 52 00:02:49,515 --> 00:02:52,190 is you take a text input standard string 53 00:02:52,190 --> 00:02:54,030 and the new speech engine will take care of 54 00:02:54,030 --> 00:02:56,900 and transform it into basic phonetic information, which 55 00:02:56,900 --> 00:03:00,675 they will then deal with and will take that string 56 00:03:00,675 --> 00:03:04,070 and eventually convert it into these what we call vocal tract 57 00:03:04,070 --> 00:03:07,320 parameters, which are basically parameters that can be sent 58 00:03:07,320 --> 00:03:09,540 to the tube resonance model. 59 00:03:09,540 --> 00:03:12,140 And those parameters will define exactly how 60 00:03:12,140 --> 00:03:14,800 this tube, which represents the throat and the nasal tract, 61 00:03:14,800 --> 00:03:18,060 changes over time to represent speech sounds. 62 00:03:18,060 --> 00:03:20,570 And with those parameters, you can send a signal through it 63 00:03:20,570 --> 00:03:23,520 and create a voice. 64 00:03:23,520 --> 00:03:25,747 So the first part of the example will 65 00:03:25,747 --> 00:03:28,330 take a perfectly normal string, like "all your base are belong 66 00:03:28,330 --> 00:03:31,740 to us," and transform it into what we 67 00:03:31,740 --> 00:03:34,694 call the phonetic format of it. 68 00:03:34,694 --> 00:03:36,578 And you can see it highlighted. 69 00:03:36,578 --> 00:03:39,060 The actual sounds are highlighted, 70 00:03:39,060 --> 00:03:42,450 whereas various parameters are also included in the output 71 00:03:42,450 --> 00:03:46,377 string, like /w, and you can determine where the words are. 72 00:03:46,377 --> 00:03:49,510 And [INAUDIBLE] which determine where sentences 73 00:03:49,510 --> 00:03:52,166 and various phrases end. 74 00:03:52,166 --> 00:03:55,795 And basically, Gnuspeech makes uses of dictionary files 75 00:03:55,795 --> 00:03:58,285 as well as some basic linguistic models in order 76 00:03:58,285 --> 00:04:03,386 to create this phonetic output from the basic input string. 77 00:04:06,240 --> 00:04:07,990 So having created that phonetic model, 78 00:04:07,990 --> 00:04:11,602 you can then send it to Monet, which is by far the largest 79 00:04:11,602 --> 00:04:14,120 part of the program, which in turn will take 80 00:04:14,120 --> 00:04:17,200 the phonetic information, and as I said, 81 00:04:17,200 --> 00:04:21,040 use what a basic diphone file, which 82 00:04:21,040 --> 00:04:27,430 takes a very large range of sounds and characters 83 00:04:27,430 --> 00:04:29,630 and will then transform these phonetics 84 00:04:29,630 --> 00:04:32,650 into direct parameters, which can represent 85 00:04:32,650 --> 00:04:36,310 the changing of a throat the entire nasal tract 86 00:04:36,310 --> 00:04:38,860 as you voice your own speech. 87 00:04:38,860 --> 00:04:43,100 So Monet has to go through a long process of calculating 88 00:04:43,100 --> 00:04:46,580 these phrases given the whole-- the rhythm, the intonation 89 00:04:46,580 --> 00:04:49,340 of the phrase that's being given to Monet. 90 00:04:49,340 --> 00:04:52,010 And also, a very important part of the Monet process 91 00:04:52,010 --> 00:04:55,330 is by taking each phrase and the postures-- 92 00:04:55,330 --> 00:04:57,770 is what we call the output. 93 00:04:57,770 --> 00:05:02,870 Looking at the phrase and piece by piece examining the sounds 94 00:05:02,870 --> 00:05:07,190 and realizing that as the postures of the throat change, 95 00:05:07,190 --> 00:05:10,050 there are important changes being made between there. 96 00:05:10,050 --> 00:05:17,320 And basically having some sort of-- basically a slow change 97 00:05:17,320 --> 00:05:19,250 between them, not a gradual conversion 98 00:05:19,250 --> 00:05:22,450 as opposed to a sudden change in the actual shape. 99 00:05:22,450 --> 00:05:25,630 So then having outputted the basic postures, 100 00:05:25,630 --> 00:05:28,832 you finally send it to the Tube Resonance Model, or TRM, 101 00:05:28,832 --> 00:05:31,290 which will take the vocal tract which is divided into eight 102 00:05:31,290 --> 00:05:32,790 sections in this model. 103 00:05:32,790 --> 00:05:37,440 And send a signal off of a sine wave 104 00:05:37,440 --> 00:05:40,450 through the modified tube resonance model. 105 00:05:40,450 --> 00:05:43,740 And therefore, all the changes that occur as time goes on 106 00:05:43,740 --> 00:05:46,405 and the postures which change the width 107 00:05:46,405 --> 00:05:49,360 of the tube at various stages will then 108 00:05:49,360 --> 00:05:52,770 cause different speech patterns to come out and basically 109 00:05:52,770 --> 00:05:56,560 create an actual speech pattern, which is usually recognized. 110 00:05:56,560 --> 00:06:01,880 So basically you have these three parts from a basic string 111 00:06:01,880 --> 00:06:05,260 to phonetics to throat postures until finally you 112 00:06:05,260 --> 00:06:08,250 get the actual speech out. 113 00:06:08,250 --> 00:06:11,880 So now I'm handing it over to Joyce 114 00:06:11,880 --> 00:06:15,345 to talk a little bit about the resources and algorithms. 115 00:06:21,780 --> 00:06:24,007 JOYCE CHEN: Well, before I talk about the resources 116 00:06:24,007 --> 00:06:26,895 and algorithms, I'll talk a little bit about the TRM, which 117 00:06:26,895 --> 00:06:28,740 is the tube resonance model. 118 00:06:28,740 --> 00:06:31,370 So we already talked about how Monet outputs 119 00:06:31,370 --> 00:06:34,380 like two parameters based on transitions 120 00:06:34,380 --> 00:06:37,940 between different words and postures and so on. 121 00:06:37,940 --> 00:06:39,760 And the tube resonance model actually 122 00:06:39,760 --> 00:06:42,840 simulates the physics of the vocal tract. 123 00:06:42,840 --> 00:06:44,770 First you have a glottal source. 124 00:06:44,770 --> 00:06:47,264 If you have done any linguistics, 125 00:06:47,264 --> 00:06:48,930 you might have heard the little clicking 126 00:06:48,930 --> 00:06:50,720 sound the glottal source makes. 127 00:06:50,720 --> 00:06:53,710 There are different ways to simulate the glottal source. 128 00:06:53,710 --> 00:06:57,520 Now, ideally, the way you have a good, natural glottal source 129 00:06:57,520 --> 00:07:00,750 is you have a simulation of the physics between two 130 00:07:00,750 --> 00:07:03,750 oscillating masses as you air passes through them. 131 00:07:03,750 --> 00:07:05,620 Now, back in the days when, say, people 132 00:07:05,620 --> 00:07:08,760 were doing speech research on gnu speech, 133 00:07:08,760 --> 00:07:12,780 actually simulating the physics of glottis was not possible. 134 00:07:12,780 --> 00:07:15,010 So what they did instead was, you know, 135 00:07:15,010 --> 00:07:17,050 they would try a half sine wave or they 136 00:07:17,050 --> 00:07:20,260 would research the most natural sounding glottal pulse shape, 137 00:07:20,260 --> 00:07:22,950 initialize a wave table, and do table lookup 138 00:07:22,950 --> 00:07:26,060 on it, updating it with the amplitude 139 00:07:26,060 --> 00:07:28,380 and so on to change it a little by little. 140 00:07:28,380 --> 00:07:31,490 So one of our goals was possibly to-- 141 00:07:31,490 --> 00:07:34,770 because we harnessed the additional computational power 142 00:07:34,770 --> 00:07:37,760 and make more natural sounding speech. 143 00:07:37,760 --> 00:07:40,680 And now I'll talk about allocating the resources. 144 00:07:43,300 --> 00:07:46,275 For example, in new speech Monet, 145 00:07:46,275 --> 00:07:48,430 there is not as much computation. 146 00:07:48,430 --> 00:07:49,660 Monet has a lot of rules. 147 00:07:49,660 --> 00:07:51,500 For example, between postures and postures, 148 00:07:51,500 --> 00:07:53,150 like different shapes of vocal tracts, 149 00:07:53,150 --> 00:07:55,300 you can't just do an linear interpolation 150 00:07:55,300 --> 00:07:56,600 to smoothly change. 151 00:07:56,600 --> 00:07:58,710 There are different rules beach in order 152 00:07:58,710 --> 00:08:00,750 to change between the postures which 153 00:08:00,750 --> 00:08:03,970 greatly affects the speech. 154 00:08:03,970 --> 00:08:10,030 This was much harder to improve on, like to parallelize. 155 00:08:10,030 --> 00:08:11,790 Then the tube resonance model. 156 00:08:11,790 --> 00:08:13,040 It had a lot more computation. 157 00:08:13,040 --> 00:08:14,581 In fact, the thing that took probably 158 00:08:14,581 --> 00:08:17,590 most computation was after we got our signal 159 00:08:17,590 --> 00:08:21,210 data from the mouth end of the simulation, 160 00:08:21,210 --> 00:08:23,520 we would have to up sample or down sample it. 161 00:08:23,520 --> 00:08:25,710 And that was something that had a lot of potential 162 00:08:25,710 --> 00:08:27,260 to be parallelized. 163 00:08:27,260 --> 00:08:29,730 However, when you were simulating the tube presence 164 00:08:29,730 --> 00:08:33,400 model, you could only update the signal inside the vocal tract 165 00:08:33,400 --> 00:08:34,659 incrementally. 166 00:08:34,659 --> 00:08:37,440 If you were to break it up, there was a possibility 167 00:08:37,440 --> 00:08:40,260 that there would be a lot of pops in between when 168 00:08:40,260 --> 00:08:41,850 you try to space them together. 169 00:08:41,850 --> 00:08:44,507 We thought about trying to resolve that with interpolation 170 00:08:44,507 --> 00:08:45,090 between forms. 171 00:08:47,910 --> 00:08:49,590 There were nested loops. 172 00:08:49,590 --> 00:08:52,080 The main synthesized thing had nested loops. 173 00:08:52,080 --> 00:08:53,480 You had a posture. 174 00:08:53,480 --> 00:08:55,890 And then you simulate on the posture 175 00:08:55,890 --> 00:08:57,470 and between the postures. 176 00:08:57,470 --> 00:09:00,680 And that took the most computation, 177 00:09:00,680 --> 00:09:03,570 as well as updating the glottal wave table. 178 00:09:03,570 --> 00:09:04,920 All right. 179 00:09:04,920 --> 00:09:07,840 Now I will hand it off to Omari to explain the challenges. 180 00:09:11,770 --> 00:09:14,490 SPEAKER 5: So the TRM algorithm, which 181 00:09:14,490 --> 00:09:16,440 is-- we're most focused on trying 182 00:09:16,440 --> 00:09:28,270 to-- so we most tried to focus on parallelizing the TRM 183 00:09:28,270 --> 00:09:29,010 algorithm. 184 00:09:29,010 --> 00:09:32,910 Because both Gnuspeech and Monet are almost entirely 185 00:09:32,910 --> 00:09:34,390 just dictionary lookups involving 186 00:09:34,390 --> 00:09:37,020 large amounts of memory with not that much computation. 187 00:09:37,020 --> 00:09:41,300 So there wasn't really that much potential 188 00:09:41,300 --> 00:09:42,790 for parallelizing those. 189 00:09:42,790 --> 00:09:49,425 So we looked at the tasks that were being done on TRM 190 00:09:49,425 --> 00:09:51,340 and profiled them. 191 00:09:51,340 --> 00:09:54,620 And you can see what took most of the time 192 00:09:54,620 --> 00:09:57,940 was the noise generator part, like the attempt 193 00:09:57,940 --> 00:10:01,810 at the glottal source that was being put 194 00:10:01,810 --> 00:10:05,850 into the tubes and the actual updates 195 00:10:05,850 --> 00:10:08,750 where the tubes are supposed to be as they were shifting. 196 00:10:08,750 --> 00:10:14,260 And so unfortunately, the main loop as this was updating 197 00:10:14,260 --> 00:10:15,630 was very, very fast. 198 00:10:15,630 --> 00:10:17,690 It was about 15 microseconds. 199 00:10:17,690 --> 00:10:24,500 So it would be pretty difficult to update, for example, 200 00:10:24,500 --> 00:10:28,490 several SPUs as fast as we needed them to, 201 00:10:28,490 --> 00:10:32,950 considering how communication costs affect them. 202 00:10:35,730 --> 00:10:42,950 So parallelism was not very simple to use. 203 00:10:42,950 --> 00:10:46,710 Our main, original idea for this exploiting parallelism 204 00:10:46,710 --> 00:10:50,350 was to make a pipeline of the various parts of the TRM model, 205 00:10:50,350 --> 00:10:53,690 maybe using three or four SPUs, like one 206 00:10:53,690 --> 00:10:56,080 for each part of the throat as a sound would 207 00:10:56,080 --> 00:10:57,820 go from one to the next. 208 00:10:57,820 --> 00:11:00,090 So that all of them would be engaged simultaneously, 209 00:11:00,090 --> 00:11:06,380 like going from one posture to the next in a linear fashion. 210 00:11:06,380 --> 00:11:09,210 But unfortunately, the timing for this 211 00:11:09,210 --> 00:11:12,430 was very, very fast, in the order of about 70 kilohertz, 212 00:11:12,430 --> 00:11:15,074 which is many times a second for SPUs 213 00:11:15,074 --> 00:11:17,240 to be transferring data back and forth to each other 214 00:11:17,240 --> 00:11:19,820 with mailboxes and memory. 215 00:11:19,820 --> 00:11:42,490 So that was somewhat difficult. 216 00:11:42,490 --> 00:11:45,080 OMARI: Unfortunately, with this project, 217 00:11:45,080 --> 00:11:48,670 we faced a number of challenges, the first 218 00:11:48,670 --> 00:11:52,220 and foremost being that Gnuspeech 219 00:11:52,220 --> 00:11:55,260 is written in a programming language most of us 220 00:11:55,260 --> 00:11:56,720 weren't familiar with. 221 00:11:56,720 --> 00:11:59,060 And it's huge. 222 00:11:59,060 --> 00:12:02,330 Monet, for example, is 30,000 lines. 223 00:12:02,330 --> 00:12:03,390 It's hardly documented. 224 00:12:06,090 --> 00:12:08,780 And it took a fair amount of time 225 00:12:08,780 --> 00:12:12,360 just reading through and figuring out what was going on. 226 00:12:12,360 --> 00:12:14,630 Additionally, because it uses Gnustep, 227 00:12:14,630 --> 00:12:19,660 which is a GUI library, the calls are asynchronous 228 00:12:19,660 --> 00:12:25,670 and it makes it tremendously difficult to debug as well. 229 00:12:25,670 --> 00:12:30,370 I had tried to convert part of it to C++ to try to get 230 00:12:30,370 --> 00:12:32,270 the tube running on one of the SPEs, 231 00:12:32,270 --> 00:12:34,710 and that took three days in and of itself. 232 00:12:34,710 --> 00:12:38,910 And I ended up having to toss that attempt out. 233 00:12:38,910 --> 00:12:42,330 Another problem we had was dynamic pointer alignment. 234 00:12:42,330 --> 00:12:45,830 In Objective C, most of the objects 235 00:12:45,830 --> 00:12:49,130 are stored as dynamic pointers. 236 00:12:49,130 --> 00:12:55,190 And basically in Objective C, there's 237 00:12:55,190 --> 00:12:59,300 also no malloc alignment or anything of that nature. 238 00:12:59,300 --> 00:13:02,390 So we couldn't really transfer any of the objects 239 00:13:02,390 --> 00:13:08,730 from Objective C memory area to the SPUs to work on the data 240 00:13:08,730 --> 00:13:12,190 and then send them back. 241 00:13:12,190 --> 00:13:13,930 So what is working now? 242 00:13:13,930 --> 00:13:21,160 We are able to do line buffered text in the Gnuspeech engine 243 00:13:21,160 --> 00:13:23,680 and translate that to utterances so 244 00:13:23,680 --> 00:13:27,350 the-- phonetic pronunciations. 245 00:13:27,350 --> 00:13:31,930 And get to the point where we would execute the tube model. 246 00:13:31,930 --> 00:13:37,420 Unfortunately, one of the-- a bug in Gnuspeech 247 00:13:37,420 --> 00:13:40,060 potentially is preventing us from properly executing 248 00:13:40,060 --> 00:13:41,470 the tube model right now. 249 00:13:44,490 --> 00:13:47,870 So that's one thing we're having problems with. 250 00:13:47,870 --> 00:13:52,230 Additionally, the tube runs on the PPE. 251 00:13:52,230 --> 00:13:54,930 We've been trying to get the tube to run on the SPE, 252 00:13:54,930 --> 00:14:00,220 but it's not going well, partly because of the dynamic pointer 253 00:14:00,220 --> 00:14:04,280 alignment issue and partly because of some other things 254 00:14:04,280 --> 00:14:07,090 we've run into. 255 00:14:07,090 --> 00:14:09,400 Currently not working. 256 00:14:09,400 --> 00:14:14,000 As Drew had mentioned, there are a lot 257 00:14:14,000 --> 00:14:18,510 of dictionary lookups in the preprocessing 258 00:14:18,510 --> 00:14:22,760 stage of the pipeline. 259 00:14:22,760 --> 00:14:25,700 And there's a bug in Gnustep where 260 00:14:25,700 --> 00:14:31,360 it won't parse the dictionary if it's above a certain size. 261 00:14:31,360 --> 00:14:35,230 The dictionary has I believe 70,000 entries, 262 00:14:35,230 --> 00:14:37,430 it takes up almost 3 megabytes. 263 00:14:37,430 --> 00:14:44,500 But if there are more than like 3,000 entries 264 00:14:44,500 --> 00:14:46,740 in the dictionary, it just doesn't parse. 265 00:14:46,740 --> 00:14:48,160 And we have no idea why. 266 00:14:52,340 --> 00:14:58,870 So to conclude, this was a tremendously difficult problem. 267 00:14:58,870 --> 00:15:02,130 There are a bunch of data dependencies 268 00:15:02,130 --> 00:15:07,160 and the synchronization is very, very close. 269 00:15:07,160 --> 00:15:11,420 However, we feel that with more time 270 00:15:11,420 --> 00:15:14,030 and with more experience with the code base, 271 00:15:14,030 --> 00:15:16,870 we would have been able to parallelize it. 272 00:15:16,870 --> 00:15:20,970 And the parallelization can almost 273 00:15:20,970 --> 00:15:22,620 definitely help the vocal quality 274 00:15:22,620 --> 00:15:24,820 in terms of naturalness. 275 00:15:24,820 --> 00:15:26,960 Getting, for example, as Joyce mentioned, 276 00:15:26,960 --> 00:15:29,820 a higher quality glottal source. 277 00:15:29,820 --> 00:15:33,620 Speaker identification and vowel identification. 278 00:15:33,620 --> 00:15:39,630 For example, when you pronounce different vowels, 279 00:15:39,630 --> 00:15:43,400 sometimes quality the glottal source changes. 280 00:15:43,400 --> 00:15:46,780 And lastly, it would be worth the time to re-write the whole 281 00:15:46,780 --> 00:15:51,000 thing from scratch, skipping Gnustep, skipping Objective C, 282 00:15:51,000 --> 00:15:55,650 and going with C++ or most likely C for the whole thing. 283 00:15:55,650 --> 00:15:56,150 Thank you. 284 00:15:56,150 --> 00:15:56,805 Any questions? 285 00:16:02,227 --> 00:16:03,602 AUDIENCE: It sounds like you guys 286 00:16:03,602 --> 00:16:05,685 are taking a fine-grained approach in which you're 287 00:16:05,685 --> 00:16:09,280 splitting the application across different units. 288 00:16:09,280 --> 00:16:12,410 Since you're synthesizing completely independent words, 289 00:16:12,410 --> 00:16:16,125 let's say, could you just run the whole application 290 00:16:16,125 --> 00:16:17,690 on an SPU? 291 00:16:17,690 --> 00:16:20,000 There's engineering there, but from a parallelization 292 00:16:20,000 --> 00:16:22,367 standpoint, can you just take the whole application 293 00:16:22,367 --> 00:16:24,200 and run it, for example, on different words? 294 00:16:28,090 --> 00:16:30,340 OMARI: So I believe you're suggesting 295 00:16:30,340 --> 00:16:32,420 we run the tube on different SPEs 296 00:16:32,420 --> 00:16:36,280 and then feed data to the separate instances of the tube 297 00:16:36,280 --> 00:16:37,540 from the PPE? 298 00:16:37,540 --> 00:16:40,114 AUDIENCE: Well, including-- the whole processing backline. 299 00:16:40,114 --> 00:16:42,530 I mean, order the sentences from one sentence to the next? 300 00:16:42,530 --> 00:16:44,820 OMARI: So the big stumbling block for us 301 00:16:44,820 --> 00:16:48,420 was that there isn't currently an Objective C 302 00:16:48,420 --> 00:16:54,710 compiler for the SPEs, so we can't run the Objective C 303 00:16:54,710 --> 00:16:55,948 code on the SPEs at all. 304 00:16:55,948 --> 00:16:57,656 AUDIENCE: But if you did it from scratch, 305 00:16:57,656 --> 00:17:00,349 if you were to write your own-- just throw away 306 00:17:00,349 --> 00:17:03,525 all your Objective C and start from scratch, 307 00:17:03,525 --> 00:17:05,483 would that be a better parallelization strategy 308 00:17:05,483 --> 00:17:06,706 than a fine-grained one? 309 00:17:11,740 --> 00:17:13,240 OMARI: Possibly. 310 00:17:13,240 --> 00:17:18,859 So one of the disadvantages of splitting up words 311 00:17:18,859 --> 00:17:22,099 is that there is state that connects 312 00:17:22,099 --> 00:17:24,839 the different-- [INAUDIBLE] continuous state that connects 313 00:17:24,839 --> 00:17:28,240 the different postures vocal tract all 314 00:17:28,240 --> 00:17:29,880 the way through the utterance. 315 00:17:29,880 --> 00:17:34,090 And so we would have to do possibly some prediction, 316 00:17:34,090 --> 00:17:36,680 possibly some interpolation to figure out 317 00:17:36,680 --> 00:17:39,480 how to connect the different, separate utterances which 318 00:17:39,480 --> 00:17:41,475 should been produced consecutively. 319 00:17:41,475 --> 00:17:42,850 AUDIENCE: Permitting one sentence 320 00:17:42,850 --> 00:17:44,270 at a time or something? 321 00:17:44,270 --> 00:17:45,360 OMARI: Yeah. 322 00:17:45,360 --> 00:17:48,620 That might be an option, yes.