1 00:00:00,050 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:04,010 Commons license. 3 00:00:04,010 --> 00:00:06,350 Your support will help MIT OpenCourseWare 4 00:00:06,350 --> 00:00:10,720 continue to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,202 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,202 --> 00:00:17,827 at ocw.mit.edu. 8 00:00:21,630 --> 00:00:25,903 ARVIND THIAGARAJAN: So I'm Arvind and this is Micah. 9 00:00:25,903 --> 00:00:29,290 And we worked together on a framework for flexible stream 10 00:00:29,290 --> 00:00:31,530 processing on the cell. 11 00:00:31,530 --> 00:00:33,999 So we were also oriented towards an application which is, 12 00:00:33,999 --> 00:00:35,290 in this case, a software radio. 13 00:00:35,290 --> 00:00:37,622 And we used this application as a case study. 14 00:00:37,622 --> 00:00:38,330 Where is the mic? 15 00:00:41,390 --> 00:00:43,230 I'll try and speak as loud as I can. 16 00:00:43,230 --> 00:00:46,710 So our project was about flexible stream processing 17 00:00:46,710 --> 00:00:48,030 framework for the cell. 18 00:00:48,030 --> 00:00:50,870 And we used the software radios as a case study, 19 00:00:50,870 --> 00:00:52,620 essentially, of an application you could 20 00:00:52,620 --> 00:00:54,480 build using this framework. 21 00:00:54,480 --> 00:00:57,990 The motivation for our project is essentially 22 00:00:57,990 --> 00:00:59,490 what we've been discussing in class. 23 00:00:59,490 --> 00:01:00,380 It's been reiterated. 24 00:01:00,380 --> 00:01:01,940 The cell isn't easy to program. 25 00:01:01,940 --> 00:01:03,530 There's no shared memory. 26 00:01:03,530 --> 00:01:06,541 There's just message passing, which is quite messy. 27 00:01:06,541 --> 00:01:08,540 You have to explicitly parallelize your programs 28 00:01:08,540 --> 00:01:11,452 if you want to write them as, say, custom C programs. 29 00:01:11,452 --> 00:01:13,410 Some of the groups have describe the challenges 30 00:01:13,410 --> 00:01:15,660 they faced when doing that. 31 00:01:15,660 --> 00:01:17,470 Extracting parallelism can be tricky. 32 00:01:17,470 --> 00:01:20,700 For example, if you want to do pipelining, then on DSPs 33 00:01:20,700 --> 00:01:22,130 you have to predict what addresses 34 00:01:22,130 --> 00:01:24,290 you're going to require, and set up 35 00:01:24,290 --> 00:01:27,210 a DMA so that those addresses are fetched in advance. 36 00:01:27,210 --> 00:01:30,680 And as Bill mentioned in his talk in class, 37 00:01:30,680 --> 00:01:33,590 stream programming can help alleviate some of these issues 38 00:01:33,590 --> 00:01:36,090 for some applications, like software radio 39 00:01:36,090 --> 00:01:38,015 where the applications fit naturally 40 00:01:38,015 --> 00:01:39,790 into the streaming model. 41 00:01:39,790 --> 00:01:42,110 So what we tried to do for our project 42 00:01:42,110 --> 00:01:45,100 was build a light weight-- as light weight 43 00:01:45,100 --> 00:01:47,910 as possible because the code has to fit on DSPs, 44 00:01:47,910 --> 00:01:52,140 but as expressive as possible, as well, so as to simplify life 45 00:01:52,140 --> 00:01:53,274 for developers. 46 00:01:53,274 --> 00:01:55,940 The streaming framework which is targeted specifically at signal 47 00:01:55,940 --> 00:01:58,870 processing applications. 48 00:01:58,870 --> 00:02:01,960 The data model, at least, is based on a research project 49 00:02:01,960 --> 00:02:03,720 that I have worked on in the past, which 50 00:02:03,720 --> 00:02:06,520 is a WaveScope streaming database management system. 51 00:02:06,520 --> 00:02:09,480 It is is a research project with the database group. 52 00:02:09,480 --> 00:02:12,240 The data model is essentially an extension of the streaming 53 00:02:12,240 --> 00:02:14,940 model to handle larger blocks of data 54 00:02:14,940 --> 00:02:16,149 so as to process them better. 55 00:02:16,149 --> 00:02:18,189 For example, several signal processing operators, 56 00:02:18,189 --> 00:02:19,830 like the fast Fourier transform, need 57 00:02:19,830 --> 00:02:22,270 to do multiple passes over the data 58 00:02:22,270 --> 00:02:25,240 and, therefore, trying to treat streams on a sample 59 00:02:25,240 --> 00:02:28,300 by sample basis leads to high cost, 60 00:02:28,300 --> 00:02:31,790 scheduling overhead, as well as inefficiency. 61 00:02:31,790 --> 00:02:34,380 So that's what the data model does. 62 00:02:34,380 --> 00:02:36,170 And we tried to port this data model over 63 00:02:36,170 --> 00:02:39,100 to the cell processor and see how well we 64 00:02:39,100 --> 00:02:41,150 could exploit the features of the cell processor. 65 00:02:41,150 --> 00:02:45,730 In particular, the high on-chip bandwidth between the SPEs 66 00:02:45,730 --> 00:02:47,860 to do with streaming. 67 00:02:47,860 --> 00:02:50,865 So our case study was, as we mentioned, a simple software 68 00:02:50,865 --> 00:02:51,690 radio application. 69 00:02:51,690 --> 00:02:52,481 It's really simple. 70 00:02:52,481 --> 00:02:57,600 It uses incoherent demodulation, as well as just simple 71 00:02:57,600 --> 00:03:00,150 amplitude-shift key modulation. 72 00:03:00,150 --> 00:03:01,900 As I said, the main goals of our framework 73 00:03:01,900 --> 00:03:04,034 were to simplify life and modeling 74 00:03:04,034 --> 00:03:05,575 as much parallelism as possible-- try 75 00:03:05,575 --> 00:03:08,170 and extract pipeline parallelism, data parallelism. 76 00:03:08,170 --> 00:03:10,650 Some of the kinds of parallelism, We'll mention. 77 00:03:10,650 --> 00:03:13,850 The kind of parallelism we were able to finally implement-- 78 00:03:13,850 --> 00:03:15,846 so far, it's only pipeline parallelism, 79 00:03:15,846 --> 00:03:18,220 but as future work, we'd be interested in the other kinds 80 00:03:18,220 --> 00:03:19,178 of parallelism as well. 81 00:03:19,178 --> 00:03:21,540 More about it as I go on. 82 00:03:21,540 --> 00:03:23,660 So in the framework we at least implemented, 83 00:03:23,660 --> 00:03:26,010 the programming model is quite simple. 84 00:03:26,010 --> 00:03:28,776 The basic execution unit is what we call an operator. 85 00:03:28,776 --> 00:03:31,150 It's analogous to what in StreamIt would be called a work 86 00:03:31,150 --> 00:03:33,820 function, or what in GNURadio, which 87 00:03:33,820 --> 00:03:35,730 is a framework for building soft radios, 88 00:03:35,730 --> 00:03:37,210 would be called a block. 89 00:03:37,210 --> 00:03:40,530 So these operators can be any arbitrary C++ classes with 90 00:03:40,530 --> 00:03:44,560 state, and they implement an iterate method which 91 00:03:44,560 --> 00:03:47,105 the developer has to overload in order to process a block 92 00:03:47,105 --> 00:03:48,330 of data. 93 00:03:48,330 --> 00:03:51,060 The WaveScope data model also provides a library 94 00:03:51,060 --> 00:03:53,760 for managing memory and passing blocks of signal 95 00:03:53,760 --> 00:03:56,270 data between these operators. 96 00:03:56,270 --> 00:03:57,770 Applications in this model are built 97 00:03:57,770 --> 00:03:59,330 by chaining operators together. 98 00:03:59,330 --> 00:04:01,460 So this is a snippet of some of the code 99 00:04:01,460 --> 00:04:04,800 we wrote for the software radio, roughly. 100 00:04:04,800 --> 00:04:07,730 So you create a box, let's say a FIRFilter, 101 00:04:07,730 --> 00:04:10,590 which processes elements of type float. 102 00:04:10,590 --> 00:04:12,690 And then, it takes in some arguments, initialize 103 00:04:12,690 --> 00:04:14,220 the filters, parameters, and so on. 104 00:04:14,220 --> 00:04:15,970 You want to create a white noise generator 105 00:04:15,970 --> 00:04:18,374 and hook up the filter to the white noise generator. 106 00:04:18,374 --> 00:04:23,040 We use this to simulate a simple channel, [INAUDIBLE] channel. 107 00:04:23,040 --> 00:04:28,080 So I'll just describe the components of our framework. 108 00:04:28,080 --> 00:04:31,340 We have a lightweight scheduler on both the PPE, as well as 109 00:04:31,340 --> 00:04:33,490 the SPEs. 110 00:04:33,490 --> 00:04:36,090 Right now, it uses a static operator mapping in the sense 111 00:04:36,090 --> 00:04:38,570 that you have to specify a static configuration file, 112 00:04:38,570 --> 00:04:42,550 where you say this operator name will run on this [? SPU ?] 113 00:04:42,550 --> 00:04:43,390 number. 114 00:04:43,390 --> 00:04:47,435 But we've not completely implemented dynamically 115 00:04:47,435 --> 00:04:49,290 reconfiguring at runtime. 116 00:04:49,290 --> 00:04:51,680 And we haven't yet seen the need for doing that. 117 00:04:51,680 --> 00:04:54,197 So it wouldn't be too hard to add if needed. 118 00:04:54,197 --> 00:04:56,030 But right now, you can easily shuffle around 119 00:04:56,030 --> 00:04:59,310 the operator mapping by tweaking the configuration file. 120 00:04:59,310 --> 00:05:02,810 Signal blocks, as I said, were adapted from WaveScope. 121 00:05:02,810 --> 00:05:05,190 They use reference counting and avoid in-memory copies, 122 00:05:05,190 --> 00:05:08,650 which can be quite expensive, especially on the cell. 123 00:05:08,650 --> 00:05:11,220 They also provide a convenient API to manipulate signals. 124 00:05:11,220 --> 00:05:13,428 So you don't have to do much of the memory management 125 00:05:13,428 --> 00:05:15,670 yourself or debug any of the hard problems 126 00:05:15,670 --> 00:05:17,389 to do with memory management. 127 00:05:17,389 --> 00:05:18,930 We also ported this library to ensure 128 00:05:18,930 --> 00:05:23,410 that data is aligned for you automatically and transported 129 00:05:23,410 --> 00:05:24,180 via queues. 130 00:05:24,180 --> 00:05:26,710 So one of the major things we had to implement 131 00:05:26,710 --> 00:05:30,360 was a queuing library and remote heap management-- 132 00:05:30,360 --> 00:05:33,180 what amounted essentially to a remote heap management library. 133 00:05:33,180 --> 00:05:35,750 So in some sense, we faced a choice here. 134 00:05:35,750 --> 00:05:38,710 We could either have the PPE control 135 00:05:38,710 --> 00:05:42,030 and allocate memory statically and make all the decisions 136 00:05:42,030 --> 00:05:44,390 about what memory is allocated where. 137 00:05:44,390 --> 00:05:47,390 Or, we could have the SPEs manage it themselves. 138 00:05:47,390 --> 00:05:49,895 We decided to go for the latter, partly because it 139 00:05:49,895 --> 00:05:52,020 was more dynamic and, also, because we weren't sure 140 00:05:52,020 --> 00:05:54,865 what the implications of all the control flow passing 141 00:05:54,865 --> 00:05:56,892 through the PPE were for this. 142 00:05:56,892 --> 00:05:59,350 So we chose the second approach, which is autonomous memory 143 00:05:59,350 --> 00:06:00,350 management. 144 00:06:00,350 --> 00:06:05,774 So when an SPE sends a streaming data element to another SPE, 145 00:06:05,774 --> 00:06:07,690 it doesn't have to actually explicitly request 146 00:06:07,690 --> 00:06:09,480 the other SPE for allocating memory. 147 00:06:09,480 --> 00:06:11,870 It has a remote heap interface so 148 00:06:11,870 --> 00:06:14,450 that it can directly allocate data and write to the SPE. 149 00:06:14,450 --> 00:06:19,080 This is currently of fixed size, but it 150 00:06:19,080 --> 00:06:21,970 could be improved by using this heap 151 00:06:21,970 --> 00:06:23,110 to share a bunch of queues. 152 00:06:23,110 --> 00:06:24,526 Right now, we have one remote heap 153 00:06:24,526 --> 00:06:28,450 for each queue between operators on different SPEs. 154 00:06:31,520 --> 00:06:33,170 So our system automatically handles 155 00:06:33,170 --> 00:06:40,740 pipelining streaming data from SPE to SPE using the DMA API. 156 00:06:40,740 --> 00:06:43,417 So Micah will takeover from here and describe the software radio 157 00:06:43,417 --> 00:06:44,452 implementation briefly. 158 00:07:04,445 --> 00:07:06,570 MICAH BRODSKY: So our software radio implementation 159 00:07:06,570 --> 00:07:09,301 is relatively simple. 160 00:07:09,301 --> 00:07:10,800 We weren't pushing on that too hard, 161 00:07:10,800 --> 00:07:12,650 especially because it took the vast majority of our time 162 00:07:12,650 --> 00:07:14,390 just to get the framework to work. 163 00:07:14,390 --> 00:07:15,860 Saving the programmer trouble meant 164 00:07:15,860 --> 00:07:17,855 that we inherited a lot of trouble ourselves. 165 00:07:22,320 --> 00:07:26,312 It breaks down to about 25 boxes, 166 00:07:26,312 --> 00:07:27,895 which we took in our config file which 167 00:07:27,895 --> 00:07:32,690 is manually mapped to the SPEs. 168 00:07:32,690 --> 00:07:36,025 About 3,000 lines of code, most of which is framework. 169 00:07:40,250 --> 00:07:43,340 So I guess, if we were more put-together, 170 00:07:43,340 --> 00:07:45,637 we'd have a nice diagram to show you. 171 00:07:45,637 --> 00:07:46,470 There's enough time. 172 00:07:46,470 --> 00:07:48,780 I'll try to draw a quick diagram on the chalkboard. 173 00:07:52,540 --> 00:07:56,278 We're simulating both sender, receiver and a channel. 174 00:08:03,373 --> 00:08:05,940 So the computation in question looks something 175 00:08:05,940 --> 00:08:09,270 like you have a bitstream. 176 00:08:09,270 --> 00:08:10,950 You need to take these and convert them 177 00:08:10,950 --> 00:08:14,350 into some-- can you read anything? 178 00:08:27,360 --> 00:08:29,910 So you take stream of bits in. 179 00:08:34,270 --> 00:08:36,850 Take bits. 180 00:08:36,850 --> 00:08:38,950 Basically, use a lookup mechanism 181 00:08:38,950 --> 00:08:42,659 to convert it to an analog waveform, 182 00:08:42,659 --> 00:08:44,387 and filter that to produce something 183 00:08:44,387 --> 00:08:45,595 that has a narrower spectrum. 184 00:08:48,986 --> 00:08:50,420 Running out of space here. 185 00:08:50,420 --> 00:08:51,376 Means that. 186 00:08:51,376 --> 00:08:54,932 Multiply that against a sine wave. 187 00:08:54,932 --> 00:08:56,640 ARVIND THIAGARAJAN: That's for modulation 188 00:08:56,640 --> 00:09:01,410 MICAH BRODSKY: And so you get-- you've probably 189 00:09:01,410 --> 00:09:03,900 seen pictures of this. 190 00:09:03,900 --> 00:09:05,860 It's very much like AM It's basically binary 191 00:09:05,860 --> 00:09:07,150 AM. [INAUDIBLE]. 192 00:09:07,150 --> 00:09:08,990 It's one of the simplest things you can do. 193 00:09:14,830 --> 00:09:17,780 Then, this is to simulate a channel, which 194 00:09:17,780 --> 00:09:22,170 is a random FIR filter, finite impulse response. 195 00:09:22,170 --> 00:09:24,690 What that means, it's basically taking 196 00:09:24,690 --> 00:09:27,060 the copies of the input at different time offsets 197 00:09:27,060 --> 00:09:29,190 with random coefficients and just summing them up. 198 00:09:29,190 --> 00:09:32,010 It's a huge multiply add computation. 199 00:09:32,010 --> 00:09:33,350 AUDIENCE: [INAUDIBLE] 200 00:09:33,350 --> 00:09:35,706 So this is 80 taps. 201 00:09:38,960 --> 00:09:40,785 And add some Gaussian noise. 202 00:09:46,050 --> 00:09:48,450 And then this, we take over, and then try 203 00:09:48,450 --> 00:09:50,560 to figure out what we put in in the first place. 204 00:09:53,530 --> 00:09:54,745 So a bunch of filtering. 205 00:09:58,310 --> 00:10:00,610 Again, more finite impulse response filtering. 206 00:10:03,180 --> 00:10:05,950 There's a little closed loop that 207 00:10:05,950 --> 00:10:09,100 tries to estimate the signal amplitude and correct for it. 208 00:10:09,100 --> 00:10:10,600 That's called automatic gain control 209 00:10:10,600 --> 00:10:11,850 to sort of keep it constant. 210 00:10:14,400 --> 00:10:16,910 And I'm probably getting these things a little out of order. 211 00:10:20,897 --> 00:10:22,605 This is the incoherent demodulation part. 212 00:10:22,605 --> 00:10:24,563 We square the signal to get rid of the carrier. 213 00:10:26,872 --> 00:10:27,830 Automatic gain control. 214 00:10:27,830 --> 00:10:29,420 Filtering. 215 00:10:29,420 --> 00:10:31,800 There's another loop. 216 00:10:31,800 --> 00:10:33,370 This is called a phase lock loop. 217 00:10:33,370 --> 00:10:38,420 The idea is try to match a sine wave to some input signal. 218 00:10:41,605 --> 00:10:44,190 I don't know how to explain it very well. 219 00:10:44,190 --> 00:10:45,960 It's basically a locking type of detector. 220 00:10:45,960 --> 00:10:49,330 The idea is to lock into the phase of some periodic thing. 221 00:10:49,330 --> 00:10:51,805 This is for recovering when do you sample. 222 00:10:51,805 --> 00:10:53,430 Because you've got this messy waveform, 223 00:10:53,430 --> 00:10:56,354 you've got to know when to look and, say, OK, is it high, 224 00:10:56,354 --> 00:10:57,478 is it low to get a bit out. 225 00:10:57,478 --> 00:10:58,822 ARVIND THIAGARAJAN: [INAUDIBLE]. 226 00:10:58,822 --> 00:11:00,414 MICAH BRODSKY: Yeah. 227 00:11:00,414 --> 00:11:01,705 I think that gives the picture. 228 00:11:01,705 --> 00:11:04,220 I'm probably boring everybody. 229 00:11:04,220 --> 00:11:06,981 Here's a picture generated from the system. 230 00:11:06,981 --> 00:11:08,355 So the green line is the data in. 231 00:11:11,224 --> 00:11:13,112 Hi, low. [INAUDIBLE]. 232 00:11:13,112 --> 00:11:15,711 The red line is the analog signal 233 00:11:15,711 --> 00:11:17,760 out right before it's supposed to decide 234 00:11:17,760 --> 00:11:19,044 what the heck the input was. 235 00:11:21,667 --> 00:11:23,250 This is after squaring, and filtering, 236 00:11:23,250 --> 00:11:25,705 and automatic gain control, and all that. 237 00:11:25,705 --> 00:11:28,300 The little blips are actually because we 238 00:11:28,300 --> 00:11:30,785 used a modulation called alternate-mark-inversion. 239 00:11:30,785 --> 00:11:33,232 It basically flips every one. 240 00:11:33,232 --> 00:11:35,612 That's why it's blipping instead of being constant, 241 00:11:35,612 --> 00:11:37,516 which is to be [INAUDIBLE]. 242 00:11:37,516 --> 00:11:39,729 And then the little blue daggers are the results 243 00:11:39,729 --> 00:11:42,596 of the phase lock loop to try and recover when to sample. 244 00:11:42,596 --> 00:11:44,720 And they're kind of off, but they're kind of right. 245 00:11:44,720 --> 00:11:48,300 And so, if you take the little blue blip, 246 00:11:48,300 --> 00:11:51,330 if the red line is above 0, it's a 1. 247 00:11:51,330 --> 00:11:55,219 And if it's below 0, it's 0. 248 00:11:55,219 --> 00:11:56,760 And that's how you get your bits out. 249 00:12:04,530 --> 00:12:08,041 This was hard to get right, mostly because of the framework 250 00:12:08,041 --> 00:12:08,541 issues. 251 00:12:11,960 --> 00:12:15,280 Implementing distributed objects on a system 252 00:12:15,280 --> 00:12:18,174 without real shared memory is hard 253 00:12:18,174 --> 00:12:19,840 because you have to serialize everything 254 00:12:19,840 --> 00:12:21,423 into a stream of bits and deserialize. 255 00:12:24,560 --> 00:12:27,880 So it really makes pie out of any existing object oriented 256 00:12:27,880 --> 00:12:29,770 code. 257 00:12:29,770 --> 00:12:35,044 We did quite a bit of work to get decent lock-free almost 258 00:12:35,044 --> 00:12:36,010 zero-copy. 259 00:12:36,010 --> 00:12:37,510 Another day or so, we probably would 260 00:12:37,510 --> 00:12:40,860 have gotten zero-copy-- transfer-- streaming 261 00:12:40,860 --> 00:12:43,370 of the data from place to place. 262 00:12:43,370 --> 00:12:45,120 And we had to keep the code footprint low. 263 00:12:47,680 --> 00:12:49,640 C++ is bloated. 264 00:12:49,640 --> 00:12:52,590 We don't have an overlay system yet to-- 265 00:12:52,590 --> 00:12:55,689 if SPE is not running a particular box, 266 00:12:55,689 --> 00:12:57,230 it still has to have the code for it. 267 00:12:57,230 --> 00:12:59,355 So we don't have any infrastructure for-- 268 00:12:59,355 --> 00:13:00,980 ARVIND THIAGARAJAN: There's some macros 269 00:13:00,980 --> 00:13:01,960 to get around that, right? 270 00:13:01,960 --> 00:13:02,230 MICAH BRODSKY: Yeah. 271 00:13:02,230 --> 00:13:03,190 It's pretty messy. 272 00:13:03,190 --> 00:13:06,397 So all the code is on all the SPEs. 273 00:13:06,397 --> 00:13:07,980 So code bloat is particularly a issue. 274 00:13:07,980 --> 00:13:10,900 And XLC has this particular penchant 275 00:13:10,900 --> 00:13:13,240 for runtime type information and exception handling. 276 00:13:16,240 --> 00:13:19,450 Incredible amount of voodoo is necessary to get them 277 00:13:19,450 --> 00:13:21,792 and that 70 K of useless bloat out of there. 278 00:13:25,595 --> 00:13:27,470 ARVIND THIAGARAJAN: We pretty much have the-- 279 00:13:27,470 --> 00:13:32,188 MICAH BRODSKY: It works, but not always. 280 00:13:32,188 --> 00:13:33,562 ARVIND THIAGARAJAN: We did manage 281 00:13:33,562 --> 00:13:35,853 to get it running long enough to get some measurements. 282 00:13:35,853 --> 00:13:39,660 MICAH BRODSKY: Yeah, we did get some decent data out. 283 00:13:39,660 --> 00:13:43,508 Running on the PPE only, we can about 170,000 samples 284 00:13:43,508 --> 00:13:45,440 per second through. 285 00:13:45,440 --> 00:13:48,474 And with the scheduling file, that's kind of rule of thumb-- 286 00:13:48,474 --> 00:13:50,640 we just roughly said, OK, that looks about this big. 287 00:13:50,640 --> 00:13:53,520 We'll throw it on this SPE, SPE, SPE. 288 00:13:53,520 --> 00:13:56,060 We got roughly four times that using five SPEs. 289 00:13:56,060 --> 00:13:57,685 ARVIND THIAGARAJAN: The core foorprints 290 00:13:57,685 --> 00:13:59,057 are really large, but-- 291 00:13:59,057 --> 00:13:59,890 [INTERPOSING VOICES] 292 00:14:02,946 --> 00:14:05,664 MICAH BRODSKY: We really had to push down 293 00:14:05,664 --> 00:14:06,580 things like our queue. 294 00:14:06,580 --> 00:14:08,543 We just didn't have enough memory. 295 00:14:08,543 --> 00:14:09,043 It sucked. 296 00:14:14,124 --> 00:14:15,624 We basically just said that already. 297 00:14:15,624 --> 00:14:16,508 ARVIND THIAGARAJAN: Some performance bottlenecks, I 298 00:14:16,508 --> 00:14:16,950 guess. 299 00:14:16,950 --> 00:14:17,840 [INTERPOSING VOICES] 300 00:14:17,840 --> 00:14:20,050 MICAH BRODSKY: Interesting performance behavior. 301 00:14:20,050 --> 00:14:23,050 We found that the SPEs are ridiculously underutilized. 302 00:14:25,870 --> 00:14:29,970 Most of the algorithms are quite a bit zippier on the SPEs. 303 00:14:29,970 --> 00:14:32,880 And so, they may be running about a third of the time, 304 00:14:32,880 --> 00:14:34,860 and the rest of time just waiting for input. 305 00:14:34,860 --> 00:14:37,639 And then the PPE, which is doing only a tiny amount 306 00:14:37,639 --> 00:14:40,180 of the computation-- basically just feeding it in and sucking 307 00:14:40,180 --> 00:14:44,540 it out-- is spending half of its time all busy 308 00:14:44,540 --> 00:14:48,350 and, the other half of the time, stuck in flow control waiting 309 00:14:48,350 --> 00:14:54,790 for queue space, which is-- our flow control algorithm sucks. 310 00:14:54,790 --> 00:14:57,290 Better with time. 311 00:14:57,290 --> 00:14:59,190 So it seems like you should be able to do 312 00:14:59,190 --> 00:15:02,760 quite a bit better than we did with a bit more work. 313 00:15:09,810 --> 00:15:12,642 We need to cut down the footprint. 314 00:15:12,642 --> 00:15:14,600 And once we have a little bit of breathing room 315 00:15:14,600 --> 00:15:17,240 and get rid of the nasty race bugs and such, 316 00:15:17,240 --> 00:15:19,650 we can finally investigate what was our original, pie 317 00:15:19,650 --> 00:15:22,160 in the sky goal of automatically deciding 318 00:15:22,160 --> 00:15:28,070 what goes where, taking measurements of the performance 319 00:15:28,070 --> 00:15:30,920 and then feeding that back to producing a better 320 00:15:30,920 --> 00:15:34,870 placement of operators, and applying data 321 00:15:34,870 --> 00:15:36,600 parallelism by instantiating operators 322 00:15:36,600 --> 00:15:40,590 on multiple different SPEs and splitting the data stream. 323 00:15:43,240 --> 00:15:44,330 And just doing more. 324 00:15:49,818 --> 00:15:52,228 AUDIENCE: Since your PPE is at 40% to 50% utilization, 325 00:15:52,228 --> 00:15:54,225 did you put actual work on there? 326 00:15:54,225 --> 00:15:56,100 MICAH BRODSKY: We did put some work on there. 327 00:15:56,100 --> 00:15:57,808 Actually, we put work on there because we 328 00:15:57,808 --> 00:15:59,726 couldn't fit all the boxes-- the code for all 329 00:15:59,726 --> 00:16:01,352 the boxes under the SPEs. 330 00:16:01,352 --> 00:16:03,310 So we started strategically to put a few things 331 00:16:03,310 --> 00:16:04,060 at the beginning and a few things 332 00:16:04,060 --> 00:16:05,434 at the end which weren't supposed 333 00:16:05,434 --> 00:16:07,290 to be very computationally intensive, 334 00:16:07,290 --> 00:16:10,014 and yet they managed to take up half the CPU. 335 00:16:10,014 --> 00:16:11,680 The issue with the other half of the CPU 336 00:16:11,680 --> 00:16:14,210 is that it's actually inside an inner loop 337 00:16:14,210 --> 00:16:17,580 deep recursive in blocking because, basically, 338 00:16:17,580 --> 00:16:19,840 our back pressure isn't online yet. 339 00:16:19,840 --> 00:16:22,690 So if it tries to emit something to a queue, 340 00:16:22,690 --> 00:16:24,530 and that queue is full, it just stops. 341 00:16:24,530 --> 00:16:27,030 And it really could be running a whole bunch of other stuff, 342 00:16:27,030 --> 00:16:28,452 but it's not smart enough to do that ahead. 343 00:16:28,452 --> 00:16:29,327 AUDIENCE: [INAUDIBLE] 344 00:16:36,577 --> 00:16:38,410 MICAH BRODSKY: Unlike a model like StreamIt, 345 00:16:38,410 --> 00:16:41,130 everything here is asynchronous and code driven. 346 00:16:41,130 --> 00:16:45,641 The SPUs they can decide on the fly how much to emit. 347 00:16:45,641 --> 00:16:47,640 The programmer doesn't have to declare anything, 348 00:16:47,640 --> 00:16:50,980 But it means everything's asynchronous. 349 00:16:50,980 --> 00:16:53,509 And so you basically [? race the issues ?] galore. 350 00:16:53,509 --> 00:16:55,300 AUDIENCE: But in this a application, do you 351 00:16:55,300 --> 00:16:56,570 find any dynamic rates? 352 00:16:56,570 --> 00:16:58,061 MICAH BRODSKY: In this application, there's not much. 353 00:16:58,061 --> 00:16:59,352 It's a very simple application. 354 00:16:59,352 --> 00:17:02,037 If we actually went to packetization, 355 00:17:02,037 --> 00:17:04,472 error correction, compression, things like that, 356 00:17:04,472 --> 00:17:06,013 we'd probably see a lot more of that. 357 00:17:06,013 --> 00:17:08,001 This is definitely underutilized in 358 00:17:08,001 --> 00:17:13,333 the asynchronous capabilities of the system. 359 00:17:13,333 --> 00:17:15,124 AUDIENCE: In the case of radio, wouldn't it 360 00:17:15,124 --> 00:17:18,438 be OK to draw [INAUDIBLE] frame into audio data 361 00:17:18,438 --> 00:17:22,260 just because you're spending so much time waiting, 362 00:17:22,260 --> 00:17:25,030 it'd be better just to relieve pressure on the queues 363 00:17:25,030 --> 00:17:27,420 by just dropping some frames that are unnecessary. 364 00:17:27,420 --> 00:17:27,898 MICAH BRODSKY: It might well be. 365 00:17:27,898 --> 00:17:29,332 AUDIENCE: And interpolating it in the end 366 00:17:29,332 --> 00:17:30,766 to try to fix it up a little bit. 367 00:17:30,766 --> 00:17:31,244 MICAH BRODSKY: It might well be. 368 00:17:31,244 --> 00:17:32,702 We decided not to do that as a part 369 00:17:32,702 --> 00:17:36,210 of the framework because [INAUDIBLE] policy decision, 370 00:17:36,210 --> 00:17:38,994 and we didn't want to make that for all possible applications. 371 00:17:38,994 --> 00:17:41,050 But if we figure out a good way to dispose that, 372 00:17:41,050 --> 00:17:42,960 that definitely would be an option. 373 00:17:42,960 --> 00:17:44,140 Just drop a few packets. 374 00:17:44,140 --> 00:17:47,334 Drop a few samples. 375 00:17:47,334 --> 00:17:48,875 AUDIENCE: So what about buffer sizes? 376 00:17:48,875 --> 00:17:50,937 Do you declare the buffer size as well? 377 00:17:50,937 --> 00:17:51,770 MICAH BRODSKY: Yeah. 378 00:17:51,770 --> 00:17:53,786 The way we have it now-- there's actually 379 00:17:53,786 --> 00:17:57,030 just a #define in the code that says all buffers are this size. 380 00:17:57,030 --> 00:17:59,340 AUDIENCE: Do you need any double buffering in there? 381 00:17:59,340 --> 00:18:00,715 MICAH BRODSKY: You don't need it. 382 00:18:00,715 --> 00:18:03,250 Well, actually, the way it works is that there's 383 00:18:03,250 --> 00:18:05,360 this remote heap input queue. 384 00:18:05,360 --> 00:18:09,640 And an upstream SPE just DMAs things in as it feels like. 385 00:18:09,640 --> 00:18:11,950 And then the downstream SPE looks. 386 00:18:11,950 --> 00:18:13,140 Is there something here? 387 00:18:13,140 --> 00:18:14,330 Grab it, use it. 388 00:18:14,330 --> 00:18:17,139 So it just works, however much buffering there is. 389 00:18:17,139 --> 00:18:19,430 ARVIND THIAGARAJAN: The ring buffer and the [INAUDIBLE] 390 00:18:19,430 --> 00:18:21,620 should get that block [? free. ?] 391 00:18:21,620 --> 00:18:24,610 MICAH BRODSKY: That's a benefit of the asynchronous approach. 392 00:18:24,610 --> 00:18:27,514 It just works, if you have memory, which we don't. 393 00:18:30,165 --> 00:18:30,940 Queues are tiny. 394 00:18:35,995 --> 00:18:37,867 SPEAKER: [INAUDIBLE] is angrily telling me 395 00:18:37,867 --> 00:18:40,450 that his connection dropped but he's not picking up the phone. 396 00:18:49,741 --> 00:18:50,990 Yeah, I could see it in there. 397 00:19:01,427 --> 00:19:04,906 AUDIENCE: Any other questions for them? 398 00:19:04,906 --> 00:19:07,920 Did you want to do the demo? 399 00:19:07,920 --> 00:19:10,370 MICAH BRODSKY: Nah. 400 00:19:10,370 --> 00:19:13,320 SPEAKER ON PHONE: What did you do to get race conditions? 401 00:19:13,320 --> 00:19:14,320 MICAH BRODSKY: How did we manage to get 402 00:19:14,320 --> 00:19:15,403 ourselves race conditions? 403 00:19:18,150 --> 00:19:22,775 Well, it's a mix of the fact that all the SPUs 404 00:19:22,775 --> 00:19:25,170 can essentially operate as independent threads 405 00:19:25,170 --> 00:19:27,110 and are sort of asynchronously DMAing things 406 00:19:27,110 --> 00:19:31,350 to each other and somewhat poor-- 407 00:19:31,350 --> 00:19:33,880 legacy driven architectural decisions on the way the PPU 408 00:19:33,880 --> 00:19:35,462 code works. 409 00:19:35,462 --> 00:19:38,397 Because we ported a lot of code from the WaveScope platform. 410 00:19:38,397 --> 00:19:38,897 [? 411 00:19:38,897 --> 00:19:40,085 ARVIND THIAGARAJAN: Which wasn't very ?] well documented. 412 00:19:40,085 --> 00:19:41,585 MICAH BRODSKY: And that was intended 413 00:19:41,585 --> 00:19:43,740 to be multithreaded with Pthreads, which 414 00:19:43,740 --> 00:19:47,344 in our original, naive before we actually started the class 415 00:19:47,344 --> 00:19:48,802 impression, we thought we were just 416 00:19:48,802 --> 00:19:50,718 going to port the whole thing and use Pthreads 417 00:19:50,718 --> 00:19:52,800 on the sub, which of course is not possible. 418 00:19:52,800 --> 00:19:57,490 So there were still some threaded components in there. 419 00:19:57,490 --> 00:20:01,386 And we introduced a lot of bugs trying to port the thing. 420 00:20:03,920 --> 00:20:05,080 See if i can-- 421 00:20:05,080 --> 00:20:06,440 ARVIND THIAGARAJAN: The surprising thing was the fact 422 00:20:06,440 --> 00:20:08,523 that we had to replace array constructors in order 423 00:20:08,523 --> 00:20:10,717 to eliminate [? RTTI. ?] And it was just a big deal 424 00:20:10,717 --> 00:20:11,550 [INTERPOSING VOICES] 425 00:20:11,550 --> 00:20:12,335 MICAH BRODSKY: We had to get rid of new, 426 00:20:12,335 --> 00:20:13,770 we had to get rid of delete. 427 00:20:13,770 --> 00:20:15,689 We had to get rid of array constructors. 428 00:20:15,689 --> 00:20:17,156 We had to get rid of-- 429 00:20:17,156 --> 00:20:18,134 ARVIND THIAGARAJAN: Virtual destructors. 430 00:20:18,134 --> 00:20:19,112 MICAH BRODSKY: --pure virtual functions 431 00:20:19,112 --> 00:20:20,112 and virtual destructors. 432 00:20:20,112 --> 00:20:21,490 It was a mess. 433 00:20:21,490 --> 00:20:25,193 I guess one more pithy thing I could 434 00:20:25,193 --> 00:20:27,950 say about race conditions is it's incredible 435 00:20:27,950 --> 00:20:30,620 how many subtle race conditions and bugs we found in 436 00:20:30,620 --> 00:20:35,260 the remote keeping queueing library. 437 00:20:35,260 --> 00:20:38,930 Because there's one-- it's a lock free asynchronous data 438 00:20:38,930 --> 00:20:41,110 structure. 439 00:20:41,110 --> 00:20:45,346 There are two threads-- two SPUs reading and writing from it 440 00:20:45,346 --> 00:20:46,630 at the same time. 441 00:20:46,630 --> 00:20:48,380 And there's all sorts of little subtleties 442 00:20:48,380 --> 00:20:50,042 where if you get it a little bit wrong, 443 00:20:50,042 --> 00:20:52,250 you end up with one of the queue pointers overrunning 444 00:20:52,250 --> 00:20:53,850 the other guy, and you have basically 445 00:20:53,850 --> 00:20:57,130 dangling pointers and things like that. 446 00:20:57,130 --> 00:20:59,910 At least three or four times, we spent a few hours 447 00:20:59,910 --> 00:21:03,410 until we discovered that was the cause of mysterious behavior. 448 00:21:10,610 --> 00:21:12,110 AUDIENCE: Anything else? 449 00:21:12,110 --> 00:21:12,710 Thank you. 450 00:21:12,710 --> 00:21:14,260 [APPLAUSE]