1 00:00:00,030 --> 00:00:03,155 The following content is provided under a Creative 2 00:00:03,155 --> 00:00:04,000 Commons license. 3 00:00:04,000 --> 00:00:06,920 Your support will help MIT OpenCourseWare continue to 4 00:00:06,920 --> 00:00:08,660 offer high quality, educational 5 00:00:08,660 --> 00:00:10,560 resources for free. 6 00:00:10,560 --> 00:00:13,450 To make a donation or view additional materials from 7 00:00:13,450 --> 00:00:16,610 hundreds of MIT courses visit MIT OpenCourseWare at 8 00:00:16,610 --> 00:00:17,860 ocw.mit.edu. 9 00:00:22,010 --> 00:00:23,260 MICHAEL PERRONE: So my name's Michael Perrone. 10 00:00:30,170 --> 00:00:34,460 I'm at the T.J. Watson Research Center, IBM research. 11 00:00:34,460 --> 00:00:38,330 Doing all kinds of things for research, but most recently-- 12 00:00:38,330 --> 00:00:39,630 that's not what I want. 13 00:00:39,630 --> 00:00:40,660 There we go. 14 00:00:40,660 --> 00:00:43,290 Most recently I've been working with the cell 15 00:00:43,290 --> 00:00:46,590 processor for the past three years or so. 16 00:00:46,590 --> 00:00:47,840 I don't want that. 17 00:00:51,170 --> 00:00:53,380 How's that? 18 00:00:53,380 --> 00:00:56,320 And because I do have to run out for a flight, I have my 19 00:00:56,320 --> 00:00:59,330 e-mail here if you want to ask me questions, 20 00:00:59,330 --> 00:01:02,820 feel free to do that. 21 00:01:02,820 --> 00:01:05,640 What I'm going to do in this presentation is as Saman 22 00:01:05,640 --> 00:01:09,270 suggested, talk in depth about the cell processor, but really 23 00:01:09,270 --> 00:01:11,140 it's still going to be just the very surface because you 24 00:01:11,140 --> 00:01:12,950 going to have a month to go into a lot more detail. 25 00:01:12,950 --> 00:01:16,300 But I want to give you a sense for why it was created, the 26 00:01:16,300 --> 00:01:19,180 way it was created, what it's capable of doing, and what are 27 00:01:19,180 --> 00:01:22,640 the programming considerations that have to be taken in mind 28 00:01:22,640 --> 00:01:24,120 when you program. 29 00:01:30,520 --> 00:01:33,490 Here's the agenda just for this section, 30 00:01:33,490 --> 00:01:34,800 Mike, of this class. 31 00:01:34,800 --> 00:01:35,990 I'll give you some motivation. 32 00:01:35,990 --> 00:01:37,840 This is going to be a bit of a repeat, so I'll go through it 33 00:01:37,840 --> 00:01:38,940 fairly quickly. 34 00:01:38,940 --> 00:01:43,460 I'll talk about the design concepts, hardware overview, 35 00:01:43,460 --> 00:01:46,070 performance characteristics, application affinity-- 36 00:01:46,070 --> 00:01:49,920 what good is this device? 37 00:01:49,920 --> 00:01:53,290 Talk about the software and this I imagine is one of the 38 00:01:53,290 --> 00:01:55,330 areas where you're going to go into a lot of detail in the 39 00:01:55,330 --> 00:01:59,200 next month because as you suggested, the software really 40 00:01:59,200 --> 00:02:01,470 is the issue and I would actually go a little further 41 00:02:01,470 --> 00:02:05,520 and say, why do people drive such large cars in the U.S.? 42 00:02:05,520 --> 00:02:07,560 Why do they waste so much energy? 43 00:02:07,560 --> 00:02:08,360 The answer is very simple. 44 00:02:08,360 --> 00:02:09,660 It's because it's cheap. 45 00:02:09,660 --> 00:02:12,840 Even at $3 a gallon, it's cheap compared to say, Europe 46 00:02:12,840 --> 00:02:15,450 and other places. 47 00:02:15,450 --> 00:02:17,790 The truth is it's the same thing with programmers. 48 00:02:17,790 --> 00:02:20,480 Why did programmers program the way they did in the past 49 00:02:20,480 --> 00:02:22,240 10, 20 years? 50 00:02:22,240 --> 00:02:23,490 Because cycles were cheap. 51 00:02:23,490 --> 00:02:26,190 They knew Moore's law was going to keep going and so you 52 00:02:26,190 --> 00:02:28,710 could implement some algorithm, you didn't have to 53 00:02:28,710 --> 00:02:31,310 worry about the details, as long as you got the right 54 00:02:31,310 --> 00:02:35,560 power law-- if you got your n squared or n cubed or n log n, 55 00:02:35,560 --> 00:02:37,720 whatever behavior. 56 00:02:37,720 --> 00:02:41,390 The details, if the multiplying factor was 10 or 57 00:02:41,390 --> 00:02:42,170 100 it didn't matter. 58 00:02:42,170 --> 00:02:44,170 Eventually Moore's law would solve that problem for you, so 59 00:02:44,170 --> 00:02:45,410 you didn't have to be efficient. 60 00:02:45,410 --> 00:02:49,090 And I think I've spent the better part of three years 61 00:02:49,090 --> 00:02:52,510 trying to fight against that and you're going to learn in 62 00:02:52,510 --> 00:02:54,630 this class that, particularly for multicore you have to 63 00:02:54,630 --> 00:02:57,660 think very hard about how you're going to get 64 00:02:57,660 --> 00:02:58,910 performance. 65 00:03:00,990 --> 00:03:04,260 This is actually the take home message that I want to give. 66 00:03:04,260 --> 00:03:06,630 I think it's just one or two slides, but we really need to 67 00:03:06,630 --> 00:03:10,340 get to these because that's where I want to get you 68 00:03:10,340 --> 00:03:11,740 thinking along the right lines. 69 00:03:11,740 --> 00:03:12,960 And then there's a hardware 70 00:03:12,960 --> 00:03:16,650 consideration, we can skip that. 71 00:03:16,650 --> 00:03:19,790 All right, so where have all the gigahertz gone, right? 72 00:03:19,790 --> 00:03:24,220 We saw Moore's law, things getting faster and faster and 73 00:03:24,220 --> 00:03:26,926 the answer is I have a different chart that's 74 00:03:26,926 --> 00:03:28,220 basically the same thing. 75 00:03:28,220 --> 00:03:31,200 You have relative device performance on this axis and 76 00:03:31,200 --> 00:03:32,210 you've got the year here. 77 00:03:32,210 --> 00:03:35,300 And different technologies were growing, growing, 78 00:03:35,300 --> 00:03:37,210 growing, but now you see they're thresholding. 79 00:03:37,210 --> 00:03:42,100 And you go to conferences now, architecture conferences, and 80 00:03:42,100 --> 00:03:45,500 people are saying, Moore's law is dead. 81 00:03:45,500 --> 00:03:47,660 Now, I don't know if I would go that far and I know there 82 00:03:47,660 --> 00:03:50,110 are true believers out there who say, well maybe the 83 00:03:50,110 --> 00:03:54,280 silicon on the insulator technology is dead, but 84 00:03:54,280 --> 00:03:55,140 they'll be something else. 85 00:03:55,140 --> 00:03:59,330 And maybe that's true and maybe that is multicore, but 86 00:03:59,330 --> 00:04:02,800 unless we get the right programming models in place 87 00:04:02,800 --> 00:04:04,050 it's not going to be multicore. 88 00:04:07,320 --> 00:04:08,730 Here's this power density graph. 89 00:04:08,730 --> 00:04:11,460 Here we have the nuclear reactor power up here and you 90 00:04:11,460 --> 00:04:12,660 see pentiums going up now. 91 00:04:12,660 --> 00:04:16,870 Of course, there's a log plot, so we're far away, but on this 92 00:04:16,870 --> 00:04:18,450 axis we're not far away. 93 00:04:18,450 --> 00:04:22,320 This is how much we shrink the technology, the size of those 94 00:04:22,320 --> 00:04:23,940 transistors. 95 00:04:23,940 --> 00:04:30,670 So if we're kind of going down by 2 every 18 months or so, 96 00:04:30,670 --> 00:04:33,080 maybe it's 2 years now, we're not so far away from that 97 00:04:33,080 --> 00:04:34,500 nuclear reactor output. 98 00:04:34,500 --> 00:04:37,140 And that's a problem. 99 00:04:37,140 --> 00:04:39,800 And what's really causing that problem? 100 00:04:39,800 --> 00:04:42,680 Here's a picture of one of these gates magnified a lot 101 00:04:42,680 --> 00:04:46,300 and here's the interface magnified even further and you 102 00:04:46,300 --> 00:04:49,330 see here's this dielectric that's insulating between the 103 00:04:49,330 --> 00:04:51,680 2 sides of the gate-- 104 00:04:51,680 --> 00:04:52,860 we're reaching a fundamental limit. 105 00:04:52,860 --> 00:04:54,000 A few atomic layers. 106 00:04:54,000 --> 00:04:56,880 You see here it's like 11 angstroms. What's that? 107 00:04:56,880 --> 00:05:00,040 10, 11 atoms across? 108 00:05:00,040 --> 00:05:02,700 If you go back to basic physics you know that quantum 109 00:05:02,700 --> 00:05:06,700 mechanical properties like electrons, they tunnel, right? 110 00:05:06,700 --> 00:05:08,560 And they tunnel through barriers with kind of an 111 00:05:08,560 --> 00:05:09,890 exponential decay. 112 00:05:09,890 --> 00:05:12,800 So whenever you shrink this further you get more and more 113 00:05:12,800 --> 00:05:15,630 leakage, so the current is leaking through. 114 00:05:15,630 --> 00:05:19,040 In this graph, what you see here is that as this size gets 115 00:05:19,040 --> 00:05:22,780 smaller, the leakage current is getting equivalent to the 116 00:05:22,780 --> 00:05:23,510 active power. 117 00:05:23,510 --> 00:05:29,050 So even when it's not doing anything, this 65 nanometer, 118 00:05:29,050 --> 00:05:31,110 the technology is leaking as much power 119 00:05:31,110 --> 00:05:32,550 as it actually uses. 120 00:05:32,550 --> 00:05:35,430 And eventually, as we get smaller, smaller we're going 121 00:05:35,430 --> 00:05:38,720 to be using more power, just leaking stuff away and that's 122 00:05:38,720 --> 00:05:43,200 really bad because as Saman suggested we have people like 123 00:05:43,200 --> 00:05:45,390 Google putting this stuff near the Coulee Dam so that they 124 00:05:45,390 --> 00:05:46,140 can get power. 125 00:05:46,140 --> 00:05:49,600 I deal with a lot of customers who have tens of thousands of 126 00:05:49,600 --> 00:05:54,450 nodes, 50,000 processors, 100,000 processors. 127 00:05:54,450 --> 00:05:56,930 They're using 20 gigabytes-- 128 00:05:56,930 --> 00:05:58,620 sorry, megahertz. 129 00:05:58,620 --> 00:06:02,090 No, megawatts, that's what I want to say. 130 00:06:02,090 --> 00:06:03,340 It's too early in the morning. 131 00:06:06,150 --> 00:06:09,460 Tens of megawatts to power their installations and 132 00:06:09,460 --> 00:06:12,130 they're choosing sites specifically to get that power 133 00:06:12,130 --> 00:06:12,940 and they're limited. 134 00:06:12,940 --> 00:06:15,300 So they come to me, they come to people at IBM and they say, 135 00:06:15,300 --> 00:06:16,390 what can we do about power? 136 00:06:16,390 --> 00:06:18,810 Power is a problem. 137 00:06:18,810 --> 00:06:21,630 And that's why we're not seeing 138 00:06:21,630 --> 00:06:25,190 increasing the gigahertz. 139 00:06:25,190 --> 00:06:26,590 Has this ever happened before? 140 00:06:26,590 --> 00:06:29,560 Well, I'm going to go to this quickly, yes. 141 00:06:29,560 --> 00:06:33,230 Here we see the power outage output of a steam iron, right 142 00:06:33,230 --> 00:06:36,520 there per unit area. 143 00:06:39,300 --> 00:06:42,750 And something's messed up here. 144 00:06:42,750 --> 00:06:46,960 You see as the technology changed from bipolar to CMOS 145 00:06:46,960 --> 00:06:52,220 we were able to improve the performance, but the heat flux 146 00:06:52,220 --> 00:06:55,700 got higher again and that begs the question, what's going to 147 00:06:55,700 --> 00:06:56,750 happen next? 148 00:06:56,750 --> 00:07:00,310 And of course, IBM, Intel, AMD, they're all 149 00:07:00,310 --> 00:07:03,250 betting this multicore. 150 00:07:03,250 --> 00:07:06,090 And so there's an opportunity from a business point of view. 151 00:07:06,090 --> 00:07:09,650 So now, that's the intro. 152 00:07:09,650 --> 00:07:12,540 Multicore: how do you deal with it? 153 00:07:12,540 --> 00:07:17,060 Here's a picture of the chip, the cell processor. 154 00:07:17,060 --> 00:07:19,930 You can see these 8 little black dots. 155 00:07:19,930 --> 00:07:23,570 They're local memory for each one of 8 special purpose 156 00:07:23,570 --> 00:07:27,400 processors, as well as a big chunk over here, which is a 157 00:07:27,400 --> 00:07:28,200 ninth processor. 158 00:07:28,200 --> 00:07:32,770 So this chip has 9 processors on board and the trick is to 159 00:07:32,770 --> 00:07:35,720 design it so that it addresses lots of issues 160 00:07:35,720 --> 00:07:38,080 that we just discussed. 161 00:07:38,080 --> 00:07:43,570 So let me put this in context, cell was created for the Sony 162 00:07:43,570 --> 00:07:44,780 Playstation 3. 163 00:07:44,780 --> 00:07:48,590 It started in about 2000 and there's a long development 164 00:07:48,590 --> 00:07:53,530 here until it was finally announced over here. 165 00:07:53,530 --> 00:07:54,380 Where was it first announced? 166 00:07:54,380 --> 00:07:58,680 It was announced several years later and IBM recently 167 00:07:58,680 --> 00:08:02,190 announced a cell blade about a year back and we're pushing 168 00:08:02,190 --> 00:08:05,280 these blades and we're very much struggling with the 169 00:08:05,280 --> 00:08:06,660 programming model. 170 00:08:06,660 --> 00:08:09,040 How do you get performance while making something 171 00:08:09,040 --> 00:08:09,610 programmable? 172 00:08:09,610 --> 00:08:11,790 If you go to customers and they have 4 million lines of 173 00:08:11,790 --> 00:08:19,240 code, you can't tell them just port it and it'll be 80 person 174 00:08:19,240 --> 00:08:22,030 years to get it ported, 100 person years more. 175 00:08:22,030 --> 00:08:23,740 And then you have to optimize it. 176 00:08:23,740 --> 00:08:27,950 So there are problems and we'll talk about that. 177 00:08:27,950 --> 00:08:32,360 But it was created in this context and because of that, 178 00:08:32,360 --> 00:08:35,510 this chip in particular, is a commodity processor. 179 00:08:35,510 --> 00:08:39,070 Meaning that it's going to be selling millions and millions. 180 00:08:39,070 --> 00:08:44,920 Sony Playstation 2 sold an average of 20 million units 181 00:08:44,920 --> 00:08:47,360 each year for 5 years and we expect the same for the 182 00:08:47,360 --> 00:08:48,440 Playstation 3. 183 00:08:48,440 --> 00:08:53,600 So the cell has a big advantage over other multicore 184 00:08:53,600 --> 00:08:57,340 processors like the Intel Woodcrest, which has a street 185 00:08:57,340 --> 00:09:01,930 price of about $2000 and the cell around 100. 186 00:09:01,930 --> 00:09:04,790 So not only do we have big performance improvements, we 187 00:09:04,790 --> 00:09:06,800 have price advantages too because of 188 00:09:06,800 --> 00:09:09,660 that commodity market. 189 00:09:09,660 --> 00:09:14,450 All right, let's talk about the design concept. 190 00:09:14,450 --> 00:09:16,580 Here's a little bit of a rehash of what we discussed 191 00:09:16,580 --> 00:09:18,550 with some interesting words here. 192 00:09:18,550 --> 00:09:20,570 We're talking about a power wall, a memory wall and a 193 00:09:20,570 --> 00:09:21,320 frequency wall. 194 00:09:21,320 --> 00:09:22,900 So we've talked about this frequency wall. 195 00:09:22,900 --> 00:09:26,300 We're hitting this wall because of the power really 196 00:09:26,300 --> 00:09:28,840 and the power wall people just don't have enough power coming 197 00:09:28,840 --> 00:09:32,140 into their buildings to keep these things going. 198 00:09:32,140 --> 00:09:35,680 But memory wall, Saman didn't actually use that term, but 199 00:09:35,680 --> 00:09:38,140 that's the fact that as the clock frequencies get higher 200 00:09:38,140 --> 00:09:41,160 and higher, memory appeared further and further away. 201 00:09:41,160 --> 00:09:44,200 The more cycles that I have to go as a processor before the 202 00:09:44,200 --> 00:09:45,180 data came in. 203 00:09:45,180 --> 00:09:47,620 And so that changes the whole paradigm, how you have to 204 00:09:47,620 --> 00:09:48,400 think about it. 205 00:09:48,400 --> 00:09:53,000 We have processors with lots of cache, but is cache really 206 00:09:53,000 --> 00:09:54,320 what you want? 207 00:09:54,320 --> 00:09:56,020 Well, it depends. 208 00:09:56,020 --> 00:09:59,490 If you have a very localized process where you're going to 209 00:09:59,490 --> 00:10:02,920 bring something into cache and the data is going to be reused 210 00:10:02,920 --> 00:10:04,950 then that's really a good thing to do. 211 00:10:04,950 --> 00:10:07,980 But what if you have random gather and scatter of data? 212 00:10:07,980 --> 00:10:13,040 You know, you're doing some transactional processing or 213 00:10:13,040 --> 00:10:16,040 whatever mathematical function you're calculating is very 214 00:10:16,040 --> 00:10:17,820 distributed like an FFT. 215 00:10:17,820 --> 00:10:21,210 So you have to do all sorts of accesses through memory and it 216 00:10:21,210 --> 00:10:23,770 doesn't fit in that cache. 217 00:10:23,770 --> 00:10:26,080 Well, then you can start thrashing cache. 218 00:10:26,080 --> 00:10:29,770 You bring in one integer and then you ask the cache for the 219 00:10:29,770 --> 00:10:32,400 next thing, it's not there, so it has to go in and so you 220 00:10:32,400 --> 00:10:36,380 spend all this time wasting time getting stuff into cache. 221 00:10:36,380 --> 00:10:40,270 So what we're pushing for multicore, especially for cell 222 00:10:40,270 --> 00:10:43,260 is the notion of a shopping list. And this is where 223 00:10:43,260 --> 00:10:46,830 programability comes in and programing models come in. 224 00:10:46,830 --> 00:10:50,310 You really need to think ahead of time about what your 225 00:10:50,310 --> 00:10:53,380 shopping list is going to be and the analogy that people 226 00:10:53,380 --> 00:10:56,170 have been using is you're fixing something in your 227 00:10:56,170 --> 00:10:57,690 house, you're pipe breaks. 228 00:10:57,690 --> 00:10:59,120 So you go and say, oh, I need a new pipe. 229 00:10:59,120 --> 00:11:00,570 So you go the store, you get a pipe. 230 00:11:00,570 --> 00:11:02,870 You bring it back and say, oh, I need some putty. 231 00:11:02,870 --> 00:11:04,090 So you go to the store, you get some putty. 232 00:11:04,090 --> 00:11:05,420 And oh, I need a wrench. 233 00:11:05,420 --> 00:11:07,760 Go to the store-- that's what cache is. 234 00:11:07,760 --> 00:11:10,790 So you figure out what you need when you need it. 235 00:11:10,790 --> 00:11:12,580 In the cell processor you have to think ahead and make a 236 00:11:12,580 --> 00:11:14,850 shopping list. If I'm going to do this calculation I need all 237 00:11:14,850 --> 00:11:15,630 these things. 238 00:11:15,630 --> 00:11:17,240 I'm going to bring them all in, I'm going to start 239 00:11:17,240 --> 00:11:18,040 calculating. 240 00:11:18,040 --> 00:11:20,090 When I'm calculating on that, I'm going to get my other 241 00:11:20,090 --> 00:11:23,190 shopping list. So that I can have some concurrency of the 242 00:11:23,190 --> 00:11:24,750 data load with the computes. 243 00:11:31,230 --> 00:11:33,260 I'm going to skip this here. 244 00:11:37,850 --> 00:11:41,340 You can read that later, it's not that important. 245 00:11:41,340 --> 00:11:45,230 Cell synergy, now this is kind of you know, apple pie, 246 00:11:45,230 --> 00:11:48,000 motherhood kind of thing. 247 00:11:48,000 --> 00:11:50,610 The cell processor was specifically designed so that 248 00:11:50,610 --> 00:11:52,990 those 9 cores are synergistic. 249 00:11:52,990 --> 00:11:55,670 That they interoperate very efficiently. 250 00:11:55,670 --> 00:11:59,040 Now I told you we have 8 identical processors, we call 251 00:11:59,040 --> 00:11:59,440 those SPEs. 252 00:11:59,440 --> 00:12:02,760 In the ninth processor its the PPE. 253 00:12:02,760 --> 00:12:06,100 It's been designed so that the PPE is running the OS and it's 254 00:12:06,100 --> 00:12:09,630 doing all the transaction file systems and what not so that 255 00:12:09,630 --> 00:12:11,650 these SPEs can focus on what they're good 256 00:12:11,650 --> 00:12:12,900 at, which is compute. 257 00:12:15,510 --> 00:12:19,290 The whole thing is pullled together with an element 258 00:12:19,290 --> 00:12:22,530 interconnect bus and we'll talk about that. 259 00:12:22,530 --> 00:12:25,070 It's very, very efficient, very high bandwidth bus. 260 00:12:28,940 --> 00:12:30,450 Now we're going to talk about the detail hardware 261 00:12:30,450 --> 00:12:31,340 components. 262 00:12:31,340 --> 00:12:35,440 And Rodric somewhere, there you are, asked me to actually 263 00:12:35,440 --> 00:12:39,370 dig down into more of the hardware. 264 00:12:39,370 --> 00:12:40,200 I would love to do that. 265 00:12:40,200 --> 00:12:43,470 Honestly, I'm not a hardware person. 266 00:12:43,470 --> 00:12:47,220 I'll do the best I can, perhaps at the end of the talk 267 00:12:47,220 --> 00:12:50,410 we'll dig down and show me which slides you want. 268 00:12:50,410 --> 00:12:54,180 But I've been dealing with this for so long that I can do 269 00:12:54,180 --> 00:12:55,010 a decent job. 270 00:12:55,010 --> 00:12:57,620 Here's another picture of the chip. 271 00:12:57,620 --> 00:12:59,140 It has lots of transistors. 272 00:12:59,140 --> 00:13:00,340 This is the size. 273 00:13:00,340 --> 00:13:03,200 We talked about the 9 cores, it has 10 threads because this 274 00:13:03,200 --> 00:13:06,080 power processor, the PPE has 2 threads. 275 00:13:06,080 --> 00:13:08,780 Each of these are single threaded. 276 00:13:08,780 --> 00:13:10,480 And this is the wow factor. 277 00:13:10,480 --> 00:13:15,350 We have 200 gigaflops, over 200 gigaflops of single 278 00:13:15,350 --> 00:13:18,260 precision performance on these chips. 279 00:13:18,260 --> 00:13:21,820 And over 20 gigaflops of double precision and that will 280 00:13:21,820 --> 00:13:24,670 be going up to 100 gigaflops by the end of this year. 281 00:13:27,840 --> 00:13:32,430 The bandwidth to main memory is 25 gigabytes per second and 282 00:13:32,430 --> 00:13:35,640 up to 75 gigabytes per second of I/O bandwidth. 283 00:13:35,640 --> 00:13:38,840 Now this chip really has tremendous bandwidth, but what 284 00:13:38,840 --> 00:13:40,800 we've seen so far-- particularly with the Sony 285 00:13:40,800 --> 00:13:44,170 Playstation and I think you may have lots of them here, 286 00:13:44,170 --> 00:13:46,780 the board is not designed to really take 287 00:13:46,780 --> 00:13:48,310 advantage of that bandwidth. 288 00:13:48,310 --> 00:13:53,860 And even the blades that IBM sells really can't get that 289 00:13:53,860 --> 00:13:55,830 type of bandwidth off the blade. 290 00:13:55,830 --> 00:13:57,870 And so if you're keeping everything local on the blade 291 00:13:57,870 --> 00:14:00,640 or on the Playstation 3 then you have lots of bandwidth 292 00:14:00,640 --> 00:14:01,500 internally. 293 00:14:01,500 --> 00:14:06,320 But off blade or off board you really have to survive with 294 00:14:06,320 --> 00:14:12,060 something like a gigabyte, 2 gigabytes in the future. 295 00:14:12,060 --> 00:14:14,330 And this element interconnect bus I mentioned before has a 296 00:14:14,330 --> 00:14:20,630 tremendous bandwidth, over 300 gigabytes per second. 297 00:14:20,630 --> 00:14:23,280 The top frequency in the lab was over 4 gigabytes-- 298 00:14:23,280 --> 00:14:24,380 gigahertz, sorry. 299 00:14:24,380 --> 00:14:28,060 And it's currently running when you 300 00:14:28,060 --> 00:14:30,220 buy them at 3.2 gigahertz. 301 00:14:30,220 --> 00:14:33,720 And actually the Playstation 3's that you're buying today, 302 00:14:33,720 --> 00:14:38,560 I think, they only use 7 out of the 8 SPEs. 303 00:14:38,560 --> 00:14:40,460 And that was a design consideration from the 304 00:14:40,460 --> 00:14:43,870 hardware point of view because as these chips get bigger and 305 00:14:43,870 --> 00:14:46,890 bigger, which is if you can't ratchet up the gigahertz you 306 00:14:46,890 --> 00:14:48,980 have to spread out. 307 00:14:48,980 --> 00:14:51,830 And so as they get bigger, flaws in the manufacturing 308 00:14:51,830 --> 00:14:54,630 process lead to faulty units. 309 00:14:54,630 --> 00:14:57,600 So instead of just throwing away things, if one of these 310 00:14:57,600 --> 00:15:01,110 SPE is bad we don't use it and we just do 7. 311 00:15:01,110 --> 00:15:06,410 As the design process gets better by the end of this year 312 00:15:06,410 --> 00:15:09,020 they'll be using 8. 313 00:15:09,020 --> 00:15:14,160 The blades that IBM sells, they're all set up for 8 314 00:15:14,160 --> 00:15:18,440 OK, so here's a schematic view of what you just saw on the 315 00:15:18,440 --> 00:15:20,010 previous slide. 316 00:15:20,010 --> 00:15:22,110 You have these 8 SPEs. 317 00:15:22,110 --> 00:15:25,270 You have the PPE here with this L1 and L2 cache. 318 00:15:25,270 --> 00:15:26,800 You have the element interconnect bus connecting 319 00:15:26,800 --> 00:15:29,610 all of these pieces together to a memory interface 320 00:15:29,610 --> 00:15:32,550 controller and a bus interface controller. 321 00:15:32,550 --> 00:15:38,120 And so this MIC is what has the 25.6 gigabytes per second 322 00:15:38,120 --> 00:15:43,240 and this BIC has potentially 75 going out here. 323 00:15:43,240 --> 00:15:48,060 Each of these SPEs has its own local store. 324 00:15:48,060 --> 00:15:49,940 Those are those little black dots that you saw, those 8 325 00:15:49,940 --> 00:15:50,830 black dots. 326 00:15:50,830 --> 00:15:54,340 It's not very large, it's a quarter of a megabyte, but 327 00:15:54,340 --> 00:15:57,750 it's very fast to this SXU, this processing unit. 328 00:15:57,750 --> 00:16:02,520 It's only 6 cycles away from that unit. 329 00:16:02,520 --> 00:16:06,520 And it's a fully pipelined 6 so that if you feed that 330 00:16:06,520 --> 00:16:09,810 pipeline you can get data every cycle. 331 00:16:09,810 --> 00:16:11,480 And here, the thing that you can't read because it's 332 00:16:11,480 --> 00:16:14,190 probably too dark is the DMA engine. 333 00:16:14,190 --> 00:16:17,960 So one of the interesting things about this is that each 334 00:16:17,960 --> 00:16:21,610 one of these is a full fledged processor. 335 00:16:21,610 --> 00:16:25,930 It can access main memory independent of this PPE. 336 00:16:25,930 --> 00:16:30,990 So you can have 9 processes or 10 if you're running 2 threads 337 00:16:30,990 --> 00:16:34,470 here, all going simultaneously, all 338 00:16:34,470 --> 00:16:36,210 independent of one another. 339 00:16:36,210 --> 00:16:37,800 And that allows for a tremendous amount of 340 00:16:37,800 --> 00:16:41,650 flexibility in the types of algorithms you can implement. 341 00:16:41,650 --> 00:16:45,000 And because of this bus here you can see it's 96 bytes per 342 00:16:45,000 --> 00:16:49,390 cycle and we're at 3.2 gigahertz. 343 00:16:49,390 --> 00:16:54,930 I think that's 288 gigabytes per second. 344 00:16:54,930 --> 00:16:57,650 These guys can communicate to one another across this bus 345 00:16:57,650 --> 00:17:00,210 without ever going out to main memory and so they can get 346 00:17:00,210 --> 00:17:03,400 much faster access to their local memories. 347 00:17:03,400 --> 00:17:06,630 So if you're doing lots of computes internally here you 348 00:17:06,630 --> 00:17:11,880 can scream on this processing; really, really go fast. And 349 00:17:11,880 --> 00:17:13,530 you can do the same if you're going out to the memory 350 00:17:13,530 --> 00:17:16,410 interface controller here to main memory, if you 351 00:17:16,410 --> 00:17:19,890 sufficiently hide that memory access. 352 00:17:19,890 --> 00:17:21,140 So we'll talk about that. 353 00:17:24,030 --> 00:17:28,100 All right, this is the PPE that I mentioned before. 354 00:17:28,100 --> 00:17:32,020 It's based on the IBM power family of processors, it's a 355 00:17:32,020 --> 00:17:34,680 watered down version to reduce the power consumption. 356 00:17:34,680 --> 00:17:39,190 So it doesn't have the horse power that you see in say a 357 00:17:39,190 --> 00:17:42,470 Pentium 4 or even-- 358 00:17:42,470 --> 00:17:44,730 actually, I don't have an exact comparison point for 359 00:17:44,730 --> 00:17:47,930 this processor, but if you take the code that runs today 360 00:17:47,930 --> 00:17:51,910 on your Intel or AMD, whatever your power and you recompile 361 00:17:51,910 --> 00:17:54,960 it on cell it'll run today-- 362 00:17:54,960 --> 00:17:57,810 maybe you have to change the library or two, but it'll run 363 00:17:57,810 --> 00:17:59,470 today here, no problem. 364 00:17:59,470 --> 00:18:04,150 But it'll be about 60% slower, 50% slower and so people say, 365 00:18:04,150 --> 00:18:07,620 oh my god this cell processor's terrible. 366 00:18:07,620 --> 00:18:10,980 But that's because you're only using that one piece. 367 00:18:10,980 --> 00:18:12,490 So let's look at the other-- 368 00:18:12,490 --> 00:18:14,070 OK, so now we go into details of the PPE. 369 00:18:16,960 --> 00:18:20,490 Half a megabyte of L2 cache here, coherent load stores. 370 00:18:20,490 --> 00:18:24,270 It does have a VMX unit, so you can do some SIMD 371 00:18:24,270 --> 00:18:27,670 operations, single instruction multiple data instructions. 372 00:18:27,670 --> 00:18:29,530 Two-way hardware multithreaded here. 373 00:18:33,180 --> 00:18:36,960 Then there's an EIB that goes around here. 374 00:18:36,960 --> 00:18:41,780 It's composed of four 16 byte data rings. 375 00:18:41,780 --> 00:18:44,510 And you can have multiple, simultaneous transfers per 376 00:18:44,510 --> 00:18:48,030 ring for a total of over 100 outstanding requests 377 00:18:48,030 --> 00:18:49,280 simultaneously. 378 00:18:53,390 --> 00:18:54,830 But this slide doesn't-- this kind of hides 379 00:18:54,830 --> 00:18:55,620 it under the rug. 380 00:18:55,620 --> 00:18:57,720 There's a certain topology here. 381 00:18:57,720 --> 00:18:59,910 And so these things are going to be 382 00:18:59,910 --> 00:19:05,160 connected to those 8 SPEs. 383 00:19:05,160 --> 00:19:08,850 And depending on which way you send things, you'll have 384 00:19:08,850 --> 00:19:10,340 better or worse performance. 385 00:19:10,340 --> 00:19:14,845 So some of these buses are going around this way and some 386 00:19:14,845 --> 00:19:16,470 are going counterclockwise. 387 00:19:16,470 --> 00:19:18,960 And because of that you have to know who you're 388 00:19:18,960 --> 00:19:22,230 communicating if you want have real high efficiency. 389 00:19:22,230 --> 00:19:25,890 I haven't seen personally cases where it made a really 390 00:19:25,890 --> 00:19:27,700 big difference, but I do know that there's some people who 391 00:19:27,700 --> 00:19:34,720 found, if I'm going from here to here I want to make sure 392 00:19:34,720 --> 00:19:38,220 I'm sending things the right way because of that 393 00:19:38,220 --> 00:19:39,160 connectivity. 394 00:19:39,160 --> 00:19:40,880 Or else I could be sending things all the 395 00:19:40,880 --> 00:19:42,540 way around and waiting. 396 00:19:42,540 --> 00:19:43,880 AUDIENCE: Just a quick question. 397 00:19:43,880 --> 00:19:46,020 MICHAEL PERRONE: Yes. 398 00:19:46,020 --> 00:19:47,282 AUDIENCE: Just like you said you could complie anything on 399 00:19:47,282 --> 00:19:48,740 the power processor would be slower, but you can. 400 00:19:48,740 --> 00:19:51,380 Now you also said the cell processor is in itself a 401 00:19:51,380 --> 00:19:53,440 [INAUDIBLE] processor. 402 00:19:53,440 --> 00:19:58,300 Can I compile it in a C code just for that as well. 403 00:19:58,300 --> 00:19:59,580 MICHAEL PERRONE: C code would compile. 404 00:19:59,580 --> 00:20:03,000 There's issues with libraries because the libraries wouldn't 405 00:20:03,000 --> 00:20:05,230 be ported to the SPE necessarily. 406 00:20:05,230 --> 00:20:08,620 If it had been then yes. 407 00:20:08,620 --> 00:20:10,440 This is actually a very good question. 408 00:20:10,440 --> 00:20:11,700 It opens up lots of things. 409 00:20:11,700 --> 00:20:14,320 I don't know if I should take that later. 410 00:20:14,320 --> 00:20:15,480 PROFESSOR: Take it later. 411 00:20:15,480 --> 00:20:18,990 MICHAEL PERRONE: Bottom line is this chip has two different 412 00:20:18,990 --> 00:20:21,230 processors and therefore you need two different compilers 413 00:20:21,230 --> 00:20:26,440 and it generates two different source codes. 414 00:20:26,440 --> 00:20:30,220 In principle, SPEs can run a full OS, but they're not 415 00:20:30,220 --> 00:20:32,980 designed to do that and no one's ever actually tried. 416 00:20:32,980 --> 00:20:36,200 So you could imagine having 8 or 9 OSes running on this 417 00:20:36,200 --> 00:20:38,190 processor if you wanted. 418 00:20:38,190 --> 00:20:41,280 Terrible waste from my perspective, but OK, so let's 419 00:20:41,280 --> 00:20:42,780 talk about these a little bit. 420 00:20:42,780 --> 00:20:47,190 Each of these SPEs has, like I mentioned this memory flow 421 00:20:47,190 --> 00:20:52,110 controller here, an atomic update unit, the local store, 422 00:20:52,110 --> 00:20:54,900 and the SPU, which is actually the processing unit. 423 00:20:54,900 --> 00:21:01,370 Each SPU has a register file with 128 registers. 424 00:21:01,370 --> 00:21:04,140 Each register is 128 bits. 425 00:21:04,140 --> 00:21:09,340 So they're native SIMD, there are no scalar registers here 426 00:21:09,340 --> 00:21:12,060 for the user to play with. 427 00:21:12,060 --> 00:21:15,220 If you want to do scalar ops they'll be running in those 428 00:21:15,220 --> 00:21:18,420 full vector registers, but you'll just be wasting some 429 00:21:18,420 --> 00:21:19,670 portion of that register. 430 00:21:22,340 --> 00:21:25,760 It has IEEE double precision floating point, but it doesn't 431 00:21:25,760 --> 00:21:29,100 have IEEE single precision floating point. 432 00:21:29,100 --> 00:21:32,950 It's curiosity, but that was again, came from the history. 433 00:21:32,950 --> 00:21:36,420 The processor was designed for the gaming industry and the 434 00:21:36,420 --> 00:21:38,850 gamers, they didn't care if it had IEEE. 435 00:21:38,850 --> 00:21:39,910 Who cares IEEE? 436 00:21:39,910 --> 00:21:42,020 What I want is to have good monsters right on the screen. 437 00:21:45,590 --> 00:21:51,500 And so those SIMD registers can operate bitwise on bytes, 438 00:21:51,500 --> 00:21:57,020 on shorts, on four words at a time or two doubles at a time. 439 00:21:59,950 --> 00:22:06,210 The DMA engines here, each DMA engine can have up to 16 440 00:22:06,210 --> 00:22:09,430 outstanding requests in its queue before it stalls. 441 00:22:09,430 --> 00:22:12,680 So you can imagine you're writing something, some code 442 00:22:12,680 --> 00:22:15,210 and you're sending things out to the DMA and then all of a 443 00:22:15,210 --> 00:22:18,060 sudden you see really bad performance, it could be that 444 00:22:18,060 --> 00:22:20,210 your DMA egine has stalled the entire processor. 445 00:22:20,210 --> 00:22:23,300 If you try to write to that thing and then that queue is 446 00:22:23,300 --> 00:22:27,230 full, it just waits until the next open slot is available. 447 00:22:27,230 --> 00:22:31,040 So those are kind considerations. 448 00:22:31,040 --> 00:22:34,460 AUDIENCE: You mean [UNINTELLIGIBLE PHRASE] 449 00:22:34,460 --> 00:22:35,352 MICHAEL PERRONE: Yes. 450 00:22:35,352 --> 00:22:37,360 AUDIENCE: It's not the global one? 451 00:22:37,360 --> 00:22:37,900 MICHAEL PERRONE: Right. 452 00:22:37,900 --> 00:22:39,590 That's correct. 453 00:22:39,590 --> 00:22:42,000 But there is a global address space. 454 00:22:42,000 --> 00:22:45,070 AUDIENCE: 16 slots each in each SPU. 455 00:22:45,070 --> 00:22:45,910 MICHAEL PERRONE: Right. 456 00:22:45,910 --> 00:22:46,450 Exactly. 457 00:22:46,450 --> 00:22:51,570 Each MFC has its own 16 slots. 458 00:22:51,570 --> 00:22:54,450 And they all address the same memory. 459 00:22:54,450 --> 00:22:57,540 They can have a transparent memory space or they can have 460 00:22:57,540 --> 00:22:59,280 a partitioned memory space depending on 461 00:22:59,280 --> 00:22:59,920 how you set it up. 462 00:22:59,920 --> 00:23:03,809 AUDIENCE: Each SPU doesn't have its own-- the DMA goes 463 00:23:03,809 --> 00:23:05,267 onto the bus, [UNINTELLIGIBLE] 464 00:23:07,985 --> 00:23:10,850 that goes to a connection to the [UNINTELLIGIBLE]. 465 00:23:14,235 --> 00:23:16,570 PROFESSOR: You can add this data in the SPUs too. 466 00:23:16,570 --> 00:23:18,530 You don't have to always go to outside memory. 467 00:23:18,530 --> 00:23:20,690 You can do SPU to SPU communication basically. 468 00:23:20,690 --> 00:23:21,250 MICHAEL PERRONE: Right. 469 00:23:21,250 --> 00:23:23,700 So I can do a DMA that transfers memory from this 470 00:23:23,700 --> 00:23:27,760 local store to this one if I wanted to and vice versa. 471 00:23:27,760 --> 00:23:29,590 And I can pull stuff in through the-- 472 00:23:32,920 --> 00:23:34,350 yeah, I mentioned this stuff. 473 00:23:37,800 --> 00:23:43,710 Now this broadband interface controller, the BIC, this is 474 00:23:43,710 --> 00:23:47,660 how you get off the blade or off the board. 475 00:23:47,660 --> 00:23:51,570 It has 20 gigabytes per second here on I/O IF. 476 00:23:54,790 --> 00:23:56,410 In 10 over here--I'm sorry, 5 over here. 477 00:23:56,410 --> 00:24:00,700 I'm trying to remember how we get up to 70. 478 00:24:00,700 --> 00:24:04,260 This is actually two-way and one is 25 and 479 00:24:04,260 --> 00:24:04,990 the other one's 30. 480 00:24:04,990 --> 00:24:08,100 That gets you to 55. 481 00:24:08,100 --> 00:24:09,920 This should be 10 and now, what's going on here? 482 00:24:14,310 --> 00:24:16,790 It adds up to 75, I'm sure. 483 00:24:16,790 --> 00:24:18,040 I'm sure about that. 484 00:24:20,790 --> 00:24:22,850 I don't know why that says that. 485 00:24:22,850 --> 00:24:25,730 But the interesting thing about this over here, this I/O 486 00:24:25,730 --> 00:24:30,670 IF zero is that you can use it to connect two 487 00:24:30,670 --> 00:24:32,130 cell processors together. 488 00:24:32,130 --> 00:24:35,180 So this is why I know it's really 25.6 because it's 489 00:24:35,180 --> 00:24:38,110 matched to this one. 490 00:24:38,110 --> 00:24:42,690 So you have 25.6 going out to main memory, but this one can 491 00:24:42,690 --> 00:24:45,240 go to another processor, so now you have these two 492 00:24:45,240 --> 00:24:49,140 processors side-by-side connected at 25.6 gigabytes 493 00:24:49,140 --> 00:24:49,880 per second. 494 00:24:49,880 --> 00:24:52,360 And now I can do a memory access through here to the 495 00:24:52,360 --> 00:24:56,270 memory that's on this processor and vice versa. 496 00:24:56,270 --> 00:24:59,090 However, If I'm going straight out to my memory it's going to 497 00:24:59,090 --> 00:25:01,300 be faster than if I go out to this memory. 498 00:25:01,300 --> 00:25:04,220 So you have a slight NUMA architecture and nonuniform 499 00:25:04,220 --> 00:25:05,320 memory access. 500 00:25:05,320 --> 00:25:09,220 And you can hide that with sufficient multibuffering. 501 00:25:12,090 --> 00:25:14,910 So I know that this is 25 and I know the other one's 30. 502 00:25:14,910 --> 00:25:17,070 I don't know why it's written as 20 there. 503 00:25:17,070 --> 00:25:18,970 AUDIENCE: Can the SPUs write to the 504 00:25:18,970 --> 00:25:21,600 [UNINTELLIGIBLE PHRASE]? 505 00:25:21,600 --> 00:25:24,370 MICHAEL PERRONE: Yes, they can read from it. 506 00:25:24,370 --> 00:25:27,220 I don't know if they can write to it. 507 00:25:27,220 --> 00:25:29,790 In fact, that leads to a bottleneck occurring. 508 00:25:29,790 --> 00:25:34,850 So I happily start a process on my PPE and then I tell all 509 00:25:34,850 --> 00:25:37,340 my SPEs, start doing some number crunching. 510 00:25:37,340 --> 00:25:38,420 So they do that. 511 00:25:38,420 --> 00:25:41,690 They get access to memory, but they find the memory is in L2. 512 00:25:41,690 --> 00:25:44,440 So they start pulling from L2, but now all 8 are pulling from 513 00:25:44,440 --> 00:25:47,820 L2 and it's only 7 gigabytes per second instead of 25 and 514 00:25:47,820 --> 00:25:49,180 so you get a bottleneck. 515 00:25:49,180 --> 00:25:51,660 And so what I tell everybody is if you're going to 516 00:25:51,660 --> 00:25:54,520 initialize data with that PPE make sure you flush your cache 517 00:25:54,520 --> 00:25:59,210 before you start the SPEs. 518 00:25:59,210 --> 00:26:02,010 And then you don't want to be touching that memory because 519 00:26:02,010 --> 00:26:04,380 you really want to keep things-- stuff that the SPEs 520 00:26:04,380 --> 00:26:06,330 are dealing with-- you want to keep it out of L2 cache. 521 00:26:12,380 --> 00:26:14,020 Here there's an interrupt controller. 522 00:26:17,050 --> 00:26:19,540 An I/O bus master translation unit. 523 00:26:19,540 --> 00:26:22,850 And you know, these allow for messaging and message passing 524 00:26:22,850 --> 00:26:24,340 and interrupts and things of that nature. 525 00:26:27,450 --> 00:26:29,130 So that's the hardware overview. 526 00:26:29,130 --> 00:26:30,820 Any questions before I move on? 527 00:26:37,950 --> 00:26:39,900 So why's the cell processor so fast? 528 00:26:39,900 --> 00:26:43,250 Well, 3.2 gigahertz, that's one. 529 00:26:43,250 --> 00:26:45,630 But there's also the fact that we have 8 SPEs. 530 00:26:45,630 --> 00:26:51,140 Each 8 SPEs have SIMD units, registers that are running so 531 00:26:51,140 --> 00:26:56,090 they can do this parallel processing on a chip. 532 00:26:56,090 --> 00:27:01,440 We have 8 SPEs and each one are doing up to 8 ops per 533 00:27:01,440 --> 00:27:03,760 cycle if you're doing a mul-add. 534 00:27:03,760 --> 00:27:07,730 So you have four mul-adds for single precision. 535 00:27:07,730 --> 00:27:15,340 So you've got 8, that's 64 ops per cycle times 3.2. 536 00:27:15,340 --> 00:27:20,040 You get up to 200 gigaflops per cycle, 204.8. 537 00:27:20,040 --> 00:27:23,970 So that's really the main reason. 538 00:27:23,970 --> 00:27:25,650 We've talked about this stuff here. 539 00:27:25,650 --> 00:27:29,810 This is an image of why it's faster. 540 00:27:29,810 --> 00:27:32,160 Instead of staging and bringing the data through the 541 00:27:32,160 --> 00:27:34,740 L2, which is kind of what we were just discussing and 542 00:27:34,740 --> 00:27:39,220 having this PU, this processing unit, the PPE 543 00:27:39,220 --> 00:27:42,640 manage the data coming in, each one can do it themselves 544 00:27:42,640 --> 00:27:45,030 and bypass this bottleneck. 545 00:27:45,030 --> 00:27:47,410 So that's something you have to keep in the back of your 546 00:27:47,410 --> 00:27:48,380 mind when you're programming. 547 00:27:48,380 --> 00:27:52,140 You really want to make sure that you get this processor 548 00:27:52,140 --> 00:27:52,720 out of there. 549 00:27:52,720 --> 00:27:54,230 You don't want it in your way. 550 00:27:54,230 --> 00:27:56,540 Let these guys do as much of their own work as they can. 551 00:27:59,780 --> 00:28:03,030 Here's a comparison of theorectical peak performance 552 00:28:03,030 --> 00:28:08,200 of cell versus freescale, AMD, Intel over here. 553 00:28:08,200 --> 00:28:08,720 Very nice. 554 00:28:08,720 --> 00:28:11,170 That's the wow chart. 555 00:28:11,170 --> 00:28:15,860 The theoretical peak, this is in practice, what did we see? 556 00:28:15,860 --> 00:28:18,410 I don't know if you can read these numbers but what you 557 00:28:18,410 --> 00:28:20,750 really want to focus on is the first and last columns. 558 00:28:20,750 --> 00:28:23,460 This is the type of calculation, high performance 559 00:28:23,460 --> 00:28:26,470 computing like matrix multiplication, 560 00:28:26,470 --> 00:28:28,910 bioinformatics, graphics, security, it was really 561 00:28:28,910 --> 00:28:31,150 designed for graphics. 562 00:28:31,150 --> 00:28:33,850 Security, communication, video processing and over here you 563 00:28:33,850 --> 00:28:40,470 see the advantage against an IA 32, a G5 processor. 564 00:28:40,470 --> 00:28:46,510 And you see 8x, 12x, 15, 10, 18x. 565 00:28:46,510 --> 00:28:48,270 Very considerable improvement in performance. 566 00:28:48,270 --> 00:28:49,557 In the back-- question? 567 00:28:49,557 --> 00:28:51,841 AUDIENCE: [UNINTELLIGIBLE] previous slide, how did it 568 00:28:51,841 --> 00:28:55,140 compare to high [UNINTELLIGIBLE PHRASE]? 569 00:28:55,140 --> 00:28:57,020 MICHAEL PERRONE: All right, so you're thinking like a peak 570 00:28:57,020 --> 00:28:58,833 stream or something like that? 571 00:28:58,833 --> 00:29:01,400 AUDIENCE: Any particular [UNINTELLIGIBLE PHRASE]. 572 00:29:01,400 --> 00:29:05,506 The design of the SPUs is very reminiscent of 573 00:29:05,506 --> 00:29:06,860 [UNINTELLIGIBLE PHRASE]. 574 00:29:06,860 --> 00:29:11,480 MICHAEL PERRONE: So I believe, and I'm not well versed in all 575 00:29:11,480 --> 00:29:12,560 of the processors that are out there. 576 00:29:12,560 --> 00:29:14,090 I think that we still have a performance 577 00:29:14,090 --> 00:29:17,850 advantage in that space. 578 00:29:17,850 --> 00:29:19,260 You know, I don't know about Xilinx and 579 00:29:19,260 --> 00:29:20,490 those kind of things-- 580 00:29:20,490 --> 00:29:25,850 FPGAs I don't know, but what I tell people this 581 00:29:25,850 --> 00:29:26,890 is there's a spectrum. 582 00:29:26,890 --> 00:29:29,150 And at one end you have your general purpose processors. 583 00:29:29,150 --> 00:29:32,390 You've got your Intel, you've got your Opteron whatever, 584 00:29:32,390 --> 00:29:33,540 your power processor. 585 00:29:33,540 --> 00:29:37,410 And then at the other and you've got your FPGAs and DSPs 586 00:29:37,410 --> 00:29:39,960 and then maybe over here, somewhere in the middle you've 587 00:29:39,960 --> 00:29:42,230 got graphical processing units. 588 00:29:42,230 --> 00:29:43,970 Like Nvidia kind of things. 589 00:29:43,970 --> 00:29:47,210 And then somewhere between those graphics processing 590 00:29:47,210 --> 00:29:49,060 processors and the general purpose 591 00:29:49,060 --> 00:29:52,360 processors you've got cell. 592 00:29:52,360 --> 00:29:57,040 You get a significant improvement in performance, 593 00:29:57,040 --> 00:29:59,340 but you have to pay some pain in programming. 594 00:29:59,340 --> 00:30:01,350 But not nearly as much as you have to do with the graphics 595 00:30:01,350 --> 00:30:06,150 processors and no where near the FPGAs, which are just 596 00:30:06,150 --> 00:30:08,220 every time you write something you have to rewrite 597 00:30:08,220 --> 00:30:10,980 everything. 598 00:30:10,980 --> 00:30:11,520 Question? 599 00:30:11,520 --> 00:30:13,848 AUDIENCE: Somewhat related to the previous question, but 600 00:30:13,848 --> 00:30:16,253 with a different angle. 601 00:30:16,253 --> 00:30:19,540 I always figured anyone could do a [INAUDIBLE], so that's 602 00:30:19,540 --> 00:30:21,010 why I ask about FFTs. 603 00:30:21,010 --> 00:30:25,590 Are they captured on the front or otherwise [UNINTELLIGIBLE] 604 00:30:25,590 --> 00:30:27,640 MICHAEL PERRONE: Yeah, so this is actually one of the things 605 00:30:27,640 --> 00:30:29,660 I spent a lot of time on for FFTs. 606 00:30:29,660 --> 00:30:32,750 I spent a lot of time with the petroleum industry. 607 00:30:32,750 --> 00:30:36,590 They take these enormous boats, they have these arrays 608 00:30:36,590 --> 00:30:39,460 that go 5 kilometers back and 1 kilometer wide, they drag 609 00:30:39,460 --> 00:30:41,800 them over the ocean, and they make these noises and they 610 00:30:41,800 --> 00:30:43,240 record the echo. 611 00:30:43,240 --> 00:30:45,010 And they have to do this enormous FFT and it 612 00:30:45,010 --> 00:30:47,580 takes them 6 months. 613 00:30:47,580 --> 00:30:49,690 Depending on the size of the FFT it can be anywhere from a 614 00:30:49,690 --> 00:30:51,665 week to 6 months, literally. 615 00:30:51,665 --> 00:30:52,270 AUDIENCE: [UNINTELLIGIBLE]. 616 00:30:52,270 --> 00:30:52,860 MICHAEL PERRONE: Sorry? 617 00:30:52,860 --> 00:30:55,740 AUDIENCE: Is this a PD FFT? 618 00:30:55,740 --> 00:31:00,690 MICHAEL PERRONE: Sometimes I do too, but they do both. 619 00:31:00,690 --> 00:31:03,250 I've become somewhat of an expert on these FFTs. 620 00:31:03,250 --> 00:31:06,610 For cell the best performance number I know of is about 90 621 00:31:06,610 --> 00:31:08,390 gigaflops of FFT performance. 622 00:31:11,960 --> 00:31:14,630 You know, that's very good. 623 00:31:14,630 --> 00:31:17,590 Yeah, it's like 50% of peak performance. 624 00:31:17,590 --> 00:31:21,320 You know, it's easy to get 98% with [? lynpacker ?] 625 00:31:21,320 --> 00:31:22,890 or [? djem ?] 626 00:31:22,890 --> 00:31:28,320 on a processor like this and we have. We get 97% of peak 627 00:31:28,320 --> 00:31:31,845 performance, but it's a lot harder to get FFTs up to that. 628 00:31:31,845 --> 00:31:34,005 AUDIENCE: Well, then I'll [INAUDIBLE] the next questions 629 00:31:34,005 --> 00:31:36,529 then which is somehow or another you get the FFT 630 00:31:36,529 --> 00:31:39,435 performance, you've got to get the data at the right 631 00:31:39,435 --> 00:31:39,535 place at the time. 632 00:31:39,535 --> 00:31:39,560 [UNINTELLIGIBLE] 633 00:31:39,560 --> 00:31:42,940 So you've personally done that or been involved with that? 634 00:31:42,940 --> 00:31:44,560 MICHAEL PERRONE: Right, so we do a lot of tricks. 635 00:31:44,560 --> 00:31:47,080 I can show you another slide or another presentation that 636 00:31:47,080 --> 00:31:51,880 we talk about this, but typically the FFTs that we 637 00:31:51,880 --> 00:31:58,920 work with are somewhere from a 1024 to 2048, that's square. 638 00:31:58,920 --> 00:32:04,700 And so it's possible to take say, the top 4 rows-- 639 00:32:04,700 --> 00:32:08,540 in the case of 1024, four rows complex, single precision I 640 00:32:08,540 --> 00:32:11,300 think is 16 kilobytes. 641 00:32:11,300 --> 00:32:13,340 That fits into the local store very nicely. 642 00:32:13,340 --> 00:32:14,690 So you can stop multibuffering. 643 00:32:14,690 --> 00:32:16,620 You bring in one, you start computing on it. 644 00:32:16,620 --> 00:32:19,530 While you're computing on those 4 in a SIMD fashion 645 00:32:19,530 --> 00:32:21,530 across the SIMD registers you're 646 00:32:21,530 --> 00:32:22,900 bringing in the next one. 647 00:32:22,900 --> 00:32:24,670 And then when that one's finished you're writing that 648 00:32:24,670 --> 00:32:26,840 one out while your computing on the one that arrived and 649 00:32:26,840 --> 00:32:28,140 while you're getting the next one. 650 00:32:28,140 --> 00:32:33,760 And since you can get the entire 1024 or 2000 into local 651 00:32:33,760 --> 00:32:38,600 store, you're only 6 cycles away from any element in it. 652 00:32:38,600 --> 00:32:41,470 So it's much, much faster. 653 00:32:41,470 --> 00:32:45,610 We also did the 16 million element FFT. 654 00:32:48,120 --> 00:32:52,550 1D, yeah and we did some tricks there to make it 655 00:32:52,550 --> 00:32:53,980 efficient, but it was a lot slower. 656 00:32:56,810 --> 00:32:59,180 AUDIENCE: [UNINTELLIGIBLE PHRASE] 657 00:32:59,180 --> 00:33:01,156 would have to be a lot slower by the need for the problem. 658 00:33:01,156 --> 00:33:03,970 [UNINTELLIGIBLE PHRASE] 659 00:33:03,970 --> 00:33:05,870 MICHAEL PERRONE: What I remember it was fifteen times 660 00:33:05,870 --> 00:33:08,660 faster than a power 5. 661 00:33:12,970 --> 00:33:16,160 It might have been a power 4, I don't remember, sorry. 662 00:33:22,010 --> 00:33:22,716 I might want to skip this one. 663 00:33:22,716 --> 00:33:25,436 I think I'm going to skip this one. 664 00:33:25,436 --> 00:33:27,340 AUDIENCE: [UNINTELLIGIBLE PHRASE] 665 00:33:27,340 --> 00:33:28,590 MICHAEL PERRONE: Right. 666 00:33:32,330 --> 00:33:34,360 Let's talk about what is the cell good for. 667 00:33:34,360 --> 00:33:36,935 You kind of have a sense of the architecture and how it 668 00:33:36,935 --> 00:33:38,510 all fits together. 669 00:33:38,510 --> 00:33:41,690 You may have some sense of the gotchas and the problems that 670 00:33:41,690 --> 00:33:44,300 might be there, but what did we actually applied to 2? 671 00:33:44,300 --> 00:33:48,120 I mean you saw some of that here. 672 00:33:48,120 --> 00:33:52,405 Here's a list of things that either we've already proven to 673 00:33:52,405 --> 00:33:56,460 ourself that it works well or we're very confident that it 674 00:33:56,460 --> 00:33:58,320 works well or we're working to demonstrate 675 00:33:58,320 --> 00:33:59,570 that it works well. 676 00:34:01,700 --> 00:34:04,280 Signal processing, image processing, audio resampling, 677 00:34:04,280 --> 00:34:04,990 noise generation. 678 00:34:04,990 --> 00:34:06,920 I mean, you can read through this list, there's a long 679 00:34:06,920 --> 00:34:11,010 list. And I guess there are a few characteristics that 680 00:34:11,010 --> 00:34:14,030 really make it suitable for cell. 681 00:34:14,030 --> 00:34:16,460 Things that are in single precision because you've got 682 00:34:16,460 --> 00:34:20,210 200 gigaflops single and only 20 of double, but that will 683 00:34:20,210 --> 00:34:23,360 change as I mentioned. 684 00:34:23,360 --> 00:34:26,580 Things that are streaming, streaming through and so 685 00:34:26,580 --> 00:34:29,770 single processing is ideal where the data comes through 686 00:34:29,770 --> 00:34:32,350 and you do your compute and then you throw it away or you 687 00:34:32,350 --> 00:34:33,830 give out your results and you throw it away. 688 00:34:33,830 --> 00:34:35,080 Those are good. 689 00:34:39,770 --> 00:34:42,410 And things that are compute intensive, where you bring the 690 00:34:42,410 --> 00:34:44,590 data in and you're going to crunch on it for a long time, 691 00:34:44,590 --> 00:34:48,170 so things like cryptography where you're either generating 692 00:34:48,170 --> 00:34:53,320 something from a key and there's virtually no input. 693 00:34:53,320 --> 00:34:57,120 You're just generating streams of random numbers that's very 694 00:34:57,120 --> 00:34:58,160 well suited for this thing. 695 00:34:58,160 --> 00:34:59,770 You see FFTs listed here. 696 00:35:02,630 --> 00:35:04,210 TCPIP off load. 697 00:35:04,210 --> 00:35:06,790 I didn't put that there. 698 00:35:06,790 --> 00:35:11,590 There's actually a problem with cell today that we're 699 00:35:11,590 --> 00:35:15,330 working to fix that the TCPIP performance is not very good. 700 00:35:15,330 --> 00:35:19,970 And so what I tell people to use is open NPI. 701 00:35:19,970 --> 00:35:23,450 You know, so that over InfiniBand. 702 00:35:23,450 --> 00:35:26,930 The PPE processor really doesn't have the horse power 703 00:35:26,930 --> 00:35:29,750 to drive a full TCPIP sack. 704 00:35:29,750 --> 00:35:33,300 I'm not sure it has the horse power to do a full NPI stack 705 00:35:33,300 --> 00:35:36,680 either, but at least you have more control in that case. 706 00:35:42,130 --> 00:35:45,170 The game physics, physical simulations-- 707 00:35:45,170 --> 00:35:47,130 I can show you a demo, but I don't know that we'll have 708 00:35:47,130 --> 00:35:50,500 time where a company called Rapid Mind, which is 709 00:35:50,500 --> 00:35:55,380 developing software to ease programmability for cell. 710 00:35:55,380 --> 00:35:57,980 Basically you take your existing scalar code and you 711 00:35:57,980 --> 00:36:03,010 instrument it with C++ classes that are kind of SPE aware. 712 00:36:03,010 --> 00:36:07,650 And by doing that, just write your scalar code and you get 713 00:36:07,650 --> 00:36:11,010 the SPE performance advantage. 714 00:36:11,010 --> 00:36:12,470 They have this wonderful demo of these chickens. 715 00:36:12,470 --> 00:36:15,770 They've got 16,000 chickens in a chicken yard. 716 00:36:15,770 --> 00:36:18,870 You know, the chicken yard has varying topologies and the 717 00:36:18,870 --> 00:36:22,310 chickens move around and all 16,000 are being processed in 718 00:36:22,310 --> 00:36:24,470 real time with a single cell processor. 719 00:36:24,470 --> 00:36:30,080 In fact, the Nvidia card that was used to render that 720 00:36:30,080 --> 00:36:33,480 couldn't keep up with what was coming out of the SPEs. 721 00:36:33,480 --> 00:36:34,470 We we're impressed with that. 722 00:36:34,470 --> 00:36:35,000 We're happy with that. 723 00:36:35,000 --> 00:36:37,710 We showed it around at the game conferences and the 724 00:36:37,710 --> 00:36:40,300 gamers saw all these chickens and were like, 725 00:36:40,300 --> 00:36:40,630 this is really cool. 726 00:36:40,630 --> 00:36:41,880 How do I shoot them? 727 00:36:44,260 --> 00:36:45,740 So we said, you can't. 728 00:36:45,740 --> 00:36:48,250 But maybe in the next version. 729 00:36:48,250 --> 00:36:51,780 But the idea that we've designed this so that it can 730 00:36:51,780 --> 00:36:55,050 do physical simulations, and this is maybe an entree for 731 00:36:55,050 --> 00:36:56,740 some of you people when you're doing your stuff. 732 00:36:56,740 --> 00:36:58,680 I don't know what kinds of things you want to try to do 733 00:36:58,680 --> 00:37:02,630 on cell, but I've seen people do lots of things that really 734 00:37:02,630 --> 00:37:04,430 have no business doing well on cell and they 735 00:37:04,430 --> 00:37:05,240 did very, very well. 736 00:37:05,240 --> 00:37:08,010 Like pointer chasing. 737 00:37:13,260 --> 00:37:14,100 I'm trying to remember. 738 00:37:14,100 --> 00:37:15,230 There are two pieces of work. 739 00:37:15,230 --> 00:37:22,860 One done by PNNL Fabritzio Petrini and he did a graph 740 00:37:22,860 --> 00:37:24,690 traversal algorithm. 741 00:37:24,690 --> 00:37:29,340 It was very much random access and he was able to parallelize 742 00:37:29,340 --> 00:37:31,120 that very nicely on Cell. 743 00:37:31,120 --> 00:37:34,900 And then there was another guy at Georgia Tech who did 744 00:37:34,900 --> 00:37:37,010 something similar for linked lists. 745 00:37:37,010 --> 00:37:41,170 And you know, I expect things to work well on cell if 746 00:37:41,170 --> 00:37:44,310 they're streaming and they have very compute intensive 747 00:37:44,310 --> 00:37:46,870 kernels that are working on things, but those are two 748 00:37:46,870 --> 00:37:50,600 examples where they're very not very compute intensive and 749 00:37:50,600 --> 00:37:51,350 not very streaming. 750 00:37:51,350 --> 00:37:54,710 They're kind of random access and they work very well. 751 00:37:54,710 --> 00:37:56,410 Over here, target applications. 752 00:37:56,410 --> 00:37:58,980 There are lots of areas where we're trying 753 00:37:58,980 --> 00:38:02,260 to push cell forward. 754 00:38:02,260 --> 00:38:04,110 Clearly it works in the gaming industry, but 755 00:38:04,110 --> 00:38:04,960 where else can it work? 756 00:38:04,960 --> 00:38:08,360 So medical imaging, there's a lot of success there. 757 00:38:08,360 --> 00:38:11,580 The sysmic imaging for petroleum, aerospace and 758 00:38:11,580 --> 00:38:13,080 defense for radar and sonar-- 759 00:38:13,080 --> 00:38:16,190 these are all signal processing apps. 760 00:38:16,190 --> 00:38:18,510 We're also looking at digital content creation 761 00:38:18,510 --> 00:38:20,220 for computer animation. 762 00:38:20,220 --> 00:38:21,470 Very well suited for cell. 763 00:38:24,470 --> 00:38:28,040 This is kind just what I just said. 764 00:38:28,040 --> 00:38:29,380 Did I leave out anything? 765 00:38:29,380 --> 00:38:33,870 Finance-- once we have double precision we'll be doing 766 00:38:33,870 --> 00:38:35,700 things with finance. 767 00:38:35,700 --> 00:38:37,940 We actually demonstrated that things work very well. 768 00:38:37,940 --> 00:38:41,690 You know, metropolis algorithms, Monte Carlo, Black 769 00:38:41,690 --> 00:38:43,620 shoals algorithms if you're familiar with these kind of 770 00:38:43,620 --> 00:38:47,140 things from finance. 771 00:38:47,140 --> 00:38:48,960 They tell us they need double precision and we're like, you 772 00:38:48,960 --> 00:38:51,240 don't really need double precision, come on. 773 00:38:51,240 --> 00:38:56,460 I mean, what you have is some mathematical calculation that 774 00:38:56,460 --> 00:38:57,900 you're doing and you're doing it over and over and over. 775 00:38:57,900 --> 00:39:00,190 And Monte Carlo there's so much noise, we say to these 776 00:39:00,190 --> 00:39:01,150 people, why do you need double precision? 777 00:39:01,150 --> 00:39:06,040 It turns out with decimal notation you can only go up to 778 00:39:06,040 --> 00:39:08,730 like a billion or something in single precision. 779 00:39:08,730 --> 00:39:11,180 So they have more dollars than that, so they need double, for 780 00:39:11,180 --> 00:39:13,060 that reason alone. 781 00:39:13,060 --> 00:39:15,620 But this gets back to the sloppiness of programmers. 782 00:39:15,620 --> 00:39:18,030 And I'm guilty of this myself. 783 00:39:18,030 --> 00:39:18,910 They said, oh we have double. 784 00:39:18,910 --> 00:39:19,930 Let's use double. 785 00:39:19,930 --> 00:39:21,720 They didn't need to, but they did it anyway. 786 00:39:21,720 --> 00:39:24,990 And now their legacy code is stuck with double. 787 00:39:24,990 --> 00:39:28,840 They could convert it all to single, but it's too painful. 788 00:39:28,840 --> 00:39:32,410 Down on Wall Street to build a new data center is like $100 789 00:39:32,410 --> 00:39:34,090 million proposition. 790 00:39:34,090 --> 00:39:37,330 And they do it regularly, all of the banks. 791 00:39:37,330 --> 00:39:40,050 They'll be generating a new data center every year, 792 00:39:40,050 --> 00:39:43,700 sometimes multiple times a year and they just don't have 793 00:39:43,700 --> 00:39:47,010 time or the resources to go through and redo all their 794 00:39:47,010 --> 00:39:49,270 code to make it run or something like cell. 795 00:39:49,270 --> 00:39:54,690 So we're making double precision cell. 796 00:39:54,690 --> 00:39:56,170 That's the short of it. 797 00:39:56,170 --> 00:40:00,210 All right, now software environment. 798 00:40:00,210 --> 00:40:03,640 This is stuff that you can find on the web and actually, 799 00:40:03,640 --> 00:40:06,260 it's changing a lot lately because we just 800 00:40:06,260 --> 00:40:09,430 released the 2.0 SDK. 801 00:40:09,430 --> 00:40:12,950 And so the stuff that's in the slide might not actually be 802 00:40:12,950 --> 00:40:16,480 the latest and greatest, but it's going to be epsilon away, 803 00:40:16,480 --> 00:40:17,970 so don't worry about it too much. 804 00:40:17,970 --> 00:40:20,020 But you really shouldn't trust these slides, you should go to 805 00:40:20,020 --> 00:40:23,300 the website and the website you want to go to is 806 00:40:23,300 --> 00:40:26,981 www.ibm.com/alphaworks. 807 00:40:26,981 --> 00:40:30,100 PROFESSOR: Tomorrow we are going to have a recitation 808 00:40:30,100 --> 00:40:32,310 session talking about the environment 809 00:40:32,310 --> 00:40:33,960 that we have created. 810 00:40:33,960 --> 00:40:36,500 I think we just got, probably just set up the latest 811 00:40:36,500 --> 00:40:39,815 environment and then we increase it through the three 812 00:40:39,815 --> 00:40:41,000 weeks we've got. 813 00:40:41,000 --> 00:40:44,180 This is changing faster than a three week cycle. 814 00:40:44,180 --> 00:40:45,430 So [UNINTELLIGIBLE PHRASE] 815 00:40:47,590 --> 00:40:51,620 So this will give you a preview of what's going to be. 816 00:40:51,620 --> 00:40:52,460 MICHAEL PERRONE: Then you go to alphaworks, you go to 817 00:40:52,460 --> 00:40:55,510 search on alphaworks for cell and you get more information 818 00:40:55,510 --> 00:40:57,810 then you could ever possibly read. 819 00:40:57,810 --> 00:41:01,370 We have a programmer's manual that's 900 pages long, it's 820 00:41:01,370 --> 00:41:04,260 really good reading. 821 00:41:04,260 --> 00:41:07,730 Actually there's one thing in that 800, 900 hundred pages 822 00:41:07,730 --> 00:41:08,460 that you really should read. 823 00:41:08,460 --> 00:41:10,600 It's called the cell programming tips chapter. 824 00:41:10,600 --> 00:41:14,450 It's a really nice chapter. 825 00:41:14,450 --> 00:41:17,140 But there are many, many publications and things like 826 00:41:17,140 --> 00:41:23,110 that, more than just the SDK in the OS and whatnot, so I 827 00:41:23,110 --> 00:41:25,410 encourage you to look at that. 828 00:41:25,410 --> 00:41:28,430 All right, so this is kind of the pyramid, the 829 00:41:28,430 --> 00:41:29,520 cell software pyramid. 830 00:41:29,520 --> 00:41:32,990 We've got the standards under here, the application binary 831 00:41:32,990 --> 00:41:36,710 interface, language extensions. 832 00:41:36,710 --> 00:41:39,380 And over here we have development tools and we'll 833 00:41:39,380 --> 00:41:42,130 talk about each of these pieces briefly. 834 00:41:45,080 --> 00:41:49,350 These specifications define what's actually the reference 835 00:41:49,350 --> 00:41:52,030 implementation for the cell. 836 00:41:52,030 --> 00:41:56,480 C++ and C, they have language extensions in the similar way 837 00:41:56,480 --> 00:42:01,090 to the extensions for VMX for SSE on Intel. 838 00:42:01,090 --> 00:42:05,000 You have C extensions for cell that allow you to use 839 00:42:05,000 --> 00:42:12,200 intrinsics that actually run as SIMD instructions on cell. 840 00:42:12,200 --> 00:42:15,540 For example, you can say SPU underscore mul-add, and it's 841 00:42:15,540 --> 00:42:17,670 going to do a vector mul-add. 842 00:42:17,670 --> 00:42:24,060 So you can get assembly language level control over 843 00:42:24,060 --> 00:42:28,390 your code without having to use any assembly language. 844 00:42:28,390 --> 00:42:30,890 And then there's that. 845 00:42:30,890 --> 00:42:34,180 There is a full system simulator. 846 00:42:34,180 --> 00:42:40,050 The simulator is very, very accurate for things that do 847 00:42:40,050 --> 00:42:43,040 not run out to main memory. 848 00:42:43,040 --> 00:42:44,910 They've been working to improve this so I don't know 849 00:42:44,910 --> 00:42:47,810 if recently they have made it more accurate, but if you're 850 00:42:47,810 --> 00:42:52,090 doing compute intensive stuff, if you're compute bound the 851 00:42:52,090 --> 00:42:55,000 simulator can give you accuracies within 99%. 852 00:42:55,000 --> 00:42:58,120 You know, within 1% of the real value. 853 00:42:58,120 --> 00:43:02,050 I've only seen one thing on the simulator more than 1% off 854 00:43:02,050 --> 00:43:04,930 and that was 4%, so the simulator is very-- excuse 855 00:43:04,930 --> 00:43:06,220 me-- very reliable. 856 00:43:06,220 --> 00:43:08,260 And I encourage you to use it if you can't 857 00:43:08,260 --> 00:43:09,510 get access to hardware. 858 00:43:12,600 --> 00:43:14,240 What else? 859 00:43:14,240 --> 00:43:16,710 The simulator has all kinds of tools in there. 860 00:43:16,710 --> 00:43:21,820 And I'm not going to go through the software stack in 861 00:43:21,820 --> 00:43:23,070 simulation. 862 00:43:31,280 --> 00:43:33,090 This gives you a sense for-- 863 00:43:33,090 --> 00:43:35,330 you've got your hardware running here. 864 00:43:35,330 --> 00:43:38,280 You can run this on any one of these platforms. Power PC, 865 00:43:38,280 --> 00:43:42,910 Intel with these OS's. 866 00:43:42,910 --> 00:43:46,560 The whole thing is written in TCL, the simulator. 867 00:43:46,560 --> 00:43:48,930 And it has all these kind of simulators. 868 00:43:48,930 --> 00:43:54,300 It's simulating the DMAs, it's simulating the caches and then 869 00:43:54,300 --> 00:43:56,300 you get a graphical user interface and a command line 870 00:43:56,300 --> 00:43:58,590 interface to that simulator. 871 00:43:58,590 --> 00:44:01,940 THe graphical user interface is convenient, but the command 872 00:44:01,940 --> 00:44:03,160 line gives you much more control. 873 00:44:03,160 --> 00:44:04,860 You can treat parameters. 874 00:44:09,790 --> 00:44:14,850 This gives you a view of what the graphical 875 00:44:14,850 --> 00:44:17,600 userface looks like. 876 00:44:17,600 --> 00:44:19,660 It says mambo zebra because that was a different project, 877 00:44:19,660 --> 00:44:21,360 but now it'd probably say system sim or 878 00:44:21,360 --> 00:44:23,780 something like that. 879 00:44:23,780 --> 00:44:26,040 And you'll see the PPC-- 880 00:44:26,040 --> 00:44:28,190 this is the PPE I don't know why they changed it. 881 00:44:28,190 --> 00:44:32,090 And then you have SP of zero, SP of 1 going down and it 882 00:44:32,090 --> 00:44:35,240 gives you some access to these parameters. 883 00:44:35,240 --> 00:44:41,310 The model here, it says pipeline and then there's I 884 00:44:41,310 --> 00:44:43,090 think, functional mode or pipeline mode. 885 00:44:43,090 --> 00:44:45,570 Pipeline mode is where it's really simulating everything 886 00:44:45,570 --> 00:44:47,280 and it's much slower. 887 00:44:47,280 --> 00:44:48,760 But it's accurate. 888 00:44:48,760 --> 00:44:50,590 And then the other is functional mode just to test 889 00:44:50,590 --> 00:44:51,960 the code actually works as it's supposed to. 890 00:44:51,960 --> 00:44:55,136 PROFESSOR: I guess one point in the class what we'll try 891 00:44:55,136 --> 00:44:58,340 and do is since each group has access to the the hardware, 892 00:44:58,340 --> 00:45:01,930 you can do most of the things in the real hardware and use 893 00:45:01,930 --> 00:45:03,430 the debugger in the hardware that's 894 00:45:03,430 --> 00:45:04,300 probably been talked about. 895 00:45:04,300 --> 00:45:07,950 But if things gets really bad and you can't understand use 896 00:45:07,950 --> 00:45:11,030 simulator as a very accurate debugger only when it's needs 897 00:45:11,030 --> 00:45:13,250 needed because there you can look at every 898 00:45:13,250 --> 00:45:14,870 little detail inside. 899 00:45:14,870 --> 00:45:17,980 This is kind of a thing, a last resort type thing. 900 00:45:17,980 --> 00:45:19,930 MICHAEL PERRONE: Yeah, I agree. 901 00:45:19,930 --> 00:45:21,390 Like I said, I've been doing this for three years. 902 00:45:21,390 --> 00:45:23,590 Three years ago we didn't even have hardware. 903 00:45:23,590 --> 00:45:27,120 So the simulator was all we had, so we relied on it a lot. 904 00:45:27,120 --> 00:45:29,880 But I think that usage of it makes a lot of sense. 905 00:45:33,550 --> 00:45:34,900 This is the graphical interface. 906 00:45:34,900 --> 00:45:36,720 You know, it's just a Tickle interface. 907 00:45:41,240 --> 00:45:42,440 I'm going to skip through these things. 908 00:45:42,440 --> 00:45:47,350 It just shows you how you can look at memory with this more 909 00:45:47,350 --> 00:45:48,970 memory access. 910 00:45:48,970 --> 00:45:49,830 You get some graphical 911 00:45:49,830 --> 00:45:51,630 representation of various pieces. 912 00:45:51,630 --> 00:45:52,660 You know, how many stalls? 913 00:45:52,660 --> 00:45:53,740 How many loads? 914 00:45:53,740 --> 00:45:55,590 How many DMA transactions? 915 00:45:55,590 --> 00:45:57,320 So you can see what's going on at that level. 916 00:46:00,270 --> 00:46:02,090 And all of this can be pulled together into 917 00:46:02,090 --> 00:46:05,240 this UART window here. 918 00:46:05,240 --> 00:46:09,680 OK, so the Linux, it's pretty standard Linux, but it has 919 00:46:09,680 --> 00:46:12,410 some extensions. 920 00:46:12,410 --> 00:46:14,820 Let's see. 921 00:46:14,820 --> 00:46:16,930 Provided as a patch, yeah. 922 00:46:16,930 --> 00:46:17,730 That might be wrong. 923 00:46:17,730 --> 00:46:21,490 I don't know where we are currently. 924 00:46:21,490 --> 00:46:24,980 You have this SPE thread API for creating 925 00:46:24,980 --> 00:46:28,020 threads from the PPEs. 926 00:46:28,020 --> 00:46:30,850 Let's see. 927 00:46:30,850 --> 00:46:32,330 What do I want to tell you here? 928 00:46:32,330 --> 00:46:35,680 There's a better slide for this kind of information. 929 00:46:35,680 --> 00:46:39,220 They share the memory space, we talked about that. 930 00:46:39,220 --> 00:46:41,830 There's error event and signal handling. 931 00:46:41,830 --> 00:46:45,630 So there are multiple ways you communicate. 932 00:46:45,630 --> 00:46:50,030 You can communicate with the interrupts and the event and 933 00:46:50,030 --> 00:46:53,770 signaling that way or you can use these mailboxes. 934 00:46:53,770 --> 00:46:56,640 So each SPE has its own mailbox and inbox and an 935 00:46:56,640 --> 00:46:59,750 outbox so you can post something to your outbox and 936 00:46:59,750 --> 00:47:01,770 then the PPE will read it when it's ready. 937 00:47:01,770 --> 00:47:05,030 Or you can read from your inbox waiting on the PPE to 938 00:47:05,030 --> 00:47:05,790 write something. 939 00:47:05,790 --> 00:47:07,960 You have to be careful because you can stall there. 940 00:47:07,960 --> 00:47:11,970 If the PPE hasn't written you will stall waiting for 941 00:47:11,970 --> 00:47:12,770 something to fill up. 942 00:47:12,770 --> 00:47:14,460 So you can do a check. 943 00:47:14,460 --> 00:47:16,150 There are ways to get around that, but these are kind of 944 00:47:16,150 --> 00:47:18,040 common gotchas that you have to watch out for. 945 00:47:22,410 --> 00:47:25,360 Then you have the mailboxes, you have the interrupts, you 946 00:47:25,360 --> 00:47:26,100 also have DMAs. 947 00:47:26,100 --> 00:47:28,300 You can do communication with DMAs so you have at least 948 00:47:28,300 --> 00:47:29,900 three different ways that you communicate 949 00:47:29,900 --> 00:47:33,580 between the SPEs on cell. 950 00:47:33,580 --> 00:47:37,250 And which one is going to be best really depends on the 951 00:47:37,250 --> 00:47:40,050 algorithm you're running. 952 00:47:40,050 --> 00:47:42,330 So these are the extensions to Linux. 953 00:47:42,330 --> 00:47:43,800 This is going to show you a bunch of things that you 954 00:47:43,800 --> 00:47:46,800 probably won't be able to read, but there's something 955 00:47:46,800 --> 00:47:51,580 called SPUFS, the file system that has a bunch of open, 956 00:47:51,580 --> 00:47:53,900 read, write, and close functionality. 957 00:47:57,450 --> 00:48:01,630 And then we also have this signaling and the mailboxes 958 00:48:01,630 --> 00:48:03,650 that I mentioned to you previously. 959 00:48:03,650 --> 00:48:04,870 And this you can't even read. 960 00:48:04,870 --> 00:48:05,850 I can't even read this one. 961 00:48:05,850 --> 00:48:08,300 What is it? 962 00:48:08,300 --> 00:48:10,060 Ah, this is perhaps the most important one. 963 00:48:10,060 --> 00:48:13,790 It says SPU create thread. 964 00:48:13,790 --> 00:48:19,370 So the SPEs from the Linux point of view are just threads 965 00:48:19,370 --> 00:48:20,440 that are running. 966 00:48:20,440 --> 00:48:23,290 The Linux doesn't really know that they're special purpose 967 00:48:23,290 --> 00:48:25,890 hardware, it just knows it's a thread and you can do things 968 00:48:25,890 --> 00:48:29,775 like spawn a thread, kill a thread, wait on a thread-- all 969 00:48:29,775 --> 00:48:33,490 the usual things that you can do with threads. 970 00:48:33,490 --> 00:48:34,970 So it's a lot like P threads, but it's 971 00:48:34,970 --> 00:48:36,980 not actually P threads. 972 00:48:36,980 --> 00:48:40,590 So here you could see these things are more useful. 973 00:48:40,590 --> 00:48:42,710 This is SPE create groups. 974 00:48:42,710 --> 00:48:46,370 So you can create a thread and thread group so that threads 975 00:48:46,370 --> 00:48:49,200 that are part of the same group know about one another. 976 00:48:49,200 --> 00:48:51,620 So you can partition your system and have three SPEs 977 00:48:51,620 --> 00:48:53,740 doing one thing and five doing another. 978 00:48:53,740 --> 00:48:56,060 So that you can split it up however you like. 979 00:48:56,060 --> 00:48:58,940 You have get and set affinity so that you can choose which 980 00:48:58,940 --> 00:49:01,750 SPEs are running which tasks, so that you can get more 981 00:49:01,750 --> 00:49:05,800 efficient use of that element interconnect bus. 982 00:49:05,800 --> 00:49:10,260 Kill and waits, open, close, writing signals, the usual. 983 00:49:15,110 --> 00:49:17,490 Let me check my time here. 984 00:49:17,490 --> 00:49:22,410 I really don't have a lot more time, so I'm going to say that 985 00:49:22,410 --> 00:49:24,030 we have this thread management library. 986 00:49:24,030 --> 00:49:26,660 It has the functionality that I just mentioned. 987 00:49:26,660 --> 00:49:28,470 In the next month or so you're going to go through that in a 988 00:49:28,470 --> 00:49:29,990 lot more detail. 989 00:49:35,860 --> 00:49:38,340 The SPE comes with a lot of sample libraries. 990 00:49:38,340 --> 00:49:41,410 These are not necessarily the very best implementation of 991 00:49:41,410 --> 00:49:43,440 these libraries and they're not even fully functional 992 00:49:43,440 --> 00:49:46,500 libraries, but they're suggestive of first of all, 993 00:49:46,500 --> 00:49:50,900 how things can be written to cell, how to use cell, and in 994 00:49:50,900 --> 00:49:53,000 some cases how to optimize cell. 995 00:49:53,000 --> 00:49:55,790 Like the basic matrix operations, there's some 996 00:49:55,790 --> 00:49:56,670 optimization. 997 00:49:56,670 --> 00:49:58,970 The FFTs are very tightly optimized, so if you want to 998 00:49:58,970 --> 00:50:01,470 take a look at that and understand how to do that type 999 00:50:01,470 --> 00:50:04,010 of memory manipulation. 1000 00:50:04,010 --> 00:50:08,940 So there are samples codes out there that can be very useful. 1001 00:50:08,940 --> 00:50:10,240 We'll skip that. 1002 00:50:10,240 --> 00:50:12,400 Oh, this is that FFT 16 million. 1003 00:50:12,400 --> 00:50:15,940 There's an example, it's on the SDK. 1004 00:50:15,940 --> 00:50:18,340 Actually, I don't know if you've got PS3's if all these 1005 00:50:18,340 --> 00:50:20,070 things can run. 1006 00:50:20,070 --> 00:50:20,900 They should run. 1007 00:50:20,900 --> 00:50:23,820 Yeah, they should run. 1008 00:50:23,820 --> 00:50:25,850 There may be some memory issues out to main memory that 1009 00:50:25,850 --> 00:50:29,090 I'm not aware of. 1010 00:50:29,090 --> 00:50:32,040 There are all kinds of demos there that you can play with, 1011 00:50:32,040 --> 00:50:35,620 which are good for learning how to spawn threads and 1012 00:50:35,620 --> 00:50:38,030 things like that. 1013 00:50:38,030 --> 00:50:41,360 You have your basic GNU binutils tools. 1014 00:50:41,360 --> 00:50:43,670 There's GCC out there. 1015 00:50:43,670 --> 00:50:45,150 There's also XLC. 1016 00:50:45,150 --> 00:50:48,530 You can download XLC. 1017 00:50:48,530 --> 00:50:51,420 In some cases, one will be better than the other, but I 1018 00:50:51,420 --> 00:50:53,780 think in most cases XLC's a little better. 1019 00:50:53,780 --> 00:50:57,210 Or in some cases, actually a lot better. 1020 00:50:57,210 --> 00:50:59,240 So you can get that. 1021 00:50:59,240 --> 00:51:00,820 I'd recommend that. 1022 00:51:00,820 --> 00:51:04,110 There's a debugger which provides application source 1023 00:51:04,110 --> 00:51:06,160 level debugging. 1024 00:51:06,160 --> 00:51:08,790 PPE multithreading, SPE multithreading, the 1025 00:51:08,790 --> 00:51:11,310 interaction between these guys. 1026 00:51:11,310 --> 00:51:15,430 There are three modes for the debugger: stand alone and then 1027 00:51:15,430 --> 00:51:17,750 attached to SPE threads. 1028 00:51:17,750 --> 00:51:19,000 Sounds like two. 1029 00:51:22,270 --> 00:51:26,120 That's problematic. 1030 00:51:26,120 --> 00:51:28,130 There's this nice static analysis tool. 1031 00:51:28,130 --> 00:51:30,140 This is good for looking for really tightly, 1032 00:51:30,140 --> 00:51:31,330 optimizing your code. 1033 00:51:31,330 --> 00:51:33,070 You have to be able to read assembly, but it shows you 1034 00:51:33,070 --> 00:51:34,810 graphically-- 1035 00:51:34,810 --> 00:51:36,430 kind of-- 1036 00:51:36,430 --> 00:51:38,800 where the stalls are happening and you can try and 1037 00:51:38,800 --> 00:51:40,890 reorganize your code. 1038 00:51:40,890 --> 00:51:44,720 And then like Saman suggested, the dynamic analysis using the 1039 00:51:44,720 --> 00:51:48,880 simulator is a good way to really get cycle by cycle 1040 00:51:48,880 --> 00:51:51,190 stepping through the code. 1041 00:51:51,190 --> 00:51:54,220 And someone was very excited when they made this chart 1042 00:51:54,220 --> 00:51:55,720 because they put these big explosions here. 1043 00:51:58,500 --> 00:52:02,790 You've got some compiler here that's going to be generating 1044 00:52:02,790 --> 00:52:07,270 two pieces of code, the PPE binary and the SPE binary. 1045 00:52:07,270 --> 00:52:11,210 When you go through the cell tutorials for training on how 1046 00:52:11,210 --> 00:52:14,900 to program cell you'll see that this code is actually 1047 00:52:14,900 --> 00:52:17,900 plugged into-- linked into the PPE code. 1048 00:52:17,900 --> 00:52:21,170 And when the PPE code spawns a thread it's going to take a 1049 00:52:21,170 --> 00:52:25,030 pointer to this code and basically DMA that code into 1050 00:52:25,030 --> 00:52:27,540 the SPE and tell the SPE to start running. 1051 00:52:27,540 --> 00:52:31,180 Once it's done that, that thread is independent. 1052 00:52:31,180 --> 00:52:34,220 The PPE could kill it, but it could just let it run to its 1053 00:52:34,220 --> 00:52:37,060 natural termination or this thing could terminate itself 1054 00:52:37,060 --> 00:52:41,370 or it could be interrupted by some other communication. 1055 00:52:41,370 --> 00:52:42,890 But that's the basic process, you have these 1056 00:52:42,890 --> 00:52:45,900 two pieces of code. 1057 00:52:45,900 --> 00:52:51,070 OK, so now this is really what I wanted to get to. 1058 00:52:51,070 --> 00:52:54,620 So I want lots of questions here. 1059 00:52:54,620 --> 00:52:59,800 There are 4 levels of parallelism in cell. 1060 00:52:59,800 --> 00:53:02,680 On the cell blade, the IBM blade you have two cell 1061 00:53:02,680 --> 00:53:04,270 processors per blade. 1062 00:53:04,270 --> 00:53:06,570 So that's one level of parallelism. 1063 00:53:06,570 --> 00:53:08,160 At chip level we know there are 9 cores and they're all 1064 00:53:08,160 --> 00:53:08,900 running independently. 1065 00:53:08,900 --> 00:53:11,050 That's another level of parallelism. 1066 00:53:11,050 --> 00:53:14,170 On the instruction level each of the SPEs has two 1067 00:53:14,170 --> 00:53:18,010 instruction pipelines, so it's an odd and an even pipeline. 1068 00:53:18,010 --> 00:53:19,860 One pipeline is doing things-- 1069 00:53:19,860 --> 00:53:23,370 the odd pipeline is doing loads and stores, DMA 1070 00:53:23,370 --> 00:53:30,840 transactions, interrupts, branches and it's doing 1071 00:53:30,840 --> 00:53:33,610 something called shuffle byte or the shuffle operation. 1072 00:53:33,610 --> 00:53:36,270 So shuffle operation's a very, very useful operation that 1073 00:53:36,270 --> 00:53:41,140 allows you to take two registers as data, a third 1074 00:53:41,140 --> 00:53:44,730 register as a pattern register, and the fourth 1075 00:53:44,730 --> 00:53:46,530 register as output. 1076 00:53:46,530 --> 00:53:50,040 It then, from this pattern, will choose arbitrarily the 1077 00:53:50,040 --> 00:53:53,210 bytes that are in these two and reconstitute them into 1078 00:53:53,210 --> 00:53:54,990 this fourth register. 1079 00:53:54,990 --> 00:53:58,350 It's wonderful for doing manipulations and shuffling 1080 00:53:58,350 --> 00:53:59,360 things around. 1081 00:53:59,360 --> 00:54:02,870 Like shuffling a deck of cards, you could take all of 1082 00:54:02,870 --> 00:54:04,820 these and ignore this or you could take the first one here, 1083 00:54:04,820 --> 00:54:07,410 replicate it 16 times or you could take a random sampling 1084 00:54:07,410 --> 00:54:09,120 from these, put into that register. 1085 00:54:09,120 --> 00:54:12,172 AUDIENCE: Do you use that specifically for the 1086 00:54:12,172 --> 00:54:13,630 [UNINTELLIGIBLE]? 1087 00:54:13,630 --> 00:54:14,670 MICHAEL PERRONE: We do use it, yeah. 1088 00:54:14,670 --> 00:54:18,010 Yeah, you take a look, you'll see we use shuffle a lot. 1089 00:54:18,010 --> 00:54:20,540 It's surprising how valuable shuffle can be. 1090 00:54:20,540 --> 00:54:23,280 However, then you have to worry now, you've got the 1091 00:54:23,280 --> 00:54:28,300 shuffle here, if you're doing like matrix transpose, it's 1092 00:54:28,300 --> 00:54:30,350 all shuffles. 1093 00:54:30,350 --> 00:54:32,090 But what's a date matrix transpose? 1094 00:54:32,090 --> 00:54:34,490 It's really bandwidth bound, right? 1095 00:54:34,490 --> 00:54:36,940 Because you're pulling data in, shuffling it around and 1096 00:54:36,940 --> 00:54:37,350 sending it out. 1097 00:54:37,350 --> 00:54:39,640 Well, where's the reads and writes? 1098 00:54:39,640 --> 00:54:40,590 They're on the odd pipeline. 1099 00:54:40,590 --> 00:54:41,360 Where are the shuffles? 1100 00:54:41,360 --> 00:54:42,970 They're on the odd pipeline. 1101 00:54:42,970 --> 00:54:45,390 So now you can have a situation where it's all 1102 00:54:45,390 --> 00:54:50,360 shuffle, shuffle, shuffle, shuffle and then the 1103 00:54:50,360 --> 00:54:53,950 instruction pre-fetch buffer gets starved and so it stalls 1104 00:54:53,950 --> 00:54:56,840 for 15, 17 cycles while I have to load. 1105 00:54:56,840 --> 00:54:59,900 Basically, it's a tiny little loop. 1106 00:54:59,900 --> 00:55:01,710 But you get stalls and you get really bad performance. 1107 00:55:01,710 --> 00:55:04,480 So then you have to tell the compiler-- 1108 00:55:04,480 --> 00:55:05,880 actually, the compiler is getting 1109 00:55:05,880 --> 00:55:07,170 better at these things. 1110 00:55:07,170 --> 00:55:10,550 Much better than it used to be or by hand you can force it to 1111 00:55:10,550 --> 00:55:12,910 leave a slot for the pre-fetch. 1112 00:55:12,910 --> 00:55:14,690 These are gotchas that programmers 1113 00:55:14,690 --> 00:55:17,470 have to be aware of. 1114 00:55:17,470 --> 00:55:20,800 On the other pipeline you have all your normal operations. 1115 00:55:20,800 --> 00:55:25,620 So you have your mul-adds, your bit operations, all the 1116 00:55:25,620 --> 00:55:28,060 shift and things like that, they're all over there. 1117 00:55:28,060 --> 00:55:30,500 There is one other operation on the odd pipeline and I 1118 00:55:30,500 --> 00:55:32,730 think it's a quad word rotate or 1119 00:55:32,730 --> 00:55:36,560 something, but I don't remember. 1120 00:55:36,560 --> 00:55:40,710 So that's instruction level dual issue parallelism. 1121 00:55:40,710 --> 00:55:43,280 AUDIENCE: [UNINTELLIGIBLE PHRASE] 1122 00:55:43,280 --> 00:55:44,280 MICHAEL PERRONE: Everything is in order on 1123 00:55:44,280 --> 00:55:45,340 this processor, yeah. 1124 00:55:45,340 --> 00:55:47,080 And that was done for power reasons, right? 1125 00:55:47,080 --> 00:55:49,760 Get rid of all the space and all the transistors that are 1126 00:55:49,760 --> 00:55:51,730 doing all this fancy, out of order 1127 00:55:51,730 --> 00:55:53,600 processing to save power. 1128 00:55:53,600 --> 00:55:54,850 AUDIENCE: [UNINTELLIGIBLE PHRASE] 1129 00:56:18,050 --> 00:56:19,270 MICHAEL PERRONE: That's a really good point. 1130 00:56:19,270 --> 00:56:22,810 When you're doing scalar processing you think well, 1131 00:56:22,810 --> 00:56:25,465 you're thinking I'm going to-- kind of conceptually, you want 1132 00:56:25,465 --> 00:56:27,050 to have all the things that are doing the same thing 1133 00:56:27,050 --> 00:56:27,960 together right. 1134 00:56:27,960 --> 00:56:30,160 That's how I used to program. 1135 00:56:30,160 --> 00:56:32,590 You put all this stuff here then you do maybe all your 1136 00:56:32,590 --> 00:56:35,320 reads or whatever and then you do all your computes and you 1137 00:56:35,320 --> 00:56:36,290 can't do it that way. 1138 00:56:36,290 --> 00:56:38,370 You have to really think about how are you going to interlead 1139 00:56:38,370 --> 00:56:39,600 these things. 1140 00:56:39,600 --> 00:56:43,990 Now the compiler will help you, but to get really high 1141 00:56:43,990 --> 00:56:46,680 performance you have to have better tools and we don't have 1142 00:56:46,680 --> 00:56:47,550 those tools yet. 1143 00:56:47,550 --> 00:56:50,140 And so I'm hoping that you guys are the ones that are 1144 00:56:50,140 --> 00:56:52,380 going to come up with the new tools, the new ideas that are 1145 00:56:52,380 --> 00:56:54,420 going to really help people improve 1146 00:56:54,420 --> 00:56:57,970 programmability in cell. 1147 00:56:57,970 --> 00:57:00,930 Then at the lowest level you have the register level 1148 00:57:00,930 --> 00:57:05,320 parallelism where you can have four single precision float 1149 00:57:05,320 --> 00:57:08,720 ops going simultaneously. 1150 00:57:08,720 --> 00:57:11,250 So when you're programming cell you have to keep all of 1151 00:57:11,250 --> 00:57:13,140 these levels of hierarchy in your head. 1152 00:57:13,140 --> 00:57:15,860 It's not straight scalar programming anymore. 1153 00:57:15,860 --> 00:57:18,070 And if you think of it that way you're just not going to 1154 00:57:18,070 --> 00:57:20,910 get the performance that you're looking for period. 1155 00:57:24,600 --> 00:57:26,960 Another consideration is this local store. 1156 00:57:26,960 --> 00:57:30,880 Each little store is 256 kilobytes. 1157 00:57:30,880 --> 00:57:32,130 That's not a lot of space. 1158 00:57:35,110 --> 00:57:37,760 You have to think about how are you going to bring the 1159 00:57:37,760 --> 00:57:41,680 data in so that the chunks are big enough, but not too big 1160 00:57:41,680 --> 00:57:43,050 because if they're too big thing then you won't be able 1161 00:57:43,050 --> 00:57:44,300 to get multibuffering. 1162 00:57:48,120 --> 00:57:49,930 Let's back up a little bit more. 1163 00:57:49,930 --> 00:57:54,640 The local store holds the data, but it also holds the 1164 00:57:54,640 --> 00:57:56,730 code that you're running. 1165 00:57:56,730 --> 00:58:02,350 So if you have 200 kilobytes of code then you only have 56 1166 00:58:02,350 --> 00:58:03,950 kilobytes of data space. 1167 00:58:03,950 --> 00:58:06,080 And if you want to have double buffering that means you only 1168 00:58:06,080 --> 00:58:15,400 have 25 kilobytes and then as Saman correctly points out 1169 00:58:15,400 --> 00:58:17,950 there's a problem with stack. 1170 00:58:17,950 --> 00:58:20,390 So if you're going to have recursion in your code or 1171 00:58:20,390 --> 00:58:23,550 something nasty like that, you're going to start pushing 1172 00:58:23,550 --> 00:58:25,630 stack variables off the register file. 1173 00:58:25,630 --> 00:58:27,020 So where do they go? 1174 00:58:27,020 --> 00:58:29,130 They go in the local store. 1175 00:58:29,130 --> 00:58:34,200 What prevents the stack them overwriting your data? 1176 00:58:34,200 --> 00:58:35,520 Nothing. 1177 00:58:35,520 --> 00:58:38,160 Nothing at all and that's a big gotcha. 1178 00:58:38,160 --> 00:58:42,620 I've seen over the past three years maybe 30 separate 1179 00:58:42,620 --> 00:58:46,470 algorithms implemented on cell and I know of only one that 1180 00:58:46,470 --> 00:58:48,030 was definitely doing that. 1181 00:58:48,030 --> 00:58:51,080 But you know, if there are 30 in this class maybe you're 1182 00:58:51,080 --> 00:58:52,420 going to be the one that that happens to. 1183 00:58:52,420 --> 00:58:57,970 So you have to be aware of that and you have 1184 00:58:57,970 --> 00:58:58,400 to deal with it. 1185 00:58:58,400 --> 00:59:02,240 So what you can do, is in the local store put some dead beef 1186 00:59:02,240 --> 00:59:07,400 thing in there so that you can look for an overwrite and that 1187 00:59:07,400 --> 00:59:10,240 will let you know that either you have to make you code 1188 00:59:10,240 --> 00:59:14,890 smalller or your data smaller or get rid of recursion. 1189 00:59:14,890 --> 00:59:18,350 On SPEs, recursion is kind of anathema. 1190 00:59:18,350 --> 00:59:19,900 Inlining is good. 1191 00:59:19,900 --> 00:59:25,220 Inlining really can accelerate your codes performance. 1192 00:59:25,220 --> 00:59:28,310 Oh yeah, it says stack right there. 1193 00:59:28,310 --> 00:59:30,330 You're reading ahead on me here. 1194 00:59:30,330 --> 00:59:32,340 Yes, so all three are in there and you have 1195 00:59:32,340 --> 00:59:33,780 to be aware of that. 1196 00:59:33,780 --> 00:59:37,000 Now there is a memory management library, very 1197 00:59:37,000 --> 00:59:39,960 lightweight library on the SPE and it's going to prevent your 1198 00:59:39,960 --> 00:59:42,930 data from overwriting your code because once the code's 1199 00:59:42,930 --> 00:59:45,820 loaded that memory management library knows where it is and 1200 00:59:45,820 --> 00:59:47,320 it will stop. 1201 00:59:47,320 --> 00:59:50,830 The date you from allocating, doing a [? mul-add. ?] 1202 00:59:50,830 --> 00:59:52,150 over this code. 1203 00:59:52,150 --> 00:59:53,850 But the stack's up for grabs. 1204 00:59:53,850 --> 00:59:56,270 And that was again done because of power 1205 00:59:56,270 --> 00:59:58,220 considerations and real estate on the chip. 1206 00:59:58,220 --> 01:00:02,640 It you want to have a chip that's this big you can have 1207 01:00:02,640 --> 01:00:05,950 anything you want, but manufacturing it's impossible. 1208 01:00:05,950 --> 01:00:08,170 So things were removed and that was one of the things 1209 01:00:08,170 --> 01:00:09,440 that's removed and that's one of the things you have to 1210 01:00:09,440 --> 01:00:11,040 watch out for. 1211 01:00:11,040 --> 01:00:14,010 And communication, we've talked about this quite a bit. 1212 01:00:17,380 --> 01:00:20,460 I didn't mention this: the DMA transactions-- oh, 1213 01:00:20,460 --> 01:00:21,685 question in the back? 1214 01:00:21,685 --> 01:00:25,151 AUDIENCE: Is there any reasonable possibility of 1215 01:00:25,151 --> 01:00:26,665 doing things dynamically? 1216 01:00:32,670 --> 01:00:39,000 Is it at all conceivable to have [? bunks ?] that fetch in 1217 01:00:39,000 --> 01:00:42,100 new code or an allocator that shuffles somehow? 1218 01:00:42,100 --> 01:00:45,572 Or is it basically as soon as you get to that point your 1219 01:00:45,572 --> 01:00:46,510 performance is going to go to hell. 1220 01:00:46,510 --> 01:00:48,330 MICHAEL PERRONE: Yes, well if you don't do anything about 1221 01:00:48,330 --> 01:00:50,510 it, yes your performance will go to hell. 1222 01:00:50,510 --> 01:00:52,070 So there are two ways. 1223 01:00:52,070 --> 01:00:57,240 In research we came up with an overlay mechanism. 1224 01:00:57,240 --> 01:00:59,810 So this is what people used to do 20 years ago when 1225 01:00:59,810 --> 01:01:00,820 processors were simple. 1226 01:01:00,820 --> 01:01:03,630 Well, these processors are simple, so going back to the 1227 01:01:03,630 --> 01:01:07,570 old technologies is actually a good thing to do. 1228 01:01:07,570 --> 01:01:13,580 So we had a video processing algorithm where we took video 1229 01:01:13,580 --> 01:01:17,070 images, we had to decode them with one SPE, we had to do 1230 01:01:17,070 --> 01:01:19,630 some background subtraction to the next SPE. 1231 01:01:19,630 --> 01:01:21,300 We had to do some edge detection. 1232 01:01:21,300 --> 01:01:24,300 And so each SPE was doing a different thing, but even then 1233 01:01:24,300 --> 01:01:27,850 the code was very big, the chunks of code were large. 1234 01:01:27,850 --> 01:01:32,080 And we were spending 27% of the time swapping code out and 1235 01:01:32,080 --> 01:01:33,370 bringing in new code. 1236 01:01:33,370 --> 01:01:34,740 Bad, very bad. 1237 01:01:34,740 --> 01:01:36,580 Oh, and I should tell you, spawning SPE 1238 01:01:36,580 --> 01:01:37,830 threads is very painful. 1239 01:01:40,660 --> 01:01:43,790 500,000 cycles, a million cycles-- 1240 01:01:43,790 --> 01:01:44,490 I don't know. 1241 01:01:44,490 --> 01:01:48,040 It varies depending on how the SPE feels that particular day. 1242 01:01:48,040 --> 01:01:51,080 And it's something to avoid. 1243 01:01:51,080 --> 01:01:53,030 You really want to spawn a thread and keep it running for 1244 01:01:53,030 --> 01:01:54,240 a long time. 1245 01:01:54,240 --> 01:01:58,290 So context switching is painful on cell. 1246 01:01:58,290 --> 01:02:03,420 Using an overlay we got that 27% overhead down to 1%. 1247 01:02:03,420 --> 01:02:04,970 So yes, you can do that. 1248 01:02:04,970 --> 01:02:07,410 That tool is not in the SDK. 1249 01:02:07,410 --> 01:02:09,640 It's on my to-do list to put it in the SDK, but the 1250 01:02:09,640 --> 01:02:11,750 compiler team at IBM tells me that the XLC 1251 01:02:11,750 --> 01:02:14,040 compiler now does overlays. 1252 01:02:14,040 --> 01:02:18,310 But it only does overlays at the function level, so if the 1253 01:02:18,310 --> 01:02:20,800 function still doesn't fit in the SPE 1254 01:02:20,800 --> 01:02:22,070 you're dead in the water. 1255 01:02:22,070 --> 01:02:24,800 And I think the compiler will say, when it compiles it it'll 1256 01:02:24,800 --> 01:02:28,010 say this doesn't fit quietly and you'll never see that 1257 01:02:28,010 --> 01:02:29,450 until you run and it doesn't load and you don't know 1258 01:02:29,450 --> 01:02:30,360 what's going on. 1259 01:02:30,360 --> 01:02:33,570 So read your compiler outputs. 1260 01:02:33,570 --> 01:02:35,530 The DMA granularity is 128 bytes. 1261 01:02:35,530 --> 01:02:38,770 This is the same, the data transactions for Intel, for 1262 01:02:38,770 --> 01:02:41,950 AMD they're all 128 byte data envelopes. 1263 01:02:41,950 --> 01:02:45,690 So if you're doing a memory access that's 4 bytes you're 1264 01:02:45,690 --> 01:02:48,180 still using 128 bytes of bandwidth. 1265 01:02:48,180 --> 01:02:50,790 So this comes back to this notion of getting a shopping 1266 01:02:50,790 --> 01:02:53,740 list. You really want to think ahead what you want to get, 1267 01:02:53,740 --> 01:02:56,130 bring it in, then use it so that you don't waste 1268 01:02:56,130 --> 01:02:58,750 bandwidth; if you're bandwidth bound. 1269 01:02:58,750 --> 01:03:01,380 If you're not than you can be a little more wasteful. 1270 01:03:01,380 --> 01:03:04,100 But there's a guy, Mike Acton-- 1271 01:03:04,100 --> 01:03:07,050 you can find his website, I think he has a website called 1272 01:03:07,050 --> 01:03:11,060 www.cellperformance.org? 1273 01:03:11,060 --> 01:03:11,480 Net? 1274 01:03:11,480 --> 01:03:11,820 Com? 1275 01:03:11,820 --> 01:03:12,100 I don't know. 1276 01:03:12,100 --> 01:03:15,010 AUDIENCE: Just a quick comment [UNINTELLIGIBLE PHRASE]. 1277 01:03:15,010 --> 01:03:16,410 MICHAEL PERRONE: Oh, he's good. 1278 01:03:16,410 --> 01:03:17,410 He's much better than me. 1279 01:03:17,410 --> 01:03:20,470 You're really going to like him. 1280 01:03:20,470 --> 01:03:24,460 His belief, and I believe him wholeheartedly, is it's all 1281 01:03:24,460 --> 01:03:26,030 about the data. 1282 01:03:26,030 --> 01:03:32,930 We're coming to a point in computer science where the 1283 01:03:32,930 --> 01:03:35,150 code doesn't matter as much as getting the data 1284 01:03:35,150 --> 01:03:36,310 where you need it. 1285 01:03:36,310 --> 01:03:40,300 This is because of the latency out to main memory. 1286 01:03:40,300 --> 01:03:43,790 Memory's getting so far away that having all these cycles 1287 01:03:43,790 --> 01:03:46,210 is not that useful anymore if you can't get the data. 1288 01:03:46,210 --> 01:03:47,940 So he always pushes this point, you 1289 01:03:47,940 --> 01:03:48,830 have to get the data. 1290 01:03:48,830 --> 01:03:51,510 You have to think about the data, good code starts with 1291 01:03:51,510 --> 01:03:54,180 the data, good code ends with the data, good data structure 1292 01:03:54,180 --> 01:03:55,000 start with the data. 1293 01:03:55,000 --> 01:03:58,520 You have to think data, data, data. 1294 01:03:58,520 --> 01:04:00,590 And I can't emphasize that enough because it's really 1295 01:04:00,590 --> 01:04:03,625 very, very true for this processor and I believe, for 1296 01:04:03,625 --> 01:04:05,310 all the multicore processors you're going to be seeing. 1297 01:04:08,730 --> 01:04:15,090 The DMAs that you issue can be 128 bytes or multiples of 128 1298 01:04:15,090 --> 01:04:17,890 bytes, up to 16 kilobytes per single DMA. 1299 01:04:17,890 --> 01:04:20,570 There's also something called a DMA list, which is a list of 1300 01:04:20,570 --> 01:04:26,140 DMAs in local store and you tell the DMA queue OK, here 1301 01:04:26,140 --> 01:04:29,490 are these 100 DMAs, spawn them off. 1302 01:04:29,490 --> 01:04:32,760 That only takes one slot in the DMA queue so it's an 1303 01:04:32,760 --> 01:04:36,210 efficient way of loading the queue without 1304 01:04:36,210 --> 01:04:39,200 overloading the queue. 1305 01:04:39,200 --> 01:04:46,080 Traffic controls, this is perhaps one of the trickier 1306 01:04:46,080 --> 01:04:48,020 things with cell because the simulator doesn't help very 1307 01:04:48,020 --> 01:04:51,560 much and the tools don't help very much. 1308 01:04:51,560 --> 01:04:53,530 Thinking about synchronization, DMA latency 1309 01:04:53,530 --> 01:04:54,860 handling-- all those things are important. 1310 01:04:59,390 --> 01:05:01,690 OK, so this is the last slide that I'm going to do and then 1311 01:05:01,690 --> 01:05:02,940 I have to run off. 1312 01:05:05,820 --> 01:05:09,780 I want to give you a sense for the process by which people-- 1313 01:05:09,780 --> 01:05:12,320 my group in particular went through, especially when we 1314 01:05:12,320 --> 01:05:15,490 didn't even have hardware and we didn't have compilers that 1315 01:05:15,490 --> 01:05:17,880 worked nearly as well as they do now and it's really very 1316 01:05:17,880 --> 01:05:21,140 ugly knifes and stones and sticks. 1317 01:05:21,140 --> 01:05:23,750 You know, just kind of stone knifes. 1318 01:05:23,750 --> 01:05:26,580 That's what I'm thinking, very primitive. 1319 01:05:26,580 --> 01:05:30,970 But this way of thinking is still very much true. 1320 01:05:30,970 --> 01:05:32,570 You have to think about your code this way. 1321 01:05:32,570 --> 01:05:34,940 You want to start, you have your application, whatever it 1322 01:05:34,940 --> 01:05:35,900 happens to be; you want to do an 1323 01:05:35,900 --> 01:05:38,080 algorithmic complexity study. 1324 01:05:38,080 --> 01:05:41,140 Is this order n squared, is this log n? 1325 01:05:41,140 --> 01:05:42,260 Where are the bottlenecks? 1326 01:05:42,260 --> 01:05:45,160 What do I expect to be bottlenecks? 1327 01:05:45,160 --> 01:05:48,390 Then I want to do data layout/locality. 1328 01:05:48,390 --> 01:05:50,360 Now this is the data, data, data approach of Mike Acton. 1329 01:05:52,950 --> 01:05:54,430 You want to think about the data. 1330 01:05:54,430 --> 01:05:55,540 Where is it? 1331 01:05:55,540 --> 01:05:57,810 How can you structure your data so that it's going to be 1332 01:05:57,810 --> 01:06:01,550 efficiently positioned for when you need it? 1333 01:06:01,550 --> 01:06:04,400 And then you start with an experimental petitioning of 1334 01:06:04,400 --> 01:06:05,340 the algorithm. 1335 01:06:05,340 --> 01:06:08,050 You want to break it up between the pieces that you 1336 01:06:08,050 --> 01:06:12,320 believe are scalar and remain scalar, leave those on the SPE 1337 01:06:12,320 --> 01:06:14,460 and the ones that can be paralellized. 1338 01:06:14,460 --> 01:06:17,810 Those are the ones that are going to go on the SPE. 1339 01:06:17,810 --> 01:06:19,430 You have the think conceptually about 1340 01:06:19,430 --> 01:06:21,730 partitioning that out. 1341 01:06:21,730 --> 01:06:24,980 And then run it on the PPE anyway. 1342 01:06:24,980 --> 01:06:27,390 You want to have a baseline there. 1343 01:06:27,390 --> 01:06:31,370 Then you have this PPE scalar code and PPE control code. 1344 01:06:31,370 --> 01:06:35,230 This PPE scalar code you want to then push down to the SPEs. 1345 01:06:35,230 --> 01:06:39,060 So now you're going to add stuff for communication, 1346 01:06:39,060 --> 01:06:40,440 synchronization, and latency handling. 1347 01:06:40,440 --> 01:06:42,420 So you have the spawn threads. 1348 01:06:42,420 --> 01:06:43,640 The [? RAIDs ?] 1349 01:06:43,640 --> 01:06:47,110 have to be told where the data is, they have to get their 1350 01:06:47,110 --> 01:06:49,320 code, they have to run their code, they have to then start 1351 01:06:49,320 --> 01:06:51,490 pulling in the data, synchronize with the other 1352 01:06:51,490 --> 01:06:55,620 SPEs and then latency handling with multibuffering of the 1353 01:06:55,620 --> 01:06:59,090 data so that you can be doing computing and reading data 1354 01:06:59,090 --> 01:07:01,020 simultaneously. 1355 01:07:01,020 --> 01:07:06,970 Then you have your first parallel code that's running. 1356 01:07:06,970 --> 01:07:12,400 Now the compiler, the XLC compiler, GCC compiler-- 1357 01:07:12,400 --> 01:07:14,900 well, the XLC compiler I know for certain will do some 1358 01:07:14,900 --> 01:07:16,370 automatic SIMDization. 1359 01:07:16,370 --> 01:07:18,080 if you put the auto SIMD flag on. 1360 01:07:18,080 --> 01:07:19,550 Does GCC compiler do that? 1361 01:07:19,550 --> 01:07:20,800 PROFESSOR: [UNINTELLIGIBLE PHRASE] 1362 01:07:23,300 --> 01:07:24,860 MICHAEL PERRONE: OK, so I don't know if the GCC 1363 01:07:24,860 --> 01:07:27,190 compiler does that. 1364 01:07:27,190 --> 01:07:33,690 So that can be done by hand, but sometimes that works, 1365 01:07:33,690 --> 01:07:34,670 sometimes it doesn't. 1366 01:07:34,670 --> 01:07:36,690 And it really depends on how complex the algorithm. 1367 01:07:36,690 --> 01:07:39,530 If it's a very regular code, like a matrix-matrix multiply. 1368 01:07:39,530 --> 01:07:43,980 You'll see that the compiler can do fairly well if the 1369 01:07:43,980 --> 01:07:45,590 block sizes are right and all. 1370 01:07:45,590 --> 01:07:50,090 But if you have something that's more irregular then you 1371 01:07:50,090 --> 01:07:53,360 may find that doing it by hand is really required. 1372 01:07:53,360 --> 01:07:56,270 And so this step here could be done with the compiler 1373 01:07:56,270 --> 01:07:58,700 initially to see if you're getting the performance that 1374 01:07:58,700 --> 01:08:00,780 you think you should be getting from that algorithmic 1375 01:08:00,780 --> 01:08:02,380 complexity study. 1376 01:08:02,380 --> 01:08:04,420 You should see that type of scaling. 1377 01:08:04,420 --> 01:08:06,880 You can look at the CPI and see how many cycles per 1378 01:08:06,880 --> 01:08:08,480 instruction you're getting. 1379 01:08:08,480 --> 01:08:11,200 Each SPE should be getting 0.5. 1380 01:08:11,200 --> 01:08:13,590 You should be able to get two instructions per cycle. 1381 01:08:16,310 --> 01:08:19,480 Very few codes actually get exactly-- 1382 01:08:19,480 --> 01:08:27,180 you can get down to 5.8 or something like that, but I 1383 01:08:27,180 --> 01:08:29,830 think if you can get to 1 you're doing well. 1384 01:08:29,830 --> 01:08:32,390 If you get to 2 there's probably more you can be doing 1385 01:08:32,390 --> 01:08:33,870 and if you're above 2 there's something 1386 01:08:33,870 --> 01:08:36,200 wrong with your code. 1387 01:08:36,200 --> 01:08:37,020 It may be the algorithm. 1388 01:08:37,020 --> 01:08:39,400 It may be just a poorly chosen algorithm. 1389 01:08:42,120 --> 01:08:44,020 But that's where you can talk to me. 1390 01:08:44,020 --> 01:08:46,010 I want to make myself available to everyone in the 1391 01:08:46,010 --> 01:08:48,460 class or in my department as well. 1392 01:08:48,460 --> 01:08:53,170 We're very enthusiastic about working with research groups 1393 01:08:53,170 --> 01:08:59,230 in universities to develop new tools, new methods and if you 1394 01:08:59,230 --> 01:09:00,180 can help me, I can help you. 1395 01:09:00,180 --> 01:09:01,850 I think it works very well. 1396 01:09:04,710 --> 01:09:07,440 Then once you've done this, you may find that what you 1397 01:09:07,440 --> 01:09:11,000 originally thought for the complexity or the layout 1398 01:09:11,000 --> 01:09:13,840 wasn't quite accurate, so you need to then go do some 1399 01:09:13,840 --> 01:09:14,970 additional rebalancing. 1400 01:09:14,970 --> 01:09:17,060 Maybe change your block sizes. 1401 01:09:17,060 --> 01:09:20,960 You know, maybe you had 64 by 64 blocks, now you need 32 by 1402 01:09:20,960 --> 01:09:25,800 64 or 48 by whatever-- some readjustment to match what you 1403 01:09:25,800 --> 01:09:30,610 have, And then you may want to reevaluate the data movement. 1404 01:09:30,610 --> 01:09:33,100 And then you know, in many cases you'll be done, but 1405 01:09:33,100 --> 01:09:35,620 you're looking at your cycles per instruction or your speed 1406 01:09:35,620 --> 01:09:39,960 up and you're not seeing exactly what you expected, so 1407 01:09:39,960 --> 01:09:42,830 you can start looking at other optimization considerations. 1408 01:09:42,830 --> 01:09:46,210 Like using the vector unit, the VMX unit on the cell 1409 01:09:46,210 --> 01:09:49,840 processor, on the PPE. 1410 01:09:49,840 --> 01:09:53,760 Looking for system bottlenecks and this is actually, I have 1411 01:09:53,760 --> 01:09:56,400 found the biggest problem. 1412 01:09:56,400 --> 01:09:59,730 Trying to identify where the DMA bottlenecks are happening 1413 01:09:59,730 --> 01:10:02,980 is kind of devilishly hard. 1414 01:10:02,980 --> 01:10:05,100 We don't have good tools for that, so you really have to 1415 01:10:05,100 --> 01:10:08,100 think hard and come up with interesting kind of 1416 01:10:08,100 --> 01:10:11,260 experiments for your code to track down these bottlenecks. 1417 01:10:13,990 --> 01:10:15,160 And then load balancing. 1418 01:10:15,160 --> 01:10:17,850 If you look at these SPEs, I told you they're completely 1419 01:10:17,850 --> 01:10:18,520 independent. 1420 01:10:18,520 --> 01:10:20,850 You can have them all running the same code or they could be 1421 01:10:20,850 --> 01:10:22,170 running all different code. 1422 01:10:22,170 --> 01:10:24,310 They could be daisy chained so that this one feeds, this one 1423 01:10:24,310 --> 01:10:25,940 feeds that one, feeds that one. 1424 01:10:25,940 --> 01:10:28,020 If you do that daisy chaining you may find out there's a 1425 01:10:28,020 --> 01:10:28,400 bottleneck. 1426 01:10:28,400 --> 01:10:31,540 That this SPE takes three times as long 1427 01:10:31,540 --> 01:10:33,110 as any of the others. 1428 01:10:33,110 --> 01:10:38,540 So make that use 3 SPEs and have this SPE feed these 3. 1429 01:10:38,540 --> 01:10:41,430 So you have to do some load balancing and thinking about 1430 01:10:41,430 --> 01:10:43,460 how many SPEs really need to be dedicated 1431 01:10:43,460 --> 01:10:46,510 to each of the tasks. 1432 01:10:46,510 --> 01:10:50,920 Now that's the end of my talk. 1433 01:10:50,920 --> 01:10:54,900 I think that gives you a good sense of where we have been, 1434 01:10:54,900 --> 01:10:57,190 where we are now, and where we're going. 1435 01:10:57,190 --> 01:11:01,420 And I hope that if was good, educational, and I'll make 1436 01:11:01,420 --> 01:11:03,260 myself available to you guys in the future. 1437 01:11:03,260 --> 01:11:04,680 And if you have questions-- 1438 01:11:04,680 --> 01:11:05,170 PROFESSOR: Thank you. 1439 01:11:05,170 --> 01:11:10,140 I know you have to catch a flight. 1440 01:11:10,140 --> 01:11:11,810 How much time do have for questions? 1441 01:11:11,810 --> 01:11:13,210 MICHAEL PERRONE: Not much. 1442 01:11:13,210 --> 01:11:14,890 I leave at 1:10. 1443 01:11:14,890 --> 01:11:16,750 So I should be there by 12:00. 1444 01:11:16,750 --> 01:11:16,970 PROFESSOR: OK. 1445 01:11:16,970 --> 01:11:18,080 So [UNINTELLIGIBLE] 1446 01:11:18,080 --> 01:11:18,810 at some time. 1447 01:11:18,810 --> 01:11:19,450 MICHAEL PERRONE: My car is out-- 1448 01:11:19,450 --> 01:11:22,770 PROFESSOR: OK, so we'll have about 5 minutes questions. 1449 01:11:22,770 --> 01:11:25,630 OK, so I know this talk is early. 1450 01:11:25,630 --> 01:11:27,750 We haven't gotten a lot of basics so there might be a lot 1451 01:11:27,750 --> 01:11:30,940 of things kind of going above your head, but we'll slowly 1452 01:11:30,940 --> 01:11:32,030 get back to it. 1453 01:11:32,030 --> 01:11:34,990 So questions? 1454 01:11:34,990 --> 01:11:38,190 AUDIENCE: You mentioned that SPEs would 1455 01:11:38,190 --> 01:11:40,910 be able to run kernel. 1456 01:11:40,910 --> 01:11:43,517 Is there a microkernel that you could install on them so 1457 01:11:43,517 --> 01:11:45,660 that you could begin experimenting with MPI type 1458 01:11:45,660 --> 01:11:47,240 structures? 1459 01:11:47,240 --> 01:11:49,450 MICHAEL PERRONE: Not that I'm aware of. 1460 01:11:49,450 --> 01:11:52,240 We did look at something called MicroMPI, where we were 1461 01:11:52,240 --> 01:11:57,290 using kind of a very watered down MPI implementation for 1462 01:11:57,290 --> 01:12:00,030 the SPEs in the transactions. 1463 01:12:00,030 --> 01:12:01,000 I don't recommend it. 1464 01:12:01,000 --> 01:12:07,060 What I recommend is you have a cluster say, a thousand node 1465 01:12:07,060 --> 01:12:10,570 cluster and the code today, the legacy code that's out 1466 01:12:10,570 --> 01:12:14,400 there runs some process on this node. 1467 01:12:14,400 --> 01:12:19,360 Take that process, don't try to push MPI further down, but 1468 01:12:19,360 --> 01:12:24,940 just try to subpartition that process and let the PPE handle 1469 01:12:24,940 --> 01:12:31,190 all the communication off board, off node. 1470 01:12:31,190 --> 01:12:32,130 That's my recommendation. 1471 01:12:32,130 --> 01:12:35,960 AUDIENCE: So MPI is running on [UNINTELLIGIBLE]? 1472 01:12:35,960 --> 01:12:37,890 MICHAEL PERRONE: Yeah, Open MPI. 1473 01:12:37,890 --> 01:12:39,840 It's an open source MPI. 1474 01:12:39,840 --> 01:12:42,310 It's just a recompile and it hasn't 1475 01:12:42,310 --> 01:12:44,960 been tuned or optimized. 1476 01:12:44,960 --> 01:12:48,480 And it doesn't know anything about the SPEs. 1477 01:12:48,480 --> 01:12:50,990 You know, you let the PPE do all the communication or 1478 01:12:50,990 --> 01:12:52,080 handle the communications. 1479 01:12:52,080 --> 01:12:55,180 When it finishes the task at hand then it can 1480 01:12:55,180 --> 01:12:56,967 issue its MPI process. 1481 01:12:56,967 --> 01:12:58,217 AUDIENCE: [UNINTELLIGIBLE PHRASE] 1482 01:13:00,010 --> 01:13:03,850 MICHAEL PERRONE: Open NP is the methodology where you take 1483 01:13:03,850 --> 01:13:08,260 existing scalar code and you insert compiler pragmas to say 1484 01:13:08,260 --> 01:13:10,490 this for loop can be parallelized. 1485 01:13:10,490 --> 01:13:13,330 And you know, this data structures are disjoint, so we 1486 01:13:13,330 --> 01:13:17,410 don't have to worry about any kind of interference, side 1487 01:13:17,410 --> 01:13:19,950 effects of the data manipulation. 1488 01:13:19,950 --> 01:13:24,090 The compiler, the XLC compiler implements open MP. 1489 01:13:24,090 --> 01:13:27,360 There's several components that are required. 1490 01:13:27,360 --> 01:13:30,980 One was a software cache where they implemented a little 1491 01:13:30,980 --> 01:13:32,460 cache on the local store. 1492 01:13:32,460 --> 01:13:36,250 And if it misses in that local cache it goes and gets it. 1493 01:13:36,250 --> 01:13:41,910 I don't know how well that performs yet, but it exists. 1494 01:13:41,910 --> 01:13:43,150 There's the SIMDization. 1495 01:13:43,150 --> 01:13:45,830 For a while, Open NP wasn't working with auto SIMDization 1496 01:13:45,830 --> 01:13:48,830 but now it does. 1497 01:13:48,830 --> 01:13:53,310 So it's getting there, for so C it's there. 1498 01:13:53,310 --> 01:13:55,205 I don't know what type of performance hit 1499 01:13:55,205 --> 01:13:55,950 you take for that. 1500 01:13:55,950 --> 01:13:59,820 AUDIENCE: Probably runs [UNINTELLIGIBLE PHRASE] 1501 01:13:59,820 --> 01:14:00,450 MICHAEL PERRONE: It's 1502 01:14:00,450 --> 01:14:03,110 XLC version that does that. 1503 01:14:03,110 --> 01:14:06,700 I don't know if GCC does it. 1504 01:14:06,700 --> 01:14:10,500 But my recommendation is if you want to use open NP, go 1505 01:14:10,500 --> 01:14:14,010 ahead, take your scalar code, implement it with those 1506 01:14:14,010 --> 01:14:17,340 pragmas, see what type of improvement you get. 1507 01:14:17,340 --> 01:14:18,330 Play around with it a little. 1508 01:14:18,330 --> 01:14:21,710 If you find something that you expect should be 10x better 1509 01:14:21,710 --> 01:14:25,100 and it's only 3x take that bottleneck and 1510 01:14:25,100 --> 01:14:26,350 implement it by hand. 1511 01:14:31,726 --> 01:14:32,976 AUDIENCE: [UNINTELLIGIBLE PHRASE] 1512 01:14:34,945 --> 01:14:39,340 with the memory models and such that the SPEs certainly 1513 01:14:39,340 --> 01:14:41,293 went back a couple of generations to a simpler 1514 01:14:41,293 --> 01:14:41,781 [INAUDIBLE]. 1515 01:14:41,781 --> 01:14:44,512 How come you went so far back rather to just say, 1516 01:14:44,512 --> 01:14:45,580 segmentation. 1517 01:14:45,580 --> 01:14:46,760 MICHAEL PERRONE: I don't know the answer. 1518 01:14:46,760 --> 01:14:48,010 I'm sorry. 1519 01:14:50,650 --> 01:14:53,860 I suspect and most of these answers come down to the same 1520 01:14:53,860 --> 01:14:57,210 thing, it comes back to Sony. 1521 01:14:57,210 --> 01:14:59,990 Sony contracted with IBM, gave us a lot of money 1522 01:14:59,990 --> 01:15:00,700 to make this thing. 1523 01:15:00,700 --> 01:15:02,330 And they said we need a Playstation 3. 1524 01:15:02,330 --> 01:15:03,740 We need this, this, this, this. 1525 01:15:03,740 --> 01:15:06,870 And so IBM was very focused on providing those things. 1526 01:15:06,870 --> 01:15:10,650 Now that that is delivered, Playstation 3 is being sold 1527 01:15:10,650 --> 01:15:11,740 we're looking at other options. 1528 01:15:11,740 --> 01:15:17,560 And if that's something that you're interested in pursuing 1529 01:15:17,560 --> 01:15:18,020 you should talk to me. 1530 01:15:18,020 --> 01:15:20,057 AUDIENCE: Among other things it seems to me that the 1531 01:15:20,057 --> 01:15:23,622 lightweight mechanism for keeping the stack from 1532 01:15:23,622 --> 01:15:27,940 stomping on other things -- 1533 01:15:27,940 --> 01:15:33,950 PROFESSOR: I think that this is very new area. 1534 01:15:33,950 --> 01:15:36,190 Before you put things in hardware, you need to have 1535 01:15:36,190 --> 01:15:39,190 some kind of consensus, what's the right way to do it? 1536 01:15:39,190 --> 01:15:42,690 This is a bare metal that gives you huge amount of 1537 01:15:42,690 --> 01:15:44,320 opportunity but you give enough rope to hang yourself. 1538 01:15:46,850 --> 01:15:49,250 And the key thing is you can get all this performance and 1539 01:15:49,250 --> 01:15:52,380 what will happen perhaps, in the next few years is people 1540 01:15:52,380 --> 01:15:53,730 come up to consensus saying, look, 1541 01:15:53,730 --> 01:15:54,790 everybody has to do this. 1542 01:15:54,790 --> 01:15:57,060 Everybody needs MPI, everybody needs this cache. 1543 01:15:57,060 --> 01:16:00,180 And slowly, some of those features will do a little bit 1544 01:16:00,180 --> 01:16:02,130 of a feature creep, so you're going to have they little bit 1545 01:16:02,130 --> 01:16:04,390 of overhead, little bit less power efficient. 1546 01:16:04,390 --> 01:16:05,630 But it will be much easier to program. 1547 01:16:05,630 --> 01:16:08,590 But this is kind of the bare metal thing that to get and in 1548 01:16:08,590 --> 01:16:12,410 some sense, it's a nice time because I think in 5 years if 1549 01:16:12,410 --> 01:16:17,400 you look at cell you won't have this level of access. 1550 01:16:17,400 --> 01:16:20,940 You'll have all this nice build on top up in doing this 1551 01:16:20,940 --> 01:16:22,840 so, this is a unique positioning there. 1552 01:16:22,840 --> 01:16:25,840 It's very hard to deal with, but also on the other hand you 1553 01:16:25,840 --> 01:16:27,640 get to see underneath. 1554 01:16:27,640 --> 01:16:30,650 You get to see without any kind of these sort 1555 01:16:30,650 --> 01:16:31,310 of things in there. 1556 01:16:31,310 --> 01:16:33,700 So my feeling is in a few years you'll get all those 1557 01:16:33,700 --> 01:16:34,800 things put back. 1558 01:16:34,800 --> 01:16:37,210 When and if we figure out how to deal with things like 1559 01:16:37,210 --> 01:16:40,640 segmentation on the multicore with very fine grain 1560 01:16:40,640 --> 01:16:42,640 communication and there's a lot of issues here that you 1561 01:16:42,640 --> 01:16:43,370 need to figure out. 1562 01:16:43,370 --> 01:16:44,950 But right now all those issues are [INAUDIBLE]. 1563 01:16:44,950 --> 01:16:46,450 It's like OK, we don't know how to do it. 1564 01:16:46,450 --> 01:16:52,460 Well, you go figure it out OK? 1565 01:16:52,460 --> 01:16:53,070 MICHAEL PERRONE: Thank you very much. 1566 01:16:53,070 --> 01:16:54,320 PROFESSOR: Thank you. 1567 01:16:56,390 --> 01:16:58,070 I don't have that much more material. 1568 01:16:58,070 --> 01:17:00,690 So I have about 10, 15 minutes. 1569 01:17:00,690 --> 01:17:03,160 Do you guys need a break or should we just go 1570 01:17:03,160 --> 01:17:03,740 directly to the end? 1571 01:17:03,740 --> 01:17:06,430 How many people say we want a break?