1 00:00:00,030 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,850 Commons license. 3 00:00:03,850 --> 00:00:06,920 Your support will help MIT OpenCourseWare continue to 4 00:00:06,920 --> 00:00:10,560 offer high quality educational resources for free. 5 00:00:10,560 --> 00:00:13,410 To make a donation or view additional materials from 6 00:00:13,410 --> 00:00:17,510 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,510 --> 00:00:18,760 ocw.mit.edu. 8 00:00:21,270 --> 00:00:22,890 PROFESSOR RABBAH: OK, so today's the last lecture day 9 00:00:22,890 --> 00:00:25,150 we're going to talk about the raw architecture. 10 00:00:25,150 --> 00:00:27,730 This is a processor that was built here at MIT and 11 00:00:27,730 --> 00:00:30,650 essentially trailblazed a lot of the research in terms of 12 00:00:30,650 --> 00:00:35,190 parallel architectures for multicores, compilation for 13 00:00:35,190 --> 00:00:37,160 multicores, programming language and so on. 14 00:00:37,160 --> 00:00:39,950 So you've heard some things about RAW and the 15 00:00:39,950 --> 00:00:42,200 parallelizing technology in terms of StreamIt. 16 00:00:42,200 --> 00:00:45,010 So we're going to cover some of that again here today just 17 00:00:45,010 --> 00:00:48,460 briefly and give you little bit more insight into what 18 00:00:48,460 --> 00:00:51,240 went into the design of the raw architecture. 19 00:00:51,240 --> 00:00:54,150 So these are RAW chips they were delivered 20 00:00:54,150 --> 00:00:56,040 in October of 2002. 21 00:00:56,040 --> 00:00:59,920 Each one of these has 16 processors on it. 22 00:00:59,920 --> 00:01:04,040 I'm going to show you sort of a diagram on the next slide. 23 00:01:04,040 --> 00:01:05,900 It's really a tiled microprocessor. 24 00:01:05,900 --> 00:01:09,550 We'll get into what that means and how it actually-- 25 00:01:09,550 --> 00:01:11,930 what does a tiled microprocessor give you that 26 00:01:11,930 --> 00:01:14,210 makes it an attractive design point in the 27 00:01:14,210 --> 00:01:16,130 architecture space? 28 00:01:16,130 --> 00:01:18,640 Each of the raw tiles-- you can sort of see the outline 29 00:01:18,640 --> 00:01:21,790 here sort of replicates-- 30 00:01:21,790 --> 00:01:23,060 is four millimeters. 31 00:01:23,060 --> 00:01:25,800 It's four millimeters square. 32 00:01:25,800 --> 00:01:28,680 It's a single-issue 8-stage pipeline. 33 00:01:28,680 --> 00:01:32,820 It has local memory, so there's a 32K cache. 34 00:01:32,820 --> 00:01:36,230 And the unique aspect of the raw processor is that is has a 35 00:01:36,230 --> 00:01:38,980 lot of on-chip networks that you could use to orchestrate 36 00:01:38,980 --> 00:01:41,210 communication between processors. 37 00:01:41,210 --> 00:01:43,080 So there's two operand networks. 38 00:01:43,080 --> 00:01:44,350 I'm going to get into what that means and 39 00:01:44,350 --> 00:01:45,700 what they used for. 40 00:01:45,700 --> 00:01:47,480 But these eventually allow you to do point-to-point 41 00:01:47,480 --> 00:01:50,260 communication between tiles with very low latency. 42 00:01:50,260 --> 00:01:53,280 And then there's a network that essentially allows you to 43 00:01:53,280 --> 00:01:56,770 handle cache misses and input and output and one for message 44 00:01:56,770 --> 00:02:00,850 passings, a more dynamic style of messaging, something 45 00:02:00,850 --> 00:02:03,850 similar to what you're accustomed to at the cell, for 46 00:02:03,850 --> 00:02:06,440 example, in DMA transfers. 47 00:02:06,440 --> 00:02:09,570 This was built in 180 nanometer ASIC 48 00:02:09,570 --> 00:02:10,510 technology by IBM. 49 00:02:10,510 --> 00:02:12,810 It's got 100 million transistors. 50 00:02:12,810 --> 00:02:16,140 It was designed here by MIT grad students. 51 00:02:16,140 --> 00:02:20,240 It's got something like a million gates on it. 52 00:02:20,240 --> 00:02:22,270 Three to four years of development time. 53 00:02:22,270 --> 00:02:23,940 And what was really interesting here is that this 54 00:02:23,940 --> 00:02:27,590 was-- because of the tiled nature of the architecture, 55 00:02:27,590 --> 00:02:31,050 you could just design one tile and then once you have one 56 00:02:31,050 --> 00:02:32,850 tile, you essentially just plop down more and more and 57 00:02:32,850 --> 00:02:33,590 more of them. 58 00:02:33,590 --> 00:02:36,710 And so you have one, you scale it out to 16 tiles. 59 00:02:36,710 --> 00:02:39,680 And the design sort of came back without any bugs when the 60 00:02:39,680 --> 00:02:42,010 first chip was delivered. 61 00:02:42,010 --> 00:02:45,760 The core frequency was expected to run at 425-- 62 00:02:45,760 --> 00:02:49,870 I think lower than 425 megahertz. 63 00:02:49,870 --> 00:02:52,430 AUDIENCE: Designed for 250? 64 00:02:52,430 --> 00:02:54,460 PROFESSOR RABBAH: 250 megahertz and came back and it 65 00:02:54,460 --> 00:02:56,350 ran 425 megahertz. 66 00:02:56,350 --> 00:03:01,690 And it's been clocked as high as 500 megahertz at 2.2 volts. 67 00:03:01,690 --> 00:03:05,000 The chip isn't really designed for low power but the tile 68 00:03:05,000 --> 00:03:08,740 abstraction is really nice for power consumption because if 69 00:03:08,740 --> 00:03:10,160 you're not using tiles you can essentially 70 00:03:10,160 --> 00:03:11,230 just shut them down. 71 00:03:11,230 --> 00:03:15,920 So it'll allow you to sort of have power efficient design 72 00:03:15,920 --> 00:03:17,760 just by nature of the architecture. 73 00:03:17,760 --> 00:03:20,340 But when you're using all the tiles, all the memories, all 74 00:03:20,340 --> 00:03:25,330 the networks, in a non-optimized design, you 75 00:03:25,330 --> 00:03:29,260 consume about 18 watts of power. 76 00:03:29,260 --> 00:03:32,380 So how do you use this tiled processor? 77 00:03:32,380 --> 00:03:34,400 So here's one particular example. 78 00:03:34,400 --> 00:03:37,000 The nice thing about tile architecture is that you can 79 00:03:37,000 --> 00:03:40,470 let applications consume as many tiles as they need. 80 00:03:40,470 --> 00:03:42,620 If you have an application with a lot parallelism then 81 00:03:42,620 --> 00:03:44,440 you give it a lot of tiles. 82 00:03:44,440 --> 00:03:46,110 If you have an application that doesn't need a lot of 83 00:03:46,110 --> 00:03:48,150 parallelism then you don't give it a lot of tiles. 84 00:03:48,150 --> 00:03:51,540 So it allows you to really exploit the mapping of your 85 00:03:51,540 --> 00:03:54,350 application down to the architecture and gives you 86 00:03:54,350 --> 00:03:56,990 ASIC-like behavior-- application specific 87 00:03:56,990 --> 00:03:59,020 processing technology. 88 00:03:59,020 --> 00:04:02,170 So one example is you have some video that you're 89 00:04:02,170 --> 00:04:05,355 recording and you want to encode it and stream it across 90 00:04:05,355 --> 00:04:08,410 the web or display it on your monitor or whatever else. 91 00:04:08,410 --> 00:04:10,910 So you can have some logic that you map down. 92 00:04:10,910 --> 00:04:13,330 If your chips are here, you do some computation. 93 00:04:13,330 --> 00:04:15,940 You have memories sprinkled across the tile that you're 94 00:04:15,940 --> 00:04:18,090 going to use for local store. 95 00:04:18,090 --> 00:04:23,060 So you can parallelize, for example, the motion estimation 96 00:04:23,060 --> 00:04:28,940 for encoding the temporal redundancy in a video stream. 97 00:04:28,940 --> 00:04:31,200 You can have another application completely 98 00:04:31,200 --> 00:04:34,470 independent running on other part of the chip. 99 00:04:34,470 --> 00:04:36,250 So here's an application that's using four different 100 00:04:36,250 --> 00:04:37,460 tiles and it's really isolated. 101 00:04:37,460 --> 00:04:40,450 It doesn't affect what's going on in these tiles. 102 00:04:40,450 --> 00:04:42,500 You can have another application that's running 103 00:04:42,500 --> 00:04:44,590 something like MPI where you're doing dynamic 104 00:04:44,590 --> 00:04:48,500 messaging, and httpd server and this tile is maybe not 105 00:04:48,500 --> 00:04:51,580 used so it's just sleeping or it's idle. 106 00:04:51,580 --> 00:04:52,930 You can have memories connected off 107 00:04:52,930 --> 00:04:55,010 the chip, I/O devices. 108 00:04:55,010 --> 00:04:58,830 So it's really interesting in the sense that probably the 109 00:04:58,830 --> 00:05:01,610 most interesting aspect of it is you just allow the tiles to 110 00:05:01,610 --> 00:05:04,180 sort of be used as your fundamental resource. 111 00:05:04,180 --> 00:05:04,970 And you can scale them up as your 112 00:05:04,970 --> 00:05:07,650 application parallelism scales. 113 00:05:07,650 --> 00:05:10,580 This is a picture of the raw board-- the raw motherboard. 114 00:05:10,580 --> 00:05:13,870 Actually you see it in the Stata Center in the Raw Lab. 115 00:05:13,870 --> 00:05:15,530 This is the raw chip. 116 00:05:15,530 --> 00:05:18,820 A lot of the peripheral device, firmware and 117 00:05:18,820 --> 00:05:23,630 interconnect for dealing with a lot of devices off the chip 118 00:05:23,630 --> 00:05:25,030 are implemented in these FPGAs, so 119 00:05:25,030 --> 00:05:27,390 these are Xilinx chips. 120 00:05:27,390 --> 00:05:28,570 There's DRAM. 121 00:05:28,570 --> 00:05:33,980 You have connection to a PCI card, USB stick. 122 00:05:33,980 --> 00:05:36,560 Network interface so you can actually log into this machine 123 00:05:36,560 --> 00:05:37,810 and use it. 124 00:05:40,600 --> 00:05:42,760 And there's a real compiler. 125 00:05:42,760 --> 00:05:45,890 It can run real applications on it. 126 00:05:45,890 --> 00:05:49,660 There's actually a bigger chip that we built where we take 127 00:05:49,660 --> 00:05:52,800 four of these raw chips and sort of scale them up. 128 00:05:52,800 --> 00:05:55,350 So rather than having 16 tiles on your motherboard, you can 129 00:05:55,350 --> 00:05:56,400 have four raw chips. 130 00:05:56,400 --> 00:05:57,800 That gives you 64 tiles. 131 00:05:57,800 --> 00:06:00,160 You can scale this up to a thousand tiles or so on. 132 00:06:00,160 --> 00:06:02,120 Just because of the tile nature, everything is 133 00:06:02,120 --> 00:06:04,025 symmetric, homogeneous, so you can really scale 134 00:06:04,025 --> 00:06:06,500 it up really big. 135 00:06:06,500 --> 00:06:08,260 So what is the performance of raw? 136 00:06:08,260 --> 00:06:12,370 So looking at the overall application performance, so 137 00:06:12,370 --> 00:06:13,710 we've done a lot of benchmarking. 138 00:06:13,710 --> 00:06:16,120 So these are numbers from a paper that was published in 139 00:06:16,120 --> 00:06:19,530 2004, where we took a lot of applications-- some are 140 00:06:19,530 --> 00:06:21,790 well-known and used in standard benchmark suites-- 141 00:06:21,790 --> 00:06:25,800 and compiled them for raw using various raw compiler 142 00:06:25,800 --> 00:06:27,750 that we built in-house. 143 00:06:27,750 --> 00:06:30,170 And we've compared them against the Pentium 3. 144 00:06:30,170 --> 00:06:33,170 So the Pentium 3 is sort of a unique comparison point 145 00:06:33,170 --> 00:06:36,580 because it sort of matches raw in terms of the technology 146 00:06:36,580 --> 00:06:38,890 that was used to fabricate the two. 147 00:06:38,890 --> 00:06:41,310 And what you're seeing here, this is a lock scale. 148 00:06:41,310 --> 00:06:45,460 The speedup of the application running on raw compared to the 149 00:06:45,460 --> 00:06:47,110 application running on a P3. 150 00:06:47,110 --> 00:06:50,660 So the higher you get, the better the performance is. 151 00:06:50,660 --> 00:06:55,770 So these applications are sort of grouped into a few classes. 152 00:06:55,770 --> 00:06:58,500 So the first class is what we call ILP applications. 153 00:06:58,500 --> 00:07:01,620 So these are applications that have essentially instruction 154 00:07:01,620 --> 00:07:02,730 level parallelism. 155 00:07:02,730 --> 00:07:04,680 I'm going to talk a little bit more about and 156 00:07:04,680 --> 00:07:05,450 sort of explain it. 157 00:07:05,450 --> 00:07:08,040 But you've seen this early on in the lecture-- 158 00:07:08,040 --> 00:07:10,620 in some of Saman's lectures. 159 00:07:10,620 --> 00:07:13,020 So here you're trying to exploit inherent instruction 160 00:07:13,020 --> 00:07:14,820 level parallelism in the applications. 161 00:07:14,820 --> 00:07:17,700 And if you have lots of ILP then you map it to a lot of 162 00:07:17,700 --> 00:07:20,040 tiles and you can get parallelism that way and you 163 00:07:20,040 --> 00:07:22,110 get better performance. 164 00:07:22,110 --> 00:07:24,850 These applications here-- what we call the streaming 165 00:07:24,850 --> 00:07:25,530 applications. 166 00:07:25,530 --> 00:07:30,340 So you saw some of these in the StreamIt lecture and the 167 00:07:30,340 --> 00:07:32,250 StreamIt parallelizer compiler. 168 00:07:32,250 --> 00:07:34,100 Some of those numbers were generated on a raw-like 169 00:07:34,100 --> 00:07:35,830 architecture. 170 00:07:35,830 --> 00:07:39,130 And then you have the server or sort of more traditional 171 00:07:39,130 --> 00:07:42,650 applications that you expect to run in a server style or 172 00:07:42,650 --> 00:07:45,230 throughput-oriented. 173 00:07:45,230 --> 00:07:47,260 And then finally you have bit-level applications. 174 00:07:47,260 --> 00:07:50,750 So doing things at the very lowest level of computation 175 00:07:50,750 --> 00:07:52,900 where you're doing a lot of bit manipulation. 176 00:07:52,900 --> 00:07:57,830 So what's interesting here to note is that as you get into 177 00:07:57,830 --> 00:08:00,760 more applications that have a lot of inherent parallelism in 178 00:08:00,760 --> 00:08:03,830 them, where you want explicit-- 179 00:08:03,830 --> 00:08:06,270 where you can extract a lot of parallelism because of the 180 00:08:06,270 --> 00:08:07,940 explicit nature of the applications-- 181 00:08:07,940 --> 00:08:09,920 you can really map those really well through the 182 00:08:09,920 --> 00:08:10,630 architecture. 183 00:08:10,630 --> 00:08:13,850 And because of the communication nature-- 184 00:08:13,850 --> 00:08:15,650 because of communication capabilities of the 185 00:08:15,650 --> 00:08:18,810 architecture, being able to stream data from one tile to 186 00:08:18,810 --> 00:08:22,350 another really fast, you can get really high on-chip 187 00:08:22,350 --> 00:08:24,130 bandwidth and that gives you really high performance, 188 00:08:24,130 --> 00:08:26,380 especially for these kinds of applications. 189 00:08:30,120 --> 00:08:32,550 There are other applications that we've done. 190 00:08:32,550 --> 00:08:34,800 Some of the students have worked on in the raw group. 191 00:08:34,800 --> 00:08:39,040 So an MPEG-2 encoder where you're essentially trying to 192 00:08:39,040 --> 00:08:42,450 do real-time encoding of a video screen at different 193 00:08:42,450 --> 00:08:42,930 resolutions. 194 00:08:42,930 --> 00:08:47,925 So 350 by 240 or 720 by 480 where you're compiling down to 195 00:08:47,925 --> 00:08:49,640 a number of tiles. 196 00:08:49,640 --> 00:08:52,360 One, 4, 8 sixteen, 16-- 197 00:08:52,360 --> 00:08:55,410 1 and 16 are somehow missing, I'm not sure why. 198 00:08:55,410 --> 00:08:57,150 And what you're looking for here? 199 00:08:57,150 --> 00:08:59,340 Sort of scalability of algorithm. 200 00:08:59,340 --> 00:09:02,390 As you add more tiles, are you getting more and more 201 00:09:02,390 --> 00:09:04,780 performance or are you getting better and better throughput? 202 00:09:04,780 --> 00:09:08,610 So you could encode more frames per second for example. 203 00:09:08,610 --> 00:09:12,610 So if you're doing HDTV, it's 1080p, then you really want to 204 00:09:12,610 --> 00:09:13,720 sort of get-- 205 00:09:13,720 --> 00:09:16,800 there's a lot of compute power that you need. 206 00:09:16,800 --> 00:09:19,570 And so as you add more frames, maybe you can get to sort of 207 00:09:19,570 --> 00:09:23,750 the throughput that you need for HDTV. 208 00:09:23,750 --> 00:09:25,450 So this is something that might be interesting for some 209 00:09:25,450 --> 00:09:26,460 of your projects as well. 210 00:09:26,460 --> 00:09:28,610 And we've talked about this before. 211 00:09:28,610 --> 00:09:31,650 On the cell, as you're using more and more SPEs, can you 212 00:09:31,650 --> 00:09:34,190 accelerate the performance of your application? 213 00:09:34,190 --> 00:09:35,420 Can you sort of show that if you're 214 00:09:35,420 --> 00:09:36,640 doing some visual aspect? 215 00:09:36,640 --> 00:09:38,330 And you can sort of demonstrate it. 216 00:09:38,330 --> 00:09:42,060 So there's a demo that is set up and in the lab where you 217 00:09:42,060 --> 00:09:44,703 can sort of crank up number of tiles that you're using and 218 00:09:44,703 --> 00:09:47,850 you get better performance from the MPEG encoder. 219 00:09:47,850 --> 00:09:50,300 And just looking at number of frames per second that you can 220 00:09:50,300 --> 00:09:55,520 get, with 64 tiles-- so the raw chip is 16 tiles, but you 221 00:09:55,520 --> 00:09:58,800 can scale it up by having more chips-- 222 00:09:58,800 --> 00:10:00,920 so you can get about 51 frames. 223 00:10:00,920 --> 00:10:03,230 These numbers have been improved and there are 224 00:10:03,230 --> 00:10:08,280 different ways of optimizing these performances. 225 00:10:08,280 --> 00:10:14,990 At 352 by 4 240, the estimated data rate-- estimated 226 00:10:14,990 --> 00:10:15,210 throughput-- 227 00:10:15,210 --> 00:10:17,585 of 160 frames per second almost. So this 228 00:10:17,585 --> 00:10:20,790 is really high bandwidth. 229 00:10:20,790 --> 00:10:23,790 Another interesting thing that we've done with the raw chip 230 00:10:23,790 --> 00:10:27,330 is taking a look at graphics pipelines and looking at is 231 00:10:27,330 --> 00:10:30,740 there anything we can do to exploit the inherent tiled 232 00:10:30,740 --> 00:10:32,700 architecture of the raw chip. 233 00:10:32,700 --> 00:10:36,190 So here's a screenshot from Counterstrike and a simplified 234 00:10:36,190 --> 00:10:38,930 graphics pipeline where you have some input to the screen 235 00:10:38,930 --> 00:10:39,860 you want to render. 236 00:10:39,860 --> 00:10:41,280 You do some vertex shading. 237 00:10:41,280 --> 00:10:43,990 So these are triangles that you want to figure out what 238 00:10:43,990 --> 00:10:45,810 colors to make-- 239 00:10:45,810 --> 00:10:47,730 what colors to paint them. 240 00:10:47,730 --> 00:10:50,210 The triangle's set up for pixel stage. 241 00:10:50,210 --> 00:10:53,060 And in this screen you'll notice that there are two 242 00:10:53,060 --> 00:10:54,550 different things that you're rendering. 243 00:10:54,550 --> 00:10:57,080 There's essentially this part of the screen which has a lot 244 00:10:57,080 --> 00:11:00,070 of triangles that span a relatively 245 00:11:00,070 --> 00:11:03,470 not-so-complex image. 246 00:11:03,470 --> 00:11:06,380 And then you have these guys here that have fewer triangle 247 00:11:06,380 --> 00:11:12,520 span a smaller region of the frame. 248 00:11:12,520 --> 00:11:14,960 And what you might want to do is allocate more computer 249 00:11:14,960 --> 00:11:17,960 power to the pixel stage and less compute power to the 250 00:11:17,960 --> 00:11:18,850 vertex stage. 251 00:11:18,850 --> 00:11:21,660 So that's analogous to saying, I want more tiles for one 252 00:11:21,660 --> 00:11:24,260 stage of the pipeline and fewer tiles for another. 253 00:11:24,260 --> 00:11:27,040 Or maybe I want to be able to dynamically change how many 254 00:11:27,040 --> 00:11:28,430 tiles I'm allocating at different 255 00:11:28,430 --> 00:11:30,200 stages of the pipeline. 256 00:11:30,200 --> 00:11:33,580 So that as your screens that you're rendering change in 257 00:11:33,580 --> 00:11:38,120 terms of their complexity, you can maintain the good visual 258 00:11:38,120 --> 00:11:43,950 illusions transparently without compromising the 259 00:11:43,950 --> 00:11:45,500 utilization of the chip. 260 00:11:45,500 --> 00:11:47,560 So some demos that were done with the 261 00:11:47,560 --> 00:11:49,230 graphics group it at MIT-- 262 00:11:49,230 --> 00:11:51,250 Fredo Durand's group-- 263 00:11:51,250 --> 00:11:52,080 phong shading. 264 00:11:52,080 --> 00:11:55,790 You have 132 vertices with 1 light source. 265 00:11:55,790 --> 00:11:57,350 So this is what you're trying to shade. 266 00:11:57,350 --> 00:12:00,410 You have a lot of regions black. 267 00:12:00,410 --> 00:12:04,110 So if you're looking at a fixed pipeline where the 268 00:12:04,110 --> 00:12:06,650 vertex shader is taking six tiles-- this is 269 00:12:06,650 --> 00:12:08,120 on a 64-tile chip-- 270 00:12:08,120 --> 00:12:10,920 the rasterizer is taking 15 tiles, the pixel processor has 271 00:12:10,920 --> 00:12:15,150 15 tiles, the alpha buffer operations has 15 tiles, then 272 00:12:15,150 --> 00:12:18,310 you might not get the best utilization because for that 273 00:12:18,310 --> 00:12:20,910 entire region that you're rendering where it's black 274 00:12:20,910 --> 00:12:23,920 there's nothing really interesting happening there. 275 00:12:23,920 --> 00:12:27,310 You want to shift those tiles to another processor, to 276 00:12:27,310 --> 00:12:28,770 another stage of pipeline. 277 00:12:28,770 --> 00:12:31,160 Or, if you can't really utilize them, then you're just 278 00:12:31,160 --> 00:12:33,750 wasting power, wasting energy, and so you might just want to 279 00:12:33,750 --> 00:12:36,020 shut them and not use them at all. 280 00:12:36,020 --> 00:12:38,120 So with a fixed pipeline versus a reconfigurable 281 00:12:38,120 --> 00:12:42,190 pipeline where I can change the number of tiles allocated 282 00:12:42,190 --> 00:12:44,660 to different stages of the pipeline, I can get better 283 00:12:44,660 --> 00:12:46,430 utilization. 284 00:12:46,430 --> 00:12:49,250 And, in some cases, better performance. 285 00:12:49,250 --> 00:12:51,060 So here, fuller bars, and you're 286 00:12:51,060 --> 00:12:53,540 finishing faster in time. 287 00:12:53,540 --> 00:12:56,540 So this is indicative also of what's going on in the 288 00:12:56,540 --> 00:12:57,620 graphics industry. 289 00:12:57,620 --> 00:12:59,930 So the graphics card used to be very-- 290 00:13:04,990 --> 00:13:07,930 well, it had fixed resources allocated to different stage, 291 00:13:07,930 --> 00:13:10,900 which is essentially what we're trying model in this 292 00:13:10,900 --> 00:13:13,830 part of the experiment, where more and more now you have 293 00:13:13,830 --> 00:13:16,300 unified shaders that you can use for the pixel shading and 294 00:13:16,300 --> 00:13:16,870 the vertex shading. 295 00:13:16,870 --> 00:13:19,230 So you're getting into more of that programmable aspect. 296 00:13:19,230 --> 00:13:21,660 Precisely because you want to be able to do this kind of 297 00:13:21,660 --> 00:13:24,870 load balancing and exploit dynamisms that you see in 298 00:13:24,870 --> 00:13:26,530 different things that you're trying to render. 299 00:13:29,150 --> 00:13:31,270 Another example: shadow volumes. 300 00:13:31,270 --> 00:13:35,170 You have 4 triangles, one light source. 301 00:13:35,170 --> 00:13:37,100 And this was rendered in three passes. 302 00:13:37,100 --> 00:13:41,285 So pass 1, pass 2, pass 3, would essentially take the 303 00:13:41,285 --> 00:13:45,360 same amount of time because you're doing the same 304 00:13:45,360 --> 00:13:48,600 computation map to a fixed number of resources. 305 00:13:48,600 --> 00:13:50,830 But if I can change the number of resources that I need for 306 00:13:50,830 --> 00:13:53,527 different passes-- so the rasterizer, for example, and 307 00:13:53,527 --> 00:13:55,690 the alpha buffer operations, is really where you 308 00:13:55,690 --> 00:13:56,580 need a lot of power. 309 00:13:56,580 --> 00:14:01,910 So if you go from 15 tiles for each to 20 tiles for each, you 310 00:14:01,910 --> 00:14:04,370 get better execution time because you were able to 311 00:14:04,370 --> 00:14:06,820 exploit parallelism or match parallelism better to the 312 00:14:06,820 --> 00:14:07,740 application. 313 00:14:07,740 --> 00:14:09,800 And so you get 40% percent faster in 314 00:14:09,800 --> 00:14:11,050 this particular case. 315 00:14:13,460 --> 00:14:16,350 And another interesting application: this is the 316 00:14:16,350 --> 00:14:19,200 largest in the world microphone array. 317 00:14:19,200 --> 00:14:21,580 It's actually in the Guinness Book of Records. 318 00:14:21,580 --> 00:14:23,620 It was build in the lab. 319 00:14:23,620 --> 00:14:27,140 And what it essentially has-- each of these little boards 320 00:14:27,140 --> 00:14:28,780 has two microphones on it. 321 00:14:28,780 --> 00:14:30,850 And so what you can use this for is 322 00:14:30,850 --> 00:14:32,050 eavesdropping for example. 323 00:14:32,050 --> 00:14:35,820 Or you can carry this around if you want. 324 00:14:35,820 --> 00:14:38,720 Pack it in the car and do some spying. 325 00:14:38,720 --> 00:14:42,720 But somewhat more interesting demos that were done with this 326 00:14:42,720 --> 00:14:45,910 in smaller scales was that in a noisy room, for example, if 327 00:14:45,910 --> 00:14:47,770 you want the sort of hone in. 328 00:14:47,770 --> 00:14:51,280 Let's say everybody here was speaking, but for the camera 329 00:14:51,280 --> 00:14:52,790 they want to record only my voice. 330 00:14:52,790 --> 00:14:56,170 They can have a microphone array in the back that focuses 331 00:14:56,170 --> 00:14:57,370 on just my voice. 332 00:14:57,370 --> 00:15:01,190 And the way it's done is you can measure the distance from 333 00:15:01,190 --> 00:15:03,427 the time it takes for the sound wave to reach each of 334 00:15:03,427 --> 00:15:05,930 these different microphones and you can focus in on a 335 00:15:05,930 --> 00:15:10,950 particular source of noise and be able to 336 00:15:10,950 --> 00:15:13,510 just highlight that. 337 00:15:13,510 --> 00:15:15,380 So there's this demo where's it's a noisy room-- 338 00:15:15,380 --> 00:15:18,470 I probably should have had these in here in retrospect-- 339 00:15:18,470 --> 00:15:21,750 there's a noisy room, lots of people are talking, then you 340 00:15:21,750 --> 00:15:24,180 turn on the microphone array and you can hear that one 341 00:15:24,180 --> 00:15:26,000 particular source and it's a lot clearer. 342 00:15:26,000 --> 00:15:30,740 You can also have applications where you're tracking a person 343 00:15:30,740 --> 00:15:33,060 in a room with videos as well, so you can sort 344 00:15:33,060 --> 00:15:34,850 of follow him around. 345 00:15:34,850 --> 00:15:36,340 So it's a very interesting application. 346 00:15:36,340 --> 00:15:39,620 An now I regret not having the video demo in here. 347 00:15:39,620 --> 00:15:40,200 Actually, should I do it? 348 00:15:40,200 --> 00:15:41,550 It's on the Web. 349 00:15:41,550 --> 00:15:42,890 OK. 350 00:15:42,890 --> 00:15:45,990 So a case study using the beamformer. 351 00:15:45,990 --> 00:15:49,290 So what's being done in the microphone array is you're 352 00:15:49,290 --> 00:15:50,130 doing beamforming. 353 00:15:50,130 --> 00:15:53,270 So you're trying to figure out what are the different beams 354 00:15:53,270 --> 00:15:54,290 that are reaching the microphone. 355 00:15:54,290 --> 00:15:57,280 You want to be able to amplify one of them. 356 00:15:57,280 --> 00:16:02,650 So looking at the application written natively in C running 357 00:16:02,650 --> 00:16:06,050 on a 1 gigahertz Pentium , what is the operation 358 00:16:06,050 --> 00:16:06,620 throughput? 359 00:16:06,620 --> 00:16:10,470 So you're getting about 240 MegaFLOPS. 360 00:16:10,470 --> 00:16:14,520 And if you go down to an optimized-- 361 00:16:14,520 --> 00:16:17,700 same code but running on single tile raw chip, you get 362 00:16:17,700 --> 00:16:19,190 about 19 MegaFLOPS. 363 00:16:19,190 --> 00:16:20,480 So a not very good performance. 364 00:16:20,480 --> 00:16:22,530 But here, what you really want to do, is you have al lot of 365 00:16:22,530 --> 00:16:23,200 parallelism. 366 00:16:23,200 --> 00:16:25,580 Because each of those beams that's reaching individual 367 00:16:25,580 --> 00:16:27,660 microphones can be done in parallel. 368 00:16:27,660 --> 00:16:29,170 So you have a lot of parallelism in that 369 00:16:29,170 --> 00:16:29,830 application. 370 00:16:29,830 --> 00:16:33,350 So taking the C program, reimplementing it in StreamIt 371 00:16:33,350 --> 00:16:36,060 that you've seen in previous lectures, and not really 372 00:16:36,060 --> 00:16:38,180 optimizing it in terms of doing a lot of the 373 00:16:38,180 --> 00:16:43,360 optimizations you saw in the parallelizing compiler talk, 374 00:16:43,360 --> 00:16:44,600 you get about 640 MegaFLOPS. 375 00:16:44,600 --> 00:16:49,810 So already you're beating the C program running on a pretty 376 00:16:49,810 --> 00:16:51,800 fast superscalar machine. 377 00:16:51,800 --> 00:16:54,240 And if you really optimize the StreamIt code in terms of 378 00:16:54,240 --> 00:16:59,030 doing the fission and fusion, increasing the parallelism, 379 00:16:59,030 --> 00:17:01,560 doing better load balancing automatically, you can get up 380 00:17:01,560 --> 00:17:03,350 to 1.4 GigaFLOPS. 381 00:17:03,350 --> 00:17:06,420 So really good performance and really matching the inherent 382 00:17:06,420 --> 00:17:07,800 parallelism to the architecture. 383 00:17:10,380 --> 00:17:13,510 So it was just a big overview of the raw chip and what we've 384 00:17:13,510 --> 00:17:14,820 done with it in lab. 385 00:17:14,820 --> 00:17:17,620 There's more in here than I've talked about. 386 00:17:17,620 --> 00:17:20,430 But what I'm going to do next is give you some insights as 387 00:17:20,430 --> 00:17:22,700 to what is the design philosophy that went into raw 388 00:17:22,700 --> 00:17:27,000 architecture, why was it designed the way it was. 389 00:17:27,000 --> 00:17:28,810 And then I'm going to talk a little bit about the raw 390 00:17:28,810 --> 00:17:30,310 parallelizing compiler. 391 00:17:30,310 --> 00:17:33,550 And while the StreamIt language and compiler also has 392 00:17:33,550 --> 00:17:36,190 a back end for the raw architecture, we've sort of 393 00:17:36,190 --> 00:17:37,300 seen that in previous lectures so I'm not going to 394 00:17:37,300 --> 00:17:38,580 talk about that here. 395 00:17:38,580 --> 00:17:42,520 So I'm just going to focus on the first two bullets. 396 00:17:42,520 --> 00:17:47,680 And a few years ago when the project got started, sort of 397 00:17:47,680 --> 00:17:52,430 the insight in the wide issue processors and the design 398 00:17:52,430 --> 00:17:55,580 philosophy that was being followed in industry for 399 00:17:55,580 --> 00:17:58,760 building wider superscalars, faster superscalars, was 400 00:17:58,760 --> 00:18:01,560 really going to come to a halt largely because you have 401 00:18:01,560 --> 00:18:03,340 scalability issues. 402 00:18:03,340 --> 00:18:06,840 So if you look at sort of a simplified illustration of a 403 00:18:06,840 --> 00:18:10,210 wide issue microprocessor, you have your program counter such 404 00:18:10,210 --> 00:18:11,070 as instructions. 405 00:18:11,070 --> 00:18:12,740 Goes into some control logic. 406 00:18:12,740 --> 00:18:14,200 Control logic is then going to run. 407 00:18:14,200 --> 00:18:15,370 You're going to read some variables from 408 00:18:15,370 --> 00:18:17,210 the register file. 409 00:18:17,210 --> 00:18:19,700 You'll have a big crossbar in the middle that routes to 410 00:18:19,700 --> 00:18:20,570 operands like ALUs. 411 00:18:20,570 --> 00:18:23,600 Yell And then you operate on those and you have to send it 412 00:18:23,600 --> 00:18:25,430 back to the register file. 413 00:18:25,430 --> 00:18:30,220 Plus you have this really big problem with the network. 414 00:18:30,220 --> 00:18:32,850 So if you're doing some computation-- 415 00:18:32,850 --> 00:18:35,110 sorry, I rearranged these slides. 416 00:18:35,110 --> 00:18:38,290 So what you have if you have n ALUs, then the complexity of 417 00:18:38,290 --> 00:18:41,660 your crossbar increases as n squared, because you 418 00:18:41,660 --> 00:18:42,900 essentially have to have everybody 419 00:18:42,900 --> 00:18:44,880 talking to each other. 420 00:18:44,880 --> 00:18:47,260 And in terms of the number of wires that you need out of the 421 00:18:47,260 --> 00:18:49,970 register file to support everybody being able to sort 422 00:18:49,970 --> 00:18:52,470 of talk to anybody else very efficiently, the number of 423 00:18:52,470 --> 00:18:54,970 ports, the number of wires increases n cubed. 424 00:18:54,970 --> 00:18:57,600 So that's a problem because you can't clock all those 425 00:18:57,600 --> 00:18:59,150 wires fast enough. 426 00:18:59,150 --> 00:19:01,410 The frequency becomes sort of limited. 427 00:19:01,410 --> 00:19:04,380 It grows even less than linearly. 428 00:19:04,380 --> 00:19:08,230 And this is a problem because operational routing-- operand 429 00:19:08,230 --> 00:19:09,620 routing, is global. 430 00:19:09,620 --> 00:19:12,900 So if I have- I'm doing some operations and it's an add, 431 00:19:12,900 --> 00:19:16,760 the results of this add is fed to another operation to shift, 432 00:19:16,760 --> 00:19:19,660 and these are going to execute on two different ALUs. 433 00:19:19,660 --> 00:19:22,860 So what's going to happen? 434 00:19:22,860 --> 00:19:24,410 I do the add operation. 435 00:19:24,410 --> 00:19:26,100 It's going to produce a result. 436 00:19:26,100 --> 00:19:30,100 But there's no direct path for this ALU to send this result 437 00:19:30,100 --> 00:19:30,560 to this ALU. 438 00:19:30,560 --> 00:19:33,530 So instead what has happened is the operand has to travel 439 00:19:33,530 --> 00:19:36,195 all the way back around through the crossbar and then 440 00:19:36,195 --> 00:19:37,445 back to this ALU. 441 00:19:39,700 --> 00:19:43,210 So that's really just going to take a long time and not 442 00:19:43,210 --> 00:19:44,300 necessarily very efficient. 443 00:19:44,300 --> 00:19:48,140 And if you're doing this for a lot of ALU operations, you 444 00:19:48,140 --> 00:19:49,780 have a lot of parallelism in your application level, 445 00:19:49,780 --> 00:19:51,750 instructional level parallelism, and that's just 446 00:19:51,750 --> 00:19:53,300 creating a lot of communication. 447 00:19:53,300 --> 00:19:55,170 But you're not really exploiting the locality of the 448 00:19:55,170 --> 00:19:56,830 computation. 449 00:19:56,830 --> 00:19:59,440 If 2 instructions are really close together, you want to be 450 00:19:59,440 --> 00:20:01,890 able to just have a point-to-point path, for 451 00:20:01,890 --> 00:20:05,050 example, or a shorter path that allows you to exploit 452 00:20:05,050 --> 00:20:07,920 where was instructions are in space. 453 00:20:07,920 --> 00:20:11,220 And so this was the driving insight for the architecture 454 00:20:11,220 --> 00:20:14,850 in that you want to make operand routing local. 455 00:20:14,850 --> 00:20:18,110 So an idea is to essentially exploit this locality by 456 00:20:18,110 --> 00:20:19,730 distributing the ALUs. 457 00:20:19,730 --> 00:20:22,570 And rather than having that massive crossbar, what you 458 00:20:22,570 --> 00:20:25,660 want to do is have an on-chip mesh network. 459 00:20:25,660 --> 00:20:28,393 So rather than have one big crossbar, you have lots of 460 00:20:28,393 --> 00:20:29,060 smaller ones. 461 00:20:29,060 --> 00:20:31,040 So these become switch processors. 462 00:20:31,040 --> 00:20:34,750 So I can put value from this ALU here and then have that 463 00:20:34,750 --> 00:20:37,350 value routed to any other ALU. 464 00:20:37,350 --> 00:20:39,990 Maybe that just cost me more in terms of instructions that 465 00:20:39,990 --> 00:20:42,770 says where this operand is going. 466 00:20:42,770 --> 00:20:44,320 We'll get into that. 467 00:20:44,320 --> 00:20:46,580 But here, what this allows me to do is exploit 468 00:20:46,580 --> 00:20:47,790 that locality better. 469 00:20:47,790 --> 00:20:51,650 Same instruction chain, I can put the first operation on one 470 00:20:51,650 --> 00:20:55,950 ALU, I can put the other operation on the second ALU. 471 00:20:55,950 --> 00:20:58,240 And here, rather than putting it for example here, which 472 00:20:58,240 --> 00:21:01,230 would send the operand really far across chip, what I want 473 00:21:01,230 --> 00:21:03,660 to do is recognize that there's a producer/consumer 474 00:21:03,660 --> 00:21:04,800 relationship here. 475 00:21:04,800 --> 00:21:07,245 I want to exploit that locality and have them close 476 00:21:07,245 --> 00:21:11,260 in spaces so that the routes remain fairly short. 477 00:21:11,260 --> 00:21:13,890 You know what I can also do is sort of pipeline this network 478 00:21:13,890 --> 00:21:16,770 so that I can have the hardware essential match 479 00:21:16,770 --> 00:21:18,530 computation flow. 480 00:21:18,530 --> 00:21:22,900 If one ALU is producing a lot of results at a lot faster 481 00:21:22,900 --> 00:21:25,680 rate than for example this instruction can consume them, 482 00:21:25,680 --> 00:21:29,470 then the hardware can take care of, for example, blocking 483 00:21:29,470 --> 00:21:32,680 or stalling the producing processor so it doesn't get 484 00:21:32,680 --> 00:21:33,650 too far ahead. 485 00:21:33,650 --> 00:21:36,490 It gives you a nature mechanism for regulating the 486 00:21:36,490 --> 00:21:39,940 flow data on the chip. 487 00:21:39,940 --> 00:21:44,680 Well, this is better than what we saw before because with the 488 00:21:44,680 --> 00:21:47,260 crossbar you're not really getting any scalability in 489 00:21:47,260 --> 00:21:50,670 terms of your latency transport operands from one 490 00:21:50,670 --> 00:21:53,380 ALU to another. 491 00:21:53,380 --> 00:21:56,790 Whereas with on-chip network, if you've taken routing 492 00:21:56,790 --> 00:21:59,340 classes, you know that there exists an algorithm that sort 493 00:21:59,340 --> 00:22:03,030 of allows it to route things at least the square root of n, 494 00:22:03,030 --> 00:22:05,170 where n is the number of things that are communicating 495 00:22:05,170 --> 00:22:06,380 in your network. 496 00:22:06,380 --> 00:22:08,560 But if you're doing locality driven placement then it's 497 00:22:08,560 --> 00:22:10,040 essentially costing time. 498 00:22:10,040 --> 00:22:12,190 And in a raw chip, it's in fact three cycles. 499 00:22:12,190 --> 00:22:15,220 So you can send one operand from one tile to another in 500 00:22:15,220 --> 00:22:15,780 three cycles. 501 00:22:15,780 --> 00:22:18,730 And we'll get into how that number comes about. 502 00:22:18,730 --> 00:22:19,830 So this is much better. 503 00:22:19,830 --> 00:22:21,450 But what it does is increase the 504 00:22:21,450 --> 00:22:22,750 complexity on the compiler. 505 00:22:22,750 --> 00:22:25,780 It says, this is my computation, how do you map it 506 00:22:25,780 --> 00:22:28,960 efficiently so that things are clustered in space well so 507 00:22:28,960 --> 00:22:33,880 that I don't have these really long routes for communication? 508 00:22:33,880 --> 00:22:36,190 But then we can look at what else can we distribute. 509 00:22:36,190 --> 00:22:38,640 Well, we have the register file. 510 00:22:38,640 --> 00:22:41,240 We can distribute that across all the ALUs. 511 00:22:41,240 --> 00:22:44,500 And that essentially decreases that n cubed relationships 512 00:22:44,500 --> 00:22:47,980 between ALUs and register file ports to something that's a 513 00:22:47,980 --> 00:22:49,130 lot more tractable. 514 00:22:49,130 --> 00:22:54,170 Where it's one small register per ALU. 515 00:22:54,170 --> 00:22:57,370 And this is better in terms of scalability, but we haven't 516 00:22:57,370 --> 00:22:59,870 solved the entire problem in that we still have one global 517 00:22:59,870 --> 00:23:03,390 program counter, we have one global instruction fetch unit, 518 00:23:03,390 --> 00:23:07,240 one global control unified load/store queue for 519 00:23:07,240 --> 00:23:08,600 communicating with memory. 520 00:23:08,600 --> 00:23:13,850 And those all have scalability problems. So whereas we fixed 521 00:23:13,850 --> 00:23:15,360 the problem with the crossbar-- 522 00:23:15,360 --> 00:23:17,840 that becomes more scalable-- 523 00:23:17,840 --> 00:23:19,940 we haven't really fix the problems with the others. 524 00:23:19,940 --> 00:23:22,530 So what's the natural solution to do here? 525 00:23:22,530 --> 00:23:26,250 Well, we'll just distribute everything else. 526 00:23:26,250 --> 00:23:30,090 And so you start off with each ALU here now having it's own 527 00:23:30,090 --> 00:23:32,610 program counter, its own instruction cache, it's own 528 00:23:32,610 --> 00:23:33,540 data cache. 529 00:23:33,540 --> 00:23:37,610 And it has its register file ALU and everybody-- that same 530 00:23:37,610 --> 00:23:40,560 sort of design pattern is repeated for each 531 00:23:40,560 --> 00:23:41,840 one of those ALUs. 532 00:23:41,840 --> 00:23:44,340 So now it looks like it's a lot scalable. 533 00:23:44,340 --> 00:23:46,320 I don't have any global wires. 534 00:23:46,320 --> 00:23:49,100 There's no global centralized data structure. 535 00:23:49,100 --> 00:23:52,220 And all of that means I can do things more-- 536 00:23:52,220 --> 00:23:55,600 I can do things faster and more efficiently. 537 00:23:55,600 --> 00:23:58,990 And what you start seeing here is this sort of tile processor 538 00:23:58,990 --> 00:24:00,110 coming about all. 539 00:24:00,110 --> 00:24:03,920 So each one of those things was exactly the same. 540 00:24:03,920 --> 00:24:06,470 And what was done in the raw processor is that none of 541 00:24:06,470 --> 00:24:09,880 those tiles was longer than you can communicate in one 542 00:24:09,880 --> 00:24:10,710 clock cycle. 543 00:24:10,710 --> 00:24:14,850 So this solved essentially a wire delay problem as well. 544 00:24:14,850 --> 00:24:17,600 So if this is the distance that a wire-- 545 00:24:17,600 --> 00:24:19,340 that a signal can travel in one clock 546 00:24:19,340 --> 00:24:21,970 cycle, the tile is smaller. 547 00:24:21,970 --> 00:24:23,810 It can fit within this circle. 548 00:24:23,810 --> 00:24:26,820 So that means that you're guaranteed-- 549 00:24:26,820 --> 00:24:29,200 you have better scalability problems. You're solving the 550 00:24:29,200 --> 00:24:32,860 issues that people are facing with wire delay. 551 00:24:32,860 --> 00:24:36,940 And in terms of the tile processor abstraction, Michael 552 00:24:36,940 --> 00:24:41,680 Taylor was is a PhD student in the raw group, his thesis sort 553 00:24:41,680 --> 00:24:45,780 of identified the tile processor approach and this 554 00:24:45,780 --> 00:24:48,000 aspect of the tile processor approach that makes it more 555 00:24:48,000 --> 00:24:50,880 attractive, the SON. 556 00:24:50,880 --> 00:24:52,990 Which is the scalar operand network. 557 00:24:52,990 --> 00:24:57,080 And the next two slides, the next part of the lecture, is 558 00:24:57,080 --> 00:25:00,120 going to really focus on what that means. 559 00:25:00,120 --> 00:25:02,170 He argues why the tile 560 00:25:02,170 --> 00:25:05,340 processor approach is scalable. 561 00:25:05,340 --> 00:25:07,160 And it's scalable for the same reasons as multicores. 562 00:25:07,160 --> 00:25:09,350 You just add more and more cores on a chip. 563 00:25:09,350 --> 00:25:13,910 But the intrinsic difference between the multicore that you 564 00:25:13,910 --> 00:25:16,580 see today and the raw architecture is the scalar 565 00:25:16,580 --> 00:25:18,150 operand network. 566 00:25:18,150 --> 00:25:20,960 So I'm going to ask you questions about 567 00:25:20,960 --> 00:25:22,690 this in a few slides. 568 00:25:22,690 --> 00:25:25,620 But really what you're getting here is the ability to 569 00:25:25,620 --> 00:25:28,980 communicate from one processor to another very efficiently. 570 00:25:28,980 --> 00:25:31,990 And the way you do this on raw is you have your instruction 571 00:25:31,990 --> 00:25:35,340 fetch d code, register file read stage, WALU-- 572 00:25:35,340 --> 00:25:38,090 your competition pipeline. 573 00:25:38,090 --> 00:25:41,430 But part of the registers-- the new register file-- so 24 574 00:25:41,430 --> 00:25:43,960 through 27 are network mapped. 575 00:25:43,960 --> 00:25:46,890 So what that means is, if I write-- if one of the 576 00:25:46,890 --> 00:25:51,800 operations that I have in my computation has a destination 577 00:25:51,800 --> 00:25:56,480 register that's 24, 25, 26 or 27, that value automatically 578 00:25:56,480 --> 00:25:59,360 get sent to the output network. 579 00:25:59,360 --> 00:26:01,150 And if I have a value-- 580 00:26:01,150 --> 00:26:04,960 if one of my source operands is registered at 24, 25, 26 or 581 00:26:04,960 --> 00:26:08,340 27, implicitly that means get that value off the network. 582 00:26:12,010 --> 00:26:15,780 And so I can have add 25-- 583 00:26:15,780 --> 00:26:18,560 added to register 25-- so this is one of the network map 584 00:26:18,560 --> 00:26:20,760 ports, sum two operands. 585 00:26:20,760 --> 00:26:23,150 So this is a picture of the raw chip. 586 00:26:23,150 --> 00:26:25,100 This is one tile. 587 00:26:25,100 --> 00:26:26,760 This is the other tile. 588 00:26:26,760 --> 00:26:30,250 So you can sort of see the computation and the network 589 00:26:30,250 --> 00:26:32,110 switch processor here. 590 00:26:32,110 --> 00:26:36,340 So the operand flows into the network and then gets 591 00:26:36,340 --> 00:26:39,360 transported across from one tile to the other. 592 00:26:39,360 --> 00:26:40,800 And then gets injected into the other 593 00:26:40,800 --> 00:26:43,270 tiles compute networks. 594 00:26:43,270 --> 00:26:46,700 And here this instruction has sort a source operand that 595 00:26:46,700 --> 00:26:48,250 that's register map operand. 596 00:26:48,250 --> 00:26:49,730 So it knows where to get its value from. 597 00:26:49,730 --> 00:26:51,830 And then you can do the computation. 598 00:26:51,830 --> 00:26:55,200 An interesting aspect here is that while you've seen 599 00:26:55,200 --> 00:26:58,080 instructions like this, just normal instructions, here you 600 00:26:58,080 --> 00:27:02,220 also have explicit routing instructions that are executed 601 00:27:02,220 --> 00:27:04,330 on the switch processor. 602 00:27:04,330 --> 00:27:06,960 So the switch processor here says take the value that's 603 00:27:06,960 --> 00:27:11,990 coming from my processor and send it east. So each 604 00:27:11,990 --> 00:27:15,360 processor can send values east, west, north or south. 605 00:27:15,360 --> 00:27:17,950 So it can go to the tile above it, the tile below it, the 606 00:27:17,950 --> 00:27:20,650 tile to the left of it or tile to the right of it. 607 00:27:20,650 --> 00:27:24,290 And so sending it east sends it along this wire here. 608 00:27:24,290 --> 00:27:27,120 And then this particular switch processor says get a 609 00:27:27,120 --> 00:27:30,910 value from the west port and send it to my processor. 610 00:27:30,910 --> 00:27:33,970 Now you could have had here, this process could say, this 611 00:27:33,970 --> 00:27:37,060 value is not for me, so I want to just pass through to some 612 00:27:37,060 --> 00:27:37,980 other processor. 613 00:27:37,980 --> 00:27:40,770 So you can pass it from the west port to the south port or 614 00:27:40,770 --> 00:27:44,170 to the north port or just pass it through laterally to the 615 00:27:44,170 --> 00:27:46,530 other east port. 616 00:27:46,530 --> 00:27:48,000 So it just allows you to essentially just have an 617 00:27:48,000 --> 00:27:50,480 on-chip network and not operand-- you can imagine 618 00:27:50,480 --> 00:27:55,040 having an operand that has a data packet and header that's 619 00:27:55,040 --> 00:27:58,540 says, I'm going to tile 10 and the switches know 620 00:27:58,540 --> 00:27:59,510 which way to send it. 621 00:27:59,510 --> 00:28:01,700 But the interesting aspect here is that the compiler 622 00:28:01,700 --> 00:28:04,060 actually orchestrates the communication, so you don't 623 00:28:04,060 --> 00:28:06,612 need that extra header that says, I'm going to tile 10. 624 00:28:06,612 --> 00:28:09,380 You just have to generate a schedule of how to write that 625 00:28:09,380 --> 00:28:11,250 data through. 626 00:28:11,250 --> 00:28:13,170 So we'll get into what that means for the compiler in 627 00:28:13,170 --> 00:28:16,140 terms of that added complexity. 628 00:28:16,140 --> 00:28:19,630 So communication on multicores is expensive for 629 00:28:19,630 --> 00:28:20,640 the following reasons. 630 00:28:20,640 --> 00:28:24,400 And this is really sort of going contrast or going to put 631 00:28:24,400 --> 00:28:26,360 the scalar operand network into slightly more 632 00:28:26,360 --> 00:28:27,450 perspective. 633 00:28:27,450 --> 00:28:31,480 But first, so how do you communicate between multicores 634 00:28:31,480 --> 00:28:32,650 on the cell? 635 00:28:32,650 --> 00:28:36,510 You have the DMA transfers from one SPE to another. 636 00:28:36,510 --> 00:28:39,570 You can't really ship an operand single value. 637 00:28:39,570 --> 00:28:43,030 So if I write the value x, and I want to send x from one SPE 638 00:28:43,030 --> 00:28:46,790 to another, I can't really do that very efficiently, right? 639 00:28:46,790 --> 00:28:52,140 So this is essentially the contrasting thing between 640 00:28:52,140 --> 00:28:55,320 multicore processors that largely exist today and the 641 00:28:55,320 --> 00:28:56,350 raw processor. 642 00:28:56,350 --> 00:29:00,210 So I've shown you an empirical-- a quantitative-- 643 00:29:00,210 --> 00:29:04,170 an analytical model for communication costs before in 644 00:29:04,170 --> 00:29:06,380 earlier slides. 645 00:29:06,380 --> 00:29:08,740 This is an illustration of that concept. 646 00:29:08,740 --> 00:29:12,370 So if I have a processor that's talking to another, 647 00:29:12,370 --> 00:29:16,230 that value has to travel across some network and 648 00:29:16,230 --> 00:29:18,940 there's some transport costs associated with that. 649 00:29:18,940 --> 00:29:20,590 But there's also some added complexities. 650 00:29:20,590 --> 00:29:22,760 So there were lots of terms, if you remember, in that 651 00:29:22,760 --> 00:29:25,730 really big equation I've shown before. 652 00:29:25,730 --> 00:29:29,100 You have some overhead in terms of packaging the data. 653 00:29:29,100 --> 00:29:32,040 And you have some overhead in terms of unpacking the data. 654 00:29:32,040 --> 00:29:33,420 So what does that look? 655 00:29:33,420 --> 00:29:36,580 Well, there are two components we're going to break this down 656 00:29:36,580 --> 00:29:39,020 to: the send occupancy and send latency. 657 00:29:39,020 --> 00:29:40,530 And I'm going to talk about each of those. 658 00:29:40,530 --> 00:29:43,050 And similarly on the receive side, you have the receive 659 00:29:43,050 --> 00:29:45,640 latency and the receive occupancy. 660 00:29:45,640 --> 00:29:50,400 So bear in mind, this lifetime of a message essentially has 661 00:29:50,400 --> 00:29:52,820 to flow through these five components. 662 00:29:52,820 --> 00:29:55,810 It has to go through the occupancy stage, then there's 663 00:29:55,810 --> 00:29:59,810 the send latency, transport, receive latency and receive 664 00:29:59,810 --> 00:30:04,830 occupancy before you can actually use it to compute on. 665 00:30:04,830 --> 00:30:06,670 So what are some things that you do here? 666 00:30:06,670 --> 00:30:09,900 Well, it's things that you've done on cell for getting VME 667 00:30:09,900 --> 00:30:10,890 transfers to work. 668 00:30:10,890 --> 00:30:14,040 You have to figure who the destination is, what is the 669 00:30:14,040 --> 00:30:17,800 value, maybe you have an idea associated with it, a tag, 670 00:30:17,800 --> 00:30:18,630 things of that sort. 671 00:30:18,630 --> 00:30:20,120 And you have to essentially inject that 672 00:30:20,120 --> 00:30:22,530 message into the network. 673 00:30:22,530 --> 00:30:24,210 So there's some latency associated with that. 674 00:30:24,210 --> 00:30:26,370 Maybe your-- 675 00:30:26,370 --> 00:30:31,480 on cell you have a DMA engine which essentially hides this 676 00:30:31,480 --> 00:30:32,510 latency for you. 677 00:30:32,510 --> 00:30:34,520 Because you can essentially just send the message to the 678 00:30:34,520 --> 00:30:36,110 DMA, right into its queue. 679 00:30:36,110 --> 00:30:39,530 And you can especially forget about it unless it stalls 680 00:30:39,530 --> 00:30:43,340 because the DMA list is full. 681 00:30:43,340 --> 00:30:45,890 On the receive side, you sort of have a similar thing. 682 00:30:45,890 --> 00:30:49,810 You have to get the network to inject that value into the 683 00:30:49,810 --> 00:30:53,005 processor and then you have to depackage it, demultiplex it 684 00:30:53,005 --> 00:30:55,960 and put it into some form that you can actually use to 685 00:30:55,960 --> 00:30:57,670 operate on it. 686 00:30:57,670 --> 00:31:01,700 So this 5-tuple is gives us a way of sort of characterizing 687 00:31:01,700 --> 00:31:05,570 communication patterns on different architectures. 688 00:31:05,570 --> 00:31:09,530 So I can contrast, for example, raw versus the 689 00:31:09,530 --> 00:31:12,520 traditional microprocessor. 690 00:31:12,520 --> 00:31:15,460 So this is a traditional superscalar. 691 00:31:15,460 --> 00:31:18,800 A traditional superscalar essentially has all the 692 00:31:18,800 --> 00:31:22,200 sophisticated circuitry that allows you to essentially 693 00:31:22,200 --> 00:31:23,660 bypass network. 694 00:31:23,660 --> 00:31:26,020 You can have an operand directly flowing to another 695 00:31:26,020 --> 00:31:29,950 ALU through all the n squared wires in the crossbar. 696 00:31:29,950 --> 00:31:33,320 And a lot of dynamic scheduling is going on. 697 00:31:33,320 --> 00:31:37,110 So it really has no occupancy, latency, you're not really 698 00:31:37,110 --> 00:31:39,470 doing any packaging of the operands. 699 00:31:39,470 --> 00:31:43,460 Your transport cost is essentially completely hidden. 700 00:31:43,460 --> 00:31:46,000 You have no complexity on the receive side. 701 00:31:46,000 --> 00:31:47,540 So it's really efficient. 702 00:31:47,540 --> 00:31:50,140 So this is essentially what you want to get to go: this 703 00:31:50,140 --> 00:31:51,250 kind of 5-tuple. 704 00:31:51,250 --> 00:31:54,170 But as we saw before, it's really not scalable because 705 00:31:54,170 --> 00:31:57,460 the wire complexity woes-- whether it's n squared or n 706 00:31:57,460 --> 00:31:59,480 cubed, that's not good from an energy 707 00:31:59,480 --> 00:32:01,150 efficient point of view. 708 00:32:01,150 --> 00:32:02,340 Scalable multiprocessors-- 709 00:32:02,340 --> 00:32:05,580 these are on-chip multiprocessors more 710 00:32:05,580 --> 00:32:08,770 indicative of things that you have today-- have this kind of 711 00:32:08,770 --> 00:32:12,210 5-tuple where you have about 16 cycles just to get a 712 00:32:12,210 --> 00:32:15,355 message out, know roughly 3 cycles are so 713 00:32:15,355 --> 00:32:16,890 to transport message. 714 00:32:16,890 --> 00:32:19,370 So maybe this is being done through a shared cache. 715 00:32:19,370 --> 00:32:22,120 Which is how a lot of architecture communicates 716 00:32:22,120 --> 00:32:23,300 between processors today. 717 00:32:23,300 --> 00:32:26,970 And you have to sort of demultiplex the message on the 718 00:32:26,970 --> 00:32:28,130 receive side. 719 00:32:28,130 --> 00:32:30,280 So that adds some latency. 720 00:32:30,280 --> 00:32:34,580 In raw, because you have these net memory map registers on 721 00:32:34,580 --> 00:32:37,210 the input side and the output side, you really can knock 722 00:32:37,210 --> 00:32:44,790 down the complexity from the send side in terms of the 723 00:32:44,790 --> 00:32:46,770 occupancy and latency to zero. 724 00:32:46,770 --> 00:32:48,610 And you just write the values to the register. 725 00:32:48,610 --> 00:32:50,490 And it looks like a normal register, right? 726 00:32:50,490 --> 00:32:53,500 But it just magically appears on the network. 727 00:32:53,500 --> 00:32:56,380 And then from one tile to another, it's one cycle to 728 00:32:56,380 --> 00:32:59,380 ship the value across that one link from one switch processor 729 00:32:59,380 --> 00:33:02,020 to the other, as long as it's a near neighbor. 730 00:33:02,020 --> 00:33:04,080 And then two cycles to inject the network 731 00:33:04,080 --> 00:33:05,820 into the tile processor. 732 00:33:05,820 --> 00:33:08,270 And then you're ready to use it. 733 00:33:08,270 --> 00:33:12,790 So in this space, where would you put cell is the question? 734 00:33:12,790 --> 00:33:14,310 Anybody have any ideas? 735 00:33:19,670 --> 00:33:21,790 What would the communication panel look like on cell? 736 00:33:27,960 --> 00:33:30,930 So you have to do explicit sends and receives. 737 00:33:30,930 --> 00:33:35,450 So let's look at this. 738 00:33:35,450 --> 00:33:38,000 So can we get rid of this stage on cell which is 739 00:33:38,000 --> 00:33:40,160 essentially saying packaging up my 740 00:33:40,160 --> 00:33:42,190 message, is it's no, right? 741 00:33:42,190 --> 00:33:44,500 Because you have to essentially say where that DMA 742 00:33:44,500 --> 00:33:46,680 transfer is going to go to-- which region of memory? 743 00:33:46,680 --> 00:33:49,670 So you're buildings these control blocks. 744 00:33:49,670 --> 00:33:54,230 And then the send latency here is roughly zero, because you 745 00:33:54,230 --> 00:33:56,090 have the DMA processor which allows that kind of 746 00:33:56,090 --> 00:33:58,830 concurrency between communication and computation, 747 00:33:58,830 --> 00:34:03,560 so you can hide essentially that part of the transport-- 748 00:34:03,560 --> 00:34:05,760 that part of communication costs. 749 00:34:05,760 --> 00:34:09,210 Your transport costs here, you have this really massive 750 00:34:09,210 --> 00:34:10,860 bandwidth, this really high bandwidth 751 00:34:10,860 --> 00:34:11,750 interconnect on the chip. 752 00:34:11,750 --> 00:34:14,520 So this makes it reasonably fast, but 753 00:34:14,520 --> 00:34:16,430 it's still a few cycles. 754 00:34:16,430 --> 00:34:18,970 There's no near neighbor? 755 00:34:18,970 --> 00:34:22,420 Yeah, a hundred cycles to go near neighbor communication. 756 00:34:22,420 --> 00:34:24,160 Because you're still-- 757 00:34:24,160 --> 00:34:26,210 you don't have that fast mechanism of being able to 758 00:34:26,210 --> 00:34:27,910 send things points to point. 759 00:34:27,910 --> 00:34:32,020 You're putting things on the bus and there's some 760 00:34:32,020 --> 00:34:33,690 complexity there. 761 00:34:33,690 --> 00:34:36,345 On the receive, you have the same kind of complexity that 762 00:34:36,345 --> 00:34:37,820 you had on the send side. 763 00:34:37,820 --> 00:34:39,770 You have to know that the message is coming, that can be 764 00:34:39,770 --> 00:34:41,150 done in different ways. 765 00:34:41,150 --> 00:34:43,790 And then you have to take that message and write it into your 766 00:34:43,790 --> 00:34:45,380 local store. 767 00:34:45,380 --> 00:34:50,610 Which also adds some overhead in terms of the communication 768 00:34:50,610 --> 00:34:57,970 cost. So the cell would probably be somewhere up here, 769 00:34:57,970 --> 00:34:58,530 I would imagine. 770 00:34:58,530 --> 00:35:00,300 I didn't have a chance to get the numbers. 771 00:35:00,300 --> 00:35:04,490 If I do, I'll update the slide later on. 772 00:35:04,490 --> 00:35:08,770 OK, so that's essentially a brief insight into the raw-- 773 00:35:08,770 --> 00:35:09,100 yeah? 774 00:35:09,100 --> 00:35:13,550 AUDIENCE: Where did you get the scalable processor? 775 00:35:13,550 --> 00:35:17,500 PROFESSOR RABBAH: So these are from Michael Taylor's thesis. 776 00:35:17,500 --> 00:35:21,790 So I believe what he's done here is just looked at some 777 00:35:21,790 --> 00:35:24,520 existing microprocessor and essentially benchmarked 778 00:35:24,520 --> 00:35:27,050 communication latency from one processor to another. 779 00:35:27,050 --> 00:35:30,696 AUDIENCE: So this is like going through the cache on the 780 00:35:30,696 --> 00:35:30,830 [OBSCURED]? 781 00:35:30,830 --> 00:35:32,010 PROFESSOR RABBAH: That's in fact how you-- 782 00:35:32,010 --> 00:35:34,310 a lot of these multiprocessors today have shared caches, 783 00:35:34,310 --> 00:35:37,770 either L-1 and more so now it's L-2. 784 00:35:37,770 --> 00:35:38,300 So if you have-- 785 00:35:38,300 --> 00:35:40,640 L-1s are dedicated to different processors. 786 00:35:40,640 --> 00:35:41,890 But you still have to go the memory to communicate. 787 00:35:45,750 --> 00:35:48,360 So the raw parallelizing compiler-- yeah? 788 00:35:48,360 --> 00:35:50,310 Another question? 789 00:35:50,310 --> 00:35:52,540 AUDIENCE: You might want to postpone this question. 790 00:35:52,540 --> 00:35:57,950 Two related questions: so raw has-- 791 00:35:57,950 --> 00:36:00,500 I guess raw has pretty well optimized nearest neighbor 792 00:36:00,500 --> 00:36:02,170 communication. 793 00:36:02,170 --> 00:36:08,074 But we know from, for example, Red's Rule in heuristic and 794 00:36:08,074 --> 00:36:11,910 intellectual engineering about the number of wires needed for 795 00:36:11,910 --> 00:36:12,652 a given area. 796 00:36:12,652 --> 00:36:14,190 Is that in between-- 797 00:36:14,190 --> 00:36:21,050 as I recall, it's the minimum for a good sized circuit is 798 00:36:21,050 --> 00:36:24,592 proportional to the perimeter, or roughly the 799 00:36:24,592 --> 00:36:27,480 square root of the area. 800 00:36:27,480 --> 00:36:33,070 And it ranges from there to-- not proportional to the area. 801 00:36:33,070 --> 00:36:34,955 There's something in between. 802 00:36:34,955 --> 00:36:36,790 Something with 3 in it. 803 00:36:36,790 --> 00:36:40,180 Like to the 3/2 power I think, perhaps. 804 00:36:40,180 --> 00:36:41,710 No, something like 2/3rds, something like-- 805 00:36:41,710 --> 00:36:42,910 yeah, 2/3rds power. 806 00:36:42,910 --> 00:36:47,070 So the area to the 1/2 power or area to the 2/3rds power. 807 00:36:47,070 --> 00:36:51,380 So Red's Rule says the number of wires you need is roughly 808 00:36:51,380 --> 00:36:52,770 in that area. 809 00:36:52,770 --> 00:36:55,860 And so that sort of pushes that-- 810 00:36:55,860 --> 00:36:58,650 so the minimum you need is the nearest communication. 811 00:36:58,650 --> 00:37:01,990 And often you need more than that. 812 00:37:01,990 --> 00:37:06,470 We know from the FPGA experience that nearest 813 00:37:06,470 --> 00:37:09,470 neighbor communication is not-- 814 00:37:09,470 --> 00:37:11,010 or, at least, it's good to have move than nearest 815 00:37:11,010 --> 00:37:13,930 neighbor, and that often long wires followed across the 816 00:37:13,930 --> 00:37:15,610 chip, in extremely high-- 817 00:37:15,610 --> 00:37:16,600 PROFESSOR RABBAH: So I'm going to actually show you an 818 00:37:16,600 --> 00:37:20,280 example where nearest neighbor is good but you might also 819 00:37:20,280 --> 00:37:23,130 want some global mechanism for control 820 00:37:23,130 --> 00:37:25,490 orchestration for example. 821 00:37:25,490 --> 00:37:28,470 AUDIENCE: Not just for con-- not surely just for control 822 00:37:28,470 --> 00:37:31,810 but for broadcast, for arbitrary for the computation 823 00:37:31,810 --> 00:37:35,030 to use, not just for the chip to use. 824 00:37:35,030 --> 00:37:38,970 Like why are you scaling out two hops, four hops, fewer and 825 00:37:38,970 --> 00:37:39,450 fewer wire-- 826 00:37:39,450 --> 00:37:42,110 PROFESSOR RABBAH: Yes, in fact what I think is going to 827 00:37:42,110 --> 00:37:44,280 happen is a lot of these chip designs are going to be 828 00:37:44,280 --> 00:37:45,090 heirarchical. 829 00:37:45,090 --> 00:37:49,570 You have some really global type communication at the 830 00:37:49,570 --> 00:37:50,300 highest level. 831 00:37:50,300 --> 00:37:53,140 And then as you get within each one of the processors, 832 00:37:53,140 --> 00:37:55,610 then you see things at the lowest level, something that 833 00:37:55,610 --> 00:37:56,070 looked like raw. 834 00:37:56,070 --> 00:37:58,690 So you can build sort of a hierarchy of communication 835 00:37:58,690 --> 00:38:02,590 stages that allow you to sort of solve that problem. 836 00:38:02,590 --> 00:38:04,110 But all of that adds complexity, right? 837 00:38:04,110 --> 00:38:05,540 First you have to solve the problem of how do you 838 00:38:05,540 --> 00:38:09,120 parallelize for just a fixed number of cores and then 839 00:38:09,120 --> 00:38:10,470 figure out the communications. 840 00:38:10,470 --> 00:38:13,050 Once we understand how to do that well with a nice 841 00:38:13,050 --> 00:38:15,745 programming model then you can build heirarchically on that. 842 00:38:15,745 --> 00:38:17,975 AUDIENCE: On the other hand, it might make the compiler's 843 00:38:17,975 --> 00:38:20,250 job easier because it's not as constrained. 844 00:38:20,250 --> 00:38:21,090 PROFESSOR RABBAH: It might give you a 845 00:38:21,090 --> 00:38:21,570 nice fall back rate. 846 00:38:21,570 --> 00:38:24,915 It might save you in cases where there are things that 847 00:38:24,915 --> 00:38:26,360 are hard to do. 848 00:38:26,360 --> 00:38:29,720 There are some issues in the last two-- 849 00:38:29,720 --> 00:38:33,120 the second to the last three slides. 850 00:38:33,120 --> 00:38:36,862 We'll talk about an example of where that might be the case. 851 00:38:36,862 --> 00:38:40,970 AUDIENCE: Another question which [OBSCURED] 852 00:38:40,970 --> 00:38:45,770 so raw, I guess, being simpled and tiled, I guess one of the 853 00:38:45,770 --> 00:38:47,436 selling points I think was that it really cuts down on 854 00:38:47,436 --> 00:38:48,850 the engineering effort. 855 00:38:48,850 --> 00:38:49,580 PROFESSOR RABBAH: Oh, absolutely. 856 00:38:49,580 --> 00:38:54,660 This was done a million gates in-house for [OBSCURED] 857 00:38:54,660 --> 00:38:58,040 AUDIENCE: So a company like Intel has a ridiculous number 858 00:38:58,040 --> 00:38:58,860 of engineers. 859 00:38:58,860 --> 00:39:01,485 And to get a competitive edge, they something they want to 860 00:39:01,485 --> 00:39:02,431 apply more engineering to it. 861 00:39:02,431 --> 00:39:05,902 And so the question is, where might you apply more 862 00:39:05,902 --> 00:39:07,760 engineering to try to squeeze more-- 863 00:39:07,760 --> 00:39:09,416 PROFESSOR AMARASINGHE: That's the million dollar question 864 00:39:09,416 --> 00:39:11,220 that everybody's looking at. 865 00:39:11,220 --> 00:39:14,188 Because if somehow Intel thought they could add more 866 00:39:14,188 --> 00:39:15,570 and more engineering. 867 00:39:15,570 --> 00:39:19,520 And then build this very complex full-scale [OBSCURED] 868 00:39:19,520 --> 00:39:22,200 But separate vessels. 869 00:39:22,200 --> 00:39:26,500 And so I think there's still a lot of things that is wrong. 870 00:39:26,500 --> 00:39:33,090 Meaning it's [OBSCURED] 871 00:39:33,090 --> 00:39:35,100 so at Intel basically they will let you do 872 00:39:35,100 --> 00:39:36,450 something like that. 873 00:39:36,450 --> 00:39:39,650 They will put a lot of engineers doing each of these 874 00:39:39,650 --> 00:39:43,640 components, finding very few, and they can get a lot more 875 00:39:43,640 --> 00:39:47,140 performance, a lot less power and stuff like that. 876 00:39:47,140 --> 00:39:53,150 So depending on what you want, science is not everything. 877 00:39:53,150 --> 00:39:58,420 There are a lot of other things [OBSCURED] 878 00:39:58,420 --> 00:40:00,820 So while it makes it easier? 879 00:40:00,820 --> 00:40:08,260 [OBSCURED] 880 00:40:08,260 --> 00:40:11,435 And the key thing is, you start something simple and as 881 00:40:11,435 --> 00:40:14,220 you go on, you can add more and more complexity. 882 00:40:14,220 --> 00:40:18,510 Just, as there's more things to do. 883 00:40:18,510 --> 00:40:20,680 PROFESSOR RABBAH: Part of the complexity might be going to-- 884 00:40:20,680 --> 00:40:25,860 not making all those [OBSCURED]. 885 00:40:25,860 --> 00:40:30,030 OK, so raw pushes a lot of the complexity into the compiler 886 00:40:30,030 --> 00:40:33,240 in that the compiler now has to do at least two things. 887 00:40:33,240 --> 00:40:35,250 It has to distribute the instructions. 888 00:40:35,250 --> 00:40:37,450 You have a single program and you have to figure out how to 889 00:40:37,450 --> 00:40:39,140 parallelize it across multiple cores. 890 00:40:39,140 --> 00:40:41,900 But not only that, because you have the scalar operand 891 00:40:41,900 --> 00:40:44,480 network, you have to figure out how the different cores 892 00:40:44,480 --> 00:40:45,410 have to talk to each other. 893 00:40:45,410 --> 00:40:47,790 So you have to essentially generate schedule for the 894 00:40:47,790 --> 00:40:50,400 switch processors as well. 895 00:40:50,400 --> 00:40:52,055 So I'm going to talk a little bit about the 896 00:40:52,055 --> 00:40:53,480 raw paralyzing compiler. 897 00:40:53,480 --> 00:40:55,470 And this is different from a StreamIT parallelizing 898 00:40:55,470 --> 00:40:58,890 compiler which really talks about a different program as 899 00:40:58,890 --> 00:41:01,450 an input, using a different language. 900 00:41:01,450 --> 00:41:04,830 This is work again done here at MIT by Walter Lee who 901 00:41:04,830 --> 00:41:07,570 graduated two years ago. 902 00:41:07,570 --> 00:41:09,050 We have a sequential program. 903 00:41:09,050 --> 00:41:14,030 You inject it into raw C seed, raw C compiler, and you get 904 00:41:14,030 --> 00:41:17,070 fine-grained Orchestrated Parallel execution. 905 00:41:17,070 --> 00:41:20,700 And what the compiler does is worry about data distribution 906 00:41:20,700 --> 00:41:23,290 just like you have to do on cell in terms of which memory 907 00:41:23,290 --> 00:41:25,270 goes into which local store. 908 00:41:25,270 --> 00:41:27,560 which competition operates on-- 909 00:41:27,560 --> 00:41:29,540 the raw compiler has to worry about which computation 910 00:41:29,540 --> 00:41:32,460 operates on which data element and can you put that data in 911 00:41:32,460 --> 00:41:36,370 the right caches for each of the different tiles. 912 00:41:36,370 --> 00:41:39,400 Instruction distribution: so the way this compiler 913 00:41:39,400 --> 00:41:41,060 essentially get parallelism, it's going to look at 914 00:41:41,060 --> 00:41:43,270 instruction level parallelism in your application. 915 00:41:43,270 --> 00:41:45,780 And it's going to divide that up among the different cores. 916 00:41:45,780 --> 00:41:48,810 And then the last step is the coordination of communication 917 00:41:48,810 --> 00:41:50,000 in control flow. 918 00:41:50,000 --> 00:41:51,330 So I'm just going to briefly step 919 00:41:51,330 --> 00:41:53,570 through each one of those. 920 00:41:53,570 --> 00:41:56,890 So the data distribution really has essentially trying 921 00:41:56,890 --> 00:41:58,410 to solve the problem of locality. 922 00:41:58,410 --> 00:42:01,350 You have two instructions. 923 00:42:01,350 --> 00:42:04,030 A load into r1 from some address and then 924 00:42:04,030 --> 00:42:05,410 you're adding r1. 925 00:42:05,410 --> 00:42:06,930 You're incrementing that value. 926 00:42:06,930 --> 00:42:08,970 And you might write it back for later on. 927 00:42:08,970 --> 00:42:11,110 So where would you put these two instructions? 928 00:42:11,110 --> 00:42:15,060 So to exploit the locality, then you want the data-- if 929 00:42:15,060 --> 00:42:18,020 the data is here, then you want these two instructions to 930 00:42:18,020 --> 00:42:19,310 be on this tile. 931 00:42:19,310 --> 00:42:21,755 If the data is here, then you want these two instructions to 932 00:42:21,755 --> 00:42:23,420 be on this file. 933 00:42:23,420 --> 00:42:25,700 Because it doesn't help you to have the data here and the 934 00:42:25,700 --> 00:42:27,130 instructions here. 935 00:42:27,130 --> 00:42:29,120 Because what do you have to do in that case? 936 00:42:29,120 --> 00:42:31,390 You have to send a message that says, send me this data. 937 00:42:31,390 --> 00:42:34,050 And then you have to wait for it to come in and then you 938 00:42:34,050 --> 00:42:35,020 have to operate on it. 939 00:42:35,020 --> 00:42:37,300 And then maybe you have to write it back. 940 00:42:37,300 --> 00:42:39,220 So the compiler sort of worries about the data 941 00:42:39,220 --> 00:42:40,030 distribution. 942 00:42:40,030 --> 00:42:42,190 It applies some data analysis. 943 00:42:42,190 --> 00:42:45,530 A lot of a thing that you saw in Saman's lecture on classic 944 00:42:45,530 --> 00:42:47,020 parallelization technology. 945 00:42:47,020 --> 00:42:49,280 Sort of figure out the interdependencies and then 946 00:42:49,280 --> 00:42:51,770 they can figure out how to split up the data across the 947 00:42:51,770 --> 00:42:52,840 different cores. 948 00:42:52,840 --> 00:42:55,683 And there's some other work done by other students in the 949 00:42:55,683 --> 00:42:58,470 group that tried to address this problem. 950 00:42:58,470 --> 00:43:05,020 The instruction distribution is perhaps as complicated and 951 00:43:05,020 --> 00:43:06,040 interesting. 952 00:43:06,040 --> 00:43:07,980 In here, what's going on is-- let's say 953 00:43:07,980 --> 00:43:09,250 you have a base block. 954 00:43:09,250 --> 00:43:10,950 So you take your sequential program. 955 00:43:10,950 --> 00:43:14,010 You figure out what are the different basic blocks of 956 00:43:14,010 --> 00:43:17,200 computation that you have and within the basic block you 957 00:43:17,200 --> 00:43:18,510 have lots of instructions. 958 00:43:18,510 --> 00:43:21,650 So each one of these green boxes is a particular 959 00:43:21,650 --> 00:43:22,560 instruction. 960 00:43:22,560 --> 00:43:25,230 And what you're seeing-- these arrows here that connect the 961 00:43:25,230 --> 00:43:28,090 edges-- are operands that you have to exchange. 962 00:43:28,090 --> 00:43:30,320 So you might have-- 963 00:43:33,190 --> 00:43:33,835 this is an add instruction. 964 00:43:33,835 --> 00:43:35,880 It requires a value coming from here. 965 00:43:35,880 --> 00:43:36,970 Multiply-- 966 00:43:36,970 --> 00:43:39,640 subtract instruction requires values coming in from 967 00:43:39,640 --> 00:43:40,690 different areas. 968 00:43:40,690 --> 00:43:42,490 So how would you distribute this 969 00:43:42,490 --> 00:43:44,630 across a number of cores-- 970 00:43:44,630 --> 00:43:46,720 or across a number of tiles? 971 00:43:46,720 --> 00:43:50,150 Any ideas here? 972 00:43:50,150 --> 00:43:53,350 So you can look for, for example, some chains that are 973 00:43:53,350 --> 00:43:55,330 not interconnected. 974 00:43:55,330 --> 00:43:57,540 So you can look for clusters that you can use. 975 00:43:57,540 --> 00:44:00,940 And say, OK, well I see no edges here so maybe I can put 976 00:44:00,940 --> 00:44:02,870 this on one tile. 977 00:44:02,870 --> 00:44:05,270 And then maybe I can put some of these instructions on 978 00:44:05,270 --> 00:44:06,440 another tile. 979 00:44:06,440 --> 00:44:09,010 Because sort of the communication flow is local. 980 00:44:09,010 --> 00:44:12,630 So maybe one strategy might be, look for the longest 981 00:44:12,630 --> 00:44:15,000 single chains so you can keep the communication flow. 982 00:44:15,000 --> 00:44:18,630 And then you apply and make and algorithm, come up with a 983 00:44:18,630 --> 00:44:20,550 number of clusters. 984 00:44:20,550 --> 00:44:22,530 Something like that does happen. 985 00:44:22,530 --> 00:44:26,070 And keep in mind from the lectures we talked about the 986 00:44:26,070 --> 00:44:27,800 parallelizing compiler, you have to worry about 987 00:44:27,800 --> 00:44:29,550 parallelism versus communication. 988 00:44:29,550 --> 00:44:31,800 Some the more you distribute things, the more communication 989 00:44:31,800 --> 00:44:33,240 you have to get right. 990 00:44:33,240 --> 00:44:34,640 So here we're showing-- 991 00:44:34,640 --> 00:44:38,400 what I'm showing is color mapping from the original 992 00:44:38,400 --> 00:44:41,520 instructions in the base block to the same instructions, but 993 00:44:41,520 --> 00:44:44,290 now each color essential represents a different cluster 994 00:44:44,290 --> 00:44:48,900 or essentially code that would map a different thread. 995 00:44:48,900 --> 00:44:52,270 So blue is one thread, yellow is another, green is another, 996 00:44:52,270 --> 00:44:54,260 red, purple, and so on. 997 00:44:54,260 --> 00:44:56,680 But I have to worry about communication between the 998 00:44:56,680 --> 00:44:58,770 different colors because they're essentially two 999 00:44:58,770 --> 00:44:59,960 different threads. 1000 00:44:59,960 --> 00:45:02,320 They're going to run on two different processors or two 1001 00:45:02,320 --> 00:45:03,400 different tiles. 1002 00:45:03,400 --> 00:45:08,800 So those arrows that are highlighted in dark black are 1003 00:45:08,800 --> 00:45:09,320 communication edges. 1004 00:45:09,320 --> 00:45:11,860 They have to explicitly send the operands around. 1005 00:45:11,860 --> 00:45:14,310 Right? 1006 00:45:14,310 --> 00:45:16,470 So then I might look at the granularity. 1007 00:45:16,470 --> 00:45:18,260 What is my communication cost? 1008 00:45:18,260 --> 00:45:19,770 What is my computation cost? 1009 00:45:19,770 --> 00:45:21,350 And I want to worry about load balancing. 1010 00:45:21,350 --> 00:45:26,870 As we saw, load balancing can give you how it can better 1011 00:45:26,870 --> 00:45:28,490 make use of your architecture and give you better 1012 00:45:28,490 --> 00:45:30,770 utilization, better throughput. 1013 00:45:30,770 --> 00:45:33,250 So you might essentially say, it doesn't-- it's not 1014 00:45:33,250 --> 00:45:36,650 worthwhile to have these running on a different tile 1015 00:45:36,650 --> 00:45:38,660 because there's a lot of communication going on. 1016 00:45:38,660 --> 00:45:40,290 So maybe I'd want to fuse those together. 1017 00:45:40,290 --> 00:45:43,870 Keep the communication local. 1018 00:45:43,870 --> 00:45:46,940 And essentially eliminate costly communication. 1019 00:45:46,940 --> 00:45:48,680 So there are different heuristics that you can apply. 1020 00:45:48,680 --> 00:45:51,630 You can use that 5-tuple. 1021 00:45:51,630 --> 00:45:54,310 You can use heuristic space on the 5-tuple to determine when 1022 00:45:54,310 --> 00:45:58,510 it's profitable to break things up and when it's not. 1023 00:45:58,510 --> 00:46:01,050 And then you have to worry about placement. 1024 00:46:01,050 --> 00:46:04,010 So you don't quite have this on cell in that you create 1025 00:46:04,010 --> 00:46:06,230 these SPE threads and they can run on any 1026 00:46:06,230 --> 00:46:08,020 SPE in the raw compiler. 1027 00:46:08,020 --> 00:46:10,410 You can actually exploit the spacial characteristics of the 1028 00:46:10,410 --> 00:46:14,010 chip in the point-to-point communication network to say, 1029 00:46:14,010 --> 00:46:16,950 I want to put these two threads on tile 1 and tile 2, 1030 00:46:16,950 --> 00:46:19,300 where tile 1 and tile 2 are adjacent to each other. 1031 00:46:19,300 --> 00:46:21,770 Because I have a well-defined communication pattern that I'm 1032 00:46:21,770 --> 00:46:22,640 going to use. 1033 00:46:22,640 --> 00:46:26,350 And map to the communication network on the chip to get 1034 00:46:26,350 --> 00:46:29,710 really fast, really low latency. 1035 00:46:29,710 --> 00:46:32,210 So you can take each one of these colors, place it on a 1036 00:46:32,210 --> 00:46:33,360 different tile. 1037 00:46:33,360 --> 00:46:36,490 And now you have these wires that are going across these 1038 00:46:36,490 --> 00:46:39,040 tiles which essentially represent communication. 1039 00:46:39,040 --> 00:46:41,570 But now the tile has to worry about, oh, I have to 1040 00:46:41,570 --> 00:46:43,960 essentially send these on fixed routes. 1041 00:46:43,960 --> 00:46:46,450 There's no arbitrary communication mechanism. 1042 00:46:46,450 --> 00:46:50,750 So if there's data going from this tile to this tile, it 1043 00:46:50,750 --> 00:46:52,950 actually has to be routed through a network. 1044 00:46:52,950 --> 00:46:54,830 And that might mean getting routing through somebody 1045 00:46:54,830 --> 00:46:57,630 else's tile. 1046 00:46:57,630 --> 00:47:00,950 So the next stage would be communication coordination. 1047 00:47:00,950 --> 00:47:05,510 You have to figure out which switch you need to go to and 1048 00:47:05,510 --> 00:47:08,210 what do you do to get that operand to the right switch 1049 00:47:08,210 --> 00:47:10,100 which then gets it to the right processor. 1050 00:47:10,100 --> 00:47:12,960 So here, I believe the heuristic is to do dimension 1051 00:47:12,960 --> 00:47:17,700 order routing so you send along the x-dimension and then 1052 00:47:17,700 --> 00:47:18,860 the y-dimension. 1053 00:47:18,860 --> 00:47:19,650 I might have those reversed. 1054 00:47:19,650 --> 00:47:23,210 I don't know. 1055 00:47:23,210 --> 00:47:25,610 And then finally, now you've figured out your communication 1056 00:47:25,610 --> 00:47:28,190 patterns, you've figured out your instructions, you do some 1057 00:47:28,190 --> 00:47:29,440 instructions scheduling. 1058 00:47:29,440 --> 00:47:31,360 And what you can do here, because the communication 1059 00:47:31,360 --> 00:47:33,965 patterns are static, you've split up the instructions so 1060 00:47:33,965 --> 00:47:38,110 you know when you need to ship data around and how. 1061 00:47:38,110 --> 00:47:41,010 You can guarantee deadlock freedom by carefully ordering 1062 00:47:41,010 --> 00:47:42,690 your send and receive pairs. 1063 00:47:42,690 --> 00:47:46,370 So what you see here, every time you see an instruction 1064 00:47:46,370 --> 00:47:48,800 that needs to ship an operand around, there's the equivalent 1065 00:47:48,800 --> 00:47:51,590 of a route instruction that has route east, 1066 00:47:51,590 --> 00:47:53,330 west, north, south. 1067 00:47:53,330 --> 00:47:56,940 There's an equivalent route instruction on the other 1068 00:47:56,940 --> 00:47:57,800 processors. 1069 00:47:57,800 --> 00:48:00,590 And that allows you to essentially analyze code and 1070 00:48:00,590 --> 00:48:04,020 say, OK, I've laid these things out carefully, I've 1071 00:48:04,020 --> 00:48:06,330 orchestrated my send and receive pairs so I can 1072 00:48:06,330 --> 00:48:08,800 guarantee, for example, there are no overlapping routes. 1073 00:48:08,800 --> 00:48:12,540 Or that there are no deadlocks because one is trying to shift 1074 00:48:12,540 --> 00:48:14,540 the other while the other is also trying to ship, and they 1075 00:48:14,540 --> 00:48:19,000 both block on the shared network link. 1076 00:48:19,000 --> 00:48:20,740 And finally, you have the code representation. 1077 00:48:20,740 --> 00:48:24,050 So this is where you package things up into object files, 1078 00:48:24,050 --> 00:48:26,420 into essentially things like threads. 1079 00:48:26,420 --> 00:48:28,940 And then you can compile them and run them. 1080 00:48:28,940 --> 00:48:32,580 Now the question that was posed earlier is, well there's 1081 00:48:32,580 --> 00:48:35,290 one thing we haven't talked about and that's branching. 1082 00:48:35,290 --> 00:48:38,700 This is a sequential program, it executes branches. 1083 00:48:38,700 --> 00:48:41,605 And now I have this loop that I've split up across a number 1084 00:48:41,605 --> 00:48:44,990 of tiles, how do I know who's going to do the branch? 1085 00:48:44,990 --> 00:48:47,360 And if one tile is doing the branch, how does it 1086 00:48:47,360 --> 00:48:49,190 communicate with everybody else? 1087 00:48:49,190 --> 00:48:51,735 Or if I'm going to repeat the branch on every file, does 1088 00:48:51,735 --> 00:48:53,730 that mean I'm redoing too much computation 1089 00:48:53,730 --> 00:48:55,090 on every other tile? 1090 00:48:55,090 --> 00:48:57,960 So control coordination is actually quite an interesting 1091 00:48:57,960 --> 00:49:00,030 aspect of-- 1092 00:49:00,030 --> 00:49:01,800 adds another interesting aspect to the 1093 00:49:01,800 --> 00:49:04,600 parallelization for raw. 1094 00:49:04,600 --> 00:49:07,830 So what you have to do is figure out-- 1095 00:49:07,830 --> 00:49:09,650 there are two different ways you can do it. 1096 00:49:09,650 --> 00:49:14,750 Because you have no mechanism for a global message on raw, 1097 00:49:14,750 --> 00:49:16,940 you can't say, I've taken a branch, everybody go to this 1098 00:49:16,940 --> 00:49:17,970 program counter. 1099 00:49:17,970 --> 00:49:21,690 You essentially have to send either the branch result so 1100 00:49:21,690 --> 00:49:24,200 one tile can do the comparison, it calculates the 1101 00:49:24,200 --> 00:49:29,490 condition, and then it has to communicate x to each of the 1102 00:49:29,490 --> 00:49:32,200 different branches-- to each of the different tiles. 1103 00:49:32,200 --> 00:49:34,900 Or every tiles has to essentially just replicate the 1104 00:49:34,900 --> 00:49:37,040 control and redo the computations. 1105 00:49:37,040 --> 00:49:40,450 So every tile figures out what is the condition, what are the 1106 00:49:40,450 --> 00:49:42,700 conditions for the branch. 1107 00:49:42,700 --> 00:49:45,130 They redundantly do that computation and then they can 1108 00:49:45,130 --> 00:49:47,770 all merge at the same time-- 1109 00:49:47,770 --> 00:49:49,530 at different times. 1110 00:49:49,530 --> 00:49:52,180 So that gives you two ways of doing the branching. 1111 00:49:52,180 --> 00:49:56,720 If each tile's doing its own control flow calculation, then 1112 00:49:56,720 --> 00:49:58,560 they can essentially branch at different times. 1113 00:49:58,560 --> 00:50:00,790 But if they're all going to wait for the result to 1114 00:50:00,790 --> 00:50:02,730 compare, then it essentially gives you points where you 1115 00:50:02,730 --> 00:50:04,320 have to synchronize. 1116 00:50:04,320 --> 00:50:06,510 Everybody's going to wait for the result of the branch. 1117 00:50:06,510 --> 00:50:08,320 But the latency could be different. 1118 00:50:08,320 --> 00:50:10,670 Because if I'm sending the branch condition to one tile 1119 00:50:10,670 --> 00:50:13,390 versus another file, and if one's closer than the other. 1120 00:50:13,390 --> 00:50:16,390 Then the branch that's closer to me-- the tile that's closer 1121 00:50:16,390 --> 00:50:18,360 to me will take that branch earlier in time. 1122 00:50:18,360 --> 00:50:20,850 So you get sort of the effective of a global 1123 00:50:20,850 --> 00:50:23,500 asynchronous branching in either case. 1124 00:50:23,500 --> 00:50:27,680 Does that make sense? 1125 00:50:27,680 --> 00:50:31,400 So, in summary, the raw architecture is really a tile 1126 00:50:31,400 --> 00:50:31,510 microprocessor. 1127 00:50:31,510 --> 00:50:36,340 It incorporates the best elements from superscalars in 1128 00:50:36,340 --> 00:50:39,460 terms of a really low latency communication network between 1129 00:50:39,460 --> 00:50:42,320 tiles which really cuts down on the communication costs. 1130 00:50:42,320 --> 00:50:45,250 And as we saw, and as probably you've been learning, 1131 00:50:45,250 --> 00:50:47,830 communication is really an expensive part of 1132 00:50:47,830 --> 00:50:52,530 parallelization on existing multicore chips. 1133 00:50:52,530 --> 00:50:55,670 And it's also getting the scalability of multicores in 1134 00:50:55,670 --> 00:50:58,920 terms of explicit parallelism but also gives you implicit 1135 00:50:58,920 --> 00:51:02,060 parallelism because the networks are pipelined and 1136 00:51:02,060 --> 00:51:04,040 they can give you full control. 1137 00:51:04,040 --> 00:51:06,560 So you're trying to get to the point where you have a tile 1138 00:51:06,560 --> 00:51:09,650 processor with scalar operand network that allows you to do 1139 00:51:09,650 --> 00:51:13,420 communication with a very low cost. And it might be the case 1140 00:51:13,420 --> 00:51:16,640 in the future that these chips will especially be-- 1141 00:51:16,640 --> 00:51:18,925 more complex architectures will sit on top of these so 1142 00:51:18,925 --> 00:51:22,220 you'll use these as fundamental building blocks. 1143 00:51:22,220 --> 00:51:29,425 And there was the 80 chip multicore from Intel: there 1144 00:51:29,425 --> 00:51:31,430 have been rumors that that might actually be something 1145 00:51:31,430 --> 00:51:34,770 like a graphics processor that has something like a scalar 1146 00:51:34,770 --> 00:51:35,770 operand network because you could 1147 00:51:35,770 --> 00:51:39,030 communicate with a very fast-- 1148 00:51:39,030 --> 00:51:41,010 with very low latency between tiles. 1149 00:51:41,010 --> 00:51:44,020 And in that article which came out a few months ago was the 1150 00:51:44,020 --> 00:51:47,610 first time I think that I had seen tile architectures used 1151 00:51:47,610 --> 00:51:49,950 in literature or in publications. 1152 00:51:49,950 --> 00:51:53,360 So I think you'll see more of these kinds of designs pattern 1153 00:51:53,360 --> 00:51:57,420 appear as people scale out to more than 2 cores, 4 cores, 8 1154 00:51:57,420 --> 00:51:59,560 cores and so on, where you could still communication 1155 00:51:59,560 --> 00:52:01,370 reasonably well with caches. 1156 00:52:01,370 --> 00:52:03,840 And that's all I prepared for today. 1157 00:52:03,840 --> 00:52:05,090 Any other questions? 1158 00:52:08,190 --> 00:52:09,650 And this is a list of people who 1159 00:52:09,650 --> 00:52:11,520 contributed to the raw project. 1160 00:52:11,520 --> 00:52:13,750 A lot of students who are led by Anant and Saman. 1161 00:52:13,750 --> 00:52:17,590 PROFESSOR AMARASINGHE: [OBSCURED] 1162 00:52:17,590 --> 00:52:23,050 view of what happened in our groups and then how it relates 1163 00:52:23,050 --> 00:52:25,070 to necessary to what you need. 1164 00:52:25,070 --> 00:52:30,270 But this is trying to take it to a much finer grain. 1165 00:52:30,270 --> 00:52:33,760 Whereas in Cell, of course, the message has to be large, 1166 00:52:33,760 --> 00:52:36,098 you can do a lot of coarse grain stuff. 1167 00:52:36,098 --> 00:52:38,800 But in raw, you try to do much more fine grain stuff. 1168 00:52:38,800 --> 00:52:40,065 But we're going to talk about it the next 1169 00:52:40,065 --> 00:52:42,135 lecture on the future. 1170 00:52:42,135 --> 00:52:43,170 [OBSCURED] 1171 00:52:43,170 --> 00:52:46,961 AUDIENCE: [OBSCURED] 1172 00:52:46,961 --> 00:52:49,640 Don't you need long wires for the clock. 1173 00:52:49,640 --> 00:52:51,230 PROFESSOR RABBAH: There's no global clock. 1174 00:52:51,230 --> 00:52:56,617 AUDIENCE: So you have this network that seems to -- 1175 00:52:56,617 --> 00:53:00,326 So that the network actually requires handshaking? 1176 00:53:00,326 --> 00:53:00,500 Or-- 1177 00:53:00,500 --> 00:53:04,130 PROFESSOR AMARASINGHE: The way you can do is, you can in 1178 00:53:04,130 --> 00:53:09,650 modern processors, [OBSCURED] 1179 00:53:09,650 --> 00:53:12,300 so since there's no long wire, you can actually carry the 1180 00:53:12,300 --> 00:53:14,580 clock with the data. 1181 00:53:14,580 --> 00:53:16,370 So in the global world, the switching here would happen 1182 00:53:16,370 --> 00:53:20,160 when the switching here. 1183 00:53:20,160 --> 00:53:23,490 But since there's no big wire connecting, then that's OK. 1184 00:53:23,490 --> 00:53:27,340 So you can deal with clock ticking. 1185 00:53:27,340 --> 00:53:29,187 AUDIENCE: So this is not going to be 1186 00:53:29,187 --> 00:53:31,128 not clock drift because-- 1187 00:53:31,128 --> 00:53:32,120 PROFESSOR AMARASINGHE: Yeah, that's clock drift. 1188 00:53:32,120 --> 00:53:37,320 One end of the process clock is happening at the global 1189 00:53:37,320 --> 00:53:38,570 instant time at the other end of the processor. 1190 00:53:48,130 --> 00:53:52,950 And since the wires also kind of go in the tree, you can 1191 00:53:52,950 --> 00:53:53,360 deal with that. 1192 00:53:53,360 --> 00:53:55,276 AUDIENCE: Drift meaning ticking at 1193 00:53:55,276 --> 00:53:56,180 different rates, not just-- 1194 00:53:56,180 --> 00:53:58,363 PROFESSOR AMARASINGHE: Yeah, I know. 1195 00:53:58,363 --> 00:53:59,410 Basically I don't think I can go back to it. 1196 00:53:59,410 --> 00:54:00,740 It has a skew. 1197 00:54:00,740 --> 00:54:05,090 There's a clock skew going in between those. 1198 00:54:05,090 --> 00:54:07,486 AUDIENCE: So you don't need synchronizers between the 1199 00:54:07,486 --> 00:54:07,640 different tiles? 1200 00:54:07,640 --> 00:54:08,870 PROFESSOR AMARASINGHE: No, we don't need synchronizers 1201 00:54:08,870 --> 00:54:09,640 because tiles are local. 1202 00:54:09,640 --> 00:54:11,400 The clock would bring those tiles. 1203 00:54:11,400 --> 00:54:14,210 The clock would bring two things that communicate close 1204 00:54:14,210 --> 00:54:17,100 enough that it fits it in the cycle. 1205 00:54:17,100 --> 00:54:20,335 But for example, if you get it two very far away branches of 1206 00:54:20,335 --> 00:54:23,169 a tree and then if you try to communicate with them then you 1207 00:54:23,169 --> 00:54:23,450 have a problem. 1208 00:54:23,450 --> 00:54:27,683 Another thing is when the tree goes here, you want to use two 1209 00:54:27,683 --> 00:54:30,170 different branches it's similar to going down. 1210 00:54:30,170 --> 00:54:31,630 So you can compress the process. 1211 00:54:31,630 --> 00:54:32,810 So there are all these things. 1212 00:54:32,810 --> 00:54:34,060 I mean, modern processors really really destable. 1213 00:54:37,310 --> 00:54:40,538 The problem occurs when you try to connect directly from 1214 00:54:40,538 --> 00:54:44,270 the far end of the branch to something that gets clocked 1215 00:54:44,270 --> 00:54:48,265 there to something that clocks at a very 1216 00:54:48,265 --> 00:54:48,613 early end of the branch. 1217 00:54:48,613 --> 00:54:50,070 If you're trying to connect those two, then the skew might 1218 00:54:50,070 --> 00:54:51,150 be too long. 1219 00:54:51,150 --> 00:54:53,042 Then you can get into clock trouble. 1220 00:54:53,042 --> 00:54:53,770 AUDIENCE: [OBSCURED] 1221 00:54:53,770 --> 00:54:57,283 I was just worried about this local network. 1222 00:54:57,283 --> 00:55:04,224 [OBSCURED] 1223 00:55:04,224 --> 00:55:11,386 AUDIENCE: Another question I had was in the mesh, obviously 1224 00:55:11,386 --> 00:55:15,897 the processors in the middle have further to get to the I/O 1225 00:55:15,897 --> 00:55:18,860 devices or to the main memory. 1226 00:55:18,860 --> 00:55:22,641 What do you see happening as you get to larger and larger 1227 00:55:22,641 --> 00:55:23,144 processors? 1228 00:55:23,144 --> 00:55:25,282 Are they going to just put more and more local memory on 1229 00:55:25,282 --> 00:55:26,858 the tile and [OBSCURED] 1230 00:55:26,858 --> 00:55:30,500 it, or are they going to add extra memory buses on it? 1231 00:55:30,500 --> 00:55:32,225 PROFESSOR RABBAH: It could be a combination of both. 1232 00:55:32,225 --> 00:55:35,950 So it's not just memory, I/O devices. 1233 00:55:35,950 --> 00:55:38,370 If you're doing I/O then you might to be placed at a part 1234 00:55:38,370 --> 00:55:42,165 of the chip that has direct access to an I/O device or 1235 00:55:42,165 --> 00:55:43,550 very close. 1236 00:55:43,550 --> 00:55:46,600 It also comes up in the case of the communication 1237 00:55:46,600 --> 00:55:47,470 orchestration. 1238 00:55:47,470 --> 00:55:51,680 So if this guy is doing the branch then you want him 1239 00:55:51,680 --> 00:55:53,270 essentially centrally located. 1240 00:55:53,270 --> 00:55:56,150 So the best patterns for allocating things is 1241 00:55:56,150 --> 00:55:57,020 essentially across. 1242 00:55:57,020 --> 00:55:59,670 It's like a plus sign where it branches in the middle. 1243 00:55:59,670 --> 00:56:02,470 PROFESSOR AMARASINGHE: But that's not [OBSCURED]. 1244 00:56:02,470 --> 00:56:07,420 You can make them uniform by everybody equally there. 1245 00:56:07,420 --> 00:56:11,624 And a lot of times people have done that simple model with 1246 00:56:11,624 --> 00:56:16,353 everybody equally there Or you try to take advantage of 1247 00:56:16,353 --> 00:56:16,670 closeness and stuff like that. 1248 00:56:16,670 --> 00:56:16,950 So you can't have both ways. 1249 00:56:16,950 --> 00:56:19,730 So anytime you try to make me [OBSCURED] 1250 00:56:19,730 --> 00:56:24,580 very, very close and fast access, you're are doing it by 1251 00:56:24,580 --> 00:56:30,652 basically making the other parts to have less resources 1252 00:56:30,652 --> 00:56:32,240 and less access. 1253 00:56:32,240 --> 00:56:34,760 On the other hand, there are a lot of people working on 1254 00:56:34,760 --> 00:56:38,690 [INAUDIBLE] 1255 00:56:38,690 --> 00:56:41,655 things that, for example, there's a thing called tree 1256 00:56:41,655 --> 00:56:42,990 space laser. 1257 00:56:42,990 --> 00:56:45,920 So what that does is you put a mirror on top of the tile, on 1258 00:56:45,920 --> 00:56:48,500 top of the processor. 1259 00:56:48,500 --> 00:56:58,920 And each of these-- you can embed a small LED transmitter 1260 00:56:58,920 --> 00:56:59,490 into the chip. 1261 00:56:59,490 --> 00:57:01,435 So basically if you want to communicate with someone, you 1262 00:57:01,435 --> 00:57:03,740 just bounce that laser on top of that and get it 1263 00:57:03,740 --> 00:57:04,660 to the right guy. 1264 00:57:04,660 --> 00:57:07,100 So there are a lot of exotic things that might be able to 1265 00:57:07,100 --> 00:57:09,150 solve this thing, technological problem. 1266 00:57:09,150 --> 00:57:11,160 But in some case, speed of light-- 1267 00:57:11,160 --> 00:57:14,860 I don't think an engineer has figured out how to 1268 00:57:14,860 --> 00:57:15,430 break speed of light. 1269 00:57:15,430 --> 00:57:17,925 Unless, of course, people go with quantum computing and 1270 00:57:17,925 --> 00:57:18,870 stuff like that. 1271 00:57:18,870 --> 00:57:21,930 So, I mean the key thing is, you have resources, you have 1272 00:57:21,930 --> 00:57:22,660 certain data and you just have to deal with it. 1273 00:57:22,660 --> 00:57:26,190 Getting nice uniformity has a cost. 1274 00:57:26,190 --> 00:57:27,340 PROFESSOR RABBAH: Yeah, I mean, on the [OBSCURED] 1275 00:57:27,340 --> 00:57:30,650 that are groups here at MIT who are working on optical 1276 00:57:30,650 --> 00:57:32,210 networks in the third dimension. 1277 00:57:32,210 --> 00:57:33,956 So you have a tile chip plus an optical network in the 1278 00:57:33,956 --> 00:57:35,990 third dimension which allows you to do things like 1279 00:57:35,990 --> 00:57:38,214 broadcast much more efficiently. 1280 00:57:38,214 --> 00:57:38,752 OK? 1281 00:57:38,752 --> 00:57:40,300 PROFESSOR AMARASINGHE: I guess we'll take a break here and 1282 00:57:40,300 --> 00:57:42,286 take a small, three-minute break and then we can go on to 1283 00:57:42,286 --> 00:57:43,536 the next topic.