1 00:00:00,030 --> 00:00:02,420 The following content is provided under a Creative 2 00:00:02,420 --> 00:00:03,860 Commons license. 3 00:00:03,860 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue to 4 00:00:06,860 --> 00:00:10,540 offer high quality educational resources for free. 5 00:00:10,540 --> 00:00:13,410 To make a donation or view additional materials from 6 00:00:13,410 --> 00:00:17,460 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,460 --> 00:00:18,710 ocw.mit.edu. 8 00:00:21,430 --> 00:00:23,530 PROFESSOR: OK. 9 00:00:23,530 --> 00:00:24,920 Let's get started. 10 00:00:24,920 --> 00:00:28,750 So what I'm going to do next is switch gears to one 11 00:00:28,750 --> 00:00:30,540 interesting compiler, which is the StreamIt 12 00:00:30,540 --> 00:00:31,790 parallelizing compiler. 13 00:00:33,850 --> 00:00:37,400 The main idea about StreamIt is the need for a common 14 00:00:37,400 --> 00:00:39,080 machine language. 15 00:00:39,080 --> 00:00:43,670 What we want to do normally is, in a language, you want to 16 00:00:43,670 --> 00:00:46,465 represent common architecture properties so you get good 17 00:00:46,465 --> 00:00:46,790 performance. 18 00:00:46,790 --> 00:00:50,470 You don't do it at a very high level of abstraction, so you 19 00:00:50,470 --> 00:00:53,300 base a lot of cycles dealing with the abstraction. 20 00:00:53,300 --> 00:00:55,960 But you want to abstract out the differences between 21 00:00:55,960 --> 00:00:58,310 machines to get the portability, otherwise you are 22 00:00:58,310 --> 00:00:58,950 going to just do 23 00:00:58,950 --> 00:01:01,690 assembly-hacking for one machine. 24 00:01:01,690 --> 00:01:05,010 Also you can't have things too complex, because a typical 25 00:01:05,010 --> 00:01:07,840 programmer cannot deal with very complex things that we 26 00:01:07,840 --> 00:01:10,960 ask them to do. 27 00:01:10,960 --> 00:01:15,510 C and Fortran was a really nice common assembly language 28 00:01:15,510 --> 00:01:17,080 for imperative languages running 29 00:01:17,080 --> 00:01:20,650 on the unicore machines. 30 00:01:20,650 --> 00:01:24,650 The problem is this type of language is not a good common 31 00:01:24,650 --> 00:01:26,610 language for multicores, because it doesn't deal with, 32 00:01:26,610 --> 00:01:29,350 first of all, multiple cores. 33 00:01:29,350 --> 00:01:32,620 And as you keep changing the number of cores -- 34 00:01:32,620 --> 00:01:34,970 for example, automatic parallelizing compilers are 35 00:01:34,970 --> 00:01:38,480 not good to basically get really good parallelism out of 36 00:01:38,480 --> 00:01:40,360 that, even though we talk about that. 37 00:01:40,360 --> 00:01:43,300 Still a lot of work has to be done in there. 38 00:01:43,300 --> 00:01:45,860 So what's the correct abstraction if you have 39 00:01:45,860 --> 00:01:48,860 multicore machines? 40 00:01:48,860 --> 00:01:51,200 The current offering, what you guys are doing, is things like 41 00:01:51,200 --> 00:01:54,470 OpenMP, MPI type stuff. 42 00:01:54,470 --> 00:01:56,850 You are hand-hacking the parallelism. 43 00:01:56,850 --> 00:01:59,390 Well, the issue with that. 44 00:01:59,390 --> 00:02:01,450 It's basically this explicit parallel construct. 45 00:02:01,450 --> 00:02:04,710 It's kind of added to languages like C -- that's 46 00:02:04,710 --> 00:02:06,510 what you're working on. 47 00:02:06,510 --> 00:02:09,790 And what this does is all these nice properties about 48 00:02:09,790 --> 00:02:13,260 composability, malleability, debuggability, portability -- 49 00:02:13,260 --> 00:02:15,660 all those things were kind of out of the window. 50 00:02:15,660 --> 00:02:18,320 And this is why this parallelizing is hard, because 51 00:02:18,320 --> 00:02:21,040 all these things makes life very difficult for the 52 00:02:21,040 --> 00:02:22,700 programmer. 53 00:02:22,700 --> 00:02:25,720 And it's a huge additional program burden. 54 00:02:25,720 --> 00:02:27,850 The programmer has to introduce parallelism, 55 00:02:27,850 --> 00:02:29,580 correctness, optimization -- 56 00:02:29,580 --> 00:02:32,920 it's all left to the programmer. 57 00:02:32,920 --> 00:02:35,400 So what the program has to do in this kind of world -- what 58 00:02:35,400 --> 00:02:37,236 you are doing right now -- 59 00:02:37,236 --> 00:02:40,300 you have to feed all the granularity decisions. 60 00:02:40,300 --> 00:02:42,880 If things are too small you might get too much 61 00:02:42,880 --> 00:02:43,250 communication. 62 00:02:43,250 --> 00:02:46,710 If things are too large you might not get good load 63 00:02:46,710 --> 00:02:48,340 balancing, and stuff like. 64 00:02:48,340 --> 00:02:50,680 And then you deal with all the load balancing decisions. 65 00:02:50,680 --> 00:02:54,570 All those decisions are left for you guys. 66 00:02:54,570 --> 00:02:56,610 You need to figure out what's local, what's not. 67 00:02:56,610 --> 00:03:00,530 And if you make a wrong decision it can cost you. 68 00:03:00,530 --> 00:03:04,460 All the synchronization, and all the pain and suffering 69 00:03:04,460 --> 00:03:06,230 that comes from making a wrong decision. 70 00:03:06,230 --> 00:03:10,270 Things like race conditions, deadlocks, 71 00:03:10,270 --> 00:03:12,620 and stuff like that. 72 00:03:12,620 --> 00:03:16,430 And this, while MIT students can hack it, you can't go and 73 00:03:16,430 --> 00:03:21,030 convince a dull programmer to convert from writing nice 74 00:03:21,030 --> 00:03:24,230 simple Java application code to dealing with all these 75 00:03:24,230 --> 00:03:25,810 complexities. 76 00:03:25,810 --> 00:03:29,210 So this is what kind of led to our research for the last five 77 00:03:29,210 --> 00:03:32,070 years to do StreamIt. 78 00:03:32,070 --> 00:03:34,900 What you want to do is move a bunch of these decisions to 79 00:03:34,900 --> 00:03:35,980 the compiler. 80 00:03:35,980 --> 00:03:38,950 Granularity, load balancing, locality, and 81 00:03:38,950 --> 00:03:39,516 synchronization -- 82 00:03:39,516 --> 00:03:40,650 [OBSCURED] 83 00:03:40,650 --> 00:03:43,250 And in today's talk I am going to talk to you about, after 84 00:03:43,250 --> 00:03:46,060 you write a StreamIt program -- as Bill pointed out, the 85 00:03:46,060 --> 00:03:48,910 nice parallel properties -- how do you actually go about 86 00:03:48,910 --> 00:03:50,670 getting this kind of parallelism. 87 00:03:54,130 --> 00:03:59,210 So in StreamIt , in summary, it basically has regular and 88 00:03:59,210 --> 00:04:01,630 repeating computation in these filters. 89 00:04:01,630 --> 00:04:04,290 This is called a synchronous data flow model, because we 90 00:04:04,290 --> 00:04:07,990 know at compile time exactly how the data moves, how much 91 00:04:07,990 --> 00:04:10,890 each produces and consumes. 92 00:04:10,890 --> 00:04:15,110 And this has natural parallelism, and it exposes 93 00:04:15,110 --> 00:04:19,010 exactly what's going to happen to the compiler. 94 00:04:19,010 --> 00:04:21,970 And the compiler can do a lot of powerful transformations, 95 00:04:21,970 --> 00:04:24,720 as yesterday I pointed out. 96 00:04:24,720 --> 00:04:28,320 The first thing is, because of synchronous data flow, we know 97 00:04:28,320 --> 00:04:32,530 at compile time exactly who needs to do what when. 98 00:04:32,530 --> 00:04:34,580 And that really helps transform. 99 00:04:34,580 --> 00:04:37,970 It's not like everything happens run-time dynamically. 100 00:04:37,970 --> 00:04:39,970 So what does that mean? 101 00:04:39,970 --> 00:04:43,780 So what that means is each filter knows exactly how much 102 00:04:43,780 --> 00:04:45,960 to push and pop -- 103 00:04:45,960 --> 00:04:48,040 that's in a repeatable execution. 104 00:04:48,040 --> 00:04:50,330 And so what we can do is, we can come up to the static 105 00:04:50,330 --> 00:04:53,710 schedule that can be repeated multiple times. 106 00:04:53,710 --> 00:04:55,900 So let me tell you a little bit about what a static 107 00:04:55,900 --> 00:04:56,540 schedule means. 108 00:04:56,540 --> 00:04:59,350 So assume this filter pushes two, this filter pops three 109 00:04:59,350 --> 00:05:01,960 but pushes one, that filter pops two. 110 00:05:01,960 --> 00:05:04,180 So these are kind of rate pushes, it's not everybody 111 00:05:04,180 --> 00:05:05,670 producing-consuming at once. 112 00:05:05,670 --> 00:05:07,000 So what's the schedule? 113 00:05:07,000 --> 00:05:07,580 So you can say -- 114 00:05:07,580 --> 00:05:12,130 OK, at the beginning it produces two items, but I 115 00:05:12,130 --> 00:05:15,650 can't consume that because I need three items. And then I 116 00:05:15,650 --> 00:05:18,910 do two of them, and I can consume the first three and 117 00:05:18,910 --> 00:05:20,240 then produce one there. 118 00:05:20,240 --> 00:05:21,350 And I have two left behind. 119 00:05:21,350 --> 00:05:24,810 I do one more that, and now I got three. 120 00:05:24,810 --> 00:05:27,610 And it consumes that and produces that. 121 00:05:27,610 --> 00:05:31,450 And then I can fire C. So the neat thing about this is, when 122 00:05:31,450 --> 00:05:35,290 I started there was nothing inside any of these buffers. 123 00:05:35,290 --> 00:05:39,710 And if I ran A, A, B, A, B, C, there's nothing inside the 124 00:05:39,710 --> 00:05:41,440 buffers again. 125 00:05:41,440 --> 00:05:45,690 So what I have is, I'm back to the starting positioning. 126 00:05:45,690 --> 00:05:48,890 And if I repeat this millions of times -- 127 00:05:48,890 --> 00:05:51,390 I keep the computation running nicely without any buffers 128 00:05:51,390 --> 00:05:53,340 accumulating or anything like that. 129 00:05:53,340 --> 00:05:56,650 So I can come up with this very nice schedule that says 130 00:05:56,650 --> 00:05:58,390 -- here's what I have to do. 131 00:05:58,390 --> 00:06:02,060 I have to run A actually three times, B twice, and C once, in 132 00:06:02,060 --> 00:06:03,970 this order, and if I do that I have a 133 00:06:03,970 --> 00:06:05,730 computation that keeps running. 134 00:06:05,730 --> 00:06:08,940 And that gives me a good global view on what can I 135 00:06:08,940 --> 00:06:11,230 parallelize, what can I load balance, all those things. 136 00:06:11,230 --> 00:06:13,660 Because things don't change. 137 00:06:13,660 --> 00:06:17,670 One more additional thing about StreamIt is we can look 138 00:06:17,670 --> 00:06:20,290 at more elements than I am consuming. 139 00:06:20,290 --> 00:06:21,100 Question? 140 00:06:21,100 --> 00:06:25,500 AUDIENCE: How common is it in your typical code that you can 141 00:06:25,500 --> 00:06:28,400 actually produce a static schedule like that? 142 00:06:28,400 --> 00:06:33,880 PROFESSOR: In a lot of DSV code this is very common. 143 00:06:33,880 --> 00:06:38,670 A lot of DSV code right now that goes into hardware and 144 00:06:38,670 --> 00:06:41,230 software, they have very common properties. 145 00:06:41,230 --> 00:06:44,390 But even things that are not common, what has a very large 146 00:06:44,390 --> 00:06:48,760 chunk of the program has this static property, and there are 147 00:06:48,760 --> 00:06:51,310 some few places that has dynamic property. 148 00:06:51,310 --> 00:06:54,850 So it's like, when you write a normal program you don't write 149 00:06:54,850 --> 00:06:57,600 a branch instruction after every instruction. 150 00:06:57,600 --> 00:06:59,690 You have a few hundred instructions and a branch, a 151 00:06:59,690 --> 00:07:02,200 few tens of instructions and a branch type thing. 152 00:07:02,200 --> 00:07:05,840 So what you can think about it is that those instructions 153 00:07:05,840 --> 00:07:08,460 without a branch can get optimized the hell out of 154 00:07:08,460 --> 00:07:10,980 them, and then you do a branch dynamically. 155 00:07:10,980 --> 00:07:12,660 So you can think about it like this. 156 00:07:12,660 --> 00:07:15,560 What's the largest chunks you can find that you don't have 157 00:07:15,560 --> 00:07:17,480 this uncertainty until run-time? 158 00:07:17,480 --> 00:07:19,930 Then you can optimize the hell out of it, and then you can 159 00:07:19,930 --> 00:07:23,330 deal with this run-time issues -- 160 00:07:23,330 --> 00:07:27,110 basically branches, or control for changes, or direct your 161 00:07:27,110 --> 00:07:31,410 rate changes at run-time. 162 00:07:31,410 --> 00:07:34,240 If we have 10-90 rule, if you get 90% of the things are in a 163 00:07:34,240 --> 00:07:36,640 nice thing, and if you get good performance on that -- 164 00:07:36,640 --> 00:07:38,880 hey, it has a big impact. 165 00:07:38,880 --> 00:07:41,850 So in our language you can deal with dynamism, but our 166 00:07:41,850 --> 00:07:43,590 analysis is basically trying to find the largest static 167 00:07:43,590 --> 00:07:45,350 chunk and analyze. 168 00:07:45,350 --> 00:07:48,710 So most of the time that basically said we start with 169 00:07:48,710 --> 00:07:50,150 empty and end with empty. 170 00:07:50,150 --> 00:07:52,630 But the trouble is, a lot of times we actually can look 171 00:07:52,630 --> 00:07:54,630 beyond the number of what we consume. 172 00:07:54,630 --> 00:07:58,010 So what you have to do is kind of do initial schedule that 173 00:07:58,010 --> 00:08:03,260 you don't start with empty, you basically consume 174 00:08:03,260 --> 00:08:05,230 something -- you start with something like this. 175 00:08:05,230 --> 00:08:07,860 So the next time in something comes -- 176 00:08:07,860 --> 00:08:11,190 three things come into this one -- 177 00:08:11,190 --> 00:08:13,360 I can actually pick four and pop three. 178 00:08:13,360 --> 00:08:16,340 So you go through the first thing kind of priming 179 00:08:16,340 --> 00:08:19,270 everything with the amount of data needed, and then you go 180 00:08:19,270 --> 00:08:22,190 to the static schedule. 181 00:08:22,190 --> 00:08:25,860 This kind of gives you a feel for what you'll get. 182 00:08:25,860 --> 00:08:27,430 This is a neat thing, I know exactly 183 00:08:27,430 --> 00:08:28,930 what's going on in here. 184 00:08:28,930 --> 00:08:30,730 So now how do I run this parallelism? 185 00:08:30,730 --> 00:08:35,560 This is something actually Rodric pointed out before, 186 00:08:35,560 --> 00:08:38,220 there are three types of parallelism we can deal with. 187 00:08:38,220 --> 00:08:42,140 So here's my stream program in here, and I do some filters, 188 00:08:42,140 --> 00:08:44,040 scatter-gather in here. 189 00:08:44,040 --> 00:08:46,710 The first site of parallelism is task parallelism. 190 00:08:46,710 --> 00:08:49,280 What that means is the programmer said, there are 191 00:08:49,280 --> 00:08:51,100 three things that can run parallelly 192 00:08:51,100 --> 00:08:53,250 before I join them together. 193 00:08:53,250 --> 00:08:54,890 So this is a programmer-specified 194 00:08:54,890 --> 00:08:56,720 parallelism. 195 00:08:56,720 --> 00:09:01,010 And you have a nice data parallel messenger 196 00:09:01,010 --> 00:09:03,190 presentation. 197 00:09:03,190 --> 00:09:06,140 The second part is data parallelism. 198 00:09:06,140 --> 00:09:09,680 What that means is, you have some of these things that 199 00:09:09,680 --> 00:09:13,180 don't depend on the previous run of that one. 200 00:09:13,180 --> 00:09:17,570 So there's no invocation, dependency across multiple 201 00:09:17,570 --> 00:09:18,340 invocations. 202 00:09:18,340 --> 00:09:20,950 These are called stateless filters, there's no state that 203 00:09:20,950 --> 00:09:21,920 keeps changing. 204 00:09:21,920 --> 00:09:23,320 If the state kept changing, you had to wait till the 205 00:09:23,320 --> 00:09:25,860 previous one finishes to run the next one. 206 00:09:25,860 --> 00:09:27,310 So if you have a stateless filter -- 207 00:09:27,310 --> 00:09:29,840 assume that it's data parallel -- 208 00:09:29,840 --> 00:09:35,270 what you can do is you can basically take that, replicate 209 00:09:35,270 --> 00:09:38,240 it many, many times, and when the data comes -- 210 00:09:38,240 --> 00:09:39,040 parallel is in it every data -- 211 00:09:39,040 --> 00:09:43,210 and it will compute and parallelly get out here. 212 00:09:43,210 --> 00:09:46,020 The final thing is pipeline parallelism. 213 00:09:46,020 --> 00:09:48,440 So you can feed this one into this one, this one into this 214 00:09:48,440 --> 00:09:51,230 one, and then douse across in a pipeline fashion. 215 00:09:51,230 --> 00:09:53,380 And you can get multiple things execution. 216 00:09:53,380 --> 00:09:55,660 So we have these three types of parallelism in here, and 217 00:09:55,660 --> 00:09:59,390 the interesting thing is if you have stateful filters, you 218 00:09:59,390 --> 00:10:01,170 can't run this data parallel. 219 00:10:01,170 --> 00:10:02,900 Actually the only parallelism you can get is pipeline 220 00:10:02,900 --> 00:10:05,540 parallelism. 221 00:10:05,540 --> 00:10:08,170 So traditionally task parallelism is fork/join 222 00:10:08,170 --> 00:10:10,800 parallelism, that you guys are doing right now. 223 00:10:10,800 --> 00:10:13,800 Data parallelism is loop parallelism. 224 00:10:13,800 --> 00:10:16,760 And pipeline parallelism mainly was done in hardware. 225 00:10:16,760 --> 00:10:19,520 If you have done something like Verilog or VHDL you'll do 226 00:10:19,520 --> 00:10:21,070 a lot of pipeline parallelism. 227 00:10:21,070 --> 00:10:23,600 So kind of combining these three ideas from different 228 00:10:23,600 --> 00:10:26,620 communities all into one, because I think programs can 229 00:10:26,620 --> 00:10:30,040 have each part in there. 230 00:10:30,040 --> 00:10:32,200 So now, how do you go and exploit this? 231 00:10:32,200 --> 00:10:35,480 How do you go take advantage of that? 232 00:10:35,480 --> 00:10:38,660 So I'll talk a little bit of baseline techniques, and then 233 00:10:38,660 --> 00:10:40,890 talk about what StreamIt compiler does today. 234 00:10:40,890 --> 00:10:43,660 So assume I have a program like this. 235 00:10:43,660 --> 00:10:46,110 The hardest thing is there are two tasks in here. 236 00:10:46,110 --> 00:10:47,940 The programs are given, you don't have to worry anything 237 00:10:47,940 --> 00:10:48,500 about that. 238 00:10:48,500 --> 00:10:52,190 And what you can do is assign them into different 239 00:10:52,190 --> 00:10:53,320 cores and run it. 240 00:10:53,320 --> 00:10:54,570 Neat. 241 00:10:57,070 --> 00:10:59,030 You can think what a fork /join parallelism is, you come 242 00:10:59,030 --> 00:11:02,600 here you fork, you do this thing, and you join in here. 243 00:11:02,600 --> 00:11:05,160 So the interesting thing is if you have two cores. 244 00:11:05,160 --> 00:11:07,090 You probably got a 2x speedup in this one. 245 00:11:07,090 --> 00:11:08,740 This is really neat because there are two things in here. 246 00:11:08,740 --> 00:11:11,210 The problem is, how about if you have a lot more different 247 00:11:11,210 --> 00:11:13,600 number of cores? 248 00:11:13,600 --> 00:11:16,280 Or if the next generation has double the number of cores, 249 00:11:16,280 --> 00:11:19,075 and I'm stuck with the program you've written for the current 250 00:11:19,075 --> 00:11:19,340 generation? 251 00:11:19,340 --> 00:11:23,130 So this not that great, interesting. 252 00:11:23,130 --> 00:11:28,290 So we ran it on the Raw processor we have -- it has 16 253 00:11:28,290 --> 00:11:30,050 cores in there -- 254 00:11:30,050 --> 00:11:33,080 that we have been building, and this is actually running a 255 00:11:33,080 --> 00:11:34,140 simulator of that. 256 00:11:34,140 --> 00:11:38,530 What you find is, is a bunch of StreamIt programs we have 257 00:11:38,530 --> 00:11:42,590 we kind of get performance like basically close to two, 258 00:11:42,590 --> 00:11:44,100 because that's the kind of 259 00:11:44,100 --> 00:11:45,450 parallelism people have written. 260 00:11:45,450 --> 00:11:49,760 In fact, some programs even slowed down in there, because 261 00:11:49,760 --> 00:11:54,600 what happens in here is the parallelism and 262 00:11:54,600 --> 00:11:57,340 synchronization is not matched with the target -- 263 00:11:57,340 --> 00:11:58,550 because it's matched with the program. 264 00:11:58,550 --> 00:12:00,780 Because you wrote a program because your parallelism in 265 00:12:00,780 --> 00:12:04,430 there was what you thought was right for the algorithm. 266 00:12:04,430 --> 00:12:07,370 We didn't want you to give any consideration to the machine 267 00:12:07,370 --> 00:12:09,490 you are running, and it didn't match the machine, basically, 268 00:12:09,490 --> 00:12:10,620 if you just got the parallelism. 269 00:12:10,620 --> 00:12:13,280 And you just don't do that right. 270 00:12:13,280 --> 00:12:17,280 So one thing we have noticed for a lot of streaming 271 00:12:17,280 --> 00:12:20,210 programs, to answer your question, is there are a lot 272 00:12:20,210 --> 00:12:22,010 of data parallelism. 273 00:12:22,010 --> 00:12:24,000 In fact, in this filter -- 274 00:12:24,000 --> 00:12:26,440 in this program -- 275 00:12:26,440 --> 00:12:29,030 what you can do is you can find data parallel filters, 276 00:12:29,030 --> 00:12:30,660 and parallelize them. 277 00:12:30,660 --> 00:12:32,750 So you can take each filter, run it on 278 00:12:32,750 --> 00:12:34,370 every core for awile. 279 00:12:34,370 --> 00:12:35,850 Get the data back. 280 00:12:35,850 --> 00:12:37,070 Go to the next filter, write on every 281 00:12:37,070 --> 00:12:38,590 go-while, get that back. 282 00:12:38,590 --> 00:12:42,285 So what you can do is, if you have four cores in here, you 283 00:12:42,285 --> 00:12:44,950 can each replicate all this four times. 284 00:12:44,950 --> 00:12:47,410 Run these four for a while, and then these four, these 285 00:12:47,410 --> 00:12:49,300 four, these four, these four. 286 00:12:49,300 --> 00:12:50,360 OK? 287 00:12:50,360 --> 00:12:51,360 So that's the nice way to do that. 288 00:12:51,360 --> 00:12:53,390 So the nice thing about doing that is you have a lot of nice 289 00:12:53,390 --> 00:12:55,480 in the load balancing, because each are doing the same amount 290 00:12:55,480 --> 00:12:56,740 of work for a while. 291 00:12:56,740 --> 00:12:59,410 And after it accumulates enough data you go to the next 292 00:12:59,410 --> 00:13:04,870 one, do for a while, and then like that. 293 00:13:04,870 --> 00:13:07,220 And each group basically will occupy the entire machine -- 294 00:13:07,220 --> 00:13:09,640 you just go down this group like that. 295 00:13:09,640 --> 00:13:13,660 And so we ran it, it started even slower. 296 00:13:13,660 --> 00:13:14,910 Why? 297 00:13:17,720 --> 00:13:19,470 It should have a lot more parallelism, because all those 298 00:13:19,470 --> 00:13:20,490 filters were data-parallel. 299 00:13:20,490 --> 00:13:23,280 So you sort of gettting stuck with two, now we can easily 300 00:13:23,280 --> 00:13:26,135 run a parallelism of 16, because data parallelism you 301 00:13:26,135 --> 00:13:28,150 can just put it any amount in there. 302 00:13:28,150 --> 00:13:30,700 But we are running slow. 303 00:13:30,700 --> 00:13:32,240 AUDIENCE: Communication overhead? 304 00:13:32,240 --> 00:13:34,850 PROFESSOR: Yeah, it could mainly be communication 305 00:13:34,850 --> 00:13:37,780 overhead, because what happens is you run this for a small 306 00:13:37,780 --> 00:13:38,320 amount of time. 307 00:13:38,320 --> 00:13:41,120 You had to send it all over the place, collect it back 308 00:13:41,120 --> 00:13:44,740 again, send it all over the place, collect it back again. 309 00:13:44,740 --> 00:13:46,420 The problem is there's too much synchronization and 310 00:13:46,420 --> 00:13:47,810 communication. 311 00:13:47,810 --> 00:13:50,400 Because every person at the end is like this global 312 00:13:50,400 --> 00:13:54,200 barrier, and the data has to go shuffling around. 313 00:13:54,200 --> 00:13:57,220 And that doesn't help. 314 00:13:57,220 --> 00:14:00,940 So the other part, what you can do in the baseline is what 315 00:14:00,940 --> 00:14:03,280 you call hardware pipeline. 316 00:14:03,280 --> 00:14:05,125 What that means is you can actually do pipeline 317 00:14:05,125 --> 00:14:05,560 parallelism. 318 00:14:05,560 --> 00:14:12,490 The way you can do that is you can look at the amount of work 319 00:14:12,490 --> 00:14:16,490 each filters contain, and you can combine them together in a 320 00:14:16,490 --> 00:14:20,890 way that the number of filters is going to be just about the 321 00:14:20,890 --> 00:14:22,140 number of tiles available. 322 00:14:24,380 --> 00:14:25,760 Most programs have more filters than 323 00:14:25,760 --> 00:14:26,960 the number of cores. 324 00:14:26,960 --> 00:14:29,250 So you review combined filters, to give us a number 325 00:14:29,250 --> 00:14:32,630 of filters, is just either the same, or one or two less than 326 00:14:32,630 --> 00:14:35,090 the number of cores available. 327 00:14:35,090 --> 00:14:38,840 In a way that you combine them so each of them will probably 328 00:14:38,840 --> 00:14:41,360 have close to the same amount of work. 329 00:14:41,360 --> 00:14:44,180 The problem is if when you combine it's very hard to get 330 00:14:44,180 --> 00:14:46,860 the same amount of work. 331 00:14:46,860 --> 00:14:49,970 And if you assume eight cores, you can do this combination 332 00:14:49,970 --> 00:14:50,390 and we can say -- 333 00:14:50,390 --> 00:14:53,420 aha, if I do this combination, I have one, two, three, four, 334 00:14:53,420 --> 00:14:55,460 five, six, seven. 335 00:14:55,460 --> 00:14:57,680 Eight cores, I can get seven of them. 336 00:14:57,680 --> 00:15:00,430 Hopefully each of them have the same amount of work, and I 337 00:15:00,430 --> 00:15:03,070 can run that. 338 00:15:03,070 --> 00:15:06,610 And then we assign this to one filter and say -- "You own 339 00:15:06,610 --> 00:15:07,970 this one, you run it forever. 340 00:15:07,970 --> 00:15:11,040 You get the data from the guy who owns this one, and you 341 00:15:11,040 --> 00:15:16,080 produce at this one." And if you have more cores you can 342 00:15:16,080 --> 00:15:17,500 actually keep doing some of that. 343 00:15:17,500 --> 00:15:18,790 If you have enough filters you can each 344 00:15:18,790 --> 00:15:20,800 combine them and do that. 345 00:15:20,800 --> 00:15:24,980 So we perform, and we got this. 346 00:15:24,980 --> 00:15:28,020 Not that bad. 347 00:15:28,020 --> 00:15:29,650 So what might be the problems here? 348 00:15:37,308 --> 00:15:40,100 AUDIENCE: Hardware locality. 349 00:15:40,100 --> 00:15:42,616 You want to make sure that the communicating filters are 350 00:15:42,616 --> 00:15:43,460 close to each other. 351 00:15:43,460 --> 00:15:44,805 PROFESSOR: Yeah, that we can deal with. 352 00:15:44,805 --> 00:15:47,400 It's not a big locality [OBSCURED] 353 00:15:47,400 --> 00:15:49,740 What's the other problem? 354 00:15:49,740 --> 00:15:50,990 The bigger problem. 355 00:15:56,020 --> 00:15:57,000 AUDIENCE: [NOISE] 356 00:15:57,000 --> 00:15:57,490 load balance. 357 00:15:57,490 --> 00:15:59,310 PROFESSOR: Load balance is the biggest problem, because the 358 00:15:59,310 --> 00:16:01,540 problem is you are combining different types of things 359 00:16:01,540 --> 00:16:04,080 together, and you are hoping that each chunk you get 360 00:16:04,080 --> 00:16:05,660 combined togeher will have an almost 361 00:16:05,660 --> 00:16:07,430 identical amount of work. 362 00:16:07,430 --> 00:16:09,070 And that's very hard to achieve most of the time, 363 00:16:09,070 --> 00:16:11,640 because dynamically things keep changing. 364 00:16:11,640 --> 00:16:13,360 The nice thing about loops is, most of the time if you have a 365 00:16:13,360 --> 00:16:17,260 loop or state if you replicate it many times, it's the same 366 00:16:17,260 --> 00:16:19,050 amount of code, same amount of work. 367 00:16:19,050 --> 00:16:20,340 It nicely balances out. 368 00:16:20,340 --> 00:16:21,310 Hardware -- 369 00:16:21,310 --> 00:16:23,570 combining different things becomes actually much harder. 370 00:16:26,280 --> 00:16:28,390 So again, parallelism and synchronization are not really 371 00:16:28,390 --> 00:16:29,950 matched to the target. 372 00:16:29,950 --> 00:16:35,270 So the StreamIt compiler right now does two, three things. 373 00:16:35,270 --> 00:16:36,180 I'll go through details. 374 00:16:36,180 --> 00:16:37,550 Coarsen the granularity of things. 375 00:16:37,550 --> 00:16:41,020 So what happens is if you have small filters it combines them 376 00:16:41,020 --> 00:16:45,320 together to get the large stateless areas. 377 00:16:45,320 --> 00:16:47,600 It data parallelizes when possible. 378 00:16:47,600 --> 00:16:49,960 And it does software pipelining, that's a pipeline 379 00:16:49,960 --> 00:16:50,640 parallelism. 380 00:16:50,640 --> 00:16:53,710 I'll go through all these things in detail. 381 00:16:53,710 --> 00:16:58,460 And you can get about 11x's speedup by 382 00:16:58,460 --> 00:16:59,460 doing all those things. 383 00:16:59,460 --> 00:17:01,150 So coarsen the stream graph. 384 00:17:01,150 --> 00:17:03,290 So you look at this stream graph and say -- wait a 385 00:17:03,290 --> 00:17:06,950 minute, I have a bunch of data-parallel parts. 386 00:17:06,950 --> 00:17:09,420 And before what I did was I take each data-parallel part, 387 00:17:09,420 --> 00:17:12,450 when 16 then came or get together, went 16 came 388 00:17:12,450 --> 00:17:13,310 together, went 16. 389 00:17:13,310 --> 00:17:14,400 Why? 390 00:17:14,400 --> 00:17:15,900 I have put too much communication. 391 00:17:15,900 --> 00:17:20,590 Can I combine data-parallel things into one gigantic unit 392 00:17:20,590 --> 00:17:22,620 when possible? 393 00:17:22,620 --> 00:17:24,810 Of course, you don't want to combine a data-parallel part 394 00:17:24,810 --> 00:17:26,290 with a non-data-parallel part. 395 00:17:26,290 --> 00:17:27,830 Then the entire thing becomes sequential, 396 00:17:27,830 --> 00:17:29,170 and that's not helpful. 397 00:17:29,170 --> 00:17:32,400 So in here what we found is these four cannot be combined, 398 00:17:32,400 --> 00:17:36,350 because if you combime them the entire thing becomes 399 00:17:36,350 --> 00:17:37,520 sequential. 400 00:17:37,520 --> 00:17:41,240 So what we have to do is, you can combine this way. 401 00:17:41,240 --> 00:17:43,680 So all those things are data-parallel, all those 402 00:17:43,680 --> 00:17:44,790 things are data-parallel. 403 00:17:44,790 --> 00:17:47,640 And even though they are data-parallel if you combine 404 00:17:47,640 --> 00:17:49,160 them they become non-data-parallel, because 405 00:17:49,160 --> 00:17:51,040 this is actually doing peeking, it's looking at more 406 00:17:51,040 --> 00:17:53,030 than one, and so it's looking at somebody 407 00:17:53,030 --> 00:17:53,950 else's iteration work. 408 00:17:53,950 --> 00:17:56,920 So you can't combine them. 409 00:17:56,920 --> 00:18:00,560 So what the benefits of doing this is you reduce global 410 00:18:00,560 --> 00:18:01,810 communication basically. 411 00:18:04,460 --> 00:18:07,920 And the next thing is you want data 412 00:18:07,920 --> 00:18:10,650 parallelizing to four cores. 413 00:18:10,650 --> 00:18:18,060 And this one fits four ways in there. 414 00:18:18,060 --> 00:18:20,425 But the interesting thing is, when you go in this one you 415 00:18:20,425 --> 00:18:21,830 realize there's some task parallelism. 416 00:18:24,680 --> 00:18:28,040 We know there are two tasks that have the same amount of 417 00:18:28,040 --> 00:18:30,520 work in here. 418 00:18:30,520 --> 00:18:32,880 So facing this four ways, and facing this four ways, and 419 00:18:32,880 --> 00:18:34,525 giving the entire machine to this one, and giving the 420 00:18:34,525 --> 00:18:36,860 entire machine to this one, might not be the best idea. 421 00:18:36,860 --> 00:18:40,000 What you want to do is you want to face it two ways. 422 00:18:40,000 --> 00:18:43,930 And then basically give the entire machine to all of these 423 00:18:43,930 --> 00:18:45,550 running at the same time, because they're 424 00:18:45,550 --> 00:18:46,130 load balanced -- 425 00:18:46,130 --> 00:18:48,340 because they are the same thing repeated. 426 00:18:48,340 --> 00:18:49,790 And you can do the same thing in here. 427 00:18:53,660 --> 00:18:54,610 OK. 428 00:18:54,610 --> 00:18:59,390 So that's what the compiler does automatically, and it 429 00:18:59,390 --> 00:19:00,400 preserves task parallelism. 430 00:19:00,400 --> 00:19:02,320 So if you are task parallelism you don't need -- 431 00:19:02,320 --> 00:19:05,160 the thing about that is the parallelism you need, you 432 00:19:05,160 --> 00:19:06,270 don't need too much parallelism. 433 00:19:06,270 --> 00:19:08,590 You need enough parallelism to make the machine happy. 434 00:19:08,590 --> 00:19:10,280 If you have too much parallelism you end up in 435 00:19:10,280 --> 00:19:11,600 other problems, like synchronization. 436 00:19:11,600 --> 00:19:13,380 So this gives enough parallelism to keep the entire 437 00:19:13,380 --> 00:19:16,640 machine happy, but not too much. 438 00:19:16,640 --> 00:19:20,420 And by doing that actually we get pretty good performance. 439 00:19:20,420 --> 00:19:24,770 There are a few cases where this hardware parallelism wins 440 00:19:24,770 --> 00:19:26,460 out, these two, but most of them -- 441 00:19:26,460 --> 00:19:29,650 actually this last one we can recover -- 442 00:19:29,650 --> 00:19:31,458 do it pretty well. 443 00:19:31,458 --> 00:19:32,820 OK. 444 00:19:32,820 --> 00:19:37,870 So what's left here is -- so this is good parallelism and 445 00:19:37,870 --> 00:19:39,770 low synchronization. 446 00:19:39,770 --> 00:19:43,320 But there's one thing, when you are doing data parallelism 447 00:19:43,320 --> 00:19:47,540 there are places where there are filters that cannot be 448 00:19:47,540 --> 00:19:50,460 parallelized -- they are stateful filters. 449 00:19:50,460 --> 00:19:53,590 Because you can't run the data parallelism, and according to 450 00:19:53,590 --> 00:19:55,580 Amdahl's Law that's actually going to basically kill you, 451 00:19:55,580 --> 00:19:57,210 because that's just waiting there and you 452 00:19:57,210 --> 00:20:00,090 can't do too much. 453 00:20:00,090 --> 00:20:03,040 I'm going to show that using this separate program -- so 454 00:20:03,040 --> 00:20:05,580 this number is the amount of work that each 455 00:20:05,580 --> 00:20:06,300 of them has to do. 456 00:20:06,300 --> 00:20:08,670 So this is actually a lot of work, a lot of work -- this 457 00:20:08,670 --> 00:20:12,410 does a little work in each of these filters. 458 00:20:12,410 --> 00:20:14,660 So if you look at that, these are data parallel but it 459 00:20:14,660 --> 00:20:15,780 doesn't do any much work. 460 00:20:15,780 --> 00:20:19,200 Just parallelizing this doesn't help you. 461 00:20:19,200 --> 00:20:20,710 And these are data parallel. 462 00:20:20,710 --> 00:20:21,830 And these actually do enough work. 463 00:20:21,830 --> 00:20:23,770 Actually we can go and say I am replicating this four 464 00:20:23,770 --> 00:20:25,100 times, and I'm OK. 465 00:20:25,100 --> 00:20:27,640 I'm getting actually good performance in here. 466 00:20:27,640 --> 00:20:30,090 Now what we have is a program like this. 467 00:20:30,090 --> 00:20:33,320 And so if you are not doing anything else that we have 468 00:20:33,320 --> 00:20:34,080 data parallelism in. 469 00:20:34,080 --> 00:20:37,000 So what happens in the first cycle you run these two. 470 00:20:37,000 --> 00:20:39,460 And then you run data parallel this one, and then you run 471 00:20:39,460 --> 00:20:45,260 these, and then you run data parallel this one. 472 00:20:45,260 --> 00:20:47,760 And if you look at that, what happens is we have a bunch of 473 00:20:47,760 --> 00:20:50,140 holes in here. 474 00:20:50,140 --> 00:20:52,510 Because at that point when you are running that part of the 475 00:20:52,510 --> 00:20:54,370 program there's not enough parallelism, and you only have 476 00:20:54,370 --> 00:20:55,330 two things in there. 477 00:20:55,330 --> 00:20:57,430 And when you're running this you can run this task 478 00:20:57,430 --> 00:20:59,210 parallelism in here, but there's nothing else you can 479 00:20:59,210 --> 00:21:00,980 do in here. 480 00:21:00,980 --> 00:21:05,860 And so you get basically 21 time steps each -- 481 00:21:05,860 --> 00:21:08,310 time minutes basically will run into that program. 482 00:21:08,310 --> 00:21:10,910 But here we can do better. 483 00:21:10,910 --> 00:21:14,960 What we can do is we can take and try to move that there, 484 00:21:14,960 --> 00:21:18,620 and kind of compress them. 485 00:21:18,620 --> 00:21:20,880 But the interesting thing is these things 486 00:21:20,880 --> 00:21:24,820 are not data parallel. 487 00:21:24,820 --> 00:21:26,350 So how do I do that? 488 00:21:26,350 --> 00:21:28,640 So the way to do that is taking advantage of pipeline 489 00:21:28,640 --> 00:21:30,370 parallelism. 490 00:21:30,370 --> 00:21:34,280 So what you can do is you can take this filter in here. 491 00:21:34,280 --> 00:21:40,120 Since each of the entire graph can run only sequentially -- 492 00:21:40,120 --> 00:21:42,590 this has to run after this -- you can look at the filters 493 00:21:42,590 --> 00:21:47,470 running separately like that, and kind of say, instead of 494 00:21:47,470 --> 00:21:50,440 running this and this and this, why don't I run this 495 00:21:50,440 --> 00:21:52,020 iterations of this one. 496 00:21:52,020 --> 00:21:53,940 This iterations of this invocation. 497 00:21:53,940 --> 00:21:55,680 And this interations of this one. 498 00:21:55,680 --> 00:21:58,340 And this iterations on the next one. 499 00:21:58,340 --> 00:22:00,420 And I'm still maintaining -- because when I'm running this 500 00:22:00,420 --> 00:22:01,030 even though the -- 501 00:22:01,030 --> 00:22:02,920 I'm not running anything data parallel here because these 502 00:22:02,920 --> 00:22:05,100 ones were already done previously, so I can actually 503 00:22:05,100 --> 00:22:06,760 use that value. 504 00:22:06,760 --> 00:22:10,720 And so I can maintain that dependency, but I'm running 505 00:22:10,720 --> 00:22:12,430 things from the different iterations. 506 00:22:12,430 --> 00:22:15,200 And so what I need to do is, I need to kind of do a prologue 507 00:22:15,200 --> 00:22:17,430 to kind of set everything up in there. 508 00:22:17,430 --> 00:22:19,810 And then I can do that and I don't have any kind of 509 00:22:19,810 --> 00:22:21,330 dependence among these things. 510 00:22:21,330 --> 00:22:28,020 So now what I can do is I can basically take thes two and 511 00:22:28,020 --> 00:22:31,520 basically lay out anything anywhere in those groups, 512 00:22:31,520 --> 00:22:34,610 because they are in different iterations and since I am 513 00:22:34,610 --> 00:22:36,870 pipelining these I don't have any dependence in there. 514 00:22:36,870 --> 00:22:39,960 So I end up in this kind of a thing, and basically much 515 00:22:39,960 --> 00:22:41,210 compress in here. 516 00:22:43,420 --> 00:22:46,380 And by doing that what you actually get is a really nice 517 00:22:46,380 --> 00:22:47,830 performance. 518 00:22:47,830 --> 00:22:53,500 The only place that this actually wins -- hardware 519 00:22:53,500 --> 00:22:55,630 pipelining, and this little bit in there. 520 00:22:55,630 --> 00:22:59,970 But the rest you get a really good win in here. 521 00:22:59,970 --> 00:23:00,700 OK. 522 00:23:00,700 --> 00:23:06,460 So what this does is basically now we got a program that when 523 00:23:06,460 --> 00:23:09,140 the programmer never thought anything about what the 524 00:23:09,140 --> 00:23:10,010 hardware is -- 525 00:23:10,010 --> 00:23:12,780 just wrote abstract graph and data streaming. 526 00:23:12,780 --> 00:23:15,810 And given Raw, we automatically actually mapped 527 00:23:15,810 --> 00:23:17,880 into it, and figured out what is the right balance, right 528 00:23:17,880 --> 00:23:20,200 communication, right synchronization, and got 529 00:23:20,200 --> 00:23:21,370 really good performance. 530 00:23:21,370 --> 00:23:24,050 And you're getting something like 11x performance. 531 00:23:24,050 --> 00:23:26,540 If you do hard hand, if you work hard probably you can do 532 00:23:26,540 --> 00:23:27,390 a little bit better. 533 00:23:27,390 --> 00:23:29,650 But this is good, because you don't hand-do anything. 534 00:23:29,650 --> 00:23:31,990 The killer thing is now I can probably take this set of 535 00:23:31,990 --> 00:23:34,800 programs -- which we are actually working on -- is you 536 00:23:34,800 --> 00:23:39,600 can take them to Cell which has, depending on the day, six 537 00:23:39,600 --> 00:23:44,100 cores, seven cores, eight cores, and we can basically 538 00:23:44,100 --> 00:23:47,150 get to matching the number of cores in there. 539 00:23:47,150 --> 00:23:49,390 So this is it because right now what happens is you have 540 00:23:49,390 --> 00:23:51,810 to basically hand code all those things, and this can 541 00:23:51,810 --> 00:23:53,220 automate all that process. 542 00:23:53,220 --> 00:23:55,470 So that's the idea, is can you do this -- which we haven't 543 00:23:55,470 --> 00:23:57,530 really proved and this is our research -- 544 00:23:57,530 --> 00:23:59,970 write once, use anywhere. 545 00:23:59,970 --> 00:24:04,760 So write this program once in this abstract way. 546 00:24:04,760 --> 00:24:06,760 You have to really don't think about full parallelism. 547 00:24:06,760 --> 00:24:09,815 You have to think about some amount of parallelism, how 548 00:24:09,815 --> 00:24:13,140 this can be put into a stream graph, but you are not dealing 549 00:24:13,140 --> 00:24:15,530 with synchronization, load balancing, performance. 550 00:24:15,530 --> 00:24:16,990 You don't have to deal with that. 551 00:24:16,990 --> 00:24:20,140 And then the compiler will automatically do all these 552 00:24:20,140 --> 00:24:22,800 things behind you, and get really good performance. 553 00:24:22,800 --> 00:24:26,380 And the reason I showed this was -- 554 00:24:26,380 --> 00:24:27,713 I'll just play one more slide I think -- 555 00:24:30,710 --> 00:24:33,050 showed this was it's not a simple thing. 556 00:24:33,050 --> 00:24:35,580 The compiler actually has to do a bunch of work, the work 557 00:24:35,580 --> 00:24:36,930 that you used to do before. 558 00:24:36,930 --> 00:24:40,190 Things like figuring out what's the right granularity, 559 00:24:40,190 --> 00:24:43,780 what's the right mix of operations, what type of 560 00:24:43,780 --> 00:24:46,840 transformations you need to do to get there. 561 00:24:46,840 --> 00:24:51,970 But at some point we did three things -- coarse-grained, data 562 00:24:51,970 --> 00:24:53,840 parallel, and software pipelining. 563 00:24:53,840 --> 00:24:56,290 And by doing these three we can actually get a really good 564 00:24:56,290 --> 00:25:00,620 performance in most of the programs we have. So what we 565 00:25:00,620 --> 00:25:03,840 are hoping is basically this kind of techniques can in fact 566 00:25:03,840 --> 00:25:08,380 help programmers to get multicore performance without 567 00:25:08,380 --> 00:25:11,490 really going and dealing in the grunge level of details 568 00:25:11,490 --> 00:25:12,540 you guys do. 569 00:25:12,540 --> 00:25:15,430 You guys will appreciate that, and hopefully will 570 00:25:15,430 --> 00:25:17,310 think of making -- 571 00:25:17,310 --> 00:25:19,870 because now at the end of this class, you will know all the 572 00:25:19,870 --> 00:25:21,520 pain and suffering the programmers go 573 00:25:21,520 --> 00:25:24,170 through to get there. 574 00:25:24,170 --> 00:25:26,510 And the interesting thing would be to in fact look at 575 00:25:26,510 --> 00:25:29,100 the ways to basically reduce that pain and suffering. 576 00:25:29,100 --> 00:25:32,570 So that's what I have today. 577 00:25:32,570 --> 00:25:36,340 So this was, as I promised, a short lecture -- 578 00:25:36,340 --> 00:25:37,420 the second one. 579 00:25:37,420 --> 00:25:38,670 Any questions? 580 00:25:41,611 --> 00:25:44,330 AUDIENCE: So if we've got enough data parallelism we'll 581 00:25:44,330 --> 00:25:49,322 have the same software pipeline jumping on each tile? 582 00:25:49,322 --> 00:25:50,950 Is that right? 583 00:25:50,950 --> 00:25:51,760 PROFESSOR: Yes. 584 00:25:51,760 --> 00:25:52,410 AUDIENCE: OK. 585 00:25:52,410 --> 00:25:56,002 So if you do that how does it scale up to something that has 586 00:25:56,002 --> 00:25:59,236 higher communication costs than Raw? 587 00:25:59,236 --> 00:26:01,939 By doing this software pipelining you have to do all 588 00:26:01,939 --> 00:26:03,650 of your communication off tile. 589 00:26:03,650 --> 00:26:08,140 PROFESSOR: So the interest in there right now is we haven't 590 00:26:08,140 --> 00:26:10,850 done any kind of hardware pipelining. 591 00:26:10,850 --> 00:26:14,040 We are kind of doing -- everybody's getting a lot of 592 00:26:14,040 --> 00:26:15,950 data moving in there. 593 00:26:15,950 --> 00:26:19,950 The neat thing about right now is, even with the SP in Cell 594 00:26:19,950 --> 00:26:24,090 and even Raw, the number of tiles are still small enough 595 00:26:24,090 --> 00:26:27,240 that a lot of communication -- unless way too much 596 00:26:27,240 --> 00:26:28,720 communication -- it doesn't really overwhelm you. 597 00:26:28,720 --> 00:26:32,530 Because everybody's nearby, you can send things. 598 00:26:32,530 --> 00:26:35,910 They talk a little bit about in Cell that near enableness 599 00:26:35,910 --> 00:26:38,310 helps, but not that much. 600 00:26:38,310 --> 00:26:42,150 But as we go into larger and larger cores, it's going to 601 00:26:42,150 --> 00:26:44,360 become an issue. 602 00:26:44,360 --> 00:26:47,080 Near enables become much easier to communicate, and you 603 00:26:47,080 --> 00:26:48,700 can't do global things in there. 604 00:26:48,700 --> 00:26:50,230 And at that point you will actually have to do some 605 00:26:50,230 --> 00:26:51,090 hardware pipelining. 606 00:26:51,090 --> 00:26:55,120 You can't just assume that at some point everybody's going 607 00:26:55,120 --> 00:26:56,940 to get some data and go to something. 608 00:26:56,940 --> 00:26:59,480 So what you need to do is have different chunks that the only 609 00:26:59,480 --> 00:27:01,520 communication that would be between these chunks would be 610 00:27:01,520 --> 00:27:02,980 kind of a pipeline communication. 611 00:27:02,980 --> 00:27:06,410 So you don't mix data around. 612 00:27:06,410 --> 00:27:10,260 So as we go into larger and larger cores you need to start 613 00:27:10,260 --> 00:27:12,860 doing techniques like that. 614 00:27:12,860 --> 00:27:16,710 The interesting thing here is even though what you had to 615 00:27:16,710 --> 00:27:18,950 change was the compiler -- hopefully the program stays 616 00:27:18,950 --> 00:27:20,040 the same -- 617 00:27:20,040 --> 00:27:22,180 right now it's not an easy issue, because our compiler 618 00:27:22,180 --> 00:27:24,733 has 10 times more core than the program, so it's easier in 619 00:27:24,733 --> 00:27:25,300 the program. 620 00:27:25,300 --> 00:27:30,300 But if you look at something C, the core base is millions 621 00:27:30,300 --> 00:27:32,410 of times larger than the size of the compiler. 622 00:27:32,410 --> 00:27:34,020 So at some point they'll be switched. 623 00:27:34,020 --> 00:27:35,500 It's easier to change the compiler to kind 624 00:27:35,500 --> 00:27:36,900 of keep up to date. 625 00:27:36,900 --> 00:27:39,070 That's what happened in C. Every generation you change 626 00:27:39,070 --> 00:27:41,150 the compiler, you don't ask programmers where to code the 627 00:27:41,150 --> 00:27:42,350 application. 628 00:27:42,350 --> 00:27:46,500 So can you make these kind of things as the multicores 629 00:27:46,500 --> 00:27:47,240 become different -- 630 00:27:47,240 --> 00:27:49,320 bigger, have different features. 631 00:27:49,320 --> 00:27:52,270 You change the compiler to get the performance, but have the 632 00:27:52,270 --> 00:27:54,780 same code base. 633 00:27:54,780 --> 00:27:56,030 That's the goal for portability. 634 00:27:58,711 --> 00:28:01,350 AUDIENCE: Have you tried applying StreamIt or the 635 00:28:01,350 --> 00:28:06,550 streaming model in general, to codes that are not not very 636 00:28:06,550 --> 00:28:12,950 clearly stream-based but using the streaming model to make 637 00:28:12,950 --> 00:28:16,450 communication explicit, such as scientific codes. 638 00:28:16,450 --> 00:28:19,100 Or, for example, the kinds of parallelizable loops that you 639 00:28:19,100 --> 00:28:20,870 covered in the first half of the lecture. 640 00:28:20,870 --> 00:28:23,450 PROFESSOR: Some of those things, when you have free 641 00:28:23,450 --> 00:28:25,810 form simple communication can map into streaming. 642 00:28:25,810 --> 00:28:28,420 So for example, one thing we are doing is things 643 00:28:28,420 --> 00:28:30,490 like right now MPEG. 644 00:28:30,490 --> 00:28:33,430 Some part of the MPEG is nicely StreamIt, but when you 645 00:28:33,430 --> 00:28:36,090 actually go inside the MPEG and dealing with the frame, 646 00:28:36,090 --> 00:28:39,810 it's basically a big array, and you're doing that. 647 00:28:39,810 --> 00:28:42,100 So how do you chunkify the arrays, and basically deal 648 00:28:42,100 --> 00:28:43,350 with it in a streaming order? 649 00:28:43,350 --> 00:28:44,980 There's some interesting things you can do. 650 00:28:44,980 --> 00:28:48,400 There will be some stuff that doesn't fit that. 651 00:28:48,400 --> 00:28:55,100 Things like pattern recognition type stuff, where 652 00:28:55,100 --> 00:28:57,270 what you want to do is you want to -- 653 00:28:57,270 --> 00:28:59,330 assume you're trying to -- 654 00:28:59,330 --> 00:29:01,540 good example. 655 00:29:01,540 --> 00:29:04,210 You're trying to feature a condition in a video. 656 00:29:04,210 --> 00:29:07,560 And what happens is the number of features, can you match or 657 00:29:07,560 --> 00:29:08,860 connect two features, or match and 658 00:29:08,860 --> 00:29:10,090 connect a thousand features. 659 00:29:10,090 --> 00:29:12,480 And then each feature you need to do some processing. 660 00:29:12,480 --> 00:29:14,880 And that is a very dynamic thing. 661 00:29:14,880 --> 00:29:16,140 And that doesn't really fit into 662 00:29:16,140 --> 00:29:17,800 streaming order right now. 663 00:29:17,800 --> 00:29:21,530 And so the interesting thing is, the problem we have been 664 00:29:21,530 --> 00:29:24,340 doing is we are trying to fit everything into one. 665 00:29:24,340 --> 00:29:26,780 So right now the object-oriented model is it 666 00:29:26,780 --> 00:29:28,570 basically -- everything has to fit in there. 667 00:29:28,570 --> 00:29:30,580 But what you're finding is there are many things that 668 00:29:30,580 --> 00:29:32,170 don't really fit nicely. 669 00:29:32,170 --> 00:29:35,310 And you'll do these very crazy looking things just to get 670 00:29:35,310 --> 00:29:38,320 every program to fit into the object-oriented model. 671 00:29:38,320 --> 00:29:39,150 That doesn't really work. 672 00:29:39,150 --> 00:29:41,090 I think the right way to work is, is there 673 00:29:41,090 --> 00:29:42,350 might be multiple models. 674 00:29:42,350 --> 00:29:44,050 There's a streaming model, there's some kind of a 675 00:29:44,050 --> 00:29:45,610 threaded model, there might be different ones -- 676 00:29:45,610 --> 00:29:47,130 I don't know what other models are. 677 00:29:47,130 --> 00:29:49,450 So the key thing is your program might have a large 678 00:29:49,450 --> 00:29:51,520 chunky model, another chunky model. 679 00:29:51,520 --> 00:29:53,540 Don't try to come up with -- 680 00:29:53,540 --> 00:29:55,220 right now what we have is we have a kitchen 681 00:29:55,220 --> 00:29:56,370 sink type of language. 682 00:29:56,370 --> 00:29:59,120 It tries to support everything at the same time. 683 00:29:59,120 --> 00:30:01,030 And that doesn't really work because then you have to think 684 00:30:01,030 --> 00:30:01,400 about and say -- 685 00:30:01,400 --> 00:30:04,110 OK done, can I have a pointer here? 686 00:30:04,110 --> 00:30:08,890 And I need to think about all the possible models kind of 687 00:30:08,890 --> 00:30:10,960 colliding in the same space. 688 00:30:10,960 --> 00:30:14,880 AUDIENCE: On the other hand, the object-oriented model is 689 00:30:14,880 --> 00:30:16,010 much more generalized to me. 690 00:30:16,010 --> 00:30:18,545 It's not the best model for many things, but it's much 691 00:30:18,545 --> 00:30:20,830 more generalizable than some models. 692 00:30:20,830 --> 00:30:23,910 And having a single model cuts down on the number of semantic 693 00:30:23,910 --> 00:30:25,140 barriers you have to cross -- 694 00:30:25,140 --> 00:30:26,030 PROFESSOR: I don't know but -- 695 00:30:26,030 --> 00:30:29,790 AUDIENCE: Semantic barriers incur both programmer overhead 696 00:30:29,790 --> 00:30:31,380 and run-time overhead. 697 00:30:31,380 --> 00:30:33,130 PROFESSOR: See the problem with right now with all the 698 00:30:33,130 --> 00:30:36,770 semantic barriers, is object-oriented model plus a 699 00:30:36,770 --> 00:30:38,860 huge number of libraries. 700 00:30:38,860 --> 00:30:41,630 If you want to do OpenGL, it's object-oriented but you have 701 00:30:41,630 --> 00:30:42,250 no library. 702 00:30:42,250 --> 00:30:43,395 If you want to do something else, you have 703 00:30:43,395 --> 00:30:44,230 to learn the library. 704 00:30:44,230 --> 00:30:46,650 What the right thing would be, instead of trying to learn the 705 00:30:46,650 --> 00:30:49,880 libraries is learn kind of a subset language. 706 00:30:49,880 --> 00:30:53,580 So you have nice semantics, you have nice syntax in there, 707 00:30:53,580 --> 00:30:56,380 you have nice error checking, nice 708 00:30:56,380 --> 00:30:58,270 optimization within that syntax. 709 00:30:58,270 --> 00:31:01,340 Because the trouble is right now everything is in this just 710 00:31:01,340 --> 00:31:03,620 gigantic language, and you can't do anything. 711 00:31:03,620 --> 00:31:05,750 And in the program you don't even know, because you can mix 712 00:31:05,750 --> 00:31:08,290 and match in really bad ways. 713 00:31:08,290 --> 00:31:13,450 The mix and match gives you a lot of power, but it can 714 00:31:13,450 --> 00:31:14,480 actually really hurt. 715 00:31:14,480 --> 00:31:15,990 And a lot of people don't need it. 716 00:31:15,990 --> 00:31:19,060 Like for example in C, people doubt it was really crucial 717 00:31:19,060 --> 00:31:22,230 for you to access any part of memory anywhere you want. 718 00:31:22,230 --> 00:31:25,270 You just can go and just access any program, anywhere, 719 00:31:25,270 --> 00:31:26,620 anytime in there. 720 00:31:26,620 --> 00:31:28,300 If you look at it, nobody takes advantage of that. 721 00:31:28,300 --> 00:31:29,700 How many times do you write the program an say -- "Hey, I 722 00:31:29,700 --> 00:31:32,980 want to go access the other guy's stack from this part." 723 00:31:32,980 --> 00:31:34,150 That doesn't work. 724 00:31:34,150 --> 00:31:36,160 You have a variable and you use a variable. 725 00:31:36,160 --> 00:31:37,610 AUDIENCE: It still [OBSCURED] 726 00:31:37,610 --> 00:31:42,020 PROFESSOR: Yeah, but the thing is because of that power, it 727 00:31:42,020 --> 00:31:43,950 creates a lot of problems for a compiler -- 728 00:31:43,950 --> 00:31:45,970 because it needs to prove that you're not doing 729 00:31:45,970 --> 00:31:47,530 that, which is hard. 730 00:31:47,530 --> 00:31:50,160 And also, if you make a mistake the program is like -- 731 00:31:50,160 --> 00:31:51,190 "Yeah, this looks like right. 732 00:31:51,190 --> 00:31:54,650 It still matches my semantics and syntax, I'll let you do 733 00:31:54,650 --> 00:31:57,155 that." But what you realize is that's not something people do 734 00:31:57,155 --> 00:31:59,370 -- just stick with your variable. 735 00:31:59,370 --> 00:32:01,140 And if you don't go to variables -- 736 00:32:01,140 --> 00:32:03,190 that's what type-safe languages do -- it's probably 737 00:32:03,190 --> 00:32:05,690 more for bugs than a feature. 738 00:32:05,690 --> 00:32:08,790 And the same kind of thing having efficiency in language, 739 00:32:08,790 --> 00:32:11,710 is you can do everything at the same time. 740 00:32:11,710 --> 00:32:15,130 Why can't you have a language that you can go with this kind 741 00:32:15,130 --> 00:32:15,600 of context. 742 00:32:15,600 --> 00:32:18,750 I'm in the streaming context now. 743 00:32:18,750 --> 00:32:20,220 I say this is my streaming context. 744 00:32:20,220 --> 00:32:21,780 I am in a threaded context. 745 00:32:21,780 --> 00:32:25,720 Then what that does is, I have to learn the full set of 746 00:32:25,720 --> 00:32:28,340 features, but I restrict what I'm using here. 747 00:32:28,340 --> 00:32:30,760 That can probably realistically improve your 748 00:32:30,760 --> 00:32:33,520 program building, because you don't have to worry about -- 749 00:32:33,520 --> 00:32:35,568 AUDIENCE: It gives the programmer time to get to know 750 00:32:35,568 --> 00:32:36,080 each language. 751 00:32:36,080 --> 00:32:37,530 PROFESSOR: But right now you have to do that. 752 00:32:37,530 --> 00:32:40,100 If you look at C# it has all these features. 753 00:32:40,100 --> 00:32:42,770 It has streaming features, it has threaded features, it has 754 00:32:42,770 --> 00:32:43,990 every possible object-oriented feature. 755 00:32:43,990 --> 00:32:49,465 AUDIENCE: Right, but there's a compact central model which 756 00:32:49,465 --> 00:32:50,900 covers most things. 757 00:32:50,900 --> 00:32:52,880 You can pull in additional features and fit them 758 00:32:52,880 --> 00:32:54,300 [OBSCURED]. 759 00:32:54,300 --> 00:32:58,745 You can do pointer manipulation in C#, but you 760 00:32:58,745 --> 00:33:00,770 bracket things into an unsafe block. 761 00:33:00,770 --> 00:33:02,845 And then the compiler knows in there you're 762 00:33:02,845 --> 00:33:03,800 doing really bad things. 763 00:33:03,800 --> 00:33:06,950 PROFESSOR: That's a nice thing, because 764 00:33:06,950 --> 00:33:07,850 you can have unsafe. 765 00:33:07,850 --> 00:33:08,970 But can you have something like -- this is 766 00:33:08,970 --> 00:33:11,060 my streaming part. 767 00:33:11,060 --> 00:33:11,510 OK. 768 00:33:11,510 --> 00:33:13,155 Can I do something like that, so I don't have 769 00:33:13,155 --> 00:33:13,610 to worry about other? 770 00:33:13,610 --> 00:33:17,210 The key thing is, is there a way where -- because right 771 00:33:17,210 --> 00:33:20,650 now, my feeling is if you look at the object-oriented part. 772 00:33:20,650 --> 00:33:23,190 So if you are doing, for example, Windows programming, 773 00:33:23,190 --> 00:33:25,280 you can spend about a week and learn all the 774 00:33:25,280 --> 00:33:26,810 object-oriented concepts. 775 00:33:26,810 --> 00:33:28,310 And you have to spend probably a year to learn all the 776 00:33:28,310 --> 00:33:30,150 libraries on top of that. 777 00:33:30,150 --> 00:33:32,590 That's the old action these days. 778 00:33:32,590 --> 00:33:35,540 It's basically the building blocks have become too low, 779 00:33:35,540 --> 00:33:39,950 and then everything else is kind of an unorganized mess on 780 00:33:39,950 --> 00:33:41,050 top of that. 781 00:33:41,050 --> 00:33:43,390 Can you put more abstraction things that easy? 782 00:33:43,390 --> 00:33:45,100 Hey, I'm talking about research, this is 783 00:33:45,100 --> 00:33:45,930 one possible angle. 784 00:33:45,930 --> 00:33:48,420 I mean there might -- you can think, I know there are messes 785 00:33:48,420 --> 00:33:50,460 that I think in there. 786 00:33:50,460 --> 00:33:55,310 My feeling is what we do well is when things get too 787 00:33:55,310 --> 00:33:58,180 complicated we build abstraction layers. 788 00:33:58,180 --> 00:34:02,835 And the interesting thing there is, we build this high 789 00:34:02,835 --> 00:34:05,790 level programming language abstraction layer. 790 00:34:05,790 --> 00:34:09,030 And then now we have built so much crud on top of that 791 00:34:09,030 --> 00:34:11,360 without any nice abstraction layers, I think it's probably 792 00:34:11,360 --> 00:34:13,290 time to think through what there could be at the 793 00:34:13,290 --> 00:34:13,790 abstraction level. 794 00:34:13,790 --> 00:34:15,590 Things like, it's hitting -- 795 00:34:15,590 --> 00:34:17,480 that is where parallelism is really hitting. 796 00:34:17,480 --> 00:34:19,980 Because that layer, the object-oriented layer, doesn't 797 00:34:19,980 --> 00:34:22,530 really support that well. 798 00:34:22,530 --> 00:34:26,170 And it's all kind of ad hoc on top of that. 799 00:34:26,170 --> 00:34:27,160 And so that says something. 800 00:34:27,160 --> 00:34:29,070 Yes, it's usable. 801 00:34:29,070 --> 00:34:30,140 We had this argument -- 802 00:34:30,140 --> 00:34:32,760 assembly languages programmers -- for two decades. 803 00:34:32,760 --> 00:34:33,390 There are people who were 804 00:34:33,390 --> 00:34:34,970 swearing by assembly languages. 805 00:34:34,970 --> 00:34:39,010 They could write it two times smaller, two times faster than 806 00:34:39,010 --> 00:34:40,680 anything you can write in high level language. 807 00:34:40,680 --> 00:34:42,120 It's probably still true today. 808 00:34:42,120 --> 00:34:46,550 But at the end there were things that high level 809 00:34:46,550 --> 00:34:47,560 languages won out. 810 00:34:47,560 --> 00:34:49,350 I think we are probably in another layer like that. 811 00:34:49,350 --> 00:34:51,050 I don't know, probably will go with that argument. 812 00:34:51,050 --> 00:34:52,680 You can always point to something saying this is 813 00:34:52,680 --> 00:34:53,910 something you cannot do. 814 00:34:53,910 --> 00:34:56,580 If there are still things -- like structured programs and 815 00:34:56,580 --> 00:34:58,260 unstructured programs, we talked about that. 816 00:34:58,260 --> 00:35:00,197 That argument went for a decade. 817 00:35:00,197 --> 00:35:04,810 AUDIENCE: The question I would pose is can you formulate a 818 00:35:04,810 --> 00:35:08,100 kitchen sink language at a parallelizable level of 819 00:35:08,100 --> 00:35:08,590 abstraction? 820 00:35:08,590 --> 00:35:09,080 PROFESSOR: Ah. 821 00:35:09,080 --> 00:35:12,572 That's interesting, because parallelization is -- one of 822 00:35:12,572 --> 00:35:14,390 the biggest things people have to figure out is 823 00:35:14,390 --> 00:35:16,120 composability. 824 00:35:16,120 --> 00:35:19,530 You can't have two parallel regions as a 825 00:35:19,530 --> 00:35:21,220 black box put together. 826 00:35:21,220 --> 00:35:23,640 You start running into deadlocks and all those other 827 00:35:23,640 --> 00:35:24,720 issues in there. 828 00:35:24,720 --> 00:35:27,760 Most of the things that you work is the abstraction works, 829 00:35:27,760 --> 00:35:30,850 because then you can compose at a higher level abstraction. 830 00:35:30,850 --> 00:35:32,770 You can have interface and say -- here are something 831 00:35:32,770 --> 00:35:34,880 interface, I don't know what's underneath, I compose at the 832 00:35:34,880 --> 00:35:35,630 interface level. 833 00:35:35,630 --> 00:35:38,960 And then the next guy composes at the higher level, and 834 00:35:38,960 --> 00:35:40,360 everything is hidden. 835 00:35:40,360 --> 00:35:42,750 We don't know how to do that in parallelism right now. 836 00:35:42,750 --> 00:35:47,170 We need to combine two things, it runs into problems. And the 837 00:35:47,170 --> 00:35:49,370 minute you figure that one out -- if somebody can figure out 838 00:35:49,370 --> 00:35:52,450 what's the right abstraction that is composable, parallel 839 00:35:52,450 --> 00:35:53,030 abstraction -- 840 00:35:53,030 --> 00:35:55,990 I think that will solve a huge amount of problems. 841 00:35:55,990 --> 00:35:58,831 AUDIENCE: Isn't it Fortress that attempted to do something 842 00:35:58,831 --> 00:36:01,549 that's parallelizable and the kitchen sink, but that then 843 00:36:01,549 --> 00:36:03,280 leaves all the parallelizable -- 844 00:36:03,280 --> 00:36:03,640 AUDIENCE: I'm saying how terrible [OBSCURED] 845 00:36:03,640 --> 00:36:06,020 programmers. 846 00:36:06,020 --> 00:36:08,040 PROFESSOR: But I would say right now is a very exciting 847 00:36:08,040 --> 00:36:11,240 time, because there's a big problem and 848 00:36:11,240 --> 00:36:12,920 nobody knows the solution. 849 00:36:12,920 --> 00:36:17,020 And I think for industry they lose a lot of sleep over that, 850 00:36:17,020 --> 00:36:19,200 but for academia it's the best time. 851 00:36:19,200 --> 00:36:20,930 Because we don't care, we don't have to make money out 852 00:36:20,930 --> 00:36:24,170 of these things, we don't have to get production out of it. 853 00:36:24,170 --> 00:36:26,470 But these actually have a very open problem that a lot of 854 00:36:26,470 --> 00:36:28,150 people care about. 855 00:36:28,150 --> 00:36:31,120 And I think this is fun partly because of that. 856 00:36:31,120 --> 00:36:35,040 I think this huge open problem that if you talk to people 857 00:36:35,040 --> 00:36:38,070 like Intels and Microsoft, a lot of people worry a lot 858 00:36:38,070 --> 00:36:40,475 about they don't know how to deal with the future in 5-10 859 00:36:40,475 --> 00:36:42,870 years time. 860 00:36:42,870 --> 00:36:44,990 They don't see this is scaling what they're doing. 861 00:36:44,990 --> 00:36:48,650 And from Intel's point of view, they made money by 862 00:36:48,650 --> 00:36:53,280 making Moore's Law available for people to use. 863 00:36:53,280 --> 00:36:55,585 They know how to make it available, but they don't know 864 00:36:55,585 --> 00:36:57,330 how to make people use it. 865 00:36:57,330 --> 00:37:01,150 From Microsoft's point of view, their current 866 00:37:01,150 --> 00:37:05,340 development methodology is almost at this breaking point. 867 00:37:05,340 --> 00:37:09,290 And if you look at the last time this happening -- so 868 00:37:09,290 --> 00:37:12,760 things like Windows 3.0, where their current development 869 00:37:12,760 --> 00:37:14,670 methodology doesn't really scale, and they really 870 00:37:14,670 --> 00:37:15,630 revamped it. 871 00:37:15,630 --> 00:37:18,010 They came up with all this process, and that had 872 00:37:18,010 --> 00:37:18,750 lasted until now. 873 00:37:18,750 --> 00:37:22,620 For the last two Office and Vista, just realized they 874 00:37:22,620 --> 00:37:24,040 can't really scale that up. 875 00:37:24,040 --> 00:37:26,300 So they are already in trouble, because they can't 876 00:37:26,300 --> 00:37:29,040 write the next big goal is just two times bigger than 877 00:37:29,040 --> 00:37:31,070 Vista, and hopefully get it working. 878 00:37:31,070 --> 00:37:34,120 But on top of that, they have it thrown this multicore 879 00:37:34,120 --> 00:37:37,180 thing, and that really puts huge amount of burden. 880 00:37:37,180 --> 00:37:38,480 So they are worried about that. 881 00:37:38,480 --> 00:37:41,840 So from both their point of view, everybody's clamoring 882 00:37:41,840 --> 00:37:43,090 for a solution. 883 00:37:45,240 --> 00:37:46,950 And things like last time around -- 884 00:37:46,950 --> 00:37:49,200 I'll talk about this in the future -- last time around 885 00:37:49,200 --> 00:37:51,240 when that happened, it created a huge amount of 886 00:37:51,240 --> 00:37:54,720 opportunities, and bunch of people who sold it kind of 887 00:37:54,720 --> 00:37:55,800 became famous. 888 00:37:55,800 --> 00:37:58,230 Becuase they say -- we came up with a solution, and that 889 00:37:58,230 --> 00:37:59,890 people started using and stuff like that. 890 00:37:59,890 --> 00:38:02,190 Right now, everybody's kind of waiting for somebody to come 891 00:38:02,190 --> 00:38:05,500 up and say here's the solution, here's a solution. 892 00:38:05,500 --> 00:38:07,590 And there are a lot of -- 893 00:38:07,590 --> 00:38:10,910 Fortress type exports is one, and what we are doing is one. 894 00:38:10,910 --> 00:38:13,380 And hopefully some of you will end up doing something 895 00:38:13,380 --> 00:38:15,690 interesting that might solve it. 896 00:38:15,690 --> 00:38:17,090 This is why it's fun. 897 00:38:17,090 --> 00:38:21,610 I think we haven't had this much of an interesting time in 898 00:38:21,610 --> 00:38:24,640 programming languages, parallelism, architecture in 899 00:38:24,640 --> 00:38:27,500 the last two decades. 900 00:38:27,500 --> 00:38:30,680 With that, I'll stop my talk.