1 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:03,910 Commons license. 3 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,600 offer high quality educational resources for free. 5 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 6 00:00:13,500 --> 00:00:17,430 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,430 --> 00:00:18,680 ocw.mit.edu. 8 00:00:23,660 --> 00:00:25,450 PROFESSOR: Let's get started. 9 00:00:25,450 --> 00:00:29,750 So today, I'm going to talk a little bit more about 10 00:00:29,750 --> 00:00:33,200 performance issues in parallelization. 11 00:00:33,200 --> 00:00:36,860 A little bit more out of the [INAUDIBLE] 12 00:00:36,860 --> 00:00:39,170 to what people are doing otherwise. 13 00:00:39,170 --> 00:00:44,030 So normally what we have done so far, is we looked at Cilk. 14 00:00:44,030 --> 00:00:46,100 It provides a very robust and environment for 15 00:00:46,100 --> 00:00:46,930 parallelization. 16 00:00:46,930 --> 00:00:50,550 It hides many issues and eliminates many of the 17 00:00:50,550 --> 00:00:56,120 problems out there if you find other areas of parallelization 18 00:00:56,120 --> 00:00:57,710 that you deal with. 19 00:00:57,710 --> 00:01:00,290 And in last lectures, we looked at things like cache 20 00:01:00,290 --> 00:01:01,140 [UNINTELLIGIBLE] 21 00:01:01,140 --> 00:01:04,269 algorithims, algorithmic issues, like 22 00:01:04,269 --> 00:01:06,710 looking at work and spend. 23 00:01:06,710 --> 00:01:08,790 And in fact, in your projects, you're going to use all these 24 00:01:08,790 --> 00:01:13,660 nice concepts to get you a nice parallel learning CorD. 25 00:01:13,660 --> 00:01:20,640 But if you look at a lot of these CorD [UNINTELLIGIBLE] 26 00:01:20,640 --> 00:01:24,150 the very people normally for parallelized CorD outside 27 00:01:24,150 --> 00:01:25,800 probably Cilk. 28 00:01:25,800 --> 00:01:30,270 And there are a lot of other issues that arise, things like 29 00:01:30,270 --> 00:01:33,160 synchronization issues and memory issues. 30 00:01:33,160 --> 00:01:36,380 Today, I think we are going to focus mostly on memory issues. 31 00:01:36,380 --> 00:01:39,450 And we are going to use open OpenMP instead of [INAUDIBLE]. 32 00:01:39,450 --> 00:01:42,400 And most of these issues will be affected on 33 00:01:42,400 --> 00:01:43,710 Cilk sometimes, too. 34 00:01:43,710 --> 00:01:45,990 But Cilk tries to hide them from you. 35 00:01:45,990 --> 00:01:46,920 There's a layer of abstract. 36 00:01:46,920 --> 00:01:49,940 And it's hard to kind of get to those issues in there. 37 00:01:49,940 --> 00:01:53,200 So we are going to look at this thing called OpenMP. 38 00:01:53,200 --> 00:01:55,430 So today, we are going to address things like 39 00:01:55,430 --> 00:01:58,820 granularity of parallelism. 40 00:01:58,820 --> 00:02:00,140 There are so many things that just went out 41 00:02:00,140 --> 00:02:01,820 on the page, I guess. 42 00:02:01,820 --> 00:02:05,060 True sharing, false sharing, load balancing issues, the 43 00:02:05,060 --> 00:02:06,170 [UNINTELLIGIBLE]. 44 00:02:06,170 --> 00:02:10,570 So from the license and keep talking about that we want to 45 00:02:10,570 --> 00:02:13,550 be out of all this not dealing with Voodoo parameter. 46 00:02:13,550 --> 00:02:16,350 Today, we actually are dealing mainly with Voodoo. 47 00:02:16,350 --> 00:02:20,460 So I guess this should be basically 48 00:02:20,460 --> 00:02:21,670 the Halloween lecture. 49 00:02:21,670 --> 00:02:24,703 So we are all about Voodoo today and see how we can deal 50 00:02:24,703 --> 00:02:27,760 with Voodoo issues. 51 00:02:27,760 --> 00:02:32,690 So if you look at a Cilk program, here is a nice simple 52 00:02:32,690 --> 00:02:34,740 matrix multiply, seem to be [INAUDIBLE] 53 00:02:34,740 --> 00:02:36,320 example these days. 54 00:02:36,320 --> 00:02:39,060 What you can do is you can put a Cilk formula in these two 55 00:02:39,060 --> 00:02:42,070 loops and get a nice parallel performance. 56 00:02:42,070 --> 00:02:44,870 However, [UNINTELLIGIBLE] 57 00:02:44,870 --> 00:02:47,270 from where how the memory is arranged is 58 00:02:47,270 --> 00:02:48,820 up to the Cilk scheduler. 59 00:02:48,820 --> 00:02:50,980 Cilk scheduler is doing some work stealing. 60 00:02:50,980 --> 00:02:54,420 Depending on how the work gets distributed, the process will 61 00:02:54,420 --> 00:02:56,090 get worked, it will happen. 62 00:02:56,090 --> 00:02:58,370 Hopefully, everything will go nicely. 63 00:02:58,370 --> 00:03:02,270 And so what that means is it ties the distribution and load 64 00:03:02,270 --> 00:03:04,060 balancing issues. 65 00:03:04,060 --> 00:03:07,640 So it's nice if you have access to Cilk, but many other 66 00:03:07,640 --> 00:03:09,050 [UNINTELLIGIBLE] you might not be. 67 00:03:09,050 --> 00:03:12,320 And even within Cilk, some of these issues might show up. 68 00:03:12,320 --> 00:03:15,390 So what we are going to do is step one 69 00:03:15,390 --> 00:03:17,820 below the Cilk scheduler. 70 00:03:17,820 --> 00:03:22,470 So there's this system called OpenMP. 71 00:03:22,470 --> 00:03:25,640 It's a more simplified model of parallelism. 72 00:03:25,640 --> 00:03:29,665 So what it tries to do is instead of giving this very 73 00:03:29,665 --> 00:03:30,460 [UNINTELLIGIBLE] 74 00:03:30,460 --> 00:03:33,930 system, it lets you basically direct access to the 75 00:03:33,930 --> 00:03:35,110 processors. 76 00:03:35,110 --> 00:03:38,280 So what that means is there's normally what we call a 77 00:03:38,280 --> 00:03:39,210 fork-join model. 78 00:03:39,210 --> 00:03:39,520 [UNINTELLIGIBLE] 79 00:03:39,520 --> 00:03:43,580 we have with Cilk, basically. 80 00:03:43,580 --> 00:03:46,490 We can do fork into different workers and join. 81 00:03:46,490 --> 00:03:49,750 And more or less, you can actually bind these workers to 82 00:03:49,750 --> 00:03:50,420 [UNINTELLIGIBLE] 83 00:03:50,420 --> 00:03:53,510 sometimes or make sure that the number-- 84 00:03:53,510 --> 00:03:56,800 I'll give you some techniques how to do that as I go on. 85 00:03:56,800 --> 00:04:01,540 So for parallel loops, you can do data parallelism, different 86 00:04:01,540 --> 00:04:02,340 [UNINTELLIGIBLE] 87 00:04:02,340 --> 00:04:04,580 parallelism you can do something like fork-join. 88 00:04:04,580 --> 00:04:07,620 And you can see a bunch of static or 89 00:04:07,620 --> 00:04:11,220 dynamic scheduling policies. 90 00:04:11,220 --> 00:04:16,790 So for example in OpenMP, you can see for this loop that add 91 00:04:16,790 --> 00:04:17,959 a pragma to [UNINTELLIGIBLE] 92 00:04:17,959 --> 00:04:21,029 in front of this loop and say this is OpenMP. 93 00:04:21,029 --> 00:04:22,450 Parallel loop in here. 94 00:04:22,450 --> 00:04:24,310 Parallel full loop in this one. 95 00:04:24,310 --> 00:04:26,990 And schedule it using static chunk. 96 00:04:26,990 --> 00:04:29,610 I will tell you what exactly that means. 97 00:04:29,610 --> 00:04:34,680 And that gives you direct access to how each of these 98 00:04:34,680 --> 00:04:36,730 parts will be run. 99 00:04:36,730 --> 00:04:39,320 So let me get a little bit in detail. 100 00:04:39,320 --> 00:04:41,730 So assume you have [UNINTELLIGIBLE] courses in 101 00:04:41,730 --> 00:04:42,950 there, [UNINTELLIGIBLE] processors. 102 00:04:42,950 --> 00:04:46,270 So now in OpenMP, you are basically opening the entire 103 00:04:46,270 --> 00:04:46,990 world underneath. 104 00:04:46,990 --> 00:04:49,840 And you have to kind of see what's going on. 105 00:04:49,840 --> 00:04:55,330 And if you say, schedule a static chunk of four, assume 106 00:04:55,330 --> 00:04:58,050 you have 16 iterations. 107 00:04:58,050 --> 00:05:00,200 Here are my 16 iterations. 108 00:05:00,200 --> 00:05:02,990 So each of these dots represent a value for i. 109 00:05:02,990 --> 00:05:07,890 So what it says is you take chunks of four and basically 110 00:05:07,890 --> 00:05:08,950 send it across it. 111 00:05:08,950 --> 00:05:11,190 So what happens is the first four iterations will go to 112 00:05:11,190 --> 00:05:13,240 [UNINTELLIGIBLE] or core zero. 113 00:05:13,240 --> 00:05:15,650 Next four will go to core one, core two and core three. 114 00:05:15,650 --> 00:05:18,870 So you know exactly which iterations run where. 115 00:05:18,870 --> 00:05:20,660 It's a very static thing. 116 00:05:20,660 --> 00:05:23,100 You have full control of what's going on. 117 00:05:23,100 --> 00:05:26,980 Whereas in Cilk, it's up to the scheduler. 118 00:05:26,980 --> 00:05:29,090 So the nice thing here is you can have full control. 119 00:05:29,090 --> 00:05:30,960 But you get enough room to harm yourself if 120 00:05:30,960 --> 00:05:32,260 you do things wrong. 121 00:05:32,260 --> 00:05:36,510 So this is a double-edged sword in that sense. 122 00:05:36,510 --> 00:05:39,990 So instead of doing static five you do static two. 123 00:05:39,990 --> 00:05:43,010 You're assigning chunks of size two. 124 00:05:43,010 --> 00:05:46,125 What it will do is it will assign chunks of size two to 125 00:05:46,125 --> 00:05:48,450 the four cores. 126 00:05:48,450 --> 00:05:49,770 And then you're not done yet. 127 00:05:49,770 --> 00:05:52,620 And then you start with again core zero and assign 128 00:05:52,620 --> 00:05:53,980 chunks of size two. 129 00:05:53,980 --> 00:05:57,880 This is called block cyclic schedule. 130 00:05:57,880 --> 00:05:59,880 And if you do a chunk of size one, it's 131 00:05:59,880 --> 00:06:00,890 called a cyclic schedule. 132 00:06:00,890 --> 00:06:08,190 [UNINTELLIGIBLE] cycles just assigning iterations to cores. 133 00:06:08,190 --> 00:06:08,830 OK. 134 00:06:08,830 --> 00:06:11,700 So far so good? 135 00:06:11,700 --> 00:06:13,780 OK. 136 00:06:13,780 --> 00:06:15,780 So I want to do something. 137 00:06:15,780 --> 00:06:19,740 So I have this program. 138 00:06:19,740 --> 00:06:21,790 This is again your 139 00:06:21,790 --> 00:06:24,350 run-of-the-mill matrix multiply. 140 00:06:24,350 --> 00:06:30,070 And I ran a sequential single machine, and I got this 141 00:06:30,070 --> 00:06:31,860 performance. 142 00:06:31,860 --> 00:06:34,750 Then, I said look, I want to parallelize the outer loop. 143 00:06:34,750 --> 00:06:37,640 So I parallelize this loop. 144 00:06:37,640 --> 00:06:40,710 What should I get? 145 00:06:40,710 --> 00:06:41,790 [UNINTELLIGIBLE] fast or slow. 146 00:06:41,790 --> 00:06:46,970 I want to just check whether you are awake or sleep. 147 00:06:46,970 --> 00:06:47,890 How do [UNINTELLIGIBLE PHRASE] 148 00:06:47,890 --> 00:06:49,140 to run slower. 149 00:06:53,030 --> 00:06:54,370 It's not a trick question. 150 00:06:54,370 --> 00:06:57,460 This is just to make sure that actually participate. 151 00:06:57,460 --> 00:06:59,230 How do people think it's run faster? 152 00:06:59,230 --> 00:07:00,950 AUDIENCE: [INAUDIBLE] 153 00:07:00,950 --> 00:07:01,100 PROFESSOR: OK. 154 00:07:01,100 --> 00:07:02,390 Good. 155 00:07:02,390 --> 00:07:05,234 What do others think? 156 00:07:05,234 --> 00:07:06,070 OK. 157 00:07:06,070 --> 00:07:08,980 They're probably checking their email or something. 158 00:07:08,980 --> 00:07:10,550 OK. 159 00:07:10,550 --> 00:07:14,340 So actually it ran faster. 160 00:07:14,340 --> 00:07:18,910 The source not run on the common cloud machines. 161 00:07:18,910 --> 00:07:20,990 This was a previous generation that I ran. 162 00:07:20,990 --> 00:07:23,390 So [UNINTELLIGIBLE] was seven times faster. 163 00:07:23,390 --> 00:07:24,290 So this was great. 164 00:07:24,290 --> 00:07:25,670 I parallized outer loop. 165 00:07:25,670 --> 00:07:28,390 What happens if I parallize inner loop? 166 00:07:28,390 --> 00:07:31,570 So this test, this i loop, runs parallel. 167 00:07:31,570 --> 00:07:34,060 Here, I launch the [UNINTELLIGIBLE] parallel. 168 00:07:34,060 --> 00:07:38,138 How much people thinks this runs faster? 169 00:07:38,138 --> 00:07:39,970 AUDIENCE: [INAUDIBLE] 170 00:07:39,970 --> 00:07:42,800 PROFESSOR: Compared to this one. 171 00:07:42,800 --> 00:07:45,290 How many people thinks this runs slower? 172 00:07:45,290 --> 00:07:46,530 OK. 173 00:07:46,530 --> 00:07:50,530 There's some consistent answers here. 174 00:07:50,530 --> 00:07:52,956 Why do you think it would run slower? 175 00:07:56,130 --> 00:07:57,720 So OK. 176 00:07:57,720 --> 00:08:00,180 It ran slower, so it can improve that. 177 00:08:00,180 --> 00:08:02,270 And that's a little bit slow. 178 00:08:02,270 --> 00:08:05,774 Why is it slow? 179 00:08:05,774 --> 00:08:08,250 AUDIENCE: [INAUDIBLE] 180 00:08:08,250 --> 00:08:09,050 PROFESSOR: Exactly. 181 00:08:09,050 --> 00:08:12,520 So what it's doing here, it's basically spawning many, many 182 00:08:12,520 --> 00:08:13,200 times in here. 183 00:08:13,200 --> 00:08:17,530 Because every time you have parallelism, you chunkify into 184 00:08:17,530 --> 00:08:18,500 the processor. 185 00:08:18,500 --> 00:08:21,450 Here you are getting a lot more smaller chunks inside. 186 00:08:21,450 --> 00:08:24,700 So let's look at how this is basically run. 187 00:08:24,700 --> 00:08:31,075 So normally, you can think about an OpenMP program as you 188 00:08:31,075 --> 00:08:32,549 have one sequential thread. 189 00:08:32,549 --> 00:08:34,000 You run the main program. 190 00:08:34,000 --> 00:08:37,760 And then assume you have, in cores, you might have n minus 191 00:08:37,760 --> 00:08:42,100 1 other thread just waiting for work. 192 00:08:42,100 --> 00:08:45,270 And then, when you finally come to the parallel 193 00:08:45,270 --> 00:08:48,180 loop, it says, OK. 194 00:08:48,180 --> 00:08:48,730 Set up. 195 00:08:48,730 --> 00:08:52,080 What do you want to run on other basic cores. 196 00:08:52,080 --> 00:08:53,510 And release it. 197 00:08:53,510 --> 00:08:55,360 Release these waiting people. 198 00:08:55,360 --> 00:08:57,020 And let them start working on the parallel work. 199 00:08:57,020 --> 00:08:59,650 And also, I will start doing on my own chunk. 200 00:08:59,650 --> 00:09:03,720 So suddenly, when you say parallel four, it releases all 201 00:09:03,720 --> 00:09:07,310 other cores to go run that part of the core. 202 00:09:07,310 --> 00:09:11,740 And once it's done, it's will say, OK, I'm done. 203 00:09:11,740 --> 00:09:13,140 I have to wait until everybody is done. 204 00:09:13,140 --> 00:09:16,570 So even if the main guy is done, it has to wait until 205 00:09:16,570 --> 00:09:18,200 everybody is finished. 206 00:09:18,200 --> 00:09:22,610 And then, start executing the sequence [UNINTELLIGIBLE]. 207 00:09:22,610 --> 00:09:27,670 So this is the gist of how OpenMP program is run. 208 00:09:27,670 --> 00:09:30,840 And if you realize that it all heads here because you have 209 00:09:30,840 --> 00:09:33,150 basically make sure all these cases are broken up. 210 00:09:33,150 --> 00:09:35,740 So there's some things that has to be issued. 211 00:09:35,740 --> 00:09:38,590 And there's a delay between these guys can start if 212 00:09:38,590 --> 00:09:39,990 everybody has equal work. 213 00:09:39,990 --> 00:09:42,200 Despite not finishing on time because it may take some time 214 00:09:42,200 --> 00:09:43,430 for this to start. 215 00:09:43,430 --> 00:09:46,490 And then, it has to also tell this back OK, I am done. 216 00:09:46,490 --> 00:09:48,320 So there's a lot of synchronization going on. 217 00:09:48,320 --> 00:09:49,390 Locks and unlocks. 218 00:09:49,390 --> 00:09:51,020 Here it's called various synchronization here. 219 00:09:54,085 --> 00:09:58,400 And so if this work is small, this synchronization starts 220 00:09:58,400 --> 00:10:00,070 dominating. 221 00:10:00,070 --> 00:10:03,530 So what happens is [UNINTELLIGIBLE] 222 00:10:03,530 --> 00:10:05,490 fine grain parallelism. 223 00:10:05,490 --> 00:10:08,090 Do a little work in the parallel region, and 224 00:10:08,090 --> 00:10:10,880 synchronization will basically start dominating your time. 225 00:10:10,880 --> 00:10:13,310 So how do you take this? 226 00:10:13,310 --> 00:10:15,880 And sometimes when you run something parallel, it might 227 00:10:15,880 --> 00:10:19,150 even run slow because the amount of stuff in the 228 00:10:19,150 --> 00:10:22,420 parallel region is so small, [UNINTELLIGIBLE] will start 229 00:10:22,420 --> 00:10:22,760 dominating. 230 00:10:22,760 --> 00:10:24,010 And that's not a good way. 231 00:10:24,010 --> 00:10:26,700 And also, sometimes you assume. 232 00:10:26,700 --> 00:10:29,240 And you keep increasing the number of cores. 233 00:10:29,240 --> 00:10:32,750 Hopefully, you want to see a nice parallelism increase, but 234 00:10:32,750 --> 00:10:35,090 it doesn't, even though you have enough information. 235 00:10:35,090 --> 00:10:37,880 But that means you're running a lot of small chunks, even 236 00:10:37,880 --> 00:10:39,390 though you seem to have a lot of parallelism available. 237 00:10:42,730 --> 00:10:46,160 And also, you can make sure the synchronization in the 238 00:10:46,160 --> 00:10:47,210 time in the parallel region. 239 00:10:47,210 --> 00:10:48,720 If the parallel regions are on a very 240 00:10:48,720 --> 00:10:50,670 short time, this happens. 241 00:10:50,670 --> 00:10:54,780 We saw this effect when we were doing Cilk. 242 00:10:54,780 --> 00:10:56,030 Remember? 243 00:10:56,030 --> 00:10:59,450 When did we see this granularity affecting Cilk? 244 00:10:59,450 --> 00:11:00,750 And what did he do? 245 00:11:03,900 --> 00:11:04,910 When you write Cilk programs. 246 00:11:04,910 --> 00:11:05,660 You write [UNINTELLIGIBLE] 247 00:11:05,660 --> 00:11:06,950 programs. 248 00:11:06,950 --> 00:11:12,500 Where did the granularity start showing up on us? 249 00:11:12,500 --> 00:11:15,160 It may not be exactly this because the scheduling is 250 00:11:15,160 --> 00:11:15,430 complicated. 251 00:11:15,430 --> 00:11:15,890 OK. 252 00:11:15,890 --> 00:11:16,380 Yes? 253 00:11:16,380 --> 00:11:19,280 AUDIENCE: The two by two matrix [INAUDIBLE] 254 00:11:19,280 --> 00:11:19,720 PROFESSOR: Yeah. 255 00:11:19,720 --> 00:11:20,420 Something like two by-- 256 00:11:20,420 --> 00:11:22,800 for example, that's the reason we wanted to have 257 00:11:22,800 --> 00:11:24,580 a large base case. 258 00:11:24,580 --> 00:11:26,900 Because if you didn't put a large base case, it keeps 259 00:11:26,900 --> 00:11:30,110 dividing into smaller and smaller problems. 260 00:11:30,110 --> 00:11:32,580 And if the schedule is smart, it won't be 261 00:11:32,580 --> 00:11:33,700 doing exactly this. 262 00:11:33,700 --> 00:11:37,350 But it's always good to have these large granulated chunks 263 00:11:37,350 --> 00:11:38,620 at the bottom. 264 00:11:41,870 --> 00:11:43,890 So how to get [UNINTELLIGIBLE] 265 00:11:43,890 --> 00:11:45,560 granulated parallelism. 266 00:11:45,560 --> 00:11:48,080 What we need to do is reduce the number of [UNINTELLIGIBLE] 267 00:11:48,080 --> 00:11:48,770 equations. 268 00:11:48,770 --> 00:11:51,760 So you want to always try to look for the outer most loop 269 00:11:51,760 --> 00:11:55,080 you can get at all the really large independent regions. 270 00:11:55,080 --> 00:11:57,040 So you go look, and not [UNINTELLIGIBLE] 271 00:11:57,040 --> 00:11:57,930 thing you want to parallelize. 272 00:11:57,930 --> 00:12:01,165 You go up, up, up, up until the point you can parallelize. 273 00:12:01,165 --> 00:12:05,490 And that's the best way to get good performance. 274 00:12:05,490 --> 00:12:06,740 OK? 275 00:12:09,580 --> 00:12:13,850 So if you really compare these three programs here, again, 276 00:12:13,850 --> 00:12:15,210 what you see-- 277 00:12:15,210 --> 00:12:16,720 of course, this has no synchronization. 278 00:12:16,720 --> 00:12:19,400 This has n amount of synchronizations. 279 00:12:19,400 --> 00:12:20,870 Here in [UNINTELLIGIBLE] synchronization, that's 280 00:12:20,870 --> 00:12:23,210 obviously a lot more synchronization going on. 281 00:12:23,210 --> 00:12:27,190 And that is where this [UNINTELLIGIBLE] comes from. 282 00:12:27,190 --> 00:12:27,620 OK. 283 00:12:27,620 --> 00:12:30,260 So now, I am switching a little bit in here. 284 00:12:30,260 --> 00:12:33,450 I want you guys to look at this program a little bit. 285 00:12:33,450 --> 00:12:35,000 So what am I doing here? 286 00:12:35,000 --> 00:12:37,580 I have two [UNINTELLIGIBLE]. 287 00:12:37,580 --> 00:12:45,110 And I am just basically adding matrix B to matrix A. OK? 288 00:12:45,110 --> 00:12:48,720 And then I have another loop test here, adding matrix C to 289 00:12:48,720 --> 00:12:52,120 matrix A. And what am I doing in here? 290 00:12:52,120 --> 00:12:55,740 I am basically going through matrix A in another 291 00:12:55,740 --> 00:12:59,020 direction in here. 292 00:12:59,020 --> 00:13:00,270 AUDIENCE: [INAUDIBLE] 293 00:13:02,310 --> 00:13:03,700 PROFESSOR: It's not really a transpose. 294 00:13:03,700 --> 00:13:04,590 I'm not transposing. 295 00:13:04,590 --> 00:13:08,830 What I'm doing is I'm actually doing a mirror because the C 296 00:13:08,830 --> 00:13:10,210 gets mirrored on ix. 297 00:13:10,210 --> 00:13:11,030 It's because [UNINTELLIGIBLE] 298 00:13:11,030 --> 00:13:11,590 ix [UNINTELLIGIBLE] 299 00:13:11,590 --> 00:13:12,425 the other direction. 300 00:13:12,425 --> 00:13:12,740 OK? 301 00:13:12,740 --> 00:13:14,400 So it's not really a transpose. 302 00:13:14,400 --> 00:13:16,960 So I do a mirror addition. 303 00:13:16,960 --> 00:13:19,010 And then I'm asking for the two outer 304 00:13:19,010 --> 00:13:22,130 most loops to be parallel. 305 00:13:22,130 --> 00:13:25,440 So if you run this sequential-- 306 00:13:25,440 --> 00:13:29,870 OK, you get about 30 milliseconds, I 307 00:13:29,870 --> 00:13:32,120 guess, to run in here. 308 00:13:32,120 --> 00:13:33,700 So that is in [UNINTELLIGIBLE]. 309 00:13:33,700 --> 00:13:35,580 But if you're running parallel, what do you get? 310 00:13:35,580 --> 00:13:36,840 Should you get faster or slower? 311 00:13:40,960 --> 00:13:42,310 OK. 312 00:13:42,310 --> 00:13:45,200 Anyone want to take a guess [UNINTELLIGIBLE] 313 00:13:45,200 --> 00:13:46,995 Sometimes some of these questions, you might not have 314 00:13:46,995 --> 00:13:48,040 enough information to answer. 315 00:13:48,040 --> 00:13:50,750 But it's still good to just take a stand on one direction 316 00:13:50,750 --> 00:13:51,450 or another. 317 00:13:51,450 --> 00:13:54,120 How many people think it runs faster? 318 00:13:54,120 --> 00:13:56,350 How many people think it runs slower? 319 00:13:56,350 --> 00:13:57,750 OK. 320 00:13:57,750 --> 00:13:59,550 Some. 321 00:13:59,550 --> 00:14:01,940 Oops. 322 00:14:01,940 --> 00:14:03,190 What happened? 323 00:14:07,310 --> 00:14:09,140 What happened in here? 324 00:14:19,750 --> 00:14:25,150 Can anybody point out why it might be running slower 325 00:14:25,150 --> 00:14:27,880 parallely than running sequentially? 326 00:14:31,275 --> 00:14:33,220 AUDIENCE: [INAUDIBLE] 327 00:14:33,220 --> 00:14:33,550 PROFESSOR: Yeah. 328 00:14:33,550 --> 00:14:34,760 There's a cache issue. 329 00:14:34,760 --> 00:14:38,611 Watch the possible cache issue in here. 330 00:14:38,611 --> 00:14:41,600 AUDIENCE: [INAUDIBLE] 331 00:14:41,600 --> 00:14:42,850 PROFESSOR: Yeah. 332 00:14:44,610 --> 00:14:50,610 If you think about, the first equations of, I guess, the 333 00:14:50,610 --> 00:14:51,080 first core-- 334 00:14:51,080 --> 00:14:52,230 I have some diagram. 335 00:14:52,230 --> 00:14:53,740 I'll show it to you in there. 336 00:14:53,740 --> 00:14:56,850 And here, only the last data elements we'll get for the 337 00:14:56,850 --> 00:14:58,760 first iterations because we are going 338 00:14:58,760 --> 00:14:59,970 in the other direction. 339 00:14:59,970 --> 00:15:03,500 So if you look at it a little more deeply into 340 00:15:03,500 --> 00:15:04,980 what's going on. 341 00:15:04,980 --> 00:15:08,710 Number of instructions seem to be a little higher. 342 00:15:08,710 --> 00:15:11,310 This one I couldn't actually explain why this might be the 343 00:15:11,310 --> 00:15:13,270 case in here. 344 00:15:13,270 --> 00:15:15,530 If anybody has an idea, you can say that. 345 00:15:15,530 --> 00:15:18,620 But this was kind of [UNINTELLIGIBLE]. 346 00:15:18,620 --> 00:15:21,500 This might be [UNINTELLIGIBLE] the cycles. 347 00:15:21,500 --> 00:15:22,750 Huh. 348 00:15:27,830 --> 00:15:28,600 OK. 349 00:15:28,600 --> 00:15:30,610 I can explain this. 350 00:15:30,610 --> 00:15:34,730 Because this is a sequential run, this is a sum total of a 351 00:15:34,730 --> 00:15:36,380 parallel run. 352 00:15:36,380 --> 00:15:40,560 So because of all the overhead that happens because this was 353 00:15:40,560 --> 00:15:43,200 running on, I think, an eight core machine. 354 00:15:43,200 --> 00:15:45,630 So you're running eight times of small companies. 355 00:15:45,630 --> 00:15:48,420 There's a lot of overhead that goes around, synchronization, 356 00:15:48,420 --> 00:15:49,320 and stuff like that. 357 00:15:49,320 --> 00:15:52,050 So a number of instructions just blows up. 358 00:15:52,050 --> 00:15:55,998 But for each core, you don't have this blow up. 359 00:15:55,998 --> 00:15:57,248 AUDIENCE: [INAUDIBLE] 360 00:15:59,484 --> 00:15:59,982 Cilk? 361 00:15:59,982 --> 00:16:03,468 Because does Cilk have different processor affinity, 362 00:16:03,468 --> 00:16:06,456 things that open [UNINTELLIGIBLE]? 363 00:16:06,456 --> 00:16:11,960 Because it seems like if the program, the language-- 364 00:16:11,960 --> 00:16:13,786 PROFESSOR: [INAUDIBLE]. 365 00:16:13,786 --> 00:16:16,610 Let's see if we can process the affinity 366 00:16:16,610 --> 00:16:18,670 information or if not. 367 00:16:18,670 --> 00:16:21,330 It's just pure [UNINTELLIGIBLE]. 368 00:16:21,330 --> 00:16:23,780 AUDIENCE: [INAUDIBLE] 369 00:16:23,780 --> 00:16:24,600 PROFESSOR: Yeah. 370 00:16:24,600 --> 00:16:28,430 I mean if you like executed locally if you have good cache 371 00:16:28,430 --> 00:16:28,900 [UNINTELLIGIBLE] 372 00:16:28,900 --> 00:16:29,210 with them. 373 00:16:29,210 --> 00:16:31,060 But if there's no cache [UNINTELLIGIBLE] 374 00:16:31,060 --> 00:16:32,670 you might steal something where data might 375 00:16:32,670 --> 00:16:33,887 be somewhere else. 376 00:16:33,887 --> 00:16:37,296 AUDIENCE: But you'll still mimic the cache behavior, 377 00:16:37,296 --> 00:16:42,166 considerably, except for when you steal. 378 00:16:42,166 --> 00:16:44,601 [INAUDIBLE] 379 00:16:44,601 --> 00:16:45,088 PROFESSOR: Yeah. 380 00:16:45,088 --> 00:16:46,070 So OK. 381 00:16:46,070 --> 00:16:49,450 We don't have a mic in here. 382 00:16:49,450 --> 00:16:49,640 OK. 383 00:16:49,640 --> 00:16:50,570 There's a mic. 384 00:16:50,570 --> 00:16:51,820 There we go. 385 00:16:54,380 --> 00:16:57,110 But if you have two different of these regions, the way the 386 00:16:57,110 --> 00:17:01,985 parallelization happens can be different. 387 00:17:01,985 --> 00:17:06,440 AUDIENCE: In Cilk, what happens is the code is 388 00:17:06,440 --> 00:17:10,569 mimicking, for the most part, exactly what the C or C++ code 389 00:17:10,569 --> 00:17:11,930 would be doing. 390 00:17:11,930 --> 00:17:15,339 And so you get exactly the same cache hits and misses. 391 00:17:15,339 --> 00:17:19,690 Except when you steal, it's like starting over with an 392 00:17:19,690 --> 00:17:21,550 empty cache. 393 00:17:21,550 --> 00:17:21,770 OK? 394 00:17:21,770 --> 00:17:24,589 But as long as you have sufficient parallelism, the 395 00:17:24,589 --> 00:17:29,080 steals don't occur very often. 396 00:17:29,080 --> 00:17:31,520 And so therefore, you end up getting the same kind of 397 00:17:31,520 --> 00:17:33,740 behavior that you would get out of the serial code. 398 00:17:33,740 --> 00:17:34,120 PROFESSOR: Yeah. 399 00:17:34,120 --> 00:17:37,250 But Charles, in this one, because you had to steal 400 00:17:37,250 --> 00:17:41,240 everything here, the between here and here, the parallelism 401 00:17:41,240 --> 00:17:41,830 would be different. 402 00:17:41,830 --> 00:17:43,220 AUDIENCE: There would be no affinity between those two. 403 00:17:43,220 --> 00:17:44,380 PROFESSOR: No [? affinity ?] will be there. 404 00:17:44,380 --> 00:17:45,300 Yeah, exactly. 405 00:17:45,300 --> 00:17:48,180 So in the sequential one, everything that fits in the 406 00:17:48,180 --> 00:17:50,720 cache, so that would be affinity because we are not 407 00:17:50,720 --> 00:17:51,790 doing parallelism. 408 00:17:51,790 --> 00:17:53,510 And that's what I think happened here. 409 00:17:53,510 --> 00:17:54,990 Because you had no [? affinity-- ?] 410 00:17:54,990 --> 00:17:57,960 AUDIENCE: No in a serial code, there's no [? affinity. ?] 411 00:17:57,960 --> 00:17:58,070 PROFESSOR: No. 412 00:17:58,070 --> 00:18:00,920 Serial code-- if this fits in the cache, it's then running 413 00:18:00,920 --> 00:18:03,090 on one core. 414 00:18:03,090 --> 00:18:07,900 So if it fits in the one core's cache, you're happy. 415 00:18:07,900 --> 00:18:10,770 AUDIENCE: So the issue is-- right-- is if you only access 416 00:18:10,770 --> 00:18:14,860 it once, by the time you fill up the cache-- 417 00:18:14,860 --> 00:18:16,600 It takes some time to fill up the cache to get them 418 00:18:16,600 --> 00:18:17,280 synchronized. 419 00:18:17,280 --> 00:18:19,050 PROFESSOR: So it fits in the one core's cache, it's OK. 420 00:18:19,050 --> 00:18:21,210 Otherwise, it has no affinity in here. 421 00:18:21,210 --> 00:18:26,290 So the key difference in here is, of course, CPI is slow. 422 00:18:26,290 --> 00:18:27,530 We don't know exactly why. 423 00:18:27,530 --> 00:18:28,910 But in [UNINTELLIGIBLE]. 424 00:18:28,910 --> 00:18:31,480 So what you find is that there's a huge amount of cache 425 00:18:31,480 --> 00:18:32,150 in [UNINTELLIGIBLE] 426 00:18:32,150 --> 00:18:33,660 going on. 427 00:18:33,660 --> 00:18:35,660 So that should give you a feeling of what's going on. 428 00:18:35,660 --> 00:18:37,370 So let's look at what might happen. 429 00:18:37,370 --> 00:18:42,070 So I'm showing this matches [UNINTELLIGIBLE] 430 00:18:42,070 --> 00:18:46,950 last year on-- what we had were Cagnode machines that 431 00:18:46,950 --> 00:18:49,130 were basically code to quad processor. 432 00:18:49,130 --> 00:18:50,620 So we had eight codes in here. 433 00:18:50,620 --> 00:18:52,956 And I put them-- you don't have to now 434 00:18:52,956 --> 00:18:53,680 look at this table. 435 00:18:53,680 --> 00:18:57,210 I put them in the slides so you can look at it later. 436 00:18:57,210 --> 00:18:58,920 And so this is the last year's machine. 437 00:18:58,920 --> 00:19:01,360 And of course, this year's machine is different. 438 00:19:01,360 --> 00:19:04,610 We have two six core processors in here. 439 00:19:04,610 --> 00:19:06,330 So this is what we [UNINTELLIGIBLE] 440 00:19:06,330 --> 00:19:07,650 this year. 441 00:19:07,650 --> 00:19:09,380 [UNINTELLIGIBLE] 442 00:19:09,380 --> 00:19:10,390 OK. 443 00:19:10,390 --> 00:19:13,670 And so right now, I'm showing numbers for this one. 444 00:19:13,670 --> 00:19:15,830 And later, I will show what happened in the 445 00:19:15,830 --> 00:19:16,770 [UNINTELLIGIBLE]. 446 00:19:16,770 --> 00:19:19,360 So if you look at a cache-- 447 00:19:19,360 --> 00:19:25,390 so what happened is each of the data items in the cache 448 00:19:25,390 --> 00:19:27,340 can be in multiple states. 449 00:19:27,340 --> 00:19:29,600 This is called MSI protocol here. 450 00:19:29,600 --> 00:19:33,120 What that means is the item might be modified. 451 00:19:33,120 --> 00:19:36,490 If it is modified, it can be only in one cache. 452 00:19:36,490 --> 00:19:40,220 If anybody else wants to touch it, it has to get it out of 453 00:19:40,220 --> 00:19:42,340 the modified state. 454 00:19:42,340 --> 00:19:43,670 Or it can be sharing. 455 00:19:43,670 --> 00:19:46,190 Sharing means it's reading. 456 00:19:46,190 --> 00:19:49,780 So that means that item is read by multiple people. 457 00:19:49,780 --> 00:19:51,160 And that can have multiple covers. 458 00:19:51,160 --> 00:19:53,375 So sharing items can be in multiple places. 459 00:19:55,890 --> 00:19:59,220 However if you're modifying, [UNINTELLIGIBLE] 460 00:19:59,220 --> 00:20:00,890 items in everybody else. 461 00:20:00,890 --> 00:20:03,130 So that means if I modify something, it 462 00:20:03,130 --> 00:20:04,230 can only be in mine. 463 00:20:04,230 --> 00:20:06,540 If other people had that data, I had to go in 464 00:20:06,540 --> 00:20:07,070 and validate this. 465 00:20:07,070 --> 00:20:09,470 So if you modify this, I had to go in and validate it. 466 00:20:09,470 --> 00:20:10,740 So that's a sharing state. 467 00:20:10,740 --> 00:20:11,910 That means I'm [UNINTELLIGIBLE] everybody 468 00:20:11,910 --> 00:20:13,130 [UNINTELLIGIBLE] 469 00:20:13,130 --> 00:20:13,880 read this. 470 00:20:13,880 --> 00:20:16,290 But if I ever want to change that, I have to go in and 471 00:20:16,290 --> 00:20:17,780 validate this one. 472 00:20:17,780 --> 00:20:20,530 So what that means is when I start writing, I am validating 473 00:20:20,530 --> 00:20:21,750 it from everybody. 474 00:20:21,750 --> 00:20:25,980 So even if everybody kept a copy and they start modifying, 475 00:20:25,980 --> 00:20:28,540 I had to get my own copy. 476 00:20:28,540 --> 00:20:30,000 And everybody else will invalidate. 477 00:20:30,000 --> 00:20:32,260 And then, if somebody else wanted to read-- 478 00:20:32,260 --> 00:20:34,410 for example, if this guy wants to read-- 479 00:20:34,410 --> 00:20:35,960 basically, this has to make this a 480 00:20:35,960 --> 00:20:37,950 sharing and back to sharing. 481 00:20:37,950 --> 00:20:40,080 That means I have to get the value 13, propogate it, and 482 00:20:40,080 --> 00:20:43,100 this becomes sharing again. 483 00:20:43,100 --> 00:20:43,560 OK. 484 00:20:43,560 --> 00:20:44,600 Did you get that? 485 00:20:44,600 --> 00:20:45,770 What's going on in here? 486 00:20:45,770 --> 00:20:47,520 In the cache? 487 00:20:47,520 --> 00:20:51,050 So reads, everybody can keep a copy if they want. 488 00:20:51,050 --> 00:20:53,530 Write-- only one guy can keep a copy. 489 00:20:53,530 --> 00:20:55,730 So what happens then is true sharing. 490 00:20:55,730 --> 00:20:58,280 So you have these two different cores. 491 00:20:58,280 --> 00:20:59,550 So I want to read something. 492 00:20:59,550 --> 00:21:02,500 So I get it from outside probably on main memory, and I 493 00:21:02,500 --> 00:21:04,820 put it in my cache in here. 494 00:21:04,820 --> 00:21:09,790 And then, the next guy wants to write the same thing. 495 00:21:09,790 --> 00:21:11,200 Assume I'm writing that. 496 00:21:11,200 --> 00:21:14,520 And once I want to write that, I can keep this copy I had 497 00:21:14,520 --> 00:21:16,920 invalidated from here and get a copy here. 498 00:21:16,920 --> 00:21:19,130 And then, if this way, I want to write it again, I have to 499 00:21:19,130 --> 00:21:21,140 basically invalidate it from here and get a copy. 500 00:21:21,140 --> 00:21:23,400 If I'm reading, both of us can keep a copy and just kind of 501 00:21:23,400 --> 00:21:25,570 keep bouncing back and forth, back and forth. 502 00:21:25,570 --> 00:21:30,350 And so if you bounce too many times, you get all of these 503 00:21:30,350 --> 00:21:31,520 invalidations. 504 00:21:31,520 --> 00:21:34,770 So the fact I looked at that I have invalidations basically 505 00:21:34,770 --> 00:21:37,770 tells me something like this is going on. 506 00:21:37,770 --> 00:21:39,505 So what's happening in this program? 507 00:21:42,070 --> 00:21:46,420 When I parallelize this four loop, my four cores-- 508 00:21:46,420 --> 00:21:49,080 basically since I am doing here [UNINTELLIGIBLE]-- 509 00:21:49,080 --> 00:21:51,710 are going to get this nice distribution of 510 00:21:51,710 --> 00:21:53,690 data into the caches. 511 00:21:53,690 --> 00:21:55,900 Assume it fits in cache. 512 00:21:55,900 --> 00:21:56,140 OK. 513 00:21:56,140 --> 00:21:59,050 So all this data nicely fits into cache, and now I'm pretty 514 00:21:59,050 --> 00:22:00,760 happy when I run this one because I got this 515 00:22:00,760 --> 00:22:01,490 data into the cache. 516 00:22:01,490 --> 00:22:02,910 And I write it. 517 00:22:02,910 --> 00:22:07,860 But the minute I go in here, basically this data has to 518 00:22:07,860 --> 00:22:09,110 [UNINTELLIGIBLE]. 519 00:22:10,710 --> 00:22:14,690 OK, because I am going this is n minus i in here so this data 520 00:22:14,690 --> 00:22:15,520 has a flip route. 521 00:22:15,520 --> 00:22:19,380 And by doing this, basically, I incur this huge amount of 522 00:22:19,380 --> 00:22:22,090 [UNINTELLIGIBLE], and it slows down. 523 00:22:22,090 --> 00:22:22,670 OK? 524 00:22:22,670 --> 00:22:25,250 So that's why it didn't work well. 525 00:22:25,250 --> 00:22:28,990 So what can you do? 526 00:22:33,580 --> 00:22:38,230 When you have these read, write and write, write 527 00:22:38,230 --> 00:22:40,050 conflicts in here. 528 00:22:40,050 --> 00:22:44,420 And you have to actually move the data across in here. 529 00:22:44,420 --> 00:22:48,450 And what you can do is look for this true sharing. 530 00:22:48,450 --> 00:22:49,550 You can look at the [UNINTELLIGIBLE] 531 00:22:49,550 --> 00:22:51,030 and see if we have excessive 532 00:22:51,030 --> 00:22:52,910 [UNINTELLIGIBLE], we have a problem. 533 00:22:52,910 --> 00:22:54,990 And how do we eliminate that? 534 00:22:54,990 --> 00:22:56,240 You want to make the sharing minimal. 535 00:22:59,580 --> 00:23:01,984 If you want to get some data into a cache, you want to try 536 00:23:01,984 --> 00:23:04,320 to keep it there as much as possible. 537 00:23:04,320 --> 00:23:06,050 And if you're using, you'd want to try to align 538 00:23:06,050 --> 00:23:08,080 everything across. 539 00:23:08,080 --> 00:23:11,140 So even across different regions, it'll use the same 540 00:23:11,140 --> 00:23:12,950 kind of things. 541 00:23:12,950 --> 00:23:15,120 And/or enforce some kind of [UNINTELLIGIBLE] technique to 542 00:23:15,120 --> 00:23:15,900 keep the data alive. 543 00:23:15,900 --> 00:23:18,045 So there are a lot of techniques in there, but true 544 00:23:18,045 --> 00:23:19,900 sharing can be an interesting problem. 545 00:23:19,900 --> 00:23:23,090 So in here, simple change, yes. 546 00:23:23,090 --> 00:23:28,380 You're, basically, instead of changing A, you change C. So 547 00:23:28,380 --> 00:23:30,130 you write A the same way. 548 00:23:30,130 --> 00:23:33,580 But now what I have done is I am doing the mirror by 549 00:23:33,580 --> 00:23:36,760 changing the axis to C, is to [UNINTELLIGIBLE] is the same 550 00:23:36,760 --> 00:23:38,590 as this axis. 551 00:23:38,590 --> 00:23:40,070 So these two are the same thing. 552 00:23:40,070 --> 00:23:42,620 And the minute I do that, voila! 553 00:23:42,620 --> 00:23:44,210 I get good speed up. 554 00:23:44,210 --> 00:23:47,290 Because if you look at that, my inundations has gone down. 555 00:23:47,290 --> 00:23:48,640 My L1 cache [UNINTELLIGIBLE] has now 556 00:23:48,640 --> 00:23:49,610 really, really gone down. 557 00:23:49,610 --> 00:23:51,030 I'm not doing anything. 558 00:23:51,030 --> 00:23:54,820 And of course, I am doing more instructions here than this 559 00:23:54,820 --> 00:23:57,220 one because-- 560 00:23:57,220 --> 00:24:00,070 I think, the difference between instruction here and 561 00:24:00,070 --> 00:24:03,220 here is because a lot of times synchronization operations are 562 00:24:03,220 --> 00:24:05,530 dynamic because in the [UNINTELLIGIBLE] 563 00:24:05,530 --> 00:24:07,760 miscounted as the instructions, you are busy 564 00:24:07,760 --> 00:24:09,770 waiting in there. 565 00:24:09,770 --> 00:24:13,350 So this number is not really a constant number. 566 00:24:13,350 --> 00:24:14,850 OK, question. 567 00:24:14,850 --> 00:24:15,440 AUDIENCE: Not a question. 568 00:24:15,440 --> 00:24:19,020 So another thing one could do here is do loop fusion. 569 00:24:19,020 --> 00:24:19,530 PROFESSOR: Yes. 570 00:24:19,530 --> 00:24:20,410 Yes. 571 00:24:20,410 --> 00:24:24,100 Here is a nice way of putting both of the loops into one and 572 00:24:24,100 --> 00:24:24,990 do loop fusion. 573 00:24:24,990 --> 00:24:26,300 And that works. 574 00:24:26,300 --> 00:24:28,230 In this case, you can do that. 575 00:24:28,230 --> 00:24:31,360 AUDIENCE: So loop fusion is where you take two loops, and 576 00:24:31,360 --> 00:24:33,620 you convert it into one loop. 577 00:24:33,620 --> 00:24:37,050 So in this case, you could have just written one nest, 578 00:24:37,050 --> 00:24:38,840 which has two things going on inside. 579 00:24:41,490 --> 00:24:44,580 And then you would save all the loop overhead and the 580 00:24:44,580 --> 00:24:46,090 scheduling overhead. 581 00:24:46,090 --> 00:24:49,090 So rather than doing it twice, you actually have reduced the 582 00:24:49,090 --> 00:24:52,890 overhead to just the parallelism of the one loop. 583 00:24:52,890 --> 00:24:55,790 So if you look at that, you'll realize that you could somehow 584 00:24:55,790 --> 00:24:59,670 make it just be a single nest with two statements in there, 585 00:24:59,670 --> 00:25:01,130 rather than one. 586 00:25:01,130 --> 00:25:04,100 PROFESSOR: So basically, instead of [UNINTELLIGIBLE] 587 00:25:04,100 --> 00:25:08,190 entire thing and move plus C into here, basically. 588 00:25:08,190 --> 00:25:10,040 And I could have just done it in one loop nest. 589 00:25:10,040 --> 00:25:11,950 That's what loop fusion would do here. 590 00:25:11,950 --> 00:25:13,750 So we can actually [UNINTELLIGIBLE] 591 00:25:13,750 --> 00:25:15,280 much nicer in here. 592 00:25:15,280 --> 00:25:19,190 But just for example purposes, so now I really reduced this 593 00:25:19,190 --> 00:25:20,540 one and got that. 594 00:25:20,540 --> 00:25:21,740 So this is great. 595 00:25:21,740 --> 00:25:23,830 Cagnodes really showed this classic 596 00:25:23,830 --> 00:25:25,820 problem in the computer. 597 00:25:25,820 --> 00:25:27,070 And so I'm like, OK. 598 00:25:27,070 --> 00:25:28,720 Now we have new machines. 599 00:25:28,720 --> 00:25:30,075 Let's try it and see what happens. 600 00:25:34,090 --> 00:25:35,670 What does this show? 601 00:25:35,670 --> 00:25:41,240 This is your nice cloud machines we've got now. 602 00:25:41,240 --> 00:25:42,480 I have no slow down. 603 00:25:42,480 --> 00:25:47,570 I was really disappointed because beforehand, I had this 604 00:25:47,570 --> 00:25:50,070 for sharing going on in here and had a 605 00:25:50,070 --> 00:25:51,100 really big slow down. 606 00:25:51,100 --> 00:25:54,070 But this one, in fact, the difference is very small. 607 00:25:54,070 --> 00:25:56,540 And when you look at any kind of performance counters, they 608 00:25:56,540 --> 00:25:58,580 are pretty comparable. 609 00:25:58,580 --> 00:26:00,220 There's nothing much going on here. 610 00:26:00,220 --> 00:26:03,580 So what do you think is going on in this new 611 00:26:03,580 --> 00:26:06,130 architecture now? 612 00:26:06,130 --> 00:26:07,400 Why this might be? 613 00:26:15,730 --> 00:26:16,980 AUDIENCE: [INAUDIBLE] 614 00:26:26,030 --> 00:26:28,800 PROFESSOR: That's an interesting observation, but 615 00:26:28,800 --> 00:26:32,050 also we have-- 616 00:26:32,050 --> 00:26:36,250 yes, core seven basically is two by two in the die. 617 00:26:36,250 --> 00:26:38,650 But we also have two different processors. 618 00:26:38,650 --> 00:26:40,110 So that's there, too. 619 00:26:40,110 --> 00:26:43,640 So in some sense, when you get our two-way process 620 00:26:43,640 --> 00:26:44,670 [UNINTELLIGIBLE] 621 00:26:44,670 --> 00:26:45,620 So that's there. 622 00:26:45,620 --> 00:26:46,370 That might help. 623 00:26:46,370 --> 00:26:48,650 That's an interesting observation. 624 00:26:48,650 --> 00:26:50,340 What else might be going on here? 625 00:26:50,340 --> 00:26:52,880 Why do you think they manage to get this one? 626 00:26:52,880 --> 00:26:54,580 What might be another answer? 627 00:26:58,790 --> 00:27:03,350 What can hide these kind of delays that can happen? 628 00:27:03,350 --> 00:27:06,990 Load delays, and cache misses, and stuff like that. 629 00:27:06,990 --> 00:27:09,110 What techniques and hardware can hide those? 630 00:27:15,300 --> 00:27:17,050 Just [UNINTELLIGIBLE] a speculation. 631 00:27:17,050 --> 00:27:18,610 Prefetching. 632 00:27:18,610 --> 00:27:22,290 So most hardware has an internal mechanism. 633 00:27:22,290 --> 00:27:24,890 When you start fetching data, you say, aha! 634 00:27:24,890 --> 00:27:26,420 I see a pattern. 635 00:27:26,420 --> 00:27:27,650 I know you want to get this thing. 636 00:27:27,650 --> 00:27:31,960 Let me go forward and bring more data, thinking you are 637 00:27:31,960 --> 00:27:34,240 going to follow that pattern. 638 00:27:34,240 --> 00:27:34,590 OK. 639 00:27:34,590 --> 00:27:36,240 All or most of the [UNINTELLIGIBLE] 640 00:27:36,240 --> 00:27:39,010 for, I think, have [UNINTELLIGIBLE] even a 641 00:27:39,010 --> 00:27:42,280 Pentium had something for prefetching going on. 642 00:27:42,280 --> 00:27:44,470 But most of the time, what happens is the prefetching 643 00:27:44,470 --> 00:27:45,970 engine can't keep up. 644 00:27:45,970 --> 00:27:47,930 If you are getting there, it's [UNINTELLIGIBLE] 645 00:27:47,930 --> 00:27:48,880 a little bit further. 646 00:27:48,880 --> 00:27:50,650 You are going to catch up, and you're going to start because 647 00:27:50,650 --> 00:27:53,026 you have more [UNINTELLIGIBLE] here. 648 00:27:53,026 --> 00:27:54,980 I think [UNINTELLIGIBLE] 649 00:27:54,980 --> 00:27:56,880 has a really, really good prefetcher. 650 00:27:56,880 --> 00:28:02,170 And then, we saw it in our architecture slides, too. 651 00:28:02,170 --> 00:28:04,940 That a lot of things that used to happen before is gone. 652 00:28:04,940 --> 00:28:05,990 So this is really good. 653 00:28:05,990 --> 00:28:11,330 What that means is a lot of weird stuff that's going on 654 00:28:11,330 --> 00:28:12,360 [UNINTELLIGIBLE] 655 00:28:12,360 --> 00:28:14,010 making them disappear. 656 00:28:14,010 --> 00:28:15,990 So these kind of problems don't show up. 657 00:28:15,990 --> 00:28:17,260 So that's the nice story. 658 00:28:17,260 --> 00:28:18,920 The other part is, OK. 659 00:28:18,920 --> 00:28:22,630 Now if you start really tweaking your programs to one 660 00:28:22,630 --> 00:28:25,120 architecture, you wait a generation. 661 00:28:25,120 --> 00:28:28,300 And then now, we have done either the tweaking-- 662 00:28:28,300 --> 00:28:33,910 the best case, tweaking has no impact, and it's 663 00:28:33,910 --> 00:28:35,170 not affecting anything. 664 00:28:35,170 --> 00:28:37,235 In most of the time, worst case, tweaking actually slows 665 00:28:37,235 --> 00:28:39,160 down the program because you are trying to do something 666 00:28:39,160 --> 00:28:40,030 complicated. 667 00:28:40,030 --> 00:28:42,790 That's just not needed anymore. 668 00:28:42,790 --> 00:28:46,990 So even though these kind of things showed up in 669 00:28:46,990 --> 00:28:47,500 [UNINTELLIGIBLE] 670 00:28:47,500 --> 00:28:48,750 architecture, it's not an issue. 671 00:28:48,750 --> 00:28:52,250 But if you go to many of the smaller architectures that 672 00:28:52,250 --> 00:28:55,830 have that don't have that much of the very popular 673 00:28:55,830 --> 00:28:57,620 prefetchers, this kind of issue you would see. 674 00:28:57,620 --> 00:29:00,610 So for example, if you go to a cell phone [UNINTELLIGIBLE], 675 00:29:00,610 --> 00:29:03,770 you would probably see these kind of issues happening. 676 00:29:03,770 --> 00:29:05,560 Any questions here so far? 677 00:29:05,560 --> 00:29:06,300 So that's the good news. 678 00:29:06,300 --> 00:29:08,280 You guys don't have to worry about it too much. 679 00:29:08,280 --> 00:29:11,440 But at least it's good to know the technique because you'll 680 00:29:11,440 --> 00:29:13,820 see it in other architectures. 681 00:29:13,820 --> 00:29:18,110 So now, I want to switch a little bit into looking at 682 00:29:18,110 --> 00:29:21,630 programs that don't have what we call data parallelism. 683 00:29:21,630 --> 00:29:24,320 That means you can start and say, [UNINTELLIGIBLE] 684 00:29:24,320 --> 00:29:24,740 parallels. 685 00:29:24,740 --> 00:29:26,840 Everybody get the different chunk and run. 686 00:29:26,840 --> 00:29:30,110 And we are going a little bit more deeply into looking at 687 00:29:30,110 --> 00:29:32,230 programs that are a little bit different. 688 00:29:32,230 --> 00:29:36,540 So I wanted to come up with this little representation to 689 00:29:36,540 --> 00:29:37,580 represent the program. 690 00:29:37,580 --> 00:29:42,090 And so if you think about iteration space-- 691 00:29:42,090 --> 00:29:44,090 actually before you, I'll go down to dependence. 692 00:29:44,090 --> 00:29:46,200 I'll also do a little bit of load balance. 693 00:29:46,200 --> 00:29:49,890 So here's a loop that in my iterations-- 694 00:29:49,890 --> 00:29:53,690 the first one I transformed zero to eight. 695 00:29:53,690 --> 00:29:55,670 But J only runs from one to eight. 696 00:29:55,670 --> 00:29:58,630 So each I, I have less and less amount of 697 00:29:58,630 --> 00:30:02,160 J iterations, basically. 698 00:30:02,160 --> 00:30:02,600 OK? 699 00:30:02,600 --> 00:30:05,100 It's a triangular loop. 700 00:30:05,100 --> 00:30:06,350 OK? 701 00:30:10,050 --> 00:30:10,250 OK. 702 00:30:10,250 --> 00:30:12,420 So this is the way to represent iteration space, so 703 00:30:12,420 --> 00:30:15,600 I will represent data and get back to this again. 704 00:30:15,600 --> 00:30:18,830 So if you look at a data space, you can assume data 705 00:30:18,830 --> 00:30:23,740 iteration space could be this funky, triangular, hyperplane 706 00:30:23,740 --> 00:30:24,760 type of thing. 707 00:30:24,760 --> 00:30:28,990 Whereas data is mostly [? rectangulineum ?], 708 00:30:28,990 --> 00:30:30,740 multi-dimensional rectangle. 709 00:30:30,740 --> 00:30:32,810 So for example, if I have [UNINTELLIGIBLE] 710 00:30:32,810 --> 00:30:35,140 and it's a one-dimensional one, this is basically a 711 00:30:35,140 --> 00:30:36,050 two-dimensional data. 712 00:30:36,050 --> 00:30:37,480 And you can have three-dimensional cubes and 713 00:30:37,480 --> 00:30:38,020 stuff like that. 714 00:30:38,020 --> 00:30:39,500 You can represent data like that. 715 00:30:39,500 --> 00:30:41,170 So this is a way to nicely represent. 716 00:30:41,170 --> 00:30:45,000 And when you start thinking about it, we can look at 717 00:30:45,000 --> 00:30:45,850 what's going on. 718 00:30:45,850 --> 00:30:46,780 OK? 719 00:30:46,780 --> 00:30:49,870 So now you have this loop again. 720 00:30:49,870 --> 00:30:52,820 So here's the basic [UNINTELLIGIBLE] iterations. 721 00:30:52,820 --> 00:30:54,140 And here's the data. 722 00:30:54,140 --> 00:30:56,850 Assume this is both A and B. There will be another one for 723 00:30:56,850 --> 00:30:57,480 matrix [UNINTELLIGIBLE] 724 00:30:57,480 --> 00:30:59,870 B. One data into each iteration is going to touch. 725 00:30:59,870 --> 00:31:01,700 So these are the data that need to get touched, and 726 00:31:01,700 --> 00:31:04,270 here's the iterations you are going to run. 727 00:31:04,270 --> 00:31:09,720 So we can say OpenMP parallel four. 728 00:31:09,720 --> 00:31:12,200 So what happens when you do parallel four? 729 00:31:12,200 --> 00:31:14,630 So I am going to get parallel. 730 00:31:14,630 --> 00:31:17,930 And so core one gets this one, another core, another core, 731 00:31:17,930 --> 00:31:20,610 another core get these iterations running. 732 00:31:20,610 --> 00:31:22,398 So what happens if you do this one? 733 00:31:26,520 --> 00:31:29,590 Do you get really good performance? 734 00:31:29,590 --> 00:31:31,852 Why? 735 00:31:31,852 --> 00:31:34,570 AUDIENCE: [INAUDIBLE] 736 00:31:34,570 --> 00:31:35,080 PROFESSOR: It's not balanced. 737 00:31:35,080 --> 00:31:36,900 The load is not balanced in here. 738 00:31:36,900 --> 00:31:40,150 So basically if you run sequential and if you run 739 00:31:40,150 --> 00:31:45,990 block distribution, I get about 3x performance in here. 740 00:31:45,990 --> 00:31:49,470 So if I look at closely, here is the number of iterations 741 00:31:49,470 --> 00:31:51,340 given to each core. 742 00:31:51,340 --> 00:31:53,790 The first core gets almost nothing, and the last guy gets 743 00:31:53,790 --> 00:31:55,380 a lot of work. 744 00:31:55,380 --> 00:31:58,510 Here's where something like the Cilk runtime can come into 745 00:31:58,510 --> 00:32:01,955 play because with Cilk runtime, basically, this guy 746 00:32:01,955 --> 00:32:03,040 will finish the [UNINTELLIGIBLE] 747 00:32:03,040 --> 00:32:04,620 start stealing from somebody else. 748 00:32:04,620 --> 00:32:07,620 And so it would be done nicely. 749 00:32:07,620 --> 00:32:09,860 But whereas if you do a static schedule, you 750 00:32:09,860 --> 00:32:11,190 are in this big bind. 751 00:32:11,190 --> 00:32:13,270 You don't have too many things going on. 752 00:32:17,990 --> 00:32:20,650 And basically, this is what we call load imbalance. 753 00:32:20,650 --> 00:32:24,760 So what you can do is figure out a complicated partitioning 754 00:32:24,760 --> 00:32:27,645 so you can statically partition this out. 755 00:32:27,645 --> 00:32:31,810 Or you can do something like the dynamic scheduler like the 756 00:32:31,810 --> 00:32:32,530 [UNINTELLIGIBLE] 757 00:32:32,530 --> 00:32:35,720 scheduler for a solution. 758 00:32:35,720 --> 00:32:36,970 So how to detect load imbalance? 759 00:32:39,290 --> 00:32:44,120 Basically, what you might want to do is for each of the 760 00:32:44,120 --> 00:32:45,690 different sections you are running, you want to look at 761 00:32:45,690 --> 00:32:47,520 the time mistakes. 762 00:32:47,520 --> 00:32:50,530 And in the [UNINTELLIGIBLE] axis varying, huge varying, 763 00:32:50,530 --> 00:32:52,260 that means there's a load imbalance going on. 764 00:32:52,260 --> 00:32:55,345 So you might want to check and make sure each of the parallel 765 00:32:55,345 --> 00:32:58,330 regions time is taking. 766 00:32:58,330 --> 00:33:01,370 And that gives you this view. 767 00:33:01,370 --> 00:33:04,620 How to eliminate load imbalance or the use of 768 00:33:04,620 --> 00:33:08,500 dynamic scheduler that will deal with that. 769 00:33:08,500 --> 00:33:12,220 Or you can do a different distribution statically. 770 00:33:12,220 --> 00:33:14,690 That will not partition in this large block. 771 00:33:14,690 --> 00:33:16,900 So let me show you a static part because we have already 772 00:33:16,900 --> 00:33:18,440 learned the dynamic part before. 773 00:33:18,440 --> 00:33:21,165 So now instead of doing that, we do a cyclic distribution. 774 00:33:21,165 --> 00:33:23,370 We use a static one. 775 00:33:23,370 --> 00:33:27,690 That means if you have a lot more than and a little bit 776 00:33:27,690 --> 00:33:32,220 better distribution so what happens to the processor? 777 00:33:32,220 --> 00:33:33,810 Zero gets this one and this one. 778 00:33:33,810 --> 00:33:35,740 One gets this one and this one. 779 00:33:35,740 --> 00:33:36,860 So on and so forth. 780 00:33:36,860 --> 00:33:39,380 So that would be a little bit between balancing there. 781 00:33:39,380 --> 00:33:42,830 But if you have enough of cyclic, the imbalance would be 782 00:33:42,830 --> 00:33:45,550 much lower. 783 00:33:45,550 --> 00:33:47,610 So should we run faster? 784 00:33:51,070 --> 00:33:56,470 So here's the iterations each guy gets in here. 785 00:33:56,470 --> 00:33:59,360 This looks very balanced because I had a lot more 786 00:33:59,360 --> 00:34:01,810 iterations than this eight one. 787 00:34:01,810 --> 00:34:03,420 This is not that balanced here because this guy gets a lot 788 00:34:03,420 --> 00:34:04,800 more than the first one. 789 00:34:04,800 --> 00:34:06,820 The first one gets six. 790 00:34:06,820 --> 00:34:09,070 And the second and last one gets a lot more. 791 00:34:13,775 --> 00:34:16,560 Uh oh. 792 00:34:16,560 --> 00:34:18,060 What do you think is happening here now? 793 00:34:22,040 --> 00:34:23,290 I ran again slower. 794 00:34:25,870 --> 00:34:28,429 See I guess the people in class last year had things 795 00:34:28,429 --> 00:34:31,199 worse because they had this old processor that did all 796 00:34:31,199 --> 00:34:33,010 these crazy things on them. 797 00:34:33,010 --> 00:34:35,830 and you guys get the fast one that doesn't do that. 798 00:34:35,830 --> 00:34:41,310 So why do you think cyclic distribution is 799 00:34:41,310 --> 00:34:43,639 running a lot slower? 800 00:34:43,639 --> 00:34:44,454 What might be going? 801 00:34:44,454 --> 00:34:45,704 AUDIENCE: [INAUDIBLE] 802 00:34:47,420 --> 00:34:49,219 PROFESSOR: Spoiling [UNINTELLIGIBLE] it's not that 803 00:34:49,219 --> 00:34:51,830 much because if you don't run this and synchronize, what you 804 00:34:51,830 --> 00:34:54,929 do is you run the same amount of tread and say, now, instead 805 00:34:54,929 --> 00:34:59,260 of running continuously, you run jumping all iterations. 806 00:34:59,260 --> 00:35:01,810 You should run zero and nine or whatever jump over 807 00:35:01,810 --> 00:35:03,060 iterations. 808 00:35:12,500 --> 00:35:13,130 Why do you think? 809 00:35:13,130 --> 00:35:14,630 AUDIENCE: [INAUDIBLE] 810 00:35:14,630 --> 00:35:16,240 PROFESSOR: Yeah, there's a cache issue. 811 00:35:16,240 --> 00:35:19,980 All this time and the question is not sure, it's probably a 812 00:35:19,980 --> 00:35:20,470 cache issue. 813 00:35:20,470 --> 00:35:22,550 What type of cache issue do you think is going on? 814 00:35:22,550 --> 00:35:23,800 AUDIENCE: [INAUDIBLE] 815 00:35:28,920 --> 00:35:29,410 PROFESSOR: Yeah. 816 00:35:29,410 --> 00:35:31,010 [UNINTELLIGIBLE]. 817 00:35:31,010 --> 00:35:34,280 But let me show you what happens. 818 00:35:34,280 --> 00:35:35,580 So you get off then. 819 00:35:35,580 --> 00:35:37,780 OK, so if you look at-- 820 00:35:37,780 --> 00:35:40,150 the data is here so let's look at what happens. 821 00:35:40,150 --> 00:35:42,950 So this is running a [UNINTELLIGIBLE] lower. 822 00:35:42,950 --> 00:35:45,460 It's showing a lot more instructions, but instruction 823 00:35:45,460 --> 00:35:47,710 doesn't tell you too much because a lot of them might be 824 00:35:47,710 --> 00:35:49,360 missing synchronization costs in here. 825 00:35:49,360 --> 00:35:52,690 So instruction is not that illuminating here. 826 00:35:52,690 --> 00:35:54,850 The big illumination here is this one again. 827 00:35:54,850 --> 00:35:56,950 Invalidations. 828 00:35:56,950 --> 00:35:59,300 I have a huge amount of invalidations going on. 829 00:35:59,300 --> 00:36:04,330 So here is a case of false sharing. 830 00:36:04,330 --> 00:36:08,080 So what happens is now things next to each other, you want 831 00:36:08,080 --> 00:36:09,550 to multiply different processors. 832 00:36:09,550 --> 00:36:10,890 We're not touching the same data. 833 00:36:10,890 --> 00:36:13,250 Everybody's looking at somebody else's data. 834 00:36:13,250 --> 00:36:16,290 So what happens is assume I want to write this data item. 835 00:36:16,290 --> 00:36:17,550 I like that data item. 836 00:36:17,550 --> 00:36:22,430 But I get the entire cache line because when I ask for 837 00:36:22,430 --> 00:36:25,330 that, I get my synchronization by the cache line. 838 00:36:25,330 --> 00:36:28,830 I get this entire cache line coming in here into this one. 839 00:36:28,830 --> 00:36:30,550 And the next guys [UNINTELLIGIBLE] at me. 840 00:36:30,550 --> 00:36:33,020 This core won't write this data because instead of 841 00:36:33,020 --> 00:36:36,000 blocks, I basically give each strips. 842 00:36:36,000 --> 00:36:37,840 There's a lot of overlap between strips. 843 00:36:37,840 --> 00:36:41,000 So this guy says not to write this one, I had to get the 844 00:36:41,000 --> 00:36:43,580 entire cache line going back here. 845 00:36:43,580 --> 00:36:45,630 And so if you want to write that again, I had to get the 846 00:36:45,630 --> 00:36:47,680 entire cache line going back even though we are writing 847 00:36:47,680 --> 00:36:48,790 different data. 848 00:36:48,790 --> 00:36:52,010 Because we are sharing cache lines in here. 849 00:36:52,010 --> 00:36:53,880 This thinking was in back and forth, back and 850 00:36:53,880 --> 00:36:54,960 forth, back and forth. 851 00:36:54,960 --> 00:36:56,520 I have a lot of cache [UNINTELLIGIBLE]. 852 00:36:56,520 --> 00:36:59,200 Things are really shot. 853 00:36:59,200 --> 00:37:00,160 OK? 854 00:37:00,160 --> 00:37:03,740 And so what happens here is if you look at the cache lines-- 855 00:37:03,740 --> 00:37:05,140 there's my animation. 856 00:37:05,140 --> 00:37:07,280 So cache lines basically mess this all up. 857 00:37:07,280 --> 00:37:08,690 You can see that really carefully. 858 00:37:08,690 --> 00:37:12,500 What happens is between these lines, there would be some 859 00:37:12,500 --> 00:37:13,750 overlap of cache lines. 860 00:37:15,710 --> 00:37:17,890 And this overlap in cache lines keeps bouncing back and 861 00:37:17,890 --> 00:37:19,990 forth, back and forth in here. 862 00:37:19,990 --> 00:37:23,630 And so what happens is basically cache lines are 863 00:37:23,630 --> 00:37:28,850 bigger than the data size, or there's overlap in here, and 864 00:37:28,850 --> 00:37:31,480 the cache line is shared when the data is not shared. 865 00:37:31,480 --> 00:37:37,050 And so how to detect false sharing in too many conflicts. 866 00:37:37,050 --> 00:37:41,340 You assume this is a nice parallelism, but suddenly, you 867 00:37:41,340 --> 00:37:43,840 don't have a speed up, and you have a lot of conflicts here, 868 00:37:43,840 --> 00:37:47,090 even though there isn't something to be sharing. 869 00:37:47,090 --> 00:37:49,770 And how to eliminate false sharing. 870 00:37:49,770 --> 00:37:54,250 Make data used by each contiguous in memory. 871 00:37:54,250 --> 00:37:55,550 That's a good way of doing that. 872 00:37:55,550 --> 00:37:57,360 Or pad at the end. 873 00:37:57,360 --> 00:38:00,330 So these kind of at the corners, there's not going to 874 00:38:00,330 --> 00:38:02,830 be any overlapping. 875 00:38:02,830 --> 00:38:07,770 So in here, one thing you can do is, you can measure each 876 00:38:07,770 --> 00:38:10,370 thing that each of the cores get. 877 00:38:10,370 --> 00:38:11,350 We can make [UNINTELLIGIBLE]. 878 00:38:11,350 --> 00:38:14,560 But before what happens was a core used to get this 879 00:38:14,560 --> 00:38:15,970 line and this line. 880 00:38:15,970 --> 00:38:17,730 There are different places in memory. 881 00:38:17,730 --> 00:38:20,220 But you can make these two contiguous in memory by 882 00:38:20,220 --> 00:38:22,190 basically now, instead of having a two-dimensional 883 00:38:22,190 --> 00:38:23,940 array, you made that a 884 00:38:23,940 --> 00:38:26,210 three-dimensional or disarrays. 885 00:38:26,210 --> 00:38:27,890 AUDIENCE: Can you say that again? 886 00:38:27,890 --> 00:38:31,030 PROFESSOR: So before you what just happened was each of them 887 00:38:31,030 --> 00:38:34,210 were going to get this line and this line, each core. 888 00:38:34,210 --> 00:38:38,070 All these lines that were in different parts of the memory. 889 00:38:38,070 --> 00:38:40,510 In here, each would get only two lines. 890 00:38:40,510 --> 00:38:41,570 But they're in a different place. 891 00:38:41,570 --> 00:38:43,845 So if you have more cyclic, you'll get a lot more lines or 892 00:38:43,845 --> 00:38:44,820 lower memory. 893 00:38:44,820 --> 00:38:47,830 So what we can do is we can arrange the cache. 894 00:38:47,830 --> 00:38:50,560 So if you think about this, you can think the cache, now 895 00:38:50,560 --> 00:38:53,220 the data, is instead of two-dimensions is 896 00:38:53,220 --> 00:38:55,390 three-dimensional data. 897 00:38:55,390 --> 00:38:58,360 One dimension is this cyclic part in here. 898 00:38:58,360 --> 00:38:59,450 So we can do that. 899 00:38:59,450 --> 00:39:04,030 And then, you can change any way that the cyclic part, the 900 00:39:04,030 --> 00:39:06,290 one that I got this line and this line, now become 901 00:39:06,290 --> 00:39:07,660 contiguous. 902 00:39:07,660 --> 00:39:10,390 So you think about data as a two-dimension. 903 00:39:10,390 --> 00:39:11,790 You think about it as a cube. 904 00:39:11,790 --> 00:39:14,690 And you kind of change the cube for the inner dimension 905 00:39:14,690 --> 00:39:16,670 to be the one that's contiguous. 906 00:39:16,670 --> 00:39:18,980 So you can do data [UNINTELLIGIBLE] 907 00:39:18,980 --> 00:39:20,230 transformation and get there. 908 00:39:23,290 --> 00:39:29,130 So now what happens is the role of core zero just gets 909 00:39:29,130 --> 00:39:30,590 contiguous in memory. 910 00:39:30,590 --> 00:39:32,370 And core one gets contiguous in memory. 911 00:39:32,370 --> 00:39:34,600 So if you're trying to make it contiguous, that's great. 912 00:39:34,600 --> 00:39:37,930 So between padding and making things contiguous, you can get 913 00:39:37,930 --> 00:39:38,840 good performance. 914 00:39:38,840 --> 00:39:41,520 And if you do data transformation, voila! 915 00:39:41,520 --> 00:39:44,940 My invalidations just went down drastically. 916 00:39:44,940 --> 00:39:47,900 I again have a nice load balancing here. 917 00:39:47,900 --> 00:39:50,100 Invalidations went down drastically. 918 00:39:50,100 --> 00:39:53,660 That means my [UNINTELLIGIBLE] increased a little bit and I 919 00:39:53,660 --> 00:39:56,600 get really nice speed up. 920 00:39:56,600 --> 00:40:01,320 So here are the kind of crazy things you are to do if you 921 00:40:01,320 --> 00:40:05,200 are doing things like algorithms that 922 00:40:05,200 --> 00:40:07,100 are not cache obvious. 923 00:40:07,100 --> 00:40:08,820 And if you are doing directly parallizing yourself without 924 00:40:08,820 --> 00:40:11,890 letting a nice [UNINTELLIGIBLE] 925 00:40:11,890 --> 00:40:12,660 time to help you. 926 00:40:12,660 --> 00:40:16,580 Something like a [UNINTELLIGIBLE] assistant. 927 00:40:16,580 --> 00:40:19,710 So I'm just going to summarize this 928 00:40:19,710 --> 00:40:20,990 because this is important. 929 00:40:20,990 --> 00:40:22,630 We looked at a bunch of cache issues. 930 00:40:22,630 --> 00:40:25,420 We looked at cold missiles, capacity missiles, and 931 00:40:25,420 --> 00:40:26,970 conflict missiles before. 932 00:40:26,970 --> 00:40:29,390 And today, here are some examples of 933 00:40:29,390 --> 00:40:31,340 true sharing missiles. 934 00:40:31,340 --> 00:40:36,160 What happened was I am actually really using data, 935 00:40:36,160 --> 00:40:41,880 but I set up my parallelism in such a way that between 936 00:40:41,880 --> 00:40:46,000 different executions, my data has to move across. 937 00:40:46,000 --> 00:40:47,030 [UNINTELLIGIBLE] 938 00:40:47,030 --> 00:40:51,230 So I am truly sharing data, but the data has to go to 939 00:40:51,230 --> 00:40:52,250 somebody else's cache. 940 00:40:52,250 --> 00:40:53,110 So I've got a lot of [UNINTELLIGIBLE] 941 00:40:53,110 --> 00:40:54,830 violations here. 942 00:40:54,830 --> 00:40:58,420 More into this one is more like false sharing, where you 943 00:40:58,420 --> 00:41:01,240 assume there's no sharing, nice parallelism, everything, 944 00:41:01,240 --> 00:41:03,120 except the program runs very slow. 945 00:41:03,120 --> 00:41:05,610 And that can be because of false sharing. 946 00:41:05,610 --> 00:41:09,550 So we just kind of touch on these two topics. 947 00:41:09,550 --> 00:41:10,660 OK? 948 00:41:10,660 --> 00:41:14,340 So let me switch gears a little bit about dependences. 949 00:41:14,340 --> 00:41:16,380 We touched on the dependences a little bit. 950 00:41:16,380 --> 00:41:19,070 And these are two fine programs that are not 951 00:41:19,070 --> 00:41:20,410 completely parallel. 952 00:41:20,410 --> 00:41:24,710 So normally, what happens is a true dependence means that I'm 953 00:41:24,710 --> 00:41:26,570 writing and reading [UNINTELLIGIBLE] 954 00:41:26,570 --> 00:41:27,660 other way out. 955 00:41:27,660 --> 00:41:30,400 And if two guys are both fighting, then the order has 956 00:41:30,400 --> 00:41:32,780 to maintiain us out would be dependence. 957 00:41:32,780 --> 00:41:38,650 And did our dependence even loop, because 958 00:41:38,650 --> 00:41:40,880 these are single items. 959 00:41:40,880 --> 00:41:43,760 So if you have an error here, this is becoming a lot more 960 00:41:43,760 --> 00:41:44,600 complicated. 961 00:41:44,600 --> 00:41:46,200 Because there's no simple thing in here. 962 00:41:46,200 --> 00:41:48,450 Because it's not just using the same iteration. 963 00:41:48,450 --> 00:41:51,910 You might be using data from different iterations. 964 00:41:51,910 --> 00:41:55,920 So what happens is there's a dynamic instance of 965 00:41:55,920 --> 00:41:56,840 iterations. 966 00:41:56,840 --> 00:41:59,230 So one iteration writes the data, and somebody else might 967 00:41:59,230 --> 00:42:01,310 be reading the data. 968 00:42:01,310 --> 00:42:03,490 And that is basically the order we have to 969 00:42:03,490 --> 00:42:03,990 [UNINTELLIGIBLE]. 970 00:42:03,990 --> 00:42:05,060 Let me give you an example. 971 00:42:05,060 --> 00:42:07,580 This kind of demonstrates what's going on. 972 00:42:07,580 --> 00:42:08,170 OK? 973 00:42:08,170 --> 00:42:10,500 And when you edit, you say look, this is where you 974 00:42:10,500 --> 00:42:10,910 [UNINTELLIGIBLE] 975 00:42:10,910 --> 00:42:12,590 complicated. 976 00:42:12,590 --> 00:42:14,380 So in order to give you and example, let me 977 00:42:14,380 --> 00:42:15,876 look at this program. 978 00:42:15,876 --> 00:42:17,150 I have a simple program here. 979 00:42:17,150 --> 00:42:19,210 Ai equals Ai plus one. 980 00:42:19,210 --> 00:42:19,990 My iterations-- 981 00:42:19,990 --> 00:42:21,570 I'm running five iterations in here. 982 00:42:21,570 --> 00:42:23,420 So this is my iteration space. 983 00:42:23,420 --> 00:42:25,590 I have a large array, so this is my data space. 984 00:42:28,090 --> 00:42:29,990 And now, I keep running this program. 985 00:42:29,990 --> 00:42:32,800 So what happens is this is time going down in here. 986 00:42:32,800 --> 00:42:35,930 So the first situation basically, I first read and 987 00:42:35,930 --> 00:42:36,560 then write. 988 00:42:36,560 --> 00:42:39,350 Same in the second iteration, I read and write. 989 00:42:39,350 --> 00:42:40,700 Third iteration read and write. 990 00:42:40,700 --> 00:42:42,770 Fourth iteration, read and write. 991 00:42:42,770 --> 00:42:44,610 Do you see how this is going on these four situations? 992 00:42:44,610 --> 00:42:46,050 Second iteration, third iteration, fourth iteration, 993 00:42:46,050 --> 00:42:48,115 [UNINTELLIGIBLE]. 994 00:42:48,115 --> 00:42:48,580 OK. 995 00:42:48,580 --> 00:42:50,710 So what happens is first iteration read 996 00:42:50,710 --> 00:42:53,810 this value is zero. 997 00:42:53,810 --> 00:42:56,170 And write the value as zero in the menu writing. 998 00:42:56,170 --> 00:43:00,930 Second iteration A1, A1, A2, A2, A3, A3. 999 00:43:00,930 --> 00:43:07,290 So now, when this is writing, that's a dependence 1000 00:43:07,290 --> 00:43:08,160 between these two. 1001 00:43:08,160 --> 00:43:10,150 You see the true and entire output dependence 1002 00:43:10,150 --> 00:43:11,400 between these two. 1003 00:43:15,545 --> 00:43:18,270 What type of dependence do we have? 1004 00:43:18,270 --> 00:43:19,520 [UNINTELLIGIBLE] dependence. 1005 00:43:24,240 --> 00:43:26,340 True dependence is what? 1006 00:43:26,340 --> 00:43:28,780 What to what? 1007 00:43:28,780 --> 00:43:29,940 What's the first thing that occurs? 1008 00:43:29,940 --> 00:43:33,620 What's the next thing that occurs? 1009 00:43:33,620 --> 00:43:34,875 Anybody want to answer? 1010 00:43:40,820 --> 00:43:43,140 AUDIENCE: [INAUDIBLE] 1011 00:43:43,140 --> 00:43:43,900 PROFESSOR: Write to read. 1012 00:43:43,900 --> 00:43:45,720 So you have the first thing has to be write to read. 1013 00:43:45,720 --> 00:43:46,570 Watch this. 1014 00:43:46,570 --> 00:43:48,800 This is a read to write. 1015 00:43:48,800 --> 00:43:51,230 So what type of dependence is this? 1016 00:43:51,230 --> 00:43:52,240 This is anti-dependent. 1017 00:43:52,240 --> 00:43:54,680 So here is ante-dependence in here. 1018 00:43:54,680 --> 00:43:57,710 But the nice thing about that is this dependent didn't cross 1019 00:43:57,710 --> 00:43:58,940 the iteration boundary. 1020 00:43:58,940 --> 00:44:01,050 So these black lines are my iteration boundaries. 1021 00:44:01,050 --> 00:44:03,190 So these are for situations that [UNINTELLIGIBLE]. 1022 00:44:03,190 --> 00:44:06,780 So there's no iteration crossing in here. 1023 00:44:06,780 --> 00:44:09,360 You can kind of [UNINTELLIGIBLE] it using each 1024 00:44:09,360 --> 00:44:13,590 of these iterations and my dependencies within the very 1025 00:44:13,590 --> 00:44:14,360 same iteration. 1026 00:44:14,360 --> 00:44:18,116 So the same iteration I have dependency [UNINTELLIGIBLE]. 1027 00:44:18,116 --> 00:44:18,590 OK? 1028 00:44:18,590 --> 00:44:19,910 This is a simpler case. 1029 00:44:19,910 --> 00:44:24,060 So let's look at something a little bit more complicated. 1030 00:44:24,060 --> 00:44:28,560 So I have Ai plus 1 equals Ai plus 1. 1031 00:44:28,560 --> 00:44:32,430 So what happens is first I am reading Ai. 1032 00:44:32,430 --> 00:44:39,270 And then, I am writing Ai plus 1 in the same iteration. 1033 00:44:39,270 --> 00:44:41,100 The next iteration, I am reading now Ai 1034 00:44:41,100 --> 00:44:42,050 [UNINTELLIGIBLE] 1035 00:44:42,050 --> 00:44:44,250 this is A0 and 1. 1036 00:44:44,250 --> 00:44:45,890 This is A is 1. 1037 00:44:45,890 --> 00:44:46,410 [UNINTELLIGIBLE] 1038 00:44:46,410 --> 00:44:47,880 I am writing Ai plus 1. 1039 00:44:47,880 --> 00:44:49,130 I am writing 2. 1040 00:44:51,610 --> 00:44:54,070 So I have a dependence like this now. 1041 00:44:54,070 --> 00:44:55,320 What type of dependence is this? 1042 00:44:58,190 --> 00:45:00,560 This is a true dependence because I am writing. 1043 00:45:00,560 --> 00:45:04,610 And this is actually reading what what it is writing. 1044 00:45:04,610 --> 00:45:08,720 So does this look parallel? 1045 00:45:08,720 --> 00:45:09,590 No. 1046 00:45:09,590 --> 00:45:13,790 Because what happens is if you look at each iteration depends 1047 00:45:13,790 --> 00:45:15,680 on the previous iteration. 1048 00:45:15,680 --> 00:45:18,040 So you have to actually have this dependence going back and 1049 00:45:18,040 --> 00:45:20,890 forth, back and forth in here. 1050 00:45:20,890 --> 00:45:23,550 So let's look at a couple more other things. 1051 00:45:23,550 --> 00:45:26,830 So here is Ai equals Ai plus 2. 1052 00:45:26,830 --> 00:45:30,660 So I am basically reading Ai plus 2. 1053 00:45:30,660 --> 00:45:31,670 So I am reading this one. 1054 00:45:31,670 --> 00:45:33,220 I am writing this one. 1055 00:45:33,220 --> 00:45:33,810 Reading this one. 1056 00:45:33,810 --> 00:45:36,060 Writing this one. 1057 00:45:36,060 --> 00:45:38,040 Here is my dependence that's in here. 1058 00:45:38,040 --> 00:45:39,620 You see the two are anti in here. 1059 00:45:42,210 --> 00:45:44,150 This is anti-dependence because I am going from a 1060 00:45:44,150 --> 00:45:47,160 reading to a write in here. 1061 00:45:47,160 --> 00:45:48,510 Can this loop be parallel? 1062 00:45:55,240 --> 00:45:57,744 Can this loop run parallel? 1063 00:45:57,744 --> 00:46:00,440 AUDIENCE: [INAUDIBLE] 1064 00:46:00,440 --> 00:46:04,090 PROFESSOR: So can every iteration run parallel? 1065 00:46:04,090 --> 00:46:05,030 There could be basically. 1066 00:46:05,030 --> 00:46:07,590 No because what happens is if you look at that, there's a 1067 00:46:07,590 --> 00:46:09,690 dependence that goes like this. 1068 00:46:09,690 --> 00:46:11,060 And of course, there are two chains. 1069 00:46:11,060 --> 00:46:14,170 So if you are interested, you can run at least two-way 1070 00:46:14,170 --> 00:46:15,210 parallelism. 1071 00:46:15,210 --> 00:46:18,200 You can run one chain parallel to another chain when you do 1072 00:46:18,200 --> 00:46:21,130 get that much parallelism. 1073 00:46:21,130 --> 00:46:22,380 How about this one? 1074 00:46:24,740 --> 00:46:26,970 2i and 2i plus 1. 1075 00:46:26,970 --> 00:46:28,220 [UNINTELLIGIBLE] 1076 00:46:29,890 --> 00:46:31,140 Is there independence in here? 1077 00:46:34,740 --> 00:46:36,120 Nope because one is-- 1078 00:46:36,120 --> 00:46:38,300 you are reading all the elements and even writing 1079 00:46:38,300 --> 00:46:39,210 elements [UNINTELLIGIBLE] 1080 00:46:39,210 --> 00:46:39,730 dependence. 1081 00:46:39,730 --> 00:46:42,100 So you can have a missing parallel. 1082 00:46:42,100 --> 00:46:43,230 OK? 1083 00:46:43,230 --> 00:46:46,210 So this is the kind of interesting thing 1084 00:46:46,210 --> 00:46:46,890 that is going on. 1085 00:46:46,890 --> 00:46:51,150 So next, I want to look at something a little bit more 1086 00:46:51,150 --> 00:46:52,130 complicated. 1087 00:46:52,130 --> 00:46:54,400 So let's look at this. 1088 00:46:54,400 --> 00:46:59,850 So here's a classic algorithm called successive over 1089 00:46:59,850 --> 00:47:01,320 relaxation. 1090 00:47:01,320 --> 00:47:05,050 So it kind of simulates a lot of times things like heat flow 1091 00:47:05,050 --> 00:47:06,460 through a plane. 1092 00:47:06,460 --> 00:47:08,830 So the idea there is-- 1093 00:47:08,830 --> 00:47:10,500 let me illustrate what he does. 1094 00:47:10,500 --> 00:47:16,750 So assume you have a big metal sheet. 1095 00:47:16,750 --> 00:47:20,790 And you put some kind of heat source in one place. 1096 00:47:20,790 --> 00:47:23,640 And after sometime, it all reaches a steady state. 1097 00:47:23,640 --> 00:47:24,920 The other side might be cold. 1098 00:47:24,920 --> 00:47:28,190 And you want to know part of the sheet's temperature. 1099 00:47:31,070 --> 00:47:34,700 Because temperature can leak out. 1100 00:47:34,700 --> 00:47:40,515 And there are more things like you have a heat source and 1101 00:47:40,515 --> 00:47:44,480 others that [UNINTELLIGIBLE] to work a glass of water or 1102 00:47:44,480 --> 00:47:45,300 some kind of a sink. 1103 00:47:45,300 --> 00:47:47,650 So what is the heat distribution? 1104 00:47:47,650 --> 00:47:51,120 So one way to compare that this is basically the same 1105 00:47:51,120 --> 00:47:51,550 [UNINTELLIGIBLE]. 1106 00:47:51,550 --> 00:47:55,230 The heat value here is basically the average around 1107 00:47:55,230 --> 00:47:59,690 all these other values right now. 1108 00:47:59,690 --> 00:48:01,700 Because if something is too hot, the heat is going to 1109 00:48:01,700 --> 00:48:03,040 propagate something that is too cold. 1110 00:48:03,040 --> 00:48:04,800 The heat is going to propagate because it kind of has to be 1111 00:48:04,800 --> 00:48:06,820 average around that. 1112 00:48:06,820 --> 00:48:07,840 Then, you take the average in here. 1113 00:48:07,840 --> 00:48:09,260 So what it's doing is calculating 1114 00:48:09,260 --> 00:48:11,200 the average in here. 1115 00:48:11,200 --> 00:48:12,810 And then, you have to do it many, many times. 1116 00:48:12,810 --> 00:48:14,480 So if you have a heat source, at that point, it 1117 00:48:14,480 --> 00:48:15,300 would be very hard. 1118 00:48:15,300 --> 00:48:17,790 And then, it will start propagating slowly and kind of 1119 00:48:17,790 --> 00:48:18,810 propagate down. 1120 00:48:18,810 --> 00:48:21,890 And the cold side in this way or after running many times, 1121 00:48:21,890 --> 00:48:23,180 it basically stabilizes. 1122 00:48:23,180 --> 00:48:25,350 And at that point, you have the kind of heat distribution 1123 00:48:25,350 --> 00:48:26,180 that we [UNINTELLIGIBLE] have. 1124 00:48:26,180 --> 00:48:28,220 This is the kind of calculation you do. 1125 00:48:28,220 --> 00:48:30,070 So this is the calculation. 1126 00:48:30,070 --> 00:48:31,410 So what you're doing is calculating this. 1127 00:48:31,410 --> 00:48:35,220 You are creating this, this, this, and this 1128 00:48:35,220 --> 00:48:36,370 and updating that. 1129 00:48:36,370 --> 00:48:38,480 And then, you do it for t time stamps. 1130 00:48:38,480 --> 00:48:41,380 So you just go around doing each of these things first and 1131 00:48:41,380 --> 00:48:44,284 doing it for t time stamps. 1132 00:48:44,284 --> 00:48:44,740 OK? 1133 00:48:44,740 --> 00:48:46,608 So we would like to run this parallel. 1134 00:48:49,620 --> 00:48:53,650 So here's my basically data space. 1135 00:48:53,650 --> 00:48:55,270 There's my data items. 1136 00:48:55,270 --> 00:48:56,980 So here's my array, two-dimensional array. 1137 00:48:56,980 --> 00:48:58,540 So this is how I'm trying to update. 1138 00:48:58,540 --> 00:49:00,170 I'm reading all this file. 1139 00:49:00,170 --> 00:49:01,760 So here's my iteration space. 1140 00:49:01,760 --> 00:49:03,050 So what I have looked at this. 1141 00:49:03,050 --> 00:49:04,590 I don't want to-- 1142 00:49:04,590 --> 00:49:06,400 it's hard to give you a 3D diagram. 1143 00:49:06,400 --> 00:49:08,190 I don't have a 3D projector. 1144 00:49:08,190 --> 00:49:11,160 So what I'm showing here is three-dimension here. 1145 00:49:11,160 --> 00:49:13,850 So this is the previous iteration, first iteration. 1146 00:49:13,850 --> 00:49:15,990 So if I still go tij. 1147 00:49:15,990 --> 00:49:19,800 So you go through t, and then you go through i in here, and 1148 00:49:19,800 --> 00:49:21,560 then, when you're done, you go to the [UNINTELLIGIBLE] 1149 00:49:21,560 --> 00:49:23,440 iteration and you go this way. 1150 00:49:23,440 --> 00:49:24,820 So here's how you would iterate. 1151 00:49:24,820 --> 00:49:27,300 So you run this one, this one, this one, this one, this one, 1152 00:49:27,300 --> 00:49:28,530 this one, this one, this one, this one. 1153 00:49:28,530 --> 00:49:31,500 And then increase t by 1, and go like this. 1154 00:49:31,500 --> 00:49:35,370 And right now, we are here. 1155 00:49:35,370 --> 00:49:37,140 We are trying to update this one. 1156 00:49:37,140 --> 00:49:39,080 That's what we are trying to do. 1157 00:49:39,080 --> 00:49:42,570 And that means we are already finished up to this point. 1158 00:49:42,570 --> 00:49:45,760 All these points are finished up. 1159 00:49:45,760 --> 00:49:50,160 Now, what we have to do is figure out when I'm reading, 1160 00:49:50,160 --> 00:49:53,480 who actually wrote this value. 1161 00:49:53,480 --> 00:49:53,920 OK? 1162 00:49:53,920 --> 00:49:56,830 First of all, let's figure out which iterations might be able 1163 00:49:56,830 --> 00:49:58,430 to write this value. 1164 00:49:58,430 --> 00:50:04,460 So if you look at this value, this 1165 00:50:04,460 --> 00:50:06,110 relationship in between here. 1166 00:50:06,110 --> 00:50:09,060 This one, basically, is ij. 1167 00:50:09,060 --> 00:50:11,030 And this is ij, ij, ij. 1168 00:50:11,030 --> 00:50:13,770 These three iterations can write this one. 1169 00:50:13,770 --> 00:50:17,650 So and these iterations can write this one. 1170 00:50:17,650 --> 00:50:19,770 Let me go to this one. 1171 00:50:19,770 --> 00:50:22,480 This is a pretty darn complicated [UNINTELLIGIBLE]. 1172 00:50:22,480 --> 00:50:26,650 So what that means is in this one, this one 1173 00:50:26,650 --> 00:50:28,870 already wrote something. 1174 00:50:28,870 --> 00:50:30,990 This is what I'm reading in here. 1175 00:50:30,990 --> 00:50:32,500 This one already wrote something. 1176 00:50:32,500 --> 00:50:33,355 This is what I'm reading here. 1177 00:50:33,355 --> 00:50:34,390 This iteration wrote something. 1178 00:50:34,390 --> 00:50:35,930 I read it here. 1179 00:50:35,930 --> 00:50:36,200 OK. 1180 00:50:36,200 --> 00:50:38,850 Everybody following so far? 1181 00:50:38,850 --> 00:50:40,170 How about this, guys? 1182 00:50:40,170 --> 00:50:42,665 Who wrote the value I am reading in these iterations? 1183 00:50:47,530 --> 00:50:50,300 In this one, I haven't reached there yet. 1184 00:50:50,300 --> 00:50:53,130 So who has written that? 1185 00:50:53,130 --> 00:50:56,812 So I assume this is t equals 1. 1186 00:50:56,812 --> 00:50:57,630 [UNINTELLIGIBLE] 1187 00:50:57,630 --> 00:50:59,260 somebody has to write those things. 1188 00:50:59,260 --> 00:51:03,150 So what that means is this also wrote all of those values 1189 00:51:03,150 --> 00:51:04,520 because I have done those iterations. 1190 00:51:04,520 --> 00:51:07,190 But the interesting thing is some of these values got 1191 00:51:07,190 --> 00:51:07,630 overwritten. 1192 00:51:07,630 --> 00:51:09,650 This value got overwritten , this value got overwritten. 1193 00:51:09,650 --> 00:51:13,310 So these two values disappear. 1194 00:51:13,310 --> 00:51:15,720 This value got overwritten by this guy. 1195 00:51:15,720 --> 00:51:20,080 This value got overwritten by this guy. 1196 00:51:20,080 --> 00:51:20,280 OK. 1197 00:51:20,280 --> 00:51:22,400 But we haven't overwritten this value, this value, and 1198 00:51:22,400 --> 00:51:23,310 this value yet. 1199 00:51:23,310 --> 00:51:25,850 This one, basically, I've just updated. 1200 00:51:25,850 --> 00:51:26,620 But I [UNINTELLIGIBLE] 1201 00:51:26,620 --> 00:51:28,894 this one. 1202 00:51:28,894 --> 00:51:30,676 Do you see this? 1203 00:51:30,676 --> 00:51:33,954 Is everybody following me? 1204 00:51:33,954 --> 00:51:34,906 AUDIENCE: Once again, sir. 1205 00:51:34,906 --> 00:51:36,810 I got lost. 1206 00:51:36,810 --> 00:51:40,630 So what are [INAUDIBLE] 1207 00:51:40,630 --> 00:51:42,640 PROFESSOR: So what happens is I am trying to update in this 1208 00:51:42,640 --> 00:51:46,780 iteration because this array get rid of multiple times. 1209 00:51:46,780 --> 00:51:50,200 But in each iteration, you are only doing one update. 1210 00:51:50,200 --> 00:51:52,220 So I am trying to read and write in here. 1211 00:51:52,220 --> 00:51:54,380 So I need to read all of these five 1212 00:51:54,380 --> 00:51:56,010 elements in this iteration. 1213 00:51:56,010 --> 00:51:58,450 So I want to figure out who wrote that. 1214 00:51:58,450 --> 00:51:59,550 OK? 1215 00:51:59,550 --> 00:52:03,180 This one can be written by this guy and this iteration. 1216 00:52:03,180 --> 00:52:05,440 Could this iteration write its value in here? 1217 00:52:08,110 --> 00:52:09,000 OK? 1218 00:52:09,000 --> 00:52:09,420 [UNINTELLIGIBLE] 1219 00:52:09,420 --> 00:52:11,990 This iteration write because we see it's writing ij. 1220 00:52:11,990 --> 00:52:16,890 I mean my diagram is not that great because I have three in 1221 00:52:16,890 --> 00:52:18,660 here and five in here. 1222 00:52:18,660 --> 00:52:19,890 So just bear with me on that. 1223 00:52:19,890 --> 00:52:21,395 So assume I am writing ij in here. 1224 00:52:24,760 --> 00:52:28,150 So my iterations go from 1 to n, but my data goes from 0 to 1225 00:52:28,150 --> 00:52:29,260 n plus 1, basically. 1226 00:52:29,260 --> 00:52:30,990 1 to n minus 1 iterations. 1227 00:52:30,990 --> 00:52:32,270 0 to n plus 1 data. 1228 00:52:32,270 --> 00:52:35,160 So data is bigger than iteration space because of 1229 00:52:35,160 --> 00:52:36,290 [UNINTELLIGIBLE]. 1230 00:52:36,290 --> 00:52:40,740 So what happens is when I'm in this iteration, we'll say this 1231 00:52:40,740 --> 00:52:44,770 is 1 2 iteration. 1232 00:52:44,770 --> 00:52:47,590 I will write this value. 1233 00:52:47,590 --> 00:52:52,370 This iteration will also write this value. 1234 00:52:52,370 --> 00:52:53,075 OK? 1235 00:52:53,075 --> 00:52:54,730 You see that? 1236 00:52:54,730 --> 00:52:56,400 All of these iterations are the same. 1237 00:52:56,400 --> 00:52:58,170 This iteration we will also write this value. 1238 00:52:58,170 --> 00:53:00,880 But right now, who is the last guy who wrote it? 1239 00:53:00,880 --> 00:53:02,400 The last guy that wrote it is this guy. 1240 00:53:05,125 --> 00:53:07,910 Because this iteration wrote it, and after that, it got 1241 00:53:07,910 --> 00:53:09,060 ordered in this one. 1242 00:53:09,060 --> 00:53:11,330 But this one hadn't occurred yet, so it hadn't been ordered 1243 00:53:11,330 --> 00:53:11,900 by this guy. 1244 00:53:11,900 --> 00:53:15,670 So the last guy who wrote it was this one. 1245 00:53:15,670 --> 00:53:16,820 So that's why I had to eliminate this. 1246 00:53:16,820 --> 00:53:19,630 But this data value-- 1247 00:53:19,630 --> 00:53:21,400 I haven't executed this iteration yet. 1248 00:53:21,400 --> 00:53:24,260 So nobody had written this one in this time stamp. 1249 00:53:24,260 --> 00:53:26,310 So it has to be from the previous time stamp. 1250 00:53:26,310 --> 00:53:32,300 So I read two values from the current time stamp, three 1251 00:53:32,300 --> 00:53:33,870 values from the previous time stamp. 1252 00:53:33,870 --> 00:53:35,120 These three values have to come from the 1253 00:53:35,120 --> 00:53:35,980 previous time stamp. 1254 00:53:35,980 --> 00:53:38,670 These two values that come from the current time stamp. 1255 00:53:38,670 --> 00:53:39,830 You see that? 1256 00:53:39,830 --> 00:53:40,105 OK. 1257 00:53:40,105 --> 00:53:41,350 Good. 1258 00:53:41,350 --> 00:53:45,250 So what that means is because dependence means-- 1259 00:53:45,250 --> 00:53:47,500 OK. 1260 00:53:47,500 --> 00:53:50,455 This line, this dark, red line. 1261 00:53:50,455 --> 00:53:51,160 See. 1262 00:53:51,160 --> 00:53:54,975 I am reading a value in a current iteration that was 1263 00:53:54,975 --> 00:53:56,070 written by this iteration. 1264 00:53:56,070 --> 00:53:58,210 So that means I have no dependence between these two 1265 00:53:58,210 --> 00:54:00,670 iterations. 1266 00:54:00,670 --> 00:54:00,960 OK. 1267 00:54:00,960 --> 00:54:03,930 This line, this dark, red line. 1268 00:54:03,930 --> 00:54:06,020 I am reading a value written by this iteration. 1269 00:54:06,020 --> 00:54:07,870 So I have a dependency in here. 1270 00:54:07,870 --> 00:54:10,070 This line means I have a dependence between this 1271 00:54:10,070 --> 00:54:11,750 iteration to the current one. 1272 00:54:11,750 --> 00:54:13,310 This line means I have dependence between this 1273 00:54:13,310 --> 00:54:14,540 iteration and the current one. 1274 00:54:14,540 --> 00:54:15,980 This line means I have dependence between this 1275 00:54:15,980 --> 00:54:18,980 iteration and the current one. 1276 00:54:18,980 --> 00:54:20,230 You see that? 1277 00:54:22,480 --> 00:54:23,320 OK. 1278 00:54:23,320 --> 00:54:28,560 So now, I want to see how we can parallelize this group. 1279 00:54:28,560 --> 00:54:30,380 So what can I do? 1280 00:54:30,380 --> 00:54:32,200 So I look at all this dependence. 1281 00:54:32,200 --> 00:54:34,720 At this point, I don't have to think about all this where who 1282 00:54:34,720 --> 00:54:34,910 wrote what. 1283 00:54:34,910 --> 00:54:37,840 I can say this is dependence. 1284 00:54:37,840 --> 00:54:43,310 In order to do this equation, all these iterations have to 1285 00:54:43,310 --> 00:54:47,440 be done because I am losing the values produced by them. 1286 00:54:47,440 --> 00:54:49,740 So these have to be finished before I can patch that. 1287 00:54:49,740 --> 00:54:51,760 So the parallelism means I tried to 1288 00:54:51,760 --> 00:54:54,430 do things in parallel. 1289 00:54:54,430 --> 00:54:56,900 So can we parallelize this loop? 1290 00:55:00,120 --> 00:55:01,630 Can we run each time stamp separately? 1291 00:55:05,110 --> 00:55:08,940 No because I am using these three values from the previous 1292 00:55:08,940 --> 00:55:10,100 time stamp. 1293 00:55:10,100 --> 00:55:14,030 So I can't run this time stamp, B equals 1, until B 1294 00:55:14,030 --> 00:55:15,670 equals 0 is done. 1295 00:55:15,670 --> 00:55:16,460 Or B plus 2 [UNINTELLIGIBLE] 1296 00:55:16,460 --> 00:55:18,700 B plus 1 is done. 1297 00:55:18,700 --> 00:55:19,030 OK? 1298 00:55:19,030 --> 00:55:22,690 So I can't parallelize this loop. 1299 00:55:22,690 --> 00:55:22,930 OK. 1300 00:55:22,930 --> 00:55:26,320 Can I parallelize this loop? 1301 00:55:26,320 --> 00:55:27,850 Why? 1302 00:55:27,850 --> 00:55:31,000 Will dependence stop me from parallelizing this one? 1303 00:55:36,070 --> 00:55:37,710 So I'm looking at i. 1304 00:55:37,710 --> 00:55:40,180 This is my i dimension. 1305 00:55:40,180 --> 00:55:41,720 How many lines, at least, tell me. 1306 00:55:41,720 --> 00:55:43,370 How many dependencies are going to stop 1307 00:55:43,370 --> 00:55:46,850 me from doing that? 1308 00:55:46,850 --> 00:55:47,160 OK. 1309 00:55:47,160 --> 00:55:47,510 Good. 1310 00:55:47,510 --> 00:55:48,180 I have [UNINTELLIGIBLE]. 1311 00:55:48,180 --> 00:55:49,290 Somebody says three. 1312 00:55:49,290 --> 00:55:50,750 Somebody says one. 1313 00:55:50,750 --> 00:55:51,030 OK. 1314 00:55:51,030 --> 00:55:51,930 Let's get a vote. 1315 00:55:51,930 --> 00:55:53,545 How many people think it's three? 1316 00:55:56,390 --> 00:55:56,630 OK. 1317 00:55:56,630 --> 00:55:58,760 There's one vote for three. 1318 00:55:58,760 --> 00:56:00,060 How many people think it's three? 1319 00:56:00,060 --> 00:56:01,310 How many people think it's one? 1320 00:56:03,770 --> 00:56:04,530 Wait a minute. 1321 00:56:04,530 --> 00:56:08,100 One vote for three and two votes for one? 1322 00:56:08,100 --> 00:56:08,350 OK. 1323 00:56:08,350 --> 00:56:10,100 Where's the rest? 1324 00:56:10,100 --> 00:56:10,930 For two? 1325 00:56:10,930 --> 00:56:12,110 For 0? 1326 00:56:12,110 --> 00:56:13,480 Can't be 0 if the 0 is parallel. 1327 00:56:13,480 --> 00:56:14,300 OK. 1328 00:56:14,300 --> 00:56:16,150 So we'll start parallelizing. 1329 00:56:16,150 --> 00:56:16,750 OK. 1330 00:56:16,750 --> 00:56:18,960 So what happens in here? 1331 00:56:18,960 --> 00:56:21,210 Right now, this is actually one. 1332 00:56:21,210 --> 00:56:22,660 This one. 1333 00:56:22,660 --> 00:56:26,900 Because these things don't participate because this has 1334 00:56:26,900 --> 00:56:28,350 already happened. 1335 00:56:28,350 --> 00:56:33,030 When you go to ij iterations, these are already done. 1336 00:56:33,030 --> 00:56:34,170 So you're going from t. 1337 00:56:34,170 --> 00:56:36,160 So you're looking at the current iterations because 1338 00:56:36,160 --> 00:56:38,680 you're ending in two loops. 1339 00:56:38,680 --> 00:56:39,740 So the t is done. 1340 00:56:39,740 --> 00:56:41,720 So these all are already done when you go try 1341 00:56:41,720 --> 00:56:42,490 to parallelize sides. 1342 00:56:42,490 --> 00:56:44,740 So I don't have to worry about these three. 1343 00:56:44,740 --> 00:56:48,150 In here because actually I'm losing t of something here, I 1344 00:56:48,150 --> 00:56:51,480 am in trouble. 1345 00:56:51,480 --> 00:56:56,410 So when you go look at this one, I have this one. 1346 00:56:56,410 --> 00:56:59,020 So every dimension has a dependence in here. 1347 00:56:59,020 --> 00:57:00,750 So I can't run it in parallel. 1348 00:57:00,750 --> 00:57:02,740 So does this mean that there's no parallelism? 1349 00:57:08,160 --> 00:57:09,410 Who think there's no parallelism? 1350 00:57:12,410 --> 00:57:13,340 Who thinks there is? 1351 00:57:13,340 --> 00:57:14,370 Oh, somebody thinks there's no parallelism. 1352 00:57:14,370 --> 00:57:16,680 Who thinks there's parallelism? 1353 00:57:16,680 --> 00:57:17,170 OK. 1354 00:57:17,170 --> 00:57:18,080 More people think there's parallelism. 1355 00:57:18,080 --> 00:57:20,090 Let's see what we can do. 1356 00:57:20,090 --> 00:57:21,889 Question? 1357 00:57:21,889 --> 00:57:24,354 AUDIENCE: Do you really think [INAUDIBLE] 1358 00:57:34,214 --> 00:57:36,679 I'm trying to figure out how to word this. 1359 00:57:36,679 --> 00:57:38,158 Do you really want to have 1360 00:57:38,158 --> 00:57:39,144 dependence on the same concept? 1361 00:57:39,144 --> 00:57:40,394 [INAUDIBLE]? 1362 00:57:43,620 --> 00:57:43,880 PROFESSOR: Yeah. 1363 00:57:43,880 --> 00:57:45,730 I mean you can do-- 1364 00:57:45,730 --> 00:57:48,810 this is the way this SOR is sitting so there's a 1365 00:57:48,810 --> 00:57:50,420 dependence between time stamp. 1366 00:57:50,420 --> 00:57:51,750 There's another SOR. 1367 00:57:51,750 --> 00:57:53,380 What they do is kind of a red, black. 1368 00:57:53,380 --> 00:57:56,400 So when you calculate the next time stamp, you calculate it 1369 00:57:56,400 --> 00:57:57,760 right and complete the new array. 1370 00:57:57,760 --> 00:57:58,560 So there's no dependence. 1371 00:57:58,560 --> 00:58:00,900 So that's a different algorithm. 1372 00:58:00,900 --> 00:58:03,800 This algorithm, basically, uses sum value from 1373 00:58:03,800 --> 00:58:05,730 [UNINTELLIGIBLE] because the value-- the algorithm you're 1374 00:58:05,730 --> 00:58:06,985 talking-- you already created the other copies. 1375 00:58:06,985 --> 00:58:07,890 You had two copies. 1376 00:58:07,890 --> 00:58:09,170 You're bouncing back and forth. 1377 00:58:09,170 --> 00:58:09,530 Nice. 1378 00:58:09,530 --> 00:58:11,100 No real problem in here. 1379 00:58:11,100 --> 00:58:12,760 But then you had to have twice the amount of storage. 1380 00:58:12,760 --> 00:58:14,320 Here, you are updating in. 1381 00:58:14,320 --> 00:58:17,740 And since this is kind of running enough iterations 1382 00:58:17,740 --> 00:58:23,190 until it converges, it doesn't seem to matter that the 1383 00:58:23,190 --> 00:58:24,440 [UNINTELLIGIBLE PHRASE]. 1384 00:58:26,520 --> 00:58:27,280 OK. 1385 00:58:27,280 --> 00:58:32,380 So we cannot find a loop, what we call doall loop. 1386 00:58:32,380 --> 00:58:35,380 The doall loop means there's no loop carried dependences. 1387 00:58:35,380 --> 00:58:36,930 It's fully parallel. 1388 00:58:36,930 --> 00:58:38,900 This is the best case. 1389 00:58:38,900 --> 00:58:41,470 So what happens is when you get there, 1390 00:58:41,470 --> 00:58:42,620 everybody can run parallel. 1391 00:58:42,620 --> 00:58:44,770 And when you're done, you can stop and then do that. 1392 00:58:44,770 --> 00:58:46,540 So this is the doall loop. 1393 00:58:46,540 --> 00:58:47,560 Of course, there's no doall loop. 1394 00:58:47,560 --> 00:58:48,830 We can look at every dimension. 1395 00:58:48,830 --> 00:58:50,290 We had some kind of dependence. 1396 00:58:50,290 --> 00:58:53,430 So there's another choice, what we call doacross loop. 1397 00:58:53,430 --> 00:58:57,200 What that means is we have some loop carried dependence. 1398 00:58:57,200 --> 00:58:58,760 There's something I have to use for 1399 00:58:58,760 --> 00:59:00,320 the previous iteration. 1400 00:59:00,320 --> 00:59:01,910 But it's only one thing. 1401 00:59:01,910 --> 00:59:05,150 I have a lot of other things I can run around that only I 1402 00:59:05,150 --> 00:59:06,190 just have to wait one thing. 1403 00:59:06,190 --> 00:59:06,820 One is done. 1404 00:59:06,820 --> 00:59:08,190 I can just keep running. 1405 00:59:08,190 --> 00:59:11,020 And if I calculate and send this one early, then I can do 1406 00:59:11,020 --> 00:59:13,250 my other calculations later. 1407 00:59:13,250 --> 00:59:14,160 This is not that great. 1408 00:59:14,160 --> 00:59:15,820 If you look at the difference here. 1409 00:59:15,820 --> 00:59:19,350 This definitely has very little overhead in here. 1410 00:59:19,350 --> 00:59:21,330 This can run slow. 1411 00:59:21,330 --> 00:59:23,050 And of course, this thing gets produced very late. 1412 00:59:23,050 --> 00:59:24,790 It's [? almost ?] sequential. 1413 00:59:24,790 --> 00:59:27,890 So I hope you can just-- it the other guy wants something, 1414 00:59:27,890 --> 00:59:29,810 I can immediately send it very early. 1415 00:59:29,810 --> 00:59:31,560 And then I can run there. 1416 00:59:31,560 --> 00:59:35,600 So you can get some kind of doacross patterns in here. 1417 00:59:35,600 --> 00:59:37,690 So if you want to do this one-- 1418 00:59:37,690 --> 00:59:39,710 this is a little bit crazy in here. 1419 00:59:39,710 --> 00:59:41,580 But they'll do it in here. 1420 00:59:41,580 --> 00:59:44,260 And so what first we are to do is you are to say, OK. 1421 00:59:44,260 --> 00:59:44,640 Look. 1422 00:59:44,640 --> 00:59:48,570 I'm running this loop, the i loop in parallel. 1423 00:59:48,570 --> 00:59:52,410 But I have to exchange some data. 1424 00:59:52,410 --> 00:59:55,730 Before I want to run this one, I have to basically get the 1425 00:59:55,730 --> 00:59:58,060 previous i value produced. 1426 00:59:58,060 --> 01:00:00,620 And when it's done, I can say the next guy can use it. 1427 01:00:00,620 --> 01:00:02,490 So this is a very complicated one. 1428 01:00:02,490 --> 01:00:05,430 I don't want you to understand it too well. 1429 01:00:05,430 --> 01:00:09,210 So the reason I put it is to show that OK, if you want to 1430 01:00:09,210 --> 01:00:12,080 spend a week trying to really call this up and understand 1431 01:00:12,080 --> 01:00:14,230 and make sure that it works OK. 1432 01:00:14,230 --> 01:00:16,930 So you can do things like that. 1433 01:00:16,930 --> 01:00:17,910 OK? 1434 01:00:17,910 --> 01:00:18,570 Aha. 1435 01:00:18,570 --> 01:00:21,410 So this is the true voodooness. 1436 01:00:21,410 --> 01:00:23,170 OK. 1437 01:00:23,170 --> 01:00:28,150 AUDIENCE: So in Cilk, if you do this with divide and 1438 01:00:28,150 --> 01:00:34,400 conquer, you can make it be what I called in the Tableau 1439 01:00:34,400 --> 01:00:35,760 construction. 1440 01:00:35,760 --> 01:00:38,820 Each layer here is basically constructing a Tableau. 1441 01:00:38,820 --> 01:00:41,078 And so if you do it with divide and conquer, you can do 1442 01:00:41,078 --> 01:00:44,670 it with a very simple recursive code. 1443 01:00:44,670 --> 01:00:49,160 But you can also do it with a loop that goes diagonally. 1444 01:00:49,160 --> 01:00:49,260 AUDIENCE: [INTERPOSING VOICES] 1445 01:00:49,260 --> 01:00:49,390 PROFESSOR: Yes. 1446 01:00:49,390 --> 01:00:50,460 I'm going to get that. 1447 01:00:50,460 --> 01:00:52,690 That's next. 1448 01:00:52,690 --> 01:00:53,770 AUDIENCE: Sorry. 1449 01:00:53,770 --> 01:00:54,540 PROFESSOR: That's OK. 1450 01:00:54,540 --> 01:01:01,080 So the reason that I'm showing that is because this class is 1451 01:01:01,080 --> 01:01:05,210 not just about how to make the cores exactly run faster. 1452 01:01:05,210 --> 01:01:07,630 Think about algorithmic issues and stuff like that. 1453 01:01:07,630 --> 01:01:10,670 So sometimes, when you look at a problem, it looks crazy. 1454 01:01:10,670 --> 01:01:13,240 And there might be some changes you can do that you 1455 01:01:13,240 --> 01:01:16,590 can get to run things in parallel. 1456 01:01:16,590 --> 01:01:18,890 So I'm actually doing not diagonal. 1457 01:01:18,890 --> 01:01:21,030 I'm actually doing something very simple. 1458 01:01:21,030 --> 01:01:26,120 So what I have done here is I have all these 1459 01:01:26,120 --> 01:01:27,120 dependences in here. 1460 01:01:27,120 --> 01:01:27,660 OK? 1461 01:01:27,660 --> 01:01:34,380 So the problem here is I can't find a single [UNINTELLIGIBLE] 1462 01:01:34,380 --> 01:01:37,390 that basically has no crossing. 1463 01:01:37,390 --> 01:01:39,500 But if you look at this [UNINTELLIGIBLE] 1464 01:01:39,500 --> 01:01:41,140 diagonal here. 1465 01:01:41,140 --> 01:01:44,780 What you see is, in fact, there's nothing that crosses 1466 01:01:44,780 --> 01:01:46,030 the diagonal. 1467 01:01:48,370 --> 01:01:48,880 OK? 1468 01:01:48,880 --> 01:01:51,600 So this one basically doesn't depend on 1469 01:01:51,600 --> 01:01:52,790 this one or this one. 1470 01:01:52,790 --> 01:01:54,480 It only depends on the previous one. 1471 01:01:54,480 --> 01:01:57,230 So I can run everything in the diagonal parallel in 1472 01:01:57,230 --> 01:01:58,560 here in this one. 1473 01:01:58,560 --> 01:02:01,247 So of course, I can't write anything [UNINTELLIGIBLE] in 1474 01:02:01,247 --> 01:02:03,260 here, but there's a cute trick you can do. 1475 01:02:03,260 --> 01:02:06,400 What you can do is you can take iteration 1476 01:02:06,400 --> 01:02:07,910 space and skew it. 1477 01:02:10,620 --> 01:02:15,300 So what I have done is now instead off the same thing, 1478 01:02:15,300 --> 01:02:17,851 instead of this being a square, now I skewed it a 1479 01:02:17,851 --> 01:02:20,250 little bit. 1480 01:02:20,250 --> 01:02:20,790 OK? 1481 01:02:20,790 --> 01:02:28,730 So what that means is when I'm running first i, I basically 1482 01:02:28,730 --> 01:02:29,530 don't run any here. 1483 01:02:29,530 --> 01:02:32,380 Then, I run this one and this iteration here. 1484 01:02:32,380 --> 01:02:34,220 So what I have done is I have kind of moved my iteration 1485 01:02:34,220 --> 01:02:35,090 space around. 1486 01:02:35,090 --> 01:02:38,810 Do you see how this might be? 1487 01:02:38,810 --> 01:02:42,900 So now, the interesting thing is when I skew, if I look at 1488 01:02:42,900 --> 01:02:56,635 this line, I can parallelize in this one because all the 1489 01:02:56,635 --> 01:02:58,490 dependences come from the previous iteration. 1490 01:02:58,490 --> 01:02:59,150 Am I right? 1491 01:02:59,150 --> 01:03:00,400 [UNINTELLIGIBLE] 1492 01:03:03,532 --> 01:03:04,000 Yeah. 1493 01:03:04,000 --> 01:03:05,250 I skewed it. 1494 01:03:10,864 --> 01:03:16,440 Yes, everything in here, these ones are parallel. 1495 01:03:16,440 --> 01:03:16,990 OK? 1496 01:03:16,990 --> 01:03:21,070 And any dependence comes from the previous iteration. 1497 01:03:21,070 --> 01:03:22,670 There's no current iteration in here. 1498 01:03:22,670 --> 01:03:24,580 Everything in this one is parallel. 1499 01:03:24,580 --> 01:03:26,960 So I can parallelize this. 1500 01:03:26,960 --> 01:03:29,020 So this one doesn't depend on this one or this one. 1501 01:03:29,020 --> 01:03:32,000 So this is all parallel. 1502 01:03:32,000 --> 01:03:33,570 This is a little bit more complicated. 1503 01:03:33,570 --> 01:03:36,990 So if you're interested to go deep, just go 1504 01:03:36,990 --> 01:03:38,005 stare at the slides. 1505 01:03:38,005 --> 01:03:40,460 I have the slides out there to understand how that happens. 1506 01:03:40,460 --> 01:03:43,170 So if you think about what I'm running here in parallel is 1507 01:03:43,170 --> 01:03:47,660 the one basically this diagonal in here. 1508 01:03:47,660 --> 01:03:50,870 So what happens is if you run this, this, and this parallel, 1509 01:03:50,870 --> 01:03:51,690 there's no dependence. 1510 01:03:51,690 --> 01:03:54,600 I don't need this one or this one to run this one. 1511 01:03:54,600 --> 01:03:56,550 So I can run this, this, this, this, all 1512 01:03:56,550 --> 01:03:58,010 this diagonal in parallel. 1513 01:03:58,010 --> 01:03:59,680 But the trouble with just the diagonal is I don't have a 1514 01:03:59,680 --> 01:04:01,850 place in here to say [UNINTELLIGIBLE] 1515 01:04:01,850 --> 01:04:02,560 for a diagonal. 1516 01:04:02,560 --> 01:04:04,940 So I basically skewed it and then made a 1517 01:04:04,940 --> 01:04:06,760 diagonal into one loop. 1518 01:04:06,760 --> 01:04:12,780 So then, now what happens is basically j 1519 01:04:12,780 --> 01:04:15,850 loop I can run parallel. 1520 01:04:15,850 --> 01:04:17,800 This one. 1521 01:04:17,800 --> 01:04:20,450 So I can do it four [UNINTELLIGIBLE] four. 1522 01:04:20,450 --> 01:04:21,070 OK? 1523 01:04:21,070 --> 01:04:27,910 So here's something you found a problem that has no nice 1524 01:04:27,910 --> 01:04:28,200 parallelism. 1525 01:04:28,200 --> 01:04:31,370 But you realize there's kind of a what you call a wavefront 1526 01:04:31,370 --> 01:04:32,550 going on here. 1527 01:04:32,550 --> 01:04:33,530 Wave going on here. 1528 01:04:33,530 --> 01:04:36,870 So not the given dimension, but there's another dimension 1529 01:04:36,870 --> 01:04:37,690 that you can parallel. 1530 01:04:37,690 --> 01:04:41,090 So you kind of skewed your space to get that nice 1531 01:04:41,090 --> 01:04:41,690 [UNINTELLIGIBLE] line. 1532 01:04:41,690 --> 01:04:44,210 And you run parallel. 1533 01:04:44,210 --> 01:04:47,100 So that's all I have for today.