1 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:03,910 Commons license. 3 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,600 offer high quality educational resources for free. 5 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 6 00:00:13,500 --> 00:00:17,430 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,430 --> 00:00:18,680 ocw.mit.edu. 8 00:00:27,980 --> 00:00:30,590 PROFESSOR: The last time we talked about nondeterministic 9 00:00:30,590 --> 00:00:32,400 programming. 10 00:00:32,400 --> 00:00:37,290 And I think actually the mic is up pretty high. 11 00:00:37,290 --> 00:00:40,740 If we can tone that down just a little bit. 12 00:00:40,740 --> 00:00:43,410 We talked about nondeterministic programming. 13 00:00:43,410 --> 00:00:47,080 And as you recall, the rule with nondeterministic 14 00:00:47,080 --> 00:00:49,630 programming is you should never do it 15 00:00:49,630 --> 00:00:50,880 unless you have to. 16 00:00:53,800 --> 00:00:55,000 Today we're going to talk about 17 00:00:55,000 --> 00:00:56,730 synchronizing with locks. 18 00:00:56,730 --> 00:00:59,730 And it goes doubly that you should never synchronize 19 00:00:59,730 --> 00:01:04,569 without locks unless you have to. 20 00:01:04,569 --> 00:01:07,240 There's some good reasons for synchronizing without 21 00:01:07,240 --> 00:01:10,620 locks as we'll see. 22 00:01:10,620 --> 00:01:14,910 But it, once again, becomes even more difficult to test 23 00:01:14,910 --> 00:01:19,550 correctness and to ensure that the program that you think 24 00:01:19,550 --> 00:01:21,310 you've written is, in fact, the program 25 00:01:21,310 --> 00:01:22,860 you meant to write. 26 00:01:22,860 --> 00:01:26,260 So we're going to talk about a bunch of 27 00:01:26,260 --> 00:01:27,870 really important topics. 28 00:01:27,870 --> 00:01:30,630 The first is memory consistency. 29 00:01:30,630 --> 00:01:33,480 And then we'll talk a little bit about lock free protocols 30 00:01:33,480 --> 00:01:36,800 and one of the problems that arises called the AVA problem. 31 00:01:36,800 --> 00:01:40,700 And then we're going to talk about a technology that we're 32 00:01:40,700 --> 00:01:47,180 using in the Cilk++ system, which tries to make an end run 33 00:01:47,180 --> 00:01:52,180 around some of these problems and allows you to do 34 00:01:52,180 --> 00:01:55,410 synchronization without locks, with low overhead. 35 00:01:55,410 --> 00:01:59,810 But it only works in certain context. 36 00:01:59,810 --> 00:02:01,060 So we're going to start with memory consistency. 37 00:02:05,240 --> 00:02:09,729 So here is a very simple parallel program. 38 00:02:09,729 --> 00:02:12,870 So initially a and b are both 0. 39 00:02:12,870 --> 00:02:17,250 And processor zero moves a 1 into a. 40 00:02:17,250 --> 00:02:22,360 And then it moves whatever is in location b 41 00:02:22,360 --> 00:02:26,240 into the EBX register. 42 00:02:26,240 --> 00:02:29,070 Processor one does something complementary. 43 00:02:29,070 --> 00:02:31,050 It moves a 1 into b. 44 00:02:31,050 --> 00:02:39,040 And then it moves whatever is in a into the EAX register. 45 00:02:39,040 --> 00:02:42,090 Into the EAX register as opposed to the EBX register. 46 00:02:42,090 --> 00:02:48,870 And the question is what are the final possible values of 47 00:02:48,870 --> 00:02:53,860 EAX and EBX after both processors have executed. 48 00:02:53,860 --> 00:02:55,495 Seems like a straightforward enough question. 49 00:02:58,470 --> 00:03:03,320 What values can EAX and EBX have, depending upon-- there 50 00:03:03,320 --> 00:03:07,400 may be scheduling of when things happen and so forth. 51 00:03:07,400 --> 00:03:09,320 So it's not always going to give the same answer. 52 00:03:09,320 --> 00:03:11,550 But the question is what's the set of answers 53 00:03:11,550 --> 00:03:12,720 that you can get? 54 00:03:12,720 --> 00:03:15,230 Well, it turns out you can't just answer this question for 55 00:03:15,230 --> 00:03:18,940 any particular machine without knowing the 56 00:03:18,940 --> 00:03:22,370 machine's memory model. 57 00:03:22,370 --> 00:03:25,940 So it depends upon how memory operations behave in the 58 00:03:25,940 --> 00:03:28,290 parallel computer system. 59 00:03:28,290 --> 00:03:31,070 And different machines have different memory models. 60 00:03:31,070 --> 00:03:35,450 And we'll give you different answers for this code. 61 00:03:35,450 --> 00:03:37,660 There'll be some answers that you get on some machines, 62 00:03:37,660 --> 00:03:39,080 different answers on different machines. 63 00:03:42,620 --> 00:03:48,930 So probably the bedrock of memory models is a model 64 00:03:48,930 --> 00:03:51,750 called sequential consistency. 65 00:03:51,750 --> 00:03:55,430 And this is intuitively what you might think you want. 66 00:03:58,470 --> 00:04:02,390 So Lamport in 1979 said, "The result of any execution is the 67 00:04:02,390 --> 00:04:05,650 same as if the operations of all the processors were 68 00:04:05,650 --> 00:04:08,940 executed in some sequential order, and the operations of 69 00:04:08,940 --> 00:04:12,060 each individual processor appear in this sequence in the 70 00:04:12,060 --> 00:04:15,380 order specified by its program." 71 00:04:15,380 --> 00:04:17,579 So what does that mean? 72 00:04:17,579 --> 00:04:23,560 So what it says is that if I look at the processor's 73 00:04:23,560 --> 00:04:26,510 program and the sequence of operations that are issued by 74 00:04:26,510 --> 00:04:31,780 that processor's program, they're interleaved with the 75 00:04:31,780 --> 00:04:34,230 corresponding sequences defined by the other 76 00:04:34,230 --> 00:04:38,900 processors to produce a global linear order. 77 00:04:41,600 --> 00:04:44,470 So the first thing is that there's a global linear order 78 00:04:44,470 --> 00:04:47,900 that consists of all of these processors' instructions being 79 00:04:47,900 --> 00:04:49,150 interleaved. 80 00:04:51,850 --> 00:04:56,190 In this linear order, whenever you perform a load from memory 81 00:04:56,190 --> 00:05:00,770 into register, it receives the value that was stored by the 82 00:05:00,770 --> 00:05:05,920 most recent store operation in that linear order to that 83 00:05:05,920 --> 00:05:08,660 location, i.e. it's memory. 84 00:05:11,680 --> 00:05:17,060 So you don't get something if you have in this linear order 85 00:05:17,060 --> 00:05:19,480 that two processors wrote. 86 00:05:19,480 --> 00:05:21,780 Well, one of them came last. 87 00:05:21,780 --> 00:05:24,340 The most recent one before you read, that's the 88 00:05:24,340 --> 00:05:26,490 one that you get. 89 00:05:26,490 --> 00:05:27,580 Now, there may be many different 90 00:05:27,580 --> 00:05:29,060 interleavings and so forth. 91 00:05:29,060 --> 00:05:32,520 And you could get any of the values that correspond to any 92 00:05:32,520 --> 00:05:33,430 of those interleavings. 93 00:05:33,430 --> 00:05:37,350 But the point is that you must get a value that is 94 00:05:37,350 --> 00:05:39,170 represented by some interleaving. 95 00:05:42,230 --> 00:05:45,240 The hardware can then do anything it wants, but for the 96 00:05:45,240 --> 00:05:49,290 execution to satisfy the sequential consistency model, 97 00:05:49,290 --> 00:05:52,800 for it to be sequentially consistent, must appear as if 98 00:05:52,800 --> 00:05:58,950 the loads and storage obey some global linear order. 99 00:05:58,950 --> 00:06:03,760 So let's be concrete about that with the problem that I 100 00:06:03,760 --> 00:06:04,580 gave before. 101 00:06:04,580 --> 00:06:07,840 So initially, we have a and b are 0. 102 00:06:07,840 --> 00:06:10,620 And now, we have these instructions executed. 103 00:06:10,620 --> 00:06:15,360 So what I have to do is say, I get any possible outcome based 104 00:06:15,360 --> 00:06:18,980 on interleaving these instructions in this order. 105 00:06:18,980 --> 00:06:21,210 So if I look at it, I've got two instructions 106 00:06:21,210 --> 00:06:23,160 here, two over here. 107 00:06:23,160 --> 00:06:28,040 So that there are six possible interleavings because 4 choose 108 00:06:28,040 --> 00:06:33,270 2 is 6 for those people who've taken 6042. 109 00:06:33,270 --> 00:06:35,660 So there are six possible interleavings. 110 00:06:35,660 --> 00:06:41,330 So for example, if I execute first move a 1 into a, and 111 00:06:41,330 --> 00:06:50,330 then I execute move load into register, the value of b, and 112 00:06:50,330 --> 00:06:54,380 then I move 1 into b, and then I load the value of a, I get a 113 00:06:54,380 --> 00:06:59,940 value of 1 for EAX and a value of 0 for EBX. 114 00:06:59,940 --> 00:07:01,960 For this particular interleaving of those 115 00:07:01,960 --> 00:07:03,210 instructions. 116 00:07:05,310 --> 00:07:07,070 That's what happens if I execute these two 117 00:07:07,070 --> 00:07:07,850 before these two. 118 00:07:07,850 --> 00:07:12,600 If I execute these two instructions here before these 119 00:07:12,600 --> 00:07:15,840 two here, I get the order 3412. 120 00:07:15,840 --> 00:07:17,780 And essentially, the opposite thing happens. 121 00:07:17,780 --> 00:07:21,870 EAX gets 0 and EBX gets 1. 122 00:07:21,870 --> 00:07:26,850 And then, if I interleave them in some way, where 1 and 3 123 00:07:26,850 --> 00:07:31,510 somehow come first before I do the 2 and 4, then I'll get a 124 00:07:31,510 --> 00:07:34,030 value of 11 for each of them. 125 00:07:34,030 --> 00:07:37,060 Those are the middle cases. 126 00:07:37,060 --> 00:07:39,386 So what don't I get? 127 00:07:39,386 --> 00:07:40,360 AUDIENCE: 00 128 00:07:40,360 --> 00:07:42,670 PROFESSOR: You never gets 00 in a 129 00:07:42,670 --> 00:07:46,120 sequentially consistent execution. 130 00:07:46,120 --> 00:07:49,680 Sequential consistent implies that no execution-- 131 00:07:49,680 --> 00:07:52,730 whoops, that should be EAX. 132 00:07:52,730 --> 00:07:56,220 That EAX equals EBX equals 0. 133 00:07:56,220 --> 00:07:59,450 I don't ever get that outcome. 134 00:07:59,450 --> 00:08:03,610 If I did, then I would say my machine wasn't sequentially 135 00:08:03,610 --> 00:08:04,860 consistent. 136 00:08:07,020 --> 00:08:11,360 So now let me take a detour a little bit to look at mutual 137 00:08:11,360 --> 00:08:14,640 exclusion again. 138 00:08:14,640 --> 00:08:18,890 And understand what happens to mutual exclusion algorithms in 139 00:08:18,890 --> 00:08:21,280 the context of memory consistency. 140 00:08:21,280 --> 00:08:24,100 So everybody understood what sequential consistency is. 141 00:08:24,100 --> 00:08:26,690 I simply look at my program as if I'm interleaving 142 00:08:26,690 --> 00:08:27,940 instructions. 143 00:08:32,510 --> 00:08:37,440 So most implementations of mutual exclusion, as I showed 144 00:08:37,440 --> 00:08:42,530 previously, employ some kind of atomic read-modify-write. 145 00:08:42,530 --> 00:08:48,130 So the example I gave you last time was using the exchange 146 00:08:48,130 --> 00:08:52,370 operation to atomically exchange a value in a register 147 00:08:52,370 --> 00:08:53,920 with a value in memory. 148 00:08:53,920 --> 00:08:56,690 People remember that? 149 00:08:56,690 --> 00:08:58,180 To implement a lock? 150 00:08:58,180 --> 00:09:00,500 So in order to implement a lock, I 151 00:09:00,500 --> 00:09:02,700 atomically switch two values. 152 00:09:06,500 --> 00:09:08,330 So we, in particular, use the exchange one. 153 00:09:08,330 --> 00:09:10,750 And there are a bunch of other commands that people can use. 154 00:09:10,750 --> 00:09:12,270 Test-and-set, compare-and-swap, 155 00:09:12,270 --> 00:09:17,830 load-linked-store-conditional, which essentially do some kind 156 00:09:17,830 --> 00:09:19,630 of read-modify-write on memory. 157 00:09:19,630 --> 00:09:22,340 These tend to be expensive instructions, as I mentioned. 158 00:09:22,340 --> 00:09:23,810 They usually tend to cost something 159 00:09:23,810 --> 00:09:27,820 like an L2 cache hit. 160 00:09:27,820 --> 00:09:30,480 Now, the question is can mutual exclusion be 161 00:09:30,480 --> 00:09:33,620 implemented with only atomic loads and stores? 162 00:09:33,620 --> 00:09:38,870 Do you really need one of these heavyweight operations 163 00:09:38,870 --> 00:09:42,740 to implement mutual exclusion? 164 00:09:42,740 --> 00:09:44,710 What if I don't use our read-modify-write? 165 00:09:44,710 --> 00:09:46,860 Is that possible to do it? 166 00:09:46,860 --> 00:09:50,690 And in fact, the answer is yes. 167 00:09:50,690 --> 00:09:53,920 So Dekker and Dijksra show that it can as long as the 168 00:09:53,920 --> 00:09:58,420 computer system is sequentially consistent. 169 00:09:58,420 --> 00:10:00,770 So as long as you have sequential consistency, you in 170 00:10:00,770 --> 00:10:06,850 fact, can implement a mutual exclusion with 171 00:10:06,850 --> 00:10:08,940 read-modify-write. 172 00:10:08,940 --> 00:10:11,330 We're actually not going to use either the Dekker or 173 00:10:11,330 --> 00:10:14,090 Dijksra algorithms, although you can read about those in 174 00:10:14,090 --> 00:10:15,010 the literature. 175 00:10:15,010 --> 00:10:17,230 We're going to look at what is probably the simplest such 176 00:10:17,230 --> 00:10:19,580 algorithm that's been devised to date, which 177 00:10:19,580 --> 00:10:23,690 is devised by Peterson. 178 00:10:23,690 --> 00:10:29,370 And I'm going to illustrate it with these two smileys. 179 00:10:29,370 --> 00:10:30,320 That's a she. 180 00:10:30,320 --> 00:10:31,590 And that's a he. 181 00:10:31,590 --> 00:10:34,280 And they want to operate on widget x. 182 00:10:34,280 --> 00:10:36,040 And she wants to frob it. 183 00:10:36,040 --> 00:10:38,560 And he wants to borf it. 184 00:10:38,560 --> 00:10:41,240 And we want to preserve the property that we are not 185 00:10:41,240 --> 00:10:43,020 frobbing and borfing at the same time. 186 00:10:46,380 --> 00:10:47,430 So how do we do that? 187 00:10:47,430 --> 00:10:50,010 Well, here's the code. 188 00:10:50,010 --> 00:10:53,120 So we're going to set up some things before we start he and 189 00:10:53,120 --> 00:10:54,830 she operating. 190 00:10:54,830 --> 00:10:56,300 So we're going to have our widget x. 191 00:10:56,300 --> 00:10:57,820 That's our protected variable. 192 00:10:57,820 --> 00:11:01,700 And we're going to have a Boolean set initially to false 193 00:11:01,700 --> 00:11:03,980 that says whether she wants to frob it. 194 00:11:03,980 --> 00:11:06,290 So we don't want to make them frob it unless they 195 00:11:06,290 --> 00:11:08,020 want to frob it. 196 00:11:08,020 --> 00:11:13,510 And we don't want him to borf it unless he wants to borf it. 197 00:11:13,510 --> 00:11:16,320 And we're going to have an extra auxiliary variable, 198 00:11:16,320 --> 00:11:18,620 which is whose turn it is. 199 00:11:18,620 --> 00:11:21,030 So they're going to sort of do a take turn. 200 00:11:21,030 --> 00:11:25,190 But that only is going to come into account if the other one 201 00:11:25,190 --> 00:11:26,120 doesn't have a conflict. 202 00:11:26,120 --> 00:11:28,370 If they don't have a conflict, then one of them is going to 203 00:11:28,370 --> 00:11:29,340 be able to go. 204 00:11:29,340 --> 00:11:31,180 So here's what she basically does. 205 00:11:31,180 --> 00:11:34,200 She initially sets that she wants to 206 00:11:34,200 --> 00:11:39,440 operate on the widget. 207 00:11:39,440 --> 00:11:44,365 And then, what she does is she sets the turn to be his. 208 00:11:47,900 --> 00:11:54,350 And then, while he wants it, and the turn is his, she's 209 00:11:54,350 --> 00:11:55,210 going to just spin. 210 00:11:55,210 --> 00:11:58,050 Notice that you're not frobbing it in while loop. 211 00:11:58,050 --> 00:12:00,840 The body of the while loop is empty. 212 00:12:00,840 --> 00:12:03,510 So this is a spinning solution. 213 00:12:03,510 --> 00:12:07,170 So while he wants it, and it's his turn, you're just going to 214 00:12:07,170 --> 00:12:11,850 sit there, continually testing the variables he wants and 215 00:12:11,850 --> 00:12:15,810 turn equals his until one of them ends up being false. 216 00:12:18,920 --> 00:12:23,760 So if he doesn't want it, or it's not his turn, then she 217 00:12:23,760 --> 00:12:24,710 gets to frob it. 218 00:12:24,710 --> 00:12:28,950 And when she's done, she sets she wants to false. 219 00:12:28,950 --> 00:12:30,450 And he does a similar thing. 220 00:12:30,450 --> 00:12:33,220 He sets it to true, says it's her turn. 221 00:12:33,220 --> 00:12:36,700 And then, while she wants it, and the turn is hers, just 222 00:12:36,700 --> 00:12:42,190 sits there waiting, continually re-executing this 223 00:12:42,190 --> 00:12:45,510 until finally, one of these turns out to be false. 224 00:12:45,510 --> 00:12:46,436 And then he borfs it. 225 00:12:46,436 --> 00:12:47,900 And he sets it to false. 226 00:12:47,900 --> 00:12:50,520 And then, they're doing both of these things sort of in a 227 00:12:50,520 --> 00:12:55,200 loop, periodically coming back and executing it. 228 00:12:55,200 --> 00:12:57,010 And what you want to do is you don't want to make it so that 229 00:12:57,010 --> 00:12:57,770 it's forced. 230 00:12:57,770 --> 00:13:01,460 That it's one turn, then the other because maybe he never 231 00:13:01,460 --> 00:13:03,830 wants to borf it. 232 00:13:03,830 --> 00:13:06,160 And then, she would be stuck not being able to frob it, 233 00:13:06,160 --> 00:13:08,350 even though he doesn't want. 234 00:13:08,350 --> 00:13:12,560 So if you think about this-- let's think about why this 235 00:13:12,560 --> 00:13:15,200 always is going to give you mutual exclusion. 236 00:13:17,710 --> 00:13:21,840 So basically, what's happening here is if he wants it-- 237 00:13:21,840 --> 00:13:24,120 by the way, these things are not easy to reason about. 238 00:13:24,120 --> 00:13:28,160 And usually, as much as I can talk and talk in class, what 239 00:13:28,160 --> 00:13:31,220 you really need to do is go home, and sit down with this 240 00:13:31,220 --> 00:13:32,090 kind of thing. 241 00:13:32,090 --> 00:13:35,940 And study it for 10 minutes. 242 00:13:35,940 --> 00:13:39,500 And then, you'll understand what the subtleties are as 243 00:13:39,500 --> 00:13:40,100 what's going on. 244 00:13:40,100 --> 00:13:44,620 But basically, what we're doing is we're making it so 245 00:13:44,620 --> 00:13:49,680 that it's not going to be the case that both she's setting 246 00:13:49,680 --> 00:13:50,820 it and she wants it. 247 00:13:50,820 --> 00:13:53,060 And the turn is his. 248 00:13:53,060 --> 00:13:59,010 And then, if there's a race where he wants it also, then 249 00:13:59,010 --> 00:14:02,270 that's going to preclude both of them from going into it at 250 00:14:02,270 --> 00:14:03,790 the same time. 251 00:14:03,790 --> 00:14:10,040 And then whichever one sets the turn, one of those is 252 00:14:10,040 --> 00:14:11,830 going to occur first. 253 00:14:11,830 --> 00:14:13,810 And one is going to occur second. 254 00:14:13,810 --> 00:14:18,380 And whoever ends up coming second, the other 255 00:14:18,380 --> 00:14:19,630 one gets to go ahead. 256 00:14:22,700 --> 00:14:26,990 So it's very subtle how that is actually working to make 257 00:14:26,990 --> 00:14:29,925 sure that each one is gating the other to allow them to go. 258 00:14:32,470 --> 00:14:35,590 But the way to reason about this is to reason about it is 259 00:14:35,590 --> 00:14:37,860 what are the possible interleavings? 260 00:14:37,860 --> 00:14:39,670 And the important interleavings here as you can 261 00:14:39,670 --> 00:14:43,710 see are what happens when setting these things. 262 00:14:43,710 --> 00:14:45,740 And once they're set, what happens in 263 00:14:45,740 --> 00:14:46,850 testing these things? 264 00:14:46,850 --> 00:14:49,810 And especially because when you go around the loop and so 265 00:14:49,810 --> 00:14:52,730 forth, you have to imagine that an arbitrarily long 266 00:14:52,730 --> 00:14:54,600 amount of time is gone. 267 00:14:54,600 --> 00:14:57,730 So for example, between the time that you check that the 268 00:14:57,730 --> 00:15:01,760 turn is his, he may have already gone around this loop. 269 00:15:04,350 --> 00:15:06,140 And so you have to worry about-- 270 00:15:06,140 --> 00:15:09,110 even though, it may look like one instruction from this 271 00:15:09,110 --> 00:15:11,740 processors point of view for correctness purpose, you have 272 00:15:11,740 --> 00:15:14,450 to imagine that an arbitrary amount of computation could 273 00:15:14,450 --> 00:15:18,340 occur between any two instructions. 274 00:15:18,340 --> 00:15:21,260 So any question about this code? 275 00:15:21,260 --> 00:15:25,780 People see how it preserves mutual exclusion and how you 276 00:15:25,780 --> 00:15:29,280 use sequential consistency to reason about it by asking what 277 00:15:29,280 --> 00:15:32,000 are the possible interleavings? 278 00:15:32,000 --> 00:15:32,930 Questions? 279 00:15:32,930 --> 00:15:34,057 Yeah. 280 00:15:34,057 --> 00:15:36,756 AUDIENCE: So, I don't know if I got it right. 281 00:15:36,756 --> 00:15:41,148 So basically, sets the [UNINTELLIGIBLE] 282 00:15:41,148 --> 00:15:46,500 to give him a chance before she goes to loop. 283 00:15:46,500 --> 00:15:50,130 So basically, she waits there until he has been able to go? 284 00:15:50,130 --> 00:15:51,770 That's why on the [UNINTELLIGIBLE]. 285 00:15:51,770 --> 00:15:55,070 PROFESSOR: So on this third line-- 286 00:15:55,070 --> 00:15:55,995 AUDIENCE: Both of them. 287 00:15:55,995 --> 00:15:59,043 Either before actually frobbing or borfing. 288 00:15:59,043 --> 00:16:01,826 And before that while you always give the turn to the 289 00:16:01,826 --> 00:16:03,030 other to give them a chance to go. 290 00:16:03,030 --> 00:16:03,296 PROFESSOR: Yeah. 291 00:16:03,296 --> 00:16:04,550 So there are two things you want to show. 292 00:16:04,550 --> 00:16:10,470 One is that they can't both be stalled on 293 00:16:10,470 --> 00:16:11,516 the while loop there. 294 00:16:11,516 --> 00:16:13,510 And that can't happen because the turn can't be 295 00:16:13,510 --> 00:16:16,780 simultaneously his and hers. 296 00:16:16,780 --> 00:16:20,110 So you know that they're not both going to deadlock in 297 00:16:20,110 --> 00:16:23,400 trying to do this by sitting there waiting for the other 298 00:16:23,400 --> 00:16:24,400 because of this. 299 00:16:24,400 --> 00:16:28,960 And now, the question is well, how do you know that one can't 300 00:16:28,960 --> 00:16:34,070 get through while the other is also going through? 301 00:16:36,580 --> 00:16:45,780 And for that, you have to look and say, oh well, if you go 302 00:16:45,780 --> 00:16:49,850 through, then you know that it is either he doesn't want it, 303 00:16:49,850 --> 00:16:53,280 or it's not his turn. 304 00:16:53,280 --> 00:16:54,695 And in which case, if he doesn't want it, 305 00:16:54,695 --> 00:16:55,205 it's not his turn. 306 00:16:55,205 --> 00:16:58,490 If he does change it to that he wants it, then in fact, 307 00:16:58,490 --> 00:17:00,970 it's going to be your turn. 308 00:17:00,970 --> 00:17:01,785 Question? 309 00:17:01,785 --> 00:17:03,960 AUDIENCE: This only works for exactly two threads, right? 310 00:17:03,960 --> 00:17:06,349 PROFESSOR: This only works for exactly two threads. 311 00:17:06,349 --> 00:17:09,099 This does not work for three, but there are extensions of 312 00:17:09,099 --> 00:17:12,650 this sort of thing to end threads in an 313 00:17:12,650 --> 00:17:14,859 arbitrary large number. 314 00:17:14,859 --> 00:17:17,460 However, the data structures to implement this kind of 315 00:17:17,460 --> 00:17:21,960 mutual exclusion for end threads end up taking space 316 00:17:21,960 --> 00:17:23,730 proportional to n. 317 00:17:23,730 --> 00:17:28,329 And so one of the advantages of the built in atomics-- 318 00:17:28,329 --> 00:17:34,590 the compare-and-swap, or the atomic exchange, or whatever-- 319 00:17:34,590 --> 00:17:37,540 is they work for an arbitrary number of threads with only a 320 00:17:37,540 --> 00:17:40,560 bounded amount of resource. 321 00:17:40,560 --> 00:17:46,010 You don't require extra data structures and so forth. 322 00:17:46,010 --> 00:17:48,710 So that's why they put those things in the architecture 323 00:17:48,710 --> 00:17:54,690 because in the architecture you can build things that will 324 00:17:54,690 --> 00:17:57,090 solve this problem much more simply than 325 00:17:57,090 --> 00:17:58,340 this sort of thing. 326 00:18:04,510 --> 00:18:06,640 However, there are going to be lessons here that you may want 327 00:18:06,640 --> 00:18:08,280 to use in your programming, depending 328 00:18:08,280 --> 00:18:09,330 on what you're doing. 329 00:18:09,330 --> 00:18:19,310 So now, it turns out that no modern day processor 330 00:18:19,310 --> 00:18:20,800 implements sequential consistency. 331 00:18:23,450 --> 00:18:25,650 There have been machines that were built-- actually quite 332 00:18:25,650 --> 00:18:26,740 good machines-- 333 00:18:26,740 --> 00:18:28,500 that implemented sequential consistency. 334 00:18:28,500 --> 00:18:32,920 But today, nobody implements it. 335 00:18:32,920 --> 00:18:35,930 They all implement some form of what's called relaxed 336 00:18:35,930 --> 00:18:40,930 consistency, where the hardware may reorder 337 00:18:40,930 --> 00:18:42,840 instructions. 338 00:18:42,840 --> 00:18:44,490 And so you have things executing 339 00:18:44,490 --> 00:18:45,910 not in program order. 340 00:18:45,910 --> 00:18:49,790 And the compilers may reorder instructions as well. 341 00:18:49,790 --> 00:18:53,120 So both the hardware and the software are going in there. 342 00:18:53,120 --> 00:18:56,720 So let's take a look at that. 343 00:18:56,720 --> 00:19:01,280 So here's the program order for one of the things. 344 00:19:01,280 --> 00:19:09,840 We move 1 into a, and then move the value of b into EBX 345 00:19:09,840 --> 00:19:11,210 to do a load. 346 00:19:11,210 --> 00:19:13,150 Here's the program order. 347 00:19:13,150 --> 00:19:19,760 Most modern hardware will switch these and execute it in 348 00:19:19,760 --> 00:19:21,740 this order. 349 00:19:21,740 --> 00:19:24,360 Why do you suppose? 350 00:19:24,360 --> 00:19:26,740 Even if you write it this way, the instruction level 351 00:19:26,740 --> 00:19:31,270 parallelism within the processor will, in fact, 352 00:19:31,270 --> 00:19:35,260 execute it in the opposite order most of the time. 353 00:19:35,260 --> 00:19:35,510 Yeah? 354 00:19:35,510 --> 00:19:37,340 AUDIENCE: Because loading takes longer. 355 00:19:37,340 --> 00:19:37,520 PROFESSOR: Yeah. 356 00:19:37,520 --> 00:19:39,900 Because loading takes longer. 357 00:19:39,900 --> 00:19:43,290 Loading is going to take latency. 358 00:19:43,290 --> 00:19:46,110 I can't complete the load from the processor's point of view 359 00:19:46,110 --> 00:19:47,850 until I get an answer. 360 00:19:47,850 --> 00:19:50,730 So if I load, and I wait for it to go out to the memory 361 00:19:50,730 --> 00:19:53,240 system and back into the processor, 362 00:19:53,240 --> 00:19:56,720 and then I do a store-- 363 00:19:56,720 --> 00:19:58,730 well, as soon as I've done the store, I can move on. 364 00:19:58,730 --> 00:20:00,560 Even if the store takes a while to get out 365 00:20:00,560 --> 00:20:01,870 to the memory system. 366 00:20:01,870 --> 00:20:03,260 But if I do it in the opposite order. 367 00:20:03,260 --> 00:20:11,650 I do the store first, and then I do the load, I've ended up 368 00:20:11,650 --> 00:20:16,340 wasting essentially one cycle, the cycle to do the store, 369 00:20:16,340 --> 00:20:19,050 when I could have been overlapping that with the time 370 00:20:19,050 --> 00:20:22,500 it took to do the load. 371 00:20:22,500 --> 00:20:23,520 So people follow that? 372 00:20:23,520 --> 00:20:26,870 So if I execute the load first, I can go right on to 373 00:20:26,870 --> 00:20:29,030 execute the store. 374 00:20:29,030 --> 00:20:33,950 I can issue the load, go right on to execute the store 375 00:20:33,950 --> 00:20:39,160 without having to wait for the load to complete if I have a 376 00:20:39,160 --> 00:20:44,950 multi-issue CPU in the processor core. 377 00:20:44,950 --> 00:20:47,520 So you get higher instruction level parallelism. 378 00:20:47,520 --> 00:20:53,490 Now when is it safe for the hardware compiler to perform 379 00:20:53,490 --> 00:20:55,930 this reordering? 380 00:20:55,930 --> 00:20:59,070 Can it always switch instructions like this to put 381 00:20:59,070 --> 00:21:00,320 loads before stores? 382 00:21:05,360 --> 00:21:10,060 When would this be a bad idea to put a load before a store? 383 00:21:13,870 --> 00:21:14,620 Yeah? 384 00:21:14,620 --> 00:21:16,250 AUDIENCE: You're loading the variable you just stored. 385 00:21:16,250 --> 00:21:17,100 PROFESSOR: Yeah, if you're loading the 386 00:21:17,100 --> 00:21:19,720 variable you just stored. 387 00:21:19,720 --> 00:21:27,820 Suppose you say store into x and then load from x. 388 00:21:27,820 --> 00:21:30,290 That's different from if I load from x, and then 389 00:21:30,290 --> 00:21:33,450 I store into x. 390 00:21:33,450 --> 00:21:40,710 So if you're going to the same location, then that's not a 391 00:21:40,710 --> 00:21:43,540 safe thing to do. 392 00:21:43,540 --> 00:21:52,620 So basically, in this case, if a is not equal to b, then this 393 00:21:52,620 --> 00:21:53,815 is safe to do. 394 00:21:53,815 --> 00:21:57,020 But if a equals b, this is not safe to do. 395 00:21:59,780 --> 00:22:04,680 Because it's going to give you a different answer. 396 00:22:04,680 --> 00:22:08,190 However, it turns out that there's another time when this 397 00:22:08,190 --> 00:22:09,790 is not safe to do. 398 00:22:09,790 --> 00:22:12,600 So this would have been the end of the story if we were 399 00:22:12,600 --> 00:22:14,320 running on one processor. 400 00:22:14,320 --> 00:22:18,350 The other time that it's not safe to do it is-- 401 00:22:18,350 --> 00:22:23,320 if it's safe, the other assumption is that there's no 402 00:22:23,320 --> 00:22:24,260 concurrency. 403 00:22:24,260 --> 00:22:28,380 If there is concurrency, you can run into trouble as well. 404 00:22:30,920 --> 00:22:34,420 And the reason is because another processor may be 405 00:22:34,420 --> 00:22:39,710 changing the value that you're planning to read. 406 00:22:39,710 --> 00:22:42,780 And so if you read things out of order, you may violate 407 00:22:42,780 --> 00:22:45,180 sequential consistency. 408 00:22:45,180 --> 00:22:47,560 Let me show you what's going on in the hardware so you have 409 00:22:47,560 --> 00:22:51,030 an appreciation of what the issue is here. 410 00:22:51,030 --> 00:22:56,940 So here's 30,000 feet of hardware reordering. 411 00:22:56,940 --> 00:23:02,880 So the processor is going to issue memory operations to the 412 00:23:02,880 --> 00:23:04,050 memory system. 413 00:23:04,050 --> 00:23:06,560 And results of memory operations are 414 00:23:06,560 --> 00:23:08,480 going to come back. 415 00:23:08,480 --> 00:23:12,280 But they really only have to come back when? 416 00:23:12,280 --> 00:23:14,670 If they're loads. 417 00:23:14,670 --> 00:23:19,040 If they're stores, they don't have to come back. 418 00:23:19,040 --> 00:23:24,570 So the processor, in fact, can issue stores faster than the 419 00:23:24,570 --> 00:23:26,320 network can handle them. 420 00:23:26,320 --> 00:23:28,190 And the memory system can handle them. 421 00:23:28,190 --> 00:23:29,780 So the processors are generally very fast. 422 00:23:29,780 --> 00:23:33,520 The memory systems are relatively slow. 423 00:23:33,520 --> 00:23:36,220 But the processor is not generally issuing a store on 424 00:23:36,220 --> 00:23:37,850 every cycle. 425 00:23:37,850 --> 00:23:40,150 It may do store, it may do some additions, it may do 426 00:23:40,150 --> 00:23:42,720 another store, et cetera. 427 00:23:42,720 --> 00:23:46,580 So rather than waiting for the memory system to do every 428 00:23:46,580 --> 00:23:49,570 store, they create a store buffer. 429 00:23:49,570 --> 00:23:52,820 And the memory system pulls things out of the store buffer 430 00:23:52,820 --> 00:23:54,880 as fast as it can. 431 00:23:54,880 --> 00:23:57,940 And the processor shoves stuff into the store buffer up to 432 00:23:57,940 --> 00:24:00,360 the point that the store buffer gets full, in which 433 00:24:00,360 --> 00:24:01,540 case it would have to stall. 434 00:24:01,540 --> 00:24:06,630 But for most many codes, it never has to stall because 435 00:24:06,630 --> 00:24:10,400 there is a sufficient frequency of other operations 436 00:24:10,400 --> 00:24:13,730 going on that you don't have to wait. 437 00:24:13,730 --> 00:24:16,280 So when a store occurs, it doesn't occur immediately on 438 00:24:16,280 --> 00:24:18,480 the store buffer. 439 00:24:18,480 --> 00:24:23,700 Now along comes a load operation. 440 00:24:23,700 --> 00:24:28,120 And the load operation, if it's to a different address, 441 00:24:28,120 --> 00:24:30,540 you want to have that take priority because the processor 442 00:24:30,540 --> 00:24:31,920 can be waiting. 443 00:24:31,920 --> 00:24:36,090 It's next instructions may be waiting on the result. 444 00:24:36,090 --> 00:24:39,910 So you want that to go as fast as possible. 445 00:24:39,910 --> 00:24:45,190 They have a passing lane here where the fast cars or the 446 00:24:45,190 --> 00:24:47,550 important cars, the ambulances, et cetera, in this 447 00:24:47,550 --> 00:24:51,640 case loads, can scoot by all the other things in traffic 448 00:24:51,640 --> 00:24:53,980 and get to the memory system first. 449 00:24:53,980 --> 00:24:57,510 But as we said, we don't want to do that if the last thing 450 00:24:57,510 --> 00:25:01,850 that I stored was to the same address. 451 00:25:01,850 --> 00:25:05,890 So in fact, there is content addressable memory here, which 452 00:25:05,890 --> 00:25:09,660 matches the address that is being loaded with everything 453 00:25:09,660 --> 00:25:11,780 in the store buffer. 454 00:25:11,780 --> 00:25:15,150 And if it does match, it gets satisfied immediately by the 455 00:25:15,150 --> 00:25:17,340 store buffer. 456 00:25:17,340 --> 00:25:21,880 And only does it make it out to the network if it's not in 457 00:25:21,880 --> 00:25:23,940 the store buffer. 458 00:25:23,940 --> 00:25:27,350 But what you can see here is that this mechanism, which 459 00:25:27,350 --> 00:25:32,510 works great on one processor, violates sequential 460 00:25:32,510 --> 00:25:36,230 consistency because I may have operations going to two 461 00:25:36,230 --> 00:25:40,580 different memory locations, where the order, in fact, 462 00:25:40,580 --> 00:25:42,600 matters to me. 463 00:25:42,600 --> 00:25:45,150 So let's see how that works out. 464 00:25:45,150 --> 00:25:47,990 So first of all, let me tell you what the memory can-- so a 465 00:25:47,990 --> 00:25:50,580 load can bypass a store to different address. 466 00:25:50,580 --> 00:25:54,350 First of all, any questions about this mechanism? 467 00:25:54,350 --> 00:26:02,650 So this accounts for a whole bunch of understanding of what 468 00:26:02,650 --> 00:26:05,020 happens in concurrency in systems. 469 00:26:05,020 --> 00:26:08,810 This one understanding of store buffers. 470 00:26:08,810 --> 00:26:11,620 It's absolutely crucial. 471 00:26:11,620 --> 00:26:14,980 And I have talked, by the way, with lots of experts who don't 472 00:26:14,980 --> 00:26:16,255 understand this. 473 00:26:16,255 --> 00:26:19,340 That this is what's going on for why we don't have 474 00:26:19,340 --> 00:26:22,420 sequential consistency in our computers. 475 00:26:22,420 --> 00:26:25,760 It's because they made the decision to allow this 476 00:26:25,760 --> 00:26:29,040 optimization, even though it doesn't preserve sequential 477 00:26:29,040 --> 00:26:31,250 consistency. 478 00:26:31,250 --> 00:26:33,750 There were machines in the past that did support 479 00:26:33,750 --> 00:26:34,840 sequential consistency. 480 00:26:34,840 --> 00:26:40,140 And what they did was they used speculation to allow the 481 00:26:40,140 --> 00:26:43,010 processor to assume that it was sequentially consistent. 482 00:26:43,010 --> 00:26:45,820 And if that turned out to be wrong, they were able to roll 483 00:26:45,820 --> 00:26:49,430 back the processor's state to the point before 484 00:26:49,430 --> 00:26:52,440 the access was done. 485 00:26:52,440 --> 00:26:55,340 In fact, the processor is already doing that for 486 00:26:55,340 --> 00:26:58,340 branches, where it makes branch predictions and 487 00:26:58,340 --> 00:26:59,840 executes down a line. 488 00:26:59,840 --> 00:27:01,800 But it's wrong, it has to flush the 489 00:27:01,800 --> 00:27:04,180 pipeline and so forth. 490 00:27:04,180 --> 00:27:09,540 Why they don't do the same thing for hardware is an 491 00:27:09,540 --> 00:27:10,530 interesting-- 492 00:27:10,530 --> 00:27:13,900 for loads of stores-- is an interesting question. 493 00:27:13,900 --> 00:27:16,170 Because at some level there's no reason 494 00:27:16,170 --> 00:27:17,970 they couldn't do this. 495 00:27:17,970 --> 00:27:20,890 Instead, it's sort of been a thing where the software 496 00:27:20,890 --> 00:27:23,020 people say, yeah we can handle it. 497 00:27:23,020 --> 00:27:24,480 And the hardware people say, OK. 498 00:27:24,480 --> 00:27:26,240 You're willing to handle it. 499 00:27:26,240 --> 00:27:28,170 We won't worry about it then. 500 00:27:28,170 --> 00:27:32,360 When in fact, it just makes life complicated for everybody 501 00:27:32,360 --> 00:27:34,100 that you don't have sequential consistency. 502 00:27:34,100 --> 00:27:36,465 AUDIENCE: [INAUDIBLE] 503 00:27:36,465 --> 00:27:41,670 you have to do speculation across both [INAUDIBLE]. 504 00:27:41,670 --> 00:27:43,530 PROFESSOR: Well here, you only have to do speculation over 505 00:27:43,530 --> 00:27:45,730 what actually is coming out of your memory system. 506 00:27:45,730 --> 00:27:49,310 And if it doesn't match, you could roll back. 507 00:27:49,310 --> 00:27:53,050 The issue, in part, is how many machine states are you 508 00:27:53,050 --> 00:27:54,860 ready to roll back to. 509 00:27:54,860 --> 00:27:56,810 Loads come more frequently than branches. 510 00:27:56,810 --> 00:27:57,940 That's one thing. 511 00:27:57,940 --> 00:28:01,400 So no doubt, there are good reasons for why 512 00:28:01,400 --> 00:28:02,090 they're doing it. 513 00:28:02,090 --> 00:28:06,080 Nevertheless, definitely loss of sequential consistency 514 00:28:06,080 --> 00:28:08,830 becomes a headache for a lot of people in doing a 515 00:28:08,830 --> 00:28:09,700 concurrent program. 516 00:28:09,700 --> 00:28:10,440 We had a question here? 517 00:28:10,440 --> 00:28:11,570 Yes, Sara? 518 00:28:11,570 --> 00:28:12,861 AUDIENCE: So this does not preserve sequential 519 00:28:12,861 --> 00:28:13,450 consistency? 520 00:28:13,450 --> 00:28:15,956 But as long as there's only one processor, it should have 521 00:28:15,956 --> 00:28:18,050 the same effect, right? 522 00:28:18,050 --> 00:28:20,320 PROFESSOR: But sequential consistency for one processor 523 00:28:20,320 --> 00:28:22,190 is easy because all you do is execute them-- 524 00:28:22,190 --> 00:28:23,170 AUDIENCE: Yeah, I'm just saying-- 525 00:28:23,170 --> 00:28:25,450 PROFESSOR: It should have the same effect, exactly. 526 00:28:25,450 --> 00:28:30,080 So on one processor, this works perfectly well. 527 00:28:30,080 --> 00:28:32,820 If there's no concurrency, this is going to give you the 528 00:28:32,820 --> 00:28:34,520 same behavior. 529 00:28:34,520 --> 00:28:38,470 And yet, you've now got this optimization that loads can 530 00:28:38,470 --> 00:28:39,580 bypass stores. 531 00:28:39,580 --> 00:28:44,400 And therefore, you can do a store and a load and be able 532 00:28:44,400 --> 00:28:46,500 to overlap their execution. 533 00:28:46,500 --> 00:28:52,290 So this definitely wins for serial execution. 534 00:28:52,290 --> 00:28:53,230 Yep, good. 535 00:28:53,230 --> 00:28:54,715 Any other questions about this mechanism? 536 00:28:57,620 --> 00:29:00,030 You could reason about it on the quiz. 537 00:29:00,030 --> 00:29:03,250 That kind of thing, right? 538 00:29:03,250 --> 00:29:05,990 Yeah, OK? 539 00:29:05,990 --> 00:29:10,960 So here's the x86 memory consistency model. 540 00:29:10,960 --> 00:29:13,590 For many years, Intel was unwilling to say what their 541 00:29:13,590 --> 00:29:16,930 memory consistency model was for fear that people would 542 00:29:16,930 --> 00:29:18,740 then rely on it. 543 00:29:18,740 --> 00:29:20,310 And then, they would be forced into it. 544 00:29:20,310 --> 00:29:23,190 But recently, they've started being more explicit about it. 545 00:29:23,190 --> 00:29:25,290 And this is the large part of it. 546 00:29:25,290 --> 00:29:27,400 I haven't put up all the things because there are a 547 00:29:27,400 --> 00:29:32,110 whole bunch of instructions, such as locking instructions 548 00:29:32,110 --> 00:29:34,720 and so forth, for which for some of them, it's more 549 00:29:34,720 --> 00:29:35,280 complicated. 550 00:29:35,280 --> 00:29:36,890 But this is the basics. 551 00:29:36,890 --> 00:29:40,190 So loads are not reordered with loads. 552 00:29:40,190 --> 00:29:42,390 So if you add a load to one location, a load to another 553 00:29:42,390 --> 00:29:45,470 location, they always execute in the same order. 554 00:29:45,470 --> 00:29:48,080 Stores are not reordered with stores. 555 00:29:48,080 --> 00:29:51,310 If you have store and then a subsequent store, those two 556 00:29:51,310 --> 00:29:53,980 stores always go in that order. 557 00:29:53,980 --> 00:29:58,240 Stores are not reordered with prior loads. 558 00:29:58,240 --> 00:30:02,780 So if you do a store after a load-- 559 00:30:02,780 --> 00:30:07,530 if you do a load and then a store, they're going to go in 560 00:30:07,530 --> 00:30:09,460 that order. 561 00:30:09,460 --> 00:30:11,180 However, a load-- 562 00:30:11,180 --> 00:30:12,980 and this is what we just talked about-- 563 00:30:12,980 --> 00:30:16,570 may be reordered with a prior store to a different location 564 00:30:16,570 --> 00:30:19,095 but not with a prior store to the same location. 565 00:30:21,770 --> 00:30:23,650 So that's exactly what we just talked about on 566 00:30:23,650 --> 00:30:25,230 the previous slide. 567 00:30:25,230 --> 00:30:27,610 Then, loads and stores are not reordered with lock 568 00:30:27,610 --> 00:30:28,720 instructions. 569 00:30:28,720 --> 00:30:30,790 So a certain set of instructions are called lock 570 00:30:30,790 --> 00:30:31,640 instructions. 571 00:30:31,640 --> 00:30:35,140 And they include all the atomic updates, the exchanges, 572 00:30:35,140 --> 00:30:39,160 comparisons-and-swaps, and a variety of other atomic 573 00:30:39,160 --> 00:30:42,680 operations that the hardware supports. 574 00:30:42,680 --> 00:30:45,470 The stores to the same location always respect a 575 00:30:45,470 --> 00:30:47,070 global order. 576 00:30:47,070 --> 00:30:51,060 Everybody sees the store to a location in 577 00:30:51,060 --> 00:30:53,970 exactly the same order. 578 00:30:53,970 --> 00:30:57,410 And the lock instructions respect a global total order. 579 00:30:57,410 --> 00:31:02,460 So that everybody sees that this thread, or processor, got 580 00:31:02,460 --> 00:31:04,260 a lock before that one. 581 00:31:04,260 --> 00:31:08,330 You don't have two different processors disagreeing on what 582 00:31:08,330 --> 00:31:12,200 the order was that somebody acquired a lock or whatever. 583 00:31:12,200 --> 00:31:16,190 And then, memory ordering preserves transitive 584 00:31:16,190 --> 00:31:17,850 visibility, which is sort of like 585 00:31:17,850 --> 00:31:19,950 saying it obeys causality. 586 00:31:19,950 --> 00:31:27,530 In other words, if after doing a, if you had some effect, and 587 00:31:27,530 --> 00:31:31,223 then you did b, it should look like to other people like a 588 00:31:31,223 --> 00:31:32,680 and then b happened. 589 00:31:32,680 --> 00:31:36,590 Like there's a causality going on. 590 00:31:36,590 --> 00:31:39,980 But that's not sequential consistency, mainly 591 00:31:39,980 --> 00:31:41,230 because of four here. 592 00:31:43,770 --> 00:31:46,280 So what's the impact of reordering? 593 00:31:46,280 --> 00:31:50,240 So here, we have our example from the beginning for the 594 00:31:50,240 --> 00:31:55,260 memory bottle, where I'm storing a 1 into a and then 595 00:31:55,260 --> 00:31:58,960 loading whatever is in b. 596 00:31:58,960 --> 00:32:01,890 And similarly, over here the opposite. 597 00:32:01,890 --> 00:32:07,040 So what happens if I'm allowed to do reordering? 598 00:32:07,040 --> 00:32:10,960 What can happen to these two instructions? 599 00:32:10,960 --> 00:32:11,180 Yeah. 600 00:32:11,180 --> 00:32:14,060 They can execute in the opposite order. 601 00:32:14,060 --> 00:32:17,960 Similarly, these two guys can execute in the opposite order. 602 00:32:17,960 --> 00:32:28,050 So they can actually execute in this order where we do the 603 00:32:28,050 --> 00:32:30,990 load and then the stores. 604 00:32:30,990 --> 00:32:33,450 So it executes as if this were the order. 605 00:32:33,450 --> 00:32:34,540 Did I do this right? 606 00:32:34,540 --> 00:32:36,940 Executes as if this were the order. 607 00:32:36,940 --> 00:32:38,990 So I could do 1, 2, 3, 4. 608 00:32:38,990 --> 00:32:43,455 So if then, I do the ordering 2, 4, 1, 3. 609 00:32:47,250 --> 00:32:49,800 AUDIENCE: [INAUDIBLE] 610 00:32:49,800 --> 00:32:50,940 PROFESSOR: I got the screwed up, I think. 611 00:32:50,940 --> 00:32:51,320 Didn't I? 612 00:32:51,320 --> 00:32:52,820 AUDIENCE: [INAUDIBLE] 613 00:32:52,820 --> 00:32:55,060 PROFESSOR: Because I should be swapping these guys, right? 614 00:32:55,060 --> 00:32:55,820 AUDIENCE: Swapped the wrong [INAUDIBLE]. 615 00:32:55,820 --> 00:32:57,480 PROFESSOR: Ugh. 616 00:32:57,480 --> 00:32:58,730 OK. 617 00:33:00,730 --> 00:33:04,850 So if I did this one 2, 1, 4, 3. 618 00:33:08,380 --> 00:33:09,850 So ignore this thing. 619 00:33:09,850 --> 00:33:14,050 Suppose I do the order 2. 620 00:33:14,050 --> 00:33:16,330 So basically, I load b. 621 00:33:16,330 --> 00:33:17,900 Then, I load a. 622 00:33:17,900 --> 00:33:23,230 Then, I store a. 623 00:33:23,230 --> 00:33:25,320 And then, I store b. 624 00:33:25,320 --> 00:33:31,740 What's the result value that are in EAX and EBX? 625 00:33:31,740 --> 00:33:33,570 You get 00. 626 00:33:33,570 --> 00:33:40,130 Remember 00 wasn't the legal value from sequential 627 00:33:40,130 --> 00:33:40,870 consistency. 628 00:33:40,870 --> 00:33:45,040 But in this case, the Intel architecture and many other 629 00:33:45,040 --> 00:33:50,820 architectures out there will give you the wrong value for 630 00:33:50,820 --> 00:33:53,920 the execution of these instructions. 631 00:33:53,920 --> 00:33:56,270 Any question about that? 632 00:33:56,270 --> 00:34:01,075 So it doesn't preserve sequential consistency. 633 00:34:04,370 --> 00:34:08,510 So that's kind of scary in some way because you got to 634 00:34:08,510 --> 00:34:10,280 reason about this. 635 00:34:10,280 --> 00:34:12,679 Let's see what happens in Peterson's algorithm if you 636 00:34:12,679 --> 00:34:15,889 don't have sequential consistency. 637 00:34:15,889 --> 00:34:19,139 So here we go. 638 00:34:19,139 --> 00:34:21,179 We have the code where she wants is true, 639 00:34:21,179 --> 00:34:23,130 turn is his, et cetera. 640 00:34:23,130 --> 00:34:26,150 How is this going to fail? 641 00:34:26,150 --> 00:34:27,400 What could happen here? 642 00:34:34,100 --> 00:34:35,550 Where will the bug arise? 643 00:34:35,550 --> 00:34:36,460 What's going to happen? 644 00:34:36,460 --> 00:34:39,520 What's the reordering that might happen? 645 00:34:39,520 --> 00:34:42,164 AUDIENCE: On the while you do loads, right? 646 00:34:42,164 --> 00:34:43,690 [INAUDIBLE] the he_wants and turn is. 647 00:34:43,690 --> 00:34:44,994 PROFESSOR: Sorry? 648 00:34:44,994 --> 00:34:46,416 AUDIENCE: On the while statement, 649 00:34:46,416 --> 00:34:48,312 you do a load, right? 650 00:34:48,312 --> 00:34:49,260 Because [INAUDIBLE]. 651 00:34:49,260 --> 00:34:49,380 PROFESSOR: Right. 652 00:34:49,380 --> 00:34:51,810 He_wants is a load. 653 00:34:51,810 --> 00:34:53,420 AUDIENCE: And so that will get reordered. 654 00:34:53,420 --> 00:34:54,670 PROFESSOR: Where could that be reordered to? 655 00:34:59,250 --> 00:35:03,480 That could be reordered all the way to the top. 656 00:35:03,480 --> 00:35:07,130 Similarly, this one can be reordered all 657 00:35:07,130 --> 00:35:09,830 the way to the top. 658 00:35:09,830 --> 00:35:13,530 So the loads could be ordered all the way to the top. 659 00:35:13,530 --> 00:35:16,550 And now, what's going to happen is you're going to set 660 00:35:16,550 --> 00:35:20,630 that she_wants is true but get a value of he_wants 661 00:35:20,630 --> 00:35:23,420 that might be old. 662 00:35:23,420 --> 00:35:26,110 And so they won't see each other's values. 663 00:35:26,110 --> 00:35:29,970 And so then, both threads can now enter the critical section 664 00:35:29,970 --> 00:35:31,220 simultaneously. 665 00:35:33,420 --> 00:35:35,666 Yeah, Reid? 666 00:35:35,666 --> 00:35:40,112 AUDIENCE: If you swap the order of the loads, does the 667 00:35:40,112 --> 00:35:43,090 [INAUDIBLE]? 668 00:35:43,090 --> 00:35:45,512 PROFESSOR: If you swap the order of the loads-- 669 00:35:45,512 --> 00:35:48,002 AUDIENCE: If you swap-- put the turn equals his on the 670 00:35:48,002 --> 00:35:50,492 left, [INAUDIBLE] on the right. 671 00:35:50,492 --> 00:35:52,484 Because according to-- 672 00:35:52,484 --> 00:35:55,485 PROFESSOR: Put the turn equals his over here? 673 00:35:55,485 --> 00:35:58,460 AUDIENCE: Because the he_wants can't cross the load. 674 00:35:58,460 --> 00:36:01,130 PROFESSOR: Yeah, but that's not what you want to do. 675 00:36:01,130 --> 00:36:02,615 AUDIENCE: Then you can't [INAUDIBLE]. 676 00:36:05,590 --> 00:36:08,410 PROFESSOR: The whole idea here is that when you're saying you 677 00:36:08,410 --> 00:36:12,320 want to do something, you give the other one a turn so that 678 00:36:12,320 --> 00:36:18,560 whoever ends up winning the race allows just one of them 679 00:36:18,560 --> 00:36:19,130 to go through. 680 00:36:19,130 --> 00:36:20,089 Yeah? 681 00:36:20,089 --> 00:36:22,434 AUDIENCE: I think the point is that if you put turn equals 682 00:36:22,434 --> 00:36:25,720 his and he_wants-- 683 00:36:25,720 --> 00:36:27,266 PROFESSOR: You're saying this stuff here. 684 00:36:27,266 --> 00:36:29,696 AUDIENCE: Swap those two [UNINTELLIGIBLE] turn equals 685 00:36:29,696 --> 00:36:33,770 his will not be reordered before the store that-- 686 00:36:33,770 --> 00:36:34,530 PROFESSOR: You might be right. 687 00:36:34,530 --> 00:36:35,801 Let me think about that. 688 00:36:35,801 --> 00:36:37,725 AUDIENCE: You both reorder the same [? word. ?] 689 00:36:37,725 --> 00:36:40,611 AUDIENCE: But you just stored turn, right? 690 00:36:40,611 --> 00:36:41,092 PROFESSOR: Yeah. 691 00:36:41,092 --> 00:36:42,535 So if do turn equals his-- 692 00:36:42,535 --> 00:36:43,657 I see what you're saying. 693 00:36:43,657 --> 00:36:44,459 Do this turn equals his. 694 00:36:44,459 --> 00:36:45,920 I was looking at this turn equals his. 695 00:36:45,920 --> 00:36:47,650 AUDIENCE: You mean turn equals equals his. 696 00:36:47,650 --> 00:36:49,060 AUDIENCE: So the Boolean expression [INAUDIBLE]. 697 00:36:49,060 --> 00:36:49,530 PROFESSOR: Yeah. 698 00:36:49,530 --> 00:36:50,940 OK, I hadn't thought about that. 699 00:36:50,940 --> 00:36:52,840 Let me just think about that a second. 700 00:36:52,840 --> 00:36:57,175 So if we do the turn equals his-- 701 00:36:57,175 --> 00:36:58,540 AUDIENCE: [INAUDIBLE] 702 00:36:58,540 --> 00:37:00,360 and you won't reorder those two [INAUDIBLE]? 703 00:37:00,360 --> 00:37:01,610 PROFESSOR: Then the-- 704 00:37:06,600 --> 00:37:07,000 Yeah. 705 00:37:07,000 --> 00:37:08,250 You got to be-- 706 00:37:11,120 --> 00:37:13,760 I have to think about that. 707 00:37:13,760 --> 00:37:15,610 I don't know about you folks, but I find this stuff really 708 00:37:15,610 --> 00:37:18,220 hard to think about. 709 00:37:18,220 --> 00:37:19,760 And so do most people, I think. 710 00:37:22,290 --> 00:37:24,450 This is one of these things where I don't think I can do 711 00:37:24,450 --> 00:37:27,800 without sitting down for 10 minutes and 712 00:37:27,800 --> 00:37:30,820 thinking about it deeply. 713 00:37:30,820 --> 00:37:32,980 But it's an interesting thought that if you did it the 714 00:37:32,980 --> 00:37:35,220 other direction that maybe there would be 715 00:37:35,220 --> 00:37:39,350 a requirement there. 716 00:37:39,350 --> 00:37:43,910 I'm skeptical that that is true because to my knowledge 717 00:37:43,910 --> 00:37:47,020 to do the mutual exclusion, you pretty much have to do 718 00:37:47,020 --> 00:37:48,440 what I'm going to talk about next. 719 00:37:51,890 --> 00:37:53,730 But it would be interesting if is true. 720 00:37:57,890 --> 00:38:00,680 Because you also have to worry about this guy getting 721 00:38:00,680 --> 00:38:03,052 reordered with respect to this one. 722 00:38:03,052 --> 00:38:04,390 AUDIENCE: The loads can't be reordered with 723 00:38:04,390 --> 00:38:05,730 respect to each other. 724 00:38:05,730 --> 00:38:08,910 PROFESSOR: So he_wants and turn equals his. 725 00:38:08,910 --> 00:38:10,596 Yeah. 726 00:38:10,596 --> 00:38:12,220 So the loads won't be reordered. 727 00:38:12,220 --> 00:38:14,010 Yeah. 728 00:38:14,010 --> 00:38:15,630 So that looks OK. 729 00:38:15,630 --> 00:38:17,460 And then, you're saying and then therefore, it can't go 730 00:38:17,460 --> 00:38:20,450 forward because this one won't get reordered with that one. 731 00:38:20,450 --> 00:38:22,450 You might be right. 732 00:38:22,450 --> 00:38:23,700 That'd be cute. 733 00:38:26,370 --> 00:38:29,370 So I have to update the slides for next year if that's true. 734 00:38:32,670 --> 00:38:36,530 So one way out of this quandary is to use what's 735 00:38:36,530 --> 00:38:40,070 called a memory fence or memory barrier. 736 00:38:40,070 --> 00:38:42,950 And it's a hardware action that enforces an ordering 737 00:38:42,950 --> 00:38:45,620 constraint between the instructions before 738 00:38:45,620 --> 00:38:48,770 and after the fence. 739 00:38:48,770 --> 00:38:52,600 So a memory fence says don't allow the processor to reorder 740 00:38:52,600 --> 00:38:54,210 these things. 741 00:38:54,210 --> 00:38:57,290 So why would you not want to do a memory fence? 742 00:39:01,680 --> 00:39:02,810 Then we'll talk about why you do it. 743 00:39:02,810 --> 00:39:05,213 Yeah? 744 00:39:05,213 --> 00:39:07,430 AUDIENCE: To force a hardware slowdown? 745 00:39:07,430 --> 00:39:07,710 PROFESSOR: Yeah. 746 00:39:07,710 --> 00:39:08,480 You're forcing the hardware slowdown. 747 00:39:08,480 --> 00:39:11,100 You're also forcing compiler because the compiler has to 748 00:39:11,100 --> 00:39:12,190 respect that, too. 749 00:39:12,190 --> 00:39:14,480 You're not letting the compiler do optimizations 750 00:39:14,480 --> 00:39:16,810 across the fence. 751 00:39:16,810 --> 00:39:21,800 So generally, fences slow things down. 752 00:39:21,800 --> 00:39:23,440 In addition, it turns out that they have 753 00:39:23,440 --> 00:39:24,690 some significant overhead. 754 00:39:26,970 --> 00:39:32,510 So you can issue a memory fence explicitly as an 755 00:39:32,510 --> 00:39:33,230 instruction. 756 00:39:33,230 --> 00:39:39,180 So the mfence instruction sets a memory fence. 757 00:39:39,180 --> 00:39:43,990 There's also, it turns out, on x86 an lfence and an sfence, 758 00:39:43,990 --> 00:39:50,940 which allow loads to go over but not stores and 759 00:39:50,940 --> 00:39:52,300 stores but not loads. 760 00:39:52,300 --> 00:39:54,860 And this one is basically both. 761 00:39:54,860 --> 00:39:57,090 From the point of view of what we're using it for, we're only 762 00:39:57,090 --> 00:40:00,070 going to worry about the fences. 763 00:40:00,070 --> 00:40:01,490 They're done by the explicit one. 764 00:40:01,490 --> 00:40:03,840 But it also turns out all the locking instructions 765 00:40:03,840 --> 00:40:07,530 automatically put a fence in. 766 00:40:07,530 --> 00:40:14,170 One of the humorous things in recent memory is major 767 00:40:14,170 --> 00:40:18,930 manufacturers for whom the lock instruction was actually 768 00:40:18,930 --> 00:40:23,000 faster than doing a memory fence, which is kind of weird 769 00:40:23,000 --> 00:40:27,510 because a lock instruction does a memory fence. 770 00:40:27,510 --> 00:40:29,530 So how do you think that sort of thing comes about? 771 00:40:29,530 --> 00:40:33,160 So when you looked at performance it would be like-- 772 00:40:33,160 --> 00:40:35,020 for this particular machine I'm thinking about-- 773 00:40:35,020 --> 00:40:41,750 it was 30 cycles to do a lock instruction. 774 00:40:41,750 --> 00:40:46,435 And it was on the order of 50 cycles to do a memory fence. 775 00:40:49,610 --> 00:40:51,230 And so if you want to do a memory fence, 776 00:40:51,230 --> 00:40:53,725 what should you do? 777 00:40:53,725 --> 00:40:54,260 AUDIENCE: Do a lock. 778 00:40:54,260 --> 00:40:55,490 PROFESSOR: Do a lock instruction 779 00:40:55,490 --> 00:40:57,550 instead to get the effect. 780 00:40:57,550 --> 00:40:59,460 But why do you suppose that came up in the hardware? 781 00:40:59,460 --> 00:41:02,800 Why is it that one instruction would be-- 782 00:41:08,400 --> 00:41:14,190 It's a social reason why this sort of thing happens. 783 00:41:14,190 --> 00:41:16,510 So I don't know for sure. 784 00:41:16,510 --> 00:41:19,220 But I know enough about engineering to understand how 785 00:41:19,220 --> 00:41:20,760 these things come about. 786 00:41:20,760 --> 00:41:22,240 So here's what goes on. 787 00:41:22,240 --> 00:41:25,680 They do studies of traces of programs. 788 00:41:25,680 --> 00:41:28,690 And how often do you think lock instructions occur? 789 00:41:28,690 --> 00:41:32,200 And how often do you think fence instructions occur? 790 00:41:32,200 --> 00:41:35,640 Turns out lock instructions occur all the time, whereas 791 00:41:35,640 --> 00:41:39,040 fences, they don't occur so often because usually it's 792 00:41:39,040 --> 00:41:41,280 somebody who really knows what they're doing who's using a 793 00:41:41,280 --> 00:41:43,120 memory fence. 794 00:41:43,120 --> 00:41:46,590 So then, they say to the engineering team, we're going 795 00:41:46,590 --> 00:41:48,170 to make our code go faster. 796 00:41:48,170 --> 00:41:50,630 And lock instructions are going really fast. 797 00:41:50,630 --> 00:41:53,740 So they put a top engineer on making lock 798 00:41:53,740 --> 00:41:56,350 instructions go fast. 799 00:41:56,350 --> 00:42:04,530 They put a second-rate engineer on making memory 800 00:42:04,530 --> 00:42:06,860 fence operations go fast because they're not used as 801 00:42:06,860 --> 00:42:11,780 often, without sort of recognizing that, gee, what 802 00:42:11,780 --> 00:42:14,800 you do for one is the same problem. 803 00:42:14,800 --> 00:42:17,510 You can do the same thing for the other. 804 00:42:17,510 --> 00:42:19,770 So it ends up you'll see things in architecture that 805 00:42:19,770 --> 00:42:22,070 are really quite humorous like that, where things are sort of 806 00:42:22,070 --> 00:42:25,750 like, wait a minute, how come this is slower when well, it 807 00:42:25,750 --> 00:42:28,770 probably has to do with the engineering team that built 808 00:42:28,770 --> 00:42:31,500 the system. 809 00:42:31,500 --> 00:42:34,390 And actually now I'm aware of two architectures where they 810 00:42:34,390 --> 00:42:38,810 did the same kind of thing by different manufacturers. 811 00:42:38,810 --> 00:42:41,990 Where they got these memory fences. 812 00:42:41,990 --> 00:42:46,390 It should be at least as fast because the one is doing-- 813 00:42:46,390 --> 00:42:48,910 anyway. 814 00:42:48,910 --> 00:42:51,220 Interesting story there. 815 00:42:51,220 --> 00:42:55,580 Now, you can actually access a memory fence using a built in 816 00:42:55,580 --> 00:42:59,280 function called sync synchronize. 817 00:42:59,280 --> 00:43:00,890 And in fact, there whole set of atomics-- 818 00:43:00,890 --> 00:43:03,770 I've put the information here for where you can go and look 819 00:43:03,770 --> 00:43:07,480 at the atomic operations that include memory fences and so 820 00:43:07,480 --> 00:43:09,690 forth to using in the compiler. 821 00:43:09,690 --> 00:43:13,610 It turns out when I was trying to get this going last night, 822 00:43:13,610 --> 00:43:15,100 I couldn't get it to work. 823 00:43:15,100 --> 00:43:18,590 And it turns out that's because our compiler had a bug 824 00:43:18,590 --> 00:43:25,320 where this instruction was compiling to nothing. 825 00:43:25,320 --> 00:43:27,250 There's a compiler bug. 826 00:43:27,250 --> 00:43:33,460 And so I messed around for far too much time and then finally 827 00:43:33,460 --> 00:43:35,740 sent out a help message to the T.A.s. 828 00:43:35,740 --> 00:43:37,750 And then, John figured out that there was a bug. 829 00:43:37,750 --> 00:43:39,520 And he's patched all the compilers so that you 830 00:43:39,520 --> 00:43:42,800 guys all have it. 831 00:43:42,800 --> 00:43:46,590 But anyway, it was like, how come this isn't working? 832 00:43:46,590 --> 00:43:47,950 AUDIENCE: What compiler are we using? 833 00:43:47,950 --> 00:43:49,550 PROFESSOR: This was GCC. 834 00:43:49,550 --> 00:43:54,120 I was trying 4 1, and I tried 4 3. 835 00:43:54,120 --> 00:43:57,290 And so the one that we're using in class for the most 836 00:43:57,290 --> 00:43:59,010 part, is 4 3. 837 00:43:59,010 --> 00:44:00,830 So anyway, John put the patch in. 838 00:44:00,830 --> 00:44:03,370 So now, when you use these 839 00:44:03,370 --> 00:44:05,210 instructions, they're all there. 840 00:44:09,100 --> 00:44:11,580 And then, the last thing is that the typical cost of a 841 00:44:11,580 --> 00:44:15,610 memory fence operation is comparable to that of an L2 842 00:44:15,610 --> 00:44:16,860 cache access. 843 00:44:18,870 --> 00:44:24,400 So memory fences tend to be on our machine-- 844 00:44:24,400 --> 00:44:26,030 and I haven't actually measured in our machine. 845 00:44:26,030 --> 00:44:28,430 I meant to do that, and I didn't get around to it. 846 00:44:28,430 --> 00:44:31,850 It's probably on the order of 10, or 15 cycles, or 847 00:44:31,850 --> 00:44:35,680 something, which is not bad. 848 00:44:35,680 --> 00:44:37,860 If it's less than 20, it's pretty good. 849 00:44:42,130 --> 00:44:44,550 So here's Peterson's algorithm with memory fences. 850 00:44:44,550 --> 00:44:48,820 You just simply sticky in the memory fence there to prevent 851 00:44:48,820 --> 00:44:49,460 the reordering. 852 00:44:49,460 --> 00:44:51,730 And it's interesting if there's a way that we can play 853 00:44:51,730 --> 00:44:54,230 the game with the instruction stream to do the same thing 854 00:44:54,230 --> 00:44:57,350 because that would make this code go, generally, a lot 855 00:44:57,350 --> 00:45:02,230 faster in terms of overhead. 856 00:45:02,230 --> 00:45:06,380 And so using memory fences, you can restore consistency. 857 00:45:06,380 --> 00:45:08,530 Now, memory fences are like data races. 858 00:45:08,530 --> 00:45:10,480 If you don't have them, how do you know that 859 00:45:10,480 --> 00:45:11,400 you don't have them. 860 00:45:11,400 --> 00:45:13,950 It's very difficult to regression test for them, 861 00:45:13,950 --> 00:45:15,250 which is one reason I think there was a 862 00:45:15,250 --> 00:45:17,400 bug in the GCC compiler. 863 00:45:17,400 --> 00:45:22,200 How do you know that some piece of code is failing 864 00:45:22,200 --> 00:45:25,170 because most of the time it will work correctly. 865 00:45:25,170 --> 00:45:28,640 It's just occasionally, they'll be some reordering, 866 00:45:28,640 --> 00:45:30,680 and timing, and race condition that causes 867 00:45:30,680 --> 00:45:32,190 it not to work out. 868 00:45:32,190 --> 00:45:36,270 In this case, you both have to have the race and the 869 00:45:36,270 --> 00:45:39,520 reordering happening at the same time for Peterson's 870 00:45:39,520 --> 00:45:41,230 algorithm, for example. 871 00:45:41,230 --> 00:45:44,090 So compilers can be very difficult 872 00:45:44,090 --> 00:45:45,770 for things like this. 873 00:45:45,770 --> 00:45:48,920 Really, the way to do it, which is what I was doing, was 874 00:45:48,920 --> 00:45:54,590 do an objdump and search for is fence in there. 875 00:45:54,590 --> 00:45:59,376 And in this case, it wasn't in there. 876 00:45:59,376 --> 00:46:02,610 AUDIENCE: And also compiler's self-analyzers, by itself. 877 00:46:02,610 --> 00:46:06,340 And that's this instruction that basically can take code. 878 00:46:06,340 --> 00:46:07,130 PROFESSOR: Right. 879 00:46:07,130 --> 00:46:07,960 It's not doing anything. 880 00:46:07,960 --> 00:46:08,780 Right. 881 00:46:08,780 --> 00:46:10,570 So it says, oop, get out of it. 882 00:46:10,570 --> 00:46:11,400 Yep. 883 00:46:11,400 --> 00:46:12,650 Good. 884 00:46:15,220 --> 00:46:17,550 So any questions about consistency. 885 00:46:17,550 --> 00:46:20,490 So what turns out to be most of the time when you're 886 00:46:20,490 --> 00:46:23,320 designing things where you want to synchronize through 887 00:46:23,320 --> 00:46:30,610 memory directly, rather than using locks or what have you. 888 00:46:30,610 --> 00:46:32,870 The methodology that I found works pretty well. 889 00:46:32,870 --> 00:46:36,010 Work it out for sequential consistency, and then figure 890 00:46:36,010 --> 00:46:39,150 out where you have to put the fences in. 891 00:46:39,150 --> 00:46:42,530 And that's a pretty good methodology 892 00:46:42,530 --> 00:46:44,190 for working out where-- 893 00:46:44,190 --> 00:46:45,550 here's sequential consistency. 894 00:46:45,550 --> 00:46:48,750 Now, what reorderings do I need to ensure in order to 895 00:46:48,750 --> 00:46:51,250 make sure that it works properly. 896 00:46:51,250 --> 00:46:53,450 And that can be error prone. 897 00:46:53,450 --> 00:46:56,910 So once again, big skull and cross bones on whether you 898 00:46:56,910 --> 00:46:58,830 actually try this in practice. 899 00:46:58,830 --> 00:47:00,135 It really better make a difference. 900 00:47:04,100 --> 00:47:07,730 Now, the fact that you can synchronize directly through 901 00:47:07,730 --> 00:47:12,600 memory has led to a lot of protocols that are called 902 00:47:12,600 --> 00:47:19,610 lock-free protocols, which have some advantages, even 903 00:47:19,610 --> 00:47:22,290 though, in particular, because they don't use locks. 904 00:47:22,290 --> 00:47:25,870 And so I want to illustrate some of those because you'll 905 00:47:25,870 --> 00:47:26,790 see these in certain places. 906 00:47:26,790 --> 00:47:29,620 So recall the summing problem from last time. 907 00:47:29,620 --> 00:47:33,710 So here we have an array. 908 00:47:33,710 --> 00:47:36,950 And what we're going to do is run through all the elements 909 00:47:36,950 --> 00:47:39,310 in the array, computing something on every element, 910 00:47:39,310 --> 00:47:41,050 and adding into result. 911 00:47:41,050 --> 00:47:43,100 And we wanted to parallelize that. 912 00:47:43,100 --> 00:47:45,750 So we parallelize that with a Cilk 4. 913 00:47:45,750 --> 00:47:49,450 And what was the problem when we parallelize this? 914 00:47:49,450 --> 00:47:51,770 We get a race. 915 00:47:51,770 --> 00:47:52,610 So there's the race. 916 00:47:52,610 --> 00:47:58,110 We get a race on result because we've got two parallel 917 00:47:58,110 --> 00:47:59,840 instructions both trying to update 918 00:47:59,840 --> 00:48:02,810 results at the same time. 919 00:48:02,810 --> 00:48:06,230 So we can solve that with a lock. 920 00:48:06,230 --> 00:48:07,830 And I showed you last time that we could 921 00:48:07,830 --> 00:48:10,870 solve this for lock. 922 00:48:10,870 --> 00:48:16,690 By declaring a mutex, and then locking before we update the 923 00:48:16,690 --> 00:48:18,250 results, and then unlock. 924 00:48:18,250 --> 00:48:20,650 And of course, as we argued yesterday, that could cause 925 00:48:20,650 --> 00:48:22,170 severe contention. 926 00:48:22,170 --> 00:48:24,430 Now, contention can be an issue. 927 00:48:24,430 --> 00:48:27,740 But if it turns out that the compute here, which I've moved 928 00:48:27,740 --> 00:48:29,990 outside the lock notice. 929 00:48:29,990 --> 00:48:33,400 I've put it into temp and then added temp in so I can lock 930 00:48:33,400 --> 00:48:34,860 for the shortest possible time. 931 00:48:34,860 --> 00:48:38,250 If this compute is sufficiently large, there may 932 00:48:38,250 --> 00:48:38,890 be contention. 933 00:48:38,890 --> 00:48:41,340 But it may not be a significant contention in your 934 00:48:41,340 --> 00:48:44,860 execution because the update here could be very, very short 935 00:48:44,860 --> 00:48:47,330 compared with the time it takes to compute. 936 00:48:47,330 --> 00:48:51,290 So for example, if computing on array i cost you more than 937 00:48:51,290 --> 00:48:59,000 say order n time, then the fact that you have contention 938 00:48:59,000 --> 00:49:02,530 there isn't going to matter, generally, because the total 939 00:49:02,530 --> 00:49:04,710 amount of time that you're going to be locking is just 940 00:49:04,710 --> 00:49:06,800 small compared to the total execution time. 941 00:49:09,370 --> 00:49:13,370 Still in a multiprogram setting, there may be other 942 00:49:13,370 --> 00:49:16,470 problems that you can get into, even when you have this 943 00:49:16,470 --> 00:49:19,510 and even if you think that contention 944 00:49:19,510 --> 00:49:20,760 is going to be minimal. 945 00:49:23,810 --> 00:49:26,160 So can anybody think of what the issues might be? 946 00:49:26,160 --> 00:49:29,180 Why could this be problematic even if contention 947 00:49:29,180 --> 00:49:30,430 is not a big issue? 948 00:49:35,170 --> 00:49:37,780 And the hint here is it's in a multiprogram setting. 949 00:49:50,720 --> 00:49:52,470 So what happens in a multiprogram setting. 950 00:49:58,770 --> 00:49:59,445 Yeah. 951 00:49:59,445 --> 00:50:01,225 AUDIENCE: [INAUDIBLE] 952 00:50:01,225 --> 00:50:02,810 PROFESSOR: Because the resolve is-- 953 00:50:02,810 --> 00:50:04,060 AUDIENCE: [INAUDIBLE PHRASE] 954 00:50:07,380 --> 00:50:09,960 PROFESSOR: It actually doesn't have to do with resolve here. 955 00:50:09,960 --> 00:50:13,960 It has to do with locking explicitly. 956 00:50:13,960 --> 00:50:16,800 It's a problem with locking in a multiprogrammed environment. 957 00:50:16,800 --> 00:50:18,890 What happens in a multiprogrammed environment? 958 00:50:18,890 --> 00:50:21,143 What do I mean by multiprogrammed environment? 959 00:50:21,143 --> 00:50:22,090 AUDIENCE: [INAUDIBLE] 960 00:50:22,090 --> 00:50:23,640 PROFESSOR: Have multiple jobs running, right? 961 00:50:23,640 --> 00:50:25,870 And what happens to the processor when there are 962 00:50:25,870 --> 00:50:27,762 multiple jobs running? 963 00:50:27,762 --> 00:50:29,470 AUDIENCE: [INAUDIBLE] 964 00:50:29,470 --> 00:50:30,720 PROFESSOR: Contact switches. 965 00:50:34,530 --> 00:50:36,430 So now, what can go wrong here? 966 00:50:36,430 --> 00:50:38,280 What can be really bad here? 967 00:50:38,280 --> 00:50:38,530 Yeah. 968 00:50:38,530 --> 00:50:40,021 AUDIENCE: You aquire the lock and then the 969 00:50:40,021 --> 00:50:40,520 contacts switch out. 970 00:50:40,520 --> 00:50:41,560 PROFESSOR: Yeah. 971 00:50:41,560 --> 00:50:43,140 You acquire the lock. 972 00:50:43,140 --> 00:50:46,190 And then, the operating system contact switches you out. 973 00:50:46,190 --> 00:50:46,890 And so what happens? 974 00:50:46,890 --> 00:50:50,570 You hold the lock while some other job is running. 975 00:50:50,570 --> 00:50:52,090 And what are those guys doing. 976 00:50:52,090 --> 00:50:55,460 They go and spin and wait on the lock. 977 00:50:55,460 --> 00:50:58,380 Now, this is a good time where you'd rather not have a 978 00:50:58,380 --> 00:50:59,130 spinning lock. 979 00:50:59,130 --> 00:51:03,010 You'd rather have a yielding lock. 980 00:51:03,010 --> 00:51:06,590 But even so, suddenly you're talking about something that's 981 00:51:06,590 --> 00:51:10,650 operating at the level of 100 times a second, 10 982 00:51:10,650 --> 00:51:13,970 milliseconds, versus something that is operating on a 983 00:51:13,970 --> 00:51:15,940 nanosecond level. 984 00:51:15,940 --> 00:51:19,460 So you're talking six orders of magnitude of performance 985 00:51:19,460 --> 00:51:25,020 difference if you end up getting switched out while you 986 00:51:25,020 --> 00:51:26,270 hold a lock. 987 00:51:30,310 --> 00:51:30,900 That's the issue. 988 00:51:30,900 --> 00:51:31,880 What happens. 989 00:51:31,880 --> 00:51:34,550 And then, if that happens, all the other loop 990 00:51:34,550 --> 00:51:37,225 iterations must wait. 991 00:51:37,225 --> 00:51:38,710 AUDIENCE: [INAUDIBLE] 992 00:51:38,710 --> 00:51:41,185 in the large program here [UNINTELLIGIBLE PHRASE]. 993 00:51:43,887 --> 00:51:45,110 I don't have a mic. 994 00:51:45,110 --> 00:51:46,862 If one [UNINTELLIGIBLE] crashes or one 995 00:51:46,862 --> 00:51:48,112 [UNINTELLIGIBLE PHRASE] 996 00:51:52,318 --> 00:51:53,568 the lock [UNINTELLIGIBLE PHRASE]. 997 00:52:01,742 --> 00:52:04,926 AUDIENCE: Can you specify whether those are yielding 998 00:52:04,926 --> 00:52:05,890 locks or spinning locks? 999 00:52:05,890 --> 00:52:08,370 PROFESSOR: Usually, the mutex type will tell you. 1000 00:52:08,370 --> 00:52:10,850 I'm just using a simple name of mutex. 1001 00:52:10,850 --> 00:52:15,140 I probably should have been using the ones that-- 1002 00:52:15,140 --> 00:52:17,090 we were using one called Cilk mutex. 1003 00:52:17,090 --> 00:52:20,000 And I probably should've used that here rather than just 1004 00:52:20,000 --> 00:52:22,770 simple mutex. 1005 00:52:22,770 --> 00:52:25,560 AUDIENCE: Are they yielding? 1006 00:52:25,560 --> 00:52:26,620 PROFESSOR: There's a good question. 1007 00:52:26,620 --> 00:52:30,330 I used to know the answer to this. 1008 00:52:30,330 --> 00:52:34,390 I believe that those spin for a while, are competitive. 1009 00:52:34,390 --> 00:52:35,890 They spin for a while and then yield. 1010 00:52:35,890 --> 00:52:36,610 But I'm not sure. 1011 00:52:36,610 --> 00:52:39,260 They may just spin. 1012 00:52:39,260 --> 00:52:42,720 They don't just automatically yield. 1013 00:52:42,720 --> 00:52:45,520 They're either competitive, or they'll spin and yield. 1014 00:52:45,520 --> 00:52:47,170 I believe they spin and yield. 1015 00:52:47,170 --> 00:52:48,690 And I believe there's actually a switch where 1016 00:52:48,690 --> 00:52:49,720 you can tell it-- 1017 00:52:49,720 --> 00:52:50,970 if you're doing timing measurements-- 1018 00:52:53,400 --> 00:52:55,140 make it so that it purely spins so that you can get 1019 00:52:55,140 --> 00:52:57,757 better benchmark results. 1020 00:52:57,757 --> 00:53:06,370 AUDIENCE: So my question is does the colonel have power to 1021 00:53:06,370 --> 00:53:08,660 switch out a spinning lock or not? 1022 00:53:08,660 --> 00:53:09,270 PROFESSOR: Yeah. 1023 00:53:09,270 --> 00:53:11,910 Well, the colonel, the scheduler, can come in at any 1024 00:53:11,910 --> 00:53:13,430 moment and say, whip you're out. 1025 00:53:16,160 --> 00:53:16,790 You're out. 1026 00:53:16,790 --> 00:53:18,850 That's it. 1027 00:53:18,850 --> 00:53:21,640 And wherever it is, it interrupts it at 1028 00:53:21,640 --> 00:53:23,280 that moment in time. 1029 00:53:26,670 --> 00:53:31,450 So one solution to this problem is to 1030 00:53:31,450 --> 00:53:33,110 use a lock-free method. 1031 00:53:33,110 --> 00:53:35,430 And one of the common ways of doing that is with what's 1032 00:53:35,430 --> 00:53:38,730 called compare-and-swap instruction. 1033 00:53:38,730 --> 00:53:41,130 So this is what's called a locking instruction, meaning 1034 00:53:41,130 --> 00:53:45,150 it's one of these ones that goes out to L2, in terms of 1035 00:53:45,150 --> 00:53:46,590 timing and so forth. 1036 00:53:46,590 --> 00:53:50,950 And what it does is it does the following thing. 1037 00:53:50,950 --> 00:53:54,530 It has an address of a location. 1038 00:53:54,530 --> 00:53:58,660 And it's got the old value that was stored in the 1039 00:53:58,660 --> 00:54:00,520 location and a new value. 1040 00:54:00,520 --> 00:54:04,810 And it says if the value that is there is the old value, 1041 00:54:04,810 --> 00:54:07,420 well, then stick the new value in there. 1042 00:54:07,420 --> 00:54:10,700 And then return essentially true. 1043 00:54:10,700 --> 00:54:14,250 Otherwise return false. 1044 00:54:14,250 --> 00:54:23,020 So it's basically saying what you tend to do is you first 1045 00:54:23,020 --> 00:54:25,530 look to see what's the value. 1046 00:54:25,530 --> 00:54:27,260 You then update the value. 1047 00:54:27,260 --> 00:54:32,310 And then you say, if it hasn't changed, stick it back in and 1048 00:54:32,310 --> 00:54:33,160 return true. 1049 00:54:33,160 --> 00:54:37,470 If it has changed, return false. 1050 00:54:37,470 --> 00:54:42,110 So it's only swaps the value if it is true. 1051 00:54:42,110 --> 00:54:43,900 There's actually two versions. 1052 00:54:43,900 --> 00:54:47,060 One which says bool and one which says val. 1053 00:54:47,060 --> 00:54:49,765 And if you do the bool version, it returns a flag. 1054 00:54:49,765 --> 00:54:52,110 If you do the val version, it actually returns the value 1055 00:54:52,110 --> 00:54:53,210 that was in there. 1056 00:54:53,210 --> 00:54:55,620 So it's more like a compare-and-swap. 1057 00:54:55,620 --> 00:54:58,650 The main thing about this is this code essentially executes 1058 00:54:58,650 --> 00:55:02,885 atomically with the single instruction, which is called-- 1059 00:55:07,990 --> 00:55:10,530 The instruction is cmpxchg. 1060 00:55:14,760 --> 00:55:15,420 Is it up there? 1061 00:55:15,420 --> 00:55:17,720 Oh, there it is. 1062 00:55:17,720 --> 00:55:17,970 Yeah. 1063 00:55:17,970 --> 00:55:21,190 So the cmpxchg instruction on x86. 1064 00:55:21,190 --> 00:55:26,030 So when you compile this, you should find on your assembly 1065 00:55:26,030 --> 00:55:30,890 output that instruction somewhere unless the compiler 1066 00:55:30,890 --> 00:55:33,310 figures out a better way to optimize that. 1067 00:55:33,310 --> 00:55:34,560 But generally, you should find that. 1068 00:55:37,330 --> 00:55:40,400 Also, one of the things about this is it works on values 1069 00:55:40,400 --> 00:55:42,970 that are sort of integer type values. 1070 00:55:42,970 --> 00:55:46,550 But it doesn't work on floating point numbers, in 1071 00:55:46,550 --> 00:55:48,080 particular. 1072 00:55:48,080 --> 00:55:51,410 So you can't compare-and-swap a value, which is a floating 1073 00:55:51,410 --> 00:55:51,940 point value. 1074 00:55:51,940 --> 00:55:54,060 You can only do it with energy type values. 1075 00:55:54,060 --> 00:55:57,370 So let's take a look at how we can use the compare-and-swap 1076 00:55:57,370 --> 00:56:00,120 for the summing problem. 1077 00:56:00,120 --> 00:56:04,090 So what we do is we have the same sort of code. 1078 00:56:04,090 --> 00:56:08,470 And now, what I'm going to do is compute my temporary value. 1079 00:56:08,470 --> 00:56:10,740 And then, what I'll do is I'll read the value 1080 00:56:10,740 --> 00:56:13,020 of result into old. 1081 00:56:13,020 --> 00:56:17,430 I'll then update my new value for what I think I want the 1082 00:56:17,430 --> 00:56:22,240 result to be the result plus the thing that I computed. 1083 00:56:22,240 --> 00:56:30,660 And now, what I do is I attempt to compare-and-swap as 1084 00:56:30,660 --> 00:56:35,280 long as the old value is what I read it to be. 1085 00:56:35,280 --> 00:56:38,560 Swap in the new value. 1086 00:56:38,560 --> 00:56:42,590 If the old value turns out to be different from what is 1087 00:56:42,590 --> 00:56:44,940 currently in the result location, 1088 00:56:44,940 --> 00:56:46,470 then it returns false. 1089 00:56:46,470 --> 00:56:47,720 And I redo this again. 1090 00:56:52,110 --> 00:56:55,020 Then, I have to redo the whole loop again. 1091 00:56:55,020 --> 00:56:56,560 So this is a do-while loop. 1092 00:56:56,560 --> 00:56:59,690 Do-while is like a while loop, except you do the body first. 1093 00:56:59,690 --> 00:57:01,860 And then you test the condition. 1094 00:57:01,860 --> 00:57:03,970 So if this fails, I go back. 1095 00:57:03,970 --> 00:57:06,850 I then get a new value for the result and so forth. 1096 00:57:06,850 --> 00:57:08,140 So let me show you how that works. 1097 00:57:13,410 --> 00:57:15,190 Let's see. 1098 00:57:15,190 --> 00:57:17,160 So first, I'll show you how this works. 1099 00:57:17,160 --> 00:57:19,850 Actually, I'll show it on a more interesting example how 1100 00:57:19,850 --> 00:57:21,930 this works. 1101 00:57:21,930 --> 00:57:24,850 So what happens if I get swapped out in the middle of a 1102 00:57:24,850 --> 00:57:26,360 loop iteration? 1103 00:57:26,360 --> 00:57:31,100 All I do is when I do the compare-and-swap it fails. 1104 00:57:31,100 --> 00:57:32,850 So no other instructions can wait. 1105 00:57:32,850 --> 00:57:37,120 They can all march ahead and do the thing they need to do. 1106 00:57:37,120 --> 00:57:39,520 And then, the one that got swapped out, eh. 1107 00:57:39,520 --> 00:57:41,870 It gets some old value. 1108 00:57:41,870 --> 00:57:47,230 It discovers that and has to re-execute the loop. 1109 00:57:47,230 --> 00:57:48,270 So is that fine? 1110 00:57:48,270 --> 00:57:51,200 So what this means is that the amount work that's going on, 1111 00:57:51,200 --> 00:57:56,420 however, could in fact, be greater, depending upon how 1112 00:57:56,420 --> 00:57:57,450 much contention there is. 1113 00:57:57,450 --> 00:57:59,120 If there's a lot of contention, you could end up 1114 00:57:59,120 --> 00:58:06,400 having these guys fighting and not re 1115 00:58:06,400 --> 00:58:08,680 executing a lot of code. 1116 00:58:08,680 --> 00:58:11,330 But that's really not much worse than them spinning is 1117 00:58:11,330 --> 00:58:14,590 what it comes down to. 1118 00:58:14,590 --> 00:58:16,470 Any questions? 1119 00:58:16,470 --> 00:58:20,170 Let's do a more interesting example. 1120 00:58:20,170 --> 00:58:23,160 Here's a lock-free stack. 1121 00:58:23,160 --> 00:58:25,550 So what we're going to do is we're going to have a node, 1122 00:58:25,550 --> 00:58:27,760 which has a next pointer and some data. 1123 00:58:27,760 --> 00:58:30,170 All we really care about is the next pointer. 1124 00:58:30,170 --> 00:58:35,340 And we have a stack, which has basically a head pointer. 1125 00:58:38,150 --> 00:58:39,790 So we have a linked list here. 1126 00:58:39,790 --> 00:58:42,010 We want to basically be able to insert things at the front 1127 00:58:42,010 --> 00:58:45,440 and take things out of the front. 1128 00:58:45,440 --> 00:58:47,910 So here's a lock-free push. 1129 00:58:47,910 --> 00:58:49,710 So remember, this could be concurrent. 1130 00:58:49,710 --> 00:58:52,060 So these guys want to operate on it at a time. 1131 00:58:52,060 --> 00:58:56,030 We saw last time how in doing very simple updates on link 1132 00:58:56,030 --> 00:59:00,260 structures, you could get yourself into a mess if you 1133 00:59:00,260 --> 00:59:03,890 didn't properly synchronize when we did the insertion in 1134 00:59:03,890 --> 00:59:05,710 the hash table. 1135 00:59:05,710 --> 00:59:09,370 So here's my push [? up ?] code. 1136 00:59:09,370 --> 00:59:10,430 Well let's walk through it. 1137 00:59:10,430 --> 00:59:15,580 It says, basically, here's my node that I want to insert. 1138 00:59:15,580 --> 00:59:20,350 It says, first of all, make node.next point to the head. 1139 00:59:20,350 --> 00:59:22,140 So we basically have it pointing to 77. 1140 00:59:25,220 --> 00:59:27,840 So then what we say is OK. 1141 00:59:27,840 --> 00:59:34,610 Let's compare-and-swap to make the head point to the node but 1142 00:59:34,610 --> 00:59:39,670 only if the value of the head has not changed. 1143 00:59:39,670 --> 00:59:41,730 It's still the value of the node.next. 1144 00:59:45,100 --> 00:59:47,900 And if so, it does the swap. 1145 00:59:47,900 --> 00:59:49,585 Question? 1146 00:59:49,585 --> 00:59:51,232 AUDIENCE: You say compare-and-swap. 1147 00:59:51,232 --> 00:59:54,850 But you compare it to what? 1148 00:59:54,850 --> 00:59:57,270 PROFESSOR: PROFESSOR: In this case it's comparing to the-- 1149 00:59:57,270 --> 01:00:00,940 so this is basically the location that you're doing the 1150 01:00:00,940 --> 01:00:05,690 compare-and-swap on, the old value that you expect to see 1151 01:00:05,690 --> 01:00:08,400 in that location, and the new value. 1152 01:00:08,400 --> 01:00:10,730 So here, what it says-- 1153 01:00:10,730 --> 01:00:12,250 when we're at this point here-- 1154 01:00:12,250 --> 01:00:14,225 we're saying before you do the compare-and-swap, we're 1155 01:00:14,225 --> 01:00:23,770 saying, I only want you to set that pointer to go to here if 1156 01:00:23,770 --> 01:00:27,410 this value is still pointing to there. 1157 01:00:27,410 --> 01:00:30,800 So only move this here if this value is still 77. 1158 01:00:30,800 --> 01:00:32,420 In other words, if somebody else came in-- 1159 01:00:32,420 --> 01:00:35,670 well, I'll do an example in a second that shows what happens 1160 01:00:35,670 --> 01:00:39,940 when we have concurrency, and one of them might fail. 1161 01:00:39,940 --> 01:00:42,030 But if it is true, then it basically sets it. 1162 01:00:42,030 --> 01:00:43,920 And now I'm home free. 1163 01:00:46,790 --> 01:00:48,810 So let's take a look at what happens when we have 1164 01:00:48,810 --> 01:00:49,195 contention. 1165 01:00:49,195 --> 01:00:51,220 So I have two guys. 1166 01:00:51,220 --> 01:00:54,170 So 33 says, OK I'll come in. 1167 01:00:54,170 --> 01:00:56,600 Let me set my next pointer to the head. 1168 01:00:56,600 --> 01:00:58,030 But then comes 81. 1169 01:00:58,030 --> 01:00:59,450 And it says, OK. 1170 01:00:59,450 --> 01:01:04,700 Let me try to set my pointer to also be 77 because I look 1171 01:01:04,700 --> 01:01:05,700 at what the head is, and that's where 1172 01:01:05,700 --> 01:01:08,600 it's supposed to go. 1173 01:01:08,600 --> 01:01:10,160 So now, what happens is we do the 1174 01:01:10,160 --> 01:01:12,950 compare-and-swap operation. 1175 01:01:12,950 --> 01:01:14,560 And they both are going to try to do it. 1176 01:01:14,560 --> 01:01:17,330 And one of them is going to, essentially, do it first 1177 01:01:17,330 --> 01:01:20,530 because the hardware preserves that the compare-and-swaps, 1178 01:01:20,530 --> 01:01:22,800 their locking operations, they will happen in 1179 01:01:22,800 --> 01:01:25,220 some definite order. 1180 01:01:25,220 --> 01:01:28,920 So in this case, 81 got in there and did its 1181 01:01:28,920 --> 01:01:29,990 compare-and-swap first. 1182 01:01:29,990 --> 01:01:33,040 When it looked, 77 was still a value that it said. 1183 01:01:33,040 --> 01:01:34,950 So it allowed that pointer to be changed. 1184 01:01:34,950 --> 01:01:38,310 But now what happens when 33 tries. 1185 01:01:38,310 --> 01:01:40,850 33 tries to do the compare-and-swap. 1186 01:01:40,850 --> 01:01:44,320 And the compare-and-swap fails because it's saying, I want to 1187 01:01:44,320 --> 01:01:50,400 swap 33 in as long as the value of head is the pointer 1188 01:01:50,400 --> 01:01:53,880 to 70, the node was 77. 1189 01:01:53,880 --> 01:01:56,500 The value is no longer the pointer to the node of 77. 1190 01:01:56,500 --> 01:02:01,430 It's now the pointer to the value of the node with 81. 1191 01:02:01,430 --> 01:02:04,650 So the compare-and-swap fails. 1192 01:02:04,650 --> 01:02:06,200 People follow that? 1193 01:02:06,200 --> 01:02:09,420 And so what does 33 have to do? 1194 01:02:09,420 --> 01:02:10,910 It's got to start again. 1195 01:02:10,910 --> 01:02:13,400 So it goes back around the loop, and now it sets it to 1196 01:02:13,400 --> 01:02:15,240 81, which is now the head. 1197 01:02:15,240 --> 01:02:18,410 And now, it can compare-and-swap in the value. 1198 01:02:18,410 --> 01:02:22,940 And they both get in there perfectly well. 1199 01:02:22,940 --> 01:02:23,470 Question? 1200 01:02:23,470 --> 01:02:24,916 AUDIENCE: What if there's [INAUDIBLE]? 1201 01:02:24,916 --> 01:02:26,850 What if two nodes have-- 1202 01:02:26,850 --> 01:02:31,470 PROFESSOR: Well, notice here, it's not looking at the value 1203 01:02:31,470 --> 01:02:32,230 of the data. 1204 01:02:32,230 --> 01:02:33,500 Nowhere does data appear here. 1205 01:02:33,500 --> 01:02:35,050 It's actually looking at the address of 1206 01:02:35,050 --> 01:02:37,410 this chunk of memory. 1207 01:02:37,410 --> 01:02:39,450 There is a similar problem, which I will 1208 01:02:39,450 --> 01:02:41,000 raise in just a moment. 1209 01:02:41,000 --> 01:02:42,620 There is still a problem. 1210 01:02:42,620 --> 01:02:43,720 Yeah, question? 1211 01:02:43,720 --> 01:02:45,220 AUDIENCE: So I'm confused about the interface. 1212 01:02:45,220 --> 01:02:49,300 So you give it the address of where you want to 1213 01:02:49,300 --> 01:02:50,840 compare the value of. 1214 01:02:50,840 --> 01:02:54,760 And you're giving it what you're pointing at and-- 1215 01:02:54,760 --> 01:02:57,060 PROFESSOR: And here's the value that I expect to be 1216 01:02:57,060 --> 01:02:59,160 stored in this location. 1217 01:02:59,160 --> 01:03:03,870 The value I expect to be in there is node dot next. 1218 01:03:03,870 --> 01:03:06,500 So if I go back a couple things. 1219 01:03:06,500 --> 01:03:07,750 Here. 1220 01:03:10,960 --> 01:03:13,490 Here, the guy says, the value I expect to be 1221 01:03:13,490 --> 01:03:17,470 there is in this case. 1222 01:03:17,470 --> 01:03:20,960 the address of this chunk of memory here. 1223 01:03:20,960 --> 01:03:26,170 He expects the address of the node containing 77 is going to 1224 01:03:26,170 --> 01:03:28,000 be in this location. 1225 01:03:28,000 --> 01:03:28,580 It's not. 1226 01:03:28,580 --> 01:03:30,790 What's in this location is the address of 1227 01:03:30,790 --> 01:03:33,490 this chunk of memory. 1228 01:03:33,490 --> 01:03:36,880 But you're saying, if it's equal to this, then you can go 1229 01:03:36,880 --> 01:03:37,720 ahead and do the swap. 1230 01:03:37,720 --> 01:03:39,530 Otherwise you're going to fail. 1231 01:03:39,530 --> 01:03:43,970 And the swap consists of now sticking this value into-- 1232 01:03:43,970 --> 01:03:45,560 conditionally sticking it in there. 1233 01:03:45,560 --> 01:03:47,342 So you either do it or you don't do it. 1234 01:03:51,810 --> 01:03:54,790 So let's now do a pop. 1235 01:03:54,790 --> 01:03:59,030 So pop you can also do with things. 1236 01:03:59,030 --> 01:04:03,510 So here, I'm going to want to extract an element. 1237 01:04:03,510 --> 01:04:06,640 And what I'm going to do is create a current value that I 1238 01:04:06,640 --> 01:04:11,800 want to make point to the element that gets eliminated. 1239 01:04:11,800 --> 01:04:14,850 So what I do is I say, well, the element that I want is 1240 01:04:14,850 --> 01:04:17,640 that guy there. 1241 01:04:17,640 --> 01:04:21,560 And now, what I want to do is make the head jump around and 1242 01:04:21,560 --> 01:04:24,500 point to 94. 1243 01:04:24,500 --> 01:04:28,300 So what I do is I say, well, as long as the-- 1244 01:04:33,110 --> 01:04:35,990 and I want to do that unless I get down to the fact that I 1245 01:04:35,990 --> 01:04:41,110 have an empty list. 1246 01:04:41,110 --> 01:04:55,750 So basically, I say, if the head still has 1247 01:04:55,750 --> 01:04:57,180 the value of current-- 1248 01:04:57,180 --> 01:04:59,580 so they're pointing to the same place-- 1249 01:04:59,580 --> 01:05:04,680 then, I want to move in current arrow next. 1250 01:05:04,680 --> 01:05:07,470 And then I'm done. 1251 01:05:07,470 --> 01:05:12,940 Otherwise, I want to set current to head, reset it, and 1252 01:05:12,940 --> 01:05:15,550 go back to the beginning and try to pop again. 1253 01:05:15,550 --> 01:05:18,770 And I'm going to keep doing that until I get my pop to 1254 01:05:18,770 --> 01:05:23,240 succeed or until current points to nil. 1255 01:05:23,240 --> 01:05:25,390 If it ended up at the end, then I don't want to keep 1256 01:05:25,390 --> 01:05:31,660 popping if the list ended up being empty. 1257 01:05:31,660 --> 01:05:36,980 So basically, it sets that one to jump over. 1258 01:05:36,980 --> 01:05:40,450 And now, once it's done that, I can go, and I can clean up, 1259 01:05:40,450 --> 01:05:41,900 I can get rid of this pointer, et cetera. 1260 01:05:41,900 --> 01:05:44,660 But nobody else who's coming in to use this link list, can 1261 01:05:44,660 --> 01:05:48,960 see 15 now because I'm the only one with a pointer to it. 1262 01:05:48,960 --> 01:05:50,340 So people understand that? 1263 01:05:53,930 --> 01:05:56,965 So where's the bug? 1264 01:06:00,200 --> 01:06:04,990 Turns out this has a but after all that work. 1265 01:06:04,990 --> 01:06:09,330 Each of these individually does what it's supposed to do. 1266 01:06:09,330 --> 01:06:10,280 But here's the bug. 1267 01:06:10,280 --> 01:06:12,420 And it's a famous problem because you see it all the 1268 01:06:12,420 --> 01:06:16,050 time when people are synchronizing through memory 1269 01:06:16,050 --> 01:06:18,210 with lock-free algorithms. 1270 01:06:18,210 --> 01:06:22,150 It's called the ABA problem. 1271 01:06:22,150 --> 01:06:23,040 So here's the problem. 1272 01:06:23,040 --> 01:06:24,570 And it's similar to what some people were 1273 01:06:24,570 --> 01:06:25,780 concerned about earlier. 1274 01:06:25,780 --> 01:06:28,020 So here's the ABA problem. 1275 01:06:28,020 --> 01:06:31,830 Thread 1 begins to pop 15. 1276 01:06:31,830 --> 01:06:38,000 So imagine that what it does is it sets its current there, 1277 01:06:38,000 --> 01:06:43,333 and then it reads the value here, and starts to set the 1278 01:06:43,333 --> 01:06:46,640 head here using the compare-and-swap. 1279 01:06:46,640 --> 01:06:49,040 But it doesn't complete the compare-and-swap yet. 1280 01:06:49,040 --> 01:06:50,810 The compare-and-swap hasn't executed. 1281 01:06:50,810 --> 01:06:53,110 It's simply gotten this value, and it's 1282 01:06:53,110 --> 01:06:55,560 about to swap it here. 1283 01:06:55,560 --> 01:06:57,110 So then, thread 2 comes along. 1284 01:06:57,110 --> 01:07:00,900 And it says, oh, I want to pop something as well. 1285 01:07:00,900 --> 01:07:02,210 So it comes in. 1286 01:07:02,210 --> 01:07:05,790 And it turns out it's faster, and manages to pop 15 off, and 1287 01:07:05,790 --> 01:07:08,680 set up its pointers. 1288 01:07:08,680 --> 01:07:10,690 Now, what would normally happen here is if this 1289 01:07:10,690 --> 01:07:13,520 completed, what would happen? 1290 01:07:13,520 --> 01:07:16,090 The compare-and-swap instruction would discover 1291 01:07:16,090 --> 01:07:20,630 that this pointer is no longer the pointer to the head. 1292 01:07:20,630 --> 01:07:21,630 And so it would fail. 1293 01:07:21,630 --> 01:07:23,000 We'd be all hunky dory. 1294 01:07:23,000 --> 01:07:24,160 No problem. 1295 01:07:24,160 --> 01:07:26,760 But what could actually happen here? 1296 01:07:26,760 --> 01:07:28,290 Thread 2 keeps going on. 1297 01:07:28,290 --> 01:07:31,900 It says, oh, let me pop 94. 1298 01:07:31,900 --> 01:07:34,630 So it does the same thing. 1299 01:07:34,630 --> 01:07:38,240 So thread 1 is still stalled here, not having completed its 1300 01:07:38,240 --> 01:07:40,770 compare-and-swap. 1301 01:07:40,770 --> 01:07:41,820 It swaps 94. 1302 01:07:41,820 --> 01:07:43,630 Then, thread 2 goes on and says, oh, 1303 01:07:43,630 --> 01:07:46,410 let's put 15 back on. 1304 01:07:46,410 --> 01:07:50,800 So it puts 15 back on because after all, it had 15. 1305 01:07:50,800 --> 01:07:54,570 So now, what happens here? 1306 01:07:54,570 --> 01:07:59,520 Thread 1 now looks, and it now completes, and does its 1307 01:07:59,520 --> 01:08:09,510 compare-and-swap, it resumes, splicing out 15, which it 1308 01:08:09,510 --> 01:08:10,180 thinks it has. 1309 01:08:10,180 --> 01:08:14,300 But it doesn't realize that other stuff has gone on. 1310 01:08:14,300 --> 01:08:17,300 And now, we've got a mess. 1311 01:08:22,380 --> 01:08:25,750 So this is the ABA problem because what happened was we 1312 01:08:25,750 --> 01:08:28,840 were checking to see whether the value was still the same 1313 01:08:28,840 --> 01:08:31,260 value, the same chunk of memory. 1314 01:08:31,260 --> 01:08:32,340 It got popped off. 1315 01:08:32,340 --> 01:08:35,160 But it got popped back on. 1316 01:08:35,160 --> 01:08:37,800 But now, it could be in any configuration. 1317 01:08:37,800 --> 01:08:38,840 We don't know what it is. 1318 01:08:38,840 --> 01:08:43,960 And now, the code is thinking, that oh, nothing happened. 1319 01:08:43,960 --> 01:08:47,109 But in fact, something happened. 1320 01:08:47,109 --> 01:08:51,050 So it's ABA because basically, you've got 15 there. 1321 01:08:51,050 --> 01:08:51,825 It goes away. 1322 01:08:51,825 --> 01:08:54,029 Then, 15 comes back. 1323 01:08:54,029 --> 01:08:56,274 Question? 1324 01:08:56,274 --> 01:08:58,158 AUDIENCE: Can you compare two things and then swap because 1325 01:08:58,158 --> 01:08:59,100 that would solve this, right? 1326 01:08:59,100 --> 01:09:00,524 PROFESSOR: That's called a double compare-and-swap. 1327 01:09:03,233 --> 01:09:07,689 And we'll talk about it in a second. 1328 01:09:07,689 --> 01:09:12,899 So the classic way to solve this problem is to use 1329 01:09:12,899 --> 01:09:15,050 versioning. 1330 01:09:15,050 --> 01:09:17,540 So what you do is you pack a version number with each 1331 01:09:17,540 --> 01:09:22,180 pointer in the same atomically updatable word. 1332 01:09:22,180 --> 01:09:28,840 So that when 15 comes back, you've got the pointer. 1333 01:09:28,840 --> 01:09:31,770 But you also have a version on that pointer so that the value 1334 01:09:31,770 --> 01:09:33,420 has to be the same as the version you 1335 01:09:33,420 --> 01:09:35,399 had and not the value. 1336 01:09:35,399 --> 01:09:37,729 What you do is you increment the version number every time 1337 01:09:37,729 --> 01:09:40,560 the pointer is changed. 1338 01:09:40,560 --> 01:09:42,779 So you just do an increment. 1339 01:09:42,779 --> 01:09:44,950 But you do the compare-and-swap on the 1340 01:09:44,950 --> 01:09:49,779 version number and the pointer at the same time. 1341 01:09:49,779 --> 01:09:52,630 Now, it turns out that some architectures actually have 1342 01:09:52,630 --> 01:09:55,710 what's called a double compare-and-swap, which will 1343 01:09:55,710 --> 01:09:59,320 do compare-and-swap on two distinct locations. 1344 01:09:59,320 --> 01:10:02,140 And that simplifies things even more because it means you 1345 01:10:02,140 --> 01:10:04,050 don't have to pack and make sure that 1346 01:10:04,050 --> 01:10:05,420 things fit in one word. 1347 01:10:05,420 --> 01:10:06,990 You can keep versioning elsewhere. 1348 01:10:06,990 --> 01:10:09,730 And there are a whole bunch of other places where you can, in 1349 01:10:09,730 --> 01:10:12,700 fact, optimize and get even tighter code than you could if 1350 01:10:12,700 --> 01:10:14,990 you have to pack. 1351 01:10:14,990 --> 01:10:16,910 So that's generally the way you solve this. 1352 01:10:16,910 --> 01:10:19,730 And, of course, you can see this gets-- 1353 01:10:19,730 --> 01:10:24,310 as I say, this week has been skull and cross bones lecture. 1354 01:10:24,310 --> 01:10:28,920 It's appropriate it comes right after Halloween because 1355 01:10:28,920 --> 01:10:31,260 really, you do not want to play these games 1356 01:10:31,260 --> 01:10:32,740 unless you have to. 1357 01:10:32,740 --> 01:10:35,320 But you should know about them because you will find times 1358 01:10:35,320 --> 01:10:38,680 where you need this, or you need to understand somebody's 1359 01:10:38,680 --> 01:10:40,720 code that they've written in a lock-free way. 1360 01:10:40,720 --> 01:10:43,810 Because remember lock-free has the nice property that hey, 1361 01:10:43,810 --> 01:10:47,080 the operating system swaps something out, it just keeps 1362 01:10:47,080 --> 01:10:50,490 running nice and jolly if it's correct. 1363 01:10:54,300 --> 01:10:57,750 So the other issue is that version numbers may need to be 1364 01:10:57,750 --> 01:11:00,170 very large. 1365 01:11:00,170 --> 01:11:02,170 So if you have a version number, how many bits to you 1366 01:11:02,170 --> 01:11:03,970 assign to that version number. 1367 01:11:03,970 --> 01:11:05,810 Well, 64 bits, that's no problem. 1368 01:11:05,810 --> 01:11:07,930 You never run out of 64 bits. 1369 01:11:07,930 --> 01:11:11,390 2 to the 64th is a very, very, very big number. 1370 01:11:11,390 --> 01:11:13,400 And you'll never run out of 2 to the 64th. 1371 01:11:13,400 --> 01:11:15,370 We did that calculation at the beginning of the term. 1372 01:11:15,370 --> 01:11:16,620 How big did we say it was? 1373 01:11:22,240 --> 01:11:23,940 It's pretty big, right? 1374 01:11:23,940 --> 01:11:26,758 It's like this big. 1375 01:11:26,758 --> 01:11:29,740 Or is it this big? 1376 01:11:29,740 --> 01:11:32,108 My two-year-old is this big. 1377 01:11:34,810 --> 01:11:37,200 So anyway, it's pretty big. 1378 01:11:37,200 --> 01:11:39,340 So is it bigger than-- 1379 01:11:39,340 --> 01:11:41,570 no, it's not bigger than the number of particles in the 1380 01:11:41,570 --> 01:11:42,230 universe, right? 1381 01:11:42,230 --> 01:11:45,540 That's 10 to the 80th, which is much bigger 1382 01:11:45,540 --> 01:11:46,690 than 2 to the 64th. 1383 01:11:46,690 --> 01:11:48,280 But it's still a big number. 1384 01:11:48,280 --> 01:11:50,180 I think it's like more than there are atoms in the earth 1385 01:11:50,180 --> 01:11:50,610 or something. 1386 01:11:50,610 --> 01:11:52,430 It's still pretty big. 1387 01:11:52,430 --> 01:11:54,060 You never get through it if you calculate it. 1388 01:11:54,060 --> 01:11:55,780 I think we calculated it and it was 1389 01:11:55,780 --> 01:11:57,320 hundreds of years or whatever. 1390 01:11:57,320 --> 01:11:59,670 Anyway, it's a long time. 1391 01:11:59,670 --> 01:12:03,230 Many, many, many years at the very fastest, updating with 1392 01:12:03,230 --> 01:12:04,880 biggest supercomputers, and the most 1393 01:12:04,880 --> 01:12:05,860 processors, et cetera. 1394 01:12:05,860 --> 01:12:07,806 Never run out of 64 bits. 1395 01:12:07,806 --> 01:12:09,250 32 bits. 1396 01:12:09,250 --> 01:12:10,810 Four billion. 1397 01:12:10,810 --> 01:12:11,650 Maybe you run out. 1398 01:12:11,650 --> 01:12:13,570 Maybe you don't. 1399 01:12:13,570 --> 01:12:15,070 So that's one of the issues. 1400 01:12:15,070 --> 01:12:18,370 You have to say, well, how often do I have to do that. 1401 01:12:18,370 --> 01:12:20,020 Really, you only have to worry about this. 1402 01:12:20,020 --> 01:12:22,160 You can wraparound. 1403 01:12:22,160 --> 01:12:24,880 But you've got to make sure that then you never have a 1404 01:12:24,880 --> 01:12:29,110 situation where something could be swapped out for long 1405 01:12:29,110 --> 01:12:32,920 enough that it would come back and bite you because you're 1406 01:12:32,920 --> 01:12:34,640 coming around and then eating your tail. 1407 01:12:34,640 --> 01:12:36,540 And you've got to make sure you wouldn't have things 1408 01:12:36,540 --> 01:12:38,070 overlap and get a [? thing. ?] 1409 01:12:38,070 --> 01:12:40,580 So that might be a risk you're willing to take. 1410 01:12:40,580 --> 01:12:43,240 You can do an analysis and say, what are the odds my 1411 01:12:43,240 --> 01:12:46,090 system crashes from this reason or 1412 01:12:46,090 --> 01:12:48,010 from a different reason? 1413 01:12:48,010 --> 01:12:51,560 That can be reasonable engineering trade-off. 1414 01:12:51,560 --> 01:12:54,460 So there's an alternative to compare-and-swap. 1415 01:12:54,460 --> 01:12:56,430 One is the double compare-and-swap. 1416 01:12:56,430 --> 01:12:59,170 Another one is some machines have what's called a 1417 01:12:59,170 --> 01:13:02,300 load-linked, store conditional instruction. 1418 01:13:02,300 --> 01:13:04,620 What those are actually is a pair of instructions. 1419 01:13:04,620 --> 01:13:05,870 One is load-linked. 1420 01:13:05,870 --> 01:13:09,800 When you load-linked, it basically says, let's set a 1421 01:13:09,800 --> 01:13:11,950 bit, essentially, in that word. 1422 01:13:11,950 --> 01:13:16,870 And if that word ever changes when you do store conditional, 1423 01:13:16,870 --> 01:13:18,770 it will fail. 1424 01:13:18,770 --> 01:13:21,870 So even if some other processor changes it to the 1425 01:13:21,870 --> 01:13:26,030 exact same value, it's keeping track of whether anybody else 1426 01:13:26,030 --> 01:13:29,490 wrote it using the memory consistency mechanism. 1427 01:13:29,490 --> 01:13:31,870 The MSI type protocol that we talked about. 1428 01:13:31,870 --> 01:13:36,060 It's using that kind of mechanism to make sure that if 1429 01:13:36,060 --> 01:13:36,530 it changes. 1430 01:13:36,530 --> 01:13:39,215 And so this is actually much more reliable as a mechanism. 1431 01:13:41,800 --> 01:13:46,060 x86 does not have load-linked, store conditional. 1432 01:13:46,060 --> 01:13:46,930 I'm not sure why. 1433 01:13:46,930 --> 01:13:48,730 I don't know if there's a patent on it or 1434 01:13:48,730 --> 01:13:49,300 what's going on. 1435 01:13:49,300 --> 01:13:50,550 But they don't have it. 1436 01:13:55,630 --> 01:13:57,620 Final topic is reducers. 1437 01:14:00,650 --> 01:14:02,610 So once again, recall the summing problem. 1438 01:14:06,140 --> 01:14:09,830 In Cilk++, they have a mechanism called reducer 1439 01:14:09,830 --> 01:14:15,150 hyperobjects, which lets you do an end run around some of 1440 01:14:15,150 --> 01:14:17,610 these synchronization problems. 1441 01:14:17,610 --> 01:14:22,420 And the basic idea behind it is we actually could code this 1442 01:14:22,420 --> 01:14:25,110 fairly easily as we talked about last time by just doing 1443 01:14:25,110 --> 01:14:28,110 divide and conquer on the array. 1444 01:14:28,110 --> 01:14:30,830 We add up the first half of the elements, add up the 1445 01:14:30,830 --> 01:14:32,120 second half of the elements, when they 1446 01:14:32,120 --> 01:14:33,810 return, add them together. 1447 01:14:33,810 --> 01:14:38,120 But the problem is that coding that is a pain to do. 1448 01:14:38,120 --> 01:14:40,770 So the hyper object mechanism sort of does that 1449 01:14:40,770 --> 01:14:42,660 automatically for you. 1450 01:14:42,660 --> 01:14:50,350 What you can do is declare result to be an integer, which 1451 01:14:50,350 --> 01:14:56,530 is going to have the operation add performed on it. 1452 01:14:56,530 --> 01:14:59,780 And what happens then is you can just go ahead and add the 1453 01:14:59,780 --> 01:15:03,730 values up like this. 1454 01:15:03,730 --> 01:15:08,060 And basically, what it does is essentially adds things 1455 01:15:08,060 --> 01:15:12,090 locally and will combine them on an as needed basis. 1456 01:15:12,090 --> 01:15:15,210 So you don't actually have to do any synchronization at all. 1457 01:15:15,210 --> 01:15:17,700 In the end, you have to get the result 1458 01:15:17,700 --> 01:15:18,810 by doing a get value. 1459 01:15:18,810 --> 01:15:23,150 So let me show you a little bit more what's going on in 1460 01:15:23,150 --> 01:15:24,250 this situation. 1461 01:15:24,250 --> 01:15:27,690 So the first thing here is we're saying result is a 1462 01:15:27,690 --> 01:15:29,415 summing reducer over int. 1463 01:15:32,430 --> 01:15:35,480 The updates are resolved automatically without races or 1464 01:15:35,480 --> 01:15:39,380 contention because they're basically doing it by keeping 1465 01:15:39,380 --> 01:15:42,200 local values and copying them. 1466 01:15:42,200 --> 01:15:44,730 And then, at the end, you can get the underlying value. 1467 01:15:47,610 --> 01:15:52,240 So the way this works is that when you declare the variable, 1468 01:15:52,240 --> 01:15:57,210 you're declaring it as a reducer over some associative 1469 01:15:57,210 --> 01:16:00,370 operation, such as addition. 1470 01:16:00,370 --> 01:16:04,720 So it only works cleanly if your operation is associative. 1471 01:16:04,720 --> 01:16:07,860 And there are a lot of associative operations. 1472 01:16:07,860 --> 01:16:10,480 Addition, multiplication, logical, and, list 1473 01:16:10,480 --> 01:16:11,430 concatenation. 1474 01:16:11,430 --> 01:16:14,400 I can concatenate two lists. 1475 01:16:14,400 --> 01:16:15,710 So what does associative mean? 1476 01:16:18,210 --> 01:16:20,730 I think I have a slide on this in a minute. 1477 01:16:20,730 --> 01:16:23,900 It means a times b times c. 1478 01:16:23,900 --> 01:16:26,220 I can parenthesize it any way I want and 1479 01:16:26,220 --> 01:16:27,020 get the same answer. 1480 01:16:27,020 --> 01:16:28,970 Associative, right? 1481 01:16:28,970 --> 01:16:30,140 It's not associative like 1482 01:16:30,140 --> 01:16:35,840 associative memory or whatever. 1483 01:16:35,840 --> 01:16:38,950 So now, the individual strands in the computation can update 1484 01:16:38,950 --> 01:16:43,880 x as if it were an ordinary non-local variable. 1485 01:16:43,880 --> 01:16:47,670 But in fact, it's maintained as a set of different copies 1486 01:16:47,670 --> 01:16:50,540 called views. 1487 01:16:50,540 --> 01:16:53,400 The Cilk++ runtime system coordinates the views and 1488 01:16:53,400 --> 01:16:55,370 combines them when appropriate. 1489 01:16:55,370 --> 01:16:58,620 And when only one view remains, now you can get the 1490 01:16:58,620 --> 01:16:59,710 actual value. 1491 01:16:59,710 --> 01:17:02,890 So for example, you may have a summing reducer where the 1492 01:17:02,890 --> 01:17:06,620 actual value at this point in time is 89. 1493 01:17:06,620 --> 01:17:16,120 But locally, each processor may only see a different value 1494 01:17:16,120 --> 01:17:18,790 whose sum is 89. 1495 01:17:18,790 --> 01:17:23,170 But locally, I could do something like increment this. 1496 01:17:23,170 --> 01:17:27,010 And this guy can independently increment his view and has the 1497 01:17:27,010 --> 01:17:30,100 effect that it increments whatever the total sum is. 1498 01:17:30,100 --> 01:17:33,920 And then, the runtime system manages to combine everything 1499 01:17:33,920 --> 01:17:38,630 at the end to make it be the value when there's no more 1500 01:17:38,630 --> 01:17:42,410 parallelism associated with that reducer. 1501 01:17:42,410 --> 01:17:44,390 So here's the conceptual behavior. 1502 01:17:44,390 --> 01:17:45,980 Imagine I have this code. 1503 01:17:45,980 --> 01:17:47,780 I set x equal to 0. 1504 01:17:47,780 --> 01:17:49,180 I then add 3. 1505 01:17:49,180 --> 01:17:50,210 I then increment. 1506 01:17:50,210 --> 01:17:51,960 I had 4, increments at 5. 1507 01:17:51,960 --> 01:17:53,240 Fa da da da da. 1508 01:17:53,240 --> 01:17:57,130 At the end, I get some value, which I 1509 01:17:57,130 --> 01:17:58,380 don't think I put down. 1510 01:18:01,360 --> 01:18:03,550 Another way I could do this is the following. 1511 01:18:03,550 --> 01:18:06,830 Let me do exactly the same here but with a local view 1512 01:18:06,830 --> 01:18:10,520 that I'll call x1. 1513 01:18:10,520 --> 01:18:14,060 For this set of operations, let me start a new view that I 1514 01:18:14,060 --> 01:18:18,210 start out with the identity for addition, which is 0 and 1515 01:18:18,210 --> 01:18:19,970 add those guys up. 1516 01:18:19,970 --> 01:18:24,060 And then, at the end, let me add x1 and x2. 1517 01:18:24,060 --> 01:18:26,540 It should give me the same answer if addition is 1518 01:18:26,540 --> 01:18:27,790 associative. 1519 01:18:30,600 --> 01:18:32,195 In particular, these now can operate in 1520 01:18:32,195 --> 01:18:33,595 parallel with no races. 1521 01:18:36,520 --> 01:18:39,830 So if you don't actually look at the intermediate values-- 1522 01:18:39,830 --> 01:18:42,640 if all I'm doing is updating them, but I'm not actually 1523 01:18:42,640 --> 01:18:46,200 looking to see what the absolute value of the thing 1524 01:18:46,200 --> 01:18:49,900 is, I should get the same answer at the end. 1525 01:18:49,900 --> 01:18:51,420 The answer to the result is then determinant. 1526 01:18:51,420 --> 01:18:54,230 It's not deterministic because it's going to get done in a 1527 01:18:54,230 --> 01:18:56,200 different way with different memory state. 1528 01:18:56,200 --> 01:18:59,260 But it's determinant, meaning the output answer is going to 1529 01:18:59,260 --> 01:19:03,350 give you the same no matter how it executes, even if the 1530 01:19:03,350 --> 01:19:06,250 resulting computation is nondeterministic. 1531 01:19:06,250 --> 01:19:08,470 So this is a way of encapsulating, if you will, 1532 01:19:08,470 --> 01:19:09,650 nondeterminism. 1533 01:19:09,650 --> 01:19:12,430 And it worked because addition is associative. 1534 01:19:12,430 --> 01:19:15,340 It didn't matter which order I did it. 1535 01:19:15,340 --> 01:19:17,660 And once again, I could have broken it here instead of 1536 01:19:17,660 --> 01:19:20,360 there, and I still get the same answer. 1537 01:19:20,360 --> 01:19:21,460 It doesn't matter. 1538 01:19:21,460 --> 01:19:24,540 So the idea is as these things are work stealing around. 1539 01:19:24,540 --> 01:19:26,970 they're accumulating things locally but combining them in 1540 01:19:26,970 --> 01:19:32,080 a way that maintains the invariant that the final value 1541 01:19:32,080 --> 01:19:33,570 is going to be the sum. 1542 01:19:36,240 --> 01:19:38,760 So there's a lot of other related work where people do 1543 01:19:38,760 --> 01:19:42,680 reduction types of things, but they're all tied to specific 1544 01:19:42,680 --> 01:19:44,520 control or data structures. 1545 01:19:44,520 --> 01:19:50,300 And the neat thing about the Cilk++ version is that it is 1546 01:19:50,300 --> 01:19:51,290 not tied to anything. 1547 01:19:51,290 --> 01:19:52,430 You can name it anywhere. 1548 01:19:52,430 --> 01:19:54,280 You can write recursive programs. 1549 01:19:54,280 --> 01:19:58,300 You can update locally your reducer wherever you want, and 1550 01:19:58,300 --> 01:20:04,020 it figures out exactly how to combine them in order to get 1551 01:20:04,020 --> 01:20:06,760 your final answer. 1552 01:20:06,760 --> 01:20:11,450 So the algebraic framework for this is that we have a monoid, 1553 01:20:11,450 --> 01:20:17,720 which is a set, an operator, and an identity, where the 1554 01:20:17,720 --> 01:20:20,740 operator is an associative binary operator. 1555 01:20:20,740 --> 01:20:24,450 And the identity is, in fact, the identity. 1556 01:20:24,450 --> 01:20:26,730 So here are some examples. 1557 01:20:26,730 --> 01:20:31,510 Integers with plus and 0, the real numbers with times and 1, 1558 01:20:31,510 --> 01:20:35,110 true and false, Booleans with and, where true is the 1559 01:20:35,110 --> 01:20:40,810 identity, strings over some alphabet with concatenation, 1560 01:20:40,810 --> 01:20:43,530 where the empty string is the identity. 1561 01:20:43,530 --> 01:20:46,720 You can do MAX with minus infinity as the 1562 01:20:46,720 --> 01:20:48,110 operation, and so forth. 1563 01:20:48,110 --> 01:20:49,540 And you can come up with your own. 1564 01:20:49,540 --> 01:20:52,990 It's easy to come up with examples of monoids. 1565 01:20:52,990 --> 01:20:57,530 So what we do in Cilk++ is we represent a monoid over a set 1566 01:20:57,530 --> 01:21:02,530 t by a C++ class that inherits from this base class that's 1567 01:21:02,530 --> 01:21:07,550 predefined for you, which is parameterized using templates 1568 01:21:07,550 --> 01:21:08,880 with the types. 1569 01:21:08,880 --> 01:21:10,770 So the set that we're going to use is, in fact, 1570 01:21:10,770 --> 01:21:13,120 going to be a type. 1571 01:21:13,120 --> 01:21:15,090 And the member function reduced-- 1572 01:21:15,090 --> 01:21:18,440 this monoid has to have a member function reduced that 1573 01:21:18,440 --> 01:21:21,050 implements the binary operator times. 1574 01:21:21,050 --> 01:21:24,640 And it also has an identity member function. 1575 01:21:24,640 --> 01:21:28,570 So we set up the algebraic framework. 1576 01:21:28,570 --> 01:21:32,620 So here's, for example, how I could define a sum monoid. 1577 01:21:32,620 --> 01:21:36,600 I inherit from the base with int, for example, here. 1578 01:21:36,600 --> 01:21:39,650 And I define my reduced function. 1579 01:21:39,650 --> 01:21:42,530 And it actually turns out to be important, you always do 1580 01:21:42,530 --> 01:21:44,970 the right one into the left. 1581 01:21:44,970 --> 01:21:47,420 Otherwise, you won't have it be associative. 1582 01:21:47,420 --> 01:21:49,710 And then, you have an identity, which gives you in 1583 01:21:49,710 --> 01:21:52,940 this case a new element, which is 0. 1584 01:21:55,610 --> 01:22:02,080 And so you can now define the reducer as so. 1585 01:22:02,080 --> 01:22:04,270 You just say Cilk reducer, the sum monoid 1586 01:22:04,270 --> 01:22:06,760 you've defined and x. 1587 01:22:06,760 --> 01:22:10,360 And now, the local view of x can be accessed as x open 1588 01:22:10,360 --> 01:22:11,990 close parenthesis. 1589 01:22:11,990 --> 01:22:13,940 Now, in the example I showed you, you didn't need to do the 1590 01:22:13,940 --> 01:22:15,980 open close parenthesis. 1591 01:22:15,980 --> 01:22:18,045 And the way you get rid of those open close parenthesis 1592 01:22:18,045 --> 01:22:20,930 is you define a wrapper class. 1593 01:22:20,930 --> 01:22:24,420 So it's generally inconvenient to replace every access with x 1594 01:22:24,420 --> 01:22:25,970 over brown. 1595 01:22:25,970 --> 01:22:26,880 That's one issue. 1596 01:22:26,880 --> 01:22:28,640 The other thing is accesses aren't safe. 1597 01:22:28,640 --> 01:22:33,070 Nothing prevents a programmer from writing x times equals 2, 1598 01:22:33,070 --> 01:22:35,960 even though the reducer was defined over plus. 1599 01:22:35,960 --> 01:22:38,340 And that will screw up the logic of this code if 1600 01:22:38,340 --> 01:22:40,920 somewhere he's multiplying when, in fact, it's only 1601 01:22:40,920 --> 01:22:44,300 supposed to be combined with addition. 1602 01:22:44,300 --> 01:22:46,740 So the way you solve that is with a wrapper class. 1603 01:22:46,740 --> 01:22:49,540 You can do a wrapper class that will protect all of the 1604 01:22:49,540 --> 01:22:54,020 operations inside and export things that you can just refer 1605 01:22:54,020 --> 01:22:54,760 to the variable. 1606 01:22:54,760 --> 01:22:56,770 And it will actually call that. 1607 01:22:56,770 --> 01:22:58,930 For most of what you're doing, you probably don't need to 1608 01:22:58,930 --> 01:23:00,380 write a wrapper class. 1609 01:23:00,380 --> 01:23:07,950 You'll do fine just operating with the extra parentheses. 1610 01:23:07,950 --> 01:23:09,330 In addition, there's a whole bunch of 1611 01:23:09,330 --> 01:23:11,680 commonly use reducers. 1612 01:23:11,680 --> 01:23:17,980 Lists, appends, max, min, adds, an output stream, and 1613 01:23:17,980 --> 01:23:23,640 some strings, and also you can roll your own using things. 1614 01:23:23,640 --> 01:23:27,380 One issue with addition is that, in fact, 1615 01:23:27,380 --> 01:23:29,810 this doesn't preserve-- 1616 01:23:29,810 --> 01:23:32,030 for floating point addition-- does not 1617 01:23:32,030 --> 01:23:34,970 preserve the same answer. 1618 01:23:34,970 --> 01:23:37,710 And the reason is because floating point numbers are not 1619 01:23:37,710 --> 01:23:38,970 associative. 1620 01:23:38,970 --> 01:23:42,270 If I had a to b and add that to c, I can get something 1621 01:23:42,270 --> 01:23:45,370 different because of round off error from adding a to the 1622 01:23:45,370 --> 01:23:47,320 result of b and c. 1623 01:23:47,320 --> 01:23:50,662 So generally, floating point operations don't give you-- 1624 01:23:50,662 --> 01:23:53,400 they'll give you something that is close enough for most 1625 01:23:53,400 --> 01:23:55,110 things, but it's not actually associative. 1626 01:23:55,110 --> 01:23:58,580 So you will get different answers. 1627 01:23:58,580 --> 01:23:59,390 A quick example. 1628 01:23:59,390 --> 01:24:01,200 I'm sorry to run over a little bit here. 1629 01:24:01,200 --> 01:24:03,850 I hope people have a couple minutes. 1630 01:24:03,850 --> 01:24:05,800 Here's a real world example. 1631 01:24:05,800 --> 01:24:08,970 A company had a mechanical assembly represented a tree of 1632 01:24:08,970 --> 01:24:11,170 assemblies down to individual parts. 1633 01:24:11,170 --> 01:24:14,820 A pickup truck has all these parts and all of these extra 1634 01:24:14,820 --> 01:24:18,110 subparts all the way down to some geometric description of 1635 01:24:18,110 --> 01:24:19,670 what the part is. 1636 01:24:19,670 --> 01:24:21,610 And what they want to do is the so-called collision 1637 01:24:21,610 --> 01:24:22,980 detection problem, which has nothing to do 1638 01:24:22,980 --> 01:24:24,390 with colliding autos. 1639 01:24:24,390 --> 01:24:27,730 What they're doing is saying, find collisions between the 1640 01:24:27,730 --> 01:24:29,230 assembly and a target object. 1641 01:24:29,230 --> 01:24:31,870 And that object might be something like a half space 1642 01:24:31,870 --> 01:24:33,250 because they're computing a cutaway. 1643 01:24:33,250 --> 01:24:35,720 Tell me all the things that fall within this. 1644 01:24:35,720 --> 01:24:39,710 Or maybe, here's an engine compartment, and does the 1645 01:24:39,710 --> 01:24:42,270 engine fit in with it? 1646 01:24:42,270 --> 01:24:44,690 So here's a code that does that. 1647 01:24:44,690 --> 01:24:50,020 Basically, it does a recursive walk, where it looks to see 1648 01:24:50,020 --> 01:24:52,080 whether it's an internal node or a leaf. 1649 01:24:52,080 --> 01:24:58,000 If it's a leaf, it says, oh, let me check to see whether 1650 01:24:58,000 --> 01:25:00,440 the target collides with a particular 1651 01:25:00,440 --> 01:25:01,790 element of the tree. 1652 01:25:01,790 --> 01:25:06,120 And if so, add that object to the end of a list. 1653 01:25:06,120 --> 01:25:13,730 So this is the standard a C++ library for putting something 1654 01:25:13,730 --> 01:25:15,950 on the end of the list. 1655 01:25:15,950 --> 01:25:19,670 If it's an internal node, then go through all of the children 1656 01:25:19,670 --> 01:25:21,030 recursively. 1657 01:25:21,030 --> 01:25:25,680 And walk the children recursively. 1658 01:25:25,680 --> 01:25:28,850 So basically, you're going to look through the whole tree. 1659 01:25:28,850 --> 01:25:34,290 Does it intersect this particular object, x? 1660 01:25:34,290 --> 01:25:36,290 So how do we parallelize this? 1661 01:25:36,290 --> 01:25:38,060 We can parallelize the recursion. 1662 01:25:38,060 --> 01:25:41,280 We turn the 4 loop here into a Cilk 4. 1663 01:25:41,280 --> 01:25:43,640 So it goes through all the children at the same time. 1664 01:25:43,640 --> 01:25:45,270 They all can do their comparisons 1665 01:25:45,270 --> 01:25:47,830 completely the same. 1666 01:25:47,830 --> 01:25:49,580 Oops, but we have a bug. 1667 01:25:49,580 --> 01:25:52,170 What's the bug? 1668 01:25:52,170 --> 01:25:54,900 AUDIENCE: Is it push back? 1669 01:25:54,900 --> 01:25:55,230 PROFESSOR: Yeah. 1670 01:25:55,230 --> 01:25:56,760 The push back here. 1671 01:25:56,760 --> 01:26:02,300 We have a race here because they're all trying to push on 1672 01:26:02,300 --> 01:26:05,380 to this output list at the same time. 1673 01:26:05,380 --> 01:26:08,870 So we could resolve it with a lock or whatever. 1674 01:26:08,870 --> 01:26:13,340 But it turns out it's much better to resolve it with a-- 1675 01:26:13,340 --> 01:26:15,900 so we could do this, right? 1676 01:26:15,900 --> 01:26:18,140 But now, you've got lock contention. 1677 01:26:18,140 --> 01:26:20,270 And also, the list ends up getting produced 1678 01:26:20,270 --> 01:26:23,100 in a jumbled order. 1679 01:26:23,100 --> 01:26:28,040 So it turns out if you use a reducer, you declare this to 1680 01:26:28,040 --> 01:26:32,045 be a reducer with list append. 1681 01:26:32,045 --> 01:26:37,180 And what happens then is turns out list concatenation is 1682 01:26:37,180 --> 01:26:38,710 associative. 1683 01:26:38,710 --> 01:26:42,510 If I concatenate a to b, and then concatenate c, that's the 1684 01:26:42,510 --> 01:26:46,040 same as concatenating a to the concatenation of b and c. 1685 01:26:46,040 --> 01:26:49,780 And I can concatenate lists in constant time by keeping a 1686 01:26:49,780 --> 01:26:53,290 pointer to the head and tail of each list. 1687 01:26:53,290 --> 01:26:55,340 So if you do that, and that turns out to be one of the 1688 01:26:55,340 --> 01:26:59,220 built in functions, then, in fact, this code operates 1689 01:26:59,220 --> 01:27:02,200 perfectly well with no contention and so forth. 1690 01:27:02,200 --> 01:27:06,000 And in fact, produces the output in the same order as 1691 01:27:06,000 --> 01:27:08,000 the original C++. 1692 01:27:08,000 --> 01:27:09,250 It runs fast. 1693 01:27:11,540 --> 01:27:16,050 And there's a little description of how it works. 1694 01:27:16,050 --> 01:27:17,940 The actual protocol is kind of tricky. 1695 01:27:17,940 --> 01:27:19,730 And we'll put the paper-- 1696 01:27:19,730 --> 01:27:22,970 let's make sure we get this paper up on the web. 1697 01:27:22,970 --> 01:27:24,480 I think it was there from last year. 1698 01:27:24,480 --> 01:27:26,990 So we should be able to find it. 1699 01:27:26,990 --> 01:27:28,740 If you're interested in how the details work. 1700 01:27:28,740 --> 01:27:30,860 Here's the important thing to know from a 1701 01:27:30,860 --> 01:27:33,450 programmer point of view. 1702 01:27:33,450 --> 01:27:36,630 So typically, the cost-- 1703 01:27:36,630 --> 01:27:40,010 it turns out the reduce operations you're only calling 1704 01:27:40,010 --> 01:27:41,310 when there's actually a steal. 1705 01:27:41,310 --> 01:27:42,900 It's actually a return from a steal. 1706 01:27:42,900 --> 01:27:46,560 But since stealing occurs relatively infrequently the 1707 01:27:46,560 --> 01:27:49,450 load balance, the number of times you actually do one of 1708 01:27:49,450 --> 01:27:53,200 these reduce operations is small. 1709 01:27:53,200 --> 01:27:56,250 The most of the cost is actually accessing the reducer 1710 01:27:56,250 --> 01:27:58,410 to do the updates. 1711 01:27:58,410 --> 01:28:01,060 And it's never worse than a hash table lookup the way it's 1712 01:28:01,060 --> 01:28:02,480 implemented. 1713 01:28:02,480 --> 01:28:05,380 If the reducer is accessed several times within a region 1714 01:28:05,380 --> 01:28:08,990 of code, the compiler can optimize the lookups using 1715 01:28:08,990 --> 01:28:11,170 common subexpression elimination. 1716 01:28:11,170 --> 01:28:15,380 And in the common case, then, what happens is it basically 1717 01:28:15,380 --> 01:28:18,480 has an access cost equal to one additional level of 1718 01:28:18,480 --> 01:28:22,480 indirection, which is typically an L1 cache hit. 1719 01:28:22,480 --> 01:28:26,080 So the overhead of actually updating one of these things 1720 01:28:26,080 --> 01:28:30,320 is really just like an extra L1 cache hit for most of these 1721 01:28:30,320 --> 01:28:32,160 things, for most of the time. 1722 01:28:32,160 --> 01:28:37,210 If you have the case that you're accessing a reducer 1723 01:28:37,210 --> 01:28:39,640 several times within the same block of code. 1724 01:28:39,640 --> 01:28:42,150 Otherwise, at the very worst, you have to actually do a hash 1725 01:28:42,150 --> 01:28:43,140 table lookup. 1726 01:28:43,140 --> 01:28:46,270 And that tends to be a little bit more like a function call 1727 01:28:46,270 --> 01:28:51,180 overhead just in terms of order of magnitude. 1728 01:28:51,180 --> 01:28:52,430 Sorry for running over.