1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:17,430
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,430 --> 00:00:18,680
ocw.mit.edu.

8
00:00:27,980 --> 00:00:30,590
PROFESSOR: The last time we
talked about nondeterministic

9
00:00:30,590 --> 00:00:32,400
programming.

10
00:00:32,400 --> 00:00:37,290
And I think actually the
mic is up pretty high.

11
00:00:37,290 --> 00:00:40,740
If we can tone that down
just a little bit.

12
00:00:40,740 --> 00:00:43,410
We talked about nondeterministic
programming.

13
00:00:43,410 --> 00:00:47,080
And as you recall, the rule
with nondeterministic

14
00:00:47,080 --> 00:00:49,630
programming is you should
never do it

15
00:00:49,630 --> 00:00:50,880
unless you have to.

16
00:00:53,800 --> 00:00:55,000
Today we're going to talk about

17
00:00:55,000 --> 00:00:56,730
synchronizing with locks.

18
00:00:56,730 --> 00:00:59,730
And it goes doubly that you
should never synchronize

19
00:00:59,730 --> 00:01:04,569
without locks unless
you have to.

20
00:01:04,569 --> 00:01:07,240
There's some good reasons for
synchronizing without

21
00:01:07,240 --> 00:01:10,620
locks as we'll see.

22
00:01:10,620 --> 00:01:14,910
But it, once again, becomes even
more difficult to test

23
00:01:14,910 --> 00:01:19,550
correctness and to ensure that
the program that you think

24
00:01:19,550 --> 00:01:21,310
you've written is, in
fact, the program

25
00:01:21,310 --> 00:01:22,860
you meant to write.

26
00:01:22,860 --> 00:01:26,260
So we're going to talk
about a bunch of

27
00:01:26,260 --> 00:01:27,870
really important topics.

28
00:01:27,870 --> 00:01:30,630
The first is memory
consistency.

29
00:01:30,630 --> 00:01:33,480
And then we'll talk a little bit
about lock free protocols

30
00:01:33,480 --> 00:01:36,800
and one of the problems that
arises called the AVA problem.

31
00:01:36,800 --> 00:01:40,700
And then we're going to talk
about a technology that we're

32
00:01:40,700 --> 00:01:47,180
using in the Cilk++ system,
which tries to make an end run

33
00:01:47,180 --> 00:01:52,180
around some of these problems
and allows you to do

34
00:01:52,180 --> 00:01:55,410
synchronization without locks,
with low overhead.

35
00:01:55,410 --> 00:01:59,810
But it only works in
certain context.

36
00:01:59,810 --> 00:02:01,060
So we're going to start with
memory consistency.

37
00:02:05,240 --> 00:02:09,729
So here is a very simple
parallel program.

38
00:02:09,729 --> 00:02:12,870
So initially a and
b are both 0.

39
00:02:12,870 --> 00:02:17,250
And processor zero
moves a 1 into a.

40
00:02:17,250 --> 00:02:22,360
And then it moves whatever
is in location b

41
00:02:22,360 --> 00:02:26,240
into the EBX register.

42
00:02:26,240 --> 00:02:29,070
Processor one does something
complementary.

43
00:02:29,070 --> 00:02:31,050
It moves a 1 into b.

44
00:02:31,050 --> 00:02:39,040
And then it moves whatever is
in a into the EAX register.

45
00:02:39,040 --> 00:02:42,090
Into the EAX register as opposed
to the EBX register.

46
00:02:42,090 --> 00:02:48,870
And the question is what are the
final possible values of

47
00:02:48,870 --> 00:02:53,860
EAX and EBX after both
processors have executed.

48
00:02:53,860 --> 00:02:55,495
Seems like a straightforward
enough question.

49
00:02:58,470 --> 00:03:03,320
What values can EAX and EBX
have, depending upon-- there

50
00:03:03,320 --> 00:03:07,400
may be scheduling of when things
happen and so forth.

51
00:03:07,400 --> 00:03:09,320
So it's not always going to
give the same answer.

52
00:03:09,320 --> 00:03:11,550
But the question is what's
the set of answers

53
00:03:11,550 --> 00:03:12,720
that you can get?

54
00:03:12,720 --> 00:03:15,230
Well, it turns out you can't
just answer this question for

55
00:03:15,230 --> 00:03:18,940
any particular machine
without knowing the

56
00:03:18,940 --> 00:03:22,370
machine's memory model.

57
00:03:22,370 --> 00:03:25,940
So it depends upon how memory
operations behave in the

58
00:03:25,940 --> 00:03:28,290
parallel computer system.

59
00:03:28,290 --> 00:03:31,070
And different machines have
different memory models.

60
00:03:31,070 --> 00:03:35,450
And we'll give you different
answers for this code.

61
00:03:35,450 --> 00:03:37,660
There'll be some answers that
you get on some machines,

62
00:03:37,660 --> 00:03:39,080
different answers on
different machines.

63
00:03:42,620 --> 00:03:48,930
So probably the bedrock of
memory models is a model

64
00:03:48,930 --> 00:03:51,750
called sequential consistency.

65
00:03:51,750 --> 00:03:55,430
And this is intuitively what
you might think you want.

66
00:03:58,470 --> 00:04:02,390
So Lamport in 1979 said, "The
result of any execution is the

67
00:04:02,390 --> 00:04:05,650
same as if the operations of
all the processors were

68
00:04:05,650 --> 00:04:08,940
executed in some sequential
order, and the operations of

69
00:04:08,940 --> 00:04:12,060
each individual processor appear
in this sequence in the

70
00:04:12,060 --> 00:04:15,380
order specified by
its program."

71
00:04:15,380 --> 00:04:17,579
So what does that mean?

72
00:04:17,579 --> 00:04:23,560
So what it says is that if
I look at the processor's

73
00:04:23,560 --> 00:04:26,510
program and the sequence of
operations that are issued by

74
00:04:26,510 --> 00:04:31,780
that processor's program,
they're interleaved with the

75
00:04:31,780 --> 00:04:34,230
corresponding sequences
defined by the other

76
00:04:34,230 --> 00:04:38,900
processors to produce a
global linear order.

77
00:04:41,600 --> 00:04:44,470
So the first thing is that
there's a global linear order

78
00:04:44,470 --> 00:04:47,900
that consists of all of these
processors' instructions being

79
00:04:47,900 --> 00:04:49,150
interleaved.

80
00:04:51,850 --> 00:04:56,190
In this linear order, whenever
you perform a load from memory

81
00:04:56,190 --> 00:05:00,770
into register, it receives the
value that was stored by the

82
00:05:00,770 --> 00:05:05,920
most recent store operation in
that linear order to that

83
00:05:05,920 --> 00:05:08,660
location, i.e. it's memory.

84
00:05:11,680 --> 00:05:17,060
So you don't get something if
you have in this linear order

85
00:05:17,060 --> 00:05:19,480
that two processors wrote.

86
00:05:19,480 --> 00:05:21,780
Well, one of them came last.

87
00:05:21,780 --> 00:05:24,340
The most recent one before
you read, that's the

88
00:05:24,340 --> 00:05:26,490
one that you get.

89
00:05:26,490 --> 00:05:27,580
Now, there may be
many different

90
00:05:27,580 --> 00:05:29,060
interleavings and so forth.

91
00:05:29,060 --> 00:05:32,520
And you could get any of the
values that correspond to any

92
00:05:32,520 --> 00:05:33,430
of those interleavings.

93
00:05:33,430 --> 00:05:37,350
But the point is that you
must get a value that is

94
00:05:37,350 --> 00:05:39,170
represented by some
interleaving.

95
00:05:42,230 --> 00:05:45,240
The hardware can then do
anything it wants, but for the

96
00:05:45,240 --> 00:05:49,290
execution to satisfy the
sequential consistency model,

97
00:05:49,290 --> 00:05:52,800
for it to be sequentially
consistent, must appear as if

98
00:05:52,800 --> 00:05:58,950
the loads and storage obey
some global linear order.

99
00:05:58,950 --> 00:06:03,760
So let's be concrete about that
with the problem that I

100
00:06:03,760 --> 00:06:04,580
gave before.

101
00:06:04,580 --> 00:06:07,840
So initially, we have
a and b are 0.

102
00:06:07,840 --> 00:06:10,620
And now, we have these
instructions executed.

103
00:06:10,620 --> 00:06:15,360
So what I have to do is say, I
get any possible outcome based

104
00:06:15,360 --> 00:06:18,980
on interleaving these
instructions in this order.

105
00:06:18,980 --> 00:06:21,210
So if I look at it, I've
got two instructions

106
00:06:21,210 --> 00:06:23,160
here, two over here.

107
00:06:23,160 --> 00:06:28,040
So that there are six possible
interleavings because 4 choose

108
00:06:28,040 --> 00:06:33,270
2 is 6 for those people
who've taken 6042.

109
00:06:33,270 --> 00:06:35,660
So there are six possible
interleavings.

110
00:06:35,660 --> 00:06:41,330
So for example, if I execute
first move a 1 into a, and

111
00:06:41,330 --> 00:06:50,330
then I execute move load into
register, the value of b, and

112
00:06:50,330 --> 00:06:54,380
then I move 1 into b, and then I
load the value of a, I get a

113
00:06:54,380 --> 00:06:59,940
value of 1 for EAX and
a value of 0 for EBX.

114
00:06:59,940 --> 00:07:01,960
For this particular interleaving
of those

115
00:07:01,960 --> 00:07:03,210
instructions.

116
00:07:05,310 --> 00:07:07,070
That's what happens if
I execute these two

117
00:07:07,070 --> 00:07:07,850
before these two.

118
00:07:07,850 --> 00:07:12,600
If I execute these two
instructions here before these

119
00:07:12,600 --> 00:07:15,840
two here, I get the
order 3412.

120
00:07:15,840 --> 00:07:17,780
And essentially, the opposite
thing happens.

121
00:07:17,780 --> 00:07:21,870
EAX gets 0 and EBX gets 1.

122
00:07:21,870 --> 00:07:26,850
And then, if I interleave them
in some way, where 1 and 3

123
00:07:26,850 --> 00:07:31,510
somehow come first before I do
the 2 and 4, then I'll get a

124
00:07:31,510 --> 00:07:34,030
value of 11 for each of them.

125
00:07:34,030 --> 00:07:37,060
Those are the middle cases.

126
00:07:37,060 --> 00:07:39,386
So what don't I get?

127
00:07:39,386 --> 00:07:40,360
AUDIENCE: 00

128
00:07:40,360 --> 00:07:42,670
PROFESSOR: You never
gets 00 in a

129
00:07:42,670 --> 00:07:46,120
sequentially consistent execution.

130
00:07:46,120 --> 00:07:49,680
Sequential consistent implies
that no execution--

131
00:07:49,680 --> 00:07:52,730
whoops, that should be EAX.

132
00:07:52,730 --> 00:07:56,220
That EAX equals EBX equals 0.

133
00:07:56,220 --> 00:07:59,450
I don't ever get that outcome.

134
00:07:59,450 --> 00:08:03,610
If I did, then I would say my
machine wasn't sequentially

135
00:08:03,610 --> 00:08:04,860
consistent.

136
00:08:07,020 --> 00:08:11,360
So now let me take a detour a
little bit to look at mutual

137
00:08:11,360 --> 00:08:14,640
exclusion again.

138
00:08:14,640 --> 00:08:18,890
And understand what happens to
mutual exclusion algorithms in

139
00:08:18,890 --> 00:08:21,280
the context of memory
consistency.

140
00:08:21,280 --> 00:08:24,100
So everybody understood what
sequential consistency is.

141
00:08:24,100 --> 00:08:26,690
I simply look at my program
as if I'm interleaving

142
00:08:26,690 --> 00:08:27,940
instructions.

143
00:08:32,510 --> 00:08:37,440
So most implementations of
mutual exclusion, as I showed

144
00:08:37,440 --> 00:08:42,530
previously, employ some kind of
atomic read-modify-write.

145
00:08:42,530 --> 00:08:48,130
So the example I gave you last
time was using the exchange

146
00:08:48,130 --> 00:08:52,370
operation to atomically exchange
a value in a register

147
00:08:52,370 --> 00:08:53,920
with a value in memory.

148
00:08:53,920 --> 00:08:56,690
People remember that?

149
00:08:56,690 --> 00:08:58,180
To implement a lock?

150
00:08:58,180 --> 00:09:00,500
So in order to implement
a lock, I

151
00:09:00,500 --> 00:09:02,700
atomically switch two values.

152
00:09:06,500 --> 00:09:08,330
So we, in particular, use
the exchange one.

153
00:09:08,330 --> 00:09:10,750
And there are a bunch of other
commands that people can use.

154
00:09:10,750 --> 00:09:12,270
Test-and-set, compare-and-swap,

155
00:09:12,270 --> 00:09:17,830
load-linked-store-conditional,
which essentially do some kind

156
00:09:17,830 --> 00:09:19,630
of read-modify-write
on memory.

157
00:09:19,630 --> 00:09:22,340
These tend to be expensive
instructions, as I mentioned.

158
00:09:22,340 --> 00:09:23,810
They usually tend to
cost something

159
00:09:23,810 --> 00:09:27,820
like an L2 cache hit.

160
00:09:27,820 --> 00:09:30,480
Now, the question is can
mutual exclusion be

161
00:09:30,480 --> 00:09:33,620
implemented with only atomic
loads and stores?

162
00:09:33,620 --> 00:09:38,870
Do you really need one of these
heavyweight operations

163
00:09:38,870 --> 00:09:42,740
to implement mutual exclusion?

164
00:09:42,740 --> 00:09:44,710
What if I don't use our
read-modify-write?

165
00:09:44,710 --> 00:09:46,860
Is that possible to do it?

166
00:09:46,860 --> 00:09:50,690
And in fact, the
answer is yes.

167
00:09:50,690 --> 00:09:53,920
So Dekker and Dijksra show that
it can as long as the

168
00:09:53,920 --> 00:09:58,420
computer system is sequentially
consistent.

169
00:09:58,420 --> 00:10:00,770
So as long as you have
sequential consistency, you in

170
00:10:00,770 --> 00:10:06,850
fact, can implement a mutual
exclusion with

171
00:10:06,850 --> 00:10:08,940
read-modify-write.

172
00:10:08,940 --> 00:10:11,330
We're actually not going to
use either the Dekker or

173
00:10:11,330 --> 00:10:14,090
Dijksra algorithms, although
you can read about those in

174
00:10:14,090 --> 00:10:15,010
the literature.

175
00:10:15,010 --> 00:10:17,230
We're going to look at what is
probably the simplest such

176
00:10:17,230 --> 00:10:19,580
algorithm that's been devised
to date, which

177
00:10:19,580 --> 00:10:23,690
is devised by Peterson.

178
00:10:23,690 --> 00:10:29,370
And I'm going to illustrate
it with these two smileys.

179
00:10:29,370 --> 00:10:30,320
That's a she.

180
00:10:30,320 --> 00:10:31,590
And that's a he.

181
00:10:31,590 --> 00:10:34,280
And they want to operate
on widget x.

182
00:10:34,280 --> 00:10:36,040
And she wants to frob it.

183
00:10:36,040 --> 00:10:38,560
And he wants to borf it.

184
00:10:38,560 --> 00:10:41,240
And we want to preserve the
property that we are not

185
00:10:41,240 --> 00:10:43,020
frobbing and borfing
at the same time.

186
00:10:46,380 --> 00:10:47,430
So how do we do that?

187
00:10:47,430 --> 00:10:50,010
Well, here's the code.

188
00:10:50,010 --> 00:10:53,120
So we're going to set up some
things before we start he and

189
00:10:53,120 --> 00:10:54,830
she operating.

190
00:10:54,830 --> 00:10:56,300
So we're going to have
our widget x.

191
00:10:56,300 --> 00:10:57,820
That's our protected variable.

192
00:10:57,820 --> 00:11:01,700
And we're going to have a
Boolean set initially to false

193
00:11:01,700 --> 00:11:03,980
that says whether she
wants to frob it.

194
00:11:03,980 --> 00:11:06,290
So we don't want to make them
frob it unless they

195
00:11:06,290 --> 00:11:08,020
want to frob it.

196
00:11:08,020 --> 00:11:13,510
And we don't want him to borf it
unless he wants to borf it.

197
00:11:13,510 --> 00:11:16,320
And we're going to have an
extra auxiliary variable,

198
00:11:16,320 --> 00:11:18,620
which is whose turn it is.

199
00:11:18,620 --> 00:11:21,030
So they're going to sort
of do a take turn.

200
00:11:21,030 --> 00:11:25,190
But that only is going to come
into account if the other one

201
00:11:25,190 --> 00:11:26,120
doesn't have a conflict.

202
00:11:26,120 --> 00:11:28,370
If they don't have a conflict,
then one of them is going to

203
00:11:28,370 --> 00:11:29,340
be able to go.

204
00:11:29,340 --> 00:11:31,180
So here's what she
basically does.

205
00:11:31,180 --> 00:11:34,200
She initially sets
that she wants to

206
00:11:34,200 --> 00:11:39,440
operate on the widget.

207
00:11:39,440 --> 00:11:44,365
And then, what she does is she
sets the turn to be his.

208
00:11:47,900 --> 00:11:54,350
And then, while he wants it,
and the turn is his, she's

209
00:11:54,350 --> 00:11:55,210
going to just spin.

210
00:11:55,210 --> 00:11:58,050
Notice that you're not frobbing
it in while loop.

211
00:11:58,050 --> 00:12:00,840
The body of the while
loop is empty.

212
00:12:00,840 --> 00:12:03,510
So this is a spinning
solution.

213
00:12:03,510 --> 00:12:07,170
So while he wants it, and it's
his turn, you're just going to

214
00:12:07,170 --> 00:12:11,850
sit there, continually testing
the variables he wants and

215
00:12:11,850 --> 00:12:15,810
turn equals his until one of
them ends up being false.

216
00:12:18,920 --> 00:12:23,760
So if he doesn't want it, or
it's not his turn, then she

217
00:12:23,760 --> 00:12:24,710
gets to frob it.

218
00:12:24,710 --> 00:12:28,950
And when she's done, she sets
she wants to false.

219
00:12:28,950 --> 00:12:30,450
And he does a similar thing.

220
00:12:30,450 --> 00:12:33,220
He sets it to true, says
it's her turn.

221
00:12:33,220 --> 00:12:36,700
And then, while she wants it,
and the turn is hers, just

222
00:12:36,700 --> 00:12:42,190
sits there waiting, continually
re-executing this

223
00:12:42,190 --> 00:12:45,510
until finally, one of these
turns out to be false.

224
00:12:45,510 --> 00:12:46,436
And then he borfs it.

225
00:12:46,436 --> 00:12:47,900
And he sets it to false.

226
00:12:47,900 --> 00:12:50,520
And then, they're doing both of
these things sort of in a

227
00:12:50,520 --> 00:12:55,200
loop, periodically coming
back and executing it.

228
00:12:55,200 --> 00:12:57,010
And what you want to do is you
don't want to make it so that

229
00:12:57,010 --> 00:12:57,770
it's forced.

230
00:12:57,770 --> 00:13:01,460
That it's one turn, then the
other because maybe he never

231
00:13:01,460 --> 00:13:03,830
wants to borf it.

232
00:13:03,830 --> 00:13:06,160
And then, she would be stuck
not being able to frob it,

233
00:13:06,160 --> 00:13:08,350
even though he doesn't want.

234
00:13:08,350 --> 00:13:12,560
So if you think about this--
let's think about why this

235
00:13:12,560 --> 00:13:15,200
always is going to give
you mutual exclusion.

236
00:13:17,710 --> 00:13:21,840
So basically, what's happening
here is if he wants it--

237
00:13:21,840 --> 00:13:24,120
by the way, these things are
not easy to reason about.

238
00:13:24,120 --> 00:13:28,160
And usually, as much as I can
talk and talk in class, what

239
00:13:28,160 --> 00:13:31,220
you really need to do is go
home, and sit down with this

240
00:13:31,220 --> 00:13:32,090
kind of thing.

241
00:13:32,090 --> 00:13:35,940
And study it for 10 minutes.

242
00:13:35,940 --> 00:13:39,500
And then, you'll understand
what the subtleties are as

243
00:13:39,500 --> 00:13:40,100
what's going on.

244
00:13:40,100 --> 00:13:44,620
But basically, what we're doing
is we're making it so

245
00:13:44,620 --> 00:13:49,680
that it's not going to be the
case that both she's setting

246
00:13:49,680 --> 00:13:50,820
it and she wants it.

247
00:13:50,820 --> 00:13:53,060
And the turn is his.

248
00:13:53,060 --> 00:13:59,010
And then, if there's a race
where he wants it also, then

249
00:13:59,010 --> 00:14:02,270
that's going to preclude both of
them from going into it at

250
00:14:02,270 --> 00:14:03,790
the same time.

251
00:14:03,790 --> 00:14:10,040
And then whichever one sets
the turn, one of those is

252
00:14:10,040 --> 00:14:11,830
going to occur first.

253
00:14:11,830 --> 00:14:13,810
And one is going to
occur second.

254
00:14:13,810 --> 00:14:18,380
And whoever ends up coming
second, the other

255
00:14:18,380 --> 00:14:19,630
one gets to go ahead.

256
00:14:22,700 --> 00:14:26,990
So it's very subtle how that
is actually working to make

257
00:14:26,990 --> 00:14:29,925
sure that each one is gating the
other to allow them to go.

258
00:14:32,470 --> 00:14:35,590
But the way to reason about this
is to reason about it is

259
00:14:35,590 --> 00:14:37,860
what are the possible
interleavings?

260
00:14:37,860 --> 00:14:39,670
And the important interleavings
here as you can

261
00:14:39,670 --> 00:14:43,710
see are what happens when
setting these things.

262
00:14:43,710 --> 00:14:45,740
And once they're set,
what happens in

263
00:14:45,740 --> 00:14:46,850
testing these things?

264
00:14:46,850 --> 00:14:49,810
And especially because when you
go around the loop and so

265
00:14:49,810 --> 00:14:52,730
forth, you have to imagine
that an arbitrarily long

266
00:14:52,730 --> 00:14:54,600
amount of time is gone.

267
00:14:54,600 --> 00:14:57,730
So for example, between the time
that you check that the

268
00:14:57,730 --> 00:15:01,760
turn is his, he may have already
gone around this loop.

269
00:15:04,350 --> 00:15:06,140
And so you have to
worry about--

270
00:15:06,140 --> 00:15:09,110
even though, it may look like
one instruction from this

271
00:15:09,110 --> 00:15:11,740
processors point of view for
correctness purpose, you have

272
00:15:11,740 --> 00:15:14,450
to imagine that an arbitrary
amount of computation could

273
00:15:14,450 --> 00:15:18,340
occur between any two
instructions.

274
00:15:18,340 --> 00:15:21,260
So any question about
this code?

275
00:15:21,260 --> 00:15:25,780
People see how it preserves
mutual exclusion and how you

276
00:15:25,780 --> 00:15:29,280
use sequential consistency to
reason about it by asking what

277
00:15:29,280 --> 00:15:32,000
are the possible
interleavings?

278
00:15:32,000 --> 00:15:32,930
Questions?

279
00:15:32,930 --> 00:15:34,057
Yeah.

280
00:15:34,057 --> 00:15:36,756
AUDIENCE: So, I don't know
if I got it right.

281
00:15:36,756 --> 00:15:41,148
So basically, sets the
[UNINTELLIGIBLE]

282
00:15:41,148 --> 00:15:46,500
to give him a chance before
she goes to loop.

283
00:15:46,500 --> 00:15:50,130
So basically, she waits there
until he has been able to go?

284
00:15:50,130 --> 00:15:51,770
That's why on the
[UNINTELLIGIBLE].

285
00:15:51,770 --> 00:15:55,070
PROFESSOR: So on this
third line--

286
00:15:55,070 --> 00:15:55,995
AUDIENCE: Both of them.

287
00:15:55,995 --> 00:15:59,043
Either before actually
frobbing or borfing.

288
00:15:59,043 --> 00:16:01,826
And before that while you always
give the turn to the

289
00:16:01,826 --> 00:16:03,030
other to give them
a chance to go.

290
00:16:03,030 --> 00:16:03,296
PROFESSOR: Yeah.

291
00:16:03,296 --> 00:16:04,550
So there are two things
you want to show.

292
00:16:04,550 --> 00:16:10,470
One is that they can't
both be stalled on

293
00:16:10,470 --> 00:16:11,516
the while loop there.

294
00:16:11,516 --> 00:16:13,510
And that can't happen because
the turn can't be

295
00:16:13,510 --> 00:16:16,780
simultaneously his and hers.

296
00:16:16,780 --> 00:16:20,110
So you know that they're not
both going to deadlock in

297
00:16:20,110 --> 00:16:23,400
trying to do this by sitting
there waiting for the other

298
00:16:23,400 --> 00:16:24,400
because of this.

299
00:16:24,400 --> 00:16:28,960
And now, the question is well,
how do you know that one can't

300
00:16:28,960 --> 00:16:34,070
get through while the other
is also going through?

301
00:16:36,580 --> 00:16:45,780
And for that, you have to look
and say, oh well, if you go

302
00:16:45,780 --> 00:16:49,850
through, then you know that it
is either he doesn't want it,

303
00:16:49,850 --> 00:16:53,280
or it's not his turn.

304
00:16:53,280 --> 00:16:54,695
And in which case, if
he doesn't want it,

305
00:16:54,695 --> 00:16:55,205
it's not his turn.

306
00:16:55,205 --> 00:16:58,490
If he does change it to that
he wants it, then in fact,

307
00:16:58,490 --> 00:17:00,970
it's going to be your turn.

308
00:17:00,970 --> 00:17:01,785
Question?

309
00:17:01,785 --> 00:17:03,960
AUDIENCE: This only works for
exactly two threads, right?

310
00:17:03,960 --> 00:17:06,349
PROFESSOR: This only works
for exactly two threads.

311
00:17:06,349 --> 00:17:09,099
This does not work for three,
but there are extensions of

312
00:17:09,099 --> 00:17:12,650
this sort of thing to
end threads in an

313
00:17:12,650 --> 00:17:14,859
arbitrary large number.

314
00:17:14,859 --> 00:17:17,460
However, the data structures
to implement this kind of

315
00:17:17,460 --> 00:17:21,960
mutual exclusion for end threads
end up taking space

316
00:17:21,960 --> 00:17:23,730
proportional to n.

317
00:17:23,730 --> 00:17:28,329
And so one of the advantages
of the built in atomics--

318
00:17:28,329 --> 00:17:34,590
the compare-and-swap, or the
atomic exchange, or whatever--

319
00:17:34,590 --> 00:17:37,540
is they work for an arbitrary
number of threads with only a

320
00:17:37,540 --> 00:17:40,560
bounded amount of resource.

321
00:17:40,560 --> 00:17:46,010
You don't require extra data
structures and so forth.

322
00:17:46,010 --> 00:17:48,710
So that's why they put those
things in the architecture

323
00:17:48,710 --> 00:17:54,690
because in the architecture you
can build things that will

324
00:17:54,690 --> 00:17:57,090
solve this problem much
more simply than

325
00:17:57,090 --> 00:17:58,340
this sort of thing.

326
00:18:04,510 --> 00:18:06,640
However, there are going to be
lessons here that you may want

327
00:18:06,640 --> 00:18:08,280
to use in your programming,
depending

328
00:18:08,280 --> 00:18:09,330
on what you're doing.

329
00:18:09,330 --> 00:18:19,310
So now, it turns out that
no modern day processor

330
00:18:19,310 --> 00:18:20,800
implements sequential
consistency.

331
00:18:23,450 --> 00:18:25,650
There have been machines that
were built-- actually quite

332
00:18:25,650 --> 00:18:26,740
good machines--

333
00:18:26,740 --> 00:18:28,500
that implemented sequential
consistency.

334
00:18:28,500 --> 00:18:32,920
But today, nobody
implements it.

335
00:18:32,920 --> 00:18:35,930
They all implement some form
of what's called relaxed

336
00:18:35,930 --> 00:18:40,930
consistency, where the
hardware may reorder

337
00:18:40,930 --> 00:18:42,840
instructions.

338
00:18:42,840 --> 00:18:44,490
And so you have things
executing

339
00:18:44,490 --> 00:18:45,910
not in program order.

340
00:18:45,910 --> 00:18:49,790
And the compilers may reorder
instructions as well.

341
00:18:49,790 --> 00:18:53,120
So both the hardware and the
software are going in there.

342
00:18:53,120 --> 00:18:56,720
So let's take a look at that.

343
00:18:56,720 --> 00:19:01,280
So here's the program order
for one of the things.

344
00:19:01,280 --> 00:19:09,840
We move 1 into a, and then move
the value of b into EBX

345
00:19:09,840 --> 00:19:11,210
to do a load.

346
00:19:11,210 --> 00:19:13,150
Here's the program order.

347
00:19:13,150 --> 00:19:19,760
Most modern hardware will switch
these and execute it in

348
00:19:19,760 --> 00:19:21,740
this order.

349
00:19:21,740 --> 00:19:24,360
Why do you suppose?

350
00:19:24,360 --> 00:19:26,740
Even if you write it this way,
the instruction level

351
00:19:26,740 --> 00:19:31,270
parallelism within the processor
will, in fact,

352
00:19:31,270 --> 00:19:35,260
execute it in the opposite
order most of the time.

353
00:19:35,260 --> 00:19:35,510
Yeah?

354
00:19:35,510 --> 00:19:37,340
AUDIENCE: Because loading
takes longer.

355
00:19:37,340 --> 00:19:37,520
PROFESSOR: Yeah.

356
00:19:37,520 --> 00:19:39,900
Because loading takes longer.

357
00:19:39,900 --> 00:19:43,290
Loading is going to
take latency.

358
00:19:43,290 --> 00:19:46,110
I can't complete the load from
the processor's point of view

359
00:19:46,110 --> 00:19:47,850
until I get an answer.

360
00:19:47,850 --> 00:19:50,730
So if I load, and I wait for
it to go out to the memory

361
00:19:50,730 --> 00:19:53,240
system and back into
the processor,

362
00:19:53,240 --> 00:19:56,720
and then I do a store--

363
00:19:56,720 --> 00:19:58,730
well, as soon as I've done
the store, I can move on.

364
00:19:58,730 --> 00:20:00,560
Even if the store takes
a while to get out

365
00:20:00,560 --> 00:20:01,870
to the memory system.

366
00:20:01,870 --> 00:20:03,260
But if I do it in the
opposite order.

367
00:20:03,260 --> 00:20:11,650
I do the store first, and then
I do the load, I've ended up

368
00:20:11,650 --> 00:20:16,340
wasting essentially one cycle,
the cycle to do the store,

369
00:20:16,340 --> 00:20:19,050
when I could have been
overlapping that with the time

370
00:20:19,050 --> 00:20:22,500
it took to do the load.

371
00:20:22,500 --> 00:20:23,520
So people follow that?

372
00:20:23,520 --> 00:20:26,870
So if I execute the load first,
I can go right on to

373
00:20:26,870 --> 00:20:29,030
execute the store.

374
00:20:29,030 --> 00:20:33,950
I can issue the load, go right
on to execute the store

375
00:20:33,950 --> 00:20:39,160
without having to wait for the
load to complete if I have a

376
00:20:39,160 --> 00:20:44,950
multi-issue CPU in the
processor core.

377
00:20:44,950 --> 00:20:47,520
So you get higher instruction
level parallelism.

378
00:20:47,520 --> 00:20:53,490
Now when is it safe for the
hardware compiler to perform

379
00:20:53,490 --> 00:20:55,930
this reordering?

380
00:20:55,930 --> 00:20:59,070
Can it always switch
instructions like this to put

381
00:20:59,070 --> 00:21:00,320
loads before stores?

382
00:21:05,360 --> 00:21:10,060
When would this be a bad idea to
put a load before a store?

383
00:21:13,870 --> 00:21:14,620
Yeah?

384
00:21:14,620 --> 00:21:16,250
AUDIENCE: You're loading the
variable you just stored.

385
00:21:16,250 --> 00:21:17,100
PROFESSOR: Yeah, if
you're loading the

386
00:21:17,100 --> 00:21:19,720
variable you just stored.

387
00:21:19,720 --> 00:21:27,820
Suppose you say store into
x and then load from x.

388
00:21:27,820 --> 00:21:30,290
That's different from if
I load from x, and then

389
00:21:30,290 --> 00:21:33,450
I store into x.

390
00:21:33,450 --> 00:21:40,710
So if you're going to the same
location, then that's not a

391
00:21:40,710 --> 00:21:43,540
safe thing to do.

392
00:21:43,540 --> 00:21:52,620
So basically, in this case, if a
is not equal to b, then this

393
00:21:52,620 --> 00:21:53,815
is safe to do.

394
00:21:53,815 --> 00:21:57,020
But if a equals b, this
is not safe to do.

395
00:21:59,780 --> 00:22:04,680
Because it's going to give
you a different answer.

396
00:22:04,680 --> 00:22:08,190
However, it turns out that
there's another time when this

397
00:22:08,190 --> 00:22:09,790
is not safe to do.

398
00:22:09,790 --> 00:22:12,600
So this would have been the end
of the story if we were

399
00:22:12,600 --> 00:22:14,320
running on one processor.

400
00:22:14,320 --> 00:22:18,350
The other time that it's
not safe to do it is--

401
00:22:18,350 --> 00:22:23,320
if it's safe, the other
assumption is that there's no

402
00:22:23,320 --> 00:22:24,260
concurrency.

403
00:22:24,260 --> 00:22:28,380
If there is concurrency, you can
run into trouble as well.

404
00:22:30,920 --> 00:22:34,420
And the reason is because
another processor may be

405
00:22:34,420 --> 00:22:39,710
changing the value that you're
planning to read.

406
00:22:39,710 --> 00:22:42,780
And so if you read things out
of order, you may violate

407
00:22:42,780 --> 00:22:45,180
sequential consistency.

408
00:22:45,180 --> 00:22:47,560
Let me show you what's going on
in the hardware so you have

409
00:22:47,560 --> 00:22:51,030
an appreciation of what
the issue is here.

410
00:22:51,030 --> 00:22:56,940
So here's 30,000 feet of
hardware reordering.

411
00:22:56,940 --> 00:23:02,880
So the processor is going to
issue memory operations to the

412
00:23:02,880 --> 00:23:04,050
memory system.

413
00:23:04,050 --> 00:23:06,560
And results of memory
operations are

414
00:23:06,560 --> 00:23:08,480
going to come back.

415
00:23:08,480 --> 00:23:12,280
But they really only have
to come back when?

416
00:23:12,280 --> 00:23:14,670
If they're loads.

417
00:23:14,670 --> 00:23:19,040
If they're stores, they don't
have to come back.

418
00:23:19,040 --> 00:23:24,570
So the processor, in fact, can
issue stores faster than the

419
00:23:24,570 --> 00:23:26,320
network can handle them.

420
00:23:26,320 --> 00:23:28,190
And the memory system
can handle them.

421
00:23:28,190 --> 00:23:29,780
So the processors are
generally very fast.

422
00:23:29,780 --> 00:23:33,520
The memory systems are
relatively slow.

423
00:23:33,520 --> 00:23:36,220
But the processor is not
generally issuing a store on

424
00:23:36,220 --> 00:23:37,850
every cycle.

425
00:23:37,850 --> 00:23:40,150
It may do store, it may do
some additions, it may do

426
00:23:40,150 --> 00:23:42,720
another store, et cetera.

427
00:23:42,720 --> 00:23:46,580
So rather than waiting for the
memory system to do every

428
00:23:46,580 --> 00:23:49,570
store, they create
a store buffer.

429
00:23:49,570 --> 00:23:52,820
And the memory system pulls
things out of the store buffer

430
00:23:52,820 --> 00:23:54,880
as fast as it can.

431
00:23:54,880 --> 00:23:57,940
And the processor shoves stuff
into the store buffer up to

432
00:23:57,940 --> 00:24:00,360
the point that the store buffer
gets full, in which

433
00:24:00,360 --> 00:24:01,540
case it would have to stall.

434
00:24:01,540 --> 00:24:06,630
But for most many codes, it
never has to stall because

435
00:24:06,630 --> 00:24:10,400
there is a sufficient frequency
of other operations

436
00:24:10,400 --> 00:24:13,730
going on that you don't
have to wait.

437
00:24:13,730 --> 00:24:16,280
So when a store occurs, it
doesn't occur immediately on

438
00:24:16,280 --> 00:24:18,480
the store buffer.

439
00:24:18,480 --> 00:24:23,700
Now along comes a
load operation.

440
00:24:23,700 --> 00:24:28,120
And the load operation, if it's
to a different address,

441
00:24:28,120 --> 00:24:30,540
you want to have that take
priority because the processor

442
00:24:30,540 --> 00:24:31,920
can be waiting.

443
00:24:31,920 --> 00:24:36,090
It's next instructions may
be waiting on the result.

444
00:24:36,090 --> 00:24:39,910
So you want that to go
as fast as possible.

445
00:24:39,910 --> 00:24:45,190
They have a passing lane here
where the fast cars or the

446
00:24:45,190 --> 00:24:47,550
important cars, the ambulances,
et cetera, in this

447
00:24:47,550 --> 00:24:51,640
case loads, can scoot by all
the other things in traffic

448
00:24:51,640 --> 00:24:53,980
and get to the memory
system first.

449
00:24:53,980 --> 00:24:57,510
But as we said, we don't want
to do that if the last thing

450
00:24:57,510 --> 00:25:01,850
that I stored was to
the same address.

451
00:25:01,850 --> 00:25:05,890
So in fact, there is content
addressable memory here, which

452
00:25:05,890 --> 00:25:09,660
matches the address that is
being loaded with everything

453
00:25:09,660 --> 00:25:11,780
in the store buffer.

454
00:25:11,780 --> 00:25:15,150
And if it does match, it gets
satisfied immediately by the

455
00:25:15,150 --> 00:25:17,340
store buffer.

456
00:25:17,340 --> 00:25:21,880
And only does it make it out to
the network if it's not in

457
00:25:21,880 --> 00:25:23,940
the store buffer.

458
00:25:23,940 --> 00:25:27,350
But what you can see here is
that this mechanism, which

459
00:25:27,350 --> 00:25:32,510
works great on one processor,
violates sequential

460
00:25:32,510 --> 00:25:36,230
consistency because I may have
operations going to two

461
00:25:36,230 --> 00:25:40,580
different memory locations,
where the order, in fact,

462
00:25:40,580 --> 00:25:42,600
matters to me.

463
00:25:42,600 --> 00:25:45,150
So let's see how
that works out.

464
00:25:45,150 --> 00:25:47,990
So first of all, let me tell you
what the memory can-- so a

465
00:25:47,990 --> 00:25:50,580
load can bypass a store
to different address.

466
00:25:50,580 --> 00:25:54,350
First of all, any questions
about this mechanism?

467
00:25:54,350 --> 00:26:02,650
So this accounts for a whole
bunch of understanding of what

468
00:26:02,650 --> 00:26:05,020
happens in concurrency
in systems.

469
00:26:05,020 --> 00:26:08,810
This one understanding
of store buffers.

470
00:26:08,810 --> 00:26:11,620
It's absolutely crucial.

471
00:26:11,620 --> 00:26:14,980
And I have talked, by the way,
with lots of experts who don't

472
00:26:14,980 --> 00:26:16,255
understand this.

473
00:26:16,255 --> 00:26:19,340
That this is what's going
on for why we don't have

474
00:26:19,340 --> 00:26:22,420
sequential consistency
in our computers.

475
00:26:22,420 --> 00:26:25,760
It's because they made the
decision to allow this

476
00:26:25,760 --> 00:26:29,040
optimization, even though it
doesn't preserve sequential

477
00:26:29,040 --> 00:26:31,250
consistency.

478
00:26:31,250 --> 00:26:33,750
There were machines in the
past that did support

479
00:26:33,750 --> 00:26:34,840
sequential consistency.

480
00:26:34,840 --> 00:26:40,140
And what they did was they used
speculation to allow the

481
00:26:40,140 --> 00:26:43,010
processor to assume that it was
sequentially consistent.

482
00:26:43,010 --> 00:26:45,820
And if that turned out to be
wrong, they were able to roll

483
00:26:45,820 --> 00:26:49,430
back the processor's state
to the point before

484
00:26:49,430 --> 00:26:52,440
the access was done.

485
00:26:52,440 --> 00:26:55,340
In fact, the processor is
already doing that for

486
00:26:55,340 --> 00:26:58,340
branches, where it makes
branch predictions and

487
00:26:58,340 --> 00:26:59,840
executes down a line.

488
00:26:59,840 --> 00:27:01,800
But it's wrong, it
has to flush the

489
00:27:01,800 --> 00:27:04,180
pipeline and so forth.

490
00:27:04,180 --> 00:27:09,540
Why they don't do the same
thing for hardware is an

491
00:27:09,540 --> 00:27:10,530
interesting--

492
00:27:10,530 --> 00:27:13,900
for loads of stores-- is an
interesting question.

493
00:27:13,900 --> 00:27:16,170
Because at some level
there's no reason

494
00:27:16,170 --> 00:27:17,970
they couldn't do this.

495
00:27:17,970 --> 00:27:20,890
Instead, it's sort of been a
thing where the software

496
00:27:20,890 --> 00:27:23,020
people say, yeah we
can handle it.

497
00:27:23,020 --> 00:27:24,480
And the hardware
people say, OK.

498
00:27:24,480 --> 00:27:26,240
You're willing to handle it.

499
00:27:26,240 --> 00:27:28,170
We won't worry about it then.

500
00:27:28,170 --> 00:27:32,360
When in fact, it just makes life
complicated for everybody

501
00:27:32,360 --> 00:27:34,100
that you don't have sequential
consistency.

502
00:27:34,100 --> 00:27:36,465
AUDIENCE: [INAUDIBLE]

503
00:27:36,465 --> 00:27:41,670
you have to do speculation
across both [INAUDIBLE].

504
00:27:41,670 --> 00:27:43,530
PROFESSOR: Well here, you only
have to do speculation over

505
00:27:43,530 --> 00:27:45,730
what actually is coming out
of your memory system.

506
00:27:45,730 --> 00:27:49,310
And if it doesn't match,
you could roll back.

507
00:27:49,310 --> 00:27:53,050
The issue, in part, is how many
machine states are you

508
00:27:53,050 --> 00:27:54,860
ready to roll back to.

509
00:27:54,860 --> 00:27:56,810
Loads come more frequently
than branches.

510
00:27:56,810 --> 00:27:57,940
That's one thing.

511
00:27:57,940 --> 00:28:01,400
So no doubt, there are
good reasons for why

512
00:28:01,400 --> 00:28:02,090
they're doing it.

513
00:28:02,090 --> 00:28:06,080
Nevertheless, definitely loss
of sequential consistency

514
00:28:06,080 --> 00:28:08,830
becomes a headache for a lot
of people in doing a

515
00:28:08,830 --> 00:28:09,700
concurrent program.

516
00:28:09,700 --> 00:28:10,440
We had a question here?

517
00:28:10,440 --> 00:28:11,570
Yes, Sara?

518
00:28:11,570 --> 00:28:12,861
AUDIENCE: So this does not
preserve sequential

519
00:28:12,861 --> 00:28:13,450
consistency?

520
00:28:13,450 --> 00:28:15,956
But as long as there's only one
processor, it should have

521
00:28:15,956 --> 00:28:18,050
the same effect, right?

522
00:28:18,050 --> 00:28:20,320
PROFESSOR: But sequential
consistency for one processor

523
00:28:20,320 --> 00:28:22,190
is easy because all you
do is execute them--

524
00:28:22,190 --> 00:28:23,170
AUDIENCE: Yeah, I'm
just saying--

525
00:28:23,170 --> 00:28:25,450
PROFESSOR: It should have the
same effect, exactly.

526
00:28:25,450 --> 00:28:30,080
So on one processor, this
works perfectly well.

527
00:28:30,080 --> 00:28:32,820
If there's no concurrency, this
is going to give you the

528
00:28:32,820 --> 00:28:34,520
same behavior.

529
00:28:34,520 --> 00:28:38,470
And yet, you've now got this
optimization that loads can

530
00:28:38,470 --> 00:28:39,580
bypass stores.

531
00:28:39,580 --> 00:28:44,400
And therefore, you can do a
store and a load and be able

532
00:28:44,400 --> 00:28:46,500
to overlap their execution.

533
00:28:46,500 --> 00:28:52,290
So this definitely wins
for serial execution.

534
00:28:52,290 --> 00:28:53,230
Yep, good.

535
00:28:53,230 --> 00:28:54,715
Any other questions about
this mechanism?

536
00:28:57,620 --> 00:29:00,030
You could reason about
it on the quiz.

537
00:29:00,030 --> 00:29:03,250
That kind of thing, right?

538
00:29:03,250 --> 00:29:05,990
Yeah, OK?

539
00:29:05,990 --> 00:29:10,960
So here's the x86 memory
consistency model.

540
00:29:10,960 --> 00:29:13,590
For many years, Intel was
unwilling to say what their

541
00:29:13,590 --> 00:29:16,930
memory consistency model was
for fear that people would

542
00:29:16,930 --> 00:29:18,740
then rely on it.

543
00:29:18,740 --> 00:29:20,310
And then, they would
be forced into it.

544
00:29:20,310 --> 00:29:23,190
But recently, they've started
being more explicit about it.

545
00:29:23,190 --> 00:29:25,290
And this is the large
part of it.

546
00:29:25,290 --> 00:29:27,400
I haven't put up all the things
because there are a

547
00:29:27,400 --> 00:29:32,110
whole bunch of instructions,
such as locking instructions

548
00:29:32,110 --> 00:29:34,720
and so forth, for which for
some of them, it's more

549
00:29:34,720 --> 00:29:35,280
complicated.

550
00:29:35,280 --> 00:29:36,890
But this is the basics.

551
00:29:36,890 --> 00:29:40,190
So loads are not reordered
with loads.

552
00:29:40,190 --> 00:29:42,390
So if you add a load to one
location, a load to another

553
00:29:42,390 --> 00:29:45,470
location, they always execute
in the same order.

554
00:29:45,470 --> 00:29:48,080
Stores are not reordered
with stores.

555
00:29:48,080 --> 00:29:51,310
If you have store and then a
subsequent store, those two

556
00:29:51,310 --> 00:29:53,980
stores always go
in that order.

557
00:29:53,980 --> 00:29:58,240
Stores are not reordered
with prior loads.

558
00:29:58,240 --> 00:30:02,780
So if you do a store
after a load--

559
00:30:02,780 --> 00:30:07,530
if you do a load and then a
store, they're going to go in

560
00:30:07,530 --> 00:30:09,460
that order.

561
00:30:09,460 --> 00:30:11,180
However, a load--

562
00:30:11,180 --> 00:30:12,980
and this is what we just
talked about--

563
00:30:12,980 --> 00:30:16,570
may be reordered with a prior
store to a different location

564
00:30:16,570 --> 00:30:19,095
but not with a prior store
to the same location.

565
00:30:21,770 --> 00:30:23,650
So that's exactly what we
just talked about on

566
00:30:23,650 --> 00:30:25,230
the previous slide.

567
00:30:25,230 --> 00:30:27,610
Then, loads and stores are
not reordered with lock

568
00:30:27,610 --> 00:30:28,720
instructions.

569
00:30:28,720 --> 00:30:30,790
So a certain set of instructions
are called lock

570
00:30:30,790 --> 00:30:31,640
instructions.

571
00:30:31,640 --> 00:30:35,140
And they include all the atomic
updates, the exchanges,

572
00:30:35,140 --> 00:30:39,160
comparisons-and-swaps, and
a variety of other atomic

573
00:30:39,160 --> 00:30:42,680
operations that the
hardware supports.

574
00:30:42,680 --> 00:30:45,470
The stores to the same location
always respect a

575
00:30:45,470 --> 00:30:47,070
global order.

576
00:30:47,070 --> 00:30:51,060
Everybody sees the store
to a location in

577
00:30:51,060 --> 00:30:53,970
exactly the same order.

578
00:30:53,970 --> 00:30:57,410
And the lock instructions
respect a global total order.

579
00:30:57,410 --> 00:31:02,460
So that everybody sees that this
thread, or processor, got

580
00:31:02,460 --> 00:31:04,260
a lock before that one.

581
00:31:04,260 --> 00:31:08,330
You don't have two different
processors disagreeing on what

582
00:31:08,330 --> 00:31:12,200
the order was that somebody
acquired a lock or whatever.

583
00:31:12,200 --> 00:31:16,190
And then, memory ordering
preserves transitive

584
00:31:16,190 --> 00:31:17,850
visibility, which
is sort of like

585
00:31:17,850 --> 00:31:19,950
saying it obeys causality.

586
00:31:19,950 --> 00:31:27,530
In other words, if after doing
a, if you had some effect, and

587
00:31:27,530 --> 00:31:31,223
then you did b, it should look
like to other people like a

588
00:31:31,223 --> 00:31:32,680
and then b happened.

589
00:31:32,680 --> 00:31:36,590
Like there's a causality
going on.

590
00:31:36,590 --> 00:31:39,980
But that's not sequential
consistency, mainly

591
00:31:39,980 --> 00:31:41,230
because of four here.

592
00:31:43,770 --> 00:31:46,280
So what's the impact
of reordering?

593
00:31:46,280 --> 00:31:50,240
So here, we have our example
from the beginning for the

594
00:31:50,240 --> 00:31:55,260
memory bottle, where I'm storing
a 1 into a and then

595
00:31:55,260 --> 00:31:58,960
loading whatever is in b.

596
00:31:58,960 --> 00:32:01,890
And similarly, over
here the opposite.

597
00:32:01,890 --> 00:32:07,040
So what happens if I'm allowed
to do reordering?

598
00:32:07,040 --> 00:32:10,960
What can happen to these
two instructions?

599
00:32:10,960 --> 00:32:11,180
Yeah.

600
00:32:11,180 --> 00:32:14,060
They can execute in the
opposite order.

601
00:32:14,060 --> 00:32:17,960
Similarly, these two guys can
execute in the opposite order.

602
00:32:17,960 --> 00:32:28,050
So they can actually execute in
this order where we do the

603
00:32:28,050 --> 00:32:30,990
load and then the stores.

604
00:32:30,990 --> 00:32:33,450
So it executes as if this
were the order.

605
00:32:33,450 --> 00:32:34,540
Did I do this right?

606
00:32:34,540 --> 00:32:36,940
Executes as if this
were the order.

607
00:32:36,940 --> 00:32:38,990
So I could do 1, 2, 3, 4.

608
00:32:38,990 --> 00:32:43,455
So if then, I do the ordering
2, 4, 1, 3.

609
00:32:47,250 --> 00:32:49,800
AUDIENCE: [INAUDIBLE]

610
00:32:49,800 --> 00:32:50,940
PROFESSOR: I got the screwed
up, I think.

611
00:32:50,940 --> 00:32:51,320
Didn't I?

612
00:32:51,320 --> 00:32:52,820
AUDIENCE: [INAUDIBLE]

613
00:32:52,820 --> 00:32:55,060
PROFESSOR: Because I should be
swapping these guys, right?

614
00:32:55,060 --> 00:32:55,820
AUDIENCE: Swapped the
wrong [INAUDIBLE].

615
00:32:55,820 --> 00:32:57,480
PROFESSOR: Ugh.

616
00:32:57,480 --> 00:32:58,730
OK.

617
00:33:00,730 --> 00:33:04,850
So if I did this
one 2, 1, 4, 3.

618
00:33:08,380 --> 00:33:09,850
So ignore this thing.

619
00:33:09,850 --> 00:33:14,050
Suppose I do the order 2.

620
00:33:14,050 --> 00:33:16,330
So basically, I load b.

621
00:33:16,330 --> 00:33:17,900
Then, I load a.

622
00:33:17,900 --> 00:33:23,230
Then, I store a.

623
00:33:23,230 --> 00:33:25,320
And then, I store b.

624
00:33:25,320 --> 00:33:31,740
What's the result value that
are in EAX and EBX?

625
00:33:31,740 --> 00:33:33,570
You get 00.

626
00:33:33,570 --> 00:33:40,130
Remember 00 wasn't the legal
value from sequential

627
00:33:40,130 --> 00:33:40,870
consistency.

628
00:33:40,870 --> 00:33:45,040
But in this case, the Intel
architecture and many other

629
00:33:45,040 --> 00:33:50,820
architectures out there will
give you the wrong value for

630
00:33:50,820 --> 00:33:53,920
the execution of these
instructions.

631
00:33:53,920 --> 00:33:56,270
Any question about that?

632
00:33:56,270 --> 00:34:01,075
So it doesn't preserve
sequential consistency.

633
00:34:04,370 --> 00:34:08,510
So that's kind of scary in some
way because you got to

634
00:34:08,510 --> 00:34:10,280
reason about this.

635
00:34:10,280 --> 00:34:12,679
Let's see what happens in
Peterson's algorithm if you

636
00:34:12,679 --> 00:34:15,889
don't have sequential
consistency.

637
00:34:15,889 --> 00:34:19,139
So here we go.

638
00:34:19,139 --> 00:34:21,179
We have the code where
she wants is true,

639
00:34:21,179 --> 00:34:23,130
turn is his, et cetera.

640
00:34:23,130 --> 00:34:26,150
How is this going to fail?

641
00:34:26,150 --> 00:34:27,400
What could happen here?

642
00:34:34,100 --> 00:34:35,550
Where will the bug arise?

643
00:34:35,550 --> 00:34:36,460
What's going to happen?

644
00:34:36,460 --> 00:34:39,520
What's the reordering
that might happen?

645
00:34:39,520 --> 00:34:42,164
AUDIENCE: On the while
you do loads, right?

646
00:34:42,164 --> 00:34:43,690
[INAUDIBLE] the he_wants
and turn is.

647
00:34:43,690 --> 00:34:44,994
PROFESSOR: Sorry?

648
00:34:44,994 --> 00:34:46,416
AUDIENCE: On the while
statement,

649
00:34:46,416 --> 00:34:48,312
you do a load, right?

650
00:34:48,312 --> 00:34:49,260
Because [INAUDIBLE].

651
00:34:49,260 --> 00:34:49,380
PROFESSOR: Right.

652
00:34:49,380 --> 00:34:51,810
He_wants is a load.

653
00:34:51,810 --> 00:34:53,420
AUDIENCE: And so that
will get reordered.

654
00:34:53,420 --> 00:34:54,670
PROFESSOR: Where could
that be reordered to?

655
00:34:59,250 --> 00:35:03,480
That could be reordered all
the way to the top.

656
00:35:03,480 --> 00:35:07,130
Similarly, this one can
be reordered all

657
00:35:07,130 --> 00:35:09,830
the way to the top.

658
00:35:09,830 --> 00:35:13,530
So the loads could be ordered
all the way to the top.

659
00:35:13,530 --> 00:35:16,550
And now, what's going to happen
is you're going to set

660
00:35:16,550 --> 00:35:20,630
that she_wants is true but
get a value of he_wants

661
00:35:20,630 --> 00:35:23,420
that might be old.

662
00:35:23,420 --> 00:35:26,110
And so they won't see
each other's values.

663
00:35:26,110 --> 00:35:29,970
And so then, both threads can
now enter the critical section

664
00:35:29,970 --> 00:35:31,220
simultaneously.

665
00:35:33,420 --> 00:35:35,666
Yeah, Reid?

666
00:35:35,666 --> 00:35:40,112
AUDIENCE: If you swap the order
of the loads, does the

667
00:35:40,112 --> 00:35:43,090
[INAUDIBLE]?

668
00:35:43,090 --> 00:35:45,512
PROFESSOR: If you swap the
order of the loads--

669
00:35:45,512 --> 00:35:48,002
AUDIENCE: If you swap-- put
the turn equals his on the

670
00:35:48,002 --> 00:35:50,492
left, [INAUDIBLE]
on the right.

671
00:35:50,492 --> 00:35:52,484
Because according to--

672
00:35:52,484 --> 00:35:55,485
PROFESSOR: Put the turn
equals his over here?

673
00:35:55,485 --> 00:35:58,460
AUDIENCE: Because the he_wants
can't cross the load.

674
00:35:58,460 --> 00:36:01,130
PROFESSOR: Yeah, but that's
not what you want to do.

675
00:36:01,130 --> 00:36:02,615
AUDIENCE: Then you can't
[INAUDIBLE].

676
00:36:05,590 --> 00:36:08,410
PROFESSOR: The whole idea here
is that when you're saying you

677
00:36:08,410 --> 00:36:12,320
want to do something, you give
the other one a turn so that

678
00:36:12,320 --> 00:36:18,560
whoever ends up winning the race
allows just one of them

679
00:36:18,560 --> 00:36:19,130
to go through.

680
00:36:19,130 --> 00:36:20,089
Yeah?

681
00:36:20,089 --> 00:36:22,434
AUDIENCE: I think the point is
that if you put turn equals

682
00:36:22,434 --> 00:36:25,720
his and he_wants--

683
00:36:25,720 --> 00:36:27,266
PROFESSOR: You're saying
this stuff here.

684
00:36:27,266 --> 00:36:29,696
AUDIENCE: Swap those two
[UNINTELLIGIBLE] turn equals

685
00:36:29,696 --> 00:36:33,770
his will not be reordered
before the store that--

686
00:36:33,770 --> 00:36:34,530
PROFESSOR: You might be right.

687
00:36:34,530 --> 00:36:35,801
Let me think about that.

688
00:36:35,801 --> 00:36:37,725
AUDIENCE: You both reorder
the same [? word. ?]

689
00:36:37,725 --> 00:36:40,611
AUDIENCE: But you just
stored turn, right?

690
00:36:40,611 --> 00:36:41,092
PROFESSOR: Yeah.

691
00:36:41,092 --> 00:36:42,535
So if do turn equals his--

692
00:36:42,535 --> 00:36:43,657
I see what you're saying.

693
00:36:43,657 --> 00:36:44,459
Do this turn equals his.

694
00:36:44,459 --> 00:36:45,920
I was looking at this
turn equals his.

695
00:36:45,920 --> 00:36:47,650
AUDIENCE: You mean turn
equals equals his.

696
00:36:47,650 --> 00:36:49,060
AUDIENCE: So the Boolean
expression [INAUDIBLE].

697
00:36:49,060 --> 00:36:49,530
PROFESSOR: Yeah.

698
00:36:49,530 --> 00:36:50,940
OK, I hadn't thought
about that.

699
00:36:50,940 --> 00:36:52,840
Let me just think about
that a second.

700
00:36:52,840 --> 00:36:57,175
So if we do the turn
equals his--

701
00:36:57,175 --> 00:36:58,540
AUDIENCE: [INAUDIBLE]

702
00:36:58,540 --> 00:37:00,360
and you won't reorder those
two [INAUDIBLE]?

703
00:37:00,360 --> 00:37:01,610
PROFESSOR: Then the--

704
00:37:06,600 --> 00:37:07,000
Yeah.

705
00:37:07,000 --> 00:37:08,250
You got to be--

706
00:37:11,120 --> 00:37:13,760
I have to think about that.

707
00:37:13,760 --> 00:37:15,610
I don't know about you folks,
but I find this stuff really

708
00:37:15,610 --> 00:37:18,220
hard to think about.

709
00:37:18,220 --> 00:37:19,760
And so do most people,
I think.

710
00:37:22,290 --> 00:37:24,450
This is one of these things
where I don't think I can do

711
00:37:24,450 --> 00:37:27,800
without sitting down
for 10 minutes and

712
00:37:27,800 --> 00:37:30,820
thinking about it deeply.

713
00:37:30,820 --> 00:37:32,980
But it's an interesting thought
that if you did it the

714
00:37:32,980 --> 00:37:35,220
other direction that maybe
there would be

715
00:37:35,220 --> 00:37:39,350
a requirement there.

716
00:37:39,350 --> 00:37:43,910
I'm skeptical that that is true
because to my knowledge

717
00:37:43,910 --> 00:37:47,020
to do the mutual exclusion,
you pretty much have to do

718
00:37:47,020 --> 00:37:48,440
what I'm going to
talk about next.

719
00:37:51,890 --> 00:37:53,730
But it would be interesting
if is true.

720
00:37:57,890 --> 00:38:00,680
Because you also have to worry
about this guy getting

721
00:38:00,680 --> 00:38:03,052
reordered with respect
to this one.

722
00:38:03,052 --> 00:38:04,390
AUDIENCE: The loads can't
be reordered with

723
00:38:04,390 --> 00:38:05,730
respect to each other.

724
00:38:05,730 --> 00:38:08,910
PROFESSOR: So he_wants
and turn equals his.

725
00:38:08,910 --> 00:38:10,596
Yeah.

726
00:38:10,596 --> 00:38:12,220
So the loads won't
be reordered.

727
00:38:12,220 --> 00:38:14,010
Yeah.

728
00:38:14,010 --> 00:38:15,630
So that looks OK.

729
00:38:15,630 --> 00:38:17,460
And then, you're saying and then
therefore, it can't go

730
00:38:17,460 --> 00:38:20,450
forward because this one won't
get reordered with that one.

731
00:38:20,450 --> 00:38:22,450
You might be right.

732
00:38:22,450 --> 00:38:23,700
That'd be cute.

733
00:38:26,370 --> 00:38:29,370
So I have to update the slides
for next year if that's true.

734
00:38:32,670 --> 00:38:36,530
So one way out of this quandary
is to use what's

735
00:38:36,530 --> 00:38:40,070
called a memory fence
or memory barrier.

736
00:38:40,070 --> 00:38:42,950
And it's a hardware action
that enforces an ordering

737
00:38:42,950 --> 00:38:45,620
constraint between the
instructions before

738
00:38:45,620 --> 00:38:48,770
and after the fence.

739
00:38:48,770 --> 00:38:52,600
So a memory fence says don't
allow the processor to reorder

740
00:38:52,600 --> 00:38:54,210
these things.

741
00:38:54,210 --> 00:38:57,290
So why would you not want
to do a memory fence?

742
00:39:01,680 --> 00:39:02,810
Then we'll talk about
why you do it.

743
00:39:02,810 --> 00:39:05,213
Yeah?

744
00:39:05,213 --> 00:39:07,430
AUDIENCE: To force a
hardware slowdown?

745
00:39:07,430 --> 00:39:07,710
PROFESSOR: Yeah.

746
00:39:07,710 --> 00:39:08,480
You're forcing the hardware
slowdown.

747
00:39:08,480 --> 00:39:11,100
You're also forcing compiler
because the compiler has to

748
00:39:11,100 --> 00:39:12,190
respect that, too.

749
00:39:12,190 --> 00:39:14,480
You're not letting the compiler
do optimizations

750
00:39:14,480 --> 00:39:16,810
across the fence.

751
00:39:16,810 --> 00:39:21,800
So generally, fences
slow things down.

752
00:39:21,800 --> 00:39:23,440
In addition, it turns
out that they have

753
00:39:23,440 --> 00:39:24,690
some significant overhead.

754
00:39:26,970 --> 00:39:32,510
So you can issue a memory
fence explicitly as an

755
00:39:32,510 --> 00:39:33,230
instruction.

756
00:39:33,230 --> 00:39:39,180
So the mfence instruction
sets a memory fence.

757
00:39:39,180 --> 00:39:43,990
There's also, it turns out, on
x86 an lfence and an sfence,

758
00:39:43,990 --> 00:39:50,940
which allow loads to go over
but not stores and

759
00:39:50,940 --> 00:39:52,300
stores but not loads.

760
00:39:52,300 --> 00:39:54,860
And this one is basically
both.

761
00:39:54,860 --> 00:39:57,090
From the point of view of what
we're using it for, we're only

762
00:39:57,090 --> 00:40:00,070
going to worry about
the fences.

763
00:40:00,070 --> 00:40:01,490
They're done by the
explicit one.

764
00:40:01,490 --> 00:40:03,840
But it also turns out all
the locking instructions

765
00:40:03,840 --> 00:40:07,530
automatically put a fence in.

766
00:40:07,530 --> 00:40:14,170
One of the humorous things
in recent memory is major

767
00:40:14,170 --> 00:40:18,930
manufacturers for whom the lock
instruction was actually

768
00:40:18,930 --> 00:40:23,000
faster than doing a memory
fence, which is kind of weird

769
00:40:23,000 --> 00:40:27,510
because a lock instruction
does a memory fence.

770
00:40:27,510 --> 00:40:29,530
So how do you think that sort
of thing comes about?

771
00:40:29,530 --> 00:40:33,160
So when you looked at
performance it would be like--

772
00:40:33,160 --> 00:40:35,020
for this particular machine
I'm thinking about--

773
00:40:35,020 --> 00:40:41,750
it was 30 cycles to do
a lock instruction.

774
00:40:41,750 --> 00:40:46,435
And it was on the order of 50
cycles to do a memory fence.

775
00:40:49,610 --> 00:40:51,230
And so if you want to
do a memory fence,

776
00:40:51,230 --> 00:40:53,725
what should you do?

777
00:40:53,725 --> 00:40:54,260
AUDIENCE: Do a lock.

778
00:40:54,260 --> 00:40:55,490
PROFESSOR: Do a lock
instruction

779
00:40:55,490 --> 00:40:57,550
instead to get the effect.

780
00:40:57,550 --> 00:40:59,460
But why do you suppose that
came up in the hardware?

781
00:40:59,460 --> 00:41:02,800
Why is it that one instruction
would be--

782
00:41:08,400 --> 00:41:14,190
It's a social reason why this
sort of thing happens.

783
00:41:14,190 --> 00:41:16,510
So I don't know for sure.

784
00:41:16,510 --> 00:41:19,220
But I know enough about
engineering to understand how

785
00:41:19,220 --> 00:41:20,760
these things come about.

786
00:41:20,760 --> 00:41:22,240
So here's what goes on.

787
00:41:22,240 --> 00:41:25,680
They do studies of traces
of programs.

788
00:41:25,680 --> 00:41:28,690
And how often do you think
lock instructions occur?

789
00:41:28,690 --> 00:41:32,200
And how often do you think
fence instructions occur?

790
00:41:32,200 --> 00:41:35,640
Turns out lock instructions
occur all the time, whereas

791
00:41:35,640 --> 00:41:39,040
fences, they don't occur so
often because usually it's

792
00:41:39,040 --> 00:41:41,280
somebody who really knows what
they're doing who's using a

793
00:41:41,280 --> 00:41:43,120
memory fence.

794
00:41:43,120 --> 00:41:46,590
So then, they say to the
engineering team, we're going

795
00:41:46,590 --> 00:41:48,170
to make our code go faster.

796
00:41:48,170 --> 00:41:50,630
And lock instructions are
going really fast.

797
00:41:50,630 --> 00:41:53,740
So they put a top engineer
on making lock

798
00:41:53,740 --> 00:41:56,350
instructions go fast.

799
00:41:56,350 --> 00:42:04,530
They put a second-rate engineer
on making memory

800
00:42:04,530 --> 00:42:06,860
fence operations go fast because
they're not used as

801
00:42:06,860 --> 00:42:11,780
often, without sort of
recognizing that, gee, what

802
00:42:11,780 --> 00:42:14,800
you do for one is the
same problem.

803
00:42:14,800 --> 00:42:17,510
You can do the same thing
for the other.

804
00:42:17,510 --> 00:42:19,770
So it ends up you'll see things
in architecture that

805
00:42:19,770 --> 00:42:22,070
are really quite humorous like
that, where things are sort of

806
00:42:22,070 --> 00:42:25,750
like, wait a minute, how come
this is slower when well, it

807
00:42:25,750 --> 00:42:28,770
probably has to do with the
engineering team that built

808
00:42:28,770 --> 00:42:31,500
the system.

809
00:42:31,500 --> 00:42:34,390
And actually now I'm aware of
two architectures where they

810
00:42:34,390 --> 00:42:38,810
did the same kind of thing by
different manufacturers.

811
00:42:38,810 --> 00:42:41,990
Where they got these
memory fences.

812
00:42:41,990 --> 00:42:46,390
It should be at least as fast
because the one is doing--

813
00:42:46,390 --> 00:42:48,910
anyway.

814
00:42:48,910 --> 00:42:51,220
Interesting story there.

815
00:42:51,220 --> 00:42:55,580
Now, you can actually access a
memory fence using a built in

816
00:42:55,580 --> 00:42:59,280
function called sync
synchronize.

817
00:42:59,280 --> 00:43:00,890
And in fact, there whole
set of atomics--

818
00:43:00,890 --> 00:43:03,770
I've put the information here
for where you can go and look

819
00:43:03,770 --> 00:43:07,480
at the atomic operations that
include memory fences and so

820
00:43:07,480 --> 00:43:09,690
forth to using in
the compiler.

821
00:43:09,690 --> 00:43:13,610
It turns out when I was trying
to get this going last night,

822
00:43:13,610 --> 00:43:15,100
I couldn't get it to work.

823
00:43:15,100 --> 00:43:18,590
And it turns out that's because
our compiler had a bug

824
00:43:18,590 --> 00:43:25,320
where this instruction was
compiling to nothing.

825
00:43:25,320 --> 00:43:27,250
There's a compiler bug.

826
00:43:27,250 --> 00:43:33,460
And so I messed around for far
too much time and then finally

827
00:43:33,460 --> 00:43:35,740
sent out a help message
to the T.A.s.

828
00:43:35,740 --> 00:43:37,750
And then, John figured out
that there was a bug.

829
00:43:37,750 --> 00:43:39,520
And he's patched all the
compilers so that you

830
00:43:39,520 --> 00:43:42,800
guys all have it.

831
00:43:42,800 --> 00:43:46,590
But anyway, it was like, how
come this isn't working?

832
00:43:46,590 --> 00:43:47,950
AUDIENCE: What compiler
are we using?

833
00:43:47,950 --> 00:43:49,550
PROFESSOR: This was GCC.

834
00:43:49,550 --> 00:43:54,120
I was trying 4 1,
and I tried 4 3.

835
00:43:54,120 --> 00:43:57,290
And so the one that we're using
in class for the most

836
00:43:57,290 --> 00:43:59,010
part, is 4 3.

837
00:43:59,010 --> 00:44:00,830
So anyway, John put
the patch in.

838
00:44:00,830 --> 00:44:03,370
So now, when you use these

839
00:44:03,370 --> 00:44:05,210
instructions, they're all there.

840
00:44:09,100 --> 00:44:11,580
And then, the last thing is
that the typical cost of a

841
00:44:11,580 --> 00:44:15,610
memory fence operation is
comparable to that of an L2

842
00:44:15,610 --> 00:44:16,860
cache access.

843
00:44:18,870 --> 00:44:24,400
So memory fences tend to
be on our machine--

844
00:44:24,400 --> 00:44:26,030
and I haven't actually measured
in our machine.

845
00:44:26,030 --> 00:44:28,430
I meant to do that, and I
didn't get around to it.

846
00:44:28,430 --> 00:44:31,850
It's probably on the order
of 10, or 15 cycles, or

847
00:44:31,850 --> 00:44:35,680
something, which is not bad.

848
00:44:35,680 --> 00:44:37,860
If it's less than 20,
it's pretty good.

849
00:44:42,130 --> 00:44:44,550
So here's Peterson's algorithm
with memory fences.

850
00:44:44,550 --> 00:44:48,820
You just simply sticky in the
memory fence there to prevent

851
00:44:48,820 --> 00:44:49,460
the reordering.

852
00:44:49,460 --> 00:44:51,730
And it's interesting if there's
a way that we can play

853
00:44:51,730 --> 00:44:54,230
the game with the instruction
stream to do the same thing

854
00:44:54,230 --> 00:44:57,350
because that would make this
code go, generally, a lot

855
00:44:57,350 --> 00:45:02,230
faster in terms of overhead.

856
00:45:02,230 --> 00:45:06,380
And so using memory fences, you
can restore consistency.

857
00:45:06,380 --> 00:45:08,530
Now, memory fences are
like data races.

858
00:45:08,530 --> 00:45:10,480
If you don't have them,
how do you know that

859
00:45:10,480 --> 00:45:11,400
you don't have them.

860
00:45:11,400 --> 00:45:13,950
It's very difficult to
regression test for them,

861
00:45:13,950 --> 00:45:15,250
which is one reason I
think there was a

862
00:45:15,250 --> 00:45:17,400
bug in the GCC compiler.

863
00:45:17,400 --> 00:45:22,200
How do you know that some
piece of code is failing

864
00:45:22,200 --> 00:45:25,170
because most of the time
it will work correctly.

865
00:45:25,170 --> 00:45:28,640
It's just occasionally, they'll
be some reordering,

866
00:45:28,640 --> 00:45:30,680
and timing, and race condition
that causes

867
00:45:30,680 --> 00:45:32,190
it not to work out.

868
00:45:32,190 --> 00:45:36,270
In this case, you both have
to have the race and the

869
00:45:36,270 --> 00:45:39,520
reordering happening at the
same time for Peterson's

870
00:45:39,520 --> 00:45:41,230
algorithm, for example.

871
00:45:41,230 --> 00:45:44,090
So compilers can be
very difficult

872
00:45:44,090 --> 00:45:45,770
for things like this.

873
00:45:45,770 --> 00:45:48,920
Really, the way to do it, which
is what I was doing, was

874
00:45:48,920 --> 00:45:54,590
do an objdump and search
for is fence in there.

875
00:45:54,590 --> 00:45:59,376
And in this case, it
wasn't in there.

876
00:45:59,376 --> 00:46:02,610
AUDIENCE: And also compiler's
self-analyzers, by itself.

877
00:46:02,610 --> 00:46:06,340
And that's this instruction that
basically can take code.

878
00:46:06,340 --> 00:46:07,130
PROFESSOR: Right.

879
00:46:07,130 --> 00:46:07,960
It's not doing anything.

880
00:46:07,960 --> 00:46:08,780
Right.

881
00:46:08,780 --> 00:46:10,570
So it says, oop, get out of it.

882
00:46:10,570 --> 00:46:11,400
Yep.

883
00:46:11,400 --> 00:46:12,650
Good.

884
00:46:15,220 --> 00:46:17,550
So any questions about
consistency.

885
00:46:17,550 --> 00:46:20,490
So what turns out to be most
of the time when you're

886
00:46:20,490 --> 00:46:23,320
designing things where you want
to synchronize through

887
00:46:23,320 --> 00:46:30,610
memory directly, rather than
using locks or what have you.

888
00:46:30,610 --> 00:46:32,870
The methodology that I found
works pretty well.

889
00:46:32,870 --> 00:46:36,010
Work it out for sequential
consistency, and then figure

890
00:46:36,010 --> 00:46:39,150
out where you have to
put the fences in.

891
00:46:39,150 --> 00:46:42,530
And that's a pretty
good methodology

892
00:46:42,530 --> 00:46:44,190
for working out where--

893
00:46:44,190 --> 00:46:45,550
here's sequential consistency.

894
00:46:45,550 --> 00:46:48,750
Now, what reorderings do I need
to ensure in order to

895
00:46:48,750 --> 00:46:51,250
make sure that it
works properly.

896
00:46:51,250 --> 00:46:53,450
And that can be error prone.

897
00:46:53,450 --> 00:46:56,910
So once again, big skull and
cross bones on whether you

898
00:46:56,910 --> 00:46:58,830
actually try this in practice.

899
00:46:58,830 --> 00:47:00,135
It really better make
a difference.

900
00:47:04,100 --> 00:47:07,730
Now, the fact that you can
synchronize directly through

901
00:47:07,730 --> 00:47:12,600
memory has led to a lot of
protocols that are called

902
00:47:12,600 --> 00:47:19,610
lock-free protocols, which have
some advantages, even

903
00:47:19,610 --> 00:47:22,290
though, in particular, because
they don't use locks.

904
00:47:22,290 --> 00:47:25,870
And so I want to illustrate some
of those because you'll

905
00:47:25,870 --> 00:47:26,790
see these in certain places.

906
00:47:26,790 --> 00:47:29,620
So recall the summing problem
from last time.

907
00:47:29,620 --> 00:47:33,710
So here we have an array.

908
00:47:33,710 --> 00:47:36,950
And what we're going to do is
run through all the elements

909
00:47:36,950 --> 00:47:39,310
in the array, computing
something on every element,

910
00:47:39,310 --> 00:47:41,050
and adding into result.

911
00:47:41,050 --> 00:47:43,100
And we wanted to parallelize
that.

912
00:47:43,100 --> 00:47:45,750
So we parallelize that
with a Cilk 4.

913
00:47:45,750 --> 00:47:49,450
And what was the problem when
we parallelize this?

914
00:47:49,450 --> 00:47:51,770
We get a race.

915
00:47:51,770 --> 00:47:52,610
So there's the race.

916
00:47:52,610 --> 00:47:58,110
We get a race on result because
we've got two parallel

917
00:47:58,110 --> 00:47:59,840
instructions both
trying to update

918
00:47:59,840 --> 00:48:02,810
results at the same time.

919
00:48:02,810 --> 00:48:06,230
So we can solve that
with a lock.

920
00:48:06,230 --> 00:48:07,830
And I showed you last
time that we could

921
00:48:07,830 --> 00:48:10,870
solve this for lock.

922
00:48:10,870 --> 00:48:16,690
By declaring a mutex, and then
locking before we update the

923
00:48:16,690 --> 00:48:18,250
results, and then unlock.

924
00:48:18,250 --> 00:48:20,650
And of course, as we argued
yesterday, that could cause

925
00:48:20,650 --> 00:48:22,170
severe contention.

926
00:48:22,170 --> 00:48:24,430
Now, contention can
be an issue.

927
00:48:24,430 --> 00:48:27,740
But if it turns out that the
compute here, which I've moved

928
00:48:27,740 --> 00:48:29,990
outside the lock notice.

929
00:48:29,990 --> 00:48:33,400
I've put it into temp and then
added temp in so I can lock

930
00:48:33,400 --> 00:48:34,860
for the shortest
possible time.

931
00:48:34,860 --> 00:48:38,250
If this compute is sufficiently
large, there may

932
00:48:38,250 --> 00:48:38,890
be contention.

933
00:48:38,890 --> 00:48:41,340
But it may not be a significant
contention in your

934
00:48:41,340 --> 00:48:44,860
execution because the update
here could be very, very short

935
00:48:44,860 --> 00:48:47,330
compared with the time
it takes to compute.

936
00:48:47,330 --> 00:48:51,290
So for example, if computing on
array i cost you more than

937
00:48:51,290 --> 00:48:59,000
say order n time, then the fact
that you have contention

938
00:48:59,000 --> 00:49:02,530
there isn't going to matter,
generally, because the total

939
00:49:02,530 --> 00:49:04,710
amount of time that you're going
to be locking is just

940
00:49:04,710 --> 00:49:06,800
small compared to the total
execution time.

941
00:49:09,370 --> 00:49:13,370
Still in a multiprogram setting,
there may be other

942
00:49:13,370 --> 00:49:16,470
problems that you can get into,
even when you have this

943
00:49:16,470 --> 00:49:19,510
and even if you think
that contention

944
00:49:19,510 --> 00:49:20,760
is going to be minimal.

945
00:49:23,810 --> 00:49:26,160
So can anybody think of what
the issues might be?

946
00:49:26,160 --> 00:49:29,180
Why could this be problematic
even if contention

947
00:49:29,180 --> 00:49:30,430
is not a big issue?

948
00:49:35,170 --> 00:49:37,780
And the hint here is it's in
a multiprogram setting.

949
00:49:50,720 --> 00:49:52,470
So what happens in a
multiprogram setting.

950
00:49:58,770 --> 00:49:59,445
Yeah.

951
00:49:59,445 --> 00:50:01,225
AUDIENCE: [INAUDIBLE]

952
00:50:01,225 --> 00:50:02,810
PROFESSOR: Because
the resolve is--

953
00:50:02,810 --> 00:50:04,060
AUDIENCE: [INAUDIBLE PHRASE]

954
00:50:07,380 --> 00:50:09,960
PROFESSOR: It actually doesn't
have to do with resolve here.

955
00:50:09,960 --> 00:50:13,960
It has to do with locking
explicitly.

956
00:50:13,960 --> 00:50:16,800
It's a problem with locking in a
multiprogrammed environment.

957
00:50:16,800 --> 00:50:18,890
What happens in a
multiprogrammed environment?

958
00:50:18,890 --> 00:50:21,143
What do I mean by
multiprogrammed environment?

959
00:50:21,143 --> 00:50:22,090
AUDIENCE: [INAUDIBLE]

960
00:50:22,090 --> 00:50:23,640
PROFESSOR: Have multiple
jobs running, right?

961
00:50:23,640 --> 00:50:25,870
And what happens to the
processor when there are

962
00:50:25,870 --> 00:50:27,762
multiple jobs running?

963
00:50:27,762 --> 00:50:29,470
AUDIENCE: [INAUDIBLE]

964
00:50:29,470 --> 00:50:30,720
PROFESSOR: Contact switches.

965
00:50:34,530 --> 00:50:36,430
So now, what can
go wrong here?

966
00:50:36,430 --> 00:50:38,280
What can be really bad here?

967
00:50:38,280 --> 00:50:38,530
Yeah.

968
00:50:38,530 --> 00:50:40,021
AUDIENCE: You aquire the
lock and then the

969
00:50:40,021 --> 00:50:40,520
contacts switch out.

970
00:50:40,520 --> 00:50:41,560
PROFESSOR: Yeah.

971
00:50:41,560 --> 00:50:43,140
You acquire the lock.

972
00:50:43,140 --> 00:50:46,190
And then, the operating system
contact switches you out.

973
00:50:46,190 --> 00:50:46,890
And so what happens?

974
00:50:46,890 --> 00:50:50,570
You hold the lock while some
other job is running.

975
00:50:50,570 --> 00:50:52,090
And what are those guys doing.

976
00:50:52,090 --> 00:50:55,460
They go and spin and
wait on the lock.

977
00:50:55,460 --> 00:50:58,380
Now, this is a good time where
you'd rather not have a

978
00:50:58,380 --> 00:50:59,130
spinning lock.

979
00:50:59,130 --> 00:51:03,010
You'd rather have
a yielding lock.

980
00:51:03,010 --> 00:51:06,590
But even so, suddenly you're
talking about something that's

981
00:51:06,590 --> 00:51:10,650
operating at the level of
100 times a second, 10

982
00:51:10,650 --> 00:51:13,970
milliseconds, versus something
that is operating on a

983
00:51:13,970 --> 00:51:15,940
nanosecond level.

984
00:51:15,940 --> 00:51:19,460
So you're talking six orders
of magnitude of performance

985
00:51:19,460 --> 00:51:25,020
difference if you end up getting
switched out while you

986
00:51:25,020 --> 00:51:26,270
hold a lock.

987
00:51:30,310 --> 00:51:30,900
That's the issue.

988
00:51:30,900 --> 00:51:31,880
What happens.

989
00:51:31,880 --> 00:51:34,550
And then, if that happens,
all the other loop

990
00:51:34,550 --> 00:51:37,225
iterations must wait.

991
00:51:37,225 --> 00:51:38,710
AUDIENCE: [INAUDIBLE]

992
00:51:38,710 --> 00:51:41,185
in the large program here
[UNINTELLIGIBLE PHRASE].

993
00:51:43,887 --> 00:51:45,110
I don't have a mic.

994
00:51:45,110 --> 00:51:46,862
If one [UNINTELLIGIBLE]
crashes or one

995
00:51:46,862 --> 00:51:48,112
[UNINTELLIGIBLE PHRASE]

996
00:51:52,318 --> 00:51:53,568
the lock
[UNINTELLIGIBLE PHRASE].

997
00:52:01,742 --> 00:52:04,926
AUDIENCE: Can you specify
whether those are yielding

998
00:52:04,926 --> 00:52:05,890
locks or spinning locks?

999
00:52:05,890 --> 00:52:08,370
PROFESSOR: Usually, the mutex
type will tell you.

1000
00:52:08,370 --> 00:52:10,850
I'm just using a simple
name of mutex.

1001
00:52:10,850 --> 00:52:15,140
I probably should have been
using the ones that--

1002
00:52:15,140 --> 00:52:17,090
we were using one called
Cilk mutex.

1003
00:52:17,090 --> 00:52:20,000
And I probably should've used
that here rather than just

1004
00:52:20,000 --> 00:52:22,770
simple mutex.

1005
00:52:22,770 --> 00:52:25,560
AUDIENCE: Are they yielding?

1006
00:52:25,560 --> 00:52:26,620
PROFESSOR: There's
a good question.

1007
00:52:26,620 --> 00:52:30,330
I used to know the
answer to this.

1008
00:52:30,330 --> 00:52:34,390
I believe that those spin for
a while, are competitive.

1009
00:52:34,390 --> 00:52:35,890
They spin for a while
and then yield.

1010
00:52:35,890 --> 00:52:36,610
But I'm not sure.

1011
00:52:36,610 --> 00:52:39,260
They may just spin.

1012
00:52:39,260 --> 00:52:42,720
They don't just automatically
yield.

1013
00:52:42,720 --> 00:52:45,520
They're either competitive,
or they'll spin and yield.

1014
00:52:45,520 --> 00:52:47,170
I believe they spin and yield.

1015
00:52:47,170 --> 00:52:48,690
And I believe there's actually
a switch where

1016
00:52:48,690 --> 00:52:49,720
you can tell it--

1017
00:52:49,720 --> 00:52:50,970
if you're doing timing
measurements--

1018
00:52:53,400 --> 00:52:55,140
make it so that it purely spins
so that you can get

1019
00:52:55,140 --> 00:52:57,757
better benchmark results.

1020
00:52:57,757 --> 00:53:06,370
AUDIENCE: So my question is does
the colonel have power to

1021
00:53:06,370 --> 00:53:08,660
switch out a spinning
lock or not?

1022
00:53:08,660 --> 00:53:09,270
PROFESSOR: Yeah.

1023
00:53:09,270 --> 00:53:11,910
Well, the colonel, the
scheduler, can come in at any

1024
00:53:11,910 --> 00:53:13,430
moment and say, whip
you're out.

1025
00:53:16,160 --> 00:53:16,790
You're out.

1026
00:53:16,790 --> 00:53:18,850
That's it.

1027
00:53:18,850 --> 00:53:21,640
And wherever it is, it
interrupts it at

1028
00:53:21,640 --> 00:53:23,280
that moment in time.

1029
00:53:26,670 --> 00:53:31,450
So one solution to this
problem is to

1030
00:53:31,450 --> 00:53:33,110
use a lock-free method.

1031
00:53:33,110 --> 00:53:35,430
And one of the common ways of
doing that is with what's

1032
00:53:35,430 --> 00:53:38,730
called compare-and-swap
instruction.

1033
00:53:38,730 --> 00:53:41,130
So this is what's called a
locking instruction, meaning

1034
00:53:41,130 --> 00:53:45,150
it's one of these ones that goes
out to L2, in terms of

1035
00:53:45,150 --> 00:53:46,590
timing and so forth.

1036
00:53:46,590 --> 00:53:50,950
And what it does is it does
the following thing.

1037
00:53:50,950 --> 00:53:54,530
It has an address
of a location.

1038
00:53:54,530 --> 00:53:58,660
And it's got the old value
that was stored in the

1039
00:53:58,660 --> 00:54:00,520
location and a new value.

1040
00:54:00,520 --> 00:54:04,810
And it says if the value that
is there is the old value,

1041
00:54:04,810 --> 00:54:07,420
well, then stick the
new value in there.

1042
00:54:07,420 --> 00:54:10,700
And then return essentially
true.

1043
00:54:10,700 --> 00:54:14,250
Otherwise return false.

1044
00:54:14,250 --> 00:54:23,020
So it's basically saying what
you tend to do is you first

1045
00:54:23,020 --> 00:54:25,530
look to see what's the value.

1046
00:54:25,530 --> 00:54:27,260
You then update the value.

1047
00:54:27,260 --> 00:54:32,310
And then you say, if it hasn't
changed, stick it back in and

1048
00:54:32,310 --> 00:54:33,160
return true.

1049
00:54:33,160 --> 00:54:37,470
If it has changed,
return false.

1050
00:54:37,470 --> 00:54:42,110
So it's only swaps the
value if it is true.

1051
00:54:42,110 --> 00:54:43,900
There's actually two versions.

1052
00:54:43,900 --> 00:54:47,060
One which says bool and
one which says val.

1053
00:54:47,060 --> 00:54:49,765
And if you do the bool version,
it returns a flag.

1054
00:54:49,765 --> 00:54:52,110
If you do the val version, it
actually returns the value

1055
00:54:52,110 --> 00:54:53,210
that was in there.

1056
00:54:53,210 --> 00:54:55,620
So it's more like a
compare-and-swap.

1057
00:54:55,620 --> 00:54:58,650
The main thing about this is
this code essentially executes

1058
00:54:58,650 --> 00:55:02,885
atomically with the single
instruction, which is called--

1059
00:55:07,990 --> 00:55:10,530
The instruction is cmpxchg.

1060
00:55:14,760 --> 00:55:15,420
Is it up there?

1061
00:55:15,420 --> 00:55:17,720
Oh, there it is.

1062
00:55:17,720 --> 00:55:17,970
Yeah.

1063
00:55:17,970 --> 00:55:21,190
So the cmpxchg instruction
on x86.

1064
00:55:21,190 --> 00:55:26,030
So when you compile this, you
should find on your assembly

1065
00:55:26,030 --> 00:55:30,890
output that instruction
somewhere unless the compiler

1066
00:55:30,890 --> 00:55:33,310
figures out a better way
to optimize that.

1067
00:55:33,310 --> 00:55:34,560
But generally, you
should find that.

1068
00:55:37,330 --> 00:55:40,400
Also, one of the things about
this is it works on values

1069
00:55:40,400 --> 00:55:42,970
that are sort of integer
type values.

1070
00:55:42,970 --> 00:55:46,550
But it doesn't work on floating
point numbers, in

1071
00:55:46,550 --> 00:55:48,080
particular.

1072
00:55:48,080 --> 00:55:51,410
So you can't compare-and-swap
a value, which is a floating

1073
00:55:51,410 --> 00:55:51,940
point value.

1074
00:55:51,940 --> 00:55:54,060
You can only do it with
energy type values.

1075
00:55:54,060 --> 00:55:57,370
So let's take a look at how we
can use the compare-and-swap

1076
00:55:57,370 --> 00:56:00,120
for the summing problem.

1077
00:56:00,120 --> 00:56:04,090
So what we do is we have
the same sort of code.

1078
00:56:04,090 --> 00:56:08,470
And now, what I'm going to do is
compute my temporary value.

1079
00:56:08,470 --> 00:56:10,740
And then, what I'll do is
I'll read the value

1080
00:56:10,740 --> 00:56:13,020
of result into old.

1081
00:56:13,020 --> 00:56:17,430
I'll then update my new value
for what I think I want the

1082
00:56:17,430 --> 00:56:22,240
result to be the result plus
the thing that I computed.

1083
00:56:22,240 --> 00:56:30,660
And now, what I do is I attempt
to compare-and-swap as

1084
00:56:30,660 --> 00:56:35,280
long as the old value is
what I read it to be.

1085
00:56:35,280 --> 00:56:38,560
Swap in the new value.

1086
00:56:38,560 --> 00:56:42,590
If the old value turns out to
be different from what is

1087
00:56:42,590 --> 00:56:44,940
currently in the result
location,

1088
00:56:44,940 --> 00:56:46,470
then it returns false.

1089
00:56:46,470 --> 00:56:47,720
And I redo this again.

1090
00:56:52,110 --> 00:56:55,020
Then, I have to redo the
whole loop again.

1091
00:56:55,020 --> 00:56:56,560
So this is a do-while loop.

1092
00:56:56,560 --> 00:56:59,690
Do-while is like a while loop,
except you do the body first.

1093
00:56:59,690 --> 00:57:01,860
And then you test
the condition.

1094
00:57:01,860 --> 00:57:03,970
So if this fails, I go back.

1095
00:57:03,970 --> 00:57:06,850
I then get a new value for
the result and so forth.

1096
00:57:06,850 --> 00:57:08,140
So let me show you
how that works.

1097
00:57:13,410 --> 00:57:15,190
Let's see.

1098
00:57:15,190 --> 00:57:17,160
So first, I'll show you
how this works.

1099
00:57:17,160 --> 00:57:19,850
Actually, I'll show it on a more
interesting example how

1100
00:57:19,850 --> 00:57:21,930
this works.

1101
00:57:21,930 --> 00:57:24,850
So what happens if I get swapped
out in the middle of a

1102
00:57:24,850 --> 00:57:26,360
loop iteration?

1103
00:57:26,360 --> 00:57:31,100
All I do is when I do the
compare-and-swap it fails.

1104
00:57:31,100 --> 00:57:32,850
So no other instructions
can wait.

1105
00:57:32,850 --> 00:57:37,120
They can all march ahead and do
the thing they need to do.

1106
00:57:37,120 --> 00:57:39,520
And then, the one that
got swapped out, eh.

1107
00:57:39,520 --> 00:57:41,870
It gets some old value.

1108
00:57:41,870 --> 00:57:47,230
It discovers that and has
to re-execute the loop.

1109
00:57:47,230 --> 00:57:48,270
So is that fine?

1110
00:57:48,270 --> 00:57:51,200
So what this means is that the
amount work that's going on,

1111
00:57:51,200 --> 00:57:56,420
however, could in fact, be
greater, depending upon how

1112
00:57:56,420 --> 00:57:57,450
much contention there is.

1113
00:57:57,450 --> 00:57:59,120
If there's a lot of contention,
you could end up

1114
00:57:59,120 --> 00:58:06,400
having these guys fighting
and not re

1115
00:58:06,400 --> 00:58:08,680
executing a lot of code.

1116
00:58:08,680 --> 00:58:11,330
But that's really not much worse
than them spinning is

1117
00:58:11,330 --> 00:58:14,590
what it comes down to.

1118
00:58:14,590 --> 00:58:16,470
Any questions?

1119
00:58:16,470 --> 00:58:20,170
Let's do a more interesting
example.

1120
00:58:20,170 --> 00:58:23,160
Here's a lock-free stack.

1121
00:58:23,160 --> 00:58:25,550
So what we're going to do is
we're going to have a node,

1122
00:58:25,550 --> 00:58:27,760
which has a next pointer
and some data.

1123
00:58:27,760 --> 00:58:30,170
All we really care about
is the next pointer.

1124
00:58:30,170 --> 00:58:35,340
And we have a stack, which has
basically a head pointer.

1125
00:58:38,150 --> 00:58:39,790
So we have a linked list here.

1126
00:58:39,790 --> 00:58:42,010
We want to basically be able to
insert things at the front

1127
00:58:42,010 --> 00:58:45,440
and take things out
of the front.

1128
00:58:45,440 --> 00:58:47,910
So here's a lock-free push.

1129
00:58:47,910 --> 00:58:49,710
So remember, this could
be concurrent.

1130
00:58:49,710 --> 00:58:52,060
So these guys want to operate
on it at a time.

1131
00:58:52,060 --> 00:58:56,030
We saw last time how in doing
very simple updates on link

1132
00:58:56,030 --> 00:59:00,260
structures, you could get
yourself into a mess if you

1133
00:59:00,260 --> 00:59:03,890
didn't properly synchronize when
we did the insertion in

1134
00:59:03,890 --> 00:59:05,710
the hash table.

1135
00:59:05,710 --> 00:59:09,370
So here's my push
[? up ?] code.

1136
00:59:09,370 --> 00:59:10,430
Well let's walk through it.

1137
00:59:10,430 --> 00:59:15,580
It says, basically, here's my
node that I want to insert.

1138
00:59:15,580 --> 00:59:20,350
It says, first of all, make
node.next point to the head.

1139
00:59:20,350 --> 00:59:22,140
So we basically have
it pointing to 77.

1140
00:59:25,220 --> 00:59:27,840
So then what we say is OK.

1141
00:59:27,840 --> 00:59:34,610
Let's compare-and-swap to make
the head point to the node but

1142
00:59:34,610 --> 00:59:39,670
only if the value of the
head has not changed.

1143
00:59:39,670 --> 00:59:41,730
It's still the value
of the node.next.

1144
00:59:45,100 --> 00:59:47,900
And if so, it does the swap.

1145
00:59:47,900 --> 00:59:49,585
Question?

1146
00:59:49,585 --> 00:59:51,232
AUDIENCE: You say
compare-and-swap.

1147
00:59:51,232 --> 00:59:54,850
But you compare it to what?

1148
00:59:54,850 --> 00:59:57,270
PROFESSOR: PROFESSOR: In this
case it's comparing to the--

1149
00:59:57,270 --> 01:00:00,940
so this is basically the
location that you're doing the

1150
01:00:00,940 --> 01:00:05,690
compare-and-swap on, the old
value that you expect to see

1151
01:00:05,690 --> 01:00:08,400
in that location, and
the new value.

1152
01:00:08,400 --> 01:00:10,730
So here, what it says--

1153
01:00:10,730 --> 01:00:12,250
when we're at this
point here--

1154
01:00:12,250 --> 01:00:14,225
we're saying before you do the
compare-and-swap, we're

1155
01:00:14,225 --> 01:00:23,770
saying, I only want you to set
that pointer to go to here if

1156
01:00:23,770 --> 01:00:27,410
this value is still
pointing to there.

1157
01:00:27,410 --> 01:00:30,800
So only move this here if
this value is still 77.

1158
01:00:30,800 --> 01:00:32,420
In other words, if somebody
else came in--

1159
01:00:32,420 --> 01:00:35,670
well, I'll do an example in a
second that shows what happens

1160
01:00:35,670 --> 01:00:39,940
when we have concurrency, and
one of them might fail.

1161
01:00:39,940 --> 01:00:42,030
But if it is true, then
it basically sets it.

1162
01:00:42,030 --> 01:00:43,920
And now I'm home free.

1163
01:00:46,790 --> 01:00:48,810
So let's take a look at what
happens when we have

1164
01:00:48,810 --> 01:00:49,195
contention.

1165
01:00:49,195 --> 01:00:51,220
So I have two guys.

1166
01:00:51,220 --> 01:00:54,170
So 33 says, OK I'll come in.

1167
01:00:54,170 --> 01:00:56,600
Let me set my next pointer
to the head.

1168
01:00:56,600 --> 01:00:58,030
But then comes 81.

1169
01:00:58,030 --> 01:00:59,450
And it says, OK.

1170
01:00:59,450 --> 01:01:04,700
Let me try to set my pointer to
also be 77 because I look

1171
01:01:04,700 --> 01:01:05,700
at what the head is,
and that's where

1172
01:01:05,700 --> 01:01:08,600
it's supposed to go.

1173
01:01:08,600 --> 01:01:10,160
So now, what happens
is we do the

1174
01:01:10,160 --> 01:01:12,950
compare-and-swap operation.

1175
01:01:12,950 --> 01:01:14,560
And they both are going
to try to do it.

1176
01:01:14,560 --> 01:01:17,330
And one of them is going to,
essentially, do it first

1177
01:01:17,330 --> 01:01:20,530
because the hardware preserves
that the compare-and-swaps,

1178
01:01:20,530 --> 01:01:22,800
their locking operations,
they will happen in

1179
01:01:22,800 --> 01:01:25,220
some definite order.

1180
01:01:25,220 --> 01:01:28,920
So in this case, 81 got
in there and did its

1181
01:01:28,920 --> 01:01:29,990
compare-and-swap first.

1182
01:01:29,990 --> 01:01:33,040
When it looked, 77 was still
a value that it said.

1183
01:01:33,040 --> 01:01:34,950
So it allowed that pointer
to be changed.

1184
01:01:34,950 --> 01:01:38,310
But now what happens
when 33 tries.

1185
01:01:38,310 --> 01:01:40,850
33 tries to do the
compare-and-swap.

1186
01:01:40,850 --> 01:01:44,320
And the compare-and-swap fails
because it's saying, I want to

1187
01:01:44,320 --> 01:01:50,400
swap 33 in as long as the value
of head is the pointer

1188
01:01:50,400 --> 01:01:53,880
to 70, the node was 77.

1189
01:01:53,880 --> 01:01:56,500
The value is no longer the
pointer to the node of 77.

1190
01:01:56,500 --> 01:02:01,430
It's now the pointer to the
value of the node with 81.

1191
01:02:01,430 --> 01:02:04,650
So the compare-and-swap fails.

1192
01:02:04,650 --> 01:02:06,200
People follow that?

1193
01:02:06,200 --> 01:02:09,420
And so what does
33 have to do?

1194
01:02:09,420 --> 01:02:10,910
It's got to start again.

1195
01:02:10,910 --> 01:02:13,400
So it goes back around the loop,
and now it sets it to

1196
01:02:13,400 --> 01:02:15,240
81, which is now the head.

1197
01:02:15,240 --> 01:02:18,410
And now, it can compare-and-swap
in the value.

1198
01:02:18,410 --> 01:02:22,940
And they both get in there
perfectly well.

1199
01:02:22,940 --> 01:02:23,470
Question?

1200
01:02:23,470 --> 01:02:24,916
AUDIENCE: What if there's
[INAUDIBLE]?

1201
01:02:24,916 --> 01:02:26,850
What if two nodes have--

1202
01:02:26,850 --> 01:02:31,470
PROFESSOR: Well, notice here,
it's not looking at the value

1203
01:02:31,470 --> 01:02:32,230
of the data.

1204
01:02:32,230 --> 01:02:33,500
Nowhere does data appear here.

1205
01:02:33,500 --> 01:02:35,050
It's actually looking
at the address of

1206
01:02:35,050 --> 01:02:37,410
this chunk of memory.

1207
01:02:37,410 --> 01:02:39,450
There is a similar problem,
which I will

1208
01:02:39,450 --> 01:02:41,000
raise in just a moment.

1209
01:02:41,000 --> 01:02:42,620
There is still a problem.

1210
01:02:42,620 --> 01:02:43,720
Yeah, question?

1211
01:02:43,720 --> 01:02:45,220
AUDIENCE: So I'm confused
about the interface.

1212
01:02:45,220 --> 01:02:49,300
So you give it the address
of where you want to

1213
01:02:49,300 --> 01:02:50,840
compare the value of.

1214
01:02:50,840 --> 01:02:54,760
And you're giving it what
you're pointing at and--

1215
01:02:54,760 --> 01:02:57,060
PROFESSOR: And here's the value
that I expect to be

1216
01:02:57,060 --> 01:02:59,160
stored in this location.

1217
01:02:59,160 --> 01:03:03,870
The value I expect to be in
there is node dot next.

1218
01:03:03,870 --> 01:03:06,500
So if I go back a
couple things.

1219
01:03:06,500 --> 01:03:07,750
Here.

1220
01:03:10,960 --> 01:03:13,490
Here, the guy says, the
value I expect to be

1221
01:03:13,490 --> 01:03:17,470
there is in this case.

1222
01:03:17,470 --> 01:03:20,960
the address of this chunk
of memory here.

1223
01:03:20,960 --> 01:03:26,170
He expects the address of the
node containing 77 is going to

1224
01:03:26,170 --> 01:03:28,000
be in this location.

1225
01:03:28,000 --> 01:03:28,580
It's not.

1226
01:03:28,580 --> 01:03:30,790
What's in this location
is the address of

1227
01:03:30,790 --> 01:03:33,490
this chunk of memory.

1228
01:03:33,490 --> 01:03:36,880
But you're saying, if it's equal
to this, then you can go

1229
01:03:36,880 --> 01:03:37,720
ahead and do the swap.

1230
01:03:37,720 --> 01:03:39,530
Otherwise you're
going to fail.

1231
01:03:39,530 --> 01:03:43,970
And the swap consists of now
sticking this value into--

1232
01:03:43,970 --> 01:03:45,560
conditionally sticking
it in there.

1233
01:03:45,560 --> 01:03:47,342
So you either do it or
you don't do it.

1234
01:03:51,810 --> 01:03:54,790
So let's now do a pop.

1235
01:03:54,790 --> 01:03:59,030
So pop you can also
do with things.

1236
01:03:59,030 --> 01:04:03,510
So here, I'm going to want
to extract an element.

1237
01:04:03,510 --> 01:04:06,640
And what I'm going to do is
create a current value that I

1238
01:04:06,640 --> 01:04:11,800
want to make point to the
element that gets eliminated.

1239
01:04:11,800 --> 01:04:14,850
So what I do is I say, well,
the element that I want is

1240
01:04:14,850 --> 01:04:17,640
that guy there.

1241
01:04:17,640 --> 01:04:21,560
And now, what I want to do is
make the head jump around and

1242
01:04:21,560 --> 01:04:24,500
point to 94.

1243
01:04:24,500 --> 01:04:28,300
So what I do is I say, well,
as long as the--

1244
01:04:33,110 --> 01:04:35,990
and I want to do that unless I
get down to the fact that I

1245
01:04:35,990 --> 01:04:41,110
have an empty list.

1246
01:04:41,110 --> 01:04:55,750
So basically, I say, if
the head still has

1247
01:04:55,750 --> 01:04:57,180
the value of current--

1248
01:04:57,180 --> 01:04:59,580
so they're pointing to
the same place--

1249
01:04:59,580 --> 01:05:04,680
then, I want to move in
current arrow next.

1250
01:05:04,680 --> 01:05:07,470
And then I'm done.

1251
01:05:07,470 --> 01:05:12,940
Otherwise, I want to set current
to head, reset it, and

1252
01:05:12,940 --> 01:05:15,550
go back to the beginning
and try to pop again.

1253
01:05:15,550 --> 01:05:18,770
And I'm going to keep doing
that until I get my pop to

1254
01:05:18,770 --> 01:05:23,240
succeed or until current
points to nil.

1255
01:05:23,240 --> 01:05:25,390
If it ended up at the end,
then I don't want to keep

1256
01:05:25,390 --> 01:05:31,660
popping if the list ended
up being empty.

1257
01:05:31,660 --> 01:05:36,980
So basically, it sets that
one to jump over.

1258
01:05:36,980 --> 01:05:40,450
And now, once it's done that, I
can go, and I can clean up,

1259
01:05:40,450 --> 01:05:41,900
I can get rid of this
pointer, et cetera.

1260
01:05:41,900 --> 01:05:44,660
But nobody else who's coming in
to use this link list, can

1261
01:05:44,660 --> 01:05:48,960
see 15 now because I'm the only
one with a pointer to it.

1262
01:05:48,960 --> 01:05:50,340
So people understand that?

1263
01:05:53,930 --> 01:05:56,965
So where's the bug?

1264
01:06:00,200 --> 01:06:04,990
Turns out this has a but
after all that work.

1265
01:06:04,990 --> 01:06:09,330
Each of these individually does
what it's supposed to do.

1266
01:06:09,330 --> 01:06:10,280
But here's the bug.

1267
01:06:10,280 --> 01:06:12,420
And it's a famous problem
because you see it all the

1268
01:06:12,420 --> 01:06:16,050
time when people are
synchronizing through memory

1269
01:06:16,050 --> 01:06:18,210
with lock-free algorithms.

1270
01:06:18,210 --> 01:06:22,150
It's called the ABA problem.

1271
01:06:22,150 --> 01:06:23,040
So here's the problem.

1272
01:06:23,040 --> 01:06:24,570
And it's similar to what
some people were

1273
01:06:24,570 --> 01:06:25,780
concerned about earlier.

1274
01:06:25,780 --> 01:06:28,020
So here's the ABA problem.

1275
01:06:28,020 --> 01:06:31,830
Thread 1 begins to pop 15.

1276
01:06:31,830 --> 01:06:38,000
So imagine that what it does is
it sets its current there,

1277
01:06:38,000 --> 01:06:43,333
and then it reads the value
here, and starts to set the

1278
01:06:43,333 --> 01:06:46,640
head here using the
compare-and-swap.

1279
01:06:46,640 --> 01:06:49,040
But it doesn't complete the
compare-and-swap yet.

1280
01:06:49,040 --> 01:06:50,810
The compare-and-swap
hasn't executed.

1281
01:06:50,810 --> 01:06:53,110
It's simply gotten this
value, and it's

1282
01:06:53,110 --> 01:06:55,560
about to swap it here.

1283
01:06:55,560 --> 01:06:57,110
So then, thread 2 comes along.

1284
01:06:57,110 --> 01:07:00,900
And it says, oh, I want to
pop something as well.

1285
01:07:00,900 --> 01:07:02,210
So it comes in.

1286
01:07:02,210 --> 01:07:05,790
And it turns out it's faster,
and manages to pop 15 off, and

1287
01:07:05,790 --> 01:07:08,680
set up its pointers.

1288
01:07:08,680 --> 01:07:10,690
Now, what would normally
happen here is if this

1289
01:07:10,690 --> 01:07:13,520
completed, what would happen?

1290
01:07:13,520 --> 01:07:16,090
The compare-and-swap instruction
would discover

1291
01:07:16,090 --> 01:07:20,630
that this pointer is no longer
the pointer to the head.

1292
01:07:20,630 --> 01:07:21,630
And so it would fail.

1293
01:07:21,630 --> 01:07:23,000
We'd be all hunky dory.

1294
01:07:23,000 --> 01:07:24,160
No problem.

1295
01:07:24,160 --> 01:07:26,760
But what could actually
happen here?

1296
01:07:26,760 --> 01:07:28,290
Thread 2 keeps going on.

1297
01:07:28,290 --> 01:07:31,900
It says, oh, let me pop 94.

1298
01:07:31,900 --> 01:07:34,630
So it does the same thing.

1299
01:07:34,630 --> 01:07:38,240
So thread 1 is still stalled
here, not having completed its

1300
01:07:38,240 --> 01:07:40,770
compare-and-swap.

1301
01:07:40,770 --> 01:07:41,820
It swaps 94.

1302
01:07:41,820 --> 01:07:43,630
Then, thread 2 goes
on and says, oh,

1303
01:07:43,630 --> 01:07:46,410
let's put 15 back on.

1304
01:07:46,410 --> 01:07:50,800
So it puts 15 back on because
after all, it had 15.

1305
01:07:50,800 --> 01:07:54,570
So now, what happens here?

1306
01:07:54,570 --> 01:07:59,520
Thread 1 now looks, and it now
completes, and does its

1307
01:07:59,520 --> 01:08:09,510
compare-and-swap, it resumes,
splicing out 15, which it

1308
01:08:09,510 --> 01:08:10,180
thinks it has.

1309
01:08:10,180 --> 01:08:14,300
But it doesn't realize that
other stuff has gone on.

1310
01:08:14,300 --> 01:08:17,300
And now, we've got a mess.

1311
01:08:22,380 --> 01:08:25,750
So this is the ABA problem
because what happened was we

1312
01:08:25,750 --> 01:08:28,840
were checking to see whether the
value was still the same

1313
01:08:28,840 --> 01:08:31,260
value, the same chunk
of memory.

1314
01:08:31,260 --> 01:08:32,340
It got popped off.

1315
01:08:32,340 --> 01:08:35,160
But it got popped back on.

1316
01:08:35,160 --> 01:08:37,800
But now, it could be in
any configuration.

1317
01:08:37,800 --> 01:08:38,840
We don't know what it is.

1318
01:08:38,840 --> 01:08:43,960
And now, the code is thinking,
that oh, nothing happened.

1319
01:08:43,960 --> 01:08:47,109
But in fact, something
happened.

1320
01:08:47,109 --> 01:08:51,050
So it's ABA because basically,
you've got 15 there.

1321
01:08:51,050 --> 01:08:51,825
It goes away.

1322
01:08:51,825 --> 01:08:54,029
Then, 15 comes back.

1323
01:08:54,029 --> 01:08:56,274
Question?

1324
01:08:56,274 --> 01:08:58,158
AUDIENCE: Can you compare two
things and then swap because

1325
01:08:58,158 --> 01:08:59,100
that would solve this, right?

1326
01:08:59,100 --> 01:09:00,524
PROFESSOR: That's called a
double compare-and-swap.

1327
01:09:03,233 --> 01:09:07,689
And we'll talk about
it in a second.

1328
01:09:07,689 --> 01:09:12,899
So the classic way to solve
this problem is to use

1329
01:09:12,899 --> 01:09:15,050
versioning.

1330
01:09:15,050 --> 01:09:17,540
So what you do is you pack a
version number with each

1331
01:09:17,540 --> 01:09:22,180
pointer in the same atomically
updatable word.

1332
01:09:22,180 --> 01:09:28,840
So that when 15 comes back,
you've got the pointer.

1333
01:09:28,840 --> 01:09:31,770
But you also have a version on
that pointer so that the value

1334
01:09:31,770 --> 01:09:33,420
has to be the same as
the version you

1335
01:09:33,420 --> 01:09:35,399
had and not the value.

1336
01:09:35,399 --> 01:09:37,729
What you do is you increment the
version number every time

1337
01:09:37,729 --> 01:09:40,560
the pointer is changed.

1338
01:09:40,560 --> 01:09:42,779
So you just do an increment.

1339
01:09:42,779 --> 01:09:44,950
But you do the compare-and-swap
on the

1340
01:09:44,950 --> 01:09:49,779
version number and the pointer
at the same time.

1341
01:09:49,779 --> 01:09:52,630
Now, it turns out that some
architectures actually have

1342
01:09:52,630 --> 01:09:55,710
what's called a double
compare-and-swap, which will

1343
01:09:55,710 --> 01:09:59,320
do compare-and-swap on two
distinct locations.

1344
01:09:59,320 --> 01:10:02,140
And that simplifies things even
more because it means you

1345
01:10:02,140 --> 01:10:04,050
don't have to pack and
make sure that

1346
01:10:04,050 --> 01:10:05,420
things fit in one word.

1347
01:10:05,420 --> 01:10:06,990
You can keep versioning
elsewhere.

1348
01:10:06,990 --> 01:10:09,730
And there are a whole bunch of
other places where you can, in

1349
01:10:09,730 --> 01:10:12,700
fact, optimize and get even
tighter code than you could if

1350
01:10:12,700 --> 01:10:14,990
you have to pack.

1351
01:10:14,990 --> 01:10:16,910
So that's generally the
way you solve this.

1352
01:10:16,910 --> 01:10:19,730
And, of course, you can
see this gets--

1353
01:10:19,730 --> 01:10:24,310
as I say, this week has been
skull and cross bones lecture.

1354
01:10:24,310 --> 01:10:28,920
It's appropriate it comes right
after Halloween because

1355
01:10:28,920 --> 01:10:31,260
really, you do not want
to play these games

1356
01:10:31,260 --> 01:10:32,740
unless you have to.

1357
01:10:32,740 --> 01:10:35,320
But you should know about them
because you will find times

1358
01:10:35,320 --> 01:10:38,680
where you need this, or you need
to understand somebody's

1359
01:10:38,680 --> 01:10:40,720
code that they've written
in a lock-free way.

1360
01:10:40,720 --> 01:10:43,810
Because remember lock-free has
the nice property that hey,

1361
01:10:43,810 --> 01:10:47,080
the operating system swaps
something out, it just keeps

1362
01:10:47,080 --> 01:10:50,490
running nice and jolly
if it's correct.

1363
01:10:54,300 --> 01:10:57,750
So the other issue is that
version numbers may need to be

1364
01:10:57,750 --> 01:11:00,170
very large.

1365
01:11:00,170 --> 01:11:02,170
So if you have a version number,
how many bits to you

1366
01:11:02,170 --> 01:11:03,970
assign to that version number.

1367
01:11:03,970 --> 01:11:05,810
Well, 64 bits, that's
no problem.

1368
01:11:05,810 --> 01:11:07,930
You never run out of 64 bits.

1369
01:11:07,930 --> 01:11:11,390
2 to the 64th is a very,
very, very big number.

1370
01:11:11,390 --> 01:11:13,400
And you'll never run out
of 2 to the 64th.

1371
01:11:13,400 --> 01:11:15,370
We did that calculation at the
beginning of the term.

1372
01:11:15,370 --> 01:11:16,620
How big did we say it was?

1373
01:11:22,240 --> 01:11:23,940
It's pretty big, right?

1374
01:11:23,940 --> 01:11:26,758
It's like this big.

1375
01:11:26,758 --> 01:11:29,740
Or is it this big?

1376
01:11:29,740 --> 01:11:32,108
My two-year-old is this big.

1377
01:11:34,810 --> 01:11:37,200
So anyway, it's pretty big.

1378
01:11:37,200 --> 01:11:39,340
So is it bigger than--

1379
01:11:39,340 --> 01:11:41,570
no, it's not bigger than the
number of particles in the

1380
01:11:41,570 --> 01:11:42,230
universe, right?

1381
01:11:42,230 --> 01:11:45,540
That's 10 to the 80th,
which is much bigger

1382
01:11:45,540 --> 01:11:46,690
than 2 to the 64th.

1383
01:11:46,690 --> 01:11:48,280
But it's still a big number.

1384
01:11:48,280 --> 01:11:50,180
I think it's like more than
there are atoms in the earth

1385
01:11:50,180 --> 01:11:50,610
or something.

1386
01:11:50,610 --> 01:11:52,430
It's still pretty big.

1387
01:11:52,430 --> 01:11:54,060
You never get through it
if you calculate it.

1388
01:11:54,060 --> 01:11:55,780
I think we calculated
it and it was

1389
01:11:55,780 --> 01:11:57,320
hundreds of years or whatever.

1390
01:11:57,320 --> 01:11:59,670
Anyway, it's a long time.

1391
01:11:59,670 --> 01:12:03,230
Many, many, many years at the
very fastest, updating with

1392
01:12:03,230 --> 01:12:04,880
biggest supercomputers,
and the most

1393
01:12:04,880 --> 01:12:05,860
processors, et cetera.

1394
01:12:05,860 --> 01:12:07,806
Never run out of 64 bits.

1395
01:12:07,806 --> 01:12:09,250
32 bits.

1396
01:12:09,250 --> 01:12:10,810
Four billion.

1397
01:12:10,810 --> 01:12:11,650
Maybe you run out.

1398
01:12:11,650 --> 01:12:13,570
Maybe you don't.

1399
01:12:13,570 --> 01:12:15,070
So that's one of the issues.

1400
01:12:15,070 --> 01:12:18,370
You have to say, well, how often
do I have to do that.

1401
01:12:18,370 --> 01:12:20,020
Really, you only have
to worry about this.

1402
01:12:20,020 --> 01:12:22,160
You can wraparound.

1403
01:12:22,160 --> 01:12:24,880
But you've got to make sure
that then you never have a

1404
01:12:24,880 --> 01:12:29,110
situation where something could
be swapped out for long

1405
01:12:29,110 --> 01:12:32,920
enough that it would come back
and bite you because you're

1406
01:12:32,920 --> 01:12:34,640
coming around and then
eating your tail.

1407
01:12:34,640 --> 01:12:36,540
And you've got to make sure
you wouldn't have things

1408
01:12:36,540 --> 01:12:38,070
overlap and get a [? thing. ?]

1409
01:12:38,070 --> 01:12:40,580
So that might be a risk you're
willing to take.

1410
01:12:40,580 --> 01:12:43,240
You can do an analysis and
say, what are the odds my

1411
01:12:43,240 --> 01:12:46,090
system crashes from
this reason or

1412
01:12:46,090 --> 01:12:48,010
from a different reason?

1413
01:12:48,010 --> 01:12:51,560
That can be reasonable
engineering trade-off.

1414
01:12:51,560 --> 01:12:54,460
So there's an alternative
to compare-and-swap.

1415
01:12:54,460 --> 01:12:56,430
One is the double
compare-and-swap.

1416
01:12:56,430 --> 01:12:59,170
Another one is some machines
have what's called a

1417
01:12:59,170 --> 01:13:02,300
load-linked, store conditional
instruction.

1418
01:13:02,300 --> 01:13:04,620
What those are actually is
a pair of instructions.

1419
01:13:04,620 --> 01:13:05,870
One is load-linked.

1420
01:13:05,870 --> 01:13:09,800
When you load-linked, it
basically says, let's set a

1421
01:13:09,800 --> 01:13:11,950
bit, essentially,
in that word.

1422
01:13:11,950 --> 01:13:16,870
And if that word ever changes
when you do store conditional,

1423
01:13:16,870 --> 01:13:18,770
it will fail.

1424
01:13:18,770 --> 01:13:21,870
So even if some other processor
changes it to the

1425
01:13:21,870 --> 01:13:26,030
exact same value, it's keeping
track of whether anybody else

1426
01:13:26,030 --> 01:13:29,490
wrote it using the memory
consistency mechanism.

1427
01:13:29,490 --> 01:13:31,870
The MSI type protocol that
we talked about.

1428
01:13:31,870 --> 01:13:36,060
It's using that kind of
mechanism to make sure that if

1429
01:13:36,060 --> 01:13:36,530
it changes.

1430
01:13:36,530 --> 01:13:39,215
And so this is actually much
more reliable as a mechanism.

1431
01:13:41,800 --> 01:13:46,060
x86 does not have load-linked,
store conditional.

1432
01:13:46,060 --> 01:13:46,930
I'm not sure why.

1433
01:13:46,930 --> 01:13:48,730
I don't know if there's
a patent on it or

1434
01:13:48,730 --> 01:13:49,300
what's going on.

1435
01:13:49,300 --> 01:13:50,550
But they don't have it.

1436
01:13:55,630 --> 01:13:57,620
Final topic is reducers.

1437
01:14:00,650 --> 01:14:02,610
So once again, recall
the summing problem.

1438
01:14:06,140 --> 01:14:09,830
In Cilk++, they have a mechanism
called reducer

1439
01:14:09,830 --> 01:14:15,150
hyperobjects, which lets you do
an end run around some of

1440
01:14:15,150 --> 01:14:17,610
these synchronization
problems.

1441
01:14:17,610 --> 01:14:22,420
And the basic idea behind it is
we actually could code this

1442
01:14:22,420 --> 01:14:25,110
fairly easily as we talked about
last time by just doing

1443
01:14:25,110 --> 01:14:28,110
divide and conquer
on the array.

1444
01:14:28,110 --> 01:14:30,830
We add up the first half of
the elements, add up the

1445
01:14:30,830 --> 01:14:32,120
second half of the elements,
when they

1446
01:14:32,120 --> 01:14:33,810
return, add them together.

1447
01:14:33,810 --> 01:14:38,120
But the problem is that coding
that is a pain to do.

1448
01:14:38,120 --> 01:14:40,770
So the hyper object mechanism
sort of does that

1449
01:14:40,770 --> 01:14:42,660
automatically for you.

1450
01:14:42,660 --> 01:14:50,350
What you can do is declare
result to be an integer, which

1451
01:14:50,350 --> 01:14:56,530
is going to have the operation
add performed on it.

1452
01:14:56,530 --> 01:14:59,780
And what happens then is you can
just go ahead and add the

1453
01:14:59,780 --> 01:15:03,730
values up like this.

1454
01:15:03,730 --> 01:15:08,060
And basically, what it does
is essentially adds things

1455
01:15:08,060 --> 01:15:12,090
locally and will combine them
on an as needed basis.

1456
01:15:12,090 --> 01:15:15,210
So you don't actually have to do
any synchronization at all.

1457
01:15:15,210 --> 01:15:17,700
In the end, you have
to get the result

1458
01:15:17,700 --> 01:15:18,810
by doing a get value.

1459
01:15:18,810 --> 01:15:23,150
So let me show you a little bit
more what's going on in

1460
01:15:23,150 --> 01:15:24,250
this situation.

1461
01:15:24,250 --> 01:15:27,690
So the first thing here is
we're saying result is a

1462
01:15:27,690 --> 01:15:29,415
summing reducer over int.

1463
01:15:32,430 --> 01:15:35,480
The updates are resolved
automatically without races or

1464
01:15:35,480 --> 01:15:39,380
contention because they're
basically doing it by keeping

1465
01:15:39,380 --> 01:15:42,200
local values and copying them.

1466
01:15:42,200 --> 01:15:44,730
And then, at the end, you can
get the underlying value.

1467
01:15:47,610 --> 01:15:52,240
So the way this works is that
when you declare the variable,

1468
01:15:52,240 --> 01:15:57,210
you're declaring it as a reducer
over some associative

1469
01:15:57,210 --> 01:16:00,370
operation, such as addition.

1470
01:16:00,370 --> 01:16:04,720
So it only works cleanly if your
operation is associative.

1471
01:16:04,720 --> 01:16:07,860
And there are a lot of
associative operations.

1472
01:16:07,860 --> 01:16:10,480
Addition, multiplication,
logical, and, list

1473
01:16:10,480 --> 01:16:11,430
concatenation.

1474
01:16:11,430 --> 01:16:14,400
I can concatenate two lists.

1475
01:16:14,400 --> 01:16:15,710
So what does associative mean?

1476
01:16:18,210 --> 01:16:20,730
I think I have a slide
on this in a minute.

1477
01:16:20,730 --> 01:16:23,900
It means a times b times c.

1478
01:16:23,900 --> 01:16:26,220
I can parenthesize it
any way I want and

1479
01:16:26,220 --> 01:16:27,020
get the same answer.

1480
01:16:27,020 --> 01:16:28,970
Associative, right?

1481
01:16:28,970 --> 01:16:30,140
It's not associative like

1482
01:16:30,140 --> 01:16:35,840
associative memory or whatever.

1483
01:16:35,840 --> 01:16:38,950
So now, the individual strands
in the computation can update

1484
01:16:38,950 --> 01:16:43,880
x as if it were an ordinary
non-local variable.

1485
01:16:43,880 --> 01:16:47,670
But in fact, it's maintained as
a set of different copies

1486
01:16:47,670 --> 01:16:50,540
called views.

1487
01:16:50,540 --> 01:16:53,400
The Cilk++ runtime system
coordinates the views and

1488
01:16:53,400 --> 01:16:55,370
combines them when
appropriate.

1489
01:16:55,370 --> 01:16:58,620
And when only one view remains,
now you can get the

1490
01:16:58,620 --> 01:16:59,710
actual value.

1491
01:16:59,710 --> 01:17:02,890
So for example, you may have
a summing reducer where the

1492
01:17:02,890 --> 01:17:06,620
actual value at this point
in time is 89.

1493
01:17:06,620 --> 01:17:16,120
But locally, each processor may
only see a different value

1494
01:17:16,120 --> 01:17:18,790
whose sum is 89.

1495
01:17:18,790 --> 01:17:23,170
But locally, I could do
something like increment this.

1496
01:17:23,170 --> 01:17:27,010
And this guy can independently
increment his view and has the

1497
01:17:27,010 --> 01:17:30,100
effect that it increments
whatever the total sum is.

1498
01:17:30,100 --> 01:17:33,920
And then, the runtime system
manages to combine everything

1499
01:17:33,920 --> 01:17:38,630
at the end to make it be the
value when there's no more

1500
01:17:38,630 --> 01:17:42,410
parallelism associated
with that reducer.

1501
01:17:42,410 --> 01:17:44,390
So here's the conceptual
behavior.

1502
01:17:44,390 --> 01:17:45,980
Imagine I have this code.

1503
01:17:45,980 --> 01:17:47,780
I set x equal to 0.

1504
01:17:47,780 --> 01:17:49,180
I then add 3.

1505
01:17:49,180 --> 01:17:50,210
I then increment.

1506
01:17:50,210 --> 01:17:51,960
I had 4, increments at 5.

1507
01:17:51,960 --> 01:17:53,240
Fa da da da da.

1508
01:17:53,240 --> 01:17:57,130
At the end, I get some
value, which I

1509
01:17:57,130 --> 01:17:58,380
don't think I put down.

1510
01:18:01,360 --> 01:18:03,550
Another way I could do this
is the following.

1511
01:18:03,550 --> 01:18:06,830
Let me do exactly the same here
but with a local view

1512
01:18:06,830 --> 01:18:10,520
that I'll call x1.

1513
01:18:10,520 --> 01:18:14,060
For this set of operations, let
me start a new view that I

1514
01:18:14,060 --> 01:18:18,210
start out with the identity for
addition, which is 0 and

1515
01:18:18,210 --> 01:18:19,970
add those guys up.

1516
01:18:19,970 --> 01:18:24,060
And then, at the end, let
me add x1 and x2.

1517
01:18:24,060 --> 01:18:26,540
It should give me the same
answer if addition is

1518
01:18:26,540 --> 01:18:27,790
associative.

1519
01:18:30,600 --> 01:18:32,195
In particular, these
now can operate in

1520
01:18:32,195 --> 01:18:33,595
parallel with no races.

1521
01:18:36,520 --> 01:18:39,830
So if you don't actually look
at the intermediate values--

1522
01:18:39,830 --> 01:18:42,640
if all I'm doing is updating
them, but I'm not actually

1523
01:18:42,640 --> 01:18:46,200
looking to see what the absolute
value of the thing

1524
01:18:46,200 --> 01:18:49,900
is, I should get the same
answer at the end.

1525
01:18:49,900 --> 01:18:51,420
The answer to the result
is then determinant.

1526
01:18:51,420 --> 01:18:54,230
It's not deterministic because
it's going to get done in a

1527
01:18:54,230 --> 01:18:56,200
different way with different
memory state.

1528
01:18:56,200 --> 01:18:59,260
But it's determinant, meaning
the output answer is going to

1529
01:18:59,260 --> 01:19:03,350
give you the same no matter how
it executes, even if the

1530
01:19:03,350 --> 01:19:06,250
resulting computation
is nondeterministic.

1531
01:19:06,250 --> 01:19:08,470
So this is a way of
encapsulating, if you will,

1532
01:19:08,470 --> 01:19:09,650
nondeterminism.

1533
01:19:09,650 --> 01:19:12,430
And it worked because addition
is associative.

1534
01:19:12,430 --> 01:19:15,340
It didn't matter which
order I did it.

1535
01:19:15,340 --> 01:19:17,660
And once again, I could have
broken it here instead of

1536
01:19:17,660 --> 01:19:20,360
there, and I still get
the same answer.

1537
01:19:20,360 --> 01:19:21,460
It doesn't matter.

1538
01:19:21,460 --> 01:19:24,540
So the idea is as these things
are work stealing around.

1539
01:19:24,540 --> 01:19:26,970
they're accumulating things
locally but combining them in

1540
01:19:26,970 --> 01:19:32,080
a way that maintains the
invariant that the final value

1541
01:19:32,080 --> 01:19:33,570
is going to be the sum.

1542
01:19:36,240 --> 01:19:38,760
So there's a lot of other
related work where people do

1543
01:19:38,760 --> 01:19:42,680
reduction types of things, but
they're all tied to specific

1544
01:19:42,680 --> 01:19:44,520
control or data structures.

1545
01:19:44,520 --> 01:19:50,300
And the neat thing about the
Cilk++ version is that it is

1546
01:19:50,300 --> 01:19:51,290
not tied to anything.

1547
01:19:51,290 --> 01:19:52,430
You can name it anywhere.

1548
01:19:52,430 --> 01:19:54,280
You can write recursive
programs.

1549
01:19:54,280 --> 01:19:58,300
You can update locally your
reducer wherever you want, and

1550
01:19:58,300 --> 01:20:04,020
it figures out exactly how to
combine them in order to get

1551
01:20:04,020 --> 01:20:06,760
your final answer.

1552
01:20:06,760 --> 01:20:11,450
So the algebraic framework for
this is that we have a monoid,

1553
01:20:11,450 --> 01:20:17,720
which is a set, an operator,
and an identity, where the

1554
01:20:17,720 --> 01:20:20,740
operator is an associative
binary operator.

1555
01:20:20,740 --> 01:20:24,450
And the identity is, in
fact, the identity.

1556
01:20:24,450 --> 01:20:26,730
So here are some examples.

1557
01:20:26,730 --> 01:20:31,510
Integers with plus and 0, the
real numbers with times and 1,

1558
01:20:31,510 --> 01:20:35,110
true and false, Booleans with
and, where true is the

1559
01:20:35,110 --> 01:20:40,810
identity, strings over some
alphabet with concatenation,

1560
01:20:40,810 --> 01:20:43,530
where the empty string
is the identity.

1561
01:20:43,530 --> 01:20:46,720
You can do MAX with minus
infinity as the

1562
01:20:46,720 --> 01:20:48,110
operation, and so forth.

1563
01:20:48,110 --> 01:20:49,540
And you can come up
with your own.

1564
01:20:49,540 --> 01:20:52,990
It's easy to come up with
examples of monoids.

1565
01:20:52,990 --> 01:20:57,530
So what we do in Cilk++ is we
represent a monoid over a set

1566
01:20:57,530 --> 01:21:02,530
t by a C++ class that inherits
from this base class that's

1567
01:21:02,530 --> 01:21:07,550
predefined for you, which is
parameterized using templates

1568
01:21:07,550 --> 01:21:08,880
with the types.

1569
01:21:08,880 --> 01:21:10,770
So the set that we're going
to use is, in fact,

1570
01:21:10,770 --> 01:21:13,120
going to be a type.

1571
01:21:13,120 --> 01:21:15,090
And the member function
reduced--

1572
01:21:15,090 --> 01:21:18,440
this monoid has to have a member
function reduced that

1573
01:21:18,440 --> 01:21:21,050
implements the binary
operator times.

1574
01:21:21,050 --> 01:21:24,640
And it also has an identity
member function.

1575
01:21:24,640 --> 01:21:28,570
So we set up the algebraic
framework.

1576
01:21:28,570 --> 01:21:32,620
So here's, for example, how I
could define a sum monoid.

1577
01:21:32,620 --> 01:21:36,600
I inherit from the base with
int, for example, here.

1578
01:21:36,600 --> 01:21:39,650
And I define my reduced
function.

1579
01:21:39,650 --> 01:21:42,530
And it actually turns out to
be important, you always do

1580
01:21:42,530 --> 01:21:44,970
the right one into the left.

1581
01:21:44,970 --> 01:21:47,420
Otherwise, you won't have
it be associative.

1582
01:21:47,420 --> 01:21:49,710
And then, you have an identity,
which gives you in

1583
01:21:49,710 --> 01:21:52,940
this case a new element,
which is 0.

1584
01:21:55,610 --> 01:22:02,080
And so you can now define
the reducer as so.

1585
01:22:02,080 --> 01:22:04,270
You just say Cilk reducer,
the sum monoid

1586
01:22:04,270 --> 01:22:06,760
you've defined and x.

1587
01:22:06,760 --> 01:22:10,360
And now, the local view of x
can be accessed as x open

1588
01:22:10,360 --> 01:22:11,990
close parenthesis.

1589
01:22:11,990 --> 01:22:13,940
Now, in the example I showed
you, you didn't need to do the

1590
01:22:13,940 --> 01:22:15,980
open close parenthesis.

1591
01:22:15,980 --> 01:22:18,045
And the way you get rid of those
open close parenthesis

1592
01:22:18,045 --> 01:22:20,930
is you define a wrapper class.

1593
01:22:20,930 --> 01:22:24,420
So it's generally inconvenient
to replace every access with x

1594
01:22:24,420 --> 01:22:25,970
over brown.

1595
01:22:25,970 --> 01:22:26,880
That's one issue.

1596
01:22:26,880 --> 01:22:28,640
The other thing is accesses
aren't safe.

1597
01:22:28,640 --> 01:22:33,070
Nothing prevents a programmer
from writing x times equals 2,

1598
01:22:33,070 --> 01:22:35,960
even though the reducer
was defined over plus.

1599
01:22:35,960 --> 01:22:38,340
And that will screw up the
logic of this code if

1600
01:22:38,340 --> 01:22:40,920
somewhere he's multiplying
when, in fact, it's only

1601
01:22:40,920 --> 01:22:44,300
supposed to be combined
with addition.

1602
01:22:44,300 --> 01:22:46,740
So the way you solve that
is with a wrapper class.

1603
01:22:46,740 --> 01:22:49,540
You can do a wrapper class that
will protect all of the

1604
01:22:49,540 --> 01:22:54,020
operations inside and export
things that you can just refer

1605
01:22:54,020 --> 01:22:54,760
to the variable.

1606
01:22:54,760 --> 01:22:56,770
And it will actually
call that.

1607
01:22:56,770 --> 01:22:58,930
For most of what you're doing,
you probably don't need to

1608
01:22:58,930 --> 01:23:00,380
write a wrapper class.

1609
01:23:00,380 --> 01:23:07,950
You'll do fine just operating
with the extra parentheses.

1610
01:23:07,950 --> 01:23:09,330
In addition, there's
a whole bunch of

1611
01:23:09,330 --> 01:23:11,680
commonly use reducers.

1612
01:23:11,680 --> 01:23:17,980
Lists, appends, max, min, adds,
an output stream, and

1613
01:23:17,980 --> 01:23:23,640
some strings, and also you can
roll your own using things.

1614
01:23:23,640 --> 01:23:27,380
One issue with addition
is that, in fact,

1615
01:23:27,380 --> 01:23:29,810
this doesn't preserve--

1616
01:23:29,810 --> 01:23:32,030
for floating point addition--
does not

1617
01:23:32,030 --> 01:23:34,970
preserve the same answer.

1618
01:23:34,970 --> 01:23:37,710
And the reason is because
floating point numbers are not

1619
01:23:37,710 --> 01:23:38,970
associative.

1620
01:23:38,970 --> 01:23:42,270
If I had a to b and add that
to c, I can get something

1621
01:23:42,270 --> 01:23:45,370
different because of round off
error from adding a to the

1622
01:23:45,370 --> 01:23:47,320
result of b and c.

1623
01:23:47,320 --> 01:23:50,662
So generally, floating point
operations don't give you--

1624
01:23:50,662 --> 01:23:53,400
they'll give you something that
is close enough for most

1625
01:23:53,400 --> 01:23:55,110
things, but it's not actually
associative.

1626
01:23:55,110 --> 01:23:58,580
So you will get different
answers.

1627
01:23:58,580 --> 01:23:59,390
A quick example.

1628
01:23:59,390 --> 01:24:01,200
I'm sorry to run over
a little bit here.

1629
01:24:01,200 --> 01:24:03,850
I hope people have
a couple minutes.

1630
01:24:03,850 --> 01:24:05,800
Here's a real world example.

1631
01:24:05,800 --> 01:24:08,970
A company had a mechanical
assembly represented a tree of

1632
01:24:08,970 --> 01:24:11,170
assemblies down to
individual parts.

1633
01:24:11,170 --> 01:24:14,820
A pickup truck has all these
parts and all of these extra

1634
01:24:14,820 --> 01:24:18,110
subparts all the way down to
some geometric description of

1635
01:24:18,110 --> 01:24:19,670
what the part is.

1636
01:24:19,670 --> 01:24:21,610
And what they want to do is
the so-called collision

1637
01:24:21,610 --> 01:24:22,980
detection problem, which
has nothing to do

1638
01:24:22,980 --> 01:24:24,390
with colliding autos.

1639
01:24:24,390 --> 01:24:27,730
What they're doing is saying,
find collisions between the

1640
01:24:27,730 --> 01:24:29,230
assembly and a target object.

1641
01:24:29,230 --> 01:24:31,870
And that object might be
something like a half space

1642
01:24:31,870 --> 01:24:33,250
because they're computing
a cutaway.

1643
01:24:33,250 --> 01:24:35,720
Tell me all the things that
fall within this.

1644
01:24:35,720 --> 01:24:39,710
Or maybe, here's an engine
compartment, and does the

1645
01:24:39,710 --> 01:24:42,270
engine fit in with it?

1646
01:24:42,270 --> 01:24:44,690
So here's a code
that does that.

1647
01:24:44,690 --> 01:24:50,020
Basically, it does a recursive
walk, where it looks to see

1648
01:24:50,020 --> 01:24:52,080
whether it's an internal
node or a leaf.

1649
01:24:52,080 --> 01:24:58,000
If it's a leaf, it says, oh,
let me check to see whether

1650
01:24:58,000 --> 01:25:00,440
the target collides
with a particular

1651
01:25:00,440 --> 01:25:01,790
element of the tree.

1652
01:25:01,790 --> 01:25:06,120
And if so, add that object
to the end of a list.

1653
01:25:06,120 --> 01:25:13,730
So this is the standard a C++
library for putting something

1654
01:25:13,730 --> 01:25:15,950
on the end of the list.

1655
01:25:15,950 --> 01:25:19,670
If it's an internal node, then
go through all of the children

1656
01:25:19,670 --> 01:25:21,030
recursively.

1657
01:25:21,030 --> 01:25:25,680
And walk the children
recursively.

1658
01:25:25,680 --> 01:25:28,850
So basically, you're going to
look through the whole tree.

1659
01:25:28,850 --> 01:25:34,290
Does it intersect this
particular object, x?

1660
01:25:34,290 --> 01:25:36,290
So how do we parallelize this?

1661
01:25:36,290 --> 01:25:38,060
We can parallelize
the recursion.

1662
01:25:38,060 --> 01:25:41,280
We turn the 4 loop here
into a Cilk 4.

1663
01:25:41,280 --> 01:25:43,640
So it goes through all the
children at the same time.

1664
01:25:43,640 --> 01:25:45,270
They all can do their
comparisons

1665
01:25:45,270 --> 01:25:47,830
completely the same.

1666
01:25:47,830 --> 01:25:49,580
Oops, but we have a bug.

1667
01:25:49,580 --> 01:25:52,170
What's the bug?

1668
01:25:52,170 --> 01:25:54,900
AUDIENCE: Is it push back?

1669
01:25:54,900 --> 01:25:55,230
PROFESSOR: Yeah.

1670
01:25:55,230 --> 01:25:56,760
The push back here.

1671
01:25:56,760 --> 01:26:02,300
We have a race here because
they're all trying to push on

1672
01:26:02,300 --> 01:26:05,380
to this output list
at the same time.

1673
01:26:05,380 --> 01:26:08,870
So we could resolve it with
a lock or whatever.

1674
01:26:08,870 --> 01:26:13,340
But it turns out it's much
better to resolve it with a--

1675
01:26:13,340 --> 01:26:15,900
so we could do this, right?

1676
01:26:15,900 --> 01:26:18,140
But now, you've got
lock contention.

1677
01:26:18,140 --> 01:26:20,270
And also, the list ends
up getting produced

1678
01:26:20,270 --> 01:26:23,100
in a jumbled order.

1679
01:26:23,100 --> 01:26:28,040
So it turns out if you use a
reducer, you declare this to

1680
01:26:28,040 --> 01:26:32,045
be a reducer with list append.

1681
01:26:32,045 --> 01:26:37,180
And what happens then is turns
out list concatenation is

1682
01:26:37,180 --> 01:26:38,710
associative.

1683
01:26:38,710 --> 01:26:42,510
If I concatenate a to b, and
then concatenate c, that's the

1684
01:26:42,510 --> 01:26:46,040
same as concatenating a to the
concatenation of b and c.

1685
01:26:46,040 --> 01:26:49,780
And I can concatenate lists in
constant time by keeping a

1686
01:26:49,780 --> 01:26:53,290
pointer to the head and
tail of each list.

1687
01:26:53,290 --> 01:26:55,340
So if you do that, and that
turns out to be one of the

1688
01:26:55,340 --> 01:26:59,220
built in functions, then, in
fact, this code operates

1689
01:26:59,220 --> 01:27:02,200
perfectly well with no
contention and so forth.

1690
01:27:02,200 --> 01:27:06,000
And in fact, produces the output
in the same order as

1691
01:27:06,000 --> 01:27:08,000
the original C++.

1692
01:27:08,000 --> 01:27:09,250
It runs fast.

1693
01:27:11,540 --> 01:27:16,050
And there's a little description
of how it works.

1694
01:27:16,050 --> 01:27:17,940
The actual protocol
is kind of tricky.

1695
01:27:17,940 --> 01:27:19,730
And we'll put the paper--

1696
01:27:19,730 --> 01:27:22,970
let's make sure we get this
paper up on the web.

1697
01:27:22,970 --> 01:27:24,480
I think it was there
from last year.

1698
01:27:24,480 --> 01:27:26,990
So we should be able
to find it.

1699
01:27:26,990 --> 01:27:28,740
If you're interested in
how the details work.

1700
01:27:28,740 --> 01:27:30,860
Here's the important thing
to know from a

1701
01:27:30,860 --> 01:27:33,450
programmer point of view.

1702
01:27:33,450 --> 01:27:36,630
So typically, the cost--

1703
01:27:36,630 --> 01:27:40,010
it turns out the reduce
operations you're only calling

1704
01:27:40,010 --> 01:27:41,310
when there's actually a steal.

1705
01:27:41,310 --> 01:27:42,900
It's actually a return
from a steal.

1706
01:27:42,900 --> 01:27:46,560
But since stealing occurs
relatively infrequently the

1707
01:27:46,560 --> 01:27:49,450
load balance, the number of
times you actually do one of

1708
01:27:49,450 --> 01:27:53,200
these reduce operations
is small.

1709
01:27:53,200 --> 01:27:56,250
The most of the cost is actually
accessing the reducer

1710
01:27:56,250 --> 01:27:58,410
to do the updates.

1711
01:27:58,410 --> 01:28:01,060
And it's never worse than a hash
table lookup the way it's

1712
01:28:01,060 --> 01:28:02,480
implemented.

1713
01:28:02,480 --> 01:28:05,380
If the reducer is accessed
several times within a region

1714
01:28:05,380 --> 01:28:08,990
of code, the compiler can
optimize the lookups using

1715
01:28:08,990 --> 01:28:11,170
common subexpression
elimination.

1716
01:28:11,170 --> 01:28:15,380
And in the common case, then,
what happens is it basically

1717
01:28:15,380 --> 01:28:18,480
has an access cost equal to
one additional level of

1718
01:28:18,480 --> 01:28:22,480
indirection, which is typically
an L1 cache hit.

1719
01:28:22,480 --> 01:28:26,080
So the overhead of actually
updating one of these things

1720
01:28:26,080 --> 01:28:30,320
is really just like an extra L1
cache hit for most of these

1721
01:28:30,320 --> 01:28:32,160
things, for most of the time.

1722
01:28:32,160 --> 01:28:37,210
If you have the case that you're
accessing a reducer

1723
01:28:37,210 --> 01:28:39,640
several times within the
same block of code.

1724
01:28:39,640 --> 01:28:42,150
Otherwise, at the very worst,
you have to actually do a hash

1725
01:28:42,150 --> 01:28:43,140
table lookup.

1726
01:28:43,140 --> 01:28:46,270
And that tends to be a little
bit more like a function call

1727
01:28:46,270 --> 01:28:51,180
overhead just in terms of
order of magnitude.

1728
01:28:51,180 --> 01:28:52,430
Sorry for running over.