1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high-quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:17,430
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,430 --> 00:00:18,680
ocw.mit.edu.

8
00:00:27,870 --> 00:00:30,680
Let's get going here.

9
00:00:35,990 --> 00:00:40,940
So this is a lecture that's
actually appropriate for

10
00:00:40,940 --> 00:00:46,670
Halloween, because it's
a scary topic.

11
00:00:46,670 --> 00:00:48,000
Non-deterministic programming.

12
00:00:52,410 --> 00:00:55,280
So we've been looking mostly
at deterministic programs.

13
00:00:55,280 --> 00:01:00,290
So a program is deterministic
on a given input if every

14
00:01:00,290 --> 00:01:03,550
memory location is updated
with the same sequence of

15
00:01:03,550 --> 00:01:05,470
values in every execution.

16
00:01:08,000 --> 00:01:12,580
So if you look at the memory of
the machine, you can view

17
00:01:12,580 --> 00:01:17,250
that as, essentially, the
state of the machine.

18
00:01:17,250 --> 00:01:19,570
And if you're always updating
every memory location with

19
00:01:19,570 --> 00:01:23,760
exactly the same sequence of
values, then the program is

20
00:01:23,760 --> 00:01:24,530
deterministic.

21
00:01:24,530 --> 00:01:29,700
Now it may be that two memory
locations may be updated in a

22
00:01:29,700 --> 00:01:31,310
different order.

23
00:01:31,310 --> 00:01:36,340
So you may have one location
which is updated first in one

24
00:01:36,340 --> 00:01:39,600
execution, and another that's
second, and then in a

25
00:01:39,600 --> 00:01:41,490
different execution, they may
be a different order.

26
00:01:41,490 --> 00:01:43,270
That's OK, generally.

27
00:01:43,270 --> 00:01:46,490
The issue is whether or not
every memory location sees the

28
00:01:46,490 --> 00:01:47,870
same order.

29
00:01:47,870 --> 00:01:55,800
And if they do, then it's for
every execution, then it's a

30
00:01:55,800 --> 00:01:57,050
deterministic program.

31
00:02:01,850 --> 00:02:07,850
So what's the advantage
of having a

32
00:02:07,850 --> 00:02:10,316
deterministic program?

33
00:02:10,316 --> 00:02:11,302
Yeah?

34
00:02:11,302 --> 00:02:15,246
AUDIENCE: It always runs the
same way [INAUDIBLE].

35
00:02:15,246 --> 00:02:16,470
PROFESSOR: It always
runs the same way.

36
00:02:16,470 --> 00:02:17,860
So what?

37
00:02:17,860 --> 00:02:18,400
What's that good for?

38
00:02:18,400 --> 00:02:20,220
AUDIENCE: So you can
find bugs easier.

39
00:02:20,220 --> 00:02:22,790
PROFESSOR: Yeah, debugging.

40
00:02:22,790 --> 00:02:25,810
It's really easy to find bugs
if every time you run it it

41
00:02:25,810 --> 00:02:27,030
does the same thing.

42
00:02:27,030 --> 00:02:31,360
It's much harder to find bugs
if, when you run it, it might

43
00:02:31,360 --> 00:02:34,060
do something different.

44
00:02:34,060 --> 00:02:38,400
So that leads to our first
major rule of thumb about

45
00:02:38,400 --> 00:02:41,910
determinism, which is you
should always write

46
00:02:41,910 --> 00:02:43,160
deterministic programs.

47
00:02:46,180 --> 00:02:47,190
Don't write

48
00:02:47,190 --> 00:02:50,690
non-deterministic programs.

49
00:02:50,690 --> 00:02:54,480
And the only problem is, boy
is that poor quality there.

50
00:02:54,480 --> 00:02:56,750
So basically, it says, always
write non-deterministic

51
00:02:56,750 --> 00:03:00,570
programs unless you can't.

52
00:03:00,570 --> 00:03:04,790
So sometimes, the only way to
get performance is to do

53
00:03:04,790 --> 00:03:06,040
something non-deterministic.

54
00:03:09,740 --> 00:03:14,900
So this lecture is basically
about some of the ways of

55
00:03:14,900 --> 00:03:18,380
doing non-deterministic
programming.

56
00:03:18,380 --> 00:03:26,450
So it's appropriate that we say
this is not for those who

57
00:03:26,450 --> 00:03:27,940
are faint of heart.

58
00:03:27,940 --> 00:03:33,340
We are treading into dangerous
territory here.

59
00:03:36,020 --> 00:03:39,410
So the basic rule is, as I say,
any time you can, make

60
00:03:39,410 --> 00:03:42,830
your program deterministic.

61
00:03:42,830 --> 00:03:46,780
So we're going to talk about the
number one way that people

62
00:03:46,780 --> 00:03:51,240
introduce non-determinism into
programs, which is via mutual

63
00:03:51,240 --> 00:03:57,510
exclusion and mutexes, which are
a type of lock, and then

64
00:03:57,510 --> 00:04:01,350
look at some of the anomalies
that you get.

65
00:04:01,350 --> 00:04:05,990
Besides just things being
non-deterministic, you can

66
00:04:05,990 --> 00:04:08,420
also get some very, very
weird behavior

67
00:04:08,420 --> 00:04:12,100
sometimes for the execution.

68
00:04:12,100 --> 00:04:15,840
So we'll start out with
mutual exclusion.

69
00:04:15,840 --> 00:04:18,120
So let's take a look, for
example, suppose I'm

70
00:04:18,120 --> 00:04:20,899
implementing a hash table
as a set of bins.

71
00:04:20,899 --> 00:04:24,640
And I'm resolving collisions
with chaining.

72
00:04:24,640 --> 00:04:29,720
So here, each slot of my hash
table has a chain of all the

73
00:04:29,720 --> 00:04:34,180
values that resolve
to that slot.

74
00:04:34,180 --> 00:04:40,660
And if I have a value x, let's
say it has key 81, and I want

75
00:04:40,660 --> 00:04:47,920
to insert x into the table, I
first compute a hash of x.

76
00:04:47,920 --> 00:04:53,690
And let's say it hashes to this
particular list here.

77
00:04:53,690 --> 00:04:55,960
And then what I do is
I say, OK, let me

78
00:04:55,960 --> 00:04:57,990
insert x into the table.

79
00:04:57,990 --> 00:05:02,900
So I make the next pointer of
x point to whatever is the

80
00:05:02,900 --> 00:05:07,310
head of the table.

81
00:05:07,310 --> 00:05:12,020
And then I make the
table 0.2x.

82
00:05:12,020 --> 00:05:19,540
And that effectively inserts
x into the hash table.

83
00:05:19,540 --> 00:05:21,910
Fairly straightforward
piece of code.

84
00:05:21,910 --> 00:05:24,490
I would expect that most of you
could write that even on

85
00:05:24,490 --> 00:05:27,760
an exam and get it right.

86
00:05:27,760 --> 00:05:33,130
But what happens when we say,
oh, let's have some

87
00:05:33,130 --> 00:05:33,380
concurrency.

88
00:05:33,380 --> 00:05:37,430
Let's have the ability to look
up things in a hash table in

89
00:05:37,430 --> 00:05:44,080
different parallel branches
of a parallel program.

90
00:05:44,080 --> 00:05:48,010
So here, we have a concurrent
hash table now where I've got

91
00:05:48,010 --> 00:05:51,190
two values, and I'm going to
have two different threads

92
00:05:51,190 --> 00:05:54,770
inserting x and y.

93
00:05:54,770 --> 00:05:57,480
So one of them is going to do
this one, and one of them is

94
00:05:57,480 --> 00:06:00,450
going to do this one.

95
00:06:00,450 --> 00:06:05,580
So let's just see how
this can screw up.

96
00:06:05,580 --> 00:06:10,120
So first, we hash x,
and it hashes to

97
00:06:10,120 --> 00:06:13,440
this particular slot.

98
00:06:13,440 --> 00:06:17,170
So then we do, just as we're
doing before, making its next

99
00:06:17,170 --> 00:06:18,570
pointer point to the beginning
of the array.

100
00:06:22,450 --> 00:06:25,140
Then y gets in the picture,
and it decides oh,

101
00:06:25,140 --> 00:06:26,600
I'm going to hash.

102
00:06:26,600 --> 00:06:29,275
And oh, it hashes to exactly
the same slot.

103
00:06:31,910 --> 00:06:34,830
And then y makes its next
pointer point to the same to

104
00:06:34,830 --> 00:06:37,240
the head of the list.

105
00:06:37,240 --> 00:06:41,100
And then it sets the head of
the list to point to y.

106
00:06:41,100 --> 00:06:43,700
So now y is in the list.

107
00:06:43,700 --> 00:06:49,020
Whoops, now x puts itself in the
list, effectively taking y

108
00:06:49,020 --> 00:06:49,780
out of the list.

109
00:06:49,780 --> 00:06:53,300
So rather than x and y both
being in the list, we have a

110
00:06:53,300 --> 00:06:54,550
concurrency bug.

111
00:06:58,150 --> 00:07:03,770
So this is clearly a race.

112
00:07:03,770 --> 00:07:12,530
So it's a determinacy race,
because we have two parallel

113
00:07:12,530 --> 00:07:16,090
instructions accessing
essentially the same location,

114
00:07:16,090 --> 00:07:18,450
at least one of which-- in
this case both of them--

115
00:07:18,450 --> 00:07:22,030
performing a store
to that location.

116
00:07:22,030 --> 00:07:23,890
So that's a determinacy race.

117
00:07:23,890 --> 00:07:26,950
And how things are going to work
out depends upon which

118
00:07:26,950 --> 00:07:29,540
one of these guys goes first.

119
00:07:29,540 --> 00:07:32,920
Notice, as with most race bugs,
that if this code all

120
00:07:32,920 --> 00:07:35,780
executed before this
code, we're OK.

121
00:07:35,780 --> 00:07:40,780
Or if this code all executed
before this code, we're OK.

122
00:07:40,780 --> 00:07:44,810
So the bug occurs when they
happen to execute at

123
00:07:44,810 --> 00:07:49,610
essentially the same time and
their instructions interleave.

124
00:07:52,380 --> 00:07:54,110
So this is a race bug.

125
00:07:54,110 --> 00:08:03,610
So one of the classic ways of
fixing this kind of race bug

126
00:08:03,610 --> 00:08:07,296
is to insist on some kind
of mutual exclusion.

127
00:08:09,820 --> 00:08:16,830
So a critical section is a piece
of code that is going to

128
00:08:16,830 --> 00:08:26,350
access shared data that must not
be executed by two threads

129
00:08:26,350 --> 00:08:29,400
at the same time.

130
00:08:29,400 --> 00:08:34,809
So it shouldn't be accessed by
two threads at the same time.

131
00:08:34,809 --> 00:08:36,100
So it's mutual exclusion.

132
00:08:36,100 --> 00:08:39,159
So that's what a critical
section is.

133
00:08:39,159 --> 00:08:43,640
And we have a mechanism
that operating

134
00:08:43,640 --> 00:08:45,510
systems typically provide--

135
00:08:45,510 --> 00:08:48,910
as well as runtime systems, but
you can build your own--

136
00:08:48,910 --> 00:08:53,270
called "mutexes," or "mutex
locks," or sometimes just

137
00:08:53,270 --> 00:08:59,220
"locks." So a mutex is an object
that has a lock and

138
00:08:59,220 --> 00:09:00,580
unlock member function.

139
00:09:03,210 --> 00:09:08,140
And any attempt by a thread to
lock an already locked mutex

140
00:09:08,140 --> 00:09:11,330
causes that thread to block.

141
00:09:11,330 --> 00:09:15,000
And "block" is, by the way,
a hugely overused word in

142
00:09:15,000 --> 00:09:16,190
computer science.

143
00:09:16,190 --> 00:09:21,540
In this case, by "block," they
mean "wait." It waits until

144
00:09:21,540 --> 00:09:25,120
the mutex is unlocked.

145
00:09:25,120 --> 00:09:28,680
So whenever you have something
that's locked, somebody else

146
00:09:28,680 --> 00:09:31,450
comes and tries to
grab the lock.

147
00:09:31,450 --> 00:09:33,980
The mutex mechanism
only allows one

148
00:09:33,980 --> 00:09:35,380
thread to access it.

149
00:09:35,380 --> 00:09:37,850
The other one waits until
the lock is freed.

150
00:09:37,850 --> 00:09:46,220
Then this other one
can go access it.

151
00:09:46,220 --> 00:09:52,370
So what we can do is build a
concurrent hash table by

152
00:09:52,370 --> 00:09:59,770
modifying each slot in the table
to have both a mutex, L,

153
00:09:59,770 --> 00:10:05,290
and a pointer called "head"
to the slot contents.

154
00:10:05,290 --> 00:10:09,350
And then the idea is that
what we'll do is hash

155
00:10:09,350 --> 00:10:11,370
the value to a slot.

156
00:10:11,370 --> 00:10:14,070
But before we access the
elements of the slot, we're

157
00:10:14,070 --> 00:10:17,460
going to grab the lock
on the slot.

158
00:10:17,460 --> 00:10:19,980
So every slot in the table
has a lock here.

159
00:10:19,980 --> 00:10:22,040
Now, I could have a lock
on the whole table.

160
00:10:22,040 --> 00:10:24,820
What's the problem with that?

161
00:10:24,820 --> 00:10:25,763
Sure.

162
00:10:25,763 --> 00:10:27,013
AUDIENCE: [INAUDIBLE]

163
00:10:29,627 --> 00:10:31,076
basically can't do anything.

164
00:10:31,076 --> 00:10:32,042
You can't read.

165
00:10:32,042 --> 00:10:33,020
You couldn't be reading
from the table.

166
00:10:33,020 --> 00:10:34,470
PROFESSOR: Yeah, so
if you have a lock

167
00:10:34,470 --> 00:10:35,460
on the whole table--

168
00:10:35,460 --> 00:10:37,550
AUDIENCE: You would defeat
the purpose [INAUDIBLE]

169
00:10:37,550 --> 00:10:38,510
PROFESSOR: You defeat the
purpose of trying to have a

170
00:10:38,510 --> 00:10:41,380
concurrent hash table, right?

171
00:10:41,380 --> 00:10:44,200
Because only one thread can
actually access the

172
00:10:44,200 --> 00:10:45,260
hash table at a time.

173
00:10:45,260 --> 00:10:49,270
So in this case, what we'll do
is we'll lock each slot of the

174
00:10:49,270 --> 00:10:51,310
hash table.

175
00:10:51,310 --> 00:10:53,450
And there are actually
mechanisms where you can lock

176
00:10:53,450 --> 00:10:55,780
each element of the
hash table or a

177
00:10:55,780 --> 00:10:58,040
constant number of elements.

178
00:10:58,040 --> 00:11:01,420
But basically, what we're trying
to do is make it so

179
00:11:01,420 --> 00:11:04,480
that the odds are that if you
have a big enough table and

180
00:11:04,480 --> 00:11:08,560
relatively few processors you're
running on, the odds

181
00:11:08,560 --> 00:11:10,500
that they'll conflict are
going to be very low.

182
00:11:13,100 --> 00:11:16,780
So what we do is we grab a lock
on the slot, and then we

183
00:11:16,780 --> 00:11:19,810
play the same game of inserting

184
00:11:19,810 --> 00:11:21,670
ourselves at the head.

185
00:11:21,670 --> 00:11:23,270
And then we unlock the slot.

186
00:11:26,770 --> 00:11:30,860
So what that does is it means
that only one of the two

187
00:11:30,860 --> 00:11:35,690
threads in the previous example
can actually execute

188
00:11:35,690 --> 00:11:38,260
this code at a time.

189
00:11:38,260 --> 00:11:42,820
And so it guarantees that the
two regions of code will

190
00:11:42,820 --> 00:11:45,790
either execute in this order or
in this order, and you'll

191
00:11:45,790 --> 00:11:49,460
never get the instructions
interleaved.

192
00:11:49,460 --> 00:11:55,150
Now, this is introducing
non-determinism.

193
00:11:55,150 --> 00:11:58,740
Why is this going to be
non-deterministic?

194
00:11:58,740 --> 00:12:00,130
Yes?

195
00:12:00,130 --> 00:12:02,630
AUDIENCE: [INAUDIBLE] lock
first, it'll be [INAUDIBLE].

196
00:12:02,630 --> 00:12:04,510
PROFESSOR: Yeah, depending upon
which one gets the lock

197
00:12:04,510 --> 00:12:10,210
first, the length list in there
will have the elements

198
00:12:10,210 --> 00:12:12,660
in a different order.

199
00:12:12,660 --> 00:12:16,700
So a program that depends on
the order of that list is

200
00:12:16,700 --> 00:12:19,230
going to behave differently
from run to run.

201
00:12:22,520 --> 00:12:26,610
So let's recall the definition
of a determinacy race.

202
00:12:26,610 --> 00:12:29,980
It occurs when two logically
parallel instructions access

203
00:12:29,980 --> 00:12:32,500
the same memory location,
and at least one of the

204
00:12:32,500 --> 00:12:36,080
instructions performs a write.

205
00:12:36,080 --> 00:12:41,190
So that is, we do have a
determinacy race when we

206
00:12:41,190 --> 00:12:43,480
introduce locks.

207
00:12:43,480 --> 00:12:45,480
Locks are essentially, we're
going to have an intentional

208
00:12:45,480 --> 00:12:48,680
determinacy race.

209
00:12:48,680 --> 00:12:53,480
So a program execution with no
determinacy races means the

210
00:12:53,480 --> 00:12:57,350
program is deterministic
on that input.

211
00:12:57,350 --> 00:13:02,420
So if there are no determinacy
races, then although

212
00:13:02,420 --> 00:13:05,300
individual locations may be
updated in a different order

213
00:13:05,300 --> 00:13:11,680
in a parallel execution, every
memory location will be

214
00:13:11,680 --> 00:13:15,500
updated by exactly the same
thing at the same time.

215
00:13:15,500 --> 00:13:18,860
The order will be of update
of operations on any given

216
00:13:18,860 --> 00:13:20,460
location will be the
same always.

217
00:13:22,960 --> 00:13:25,350
So that's actually a theorem,
which we're

218
00:13:25,350 --> 00:13:27,950
not going to prove.

219
00:13:27,950 --> 00:13:30,700
But I think if you think
about it, it's fairly

220
00:13:30,700 --> 00:13:31,240
straightforward.

221
00:13:31,240 --> 00:13:35,430
If you never have two guys in
parallel that could possibly

222
00:13:35,430 --> 00:13:39,490
affect the same location, then
the behavior always is going

223
00:13:39,490 --> 00:13:40,760
to be the same thing.

224
00:13:40,760 --> 00:13:44,020
Things are going to get written
in the same order.

225
00:13:44,020 --> 00:13:47,200
So the program in that case
always behaves the same on

226
00:13:47,200 --> 00:13:50,370
that given input, no
matter how it's

227
00:13:50,370 --> 00:13:52,960
scheduled and executed.

228
00:13:52,960 --> 00:13:56,220
We'll always have essentially
the same behavior, even though

229
00:13:56,220 --> 00:13:58,220
it may get scheduled
one way or another.

230
00:14:04,210 --> 00:14:07,630
And one of the nice things
that we have in our race

231
00:14:07,630 --> 00:14:11,640
detection tool Cilkscreen is
that if we do have determinacy

232
00:14:11,640 --> 00:14:14,250
races that exist in
an ostensibly

233
00:14:14,250 --> 00:14:15,590
deterministic program--

234
00:14:15,590 --> 00:14:18,460
that is, a program
with no mutexes.

235
00:14:18,460 --> 00:14:25,880
If basically it just reads and
writes on locations and so

236
00:14:25,880 --> 00:14:27,720
forth, then Cilkscreen
guarantees

237
00:14:27,720 --> 00:14:30,250
to find such a race.

238
00:14:30,250 --> 00:14:32,980
So It's nice that we get a
guarantee out of Cilkscreen.

239
00:14:38,640 --> 00:14:43,430
So this is all beautiful,
elegant, everything works out

240
00:14:43,430 --> 00:14:48,270
great if there are no
determinacy races.

241
00:14:48,270 --> 00:14:51,510
But when we do something like a
concurrent hash table, we're

242
00:14:51,510 --> 00:14:56,240
intentionally putting in
a determinacy area.

243
00:14:56,240 --> 00:14:59,000
So that asks sort of
a natural question.

244
00:14:59,000 --> 00:15:04,050
Why would I want to have a
concurrent hash table?

245
00:15:04,050 --> 00:15:09,450
Why not make it so that my
program is deterministic?

246
00:15:09,450 --> 00:15:13,190
Why might a concurrent hash
table be an advantageous thing

247
00:15:13,190 --> 00:15:15,070
to have in a program that
you wanted to go fast?

248
00:15:19,290 --> 00:15:20,780
Some ideas?

249
00:15:20,780 --> 00:15:22,187
Where might you want
to use it?

250
00:15:22,187 --> 00:15:22,644
Yeah?

251
00:15:22,644 --> 00:15:23,560
AUDIENCE: Speed?

252
00:15:23,560 --> 00:15:24,300
PROFESSOR: Yeah, speed.

253
00:15:24,300 --> 00:15:25,300
But I mean, what's
an application?

254
00:15:25,300 --> 00:15:31,640
What's a use case, as the
entrepreneurs would ask you?

255
00:15:31,640 --> 00:15:33,350
Where is it that you would
really want to use a

256
00:15:33,350 --> 00:15:35,480
concurrent hash table
to give you speed?

257
00:15:38,420 --> 00:15:38,910
Yeah?

258
00:15:38,910 --> 00:15:41,850
AUDIENCE: If you started
using it [INAUDIBLE]

259
00:15:41,850 --> 00:15:45,430
along with your system
[INAUDIBLE]

260
00:15:45,430 --> 00:15:45,820
values.

261
00:15:45,820 --> 00:15:49,150
PROFESSOR: Yeah, it could be
that there's some sort of

262
00:15:49,150 --> 00:15:53,020
global table that you want a
lot of people to be able to

263
00:15:53,020 --> 00:15:54,270
access at one time.

264
00:15:56,990 --> 00:16:00,960
So if you lock down and only had
one thread accessing at a

265
00:16:00,960 --> 00:16:03,670
time, you reduce how
much concurrency

266
00:16:03,670 --> 00:16:04,670
that you could have.

267
00:16:04,670 --> 00:16:05,796
That's a good one.

268
00:16:05,796 --> 00:16:06,768
Yeah?

269
00:16:06,768 --> 00:16:10,170
AUDIENCE: Perhaps most of the
time, people are just reading.

270
00:16:10,170 --> 00:16:12,600
So if you had something
concurrent, your reading

271
00:16:12,600 --> 00:16:15,516
should be fine.

272
00:16:15,516 --> 00:16:18,432
So in that case, a lot more
reading high performance

273
00:16:18,432 --> 00:16:21,348
[INAUDIBLE]

274
00:16:21,348 --> 00:16:25,090
PROFESSOR: Yeah, so in fact,
there's a type of lock called

275
00:16:25,090 --> 00:16:31,010
a reader-writer lock, which
allows one writer to operate,

276
00:16:31,010 --> 00:16:33,650
but many readers.

277
00:16:33,650 --> 00:16:37,120
So that's another type of
concurrency control.

278
00:16:37,120 --> 00:16:39,770
So just another place, a common
place that you use it,

279
00:16:39,770 --> 00:16:42,440
is when you're memoizing.

280
00:16:42,440 --> 00:16:45,350
Meaning I do a computation, I
want to remember the results

281
00:16:45,350 --> 00:16:49,790
so that if I see it again, I
can look it up rather than

282
00:16:49,790 --> 00:16:52,210
having to compute it
again from scratch.

283
00:16:52,210 --> 00:16:56,060
So you might keep all those
values in a hash table.

284
00:16:56,060 --> 00:16:59,340
Well, if I go in the hash table,
now I'm going to have

285
00:16:59,340 --> 00:17:02,550
concurrent accesses to that
hash table if I've got a

286
00:17:02,550 --> 00:17:05,890
parallel program that wants
to do memorizing.

287
00:17:05,890 --> 00:17:09,050
And there are a bunch
of other cases.

288
00:17:09,050 --> 00:17:11,300
So we have determinacy races.

289
00:17:11,300 --> 00:17:15,810
And we have a great guarantee
that if there is a race, we

290
00:17:15,810 --> 00:17:17,208
guarantee to find it.

291
00:17:20,869 --> 00:17:23,500
Now, there's another type of
race, and in fact, you'll hear

292
00:17:23,500 --> 00:17:26,859
more about this type of race
if you read the literature

293
00:17:26,859 --> 00:17:30,170
than you hear about
determinacy races.

294
00:17:30,170 --> 00:17:34,440
So a data race occurs when you
have two logically parallel

295
00:17:34,440 --> 00:17:39,060
instructions holding
no locks in common.

296
00:17:39,060 --> 00:17:41,400
And they access the same
location, and at least one of

297
00:17:41,400 --> 00:17:44,730
the instructions performs
a write.

298
00:17:44,730 --> 00:17:49,310
So this is saying that
I've got accesses.

299
00:17:49,310 --> 00:17:51,340
And if they have no
locks in common--

300
00:17:51,340 --> 00:17:54,490
so it could be that you have
a problem where one of them

301
00:17:54,490 --> 00:17:59,720
holds a lock L, and another
one holds L prime.

302
00:17:59,720 --> 00:18:04,270
And then they access the
location, that's going to be a

303
00:18:04,270 --> 00:18:08,030
data race, because they don't
hold locks in common.

304
00:18:08,030 --> 00:18:13,360
But if I have L and L being the
locks that the two threads

305
00:18:13,360 --> 00:18:15,250
hold, and they access the
same location, that's

306
00:18:15,250 --> 00:18:16,770
not a data race now.

307
00:18:16,770 --> 00:18:20,410
It is a determinacy race,
because it's going to matter

308
00:18:20,410 --> 00:18:23,780
which order it is, but it's not
a data race, because the

309
00:18:23,780 --> 00:18:27,160
locks, in some sense, are
protecting access.

310
00:18:27,160 --> 00:18:30,460
So Cilkscreen, in fact,
understands locks and will not

311
00:18:30,460 --> 00:18:34,515
report a determinacy race unless
it is also a data race.

312
00:18:38,880 --> 00:18:44,530
However, since codes that use
locks are non-deterministic by

313
00:18:44,530 --> 00:18:50,440
intention, they actually weaken
Cilkscreen's guarantee.

314
00:18:50,440 --> 00:18:54,820
And in particular, in its
execution that it does, if it

315
00:18:54,820 --> 00:18:58,250
finds a data race, it's going
to say, I'm going to ignore

316
00:18:58,250 --> 00:18:59,240
that data race.

317
00:18:59,240 --> 00:19:04,530
But now it is only going to
follow one of the two paths

318
00:19:04,530 --> 00:19:06,640
that might arise from
that data race.

319
00:19:11,460 --> 00:19:13,990
In other words, it doesn't
follow both paths.

320
00:19:13,990 --> 00:19:18,480
If you could think about it,
when one of them wins--

321
00:19:18,480 --> 00:19:20,970
so you have a race between
two critical sections.

322
00:19:20,970 --> 00:19:25,080
When one of them wins, you can
imagine that's one possible

323
00:19:25,080 --> 00:19:26,300
outcome of the computation.

324
00:19:26,300 --> 00:19:31,090
When the other wins,
it's another path.

325
00:19:31,090 --> 00:19:33,420
And what Cilkscreen does
is it picks one path.

326
00:19:33,420 --> 00:19:36,350
In fact, it picks the path which
is the one that would

327
00:19:36,350 --> 00:19:40,120
occur in the cereal execution.

328
00:19:40,120 --> 00:19:42,210
So there's a whole path there
that you're not exploring.

329
00:19:45,240 --> 00:19:49,420
So Cilkscreen's guarantee is not
going to be strong there.

330
00:19:49,420 --> 00:19:53,050
However, if the critical
sections, in fact, commute--

331
00:19:53,050 --> 00:19:55,230
that is, they do exactly
the same thing, no

332
00:19:55,230 --> 00:19:57,450
matter what the order.

333
00:19:57,450 --> 00:20:00,780
So for example, if they're both
incrementing a value,

334
00:20:00,780 --> 00:20:04,500
then the result, after doing one
versus after the other is

335
00:20:04,500 --> 00:20:07,970
the same value, then you get a
guarantee out of Cilkscreen.

336
00:20:10,860 --> 00:20:13,460
So Cilkscreen could still be
very helpful for finding bugs,

337
00:20:13,460 --> 00:20:17,930
because typically, when you
organize your computation, if

338
00:20:17,930 --> 00:20:22,180
it occurs in this order,
there's typically some

339
00:20:22,180 --> 00:20:24,870
execution or input where you can
make things occur in the

340
00:20:24,870 --> 00:20:26,920
other order.

341
00:20:26,920 --> 00:20:32,040
So you can actually cover more
races than you might imagine

342
00:20:32,040 --> 00:20:33,920
on first blush.

343
00:20:33,920 --> 00:20:36,286
But it is a danger.

344
00:20:36,286 --> 00:20:38,140
But what we're talking about
today is dangerous

345
00:20:38,140 --> 00:20:42,180
programming, non-deterministic
programming.

346
00:20:42,180 --> 00:20:45,230
So when you start using
mutexes, some of the

347
00:20:45,230 --> 00:20:49,600
guarantees and so forth
get much dicier.

348
00:20:49,600 --> 00:20:50,850
Any questions about that?

349
00:20:55,120 --> 00:20:59,840
Now, if you have no data races
in your code, that doesn't

350
00:20:59,840 --> 00:21:04,430
mean that you have no bugs.

351
00:21:04,430 --> 00:21:09,970
So for example, here's a way
somebody might fix that

352
00:21:09,970 --> 00:21:12,070
insertion code.

353
00:21:12,070 --> 00:21:18,750
So we hash the key, we grab a
lock, we set x next to be

354
00:21:18,750 --> 00:21:21,290
whatever is the head
of the list, and

355
00:21:21,290 --> 00:21:23,820
then we do an unlock.

356
00:21:23,820 --> 00:21:25,890
And now we lock it again.

357
00:21:25,890 --> 00:21:29,500
Now we follow the
head to set x--

358
00:21:29,500 --> 00:21:32,190
sorry, we set x to be
the head of the list

359
00:21:32,190 --> 00:21:33,790
and then unlock again.

360
00:21:33,790 --> 00:21:37,160
And now notice that in this
case, technically, there is no

361
00:21:37,160 --> 00:21:41,160
data race if I have two
concurrent threads trying to

362
00:21:41,160 --> 00:21:43,800
access these at a time, because
all the axis I'm

363
00:21:43,800 --> 00:21:47,260
doing, I'm holding lock L.
Nevertheless, I can get that

364
00:21:47,260 --> 00:21:54,610
same interleaving of code
that causes the bug.

365
00:21:54,610 --> 00:21:58,690
So just because you don't have
a data race doesn't mean that

366
00:21:58,690 --> 00:22:01,280
you don't have a bug
in your code.

367
00:22:01,280 --> 00:22:02,790
As I say, this is dangerous
programming.

368
00:22:05,580 --> 00:22:10,490
However, typically, if you
have mutexes and no data

369
00:22:10,490 --> 00:22:14,290
races, usually it means that you
went through and thought

370
00:22:14,290 --> 00:22:15,040
about this code.

371
00:22:15,040 --> 00:22:19,530
And if you were thinking about
this code, you would say, gee,

372
00:22:19,530 --> 00:22:22,520
really I'm trying to make these
two instructions be the

373
00:22:22,520 --> 00:22:23,400
critical section.

374
00:22:23,400 --> 00:22:26,590
Why would I unlock
and lock again?

375
00:22:26,590 --> 00:22:30,610
So most of the time, as a
practical matter, if you don't

376
00:22:30,610 --> 00:22:33,430
have data races, it probably
means you did the right thing

377
00:22:33,430 --> 00:22:36,570
in terms of identifying the
critical sections that needed

378
00:22:36,570 --> 00:22:41,330
to be locked and not unlocking
things in the middle of them.

379
00:22:41,330 --> 00:22:45,710
So as a practical matter, no
data races usually means it's

380
00:22:45,710 --> 00:22:46,960
unlikely you have bugs.

381
00:22:49,910 --> 00:22:50,860
But no guarantees.

382
00:22:50,860 --> 00:22:53,330
As I say, dangerous
programming.

383
00:22:53,330 --> 00:22:55,310
Non-deterministic programming
is dangerous program.

384
00:22:57,980 --> 00:22:59,230
Any questions about that?

385
00:23:02,910 --> 00:23:06,550
Anybody scared off yet?

386
00:23:06,550 --> 00:23:07,530
Yeah?

387
00:23:07,530 --> 00:23:10,778
AUDIENCE: So what you can
do is the opposite.

388
00:23:10,778 --> 00:23:13,634
You don't have any bugs, but
you made the critical

389
00:23:13,634 --> 00:23:15,070
distinction to [INAUDIBLE]

390
00:23:15,070 --> 00:23:18,310
PROFESSOR: Yes, so certainly
from a performance point of

391
00:23:18,310 --> 00:23:21,080
view, one of the problems
with locking is that--

392
00:23:21,080 --> 00:23:22,780
and we'll talk about this
a little bit later--

393
00:23:22,780 --> 00:23:26,300
with locking is that if you have
a large section that you

394
00:23:26,300 --> 00:23:29,640
decide to lock, it means
other threads can't do

395
00:23:29,640 --> 00:23:31,870
work on that section.

396
00:23:31,870 --> 00:23:34,600
So they're spinning,
wasting cycles.

397
00:23:34,600 --> 00:23:36,020
So generally, you want
to try to lock

398
00:23:36,020 --> 00:23:38,970
things as small as possible.

399
00:23:38,970 --> 00:23:41,390
The other problem is, it turns
out that there's overhead

400
00:23:41,390 --> 00:23:44,550
associated with these locks.

401
00:23:44,550 --> 00:23:47,150
So if there's overhead
associated with the locks,

402
00:23:47,150 --> 00:23:50,610
that's problematic as well,
because now you

403
00:23:50,610 --> 00:23:52,180
may be slowing down.

404
00:23:52,180 --> 00:23:56,460
If this is in an inner loop,
notice that we've now, even if

405
00:23:56,460 --> 00:24:00,430
I just have the lock and unlock
without these two

406
00:24:00,430 --> 00:24:03,570
spurious ones here, we
may be more than

407
00:24:03,570 --> 00:24:04,610
doubling the overhead.

408
00:24:04,610 --> 00:24:07,760
In fact, locking instructions
tend to be much more expensive

409
00:24:07,760 --> 00:24:10,140
than register operations.

410
00:24:10,140 --> 00:24:15,770
They usually cost something on
the order of going to L2 cache

411
00:24:15,770 --> 00:24:18,050
as a minimum.

412
00:24:18,050 --> 00:24:20,210
So it's not even L1 cache.

413
00:24:20,210 --> 00:24:21,730
It's like going out
to L2 cache.

414
00:24:24,780 --> 00:24:27,740
Now, it turns out there
are some times where

415
00:24:27,740 --> 00:24:28,870
you have data races.

416
00:24:28,870 --> 00:24:32,720
So we say if there are no data
races, then you have no

417
00:24:32,720 --> 00:24:34,540
guarantee there's no bugs.

418
00:24:34,540 --> 00:24:41,440
If there are data races, your
program still may be correct.

419
00:24:41,440 --> 00:24:48,350
Here's an example of a code
where you might want to allow

420
00:24:48,350 --> 00:24:50,460
a benign data race.

421
00:24:50,460 --> 00:24:53,470
So here we have, let's say,
an array A that has these

422
00:24:53,470 --> 00:24:54,930
elements in it.

423
00:24:54,930 --> 00:24:56,960
And we want to find,
what is the set of

424
00:24:56,960 --> 00:24:59,680
digits in the array?

425
00:24:59,680 --> 00:25:02,880
So these are all going to be
values between 0 and 9.

426
00:25:02,880 --> 00:25:05,200
And I want to know which
ones are present of

427
00:25:05,200 --> 00:25:06,230
the digits 0 to 9.

428
00:25:06,230 --> 00:25:09,110
Which ones are not present?

429
00:25:09,110 --> 00:25:10,710
So I can write a little
code for that.

430
00:25:10,710 --> 00:25:14,430
Let me initialize an array
called "digits" to have

431
00:25:14,430 --> 00:25:16,900
all-zero entries.

432
00:25:16,900 --> 00:25:24,880
And now let me go through all
the elements of A and set

433
00:25:24,880 --> 00:25:28,790
digits of whatever the
digit is to be 1.

434
00:25:28,790 --> 00:25:32,920
So set at 1 if that
digit is present.

435
00:25:32,920 --> 00:25:35,000
And I can do that in
parallel, even.

436
00:25:37,880 --> 00:25:42,500
So what can happen here is I can
have, if I've done this in

437
00:25:42,500 --> 00:25:48,580
parallel, this particular update
of digits of 6 will be

438
00:25:48,580 --> 00:25:52,170
set to 1 when this one
is being sent to 1.

439
00:25:52,170 --> 00:25:54,060
Is that a problem?

440
00:25:54,060 --> 00:25:54,850
In some sense, no.

441
00:25:54,850 --> 00:25:56,090
They're both being set to 1.

442
00:25:56,090 --> 00:25:58,190
Who cares?

443
00:25:58,190 --> 00:26:01,370
But there is a race there.

444
00:26:01,370 --> 00:26:04,130
There is a race, but
it's a benign race.

445
00:26:04,130 --> 00:26:07,370
Well, it may or may
not be benign.

446
00:26:07,370 --> 00:26:12,030
So there's a gotcha
on this one.

447
00:26:12,030 --> 00:26:15,320
So this code only works
correctly if the hardware

448
00:26:15,320 --> 00:26:16,850
writes the array elements
atomically.

449
00:26:19,410 --> 00:26:22,610
So for example, not
on the x86-64

450
00:26:22,610 --> 00:26:23,740
architecture we're using.

451
00:26:23,740 --> 00:26:29,380
But on some architectures, you
cannot write a byte value.

452
00:26:29,380 --> 00:26:32,902
You cannot write a byte value
as an atomic operation.

453
00:26:32,902 --> 00:26:37,040
It implements a right to a
byte by reading a word,

454
00:26:37,040 --> 00:26:40,840
masking out things, changing the
field, masking again, and

455
00:26:40,840 --> 00:26:42,720
then writing it back out.

456
00:26:42,720 --> 00:26:44,650
So you can have a race
on a byte value.

457
00:26:44,650 --> 00:26:47,620
In particular, even if I were
going to do this with bits, I

458
00:26:47,620 --> 00:26:52,360
could have a race on bits,
although C doesn't let me

459
00:26:52,360 --> 00:26:54,870
access bits directly.

460
00:26:54,870 --> 00:26:59,670
The smallest unit I can
access is a byte.

461
00:26:59,670 --> 00:27:02,970
So you have to worry about
what's the level of atomicity

462
00:27:02,970 --> 00:27:04,520
provided by your architecture?

463
00:27:04,520 --> 00:27:09,470
So the x86 architecture, the
grain size of atomic update is

464
00:27:09,470 --> 00:27:12,040
you can do a single-byte write,
and it will do the

465
00:27:12,040 --> 00:27:14,370
right, proper thing--

466
00:27:14,370 --> 00:27:17,020
do the right thing
on the write.

467
00:27:17,020 --> 00:27:19,780
So we have both things.

468
00:27:19,780 --> 00:27:20,740
No bugs.

469
00:27:20,740 --> 00:27:23,370
No data races doesn't
mean no bugs.

470
00:27:23,370 --> 00:27:27,090
Presence of data races doesn't
mean you have bugs.

471
00:27:27,090 --> 00:27:31,000
But generally, they're fairly
well overlapped.

472
00:27:31,000 --> 00:27:38,740
Now, why would I not want to put
in a lock and unlock here

473
00:27:38,740 --> 00:27:41,030
just to get rid of the race?

474
00:27:41,030 --> 00:27:43,230
If I run Cilkscreen on this,
it's going to complain.

475
00:27:43,230 --> 00:27:46,820
It's going to say, you've
got a race here.

476
00:27:46,820 --> 00:27:49,870
Why would I not want to put a
lock on here, for example?

477
00:27:53,758 --> 00:27:55,216
AUDIENCE: Because then we don't

478
00:27:55,216 --> 00:27:57,160
have parallelism anymore?

479
00:27:57,160 --> 00:28:00,870
PROFESSOR: No, well, I'd have
parallelism maybe up to 10,

480
00:28:00,870 --> 00:28:03,100
for example, right?

481
00:28:03,100 --> 00:28:04,880
Because I have 10 different
things that could be

482
00:28:04,880 --> 00:28:06,510
going on at a time.

483
00:28:06,510 --> 00:28:09,290
But that's one reason.

484
00:28:09,290 --> 00:28:10,890
That is one reason.

485
00:28:10,890 --> 00:28:12,310
What's another reason why
I might not want to

486
00:28:12,310 --> 00:28:14,563
put locks in here?

487
00:28:14,563 --> 00:28:15,813
AUDIENCE: [INAUDIBLE]

488
00:28:18,802 --> 00:28:21,400
PROFESSOR: It could be that
all the numbers--

489
00:28:21,400 --> 00:28:24,760
that's a case where it doesn't
get me much speedup.

490
00:28:24,760 --> 00:28:26,680
But what's another reason
I might want to do this?

491
00:28:26,680 --> 00:28:27,930
AUDIENCE: [INAUDIBLE]

492
00:28:31,927 --> 00:28:34,990
PROFESSOR: I think you're
on the right track.

493
00:28:34,990 --> 00:28:35,620
Overhead.

494
00:28:35,620 --> 00:28:36,370
Yeah.

495
00:28:36,370 --> 00:28:37,040
Overhead.

496
00:28:37,040 --> 00:28:39,730
This is my inner loop.

497
00:28:39,730 --> 00:28:41,970
So if I'm locking and unlocking,
all this is doing

498
00:28:41,970 --> 00:28:44,310
is just doing a memory
[? dereference ?]

499
00:28:44,310 --> 00:28:46,670
and an assignment.

500
00:28:46,670 --> 00:28:49,890
And that may be fairly cheap,
whereas if I grab a lock and

501
00:28:49,890 --> 00:28:53,590
then release the lock, those
operations may be much, much

502
00:28:53,590 --> 00:28:55,810
more expensive.

503
00:28:55,810 --> 00:29:00,110
So I may be slowing down the
execution of this loop by more

504
00:29:00,110 --> 00:29:05,860
than I'm going to gain out of
the parallelism of this.

505
00:29:05,860 --> 00:29:09,810
So I may say, I may reason, hey,
there is a good reason

506
00:29:09,810 --> 00:29:14,530
why not have a data
race there.

507
00:29:19,280 --> 00:29:21,780
So I may want to have a data
race, and I may want

508
00:29:21,780 --> 00:29:23,540
to say that's OK.

509
00:29:23,540 --> 00:29:25,690
And if that happens, however,
you're now going to get

510
00:29:25,690 --> 00:29:27,060
warnings out of Cilkscreen.

511
00:29:27,060 --> 00:29:30,370
And I generally recommend that
you have no warnings on

512
00:29:30,370 --> 00:29:33,610
Cilkscreen when you
run your code.

513
00:29:33,610 --> 00:29:38,280
So the Cilk environment provides
a mechanism called

514
00:29:38,280 --> 00:29:45,310
"fake locks." So a fake lock
allows you to communicate to

515
00:29:45,310 --> 00:29:47,290
Cilkscreen that a race
is intentional.

516
00:29:47,290 --> 00:29:51,460
So what you then do is you put
a fake lock in around this

517
00:29:51,460 --> 00:29:53,260
access here.

518
00:29:53,260 --> 00:29:56,760
And what happens is when
Cilkscreen runs, it says, oh,

519
00:29:56,760 --> 00:30:01,760
you grabbed this lock, so I
shouldn't report a race.

520
00:30:01,760 --> 00:30:08,610
But during execution, no lock
is actually grabbed, because

521
00:30:08,610 --> 00:30:10,000
it's a fake one.

522
00:30:10,000 --> 00:30:15,470
So it doesn't slow you down it
all at runtime, but Cilkscreen

523
00:30:15,470 --> 00:30:18,970
still thinks that a lock
is being acquired.

524
00:30:18,970 --> 00:30:21,160
Questions about that?

525
00:30:21,160 --> 00:30:25,050
So this is if you want to have
an intentional race, this is a

526
00:30:25,050 --> 00:30:26,200
way you can quiet Cilkscreen.

527
00:30:26,200 --> 00:30:28,360
Of course, it's dangerous,
right?

528
00:30:28,360 --> 00:30:31,610
It's yet another example of
what's dangerous here.

529
00:30:31,610 --> 00:30:33,150
Because what happens if
you did it wrong?

530
00:30:33,150 --> 00:30:35,350
What happens if there really
is a bug there?

531
00:30:35,350 --> 00:30:38,080
You're now telling it
to ignore that bug.

532
00:30:38,080 --> 00:30:41,960
So one way that you can
make your code--

533
00:30:41,960 --> 00:30:44,720
if you put in fake locks
everywhere, you could make it

534
00:30:44,720 --> 00:30:48,010
so, oh, Cilkscreen runs just
great, and have your code full

535
00:30:48,010 --> 00:30:49,260
of race bugs.

536
00:30:51,400 --> 00:30:56,020
So if you use fake locks, you
should document very carefully

537
00:30:56,020 --> 00:30:59,340
that you're doing so and why
that's going to be a safe

538
00:30:59,340 --> 00:31:02,160
thing to do.

539
00:31:02,160 --> 00:31:03,410
Any questions about that?

540
00:31:07,676 --> 00:31:10,860
By the way, one of the nice
things about some of the

541
00:31:10,860 --> 00:31:15,240
concurrency platforms like Cilk
is that they provide a

542
00:31:15,240 --> 00:31:17,830
layer of abstraction where
generally, you don't have to

543
00:31:17,830 --> 00:31:19,180
do very much locking.

544
00:31:19,180 --> 00:31:22,670
If you program with Pthreads,
for example, you're locking

545
00:31:22,670 --> 00:31:24,660
all the time.

546
00:31:24,660 --> 00:31:26,990
So you're writing
non-deterministic programs all

547
00:31:26,990 --> 00:31:29,500
the time, and you're debugging
non-deterministic

548
00:31:29,500 --> 00:31:31,340
programs all the time.

549
00:31:31,340 --> 00:31:34,230
Whereas Cilk provides a layer
of programming where you can

550
00:31:34,230 --> 00:31:38,040
do most of your programming in
a deterministic fashion.

551
00:31:38,040 --> 00:31:42,420
And occasionally, you may want
to have some non-determinism

552
00:31:42,420 --> 00:31:43,020
here or there.

553
00:31:43,020 --> 00:31:49,580
But hopefully you can manage
that if you do it judiciously.

554
00:31:49,580 --> 00:31:53,670
Any questions about
mutexes and uses

555
00:31:53,670 --> 00:31:56,140
for them and so forth?

556
00:31:56,140 --> 00:31:57,880
Good.

557
00:31:57,880 --> 00:31:59,920
So let's talk about how
they get implemented.

558
00:31:59,920 --> 00:32:02,900
Because as with all these
things, we want to understand

559
00:32:02,900 --> 00:32:05,990
not just what the abstractions
is but how it is that you

560
00:32:05,990 --> 00:32:10,280
actually implement these things
so that you can reason

561
00:32:10,280 --> 00:32:14,510
about them more cogently.

562
00:32:14,510 --> 00:32:19,900
So there's typically three major
properties of mutexes

563
00:32:19,900 --> 00:32:20,810
when you look at them.

564
00:32:20,810 --> 00:32:23,580
And when you see documentation
for mutexes, you should

565
00:32:23,580 --> 00:32:26,740
understand what the difference
is of these things.

566
00:32:26,740 --> 00:32:29,080
The first is whether it's
a yielding mutex

567
00:32:29,080 --> 00:32:31,850
or a spinning mutex.

568
00:32:31,850 --> 00:32:36,190
So a yielding mutex, when you
spin, it returns control to

569
00:32:36,190 --> 00:32:37,700
the operating system.

570
00:32:37,700 --> 00:32:40,170
And why might you
want to do that?

571
00:32:40,170 --> 00:32:43,700
Whereas a spinning one just
consumes processor cycles.

572
00:32:43,700 --> 00:32:44,950
Why would you want to do that?

573
00:32:47,926 --> 00:32:48,412
Yeah.

574
00:32:48,412 --> 00:32:50,360
AUDIENCE: [INAUDIBLE]
allow other threads.

575
00:32:50,360 --> 00:32:53,090
PROFESSOR: Yeah, it can allow
other threads or other jobs

576
00:32:53,090 --> 00:32:56,180
that could be running
to use the processor

577
00:32:56,180 --> 00:32:57,010
while you're waiting.

578
00:32:57,010 --> 00:32:58,330
What's the downside of that?

579
00:33:00,830 --> 00:33:02,290
To speak to the-- either one.

580
00:33:02,290 --> 00:33:03,787
Go ahead.

581
00:33:03,787 --> 00:33:07,945
AUDIENCE: It might be possible
that whatever you're trying to

582
00:33:07,945 --> 00:33:11,771
do is essential, and you really
want to get that done

583
00:33:11,771 --> 00:33:13,268
[UNINTELLIGIBLE] everything
else executes.

584
00:33:13,268 --> 00:33:14,250
So you really want [INAUDIBLE]

585
00:33:14,250 --> 00:33:19,760
PROFESSOR: Yeah, context
switching a thread out is a

586
00:33:19,760 --> 00:33:22,100
heavyweight operation.

587
00:33:22,100 --> 00:33:25,010
And It may be, if you end up
context switching out, it may

588
00:33:25,010 --> 00:33:29,290
be you only had to wait for a
half a dozen cycles and you'd

589
00:33:29,290 --> 00:33:30,480
have the lock.

590
00:33:30,480 --> 00:33:33,700
But instead, now you're going
and you're doing a context

591
00:33:33,700 --> 00:33:36,250
switch and may not get access
to the machine for another

592
00:33:36,250 --> 00:33:38,750
hundredth of a second
or something.

593
00:33:38,750 --> 00:33:43,640
So it may be on the order of
10 to the 6th-- a million

594
00:33:43,640 --> 00:33:45,660
instructions before you
get access again,

595
00:33:45,660 --> 00:33:49,500
rather than just a few.

596
00:33:49,500 --> 00:33:54,520
The second property of mutexes
is whether they're

597
00:33:54,520 --> 00:33:56,850
reentrant or not.

598
00:33:56,850 --> 00:34:02,550
So a reenttrant mutex allows a
thread that's holding a lock

599
00:34:02,550 --> 00:34:05,340
to acquire it again.

600
00:34:05,340 --> 00:34:08,219
So I may hold the lock, and then
I may try to acquire the

601
00:34:08,219 --> 00:34:10,570
lock again.

602
00:34:10,570 --> 00:34:19,139
Java is full of reentrant locks,
reentrant mutexes.

603
00:34:19,139 --> 00:34:22,530
So why is this a positive
or negative?

604
00:34:22,530 --> 00:34:26,130
What are the pros and
cons of this one?

605
00:34:26,130 --> 00:34:29,699
Why might reentrancy be
a good thing to want?

606
00:34:34,690 --> 00:34:38,190
Why would I bother doing--

607
00:34:38,190 --> 00:34:42,043
why would I grab a lock
that I already have?

608
00:34:42,043 --> 00:34:45,375
AUDIENCE: It'd be too easy to
do a check [INAUDIBLE].

609
00:34:45,375 --> 00:34:46,739
PROFESSOR: It lets
you do what?

610
00:34:46,739 --> 00:34:49,491
AUDIENCE: It lets you not have
to worry about locking when

611
00:34:49,491 --> 00:34:49,804
you already have a lock.

612
00:34:49,804 --> 00:34:51,610
PROFESSOR: It lets you
not worry about it.

613
00:34:51,610 --> 00:34:52,000
That's right.

614
00:34:52,000 --> 00:34:52,944
But why is that valuable?

615
00:34:52,944 --> 00:34:55,159
AUDIENCE: It saves you one
line in an If statment to

616
00:34:55,159 --> 00:34:58,069
check if you have
a lock or not.

617
00:34:58,069 --> 00:34:59,640
PROFESSOR: That could be.

618
00:34:59,640 --> 00:35:02,320
Basically, the If statement
is embedded in there.

619
00:35:02,320 --> 00:35:03,130
But why would I care?

620
00:35:03,130 --> 00:35:05,320
Why would I want to be acquiring
something that I

621
00:35:05,320 --> 00:35:06,570
already have?

622
00:35:09,410 --> 00:35:12,600
In what programming situation
might that arise?

623
00:35:12,600 --> 00:35:15,510
This seems kind of
weird, right?

624
00:35:15,510 --> 00:35:16,970
Could be recursion.

625
00:35:16,970 --> 00:35:17,890
Yeah.

626
00:35:17,890 --> 00:35:20,880
So usually, what it comes from
is when you have objects, and

627
00:35:20,880 --> 00:35:23,480
you have several methods
on the object.

628
00:35:23,480 --> 00:35:27,070
And what you'd like to do is,
if somebody's calling the

629
00:35:27,070 --> 00:35:33,440
method from the outside, you
would like to be able to

630
00:35:33,440 --> 00:35:35,680
execute that particular--

631
00:35:35,680 --> 00:35:41,650
I guess in C++ they don't call
them "methods." They call them

632
00:35:41,650 --> 00:35:43,940
"member functions." "Member
functions," they call them.

633
00:35:43,940 --> 00:35:46,430
In Java, they call them
"methods," and in C++, they

634
00:35:46,430 --> 00:35:49,020
call them "member functions."
Doesn't matter.

635
00:35:49,020 --> 00:35:50,700
It's the same thing.

636
00:35:50,700 --> 00:35:52,840
So when you access one of these,
normally, from the

637
00:35:52,840 --> 00:35:55,940
outside, you want to make sure
you grab the lock associated

638
00:35:55,940 --> 00:35:57,380
with the object.

639
00:35:57,380 --> 00:36:00,760
However, it may be that what
you're doing inside the object

640
00:36:00,760 --> 00:36:01,880
is you want to be able--

641
00:36:01,880 --> 00:36:05,720
one of the operations may be a
more complex operation that

642
00:36:05,720 --> 00:36:09,840
wants to use one of its
own implementations.

643
00:36:09,840 --> 00:36:11,650
So rather than implementing
it twice--

644
00:36:11,650 --> 00:36:16,560
once in the locked form, once
without getting the lock--

645
00:36:16,560 --> 00:36:19,970
you just implement it once, and
you use reentrant locks.

646
00:36:19,970 --> 00:36:23,690
And that way, you don't have
to worry about, in coding

647
00:36:23,690 --> 00:36:26,840
those things, whether or not
you've already got it.

648
00:36:26,840 --> 00:36:29,570
So that's probably the most
common place that I know that

649
00:36:29,570 --> 00:36:31,040
people want reentrant locks.

650
00:36:31,040 --> 00:36:35,420
Naturally, to acquire a
reentrant lock, you have to do

651
00:36:35,420 --> 00:36:37,880
some kind of If statement,
which is a conditional.

652
00:36:37,880 --> 00:36:41,015
And as you know, if it's an
unpredictable branch, that's

653
00:36:41,015 --> 00:36:43,010
going to be very expensive.

654
00:36:43,010 --> 00:36:50,090
So generally, there is a cost
to making it reentrant.

655
00:36:50,090 --> 00:36:55,640
The third property is whether
the lock is fair or unfair.

656
00:36:55,640 --> 00:36:59,770
So a fair mutex puts block
threads essentially into a

657
00:36:59,770 --> 00:37:00,970
FIFO queue.

658
00:37:00,970 --> 00:37:04,250
And the unlock operation
unblocks the thread that has

659
00:37:04,250 --> 00:37:06,940
been waiting the longest.

660
00:37:06,940 --> 00:37:13,450
So it makes it so that if you
try to acquire a lock, you

661
00:37:13,450 --> 00:37:15,930
don't have some other thread
coming in and trying to access

662
00:37:15,930 --> 00:37:19,000
that lock and getting
ahead of you.

663
00:37:19,000 --> 00:37:21,740
It puts you in a queue.

664
00:37:21,740 --> 00:37:24,900
So an unfair mutex lets any
blocked thread go next.

665
00:37:27,530 --> 00:37:31,030
So the cheapest thing to
implement is a spinning,

666
00:37:31,030 --> 00:37:35,180
non-reentrant, unfair lock--

667
00:37:35,180 --> 00:37:36,110
mutex.

668
00:37:36,110 --> 00:37:37,890
Those are the cheapest
ones to implement.

669
00:37:37,890 --> 00:37:40,490
Very lightweight, very
easy to use.

670
00:37:40,490 --> 00:37:42,270
The heavyweight ones
are a yielding,

671
00:37:42,270 --> 00:37:44,900
reentrant, fair lock.

672
00:37:44,900 --> 00:37:48,130
And of course, you can have
combinations, because all of

673
00:37:48,130 --> 00:37:52,100
these have, as you can see,
different properties in terms

674
00:37:52,100 --> 00:37:55,840
of convenience of use and
so forth, as well

675
00:37:55,840 --> 00:37:57,650
as different overheads.

676
00:37:57,650 --> 00:38:00,950
So there's some cases where the
overhead isn't a big deal

677
00:38:00,950 --> 00:38:05,660
because it's not in the inner
loop of a program or a heavily

678
00:38:05,660 --> 00:38:06,910
executed statement.

679
00:38:09,220 --> 00:38:12,400
So let's take a look at one of
the simplest locks, which is a

680
00:38:12,400 --> 00:38:14,710
simple spinning mutex.

681
00:38:14,710 --> 00:38:20,420
This is the x86 code for
how to acquire a lock.

682
00:38:23,260 --> 00:38:24,060
So let's run through this.

683
00:38:24,060 --> 00:38:26,000
So we start out at the top.

684
00:38:26,000 --> 00:38:28,490
And I check to see if the
mutex is 0, which is

685
00:38:28,490 --> 00:38:32,280
basically, it's going to be 0
if it's free and 1 if it has

686
00:38:32,280 --> 00:38:34,570
been acquired.

687
00:38:34,570 --> 00:38:36,230
So we compare it.

688
00:38:36,230 --> 00:38:41,010
If it's free, then I jump
to try to get the mutex.

689
00:38:41,010 --> 00:38:44,270
Otherwise, I execute this PAUSE
instruction, and this

690
00:38:44,270 --> 00:38:46,140
turns out to be a--

691
00:38:46,140 --> 00:38:46,880
it's humorous.

692
00:38:46,880 --> 00:38:50,460
It's x86 hack to un-confuse
the pipeline.

693
00:38:50,460 --> 00:38:53,890
So it turns out that in this
case, if you don't have a

694
00:38:53,890 --> 00:38:55,100
pause here--

695
00:38:55,100 --> 00:38:58,320
which is no-op and
does nothing--

696
00:38:58,320 --> 00:39:04,970
x86 mispredicts something or
whatever, and it's more time

697
00:39:04,970 --> 00:39:08,040
consuming than if it doesn't
have that there.

698
00:39:08,040 --> 00:39:12,290
The manual explains very little
about this hardware bug

699
00:39:12,290 --> 00:39:15,440
except to say, put
in the pause.

700
00:39:15,440 --> 00:39:17,940
So if you didn't get it, then
you jump to spin mutex, and

701
00:39:17,940 --> 00:39:20,710
try again, check to
see if it's free.

702
00:39:20,710 --> 00:39:23,890
Now, notice that we're going to
spin until it's free, and

703
00:39:23,890 --> 00:39:25,860
then we're going to
try to get it.

704
00:39:25,860 --> 00:39:27,170
Why not just try to
get it first?

705
00:39:34,900 --> 00:39:37,110
Well, think about that while we
go through how to get it,

706
00:39:37,110 --> 00:39:39,310
and then I'll ask it again.

707
00:39:39,310 --> 00:39:42,310
Think about why it is that you
might want to get it first.

708
00:39:42,310 --> 00:39:47,180
So if I want to get the
mutex, I first get a

709
00:39:47,180 --> 00:39:50,490
value of 1 in my register.

710
00:39:50,490 --> 00:39:53,950
And then I compute this exchange
operation, which

711
00:39:53,950 --> 00:40:01,210
exchanges the value of the mutex
with the value of the--

712
00:40:01,210 --> 00:40:03,110
with the one that I have.

713
00:40:03,110 --> 00:40:06,240
So it exchanges the memory
location with the register.

714
00:40:06,240 --> 00:40:08,070
Now, this is an expensive
operation--

715
00:40:08,070 --> 00:40:09,060
exchange--

716
00:40:09,060 --> 00:40:11,970
because it's an atomic exchange,
and it typically has

717
00:40:11,970 --> 00:40:15,930
to go at least out
to L2 to do this.

718
00:40:15,930 --> 00:40:17,810
So it's an expensive operation,
because it's a

719
00:40:17,810 --> 00:40:20,170
read-modify-write operation.

720
00:40:20,170 --> 00:40:25,280
I'm swapping my register
value with a value

721
00:40:25,280 --> 00:40:26,530
that's in the mutex.

722
00:40:30,490 --> 00:40:35,100
So it turns out that if it's
0, then it means I got it.

723
00:40:39,730 --> 00:40:43,610
So I compare it with 0, and if
it's equal to 0, I go onto the

724
00:40:43,610 --> 00:40:44,720
critical section.

725
00:40:44,720 --> 00:40:47,140
When I'm done with the critical
section, I release

726
00:40:47,140 --> 00:40:51,050
the mutex by basically storing
0 in there, because I'm the

727
00:40:51,050 --> 00:40:54,620
only one who accesses the
mutex at this point.

728
00:40:54,620 --> 00:40:57,410
If I didn't get it, if the
value is 1, notice that

729
00:40:57,410 --> 00:41:02,510
because I'm swapping a 1 in,
even though the 1 got swapped

730
00:41:02,510 --> 00:41:04,350
in, well, there was
a 1 there before.

731
00:41:04,350 --> 00:41:08,650
So it basically did not affect
the value of the mutex.

732
00:41:08,650 --> 00:41:10,810
But I discover, oh,
I don't have it.

733
00:41:10,810 --> 00:41:14,050
Then we go all the way back
up there to spin mutex.

734
00:41:14,050 --> 00:41:15,090
So here's the question.

735
00:41:15,090 --> 00:41:16,790
Why do I need all this
preamble code?

736
00:41:16,790 --> 00:41:21,820
Why not just go straight to
Get_Mutex, make the spin mutex

737
00:41:21,820 --> 00:41:25,110
here be a jump to Get_Mutex?

738
00:41:25,110 --> 00:41:25,450
Yeah?

739
00:41:25,450 --> 00:41:27,945
AUDIENCE: Maybe it's because
the exchange is expensive.

740
00:41:27,945 --> 00:41:29,050
PROFESSOR: Excuse me?

741
00:41:29,050 --> 00:41:29,810
AUDIENCE: The exchange is--

742
00:41:29,810 --> 00:41:31,510
PROFESSOR: Yeah, because the
exchange is expensive.

743
00:41:31,510 --> 00:41:32,520
Exactly.

744
00:41:32,520 --> 00:41:36,190
So this code here,
I can compare.

745
00:41:36,190 --> 00:41:38,900
And as long as nobody's touching
anything, this

746
00:41:38,900 --> 00:41:46,950
becomes just L1 memory
accesses.

747
00:41:46,950 --> 00:41:51,320
Whereas here, it's going
to be at least L2 to do

748
00:41:51,320 --> 00:41:52,790
the exchange operation.

749
00:41:52,790 --> 00:41:55,690
So rather than doing that--

750
00:41:55,690 --> 00:42:00,290
moreover, this one actually
changes the value.

751
00:42:00,290 --> 00:42:02,500
So what happens when I change
the value of the mutex?

752
00:42:02,500 --> 00:42:05,970
Even though I change it to the
same value, what happens in

753
00:42:05,970 --> 00:42:08,590
order to do that exchange?

754
00:42:08,590 --> 00:42:13,320
Remember from several
lectures ago.

755
00:42:13,320 --> 00:42:15,890
What's going to happen when
I make an exchange there?

756
00:42:15,890 --> 00:42:17,350
What does the hardware
have to do?

757
00:42:23,811 --> 00:42:27,730
What's the hardware going to
do on any store to a shared

758
00:42:27,730 --> 00:42:32,380
memory location, to a memory
location in shared memory that

759
00:42:32,380 --> 00:42:34,480
is actually shared?

760
00:42:34,480 --> 00:42:35,260
Yeah?

761
00:42:35,260 --> 00:42:36,126
AUDIENCE: [INAUDIBLE]

762
00:42:36,126 --> 00:42:37,890
PROFESSOR: Yeah, it's
got to invalidate

763
00:42:37,890 --> 00:42:41,180
all the other copies.

764
00:42:41,180 --> 00:42:43,900
So if everybody spinning here--
imagine that you have

765
00:42:43,900 --> 00:42:48,310
five guys spinning,
doing exchanges--

766
00:42:48,310 --> 00:42:53,310
they're all creating all this
traffic of invalidations,

767
00:42:53,310 --> 00:42:56,870
what's called an "invalidation
storm." So they create an

768
00:42:56,870 --> 00:42:59,950
invalidation storm as they all
are invalidating each other so

769
00:42:59,950 --> 00:43:02,240
that they can get access to it
so that they can change the

770
00:43:02,240 --> 00:43:05,240
value themselves.

771
00:43:05,240 --> 00:43:07,605
But up here, all I'm doing
is looking at the value.

772
00:43:10,672 --> 00:43:15,345
All I'm doing is looking at the
value to see if it's free.

773
00:43:15,345 --> 00:43:26,675
And it's not until the guy
actually frees the value that

774
00:43:26,675 --> 00:43:27,430
it actually--

775
00:43:27,430 --> 00:43:30,100
actually, this is interesting.

776
00:43:30,100 --> 00:43:34,470
I think I wrote this with Intel
syntax, rather than

777
00:43:34,470 --> 00:43:37,670
AT&T, didn't I?

778
00:43:37,670 --> 00:43:46,020
The MOV mutex, 0 moves
0 into the mutex,

779
00:43:46,020 --> 00:43:47,630
which is Intel syntax.

780
00:43:47,630 --> 00:43:51,620
I probably should have converted
this to AT&T,

781
00:43:51,620 --> 00:43:54,350
because that's what we're
generally using in the class.

782
00:43:54,350 --> 00:43:58,700
I'll fix that before I
put the slides up.

783
00:43:58,700 --> 00:44:00,640
Basically, I pulled this out
of the Intel manual.

784
00:44:03,820 --> 00:44:06,770
So any questions about
this code?

785
00:44:06,770 --> 00:44:08,490
Everybody see how it works?

786
00:44:08,490 --> 00:44:11,610
It relies on this atomic
exchange operation.

787
00:44:11,610 --> 00:44:15,310
And I'm going to end up sitting
here spinning until

788
00:44:15,310 --> 00:44:17,330
maybe I can get access to it.

789
00:44:17,330 --> 00:44:19,750
When I have a chance to get
access to it, I try to get it.

790
00:44:19,750 --> 00:44:21,570
If I don't get it, I go
back to spinning.

791
00:44:26,690 --> 00:44:29,600
How do I convert this
to a yielding mutex?

792
00:44:35,449 --> 00:44:37,894
AUDIENCE: Instead
of having that

793
00:44:37,894 --> 00:44:42,295
spinning mutex, you should--

794
00:44:42,295 --> 00:44:43,762
you shouldn't have that.

795
00:44:43,762 --> 00:44:46,207
You should just have something
that allows you to just

796
00:44:46,207 --> 00:44:46,696
[INAUDIBLE].

797
00:44:46,696 --> 00:44:48,910
PROFESSOR: Yeah, so actually,
the way you do it is you

798
00:44:48,910 --> 00:44:50,460
replace the PAUSE instruction.

799
00:44:50,460 --> 00:44:51,610
Exactly what you're saying.

800
00:44:51,610 --> 00:44:53,650
You've got the right
place in the code.

801
00:44:53,650 --> 00:44:54,865
We basically call a yield.

802
00:44:54,865 --> 00:44:56,250
And you can use, for example,
pthread_yield.

803
00:44:59,330 --> 00:45:00,770
What it tells the operating
system is,

804
00:45:00,770 --> 00:45:02,295
give up on this quantum.

805
00:45:02,295 --> 00:45:04,590
You can schedule me out.

806
00:45:04,590 --> 00:45:05,710
Somebody else can
be scheduled.

807
00:45:05,710 --> 00:45:08,830
Now, if nobody else is there to
be scheduled, often you'll

808
00:45:08,830 --> 00:45:12,860
just get control back, and
you'll jump again and give the

809
00:45:12,860 --> 00:45:14,160
operating system another time.

810
00:45:17,070 --> 00:45:23,300
Now, one of the things I've seen
in computer benchmarks

811
00:45:23,300 --> 00:45:30,470
that use locking is that they
all use spin locks.

812
00:45:30,470 --> 00:45:37,880
They never use the yielding,
because if you yield, then

813
00:45:37,880 --> 00:45:40,380
when the lock comes free, you're
not going to be ready

814
00:45:40,380 --> 00:45:41,020
to come back in.

815
00:45:41,020 --> 00:45:44,470
You may be switched out.

816
00:45:44,470 --> 00:45:48,900
So a common thing that all
these companies do when

817
00:45:48,900 --> 00:45:52,320
they're vying for who's got the
fastest on this benchmark

818
00:45:52,320 --> 00:45:55,790
or fastest on that benchmark
is they go through and they

819
00:45:55,790 --> 00:46:01,250
convert all their yielding
mutexes into spinning mutexes,

820
00:46:01,250 --> 00:46:05,240
then take their measurements,
when in fact, as a practical

821
00:46:05,240 --> 00:46:08,100
matter, they can't actually
ship code that way.

822
00:46:08,100 --> 00:46:11,970
So you'll see this kind of game
played where people try

823
00:46:11,970 --> 00:46:15,330
to get the best performance
they can in some kind of

824
00:46:15,330 --> 00:46:16,490
laboratory setting.

825
00:46:16,490 --> 00:46:18,680
It's not the same as when
you're actually

826
00:46:18,680 --> 00:46:22,650
doing a real thing.

827
00:46:22,650 --> 00:46:24,490
So you have a choice here.

828
00:46:29,430 --> 00:46:30,850
There's kind of a
tension here.

829
00:46:35,880 --> 00:46:39,240
You'd like to claim the mutex
soon after it's released.

830
00:46:39,240 --> 00:46:40,885
And you're not going to
get that if you yield.

831
00:46:43,600 --> 00:46:47,080
At the same time, you want
to behave nicely

832
00:46:47,080 --> 00:46:50,530
and waste few cycles.

833
00:46:50,530 --> 00:46:55,040
So what's the strategy for being
able to accomplish both

834
00:46:55,040 --> 00:46:57,260
of these goals?

835
00:46:57,260 --> 00:47:01,090
So one of these goals is the
spinning mutex does a great

836
00:47:01,090 --> 00:47:03,870
job of claiming the mutex soon
after it's released.

837
00:47:03,870 --> 00:47:09,630
The yielding mutex behaves
nicely and wastes few cycles.

838
00:47:09,630 --> 00:47:10,830
Is there the best
of both worlds?

839
00:47:10,830 --> 00:47:12,460
There's certainly the worst
of both worlds, right?

840
00:47:15,500 --> 00:47:16,800
What's the best of
both worlds?

841
00:47:19,710 --> 00:47:24,910
How might we accomplish both
of these goals with small

842
00:47:24,910 --> 00:47:27,650
modification to the
locking code?

843
00:47:36,990 --> 00:47:38,100
So it turns out you
can get within a

844
00:47:38,100 --> 00:47:39,375
factor of two of optimal.

845
00:47:47,130 --> 00:47:52,445
How might you do that while
wasting few cycles?

846
00:47:55,790 --> 00:47:57,920
So here's the idea.

847
00:47:57,920 --> 00:48:07,030
Spin for a little while, and
then, if after a little while

848
00:48:07,030 --> 00:48:14,480
you didn't manage to access
the mutex, then yield.

849
00:48:14,480 --> 00:48:16,910
So that if the new mutex was
right there available to be

850
00:48:16,910 --> 00:48:20,100
accessed, you could access it,
but you don't spin for an

851
00:48:20,100 --> 00:48:23,390
indefinite amount of time.

852
00:48:23,390 --> 00:48:27,490
So the question is, how
long do you spin?

853
00:48:27,490 --> 00:48:30,020
So we're going to spin for a
little while and then yield.

854
00:48:30,020 --> 00:48:30,360
Yeah?

855
00:48:30,360 --> 00:48:32,760
AUDIENCE: [INAUDIBLE].

856
00:48:32,760 --> 00:48:34,950
PROFESSOR: Yeah, exactly.

857
00:48:34,950 --> 00:48:38,700
So what you do is you spin for
basically as long as a context

858
00:48:38,700 --> 00:48:39,950
switch takes.

859
00:48:41,980 --> 00:48:44,760
So if you spin for as long as
it takes to do a context

860
00:48:44,760 --> 00:48:49,090
switch and then do a context
switch, if the mutex became

861
00:48:49,090 --> 00:48:51,530
immediately available, well,
you're only going to wait

862
00:48:51,530 --> 00:48:52,780
double what you would
have waited.

863
00:48:55,260 --> 00:48:59,010
And if in the meantime during
that first part where you're

864
00:48:59,010 --> 00:49:01,970
spinning it becomes available,
you're not waiting at all any

865
00:49:01,970 --> 00:49:03,570
longer than you actually
have to.

866
00:49:03,570 --> 00:49:05,710
So in both cases, you're
waiting at

867
00:49:05,710 --> 00:49:06,950
most a factor of two.

868
00:49:06,950 --> 00:49:09,580
In one case, you're waiting
exactly the right.

869
00:49:09,580 --> 00:49:12,510
The other, you can actually
wait a factor of two.

870
00:49:12,510 --> 00:49:18,220
So this is a classic amortized
kind of argument, that you can

871
00:49:18,220 --> 00:49:20,980
amortize the cost
of the spinning

872
00:49:20,980 --> 00:49:22,680
to the context switch.

873
00:49:22,680 --> 00:49:25,860
So spin until you spend as much
time as it would cost for

874
00:49:25,860 --> 00:49:26,940
a context switch.

875
00:49:26,940 --> 00:49:30,130
Then do the context switch.

876
00:49:30,130 --> 00:49:31,380
Yet another voodoo parameter.

877
00:49:37,440 --> 00:49:39,700
Yeah, so if the mutex is
released while spinning,

878
00:49:39,700 --> 00:49:41,720
that's optimal.

879
00:49:41,720 --> 00:49:44,280
If the mutex is released
after the yield, you're

880
00:49:44,280 --> 00:49:45,690
within twice optimal.

881
00:49:48,270 --> 00:49:52,820
Turns out that 2 is not
the optimal value.

882
00:49:52,820 --> 00:49:57,130
There's a randomized algorithm
that makes it e over e minus 1

883
00:49:57,130 --> 00:50:02,030
competitive where e is the base
of the natural logarithm.

884
00:50:05,620 --> 00:50:12,990
So 2.7 divided by 1.7,
which is what?

885
00:50:12,990 --> 00:50:14,850
Who's got a calculator?

886
00:50:14,850 --> 00:50:17,150
2.7 divided by 1.7 is--

887
00:50:17,150 --> 00:50:18,760
I should have calculated
this out.

888
00:50:18,760 --> 00:50:19,930
AUDIENCE: [INAUDIBLE]

889
00:50:19,930 --> 00:50:21,470
PROFESSOR: It's about 1.6.

890
00:50:21,470 --> 00:50:21,910
Good.

891
00:50:21,910 --> 00:50:23,310
So it's better than 2.

892
00:50:26,980 --> 00:50:30,021
People analyze these
things, right?

893
00:50:30,021 --> 00:50:34,250
So any questions about
implementation of locks?

894
00:50:34,250 --> 00:50:36,030
There are many other ways
of implementing locks.

895
00:50:36,030 --> 00:50:39,210
There are other instructions
that people use.

896
00:50:39,210 --> 00:50:42,780
They do things like
compare-and-swap is another

897
00:50:42,780 --> 00:50:43,830
operation that's used.

898
00:50:43,830 --> 00:50:47,100
There are some machines have
an operation called

899
00:50:47,100 --> 00:50:51,290
load-linked/store-conditional,
which is not on the x86

900
00:50:51,290 --> 00:50:53,590
architecture, but it is on
other architectures.

901
00:50:53,590 --> 00:50:55,720
You'll see a lot of other things
of doing some kind of

902
00:50:55,720 --> 00:50:59,130
atomic operation to
implement a lock.

903
00:50:59,130 --> 00:51:03,110
Uniformly, they're expensive
compared to register

904
00:51:03,110 --> 00:51:06,160
operations in particular or even
L1 accesses, typically,

905
00:51:06,160 --> 00:51:08,760
in particular.

906
00:51:08,760 --> 00:51:10,010
Any questions?

907
00:51:12,860 --> 00:51:15,810
So now that we've decided that
we're going to use mutexes,

908
00:51:15,810 --> 00:51:19,330
and we understand we're writing
non-deterministic code

909
00:51:19,330 --> 00:51:22,590
and so forth, well, it turns out
there are a host of other

910
00:51:22,590 --> 00:51:24,520
system anomalies that occur.

911
00:51:24,520 --> 00:51:29,120
So locks are like, they're this
really evil mechanism

912
00:51:29,120 --> 00:51:31,360
that works really well.

913
00:51:31,360 --> 00:51:33,420
It feels so good that
nobody wants to stop

914
00:51:33,420 --> 00:51:36,510
using it, even though--

915
00:51:36,510 --> 00:51:37,880
but nobody has better ideas.

916
00:51:37,880 --> 00:51:42,440
One of the most interesting
ideas in recent memory is the

917
00:51:42,440 --> 00:51:45,940
idea of using what's called
"transactional memory," which

918
00:51:45,940 --> 00:51:48,800
is basically where memory
operates like a database

919
00:51:48,800 --> 00:51:50,100
transaction.

920
00:51:50,100 --> 00:51:52,730
And it's allowed to abort, in
which case you roll it back

921
00:51:52,730 --> 00:51:55,520
and retry it.

922
00:51:55,520 --> 00:51:59,280
Yet, transactional memory has a
host of issues with it, and

923
00:51:59,280 --> 00:52:00,530
still people use locks.

924
00:52:06,170 --> 00:52:08,610
So let's talk about some of
the bad things that happen

925
00:52:08,610 --> 00:52:11,070
when you start doing locks.

926
00:52:11,070 --> 00:52:15,430
I'm going to talk about three of
them, deadlock, convoying,

927
00:52:15,430 --> 00:52:18,230
and contention.

928
00:52:18,230 --> 00:52:22,760
So deadlock is probably the most
important one, because it

929
00:52:22,760 --> 00:52:24,665
has to do with correctness.

930
00:52:24,665 --> 00:52:27,030
So you can have coded-- in fact,
I've seen people with

931
00:52:27,030 --> 00:52:31,360
very fast code that has deadlock
potential in it.

932
00:52:31,360 --> 00:52:34,830
It's like, if you deadlock, then
your average running time

933
00:52:34,830 --> 00:52:37,860
is infinite if there's
a possibility

934
00:52:37,860 --> 00:52:38,810
of a deadlock, right?

935
00:52:38,810 --> 00:52:41,440
Because you're averaging
infinity with everything else

936
00:52:41,440 --> 00:52:44,410
that you might run.

937
00:52:44,410 --> 00:52:48,260
So it's not good to have
deadlock in your code,

938
00:52:48,260 --> 00:52:50,400
regardless.

939
00:52:50,400 --> 00:52:52,970
It's kind of like your
code seg faulting.

940
00:52:52,970 --> 00:52:55,110
No decent code should
seg fault.

941
00:52:55,110 --> 00:52:58,290
It should always catch its
own errors and terminate

942
00:52:58,290 --> 00:52:59,350
gracefully.

943
00:52:59,350 --> 00:53:02,600
It shouldn't just seg fault
in some circumstance.

944
00:53:02,600 --> 00:53:04,295
Similarly, your code should
not deadlock.

945
00:53:07,190 --> 00:53:12,540
So here's sort of a classical
instance of deadlock.

946
00:53:12,540 --> 00:53:16,020
And deadlock typically occurs
when you hold more than one

947
00:53:16,020 --> 00:53:18,670
lock at a time.

948
00:53:18,670 --> 00:53:27,250
So here, this guy is going to
grab a lock A, going to grab a

949
00:53:27,250 --> 00:53:31,150
lock B, then unlock B, unlock
A, and in there

950
00:53:31,150 --> 00:53:32,050
do a critical section.

951
00:53:32,050 --> 00:53:34,650
Why might I grab two locks?

952
00:53:34,650 --> 00:53:37,340
What's the circumstance where I
might have code that looked

953
00:53:37,340 --> 00:53:40,480
very similar to this?

954
00:53:40,480 --> 00:53:41,200
Use case.

955
00:53:41,200 --> 00:53:42,172
AUDIENCE: Two objects?

956
00:53:42,172 --> 00:53:43,156
PROFESSOR: sorry?

957
00:53:43,156 --> 00:53:44,250
AUDIENCE: You need
two objects.

958
00:53:44,250 --> 00:53:45,430
PROFESSOR: You need
two objects.

959
00:53:45,430 --> 00:53:46,680
When might that occur?

960
00:53:49,442 --> 00:53:51,827
AUDIENCE: Account
transactions.

961
00:53:51,827 --> 00:53:53,780
PROFESSOR: Yeah, account
transactions.

962
00:53:53,780 --> 00:53:56,790
That's the classic one.

963
00:53:56,790 --> 00:53:59,060
You want to move something from
this bank account to that

964
00:53:59,060 --> 00:54:00,055
bank account.

965
00:54:00,055 --> 00:54:02,400
And you want to make sure that
as you're updating it, nothing

966
00:54:02,400 --> 00:54:04,110
else is occurring.

967
00:54:04,110 --> 00:54:06,110
Another place this comes up all
the time is when you do

968
00:54:06,110 --> 00:54:08,420
graph algorithms.

969
00:54:08,420 --> 00:54:12,140
You always want to grab the edge
and have the two vertices

970
00:54:12,140 --> 00:54:15,040
on each end of the edge not move
while you do something

971
00:54:15,040 --> 00:54:16,970
across the edge.

972
00:54:16,970 --> 00:54:19,870
So lots of cases there.

973
00:54:19,870 --> 00:54:22,170
It turns out the order in
which you unlock things

974
00:54:22,170 --> 00:54:26,240
doesn't matter, because you can
always unlock something.

975
00:54:26,240 --> 00:54:28,350
You never hold up
for unlocking.

976
00:54:28,350 --> 00:54:30,180
The problem with deadlock
is generally

977
00:54:30,180 --> 00:54:31,660
how you acquire locks.

978
00:54:31,660 --> 00:54:35,010
So in this example, Thread 2
grabs Lock B, then grabs Lock

979
00:54:35,010 --> 00:54:39,290
A. So it might be, for example,
that you have some

980
00:54:39,290 --> 00:54:42,320
random process that's at
the node of a graph.

981
00:54:42,320 --> 00:54:44,440
And now it's going to
grab a lock on the

982
00:54:44,440 --> 00:54:45,480
other end of an edge.

983
00:54:45,480 --> 00:54:48,790
But you might have the guy at
the other end grabbing that

984
00:54:48,790 --> 00:54:52,850
vertex and then grabbing
the one on your end.

985
00:54:52,850 --> 00:54:54,730
And that's basically
the situation.

986
00:54:54,730 --> 00:54:59,990
So what happens is Thread
1 acquires a lock here.

987
00:54:59,990 --> 00:55:01,890
Thread 2 acquires that lock.

988
00:55:01,890 --> 00:55:04,910
And now which one can go?

989
00:55:04,910 --> 00:55:05,550
Neither of them.

990
00:55:05,550 --> 00:55:08,110
You've got a deadlock.

991
00:55:08,110 --> 00:55:11,060
Ultimate loss of performance.

992
00:55:11,060 --> 00:55:13,280
So it's really a correctness
issue.

993
00:55:13,280 --> 00:55:16,210
But you can view it, if you
really say, oh, correctness,

994
00:55:16,210 --> 00:55:17,350
that's for sissies.

995
00:55:17,350 --> 00:55:19,340
We do performance.

996
00:55:19,340 --> 00:55:23,630
Well, it's still a performance
issue, because it's the

997
00:55:23,630 --> 00:55:24,970
ultimate loss of performance.

998
00:55:24,970 --> 00:55:27,740
In fact, that's probably true
of any correctness issue.

999
00:55:27,740 --> 00:55:28,470
No, that's not true.

1000
00:55:28,470 --> 00:55:30,700
Sometimes you just get
the wrong number.

1001
00:55:30,700 --> 00:55:32,700
Here is a correctness
issue that

1002
00:55:32,700 --> 00:55:33,950
your code stops operating.

1003
00:55:37,710 --> 00:55:41,410
So there are three conditions
that are usually pointed to

1004
00:55:41,410 --> 00:55:43,320
that you need for deadlock.

1005
00:55:43,320 --> 00:55:45,220
The first is mutual exclusion.

1006
00:55:45,220 --> 00:55:49,340
Each thread claims exclusive
control over the resources

1007
00:55:49,340 --> 00:55:55,540
that it holds, in this case, the
resources being the locks.

1008
00:55:55,540 --> 00:55:58,140
So there's got to be some
resource that you're grabbing,

1009
00:55:58,140 --> 00:56:00,840
and that you're the only one
who gets to have it.

1010
00:56:00,840 --> 00:56:03,280
So in this case, it would
be the locks.

1011
00:56:03,280 --> 00:56:06,310
The second is non-preemption.

1012
00:56:06,310 --> 00:56:09,390
You don't let go of your
resources until you complete

1013
00:56:09,390 --> 00:56:12,370
your use of them.

1014
00:56:12,370 --> 00:56:18,840
So that means you can't let go
of a lock in a situation.

1015
00:56:18,840 --> 00:56:22,060
If you're actually
able to preempt--

1016
00:56:22,060 --> 00:56:26,020
so this piece of code over there
has grabbed locks, and

1017
00:56:26,020 --> 00:56:29,210
now I can come in and take them
away, then you may not

1018
00:56:29,210 --> 00:56:29,975
have a deadlock potential.

1019
00:56:29,975 --> 00:56:31,560
You may have other issues,
but you won't

1020
00:56:31,560 --> 00:56:34,090
have a deadlock potential.

1021
00:56:34,090 --> 00:56:36,520
And the third one is
circular waiting.

1022
00:56:36,520 --> 00:56:39,430
You have a cycle of threads in
which each thread is blocked

1023
00:56:39,430 --> 00:56:44,790
waiting for resources held by
the next thread in the cycle.

1024
00:56:44,790 --> 00:56:48,650
So let me illustrate this with
a very famous story that some

1025
00:56:48,650 --> 00:56:52,900
of you may have seen, because
it is so famous.

1026
00:56:52,900 --> 00:56:55,980
It's the dining philosophers
problem.

1027
00:56:55,980 --> 00:56:58,890
It's an illustrative story a
deadlock that was originally

1028
00:56:58,890 --> 00:57:03,120
told by Tony Hoare, based on
an examination question by

1029
00:57:03,120 --> 00:57:05,160
Edsger Dijkstra.

1030
00:57:05,160 --> 00:57:07,260
And the story has been
embellished over the years by

1031
00:57:07,260 --> 00:57:09,060
many retellers.

1032
00:57:09,060 --> 00:57:10,610
It's one of these things that
if you're a computer

1033
00:57:10,610 --> 00:57:13,120
scientist, you should know
this story just because

1034
00:57:13,120 --> 00:57:16,130
everybody knows this story.

1035
00:57:16,130 --> 00:57:19,080
So here's how the story goes,
at least my version of it.

1036
00:57:19,080 --> 00:57:21,810
I get to retell it now.

1037
00:57:21,810 --> 00:57:25,270
So each of n philosophers needs
the two chopsticks on

1038
00:57:25,270 --> 00:57:29,500
either side of his or
her plate to eat the

1039
00:57:29,500 --> 00:57:31,960
noodles on the plate.

1040
00:57:31,960 --> 00:57:35,940
So they're not worried about
germs here, by the way.

1041
00:57:35,940 --> 00:57:37,690
So you have five philosophers
in this case

1042
00:57:37,690 --> 00:57:40,250
sitting around the table.

1043
00:57:40,250 --> 00:57:42,470
There are five chopsticks
between them.

1044
00:57:42,470 --> 00:57:46,200
In order to eat, they need to
grab the two chopsticks on

1045
00:57:46,200 --> 00:57:47,350
either side.

1046
00:57:47,350 --> 00:57:48,200
Then they can eat.

1047
00:57:48,200 --> 00:57:50,470
Then they put them down.

1048
00:57:50,470 --> 00:57:56,310
So here's what philosopher
i does.

1049
00:57:56,310 --> 00:58:02,220
So in an infinite loop, the
philosopher does thinking,

1050
00:58:02,220 --> 00:58:04,650
because that's what
philosophers do.

1051
00:58:04,650 --> 00:58:14,660
Then it grabs the lock of
chopstick i and grabs the lock

1052
00:58:14,660 --> 00:58:16,480
of chopstick i plus 1.

1053
00:58:16,480 --> 00:58:17,020
That's the 1.

1054
00:58:17,020 --> 00:58:19,850
So if we index them, say, to the
left of the plate, this is

1055
00:58:19,850 --> 00:58:22,180
grabbing the chopstick to
the left of your plate.

1056
00:58:22,180 --> 00:58:24,930
This is grabbing the chopstick
to the right of your plate.

1057
00:58:24,930 --> 00:58:26,240
Then you can eat.

1058
00:58:26,240 --> 00:58:27,550
Then you release your
two chopsticks.

1059
00:58:30,810 --> 00:58:32,650
So here, that's the code.

1060
00:58:32,650 --> 00:58:33,900
And then you go back
to thinking.

1061
00:58:36,900 --> 00:58:38,435
I guess they have no other
bodily functions.

1062
00:58:41,990 --> 00:58:45,660
So the problem is, one day they
all pick up their left

1063
00:58:45,660 --> 00:58:46,970
chopsticks simultaneously.

1064
00:58:50,200 --> 00:58:52,080
Now they go to look for
their right chopstick.

1065
00:58:52,080 --> 00:58:53,800
It's not there.

1066
00:58:53,800 --> 00:58:55,050
So what happens?

1067
00:58:57,150 --> 00:59:04,520
They starve because their code
doesn't let them release--

1068
00:59:04,520 --> 00:59:07,780
there's no preemption, so they
can't release the chopstick

1069
00:59:07,780 --> 00:59:09,030
they've already got.

1070
00:59:11,510 --> 00:59:13,090
And we have a circular
waiting.

1071
00:59:13,090 --> 00:59:14,130
They have mutual exclusion.

1072
00:59:14,130 --> 00:59:16,570
Only one of them can have
a chopstick at a time.

1073
00:59:16,570 --> 00:59:19,350
And we have a circular waiting
thing, because everyone is

1074
00:59:19,350 --> 00:59:24,320
waiting for the philosopher
on the right.

1075
00:59:24,320 --> 00:59:25,510
Is that clear to everybody?

1076
00:59:25,510 --> 00:59:27,470
That's the dining philosophers
problem.

1077
00:59:27,470 --> 00:59:28,745
How do you fix this problem?

1078
00:59:31,420 --> 00:59:34,225
What are solutions to
fixing this problem?

1079
00:59:39,390 --> 00:59:41,655
The problem being that you'd
like them to eat indefinitely.

1080
00:59:41,655 --> 00:59:45,401
AUDIENCE: You can index the
chopstick and say that

1081
00:59:45,401 --> 00:59:46,850
[INAUDIBLE].

1082
00:59:46,850 --> 00:59:50,000
PROFESSOR: Yeah, you can pick
the smaller index first.

1083
00:59:50,000 --> 00:59:52,330
So in general, that means
everybody would grab the one

1084
00:59:52,330 --> 00:59:54,630
on their left, then the one on
their right, except for the

1085
00:59:54,630 --> 01:00:00,660
guy who's going between
0 and n minus 1.

1086
01:00:00,660 --> 01:00:04,850
They would do n minus
1 and then 0.

1087
01:00:04,850 --> 01:00:08,260
They would do n minus
1 first, and then 0.

1088
01:00:08,260 --> 01:00:08,780
Sorry.

1089
01:00:08,780 --> 01:00:11,160
They would do 0 first,
and then n minus 1.

1090
01:00:11,160 --> 01:00:12,160
[INAUDIBLE]

1091
01:00:12,160 --> 01:00:14,360
Let me say that more
precisely.

1092
01:00:14,360 --> 01:00:18,270
So this is a classic way
to prevent deadlock.

1093
01:00:18,270 --> 01:00:22,510
Suppose that we can linearly
order the mutexes in some

1094
01:00:22,510 --> 01:00:26,430
order so that whenever a thread
that holds a mutex Li

1095
01:00:26,430 --> 01:00:32,360
and attempts to lock another
mutex Lj, we have it that Li

1096
01:00:32,360 --> 01:00:35,090
goes before Lj in
the ordering.

1097
01:00:35,090 --> 01:00:36,500
Then no deadlock can occur.

1098
01:00:39,890 --> 01:00:44,920
So always grab the resource so
if they can all order the

1099
01:00:44,920 --> 01:00:45,530
resources--

1100
01:00:45,530 --> 01:00:50,320
so they're always grabbing them
in some subsequence of

1101
01:00:50,320 --> 01:00:52,980
this order, so they're always
grabbing one that's larger and

1102
01:00:52,980 --> 01:00:56,860
larger and larger, and you're
never going back and grabbing

1103
01:00:56,860 --> 01:00:58,960
one smaller, than you
have no deadlock.

1104
01:00:58,960 --> 01:01:00,730
Here's why.

1105
01:01:00,730 --> 01:01:03,010
Suppose you have a
cycle of waiting.

1106
01:01:03,010 --> 01:01:05,160
You have a deadlock
has occurred.

1107
01:01:05,160 --> 01:01:07,530
Let's look at the thread in
the cycle that holds the

1108
01:01:07,530 --> 01:01:10,710
largest mutex that's called
Lmax in the ordering.

1109
01:01:10,710 --> 01:01:13,090
So whatever is in
the ordering.

1110
01:01:13,090 --> 01:01:16,380
And suppose that it's waiting on
a mutex L held by the next

1111
01:01:16,380 --> 01:01:17,350
threat in the cycle.

1112
01:01:17,350 --> 01:01:18,640
That's the condition.

1113
01:01:18,640 --> 01:01:23,270
Well, then it must be that Lmax
falls before L, because

1114
01:01:23,270 --> 01:01:26,930
we're gathering them always
in an increasing order.

1115
01:01:26,930 --> 01:01:32,210
But that contradicts the fact
that Lmax is the largest.

1116
01:01:32,210 --> 01:01:33,810
So a deadlock cannot occur.

1117
01:01:36,930 --> 01:01:38,180
Questions?

1118
01:01:45,430 --> 01:01:46,690
Is this clear?

1119
01:01:46,690 --> 01:01:48,910
Who's seen this before?

1120
01:01:48,910 --> 01:01:49,780
A few people.

1121
01:01:49,780 --> 01:01:51,030
OK.

1122
01:01:53,030 --> 01:01:53,820
Is this clear?

1123
01:01:53,820 --> 01:01:56,900
So if you grab them in
increasing order, then there's

1124
01:01:56,900 --> 01:02:00,110
always some guy that has the
largest one, and nobody is

1125
01:02:00,110 --> 01:02:01,490
holding one larger.

1126
01:02:01,490 --> 01:02:05,770
So he can always grab
the next one.

1127
01:02:05,770 --> 01:02:14,110
So in this case of the dining
philosophers, what we can do

1128
01:02:14,110 --> 01:02:22,840
is grab the minimum of i and
i plus 1 mod n and then the

1129
01:02:22,840 --> 01:02:25,120
maximum of i and
i plus 1 mod n.

1130
01:02:25,120 --> 01:02:28,890
That gives us the same
two chopsticks.

1131
01:02:28,890 --> 01:02:32,090
And in fact, for most of the
philosophers, it's exactly the

1132
01:02:32,090 --> 01:02:32,770
same order.

1133
01:02:32,770 --> 01:02:36,830
But for one guy, it's
a different order.

1134
01:02:36,830 --> 01:02:42,030
It ends up being the guy who
would normally have done n

1135
01:02:42,030 --> 01:02:43,060
minus 1 and 0.

1136
01:02:43,060 --> 01:02:44,840
Instead, he does 0, n minus 1.

1137
01:02:44,840 --> 01:02:47,350
So in some sense, it's like
having a left-handed

1138
01:02:47,350 --> 01:02:49,300
person at the table.

1139
01:02:49,300 --> 01:02:52,320
You grab your left, then your
right, except for one guy does

1140
01:02:52,320 --> 01:02:54,030
right and then left.

1141
01:02:54,030 --> 01:02:56,290
And that fixes it, OK?

1142
01:02:56,290 --> 01:02:57,540
That fixes it.

1143
01:03:01,060 --> 01:03:01,310
Good.

1144
01:03:01,310 --> 01:03:03,865
So that's basically the dining
philosophers problem.

1145
01:03:03,865 --> 01:03:04,880
That's one way of fixing it.

1146
01:03:04,880 --> 01:03:07,150
There are actually other
ways of doing it.

1147
01:03:07,150 --> 01:03:09,410
One of the problems with this
particular solution is you

1148
01:03:09,410 --> 01:03:11,030
still can have a long
chain of waiting.

1149
01:03:13,710 --> 01:03:16,100
So there are other schemes that
you can use where, for

1150
01:03:16,100 --> 01:03:20,700
example, if every other one
grabs left and then right and

1151
01:03:20,700 --> 01:03:23,100
then right and then left and
then left and then right and

1152
01:03:23,100 --> 01:03:28,950
then right and left and so
forth, you can end up making

1153
01:03:28,950 --> 01:03:31,510
it so that nobody has to
wait to go all the

1154
01:03:31,510 --> 01:03:32,510
way around the circle.

1155
01:03:32,510 --> 01:03:33,278
Yeah?

1156
01:03:33,278 --> 01:03:34,528
AUDIENCE: [INAUDIBLE]

1157
01:03:37,400 --> 01:03:40,580
PROFESSOR: Well, that would be
a preemption type of thing,

1158
01:03:40,580 --> 01:03:42,730
where I grab one, and if I
didn't get it in time, I

1159
01:03:42,730 --> 01:03:45,320
release it and then try again.

1160
01:03:45,320 --> 01:03:48,120
When you have something like
that, there's an issue.

1161
01:03:48,120 --> 01:03:51,570
It's, how do you set
the timeout amount?

1162
01:03:51,570 --> 01:03:53,980
And the second issue that you
get into when you do timeouts

1163
01:03:53,980 --> 01:03:57,990
is, how do you know you don't
then repeat exactly the same

1164
01:03:57,990 --> 01:04:00,810
thing and convert a
deadlock situation

1165
01:04:00,810 --> 01:04:03,050
into a livelock situation?

1166
01:04:03,050 --> 01:04:05,030
So a livelock situation is
where they're not making

1167
01:04:05,030 --> 01:04:07,480
progress, but they're all
busily working, thinking

1168
01:04:07,480 --> 01:04:09,120
they're making progress.

1169
01:04:09,120 --> 01:04:10,370
So you timeout.

1170
01:04:10,370 --> 01:04:12,320
Let's try again.

1171
01:04:12,320 --> 01:04:14,530
What makes you think that the
guys that are deadlocking

1172
01:04:14,530 --> 01:04:16,490
aren't going to do exactly
the same thing.

1173
01:04:16,490 --> 01:04:18,062
AUDIENCE: [INAUDIBLE]

1174
01:04:18,062 --> 01:04:19,680
PROFESSOR: And exactly.

1175
01:04:19,680 --> 01:04:22,660
And in fact, that's actually
a workable scheme.

1176
01:04:22,660 --> 01:04:23,850
And there are schemes
that do it.

1177
01:04:23,850 --> 01:04:27,240
Now, that's much more
complicated.

1178
01:04:27,240 --> 01:04:30,470
Sometimes has more overhead,
especially because things

1179
01:04:30,470 --> 01:04:31,440
become available.

1180
01:04:31,440 --> 01:04:35,360
And it's like, no, you're busy
raiding some random amount of

1181
01:04:35,360 --> 01:04:38,020
time before you try again.

1182
01:04:38,020 --> 01:04:40,590
So this is, by the way, the
protocol that is used on the

1183
01:04:40,590 --> 01:04:45,630
Ethernet for doing contention
resolution.

1184
01:04:45,630 --> 01:04:49,100
It's what's called "exponential
backoff." And

1185
01:04:49,100 --> 01:04:55,190
various backoff schemes are
used in order to allow

1186
01:04:55,190 --> 01:04:58,810
multiple things acquire
mutually-exclusive access to

1187
01:04:58,810 --> 01:05:02,960
something without having to
have a definite ordering.

1188
01:05:02,960 --> 01:05:05,820
So there are solutions, but
they definitely get more

1189
01:05:05,820 --> 01:05:07,230
heavyweight.

1190
01:05:07,230 --> 01:05:09,080
It's not lightweight.

1191
01:05:09,080 --> 01:05:11,890
Whereas if you can prevent
deadlock, that's really good,

1192
01:05:11,890 --> 01:05:16,900
because you just simply
do the natural thing.

1193
01:05:16,900 --> 01:05:19,500
And that tends to
be pretty quick.

1194
01:05:19,500 --> 01:05:24,950
But yeah, all I'm doing is
sort of covering the

1195
01:05:24,950 --> 01:05:26,240
introduction to all
these things.

1196
01:05:26,240 --> 01:05:30,150
There are books written on
this type of subject.

1197
01:05:30,150 --> 01:05:33,620
Any other questions about
dining philosophers and

1198
01:05:33,620 --> 01:05:36,630
deadlock and so forth?

1199
01:05:36,630 --> 01:05:38,005
Now let me tell you how
to deadlock Cilk++.

1200
01:05:42,680 --> 01:05:45,880
So here's a code that
will deadlock

1201
01:05:45,880 --> 01:05:48,240
Cilk++, or has the potential.

1202
01:05:48,240 --> 01:05:50,060
You might run it a bunch of
times, it looks fine.

1203
01:05:53,070 --> 01:05:56,330
Here's what we've done is
main routine spawns foo.

1204
01:05:56,330 --> 01:05:57,210
Here's foo down here.

1205
01:05:57,210 --> 01:06:01,020
All foo does is grab a lock
and then unlocks it.

1206
01:06:01,020 --> 01:06:02,500
Empty critical section.

1207
01:06:02,500 --> 01:06:03,520
It could do something
in there.

1208
01:06:03,520 --> 01:06:05,950
It doesn't matter.

1209
01:06:05,950 --> 01:06:09,460
Then the main grabs a lock,
does a cilk_sync and then

1210
01:06:09,460 --> 01:06:12,270
unlocks it.

1211
01:06:12,270 --> 01:06:15,420
So what can go wrong here?

1212
01:06:15,420 --> 01:06:19,455
Notice, by the way, this
is only one lock, L.

1213
01:06:19,455 --> 01:06:22,730
There's not two locks.

1214
01:06:22,730 --> 01:06:26,960
So you can deadlock Cilk by
just introducing one lock.

1215
01:06:26,960 --> 01:06:28,890
So here's sort of
what's going on.

1216
01:06:28,890 --> 01:06:33,140
Let's let this be the main
thread and this be foo.

1217
01:06:33,140 --> 01:06:35,780
And this will represent
a lock acquire, and

1218
01:06:35,780 --> 01:06:37,390
this is a lock release.

1219
01:06:37,390 --> 01:06:39,910
So what happens is we
perform the lock

1220
01:06:39,910 --> 01:06:43,970
acquire here in the parent.

1221
01:06:43,970 --> 01:06:48,090
First, we spawned here, then
we acquire the lock here.

1222
01:06:48,090 --> 01:06:52,300
And now foo tries to get access
to the lock, and it

1223
01:06:52,300 --> 01:06:55,480
can't because why?

1224
01:06:55,480 --> 01:06:58,570
The main routine has the lock.

1225
01:06:58,570 --> 01:07:01,000
Now what happens?

1226
01:07:01,000 --> 01:07:03,750
The main routine proceeds to the
sync, and what does it do

1227
01:07:03,750 --> 01:07:05,000
at the sync?

1228
01:07:06,870 --> 01:07:09,960
It waits for all children
to be done.

1229
01:07:12,520 --> 01:07:15,420
And notice now we've created a
cycle of waiting, even though

1230
01:07:15,420 --> 01:07:17,630
we didn't use a lock.

1231
01:07:17,630 --> 01:07:20,580
Main waits, but foo is never
going to complete, because

1232
01:07:20,580 --> 01:07:24,170
it's waiting for the main thread
to release it, the main

1233
01:07:24,170 --> 01:07:26,722
strand here to release it,
the main function here.

1234
01:07:26,722 --> 01:07:28,850
Is that clear?

1235
01:07:28,850 --> 01:07:33,830
So you can deadlock Cilk too
by doing non-deterministic

1236
01:07:33,830 --> 01:07:34,690
programming.

1237
01:07:34,690 --> 01:07:40,660
So here's the methodology that
will help you not do that.

1238
01:07:40,660 --> 01:07:42,060
So what's bad here?

1239
01:07:42,060 --> 01:07:46,430
What's bad is holding the
lock across the sync.

1240
01:07:46,430 --> 01:07:47,540
That's bad.

1241
01:07:47,540 --> 01:07:49,940
So don't do that.

1242
01:07:49,940 --> 01:07:52,880
Doctor, my head hurts.

1243
01:07:52,880 --> 01:07:54,130
Well, stop hitting it.

1244
01:07:59,120 --> 01:08:03,160
So don't hold mutexes
across Cilk syncs.

1245
01:08:03,160 --> 01:08:06,290
Hold mutexes only within
strands, only with

1246
01:08:06,290 --> 01:08:08,802
serially-executing
pieces of code.

1247
01:08:08,802 --> 01:08:13,780
Now, it turns out that you can
hold it across syncs and so

1248
01:08:13,780 --> 01:08:15,880
forth, but you have
to be careful.

1249
01:08:15,880 --> 01:08:19,390
And I'm not going to get
into the details of

1250
01:08:19,390 --> 01:08:20,170
how you can do that.

1251
01:08:20,170 --> 01:08:23,770
If you want to figure that out
on your own, that's fine.

1252
01:08:23,770 --> 01:08:25,720
And then you're welcome
to try to do that

1253
01:08:25,720 --> 01:08:28,319
without deadlocking something.

1254
01:08:28,319 --> 01:08:32,380
Turns out, basically, if you
grab the lock before you do

1255
01:08:32,380 --> 01:08:35,259
any spawns, and then released
it after the Cilk

1256
01:08:35,259 --> 01:08:36,509
sync, you're OK.

1257
01:08:40,020 --> 01:08:41,382
You're generally, in
that case, OK.

1258
01:08:47,710 --> 01:08:50,920
So as always, try to avoid using
mutexes, but that's not

1259
01:08:50,920 --> 01:08:52,350
always possible.

1260
01:08:52,350 --> 01:08:54,260
In other words, try to do
deterministic programming.

1261
01:08:54,260 --> 01:08:56,790
That helps too.

1262
01:08:56,790 --> 01:09:01,720
And on your homework, you had an
example of where it is that

1263
01:09:01,720 --> 01:09:08,189
deterministic programming can
actually do a pretty good job.

1264
01:09:08,189 --> 01:09:11,350
The next anomaly I want to
talk about is convoying.

1265
01:09:11,350 --> 01:09:13,620
Once again, another thing
that can happen.

1266
01:09:13,620 --> 01:09:17,710
This one is actually quite an
embarrassment, because the

1267
01:09:17,710 --> 01:09:22,760
original MIT Cilk system that
we built had this bug in it.

1268
01:09:22,760 --> 01:09:24,970
So we had this bug.

1269
01:09:24,970 --> 01:09:26,590
So let me show you what it is.

1270
01:09:26,590 --> 01:09:28,600
So here's the idea.

1271
01:09:28,600 --> 01:09:31,520
We're using random work-stealing
where each thief

1272
01:09:31,520 --> 01:09:33,529
grabs a mutex on its
victim's deck.

1273
01:09:33,529 --> 01:09:38,330
So in order to steal from a
victim, it grabs a mutex on

1274
01:09:38,330 --> 01:09:39,420
the victim.

1275
01:09:39,420 --> 01:09:42,600
And now, once it's got the
mutex, it now is in a position

1276
01:09:42,600 --> 01:09:46,960
to migrate the work that's
on that victim to

1277
01:09:46,960 --> 01:09:48,200
actually steal the work.

1278
01:09:48,200 --> 01:09:49,640
And you want to do
that atomically.

1279
01:09:49,640 --> 01:09:51,800
You don't want two guys getting
in there trying to

1280
01:09:51,800 --> 01:09:54,220
steal from each other.

1281
01:09:54,220 --> 01:09:58,150
So if the victim's deck is
empty, the thief releases the

1282
01:09:58,150 --> 01:09:59,770
mutex and tries again
at random.

1283
01:09:59,770 --> 01:10:01,180
That makes sense.

1284
01:10:01,180 --> 01:10:04,070
If there's nothing there to be
stolen, then just released the

1285
01:10:04,070 --> 01:10:06,200
mutex and move on.

1286
01:10:06,200 --> 01:10:08,960
If the victim's deck contains
work, the thief then steals

1287
01:10:08,960 --> 01:10:10,750
the topmost frame and then
releases the mutex.

1288
01:10:13,400 --> 01:10:14,890
Where's the performance
bug here?

1289
01:10:19,760 --> 01:10:21,708
AUDIENCE: [INAUDIBLE]

1290
01:10:21,708 --> 01:10:24,143
trying to steal from
each other.

1291
01:10:24,143 --> 01:10:27,065
Like A steals from B, B steals
from C, C steals from D, and

1292
01:10:27,065 --> 01:10:29,510
they all have locks on each
other, and then--

1293
01:10:29,510 --> 01:10:31,810
PROFESSOR: No, because in that
case, they'll each grab the

1294
01:10:31,810 --> 01:10:34,440
deck from each other, discover
it's empty, and release it.

1295
01:10:38,680 --> 01:10:39,720
OK, let me show the bug.

1296
01:10:39,720 --> 01:10:41,520
It is very subtle.

1297
01:10:41,520 --> 01:10:45,810
As I say, we didn't realize we
had this bug until we noticed

1298
01:10:45,810 --> 01:10:47,650
some codes on which we
weren't getting the

1299
01:10:47,650 --> 01:10:48,860
speedups we were expecting.

1300
01:10:48,860 --> 01:10:51,830
Let me show you where
this bug comes from.

1301
01:10:51,830 --> 01:10:53,130
Here's the problem.

1302
01:10:53,130 --> 01:10:56,850
At the startup, most thieves
will quickly converge on the

1303
01:10:56,850 --> 01:10:58,820
worker P0 containing
the initial

1304
01:10:58,820 --> 01:11:02,720
strand, creating a convoy.

1305
01:11:02,720 --> 01:11:05,390
So let me show you
how that happens.

1306
01:11:05,390 --> 01:11:09,450
So here we have the startup of
our Cilk system where one guy

1307
01:11:09,450 --> 01:11:12,700
has work, and all these
are workers that

1308
01:11:12,700 --> 01:11:15,410
have no work to do.

1309
01:11:15,410 --> 01:11:16,490
So what happens?

1310
01:11:16,490 --> 01:11:19,200
They all try to steal
at random.

1311
01:11:19,200 --> 01:11:23,250
In this case, we have this guy
tries to steal from this

1312
01:11:23,250 --> 01:11:25,330
fellow, this guy tries
to steal from

1313
01:11:25,330 --> 01:11:26,930
this fellow, et cetera.

1314
01:11:26,930 --> 01:11:33,260
So of these, this guy, this
guy, and that guy all are

1315
01:11:33,260 --> 01:11:35,850
going to discover there's
nothing there to be stolen,

1316
01:11:35,850 --> 01:11:38,270
and they're going to
repeat the process.

1317
01:11:38,270 --> 01:11:41,010
This guy and this guy, there's
going to be some arbitration.

1318
01:11:41,010 --> 01:11:43,530
And one of them is going
to get the lock.

1319
01:11:43,530 --> 01:11:46,540
Let's assume it's
this one here.

1320
01:11:46,540 --> 01:11:48,900
So what happens is, this
guy gets the lock.

1321
01:11:48,900 --> 01:11:50,150
What does this guy do?

1322
01:11:52,440 --> 01:11:53,690
He's going to wait.

1323
01:11:55,970 --> 01:11:57,500
Because he's trying to
acquire the lock.

1324
01:11:57,500 --> 01:12:00,610
He can't acquire the
lock, so he waits.

1325
01:12:00,610 --> 01:12:02,350
So then what happens?

1326
01:12:02,350 --> 01:12:07,470
This guy now wants to steal
the work from this fellow.

1327
01:12:07,470 --> 01:12:10,980
So he steals a little
bit of work.

1328
01:12:10,980 --> 01:12:13,510
Then these guys now,
what do they do?

1329
01:12:13,510 --> 01:12:16,550
They try again.

1330
01:12:16,550 --> 01:12:19,450
So this guy tries to steal from
there, this guy tries to

1331
01:12:19,450 --> 01:12:22,100
steal from there, this one
happens to try to steal there.

1332
01:12:22,100 --> 01:12:25,510
This one sees there's work
there to be done, so

1333
01:12:25,510 --> 01:12:27,320
what does it do?

1334
01:12:27,320 --> 01:12:29,200
It waits.

1335
01:12:29,200 --> 01:12:32,590
But these guys then try again.

1336
01:12:32,590 --> 01:12:36,675
Maybe a little bit more
stuff is moved.

1337
01:12:36,675 --> 01:12:38,490
They try again.

1338
01:12:38,490 --> 01:12:40,550
A little bit more stuff.

1339
01:12:40,550 --> 01:12:42,020
They try again.

1340
01:12:42,020 --> 01:12:45,870
But every time one tries and
gets stuck on P0 while we're

1341
01:12:45,870 --> 01:12:51,540
doing that whole transfer, they
all are ending up getting

1342
01:12:51,540 --> 01:12:54,980
stuck waiting for this
guy to finish.

1343
01:12:54,980 --> 01:12:58,300
And now, we've got work over
here, but how many guys are

1344
01:12:58,300 --> 01:13:00,580
going to be trying to
steal from this guy?

1345
01:13:00,580 --> 01:13:02,800
None.

1346
01:13:02,800 --> 01:13:05,140
They're all going to be trying
to steal from this one,

1347
01:13:05,140 --> 01:13:07,870
because they all have done a
lock acquisition, and they're

1348
01:13:07,870 --> 01:13:10,520
sitting there waiting.

1349
01:13:10,520 --> 01:13:14,300
So this is called convoying,
where they all pile up on one

1350
01:13:14,300 --> 01:13:16,090
thing, and now resolving
that convoy.

1351
01:13:16,090 --> 01:13:18,530
So this was a bug in startup.

1352
01:13:18,530 --> 01:13:22,540
Why wasn't Cilk starting
up fast?

1353
01:13:22,540 --> 01:13:26,680
Initially, we just thought, oh,
there's system kinds of

1354
01:13:26,680 --> 01:13:28,870
things going on there.

1355
01:13:28,870 --> 01:13:30,770
So the work now gets distributed
very slowly,

1356
01:13:30,770 --> 01:13:33,340
because each one is going to
serially try to get this work,

1357
01:13:33,340 --> 01:13:34,650
and they're not going
to try to get the

1358
01:13:34,650 --> 01:13:36,460
work from each other.

1359
01:13:36,460 --> 01:13:41,040
What you want is that on the
second phase, half the guys

1360
01:13:41,040 --> 01:13:43,100
might start hitting this one.

1361
01:13:43,100 --> 01:13:47,170
So you get some kind of
exponential distribution of

1362
01:13:47,170 --> 01:13:49,590
the work in kind of
a tree fashion.

1363
01:13:49,590 --> 01:13:52,080
And that's what theory
says would happen.

1364
01:13:52,080 --> 01:13:55,420
But the theory is usually done
without worrying about what

1365
01:13:55,420 --> 01:13:58,810
happens in the implementation
of the lock.

1366
01:13:58,810 --> 01:14:01,547
What's the fix for this?

1367
01:14:01,547 --> 01:14:03,044
Yeah?

1368
01:14:03,044 --> 01:14:06,038
AUDIENCE: Can you just basically
shove-- when you're

1369
01:14:06,038 --> 01:14:09,032
transferring, you should also
say, I have work, so that

1370
01:14:09,032 --> 01:14:13,024
people [INAUDIBLE] waiting for
that guy to [INAUDIBLE].

1371
01:14:13,024 --> 01:14:15,790
PROFESSOR: You could do that,
but in the meantime, it could

1372
01:14:15,790 --> 01:14:19,540
be that the attempt to steal
goes so much faster than the

1373
01:14:19,540 --> 01:14:22,000
actual getting of the work,
you're still going to get half

1374
01:14:22,000 --> 01:14:24,980
the guys locked up
on this one.

1375
01:14:24,980 --> 01:14:27,160
And the other half might be
locked up on this one.

1376
01:14:30,890 --> 01:14:32,640
Good idea.

1377
01:14:32,640 --> 01:14:35,584
What other things can we do?

1378
01:14:35,584 --> 01:14:37,428
AUDIENCE: Can you check
how many people

1379
01:14:37,428 --> 01:14:38,350
are waiting on the--

1380
01:14:38,350 --> 01:14:41,110
PROFESSOR: Yeah, so the
idea is we don't want

1381
01:14:41,110 --> 01:14:43,215
to use a lock operation.

1382
01:14:46,350 --> 01:14:48,280
So here's the idea.

1383
01:14:48,280 --> 01:14:51,370
We use a non-blocking function
that's usually called

1384
01:14:51,370 --> 01:14:55,660
"try_lock," rather than "lock."
try_lock attempts to

1385
01:14:55,660 --> 01:14:57,360
acquire the mutex.

1386
01:14:57,360 --> 01:14:59,830
If it succeeds, great.

1387
01:14:59,830 --> 01:15:00,770
It's got it.

1388
01:15:00,770 --> 01:15:03,800
If it fails, it doesn't
go to spin.

1389
01:15:03,800 --> 01:15:05,435
It simply returns and
say, I failed.

1390
01:15:08,290 --> 01:15:10,350
It doesn't go to spin or
to yield or anything.

1391
01:15:10,350 --> 01:15:12,550
It just says, oh, I
failed, and tells

1392
01:15:12,550 --> 01:15:15,750
that back to the user.

1393
01:15:15,750 --> 01:15:18,710
But it doesn't attempt
to block.

1394
01:15:18,710 --> 01:15:23,610
So with try_lock now, what can
these other processors do?

1395
01:15:23,610 --> 01:15:24,940
They do a try_lock--

1396
01:15:24,940 --> 01:15:25,440
yeah?

1397
01:15:25,440 --> 01:15:27,680
AUDIENCE: [INAUDIBLE]

1398
01:15:27,680 --> 01:15:30,190
PROFESSOR: Exactly.

1399
01:15:30,190 --> 01:15:34,700
Instead of waiting there on
the guy that they fail on,

1400
01:15:34,700 --> 01:15:37,390
they pick another random
one to steal from.

1401
01:15:39,950 --> 01:15:42,030
So they'll just continually
try to get it.

1402
01:15:42,030 --> 01:15:44,150
If they get it, then they
can do their operation.

1403
01:15:44,150 --> 01:15:50,110
If they don't get it, they just
look elsewhere for work.

1404
01:15:50,110 --> 01:15:51,020
So that's what it does.

1405
01:15:51,020 --> 01:15:52,330
It just tries to steal again at

1406
01:15:52,330 --> 01:15:53,580
random, rather than blocking.

1407
01:15:57,820 --> 01:16:00,845
And that gets rid of this
convoying problem.

1408
01:16:04,090 --> 01:16:09,250
As I say, dangerous programming,
because we didn't

1409
01:16:09,250 --> 01:16:10,800
even know we had a problem.

1410
01:16:10,800 --> 01:16:12,930
Just our code was slower than
it could have been.

1411
01:16:16,740 --> 01:16:18,130
Questions about convoying?

1412
01:16:24,210 --> 01:16:27,230
So try_lock is actually a very
convenient thing to use.

1413
01:16:27,230 --> 01:16:31,150
So in many cases, you may find
that, hey, rather than waiting

1414
01:16:31,150 --> 01:16:33,860
on something with nothing to do,
let me go see if there's

1415
01:16:33,860 --> 01:16:37,450
something else I can
do in the meantime.

1416
01:16:40,320 --> 01:16:41,570
Contention.

1417
01:16:43,760 --> 01:16:51,390
So here's an example of a code
where I want to add up some

1418
01:16:51,390 --> 01:16:57,060
function of the elements
of some array.

1419
01:16:57,060 --> 01:17:04,470
So here I've got a value of
n, which is a million.

1420
01:17:04,470 --> 01:17:17,410
And I have a type X. So we have
a compute function, which

1421
01:17:17,410 --> 01:17:20,680
takes a pointer to a--

1422
01:17:20,680 --> 01:17:22,640
did I do this right?

1423
01:17:22,640 --> 01:17:31,150
To value V. So anyway, my C++
is not as good as my C, and

1424
01:17:31,150 --> 01:17:35,220
for those who don't know,
my C isn't very good.

1425
01:17:35,220 --> 01:17:41,025
So anyway, we have an array
of type X of n elements.

1426
01:17:43,730 --> 01:17:47,390
And what I do is I set result
to be 0, and then I have a

1427
01:17:47,390 --> 01:17:52,090
loop here which basically goes
and adds into result the

1428
01:17:52,090 --> 01:17:55,220
result of computing on each
element of the array.

1429
01:17:55,220 --> 01:17:58,810
And then it outputs
the result.

1430
01:17:58,810 --> 01:18:00,940
Does everybody understand what's
going on in the code?

1431
01:18:00,940 --> 01:18:03,230
It's basically compute on every
element in the array,

1432
01:18:03,230 --> 01:18:06,270
take the result, add all
those results together.

1433
01:18:06,270 --> 01:18:08,680
We want to parallelize this.

1434
01:18:08,680 --> 01:18:12,090
So let's parallelize that.

1435
01:18:12,090 --> 01:18:14,335
What looks like the best
opportunity for parallelizing?

1436
01:18:18,970 --> 01:18:21,490
Yeah, we go after the for and
make it be a cilk_for.

1437
01:18:21,490 --> 01:18:24,310
Let's add all those guys up.

1438
01:18:24,310 --> 01:18:28,490
And what's the problem
with that?

1439
01:18:28,490 --> 01:18:29,580
We get a race.

1440
01:18:29,580 --> 01:18:30,590
What's the race on?

1441
01:18:30,590 --> 01:18:31,510
AUDIENCE: Result.

1442
01:18:31,510 --> 01:18:33,360
PROFESSOR: Result.

1443
01:18:33,360 --> 01:18:36,200
They're all updating
result in parallel.

1444
01:18:36,200 --> 01:18:39,910
Oh, I know how to
resolve a race.

1445
01:18:39,910 --> 01:18:43,800
Let's just put a
lock around it.

1446
01:18:43,800 --> 01:18:45,050
So here we have the race.

1447
01:18:48,460 --> 01:18:50,630
First, let's analyze this.

1448
01:18:50,630 --> 01:18:53,940
So the work here is order n.

1449
01:18:53,940 --> 01:18:55,320
What is the span?

1450
01:18:55,320 --> 01:18:56,240
AUDIENCE: Log n.

1451
01:18:56,240 --> 01:19:00,000
PROFESSOR: Yeah, the span is log
n for the control of the

1452
01:19:00,000 --> 01:19:04,080
stuff here, because this
is all constant time.

1453
01:19:04,080 --> 01:19:11,760
So the running time here is
order n over P plus log n.

1454
01:19:11,760 --> 01:19:13,560
If you remember the greedy
scheduling, it's going to be

1455
01:19:13,560 --> 01:19:16,100
something like this, because
this is the work

1456
01:19:16,100 --> 01:19:20,080
over P plus the span.

1457
01:19:20,080 --> 01:19:25,610
So we expect that if n over P
is big compared to log n,

1458
01:19:25,610 --> 01:19:28,060
we're going to do pretty well,
because we have parallelism

1459
01:19:28,060 --> 01:19:29,310
over log n.

1460
01:19:31,370 --> 01:19:33,330
So let's fix this bug.

1461
01:19:33,330 --> 01:19:38,030
So this is fast code, but
it's incorrect code.

1462
01:19:38,030 --> 01:19:43,040
So let's fix it by getting
rid of this race.

1463
01:19:43,040 --> 01:19:47,410
So what we'll do is we'll
put a lock before.

1464
01:19:47,410 --> 01:19:52,045
We'll introduce a mutex L, and
we'll lock L before we add to

1465
01:19:52,045 --> 01:19:55,400
the result, and then
we'll unlock it.

1466
01:19:55,400 --> 01:20:00,120
So first of all, this is a bad
way to do it, because what I

1467
01:20:00,120 --> 01:20:05,380
really should do is first
compute the result of my array

1468
01:20:05,380 --> 01:20:10,210
and then lock, add it to the
result, and then unlock so

1469
01:20:10,210 --> 01:20:12,160
that we lessen the time
that I'm holding the

1470
01:20:12,160 --> 01:20:14,200
lock in each iteration.

1471
01:20:14,200 --> 01:20:16,270
Nevertheless, this is still
a lousy piece of code.

1472
01:20:16,270 --> 01:20:17,311
Why's that?

1473
01:20:17,311 --> 01:20:19,235
AUDIENCE: It's still
serialized.

1474
01:20:19,235 --> 01:20:20,820
PROFESSOR: Yeah, it's
serialized.

1475
01:20:20,820 --> 01:20:27,110
Every update to result here
has to go on serially.

1476
01:20:27,110 --> 01:20:28,130
They're n accesses.

1477
01:20:28,130 --> 01:20:30,760
They're all going to
go one at a time.

1478
01:20:30,760 --> 01:20:34,450
So my running time, instead of
being n over log n, is going

1479
01:20:34,450 --> 01:20:37,800
to be something like order n.

1480
01:20:43,490 --> 01:20:49,390
Believe me, I have seen many
people write code where they

1481
01:20:49,390 --> 01:20:53,220
essentially do exactly this.

1482
01:20:53,220 --> 01:20:56,340
They take something, they make
it parallel, they have a race

1483
01:20:56,340 --> 01:20:57,760
bug, they fix it with a mutex.

1484
01:21:01,200 --> 01:21:05,240
Bad idea, because then
we end up with

1485
01:21:05,240 --> 01:21:07,170
contention on this mutex.

1486
01:21:07,170 --> 01:21:08,990
What's the right way to
parallelize this?

1487
01:21:17,900 --> 01:21:18,395
Yeah?

1488
01:21:18,395 --> 01:21:24,830
AUDIENCE: Maybe you could
have each [INAUDIBLE]

1489
01:21:24,830 --> 01:21:29,780
have result as an array and
have each [INAUDIBLE]

1490
01:21:29,780 --> 01:21:30,770
one place [INAUDIBLE].

1491
01:21:30,770 --> 01:21:32,750
And then at the end,
some of all the--

1492
01:21:32,750 --> 01:21:34,970
PROFESSOR: But won't that
be n elements to sum up?

1493
01:21:34,970 --> 01:21:37,455
AUDIENCE: [INAUDIBLE]

1494
01:21:37,455 --> 01:21:39,940
AUDIENCE: So basically,
have, say.

1495
01:21:39,940 --> 01:21:41,440
Eight results, instead
of having--

1496
01:21:41,440 --> 01:21:42,690
PROFESSOR: For each thread.

1497
01:21:45,180 --> 01:21:46,400
Good.

1498
01:21:46,400 --> 01:21:50,310
So that each one could keep it
local to its own thread.

1499
01:21:50,310 --> 01:21:53,180
Now, of course, that involves me
knowing how many processors

1500
01:21:53,180 --> 01:21:55,770
I'm running on.

1501
01:21:55,770 --> 01:22:01,490
So now, if that number
changes or whatever--

1502
01:22:01,490 --> 01:22:03,530
there's a way of doing it
completely processor

1503
01:22:03,530 --> 01:22:05,570
obliviously.

1504
01:22:05,570 --> 01:22:06,430
AUDIENCE: Divide and conquer.

1505
01:22:06,430 --> 01:22:08,970
PROFESSOR: Yeah, do divide
and conquer.

1506
01:22:08,970 --> 01:22:12,680
Add up recursively the first
half of the elements, add up

1507
01:22:12,680 --> 01:22:14,080
the second half of
the elements,

1508
01:22:14,080 --> 01:22:15,820
and add them together.

1509
01:22:15,820 --> 01:22:18,240
Next time, we're going to see
yet another mechanism for

1510
01:22:18,240 --> 01:22:22,060
doing that, which gets the
kind of performance that

1511
01:22:22,060 --> 01:22:26,340
you're mentioning but without
having to rewrite the For loop

1512
01:22:26,340 --> 01:22:27,780
as divide and conquer.

1513
01:22:27,780 --> 01:22:29,030
We'll see that next time.

1514
01:22:31,390 --> 01:22:33,640
So in this case, we have lock
contention that takes away our

1515
01:22:33,640 --> 01:22:34,570
parallelism.

1516
01:22:34,570 --> 01:22:40,960
Unfortunately, very little is
known about lock contention.

1517
01:22:40,960 --> 01:22:45,200
The greedy scheduler, you can
show that it achieves T1 over

1518
01:22:45,200 --> 01:22:51,740
P plus T infinity plus B where
B is the bondage, that is, if

1519
01:22:51,740 --> 01:22:56,180
you add the total time of
all critical sections.

1520
01:22:56,180 --> 01:22:58,440
That's a lousy bound, because
it says, even if they're

1521
01:22:58,440 --> 01:23:02,310
locked by different locks, you
still add up the total time of

1522
01:23:02,310 --> 01:23:04,100
all the critical sections.

1523
01:23:04,100 --> 01:23:07,550
And generally, although you can
improve this in special

1524
01:23:07,550 --> 01:23:12,120
cases, the general theory for
understanding contention is

1525
01:23:12,120 --> 01:23:14,220
not understood very well.

1526
01:23:14,220 --> 01:23:18,740
And this upper bound is weak,
but little is known about lock

1527
01:23:18,740 --> 01:23:20,100
contention.

1528
01:23:20,100 --> 01:23:22,890
Very little is known about
lock contention.

1529
01:23:22,890 --> 01:23:28,050
So to conclude, always
write deterministic

1530
01:23:28,050 --> 01:23:31,170
programs, unless you can't.

1531
01:23:33,790 --> 01:23:37,410
Always write deterministic
programs, unless you can't.

1532
01:23:37,410 --> 01:23:38,660
Great.