1
00:00:00,000 --> 00:00:00,120

2
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

3
00:00:02,500 --> 00:00:03,910
Commons license.

4
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

5
00:00:06,950 --> 00:00:10,600
offer high quality educational
resources for free.

6
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

7
00:00:13,500 --> 00:00:18,480
hundreds of MIT courses, visit
MIT OpenCourseWare at

8
00:00:18,480 --> 00:00:19,730
ocw.mit.edu.

9
00:00:19,730 --> 00:00:30,430

10
00:00:30,430 --> 00:00:35,490
PROFESSOR: So John is going to
present project three, beta.

11
00:00:35,490 --> 00:00:35,900
JOHN: All right.

12
00:00:35,900 --> 00:00:38,410
So here's the performance
grades.

13
00:00:38,410 --> 00:00:41,180
In general, the submission went
a lot better than last

14
00:00:41,180 --> 00:00:44,950
time in that things were on
time and nobody failed to

15
00:00:44,950 --> 00:00:50,020
build, or forgot to add files
to their project, or so on.

16
00:00:50,020 --> 00:00:54,280
We did change the scoring
mechanism a little bit.

17
00:00:54,280 --> 00:00:56,430
In the [? mdriver ?]

18
00:00:56,430 --> 00:00:59,550
that we gave you, if your
validator failed you on any of

19
00:00:59,550 --> 00:01:02,460
your traces, your
score is a zero.

20
00:01:02,460 --> 00:01:04,400
In this one, we decided
to be nicer.

21
00:01:04,400 --> 00:01:08,080
We replaced your validator with
our correct validator.

22
00:01:08,080 --> 00:01:11,100
And for traces that you failed,
you get a zero for the

23
00:01:11,100 --> 00:01:12,950
points that those traces
contribute.

24
00:01:12,950 --> 00:01:16,000
But you did get an overall
partial score, even if you

25
00:01:16,000 --> 00:01:17,920
failed a couple traces.

26
00:01:17,920 --> 00:01:23,330
So on that note, the reference
implementation does get a 56

27
00:01:23,330 --> 00:01:24,370
on this score.

28
00:01:24,370 --> 00:01:27,240
And there were people who had
slower than reference

29
00:01:27,240 --> 00:01:30,750
implementations that
landed below 56.

30
00:01:30,750 --> 00:01:32,970
So that might be something to
think about for your final

31
00:01:32,970 --> 00:01:34,610
submission.

32
00:01:34,610 --> 00:01:36,540
The high score was a 96.

33
00:01:36,540 --> 00:01:39,430
And there were actually quite
a few groups in the 90s.

34
00:01:39,430 --> 00:01:43,330
So overall, people did
really well on this.

35
00:01:43,330 --> 00:01:49,160
With that said, your validators
didn't really--

36
00:01:49,160 --> 00:01:50,920
I guess they were OK.

37
00:01:50,920 --> 00:01:56,800
But there's some people whose
validators failed projects

38
00:01:56,800 --> 00:01:59,470
that were correct, and other
people whose validators failed

39
00:01:59,470 --> 00:02:01,620
to detect certain situations.

40
00:02:01,620 --> 00:02:04,650
So that's also something to
work on for the final.

41
00:02:04,650 --> 00:02:06,730
We won't be releasing the
stock validators.

42
00:02:06,730 --> 00:02:09,434
So it'll be up to you guys to
find out what's wrong with

43
00:02:09,434 --> 00:02:11,750
your validators and fix them.

44
00:02:11,750 --> 00:02:15,090
And along the same lines of
correctness, once again, for

45
00:02:15,090 --> 00:02:16,850
the final submission,
we'll be running--

46
00:02:16,850 --> 00:02:19,270
actually even for the beta,
I believe, we're going to

47
00:02:19,270 --> 00:02:22,610
Valgrind your projects and
look for memory errors.

48
00:02:22,610 --> 00:02:25,760
So do that to your own projects
and investigate any

49
00:02:25,760 --> 00:02:27,010
messages you get.

50
00:02:27,010 --> 00:02:32,267

51
00:02:32,267 --> 00:02:34,710
AUDIENCE: [INAUDIBLE]

52
00:02:34,710 --> 00:02:35,310
JOHN: OK.

53
00:02:35,310 --> 00:02:40,890
So the highlighted column
number 31 refers to the

54
00:02:40,890 --> 00:02:43,550
reference implementation
of the validator.

55
00:02:43,550 --> 00:02:45,000
So that's the authority.

56
00:02:45,000 --> 00:02:48,110
If that's green, then your
implementation is correct.

57
00:02:48,110 --> 00:02:51,660
And so hopefully, a correct
validator would

58
00:02:51,660 --> 00:02:53,015
agree with column 31.

59
00:02:53,015 --> 00:02:56,210

60
00:02:56,210 --> 00:02:57,460
AUDIENCE: [INAUDIBLE]

61
00:02:57,460 --> 00:02:59,580

62
00:02:59,580 --> 00:03:00,560
JOHN: Yes.

63
00:03:00,560 --> 00:03:01,862
AUDIENCE: [INAUDIBLE]

64
00:03:01,862 --> 00:03:02,730
question.

65
00:03:02,730 --> 00:03:05,120
How can it be that most of--

66
00:03:05,120 --> 00:03:08,725
so an implementation is
vertical, so tests are

67
00:03:08,725 --> 00:03:09,120
[UNINTELLIGIBLE]?

68
00:03:09,120 --> 00:03:10,000
JOHN: No.

69
00:03:10,000 --> 00:03:11,700
The implementations are
horizontal, and

70
00:03:11,700 --> 00:03:15,074
the tests are vertical.

71
00:03:15,074 --> 00:03:17,484
AUDIENCE: So we want our column
to look like column 31?

72
00:03:17,484 --> 00:03:18,230
Or we want--

73
00:03:18,230 --> 00:03:20,820
JOHN: You want-- your validators
correctness score

74
00:03:20,820 --> 00:03:27,590
will be determined by whether
or not your column number

75
00:03:27,590 --> 00:03:29,950
corresponds with column 31.

76
00:03:29,950 --> 00:03:33,130
And then, your implementations
correctness will purely be

77
00:03:33,130 --> 00:03:37,098
determined by whether 31 marks
your row red or green.

78
00:03:37,098 --> 00:03:39,910

79
00:03:39,910 --> 00:03:41,160
Does that make sense?

80
00:03:41,160 --> 00:03:43,955

81
00:03:43,955 --> 00:03:45,452
AUDIENCE: [INAUDIBLE PHRASE]

82
00:03:45,452 --> 00:03:47,947
columns that are all green,
our validators are not

83
00:03:47,947 --> 00:03:48,446
[UNINTELLIGIBLE]?

84
00:03:48,446 --> 00:03:48,950
Is that what you're saying?

85
00:03:48,950 --> 00:03:49,630
PROFESSOR: That's right.

86
00:03:49,630 --> 00:03:51,500
JOHN: That's correct.

87
00:03:51,500 --> 00:03:53,500
PROFESSOR: Whereas the rows that
are green, that's what we

88
00:03:53,500 --> 00:03:56,230
like to see.

89
00:03:56,230 --> 00:03:57,520
We like green rows.

90
00:03:57,520 --> 00:04:00,950
And then, we like columns
that match column 31.

91
00:04:00,950 --> 00:04:02,200
AUDIENCE: [INAUDIBLE PHRASE].

92
00:04:02,200 --> 00:04:04,359

93
00:04:04,359 --> 00:04:06,307
The first row should
be all red.

94
00:04:06,307 --> 00:04:10,210
And right now, [INAUDIBLE].

95
00:04:10,210 --> 00:04:10,370
JOHN: Right.

96
00:04:10,370 --> 00:04:11,170
PROFESSOR: That's correct.

97
00:04:11,170 --> 00:04:14,280
JOHN: Whatever error this
person had, very few

98
00:04:14,280 --> 00:04:16,209
validators seems to
have caught them.

99
00:04:16,209 --> 00:04:18,930
Which is very surprising,
because what we did for your

100
00:04:18,930 --> 00:04:23,220
validator.c is that we removed
the line of code that it

101
00:04:23,220 --> 00:04:25,750
contained, and we added the
comment that explained in

102
00:04:25,750 --> 00:04:29,070
English exactly what that
line of code did.

103
00:04:29,070 --> 00:04:32,620
So it was kind of interesting to
see that not everybody came

104
00:04:32,620 --> 00:04:35,200
up with the validator that's
identical to reference one.

105
00:04:35,200 --> 00:04:39,110

106
00:04:39,110 --> 00:04:39,415
PROFESSOR: OK--

107
00:04:39,415 --> 00:04:39,720
JOHN:JOHN: Yeah.

108
00:04:39,720 --> 00:04:42,630
So please run Valgrind on your
code before the final

109
00:04:42,630 --> 00:04:43,930
submission.

110
00:04:43,930 --> 00:04:47,670
And we'll be posting your
personalized results to your

111
00:04:47,670 --> 00:04:50,220
repose sometime probably by
the end of the day, either

112
00:04:50,220 --> 00:04:53,480
today or tomorrow.

113
00:04:53,480 --> 00:04:53,970
PROFESSOR: Great.

114
00:04:53,970 --> 00:04:55,164
All right, you can take
this [UNINTELLIGIBLE].

115
00:04:55,164 --> 00:04:56,976
Or you can [UNINTELLIGIBLE].

116
00:04:56,976 --> 00:04:58,060
Here you go.

117
00:04:58,060 --> 00:05:00,520
You guys can have it here, in
case you need to chip in.

118
00:05:00,520 --> 00:05:05,550

119
00:05:05,550 --> 00:05:06,930
OK.

120
00:05:06,930 --> 00:05:11,960
So today, we're going to talk
about programming in parallel.

121
00:05:11,960 --> 00:05:13,380
Parallel programming
and so forth.

122
00:05:13,380 --> 00:05:16,480
So this is I'm sure what you've
all been waiting for.

123
00:05:16,480 --> 00:05:18,060
Oops.

124
00:05:18,060 --> 00:05:20,512
Oh, we have no power here.

125
00:05:20,512 --> 00:05:21,898
There we go.

126
00:05:21,898 --> 00:05:30,710

127
00:05:30,710 --> 00:05:33,240
There we go.

128
00:05:33,240 --> 00:05:34,490
Now I've got power.

129
00:05:34,490 --> 00:05:37,170

130
00:05:37,170 --> 00:05:39,160
OK.

131
00:05:39,160 --> 00:05:42,070
Let's see here.

132
00:05:42,070 --> 00:05:42,420
How's that?

133
00:05:42,420 --> 00:05:42,890
Good.

134
00:05:42,890 --> 00:05:43,800
OK.

135
00:05:43,800 --> 00:05:46,650
So we talk about multicore
programming.

136
00:05:46,650 --> 00:05:52,990
And let me start with a
little bit of history.

137
00:05:52,990 --> 00:05:57,690

138
00:05:57,690 --> 00:06:03,890
So since the mid
to late 1960s--

139
00:06:03,890 --> 00:06:06,420
so how many years is that?

140
00:06:06,420 --> 00:06:07,270
50 years.

141
00:06:07,270 --> 00:06:08,520
Wow.

142
00:06:08,520 --> 00:06:11,230

143
00:06:11,230 --> 00:06:15,650
Semiconductor density
has been increasing

144
00:06:15,650 --> 00:06:17,812
at the rate of about--

145
00:06:17,812 --> 00:06:22,510
it's been doubling about
every 18 to 24 months.

146
00:06:22,510 --> 00:06:22,790
OK.

147
00:06:22,790 --> 00:06:34,100
So every year, every one to two
years, every year and a

148
00:06:34,100 --> 00:06:37,280
half to two years, we
get a doubling of

149
00:06:37,280 --> 00:06:40,410
density on the chips.

150
00:06:40,410 --> 00:06:44,220
And that's a trend that
still is continuing.

151
00:06:44,220 --> 00:06:44,480
OK.

152
00:06:44,480 --> 00:06:47,620
So that's called Moore's law,
the doubling of density of

153
00:06:47,620 --> 00:06:48,940
integrated circuits.

154
00:06:48,940 --> 00:06:53,890
And so, this is basically a
curve showing how transistor

155
00:06:53,890 --> 00:06:56,110
count is rising.

156
00:06:56,110 --> 00:06:56,550
OK.

157
00:06:56,550 --> 00:07:01,060
So all these green things are
Intel CPUs and what the

158
00:07:01,060 --> 00:07:03,460
transistor count is on them.

159
00:07:03,460 --> 00:07:04,350
Yeah, question?

160
00:07:04,350 --> 00:07:06,210
AUDIENCE: [INAUDIBLE PHRASE]

161
00:07:06,210 --> 00:07:10,120
the lines in [INAUDIBLE]?

162
00:07:10,120 --> 00:07:12,150
PROFESSOR: So there have
been some technology

163
00:07:12,150 --> 00:07:14,550
changes along the way.

164
00:07:14,550 --> 00:07:19,530
So in particular, the
[UNINTELLIGIBLE] transition is

165
00:07:19,530 --> 00:07:21,800
back down here I think.

166
00:07:21,800 --> 00:07:24,812
I don't remember which
one that is.

167
00:07:24,812 --> 00:07:26,590
Well, this is actually
a different one.

168
00:07:26,590 --> 00:07:29,050
What we're looking at right now
is the transistors, which

169
00:07:29,050 --> 00:07:31,590
have been very smooth.

170
00:07:31,590 --> 00:07:32,180
OK.

171
00:07:32,180 --> 00:07:34,090
So I'll explain this
curve in a minute.

172
00:07:34,090 --> 00:07:36,330
So there's two things
plotted on here.

173
00:07:36,330 --> 00:07:42,000
One is the Intel CPU density,
and the other is what the

174
00:07:42,000 --> 00:07:45,090
clock speed of those
processes is.

175
00:07:45,090 --> 00:07:48,450
And so these are the clock
speed numbers.

176
00:07:48,450 --> 00:07:54,890
And so, the integrated circuit
technology has been--

177
00:07:54,890 --> 00:07:56,430
the density has been doubling.

178
00:07:56,430 --> 00:08:00,400
And it's really an unbelievable
sort of social

179
00:08:00,400 --> 00:08:04,460
and economic process, that
this has basically

180
00:08:04,460 --> 00:08:06,790
been called a law.

181
00:08:06,790 --> 00:08:11,470
Because what happens is if a--

182
00:08:11,470 --> 00:08:14,020
there's so many people that
contribute to making

183
00:08:14,020 --> 00:08:15,880
integrated circuits be dense.

184
00:08:15,880 --> 00:08:18,970
There's so many pieces of
technology that go into that.

185
00:08:18,970 --> 00:08:21,360
And what happens is if you
decide that you're going to

186
00:08:21,360 --> 00:08:24,980
try to jump and try to make
something that goes faster

187
00:08:24,980 --> 00:08:28,490
than Moore's law, what happens
is it's more expensive

188
00:08:28,490 --> 00:08:29,440
for you to do it.

189
00:08:29,440 --> 00:08:33,860
And none of the other
participants in that economy

190
00:08:33,860 --> 00:08:34,500
can keep up.

191
00:08:34,500 --> 00:08:36,100
And you're just going to
be more expensive.

192
00:08:36,100 --> 00:08:41,159
So people will op for the
cheapest thing that gets the

193
00:08:41,159 --> 00:08:45,730
factor of two every
18 to 24 months.

194
00:08:45,730 --> 00:08:51,020
Whereas if you're behind, then
nobody uses your stuff.

195
00:08:51,020 --> 00:08:55,790
So everybody's got this sort
of self-fulfilling prophecy

196
00:08:55,790 --> 00:08:59,610
that the rate at which the
density is increasing has just

197
00:08:59,610 --> 00:09:03,190
been extremely stable
for over 50 years.

198
00:09:03,190 --> 00:09:04,180
It's remarkable.

199
00:09:04,180 --> 00:09:05,544
Yeah, question?

200
00:09:05,544 --> 00:09:06,794
AUDIENCE: [INAUDIBLE PHRASE]

201
00:09:06,794 --> 00:09:09,234

202
00:09:09,234 --> 00:09:11,118
every six months.

203
00:09:11,118 --> 00:09:13,473
And somehow, [INAUDIBLE]

204
00:09:13,473 --> 00:09:15,360
you would have self-replicated?

205
00:09:15,360 --> 00:09:16,500
PROFESSOR: No, I'm
not saying that.

206
00:09:16,500 --> 00:09:22,230
What I'm saying is that there
is some amount of everybody

207
00:09:22,230 --> 00:09:24,140
expecting that this
is the point that

208
00:09:24,140 --> 00:09:25,990
everybody's going to be at.

209
00:09:25,990 --> 00:09:30,720
And so if you try to go more
aggressively than that, you

210
00:09:30,720 --> 00:09:33,690
can get burned because you'll
be more expensive.

211
00:09:33,690 --> 00:09:35,860
If you don't go that fast,
you're going to get burned

212
00:09:35,860 --> 00:09:38,390
because nobody's going to adopt
your particular piece of

213
00:09:38,390 --> 00:09:39,960
the technology.

214
00:09:39,960 --> 00:09:43,500
And so, what happens is
everybody sort of settles for

215
00:09:43,500 --> 00:09:46,270
this regular repeating.

216
00:09:46,270 --> 00:09:50,540
It's a remarkable social and
economic phenomenon.

217
00:09:50,540 --> 00:09:53,320
It's got very little to do at
some level of technology.

218
00:09:53,320 --> 00:09:56,530
It's just that we know that
we can improve things.

219
00:09:56,530 --> 00:09:59,060
But what's amazing is this
growth has gone through many

220
00:09:59,060 --> 00:10:01,360
transitions.

221
00:10:01,360 --> 00:10:03,120
At one point, they said we
aren't going to be able to

222
00:10:03,120 --> 00:10:08,680
build integrated circuits any
more densely because all of

223
00:10:08,680 --> 00:10:11,190
the masks that were made--

224
00:10:11,190 --> 00:10:13,970
it's basically, you make
computers with a photographic

225
00:10:13,970 --> 00:10:20,110
process of exposing and
using masks that

226
00:10:20,110 --> 00:10:22,000
you shine light through.

227
00:10:22,000 --> 00:10:23,610
It's the way they
used to do it.

228
00:10:23,610 --> 00:10:26,360
And what happened was the wave
lengths of light were such

229
00:10:26,360 --> 00:10:28,020
that you were just simply not
going to be able to get the

230
00:10:28,020 --> 00:10:29,070
resolutions.

231
00:10:29,070 --> 00:10:29,680
So what did they do?

232
00:10:29,680 --> 00:10:32,160
They switched to eBeams.

233
00:10:32,160 --> 00:10:32,430
OK.

234
00:10:32,430 --> 00:10:38,180
Electrons rather than photons
to expose the silicon wafers

235
00:10:38,180 --> 00:10:39,100
and so forth.

236
00:10:39,100 --> 00:10:42,120
And so, they've gone through a
whole bunch of transitions and

237
00:10:42,120 --> 00:10:42,910
different technologies.

238
00:10:42,910 --> 00:10:47,020
And yet, throughout all of that,
it's been just a very

239
00:10:47,020 --> 00:10:51,630
steady progress at about the
rate of 18 to 24 months per

240
00:10:51,630 --> 00:10:53,240
doubling of density.

241
00:10:53,240 --> 00:10:56,530
And that is still going on,
and is projected to go on

242
00:10:56,530 --> 00:10:59,980
maybe for 10 years more.

243
00:10:59,980 --> 00:11:02,955
It's going to run out, I
hope in my lifetime.

244
00:11:02,955 --> 00:11:06,000

245
00:11:06,000 --> 00:11:10,680
And certainly within
your lifetimes.

246
00:11:10,680 --> 00:11:12,710
So that has been going.

247
00:11:12,710 --> 00:11:15,080
Then, there's second phenomenon
that has been going

248
00:11:15,080 --> 00:11:22,200
on since about mid-1980s.

249
00:11:22,200 --> 00:11:26,450
And that is that the clock
speed has actually been

250
00:11:26,450 --> 00:11:30,730
growing on a similar curve,
where basically, we've been

251
00:11:30,730 --> 00:11:38,540
getting 30% faster
processors, clock

252
00:11:38,540 --> 00:11:42,210
speed, since the mid-1980s.

253
00:11:42,210 --> 00:11:47,140
But something happened there,
which was in around 2003, it

254
00:11:47,140 --> 00:11:48,390
flattened out.

255
00:11:48,390 --> 00:11:51,370

256
00:11:51,370 --> 00:11:57,320
And the reason is, as a
practical matter, clock speed

257
00:11:57,320 --> 00:11:59,950
for air cooled systems
is bounded at

258
00:11:59,950 --> 00:12:01,730
somewhere around 5 gigahertz.

259
00:12:01,730 --> 00:12:07,450
If you want to liquid cool
it or nitrogen cool it or

260
00:12:07,450 --> 00:12:09,500
something, you could
make it go faster.

261
00:12:09,500 --> 00:12:15,240
But basically, the problem is
that things get too hot.

262
00:12:15,240 --> 00:12:18,000
And they cannot convey
the heat out.

263
00:12:18,000 --> 00:12:20,680
So for a while, if you have
greater density, the

264
00:12:20,680 --> 00:12:22,010
transistors get smaller.

265
00:12:22,010 --> 00:12:23,460
They switch faster.

266
00:12:23,460 --> 00:12:26,030
And you can make the clock
speed go faster.

267
00:12:26,030 --> 00:12:28,710
But at some point, they
hit the wall.

268
00:12:28,710 --> 00:12:30,790
And so there the vendors were.

269
00:12:30,790 --> 00:12:35,150
People like Intel,
AMD, Motorola.

270
00:12:35,150 --> 00:12:37,350
A variety of the semiconductor
manufacturers.

271
00:12:37,350 --> 00:12:41,860
And what's happened is they
can still make integrated

272
00:12:41,860 --> 00:12:44,340
circuits more and more dense.

273
00:12:44,340 --> 00:12:47,820
But they can't clock
them any faster.

274
00:12:47,820 --> 00:12:49,460
OK.

275
00:12:49,460 --> 00:12:53,230
So here's what's going
on in the circuits.

276
00:12:53,230 --> 00:12:58,150
So here's essentially how much
power was being dissipated by

277
00:12:58,150 --> 00:13:01,960
a variety of Intel processors
along the way, and what they

278
00:13:01,960 --> 00:13:02,280
[INAUDIBLE]

279
00:13:02,280 --> 00:13:03,910
2000.

280
00:13:03,910 --> 00:13:07,020
They started getting hot and
hotter, until if they just

281
00:13:07,020 --> 00:13:15,610
continued this trend, they were
going to be trying to

282
00:13:15,610 --> 00:13:17,720
have junction temperatures that
are as hot as the surface

283
00:13:17,720 --> 00:13:19,500
of the sun.

284
00:13:19,500 --> 00:13:22,140
Well, they clearly
couldn't do that.

285
00:13:22,140 --> 00:13:22,900
OK.

286
00:13:22,900 --> 00:13:25,590
So you might say, well, let's
put it off a few years.

287
00:13:25,590 --> 00:13:28,200
Yeah, but how many years are
you going to put this off?

288
00:13:28,200 --> 00:13:30,920
And so, what happened
was they got stuck.

289
00:13:30,920 --> 00:13:37,620
They simply could not make chips
get clocked in faster.

290
00:13:37,620 --> 00:13:39,370
So what did they decide to do?

291
00:13:39,370 --> 00:13:43,110
They got all the silicon area,
but they can't make the

292
00:13:43,110 --> 00:13:45,810
processors faster with it.

293
00:13:45,810 --> 00:13:51,720
So their solution was to scale
performance to put many

294
00:13:51,720 --> 00:13:55,130
processing cores on the
microprocessor chip.

295
00:13:55,130 --> 00:13:57,895
So this is an example
of a Core i7.

296
00:13:57,895 --> 00:13:58,780
It's a four core.

297
00:13:58,780 --> 00:14:02,180
One, two, three, four
cores processor.

298
00:14:02,180 --> 00:14:04,020
We actually have six
core machines now.

299
00:14:04,020 --> 00:14:08,290
But I didn't update
the figure.

300
00:14:08,290 --> 00:14:11,700
And what's going to happen now
is Moore's law is going to

301
00:14:11,700 --> 00:14:14,080
continue for a few more years.

302
00:14:14,080 --> 00:14:17,180
And so it looks like each new
generation of Moore's law is

303
00:14:17,180 --> 00:14:21,860
going to potentially double the
number of cores per chip.

304
00:14:21,860 --> 00:14:25,090
So you folks are using
12 core machines.

305
00:14:25,090 --> 00:14:29,090
Two six core chips.

306
00:14:29,090 --> 00:14:32,210
Well, that's going to basically
keep increasing.

307
00:14:32,210 --> 00:14:35,220
And so, we're going to get more
and more cores per chip.

308
00:14:35,220 --> 00:14:35,470
OK.

309
00:14:35,470 --> 00:14:36,770
That's all well and good.

310
00:14:36,770 --> 00:14:42,110
But it turns out that there's
a major issue.

311
00:14:42,110 --> 00:14:44,480
And that's software.

312
00:14:44,480 --> 00:14:46,620
Everybody has written
their software.

313
00:14:46,620 --> 00:14:50,050
And there's billions and
billions and billions of

314
00:14:50,050 --> 00:14:54,620
dollars invested in existing
legacy software that's written

315
00:14:54,620 --> 00:14:57,040
for how many cores?

316
00:14:57,040 --> 00:14:58,910
One.

317
00:14:58,910 --> 00:15:04,220
And moving it to multicore is a

318
00:15:04,220 --> 00:15:06,160
nightmare for these companies.

319
00:15:06,160 --> 00:15:06,760
OK.

320
00:15:06,760 --> 00:15:09,850
And it's potentially a nightmare
for these vendors.

321
00:15:09,850 --> 00:15:13,100
Because if people say, gee, you
can't make the processors

322
00:15:13,100 --> 00:15:16,770
go any faster, why should
I buy a new processor?

323
00:15:16,770 --> 00:15:20,100
My old processor is as
good as my new one.

324
00:15:20,100 --> 00:15:20,530
OK.

325
00:15:20,530 --> 00:15:25,100
And so, anyway, so that's
sometimes been called the

326
00:15:25,100 --> 00:15:26,990
multicore challenge.

327
00:15:26,990 --> 00:15:29,890
The multicore menace.

328
00:15:29,890 --> 00:15:33,860
The multicore revolution.

329
00:15:33,860 --> 00:15:34,430
Whatever.

330
00:15:34,430 --> 00:15:35,680
But that's what it's
all about.

331
00:15:35,680 --> 00:15:39,260
It's all about the issue of the
frequency scaling of the

332
00:15:39,260 --> 00:15:43,390
clocks, verses, Moore's law.

333
00:15:43,390 --> 00:15:46,300
Which talks about what
the density is.

334
00:15:46,300 --> 00:15:47,310
OK.

335
00:15:47,310 --> 00:15:49,870
So their solution is to do--

336
00:15:49,870 --> 00:15:52,130
and so what we're going to talk
about for a bunch of the

337
00:15:52,130 --> 00:15:55,900
rest of the term is going to
be, how do you actually

338
00:15:55,900 --> 00:15:58,300
program multicore processors?

339
00:15:58,300 --> 00:16:00,640
We're going to look at some
fairly new software technology

340
00:16:00,640 --> 00:16:03,110
for doing that.

341
00:16:03,110 --> 00:16:08,170
So here's an abstract multicore
architecture.

342
00:16:08,170 --> 00:16:10,310
It's not precise.

343
00:16:10,310 --> 00:16:12,160
This is only showing
one level of cache.

344
00:16:12,160 --> 00:16:14,850
So we have processors connected
to a cache.

345
00:16:14,850 --> 00:16:18,360
In fact, of course, you
know that there are

346
00:16:18,360 --> 00:16:21,070
multiple levels of cache.

347
00:16:21,070 --> 00:16:24,490
Yeah, this is the international
symbol for cache

348
00:16:24,490 --> 00:16:25,740
if you live in the US.

349
00:16:25,740 --> 00:16:29,310

350
00:16:29,310 --> 00:16:32,130
So the processors have
their cache.

351
00:16:32,130 --> 00:16:34,610
Of course, you know that what
actually happens is you have

352
00:16:34,610 --> 00:16:35,900
multiple levels of cache.

353
00:16:35,900 --> 00:16:38,390
And it's shared cache
at some levels.

354
00:16:38,390 --> 00:16:38,660
OK.

355
00:16:38,660 --> 00:16:40,120
So it's more complex
than this.

356
00:16:40,120 --> 00:16:43,210
But this is sort of an abstract
way of understanding

357
00:16:43,210 --> 00:16:44,550
a bunch of the issues.

358
00:16:44,550 --> 00:16:46,780
And then, of course, they only
get more complicated as we

359
00:16:46,780 --> 00:16:54,200
look at reality, as with all
these hardware related things.

360
00:16:54,200 --> 00:16:55,690
And so, this is a chip
multiprocessor.

361
00:16:55,690 --> 00:16:58,150
Now there are other ways
of using the silicon.

362
00:16:58,150 --> 00:17:00,830
So another way of using the
silicon is building things

363
00:17:00,830 --> 00:17:04,230
like graphics processors and
using silicon for a very

364
00:17:04,230 --> 00:17:07,130
special purpose thing.

365
00:17:07,130 --> 00:17:09,890
So that instead of saying,
let's build multiple

366
00:17:09,890 --> 00:17:13,720
processors, you can say, let's
dedicate some fraction of the

367
00:17:13,720 --> 00:17:14,829
silicon real estate.

368
00:17:14,829 --> 00:17:18,280
Instead of to general purpose
computing, let's dedicate it

369
00:17:18,280 --> 00:17:22,770
to some specific purpose, like
graphics, or some kind of

370
00:17:22,770 --> 00:17:26,290
stream processing,
or what have you.

371
00:17:26,290 --> 00:17:29,800
Sensor processing.

372
00:17:29,800 --> 00:17:31,360
A variety of other things
you can do.

373
00:17:31,360 --> 00:17:34,100
But one main trend is doing
chip multiprocessors.

374
00:17:34,100 --> 00:17:37,930

375
00:17:37,930 --> 00:17:39,680
So we're going to talk
a little bit about

376
00:17:39,680 --> 00:17:40,820
shared memory hardware.

377
00:17:40,820 --> 00:17:44,680
Just enough to get you folks off
the ground to understand

378
00:17:44,680 --> 00:17:47,400
what's going on underneath
the system.

379
00:17:47,400 --> 00:17:49,710
And then, we're going to talk
about four concurrency

380
00:17:49,710 --> 00:17:53,230
platforms, which are not
the only platforms

381
00:17:53,230 --> 00:17:54,230
one can program in.

382
00:17:54,230 --> 00:17:59,640
But they're ones that you
should be familiar with.

383
00:17:59,640 --> 00:18:02,740
The last one, Cilk++, is the
one we're going to do our

384
00:18:02,740 --> 00:18:05,520
programming assignments in.

385
00:18:05,520 --> 00:18:10,030
And then, race conditions, we're
going to talk about,

386
00:18:10,030 --> 00:18:13,320
because that's the biggest thing
that comes up when you

387
00:18:13,320 --> 00:18:16,900
do parallel programming compared
to ordinary serial

388
00:18:16,900 --> 00:18:17,240
programming.

389
00:18:17,240 --> 00:18:19,850
It's the most pernicious
type of bugs.

390
00:18:19,850 --> 00:18:24,670
And you need to understand race
conditions and need a way

391
00:18:24,670 --> 00:18:25,340
of handling it.

392
00:18:25,340 --> 00:18:27,240
So here's basically--

393
00:18:27,240 --> 00:18:28,785
so we'll start with shared
memory hardware.

394
00:18:28,785 --> 00:18:33,830

395
00:18:33,830 --> 00:18:36,920
So the main thing that shared
memory hardware provides is a

396
00:18:36,920 --> 00:18:39,940
thing called cache coherence.

397
00:18:39,940 --> 00:18:40,880
OK.

398
00:18:40,880 --> 00:18:45,360
And the basic idea is that you
want every processor to be

399
00:18:45,360 --> 00:18:47,330
able to fetch stuff out
of local caches

400
00:18:47,330 --> 00:18:50,300
because that's fast.

401
00:18:50,300 --> 00:18:55,020
But at the same time, you want
them to have a common view of

402
00:18:55,020 --> 00:18:58,760
what is stored in a
given location.

403
00:18:58,760 --> 00:19:00,690
So let's run through this
example and see what the

404
00:19:00,690 --> 00:19:01,580
problem is.

405
00:19:01,580 --> 00:19:03,270
And then, I'll show
you how they solve

406
00:19:03,270 --> 00:19:05,325
it in sketchy detail.

407
00:19:05,325 --> 00:19:09,070

408
00:19:09,070 --> 00:19:10,050
So here's a processor.

409
00:19:10,050 --> 00:19:11,880
Says he wants to load
the value of x.

410
00:19:11,880 --> 00:19:14,100
And in main memory here,
x has got the value of

411
00:19:14,100 --> 00:19:16,210
3, up here in DRAM.

412
00:19:16,210 --> 00:19:16,890
OK.

413
00:19:16,890 --> 00:19:20,840
So x moves through
to the processor,

414
00:19:20,840 --> 00:19:22,160
where it gets consumed.

415
00:19:22,160 --> 00:19:25,850
And it leaves behind the
fact that x equals 3

416
00:19:25,850 --> 00:19:28,690
in its local cache.

417
00:19:28,690 --> 00:19:32,040
Well, now along comes the
second processor.

418
00:19:32,040 --> 00:19:33,860
It says, I want x too.

419
00:19:33,860 --> 00:19:37,500
And perhaps the same
thing happens.

420
00:19:37,500 --> 00:19:38,020
Very good.

421
00:19:38,020 --> 00:19:39,160
So far, no problem.

422
00:19:39,160 --> 00:19:43,250
So two caches may have
the same value of x.

423
00:19:43,250 --> 00:19:45,590
They may both want to
use x, and it's both

424
00:19:45,590 --> 00:19:47,090
in their local caches.

425
00:19:47,090 --> 00:19:49,960
Now comes along the
third processor.

426
00:19:49,960 --> 00:19:51,620
Says load x as well.

427
00:19:51,620 --> 00:19:53,940
Well, it turns out that
it's actually--

428
00:19:53,940 --> 00:19:55,470
what I showed you on
the second case is

429
00:19:55,470 --> 00:19:56,940
not the common case.

430
00:19:56,940 --> 00:20:00,080
If these two processors, these
two processing cores, are on

431
00:20:00,080 --> 00:20:04,750
the same chip, it's generally
cheaper for this guy to fetch

432
00:20:04,750 --> 00:20:08,780
it out of one of these guys
caches than it is to fetch it

433
00:20:08,780 --> 00:20:10,500
out of DRAM.

434
00:20:10,500 --> 00:20:12,000
DRAM is slow.

435
00:20:12,000 --> 00:20:14,530
Getting it locally
is much cheaper.

436
00:20:14,530 --> 00:20:20,030
So basically, in this case, he
gets it from this processor.

437
00:20:20,030 --> 00:20:21,710
The first processor.

438
00:20:21,710 --> 00:20:22,650
All is well and good.

439
00:20:22,650 --> 00:20:24,900
They're all sharing
merrily around.

440
00:20:24,900 --> 00:20:26,110
OK.

441
00:20:26,110 --> 00:20:29,710
And then this fella decides
if he wants

442
00:20:29,710 --> 00:20:30,700
to load it, no problem.

443
00:20:30,700 --> 00:20:31,400
He can just load it.

444
00:20:31,400 --> 00:20:32,320
He loads it locally.

445
00:20:32,320 --> 00:20:33,550
No problem.

446
00:20:33,550 --> 00:20:34,580
OK.

447
00:20:34,580 --> 00:20:36,410
This guy decides, oh,
he's going to store

448
00:20:36,410 --> 00:20:37,500
some value to x.

449
00:20:37,500 --> 00:20:40,420
In this case, he's going
to store the value 5.

450
00:20:40,420 --> 00:20:43,480
So he sets x equal to 5.

451
00:20:43,480 --> 00:20:44,520
OK.

452
00:20:44,520 --> 00:20:45,810
fine.

453
00:20:45,810 --> 00:20:47,310
OK, now what?

454
00:20:47,310 --> 00:20:51,130
Now this guy says,
let me load x.

455
00:20:51,130 --> 00:20:54,130
He gets the value x equals 3.

456
00:20:54,130 --> 00:20:55,920
Uh-oh.

457
00:20:55,920 --> 00:20:58,810
If your parallel program
expected that this guy had

458
00:20:58,810 --> 00:21:02,510
gone first and it set x value
x equal to 5, these guys are

459
00:21:02,510 --> 00:21:04,250
now incorrect.

460
00:21:04,250 --> 00:21:07,000

461
00:21:07,000 --> 00:21:11,670
And so, the idea of cache
coherence is not letting this

462
00:21:11,670 --> 00:21:18,020
happen, making it so that
whenever a value is changed by

463
00:21:18,020 --> 00:21:21,750
a processor, the other
processors see that change and

464
00:21:21,750 --> 00:21:25,330
yet, they're still able most
of the time to execute

465
00:21:25,330 --> 00:21:29,200
effectively out of their
own local caches.

466
00:21:29,200 --> 00:21:29,530
OK.

467
00:21:29,530 --> 00:21:31,660
So that's the problem.

468
00:21:31,660 --> 00:21:34,650
So do people understand
basically what the cache

469
00:21:34,650 --> 00:21:37,470
coherence problem is?

470
00:21:37,470 --> 00:21:38,140
Yes, question?

471
00:21:38,140 --> 00:21:41,880
AUDIENCE: If the last processor
was to store x and

472
00:21:41,880 --> 00:21:47,408
set x equals 5, as soon as that
happens, wouldn't that

473
00:21:47,408 --> 00:21:49,050
write DRAM x equals 5?

474
00:21:49,050 --> 00:21:49,370
PROFESSOR: Good.

475
00:21:49,370 --> 00:21:52,365
So there's actually two
types of strategies

476
00:21:52,365 --> 00:21:54,910
that are used in caches.

477
00:21:54,910 --> 00:21:58,040
One is called write through.

478
00:21:58,040 --> 00:22:00,180
And one is called write back.

479
00:22:00,180 --> 00:22:02,150
What you're describing
is write through.

480
00:22:02,150 --> 00:22:05,240
What right through caches do
is if you write a value, it

481
00:22:05,240 --> 00:22:08,440
pushes it all the
way out to DRAM.

482
00:22:08,440 --> 00:22:11,170
These days, nobody uses
write through.

483
00:22:11,170 --> 00:22:12,640
You're always going to DRAM.

484
00:22:12,640 --> 00:22:18,610
You're always exercising the
slow DRAM versus being able to

485
00:22:18,610 --> 00:22:20,770
just write it locally.

486
00:22:20,770 --> 00:22:23,960
But you do have to do something
about these guys

487
00:22:23,960 --> 00:22:27,000
that are going to have
the shared values.

488
00:22:27,000 --> 00:22:29,290
So here's the mechanism
that they use.

489
00:22:29,290 --> 00:22:32,580
So what most people do these
days is write back caches.

490
00:22:32,580 --> 00:22:35,890
Which basically means you only
write it back when you really

491
00:22:35,890 --> 00:22:40,330
need to evict or
what have you.

492
00:22:40,330 --> 00:22:43,340
You don't always write it
all the way through.

493
00:22:43,340 --> 00:22:44,990
And so here's how these
schemes work.

494
00:22:44,990 --> 00:22:45,390
So, right.

495
00:22:45,390 --> 00:22:51,470
So that's a bogus value for
that kind to be getting.

496
00:22:51,470 --> 00:22:52,410
So let's take a look.

497
00:22:52,410 --> 00:22:54,680
So what they use is what's
called-- the simplest is

498
00:22:54,680 --> 00:22:59,060
called an MSI protocol.

499
00:22:59,060 --> 00:23:02,620
There are somewhat more
complicated ones called MESI

500
00:23:02,620 --> 00:23:06,480
protocols, and ones
that are MOESI.

501
00:23:06,480 --> 00:23:08,930
"Mo-esi" and "messy".

502
00:23:08,930 --> 00:23:11,610
Anyway, the MESI one is probably
the one you'll hear

503
00:23:11,610 --> 00:23:12,730
most often.

504
00:23:12,730 --> 00:23:14,920
It's just a little bit more
complicated than this one.

505
00:23:14,920 --> 00:23:22,510
But it saves you one extra
access when we do a write.

506
00:23:22,510 --> 00:23:23,930
I'll explain it in
just a minute.

507
00:23:23,930 --> 00:23:28,610
But let's first understand the
simplest of these mechanisms.

508
00:23:28,610 --> 00:23:31,820
So what you do is in each cache,
you're going to label

509
00:23:31,820 --> 00:23:34,840
each cache line with a state.

510
00:23:34,840 --> 00:23:38,350
And basically, it's because
of these states that you

511
00:23:38,350 --> 00:23:41,420
associate with a cache line that
cache lines end up having

512
00:23:41,420 --> 00:23:42,860
to be long.

513
00:23:42,860 --> 00:23:43,140
OK?

514
00:23:43,140 --> 00:23:47,100
Because if you think about,
you'd like cache lines to be

515
00:23:47,100 --> 00:23:51,830
at some level very short, in
that then you have more

516
00:23:51,830 --> 00:23:55,480
opportunity to have just the
stuff in cache that you want,

517
00:23:55,480 --> 00:23:57,740
from a temporal locality
point of view.

518
00:23:57,740 --> 00:24:01,160
It's one thing if you want to
bring in extra lines, extra

519
00:24:01,160 --> 00:24:03,120
data, for spatial locality.

520
00:24:03,120 --> 00:24:05,710
But to insist that it all be
there whether you access it or

521
00:24:05,710 --> 00:24:09,870
not, that's not clear how
helpful that it is.

522
00:24:09,870 --> 00:24:12,470
However, what instead is we have
things like, on the Intel

523
00:24:12,470 --> 00:24:16,590
architecture, 64 bytes
of cache line.

524
00:24:16,590 --> 00:24:19,090
And the reason is because
they're keeping extra data

525
00:24:19,090 --> 00:24:21,360
with each cache line.

526
00:24:21,360 --> 00:24:25,570
And they want the data to be
the larger fraction of what

527
00:24:25,570 --> 00:24:27,160
they're keeping compared
to the control

528
00:24:27,160 --> 00:24:28,900
information about the data.

529
00:24:28,900 --> 00:24:31,120
So in this case, they're
keeping three values.

530
00:24:31,120 --> 00:24:33,520
Three bits.

531
00:24:33,520 --> 00:24:36,370
The M bit says this cache
block has been modified.

532
00:24:36,370 --> 00:24:38,140
Somebody's written to it.

533
00:24:38,140 --> 00:24:43,130
And what they do is they, in
this protocol, they guarantee

534
00:24:43,130 --> 00:24:46,210
in the protocol that if somebody
has it in the M

535
00:24:46,210 --> 00:24:50,490
state, no other caches contain
this block in either the M

536
00:24:50,490 --> 00:24:53,980
state or S state.

537
00:24:53,980 --> 00:24:54,920
So what are those states?

538
00:24:54,920 --> 00:24:58,540
So the S state is when
other caches may be

539
00:24:58,540 --> 00:25:00,500
sharing this block.

540
00:25:00,500 --> 00:25:04,280
And the I state is that this
cache block is invalid.

541
00:25:04,280 --> 00:25:05,720
It's the same as if
it's not there.

542
00:25:05,720 --> 00:25:08,250
It's empty entry.

543
00:25:08,250 --> 00:25:10,460
So it just marks this entry.

544
00:25:10,460 --> 00:25:12,680
There's no data there.

545
00:25:12,680 --> 00:25:16,860
The cache line that's there is
not really there, is basically

546
00:25:16,860 --> 00:25:17,820
what it says.

547
00:25:17,820 --> 00:25:24,100
So here, you see for example
that this fella has x equals

548
00:25:24,100 --> 00:25:26,890
13 in the modified state.

549
00:25:26,890 --> 00:25:29,770
And so, if you look across here,
oh, nobody else has that

550
00:25:29,770 --> 00:25:32,770
in either the M or
the S state.

551
00:25:32,770 --> 00:25:37,160
They only have it in the
I state or not at all.

552
00:25:37,160 --> 00:25:39,780
If you have it in the shared
state, as these guys have,

553
00:25:39,780 --> 00:25:41,950
well, they all have it in the
shared state and notice the

554
00:25:41,950 --> 00:25:45,340
values are all the same.

555
00:25:45,340 --> 00:25:48,130
And then, if it's in the invalid
state, here this guy

556
00:25:48,130 --> 00:25:51,130
once again has it in the
modified state, which means

557
00:25:51,130 --> 00:25:54,230
these guys don't have it in
either the S or M state.

558
00:25:54,230 --> 00:25:55,610
So that's the invariant.

559
00:25:55,610 --> 00:25:58,950
So what's the basic idea
behind the cache?

560
00:25:58,950 --> 00:26:00,650
The MSI protocol?

561
00:26:00,650 --> 00:26:05,360
The idea is that before you can
write on a location, you

562
00:26:05,360 --> 00:26:09,445
must first invalidate all
the other copies.

563
00:26:09,445 --> 00:26:12,260

564
00:26:12,260 --> 00:26:14,760
So whenever you try to write
on something that's shared

565
00:26:14,760 --> 00:26:17,160
across a bunch of things or
that somebody else has

566
00:26:17,160 --> 00:26:20,940
modified, what happens is over
the network goes out a

567
00:26:20,940 --> 00:26:25,000
protocol to invalidate
all the other copies.

568
00:26:25,000 --> 00:26:27,540
So if they're just being shared,
that's no problem.

569
00:26:27,540 --> 00:26:28,970
Because all you do
is just have them

570
00:26:28,970 --> 00:26:31,300
drop it from the cache.

571
00:26:31,300 --> 00:26:35,000
If it's modified, then it may
have to be written back or the

572
00:26:35,000 --> 00:26:38,820
value brought back to you, so
that you're in a position of

573
00:26:38,820 --> 00:26:39,200
changing it.

574
00:26:39,200 --> 00:26:41,610
If somebody has it modified,
then you don't have it.

575
00:26:41,610 --> 00:26:45,420
So therefore, you need to
bring it in and make the

576
00:26:45,420 --> 00:26:46,250
change to it.

577
00:26:46,250 --> 00:26:47,117
Question?

578
00:26:47,117 --> 00:26:49,470
AUDIENCE: [INAUDIBLE]
three states?

579
00:26:49,470 --> 00:26:50,160
PROFESSOR: Three states.

580
00:26:50,160 --> 00:26:51,210
Not three bits.

581
00:26:51,210 --> 00:26:51,600
Two bits.

582
00:26:51,600 --> 00:26:52,900
Right.

583
00:26:52,900 --> 00:26:55,150
OK.

584
00:26:55,150 --> 00:26:57,470
So the idea is you first
invalidate the other copies.

585
00:26:57,470 --> 00:27:03,250
Therefore, when a processor core
is changing the value of

586
00:27:03,250 --> 00:27:05,545
some variable, it has
the only copy.

587
00:27:05,545 --> 00:27:08,320

588
00:27:08,320 --> 00:27:10,940
And by making sure that it only
has the only copy, you

589
00:27:10,940 --> 00:27:13,660
make sure that you never have
copies out there that are

590
00:27:13,660 --> 00:27:20,020
anything except copies of
what everybody else has.

591
00:27:20,020 --> 00:27:22,170
That they're all the same.

592
00:27:22,170 --> 00:27:23,340
OK.

593
00:27:23,340 --> 00:27:26,320
Does everybody follow that?

594
00:27:26,320 --> 00:27:28,440
So there's hardware under
there doing that.

595
00:27:28,440 --> 00:27:30,250
It's actually pretty
clever hardware.

596
00:27:30,250 --> 00:27:36,550
In fact, the verification of
cache protocols is a huge

597
00:27:36,550 --> 00:27:41,780
problem for which there's a lot
of technology built to try

598
00:27:41,780 --> 00:27:45,290
to verify to make sure these
cache protocols work the way

599
00:27:45,290 --> 00:27:46,960
they're supposed to work.

600
00:27:46,960 --> 00:27:48,860
Because what happens in practice
is there are all

601
00:27:48,860 --> 00:27:50,070
these intermediate states.

602
00:27:50,070 --> 00:27:52,980
What happens if this guy starts
doing this while this

603
00:27:52,980 --> 00:27:57,870
guy is doing that, and these
protocols start getting mixed,

604
00:27:57,870 --> 00:27:59,130
and so forth?

605
00:27:59,130 --> 00:28:00,770
And you've got to make
sure that works out.

606
00:28:00,770 --> 00:28:03,410
And that's what's going
on in the hardware.

607
00:28:03,410 --> 00:28:07,200
The MESI protocol does a
simple optimization.

608
00:28:07,200 --> 00:28:11,610
It says, look, before I store
something, I probably

609
00:28:11,610 --> 00:28:13,230
want to read it.

610
00:28:13,230 --> 00:28:15,120
It's likely I'm going
to read it.

611
00:28:15,120 --> 00:28:16,590
So I can read it in two ways.

612
00:28:16,590 --> 00:28:21,100
I can read it in a way that
says that it is--

613
00:28:21,100 --> 00:28:23,330
where it's just going
to be shared.

614
00:28:23,330 --> 00:28:26,390
But if I expect that I'm going
to write it, let me when I

615
00:28:26,390 --> 00:28:31,530
read it instead of getting a
shared copy, let me get an

616
00:28:31,530 --> 00:28:32,770
exclusive copy.

617
00:28:32,770 --> 00:28:34,920
And that's where the
E comes from.

618
00:28:34,920 --> 00:28:36,350
Let me get an exclusive copy.

619
00:28:36,350 --> 00:28:39,420
In other words, go through the
invalidation protocols on the

620
00:28:39,420 --> 00:28:43,030
read, so that with the
expectation that when you

621
00:28:43,030 --> 00:28:47,320
write, you don't have to then
wait for the invalidation to

622
00:28:47,320 --> 00:28:48,240
occur at that point.

623
00:28:48,240 --> 00:28:53,980
So it's a way of reducing the
latency of the protocol by

624
00:28:53,980 --> 00:28:56,270
getting it exclusively by
the read that you do

625
00:28:56,270 --> 00:28:58,860
before you do the write.

626
00:28:58,860 --> 00:29:01,410
So rather than doing a
read, which would go

627
00:29:01,410 --> 00:29:04,210
out and get the value--

628
00:29:04,210 --> 00:29:05,250
but everybody [? has them ?]

629
00:29:05,250 --> 00:29:07,940
shared-- then doing the write,
and then doing a whole

630
00:29:07,940 --> 00:29:12,020
invalidation protocol, if I
basically get it in exclusive

631
00:29:12,020 --> 00:29:15,480
mode on the read, then I go out,
I get the value, and I

632
00:29:15,480 --> 00:29:17,570
invalidate everybody else.

633
00:29:17,570 --> 00:29:20,340
Now I've just saved myself
half the work

634
00:29:20,340 --> 00:29:22,630
and half the latency.

635
00:29:22,630 --> 00:29:24,770
Or basically saved myself
some latency.

636
00:29:24,770 --> 00:29:26,100
Not half the latency.

637
00:29:26,100 --> 00:29:27,900
OK?

638
00:29:27,900 --> 00:29:32,300
So basically, what you should
know is there is invalidation

639
00:29:32,300 --> 00:29:35,030
stuff going on behind when you
start using shared memory,

640
00:29:35,030 --> 00:29:39,200
behind the scenes which
can slow down your

641
00:29:39,200 --> 00:29:42,800
processor from executing.

642
00:29:42,800 --> 00:29:45,520
Because it can't do the things
that it needs to do until it

643
00:29:45,520 --> 00:29:49,810
goes through the protocol.

644
00:29:49,810 --> 00:29:52,060
Any questions about that?

645
00:29:52,060 --> 00:29:58,920
That's basically the level
we're going to cover the

646
00:29:58,920 --> 00:30:01,390
hardware at.

647
00:30:01,390 --> 00:30:04,510
And so, you'll discover that in
doing some your problems,

648
00:30:04,510 --> 00:30:06,880
that if you're not careful,
you're going to create what

649
00:30:06,880 --> 00:30:10,220
are called invalidation storms,
where you have a whole

650
00:30:10,220 --> 00:30:12,880
bunch of things that are red,
and they're distributed across

651
00:30:12,880 --> 00:30:13,560
the processor.

652
00:30:13,560 --> 00:30:16,220
And then you go in, and
you set one value.

653
00:30:16,220 --> 00:30:18,460
And suddenly, vrrrrrruuuum.

654
00:30:18,460 --> 00:30:21,420
Gee, how come that wasn't
a fast store?

655
00:30:21,420 --> 00:30:23,550
The answer is it's going through
and invalidating all

656
00:30:23,550 --> 00:30:24,800
those other copies.

657
00:30:24,800 --> 00:30:27,630

658
00:30:27,630 --> 00:30:29,330
Good.

659
00:30:29,330 --> 00:30:31,700
So let's turn to the
real hard problem.

660
00:30:31,700 --> 00:30:35,290
So it turns out that building
these things is not

661
00:30:35,290 --> 00:30:36,930
particularly well understood.

662
00:30:36,930 --> 00:30:38,640
But it's understood
a lot better than

663
00:30:38,640 --> 00:30:41,310
programming these beasts.

664
00:30:41,310 --> 00:30:42,400
OK.

665
00:30:42,400 --> 00:30:47,370
And so, we're going to focus on
some of the strategies for

666
00:30:47,370 --> 00:30:48,620
programming.

667
00:30:48,620 --> 00:30:51,120

668
00:30:51,120 --> 00:30:55,760
So it turns out that trying to
program their processor cores

669
00:30:55,760 --> 00:30:58,530
directly is painful.

670
00:30:58,530 --> 00:31:04,110
And you're liable to make a lot
of errors, as we'll see.

671
00:31:04,110 --> 00:31:08,170
Because we're going to talk
about races soon.

672
00:31:08,170 --> 00:31:11,910
And so the idea of a current
currency platform is to do

673
00:31:11,910 --> 00:31:16,880
some level of abstraction of the
processor cores to handle

674
00:31:16,880 --> 00:31:20,540
synchronization communication
protocols, and often to do

675
00:31:20,540 --> 00:31:24,870
things like load balancing, so
that the work that you're

676
00:31:24,870 --> 00:31:28,750
doing can be moved across from
processor to processor.

677
00:31:28,750 --> 00:31:31,390
And so, here are some examples
of concurrency platforms.

678
00:31:31,390 --> 00:31:34,500
Pthreads and WinAPI threads,
we're going to talk more in

679
00:31:34,500 --> 00:31:35,500
detail about.

680
00:31:35,500 --> 00:31:38,880
Pthreads is basically for
Unix type systems,

681
00:31:38,880 --> 00:31:40,470
like Linux and such.

682
00:31:40,470 --> 00:31:43,930
WinAPI threads is for Windows.

683
00:31:43,930 --> 00:31:48,380
There's threading building
blocks, TBB, OpenMP, which is

684
00:31:48,380 --> 00:31:50,120
a standard, and Cilk++.

685
00:31:50,120 --> 00:31:54,060
Those are all examples of
concurrency platforms that

686
00:31:54,060 --> 00:31:59,280
make it easier to program
these parallel machines.

687
00:31:59,280 --> 00:32:01,520
So I'm going to do, as an
example, I'm going to use the

688
00:32:01,520 --> 00:32:06,320
Fibonacci numbers, which you
have seen before I'm sure,

689
00:32:06,320 --> 00:32:11,260
because we've actually even
used it in this class.

690
00:32:11,260 --> 00:32:16,040
This is Leonardo da Pisa, who
was also known as Fibonacci.

691
00:32:16,040 --> 00:32:17,820
And he introduced--

692
00:32:17,820 --> 00:32:20,680
he was the most brilliant
mathematician of his day.

693
00:32:20,680 --> 00:32:24,230
He came basically out of the
blue, doing all kinds of

694
00:32:24,230 --> 00:32:28,150
beautiful mathematics very
early in the Renaissance.

695
00:32:28,150 --> 00:32:31,085
You'll recognize 1202 is
very early Renaissance.

696
00:32:31,085 --> 00:32:35,610

697
00:32:35,610 --> 00:32:38,440
But it turns out, for those of
you of Indian descent, the

698
00:32:38,440 --> 00:32:39,900
Indian mathematicians
had already

699
00:32:39,900 --> 00:32:41,150
discovered all this stuff.

700
00:32:41,150 --> 00:32:43,700

701
00:32:43,700 --> 00:32:46,480
But it didn't make it into
Western culture except for

702
00:32:46,480 --> 00:32:50,040
Leonardo da Pisa.

703
00:32:50,040 --> 00:32:57,740
So here's a program as you might
write it in C. So Fib

704
00:32:57,740 --> 00:33:00,980
int n says, well, if n is
less than 2, return n.

705
00:33:00,980 --> 00:33:04,100
So if it's 0 or 1, we return,
Fib of 0 is 0.

706
00:33:04,100 --> 00:33:05,580
Fib of 1 is 1.

707
00:33:05,580 --> 00:33:09,815
And otherwise, we compute Fib of
n minus 1, compute Fib of n

708
00:33:09,815 --> 00:33:12,220
minus 2, and return the sum.

709
00:33:12,220 --> 00:33:13,950
Simple recursive program.

710
00:33:13,950 --> 00:33:15,080
Here's the main routine.

711
00:33:15,080 --> 00:33:18,940
We get the argument from the
command line, compute the

712
00:33:18,940 --> 00:33:22,170
result, and then print
out Fibonacci

713
00:33:22,170 --> 00:33:24,210
of whatever is whatever.

714
00:33:24,210 --> 00:33:26,290
Pretty simple piece of code.

715
00:33:26,290 --> 00:33:28,240
So what we're going to do is
take a look at what happens in

716
00:33:28,240 --> 00:33:32,900
each of these four concurrency
platforms to see how it is

717
00:33:32,900 --> 00:33:37,510
that they make this easy to
run this in parallel.

718
00:33:37,510 --> 00:33:40,770
Now just a disclaimer here.

719
00:33:40,770 --> 00:33:42,720
This is a really bad way--

720
00:33:42,720 --> 00:33:43,850
I hope you all recognize--

721
00:33:43,850 --> 00:33:46,365
of computing Fibonacci
numbers.

722
00:33:46,365 --> 00:33:49,960
So this is exponential
time algorithm.

723
00:33:49,960 --> 00:33:52,990
And you all know the linear
time algorithm, which is

724
00:33:52,990 --> 00:33:55,440
basically computed up
from the bottom.

725
00:33:55,440 --> 00:33:58,210
And some of you probably know
there's a logarithmic time

726
00:33:58,210 --> 00:34:00,910
algorithm based on squaring
matrices.

727
00:34:00,910 --> 00:34:02,160
Two by two matrices.

728
00:34:02,160 --> 00:34:05,030

729
00:34:05,030 --> 00:34:12,670
So in any case, we're all
about performance here.

730
00:34:12,670 --> 00:34:15,820
But obviously, this is a really
poor choice to do

731
00:34:15,820 --> 00:34:16,330
performance on.

732
00:34:16,330 --> 00:34:19,570
But it is a good didactic
example, because it's so the

733
00:34:19,570 --> 00:34:24,409
structure and the issues that
you get into in doing this

734
00:34:24,409 --> 00:34:28,639
with a very simple program that
I can fit on a slide.

735
00:34:28,639 --> 00:34:28,909
OK.

736
00:34:28,909 --> 00:34:33,219
So when you execute Fibonacci,
when you call Fib of 4, it

737
00:34:33,219 --> 00:34:36,469
calls Fib of 3 and Fib of 2.

738
00:34:36,469 --> 00:34:39,570
And Fib of 3 calls Fib
of 2 and Fib of 1.

739
00:34:39,570 --> 00:34:42,489
And Fib of 1 just returns Fib
of 2, calls [UNINTELLIGIBLE]

740
00:34:42,489 --> 00:34:44,060
1, 0, et cetera.

741
00:34:44,060 --> 00:34:49,659
And so basically, you get an
execution trace that basically

742
00:34:49,659 --> 00:34:53,270
corresponds to walk
of this tree.

743
00:34:53,270 --> 00:34:57,720
So if you were doing this in C,
you'd basically call this,

744
00:34:57,720 --> 00:34:58,770
call this, call this.

745
00:34:58,770 --> 00:34:59,670
Get a value return.

746
00:34:59,670 --> 00:35:00,550
Call this.

747
00:35:00,550 --> 00:35:02,930
Add the two values together.

748
00:35:02,930 --> 00:35:04,540
Return here.

749
00:35:04,540 --> 00:35:05,170
Call this.

750
00:35:05,170 --> 00:35:06,330
Add the two values together.

751
00:35:06,330 --> 00:35:07,780
Call the return there.

752
00:35:07,780 --> 00:35:08,260
And so forth.

753
00:35:08,260 --> 00:35:12,190
You walk that using a stack, a
call stack, in the execution.

754
00:35:12,190 --> 00:35:15,240

755
00:35:15,240 --> 00:35:19,550
The key idea for parallelization
is, well, gee.

756
00:35:19,550 --> 00:35:23,390
Fib of n minus 1 and fib of n
minus 2 are really, in this

757
00:35:23,390 --> 00:35:26,420
calculation, completely
independently calculated.

758
00:35:26,420 --> 00:35:27,940
So let's just do them
at the same time.

759
00:35:27,940 --> 00:35:31,040

760
00:35:31,040 --> 00:35:35,590
And they can be executed at
the same time without

761
00:35:35,590 --> 00:35:37,350
interference, because
all they're doing is

762
00:35:37,350 --> 00:35:38,290
basing it on n.

763
00:35:38,290 --> 00:35:41,220
They're not using any shared
memory or anything even for

764
00:35:41,220 --> 00:35:43,800
this particular program.

765
00:35:43,800 --> 00:35:45,450
So let's take a look,
to begin with, how

766
00:35:45,450 --> 00:35:48,320
Pthreads might do this.

767
00:35:48,320 --> 00:35:56,090
So Pthreads is a standard that
ANSI and the IEEE have

768
00:35:56,090 --> 00:35:58,550
established for--

769
00:35:58,550 --> 00:36:00,960
and I actually believe this is
a little bit out of date.

770
00:36:00,960 --> 00:36:03,700
I believe there's now
a 2010 version.

771
00:36:03,700 --> 00:36:05,870
I'm not sure.

772
00:36:05,870 --> 00:36:07,990
But I recall that they were
working on a new version.

773
00:36:07,990 --> 00:36:10,330
But anyway, this is a recent
enough standard.

774
00:36:10,330 --> 00:36:13,520
It's a standard that has been
revised over the years, the

775
00:36:13,520 --> 00:36:15,980
so-called POSIX standard.

776
00:36:15,980 --> 00:36:21,020
So you'll hear, Pthreads is
basically POSIX threads.

777
00:36:21,020 --> 00:36:23,530
It's basically what you might
characterize as a do it

778
00:36:23,530 --> 00:36:25,920
yourself concurrency platform.

779
00:36:25,920 --> 00:36:30,370
It's kind of like assembly
language for parallelism.

780
00:36:30,370 --> 00:36:34,000
It allows you to do the things
you need to do, but you're

781
00:36:34,000 --> 00:36:38,230
sort of doing it all by hand,
one step at a time.

782
00:36:38,230 --> 00:36:42,190
It's built as a library of
functions with special non-C

783
00:36:42,190 --> 00:36:43,440
or C++ semantics.

784
00:36:43,440 --> 00:36:50,760

785
00:36:50,760 --> 00:36:53,650
And we'll look at what some
of those semantics are.

786
00:36:53,650 --> 00:36:57,670
Each thread implements an
abstraction of a processor,

787
00:36:57,670 --> 00:37:01,640
which are multiplexed onto the
machine resources by the

788
00:37:01,640 --> 00:37:05,700
Pthread runtime implementation.

789
00:37:05,700 --> 00:37:08,800
Threads communicate through
shared memory.

790
00:37:08,800 --> 00:37:13,090
And library functions mask
the protocols involved in

791
00:37:13,090 --> 00:37:15,680
interthread coordination.

792
00:37:15,680 --> 00:37:20,290
So you can start up threads, et
cetera, and their library

793
00:37:20,290 --> 00:37:21,230
function for doing that.

794
00:37:21,230 --> 00:37:23,310
So let's just see
how that works.

795
00:37:23,310 --> 00:37:25,860
So here are, basically, the two

796
00:37:25,860 --> 00:37:29,800
important Pthread functions.

797
00:37:29,800 --> 00:37:31,560
There are actually a whole bunch
of them, because they

798
00:37:31,560 --> 00:37:34,730
also provide a bunch of
other facilities.

799
00:37:34,730 --> 00:37:38,200
One is pthread_create, which
creates Pthread.

800
00:37:38,200 --> 00:37:39,450
And one is pthread_join.

801
00:37:39,450 --> 00:37:41,620

802
00:37:41,620 --> 00:37:49,990
So pthread_create basically
is return an identifier.

803
00:37:49,990 --> 00:37:53,160
So when you say create a
Pthread, the Pthread system

804
00:37:53,160 --> 00:37:55,860
says, here's a handle by which
you can name this thread in

805
00:37:55,860 --> 00:37:57,420
the future.

806
00:37:57,420 --> 00:37:57,660
OK.

807
00:37:57,660 --> 00:38:00,490
So it's a very common thing
that the implementer says,

808
00:38:00,490 --> 00:38:01,730
here's the name that you get.

809
00:38:01,730 --> 00:38:02,850
It's called a handle.

810
00:38:02,850 --> 00:38:05,640
So it returns a handle.

811
00:38:05,640 --> 00:38:12,020
It then has an object to set
various thread attributes.

812
00:38:12,020 --> 00:38:14,250
And for most of what we're
going to need, we're just

813
00:38:14,250 --> 00:38:15,860
going to need NULL
for default.

814
00:38:15,860 --> 00:38:18,390
We don't need any special
things like changing the

815
00:38:18,390 --> 00:38:21,720
priority or what have you.

816
00:38:21,720 --> 00:38:28,390
Then what you pass is a void*
pointer to a function, which

817
00:38:28,390 --> 00:38:32,360
is going to be the routine
executed after creation.

818
00:38:32,360 --> 00:38:35,310
So you can name the function
that you want to have it

819
00:38:35,310 --> 00:38:36,560
operate on.

820
00:38:36,560 --> 00:38:39,220

821
00:38:39,220 --> 00:38:42,290
And then you have a single
pointer to an argument that

822
00:38:42,290 --> 00:38:43,700
you're going to pass
to the function.

823
00:38:43,700 --> 00:38:46,710

824
00:38:46,710 --> 00:38:49,860
So when you call something with
Pthreads to create them,

825
00:38:49,860 --> 00:38:53,070
you can't say, and here's
my list of arguments.

826
00:38:53,070 --> 00:38:55,610
If you have more than one
argument, you have to pack it

827
00:38:55,610 --> 00:38:58,670
together into a struct
and pass the

828
00:38:58,670 --> 00:39:00,380
pointer to the struct.

829
00:39:00,380 --> 00:39:02,830
And this function has to be
smart enough to understand how

830
00:39:02,830 --> 00:39:04,470
to unpack it.

831
00:39:04,470 --> 00:39:07,150
We'll see an example
in a minute.

832
00:39:07,150 --> 00:39:09,620
And then, it returns
an error status.

833
00:39:09,620 --> 00:39:11,810
So the most common thing people
do is they don't bother

834
00:39:11,810 --> 00:39:14,140
to check the error status.

835
00:39:14,140 --> 00:39:14,610
OK.

836
00:39:14,610 --> 00:39:16,950
And yet sometimes, you try to
create a Pthread, there's a

837
00:39:16,950 --> 00:39:18,810
reason it can't create one.

838
00:39:18,810 --> 00:39:21,300
And now you keep going thinking
you have one, and

839
00:39:21,300 --> 00:39:24,640
then your program crashes
and you wonder why.

840
00:39:24,640 --> 00:39:26,990
So when you create things,
you should check.

841
00:39:26,990 --> 00:39:31,540
I'm not sure in my code here
whether I checked everywhere.

842
00:39:31,540 --> 00:39:33,700
But you should check.

843
00:39:33,700 --> 00:39:36,720
Do as I say, not as I do.

844
00:39:36,720 --> 00:39:37,670
OK.

845
00:39:37,670 --> 00:39:40,280
So the other key function
is join.

846
00:39:40,280 --> 00:39:43,310
And basically, what you do is
you say, you name the thread

847
00:39:43,310 --> 00:39:44,900
that you want to wait for.

848
00:39:44,900 --> 00:39:46,680
This is the name that
would be returned

849
00:39:46,680 --> 00:39:49,700
by the create function.

850
00:39:49,700 --> 00:39:57,860
And you also give a place where
it can store the status

851
00:39:57,860 --> 00:40:01,000
of the thread when
it terminated.

852
00:40:01,000 --> 00:40:03,290
It's allowed to say, I
terminated normally.

853
00:40:03,290 --> 00:40:05,950
I terminated with a given error
condition or whatever.

854
00:40:05,950 --> 00:40:07,430
But if you don't care
what it is, you

855
00:40:07,430 --> 00:40:08,990
just put in NULL there.

856
00:40:08,990 --> 00:40:10,610
And then it returns
to the error

857
00:40:10,610 --> 00:40:13,230
status of the join function.

858
00:40:13,230 --> 00:40:15,560
So those are the two functions
that you program with.

859
00:40:15,560 --> 00:40:16,408
Question?

860
00:40:16,408 --> 00:40:17,658
AUDIENCE: [INAUDIBLE PHRASE]?

861
00:40:17,658 --> 00:40:21,090

862
00:40:21,090 --> 00:40:22,170
PROFESSOR: It's different.

863
00:40:22,170 --> 00:40:22,690
It's different.

864
00:40:22,690 --> 00:40:26,350
So it's basically, if the error
status, if it returns

865
00:40:26,350 --> 00:40:29,170
NULL, it just means everything
went OK.

866
00:40:29,170 --> 00:40:33,710

867
00:40:33,710 --> 00:40:37,800
The handle is you pass a name,
and basically this is *thread.

868
00:40:37,800 --> 00:40:41,790
It stuffs the name into
whatever you give it.

869
00:40:41,790 --> 00:40:44,070
OK so you're not saying,
here's the name.

870
00:40:44,070 --> 00:40:47,430
This is returned as an
output parameter.

871
00:40:47,430 --> 00:40:52,560
So you're giving it an address
of some place to put the name.

872
00:40:52,560 --> 00:40:52,690
OK.

873
00:40:52,690 --> 00:40:54,270
Let's see an example.

874
00:40:54,270 --> 00:40:59,840
So here's Fibonacci
with Pthreads.

875
00:40:59,840 --> 00:41:02,280
So let's just go through that.

876
00:41:02,280 --> 00:41:06,330
So the first part
is pretty good.

877
00:41:06,330 --> 00:41:11,750
This is your original code
that does Fibonacci.

878
00:41:11,750 --> 00:41:15,930
And now what we do is
we have a structure

879
00:41:15,930 --> 00:41:17,750
for the thread arguments.

880
00:41:17,750 --> 00:41:20,110
And so we're going to have an
input argument and an output

881
00:41:20,110 --> 00:41:21,500
argument in this example.

882
00:41:21,500 --> 00:41:23,980
Because Fib takes an input
argument in and

883
00:41:23,980 --> 00:41:27,280
returns Fib of n.

884
00:41:27,280 --> 00:41:29,180
So we're going to call those
input and output.

885
00:41:29,180 --> 00:41:31,570
And we'll call them
thread_args.

886
00:41:31,570 --> 00:41:37,660
And now, here is my void*
function, thread_func, which

887
00:41:37,660 --> 00:41:39,790
takes a pointer.

888
00:41:39,790 --> 00:41:43,980
And what it does is
when it executes--

889
00:41:43,980 --> 00:41:46,070
so what you're going to be able
to do is, as we'll see in

890
00:41:46,070 --> 00:41:47,300
a minute--.

891
00:41:47,300 --> 00:41:48,910
Let me just go through this.

892
00:41:48,910 --> 00:41:50,610
This is going to be the function
called when the

893
00:41:50,610 --> 00:41:52,140
thread is created.

894
00:41:52,140 --> 00:41:53,211
So when the thread is created,
you're just going

895
00:41:53,211 --> 00:41:54,990
to call this function.

896
00:41:54,990 --> 00:42:00,150
And what it's going to get is
the argument that was passed,

897
00:42:00,150 --> 00:42:03,660
which is this *star thing.

898
00:42:03,660 --> 00:42:06,050
And what it does in this case
is it's basically going to

899
00:42:06,050 --> 00:42:12,140
cast the pointer to a thread_arg
struct and

900
00:42:12,140 --> 00:42:16,570
dereference the input, and stick
that into I. Then going

901
00:42:16,570 --> 00:42:19,710
to compute Fib of I. And then
it's going to take, once

902
00:42:19,710 --> 00:42:24,100
again, deference the pointer as
if it's a thread_arg, and

903
00:42:24,100 --> 00:42:29,190
store into the output field
the result of the Fib.

904
00:42:29,190 --> 00:42:30,590
And then it returns NULL.

905
00:42:30,590 --> 00:42:34,060

906
00:42:34,060 --> 00:42:36,170
So that's basically the function
that's going to be

907
00:42:36,170 --> 00:42:37,910
called when the thread
is created.

908
00:42:37,910 --> 00:42:43,560
So in your main routine now,
what happens is we initialize

909
00:42:43,560 --> 00:42:44,400
a bunch of things.

910
00:42:44,400 --> 00:42:48,350
And now, if argc is less
than 2, we'll return 1.

911
00:42:48,350 --> 00:42:50,860
That's fine.

912
00:42:50,860 --> 00:42:54,850
Then we're going to get the
reading that we fail.

913
00:42:54,850 --> 00:42:56,280
That's actually the reading
of the input.

914
00:42:56,280 --> 00:43:00,220
So then, what we do here is we
get n from the command line.

915
00:43:00,220 --> 00:43:03,430
And then if n is less than
30, we're just going to

916
00:43:03,430 --> 00:43:05,710
compute Fib of n.

917
00:43:05,710 --> 00:43:10,680
This is what I evaluated on my
laptop was a good number.

918
00:43:10,680 --> 00:43:13,870
So the idea is there's no point
in creating the extra

919
00:43:13,870 --> 00:43:17,740
thread to do the work if it's
going to be more expensive

920
00:43:17,740 --> 00:43:19,710
than me just doing
the work myself.

921
00:43:19,710 --> 00:43:23,150
So I looked at the overhead of
thread creation and discovered

922
00:43:23,150 --> 00:43:27,420
that if it was smaller than 30,
it's going to be slower to

923
00:43:27,420 --> 00:43:30,780
create another thread
to help me out.

924
00:43:30,780 --> 00:43:33,780
It's sort of like you folks
when you're doing pair

925
00:43:33,780 --> 00:43:35,850
programming, which you're
supposed to be doing, versus

926
00:43:35,850 --> 00:43:36,990
handing it off.

927
00:43:36,990 --> 00:43:38,790
Sometimes, there are some things
that are too small to

928
00:43:38,790 --> 00:43:40,920
ask somebody else to do.

929
00:43:40,920 --> 00:43:43,897
You might as well just do it,
by time you explain what it

930
00:43:43,897 --> 00:43:45,310
is, and so forth.

931
00:43:45,310 --> 00:43:47,160
Same thing here.

932
00:43:47,160 --> 00:43:49,710
What's the point in starting
up a thread to do something

933
00:43:49,710 --> 00:43:53,630
else, because the startup cost
is rather substantial.

934
00:43:53,630 --> 00:43:56,180
So if it's less than 30, well,
we'll just be done.

935
00:43:56,180 --> 00:44:01,120
Otherwise, what we do
is we marshall the

936
00:44:01,120 --> 00:44:02,300
argument to the thread.

937
00:44:02,300 --> 00:44:06,370
We basically set args.input
to n minus 1.

938
00:44:06,370 --> 00:44:08,860
Because args is going to be
what I'm going to pass in.

939
00:44:08,860 --> 00:44:11,700
So I say the input number
is n minus 1.

940
00:44:11,700 --> 00:44:17,520
And now what I do is I create
the thread by saying, give me

941
00:44:17,520 --> 00:44:22,840
the name of the thread
that I'm creating.

942
00:44:22,840 --> 00:44:28,520
This was the field that I said
you could put to be NULL,

943
00:44:28,520 --> 00:44:30,710
which basically lets
you set some policy

944
00:44:30,710 --> 00:44:32,370
parameters and so forth.

945
00:44:32,370 --> 00:44:34,470
I say, execute the
thread_func.

946
00:44:34,470 --> 00:44:36,000
This guy here.

947
00:44:36,000 --> 00:44:38,650
And here's the argument list
that I want to provide it,

948
00:44:38,650 --> 00:44:42,000
which is this args thing.

949
00:44:42,000 --> 00:44:44,950
Once you do the thread_create,
and this is where you depart

950
00:44:44,950 --> 00:44:48,420
from normal C or
C++ semantics.

951
00:44:48,420 --> 00:44:51,000
And in fact, we're going to be
doing more moving in the

952
00:44:51,000 --> 00:44:52,260
direction of C++.

953
00:44:52,260 --> 00:44:57,240
We'll have some tutorials
on that.

954
00:44:57,240 --> 00:45:00,200
What happens is we
check the status.

955
00:45:00,200 --> 00:45:03,200
OK, I actually did check the
status to see whether or not

956
00:45:03,200 --> 00:45:05,550
it created it properly.

957
00:45:05,550 --> 00:45:09,240
But basically now, what's
happening is after I execute

958
00:45:09,240 --> 00:45:13,520
this, it goes off and all the
magic in Pthreads starts

959
00:45:13,520 --> 00:45:16,370
another thread doing
that computation.

960
00:45:16,370 --> 00:45:19,910
And control returns to the
statement after the

961
00:45:19,910 --> 00:45:21,850
pthread_create.

962
00:45:21,850 --> 00:45:25,230
So when the pthread_create
returns, that doesn't mean

963
00:45:25,230 --> 00:45:28,150
it's done computing the thing
you told it to do.

964
00:45:28,150 --> 00:45:30,420
Then, what would be the point?

965
00:45:30,420 --> 00:45:35,230
It returns after it's set up
to operate in parallel the

966
00:45:35,230 --> 00:45:36,650
other thread.

967
00:45:36,650 --> 00:45:38,110
People follow that?

968
00:45:38,110 --> 00:45:41,480
So now at this point, there
are two threads operating.

969
00:45:41,480 --> 00:45:43,060
There's the thread we've
called thread.

970
00:45:43,060 --> 00:45:45,260
And there's whatever the name
of the thread is that we

971
00:45:45,260 --> 00:45:46,510
started on.

972
00:45:46,510 --> 00:45:48,740

973
00:45:48,740 --> 00:45:52,510
So then we, in our own processor
here, we compute Fib

974
00:45:52,510 --> 00:45:54,610
of N minus 2.

975
00:45:54,610 --> 00:45:58,960
And now, what we do is we go
on to join this thread with

976
00:45:58,960 --> 00:46:04,220
the thread that we
had created.

977
00:46:04,220 --> 00:46:08,130

978
00:46:08,130 --> 00:46:10,740
So let's see here.

979
00:46:10,740 --> 00:46:13,810
And the thing that the join does
is if the other thread

980
00:46:13,810 --> 00:46:17,620
isn't done, it sits there and
waits until it is done.

981
00:46:17,620 --> 00:46:19,050
And it does that
synchronization

982
00:46:19,050 --> 00:46:20,420
automatically for you.

983
00:46:20,420 --> 00:46:21,530
And this is the kind of thing a

984
00:46:21,530 --> 00:46:23,130
concurrency platform provides.

985
00:46:23,130 --> 00:46:28,250
It provides the coordination
under the covers for you to be

986
00:46:28,250 --> 00:46:31,960
able to synchronize with it
without you having to

987
00:46:31,960 --> 00:46:34,930
synchronize on your own.

988
00:46:34,930 --> 00:46:41,400
And then, once it does return,
it adds the results together

989
00:46:41,400 --> 00:46:46,710
by taking the result which came
from the Fib of n minus 2

990
00:46:46,710 --> 00:46:50,750
and adds to it the value that
this thread has returned in

991
00:46:50,750 --> 00:46:52,000
the args.output.

992
00:46:52,000 --> 00:46:54,660

993
00:46:54,660 --> 00:46:57,420
And then it prints the result.

994
00:46:57,420 --> 00:46:59,910
So any question about that?

995
00:46:59,910 --> 00:47:02,860
Wouldn't this be fun to write
a really big system in?

996
00:47:02,860 --> 00:47:04,230
People do.

997
00:47:04,230 --> 00:47:05,480
People do.

998
00:47:05,480 --> 00:47:07,928
Yeah, question?

999
00:47:07,928 --> 00:47:09,178
AUDIENCE: [INAUDIBLE PHRASE]

1000
00:47:09,178 --> 00:47:13,540

1001
00:47:13,540 --> 00:47:14,795
PROFESSOR: That's a
tuning parameter.

1002
00:47:14,795 --> 00:47:15,595
That's a voodoo parameter.

1003
00:47:15,595 --> 00:47:15,930
AUDIENCE: Right.

1004
00:47:15,930 --> 00:47:19,374
But in this particular case, it
makes no difference at all.

1005
00:47:19,374 --> 00:47:23,310
It would've made a difference
if it was an actual person

1006
00:47:23,310 --> 00:47:23,802
[INAUDIBLE]?

1007
00:47:23,802 --> 00:47:25,290
PROFESSOR: No, it does
make a difference.

1008
00:47:25,290 --> 00:47:27,080
For how fast it computes this?

1009
00:47:27,080 --> 00:47:28,170
Absolutely does.

1010
00:47:28,170 --> 00:47:29,530
AUDIENCE: That's
not recursive?

1011
00:47:29,530 --> 00:47:30,050
PROFESSOR: No, that's right.

1012
00:47:30,050 --> 00:47:30,920
This is not recursive.

1013
00:47:30,920 --> 00:47:33,238
I'm just doing two things
and then quitting.

1014
00:47:33,238 --> 00:47:36,202
AUDIENCE: [INAUDIBLE] if it's
less than 30, then it's going

1015
00:47:36,202 --> 00:47:39,390
to be [INAUDIBLE], right?

1016
00:47:39,390 --> 00:47:41,660
PROFESSOR: If it's less than
30, it's fast enough that I

1017
00:47:41,660 --> 00:47:43,175
might as well just return.

1018
00:47:43,175 --> 00:47:46,025
AUDIENCE: Then why
[INAUDIBLE PHRASE]

1019
00:47:46,025 --> 00:47:46,975
to do it.

1020
00:47:46,975 --> 00:47:49,350
It would return [INAUDIBLE]
too.

1021
00:47:49,350 --> 00:47:49,560
PROFESSOR: No.

1022
00:47:49,560 --> 00:47:51,470
But it would be slower.

1023
00:47:51,470 --> 00:47:52,645
It would be wasteful
of resources.

1024
00:47:52,645 --> 00:47:53,434
Maybe somebody--

1025
00:47:53,434 --> 00:47:56,338
AUDIENCE: Well, because you're
using such a bad

1026
00:47:56,338 --> 00:47:56,822
algorithm, I guess?

1027
00:47:56,822 --> 00:47:57,306
PROFESSOR: Yeah.

1028
00:47:57,306 --> 00:47:57,790
AUDIENCE: Oh, I see.

1029
00:47:57,790 --> 00:47:58,280
Oh, OK.

1030
00:47:58,280 --> 00:47:59,430
PROFESSOR: OK.

1031
00:47:59,430 --> 00:48:02,410
So in any case, that's Pthread's
programming.

1032
00:48:02,410 --> 00:48:03,450
There are a bunch of issues.

1033
00:48:03,450 --> 00:48:08,090
One is that the overhead of
creating a thread is more than

1034
00:48:08,090 --> 00:48:10,220
10,000 cycles.

1035
00:48:10,220 --> 00:48:13,130
So it leaves you to only be able
to do very coarse grain

1036
00:48:13,130 --> 00:48:13,760
concurrency.

1037
00:48:13,760 --> 00:48:15,590
There are some tricks
around that.

1038
00:48:15,590 --> 00:48:17,870
One is to use what's called
thread pools.

1039
00:48:17,870 --> 00:48:21,600
What I do is I start up, and I
create a bunch of threads.

1040
00:48:21,600 --> 00:48:22,560
And I have their names.

1041
00:48:22,560 --> 00:48:23,810
I put them in a link list.

1042
00:48:23,810 --> 00:48:26,270
And whenever I need to create
one, rather than actually

1043
00:48:26,270 --> 00:48:29,090
creating one, I take one out of
the list, much as I would

1044
00:48:29,090 --> 00:48:30,820
do memory allocation.

1045
00:48:30,820 --> 00:48:32,310
Which you folks are
familiar with.

1046
00:48:32,310 --> 00:48:35,580

1047
00:48:35,580 --> 00:48:36,050
OK.

1048
00:48:36,050 --> 00:48:38,490
Ha, ha, ha, ha, ha.

1049
00:48:38,490 --> 00:48:45,340
[MANIACAL LAUGHTER]

1050
00:48:45,340 --> 00:48:48,960
So basically, you can have
a free list of threads.

1051
00:48:48,960 --> 00:48:53,550
And when you need a thread,
you grab the thread.

1052
00:48:53,550 --> 00:48:57,750
The second thing
is scalability.

1053
00:48:57,750 --> 00:49:01,580
So this code gets about a 1.5
speed up for two cores.

1054
00:49:01,580 --> 00:49:05,450
If I want to use three cores
or four cores, what

1055
00:49:05,450 --> 00:49:07,470
do I have to do?

1056
00:49:07,470 --> 00:49:09,210
Rewrite the whole program.

1057
00:49:09,210 --> 00:49:12,490
This program only works
for two cores.

1058
00:49:12,490 --> 00:49:14,780
It will also work
for one core.

1059
00:49:14,780 --> 00:49:17,600
but basically, it doesn't
really exploit

1060
00:49:17,600 --> 00:49:20,550
three or four cores.

1061
00:49:20,550 --> 00:49:22,300
It's really bad for
modulatary.

1062
00:49:22,300 --> 00:49:25,670
The Fibonacci logic is no longer
neatly encapsulated in

1063
00:49:25,670 --> 00:49:28,510
the Fib function.

1064
00:49:28,510 --> 00:49:31,320
So where do we see if we
go back to this code?

1065
00:49:31,320 --> 00:49:32,910
Here's the Fib function.

1066
00:49:32,910 --> 00:49:35,410
Oh, but now, I've
kind of got--

1067
00:49:35,410 --> 00:49:37,510
well, this is sort of just
marshaling and calling.

1068
00:49:37,510 --> 00:49:41,570
But over here, oh my goodness,
I've got some arguments here.

1069
00:49:41,570 --> 00:49:43,640
If n is less than 30,
I give a result.

1070
00:49:43,640 --> 00:49:45,960
Otherwise, I'm adding
together--

1071
00:49:45,960 --> 00:49:46,310
but wait a minute.

1072
00:49:46,310 --> 00:49:48,730
I already specified
Fib up here.

1073
00:49:48,730 --> 00:49:51,510
So I'm specifying my serial
implementation, and I'm

1074
00:49:51,510 --> 00:49:55,000
specifying a parallel
way of doing it.

1075
00:49:55,000 --> 00:49:56,410
And so that's not modular.

1076
00:49:56,410 --> 00:49:59,640
If I decided I wanted to change
the Fib, I've got to

1077
00:49:59,640 --> 00:50:01,860
change things in two places.

1078
00:50:01,860 --> 00:50:07,840
If Fib were something I did.

1079
00:50:07,840 --> 00:50:08,870
Code simplicity.

1080
00:50:08,870 --> 00:50:10,280
The programmers for this are

1081
00:50:10,280 --> 00:50:12,630
actually marshalling arguments.

1082
00:50:12,630 --> 00:50:15,070
This is what I call
shades of 1958.

1083
00:50:15,070 --> 00:50:18,495
What happened in 1958 that's
relevant to computer science?

1084
00:50:18,495 --> 00:50:21,060

1085
00:50:21,060 --> 00:50:25,320
What was the big innovation
in 1958?

1086
00:50:25,320 --> 00:50:27,310
Programming language.

1087
00:50:27,310 --> 00:50:29,380
Fortran.

1088
00:50:29,380 --> 00:50:30,820
So, Fortran.

1089
00:50:30,820 --> 00:50:35,320
Before Fortran, people wrote
in assembly language.

1090
00:50:35,320 --> 00:50:39,710
If you wanted to put three
arguments to a function, you

1091
00:50:39,710 --> 00:50:43,790
did a push, push, push, or
passed them in parameters.

1092
00:50:43,790 --> 00:50:46,720
Actually, their machines were
so much more primitive than

1093
00:50:46,720 --> 00:50:48,680
that it was even more
complicated than you could

1094
00:50:48,680 --> 00:50:54,100
imagine, given how complicated
it is today what the

1095
00:50:54,100 --> 00:50:56,170
compilers are doing.

1096
00:50:56,170 --> 00:50:57,870
But you had marshal the
arguments yourself.

1097
00:50:57,870 --> 00:51:00,150
What Fortran did was say,
no, you can actually

1098
00:51:00,150 --> 00:51:03,780
write f of a, b, c.

1099
00:51:03,780 --> 00:51:06,320
Close paren.

1100
00:51:06,320 --> 00:51:10,820
And that it will cause a, b,
and c all to be marshalled

1101
00:51:10,820 --> 00:51:13,160
automatically for you.

1102
00:51:13,160 --> 00:51:15,450
Well, Pthreads doesn't have that
automatic marshalling.

1103
00:51:15,450 --> 00:51:18,300
You got to marshall by hand if
you're going to use pthreads.

1104
00:51:18,300 --> 00:51:21,790

1105
00:51:21,790 --> 00:51:24,770
And of course, as you can
imagine, that was error prone.

1106
00:51:24,770 --> 00:51:27,800
Because there is
no type safety.

1107
00:51:27,800 --> 00:51:31,730
Are you calling things with the
right types and so forth?

1108
00:51:31,730 --> 00:51:33,940
And so forth.

1109
00:51:33,940 --> 00:51:39,460
And also, one of the things here
is that we've created two

1110
00:51:39,460 --> 00:51:41,010
jobs that aren't
the same size.

1111
00:51:41,010 --> 00:51:46,230
So there's no way that they
have of load balancing.

1112
00:51:46,230 --> 00:51:50,070
So this is why pthreads is sort
of the assembly language

1113
00:51:50,070 --> 00:51:53,320
level, so that you can do
anything you want in pthreads.

1114
00:51:53,320 --> 00:51:55,790
But you have to program
at this kind of very

1115
00:51:55,790 --> 00:51:59,500
protocol-laden level.

1116
00:51:59,500 --> 00:52:00,600
Next thing I want
to talk about is

1117
00:52:00,600 --> 00:52:01,850
threading building blocks.

1118
00:52:01,850 --> 00:52:04,700

1119
00:52:04,700 --> 00:52:09,250
This is a technology
developed by Intel.

1120
00:52:09,250 --> 00:52:12,930
It's implemented as a C++
library that runs on top of

1121
00:52:12,930 --> 00:52:16,480
the native Pthreads, typically,
or WinAPI threads.

1122
00:52:16,480 --> 00:52:21,590
So it's basically a layer on
top of the Pthread layer.

1123
00:52:21,590 --> 00:52:23,580
In this case, the program
specifies

1124
00:52:23,580 --> 00:52:26,400
tasks rather than threads.

1125
00:52:26,400 --> 00:52:30,690
And tasks are automatically
load balanced across the

1126
00:52:30,690 --> 00:52:33,540
threads using a strategy called
work-stealing, which

1127
00:52:33,540 --> 00:52:36,640
we'll talk about a little
bit more later.

1128
00:52:36,640 --> 00:52:38,610
And the focus for this
is on performance.

1129
00:52:38,610 --> 00:52:43,130
They want to write programs that
actually perform well.

1130
00:52:43,130 --> 00:52:45,700
So here's Fibonacci in TBB.

1131
00:52:45,700 --> 00:52:48,190
So as you'll see, it's better.

1132
00:52:48,190 --> 00:52:51,805
But maybe not ideal for what
you might like to express.

1133
00:52:51,805 --> 00:52:56,220

1134
00:52:56,220 --> 00:53:02,070
So what we do is we declare the
computer, the computation,

1135
00:53:02,070 --> 00:53:05,290
it's going to organized as a
bunch of explicit tasks.

1136
00:53:05,290 --> 00:53:10,030
So you say that it's
going to be a task.

1137
00:53:10,030 --> 00:53:19,280
And FibTask is going to have an
input parameter, n, and an

1138
00:53:19,280 --> 00:53:22,720
output parameters, sum.

1139
00:53:22,720 --> 00:53:28,990
And what we're going to do is
when the task is started, it

1140
00:53:28,990 --> 00:53:36,890
automatically executes the
execute method of this tasking

1141
00:53:36,890 --> 00:53:38,500
object here.

1142
00:53:38,500 --> 00:53:40,350
And the execute method now
starts to do something that

1143
00:53:40,350 --> 00:53:41,850
looks very much like
Fibonacci.

1144
00:53:41,850 --> 00:53:46,880
It says if n is less than
2, sum is equal to n.

1145
00:53:46,880 --> 00:53:48,200
That's we had before.

1146
00:53:48,200 --> 00:53:49,550
And otherwise.

1147
00:53:49,550 --> 00:53:53,570
And now what we're going to do
is recursively create two

1148
00:53:53,570 --> 00:53:57,630
child tasks, which we basically
do with this

1149
00:53:57,630 --> 00:54:07,490
function, allocate_task, giving
it the fib task a name,

1150
00:54:07,490 --> 00:54:13,040
where this is basically a method
for allocating out of a

1151
00:54:13,040 --> 00:54:16,150
particular type of the
pool, which is an

1152
00:54:16,150 --> 00:54:18,760
allocate child pool.

1153
00:54:18,760 --> 00:54:23,080
And then similarly for b, we
recursively do for n minus 2.

1154
00:54:23,080 --> 00:54:25,240
And then what it does is
it sets the number of

1155
00:54:25,240 --> 00:54:27,630
tasks to wait for.

1156
00:54:27,630 --> 00:54:30,250
In this case, it's basically
two children plus 1 for

1157
00:54:30,250 --> 00:54:32,140
bookkeeping.

1158
00:54:32,140 --> 00:54:35,580
So this ends up always being one
more than the things that

1159
00:54:35,580 --> 00:54:39,240
you created as subtasks.

1160
00:54:39,240 --> 00:54:44,160
And then what we do is we
say, OK, let's spawn.

1161
00:54:44,160 --> 00:54:46,050
So this will only
set up the task.

1162
00:54:46,050 --> 00:54:48,070
It doesn't actually
say, do it.

1163
00:54:48,070 --> 00:54:52,390
So the spawn command says
actually do this computation

1164
00:54:52,390 --> 00:54:53,870
here that I set up.

1165
00:54:53,870 --> 00:54:57,000
So it actually does b.

1166
00:54:57,000 --> 00:54:58,440
Start task b.

1167
00:54:58,440 --> 00:55:02,760
And then itself, it executes a
and waits for all of the other

1168
00:55:02,760 --> 00:55:05,640
tasks, namely both a
and b, to finish.

1169
00:55:05,640 --> 00:55:08,760
And once it's finished, it adds
the results together to

1170
00:55:08,760 --> 00:55:10,160
produce the final output.

1171
00:55:10,160 --> 00:55:13,300

1172
00:55:13,300 --> 00:55:17,600
So this, notice, has the big
advantage over the previous

1173
00:55:17,600 --> 00:55:22,260
implementation that this
is actually recursive.

1174
00:55:22,260 --> 00:55:26,010
So in doing Fib, you're not
just getting two tasks.

1175
00:55:26,010 --> 00:55:29,010
You're recursively getting each
of those two more, and

1176
00:55:29,010 --> 00:55:30,830
two more, and two more, down
to the leaves of the

1177
00:55:30,830 --> 00:55:32,630
computation.

1178
00:55:32,630 --> 00:55:36,660
And then what TBB does is it
load balances those across the

1179
00:55:36,660 --> 00:55:42,450
number of available processors
by creating these tasks.

1180
00:55:42,450 --> 00:55:45,270
And then, it automatically does
all the load balancing of

1181
00:55:45,270 --> 00:55:47,610
the tasks and so forth.

1182
00:55:47,610 --> 00:55:50,180
Questions about that?

1183
00:55:50,180 --> 00:55:51,130
Any questions?

1184
00:55:51,130 --> 00:55:55,720
I don't expect you to be able
to program a TBB, unless I

1185
00:55:55,720 --> 00:55:57,480
gave you a book and said,
program a TBB.

1186
00:55:57,480 --> 00:55:58,730
But I'm not going to do that.

1187
00:55:58,730 --> 00:56:00,900

1188
00:56:00,900 --> 00:56:03,320
This is mainly to give you a
flavor of what's in there.

1189
00:56:03,320 --> 00:56:05,500
What the alternatives are.

1190
00:56:05,500 --> 00:56:08,670
So TBB provides many
C++ templates that

1191
00:56:08,670 --> 00:56:10,150
simplify common patterns.

1192
00:56:10,150 --> 00:56:13,020
So rather than having to write
that kind of thing for

1193
00:56:13,020 --> 00:56:16,010
everything, for example, if
you have loop parallelism.

1194
00:56:16,010 --> 00:56:19,380
If you have n things that you
want to have that operate

1195
00:56:19,380 --> 00:56:23,520
parallel, you can do a parallel
four and not actually

1196
00:56:23,520 --> 00:56:24,500
see the tasks.

1197
00:56:24,500 --> 00:56:27,960
It covers them over and
creates the tasks

1198
00:56:27,960 --> 00:56:32,940
automatically, so that you can
just say, for I gets 1 to n,

1199
00:56:32,940 --> 00:56:36,220
do this to all I, and do them at
the same time essentially.

1200
00:56:36,220 --> 00:56:39,920
And it then balances
those and so forth.

1201
00:56:39,920 --> 00:56:42,930
It also has to things like
parallel reduce.

1202
00:56:42,930 --> 00:56:46,880
Sometimes what you want to do
across an array is not just do

1203
00:56:46,880 --> 00:56:48,530
something for every element
of the array.

1204
00:56:48,530 --> 00:56:51,770
You may want to add up all the
elements into a single value.

1205
00:56:51,770 --> 00:56:54,870
And so it basically has what's
called a reduction function.

1206
00:56:54,870 --> 00:56:56,980
It does parallel reduce
to aggregate.

1207
00:56:56,980 --> 00:56:59,120
And it's got various other
things, like pipelining and

1208
00:56:59,120 --> 00:57:02,250
filtering for doing what's
called software pipelining,

1209
00:57:02,250 --> 00:57:08,810
where you have one subsystem
that basically is going to

1210
00:57:08,810 --> 00:57:11,230
process the data and pass
it to the next.

1211
00:57:11,230 --> 00:57:13,270
So you're going to process it
and pass it to the next.

1212
00:57:13,270 --> 00:57:18,810
And it allows you to set up a
software pipeline of things.

1213
00:57:18,810 --> 00:57:22,150
It also collides with some
container classes, such as

1214
00:57:22,150 --> 00:57:25,180
hash tables, concurrent hash
tables, that allow you to have

1215
00:57:25,180 --> 00:57:33,670
multiple tasks beating
on a hash table.

1216
00:57:33,670 --> 00:57:35,680
Inserting and deleting from
the hash table at the same

1217
00:57:35,680 --> 00:57:39,790
time and a variety of mutual
exclusion library functions,

1218
00:57:39,790 --> 00:57:42,630
including locks and
atomic updates.

1219
00:57:42,630 --> 00:57:48,230
So it has a bunch of other
facilities that make it much

1220
00:57:48,230 --> 00:57:50,950
easier to use than just using
the raw task interface.

1221
00:57:50,950 --> 00:57:54,360

1222
00:57:54,360 --> 00:57:55,610
OpenMP.

1223
00:57:55,610 --> 00:57:57,220

1224
00:57:57,220 --> 00:58:00,100
So OpenMP is a specification
produced by an industry

1225
00:58:00,100 --> 00:58:04,290
consortium of which the
principal players--

1226
00:58:04,290 --> 00:58:09,780
the original principal player
was Silicon Graphics, which

1227
00:58:09,780 --> 00:58:13,160
essentially has become
less important in the

1228
00:58:13,160 --> 00:58:14,080
industry, let's say.

1229
00:58:14,080 --> 00:58:15,820
Put it that way.

1230
00:58:15,820 --> 00:58:19,270
And for the most part, recently,
it's been players

1231
00:58:19,270 --> 00:58:24,290
from Intel and Sun, which is now
no longer Sun, except that

1232
00:58:24,290 --> 00:58:33,160
it is Sun part of Oracle, and
of IBM, and variety of other

1233
00:58:33,160 --> 00:58:37,200
industry players.

1234
00:58:37,200 --> 00:58:39,430
There's several compilers
available.

1235
00:58:39,430 --> 00:58:43,860
Both open source and
proprietary, including gcc,

1236
00:58:43,860 --> 00:58:46,190
has OpenMP built-in.

1237
00:58:46,190 --> 00:58:51,370
And also, Visual Studio
has OpenMP built-in.

1238
00:58:51,370 --> 00:58:55,460
These are a set of linguistic
extensions to C and C++ or

1239
00:58:55,460 --> 00:58:59,710
Fortran in the form of compiler
practice pragmas.

1240
00:58:59,710 --> 00:59:03,460
So who knows what a pragma is?

1241
00:59:03,460 --> 00:59:05,150
OK.

1242
00:59:05,150 --> 00:59:05,350
Good.

1243
00:59:05,350 --> 00:59:06,490
Can you tell us what
a pragma is?

1244
00:59:06,490 --> 00:59:07,740
AUDIENCE: [INAUDIBLE PHRASE]

1245
00:59:07,740 --> 00:59:12,140

1246
00:59:12,140 --> 00:59:15,390
PROFESSOR: Yeah, it's kind
of like a compiler hint.

1247
00:59:15,390 --> 00:59:18,420
It's a way of saying to the
compiler, here's something I

1248
00:59:18,420 --> 00:59:21,970
want to tell you about the
code that I'm writing.

1249
00:59:21,970 --> 00:59:25,150
And it basically is a hint.

1250
00:59:25,150 --> 00:59:27,920
So technically, it's not
supposed to have any semantic

1251
00:59:27,920 --> 00:59:31,490
impact, but rather suggest how
something might be implemented

1252
00:59:31,490 --> 00:59:33,810
by the compiler.

1253
00:59:33,810 --> 00:59:36,160
However, in OpenMP's case, they

1254
00:59:36,160 --> 00:59:39,110
actually have a compiler--

1255
00:59:39,110 --> 00:59:42,570
it does change the semantics
in certain cases.

1256
00:59:42,570 --> 00:59:44,990
It runs on top of native threads
and it supports,

1257
00:59:44,990 --> 00:59:46,700
especially, loop parallelism.

1258
00:59:46,700 --> 00:59:49,050
And then, in the latest version,
it supports a kind of

1259
00:59:49,050 --> 00:59:54,560
task parallelism like
we saw with TBB.

1260
00:59:54,560 --> 00:59:56,270
So, in fact, their
task parallelism

1261
00:59:56,270 --> 00:59:58,750
is fairly to specify.

1262
00:59:58,750 --> 01:00:00,420
So here's the Fib code.

1263
01:00:00,420 --> 01:00:03,710
So now, this is not
looking too bad.

1264
01:00:03,710 --> 01:00:06,960
We basically inserted
a few lines here.

1265
01:00:06,960 --> 01:00:08,440
And otherwise, we actually
have the

1266
01:00:08,440 --> 01:00:13,530
original Fibonacci code.

1267
01:00:13,530 --> 01:00:18,520
So the sharp pragma says, here's
a compiler directive.

1268
01:00:18,520 --> 01:00:21,450
And it says, the OMP
says it is an

1269
01:00:21,450 --> 01:00:24,210
OpenMP compiler directive.

1270
01:00:24,210 --> 01:00:26,850
The task says, oh, the following
things should be

1271
01:00:26,850 --> 01:00:30,000
interpreted as an independent
task.

1272
01:00:30,000 --> 01:00:33,760
And now, the sharing of memory
in OpenMP is managed

1273
01:00:33,760 --> 01:00:35,760
explicitly, because they're
trying to allow for

1274
01:00:35,760 --> 01:00:39,360
programming both of distributed
memory clusters,

1275
01:00:39,360 --> 01:00:43,000
as well as shared
memory machines.

1276
01:00:43,000 --> 01:00:48,020
And so, you have to explicitly
name the shared variables that

1277
01:00:48,020 --> 01:00:50,000
you're using.

1278
01:00:50,000 --> 01:00:52,740
And here, we're basically
saying, wait for the two

1279
01:00:52,740 --> 01:00:56,180
things that we spawned
off here to complete.

1280
01:00:56,180 --> 01:01:00,430
So pretty simple code.

1281
01:01:00,430 --> 01:01:05,250
It provides many pragma
directives to express common

1282
01:01:05,250 --> 01:01:08,990
patterns, such as a parallel
for parallelization.

1283
01:01:08,990 --> 01:01:10,230
It also has reduction.

1284
01:01:10,230 --> 01:01:14,490
It also has directives for
scheduling and data sharing.

1285
01:01:14,490 --> 01:01:16,360
And it has a whole bunch
of synchronization

1286
01:01:16,360 --> 01:01:18,010
constructs and so forth.

1287
01:01:18,010 --> 01:01:21,650
So it's another interesting
one to do.

1288
01:01:21,650 --> 01:01:24,370
The main downside, I would say,
of OpenMP is that the

1289
01:01:24,370 --> 01:01:27,990
performance is not really
very composable.

1290
01:01:27,990 --> 01:01:30,660
So if you have a program you've
written with OpenMP

1291
01:01:30,660 --> 01:01:33,090
over here, another one here,
and you want to put them

1292
01:01:33,090 --> 01:01:37,380
together, they fight
with each other.

1293
01:01:37,380 --> 01:01:40,310
You have to have your
concept of what are

1294
01:01:40,310 --> 01:01:42,350
going to be the programs.

1295
01:01:42,350 --> 01:01:45,250
The task parallelism helps
a bit with that.

1296
01:01:45,250 --> 01:01:49,410
But the basic OpenMP is very
much of the model, I know how

1297
01:01:49,410 --> 01:01:50,800
many cores I'm running on.

1298
01:01:50,800 --> 01:01:52,350
I can set that.

1299
01:01:52,350 --> 01:01:55,430
And then I can have it
automatically parse up the

1300
01:01:55,430 --> 01:01:56,820
work for those many.

1301
01:01:56,820 --> 01:02:00,310
But once you've done that, some
other job, some other

1302
01:02:00,310 --> 01:02:03,170
part of the system that wants to
do the same thing, then you

1303
01:02:03,170 --> 01:02:07,490
get oversubscription and perhaps
some [UNINTELLIGIBLE].

1304
01:02:07,490 --> 01:02:10,970
Nevertheless, a very
interesting system.

1305
01:02:10,970 --> 01:02:14,960
And very accessible, because
it's in most of the standard

1306
01:02:14,960 --> 01:02:16,210
compilers these days.

1307
01:02:16,210 --> 01:02:19,090

1308
01:02:19,090 --> 01:02:23,130
What we're going to
look at is Cilk++.

1309
01:02:23,130 --> 01:02:28,740
So this is actually a small set
of linguistics extensions

1310
01:02:28,740 --> 01:02:31,320
to C++ to support fork-join
parallelism.

1311
01:02:31,320 --> 01:02:33,890
And it was developed by Cilk
Arts, which is an MIT

1312
01:02:33,890 --> 01:02:38,050
spin-off, which was acquired
by Intel last year.

1313
01:02:38,050 --> 01:02:40,790
So this is now an Intel
technology.

1314
01:02:40,790 --> 01:02:43,770
And the reason I know about it
is because I was the founder

1315
01:02:43,770 --> 01:02:44,680
of Cilk Arts.

1316
01:02:44,680 --> 01:02:48,300
It was based on 15 years of
research at MIT out of my

1317
01:02:48,300 --> 01:02:50,940
research group.

1318
01:02:50,940 --> 01:02:55,850
And we won a bunch of awards,
actually, for this work.

1319
01:02:55,850 --> 01:02:59,490
In fact, the work-stealing
scheduler that's in it is

1320
01:02:59,490 --> 01:03:00,500
provably efficient.

1321
01:03:00,500 --> 01:03:02,440
In other words, it's not just
a heuristic scheduler.

1322
01:03:02,440 --> 01:03:05,080
It's actually got a mathematical
proof that it's

1323
01:03:05,080 --> 01:03:06,480
an effective scheduler.

1324
01:03:06,480 --> 01:03:10,200
And in fact, was the inspiration
for things like

1325
01:03:10,200 --> 01:03:14,090
the work-stealing in TBB and the
new task mechanisms and so

1326
01:03:14,090 --> 01:03:19,640
forth in OpenMP, as well as a
bunch of other people who've

1327
01:03:19,640 --> 01:03:21,360
done work-stealing.

1328
01:03:21,360 --> 01:03:24,520
It in addition provides a
hyperobject library for

1329
01:03:24,520 --> 01:03:27,140
parallelizing code with global
variables, which we'll talk

1330
01:03:27,140 --> 01:03:28,120
about later.

1331
01:03:28,120 --> 01:03:32,720
And it includes two tools that
you'll come to know and love.

1332
01:03:32,720 --> 01:03:35,570
One is the Cilkscreen race
detector, and the other is the

1333
01:03:35,570 --> 01:03:39,460
Cilkview scalability analyzer.

1334
01:03:39,460 --> 01:03:41,890
Now, what we're going to be
using in this class is going

1335
01:03:41,890 --> 01:03:49,580
to be the Cilk++ technology
that was developed at Cilk

1336
01:03:49,580 --> 01:03:51,400
Arts and then massaged
a little bit

1337
01:03:51,400 --> 01:03:52,630
when it got to Intel.

1338
01:03:52,630 --> 01:03:55,990
There is a brand new Intel
technology with Cilk built

1339
01:03:55,990 --> 01:03:58,030
into their compiler.

1340
01:03:58,030 --> 01:04:02,000
And it is due to come out
in like, two weeks.

1341
01:04:02,000 --> 01:04:05,480

1342
01:04:05,480 --> 01:04:08,830
So our timing for this was it
would've been nice to have you

1343
01:04:08,830 --> 01:04:13,510
folks on the new Intel
Cilk+ technology.

1344
01:04:13,510 --> 01:04:16,950
But we're going to go with
this one for now.

1345
01:04:16,950 --> 01:04:19,690
It's not going to make too big
a difference to you folks.

1346
01:04:19,690 --> 01:04:22,190
But you should just be aware
that coming down the pike,

1347
01:04:22,190 --> 01:04:27,430
there's actually some much
more cleanly integrated

1348
01:04:27,430 --> 01:04:33,120
technology that you can use
that's in the Intel compiler.

1349
01:04:33,120 --> 01:04:36,670
So here's how we do nested
parallelism in Cilk++.

1350
01:04:36,670 --> 01:04:38,420
So basically, this
is Fibonacci.

1351
01:04:38,420 --> 01:04:42,580
And now, what I have here is,
if you notice, I've got two

1352
01:04:42,580 --> 01:04:46,430
keywords, cilk_spawn
and cilk_sync.

1353
01:04:46,430 --> 01:04:50,160
And this is how you write
parallel Fibonacci in Cilk.

1354
01:04:50,160 --> 01:04:51,825
This is it.

1355
01:04:51,825 --> 01:04:56,200
I've inserted two key words,
and my program is parallel.

1356
01:04:56,200 --> 01:05:00,350
The cilk_spawn keyword says that
the named child function

1357
01:05:00,350 --> 01:05:03,260
can execute in parallel with
the parent caller.

1358
01:05:03,260 --> 01:05:06,070
So when you say x equals
cilk_spawn or Fib of n minus

1359
01:05:06,070 --> 01:05:08,660
1, it does the same thing
that you normally think.

1360
01:05:08,660 --> 01:05:09,910
It calls the child.

1361
01:05:09,910 --> 01:05:12,810

1362
01:05:12,810 --> 01:05:16,340
But after it calls the child,
rather than waiting for it to

1363
01:05:16,340 --> 01:05:21,360
return, it goes on to
the next statement.

1364
01:05:21,360 --> 01:05:24,650
So then, the statement y equals
Fib of n minus 2 is

1365
01:05:24,650 --> 01:05:26,960
going on at the same time
as the calculation of

1366
01:05:26,960 --> 01:05:28,210
Fib of n minus 1.

1367
01:05:28,210 --> 01:05:30,730

1368
01:05:30,730 --> 01:05:34,560
And then, the cilk_sync says,
don't go past this point until

1369
01:05:34,560 --> 01:05:36,390
all the children you've spawned
off have returned.

1370
01:05:36,390 --> 01:05:39,530

1371
01:05:39,530 --> 01:05:44,580
And since this is a recursive
program, it generates gobs of

1372
01:05:44,580 --> 01:05:47,130
parallelism, if it's
a big thing.

1373
01:05:47,130 --> 01:05:50,720
So one of the key things about
Cilk++, is unlike Pthreads--

1374
01:05:50,720 --> 01:05:54,490
Pthreads, when you say,
pthread_create, it actually

1375
01:05:54,490 --> 01:05:57,080
goes and creates a
piece of work.

1376
01:05:57,080 --> 01:06:02,630
In Cilk++, these keywords
only grant permission.

1377
01:06:02,630 --> 01:06:06,050
They say you may execute these
things in parallel.

1378
01:06:06,050 --> 01:06:08,630
It doesn't insist that they
be executed in parallel.

1379
01:06:08,630 --> 01:06:11,880
The program may decide, no, in
fact, I'm going to just call

1380
01:06:11,880 --> 01:06:15,190
this, and then return, and
then execute this.

1381
01:06:15,190 --> 01:06:18,320

1382
01:06:18,320 --> 01:06:25,550
So it only grants permission,
and the Cilk++ runtime system

1383
01:06:25,550 --> 01:06:28,690
figures out how to load balance
it and schedule it.

1384
01:06:28,690 --> 01:06:31,260

1385
01:06:31,260 --> 01:06:36,590
Cilk++ also supports
loop parallelism.

1386
01:06:36,590 --> 01:06:39,360
So here's an example of an
in-place matrix transpose.

1387
01:06:39,360 --> 01:06:42,830
So I want to take this matrix
and flip it on its major axis.

1388
01:06:42,830 --> 01:06:45,330

1389
01:06:45,330 --> 01:06:47,480
And we can do it
with for loops.

1390
01:06:47,480 --> 01:06:49,040
As you know, for loops
are not the best way

1391
01:06:49,040 --> 01:06:50,270
to do matrix transpose.

1392
01:06:50,270 --> 01:06:53,070
Right?

1393
01:06:53,070 --> 01:06:56,090
It's better to do divide
and conquer.

1394
01:06:56,090 --> 01:07:00,640
But here's how you
could do it.

1395
01:07:00,640 --> 01:07:04,240
And here, I made the indices
run from 0, not 1, because

1396
01:07:04,240 --> 01:07:05,610
that's the way you do
it in programming.

1397
01:07:05,610 --> 01:07:08,150
But if I did it up here, then
these things get to be n minus

1398
01:07:08,150 --> 01:07:10,810
1, n minus 1, and then it gets
too crowded on the slide.

1399
01:07:10,810 --> 01:07:15,030
And I said, OK, I'll just put
a comment there rather than

1400
01:07:15,030 --> 01:07:17,450
try to sort it out.

1401
01:07:17,450 --> 01:07:21,150
So here's what I'm saying, is
this outer loop is parallel.

1402
01:07:21,150 --> 01:07:24,340
It's going from 1
to n minus 1.

1403
01:07:24,340 --> 01:07:26,620
And saying, do all those
things in parallel.

1404
01:07:26,620 --> 01:07:29,290
And each one is going through
a different number of

1405
01:07:29,290 --> 01:07:30,390
iterations of j.

1406
01:07:30,390 --> 01:07:33,240
So you can see you actually need
some load balancing here,

1407
01:07:33,240 --> 01:07:36,760
because some of these are going
through just one step,

1408
01:07:36,760 --> 01:07:39,520
and some are going through
n minus 1 steps.

1409
01:07:39,520 --> 01:07:43,440
It's basically the amount of
work in every iteration of the

1410
01:07:43,440 --> 01:07:47,310
outer loop here is different.

1411
01:07:47,310 --> 01:07:47,810
I'm sorry?

1412
01:07:47,810 --> 01:07:50,130
AUDIENCE: [INAUDIBLE PHRASE].

1413
01:07:50,130 --> 01:07:53,610
PROFESSOR: No. i equals 1 is
where you want to start.

1414
01:07:53,610 --> 01:07:54,995
Because you don't have
to move the diagonal.

1415
01:07:54,995 --> 01:07:58,170

1416
01:07:58,170 --> 01:08:01,750
You only have to go across
the top here.

1417
01:08:01,750 --> 01:08:07,170
And for each of those, copy it
into the appropriate column.

1418
01:08:07,170 --> 01:08:08,790
Flip it into the appropriate
column.

1419
01:08:08,790 --> 01:08:11,030
Flip the two things.

1420
01:08:11,030 --> 01:08:12,720
Actually, transpose is one
of these functions.

1421
01:08:12,720 --> 01:08:15,220
I remember writing my first
transpose functions.

1422
01:08:15,220 --> 01:08:16,989
And when I was done, I somehow
had the identity.

1423
01:08:16,989 --> 01:08:19,569

1424
01:08:19,569 --> 01:08:24,520
Because I basically made the
loops go from 1 to n and 1 to

1425
01:08:24,520 --> 01:08:27,260
n and swapped them.

1426
01:08:27,260 --> 01:08:27,819
So I swapped them.

1427
01:08:27,819 --> 01:08:29,720
So I said, oh, that
was a lot of work

1428
01:08:29,720 --> 01:08:32,979
to compute the identity.

1429
01:08:32,979 --> 01:08:34,630
No, you've got to make sure
you only go through a

1430
01:08:34,630 --> 01:08:39,210
triangular iteration space in
order to make sure you swap--

1431
01:08:39,210 --> 01:08:40,460
and then swap.

1432
01:08:40,460 --> 01:08:43,670

1433
01:08:43,670 --> 01:08:45,450
This is an in-place swap.

1434
01:08:45,450 --> 01:08:46,970
So that's cilk_for.

1435
01:08:46,970 --> 01:08:48,490
That's basically it.

1436
01:08:48,490 --> 01:08:50,470
There are some more facilities
we'll talk about.

1437
01:08:50,470 --> 01:08:52,630
But that's basically
it for parallel

1438
01:08:52,630 --> 01:08:54,210
programming in Cilk++.

1439
01:08:54,210 --> 01:08:58,000
The other part is, how do you
do it so you get fast code?

1440
01:08:58,000 --> 01:08:59,920
Which we'll talk about.

1441
01:08:59,920 --> 01:09:04,670
Now, Cilk has serial
semantics.

1442
01:09:04,670 --> 01:09:09,060
And what that means is unlike
some of the other ones, it's

1443
01:09:09,060 --> 01:09:13,220
kind of what OpenMP was
aspiring to do.

1444
01:09:13,220 --> 01:09:17,240
The idea is that if I, for
example here, delete these two

1445
01:09:17,240 --> 01:09:22,510
keywords, I get a C++ code.

1446
01:09:22,510 --> 01:09:25,600
And that code is always a legal
way to execute this

1447
01:09:25,600 --> 01:09:27,560
parallel code.

1448
01:09:27,560 --> 01:09:29,840
So the parallel code may have
more behaviors of its

1449
01:09:29,840 --> 01:09:31,630
nondeterministic code.

1450
01:09:31,630 --> 01:09:35,609
But always, it's legal to
treat it as if it's just

1451
01:09:35,609 --> 01:09:36,859
straight C++.

1452
01:09:36,859 --> 01:09:38,939

1453
01:09:38,939 --> 01:09:41,270
And the reason for that is
that, really, we're only

1454
01:09:41,270 --> 01:09:44,439
granting permission for
parallel execution.

1455
01:09:44,439 --> 01:09:47,149
So even though I put in these
keywords, I still can execute

1456
01:09:47,149 --> 01:09:50,779
it serially if I wish.

1457
01:09:50,779 --> 01:09:52,420
They don't command parallel
execution.

1458
01:09:52,420 --> 01:09:55,430
To obtain this serialization,
you can do it by hand by just

1459
01:09:55,430 --> 01:09:58,675
defining a cilk_for to be for,
and the cilk_spawn and

1460
01:09:58,675 --> 01:10:01,420
cilk_sync to be empty.

1461
01:10:01,420 --> 01:10:04,950
Or there's a switch to the
Cilk++ composite that does

1462
01:10:04,950 --> 01:10:06,340
that for you automatically.

1463
01:10:06,340 --> 01:10:10,750
And it's probably the preferred
way of doing it.

1464
01:10:10,750 --> 01:10:15,440
But the idea is conceptually,
you can sprinkle in these

1465
01:10:15,440 --> 01:10:18,140
keywords, and if you don't
want it anymore, fine.

1466
01:10:18,140 --> 01:10:21,470
If you want to compile it with
the straight c compilers, it's

1467
01:10:21,470 --> 01:10:23,600
better to use the Cilk++
compiler to do it.

1468
01:10:23,600 --> 01:10:27,970
But if you wanted to ship it
off to somebody else, you

1469
01:10:27,970 --> 01:10:30,440
could just do these sharp
defines, and they could

1470
01:10:30,440 --> 01:10:32,120
compile it with their compilers,
and it would be the

1471
01:10:32,120 --> 01:10:36,870
same as a serial C++ code.

1472
01:10:36,870 --> 01:10:41,180
So the Cilk++ concurrency
platform allows the program to

1473
01:10:41,180 --> 01:10:45,290
express potential parallelism
in application.

1474
01:10:45,290 --> 01:10:47,110
So it says, where is
the parallelism?

1475
01:10:47,110 --> 01:10:49,240
It doesn't say how
to schedule it.

1476
01:10:49,240 --> 01:10:50,910
It says, where is it?

1477
01:10:50,910 --> 01:10:56,800
And then, it gets mapped onto,
at runtime, dynamically mapped

1478
01:10:56,800 --> 01:10:58,240
onto the processor cores.

1479
01:10:58,240 --> 01:11:01,680

1480
01:11:01,680 --> 01:11:05,510
And the way that it does the
mapping is mathematically

1481
01:11:05,510 --> 01:11:09,130
provably a good way
of doing it.

1482
01:11:09,130 --> 01:11:12,530
And if you take one of my
graduate courses, I can teach

1483
01:11:12,530 --> 01:11:15,840
you how that works.

1484
01:11:15,840 --> 01:11:19,120
We'll do a little bit of study
of simple scheduling.

1485
01:11:19,120 --> 01:11:23,380
But the actual schedule it
uses is more involved.

1486
01:11:23,380 --> 01:11:25,680
But we'll cover it
a little bit.

1487
01:11:25,680 --> 01:11:28,280
Here's the components
of the Cilk++

1488
01:11:28,280 --> 01:11:31,330
platform on a single slide.

1489
01:11:31,330 --> 01:11:32,380
So let me just say
what they are.

1490
01:11:32,380 --> 01:11:34,600
The first one is the keywords.

1491
01:11:34,600 --> 01:11:36,630
So you get to put
things in there.

1492
01:11:36,630 --> 01:11:42,900
And if you elide or create the
serialization, then you get

1493
01:11:42,900 --> 01:11:47,700
the C++ code or C code, for
which then you can run your

1494
01:11:47,700 --> 01:11:50,780
regression test and demonstrate
you have some good

1495
01:11:50,780 --> 01:11:53,170
single-threaded program.

1496
01:11:53,170 --> 01:11:56,270
Alternatively, you can send it
through the Cilk++ compiler,

1497
01:11:56,270 --> 01:11:58,120
which is based on a conventional
compiler.

1498
01:11:58,120 --> 01:12:00,520
In our case, it will be GCC.

1499
01:12:00,520 --> 01:12:02,920
You can link that with the
hyperobject library, which

1500
01:12:02,920 --> 01:12:05,780
we'll talk about when we start
talking about synchronization.

1501
01:12:05,780 --> 01:12:07,515
It produces a binary.

1502
01:12:07,515 --> 01:12:11,090
If you run that binary on the
runtime system, you can also

1503
01:12:11,090 --> 01:12:12,300
run it to the regression test.

1504
01:12:12,300 --> 01:12:14,980
And in particular, if you run
it on the runtime system,

1505
01:12:14,980 --> 01:12:20,780
running on one core, it should
behave identically to having

1506
01:12:20,780 --> 01:12:23,920
run it through this path with
just the serial code.

1507
01:12:23,920 --> 01:12:26,670

1508
01:12:26,670 --> 01:12:29,080
And of course, you get
exceptional performance.

1509
01:12:29,080 --> 01:12:31,080
These, I think, were originally
marketing slides.

1510
01:12:31,080 --> 01:12:34,290

1511
01:12:34,290 --> 01:12:38,900
However, there's also the fact
that you may get what are

1512
01:12:38,900 --> 01:12:42,910
called races in your code, which
are bugs that will come

1513
01:12:42,910 --> 01:12:45,870
up that won't occur in your
serial code, but will occur in

1514
01:12:45,870 --> 01:12:48,680
your parallel code.

1515
01:12:48,680 --> 01:12:51,330
Cilk has a race detector to
detect those, for which you

1516
01:12:51,330 --> 01:12:54,450
can run parallel regression
tests to produce your reliable

1517
01:12:54,450 --> 01:12:55,960
multi-threaded code.

1518
01:12:55,960 --> 01:12:58,270
And then, the final piece of
it is there's this thing

1519
01:12:58,270 --> 01:13:02,250
called Cilkview, which allows
you to analyze the scalability

1520
01:13:02,250 --> 01:13:03,410
of your software.

1521
01:13:03,410 --> 01:13:07,370
So you can run, in fact, on a
single core or on a small

1522
01:13:07,370 --> 01:13:08,550
number of cores.

1523
01:13:08,550 --> 01:13:10,680
And then, you can predict how
it's going to behave on a

1524
01:13:10,680 --> 01:13:14,320
large number of cores.

1525
01:13:14,320 --> 01:13:18,450
So let's just, to conclude
here, talk about races.

1526
01:13:18,450 --> 01:13:21,590
Because they're the nasty,
nasty, nasty thing we get into

1527
01:13:21,590 --> 01:13:22,820
parallel programming.

1528
01:13:22,820 --> 01:13:25,620
And then next time, we'll
get deeper into the Cilk

1529
01:13:25,620 --> 01:13:26,870
technology itself.

1530
01:13:26,870 --> 01:13:29,400

1531
01:13:29,400 --> 01:13:32,930
So the most basic kind of race
there is what's called a

1532
01:13:32,930 --> 01:13:35,330
determinacy race.

1533
01:13:35,330 --> 01:13:38,710
Because if you have one of these
things, your program

1534
01:13:38,710 --> 01:13:41,350
becomes nondeterministic.

1535
01:13:41,350 --> 01:13:44,400
It doesn't do the same
thing every time.

1536
01:13:44,400 --> 01:13:47,570
A determinacy race occurs when
two logically parallel

1537
01:13:47,570 --> 01:13:51,990
instructions access the same
memory location, and at least

1538
01:13:51,990 --> 01:13:55,500
one of the instructions performs
a write, performs a

1539
01:13:55,500 --> 01:13:58,190
store, to that location.

1540
01:13:58,190 --> 01:14:01,030
So here's an example.

1541
01:14:01,030 --> 01:14:06,050
I have a cilk_for here, both
branches of which are

1542
01:14:06,050 --> 01:14:08,540
incrementing x.

1543
01:14:08,540 --> 01:14:09,560
This is basically going.

1544
01:14:09,560 --> 01:14:10,740
The index is going.

1545
01:14:10,740 --> 01:14:13,010
i equals 0 and i equals 1.

1546
01:14:13,010 --> 01:14:15,640
And then, it's asserting
that x equals 2.

1547
01:14:15,640 --> 01:14:19,200
If I run this serially,
the assertion passes.

1548
01:14:19,200 --> 01:14:22,200

1549
01:14:22,200 --> 01:14:27,230
But when I run it in parallel,
it may not produce a 2.

1550
01:14:27,230 --> 01:14:28,730
It can produce a 1.

1551
01:14:28,730 --> 01:14:31,000
And let's see why that is.

1552
01:14:31,000 --> 01:14:34,350
So the way to understand this
code is to think about its

1553
01:14:34,350 --> 01:14:37,520
execution in terms of a
dependency [? dag ?].

1554
01:14:37,520 --> 01:14:41,650
So here I have my initialization
of x.

1555
01:14:41,650 --> 01:14:45,330
Then once that's done, the
cilk_for loop allows me to do

1556
01:14:45,330 --> 01:14:52,030
two things at a time, b and c,
which are both incrementing x.

1557
01:14:52,030 --> 01:14:55,440
And then, I assert that
x equals 2 when

1558
01:14:55,440 --> 01:14:58,340
they're both done.

1559
01:14:58,340 --> 01:15:01,970
Because that's the semantics
of the cilk_for.

1560
01:15:01,970 --> 01:15:04,400
So let's see where
the race occurs.

1561
01:15:04,400 --> 01:15:06,580
So remember that it occurs
when I have two logically

1562
01:15:06,580 --> 01:15:08,550
parallel instructions
that access

1563
01:15:08,550 --> 01:15:10,390
the same memory location.

1564
01:15:10,390 --> 01:15:12,350
Here, it's going to
be the location x.

1565
01:15:12,350 --> 01:15:18,750
And at least one of them
performs a write execution.

1566
01:15:18,750 --> 01:15:22,630
So if we actually looked closer,
I want to expand this

1567
01:15:22,630 --> 01:15:23,910
into this larger thing.

1568
01:15:23,910 --> 01:15:28,040
Because as you know, X++ is not
done on a memory location.

1569
01:15:28,040 --> 01:15:30,120
It's not done as a single
instruction.

1570
01:15:30,120 --> 01:15:33,480
It's done as a load,
x into a register.

1571
01:15:33,480 --> 01:15:38,470
Increment the register, and then
store the value back in.

1572
01:15:38,470 --> 01:15:41,140
And meanwhile, there's another
register on another processor,

1573
01:15:41,140 --> 01:15:45,030
presumably, that's doing
the same thing.

1574
01:15:45,030 --> 01:15:46,590
So this is the one I
want to look at.

1575
01:15:46,590 --> 01:15:50,390
This is just a zooming in, if
you will, on this dependency

1576
01:15:50,390 --> 01:15:53,070
graph to look a little bit finer
grain at what's actually

1577
01:15:53,070 --> 01:15:56,570
happening one step at a time.

1578
01:15:56,570 --> 01:15:59,980
So the determinacy race,
recall, occurs--

1579
01:15:59,980 --> 01:16:01,210
this is by something,
I'm going to say

1580
01:16:01,210 --> 01:16:04,420
again, you should memorize.

1581
01:16:04,420 --> 01:16:05,850
So you should know
what this is.

1582
01:16:05,850 --> 01:16:09,750
You should be able to say what
a determinacy race is.

1583
01:16:09,750 --> 01:16:12,430
It's when you have two
instructions that are both

1584
01:16:12,430 --> 01:16:14,540
accessing the same location,
and one of

1585
01:16:14,540 --> 01:16:15,230
them performs write.

1586
01:16:15,230 --> 01:16:16,160
And here, I have that.

1587
01:16:16,160 --> 01:16:17,780
This guy is in parallel.

1588
01:16:17,780 --> 01:16:20,360
He's being stored to here.

1589
01:16:20,360 --> 01:16:22,160
This is also a race.

1590
01:16:22,160 --> 01:16:26,080
He's been reading it, and
this guy is writing it.

1591
01:16:26,080 --> 01:16:29,470
So let's see what can happen
and what can go wrong here.

1592
01:16:29,470 --> 01:16:31,650
So here's my value,
x, in memory.

1593
01:16:31,650 --> 01:16:34,820
And here's my two registers on,
presumably, two different

1594
01:16:34,820 --> 01:16:36,020
processors.

1595
01:16:36,020 --> 01:16:38,690
So one thing is that
you can typically--

1596
01:16:38,690 --> 01:16:42,580
and this is not quite the case
with real hardware-- but an

1597
01:16:42,580 --> 01:16:45,620
abstraction of the hardware
is that you can treat the

1598
01:16:45,620 --> 01:16:49,320
parallel execution from a
logical point of view as if

1599
01:16:49,320 --> 01:16:51,690
you're interleaving instructions
from the

1600
01:16:51,690 --> 01:16:53,190
different processors.

1601
01:16:53,190 --> 01:16:53,680
OK.

1602
01:16:53,680 --> 01:16:57,850
We're going to talk in three or
four lectures about where

1603
01:16:57,850 --> 01:17:00,000
that isn't the right
abstraction.

1604
01:17:00,000 --> 01:17:03,480
But it is close to the
right abstraction.

1605
01:17:03,480 --> 01:17:06,690
So here, basically, we execute
statement one, which causes x

1606
01:17:06,690 --> 01:17:09,430
to become 0.

1607
01:17:09,430 --> 01:17:11,300
Now let's execute
statement two.

1608
01:17:11,300 --> 01:17:16,730
That causes r1 to become 0.

1609
01:17:16,730 --> 01:17:18,130
Then, I can increment that.

1610
01:17:18,130 --> 01:17:19,310
It becomes a 1.

1611
01:17:19,310 --> 01:17:21,130
All well and good.

1612
01:17:21,130 --> 01:17:25,670
But now if the next logical
thing that happens is that r2

1613
01:17:25,670 --> 01:17:31,390
is set to the value x,
then it becomes 0.

1614
01:17:31,390 --> 01:17:33,360
Then we increment it.

1615
01:17:33,360 --> 01:17:36,900
And now, he stores
back 1 into x.

1616
01:17:36,900 --> 01:17:39,530
And now, this guy stores
1 back into x.

1617
01:17:39,530 --> 01:17:41,770
And notice that now,
we [UNINTELLIGIBLE]

1618
01:17:41,770 --> 01:17:43,320
go to the assertion.

1619
01:17:43,320 --> 01:17:47,630
And we assert that it's
2, and it's not the 2.

1620
01:17:47,630 --> 01:17:49,660
It's a 1.

1621
01:17:49,660 --> 01:17:51,870
Because we lost one
of the updates.

1622
01:17:51,870 --> 01:17:55,090
Now the reason race bugs are
really pernicious is, notice

1623
01:17:55,090 --> 01:17:58,570
that if I had executed this
whole branch, and then this

1624
01:17:58,570 --> 01:18:02,510
whole branch, I get
the right answer.

1625
01:18:02,510 --> 01:18:06,100
Or if I executed this whole
branch, and then this whole

1626
01:18:06,100 --> 01:18:08,600
branch, I get the
right answer.

1627
01:18:08,600 --> 01:18:11,310
The only time I don't get the
right answer is when those two

1628
01:18:11,310 --> 01:18:14,490
things happen to interleave
just so.

1629
01:18:14,490 --> 01:18:18,370
And that's what happens with
race conditions generally, is

1630
01:18:18,370 --> 01:18:22,050
that you can run your code a
million times and not see the

1631
01:18:22,050 --> 01:18:27,670
bug, and then run it once, and
it crashes out in the field.

1632
01:18:27,670 --> 01:18:31,400
Or what's happened is there
have been race bugs

1633
01:18:31,400 --> 01:18:35,640
responsible for failure of
space shuttle to launch.

1634
01:18:35,640 --> 01:18:42,042
You have the North American
blackout of 2001?

1635
01:18:42,042 --> 01:18:43,505
2003?

1636
01:18:43,505 --> 01:18:44,890
It wasn't that long ago.

1637
01:18:44,890 --> 01:18:45,680
It was like, 10 years ago.

1638
01:18:45,680 --> 01:18:48,970
We had big black out caused by
a race condition in the code

1639
01:18:48,970 --> 01:18:51,840
run by the power companies.

1640
01:18:51,840 --> 01:18:56,600
There been medical instruments
that have fried people, killed

1641
01:18:56,600 --> 01:19:00,030
them and maimed them, because
of race conditions.

1642
01:19:00,030 --> 01:19:02,890
These are really serious bugs.

1643
01:19:02,890 --> 01:19:03,835
Question?

1644
01:19:03,835 --> 01:19:08,290
AUDIENCE: [INAUDIBLE] when you
said, the only time that that

1645
01:19:08,290 --> 01:19:12,260
code is actually execute
serially?

1646
01:19:12,260 --> 01:19:15,380
PROFESSOR: It could execute in
parallel if it happened that

1647
01:19:15,380 --> 01:19:17,330
these guys executed
before these guys.

1648
01:19:17,330 --> 01:19:20,690
If you think of a larger
context, a whole bunch of

1649
01:19:20,690 --> 01:19:23,350
these things, and I have two
routines where they're both

1650
01:19:23,350 --> 01:19:26,210
incrementing x in the middle
of great big parallel

1651
01:19:26,210 --> 01:19:29,710
programs, it could be that
they're executing perfectly

1652
01:19:29,710 --> 01:19:31,280
well in parallel.

1653
01:19:31,280 --> 01:19:34,910
But if those two small sections
of code happen to

1654
01:19:34,910 --> 01:19:40,670
execute like this or like this,
then you're going to end

1655
01:19:40,670 --> 01:19:42,360
up with it executing
correctly.

1656
01:19:42,360 --> 01:19:46,170
But if they execute sort of at
the same time, it would not

1657
01:19:46,170 --> 01:19:49,710
necessarily behave correctly.

1658
01:19:49,710 --> 01:19:54,500
So there are two types of races
that people talk about,

1659
01:19:54,500 --> 01:19:56,900
a read race and a write race.

1660
01:19:56,900 --> 01:19:59,580
So suppose you have two
instructions that access a

1661
01:19:59,580 --> 01:20:00,880
location, x.

1662
01:20:00,880 --> 01:20:03,405
And suppose that a
is parallel to b.

1663
01:20:03,405 --> 01:20:06,170
Both a and b are both reads,
you get no race.

1664
01:20:06,170 --> 01:20:08,920
That's good.

1665
01:20:08,920 --> 01:20:09,950
Because there's no way.

1666
01:20:09,950 --> 01:20:13,130
But if one is a read and one is
a write, then one of them

1667
01:20:13,130 --> 01:20:15,330
is going to see a different
value, depending upon whether

1668
01:20:15,330 --> 01:20:16,930
it occurred before and
after the write.

1669
01:20:16,930 --> 01:20:19,155
Or if they both are writing,
one can lose a value.

1670
01:20:19,155 --> 01:20:22,190

1671
01:20:22,190 --> 01:20:25,490
So these are read races.

1672
01:20:25,490 --> 01:20:27,210
And this is a write race.

1673
01:20:27,210 --> 01:20:28,760
So we say that the two
sections of code are

1674
01:20:28,760 --> 01:20:30,630
independent if they have
no determinacy

1675
01:20:30,630 --> 01:20:32,480
races between them.

1676
01:20:32,480 --> 01:20:35,490
So for example, this piece of
code is incrementing y, and

1677
01:20:35,490 --> 01:20:37,050
this is incrementing x.

1678
01:20:37,050 --> 01:20:39,050
And y is not equal to x.

1679
01:20:39,050 --> 01:20:41,240
Those are independent
pieces of code.

1680
01:20:41,240 --> 01:20:45,970
So to avoid races, you want to
make sure that the iterations

1681
01:20:45,970 --> 01:20:48,720
of your cilk_for are
independent.

1682
01:20:48,720 --> 01:20:51,970
So what's going on in one
iteration is different from

1683
01:20:51,970 --> 01:20:52,990
what's going on in another.

1684
01:20:52,990 --> 01:20:54,910
That you're not writing
something in one that you're

1685
01:20:54,910 --> 01:20:58,740
using in the next,
for example.

1686
01:20:58,740 --> 01:21:02,460
Between a cilk_spawn and the
corresponding cilk_sync, the

1687
01:21:02,460 --> 01:21:05,230
code of the spawn child should
be independent of the code of

1688
01:21:05,230 --> 01:21:06,360
the parent.

1689
01:21:06,360 --> 01:21:06,630
OK?

1690
01:21:06,630 --> 01:21:09,400
Including any code executed
by additional

1691
01:21:09,400 --> 01:21:11,430
spawned or called children.

1692
01:21:11,430 --> 01:21:13,800
So it's basically saying, when
you spawn something off, don't

1693
01:21:13,800 --> 01:21:16,840
then go and do something that's
going to modify the

1694
01:21:16,840 --> 01:21:18,140
same locations.

1695
01:21:18,140 --> 01:21:19,985
You really want to modify
different locations.

1696
01:21:19,985 --> 01:21:22,530

1697
01:21:22,530 --> 01:21:24,730
It's fine if they both read
the same locations.

1698
01:21:24,730 --> 01:21:26,765
But it's not fine for one
of them to read and

1699
01:21:26,765 --> 01:21:29,730
one of them to write.

1700
01:21:29,730 --> 01:21:33,350
One thing here to understand
is that when you spawn a

1701
01:21:33,350 --> 01:21:36,540
function, the arguments are
actually executed serially

1702
01:21:36,540 --> 01:21:38,880
before the actual
spawn occurs.

1703
01:21:38,880 --> 01:21:41,900
So you evaluate the arguments,
and you set it all up, then

1704
01:21:41,900 --> 01:21:45,070
you spawn the function.

1705
01:21:45,070 --> 01:21:46,950
So the actual spawn
occurs after the

1706
01:21:46,950 --> 01:21:48,260
evaluation of arguments.

1707
01:21:48,260 --> 01:21:49,760
So they're evaluated
in the parent.

1708
01:21:49,760 --> 01:21:52,350

1709
01:21:52,350 --> 01:21:54,690
Machine word size matters.

1710
01:21:54,690 --> 01:21:58,250
So this is generally
the case for races.

1711
01:21:58,250 --> 01:22:00,640
By the way, races are
not just Cilk stuff.

1712
01:22:00,640 --> 01:22:05,600
These races occur in all of
these concurrency platforms.

1713
01:22:05,600 --> 01:22:07,430
I'm illustrating Cilk because
that's what we're going to be

1714
01:22:07,430 --> 01:22:10,480
using in our labs
and so forth.

1715
01:22:10,480 --> 01:22:12,430
So it turns out machine
word size matters.

1716
01:22:12,430 --> 01:22:16,370
And you can have races in
packed data structures.

1717
01:22:16,370 --> 01:22:22,180
So for example, on some
machines, if you declare a

1718
01:22:22,180 --> 01:22:28,580
char a and char b in a struct,
then updating x and x, b in

1719
01:22:28,580 --> 01:22:32,420
parallel may cause a race,
because they're both actually

1720
01:22:32,420 --> 01:22:34,380
operating on a word basis.

1721
01:22:34,380 --> 01:22:36,650
Now on the Intel architectures,

1722
01:22:36,650 --> 01:22:37,550
that doesn't happen.

1723
01:22:37,550 --> 01:22:42,090
Because Intel supports atomic
updates of single bytes.

1724
01:22:42,090 --> 01:22:43,740
So you don't have to
worry about it.

1725
01:22:43,740 --> 01:22:47,360
But if you were accessing bits
within a word, you could end

1726
01:22:47,360 --> 01:22:48,240
up with the same thing.

1727
01:22:48,240 --> 01:22:52,990
You access bit five and bit
three, you think you're acting

1728
01:22:52,990 --> 01:22:55,670
independently, but in fact,
you're reading the whole word

1729
01:22:55,670 --> 01:22:57,800
or the whole byte in
order to access it.

1730
01:22:57,800 --> 01:23:01,070

1731
01:23:01,070 --> 01:23:04,450
The technology that you're going
to be using fortunately

1732
01:23:04,450 --> 01:23:09,180
comes with a race detector,
which you will find invaluable

1733
01:23:09,180 --> 01:23:11,590
for debugging your stuff.

1734
01:23:11,590 --> 01:23:14,240
And so this is kind of like
a Valgrind for races.

1735
01:23:14,240 --> 01:23:18,920

1736
01:23:18,920 --> 01:23:22,740
What's good about this race
detector is it provides a rock

1737
01:23:22,740 --> 01:23:24,150
hard guarantee.

1738
01:23:24,150 --> 01:23:28,520
If you have a deterministic
program that on a given input

1739
01:23:28,520 --> 01:23:31,250
could possibly behave any
differently from your serial

1740
01:23:31,250 --> 01:23:35,080
program, from the corresponding
serial program,

1741
01:23:35,080 --> 01:23:38,970
if you got rid of the parallel
keywords, this tool,

1742
01:23:38,970 --> 01:23:41,570
Cilkscreen, guarantees to
report and localize the

1743
01:23:41,570 --> 01:23:43,370
offending race.

1744
01:23:43,370 --> 01:23:45,470
It'll tell you, you got
a race between this

1745
01:23:45,470 --> 01:23:47,100
location and that location.

1746
01:23:47,100 --> 01:23:49,980
And it's up to you to find
it and fix it, but it

1747
01:23:49,980 --> 01:23:51,065
can tell you that.

1748
01:23:51,065 --> 01:23:53,850
It employs regression test
methodology, where the

1749
01:23:53,850 --> 01:23:55,820
programmer provides
test inputs.

1750
01:23:55,820 --> 01:24:00,720
So if you don't provide test
inputs to elicit the race, you

1751
01:24:00,720 --> 01:24:02,650
still can have a bug.

1752
01:24:02,650 --> 01:24:05,430
But if you have a test input
that in any way could behave

1753
01:24:05,430 --> 01:24:08,090
differently than the serial
execution, bingo.

1754
01:24:08,090 --> 01:24:09,340
It'll tell you.

1755
01:24:09,340 --> 01:24:11,850

1756
01:24:11,850 --> 01:24:14,860
It identifies a bunch of things
involving the race,

1757
01:24:14,860 --> 01:24:16,700
including a stack trace.

1758
01:24:16,700 --> 01:24:19,580
It runs off the binary
executable using what's called

1759
01:24:19,580 --> 01:24:21,530
dynamic instrumentation.

1760
01:24:21,530 --> 01:24:24,170
So that's kind of like Valgrind,
except it actually

1761
01:24:24,170 --> 01:24:26,700
does this as it's running.

1762
01:24:26,700 --> 01:24:31,030
It uses a technology called PIN,
which you can read about.

1763
01:24:31,030 --> 01:24:37,340
P-I-N, which is a nice platform
for doing code

1764
01:24:37,340 --> 01:24:39,620
rewriting and analysis
on the fly.

1765
01:24:39,620 --> 01:24:42,600
It runs about 20 times slower
than real time.

1766
01:24:42,600 --> 01:24:47,600
So you basically use
it for debugging.

1767
01:24:47,600 --> 01:24:56,660
So the first part of project
four is basically coming up to

1768
01:24:56,660 --> 01:24:58,750
speed with this technology.

1769
01:24:58,750 --> 01:24:59,790
And so, there's some
good things.

1770
01:24:59,790 --> 01:25:01,180
And that's going to be
available tomorrow.

1771
01:25:01,180 --> 01:25:01,900
Is that what we said?

1772
01:25:01,900 --> 01:25:05,310
Yeah, that will be available
tomorrow.

1773
01:25:05,310 --> 01:25:07,410
So this is actually--
this is tons of fun.

1774
01:25:07,410 --> 01:25:10,120
Most people in most places
don't get to play with

1775
01:25:10,120 --> 01:25:11,560
parallel technology like this.

1776
01:25:11,560 --> 01:25:20,555