1
00:00:00,000 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:04,000
Commons license.

3
00:00:04,000 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,560
offer high quality educational
resources for free.

5
00:00:10,560 --> 00:00:13,450
To make a donation or view
additional materials from

6
00:00:13,450 --> 00:00:16,610
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:16,610 --> 00:00:17,860
ocw.mit.edu.

8
00:00:22,740 --> 00:00:25,350
PROFESSOR: So in this second
lecture we're going to talk

9
00:00:25,350 --> 00:00:29,830
about some design patterns
for parallel programming.

10
00:00:29,830 --> 00:00:33,180
And to tell you a little bit
about what a design pattern is

11
00:00:33,180 --> 00:00:34,340
and why is it useful.

12
00:00:34,340 --> 00:00:36,440
And some of you, if you've
taken object oriented

13
00:00:36,440 --> 00:00:38,450
programming you've probably
already have seen design

14
00:00:38,450 --> 00:00:40,580
patterns before.

15
00:00:40,580 --> 00:00:43,600
So I ended the last lecture with
OK, so I understand some

16
00:00:43,600 --> 00:00:45,330
of the performance implications,
how do I go

17
00:00:45,330 --> 00:00:48,510
about parallelizing
my program?

18
00:00:48,510 --> 00:00:53,900
This is a flyer I found quite
often in books and talks on

19
00:00:53,900 --> 00:00:54,600
parallel programming.

20
00:00:54,600 --> 00:00:58,960
It essentially lays out 4 common
steps for parallelizing

21
00:00:58,960 --> 00:01:00,680
your program.

22
00:01:00,680 --> 00:01:03,630
So often, you start out with
sequential programs. This

23
00:01:03,630 --> 00:01:06,810
shouldn't be surprising since
for a long time as you've

24
00:01:06,810 --> 00:01:09,350
heard in earlier lectures,
people just coded sequential

25
00:01:09,350 --> 00:01:12,190
code and that was just
good enough.

26
00:01:12,190 --> 00:01:15,220
So now the problem is you want
to take that sequential code

27
00:01:15,220 --> 00:01:17,410
or you want to still write
sequential code just because

28
00:01:17,410 --> 00:01:19,670
it's conceptually easier and
you want to be able to

29
00:01:19,670 --> 00:01:23,240
parallelize it so you can map
it down to your parallel

30
00:01:23,240 --> 00:01:26,930
architecture, which in this
example has 4 processors.

31
00:01:26,930 --> 00:01:30,630
So the first step is you take
your sequential program and

32
00:01:30,630 --> 00:01:32,760
you divide it up into tasks.

33
00:01:32,760 --> 00:01:35,250
So during the project reviews,
for example, yesterday when I

34
00:01:35,250 --> 00:01:37,980
talked to each team individually
we sort of talked

35
00:01:37,980 --> 00:01:41,740
about this and you sort of
stumbled on these 4 steps

36
00:01:41,740 --> 00:01:44,190
whether you realized
it or not.

37
00:01:44,190 --> 00:01:47,750
And so you come up with these
tasks and then each one

38
00:01:47,750 --> 00:01:50,670
essentially incapsulates
some computation.

39
00:01:50,670 --> 00:01:52,870
And then you group them
together, so this is some

40
00:01:52,870 --> 00:01:55,530
granularity adjustment and you
map them down to processes.

41
00:01:55,530 --> 00:01:58,370
These are things you can pose
into threads, for example.

42
00:01:58,370 --> 00:02:00,790
And then you have to essentially
plot these down

43
00:02:00,790 --> 00:02:04,850
onto actual processors and they
have to talk with each

44
00:02:04,850 --> 00:02:07,250
other, so you have to
orchestrate to communication

45
00:02:07,250 --> 00:02:10,110
and then finally do
the execution.

46
00:02:10,110 --> 00:02:16,280
So step through each one
of these at a time.

47
00:02:16,280 --> 00:02:20,300
Sort of composition and recall
that just really effects or

48
00:02:20,300 --> 00:02:23,450
just really is effected
by Amdahl's Law.

49
00:02:23,450 --> 00:02:26,110
And that if there's not a whole
lot of parallels in the

50
00:02:26,110 --> 00:02:28,590
application your decomposition
is a waste of time.

51
00:02:28,590 --> 00:02:30,340
There's not really a
whole lot to get.

52
00:02:30,340 --> 00:02:33,150
But what you're trying to do
is identify concurrency in

53
00:02:33,150 --> 00:02:35,200
your application and
figure out at what

54
00:02:35,200 --> 00:02:36,570
level to exploit it.

55
00:02:36,570 --> 00:02:39,140
So what you're trying to do is
divide up your computation

56
00:02:39,140 --> 00:02:42,050
into tasks and eventually
these are going to be

57
00:02:42,050 --> 00:02:44,900
distributed among processors and
you want to find enough of

58
00:02:44,900 --> 00:02:48,190
them so that you can keep
all the processors busy.

59
00:02:48,190 --> 00:02:51,200
And remember that the more of
these that you have this gives

60
00:02:51,200 --> 00:02:54,640
you sort of an upper bound on
your potential speed up.

61
00:02:54,640 --> 00:02:57,500
And as in the rate tracing
example that I showed, the

62
00:02:57,500 --> 00:03:00,660
number of tasks that you
have may vary run time.

63
00:03:00,660 --> 00:03:03,220
So sometimes you might have lot
of arrays bouncing off a

64
00:03:03,220 --> 00:03:06,140
lot of things, and sometimes you
might not have a whole lot

65
00:03:06,140 --> 00:03:10,150
of reflection going on
so number of arrays

66
00:03:10,150 --> 00:03:12,720
will change over time.

67
00:03:12,720 --> 00:03:15,080
And in other applications,
interactions, for example,

68
00:03:15,080 --> 00:03:16,610
between molecules might
change in a

69
00:03:16,610 --> 00:03:19,960
molecular dynamic simulator.

70
00:03:19,960 --> 00:03:21,980
The assignment really
effects granularity.

71
00:03:21,980 --> 00:03:23,770
This is where you've partitioned
your tasks, you're

72
00:03:23,770 --> 00:03:27,760
trying to group them together
so that you're taking into

73
00:03:27,760 --> 00:03:29,300
account, what is the
communication

74
00:03:29,300 --> 00:03:30,830
cost going to be?

75
00:03:30,830 --> 00:03:34,640
What kind of locality am
I going to deal with?

76
00:03:34,640 --> 00:03:36,450
And what kind of synchronization
mechanisms do

77
00:03:36,450 --> 00:03:38,690
I need and how often do
I need to synchronize?

78
00:03:42,980 --> 00:03:45,020
You adjust your granularity
such that you end up with

79
00:03:45,020 --> 00:03:47,460
things that are load balanced
and you try to reduce

80
00:03:47,460 --> 00:03:50,300
communication as much
as possible.

81
00:03:50,300 --> 00:03:51,870
And structured approaches
might work well.

82
00:03:51,870 --> 00:03:54,670
You might look at the code, do
some inspection, you might

83
00:03:54,670 --> 00:03:57,360
understand the application, but
there are some well-known

84
00:03:57,360 --> 00:03:59,330
design patterns which is
essentially the thing we're

85
00:03:59,330 --> 00:04:00,920
going to get to try to
help you with this.

86
00:04:04,470 --> 00:04:07,280
As programmers really, I
think, we worry about

87
00:04:07,280 --> 00:04:12,510
partitioning first. This is
really independent of an

88
00:04:12,510 --> 00:04:14,070
architecture programming
model.

89
00:04:14,070 --> 00:04:16,920
Just taking my application and
figuring out well, what are

90
00:04:16,920 --> 00:04:19,710
different parts that I need to
compose together to build my

91
00:04:19,710 --> 00:04:20,350
application?

92
00:04:20,350 --> 00:04:22,100
So I'm going to show you
an example of that.

93
00:04:22,100 --> 00:04:25,550
And one thing to keep in the
back of your mind is that the

94
00:04:25,550 --> 00:04:29,150
complexity of how much
partitioning work you actually

95
00:04:29,150 --> 00:04:31,360
have to do really effects
your decision.

96
00:04:31,360 --> 00:04:33,610
So if you start out with some
piece of code or you wrote

97
00:04:33,610 --> 00:04:36,730
your code in one way and you
realize that to actually

98
00:04:36,730 --> 00:04:39,340
parallelize it requires so much
more work, in some user

99
00:04:39,340 --> 00:04:41,760
studies we've done on sort of
trying to get performance from

100
00:04:41,760 --> 00:04:44,600
code, it really effects how
much work you actually do.

101
00:04:44,600 --> 00:04:46,610
And if something requires a lot
of work, you might not do

102
00:04:46,610 --> 00:04:48,710
it even though it might have
really hot payoff.

103
00:04:48,710 --> 00:04:53,720
So you want to be able to keep
complexity down, so it pays

104
00:04:53,720 --> 00:04:55,980
off to really think well about
your algorithm, how you

105
00:04:55,980 --> 00:04:59,430
structure it ahead of time.

106
00:04:59,430 --> 00:05:01,880
And finally, the last two stages
I've lumped together.

107
00:05:01,880 --> 00:05:03,330
It's really orchestration
and mapping.

108
00:05:03,330 --> 00:05:07,230
I have my task, they need to
communicate, so what kind of

109
00:05:07,230 --> 00:05:08,770
computation primitives
do I need?

110
00:05:08,770 --> 00:05:11,450
What kind of communication
primitives do I need?

111
00:05:11,450 --> 00:05:13,590
So am I packaging things
up into threads?

112
00:05:13,590 --> 00:05:21,270
And they're talking together
over DMAs or shared memories.

113
00:05:21,270 --> 00:05:24,050
And what you want to do is try
to preserve locality and then

114
00:05:24,050 --> 00:05:26,860
figure out how to come up with
a scheduling order that

115
00:05:26,860 --> 00:05:29,400
preserves overall dependence
of the computation.

116
00:05:33,390 --> 00:05:36,720
Parallel program by patterns is
meant to essentially give

117
00:05:36,720 --> 00:05:40,750
you a cookbook or set of recipes
you can follow to help

118
00:05:40,750 --> 00:05:43,050
you with different steps:
decompose, assign,

119
00:05:43,050 --> 00:05:45,100
orchestrate and map.

120
00:05:45,100 --> 00:05:46,710
This could lead to really
high quality

121
00:05:46,710 --> 00:05:47,740
solutions in some domains.

122
00:05:47,740 --> 00:05:51,490
So in the scientific
computations there's a lot of

123
00:05:51,490 --> 00:05:53,590
problems that are well
understood and well studied

124
00:05:53,590 --> 00:05:55,430
and some of the frequently
occurring things have been

125
00:05:55,430 --> 00:06:00,240
abstracted out and sort of
recorded in patterns.

126
00:06:00,240 --> 00:06:02,700
And there's another purpose to
patterns too, in that they

127
00:06:02,700 --> 00:06:05,360
provide you with the
vocabulary for--

128
00:06:05,360 --> 00:06:08,100
two programmers can talk to each
other and use the right

129
00:06:08,100 --> 00:06:10,940
terminology and that conveys
a whole lot of information

130
00:06:10,940 --> 00:06:13,920
without having to actually
go through and

131
00:06:13,920 --> 00:06:16,280
understand all the details.

132
00:06:16,280 --> 00:06:21,510
You instantaneously know if
I use a particular model.

133
00:06:24,010 --> 00:06:26,570
It can also help with software
reusuability, malleability,

134
00:06:26,570 --> 00:06:27,640
and modularity.

135
00:06:27,640 --> 00:06:30,070
All of those things that are
software engineer perspective

136
00:06:30,070 --> 00:06:34,060
from an engineering perspective
are important.

137
00:06:34,060 --> 00:06:38,035
So brief history and I found
this in some of the talks that

138
00:06:38,035 --> 00:06:39,770
I was researching.

139
00:06:39,770 --> 00:06:43,520
There's a book by Christopher
Alexander from Berkeley in

140
00:06:43,520 --> 00:06:48,390
1977 that actually looked at
classifying patterns or really

141
00:06:48,390 --> 00:06:50,800
listing patterns from an
architectural perspective.

142
00:06:50,800 --> 00:06:53,915
He tried to look at what are
some patterns that occur in

143
00:06:53,915 --> 00:06:57,230
living designs and
recording those.

144
00:06:57,230 --> 00:06:59,150
So as an example, for
example, there's a 6

145
00:06:59,150 --> 00:07:00,850
foot balcony pattern.

146
00:07:00,850 --> 00:07:02,920
So if you're going to build a
balcony you should build it 6

147
00:07:02,920 --> 00:07:06,590
foot deep and you should have it
slightly recessed and so on

148
00:07:06,590 --> 00:07:08,900
because this is what's commonly
used and these are

149
00:07:08,900 --> 00:07:13,140
the kinds of balconies that
have good properties

150
00:07:13,140 --> 00:07:13,850
architecturally.

151
00:07:13,850 --> 00:07:16,480
Now I don't know whether this
book actually had a whole lot

152
00:07:16,480 --> 00:07:19,180
of impact on how people designed
architectures.

153
00:07:19,180 --> 00:07:23,350
Certainly not probably for the
Stata Center, but some

154
00:07:23,350 --> 00:07:26,360
patterns from object oriented
programming, I think, many of

155
00:07:26,360 --> 00:07:31,480
you have already seen these by
the Gang of Four in 1995--

156
00:07:31,480 --> 00:07:35,760
really sort of organized and
classified and came up with

157
00:07:35,760 --> 00:07:38,730
different ways of--

158
00:07:38,730 --> 00:07:42,240
or captured different ways of
programming that people had

159
00:07:42,240 --> 00:07:43,060
been using.

160
00:07:43,060 --> 00:07:45,230
You know, things like the
visitor pattern, for example,

161
00:07:45,230 --> 00:07:46,480
some of you might know.

162
00:07:52,460 --> 00:07:56,280
So in 2005, not too long ago
there was a new book, which

163
00:07:56,280 --> 00:07:59,860
I'm using to create some
of these slides.

164
00:07:59,860 --> 00:08:05,540
Really came up with or recorded
patterns for parallel

165
00:08:05,540 --> 00:08:06,480
programming.

166
00:08:06,480 --> 00:08:09,500
And they identified really
4 design spaces.

167
00:08:09,500 --> 00:08:14,260
I think these are sort of
structured to express or

168
00:08:14,260 --> 00:08:15,930
capture different elements.

169
00:08:15,930 --> 00:08:19,580
So some elements are for the
algorithm expression, I've

170
00:08:19,580 --> 00:08:22,080
listed those here and some are
for the actual software

171
00:08:22,080 --> 00:08:24,380
construction or the actual
implementation.

172
00:08:24,380 --> 00:08:28,380
So under algorithm expression
it's really the thing of

173
00:08:28,380 --> 00:08:29,920
decomposition; finding
concurrency.

174
00:08:29,920 --> 00:08:32,050
Where are my tasks?

175
00:08:32,050 --> 00:08:35,390
In the algorithm structure,
well, you might need some way

176
00:08:35,390 --> 00:08:39,050
of packaging those tasks
together so that they can talk

177
00:08:39,050 --> 00:08:46,950
to each other and be able to
use parallel architecture.

178
00:08:46,950 --> 00:08:49,150
On the software construction
side you're dealing slightly

179
00:08:49,150 --> 00:08:50,640
more low level details.

180
00:08:50,640 --> 00:08:53,320
So what are some things you
might need at a slightly lower

181
00:08:53,320 --> 00:08:55,770
level of implementation to
actually get all the

182
00:08:55,770 --> 00:08:58,920
computation that's expressed at
the algorithm level to work

183
00:08:58,920 --> 00:08:59,950
and run well?

184
00:08:59,950 --> 00:09:05,010
So I'm going to essentially talk
about the latter part in

185
00:09:05,010 --> 00:09:07,580
the next lecture and I'll cover
much of the algorithm

186
00:09:07,580 --> 00:09:09,320
expression stuff here--

187
00:09:09,320 --> 00:09:12,780
at least the fine concurrency
part in this talk.

188
00:09:12,780 --> 00:09:14,740
And if there's time I'll
do algorithm structure.

189
00:09:14,740 --> 00:09:17,750
Otherwise, just talk
about it next time.

190
00:09:17,750 --> 00:09:21,650
So let's say you're working
with MPEG decoding.

191
00:09:21,650 --> 00:09:24,690
This is a pipeline picture of an
MPEG 2 decoder or rather a

192
00:09:24,690 --> 00:09:27,780
blocked level diagram of
an MPEG 2 decoder.

193
00:09:27,780 --> 00:09:29,690
And you have this algorithm
and you say, OK, I want to

194
00:09:29,690 --> 00:09:30,380
parallelize this.

195
00:09:30,380 --> 00:09:31,830
Where's my parallelism?

196
00:09:31,830 --> 00:09:34,510
Where's my concurrency?

197
00:09:34,510 --> 00:09:37,000
You know, in MPEG 2 you have
some bit screen, you do some

198
00:09:37,000 --> 00:09:39,770
decoding on it and you end
up with two things.

199
00:09:39,770 --> 00:09:42,430
You end up with motion vectors
that tell you here's

200
00:09:42,430 --> 00:09:44,890
somebody's head, in the next
scene it's moved to this

201
00:09:44,890 --> 00:09:46,140
particular location.

202
00:09:49,050 --> 00:09:53,350
So these are captured by
the motion vectors.

203
00:09:53,350 --> 00:09:56,800
So this captures or recovers
temporal information.

204
00:09:56,800 --> 00:09:58,730
In here you cover spatial
information.

205
00:09:58,730 --> 00:10:01,730
So in somebody's head you might
have discovered some

206
00:10:01,730 --> 00:10:03,880
redundancies so that redundancy
is eliminated out

207
00:10:03,880 --> 00:10:07,180
so you need to know essentially,
uncompress or

208
00:10:07,180 --> 00:10:08,560
undo that compression.

209
00:10:08,560 --> 00:10:10,070
So you go through some stages.

210
00:10:10,070 --> 00:10:12,330
You combine the two together.

211
00:10:12,330 --> 00:10:15,030
Combine the motion estimation
and now the recovered pictures

212
00:10:15,030 --> 00:10:17,480
to reconstruct the image and
then you might do some

213
00:10:17,480 --> 00:10:18,730
additional stages.

214
00:10:20,850 --> 00:10:24,600
This particular stage here is
indicated to be data parallel

215
00:10:24,600 --> 00:10:28,120
in that I can do different
scenes for example in parallel

216
00:10:28,120 --> 00:10:29,850
or I might be able to do
different slices of the

217
00:10:29,850 --> 00:10:31,030
picture in parallel.

218
00:10:31,030 --> 00:10:33,820
So I can essentially take
advantage of data parallelism

219
00:10:33,820 --> 00:10:37,140
in the concept of taking a loop
and breaking it up as I

220
00:10:37,140 --> 00:10:40,520
showed in lecture 5.

221
00:10:40,520 --> 00:10:43,480
So in task decomposition what
we're looking for is really

222
00:10:43,480 --> 00:10:47,630
independent coarse-grain
computation.

223
00:10:47,630 --> 00:10:51,040
And these often are inherent to
the algorithm. so here I've

224
00:10:51,040 --> 00:10:52,850
outlined these in yellow here.

225
00:10:52,850 --> 00:10:54,600
You know, so this is one
particular task.

226
00:10:54,600 --> 00:10:57,960
I can have one thread of
execution doing all the

227
00:10:57,960 --> 00:10:59,370
spatial decomposition.

228
00:10:59,370 --> 00:11:00,830
I can have another
thread decoding

229
00:11:00,830 --> 00:11:03,710
all my motion vectors.

230
00:11:03,710 --> 00:11:07,410
And in general, you're looking
for sequences of statements

231
00:11:07,410 --> 00:11:09,410
that operate together
as a group.

232
00:11:09,410 --> 00:11:12,970
These could be loops or they
could be functions.

233
00:11:12,970 --> 00:11:15,990
And usually you want these to
essentially just fall out of

234
00:11:15,990 --> 00:11:17,670
your algorithm as
it's expressed.

235
00:11:17,670 --> 00:11:21,070
And a lot of cases it does, so
depending on how you think

236
00:11:21,070 --> 00:11:22,570
about the program you might
be able to find

237
00:11:22,570 --> 00:11:26,260
these quicker or easier.

238
00:11:26,260 --> 00:11:28,760
Data decompositions, which
I've highlighted here

239
00:11:28,760 --> 00:11:31,490
essentially says you have the
same computation applied to

240
00:11:31,490 --> 00:11:33,080
lots of small data element.

241
00:11:33,080 --> 00:11:35,130
You know, you can take your
large data element, partition

242
00:11:35,130 --> 00:11:38,150
it into smaller chunks and do
the computation over and over

243
00:11:38,150 --> 00:11:40,750
in parallel and so that allows
you to essentially get that

244
00:11:40,750 --> 00:11:44,070
kind of data parallelism,
expansion of space.

245
00:11:44,070 --> 00:11:47,000
And finally, I'm going to
make a case for pipeline

246
00:11:47,000 --> 00:11:49,960
parallelism, which essentially
says, well, I can recognize

247
00:11:49,960 --> 00:11:53,060
that I have a lot of stages in
my computation and it does

248
00:11:53,060 --> 00:11:56,430
help to actually have this kind
of decomposition just

249
00:11:56,430 --> 00:11:58,400
because you're familiar
with pipelining

250
00:11:58,400 --> 00:12:00,890
concepts from other domains.

251
00:12:00,890 --> 00:12:05,030
So this type of producer
consumer chain is actually

252
00:12:05,030 --> 00:12:06,590
beneficial.

253
00:12:06,590 --> 00:12:08,950
So it does help to sort of
expose these kinds of

254
00:12:08,950 --> 00:12:10,200
relationships.

255
00:12:12,170 --> 00:12:15,370
So what are some guidelines for
actually coming up with

256
00:12:15,370 --> 00:12:16,560
your task decomposition?

257
00:12:16,560 --> 00:12:17,640
Where do you start?

258
00:12:17,640 --> 00:12:21,600
You have your algorithm, you
understand the problem really

259
00:12:21,600 --> 00:12:24,970
well, you're writing some code
and the hope is that in fact,

260
00:12:24,970 --> 00:12:28,650
as I've pointed out, it does
happen that you can look for

261
00:12:28,650 --> 00:12:32,640
natural code regions that
incapsulate your computation.

262
00:12:32,640 --> 00:12:36,020
So function calls, distinct
loop iterations are pretty

263
00:12:36,020 --> 00:12:38,830
good places to start looking.

264
00:12:38,830 --> 00:12:42,990
And it's easier as a general
rule, it's easier to start

265
00:12:42,990 --> 00:12:46,880
with as many tasks as possible
and then fuse them to make the

266
00:12:46,880 --> 00:12:50,220
more coarse-grained than to
go the other way around.

267
00:12:50,220 --> 00:12:53,730
It impacts your software
engineering decisions, it

268
00:12:53,730 --> 00:12:56,530
impacts software implementation,
it impacts how

269
00:12:56,530 --> 00:12:59,050
you incapsulate things at
low level details of

270
00:12:59,050 --> 00:13:00,970
implementation.

271
00:13:00,970 --> 00:13:04,720
So it's always easier to
fuse than to fizz.

272
00:13:04,720 --> 00:13:06,180
And you want to consider
three things.

273
00:13:06,180 --> 00:13:08,380
You want to keep three things
in mind: flexibility,

274
00:13:08,380 --> 00:13:10,150
efficiency, and simplicity.

275
00:13:10,150 --> 00:13:21,760
So flexibility says if you made
some decisions, is that

276
00:13:21,760 --> 00:13:24,540
really going to scale well or
is it going to allow you to

277
00:13:24,540 --> 00:13:28,330
sort of make the decisions,
changes?

278
00:13:28,330 --> 00:13:30,230
So you might want to have
fixed tasks versus

279
00:13:30,230 --> 00:13:32,140
parameterized tasks,
for example.

280
00:13:32,140 --> 00:13:37,420
So the loops that I showed in
the previous talk, each loop

281
00:13:37,420 --> 00:13:40,070
that I parallelized had a hard
coded number that said, you're

282
00:13:40,070 --> 00:13:41,460
going to do 4 iterations.

283
00:13:41,460 --> 00:13:43,520
That may work well or it
may not work well.

284
00:13:43,520 --> 00:13:45,420
You know, I can't reuse that
code now if I want to

285
00:13:45,420 --> 00:13:48,130
essentially use that kind of
data decomposition and work

286
00:13:48,130 --> 00:13:51,220
sharing if I have a longer loop
and I want a longer array

287
00:13:51,220 --> 00:13:53,510
and I want each thread
to do more work.

288
00:13:53,510 --> 00:13:56,670
So you might want to
parameterize more things in

289
00:13:56,670 --> 00:13:58,140
your tasks.

290
00:13:58,140 --> 00:14:01,110
The efficiency, in that you have
to keep in mind that each

291
00:14:01,110 --> 00:14:04,470
of these tasks will eventually
sort of have to talk with

292
00:14:04,470 --> 00:14:05,410
other tasks.

293
00:14:05,410 --> 00:14:08,530
There's communication costs
that have to be taken into

294
00:14:08,530 --> 00:14:10,030
account, synchronization.

295
00:14:10,030 --> 00:14:12,040
So you want these tasks to
actually amortize the

296
00:14:12,040 --> 00:14:16,080
communication costs or other
overheads over to computation.

297
00:14:16,080 --> 00:14:18,340
And you want to keep in mind
that there's going to be

298
00:14:18,340 --> 00:14:20,720
dependencies between these tasks
and you don't want these

299
00:14:20,720 --> 00:14:22,260
dependencies to get
out of hand.

300
00:14:22,260 --> 00:14:24,230
So you want to keep things
under control.

301
00:14:24,230 --> 00:14:28,410
And lastly, which is probably as
important as the other two:

302
00:14:28,410 --> 00:14:29,220
simplicity.

303
00:14:29,220 --> 00:14:32,290
And that if you start
decomposing your code into

304
00:14:32,290 --> 00:14:35,850
different chunks and you can't
then understand your code in

305
00:14:35,850 --> 00:14:39,180
the end, it doesn't help you
from debugging perspective,

306
00:14:39,180 --> 00:14:40,970
doesn't help you from a
software engineering

307
00:14:40,970 --> 00:14:44,780
perspective or not being able
to reuse you code or other

308
00:14:44,780 --> 00:14:46,290
people being able to understand
your code.

309
00:14:49,040 --> 00:14:49,990
Guidelines for data

310
00:14:49,990 --> 00:14:52,250
decomposition are sort of similar.

311
00:14:52,250 --> 00:14:56,200
And you essentially have to do
task and data parallelism to

312
00:14:56,200 --> 00:15:00,250
sort of complete your process.

313
00:15:00,250 --> 00:15:03,040
And often your task
decomposition dictates your

314
00:15:03,040 --> 00:15:04,210
data partitioning.

315
00:15:04,210 --> 00:15:07,320
So if I've split up a loop into
two different processes

316
00:15:07,320 --> 00:15:10,590
I've essentially implied how
data should be distributed

317
00:15:10,590 --> 00:15:13,290
between these two threads.

318
00:15:13,290 --> 00:15:17,320
And data composition is a good
starting point as opposed to

319
00:15:17,320 --> 00:15:20,340
task parallelism as a
good starting point.

320
00:15:20,340 --> 00:15:22,470
If you're doing the same
computation over and over and

321
00:15:22,470 --> 00:15:27,050
over again, over really, really
large data sets, so you

322
00:15:27,050 --> 00:15:30,040
can essentially use that as your
stick to decide whether

323
00:15:30,040 --> 00:15:33,380
you do task decomposition first
or data decomposition

324
00:15:33,380 --> 00:15:37,720
first.

325
00:15:37,720 --> 00:15:41,460
I've just listed two common
data decompositions.

326
00:15:41,460 --> 00:15:44,040
I'll talk about more of these
later on when we talk about

327
00:15:44,040 --> 00:15:48,000
actual performance
optimizations.

328
00:15:48,000 --> 00:15:51,890
So you want to decompose arrays
for example, along rows

329
00:15:51,890 --> 00:15:52,670
or columns.

330
00:15:52,670 --> 00:15:55,200
You can compose arrays into
blocks, you decompose them

331
00:15:55,200 --> 00:15:56,640
into blocks.

332
00:15:56,640 --> 00:15:58,150
You have recursive
data structures.

333
00:15:58,150 --> 00:16:00,570
So a tree, you might partition
it into left and right

334
00:16:00,570 --> 00:16:02,420
sub-trees in a binary tree.

335
00:16:02,420 --> 00:16:06,150
And the thing you're trying to
get to is actually start with

336
00:16:06,150 --> 00:16:08,740
a problem, and then recursively
subdivide it until

337
00:16:08,740 --> 00:16:10,340
you can get to a manageable
part.

338
00:16:10,340 --> 00:16:12,190
Do the computation and figure
out a way to do the

339
00:16:12,190 --> 00:16:13,390
integration.

340
00:16:13,390 --> 00:16:16,970
You know, it's like merge
sort, classic example --

341
00:16:16,970 --> 00:16:19,660
tries to capture this
really well.

342
00:16:19,660 --> 00:16:23,620
So again, the three theme, key
concepts to keep in mind when

343
00:16:23,620 --> 00:16:25,980
you're doing data decomposition:
flexibility,

344
00:16:25,980 --> 00:16:29,100
efficiency, and simplicity.

345
00:16:29,100 --> 00:16:33,370
The first two are really just
meant to suggest that the size

346
00:16:33,370 --> 00:16:36,150
of the data that you've
allocated actually leads to

347
00:16:36,150 --> 00:16:37,670
enough work.

348
00:16:37,670 --> 00:16:40,250
Because you want to amortize the
cost of communication or

349
00:16:40,250 --> 00:16:43,570
synchronization, but you also
want the amount of work that's

350
00:16:43,570 --> 00:16:48,510
generated by each data chunk
to generate about the same

351
00:16:48,510 --> 00:16:51,040
amount of work, so
load balancing.

352
00:16:51,040 --> 00:16:55,360
And simplicity, just for same
reason as task decomposition

353
00:16:55,360 --> 00:16:57,600
can get out of hand,
data decomposition

354
00:16:57,600 --> 00:16:58,550
can get out of hand.

355
00:16:58,550 --> 00:17:01,480
You don't want data moving
around throughout and then it

356
00:17:01,480 --> 00:17:06,120
becomes again, hard to debug or
manage or make changes and

357
00:17:06,120 --> 00:17:09,110
track dependencies.

358
00:17:09,110 --> 00:17:11,550
Pipeline parallelism, this is
actually classified somewhere

359
00:17:11,550 --> 00:17:12,500
else in the book.

360
00:17:12,500 --> 00:17:17,850
Actually, lifted it up and tried
to make a case for it in

361
00:17:17,850 --> 00:17:20,480
that it's just good nature I
think, to expose producer

362
00:17:20,480 --> 00:17:21,720
consumer relationships
in your code.

363
00:17:21,720 --> 00:17:24,280
So if I have a function that's
producing data that's going to

364
00:17:24,280 --> 00:17:29,780
be used by another function as
with the spatial decomposition

365
00:17:29,780 --> 00:17:32,300
or in different stages of
classic rate tracing

366
00:17:32,300 --> 00:17:35,320
algorithms you want to maintain
that producer

367
00:17:35,320 --> 00:17:38,970
consumer relationship or that
asembly line analogy.

368
00:17:38,970 --> 00:17:41,910
And what are some prime examples
of pipelines in

369
00:17:41,910 --> 00:17:43,810
computer architecture?

370
00:17:43,810 --> 00:17:45,280
It's like the instruction
pipeline

371
00:17:45,280 --> 00:17:47,170
and your super scalar.

372
00:17:47,170 --> 00:17:50,070
But there might be some other
examples of pipelines, things

373
00:17:50,070 --> 00:17:53,590
that you might have used
in say, UNIX shell.

374
00:17:53,590 --> 00:17:57,740
So cat processor, pipe
it to another--

375
00:17:57,740 --> 00:17:59,780
to a grep word and then
word count that.

376
00:17:59,780 --> 00:18:02,860
So I think it's a
natural concept.

377
00:18:02,860 --> 00:18:05,880
We use it in many different ways
and it's good to sort of

378
00:18:05,880 --> 00:18:08,890
practice that at the software
level as well.

379
00:18:08,890 --> 00:18:11,490
And there are some computations
in specific

380
00:18:11,490 --> 00:18:13,990
domains, like in signal
processing and graphics that

381
00:18:13,990 --> 00:18:17,140
have really sort of--

382
00:18:17,140 --> 00:18:20,410
where the pipeline model is
really important part of how

383
00:18:20,410 --> 00:18:21,960
computation gets carried out.

384
00:18:21,960 --> 00:18:25,040
You know, you have your graphics
pipeline for example,

385
00:18:25,040 --> 00:18:27,070
in signal processing.

386
00:18:27,070 --> 00:18:29,418
How much time do I have?

387
00:18:29,418 --> 00:18:31,430
How am I doing on time?

388
00:18:31,430 --> 00:18:33,250
PROFESSOR: OK, should
I stop here?

389
00:18:33,250 --> 00:18:34,660
AUDIENCE: About how much more?

390
00:18:34,660 --> 00:18:37,550
PROFESSOR: 10 slides.

391
00:18:37,550 --> 00:18:42,930
PROFESSOR: OK, so this is sort
of a brief summary, which will

392
00:18:42,930 --> 00:18:46,270
lead into a much larger talk at
the next lecture on how you

393
00:18:46,270 --> 00:18:48,200
actually go about re-engineering
your code for

394
00:18:48,200 --> 00:18:49,090
parallelism.

395
00:18:49,090 --> 00:18:52,160
And this could come into play
if you start with sequential

396
00:18:52,160 --> 00:18:53,610
code and you're parallelizing
it.

397
00:18:53,610 --> 00:18:55,090
Some of you are doing that
for your projects.

398
00:18:55,090 --> 00:18:58,140
Or if you're actually writing
code from scratch and you want

399
00:18:58,140 --> 00:18:59,650
to engineer that for parallelism
as well.

400
00:19:02,570 --> 00:19:06,280
So I think it's important to
sort of understand the problem

401
00:19:06,280 --> 00:19:07,305
that you're working with.

402
00:19:07,305 --> 00:19:09,300
So you want to survey your
landscape, understand what

403
00:19:09,300 --> 00:19:12,270
other people might have done,
and look for well-known

404
00:19:12,270 --> 00:19:16,320
solutions and common pitfalls.

405
00:19:16,320 --> 00:19:18,570
And the patterns that I'm going
to talk about in more

406
00:19:18,570 --> 00:19:21,480
detail really provide you with a
list of questions to sort of

407
00:19:21,480 --> 00:19:26,620
help you assess the existing
code that you're working with

408
00:19:26,620 --> 00:19:29,160
or the problem that you're
trying to solve.

409
00:19:29,160 --> 00:19:31,250
There's something that you need
to keep in mind that sort

410
00:19:31,250 --> 00:19:33,450
of effect your overall
correctness.

411
00:19:33,450 --> 00:19:34,660
So for example, is your

412
00:19:34,660 --> 00:19:37,110
computation numerically stable?

413
00:19:37,110 --> 00:19:40,640
You might know if you have a
floating point computation you

414
00:19:40,640 --> 00:19:43,540
might not be able to reorder
all the operations because

415
00:19:43,540 --> 00:19:46,320
that might effect your
actual precision.

416
00:19:46,320 --> 00:19:48,320
And so your overall output might
be different and that

417
00:19:48,320 --> 00:19:50,730
may or may not be acceptable.

418
00:19:50,730 --> 00:19:53,930
So a lot of scientific codes for
example, are things that

419
00:19:53,930 --> 00:19:56,530
have to deal with a lot of
precision, might have to be

420
00:19:56,530 --> 00:19:58,720
cognizant of that fact.

421
00:19:58,720 --> 00:20:01,260
You want to also define the
scope of, what are you trying

422
00:20:01,260 --> 00:20:04,880
to do and will it
be good enough?

423
00:20:04,880 --> 00:20:07,160
You want to do back of the hand,
back of the envelope

424
00:20:07,160 --> 00:20:10,550
calculations to make sure that
things that you're suggesting

425
00:20:10,550 --> 00:20:13,095
of doing are actually feasible,
that they're

426
00:20:13,095 --> 00:20:15,310
actually practical and that
will give you the sort of

427
00:20:15,310 --> 00:20:19,460
performance expectations
that you've set out.

428
00:20:19,460 --> 00:20:24,400
You also want to understand
your input range.

429
00:20:24,400 --> 00:20:26,630
You might be able to specialize
if there are some

430
00:20:26,630 --> 00:20:29,170
cases for example, that you're
allowed to ignore.

431
00:20:29,170 --> 00:20:31,930
So these are good things
to keep in mind.

432
00:20:31,930 --> 00:20:34,140
You also want to define
a testing protocol.

433
00:20:34,140 --> 00:20:37,250
I think it's important
to understand--

434
00:20:37,250 --> 00:20:38,800
you started out with some piece
of code, you're going to

435
00:20:38,800 --> 00:20:40,520
make some changes to it,
how you going to go

436
00:20:40,520 --> 00:20:41,770
about testing it?

437
00:20:41,770 --> 00:20:44,650
How you might go about debugging
it and that could be

438
00:20:44,650 --> 00:20:47,910
essentially where you spend
a lot of your time.

439
00:20:47,910 --> 00:20:51,650
And then having these things
in mind, I think, the parts

440
00:20:51,650 --> 00:20:55,400
that are worth looking
at are the parts that

441
00:20:55,400 --> 00:20:56,470
make the most sense.

442
00:20:56,470 --> 00:20:58,140
Where is your computation
spending most of its time?

443
00:20:58,140 --> 00:20:59,610
Is there hot spots
in your code?

444
00:20:59,610 --> 00:21:02,380
And you can use profiling tools
for that and in fact,

445
00:21:02,380 --> 00:21:04,150
you'll see some of that for
cell in some of the

446
00:21:04,150 --> 00:21:08,660
recitations later
in the course.

447
00:21:08,660 --> 00:21:13,750
So a simple example of molecular
dynamics simulator.

448
00:21:13,750 --> 00:21:16,940
What you're trying to do is,
you have some space of

449
00:21:16,940 --> 00:21:19,930
molecules, which I'm just going
to represent in 2D.

450
00:21:19,930 --> 00:21:23,040
You know, look, they have water
molecules and I have

451
00:21:23,040 --> 00:21:30,250
some protein that I'm trying
to understand how the

452
00:21:30,250 --> 00:21:32,640
different atoms in that molecule
are moving around so

453
00:21:32,640 --> 00:21:34,880
that I can determine the
shape of the protein.

454
00:21:34,880 --> 00:21:37,010
So there are forces,
there are bonded

455
00:21:37,010 --> 00:21:39,190
forces between the molecules.

456
00:21:39,190 --> 00:21:42,420
So I've just shown for example,
bonded forces among

457
00:21:42,420 --> 00:21:45,120
my protein and then there
are non-bonded forces.

458
00:21:45,120 --> 00:21:47,300
So how are different atoms sort
of interacting with each

459
00:21:47,300 --> 00:21:49,670
other because of electrostatic
forces, for example.

460
00:21:52,920 --> 00:21:55,435
So what you try to do is figure
out, on each atom, what

461
00:21:55,435 --> 00:21:58,100
are all the forces that are
affecting it and what is its

462
00:21:58,100 --> 00:22:01,110
current position and then you
try to estimate where it's

463
00:22:01,110 --> 00:22:05,930
going to move based on
Newtonian, in the simplest

464
00:22:05,930 --> 00:22:09,710
case, a Newtonian f equals
m a type projection.

465
00:22:09,710 --> 00:22:12,950
So in a naive algorithm you have
n squared interactions.

466
00:22:12,950 --> 00:22:16,150
You have to calculate all the
forces on one molecule from

467
00:22:16,150 --> 00:22:19,210
all others.

468
00:22:19,210 --> 00:22:21,910
By understanding your problem
you know that you can actually

469
00:22:21,910 --> 00:22:24,720
exploit the properties of
forces that essentially

470
00:22:24,720 --> 00:22:27,490
decrease exponentially, so you
can use a cutoff distance.

471
00:22:27,490 --> 00:22:30,400
So if a molecule is way too far
away you can ignore this.

472
00:22:30,400 --> 00:22:35,220
And for people who do galaxy
calculations, you know you can

473
00:22:35,220 --> 00:22:38,620
ignore geometric forces between
constellations or

474
00:22:38,620 --> 00:22:41,160
clusters that are
too far apart.

475
00:22:41,160 --> 00:22:45,370
So in the sequential code, some
pseudo code for doing a

476
00:22:45,370 --> 00:22:48,100
molecular dynamic simulator,
you have your atoms array,

477
00:22:48,100 --> 00:22:50,750
your force array, your set of
neighbors in a two-dimensional

478
00:22:50,750 --> 00:22:54,470
space and you're going to go
through and sort of simulate

479
00:22:54,470 --> 00:22:55,710
different time steps.

480
00:22:55,710 --> 00:22:58,660
And for each time step you're
going to do-- for each atom--

481
00:22:58,660 --> 00:23:02,250
compute the bonded forces,
compute who are my neighbors

482
00:23:02,250 --> 00:23:05,250
for those neighbors, compute--
so these are things that

483
00:23:05,250 --> 00:23:07,100
essentially incapsulate
distance.

484
00:23:07,100 --> 00:23:09,500
Compute the forces between
them, update the

485
00:23:09,500 --> 00:23:11,360
position and end.

486
00:23:11,360 --> 00:23:16,180
So since this is a loop then
that might suggest essentially

487
00:23:16,180 --> 00:23:18,810
where to start looking
for concurrency.

488
00:23:18,810 --> 00:23:21,680
So you can start with the
decomposition patterns and

489
00:23:21,680 --> 00:23:26,360
they'll be more in depth details
about those next.

490
00:23:26,360 --> 00:23:28,705
I'm going to give you some
intuition and then you would

491
00:23:28,705 --> 00:23:31,370
try to figure out whether your
decomposition has to abide by

492
00:23:31,370 --> 00:23:34,120
certain dependencies and what
are those dependencies?

493
00:23:34,120 --> 00:23:35,140
How do you expose them?

494
00:23:35,140 --> 00:23:37,620
And then, how can you
design and evaluate?

495
00:23:37,620 --> 00:23:38,870
How can you evaluate
your design?

496
00:23:42,050 --> 00:23:44,000
Screwed up again.

497
00:23:44,000 --> 00:23:44,920
I just fixed this.

498
00:23:44,920 --> 00:23:48,800
OK, so this is the pseudo code
again from the previous slide.

499
00:23:48,800 --> 00:23:52,440
And since all you have is a
simple loop that essentially

500
00:23:52,440 --> 00:23:54,600
says, this is where to look
for the computation.

501
00:23:54,600 --> 00:23:56,980
And since you're essentially
doing the same computation for

502
00:23:56,980 --> 00:24:02,210
each atom then that again,
gives you the type of

503
00:24:02,210 --> 00:24:05,170
parallelism that we've
talked about before.

504
00:24:05,170 --> 00:24:08,700
So you can look for splitting
up each iteration and

505
00:24:08,700 --> 00:24:10,970
parallelizing those so that each
processor for example,

506
00:24:10,970 --> 00:24:12,850
does one atom.

507
00:24:12,850 --> 00:24:15,870
Or each processor does a
collection of atoms. But there

508
00:24:15,870 --> 00:24:16,920
are additional tasks.

509
00:24:16,920 --> 00:24:19,640
So data level parallelism
versus sort of control

510
00:24:19,640 --> 00:24:20,670
parallelism.

511
00:24:20,670 --> 00:24:23,320
For each atom you also want
to calculate the forces.

512
00:24:23,320 --> 00:24:25,410
You want to calculate long
range interactions, find

513
00:24:25,410 --> 00:24:27,490
neighbors, update position
and so on.

514
00:24:27,490 --> 00:24:30,340
And some of these have shared
data, some of them do not.

515
00:24:30,340 --> 00:24:31,790
So you have to factor that in.

516
00:24:31,790 --> 00:24:34,660
So understanding there control
dependencies essentially tells

517
00:24:34,660 --> 00:24:40,930
you how you need to it lay
out your orchestration.

518
00:24:40,930 --> 00:24:42,750
So you have your bonded forces,
you have your neighbor

519
00:24:42,750 --> 00:24:45,240
list, that effects your long
range calculations.

520
00:24:45,240 --> 00:24:48,110
But to do this update position
I need both of these tasks to

521
00:24:48,110 --> 00:24:49,870
have completed.

522
00:24:49,870 --> 00:24:53,270
And in each one of these tasks
there's different data

523
00:24:53,270 --> 00:24:54,660
structures that they need.

524
00:24:54,660 --> 00:25:00,180
So everybody essentially reads
the location of the items. So

525
00:25:00,180 --> 00:25:02,510
this is good because we want
time step that essentially

526
00:25:02,510 --> 00:25:08,470
says, I can really distribute
this really well, but then

527
00:25:08,470 --> 00:25:10,480
there's a right synchronization
problem

528
00:25:10,480 --> 00:25:12,880
because eventually I have to
update this array so I have to

529
00:25:12,880 --> 00:25:16,590
be careful about who goes
first. There's an

530
00:25:16,590 --> 00:25:19,200
accumulation, which means
I can potentially do a

531
00:25:19,200 --> 00:25:21,280
reduction on these.

532
00:25:21,280 --> 00:25:23,630
There's some write on the other
end, but that seems to

533
00:25:23,630 --> 00:25:25,280
be a localized data structure.

534
00:25:25,280 --> 00:25:30,040
So for partitioning example,
neighbors might have to be

535
00:25:30,040 --> 00:25:31,960
just locals at a different
processor.

536
00:25:31,960 --> 00:25:34,890
So coming up with this structure
and sort of block

537
00:25:34,890 --> 00:25:38,380
level diagram helps you
essentially figure out where

538
00:25:38,380 --> 00:25:39,180
are your tasks?

539
00:25:39,180 --> 00:25:41,410
Helps you figure out what kind
of synchronization mechanisms

540
00:25:41,410 --> 00:25:44,710
you need and it can also help
you suggest the data

541
00:25:44,710 --> 00:25:48,380
distribution that you might need
to reduce synchronization

542
00:25:48,380 --> 00:25:52,330
costs and problems. And
lastly, you want to

543
00:25:52,330 --> 00:25:54,440
essentially evaluate
your design.

544
00:25:54,440 --> 00:25:55,980
And you want to keep in mind,
what is your target

545
00:25:55,980 --> 00:25:56,990
architecture.

546
00:25:56,990 --> 00:25:59,130
Are you trying to really run
on shared memory and

547
00:25:59,130 --> 00:26:02,290
distributed memory and message
passing or are you just doing

548
00:26:02,290 --> 00:26:03,160
this for one architecture?

549
00:26:03,160 --> 00:26:06,750
So for your project, you're
doing this for self, so you

550
00:26:06,750 --> 00:26:08,830
can be very self specific.

551
00:26:08,830 --> 00:26:11,270
But if you're doing this
in other contexts, the

552
00:26:11,270 --> 00:26:12,770
architecture actually
might influence

553
00:26:12,770 --> 00:26:14,020
some of your decisions.

554
00:26:16,080 --> 00:26:17,760
Does data sharing have
enough special

555
00:26:17,760 --> 00:26:19,980
properties like a read only?

556
00:26:19,980 --> 00:26:21,460
There are data structures
that are read only.

557
00:26:21,460 --> 00:26:24,370
Are there enough accumulations
that you can exploit by

558
00:26:24,370 --> 00:26:25,250
reductions?

559
00:26:25,250 --> 00:26:29,300
Are there temporal constraints
on data sharing

560
00:26:29,300 --> 00:26:30,250
that you can exploit?

561
00:26:30,250 --> 00:26:31,630
And can you deal with
those efficiently?

562
00:26:31,630 --> 00:26:34,170
If you can't then you
have a problem.

563
00:26:34,170 --> 00:26:35,550
So you need to resolve that.

564
00:26:35,550 --> 00:26:37,180
If the designs OK then
you move on to

565
00:26:37,180 --> 00:26:38,970
the next design space.

566
00:26:38,970 --> 00:26:41,680
So at the next lecture
I'll go through these

567
00:26:41,680 --> 00:26:42,460
in a lot more detail.

568
00:26:42,460 --> 00:26:44,180
That's it.