1
00:00:00,000 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,880
Commons license.

3
00:00:03,880 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,550
offer high-quality educational
resources for free.

5
00:00:10,550 --> 00:00:13,690
To make a donation or view
additional materials from

6
00:00:13,690 --> 00:00:17,920
hundreds of MIT courses, visit
MIT OpenCourseWare at ocw.

7
00:00:17,920 --> 00:00:19,170
mit.edu.

8
00:00:22,890 --> 00:00:28,070
PROFESSOR: So last week, last
few lectures, you heard about

9
00:00:28,070 --> 00:00:32,760
parallel architectures and
started with lecture four on

10
00:00:32,760 --> 00:00:34,290
discussions of concurrency.

11
00:00:34,290 --> 00:00:37,730
How do you take applications
or independent actors that

12
00:00:37,730 --> 00:00:40,260
want to operate on the
same data and make

13
00:00:40,260 --> 00:00:42,790
them run safely together?

14
00:00:42,790 --> 00:00:46,930
And so just recapping the last
two lectures, you saw really

15
00:00:46,930 --> 00:00:49,330
two primary classes
of architectures.

16
00:00:49,330 --> 00:00:51,800
Although Saman talked
about a few more.

17
00:00:51,800 --> 00:00:55,630
There was the class of shared
memory processors, you know,

18
00:00:55,630 --> 00:00:59,500
the multicores that Intel, AMD,
and PowerPC, for example,

19
00:00:59,500 --> 00:01:03,970
have today, where you have
one copy of the data.

20
00:01:03,970 --> 00:01:06,660
And that's really shared among
all the different processors

21
00:01:06,660 --> 00:01:09,770
because they essentially
share the same memory.

22
00:01:09,770 --> 00:01:14,460
And you need things like
atomicity and synchronization

23
00:01:14,460 --> 00:01:17,770
to be able to make sure that the
sharing is done properly

24
00:01:17,770 --> 00:01:22,230
so that you don't get into data
race situations where

25
00:01:22,230 --> 00:01:24,790
multiple processors try to
update the same data element

26
00:01:24,790 --> 00:01:28,260
and you end up with
erroneous results.

27
00:01:28,260 --> 00:01:30,530
You also heard about distributed
memory processors.

28
00:01:30,530 --> 00:01:35,780
So an example of that might be
the Cell, loosely said, where

29
00:01:35,780 --> 00:01:39,710
you have cores that primarily
access their own local memory.

30
00:01:39,710 --> 00:01:43,940
And while you can have a single
global memory address

31
00:01:43,940 --> 00:01:46,410
space, to get data from memory
you essentially have to

32
00:01:46,410 --> 00:01:49,720
communicate with the different
processors to explicitly fetch

33
00:01:49,720 --> 00:01:51,340
data in and out.

34
00:01:51,340 --> 00:01:54,640
So things like data
distribution, where the data

35
00:01:54,640 --> 00:01:56,730
is, and what your communication
pattern is like

36
00:01:56,730 --> 00:01:57,980
affect your performance.

37
00:02:01,270 --> 00:02:05,340
So what I'm going to talk about
in today's lecture is

38
00:02:05,340 --> 00:02:06,940
programming these two
different kinds of

39
00:02:06,940 --> 00:02:09,260
architectures, shared memory
processors and distributed

40
00:02:09,260 --> 00:02:14,570
memory processors, and present
you with some concepts for

41
00:02:14,570 --> 00:02:16,890
commonly programming
these machines.

42
00:02:16,890 --> 00:02:19,620
So in shared memory processors,
you have, say, n

43
00:02:19,620 --> 00:02:21,070
processors, 1 to n.

44
00:02:21,070 --> 00:02:23,890
And they're connected
to a single memory.

45
00:02:23,890 --> 00:02:27,810
And if one processor asks for
the value stored at address X,

46
00:02:27,810 --> 00:02:29,270
everybody knows where
it'll go look.

47
00:02:29,270 --> 00:02:33,480
Because there's only one address
X. And so different

48
00:02:33,480 --> 00:02:36,000
processors can communicate
through shared variables.

49
00:02:36,000 --> 00:02:38,620
And you need things like
locking, as I mentioned, to

50
00:02:38,620 --> 00:02:43,900
avoid race conditions or
erroneous computation.

51
00:02:43,900 --> 00:02:45,860
So as an example of
parallelization, you know,

52
00:02:45,860 --> 00:02:48,220
straightforward parallelization
in a shared

53
00:02:48,220 --> 00:02:52,000
memory machine, would be you
have the simple loop that's

54
00:02:52,000 --> 00:02:54,610
just running through an array.

55
00:02:54,610 --> 00:02:58,140
And you're adding elements of
array A to elements of array

56
00:02:58,140 --> 00:03:02,480
B. And you're going to write
them to some new array, C.

57
00:03:02,480 --> 00:03:04,880
Well, if I gave you this loop
you can probably recognize

58
00:03:04,880 --> 00:03:07,370
that there's really no data
dependencies here.

59
00:03:07,370 --> 00:03:09,860
I can split up this loop into
three chunks -- let's say I

60
00:03:09,860 --> 00:03:13,010
have three processors -- where
one processor does all the

61
00:03:13,010 --> 00:03:16,310
computations for iterations
zero through three, so the

62
00:03:16,310 --> 00:03:17,470
first four iterations.

63
00:03:17,470 --> 00:03:19,530
Second processor does the
next four iterations.

64
00:03:19,530 --> 00:03:21,600
And the third processor does
the last four iterations.

65
00:03:21,600 --> 00:03:26,473
And so that's shown with the
little -- should have brought

66
00:03:26,473 --> 00:03:27,680
a laser pointer.

67
00:03:27,680 --> 00:03:29,500
So that's showing here.

68
00:03:29,500 --> 00:03:31,980
And what you might need to
do is some mechanism to

69
00:03:31,980 --> 00:03:34,840
essentially tell the different
processors, here's the code

70
00:03:34,840 --> 00:03:38,110
that you need to run and
maybe where to start.

71
00:03:38,110 --> 00:03:41,070
And then you may need some way
of sort of synchronizing these

72
00:03:41,070 --> 00:03:43,765
different processors that say,
I'm done, I can move on to the

73
00:03:43,765 --> 00:03:46,490
next computation steps.

74
00:03:46,490 --> 00:03:48,950
So this is an example of a data
parallel computation.

75
00:03:48,950 --> 00:03:51,720
The loop has no real
dependencies and, you know,

76
00:03:51,720 --> 00:03:55,840
each processor can operate
on different data sets.

77
00:03:55,840 --> 00:03:58,540
And what you could do is you
can have a process --

78
00:03:58,540 --> 00:04:03,360
this is a single application
that forks off or creates what

79
00:04:03,360 --> 00:04:04,840
are commonly called
the threads.

80
00:04:04,840 --> 00:04:08,230
And each thread goes on and
executes in this case the same

81
00:04:08,230 --> 00:04:09,860
computation.

82
00:04:09,860 --> 00:04:12,970
So a single process can create
multiple concurrent threads.

83
00:04:12,970 --> 00:04:17,290
And really each thread is just
a mechanism for encapsulating

84
00:04:17,290 --> 00:04:20,060
some trace of execution,
some execution path.

85
00:04:20,060 --> 00:04:22,980
So in this case you're
essentially encapsulating this

86
00:04:22,980 --> 00:04:24,860
particular loop here.

87
00:04:24,860 --> 00:04:28,670
And maybe you parameterize
your start index and your

88
00:04:28,670 --> 00:04:32,990
ending index or maybe
your loop bounds.

89
00:04:32,990 --> 00:04:35,420
And in a shared memory
processor, since you're

90
00:04:35,420 --> 00:04:37,470
communicating --

91
00:04:37,470 --> 00:04:40,760
since there's only a single
memory, really you don't need

92
00:04:40,760 --> 00:04:43,900
to do anything special about
the data in this particular

93
00:04:43,900 --> 00:04:46,270
example, because everybody knows
where to go look for it.

94
00:04:46,270 --> 00:04:47,240
Everybody can access it.

95
00:04:47,240 --> 00:04:48,330
Everything's independent.

96
00:04:48,330 --> 00:04:52,850
There's no real issues with
races or deadlocks.

97
00:04:52,850 --> 00:04:56,930
So I just wrote down some actual
code for that loop that

98
00:04:56,930 --> 00:04:59,315
parallelize it using Pthreads,
a commonly

99
00:04:59,315 --> 00:05:01,530
used threading mechanism.

100
00:05:01,530 --> 00:05:04,200
Just to give you a little bit
of flavor for, you know, the

101
00:05:04,200 --> 00:05:07,580
complexity of -- the simple loop
that we had expands to a

102
00:05:07,580 --> 00:05:09,840
lot more code in this case.

103
00:05:09,840 --> 00:05:11,300
So you have your array.

104
00:05:11,300 --> 00:05:13,050
It has 12 elements.

105
00:05:13,050 --> 00:05:16,930
A, B, and C. And you have
the basic functions.

106
00:05:16,930 --> 00:05:18,710
So this is the actual code
or computation that we

107
00:05:18,710 --> 00:05:20,010
want to carry out.

108
00:05:20,010 --> 00:05:22,190
And what I've done here is
I've parameterized where

109
00:05:22,190 --> 00:05:24,790
you're essentially starting
in the array.

110
00:05:24,790 --> 00:05:26,090
So you get this parameter.

111
00:05:26,090 --> 00:05:29,210
And then you calculate four
iterations' worth.

112
00:05:29,210 --> 00:05:31,312
And this is essentially the
computation that we're

113
00:05:31,312 --> 00:05:31,870
carrying out.

114
00:05:31,870 --> 00:05:35,500
And now in my main program or
in my main function, rather,

115
00:05:35,500 --> 00:05:39,320
what I do is I have this concept
of threads that I'm

116
00:05:39,320 --> 00:05:40,140
going to create.

117
00:05:40,140 --> 00:05:44,140
In this case I'm going to
create three of them.

118
00:05:44,140 --> 00:05:46,040
There are some parameters that
I have to pass in, so some

119
00:05:46,040 --> 00:05:48,670
attributes which are now
going to get into here.

120
00:05:48,670 --> 00:05:50,470
But then I pass in the
function pointer.

121
00:05:50,470 --> 00:05:53,220
This is essentially a mechanism
that says once I've

122
00:05:53,220 --> 00:05:56,030
created this thread, I go to
this function and execute this

123
00:05:56,030 --> 00:05:57,920
particular code.

124
00:05:57,920 --> 00:05:59,320
And then some arguments
that are functions.

125
00:05:59,320 --> 00:06:01,850
So here I'm just passing in an
index at which each loop

126
00:06:01,850 --> 00:06:03,400
switch starts with.

127
00:06:03,400 --> 00:06:06,790
And after I've created each
thread here, implicitly in the

128
00:06:06,790 --> 00:06:08,140
thread creation, the
code can just

129
00:06:08,140 --> 00:06:10,210
immediately start running.

130
00:06:10,210 --> 00:06:12,800
And then once all the threads
have started running, I can

131
00:06:12,800 --> 00:06:15,470
essentially just exit
the program

132
00:06:15,470 --> 00:06:16,720
because I've completed.

133
00:06:19,070 --> 00:06:22,100
So what I've shown you with
that first example was the

134
00:06:22,100 --> 00:06:24,670
concept of, or example
of, data parallelism.

135
00:06:24,670 --> 00:06:27,790
So you're performing the same
computation, but instead of

136
00:06:27,790 --> 00:06:31,140
operating on one big chunk of
data, I've partitioned the

137
00:06:31,140 --> 00:06:35,020
data into smaller chunks
and I've replicated the

138
00:06:35,020 --> 00:06:38,190
computation so that I can get
that kind of parallelism.

139
00:06:38,190 --> 00:06:40,700
But there's another form of
parallelism called control

140
00:06:40,700 --> 00:06:44,630
parallelism, which essentially
uses the same model of

141
00:06:44,630 --> 00:06:48,330
threading but doesn't
necessarily have to run the

142
00:06:48,330 --> 00:06:51,380
same function or run the same
computation each thread.

143
00:06:51,380 --> 00:06:53,400
So I've sort of illustrated
that in the illustration

144
00:06:53,400 --> 00:06:58,050
there, where these are your data
parallel computations and

145
00:06:58,050 --> 00:07:06,390
these are some other
computations in your code.

146
00:07:06,390 --> 00:07:09,370
So there is sort of a
programming model that allows

147
00:07:09,370 --> 00:07:13,580
you to do this kind of
parallelism and tries to sort

148
00:07:13,580 --> 00:07:17,840
of help the programmer by taking
their sequential code

149
00:07:17,840 --> 00:07:20,770
and then adding annotations that
say, this loop is data

150
00:07:20,770 --> 00:07:27,500
parallel or this set of code
is has this kind of control

151
00:07:27,500 --> 00:07:29,000
parallelism in it.

152
00:07:29,000 --> 00:07:31,760
So you start with your
parallel code.

153
00:07:31,760 --> 00:07:34,900
This is the same program,
multiple data kind of

154
00:07:34,900 --> 00:07:35,870
parallelization.

155
00:07:35,870 --> 00:07:38,020
So you might have seen in the
previous talk and the previous

156
00:07:38,020 --> 00:07:40,390
lecture, it was SIMD, single
instruction or same

157
00:07:40,390 --> 00:07:43,130
instruction, multiple data,
which allowed you to execute

158
00:07:43,130 --> 00:07:45,950
the same operation, you
know, and add over

159
00:07:45,950 --> 00:07:47,490
multiple data elements.

160
00:07:47,490 --> 00:07:50,630
So here it's a similar
kind of terminology.

161
00:07:50,630 --> 00:07:54,510
There's same program, multiple
data, and multiple program,

162
00:07:54,510 --> 00:07:54,920
multiple data.

163
00:07:54,920 --> 00:07:59,600
This talk is largely focused on
the SPMD model, where you

164
00:07:59,600 --> 00:08:02,860
essentially have one central
decision maker or you're

165
00:08:02,860 --> 00:08:05,670
trying to solve one central
computation.

166
00:08:05,670 --> 00:08:07,190
And you're trying to parallelize
that over your

167
00:08:07,190 --> 00:08:09,470
architecture to get the
best performance.

168
00:08:09,470 --> 00:08:12,990
So you start off with your
program and then you annotate

169
00:08:12,990 --> 00:08:16,200
the code with what's parallel
and what's not parallel.

170
00:08:16,200 --> 00:08:18,990
And you might add in some
synchronization directives so

171
00:08:18,990 --> 00:08:21,450
that if you do in fact have
sharing, you might want to use

172
00:08:21,450 --> 00:08:25,320
the right locking mechanism
to guarantee safety.

173
00:08:25,320 --> 00:08:28,990
Now, in OpenMP, there are
some limitations as

174
00:08:28,990 --> 00:08:29,840
to what it can do.

175
00:08:29,840 --> 00:08:32,130
So it in fact assumes
that the programmer

176
00:08:32,130 --> 00:08:33,260
knows what he's doing.

177
00:08:33,260 --> 00:08:35,670
And the programmer is largely
responsible for getting the

178
00:08:35,670 --> 00:08:38,680
synchronization right, or that
if they're sharing that they

179
00:08:38,680 --> 00:08:42,900
get those dependencies
protected correctly.

180
00:08:42,900 --> 00:08:45,850
So you can take your program,
insert these annotations, and

181
00:08:45,850 --> 00:08:49,950
then you go on and
test and debug.

182
00:08:49,950 --> 00:08:54,730
So a simple OpenMP example,
again using the simple loop --

183
00:08:54,730 --> 00:08:56,930
now, I've thrown away some of
the extra code -- you're

184
00:08:56,930 --> 00:09:00,190
adding these two extra
pragmas in this case.

185
00:09:00,190 --> 00:09:05,370
The first one, your parallel
pragma, I call the data

186
00:09:05,370 --> 00:09:08,450
parallel pragma, really says
that you can execute as many

187
00:09:08,450 --> 00:09:13,120
of the following code block as
there are processors or as

188
00:09:13,120 --> 00:09:15,250
many as you have thread
contexts.

189
00:09:15,250 --> 00:09:18,260
So in this case I implicitly
made the assumption that I

190
00:09:18,260 --> 00:09:20,840
have three processors, so I can
automatically partition my

191
00:09:20,840 --> 00:09:23,060
code into three sets.

192
00:09:23,060 --> 00:09:25,380
And this transformation can sort
of be done automatically

193
00:09:25,380 --> 00:09:27,810
by the compiler.

194
00:09:27,810 --> 00:09:31,750
And then there's a for pragma
that says this loop is

195
00:09:31,750 --> 00:09:36,430
parallel and you can divide up
the work in the mechanism

196
00:09:36,430 --> 00:09:37,230
that's work sharing.

197
00:09:37,230 --> 00:09:39,430
So multiple threads can
collaborate to solve the same

198
00:09:39,430 --> 00:09:42,540
computation, but each one does
a smaller amount of work.

199
00:09:42,540 --> 00:09:49,760
So this is in contrast to what
I'm going to focus on a lot

200
00:09:49,760 --> 00:09:52,190
more in the rest of our talk,
which is distributed memory

201
00:09:52,190 --> 00:09:55,080
processors and programming
for distributed memories.

202
00:09:55,080 --> 00:09:58,760
And this will feel a lot more
like programming for the Cell

203
00:09:58,760 --> 00:10:00,780
as you get more and more
involved in that and your

204
00:10:00,780 --> 00:10:03,930
projects get more intense.

205
00:10:03,930 --> 00:10:08,220
So in distributed memory
processors, to recap the

206
00:10:08,220 --> 00:10:11,020
previous lectures, you
have n processors.

207
00:10:11,020 --> 00:10:13,070
Each processor has
its own memory.

208
00:10:13,070 --> 00:10:16,180
And they essentially share the
interconnection network.

209
00:10:19,030 --> 00:10:21,670
Each processor has its own
address, X. So when a

210
00:10:21,670 --> 00:10:25,440
processor, P1, asks for X it
knows where to go look.

211
00:10:25,440 --> 00:10:27,810
It's going to look in its
own local memory.

212
00:10:27,810 --> 00:10:31,490
So if all processors are asking
for the same value as

213
00:10:31,490 --> 00:10:33,670
sort of address X, then each
one goes and looks in a

214
00:10:33,670 --> 00:10:35,280
different place.

215
00:10:35,280 --> 00:10:37,620
So there are n places
to look, really.

216
00:10:37,620 --> 00:10:39,830
And what's stored in those
addresses will vary because

217
00:10:39,830 --> 00:10:42,200
it's everybody's local memory.

218
00:10:42,200 --> 00:10:46,150
So if one processor, say P1,
wants to look at the value

219
00:10:46,150 --> 00:10:49,480
stored in processor two's
address, it actually has to

220
00:10:49,480 --> 00:10:51,120
explicitly request it.

221
00:10:51,120 --> 00:10:53,320
The processor two has
to send it data.

222
00:10:53,320 --> 00:10:55,880
And processor one has to figure
out, you know, what to

223
00:10:55,880 --> 00:10:56,720
do with that copy.

224
00:10:56,720 --> 00:10:57,970
So it has to store
it somewhere.

225
00:11:00,720 --> 00:11:06,400
So this message passing really
exposes explicit communication

226
00:11:06,400 --> 00:11:08,300
to exchange data.

227
00:11:08,300 --> 00:11:11,620
And you'll see that there are
different kinds of data

228
00:11:11,620 --> 00:11:12,800
communications.

229
00:11:12,800 --> 00:11:16,240
But really the concept of what
you exchange has three

230
00:11:16,240 --> 00:11:19,030
different -- or four different,
rather --

231
00:11:19,030 --> 00:11:20,300
things you need to address.

232
00:11:20,300 --> 00:11:22,760
One is how is the data
described and

233
00:11:22,760 --> 00:11:24,530
what does it describe?

234
00:11:24,530 --> 00:11:25,970
How are the processes
identified?

235
00:11:25,970 --> 00:11:27,870
So how do I identify that
processor one is

236
00:11:27,870 --> 00:11:28,990
sending me this data?

237
00:11:28,990 --> 00:11:30,760
And if I'm receiving data
how do I know who I'm

238
00:11:30,760 --> 00:11:32,010
receiving it from?

239
00:11:34,210 --> 00:11:35,570
Are all messages the same?

240
00:11:35,570 --> 00:11:38,100
Well, you know, if I send a
message to somebody, do I have

241
00:11:38,100 --> 00:11:40,770
any guarantee that it's
received or not?

242
00:11:40,770 --> 00:11:43,450
And what does it mean for a send
operation or a receive

243
00:11:43,450 --> 00:11:44,720
operation to be completed?

244
00:11:44,720 --> 00:11:47,580
You know, is there some sort
of acknowledgment process?

245
00:11:50,440 --> 00:11:53,380
So an example of a message
passing program -- and if

246
00:11:53,380 --> 00:11:55,580
you've started to look at the
lab you'll see that this is

247
00:11:55,580 --> 00:11:59,770
essentially where the
lab came from.

248
00:11:59,770 --> 00:12:00,920
It's the same idea.

249
00:12:00,920 --> 00:12:05,050
I've created -- here I have some
two-dimensional space.

250
00:12:05,050 --> 00:12:07,380
And I have points in this
two-dimensional space.

251
00:12:07,380 --> 00:12:10,920
I have points B, which are these
blue circles, and I have

252
00:12:10,920 --> 00:12:14,160
points A which I've represented
as these yellow or

253
00:12:14,160 --> 00:12:17,040
golden squares.

254
00:12:17,040 --> 00:12:19,740
And what I want to do is for
every point in A I want to

255
00:12:19,740 --> 00:12:22,820
calculate the distance to all
of the points B. So there's

256
00:12:22,820 --> 00:12:24,190
sort of a pair wise interaction

257
00:12:24,190 --> 00:12:26,880
between the two arrays.

258
00:12:26,880 --> 00:12:29,710
So a simple loop that
essentially does this --

259
00:12:29,710 --> 00:12:34,980
and there are n squared
interactions, you have, you

260
00:12:34,980 --> 00:12:37,410
know, a loop that loops over
all the A elements, a loop

261
00:12:37,410 --> 00:12:39,920
that loops over all
the B elements.

262
00:12:39,920 --> 00:12:42,350
And you essentially calculate
in this case Euclidean

263
00:12:42,350 --> 00:12:43,500
distance which I'm
not showing.

264
00:12:43,500 --> 00:12:44,990
And you store it into
some new array.

265
00:12:47,890 --> 00:12:51,800
So if I give you two processors
to do this work,

266
00:12:51,800 --> 00:12:54,180
processor one and processor
two, and I give you some

267
00:12:54,180 --> 00:12:56,780
mechanism to share between
the two --

268
00:12:56,780 --> 00:12:58,270
so here's my CPU.

269
00:12:58,270 --> 00:12:59,940
Each processor has
local memory.

270
00:12:59,940 --> 00:13:01,300
What would be some approach
for actually

271
00:13:01,300 --> 00:13:02,360
parallelizing this?

272
00:13:02,360 --> 00:13:05,240
Anybody look at the lab yet?

273
00:13:05,240 --> 00:13:08,860
OK, so what would you do
with two processors?

274
00:13:14,734 --> 00:13:19,630
AUDIENCE: One has half
memory [INAUDIBLE]

275
00:13:19,630 --> 00:13:21,120
PROFESSOR: Right.

276
00:13:21,120 --> 00:13:24,360
So what was said was that you
split one of the arrays in two

277
00:13:24,360 --> 00:13:26,590
and you can actually get that
kind of concurrency.

278
00:13:26,590 --> 00:13:29,080
So, you know, let's
say processor one

279
00:13:29,080 --> 00:13:30,470
already has the data.

280
00:13:30,470 --> 00:13:33,320
And it has some place that it's
already allocated where

281
00:13:33,320 --> 00:13:37,270
it's going to write C, the
results of the computation,

282
00:13:37,270 --> 00:13:39,860
then I can break up the work
just like it was suggested.

283
00:13:39,860 --> 00:13:43,550
So what P1 has to do
is send data to P2.

284
00:13:43,550 --> 00:13:44,650
It says here's the data.

285
00:13:44,650 --> 00:13:46,190
Here's the computation.

286
00:13:46,190 --> 00:13:48,250
Go ahead and help me out.

287
00:13:48,250 --> 00:13:53,340
So I send the first array
elements, and then I send half

288
00:13:53,340 --> 00:13:55,580
of the other elements
that I want the

289
00:13:55,580 --> 00:13:57,630
calculations done for.

290
00:13:57,630 --> 00:14:00,760
And then P1 and P2 can
now sort of start

291
00:14:00,760 --> 00:14:01,670
computing in parallel.

292
00:14:01,670 --> 00:14:06,650
But notice that P2 has its own
array that it's going to store

293
00:14:06,650 --> 00:14:08,210
results in.

294
00:14:08,210 --> 00:14:12,210
And so as these compute they
actually fill in different

295
00:14:12,210 --> 00:14:16,570
logical places or logical parts
of the overall matrix.

296
00:14:16,570 --> 00:14:19,720
So what has to be done is at the
end for P1 to have all the

297
00:14:19,720 --> 00:14:23,880
results, P2 has to send it sort
of the rest of the matrix

298
00:14:23,880 --> 00:14:24,400
to complete it.

299
00:14:24,400 --> 00:14:27,200
And so now P1 has
all the results.

300
00:14:27,200 --> 00:14:29,480
The computation is done
and you can move on.

301
00:14:29,480 --> 00:14:31,690
Does that make sense?

302
00:14:31,690 --> 00:14:31,870
OK.

303
00:14:31,870 --> 00:14:36,490
So you'll get to actually do
this as part of your labs.

304
00:14:36,490 --> 00:14:39,600
So in this example messaging
program, you have started out

305
00:14:39,600 --> 00:14:41,170
with a sequential code.

306
00:14:41,170 --> 00:14:43,330
And we had two processors.

307
00:14:43,330 --> 00:14:45,570
So processor one actually
sends the code.

308
00:14:45,570 --> 00:14:47,410
So it is essentially a template
for the code you'll

309
00:14:47,410 --> 00:14:49,350
end up writing.

310
00:14:49,350 --> 00:14:52,330
And it does have to work
in the outer loop.

311
00:14:52,330 --> 00:14:57,480
So this n array over which it
is iterating the A array, is

312
00:14:57,480 --> 00:14:58,840
it's only doing half as many.

313
00:15:01,910 --> 00:15:06,170
And processor two has to
actually receive the data.

314
00:15:06,170 --> 00:15:09,140
And it specifies where to
receive the data into.

315
00:15:09,140 --> 00:15:13,110
So I've omitted some things, for
example, extra information

316
00:15:13,110 --> 00:15:15,260
sort of hidden in these
parameters.

317
00:15:15,260 --> 00:15:18,650
So here you're sending all of
A, all of B. Whereas, you

318
00:15:18,650 --> 00:15:21,600
know, you could have specified
extra parameters that says,

319
00:15:21,600 --> 00:15:24,740
you know, I'm sending you A.
Here's n elements to read from

320
00:15:24,740 --> 00:15:28,110
A. Here's B. Here's n by
two elements to read

321
00:15:28,110 --> 00:15:31,040
from B. And so on.

322
00:15:31,040 --> 00:15:33,650
But the computation is
essentially the same except

323
00:15:33,650 --> 00:15:38,140
for the index at which you
start, in this case changed

324
00:15:38,140 --> 00:15:40,370
for processor two.

325
00:15:40,370 --> 00:15:43,610
And now, when the computation is
done, this guy essentially

326
00:15:43,610 --> 00:15:47,180
waits until the data
is received.

327
00:15:47,180 --> 00:15:49,410
Processor two eventually sends
it that data and now

328
00:15:49,410 --> 00:15:50,050
you can move on.

329
00:15:50,050 --> 00:15:52,450
AUDIENCE: I have a question.

330
00:15:52,450 --> 00:15:53,215
PROFESSOR: Yeah?

331
00:15:53,215 --> 00:15:57,583
AUDIENCE: So would processor two
have to wait for the data

332
00:15:57,583 --> 00:15:58,554
from processor one?

333
00:15:58,554 --> 00:15:59,790
PROFESSOR: Yeah,
so there's a --

334
00:15:59,790 --> 00:16:01,760
I'll get into that later.

335
00:16:01,760 --> 00:16:03,980
So what does it mean
to receive?

336
00:16:06,660 --> 00:16:09,670
To do this computation, I
actually need this instruction

337
00:16:09,670 --> 00:16:10,420
to complete.

338
00:16:10,420 --> 00:16:13,250
So what does it need for that
instruction to complete?

339
00:16:13,250 --> 00:16:15,370
I do have to get the data
because otherwise I don't know

340
00:16:15,370 --> 00:16:16,360
what to compute on.

341
00:16:16,360 --> 00:16:19,140
So there is some implicit
synchronization

342
00:16:19,140 --> 00:16:20,100
that you have to do.

343
00:16:20,100 --> 00:16:21,490
And in some cases
it's explicit.

344
00:16:21,490 --> 00:16:24,770
So I'll get into that
a little bit later.

345
00:16:24,770 --> 00:16:27,400
Does that sort of hint
at the answer?

346
00:16:27,400 --> 00:16:28,140
Are you still confused?

347
00:16:28,140 --> 00:16:33,040
AUDIENCE: So processor one
doesn't do the computation but

348
00:16:33,040 --> 00:16:34,330
it still sends the data --

349
00:16:34,330 --> 00:16:41,470
PROFESSOR: So in terms of
tracing, processor one sends

350
00:16:41,470 --> 00:16:43,940
the data and then can
immediately start executing

351
00:16:43,940 --> 00:16:45,380
its code, right?

352
00:16:45,380 --> 00:16:48,960
Processor two, in this
particular example, has to

353
00:16:48,960 --> 00:16:50,500
wait until it receives
the data.

354
00:16:50,500 --> 00:16:52,810
So once this receive completes,
then you can

355
00:16:52,810 --> 00:16:54,770
actually go and start executing

356
00:16:54,770 --> 00:16:56,070
the rest of the code.

357
00:16:56,070 --> 00:16:58,930
So imagine that it essentially
says, wait until I have data.

358
00:16:58,930 --> 00:16:59,950
Wait until I have
something to do.

359
00:16:59,950 --> 00:17:01,720
Does that help?

360
00:17:01,720 --> 00:17:04,670
AUDIENCE: Can the
main processor

361
00:17:04,670 --> 00:17:06,080
[UNINTELLIGIBLE PHRASE]

362
00:17:06,080 --> 00:17:06,920
PROFESSOR: Can the
main processor --

363
00:17:06,920 --> 00:17:11,196
AUDIENCE: I mean, in Cell,
everybody is not peers.

364
00:17:11,196 --> 00:17:13,360
There is a master there.

365
00:17:13,360 --> 00:17:17,760
And what master can do instead
of doing computation, master

366
00:17:17,760 --> 00:17:21,713
can be basically the
quarterback, sending data,

367
00:17:21,713 --> 00:17:22,061
receiving data.

368
00:17:22,061 --> 00:17:25,534
And SPEs can be basically
waiting for data, get the

369
00:17:25,534 --> 00:17:26,750
computation, send it back.

370
00:17:26,750 --> 00:17:30,728
So in some sense in Cell you
probably don't want to do the

371
00:17:30,728 --> 00:17:32,498
computation on the master.

372
00:17:32,498 --> 00:17:34,140
Because that means the
master slows down.

373
00:17:34,140 --> 00:17:36,608
The master will do only
the data management.

374
00:17:36,608 --> 00:17:41,730
So that might be one symmetrical
[UNINTELLIGIBLE]

375
00:17:41,730 --> 00:17:43,280
PROFESSOR: And you'll see
that in the example.

376
00:17:43,280 --> 00:17:46,500
Because the PPE in that case
has to send the data to two

377
00:17:46,500 --> 00:17:49,390
different SPEs.

378
00:17:49,390 --> 00:17:49,580
Yup?

379
00:17:49,580 --> 00:17:51,136
AUDIENCE: In some sense
[UNINTELLIGIBLE PHRASE]

380
00:17:54,250 --> 00:17:57,005
at points seems to be
[UNINTELLIGIBLE] sense that if

381
00:17:57,005 --> 00:18:00,368
-- so have a huge array
and you want to

382
00:18:00,368 --> 00:18:03,090
[UNINTELLIGIBLE PHRASE] the
data to receive the whole

383
00:18:03,090 --> 00:18:05,940
array, then you have
to [UNINTELLIGIBLE]

384
00:18:05,940 --> 00:18:08,600
PROFESSOR: Yeah, we'll
get into that later.

385
00:18:08,600 --> 00:18:09,840
Yeah, I mean, that's
a good point.

386
00:18:09,840 --> 00:18:11,450
You know, communication
is not cheap.

387
00:18:11,450 --> 00:18:15,140
And if you sort of don't take
that into consideration, you

388
00:18:15,140 --> 00:18:16,920
end up paying a lot
for overhead for

389
00:18:16,920 --> 00:18:17,640
parallelizing things.

390
00:18:17,640 --> 00:18:20,120
AUDIENCE: [INAUDIBLE]

391
00:18:20,120 --> 00:18:22,250
PROFESSOR: Well, you can do
things in software as well.

392
00:18:22,250 --> 00:18:24,210
We'll get into that.

393
00:18:24,210 --> 00:18:27,850
OK, so some crude performance
analysis.

394
00:18:27,850 --> 00:18:29,960
So I have to calculate
this distance.

395
00:18:29,960 --> 00:18:31,610
And given two processors, I can

396
00:18:31,610 --> 00:18:34,020
effectively get a 2x speedup.

397
00:18:34,020 --> 00:18:37,620
By dividing up the work I can
get done in half the time.

398
00:18:37,620 --> 00:18:40,250
Well, if you gave me four
processors, I can maybe get

399
00:18:40,250 --> 00:18:44,980
done four times as fast. And
in my communication model

400
00:18:44,980 --> 00:18:48,690
here, I have one copy of one
array that's essentially

401
00:18:48,690 --> 00:18:50,540
sending to every processor.

402
00:18:50,540 --> 00:18:54,060
And there's some subset of A.
So I'm partitioning my other

403
00:18:54,060 --> 00:18:55,580
array into smaller subsets.

404
00:18:55,580 --> 00:18:58,320
And I'm sending those to each
of the different processors.

405
00:18:58,320 --> 00:19:01,610
So we'll get into terminology
for how to actually name these

406
00:19:01,610 --> 00:19:03,820
communications later.

407
00:19:03,820 --> 00:19:05,930
But really the thing to take
away here is that this

408
00:19:05,930 --> 00:19:09,000
granularity -- how I'm
partitioning A -- affects my

409
00:19:09,000 --> 00:19:12,580
performance and communication
almost directly.

410
00:19:12,580 --> 00:19:14,690
And, you know, the comment that
was just made is that,

411
00:19:14,690 --> 00:19:16,090
you know, what do you do
about communication?

412
00:19:16,090 --> 00:19:17,040
It's not free.

413
00:19:17,040 --> 00:19:19,270
So all of those will
be addressed.

414
00:19:19,270 --> 00:19:22,150
So to understand performance,
we sort of summarize three

415
00:19:22,150 --> 00:19:23,950
main concepts that you
essentially need to

416
00:19:23,950 --> 00:19:24,890
understand.

417
00:19:24,890 --> 00:19:28,945
One is coverage, or in other
words, how much parallelism do

418
00:19:28,945 --> 00:19:32,880
I actually have in
my application?

419
00:19:32,880 --> 00:19:35,980
And this can actually affect,
you know, how much work is it

420
00:19:35,980 --> 00:19:38,730
worth spending on this
particular application?

421
00:19:38,730 --> 00:19:39,910
Granularity --

422
00:19:39,910 --> 00:19:41,860
you know, how do you partition
your data among your different

423
00:19:41,860 --> 00:19:43,980
processors so that you can keep
communication down, so

424
00:19:43,980 --> 00:19:46,970
you can keep synchronization
down, and so on.

425
00:19:46,970 --> 00:19:48,130
Locality --

426
00:19:48,130 --> 00:19:50,990
so while not shown in the
particular example, if two

427
00:19:50,990 --> 00:19:54,840
processors are communicating, if
they are close in space or

428
00:19:54,840 --> 00:19:57,570
far in space, or if the
communication between two

429
00:19:57,570 --> 00:20:01,400
processors is far cheaper than
two other processors, can I

430
00:20:01,400 --> 00:20:02,540
exploit that in some way?

431
00:20:02,540 --> 00:20:07,370
And so we'll talk about
that as well.

432
00:20:07,370 --> 00:20:11,170
So an example of sort of
parallelism in an application,

433
00:20:11,170 --> 00:20:14,030
there are two essentially
projects that are doing ray

434
00:20:14,030 --> 00:20:17,030
tracing, so I thought I'd
have this slide here.

435
00:20:17,030 --> 00:20:20,890
You know, how much parallelism
do you have in

436
00:20:20,890 --> 00:20:23,490
a ray tracing program.

437
00:20:23,490 --> 00:20:26,170
In ray tracing what you do is
you essentially have some

438
00:20:26,170 --> 00:20:30,080
camera source, some observer.

439
00:20:30,080 --> 00:20:33,060
And you're trying to figure out,
you know, how to color or

440
00:20:33,060 --> 00:20:35,180
how to shade different pixels
in your screen.

441
00:20:35,180 --> 00:20:38,160
So what you do is you shoot rays
from a particular source

442
00:20:38,160 --> 00:20:38,850
through your plane.

443
00:20:38,850 --> 00:20:41,270
And then you see how the rays
bounce off of other objects.

444
00:20:41,270 --> 00:20:44,310
And that allows you to render
scenes in various ways.

445
00:20:44,310 --> 00:20:47,580
So you have different kinds of
parallelisms. You have your

446
00:20:47,580 --> 00:20:50,190
primary ray that's shot in.

447
00:20:50,190 --> 00:20:52,750
And if you're shooting into
something like water or some

448
00:20:52,750 --> 00:20:57,440
very reflective surface, or some
surface that can actually

449
00:20:57,440 --> 00:21:03,410
reflect, transmit, you can
essentially end up with a lot

450
00:21:03,410 --> 00:21:06,980
more rays that are created
at run time.

451
00:21:06,980 --> 00:21:10,040
So there's dynamic parallelism
in this particular example.

452
00:21:10,040 --> 00:21:11,800
And you can shoot a lot
of rays from here.

453
00:21:11,800 --> 00:21:14,190
So there's different kinds of
parallelism you can exploit.

454
00:21:17,500 --> 00:21:21,080
Not all prior programs have this
kind of, sort of, a lot

455
00:21:21,080 --> 00:21:23,450
of parallelism, or
embarrassingly parallel

456
00:21:23,450 --> 00:21:25,180
computation.

457
00:21:25,180 --> 00:21:27,490
You know, you saw
some basic code

458
00:21:27,490 --> 00:21:29,090
sequences in earlier lectures.

459
00:21:29,090 --> 00:21:31,020
So there's a sequential part.

460
00:21:31,020 --> 00:21:32,710
And the reason this is
sequential is because there

461
00:21:32,710 --> 00:21:35,020
are data flow dependencies
between each of the different

462
00:21:35,020 --> 00:21:36,090
computations.

463
00:21:36,090 --> 00:21:38,600
So here I calculate a, but I
need the result of a to do

464
00:21:38,600 --> 00:21:39,660
this instruction.

465
00:21:39,660 --> 00:21:43,300
I calculate d here and I need
that result to calculate e.

466
00:21:43,300 --> 00:21:45,722
But then this loop really here
is just assigning or it's

467
00:21:45,722 --> 00:21:46,860
initializing some big array.

468
00:21:46,860 --> 00:21:49,940
And I can really do
that in parallel.

469
00:21:49,940 --> 00:21:52,520
So I have sequential parts
and parallel parts.

470
00:21:52,520 --> 00:21:56,590
So how does that affect
my overall speedups?

471
00:21:56,590 --> 00:22:00,460
And so there's this law which
is really a demonstration of

472
00:22:00,460 --> 00:22:03,550
diminishing returns,
Amdahl's Law.

473
00:22:03,550 --> 00:22:07,850
It says that if, you know, you
have a really fast car, it's

474
00:22:07,850 --> 00:22:10,310
only as good to you as fast
as you can drive it.

475
00:22:10,310 --> 00:22:12,880
So if there's a lot of
congestion on your road or,

476
00:22:12,880 --> 00:22:15,320
you know, there are posted speed
limits or some other

477
00:22:15,320 --> 00:22:17,700
mechanism, you really can't
exploit all the

478
00:22:17,700 --> 00:22:18,670
speed of your car.

479
00:22:18,670 --> 00:22:22,440
Or in other words, you're only
as fast as the fastest

480
00:22:22,440 --> 00:22:25,940
mechanisms of the computation
that you can have.

481
00:22:25,940 --> 00:22:31,530
So to look at this in more
detail, your potential speedup

482
00:22:31,530 --> 00:22:35,210
is really proportional to the
fraction of the code that can

483
00:22:35,210 --> 00:22:36,420
be parallelized.

484
00:22:36,420 --> 00:22:38,730
So if I have some computation
-- let's say it has three

485
00:22:38,730 --> 00:22:42,355
parts: a sequential part that
takes 25 seconds, a parallel

486
00:22:42,355 --> 00:22:45,120
part that takes 50 seconds,
and a sequential part that

487
00:22:45,120 --> 00:22:47,250
runs in 25 seconds.

488
00:22:47,250 --> 00:22:49,680
So the total execution
time is 100 seconds.

489
00:22:49,680 --> 00:22:53,390
And if I have one processor,
that's really all I can do.

490
00:22:53,390 --> 00:22:55,350
And if she gave me more than one
processor -- so let's say

491
00:22:55,350 --> 00:22:57,010
I have five processors.

492
00:22:57,010 --> 00:22:59,390
Well, I can't do anything about
the sequential work.

493
00:22:59,390 --> 00:23:01,920
So that's still going
to take 25 seconds.

494
00:23:01,920 --> 00:23:04,120
And I can't do anything about
this sequential work either.

495
00:23:04,120 --> 00:23:05,850
That still takes 25 seconds.

496
00:23:05,850 --> 00:23:09,080
But this parallel part I can
essentially break up among the

497
00:23:09,080 --> 00:23:10,060
different processors.

498
00:23:10,060 --> 00:23:11,470
So five in this case.

499
00:23:11,470 --> 00:23:13,680
And that gets me, you know,
five-way parallelism.

500
00:23:13,680 --> 00:23:16,970
So the 50 seconds now is
reduced to 10 seconds.

501
00:23:16,970 --> 00:23:18,990
Is that clear so far?

502
00:23:18,990 --> 00:23:22,330
So the overall running time in
that case is 60 seconds.

503
00:23:22,330 --> 00:23:24,690
So what would be my speedup?

504
00:23:24,690 --> 00:23:29,580
Well, you calculate speedup,
old running time divided by

505
00:23:29,580 --> 00:23:30,790
the new running time.

506
00:23:30,790 --> 00:23:32,820
So 100 seconds divided
by 60 seconds.

507
00:23:32,820 --> 00:23:37,020
Or my parallel version
is 1.67 times faster.

508
00:23:37,020 --> 00:23:38,660
So this is great.

509
00:23:38,660 --> 00:23:40,690
If I increase the number of
processors, then I should be

510
00:23:40,690 --> 00:23:41,980
able to get more and
more parallelism.

511
00:23:41,980 --> 00:23:47,530
But it also means that there's
sort of an upper bound on how

512
00:23:47,530 --> 00:23:49,160
much speedup you can get.

513
00:23:49,160 --> 00:23:52,450
So if you look at the fraction
of work in your application

514
00:23:52,450 --> 00:23:54,840
that's parallel, that's p.

515
00:23:54,840 --> 00:23:58,530
And your number of processors,
well, your speedup is --

516
00:23:58,530 --> 00:24:01,270
let's say the old running time
is just one unit of work.

517
00:24:01,270 --> 00:24:04,340
If the time it takes for the
sequential work -- so that's 1

518
00:24:04,340 --> 00:24:08,720
minus p, since p is the fraction
of the parallel work.

519
00:24:08,720 --> 00:24:11,120
And it's the time to do
the parallel work.

520
00:24:11,120 --> 00:24:14,480
And since I can parallelize
that fraction over n

521
00:24:14,480 --> 00:24:18,500
processors, I can sort of reduce
that to really small

522
00:24:18,500 --> 00:24:19,640
amounts in the limit.

523
00:24:19,640 --> 00:24:20,890
Does that make sense so far?

524
00:24:23,290 --> 00:24:26,260
So the speedup can tend to 1
over 1 minus p in the limit.

525
00:24:26,260 --> 00:24:29,560
If I increase the number of
processors or that gets really

526
00:24:29,560 --> 00:24:33,820
large, that's essentially my
upper bound on how fast

527
00:24:33,820 --> 00:24:36,110
programs can work.

528
00:24:36,110 --> 00:24:39,550
You know, how much can I exploit
out of my program?

529
00:24:39,550 --> 00:24:40,370
So this is great.

530
00:24:40,370 --> 00:24:43,540
What this law says -- the
implication here is if your

531
00:24:43,540 --> 00:24:45,790
program has a lot of inherent
parallelism, you

532
00:24:45,790 --> 00:24:47,010
can do really well.

533
00:24:47,010 --> 00:24:49,090
But if your program doesn't have
any parallelism, well,

534
00:24:49,090 --> 00:24:50,450
there's really nothing
you can do.

535
00:24:50,450 --> 00:24:52,680
So parallel architectures
won't really help you.

536
00:24:52,680 --> 00:24:54,660
And there's some interesting
trade-offs, for example, that

537
00:24:54,660 --> 00:24:58,850
you might consider if you're
designing a chip or if you're

538
00:24:58,850 --> 00:25:01,670
looking at an application or
domain of applications,

539
00:25:01,670 --> 00:25:06,080
figuring out what is the best
architecture to run them on.

540
00:25:06,080 --> 00:25:10,510
So in terms of performance
scalability, as I increase the

541
00:25:10,510 --> 00:25:12,890
number of processors,
I have speedup.

542
00:25:12,890 --> 00:25:15,230
You can define, sort of,
an efficiency to

543
00:25:15,230 --> 00:25:16,570
be linear at 100%.

544
00:25:16,570 --> 00:25:20,590
But typically you end up in sort
of the sublinear domain.

545
00:25:20,590 --> 00:25:25,800
That's because communication
is not often free.

546
00:25:25,800 --> 00:25:28,350
But you can get super linear
speedups ups on real

547
00:25:28,350 --> 00:25:31,450
architectures because of
secondary and tertiary effects

548
00:25:31,450 --> 00:25:34,770
that come from register
allocation or caching effects.

549
00:25:34,770 --> 00:25:38,120
So they can hide a lot of
latency or you can take

550
00:25:38,120 --> 00:25:41,090
advantage of a lot of pipelining
mechanisms in the

551
00:25:41,090 --> 00:25:43,330
architecture to get super
linear speedups.

552
00:25:43,330 --> 00:25:46,620
So you can end up in two
different domains.

553
00:25:49,860 --> 00:25:55,000
So a small, you know, overview
of the extent of parallelism

554
00:25:55,000 --> 00:25:56,440
in your program and
how that affects

555
00:25:56,440 --> 00:25:59,280
your overall execution.

556
00:25:59,280 --> 00:26:01,620
And the other concept
is granularity.

557
00:26:01,620 --> 00:26:03,520
So given that I have this
much parallelism, how

558
00:26:03,520 --> 00:26:04,590
do I exploit it?

559
00:26:04,590 --> 00:26:06,810
There are different ways of
exploiting it, and that comes

560
00:26:06,810 --> 00:26:09,440
down to, well, how do I
subdivide my problem?

561
00:26:09,440 --> 00:26:12,410
What is the granularity of the
sub-problems I'm going to

562
00:26:12,410 --> 00:26:15,320
calculate on?

563
00:26:15,320 --> 00:26:19,130
And, really, granularity from
my perspective, is just a

564
00:26:19,130 --> 00:26:22,760
qualitative measure of what is
the ratio of your computation

565
00:26:22,760 --> 00:26:24,950
to your communication?

566
00:26:24,950 --> 00:26:27,130
So if you're doing a lot of
computation, very little

567
00:26:27,130 --> 00:26:29,400
communication, you could
be doing really

568
00:26:29,400 --> 00:26:30,580
well or vice versa.

569
00:26:30,580 --> 00:26:33,240
Then you could be computation
limited, and so you need a lot

570
00:26:33,240 --> 00:26:35,450
of bandwidth for example
in your architecture.

571
00:26:35,450 --> 00:26:37,258
AUDIENCE: Like before, you
really didn't have to give

572
00:26:37,258 --> 00:26:40,880
every single processor
an entire copy of B.

573
00:26:40,880 --> 00:26:41,850
PROFESSOR: Right.

574
00:26:41,850 --> 00:26:42,010
Yeah.

575
00:26:42,010 --> 00:26:45,070
Good point.

576
00:26:48,190 --> 00:26:50,960
And as you saw in the previous
slides, you have --

577
00:26:50,960 --> 00:26:53,380
computation stages
are separated by

578
00:26:53,380 --> 00:26:54,790
communication stages.

579
00:26:54,790 --> 00:26:57,070
And your communication in a
lot of cases essentially

580
00:26:57,070 --> 00:26:59,060
serves as synchronization.

581
00:26:59,060 --> 00:27:01,560
I need everybody to get to the
same point before I can move

582
00:27:01,560 --> 00:27:04,780
on logically in my
computation.

583
00:27:04,780 --> 00:27:07,830
So there are two kinds of sort
of classes of granularity.

584
00:27:07,830 --> 00:27:10,840
There's fine grain and, as
you'll see, coarse grain.

585
00:27:10,840 --> 00:27:14,010
So in fine-grain parallelism,
you have low computation to

586
00:27:14,010 --> 00:27:15,460
communication ratio.

587
00:27:15,460 --> 00:27:18,150
And that has good properties
in that you have a small

588
00:27:18,150 --> 00:27:21,010
amount of work done between
communication stages.

589
00:27:24,410 --> 00:27:27,910
And it has bad properties in
that it gives you less

590
00:27:27,910 --> 00:27:29,530
performance opportunity.

591
00:27:32,190 --> 00:27:33,850
It should be more, right?

592
00:27:33,850 --> 00:27:35,100
More opportunity for --

593
00:27:35,100 --> 00:27:35,600
AUDIENCE: No.

594
00:27:35,600 --> 00:27:36,100
Less.

595
00:27:36,100 --> 00:27:36,480
PROFESSOR: Sorry.

596
00:27:36,480 --> 00:27:37,120
Yeah, yeah, sorry.

597
00:27:37,120 --> 00:27:38,370
I didn't get enough sleep.

598
00:27:41,670 --> 00:27:44,470
So less opportunities for
performance enhancement, but

599
00:27:44,470 --> 00:27:48,870
you have high communication
ratio because essentially

600
00:27:48,870 --> 00:27:50,800
you're communicating
very often.

601
00:27:50,800 --> 00:27:55,120
So these are the computations
here and these yellow bars are

602
00:27:55,120 --> 00:27:56,850
the synchronization points.

603
00:27:56,850 --> 00:27:59,640
So I have to distribute
data or communicate.

604
00:27:59,640 --> 00:28:02,000
I do computations but,
you know, computation

605
00:28:02,000 --> 00:28:02,990
doesn't last very long.

606
00:28:02,990 --> 00:28:06,410
And I do more communication or
more synchronization, and I

607
00:28:06,410 --> 00:28:07,750
repeat the process.

608
00:28:07,750 --> 00:28:11,370
So naturally you can adjust this
granularity to sort of

609
00:28:11,370 --> 00:28:12,530
reduce the communication
overhead.

610
00:28:12,530 --> 00:28:14,510
AUDIENCE:
[UNINTELLIGIBLE PHRASE]

611
00:28:14,510 --> 00:28:17,010
two things in that
overhead part.

612
00:28:17,010 --> 00:28:18,120
One is the volume.

613
00:28:18,120 --> 00:28:21,380
So one, communication.

614
00:28:21,380 --> 00:28:23,484
Also there's a large part
of synchronization cost.

615
00:28:23,484 --> 00:28:26,210
Basically you get a
communication goal and you

616
00:28:26,210 --> 00:28:29,022
have to go start the messages
and wait until

617
00:28:29,022 --> 00:28:29,210
everybody is done.

618
00:28:29,210 --> 00:28:32,250
So that overhead also can go.

619
00:28:32,250 --> 00:28:35,642
Even if you don't send that much
data, just the fact that

620
00:28:35,642 --> 00:28:38,260
you are communicating, that
means you have to do a lot of

621
00:28:38,260 --> 00:28:43,076
this additional bookkeeping
stuff, that especially in the

622
00:28:43,076 --> 00:28:43,679
distributed [? memory ?]

623
00:28:43,679 --> 00:28:44,954
[? machine is ?] pretty
expensive.

624
00:28:44,954 --> 00:28:45,910
PROFESSOR: Yeah.

625
00:28:45,910 --> 00:28:48,780
Thanks.

626
00:28:48,780 --> 00:28:52,490
So in coarse-grain parallelism,
you sort of make

627
00:28:52,490 --> 00:28:55,070
the work chunks more and
more so that you do the

628
00:28:55,070 --> 00:28:57,440
communication synchronization
less and less.

629
00:28:57,440 --> 00:28:58,440
And so that's shown here.

630
00:28:58,440 --> 00:29:00,750
You do longer pieces of
work and have fewer

631
00:29:00,750 --> 00:29:04,370
synchronization stages.

632
00:29:04,370 --> 00:29:09,690
So in that regime, you can have
more opportunities for

633
00:29:09,690 --> 00:29:12,590
performance improvements, but
the tricky thing that you get

634
00:29:12,590 --> 00:29:15,730
into is what's called
load balancing.

635
00:29:15,730 --> 00:29:19,090
So if each of these different
computations takes differing

636
00:29:19,090 --> 00:29:22,130
amounts of time to complete,
then what you might end up

637
00:29:22,130 --> 00:29:25,380
doing is a lot of people might
end up idle as they wait until

638
00:29:25,380 --> 00:29:27,690
everybody's essentially reached
their finish line.

639
00:29:27,690 --> 00:29:27,960
Yep?

640
00:29:27,960 --> 00:29:30,000
AUDIENCE: If you don't have to
acknowledge that something's

641
00:29:30,000 --> 00:29:31,593
done can't you just say, [?

642
00:29:31,593 --> 00:29:32,040
OK, I'm done with
your salt ?].

643
00:29:32,040 --> 00:29:33,570
Hand it to the initial
processor

644
00:29:33,570 --> 00:29:35,100
and keep doing whatever?

645
00:29:35,100 --> 00:29:37,470
PROFESSOR: So, you can do
that in cases where that

646
00:29:37,470 --> 00:29:40,430
essentially there is a
mechanism -- or the

647
00:29:40,430 --> 00:29:42,340
application allows for it.

648
00:29:42,340 --> 00:29:44,473
But as I'll show -- well, you
won't see until the next

649
00:29:44,473 --> 00:29:46,340
lecture -- there are
dependencies, for example,

650
00:29:46,340 --> 00:29:48,420
that might preclude you
from doing that.

651
00:29:48,420 --> 00:29:51,170
If everybody needs to reach the
same point because you're

652
00:29:51,170 --> 00:29:53,550
updating a large data structure
before you can go

653
00:29:53,550 --> 00:29:55,290
on, then you might not
be able to do that.

654
00:29:55,290 --> 00:29:58,810
So think of doing molecular
dynamics simulations.

655
00:29:58,810 --> 00:30:00,970
You need everybody to calculate
a new position

656
00:30:00,970 --> 00:30:02,895
before you can go on and
calculate new kinds of coarse

657
00:30:02,895 --> 00:30:03,030
interactions.

658
00:30:03,030 --> 00:30:04,700
AUDIENCE: [UNINTELLIGIBLE]
nothing else to calculate yet.

659
00:30:04,700 --> 00:30:05,700
PROFESSOR: Right.

660
00:30:05,700 --> 00:30:06,410
AUDIENCE: But also there
is pipelining.

661
00:30:06,410 --> 00:30:08,400
So what do you talk about
[UNINTELLIGIBLE] because you

662
00:30:08,400 --> 00:30:10,785
might want to get the next data
while you're computing

663
00:30:10,785 --> 00:30:13,830
now so that when I'm done
I can start sending.

664
00:30:13,830 --> 00:30:16,660
[UNINTELLIGIBLE PHRASE] you
can all have some of that.

665
00:30:16,660 --> 00:30:17,970
PROFESSOR: Yep.

666
00:30:17,970 --> 00:30:21,935
Yeah, because communication is
such an intensive part, there

667
00:30:21,935 --> 00:30:22,920
are different ways of
dealing with it.

668
00:30:22,920 --> 00:30:26,000
And that will be right
after load balancing.

669
00:30:26,000 --> 00:30:28,880
So the load balancing problem
is just an illustration.

670
00:30:28,880 --> 00:30:33,610
And things that appear in sort
of this lightish pink will

671
00:30:33,610 --> 00:30:34,900
serve as sort of visual cues.

672
00:30:34,900 --> 00:30:36,985
This is the same color coding
scheme that David's using in

673
00:30:36,985 --> 00:30:37,780
the recitations.

674
00:30:37,780 --> 00:30:39,620
So this is PPU code.

675
00:30:39,620 --> 00:30:42,180
Things that appear in yellow
will be SPU code.

676
00:30:42,180 --> 00:30:44,490
And these are just meant to
essentially show you how you

677
00:30:44,490 --> 00:30:47,780
might do things like this on
Cell, just to help you along

678
00:30:47,780 --> 00:30:50,930
in picking up more of the syntax
and functionality you

679
00:30:50,930 --> 00:30:53,190
need for your programs.

680
00:30:53,190 --> 00:30:58,090
So in the load balancing
problem, you essentially have,

681
00:30:58,090 --> 00:31:03,930
let's say, three different
threads of computation.

682
00:31:03,930 --> 00:31:06,880
And so that's shown here:
red, blue, and orange.

683
00:31:06,880 --> 00:31:08,650
And you've reached some
communication stage.

684
00:31:08,650 --> 00:31:13,560
So the PPU program in this case
is saying, send a message

685
00:31:13,560 --> 00:31:16,340
to each of my SPEs, to each of
my different processors, that

686
00:31:16,340 --> 00:31:18,210
you're ready to start.

687
00:31:18,210 --> 00:31:21,150
And so now once every processor
gets that message,

688
00:31:21,150 --> 00:31:22,370
they can start computing.

689
00:31:22,370 --> 00:31:25,970
And let's assume they have data
and so on ready to go.

690
00:31:25,970 --> 00:31:29,880
And what's going to happen is
each processor is going to run

691
00:31:29,880 --> 00:31:31,710
through the computation
at different rates.

692
00:31:31,710 --> 00:31:33,980
Now this could be because
one processor

693
00:31:33,980 --> 00:31:35,100
is faster than another.

694
00:31:35,100 --> 00:31:36,960
Or it could be because
one processor is

695
00:31:36,960 --> 00:31:37,930
more loaded than another.

696
00:31:37,930 --> 00:31:40,550
Or it could be just because
each processor is assigned

697
00:31:40,550 --> 00:31:43,090
sort of differing
amounts of work.

698
00:31:43,090 --> 00:31:44,090
So one has a short loop.

699
00:31:44,090 --> 00:31:46,240
One has a longer loop.

700
00:31:46,240 --> 00:31:49,400
And so as the animation shows,
sort of, execution proceeds

701
00:31:49,400 --> 00:31:52,930
and everybody's waiting until
the orange guy has completed.

702
00:31:52,930 --> 00:31:56,200
But nobody could have made
progress until everybody's

703
00:31:56,200 --> 00:32:00,130
reached synchronization point
because, you know, there's a

704
00:32:00,130 --> 00:32:03,160
strict dependence that's being
enforced here that says, I'm

705
00:32:03,160 --> 00:32:04,820
going to wait until everybody's
told me they're

706
00:32:04,820 --> 00:32:08,130
done before I go on to the
next step of computation.

707
00:32:08,130 --> 00:32:10,500
And so you know, in Cell
you do that using

708
00:32:10,500 --> 00:32:12,360
mailboxes in this case.

709
00:32:12,360 --> 00:32:13,610
That clear so far?

710
00:32:16,380 --> 00:32:19,160
So how do you get around this
load balancing problem?

711
00:32:19,160 --> 00:32:20,800
Well, there are two
different ways.

712
00:32:20,800 --> 00:32:22,430
There's static load balancing.

713
00:32:22,430 --> 00:32:24,540
I know my application
really, really well.

714
00:32:24,540 --> 00:32:27,280
And I understand sort of
different computations.

715
00:32:27,280 --> 00:32:29,920
So what I can do is I can divide
up the work and have a

716
00:32:29,920 --> 00:32:33,320
static mapping of the work
to my processors.

717
00:32:33,320 --> 00:32:36,890
And static mapping just means,
you know, in this particular

718
00:32:36,890 --> 00:32:40,680
example, that I'm going to
assign the work to different

719
00:32:40,680 --> 00:32:43,820
processors and that's what
the processors will do.

720
00:32:43,820 --> 00:32:47,410
Work can't shift around
between processors.

721
00:32:47,410 --> 00:32:49,330
And so in this case I
have a work queue.

722
00:32:49,330 --> 00:32:51,750
Each of those bars is
some computation.

723
00:32:51,750 --> 00:32:55,590
You know, I can assign some
chunk to P1, processor one,

724
00:32:55,590 --> 00:32:57,400
some chunk to processor two.

725
00:32:57,400 --> 00:32:59,060
And then computation
can go on.

726
00:32:59,060 --> 00:33:03,320
Those allocations
don't change.

727
00:33:03,320 --> 00:33:05,430
So this works well if I
understand the application,

728
00:33:05,430 --> 00:33:08,760
well and I know the computation,
and my cores are

729
00:33:08,760 --> 00:33:11,780
relatively homogeneous and, you
know, there's not a lot of

730
00:33:11,780 --> 00:33:13,540
contention for them.

731
00:33:13,540 --> 00:33:16,690
So if all the cores are the
same, each core has an equal

732
00:33:16,690 --> 00:33:18,710
amount of work -- the total
amount of work -- this works

733
00:33:18,710 --> 00:33:21,470
really well because nobody
is sitting too idle.

734
00:33:21,470 --> 00:33:24,140
It doesn't work so well for
heterogeneous architectures or

735
00:33:24,140 --> 00:33:24,590
multicores.

736
00:33:24,590 --> 00:33:26,690
Because one might be faster
than the other.

737
00:33:26,690 --> 00:33:29,470
It increases the complexity of
the allocation I need to do.

738
00:33:29,470 --> 00:33:33,170
If there's a lot of contention
for some resources, then that

739
00:33:33,170 --> 00:33:36,470
can affect the static
load balancing.

740
00:33:36,470 --> 00:33:41,180
So work distribution might
end up being uneven.

741
00:33:41,180 --> 00:33:43,740
So the alternative is dynamic
load balancing.

742
00:33:43,740 --> 00:33:46,040
And you certainly could do
sort of a hybrid load

743
00:33:46,040 --> 00:33:48,860
balancing, static plus
dynamic mechanism.

744
00:33:48,860 --> 00:33:51,360
Although I don't have
that in the slides.

745
00:33:51,360 --> 00:33:57,020
So in the dynamic load
balancing scheme, two

746
00:33:57,020 --> 00:33:59,860
different mechanisms I'm
going to illustrate.

747
00:33:59,860 --> 00:34:01,620
So in the first scheme, you
start with something like the

748
00:34:01,620 --> 00:34:03,610
static mechanism.

749
00:34:03,610 --> 00:34:07,090
So I have some work going
to processor one.

750
00:34:07,090 --> 00:34:10,080
And I have some work going
to processor two.

751
00:34:10,080 --> 00:34:14,840
But then as processor two
executes and completes faster

752
00:34:14,840 --> 00:34:17,610
than processor one, it takes on
some of the additional work

753
00:34:17,610 --> 00:34:18,520
from processor one.

754
00:34:18,520 --> 00:34:20,650
So the work that was here
is now shifted.

755
00:34:20,650 --> 00:34:24,160
And so you can keep helping
out, you know, your other

756
00:34:24,160 --> 00:34:27,040
processors to compute
things faster.

757
00:34:27,040 --> 00:34:29,630
In the other scheme, you have
a work queue where you

758
00:34:29,630 --> 00:34:32,780
essentially are distributing
work on the fly.

759
00:34:32,780 --> 00:34:35,740
So as things complete, you're
just sending them

760
00:34:35,740 --> 00:34:37,240
more work to do.

761
00:34:37,240 --> 00:34:39,610
So in this animation
here, I start off.

762
00:34:39,610 --> 00:34:41,820
I send work to two different
processors.

763
00:34:41,820 --> 00:34:44,780
P2 is really fast so it's just
zipping through things.

764
00:34:44,780 --> 00:34:48,360
And then P1 eventually finishes
and new work is

765
00:34:48,360 --> 00:34:50,570
allocated to the two
different schemes.

766
00:34:50,570 --> 00:34:54,120
So dynamic load balancing is
intended to sort of give equal

767
00:34:54,120 --> 00:34:57,250
amounts of work in a different
scheme for processors.

768
00:34:57,250 --> 00:34:59,880
So it really increased
utilization and spent less and

769
00:34:59,880 --> 00:35:03,310
less time being idle.

770
00:35:03,310 --> 00:35:03,500
OK.

771
00:35:03,500 --> 00:35:08,480
So load balancing was one part
of sort of how granularity can

772
00:35:08,480 --> 00:35:10,270
have a performance trade-off.

773
00:35:10,270 --> 00:35:11,660
The other is synchronization.

774
00:35:11,660 --> 00:35:14,520
So there were already some good
questions as to, well,

775
00:35:14,520 --> 00:35:16,350
you know, how does this play
into overall execution?

776
00:35:16,350 --> 00:35:17,000
When can I wait?

777
00:35:17,000 --> 00:35:18,810
When can't I wait?

778
00:35:18,810 --> 00:35:21,930
So I'm going to illustrate it
with just a simple data

779
00:35:21,930 --> 00:35:22,960
dependence graph.

780
00:35:22,960 --> 00:35:25,600
Although you can imagine that
in each one of these circles

781
00:35:25,600 --> 00:35:27,550
there's some really heavy
load computation.

782
00:35:27,550 --> 00:35:29,340
And you'll see that in the
next lecture, in fact.

783
00:35:29,340 --> 00:35:32,540
So if I have some simple
computation here --

784
00:35:32,540 --> 00:35:33,930
I have some operands.

785
00:35:33,930 --> 00:35:35,800
I'm doing an addition.

786
00:35:35,800 --> 00:35:36,910
Here I do another addition.

787
00:35:36,910 --> 00:35:38,690
I need both of these results
before I can do this

788
00:35:38,690 --> 00:35:40,140
multiplication.

789
00:35:40,140 --> 00:35:43,070
Here I have, you know, some
loop that's adding through

790
00:35:43,070 --> 00:35:44,200
some array elements.

791
00:35:44,200 --> 00:35:46,870
I need all those results before
I do final substraction

792
00:35:46,870 --> 00:35:49,580
and produce my final result.

793
00:35:49,580 --> 00:35:52,330
So what are some synchronization
points here?

794
00:35:52,330 --> 00:35:55,630
Well, it really depends on how
I allocate the different

795
00:35:55,630 --> 00:35:59,090
instructions to processors.

796
00:35:59,090 --> 00:36:02,580
So if I have an allocation that
just says, well, let's

797
00:36:02,580 --> 00:36:06,240
put all these chains on one
processor, put these two

798
00:36:06,240 --> 00:36:08,470
chains on two different
processors, well, where are my

799
00:36:08,470 --> 00:36:09,890
synchronization points?

800
00:36:09,890 --> 00:36:13,120
Well, it depends on where this
guy is and where this guy is.

801
00:36:13,120 --> 00:36:15,880
Because for this instruction
to execute, it needs to

802
00:36:15,880 --> 00:36:17,910
receive data from P1 and P2.

803
00:36:17,910 --> 00:36:23,260
So if P1 and P2 are different
from what's in that box,

804
00:36:23,260 --> 00:36:24,100
somebody has to wait.

805
00:36:24,100 --> 00:36:24,460
And so there's a

806
00:36:24,460 --> 00:36:26,550
synchronization that has to happen.

807
00:36:32,200 --> 00:36:34,520
So essentially at all join
points there's potential for

808
00:36:34,520 --> 00:36:35,510
synchronization.

809
00:36:35,510 --> 00:36:37,810
But I can adjust the granularity
so that I can

810
00:36:37,810 --> 00:36:40,330
remove more and more
synchronization points.

811
00:36:46,920 --> 00:36:50,950
So if I had assigned all this
entire sub-graph to the same

812
00:36:50,950 --> 00:36:53,580
processor, I really get rid of
the synchronization because it

813
00:36:53,580 --> 00:36:56,910
is essentially local to that
particular processor.

814
00:36:56,910 --> 00:36:58,820
And there's no extra messaging
that would have to happen

815
00:36:58,820 --> 00:37:01,415
across processors that says,
I'm ready, or I'm ready to

816
00:37:01,415 --> 00:37:04,330
send you data, or you can move
on to the next step.

817
00:37:04,330 --> 00:37:06,650
And so in this case the last
synchronization point would be

818
00:37:06,650 --> 00:37:07,890
at this join point.

819
00:37:07,890 --> 00:37:12,470
Let's say if it's allocated on
P1 or on some other processor.

820
00:37:12,470 --> 00:37:14,420
So how would I get rid of this
synchronization point?

821
00:37:14,420 --> 00:37:19,080
AUDIENCE: Do the whole thing.

822
00:37:19,080 --> 00:37:19,390
PROFESSOR: Right.

823
00:37:19,390 --> 00:37:22,160
You put the entire thing
on a single processor.

824
00:37:22,160 --> 00:37:23,940
But you get no parallelism
in this case.

825
00:37:23,940 --> 00:37:26,360
So the coarse-grain, fine-grain
grain parallelism

826
00:37:26,360 --> 00:37:30,410
granularity issue
comes to play.

827
00:37:30,410 --> 00:37:33,410
So the last sort of thing I'm
going to talk about in terms

828
00:37:33,410 --> 00:37:37,440
of how granularity impacts
performance -- and this was

829
00:37:37,440 --> 00:37:39,570
already touched on -- is that
communication is really not

830
00:37:39,570 --> 00:37:42,760
cheap and can be quite
overwhelming on a lot of

831
00:37:42,760 --> 00:37:43,820
architectures.

832
00:37:43,820 --> 00:37:46,600
And what's interesting about
multicores is that they're

833
00:37:46,600 --> 00:37:48,840
essentially putting a lot
more resources closer

834
00:37:48,840 --> 00:37:50,470
together on a chip.

835
00:37:50,470 --> 00:37:55,010
So it essentially is changing
the factors for communication.

836
00:37:55,010 --> 00:37:57,910
So rather than having, you know,
your parallel cluster

837
00:37:57,910 --> 00:38:00,700
now which is connected, say,
by ethernet or some other

838
00:38:00,700 --> 00:38:03,690
high-speed link, now you
essentially have large

839
00:38:03,690 --> 00:38:05,920
clusters or will have large
clusters on a chip.

840
00:38:05,920 --> 00:38:09,250
So communication factors
really change.

841
00:38:09,250 --> 00:38:15,310
But the cost model is relatively
captured by these

842
00:38:15,310 --> 00:38:16,560
different parameters.

843
00:38:19,730 --> 00:38:22,450
So what is the cost of
my communication?

844
00:38:22,450 --> 00:38:26,915
Well, it's equal to, well, how
many messages am I sending and

845
00:38:26,915 --> 00:38:30,130
what is the frequency with
which I'm sending them?

846
00:38:30,130 --> 00:38:32,130
There's some overhead
for message.

847
00:38:32,130 --> 00:38:34,140
So I have to actually package
data together.

848
00:38:34,140 --> 00:38:37,920
I have to stick in a control
header and then send it out.

849
00:38:37,920 --> 00:38:40,360
So that takes me some work
on the receiver side.

850
00:38:40,360 --> 00:38:41,450
I have to take the message.

851
00:38:41,450 --> 00:38:46,390
I maybe have to decode the
header, figure out where to

852
00:38:46,390 --> 00:38:48,470
store the data that's coming
in on the message.

853
00:38:48,470 --> 00:38:51,700
So there's some overhead
associated with that as well.

854
00:38:51,700 --> 00:38:55,420
There's a network delay for
sending a message, so putting

855
00:38:55,420 --> 00:38:58,375
a message on the network so that
it can be transmitted, or

856
00:38:58,375 --> 00:38:59,880
picking things up
off the network.

857
00:38:59,880 --> 00:39:04,050
So there's a latency also
associated with how long does

858
00:39:04,050 --> 00:39:07,765
it take for a message to get
from point A to point B. What

859
00:39:07,765 --> 00:39:11,020
is the bandwidth that I
have across a link?

860
00:39:11,020 --> 00:39:13,350
So if I have a lot of bandwidth
then that can really

861
00:39:13,350 --> 00:39:16,670
lower my communication cost. But
if I have little bandwidth

862
00:39:16,670 --> 00:39:19,450
then that can really
create contention.

863
00:39:19,450 --> 00:39:20,940
How much data am I sending?

864
00:39:20,940 --> 00:39:22,780
And, you know, number
of messages.

865
00:39:22,780 --> 00:39:25,730
So this numerator here is really
an average of the data

866
00:39:25,730 --> 00:39:29,590
that you're sending
per communication.

867
00:39:29,590 --> 00:39:32,100
There's a cost induced
per contention.

868
00:39:32,100 --> 00:39:33,880
And then finally there's
-- so all of

869
00:39:33,880 --> 00:39:35,580
these are added factors.

870
00:39:35,580 --> 00:39:37,830
The higher they are, except for
bandwidth, because it's in

871
00:39:37,830 --> 00:39:39,020
the denominator here,
the worse your

872
00:39:39,020 --> 00:39:40,800
communication cost becomes.

873
00:39:40,800 --> 00:39:45,010
So you can try to reduce the
communication cost by

874
00:39:45,010 --> 00:39:46,220
communicating less.

875
00:39:46,220 --> 00:39:47,100
So you adjust your
granularity.

876
00:39:47,100 --> 00:39:50,460
And that can impact your
synchronization or what kind

877
00:39:50,460 --> 00:39:52,960
of data you're shipping
around.

878
00:39:52,960 --> 00:39:55,450
You can do some architectural
tweaks or maybe some software

879
00:39:55,450 --> 00:39:58,800
tweaks to really get the network
latency down and the

880
00:39:58,800 --> 00:40:00,190
overhead per message down.

881
00:40:00,190 --> 00:40:03,880
So on something like raw
architecture, which we saw in

882
00:40:03,880 --> 00:40:06,230
Saman's lecture, there's a
really fast mechanism to

883
00:40:06,230 --> 00:40:08,900
communicate your nearest
neighbor in three cycles.

884
00:40:08,900 --> 00:40:12,580
So one processor can send a
single operand to another

885
00:40:12,580 --> 00:40:16,310
reasonably fast. You know, you
can improve the bandwidth

886
00:40:16,310 --> 00:40:19,010
again in architectural
mechanism.

887
00:40:19,010 --> 00:40:21,720
You can do some tricks as to how
you package your data in

888
00:40:21,720 --> 00:40:24,490
each message.

889
00:40:24,490 --> 00:40:27,380
And lastly, what I'm going to
talk about in a couple of

890
00:40:27,380 --> 00:40:30,320
slides is, well, I can also
improve it using some

891
00:40:30,320 --> 00:40:31,965
mechanisms that try
to increase the

892
00:40:31,965 --> 00:40:33,390
overlap between messages.

893
00:40:33,390 --> 00:40:35,160
And what does this
really mean?

894
00:40:35,160 --> 00:40:37,070
What am I overlapping it with?

895
00:40:37,070 --> 00:40:40,100
And it's really the
communication and computation

896
00:40:40,100 --> 00:40:43,590
stages are going to somehow
get aligned.

897
00:40:43,590 --> 00:40:46,300
So before I actually show you
that, I just want to point out

898
00:40:46,300 --> 00:40:48,220
that there are two kinds
of messages.

899
00:40:48,220 --> 00:40:51,020
There's data messages, and these
are, for example, the

900
00:40:51,020 --> 00:40:54,240
arrays that I'm sending around
to different processors for

901
00:40:54,240 --> 00:40:57,760
the distance calculations
between points in space.

902
00:40:57,760 --> 00:40:59,650
But there are also
control messages.

903
00:40:59,650 --> 00:41:02,460
So control messages essentially
say, I'm done, or

904
00:41:02,460 --> 00:41:06,900
I'm ready to go, or is there
any work for me to do?

905
00:41:06,900 --> 00:41:09,700
So on Cell, control messages,
you know, you can think of

906
00:41:09,700 --> 00:41:12,560
using Mailboxes for those and
the DMAs for doing the data

907
00:41:12,560 --> 00:41:13,590
communication.

908
00:41:13,590 --> 00:41:16,170
So data messages are relatively
much larger --

909
00:41:16,170 --> 00:41:19,150
you're sending a lot of data
-- versus control messages

910
00:41:19,150 --> 00:41:22,190
that are really much shorter,
just essentially just sending

911
00:41:22,190 --> 00:41:23,930
you very brief information.

912
00:41:27,640 --> 00:41:30,980
So in order to get that overlap,
what you can do is

913
00:41:30,980 --> 00:41:33,610
essentially use this concept
of pipelining.

914
00:41:33,610 --> 00:41:35,250
So you've seen pipelining
in superscalar.

915
00:41:35,250 --> 00:41:37,620
Someone talked about that.

916
00:41:37,620 --> 00:41:40,130
And what you are essentially
trying to do is break up the

917
00:41:40,130 --> 00:41:43,950
communication and computation
into different stages and then

918
00:41:43,950 --> 00:41:45,860
figure out a way to overlap
them so that you can

919
00:41:45,860 --> 00:41:47,970
essentially hide the
latency for the

920
00:41:47,970 --> 00:41:51,090
sends and the receives.

921
00:41:51,090 --> 00:41:54,830
So let's say you have some work
that you're doing, and it

922
00:41:54,830 --> 00:41:57,570
really requires you to
send the data --

923
00:41:57,570 --> 00:42:00,200
somebody has to send you the
data or you essentially have

924
00:42:00,200 --> 00:42:02,440
to wait until you get it.

925
00:42:02,440 --> 00:42:04,850
And then after you've waited and
the data is there, you can

926
00:42:04,850 --> 00:42:06,470
actually go on and
do your work.

927
00:42:06,470 --> 00:42:07,670
So these are color coded.

928
00:42:07,670 --> 00:42:11,540
So this is essentially one
iteration of the work.

929
00:42:11,540 --> 00:42:15,220
And so you could overlap them
by breaking up the work into

930
00:42:15,220 --> 00:42:21,050
send, wait, work stages, where
each iteration trying to send

931
00:42:21,050 --> 00:42:24,340
or request the data for the next
iteration, I wait on the

932
00:42:24,340 --> 00:42:27,420
data from a previous iteration
and then I do my work.

933
00:42:27,420 --> 00:42:29,910
So depending on how I partition,
I can really get

934
00:42:29,910 --> 00:42:32,280
really good overlap.

935
00:42:32,280 --> 00:42:35,320
And so what you want to get to
is the concept of the steady

936
00:42:35,320 --> 00:42:40,130
state, where in your main loop
body, all you're doing is

937
00:42:40,130 --> 00:42:43,590
essentially pre-fetching or
requesting data that's going

938
00:42:43,590 --> 00:42:46,030
to be used in future iterations
for future work.

939
00:42:46,030 --> 00:42:49,890
And then you're waiting on --

940
00:42:49,890 --> 00:42:51,490
yeah.

941
00:42:51,490 --> 00:42:54,100
I think my color coding
is a little bogus.

942
00:42:54,100 --> 00:42:55,860
That's good.

943
00:42:55,860 --> 00:42:58,360
So here's an example of how
you might do this kind of

944
00:42:58,360 --> 00:43:01,700
buffer pipelining in Cell.

945
00:43:01,700 --> 00:43:05,710
So I have some main loop that's
going to do some work,

946
00:43:05,710 --> 00:43:07,670
that's encapsulating
this process data.

947
00:43:07,670 --> 00:43:09,780
And what I'm going to
use is two buffers.

948
00:43:09,780 --> 00:43:12,750
So the scheme is also called
double buffering.

949
00:43:12,750 --> 00:43:15,200
I'm going to use this ID to
represent which buffer I'm

950
00:43:15,200 --> 00:43:15,700
going to use.

951
00:43:15,700 --> 00:43:18,910
So it's either buffer
zero or buffer one.

952
00:43:18,910 --> 00:43:21,230
And this instruction here
essentially flips the bit.

953
00:43:21,230 --> 00:43:23,700
So it's either zero or one.

954
00:43:23,700 --> 00:43:27,620
So I fetch data into buffer zero
and then I enter my loop.

955
00:43:27,620 --> 00:43:30,680
So this is essentially the first
send, which is trying to

956
00:43:30,680 --> 00:43:33,330
get me one iteration ahead.

957
00:43:33,330 --> 00:43:37,760
So I enter this mail loop and
I do some calculation to

958
00:43:37,760 --> 00:43:40,380
figure out where to write
the next data.

959
00:43:40,380 --> 00:43:43,735
And then I do another request
for the next data item that

960
00:43:43,735 --> 00:43:47,800
I'm going to -- sorry, there's
an m missing here --

961
00:43:47,800 --> 00:43:50,900
I'm going to fetch data into
a different buffer, right.

962
00:43:50,900 --> 00:43:54,360
This is ID where I've already
flipped the bit once.

963
00:43:54,360 --> 00:43:58,160
So this get is going to write
data into buffer zero.

964
00:43:58,160 --> 00:44:01,730
And this get is going to write
data into buffer one.

965
00:44:01,730 --> 00:44:02,770
I flip the bit again.

966
00:44:02,770 --> 00:44:08,720
So now I'm going to issue a wait
instruction that says is

967
00:44:08,720 --> 00:44:10,380
the data from buffer
zero ready?

968
00:44:10,380 --> 00:44:13,260
And if it is then I can go on
and actually do my work.

969
00:44:13,260 --> 00:44:15,590
Does that make sense?

970
00:44:15,590 --> 00:44:16,150
People are confused?

971
00:44:16,150 --> 00:44:17,400
Should I go over it again?

972
00:44:19,772 --> 00:44:21,720
AUDIENCE: [INAUDIBLE]

973
00:44:21,720 --> 00:44:24,260
PROFESSOR: So this
is an [? x or. ?]

974
00:44:24,260 --> 00:44:27,710
So I could have just said buffer
equals zero or buffer

975
00:44:27,710 --> 00:44:28,960
equals one.

976
00:44:33,220 --> 00:44:34,410
Oh, sorry.

977
00:44:34,410 --> 00:44:34,620
This is one.

978
00:44:34,620 --> 00:44:34,690
Yeah.

979
00:44:34,690 --> 00:44:36,980
Yeah.

980
00:44:36,980 --> 00:44:40,950
So this is a one here.

981
00:44:40,950 --> 00:44:41,440
Last-minute editing.

982
00:44:41,440 --> 00:44:43,430
It's right there.

983
00:44:43,430 --> 00:44:44,672
Did that confuse you?

984
00:44:44,672 --> 00:44:45,195
AUDIENCE: No.

985
00:44:45,195 --> 00:44:47,810
But, like, I don't
see [INAUDIBLE]

986
00:44:47,810 --> 00:44:48,140
PROFESSOR: Oh.

987
00:44:48,140 --> 00:44:48,410
OK.

988
00:44:48,410 --> 00:44:50,190
So I'll go over it again.

989
00:44:50,190 --> 00:44:53,600
So this get here is going
to write into ID zero.

990
00:44:53,600 --> 00:44:56,500
So that's buffer zero.

991
00:44:56,500 --> 00:44:57,930
And then I'm going
to change the ID.

992
00:44:57,930 --> 00:44:59,400
So imagine there's a one here.

993
00:44:59,400 --> 00:45:04,190
So now the next time I use ID,
which is here, I'm trying to

994
00:45:04,190 --> 00:45:04,950
get the data.

995
00:45:04,950 --> 00:45:07,920
And I'm going to write
it to buffer one.

996
00:45:07,920 --> 00:45:11,270
The DMA on the Cell processor
essentially says I can send

997
00:45:11,270 --> 00:45:14,450
this request off and I can check
later to see when that

998
00:45:14,450 --> 00:45:15,710
data is available.

999
00:45:15,710 --> 00:45:17,940
But that data is going to go
into a different buffer,

1000
00:45:17,940 --> 00:45:19,420
essentially B1.

1001
00:45:19,420 --> 00:45:22,450
Whereas I'm going to work
on buffer zero.

1002
00:45:22,450 --> 00:45:25,920
Because I changed the
ID back here.

1003
00:45:25,920 --> 00:45:27,610
Now you get it?

1004
00:45:27,610 --> 00:45:30,790
So I fetch data into buffer
zero initially

1005
00:45:30,790 --> 00:45:31,540
before I start to loop.

1006
00:45:31,540 --> 00:45:34,110
And then I start working.

1007
00:45:34,110 --> 00:45:36,380
I probably should have had
an animation in here.

1008
00:45:36,380 --> 00:45:39,430
So then you go into
your main loop.

1009
00:45:39,430 --> 00:45:42,880
You try to start fetching into
buffet one and then you try to

1010
00:45:42,880 --> 00:45:44,280
compute out of buffer zero.

1011
00:45:44,280 --> 00:45:46,130
But before you can start
computing out of buffer zero,

1012
00:45:46,130 --> 00:45:48,250
you just have to make sure
that your data is there.

1013
00:45:48,250 --> 00:45:52,790
And so that's what the
synchronization is doing here.

1014
00:45:52,790 --> 00:45:55,180
Hope that was clear.

1015
00:45:55,180 --> 00:45:58,710
OK, so this kind of computation
and communication

1016
00:45:58,710 --> 00:46:01,680
overlap really helps in
hiding the latency.

1017
00:46:01,680 --> 00:46:04,990
And it can be real useful
in terms of improving

1018
00:46:04,990 --> 00:46:06,240
performance.

1019
00:46:09,720 --> 00:46:13,450
And there are different kinds
of communication patterns.

1020
00:46:13,450 --> 00:46:14,720
So there's point to point.

1021
00:46:14,720 --> 00:46:18,080
And you can use these both for
data communication or control

1022
00:46:18,080 --> 00:46:19,060
communication.

1023
00:46:19,060 --> 00:46:20,880
And it just means that, you
know, one processor can

1024
00:46:20,880 --> 00:46:23,580
explicitly send a message
to another processor.

1025
00:46:23,580 --> 00:46:26,345
There's also broadcast that
says, hey, I have some data

1026
00:46:26,345 --> 00:46:28,603
that everybody's interested in,
so I can just broadcast it

1027
00:46:28,603 --> 00:46:31,000
to everybody on the network.

1028
00:46:31,000 --> 00:46:32,930
Or a reduce, which
is the opposite.

1029
00:46:32,930 --> 00:46:36,350
It says everybody on the network
has data that I need

1030
00:46:36,350 --> 00:46:39,840
to compute, so everybody
send me their data.

1031
00:46:39,840 --> 00:46:42,790
There's an all to all, which
says all processors should

1032
00:46:42,790 --> 00:46:46,410
just do a global exchange of
data that they have. And then

1033
00:46:46,410 --> 00:46:48,100
there's a scatter
and a gather.

1034
00:46:48,100 --> 00:46:50,810
So a scatter and a gather are
really different types of

1035
00:46:50,810 --> 00:46:54,770
broadcast. So it's one to
several or one to many.

1036
00:46:54,770 --> 00:46:57,340
And gather, which
is many to one.

1037
00:46:57,340 --> 00:47:00,370
So this is useful when you're
doing a computation that

1038
00:47:00,370 --> 00:47:04,670
really is trying to pull data
in together but only from a

1039
00:47:04,670 --> 00:47:06,260
subset of all processors.

1040
00:47:06,260 --> 00:47:12,250
So it depends on how you've
partitioned your problems.

1041
00:47:12,250 --> 00:47:16,430
So there's a well-known sort
of message passing library

1042
00:47:16,430 --> 00:47:24,950
specification called MPI that
tries to specify all of these

1043
00:47:24,950 --> 00:47:29,120
different communications in
order to sort of facilitate

1044
00:47:29,120 --> 00:47:30,310
parallel programming.

1045
00:47:30,310 --> 00:47:34,590
Its full featured actually has
more types of communications

1046
00:47:34,590 --> 00:47:36,660
and more kinds of functionality
than I showed on

1047
00:47:36,660 --> 00:47:38,870
the previous slides.

1048
00:47:38,870 --> 00:47:41,360
But it's not a language or
a compiler specification.

1049
00:47:41,360 --> 00:47:43,470
It's really just a library
that you can implement in

1050
00:47:43,470 --> 00:47:45,750
various ways on different
architectures.

1051
00:47:45,750 --> 00:47:50,720
Again, it's same program,
multiple data, or supports the

1052
00:47:50,720 --> 00:47:53,080
SPMD model.

1053
00:47:53,080 --> 00:47:55,990
And it works reasonably well for
parallel architectures for

1054
00:47:55,990 --> 00:47:59,760
clusters, heterogeneous
multicores, homogeneous

1055
00:47:59,760 --> 00:48:01,830
multicores.

1056
00:48:01,830 --> 00:48:05,240
Because really all it's doing
is just abstracting out --

1057
00:48:05,240 --> 00:48:08,130
it's giving you a mechanism
to abstract out all the

1058
00:48:08,130 --> 00:48:11,540
communication that you would
need in your computation.

1059
00:48:11,540 --> 00:48:15,280
So you can have additional
things like precise buffer

1060
00:48:15,280 --> 00:48:16,370
management.

1061
00:48:16,370 --> 00:48:19,250
You can have some collective
operations.

1062
00:48:19,250 --> 00:48:22,985
I'll show an example of for
doing things things in a

1063
00:48:22,985 --> 00:48:27,370
scalable manner when a lot of
things need to communicate

1064
00:48:27,370 --> 00:48:29,140
with each other.

1065
00:48:29,140 --> 00:48:32,840
So just a brief history of
where MPI came from.

1066
00:48:32,840 --> 00:48:35,270
And, you know, very early
when, you know, parallel

1067
00:48:35,270 --> 00:48:38,260
computers started becoming more
and more widespread and

1068
00:48:38,260 --> 00:48:40,720
there were these networks and
people had problems porting

1069
00:48:40,720 --> 00:48:43,840
their applications or writing
applications for these

1070
00:48:43,840 --> 00:48:45,860
[? came, ?] just because it was
difficult, as you might be

1071
00:48:45,860 --> 00:48:48,200
finding in terms of programming
things with the

1072
00:48:48,200 --> 00:48:50,540
Cell processor.

1073
00:48:50,540 --> 00:48:52,800
You know, there needed to be
ways to sort of address the

1074
00:48:52,800 --> 00:48:54,860
spectrum of communication.

1075
00:48:54,860 --> 00:48:58,330
And it often helps to have a
standard because if everybody

1076
00:48:58,330 --> 00:49:01,680
implements the same standard
specification, that allows

1077
00:49:01,680 --> 00:49:03,370
your code to be ported
around from one

1078
00:49:03,370 --> 00:49:04,710
architecture to the other.

1079
00:49:04,710 --> 00:49:08,090
And so MPI came around.

1080
00:49:08,090 --> 00:49:11,130
The forum was organized
in 1992.

1081
00:49:11,130 --> 00:49:13,980
And that had a lot of people
participating in it from

1082
00:49:13,980 --> 00:49:17,000
vendors, you know, people like
IBM, a company like IBM,

1083
00:49:17,000 --> 00:49:23,030
Intel, and people who had
expertise in writing

1084
00:49:23,030 --> 00:49:27,550
libraries, users who were
interested in using these

1085
00:49:27,550 --> 00:49:31,910
kinds of specifications to
do their computations, so

1086
00:49:31,910 --> 00:49:36,010
scientific people who were
in the scientific domain.

1087
00:49:36,010 --> 00:49:38,050
And it was finished in
about 18 months.

1088
00:49:38,050 --> 00:49:40,582
I don't know if that's a
reasonably long time or a

1089
00:49:40,582 --> 00:49:40,950
short time.

1090
00:49:40,950 --> 00:49:44,340
But considering, you know, I
think the MPEG-4 standard took

1091
00:49:44,340 --> 00:49:48,170
a bit longer to do, as
a comparison point.

1092
00:49:48,170 --> 00:49:49,880
I don't have the actual data.

1093
00:49:49,880 --> 00:49:53,270
So point-to-point
communication --

1094
00:49:53,270 --> 00:49:56,590
and again, a reminder, this is
how you would do it on Cell.

1095
00:49:56,590 --> 00:50:00,300
These are SPE sends
and receives.

1096
00:50:00,300 --> 00:50:02,490
You have one processor
that's sending

1097
00:50:02,490 --> 00:50:04,660
it to another processor.

1098
00:50:04,660 --> 00:50:06,260
Or you have some network
in between.

1099
00:50:06,260 --> 00:50:08,430
And processor A can essentially
send the data

1100
00:50:08,430 --> 00:50:11,910
explicitly to processor two.

1101
00:50:11,910 --> 00:50:14,880
And the message in this case
would include how the data is

1102
00:50:14,880 --> 00:50:17,240
packaged, some other information
such as the length

1103
00:50:17,240 --> 00:50:20,570
of the data, destination,
possibly some tag so you can

1104
00:50:20,570 --> 00:50:22,650
identify the actual
communication.

1105
00:50:22,650 --> 00:50:26,560
And, you know, there's an actual
mapping for the actual

1106
00:50:26,560 --> 00:50:29,760
functions on Cell.

1107
00:50:29,760 --> 00:50:32,950
And there's a get for the send
and a put for the receive.

1108
00:50:39,230 --> 00:50:41,760
So there's a question of, well,
how do I know if my data

1109
00:50:41,760 --> 00:50:42,880
actually got sent?

1110
00:50:42,880 --> 00:50:45,410
How do I know if it
was received?

1111
00:50:45,410 --> 00:50:49,190
And there's, you know, you can
think of a synchronous send

1112
00:50:49,190 --> 00:50:52,200
and a synchronous receive, or
asynchronous communication.

1113
00:50:52,200 --> 00:50:53,940
So in the synchronous
communication, you actually

1114
00:50:53,940 --> 00:50:55,390
wait for notification.

1115
00:50:55,390 --> 00:50:57,250
So this is kind of like
your fax machine.

1116
00:50:57,250 --> 00:50:58,900
You put something
into your fax.

1117
00:50:58,900 --> 00:51:01,190
It goes out and you eventually
get a beep that says your

1118
00:51:01,190 --> 00:51:02,260
transmission was OK.

1119
00:51:02,260 --> 00:51:04,900
Or if it wasn't OK then, you
know, you get a message that

1120
00:51:04,900 --> 00:51:06,630
says, you know, something
went wrong.

1121
00:51:06,630 --> 00:51:09,520
And you can redo your
communication.

1122
00:51:09,520 --> 00:51:11,730
An asynchronous send is
kind of like your --

1123
00:51:11,730 --> 00:51:13,055
AUDIENCE: Most [UNINTELLIGIBLE]
you could get

1124
00:51:13,055 --> 00:51:13,320
a reply too.

1125
00:51:13,320 --> 00:51:15,340
PROFESSOR: Yeah, you
can get a reply.

1126
00:51:15,340 --> 00:51:16,420
Thanks.

1127
00:51:16,420 --> 00:51:18,360
An asynchronous send, it's like
you write a letter, you

1128
00:51:18,360 --> 00:51:20,680
go put it in the mailbox, and
you don't know whether it

1129
00:51:20,680 --> 00:51:26,425
actually made it into the actual
postman's bag and it

1130
00:51:26,425 --> 00:51:29,660
was delivered to your
destination or if it was

1131
00:51:29,660 --> 00:51:30,940
actually delivered.

1132
00:51:30,940 --> 00:51:34,200
So you only know that the
message was sent.

1133
00:51:34,200 --> 00:51:36,250
You know, you put it
in the mailbox.

1134
00:51:36,250 --> 00:51:37,940
But you don't know anything else
about what happened to

1135
00:51:37,940 --> 00:51:41,100
the message along the way.

1136
00:51:41,100 --> 00:51:43,530
There's also the concept
of a blocking versus a

1137
00:51:43,530 --> 00:51:44,940
non-blocking message.

1138
00:51:44,940 --> 00:51:48,110
So this is orthogonal really
to synchronous versus

1139
00:51:48,110 --> 00:51:49,930
asynchronous.

1140
00:51:49,930 --> 00:51:54,940
So in blocking messages, a
sender waits until there's

1141
00:51:54,940 --> 00:51:58,440
some signal that says the
message has been transmitted.

1142
00:51:58,440 --> 00:52:03,180
So this is, for example if I'm
writing data into a buffer,

1143
00:52:03,180 --> 00:52:05,350
and the buffer essentially gets
transmitted to somebody

1144
00:52:05,350 --> 00:52:10,590
else, we wait until the
buffer is empty.

1145
00:52:10,590 --> 00:52:13,070
And what that means is that
somebody has read it on the

1146
00:52:13,070 --> 00:52:15,350
other end or somebody has
drained that buffer from

1147
00:52:15,350 --> 00:52:16,670
somewhere else.

1148
00:52:16,670 --> 00:52:19,820
The receiver, if he's waiting on
data, well, he just waits.

1149
00:52:19,820 --> 00:52:21,850
He essentially blocks until
somebody has put

1150
00:52:21,850 --> 00:52:24,110
data into the buffer.

1151
00:52:24,110 --> 00:52:26,470
And you can get into potential
deadlock situations.

1152
00:52:26,470 --> 00:52:30,100
So you saw deadlock with locks
in the concurrency talk.

1153
00:52:30,100 --> 00:52:30,920
I'm going to show
you a different

1154
00:52:30,920 --> 00:52:32,590
kind of deadlock example.

1155
00:52:35,200 --> 00:52:40,630
An example of a blocking
send on Cell --

1156
00:52:40,630 --> 00:52:43,250
allows you to use mailboxes.

1157
00:52:43,250 --> 00:52:47,260
Or you can sort of use
mailboxes for that.

1158
00:52:47,260 --> 00:52:50,050
Mailboxes again are just for
communicating short messages,

1159
00:52:50,050 --> 00:52:53,910
really, not necessarily for
communicating data messages.

1160
00:52:53,910 --> 00:52:58,340
So an SPE does some work, and
then it writes out a message,

1161
00:52:58,340 --> 00:53:02,750
in this case to notify the PPU
that, let's say, it's done.

1162
00:53:02,750 --> 00:53:04,660
And then it goes on and
does more work.

1163
00:53:04,660 --> 00:53:08,230
And then it wants to notify
the PPU of something else.

1164
00:53:08,230 --> 00:53:12,190
So in this case this particular
send will block

1165
00:53:12,190 --> 00:53:14,930
because, let's say, the PPU
hasn't drained its mailbox.

1166
00:53:14,930 --> 00:53:16,220
It hasn't read the mailbox.

1167
00:53:16,220 --> 00:53:20,960
So you essentially stop and wait
until the PPU has, you

1168
00:53:20,960 --> 00:53:21,810
know, caught up.

1169
00:53:21,810 --> 00:53:26,860
AUDIENCE: So all mailbox
sends are blocking?

1170
00:53:26,860 --> 00:53:27,990
PROFESSOR: Yes.

1171
00:53:27,990 --> 00:53:29,240
David says yes.

1172
00:53:36,680 --> 00:53:39,730
A non-blocking send is something
that essentially

1173
00:53:39,730 --> 00:53:44,400
allows you to send a message
out and just continue on.

1174
00:53:44,400 --> 00:53:48,650
You don't care exactly about
what's happened to the message

1175
00:53:48,650 --> 00:53:51,910
or what's going on with
the receiver.

1176
00:53:51,910 --> 00:53:54,530
So you write the data into
the buffer and you

1177
00:53:54,530 --> 00:53:56,040
just continue executing.

1178
00:53:56,040 --> 00:53:58,740
And this really helps you in
terms of avoiding idle times

1179
00:53:58,740 --> 00:54:00,560
and deadlocks, but it might
not always be the

1180
00:54:00,560 --> 00:54:01,840
thing that you want.

1181
00:54:01,840 --> 00:54:05,970
So an example of sort of a
non-blocking send and wait on

1182
00:54:05,970 --> 00:54:09,170
Cell is using the DMAs
to ship data out.

1183
00:54:09,170 --> 00:54:11,580
You know, you can put something,
put in a request to

1184
00:54:11,580 --> 00:54:13,910
send data out on the DMA.

1185
00:54:13,910 --> 00:54:19,190
And you could wait for it if you
want in terms of reading

1186
00:54:19,190 --> 00:54:22,680
the status bits to make
sure it's completed.

1187
00:54:22,680 --> 00:54:27,530
OK, so what is a source of
deadlock in the blocking case?

1188
00:54:27,530 --> 00:54:30,130
And it really comes about if you
don't really have enough

1189
00:54:30,130 --> 00:54:33,140
buffering in your communication
network.

1190
00:54:33,140 --> 00:54:36,300
And often you can resolve that
by having additional storage.

1191
00:54:36,300 --> 00:54:38,700
So let's say I have processor
one and processor two and

1192
00:54:38,700 --> 00:54:41,000
they're trying to send messages
to each other.

1193
00:54:41,000 --> 00:54:43,710
So processor one sends a message
at the same time

1194
00:54:43,710 --> 00:54:45,170
processor two sends a message.

1195
00:54:45,170 --> 00:54:46,990
And these are going
to go, let's say,

1196
00:54:46,990 --> 00:54:49,200
into the same buffer.

1197
00:54:49,200 --> 00:54:53,815
Well, neither can make progress
because somebody has

1198
00:54:53,815 --> 00:54:55,930
to essentially drain that buffer
before these receives

1199
00:54:55,930 --> 00:54:57,180
can execute.

1200
00:55:00,350 --> 00:55:02,990
So what happens with that code
is it really depends on how

1201
00:55:02,990 --> 00:55:04,830
much buffering you have
between the two.

1202
00:55:04,830 --> 00:55:06,370
If you have a lot of buffering,
then you may never

1203
00:55:06,370 --> 00:55:08,000
see the deadlock.

1204
00:55:08,000 --> 00:55:13,930
But if you have a really tiny
buffer, then you do a send.

1205
00:55:13,930 --> 00:55:18,180
The other person can't do the
send because the buffer hasn't

1206
00:55:18,180 --> 00:55:18,970
been drained.

1207
00:55:18,970 --> 00:55:21,220
And so you end up
with a deadlock.

1208
00:55:21,220 --> 00:55:23,600
And so a potential solution
is, well, you actually

1209
00:55:23,600 --> 00:55:24,620
increase your buffer length.

1210
00:55:24,620 --> 00:55:26,170
But that doesn't always
work because you can

1211
00:55:26,170 --> 00:55:27,480
still get into trouble.

1212
00:55:27,480 --> 00:55:30,040
So what you might need to do
is essentially be more

1213
00:55:30,040 --> 00:55:33,990
diligent about how you order
your sends and receives.

1214
00:55:33,990 --> 00:55:36,740
So if you have processor one
doing a send, make sure it's

1215
00:55:36,740 --> 00:55:39,400
matched up with a receive
on the other end.

1216
00:55:39,400 --> 00:55:42,050
And similarly, if you're doing
a receive here, make sure

1217
00:55:42,050 --> 00:55:44,100
there's sort of a matching
send on the other end.

1218
00:55:44,100 --> 00:55:46,870
And that helps you in sort of
making sure that things are

1219
00:55:46,870 --> 00:55:51,160
operating reasonably in lock
step at, you know, partially

1220
00:55:51,160 --> 00:55:52,410
ordered times.

1221
00:55:59,750 --> 00:56:03,600
That was really examples of
point-to-point communication.

1222
00:56:03,600 --> 00:56:05,990
A broadcast mechanism is
slightly different.

1223
00:56:05,990 --> 00:56:09,190
It says, I have data that I
want to send to everybody.

1224
00:56:09,190 --> 00:56:11,750
It could be really efficient
for sending short control

1225
00:56:11,750 --> 00:56:15,570
messages, maybe even efficient
for sending data messages.

1226
00:56:15,570 --> 00:56:18,950
So as an example, if you
remember our calculation of

1227
00:56:18,950 --> 00:56:22,380
distances between all points,
the parallelization strategy

1228
00:56:22,380 --> 00:56:24,900
said, well, I'm going
to send one copy of

1229
00:56:24,900 --> 00:56:28,370
the array A to everybody.

1230
00:56:28,370 --> 00:56:29,820
In the two processor
case that was easy.

1231
00:56:29,820 --> 00:56:32,730
But if I have n processors,
then rather than sending

1232
00:56:32,730 --> 00:56:35,570
point-to-point communication
from A to everybody else, what

1233
00:56:35,570 --> 00:56:38,300
I could do is just, say,
broadcast A to everybody and

1234
00:56:38,300 --> 00:56:41,420
they can grab it off
the network.

1235
00:56:41,420 --> 00:56:45,210
So in MPI there's this
function, MPI

1236
00:56:45,210 --> 00:56:46,560
broadcast, that does that.

1237
00:56:46,560 --> 00:56:51,700
I'm using sort of generic
abstract sends, receives and

1238
00:56:51,700 --> 00:56:53,350
broadcasts in my examples.

1239
00:56:53,350 --> 00:56:55,600
So you can broadcast
A to everybody.

1240
00:56:55,600 --> 00:56:59,550
And then if I have n processors,
then what I might

1241
00:56:59,550 --> 00:57:03,310
do is distribute the m's in a
round robin manner to each of

1242
00:57:03,310 --> 00:57:04,230
the different processes.

1243
00:57:04,230 --> 00:57:05,710
So you pointed this out.

1244
00:57:05,710 --> 00:57:07,180
I don't have to send
B to everybody.

1245
00:57:07,180 --> 00:57:09,390
I can just send, you
know, in this case,

1246
00:57:09,390 --> 00:57:10,420
one particular element.

1247
00:57:10,420 --> 00:57:12,840
Is that clear?

1248
00:57:16,838 --> 00:57:18,670
AUDIENCE: There's no
broadcast on Cell?

1249
00:57:18,670 --> 00:57:21,210
PROFESSOR: There is no
broadcast on Cell.

1250
00:57:21,210 --> 00:57:25,680
There is no mechanism for
reduction either.

1251
00:57:25,680 --> 00:57:30,340
And you can't quite do
scatters and gathers.

1252
00:57:30,340 --> 00:57:32,430
I don't think.

1253
00:57:32,430 --> 00:57:35,380
OK, so an example of a
reduction, you know, I said

1254
00:57:35,380 --> 00:57:37,650
it's the opposite of a
broadcast. Everybody has data

1255
00:57:37,650 --> 00:57:39,860
that needs to essentially
get to the same point.

1256
00:57:39,860 --> 00:57:45,610
So as an example, if everybody
in this room had a value,

1257
00:57:45,610 --> 00:57:47,770
including myself, and I wanted
to know what is the collective

1258
00:57:47,770 --> 00:57:49,670
value of everybody in the
room, you all have to

1259
00:57:49,670 --> 00:57:50,840
send me your data.

1260
00:57:50,840 --> 00:57:53,800
Now, this is important because
if -- you know, in this case

1261
00:57:53,800 --> 00:57:54,700
we're doing an addition.

1262
00:57:54,700 --> 00:57:56,290
It's an associative operation.

1263
00:57:56,290 --> 00:57:58,050
So what we can do is we
can be smart about

1264
00:57:58,050 --> 00:57:59,350
how the data is sent.

1265
00:57:59,350 --> 00:58:02,160
So, you know, guys that are
close together can essentially

1266
00:58:02,160 --> 00:58:03,420
add up their numbers
and forward me.

1267
00:58:03,420 --> 00:58:05,120
So instead of getting
n messages I

1268
00:58:05,120 --> 00:58:06,520
can get log n messages.

1269
00:58:06,520 --> 00:58:08,450
And so if every pair of you
added your numbers and

1270
00:58:08,450 --> 00:58:11,040
forwarded me that, that cuts
down communication by half.

1271
00:58:11,040 --> 00:58:13,640
And so you can, you know --
starting from the back of

1272
00:58:13,640 --> 00:58:16,420
room, by the time you get to
me, I only get two messages

1273
00:58:16,420 --> 00:58:18,800
instead of n messages.

1274
00:58:18,800 --> 00:58:21,680
So a reduction combines data
from all processors.

1275
00:58:21,680 --> 00:58:24,030
In MPI, you know, there's
this function MPI

1276
00:58:24,030 --> 00:58:26,300
reduce for doing that.

1277
00:58:26,300 --> 00:58:29,180
And the collective operations
are things that are

1278
00:58:29,180 --> 00:58:29,920
associative.

1279
00:58:29,920 --> 00:58:32,730
And subtract --

1280
00:58:32,730 --> 00:58:33,660
sorry.

1281
00:58:33,660 --> 00:58:39,500
And or and -- you can read
them on the slide.

1282
00:58:39,500 --> 00:58:42,540
There is a semantic caveat here
that no processor can

1283
00:58:42,540 --> 00:58:45,760
finish the reduction before all
processors have at least

1284
00:58:45,760 --> 00:58:49,730
sent it one data or have
contributed, rather, a

1285
00:58:49,730 --> 00:58:51,790
particular value.

1286
00:58:51,790 --> 00:58:54,740
So in many numerical algorithms,
you can actually

1287
00:58:54,740 --> 00:59:00,200
use the broadcast and send to
broadcast and reduce in place

1288
00:59:00,200 --> 00:59:04,430
of sends and receives because
it really improves the

1289
00:59:04,430 --> 00:59:06,970
simplicity of your
computation.

1290
00:59:06,970 --> 00:59:09,260
You don't have to do n sends
to communicate there.

1291
00:59:09,260 --> 00:59:11,350
You can just broadcast. It
gives you a mechanism for

1292
00:59:11,350 --> 00:59:13,970
essentially having a shared
memory abstraction on

1293
00:59:13,970 --> 00:59:16,150
distributed memory
architecture.

1294
00:59:16,150 --> 00:59:18,392
There are things like all to all
communication which would

1295
00:59:18,392 --> 00:59:19,730
also help you in that sense.

1296
00:59:19,730 --> 00:59:24,960
Although I don't talk about all
to all communication here.

1297
00:59:24,960 --> 00:59:27,360
So I'm going to show you an
example of sort of a more

1298
00:59:27,360 --> 00:59:29,810
detailed MPI.

1299
00:59:29,810 --> 00:59:32,690
But I also want to contrast this
to the OpenMP programming

1300
00:59:32,690 --> 00:59:36,000
on shared memory processors
because one might look simpler

1301
00:59:36,000 --> 00:59:38,470
than the other.

1302
00:59:38,470 --> 00:59:40,540
So suppose that you have a
numerical integration method

1303
00:59:40,540 --> 00:59:44,990
that essentially you're going
to use to calculate pi.

1304
00:59:44,990 --> 00:59:48,200
So as you get finer and finer,
you can get more accurate --

1305
00:59:48,200 --> 00:59:50,030
as you shrink these intervals
you can get

1306
00:59:50,030 --> 00:59:53,900
better values for pi.

1307
00:59:53,900 --> 00:59:58,690
And the code for doing
that is some C code.

1308
00:59:58,690 --> 01:00:00,270
You have some variables.

1309
01:00:00,270 --> 01:00:02,800
And then you have a step that
essentially tells you how many

1310
01:00:02,800 --> 01:00:04,730
times you're going to
do this computation.

1311
01:00:04,730 --> 01:00:07,440
And for each time step you
calculate this particular

1312
01:00:07,440 --> 01:00:08,540
function here.

1313
01:00:08,540 --> 01:00:11,040
And you add it all up and in the
end you can sort of print

1314
01:00:11,040 --> 01:00:13,640
out what is the value of
pi that you calculated.

1315
01:00:13,640 --> 01:00:16,600
So clearly as, you know, as you
shrink your intervals, you

1316
01:00:16,600 --> 01:00:20,340
can get more and more accurate
measures of pi.

1317
01:00:20,340 --> 01:00:22,740
So that translates to increasing
the number of steps

1318
01:00:22,740 --> 01:00:29,160
in that particular C code.

1319
01:00:29,160 --> 01:00:33,700
So you can use that numerical
integration to calculate pi

1320
01:00:33,700 --> 01:00:34,900
with OpenMP.

1321
01:00:34,900 --> 01:00:37,356
And what that translates to is
-- sorry, there should have

1322
01:00:37,356 --> 01:00:41,370
been an animation here to ask
you what I should add in.

1323
01:00:41,370 --> 01:00:43,220
You have this particular loop.

1324
01:00:43,220 --> 01:00:46,330
And this is computation that
you want to parallelize.

1325
01:00:46,330 --> 01:00:48,890
And there is really four
questions that you essentially

1326
01:00:48,890 --> 01:00:50,540
have to go through.

1327
01:00:50,540 --> 01:00:52,220
Are there variables
that are shared?

1328
01:00:52,220 --> 01:00:54,420
Because you have to get
the process right.

1329
01:00:54,420 --> 01:00:56,200
If there are variables that
are shared, you have to

1330
01:00:56,200 --> 01:01:01,830
explicitly synchronize them and
use locks to protect them.

1331
01:01:01,830 --> 01:01:02,880
What values are private?

1332
01:01:02,880 --> 01:01:08,690
So in OpenMP, things that are
private are data on the stack,

1333
01:01:08,690 --> 01:01:12,000
things that are defined
lexically within the scope of

1334
01:01:12,000 --> 01:01:16,030
the computation that you
encapsulate by an OpenMP

1335
01:01:16,030 --> 01:01:18,900
pragma, and what variables
you might want

1336
01:01:18,900 --> 01:01:20,340
to use for a reduction.

1337
01:01:20,340 --> 01:01:23,490
So in this case I'm doing a
summation, and this is the

1338
01:01:23,490 --> 01:01:25,400
computation that I
can parallelize.

1339
01:01:25,400 --> 01:01:28,770
Then I essentially want to do
a reduction for the plus

1340
01:01:28,770 --> 01:01:32,980
operator since I'm doing an
addition on this variable.

1341
01:01:32,980 --> 01:01:34,760
This loop here is parallel.

1342
01:01:34,760 --> 01:01:36,040
It's data parallel.

1343
01:01:36,040 --> 01:01:38,900
I can split it up.

1344
01:01:38,900 --> 01:01:41,190
The for loop is also --

1345
01:01:41,190 --> 01:01:43,810
I can do this work
sharing on it.

1346
01:01:43,810 --> 01:01:45,960
So I use the parallel
for pragma.

1347
01:01:45,960 --> 01:01:49,360
And the variable x
here is private.

1348
01:01:49,360 --> 01:01:52,010
It's defined here but I can
essentially give a directive

1349
01:01:52,010 --> 01:01:53,530
that says, this is private.

1350
01:01:53,530 --> 01:01:56,290
You can essentially rename
it on each processor.

1351
01:01:56,290 --> 01:01:58,890
Its value won't have any
effect on the overall

1352
01:01:58,890 --> 01:02:01,270
computation because each
computation will have its own

1353
01:02:01,270 --> 01:02:03,910
local copy.

1354
01:02:03,910 --> 01:02:06,360
That clear so far?

1355
01:02:06,360 --> 01:02:09,950
So computing pi with integration
using MPI takes up

1356
01:02:09,950 --> 01:02:12,110
two slides.

1357
01:02:12,110 --> 01:02:13,980
You know, I could fit it on one
slide but you couldn't see

1358
01:02:13,980 --> 01:02:15,170
it in the back.

1359
01:02:15,170 --> 01:02:16,990
So there's some initialization.

1360
01:02:16,990 --> 01:02:20,130
In fact, I think there's only
six basic MPI commands that

1361
01:02:20,130 --> 01:02:22,120
you need for computing.

1362
01:02:22,120 --> 01:02:26,030
Three of them are here and
you'll see the others are MPI

1363
01:02:26,030 --> 01:02:27,680
send and MPI receive.

1364
01:02:27,680 --> 01:02:31,550
And there's one more that you'll
see on the next slide.

1365
01:02:31,550 --> 01:02:33,170
So there's some loop
that says while I'm

1366
01:02:33,170 --> 01:02:36,790
not done keep computing.

1367
01:02:36,790 --> 01:02:39,120
And what you do is you broadcast
n to all the

1368
01:02:39,120 --> 01:02:39,990
different processors.

1369
01:02:39,990 --> 01:02:42,900
N is really your time step.

1370
01:02:42,900 --> 01:02:47,660
How many small intervals of
execution are you going to do?

1371
01:02:47,660 --> 01:02:49,930
And you can go through,
do your computation.

1372
01:02:49,930 --> 01:02:52,510
So now this -- the MPI
essentially encapsulates the

1373
01:02:52,510 --> 01:02:55,290
computation over n processors.

1374
01:02:55,290 --> 01:02:58,250
And then you get to an MPI
reduce command at some point

1375
01:02:58,250 --> 01:03:01,810
that says, OK, what values
did everybody compute?

1376
01:03:01,810 --> 01:03:03,430
Do the reduction on that.

1377
01:03:03,430 --> 01:03:06,410
Write that value into my MPI.

1378
01:03:06,410 --> 01:03:10,700
Now what happens here is there's
processor ID zero

1379
01:03:10,700 --> 01:03:12,140
which I'm going to consider
the master.

1380
01:03:12,140 --> 01:03:15,030
So he's the one who's going to
actually print out the value.

1381
01:03:15,030 --> 01:03:18,780
So the reduction essentially
synchronizes until everybody's

1382
01:03:18,780 --> 01:03:22,370
communicated a value
to processor zero.

1383
01:03:22,370 --> 01:03:23,990
And then it can print
out the pi.

1384
01:03:23,990 --> 01:03:27,660
And then you can finalize, which
actually makes sure the

1385
01:03:27,660 --> 01:03:28,890
computation can exit.

1386
01:03:28,890 --> 01:03:30,360
And you can go on
and terminate.

1387
01:03:35,710 --> 01:03:39,750
So the last concept in terms of
understanding performance

1388
01:03:39,750 --> 01:03:43,010
for parallelism is this
notion of locality.

1389
01:03:43,010 --> 01:03:46,240
And there's locality in your
communication and locality in

1390
01:03:46,240 --> 01:03:47,930
your computation.

1391
01:03:47,930 --> 01:03:50,700
So what do I mean by that?

1392
01:03:50,700 --> 01:03:55,690
So in terms of communication,
you know, if I have two

1393
01:03:55,690 --> 01:03:58,830
operations and let's say -- this
is a picture or schematic

1394
01:03:58,830 --> 01:04:02,130
of what the MIT raw
chip looks like.

1395
01:04:02,130 --> 01:04:03,570
Each one of these is a core.

1396
01:04:03,570 --> 01:04:06,620
There's some network, some basic
computation elements.

1397
01:04:06,620 --> 01:04:09,270
And if I have, you know, an
addition that feeds into a

1398
01:04:09,270 --> 01:04:12,910
shift, well, I can put the
addition here and the shift

1399
01:04:12,910 --> 01:04:15,720
there, but that means I have a
really long path that I need

1400
01:04:15,720 --> 01:04:17,700
to go to in terms
of communicating

1401
01:04:17,700 --> 01:04:19,190
that data value around.

1402
01:04:19,190 --> 01:04:22,590
So the computation naturally
should just be closer together

1403
01:04:22,590 --> 01:04:25,950
because that decreases the
latency that I need to

1404
01:04:25,950 --> 01:04:27,940
communicate.

1405
01:04:27,940 --> 01:04:30,300
So rather than doing net
mapping, what I might want to

1406
01:04:30,300 --> 01:04:32,900
do is just go to somebody who is
close to me and available.

1407
01:04:32,900 --> 01:04:35,130
AUDIENCE: Also there
are volume issues.

1408
01:04:35,130 --> 01:04:37,140
So assume more than that.

1409
01:04:37,140 --> 01:04:40,309
A lot of other people also
want to communicate.

1410
01:04:40,309 --> 01:04:43,296
So if [UNINTELLIGIBLE] randomly
distributed, you can

1411
01:04:43,296 --> 01:04:44,292
assume there's a lot
more communication

1412
01:04:44,292 --> 01:04:47,880
going into the channel.

1413
01:04:47,880 --> 01:04:52,710
Whereas if you put locality in
there then you can scale

1414
01:04:52,710 --> 01:04:56,210
communication much better than
scaling the network.

1415
01:05:00,380 --> 01:05:02,400
PROFESSOR: There's also a notion
of locality in terms of

1416
01:05:02,400 --> 01:05:03,040
memory accesses.

1417
01:05:03,040 --> 01:05:07,880
And these are potentially also
very important or more

1418
01:05:07,880 --> 01:05:10,010
important, rather, because
of the latencies

1419
01:05:10,010 --> 01:05:12,310
for accessing memory.

1420
01:05:12,310 --> 01:05:15,860
So if I have, you know, this
loop that's doing some

1421
01:05:15,860 --> 01:05:19,270
addition or some computation on
an array and I distribute

1422
01:05:19,270 --> 01:05:21,900
it, say, over four
processors --

1423
01:05:21,900 --> 01:05:24,880
this is, again, let's assume
a data parallel loop.

1424
01:05:24,880 --> 01:05:27,460
So what I can do is have a work
sharing mechanism that

1425
01:05:27,460 --> 01:05:29,970
says, this thread here
will operate on

1426
01:05:29,970 --> 01:05:31,570
the first four indices.

1427
01:05:31,570 --> 01:05:34,135
This thread here will operate
on the next four indices and

1428
01:05:34,135 --> 01:05:36,120
the next four and
the next four.

1429
01:05:36,120 --> 01:05:39,530
And then you essentially get to
join barrier and then you

1430
01:05:39,530 --> 01:05:40,950
can continue on.

1431
01:05:40,950 --> 01:05:44,840
And if we consider how the
access patterns are going to

1432
01:05:44,840 --> 01:05:50,730
be generated for this particular
loop, well, in the

1433
01:05:50,730 --> 01:05:52,560
sequential case I'm essentially

1434
01:05:52,560 --> 01:05:54,200
generating them in sequence.

1435
01:05:54,200 --> 01:05:56,620
So that allows me to exploit,
for example, on traditional

1436
01:05:56,620 --> 01:05:59,780
[? CAT ?] architecture, a notion
of spatial locality.

1437
01:05:59,780 --> 01:06:03,620
If I look at how things are
organized in memory, in the

1438
01:06:03,620 --> 01:06:06,710
sequential case I can
perhaps fetch an

1439
01:06:06,710 --> 01:06:07,870
entire block at a time.

1440
01:06:07,870 --> 01:06:11,340
So I can fetch all the
elements of A0

1441
01:06:11,340 --> 01:06:12,680
to A3 in one shot.

1442
01:06:12,680 --> 01:06:16,820
I can fetch all the elements
of A4 to A7 in one shot.

1443
01:06:16,820 --> 01:06:19,520
And that allows me to
essentially improve

1444
01:06:19,520 --> 01:06:22,080
performance because I overlap
communication.

1445
01:06:22,080 --> 01:06:25,890
I'm predicting that once I see
a reference, I'm going to use

1446
01:06:25,890 --> 01:06:29,140
data that's adjacent
to it in space.

1447
01:06:29,140 --> 01:06:31,070
There's also a notion of
temporal locality that says

1448
01:06:31,070 --> 01:06:33,990
that if I use some particular
data element, I'm going to

1449
01:06:33,990 --> 01:06:35,430
reuse it later on.

1450
01:06:35,430 --> 01:06:37,970
I'm not showing that here.

1451
01:06:37,970 --> 01:06:41,190
But in the parallel case what
could happen is if each one of

1452
01:06:41,190 --> 01:06:43,920
these threads is requesting a
different data element -- and

1453
01:06:43,920 --> 01:06:49,470
let's say execution essentially
proceeds -- you

1454
01:06:49,470 --> 01:06:51,230
know, all the threads
are requesting their

1455
01:06:51,230 --> 01:06:53,970
data at the same time.

1456
01:06:53,970 --> 01:06:56,240
Then all these requests are
going to end up going to the

1457
01:06:56,240 --> 01:06:59,130
same memory bank.

1458
01:06:59,130 --> 01:07:02,420
The first thread is requesting
ace of zero.

1459
01:07:02,420 --> 01:07:05,770
The next thread is requesting
ace of four, the next thread

1460
01:07:05,770 --> 01:07:08,750
ace of eight, next
thread ace of 12.

1461
01:07:08,750 --> 01:07:11,010
And all of these happen to be
in the same memory bank.

1462
01:07:11,010 --> 01:07:13,040
So what that means is, you
know, there's a lot of

1463
01:07:13,040 --> 01:07:14,940
contention for that
one memory bank.

1464
01:07:14,940 --> 01:07:17,650
And in effect I've serialized
the computation.

1465
01:07:17,650 --> 01:07:17,850
Right?

1466
01:07:17,850 --> 01:07:20,620
Everybody see that?

1467
01:07:20,620 --> 01:07:23,090
And, you know, this can be
a problem in that you can

1468
01:07:23,090 --> 01:07:26,920
essentially fully serialize the
computation in that, you

1469
01:07:26,920 --> 01:07:29,720
know, there's contention on the
first bank, contention on

1470
01:07:29,720 --> 01:07:33,620
the second bank, and then
contention on the third bank,

1471
01:07:33,620 --> 01:07:35,040
and then contention on
the fourth bank.

1472
01:07:35,040 --> 01:07:38,000
And so I've done absolutely
nothing other than pay

1473
01:07:38,000 --> 01:07:39,720
overhead for parallelization.

1474
01:07:39,720 --> 01:07:42,590
I've made extra work for
myself [? concreting ?]

1475
01:07:42,590 --> 01:07:44,250
the threads.

1476
01:07:44,250 --> 01:07:46,810
Maybe I've done some extra
work in terms of

1477
01:07:46,810 --> 01:07:48,840
synchronization.

1478
01:07:48,840 --> 01:07:50,460
So I'm fully serial.

1479
01:07:52,980 --> 01:07:55,810
So what you want to do is
actually reorganize the way

1480
01:07:55,810 --> 01:07:59,840
data is laid out in memory so
that you can effectively get

1481
01:07:59,840 --> 01:08:01,620
the benefit of parallelization.

1482
01:08:01,620 --> 01:08:07,420
So if you have the data
organized as is there, you can

1483
01:08:07,420 --> 01:08:09,670
shuffle things around.

1484
01:08:09,670 --> 01:08:13,000
And then you end up with fully
parallel or a layout that's

1485
01:08:13,000 --> 01:08:16,320
more amenable to full
parallelism because now each

1486
01:08:16,320 --> 01:08:17,930
thread is going to
a different bank.

1487
01:08:17,930 --> 01:08:20,650
And that essentially gives you
a four-way parallelism.

1488
01:08:20,650 --> 01:08:22,750
And so you get the performance
benefits.

1489
01:08:26,480 --> 01:08:30,623
So there are different kinds of
sort of considerations you

1490
01:08:30,623 --> 01:08:34,400
need to take into account for
shared memory architectures in

1491
01:08:34,400 --> 01:08:37,400
terms of how the design affects
the memory latency.

1492
01:08:37,400 --> 01:08:42,190
So in a uniform memory access
architecture, every processor

1493
01:08:42,190 --> 01:08:44,240
is either, you can think
of it as being

1494
01:08:44,240 --> 01:08:45,200
equidistant from memory.

1495
01:08:45,200 --> 01:08:48,295
Or another way, it has the
same access latency for

1496
01:08:48,295 --> 01:08:50,100
getting data from memory.

1497
01:08:50,100 --> 01:08:54,480
Most shared memory architectures
are non-uniform,

1498
01:08:54,480 --> 01:08:56,710
also known as NUMA
architecture.

1499
01:08:56,710 --> 01:08:59,460
So you have physically
partitioned memories.

1500
01:08:59,460 --> 01:09:03,020
And the processors can have the
same address space, but

1501
01:09:03,020 --> 01:09:05,930
the placement of data affects
the performance because going

1502
01:09:05,930 --> 01:09:10,860
to one bank versus another
can be faster or slower.

1503
01:09:10,860 --> 01:09:12,560
So what kind of architecture
is Cell?

1504
01:09:12,560 --> 01:09:19,100
Yeah.

1505
01:09:19,100 --> 01:09:19,710
No guesses?

1506
01:09:19,710 --> 01:09:22,910
AUDIENCE: It's not
a shared memory.

1507
01:09:22,910 --> 01:09:23,150
PROFESSOR: Right.

1508
01:09:23,150 --> 01:09:24,720
It's not a shared memory
architecture.

1509
01:09:27,770 --> 01:09:30,500
So a summary of parallel
performance factors.

1510
01:09:30,500 --> 01:09:32,390
So there's three things
I tried to cover.

1511
01:09:34,970 --> 01:09:36,510
Coverage or the extent
of parallelism in the

1512
01:09:36,510 --> 01:09:37,420
application.

1513
01:09:37,420 --> 01:09:40,480
So you saw Amdahl's Law and it
actually gave you a sort of a

1514
01:09:40,480 --> 01:09:43,990
model that said when is
parallelizing your application

1515
01:09:43,990 --> 01:09:45,000
going to be worthwhile?

1516
01:09:45,000 --> 01:09:46,750
And it really boils down to
how much parallelism you

1517
01:09:46,750 --> 01:09:48,990
actually have in your particular
algorithm.

1518
01:09:48,990 --> 01:09:50,870
If your algorithm is sequential,
then there's

1519
01:09:50,870 --> 01:09:56,110
really nothing you can do for
programming for performance

1520
01:09:56,110 --> 01:09:57,820
using parallel architectures.

1521
01:09:57,820 --> 01:10:02,500
I talked about granularity of
the data partitioning and the

1522
01:10:02,500 --> 01:10:04,360
granularity of the work
distribution.

1523
01:10:04,360 --> 01:10:06,080
You know, if you had really
fine-grain things versus

1524
01:10:06,080 --> 01:10:08,200
really coarse-grain things,
how does that translate to

1525
01:10:08,200 --> 01:10:10,980
different communication costs?

1526
01:10:10,980 --> 01:10:13,770
And then last thing I
shared was locality.

1527
01:10:13,770 --> 01:10:16,620
So if you have near neighbors
talking, that may be different

1528
01:10:16,620 --> 01:10:19,230
than two things that are
further apart in space

1529
01:10:19,230 --> 01:10:20,730
communicating.

1530
01:10:20,730 --> 01:10:23,690
And there are some issues in
terms of the memory latency

1531
01:10:23,690 --> 01:10:28,310
and how you actually can
take advantage of that.

1532
01:10:28,310 --> 01:10:33,400
So this really is an overview
of sort of the parallel

1533
01:10:33,400 --> 01:10:36,670
programming concepts and the
performance implications.

1534
01:10:36,670 --> 01:10:39,530
So the next lecture will be,
you know, how do I actually

1535
01:10:39,530 --> 01:10:40,560
parallelize my program?

1536
01:10:40,560 --> 01:10:42,480
And we'll talk about that.