1
00:00:00,030 --> 00:00:02,420
The following content is
provided under a Creative

2
00:00:02,420 --> 00:00:03,850
Commons license.

3
00:00:03,850 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,860 --> 00:00:10,540
offer high quality educational
resources for free.

5
00:00:10,540 --> 00:00:13,410
To make a donation or view
additional materials from

6
00:00:13,410 --> 00:00:17,610
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,610 --> 00:00:18,860
ocw.mit.edu.

8
00:00:21,390 --> 00:00:23,210
PROFESSOR: Let's get started.

9
00:00:23,210 --> 00:00:27,660
So what we are going to do today
is go about discovering

10
00:00:27,660 --> 00:00:29,650
other alternating methods.

11
00:00:29,650 --> 00:00:32,590
We know you guys are amazing
hackers and you can actually

12
00:00:32,590 --> 00:00:34,730
do all those things by hand.

13
00:00:34,730 --> 00:00:40,580
But to make multi-core generally
acceptable, can we

14
00:00:40,580 --> 00:00:41,510
do things automatically?

15
00:00:41,510 --> 00:00:44,880
Can we really reduce a burden
from the programers?

16
00:00:44,880 --> 00:00:48,460
So at the beginning I'm going
to talk about general

17
00:00:48,460 --> 00:00:49,600
parallelizing compilers.

18
00:00:49,600 --> 00:00:50,540
What people have done.

19
00:00:50,540 --> 00:00:51,800
What's the state of the art.

20
00:00:51,800 --> 00:00:55,590
Kind of get your feel
what is doable.

21
00:00:55,590 --> 00:00:58,120
Hopefully, that will be a little
over an hour, and then

22
00:00:58,120 --> 00:01:02,560
we'll go talk about StreamEd
compiler, what we have done

23
00:01:02,560 --> 00:01:09,140
recently, and how this
automation part can do.

24
00:01:09,140 --> 00:01:11,730
So, I'll talk a little bit
about parallel execution.

25
00:01:11,730 --> 00:01:16,150
This is kind of what
you know already.

26
00:01:16,150 --> 00:01:19,600
Then go into parallelizing
compilers, and talk about how

27
00:01:19,600 --> 00:01:21,670
to determine if something is
parallel by doing data

28
00:01:21,670 --> 00:01:25,020
dependence analysis, and how
to increase the amount of

29
00:01:25,020 --> 00:01:27,110
parallelism available in
code loop, what kind of

30
00:01:27,110 --> 00:01:28,610
transformation.

31
00:01:28,610 --> 00:01:32,570
Then we go look at how to
generate code, because once

32
00:01:32,570 --> 00:01:34,280
you see that something is
parallel, how you actually get

33
00:01:34,280 --> 00:01:35,270
to run parallel.

34
00:01:35,270 --> 00:01:38,480
And finish up with actually how
to do communication code

35
00:01:38,480 --> 00:01:44,330
in a machine such as a server.

36
00:01:44,330 --> 00:01:48,660
So in parallel execution, this
is something -- it's a review.

37
00:01:48,660 --> 00:01:50,460
So there are many ways of
parallelism, things like

38
00:01:50,460 --> 00:01:51,240
instruction level parallelism.

39
00:01:51,240 --> 00:01:55,680
It's basically effected by
hardware or compiler

40
00:01:55,680 --> 00:01:57,060
scheduling.

41
00:01:57,060 --> 00:01:59,730
As of today this is
in abundance.

42
00:01:59,730 --> 00:02:02,850
In all for scalars we do
that, in [OBSCURES]

43
00:02:02,850 --> 00:02:04,560
we do that.

44
00:02:04,560 --> 00:02:07,350
Then password parallelism,
it's what most of you

45
00:02:07,350 --> 00:02:08,860
guys are doing now.

46
00:02:08,860 --> 00:02:11,220
You probably find a program, you
divide it into tasks, you

47
00:02:11,220 --> 00:02:14,200
get task level parallelism,
mainly by hand.

48
00:02:14,200 --> 00:02:16,120
Some of you might be doing data
level parallelism and

49
00:02:16,120 --> 00:02:19,300
also loop level parallelism.

50
00:02:19,300 --> 00:02:22,010
That can be the hand or
compiler generated.

51
00:02:22,010 --> 00:02:24,300
Then, of course, pipeline
parallelism is more mainly

52
00:02:24,300 --> 00:02:26,860
done in hardware and language
extreme, do pipeline

53
00:02:26,860 --> 00:02:28,560
parallelism.

54
00:02:28,560 --> 00:02:31,435
Divide and conquer parallelism
we went a little bit more than

55
00:02:31,435 --> 00:02:35,170
in hardware, mainly by hand
for recursive functions.

56
00:02:35,170 --> 00:02:39,660
Today we are going to focus
on loop level parallelism,

57
00:02:39,660 --> 00:02:43,360
particularly how do loop level
parallelism by the compiler.

58
00:02:43,360 --> 00:02:45,090
So why loops?

59
00:02:45,090 --> 00:02:48,000
So loops is interesting because
people observed in

60
00:02:48,000 --> 00:02:51,910
morse code, 90% of execution
time is in 10% of the code.

61
00:02:51,910 --> 00:02:55,690
Almost 99% of the execution time
is in 10% of the code.

62
00:02:55,690 --> 00:03:00,080
This called a loop, and it makes
sense because running at

63
00:03:00,080 --> 00:03:05,990
3 gigahertz, if only run one
instruction one, then you run

64
00:03:05,990 --> 00:03:09,920
through the hard drive in only a
few minutes because you need

65
00:03:09,920 --> 00:03:11,370
to have repeatability.

66
00:03:11,370 --> 00:03:12,830
A lot of time repeatability
thing loops.

67
00:03:16,070 --> 00:03:17,970
Loops, if you can parallelize,
you can get really good

68
00:03:17,970 --> 00:03:21,190
performance because loops most
of the time, each loop

69
00:03:21,190 --> 00:03:24,420
iteration have the same amount
of work and you get nice good

70
00:03:24,420 --> 00:03:28,620
load balance, it's somewhat
easier to analyze, so that's

71
00:03:28,620 --> 00:03:29,750
why the compiler start there.

72
00:03:29,750 --> 00:03:33,070
Whereas if you try to get task
level parallelism, things have

73
00:03:33,070 --> 00:03:38,350
a lot more complexities that
automatic compiler cannot do.

74
00:03:38,350 --> 00:03:41,220
So there are two types
of parallel loops.

75
00:03:41,220 --> 00:03:43,120
One is a for all loop.

76
00:03:43,120 --> 00:03:45,720
That means there are no loop
carried dependences.

77
00:03:45,720 --> 00:03:49,030
That means you can get the
sequential code executing, run

78
00:03:49,030 --> 00:03:52,390
everything in parallel, and at
the end you have a barrier and

79
00:03:52,390 --> 00:03:53,710
when everybody finishes
you continue on

80
00:03:53,710 --> 00:03:55,810
the sequential code.

81
00:03:55,810 --> 00:03:58,300
That is how you do
a for all loop.

82
00:03:58,300 --> 00:04:01,510
Some languages, in fact, have
explicitly parallel construct,

83
00:04:01,510 --> 00:04:06,580
say OK, here's a for all
loop and go do that.

84
00:04:06,580 --> 00:04:08,990
The other type of
loop is called a

85
00:04:08,990 --> 00:04:10,990
foracross or doacross loop.

86
00:04:10,990 --> 00:04:13,860
That says OK, while the loop
is parallel, there are some

87
00:04:13,860 --> 00:04:14,760
dependences.

88
00:04:14,760 --> 00:04:17,670
That means some value generated
here is used

89
00:04:17,670 --> 00:04:18,720
somewhere here.

90
00:04:18,720 --> 00:04:20,670
So you can run it parallel,
but you have some

91
00:04:20,670 --> 00:04:22,280
communication going too.

92
00:04:22,280 --> 00:04:23,670
So you had to move data.

93
00:04:23,670 --> 00:04:26,590
So it's not completely running
parallel, there's some

94
00:04:26,590 --> 00:04:27,720
synchronization going on.

95
00:04:27,720 --> 00:04:29,300
But you can get large chunk
running parallels.

96
00:04:32,200 --> 00:04:36,840
So we kind of focus on dual
loops today, and let's look at

97
00:04:36,840 --> 00:04:38,720
this example.

98
00:04:38,720 --> 00:04:40,940
We see it's a for far
so it's a parallel

99
00:04:40,940 --> 00:04:46,110
loop or for all loop.

100
00:04:46,110 --> 00:04:48,430
When you know it's parallel,
in here, of course,

101
00:04:48,430 --> 00:04:51,030
the user said that.

102
00:04:51,030 --> 00:04:53,930
What we can do is we can
distribute the iteration by

103
00:04:53,930 --> 00:04:57,520
chunking up the iteration space
into number of process

104
00:04:57,520 --> 00:05:02,170
chunks, and basically
run that.

105
00:05:02,170 --> 00:05:05,480
If PMD mode, you can at the
beginning the first processor

106
00:05:05,480 --> 00:05:10,120
can calculate the number of
iterations you can run on each

107
00:05:10,120 --> 00:05:14,250
process in here, and then you
synchronize, you put a barrier

108
00:05:14,250 --> 00:05:17,170
there, so everybody kind of
sync up at that point.

109
00:05:17,170 --> 00:05:20,590
Or other process of waiting, and
at that point, everybody

110
00:05:20,590 --> 00:05:23,116
starts, when you reach this
point it's running, it's part

111
00:05:23,116 --> 00:05:25,420
of iterations, and then you're
going to put a barrier

112
00:05:25,420 --> 00:05:26,670
synchronization in place.

113
00:05:28,230 --> 00:05:32,150
Kind of obvious, parallel code
basically in here, running on

114
00:05:32,150 --> 00:05:34,780
shared memory machine
at this point.

115
00:05:34,780 --> 00:05:36,310
So this is what we can do.

116
00:05:36,310 --> 00:05:39,000
I mean this is what
we saw before.

117
00:05:39,000 --> 00:05:41,650
Of course, instead of doing
that, you can also do fork

118
00:05:41,650 --> 00:05:44,890
join types or once you want to
run something parallel, you

119
00:05:44,890 --> 00:05:49,220
can fork a thread and each
thread gets some amount of

120
00:05:49,220 --> 00:05:51,480
iterations you run, and after
that you merge together.

121
00:05:51,480 --> 00:05:54,180
So you can do both.

122
00:05:54,180 --> 00:05:55,290
So that's my hand.

123
00:05:55,290 --> 00:05:59,330
How do you do something like
that by the compiler?

124
00:05:59,330 --> 00:06:01,540
That sounds simple enough,
trivial enough.

125
00:06:01,540 --> 00:06:03,010
But you don't automate
the entire process.

126
00:06:03,010 --> 00:06:06,480
How to go about doing that.

127
00:06:06,480 --> 00:06:09,240
So, here are some normal
loops, for loops.

128
00:06:09,240 --> 00:06:13,110
So the for all does this thing
that was so simple, which is

129
00:06:13,110 --> 00:06:15,270
the for all construct that means
somebody could look at

130
00:06:15,270 --> 00:06:18,310
that and said this
loop is parallel.

131
00:06:18,310 --> 00:06:21,470
But you look at these FOR loops,
how many of these loops

132
00:06:21,470 --> 00:06:23,780
are parallel?

133
00:06:23,780 --> 00:06:26,940
Is the first loop parallel?

134
00:06:26,940 --> 00:06:27,160
Why?

135
00:06:27,160 --> 00:06:27,810
Why not?

136
00:06:27,810 --> 00:06:31,220
AUDIENCE: [OBSCURED.]

137
00:06:31,220 --> 00:06:36,480
PROFESSOR: It's a loop because
the iteration, one of that is

138
00:06:36,480 --> 00:06:38,860
using what you wrote
in iteration zero.

139
00:06:38,860 --> 00:06:41,910
So iteration one has to wait
until iteration zero is

140
00:06:41,910 --> 00:06:43,310
done, so and so.

141
00:06:43,310 --> 00:06:44,560
How about this one?

142
00:06:50,110 --> 00:06:50,460
Why?

143
00:06:50,460 --> 00:06:57,350
AUDIENCE: [NOISE.]

144
00:06:57,350 --> 00:07:01,380
PROFESSOR: Not really.

145
00:07:01,380 --> 00:07:04,500
So it's writing element
0 to 5, it's reading

146
00:07:04,500 --> 00:07:08,040
elements 6 to 11.

147
00:07:08,040 --> 00:07:10,440
So they don't overlap.

148
00:07:10,440 --> 00:07:12,740
So what you read and what you
write never overlap, so you

149
00:07:12,740 --> 00:07:16,240
can keep doing it in any order,
because the dependence

150
00:07:16,240 --> 00:07:18,990
means something you wrote,
later you will read.

151
00:07:18,990 --> 00:07:19,920
This doesn't happen in here.

152
00:07:19,920 --> 00:07:33,600
How about this one?

153
00:07:33,600 --> 00:07:35,010
AUDIENCE: There's no dependence
in there.

154
00:07:35,010 --> 00:07:35,250
PROFESSOR: Why?

155
00:07:35,250 --> 00:07:38,420
AUDIENCE: [OBSCURED.]

156
00:07:38,420 --> 00:07:41,420
PROFESSOR: So you're writing
even, you're reading odd.

157
00:07:41,420 --> 00:07:43,900
So there's no overlapping
or anything like that.

158
00:07:43,900 --> 00:07:44,350
Question?

159
00:07:44,350 --> 00:07:47,020
OK.

160
00:07:47,020 --> 00:07:48,620
So, the way to look at that --

161
00:07:48,620 --> 00:07:50,260
I'm going to go a little
bit of formalism.

162
00:07:50,260 --> 00:07:53,100
You can think about this
as a iteration space.

163
00:07:53,100 --> 00:07:57,080
So iteration is if you look at
each iteration separately,

164
00:07:57,080 --> 00:07:59,820
there could be thousands and
millions of iterations and

165
00:07:59,820 --> 00:08:01,160
your compiler never [COUGHING]

166
00:08:01,160 --> 00:08:04,740
doing any work, and also some
iteration space is defined by

167
00:08:04,740 --> 00:08:08,070
a range like 1 to n, so you
don't even know exactly how

168
00:08:08,070 --> 00:08:09,905
many iterations are
going to be there.

169
00:08:09,905 --> 00:08:13,310
So you can represent this
as abstract space.

170
00:08:13,310 --> 00:08:16,470
Normally, most of this
loops you look at you

171
00:08:16,470 --> 00:08:17,470
normalize to step one.

172
00:08:17,470 --> 00:08:22,320
So what that means is all the
integer points in that space.

173
00:08:22,320 --> 00:08:25,880
So if you have a loop like
this, y equals 0 to 6, J

174
00:08:25,880 --> 00:08:27,600
equals 1i to 7.

175
00:08:27,600 --> 00:08:29,040
That's the iteration space,
there are two

176
00:08:29,040 --> 00:08:31,080
dimensions in there.

177
00:08:31,080 --> 00:08:34,150
The points that start iteration
off because it's not

178
00:08:34,150 --> 00:08:38,980
a rectangular space, it can have
this structure because

179
00:08:38,980 --> 00:08:42,090
j's go in triangular in here.

180
00:08:42,090 --> 00:08:44,340
So the way you can represent
that is so you can represent

181
00:08:44,340 --> 00:08:48,990
iteration space by a vector
i, and you can have each

182
00:08:48,990 --> 00:08:50,340
dimension or use
two dimension.

183
00:08:50,340 --> 00:08:52,370
This was some i1, i2
space in here.

184
00:08:52,370 --> 00:08:54,900
So you can represent
it like that.

185
00:08:54,900 --> 00:08:57,840
It's the notion of lexicographic
ordering.

186
00:08:57,840 --> 00:09:00,540
That means if you execute the
loop, what's the order you're

187
00:09:00,540 --> 00:09:01,990
going to receive this thing.

188
00:09:01,990 --> 00:09:03,950
If you execute this loop,
what you are going to do

189
00:09:03,950 --> 00:09:06,010
is you go from --

190
00:09:06,010 --> 00:09:07,590
you go like this.

191
00:09:07,590 --> 00:09:09,810
This is lexicographical
ordering of

192
00:09:09,810 --> 00:09:12,070
everything in the loops.

193
00:09:12,070 --> 00:09:13,440
That's the normal
execution order.

194
00:09:13,440 --> 00:09:15,440
That's a sequential order.

195
00:09:15,440 --> 00:09:17,940
At some point you want to make
sure that anything we do kind

196
00:09:17,940 --> 00:09:20,210
of has a look and feel
of the sequential

197
00:09:20,210 --> 00:09:23,430
lexicographical order.

198
00:09:23,430 --> 00:09:27,440
So, one thing you can say
is if you have multiple

199
00:09:27,440 --> 00:09:33,610
dimensions, if there are two
iterations, one iteration

200
00:09:33,610 --> 00:09:37,180
lexicographical and another
iterations says if all outer

201
00:09:37,180 --> 00:09:40,490
dimensions are the same, you
go to the first dimension

202
00:09:40,490 --> 00:09:44,650
where the numbers, they are in
two different iterations.

203
00:09:44,650 --> 00:09:46,960
Then that dictates if it's

204
00:09:46,960 --> 00:09:48,790
lexicographical than the other.

205
00:09:48,790 --> 00:09:51,840
So if the outer dimensions are
the same, that means the next

206
00:09:51,840 --> 00:09:53,470
one decides, the next one
decides, next one decides

207
00:09:53,470 --> 00:09:54,610
going down.

208
00:09:54,610 --> 00:09:57,000
First one that's actually
different decides who's before

209
00:09:57,000 --> 00:09:58,250
the other one.

210
00:10:00,630 --> 00:10:04,515
So another concept is called
affine loop nest. Affine loop

211
00:10:04,515 --> 00:10:08,770
nest says loop bounds are
integer linear functions of

212
00:10:08,770 --> 00:10:11,840
constants, loop constant
variable

213
00:10:11,840 --> 00:10:14,200
and outer loop indices.

214
00:10:14,200 --> 00:10:17,525
So that means if you want to get
affine function within a

215
00:10:17,525 --> 00:10:21,370
loop, that has to be a linear
function or integer function

216
00:10:21,370 --> 00:10:26,500
where all the things either
has to be constant or loop

217
00:10:26,500 --> 00:10:26,950
constants --

218
00:10:26,950 --> 00:10:29,760
that means that that variable
doesn't change in the loop or

219
00:10:29,760 --> 00:10:30,940
outer loop indices.

220
00:10:30,940 --> 00:10:32,810
That makes it much easier
to analyze.

221
00:10:35,550 --> 00:10:39,890
Also, array axises, each
dimension, axis function has

222
00:10:39,890 --> 00:10:41,430
the same property.

223
00:10:41,430 --> 00:10:44,670
So of course, there are many
programs that doesn't satisfy

224
00:10:44,670 --> 00:10:46,730
this, for example,
if we do FFD.

225
00:10:46,730 --> 00:10:48,450
That doesn't satisfy that
because you have

226
00:10:48,450 --> 00:10:50,390
exponentials in there.

227
00:10:50,390 --> 00:10:53,900
But what that means is at 50,
there's probably no way that

228
00:10:53,900 --> 00:10:55,620
the compiler's going
to analyze that.

229
00:10:55,620 --> 00:11:01,060
But most kind of loops fit this
kind of model and then

230
00:11:01,060 --> 00:11:03,120
you can put into nice
mathematical framework and

231
00:11:03,120 --> 00:11:05,840
analyze that what I'm going to
go through is kind of follow

232
00:11:05,840 --> 00:11:06,930
through some of the
mathematical

233
00:11:06,930 --> 00:11:10,280
framework with you guys.

234
00:11:10,280 --> 00:11:14,100
So, what you can do here is if
you look at this one, instead

235
00:11:14,100 --> 00:11:20,280
of representing this iteration
space by each iteration, which

236
00:11:20,280 --> 00:11:23,650
can be huge or which is not even
known at compile time,

237
00:11:23,650 --> 00:11:27,890
what you can do is you can
represent this by kind of a

238
00:11:27,890 --> 00:11:31,800
bounding space of iterations,
basically.

239
00:11:31,800 --> 00:11:35,270
So what this is, we don't mark
every box there, but we say

240
00:11:35,270 --> 00:11:37,160
OK, look, if you put
these planes --

241
00:11:37,160 --> 00:11:40,650
I put four planes in here, and
everything inside these planes

242
00:11:40,650 --> 00:11:43,120
represent this iteration
space.

243
00:11:43,120 --> 00:11:46,850
That's nice because instead of
going 0 to 6, if you go 0 to

244
00:11:46,850 --> 00:11:51,010
60,000, still I have the same
equation, I don't suddenly

245
00:11:51,010 --> 00:11:55,370
have 6 million data points in
here I need to represent.

246
00:11:55,370 --> 00:12:00,230
So, my representation doesn't
grow with the size of my

247
00:12:00,230 --> 00:12:01,155
iteration space.

248
00:12:01,155 --> 00:12:03,500
It grows with the shape of
this iteration space.

249
00:12:03,500 --> 00:12:06,590
If you have complicated one,
it can be difficult.

250
00:12:06,590 --> 00:12:08,890
So what you can do is you can
iteration space, it's all

251
00:12:08,890 --> 00:12:13,140
iterations zero to
six, j's I27.

252
00:12:13,140 --> 00:12:16,240
This is all linear functionns.

253
00:12:16,240 --> 00:12:18,530
That makes our analysis
easier.

254
00:12:18,530 --> 00:12:21,570
So the flip side of that
is the data space.

255
00:12:21,570 --> 00:12:24,200
So, if m dimension array
has m dimensional

256
00:12:24,200 --> 00:12:27,290
discrete cartesian space.

257
00:12:27,290 --> 00:12:30,520
Basically, in the data space you
don't have arrays that are

258
00:12:30,520 --> 00:12:31,240
odd shaped.

259
00:12:31,240 --> 00:12:34,420
It's almost a hypercube
always.

260
00:12:34,420 --> 00:12:38,790
So something like that is a
one dimensional space and

261
00:12:38,790 --> 00:12:40,130
something can be represented
as a two

262
00:12:40,130 --> 00:12:42,470
dimensional space in here.

263
00:12:42,470 --> 00:12:45,990
So data space has this nice
property, in that sense it's a

264
00:12:45,990 --> 00:12:48,140
t multi-dimensional hypercube.

265
00:12:48,140 --> 00:12:51,290
And what that gives you
is kind of a bunch of

266
00:12:51,290 --> 00:12:54,525
mathematical techniques to kind
of do and at least see

267
00:12:54,525 --> 00:12:56,450
some transformations we need
to do in compiling.

268
00:12:59,570 --> 00:13:01,470
As humans, I think we can look
at a lot more complicated

269
00:13:01,470 --> 00:13:06,320
loops by hand, and get a better
idea what's going on.

270
00:13:06,320 --> 00:13:08,850
But in a compiler you need to
have a very simple way of

271
00:13:08,850 --> 00:13:11,910
describing what to analyze, what
to formulate, and having

272
00:13:11,910 --> 00:13:14,900
this model helps you put it into
a nice mathematical frame

273
00:13:14,900 --> 00:13:17,250
you can do.

274
00:13:17,250 --> 00:13:18,160
So the next thing
is dependence.

275
00:13:18,160 --> 00:13:20,770
We have done that so I will go
through this fast. So the

276
00:13:20,770 --> 00:13:22,330
first is a true dependence.

277
00:13:22,330 --> 00:13:25,890
What that means is I wrote
something, I write it here.

278
00:13:25,890 --> 00:13:27,750
So I really meant
that I actually

279
00:13:27,750 --> 00:13:29,860
really use that value.

280
00:13:29,860 --> 00:13:34,020
There are two dependences mainly
because we are finding

281
00:13:34,020 --> 00:13:37,050
dependence on some location,
is an anti-dependence.

282
00:13:37,050 --> 00:13:39,870
That means I can't write it
until this read is done

283
00:13:39,870 --> 00:13:41,460
because I can't destroy
the value.

284
00:13:41,460 --> 00:13:44,480
Output dependence is there, so
ordering of writing that you

285
00:13:44,480 --> 00:13:45,730
need to maintain.

286
00:13:48,010 --> 00:13:53,920
So in a dynamic instance, data
dependence exist between i and

287
00:13:53,920 --> 00:13:59,130
j if Either i and j is a write
operation, and i and j refers

288
00:13:59,130 --> 00:14:01,780
to the same variable, and
i executes before j.

289
00:14:01,780 --> 00:14:05,590
So it's the same thing, one
execute before the other.

290
00:14:05,590 --> 00:14:07,590
So it's not that you don't have
a dependence when they

291
00:14:07,590 --> 00:14:10,680
get there in time, then it
become either true or anti.

292
00:14:10,680 --> 00:14:17,150
So it's always going to
be positive over time.

293
00:14:17,150 --> 00:14:19,050
So how about other accesses?

294
00:14:19,050 --> 00:14:21,930
So one element, you can figure
out what happened.

295
00:14:21,930 --> 00:14:23,890
So how do you do dependence
and other accesses?

296
00:14:23,890 --> 00:14:26,040
Now things get a little bit
complicated, because arrays is

297
00:14:26,040 --> 00:14:28,210
not one element.

298
00:14:28,210 --> 00:14:29,750
So that's when you go to
dependence analysis.

299
00:14:32,660 --> 00:14:36,620
So I will describe this using
bunch of examples.

300
00:14:36,620 --> 00:14:39,710
So in order to look at arrays,
there are two spaces I need to

301
00:14:39,710 --> 00:14:40,960
worry about.

302
00:14:40,960 --> 00:14:44,930
One is the iteration space,
one is the data space.

303
00:14:44,930 --> 00:14:49,410
What we want to do is figure
out what happens at every

304
00:14:49,410 --> 00:14:52,700
iteration for data and what
other dependences kind of

305
00:14:52,700 --> 00:14:55,020
summarize this down.

306
00:14:55,020 --> 00:14:58,320
We don't want to look at, say
OK, one iteration depend on

307
00:14:58,320 --> 00:15:00,470
second, two depend on third
-- you don't want to list

308
00:15:00,470 --> 00:15:01,090
everything.

309
00:15:01,090 --> 00:15:02,390
We need to come up
with a summary --

310
00:15:02,390 --> 00:15:05,300
that's what basically dependence
analysis will do.

311
00:15:05,300 --> 00:15:08,110
So if you have this access,
this is this loop.

312
00:15:08,110 --> 00:15:12,000
What happens is as we run down,
so iterations we are

313
00:15:12,000 --> 00:15:12,970
running down here.

314
00:15:12,970 --> 00:15:15,760
So we have iteration zero,
1, 2, 3, 4, 5.

315
00:15:15,760 --> 00:15:18,040
First do the read, write,
read, write.

316
00:15:18,040 --> 00:15:20,440
So this is kind of time
going down there.

317
00:15:20,440 --> 00:15:23,860
What you do is this one you are

318
00:15:23,860 --> 00:15:25,730
reading and you are writing.

319
00:15:25,730 --> 00:15:27,760
You're reading and writing,
so you have a

320
00:15:27,760 --> 00:15:29,690
dependence like that.

321
00:15:29,690 --> 00:15:30,940
You see the two anti-dependence.

322
00:15:34,410 --> 00:15:36,270
Read -- anti-dependence, I have

323
00:15:36,270 --> 00:15:38,090
anti-dependence going on here.

324
00:15:38,090 --> 00:15:39,690
If you look at it, here's
a dependence vector.

325
00:15:39,690 --> 00:15:42,270
What that means is there's a
dependence at each of those

326
00:15:42,270 --> 00:15:45,870
things in there -- that's
anti-dependence going on.

327
00:15:45,870 --> 00:15:49,020
One way to look at summarizes of
this, what is my iteration.

328
00:15:49,020 --> 00:15:52,210
My iteration goes like --
what's my dependence.

329
00:15:52,210 --> 00:15:56,530
I have anti-dependence with the
same iteration, because my

330
00:15:56,530 --> 00:15:57,970
read and write has to be

331
00:15:57,970 --> 00:15:59,990
dependence in the same iteration.

332
00:15:59,990 --> 00:16:01,990
So this is a way to kind
of describe that.

333
00:16:01,990 --> 00:16:03,240
So a different one.

334
00:16:05,890 --> 00:16:07,060
This one.

335
00:16:07,060 --> 00:16:13,350
I did Ai plus 1 equals Ai So
what you realize is iteration

336
00:16:13,350 --> 00:16:18,880
zero, you wrote iteration zero,
you wrote a zero, you

337
00:16:18,880 --> 00:16:24,270
read these and you wrote A1, and
iteration 1, you read A1

338
00:16:24,270 --> 00:16:28,820
and wrote A2, basically.

339
00:16:28,820 --> 00:16:31,210
Now what you have is your
dependence is like

340
00:16:31,210 --> 00:16:34,600
that, going like that.

341
00:16:34,600 --> 00:16:37,090
So if you look at what's
happening in here, if you

342
00:16:37,090 --> 00:16:39,890
summarize in here, what you
have is a dependence going

343
00:16:39,890 --> 00:16:42,960
like that in iteration space.

344
00:16:42,960 --> 00:16:46,300
So in iteration that means
iteration 1 is actually these

345
00:16:46,300 --> 00:16:49,130
two dependence, that uses
something that wrote iteration

346
00:16:49,130 --> 00:16:52,660
zero, iteration 2 you have
something iteration 1, and you

347
00:16:52,660 --> 00:16:55,530
have iteration going
like that.

348
00:16:55,530 --> 00:16:59,240
Sometimes this can be summarized
as the dependence

349
00:16:59,240 --> 00:17:00,490
vector of 1.

350
00:17:05,750 --> 00:17:10,700
Because the previous one was
zero because there's no loop

351
00:17:10,700 --> 00:17:11,480
carry dependency.

352
00:17:11,480 --> 00:17:13,650
In the outer loop there's
a dependence on 1.

353
00:17:13,650 --> 00:17:21,850
So if you have this one, I plus
2, of course, it gets

354
00:17:21,850 --> 00:17:26,510
carried 1 across in here and
then you have a 1 skipped

355
00:17:26,510 --> 00:17:30,080
representation in here.

356
00:17:30,080 --> 00:17:32,320
If you have 2I2 by plus
1, what you realize

357
00:17:32,320 --> 00:17:34,370
is there's no overlap.

358
00:17:34,370 --> 00:17:35,660
So there's no basically
dependency.

359
00:17:38,840 --> 00:17:42,040
You kind of get how that
analytic goes.

360
00:17:42,040 --> 00:17:46,810
So, to find data dependence in
a loop, so there's a little

361
00:17:46,810 --> 00:17:47,530
bit of legalese.

362
00:17:47,530 --> 00:17:48,520
So let me try to do that.

363
00:17:48,520 --> 00:17:54,220
So for every pair of array
accesses, what you want to

364
00:17:54,220 --> 00:17:59,940
find is is there a dynamic
instant that happened?

365
00:17:59,940 --> 00:18:04,640
An iteration that wrote a value,
and another dynamic

366
00:18:04,640 --> 00:18:08,240
instance happened that later
that actually used that value.

367
00:18:08,240 --> 00:18:11,690
So the first access, so there's
a dynamic instance

368
00:18:11,690 --> 00:18:17,820
that's wrote, or that access,
and another iteration instance

369
00:18:17,820 --> 00:18:20,250
that also accessed the
same location later.

370
00:18:20,250 --> 00:18:21,980
And one of them has to
be right, otherwise

371
00:18:21,980 --> 00:18:23,930
there are two in anti.

372
00:18:23,930 --> 00:18:25,860
That's the notion about
the second one came

373
00:18:25,860 --> 00:18:28,350
after the first one.

374
00:18:28,350 --> 00:18:30,000
You can also look at
the same arrays.

375
00:18:30,000 --> 00:18:32,270
It doesn't have the be the same
as different access, the

376
00:18:32,270 --> 00:18:33,510
same array access if
you are writing.

377
00:18:33,510 --> 00:18:36,150
If you look at same array access
writing you can have

378
00:18:36,150 --> 00:18:37,370
output dependences also.

379
00:18:37,370 --> 00:18:41,580
So it's basically between a read
and a write, and a write

380
00:18:41,580 --> 00:18:42,590
and a write.

381
00:18:42,590 --> 00:18:45,600
Two different writes, it can
be the same write too.

382
00:18:45,600 --> 00:18:47,590
Key thing is we are looking
at location.

383
00:18:47,590 --> 00:18:49,405
We're not looking at value path
and say who's actually in

384
00:18:49,405 --> 00:18:52,560
the same location.

385
00:18:52,560 --> 00:18:55,360
Loop carry dependence
means the dependence

386
00:18:55,360 --> 00:18:57,790
cross a loop boundary.

387
00:18:57,790 --> 00:19:03,100
That means the person who read
and person who wrote are in

388
00:19:03,100 --> 00:19:06,040
different loop iteration.

389
00:19:06,040 --> 00:19:08,290
If it's in the same iteration,
then it's all local, because

390
00:19:08,290 --> 00:19:10,570
in my iteration I deal with
that, I moved data around.

391
00:19:10,570 --> 00:19:13,300
But what I'm writing is used by
somebody else in different

392
00:19:13,300 --> 00:19:18,220
iteration, I have loop carry
dependence going on.

393
00:19:18,220 --> 00:19:20,880
Basic thing is there's a loop
carry dependence, that loop is

394
00:19:20,880 --> 00:19:23,650
not parallelized in that.

395
00:19:23,650 --> 00:19:26,800
What that means is I am writing
in one iteration of

396
00:19:26,800 --> 00:19:28,830
the loop and somebody is reading
in different iteration

397
00:19:28,830 --> 00:19:29,630
of the loop.

398
00:19:29,630 --> 00:19:31,960
That means I actually had to
move the data across, they can

399
00:19:31,960 --> 00:19:33,340
happen in parallel.

400
00:19:33,340 --> 00:19:34,590
That's a very simple way
of looking at that.

401
00:19:37,930 --> 00:19:41,510
So, what we have done is --

402
00:19:41,510 --> 00:19:44,550
OK, the basic idea is how
to actually go and

403
00:19:44,550 --> 00:19:46,740
automate this process.

404
00:19:46,740 --> 00:19:49,050
The simple notion is called a
data dependence analysis, and

405
00:19:49,050 --> 00:19:51,850
I will give you a formulation
of that.

406
00:19:51,850 --> 00:19:57,700
So what you can formally do is
using a set of equations.

407
00:19:57,700 --> 00:20:01,140
So what you want to say is
instead of two distinct

408
00:20:01,140 --> 00:20:03,110
iterations, one is the
write iteration,

409
00:20:03,110 --> 00:20:06,150
one is the read iteration.

410
00:20:06,150 --> 00:20:07,700
One iteration writes
the value, one

411
00:20:07,700 --> 00:20:09,200
iteration reads the value.

412
00:20:09,200 --> 00:20:14,200
So write iteration basically,
writes a item loop plus 1, the

413
00:20:14,200 --> 00:20:16,620
read iteration reads AI.

414
00:20:16,620 --> 00:20:21,306
So we know both read and write
have to be within loop bound

415
00:20:21,306 --> 00:20:23,360
iteration, because we know
that because we can't be

416
00:20:23,360 --> 00:20:24,920
outside loop bounds.

417
00:20:24,920 --> 00:20:28,410
Then we also want to make sure
that the loop carried

418
00:20:28,410 --> 00:20:30,330
dependence, that means
read and write can't

419
00:20:30,330 --> 00:20:31,560
be in the same iteration.

420
00:20:31,560 --> 00:20:33,330
If it's in the same iteration,
I don't have loop carry

421
00:20:33,330 --> 00:20:34,120
dependence.

422
00:20:34,120 --> 00:20:37,070
I am looking for loop carry
dependence at this point.

423
00:20:37,070 --> 00:20:41,250
Then what makes both of
the read and write

424
00:20:41,250 --> 00:20:42,790
write the same location.

425
00:20:42,790 --> 00:20:44,580
That means access 1 has
to be the same.

426
00:20:44,580 --> 00:20:48,550
So the right access point is
iw plus 1, and read access

427
00:20:48,550 --> 00:20:51,330
function is [? IEI. ?]

428
00:20:51,330 --> 00:20:55,380
So the key thing is now we
have set up equation.

429
00:20:55,380 --> 00:20:59,460
Are there any values for ie
and j, integer values, I'm

430
00:20:59,460 --> 00:21:03,470
sorry, iw and ir that these
equations are true.

431
00:21:03,470 --> 00:21:06,140
If that is the case, we can say
ah-ha, that is the case,

432
00:21:06,140 --> 00:21:10,630
there's an iteration that the
write and read are writing

433
00:21:10,630 --> 00:21:13,210
into two different iterations
-- one write iteration, one

434
00:21:13,210 --> 00:21:16,970
read iteration, writing
to the same value.

435
00:21:16,970 --> 00:21:18,460
Therefore that's a different
[OBSCURED].

436
00:21:18,460 --> 00:21:18,840
Is this true?

437
00:21:18,840 --> 00:21:20,280
Is there a set of values
that makes this true?

438
00:21:28,710 --> 00:21:33,480
Yeah, I mean you can do ir
equals 1, iw equals 1,

439
00:21:33,480 --> 00:21:36,120
and ir equals 2.

440
00:21:36,120 --> 00:21:38,670
So there's a value in there so
these equations will come up

441
00:21:38,670 --> 00:21:43,560
with a solution, and at that
point you have a dependency.

442
00:21:43,560 --> 00:21:51,670
AUDIENCE: [NOISE]

443
00:21:51,670 --> 00:21:56,260
PROFESSOR: So that's very easy
to make this formulation.

444
00:21:56,260 --> 00:21:59,620
So if the indices is calculated
with some thing or

445
00:21:59,620 --> 00:22:02,670
loop value, I can't write
the formulation.

446
00:22:02,670 --> 00:22:07,250
So the data that I can do this
analysis is this indices has

447
00:22:07,250 --> 00:22:09,110
to be the constant
or indefinite.

448
00:22:09,110 --> 00:22:16,540
This is A of b of I. So if my
array is A of b of i, I don't

449
00:22:16,540 --> 00:22:21,790
know how the numbers work
if you have A of b i.

450
00:22:21,790 --> 00:22:24,810
I have no idea about Ai
is without knowing

451
00:22:24,810 --> 00:22:25,770
values of B of i.

452
00:22:25,770 --> 00:22:27,780
And B of i, I can't
summarize it.

453
00:22:27,780 --> 00:22:31,330
Each B of i might be different
and I can't come up with this

454
00:22:31,330 --> 00:22:34,750
nice single formulation that
can check out every B of i.

455
00:22:34,750 --> 00:22:36,330
And I'm in big trouble.

456
00:22:36,330 --> 00:22:50,070
This is doable, but this is
not easy to do like this.

457
00:22:50,070 --> 00:22:50,780
Question?

458
00:22:50,780 --> 00:22:53,150
AUDIENCE: [NOISE]

459
00:22:53,150 --> 00:22:54,000
PROFESSOR: Yeah, that's right.

460
00:22:54,000 --> 00:22:57,800
So that the interesting thing
that you're not looking at.

461
00:22:57,800 --> 00:23:00,400
Because when we summarized it,
because what you are going to

462
00:23:00,400 --> 00:23:02,400
do is we are trying to summarize
for everything,

463
00:23:02,400 --> 00:23:05,730
every iteration, and we are not
trying to divide it into

464
00:23:05,730 --> 00:23:07,860
saying OK, can I find
the parallel groups.

465
00:23:07,860 --> 00:23:08,480
Yes.

466
00:23:08,480 --> 00:23:10,340
You can do some more complicated
analysis and do

467
00:23:10,340 --> 00:23:11,410
something like that.

468
00:23:11,410 --> 00:23:13,060
Yes.

469
00:23:13,060 --> 00:23:15,850
So other interesting thing is
OK, the next thing you want to

470
00:23:15,850 --> 00:23:20,020
see whether can find
output dependence.

471
00:23:20,020 --> 00:23:22,365
OK, are there two different
iterations that they're

472
00:23:22,365 --> 00:23:25,360
fighting the same thing.

473
00:23:25,360 --> 00:23:29,350
What that means is the
iterations are I1, I2, and I1

474
00:23:29,350 --> 00:23:33,190
not equals I2, and I1 plus
1 equals I2 plus one.

475
00:23:33,190 --> 00:23:37,120
There's no solution to this one
because the I1 has to be

476
00:23:37,120 --> 00:23:39,820
equal to I2 according to this,
and I1 cannot be equal to I2

477
00:23:39,820 --> 00:23:40,400
during this one.

478
00:23:40,400 --> 00:23:44,020
That says OK, look, I don't have
output dependence because

479
00:23:44,020 --> 00:23:45,880
it can be satisfied.

480
00:23:45,880 --> 00:23:49,880
OK, so here I know I have
a loop carried --

481
00:23:49,880 --> 00:23:52,220
I haven't said the two
anti depends on which

482
00:23:52,220 --> 00:23:54,370
directions this is.

483
00:23:54,370 --> 00:23:57,386
Two anti-dependents, but I don't
have a loop carried out

484
00:23:57,386 --> 00:24:01,070
to [OBSCURED].

485
00:24:01,070 --> 00:24:02,870
So how do we generalize this?

486
00:24:02,870 --> 00:24:06,410
So what you can do is as integer
vector I, so in order

487
00:24:06,410 --> 00:24:08,940
to generalize this, you can
use integer programming.

488
00:24:08,940 --> 00:24:11,445
How many of you know integer
programming or linear

489
00:24:11,445 --> 00:24:12,260
programming?

490
00:24:12,260 --> 00:24:14,390
OK.

491
00:24:14,390 --> 00:24:18,350
We are not going to go into
detail, but I'll tell you what

492
00:24:18,350 --> 00:24:19,280
actually happen.

493
00:24:19,280 --> 00:24:24,050
So integer programming says
there's a vector of variable

494
00:24:24,050 --> 00:24:28,570
I, and if you have a formulation
like that, is

495
00:24:28,570 --> 00:24:32,360
array, AI is less than or equal
to B, A and B are all

496
00:24:32,360 --> 00:24:38,230
constant integers, and you can
use the integer programming,

497
00:24:38,230 --> 00:24:42,290
you can see that there's a
solution for IE or not.

498
00:24:42,290 --> 00:24:45,120
This is if you do things like
operations research, there's a

499
00:24:45,120 --> 00:24:46,890
lot of work around it.

500
00:24:46,890 --> 00:24:49,350
People actually want to know
what value is Y. We don't care

501
00:24:49,350 --> 00:24:51,445
that much what values,
we just want to know

502
00:24:51,445 --> 00:24:53,520
the solution or not.

503
00:24:53,520 --> 00:24:55,600
If there's a solution, we know
that there's a dependent.

504
00:24:55,600 --> 00:24:57,520
If there's no solution we know
there's no dependent.

505
00:24:57,520 --> 00:24:59,810
So we need to do is we need to
get this set of equations and

506
00:24:59,810 --> 00:25:02,420
put it on that form.

507
00:25:02,420 --> 00:25:03,140
That's simple.

508
00:25:03,140 --> 00:25:08,350
For example, what you want
is AI less than B --

509
00:25:08,350 --> 00:25:14,680
that means you have constnat
A1 I1, plus A2 i2, which is

510
00:25:14,680 --> 00:25:19,870
less than or equal to
B. So you won't have

511
00:25:19,870 --> 00:25:22,500
this kind of a system.

512
00:25:22,500 --> 00:25:27,050
Not equals doesn't really
belong there.

513
00:25:27,050 --> 00:25:29,390
So the way you deal with not
equals if you do it in two

514
00:25:29,390 --> 00:25:34,710
different problems. You can say
IW less than IER is one

515
00:25:34,710 --> 00:25:39,590
problem, and W is greater then
IER is other problem, and if

516
00:25:39,590 --> 00:25:42,070
either problem has a solution,
you have a dependence.

517
00:25:42,070 --> 00:25:44,710
So that means one is true
and one is anti.

518
00:25:44,710 --> 00:25:46,970
You can see the true dependence
or anti-dependence,

519
00:25:46,970 --> 00:25:50,580
you can look at that.

520
00:25:50,580 --> 00:25:52,610
This one is a little
bit easier.

521
00:25:52,610 --> 00:25:56,890
This is less than, not
actually less than --

522
00:25:56,890 --> 00:25:58,140
less than equal.

523
00:26:01,900 --> 00:26:04,520
How do you deal with equal?

524
00:26:04,520 --> 00:26:06,400
So the way you deal with equal
is you write in both

525
00:26:06,400 --> 00:26:07,330
directions.

526
00:26:07,330 --> 00:26:11,450
So if A is less than B, A less
than or equal to B, B is less

527
00:26:11,450 --> 00:26:14,464
than or equal to A means
actually is equal to B. So you

528
00:26:14,464 --> 00:26:17,413
can actually try two different
inequalities and get equal to

529
00:26:17,413 --> 00:26:17,840
down there.

530
00:26:17,840 --> 00:26:20,850
So you have to kind of massage
things a little bit in here.

531
00:26:20,850 --> 00:26:27,620
So here are our original
iteration bounds, and here's

532
00:26:27,620 --> 00:26:32,800
our one problem because we are
saying write happens before

533
00:26:32,800 --> 00:26:33,950
read, so these are two
dependents that

534
00:26:33,950 --> 00:26:37,050
we are looking at.

535
00:26:37,050 --> 00:26:43,550
This is saying that write
location is the same as the

536
00:26:43,550 --> 00:26:45,560
read location and this is equal,
so I have two different

537
00:26:45,560 --> 00:26:46,930
equations in here.

538
00:26:46,930 --> 00:26:49,520
So kind of massage this a little
bit to put it in i

539
00:26:49,520 --> 00:26:52,840
form, and we can come
up with A's and B's.

540
00:26:52,840 --> 00:26:56,690
These are just manual steps,
A's and B's, and now we are

541
00:26:56,690 --> 00:27:02,050
going to throw it into some
super duper integer linear

542
00:27:02,050 --> 00:27:05,440
program package and it will say
yes or no and your set.

543
00:27:08,540 --> 00:27:09,820
And of course, you had
to do another problem

544
00:27:09,820 --> 00:27:12,370
for the other side.

545
00:27:12,370 --> 00:27:16,780
You can generalize it for much
more complete loop nest. So if

546
00:27:16,780 --> 00:27:19,310
you have this complicated loop
nest in here, you had to solve

547
00:27:19,310 --> 00:27:21,950
you've got n deepness, you have
to solve two end problems

548
00:27:21,950 --> 00:27:23,720
with all these different
constraints.

549
00:27:23,720 --> 00:27:24,590
I'm not going to go over this.

550
00:27:24,590 --> 00:27:28,090
I have the slides in here.

551
00:27:28,090 --> 00:27:31,820
So that's the single
dimension.

552
00:27:31,820 --> 00:27:35,770
So how about multi-dimension
dependences?

553
00:27:35,770 --> 00:27:39,580
So I have two dimensional
iteration space here, and I

554
00:27:39,580 --> 00:27:43,350
have I,J equals AI, J minus 1.

555
00:27:43,350 --> 00:27:45,140
That's my iteration space.

556
00:27:45,140 --> 00:27:47,240
What does my dependence
look like?

557
00:27:47,240 --> 00:27:48,490
We have arrows too.

558
00:27:58,480 --> 00:27:59,730
Which direction are
the arrows going?

559
00:27:59,730 --> 00:28:02,970
AUDIENCE: [OBSCURED]

560
00:28:02,970 --> 00:28:04,840
PROFESSOR: We have something
like this.

561
00:28:04,840 --> 00:28:06,680
Yup.

562
00:28:06,680 --> 00:28:10,990
We have something like this
because that's J minus 1, the

563
00:28:10,990 --> 00:28:12,470
I's are the same.

564
00:28:12,470 --> 00:28:16,750
Of course, if you have the other
way around, go other

565
00:28:16,750 --> 00:28:20,030
direction, one is anti and one
is it two dependence, so you

566
00:28:20,030 --> 00:28:22,410
can figure that one out.

567
00:28:22,410 --> 00:28:23,730
And do something complicated.

568
00:28:23,730 --> 00:28:25,670
First one.

569
00:28:25,670 --> 00:28:30,750
So IJ, I minus 1, J plus 1.

570
00:28:30,750 --> 00:28:32,580
Which has to be diagonal.

571
00:28:32,580 --> 00:28:37,910
Which diagonal does it go?

572
00:28:37,910 --> 00:28:39,280
This way or this way?

573
00:28:42,900 --> 00:28:44,150
Who says this way?

574
00:28:46,910 --> 00:28:48,160
Who says this way?

575
00:28:51,820 --> 00:28:57,330
So, this is actually going
in this direction.

576
00:29:00,630 --> 00:29:02,680
This is where you have to
actually think which iteration

577
00:29:02,680 --> 00:29:04,750
is actually write and
read in here.

578
00:29:04,750 --> 00:29:06,200
So things get complicated.

579
00:29:06,200 --> 00:29:08,060
This one is even more
interesting.

580
00:29:08,060 --> 00:29:08,770
This one.

581
00:29:08,770 --> 00:29:11,715
There's only one dimensional
array or two dimensional loop

582
00:29:11,715 --> 00:29:17,250
nest. So what that
means is who's

583
00:29:17,250 --> 00:29:18,530
writing and who's reading?

584
00:29:23,550 --> 00:29:26,580
If you look at it basically --

585
00:29:26,580 --> 00:29:28,790
actually this actually is a
little bit wrong, because the

586
00:29:28,790 --> 00:29:37,680
dependence analysis says --
actually, all these things,

587
00:29:37,680 --> 00:29:41,620
all this read has to go into
all the write, because they

588
00:29:41,620 --> 00:29:44,980
are writing any J, just writing
the same thing.

589
00:29:44,980 --> 00:29:46,460
So this is a little bit wrong.

590
00:29:46,460 --> 00:29:48,620
This is actually more
data flow analysis.

591
00:29:48,620 --> 00:29:52,070
This is a different -- their
dependence means I don't care

592
00:29:52,070 --> 00:29:54,900
who the guy wrote, because he's
the last guy who wrote,

593
00:29:54,900 --> 00:29:57,000
but everybody's reading,
everybody else is writing the

594
00:29:57,000 --> 00:30:01,880
same location.

595
00:30:01,880 --> 00:30:02,010
AUDIENCE: [OBSCURED].

596
00:30:02,010 --> 00:30:03,370
PROFESSOR: Keep rewriting
the same thing again

597
00:30:03,370 --> 00:30:05,060
and again and again.

598
00:30:05,060 --> 00:30:06,570
You start depending on --

599
00:30:06,570 --> 00:30:12,140
It's not dependant on J's it's
dependant on I. But location

600
00:30:12,140 --> 00:30:14,840
says you used to have iterations
right in the same

601
00:30:14,840 --> 00:30:22,030
location, different J. So not
matter what J, it's writing in

602
00:30:22,030 --> 00:30:23,280
the same location.

603
00:30:25,800 --> 00:30:27,010
You know what I'm saying?

604
00:30:27,010 --> 00:30:30,180
Because J thinks J.

605
00:30:30,180 --> 00:30:34,640
AUDIENCE: [NOISE].

606
00:30:34,640 --> 00:30:36,770
PROFESSOR: This is
iteration space.

607
00:30:36,770 --> 00:30:37,830
I am looking at iteration.

608
00:30:37,830 --> 00:30:38,030
I am looking at I and J.s

609
00:30:38,030 --> 00:30:39,790
AUDIENCE: [OBSCURED].

610
00:30:39,790 --> 00:30:42,640
PROFESSOR: B is a one
dimensional array.

611
00:30:42,640 --> 00:30:44,370
So B is a one dimensional
array.

612
00:30:44,370 --> 00:30:45,430
So what that means is --

613
00:30:45,430 --> 00:30:47,840
The reason I'm saying it's the
iteration space and array

614
00:30:47,840 --> 00:30:53,300
space is a match.

615
00:30:53,300 --> 00:30:54,760
I'll correct this and put it
in there because this is a

616
00:30:54,760 --> 00:30:55,740
data flow diagram.

617
00:30:55,740 --> 00:30:56,990
It's row independant.

618
00:30:58,800 --> 00:31:01,230
This one writing to what?

619
00:31:01,230 --> 00:31:04,590
AUDIENCE: [OBSCURED].

620
00:31:04,590 --> 00:31:08,390
PROFESSOR: Iteration space
is I and J. So, this

621
00:31:08,390 --> 00:31:09,470
is writing to what?

622
00:31:09,470 --> 00:31:12,240
I zero is --

623
00:31:12,240 --> 00:31:15,120
This is writing to what?

624
00:31:15,120 --> 00:31:16,370
B1.

625
00:31:18,720 --> 00:31:20,450
All those things are
writng to B1.

626
00:31:23,070 --> 00:31:24,360
This is really --

627
00:31:29,920 --> 00:31:33,860
So this is writing to B1,
this is reading B zero.

628
00:31:33,860 --> 00:31:36,210
So this iteration is
reading B1 again.

629
00:31:36,210 --> 00:31:37,990
So this was B1, this
is iteration B1.

630
00:31:37,990 --> 00:31:41,570
So each of these is writing to
B1, each of these are reading

631
00:31:41,570 --> 00:31:47,000
from B1, so each has to be
dependent from each other.

632
00:31:47,000 --> 00:31:48,550
AUDIENCE: So I guess one thing
that's confusing here is why

633
00:31:48,550 --> 00:31:51,578
isn't it just -- why don't we
just have arrows going down

634
00:31:51,578 --> 00:31:52,070
the column?

635
00:31:52,070 --> 00:31:53,550
Why do we have all these--?

636
00:31:53,550 --> 00:31:56,420
PROFESSOR: Arrows going down
the column means each is

637
00:31:56,420 --> 00:31:58,860
trying to do different
location.

638
00:31:58,860 --> 00:32:02,030
So what happens is that
this one, arrays

639
00:32:02,030 --> 00:32:03,280
going down this way.

640
00:32:03,280 --> 00:32:07,350
Is this one -- what's wrote here
is only that location,

641
00:32:07,350 --> 00:32:09,830
only this side I accidentally
located.

642
00:32:09,830 --> 00:32:12,390
These are all writing to the
same location and reading from

643
00:32:12,390 --> 00:32:13,210
the same location.

644
00:32:13,210 --> 00:32:16,180
AUDIENCE: Why isn't
B iterated?

645
00:32:16,180 --> 00:32:17,390
PROFESSOR: This is
iteration space.

646
00:32:17,390 --> 00:32:18,620
I have two different
loops here.

647
00:32:18,620 --> 00:32:22,120
AUDIENCE: But I don't understand
why B [NOISE.]

648
00:32:22,120 --> 00:32:24,110
PROFESSOR: This is my program.

649
00:32:24,110 --> 00:32:25,250
I can write this program.

650
00:32:25,250 --> 00:32:27,573
This is a little bit of a stupid
program because I am

651
00:32:27,573 --> 00:32:30,090
kind of trying to do the same
thing again and again.

652
00:32:30,090 --> 00:32:35,800
But hey, my program doesn't say
array dimensions has to

653
00:32:35,800 --> 00:32:36,790
match your loop dimension.

654
00:32:36,790 --> 00:32:39,050
It doesn't say that so you can
have programs like that.

655
00:32:39,050 --> 00:32:40,300
You can have other way too.

656
00:32:42,440 --> 00:32:47,800
So the key thing is to make --
don't confuse iteration space

657
00:32:47,800 --> 00:32:48,750
versus array space.

658
00:32:48,750 --> 00:32:50,280
They are two different spaces,
two different number of

659
00:32:50,280 --> 00:32:50,980
dimensions.

660
00:32:50,980 --> 00:32:52,440
That's all the point that
I'm going to make here.

661
00:32:55,360 --> 00:32:58,645
So by doing dependence analysis,
you can figure out

662
00:32:58,645 --> 00:33:00,410
-- now you can formulate
this nicely --

663
00:33:00,410 --> 00:33:03,550
figure out where the
loops are parallel.

664
00:33:03,550 --> 00:33:06,480
So that's really neat.

665
00:33:06,480 --> 00:33:09,620
The next thing I'm going to go
is trying to figure out how

666
00:33:09,620 --> 00:33:11,970
you can increase the parallelism
opportunities.

667
00:33:11,970 --> 00:33:14,550
Because there might be cases
where the original code you

668
00:33:14,550 --> 00:33:17,350
wrote, there might be some
loops that are not

669
00:33:17,350 --> 00:33:20,580
parallelizable, assays, and can
you go and increase that.

670
00:33:20,580 --> 00:33:22,750
So I'm going to talk about few
different possibilities of

671
00:33:22,750 --> 00:33:24,000
doing that.

672
00:33:25,880 --> 00:33:28,270
Scalar privatization, I will
just go in each of these

673
00:33:28,270 --> 00:33:30,550
separating.

674
00:33:30,550 --> 00:33:33,040
So here is interesting
program.

675
00:33:33,040 --> 00:33:37,490
To get parallel to the
temporary and use the

676
00:33:37,490 --> 00:33:39,080
temporary in here.

677
00:33:39,080 --> 00:33:41,080
You might not know you had
written that but the compiler

678
00:33:41,080 --> 00:33:42,950
normally generates something
like that because you always

679
00:33:42,950 --> 00:33:44,790
had temporaries in here,
so this might be

680
00:33:44,790 --> 00:33:46,460
what compiler generate.

681
00:33:46,460 --> 00:33:47,240
Is this loop parallel?

682
00:33:47,240 --> 00:33:56,020
AUDIENCE: Yup.

683
00:33:56,020 --> 00:33:56,290
PROFESSOR: Why?

684
00:33:56,290 --> 00:34:00,000
AUDIENCE: [OBSCURED].

685
00:34:00,000 --> 00:34:02,150
PROFESSOR: Is the loop carry
dependence true or anti --

686
00:34:02,150 --> 00:34:05,820
What's the true dependence
which to which?

687
00:34:05,820 --> 00:34:08,260
We didn't loop true
dependence.

688
00:34:08,260 --> 00:34:09,510
What is the loop carry
dependence?

689
00:34:12,810 --> 00:34:14,070
Anti-dependence.

690
00:34:14,070 --> 00:34:20,710
Because I cannot -- you see I
equal 1, basically wrote here

691
00:34:20,710 --> 00:34:21,820
in this reading.

692
00:34:21,820 --> 00:34:26,170
I can't write I equals 2x until
I equals 1 is done and

693
00:34:26,170 --> 00:34:26,870
done reading that.

694
00:34:26,870 --> 00:34:29,210
I have one location and
everybody's trying to read or

695
00:34:29,210 --> 00:34:31,450
write that, even though I
don't really use data.

696
00:34:31,450 --> 00:34:32,860
This is the sad thing
about this.

697
00:34:32,860 --> 00:34:34,900
That I'm really using this
guy's data, but I'm just

698
00:34:34,900 --> 00:34:36,730
waiting for the same
space to occupy.

699
00:34:39,510 --> 00:34:43,410
So, there's a loop carry
dependence in here, and it's

700
00:34:43,410 --> 00:34:45,330
anti-dependent.

701
00:34:45,330 --> 00:34:49,040
So what you can do is if you
find any anti or output loop

702
00:34:49,040 --> 00:34:50,880
carry dependence, you
can get rid of them.

703
00:34:50,880 --> 00:34:53,220
I'm not really using that value,
I'm just keeping a

704
00:34:53,220 --> 00:34:54,430
location in here.

705
00:34:54,430 --> 00:34:55,820
So how can we get rid of that?

706
00:34:55,820 --> 00:35:01,670
AUDIENCE: [OBSCURED].

707
00:35:01,670 --> 00:35:02,040
PROFESSOR: Yeah.

708
00:35:02,040 --> 00:35:03,100
That's one thing.

709
00:35:03,100 --> 00:35:03,970
There's two ways of doing it.

710
00:35:03,970 --> 00:35:07,210
One is I assign something
local.

711
00:35:07,210 --> 00:35:11,060
So each processor will
have its own copy,

712
00:35:11,060 --> 00:35:12,760
so I don't do that.

713
00:35:12,760 --> 00:35:17,670
So it's something like this,
so that's [OBSCURED].

714
00:35:17,670 --> 00:35:21,300
Or I can look at the array.

715
00:35:21,300 --> 00:35:23,480
In the array you can have either
number of process or

716
00:35:23,480 --> 00:35:24,860
iterations for each iteration.

717
00:35:24,860 --> 00:35:27,590
But uses a different location.

718
00:35:27,590 --> 00:35:30,510
This is more efficient than
this one because we are

719
00:35:30,510 --> 00:35:34,330
touching lot more locations
in here.

720
00:35:34,330 --> 00:35:36,330
I haven't done one thing here.

721
00:35:36,330 --> 00:35:37,210
I'm not complete.

722
00:35:37,210 --> 00:35:39,640
What have I forgotten to
do in both of these?

723
00:35:39,640 --> 00:35:43,070
AUDIENCE: [OBSCURED].

724
00:35:43,070 --> 00:35:45,980
PROFESSOR: Yeah, because it was
beforehand somebody might

725
00:35:45,980 --> 00:35:47,880
use final assignment of the loop
nest, so what you had to

726
00:35:47,880 --> 00:35:50,690
do is you had to kind
of finalize x.

727
00:35:50,690 --> 00:35:53,730
Because I had a temporary
variable, so with n, the last

728
00:35:53,730 --> 00:35:56,940
value has to go into x.

729
00:35:56,940 --> 00:35:58,570
You can't keep just not

730
00:35:58,570 --> 00:36:00,740
calculating value in something.

731
00:36:00,740 --> 00:36:03,270
So in here, also, you just
say last value is x.

732
00:36:03,270 --> 00:36:06,390
But after you do that, basically
now each of this

733
00:36:06,390 --> 00:36:07,640
loop is faster.

734
00:36:10,100 --> 00:36:11,350
Everybody go that?

735
00:36:13,420 --> 00:36:16,090
OK, here's another example.

736
00:36:16,090 --> 00:36:19,110
x equals x plus AI.

737
00:36:19,110 --> 00:36:20,360
Do I have loop carry
dependent?

738
00:36:30,780 --> 00:36:32,780
What did the loop-carried
dependence?

739
00:36:32,780 --> 00:36:34,030
True or anti?

740
00:36:39,120 --> 00:36:39,400
True dependence.

741
00:36:39,400 --> 00:36:43,600
So this guy is actually creating
previous value and

742
00:36:43,600 --> 00:36:45,800
adding something in the event.

743
00:36:45,800 --> 00:36:48,020
So of course in true dependence
I cannot seem to

744
00:36:48,020 --> 00:36:48,940
parallelize.

745
00:36:48,940 --> 00:36:51,760
But there are some interesting
things we can do.

746
00:36:51,760 --> 00:36:55,740
That was an associative
operation.

747
00:36:55,740 --> 00:36:58,300
I didn't care which order this
initial happened, so I'm just

748
00:36:58,300 --> 00:37:00,330
keeping a lean bunch
of values in here.

749
00:37:00,330 --> 00:37:03,710
And the results were never
used in the other loop.

750
00:37:03,710 --> 00:37:05,700
So we just keep adding things
and at the end of the loop you

751
00:37:05,700 --> 00:37:08,600
get the sum total in here.

752
00:37:08,600 --> 00:37:10,580
I never used any kind of partial
values anywhere.

753
00:37:10,580 --> 00:37:12,130
So that gives the idea.

754
00:37:12,130 --> 00:37:17,870
So what you can do is we can
translate this into each of

755
00:37:17,870 --> 00:37:21,580
the guys doing a temporary
addition

756
00:37:21,580 --> 00:37:22,460
into its own variable.

757
00:37:22,460 --> 00:37:27,650
So each processor, just
do a partial sum.

758
00:37:27,650 --> 00:37:31,390
At the end, once they're done,
you basically do the full sum.

759
00:37:31,390 --> 00:37:33,290
Of course, you can do a tree
or whatever much more

760
00:37:33,290 --> 00:37:35,700
complicated thing then that --
you can also parallelize this

761
00:37:35,700 --> 00:37:38,050
part at the tree addition.

762
00:37:38,050 --> 00:37:39,130
But you can do that.

763
00:37:39,130 --> 00:37:43,170
I mean Roderick talked about
this in hand parallelization.

764
00:37:43,170 --> 00:37:46,040
But we are doing something
very simple in here.

765
00:37:46,040 --> 00:37:50,150
So these compilers can figure
out associative

766
00:37:50,150 --> 00:37:51,950
operations and do that.

767
00:37:51,950 --> 00:37:55,020
So this is where all the
people who are in

768
00:37:55,020 --> 00:37:57,720
parallelizing, and all the
people who are writing this

769
00:37:57,720 --> 00:38:00,100
scientific code kind of start
having arguments.

770
00:38:00,100 --> 00:38:02,770
Because they say oh my God,
you're doing operations and

771
00:38:02,770 --> 00:38:05,700
it's going to have numerical
stability issues.

772
00:38:05,700 --> 00:38:06,720
Yes all true.

773
00:38:06,720 --> 00:38:09,260
In compilers you have these
flags that say OK, just forget

774
00:38:09,260 --> 00:38:12,800
about all these very issues, and
most probably it will be

775
00:38:12,800 --> 00:38:15,320
right, and in most code
it will work.

776
00:38:15,320 --> 00:38:18,610
You might find that problem,
too -- you change operation

777
00:38:18,610 --> 00:38:21,370
order to get some parallelism
and suddenly you are running

778
00:38:21,370 --> 00:38:22,960
unstability.

779
00:38:22,960 --> 00:38:25,190
There are some algorithms that
you can't do that, but most

780
00:38:25,190 --> 00:38:26,440
algorithms you can.

781
00:38:28,710 --> 00:38:30,090
So here's another interesting
thing.

782
00:38:30,090 --> 00:38:35,430
So, I have a program like that,
2 to the power I, and of

783
00:38:35,430 --> 00:38:37,310
course, most of the time

784
00:38:37,310 --> 00:38:40,080
exponentiation is very expensive.

785
00:38:40,080 --> 00:38:41,450
If you have a smart
compiler --

786
00:38:41,450 --> 00:38:42,840
I don't have to exponentiate.

787
00:38:42,840 --> 00:38:44,390
This thing called strength
reduction.

788
00:38:44,390 --> 00:38:44,970
They say wait a minute --

789
00:38:44,970 --> 00:38:46,160
I will keep variable t.

790
00:38:46,160 --> 00:38:49,270
This 2 to the power i means
basically every time I

791
00:38:49,270 --> 00:38:52,150
multiply it by 2 and I can't
keep repeating that.

792
00:38:52,150 --> 00:38:57,210
Do you see why these two
are equal there?

793
00:38:57,210 --> 00:38:57,940
This is good.

794
00:38:57,940 --> 00:38:59,550
A lot of good compilers
do that.

795
00:38:59,550 --> 00:39:01,040
But now what did
I suddenly do?

796
00:39:01,040 --> 00:39:03,740
AUDIENCE: [OBSCURED.]

797
00:39:03,740 --> 00:39:05,680
PROFESSOR: Yeah, I reduced the
amount of computation,

798
00:39:05,680 --> 00:39:09,100
obviously, but I just introduce
a loop-carried true

799
00:39:09,100 --> 00:39:10,350
dependence here.

800
00:39:12,760 --> 00:39:15,560
Because now I have t dependent
on the previous t to calculate

801
00:39:15,560 --> 00:39:20,630
the next value, and while
order-wise or sequential-wise

802
00:39:20,630 --> 00:39:24,350
this is a win, now suddenly
I can't parallelize.

803
00:39:24,350 --> 00:39:26,840
Of course, a lot of times what
you had to do is you have a

804
00:39:26,840 --> 00:39:27,750
very smart programmer.

805
00:39:27,750 --> 00:39:30,610
They say aha, I know this
operation is expensive so I am

806
00:39:30,610 --> 00:39:33,580
going to do this myself and
create you a much simpler

807
00:39:33,580 --> 00:39:35,670
program in sequentially.

808
00:39:35,670 --> 00:39:37,380
Then you try to parallelizes
this and you can't.

809
00:39:37,380 --> 00:39:41,100
So what you might try to do is
kind of do this direction

810
00:39:41,100 --> 00:39:43,660
transformation many times to
make the program run a little

811
00:39:43,660 --> 00:39:47,260
bit slower sequentially just
so you can actually go and

812
00:39:47,260 --> 00:39:49,340
parallelize it.

813
00:39:49,340 --> 00:39:50,770
So this get's a little
bit counterintuitive.

814
00:39:50,770 --> 00:39:53,900
You just look at a program and
say yeah there is a loop

815
00:39:53,900 --> 00:39:55,850
carried dependence, I can do it
a little bit more expensive

816
00:39:55,850 --> 00:39:58,540
without the loop carried
dependence, and then suddenly

817
00:39:58,540 --> 00:39:59,320
my loop is parallelized.

818
00:39:59,320 --> 00:40:01,460
So there might be cases where
you might have to do it by

819
00:40:01,460 --> 00:40:04,020
hand, and a lot of compilers
automatic parallelizing

820
00:40:04,020 --> 00:40:05,990
compilers, try to
do this also.

821
00:40:05,990 --> 00:40:08,230
Kind of look at these kind
of things and try to

822
00:40:08,230 --> 00:40:09,450
move in that direction.

823
00:40:09,450 --> 00:40:11,290
Whereas, most of the sequential
compiler is trying

824
00:40:11,290 --> 00:40:12,670
to find this and move
this direction.

825
00:40:16,320 --> 00:40:19,840
So, OK I said that.

826
00:40:19,840 --> 00:40:21,790
So, another thing called
array privatization.

827
00:40:21,790 --> 00:40:26,130
So scalars, I show you where
when you have anti and output

828
00:40:26,130 --> 00:40:28,260
dependence on a variable,
you need to privatize.

829
00:40:28,260 --> 00:40:31,360
And in arrays, you have
a lot more complexity.

830
00:40:31,360 --> 00:40:33,250
I'm not going to go into that,
you can actually do private

831
00:40:33,250 --> 00:40:35,440
copies also in there.

832
00:40:35,440 --> 00:40:37,830
You can do bunch of
transformation.

833
00:40:37,830 --> 00:40:39,840
Another thing people do is
called interprocedural

834
00:40:39,840 --> 00:40:41,740
parallelization.

835
00:40:41,740 --> 00:40:44,470
So the thing is you have a
nice loop and you start

836
00:40:44,470 --> 00:40:46,070
analyzing loop and in the middle
of a loop you have a

837
00:40:46,070 --> 00:40:48,250
function call.

838
00:40:48,250 --> 00:40:50,120
Suddenly what are you
going to do with it?

839
00:40:50,120 --> 00:40:52,400
You have no idea what the
function does, and most of the

840
00:40:52,400 --> 00:40:54,530
simple analysis says OK, I can't
parallelize anything

841
00:40:54,530 --> 00:40:55,750
that has a function call.

842
00:40:55,750 --> 00:40:57,430
That's not a good parallelizing
compiler because

843
00:40:57,430 --> 00:40:59,780
a lot of loops have function
calls and you might call it

844
00:40:59,780 --> 00:41:04,090
something simple as sine
function or some simple

845
00:41:04,090 --> 00:41:06,030
exponentiation function and
then suddenly it's not

846
00:41:06,030 --> 00:41:08,750
parallelizable.

847
00:41:08,750 --> 00:41:10,470
This is a big problem.

848
00:41:10,470 --> 00:41:11,460
There are two things
you can do.

849
00:41:11,460 --> 00:41:15,080
One is interprocedural analysis
and another inlining.

850
00:41:15,080 --> 00:41:19,600
So the interprocedural analysis
says I'm going to

851
00:41:19,600 --> 00:41:24,370
analyze the entire program and
I have function, I'm going to

852
00:41:24,370 --> 00:41:28,830
go and try to analyze the
function itself also.

853
00:41:28,830 --> 00:41:33,220
What happens is -- so assume
if the functions are used

854
00:41:33,220 --> 00:41:36,060
many, many times, so fine
function might be used

855
00:41:36,060 --> 00:41:37,076
hundreds of time.

856
00:41:37,076 --> 00:41:39,380
So every time you have a call
of a sine function, if you

857
00:41:39,380 --> 00:41:41,650
keep analyzing, reanalyzing
what's happening inside of the

858
00:41:41,650 --> 00:41:44,450
sine function, you kind of
have exponential blow up.

859
00:41:44,450 --> 00:41:48,800
So if you code size n, you might
have an exponential time

860
00:41:48,800 --> 00:41:51,640
of a number of lines that need
to be analyzed because every

861
00:41:51,640 --> 00:41:54,080
call need to go there, call some
other functions, you can

862
00:41:54,080 --> 00:41:55,490
see the blow up.

863
00:41:55,490 --> 00:41:57,530
And so analysis might
be expensive.

864
00:41:57,530 --> 00:42:00,620
Other option is you analyze
each function once.

865
00:42:00,620 --> 00:42:01,910
Yeah, OK.

866
00:42:01,910 --> 00:42:04,160
I analyze this function once
and every time I use that

867
00:42:04,160 --> 00:42:07,990
function I just use that
analysis information.

868
00:42:07,990 --> 00:42:11,450
What that means is you have a
kind of summary of what that

869
00:42:11,450 --> 00:42:13,390
function does for every call.

870
00:42:13,390 --> 00:42:15,580
This is not that easy and this
runs into a thing called

871
00:42:15,580 --> 00:42:18,660
unrealizable part problem,
because you go into function

872
00:42:18,660 --> 00:42:22,210
in one part --

873
00:42:22,210 --> 00:42:26,460
assume you call foo from
here and return here.

874
00:42:26,460 --> 00:42:28,470
You call it here and
return and here.

875
00:42:28,470 --> 00:42:31,270
So when you analyze, normally
you can go from here to here,

876
00:42:31,270 --> 00:42:34,530
here to here, but if you treat
foo as only one thing you

877
00:42:34,530 --> 00:42:36,466
might be able to even think that
you can go here to here

878
00:42:36,466 --> 00:42:38,220
and here to here.

879
00:42:38,220 --> 00:42:40,790
So this looks like one
thing in here.

880
00:42:40,790 --> 00:42:44,462
You see that control here goes
here, comes here do a function

881
00:42:44,462 --> 00:42:46,550
call goes here, because
we are not treating

882
00:42:46,550 --> 00:42:48,610
this as separate instance.

883
00:42:48,610 --> 00:42:50,480
So why did are we analyzing
it once?

884
00:42:50,480 --> 00:42:52,650
This cleared all this additional
mess and then can

885
00:42:52,650 --> 00:42:53,770
have problems in here.

886
00:42:53,770 --> 00:42:56,480
So these are the kind of
researchy things people are

887
00:42:56,480 --> 00:42:57,210
working on.

888
00:42:57,210 --> 00:42:59,480
There's no perfect answer,
these are complicated

889
00:42:59,480 --> 00:43:00,650
problems, so you had to do some

890
00:43:00,650 --> 00:43:05,770
interesting balance in here.

891
00:43:05,770 --> 00:43:08,360
Because other thing is every
analyst has to deal with that,

892
00:43:08,360 --> 00:43:10,030
so you had to kind of
an anti-compiler,

893
00:43:10,030 --> 00:43:12,940
which is not simple.

894
00:43:12,940 --> 00:43:14,550
Inlining is much more easy.

895
00:43:14,550 --> 00:43:16,700
It's a poor man solution, so
every time you have function

896
00:43:16,700 --> 00:43:18,570
call, you just bring
the function and

897
00:43:18,570 --> 00:43:19,855
just copy it in there.

898
00:43:19,855 --> 00:43:20,810
And every time you have function
call you bring the

899
00:43:20,810 --> 00:43:23,410
function and you can run it
through the same compiler, but

900
00:43:23,410 --> 00:43:25,510
of course, you can have
huge code blow up.

901
00:43:25,510 --> 00:43:28,060
It's not only analysis expense,
you might have a

902
00:43:28,060 --> 00:43:30,730
function that before had only
100 lines, now we have

903
00:43:30,730 --> 00:43:32,760
millions of lines in there
and then try and do cache

904
00:43:32,760 --> 00:43:34,660
problems, all those
other issues.

905
00:43:34,660 --> 00:43:36,310
So can be very expensive too.

906
00:43:36,310 --> 00:43:39,265
So what people do is things like
selective inlining and a

907
00:43:39,265 --> 00:43:45,970
lot of kind of interesting
combinations of these.

908
00:43:45,970 --> 00:43:48,010
Finally, loop transformations.

909
00:43:48,010 --> 00:43:53,560
So i have this loop, so I have
Aij equals Aij minus 1, A i

910
00:43:53,560 --> 00:43:57,000
minus 1 j So look at my -- my
arrowheads look too big there,

911
00:43:57,000 --> 00:44:00,280
but look at my dependences.

912
00:44:00,280 --> 00:44:02,020
Is any of this parallel?

913
00:44:02,020 --> 00:44:10,460
AUDIENCE: [OBSCURED.]

914
00:44:10,460 --> 00:44:11,710
PROFESSOR: Yeah.

915
00:44:13,840 --> 00:44:16,280
So, assays neither --

916
00:44:16,280 --> 00:44:18,650
you can't parallelize I because
there's a loop carry

917
00:44:18,650 --> 00:44:21,260
dependence in I dimension.

918
00:44:21,260 --> 00:44:23,800
You can't parallelize J because
there's loop carry

919
00:44:23,800 --> 00:44:24,900
dependence in J diimension.

920
00:44:24,900 --> 00:44:27,900
She has idea because you
can actually pipeline.

921
00:44:27,900 --> 00:44:30,480
So pipelining, we haven't
figured out how

922
00:44:30,480 --> 00:44:32,070
to parallelize pipeline.

923
00:44:32,070 --> 00:44:34,250
So the way you can do
this simply is a

924
00:44:34,250 --> 00:44:37,410
thing called loop skewing.

925
00:44:37,410 --> 00:44:39,040
You can kind of --

926
00:44:39,040 --> 00:44:42,080
because iteration space has
changed from a data space.

927
00:44:42,080 --> 00:44:45,090
You can come up with a new
iteration space that kind of

928
00:44:45,090 --> 00:44:47,790
skew the loop in there.

929
00:44:47,790 --> 00:44:51,120
So what it does is normally
iteration space, what this J

930
00:44:51,120 --> 00:44:54,480
outside, so you go execute
like this.

931
00:44:54,480 --> 00:44:57,640
The skill that -- loop basically
say I am executing

932
00:44:57,640 --> 00:44:59,650
this way, so I'm executing
the pipeline,

933
00:44:59,650 --> 00:45:00,890
basically pipeline here.

934
00:45:00,890 --> 00:45:04,060
So I'm kind of going like this
way, executing that way.

935
00:45:04,060 --> 00:45:09,470
If I could run that loop in that
fashion, what I can do is

936
00:45:09,470 --> 00:45:12,600
I can run this -- after this
iteration, when you go run the

937
00:45:12,600 --> 00:45:16,340
next iteration, there's no
dependence across here.

938
00:45:16,340 --> 00:45:18,340
If I run here, I don't have
dependence, so I can run each

939
00:45:18,340 --> 00:45:22,510
of these and I have a parallel
set of iterations to run.

940
00:45:22,510 --> 00:45:25,670
So in here, what happens is
this inner loop it can be

941
00:45:25,670 --> 00:45:30,200
parallel, basically like your
pipeline, but it's written in

942
00:45:30,200 --> 00:45:34,010
a way that I still have my two
loops in here, but I have done

943
00:45:34,010 --> 00:45:36,700
this weird transformation.

944
00:45:36,700 --> 00:45:38,430
Another interesting is
granularity of parallelism.

945
00:45:38,430 --> 00:45:40,950
Assume I have a loop
like that, i and j.

946
00:45:40,950 --> 00:45:44,150
Which loop is that in here?

947
00:45:44,150 --> 00:45:46,050
i or j?

948
00:45:46,050 --> 00:45:47,740
j is parallel.

949
00:45:47,740 --> 00:45:48,700
OK, I do something like that.

950
00:45:48,700 --> 00:45:52,580
I say I run i, every iteration
I do a barrier, I run j

951
00:45:52,580 --> 00:45:56,770
parallel and I end up doing
a barrier again.

952
00:45:56,770 --> 00:46:05,510
What might be a problem in
something like this?

953
00:46:05,510 --> 00:46:09,440
I mean inner parallelism can
be expensive, because every

954
00:46:09,440 --> 00:46:13,120
time I had to do this probably
expensive barrier, run a few

955
00:46:13,120 --> 00:46:14,870
iterations, a few in
this one, probably

956
00:46:14,870 --> 00:46:16,850
only like a few cycles.

957
00:46:16,850 --> 00:46:18,980
And write this very expensive
barrier again, and everybody

958
00:46:18,980 --> 00:46:20,440
communicates --

959
00:46:20,440 --> 00:46:23,050
all of those things.

960
00:46:23,050 --> 00:46:25,170
Most of the time when you do
inner loop parallelism it

961
00:46:25,170 --> 00:46:27,510
actually slows down
the program.

962
00:46:27,510 --> 00:46:29,640
You will probably find it too
sometimes, if you define the

963
00:46:29,640 --> 00:46:32,060
parallelism inner array to be
too small, it actually has a

964
00:46:32,060 --> 00:46:34,650
negative impact, because all the
communication you need to

965
00:46:34,650 --> 00:46:35,932
do, synchronization you
need to do all of

966
00:46:35,932 --> 00:46:39,740
them out of the program.

967
00:46:39,740 --> 00:46:40,920
So inner loop is expensive.

968
00:46:40,920 --> 00:46:42,650
What are your choices?

969
00:46:42,650 --> 00:46:44,140
Don't parallelize.

970
00:46:44,140 --> 00:46:45,980
Pretty good choice for
a lot of cases.

971
00:46:45,980 --> 00:46:47,510
You look at this and this is
actually going to win you

972
00:46:47,510 --> 00:46:49,530
basically by doing that.

973
00:46:49,530 --> 00:46:51,960
Or can you transform it to
outer loop parallelism.

974
00:46:51,960 --> 00:46:54,540
Take inner loop parallelism and
you change it to get outer

975
00:46:54,540 --> 00:46:55,120
loop parallelism.

976
00:46:55,120 --> 00:46:57,070
This program is actually nice,
there are some complex

977
00:46:57,070 --> 00:46:59,710
analysis you need to do to
make sure that's legal.

978
00:46:59,710 --> 00:47:03,390
So you can basically
take this one and

979
00:47:03,390 --> 00:47:06,170
transform in other direction.

980
00:47:06,170 --> 00:47:10,780
What that means is kind of
do a loop interchange.

981
00:47:10,780 --> 00:47:13,500
So now instead of i, you have a
a j outer dimension, i inner

982
00:47:13,500 --> 00:47:16,050
dimension, inner loop.

983
00:47:16,050 --> 00:47:19,715
When you do that what you have
is your barrier, and then you

984
00:47:19,715 --> 00:47:23,750
can run this is parallel
and this like this.

985
00:47:23,750 --> 00:47:29,985
Suddenly, instead of having n
barriers for that loop, you

986
00:47:29,985 --> 00:47:31,740
have only one barrier.

987
00:47:31,740 --> 00:47:34,940
Suddenly you have a much larger
chunk you're running,

988
00:47:34,940 --> 00:47:41,070
and this can be run.

989
00:47:41,070 --> 00:47:42,670
OK, so this is great.

990
00:47:42,670 --> 00:47:44,960
So I talked to all about all
this nice transformation,

991
00:47:44,960 --> 00:47:45,690
stuff like that.

992
00:47:45,690 --> 00:47:47,790
So at some point when you know
something is parallel you

993
00:47:47,790 --> 00:47:51,330
might want to go and generate
parallel form.

994
00:47:51,330 --> 00:47:56,150
So the problem is, depending on
how you partition, the loop

995
00:47:56,150 --> 00:47:58,460
bound has to be changed, and I'm
going to talk to you about

996
00:47:58,460 --> 00:48:00,030
how to get loop bound.

997
00:48:00,030 --> 00:48:02,440
So let's look at this program.

998
00:48:02,440 --> 00:48:08,790
So I have something in here and
there's an inner loop that

999
00:48:08,790 --> 00:48:09,980
actually reads, outer
loop writes.

1000
00:48:09,980 --> 00:48:10,660
Inner loop reads.

1001
00:48:10,660 --> 00:48:13,050
And it's a triangular thing.

1002
00:48:13,050 --> 00:48:14,300
It's a big mess.

1003
00:48:14,300 --> 00:48:19,110
Now I assume I want to run
the i loop parallel.

1004
00:48:19,110 --> 00:48:22,860
So what that means is I want
to run the first process --

1005
00:48:22,860 --> 00:48:24,900
there is no for this one, this
one on one iteration, two

1006
00:48:24,900 --> 00:48:28,450
iteration, three, four,
whatever, each one's in here.

1007
00:48:28,450 --> 00:48:32,170
How do I actually go about
generating code that

1008
00:48:32,170 --> 00:48:33,750
actually does that?

1009
00:48:33,750 --> 00:48:36,740
Each processor runs its right
number of iteration.

1010
00:48:36,740 --> 00:48:39,410
This is a non-trivial thing
because triangularly you get

1011
00:48:39,410 --> 00:48:42,430
something different and you
can assume all this

1012
00:48:42,430 --> 00:48:44,090
complexity.

1013
00:48:44,090 --> 00:48:48,250
One thing I did is my iteration
space between i and

1014
00:48:48,250 --> 00:48:54,050
j, this is my iteration space.

1015
00:48:54,050 --> 00:48:56,150
So I assume, assume I am
running a processor.

1016
00:48:56,150 --> 00:48:59,320
Each I iteration run by your
processor, you can say you

1017
00:48:59,320 --> 00:49:04,580
have then another dimension P,
and say i equals P. So I can

1018
00:49:04,580 --> 00:49:06,770
look at now instead of a two
dimensional space in a three

1019
00:49:06,770 --> 00:49:08,300
dimensional space.

1020
00:49:08,300 --> 00:49:10,340
So in this analysis, if you can
think multi-dimensionally

1021
00:49:10,340 --> 00:49:12,800
it's actually very helpful
because we can kind of keep

1022
00:49:12,800 --> 00:49:15,970
adding dimensions in here.

1023
00:49:15,970 --> 00:49:19,380
So what are the loop
bounds in here?

1024
00:49:19,380 --> 00:49:22,600
What we can do is use another
technique called

1025
00:49:22,600 --> 00:49:26,530
Fourier-Motzkin Elimination to
calculate loop bounds by using

1026
00:49:26,530 --> 00:49:28,040
projections of the
iteration space.

1027
00:49:28,040 --> 00:49:29,330
I will go through later
a bit to give you a

1028
00:49:29,330 --> 00:49:30,585
flavor for what it is.

1029
00:49:30,585 --> 00:49:33,910
It's also, if you are in to
linear programming, this is

1030
00:49:33,910 --> 00:49:37,850
kind of extension techniques
on that.

1031
00:49:37,850 --> 00:49:39,820
So the way we look
at that is --

1032
00:50:06,390 --> 00:50:10,960
A little bit too far.

1033
00:50:10,960 --> 00:50:18,600
I didn't realize MAC
can be this slow.

1034
00:50:18,600 --> 00:50:26,480
[ASIDE CONVERSATION]

1035
00:50:26,480 --> 00:50:28,993
See this is why we need
parallelism if you think this

1036
00:50:28,993 --> 00:50:33,000
running fast. So what you can do
is you can think about this

1037
00:50:33,000 --> 00:50:34,960
as this three dimensional
space.

1038
00:50:34,960 --> 00:50:36,400
i, j and p.

1039
00:50:36,400 --> 00:50:40,070
And because i is equal to p, if
you get i and p, get a line

1040
00:50:40,070 --> 00:50:41,960
in that dimension and
then j goes there.

1041
00:50:41,960 --> 00:50:44,050
So this is the kind of iteration
space in here, and

1042
00:50:44,050 --> 00:50:47,930
that represents inequalities
here.

1043
00:50:47,930 --> 00:50:52,630
So what I want is a loop where
outer dimension is p, then the

1044
00:50:52,630 --> 00:50:54,800
next dimension is i and j.

1045
00:50:54,800 --> 00:50:56,420
We can think about
it like that.

1046
00:50:56,420 --> 00:50:59,525
So what that means is I need to
get my iteration ordering

1047
00:50:59,525 --> 00:51:04,140
-- when it happens, you
just go like that.

1048
00:51:04,140 --> 00:51:05,190
All right, about doing that.

1049
00:51:05,190 --> 00:51:07,570
So this is the kind of loop I
want to generate -- let me go

1050
00:51:07,570 --> 00:51:09,090
and show you how we
generate that.

1051
00:51:25,530 --> 00:51:29,750
So here's my space in here, so
first one I want to do is my

1052
00:51:29,750 --> 00:51:32,100
inner most dimension is j.

1053
00:51:32,100 --> 00:51:34,360
And what I can do is I can look
at this thing and say

1054
00:51:34,360 --> 00:51:36,540
what are the bounds of j.

1055
00:51:36,540 --> 00:51:40,370
So, for each of the bounds of
j can be described by --

1056
00:51:40,370 --> 00:51:41,520
with p and i.

1057
00:51:41,520 --> 00:51:44,470
I'll actually show you how to
do that in little while.

1058
00:51:44,470 --> 00:51:51,180
Then I will get j goes
from 1 to i minus 1.

1059
00:51:51,180 --> 00:51:53,740
Then after that I can basically
project it into to

1060
00:51:53,740 --> 00:51:54,580
eliminate j dimension.

1061
00:51:54,580 --> 00:51:56,860
So what I'm doing is I'm going
to have a three dimension and

1062
00:51:56,860 --> 00:51:59,780
I project into two dimensions
without j anymore, because now

1063
00:51:59,780 --> 00:52:04,110
all I have left is i p and I get
a line in that dimension.

1064
00:52:04,110 --> 00:52:06,250
Then what I have to do is
now I had to find i.

1065
00:52:06,250 --> 00:52:10,110
What are my bounds of i?

1066
00:52:10,110 --> 00:52:13,340
And bounds of i is actually
i is equal to p.

1067
00:52:13,340 --> 00:52:14,690
You can figure that
one out because

1068
00:52:14,690 --> 00:52:16,190
there's a line in there.

1069
00:52:16,190 --> 00:52:18,530
Then you eliminate i and
now you get this one.

1070
00:52:21,330 --> 00:52:27,070
Then what are bounds of p? p
goes from basically 2 to n.

1071
00:52:27,070 --> 00:52:28,110
You just basically get that.

1072
00:52:28,110 --> 00:52:31,160
So you can do this projection in
here -- let me go in there,

1073
00:52:31,160 --> 00:52:35,710
and now what you end up doing
is you can get this, and of

1074
00:52:35,710 --> 00:52:39,240
course, outer loop p is not
a true -- like a loop.

1075
00:52:39,240 --> 00:52:41,050
You can say you get p, my_pid.

1076
00:52:41,050 --> 00:52:43,330
p is with this range.
i equals p.

1077
00:52:43,330 --> 00:52:44,190
Do this one.

1078
00:52:44,190 --> 00:52:46,570
So this one, -- generated
that piece of code.

1079
00:52:46,570 --> 00:52:49,280
So I will go a little bit
detail and show how this

1080
00:52:49,280 --> 00:52:51,080
happens, pretty much
can happen.

1081
00:52:51,080 --> 00:52:54,640
So I have my little bit
of different space.

1082
00:52:54,640 --> 00:52:55,400
I'm doing a different
projection.

1083
00:52:55,400 --> 00:52:57,050
I'm doing i, j, p.

1084
00:52:57,050 --> 00:53:00,340
I want to predict first i of a,
j of a, and p of a instead

1085
00:53:00,340 --> 00:53:01,830
of j, i, p before
I do anything.

1086
00:53:01,830 --> 00:53:04,250
So here's my iteration
space, what do I do?

1087
00:53:04,250 --> 00:53:07,860
The first thing I do is I find
the bounds of i, So I have

1088
00:53:07,860 --> 00:53:08,410
this thing.

1089
00:53:08,410 --> 00:53:14,230
I just basically expanded this,
and eliminated the j

1090
00:53:14,230 --> 00:53:16,040
this one doesn't contribute
to the bounds of i,

1091
00:53:16,040 --> 00:53:17,220
but everybody else.

1092
00:53:17,220 --> 00:53:19,630
So there are a bunch of things
that i has to be less than

1093
00:53:19,630 --> 00:53:22,530
that and i have to be greater
than these two.

1094
00:53:22,530 --> 00:53:26,840
Then what I have is bound of i
is, it has to be maximum of

1095
00:53:26,840 --> 00:53:28,540
this because it has to be
greater than all three.

1096
00:53:28,540 --> 00:53:30,380
So it has to be max of
this, this, and this.

1097
00:53:30,380 --> 00:53:32,190
It has to be less than these
two, it has to be

1098
00:53:32,190 --> 00:53:33,480
mean of this one.

1099
00:53:33,480 --> 00:53:33,800
Question?

1100
00:53:33,800 --> 00:53:35,800
AUDIENCE: Well why did you have
to go through all this.

1101
00:53:35,800 --> 00:53:38,520
At least in this case, the outer
loop was very simple,

1102
00:53:38,520 --> 00:53:39,970
you could have just directly
mapped that.

1103
00:53:39,970 --> 00:53:42,590
PROFESSOR: I agree with you,
it's very simple thing, but

1104
00:53:42,590 --> 00:53:45,260
the problem is that's because
you are smart and you can

1105
00:53:45,260 --> 00:53:48,270
think a little bit ahead in
there, and if I'm programming

1106
00:53:48,270 --> 00:53:52,290
a computer, I can't say find
these special cases.

1107
00:53:52,290 --> 00:53:54,850
So I want to come up with a
mathematical way that is a

1108
00:53:54,850 --> 00:53:57,340
bullet proof way that will work
from the simplest one to

1109
00:53:57,340 --> 00:53:59,970
very complicated, like for
example, finding the loop

1110
00:53:59,970 --> 00:54:04,850
bounds for that loop transpose
that I showed you before --

1111
00:54:04,850 --> 00:54:09,160
no, the skew that what
we called before.

1112
00:54:09,160 --> 00:54:13,266
AUDIENCE: So it's not so much
just defining an index to

1113
00:54:13,266 --> 00:54:14,226
iterate on, it's to
find the best

1114
00:54:14,226 --> 00:54:16,910
index to map, to parellize.

1115
00:54:16,910 --> 00:54:20,010
PROFESSOR: Any could be issue,
because you have --

1116
00:54:20,010 --> 00:54:24,640
for example, if the inner
dimension depends on i, and i

1117
00:54:24,640 --> 00:54:27,270
goes outside, then I can't
make it depend on i.

1118
00:54:27,270 --> 00:54:33,850
So if I have something like for
i equals something, for j

1119
00:54:33,850 --> 00:54:37,910
equals i to something.

1120
00:54:37,910 --> 00:54:41,350
Now if I switch these
two I have 4j.

1121
00:54:41,350 --> 00:54:42,770
I can't say it's
i to something.

1122
00:54:42,770 --> 00:54:45,830
I have to get rid of i and I
have to figure out in the for

1123
00:54:45,830 --> 00:54:49,230
i, this has to be something
with j, with some function

1124
00:54:49,230 --> 00:54:52,160
with j in here.

1125
00:54:52,160 --> 00:54:56,690
So what is this function,
how do you get that?

1126
00:54:56,690 --> 00:54:59,070
You need this kind of
transformations do that.

1127
00:54:59,070 --> 00:55:01,260
Next time I'll talk to you about
can you do it a little

1128
00:55:01,260 --> 00:55:02,480
bit even better.

1129
00:55:02,480 --> 00:55:03,730
So I get this bound in here.

1130
00:55:06,600 --> 00:55:11,390
Then actually you found this
is going from p to p.

1131
00:55:11,390 --> 00:55:16,400
So I can actually set p because,
mean and max in here.

1132
00:55:16,400 --> 00:55:19,370
Then after you do that, what you
have to do is eliminate I.

1133
00:55:19,370 --> 00:55:25,980
The way you eliminate I is you
take this has to be always

1134
00:55:25,980 --> 00:55:29,160
less than n and less than p.

1135
00:55:29,160 --> 00:55:34,630
So you take this n constraints
here and you get a n times m

1136
00:55:34,630 --> 00:55:35,660
constraints tier in here.

1137
00:55:35,660 --> 00:55:39,100
So the first three has to be
less than n, again, we repeat

1138
00:55:39,100 --> 00:55:42,260
it again, has to
be less than p.

1139
00:55:42,260 --> 00:55:45,710
Then, of course, the missing
constraint that 1

1140
00:55:45,710 --> 00:55:47,660
is less than j.

1141
00:55:47,660 --> 00:55:49,430
You put all those constraints
together.

1142
00:55:49,430 --> 00:55:51,730
Now, nice think is in that
one, it's still legal, it

1143
00:55:51,730 --> 00:55:54,050
still represents that
space, but you don't

1144
00:55:54,050 --> 00:55:55,580
have i there anymore.

1145
00:55:55,580 --> 00:55:58,470
You can completely
get rid of i.

1146
00:55:58,470 --> 00:56:01,160
So, by doing that -- and then
of course, there's a lot of

1147
00:56:01,160 --> 00:56:03,940
redundancy in here, and then you
can do some analysis and

1148
00:56:03,940 --> 00:56:05,915
eliminate redundancy and you
end up in this set of

1149
00:56:05,915 --> 00:56:06,760
constraints.

1150
00:56:06,760 --> 00:56:09,800
That's where when you say what's
the best, you can be

1151
00:56:09,800 --> 00:56:13,590
best -- it has to be correct or
that means you can't have

1152
00:56:13,590 --> 00:56:15,220
additional iterations
or less iterations.

1153
00:56:15,220 --> 00:56:19,400
But best depends on how
complicated is the loop bound

1154
00:56:19,400 --> 00:56:21,610
calculation.

1155
00:56:21,610 --> 00:56:24,080
You can come up with a correct
solution, and the best is

1156
00:56:24,080 --> 00:56:26,060
depending on which order
you do that.

1157
00:56:26,060 --> 00:56:27,530
When you have two redundant
thing, which one you

1158
00:56:27,530 --> 00:56:29,600
eliminate, so you can have a lot
of heuristics saying OK,

1159
00:56:29,600 --> 00:56:32,520
look if this one looks harder
to calculate, eliminate that

1160
00:56:32,520 --> 00:56:34,080
one with the other one.

1161
00:56:34,080 --> 00:56:36,800
So you get this set
of constraints.

1162
00:56:36,800 --> 00:56:39,950
Then you have to do is now
find the bounds of j.

1163
00:56:39,950 --> 00:56:42,195
So you have this set again.

1164
00:56:42,195 --> 00:56:46,450
To find a bound of j only two
constraints are there, and you

1165
00:56:46,450 --> 00:56:51,210
know j goes to 1 to p minus 1,
and you find the bound of j.

1166
00:56:51,210 --> 00:56:55,860
Getting rid of j means
there's only two.

1167
00:56:55,860 --> 00:56:57,550
One get rid of p minus 1.

1168
00:56:57,550 --> 00:56:58,980
There are two left for p.

1169
00:56:58,980 --> 00:57:02,200
You put it there, and then you
can eliminate the redundance

1170
00:57:02,200 --> 00:57:07,450
in here, and now you can find
the bounds of p which goes

1171
00:57:07,450 --> 00:57:09,530
from 2 to n.

1172
00:57:09,530 --> 00:57:14,720
And suddenly you have the loop
nest. So now I actually di

1173
00:57:14,720 --> 00:57:17,200
parallelization and a loop
transpose in here.

1174
00:57:20,200 --> 00:57:23,480
I could combine those two, use
this simple mathematical way

1175
00:57:23,480 --> 00:57:27,050
and find loop bounds in here.

1176
00:57:27,050 --> 00:57:30,650
So, I'm going to give you
something even a little bit

1177
00:57:30,650 --> 00:57:32,830
interesting beyond that, which
is communication code

1178
00:57:32,830 --> 00:57:34,080
generation.

1179
00:57:37,470 --> 00:57:39,240
So if you are dealing with a
cache coherent shared memory

1180
00:57:39,240 --> 00:57:40,940
machine, you are done.

1181
00:57:40,940 --> 00:57:43,890
You generate code for parallel
loop nest, you can go home

1182
00:57:43,890 --> 00:57:45,900
because everything else will
be done automatically.

1183
00:57:45,900 --> 00:57:48,920
But as we all know in something
like Cell, if you

1184
00:57:48,920 --> 00:57:51,100
have a no cache coherent shared
memory or distributed

1185
00:57:51,100 --> 00:57:54,050
memory, you have to do this
one first. Then you write

1186
00:57:54,050 --> 00:57:56,640
identify communication
and then you generate

1187
00:57:56,640 --> 00:57:59,590
communication code.

1188
00:57:59,590 --> 00:58:04,670
This have additional
burden in here.

1189
00:58:04,670 --> 00:58:07,630
So until now in data dependence
analysis, what we

1190
00:58:07,630 --> 00:58:11,040
looked at was location-centric
dependences.

1191
00:58:11,040 --> 00:58:13,950
Which location is written by
processor one is used by

1192
00:58:13,950 --> 00:58:15,650
processor two.

1193
00:58:15,650 --> 00:58:19,600
That's kind of a
location-centric kind of view.

1194
00:58:19,600 --> 00:58:23,470
How about if multiple writes
the same location?

1195
00:58:23,470 --> 00:58:25,220
We show that in example, if
multiple people write the same

1196
00:58:25,220 --> 00:58:29,450
location, which one
should I use?

1197
00:58:29,450 --> 00:58:30,850
That's not clear.

1198
00:58:30,850 --> 00:58:32,730
What you are using in the last
last guy who wrote that

1199
00:58:32,730 --> 00:58:35,290
location before I read that
thing, and that's not in these

1200
00:58:35,290 --> 00:58:36,690
data flow analysis.

1201
00:58:36,690 --> 00:58:40,110
No data dependence analysis
doesn't get it.

1202
00:58:40,110 --> 00:58:43,490
What you want is something
of a value-centric.

1203
00:58:43,490 --> 00:58:47,330
Who was the last write
before my iteration,

1204
00:58:47,330 --> 00:58:49,540
who wrote that location?

1205
00:58:49,540 --> 00:58:52,490
If I know the last write, he's
the one I should be getting

1206
00:58:52,490 --> 00:58:53,570
the value from.

1207
00:58:53,570 --> 00:58:57,720
If the last write happened in
the same processor, I am set

1208
00:58:57,720 --> 00:59:00,270
because I wrote the local
copy and I don't need

1209
00:59:00,270 --> 00:59:02,010
to deal with anything.

1210
00:59:02,010 --> 00:59:05,040
If the last write happened in
a different processor, you

1211
00:59:05,040 --> 00:59:07,135
need to get that value from the
guys who wrote it and say,

1212
00:59:07,135 --> 00:59:09,470
OK, you wrote that value,
give it to me.

1213
00:59:09,470 --> 00:59:12,340
If nobody wrote it and I'm
reading it, that means the

1214
00:59:12,340 --> 00:59:16,340
value came from the original
array because nobody had

1215
00:59:16,340 --> 00:59:17,600
written it in my iteration.

1216
00:59:17,600 --> 00:59:19,410
Then I'm reading something
that has come from the

1217
00:59:19,410 --> 00:59:21,370
previous iteration.

1218
00:59:21,370 --> 00:59:23,630
So I have to get it from
the original array.

1219
00:59:23,630 --> 00:59:26,800
But I have these three
different conditions.

1220
00:59:26,800 --> 00:59:29,260
So you know to represent that.

1221
00:59:29,260 --> 00:59:30,522
I'm not going to go into detail
on into detail on this

1222
00:59:30,522 --> 00:59:34,160
representation called
Last Write Trees.

1223
00:59:34,160 --> 00:59:41,160
So what it says is in this kind
of a loop nest in here,

1224
00:59:41,160 --> 00:59:44,160
you have some read access and
write accesses in here, and if

1225
00:59:44,160 --> 00:59:46,480
you look at it
location-centrically you get

1226
00:59:46,480 --> 00:59:49,860
this entire complex graph,
because this is the graph that

1227
00:59:49,860 --> 00:59:53,166
should have been in that example
we gave. So these

1228
00:59:53,166 --> 00:59:55,760
arrays going in here.

1229
00:59:55,760 --> 00:59:57,500
I'm switching notation.

1230
00:59:57,500 --> 00:59:59,820
before i was going the other
way around. j was in here.

1231
01:00:02,960 --> 01:00:06,590
But if you go look
at value-centric,

1232
01:00:06,590 --> 01:00:07,440
this is what happens.

1233
01:00:07,440 --> 01:00:11,150
So you say all these
guys basically got

1234
01:00:11,150 --> 01:00:12,910
the value from outside.

1235
01:00:12,910 --> 01:00:14,070
Nobody wrote it.

1236
01:00:14,070 --> 01:00:15,920
This got from -- this is the
write, this is the last write,

1237
01:00:15,920 --> 01:00:16,730
this is the last write --

1238
01:00:16,730 --> 01:00:19,380
I actually have my last
write information.

1239
01:00:19,380 --> 01:00:21,970
So where to look at that is
there are some part of

1240
01:00:21,970 --> 01:00:23,640
iteration got value from
somewhere, other part go

1241
01:00:23,640 --> 01:00:24,740
somewhere else.

1242
01:00:24,740 --> 01:00:27,380
You can't kind of do a big
summary, as you point out that

1243
01:00:27,380 --> 01:00:30,080
kind of dependence depend on
where the iterations are.

1244
01:00:30,080 --> 01:00:34,320
So you can represent it using
a tree when it shows up.

1245
01:00:34,320 --> 01:00:39,170
So you can say if j greater
than 1, here's the

1246
01:00:39,170 --> 01:00:41,430
relationship between
reads and writes.

1247
01:00:41,430 --> 01:00:44,230
Otherwise relationship means
it came from outside.

1248
01:00:44,230 --> 01:00:46,550
So I can say for each
different places.

1249
01:00:46,550 --> 01:00:49,410
So you can think about this
tree can be a lot more

1250
01:00:49,410 --> 01:00:50,200
complicated tree.

1251
01:00:50,200 --> 01:00:52,740
So each part of the iteration
space, I got data from

1252
01:00:52,740 --> 01:00:53,990
somewhere else.

1253
01:00:57,060 --> 01:00:59,730
So, you get this
function here.

1254
01:00:59,730 --> 01:01:01,940
I think I'll go to
the next slide.

1255
01:01:22,490 --> 01:01:27,240
So what you can do is now I
have processor who read,

1256
01:01:27,240 --> 01:01:30,880
processor who write, and
iterations that I can reading

1257
01:01:30,880 --> 01:01:33,190
and writing.

1258
01:01:33,190 --> 01:01:38,090
One thing I can do is I can
represent i using a huge

1259
01:01:38,090 --> 01:01:40,660
multi-dimensional space.

1260
01:01:40,660 --> 01:01:46,300
So what happens in here is the
receive iterations, those are

1261
01:01:46,300 --> 01:01:48,260
the iterations that actually
data has to be received in

1262
01:01:48,260 --> 01:01:49,440
communication.

1263
01:01:49,440 --> 01:01:51,790
Assume that the part I'm
actually communicating is also

1264
01:01:51,790 --> 01:01:55,570
within the loop bound,
so I can write that.

1265
01:01:55,570 --> 01:02:00,280
And the last write relation
is that i

1266
01:02:00,280 --> 01:02:02,890
send has to be i receive.

1267
01:02:02,890 --> 01:02:04,140
We know that.

1268
01:02:06,040 --> 01:02:11,250
What you have is the parallel
with the processors -- this is

1269
01:02:11,250 --> 01:02:14,660
i iterations are parallel, so
processor, receive processor,

1270
01:02:14,660 --> 01:02:18,500
is running iteration
i, process i.

1271
01:02:18,500 --> 01:02:20,410
Send iterations are the same
because you want to

1272
01:02:20,410 --> 01:02:22,990
parallelize that
loop basically.

1273
01:02:22,990 --> 01:02:26,900
In each iteration get assigned
to each process.

1274
01:02:26,900 --> 01:02:29,370
Of course, you want to make sure
the process communication

1275
01:02:29,370 --> 01:02:29,960
is non-local.

1276
01:02:29,960 --> 01:02:31,900
If it's local I don't have
loop communication.

1277
01:02:31,900 --> 01:02:35,310
I can represent this as this
gigantic system of equalities.

1278
01:02:35,310 --> 01:02:38,590
It has one, two, three, four,
five, and there's a j receiver

1279
01:02:38,590 --> 01:02:42,130
also in here, because you've
got to remember I think the

1280
01:02:42,130 --> 01:02:45,480
program I wrote, the original
program basically, write

1281
01:02:45,480 --> 01:02:48,230
happen in outer loop and the
read happen inner loop.

1282
01:02:48,230 --> 01:02:51,690
So there's only j receive,
the i send in here.

1283
01:02:51,690 --> 01:02:54,220
I'll show that later.

1284
01:02:54,220 --> 01:02:56,570
So I have five dimensions.

1285
01:02:56,570 --> 01:03:00,990
So I can't really draw five
dimensions, but can I wait

1286
01:03:00,990 --> 01:03:02,240
until it comes back?

1287
01:03:24,640 --> 01:03:29,050
So what I have here is I have
this set of complete system of

1288
01:03:29,050 --> 01:03:32,620
inequalities for receive
and in communication.

1289
01:03:32,620 --> 01:03:36,650
Of course, since I can't draw
five dimensions, and these

1290
01:03:36,650 --> 01:03:39,100
dimensions are the same, I just
wrote it in the same.

1291
01:03:39,100 --> 01:03:40,430
So you can actually assume
that there's another two

1292
01:03:40,430 --> 01:03:42,410
dimensions for this
one, and that's a

1293
01:03:42,410 --> 01:03:44,110
line in that dimension.

1294
01:03:47,320 --> 01:03:50,100
Actually, this is wrong.

1295
01:03:50,100 --> 01:03:50,200
Sorry.

1296
01:03:50,200 --> 01:03:53,200
This should be xi
here written.

1297
01:03:56,330 --> 01:03:57,580
My program is wrong, sorry.

1298
01:04:04,290 --> 01:04:05,690
Now what do I do?

1299
01:04:05,690 --> 01:04:10,460
One more time it has to go.

1300
01:04:10,460 --> 01:04:17,300
It makes me slow down my
lectures which is probably a

1301
01:04:17,300 --> 01:04:18,350
good thing.

1302
01:04:18,350 --> 01:04:19,490
There we go.

1303
01:04:19,490 --> 01:04:23,500
So what you can do is you can
just scan these by predicting

1304
01:04:23,500 --> 01:04:26,720
different ways to calculate the
send loop nest and receive

1305
01:04:26,720 --> 01:04:30,060
loop nest. So if you scan in
that direction, what you end

1306
01:04:30,060 --> 01:04:34,280
up is something saying for this
processor you need to

1307
01:04:34,280 --> 01:04:39,090
send, for this iteration,
this processor.

1308
01:04:39,090 --> 01:04:42,080
For what you need to send will
be received by these

1309
01:04:42,080 --> 01:04:46,650
processors and this iteration
and this, and this you can

1310
01:04:46,650 --> 01:04:51,760
send xi to this iteration
at this processor.

1311
01:04:51,760 --> 01:04:55,600
Because you had that
relationship, you can get the

1312
01:04:55,600 --> 01:05:00,720
loop nest that actually
will do the send.

1313
01:05:00,720 --> 01:05:03,590
The send there you can actually
get a loop nest do

1314
01:05:03,590 --> 01:05:06,640
receive and it shows up.

1315
01:05:06,640 --> 01:05:11,340
So what that means is, so all
these guys have to send all

1316
01:05:11,340 --> 01:05:13,280
these iterations have
to do the receive.

1317
01:05:30,900 --> 01:05:35,330
So, if you predicted a different
ordering, what you

1318
01:05:35,330 --> 01:05:39,550
end up is you can say now for
this processor has to receive.

1319
01:05:39,550 --> 01:05:42,140
All these processors had to
receive something send by

1320
01:05:42,140 --> 01:05:44,500
these guys.

1321
01:05:44,500 --> 01:05:48,550
So now you can get that entire
loop nest for receiving and

1322
01:05:48,550 --> 01:05:49,810
entire loop nest for sending,
and you have

1323
01:05:49,810 --> 01:05:51,570
computation loop nest also.

1324
01:05:51,570 --> 01:05:53,790
The problem is you can't run
them sequentially because

1325
01:05:53,790 --> 01:05:55,440
you're run in some
into the order.

1326
01:05:55,440 --> 01:06:00,350
So what you have is something
that next slide will show.

1327
01:06:00,350 --> 01:06:02,580
So you have this iteration,
there's some computation

1328
01:06:02,580 --> 01:06:05,760
happen from all the one, and I
will get a loop nest do some

1329
01:06:05,760 --> 01:06:08,700
send, I need loop nest do some
receive, in a one dimensional,

1330
01:06:08,700 --> 01:06:11,440
these kind of, you get three
seperate things.

1331
01:06:11,440 --> 01:06:13,540
But of course, what you
had to do is you

1332
01:06:13,540 --> 01:06:14,940
had to generate code.

1333
01:06:14,940 --> 01:06:16,610
So the way to do that is --.

1334
01:06:21,850 --> 01:06:24,240
So what you have to do is kind
of break this apart into

1335
01:06:24,240 --> 01:06:27,290
pieces where things happen, so
this one you do computation,

1336
01:06:27,290 --> 01:06:31,330
this one you do computation and
receive, and computation

1337
01:06:31,330 --> 01:06:32,580
send receive and whatever.

1338
01:06:34,840 --> 01:06:37,720
Should be probably send here
and receive but --

1339
01:06:40,350 --> 01:06:43,360
For that one, if you combine
this you get a complicated

1340
01:06:43,360 --> 01:06:43,960
mess like this.

1341
01:06:43,960 --> 01:06:48,200
But this all can be done very
in an automated fashion by

1342
01:06:48,200 --> 01:06:51,850
using this Fourier-Motzkin
Elimination and this linear

1343
01:06:51,850 --> 01:06:55,830
representation.

1344
01:06:55,830 --> 01:06:57,240
Of course, you can do
a lot of interesting

1345
01:06:57,240 --> 01:06:57,880
things on top of that.

1346
01:06:57,880 --> 01:06:59,600
You can eliminate redundant
communication, if you're

1347
01:06:59,600 --> 01:07:01,950
keeping sending the same thing
again that have a send unit,

1348
01:07:01,950 --> 01:07:04,780
eliminate that, you can
aggregate communication.

1349
01:07:04,780 --> 01:07:06,650
You want to send a word at a
time, you can send bunch of

1350
01:07:06,650 --> 01:07:08,940
things into one packet.

1351
01:07:08,940 --> 01:07:09,810
You can do multitask.

1352
01:07:09,810 --> 01:07:12,160
So same thing, send to
multiple people.

1353
01:07:12,160 --> 01:07:15,165
Doesn't have that much in Cell,
but assume some machines

1354
01:07:15,165 --> 01:07:18,050
have multitask support, you can
do that, and also you can

1355
01:07:18,050 --> 01:07:20,670
do some local memory management
because if you have

1356
01:07:20,670 --> 01:07:23,270
distributed memory, you don't
have to allocate everybody's

1357
01:07:23,270 --> 01:07:24,520
memory and only use a part.

1358
01:07:24,520 --> 01:07:26,370
You can say OK, look everybody
only had to

1359
01:07:26,370 --> 01:07:29,270
allocate that part.

1360
01:07:29,270 --> 01:07:30,680
OK.

1361
01:07:30,680 --> 01:07:36,180
In summary, I think automatic
parallelism of loops and

1362
01:07:36,180 --> 01:07:39,350
arrays -- we talked about data
dependence analysis, and we

1363
01:07:39,350 --> 01:07:42,210
talked about iteration and data
spaces, a how to do that,

1364
01:07:42,210 --> 01:07:46,890
and how the formulate assay
integer programming problem.

1365
01:07:46,890 --> 01:07:49,380
We can look at lot of
optimization that can increase

1366
01:07:49,380 --> 01:07:51,740
parallelism and then do that.

1367
01:07:51,740 --> 01:07:55,760
Also, we can deal with tings
like communication code

1368
01:07:55,760 --> 01:07:58,150
generation and generating
loop nest by doing this

1369
01:07:58,150 --> 01:07:59,570
Fourier-Motzkin Elimination.

1370
01:07:59,570 --> 01:08:03,260
So what I want to show out of
this talk is that, in fact,

1371
01:08:03,260 --> 01:08:06,180
this parallelization --
automatic parallelization of

1372
01:08:06,180 --> 01:08:10,060
normal loop can be done by
mapping into some nice

1373
01:08:10,060 --> 01:08:12,440
mathematical framework,
and basically

1374
01:08:12,440 --> 01:08:15,520
manipulating in that map.

1375
01:08:15,520 --> 01:08:18,960
So there are many other things
that really complicates the

1376
01:08:18,960 --> 01:08:23,040
life take out of parallelizing
programs. So like C, there are

1377
01:08:23,040 --> 01:08:24,860
pointers, you have to
deal with that.

1378
01:08:24,860 --> 01:08:29,020
So this problem is not this
simple, but what compiler

1379
01:08:29,020 --> 01:08:31,570
writers try to do most of the
time is trying to find this

1380
01:08:31,570 --> 01:08:32,380
kind of thing.

1381
01:08:32,380 --> 01:08:34,960
Find interesting mathematical
models and do a mapping in

1382
01:08:34,960 --> 01:08:38,150
there and then operating that
model and hopefully you can

1383
01:08:38,150 --> 01:08:41,940
get the analysis needed and even
the transformation needed

1384
01:08:41,940 --> 01:08:43,760
using that kind of
a nice model.

1385
01:08:43,760 --> 01:08:48,770
So I just kind of gave you
a good feel for general

1386
01:08:48,770 --> 01:08:49,690
parallelizing compilers.

1387
01:08:49,690 --> 01:08:51,690
We will take a ten-minute break

1388
01:08:51,690 --> 01:08:54,760
and talk about streaming.

1389
01:08:54,760 --> 01:08:55,780
We'll see if I can make
this computer run

1390
01:08:55,780 --> 01:08:57,030
faster in the meantime.